Can I get a little help with my program? (string searching and regex)

unknown · January 8, 2009, 7:15pm

So here’s my issue, I’m trying to figure out a way that’s not insanely
round about to accomplish the following.

I am ripping book information off of a website. I was able to do this
quite easy, but i’m having problems when the site returns more than
one book. I need a way to say:

for each regex on the page
store info into next unused excel line

(I will be doing separate searches for each piece of info (author,
isbn, etc) because of the way the html is setup)
*note, I am using WATIR but the issues I’m having I believe are core
ruby issues.

Book title 1
Another book title

notice the slight difference in the second ctl0# depending on the
number of books on the page the second number just itterates, I have
yet to see a 10+ book return, but I would imagine the leading 0 would
itterate in that instance but im not positive.

Then the corosponding author is:

author 1</
span>
author 2</
span>

with the ctl0# matching the titles.

HOWEVER, when I am done pulling info from the page and go to the next
page the first book is reset back to ctl00.

This is what I have been using, but it never tests the regex a second
time around so I never get more than one book data per search

#do some search stuff based on an excel list of 4 digit numbers.
Website will return 0-many books. (currently the script crashes if 0
books are returned)
while contLoop do colVal = worksheet.Cells(row, ‘a’).Value
if (colVal) then
browser.goto(“http://www.website.com/searchterm=” + colVal)
for i in 1…browser.spans.length
if (browser.span(:id, /
rptCourses_ctl00_rptItems_ctl\d\d_lblItemTxtTitle/).text) then
var = browser.span(:id, /
rptCourses_ctl00_rptItems_ctl\d\d_lblItemTxtTitle/).text
worksheet.Cells(row, ‘b’).value = var
end
if (browser.span(:id, /rptCourses_ctl00_rptItems_ctl\d
\d_lblItemTxtAuthor/).text) then
var = browser.span(:id, /
rptCourses_ctl00_rptItems_ctl\d\d_lblItemTxtAuthor/).text
worksheet.Cells(row, ‘c’).value = var
end
end
else
contLoop = false
end
row += 1
sleep 1
end

I’m worried that if it doesnt find an author or something for a book
the list will get out of sync. The other way I think it could be done
is to make the number that itterates in the regex a variable and go
through that but this might cause issues on subseqent pages.

That is the main problem. The other problem I have to tackle is making
a cross reference list for each book found (this is done on a seperate
sheet) ie

Searchterm | Book ID (just a simple 1 through however many books
created when the book is stored into the spreadsheet
0001 | 1
0001 | 2
0001 | 3
0002 | 4
0003 | 5
0004 | 1
This would denote that 3 books were found when searching for 0001 and
those are referenced by bookID (1,2,3) and one book each for 0002 and
0003. BookID 1 comes up when searching for both 0001 and 0004 so I
also need to find a way to make sure that another BookID is not made
for the same book when 0004 comes around.

I believe this is easiest done when storing the book but havent tried
to tackle that yet.

To sum up my problems:

getting infro from more than one book when searching
crashing when no books are found
creating the reference list
not double storing in reference list

Any insite or sample code you can provide would be awesome. I don’t
perticularly want to code this, find out it doesnt work, and have to
recode it 15 times.

Mike