the original website can be found at
i used firebug to retrieve the xpath address of the desired paragraph
(excerpted above). When I put it in doc.search it doesnt retrieve
anything, nothing at all???
Does anyone know why i cant??? Im banging my head against the wall
On Fri, Mar 28, 2008 at 2:11 AM, Adam A. [email protected]
Kobayashi. Mar 28 & 29, 7 & 9:30pm, ¥3,150. Cotton Club, Marunouchi.
but i cant get teh descriptions. In fact the descriptions aren't
Wow! It looks nice, but the html is really ugly. This would be
pretty hard to scrape on a regular basis. For artists, there are a
mix of tags, tags,
and I noticed one artist with no surrounding tags at all (Ex-press
It can be really hard to work with inconsistent html, but I suppose it
could be done to some degree of accuracy. Any hpricot masters out
there? I’m sure you’d have to attack with regexps as well. Maybe
turning into text and then parsing is a better idea after all.
thanks tod for the reply. Yes even I thought that it was badly designed
and I dont have any web desing experience at all. In fact i learn the
basics of html, xml and xpath just for this.
Although those inconsitencies will prove to be a problem in the future
the one im having right now is getting any information at all. Surely
when i pass the xpath address for the paragraph element which contains
all the artists names and event descriptinos it should return something
rather than nothing. Is that right??? Every time a try to print to
screen the result of the search it just comes blank. Does anyone know
Firebug puts in tbody’s into xpath’s that reach into tables even if the
tag is not in the html source. Try removing the tbody path and
debug using shorter xpaths to initially address content further up in
You might have some success addressing text nodes combined with some
subsequent regexp processing:
b = doc.search("//text()")
I think you might be more successful using a css selector instead of an
xpath selector. To overcome hpricot not supporting all xpath axes you
can sometimes find a way to address the elements with a clever css
It can be a challenge to use hpricot with malformed html or if there are
no containers wrapping items that otherwise appear visually as a list or
table. I haven’t tried it yet, but running the html through something
like tidy before parsing so might create some of the missing structure.
ok i have tried taking out the tbody tags completely and got some of the
text back. Ill experiment to see if i can get all of it.
I installed the gem and i got the example code
Tidy.path = ‘/usr/lib/libtidy.so’
html = ‘titleBody’
xml = Tidy.open(:show_warnings=>true) do |tidy|
tidy.options.output_xml = true
xml = tidy.clean(html)
now i have to change the path to whereever the lib is…well i foudn
tidys folder in my lib directory and changed the above to this
Tidy.path = ‘C:\ruby\lib\ruby\gems\1.8\gems\tidy-1.1.2\lib\tidy\tidylib’
and its complaining saying no such file… i tried
as thats the proper extension of the tidylib file but again it wont
I cant find any tidylib file with an extenision .so
banging my head even more now
just downloaded a dll which i needed. Why doesnt that come with the
On Fri, Mar 28, 2008 at 11:42 AM, Dan D. [email protected]
Firebug puts in tbody’s into xpath’s that reach into tables even if the tag is not in the html source. Try removing the tbody path and debug using shorter xpaths to initially address content further up in the hierarchy.
Yes, Firefox does it to make it more (X)HTML-conform. It took me a
while to get the hang of it. You might download the page using
open-uri and open it with your favourite editor, search the text and
work your way up through the tags.
Most sites don’t use , so just try it without it.
On Mar 28, 6:11 pm, Adam A. [email protected] wrote:
indvidually wrapped up in any tags but rather just clumped together
Once you have the ‘name’ node you can use next_node to get the next
elements in the document
This method should work for your example:
names = hpricot_doc.search(“//span[@class=‘textbold’]”)
names.each do |name|
node = name.next_node
node = node.next_node until node.text? and node.inner_text =~ /\w