Extract DIV from HTML


#1

Hi,
There is a need to find a specific DIV in a large HTML file (by it’s
ID/CLASS) as a routine.
Perhaps XML approach could have helped me, using REXML but what function
goes and searches through the whole XML and finds a specific TAG by it’s
attribute?
Someone suggest that myxml will be used for it, but it’s installation
didn’t, yet, go well.

c:/instantrails/ruby/lib/ruby/site_ruby/myxml/myxml.rb:243:in initialize': No such file or directory - test.xml (Errno::ENOENT) from c:/instantrails/ruby/lib/ruby/site_ruby/myxml/myxml.rb:243:ininitialize’
from example10.rb:6

Help will be grately appriciated.


#2

Couldn’t you just use a regular expression?

http://www.regular-expressions.info/tutorial.html


#3

You want to use XPath. Check out the XPath section of the following
document:

http://www.germane-software.com/software/rexml/docs/tutorial.html

For your specific case, the XPath query would be something like this:

//div[@class=‘whatever’]

Ryan


#4

Ryan L. wrote:

You want to use XPath. Check out the XPath section of the following
document:

http://www.germane-software.com/software/rexml/docs/tutorial.html

For your specific case, the XPath query would be something like this:

//div[@class=‘whatever’]

Depends. If the large HTML file is indeed a LARGE file, then REXML
XPath may be CPU/memory intensive.

If so, the REXML pull or stream parsers work quite nicely for this sort
of thing.

I’d also suggest WWW::Mechanize.

(I also saw a suggestion for RubyfulSoup, but my experience is that it
bogs down on large files. )


James B.

http://www.ruby-doc.org - Ruby Help & Documentation
http://www.artima.com/rubycs/ - The Journal By & For Rubyists
http://www.rubystuff.com - The Ruby Store for Ruby Stuff
http://www.jamesbritt.com - Playing with Better Toys
http://www.30secondrule.com - Building Better Tools


#5

Actually, i would use rubyful_soup for this. It is awesome!


#6

Many many thanks to you all,
I got very emotional after seeing all of your kind and helpful
responses.
I am a beginner in this Forum and it’s a pleasure to see such a beutiful
and helpful community, I’ll do my best to pay back my share!


#7

Dor K. wrote:

Many many thanks to you all,
I got very emotional after seeing all of your kind and helpful
responses.
I am a beginner in this Forum and it’s a pleasure to see such a beutiful
and helpful community, I’ll do my best to pay back my share!

Please note, though, that “this Forum” is really a Web front-end to an
active mailing list (ruby-talk).


James B.

http://www.ruby-doc.org - Ruby Help & Documentation
http://www.artima.com/rubycs/ - The Journal By & For Rubyists
http://www.rubystuff.com - The Ruby Store for Ruby Stuff
http://yourelevatorpitch.com - Finding Business Focus


#8

James B. wrote:

Please note, though, that “this Forum” is really a Web front-end to an
active mailing list (ruby-talk).

hhmmm! will be checked,

anyways - i am doing very well with Mechanize now, but have an itching
problem -
I want to take the DIV from a page with certain encoding (WINDOWS-1255)
and put it in a UTF8 page.
Is there any thing that does the conversion needed?

Thx,
Dor.


#9

I want to take the DIV from a page with certain encoding (WINDOWS-1255)
and put it in a UTF8 page.
Is there any thing that does the conversion needed?

iconv!

http://www.ruby-doc.org/stdlib/libdoc/iconv/rdoc/


#10

Christoffer S. wrote:

I want to take the DIV from a page with certain encoding (WINDOWS-1255)
and put it in a UTF8 page.
Is there any thing that does the conversion needed?

iconv!

http://www.ruby-doc.org/stdlib/libdoc/iconv/rdoc/

WOW! It works amazing amazingly well! I steel have a problem that i seem
to be unable to use Mechanize to get information from pages on the same
server (i am using Webrick), perhaps it is that Webrick can not run two
instances together?


#11

On Mon, May 15, 2006 at 08:51:01PM +0900, Dor K. wrote:
[snip]

PLUS - It seems mechanize can not get pages from
http://localhost:3000/… (WEBRICK), any suggestions?

What error do you get? All mechanize unit tests are done against
WEBrick without any problems.

–Aaron


#12

Dor K. wrote:

WOW! It works amazing amazingly well! I steel have a problem that i seem
to be unable to use Mechanize to get information from pages on the same
server (i am using Webrick), perhaps it is that Webrick can not run two
instances together?
Are you by any chance trying to get Mechanize to read a Rails app’s own
pages from within a request?


#13

Alex Y. wrote:

Dor K. wrote:

WOW! It works amazing amazingly well! I steel have a problem that i seem
to be unable to use Mechanize to get information from pages on the same
server (i am using Webrick), perhaps it is that Webrick can not run two
instances together?
Are you by any chance trying to get Mechanize to read a Rails app’s own
pages from within a request?

Exactamondo!


#14

Dor K. wrote:

James B. wrote:

Please note, though, that “this Forum” is really a Web front-end to an
active mailing list (ruby-talk).

hhmmm! will be checked,

anyways - i am doing very well with Mechanize now, but have an itching
problem -
I want to take the DIV from a page with certain encoding (WINDOWS-1255)
and put it in a UTF8 page.
Is there any thing that does the conversion needed?

Thx,
Dor.

PLUS - It seems mechanize can not get pages from
http://localhost:3000/… (WEBRICK), any suggestions?


#15

Dor K. wrote:

pages from within a request?

Exactamondo!
Thought so… This came up recently on the rails list. Rails is
single-threaded - WEBrick isn’t, but Rails limits itself by default
because (I think) it can’t guarantee that the app’s methods are
threadsafe.

What are you trying to do, exactly? If the Rails app generated the page
in the first place, surely it has the raw information at hand not to
need to scrape the HTML to find it again?


#16

Alex Y. wrote:

Dor K. wrote:

pages from within a request?

Exactamondo!
Thought so… This came up recently on the rails list. Rails is
single-threaded - WEBrick isn’t, but Rails limits itself by default
because (I think) it can’t guarantee that the app’s methods are
threadsafe.

What are you trying to do, exactly? If the Rails app generated the page
in the first place, surely it has the raw information at hand not to
need to scrape the HTML to find it again?

that’s right, … but i still want it :wink: !