Extract DIV from HTML

Dor_K · May 11, 2006, 5:22pm

Hi,
There is a need to find a specific DIV in a large HTML file (by it’s
ID/CLASS) as a routine.
Perhaps XML approach could have helped me, using REXML but what function
goes and searches through the whole XML and finds a specific TAG by it’s
attribute?
Someone suggest that myxml will be used for it, but it’s installation
didn’t, yet, go well.

c:/instantrails/ruby/lib/ruby/site_ruby/myxml/myxml.rb:243:in initialize': No such file or directory - test.xml (Errno::ENOENT) from c:/instantrails/ruby/lib/ruby/site_ruby/myxml/myxml.rb:243:ininitialize’
from example10.rb:6

Help will be grately appriciated.

Dor_K · May 11, 2006, 5:39pm

Couldn’t you just use a regular expression?

Dor_K · May 11, 2006, 6:13pm

You want to use XPath. Check out the XPath section of the following
document:

http://www.germane-software.com/software/rexml/docs/tutorial.html

For your specific case, the XPath query would be something like this:

//div[@class=‘whatever’]

Ryan

Dor_K · May 11, 2006, 10:32pm

Ryan L. wrote:

You want to use XPath. Check out the XPath section of the following
document:

http://www.germane-software.com/software/rexml/docs/tutorial.html

For your specific case, the XPath query would be something like this:

//div[@class=‘whatever’]

Depends. If the large HTML file is indeed a LARGE file, then REXML
XPath may be CPU/memory intensive.

If so, the REXML pull or stream parsers work quite nicely for this sort
of thing.

I’d also suggest WWW::Mechanize.

(I also saw a suggestion for RubyfulSoup, but my experience is that it
bogs down on large files. )

–
James B.

http://www.ruby-doc.org - Ruby Help & Documentation
Ruby Code & Style - The Journal By & For Rubyists
http://www.rubystuff.com - The Ruby Store for Ruby Stuff
http://www.jamesbritt.com - Playing with Better Toys
http://www.30secondrule.com - Building Better Tools

Dor_K · May 11, 2006, 10:00pm

Actually, i would use rubyful_soup for this. It is awesome!

Dor_K · May 12, 2006, 8:35am

Many many thanks to you all,
I got very emotional after seeing all of your kind and helpful
responses.
I am a beginner in this Forum and it’s a pleasure to see such a beutiful
and helpful community, I’ll do my best to pay back my share!

Dor_K · May 12, 2006, 4:07pm

Dor K. wrote:

Many many thanks to you all,
I got very emotional after seeing all of your kind and helpful
responses.
I am a beginner in this Forum and it’s a pleasure to see such a beutiful
and helpful community, I’ll do my best to pay back my share!

Please note, though, that “this Forum” is really a Web front-end to an
active mailing list (ruby-talk).

–
James B.

http://www.ruby-doc.org - Ruby Help & Documentation
Ruby Code & Style - The Journal By & For Rubyists
http://www.rubystuff.com - The Ruby Store for Ruby Stuff
http://yourelevatorpitch.com - Finding Business Focus

Dor_K · May 15, 2006, 12:48pm

James B. wrote:

Please note, though, that “this Forum” is really a Web front-end to an
active mailing list (ruby-talk).

hhmmm! will be checked,

anyways - i am doing very well with Mechanize now, but have an itching
problem -
I want to take the DIV from a page with certain encoding (WINDOWS-1255)
and put it in a UTF8 page.
Is there any thing that does the conversion needed?

Thx,
Dor.

Dor_K · May 15, 2006, 2:33pm

I want to take the DIV from a page with certain encoding (WINDOWS-1255)
and put it in a UTF8 page.
Is there any thing that does the conversion needed?

iconv!

http://www.ruby-doc.org/stdlib/libdoc/iconv/rdoc/

Dor_K · May 15, 2006, 2:39pm

Christoffer S. wrote:

I want to take the DIV from a page with certain encoding (WINDOWS-1255)
and put it in a UTF8 page.
Is there any thing that does the conversion needed?

iconv!

http://www.ruby-doc.org/stdlib/libdoc/iconv/rdoc/

WOW! It works amazing amazingly well! I steel have a problem that i seem
to be unable to use Mechanize to get information from pages on the same
server (i am using Webrick), perhaps it is that Webrick can not run two
instances together?

Dor_K · May 15, 2006, 8:28pm

On Mon, May 15, 2006 at 08:51:01PM +0900, Dor K. wrote:
[snip]

PLUS - It seems mechanize can not get pages from
http://localhost:3000/… (WEBRICK), any suggestions?

What error do you get? All mechanize unit tests are done against
WEBrick without any problems.

–Aaron

Dor_K · May 16, 2006, 12:15pm

Dor K. wrote:

WOW! It works amazing amazingly well! I steel have a problem that i seem
to be unable to use Mechanize to get information from pages on the same
server (i am using Webrick), perhaps it is that Webrick can not run two
instances together?
Are you by any chance trying to get Mechanize to read a Rails app’s own
pages from within a request?

Dor_K · May 16, 2006, 5:42pm

Alex Y. wrote:

Dor K. wrote:

WOW! It works amazing amazingly well! I steel have a problem that i seem
to be unable to use Mechanize to get information from pages on the same
server (i am using Webrick), perhaps it is that Webrick can not run two
instances together?
Are you by any chance trying to get Mechanize to read a Rails app’s own
pages from within a request?

Exactamondo!

Dor_K · May 15, 2006, 1:50pm

Dor K. wrote:

James B. wrote:

Please note, though, that “this Forum” is really a Web front-end to an
active mailing list (ruby-talk).

hhmmm! will be checked,

anyways - i am doing very well with Mechanize now, but have an itching
problem -
I want to take the DIV from a page with certain encoding (WINDOWS-1255)
and put it in a UTF8 page.
Is there any thing that does the conversion needed?

Thx,
Dor.

PLUS - It seems mechanize can not get pages from
http://localhost:3000/… (WEBRICK), any suggestions?

Dor_K · May 16, 2006, 6:08pm

Dor K. wrote:

pages from within a request?

Exactamondo!
Thought so… This came up recently on the rails list. Rails is
single-threaded - WEBrick isn’t, but Rails limits itself by default
because (I think) it can’t guarantee that the app’s methods are
threadsafe.

What are you trying to do, exactly? If the Rails app generated the page
in the first place, surely it has the raw information at hand not to
need to scrape the HTML to find it again?

Dor_K · May 18, 2006, 5:25pm

Alex Y. wrote:

Dor K. wrote:

pages from within a request?

Exactamondo!
Thought so… This came up recently on the rails list. Rails is
single-threaded - WEBrick isn’t, but Rails limits itself by default
because (I think) it can’t guarantee that the app’s methods are
threadsafe.

What are you trying to do, exactly? If the Rails app generated the page
in the first place, surely it has the raw information at hand not to
need to scrape the HTML to find it again?

that’s right, … but i still want it !