Forum: Ruby extract DIV from HTML

Announcement (2017-05-07): www.ruby-forum.com is now read-only since I unfortunately do not have the time to support and maintain the forum any more. Please see rubyonrails.org/community and ruby-lang.org/en/community for other Rails- und Ruby-related community platforms.
64a30f3037e399104d89d9ae1dddc6d4?d=identicon&s=25 Dor Kalev (Guest)
on 2006-05-11 17:22
Hi,
There is a need to find a specific DIV in a large HTML file (by it's
ID/CLASS) as a routine.
Perhaps XML approach could have helped me, using REXML but what function
goes and searches through the whole XML and finds a specific TAG by it's
attribute?
Someone suggest that myxml will be used for it, but it's installation
didn't, yet, go well.

>> c:/instantrails/ruby/lib/ruby/site_ruby/myxml/myxml.rb:243:in `initialize': No
>> such file or directory - test.xml (Errno::ENOENT)
>> from c:/instantrails/ruby/lib/ruby/site_ruby/myxml/myxml.rb:243:in `initialize'
>> from example10.rb:6

Help will be grately appriciated.
E4ba761cb171f74c8b77295178ea8a3f?d=identicon&s=25 Oky (Guest)
on 2006-05-11 17:39
Couldn't you just use a regular expression?

http://www.regular-expressions.info/tutorial.html
4b174722d1b1a4bbd9672e1ab50c30a9?d=identicon&s=25 Ryan Leavengood (Guest)
on 2006-05-11 18:13
(Received via mailing list)
You want to use XPath. Check out the XPath section of the following
document:

http://www.germane-software.com/software/rexml/doc...

For your specific case, the XPath query would be something like this:

//div[@class='whatever']

Ryan
68db3bafb0a990bf605c4cf62bf85db0?d=identicon&s=25 bpettichord@gmail.com (Guest)
on 2006-05-11 22:00
(Received via mailing list)
Actually, i would use rubyful_soup for this. It is awesome!
Bc6d88907ce09158581fbb9b469a35a3?d=identicon&s=25 James Britt (Guest)
on 2006-05-11 22:32
(Received via mailing list)
Ryan Leavengood wrote:
> You want to use XPath. Check out the XPath section of the following
> document:
>
> http://www.germane-software.com/software/rexml/doc...
>
> For your specific case, the XPath query would be something like this:
>
> //div[@class='whatever']

Depends.  If the large HTML file is indeed a LARGE file, then REXML
XPath may be CPU/memory intensive.

If so, the REXML pull or stream parsers work quite nicely for this sort
of thing.

I'd also suggest WWW::Mechanize.

(I also saw a suggestion for RubyfulSoup, but my experience is that it
bogs down on large files. )


--
James Britt

http://www.ruby-doc.org       - Ruby Help & Documentation
http://www.artima.com/rubycs/ - The Journal By & For Rubyists
http://www.rubystuff.com      - The Ruby Store for Ruby Stuff
http://www.jamesbritt.com     - Playing with Better Toys
http://www.30secondrule.com   - Building Better Tools
64a30f3037e399104d89d9ae1dddc6d4?d=identicon&s=25 Dor Kalev (Guest)
on 2006-05-12 08:35
Many many thanks to you all,
I got very emotional after seeing all of your kind and helpful
responses.
I am a beginner in this Forum and it's a pleasure to see such a beutiful
and helpful community, I'll do my best to pay back my share!
Bc6d88907ce09158581fbb9b469a35a3?d=identicon&s=25 James Britt (Guest)
on 2006-05-12 16:07
(Received via mailing list)
Dor Kalev wrote:
> Many many thanks to you all,
> I got very emotional after seeing all of your kind and helpful
> responses.
> I am a beginner in this Forum and it's a pleasure to see such a beutiful
> and helpful community, I'll do my best to pay back my share!

Please note, though, that "this Forum" is really a Web front-end to an
active mailing list (ruby-talk).


--
James Britt

http://www.ruby-doc.org       - Ruby Help & Documentation
http://www.artima.com/rubycs/ - The Journal By & For Rubyists
http://www.rubystuff.com      - The Ruby Store for Ruby Stuff
http://yourelevatorpitch.com  - Finding Business Focus
64a30f3037e399104d89d9ae1dddc6d4?d=identicon&s=25 Dor Kalev (Guest)
on 2006-05-15 12:48
James Britt wrote:
> Please note, though, that "this Forum" is really a Web front-end to an
> active mailing list (ruby-talk).
>

hhmmm! will be checked,

anyways - i am doing very well with Mechanize now, but have an itching
problem -
I want to take the DIV from a page with certain encoding (WINDOWS-1255)
and put it in a UTF8 page.
Is there any thing that does the conversion needed?

Thx,
Dor.
64a30f3037e399104d89d9ae1dddc6d4?d=identicon&s=25 Dor Kalev (Guest)
on 2006-05-15 13:50
Dor Kalev wrote:
> James Britt wrote:
>> Please note, though, that "this Forum" is really a Web front-end to an
>> active mailing list (ruby-talk).
>>
>
> hhmmm! will be checked,
>
> anyways - i am doing very well with Mechanize now, but have an itching
> problem -
> I want to take the DIV from a page with certain encoding (WINDOWS-1255)
> and put it in a UTF8 page.
> Is there any thing that does the conversion needed?
>
> Thx,
> Dor.

PLUS - It seems mechanize can not get pages from
http://localhost:3000/... (WEBRICK), any suggestions?
A19281bdbc5f08539cdef3d6636f7c4d?d=identicon&s=25 Christoffer Sawicki (Guest)
on 2006-05-15 14:33
(Received via mailing list)
> I want to take the DIV from a page with certain encoding (WINDOWS-1255)
> and put it in a UTF8 page.
> Is there any thing that does the conversion needed?

iconv!

http://www.ruby-doc.org/stdlib/libdoc/iconv/rdoc/
64a30f3037e399104d89d9ae1dddc6d4?d=identicon&s=25 Dor Kalev (Guest)
on 2006-05-15 14:39
Christoffer Sawicki wrote:
>> I want to take the DIV from a page with certain encoding (WINDOWS-1255)
>> and put it in a UTF8 page.
>> Is there any thing that does the conversion needed?
>
> iconv!
>
> http://www.ruby-doc.org/stdlib/libdoc/iconv/rdoc/

WOW! It works amazing amazingly well! I steel have a problem that i seem
to be unable to use Mechanize to get information from pages on the same
server (i am using Webrick), perhaps it is that Webrick can not run two
instances together?
A24e072d6092870feff0d5016ff2cdd0?d=identicon&s=25 Aaron Patterson (Guest)
on 2006-05-15 20:28
(Received via mailing list)
On Mon, May 15, 2006 at 08:51:01PM +0900, Dor Kalev wrote:
[snip]
>
> PLUS - It seems mechanize can not get pages from
> http://localhost:3000/... (WEBRICK), any suggestions?
>

What error do you get?  All mechanize unit tests are done against
WEBrick without any problems.

--Aaron
Ad7805c9fcc1f13efc6ed11251a6c4d2?d=identicon&s=25 Alex Young (Guest)
on 2006-05-16 12:15
(Received via mailing list)
Dor Kalev wrote:
> WOW! It works amazing amazingly well! I steel have a problem that i seem
> to be unable to use Mechanize to get information from pages on the same
> server (i am using Webrick), perhaps it is that Webrick can not run two
> instances together?
Are you by any chance trying to get Mechanize to read a Rails app's own
pages from within a request?
64a30f3037e399104d89d9ae1dddc6d4?d=identicon&s=25 Dor Kalev (Guest)
on 2006-05-16 17:42
Alex Young wrote:
> Dor Kalev wrote:
>> WOW! It works amazing amazingly well! I steel have a problem that i seem
>> to be unable to use Mechanize to get information from pages on the same
>> server (i am using Webrick), perhaps it is that Webrick can not run two
>> instances together?
> Are you by any chance trying to get Mechanize to read a Rails app's own
> pages from within a request?

Exactamondo!
Ad7805c9fcc1f13efc6ed11251a6c4d2?d=identicon&s=25 Alex Young (Guest)
on 2006-05-16 18:08
(Received via mailing list)
Dor Kalev wrote:
>>pages from within a request?
>
>
> Exactamondo!
Thought so...  This came up recently on the rails list.  Rails is
single-threaded - WEBrick isn't, but Rails limits itself by default
because (I think) it can't guarantee that the app's methods are
threadsafe.

What are you trying to do, exactly?  If the Rails app generated the page
in the first place, surely it has the raw information at hand not to
need to scrape the HTML to find it again?
64a30f3037e399104d89d9ae1dddc6d4?d=identicon&s=25 Dor Kalev (Guest)
on 2006-05-18 17:25
Alex Young wrote:
> Dor Kalev wrote:
>>>pages from within a request?
>>
>>
>> Exactamondo!
> Thought so...  This came up recently on the rails list.  Rails is
> single-threaded - WEBrick isn't, but Rails limits itself by default
> because (I think) it can't guarantee that the app's methods are
> threadsafe.
>
> What are you trying to do, exactly?  If the Rails app generated the page
> in the first place, surely it has the raw information at hand not to
> need to scrape the HTML to find it again?

that's right, ... but i still want it ;-) !
This topic is locked and can not be replied to.