Forum: Ruby How to use ReXML "in the wild"?

Announcement (2017-05-07): www.ruby-forum.com is now read-only since I unfortunately do not have the time to support and maintain the forum any more. Please see rubyonrails.org/community and ruby-lang.org/en/community for other Rails- und Ruby-related community platforms.
47df9cfb356c3ee0523cc3571b169730?d=identicon&s=25 Kenneth McDonald (Guest)
on 2008-12-16 02:31
(Received via mailing list)
I'd very much like to use ReXML's XPATH features to extract info from
Google's financial info pages, but find that Rexml chokes on the
Javascript, here's the result of trying to read in a page with this
bit of code:

require "rexml/document"
require 'net/http'
Net::HTTP.start('finance.google.com') do |http|
   response = http.get('/finance?fstype=ii&q=NYSE:WAT')
   rdoc = REXML::Document.new(response.body)
end

==========
Output:

/usr/local/lib/ruby/1.8/rexml/parsers/treeparser.rb:92:in `parse':
#<RuntimeError: Illegal character '&' in raw string
" (REXML::ParseException)
(function(){
var d=navigator.userAgent.toLowerCase().indexOf("msie")!=-1;function
e(){var b=document.styleSheets;for(var a=b.length-1;a>=0;--a){var
c=b[a].href;if(c)if(c.indexOf("styles/finance_")!=-1||
c.indexOf("styles_")!=-1)return b[a]}return null}function f(){var
b=e();if(b){var a=b.rules;return
a.length>0&&a[a.length-1].selectorText==".lastFinanceRule"}return false}
function g(){if(document.scripts)for(var b=0;b">
/usr/local/lib/ruby/1.8/rexml/text.rb:91:in `initialize'
.
.
.

Is there a good way to get around this problem? If, not, I guess it's
back to regular expressions...

Thanks,
Ken
F50f5d582d76f98686da34917531fe56?d=identicon&s=25 unknown (Guest)
on 2008-12-16 02:49
(Received via mailing list)
Hi Kenneth,
> I'd very much like to use ReXML's XPATH features to extract info from
> Google's financial info pages, but find that Rexml chokes on the
> Javascript, here's the result of trying to read in a page with this
> bit of code:

Don't try that ;) REXML in the wild == epic FAIL. At this level, you
might
want to try Hpricot or Nokogiri. At a bit higher level, scRUBYt!
You can read about web scraping in Ruby here (my most succesfull article
ever, was even mentioned in Learning Ruby from O'Reilly):

http://www.rubyrailways.com/data-extraction-for-we...

> Is there a good way to get around this problem? If, not, I guess it's
> back to regular expressions...

Web scraping with regular expressions is almost never a good idea.

Try scRUBYt!:

require 'rubygems'
require 'scrubyt'

data = Scrubyt::Extractor.define do
  fetch 'http://finance.google.com/finance?fstype=ii&q=NYSE...

  body '/html/body' do
    revenue '/div[4]/div[2]/table/tr[2]' do
      ending_9_27 '/td[2]'
      ending_6_28 '/td[3]'
    end

    gross_profit '/div[4]/div[2]/table/tr[2]' do
      ending_9_27 '/td[2]'
    end
  end
end

puts data.to_xml

output:

<root>
  <body>
    <revenue>
      <ending_9_27>386.31</ending_9_27>
      <ending_6_28>398.77</ending_6_28>
    </revenue>
    <gross_profit>
      <ending_9_27>386.31</ending_9_27>
    </gross_profit>
  </body>
</root>


HTH,
Peter
___
http://scrubyt.org
http://www.rubyrailways.com
Aafa8848c4b764f080b1b31a51eab73d?d=identicon&s=25 Phlip (Guest)
on 2008-12-16 04:51
(Received via mailing list)
Kenneth McDonald wrote:

> I'd very much like to use ReXML's XPATH features to extract info from
> Google's financial info pages, but find that Rexml chokes on the
> Javascript, here's the result of trying to read in a page with this
> bit of code:

I have studied REXML for many years, and I still can't figure out how to
get it
to recognize an &mdash; or similar advanced entity.

Like the other responder said, give up while you still can. libxml-ruby
is also
stable enough to give a shot - oh yeah, except it crashes on non-tiny
inputs.

Aaaand...

> /usr/local/lib/ruby/1.8/rexml/parsers/treeparser.rb:92:in `parse':
> #<RuntimeError: Illegal character '&' in raw string

That's because REXML and your web browser disagree on the definition of
well-formed. Your browser accepts a naked & inside a JavaScript tag, but
REXML
does not. REXML is technically correct, and your browser would have
accepted
&amp;&amp; here, but...

> a.length>0&&a[a.length-1].selectorText==".lastFinanceRule"}return false}

...browsers cannot correctly interpolate & appearing inside JavaScript
literal
strings, because some lowlife coder using Notepad might have actually
wanted
"&amp;" when they wrote "&amp;" - such as with document.write().

So, because REXML cannot accept normal HTML, due to hits and misses of
standards
compliance on all sides - you are better off with a dedicated parser!
This topic is locked and can not be replied to.