Forum: Ruby Ruby script to process html

Posted by Mario Trento (dainova)
on 2012-11-09 03:13
Hi,
What would be the most cool way to process/parse html and write it out
in another format, kind of transpose mode.
e.g INPUT:===============================
<tr>
  <td>type</td>
  <td>id=M_btn_Mainl00</td>
  <td>home</td>
</tr>
<tr>
  <td>clickAndWait</td>
  <td>id=M_ct200_MainContent</td>
  <td></td>
</tr>

Wanted output:============================
command:
command: ({ 'type', target: 'id=M_btn_Mainl00', value: 'home' });
command: ({ 'clickAndWait', target: 'id=M_ct200_MainContent' });  // no
value


this to compose some fancy automation libs. I see the way to basically
read and capture lines by tugs <tr>, <td> etc.. but feel should be more
better way.

Tx
Dai
Posted by Jonan S. (jonan_s)
on 2012-11-09 03:18
(Received via mailing list)
Anytime you're trying to parse HTML you want a gem called nokogiri. 
There's
a lot of good documentation out there for it online.
Posted by Mario Trento (dainova)
on 2012-11-09 03:28
Tx, i've checked nokogori, but my html all same, no any unique tags/ids. 
Just <td>,<tr>. Maybe you have a hint how to extract value between 
<td>...</td> in a simple way, I still need to process this .html line by 
line I think to transpose it: 3 input lines = 1 output line.

Tx
Dai
Posted by unknown (Guest)
on 2012-11-09 07:10
(Received via mailing list)
Am 09.11.2012 03:28, schrieb Mario Trento:
> Or maybe you have a hint how to extract value between <td>...</td>,
>
> Tx
> Dai
>

Depending on the formatting etc. of the real data, this could
get pretty difficult, but for your simple example data
iterating over the lines and using a regular expression
(with a named capture group) would work:

1.9.3-p194 :001 > /<tr>(?<value>.*?)<\/tr>/ =~ '  <tr>Test</tr>'
  => 2
1.9.3-p194 :002 > value
  => "Test"

But it certainly is not the "most cool way" and will break when
the html is formatted differently, like e.g.

<tr>
   <td>...</td><td>...</td><td>...</td>
</tr>

or

<tr>
   <td id='whatever'>...</td>
   ...
</tr>
Posted by Robert Klemme (robert_k78)
on 2012-11-09 09:08
(Received via mailing list)
On Fri, Nov 9, 2012 at 7:09 AM, <sto.mar@web.de> wrote:

> Am 09.11.2012 03:28, schrieb Mario Trento:
>
> Depending on the formatting etc. of the real data, this could
> get pretty difficult, but for your simple example data
> iterating over the lines and using a regular expression
> (with a named capture group) would work:
>

I find processing tag structures with line oriented tools pretty uncool.
:-)  In fact it's also error prone like you state yourself:

But it certainly is not the "most cool way" and will break when
>   ...
> </tr>
>

Nokogiri rules!

Kind regards

robert
Posted by unknown (Guest)
on 2012-11-09 14:15
(Received via mailing list)
Am 09.11.2012 09:08, schrieb Robert Klemme:
> I find processing tag structures with line oriented tools pretty uncool.
> :-)  In fact it's also error prone like you state yourself:

Robert, that's exactly what I wanted to convey to the OP,
thanks for repeating :-)

But he also asked how to extract values from a string,
and - for other problems, rather not for this one -
that knowledge might actually be useful.
Posted by Robert Klemme (robert_k78)
on 2012-11-09 14:26
(Received via mailing list)
On Fri, Nov 9, 2012 at 2:14 PM, <sto.mar@web.de> wrote:

> Am 09.11.2012 09:08, schrieb Robert Klemme:
>
>  I find processing tag structures with line oriented tools pretty uncool.
>> :-)  In fact it's also error prone like you state yourself:
>>
>
> Robert, that's exactly what I wanted to convey to the OP,
> thanks for repeating :-)
>

:-)


> But he also asked how to extract values from a string,
> and - for other problems, rather not for this one -
> that knowledge might actually be useful.
>

Right, although in that case I would use slightly less arcane features:

irb(main):001:0>  '  <tr>Test</tr>'[%r{<tr>(.*?)</tr>}, 1]
=> "Test"
irb(main):002:0>  '  <tr>Test</tr>'[%r{(?<=<tr>)(.*?)(?=</tr>)}]
=> "Test"

Kind regards

robert
Posted by Mario Trento (dainova)
on 2012-11-09 19:40
Thanks, all guys.
I"ll start with line proc I think, still not quite in oop. Thanks all 
for your help, this structure is very solid, and won't change it's 
actually Selenium scripts.html


Best
Dai
Posted by unknown (Guest)
on 2012-11-09 21:27
(Received via mailing list)
Am 09.11.2012 19:40, schrieb Mario Trento:
> Thanks, all guys.
> I"ll start with line proc I think, still not quite in oop. Thanks all
> for your help, this structure is very solid, and won't change it's
> actually Selenium scripts.html

Sometimes a little more effort at the beginning pays off
in the long run...

See a hint for a solution using Nokogiri below.

Note that it is even simpler (in my opinion) than iterating
through lines and using regular expressions and
that it handles HTML that is more complicated than your example.

Disclaimer: I have never used Nokogiri before, and spent about
ten minutes on this, based on the simplest examples on
nokogiri.org, so it's probably wrong and/or clumsy and others
could provide much better solutions to your problem.

$ cat -n extract_data.rb
      1  require 'nokogiri'
      2
      3  html = <<EOF
      4  <tr>
      5    <td>type</td><td>id=M_btn_Mainl00</td><td>home</td>
      6  </tr>
      7  <tr>
      8    <td id='something'>clickAndWait</td>
      9    <td>id=M_ct200_MainContent</td>
     10    <td></td>
     11  </tr>
     12  EOF
     13
     14  doc = Nokogiri::HTML.parse(html)
     15
     16  doc.xpath('//td').each do |cell|
     17    puts cell.content.inspect
     18  end

$ ruby extract_data.rb
"type"
"id=M_btn_Mainl00"
"home"
"clickAndWait"
"id=M_ct200_MainContent"
""
Posted by Joel Pearson (virtuoso)
on 2012-11-10 00:04
I also highly recommend Nokogiri. I tried a few different ways to parse 
HTML in Ruby; and Nokogiri is lightning fast and very flexible.
Please log in before posting. Registration is free and takes only a minute.
Existing account (Switch to SSL-encrypted connection)
NEW: Do you have a Google/GoogleMail or Yahoo account? No registration required!
Log in with Google account | Log in with Yahoo account
No account? Register here.