Ruby script to process html

aris · November 9, 2012, 3:13am

Hi,
What would be the most cool way to process/parse html and write it out
in another format, kind of transpose mode.
e.g INPUT:===============================

type id=M_btn_Mainl00 home clickAndWait id=M_ct200_MainContent

Wanted output:============================
command:
command: ({ ‘type’, target: ‘id=M_btn_Mainl00’, value: ‘home’ });
command: ({ ‘clickAndWait’, target: ‘id=M_ct200_MainContent’ }); // no
value

this to compose some fancy automation libs. I see the way to basically
read and capture lines by tugs

, etc… but feel should be more
better way.

Tx
Dai

dainova · November 9, 2012, 3:18am

Anytime you’re trying to parse HTML you want a gem called nokogiri.
There’s
a lot of good documentation out there for it online.

dainova · November 9, 2012, 3:28am

Tx, i’ve checked nokogori, but my html all same, no any unique tags/ids.
Just

,. Maybe you have a hint how to extract value between ... in a simple way, I still need to process this .html line by line I think to transpose it: 3 input lines = 1 output line.

Tx
Dai

dainova · November 9, 2012, 7:10am

Am 09.11.2012 03:28, schrieb Mario T.:

Or maybe you have a hint how to extract value between
…,
Tx
Dai

Depending on the formatting etc. of the real data, this could
get pretty difficult, but for your simple example data
iterating over the lines and using a regular expression
(with a named capture group) would work:

1.9.3-p194 :001 > /

(?.*?)</tr>/ =~ ’ Test’
=> 2
1.9.3-p194 :002 > value
=> “Test”

But it certainly is not the “most cool way” and will break when
the html is formatted differently, like e.g.

... ... ...

or

... ...

dainova · November 9, 2012, 2:15pm

Am 09.11.2012 09:08, schrieb Robert K.:

I find processing tag structures with line oriented tools pretty uncool.
In fact it’s also error prone like you state yourself:

Robert, that’s exactly what I wanted to convey to the OP,
thanks for repeating

But he also asked how to extract values from a string,
and - for other problems, rather not for this one -
that knowledge might actually be useful.

dainova · November 9, 2012, 9:08am

On Fri, Nov 9, 2012 at 7:09 AM, [email protected] wrote:

Am 09.11.2012 03:28, schrieb Mario T.:

Depending on the formatting etc. of the real data, this could
get pretty difficult, but for your simple example data
iterating over the lines and using a regular expression
(with a named capture group) would work:

I find processing tag structures with line oriented tools pretty uncool.
In fact it’s also error prone like you state yourself:

But it certainly is not the “most cool way” and will break when

…

Nokogiri rules!

Kind regards

robert

dainova · November 9, 2012, 7:40pm

Thanks, all guys.
I"ll start with line proc I think, still not quite in oop. Thanks all
for your help, this structure is very solid, and won’t change it’s
actually Selenium scripts.html

Best
Dai

dainova · November 9, 2012, 2:26pm

On Fri, Nov 9, 2012 at 2:14 PM, [email protected] wrote:

Am 09.11.2012 09:08, schrieb Robert K.:

I find processing tag structures with line oriented tools pretty uncool.

In fact it’s also error prone like you state yourself:

Robert, that’s exactly what I wanted to convey to the OP,
thanks for repeating

But he also asked how to extract values from a string,
and - for other problems, rather not for this one -
that knowledge might actually be useful.

Right, although in that case I would use slightly less arcane features:

irb(main):001:0> ’ Test'[%r{(.?)}, 1]
=> “Test”
irb(main):002:0> ’ Test'[%r{(?<=)(.?)(?=)}]
=> “Test”

Kind regards

robert

dainova · November 9, 2012, 9:27pm

Am 09.11.2012 19:40, schrieb Mario T.:

Thanks, all guys.
I"ll start with line proc I think, still not quite in oop. Thanks all
for your help, this structure is very solid, and won’t change it’s
actually Selenium scripts.html

Sometimes a little more effort at the beginning pays off
in the long run…

See a hint for a solution using Nokogiri below.

Note that it is even simpler (in my opinion) than iterating
through lines and using regular expressions and
that it handles HTML that is more complicated than your example.

Disclaimer: I have never used Nokogiri before, and spent about
ten minutes on this, based on the simplest examples on
nokogiri.org, so it’s probably wrong and/or clumsy and others
could provide much better solutions to your problem.

$ cat -n extract_data.rb
1 require ‘nokogiri’
2
3 html = <<EOF
4
5 typeid=M_btn_Mainl00home
6
7
8 clickAndWait
9 id=M_ct200_MainContent
10
11
12 EOF
13
14 doc = Nokogiri::HTML.parse(html)
15
16 doc.xpath(‘//td’).each do |cell|
17 puts cell.content.inspect
18 end

$ ruby extract_data.rb
“type”
“id=M_btn_Mainl00”
“home”
“clickAndWait”
“id=M_ct200_MainContent”
“”

dainova · November 10, 2012, 12:04am

I also highly recommend Nokogiri. I tried a few different ways to parse
HTML in Ruby; and Nokogiri is lightning fast and very flexible.