Ruby script to process html

Hi,
What would be the most cool way to process/parse html and write it out
in another format, kind of transpose mode.
e.g INPUT:===============================

type id=M_btn_Mainl00 home clickAndWait id=M_ct200_MainContent

Wanted output:============================
command:
command: ({ ‘type’, target: ‘id=M_btn_Mainl00’, value: ‘home’ });
command: ({ ‘clickAndWait’, target: ‘id=M_ct200_MainContent’ }); // no
value

this to compose some fancy automation libs. I see the way to basically
read and capture lines by tugs

, etc… but feel should be more
better way.

Tx
Dai

Anytime you’re trying to parse HTML you want a gem called nokogiri.
There’s
a lot of good documentation out there for it online.

Tx, i’ve checked nokogori, but my html all same, no any unique tags/ids.
Just

,. Maybe you have a hint how to extract value between ... in a simple way, I still need to process this .html line by line I think to transpose it: 3 input lines = 1 output line.

Tx
Dai

Am 09.11.2012 03:28, schrieb Mario T.:

Or maybe you have a hint how to extract value between

…,

Tx
Dai

Depending on the formatting etc. of the real data, this could
get pretty difficult, but for your simple example data
iterating over the lines and using a regular expression
(with a named capture group) would work:

1.9.3-p194 :001 > /

(?.*?)</tr>/ =~ ’ Test’
=> 2
1.9.3-p194 :002 > value
=> “Test”

But it certainly is not the “most cool way” and will break when
the html is formatted differently, like e.g.

... ... ...

or

... ...

Am 09.11.2012 09:08, schrieb Robert K.:

I find processing tag structures with line oriented tools pretty uncool.
:slight_smile: In fact it’s also error prone like you state yourself:

Robert, that’s exactly what I wanted to convey to the OP,
thanks for repeating :slight_smile:

But he also asked how to extract values from a string,
and - for other problems, rather not for this one -
that knowledge might actually be useful.

On Fri, Nov 9, 2012 at 7:09 AM, [email protected] wrote:

Am 09.11.2012 03:28, schrieb Mario T.:

Depending on the formatting etc. of the real data, this could
get pretty difficult, but for your simple example data
iterating over the lines and using a regular expression
(with a named capture group) would work:

I find processing tag structures with line oriented tools pretty uncool.
:slight_smile: In fact it’s also error prone like you state yourself:

But it certainly is not the “most cool way” and will break when

Nokogiri rules!

Kind regards

robert

Thanks, all guys.
I"ll start with line proc I think, still not quite in oop. Thanks all
for your help, this structure is very solid, and won’t change it’s
actually Selenium scripts.html

Best
Dai

On Fri, Nov 9, 2012 at 2:14 PM, [email protected] wrote:

Am 09.11.2012 09:08, schrieb Robert K.:

I find processing tag structures with line oriented tools pretty uncool.

:slight_smile: In fact it’s also error prone like you state yourself:

Robert, that’s exactly what I wanted to convey to the OP,
thanks for repeating :slight_smile:

:slight_smile:

But he also asked how to extract values from a string,
and - for other problems, rather not for this one -
that knowledge might actually be useful.

Right, although in that case I would use slightly less arcane features:

irb(main):001:0> ’ Test'[%r{(.?)}, 1]
=> “Test”
irb(main):002:0> ’ Test'[%r{(?<=)(.
?)(?=)}]
=> “Test”

Kind regards

robert

Am 09.11.2012 19:40, schrieb Mario T.:

Thanks, all guys.
I"ll start with line proc I think, still not quite in oop. Thanks all
for your help, this structure is very solid, and won’t change it’s
actually Selenium scripts.html

Sometimes a little more effort at the beginning pays off
in the long run…

See a hint for a solution using Nokogiri below.

Note that it is even simpler (in my opinion) than iterating
through lines and using regular expressions and
that it handles HTML that is more complicated than your example.

Disclaimer: I have never used Nokogiri before, and spent about
ten minutes on this, based on the simplest examples on
nokogiri.org, so it’s probably wrong and/or clumsy and others
could provide much better solutions to your problem.

$ cat -n extract_data.rb
1 require ‘nokogiri’
2
3 html = <<EOF
4
5 typeid=M_btn_Mainl00home
6
7
8 clickAndWait
9 id=M_ct200_MainContent
10
11
12 EOF
13
14 doc = Nokogiri::HTML.parse(html)
15
16 doc.xpath(‘//td’).each do |cell|
17 puts cell.content.inspect
18 end

$ ruby extract_data.rb
“type”
“id=M_btn_Mainl00”
“home”
“clickAndWait”
“id=M_ct200_MainContent”
“”

I also highly recommend Nokogiri. I tried a few different ways to parse
HTML in Ruby; and Nokogiri is lightning fast and very flexible.