Hi,
What would be the most cool way to process/parse html and write it out
in another format, kind of transpose mode.
e.g INPUT:===============================
<tr>
<td>type</td>
<td>id=M_btn_Mainl00</td>
<td>home</td>
</tr>
<tr>
<td>clickAndWait</td>
<td>id=M_ct200_MainContent</td>
<td></td>
</tr>
Wanted output:============================
command:
command: ({ 'type', target: 'id=M_btn_Mainl00', value: 'home' });
command: ({ 'clickAndWait', target: 'id=M_ct200_MainContent' }); // no
value
this to compose some fancy automation libs. I see the way to basically
read and capture lines by tugs <tr>, <td> etc.. but feel should be more
better way.
Tx
Dai
on 2012-11-09 03:13
on 2012-11-09 03:18
Anytime you're trying to parse HTML you want a gem called nokogiri. There's a lot of good documentation out there for it online.
on 2012-11-09 03:28
Tx, i've checked nokogori, but my html all same, no any unique tags/ids. Just <td>,<tr>. Maybe you have a hint how to extract value between <td>...</td> in a simple way, I still need to process this .html line by line I think to transpose it: 3 input lines = 1 output line. Tx Dai
on 2012-11-09 07:10
Am 09.11.2012 03:28, schrieb Mario Trento: > Or maybe you have a hint how to extract value between <td>...</td>, > > Tx > Dai > Depending on the formatting etc. of the real data, this could get pretty difficult, but for your simple example data iterating over the lines and using a regular expression (with a named capture group) would work: 1.9.3-p194 :001 > /<tr>(?<value>.*?)<\/tr>/ =~ ' <tr>Test</tr>' => 2 1.9.3-p194 :002 > value => "Test" But it certainly is not the "most cool way" and will break when the html is formatted differently, like e.g. <tr> <td>...</td><td>...</td><td>...</td> </tr> or <tr> <td id='whatever'>...</td> ... </tr>
on 2012-11-09 09:08
On Fri, Nov 9, 2012 at 7:09 AM, <sto.mar@web.de> wrote: > Am 09.11.2012 03:28, schrieb Mario Trento: > > Depending on the formatting etc. of the real data, this could > get pretty difficult, but for your simple example data > iterating over the lines and using a regular expression > (with a named capture group) would work: > I find processing tag structures with line oriented tools pretty uncool. :-) In fact it's also error prone like you state yourself: But it certainly is not the "most cool way" and will break when > ... > </tr> > Nokogiri rules! Kind regards robert
on 2012-11-09 14:15
Am 09.11.2012 09:08, schrieb Robert Klemme: > I find processing tag structures with line oriented tools pretty uncool. > :-) In fact it's also error prone like you state yourself: Robert, that's exactly what I wanted to convey to the OP, thanks for repeating :-) But he also asked how to extract values from a string, and - for other problems, rather not for this one - that knowledge might actually be useful.
on 2012-11-09 14:26
On Fri, Nov 9, 2012 at 2:14 PM, <sto.mar@web.de> wrote: > Am 09.11.2012 09:08, schrieb Robert Klemme: > > I find processing tag structures with line oriented tools pretty uncool. >> :-) In fact it's also error prone like you state yourself: >> > > Robert, that's exactly what I wanted to convey to the OP, > thanks for repeating :-) > :-) > But he also asked how to extract values from a string, > and - for other problems, rather not for this one - > that knowledge might actually be useful. > Right, although in that case I would use slightly less arcane features: irb(main):001:0> ' <tr>Test</tr>'[%r{<tr>(.*?)</tr>}, 1] => "Test" irb(main):002:0> ' <tr>Test</tr>'[%r{(?<=<tr>)(.*?)(?=</tr>)}] => "Test" Kind regards robert
on 2012-11-09 19:40
Thanks, all guys. I"ll start with line proc I think, still not quite in oop. Thanks all for your help, this structure is very solid, and won't change it's actually Selenium scripts.html Best Dai
on 2012-11-09 21:27
Am 09.11.2012 19:40, schrieb Mario Trento: > Thanks, all guys. > I"ll start with line proc I think, still not quite in oop. Thanks all > for your help, this structure is very solid, and won't change it's > actually Selenium scripts.html Sometimes a little more effort at the beginning pays off in the long run... See a hint for a solution using Nokogiri below. Note that it is even simpler (in my opinion) than iterating through lines and using regular expressions and that it handles HTML that is more complicated than your example. Disclaimer: I have never used Nokogiri before, and spent about ten minutes on this, based on the simplest examples on nokogiri.org, so it's probably wrong and/or clumsy and others could provide much better solutions to your problem. $ cat -n extract_data.rb 1 require 'nokogiri' 2 3 html = <<EOF 4 <tr> 5 <td>type</td><td>id=M_btn_Mainl00</td><td>home</td> 6 </tr> 7 <tr> 8 <td id='something'>clickAndWait</td> 9 <td>id=M_ct200_MainContent</td> 10 <td></td> 11 </tr> 12 EOF 13 14 doc = Nokogiri::HTML.parse(html) 15 16 doc.xpath('//td').each do |cell| 17 puts cell.content.inspect 18 end $ ruby extract_data.rb "type" "id=M_btn_Mainl00" "home" "clickAndWait" "id=M_ct200_MainContent" ""
on 2012-11-10 00:04
I also highly recommend Nokogiri. I tried a few different ways to parse HTML in Ruby; and Nokogiri is lightning fast and very flexible.
Please log in before posting. Registration is free and takes only a minute.
Existing account
(Switch to SSL-encrypted connection)
NEW: Do you have a Google/GoogleMail or Yahoo account? No registration required!
Log in with Google account | Log in with Yahoo account
Log in with Google account | Log in with Yahoo account
No account? Register here.