Regexp Ruby selection

casper_the_ghost · July 25, 2008, 9:45am

Hi folks,
I’m trying to code a ruby script that select the content of a HTML
table in a HTML page.
I used rubular to test my regexp syntax which is
/ <td class=“TabIntCenContenuto”[^>]>(.) /
with rubular the result of my expression is :
Result 1

12345678
Result 2
SAN FRANCESCO DA PAOLA
Result 3
Via San Francesco Da Paola, 10
Result 4
10123
Result 5
TORINO
etc…
But with my script :

File.open(‘D:/testt/1.txt’, ‘r’) do |filein|

while line = filein.gets
p line if line =~ /<td class=“TabIntCenContenuto”[^>]*>/ … line
=~ //A /
end
fileout.puts p
end
end

I got this result
“<td class=“TabIntCenContenuto”>12345678 \n”
“<td class=“TabIntCenContenuto”>SAN FRANCESCO DA PAOLA </
td>\n”
“<td class=“TabIntCenContenuto”>Via San Francesco Da Paola,
10 \n”
“<td class=“TabIntCenContenuto”>10123 \n”
“<td class=“TabIntCenContenuto” align=“left”>TORINO \n”

I thought the … between 2 “line =~” was like (…) in rubular which
let catch the content ??
Moreover I would like to transform this html code in XML. But I can"t
find an idea how to transform these HTML line in XML.

12345678
But there is no attribut ‘name’ or wathever in the

so making and
match/replace would be difficult ?
…

So, if someone can help me I would be very grateful.
Nice day

casper_the_ghost · July 25, 2008, 9:50am

Any specific reason you can’t use hpricot or other HTML parsers?

Jayanth

casper_the_ghost · July 25, 2008, 9:56am

I’ll second that. Hpricot is really quite remarkable. It’ll almost
certainly save you days and days of pain. Unless you are doing this
for fun / learning, of course.

On Fri, Jul 25, 2008 at 8:44 AM, Srijayanth S.
[email protected] wrote:

with rubular the result of my expression is :
etc…
end
I thought the … between 2 “line =~” was like (…) in rubular which
Nice day

–
Me, I imagine places that I have never seen / The colored lights in
fountains, blue and green / And I imagine places that I will never go
/ Behind these clouds that hang here dark and low
But it’s there when I’m holding you / There when I’m sleeping too /
There when there’s nothing left of me / Hanging out behind the
burned-out factories / Out of reach but leading me / Into the
beautiful sea

casper_the_ghost · July 25, 2008, 11:18am

On Fri, Jul 25, 2008 at 8:44 AM, Srijayanth S.
[email protected] wrote:

Any specific reason you can’t use hpricot or other HTML parsers?

I didn’t know this tool for ruby. I used once a parser named tidy but
that’s all. I’ll try now and let you know.

On Jul 25, 9:58 am, Sebastian H. [email protected]
wrote:

So as a summary: It doesn’t do what you thought it did. As a matter of fact it
doesn’t do anything sane. So just keep as far away from it as possible.

So i was wrong … Thanks you for your explaination of this wrong use
of the loop.

Thanks you.

casper_the_ghost · July 25, 2008, 10:05am

[email protected] wrote:

I thought the … between 2 “line =~” was like (…) in rubular which
let catch the content ??

Generally in ruby … denotes a range. Like starting_value … end_value.
In this case though it denotes a flip flop, which is evil and should
never
ever be used because it makes my head hurt. Here’s what it does though:
some_loop {
do_something if foo … bar
}
This will do nothing until foo is true. When foo is true it will
do_something.
It will then keep doing_something in every iteration of the loop until
bar
becomes true. After bar became true it will stop doing_something until
foo is
true again.
So as a summary: It doesn’t do what you thought it did. As a matter of
fact it
doesn’t do anything sane. So just keep as far away from it as possible.

HTH,
Sebastian

casper_the_ghost · July 25, 2008, 11:52am

From: [email protected] [mailto:[email protected]]

I’m trying to code a ruby script that select the content of a HTML

table in a HTML page.

I used rubular to test my regexp syntax which is

/ <td class=“TabIntCenContenuto”[^>]>(.) /

the re is fine, you can use that

with rubular the result of my expression is :

Result 1

1. 12345678

Result 2

1. SAN FRANCESCO DA PAOLA

Result 3

1. Via San Francesco Da Paola, 10

Result 4

1. 10123

Result 5

1. TORINO

etc…

But with my script :

File.open(‘D:/testt/1.txt’, ‘r’) do |filein|

while line = filein.gets

p line if line =~ /<td class=“TabIntCenContenuto”[^>]*>/ … line

=~ //A /

end

fileout.puts p

end

I got this result

“<td class="TabIntCenContenuto">12345678 \n”

"<td class="TabIntCenContenuto">SAN FRANCESCO DA PAOLA </

td>\n"

"<td class="TabIntCenContenuto">Via San Francesco Da Paola,

10 \n"

“<td class="TabIntCenContenuto">10123 \n”

“<td class="TabIntCenContenuto" align="left">TORINO \n”

you already got it, but you did not capture

sample code & run,

botp@botp-desktop:~$ cat test.rb
File.open(‘test.txt’) do |f|
while line = f.gets
if line=~/<td class=“TabIntCenContenuto”[^>]>(.) /
p $1
end
end
end

botp@botp-desktop:~$ ruby test.rb
“12345678”
“SAN FRANCESCO DA PAOLA”
“Via San Francesco Da Paola,10”
“10123”
“TORINO”

I thought the … between 2 “line =~” was like (…) in rubular which

let catch the content ??

you are making it harder. keep it simple.

Moreover I would like to transform this html code in XML. But I can"t

find an idea how to transform these HTML line in XML.

12345678

But there is no attribut ‘name’ or wathever in the so making and

match/replace would be difficult ?

if the html is nicely formatted, you can loop through the table.
if you want to be sure, try outputting all the data you can capture
first. Then output that again with xml tags inserted.

do not worry. xml, like html, is just text w tags. Manipulating text is
a good learning exercise for ruby.

kind regards -botp