Regexp problem

yankees26 · February 9, 2009, 12:40pm

how i can extract:

Traffic left: MB

i need this nuber: 123313? I tried to match this in many ways but i stil
have problem with escape characters.

yankees26 · February 9, 2009, 12:53pm

Of course that depends upon how general this needs to be. If it will
always be the first part of the first parameter to a call to
Math.ceil and negated, then:

======================================================================
text = <<EOS

Traffic left:<td
align

right>document.write(setzeTT(""+Math.ceil(-123313/1000)));</
script>
MB
EOS

m = text.match(/Math.ceil(-(\d+)/)
puts m[1] if m

Of course, it seems "suspicious that you don’t want to pick up the
minus, and this seems to take a lot of consistency for granted. For a
good answer, you’ll need to specify what conditions will always be the
same.

yankees26 · February 9, 2009, 1:50pm

m = text.match(/Math.ceil(-(\d+)/)

I cannot use regexp on this - need regexp on whole this prase
(

Traffic left:…), because document is full of strings like
this.

yankees26 · February 10, 2009, 1:13am

If you’re only trying to pull out the single number, this REGEX will
work for the whole phrase you provided.

One of the things you want to do with a REGEX is to avoid any more
detail than is necessary to find what you’re looking for. The REGEX
does not need to “match” the whole string.

yankees26 · February 10, 2009, 6:13am

Mike C. wrote:

If you’re only trying to pull out the single number, this REGEX will
work for the whole phrase you provided.

The problem is that your regex will also retrieve 9999999 in this html:

NOT TRAFFIC LEFT: MB

and the op is trying to tell you that he doesn’t want that number.

Parsing html with regex’s is a bad strategy.

yankees26 · February 10, 2009, 9:20am

Joao S. wrote:

how i can extract:
Traffic left: MB
i need this nuber: 123313? I tried to match this in many ways but i
stil have problem with escape characters.

list = DATA.read.scan( %r{<td.?>\s(.?)\s}im ).flatten

list.each_cons(2){|a,b|
if “Traffic left:” == a and b =~ /Math.ceil((-?\d+)/
p $1
end
}

END

NOT TRAFFIC LEFT: MB Traffic left: MB

yankees26 · February 10, 2009, 3:21pm

On Tue, Feb 10, 2009 at 3:19 AM, William J. [email protected]
wrote:

END
document.write(setzeTT(“”+Math.ceil(-123313/1000)));

MB

As 7Stud pointed out, a toolbox with only regular expressions inside is
often a poor choice for dealing with xml/html

Here’s a rather verbose and commented program using a combination of
hpricot
and a regular expression to do something like what I think you are
looking
for:

require ‘rubygems’
require ‘hpricot’

def get_traffic_left_numbers(string)
doc = Hpricot(string)
results = []

iterate over all of the td elements in the document

traffic_lefts = doc.search(“td”).each do |td1|
# check to see if the td contents is “Traffic left:”
if td1.inner_text == “Traffic left:”
# if yes, get the next sibling
td2 = td1.next_sibling
# and then for each script tag inside
td2.search(“script”) do | script |
# get the script_tag text
script_text = script.inner_text
# Use a regexp to capture the number
number = /Math.ceil(-?(\d+)/.match(script_text)
# add the number we found, if any, to the results array
results << number[1] if number
end
end
end
results
end

p get_traffic_left_numbers(“Traffic left:
MB

NOT TRAFFIC LEFT: MB")

When run this outputs:

[“123313”]

In other words it produces an array of strings representing the target
numbers in a script tag within a td tag which follows another td tag
whose
inner text is “Traffic left:”

HTH

–
Rick DeNatale

Blog: http://talklikeaduck.denhaven2.com/
Twitter: http://twitter.com/RickDeNatale

yankees26 · February 10, 2009, 10:48pm

On Tue, Feb 10, 2009 at 1:15 PM, Igor P. [email protected]
wrote:

require ‘rubygems’
by someone whose day job is parsing html/xml documents. However, purely
from a language and/or from a programmer’s perspective William’s
solution is far more appealing,

subjective.

much shorter,

certainly, particularly with my pedagogical comments,

easier to understand and

I’d be quite willing to argue that.

requires virtually no additional learning effort.

Yes, we wouldn’t want to expend any unnecessary effort on learning would
we.

And by the way to get that to work (in Ruby 1.8) a nuby rubyist would
have
to learn that you’d need to include ‘enumerable’ to get the cons method.

It nullifies or
“flattens” the comment started out by 7Stud that you also elevated to an
undeserving height.

You can treat regular expressions as a Maslovian hammer, but I’ve had
enough
experiences with xml to realize that that hammer is often a very poor
tool
for parsing html. I’d rather expend my learning budget in learning how
to
apply a tool like Hpricot than to debug my own low-level attempts.

But, as they say, to each his own.

–
Rick DeNatale

Blog: http://talklikeaduck.denhaven2.com/
Twitter: http://twitter.com/RickDeNatale

yankees26 · February 10, 2009, 7:16pm

Rick Denatale wrote:

On Tue, Feb 10, 2009 at 3:19 AM, William J. wrote:

As 7Stud pointed out, a toolbox with only regular expressions
inside is often a poor choice for dealing with xml/html

Here’s a rather verbose and commented program using a
combination of hpricot and a regular expression to do
something like what I think you are looking for:

require ‘rubygems’
require ‘hpricot’
. . .

When run this outputs: [“123313”]

In other words it produces an array of strings representing
the target numbers in a script tag within a td tag which
follows another td tag whose inner text is “Traffic left:”

Rick, your solution is swell, and it is probably worth while considering
by someone whose day job is parsing html/xml documents. However, purely
from a language and/or from a programmer’s perspective William’s
solution is far more appealing, much shorter, easier to understand and
requires virtually no additional learning effort. It nullifies or
“flattens” the comment started out by 7Stud that you also elevated to an
undeserving height.

yankees26 · February 11, 2009, 5:35am

Rick Denatale wrote:

On Tue, Feb 10, 2009 at 1:15 PM, Igor P. [email protected]
wrote:

require ‘rubygems’
by someone whose day job is parsing html/xml documents. However, purely
from a language and/or from a programmer’s perspective William’s
solution is far more appealing,

subjective.

much shorter,
certainly, particularly with my pedagogical comments,

and much nicer as well as more elegant, I should add. But more
importantly William’s solution is inherently packed with its own
semantics that needs no pedagogue to explain its purpose or meaning!
True, beauty is in the eyes of the beholder, but if you think of all
those engineering accomplishments that defy ageing you will certainly
notice none of them need any pedagogic, aesthetic or any other comments.

Yes, we wouldn’t want to expend any unnecessary effort on learning
would we.

No, we most certainly would not, especially when there’s absolutely no
need for it! This is why Java is such a drag. There large number of
classes that appear to be relevant to the Java environment itself have
been prolifically growing, to the point that programmers are suffocated
in “alpha.beta.gamma…” notations, never mind the unnecessary clutter
they have to memorize in order to be able to assign semantic value to
each token. You may as well write tons of pedagogic comments for every
line. At the end you do not see the trees because of the forest.
Besides, since when a long learning curve is an appreciable attribute?

… work (in Ruby 1.8) a nuby rubyist would have to learn that
you’d need to include ‘enumerable’ to get the cons method.

What can I say, any language is a constantly evolving thing but at least
in the case of of Ruby’s “enumerable” represents a shift towards better
quality which for the user means less unnecessary overhead and smaller
learning curve. I seriously doubt that now-days any astute Ruby newbie
seeks to learn Ruby 1.8 ignoring Ruby 1.9, I’d much rather say it’s just
the opposite, precisely because one would try to avoid learning too much
clutter.

I’ve had enough experiences with xml to realize that that
hammer is often a very poor tool for parsing html. I’d rather
expend my learning budget in learning how to apply a tool like
Hpricot than to debug my own low-level attempts.

Precisely, if your life revolves around xml and html, Hpricot may be the
better way. However, for an occasional brush with a Markup Language my
old Perl book and core Ruby should do just fine.

Cheers,
igor

yankees26 · February 11, 2009, 9:00am

Rick DeNatale wrote:

And by the way to get that to work (in Ruby 1.8) a nuby rubyist would
have to learn that you’d need to include ‘enumerable’ to get the cons
method.

I didn’t need to, and I’m using

ruby 1.8.7 (2008-05-31 patchlevel 0) [i386-mswin32]

yankees26 · February 11, 2009, 5:14pm

2009/2/11 Rick DeNatale [email protected]:

But personally, I don’t use or recommend 1.8.7, since it’s really neither
fish nor fowl. The backporting of some things from 1.9 feels like it has
caused more problems than it is worth.

Which problems? As I’ve written in ruby-core, all (but one) of my
1.8.6 code works flawlessly with 1.8.7. I’m interested why you seem to
have made a different experience.

Regards,
Pit

yankees26 · February 11, 2009, 4:47pm

On Wed, Feb 11, 2009 at 2:58 AM, William J. [email protected]
wrote:

Yes, I guess I should have said Ruby < 1.8.7

But personally, I don’t use or recommend 1.8.7, since it’s really
neither
fish nor fowl. The backporting of some things from 1.9 feels like it has
caused more problems than it is worth.

–
Rick DeNatale

Blog: http://talklikeaduck.denhaven2.com/
Twitter: http://twitter.com/RickDeNatale

yankees26 · February 11, 2009, 9:42pm

On Feb 11, 11:01 am, Mark T. [email protected] wrote:

sibling::td//script’).to_s.scan(/Math.ceil.-(\d*)/)
What if the cell contains “No Traffic left”?

yankees26 · February 11, 2009, 6:06pm

As long as we’re being pedagogical, I prefer XPath to all the previous
posted solutions.

More accommodating to minor changes in the HTML
Very short (one-liner) and easy to read (IMHO)

require ‘nokogiri’
doc = Nokogiri::HTML(html)

puts doc.xpath(’//td[contains(.,“Traffic left”)]/following-
sibling::td//script’).to_s.scan(/Math.ceil.-(\d*)/)

yankees26 · February 11, 2009, 10:20pm

On Wed, Feb 11, 2009 at 11:12 AM, Pit C.
[email protected]wrote:

2009/2/11 Rick DeNatale [email protected]:

But personally, I don’t use or recommend 1.8.7, since it’s really neither
fish nor fowl. The backporting of some things from 1.9 feels like it has
caused more problems than it is worth.

Which problems? As I’ve written in ruby-core, all (but one) of my
1.8.6 code works flawlessly with 1.8.7. I’m interested why you seem to
have made a different experience.

I’m not alone. I’ll refer you to the thread which Gregory B. just
opened
to discuss the problems caused by having 1.8.7 be incompatible with
1.8.6.

Rick DeNatale

Blog: http://talklikeaduck.denhaven2.com/
Twitter: http://twitter.com/RickDeNatale

yankees26 · February 11, 2009, 10:40pm

2009/2/11 Rick DeNatale [email protected]:

On Wed, Feb 11, 2009 at 11:12 AM, Pit C. [email protected]wrote:

Which problems? As I’ve written in ruby-core, all (but one) of my
1.8.6 code works flawlessly with 1.8.7. I’m interested why you seem to
have made a different experience.

I’m not alone. I’ll refer you to the thread which Gregory B. just opened
to discuss the problems caused by having 1.8.7 be incompatible with 1.8.6.

So can anyone show me some 1.8.6 code that doesn’t work in 1.8.7? In
the thread you mention there have been no examples yet.

Regards,
Pit

yankees26 · February 11, 2009, 10:07pm

On Feb 11, 3:36 pm, [email protected] wrote:

puts doc.xpath(’//td[contains(.,“Traffic left”)]/following-
sibling::td//script’).to_s.scan(/Math.ceil.-(\d*)/)

What if the cell contains “No Traffic left”?

Then you can use the XPath function starts-with() instead of contains
().

yankees26 · February 11, 2009, 11:30pm

On Feb 11, 11:01 am, Mark T. [email protected] wrote:

As long as we’re being pedagogical, I prefer XPath to all the previous
posted solutions.

More accommodating to minor changes in the HTML

Very short (one-liner) and easy to read (IMHO)

require ‘nokogiri’
doc = Nokogiri::HTML(html)

puts doc.xpath('//

// is quite cryptic.

td[contains(.,

.?

“Traffic left”)]/following-

sibling::td//script’

script?

).to_s.scan(/Math.ceil.-(\d*)/)

I’d rather use Ruby.

yankees26 · February 11, 2009, 11:53pm

Hi –

On Thu, 12 Feb 2009, [email protected] wrote:

puts doc.xpath('//

// is quite cryptic.

It’s standard XPath notation.

David

–
David A. Black / Ruby Power and Light, LLC
Ruby/Rails consulting & training: http://www.rubypal.com
Coming in 2009: The Well-Grounded Rubyist (The Well-Grounded Rubyist)

http://www.wishsight.com => Independent, social wishlist management!

Regexp problem

Traffic left:<td align

m = text.match(/Math.ceil(-(\d+)/) puts m[1] if m

iterate over all of the td elements in the document

I’m not alone. I’ll refer you to the thread which Gregory B. just opened to discuss the problems caused by having 1.8.7 be incompatible with 1.8.6.

Traffic left:<td
align

m = text.match(/Math.ceil(-(\d+)/)
puts m[1] if m

I’m not alone. I’ll refer you to the thread which Gregory B. just
opened
to discuss the problems caused by having 1.8.7 be incompatible with
1.8.6.