Forum: Ruby regexp problem

Announcement (2017-05-07): www.ruby-forum.com is now read-only since I unfortunately do not have the time to support and maintain the forum any more. Please see rubyonrails.org/community and ruby-lang.org/en/community for other Rails- und Ruby-related community platforms.
James R. (Guest)
on 2009-02-09 13:40
how i can extract:

<td>Traffic left:</td><td
align=right><b><script>document.write(setzeTT(""+Math.ceil(-123313/1000)));</script>
MB</b></td>

i need this nuber: 123313? I tried to match this in many ways but i stil
have problem with escape characters.
Mike C. (Guest)
on 2009-02-09 13:53
(Received via mailing list)
Of course that depends upon how general this needs to be.  If it will
always be  the first part of the first parameter to a call to
Math.ceil and negated, then:

======================================================================
text = <<EOS
<td>Traffic left:</td><td
align
=
right><b><script>document.write(setzeTT(""+Math.ceil(-123313/1000)));</
script>
MB</b></td>
EOS

m = text.match(/Math\.ceil\(\-(\d+)/)
puts m[1] if m
======================================================================


Of course, it seems "suspicious that you don't want to pick up the
minus, and this seems to take a lot of consistency for granted.  For a
good answer, you'll need to specify what conditions will always be the
same.
James R. (Guest)
on 2009-02-09 14:50
> m = text.match(/Math\.ceil\(\-(\d+)/)

I cannot use regexp on this - need regexp on whole this prase
(<td>Traffic left:</td>.....), because document is full of strings like
this.
Mike C. (Guest)
on 2009-02-10 02:13
(Received via mailing list)
If you're only trying to pull out the single number, this REGEX will
work for the whole phrase you provided.

One of the things you want to do with a REGEX is to avoid any more
detail than is necessary to find what you're looking for.  The REGEX
does not need to "match" the whole string.
7stud -. (Guest)
on 2009-02-10 07:13
Mike C. wrote:
> If you're only trying to pull out the single number, this REGEX will
> work for the whole phrase you provided.
>

The problem is that your regex will also retrieve 9999999 in this html:

<td>NOT TRAFFIC LEFT:</td><td
align=right><b><script>document.write(setzeTT(""+Math.ceil(-9999999/1000)));</script>
MB</b></td>

and the op is trying to tell you that he doesn't want that number.

Parsing html with regex's is a bad strategy.
William J. (Guest)
on 2009-02-10 10:20
(Received via mailing list)
Joao S. wrote:

> how i can extract:
>
> <td>Traffic left:</td><td
> align=right><b><script>document.write(setzeTT(""+Math.ceil(-123313/100
> 0)));</script> MB</b></td>
>
> i need this nuber: 123313? I tried to match this in many ways but i
> stil have problem with escape characters.


list = DATA.read.scan( %r{<td.*?>\s*(.*?)\s*</td>}im ).flatten

list.each_cons(2){|a,b|
  if "Traffic left:" == a  and  b =~ /Math.ceil\((-?\d+)/
    p $1
  end
}


__END__

<td>NOT TRAFFIC LEFT:</td><td
align=right><b>
<script>document.write(setzeTT(""+Math.ceil(-9999999/1000)));
</script>
MB</b></td>

<td> Traffic left:
</td><td
align=right><b><script>
document.write(setzeTT(""+Math.ceil(-123313/1000)));
</script>
MB</b></td>
Rick D. (Guest)
on 2009-02-10 16:21
(Received via mailing list)
On Tue, Feb 10, 2009 at 3:19 AM, William J. 
<removed_email_address@domain.invalid>
wrote:

>
> __END__
> document.write(setzeTT(""+Math.ceil(-123313/1000)));
> </script>
> MB</b></td>
>
>
As 7Stud pointed out, a toolbox with only regular expressions inside is
often a poor choice for dealing with xml/html

Here's a rather verbose and commented program using a combination of
hpricot
and a regular expression to do something like what I think you are
looking
for:

require 'rubygems'
require 'hpricot'

def get_traffic_left_numbers(string)
  doc = Hpricot(string)
  results = []
  # iterate over all of the td elements in the document
  traffic_lefts = doc.search("td").each do |td1|
    # check to see if the td contents is "Traffic left:"
    if td1.inner_text == "Traffic left:"
      # if yes, get the next sibling
      td2 = td1.next_sibling
      # and then for each script tag inside
      td2.search("script") do | script |
        # get the script_tag text
        script_text = script.inner_text
        # Use a regexp to capture the number
        number = /Math\.ceil\(-?(\d+)/.match(script_text)
        # add the number we found, if any, to the results array
        results << number[1] if number
      end
    end
  end
  results
end

p get_traffic_left_numbers("<td>Traffic left:</td><td
align=right><b><script>document.write(setzeTT(""+Math.ceil(-123313/1000)));</script>
MB</b></td>
<td>NOT TRAFFIC LEFT:</td><td
align=right><b><script>document.write(setzeTT(""+Math.ceil(-9999999/1000)));</script>
MB</b></td>")

When run this outputs:

["123313"]

In other words it produces an array of strings representing the target
numbers in a script tag within a td tag which follows another td tag
whose
inner text is "Traffic left:"

HTH


--
Rick DeNatale

Blog: http://talklikeaduck.denhaven2.com/
Twitter: http://twitter.com/RickDeNatale
Igor P. (Guest)
on 2009-02-10 20:16
Rick Denatale wrote:
> On Tue, Feb 10, 2009 at 3:19 AM, William J. wrote:
>
> As 7Stud pointed out, a toolbox with only regular expressions
> inside is often a poor choice for dealing with xml/html
>
> Here's a rather verbose and commented program using a
> combination of hpricot and a regular expression to do
> something like what I think you are looking for:
>
> require 'rubygems'
> require 'hpricot'
> . . .
>
> When run this outputs: ["123313"]
>
> In other words it produces an array of strings representing
> the target numbers in a script tag within a td tag which
> follows another td tag whose inner text is "Traffic left:"

Rick, your solution is swell, and it is probably worth while considering
by someone whose day job is parsing html/xml documents. However, purely
from a language and/or from a programmer's perspective William's
solution is far more appealing, much shorter, easier to understand and
requires virtually no additional learning effort. It nullifies or
"flattens" the comment started out by 7Stud that you also elevated to an
undeserving height.
Rick D. (Guest)
on 2009-02-10 23:48
(Received via mailing list)
On Tue, Feb 10, 2009 at 1:15 PM, Igor P. <removed_email_address@domain.invalid>
wrote:

> > require 'rubygems'
> by someone whose day job is parsing html/xml documents. However, purely
> from a language and/or from a programmer's perspective William's
> solution is far more appealing,


subjective.


> much shorter,


certainly, particularly with my pedagogical comments,

easier to understand and


I'd be quite willing to argue that.

>
> requires virtually no additional learning effort.


Yes, we wouldn't want to expend any unnecessary effort on learning would
we.

And by the way to get that to work (in Ruby 1.8) a nuby rubyist would
have
to learn that you'd need to include 'enumerable' to get the cons method.


> It nullifies or
> "flattens" the comment started out by 7Stud that you also elevated to an
> undeserving height.
>

You can treat regular expressions as a Maslovian hammer, but I've had
enough
experiences with  xml to realize that that hammer is often a very poor
tool
for parsing html.  I'd rather expend my learning budget in learning how
to
apply a tool like Hpricot than to debug my own low-level attempts.

But, as they say, to each his own.

--
Rick DeNatale

Blog: http://talklikeaduck.denhaven2.com/
Twitter: http://twitter.com/RickDeNatale
Igor P. (Guest)
on 2009-02-11 06:35
Rick Denatale wrote:
> On Tue, Feb 10, 2009 at 1:15 PM, Igor P. <removed_email_address@domain.invalid>
> wrote:
>
>> > require 'rubygems'
>> by someone whose day job is parsing html/xml documents. However, purely
>> from a language and/or from a programmer's perspective William's
>> solution is far more appealing,
>
> subjective.
>> much shorter,
> certainly, particularly with my pedagogical comments,

and much nicer as well as more elegant, I should add. But more
importantly William's solution is inherently packed with its own
semantics that needs no pedagogue to explain its purpose or meaning!
True, beauty is in the eyes of the beholder, but if you  think of all
those engineering accomplishments that defy ageing you will certainly
notice none of them need any pedagogic, aesthetic or any other comments.

> Yes, we wouldn't want to expend any unnecessary effort on learning
> would we.

No, we most certainly would not, especially when there's absolutely no
need for it! This is why Java is such a drag. There large number of
classes that appear to be relevant to the Java environment itself have
been prolifically growing, to the point that programmers are suffocated
in "alpha.beta.gamma..." notations, never mind the unnecessary clutter
they have to memorize in order to be able to assign semantic value to
each token. You may as well write tons of pedagogic comments for every
line. At the end you do not see the trees because of the forest.
Besides, since when a long learning curve is an appreciable attribute?

> ... work (in Ruby 1.8) a nuby rubyist would have  to learn that
> you'd need to include 'enumerable' to get the cons method.

What can I say, any language is a constantly evolving thing but at least
in the case of of Ruby's "enumerable"  represents a shift towards better
quality which for the user means less unnecessary overhead and smaller
learning curve. I seriously doubt that now-days any astute Ruby newbie
seeks to learn Ruby 1.8 ignoring Ruby 1.9, I'd much rather say it's just
the opposite, precisely because one would try to avoid learning too much
clutter.

> I've had enough experiences with  xml to realize that that
> hammer is often a very poor tool for parsing html. I'd rather
> expend my learning budget in learning how to apply a tool like
> Hpricot than to debug my own low-level attempts.

Precisely, if your life revolves around xml and html, Hpricot may be the
better way. However, for an occasional brush with a Markup Language my
old Perl book and core Ruby should do just fine.

Cheers,
igor :)
William J. (Guest)
on 2009-02-11 10:00
(Received via mailing list)
Rick DeNatale wrote:

>
> And by the way to get that to work (in Ruby 1.8) a nuby rubyist would
> have to learn that you'd need to include 'enumerable' to get the cons
> method.

I didn't need to, and I'm using

ruby 1.8.7 (2008-05-31 patchlevel 0) [i386-mswin32]
Rick D. (Guest)
on 2009-02-11 17:47
(Received via mailing list)
On Wed, Feb 11, 2009 at 2:58 AM, William J. 
<removed_email_address@domain.invalid>
wrote:

>
>
Yes, I guess I should have said Ruby < 1.8.7

But personally, I don't use or recommend 1.8.7, since it's really
neither
fish nor fowl. The backporting of some things from 1.9 feels like it has
caused more problems than it is worth.

--
Rick DeNatale

Blog: http://talklikeaduck.denhaven2.com/
Twitter: http://twitter.com/RickDeNatale
Pit C. (Guest)
on 2009-02-11 18:14
(Received via mailing list)
2009/2/11 Rick DeNatale <removed_email_address@domain.invalid>:
> But personally, I don't use or recommend 1.8.7, since it's really neither
> fish nor fowl. The backporting of some things from 1.9 feels like it has
> caused more problems than it is worth.

Which problems? As I've written in ruby-core, all (but one) of my
1.8.6 code works flawlessly with 1.8.7. I'm interested why you seem to
have made a different experience.

Regards,
Pit
Mark T. (Guest)
on 2009-02-11 19:06
(Received via mailing list)
As long as we're being pedagogical, I prefer XPath to all the previous
posted solutions.

* More accommodating to minor changes in the HTML
* Very short (one-liner) and easy to read (IMHO)

require 'nokogiri'
doc = Nokogiri::HTML(html)

puts doc.xpath('//td[contains(.,"Traffic left")]/following-
sibling::td//script').to_s.scan(/Math.ceil.-(\d*)/)
unknown (Guest)
on 2009-02-11 22:42
(Received via mailing list)
On Feb 11, 11:01 am, Mark T. <removed_email_address@domain.invalid> wrote:
> sibling::td//script').to_s.scan(/Math.ceil.-(\d*)/)
What if the cell contains "No Traffic left"?
Mark T. (Guest)
on 2009-02-11 23:07
(Received via mailing list)
On Feb 11, 3:36 pm, removed_email_address@domain.invalid wrote:
>
> > puts doc.xpath('//td[contains(.,"Traffic left")]/following-
> > sibling::td//script').to_s.scan(/Math.ceil.-(\d*)/)
>
> What if the cell contains "No Traffic left"?

Then you can use the XPath function starts-with() instead of contains
().
Rick D. (Guest)
on 2009-02-11 23:20
(Received via mailing list)
On Wed, Feb 11, 2009 at 11:12 AM, Pit C.
<removed_email_address@domain.invalid>wrote:

> 2009/2/11 Rick DeNatale <removed_email_address@domain.invalid>:
> > But personally, I don't use or recommend 1.8.7, since it's really neither
> > fish nor fowl. The backporting of some things from 1.9 feels like it has
> > caused more problems than it is worth.
>
> Which problems? As I've written in ruby-core, all (but one) of my
> 1.8.6 code works flawlessly with 1.8.7. I'm interested why you seem to
> have made a different experience.
>
>
I'm not alone.  I'll refer you to the thread which Gregory B. just
opened
to discuss the problems caused by having 1.8.7 be incompatible with
1.8.6.
--
Rick DeNatale

Blog: http://talklikeaduck.denhaven2.com/
Twitter: http://twitter.com/RickDeNatale
Pit C. (Guest)
on 2009-02-11 23:40
(Received via mailing list)
2009/2/11 Rick DeNatale <removed_email_address@domain.invalid>:
> On Wed, Feb 11, 2009 at 11:12 AM, Pit C. <removed_email_address@domain.invalid>wrote:
>> Which problems? As I've written in ruby-core, all (but one) of my
>> 1.8.6 code works flawlessly with 1.8.7. I'm interested why you seem to
>> have made a different experience.
>>
> I'm not alone.  I'll refer you to the thread which Gregory B. just opened
> to discuss the problems caused by having 1.8.7 be incompatible with 1.8.6.

So can anyone show me some 1.8.6 code that doesn't work in 1.8.7? In
the thread you mention there have been no examples yet.

Regards,
Pit
unknown (Guest)
on 2009-02-12 00:30
(Received via mailing list)
On Feb 11, 11:01 am, Mark T. <removed_email_address@domain.invalid> wrote:
> As long as we're being pedagogical, I prefer XPath to all the previous
> posted solutions.
>
> * More accommodating to minor changes in the HTML
> * Very short (one-liner) and easy to read (IMHO)
>
> require 'nokogiri'
> doc = Nokogiri::HTML(html)
>
> puts doc.xpath('//

// is quite cryptic.

td[contains(.,

.?

"Traffic left")]/following-
> sibling::td//script'

script?

).to_s.scan(/Math.ceil.-(\d*)/)

I'd rather use Ruby.
David A. Black (Guest)
on 2009-02-12 00:53
(Received via mailing list)
Hi --

On Thu, 12 Feb 2009, removed_email_address@domain.invalid wrote:

>> puts doc.xpath('//
>
> // is quite cryptic.

It's standard XPath notation.


David

--
David A. Black / Ruby Power and Light, LLC
Ruby/Rails consulting & training: http://www.rubypal.com
Coming in 2009: The Well-Grounded Rubyist (http://manning.com/black2)

http://www.wishsight.com => Independent, social wishlist management!
Mark T. (Guest)
on 2009-02-12 05:25
(Received via mailing list)
On Feb 11, 5:29 pm, removed_email_address@domain.invalid wrote:

> I'd rather use Ruby.

Would you use Ruby string functions instead of the regular expression?
You could, but you probably wouldn't want to. XPath is like regular
expressions for XML and HTML. It has a particular syntax but once you
learn it, it's very powerful.

> // is quite cryptic.

It's the wildcard in XPath. So '//td' just means the td can be
anywhere in the tree, as opposed to '/td' which would be at the root.
It's no more cryptic than the .* wildcard in regexps.

td[contains(.,"Traffic Left")]

The square braces constrain the td with an expression that compares
the current td node (that's what the . means) to the string "Traffic
Left". So this phrase says select the <td> tag(s) which contain the
string.

following-sibling::td//script

This says find the <script> tag under the next (in document order)
<td> tag.

XPath isn't hard to learn. And it's well worth the investment.
This topic is locked and can not be replied to.