What's the best way of parsing this string?

daveh · February 12, 2007, 10:36pm

Hi there,

I have a question about string search and replacement. Let’s say I have
this
string that contains 2 links for embedding youtube videos
amongst
some other random text.

string = “We’ve got some junk text here” +
“<object width="425" height="350"><param name="movie"
value="- YouTube +
“<param name="wmode" value="transparent"><embed
src="
- YouTube”
type="application/x-shockwave-flash"” +
“wmode="transparent" width="425"
height="350">” +
“And we’ve got some more junk text right here” +
“<object width="425" height="350"><param name="movie"
value="- YouTube +
“<param name="wmode" value="transparent"><embed
src="
- YouTube”
type="application/x-shockwave-flash"” +
“wmode="transparent" width="425"
height="350">” +
“more garbage text here”

I’d like to get the value of the ‘value’ attributes ("
- YouTube" and "
- YouTube" respectively) and convert the
string
into the following:

We’ve got some junk text here

And we've got some more junk text right here

more garbage text here

What would be the best library to use for parsing and replacing certain
values in a string? I’ve done simple .gsub’ing before, but this seems to
be
a little more complicated

Thanks,
Dave H.

daveh · February 12, 2007, 10:50pm

Well, maybe you can assume valid XML and parse the page with REXML. If
it’s
not valid XML, well you can regex through for (off-the-cuff, probably
wrong)
/<param.value="(.)"/ and use the result. Someone more knowledgable in
Regexps can help you out if it comes to this.

Jason

daveh · February 12, 2007, 11:09pm

Hi Jason,

Thanks for the suggestion. Why would REXML be a good use though?

-Dave

daveh · February 13, 2007, 12:58am

Thanks Mark… I’ll look in to that.

-Dave

daveh · February 12, 2007, 11:46pm

Hi Dave,
REXML gets you into the structure of xhtml (compliant) docs.
There is a learn.
There is a different way of things.

The Tutorial page is a lesson in itself.
An astonishing compression of answers in one page.
http://www.germane-software.com/software/XML/rexml/docs/tutorial.html

Markt

daveh · February 13, 2007, 2:17am

On 2/12/07, Dave H. [email protected] wrote:

I have a question about string search and replacement. Let’s say I have this
string that contains 2 links for embedding youtube videos amongst
some other random text.

I haven’t used it yet, but I hear really good things about Hpricot.

-austin

daveh · February 13, 2007, 4:01am

Hi,

On Tuesday 13 February 2007 02:16, Austin Z. wrote:

On 2/12/07, Dave H. [email protected] wrote:

I have a question about string search and replacement. Let’s say I have
this string that contains 2 links for embedding youtube videos
amongst some other random text.

I haven’t used it yet, but I hear really good things about Hpricot.

i’ve used it[1], i needed to pull out some statistical data off a bunch
of
html pages slightly different from one to another, combined with
firebug[2]'s
ability to generate xpath expression by simply pointing at an
element[3], and
recent hpricot’s support for xpath indices… it should be a matter of
minutes
of automatically extracting anything you want from any html page.

[1] http://code.whytheluckystiff.net/hpricot/
[2] http://www.getfirebug.com/
[3] version 1.0 is simply awesome

daveh · February 13, 2007, 2:48pm

On 2/12/07, Jason R. [email protected] wrote:

Well, maybe you can assume valid XML and parse the page with REXML. If
it’s
not valid XML, well you can regex through for (off-the-cuff, probably
wrong)
/<param.value="(.)"/ and use the result. Someone more knowledgable in
Regexps can help you out if it comes to this.

Well it might be necessary to use a non greedy match

/<param.value="(.?)"/

in order not to consume a potentially following key=“…” pair.

A more explicit and thus more readable way might be to write it like
this -
avoiding any potential backtracking issues if the regexp evolves later
too.

/<param.value=“([^”])"/

This all is just for the quick hack though, definitely go with REXML or
hpricot if they can do the job for you.

HTH
Roberts

Jason

daveh · February 13, 2007, 4:42pm

Robert D. wrote:
[…]

Well it might be necessary to use a non greedy match

/<param.value="(.?)"/

in order not to consume a potentially following key="…" pair.

A more explicit and thus more readable way might be to write it like this -
avoiding any potential backtracking issues if the regexp evolves later too.

/<param.value="([^"])"/

Your advice of using a non greedy match is good, but the example using a
greedy match is not

your_re =~ ’ … bla bla bla … value=“ha!”’
puts $1

Greetings.