What's the best way of parsing this string?

Hi there,

I have a question about string search and replacement. Let’s say I have
this
string that contains 2 links for embedding youtube videos
amongst
some other random text.

string = “We’ve got some junk text here” +
“<object width="425" height="350"><param name="movie"
value="- YouTube +
“<param name="wmode" value="transparent"><embed
src="
- YouTube
type="application/x-shockwave-flash"” +
“wmode="transparent" width="425"
height="350">” +
“And we’ve got some more junk text right here” +
“<object width="425" height="350"><param name="movie"
value="- YouTube +
“<param name="wmode" value="transparent"><embed
src="
- YouTube
type="application/x-shockwave-flash"” +
“wmode="transparent" width="425"
height="350">” +
“more garbage text here”

I’d like to get the value of the ‘value’ attributes ("
- YouTube" and "
- YouTube" respectively) and convert the
string
into the following:

We’ve got some junk text here

And we've got some more junk text right here
more garbage text here

What would be the best library to use for parsing and replacing certain
values in a string? I’ve done simple .gsub’ing before, but this seems to
be
a little more complicated :wink:

Thanks,
Dave H.

Well, maybe you can assume valid XML and parse the page with REXML. If
it’s
not valid XML, well you can regex through for (off-the-cuff, probably
wrong)
/<param.value="(.)"/ and use the result. Someone more knowledgable in
Regexps can help you out if it comes to this.

Jason

Hi Jason,

Thanks for the suggestion. Why would REXML be a good use though?

-Dave

Thanks Mark… I’ll look in to that.

-Dave

Hi Dave,
REXML gets you into the structure of xhtml (compliant) docs.
There is a learn.
There is a different way of things.

The Tutorial page is a lesson in itself.
An astonishing compression of answers in one page.
http://www.germane-software.com/software/XML/rexml/docs/tutorial.html

Markt

On 2/12/07, Dave H. [email protected] wrote:

I have a question about string search and replacement. Let’s say I have this
string that contains 2 links for embedding youtube videos amongst
some other random text.

I haven’t used it yet, but I hear really good things about Hpricot.

-austin

Hi,

On Tuesday 13 February 2007 02:16, Austin Z. wrote:

On 2/12/07, Dave H. [email protected] wrote:

I have a question about string search and replacement. Let’s say I have
this string that contains 2 links for embedding youtube videos
amongst some other random text.

I haven’t used it yet, but I hear really good things about Hpricot.

i’ve used it[1], i needed to pull out some statistical data off a bunch
of
html pages slightly different from one to another, combined with
firebug[2]'s
ability to generate xpath expression by simply pointing at an
element[3], and
recent hpricot’s support for xpath indices… it should be a matter of
minutes
of automatically extracting anything you want from any html page. :slight_smile:

[1] http://code.whytheluckystiff.net/hpricot/
[2] http://www.getfirebug.com/
[3] version 1.0 is simply awesome

On 2/12/07, Jason R. [email protected] wrote:

Well, maybe you can assume valid XML and parse the page with REXML. If
it’s
not valid XML, well you can regex through for (off-the-cuff, probably
wrong)
/<param.value="(.)"/ and use the result. Someone more knowledgable in
Regexps can help you out if it comes to this.

Well it might be necessary to use a non greedy match

/<param.value="(.?)"/

in order not to consume a potentially following key=“…” pair.

A more explicit and thus more readable way might be to write it like
this -
avoiding any potential backtracking issues if the regexp evolves later
too.

/<param.value=“([^”])"/

This all is just for the quick hack though, definitely go with REXML or
hpricot if they can do the job for you.

HTH
Roberts

Jason

Robert D. wrote:
[…]

Well it might be necessary to use a non greedy match

/<param.value="(.?)"/

in order not to consume a potentially following key="…" pair.

A more explicit and thus more readable way might be to write it like this -
avoiding any potential backtracking issues if the regexp evolves later too.

/<param.value="([^"])"/

Your advice of using a non greedy match is good, but the example using a
greedy match is not :wink:

your_re =~ ’ … bla bla bla … value=“ha!”’
puts $1

Greetings.