" <td valign=“top”>message <td valign=“top”>the message
to echo. <td valign=“top” align=“center”>Yes, unless data is
included in a character section within this element. "
how can I get this result
[“message”, “the message to echo.”, “Yes, unless data is included in a
character section within this element.”]
?
I’ve tried scan + regexp, but the best I’ve got so far is
in a character section within this element."]
There have been several simple approaches proposed in this thread
that may work for what you want. Just in case, if you needed
something more robust you could have a glance at existing Perl
modules that solve this problem like Lingua::EN::Sentence.
That is indeed what the problem domain is (did the give it away!).
Basically I have a whole lot of html files and I need to re-write them
as xml (sort of docbook-ish, but not quite). I’m using builder
(fantastic bit of kit by the way), but my original files sometimes
contain things like
"<td valign=\"top\">append</td>
<td valign=\"top\">Append to an existing file (or
<a
href="JDK 20 Documentation - Home(java.lang.String,
boolean)" target="_blank">
open a new file / overwrite an existing file)?
<td valign="top" align="center">No - default is false."
And anything I try basically means that I end up with either nothing
extracted or the whole table extracted! My thoughts were to try a
simple conversion and then fix things manually afterwards (ie get 95% of
the conversion done through a script and then apply some elbow grease to
finish off the parts that take too much time to work out)
I’m now off to read about this tokenizer ^^^ and see if it does what I
want - obviously I’d love to have an automated solution (there are 1000+
html docs I need to convert).
I must admit to beginning to loathe HTMLs lack of structural information
if this was a docbook file I’d have very few problems converting it (I
could choose many options), but html is so limited in its ability to
express what meaning some section has [sigh]
Thanks to all for the suggested regexps - I never intended it to become
a mini Ruby Q.
Kev
A quick scan says that you’ve got legit xml there, why not use REXML?
It’s included in the ruby standard libs. Or any of the above html/xml
parsing libraries with xpath to pluck your values out.