Gene T. wrote:
there’s also the htmltokenizer.getText() method, (which i just now
discovered by googling) which allows you to extract from before 1 tag
at a time
That is indeed what the problem domain is (did the
| give it away!).
Basically I have a whole lot of html files and I need to re-write them
as xml (sort of docbook-ish, but not quite). I’m using builder
(fantastic bit of kit by the way), but my original files sometimes
contain things like
<td valign=\"top\">Append to an existing file (or
open a new file / overwrite an existing file)?
<td valign=“top” align=“center”>No - default is false."
And anything I try basically means that I end up with either nothing
extracted or the whole table extracted! My thoughts were to try a
simple conversion and then fix things manually afterwards (ie get 95% of
the conversion done through a script and then apply some elbow grease to
finish off the parts that take too much time to work out)
I’m now off to read about this tokenizer ^^^ and see if it does what I
want - obviously I’d love to have an automated solution (there are 1000+
html docs I need to convert).
I must admit to beginning to loathe HTMLs lack of structural information
- if this was a docbook file I’d have very few problems converting it (I
could choose many options), but html is so limited in its ability to
express what meaning some section has [sigh]
Thanks to all for the suggested regexps - I never intended it to become
a mini Ruby Q.