Finding a sentence (more than one word & punctuation (, . ;)

Kev_J · January 11, 2006, 8:09am

given this string

" <td valign=“top”>message <td valign=“top”>the message
to echo. <td valign=“top” align=“center”>Yes, unless data is
included in a character section within this element. "

how can I get this result

[“message”, “the message to echo.”, “Yes, unless data is included in a
character section within this element.”]

?

I’ve tried scan + regexp, but the best I’ve got so far is

[[“message”]]

with this

r.scan(/">(\w+\s*)</td>/)

Thanks
Kev

Kev_J · January 11, 2006, 11:01am

Kev J. wrote:

Thanks
Kev

If you really want sentences, this will work:

s.scan /\w+(?:[\s,]+\w+)*[.;?!]/
=> [“the message\nto echo.”, “Yes, unless data is\nincluded in a
character
section within this element.”]

s.scan /\w+(?:,?\s+\w+)*[.;?!]/
=> [“the message\nto echo.”, “Yes, unless data is\nincluded in a
character
section within this element.”]

Kind regards

robert

Kev_J · January 11, 2006, 12:05pm

On Jan 11, 2006, at 8:08, Kev J. wrote:

in a character section within this element."]
There have been several simple approaches proposed in this thread
that may work for what you want. Just in case, if you needed
something more robust you could have a glance at existing Perl
modules that solve this problem like Lingua::EN::Sentence.

– fxn

Kev_J · January 11, 2006, 5:45pm

Kev J. wrote:

Thanks
Kev

if this is an HTML table extraction thing, rubyful soup is the easiest
way to do it
http://www.crummy.com/software/RubyfulSoup/documentation.html

there’s also the htmltokenizer.getText() method, (which i just now
discovered by googling) which allows you to extract from before 1 tag
at a time
http://htmltokenizer.rubyforge.org/doc/
http://htmltokenizer.rubyforge.org/doc/

Kev_J · January 12, 2006, 2:24am

Gene T. wrote:

there’s also the htmltokenizer.getText() method, (which i just now
discovered by googling) which allows you to extract from before 1 tag
at a time
http://htmltokenizer.rubyforge.org/doc/
http://htmltokenizer.rubyforge.org/doc/

That is indeed what the problem domain is (did the give it away!).

Basically I have a whole lot of html files and I need to re-write them
as xml (sort of docbook-ish, but not quite). I’m using builder
(fantastic bit of kit by the way), but my original files sometimes
contain things like

"<td valign=\"top\">append</td>
<td valign=\"top\">Append to an existing file (or
  <a

href="JDK 20 Documentation - Home(java.lang.String,
boolean)" target="_blank">
open a new file / overwrite an existing file)?

<td valign="top" align="center">No - default is false."

And anything I try basically means that I end up with either nothing
extracted or the whole table extracted! My thoughts were to try a
simple conversion and then fix things manually afterwards (ie get 95% of
the conversion done through a script and then apply some elbow grease to
finish off the parts that take too much time to work out)

I’m now off to read about this tokenizer ^^^ and see if it does what I
want - obviously I’d love to have an automated solution (there are 1000+
html docs I need to convert).

I must admit to beginning to loathe HTMLs lack of structural information

if this was a docbook file I’d have very few problems converting it (I
could choose many options), but html is so limited in its ability to
express what meaning some section has [sigh]

Thanks to all for the suggested regexps - I never intended it to become
a mini Ruby Q.
Kev

Kev_J · January 12, 2006, 6:50pm

A quick scan says that you’ve got legit xml there, why not use REXML?
It’s included in the ruby standard libs. Or any of the above html/xml
parsing libraries with xpath to pluck your values out.

REXML Docs:
http://ruby-doc.org/stdlib/

REXML Homepage:
http://www.germane-software.com/software/rexml

.adam