Finding a sentence (more than one word & punctuation (, . ;)


#1

given this string

" <td valign=“top”>message <td valign=“top”>the message
to echo. <td valign=“top” align=“center”>Yes, unless data is
included in a character section within this element. "

how can I get this result

[“message”, “the message to echo.”, “Yes, unless data is included in a
character section within this element.”]

?

I’ve tried scan + regexp, but the best I’ve got so far is

[[“message”]]

with this

r.scan(/">(\w+\s*)</td>/)

Thanks
Kev


#2

Kev J. wrote:

Thanks
Kev

If you really want sentences, this will work:

s.scan /\w+(?:[\s,]+\w+)*[.;?!]/
=> [“the message\nto echo.”, “Yes, unless data is\nincluded in a
character
section within this element.”]

s.scan /\w+(?:,?\s+\w+)*[.;?!]/
=> [“the message\nto echo.”, “Yes, unless data is\nincluded in a
character
section within this element.”]

Kind regards

robert

#3

On Jan 11, 2006, at 8:08, Kev J. wrote:

in a character section within this element."]
There have been several simple approaches proposed in this thread
that may work for what you want. Just in case, if you needed
something more robust you could have a glance at existing Perl
modules that solve this problem like Lingua::EN::Sentence.

– fxn


#4

Kev J. wrote:

Thanks
Kev

if this is an HTML table extraction thing, rubyful soup is the easiest
way to do it
http://www.crummy.com/software/RubyfulSoup/documentation.html

there’s also the htmltokenizer.getText() method, (which i just now
discovered by googling) which allows you to extract from before 1 tag
at a time
http://htmltokenizer.rubyforge.org/doc/
http://htmltokenizer.rubyforge.org/doc/


#5

Gene T. wrote:

there’s also the htmltokenizer.getText() method, (which i just now
discovered by googling) which allows you to extract from before 1 tag
at a time
http://htmltokenizer.rubyforge.org/doc/
http://htmltokenizer.rubyforge.org/doc/

That is indeed what the problem domain is (did the

give it away!).

Basically I have a whole lot of html files and I need to re-write them
as xml (sort of docbook-ish, but not quite). I’m using builder
(fantastic bit of kit by the way), but my original files sometimes
contain things like

"<td valign=\"top\">append</td>
<td valign=\"top\">Append to an existing file (or
  <a

href=“http://java.sun.com/j2se/1.4.2/docs/api/java/io/FileWriter.html#FileWriter(java.lang.String,
boolean)” target="_blank">
open a new file / overwrite an existing file)?


<td valign=“top” align=“center”>No - default is false."

And anything I try basically means that I end up with either nothing
extracted or the whole table extracted! My thoughts were to try a
simple conversion and then fix things manually afterwards (ie get 95% of
the conversion done through a script and then apply some elbow grease to
finish off the parts that take too much time to work out)

I’m now off to read about this tokenizer ^^^ and see if it does what I
want - obviously I’d love to have an automated solution (there are 1000+
html docs I need to convert).

I must admit to beginning to loathe HTMLs lack of structural information

  • if this was a docbook file I’d have very few problems converting it (I
    could choose many options), but html is so limited in its ability to
    express what meaning some section has [sigh]

Thanks to all for the suggested regexps - I never intended it to become
a mini Ruby Q. :slight_smile:
Kev


#6

A quick scan says that you’ve got legit xml there, why not use REXML?
It’s included in the ruby standard libs. Or any of the above html/xml
parsing libraries with xpath to pluck your values out.

REXML Docs:
http://ruby-doc.org/stdlib/

REXML Homepage:
http://www.germane-software.com/software/rexml

.adam