Forum: Ruby Finding a sentence (more than one word & punctuation (, . ;)

Announcement (2017-05-07): www.ruby-forum.com is now read-only since I unfortunately do not have the time to support and maintain the forum any more. Please see rubyonrails.org/community and ruby-lang.org/en/community for other Rails- und Ruby-related community platforms.
2a0f7bd2c54fbc44329d69555b96f1c5?d=identicon&s=25 Kev Jackson (Guest)
on 2006-01-11 08:09
(Received via mailing list)
given this string

"    <td valign=\"top\">message</td>    <td valign=\"top\">the message
to echo.</td>    <td valign=\"top\" align=\"center\">Yes, unless data is
included in a character section within this element.</td>  </tr>  "

how can I get this result

["message", "the message to echo.", "Yes, unless data is included in a
character section within this element."]

?

I've tried scan + regexp, but the best I've got so far is

[["message"]]

with this

r.scan(/\">(\w+\s*)<\/td>/)

Thanks
Kev
5befe95e6648daec3dd5728cd36602d0?d=identicon&s=25 Robert Klemme (Guest)
on 2006-01-11 11:01
(Received via mailing list)
Kev Jackson wrote:
>
> Thanks
> Kev

If you really want sentences, this will work:

>> s.scan /\w+(?:[\s,]+\w+)*[.;?!]/
=> ["the message\nto echo.", "Yes, unless data is\nincluded in a
character
section within this element."]
>> s.scan /\w+(?:,?\s+\w+)*[.;?!]/
=> ["the message\nto echo.", "Yes, unless data is\nincluded in a
character
section within this element."]

Kind regards

    robert
7223c62b7310e164eb79c740188abbda?d=identicon&s=25 Xavier Noria (Guest)
on 2006-01-11 12:05
(Received via mailing list)
On Jan 11, 2006, at 8:08, Kev Jackson wrote:

> in a character section within this element."]
There have been several simple approaches proposed in this thread
that may work for what you want. Just in case, if you needed
something more robust you could have a glance at existing Perl
modules that solve this problem like Lingua::EN::Sentence.

-- fxn
9dfe8c734b0f9b37a4e218425c0a2138?d=identicon&s=25 Gene Tani (Guest)
on 2006-01-11 17:45
(Received via mailing list)
Kev Jackson wrote:
>
> Thanks
> Kev

if this is an HTML table extraction thing, rubyful soup is the easiest
way to do it
http://www.crummy.com/software/RubyfulSoup/documen...

there's also the htmltokenizer.getText() method, (which i just now
discovered by googling) which allows you to extract from before 1 tag
at a time
http://htmltokenizer.rubyforge.org/doc/
http://htmltokenizer.rubyforge.org/doc/
2a0f7bd2c54fbc44329d69555b96f1c5?d=identicon&s=25 Kev Jackson (Guest)
on 2006-01-12 02:24
(Received via mailing list)
Gene Tani wrote:

>>
>>
>
>there's also the htmltokenizer.getText() method, (which i just now
>discovered by googling) which allows you to extract from before 1 tag
>at a time
>http://htmltokenizer.rubyforge.org/doc/
>http://htmltokenizer.rubyforge.org/doc/
>
>
>
>
That is indeed what the problem domain is (did the <td> give it away!).

Basically I have a whole lot of html files and I need to re-write them
as xml (sort of docbook-ish, but not quite).  I'm using builder
(fantastic bit of kit by the way), but my original files sometimes
contain things like

    "<td valign=\"top\">append</td>
    <td valign=\"top\">Append to an existing file (or
      <a
href=\"http://java.sun.com/j2se/1.4.2/docs/api/java/io/Fi...,
boolean)\" target=\"_blank\">
      open a new file / overwrite an existing file</a>)?
    </td>
    <td valign=\"top\" align=\"center\">No - default is false.</td>"

And anything I try basically means that I end up with either nothing
extracted or the whole table extracted!  My thoughts were to try a
simple conversion and then fix things manually afterwards (ie get 95% of
the conversion done through a script and then apply some elbow grease to
finish off the parts that take too much time to work out)

I'm now off to read about this tokenizer ^^^ and see if it does what I
want - obviously I'd love to have an automated solution (there are 1000+
html docs I need to convert).

I must admit to beginning to loathe HTMLs lack of structural information
- if this was a docbook file I'd have very few problems converting it (I
could choose many options), but html is so limited in its ability to
express what meaning some section has [sigh]

Thanks to all for the suggested regexps - I never intended it to become
a mini Ruby Quiz :)
Kev
65bd9e4c5aebde25ebf16d599339d570?d=identicon&s=25 Adam Sanderson (Guest)
on 2006-01-12 18:50
(Received via mailing list)
A quick scan says that you've got legit xml there, why not use REXML?
It's included in the ruby standard libs. Or any of the above html/xml
parsing libraries with xpath to pluck your values out.

REXML Docs:
http://ruby-doc.org/stdlib/

REXML Homepage:
http://www.germane-software.com/software/rexml

  .adam
This topic is locked and can not be replied to.