Extract keywords from string

hi -

i have strings that i need to extract keywords from. the string might
have html tags, urls, etc. i need to extract the keywords from the
string. i imagine i’m not the first guy to have to tackle this
problem. is there a gem i can use or anyone have any ideas how to
approach this?

thanks,
dino

If you are doing html parsing, you’ll want to look into hpricot.

There are a few parsers out there, I’ve written a couple myself.

Quoting dino d. [email protected]:

hi -

i have strings that i need to extract keywords from. the string might
have html tags, urls, etc. i need to extract the keywords from the
string. i imagine i’m not the first guy to have to tackle this
problem. is there a gem i can use or anyone have any ideas how to
approach this?

More detail needed about the keywords. The simple case is keywords
regardless
of context, separated by whitespace.

KEYWORDS = %{if else then end case when do def}

str = “if true then false else true end”
str.split.find_all{|s| KEYWORDS.include?(s)}

irb(main):006:0> KEYWORDS = %{if else then end case when do def}
=> “if else then end case when do def”
irb(main):007:0> str = “if true then false else true end”
=> “if true then false else true end”
irb(main):008:0> str.split.find_all{|s| KEYWORDS.include?(s)}
=> [“if”, “then”, “else”, “end”]
irb(main):009:0>

If you need to exclude keywords inside strings, URLs, etc. the solution
is
more complex.

HTH,
Jeffrey

On Aug 23, 2009, at 7:44 PM, Alpha B. wrote:

If you are doing html parsing, you’ll want to look into hpricot.

http://juixe.com/techknow/index.php/2008/05/19/using-hpricot/

There are a few parsers out there, I’ve written a couple myself.

Many people are leaning toward Nokogiri (read:
http://nokogiri.rubyforge.org/nokogiri/Nokogiri.html)
.

Jeff-

thanks for the reply. i can deal with context in a different method,
in your solution, i still grab “” and “test.” and “&wow*&&” as
keywords. i want to send this method a string, and get an array of
letter-only words returned. if you have context ideas, i’d love to
hear those too, but the first step is just harvesting only character
words from strings.

thanks,
dino

On Aug 24, 12:48 am, “s.ross” [email protected] wrote:

Many people are leaning toward Nokogiri (read:http://nokogiri.rubyforge.org/nokogiri/Nokogiri.html)
.

Agreed. With the disappearance of _why, the future of hpricot is
uncertain.