Newbie: how to find & extract a string from a file

Esmail_B · September 30, 2006, 12:46am

Hi,

Just starting out to explore Ruby (I like it) and I have
a question.

I have an HTML file that contains several references to jpg files.

I would like to extract the filename with the .jpg extension.
What is the best approach for this?

It appears each .jpg reference is on its own line.

Thanks!

Esmail

Esmail_B · September 30, 2006, 2:12am

I’m sure someone will have a better way of doing this… but…
Assuming it has

imgs = []
IO.readlines(“c:/somefile.html”).each {|line| imgs << line.split("<img
src="")[1].to_s.split(""")[0] if line.match("<img src=") }
puts imgs.join("\n")

Esmail_B · September 30, 2006, 2:50am

Esmail B. wrote:

I would like to extract the filename with the .jpg extension.
What is the best approach for this?

Hi there,

You could use raw regexps and do it yourself, but you should probably
use an HTML parser to extract HTML data.

A nice HTML parser is Hpricot [1], but it requires an extension (you
cen get it very easily via gems, see the link below). It is very easy
to use, and it’s fast.

Using Hpricot, you can do something like this:

require ‘hpricot’
require ‘open-uri’
soc = open(‘http://utopia.utexas.edu/maps/ireland.html’)
doc = Hpricot(soc)
soc.close
doc.search(‘//a’).each { |elem|
href = elem.attributes[‘href’]
if not href.nil? and
[‘.jpg’, ‘.jpeg’].include?(File.extname(href))
puts href
end
}

Note that you can also use the built-in REXML parser [2], and do
something like:

require ‘rexml/document’
require ‘open-uri’
include REXML
soc = open(‘http://utopia.utexas.edu/maps/ireland.html’)
doc = Document.new(soc)
soc.close
doc.elements.each(‘//a’) { |elem|
href = elem.attributes[‘href’]
if not href.nil? and
[‘.jpg’, ‘.jpeg’].include?(File.extname(href))
puts href
end
}

[1] http://code.whytheluckystiff.net/hpricot/
[2] http://www.germane-software.com/software/rexml/docs/tutorial.html

Regards,
Jordan

Esmail_B · September 30, 2006, 4:40am

x1 wrote:

I’m sure someone will have a better way of doing this… but…
Assuming it has

imgs = []
IO.readlines(“c:/somefile.html”).each {|line| imgs << line.split("<img
src="")[1].to_s.split(""")[0] if line.match("<img src=") }
puts imgs.join("\n")

Hi,

thanks … this will get me started. I feel like I could do this
using various unix tools (grep/awk), but I’m trying to learn
Ruby …

Esmail

Esmail_B · September 30, 2006, 4:46am

Hi,

Thank you so much for these pointers. Am I correct in assuming
that REXML comes as part of standard Ruby? If so I think I will
go that route first.

I could cobble something together using various Linux tools
(grep and awk come to mind), but I want something in Ruby
(because I want to learn it) and also because it will be
more portable, for instance to the XP platform.

I appreciate you taking the time to post this and the references.
If you have any other ideas/approaches, I’m game.

Thanks again,

Esmail

Esmail_B · September 30, 2006, 7:21am

Esmail B. wrote:

Thank you so much for these pointers. Am I correct in assuming
that REXML comes as part of standard Ruby? If so I think I will
go that route first.

Glad to help. And yes, REXML is a pure-ruby parser (uses regexp under
the hood) and is included with ruby stdlib since 1.8.

I could cobble something together using various Linux tools
(grep and awk come to mind), but I want something in Ruby
(because I want to learn it) and also because it will be
more portable, for instance to the XP platform.

REXML is more portible, albeit not as fast as Hpricot, which is
implemented as a compiled C extension for ruby.

Have fun learning ruby! It’s a nice language.

Regards,
Jordan