Newbie: how to find & extract a string from a file

Hi,

Just starting out to explore Ruby (I like it) and I have
a question.

I have an HTML file that contains several references to jpg files.

I would like to extract the filename with the .jpg extension.
What is the best approach for this?

It appears each .jpg reference is on its own line.

Thanks!

Esmail

I’m sure someone will have a better way of doing this… but…
Assuming it has

imgs = []
IO.readlines(“c:/somefile.html”).each {|line| imgs << line.split("<img
src="")[1].to_s.split(""")[0] if line.match("<img src=") }
puts imgs.join("\n")

Esmail B. wrote:

I would like to extract the filename with the .jpg extension.
What is the best approach for this?

Hi there,

You could use raw regexps and do it yourself, but you should probably
use an HTML parser to extract HTML data. :wink:

A nice HTML parser is Hpricot [1], but it requires an extension (you
cen get it very easily via gems, see the link below). It is very easy
to use, and it’s fast.

Using Hpricot, you can do something like this:

require ‘hpricot’
require ‘open-uri’
soc = open(‘http://utopia.utexas.edu/maps/ireland.html’)
doc = Hpricot(soc)
soc.close
doc.search(’//a’).each { |elem|
href = elem.attributes[‘href’]
if not href.nil? and
[’.jpg’, ‘.jpeg’].include?(File.extname(href))
puts href
end
}

Note that you can also use the built-in REXML parser [2], and do
something like:

require ‘rexml/document’
require ‘open-uri’
include REXML
soc = open(‘http://utopia.utexas.edu/maps/ireland.html’)
doc = Document.new(soc)
soc.close
doc.elements.each(’//a’) { |elem|
href = elem.attributes[‘href’]
if not href.nil? and
[’.jpg’, ‘.jpeg’].include?(File.extname(href))
puts href
end
}

[1] http://code.whytheluckystiff.net/hpricot/
[2] http://www.germane-software.com/software/rexml/docs/tutorial.html

Regards,
Jordan

x1 wrote:

I’m sure someone will have a better way of doing this… but…
Assuming it has

imgs = []
IO.readlines(“c:/somefile.html”).each {|line| imgs << line.split("<img
src="")[1].to_s.split(""")[0] if line.match("<img src=") }
puts imgs.join("\n")

Hi,

thanks … this will get me started. I feel like I could do this
using various unix tools (grep/awk), but I’m trying to learn
Ruby …

Esmail

Hi,

Thank you so much for these pointers. Am I correct in assuming
that REXML comes as part of standard Ruby? If so I think I will
go that route first.

I could cobble something together using various Linux tools
(grep and awk come to mind), but I want something in Ruby
(because I want to learn it) and also because it will be
more portable, for instance to the XP platform.

I appreciate you taking the time to post this and the references.
If you have any other ideas/approaches, I’m game.

Thanks again,

Esmail

Esmail B. wrote:

Thank you so much for these pointers. Am I correct in assuming
that REXML comes as part of standard Ruby? If so I think I will
go that route first.

Glad to help. And yes, REXML is a pure-ruby parser (uses regexp under
the hood) and is included with ruby stdlib since 1.8.

I could cobble something together using various Linux tools
(grep and awk come to mind), but I want something in Ruby
(because I want to learn it) and also because it will be
more portable, for instance to the XP platform.

REXML is more portible, albeit not as fast as Hpricot, which is
implemented as a compiled C extension for ruby.

Have fun learning ruby! It’s a nice language. :slight_smile:

Regards,
Jordan

This forum is not affiliated to the Ruby language, Ruby on Rails framework, nor any Ruby applications discussed here.

| Privacy Policy | Terms of Service | Remote Ruby Jobs