Find/Replace In Files Using Lookup Table

I have a directory full of HTML files. Some have anchor tags (), some do not. I also have a tab-
delimited text file with—among other things—an ID, title, and filename.

What I need to do is create a script that will:

  1. Search all of the HTML files in a directory for anchor tags
  2. Strip out the file name from the href attribute
  3. Use the file name to look up the correlating ID in the lookup file
  4. Replace the contents of the href attribute with the ID

Being new to Ruby and command-line scripting, I’m not sure where to
begin looking for examples of how to do this. Any help is appreciated.

On Wednesday 28 May 2008 18:05:15 Eric I. wrote:

On May 28, 6:18 pm, Andrew P. [email protected] wrote:

  1. To parse an HTML file you can use the hpricot gem. Alternatively,
    you could open the file and use regular expressions.

I’d suggest hpricot or REXML if the files are reasonably well-formed
and/or
XML-ish, and regex if they’re not.

On May 28, 6:18 pm, Andrew P. [email protected] wrote:

Being new to Ruby and command-line scripting, I’m not sure where to
begin looking for examples of how to do this. Any help is appreciated.

Obviously your goal is to this processing. But are you hoping to use
this to learn Ruby? If so, this is a nice-sized project that will
help you to learn the language. Here are some pointers to help you
figure out where to look or start with certain aspects of the project
(the numbers match up with your numbers above):

  1. To get a list of all of the HTML files in a given directory, you
    can use Dir.glob.

  2. To parse an HTML file you can use the hpricot gem. Alternatively,
    you could open the file and use regular expressions.

  3. To have read your tab-delimited file at the start of the program,
    you can use the CSV class in the standard library or the fastercsv
    gem. You can put the data into a hash where the file name is the key
    and the ID is the value. Lookup becomes trivial then.

  4. Depending on whether you’re using hpricot or regular expressions
    will determine how you do this. If you’re using regular expressions,
    you might want to do a gsub! call with a block that would allow you to
    do your lookup and replacement.

Some relevant information sources:

You should have one of the Ruby books to help you with basic syntax
and all that. They will also help you with regular expressions,
hashes, and file I/O.

Documentation on File (and IO), Dir, CSV, Regexp, and Hash, you can
use:

http://ruby-doc.org/core/

For hpricot:

http://code.whytheluckystiff.net/hpricot/

For fastercsv:

http://fastercsv.rubyforge.org/

I hope that’s helpful,

Eric

====

LearnRuby.com offers Rails & Ruby HANDS-ON public & ON-SITE
workshops.
Ruby Fundamentals Wkshp June 16-18 Ann Arbor, Mich.
Ready for Rails R. Wkshp June 23-24 Ann Arbor, Mich.
Ruby on Rails Wkshp June 25-27 Ann Arbor, Mich.
Ruby Plus Rails Combo Wkshp June 23-27 Ann Arbor, Mich
Please visit http://LearnRuby.com for all the details.

Thanks, Eric. These are excellent tips.