Parse Word/HTML Docs for database inserts

I am new to Ruby and have perused the forum but I will ask this question
as I couldn’t seem to answer my questions with other posts.

The documents have no structure except for a unique number that appears
first in the document and the rest of the data I am looking for is
preceeded by key words that can help me identify a country code, the
hour something was started or finished and maybe a subject here and
there. The html docs are just snippets from the news pages of the
Internet pictures and all that I need the title, and dates extracted.

What I need to do is also extract the mimetype, file name and
last_update_date of the document. Can I do this with Ruby? I know Ruby
has several gems that can help but which one would be the best for
something like this?

Most of the postings I have read deal with semi-structured data. Data
that is preceeded with a column name perhaps but these files are
completely unstructured.

Also I don’t want to be entering filenames one by one. I have about 6000
documents to parse. Is there a way to handle something like that with a
script?

Any direction would be greatly appreciated. Never have written Ruby code
so I am looking for a good tutorial using parsing or an example app that
may handle something like this.

I’m not able to help with the parsing, but if you want to check all
files in a folder you can use this:

$all_files = []
Dir.chdir dir do
$all_files += Dir["*"]
end

where dir is the directory the files are in. That will get you an
array with all the filenames. Then you can just iterate through them:

Dylan wrote:

I’m not able to help with the parsing, but if you want to check all
files in a folder you can use this:

$all_files = []
Dir.chdir dir do
$all_files += Dir[“*”]
end

Might not Find be more useful overall?

http://www.ruby-doc.org/stdlib/libdoc/find/rdoc/classes/Find.html


James B.

www.jamesbritt.com - Playing with Better Toys
www.ruby-doc.org - Ruby Help & Documentation
www.rubystuff.com - The Ruby Store for Ruby Stuff
www.neurogami.com - Smart application development

Margaret Smith wrote:

I am new to Ruby and have perused the forum but I will ask this question
as I couldn’t seem to answer my questions with other posts.

Hi Smith,

Its very tough to answer your question. Because I like HPRICOT gem very
much. But I didn’t said That is best. It depends upon your satisfaction.
And also please try with ,

http://rfeedparser.rubyforge.org/

Thanks,
P.Raveendran

The documents have no structure except for a unique number that appears
first in the document and the rest of the data I am looking for is
preceeded by key words that can help me identify a country code, the
hour something was started or finished and maybe a subject here and
there. The html docs are just snippets from the news pages of the
Internet pictures and all that I need the title, and dates extracted.

What I need to do is also extract the mimetype, file name and
last_update_date of the document. Can I do this with Ruby? I know Ruby
has several gems that can help but which one would be the best for
something like this?

Most of the postings I have read deal with semi-structured data. Data
that is preceeded with a column name perhaps but these files are
completely unstructured.

Also I don’t want to be entering filenames one by one. I have about 6000
documents to parse. Is there a way to handle something like that with a
script?

Any direction would be greatly appreciated. Never have written Ruby code
so I am looking for a good tutorial using parsing or an example app that
may handle something like this.