Parse Word/HTML Docs for database inserts

msmith362 · July 16, 2009, 1:23am

I am new to Ruby and have perused the forum but I will ask this question
as I couldn’t seem to answer my questions with other posts.

The documents have no structure except for a unique number that appears
first in the document and the rest of the data I am looking for is
preceeded by key words that can help me identify a country code, the
hour something was started or finished and maybe a subject here and
there. The html docs are just snippets from the news pages of the
Internet pictures and all that I need the title, and dates extracted.

What I need to do is also extract the mimetype, file name and
last_update_date of the document. Can I do this with Ruby? I know Ruby
has several gems that can help but which one would be the best for
something like this?

Most of the postings I have read deal with semi-structured data. Data
that is preceeded with a column name perhaps but these files are
completely unstructured.

Also I don’t want to be entering filenames one by one. I have about 6000
documents to parse. Is there a way to handle something like that with a
script?

Any direction would be greatly appreciated. Never have written Ruby code
so I am looking for a good tutorial using parsing or an example app that
may handle something like this.

msmith362 · July 16, 2009, 1:36am

I’m not able to help with the parsing, but if you want to check all
files in a folder you can use this:

$all_files = []
Dir.chdir dir do
$all_files += Dir["*"]
end

where dir is the directory the files are in. That will get you an
array with all the filenames. Then you can just iterate through them:

msmith362 · July 16, 2009, 5:37am

Dylan wrote:

I’m not able to help with the parsing, but if you want to check all
files in a folder you can use this:

$all_files = []
Dir.chdir dir do
$all_files += Dir[“*”]
end

Might not Find be more useful overall?

http://www.ruby-doc.org/stdlib/libdoc/find/rdoc/classes/Find.html

–
James B.

www.jamesbritt.com - Playing with Better Toys
www.ruby-doc.org - Ruby Help & Documentation
www.rubystuff.com - The Ruby Store for Ruby Stuff
www.neurogami.com - Smart application development

msmith362 · July 16, 2009, 8:26am

Margaret Smith wrote:

I am new to Ruby and have perused the forum but I will ask this question
as I couldn’t seem to answer my questions with other posts.

Hi Smith,

Its very tough to answer your question. Because I like HPRICOT gem very
much. But I didn’t said That is best. It depends upon your satisfaction.
And also please try with ,

http://rfeedparser.rubyforge.org/

Thanks,
P.Raveendran

The documents have no structure except for a unique number that appears
first in the document and the rest of the data I am looking for is
preceeded by key words that can help me identify a country code, the
hour something was started or finished and maybe a subject here and
there. The html docs are just snippets from the news pages of the
Internet pictures and all that I need the title, and dates extracted.

What I need to do is also extract the mimetype, file name and
last_update_date of the document. Can I do this with Ruby? I know Ruby
has several gems that can help but which one would be the best for
something like this?

Most of the postings I have read deal with semi-structured data. Data
that is preceeded with a column name perhaps but these files are
completely unstructured.

Also I don’t want to be entering filenames one by one. I have about 6000
documents to parse. Is there a way to handle something like that with a
script?

Any direction would be greatly appreciated. Never have written Ruby code
so I am looking for a good tutorial using parsing or an example app that
may handle something like this.