Question About OCR in Ruby vs. Rails

addis_a · June 3, 2013, 10:31pm

Team,

I’m working on a project that will involve processing 15,000+ complex
financial documents. They are in PDF form.

Unfortunately, the documents are not available in a non-PDF form – so I
have to electronically scan the documents and “break them down” into a
database.

I’m familiar enough with Rails, that I feel comfortable doing it with
the
Rails framework – but I’m not sure this is a good use of Rails.

Ruby and Javascript are the only programming languages I know, so I’d
either need to somehow do this as a Rails project (with Ruby and
javascript), or as a Ruby project.

If I do it as a Ruby project (not rails), can you make recommendations
about the best way to go about it?

Kirk K.

Kirk_K · June 3, 2013, 10:50pm

Hi Kirk,

I’ve had pretty good luck shelling out to pdftotext to get the text from
PDFs into a searchable format. Is this what you mean by “break them down
into a database”?

Best Regards,
Jason

Kirk_K · June 3, 2013, 10:51pm

On Mon, Jun 3, 2013 at 2:28 PM, Kirk K. [email protected]
wrote:

Team,

I’m working on a project that will involve processing 15,000+ complex
financial documents. They are in PDF form.

Unfortunately, the documents are not available in a non-PDF form – so I
have to electronically scan the documents and “break them down” into a
database.

Unless the actual data you’re trying to extract from the PDF is actually
inside an embedded raster image, you don’t need OCR but something that
can
parse the PDF file format.

I’m familiar enough with Rails, that I feel comfortable doing it with the
Rails framework – but I’m not sure this is a good use of Rails.

So, rails only needs to be involved insofar as you need a web
application
to wrap or expose this functionality. Otherwise rails is irrelevant to
“processing 15,000+ complex financial documents”.

Ruby and Javascript are the only programming languages I know, so I’d
either need to somehow do this as a Rails project (with Ruby and
javascript), or as a Ruby project.

If I do it as a Ruby project (not rails), can you make recommendations
about the best way to go about it?

Searching came up with the pdf-reader gem
(GitHub - yob/pdf-reader: The PDF::Reader library implements a PDF parser conforming as much as possible to the PDF specification from Adobe.)
which looks like it’d give you plenty of power to parse and extract the
data from your PDF files. Searching also came up with an old
(unmaintained?) gem called pdf-toolkit (
pdf-toolkit | RubyGems.org | your community gem host) that’s a wrapper around the pdftk
(
PDFtk - The PDF Toolkit). I’d just play
around
in an IRB session with these tools, trying to parse out the data from a
few
representative copies of the documents in question to see what works.
Then
you could do some trial passes and benchmark them, etc.

Kirk_K · June 5, 2013, 2:41am

Kirk K. [email protected] wrote:

If I do it as a Ruby project (not rails), can you make recommendations about the
best way to go about it?

I don’t know if the OCR part is a good use of Rails, etc, but a system
to manage all those documents once you’ve got them OCRed is a spiffy use
of Rails.

I used to work at a VLC (Very Large Company) in the contracts
department, helping put in a new contract management system, and one of
the things we had to do was exactly this, scanning 10’s of thousands
of legal documents, some of them exceeding 50-100 pages each.

The document management system was completely separate from this, which
was a good thing, as we weren’t tied to whatever technology for scanning
was required.

For the scanning bit, we had a handful of really heavy duty Xerox
document scanners being driven by dedicated workstations running the
full Adobe Acrobat suite. The final scanned, OCRed, documents where
stored in a shared drive alongside a file containing their metadata.

These were scooped up periodically by a script I wrote that stuffed them
into the database of the document management system.

There were some serious issues all around, but the biggest headache at
the end of the day was getting the people scanning to properly fill out
the metadata.

Hoo’mahns