Forum: Ruby Question About OCR in Ruby vs. Rails

C65c9097d00738c9a0e26bf84c746c96?d=identicon&s=25 Kirk Keeter (Guest)
on 2013-06-03 22:31
(Received via mailing list)
Team,

I'm working on a project that will involve processing 15,000+ complex
financial documents.  They are in PDF form.

Unfortunately, the documents are not available in a non-PDF form -- so I
have to electronically scan the documents and "break them down" into a
database.

I'm familiar enough with Rails, that I feel comfortable doing it with
the
Rails framework -- but I'm not sure this is a good use of Rails.

Ruby and Javascript are the only programming languages I know, so I'd
either need to somehow do this as a Rails project (with Ruby and
javascript), or as a Ruby project.

If I do it as a Ruby project (not rails), can you make recommendations
about the best way to go about it?

Kirk Keeter
726b0d87108fec0ad1c27fd835241bd3?d=identicon&s=25 Jason Stewart (Guest)
on 2013-06-03 22:50
(Received via mailing list)
Hi Kirk,

I've had pretty good luck shelling out to pdftotext to get the text from
PDFs into a searchable format. Is this what you mean by "break them down
into a database"?

Best Regards,
Jason
94cee02d73877ef5e6dfb04afb1fe324?d=identicon&s=25 Kendall Gifford (zettabyte)
on 2013-06-03 22:51
(Received via mailing list)
On Mon, Jun 3, 2013 at 2:28 PM, Kirk Keeter <kirkkeeter@gmail.com>
wrote:

> Team,
>
> I'm working on a project that will involve processing 15,000+ complex
> financial documents.  They are in PDF form.
>
> Unfortunately, the documents are not available in a non-PDF form -- so I
> have to electronically scan the documents and "break them down" into a
> database.
>
Unless the actual data you're trying to extract from the PDF is actually
inside an embedded raster image, you don't need OCR but something that
can
parse the PDF file format.


> I'm familiar enough with Rails, that I feel comfortable doing it with the
> Rails framework -- but I'm not sure this is a good use of Rails.
>

So, rails only needs to be involved insofar as you need a web
application
to wrap or expose this functionality. Otherwise rails is irrelevant to
"processing 15,000+ complex financial documents".


> Ruby and Javascript are the only programming languages I know, so I'd
> either need to somehow do this as a Rails project (with Ruby and
> javascript), or as a Ruby project.
>
> If I do it as a Ruby project (not rails), can you make recommendations
> about the best way to go about it?
>

Searching came up with the pdf-reader gem
(https://github.com/yob/pdf-reader)
which looks like it'd give you plenty of power to parse and extract the
data from your PDF files. Searching also came up with an old
(unmaintained?) gem called pdf-toolkit (
https://rubygems.org/gems/pdf-toolkit) that's a wrapper around the pdftk
(
http://www.pdflabs.com/tools/pdftk-the-pdf-toolkit/). I'd just play
around
in an IRB session with these tools, trying to parse out the data from a
few
representative copies of the documents in question to see what works.
Then
you could do some trial passes and benchmark them, etc.
Aa082c8b00a50928e5860dcd70bf2368?d=identicon&s=25 Tamara Temple (Guest)
on 2013-06-05 02:41
(Received via mailing list)
Kirk Keeter <kirkkeeter@gmail.com> wrote:
>
> If I do it as a Ruby project (not rails), can you make recommendations about the
best way to go about it?

I don't know if the OCR part is a good use of Rails, etc, but a system
to manage all those documents once you've got them OCRed is a spiffy use
of Rails.

I used to work at a VLC (Very Large Company) in the contracts
department, helping put in a new contract management system, and one of
the things we had to do was *exactly* this, scanning 10's of thousands
of legal documents, some of them exceeding 50-100 pages each.

The document management system was completely separate from this, which
was a good thing, as we weren't tied to whatever technology for scanning
was required.

For the scanning bit, we had a handful of really heavy duty Xerox
document scanners being driven by dedicated workstations running the
full Adobe Acrobat suite. The final scanned, OCRed, documents where
stored in a shared drive alongside a file containing their metadata.

These were scooped up periodically by a script I wrote that stuffed them
into the database of the document management system.

There were some serious issues all around, but the biggest headache at
the end of the day was getting the people scanning to properly fill out
the metadata.

Hoo'mahns
1c083e99ebced0116e07e96fa8ae2ff7?d=identicon&s=25 Jony Green (jonygreen)
on 2015-09-08 04:58
I'm not a developer, I always use this free online ocr
servie(http://www.online-code.net/ocr.html).
Please log in before posting. Registration is free and takes only a minute.
Existing account

NEW: Do you have a Google/GoogleMail, Yahoo or Facebook account? No registration required!
Log in with Google account | Log in with Yahoo account | Log in with Facebook account
No account? Register here.