Pdf Parsing Project Example

luislavena · May 9, 2011, 7:32pm

Hi,

I’m looking for an example of parsing pdf. I tried to implement this
with ruby
and docsplit gem, but it uses an external tool to extract the text, and
there are problems with number references, and you have to parse the
text file according to the regular expressions

I want to parse some papers in pdf format, to extract it’s title,
keywords, authors, authors’s mails, institutions, etc.

I’m looking for some experience ruby developer with a better way to do
this without parsing a textfile through regular expressions

Greetings

wladtepes · May 9, 2011, 8:02pm

Regular Expressions are pretty much the standard way of parsing text
files,
aren’t they? Certainly they’re what I’ve been using for years now.

What’s the problem you’re having with them?

wladtepes · May 9, 2011, 8:20pm

On Mon, May 9, 2011 at 8:01 PM, James [email protected] wrote:

Regular Expressions are pretty much the standard way of parsing text files,
aren’t they? Certainly they’re what I’ve been using for years now.

PDFs aren’t “just” text files.

A randomly-chosen excerpt from a random PDF I have lying about:

11 0 obj
<< /Title(1. The Quest for Quantum Gravity)
/Dest/section.1
/Parent 10 0 R
/Next 12 0 R

endobj

Source: http://arxiv.org/abs/1010.3420v1

I could have excerpted parts of the binary blob this PDF includes at
the start, but I rather not break anyone’s email client without
intending to.

–
Phillip G.

Though the folk I have met,
(Ah, how soon!) they forget
When I’ve moved on to some other place,
There may be one or two,
When I’ve played and passed through,
Who’ll remember my song or my face.

wladtepes · May 10, 2011, 3:58pm

Whenever I’ve done this in the past, I’ve used pdftohtml to produce an
HTML file which Nokogiri can then handle. Yes, it’s an external tool,
but it’s been reliable for me in the past.

–
Alex

wladtepes · May 10, 2011, 1:14am

I recently spotted

but haven’t had the time to play with it yet.

Regards,
Martin

2011/5/9 Felipe E. [email protected]: