Pdf Parsing Project Example

Hi,

I’m looking for an example of parsing pdf. I tried to implement this
with ruby
and docsplit gem, but it uses an external tool to extract the text, and
there are problems with number references, and you have to parse the
text file according to the regular expressions

I want to parse some papers in pdf format, to extract it’s title,
keywords, authors, authors’s mails, institutions, etc.

I’m looking for some experience ruby developer with a better way to do
this without parsing a textfile through regular expressions

Greetings

Regular Expressions are pretty much the standard way of parsing text
files,
aren’t they? Certainly they’re what I’ve been using for years now.

What’s the problem you’re having with them?

On Mon, May 9, 2011 at 8:01 PM, James [email protected] wrote:

Regular Expressions are pretty much the standard way of parsing text files,
aren’t they? Certainly they’re what I’ve been using for years now.

PDFs aren’t “just” text files.

A randomly-chosen excerpt from a random PDF I have lying about:

11 0 obj
<< /Title(1. The Quest for Quantum Gravity)
/Dest/section.1
/Parent 10 0 R
/Next 12 0 R

endobj

Source: http://arxiv.org/abs/1010.3420v1

I could have excerpted parts of the binary blob this PDF includes at
the start, but I rather not break anyone’s email client without
intending to. :wink:


Phillip G.

Though the folk I have met,
(Ah, how soon!) they forget
When I’ve moved on to some other place,
There may be one or two,
When I’ve played and passed through,
Who’ll remember my song or my face.

Whenever I’ve done this in the past, I’ve used pdftohtml to produce an
HTML file which Nokogiri can then handle. Yes, it’s an external tool,
but it’s been reliable for me in the past.


Alex

I recently spotted

but haven’t had the time to play with it yet.

Regards,
Martin

2011/5/9 Felipe E. [email protected]: