Parsing pdf files

arunvoip · August 22, 2009, 6:36pm

hello all,
Does anyone know a good pdf parser that retains
formatting
after its extracted text? I used PDF::Reader, but when you extract text
you
just get a stream of characters that are not at all intelligible. When I
copy a pdf contents from a pdf reader to Gedit text editor in linux it
retains its format. I’m looking for something like that.

Thanks for any help.

regards,
Arun K. M S

arunvoip · August 22, 2009, 7:09pm

On Sat, Aug 22, 2009 at 12:33 PM, Arun K.[email protected]
wrote:

hello all,
Does anyone know a good pdf parser that retains formatting
after its extracted text? I used PDF::Reader, but when you extract text you
just get a stream of characters that are not at all intelligible. When I
copy a pdf contents from a pdf reader to Gedit text editor in linux it
retains its format. I’m looking for something like that.

This doesn’t exist in Ruby, unfortunately.

-greg

arunvoip · August 22, 2009, 10:11pm

That’s really very sad

On Sat, Aug 22, 2009 at 10:33 PM, Gregory B.

arunvoip · August 22, 2009, 11:12pm

On Sat, Aug 22, 2009 at 4:10 PM, Arun K.[email protected]
wrote:

That’s really very sad

Looks like you better roll up your sleeves

arunvoip · August 23, 2009, 10:10am

Yeah seeing what can be done

arunvoip · August 24, 2009, 11:52am

Arun,

there is another ruby pdf-extractor:
http://scm.ywesee.com/?p=rpdf2txt;a=summary

However, it’s largely undocumented, slow, fragile, and its column
detection algorithm is basic at best. If that does not faze you, give
it a try and contact me if you have questions. Look at the included
commandline-tool in ./bin/rpdf2txt for an example.

cheers,
Hannes

arunvoip · August 24, 2009, 12:25pm

You can use http://pdftohtml.sourceforge.net or use my Ruby wrapper for
this tool:

Arun K. wrote:

hello all,
Does anyone know a good pdf parser that retains
formatting
after its extracted text? I used PDF::Reader, but when you extract text
you
just get a stream of characters that are not at all intelligible. When I
copy a pdf contents from a pdf reader to Gedit text editor in linux it
retains its format. I’m looking for something like that.

Thanks for any help.

regards,
Arun K. M S

arunvoip · August 24, 2009, 2:23pm

On Mon, Aug 24, 2009 at 6:25 AM, Erik T.[email protected] wrote:

You can use http://pdftohtml.sourceforge.net or use my Ruby wrapper for
this tool:

GitHub - eterps/pdf-struct: PDF::Extractor is a library that provides high level access to the text objects of a PDF document

Very interesting, thanks for posting this.

-greg