Parsing pdf files

hello all,
Does anyone know a good pdf parser that retains
formatting
after its extracted text? I used PDF::Reader, but when you extract text
you
just get a stream of characters that are not at all intelligible. When I
copy a pdf contents from a pdf reader to Gedit text editor in linux it
retains its format. I’m looking for something like that.

Thanks for any help.

regards,
Arun K. M S

On Sat, Aug 22, 2009 at 12:33 PM, Arun K.[email protected]
wrote:

hello all,
Does anyone know a good pdf parser that retains formatting
after its extracted text? I used PDF::Reader, but when you extract text you
just get a stream of characters that are not at all intelligible. When I
copy a pdf contents from a pdf reader to Gedit text editor in linux it
retains its format. I’m looking for something like that.

This doesn’t exist in Ruby, unfortunately.

-greg

That’s really very sad :frowning:

On Sat, Aug 22, 2009 at 10:33 PM, Gregory B.

On Sat, Aug 22, 2009 at 4:10 PM, Arun K.[email protected]
wrote:

That’s really very sad :frowning:

Looks like you better roll up your sleeves :slight_smile:

Yeah seeing what can be done :slight_smile:

Arun,

there is another ruby pdf-extractor:
http://scm.ywesee.com/?p=rpdf2txt;a=summary

However, it’s largely undocumented, slow, fragile, and its column
detection algorithm is basic at best. If that does not faze you, give
it a try and contact me if you have questions. Look at the included
commandline-tool in ./bin/rpdf2txt for an example.

cheers,
Hannes

You can use http://pdftohtml.sourceforge.net or use my Ruby wrapper for
this tool:

Arun K. wrote:

hello all,
Does anyone know a good pdf parser that retains
formatting
after its extracted text? I used PDF::Reader, but when you extract text
you
just get a stream of characters that are not at all intelligible. When I
copy a pdf contents from a pdf reader to Gedit text editor in linux it
retains its format. I’m looking for something like that.

Thanks for any help.

regards,
Arun K. M S

On Mon, Aug 24, 2009 at 6:25 AM, Erik T.[email protected] wrote:

You can use http://pdftohtml.sourceforge.net or use my Ruby wrapper for
this tool:

GitHub - eterps/pdf-struct: PDF::Extractor is a library that provides high level access to the text objects of a PDF document

Very interesting, thanks for posting this.

-greg