hello all,
Does anyone know a good pdf parser that retains
formatting
after its extracted text? I used PDF::Reader, but when you extract text
you
just get a stream of characters that are not at all intelligible. When I
copy a pdf contents from a pdf reader to Gedit text editor in linux it
retains its format. I’m looking for something like that.
Thanks for any help.
regards,
Arun K. M S
On Sat, Aug 22, 2009 at 12:33 PM, Arun K.[email protected]
wrote:
hello all,
Does anyone know a good pdf parser that retains formatting
after its extracted text? I used PDF::Reader, but when you extract text you
just get a stream of characters that are not at all intelligible. When I
copy a pdf contents from a pdf reader to Gedit text editor in linux it
retains its format. I’m looking for something like that.
This doesn’t exist in Ruby, unfortunately.
-greg
That’s really very sad 
On Sat, Aug 22, 2009 at 10:33 PM, Gregory B.
On Sat, Aug 22, 2009 at 4:10 PM, Arun K.[email protected]
wrote:
That’s really very sad 
Looks like you better roll up your sleeves 
Yeah seeing what can be done 
Arun,
there is another ruby pdf-extractor:
http://scm.ywesee.com/?p=rpdf2txt;a=summary
However, it’s largely undocumented, slow, fragile, and its column
detection algorithm is basic at best. If that does not faze you, give
it a try and contact me if you have questions. Look at the included
commandline-tool in ./bin/rpdf2txt for an example.
cheers,
Hannes
You can use http://pdftohtml.sourceforge.net or use my Ruby wrapper for
this tool:
Arun K. wrote:
hello all,
Does anyone know a good pdf parser that retains
formatting
after its extracted text? I used PDF::Reader, but when you extract text
you
just get a stream of characters that are not at all intelligible. When I
copy a pdf contents from a pdf reader to Gedit text editor in linux it
retains its format. I’m looking for something like that.
Thanks for any help.
regards,
Arun K. M S