Extract Text from PDF

Hi,

Does anyone know a way to extract plain text from a PDF using Ruby?

Many Thanks,

~ Mark

On 13.04.2007 14:06, Mark D. wrote:

Does anyone know a way to extract plain text from a PDF using Ruby?

IIRC there is a project under way to extend PDFWriter with reading
capabilities. I don’t know the current status of that. HTH

robert

Robert K. wrote:

On 13.04.2007 14:06, Mark D. wrote:

Does anyone know a way to extract plain text from a PDF using Ruby?

IIRC there is a project under way to extend PDFWriter with reading
capabilities. I don’t know the current status of that. HTH

In the meantime, you could use the commandline tools pdf2ps and ps2ascii
(I think they use ghostscript as a backend), and read the resulting
ascii file with ruby in the usual way.

Regards,

Chris

Hi,

2007/4/13, Mark D. [email protected]:

Does anyone know a way to extract plain text from a PDF using Ruby?

You can use Ruby/Poppler:
http://ruby-gnome2.sourceforge.jp/hiki.cgi?Ruby%2FPoppler

Here is an example to do that:
CVS Info for project ruby-gnome2

Thanks,

Robert K. wrote:

On 13.04.2007 14:06, Mark D. wrote:

Does anyone know a way to extract plain text from a PDF using Ruby?

IIRC there is a project under way to extend PDFWriter with reading
capabilities. I don’t know the current status of that. HTH

robert

At least on Linux, there is “pdftotext”, which is part of the “poppler”
package. So you can simply shell out to it if it’s installed. If you’re
more ambitious, you could write an extension to use the underlying
libraries in poppler.


M. Edward (Ed) Borasky, FBG, AB, PTA, PGS, MS, MNLP, NST, ACMC(P)
http://borasky-research.net/

If God had meant for carrots to be eaten cooked, He would have given
rabbits fire.

The trouble is, pdf is not always the same thing. Sometimes, there is
no text at all in a pdf. It can be all vector art outlines or even
all raster image graphics. There is never a guarantee that you will
get any or all text that may otherwise be human readable in a pdf.
Pdf has really become a kitchen sink format, so it is good to
anticipate trouble parsing pdf files.