Hi,
Does anyone know a way to extract plain text from a PDF using Ruby?
Many Thanks,
~ Mark
Hi,
Does anyone know a way to extract plain text from a PDF using Ruby?
Many Thanks,
~ Mark
On 13.04.2007 14:06, Mark D. wrote:
Does anyone know a way to extract plain text from a PDF using Ruby?
IIRC there is a project under way to extend PDFWriter with reading
capabilities. I don’t know the current status of that. HTH
robert
Robert K. wrote:
On 13.04.2007 14:06, Mark D. wrote:
Does anyone know a way to extract plain text from a PDF using Ruby?
IIRC there is a project under way to extend PDFWriter with reading
capabilities. I don’t know the current status of that. HTH
In the meantime, you could use the commandline tools pdf2ps and ps2ascii
(I think they use ghostscript as a backend), and read the resulting
ascii file with ruby in the usual way.
Regards,
Chris
Hi,
2007/4/13, Mark D. [email protected]:
Does anyone know a way to extract plain text from a PDF using Ruby?
You can use Ruby/Poppler:
http://ruby-gnome2.sourceforge.jp/hiki.cgi?Ruby%2FPoppler
Here is an example to do that:
CVS Info for project ruby-gnome2
Thanks,
Robert K. wrote:
On 13.04.2007 14:06, Mark D. wrote:
Does anyone know a way to extract plain text from a PDF using Ruby?
IIRC there is a project under way to extend PDFWriter with reading
capabilities. I don’t know the current status of that. HTHrobert
At least on Linux, there is “pdftotext”, which is part of the “poppler”
package. So you can simply shell out to it if it’s installed. If you’re
more ambitious, you could write an extension to use the underlying
libraries in poppler.
–
M. Edward (Ed) Borasky, FBG, AB, PTA, PGS, MS, MNLP, NST, ACMC(P)
http://borasky-research.net/
If God had meant for carrots to be eaten cooked, He would have given
rabbits fire.
The trouble is, pdf is not always the same thing. Sometimes, there is
no text at all in a pdf. It can be all vector art outlines or even
all raster image graphics. There is never a guarantee that you will
get any or all text that may otherwise be human readable in a pdf.
Pdf has really become a kitchen sink format, so it is good to
anticipate trouble parsing pdf files.
This forum is not affiliated to the Ruby language, Ruby on Rails framework, nor any Ruby applications discussed here.
Sponsor our Newsletter | Privacy Policy | Terms of Service | Remote Ruby Jobs