Text extraction from PDF files (non-European languages)...?

unknown · November 21, 2006, 6:06pm

Dear all,

is there a way of extracting text from a PDF, if the latter
is in some non-European language, such as Arabic or
Chinese?
Under Linux, I have been able to use Ruby in conjunction
with pdftotext for English and other Latin1 encoded texts -
with some problems sometimes for special characters,
but it doesn’t seem to work for Unicode …

Is there a Ruby way to do this ?

Thank you!

Best regards,

Axel

unknown · November 22, 2006, 1:53am

Hi,

2006/11/22, [email protected] [email protected]:

is there a way of extracting text from a PDF, if the latter
is in some non-European language, such as Arabic or
Chinese?
Under Linux, I have been able to use Ruby in conjunction
with pdftotext for English and other Latin1 encoded texts -
with some problems sometimes for special characters,
but it doesn’t seem to work for Unicode …

Which version of pdftotext did you use? Xpdf or poppler?
You need to install character map files for other Latin1 encoded
texts.

Is there a Ruby way to do this ?

You can use Ruby/Poppler if poppler doesn’t have any problem:
CVS Info for project ruby-gnome2

Thanks,

unknown · November 21, 2006, 6:16pm

Axel

On 11/21/06, [email protected] [email protected] wrote:

is there a way of extracting text from a PDF, if the latter
is in some non-European language, such as Arabic or
Chinese?

rpdf2txt (1) should work with Unicode PDF-Documents. If you run into
any problems let me know, I’m happy to tinker with the beast.

http://download.ywesee.com/rpdf2txt/rpdf2txt-1.0.6.tar.bz2
http://raa.ruby-lang.org/project/rpdf2txt/

hth

Hannes