is there a way of extracting text from a PDF, if the latter
is in some non-European language, such as Arabic or
Chinese?
Under Linux, I have been able to use Ruby in conjunction
with pdftotext for English and other Latin1 encoded texts -
with some problems sometimes for special characters,
but it doesn’t seem to work for Unicode …
is there a way of extracting text from a PDF, if the latter
is in some non-European language, such as Arabic or
Chinese?
Under Linux, I have been able to use Ruby in conjunction
with pdftotext for English and other Latin1 encoded texts -
with some problems sometimes for special characters,
but it doesn’t seem to work for Unicode …
Which version of pdftotext did you use? Xpdf or poppler?
You need to install character map files for other Latin1 encoded
texts.