Re: Parsing pdf files

Dear Arun,

there is a command-line tool pdftotext, which you can use with encoding
specifications and also with a “-layout” option, which will preserve
line breaks.
The list of possible encodings

pdftotext -listenc

does not include iscii-1988, so probably, you’ll be out of luck
if the original document is not in Unicode (maybe you can use iconv
on the result of pdftotext).

I found a utf-8 encoded web page in Hindi, printed it to a pdf file,
used
pdftotext on it, and opened it in the SciTE editor, specifying the
encoding as UTF-8. Most of the symbols are recognized correctly,
but some are not …(vowels? combinations of letters?)

I’m sending the screenshot as an attachment to your email address.

Best regards,

Axel

hello Alex,
Thank you. But I would like to point out that its not very accurate in
maintaining the layout. I already tried it out. you can copy a pdf file
from
evince to gedit, you will get a better accuracy of layout. What escapes
me
is how to do it programatically :slight_smile:

cheers & regards,
Arun

-------- Original-Nachricht --------

Datum: Sun, 23 Aug 2009 19:46:23 +0900
Von: Arun K. [email protected]
An: [email protected]
Betreff: Re: Parsing pdf files

hello Alex,
Thank you. But I would like to point out that its not very accurate in
maintaining the layout. I already tried it out. you can copy a pdf file
from
evince to gedit, you will get a better accuracy of layout. What escapes me
is how to do it programatically :slight_smile:

cheers & regards,
Arun

Dear Arun,

could you say something more about what layout features you need ?

Best regards,

Axel

Helo Alex,

Suppose the data in the pdf is two-columned ( as is the case in research
papers) or has some tables . The copied version should have the same
amount
of spaces between words and columns. I’ll attach an example two columned
text in here for your reference. For the program I’m writing, the layout
is
most essential.

If you are not able to see a two column output in your text editor
(since
its probably more than 80 characters per line) please reduce the font
size
of your text editor (or use a large monitor :wink: ).

Observe how theres space between the two columned output. This was done
by
copying from evince to gedit or emacs

On Sun, Aug 23, 2009 at 4:50 PM, Axel E. [email protected] wrote:

from
evince to gedit, you will get a better accuracy of layout. What escapes
me

-------- Original-Nachricht --------

Datum: Sun, 23 Aug 2009 22:05:23 +0900
Von: Arun K. [email protected]
An: [email protected]
Betreff: Re: Parsing pdf files

If you are not able to see a two column output in your text editor (since
its probably more than 80 characters per line) please reduce the font size
of your text editor (or use a large monitor :wink: ).

Observe how theres space between the two columned output. This was done by
copying from evince to gedit or emacs

Dear Arun,

I suppose this is due to the fact that pdftotext (but also gedit)
convert
tabulators to whitespaces.
Also, the good impression you get on gedit depends a lot on using a
mono-spaced font :slight_smile:
Admittedly, a very quick hack

text=IO.read(“ie.txt”)
text.gsub!(/ {8,8}/,“\t”)
text.gsub!(/ {2,}/,“”)
f=File.new(“temp_out.txt”,“w”)
f.puts text
f.close

doesn’t give very nice results, so some additional fiddling is
necessary.

I once wrote some code to separate two-columned text. You can combine
the two columns with tabs.

Best regards,

Axel


def column_arrange(txt_file)
text=IO.readlines(txt_file)
reg=/ +[^ ]/
ref=[]
text.each{|line|
# there might be several longer sequences of whitespace in a line
line.scan(reg).each{|y|
ref<<line.index(y)+y.length
}
}
cut_most_columns_here=ref.sort[ref.length/2]

col1=[]
col2=[]
text.each{|line|
# there might be several longer sequences of whitespace in a line
whites=line.scan(reg)
whites_ind=line.scan(reg).collect{|y| (line.index(y)+y.length)-1}

  if  whites==[]
    cut_here=cut_most_columns_here
  elsif whites.length==1
    cut_here=whites_ind[0]
  elsif whites.length>1
    min_dist=whites_ind.collect{|x| 

(x-cut_most_columns_here).abs}.min
cut_here=whites_ind.delete_if{|x|
(x-cut_most_columns_here).abs==min_dist}[0]
end
col1<<line[0…cut_here].chomp
col2<<line[cut_here…-1].chomp
}
return col1,col2
end