-------- Original-Nachricht --------
Datum: Sun, 23 Aug 2009 22:05:23 +0900
Von: Arun K. [email protected]
An: [email protected]
Betreff: Re: Parsing pdf files
If you are not able to see a two column output in your text editor (since
its probably more than 80 characters per line) please reduce the font size
of your text editor (or use a large monitor ).
Observe how theres space between the two columned output. This was done by
copying from evince to gedit or emacs
Dear Arun,
I suppose this is due to the fact that pdftotext (but also gedit)
convert
tabulators to whitespaces.
Also, the good impression you get on gedit depends a lot on using a
mono-spaced font
Admittedly, a very quick hack
text=IO.read(“ie.txt”)
text.gsub!(/ {8,8}/,“\t”)
text.gsub!(/ {2,}/,“”)
f=File.new(“temp_out.txt”,“w”)
f.puts text
f.close
doesn’t give very nice results, so some additional fiddling is
necessary.
I once wrote some code to separate two-columned text. You can combine
the two columns with tabs.
Best regards,
Axel
def column_arrange(txt_file)
text=IO.readlines(txt_file)
reg=/ +[^ ]/
ref=[]
text.each{|line|
# there might be several longer sequences of whitespace in a line
line.scan(reg).each{|y|
ref<<line.index(y)+y.length
}
}
cut_most_columns_here=ref.sort[ref.length/2]
col1=[]
col2=[]
text.each{|line|
# there might be several longer sequences of whitespace in a line
whites=line.scan(reg)
whites_ind=line.scan(reg).collect{|y| (line.index(y)+y.length)-1}
if whites==[]
cut_here=cut_most_columns_here
elsif whites.length==1
cut_here=whites_ind[0]
elsif whites.length>1
min_dist=whites_ind.collect{|x|
(x-cut_most_columns_here).abs}.min
cut_here=whites_ind.delete_if{|x|
(x-cut_most_columns_here).abs==min_dist}[0]
end
col1<<line[0…cut_here].chomp
col2<<line[cut_here…-1].chomp
}
return col1,col2
end