Hi,
I want to parse a pdf document using Ruby. I found one gem -
https://github.com/yob/pdf-reader .
Is there any other good gems, which have strong API to parse pdf data
more easily?
Please share your opinions.
Hi,
I want to parse a pdf document using Ruby. I found one gem -
https://github.com/yob/pdf-reader .
Is there any other good gems, which have strong API to parse pdf data
more easily?
Please share your opinions.
Arup:
I use pdf-reader for several of my scripts. If you’re after just reading
the PDF, then that’s the one I’d stick with. Is there something in
particular you don’t understand?
Wayne
Wayne B. wrote in post #1138922:
Arup:
I use pdf-reader for several of my scripts. If you’re after just reading
the PDF, then that’s the one I’d stick with. Is there something in
particular you don’t understand?Wayne
Thanks Wayne. I have some requirement, I need to script it. I am new to
this gem. Looking to its examples. I never used it before. Today, I
found it online. I want to collect all the data from a pdf file to a CSV
file.
I’m not sure I fully understand… You want to read all the data out of
a PDF (or just selected data?) and put all that data into a CSV file?
Here’s a sample that may be easy for you to understand. This simply goes
through a PDF and searches for a preset word or phrase and lists the
page on which it was found in the PDF.
File.open(@thepdffile, “rb”) do |io| – Open file
reader = PDF::Reader.new(io) – reader now contains full
contents of pdf
@counter=0
reader.pages.each do |page| – since a pdf is defined in pages
you have to go through each page to get the content
@counter+=1
pageText = page.text – pageText contains all the text on
a single page (only text!)
@wordlist.each do |singleword| (bunch of stuff specific to my
script). But hopefully this example helps.
singleword.strip!
if pageText.include? singleword
@indv_word << singleword
@indv_page << @counter
end
end
end
end
From: Arup R. [email protected]
To: [email protected]
Sent: Wednesday, March 5, 2014 11:59 AM
Subject: Re: PDF reader gems
Wayne B. wrote in post #1138922:
Arup:
I use pdf-reader for several of my scripts. If you’re after just reading
the PDF, then that’s the one I’d stick with. Is there something in
particular you don’t understand?Wayne
Thanks Wayne. I have some requirement, I need to script it. I am new to
this gem. Looking to its examples. I never used it before. Today, I
found it online. I want to collect all the data from it to a CSV file.
Wayne B. wrote in post #1138927:
I’m not sure I fully understand… You want to read all the data out of
a PDF (or just selected data?) and put all that data into a CSV file?Here’s a sample that may be easy for you to understand. This simply goes
through a PDF and searches for a preset word or phrase and lists the
page on which it was found in the PDF.
It would really be helpfull. I would start my script tonight. If I have
any issue to understand it, I will ask you here in this list.
I found from the doc below 3 classes to use :
PDF::Reader,
PDF::Reader::Page
PDF::Reader::ObjectHash classes.
Hope you would help me
Thank you Wayne B…
Wayne B. wrote in post #1138927:
I’m not sure I fully understand… You want to read all the data out of
a PDF (or just selected data?) and put all that data into a CSV file?Here’s a sample that may be easy for you to understand. This simply goes
through a PDF and searches for a preset word or phrase and lists the
page on which it was found in the PDF.
I wrote the code below :
require ‘pdf/reader’
File.open("#{dir}/a.pdf",‘rb’) do |io|
reader = PDF::Reader.new(io)
reader.pages.each do |page|
puts page.text
end
end
It is working. But text
gives whole page content at a time. Can I read
the page line by line ?
I don’t see anything that sticks out. You might want to post on the PDF
reader group and see what the folks there think.
https://groups.google.com/forum/#!forum/pdf-reader
Wayne
From: Arup R. [email protected]
To: [email protected]
Sent: Wednesday, March 5, 2014 2:01 PM
Subject: Re: PDF reader gems
Wayne B. wrote in post #1138927:
I’m not sure I fully understand… You want to read all the data out of
a PDF (or just selected data?) and put all that data into a CSV file?Here’s a sample that may be easy for you to understand. This simply goes
through a PDF and searches for a preset word or phrase and lists the
page on which it was found in the PDF.
I wrote the code below :
require ‘pdf/reader’
File.open("#{dir}/a.pdf",‘rb’) do |io|
reader = PDF::Reader.new(io)
reader.pages.each do |page|
puts page.text
end
end
It is working. But text
gives whole page content at a time. Can I read
the page line by line ?
On Wed, Mar 5, 2014 at 12:01 PM, Arup R. [email protected]
wrote:
File.open("#{dir}/a.pdf",‘rb’) do |io|
reader = PDF::Reader.new(io)
reader.pages.each do |page|
puts page.text
end
endIt is working. But
text
gives whole page content at a time. Can I read
the page line by line ?
‘page.text’ is just a string; you can manipulate it any way you want.
If you want “lines”, split the text on line-ending characters.
Good luck,
Wayne B. wrote in post #1138922:
Arup:
I use pdf-reader for several of my scripts. If you’re after just reading
the PDF, then that’s the one I’d stick with. Is there something in
particular you don’t understand?
@Wayne - Is pdf file holds any xml objects internally of the data it is
displaying ? If so, then I can use Nokogiri to parse this.
This forum is not affiliated to the Ruby language, Ruby on Rails framework, nor any Ruby applications discussed here.
Sponsor our Newsletter | Privacy Policy | Terms of Service | Remote Ruby Jobs