PDF reader gems

Detlef_R · March 5, 2014, 6:33pm

Hi,

I want to parse a pdf document using Ruby. I found one gem -
GitHub - yob/pdf-reader: The PDF::Reader library implements a PDF parser conforming as much as possible to the PDF specification from Adobe. .

Is there any other good gems, which have strong API to parse pdf data
more easily?

Please share your opinions.

my-ruby · March 5, 2014, 6:55pm

Arup:

I use pdf-reader for several of my scripts. If you’re after just reading
the PDF, then that’s the one I’d stick with. Is there something in
particular you don’t understand?

Wayne

my-ruby · March 5, 2014, 6:59pm

Wayne B. wrote in post #1138922:

Arup:

I use pdf-reader for several of my scripts. If you’re after just reading
the PDF, then that’s the one I’d stick with. Is there something in
particular you don’t understand?

Wayne

Thanks Wayne. I have some requirement, I need to script it. I am new to
this gem. Looking to its examples. I never used it before. Today, I
found it online. I want to collect all the data from a pdf file to a CSV
file.

my-ruby · March 5, 2014, 7:26pm

I’m not sure I fully understand… You want to read all the data out of
a PDF (or just selected data?) and put all that data into a CSV file?

Here’s a sample that may be easy for you to understand. This simply goes
through a PDF and searches for a preset word or phrase and lists the
page on which it was found in the PDF.

File.open(@thepdffile, “rb”) do |io| – Open file
reader = PDF::Reader.new(io) – reader now contains full
contents of pdf
@counter=0
reader.pages.each do |page| – since a pdf is defined in pages
you have to go through each page to get the content
@counter+=1
pageText = page.text – pageText contains all the text on
a single page (only text!)

@wordlist.each do |singleword| (bunch of stuff specific to my
script). But hopefully this example helps.
singleword.strip!
if pageText.include? singleword
@indv_word << singleword
@indv_page << @counter
end
end
end
end

From: Arup R. [email protected]
To: [email protected]
Sent: Wednesday, March 5, 2014 11:59 AM
Subject: Re: PDF reader gems

Wayne B. wrote in post #1138922:

Arup:

I use pdf-reader for several of my scripts. If you’re after just reading
the PDF, then that’s the one I’d stick with. Is there something in
particular you don’t understand?

Wayne

Thanks Wayne. I have some requirement, I need to script it. I am new to
this gem. Looking to its examples. I never used it before. Today, I
found it online. I want to collect all the data from it to a CSV file.

my-ruby · March 5, 2014, 7:45pm

Wayne B. wrote in post #1138927:

I’m not sure I fully understand… You want to read all the data out of
a PDF (or just selected data?) and put all that data into a CSV file?

Here’s a sample that may be easy for you to understand. This simply goes
through a PDF and searches for a preset word or phrase and lists the
page on which it was found in the PDF.

It would really be helpfull. I would start my script tonight. If I have
any issue to understand it, I will ask you here in this list.

I found from the doc below 3 classes to use :

PDF::Reader,
PDF::Reader::Page
PDF::Reader::ObjectHash classes.

Hope you would help me

Thank you Wayne B…

my-ruby · March 5, 2014, 9:01pm

Wayne B. wrote in post #1138927:

I’m not sure I fully understand… You want to read all the data out of
a PDF (or just selected data?) and put all that data into a CSV file?

Here’s a sample that may be easy for you to understand. This simply goes
through a PDF and searches for a preset word or phrase and lists the
page on which it was found in the PDF.

I wrote the code below :

require ‘pdf/reader’

File.open("#{dir}/a.pdf",‘rb’) do |io|
reader = PDF::Reader.new(io)
reader.pages.each do |page|
puts page.text
end
end

It is working. But text gives whole page content at a time. Can I read
the page line by line ?

my-ruby · March 5, 2014, 10:19pm

I don’t see anything that sticks out. You might want to post on the PDF
reader group and see what the folks there think.

https://groups.google.com/forum/#!forum/pdf-reader

Wayne

From: Arup R. [email protected]
To: [email protected]
Sent: Wednesday, March 5, 2014 2:01 PM
Subject: Re: PDF reader gems

Wayne B. wrote in post #1138927:

I’m not sure I fully understand… You want to read all the data out of
a PDF (or just selected data?) and put all that data into a CSV file?

Here’s a sample that may be easy for you to understand. This simply goes
through a PDF and searches for a preset word or phrase and lists the
page on which it was found in the PDF.

I wrote the code below :

require ‘pdf/reader’

File.open(“#{dir}/a.pdf”,‘rb’) do |io|
reader = PDF::Reader.new(io)
reader.pages.each do |page|
puts page.text
end
end

It is working. But text gives whole page content at a time. Can I read
the page line by line ?

my-ruby · March 6, 2014, 4:37pm

On Wed, Mar 5, 2014 at 12:01 PM, Arup R. [email protected]
wrote:

File.open(“#{dir}/a.pdf”,‘rb’) do |io|
reader = PDF::Reader.new(io)
reader.pages.each do |page|
puts page.text
end
end

It is working. But text gives whole page content at a time. Can I read
the page line by line ?

‘page.text’ is just a string; you can manipulate it any way you want.
If you want “lines”, split the text on line-ending characters.

Good luck,

my-ruby · March 6, 2014, 12:32pm

Wayne B. wrote in post #1138922:

Arup:

I use pdf-reader for several of my scripts. If you’re after just reading
the PDF, then that’s the one I’d stick with. Is there something in
particular you don’t understand?

@Wayne - Is pdf file holds any xml objects internally of the data it is
displaying ? If so, then I can use Nokogiri to parse this.