Is there any rubygem available for converting the pdf files to xml files?

Hi,

Is there any rubygem available for converting the pdf files to xml
files?

Look at rubygems.org theres at least one that is PDF to HTML, but Ive
not used it.

Wayne

Arup:

I did install the PDF to HTML gem and have to say its pretty impressive!
Its all based on the pdf2htmlEX project:

https://github.com/coolwanglu/pdf2htmlEX/tree/master/src

(its basically just a nice ruby wrapper, so you have to have pdf2htmlEX
installed). But this gem actually opens up a whole new world of
possibilities.

In combination with something like nokogiri, you should be able to parse
almost all the data you want. However, this means youll need to brush up
on your css and/or xpath to parse again with nokogiri.

On Mac OS X, it was pretty easy to install the pdf2htmEX toolset. For
Windows, somebody has already done the compiling for you here:
http://soft.rubypdf.com/software/pdf2htmlex-windows-version

Good luck!

FYI, there is a googlegroup for the pdf2htmlEX toolset and youre going
to be better off asking questions there rather than this list for any
additional help with those toolsets if you choose to use them since this
list is strictly for ruby related things.

Wayne

Wayne B. wrote in post #1139024:

Arup:

I did install the PDF to HTML gem and have to say its pretty impressive!
Its all based on the pdf2htmlEX project:

https://github.com/coolwanglu/pdf2htmlEX/tree/master/src

Wayne

Thanks for your reply. I was also looking for
https://github.com/kitplummer/pdftohtmlr/blob/master/lib/pdftohtmlr.rb

But the issue is, if PDF have any blank column values, it is not
generating any corresponding tag for those entries. Thus couldn’t track
which data is actually under which column.

I am surely give the gem a try, you linked above.

This forum is not affiliated to the Ruby language, Ruby on Rails framework, nor any Ruby applications discussed here.

| Privacy Policy | Terms of Service | Remote Ruby Jobs