Cut pages for OCR with RMagick?

Dear all,

I have many scanned pages which I’d like to cut to prepare them
for OCR.
There are two things I’d like to do:

1.) Cut off a header of each page containing the page number,

2.) Find the largest horizontal blanks in a page (which are supposed
to separate chapters) like this:

Chapter1’s text Chapter1’s text
Chapter1’s text Chapter1’s text
Chapter1’s text Chapter1’s text
Chapter1’s text Chapter1’s text
<---- cut here, at this blank
Chapter2’s text Chapter2’s text
Chapter2’s text Chapter2’s text
Chapter2’s text Chapter2’s text
Chapter2’s text Chapter2’s text
^
|
— (Then cut vertically)

I have tried to convert my pages, which are A4 and 600 dpi, to pixel
arrays,
but this is quite slow. Is there a better method, ie. using to_blob ?

Thank you very much,

Axel

On 9/29/07, Axel E. [email protected] wrote:

Dear all,

I have many scanned pages which I’d like to cut to prepare them
for OCR.
There are two things I’d like to do:

1.) Cut off a header of each page containing the page number,

2.) Find the largest horizontal blanks in a page (which are supposed
to separate chapters) like this:

If you have a binary string of the pixel data in the image
(I guess to_blob gives that), you can do something like this
for cutting the vertical spans of non-white pixels:

scanline_bytes = image_width * bytes_per_pixel
scanlines = pixels.scan(/.{#{scanline_bytes}}/)
chapters = [[]]
scanlines.each{|sl|
if is_white(sl)
chapters << [] unless paragraphs.last.empty?
else
chapters.last << sl
end
}

For finding larger spans of white, keep track of white
scanlines seen previously. #is_white can well be something
that returns true if less than 50 pixels on a scanline are white
or somesuch.

To crop the margins off the chapter scanlines:

left_border = chapter_scanlines.min{|sl| sl =~ /#{non_white_pixel}/ }
left_border -= left_border % bytes_per_pixel
right_border = chapter_scanlines.min{|sl| sl.reverse =~
/#{non_white_pixel}/ }
right_border -= right_border % bytes_per_pixel

chapter_scanlines.map!{|sl| sl[left_border…right_border] }

The middle whitespace can be had by (tune the magic number to signify
enough pixels to not be a character space):

left_border = chapter_scanlines.max{|sl| sl =~ /#{non_white_pixel}{20}/
}

and with reversed scanline for right border.

</imaging regexps for fun and profit>

HTH,

Tim H. [email protected] writes:

accomplish your goal. Since the ImageMagick/GraphicsMagick routines

RMagick OS X Installer [http://rubyforge.org/projects/rmagick/]
RMagick Hints & Tips [http://rubyforge.org/forum/forum.php?forum_id=1618]
RMagick Installation FAQ [http://rmagick.rubyforge.org/install-faq.html]

When it comes to document imaging, the problem with most image
processing kits, including/especially ImageMagick, is that they insist
on using a 32-bit per pixel memory representation for all images.
Fine for your 200x200 web GIF, but No Fun when your ~4MB B&W scanned
page suddenly expands to 140MB in memory.

You can try NArray[1] (8x expansion is better the 32x), or use a kit
that can handle 1-bit images in memory, e.g. Leptonica[2].

Steve

[1] http://narrary.rubyforge.org/
[2] http://www.leptonica.com/

Axel E. wrote:

I have tried to convert my pages, which are A4 and 600 dpi, to pixel arrays,
but this is quite slow. Is there a better method, ie. using to_blob ?

to_blob just gives you an in-memory copy of the image file. If the image
is in JPG format, for example, then the blob is an in-memory JPG file.
So, there’s no help there.

Ideally you could use some RMagick method or combination of methods to
accomplish your goal. Since the ImageMagick/GraphicsMagick routines are
written in C they’d be much faster. Offhand I can’t think of any such
methods, but then I’m not very clever at that sort of thing.

You might try asking the ImageMagick gurus
(Legacy ImageMagick Discussions Archive - Index page) if there’s a way to do it
with the command-line utilities. If so, you can usually translate the
commands and options into RMagick methods. See
RMagick 1.15.0: Magick Command Options and Their Equivalent Methods for help with
that.