I have many scanned pages which I’d like to cut to prepare them
for OCR.
There are two things I’d like to do:
1.) Cut off a header of each page containing the page number,
2.) Find the largest horizontal blanks in a page (which are supposed
to separate chapters) like this:
Chapter1’s text Chapter1’s text
Chapter1’s text Chapter1’s text
Chapter1’s text Chapter1’s text
Chapter1’s text Chapter1’s text
<---- cut here, at this blank
Chapter2’s text Chapter2’s text
Chapter2’s text Chapter2’s text
Chapter2’s text Chapter2’s text
Chapter2’s text Chapter2’s text
^
|
— (Then cut vertically)
I have tried to convert my pages, which are A4 and 600 dpi, to pixel
arrays,
but this is quite slow. Is there a better method, ie. using to_blob ?
I have many scanned pages which I’d like to cut to prepare them
for OCR.
There are two things I’d like to do:
1.) Cut off a header of each page containing the page number,
2.) Find the largest horizontal blanks in a page (which are supposed
to separate chapters) like this:
If you have a binary string of the pixel data in the image
(I guess to_blob gives that), you can do something like this
for cutting the vertical spans of non-white pixels:
For finding larger spans of white, keep track of white
scanlines seen previously. #is_white can well be something
that returns true if less than 50 pixels on a scanline are white
or somesuch.
When it comes to document imaging, the problem with most image
processing kits, including/especially ImageMagick, is that they insist
on using a 32-bit per pixel memory representation for all images.
Fine for your 200x200 web GIF, but No Fun when your ~4MB B&W scanned
page suddenly expands to 140MB in memory.
You can try NArray[1] (8x expansion is better the 32x), or use a kit
that can handle 1-bit images in memory, e.g. Leptonica[2].
I have tried to convert my pages, which are A4 and 600 dpi, to pixel arrays,
but this is quite slow. Is there a better method, ie. using to_blob ?
to_blob just gives you an in-memory copy of the image file. If the image
is in JPG format, for example, then the blob is an in-memory JPG file.
So, there’s no help there.
Ideally you could use some RMagick method or combination of methods to
accomplish your goal. Since the ImageMagick/GraphicsMagick routines are
written in C they’d be much faster. Offhand I can’t think of any such
methods, but then I’m not very clever at that sort of thing.