Since I started using a document scanner about seven years ago, I’ve scanned
many thousands of pages and used OCR (optical character recognition) software to
convert those scans into searchable PDFs. I’ve also written extensively about
the paperless office. But when you try to reduce the amount of paper you use,
you inevitably increase the amount of hard-drive space you use. I began to
wonder what combinations of scanner settings and software would get the best
quality scan results while using the least hard-disk space.
What sparked my investigation was a claim that some OCR apps increase the
file sizes of scanned images dramatically, whereas others (Acrobat Pro in
particular) shrink them. When you plan to store and read scanned documents on an
iOS device, compactness is especially important. Unfortunately, Adobe’s $499
Acrobat Pro XI can no longer be driven externally by AppleScript, which means it
requires tedious manual clicking to perform OCR. Were other OCR apps really
inflating file sizes, and was there any way around this problem without
resorting to Acrobat?
Hundreds of experiments later, I came up with some surprising results. Read
on for all the details or skip to the “So, where’s the sweet spot?” section for
the bottom line.
When you initially save a scanned document as a PDF file, you get nothing
more than a bitmapped image in a PDF wrapper. Your scanner’s software most
likely has settings to determine the resolution of the scans in dpi (dots per
inch), the color mode (black and white, grayscale, or color), and the amount of
compression applied to the scanned image. All those settings affect not only the
appearance of the scan but also the quality of information the OCR engine has to
work with. Once OCR software recognizes the text in a PDF, it saves that text in
an invisible layer along with the image so you can see what the document
originally looked like, but can also search, select, and copy its text.
Besides recognizing the text, OCR software may downsample the image
(decrease its resolution, so that it takes up less space) or change the
compression used. Sometimes these features are user-configurable; in other
cases, they’re hardwired. Acrobat Pro has yet another option—a feature called
ClearScan that replaces all the bitmapped text with a custom font (which takes
up much less space), and then swaps out the original image for one with a much
lower resolution. ClearScan nearly always results in the smallest possible PDF,
but it may not be the best choice if you want to be sure your scanned image
looks exactly like the original, even when printed. In addition, using ClearScan
means settling for Acrobat’s OCR engine, about which I’ll say more in a moment.
No comments:
Post a Comment