worldwidenews: OCR makes short work of digitizing your docs

The file cabinet looms large in the office, yet it guards its secrets jealously...even from you. It's time to convert those papers to space-saving, easy-to-find digital documents. For that, you need a scanner to turn them into digital images and an Optical Character Recognition program to convert those images into editable and searchable documents. I took four of the latest OCR programs and a free online OCR service for a test spin. All of them work to varying degrees.

To test the programs, I ran 22 varied and not particularly clean scans of documents—including one hand-written note—through four OCR programs and one free service. I looked for accuracy in text recognition, image extraction, and the ability to recreate them in a Word document. In addition, I processed 264 separate scans from a yearbook for output as a searchable PDF.

You don't actually need to install OCR software if you need to convert only a couple of small documents. You can use a free service such as Free-OCR (also known as Free-OCR.com) and upload a scan of your document. File size is limited to 2MB and 5000 pixels in any direction, which is about 150 dpi for a standard page. The OCR engine handles 29 languages, including English.

Free-OCR makes you jump through a CAPTCHA hoop, but does it apologetically.

Although you don't have to register or even fork over your email address, the Free-OCR site does make you fill in one of those annoying CAPTCHAs. (Thanks, Web bad guys, for making everyone's life more difficult.) However, those CAPTCHAs serve to remind one just how difficult OCR can be. If humans, with our incredible heuristic abilities, occasionally have problems with these, just think how poor straight-line software perusing a stream of bits must feel.

Free-OCR did a decent job of extracting the text from the test documents. With standard typed pages, you should have no problems as long as you don't have fancy plans for the scanned text: The site does not output files or recreate documents. It simply places the extracted text in a box for cutting and pasting. As a matter of fact, embedded graphics tend to confuse the output.

Free-OCR s text capture works well, but even simple graphics confuse the website. Free-OCR is not suitable for large jobs or overly complex documents, but when all you need is to quickly get the text out of a basic document scan, it will do nicely in most cases.

Tip: Use Microsoft Paint to reduce the size of any image you wish to use to 150 dpi.
FreeOCR is a nice, simple front-end for HP's public-domain Tesseract OCR engine (now used by Google) and is roughly the installable equivalent of the unrelated Free-OCR website. It interfaces directly with scanners in addition to importing image files, and it extracts text into a box from which you can cut and paste. The program is extremely easy to use and works well if all you want is text. It even extracts text from PDFs, though it exports only to text.

FreeOCR, though unrelated to Free-OCR, is just as easy to use. FreeOCR processes only one image at a time, but will OCR multi-page PDFs. And, unlike the Free-OCR website, there's no limit on file size. Also, FreeOCR can create Word and RTF documents from the text it extracts, but it's just pasted text: There's no attempt to reconstruct the document or place images.

Read this dialog box and select your options carefully when clicking through FreeOCR's installation. As far as it goes, FreeOCR is a neat little program, though it tries to install toolbars and reset your browser home page. You can install the program while cancelling and declining all offers (though it's unintuitive and the negative response buttons are grayed out). If you couldn't do that, you wouldn't be reading about FreeOCR 4.2 in this roundup.

X is a busy character these days. Not only is it still used traditionally in words, it's featured in the end of movie credits as the Roman "10," and has achieved rock star status as shorthand for eXtreme. It's even used in the name of Acrobat XI. Best of all, it can be used to illustrate one the major tools available to OCR technologists for recognizing symbols: context, i.e., looking at what surrounds a character to help identify it.
Acrobat XI Standard is very good at leveraging context and does a bang-up job of recreating entire documents, including text, images, and layout, then outputting them as the increasingly popular editable PDFs, .RTF, and Word docs. If outputting documents that look like the original is your focus, it's great. It is, however, a little less aggressive and slightly less successful at extracting text from some images than Nuance OmniPage and Abbyy FineReader, which are reviewed below.

The Readiris OCR engine gives Adobe Acrobat XI an edge in accuracy. Acrobat XI lacks the side-by-side comparison of original documents with their recreated doppelgangers that most programs offer: Word, RTF, and the like are simply saved. However, the output files are very accurate (thanks to the Readiris OCR engine), and you can fine-tune the results of PDFs in-line with the "Find OCR suspects" function.

Although Acrobat XI is primarily for PDFs, its OCR is so good that you can easily forego an auxiliary OCR program if you buy it. But at $499 for the Pro version and $299 for Standard, it is expensive. OmniPage and FineReader perform OCR and handle the PDF basics for considerably less cash.

OmniPage 18 Standard ($150) is Acrobat XI's equal at outputting Word files and editable PDFs, and it also does a very good job of extracting pure text. By default, it's a tad aggressive at rotating images trying to find text to extract. However, you can disable this behavior in the settings dialog.

At defaults, Omnipage 18 Standard is a tad aggressive about rotating images. But you can turn this behavior off, and it did a great job making an editable version of my high-school yearbook.

OmniPage features the side-by-side comparison editing of all types of files that Acrobat's lacks, and its interface is bit more flexible than Abbyy FineReader's, allowing you to arrange the various panes in more ways. Like Acrobat, OmniPage also provides a batch manager for automating multiple jobs.

OmniPage Ultimate 19 is the real news for this industry stalwart. The new $500 Ultimate marries OCR with company's speech-to-text and text-to-speech technologies from Dragon NaturallySpeaking, and starts a transition to a Windows 8-style interface. Though those aren't strictly OCR features, it does mean the program is morphing into a jack-of-all-trades translation tool. Maybe Nuance will even consider handwriting OCR in the future.

Anyone who's purchased a multifunction printer or scanner recently will probably recognize the name FineReader, as the Sprint version ships with many such products. Obviously, there are deals being made, but there's no questioning that the program also does a very nice job of OCR. Text extraction is great, though it's not quite as good at recreating complex documents in Word and RTF files as Acrobat or OmniPage.

Abbyy's dual-paned interface makes it easy to compare originals to OCR. FineReader is straightforward and easy to use. The main window shows a list of images in a column to the far left, the image being processed in a pane next to it, and the OCR'd text and elements in a pane on the right side. This side-by-side arrangement, shared with OmniPage, makes it super-easy to spot mistakes and compare page elements.

Abbyy FineReader 11 Professional is fast, recognizes text in 189 languages, and outputs in a number of different formats including ePub, editable PDFs, Microsoft Word, and even open-source PDF competitor DjVu.

FineReader created a searchable PDF of my yearbook scans just fine, but like OmniPage, it was over-zealous at rotating images trying to find text until I turned off this feature. With most OCR programs, you're better off using Windows' own Photo Viewer to rotate scans to their correct orientation before OCR'ing.
While the $170 Professional Edition that I tested slightly more expensive than OmniPage Standard, it is also available in a capable $50 Express 9 version.

OCR programs are as useless unless they have digitized images to work with. For that, you need a scanner.
A fast sheet-fed scanner that scans both sides of the page simultaneously is worth its weight in gold for anyone with lots of two-sided business documents to transfer into the digital domain. HP's Scanjet Professionals or Fujitsu's ScanSnaps are available for under $500.

A garden-variety A4/letter-sized flatbed scanner such as Canon's $200 CanoScan 9000F or similar is great for small to standard-sized photos, magazine articles, etc.

For oversized books, such as atlases or the high-school yearbook used in my testing, you'll need something that handles larger documents. I used Plustek's pricey but fast $600 Optic Pro A320. Mustek makes a variety of cheaper, albeit slower large-format scanners.

TIp: Most scanners bundle competent, albeit less comprehensive, versions of the software we've covered. Grab your scanner first, then see if you need to upgrade the software.

If PDFs are your focus, then Acrobat XI is a very good choice, though you'll pay a lot for either Standard or Pro. Most users will be just fine with the less expensive OmniPage Standard 18 or Abbyy FineReader 11 Professional version. OmniPage gets a slight nod if you're outputting RTF or Word documents, but it's otherwise a tie.

Pages

Tuesday, 23 July 2013

OCR makes short work of digitizing your docs

No comments:

Post a Comment