OCR is the most common type of recognition and is the process by which the image of text on a page in read by a computer program make the text editable. The OCR process can be classified two ways: zonal and full-text. Zonal OCR is typically used on forms, where only specific fields on the form are of interest. Full-text OCR is used on free-form documents, such as legal briefs, to read the entire document and then prepare a searchable, full-text index of the document.
OCR can also be generalized in two forms: text-over and text-under. Text-over is when the OCR data is placed over the image and the text becomes very clean and no longer appears as the original print. Text-under is when the OCR data is placed underneath an image and that data is placed on the same x-y coordinates as the original text. Text-under is used when keeping the original look of a document is required.
Image cleanup is also performed in the recognition step. Techniques include:
- Deskewing, despeckling, deshading, streak removal, and other basic cleanup functions
- Line removal and character reconstruction for use on forms
- Edge enhancement, which sharpens character edges to increase OCR accuracy
The purpose of image cleanup is to remove unwanted noise that can decrease the accuracy of automated recognition.