Detection of OCR Quality on Patents

Overview

Optical Character Recognition (OCR) works by scanning source documents and performing character analysis on the resulting images, giving a translation to ASCII text, which can then be stored and manipulated electronically. The character recognition process is not perfect, errors often occur. These errors have an adverse effect on the effectiveness of information retrieval algorithms that are based on exact matches of query terms and document terms.

The goal of this project was to identify a strategy for assessing the quality of a document obtained via an OCR process, and to assign a score (a quality coefficient) to each patent document. This coefficient indicates the quality of an OCR result by means of calculating and comparing statistical models for a gold standard of manually pre-processed documents and the document in question.

Project Partners

University of Massachusetts Amherst, US

IRF

IRF