About OCR

<< Click to Display Table of Contents >>

Navigation:  Administration > Database Administration > About Creating Databases >

About OCR

Optical Character Recognition (OCR) is the process of scanning native files and processing them into readable text files. During the Concordance Desktop database creation and import process, these readable text files (often called OCR or OCR'ed text), are written to a designated field in the database records. The incorporation of this readable text in the database records provides the ability to index and search document-level data in Concordance Desktop.  Documents that are not processed through OCR scanning will often contain characters that are not supported by Concordance Desktop, and therefore may provide unreadable results when viewed in Concordance Desktop.

For those instances where the native document files have not been OCR'ed (scanned into readable text), you can select to have Concordance Desktop OCR them during the import process. This option may be optimal for smaller databases, but probably not for very large ones, as it increases the import processing time. Files that have not been OCR'ed can still be imported/added to a database, but the native data is imported "as is," meaning that some of the data may be unreadable, and therefore not indexed or searchable in Concordance Desktop, and not viewable in the viewer. If this is the case, there is an option in Concordance Desktop that allows you to 'optimize' the native file(s) so that the text is readable, can be indexed, and is searchable in Concordance Desktop, and the document(s) viewable in the viewer.

For those instances where the native document files have been OCR'ed, Concordance Desktop provides a few options for adding the OCR'ed text to the records in the database:

OCR'ed text files with an OPT - If you have OCR'ed text files in one or more folders and an OPT file referencing each of those files, you can select to have Concordance Desktop read the OPT file to find the text files and write the text to the corresponding records as referenced in the OPT file. Concordance Desktop will not attempt to OCR the files, it will simply read the text and write it to the associated records. This option helps to minimize import processing time, as the OCR process is not run.

OCR'ed text files without an OPT - If you have OCR'ed text files but no OPT file referencing the location of the text files, you can select to have Concordance Desktop find the files based on a folder where the files are located. This option is usually selected in conjunction with the option to have the import process create an OPT file in the Load File window.  Concordance Desktop will not attempt to OCR the files, it will simply find the files and write the OCR'ed text to the corresponding records in the database. This option helps to minimize import processing time, as the OCR process is not run.

OCR'ed text included in the load file - If the OCR'ed text is included in the load file (DAT file), Concordance Desktop provides a check box you can select which will enable the import process to read the text from the load file and write it to the OCR designated field in the database records. Concordance Desktop will not attempt to OCR the files, it will simply read the text and write it to the associated records. This option helps to minimize import processing time, as the OCR process is not run.

OCR Path is included in the load file - If the DAT file references an OCR path, the new import option allows you to import document level text.  The OCR path needs to be edited to reflect the directory where the text is located or edited to reflect a relative path. When using the relative path in the DAT file, the edited copy of the DAT file must be in the same directory as your text folders.

Example of a full text path:  

C:\Cowco_Data\OCR\Vol001\ABC0001.TXT

Example of a relative text path:  

.\Vol001\ABC0001.txt

About the OCR field

Before creating a database with a load file, we recommend that you ensure there is at least one designated OCR field in the database. Note that an OCR field can contain up to 12 million characters. As the import process reaches the 12 million character limit, the text is overflowed into the next OCR field. If another OCR field is not included in the database, the import process creates the field and labels it with the same name, and adds a number (example: OCR2). For each additional OCR field needed, the number is incremented by one (OCR2, OCR3). Only two additional fields can be automatically created by the import process (OCR2 and OCR3). If more fields are required, you will need to add them using the Customize feature in the Load File window.

Preparation for database creation

Before creating a load file database:

Make sure that an OCR field (i.e., OCR, TEXT, TXT, etc.) exists in the load file - if not, use the Customize feature in the Load File window to add it.

oOCR fields can be named anything, as long as you reference the correct name when selecting the OCR field in the Load File window, so the OCR'ed text appears in the correct field in the database records.

Make sure that all OCR fields are setup as 'Paragraph' type fields, to ensure that the data can be indexed and searched

Make sure the file path pointing the text files are accurate to ensure

Before adding documents to an existing load file database:

Make sure that the OCR fields in the corresponding Concordance Desktop database are setup properly (i.e., OCR1, OCR2, OCR3, etc.)

Make sure that there are enough OCR fields to support all the text - each OCR field can hold a maximum of 12 million characters

Make sure that the OCR fields are setup as 'Paragraph' type fields to ensure that the data is indexed and searchable

Make sure that you have read/write access to the OCR fields if you are adding documents to an existing load file database

Tips and Tricks for Importing OCR Files

If your OCR files aren’t importing properly, check the following:

Verify file names and extensions

Consider using a renaming utility software or use a DOS batch file (.BAT extension) if the file names do not match the key

Check your .LOG file to verify whether each OCR file successfully loaded

Your OCR text files can reside in the same directory as the image files. They do not need to be separated. You will need to import your OCR for each volume where it resides.