<< Click here to display Table of Contents >> Navigation: Using CloudNine LAW > OCR:

OCR Options

Contents

You can specify various optical character recognition (OCR) options by doing one of the following:

•From the main window, on the Tools menu, click Options and then click the OCR tab.

•From the Batch Processing utility, on the Options menu, click OCR Settings.

OCR Engine

CloudNine™ LAW supports two OCR engines:

•ExperVision OpenRTK

•Is included with the LAW installer.

•Is limited to one instance per machine.

•ABBYY FineReader

•Installed separately from LAW and can be obtained by CloudNine when purchasing the ABBYY OCR license.

•Supports Chinese, Japanese, and Korean (CJK) languages.

•Can create PDF/A files.

•Able to run multiple instances on a single PC, one instance per CPU core.

Version 6.18+

In version 6.18+, the ability to endorse in multiple colors was introduced. Endorsing in Silver, White, Lime, Aqua, Gray or Yellow colored fonts with a white background then OCRing with Expervision results with the endorsed text with the respective colors not extracted.
ABBYY is able to OCR all of these colors except for White and Yellow.

Both engines support the creation of searchable PDF files and are able to produce < image >.ocr files, which are used in the Storm and IPRO applications for highlighting search hits on images.

Page Layout

These options improve OCR accuracy by specifying the layout of the pages.

•Auto Detect - Automatically determines the layout of the page. This is the default option.

•Single Column - Specifies that one column of text exists on a page.

If email thread analysis will be performed on the case, the Page Layout setting must be Single Column. For existing cases with Auto Detect already selected, change the Page Layout setting to Single Column, re-run the OCR process, and then re-run the Near-Duplicate & Email Thread Analysis to accurately capture the case's email thread information.

For the unsupported Xerox TextBridge engine, two special page layouts are available that can be used to OCR pages that are broken into even quadrants: "Quadrants, left to right" and "Quadrants, top to bottom." Both settings will OCR the page as if it were 4 separate pages condensed onto a single page. The left to right setting will OCR the 4 quadrants in the following order: upper left, lower left, upper right, lower right. The top to bottom setting will process in the following order: upper left, upper right, lower left, lower right.

Quality

Specifies the type of printing technology used to create the original documents and the print quality of the scanned pages.

•Normal - Use this for pages printed with inkjet printers, laser printers, or offset lithography. This is the default.

•Normal (Degraded) - The same as Normal, except that the print quality is known to contain some distortion or blemishes due to poorly printed originals, photocopying, heavy use, or aging.

•Dot Matrix - Use this for pages printed using dot matrix printers, which include many early printer models as well as many types of printed receipts, such as from cash registers and ATM machines.

•Dot Matrix (Degraded) - The same as Dot Matrix, except that except that the print quality is known to contain some distortion or blemishes due to poorly printed originals, photocopying, heavy use, or aging.

Note also the following when selecting a quality option:

•When a setting other than Normal is selected, OCR engine performance may be reduced.

•ABBYY FineReader engine supports a quality setting called Magnetic Ink Character Recognition (MICR). This is the technology used for the routing numbers on personal checks and for other documents designed to be machine readable.

•The Auto Detect setting with Xerox TextBridge OCR accommodates varying quality levels among originals.

Language

The Language setting is used to specify the language dictionary the engine should use during the OCR process. If the correct language is not selected prior to the OCR process, the characters may not be recognized properly.

If ABBYY FineReader is the selected engine, English will automatically be used as a second language if a non-English language is selected. For example, if Greek is selected and both Greek and English exist in the source image, ABBYY FineReader differentiates the languages and performs recognition for both. However, if a document contains Greek, and English is selected as the language, Greek characters will not be interpreted or rendered correctly in the text. This only pertains to documents containing Unicode characters, such as Chinese, Japanese, Korean, Greek, or Russian. Languages that share many common characters in their alphabet, for example, English, Spanish, French, German, Dutch, and Portuguese will be interpreted correctly when existing in the same document if any of these languages are selected.

The current ABBYY installer is available for download at http://www.imagecap.com/installs/ABBYYEngine11.rev1.exe.

Languages added to LAW will be available for selection from the Language list under the OCR settings dialogue. LAW will be able to recognize the OCR language during the OCR process and provide the ability to accurately search for the language characters in the OCR text.

For the unsupported Xerox TextBridge engine, the System Default setting uses whichever language is specified by Windows as the default.

Output Format

This feature is used to select the output format produced by the selected OCR engine. The available output formats and licensing requirements are in the following table:

Output format	License required
Smart Text Document	OCR (ExperVision OpenRTK)
Standard Text Document	OCR (ExperVision OpenRTK) -OR- OCR (ABBYY FineReader)
HTML	OCR (ExperVision OpenRTK) -OR- OCR (ABBYY FineReader)
Word for Windows	OCR (ExperVision OpenRTK) -OR- OCR (ABBYY FineReader)
Word for Windows (2007)	OCR (ABBYY FineReader)
WordPerfect	OCR (ExperVision OpenRTK)
Adobe PDF (Normal)	OCR (ExperVision OpenRTK) + OCR (ExperVision PDF add-on) -OR- OCR (ABBYY FineReader
Adobe PDF (w/ Hidden Text)	OCR (ExperVision OpenRTK) + OCR (ExperVision PDF add-on) -OR- OCR (ABBYY FineReader)
Adobe PDF/A (Normal)	OCR (ABBYY FineReader)
Adobe PDF/A (w/ Hidden Text)	OCR (ABBYY FineReader)

If planning to export OCR results for searching functionality, using one of the text settings is recommended as most export formats do not support non-text OCR.

The Smart Text and Standard Text are essentially the same, both producing standard ANSI text output. See Creating Searchable PDFs for more information when using the Adobe output options to create searchable PDF files.

Page Markers

This option allows LAW to "stamp" the resulting OCR with a Bates number or page value using information retrieved directly from the LAW database. This feature is useful for providing 100% accurate Bates values in the OCR text to aid searching in certain applications.

Page Markers can be customized via the law50.ini file located in the C:\Program Files (x86)\Law50 directory. By placing the PageStampText= section under the [OCR] key, the text stamped by the Page Marker feature can be customized. Currently supported fields are:

&[Page] - Current page

&[Pages] - Page count

&[Page ID] - Bates number

&[BegDoc#] - Beginning document number

<CR> - Carriage return (new line)

Example

The following page marker: PageStampText=###&[Page]|||Page &[Page ID]^^^

Results in a stamp of: ###1|||Page ABC0001^^^

This value increments for each OCR page stamped.

Auto-Rotate

This option specifies if the OCR engine should automatically rotate images for the OCR output. The three options are:

•Always ON

•Always OFF

•Binary Images Only - Auto-rotates monochrome (black and white) images. This option can help to prevent color and grayscale images that have little or no text from being improperly rotated. This setting is available with the ExperVision engine.

Overwrite existing files

Use this setting to prevent or allow the replacing of existing OCR text. This feature is useful if some documents already contain usable OCR text files and the only the files that do not contain an existing text file should be included for processing. If an existing text file is detected for the current document, the OCR engine will skip the document and move onto the next, thus saving processing time. It may also be necessary at times to replace all existing text files; checking this option will replace the OCR for each document.

Retain page layout

This setting determines whether the layout of the page (columns, etc.) will be preserved in the OCR results (non-text output formats only).

Create PDF thumbnails

The OCR engines automatically create thumbnails during the Searchable PDF creation process. Use this setting to set the "visible" property of the thumbnails when opening the PDF file in Adobe Acrobat. If this setting is checked, the thumbnails will be viewable automatically in Adobe Acrobat; otherwise, the thumbnails will be hidden under the Pages tab in Adobe Acrobat.

Reset text index status

Clearing the Reset text index status check box will prevent LAW from re-flagging the document for indexing after the OCR process is performed. This means the OCR text for affected records will not be searchable in CloudNine™ LAW. See the Full Text Indexing topic for more information.

Auto deskew

Enable this option to force the OCR engine to deskew the image before OCRing the document. This can often lead to more accurate OCR (depending on the type of document). However, if the document contains graphics or angled vertical lines, the deskew feature may align to these graphics and cause unexpected results. Disabling this option will OCR the document with its current orientation. This feature is only available if the ExperVision OCR engine is selected.

Retain pictures

This setting determines whether pictures in the original will be preserved in the OCR results. This setting does not affect the results if the output format is set to text. Pictures are not retained in text files.

Create web optimized PDF

This setting only applies to the ExperVision OpenRTK OCR engine and the Adobe PDF (Normal) and Adobe PDF (w/ Hidden Text) output format options. When the Create web optimized PDF check box is selected, the output PDF files will have the Fast Web View setting enabled in the PDF files. The Fast Web View setting provides page-at-a-time downloading from web servers, instead of downloading the entire PDF from web servers.

Need additional help? E-mail the CloudNine™ LAW Technical Support team at: lawsupport@cloudnine.com, or contact a support representative at 713-462-6464 for CloudNine™ LAW Ext. 12 or CloudNine™ Explore Support Ext. 13. The Technical Support team is available between the hours of 9:00 A.M to 7:00 P.M. Eastern Time, Monday - Friday.

Copyright © 2024 CloudNine™. All rights reserved.