Text Extraction

<< Click to Display Table of Contents >>

Navigation:  Using CloudNine LAW > Importing Documents > ED Loader > Configuring Import Settings >

Text Extraction

WindowIcon To Access and Configure the Text Extraction Settings

1.On the File menu, select Import and then click Electronic Discovery.

2.Click the Settings tab and then click Text Extraction. The Text Extraction options display.

Text Extraction options on the Settings tab

3.Configure text extraction options as needed:

Option

Description

Enable Text Extraction

Enables text extraction options from applicable files during an ED Loader import session.

You must select this option to access any other options for text extraction.

Include metadata in extracted text

Includes any available document properties in the extracted text file, for example: Author, Title, etc.

Enable binary scanning in text extraction

Forces all file types to be scanned for text.

This option works by overriding the Ext. Text flag in the File Type Manager (see File Types).

If you enable this option, it is recommended that you also enable the Validate extracted text option.

Validate extracted text

Scans each text file for readable text.

Use this option to filter text files containing only form feed and other control characters. Text files that do not contain readable text are marked as invalid and discarded.

Identify hidden text content

Detects specific kinds of hidden text in Word, Excel, and PowerPoint documents.

Hidden text can be found in these forms:

Text that is hidden inside shape controls, for example, text boxes, etc.

Text formatted as Hidden.

Hidden spreadsheets, spreadsheet columns, and spreadsheet cells.

Hidden slides.

If hidden text is detected the following actions occur:

The hidden text is added to the start of the extracted text for the document under the Text tab.

The hidden text appears at the top of the document in a section marked with <<< START HIDDEN CONTENT >>><< END HIDDEN CONTENT >>>. Page numbers within the hidden content area can help you determine the context for the hidden text.

The HiddenText field for the document is assigned a Y.

Note

Enabling this option may decrease import speeds.

Identify language content

Identifies the first 5 languages used in each document.

When you create a new case or when you open a case that was created using a previous version of CloudNine™ LAW, a field named Language is created that stores language identifiers.

Note the following:

Single word sentences are not evaluated.

A language with only one occurrence in the document is ignored.  

After the first language is identified, to be added Language field, the second through fifth languages require a hit percentage of at least 15% of document text.

Enabling this option may decrease import speeds. To improve speeds, try using the option to limit analysis to the first kilobytes of each file.

Limit content analysis to first _ KB of file

Limiting analysis can help to enhance performance.

Restrict language identification to common languages

Limiting identification to common languages can help to enhance accuracy.

Note the following:

This setting is enabled by default and persists between sessions.

When this setting is enabled, only the following languages are detected:

 Arabic

 Bengali

 Chinese: Simplified Chinese, and Traditional Chinese

 Czech

 Danish

 Dutch

 English

 Finnish

 French

 German

 Greek

 Hebrew

 Hindi

 Hungarian

 Italian

 Japanese

 Korean

 Latin

 Norwegian Bokmål

 Norwegian Nynorsk

 Polish

 Portuguese

 Russian

 Spanish

 Swedish

If a supported language is not encountered, a value of Unknown is returned. This value might also be returned if only a small amount of content is available.

When multiple languages are detected, they are delimited with semi-colon.