<< Click to Display Table of Contents >> Navigation: Using CloudNine LAW > Importing Documents > ED Loader > Configuring Import Settings > Text Extraction |
To Access and Configure the Text Extraction Settings
1.On the File menu, select Import and then click Electronic Discovery.
2.Click the Settings tab and then click Text Extraction. The Text Extraction options display.
3.Configure text extraction options as needed:
Option |
Description |
---|---|
Enable Text Extraction |
Enables text extraction options from applicable files during an ED Loader import session. You must select this option to access any other options for text extraction. |
Include metadata in extracted text |
Includes any available document properties in the extracted text file, for example: Author, Title, etc. |
Enable binary scanning in text extraction |
Forces all file types to be scanned for text. This option works by overriding the Ext. Text flag in the File Type Manager (see File Types). If you enable this option, it is recommended that you also enable the Validate extracted text option. |
Validate extracted text |
Scans each text file for readable text. Use this option to filter text files containing only form feed and other control characters. Text files that do not contain readable text are marked as invalid and discarded. |
Identify hidden text content |
Detects specific kinds of hidden text in Word, Excel, and PowerPoint documents. Hidden text can be found in these forms: •Text that is hidden inside shape controls, for example, text boxes, etc. •Text formatted as Hidden. •Hidden spreadsheets, spreadsheet columns, and spreadsheet cells. •Hidden slides. If hidden text is detected the following actions occur: •The hidden text is added to the start of the extracted text for the document under the Text tab. •The hidden text appears at the top of the document in a section marked with <<< START HIDDEN CONTENT >>><< END HIDDEN CONTENT >>>. Page numbers within the hidden content area can help you determine the context for the hidden text. •The HiddenText field for the document is assigned a Y. Note •Enabling this option may decrease import speeds. |
Identify language content |
Identifies the first 5 languages used in each document. When you create a new case or when you open a case that was created using a previous version of CloudNine™ LAW, a field named Language is created that stores language identifiers. Note the following: •Single word sentences are not evaluated. •A language with only one occurrence in the document is ignored. •After the first language is identified, to be added Language field, the second through fifth languages require a hit percentage of at least 15% of document text. •Enabling this option may decrease import speeds. To improve speeds, try using the option to limit analysis to the first kilobytes of each file. |
Limit content analysis to first _ KB of file |
Limiting analysis can help to enhance performance. |
Restrict language identification to common languages |
Limiting identification to common languages can help to enhance accuracy. Note the following: •This setting is enabled by default and persists between sessions. •When this setting is enabled, only the following languages are detected: Arabic Bengali Chinese: Simplified Chinese, and Traditional Chinese Czech Danish Dutch English Finnish French German Greek Hebrew Hindi Hungarian Italian Japanese Korean Latin Norwegian Bokmål Norwegian Nynorsk Polish Portuguese Russian Spanish Swedish •If a supported language is not encountered, a value of Unknown is returned. This value might also be returned if only a small amount of content is available. •When multiple languages are detected, they are delimited with semi-colon. |