Deduplication

<< Click to Display Table of Contents >>

Navigation:  Using CloudNine LAW > Importing Documents > ED Loader > Configuring Import Settings >

Deduplication

This page is currently under construction.

Deduplication is the process of identifying duplicate files during the discovery process and removing them from further processing and analysis. Deduplication is a necessary step in managing the volume of data that must be analyzed.

A duplicate file is an exact copy of another file. Deduplication is necessary in many situations involving electronic documents because multiple identical documents are a typical feature of large record sets. For example, in electronic discovery sets containing e-mail archives for an organization, it is not uncommon for multiple e-mail accounts to contain the exact same widely distributed e-mail or file attachment.  

CloudNine™ LAW identifies duplicate files by comparing hashes of files. A hash is a numerical representation of a file whose value is based on the file contents or other attributes. In essence, the file is subjected to an encryption process that yields a unique value. An exact copy of a file will yield the same hash value. In the case of electronic documents, the file is hashed. For e-mail, metadata fields are hashed. You can set the encryption key in the deduplication settings.

The scope of the project will determine whether or not deduplication will be performed and which methods will be used.

 

Note

In addition to deduplicating prior to the import process, LAW also allows you to deduplicate at these other times in a pre-discovery workflow:

After the import against other records in the case by using the Deduplication Utility.

After the import against other records in the case and other LAW cases by using Inter-Case Deduplication.

 

WindowIcon To Configure Deduplication

1.On the File menu click Import and then click Electronic Discovery.

2.Click the Settings tab and then click Deduplication. The Deduplication options display.

Deduplication options on the Settings tab

3.Choose from among the following options:

Enable Duplicate Detection.  Enables duplicate checking for the current session.
 

Working digest. The working digest is the method of hashing that will be conducted to determine duplicates. A hash value can be thought of as the DNA of a file. The hash values are obtained through metadata fields (e-mail) or by hashing the entire file (e-docs). LAW uses two types of hashing methods:

MD5: 128-bit output

SHA-1: 160-bit output
 

Test for duplicate against (Scope). This option identifies the scope for deduplication. During the import process, deduplication can be performed at one of two levels:

Case Level (Globally). Deduplicates documents against the entire incoming collection and against existing records in the LAW case.

Custodian Level. Deduplicates documents against records with identical custodian values.
 

If record is considered a duplicate then (Action). This setting determines the action to take once a duplicate is located. Three options are available:

Include. Creates a record for the duplicate in the database and copies the native file into the case folder.

Partially exclude. Creates a record in the database but does not copy the native file.

Exclude. Does not create a record, no text is extracted, and the native file is not copied to the case folder.
 

Include attachment hashes in e-mail metadata hash. When enabled, the ED Loader will include the hashes of attached files in the parent e-mail's metadata hash. When disabled, the Attach field is incorporated in with the metadata hash which only contains the file names of attached files.

 

Warning

Note the following warnings prior to running a deduplication session:

While enabling the Include attachment hashes in e-mail metadata hash setting is recommended, it is not advisable to change this setting during the course of a case as it will alter the e-mail hashing schema, as noted in the interface. The desired state of this setting should be determined prior to the first import into new cases and should not be changed. This setting was not available in versions prior to 5.5.07.

If the current case has already been deduplicated via the Inter-Case Deduplication utility, a warning will appear (see below) when starting the ED Loader import if deduplication is enabled.

Use of the ED Loader deduplication on imported records after the case has already been deduplicated against other cases using the Inter-Case Deduplication utility is not recommended. Doing so will present a mixture of internal and external duplicates and could cause problems when purging, filtering, or reviewing duplicate records.

Proceeding with the ED Loader deduplication after the case has been deduplicated with the Inter-Case Deduplication utility will result in the external deduplication database being placed in Rebuild/Flush mode. At this point, the current case should be removed from the external database. Also, before running the internal deduplication, it is recommended that the Deduplication Status Reset command is executed to clear the values assigned by the Inter-Case Deduplication utility to prevent the mixture of internal and external duplicates.

 

Enable hashing of non-email Outlook items. Determines whether hash values are generated for non-email items in an Outlook PST file during the ED Loader import.
 

Non-email PST items include Microsoft Outlook calendar items, contacts, journal entries, notes, and tasks. When the check box is selected, deduplication is performed on the non-email items in a PST file.
 

By default, the check box is selected for cases created in LAW version 6.9.x or later. For cases created in LAW version 6.8.x or earlier, the check box is not selected by default.