Deduplication

<< Click to Display Table of Contents >>

Navigation:  CloudNine Explore > Using CloudNine Explore > Case Dashboard >

Deduplication

What is Key Generation?

Deduplication technologies do not directly compare file contents. Instead, encryption signatures known as hash values are compared for each file. If the hash values for two different items are identical, the content of the two files is assumed to be identical. Key generation refers to the process of creating an encryption signature for a file so that files can be easily compared. File hashing and metadata hashing are the two primary methods used during import for generating keys.

Currently, the Explore import process generates two output hashes in parallel: MD5 and SHA-1. Additionally, electronic documents have their keys generated depending on their type:

E-docs - The key value is generated using the entire file as the input.

E-mail - The key value is generated using an input value from certain metadata fields. The purpose of using this post-processed metadata is to match the metadata stored within CloudNine™ Explore. Therefore, if the key is regenerated in the future, the value would still match the original. E-mail includes both e-mail messages contained in mail stores and loose e-mail messages. The term "loose e-mail" refers to a file that is identified as a mail item and successfully converted to a mail item by Outlook. These include .msg files, .eml files, and other RFC822-format e-mails.   For Microsoft Outlook PST files, e-mail also includes non-email items — calendar items, contacts, journal entries, notes, and tasks.

See Metadata Hashing Fields for a specific list of fields used by CloudNine™ Explore for generating hash values for e-mail items.

 

Deduplication Fields

CloudNine™ Explore stores information regarding deduplication test results and other deduplication-related data in the following exportable fields:

DupID - This field provides a mechanism for grouping duplicate records with their "parent" duplicates. A document determined to be a duplicate via filtering will contain the same DupID field value as all other duplicate records. The parent duplicate stores its own ID in this field.

DupCustNames - Indicates the names of all custodians containing duplicate versions of the original record. Populated for parent records only  (HasDuplicate=1 and IsDuplicate=0).

DupCustPaths - Indicates the source path to each duplicate version of the original record. Populated for parent/original records only (HasDuplicate=1 and IsDuplicate=0).

DupParentName - Indicates the custodian name of the original record. Populated for duplicate records only (IsDuplicate=1).

DupParentPath - Indicates the name of the custodian containing the original record. Populated for duplicate records only (IsDuplicate=1).

MD5Hash - Stores the MD5 hash value of the record. If a file is considered to be a duplicate, this value will be equal to the deduplication key.

Sha1Hash - Stores the SHA-1 hash value of the record. If a file is considered to be a duplicate, this value will be equal to the dedup key.

Note

The deduplication fields are not search-able within CloudNine™ Explore.

 

Deduplication Mode

Deduplication Mode refers to the range of deduplication keys that will be tested to determine the record’s duplicate state, depending on the following scopes:

Global - The Global scope will result in the deduplication keys of incoming records being tested against ALL other keys, regardless of how other records were logged.

Custodian - The Custodian scope will result in the deduplication keys of incoming records being tested against all other keys that have the same Custodian value.

In the New Case Settings options, you can choose which of these hashing modes is used in the deduplication process. The two modes with custodian in the name use the custodian scope, while the MD5 and SHA1 modes use a Global scope.

MD5 - Uses 128-bit string to deduplicate files across all sources.

SHA1 - Uses a 160-bit string to deduplicate files across sources.

MD5Custodian - Uses a 128-bit string to deduplicate files within a source.

SHA1Custodian - Uses a 160-bit string to deduplicate files within a source.

Warning

Once the first files have been imported into the database, the Deduplication Mode cannot be changed.

 

Duplicate Actions

The Action options are used to limit or exclude the data stored for any records considered to be duplicates. These are adjusted on the Filters tab.  Options include:

(Include) - Duplicate records are added to the case normally, including the native file, and all associated duplicate fields are set.

(Exclude) - Duplicate records are completely excluded from the case. The record is not added to CloudNine™ LAW and the native file is not copied.

 

Deduplication for Attachments

Attachments inherit the DupStatus of their parent item. This includes all types of attachments, such as e-mail attachments, attachments to an archive file (i.e. zip), and loose e-mail message attachments.

 

Filtering by File Type

When the (Partially Exclude) File Type Filter Action and Deduplicating action is enabled, and File Type Filtering is enabled, duplicate checking is executed normally.  Records without native files can still be flagged as duplicate parents or children.