This topic explains the deduplication process implemented in CloudNine™ LAW and the related Inter-Case Deduplication utility. Items discussed include:
•Key generation and how it relates to the deduplication of e-mail compared to electronic documents.
•A list of deduplication-related fields and possible values.
•Brief explanations of deduplication options and their resulting output.
Deduplication technologies do not directly compare file contents. Instead, the encryption signatures, also known as the hash values, of files are compared. If the hash values for two different items are identical, the content of the two files is assumed to also be identical. Key generation then, refers to the process of creating an encryption signature for a file so that files can be easily compared. File hashing and metadata hashing are the two primary methods used by the ED Loader and Turbo Import for generating keys. Currently, the ED Loader and Turbo Import generate two output hashes in parallel: MD5 and SHA-1. In the ED Loader and Turbo Import options you can choose which of these is used in the deduplication process. •E-docs. The key value is generated using the entire file as the input. •E-mail. The key value is generated using an input value of certain metadata after processing of the metadata fields has been executed. The purpose of using the post-processed metadata is to match the metadata that is stored in CloudNine™ LAW; therefore, if the key is regenerated in the future, the value would match the original. E-mail includes both e-mail messages contained in mail stores and loose e-mail messages. The term "loose e-mail" refers to a file that is identified as a mail item and successfully converted to a mail item by Outlook. These include .msg files, .eml files, and other RFC822-format e-mails. For Microsoft Outlook PST files, e-mail also includes non-email items — calendar items, contacts, journal entries, notes, and tasks.
Electronic Discovery in LAW uses the fields shown in the following table to generate the hash values for e-mail, loose e-mail items, and Outlook PST non-email items in CloudNine™ LAW:
CloudNine™ Explore uses the fields shown in the following table to generate the hash values for e-mail items in CloudNine™ Explore:
|
LAW stores information regarding deduplication test results and other deduplication-related data in the following nine fields: •DupStatus: Indicates the duplicate state of the document. This is the primary field used for differentiating duplicate items from non-duplicate items. DupStatus is a single character field and will contain one of the following values: •U - Indicates the record was not tested (not deduplicated). •N - Indicates the record was tested and was not determined to be a duplicate at the selected scope (Global/Custodian-level). •G - Indicates the record is a global-level duplicate. •C - Indicates the record is a custodian-level duplicate. •P - Indicates the "parent" duplicate. This value is set when a duplicate has been identified. The record that was assigned the "G" or "C" status will have the same _DupID value as this parent record (see _DupID field, explained below). •_DupID: This field is designed to provide a mechanism for grouping duplicate records with their "parent" duplicates. A document determined to be a duplicate in ED Loader enabled cases will contain the same LAW ID field value as any other records that were determined duplicates of that particular document. The parent duplicate will store its own ID in this field. Drag and drop this field into the grouping area in one of the grid displays to view the parent and "child" duplicates together.
A document determined to be a duplicate in a Turbo Import enabled case will contain the same LAW ID field value as any other records that were determined duplicates of that particular document. The parent duplicate will store its own ID in this field. Drag and drop this field into the grouping area in one of the grid displays to view the parent and "child" duplicates together. Records deduplicated via inter-case deduplication are slightly different. Parents and their duplicate records will still have matching _DupID field values; however, the value will not be pulled from the ID field in CloudNine™ LAW. This value will instead come from an ID assigned to records in the external deduplication database. The ID for each parent and duplicate will be the ID of the parent, as assigned in the tblDupLog table's DupID field. •_DupMethod: Indicates which hash type was used in testing the duplicate state of the record. Possible values are: •1 - MD5 hash •2 - SHA-1 hash •129 - MD5 hash and record was included in an inter-case deduplication process •130 - SHA-1 hash and record was included in an inter-case deduplication process •DupCustNames: Indicates the names of the custodians containing duplicate versions of the original record. Populated for parent/original records only (DupStatus=P) after running Tools > Apply Duplicate Relationships. •DupCustPaths: Indicates the source path to each duplicate version of the original record. Populated for parent/original records only (DupStatus=P) after running Tools > Apply Duplicate Relationships. •DupParentName: Indicates the custodian name of the original record. Populated for duplicate records only (DupStatus=G or C) after running Tools > Apply Duplicate Relationships. •DupParentPath: Indicates the name of the custodian containing the original record. Populated for duplicate records only (DupStatus=G or C) after running Tools > Apply Duplicate Relationships. •MD5Hash: Stores the MD5 hash value of the record. If a file is considered to be a duplicate, this value will be equal to the deduplication key. •Sha1Hash: Stores the SHA-1 hash value of the record. If a file is considered to be a duplicate, this value will be equal to the dedup key.
|
Scope refers to the range of deduplication keys that will be tested to determine the record’s duplicate state. The scope may be specified by the user in the ED Loader or Turbo Import Deduplication settings, the LAW Deduplication Utility, and Inter-Case Deduplication utility. Two kinds of scope are available: •Global: The Global scope will result in the deduplication keys of incoming records being tested against ALL other keys, regardless of how the scope of other records were logged. •Custodian: The Custodian scope will result in the incoming records’ deduplication keys being tested against all other keys that have the same CustodianID value.
|
The Action options are used to limit or exclude the data stored for a record that is considered to be a duplicate. Options include: •(Include) Log record: Duplicate records are added to the LAW case normally, including the native file, and all associated duplicate fields are set. •(Partially Exclude) Log record but do not copy file: Duplicate records are added to the LAW case normally and all associated duplicate fields are set, but the native file is not copied to the case folder. •(Exclude) Do not log record or copy file: Duplicate records are completely excluded from the case. The record is not added to LAW and the native file is not copied.
|
Deduplication is only performed at the parent level, so attachments will always inherit the DupStatus of their parent item. This includes all types of attachments, such as e-mail attachments, attachments to an archive file (i.e. zip), and loose e-mail message attachments.
|
When the following action is selected, and File Type Filtering is enabled, duplicate checking is executed normally. (Partially Exclude) File Type Filter Action and Deduplicating Records without native files can still be flagged as duplicate parents or children.
|
After running the deduplication process, if you delete a the parent document of a duplicate or a duplicate document itself, you can have LAW automatically update the DupStatus field and other deduplication fields for the remaining case document(s) associated with the deleted duplicate document. The Refresh duplicate status after deleting records check box on the Preferences tab in the Options dialog box controls whether the DupStatus field is updated for documents when duplicate documents are deleted from CloudNine™ LAW. By default, the check box is not selected. When the check box is selected, LAW automatically updates the DupStatus field and other fields when duplicate documents are deleted from the case.
The following fields are updated when the Refresh duplicate status after deleting records check box is selected and duplicate documents are deleted: •_DupID •DupStatus The Refresh duplicate status after deleting records check box is only available for SQL LAW cases. The check box is disabled for Access cases.
For more information about deleting documents, see Deleting Documents, Pages and Folders.
|