Deduplication is the process of identifying duplicate files during the discovery process and removing them from further processing and analysis. Deduplication is a necessary step in managing the volume of data that must be analyzed.
A duplicate file is an exact copy of another file. Deduplication is necessary in many situations involving electronic documents because multiple identical documents are a typical feature of large record sets. For example, in electronic discovery sets containing e-mail archives for an organization, it is not uncommon for multiple e-mail accounts to contain the exact same widely distributed e-mail or file attachment.
CloudNine™ LAW identifies duplicate files by comparing hashes of files. A hash is a numerical representation of a file whose value is based on the file contents or other attributes. In essence, the file is subjected to an encryption process that yields a unique value. An exact copy of a file will yield the same hash value. In the case of electronic documents, the file is hashed. For e-mail, metadata fields are hashed. You can set the encryption key in the deduplication settings.
The scope of the project will determine whether or not deduplication will be performed and which methods will be used.
In addition to deduplicating prior to the import process, LAW also allows you to deduplicate at these other times in a pre-discovery workflow: •After the import against other records in the case by using the Deduplication Utility. •After the import against other records in the case and other LAW cases by using Inter-Case Deduplication. |
1.On the File menu click Import and then click Electronic Discovery. 2.Click the Settings tab and then click Deduplication. The Deduplication options display. 3.Choose from among the following options: •Enable Duplicate Detection. Enables duplicate checking for the current session. •Working digest. The working digest is the method of hashing that will be conducted to determine duplicates. A hash value can be thought of as the DNA of a file. The hash values are obtained through metadata fields (e-mail) or by hashing the entire file (e-docs). LAW uses two types of hashing methods: •MD5: 128-bit output •SHA-1: 160-bit output •Test for duplicate against (Scope). This option identifies the scope for deduplication. During the import process, deduplication can be performed at one of two levels: •Case Level (Globally). Deduplicates documents against the entire incoming collection and against existing records in the LAW case. •Custodian Level. Deduplicates documents against records with identical custodian values. •If record is considered a duplicate then (Action). This setting determines the action to take once a duplicate is located. Three options are available: •Include. Creates a record for the duplicate in the database and copies the native file into the case folder. •Partially exclude. Creates a record in the database but does not copy the native file. •Exclude. Does not create a record, no text is extracted, and the native file is not copied to the case folder. •Include attachment hashes in e-mail metadata hash. When enabled, the ED Loader will include the hashes of attached files in the parent e-mail's metadata hash. When disabled, the Attach field is incorporated in with the metadata hash which only contains the file names of attached files.
•Enable hashing of non-email Outlook items. Determines whether hash values are generated for non-email items in an Outlook PST file during the ED Loader import. Non-email PST items include Microsoft Outlook calendar items, contacts, journal entries, notes, and tasks. When the check box is selected, deduplication is performed on the non-email items in a PST file. By default, the check box is selected for cases created in LAW version 6.9.x or later. For cases created in LAW version 6.8.x or earlier, the check box is not selected by default.
|