<< Click to Display Table of Contents >> Navigation: Using CloudNine LAW > Deduplication, Near-Duplicate & Email Threading > Duplicate Information with ED Loader |
This topic explains the deduplication process implemented in CloudNine™ LAW and the related Inter-Case Deduplication utility. Items discussed include:
•Key generation and how it relates to the deduplication of e-mail compared to electronic documents.
•A list of deduplication-related fields and possible values.
•Brief explanations of deduplication options and their resulting output.
What is Key Generation?
Deduplication technologies do not directly compare file contents. Instead, the encryption signatures, also known as the hash values, of files are compared. If the hash values for two different items are identical, the content of the two files is assumed to also be identical. Key generation then, refers to the process of creating an encryption signature for a file so that files can be easily compared. File hashing and metadata hashing are the two primary methods used by the ED Loader and Turbo Import for generating keys. Currently, the ED Loader and Turbo Import generate two output hashes in parallel: MD5 and SHA-1. In the ED Loader and Turbo Import options you can choose which of these is used in the deduplication process. •E-docs. The key value is generated using the entire file as the input. •E-mail. The key value is generated using an input value of certain metadata after processing of the metadata fields has been executed. The purpose of using the post-processed metadata is to match the metadata that is stored in CloudNine™ LAW; therefore, if the key is regenerated in the future, the value would match the original. E-mail includes both e-mail messages contained in mail stores and loose e-mail messages. The term "loose e-mail" refers to a file that is identified as a mail item and successfully converted to a mail item by Outlook. These include .msg files, .eml files, and other RFC822-format e-mails. For Microsoft Outlook PST files, e-mail also includes non-email items — calendar items, contacts, journal entries, notes, and tasks.
Electronic Discovery in LAW uses the fields shown in the following table to generate the hash values for e-mail, loose e-mail items, and Outlook PST non-email items in CloudNine™ LAW: |
E-mail Metadata Fields Used For Deduplication Keys in CloudNine™ LAW ED Loader |
---|
Base Email |
BCC |
Body |
CC |
From |
IntMsgID |
Email_Subject |
To |
Attach -- These are first-level attachments in the e-mail. The strings are delimited by semi-colons (;). - OR - AttachmentContentHash -- If Include attachment hashes in e-mail metadata hash is enabled in the ED Loader Deduplication setting, the hashes of the attached files are included in the parent e-mail's metadata hash, as opposed to the above Attach field. |
Calendars |
IsPrivate |
IsAllDayEvent |
IsRecurring |
RecurrencePattern |
Organizer |
Required |
Optional |
Location |
MeetingStatus |
Label |
Contacts |
DisplayName |
BusinessAddress |
BusinessPhone |
BusinessPhone2 |
HomeAddress |
CellularPhone |
GovernmentID |
CustomerID |
DepartmentName |
ManagerName |
ContactEmail |
ContactEmail2 |
Company |
Journal |
IsPrivate |
IsAllDayEvent |
IsRecurring |
RecurrencePattern |
JournalType |
JournalDesc |
Notes |
NotesBody |
Tasks |
IsPrivate |
IsAllDayEvent |
IsRecurring |
RecurrencePattern |
TeamTask |
TaskCompleted |
TaskDelegator |
TaskOwner |
TaskStatus |