Deduplication and Near-Duplicate/Email Thread Analysis

<< Click to Display Table of Contents >>

Navigation:  Using CloudNine LAW > Processing Documents >

Deduplication and Near-Duplicate/Email Thread Analysis

Deduplication

Deduplication is the process of scanning all parent documents within the Case Database and flagging any duplicates (identical copies). This is done by subjecting documents to a hashing process, which yields unique numerical (hash) values for each document. Documents yielding identical hashes are flagged as duplicates within case records. This process typically takes place during import via Turbo Import or ED Loader, but can otherwise be performed later with one of the following utilities:

RightArrowFor internally deduplicating documents within a single Case File, use the Deduplication Utility.

RightArrowFor externally deduplicating documents across multiple Case Files, use the Inter-Case Deduplication Utility.

Deduplication is only performed on parent-level documents, so attachments will always inherit the DupStatus of their parent item.

 

Near-Duplicate/Email Thread Analysis

Near-Duplicate/Email Thread Analysis is the process of scanning the extracted or OCR text of individual records within a Case Database and flagging any Near-Duplicates and/or Email Threads. This is done by subjecting the content (text) to a hashing process, which yields unique numerical (hash) values to be compared against a specified Threshold of similarity. Records found to have content hashes at or above the specified Threshold are flagged as either Near-Duplicates or Email Threads within case records.

RightArrowTo analyze the content of all records within a single Case File, use the Near-Duplicate & Email Thread Analysis Utility.

 

Viewing Duplicates, Near-Duplicates, and Email Threads

While highlighting documents within the Case Directory pane of the Main User Interface, you can view their Duplicate/Near-Duplicate status (and associated Duplicate/Near-Duplicate documents) by referencing the Duplicate Viewer.

Additionally, you can locate and view Duplicates, Near-Duplicates, or Email Threads by searching for specific Metadata Fields within case records using the Database Query Builder. Refer below for a listing of Metadata Fields and values to search for.

 

Metadata Fields:  Duplicates

Once Deduplication has been performed, LAW updates the following Metadata Fields for all records in the Case Directory:

DupStatus - Indicates the duplicate status, with one of the following character values:

oU - "Untested"; This document has not yet been scanned for duplicates.

oN - "None"; No duplicates have been identified for this document.

oP - "Parent"; One or more other documents have been identified as duplicates of this document.

oG - "Global"; This document is a global-level duplicate.

oC - "Custodian"; This document is a Custodian-level duplicate.

_DupID - Used to identify and group duplicate records. "Global" and "Custodian" level duplicates will have their "Parent" ID displayed here for reference. "Parent" level duplicates are assigned an ID based on how deduplication was performed. Documents without duplicates (or yet to be tested) will display a 0 for their ID.

DupMethod - Indicates the type of encryption (hashing) used for scanning documents, with one of the following values:

o1 - MD5 hashing was used.

o2 - SHA-1 hashing was used.

o129 - MD5 hashing was used via the Inter-Case Deduplication Utility.

o130 - SHA-1 hashing was used via the Inter-Case Deduplication Utility.

MD5Hash - MD5 hash values are stored here, where applicable.

Sha1Hash - SHA-1 hash values are stored here, where applicable.

Additionally, the following Metadata Fields can be populated from the Menu of the Main User Interface by selecting Tools > Apply Duplicate Relationships:

NOTICE:  These fields are automatically populated when deduplicating via Turbo Import.

For P ("Parent" level) duplicates:

DupCustNames - Displays all Custodians associated with duplicates of this record.

DupCustPaths - Displays the file path for each duplicate of this record.

For G ("Global" level) and C ("Custodian" level) duplicates:

DupParentName - Displays the Custodian associated with the "Parent" duplicate record.

DupParentPath - Displays the file path for the "Parent" duplicate record.

 

Metadata Fields:  Near-Duplicates

Once Near-Duplicate analysis has been performed, LAW updates the following Metadata Fields for all records in the Case Directory:

ND_ClusterID - Used to identify near-duplicate clusters. Each document belonging to the same cluster will display a matching ID.

ND_FamilyID - Used to identify near-duplicate families. Documents belonging to the same family will display a matching ID, which is based on the master document ID (padded to fill 8 digits).

ND_IsMaster - Flags master documents of near-duplicate families with a Y. Other documents belonging to the same family will display an N. All other documents not belonging to any near-duplicate family will also display a Y.

ND_Similarity - Displays the percentage of similarity between this document and its family master document.

ND_ResultSet - Used for internal tracking purposes by the Near-Duplicate & Email Thread Analysis Utility. Indicates the near-duplicate index revision for the current record.

ND_ContentHash - Content hash values are stored here. Documents with identical values contain identical text, but may still have different metadata or file formats.

ND_Sort - Displays a sorting ID for each document. Documents are assigned this ID based on their similarity to each other.

 

Metadata Fields:  Email Threads

Once Email Thread analysis has been performed, LAW updates the following Metadata Fields for all records in the Case Directory:

ET_IsMessage - Flags email messages with a Y. All other documents display an N.

ET_Conversants - Displays the names of all senders and recipients found within email messages. Names can be located within email headers (From, To, CC, BCC), previous quoted messages, or the main body of the message.

ET_MessageID - Displays a unique ID assigned to each message. Messages with matching IDs are recognized as separate copies of the same message.

ET_ParentID - Displays the ID of the root message being responded to or forwarded by this message.

ET_Inclusive - Flags messages containing the entire conversation of an email thread with a Y. This is typically the last message in a thread. Attachments are not flagged. All other messages in an email thread display an N.

ET_InclusiveReason - Indicates the reason for an email being flagged as Y within the ET_Inclusive field:

oMessage - This email contains body text not found in other emails of the thread.

oAttachment - This email contains attachments not found in other emails of the thread.

oMessage, Attachment - This email contains both body text and attachments not found in other emails of the thread.

ET_MetaUpdate - Flags messages whose metadata was populated from analyzed text via the Near-Duplicate & Email Thread Analysis Utility with a Y. All other messages display an N.

ET_ThreadModified - Displays the date/time of the most recent email thread analysis performed on this document.

ET_ThreadID - Used to identify email threads. Each message belonging to the same email thread will display a matching ID.

ET_ThreadSize - Indicates the number of unique messages within an email thread.

ET_ThreadIndex - Identifies individual messages and their attachments within an email thread using the following format:  "[ET_ThreadID].[message #].A.[attachment #]". The underlined portion only appears for messages with attachments, and the root message within an email thread will only display the ET_ThreadID portion.

ET_ThreadSort - Displays a sorting ID for each message in an email thread, indicating a position in the overall chain of conversation (including any branches).

ET_Indent - Displays an incremental number for each message of an email thread, starting with 0 for the root message, and increasing by 1 for each reply in the chain.