Deduplication Utility

<< Click to Display Table of Contents >>

Navigation:  Using CloudNine LAW > Processing Documents > Deduplication and Near-Duplicate/Email Thread Analysis >

Deduplication Utility

The Deduplication Utility internally scans electronic documents and flags duplicate (identical) records for a single Case File. It's recommended not to use this utility on cases in which deduplication was already performed - either while importing documents via Turbo Import or ED Loader, or through the external Inter-Case Deduplication Utility.

 

InfoIcon (Re)Running the Deduplication Utility

1.Open the Deduplication Utility from the Menu of the Main User Interface by selecting Tools > Deduplication Utility....

2.Under the Info tab, click on the Load button to retrieve a Deduplication Log for the active Case Database.

3.Compare the values listed to the right of the Records in case and Deduplicated documents in case lines at the top of the Log:

i.If these values are identical, then deduplication isn't necessary. Click on Cancel at the bottom-right of the utility, and ignore the rest of the steps listed here.

ii.If these values are different, then some/all records need to be scanned for duplicates. Continue on to step 4.

4.Navigate to the Settings tab, and ensure that the Deduplication Utility is configured correctly for the active case (explained below).

5.Click on Start at the bottom-right of the utility. The Confirm Deduplication Settings prompt opens.

6.Click on OK at the bottom-right of this prompt to accept your current settings and begin scanning the active Case Database for duplicates.

7.You will be redirected to a new Progress tab within the utility. Here you can monitor the deduplication process.

8.Once deduplication is finished, click on Exit at the bottom-right of the utility to close it.

 

DeduplicationUtilityInfoTabIcon Info

This tab provides useful information about deduplication statistics within the Case Database. Click on the Load button to retrieve the Deduplication Log for the active case. Once loaded, the following information is displayed:

---Case---

oRecords in case - Number of records currently in the Case Database.

oDeduplicated documents in case - Number of records that have been deduplicated/tested for duplicate status. If this value is less than the number for Records in case, then the Case Database has not been fully deduplicated.

oDuplicate root/parent records in case - Number of documents that have duplicate occurrences (DupStatus = P).

oDuplicates in case - Number of duplicate records (Global or Custodian level) currently in the Case Database.

Global Duplicates - Number of duplicate records at the global level (DupStatus = G).

Custodian Duplicates - Number of duplicate records at the custodian level (DupStatus = C).

<Custodian Name> - Number of duplicate records associated with a specific Custodian.

---Deduplication Log---

oItems in duplication log (All) - The number of entries that have been logged for deduplication testing.

oRecords in deduplication log (global level) - The number of entries that have been logged for deduplication testing at the Global scope. This includes records logged at the Custodian scope.

oRecords in deduplication log (custodian level) - The number of entries that have been logged for deduplication testing at the Custodian scope.

 

DeduplicationUtilityToolsTabIcon Tools

This tab provides three commands that can be ran from within the Deduplication Utility:

Deduplication Status Reset - This command will flush all items from the Deduplication Log and reset the deduplication related Metadata Fields for all records in the case. This effectively restores the Case Database to its original state before any manual deduplication was performed. This does not affect any deduplication performed via Turbo Import or ED Loader. Clicking on Run will open the DedupReset menu with the following internal commands:

oScan - This option will scan the Case Database and report the number duplicate items found.

oReset - Clears the Deduplication Log for the Case Database and resets the deduplication Metadata Fields for all records.

Verify Deduplication Log - This command will verify that all entries in the Deduplication Log exist in the Case Database. It also reset any deduplication related Metadata Fields that were updated as a result of using the Inter-Case Deduplication Utility, but does not affect any fields created through Turbo Import or ED Loader. Clicking on Run will open the Deduplication Log Verification menu with the following internal commands:

oStart - This command will scan the Deduplication Log for stale entries and will also scan the Case Database for records marked as duplicates against any invalid entries that are detected.

oSave - Any Invalid ID's found in the Log will be saved to a separate file (optional, but not recommended).

oSynchronize - Removes invalid entries from the Deduplication Log. At this point your Log should be fully synchronized moving forward.

Apply Duplicate Relationships - This command will populate deduplication related Metadata Fields with Custodian and file path information for each duplicate record and its original document within the Case Database. This command is also available from the Main User Interface by using the Menu to select Tools > Apply Duplicate Relationships. Clicking on Run will initiate the command, and present results via the Duplicate Relationship Update Status dialog.

 

DeduplicationUtilitySettingsTabIcon Settings

Use this tab to determine how documents are scanned for duplicates, and which documents should be scanned.

Processing Options - These settings determine how documents are scanned for duplicates within the Case Database:

oWorking Digest - Provides two choices for the type of hashing being used to detect duplicate documents: MD5 (128-bit output digest), or SHA1 (160-bit output digest).

oTest for duplicate against (Scope) - Provides two choices for the level (hierarchy) at which documents are compared for duplicates:

Case Level (Global) - All documents within the database are compared.

Custodian Level - Documents sharing the same Custodian within the database are compared.

Processing Range Options - These settings determine which documents are scanned for duplicates within the Case Database:

oOnly test untested records - The utility will only scan documents which have not yet been tested for duplicates. This is useful when new documents are later added to the Case Database.

oOnly test records within selected custodians - The utility will only scan documents belonging to selected Custodians. Click on the Select button to open the Custom Value Selection window, and then check the box to the left of each Custodian whose documents you wish to scan. Click on Accept when finished to apply your selection.