Deduplication & Filtering

Deduplication

On average, data sets contain duplicative data consisting of multiple copies of the same file, e-mail message, or document. Removing or flagging these copies through a process known as deduplication allows eLit to reduce the amount of data Processed and Reviewed saving our clients money.

eLit first computes the MD5 hash value of a file or document by running it through a cryptographic algorithm. Comparing documents via MD5 hash is trusted and admissible in standard legal practice.

In many cases eLit needs to adjust its algorithm when computing MD5 values. For example, if the same e-mail is received twice on a laptop running both Outlook and Thunderbird (two different e-mail applications) the MD5 value generated for each e-mail will be different even though the message is the same. By adjusting our algorithm to compute MD5 hash values using strings of metadata (To, From, Body, Attachments, etc) instead of binary, these two messages would be identified as duplicative and one copy would be filtered out or flagged. Flagged duplicatives can easily be hidden or referenced during the Review process.

Duplicative documents can be identified throughout a universal data set (globally) or within each individual custodian.

Near Dupe Identificaction

Similar to deduplication, this technology allows eLit to filter or flag documents that are similar in content but may not be 100% duplicative. Imagine a lengthy e-mail thread with dozens of messages. One message, the final conversation in the thread, contains the body of all previous correspondence. By considering that final message as the pivot document, all other messages can be filtered out avoiding redundancy during review.

Another example may be two Microsoft Word documents. One document contains 80% of the report, while the other containing 100% was saved as a new file under a different name. Near deduplication technology allows eLit to flag these two files as duplicative.

DeNIST

A simple process involving the removal of non-user created files such as system files, program files, etc. that match a Reference Data Set (RDS) provided by the National Institute of Standards and Technology (NIST). The RDS is collection of hash values (think digital fingerprint) to help alleviate much of the effort involved in determining which files are important as evidence on computers or file systems. The NIST list contains over 28 million digital signatures.

Keyword Search Filtering

ESI can be filtered or flagged by searching for specific documents via keywords, dates, custodians, file type, and more.