Migrated from eDJGroupInc.com. Author: Chuck Rothman. Published: 2012-04-04 05:00:46Format, images and links may no longer function correctly. Although Processing is smack dab in the middle of the EDRM, little real consideration is paid to it. When going through the various EDRM steps, processing does play a role, but to most practitioners, processing is equivalent to a cost item on the e-discovery budget sheet, and little more. If you know the volume of data that needs processing and the applicable cost per gigabyte, you can check off processing and move on to the more exciting steps of analysis and review. In many cases, unlike the review tool, the software used to process the data is rarely even specified.
This reality is unfortunate, because the cost and efficiency of an e-discovery project is significantly impacted by the way processing is carried out. This series outlines some of the lesser known steps of e-discovery processing that can help to make the subsequent analysis and review much more cost-effective.
Processing Defined
In its simplest terms, the processing stage of a e-discovery project involves the following steps:
1. Cataloguing or inventorying the records;
2. Extracting metadata;
3. Extracting searchable text;
4. Extracting any attachments that are incorporated into the records; and,
5. Calculating a value to subsequently determine exact duplicates.
This appears to be a fairly straight-forward, standard method. However, in the world of computer file types, nothing is straight-forward or standard. A good understanding of the different ways files and emails are constructed, and the options for processing them, will go a long way to making the review stage (which everyone knows is the most costly aspect of e-discovery) much more efficient and cost-effective.
Duplicates
Duplicates are a fact of life in e-discovery. Emails are routinely sent to multiple recipients, replete with attachments that get replicated umpteen times. The same files are routinely stored on local hard drives, server shares, and SharePoint sites. When conducting a review, looking at the same document more than once is a waste of time – the content is not going to change if you look at two copies of the same record.
The typical way that duplicates are identified is to create a unique record signature based on the contents of the record. If two records have the same signature, they are considered to contain the same content, and thus are duplicates.
The complication arises when “contents of the record” is defined. Do the contents mean the bytes that make up the individual file or email, or just the words that make up the text portion of the record (remember, electronic records contain both visual and hidden, or behind the scenes, content, such as metadata, formatting, etc.)?
If the signature is based on all of the bytes, for instance, two Microsoft Word files that contain the exact same words, but slightly different metadata (because one copy was printed and the other one wasn’t), they won’t show up as duplicates. Similarly, a Word file, and a PDF generated from that Word file, won’t show up as duplicates, because the bytes within each file are very different, even though the words displayed are the same.
Another issue arises when it comes to emails. The typical way to create the record signature is to base it on both the email contents (without any attachments) and some of the email header fields. That way, if you send an email, and then resend it later on, it will not appear as a duplicate because the sent date header fields are different. However, if an email includes a carbon copy, and you don’t include the BCC field in the signature, the original source email and all of the recipient emails will be grouped together as duplicates. Unfortunately, only the original source email will show the BCC field, and if this copy is not the one reviewed, the reviewer will not know that a recipient was BCC’d.
There are other issues associated with duplicates, such as whether to de-duplicate within a custodian or across the entire database. However, if the record signatures upon which duplicates are based are not determine the way you think they are, the de-duplication effort will not give you the results you expect, possibly resulting in many more records to review.
Part 2 of this series will continue to discuss nuances of processing that can have an effect on the number of records that get reviewed. Comments are welcome below.
eDiscoveryJournal Contributor – Chuck Rothman