Migrated from eDJGroupInc.com. Author: Greg Buckles. Published: 2010-10-19 08:45:07 A recent post to the Yahoo! LitSupport group asked whether there were any published standards covering email deduplication hashing. The problem is more complicated that it appears on the surface. As several other search experts commented, the definition of a duplicate email and the actions that you take will vary based on the jurisdiction, matter issues and party demands. Under FRCP Rule 34(b), the requesting party may define the format of production within the scope of the Rule 26(b) and the terms of your Rule 26(f) negotiations. Both rules provide some exemption for duplicative productions, but there are arguments that can be made about deduplication within a custodian’s email or across all custodians.Our early criminal forensic standards for authentication of evidence dealt primarily with individual files residing on a single piece of physical media. If you ask a forensics examiner about deduplicating email, he/she would generally shudder in horror at the thought. After all, forensics is all about reconstruction of actions and context beyond a reasonable doubt. Civil litigation rarely strays into slack space, file fragments and has a much lower threshold of authentication to meet while striving to protect privileged and confidential communications. It is common practice to identify and suppress or remove essentially identical email prior to legal review. The definition of a ‘duplicate’ email is something that should be agreed upon during the meet and confer. The problem is that most parties do not actually understand what criteria their chosen technology or service provider base deduplication on. Most systems create a unique hash value based on a combination of different MAPI field values. An email is basically a set of fields, either held within a database container (PST, NSF, EDB or other format) or as an individual delimited file (EML, MSG, etc). The fly in the ointment here is that most emails are not true duplicates, but rather similar versions with slight differences in certain fields.To borrow an example, John sending an email to Ringo, Paul and George actual creates four email that most systems would identify as identical copies. Yet, the email in John’s Sent folder does not have any internet header fields because it never actually left his mailbox. Each of the other emails has different header and probably has a different DateReceived time stamp.
Below are the typical common fields used to determine duplicate email:
- DateSent
- Subject
- Sender
- To
- CC
- Email Body
You might notice that list did not contain BCC or any Attachments. Some systems add these fields, but many do not. Some systems only use a few hundred characters from the Body and I have even found a few that drop the Body check completely based on the theory that a person can only send one email at a time. Unfortunately, many automated expense and other systems can send a burst of hundreds of emails with the same DateSent. So what fields can be different on versions of the same email?
- DateReceived
- Read/Unread
- BCC
- DeliveryPath
- Receipt Notices
- Folder Location (Inbox, Deleted Items, Project Folders)
- Forward/Reply
- MessageID
- ConversationID
- DisplayName
- SMTP Address (Internal/External email can resolve these differently)
- User Categories
- User Action Flags
- And many more…
In the context of typical corporate litigation, most of these fields may not be important enough to be considered or produced as an individual copy. But ONLY counsel can or should make that decision. I have been involved with many matters where key issues depended on whether specific custodians actually read certain emails. The Read/Unread property can easily be reset by a custodian, but action flags, categories and manually moving an email to a folder is a good indicator that email was read.As an example, I just finished a discovery scenario testing of Exchange 2010. Although they killed off SIS in the mailstore, they added an ‘exclude duplicate items’ function into the search GUI when you restore results to a mailbox. I was testing with the EDRM PST Data Set (Ver. 1) and found that Exchange was throwing out over two thirds of the results as ‘duplicates’, while several other systems only killed 5-20%. There are a lot of ‘artifacts’ in the Enron set, but it really makes you think. I ended up creating a deduplication testing PST with multiple sets of email with slightly differing properties/actions to really understand what Exchange was doing for my research paper.Once you define what makes a unique email in your collection, then you need to understand how duplicates are handled. Are they deleted to save space? Are the copies kept in the collection, but suppressed during review? Can you reconstruct all the copies if needed? Does the system track all the copies and the differences? Can it report all this information to you during review or add it to the production load files?So how you handle potential duplicates is just as important as the criteria by which you define duplicates. I agree with the need for standards, but I think that they have to be put in the legal context. Retaining user actions (folder, forward, reply, tag, etc) can be critical in some cases and absolutely irrelevant in others. We have no governing bodies or associations that have published any standards on these issues. That means that there is not one right answer, but an obligation to understand and test your chosen technology so that counsel can make the decision that is right for each case.