A recent post to the Yahoo! LitSupport group asked whether there were any published standards covering email deduplication hashing. The problem is more complicated that it appears on the surface. As several other search experts commented, the definition of a duplicate email and the actions that you take will vary based on the jurisdiction, matter issues and party demands. Under FRCP Rule 34(b), the requesting party may define the format of production within the scope of the Rule 26(b) and the terms of your Rule 26(f) negotiations. Both rules provide some exemption for duplicative productions, but there are arguments that can be made about deduplication within a custodian’s email or across all custodians.
Our early criminal forensic standards for authentication of evidence dealt primarily with individual files residing on a single piece of physical media. If you ask a forensics examiner about deduplicating email, he/she would generally shudder in horror at the thought. After all, forensics is all about reconstruction of actions and context beyond a reasonable doubt. Civil litigation rarely strays into slack space, file fragments and has a much lower threshold of authentication to meet while striving to protect privileged and confidential communications. It is common practice to identify and suppress or remove essentially identical email prior to legal review. The definition of a ‘duplicate’ email is something that should be agreed upon during the meet and confer. The problem is that most parties do not actually understand what criteria their chosen technology or service provider base deduplication on. Most systems create a unique hash value based on a combination of different MAPI field values. An email is basically a set of fields, either held within a database container (PST, NSF, EDB or other format) or as an individual delimited file (EML, MSG, etc). The fly in the ointment here is that most emails are not true duplicates, but rather similar versions with slight differences in certain fields.
To borrow an example, John sending an email to Ringo, Paul and George actual creates four email that most systems would identify as identical copies. Yet, the email in John’s Sent folder does not have any internet header fields because it never actually left his mailbox. Each of the other emails has different header and probably has a different DateReceived time stamp.
Below are the typical common fields used to determine duplicate email:
- DateSent
- Subject
- Sender
- To
- CC
- Email Body
You might notice that list did not contain BCC or any Attachments. Some systems add these fields, but many do not. Some systems only use a few hundred characters from the Body and I have even found a few that drop the Body check completely based on the theory that a person can only send one email at a time. Unfortunately, many automated expense and other systems can send a burst of hundreds of emails with the same DateSent. So what fields can be different on versions of the same email?
- DateReceived
- Read/Unread
- BCC
- DeliveryPath
- Receipt Notices
- Folder Location (Inbox, Deleted Items, Project Folders)
- Forward/Reply
- MessageID
- ConversationID
- DisplayName
- SMTP Address (Internal/External email can resolve these differently)
- User Categories
- User Action Flags
- And many more…
In the context of typical corporate litigation, most of these fields may not be important enough to be considered or produced as an individual copy. But ONLY counsel can or should make that decision. I have been involved with many matters where key issues depended on whether specific custodians actually read certain emails. The Read/Unread property can easily be reset by a custodian, but action flags, categories and manually moving an email to a folder is a good indicator that email was read.
As an example, I just finished a discovery scenario testing of Exchange 2010. Although they killed off SIS in the mailstore, they added an ‘exclude duplicate items’ function into the search GUI when you restore results to a mailbox. I was testing with the EDRM PST Data Set (Ver. 1) and found that Exchange was throwing out over two thirds of the results as ‘duplicates’, while several other systems only killed 5-20%. There are a lot of ‘artifacts’ in the Enron set, but it really makes you think. I ended up creating a deduplication testing PST with multiple sets of email with slightly differing properties/actions to really understand what Exchange was doing for my research paper.
Once you define what makes a unique email in your collection, then you need to understand how duplicates are handled. Are they deleted to save space? Are the copies kept in the collection, but suppressed during review? Can you reconstruct all the copies if needed? Does the system track all the copies and the differences? Can it report all this information to you during review or add it to the production load files?
So how you handle potential duplicates is just as important as the criteria by which you define duplicates. I agree with the need for standards, but I think that they have to be put in the legal context. Retaining user actions (folder, forward, reply, tag, etc) can be critical in some cases and absolutely irrelevant in others. We have no governing bodies or associations that have published any standards on these issues. That means that there is not one right answer, but an obligation to understand and test your chosen technology so that counsel can make the decision that is right for each case.



Greg, this reminds me of the many conversations we had when we were working on developing an email de-duplicator that would fit your clients’ needs. It was years ago, but the arguments remain true to this day as many people are still uneducated about the various de-duplication processes and techniques. I keep coming across scenarios where clients ask for de-duplication, then I explain some of the concepts you’ve mentioned on your article and suddenly they realized that de-duplication is more than just removing “identical” or “similar” items from a given set. I’ve often sat down and tried to explain the concept of “convenience” data (don’t have another word for it, so I’m borrowing the term you coined for the PST DeDuper) and how important it is to at least keep a chain-of-custody report or some kind of record on the items removed from any data set in case “convenience” information becomes relevant. Additionally, most clients tell me many of their de-duplication experiences have left them confused as they cannot get validation or reports from either the software or vendor regarding what items existed where – and the common theme is there are no standards. Every application/vendor does it a little different, and the process and methodology is not always clear, whether intentional or not.
October 20, 2010 at 10:42 am
eruano
Member Type: Provider | Role: Other | Size: Solo | Years of Experience: 17
Thank you Greg, for writing about this, now I don’t have to. I’ve tried to explain this to clients before, mostly to blank stares. The simple fact is that since most emails are within pst files in an exchange database, they can’t be individually hashed. I have asked several vendors how they identify duplicates, and it varies widely. Most will try to use a unique ID number [e.g from the mail gateway], but these things differ and depend on whether the email traversed the internet or not [e.g internal emails].
I’m reasonably comfortable with the methods that most vendors use, but there certainly isn’t a standard, and it’s probably not even possible to develop one anywhere near to a hash value equivalence. Maybe there should be a peer review of vendor methods for this to establish some kind of credential.
October 20, 2010 at 11:24 am
JamesWright
Member Type: Corporate | Role: Litsupport | Size: Solo | Years of Experience: 30+ | Certifications/Licenses: PE
Don’t overthink the problem. Archives are doing single-instance storage well. That is the paradigm to follow and leverage.
October 20, 2010 at 3:13 pm
wtkjd
Member Type: Firm | Role: Attorney | Size: Solo | Years of Experience: 26 | Certifications/Licenses: JD; CA Bar
Thanks Eric/James! I enjoyed compiling that post from various client presentationsand product testing reports over the years. It is not as simple as it seems and it will come up in every matter.
James,
I almost forgot about all the MessageID issues. I tend to assume that everyone knows that that key property is pretty much useless for deduplication. In fact. I have found collisions with that property in collections as small as 10k email. I cannot remember the last time I tested a system that actually used the MessageID for dedup. But you are entirely correct that many clients still think that way, especially the Exchange Admins.
Thanks all!
Greg
October 20, 2010 at 3:18 pm
Greg Buckles
Member Type: Other | Role: Consultant | Size: Solo | Years of Experience: 22 | Certifications/Licenses: court certified expert witness
Bill,
I would agree if I had not spent so much time testing most of the market leading archives for corporate clients. Some use a ‘First In – Only In’ process whereby the Journal copy is saved and every subsequent user version is identified as a ‘dup’ and discarded with no record of the source folder location or any user action fields. They control access to the SIS copy based on the To/From, rather than a specific owner property. Most archives do a good job of keeping the context, but you cannot assume that the dedup process will do for discovery preservation without understanding the process in the context of specific cases. You are on the money about not overthinking it, but don’t assume that every system is the same.
Cheers! Greg
October 20, 2010 at 3:26 pm
Greg Buckles
Member Type: Other | Role: Consultant | Size: Solo | Years of Experience: 22 | Certifications/Licenses: court certified expert witness