I always enjoy meeting other super geeks who revel in playing on the cutting edge of discovery technology. While I will reserve the ‘geek’ label for myself, my conversation on TAR with David D. Lewis was definitely a highlight at the Carmel Valley eDiscovery Retreat. Now I owe David for bringing the recent Actos Products case management order to my attention (MDL No. 6:11-md-2299). The order lays out the agreed upon protocol for a “Search Methodology Proof of Concept” to test Equivio’s Relevance predictive coding on 4 of 29 custodian’s ESI as a possible substitute for traditional manual review of the entire collection. Once you get into the specific email protocol (Section E) the order starts to read like it could have been copied directly from a savvy provider’s procedural manual. Now don’t get me wrong, the parties are using this process as a ‘proof of concept’ limited to four custodian’s email to see if they can apply these analytics to the broader potential ESI collection. There are lots of mandatory meet and confer check points along the process where either side could raise concerns or essentially bring the process back to the bench, but I am guessing that this train has left the station.
Why do I think that the parties have essentially committed to using TAR on the entire collection? The order does not give any insight into the volume, composition or diversity of the email or other ESI sources. The parties have already agreed to start with a random sample of 500 email that has been culled of dups, non-text email, spam and commercial email. This is the control set to define the ‘richness’ (percentage of relevant documents within the collection). Based on the language in the order, the email of the four ‘assessment’ custodians has not yet been collected or profiled at this time. Statistical models such as this site validate the margins of error (+4.38%) with a chosen 95% confidence level on a population of up to 10,000,000 items. However, these survey and statistical models often assume a reasonably random distribution of positive items, which rarely occurs in custodial mailboxes or PST files (journaled email could qualify). Without some contact and assessment of the email, I would be hesitant to recommend any specific review or sampling protocol.
Maybe the defense has already performed some preliminary interviews, sampling, collections or other profiling activities that are not reflected in the order. That could explain why the parties have agreed to not use seed sets of known relevant email. Seed sets can skew your training results and may channel you down the path to only finding what you already know about, but I see them as a useful tool in a quality process. The providers in this case have declined to comment about an active case, an admirable example that we should all respect. I know that Epiq has some really sharp folks who have a LOT of hands on TAR expertise. I can ‘feel’ their input in the practical workflow for protecting privileged documents during the training process. If nothing else, you should read the protocol for the overall hand off process between defense and plaintiff expert reviewers.
Here is a quick outline of the “Assessment Phase” training process:
By my calculations (385/15%), the joint team may review a minimum of 2,567 email to get a defined recall estimated with a margin of error +5%. All of the capitalized ‘terms of art’ and numbers thrown around give the impression of a well defined process, but I could not find any supporting expert affidavits or other materials that defined the statistical formulas or assumptions behind them. The order incorporates Equivio’s Relevance vocabulary and concepts such as “Stability”, “Nearly Stable”, “Statistical” and “Baseline” with minimal definitions. The Actos protocol goes on to define the ‘Iterative Training’ process to reach a mutually agreed ‘Stable’ point through sample batches containing 40 documents. The protocol also contemplates a final sampling of the documents excluded as ‘not relevant’ as a “Test the Rest” validation exercise.
Having spent a lot of analyst time with the Equivio team, I have a pretty good feel for their product and believe that I ‘get’ their terminology by this point. I want to make it clear that I am not criticizing the process or technology outlined. Instead, I wish that the order or supplemental materials had gotten much more specific and contemplated some of the potential challenges that all of these systems face with real world ESI. If they already know the potential scale, composition and distribution characteristics of a custodial collection, why are they not in the order? What are they going to do with all the email that has non-text files or file types that do not lend themselves to conceptual relationships? Many of the ESI sources outlined in Section C of the order are relational databases that will not lend themselves readily to analytic relevance determination. This order gives us valuable insight into TAR in action and is worth the read. It is not a universal prescription for TAR review. These learning-propagated review methods are relatively new to the legal market and each should be evaluated carefully with expert advice. The fact that the plaintiffs requested the old school Concordance style DAT/TIFF/TEXT production format makes me wonder if they are sophisticated enough to really challenge the proposed process or define the exception categories that will arise in any real world ESI collection. It is good to see TAR being leveraged for actual relevance decisions and yet another step in TAR’s adoption by the eDiscovery market.