My recent piece, “Predictive Coding Metrics are for Weenies – Part I,” looked at how those who want metrics that will suddenly “validate” predictive coding are going to get left behind waiting for that validation. To examine the fence sitters’ concerns more closely, I agree it would be nice to know in advance if the number of random sample documents your TAR system uses is enough to train it adequately. If the system is looking at 5,000 documents as a training set, is that enough? Or, should it be something smaller, such as 2,000 documents? Or whether the final recall rate of responsive documents found should be an estimated 70, 80, or 90 percent of the total responsive documents in the collection (recall is the measure used to determine what percentage of responsive documents were found out of the total estimated number of documents in the population). Some TAR systems rank documents based on their likelihood of being responsive, so another helpful metric would be whether documents, which have a score above X with your predictive coding system, are presumptively responsive and conversely, whether documents which have a score below Y are presumptively not responsive. These types of metrics ARE NOT LIKELY to emerge for a number of reasons.
First, lawyers rely on published opinions for precedential guidance, but most cases eventually reach some form of agreement on discovery issues that do not provide much guidance to the legal community as a whole. When lawyers can’t reach an agreement and a judge decides the issue, there are very few appellate court opinions that will challenge that judge or special master’s opinion, when compared to the amount of litigation, because discovery issues are seldom appealed. Even if there were opinions that emerged, a more important factor is the quality of collections and richness of the underlying data that will vary depending on factors which will differ across organizations and people.
I can’t see how uniform metric standards can easily emerge here to turn TAR into the equivalent of an “easy button”. What we are stuck with is the need to identify a lawyer’s least favorite standard, “reasonableness,” and its close eDiscovery cousin “proportionality,” based on the particular case and the types of data you are evaluating and the math-oriented results which are emerging. You then need to make the argument to the other side and the court if necessary that your chosen strategy is “reasonable”. So the metrics will likely remain nebulous and will depend on the case.
I will continue to explore the issue of metrics in my next post.