PREDICTIVE CODING METRICS ARE FOR WEENIES – Part II


Karl Schieneman

posted by
Karl Schieneman

Member Type: Other
Role: Other
Size: Small (less than 50)
Years of Experience: 20
Certifications/Licenses: JD, MBA, W.D. of PA Special Master



3 Comments post a comment

1 vote, average: 5.00 out of 51 vote, average: 5.00 out of 51 vote, average: 5.00 out of 51 vote, average: 5.00 out of 51 vote, average: 5.00 out of 5
(1 votes, average: 5.00 out of 5)
You need to be a registered member to rate this post.
Loading ... Loading ...

No related posts.

My recent piece, “Predictive Coding Metrics are for Weenies – Part I,” looked at how those who want metrics that will suddenly “validate” predictive coding are going to get left behind waiting for that validation. To examine the fence sitters’ concerns more closely, I agree it would be nice to know in advance if the number of random sample documents your TAR system uses is enough to train it adequately.  If the system is looking at 5,000 documents as a training set, is that enough?  Or, should it be something smaller, such as 2,000 documents?  Or whether the final recall rate of responsive documents found should be an estimated 70, 80, or 90 percent of the total responsive documents in the collection (recall is the measure used to determine what percentage of responsive documents were found out of the total estimated number of documents in the population).  Some TAR systems rank documents based on their likelihood of being responsive, so another helpful metric would be whether documents, which have a score above X with your predictive coding system, are presumptively responsive and conversely, whether documents which have a score below Y are presumptively not responsive.  These types of metrics ARE NOT LIKELY to emerge for a number of reasons.

First, lawyers rely on published opinions for precedential guidance, but most cases eventually reach some form of agreement on discovery issues that do not provide much guidance to the legal community as a whole.   When lawyers can’t reach an agreement and a judge decides the issue, there are very few appellate court opinions that will challenge that judge or special master’s opinion, when compared to the amount of litigation, because discovery issues are seldom appealed.   Even if there were opinions that emerged, a more important factor is the quality of collections and richness of the underlying data that will vary depending on factors which will differ across organizations and people.

I can’t see how uniform metric standards can easily emerge here to turn TAR into the equivalent of an “easy button”.  What we are stuck with is the need to identify a lawyer’s least favorite standard, “reasonableness,” and its close eDiscovery cousin “proportionality,” based on the particular case and the types of data you are evaluating and the math-oriented results which are emerging. You then need to make the argument to the other side and the court if necessary that your chosen strategy is “reasonable”.   So the metrics will likely remain nebulous and will depend on the case.

I will continue to explore the issue of metrics in my next post.

eDiscoveryJournal Contributor Karl Schieneman


More Stories


3 Comments Posted For This Story

  • Just this morning we ran a 20,092 document data set through BR for a client that resulted in 171 (99+% Visually Similar) Document Type clusters. This works for ANY sized data set. By the numbers -

    1> 100% of the documents sampled (0% error rate)
    2> 99.15% reduction (171 DTC vs 20,092 documents for initial relevancy and culling)

    I’ll let you know what the post relevancy numbers are going forward.

    John Martin

    Member Type: Other  |  Role: Other  |  Size: Small (less than 50)  |  Years of Experience: 25  |  Certifications/Licenses: Court Certified Expert



  • Update – one reviewer/one day determined 44 document type clusters to be relevant. This is a low tech, paper based review with the attorney only blowing back to paper the documents that are relevant post clustering and folder review (a folder for each document type cluster) to look through.

    Printing and numbering costs by volume of pages eliminated is 92.2% over the total volume collected.

    John Martin

    Member Type: Other  |  Role: Other  |  Size: Small (less than 50)  |  Years of Experience: 25  |  Certifications/Licenses: Court Certified Expert



  • Hi John,

    Would you idetify “BR”? Thank you.

    ESC

    Member Type: Firm  |  Role: IT  |  Size: Large (more than 1000)  |  Years of Experience: 7  |  Certifications/Licenses: ACEDS



Leave a Comment

You must be logged in to post a comment.