Migrated from eDJGroupInc.com. Author: Chuck Rothman. Published: 2012-03-14 04:00:27Format, images and links may no longer function correctly. Anyone who has read about e-Discovery in the past year has almost certainly come across the term “Predictive Coding” or one of its aliases. As exemplified by the number of seminars and vendor’s banners at Legal Tech in New York this past month, it’s definitely this year’s e-Discovery craze.

However, like most technology trends, the term “Predictive Coding” has come to mean a lot of different things. While fighting my way through the throngs of legal techno-geeks on the LegalTech vendors floor, I came across no fewer than ten completely different definitions of the term. The definitions ranged from “a system that assists in separating the legal wheat from the chaff”, “a system that proposes to automatically cull data before its collected”, to “a system that provides “predictive” objective coding of scanned images”. Almost every vendor seemed to want to jump onto the “Predictive” bandwagon.

This article is going to define Predictive Coding as “a process whereby a definition, made up of various rules, is created. Records in a collection are then evaluated to determine how well they match the definition”.

I have purposely left out any reference to e-discovery, document review, coding, and such, since, at its base level, the process can be applied to just about any situation where you want to find some specific information. In fact, the basic technology that is incorporated into the various e-discovery Predictive Coding applications has been around for at least twenty years.


In the legal world, the creation of the “definition” is accomplished a couple of different ways, depending on which software product is used:

1.         Sampling and Convergence

This approach is akin to when a document review begins. In the old days, the review team would sit in a room with the lead lawyer at the front. The lawyer would explain the case using a couple of example records to show the difference between what is relevant to the matter and what is not. The review team would then go off and review records, coding them according to what they learned in the planning meeting.

Fast-forward to 2012. Instead of the lead lawyer explaining the nuances of the case to a team of review lawyers, she would “explain” these nuances to the computer. The actual mechanism would involve the computer presenting a small random set of records to the lawyer, who would code them as relevant or not. This process would be repeated several times. However, behind the scenes, the computer compares both the relevant and non-relevant records and builds a model of what constitutes each set. On subsequent iterations, the computer “predicts” how the lawyer will code records. Once the computer’s predictions are the same as the lawyer’s coding, the computer has learned what it takes to make a record relevant, and can then go off and code the remaining records.

2.         Knowledge Gathering

The second approach can be compared to the first few days of a document review. During this time,  reviewers have started examining records and are becoming more familiar with the specific content and flavour of the records. After a week or so, they are able to identify relevant documents much quicker and more accurately.

The equivalent predictive coding process involves the computer, working in the background, watching the reviewers as they go through their paces. As each record is coded, the computer examines it and compares it to all other coded records. When the next record is given to a reviewer to code, the computer predicts how the reviewer will code it. Once the computer’s predictions start matching with the reviewers’ coding, the computer can then start coding records accordingly.

These two methods may seem the same, but there are subtle differences. The first method involves a single lawyer who is a subject-matter expert to review records, possibly for a couple of days. The second method involves less knowledgeable (and thus less expensive) lawyers to review records for a longer period of time. The end result should, theoretically, be the same, but the human cost and time involved may be dramatically different.

How It’s Used

When it comes to electronic discovery, Predictive Coding can be used, sometimes in conjunction with other techniques, to accomplish several different aspects of the EDRM, including:

  • Culling/Building a Review Set: In this mode, the system applies its definition to the entire corpus of records and culls out the most likely to be relevant. These records can then be subjected to the normal, manual review process. This should be combined with sampling of the records determined by the computer to be not-relevant, in order to validate the results.

  • Subjective Coding: The predictive coding system examines the subjective coding decisions made by lawyers as they manually review records. When a sufficient number of records have been reviewed, the system will start to make coding suggestions for subsequent records to assist the lawyers in their review.

  • Quality Control: Along the same lines as predictive subjective coding, the system uses the subjective coding decisions made by lawyers to predict how documents should be coded. However, instead of suggesting codes for un-reviewed records, the system will apply the predictions to all manually coded records and identify those records where its predictions and the actually coding diverge. This will enable quality assurance inspectors to zero in on records that may not be coded correctly.

  • Prioritization of Records for Review: Predictive coding can also be used to prioritize records in a review. Once a model is defined, the system can apply this to all records, ranking them in order from most to least relevant. The project manager can then sort all records and assign those that are likely to be most relevant to be reviewed first.

What To Watch Out For

While it appears to hold the promise of alleviating the need to review whole masses of records in order to find the relevant few, Predictive Coding may not work in all situations. Alternatively, it may need to be combined with other techniques in order to deliver valid results.

All predictive coding methods use a combination of properties to construct the model that defines a relevant record. These properties are usually the textual contents of a record. As with all linguistic analysis systems, if the records contain unique linguistic constructs, the computer may get confused.

  • An example is when records contain multiple languages within the same document, or similar topics are discussed in different languages within the same set of records. In both cases, an example record may be coded as relevant due to the content in one language. A bilingual human reviewer would be able to code another record with the same content in a different language as relevant, but a computer would not be able to determine this unless enough multi-lingual records are used to build the model.

  • Similar, a collection of mostly non-document records, such as spreadsheets or databases, may not yield enough cohesive linguistic content to allow the system to build its model.

  • Finally, if a corpus contains very few relevant records, there just may not be enough information available to construct the definition.

Regardless of the method of Predictive Coding used, some initial analysis of the record set should be carried out in order to determine if additional techniques are required to offset the limitations in the collection that would otherwise defeat the process. These could involve pre-review culling based on metadata or keywords, dividing the records into subsets, or using keywords to pull enough example records to train the system.


Document review is commonly recognized as the most time-consuming and expensive stage of the e-discovery process. Lawyers are often tasked with reviewing large volumes of electronic data under strict time constraints while ensuring that relevant documents are identified in the most efficient manner. The use of technology such as Predictive Coding can assist in bringing down the high cost involved in dealing with today’s information overload. However, its successful use depends on appreciating both its strengths and weaknesses, and applying the technology with both in mind.

eDiscoveryJournal Contributor – Chuck Rothman

0 0 votes
Article Rating