Migrated from eDJGroupInc.com. Author: Barry Murphy. Published: 2012-01-06 07:17:43Format, images and links may no longer function correctly. One of eDJ’s predictions for 2012 is that PC-TAR (predictive coding-technology assisted review) goes mainstream.  Instead of just sitting and waiting to see what happens with PC-TAR, we are actively researching it.  Watch for the launch of an eDJ survey on the topic next week (and the chance to win yet another prize from eDJ for participating in our research).  Jason Velasco did a call-out for anyone using PC-TAR to speak with us and we’ve been able to talk to actual practitioners.  I want to quickly share some of what we are learning and call for anyone else trying PC-TAR to email us and share your story.

In addition to the survey eDJ will be launching next week, we have had discussions with folks that have put PC-TAR projects in place.  One example is a firm that tested PC-TAR to see whether the upfront cost of the software would pay off with downstream cost savings.  After all, that’s the goal of using these solutions – to save money.  This firm found that PC-TAR has some benefits, but some big limitations as well.

For PC-TAR to work, an organization needs to “seed” the solution, that is, give the solution a set of documents that are relevant, privileged, etc.  Then, the solution can learn from the seed set and apply what it learns to a larger collection of documents.  The firm in question had 500K records.  Using keyword filters, they got that document set down to 100K records and ended up with about 3K records that were relevant and produced.

Starting over, and using PC-TAR, the firm sought to see if they could do better.  They took the 500K records, put it through the PC-TAR solution, had the lead lawyer spend a couple days training the system; then went through and looked at the ranking for each of the 500K records.  Right off the bat, one of the problems was finding enough relevant records because the relevance rate was so low.  Next, the firm used the 100K set of records from keyword filtering and were able to find a large enough percent of relevant records.  What they found was that 50% of the records ranked over 80 (scale of 1 – 100) were included in the 3K produced records, but the other 50% of records in the produced set were ranked low; the reason for this was that that relevancy was so low that the PC-TAR solution didn’t have enough information to give it higher relevancy.  It was iffy whether those other 50% were relevant enough, but the lawyers wanted them included.  Thus, it was clear PC-TAR could be a helpful solution, but only if the seed document set has a high enough percentage of responsive documents.

The firm tested PC-TAR again.  In the next case, there were 80K records in the total set off a user’s workstation; keyword search and file type filters got it to 21K records and after review, there were 800 records found to be relevant.  In order to test the PC-TAR solution to see if it could return a good set of data, they took the 21K records to seed it.  They told the solution that the 800 records were relevant and the 21K were not and then tested it against the rest of the 60K in the corpus; what they found was that there were records that were found to be relevant that the PC-TAR solution ranked low and some that were marked non-relevant that the PC-TAR solution ranked high (technically, they were relevant, but they were almost exact duplicates with different hashes…like a word file with a tiny different bit of metadata); when the manual review had been done, one of the instructions to the reviewers was to only mark one copy as relevant (because they were using near-duplicate identification).  This likely happened because the near duplicate documents were marked non-relevant when seeding the PC-TAR solution and that marking confused the solution.

The lessons learned are plentiful, though not necessarily absolute – PC-TAR is new.  Clearly, PC-TAR holds promise.  Something that could make PC-TAR work better as a culling tool would be more targeted collections (making the overall corpus smaller, and thereby making the relevance rates higher).  Too often, lawyers are conservative and cast a wide collection net.  Also, if a vendor does a collection, there is no incentive to minimize and target it.  In addition, the makeup of the document collection has a big bearing on whether results will be good or not.  If data sets are bilingual, for example, text analysis may not work very well.  Also, there is no way of doing text analysis on things like Tweets because of the abbreviations and short words, etc.  There will have to be different ways to doing data analysis on this new type of data.  From a user perspective, that means using different techniques and toolsets for PC-TAR, which simply is not possible right now given the mix of products and competition in the market.

PC-TAR is evolving rapidly, though.  eDJ will be conducting our survey, interviewing users in depth, and interviewing providers about techniques and technologies.  To get involved in the research, just shoot us an email and we will be in touch shortly.  Also we will be using the #PCTAR Twitter hash to track the conversation thread as well.  We are interested in comments, so please feel free to post below or shoot us an email if you have any PC-TAR stories you want to share with us.



0 0 votes
Article Rating