Rule Based Categorization – Then and Now

Migrated from eDJGroupInc.com. Author: Greg Buckles. Published: 2011-09-08 08:27:15Format, images and links may no longer function correctly. A recent client discussion reminded me of my earliest attempts at rule based categorization and the hard lessons of that experiment. Back in 2000, my general counsel (GC) asked if it was possible to find and segregate all potentially privileged emails out of the hundreds of millions that we had to produce to many different parties. I took a couple hundred thousand email and spent a week crafting search criteria/rules and doing iterative sampling checks. I worked with our top paralegals and our long standing firms to incorporate everyone’s input. I segregated approximately 18% of that collection as potentially privileged and put the remainder in the review queue without telling my contract attorneys that it had been cleansed. I felt pretty good about the exercise and knew that my rules were overly inclusive, but the point was to determine the risk of privilege waiver if we gave all the regulators remote access to the ‘cleansed’ master collection while my review teams worked on the 15-18% at issue. In the middle of the review, my GC ‘volunteered’ to man a review station for a couple hours to see for himself how it worked. After all, it was his question that kicked off this experiment. What do you think that he found?

The first thing that popped up was a sensitive trade secret that needed protective status. That distracted him with questions about how we could make rules/searches to find all the confidential/trade secret content. Then he stumbled on an email transmitting a pricing spreadsheet to a director from a trading desk. On the surface it appeared to be exactly what the government had asked for. In reality, it was a spreadsheet that my GC had personally requested as part of our own internal investigations. That made it attorney work product, but the email had no mention of the origin of the request from the legal department. The risk of inadvertent waiver was just too great in that circumstance to rely entirely on rules and searches. However, we were able to raise the review rate and accuracy substantially by grouping emails prior to review. All kinds of best practices help deal with the privileged, confidential and trade secret ESI comingled in your corporate data landfills. But those only count after you put them into effect and generally rely on users to actually follow protocol. They do not cover the historical collections and hoarder repositories. That brings us back around to enterprise categorization systems and what value they might offer today.

There have been many systems that have promised automated, rule-based retention and topic categorization throughout the last decade. Few have managed to deliver acceptable quality and consistency without a major investment in skilled, dedicated personnel to define and constantly tune the rule filters. That does not mean that automated categorization cannot offer value and play a vital role in retention management. A surprisingly large portion of communication traffic can be categorized by rules or one of the new ‘smarter’ systems that dynamically learn from user designations. It is the critical minority of oblique language or indirect communications that these systems tend to struggle with. When used carefully, auto-categorization rules can lift the majority of the user burden by pre-categorizing the majority of communications, including the often neglected Sent/Deleted items.

The first reality check is that the majority of ESI does not fall into just one bucket. Well developed systems can apply multiple classifications and have a reasonably sophisticated set of ‘meta rules’ that will resolve the final disposition action. You should expect a decent percentage to match no rules or to have conflicting/ambiguous categories. In a perfect world, humans would resolve these ‘exceptions’ so that 100% of your ESI is properly categorized. After profiling and sampling the exception pool, most of my clients decide to just go with the lowest risk default retention/handling action instead.

Despite any marketing claims, all of these systems seem to require a heavy investment to define categories and fine tune the criteria. I am skeptically interested in how predictive coding systems can be applied to support enterprise data management. The idea is that the system watches user decisions and then dynamically updates the automated criteria definitions. The current systems generally rely on manual criteria definitions and must be periodically monitored and updated to keep the rules relevant and defensible.

So where do you start? I generally like starting with the simplest domain-direction based rules for automatic destruction (think corporate spam) and perpetual retention (such as regulatory filings). Remember that category tags have multiple uses and many of them can be overly broad without creating any risks or costs. You should separate the list of actions to be taken from the actual categories and resolution rules. The sheer volume and diversity of enterprise content creation make perfection an impossible goal, but you can see immediate cost/storage savings through managed destruction/expiry and segregation of high priority ESI.

There is no magic black box that will automatically classify and manage your ESI. Every system will require some level of dedicated management depending on how far you want to push the classification envelope. My recommendation is to identify

Example Classifications - Click to Enlarge

the largest, lowest risk classifications and to grow your information governance strategy/resources based on real cost and effort savings. Do you have auto classification running on an enterprise system? I always love success and failure stories, so drop me a note at Greg@eDiscoveryJournal.com.

0 0 votes

Article Rating

Rule Based Categorization – Then and Now

Rule Based Categorization – Then and Now

Share This Story, Choose Your Platform!