Migrated from eDJGroupInc.com. Author: Barry Murphy. Published: 2011-03-03 04:36:05Format, images and links may no longer function correctly. After watching the IBM Watson debut on Jeopardy, I got to thinking more and more about computer intelligence. I’ve been watching the content analytics space for a long time and have seen up close both the potential for analytics to change the world and the skepticism with which many humans view analytics. In the early 2000’s, as an analyst with Forrester Research, I met with a number of vendors that offered expertise location. These tools would identify communication patterns (e.g. who emails with each other, who reads documents that Joe Smith authors) in order to allow employees to find subject matter experts quickly. For example, if Joe Smith was on vacation, the tool could point an employee to the person that reads Joe’s documents most frequently, assuming that reader has a similar skill set. A decade ago, that technology was interesting, but wasn’t getting companies to spend a lot of money on software solutions.
When I began covering eDiscovery, it seemed to me that there was finally a market in which content analytics could make a real impact and find some traction. This is especially true in Early Case Assessment (ECA). Everyone defines ECA a bit differently, and we’ve even done a report on it (for sale at $199) to identify how organizations can define it for themselves and implement initiatives. For the most part, the goal is to have applications analyze, prioritize, and even tag documents before a human looks at it and provide content and process analytics that can help make faster case decisions. Let’s take a look at some of the different types of content analytics that can help ECA be a powerful tool.
- Keyword culling – simple and straightforward, keyword culling allows users to take negotiated keywords and phrases and eliminate any information not within the search results from the potentially responsive data set.
- Metadata culling – allows users to eliminate information from a potentially responsive data set based on information about the information (metadata). For example, any information outside of a certain date range can be eliminated, or data that does not match a certain content type (e.g. Word, Excel, .msg) can be excluded.
- Faceted search – ability to use metadata filters to further pivot on metadata fields; like getting rid of Amazon.com emails.
- Near deduplication – unlike deduplication, which eliminates documents that are exact duplicates of others, near-deduplication allows users to eliminate or otherwise group documents that are materially similar, but not bit-level exact matches. Users must be careful, however, to repopulate near-dupes when producing the data set. A more common scenario would be to group near-dupes together so that the same reviewer is looking at all the similar documents, which can make the review process more efficient.
- Discussion threading – a feature used to keep conversations together. In the past, this has applied mostly to email – the most common collaboration mechanism. But, it is increasingly important for other collaborative tools such as bulletin boards, newsgroups, or social media. The solution aids the reviewers by visually grouping messages, typically in a hierarchy by topic. A set of messages grouped in this way is called a topic thread.
- Concept clustering – by grouping together potentially related documents, users have the ability to make the review process faster and/or more efficient. Concept clustering can take on various forms – grouping related documents into concept folders, creating heat maps of various concepts that users can click into – but is really about using machine algorithms to suggest groupings of content. It’s also possible to use concepts to include or exclude whole groups of content from downstream activities. Beyond the ability to make review faster, it can allow users to optimize review resources. For example, high-cost associates could review the “hot” content while low-cost review resources look at the rest of the data set.
- Predictive tagging / coding – Predictive tagging, or coding, combines analytics with human review by taking sets of collected data (gathered by concept searching, phrase identification, keyword searching, metadata filters, etc) and having users review and code the documents for factors such as responsiveness, issue, or privilege. Applications can then learn to tag similar documents based on the first set of human tagging. This process can reduce the total amount of documents reviewed and therefore has the potential for cost savings. In time, predictive tagging could ultimately become more common than traditional linear review. But, the legal industry is typically conservative and slow to adopt new technology and processes. If saving review costs is a priority, though, predictive coding is an option to investigate now. Note, it’s important to conduct sampling of non-reviewed documents within data sets to ensure a good confidence interval of the process. Also, predictive tagging is not an all-or-nothing approach. Many solutions provide varying levels of confidence intervals and combine both machine and human intelligence so that organizations can determine their comfort with some level of computer prediction.
In addition to content analytics, process analytics drive effective ECA. Process analytics support decision-making by informing organizations about the cost to execute the eDiscovery process from collection through production. For example, companies can make better early decisions when they know metrics such as the average time is takes a specific firm to review a document, what a certain type of document typically costs to review, or how many documents are involved in an average FINRA investigation.
I still believe that content analytics will have their biggest short-term impact in the eDiscovery market, and eventually gain more mainstream acceptance in other applications. Perhaps the business intelligence (BI) market will come to include both structured data analysis and unstructured content analysis.