Migrated from eDJGroupInc.com. Author: Greg Buckles. Published: 2010-04-29 05:37:12Format, images and links may no longer function correctly. The management and review of native files (ESI) generally requires the extraction of internal/external metadata and the readable text to be indexed for search. Most types of container or multipart files such as ZIP or PST containers must be broken out into individual files for this step and subsequent productions. This is the foundation of what our industry calls Processing. Most counsel, corporate IT and judiciary seem to operate under a presumption of magical perfection in these software and services of specialized eDiscovery providers. Most of these ‘built for purpose’ applications manage to avoid the basic MS Windows issues that drop or alter date fields, but the infinite variables associated with ESI formats and contents make it nearly impossible for any system to automatically get everything right, even if we could agree on what ‘right’ is. Although I had heard about Planet Data’s acquisition of the Cerulean Engine™, the time at the AIIM 360 Expo gave me an opportunity to understand the deep processing experience that accompanied the software.
eDiscovery providers are quick to talk about the quality of their processes and assembled technologies. They are slow to tell clients about what they miss or alter during their processing, when they even know it. Most major eDiscovery providers rely on outside software utilities and tools that they have wrapped in their proprietary workflow and code. The smaller players usually license appliances or software licensed on a per GB basis. Very few processing providers employ real development staff, as they are in the service business rather than selling software. Planet Data seems to be determined to break that mold with their continued investment in expanding the Cerulean Engine into a processing platform with web-based reporting and management. I got the chance to dive deep into the typical processing pitfalls seen by Michael Wade, Planet Data’s CTO.
According to Mr. Wade, they see more issues with client email and files that have been ‘reconstituted’ from content management systems, archives and application migrations. These malformation errors frequently pass unnoticed because the item can still be processed and viewed, even if all the content or fields are not there. Some email extraction tools can drop recipient addresses, truncate fields and alter dates when translating them to or from GMT or Universal Time formats . Finding these kinds of problems requires good reporting to spot systematic data losses, targeted sampling and even manual viewing original files in the raw hex/text mode to see what should have been there. These outside comparisons and parallel sample checks give you a chance to spot issues when byte counts on fields and content are out of synch.
In my experience with validation testing systems, I have found that most applications only give you errors when a file completely fails to process. Some applications will throw warnings for partial extractions or known bugs, but they can only warn you when they know what to look for. You just cannot rely on any software to tell you when it encounters an issue that it was not designed to spot.
Search indexes rely on file viewers like Oracle Stellent Outside In, Autonomy Keyview and iFilters to extract text from files. The high throughput of processing engines can cause memory issues and corruptions that can drop text without warning. Some viewers cut off text in documents with custom margins or completely skip headers and footers. Opening a file with the original application and saving it as text can give you a quick character count comparison, although Mr. Wade has a whole list of things that will throw this test off such as text boxes, revisions and the text format that you are exporting to. It can serve as a quick periodic check to spot corruption, but it has limitations.
Problems like these will exist in any sufficiently diverse collection. You do not have to be a technical guru to process ESI, but software cannot substitute for experienced, reasonable diligence. The cost of in-sourcing corporate eDiscovery is much more than a simple Return on Investment (ROI) calculation. Corporations should take a close look at the liability they are purchasing along with their new discovery technology.