Migrated from eDJGroupInc.com. Author: Greg Buckles. Published: 2010-03-08 09:47:27Format, images and links may no longer function correctly. When we talk about metadata for native ESI, we are usually concerned about the Operating System (OS) fields that are kept in the File Allocation Table (FAT). Different OS formats support a wide variety of fields such as different dates, attributes, permissions and file name formats (long vs. short). These fields are not usually stored within the actual file and so are very vulnerable to alteration or complete loss when items are read or copied. Forensic collection is focused on preserving this ‘envelope’ information so that evidence can be authenticated and the context reconstructed in court. That is only half of the metadata story. Microsoft Office and other programs retain non-displayed information within the header and body of all common file types, especially with the adoption of the XML based Office 2007 file formats.
Because many counsel have produced native files in the ‘paper equivalent’ TIFF with associated text, they have managed to avoid dealing with this unseen text for years. But plaintiffs are getting smarter about their requests and software is slowly being able to extract this information. That is putting reviewers in a bad place, where they cannot see text that ends up being produced. No one expects a reviewer to open each file in the original application and to then have the expertise to dig through all of the potential hiding places looking for text that is not automatically displayed.
So what kind of text are we talking about and why is it not visible? If you have Word 2007, try this trick. Go to File Menu – Prepare – Inspect Document. You should see something like this:
Microsoft has seen fit to enable users to inspect and purge all these kinds of hidden text. Can you spell S-P-O-L-I-A-T-I-O-N? I thought you could. But is this stuff really important? Every counsel is familiar with the potential for inadvertent production of privileged Comments and Track Changes, especially on contract revision language. What they might not realize is that most search engines may not fully extract and index this hidden text in from a variety of file formats. Until I started performing detailed validation testing on different discovery applications using a specially created set of test files, it was not apparent how much was being missed. If you want the deep dive, get the academic paper that I wrote on this for the DESI III Global E-Discovery/E-Disclosure Workshop.
The main point here is that most search engines use Stellent’s OutsideIn, Autonomy’s Keyview or MS iFilters to extract the text that is incorporated into the index. The problem is that the first two applications were designed as file viewers and not as full text extraction tools. This means that they extract text as it would appear if the file was opened. That leaves out Track Changes, Excel formulas, comments, PPT speaker notes and more. Many developers have tweaked the configuration settings to get as much of this as the application allows, but I have found far too many that just skip all of this when indexing native items. If it is not in the index, then it is not going to show up in filters, searches or analytic clustering. Joe Attorney says, “Good! Then I don’t have to worry about it.” If the other side demands the original, native ESI, then it will suddenly reappear in the produced files. If you scrub the files without the informed consent of the requesting party, you can bet that they will manage to come up with email attachments or other versions with intact hidden text and run straight to the judge. Either way, the ‘head-in-the-sand’ approach increases risk for a minimal cost savings. A better approach is to define the fields and text that your system can properly extract, tag and produce before you respond to Interrogatories and go into your Meet-and-Confer. That will allow your counsel to evaluate the potential relevance, risk and possible exposure of sensitive information prior to negotiations. Just remember that many pricing models, risk calculations and such are contained in Excel formulas. Internal metadata is just another wrinkle in your eDiscovery strategy that needs to be ironed out.