Migrated from eDJGroupInc.com. Author: Greg Buckles. Published: 2010-11-05 10:40:18Format, images and links may no longer function correctly. We all know that Office 2007 and later files are a different file format from your traditional DOC/XLS/PPT files, but I thought that it was worth exploring them with an eye on their potential impact in eDiscovery activities. First we need a simple explanation of what changed from Office 2003 to Office 2007 formats. Prior to 2007, Word, Excel and Powerpoint files were each proprietary binary file formats that required the application or a viewer to open. Office 2007 adopted an XML-based file format called Office Open XML that uses a common set of XML files within a compressed Zip container. These Extensible Markup Language (XML) files are simple text files that resemble HTML. The files now have an X or M added to their traditional file extensions to indicate whether they are flat XML or if they have embedded macro content. So DOC, XLS and PPT have become DOCX/DOCM, XLSX/XLSM and PPTX/PPTM. There are many advantages to the open formats, but we will focus on the potential discovery impact.
The Office Open XML formats forced most of the search and processing applications to update their code or even incorporate new viewers. Before, they had a single file to open and extract text from. Now they have multiple files within a zip container to handle. If you want to see this for yourself, create a DOCX file and then change the extension to ZIP. You can now open that container to see the individual files that determine formatting, content, properties, embedded objects and more. Below is a DOCX file with the Gettysburg address that I use for testing deduplication functionality. I have extracted all of the XML files from the zip. The document.xml file contains the actual content text. You can open an XML file with an internet brower, Notepad or a text editor.
You can crawl around in your own files to see all the different component parts. Below is a diagram that shows the high level relationships of XML file components for an Excel workbook. An important point here is that the open XML format makes it possible for applications to read and edit the content without having to use the actual Office application or Dynamic-Link Library files. Because Windows treats zip files as folders, you can even search/replace content without invoking Office at all. Try changing some text with Notepad, save the document.xml and then change the extension back.
In moving to an open file format, Microsoft also made it much easier for a bad actor to alter potential evidence and hide their tracks. Moreover, they seriously complicated the definition of a ‘document’ by breaking out all the component pieces into individual files. Look at my test file below after it has been commented and otherwise altered. Then look at the Word folder and notice that there are new files that contain the comments, endnotes, etc.
Traditional file views such as Stellent’s Outside In have always had problems with hidden content. Although the new format actually makes this content much more accessible, that only helps if software providers are willing to invest the development into properly extracting, parsing and handling all this. In my recent testing on Exchange 2010, I was surprised when I could not get hits for search terms in the body of DOCX and other Office 2007 file formats. I know that the common iFilters can read these formats, so either the defaults are not set to index the content or there is some other reason. Any time you make a major change in the dominant market formats, it requires some time for our technology and processes to adapt. That means that an old school forensic tech cannot assume that GREP based searches will get complete results when content is now kept in compressed containers. Counsel depend on their litigation support personnel and service providers to monitor the changing profile of custodian ESI and to know the capabilities and limitations of their tools. So now that you know how to open and modify this new generation of Office files, get cracking on them. Insert some test terms into different locations and run them through your system. Every forensic or litsup tech has a little bit of hacker in them, now you have an excuse . Let me know what you find.