Migrated from eDJGroupInc.com. Author: Greg Buckles. Published: 2010-11-05 14:40:18 We all know that Office 2007 and later files are a different file format from your traditional DOC/XLS/PPT files, but I thought that it was worth exploring them with an eye on their potential impact in eDiscovery activities. First we need a simple explanation of what changed from Office 2003 to Office 2007 formats. Prior to 2007, Word, Excel and Powerpoint files were each proprietary binary file formats that required the application or a viewer to open. Office 2007 adopted an XML-based file format called Office Open XML that uses a common set of XML files within a compressed Zip container. These Extensible Markup Language (XML) files are simple text files that resemble HTML. The files now have an X or M added to their traditional file extensions to indicate whether they are flat XML or if they have embedded macro content. So DOC, XLS and PPT have become DOCX/DOCM, XLSX/XLSM and PPTX/PPTM. There are many advantages to the open formats, but we will focus on the potential discovery impact.
The Office Open XML formats forced most of the search and processing applications to update their code or even incorporate new viewers. Before, they had a single file to open and extract text from. Now they have multiple files within a zip container to handle. If you want to see this for yourself, create a DOCX file and then change the extension to ZIP. You can now open that container to see the individual files that determine formatting, content, properties, embedded objects and more. Below is a DOCX file with the Gettysburg address that I use for testing deduplication functionality. I have extracted all of the XML files from the zip. The document.xml file contains the actual content text. You can open an XML file with an internet brower, Notepad or a text editor.You can crawl around in your own files to see all the different component parts. Below is a diagram that shows the high level relationships of XML file components for an Excel workbook. An important point here is that the open XML format makes it possible for applications to read and edit the content without having to use the actual Office application or Dynamic-Link Library files. Because Windows treats zip files as folders, you can even search/replace content without invoking Office at all. Try changing some text with Notepad, save the document.xml and then change the extension back.In moving to an open file format, Microsoft also made it much easier for a bad actor to alter potential evidence and hide their tracks. Moreover, they seriously complicated the definition of a ‘document’ by breaking out all the component pieces into individual files. Look at my test file below after it has been commented and otherwise altered. Then look at the Word folder and notice that there are new files that contain the comments, endnotes, etc.data:image/s3,"s3://crabby-images/97e2e/97e2e595673c0e5ddf8eaaca37c1b591b4ae4015" alt="NewXMLFiles"