E-Discovery processing involves much more than just multiplying the number of gigabyte by the per-gig rate. In the first part of this series, we looked at what processing does under the hood, and how to ensure that duplicate records are properly identified. This part continues the discussion.
Emails and Attachments
When the email is extracted from its container (PST file, etc), its attachments are embedded into the resulting file. In order to review attachments separately from their parent email, each attachment needs to be extracted from the email file. This is a standard part of e-discovery processing.
However, when conducting a native review, what format should the native email, without attachment, take? In many cases, it ends up being the email extracted from its container, with all the attachments still embedded!. This leads to some issues:
1. From a cost perspective, the volume is nearly doubled – you pay for the email file including the size of all attachments, plus you pay for each extracted attachment.
2. If the review environment is web based (as many of the more modern ones are), the amount of time a reviewer waits for a record to appear on their screen is directly related to the size of the file being downloaded. If a very short email contains many attachments, a reviewer could wait 30 seconds or more before being able to review one or two lines of text. If a review database contains 100,000 such emails, that’s an extra 50,000 minutes, or 833 hours of review time that has been added solely because of the way the email was processed.
3. If producing native files, it is impossible to redact attachments to an email if the email is produced.
A more sensible way to process emails is to think of it as like a zip file – the email container contains an email body and zero or more attachments. Each attachment, as well as the email body, is extracted from the email container file, and the whole group is linked together. The container file is then removed from the collection, since its component parts have all been separated. The email body would typically be either a text file, an RTF file (essentially a Microsoft Word file), or an HTML file.
Microsoft Office Documents
Microsoft designed Office files (primarily Word, Excel, PowerPoint) so that all three types can be embedded into each other. The idea was that you could create a Word document, and instead of transcribing an Excel spreadsheet into a table, you just put the whole spreadsheet into the file. The idea was a good one from a productivity perspective, but it complicates e-discovery processing.
Up until recently, e-discovery processing did not extract embedded files from Office documents. That means, a reviewer could open a Word file to discover several hidden Excel spreadsheets that also needed review. Since those embedded spreadsheets started life as separate Excel files, another reviewer would review another copy of the same spreadsheets as separate files. Clearly effort was duplicated and the potential for inconsistent coding greatly increased.
More modern e-discovery processing software now extracts records embedded into Office documents. It’s not uncommon to see a Word file with attached spreadsheets and graphics. While this appears to solve the problems listed above, it creates some others, particularly with respect to PowerPoint presentations.
Of all the Office documents, PowerPoints almost always have embedded records. Many of these records are graphic files, but some are spreadsheets or Word files. Unlike a Word file that contains a spreadsheet or an Excel file that contains a Word document, the embedded files within the PowerPoint presentation are there to form the visual presentation. Viewing the slides will show the embedded content, and it will show it within the proper context. By extracting the embedded records, it actually leads reviewers to look at the same records twice – once in the presentation and once on its own. Again, duplication of effort.
Another issue surrounding embedded objects is that, while a spreadsheet embedded into a Word document may need to be examined on its own, a picture embedded into the same Word document is clearly visible within the document and likely does not need to be examined separately. The smart way to deal with embedded objects is to only extract those objects that need to be reviewed on their own – if it can be reviewed in situ, don’t extract it.
Part 3 concludes this discussion and provides a checklist of things to ask IT or the forensic vendor when you have electronic records to process.