Migrated from eDJGroupInc.com. Author: Greg Buckles. Published: 2010-05-11 03:29:40Format, images and links may no longer function correctly. My first article on corporate data collections focused on preserving the content, container and context of native files as found on network shares and desktop folders. Discovery requests are increasingly targeting email archives, content management systems and other semi-structured data sources. Most of these sources include search and retrieval features, so one could assume that this makes them a safer candidate for in-house collections. This is not automatically true and it’s definitely worth talking through some of the common problems that can lead to incomplete or altered retrievals. The first thing to realize is that these systems were not designed to comply with legal discovery requests as found in the United States. The search and retrieval functionality was added to support a business user seeking to find a few specific emails or an IT administrator restoring a larger set of items that were either lost or need to be transferred to a new user. Both of these scenarios stress quick, simple search without needing to verify the accuracy or integrity of the search or restoration.
Does this mean that these systems cannot be used to comply with discovery requests? They should not be used for legal requests without a thorough understanding of their architecture, component technologies, and the overall data lifecycle. This gives you the foundation to answer hard questions about how your communications and files have flowed into the system, what happens to them, who had access, how they could have been changed and how they are reassembled when they are retrieved. The next step is to perform some reasonable diligence testing on known custodians, dates, and search criteria. Federal magistrate judges have clearly expressed their expectation that counsel and clients will be able to answer the basic question, “How do you know that your search was successful?” To me, that means knowing the capabilities and limitations of your system, the characteristics of your ESI, and the efficacy of your relevance criteria. For corporate IT or LitSupport personnel, the first goal should be validating the system and being able to confidently assert what can be searched and declare what cannot.
When analyzing the discovery capabilities of any kind of semi-structured content/communication management system there are some basic questions to ask.
- What types of files and what format are they stored in? Many communication archives only ingest specific file types or message classes (IPM.Note email is a good example) and filter out Contacts, Tasks, etc. Does the system convert HTML format email to raw text or even XML? Are the full content of native files stored in the original format or are they converted?
- What container metadata is captured, can it be searched and is that fielded metadata reconstructed on retrieval? Forensic level preservation is not required of normal business systems unless you are using them to comply with legal holds in some types of cases. If ESI is already in that system when the hold is executed, then you are only responsible for what is kept in the normal course of business. Does your system capture user actions like Read/Unread, Flags, Forward or Reply information? Most enterprise systems seek to save space with Single Instance Storage (i.e. enterprise deduplication) that may drop some or all context fields. For native files in content management systems, can you search and restore workflow actions, versions, comments, approvals and other container metadata?
- How is the text extracted from items for search? I have written several white papers and academic papers on common problems with indexing native files. The key takeaway is to define and test your dominant file types so that you know which types you are getting consistent keyword hits from. Most platforms use either Microsoft iFilters, Stellent OutsideIn or Autonomy Keyview to extract text or HTML from various files that is indexed for search.
- What search syntax and fields does my index support? Most counsel think in typical Westlaw/Lexis Boolean syntax. Unfortunately, index engines like dtSearch, Lucene, FAST and others each have their own syntax and limitations for Boolean clauses. You want to clearly understand how AND, OR, NOT, NEAR and other connectors function when you convert requests. More importantly, you want counsel to understand and approve any limitations such as the maximum word length (dtSearch default is 32 characters), single search criteria length, default proximity and more.
- How are custodial searches run? I regularly deal with custodial searches that assume that Microsoft will consistently render email Display Names. Clients are shocked when they figure out that they have been missing 10-20% of custodial email and files because all internal email was resolved in the Canonical Name (domainUsername) or contact alias name. Many content management systems do not retain or associate native files to a specific user, especially for legacy ingestions of existing file shares. The takeaway here is to run basic validation tests. If each user has an email archive or a folder in your ECM system, then run a search in that target with the typical “First Last” Display Name and compare the results to the total number of items expected.
- How are items retrieved? If your system deduplicates or converts ESI for more efficient storage, you need to understand how it reconstructs those items when they are retrieved. Does it produce a report that will enable you to show the where, when, how and who the item was originally received from and the corresponding retrieval information? Legal should ask for some kind of ‘Chain of Custody’ document or report to accompany a discovery request. This will enable the items to be authenticated as evidence. Many content management systems just spit out raw files without any reports. At the least, you will want to use a Hash utility on your retrieval sets so that you have some way to verify the content of the files at a later date. Courts do not seem to expect perfection, but a little bit of documentation goes a long way to back up your recollection of running that search.
Communication and file management systems pose different challenges when you are tasked with preserving or retrieving potential evidence. Email archives and content management systems are not magic bullets, no matter what your sales rep says. Most of them can deliver on the promised risk and cost reductions, but only if you are willing to invest the time and effort to understand, validate and document their capabilities and limitations. Your hardest task will be translating all of your technical finding (geek speak) into relevant discovery impact points for your attorneys (why do I care?).