Migrated from eDJGroupInc.com. Author: Greg Buckles. Published: 2010-05-27 11:12:32Format, images and links may no longer function correctly. Traditional discovery search applications like dtSearch and processing packages like Clearwell are usually offline while new collections are being indexed. Litsupport and legal personnel are used to being confident of the collections being searched at that moment. If you add new ESI to your matter, then you need to update the index before your search, right? But now that there are tools for searching ESI where it lives on live corporate servers and desktops, we introduce a relatively new wrinkle into search – index lag. Enterprise and desktop search engines run in the background and watch for new, deleted or changed files within the scope of folders that they are watching. The problem is that index updates are never instantaneous, which means that enterprise-wide searches are never 100% complete.
For the purposes of this conversation, we can exclude all the usual concerns about unindexable or partially indexable ESI formats and content. Instead, let’s focus in on the potential gaps resulting from the lag between changes on the live system and the index. Anyone who has installed X1 or Windows Desktop Search (WDS) has seen their desktop bog down while the system crunches through all the default file locations (yes, the defaults rarely search everything). IT admins rolling out an enterprise search solution like Autonomy IDOL, StoredIQ or Recommind will almost always throttle the system back to minimize the user impact. Many desktop search engines are set to ‘Zero Impact’ by default. This means that as long as you are active on the machine, it is not indexing so that it will not slow you down. This is based on the business usage assumption that most of what you would be searching for is at least several days old and over a typical day there will be several times when you walk away from your machine. Those inactive periods should allow your index to stay reasonably current.
This assumption can lead to problems in specific discovery scenarios. For instance, you run investigative searches on an executive. He catches wind of the investigation and deletes the critical files from his system after you have gotten hits on them. They are still in the index, but no longer on the desktop. The major problems arise when large volumes are moved, added or deleted on an active system. This temporarily ramps up the difference between index and reality. The index lag on server shares can be significant when they are set for daily updates or when the index is part of enterprise content management, archiving or backup systems.
A reasonable portion of legal matters are only concerned with a specific historical time period. That actually raises an interesting scenario based on the automatic expiry/deletion of files that are over their retention period. Most archives and content management systems run daily checks and delete these items immediately. However, it can take time for the index to be updated. If you think that this is not an issue, consider this typical discovery search scenario. An attorney emails a list of custodians and search terms to their litsupport tech. The tech creates a matter and runs the test search. The number of hits is reported and the attorney goes off to confer with outside counsel and even potentially with opposing counsel. It could be days or weeks before they come back with the green light to retrieve the files. Unless the system has automatically preserved those hits, it is probable that some items will be deleted every day if the company has any kind of systematic retention enforcement technology running across all the data sources. Now consider the effort required to manually verify that this is what has happened to thousands of files that show up as hits, but can no longer be retrieved or viewed. This is a different issue from the ‘index lag’, as rerunning the search will eliminate the false hits, but you know that they were there when you first checked and your audit log will show them. Worse, you may have reported those numbers to the other side during negotiations and they may well check the total of your produced, non-relevant and privileged counts. This is more of a preservation issue than an index problem, but it is worth considering.
The last index lag point is a bit technical. When you consider the huge scale of enterprise search, it makes sense that developers will try to minimize the footprint of the most expensive system components like the servers and database storage. This means that many systems keep only pointers in their database and most everything else in the full text index. It makes for lower overhead, but means that the index has to be updated every time a retention category, metadata property, categorization tag, virtual folder or other indexed attribute is altered. Users could decide to clean out a collection of family photos or at least mark them ‘Personal’, but they would still show up in search results as business for a period of time. This really heats up in a review scenario where multiple reviewers are tagging, flagging and commenting on a set of items. There is nothing that drives a reviewer crazy like marking an item privileged and then not finding it when you pull up all your privilege items for the second pass.
The primary take away from all of this is that enterprise systems are a different animal from traditional static eDiscovery platforms. You are handling massive quantities of live ESI and you should understand how the architecture and process flow can affect search, review and productions from those systems. You cannot assume that they are completely up to date or even connected to the network at the moment you run your search. You should be very careful when making affidavits of discovery productions to ensure that you fully understand what might not be there and how to express potential exceptions. So ‘mind the gap’ in your discovery process!