Migrated from eDJGroupInc.com. Author: Greg Buckles. Published: 2011-03-02 08:01:36Format, images and links may no longer function correctly. I recently finished a research paper that provides an overview of Enterprise Search for Discovery. My intent was to aggregate, organize and condense corporate client discussions around this area over the last year. This can get a bit technical, but a research paper gives me the room to expore and define things to make them more approachable to the non geek. Enterprise search and preservation collection platforms are the second most frequent technology RFP engagement for my corporate clients after archiving systems. The technology providers have many different approaches, architectures and features that can confuse the prospective buyer. After having the same discussion so many times, I decided to put together a low cost ($29) overview report to at least define the options, potential benefits, costs and things to consider before investing in enterprise search. Enterprise search tends to fall into two main indexing camps: selective vs. enterprise wide. One element from the report is the potential index size, as indexes like to live on Tier 1 class storage (SAN, Direct Attached or other top class storage).
First the definitions:
- Selective or On-demand Indexing focuses on custodian and corporate sources relevant tot eh matter. It does not mean having to index the world every time you have a new matter. It is expected that priority sources like executive and corporate secretary shares will be proactively indexed and incrementally indexed as needed. Rather, the sources indexed are determined by their importance and potential responsiveness. If you know that there is a central contract folder that may contain files responsive to most of your contract dispute litigation, then you would proactively index that folder and schedule monthly updates. The primary difference between a selective indexing and enterprise-wide index architectures is the design for reactive burst capacity (for selective indexing) versus continuous updates (for enterprise-wide indexing). The selective indexing solution’s graphical user interface will support rapid designation of sources and load balancing of high volume indexing. One is built for sprints while the other is designed to run continuously. The actual technical indexing and search mechanisms are similar quite similar, but the workflow, cost and scale of implementation are significantly different.
- Enterprise Wide Search Index can be defined as any system that proactively, continuously indexes all or most of the active data sources within the corporate enterprise from a single search interface. There are very few players in this market, but they have a very compelling message. Some work through federating the search criteria across multiple native indexes (Sharepoint, desktop, etc), while others create a homogeneous set of indexes from multiple sources. Remember that the goal is to provide a single live search across the enterprise.
Now on to relative index sizes:
Selective: There is always a trade-off between index size and advanced search functionality. While these extended search functions provide value and allow users to conduct more granular and complex searches, the cost is a larger, more difficult to manage index. A simple index can be compressed down to 5-10% of the original ESI data set size, especially if there are a lot of ‘Noise Words’ that it gets to skip. A basic index like this can tell you if a document contains a term, but not where that term is located in relationship to other terms. Thus, while the ability to search for phrases is lost, the index size remains small and search speed may be much faster. The average index size of most traditional eDiscovery applications that support the expected advanced features ranges from 20% to 50% of the ESI data set. Adding in concepts, taxonomies, facets and all the other bells and whistles that allow for faster review of the data, the index can actually be larger than the original ESI. That presents a challenge for storing and managing the index. Selective or case based indexing can enable the legal team to change the search capabilities to meet the requirements of different matters. This does raise potential problems if you have already proactively indexed your email system or other priority sources, but selective indexing does give you more flexibility. Indexing only a targeted portion of your enterprise will result in less index storage than the index everything approach, and in some cases enables additional analysis techniques such as topic classification and concept search which are impractical to apply to the entirety of data across the enterprise.
Enterprise Wide: Consider that you may need to allocate 30-50% of your aggregate network and local storage for your central index. Some systems will try to reduce this index storage by federating search to specific data sources or using local indexes on mobile sources. This can present issues when the federated indexes have different functionality, as previously mentioned. The problem arises because typical business user requirements for desktop search are much simpler than potential discovery searches and most users do not want to allocate 20-50% of their desktop storage for an index. So whether the indexes are federated or centralized, the enterprise wide solution will have a greater overhead on relatively expensive storage. Remember that it can be difficult or impossible to change your indexing/search options once your indexes are created.
Enterprise search comes with a price tag and the different approaches have definite pros and cons. I hope that this peek at my thoughts around index sizes whets your appetite and gets you thinking about which approach is right for your enterprise. A little research and education may help you avoid buying the wrong solution for your specific search requirements. You can view the report abstract here.
