Migrated from eDJGroupInc.com. Author: Greg Buckles. Published: 2011-01-10 06:51:33Format, images and links may no longer function correctly. Searching for a single term within a document is pretty black or white. It is either present or not. When you step up to searching based on phrases, proximity terms, concepts and compound term clusters things start to get a bit less absolute. Yet, simple lists of terms are generally either overly broad or are missing relevant ESI. The simplest search index does not store information about the position(s) of terms within a document. Modern search indexes such as Lucene, FAST, IDOL and others rely on term position and other information to derive clusters of two or more related terms (concepts) and relevance weighting factors. During a recent briefing call with Mike Wade, CTO of Planet Data, we delved into some of the challenges that Planet Data faced expanding their Exego Early Cost Assessment platform to support concept search and ECA workflow. What really caught my attention was the ability to extract two separate versions of the text from documents, both the raw unformatted text AND the rendered view. Alternatively, they have developed a merged rendering that embeds the extracted object text in-line with the viewed text.

MS Word View - Rendered Document

Raw XML of same MS Word document

Above are two views of the same document from my testing corpus. One is the rendered view in Word 2007 and the other is the actual XML text view in a text editor. Embedded objects, comments, font changes, style elements and more can change the relative position of individual terms depending on which version the system processes. On some collects, the Planet Data development team saw radical differences in  concepts and search results depending upon which version they used, so they wanted the option to take the processing and storage hit to generate and index both versions. It is important to also note that there will be absolutely no difference for many documents.

Mike gave me several good examples of issues that can cause problems, mostly focused around MS Word, Powerpoint and email with embedded charts, wordart and other text objects. The easiest to replicate on your system is to take an MS Word document with some embedded charts or other complex formatting, save it to Adobe PDF, then see how these ‘duplicates’ match up for concepts or near-duplicate detection.

We have always known that searching for text within spreadsheets is problematic, but this new generation of compound Office 2007 XML documents and email formats have dramatically expanded the potential for false negative search results. During a validation testing project, I ran into an interesting example of hidden embedded objects ‘breaking’ search. Some search engines see font or style changes as breaking up words (tokenization) and so will not find that whole term. For example, if you change half of the word Liberty, some engines will see this as ‘Lib’ and ‘erty’. Office 2007 has now fixed this behavior, but I still see it when testing applications.

Example of email formatting breaking word

The more advanced search and analytics become, the more challenging it is to discover and disclose the known exceptions. As our ESI collection sizes continue to grow, simple lists of custodians, dates and single terms become increasingly imprecise. New search training systems like Equivio’s Relevance provide a more transparent method for generating complex, weighted criteria with a high relevance confidence level, but the criteria is only as good as the extracted text that you are searching against.

0 0 votes
Article Rating