Mining the Lehman Mountains – Searching 3 Petabytes

Migrated from eDJGroupInc.com. Author: Greg Buckles. Published: 2010-06-21 06:00:20Format, images and links may no longer function correctly. Everyone talks about the ‘explosive growth’ of discovery collections. Every once in a while we get a glimpse behind the curtain at the sheer size and complexity of large matters. Browning Marean posed a question to the EDRM Search project that ate an entire afternoon dissecting the 511 page examiner’s report from In re Lehman Brothers Equity/Debt Securities Litigation, 08-cv-05523, U.S. District Court, Southern District of New York (Manhattan). Now I do love to geek out on metrics of all kinds, but what drives me is trying to understand the impact of numbers in context. In this case, we get to see the actual search criteria created by 20 Jenner & Block attorneys to find everything related to the downfall of the investment firm.

The firm of Alvarez & Marsal manages the legacy Lehman Brothers ESI collection, all 3 petabyes of data which they say equates to approximately 350 billion pages. Yes, that is BILLION with a ‘B’. The examiner and supporting firms created 37 formal search requests that covered 281 custodians. Alvarez & Marsal executed the searches using an unspecified system (although I did find a reference to Iron Mountain on page 211), deduplicated the results and then applied privilege criteria to segregate potentially privileged documents for manual review. 37 searches does not sound like much until you wade through the 127 pages of actual search criteria comprised of literally thousands of search terms in complex Boolean clauses. Alvarez & Marsal produced roughly 4.4 million documents equating to over 26 million pages. This works out to only 0.007% of the total collection that were loaded onto Iron Mountain’s Stratify platform for review. An additional 700,000 documents were produced via 3^rd party requests and roughly half of those were reviewed on Jenner & Block’s in-house installation of Anacomp’s Caselogistix. An interesting point here is that the 3^rd party documents averaged 23 pages/document compared to the Lehman search results which averaged 6 pages/document. Makes you wonder if the 3^rd party review process effectively filtered for larger summary documents and attachments while the search results caught smaller extemporaneous communications.

This sets the stage for a monstrous relevance review by the Jenner & Block team. They went to the court four times for approval to add a total of 75 contract attorneys in New York and Chicago. The review ran from April 15^th, 2009 until February of this year, roughly 209 working days. If we guess that Jenner & Block had 25 associates on the review for a total of 100 reviewers, we can calculate that the team averaged 28 document or 203 pages per hour. Having no insight into the actual review process, it is impossible to know how the documents were organized or the complexity of the coding tags. The Stratify platform gave them automatic concept foldering and advanced multi-level review workflow support that should have enabled some clustering of similar documents. Since this was not a privilege review, I would normally have expected to see a much higher average review rate, but my experience managing complex regulatory investigation reviews gives me an appreciation for the challenges involved.

Bloomberg and NPR articles have focused on a set of generic investigatory search terms in Search 33. They make fun of the ‘Stupid’ things that people might say. Too bad the reporters did not bother to see that these terms were only applied to one custodian and we do not actually know if there were any hits. Here are the actual search terms with a bit of formatting to make the clauses easier to read:

Shocked or speechless or stupid* or “huge mistake” or “big mistake” or dumb or “can’t believe” or “cannot believe” or “serious trouble” or “big trouble” or unsalvageable or “too late” or ((breach or violat*) w/5 (duty or duties or obligation*)) or “nothing we can do” or uncomfortable or “not comfortable” or “I don’t think we should” or “very sensitive” or “highly sensitive” or “very confidential” or “highly confidential” or “strongly disagree” or “do not share this” or “don’t share this” or “between you and me” or “just between us” or ((can’t or cannot or shouldn’t or “should not” or won’t or “will not”) w/5 (discuss or “talk about”) w/5 (email or e-mail or computer)) or (should w/5 (discuss or talk) w/5 (phone or “in person))

The articles are going for the shock value without any way to know if Mr. Mark Weber actually used any of that language between December 2009 and July 24, 2009. Generic words or phrases can be used to spot potential communication categories like anger, fear, deceit, solicitation and more. But this was ONE search out of literally thousands of individual searches run. The vast majority of the searches were very specific and any with too many hits were refined according to a partner who worked on the case. Here is an example of the typical criteria:

Search Terms	Time Period	Date of Request
(fund* or cash) w/10 (transfer* or mov* or sweep*)	8/1/08-9/22/08	3/19/2009
(large or big* or signific) w/10 (collateral w/10 pledg or mov*)	2/01/08-9/22/08	3/19/2009
(securit* or asset) w/10 (transfer or mov* or pledg*)	8/1/08-9/22/08	3/19/2009
(repo* or repurchase) w/10 (transfer or mov* or pledg*)	8/1/08-9/22/08	3/19/2009
solven w/20 (transfer* or mov* or pledg*)	3/31/08 -9/22/08	3/19/2009
(adequate or suffici or concern* or enough or short) w/10 liquid*	3/31/08 -9/22/08	3/19/2009
valu w/10 (model or mark* or book) w/20 (wrong or update or correct or hit or P&L or haircut)	3/31/08 -9/22/08	3/19/2009

They requested all emails from certain custodians for short date ranges, but generally gave alternative search criteria in case the search returned an “unavoidably high volume”. The some of the later searches refer to previous searches, indicating an iterative process of analyzing and refining relevance criteria. Too many counsel approach Meet and Confer negotiations as if they could possibly come up with relevance criteria without actually testing them on the opposing sides data. Optimizing the precision and recall of search criteria is akin to creating a Meritage blend in wine making. Both involve test samples, careful evaluation and constant adjustment until you are satisfied with the results. The discovery process of In Re Lehman Brothers Equity enables us rare insight into a complex matter of enormous scale. The focused, iterative search refinement process is an example of how collaboration and imagination can conquer a literal mountain of data.

0 0 votes

Article Rating

Mining the Lehman Mountains – Searching 3 Petabytes

Mining the Lehman Mountains – Searching 3 Petabytes

Share This Story, Choose Your Platform!