Anne Kershaw and Joe Howie write on the Law Technology News website about the results of their survey among eDiscovery providers. The gist of their article – and it’s a good one – is that failure to deduplicate e-mails across custodians may be at best sloppy, and at worst unethical:
We asked several judges to review this article and all quickly grasped the benefits of deduping across custodians. When asked if deduping practices should be considered when deciding attorneys fees, most indicated it would be appropriate.
Said U.S. Magistrate Judge John Facciola [author of U.S. v. O’Keefe, asserting that most lawyers are not qualified to write effective keyword searches], “Certainly. I already look for … over-lawyering, having too many people doing the same thing, or having overqualified people do what the more junior people should do. … Failing to dedupe is the electronic version of the same problem.”
I’ve stayed out of the dedupe-yay-or-nay argument until now, since LLM’s mission as a service provider is to do what our clients ask of us, not to promote ESI processing options that (let’s be honest) make us money. However, while I agree that more lawyers should de-dupe across custodians wherever possible (if for no other reason than to cut down their own review costs), I have to disagree with the necessary implication that there’s something inherently unethical about the failure to do so.
What Anne and Joe fail to note is that deduplication is itself a primitive process, especially where e-mail is concerned. All deduplication is performed by comparing “hash values” – numerical “fingerprints” calculated by examining the file and its wrapper on the storage medium. The computer calculates the hash value by factoring in every character and pixel in the file (visible and invisible), all of the metadata, the file size, and in some cases even the location of the file on the storage medium. Like human fingerprints, it’s rare (though not impossible) to find different files with the same hash; for nearly all purposes, therefore, each hash value is unique unless the files are exact copies. The processing software then compares hash values and, where it finds a match, flags or discards the duplicate.
This process works fine for files. It doesn’t work so well for e-mails. Because there are so many different formats in which e-mails can be stored, exported, attached, nested, etc., e-mail hash values are calculated based on content properties such as the subject, body text, attachment count and attachment names, and the e-mail date. The e-mail hash can also be calculated by including addressee information: sender, recipients, CC and BCC. The problem is that, depending upon which properties are used to calculate the hash, the same e-mail on two different platforms (say, Microsoft Outlook and Lotus Notes) may end up with very different hash values.
Outlook and Lotus Notes, the two most popular desktop e-mail systems, have different ways of storing addressee information. As a result, it has been our experience that, if addressee properties are included in calculating the hash values, the exact same e-mail will usually have a different hash value for each platform. Different hash values mean duplicates go unrecognized.
There are many, many companies that use some flavor of both Outlook and Lotus Notes for some, if not all, of their custodians. The upshot is that it’s quite easy to attempt and fail to deduplicate e-mails across custodians under such circumstances. I’m therefore leery of buying into the blanket statement that overproduction of e-mails, by producing too many copies of the same e-mail, is necessarily an ethical violation.