We recently had a briefing that referenced the problem of dark data in the enterprise. The vendor offered the estimate that 85% of enterprise data was ‘dark’; this particular company’s definition of dark was that the data did not show up in a database, but generally speaking ‘dark data’ is data that exists in a format that is not easily usable or scalable for the enterprise.
The point of sharing that statistic is not to quibble with its accuracy, but rather to probe the general mindset around dark data. We live in a world where data is increasingly being aggregated in new and insightful ways; a world where data can serve as a competitive moat; a world where data drives real business value. I believe all those things to be true, and yet I’m unconvinced that dark data is the trove of unlimited potential it’s sometimes made out to be.
The examples that were used in the presentation were about understanding the content of documents, with particular focus on loan documents or legal contracts as use cases. The vendor’s goal was to apply automation to create systematic visibility into the contents of the document.
As it turns out, in a past life I worked fairly extensively with loan documents and contracts. While admittedly my experience is sheerly anecdotal, here is some of my experience with dark data in real estate contracts.
In my first office job, I spent a summer in college as a loan processor. My job was to funnel paperwork back and forth between the borrowers who wanted to secure mortgages on their homes and the underwriters at the bank who made the decision on whether to lend. The borrowers provided a litany of documentation to verify their creditworthiness, and (assuming all went well) the underwriters provided the final loan documents.
As I pushed all this paper around (mostly with fax machines!), one of the things that stood out to me was how much the final document was standardized across borrowers. The borrowers and collateral changed, the loan amounts varied, the interest rates fluctuated based on all the above. The pieces of information that were unique to any specific loan were a small portion of the overall loan document; the majority of the document was comprised of static, standardized terms.
To the enterprise, the data that matters for day-to-day decision making is not the boilerplate terms of the document. It’s the unique parts of the loan that are interesting for analysis (e.g. “let’s slice and dice how much we’ve lent based on various borrower characteristics”, etc), and the pieces of information necessary for that analysis already exist in a database as part of the underwriting process. This isn’t to say that all that data is being used as efficiently and effectively as it could be, but it’s definitely not ‘dark.’
In this instance the data that matters is already part of an existing data system, and it’s unclear that there’s any value to be gained in trying to extract additional visibility from the boilerplate text.
I don’t doubt that there are use cases in which the enterprise would benefit from having more insight into data that it currently can’t access easily. That said, I’m not convinced that there is value in all dark data. Some data is dark for a reason.