What Is Dark Data & Why Should You Care?

February 16, 2017, by Sid Probstein, CTO & VP Professional Services, AI Foundry  

It’s everywhere, even if we aren’t immediately aware of it. But it isn’t hard to find. Open Windows Explorer and connect to your organization’s big file share. Or maybe it’s a NAS box, your home directory cross-mounted into your local file system. Dark Data are the files you’ll find, in all the folders and sub-folders. Terabytes and petabytes of them and depending on your organization’s size and industry, there may be millions of files.

Is your organization tired of paying for your own storage infrastructure? Are there other priorities and activities IT should be focused on? Does the organization want to take advantage of cheaper Cloud storage?

It’s not that easy because of this Dark Data. You probably can’t, at least fully, move to the Cloud because migrating all those files would create a variety of issues including:

  • Risk/liability—some of the files almost certainly contain sensitive information; either employees, customers or partners PII, PCI or PHI or trade secrets or critical IP.
  • Findability—the glut of files, moved to a Cloud store, won’t make it any easier to find good, useful, and/or current material.
  • Cost—paying a monthly storage fee versus an amortized amount for a server you purchased 3-4 years ago will be an eye-opener; unless you are sure you need all those files, moving all of them will increase your storage cost, even if it’s ultimately a manageable amount.

Steps & Solutions
There are a handful of steps to take to assess your Dark Data situation and figure out what to do about it. For starters, you need to get some metrics. Figure out how many servers, the number of users, the total number of files, the total volume of those files, etc. If you can write a script to summarize the files by date this may be a good way to get a sense of how much manual review you might have ahead of you. Your organization may consider putting a few simple rules in place to simplify the process. For example, a rule could be created to say “migrate all files less than 7 years old.” From there, you may have to manually sample the files, by constructing a taxonomy of file types; such as “Budget Proposal” or “Engineering Checklist” as opposed to “Word Document” and share these rules with others who will be assisting in this classification effort. 

There are three main approaches to making sense of Dark Data using text analytics: 

  • Keyword/phrase indexing
  • Natural language processing (NLP) 
  • And Visual /content intelligence (VCI)

Of the most interest is visual/content intelligence which produces high-quality results by exploiting the frequent presence of templates in the enterprise. Such templates cannot be detected by text-only analysis, which often starts out by discarding the formatting and images in the process. VCI actually samples or detects the non-text clues such as lines, shapes, boundary boxes, bar codes, photographs etc.in the documents. When the visual clues are combined with text clues they provide a decisive advantage for clustering, classification and detecting fielded information such as documents which contain forms or charts. 

Download White Paper

To explore this further, we have just published a white paper entitled Making Dark Data Smart with the Power of Visual Classification and how this combination creates a superior understanding of data. Download paper. 








Learn More! 

AI Foundry’s Actionable Intelligence Management Solutions help find and organize dark data to maximize the value to an organization.  Review the Five steps to turn your dark data operational.  

Contact The Author:

Sid Probstein
Email: [email protected] 
Connect on LinkedIn 
Follow on Twitter @sidprobstein