November 9, 2016 by Pete Johnson, Director of Product Management
We’ve all seen statistics and charts showing how fast corporate data is growing. There seems to be no limit to not just this idea of big data, but the growth rate of it. Most charts show accelerating growth, typically in an exponential fashion. And these massive collections contain large amounts of dark, structured and unstructured data. What does all of this mean?
Big data can have several meanings, but typically it refers to a specific set of data that is becoming too big to easily analyze. For instance, a large retailer may collect large amounts of data about what products are selling at what prices and in what areas. For a global retailer such as Walmart, this can be an astounding amount of data, making it difficult to analyze. Or imagine all the video surveillance data that a large company with multiple facilities may have. Where do you start in trying to look for a specific event?
In a recent report, Gartner estimates that 20% of enterprise data is mission critical, 30% is redundant, and 50% is of “indeterminate” value.1
But the “big data” term often gets applied to the entirety of a corporations’ data. The implication here is not that there is a single set of data that is difficult to analyze, rather the total collection of various data is becoming difficult to manage. The collective stores of ERP, CRM, email, attachments, file shares, SharePoint sites and Cloud Data make it impossible to know where everything is and if it is useful.
A big trend in corporate data analysi is to “operationalize” it, meaning that data should only be kept if it's going to be used. Saving years of sales data may satisfy local regulations, but if you're not analyzing that data constantly to better refine your sales strategies, then you are missing a big opportunity.
Which leads us to dark data. How can you analyze something if you don’t know where it is? Dark data is any data that is not explicitly identifiable. With dark data, you don’t know what it is or where it is, and you are not sure how to find it.
The key to making sure that data doesn’t become dark is to ensure that it is properly categorized the moment it is generated or arrives in your corporation. For years, companies have deployed ECM or CMS systems in an attempt to get control over this situation. But typically the systems are avoided by employees because they are too hard to use, or else they only get applied to certain data and document types. Which means most of the data they are supposed to manage quickly become dark and forgotten.
So corporations have accumulated, and are still accumulating, massive amounts of dark data. It is estimated that as much as 80% of corporate data is considered dark data. When a document in such a way that the data can be organized in a pre-defined manner, it's considered unstructured. On the other hand, structured data contained in databases and spreadsheets is usually much easier to understand. Because there are components to the document such as field names and descriptions, structured data tends to be less dark. Of course a miscellaneous spreadsheet labeled “sales data” could be unknown, but it’s not too hard to quickly figure out from the data fields what it is referring to.
By contrast, an unstructured PDF document labeled “Contract1” is much more difficult to figure out what it is. Documents such as these are considered unstructured because there is no connection of the data they contain with anything else. In a document that is mostly text with no fields or boxes, the important information it contains could be pretty much anywhere, making it very difficult to find.
And without being tagged with metadata such as “Company Name”, “Contract Type”, “Date Signed”, it is pretty hard to figure out what to do with these types of documents. And it's even more difficult to determine if it is the original, a duplicate, or a near duplicate document. While the “date modified” information can provide some insight, this is not always accurate.
The goal of course is to get control of the big data in your company, light it up so that it is no longer dark, and be able to efficiently handle both structured and unstructured data. For more ideas, please visit our website page about how you can light up your dark data.
1 Gartner (June 2015) “Organizations Will Need to Tackle Three Challenges to Curb Unstructured Data Glut and Neglect.” Deb Logan and Alan Daley.