October 3, 2016 by Pete Johnson, Director of Product Management
Moving data, applications, and processes to the Cloud is rapidly occurring across a wide variety of enterprises. The benefits (and costs) of doing so have been well discussed, but moving the applications and processes can take a long time. One strategy to ease this transition is to move your data to the cloud first, and follow on with applications later. This allows you to transition your legacy applications and IT infrastructure gradually.
But migrating even just the data can be daunting. How do you know where everything is? Which data is important and which is not? How much data do you have to move? All these questions require an understanding of your data, yet industry analysts tell us that as much as 90% of our enterprise data is “dark” or unstructured1. This makes determining what it is and whether we need to migrate it very difficult.
One strategy for understanding more about your data is sampling: taking a small slice of data and analyzing it in detail. But the requires that you sample a representative piece of your data. And if most of your data is dark, how do you ensure an objective sample?
What really is required is an exhaustive analysis of all the target data to be migrated.
Doing so will answer the questions above, and shed insight into exactly how much and which data is to be migrated. It will also help you understand how much storage you will need in the new environment, so you can properly set expectations as to cost. But if your data includes millions of documents taking up terabytes of space, that can’t be done manually. You need to automate the process.
A solution that you can point to a set of data stores and have it automatically crawl and classify (or categorize) the documents that you have under consideration would meet the requirement.
As the system crawls the various repositories, it analyzes all the files it finds and accomplishes several things:
Once this basic data capture is done, it can now do advanced analysis if the above information supports it. You could have the system dis-regard all files of certain specified types, for instance. But for those files that meet your criteria, the advanced analysis would attempt to assign a classification to the document. For instance, recognizing:
As the system starts recognizing files, they can be dispatched as you see fit:
With this type of analysis, you can see that you can get the computer to do most of the data analysis for you. With a system that can analyze thousands of files per hour, and be scaled to handle much faster processing, millions of documents can be managed in a very reasonable period.
With the process completed, you now have a complete sorting of all your data: the most relevant data gets put in your current ECM repository, on premise or in the cloud, while other data gets provisioned as you decide: archived, deleted, or stored for further review. This sort of system will ultimately get you prepared for migrating your data to the cloud.
1. EMC Digital Universe with Research & Analysis by IDC, Vernon Turner, April 2014