Tapping into unstructured data yields insights that may not be available otherwise
The pollsters are finding it difficult to live down this one, including the legendary Nate Silver, who nailed Barack Obama’s elections to the last electoral seats. That is the nature of this game, one might say. But listening to all the post-poll commentary has thrown up one very intriguing aspect about data analysis.
A common lament currently doing the rounds among the prediction pundits and media gurus is that they should have “talked to more people", that they should have “been there, on the ground" rather than just analysing poll projections sitting in Washington DC.
And that’s not because polls are flawed. Predictions based on polls is a very precise data science.
In fact, Nate Silver’s website fivethirtyeight.com shot to fame in 2008 for its very accurate election predictions, because his was the first attempt to dig and mine county-level voting data in the US in the past 40 years. He dug deep into underlying demographic and voting patterns to build his predictive data model.
What the media pundits are alluding to is something else—it has to do with another form of data or information which is neither well understood, nor recognized. This is the world of unstructured data.
It’s the information available to us in our verbal interactions, in emails, in social media posts, website pages, videos. This data is not organized in neat tables, Excel sheets or reports. And that’s why it is not very easily accessible to analysis.
The world of unstructured data is the missing piece in the poll projections that the analysts are alluding to, and they are making a very important point. This voting-related data is highly contextual; it is not structured. So it was tough to analyse. But, as the results showed, it had valuable information content.
The unstructured data context
There is a whole new data science related to semantic algorithms and machine-learning that is trying to solve this hard problem. And there is rapid progress being made. We already see its value in our lives in instances such as the recommendation engines on e-commerce sites that tell us what books, clothes, gadgets, etc. to buy.
But this is work-in-progress. More importantly, till we get to a stage where the algorithms have nailed this problem, we cannot deny ourselves the value of unstructured data in our lives.
Seth Grimes, a leading industry analyst who operates in the overlapping domains of unstructured and structured data, estimates that nearly 80% of business-relevant information is to be found in unstructured formats, primarily text.
That’s an enormous amount of untapped value.
The good news is that there are a wide variety of tools available that help us access some bits of this unstructured information. But it will be a while before this is considered as a solved problem.
In the meantime, I think we can do something more to do better for ourselves. For that, we need to solve one problem.
How to connect with unstructured data
Let me illustrate this through the problem of daily work management.
I use a bunch of tools for a variety of reasons: to manage and organize my work; to collaborate and communicate with others; to assess outcomes and take decisions, and so on. To do all this, I use Outlook, Excel, Drive, Evernote, Trello, Podio, Any.do and Keep—and then a bunch of analytics tools and content tools that are specific to our business.
So, the information I work with, on a daily basis, is dispersed across these tools. And there is no denying that I need each one of those applications because they are all good in their specific context. Drive is an excellent way to access documents on-the-go, while Podio is excellent for project workflow management. Keep is good to document ideas, while Any.do is fabulous to organize daily tasks. And so on.
Yet, most of the data within these applications is unstructured. Not only that, most of these applications don’t talk to each other. Some do, but what I need to enhance my productivity is every application talking to every other tool. My context for daily tasks runs across information stored across quite a few of these applications.
How do I tie all of this information together?
Build ‘meta data’ connectors
I use Evernote as a bridge across a bunch of the applications—Drive, Outlook, Keep and Any.do, specifically.
Evernote is primarily a document storage application, where you can tag files in ways useful to your context. But it is also an excellent tool to create meta documents that “store", or link, documents within and across applications.
By virtue of being able to link all documents in a topic together, to some extent, I am also able to tie up the information within those documents. Yes, I created that meta information manually, and it is customized to my specific context.
The point is that it is possible to link unstructured data together, and to link it with structured data. How we do it, which tools will help us do that is very context-specific.
We can, of course, wait for better tools that will do this job for us automatically—and that will happen eventually. But why deny ourselves the value from digging deeper into unstructured data? It just makes for better insights and better decisions in our daily lives—as the US election pundits have found out to their dismay now.
An alumnus of the Indian Institute of Technology, Kanpur, Nitin Srivastava is co-founder and CEO of MindWorks Global. He is currently leading the company’s transformation into a data-driven, content marketing company, which specializes in solutions for technology/IT, auto, finance and lifestyle verticals.