How much data do you really need?

Data stewards continue to argue in favour of providing unrestrained access to users to all the data—as the best means for gaining value from analysis


Photo: AFP
Photo: AFP

With the advent of Big Data and the advancement of technology to store it, are we justified in holding large quantities of it? What are the drawbacks or benefits of doing so and how can organizations find the right balance?

More than a decade back, a large telecom organization was discussing a strategy with its outsourcing partner. The topic was: How much data should its data warehouse hold—three, four or six months’ data? Holding more data meant more expenditure. And even though it meant more business value, the strategic outsourcing partner prevailed and the “usage and retention” team at the telecom operator had to settle for three months.

How would this question be settled today?

We all have read multiple forecasts telling us how the data volume is growing at an unprecedented pace. We have also heard about the three Vs of data (if not, here they are—variety, velocity, volume) fuelling this growth. Digitization continues to be one of the biggest drivers. The Internet of Things is another. The list goes on. A variety of application areas continue to fuel consumption of this data—customer management, product development, marketing spends, risk management and the like. Data stewards continue to argue in favour of providing unrestrained access to users to all the data—as the best means for gaining value from analysis.

How should organizations face this data deluge? Doesn’t storing more data mean more cost? While hardware costs have been falling, the same cannot be said for data costs. Unless you are using open-source software, your software costs are not going down either. Moreover, there are people costs and other overheads. Then again, what if you store all the data but don’t make use of it. Conversely, if you don’t store the data, you miss out on those million-dollar insights. So how does one solve this quandary?

Let’s start with the easiest part—what data needs to be stored. The data you keep is directly driven by the use case—if you want to understand customer behaviour, you need to store transactions, profiles, previous purchases and the like. But if you want to do profitability analytics, you will need finance, revenue and cost data.

How much you store is also driven to a certain extent by the industry you are in. Data for statutory reporting like Basel-II, and fraud analysis will typically require five-seven years of data, whereas customer cross-sell will require data for one-three years. For most customer analytics, companies in the financial services sector will typically store three years’ data, retail companies will store for two-three years, while telecom service providers will look at less than a year’s data.

The next task is to dissect if you need to store unstructured data, for example, Web logs, chats, text, social, voice, etc. If your industry is going the digital way or you are focusing more on online, expect the answer to be “yes”.

Finally bear in mind that not all data is created equal. Some data is accessed more routinely than others. What portion of your data is expected to be accessed very frequently, what is going to be accessed irregularly and what once in a while? In the telecom industry, CDR or call data records are the lifeline of the business. However, their usage can be starkly different. While the usage and retention teams will ask for 90 days of CDR data at a high velocity for multiple types of customer analysis, the statutory requirements team will need to store more than five years’ data but access it infrequently, whereas the revenue management team may want it for one-three years and keep churning it moderately. Each industry will have its own scenarios of this example.

Based on the answers to the above three questions, you can look at one of the following scenarios to host your data environment—use general purpose transactional database management systems, or DBMSs, (Oracle, MS SQL) along with specialized analytics DBMSs (Teradata, IBM Netezza) and/or open-source platforms (NoSQL DBMs/Hadoop file system).

If your organization is moving towards digital transformation, it will be more likely than not that you will have a loosely integrated mix of all of these as your data environment— called the data lake. Welcome to the new data foundation to support the digital organization.

The author is co-founder and chief operating officer of analytics solutions provider Kloutix Solutions.

More From Livemint