Home / Opinion / Columns /  The terrible price we pay for hiding public data

A seemingly never-ending pandemic has produced many haunting images that will stay with us for a long time. One such image is that of the migrant worker trekking back to her native place. Three years before the ‘long walk home’, a study in India’s pre-budget Economic Survey claimed that the actual number of migrant workers may be roughly double that of census estimates. The study argued that social security benefits should be portable across states to provide protection to such workers.

The study was based on a novel dataset of unreserved passenger traffic between every pair of railway stations in India, and immediately caught attention. But soon enough, prominent academics raised questions about the study, arguing that its estimates could not be taken seriously. Had the ministry released the raw dataset behind the study along with the replication codes (which allows researchers to verify calculations), this debate could have been settled conclusively. In all likelihood, it would have catalysed greater government attention towards inter-state migrants much before the pandemic hit us.

Opening up this administrative dataset would also have thrown greater light on the nature of short-term migration in India. Given that our census is conducted only once in ten years, it is easy for this all-India survey to miss short-term or circular migration flows. Railway data offered a chance to learn more. Sadly, it was kept under wraps.

The opaqueness of government departments on data is not really surprising, but it amounts to a colossal public loss. The cost of putting up such datasets on a public platform (and in a machine-readable format) is practically zero. The benefits are staggering. Yet, years after India initiated an ‘open data’ policy, openness around public datasets is more the exception than the norm. Unlike in the West, we have failed to develop a statistical ecosystem where administrative datasets can be scrutinized by independent researchers, and then deployed for policymaking.

The rail data was just one instance of many such missed opportunities. Consider the digitized records maintained by the Employees Provident Fund Organisation (EPFO) of employees receiving provident fund benefits in this country. The EPFO could one day become a valuable resource to track the movement of people in and out of formal jobs across different sectors. The government too has been keen on using EPFO data to track such flows. Yet, lack of respect for basic data norms during the EPFO data-mining exercise has made it a cautionary tale in the annals of India’s statistical system.

To make sense of the EPFO data that it had obtained from the labour ministry, the Niti Aayog invited two economists to examine it. This was the original sin. In most mature democracies, a public agency would either have published the entire dataset for everyone to use or have invited researchers through a transparent process to send in their proposals to mine the dataset for research. The cloak-and-dagger approach of the Aayog raised suspicions that were only strengthened when the ‘selected’ researchers suppressed uncomfortable truths about the EPFO data in their study. Somesh Jha, a financial journalist, found out that critical findings of the study on the incompleteness of EPFO records were part of a presentation made at the Prime Minister’s Office, but were omitted from the published version of the study. The resulting uproar meant that EPFO numbers became suspect in the eyes of serious researchers.

Perhaps the most egregious example of the use of unverified administrative data relates to India’s gross domestic product (GDP) calculations. An untested database of the ministry of corporate affairs, MCA-21, was plugged into the national accounts system in 2014-15 despite the misgivings of an independent expert. This created a big controversy. If only the government had opened up the MCA-21 dataset, suspicions could have been nipped in the bud. Both the statistics ministry (which publishes the GDP data) and ministry of corporate affairs said that the other should take a decision on this, and the dataset remains hidden today.

In a country with an empowered statistical regulator, one would have expected it to intervene in such matters and lay down clear norms for data sharing and accessibility. It would have conducted periodic audits and forced recalcitrant departments to open up their datasets for public use. But India’s beleaguered National Statistical Commission (NSC) is severely under-equipped to perform such a role. The lack of statutory backing and independent funding for the NSC has meant that it is controlled by the statistics ministry rather than the other way round (see ‘How India’s statistical system was crippled’, Mint, 8 May, 2019).

Many of India’s big administrative datasets are flawed and biased, but they can be made usable over time. Simply opening up these databases will ensure that errors and inconsistencies are quickly identified. Transparency will lead to accuracy, and raise public confidence.

Not every citizen can scrutinize all the datasets out there. But so long as an open and transparent process exists to share our public data trove, we can all rest assured that there will be people at work to scrutinize them. Till such a process is established, we need to be cautious about the use of administrative data in policymaking, even if it is sold to us as ‘big data’ or ‘real time’ analysis. Any dataset, big or small, should face public scrutiny before it is accepted as a credible input for public policies.

Pramit Bhattacharya is a Chennai-based journalist. His twitter handle is pramit_b

Subscribe to Mint Newsletters
* Enter a valid email
* Thank you for subscribing to our newsletter.
Recommended For You
Edit Profile
Get alerts on WhatsApp
Set Preferences My ReadsFeedbackRedeem a Gift CardLogout