India can achieve tech goals without raising bogey of data, AI colonisation

In March, the Union Cabinet approved the national-level IndiaAI Mission with a budget outlay of  ₹10,371.92 crore, (Image: Pixabay)
In March, the Union Cabinet approved the national-level IndiaAI Mission with a budget outlay of 10,371.92 crore, (Image: Pixabay)


  • Advocates of data sovereignty are critical of tech companies like OpenAI using Indian data, only to sell insights back at high prices
  • However, global tech companies have also contributed to India's growth by setting up their centres of excellence

Ola founder Bhavish Aggarwal recently expressed displeasure at big tech companies like OpenAI using data from India to pre-train their large language models (LLMs) and profit from it, describing the trend as a form of data and AI “colonisation."

According to Aggarwal, while India generates 20% of the world's data, only a fraction of this is stored within the country. He elaborated, "...20%, because we are 20% of the world's population..." 

He went on to criticise the practice of exporting Indian data to global data centres only to have companies like OpenAI process it and sell the insights back to India at high prices.

He also drew a historical parallel, asking rhetorically, "Does this sound similar to something that happened 200 years ago? Yeah, East India Company."

As an entrepreneur with a degree in computer science and engineering from IIT Bombay, Aggarwal has proved himself in the field. He not only founded Ola Cabs and Ola Electric but also launched an AI venture called Krutrim, which is a now a unicorn, or a startup valued at $1 billion or more.

Aggarwal's concerns about “data sovereignty" and the need for India to become self-reliant are valid. And it’s hard to fault him, or anyone for that matter, for arguing that India must produce more domestically to compete with global powers like the US and China.

This includes everything from batteries and vehicles to semiconductor manufacturing plants and data centres, as well as e-planes, rockets, bullet trains, hyperloops, foundation models, and LLMs to rival tech giants Google, Microsoft, Nvidia, and OpenAI.

This is not the first time that someone has voiced concerns over data and AI “colonisation," which is broadly defined as big tech companies (mostly from the US) using data from emerging economies to enrich themselves.

Reliance Industries chairman Mukesh Ambani voiced this concern back in December 2018 when he said, “In this new world, data is the new oil. And data is the new wealth. India’s data must be controlled and owned by Indian people and not by corporates, especially global corporations."

On 20 March, Nvidia CEO Jensen Huang raised this question with a section of the media. He said: “Prime Minister Narendra Modi told me that India should not export flour to import bread—this makes perfect sense. Why export raw material, only to import the value-added product? Why export India’s data, only to import AI?"

Data-aware, self-reliant

Urging Indians to feel proud about making more products in India and exhorting them to be confident when dreaming big about building LLMS, electric vehicles, or even semiconductor fabs (as Ambani and Aggarwal did) is an effective prescription.

The country's entrepreneurs began building India-specific LLMs only after C.P. Gurnani, then then CEO and MD of Tech Mahindra, and Rajan Anandan, MD of venture firm Peak XV Partners, took offence when OpenAI CEO Sam Altman doubted if Indian entrepreneurs could develop a generative pre-trained transformer (GPT)-type of LLM with just a $10 million investment, a fraction of what OpenAI was spending.

The result: Indian entrepreneurs have already released local language models including Aggarwal's Krutrim, Tech Mahindra's Project Indus, Sarvam AI's OpenHathi series, AI4Bharat, SML's Hanooman series, Sutra series from Two AI, and CoRover's BharatGPT.

In March, the Union Cabinet approved the national-level IndiaAI Mission with a budget outlay of 10,371.92 crore, aimed at developing a manufacturing base for graphic processing units (GPUs) in a public-private partnership, and multi-modal domain-specific LLMs. That very month, India took a big step towards achieving self-reliance in electronics when PM Modi laid the foundation stones for three semiconductor facilities in India, worth almost 1.25 trillion.

These include a fab by Tata Electronics with Taiwan’s PSMC in Gujarat; an outsourced semiconductor assembly and test (OSAT) facility in Assam, also by Tata Electronics; and an OSAT facility by CG Power in partnership with Renesas.

India is advancing rapidly in space and technology. Beyond ISRO's space missions, Skyroot launched India's first private rocket in 2022, and Agnikul Cosmos launched its sub-orbital test vehicle featuring the world's first single-piece 3D-printed rocket engine.

India is making significant progress in electric vehicles, with Tata Motors, Mahindra Electric, Ashok Leyland Electric, Hyundai, Hero Electric, Kia Motors, and startups Ather and Ola Electric Mobility contributing to this growth.

Cut slack on data, AI colonisation

Here are factors to consider in the data and AI colonisation-versus-self-reliance debate:

First, let’s ask ourselves: Must we paint big tech companies as antagonists in the 'Make-in-India' film to drive home our concerns over data and AI colonisation? Especially when many big tech global companies continue to contribute towards generating employment in the country and put India on the global research and development (R&D) map with their global capability centres (GCCs)?

Also, experience reveals that pure-bred Indian companies too can manipulate and profit from user data stored within Indian shores.

Second, India's stand on data localisation itself has grey areas. The Digital Personal Data Protection (DPDP) Act, 2023, which was notified last year, eased the stance on cross-border data transfer restrictions by adopting a “blacklist" approach.

This, some experts argue, could lead to a shortage of high-quality datasets that are crucial for AI R&D. Any specific law pertaining to any specific sector will supersede the DPDP Act when it comes to "banning" these geographies, such as Reserve Bank of India regulations that state that the personal data of payment systems are to be stored in India.

Third, there is a rapid rise in the number of local data centres in India. Yet, as of March 2024, the US hosts more than 50% (5,381) of the little over 10,000 data centres, according to Statista, while India had 163 including those set up by Microsoft, Google, NTT Data, and Amazon Web Services.

Fourth, global tech companies have contributed to India's tech growth by setting up their centres of excellence (COEs) in India and investing millions of dollars in setting up incubators and accelerators. They are partnering with educational institutions and Indian startups too.

By 2025, India is expected to have over 1,900 GCCs employing 2 million people. By 2026-27, Nasscom forecasts the number of GCCs in India to exceed 2,000.

Fifth, while Aggarwal may likely disagree, it would take billions of dollars and years to build a single entity that he envisions -- one that can train foundational models and LLMs, design custom AI chips, have an AI cloud infrastructure, and build a developer platform too.

OpenAI’s GPT was in the works for more than six years, cost upwards of $100 million and used an estimated 30,000 GPUs. Krutrim is a separate business, according to Aggarwal, for which he raised $50 million from Matrix Partners India.

Sixth, many Indian startups that are developing local language models are building them atop (hence, called wrappers) the foundation models of OpenAI's GPT, Meta's LLaMA, Google's Gemini, and Anthropic's Claude, to name a few. Building a foundation model, as indicated above, is expensive. Thus, they too are capitalising on the very data that big tech companies would have used from India.

Besides, there's a shortage of GPUs (a single Nvidia H100 GPU can cost about $25,000) to run these AI models, and Indian companies have to rely on global tech companies to provide them, even if they run much smaller language models. Moreover, funding for Indic language models is not easy to come by.

Seventh, as suggested by Mihir Kaulgud in an October 2022 paper published by the Social and Political Research Foundation, if India wants to combat data colonialism, it should focus on both the state’s sovereignty and the people who generate data, thus “being attentive to data’s social characteristics and moving the foundations of policy away from the 'data as resource' metaphor."

Quoting ‘Svensson, J., & Poveda Guillen, O. (2020). What is Data and What Can it be Used For?’, he explained that the data generated by “highly networked societies" has to be “captured, quantified, and processed."

“All of these practices give the data a particular shape, suggesting that data is deeply cultural and infused with societal norms and values," Kaulgud argued.

Wrapping up

India’s tryst with semiconductors started in the 1960s. It failed many times but did not give up. The current semiconductor projects should provide the building blocks for local microchip-making and the critical semiconductor value chain of design, fabrication, assembly, testing, marking and packaging.

The new investments will manage to make chips of only 28-40 nanometres, while sophisticated plants globally have moved on to 2-3nm. Yet, here too, India is drawing on the expertise of global tech companies, which is a sensible strategy.

On the data front, too, while India may generate 20% of global data, how much of it is good quality data? Besides, foundation models and LLMs of OpenAI, Meta and Google have been pre-trained on data that is 60-70% in English.

Further, many of the 22 official Indian languages do not have digital data, which makes it challenging to build and train AI models with local datasets. Bhashini, a unit of the National Language Translation Mission, has so far spent $6-7 million to collect data from different sources.

Ironically, Google India has been working with the Indian Institute of Science (IISc) on Project Vaani, which will gather speech data across India. And so is Microsoft, which has been helping with Indic languages, it's partnership with Sarvam AI being a case in point.

The hope now is that the government's proposed unified data platform, the IndiaAI Datasets Platform, will provide Indian startups and researchers access to quality non-personal datasets.

Simply put, public-private partnerships, investments, and global cooperation, even with big tech companies, should be able to push India's AI resolve without the need to raise the bogey of data and AI colonisation, which is, in the least, not geopolitically right.

Catch all the Business News, Market News, Breaking News Events and Latest News Updates on Live Mint. Download The Mint News App to get Daily Market Updates.