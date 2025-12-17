NEW DELHI :As Big Tech companies look to expand beyond English-speaking users, developing artificial intelligence (AI) expertise across a wider range of languages is emerging as the next major hurdle.
Google’s latest effort has seen the company embed 29 Indic languages and dialects into Gemini, its foundational AI model, as it looks to scale AI adoption across India’s hinterlands, three years after it began work in the area. A foundational model is one that can accept user queries and respond in human-understandable languages.
According to Manish Gupta, senior director at the company’s research arm, Google DeepMind, and its spearhead in AI language research efforts in India, proficiency in Indic languages is critical for last-mile applications such as agriculture and healthcare, where AI must interpret region-specific context to function properly,
“For instance, Google’s Gemini powers an agricultural analysis model that is currently being deployed at the last mile, where it is imperative for our AI models to have contextual understanding of local tongues in order to perform with high efficiency," he added.
Google’s agriculture app processes information in local languages to offer soil analysis, crop prediction and more to farmers at subsidized costs through government and private sector companies.
To be sure, by “embedding 29 languages", Google means it is training, fine-tuning and benchmarking Gemini’s performance across these languages. They are currently available for text generation and translation in the public domain, with commercial and pilot deployments depending on specific projects.
According to Gupta, Google’s linguistic depth is also helping it export AI applications developed in India to other regions. “From DeepMind, we exported our locally developed agriculture analysis AI model to other countries in Asia by replicating its performance in India. Now, we’re expanding it to Africa, and will soon bring it to the rest of the world," he said.
The case for regional-language AI
Data from International Centre for Language Studies showed that as of August last year, three top Indic languages—Hindi, Bangla and Urdu—accounted for over 13% of the world's population. Seven of India's top languages are spoken by at least 50 million people.
Despite the wide share of global languages, AI stakeholders point out that Indian languages comprise less than 1% of the open internet — leaving AI applications and services open to biases common in English. In India, only about 10% of the population speaks English. Developing proficiency in regional languages in AI models is, thus, critical for Big Tech to make inroads into India’s hinterland.
Further, most companies are riding highest-ever smartphone penetrations in India, which now has nearly 850 million active smartphones, as per data from IDC India.
“Training AI models in Indic languages will directly cater to a large part of the world’s population, as well as India’s," said Jibu Elias, responsible computing lead for India at Mozilla Foundation. “Further, Indic languages provide an ideal template to add nuances of various languages, and given their scale, make for ideal languages to train AI models on—which can then be exported to and implemented in various other countries."
Globally, the competitive landscape is taking shape. As of 30 November, Anthropic's latest foundational model, Claude Sonnet 4.5, was the best-performing AI model in Indic languages, with an average score of 60.7 out of 100. Google's latest Gemini 3 Pro scored 59.9, ranking second. Other popular models include OpenAI's GPT-5.2 at 52.2, and Elon Musk-backed xAI's Grok 4 at 52.
Industry views
Industry experts stressed that linguistic scale must be paired with ethical data practices. “Languages in India extend well beyond just the 22 official ones—there are many, for instance, that have millions of speakers but no script," said Elias, adding that there are also several regional languages and dialects that are nearly extinct.
As per the 2011 Census, India has 22 constitutionally recognised official languages and 99 other listed languages—amounting to 121 languages—alongside more than 19,500 raw dialects.
“While it is imperative for Big Tech to develop AI abilities since only 10% of India speaks English, taking an equitable approach is the key," Elias added. “For this, tech firms must work with stakeholders of communities, instead of data annotation startups that may raise questions of ethics, in order to truly scale AI abilities."
Gupta said Google works with grassroots-level data providers to collect Indic datasets, though he did not comment on commercial arrangements. Beyond language, Google is expanding multimodal AI applications in healthcare. Next month, a peer-reviewed paper on using Gemini 1.5 to detect diabetes through retina scans will be published in Ophthalmology Science.
“The development of such multimodal AI models is thus imperative, and India is playing a big role in it," Gupta said. “Native understanding of local languages is a key part of this process, as it is only through local language understanding that applications can be brought to users at scale."
The background
Google’s Indic-language efforts began with Project Vaani in December 2022, an open-source initiative to collate publicly accessible datasets.
It is distinct from Gemini, which is a foundational AI model trained to generate natural-language responses from vast datasets.
In July last year, Google introduced IndicGenBench, a proprietary benchmark to evaluate model performance across 29 Indian languages. Google has partnered with Bhashini, an initiative of India’s ministry of electronics and IT, and IISc, Bengaluru, to fine-tune these models.
Bhashini, launched in August 2022, collates free-to-use Indian language data for research, academia and AI development.