Google taps AI to grasp India’s language diversity
Project Vaani would gather speech data across India and use it to create an artificial intelligence (AI)-based language model that can understand diverse Indian languages and dialects.
NEW DELHI: Google India has teamed up with the Indian Institute of Science (IISc) for the Project Vaani initiative that would gather speech data across India and use it to create an artificial intelligence (AI)-based language model that can understand diverse Indian languages and dialects.
The project is part of Bengaluru-based IISc and AI and Robotics Technology Park’s (Artpark) Bhasha AI project that includes SYSPIN (Synthesizing Speech in Indian languages) and RESPIN (Recognizing Speech in Indian languages).
“India’s spoken languages change every few kilometres...machines have no hope. So, research and innovation for inclusive language AI require capturing this diversity in our datasets," Prasanta Kumar Ghosh, a professor at IISc, who leads these initiatives, said, explaining the reasons for launching Project Vaani.
Google and IISc plan to collect speech samples from 773 districts. The initiative, currently focused in 80 districts across 10 states, is expected to expand to every district over the next couple of years and boost the size and diversity of India’s open-sourced language data, with over 150,000 hours of curated speech and 100 million sentences of text in Indian scripts. Artpark and IISc simultaneously plan to launch challenges for researchers and startups to build applications in areas such as health, agriculture, and financial inclusion using these datasets.
Manish Gupta, director of Google Research India, said Vaani would be trained on speech and text data from over 100 Indian languages. He said the new model is a leap over Multilingual Representations for Indian Languages (MuRIL), which was a text-only model. The new model supports both speech and text.
“We want to make sure that any language which is spoken by 100,000 people is covered," he added. MuRIL is a Bert-based language model trained on 17 Indian languages. Bert or Bidirectional Encoder Representation from Transformers is a Google-developed machine language (ML)-based technique to learn contextual relations between words to generate a language model.
Gupta also announced another AI model that will use satellite imagery to offer agriculture-related insights to agritech startups and policymakers and an AI-based optical character recognition (OCR) tool that has been trained to read handwritten medical prescriptions.
Google Research India also announced a $1 million grant for IIT Madras to open a Center for Responsible AI in India. Another grant of a similar amount will be offered to Wadhwani Foundation to support the deployment of AI models.
The new language model is part of a wider Google initiative to build a model for 1,000 global languages, said Gupta. “We want to ensure that Indian languages are front and centre in terms of representation in this model," he said.
Gupta explained that many Indian languages have relatively lower resources. Models like Bert are built on available web resources, and since Indian languages tend to be less represented, often the capability of these models with Indian languages is not as good as expected by researchers. That said, Gupta also cautioned that language models must be handled carefully for the good of society.
He pointed out that models like Language Model for Dialogue Applications (LaMDA) and ChatGPT are prone to hallucination. “They may come up with an explanation that sounds convincing but is actually bogus. Working on the development of these AI models in a responsible manner becomes very important," he added.
The $1 million grant to IIT Madras to establish a Center for Responsible AI in India is an attempt to bring together researchers from other institutes and other fields like social science and law. “A lot of research on responsible AI has been done in a western context. In India, there are additional dimensions of bias based on region and caste. It is important that we study all these biases in the Indian context and keep them in mind while developing these AI models," he added.
Similarly, the AI model for agriculture is an attempt to solve many of the problems in the sector by applying AI models to satellite imagery. “Our work focuses on a combination of remote sensing and AI. We will apply the model to identify farm boundaries and landscape understanding. Then we can go deeper into what crop is being grown on each farm and what is the likely yield," said Gupta.
Gupta said Google would work with partners in the ecosystem and make this data available to the government, policymakers, and startups that are building agri solutions and contribute to the agri stack that the Indian government is defining. Google has been working on a pilot for this with the Telangana government.
3.6 Crore Indians visited in a single day choosing us as India's undisputed platform for General Election Results. Explore the latest updates here!