How to make the internet speak in multiple languages

China has the largest number of internet users, but just 1.4% of the top 10 million websites use Chinese. Hindi, the third most spoken language in the world accounts for just 0.1%.. Photo: iStock
China has the largest number of internet users, but just 1.4% of the top 10 million websites use Chinese. Hindi, the third most spoken language in the world accounts for just 0.1%.. Photo: iStock


Technology is being deployed but much more can be done to grant everyone access to all content online

One of the best known origin myths about the source of languages is the story of the Tower of Babel. It goes something like this:

There was a time when everyone on Earth spoke the same language, and, united in this manner, decided to build a tower so tall it could reach the heavens. They made such rapid progress on the tower that Yahweh, the God of Israel, was said to have remarked, “Indeed the people are one and they all have one language, and this is what they begin to do; now nothing that they propose to do will be withheld from them."

Yahweh had to stop the construction of the tower or else mortals would think themselves the equal of God. So Yahweh created multiple languages, dividing them into different linguistic groups of people who couldn’t understand a word the other spoke. Work on the tower came to a grinding halt and everyone drifted away into their own small groups, never to unite again.

You might also like

How three Sinha generations meet financial goals

For 20 months, MSCI India outdid MSCI Emerging Markets

When will Muthoot Finance stock regain its glitter?

Will India gain or lose from the China slowdown?

Today there are over 7,000 languages in the world and content created in one language is incomprehensible to those who speak another. With the internet becoming the largest store of information on the planet, the fact that the world speaks so many languages is becoming a real challenge. Even though English comprises just 16% of the speaking population, it accounts for over 60% of the top 10 million websites on the internet. On the other hand, despite the fact that China has the largest number of internet users, just 1.4% of the top 10 million websites use Chinese. Hindi, the third most spoken language in the world accounts for just 0.1%.

It is possible that all of this is just temporary. The internet was invented in the English-speaking world, so it stands to reason that the majority of its content would be in English. As the internet becomes more linguistically diverse, this is already showing signs of change. In 1996, 80% of internet users spoke English. By 2010, that dropped to 27.3%. Today, 12 times more people in China and 25 times more people in the Arabic-speaking world use the internet than they did in 1996. It seems inevitable that the language of online content will follow suit.

But building a more linguistically representative internet is not the solution we are in search of. This will require us to create new content in a wider range of languages, while what we need is to make the information already in existence understood by a wider range of people. We need translation technology that can consistently (and with a high level of accuracy) ensure that content in one language is understandable to those who speak another, so that it no longer matters what language the content was created in.

Nowhere is the need to solve this more urgent than in India. With over 3,000 languages, the only way to ensure that our development goals reach all corners of the country is to ensure that no content is out of someone’s reach solely on account of the language they speak.

Earlier this year, the government of India launched Bhashini, a digital public platform for languages designed to ensure that digital content can be delivered in all Indian languages using artificial intelligence (AI) and allied technologies for speech-to-speech translation. To achieve this at scale, we will need to generate vast training datasets of text and speech in multiple Indian languages. The Bhashini project has launched Bhasha Daan, a crowdsourcing platform through which volunteers can support the project either by translating texts into languages they understand, contributing spoken words in languages they are familiar with, or by labelling images in other languages.

As innovative as this is, it is unlikely to be sufficient. To solve a problem of this magnitude, we need to work with large datasets of annotated information that accurately cross-references speech or text in one Indian language with that in several others. This will allow us to build and train AI models to translate quickly.

One obvious source for this is the archives of All India Radio and Doordarshan, organizations which, for decades, have created content in the multiple regional languages at the same time. Past recordings of the daily news alone will give us comparative samples of roughly the same content spoken in multiple different languages. I have no doubt that many other sources of similar annotated language data exist in the private domain.

One possible impediment to this is copyright law that prevents content from being used without the permission of the owner of the work. Several countries have amended the fair-use provisions of their copyright statutes to create an exception for data analysis—to allow non-commercial use of this data for the purpose of creating training data sets. India should consider amendments along similar lines in order to incentivize more innovation in the field of language translation.

Douglas Adams, one of my favourite science fiction authors, had introduced a typically irreverent literary device into his Hitchhiker’s Guide to the Galaxy series of books, to solve the translation problem at a galactic scale. In his imaginary world, everyone has a BabelFish, a symbiotic creature that lives in your ear and translates all communication signals it hears into a language you can understand. As a result, every species, including the most exotic extraterrestrials, can understand each other.

I’ve always though it would be cool to have a BabelFish in my ear wherever I travel. Maybe Bhashini will make that possible.

Rahul Matthan is a partner at Trilegal and also has a podcast by the name Ex Machina. His Twitter handle is @matthan 

Elsewhere in Mint

In Opinion, Pramit Bhattacharya profiles a 102-year-old man, the last surviving link with India's rich statistical heritage. Rajrishi Singhal reveals one overarching idea India can pursue during its G20 presidency. Amitabh Kant writes on an ambitious Modi government scheme that transforms India, district by district.  

Catch all the Business News, Market News, Breaking News Events and Latest News Updates on Live Mint. Download The Mint News App to get Daily Market Updates.


Switch to the Mint app for fast and personalized news - Get App