The vernacular search conundrum4 min read . Updated: 30 Dec 2010, 09:04 PM IST
The vernacular search conundrum
The vernacular search conundrum
As their diet has grown to include Twitter feeds and Facebook trends and YouTube videos, search engines have become smarter and more engorged with information—but only as long as the searches are in English. The challenge of developing effective vernacular search engines is, entrepreneurs say, directly linked to the low Internet penetration in India.
“Sixty million," announces Peeyush Bajpai, co-founder of Raftaar, a Hindi search engine. Bajpai is a civil management graduate who has spent the better part of the last decade researching urban consumer trends, and the number he intones does not excite him. “That’s the most optimistic estimate of the number of Indians who access the Internet. Compare that to the over 300 million users in China, and we are nothing."
Four years ago, working at Indicus Analytics, an economic research firm and also Raftaar’s parent company, Bajpai had the sort of epiphany of numbers that drives most Internet entrepreneurs in India.
That thread of thought ran as follows: There are almost 300 million Hindi speakers in India, the top-selling newspapers are in Hindi, and every major media house starts off in English but finds the need to branch out into Hindi. Thus, Bajpai reasoned, a huge regional language market must be waiting to be tapped via the Internet.
In 2006, therefore, Bajpai and a small team of programmers and linguists began to develop basic algorithms that would allow users to scour the Internet for content in Hindi. “In hindsight, though, we realized that India had actually completely missed the Internet bus," Bajpai says.
Analysts typically count poor infrastructure and high access costs as the fall guys for the Internet’s poor penetration in India, but Bajpai blames, just as much, the lack of regional language keyboards and even the explosion of mobile telephony. The orthodox chain of Internet acceptance, says Mark Flory, an IT consultant, was broken.
“Usually people created content and Web pages, then the Internet allowed content to be linked, and this was possible thanks to affordable, ubiquitous personal computers. That’s how the Internet grew," Flory says. “But in India, cheap PCs (personal computers) were too slow, the Hindi language keyboards were never popular, and the mobiles took over."
In this melee, Indian search engines—the Internet’s synapses, bridging lifeless Web pages and usable information—get caught in a conundrum. “All search today is biased towards English, and that poses some problems," says Bajpai.
Also Read The Future of the Internet (Complete Series)
The chances of a Web page showing up high on a search depends on how well it has been tagged with keywords. But most tags are in English; even Hindi Web pages wanting to show up on Google employ English tags. So Raftaar’s crawlers, hunting for Hindi tags, miss out on a great many Hindi-content Web pages. Fewer sites crawled translates into pages buried in obscurity, which then discourages people from developing Hindi content.
For instance, searching for “Sachin Tendulkar" on Raftaar yields barely 10 results—akin to anathema in India. These are only the pages, however, that have a Hindi tag for “Sachin Tendulkar".
The answer, on the surface, may seem to be an algorithm to translate English tags into Hindi, but these translations are never perfect. The algorithm might, for instance, literally translate “Sachin" into “essence", leading into a linguistic quagmire. “For now, we‘re working on physically translating English words and meanings, as well as trying to teach algorithms the context underlying ‘Sachin’ or ‘cricketer’," says Bajpai.
At Google, the leviathan of search, translation capacities are limited by the language content available online. “Our algorithms can learn from what exists," says Jagjit Chawla, products manager at Google India. “While literal translations of words are easily caught, certain phrases, dual meanings and contextual meanings are harder to catch."
Like Bajpai, Chawla suggests that the future of vernacular search lies in the mobile phone. Rather than tweaking its algorithms to fish out regional content from the Internet, Google now has an armoury of free tools to translate, transliterate, and even obviate typing itself.
“Today searching and accessing content in India still requires Internet users to identify English alphabets," says Chawla. “We’ve just launched a tool that will allow users to vocally query Google in any language and throw back relevant results in that language. So you’ve bypassed English that way."
Vernacular search also runs up against problems of font. “You can easily shift from Times New Roman to Arial in English, and no crawler will have a problem understanding the meaning of the words," says Bajpai. “But in Hindi, moving from Susha—a very common font—to Kruti makes it gibberish," because the English letter “A" corresponds to different vowels in these fonts.
Mahesh Jayachandra, a neuroscientist and the inventor of an Indic keyboard, says most issues regarding scripts and compatibility are now close to being solved. “Now issues are more centred on keyboard design, how intuitively the letters are placed, etc.," he says. “Most Indian languages have a common linguistic structure, so it’s quite possible to have search engines specific to languages."
Before this ideal search universe takes form, entrepreneurs such as Bajpai and Anurag Dod of Guruji.com are being more pragmatic. They reckon that click-based searching, with menus and options, is easier and more lucrative. Thus Guruji, although it allows searches, positions itself as a repository of Hindi music, while Raftaar hosts pre-packaged translated content with information on, for example, the best colleges to pursue engineering.
“Most Indians, being bilingual, have a working knowledge of English," says Bajpai. “So they can just click and find all this pre-arranged information. Over time this content should increase, and people will start querying for them."