Thiruvananthapuram: More than 20 years ago, K.H. Hussain, a librarian at the Kerala Forest Research Institute, decided that he wanted to catalogue the institute’s library using the computers that had just started making their way into the state. Immediately, he ran into the problem that would occupy him for the next two decades: How would he write the titles of Malayalam books in English, which didn’t have alphabets to express several Malayalam words?
Digitizing languages: C.R. Indu works on her tablet PC (top); Kosy George (in white shirt) and others at work at CDAC in Thiruvananthapuram. Photographs: Vivek Nair/Mint
With a lexicographer friend, Hussain started his digitization campaign by first developing Rachana, a Malayalam text editor that he released in 1999. It supported, he says, “more than 900 alphabets and symbols”. He then designed an archive system, Nitya, and digitized the collections of public and personal libraries.
But his work—and the work of many others, in various fields—will now be simplified, Hussain believes, by new software from the Centre for Development of Advanced Computing (CDAC) in Thiruvananthapuram: an online handwriting recognition platform, where anything written in Malayalam on a PDA (personal digital assistant, also known as a palmtop computer) or tablet can be recognized and converted into editable text in a word processor.
With the latest Windows and Linux operating systems now being touch-controlled, such online handwriting recognition (OHR) software can revolutionize regional-language Internet access in India. Integrating the software into operating systems will do away with the need for vernacular keyboards and perhaps even the electronic stylus, says Sunil P., a Kochi-based information technology expert who serves on several state government panels.
This promise to throw open the Internet’s gates has made handwriting recognition a focus of keen research. Biennial conferences, held under the umbrella of the International Conference on Frontiers in Handwriting Recognition (ICFHR), showcase the latest advances in machine-readable handwriting. A team comprising employees of Hewlett-Packard and the Indian Institute of Technology, Madras has already published a paper on OHR for the Tamil script, with accuracy rates ranging from 70% to 90%.
Also Read The Future of the Internet (Complete Series)
Sitting in a small cubicle at CDAC, in the heart of Thiruvananthapuram, C.R. Indu explains how she and her small team began work on Malayalam OHR two years ago. “Malayalam has a cursive style of writing. It has a unique feature of compound words formed by the joining of two alphabets,” she says. “There are several alphabets that are near-similar, with a shade of difference in curves. To meet the needs of typewriting in Malayalam, the script was changed many years ago so that these compound words were separated. But several still follow the old style. All these pose(d) problems.”
CDAC thus prepared a database of 3,389 handwriting samples, and then worked on what K.G. Sulochana, CDAC’s joint director, calls an “elastic matching technique”. The software was patiently taught how to recognize characters as soon as they were written on a tablet or touch screen, to then convert them into editable text in a font of the writer’s choice.
At present, the software functions at an average recognition accuracy of 94%. Additional fine-tuning is required, Indu points out, because many handwriting styles produce characters that look very much like each other.
Taking this concept another step forward, CDAC is now developing offline handwriting recognition, which can scan or photograph handwritten text, and convert the image into editable text. This is, Sulochana says, comparatively difficult, because different people have different handwriting styles, in which characters are too often linked, broken, or irregularly aligned. There is no dynamic information available; only the complete sample of writing is available as an image. Sulochana expects this system to operate at an accuracy of 80-85%.
But even respectable accuracy levels can prove useful and effective. The real-time OHR can be installed, keyboard-less, in ticket-billing machines, online forms, or teaching aids, to create content rapidly. Its offline avatar can read registration documents, postal addresses, bank cheques, and handwritten forms.
In a way, these systems are natural progressions of Nayana, an optical character recognition (OCR) system developed by CDAC five years ago, to convert printed Malayalam documents into text. Going by Sulochana’s figures, Nayana works spottily at best. Its average accuracy is 90% for computer printouts, but only 67% for books, 60% for newspapers and a low 47% for periodicals.
Coupled with CDAC’s text-to-speech synthesizer, OCR has proven useful to the visually challenged: Computers can first convert print into recognizable text, and then convert that text into speech, thus turning into reading machines.
Sunil sees, as the next logical step in this process, the integration of OHR into cellphones. That will, he says, erase the differences between those who are literate in English and those literate in the vernacular, democratizing the Internet more than ever before.