Gen AI voice interfaces promise new forms of tech despite adoption challenges

A multimodal native AI model runs locally on a device, without needing internet connectivity to access its database on cloud platforms. (istockphoto)
A multimodal native AI model runs locally on a device, without needing internet connectivity to access its database on cloud platforms. (istockphoto)


  • Voice interfaces have their best shot at popularity with generative AI; even as brands bank on creating product ecosystems with voice-based generative AI built in, the success of the technology has a key challenge of monetization.

New Delhi: In November last year, little-known Silicon Valley startup Humane unveiled a wearable device that had no displays, but could use voice to do any task that a conventional smartphone is used for. Two months later, fellow upstart Rabbit, in partnership with Swedish tech firm Teenage Engineering, unveiled r1—a device that was similar in context and operation to Humane’s wearable, the ‘AI Pin’. Their similarity—both were using large language models (LLMs) as their fundamental technology platform to offer users a completely new interface, in partnership with ChatGPT maker, OpenAI.

This new interface in question largely used voice to interact, promising users a seamless world where they can simply speak to their devices—instead of having to tap on multiple displays and use a complex array of applications for basic tasks such as ordering lunch, hailing a cab and sending an email.

Industry experts believe that while voice as an interface has largely failed to become mainstream, the advent of generative AI and natively-running multimodal models could change this. To this end, the likes of Humane and Rabbit are early examples of likely next-generation consumer hardware.

Also read | Generative AI video: Excitement tempered with scepticism

To be sure, a multimodal native AI model runs locally on a device, without needing internet connectivity to access its database on cloud platforms. This makes the AI model easier to access and compute with for any device, and multimodality enables it to read and offer results in text, images, video and voice media. With the latest announcements made by Apple and Google, such models are now coming to smartphones. Soon, laptops certified as ‘AI PCs’ under Microsoft’s Copilot Plus range, will come with generative voice interfaces on personal computers, too.

Apple, for instance, overhauled its digital assistant, Siri, to better understand personal context and remember conversations as well—thus making the overall voice usage experience better than before. Google Assistant, powered by its Gemini LLM, can also pull off similar features on Android smartphones that natively support its AI models.

New gadget ecosystems

Industry stakeholders believe that the move can lead to new gadget ecosystems and product styles. Kashyap Kompella, AI industry analyst and founder of consultancy firm RPA2AI Research, said that the rise of generative AI voice interfaces could play a role in commercially available robots. “The rise of commercial robots with which you can speak in natural speech is an area that is likely to develop within the next decade. Enterprise robots are likely to develop first, followed by home accessibility robots that generative AI models could enable with speech," Kompella said.

Others believe that while voice interfaces could grow thanks to multimodal AI running locally on devices, this will form a part of a broader, more complex user interface. Tuong Nguyen, director analyst at Gartner, said that while voice interfaces “will increase in usefulness and popularity, the bigger story is multimodality and contextual interfaces—which means voice alongside natural language understanding combined with image analysis."

Also read | How Fractal, India’s first AI unicorn, is prepping for the next race

For many companies, voice is a way to tie interfaces together into a seamless ecosystem. At Apple’s Worldwide Developer Conference on 10 June, the company’s showcase of AI included the ability to interoperate features and solutions across various applications. A senior executive familiar with the iPhone maker’s latest suite of AI features told Mint on condition of anonymity that voice interactions through Siri will work seamlessly across Apple’s three primary product categories—iPhones, iPads and the Mac range of desktop and laptop PCs.

“In fact, Apple’s AI features are designed to establish a seamless user experience, especially with voice, across the main products that users purchase from the brand. Having underlying AI models with on-device processing can establish this as the new norm across more brands," the executive said.

Tarun Pathak, director at market researcher Counterpoint India, added that the development of product ecosystems could be a key aspect of voice-based generative AI interfaces. “With voice interfaces working seamlessly across devices, more brands could look at developing their own ecosystems of products. This could lead to innovation of form factors too, the early examples of which include Samsung’s push to make wearables control every user-end feature," he said.

Also read | Apple's AI push intensifies Big Tech race. Gen AI commercialization next? 

An email sent to a Samsung spokesperson on its ecosystem and voice AI plans remained unanswered until press time. In January, the company unveiled its Galaxy S24 range of flagship smartphones with natively operating AI features—including its voice assistant Bixby. Samsung is expected to unveil more new hardware with natively-running AI applications next month.

On 10 June, Muralikrishnan B, president of Xiaomi India, told Mint in an interview that the company’s primary product strategy for the next year in India is to establish a wider ecosystem of products beyond smartphones—including smart home appliances, audio products, wearables and more. One of the key aspects of Xiaomi’s ecosystem push is interoperability—a factor that can be improved upon by the integration of AI across product categories.

More innovations

Going forward, Counterpoint India's Pathak said that more form factors of devices, such as wearable headsets, smarter wrist gear and more could be on the way in the next four calendar years. “Voice with generative AI will stand a chance to actually replace the need to tap multiple times on a display, which is its biggest strength and reason for adoption," he added.

Gartner's Nguyen said, “Voice is not a cure-all solution for a device. Future devices, such as head-mounted displays, will expand multimodal interfaces to include other aspects such as gesture detection, motion tracking, eye tracking, sentiment analysis and more."

Also read | Xiaomi India eyes increased localization, Apple-like ecosystem

However, many others have offered caution, too. Kompella said that a key factor of concern is voice interfaces having failed to take off so far. “Voice as a technology has held great promise in the past two decades. However, the adoption has remained limited, even though companies such as Amazon at one point sold over 200 million smart speakers powered by the Alexa digital assistant across the world. The challenge lies in understanding if voice is a product or a feature—and how brands can earn money from it," he said.

“If voice with generative AI is still not monetizable, product innovation will not progress on the same note. There are specific use cases, such as medical transcription, that could see the advent of dedicated applications of voice-based generative AI. However, whether consumer hardware will finally go through an upheaval is a question left as yet unanswered," Kompella further added.

Catch all the Technology News and Updates on Live Mint. Download The Mint News App to get Daily Market Updates & Live Business News.