Home / Opinion / Columns /  AI is yet to learn the contextual nuances of what we say

I wrote early last year in IT Matters that an Artificial Intelligence (AI) lab in San Francisco called OpenAI had revealed a new technology. Called GPT-3, it had learnt the nuances of natural language over several months—as spoken and written by humans. It analysed thousands of digital books and nearly a trillion words posted to blogs, social media and the rest of the internet.

GPT-3 is the output of several years of work within the world’s leading AI labs, including OpenAI, which is an independent organization backed by $1 billion in funding from Microsoft, as well as labs at Google and Meta. GPT-3 is what AI scientists call a neural network, a mathematical system loosely modelled on the web of neurons in the brain. As can be expected, there is more than one mathematical model on which these are built.

At Google, a system called Bert (short for Bidirectional Encoder Representations from Transformers) also trained on a large selection of online words. It can guess missing words in any part of millions of sentences. OpenAI has also been refining another system called DALL-E, whose main function as a neural network uses text captions to automatically create images for a wide range of captions expressed in natural language.

Such large-scale neural networks are now called foundation models, as they form the base for all sorts of AI applications that can be written. They differ from other cognitive models that use smaller data sets to train AI systems, as their genesis involved scouring almost every shred of information available on the web, a data store that doubles in size every two years. As an aside, the website Live-counter.com, which attempts to keep track of the internet’s size, said it stood at about 40 zettabytes in 2020. Going by the two-year doubling rule, that number should now stand close to 80 zettabytes. For context, a zettabyte is the equivalent of a trillion gigabytes.

Foundation models have become popular since they blow through the traditional methods of training AI programs with smaller data sets. They were expected to be game changers. However, researchers are now discovering that they have many limitations. A paper contributed to by over a 100 researchers available at Cornell University’s https://arxiv.org/abs/2108.07258, has this to say: “AI is undergoing a paradigm shift with the rise of models (e.g., BERT, DALL-E, GPT-3) that are trained on broad data at scale and are adaptable to a wide range of downstream tasks. We call these models foundation models to underscore their critically central yet incomplete character. This report provides a thorough account of the opportunities and risks of foundation models, ranging from their capabilities (e.g., language, vision, robotics, reasoning, human interaction) and technical principles(e.g., model architectures, training procedures, data, systems, security, evaluation, theory) to their applications (e.g., law, healthcare, education) and societal impact (e.g., inequity, misuse, economic and environmental impact, legal and ethical considerations)."

“Though foundation models are based on standard deep learning and transfer learning, their scale results in new emergent capabilities, and their effectiveness across so many tasks incentivizes homogenization. Homogenization provides powerful leverage but demands caution, as the defects of the foundation model are inherited by all the adapted models downstream. Despite the impending widespread deployment of foundation models, we currently lack a clear understanding of how they work, when they fail, and what they are even capable of due to their emergent properties. To tackle these questions, we believe much of the critical research on foundation models will require deep interdisciplinary collaboration commensurate with their fundamentally sociotechnical nature."

Simply put, foundation models are far from ready for prime time. Their Achilles Heel is that their understanding of context is insufficient. For example, a simple query made by one human of another, such as “How is your wife?", can elicit a vast number of potential responses depending on time, circumstances, context, and the strength of one’s relationship (both with one’s wife and with one’s interlocutor).

Even in my own instance, I can think up a large number of different answers such as “she’s well" or “she’s gardening" or “she’s exasperated" and so on, depending on who is asking, when, where, and in which circumstance. A human mind would add all sorts of information into a response to that question, including tongue-in-cheek humour, innuendo, or intentional obfuscation.

While GPT-3 and others have made several strides with respect to context, and can independently churn out prose that incorporates several contextual nuances, they are unable to cover contexts such as urgency in tone, or sensitivities in an interaction, such as those that often attend medical discussions and even something as inane as a query about one’s wife from one’s mother-in-law.

To be fair, these models already have something called ‘late-binding context’; for instance, their response could be tied to the latest version of some database—say, Mint’s coverage of India’s performance at the Commonwealth Games. But this too is simply reliance on an anchor database with the latest information, and not a full appreciation of all contextual information.

To be ready for prime-time, foundation models will require significant advances in incorporating an ability to dynamically comprehend and apply multiple facets of late-binding context. For truly interactive AI, a full command of context is paramount.

Siddharth Pai is co-founder of Siana Capital, and the author of ‘Techproof Me’.

Catch all the Business News, Market News, Breaking News Events and Latest News Updates on Live Mint. Download The Mint News App to get Daily Market Updates.
More Less

Recommended For You

Trending Stocks

×
Get alerts on WhatsApp
Set Preferences My ReadsWatchlistFeedbackRedeem a Gift CardLogout