What does AI know about having a ball?

Photo: iStock
Photo: iStock


General AI looks reachable but we must watch out for human tics and foibles inherited by machines

I typed “gorilla in a grass skirt having a ball" in a search box on Craiyon.com and the site threw up images of a good looking gorilla, wearing a very Hawaiian grass skirt. But its version of having a ball was not to have a party, but to hold a large colourful ball in its arms. Therein lies the rub. While DALL-E Mini, the original name of Craiyon, is fantastic, it has still got some way to go. It is the open source, free and slightly attenuated version of its mother neural network programme DALL-E 2, created by OpenAI. DALL-E, along with Imagen released by Google Brain to go one up OpenAI, are the latest AI LLMS (large language models), which are stretching the boundaries of what AI can do.

In August 2020, I wrote about the stunning storytelling prowess of another LLM, GPT3 (bit.ly/3RbHfbB ). The Generative Pre-trained Transformer Version 3, I wrote, was being heralded as the first step towards the holy grail of AGI (Artificial General Intelligence), where a machine has the capacity to understand or learn any intellectual task that a human being can. GPT has been trained on a massive body of text, mined for statistical regularities or parameters or connections between different nodes in its neural network. The scale is gargantuan, with 175 billion parameters; all of Wikipedia comprises just 0.6% of its training data! GPT-3 was developed by OpenAI too, and with DALL-E, it took this to another level. OpenAI took a 12-billion-parameter version of GPT-3 and trained it to interpret natural language inputs and generate images corresponding to it; thus literally ‘swapping texts for pixels’. So, if the text prompt was “an astronaut riding a yellow horse near Saturn", the program would break up this sentence into segments of information, find an image closest to it, and then synthesise all of it to show an astronaut sitting on a horse against a starry sky with Saturn hovering in the background. A sister model called CLIP (Contrastive Language Image Pretraining) would then rank the outputs created based on certain parameters and curate the best ones to show you. The model was trained on a large number of photos either scraped off the internet or acquired from licensed sources.

Neural network-based image generation has been prevalent since the start of this century. What is new with DALL -E is how it does so from natural language prompts, the way you and I would ask a question, and produces meaningful outputs. To truly understand its magic, I would advise you to go to Craiyon.com and play with it. You might feel a bit like what Marcelo Rinesi, chief technology officer of the Institute of Ethics and Emerging Technologies, did when he watched Jurassic Park. In WIRED magazine (bit.ly/3NOJm2s), he remembers that the movie’s dinosaurs looked so realistic that they “permanently shifted people’s perception of what’s possible". After playing around with DALL-E 2, it would seem AI might be on the verge of its own ‘Jurassic Park moment’. As you play around with DALL-E 2, you will find it to be a powerful tool, but as Rinesi says, “It does nothing a skilled illustrator couldn’t with Photoshop and some time." The major difference, he says, is that “DALL-E 2 changes the economics and speed of creating such imagery."

These ground-breaking models come with their own problems, though. An obvious one is bias. LLMs like this make it possible to “industrialize disinformation or customize bias". The creators of these models know this. Imagen, for example, was quite blasé in its disclaimer: “While a subset of our training data was filtered to remove noise and undesirable content... we also utilized LAION-400M dataset which is known to contain a wide range of inappropriate content including pornographic imagery, racist slurs, and harmful social stereotypes." Some of you might remember the bold experiment by Microsoft with an AI chatbot called Tay, which within hours started spouting anti- Semitic, racial and pornographic talking points. When OpenAI launched GPT-3, it famously said that “internet-trained models have internet-scale biases." That is the nub of the issue: content for the internet is created by human beings, and a portion of it reflects the racial, gender and other biases of its creators.

I wrote of GPT-3 as evidence that AI can be as creative like humans, and DALL-E 2 looks like another large step forward. OpenAI was thinking similarly perhaps: DALL-E is the combination of the robot Wall-E and creative master Salvador Dali. As my experiment with the gorilla having a ball goes, while it can still don a grass skirt, it will take some time to come to the party.

Jaspreet Bindra is the founder of Tech Whisperer Ltd, a digital transformation and technology advisory practice

Catch all the Business News, Market News, Breaking News Events and Latest News Updates on Live Mint. Download The Mint News App to get Daily Market Updates.