Trouble viewing this email? View in web browser

Friday, March 17, 2023
By Leslie D'Monte

Rise of multimodal LLMs: Will GPT-4 bring the tech world a step closer to achieving AGI?

In a conversation with German news website Heise on 9 March, Microsoft Germany’s chief technology officer Andreas Braun said, “We will introduce GPT-4 next week … we will have multimodal models that will offer completely different possibilities — for example, videos.”

With OpenAI releasing GPT-4 on 14 March, the news about the imminent release of the fourth version of Generative Pre-trained Transformer (GPT-4) turned out right, but the hype over its size (the rumoured 100 trillion parameters--a point dismissed even by Sam Altman who had cautioned that users were “begging to be disappointed”) was subdued since OpenAI is yet to disclose the figure.


Nevertheless, will this large language model (LLM) prove to be a game changer given its multimodal features (accepting image and text inputs, emitting text outputs)? In other words, while it’s natural to expect GPT-4 to be a more powerful version of not only OpenAI’s own GPT but also of its Dall-E (which is multimodal with text and images), it’s unclear whether it will eventually have video output too to compete with the likes of Meta’s Make-a-video (multimodal with text and videos) and, perhaps, Google’s PaLM-E (robotics model that transfers knowledge from visual and language domains to a robotics system) too.

The power of LLMs, as I have pointed out often in this newsletter, stems from the use of transformer neural networks that are able to read many words (sentences and paragraphs, too) at a time, figure out how they are related, and predict the following word. LLMs such as GPT and chatbots like ChatGPT are trained on billions of words from sources like the internet, books, and sources, including Common Crawl and Wikipedia, which makes them more “knowledgeable but not necessarily more intelligent” than most humans since they may be able to connect the dots but not necessarily understand what they spew out. This implies that while LLMs such as GPT-3 and models like ChatGPT may outperform humans at some tasks, they may not comprehend what they read or write as we humans do. Moreover, these models use human supervisors to make them more sensible and less toxic. You can read more about this here.

Size does matter, it seems

Introduced in June 2018, GPT-1 used the BooksCorpus dataset to train on unseen data. It had 117 million parameters. GPT-2 is much more advanced and was trained on more than 10X the amount of data than its predecessor, GPT-1. Released in 2019 with 1.5 billion parameters, GPT-2 does require any task-specific training data (e.g. Wikipedia, news, books) to learn language tasks such as question answering, reading comprehension, summarization, and translation from raw text. The reason: data scientists can use pre-trained models and a machine learning technique called ‘Transfer Learning’ to solve problems similar to the one that was solved by the pre-trained model. For instance, the social media platform, Sharechat, pre-trained a GPT-2 model on a corpus constructed from Hindi Wikipedia and Hindi Common Crawl data to generate shayaris (poetry).

Picture credit: Comparing GPT-4 with GPT-3 and the human brain (source: Lex Fridman @youtube). Pl note that Sam Altman has dismissed the 100 trillion parameter figure for GPT-4

GPT-3 vastly enhances GPT -2’s capabilities. In a 22 July paper titled, ‘Language Models are Few-Shot Learners’, the authors describe GPT-3 as an autoregressive language model with 175 billion parameters. Autoregressive models use past values to predict future ones. GPT-3 can be used to write poems, articles, books, tweets, resumes, sift through legal documents, and even translate or write code as well as, or even better than, humans. GPT-3 was released on 11 June 2020 by OpenAI—-a non-profit AI researcher—as an application programming interface (API) for developers to test and build a host of smart software products. GPT-3 is undoubtedly an extremely well-read AI language model.

A human, on average, could read about 600-700 books (assuming 8-10 books a year for 70 years) and about 125,000 articles (assuming five every day for 70 years) in his or her lifetime. That said, it’s humanly impossible for most of us to memorize this vast reading material and reproduce it on demand. In contrast, GPT-3 has digested about 500 billion words from sources like the internet and books -- 499 billion tokens (100 tokens is about 75 words), to be precise, from sources including Common Crawl and Wikipedia. Common Crawl is an open repository that anyone can access and analyze. It contains petabytes of data collected over eight years of web crawling. Further, GPT-3 can recall and instantly draw inferences from this data repository.

ChatGPT, which has taken the internet world by storm with a little over 100 million users, has been built on the GPT3.5 series. So, what will happen when ChatGPT starts leveraging GPT-4?

In an August 2021 interview with Wired, the founder and CEO of Cerebras Andrew Feldman--a company that partners with OpenAI to train the GPT model--said GPT-4 will be about 100 trillion parameters, which if true will make GPT-a hundred times more powerful than GPT-3 and comparable to the human brain. On 20 January, I explained ‘Why ChatGPT will only become stronger with GPT-4’. But even if GPT-4 is eventually released with 100 trillion parameters (seems unlikely, though), we should not forget that it will involve a lot more training, data, and computing power (all of which may not be necessary). I personally believe that the true power will stem from GTP-4 being multimodal rather than just its size.

Are these the first signs of AGI?

Picture courtesy of Livemint

But are these the first steps towards the so-called Singularity or artificial general intelligence (AGI) when machines become sentient or even surpass human intelligence, as some are predicting?

I have argued earlier, too, that while it’s tempting to label these developments as the first steps towards AGI, machines are indeed becoming extremely adept at narrow AI (handling specialized tasks). AI, for instance, controls your spam; improves the images and photos you shoot from your cameras; can translate languages and convert text into speech and vice versa on the fly; can help doctors diagnose diseases and be used in drug discovery; can help astronomers to look for exoplanets while simultaneously assist farmers in predicting floods. Such tasks may tempt us to ascribe human-like intelligence to machines, but we must remember that even driverless cars and trucks, however impressive they sound, are still higher manifestations of “weak or narrow AI”. You may read more of this here.

But wait. Here’s what Jim Fan, a research scientist at Nvidia, has to say. He tweeted, “*If* GPT-4 is multimodal, we can predict with reasonable confidence what GPT-4 *might* be capable of, given Microsoft’s prior work Kosmos-1:

- Visual IQ test: yes, the ones that humans take!

- OCR-free reading comprehension: input a screenshot, scanned document, street sign, or any pixels that contain text. Reason about the contents directly without explicit OCR. This is extremely useful for unlocking AI-powered apps on multimedia web pages or “text in the wild” from real-world cams.

- Multimodal chat: have a conversation about a picture. You can even provide “follow-up” images in the middle.

- Broad visual understanding abilities, like captioning, visual question answering, object detection, scene layout, common sense reasoning, etc.

- Audio & speech recognition (??): wasn’t mentioned in the Kosmos-1 paper, but Whisper is already an OpenAI API and should be fairly easy to integrate.

Note: the predictions are based on what Andreas Braun, Microsoft Germany CTO, allegedly said. They may or may not be accurate (that’s why I call it “prediction”). But Kosmos-1 is very real and rock solid. It offers a glimpse of either GPT-4 or whatever AI service that Microsoft will provide next. I find it difficult to believe Kosmos-1 will stay in the lab and not become a product. In any case, prepare yourself for multimodal APIs - they’ll happen sooner or later!”

What’s Kosmos-1? According to the authors of the paper titled ‘Language Is Not All You Need: Aligning Perception with Language Models’ on Kosmos-1, “a big convergence of language, multimodal perception, action, and world modelling is a key step toward artificial general intelligence”. However, in an interview earlier this year, Sam Altman too expressed surprise over ChatGpt’s excitement.

One of the things, Altman insisted, is to put out AI tools like ChatGPT in a more responsible way and give people, institutions, and policymakers more time to understand the implications, and give society time to “update to massive changes”. GPT4, Altman had then said, “will come out at some point...when we are like confident that we can do it safely, responsibly...” In general, we will release technology much more slowly than people will like”. But when asked by the interviewer whether GPT-4 will have 100 trillion parameters as rumoured, Altman called it “complete bullshit”. He added, “I don’t know where it all comes from... people are begging to be disappointed...We don’t have an AGI (artificial general intelligence), and people seem to be expecting it.”

In this context, GPT-4’s release, with no mention of its size, does make sense since it may distract from the multimodal capabilities of this LLM. That said, the developments in the field of AI are too rapid for anyone to stick to any one point of view, so all I can say for now is let’s hold our horses on AGI and see how the power of GPT-4 can be leveraged by individuals and companies with the right policy frameworks to prevent its abuse.

GPT-4: In a nutshell

ChatGPT Plus subscribers will get GPT-4 access on with a usage cap.

GPT-4 is more reliable, creative, and able to handle much more nuanced instructions than GPT-3.5

GPT-4 surpasses ChatGPT in its advanced reasoning capabilities

GPT-4 passed a simulated bar exam with a score around the top 10% of test takers as opposed to GPT-3.5, whose score was around the bottom 10%

In the 24 of 26 languages tested, GPT-4 outperforms the English-language performance of GPT-3.5 and other LLMs (Chinchilla, PaLM), including for low-resource languages such as Latvian, Welsh, and Swahili

OpenAI has also been using GPT-4 internally “with great impact on functions like support, sales, content moderation, and programming”

OpenAI is also using GPT-4 to assist humans in evaluating AI outputs, “starting the second phase in our alignment strategy”

OpenAI claims to have spent six months making GPT-4 safer and more aligned. It says: “GPT-4 is 82% less likely to respond to requests for disallowed content and 40% more likely to produce factual responses than GPT-3.5 on our internal evaluations”

OpenAI acknowledges, though, despite GPT -4’s capabilities, it has “similar limitations as earlier GPT models”. Most importantly, it still is not fully reliable (it “hallucinates” facts and makes reasoning errors).

GPT-4 generally lacks knowledge of events that have occurred after the vast majority of its data cut off (September 2021) and “does not learn from its experience”

According to OpenAI, GPT-4 can also be confidently wrong in its predictions, not taking care to double-check work when it’s likely to make a mistake

GPT-4 was trained on Microsoft Azure AI supercomputers


Who coined the term AGI?

  • Claude Shannon
  • Alan Turing
  • Marvin Minsky
  • Mark Gubrud
  • John McCarthy

(The correct answer is given below)


$52.1 billion

The amount of funding that 3,198 AI startups received across 3,396 VC funding deals during 2022, according to data and analytics company GlobalData. The analysis, however, also reveals that 2022 experienced subdued VC funding activity across sectors and the AI space also felt the brunt of a dent in investor sentiments. Although VC funding deal volume, as well as value in the AI space, declined in 2022 compared to the previous year, the impact was more prominent in terms of value.


Democratising Generative AI

Have you heard about EleutherAI? It’s an association of computer scientists who have built a giant AI system to rival some of the most powerful machine-learning models on the planet. The group take their name from the ancient Greek word for liberty, eleutheria. EleutherAI began as an open-source AI research group on a Discord server (spaces on Discord servers are typically used in gaming for specific communities but can be used elsewhere too, just as we use Whatsapp groups) in July 2020 by a group of hackers—namely, Connor Leahy, Sid Black, and Leo Gao. EleutherAI has now become a non-profit with a mission, as stated on their website, to study the “increasingly powerful machine learning systems (that) are being developed and deployed” and promote “AI governance”. The researchers say they have already made an impact by developing or helping develop “many of the most powerful publicly available models in the world, including GPT‑J, GPT‑NeoX, BLOOM, VQGAN‑CLIP, Stable Diffusion, and OpenFold”. EleutherAI claims its models have been downloaded over 25 million times. The company has support from some prominent companies, including Stability AI, Hugging Face, CoreWeave, Canva, Google TRC, Nat Friedman (former CEO of GitHub) and Lambda Labs.

200-300% rise in AI-generated YouTube videos to steal info

AI-generated videos featuring synthetic personas are on the rise and are being used in various languages and platforms for recruitment, education, and promotional purposes. Cybercriminals have jumped onto the bandwagon, with CloudSEK researchers detecting a 200-300% month-on-month rise in YouTube videos containing links to stealer malware such as Vidar, RedLine, and Raccoon in their descriptions since November 2022.

Picture credit: AI-generated video from by CloudSEK

These videos pretend to be tutorials on downloading cracked versions of licensed software such as Adobe Photoshop, Premiere Pro, Autodesk 3ds Max, AutoCAD, and others, available only to paid users. Infostealers are malicious software designed to steal sensitive information from computers, such as passwords, credit card information, bank account numbers, and other confidential data. The malware is spread via malicious downloads, fake websites, and YouTube tutorials. They infiltrate systems and steal information, which is uploaded to the attacker’s Command and Control server. “The threat of infostealers is rapidly evolving and becoming more sophisticated, leaving users vulnerable to devastating consequences. In a concerning trend, these threat actors are now utilizing AI-generated videos to amplify their reach, and YouTube has become a convenient platform for their distribution. As a result, it is absolutely critical that users exercise extreme caution when downloading software and avoid any suspicious links or videos at all costs,” said Pavan Karthick, a CloudSEK researcher. Cybercriminals use search engine optimization (SEO) with region-specific tags and obfuscated links to make these malicious videos appear more credible. Using random keywords in different languages, the YouTube algorithm recommends the videos, making them more accessible to users. Additionally, URL shorteners and links to file hosting platforms, such as, and, make it difficult for users to detect malicious links. According to CloudSEK, it is also important to conduct awareness campaigns and equip users to detect and prevent potential threats. Additionally, users should enable multi-factor authentication, refrain from clicking on unknown links and emails, and avoid downloading or using pirated software.

The answer to the Quiz:

d) Physicist Mark Gubrud was the first to use the term AGI in 1997. However, it was Webmind founder Ben Goertzel and DeepMind co-founder Shane Legg who were instrumental in popularising the term around 2002.

I hope you folks have a great weekend. And do remember, we welcome your feedback.

Download the Mint app and read premium stories
Google Play Store App Store | Privacy Policy | Contact us You received this email because you signed up for HT newsletters or because it is included in your subscription. Copyright © HT Digital Streams. All Rights Reserved