
Mint Explainer: Why did Elon Musk cap the number of Twitter posts you can read?
Summary
The move is meant to tackle ‘data scraping’, a practice that has sparked lawsuits against OpenAI and Microsoft. What is it and why does it matter so much?Elon Musk has stumped Twitter users once again with his move to limit the number of posts users can read on the platform. On 1 July he tweeted that users with unverified accounts and new unverified accounts would be limited to reading 600 posts a day and 300 posts a day, respectively, while those with verified accounts could read 6,000 posts a day. These “temporary" limits were later raised to 10,000 posts a day for verified users, 1,000 a day for unverified users and 500 a day for new unverified users. Musk believes the move will address the “extreme levels of data scraping and system manipulation".
Twitter has a little over 360 million monthly active users, with most of them in the US (about 80 million), followed by Japan (58 million) and India (about 24 million) but as we have pointed out earlier, our online lives today are intricately intertwined with our offline lives. What we do online affects our real-world lives and vice versa.
Online Twitter feeds spread offline through thousands of newspapers, each copy of which is typically read by three to five people. Add to this the thousands of media websites reporting on stories that originate on Twitter. Twitter feeds also go viral on other social media platforms and instant messaging apps, including Instagram, Facebook, WhatsApp, Signal and Telegram, which collectively have nearly three billion users.
The end result is a manifestation of Metcalfe’s Law, which states that a network’s value is proportional to the square of the number of nodes in it. In simple terms, what we tweet quickly spreads well beyond Twitter. This begs the question: why wold Musk want to limit the reading of posts, especially on a platform that makes money primarily from advertising?
Here are some possible reasons. Twitter provides companies, developers and users with access to Twitter data through pieces of code called application programming interfaces (APIs), which broadly cover five categories: accounts and users, public tweets and replies, direct messages, advertisements, publisher tools and software development kits (SDKs).
For instance, Twitter's direct-message APIs provide “limited access" to developers to create personalised experiences on the platform. This helps businesses to, among other things, create human- or chatbot-powered customer service, marketing, and brand-engagement experiences. Developers can also use public Tweets to identify topics and interests, and provide businesses with tools to run advertising campaigns on Twitter. The platform also gives software developers and publishers tools to embed tweets, share buttons and other Twitter content on webpages.
Many researchers had long been using Twitter APIs to mine the platform’s data for academic papers. But this ended in February, when Musk said the APIs would go behind a paywall. Twitter now has a “free" version for “write-only use cases and testing the Twitter API". Its basic version, for “hobbyists or prototypes", costs $100; the pro version costs $5,000, and the three-levels of its enterprise version are much more expensive still.
As a workaround, some sites advocate the free scraping of Twitter data with programming languages like Python and a headless browser (a browser without a graphical user interface that is typically used for things like automated testing, scraping, analysing, crawling, and even hacking). These web-scraping tools can extract Twitter profiles; metadata associated with the content of a tweet including likes, retweets, and replies; hashtags; and Twitter lists. Twitter is in the public domain, so while scraping may seem legal, many tweets contain copyrighted material like images or videos, and scraping them could be illegal depending on one’s jurisdiction.
OpenAI, Microsoft sued over data scraping
It’s hardly a surprise, then, that Musk dislikes data scraping and system manipulation. Interestingly, data scraping is the reason why OpenAI and Microsoft have recently been taken to court.
In a 157-page class action lawsuit filed on June 28 in a US District Court in California, the plaintiffs alleged that the defendants used “unlawful and harmful conduct in developing, marketing, and operating their AI products including ChatGPT-3.5, ChatGPT-4.0, Dall-E, and Vall-E", which use “stolen private information" from hundreds of millions of internet users, including children of all ages, without their informed consent or knowledge, and continue to do so to develop and train the products.
ChatGPT, for instance, is based on a large language model (LLM) in the Generative Pre-trained Transformer series (GPT-3.5), which enables it to write poems, articles, books, tweets, and even code like humans. LLMs, such as the current GPT versions or Google’s LaMDA, represent a major advance in AI. They are trained on humongous volumes of data. GPT-3, for instance, had 175 billion parameters. GPT-4 is rumoured have been launched with about 100 trillion.
ChatGPT-3.5 was trained on 570GB of text data from the internet containing hundreds of billions of words, including text harvested from books, articles, and websites, including social media. DALL-E and DALL-E 2 are deep-learning models developed by OpenAI to generate realistic digital images from natural language descriptions known as ‘prompts’. VALL-E is an AI model that needs only a three-second recording of a speaker to clone their voice.
The lawsuit alleges that instead of legally acquiring all that data to train its models, which requires “consent and remuneration", OpenAI “took a different approach: theft". “They systematically scraped 300 billion words from the internet… including personal information obtained without consent." It alleges OpenAI did so "in secret, and without registering as a data broker as it was required to do under applicable law."
Acknowledging that OpenAI CEO Sam Altman has curtailed the training of its system with user inputs, the plaintiffs argue that OpenAI has not made it clear in its “updated Terms of Use" whether it will only refrain from training on data from API users but “continue to feed the inputted, collected, and stored data of the millions of everyday ChatGPT users to train the AI products." The plaintiffs want OpenAI to introduce an option for users to opt out of all data collection and to stop its “illegal" scraping of internet data.
On 9 June OpenAI was sued again. Georgia radio host Mark Walters discovered that ChatGPT was spreading false information about him and filed a lawsuit against the company, as first reported by Bloomberg Law.
In November 2022, Microsoft, its subsidiary GitHub and OpenAI were served a class action lawsuit that accused AI-powered coding assistant GitHub Copilot of relying on “software piracy on an unprecedented scale", underscoring the pitfalls of training software on copyright-protected data. Here, too, Copilot has been trained on public repositories of code scraped from the web, many of which are published with licenses that require anyone reusing the code to credit its creators.
Among other things, the suit noted, “AI systems are not exempt from the law. Those who create and operate these systems must remain accountable. If companies like Microsoft, GitHub, and OpenAI choose to disregard the law, they should not expect that we the public will sit still. AI needs to be fair & ethical for everyone. If it’s not, then it can never achieve its vaunted aims of elevating humanity. It will just become another way for the privileged few to profit from the work of the many."
Given the tone of these class action suits, companies looking to develop and monetise generative AI models have their work cut out for them.