|
Should LLMs that allegedly infringe IP, copyrights, and plagiarise get away scot-free?
Yann LeCun, chief AI scientist at Meta, stirred a hornet’s nest when he posted on X (formerly Twitter) that “…books should be freely available for download”, reasoning that most book authors do not make “significant money” from book sales. He believes that many authors “who are more motivated by intellectual impact than by a thousand bucks of income should probably just make their book available for free download”.
There has since been an understandable backlash from authors across the globe to his 1 January post.
To be sure, Meta has skin in the publishing game with its open source Llama 2 LLM. But here's some context to his post.
Just before the year ended, the New York Times (NYT) initiated legal proceedings against Microsoft and OpenAI, alleging unauthorized “copying and using millions of its articles". When OpenAI released ChatGPT on 30 November 2022, the company may not have expected to garner more than 100 million subscribers in just the first two months of the product's release. ChatGPT's soaring popularity and scale, however, have proved to be a thorn in the side of media houses across the world since the chatbot trained by OpenAI's large language model (read the fourth version of generative pre-trained transformer, or GPT-4) can not only generate new content (text, images, videos, code, etc.) with the help of natural language prompts like English but can also produce content verbatim, inviting copyright infringement suits.
AI chatbots such as OpenAI’s ChatGPT function like search engine bots that traverse the web, gathering data to display in search results. OpenAI halted this function only in July 2023. Publishers possess the option to block bots from crawling their content but discerning AI bots from those originating from search engines like Google or Microsoft's Bing, which facilitate page indexing and visibility in search outcomes, can prove challenging.
As explained in its 69-page suit, LLMs operate by anticipating probable words following a given text sequence, drawing from a vast pool of examples used for their training. By attaching the model's output to its input and looping it back, the LLM generates sentences and paragraphs step by step, formulating responses to user queries or "prompts." These models encode training corpus information as numerical "parameters" (GPT-4 boasts approximately 1.76 trillion parameters).
The process of determining these parameter values is termed "training." It involves storing encoded versions of training texts in computer memory, repeatedly passing them through the model with masked words, and adjusting parameters to minimize the variance between the masked words and the model's predictions to complete them. Consequently, these trained models can exhibit "memorization" behavior, where given the appropriate prompt, they reproduce substantial portions of the materials they were trained on.
In a May 2023 paper, too, the researchers argued that "...GPT-2 can exploit and reuse words, sentences, and even core ideas (that are originally included in OpenWebText, a pre-training corpus) in the generated texts". They added that the tendency to plagiarise rises as the model size increases or certain decoding algorithms are employed. They dealt with three types of plagiarism: Verbatim plagiarism: exact copies of words or phrases without transformation; Paraphrase plagiarism: synonymous substitution, word reordering, and/or back translation; and Idea plagiarism: representation of core content in an elongated form. You may read the full paper 'Do Language Models Plagiarize?' here.
The Times, on its part, argues in the suit that not only do its journalists spend considerable time and effort reporting pieces, but that it "employs hundreds of editors to painstakingly review its journalism for accuracy, independence, and fairness, with at least two editors reviewing each piece prior to publication and many more reviewing the most important and sensitive pieces". The publication explains that it depends on its exclusive rights of reproduction, adaptation, publication, performance, and display under copyright law to resist these forces.
"The Times has registered the copyright in its print edition every day for over 100 years, maintains a paywall, and has implemented terms of service that set limits on the copying and use of its content. To use Times content for commercial purposes, a party should first approach The Times about a licensing agreement". Simply put, The Times wants to be paid for the copyrighted content that AI-powered chatbots like OpenAI and Bing Chat have been allegedly reproducing verbatim in many cases.
It appears to be a fair demand given the loss of advertising and affiliate referral revenue other than the "free-riding on The Times’s significant efforts and investment of human capital to gather this information".
These class-action suits, as I have pointed out earlier too, are a wake-up call for companies developing generative AI models and monetizing them without satisfactorily addressing the concerns of those impacted.
In September 2023, for instance, the Authors Guild and 17 authors, including John Grisham, Jodi Picoult and George R.R. Martin, filed a class-action suit against OpenAI for “unpermitted use” of their copyrighted work. The authors have alleged “flagrant and harmful infringements of plaintiffs’ registered copyrights” in their suit and called ChatGPT a “massive commercial enterprise” that is reliant upon “systematic theft on a mass scale.”
According to Scott Sholder, a partner with Cowan, DeBaets, Abrahams & Sheppard and co-counsel for the plaintiffs, they are not objecting to the development of generative AI but insist that the “defendants had no right to develop their AI technologies with unpermitted use of the authors’ copyrighted works”.
According to the suit, the “defendants could have ‘trained’ their large language models (LLMs) on works in the public domain or paid a reasonable licensing fee to use copyrighted works”. A statement by the Authors Guild notes that GPT is already being used to generate books, such as the recent attempt to generate volumes 6 and 7 of George R.R. Martin’s Game of Thrones series A Song of Ice and Fire, as well as the “numerous AI-generated books that have been posted on Amazon that attempt to pass themselves off as human-generated and seek to profit off a human author’s hard-earned reputation”.
The outcome of such suits will define how foundational models and LLMs are shaped going forward, failing which they will continue to be hauled over the coals and be viewed with suspicion and mistrust.
|