Like almost every other tech company out there, Adobe has leaned heavily into artificial intelligence in recent years. The software company has launched a number of different AI services since 2023, including Firefly – its AI-powered media production suite. Now, however, the company’s full embrace of technology may have led to trouble, as a new lawsuit claims it used pirated books to train one of its artificial intelligence models.
A class-action lawsuit filed on behalf of Elizabeth Lyon, an Oregon author, alleges that Adobe used pirated versions of several books — including her own — to educate the company. SlimLM program.
Adobe describes SlimLM as a small suite of language models that can be “optimized for document support tasks on mobile devices.” The states that SlimLM was pre-trained on SlimPajama-627B, an “open-source multibody replicate dataset” released by Cerebras in June 2023. Lyons, who has written a series of nonfiction writing guides, says some of her work was included in a pre-training dataset that Adobe had used.
suit of Lyons, which was originally reported by Reuters, says her writing was included in an edited subset of an edited dataset that was the basis of Adobe’s program: “The SlimPajama dataset was created by copying and manipulating the RedPajama dataset (including the copying of Books3),” the suit says. “Therefore, because it is a derivative copy of the RedPajama dataset, SlimPajama contains the Books3 dataset, including the copyrighted works of Plaintiff and Class members.”
“Books3” — a huge one collection of 191,000 books that have been used to train GenAI systems — have been a constant source of legal trouble for the tech community. RedPajama has also been cited in several litigation cases. In September, a lawsuit against Apple claimed that the company had used copyrighted material train the Apple Intelligence model. The lawsuit cited the data set and accused the tech company of copying protected works “without consent and without credit or compensation.” In October, a similar lawsuit against Salesforce also claimed that the company had used RedPajama for educational purposes.
Unfortunately for the tech industry, such lawsuits have by now become somewhat commonplace. AI algorithms are trained on massive data sets, and in some cases, those data sets reportedly include pirated material. In September, Anthropic agreed to pay $1.5 billion to a number of authors who had sued it, accusing it of using pirated versions of their work to train its chatbot, Claude. The case was seen as a potential turning point in ongoing legal battles over copyrighted material in AI training data, of which there are many.
