Meta CEO Mark Zuckerberg appears to have used YouTube’s battle to crack down on pirated content to defend his own company’s use of a dataset containing copyrighted e-books, as revealed in recent excerpts of a filing that made public. end of last year.
The filing, which was part of a complaint filed in court by attorneys for the plaintiffs, relates to the AI copyright case Kadrey vs. Meta Platforms. It is one of several such cases unfolding in the US court system, pitting AI companies against creators and other IP owners. For the most part, the defendants in these cases — AI companies — claim that training on copyrighted content is “fair use.” Many copyright holders disagree.
“For example, YouTube, I think, may end up hosting some things that people pirate for a period of time, but YouTube is trying to take them down,” Zuckerberg said in his testimony, according to parts of a transcript available on Wednesday night. “And the vast majority of things on YouTube, I would guess, are kind of good and they’re allowed to do.”
Excerpts from Zuckerberg’s deposition provide some insight into Zuckerberg’s thinking about copyright content and fair use. However, it should be noted that a full transcript of the deposition was not released. TechCrunch has reached out to Meta for additional context and will update the article if the company responds.
Based on the nuggets of testimony, Zuckerberg appears to defend Meta’s use of an educational set of e-books called LibGen to develop the family of artificial intelligence models known as Llama. Meta’s Llama competes with leading models from AI companies like OpenAI.
Self-described as a “link aggregator,” LibGen provides access to copyrighted works from publishers including Cengage Learning, Macmillan Learning, McGraw Hill, and Pearson Education. LibGen has been sued numerous times, ordered shut down, and fined tens of millions of dollars for copyright infringement.
According to court filings unsealed this week, Zuckerberg reportedly ruled out using LibGen to train at least one of Meta’s Llama models despite concerns from the company’s AI executive and research groups about legal implications.
A lawyer for the plaintiffs, who include best-selling authors Sarah Silverman and Ta-Nehisi Coates, said Meta employees referred to LibGen as a “dataset we know to be pirated” and pointed out that its use “may to undermine [Meta’s] negotiating position with regulators,” according to a legal filing.
During his deposition, Zuckerberg claimed he “hadn’t really heard of” LibGen.
“I understand you’re trying to get me to give an opinion on LibGen, which I haven’t really heard of,” Zuckerberg said during the deposition. “I just have no knowledge of this particular thing.”
Under questioning from one of the plaintiffs’ lawyers, David Boies, Zuckerberg explained why it would be unreasonable to ban the use of a data set like LibGen.
“So, would I want to have a policy against people using YouTube because some of the content might be copyrighted? No,” he said. “[T]Here are cases where having such a blanket ban may not be the right thing to do.”
Zuckerberg said Meta should be “very careful” about training on copyrighted material.
“You know, [if there’s] someone providing a website and deliberately trying to infringe on people’s rights… obviously that’s something we’d want to be careful about or careful about how we dealt with it or maybe even prevent our teams from dealing with him,” Zuckerberg said during his testimony, according to the transcript.
New complaints
The attorneys for the plaintiffs in Kadrey v. Meta Platforms have amended the complaint several times since it was filed in the U.S. District Court for the Northern District of California, San Francisco Division in 2023. The latest amended complaint filed by the plaintiffs’ attorney late Wednesday contains new allegations against Meta , including that the company cross-referenced some pirated books on LibGen with copyrighted books available for license. The lawyers claim that Meta used this tactic to determine whether it made sense to enter into a licensing agreement with a publisher.
Meta reportedly used LibGen to train its latest family of Llama models, Llama 3, according to the amended filing. The plaintiffs also allege that Meta is using the dataset to train its next-generation Llama 4 models.
According to the amended filing, Meta researchers allegedly tried to hide the fact that the Llama models were trained on copyrighted material by inserting “supervised samples” into the Llama detail. And Meta downloaded pirated e-books from another source, Z-Library, for Llama training as recently as April 2024, according to the amended complaint.
Z-Library, or Z-Lib, has been the subject of a number of legal actions by publishers, including domain seizures and takedowns. In 2022, the Russian nationals who allegedly maintained it were charged with copyright infringement, wire fraud and money laundering.