A new study It seems to give beliefs to claims that Openai is training at least some of the AI models in copyright -protected content.
Openai is involved in costumes brought by the authors, developers and other rights holders who accuse the company of using their projects-books, codes, etc.-to develop its models without license. Openai has long claimed a fair use Defense, but the plaintiffs in these cases argue that there is no exhaust in the US copyright law for training data.
The study, which co-author by researchers at the University of Washington, the University of Copenhagen and Stanford, proposes a new method of detecting training data was “memorized” by models behind an API, such as Openai.
The models are prediction engines. They are trained in many data, learning patterns – so they are able to create essays, photos and more. Most of the outputs are not recorded copies of training data, but because of the way models “learn”, some are inevitably. Image models have been found overturn snapshots of snapshots of movies trainedWhile linguistic models were observed essentially censorship of news.
The study method is based on words that co-authors call “high broadcast”-that is, words that stand out as unusual in the context of a larger work body. For example, the word “radar” in the phrase “Jack and I sat perfectly with the rosemary” would think it was high emergency because it is statistically less likely than words such as “engine” or “radio” to appear before “Humming”.
Co-authors examined various Openai models, including GPT-4 and GPT-3.5, for signs of memorization, removing the high superficial word from fiction books and pieces of New York Times and have the models trying to “guess” which words were covered. If the models managed to guess properly, they are likely to memorize the excerpt during training, he came to the conclusion of co-authors.
According to the results of the tests, the GPT-4 showed signs that they have memorized departments of popular fiction books, including books in a set of data containing samples of copyright protected by Bookmia. The results also indicate that the model memorizes parts of the New York Times articles, though at a comparatively lower pace.
Abhilasha Ravicander, a doctoral student at the University of Washington and co-author of the study, told Techcrunch that findings shed light on “disputed data” models could have been trained.
“In order to have large linguistic models that are reliable, we must have models that we can explore and control and examine scientifically,” Ravicander said. “Our work aims to provide a tool for detecting large linguistic models, but there is a real need for greater transparency of data throughout the ecosystem.”
Openai has long supported the most relaxed restrictions on developing models using copyright -protected data. While the company has specific content licensing agreements and offers exception mechanisms that allow copyright owners to highlight the content they prefer the company that does not use for educational purposes, it has put pressure on several governments to codify the rules of “fair use”.
