Eleutherai, an AI research organization, has released what he claims to be one of the largest licensed and open -field text collections for AI models.
The data set, called Common Pile V0.1, took about two years to complete in collaboration with AI startups by the pool, the facial hug and others, along with various academic institutions. Weighing in 8 terabytes in size, the common pile v0.1 was used to train two new AI models from Eleutherai, Comma V0.1T and Comma V0.1-2T, that Eleutherai’s claims perform the same level as models developed using unlawful data.
AI companies, including OpenAi, are involved in lawsuits on AI training practices, which are based on tissue scraping – including copyright protected materials such as books and research journals – to build models of training data. While some AI companies have licensing arrangements with certain content providers, they argue that the US legal doctrine for fair use protects against responsibility in cases where they are trained in copyright -protected work.
Eleutherai argues that these lawsuits have “drastically reduced” transparency by AI companies, which the organization says it has harmed the broader AI research sector, making it more difficult to understand how their models and imperfections work.
“[Copyright] Appeals have not virtually changed data supply practices [model] Education, but have drastically reduced the transparency companies involved, “writes Stella Biderman, Eleutherai Executive Director, in A blog In hugging face early on Friday. “The researchers in some companies have talked about also reporting special lawsuits as the reason why they were unable to release the research they do in areas with a high level of data.”
The Common Pile V0.1, which can be downloaded from the AI Dev and Github platform of Hugging Face and GitHub, was created in consultation with legal experts and is based on sources, including 300,000 public books digitized by the Congress Library and the Interview. Eleutherai also used a whisper, a speech model in Openai, to transcribe audio content.
Eleutherai claims that Comma V0.1-1t and Comma V0.1-2T are proof that the common pile v0.1 carefully edited to allow developers to manufacture models competitively with privately owned alternatives. According to Eleutherai, models, which are 7 billion in size parameters and were trained only in a fraction of the common v0.1 pile, competitive models such as Meta’s first Llama AI model for reference points for coding, image understanding and mathematics.
The parameters, sometimes referred to as weights, are the internal components of an AI model that guides its behavior and answers.
“In general, we believe that the common idea that non -permission leads to performance is unjustified,” Biderman writes in place. “As the amount of accessible open licensed and public data increases, we can expect the quality of models trained on open content permit to improve.”
The common pile v0.1 seems to be partly an attempt to correct Eleutherai’s historical mistakes. Years ago, the company released the pile, an open collection of training text that includes copyright protected material. AI companies have been submitted under fire – and legal pressure – to use the pile to train models.
Eleutherai is committed to releasing open sets of data more often in collaboration with research and infrastructure researchers.
Updated 9:48 am Peaceful: Biderman resident clarified In a post on X that Eleutherai contributed to the release of data and models, but that their development included many partners, including the University of Toronto, which helped lead the research.
