Though freely available, these data sets are rife with improperly licensed data, according to one of the most expansive research projects examining the widely used collections.
Organized by a group of machine-learning engineers and legal experts, the Data Provenance Initiative looked at the specialized data used to teach AI models to excel at a particular task, a process called “fine-tuning.” They audited more than 1,800 fine-tuning data sets on sites such as Hugging Face, GitHub and Papers With Code, which joined Facebook AI in 2019, and found that about 70 percent didn’t specify what license should be used or had been mislabeled with more-permissive guidelines than their creators intended.
The advent of chatbots that can answer questions and mimic human speech has kicked off a race to build bigger and better generative AI models. It has also triggered questions around copyright and fair use of text taken off the internet, a key component of the massive corpus of data required to train large AI systems.
Inside the secret list of websites that make AI like ChatGPT sound smart
But without proper licensing, developers are in the dark about potential copyright restrictions, limitations on commercial use or requirements to credit a data set’s creators.
“People couldn’t do the right thing, even if they wanted to,” said Sara Hooker, co-author of the initiative’s report on their findings and head of Cohere for AI, a research lab.
Shayne Longpre, a PhD candidate at the MIT Media Lab who researches large language models and led the audit, said that hosting sites allow users to identify licenses when they upload a data set and should not be blamed for mistakes or omissions.
The lack of proper documentation is a community-wide problem that stems from modern machine-learning practices, Longpre said. Data archives are often combined, repackaged and re-licensed numerous times. Researchers trying to keep up with pace of new releases may wind up skipping steps, such as documenting data sources, or may be intentionally obscuring information as a form of “data laundering,” he said.
An interactive website lets users explore the contents of the data sets analyzed in the audit, some of which have been downloaded hundreds of thousands of times.
Hugging Face has found that data sets have better documentation when they are open, consistently used and shared, said Yacine Jernite, leader of its machine-learning and society team. The open-source company has prioritized efforts to improve documentation, such as automatically suggesting metadata. Even with imperfect annotation, openly accessible data sets are the first meaningful step toward more transparency in the field, Jernite said.
Some of the most used fine-tuning collections began as data sets created by companies such as OpenAI and Google. A growing number are machine-made data sets created using OpenAI’s models. Leading AI labs, including OpenAI, prohibit using the output from their tools to develop competing AI models but allow some noncommercial uses.
GitHub and Google declined to comment for this article. OpenAI and Meta did not immediately respond to request for comment.
AI companies have grown increasingly secretive about the data they use to train and refine popular AI models. The goal of the new research is to offer engineers, policymakers and lawyers visibility into the murky ecosystem of data fueling the generative AI gold rush.
The initiative arrives just as tensions between Silicon Valley and data owners hurtle toward a tipping point. Major AI companies are facing a flurry of copyright lawsuits from book authors, artists and coders. Meanwhile, publishers and social media forums are threatening to withhold data amid closed-door negotiations.
The initiative’s explorer tool notes that the audit does not constitute legal advice. Longpre said the tools were designed to help people stay informed, not to dictate which license is appropriate or advocate a particular policy or position.
As part of the analysis, researchers also tracked patterns across data sets, including the years that the data was collected and the geographic location of data set creators.
Roughly 70 percent of data set creators came from academia, while about 1o percent were built out of industry labs from companies such as Meta. One of the most common sources for data was Wikipedia, followed by Reddit and Twitter (now known as X).
A Washington Post analysis of Google’s C4 data set found that Wikipedia was the second-most-prevalent website among 15 million domains. Reddit recently threatened to block search crawlers from Google and Bing, risking a loss of search traffic, if leading AI companies won’t pay for its data to train their models, The Post reported last week.
The Data Provenance group’s analysis offered new insights on the limitations of commonly used data sets that offered little representation of spoken languages in the Global South compared with English-speaking and Western European countries.
But the group also found that even when the Global South did have language representation, the data set “almost always originates from North American or European creators and web sources,” its paper said.
Hooker said she hoped the project’s tools will expose prime areas for future research. “Data set creation is typically the least glorified part of the research cycle and deserves to have attribution because it takes so much work,” she said. “I love this paper because it’s grumpy but it also proposes a solution. We have to start somewhere.”