We could run out of data to train AI language programs
Language models can be trained using text from news articles, scientific papers, books, and Wikipedia. These models are being trained with more data in an attempt to make them more accurate and flexible.
The trouble is, the types of data typically used for training language models may be used up in the near future–as early as 2026, according to a paper by researchers from Epoch, an AI research and forecasting organization, that is yet to be peer reviewed. Researchers are constantly looking for new texts to train their models as they build stronger models with greater capabilities. Teven Le Scao, an AI company Hugging Face researcher, said that large language model researchers are becoming more concerned about running out of this type of data.
The problem stems partly in the fact that language AI researchers divide the data they use for training models into two types: high quality or low quality. Pablo Villalobos, Epoch’s staff researcher and lead author of the paper, said that the line between these two categories can be blurred. However, text from the former category is considered to be more professional and is often written by professional writers.
Data derived from low-quality sources includes texts such as comments on websites like 4chan or social media posts. This data is far more valuable than data of high quality. Researchers only use data that falls within the high-quality category to train models because that is the type language they want them to reproduce. This approach has produced impressive results for large language models like GPT-3.
Swabha Swayamdipta is a University of Southern California professor who specializes on dataset quality and machine learning. This would allow you to overcome data constraints. Swayamdipta states that if data shortages force AI researchers to include more diverse datasets in the training process, it will be a net positive for language models. Researchers may also be able to extend the data’s useful life for training language models. Due to performance and cost constraints, large language models are currently trained only once on the same data. Swayamdipta says that it is possible to train the same model multiple times with the same data.
Some researchers believe that bigger may not be better when it comes down to language models. Percy Liang, a Stanford University computer science professor, believes that models may be more efficient than increasing their size. He explains that smaller models that are trained with higher-quality data have been shown to outperform larger models that are trained with lower-quality data.
I’m a journalist who specializes in investigative reporting and writing. I have written for the New York Times and other publications.