AI Models Will Destroy If Trained Using Synthetic Data

The race to produce general artificial intelligence (AGI) is being waged by companies such as Google, OpenAI and Meta. At this point we are at the second of five stages towards AGI according to OpenAI and DeepMind. To produce AGI, large data is needed but at this point researchers are already facing problems to get new data due to intellectual property issues and lack of new data. The solution is to use synthetic data which is data generated by AI itself based on real world data.

But according to a study by Oxford University researchers, AI models will be "destroyed" if trained using synthetic data generated by previous AI models. This is because the large language model (LLM) is trained to read part of the data given to it. After several generations of training using synthetic data, the model will collapse causing the answers it provides to be nonsensical.

Among the examples given is a question about the design of a 14th century church. The first generation of LLMs can still provide relevant answers. But after being trained using synthetic data, the 9th generation LLM provided answers about rabbits by creating a non-existent species. This LLM cannot be trusted and used at all.

This study was conducted because AI companies are now training using data taken from websites without permission. The issue now is that more and more websites are using AI to write articles. The act of taking data without permission without looking at its authenticity will cause new models that are trained to use it to face the same collapse.

The researchers suggest that future LLM models be trained that are better screened and selected only before resulting in model collapse. What happens to this AI model is very similar to what happens to humans if they only marry close relatives. In humans the practice of marrying close relatives increased genetic defects and in the most prominent cases created the Habsburg Jaw which resulted in the destruction of the Habsburg royal family in Europe.

This study was published in the journal Nature.

Trending

Wow! There are Adult Scenes in Sakura School Simulator

TNG eWallet Is Now Optimized For Global Use – Users Can Select View By Country

Google Launches Agent2Agent Protocol to Enable Collaboration Between AI Agents

TikTok Shop Malaysia Now Has Over 1.7 Million Sellers, 3 Million Users Registered as Affiliate Representatives

eMADANI RM100 Credit Redemption – What do MAE, Setel, ShopeePay and TNG eWallet Offer?

AI Models Will Destroy If Trained Using Synthetic Data

Contact Form