OpenAI Allegedly Trains GPT-4 Using Unauthorized YouTube Video Transcripts




To train a new large-scale model (LLM), a lot of data is needed. But many companies doing training to develop their latest AI are starting to face the issue of getting quality data. According to a Wall Street Journal report, OpenAI has used data from 1 million YouTube videos without permission to train GPT-4.


OpenAI is said to use Whisper, an AI that produces video transcripts. Data. which Whisper collects is then used to train GPT-4. This is a violation of YouTube's terms and conditions and is using the intellectual property of the creators on the site without permission.



In a statement given to The Verge, OpenAI said they use data from various public sources and also through collaborations for data that is not publicly provided.


According to the WSJ, Google also does the same thing but only uses certain YouTube videos according to the terms and conditions agreed by the owner.


Last week YouTube's CEO said that if OpenAI uses YouTube videos to train Sora, it violates the site's terms and conditions. OpenAI has never admitted to using YouTube videos to train Sora. However, the issue of copyright to train AI is a heated issue last year. Several lawsuits have been filed by prominent authors and media companies against OpenAI for allegedly training AI using their work without permission.

Previous Post Next Post

Contact Form