Apple, NVIDIA, Anthropic and several other tech giants are now found using YouTube datasets to train their artificial intelligence systems and language models.
The dataset involves more than 170,000 YouTube videos gathering transcripts or video subtitles – which are then used by various tech giants to train their respective models. The dataset is provided by EleutherAI aimed at researchers and academics in training artificial intelligence.
Offering the dataset itself already violates YouTube's terms and conditions of service, and many content publishers also do not like this step, which is like stealing their data or content without permission, and using it for other purposes. What is more sad is that giant technology companies such as Apple, NVIDIA and others are also using this dataset. Apple itself has shared about their language model called OpenELM which is also seen to be trained using this dataset.
Today, various technology companies do not disclose the source of the dataset used to train their respective artificial intelligence - including OpenAI itself. The lack of transparency in the unauthorized use of content also led to lawsuits involving content publishers and technology companies. However, for small publishers, they are oppressed by these technology companies when using their content just like that.
Previously, regarding the use of YouTube data, YouTube has stated that the use of videos including transcripts and subtitles for the purpose of training artificial intelligence models is a violation of the platform's terms of use.