OpenAI sued for training ChatGPT with stolen personal data

A California law firm, Clarkson Law Firm, has filed a class action lawsuit against OpenAI for what you consider theft of personal data to train ChatGPT. The firm states in the lawsuit, which it has filed in court for the Northern District of California, that ChatGPT “uses stolen private information, including personally identifiable information, from hundreds of millions of Internet users, including children of all ages, without their informed consent or knowledge«.

To train its large language model, OpenAI collected 300 billion words from the Internet, including personal information and posts from social networks such as Twitter or Reddit. According to Clarkson, OpenAI “he did it in secret, and without registering as a data broker as he had to do according to current legislation«.

The lawsuit also refers to the opaque privacy policies for ChatGPT users today, but focuses mostly on data pulled from the web that was not published with the explicit intent to be shared with ChatGPT and other models. In addition, OpenAI, which has received billions of investment from Microsoft, and money from ChatGPT Plus subscribers, has profited from this data without compensating its authors or the source where it was published.

The lawsuit denounces fifteen alleged crimes, which include violation of privacy and negligence in the protection of personal data. Also theft to a greater degree, for obtaining huge amounts of personal data for model training. There are data sets, such as those from Common Crawl, Wikipedia, and Reddit, that include personal information and are publicly available, as long as companies sign the protocols provided for the purchase and use of this data.

But apparently, OpenAI has used this data without permission or consent of the users with ChatGPT. Even though people’s personal information is public on social networks, blogs and articles, if the data is used outside of the platform on which it was published, the person who used it may be considered to have violated their privacy.

In Europe there is a legal distinction between public domain and free-to-use data thanks to the GDPR, but in the United States it is something that is still being debated. Hence this demand. But it is not clear if the American legal system will accept this demand. For Ryan Clarkson, partner at the firmit is very important to act now with the laws that exist, instead of waiting for federal laws to be passed in this regard, because they cannot afford it “pay the consequences of negative results with AI as we have already done with social networks, or as we already did with nuclear energy. As a society, the price we would have to pay is too high.«.