The Internet Archive collapses as a result of an AI being trained

In the world there are many companies that are dedicated to offering AI services, specifically in Spain at the end of 2022 there were 185 the companies that provided solutions of this type. For train the AI it requires a lot of time, qualified personnel and an enormous amount of data that allow it to be a useful, reliable and profitable tool.

The main source of information that AI platforms tend to drink from is The Internet Archive, the great free and non-profit digital library founded in 1996 as a way to preserve the historical material and cultural heritage of the Internet. being open sourceany member of the community can interact in it by contributing valuable content.

More and more companies are training their AI tools with the service provided by Internet Archive, which causes it to saturate and stop working optimally, generating a poor experience among users. On May 29, this platform, which has with more than 800,000 million pagessuffered a fall that was reported through the Twitter profile of the web.

Apparently, the investigations pointed to an AI company that would be using its files to train its tool. The spotlight was set on an excessive surge of traffic from AWScloud computing services Amazon. This is the second major operation of this type suffered by the Internet Archive and is generating massive and constant problems.

The impact of the new fall

In this sense, Brewster-Kahlefounder of The Internet Archive, released a statement reporting the launch of tens of thousands of requests per second for its public domain OCR files from 64 virtual hosts on Amazon AWS services. There are many users who have been affected by not being able to use this non-profit platform.

The impact is such that it caused the activity of the web to fall completely for more than an hour, blocking all IP addresses for this from which the requests came. Once the problem was solved, they detected 64 other IP addresses that initiated the same activityand although they discovered how to block them, they could not prevent those requests from ending up causing a new crash on the web.

The responsible agent

The investigations have not yet determined who was behind this event, although the most consistent version is that it was an artificial intelligence company, or failing that, an AWS user who requires large amounts of information from the The Internet Archive bookstore.

According to a latest study by The Washington Post, hundreds of pieces of data are being used to train artificial intelligence, such as the case of the Google C4 (Colossal Clean Crawled Corpus), which uses more than 15 million websites. In this way, AIs have been trained as Calls de Meta, although problematic content that violated copyright was detected.

The situation that is suffering The Internet Archive calls into question the level of security of web pages and servers versus AI companies that need data. And it is that these can cause a website to stop being available by needing to access data intensively. In the end it is a chain, since the users cannot enjoy the information that they themselves have contributed due to a saturation caused by them.

The key to success of The Internet Archive

This digital archive is made up of web pages, games, digital platforms and documents that have formed or continue to form part of the Internet. Through your browser WayBack Machine you can consult the more than 70 petabytes of data that preserves, including historical game titles such as Pac-Man, Secret of Monkey Island, Duke Nukem 3D or Astro Invader without the need to download files.

It has recently announced an alliance with the company cloudflare to start the service ‘Always Online’in such a way that if any of the websites that use Cloudflare’s servers go down we can continue browsing the content or read the backup copy of the page that has been archived in Internet Archive until the service is restored.

Likewise, during the pandemic they promoted a digital library to lend books known as National Emergency Library. They came to have 1.5 million booksalthough some publishers sued the Internet Archive for violating copyright, which led them to close this section.