News

OpenAI already has its own web crawler

OpenAI technology is behind some of the most relevant artificial intelligence services of the moment. Either alone with ChatGPT and GPT-4, or in collaboration with Microsoft (which is also a majority shareholder in the company) with Bing, it rose to prominence with the launch of its chatbot last year and hasn’t been there since. ceased to gain more and more notoriety, both for what they have already done and for their plans for the future.

The most notable is, of course, GPT-5, a brand that he registered just a few days ago, and with which the future generation of its generative artificial intelligence model will be identified, which serves as the basis for part of its services. A few months ago the rumor began to circulate that OpenAI would launch it before the end of this year, but it seems that the technology will reserve its launch for later, although it is not clear if it will be to take more time to polish it, or in response to the growing demand to slow down the evolution of AI until the appropriate regulatory frameworks have been established.

If there is a key phase in the process of creating an AI model, no doubt that is his training, since the quantity and quality of the data used will directly depend on the ability to respond a posteriori. Thus, OpenAI and other companies specialized in AI, work constantly in the process of searching and preparing the data that the models later ingest. Something that, yes, has made these companies put themselves in the spotlight for the unauthorized use of copyrighted content.

OpenAI already has its own web crawler

As we can read on its website, the company seems to have found the solution to kill two birds with one stone, and that is OpenAI has launched its own web crawler, that is, a tool that automatically analyzes and indexes the content of web pages. As you already know, this is the same technology used by search engines, only in this case its function will be to feed the company’s artificial intelligence models with data.

As with search engine robots, website administrators can block the OpenAI crawler, as well as specifying that they only want the content of certain pages of the same to be analyzed. In addition, they also indicate that content that is behind paywalls, that contains personal information or whose content goes against company policies will not be indexed.

I say that this is a very smart move and that it kills two birds with one stone because, at the same time, improves searchability and indexing of information for training of its models, and also, by allowing the blocking of said analyses, it offers a tool that OpenAI will be able to use it if it is accused of using content without authorization for it. A very, very smart move.

Related Articles