THE AMERICA ONE NEWS
Jun 4, 2025  |  
0
 | Remer,MN
Sponsor:  QWIKET 
Sponsor:  QWIKET 
Sponsor:  QWIKET: Elevate your fantasy game! Interactive Sports Knowledge.
Sponsor:  QWIKET: Elevate your fantasy game! Interactive Sports Knowledge and Reasoning Support for Fantasy Sports and Betting Enthusiasts.
back  
topic
Le Monde
Le Monde
9 Jul 2024


Inline image

It's the best-kept secret of the so-called "generative" artificial intelligence (AI) industry, which includes the likes of ChatGPT, Gemini, Copilot - and it has nothing to do with computing power, the colossal size (hundreds of billions of parameters) of these programs, or clever computer codes. Of course, these aspects play a part in their success, but they have now become more or less public knowledge.

No, what the industry leaders OpenAI, Anthropic, Mistral and Microsoft have never revealed is their formula for building the collection of texts used to train their models. The latter is used to adjust parameters to predict the best possible word to complete a sentence. This ingurgitation of billions of texts is what leads to the identification of statistical correlations, enabling new texts to be generated to answer users' queries.

The origin of these texts is well known: public domain books, research articles, Wikipedia, but above all, tons of web pages, as this last source represents the majority of texts. What makes the difference is the way it's processed. "It's the sinews of war," explained Julien Launay, the creator of the company Adaptive ML and co-author of a training corpus of web data, RefinedWeb, when he worked at LightOn. He recalled the surprise his December 2022 presentation caused in New Orleans at NeurIPS, the field's flagship conference. The care taken in preparing the data enabled an AI to match the competition fed with data from more varied sources.

80,000 hours of calculations

Thomas Wolf, the co-founder of Hugging Face, a Franco-American platform making open-source models and corpora available to all, was present at the conference and invited Launay's team to join his company. One of its members, Guilherme Penedo, accepted the offer, excited by the idea of making a corpus even larger than RefinedWeb available.

"We thought we could do it in 10 days," recalled Wolf, but it ended up taking 15 times longer. Then, April 21 saw the release of FineWeb, a monster weighing in at 40 terabytes (TB), rich in 15,000 billion tokens of 3-4 letters. Downloadable for free, it allows users to create better models than with any other public corpus.

Building such an object is hard work: 80,000 hours of calculations with Nvidia's H100 graphics cards were required, similar to what it takes to train a good AI model.

First, the data has to be collected. Since 2007, a foundation has made available Common Crawl, a regularly updated collection of billions of Web pages. But to be useful for language models, only the text needs to be extracted from this mass of information. "This was one of the longest steps in the process, maybe 80% of the computing time, which we started in November 2023," said Penedo. 96 packets collected over some 15 years in Common Crawl and weighing around 5.354 TB were used.

You have 50.39% of this article left to read. The rest is for subscribers only.