August 20, 2025

24 Trillion Tokens, 0 Crashes

How Essential AI Built Essential-Web v1.0 with Daft

by Jay Chia

Essential AI leveraged Daft's data engine to process a massive web-scale dataset for large language model (LLM) training.

•
24 Trillion tokens processed
•
23.6 Billion LLM queries / 7 days
•
0 Crashes
•
90K AMD MI300X GPU hours

"Daft powers our large-scale daily jobs and production pipelines at web scale. For Essential-Web v1.0, we scaled our vLLM-inference pipeline to 32,000 sustained requests per second per VM. Daft's massively parallel compute, cloud-native I/O, and painless transition from local testing to seamless distributed scaling made this possible."
Ritvik Kapila, Essential AI

What is Essential-Web?

Essential AI’s (https://www.essential.ai/) new Essential-Web v1.0 dataset is a massive, fully labeled collection designed to give researchers instant access to clean, richly annotated training data spanning domains like science, medicine, code, and more. Sourced from 24 trillion tokens and organized with a detailed taxonomy, it makes domain-specific data extraction fast and simplifies AI training and curation.

With Essential-Web, researchers can now use simple filters to extract domain-specific corpora in minutes. For example:

•
A math dataset focused on reasoning and correctness
•
A medical QA corpus with high-quality source material
•
A web code dataset with spam and boilerplate removed

Daft's speed, scalability, and efficiency powered the end-to-end data processing needed to bring Essential-Web v1.0 to life.

Learn More:

•
Read the paper
•
Explore the dataset

Why Daft

The Essential AI data team runs massive jobs every day over hundreds of terabytes of data. They process this web-scale content to extract reliable, high-quality information for training large language models. For Essential-Web v1.0, this meant running inference on 24 trillion tokens across 23.6 billion documents from CommonCrawl — a scale that demands both efficiency and robustness. To make this possible, the team chose Daft based on four key advantages:

•
Seamless scaling from single VM to distributed: Start development locally and scale to production without code changes
•
Fast iteration and debugging: Significantly improved development velocity with intuitive error handling
•
Cloud-native architecture: Native support for cloud storage and async operations
•
Python-first design: Familiar API with powerful async user-defined function (UDF) support to build custom inference logic

By powering Essential-Web’s taxonomy, Daft’s execution model enabled the team to iterate quickly while handling unprecedented scale. This made Daft the clear choice for their vLLM-inference pipeline.