From raw multimodal data to training-ready datasets. In one pipeline, at any scale.

The data engine built for AI.

Amazon logo
Atoms logo
Essential logo
TogetherAI logo
Bytedance logo

Multimodal-native

Process video, images, audio, and sensor data alongside structured metadata in a single dataframe.

CPU and GPU in one pipeline

Run GPU inference/embeddings alongside CPU decode and filter in one pipeline. Daft handles the scheduling and batching, no glue code required.

Python dataframe API

Same operations you use in Pandas or Spark: filter, transform, aggregate, write. No new framework to learn.

Why Daft

  • Open-source data engine (Apache 2.0) with more than 5k GitHub stars, in production at companies like Amazon and Essential AI.
  • Bring your own models, storage, and infrastructure. Run pipelines exactly where and how you want.
  • If you know Pandas or Spark, you know Daft. Load a dataset, filter, transform, and export to training format in a few lines.
Native model operators

[1]

Native model operators

Embeddings, LLM extraction, and structured outputs as first-class operations. Plug in models from OpenAI, Hugging Face, or your own.

Multimodal column types

[2]

Multimodal column types

Images, video, audio, text, and embeddings as native column types. Decode, transform, and filter them like any other column.

Local to production consistency

[3]

Local to production consistency

Define pipelines once. Run them on your laptop or scale across a cluster. Same code, no rewrites.

Managed UDF runtime

[4]

Managed UDF runtime

Automatic batching, retries, and error handling for model UDFs. Zero-copy execution powered by Apache Arrow.

Lower memory footprint

[5]

Lower memory footprint

Run the same queries with 5x less memory than alternatives. Jobs that would OOM on Spark or Pandas just work.

Built in Rust for speed

[6]

Built in Rust for speed

Daft's core is written in Rust. Decode video, run transforms, and join multimodal data at TB scale without paying Python overhead.

trusted by
Tony Wang's company logo
“Daft was incredible at large volumes of abnormally shaped workloads - I pointed it at 16,000 small Parquet files in a self-hosted S3 service and it just worked! It's the data engine built for the cloud and AI workloads.”

Tony Wang

Data @ Anthropic, PhD @ Stanford

1/5
Patrick Ames's company logo
“Amazon uses Daft to manage exabytes of Apache Parquet in our Amazon S3-based data catalog. Daft improved the efficiency of one of our most critical data processing jobs by over 24%, saving over 40,000 years of Amazon EC2 vCPU computing time annually.”

Patrick Ames

Principal Engineer @ Amazon

1/5
Ritvik Kapila's company logo
“Daft powers our large-scale daily jobs and production pipelines at web scale. For Essential-Web v1.0, we scaled our vLLM-inference pipeline to 32,000 sustained requests per second per VM. Daft's massively parallel compute, cloud-native I/O, and painless transition from local testing to seamless distributed scaling made this possible.”

Ritvik Kapila

ML Research @ Essential AI

1/5
Maurice Weber's company logo
“Daft has dramatically improved our 100TB+ text data pipelines, speeding up workloads such as fuzzy deduplication by 10x. Jobs previously built using custom code on Ray/Polars has been replaced by simple Daft queries, running on internet-scale unstructured datasets.”

Maurice Weber

PhD AI Researcher @ Together AI

1/5
Alexander Filipchik's company logo
“Daft as an alternative to Spark has changed the way we think about data on our ML Platform. Its tight integrations with Ray lets us maintain a unified set of infrastructure while improving both query performance and developer productivity. Less is more.”

Alexander Filipchik

Head Of Infrastructure @ Atoms

1/5
Get updates, contribute code, or say hi.
Daft Engineering Blog
Join us as we explore innovative ways to handle multimodal datasets, optimize performance, and simplify your data workflows.
Github Discussions Forums
join
GitHub logo
The Distributed Data Community Slack
join
Slack logo