From raw multimodal data to training-ready datasets. In one pipeline, at any scale.

The data engine built for AI.

pip install daft

From raw multimodal data to training-ready datasets. In one pipeline, at any scale.

pip install daft

Multimodal-native

Process video, images, audio, and sensor data alongside structured metadata in a single dataframe.

CPU and GPU in one pipeline

Run GPU inference/embeddings alongside CPU decode and filter in one pipeline. Daft handles the scheduling and batching, no glue code required.

Python dataframe API

Same operations you use in Pandas or Spark: filter, transform, aggregate, write. No new framework to learn.

Multimodal-native

Process video, images, audio, and sensor data alongside structured metadata in a single dataframe.

CPU and GPU in one pipeline

Run GPU inference/embeddings alongside CPU decode and filter in one pipeline. Daft handles the scheduling and batching, no glue code required.

Python dataframe API

Same operations you use in Pandas or Spark: filter, transform, aggregate, write. No new framework to learn.

Why Daft

Open-source data engine (Apache 2.0) with more than 5k GitHub stars, in production at companies like Amazon and Essential AI.
Bring your own models, storage, and infrastructure. Run pipelines exactly where and how you want.
If you know Pandas or Spark, you know Daft. Load a dataset, filter, transform, and export to training format in a few lines.

View Documentation

Use Cases

AI Search

This example shows how using LLMs and embedding models, Daft extracts metadata, generates vectors, and writes them to a vector database.

See Examples

[1]

Native model operators

Embeddings, LLM extraction, and structured outputs as first-class operations. Plug in models from OpenAI, Hugging Face, or your own.

[2]

Multimodal column types

Images, video, audio, text, and embeddings as native column types. Decode, transform, and filter them like any other column.

[3]

Local to production consistency

Define pipelines once. Run them on your laptop or scale across a cluster. Same code, no rewrites.

[4]

Managed UDF runtime

Automatic batching, retries, and error handling for model UDFs. Zero-copy execution powered by Apache Arrow.

[5]

Lower memory footprint

Run the same queries with 5x less memory than alternatives. Jobs that would OOM on Spark or Pandas just work.

[6]

Built in Rust for speed

Daft's core is written in Rust. Decode video, run transforms, and join multimodal data at TB scale without paying Python overhead.

trusted by

“Daft was incredible at large volumes of abnormally shaped workloads - I pointed it at 16,000 small Parquet files in a self-hosted S3 service and it just worked! It's the data engine built for the cloud and AI workloads.”

Tony Wang

Data @ Anthropic, PhD @ Stanford

“Amazon uses Daft to manage exabytes of Apache Parquet in our Amazon S3-based data catalog. Daft improved the efficiency of one of our most critical data processing jobs by over 24%, saving over 40,000 years of Amazon EC2 vCPU computing time annually.”

See Case Study

Patrick Ames

Principal Engineer @ Amazon

“Daft powers our large-scale daily jobs and production pipelines at web scale. For Essential-Web v1.0, we scaled our vLLM-inference pipeline to 32,000 sustained requests per second per VM. Daft's massively parallel compute, cloud-native I/O, and painless transition from local testing to seamless distributed scaling made this possible.”

See Case Study

Ritvik Kapila

ML Research @ Essential AI

“Daft has dramatically improved our 100TB+ text data pipelines, speeding up workloads such as fuzzy deduplication by 10x. Jobs previously built using custom code on Ray/Polars has been replaced by simple Daft queries, running on internet-scale unstructured datasets.”

Maurice Weber

PhD AI Researcher @ Together AI

“Daft as an alternative to Spark has changed the way we think about data on our ML Platform. Its tight integrations with Ray lets us maintain a unified set of infrastructure while improving both query performance and developer productivity. Less is more.”

Alexander Filipchik

Head Of Infrastructure @ Atoms