Daft turns raw multimodal data into vectors, labels, and structured outputs. No infrastructure management required.

AI Pipelines Made Simple.

Amazon logo
Cloud Kitchens logo
Essential logo
TogetherAI logo
Bytedance logo

One Engine for any modality

Power AI data pipelines in a single framework that combines ingestion, chunking, embeddings, LLM extraction, and multimodal transforms, with consistent behavior from local development to production.

Model-first by design

Offer first-class operators for embeddings and structured outputs, enabling reliable model-on-data pipelines across millions of processed rows without relying on stitched-together ETL and LLM tools.

Minimal operations

Reduce operational overhead with built-in scaling, orchestration, logging, model execution control, and more, without managing infrastructure or glue code.

Unlock the full power of Daft with Daft Cloud

Why Daft OSS

  • Open source data engine with more than 5k GitHub stars for defining and running LLM-powered data pipelines.
  • Bring your own models, storage, and infrastructure to run pipelines exactly where and how you want.
  • Runs on your laptop for fast, iterative development, and in your own cluster when you need more scale.

Why Daft Cloud

  • Serverless platform for experimenting, deploying, and operating AI pipelines with zero infrastructure headaches.
  • Managed LLMs/VLMs and compute workers that scale automatically as your data grows.
  • Production-grade auth, observability, retries, and versioning for minimal overhead.

Use Cases

Large Scale Document Processing

  • This example demonstrates Daft's ability to seamlessly integrate with AI models to create text embeddings at scale.
  • Using a pre-trained transformer model, Daft processes large document collections stored in cloud storage or Hugging Face, converting text into embeddings using easy-to-use user-defined functions for downstream AI applications like semantic search or clustering.
Native model operators

[1]

Native model operators

Run embeddings, LLM extraction, multimodal transforms, and structured outputs with first-class operators designed for model-driven pipelines.

Continuous freshness

[2]

Continuous freshness

Process only new or changed data and keep vectors, labels, and structured fields continuously fresh without unnecessary recompute.

Local to production consistency

[3]

Local to production consistency

Define pipelines once and run them the same way everywhere, from OSS development to production scale in the Cloud.

Managed model runtime

[4]

Managed model runtime

Get automatic batching, validation, retries, and scaling so pipelines run reliably without maintaining infrastructure or orchestration.

Integrated orchestration

[5]

Integrated orchestration

Run pipelines continuously, on a schedule, or on events with built-in scheduling and lifecycle management.

Built-in observability

[6]

Built-in observability

View logs, runs, and performance insights in one place so you can understand and debug pipelines with confidence.

trusted by
Tony Wang's company logo
“Daft was incredible at large volumes of abnormally shaped workloads - I pointed it at 16,000 small Parquet files in a self-hosted S3 service and it just worked! It's the data engine built for the cloud and AI workloads.”

Tony Wang

Data @ Anthropic, PhD @ Stanford

1/5
Patrick Ames's company logo
“Amazon uses Daft to manage exabytes of Apache Parquet in our Amazon S3-based data catalog. Daft improved the efficiency of one of our most critical data processing jobs by over 24%, saving over 40,000 years of Amazon EC2 vCPU computing time annually.”

Patrick Ames

Principal Engineer @ Amazon

1/5
Ritvik Kapila's company logo
“Daft powers our large-scale daily jobs and production pipelines at web scale. For Essential-Web v1.0, we scaled our vLLM-inference pipeline to 32,000 sustained requests per second per VM. Daft's massively parallel compute, cloud-native I/O, and painless transition from local testing to seamless distributed scaling made this possible.”

Ritvik Kapila

ML Research @ Essential AI

1/5
Maurice Weber's company logo
“Daft has dramatically improved our 100TB+ text data pipelines, speeding up workloads such as fuzzy deduplication by 10x. Jobs previously built using custom code on Ray/Polars has been replaced by simple Daft queries, running on internet-scale unstructured datasets.”

Maurice Weber

PhD AI Researcher @ Together AI

1/5
Alexander Filipchik's company logo
“Daft as an alternative to Spark has changed the way we think about data on our ML Platform. Its tight integrations with Ray lets us maintain a unified set of infrastructure while improving both query performance and developer productivity. Less is more.”

Alexander Filipchik

Head Of Infrastructure at City Storage Systems (CloudKitchens)

1/5

Ecosystem

Interested in using Daft Cloud to manage your AI Pipelines? Request a DemoArrow for early access.

Get updates, contribute code, or say hi.
Daft Engineering Blog
Join us as we explore innovative ways to handle multimodal datasets, optimize performance, and simplify your data workflows.
Github Discussions Forums
join
GitHub logo
The Distributed Data Community Slack
join
Slack logo