Serverless AI Ingestion.

Ingest and index unstructured/multimodal data for AI search and retrieval. From any source to any destination. Backed by a managed inference platform built to scale with your data volume.

request a demo
                                                                                     
                        ╔════════ daft cloud ═════════╗                           
                        ║                              ║                           
                        ║     ┌───────────────────┐    ║                           
                        ║     │ Managed Inference │    ║                           
                        ║     └───────▲─┬─────────┘    ║  
                        ║             │ │              ║   ┏━ ≡ Data Sink  ━━━━━━━━┓
  ┏━ ≡ Data Source ━┓   ║  ┌──────────┴─▼────────────┐ ║   ┃  VectorDB/Postgres    ┃
  ┃  S3/HTTP/Queue  ┃─────►│ Managed Compute Workers ├────►┃  /S3/Parquet/Iceberg  ┃
  ┗━━━━━━━━━━━━━━━━━┛   ║  └─────────────────────────┘ ║   ┗━━━━━━━━━━━━━━━━━━━━━━━┛
                        ║                              ║                           
                        ╚══════════════════════════════╝                           
  • Build AI products

    Data plumbing and ETL work

    Connect to your data storage for managed ingestion, scaling, and reliability. Stop wiring queues and GPUs together and focus instead on your models, prompts, and AI products.

  • Managed Inference

    Carefully tuning tokens/s

    Optimized for high-throughput batch inference, not low-latency chat. Run embedding, vision and LLM models; A managed inference service which autoscales with your data volume.

  • Serverless

    Right-sizing GPU clusters

    Scale effortlessly with 0 engineering ops. Autoscales workers & models according to data volume, retries and I/O backpressure. Deploy and experiment faster without worrying about infra.

  • Versioned with Git

    Adhoc notebooks and scripts

    Ship fresh and clean data that your AI agents can rely on. Test new pipeline versions on real data, then deploy and backfill knowing exactly which version produced which outputs.

Common use cases

How it works

1.

Define your pipeline in Python

Start by defining your data pipeline as a simple Python function using the Daft API.

Functions can take data sources and sinks as inputs! This makes them easily unit-testable in your repo, just like any normal Python function.

You can read files, preprocess data, perform inference using built-in models, and write the results to a sink — all in just a few lines.

def my_pipeline(src: daft.DataFrame, sink: daft.Catalog):
    # Apply preprocessing: file I/O, image decoding, chunking, PDF parsing
    # or arbitrary Python functionality...
    df = src.with_column("text", parse_pdf(src["file"]))
    
    # Call models via the Daft Inference Platform
    df = df.with_column("embeddings", F.embed_text(df["text"]))
    df = df.with_column("label", F.classify_text(df["text"], ["math", "code", "prose"]))

    # Yield outputs and perform streaming writes to your sink
    table = sink.get_table("my_table")
    table.write(df)
2.

Run in Daft Cloud

Now, just like you would with any code you usually write - simply push your code to github. Daft Cloud has permissions to read your git repo, and utilizes git to access your code and version your pipelines.

1. Select the appropriate data sources and data sinksDaft Cloud will take care of performing auth + ensuring that they are valid before passing it into your Python function for you.

2. Choose how you want to run your pipeline. Your pipeline can run in several different execution modes:

  • Adhoc: Run manually, useful for one-time or experimental workloads
  • Scheduled: Cron-like, runs every hour/day
  • Incremental: Process on new/changed data (e.g. new object created in AWS S3)
  • Triggered: Runs the pipeline on batches of HTTP calls based on a predetermined batch sizing and runtime SLO
3.

Observe, debug, iterate

  • Per-run and per-record metrics (throughput, error rates, cost indicators). Daft exposes useful metrics to help you understand your pipelines: how fast is throughput, how much is it costing you and what are the potential bottlenecks to scaling up even further.
  • Error captures with payload snippets and stack traces. Daft helps you debug your code without failing the entire pipeline. Bugs in your code are surfaced as stack traces with lineage to help you track down the bad code/data.
  • Safe replays, backfills, and migrations to new model versions. Daft versions your pipelines and makes it easy to run replays, backfills and migrations. This helps you experiment with new models and techniques without messing with your old data.

How Daft Cloud is built

  1. 1.

    Ingestion & Orchestration Layer

    • Connects to data sources: S3, GCS, Supabase Storage, SQL databases, HTTP endpoints
    • Executes Python-defined pipelines for chunking, preprocessing, and transformation
    • Handles scheduling, autoscaling, backpressure, retries, and observability
    • Control plane: Tracks pipeline definitions, offsets, and run status, planning work units and assigning them to workers
    • Autoscaling workers: Pull work from a distributed queue and call into the Daft Inference Platform
    • Automatic backpressure: Applied based on GPU/CPU capacity, external LLM rate limits, and downstream sink throughput. You never touch a GPU dashboard or rate-limit config
  2. 2.

    Daft Inference Platform (Managed)

    • Batch-first, not chatbot-first: Optimized for high-throughput jobs over large datasets with emphasis on cost and throughput, not single-digit millisecond latency
    • Managed hardware: Daft provisions and manages GPU pools with autoscaling based on queue depth and job requirements
    • Provider-aware: Integrations with LLM providers (e.g. OpenAI, etc.) with centralized global rate-limiting, request batching, and strategies to minimize 429s and maximize throughput

Models

We are constantly adding models to Daft Cloud as we onboard partners and explore new use-cases. The Daft Cloud team performs recommendations for models based on industry usage as well as throughput-to-performance ratios. We aim to share more specific numbers regarding price, throughput and performance with our users as we perform more benchmarking and testing.

Note that for many data tasks (e.g. labelling, enrichment, grammar correction, summarization etc) it is often not necessary to use the latest and most expensive reasoning models. Instead, it may be far more economical to choose a smaller, more efficient model which will also provide higher throughput for your data pipelines.

ProviderModelDescription
OpenAIGPT-5 miniOpenAI's faster, cost-efficient version of GPT-5 for well-defined tasks.
DaftQwen3 Embedding 8BHigh-performance 8B parameter embedding model optimized for large-scale semantic search and retrieval tasks.
ProviderModelDescription
OpenAIGPT-5OpenAI's best model for coding and agentic tasks across domains.
OpenAIGPT-5 miniOpenAI's faster, cost-efficient version of GPT-5 for well-defined tasks.
OpenAIGPT-5 nanoOpenAI's fastest, most cost-efficient version of GPT-5.
OpenAIGPT-4.1OpenAI's enhanced GPT-4 with improved instruction following and reasoning capabilities.
OpenAIGPT-4.1 miniOpenAI's compact GPT-4.1 optimized for speed and cost-efficiency.
OpenAIGPT-4.1 nanoOpenAI's lightweight GPT-4.1 for high-volume, cost-sensitive tasks.
AnthropicClaude Sonnet 4.5Anthropic's smartest model for complex agents and coding.
AnthropicClaude Haiku 4.5Anthropic's fastest model with near-frontier intelligence.
AnthropicClaude Opus 4.1Exceptional model for specialized reasoning tasks.
ProviderModelDescription
DaftQwen3 Embedding 8BHigh-performance 8B parameter embedding model optimized for large-scale semantic search and retrieval tasks.
OpenAItext-embedding-3-largeMost capable OpenAI embedding model for semantic search, clustering, and recommendations.

FAQ

You'll likely benefit from Daft Cloud if you're:

  • Shipping AI search / RAG / agentic retrieval on top of existing data systems
  • Building labelling / enrichment / moderation pipelines over large, evolving datasets
  • Already using S3/GCS, SQL, TurboPuffer, pgvector, BigTable, etc.
  • Want to avoid building your own ingestion + inference platform team

Why it exists

We started with a Rust engine (Daft OSS) to make "run a model over 1M documents" not require Spark or a data infra team.

Teams told us they don't want to:

  • Stand up and maintain GPU clusters
  • Stitch together vLLM, queues, and schedulers
  • Manually juggle LLM provider rate limits
  • Handcraft ingestion glue for every new source and sink

Daft Cloud is that missing layer:

  • Data in: from S3/GCS/SQL/HTTP
  • Inference + transform: managed, batch-optimized
  • Data out: into TurboPuffer, pgvector, S3, BigTable, and whatever else you already use
Pricing is based on model usage and data processing volume. Contact us for early access pricing details.

Sources and sinks

Sources (read from):
  • Object storage: S3, GCS, Supabase Storage, CloudFlare R2
  • Databases: Postgres, MySQL, other SQL via SELECT-based readers
  • HTTP / feeds: HTTP endpoints, JSON APIs, simple event feeds
Sinks (write to):
  • Vector stores: TurboPuffer, pgvector (Postgres), other VectorDBs
  • Object storage: S3 / GCS / Supabase Storage / generic S3-compatible
  • Databases: Postgres, MySQL, BigTable, other analytic/feature stores
  • Webhooks / HTTP: any JSON-shaped endpoint

Daft Cloud sits in the middle and moves AI-derived vectors and structured outputs between these systems.

Providing fine-tuned model weights for open-sourced models is an upcoming feature in our roadmap. Running completely custom model architectures is also supported, but the Daft Cloud team cannot optimize the performance of these models for you as effectively.