November 7, 2025

Fall 2025 Review: Daft Open Source Updates

Highlights from new AI Functions, updated UDFs, and daft.File upgrades.

by Daft Team

It's been an incredible month of progress for Daft. Let's take a moment to highlight all of the new features, updated capabilities, and documentation improvements that have made their way into main branch.

AI Functions and Providers

First up is the new prompt function. With daft.functions.ai.prompt the foundation has been laid for massively parallel prompt engineering, synthetic data generation, and batch tool calling. As of daft release 0.6.10 the new prompt function supports full multimodal inputs and structured outputs with the OpenAI provider via the Responses API.


1import os
2import daft
3from daft.functions.ai import prompt
4
5# Set Provider to OpenAI and override base_url and api_key
6daft.set_provider(
7    "openai", 
8    base_url="https://openrouter.ai/api/v1", 
9    api_key=os.environ.get("OPENROUTER_API_KEY")
10)
11
12# Create a dataframe with the quotes
13df = daft.from_pydict(
14    {
15        "quote": [
16            "I am going to be the king of the pirates!",
17            "I'm going to be the next Hokage!",
18        ],
19    }
20)
21
22# Use the prompt function to classify the quotes
23df = df.with_column(
24    "response",
25    prompt(
26        messages = daft.col("quote"),
27        system_message="You are an anime expert. Classify the anime based on the text and returns the name, character, and quote.",
28        model="nvidia/nemotron-nano-9b-v2:free",
29    ),
30)
31
32df.show(format="fancy", max_width=120)
33

messages now supports any combination of string, image, or file expression input. This opens up a variety of use-cases from document intelligence and vision with built in integration with daft.File() and daft.DataType.image()

New experimental vLLM Provider with Dynamic Prefix Caching

Initial llm inference functions had limited functionality and performance improvements beyond a naive UDF. The experimental VLLMPrefixCachedProvider adds async batching and prefix routing to dramatically improve large-scale LLM inference throughput. Check out the blog post to get all of the technical details on how it cuts batch inference time in half with

1.
Dynamic Prefix Bucketing - improving LLM cache usage by bucketing and routing by prompt prefix.
2.
Streaming-Based Continuous Batching - Pipeline data processing with LLM inference to fully utilize GPUs.

Combined, these two strategies yield significant performance improvements and cost savings that scale to massive workloads. We observe that on a cluster of 128 GPUs (Nvidia L4), we are able to complete an inference workload of 200k prompts totaling 128 million tokens up to 50.7% faster.

New Stateless and Stateful UDFs have landed 🙌

When Daft's built-in functions aren't sufficient for your needs, the @daft.func and @daft.cls decorators let you run your own Python code over each row of data. Simply decorate a Python function or class, and it becomes usable in Daft DataFrame operations. New UDFs support eager execution with scalars, type‑hint‑driven return type inference, generators, and batch UDFs.

•
@daft.func : stateless row‑wise UDFs (with async + generator variants and a batch variant).
•
@daft.cls (+ optional @daft.method) : stateful UDFs where you initialize once (e.g., load a model) and reuse across rows.

Should I switch now?

For most production workloads, yes. You get cleaner ergonomics, async/generators, and better typing. For more detailed info refer to the migration guide.

Stateless UDFs - `@daft.func`


1import daft
2
3@daft.func
4def add_and_format(a: int, b: int) -> str:
5    return f"Sum: {a + b}"
6
7df = daft.from_pydict({"x": [1, 2, 3], "y": [4, 5, 6]})
8df = df.select(add_and_format(df["x"], df["y"]))
9

Daft supports multiple variants to optimize for different use cases:

•
Row-wise (default): Regular Python functions process one row at a time
•
Async row-wise: Async Python functions process rows concurrently
•
Generator: Generator functions produce multiple output rows per input row
•
Batch (@daft.func.batch): Process entire batches of data with daft.Series for high performance

Daft automatically detects which variant to use for regular functions based on your function signature. For batch functions, you must use the @daft.func.batch decorator. Another highly requested feature was adding retry semantics for transient failures, configurable with max_retries and on_error="raise".

Stateful UDFs - `@daft.cls` + `@daft.method`


1import daft
2
3@daft.cls
4class TextClassifier:
5    def __init__(self, model_path: str):
6        # This expensive initialization happens once per worker
7        self.model = load_model(model_path)
8
9    def __call__(self, text: str) -> str:
10        return self.model.predict(text)
11
12# Create an instance with initialization arguments
13classifier = TextClassifier("path/to/model.pkl")
14
15df = daft.from_pydict({"text": ["hello world", "goodbye world"]})
16
17# Use the instance directly as a Daft function
18df = df.select(classifier(df["text"]))

How Stateful UDFs Work

1.
Lazy Initialization: When you create an instance like classifier = TextClassifier("path/to/model.pkl"), the __init__ method is not called immediately. Instead, Daft saves the initialization arguments.
2.
Worker Initialization: During query execution, Daft calls __init__ on each instance with the saved arguments. Instances are reused for multiple rows.
3.
Method Calls: Methods can be called with either:
> Expressions (like df["text"]) - returns an Expression for DataFrame operations
> Scalars (like "hello") - executes immediately, initializing a local instance if needed

Similarly to daft.func, Daft supports the same variants (row-wise, async, batch, etc) for daft.method to optimize for different use cases.

`daft.File` Enhancements - Better File Abstraction and Media Support

One of the most powerful features for working in UDFs is the daft.File interface. The File datatype is preferable when dealing with large files that don't fit in memory or when you only need to access specific portions of a file. It provides a file-like interface with random access capabilities.

•
New VideoFile Type – Introduced a new subtype for video operations with metadata and keyframe extraction.
•
MIME Type Detection – Added automatic MIME detection based on extension or byte signature.
•
Hugging Face Files – hf:// URI resolution for file reads from Hugging Face datasets.
•
Doc Fixes – Updated examples for the new immutable daft.File API.

Other Enhancements

•
💫 New daft-examples repo - With all of these new features, we've created a new dedicated repository highlighting usage patterns and pipelines from across a number multimodal ai use-cases.
•
🌐 New Common Crawl Dataset loader - Common Crawl ↗ is one of the most important open web datasets, containing more than 250 billion web pages that span 18 years of crawls. It's now easier than ever to access Common Crawl data with the daft.datasets.common_crawl() reader.
•
🚰 Bigtable Sink – New capability to write DataFrames directly to Google Cloud Bigtable, with schema-aware family mappings.
•
✌️ Pydantic Model Support – Automatic conversion between Pydantic models and Daft structs, improving Python-native interoperability.

Thanks for keeping up with us! As always, you can always get in touch on Github or in our Slack Community!