Daft v0.7.6: Every Major Lake Format, O(1) Scalars, and Swordfish Plan Caching

Daft v0.7.6 natively reads and writes every major open lake format — Iceberg, Delta Lake, Hudi, and now Apache Paimon. One DataFrame API, four formats, zero connector management.

That's the headline. But this release also makes scalar operations O(1), introduces fingerprint-based plan caching in Swordfish, ships a Gravitino catalog for multi-format table access, and adds production observability with jemalloc monitoring and JSONL event logs.

O(1) Scalar Columns

When Daft evaluates an expression like col("x") + lit(42), the literal 42 used to get broadcast into a full-length array — a million rows meant a million-element Int64Array for a single constant value. Every literal, every default, every fill value: same story.

@universalmind303 introduced a Column enum that wraps either a materialized Series or a ScalarColumn storing a single Literal value plus a logical length (#6444):

pub enum Column {
    Series(Series),
    Scalar(ScalarColumn),
}

ScalarColumn uses OnceLock<Series> for lazy materialization — the full array is only constructed when an operation actually requires element-wise access, and the result is cached.

Row-preserving operations stay compact in O(1):

slice, head, filter, take, broadcast
cast, as_physical (operate on the scalar value directly)
concat (stays scalar if all values match)
size_bytes, Hash, PartialEq (scalar-aware)

The public API is unchanged — get_column(idx) still returns &Series via lazy materialization, so existing code works without modification. This is the foundation for further optimizations: expression evaluation can now produce Column::Scalar instead of broadcasting, and operations that don't need element-wise access can skip materialization entirely.

Every Major Lake Format

With v0.7.6, Daft natively reads and writes all four major open lake formats: Iceberg, Delta Lake, Hudi, and now Apache Paimon.

Apache Paimon is a lake format designed for streaming and batch unified analytics. If you're running streaming pipelines and want to use Daft for batch or ML workloads on top of existing Paimon tables, you previously had to copy data out of the lake.

@chenghuichen added native read and write support (#6450):

import daft
 
# Read from a Paimon table
df = daft.read_paimon(table)
 
# Write back
df.write_paimon(table, mode="append")

This uses pypaimon as the catalog and metadata layer, closing #4976.

One DataFrame API. Four lake formats. No format-specific code paths.

Gravitino Catalog and Iceberg Tables

@rchowell refactored the Apache Gravitino catalog to support Iceberg tables (#6509). The implementation now inspects table metadata to determine the concrete table format — so a single Gravitino catalog can serve Iceberg, Lance, and other table types through one interface.

Breaking change: The from_gravitino method no longer accepts a raw Gravitino client. You now pass client options only; the vended client is kept internal.

This builds on the gvfs:// virtual filesystem support contributed by @qingfeng-occ in this release (#6430), which connects Daft to Gravitino-managed data catalogs.

Swordfish Plan Caching

@colin-ho shipped fingerprint-based plan caching for Swordfish, Daft's distributed runtime (#6278). Multiple executions of the same logical plan now share a single pipeline instead of building a new one each time.

The implementation touches ~49 files across the execution engine:

PipelineMessage enum replaces raw MicroPartition with Morsel { input_id, partition } and Flush(input_id), letting a single pipeline multiplex data from multiple logical inputs
NativeExecutor becomes stateful — maintains a HashMap<u64, PlanState> keyed by plan fingerprint; repeated calls with the same fingerprint reuse the existing pipeline
MessageRouter routes pipeline output to per-input channels so each caller gets only its own results
Per-input runtime stats track (NodeId, InputId) pairs with merge support for aggregated views

This builds on the plan fingerprinting infrastructure also added in this release (#6276), where each physical plan gets a deterministic fingerprint that identifies functionally identical plans.

Production Observability

Two new subsystems give you full visibility into query execution without external tooling.

Process-level memory and CPU monitoring. @desmondcheongzx added a ProcessStatsCollector that samples jemalloc allocator stats and OS-level process metrics every 200ms (#6428):

daft.process.memory.rss — OS-level resident set size
daft.process.memory.jemalloc.allocated — bytes allocated by Rust code
daft.process.memory.jemalloc.resident — jemalloc's RSS (includes fragmentation)
daft.process.cpu.percent — process CPU utilization across all threads

Stats are emitted as OTEL gauges and flow through the subscriber pipeline. Enable with DAFT_PROCESS_MONITOR_ENABLED=true.

JSONL event log. @cckellogg shipped an EventLogSubscriber that writes structured query lifecycle events to JSONL files under ~/.daft/events/ (#6420):

{"event": "query_started", "ts": "2026-03-17T23:33:53.867Z", "query_id": "nimble-falcon-57ee1d"}
{"event": "optimization_ended", "ts": "2026-03-17T23:33:54.164Z", "query_id": "nimble-falcon-57ee1d", "duration_ms": 298}
{"event": "operator_finished", "ts": "2026-03-17T23:33:55.012Z", "query_id": "nimble-falcon-57ee1d", "node_id": 3, "rows_out": 100000}
{"event": "process_stats", "ts": "2026-03-17T23:33:55.200Z", "query_id": "nimble-falcon-57ee1d", "metrics": {"process.memory.rss": 123437056}}

Each query gets its own directory (<log_dir>/<query_id>/events.jsonl), capturing the full lifecycle: session start, query start, unoptimized plan, optimization timing, operator-level stats, process metrics, and query completion. Enable with enable_event_log() / disable_event_log(). Feed it into your monitoring stack, grep for slow queries, or build custom alerts.

Bounded Kafka Reads

Daft can now read from Kafka as a bounded batch source via daft.read_kafka (#5970), contributed by @everySympathy.

import daft
 
df = daft.read_kafka(
    bootstrap_servers="localhost:9092",
    topics=["events"],
    start="earliest",
    end="2026-03-25T00:00:00Z",
)

Start and end bounds accept "earliest" / "latest", timestamp milliseconds, datetime objects, ISO-8601 strings, or per-partition offset maps for fine-grained control:

# Multi-topic with per-partition offsets
df = daft.read_kafka(
    bootstrap_servers="localhost:9092",
    topics=["events", "metrics"],
    start={"events": {0: 1000, 1: 2000}},
    end="latest",
)

This closes a long-standing request (#4603) — Kafka data is now a first-class Daft data source, no external batch job required.

Native Ray Dataset Interop

DataFrame.to_ray_dataset() previously required the Ray runner — calling it from the native runner raised a ValueError, forcing users who run Daft compute locally but need Ray Datasets for downstream ML pipelines to do manual Arrow round-trips.

@desmondcheongzx removed this artificial restriction (#6486). When using the native runner, partitions are converted to Arrow tables and passed directly to Ray via ray.data.from_arrow():

import daft
 
df = daft.read_parquet("s3://my-bucket/data/")
df = df.where(df["label"] == "positive")
 
# Works from the native runner now — no Ray runner required
ray_ds = df.to_ray_dataset()

Run your heavy data processing on Daft's native runner for performance, then hand off to Ray for model training or serving — without changing runners or serializing through intermediate formats.

Everything Else

Join operator metrics — @srilman added rows in/out and duration for every join in the dashboard (#6391).
Bitmask join reordering — @desmondcheongzx replaced vector-based RelationSet with RelationSet(u32) for O(1) subset operations, groundwork for DP-ccp join ordering (#6421).
DataSource trait — @rchowell defined a single trait matching the Python interface, the Daft equivalent of DataFusion's TableProvider (#6427).
Ray options for UDF v2 — @Jay-ju added ray_options and resource override support for fine-grained GPU/CPU allocation per UDF (#5982).
read_text whole files — @plotor extended daft.read_text() to read entire files into single rows (#6354).
count(mode='all') — @Abyss-lord added SQL COUNT(*) semantics without requiring an expression argument (#6358).
Global aggregations in select — @caican00 enabled mixing column references and aggregations in a single select() call (#6067).
SQL date functions — @BABTUNA added current_date, current_timestamp, current_timezone and aliases (#6495).
hex / unhex — @Lucas61000 added hex encoding/decoding (#6373).
Streaming flight shuffle — @colin-ho refactored the Swordfish shuffle into a streaming read path, reducing memory pressure during distributed exchange (#6269).

Community Contributions

@everySympathy — Bounded Kafka reads, JSONL byte-range reading fix
@chenghuichen — Apache Paimon lake format support
@qingfeng-occ — Apache Gravitino gvfs:// support, OpenDAL refactoring, dashboard docs fix, SQL partition bounds fix
@Abyss-lord — count(mode='all'), field order preservation in from_pylist, bytes.read reporting for CSV, string-to-fixed_size_binary cast
@Jay-ju — Ray options for UDF v2
@caican00 — Global aggregation in select, JSONL schema inference fix
@plotor — read_text whole files, json_target_filesize config, materialized stats preservation
@Lucas61000 — hex/unhex, strip, TPC-DS q23 SQL fix
@BABTUNA — SQL date/time functions, progress bar improvements
@ykdojo — Daft Skills docs and Claude Code plugin config
@gavin9402 — Iceberg identity transform fix

Upgrade

uv add "daft>=0.7.6"

Or try the latest nightly:

uv pip install daft --pre --extra-index-url https://nightly.daft.ai

Check the full changelog for the complete list of 82 PRs.