Daft v0.7.7: Parquet Cache Regression Fixed, df.shuffle(), and Coalesce Short-Circuit

A performance regression introduced in v0.7.5 made parquet aggregations 2-4x slower. Daft v0.7.7 traces the root cause to L2 cache thrashing from batch concatenation and fixes it — 90 SUMs over 100M rows: 1.226s back down to 0.652s.

This release also adds df.shuffle() for ML data prep, makes coalesce short-circuit per the SQL spec, and ships concat_ws, timezone operations, and perceptual image hashing internals.

Parquet Streaming: Cache Regression Fixed

A performance regression introduced in v0.7.5 made aggregation workloads 2-4x slower on the streaming parquet path. @desmondcheongzx traced it to batch concatenation defeating CPU cache locality (#6558).

The streaming reader was concatenating all ~128K-row batches from each row group into a single ~500K-row batch before sending it downstream. On AMD EPYC (512KB L2 per core), a 128K-row source buffer fits comfortably in L2. A 500K-row concatenated batch blows past it — the downstream cast alone produces a 4MB output buffer that thrashes L2.

The fix extracts build_rg_reader and iterates over individual batches, sending each ~128K batch through the channel without concatenation.

Benchmarks on c6a.4xlarge (16 vCPUs, AMD EPYC), 100 ClickBench parquet files, ~100M rows:

Workload	v0.7.4	v0.7.5 (regressed)	v0.7.7 (fixed)
90 SUMs over Int16	0.576s	1.226s	0.652s
90 SUM(col + i)	1.084s	4.643s	1.714s

The simple SUM workload is back to within 13% of the v0.7.4 baseline. The expression-heavy workload recovered from a 4.3x regression to a 1.6x gap — the remaining difference comes from the streaming path itself, not the batch sizing.

df.shuffle()

@srilman added a first-class shuffle operation for randomly rearranging rows (#6481):

df = daft.read_parquet("s3://my-bucket/training-data/")
df = df.shuffle(seed=42)

Under the hood, shuffle generates a random integer per row and sorts by it — the same approach HuggingFace Datasets uses. It's implemented as an explicit logical plan node (not sugar over sort) so the optimizer can push limits through it without hitting sort's ordering constraints.

The PR also adds a standalone random_int() expression:

from daft.functions import random_int
 
df = df.with_column("sample_group", random_int(low=0, high=10, seed=7))

This is the first half of #2612. A batch_size variant for chunked shuffling is planned.

Coalesce Short-Circuit

coalesce(a, b, c) should stop evaluating the moment it finds a non-null value. Daft's implementation didn't — it evaluated every argument unconditionally, which meant expensive expressions in later positions ran even when earlier values were non-null.

@Lucas61000 fixed this by promoting coalesce from a ScalarUDF to a first-class Expr::Coalesce variant with early-exit semantics in the record batch evaluator (#6525). The optimizer, partitioning logic, and all expression visitors were updated to handle the new variant.

This closes #4069.

concat_ws

@euanlimzx added concat_ws — concatenate with separator, skipping nulls (#6543):

from daft import col
from daft.functions import concat_ws
 
# Join columns with a separator
df.select(concat_ws("/", col("bucket"), col("prefix"), col("filename")))
# "my-bucket/data/file.parquet"
 
# Nulls are skipped, not propagated (unlike concat)
df.select(concat_ws(" ", col("first"), col("middle"), col("last")))
# "Alice Smith" when middle is null — not null

This is the key difference from concat: concat("a", null, "b") returns null, but concat_ws("-", "a", null, "b") returns "a-b". Common use cases: building file paths, composite keys, and display strings. Tracks #3792.

Timezone Convert and Replace

@aaron-ang shipped two timezone operations with correct edge-case handling (#6106):

# convert_time_zone: preserves the instant, changes the display
col("ts").convert_time_zone("UTC", from_timezone="+02:00")
# 2024-01-01 00:00:00 +02:00 → 2023-12-31 22:00:00 UTC
 
# replace_time_zone: preserves local time, swaps the timezone label
col("ts").replace_time_zone("+02:00")
# 2024-01-01 00:00:00 UTC → 2024-01-01 00:00:00 +02:00
 
# Remove timezone entirely
col("ts").replace_time_zone()
# 2024-01-01 00:00:00 UTC → 2024-01-01 00:00:00 (naive)

The implementation now errors when from_timezone is missing on naive timestamps (instead of silently assuming UTC) and ignores from_timezone when the timestamp already carries timezone info. A new ParsedTimezone enum unifies fixed-offset and named-timezone handling internally. Closes #4096.

Everything Else

Perceptual image hashing internals — @linguoxuan implemented five image hashing algorithms (aHash, dHash, pHash, wHash, and crop-resistant hash) in Rust for near-duplicate image detection (#6338). A unified user-facing API is coming in the next release.
_SUCCESS file for parquet writes — @Lucas61000 added write_success_file=True on df.write_parquet(), writing an empty _SUCCESS marker on completion for Spark-compatible workflows (#6090). Closes #4085.
Subscriber on_event dispatch — @cckellogg unified subscriber callbacks into a single on_event method with an Event enum (OperatorStarted, OperatorFinished, Stats), replacing scattered per-method calls (#6508).
DAFT_TRACE console tracing — @cckellogg added DAFT_TRACE and DAFT_TRACE_FORMAT env vars for console trace output in compact, pretty, or json formats (#6458).
decode("utf-8") rewritten as cast — @srilman rewrites decode(col, "utf-8") as a cast at the Python layer, unblocking cast-elimination and filter pushdown optimizations (#6537).
Binary size reduction — @NikkeTryHard de-monomorphized sort and dispatch paths in daft-core, shrinking the shared library by 840KB (#6541).
Python source shim → Rust — @rchowell replaced the Python _DataSourceShim with a Rust ScanOperator backed by either Python or Rust data sources, an incremental step toward full streaming sources (#6556).
Supply chain hardening — @everettVT added exclude-newer to uv and pinned CI actions to SHA tags (#6565).
Dual telemetry — @ykdojo added side-by-side telemetry to osstelemetry.io alongside the existing Scarf endpoint (#6540).
explain() after writes — @Abyss-lord fixed explain() returning nonsensical plans after write_parquet/write_csv/write_json (#6564).
df.metrics preserved after shutdown — @cckellogg fixed df.metrics returning empty by preserving finalized stats before RuntimeStatsManager exits (#6555).
Dashboard Flotilla fix — @samstokes hid the Results tab for Flotilla queries where it would never populate (#6557).

Community Contributions

@linguoxuan — Perceptual image hashing internals (5 algorithms)
@Lucas61000 — _SUCCESS file support, coalesce short-circuit fix
@euanlimzx — concat_ws string function
@srilman — df.shuffle() and random_int(), decode-as-cast optimization
@aaron-ang — Timezone convert and replace
@NikkeTryHard — Binary size reduction via de-monomorphization
@Abyss-lord — explain() fix after write operations
@ykdojo — Dual telemetry endpoint

Upgrade

uv add "daft>=0.7.7"

Or try the latest nightly:

uv pip install daft --pre --extra-index-url https://nightly.daft.ai

Check the full changelog for the complete list of 17 merged PRs.

Daft v0.7.7: Parquet Cache Regression Fixed, df.shuffle(), and Coalesce Short-Circuit

Daft v0.7.7: Parquet Cache Regression Fixed, df.shuffle(), and Coalesce Short-Circuit

Parquet Streaming: Cache Regression Fixed

df.shuffle()

Coalesce Short-Circuit

concat_ws

Timezone Convert and Replace

Everything Else

Community Contributions

Upgrade

Suggested Posts

Daft v0.7.6: Every Major Lake Format, O(1) Scalars, and Swordfish Plan Caching

Daft OSS: New Governance Model