Daft v0.7.9: Temporal Arithmetic, Video Frame Decoding, and Native UUID

Migrating ETL workloads from Spark means hitting gaps in date arithmetic — functions like date_add, date_diff, and epoch conversions that Spark users take for granted. Daft v0.7.9 closes that gap with eight new temporal functions, adds video_frames() for column-level video decoding with full frame metadata, and ships a native UUID type across the entire type system. The query dashboard also gets byte-level observability — per-operator bytes_in/bytes_out with volume-scaled arrows and expansion/reduction tags.

Temporal Functions — Batch 2

@BABTUNA implemented eight new temporal functions as part of the ongoing Spark compatibility effort (#6563, tracking #3798):

Function	Description
`date_add(date, days)`	Add days to a date
`date_sub(date, days)`	Subtract days from a date
`date_diff(date1, date2)`	Difference in days between two dates
`date_from_unix_date(days)`	Integer (days since epoch) → date
`timestamp_seconds(seconds)`	Epoch seconds → timestamp
`timestamp_millis(millis)`	Epoch milliseconds → timestamp
`timestamp_micros(micros)`	Epoch microseconds → timestamp
`from_unixtime(timestamp, format)`	Unix timestamp → formatted string

All eight are implemented as Rust UDFs with SQL wrappers and Python bindings. Function names follow Spark conventions for compatibility. Together with the batch 1 functions from v0.7.7, Daft now covers the core temporal arithmetic needed for most ETL workloads migrating from Spark.

video_frames() — Column-Level Video Decoding

@everettVT added video_frames(), an expression-level function for decoding video frames from File columns with full frame metadata (#6536).

Daft already had read_video_frames() for reading video files from a path. But if you had video files already loaded into a DataFrame — say from daft.from_files() or after filtering — there was no way to decode them without re-reading from storage.

import daft
from daft.functions import video_frames
 
df = daft.from_files("s3://bucket/videos/*.mp4")
df = df.with_column(
    "frames",
    video_frames(df["file"], start_time=0.0, end_time=5.0),
)
df = df.explode("frames")

Each video produces a single row containing all decoded frames as a nested list of structs. Each struct carries: frame_index, frame_time, frame_time_base, frame_pts, frame_dts, frame_duration, is_key_frame, and data (the image bytes). Call .explode("frames") to get one row per frame.

Supports start_time/end_time filtering and optional width/height resize. Decodes all frames, not just keyframes.

UUID Type

@srilman added UUID as a built-in type in Daft's type system (#6611). This closes #3978.

Before this, reading UUID columns from databases like PostgreSQL or Trino threw errors because Daft couldn't infer an Arrow type from Python uuid.UUID objects. The workaround was CAST(id AS VARCHAR) in your SQL queries — functional but lossy.

UUID is now a first-class DataType.Uuid backed by FixedSizeBinary(16) in Arrow. The implementation spans 44 files across the stack: UuidArray in daft-core, cast operations, sort, take, growable, repr, literal support, Series downcast, schema inference, and Python bindings. The WARC reader was updated to use native UUID for WARC-Record-ID instead of strings.

file_path() — Extract Paths from File Columns

@everettVT added file_path(), a scalar UDF that extracts the path from a File column as a string (#6621).

from daft.functions import file_path
from daft import col
 
df = daft.from_files("s3://bucket/data/*")
pdf_only = df.where(file_path(col("file")).endswith(".pdf"))

daft.from_files() returns a file column of type File, but there was no expression-level way to extract the path as a string without writing a UDF. This one-liner unlocks all downstream string operations on file paths — filtering by extension, extracting directory structure, regex matching on filenames.

Dashboard: Byte-Level Observability

Three PRs combine to give the query dashboard full byte-level visibility into pipeline execution:

Per-operator bytes_in / bytes_out — @universalmind303 added cumulative byte counters per operator, mirroring the existing row counters (#6612). Without bytes, you can't tell if a 1,000-row morsel is 1 KB or 1 GB. Byte stats are surfaced to subscribers (dashboard) but kept out of the CLI progress bar to avoid noise.

Bytes in/out visualization — @universalmind303 then wired these counters into the dashboard with a Daft-specific heatmap palette, expansion/reduction factor tags on operator edges, and arrow widths that scale with actual data volume (#6640). A download node that expands compressed data from 10 MB to 200 MB is immediately visible from the arrow width alone.

Version metadata — @samstokes added Python, Daft, and Ray version display to the query detail page (#6671). Essential for debugging distributed queries where version skew across workers is the failure mode.

Phase-aware SortStats — @samstokes fixed distributed Sort row counts that were being inflated 3x because all three internal phases (sample, repartition, final-sort) were summed together (#6632). Now only the final sort phase reports rows.

Everything Else

ASOF joins (point-in-time joins) — @euanlimzx implemented DataFrame.join_asof() with a new AsofJoin logical plan node and sort-then-two-pointer-scan algorithm. Supports on, left_on/right_on, by for entity grouping, and expression-based keys. Initial implementation on the native runner; distributed support is follow-on work (#6615, closes #1400).
Parallelize post-GroupBy operations — @srilman removed a redundant concat call from GroupBy finalization that was forcing all downstream blocking sinks to run single-threaded. Now returns N per-partition results for N-way parallel execution (#6663).
Single-stage ListAgg — @srilman switched distributed ListAgg from two-stage aggregation (pre-agg → shuffle → final-agg) to single-stage (shuffle → agg). ListAgg doesn't reduce data volume, so the pre-aggregation step was pure overhead (#6665).
Re-enable Flotilla subscriber support — @srilman re-enabled subscriber support on the Flotilla distributed runner (#6650).
StructArray broadcast fix — @universalmind303 fixed broadcasting of zero-field StructArrays to the requested length (#6660).
CheckpointId path validation — @rohitkulshreshtha added validation ensuring CheckpointIds are path-safe for object stores (#6641).
Streaming sink BatchManager migration — @universalmind303 migrated the streaming sink to use the consolidated BatchManager (#6568).
ClickBench on CodSpeed — @srilman added ClickBench to CodSpeed CI for continuous performance tracking (#6669).
Crate diff in PRs — @srilman added automatic crate diff annotations to PRs (#6645).
Documentation fixes — @colin-ho fixed syntax errors in connectors docs (#6683), corrected from_glob_path in audio transcription docs (#6675), and fixed mkdocstrings YAML indentation (#6668). @colin-ho also added missing write methods to the I/O API reference (#6647).
Missing connector docs — @everettVT added documentation for previously undocumented connectors (#6664), the DESCRIBE SQL statement (#6649), and fixed incorrect accessor syntax in audio docs (#6648).

Community Contributions

@euanlimzx — Point-in-time / ASOF joins, closing one of the oldest open feature requests (#1400)
@BABTUNA — 8 temporal functions for Spark-compatible date arithmetic

Upgrade

uv add "daft>=0.7.9"

Or try the latest nightly:

uv pip install daft --pre --extra-index-url https://nightly.daft.ai

Check the full changelog for the complete list of merged PRs.