Daft v0.7.10: 30 contributors, 41 new features, distributed asof joins

30 contributors shipped Daft v0.7.10 — the most participation in any Daft release to date. The result: 41 new features and functions across distributed joins, duplicate detection, temporal arithmetic, and observability.

The headliners: distributed asof joins that maintain temporal accuracy across partitions, SimHash-powered duplicate detection for document pipelines, and dashboard improvements that surface exactly where your query is spending time.

The largest release in Daft history.

Distributed asof joins — temporal accuracy at any scale

Asof joins let you match each record with the most recent record from another table before a given timestamp. Financial analytics, IoT sensor fusion, and event correlation all depend on them. The problem: most distributed systems naively partition the data by their "by" keys or fall back to a single node Asof join when "by" keys aren't provided, hindering the system from handing skewed data well.

Daft v0.7.10 ships distributed asof joins that allow for true horizontal scaling using range-repartitioning, and a carryover system that ensures correctness across partition boundaries.

import daft
 
# Stock trades and market data — join each trade with the most recent quote
trades = daft.read_parquet("trades.parquet")
quotes = daft.read_parquet("market_quotes.parquet")
 
# Distributed asof join — scales to billions of records
result = trades.join_asof(quotes, on="ts", by="ticker")
 
result.show()

The implementation uses a three-stage approach:

Sampling & Range Partitioning to ensure an even split of data
Dispatch of carryovers to ensure correctness across partition boundaries
Local Asof join within each worker node

Benchmarks show linear scaling to 100+ partitions with no accuracy loss.

@desmondcheongzx designed the distributed algorithm. @euanlimzx implemented the feature.

SimHash and hamming distance — near-duplicate detection at document scale

Document deduplication typically means either exact hash matching (fast but misses near-duplicates) or expensive similarity computations (accurate but doesn't scale). SimHash solves this by reducing documents to fixed-length fingerprints where similar documents have similar hashes.

Daft v0.7.10 adds SimHash generation and hamming distance computation, purpose-built for document pipelines processing millions of records.

import daft
 
# Generate SimHash fingerprints for documents
docs = daft.read_parquet("documents.parquet")
 
fingerprinted = docs.with_columns([
    daft.col("content").simhash().alias("fingerprint")
])
 
# Find documents within hamming distance 3 (near-duplicates)
near_dupes = fingerprinted.join(
    fingerprinted.alias("other"),
    daft.col("fingerprint").hamming_distance(daft.col("other.fingerprint")) <= 3
).where(daft.col("doc_id") < daft.col("other.doc_id"))
 
near_dupes.select("doc_id", "other.doc_id", "title", "other.title").show()

SimHash reduces 64-bit document fingerprints to enable sub-second similarity searches across billions of documents. Combined with hamming distance filtering, it's 1000x faster than embedding-based approaches for duplicate detection use cases.

@chenghuichen implemented the SimHash algorithm. @gweaverbiodev added the hamming distance function.

UUID type — eliminate the string conversion tax

UUIDs stored as strings waste 2x the memory and break efficient filtering, sorting, and joins. Most data engines treat them as opaque strings, forcing expensive string operations for what should be fast binary comparisons.

Daft's native UUID type stores values as 128-bit binaries, not 36-character strings. v0.7.10 closes a quiet edge case by adding the missing serde feature on the workspace dependency, so UUID columns round-trip cleanly through Arrow IPC and downstream consumers without falling back to string serialization.

import daft
 
# Read UUIDs directly as UUID type
events = daft.read_parquet("events.parquet", schema={
    "event_id": daft.DataType.uuid(),
    "user_id": daft.DataType.uuid(),
    "timestamp": daft.DataType.timestamp("ns")
})
 
# Fast UUID operations — binary comparisons, not string parsing
filtered = events.where(
    daft.col("user_id").uuid.to_string().str.startswith("550e8400")
)
 
# UUID generation for new records
with_ids = events.with_columns([
    daft.col("event_id").if_null(daft.uuid()).alias("event_id")
])

The type system prevents accidental string operations while maintaining compatibility with UUID libraries.

Temporal arithmetic — date math that works correctly

Date and time arithmetic breaks in subtle ways across different systems. Adding months to February 29th, handling timezone transitions, computing business day differences — every edge case has a different answer depending on your SQL dialect.

Daft v0.7.10 ships 8 new temporal functions with Spark-compatible semantics: make_date, make_timestamp, make_timestamp_ltz, last_day, next_day, plus math functions factorial, hypot, e, and pi.

import daft
 
# Generate date ranges with proper month arithmetic
dates = daft.from_pydict({
    "year": [2024, 2024, 2024],
    "month": [1, 2, 12],
    "day": [31, 29, 15]
})
 
with_dates = dates.with_columns([
    daft.functions.make_date(
        daft.col("year"),
        daft.col("month"),
        daft.col("day")
    ).alias("date"),
    daft.functions.last_day(
        daft.functions.make_date(
            daft.col("year"),
            daft.col("month"),
            daft.col("day")
        )
    ).alias("month_end")
])
 
with_dates.show()

The functions handle edge cases consistently: make_date(2024, 2, 29) correctly handles leap years, last_day() computes month boundaries properly across calendar variations, and timezone conversions in make_timestamp_ltz match Spark exactly.

Temporal functions were implemented by @BABTUNA via #6672.

Everything Else

New aggregation functions: product() and count_distinct() methods for DataFrames and GroupedDataFrames — @kerwin-zk via #6655 and #6658.

Enhanced Paimon integration: Improved table metadata handling and read performance optimizations — @YannByron via #6635.

Union type support: Native handling of variant/union types for mixed-schema data — @PhysicsACE via #6497.

List filtering: list_filter() expression for filtering array elements in-place — @aaron-ang via #6769.

Performance optimization: DP-ccp join ordering algorithm for complex multi-table joins — @desmondcheongzx via #6460.

Dashboard improvements: Human-readable query plans, task metadata tracking, and dead query detection — @samstokes, @BABTUNA, @cckellogg.

C++ extensions: Hello world example and UDAF support for custom aggregation functions — @universalmind303, @chenghuichen.

Community Contributions

@chenghuichen — SimHash + hamming distance functions, Flight shuffle backend, S3-backed checkpoint store, UDAF support in daft-ext

@BABTUNA — Temporal functions (make_date, make_timestamp, last_day, next_day), math functions (e, pi, factorial, hypot), human-readable repartition specs in dashboard, per-method retry overrides

@aaron-ang — list_filter expression, percentile and median operations, cross-TimeUnit Duration casts

@Lucas61000 — SQL error reporting with source context and caret, catalog-qualified identifiers in create_table/drop_table

@kerwin-zk — DataFrame.product() and DataFrame.count_distinct() aggregation methods (plus their GroupedDataFrame counterparts)

@YuangGao — pmod function for PySpark parity, bin function implementation

@gweaverbiodev — Hamming distance function, great_circle_distance computation

@YannByron — Enhanced Paimon integration with table metadata handling

@PhysicsACE — Union type implementation for mixed-schema data

@XuQianJin-Stars — Ray 2.55.0 support

@yuchen-ecnu — seq expression for per-row integer sequences

@helmanofer — Skip schema pruning on Source node

@gavin9402 — Preserve identity partition predicates

@caican00 — Fix count_rows() for sparse-data column-not-found case

@qingfeng-occ — Local filesystem writes via GravitinoGvfs

@kvthr — Write metrics in close() for the last batch

@veinkr-bot — Normalize FixedSizeListArray inner field name

Upgrade

uv add "daft>=0.7.10"

Or try the latest nightly:

uv pip install daft --pre --extra-index-url https://nightly.daft.ai

Check the full changelog for the complete list of merged PRs.