
Daft v0.7.3: OTEL for Flotilla, Nightly Builds, and Lance NN Search
Daft v0.7.3 adds distributed observability with df.metrics via OTEL, nightly builds, and native Lance vector search.
by Daft TeamDaft version 0.7.3 has been released. Here are the highlights, honorable mentions, and contributor spotlights from the latest update.
Observability for distributed runs
This release brings big updates to Daft's observability roadmap. Last year, we shipped the first milestone: a metrics framework for the native runner (Swordfish) and a basic dashboard to track query lifecycles. v0.7.3 lands the second part, extending that same infrastructure to distributed execution.
Concretely, that means three things.
First, with @Jay-ju's help [#6062, #6063], the Daft dashboard can now monitor query state across your Ray cluster.
Second, df.metrics now works for Flotilla runs [#6122]. After your distributed query materializes, you get a RecordBatch of overall execution stats — operator timings, row counts, bytes processed — attached directly to the result DataFrame. Same API you'd use on a local run, same data shape, just from a cluster.
Additionally, Daft now officially supports the OpenTelemetry metrics protocol. Using standard OTEL_EXPORTER_OTLP_* environment variables [#6148], you can route metrics to your existing OTEL backend — Prometheus, OTEL collector, ClickStack — without touching any code.
The design philosophy here is worth noting: observability hooks are available across the stack, depending on how much detail you need. By default, query metrics are immediately available on your resulting DataFrame. Not a separate service, not a sidecar you have to deploy — it's just another property on the object you're already working with. If you're looking to track historical metrics, the OTEL integration uses the standard OTEL tooling that your infrastructure team already knows. And if you're looking for something more, the dashboard provides Daft-specific context to your runs.
Nightly builds, because we want you to break things faster
We now publish nightly builds at nightly.daft.ai [#6175]. pip install and you're running whatever landed on main yesterday.
Why does this matter? Because the feedback loop between "contributor lands a PR" and "real user tries it" was too long. If you found a bug, you had to wait for a release or build from source. If you wanted to validate a fix, same deal. Nightlies compress that loop to about 24 hours. Land a PR Monday afternoon, someone in Singapore is running it Tuesday morning.
It also keeps us honest. When main has to be installable every single day, you can't let broken stuff linger. The nightly pipeline is a forcing function for code quality — and an open invitation for the community to hold us to it.
pip install daft --pre --extra-index-url https://nightly.daft.ai
That's it. Live dangerously.
Lance vector search, natively in your DataFrame
We added Lance namespace read/write [#5980], nearest-neighbor vector search [#6025], and distance/similarity functions [#6098] — cosine similarity, Euclidean distance, the works — as native DataFrame expressions.
Think about what this means for a RAG pipeline: read your Lance dataset into Daft, run vector search, compute similarity, filter, join with metadata, write the results. One engine, distributed, optimized. The AI data stack doesn't need more tools. It needs fewer tools that do more.
Credit here to @shaofengshi and @huleilei who drove the Lance integration as community contributors.
Zero-copy arrays and smarter UDF serialization
from_vec is now zero-copy [#6172]. When Daft constructs arrays from Rust vectors — which happens constantly — it no longer copies the data. Just points to it.
Meanwhile, Actor UDFs now serialize only the columns they actually need [#5884]. If your UDF takes 2 columns from a 50-column DataFrame, we used to ship all 50. Now we ship 2. We also bumped Actor UDF timeouts from 10s to 60s [#6163], because it turns out calling an LLM endpoint takes longer than a hash join. Who knew.
These are the kind of changes that don't make for good demos but show up immediately in your pipeline's memory profile and wall-clock time.
Honorable mentions
v0.7.3 shipped a lot. Here's the rest of what you should know about:
- •
The arrow2 → arrow-rs migration keeps going — 20+ PRs in this release alone migrated core kernels to
arrow-rs: hashing, aggregations, UTF8 operations, temporal methods, null checks, HLL sketches, comparison kernels, and more. We're consolidating onto the Apache-maintained Arrow implementation, and the scope of this effort deserves its own story. Stay tuned. - •
Iceberg snapshot properties — You can now set custom snapshot properties on Iceberg writes [#6139]. If you're tracking lineage, audit trails, or pipeline metadata in production Iceberg tables, this one's for you.
- •
- •
Expression API expansion —
uuid()[#5983],agg_concatwith custom delimiters [#6099],list_contains[#6095], text embedding dimension specification [#6097], variance with degrees of freedom [#6105], and comparison operators for list and struct types [#6104]. Each one is a case where you used to need a UDF and now you don't. - •
Bug fixes worth knowing about — Map columns now render as Python dicts instead of lists [#6198].
is_inaccepts sets, tuples, and iterables, not just lists [#6115]. Filter pushdown works through anti-joins [#6150].into_partitions()handles the case where input already matches the target partition count [#6061].
Contributor Spotlight
Since January 1st, 8 community contributors shipped 34 PRs.
Here's the rundown:
- •
Aaron Ang (@aaron-ang) — 10 PRs. An absolute machine. Distance functions, similarity functions,
.as_Tcast methods,list_contains, string casing,agg_concatdelimiter, embedding dimension spec. He basically built a new feature wing of the API by himself. - •
huleilei (@huleilei, ByteDance) — 9 PRs. Lance vector search, arrow2 migration work on
is_in/get_lit, UDF v2 kwargs fix, image pipeline docs. Shipping across multiple areas of the codebase. - •
jay (@Jay-ju, ByteDance) — 4 PRs. Extended the Daft dashboard for distributed execution. Also added
map_groupsv2 UDF support and mcap reader improvements. - •
Zhenchao Wang (@plotor, ByteDance) — 3 PRs. Perf optimization for actor UDF serialization, build tooling improvements.
- •
everySympathy (@everySympathy, ByteDance) — 3 PRs. Added the
uuid()function, fixedinto_partitions(), docs cleanup. - •
Shaofeng Shi (@shaofengshi, Datastrato) — 2 PRs. Landed the Gravitino connector optional dependency and Lance namespace read/write support.
- •
gpathak128 (@gpathak128) — 2 PRs. First time contributing to Daft! JSON write: ignore null fields + timestamp support.
- •
Killua7163 (@Killua7163) — 1 PR. Also a first-timer! Fixed
mypy-boto3-gluein the AWS optional dependencies.
Daft continues to release powerful features thanks to our amazing contributors. It's this level of collective investment that makes Daft what it is.
Try 0.7.3 today
pip install daft==0.7.3
Full changelog is on GitHub. And if any of the work above made you think "I could contribute to that", you absolutely can. Grab a good first issue and come build with us.