First-class observability in Daft

TLDR

New Daft dashboard makes it easy to introspect operator memory, row throughput, and morsel-driven tasking locally/distributed
OTel Endpoints for production telemetry. Just set the OTEL_EXPORTER_OTLP_ENDPOINT environment variable for your existing collector to point Daft OTel metrics at your existing collector.
Stuck detection and a Unix style "TOP" view of tasks. Help isolate the root cause of slow, stuck, or failed queries.
DAFT_TRACE for console debugging. Set DAFT_TRACE=pretty for a structured trace of execution events on stdout — no dashboard required.

What does it mean for Daft to be observable?

Understand what Daft is doing with your query, enough to track down failures or tune for performance.

Daft powers exabyte-scale pipelines at the world's top technology companies. Until now, teams running Daft in production had to craft observability dashboards by hand. These teams shouldn't have to guess about their data infrastructure, especially given the cost and complexity of operating distributed systems at scale.

Today we're adding new observability tooling that makes Operators, Tasks, Rows, and Memory first class citizens. The Daft Dashboard ships with Daft itself powered by OTel compatible instrumentation endpoints. Both Python and Rust are now wired with better logging, metrics, and traces. The result is a frontend experience that is not only a better tool for debugging, but also a more accurate reflection of Daft's morsel driven execution model.

For users, this means that when a query is slow, fails, or gets stuck, they can now see what Daft was doing and which operators are consuming the most time or memory. For distributed processing, the Tasks view makes it easy to see which tasks are taking the longest for each operator, providing valuable insights into the actual atomic unit of work in Flotilla. Overall, these improvements help users introspect Daft's runtime behavior making it much easier optimize queries and pipelines.

Start with the dashboard

The Daft Dashboard is a web UI for inspecting query execution. It ships with the standard daft package and is the central observability interface for Daft. Just start the dashboard server, point your script at it, and Daft will push query events into the UI while the job runs.

This matters most when the query is slow, stuck, or failing. The dashboard now shows the physical plan, live operator stats, distributed task progress, failure context, and memory-related metrics in one place. You can use it locally while developing a query, or point a Ray cluster at a dashboard URL that every actor can reach.

uv add daft
uv run daft dashboard start
export DAFT_DASHBOARD_URL=http://localhost:3238
uv run my_script.py
# then open http://localhost:3238

python -m venv .venv
source .venv/bin/activate
pip install daft
daft dashboard start
export DAFT_DASHBOARD_URL=http://localhost:3238
python my_script.py
# then open http://localhost:3238

Daft Dashboard — query execution view with the live plan tree

Find the slow operator

Open the dashboard and the first useful view is the plan tree. It shows the physical operators Daft is running, how data flows between them, and which operators are taking the most time.

Operator cards now include wall-clock time and CPU duration. Slow operators are colored differently from fast ones. Pipeline arrows make the direction of execution explicit. This gives you a fast answer to the first debugging question: where is the query spending time?

Interactive tree visualizer for query plans (#6295)
Physical plan tree with live operator stats (#6299)
Wall-clock and CPU duration on operator cards (#6300)
Pipeline direction arrows in the plan tree (#6625)
Heatmap coloring for operator nodes (#6628)

Inspect distributed tasks

Distributed queries fail and slow down at the task level. A plan node can look suspicious, but you still need to know which tasks are running, which are pending, and whether one task is dragging the operator behind the rest.

The new tasks view makes that visible. It shows task progress from Flotilla workers, separates pending work from running work, and ties task rows and bytes back to the plan topology. If one scan task is reading much more data than the others, or one task group is taking longer because it has a different local-plan shape, you can see that directly instead of guessing from aggregate operator time.

Running this on a cluster takes one piece of setup. Every Ray actor needs DAFT_DASHBOARD_URL set so it can push events to the dashboard process. Set the env var on the driver before importing Daft — the value propagates to actors automatically — and point it at a hostname routable from every node, not localhost.

Draft tasks view for Flotilla (#6783)
Tasks tab as a collapsible sidebar in the Execution tab (#6752)
Per-task progress updates from Flotilla workers (#6838)
TaskScheduled events distinguishing pending vs running (#6866)
Task sources surfaced in the tasks sidebar (#6879)
Per-task rows and bytes stats with topology markers (#6861)
Task groups split by local-plan shape for accurate durations (#6899)
Smart per-node stats aggregation for distributed execution (#6574)
Partition sets passed through repr_json so the plan matches execution topology (#6576)
num_tasks metric tracking across runtime stats (#6716)

Locate failures and stalls

When a query fails, the useful question is not only "what exception was raised?" It is "where in the plan did that exception come from?"

The dashboard now surfaces query failure details in the query view and highlights failed operators on the plan tree. It also uses subscriber heartbeats to detect queries that stop reporting progress. That gives operators a clear place to start: the failed operator, the stalled subscriber, or the task group that stopped moving.

Surface query failure details in the query view (#6897)
Highlight failed operators directly on the plan tree (#6930)
Subscriber heartbeat and dead-query detection (#6676)

Track bytes through the plan

Out-of-memory failures are hard to debug when all you know is that a worker died. You need to know what the query was doing when memory climbed, and which operators were expanding or retaining data.

Daft now reports per-operator bytes in and bytes out. The dashboard shows inflation and deflation metrics so you can see where data expands as it moves through the plan. The tasks view also carries rows and bytes at task granularity, which helps separate a genuinely expensive operator from a skewed input partition.

For production telemetry, Daft exports OTel metrics that can be sent to your existing collector. The goal is straightforward: when a job runs out of memory, Daft should give you enough evidence to decide whether to repartition, exclude bad input, scale out, or change the query.

Per-operator bytes_in / bytes_out (#6612)
Bytes inflation / deflation metrics in the dashboard (#6640)
Per-task rows and bytes stats (#6861)
Fix for Flotilla over-reporting of bytes.read (#6774)

Run it with production guardrails

Two production details are worth calling out.

First, dashboard query state is now bounded. A long-running dashboard process should not grow forever just because many queries reported into it.

Second, the old result preview path is gone. It relied on deserializing pickle bytes from query execution, which was the wrong security tradeoff for a dashboard that might receive events from distributed workers. Removing it makes the dashboard safer to run outside a throwaway local session.

The dashboard still needs normal operational care. The current docs call out that it has no built-in authentication, stores state in memory, and should not be exposed directly to the public internet. Put it behind the same network controls you would use for any internal debugging surface.

Bounded query state retention (#6896)
Removed the pickle-based result preview tab (#6878)

Try the dashboard

uv add daft
uv run daft dashboard start
export DAFT_DASHBOARD_URL=http://localhost:3238
uv run my_script.py
# then open http://localhost:3238

For cluster setups, point DAFT_DASHBOARD_URL at a hostname routable from every Ray actor. Full docs at docs.daft.ai/observability/dashboard.

Run the query. Open the dashboard. Start with the slowest operator, then drill into tasks, bytes, and failure details from there.