
How We're Making Observability Better in Daft
Daft Observability Roadmap: metrics, OTEL integration, real-time dashboards, and DataFrame APIs for debugging and monitoring distributed pipelines.
by Srinivas LadeAs part of our goal to make Daft the best engine for your data processing and task scheduling needs, we’ve been looking into ways to improve the experience of using Daft, particularly when:
- 1.
You’re developing a job locally and need help debugging or tuning
- 2.
You’re ready to deploy & scale your job and need monitoring on your cluster
- 3.
A deployed job failed or was slow, and you need to figure out what happened
In this post, I’ll talk about how we are approaching observability in Daft for situations like these three. We’ll go through what is currently possible, what we plan to add in the upcoming releases, and how y’all can help shape our roadmap.
The Four Horsemen of Observability
Fun fact, I originally called them “Pillars”, only to learn that there’s a very common phrase “3 pillars of observability” that stands for traces, metrics, and logs. Oops …
Development of observability integrations in Daft have generally fell under the following 4 categories:
- •
Internal: Tooling to efficiently collect query metadata and attributes and share with the other 3 endpoints
- •
DataFrame APIs: A Python-based interface for high-level introspection of queries and runs
- •
Dashboard and Progress Bars: Built-in tooling to visualize realtime progress at both local and distributed scales
- •
OpenTelemetry Integrations: Hooks to connect Daft-specific telemetry with the rest of your observability stack
1) Internal Monitoring
While often not apparent from a users perspective, a lot of time has been put into building out a framework to efficiently collect telemetry data in a distributed fashion, while reducing overhead, handling node or task level failures, and ensuring accuracy. Some topics in particular include:
- •
Tracking task-level operator metrics on workers and aggregating on the scheduler
- •
Capturing UDF stdout/stderr and piping as logs with tags to easily identify
This is an area of active development as we expand upon the user-facing interfaces, so we may go into more detail in a future blog post.
2) DataFrame APIs
Dataframe APIs are the fundamental starting point for observability, offering programmer-centric introspection. While highly expressive for defining computations, they present a challenge when debugging large-scale, distributed computations where the failure is not immediately apparent in the code.
Strength:
Immediate, granular access to execution details (e.g., plan optimization, operation timing) directly within the user's familiar Python environment. Ideal for unit-testing and local debugging of data transformations.
Weakness:
Poor visualization for job-wide performance bottlenecks, cross-process data lineage, or real-time progress monitoring in a cluster environment.
Improvement Focus:
We will continue refining API hooks and adding more explicit methods for surfacing low-level statistics and plan details, making it easier for users to pinpoint exactly where in the plan performance is dropping before deploying to a cluster.
2) OpenTelemetry Integrations
OpenTelemetry (OTel) serves as the standardized bridge connecting the API details to the Dashboard's visualization. OTel's power lies in its vendor-agnostic ability to correlate logs, metrics, and traces across the entire distributed system.
Strength:
Provides the essential technical glue for tracing requests and data flow across complex, multi-service architectures. It allows for the aggregation of granular API-level events (traces/logs) into a structured format that the Dashboard can interpret.
Weakness:
The data can be noisy and complex for end-users to parse directly. Its efficacy is dependent on robust instrumentation within the core execution engine.
Improvement Focus:
Systematically expanding instrumentation coverage to capture key performance metrics for data shuffling, I/O wait times, and memory utilization, ensuring all critical path operations generate meaningful traces that are immediately actionable in the Dashboard.
Overall
These 3 feature sets are not isolated workstreams; each track is catered to each other's strengths and weaknesses to provide a unified observability stack for Daft. Given the following scenarios, here are multiple potential solutions:
Scenario | Tools | Methodology |
|---|---|---|
Local Development & Debugging | DataFrame APIs & Dashboard | Use APIs to inspect high-level plans and statistics. Dashboard and progress bars for quick monitoring and any deeper dives. |
Deployed Job Monitoring | APIs, Dashboard & OTEL | APIs can provide programmatic hooks for downstream applications or orchestration tools. Dashboard can provide basic indications of query progress. OTEL can be used to locate potential hangs or errors. |
Profiling and Performance Tuning | Dashboard & OTEL | Use the Dashboard for determining bottlenecks, and jump into OTEL to search for specifics |
Our long-term goal is not just to enhance each component individually, but to develop a clear integrated method to understand query execution within Daft itself. Some improvements we anticipate to see in the near future include:
- •
A post-execution query analysis API, similar to SQLs EXPLAIN ANALYZE
- •
OpenTelemetry integrations for tracing query progress and UDF logging
- •
Plan DAG and task visualizations within the dashboard
- •
Real progress bars by estimating query progress (a personal favorite of mine 😉)
And many more! Please check our GitHub discussion for an updated list of milestones and progress. And feel free to comment or propose ideas in the discussion or our issues: it always helps to learn of real-world use cases and problems that arise.