Welcome to the Daft blog

Join us as we explore innovative ways to handle multimodal datasets, optimize performance, and simplify your data workflows.

Multimodal Embeddings: Tutorial & Examples
Engineering
April 15, 2026

Multimodal Embeddings: Tutorial & Examples

Learn multimodal embedding techniques for cross-modal search, recommendation systems, and content moderation applications.

Multimodal Embeddings: Tutorial & Examples
Engineering
April 15, 2026

Multimodal Embeddings: Tutorial & Examples

Learn multimodal embedding techniques for cross-modal search, recommendation systems, and content moderation applications.

Daft v0.7.9: Temporal Arithmetic, Video Frame Decoding, and Native UUID
Engineering
April 13, 2026

Daft v0.7.9: Temporal Arithmetic, Video Frame Decoding, and Native UUID

Migrating ETL workloads from Spark means hitting gaps in date arithmetic — functions like `date_add`, `date_diff`, and epoch conversions that Spark users take for granted. Daft v0.7.9 closes that gap

Daft v0.7.8: Dashboard Heatmaps, Vectorized GroupBy, and Native Image Hashing
Product
April 10, 2026

Daft v0.7.8: Dashboard Heatmaps, Vectorized GroupBy, and Native Image Hashing

Daft's query dashboard now shows you exactly where time is going. Slow operators light up red, completed nodes turn green, and arrows trace the data flow through your pipeline. No more guessing which

Open-sourcing 43 Billion Tokens of SEC EDGAR
Case Studies
April 9, 2026

Open-sourcing 43 Billion Tokens of SEC EDGAR

Datamule, Teraflop AI, and Eventual collaborated to release the SEC-EDGAR dataset containing 590 GB of data, spanning 8 million samples and 43 billion tokens from all major filings in the SEC EDGAR database.

Daft v0.7.7: Parquet Cache Regression Fixed, df.shuffle(), and Coalesce Short-Circuit
Announcements
April 3, 2026

Daft v0.7.7: Parquet Cache Regression Fixed, df.shuffle(), and Coalesce Short-Circuit

Daft v0.7.7 fixes a parquet streaming regression that made aggregations 2-4x slower, adds df.shuffle() for ML data prep, and makes coalesce short-circuit per the SQL spec.

Daft v0.7.6: Every Major Lake Format, O(1) Scalars, and Swordfish Plan Caching
Announcements
March 31, 2026

Daft v0.7.6: Every Major Lake Format, O(1) Scalars, and Swordfish Plan Caching

Daft natively reads and writes every major open lake format — Iceberg, Delta Lake, Hudi, and now Apache Paimon. Plus O(1) scalar columns, fingerprint-based plan caching in Swordfish, and production observability.

Daft UDF Patterns: Four Patterns, One Notebook
Product
March 30, 2026

Daft UDF Patterns: Four Patterns, One Notebook

Row-wise, generator, async, and stateful UDFs — one notebook, one dataset, runnable side by side.

GPU Inference with @daft.cls
Product
March 23, 2026

GPU Inference with @daft.cls

Run GPU models on millions of rows without OOM. Real patterns from ByteDance, Essential AI, and more.

Stateful UDFs with daft.cls: Python Classes that Scale
Product
March 17, 2026

Stateful UDFs with daft.cls: Python Classes that Scale

Turn any Python class into a distributed operator. Hold models, connections, and clients across rows with one decorator.

PreviousPage 1 of 7Next
Get updates, contribute code, or say hi.
Daft Engineering Blog
Join us as we explore innovative ways to handle multimodal datasets, optimize performance, and simplify your data workflows.
Github Discussions Forums
join
GitHub logo
The Distributed Data Community Slack
join
Slack logo