Back to Blog
February 17, 2026

daft.File: Work with Any File, Anywhere

Distributed Random Access for Audio, Video, Documents, and Code

by Everett Kleven

Modern AI pipelines increasingly rely on unstructured files: audio, video, PDFs, source code, and more. These assets are often large, remote, and expensive to load eagerly.

daft.File brings file-native handling into Daft’s lazy, distributed execution model. Instead of pulling full objects into memory, you can pass file references through your DataFrame, open them only when needed, and process them in parallel across your compute environment.

Since its introduction in Daft 0.6.0, daft.File has steadily matured into a full-featured capability. And with the recent additions of daft.VideoFile and daft.AudioFile , we're eager to see the community leverage daft.File for their own use-cases. If you run into any troubles please file an issue on github or reach out directly in our slack community.

Why daft.File

Daft already provides strong I/O performance for structured formats like Parquet, JSONL, and CSV. daft.File extends that same philosophy to unstructured data by giving you:

  • Lazy file references in DataFrames

  • Random-access reads through open()

  • Disk materialization with to_tempfile() for libraries that need a local path

  • Full reads with read() when appropriate

  • Local and remote storage support with one interface

Basic Usage

Here's what that looks like in practice:

1import daft
2
3@daft.func
4def read_header(f: daft.File) -> bytes:
5 with f.open() as fh:
6 return fh.read(16)
7
8df = (
9 daft.from_glob_path("hf://datasets/Eventual-Inc/sample-files/")
10 .with_column("file", daft.functions.file(daft.col("path")))
11 .with_column("header", read_header(daft.col("file")))
12 .select("path", "size", "file", "header")
13)
14
15df.show(5)

This pattern scales well because the file handle is only opened during execution, where it can run distributed across partitions.

Audio: Index, Read, and Resample with daft.AudioFile

With daft.AudioFile, you can index audio, inspect metadata, and prepare tensors for downstream ML tasks.

1import daft
2from daft.functions import audio_file, audio_metadata, resample
3
4df = (
5 daft.from_glob_path("hf://datasets/Eventual-Inc/sample-files/audio/*.mp3")
6 .with_column("file", audio_file(daft.col("path")))
7 .with_column("metadata", audio_metadata(daft.col("file")))
8 .with_column("resampled", resample(daft.col("file"), sample_rate=16000))
9 .select("path", "file", "size", "metadata", "resampled")
10)
11
12df.show(3)

Video: Read Keyframes with daft.VideoFile

daft.VideoFile enables lightweight indexing and keyframe-aware video processing without writing custom ingestion glue.

1import daft
2from daft.functions import video_file, video_metadata, video_keyframes
3
4df = (
5 daft.from_glob_path("hf://datasets/Eventual-Inc/sample-files/videos/*.mp4")
6 .with_column("file", video_file(daft.col("path")))
7 .with_column("metadata", video_metadata(daft.col("file")))
8 .with_column("keyframes", video_keyframes(daft.col("file")))
9 .select("path", "file", "size", "metadata", "keyframes")
10)
11
12df.show(3)
13

PDFs: Structured Extraction from Document Files

daft.File also works cleanly with document tooling. For example, you can convert each PDF to a tempfile and extract page-level text and rendered page images with PyMuPDF in a UDF.

1import daft
2
3import pymupdf
4
5@daft.func(
6 return_dtype=daft.DataType.list(
7 daft.DataType.struct(
8 {
9 "page_number": daft.DataType.uint8(),
10 "page_text": daft.DataType.string(),
11 "page_image_bytes": daft.DataType.binary(),
12 }
13 )
14 )
15)
16def extract_pdf(file: daft.File):
17 """Extracts the content of a PDF file."""
18 pymupdf.TOOLS.mupdf_display_errors(False) # Suppress non-fatal MuPDF warnings
19 content = []
20 with file.to_tempfile() as tmp:
21 doc = pymupdf.Document(filename=str(tmp.name), filetype="pdf")
22 for pno, page in enumerate(doc):
23 row = {
24 "page_number": pno,
25 "page_text": page.get_text("text"),
26 "page_image_bytes": page.get_pixmap().tobytes(),
27 }
28 content.append(row)
29 return content
30
31if __name__ == "__main__":
32 from daft import col
33
34 df = (
35 daft.from_glob_path("hf://datasets/Eventual-Inc/sample-files/papers/*.pdf")
36 .with_column("pdf_file", daft.functions.file(col("path")))
37 .with_column("pages", extract_pdf(col("pdf_file")))
38 .explode("pages")
39 .select("path", "size", daft.functions.unnest(col("pages")))
40 )
41
42 df.show(3)

This gives you a straightforward way to build document ETL pipelines for retrieval, multimodal indexing, or downstream model input construction.

Code: Build Code Intelligence Pipelines with AST Parsing

Because daft.File is generic, the same interface works for source code analysis: open Python files, parse with ast, and extract function signatures/docstrings into structured rows. This is a strong pattern for code search, indexing, and repository analytics.

1
2import daft
3from daft import DataType, col
4from daft.functions import unnest, file as daft_file
5
6@daft.func(
7 return_dtype=DataType.list(
8 DataType.struct(
9 {
10 "name": DataType.string(),
11 "signature": DataType.string(),
12 "docstring": DataType.string(),
13 "start_line": DataType.int64(),
14 "end_line": DataType.int64(),
15 }
16 )
17 )
18)
19def extract_functions(file: daft.File):
20 """Extract all function definitions from a Python file."""
21 import ast
22
23 with file.open() as f:
24 file_content = f.read().decode("utf-8")
25
26 tree = ast.parse(file_content)
27 results = []
28
29 for node in ast.walk(tree):
30 if isinstance(node, (ast.FunctionDef, ast.AsyncFunctionDef)):
31 signature = f"def {node.name}({ast.unparse(node.args)})"
32 if node.returns:
33 signature += f" -> {ast.unparse(node.returns)}"
34
35 results.append({
36 "name": node.name,
37 "signature": signature,
38 "docstring": ast.get_docstring(node),
39 "start_line": node.lineno,
40 "end_line": node.end_lineno,
41 })
42
43 return results
44
45
46if __name__ == "__main__":
47 from daft import col
48
49 # Discover Python files from my local Daft Clone
50 df = (
51 daft.from_glob_path("~/git/Daft/daft/functions/**/*.py")
52 .with_column("file", daft_file(col("path")))
53 .with_column("functions", extract_functions(col("file")))
54 .explode("functions")
55 .select("path", "size", unnest(col("functions")))
56 )
57
58 df.show(3)
59

Unstructured Data, First-Class Treatment

daft.File bridges a major gap between tabular pipelines and real-world unstructured data. You can now treat files as first-class values in Daft: discover them, transform them lazily, and process them with Python/Rust ecosystem libraries at distributed scale.

If you’re already using Daft for structured data, this gives you a direct path to unify audio, video, documents, and code in the same execution model.

Get updates, contribute code, or say hi.
Daft Engineering Blog
Join us as we explore innovative ways to handle multimodal datasets, optimize performance, and simplify your data workflows.
Github Discussions Forums
join
GitHub logo
The Distributed Data Community Slack
join
Slack logo