
daft.File: Work with Any File, Anywhere
Distributed Random Access for Audio, Video, Documents, and Code
by Everett KlevenModern AI pipelines increasingly rely on unstructured files: audio, video, PDFs, source code, and more. These assets are often large, remote, and expensive to load eagerly.
daft.File brings file-native handling into Daft’s lazy, distributed execution model. Instead of pulling full objects into memory, you can pass file references through your DataFrame, open them only when needed, and process them in parallel across your compute environment.
Since its introduction in Daft 0.6.0, daft.File has steadily matured into a full-featured capability. And with the recent additions of daft.VideoFile and daft.AudioFile , we're eager to see the community leverage daft.File for their own use-cases. If you run into any troubles please file an issue on github or reach out directly in our slack community.
Why daft.File
Daft already provides strong I/O performance for structured formats like Parquet, JSONL, and CSV. daft.File extends that same philosophy to unstructured data by giving you:
- •
Lazy file references in DataFrames
- •
Random-access reads through
open() - •
Disk materialization with
to_tempfile()for libraries that need a local path - •
Full reads with
read()when appropriate - •
Local and remote storage support with one interface
Basic Usage
Here's what that looks like in practice:
1import daft23@daft.func4def read_header(f: daft.File) -> bytes:5 with f.open() as fh:6 return fh.read(16)78df = (9 daft.from_glob_path("hf://datasets/Eventual-Inc/sample-files/")10 .with_column("file", daft.functions.file(daft.col("path")))11 .with_column("header", read_header(daft.col("file")))12 .select("path", "size", "file", "header")13)1415df.show(5)

This pattern scales well because the file handle is only opened during execution, where it can run distributed across partitions.
Audio: Index, Read, and Resample with daft.AudioFile
With daft.AudioFile, you can index audio, inspect metadata, and prepare tensors for downstream ML tasks.
1import daft2from daft.functions import audio_file, audio_metadata, resample34df = (5 daft.from_glob_path("hf://datasets/Eventual-Inc/sample-files/audio/*.mp3")6 .with_column("file", audio_file(daft.col("path")))7 .with_column("metadata", audio_metadata(daft.col("file")))8 .with_column("resampled", resample(daft.col("file"), sample_rate=16000))9 .select("path", "file", "size", "metadata", "resampled")10)1112df.show(3)

Video: Read Keyframes with daft.VideoFile
daft.VideoFile enables lightweight indexing and keyframe-aware video processing without writing custom ingestion glue.
1import daft2from daft.functions import video_file, video_metadata, video_keyframes34df = (5 daft.from_glob_path("hf://datasets/Eventual-Inc/sample-files/videos/*.mp4")6 .with_column("file", video_file(daft.col("path")))7 .with_column("metadata", video_metadata(daft.col("file")))8 .with_column("keyframes", video_keyframes(daft.col("file")))9 .select("path", "file", "size", "metadata", "keyframes")10)1112df.show(3)13

PDFs: Structured Extraction from Document Files
daft.File also works cleanly with document tooling. For example, you can convert each PDF to a tempfile and extract page-level text and rendered page images with PyMuPDF in a UDF.
1import daft23import pymupdf45@daft.func(6 return_dtype=daft.DataType.list(7 daft.DataType.struct(8 {9 "page_number": daft.DataType.uint8(),10 "page_text": daft.DataType.string(),11 "page_image_bytes": daft.DataType.binary(),12 }13 )14 )15)16def extract_pdf(file: daft.File):17 """Extracts the content of a PDF file."""18 pymupdf.TOOLS.mupdf_display_errors(False) # Suppress non-fatal MuPDF warnings19 content = []20 with file.to_tempfile() as tmp:21 doc = pymupdf.Document(filename=str(tmp.name), filetype="pdf")22 for pno, page in enumerate(doc):23 row = {24 "page_number": pno,25 "page_text": page.get_text("text"),26 "page_image_bytes": page.get_pixmap().tobytes(),27 }28 content.append(row)29 return content3031if __name__ == "__main__":32 from daft import col3334 df = (35 daft.from_glob_path("hf://datasets/Eventual-Inc/sample-files/papers/*.pdf")36 .with_column("pdf_file", daft.functions.file(col("path")))37 .with_column("pages", extract_pdf(col("pdf_file")))38 .explode("pages")39 .select("path", "size", daft.functions.unnest(col("pages")))40 )4142 df.show(3)

This gives you a straightforward way to build document ETL pipelines for retrieval, multimodal indexing, or downstream model input construction.
Code: Build Code Intelligence Pipelines with AST Parsing
Because daft.File is generic, the same interface works for source code analysis: open Python files, parse with ast, and extract function signatures/docstrings into structured rows. This is a strong pattern for code search, indexing, and repository analytics.
12import daft3from daft import DataType, col4from daft.functions import unnest, file as daft_file56@daft.func(7 return_dtype=DataType.list(8 DataType.struct(9 {10 "name": DataType.string(),11 "signature": DataType.string(),12 "docstring": DataType.string(),13 "start_line": DataType.int64(),14 "end_line": DataType.int64(),15 }16 )17 )18)19def extract_functions(file: daft.File):20 """Extract all function definitions from a Python file."""21 import ast2223 with file.open() as f:24 file_content = f.read().decode("utf-8")2526 tree = ast.parse(file_content)27 results = []2829 for node in ast.walk(tree):30 if isinstance(node, (ast.FunctionDef, ast.AsyncFunctionDef)):31 signature = f"def {node.name}({ast.unparse(node.args)})"32 if node.returns:33 signature += f" -> {ast.unparse(node.returns)}"3435 results.append({36 "name": node.name,37 "signature": signature,38 "docstring": ast.get_docstring(node),39 "start_line": node.lineno,40 "end_line": node.end_lineno,41 })4243 return results444546if __name__ == "__main__":47 from daft import col4849 # Discover Python files from my local Daft Clone50 df = (51 daft.from_glob_path("~/git/Daft/daft/functions/**/*.py")52 .with_column("file", daft_file(col("path")))53 .with_column("functions", extract_functions(col("file")))54 .explode("functions")55 .select("path", "size", unnest(col("functions")))56 )5758 df.show(3)59

Unstructured Data, First-Class Treatment
daft.File bridges a major gap between tabular pipelines and real-world unstructured data. You can now treat files as first-class values in Daft: discover them, transform them lazily, and process them with Python/Rust ecosystem libraries at distributed scale.
If you’re already using Daft for structured data, this gives you a direct path to unify audio, video, documents, and code in the same execution model.