
daft.File: Lazy Metadata Filters
Filter millions of files by path, size, and content type before opening any of them. Cheap operations first, expensive operations on the survivors.
by Everett KlevenPreviously we introduced daft.File — a lazily evaluated file reference that treats unstructured data as a first-class type. This week: opening files and using metadata to control what gets opened.
The pattern
daft.File is lazy. When you call daft.from_files(), nothing downloads. You get lightweight references — millions of them if needed.
from_files() accepts standard glob patterns (docs):
| Pattern | Matches |
|---|---|
* | Any number of characters |
? | Any single character |
[...] | Any single character in the brackets |
** | Directories, recursively |
daft.from_files("s3://bucket/docs/**/*.md") # recursive
daft.from_files("s3://bucket/logs/2026-03-??.jsonl") # single char wildcard
daft.from_files(["s3://bucket/a/*.pdf", "s3://bucket/b/*.pdf"]) # multiple patternsThe real work starts when a UDF calls .open() or .to_tempfile() inside distributed execution. But you don't want to open 2 million files if you only need 50,000 of them. That's where metadata filtering comes in: file_path(), file_size(), and guess_mime_type() let you narrow the set before any file gets opened. Cheap operations first, expensive operations on the survivors.
Opening files: markdown example
Parse every markdown file in a repository — extract headings into a structured DataFrame:
from collections.abc import Iterator
from typing import TypedDict
import daft
from daft import col
from daft.functions import unnest
class Heading(TypedDict):
level: int
text: str
@daft.func
def extract_headings(file: daft.File) -> Iterator[Heading]:
with file.open() as f:
content = f.read().decode("utf-8")
for line in content.splitlines():
if line.startswith("#"):
yield Heading(
level=len(line) - len(line.lstrip("#")),
text=line.lstrip("# ").strip(),
)
df = (
daft.from_files("**/*.md")
.with_column("heading", extract_headings(col("file")))
.select(col("file"), unnest(col("heading")))
)
df.show(10)Three things worth noting:
from_files("**/*.md")already returns afilecolumn of typedaft.File— no cast or separate setup step needed. The glob pattern handles filtering to.mdfiles directly.@daft.funcwithIterator[Heading]is row-generating: eachyieldbecomes a separate row as a struct. Useunnestto expand the struct fields (level,text) into columns. This is different from a UDF that returns alist— those useexplode.- The engine handles distribution across partitions — same code works on 10 files or 10,000.
Filtering before you open
file_path()requiresdaft>=0.7.9.
Opening files is the expensive part. file_path() and guess_mime_type() let you filter at the reference level — no I/O, no egress.
By path
import daft
from daft import col
from daft.functions import file_path
path = file_path(col("file"))
# Only markdown files from the docs directory
df = (
daft.from_files("s3://repo/**/*")
.where(path.endswith(".md"))
.where(path.contains("/docs/"))
)By content type
Extension matching is fast but unreliable — renamed files, missing extensions. guess_mime_type() inspects magic bytes:
import daft
from daft import col
from daft.functions import guess_mime_type
df = daft.from_files("s3://inbox/**/*")
df = df.with_column("mime", guess_mime_type(col("file")))
pdfs = df.where(col("mime") == "application/pdf")
images = df.where(col("mime").startswith("image/"))Extract metadata from paths
# Partition by date from filenames
df = (
daft.from_files("s3://logs/events-*.jsonl")
.with_column("filename", file_path(col("file")).regexp_extract(r"[^/]+$", 0))
.with_column("date", col("filename").regexp_extract(r"(\d{4}-\d{2}-\d{2})", 0))
.where(col("date") >= "2026-03-01")
)What else can you open
The same .open() and .to_tempfile() interface works for any file type. Run these yourself:
- PDFs — extract page text and rendered images with PyMuPDF (
daft_file_pdf.py) - Python source — parse ASTs, extract function signatures and docstrings (
daft_file_code.py) - Audio — resample, transcribe, extract metadata (
daft_audiofile.py) - Video — frame extraction, keyframes, streaming (
daft_videofile.py)
All examples: Eventual-Inc/daft-examples/examples/files

