daft.VideoFile: Seek Lazily, Get Frames

A vision model rarely needs every frame. Keyframes, one-frame-per-second, or a specific 10-second window cover most use cases — and decoding the whole file to get there is wasted work.

daft.VideoFile and daft.read_video_frames decode only what you ask for. The slice you want is the slice that gets read.

What this is built for

The clearest case is robotics. Open X-Embodiment aggregates over a million episodes. DROID alone runs 350+ hours of multi-camera 60fps footage. That's hundreds of millions of frames across a single dataset, and most action-model training doesn't need them all: keyframes for retrieval, one frame per second for VLM annotation, a five-second window around each labeled grasp.

The shape repeats. Fleet dashcams record at 30fps for hours, but the events worth scoring (a near-miss, a lane departure, a pedestrian crossing) are seconds long. Security feeds run 24/7; the clip you need is six. Content moderation queues run on user uploads where most of the timeline is empty space between the moments that matter.

Daft's video stack is built for the throw-away ratio. Four column expressions cover the slice patterns:

Keyframes only — is_key_frame=True on read_video_frames, or video_keyframes on a VideoFile column.
Time-sampled — sample_interval_seconds=1.0 picks one frame at or after each second.
Header-only filtering — video_metadata reads resolution, fps, duration, and frame count without decoding a frame.
Time-windowed decode — video_frames(start_time, end_time) decodes just the seek range from a VideoFile.

Same DataFrame, same query plan, same scaling story as the rest of your pipeline.

The shortcut

The fastest way to get usable frames out of a folder of videos is one call:

import daft
 
df = daft.read_video_frames(
    path="s3://bucket/videos/*.mp4",
    image_height=480,
    image_width=640,
    is_key_frame=True,
)
df.show()

Each row is one frame. Columns include frame_index, frame_time, is_key_frame, and data as Image[RGB; 480 x 640] ready to feed to a model. Globs work; lists of paths work; YouTube URLs work:

df = daft.read_video_frames(
    path=[
        "https://www.youtube.com/watch?v=jNQXAC9IVRw",
        "https://www.youtube.com/watch?v=N2rZxCrb7iU",
    ],
    image_height=480,
    image_width=640,
    is_key_frame=True,
)

is_key_frame=True filters at the decoder. For a 1-hour H.264 video that's typically 200–500 frames instead of 108,000 — the compression structure already decided which frames carry the most novel information, and we lean on that. If you want temporal coverage instead of compression-driven sparsity, pass sample_interval_seconds=1.0 and Daft picks the first frame at or after each second:

df = daft.read_video_frames(
    path="s3://bucket/videos/*.mp4",
    image_height=480,
    image_width=640,
    sample_interval_seconds=1.0,
)

Both filters can stack — keyframes sampled at one-second intervals — so you keep the compression-aware sparsity and get a predictable temporal grid.

Filter before you decode

read_video_frames is the path of least resistance. When you need to make decisions per-video — skip anything over an hour, only process 1080p or higher, dispatch by codec — wrap the path in video_file and let video_metadata inspect headers without decoding a frame:

import daft
from daft.functions import unnest, video_file, video_metadata
 
df = (
    daft.from_files("s3://bucket/videos/**/*.mp4")
    .with_column("video_file", video_file(daft.col("file")))
    .with_column("video_meta", video_metadata(daft.col("video_file")))
    .select("video_file", unnest(daft.col("video_meta")))
    .where(daft.col("duration_seconds") < 3600)
    .where(daft.col("height") >= 1080)
)

Same pattern as Week 2's file_path() filter: cheap operations narrow the set, expensive operations run on the survivors. Header reads are HTTP range requests — no full download.

Targeted decode with `video_frames`

Once you've narrowed the set, video_frames decodes a specific time range from a VideoFile column. One row per video, with the decoded frames as a list of structs you can .explode() into per-frame rows:

import daft
from daft.functions import video_file, video_frames
 
df = (
    daft.from_files("s3://bucket/videos/*.mp4")
    .with_column("videofile", video_file(daft.col("file"), verify=True))
    .with_column(
        "frames",
        video_frames(
            daft.col("videofile"),
            start_time=0.0,
            end_time=10.0,
        ),
    )
    .explode("frames")
)

start_time and end_time are seconds. The decoder seeks to the nearest preceding keyframe and walks forward — so a 10-second window from a 90-minute video reads roughly 10 seconds of bytes, not the whole file. That's the "stream-based" promise: every worker pulls only what it needs from object storage.

video_keyframes is the convenience version when you only want keyframes per file:

from daft.functions import video_keyframes
 
df = df.with_column("keyframes", video_keyframes(daft.col("video")))

From frames to inference

Frames as DataFrame rows compose with the rest of Daft like any other image column. Send them to a vision model with prompt, classify with classify_image, or write the embeddings to Iceberg:

from daft.functions import prompt, format as daft_format
 
df_captioned = df.with_column(
    "caption",
    prompt(
        daft_format("Describe this scene in one sentence: {}", daft.col("data")),
        model="openai/gpt-5.5",
    ),
)

Same DataFrame, no new infrastructure — the file decode step disappears into the column.

Reference example in daft-examples/examples/files:

daft_videofile.py — full pattern with video_file, video_metadata, video_keyframes

daft.VideoFile: Seek Lazily, Get Frames

What this is built for

The shortcut

Filter before you decode

Targeted decode with `video_frames`

From frames to inference

Suggested Posts

Daft Extensions Featuring daft-h3: Native Rust Performance, Community Owned

Daft v0.7.10: 30 contributors, 41 new features, distributed asof joins

What this is built for

The shortcut

Filter before you decode

Targeted decode with video_frames

From frames to inference

Suggested Posts

Daft Extensions Featuring daft-h3: Native Rust Performance, Community Owned

Daft v0.7.10: 30 contributors, 41 new features, distributed asof joins

Targeted decode with `video_frames`