Audio transcription at scale with daft.AudioFile

Processing audio for AI is harder than it looks. Files arrive at different sample rates — 8 kHz phone calls, 44.1 kHz recordings, 48 kHz podcasts. Whisper expects 16 kHz mono. An hour of 48 kHz stereo is 518 MB uncompressed. And you need to split on silence before transcription, because Whisper only handles 30-second windows.

The usual approach: write a loop, manage workers, resample in memory, hope nothing OOMs.

daft.AudioFile handles the boilerplate.

The pattern

Two things make transcription work at scale:

Model loading is the bottleneck, not inference. A @daft.cls() loads Whisper once per worker. Without it, every file triggers a fresh model download. With it, workers keep the model resident across the batch.

Audio libraries need a file path, not a buffer. audio.to_tempfile() materializes the reference to disk. faster-whisper reads from the path. You don't manage memory.

import daft
from daft import col
from faster_whisper import WhisperModel, BatchedInferencePipeline
 
@daft.cls()
class WhisperTranscriber:
    def __init__(self):
        model = WhisperModel("turbo", device="auto", compute_type="float16")
        self.pipe = BatchedInferencePipeline(model=model)
 
    @daft.method(return_dtype=daft.DataType.string())
    def transcribe(self, audio: daft.AudioFile):
        with audio.to_tempfile() as tmp:
            segments, _ = self.pipe.transcribe(
                str(tmp.name),
                vad_filter=True,
                word_timestamps=True,
                batch_size=16,
            )
            return " ".join(s.text for s in segments)
 
 
transcriber = WhisperTranscriber()
 
df = (
    daft.from_files("s3://calls/recordings/**/*.mp3")
    .with_column("transcript", transcriber.transcribe(col("file")))
)
df.show()

vad_filter=True handles the 30-second window limit — faster-whisper's built-in Silero VAD splits on silence before transcription. No preprocessing step.

Filter before you transcribe

Long recordings are expensive. audio_metadata() reads headers — sample rate, channels, duration — without opening the audio data. Use it to filter before the model touches anything:

from daft.functions import audio_metadata
 
df = (
    daft.from_files("s3://calls/**/*.mp3")
    .with_column("meta", audio_metadata(col("file")))
    .where(col("meta")["duration_seconds"] < 3600)  # skip recordings over 1 hour
    .where(col("meta")["sample_rate_hz"] >= 8000)   # skip degraded captures
    .with_column("transcript", transcriber.transcribe(col("file")))
)

Same principle as file_path() in Week 2: cheap operations narrow the set, expensive operations run on the survivors.

If you need a specific sample rate, resample() converts before transcription:

df = (
    daft.from_files("s3://calls/**/*.mp3")
    .with_column("resampled", col("file").cast(daft.AudioFile).resample(16_000))
    .with_column("transcript", transcriber.transcribe(col("resampled")))
)

From transcripts to insights

Once you have text, the pipeline extends naturally. Summaries, translations, and embeddings all add as DataFrame columns — same syntax, no new infrastructure:

from daft.functions import prompt, format as daft_format
 
df_enriched = (
    df
    .with_column(
        "summary",
        prompt(
            daft_format("Summarize in 2 sentences: {}", col("transcript")),
            model="openai/gpt-4o-mini",
        ),
    )
)

The full pattern — transcribe, summarize, embed, query — is in the Voice AI Analytics blog post.

Reference examples in daft-examples/examples/files:

daft_audiofile.py — audio_metadata(), resample()
daft_audiofile_udf.py — byte-range seeks for streaming audio

Audio transcription at scale with daft.AudioFile

The pattern

Filter before you transcribe

From transcripts to insights

Suggested Posts

Image Embeddings: Tutorial & Examples

Multimodal Embeddings: Tutorial & Examples