
Audio transcription at scale with daft.AudioFile
Processing audio for AI is harder than it looks. Files arrive at different sample rates — 8 kHz phone calls, 44.1 kHz recordings, 48 kHz podcasts. Whisper expects 16 kHz mono. An hour of 48 kHz stereo
by Daft TeamProcessing audio for AI is harder than it looks. Files arrive at different sample rates — 8 kHz phone calls, 44.1 kHz recordings, 48 kHz podcasts. Whisper expects 16 kHz mono. An hour of 48 kHz stereo is 518 MB uncompressed. And you need to split on silence before transcription, because Whisper only handles 30-second windows.
The usual approach: write a loop, manage workers, resample in memory, hope nothing OOMs.
daft.AudioFile handles the boilerplate.
The pattern
Two things make transcription work at scale:
Model loading is the bottleneck, not inference. A @daft.cls() loads Whisper once per worker. Without it, every file triggers a fresh model download. With it, workers keep the model resident across the batch.
Audio libraries need a file path, not a buffer. audio.to_tempfile() materializes the reference to disk. faster-whisper reads from the path. You don't manage memory.
import daft
from daft import col
from faster_whisper import WhisperModel, BatchedInferencePipeline
@daft.cls()
class WhisperTranscriber:
def __init__(self):
model = WhisperModel("turbo", device="auto", compute_type="float16")
self.pipe = BatchedInferencePipeline(model=model)
@daft.method(return_dtype=daft.DataType.string())
def transcribe(self, audio: daft.AudioFile):
with audio.to_tempfile() as tmp:
segments, _ = self.pipe.transcribe(
str(tmp.name),
vad_filter=True,
word_timestamps=True,
batch_size=16,
)
return " ".join(s.text for s in segments)
transcriber = WhisperTranscriber()
df = (
daft.from_files("s3://calls/recordings/**/*.mp3")
.with_column("transcript", transcriber.transcribe(col("file")))
)
df.show()vad_filter=True handles the 30-second window limit — faster-whisper's built-in Silero VAD splits on silence before transcription. No preprocessing step.
Filter before you transcribe
Long recordings are expensive. audio_metadata() reads headers — sample rate, channels, duration — without opening the audio data. Use it to filter before the model touches anything:
from daft.functions import audio_metadata
df = (
daft.from_files("s3://calls/**/*.mp3")
.with_column("meta", audio_metadata(col("file")))
.where(col("meta")["duration_seconds"] < 3600) # skip recordings over 1 hour
.where(col("meta")["sample_rate_hz"] >= 8000) # skip degraded captures
.with_column("transcript", transcriber.transcribe(col("file")))
)Same principle as file_path() in Week 2: cheap operations narrow the set, expensive operations run on the survivors.
If you need a specific sample rate, resample() converts before transcription:
df = (
daft.from_files("s3://calls/**/*.mp3")
.with_column("resampled", col("file").cast(daft.AudioFile).resample(16_000))
.with_column("transcript", transcriber.transcribe(col("resampled")))
)From transcripts to insights
Once you have text, the pipeline extends naturally. Summaries, translations, and embeddings all add as DataFrame columns — same syntax, no new infrastructure:
from daft.functions import prompt, format as daft_format
df_enriched = (
df
.with_column(
"summary",
prompt(
daft_format("Summarize in 2 sentences: {}", col("transcript")),
model="openai/gpt-4o-mini",
),
)
)The full pattern — transcribe, summarize, embed, query — is in the Voice AI Analytics blog post.
Reference examples in daft-examples/examples/files:
daft_audiofile.py—audio_metadata(),resample()daft_audiofile_udf.py— byte-range seeks for streaming audio

