We Value Your Policy

We use cookies to enhance your browser experience, analyze site traffic, and personalize content. By clicking "Accept All," you consent to our use of cookies. For more, read our Privacy Policy
Back to Blog
August 6, 2025

Processing 300K Images Without OOM

A Streaming Solution

by Colin Ho

Images in modern applications come in all shapes and sizes. They may be uploaded from an iPhone, scraped from the web as PNGs/JPEGs, or arriving in 4K resolution from professional cameras.

However, most downstream systems require consistently sized images. AI model training and inference demand standardized dimensions like 3 x 256 x 256, and web interfaces need consistent thumbnail sizing.

The solution appears deceptively simple: download images from their URLs, resize them to target dimensions, and save the results. But try to do this on hundreds of thousands of images and a straightforward naive approach will quickly become a nightmare in memory/compute management.

The Naive Python Approach

Most engineers start with the obvious Python solution—a function to download and resize a single image, executed in a loop over image URLs. Here's what that typically looks like:

1def process_image(url):
2 # Download image
3 response = requests.get(url)
4 image = Image.open(BytesIO(response.content))
5
6 # Resize
7 resized = image.resize((256, 256))
8
9 # Save result
10 resized.save(f"processed_{hash(url)}.jpg")
11
12# Process all images
13for url in image_urls:
14 process_image(url)

This approach works for small datasets, but reveals critical flaws at larger scale:

Performance Problems:

  • Poor parallelization - only one CPU core active, barely utilized

  • Network I/O bottlenecks - single-threaded execution wastes time waiting

  • Memory usage - naively keeping everything in memory will eventually lead to OOM

The Hidden Memory Explosion

The real problem lies in how image processing fundamentally works. Understanding this memory inflation is crucial for building scalable solutions.

Here's what happens when you process a single image:

  1. 1.

    Start with a URL (<1kb)

  2. 2.

    Download into a 500KB JPEG file: (~500x inflation)

  3. 3.

    Decode for processing: Becomes 3-12MB of uncompressed pixel data (6-24x inflation)

  4. 4.

    Resize operation: Image libraries create temporary buffers, doubling memory usage to 24MB

  5. 5.

    Final encoding: Back down to ~200KB

The peak memory usage is the killer. Each image temporarily needs up to 48 times its original file size during processing. If this happens to happen all at once at the same time, you're looking at 24GB of memory usage for just 1,000 images. Most machines can't handle this and cloud instances become prohibitively expensive.

Manual Batching

A simple solution here would be to batch up your work: break work into smaller batches and distribute this across multiple threads or machines. Engineers spin up clusters, distribute batches across nodes, and hope they've guessed the right batch size.

This creates several challenges:

  • Manual tuning required for different datasets and image sizes

  • Resource waste when batches are too small

  • Memory crashes when batches are too large

  • Operational complexity managing clusters and work distribution

  • Error-prone scaling requiring constant adjustment

The approach works, but it's fragile and requires expertise that could be better spent on core business logic.

Daft's Streaming Solution

The same image download and resize script we wrote in standard python can be easily expressed with a couple lines of Daft code.

1import daft
2
3# Download images
4df = daft.read_parquet(
5 "s3://daft-public-data/open-images/validation-images-parquet-8x-64parts"
6)
7df = df.with_column("image", df["path"].url.download().image.decode())
8
9# Resize
10df = df.with_column("resized", df["image"].image.resize(256, 256))
11
12# Save results
13df = df.with_column("encoded", df["resized"].image.encode("JPEG"))
14df = df.with_column("path", df["path"].url.upload("processed_images"))
15
16# Preview results
17df.show()

Daft natively supports working with urls and images, so you don’t have to think about which other libraries you need to import for these. 

When using Daft, you also get the batching and parallelism you had to tune yourself for free. Daft's streaming execution engine automatically determines batch sizes based on the operation and distributes work across native Rust threads, maximizing resource utilization and keeping memory usage steady.

Behind the scenes, Daft pipelines your image processing operations to give you efficient resource utilization across memory, CPU and I/O:

  • Download a batch of images concurrently.

  • Process and resize a batch of images.

  • Save results to disk.

  • Repeat until complete.

All of these operations are running concurrently in a pipelined fashion.

Results on a single 16-core machine:

  • 100 images processed in 4 seconds (vs 17 seconds with naive approach)

  • Stable memory usage throughout

  • High CPU and I/O efficiency across all cores

  • No manual tuning required

Scaling to Distributed: Same Code, Massive Scale

The real power of Daft's approach becomes apparent when scaling to distributed processing. The same code that provides streaming execution locally scales across entire clusters with minimal changes.

To process our full dataset of 300,000+ images across a cluster:

1# Same processing code as above, just point to cluster
2import daft
3daft.context.set_runner_ray("ray://...")
4
5df = daft.read_parquet(
6 "s3://daft-public-data/open-images/validation-images-parquet-8x-64parts"
7)
8
9df = df.with_column("image", df["path"].url.download().image.decode())
10df = df.with_column("resized", df["image"].image.resize(256, 256))
11df = df.with_column("encoded", df["resized"].image.encode("JPEG"))
12df = df.with_column("path", df["path"].url.upload("processed_images"))
13
14# Collect to run on the full dataset
15df.collect()

Results across the distributed cluster:

  • 300,000+ images processed in under 30 seconds

  • Automatic work distribution across all nodes

  • No manual batch size tuning

  • Consistent memory usage per node

  • Built-in fault tolerance

Need to process a million images? Scale to 16 nodes. Ten million images? Scale to 100 nodes. The beauty of Daft's approach is that it provides linear scaling depending on your throughput requirements, with no code rewrites or wrangling with partition sizing.

Beyond Images

While we've focused on images in this post, Daft's streaming approach works across all multimodal data types and use cases.

  • Documents: PDF processing, text extraction, format conversion 

  • Videos: Frame extraction, transcoding, thumbnail generation

  • Audio: Format conversion, feature extraction, transcription

Check out our tutorial on Document Processing: Notebook and Tutorial Video

The same code patterns that solved our image processing problem scale naturally to these other domains.

Why This Matters

Multimodal data shouldn’t be hard. Modern data processing frameworks should handle these complexities automatically, letting engineers focus on business logic rather than infrastructure concerns.

Our motto at Daft is less OOM and more zoom! Try us out, and let us know how it goes :)

pip install daft


Get updates, contribute code, or say hi.
Daft Engineering Blog
Join us as we explore innovative ways to handle multimodal datasets, optimize performance, and revolutionize your data workflows.
Github Discussions Forums
join
GitHub logo
The Distributed Data Community Slack
join
Slack logo