We Value Your Policy

We use cookies to enhance your browser experience, analyze site traffic, and personalize content. By clicking "Accept All," you consent to our use of cookies. For more, read our Privacy Policy
Back to Blog
August 13, 2025

Embedding Millions of Text Documents With Qwen3

Near-100% GPU Utilization

by Desmond Cheong

We recently used Qwen3-Embedding-0.6B to embed millions of text documents while sustaining near-100% GPU utilization the whole way.

That’s usually the gold standard that machine learning engineers aim for… but here’s the twist: in the time it took to write this blog post, we found a way to make the same workload 3× faster, and it didn’t involve maxing out GPU utilization at all. That story’s for another post, but first, here’s the recipe that got us to near-100%.

The workload

Here at the Daft kitchen, the same order keeps coming in: “One fast, painless pipeline to get my documents into a vector database for retrieval!

Heard.

We whipped up a sample workload that:

  1. 1.

    Reads millions of text documents from S3

  2. 2.

    Chunks them into sentences using spaCy

  3. 3.

    Compute embeddings with the state-of-the-art model Qwen3-Embedding-0.6B

  4. 4.

    Writes the results to turbopuffer

Mise en place

Before starting, let’s install the required dependencies:

1pip install "daft[ray]" turbopuffer torch sentence-transformers spacy accelerate transformers
2python -m spacy download en_core_web_sm

You’ll also need to configure access for the object store where you’ll read data from. We prepared a sample dataset on AWS S3.

Import Dependencies and Configure Constants

We’ll then set the workload parameters:

1import torch
2import daft
3from daft import col
4
5NUM_GPU_NODES = 8 # GPU nodes in your cluster
6NLP_MODEL_NAME = "en_core_web_sm" # spaCy model for sentence detection
7CHUNKING_PARALLELISM = 8 # Parallel chunking processes per node
8EMBEDDING_MODEL_NAME = "Qwen/Qwen3-Embedding-0.6B" # Text embedding model
9ENCODING_DIM = 1024 # Embedding dimensions
10BATCH_SIZE = 512 # Records per embedding batch
11SENTENCE_TRANSFORMER_BATCH_SIZE = 16 # GPU batch size for embeddings

These parameters control resource allocation and processing efficiency. Adjust NUM_GPU_NODES based on your cluster size, and modify batch sizes based on your data and available GPU memory.

Step 1: Chunk Text

When creating embeddings, it's useful to split your text into meaningful chunks. Text is hierarchical and can be broken down at different levels: Document → Sections → Paragraphs → Sentences → Words → Characters. The chunking strategy to use depends on your use case.

Chunking Strategies

  • Sentence-level chunking works well for most use cases, especially when the document structure is unclear or inconsistent.

  • Paragraph-level chunking is good for RAG (Retrieval-Augmented Generation) applications where maintaining context across sentences is important.

  • Section-level chunking is useful for long documents that have clear structural divisions.

  • Fixed-size chunks are simple to implement but may break semantic meaning at arbitrary boundaries.

When to Use Each Approach

  • Sentence splitting is the default choice when you're unsure about the document structure or when working with diverse content types.

  • Paragraph splitting is preferred for RAG systems where maintaining context across multiple sentences matters for retrieval quality.

  • Custom splitting is necessary for specialized content like tweets, text messages, or code that don't follow standard paragraph structures.

Implementation

We'll use sentence-level chunking in this example.

We'll also use spaCy, which is a natural language processing library that provides robust sentence boundary detection that handles edge cases better than simple punctuation-based splitting.

1# Define the return type for chunked text
2# Here we'll keep both the chunked text and the chunk ID which we'll later use for creating IDs for the sentences
3chunked_type = daft.DataType.list(
4 daft.DataType.struct({
5 "text": daft.DataType.string(),
6 "chunk_id": daft.DataType.int32()
7 })
8)
9
10@daft.udf(
11 return_dtype=chunked_type,
12 concurrency=NUM_GPU_NODES * (CHUNKING_PARALLELISM + 1),
13 batch_size=BATCH_SIZE // CHUNKING_PARALLELISM // 2
14)
15class ChunkingUDF:
16 def __init__(self):
17 import spacy
18 self.nlp = spacy.load(NLP_MODEL_NAME)
19
20 def __call__(self, text_col):
21 results = []
22 for text in text_col:
23 doc = self.nlp(text)
24 sentence_texts = [
25 {"text": sentence.text, "chunk_id": i}
26 for i, sentence in enumerate(doc.sents)
27 ]
28 results.append(sentence_texts)
29 return results

This User-Defined Function (UDF):

  • Loads the spaCy model once per UDF during initialization for efficiency

  • Processes batches of text (text_col) to minimize overhead

  • Returns a list of sentence chunks with unique chunk IDs

  • Runs multiple instances in parallel (NUM_GPU_NODES * CHUNKING_PARALLELISM = 64 total instances) for distributed processing

Step 2: GPU-Accelerated Embedding Generation

Choosing a Text Embedding Model

The quality of your embeddings depends heavily on the model you choose. Here are some key considerations:

Model Performance

  • MTEB Leaderboard: Check the Massive Text Embedding Benchmark (MTEB) leaderboard  for the latest performance rankings across various tasks

  • Task-specific performance: Different models excel at different tasks (semantic search, clustering, classification, etc.)

  • Multilingual support: Consider if you need to process text in multiple languages

  • Language-specific tasks: If you only need to support a single language, it could be helpful to look at model performance for that specific language instead of multilingual benchmarks

Some Popular Models

  • Qwen3-Embedding-0.6B: Good performance-to-size ratio, state-of-the-art, used in this example

  • all-MiniLM-L6-v2: The default used in Sentence Transformer's documentation, often used in tutorials

  • gemini-embedding-001: The current top multilingual model on MTEB. Requires Gemini API access

  • Seed1.6-Embedding: The current top model on the Chinese MTEB leaderboard. Requires Volcengine API access

With open models available on HuggingFace , you can easily swap models by changing the EMBEDDING_MODEL_NAME constant in the code below.

We'll create a UDF to generate embeddings from the chunked text:

1# Define the return type for embeddings
2embedding_type = daft.DataType.embedding(daft.DataType.float32(), ENCODING_DIM)
3
4@daft.udf(
5 return_dtype=embedding_type,
6 concurrency=NUM_GPU_NODES,
7 num_gpus=1,
8 batch_size=BATCH_SIZE
9)
10class EncodingUDF:
11 def __init__(self):
12 from sentence_transformers import SentenceTransformer
13
14 device = 'cuda' if torch.cuda.is_available() else 'cpu'
15 self.model = SentenceTransformer(EMBEDDING_MODEL_NAME, device=device)
16 self.model.compile()
17
18 def __call__(self, text_col):
19 embeddings = self.model.encode(
20 text_col.to_pylist(),
21 batch_size=SENTENCE_TRANSFORMER_BATCH_SIZE,
22 convert_to_tensor=True,
23 torch_dtype=torch.bfloat16,
24 )
25 return embeddings.cpu().numpy()

This UDF:

  • Loads the SentenceTransformer model on GPU if available

  • Uses bfloat16 precision to reduce memory usage

  • Processes text in batches (SENTENCE_TRANSFORMER_BATCH_SIZE = 128) for optimal GPU utilization

  • Returns numpy arrays which are compatible with Daft

Step 3: Configure Distributed Processing

You can run this script locally, but if you're interested in running this pipeline on a cluster, check out our guide on scaling up. In this example, we ran on a ray cluster with 8 g5.2xlarge  workers (each comes with an A10G GPU). To configure our Daft script to use the ray cluster, we added:

1# Configure Daft to use Ray to schedule work on different worker nodes
2daft.context.set_runner_ray()
3
4# Configure S3 access for reading data
5daft.set_planning_config(
6 default_io_config=daft.io.IOConfig(
7 s3=daft.io.S3Config.from_env()
8 )
9)

Step 4: Execute the Pipeline

Now we'll execute the complete data processing pipeline:

1(
2 daft.read_parquet("s3://desmond-demo/text-embedding-dataset.parquet")
3 .with_column("sentences", ChunkingUDF(col("text")))
4 .explode("sentences")
5 .with_column("text", col("sentences")["text"])
6 .with_column("chunk_id", col("sentences")["chunk_id"])
7 .exclude("sentences")
8 .with_column("embedding", EncodingUDF(col("text")))
9 .with_column(
10 "id",
11 col("url").str.right(50) + "-" + col("chunk_id").cast(daft.DataType.string())
12 )
13 .select("id", "url", "language", "source", "text", "embedding")
14 .write_turbopuffer(
15 namespace="desmond-scale-experiment6",
16 region="aws-us-west-2",
17 id_column="id",
18 vector_column="embedding",
19 distance_metric="cosine_distance"
20 )
21)

Pipeline steps explained:

  1. 1.

    Read data: Load Parquet files from S3 with large chunk size for efficiency

  2. 2.

    Chunk text: Apply sentence splitting UDF

  3. 3.

    Explode: Flatten the list of sentences into separate rows

  4. 4.

    Extract fields: Get text and chunk_id from the sentence structs

  5. 5.

    Generate embeddings: Apply embedding UDF to text

  6. 6.

    Create IDs: Generate unique IDs combining URL and chunk_id

  7. 7.

    Select columns: Keep only the necessary columns

  8. 8.

    Write to Turbopuffer: Store data and vectors in Turbopuffer

If all works out well, when you run this script on your cluster, you should notice that network I/O, CPU work, and GPU work are pipelined to run in parallel, and you should see high GPU utilization :)

Customization Tips

  • Adjust batch sizes: Increase SENTENCE_TRANSFORMER_BATCH_SIZE for better throughput, decrease for lower GPU memory usage

  • Scale workers: Modify NUM_GPU_NODES and CHUNKING_PARALLELISM based on your cluster size and cores available per node

  • Change models: Replace EMBEDDING_MODEL_NAME with other SentenceTransformer models

  • Different chunking: Modify ChunkingUDF to use different text chunking strategies

  • Alternative vector databases: Replace with other vector databases like Lance, Pinecone, or Chroma

Performance Considerations

  • GPU memory: Monitor GPU memory usage and adjust batch sizes accordingly. If your GPUs fail to allocate sufficient memory or you exceed the max sequence length of your embedding model, SENTENCE_TRANSFORMER_BATCH_SIZE may be too large

  • Model loading: UDFs load models once per worker, so initialization time is amortized

  • Quantization: Use bfloat16 or float16 quantization for lower GPU memory utilization and higher throughput.

This pipeline can efficiently process millions of text documents while automatically scaling across your available compute resources.

What’s next on the menu?

With this recipe, we hit near-100% GPU utilization—a benchmark that’s the holy grail for many.

But the Daft kitchen never stops cooking. Since then, we’ve been experimenting with new ingredients and techniques—custom GPU pipelining, swapping Sentence Transformers for vLLM—that have made the whole meal cook 3× faster.

We’re still plating that next dish, and trust us, it’s worth the wait. Keep an eye out for the upcoming blog where we’ll share how we turned up the heat and pushed throughput beyond the peak-utilization grind.

Until then, happy embedding! And remember: we don’t sell the GPUs, we sell the sizzle.

Get started today: pip install daft

Get updates, contribute code, or say hi.
Daft Engineering Blog
Join us as we explore innovative ways to handle multimodal datasets, optimize performance, and revolutionize your data workflows.
Github Discussions Forums
join
GitHub logo
The Distributed Data Community Slack
join
Slack logo