
Embedding Millions of Text Documents With Qwen3
Near-100% GPU Utilization
by Desmond CheongWe recently used Qwen3-Embedding-0.6B to embed millions of text documents while sustaining near-100% GPU utilization the whole way.
That’s usually the gold standard that machine learning engineers aim for… but here’s the twist: in the time it took to write this blog post, we found a way to make the same workload 3× faster, and it didn’t involve maxing out GPU utilization at all. That story’s for another post, but first, here’s the recipe that got us to near-100%.

The workload
Here at the Daft kitchen, the same order keeps coming in: “One fast, painless pipeline to get my documents into a vector database for retrieval!”
Heard.
We whipped up a sample workload that:
- 1.
Reads millions of text documents from S3
- 2.
Chunks them into sentences using spaCy
- 3.
Compute embeddings with the state-of-the-art model Qwen3-Embedding-0.6B
- 4.
Writes the results to turbopuffer
Mise en place
Before starting, let’s install the required dependencies:
1pip install "daft[ray]" turbopuffer torch sentence-transformers spacy accelerate transformers2python -m spacy download en_core_web_sm
You’ll also need to configure access for the object store where you’ll read data from. We prepared a sample dataset on AWS S3.
Import Dependencies and Configure Constants
We’ll then set the workload parameters:
1import torch2import daft3from daft import col45NUM_GPU_NODES = 8 # GPU nodes in your cluster6NLP_MODEL_NAME = "en_core_web_sm" # spaCy model for sentence detection7CHUNKING_PARALLELISM = 8 # Parallel chunking processes per node8EMBEDDING_MODEL_NAME = "Qwen/Qwen3-Embedding-0.6B" # Text embedding model9ENCODING_DIM = 1024 # Embedding dimensions10BATCH_SIZE = 512 # Records per embedding batch11SENTENCE_TRANSFORMER_BATCH_SIZE = 16 # GPU batch size for embeddings
These parameters control resource allocation and processing efficiency. Adjust NUM_GPU_NODES
based on your cluster size, and modify batch sizes based on your data and available GPU memory.
Step 1: Chunk Text
When creating embeddings, it's useful to split your text into meaningful chunks. Text is hierarchical and can be broken down at different levels: Document → Sections → Paragraphs → Sentences → Words → Characters. The chunking strategy to use depends on your use case.
Chunking Strategies
- •
Sentence-level chunking works well for most use cases, especially when the document structure is unclear or inconsistent.
- •
Paragraph-level chunking is good for RAG (Retrieval-Augmented Generation) applications where maintaining context across sentences is important.
- •
Section-level chunking is useful for long documents that have clear structural divisions.
- •
Fixed-size chunks are simple to implement but may break semantic meaning at arbitrary boundaries.
When to Use Each Approach
- •
Sentence splitting is the default choice when you're unsure about the document structure or when working with diverse content types.
- •
Paragraph splitting is preferred for RAG systems where maintaining context across multiple sentences matters for retrieval quality.
- •
Custom splitting is necessary for specialized content like tweets, text messages, or code that don't follow standard paragraph structures.
Implementation
We'll use sentence-level chunking in this example.
We'll also use spaCy, which is a natural language processing library that provides robust sentence boundary detection that handles edge cases better than simple punctuation-based splitting.
1# Define the return type for chunked text2# Here we'll keep both the chunked text and the chunk ID which we'll later use for creating IDs for the sentences3chunked_type = daft.DataType.list(4 daft.DataType.struct({5 "text": daft.DataType.string(),6 "chunk_id": daft.DataType.int32()7 })8)910@daft.udf(11 return_dtype=chunked_type,12 concurrency=NUM_GPU_NODES * (CHUNKING_PARALLELISM + 1),13 batch_size=BATCH_SIZE // CHUNKING_PARALLELISM // 214)15class ChunkingUDF:16 def __init__(self):17 import spacy18 self.nlp = spacy.load(NLP_MODEL_NAME)1920 def __call__(self, text_col):21 results = []22 for text in text_col:23 doc = self.nlp(text)24 sentence_texts = [25 {"text": sentence.text, "chunk_id": i}26 for i, sentence in enumerate(doc.sents)27 ]28 results.append(sentence_texts)29 return results
This User-Defined Function (UDF):
- •
Loads the spaCy model once per UDF during initialization for efficiency
- •
Processes batches of text (
text_col
) to minimize overhead - •
Returns a list of sentence chunks with unique chunk IDs
- •
Runs multiple instances in parallel
(NUM_GPU_NODES * CHUNKING_PARALLELISM = 64 total instances)
for distributed processing
Step 2: GPU-Accelerated Embedding Generation
Choosing a Text Embedding Model
The quality of your embeddings depends heavily on the model you choose. Here are some key considerations:
Model Performance
- •
MTEB Leaderboard: Check the Massive Text Embedding Benchmark (MTEB) leaderboard for the latest performance rankings across various tasks
- •
Task-specific performance: Different models excel at different tasks (semantic search, clustering, classification, etc.)
- •
Multilingual support: Consider if you need to process text in multiple languages
- •
Language-specific tasks: If you only need to support a single language, it could be helpful to look at model performance for that specific language instead of multilingual benchmarks
Some Popular Models
- •
Qwen3-Embedding-0.6B: Good performance-to-size ratio, state-of-the-art, used in this example
- •
all-MiniLM-L6-v2: The default used in Sentence Transformer's documentation, often used in tutorials
- •
gemini-embedding-001: The current top multilingual model on MTEB. Requires Gemini API access
- •
Seed1.6-Embedding: The current top model on the Chinese MTEB leaderboard. Requires Volcengine API access
With open models available on HuggingFace , you can easily swap models by changing the EMBEDDING_MODEL_NAME
constant in the code below.
We'll create a UDF to generate embeddings from the chunked text:
1# Define the return type for embeddings2embedding_type = daft.DataType.embedding(daft.DataType.float32(), ENCODING_DIM)34@daft.udf(5 return_dtype=embedding_type,6 concurrency=NUM_GPU_NODES,7 num_gpus=1,8 batch_size=BATCH_SIZE9)10class EncodingUDF:11 def __init__(self):12 from sentence_transformers import SentenceTransformer1314 device = 'cuda' if torch.cuda.is_available() else 'cpu'15 self.model = SentenceTransformer(EMBEDDING_MODEL_NAME, device=device)16 self.model.compile()1718 def __call__(self, text_col):19 embeddings = self.model.encode(20 text_col.to_pylist(),21 batch_size=SENTENCE_TRANSFORMER_BATCH_SIZE,22 convert_to_tensor=True,23 torch_dtype=torch.bfloat16,24 )25 return embeddings.cpu().numpy()
This UDF:
- •
Loads the SentenceTransformer model on GPU if available
- •
Uses
bfloat16
precision to reduce memory usage - •
Processes text in batches (
SENTENCE_TRANSFORMER_BATCH_SIZE = 128
) for optimal GPU utilization - •
Returns numpy arrays which are compatible with Daft
Step 3: Configure Distributed Processing
You can run this script locally, but if you're interested in running this pipeline on a cluster, check out our guide on scaling up. In this example, we ran on a ray cluster with 8 g5.2xlarge workers (each comes with an A10G GPU). To configure our Daft script to use the ray cluster, we added:
1# Configure Daft to use Ray to schedule work on different worker nodes2daft.context.set_runner_ray()34# Configure S3 access for reading data5daft.set_planning_config(6 default_io_config=daft.io.IOConfig(7 s3=daft.io.S3Config.from_env()8 )9)
Step 4: Execute the Pipeline
Now we'll execute the complete data processing pipeline:
1(2 daft.read_parquet("s3://desmond-demo/text-embedding-dataset.parquet")3 .with_column("sentences", ChunkingUDF(col("text")))4 .explode("sentences")5 .with_column("text", col("sentences")["text"])6 .with_column("chunk_id", col("sentences")["chunk_id"])7 .exclude("sentences")8 .with_column("embedding", EncodingUDF(col("text")))9 .with_column(10 "id",11 col("url").str.right(50) + "-" + col("chunk_id").cast(daft.DataType.string())12 )13 .select("id", "url", "language", "source", "text", "embedding")14 .write_turbopuffer(15 namespace="desmond-scale-experiment6",16 region="aws-us-west-2",17 id_column="id",18 vector_column="embedding",19 distance_metric="cosine_distance"20 )21)
Pipeline steps explained:
- 1.
Read data: Load Parquet files from S3 with large chunk size for efficiency
- 2.
Chunk text: Apply sentence splitting UDF
- 3.
Explode: Flatten the list of sentences into separate rows
- 4.
Extract fields: Get text and chunk_id from the sentence structs
- 5.
Generate embeddings: Apply embedding UDF to text
- 6.
Create IDs: Generate unique IDs combining URL and chunk_id
- 7.
Select columns: Keep only the necessary columns
- 8.
Write to Turbopuffer: Store data and vectors in Turbopuffer
If all works out well, when you run this script on your cluster, you should notice that network I/O, CPU work, and GPU work are pipelined to run in parallel, and you should see high GPU utilization :)
Customization Tips
- •
Adjust batch sizes: Increase
SENTENCE_TRANSFORMER_BATCH_SIZE
for better throughput, decrease for lower GPU memory usage - •
Scale workers: Modify
NUM_GPU_NODES
andCHUNKING_PARALLELISM
based on your cluster size and cores available per node - •
Change models: Replace
EMBEDDING_MODEL_NAME
with other SentenceTransformer models - •
Different chunking: Modify
ChunkingUDF
to use different text chunking strategies - •
Alternative vector databases: Replace with other vector databases like Lance, Pinecone, or Chroma
Performance Considerations
- •
GPU memory: Monitor GPU memory usage and adjust batch sizes accordingly. If your GPUs fail to allocate sufficient memory or you exceed the max sequence length of your embedding model,
SENTENCE_TRANSFORMER_BATCH_SIZE
may be too large - •
Model loading: UDFs load models once per worker, so initialization time is amortized
- •
Quantization: Use
bfloat16
orfloat16
quantization for lower GPU memory utilization and higher throughput.
This pipeline can efficiently process millions of text documents while automatically scaling across your available compute resources.
What’s next on the menu?
With this recipe, we hit near-100% GPU utilization—a benchmark that’s the holy grail for many.
But the Daft kitchen never stops cooking. Since then, we’ve been experimenting with new ingredients and techniques—custom GPU pipelining, swapping Sentence Transformers for vLLM—that have made the whole meal cook 3× faster.
We’re still plating that next dish, and trust us, it’s worth the wait. Keep an eye out for the upcoming blog where we’ll share how we turned up the heat and pushed throughput beyond the peak-utilization grind.
Until then, happy embedding! And remember: we don’t sell the GPUs, we sell the sizzle.
Get started today: pip install daft