Processing 99% of U.S. Caselaw for Under $1 in the Common Pile

One of the most impactful areas of open-source software and artificial intelligence is the total democratization of legal tech and data. Teraflop AI and Eventual collaborated to support the release of the Common Pile, an 8TB, 1 Trillion Token Dataset of Public Domain and Openly Licensed Text, along with Eleuther AI, Vector Institute, Allen AI, Hugging Face, and the Data Provenance Initiative.

In collaboration with Daft, Teraflop AI provisioned 99% of US precedential caselaw data from the Caselaw Access Project (CAP) and CourtListener (CL) using the highly-efficient Daft dataframe library.

Dataset

The dataset consists largely of two sources. The first was from the Caselaw Access Project by the Library Innovation Lab at Harvard Law School, and the second was from CourtListener by Free Law Project.

The CourtListener data can be obtained here.

The Caselaw Access Project data dumps can be found at this resource.

CourtListener by the Free Law Project

The monthly CourtListener data dumps overlap, so you only need the latest one. Starting with a 54 GB CSV file compressed and stored in bz2 format, we utilized the lbzip2 library to decompress the file in parallel using all cores on our system. This helps to substantially improve the decompression time and allows us to decompress the entire file locally on a 6-CPU system in under an hour.

The 350 GB decompressed CSV file makes loading and processing in memory quite inefficient and requires a large compute node to process directly. Additionally, many of the data dumps contain improperly formatted and delimited files.

To first address these issues, we stream through the CSV file to handle any improper delimiter formatting and partition the data into smaller Parquet files with snappy compression at a size of 1 GB. These partitioned files contain the core of the HTML documents and metadata for around 99% of US caselaw.

How CourtListener is related to the Caselaw Access Project

The CourtListener data dumps contain the entirety of the Caselaw Access Project within them. The Caselaw Access Project consists of nearly 40 million pages of U.S. federal and state court decisions and judges' opinions from the last 365 years. There are approximately 7 million US opinions in the CAP subset, totaling around 80 GB of text data alone. The cases are from both federal and state courts. In addition, CourtListener adds over 2 million cases scraped from more than 2,000 courts.

Why we needed to access the CAP data directly

The CL data has multiple instances of page numbers and stars strewn about in the CAP text documents. Due to a prior extraction of the CAP data, there were issues with handling these star paginations. To resolve this issue, we dropped the CAP subset of the data from the CL parquets we converted and obtained a corrected revision of the CAP data with adjusted extraction directly from the source API. We download and extract all of the ZIP files from the CAP API. Each ZIP file contains a folder of HTML documents that contain the US caselaw. The raw HTML data is converted to snappy Parquet and temporarily stored for the next step.

Parsing and extraction of CAP data

We utilize Daft UDFs and the Selectolax processing library to handle efficient HTML parsing and extraction. Selectolax reliably parses and extracts over 1k HTML documents in one second on a few CPU cores, resulting in a clean text format. We parse and extract each of the HTML documents following the data processing outlined in COLD Cases by Harvard Library Innovation Lab. This allows us to properly target the corrected classes and tags from the HTML documents without the star pagination issue above. The cleanly extracted texts are then saved to be rejoined to the full corpus later. You can find the processed CAP data here.

Data cleaning and deduplication

Another issue with the CL dumps is that many of the columns contain the same data that was extracted in multiple different ways. It is important to merge these columns in the correct order to get one single copy of the text documents that were extracted with the best-in-class method. The CL documentation provides more information on the process. Daft's internal coalesce functionality allows the columns to be effectively merged in the appropriate order.

A team of annotators worked to identify and label different extraction issues from the PDFs and HTML in CAP and CL. Once these issues were identified, a custom regex was written to replace, remove, and clean the extracted data. Daft has fast regex parsing built directly into the framework through regexp_replace. Once these edge cases were handled, Selectolax was additionally used to parse and extract the remaining CL documents. Text normalization was done to fix broken Unicode characters, standardize whitespace and newlines, and remove other special characters. Both the newly cleaned CAP and CL subsets can now be easily and efficiently rejoined for further processing using Daft's SQL-like API.

Each of the clean text documents was hashed utilizing the xxhash algorithm that is built into Daft. This allowed us to do exact deduplication and minhash deduplication of the text documents.

Infrastructure and cost

To orchestrate the above data processing pipelines, we bootstrapped a Kubernetes cluster utilizing low-cost bare-metal servers and installed kuberay. We utilized Ray, which allowed us to distribute the Daft data processing across each of the compute nodes. Another option we tested for distributed computing was the Anyscale platform, which greatly simplifies the orchestration process.

The cost of processing all 7 million text documents and 80 GB of data from the CAP dataset was $0.06. You can find all of the processed caselaw data here.

Code, model, and data releases

After the files are processed and deduplicated, we are ready for training. We also trained and released Comma v0.1, a 7 billion parameter language model, on the Common Pile.

Additionally, we released the raw datasets from a wide variety of sources. You can find all of the raw datasets on Hugging Face under the Common Pile organization. You can also find a version of all of the preprocessed, cleaned, filtered, and deduplicated data on Hugging Face. The full training dataset used for the Comma model can be found on Hugging Face as well. All of the data in the Common Pile is located here.

Acknowledgments and recognition

This work could not have been accomplished without the significant efforts of Jack Cushman and Greg Leppert at the Caselaw Access Project and Harvard Library Innovation Lab, the team at Eventual, and Michael Lissner at the Free Law Project. A big thank you to them for providing the essential guidance and support for this release. You can support the Free Law Project here.

The Common Pile was accepted to NeurIPS 2025 and featured in an article by The Washington Post.