Harvey partners with Voyage to build custom legal embeddings

Editor’s Note: This post originally appeared in Harvey’s blog. We repost it here in the third-person with Harvey’s kind permission.

Intro

Retrieval-augmented-generation (RAG) is a fundamental component of real-world LLM systems, and a tool we often use to augment our custom models with specialized context. Embeddings are the backbone of RAG, enabling retrieval of items by their semantic meaning and complementing classical search strategies like keyword search. The challenge with standard embeddings, like standard language models, is that they are trained on general corpora of data and therefore struggle to perform in specialized fields. For example, when considered against the entire universe of text, legal jargon is all relatively similar—this prevents embedding-based retrieval methods from disambiguating relevant text from the rest of the data.

Voyage AI

Voyage AI, led by Stanford professor Tengyu Ma, is a leading developer of customized embedding models and LLM retrieval infrastructure. Voyage has assembled a world-class AI research team that has developed novel techniques that enable embeddings to better capture the nuances of specialized text in the same way as domain experts. For example:

voyage-finance-2: Embeddings optimized for financial applications.
voyage-law-2: Embeddings optimized for legal applications and long-context retrieval.
voyage-multilingual-2: Embeddings optimized for multi-lingual retrieval.

Given their track record of building domain-specific embedding models, Harvey was excited to partner with the Voyage AI team to fine-tune embeddings specifically for Harvey use cases.

Custom Embeddings

Together, Harvey and Voyage collaborated to fine-tune an embedding model on US case law — more than 20 billion tokens of legal text where even the best standard embedding models struggle to distinguish cases relevant to common questions. Starting from voyage-law-2 as a base, the custom Harvey model was trained on both the raw case law text itself, using Voyage AI’s proprietary self-supervised techniques, and subsequently on a dataset of exemplar questions and expert annotations on relevant cases collected by the Harvey legal research team.

Voyage’s custom training work has been immediately impactful. Harvey evaluated the custom Harvey model and other leading embedding models on the Harvey legal retrieval task — a large dataset of query-content pairs generated from a variety of legal documents — and used Normalized Discounted Cumulative Gain (NDCG@10) and Recall at 100 items (Recall@100) as performance metrics (both are standard metrics for retrieval quality). The custom Harvey embedding model, named voyage-law-2-harvey, reduces the amount of irrelevant material returned in top results by nearly 25% compared to the next best off-the-shelf embedding models (e.g. Google’s text-embedding-004 or OpenAI’s text-embedding-3-large). It is able to accomplish this with 1/3 of the embedding dimensionality, leading to significant benefits in storage and latency. Harvey has also combined voyage-law-2-harvey’s more robust understanding of legal text with other proprietary search methods, further improving the ability of Harvey’s retrieval systems to identify relevant cases and passages from cases in response to complex legal questions.

Next Steps

Harvey is excited to continue working with Tengyu and Voyage to develop a suite of custom embeddings models for legal (and beyond), as well as work with clients to create firm/company-specific embeddings for enterprise search, RAG systems, and other GenAI applications.

Credits: Aravind Srinivasan[1], Calvin Qi[1], Wen Phan[2], Daniel Hunter[1], Julio Pereyra[1], Niko Grupen[1], Tengyu Ma[2], Gabriel Pereyra[1]

[1] Harvey

[2] Voyage AI