Domain-Specific Embeddings and Retrieval: Legal Edition (voyage-law-2)

TL;DR – Domain-specific and custom embedding models have been shown to enhance the retrieval quality significantly. Hot on the heels of the state-of-the-art code embedding model (voyage-code-2), we are thrilled to release voyage-law-2, which tops the MTEB leaderboard for legal retrieval by a significant margin. For example, it outperforms OpenAI v3 large by 6% on average over eight legal retrieval datasets and by more than 10% on three of them (LeCaRDv2, LegalQuAD, and GerDaLIR). Equipped with 16K-context length and trained on massive long-context legal documents, voyage-law-2 excels in long-context retrieval across domains. Notably, it also performs better or as well on general-purpose corpora across domains.

In retrieval-augmented generation (RAG), the quality of embeddings determines the relevancy of the retrieved documents, which in turn significantly affects the hallucination rate of the LLMs and the response quality. However, the general-purpose embedding models currently available on the market fall short in expertise-intensive domains such as legal, finance, and coding. This underscores the importance of developing high-quality, domain-specific embedding models, such as voyage-law-2, which is optimized for applications in the legal domain.

Legal retrieval. Retrieving relevant legal documents, such as precedents or legislation, is a labor-intensive and tedious yet crucial task in the daily operations of any legal firm. Consequently, it is a foundational and impactful step in automating legal work, particularly within the RAG framework. However, high-quality legal retrieval poses significant challenges due to the extensive length and dense jargon of legal documents, coupled with the requirement for analytical and deductive reasoning.

Domain customization is the key to solving challenging retrieval problems. Embedding models typically have no more than 10 billion parameters due to latency constraints, which limits their general-purpose capability. Therefore, rationing the parameter capacity and allocating them to specific domains is necessary and also sufficient to achieve excellent performance in those areas. For example, semantic code retrieval is highly challenging yet our voyage-code-2 achieves a 15% improvement, or at least a 60% error reduction, compared to the next best model, OpenAI v3 large.

Training legal embeddings. voyage-law-2 is trained on an additional 1T high-quality legal tokens with specifically designed positive pairs and a novel contrastive learning algorithm. Abundant long legal documents are included to ensure the quality on long-context retrieval. Moreover, law is a cross-cutting field, intersecting with other domains such as finance and technology (e.g., finance law, intellectual property, etc.) Hence, data from other domains was mixed into the training set to ensure broader coverage. The following pie chart shows the breakdown of training tokens by domain.

Evaluation

Legal retrieval

Datasets. We evaluate voyage-law-2 on eight legal retrieval datasets spanning contracts, congressional bills, and court cases, all published on Massive Text Embedding Benchmark (MTEB), where you can gather additional details, inspect the data, and reproduce the results. The following table provides a summary of the datasets.

Dataset	Descriptions
LegalSummarization	Contracts and their summarizations
ConsumerContractsQA	Questions and answers on contracts
CorporateLobbying	Corporate lobbying bill titles and summaries
AILACasedocs	India supreme court cases
AILAStatutes	India supreme court cases and relevant statues
LeCaRDv2	Chinese legal cases
LegalQuAD	German legal questions and relevant cases
GerDaLIR	German legal cases

Models and Metrics. We evaluate voyage-law-2 and four other baselines, E5 Mistral (e5-mistral-7b-instruct), OpenAI v3 large (text-embedding-3-large), Cohere v3 (embed-english-v3), and BGE v1.5 large (bge-large-en-v1.5). Given a query, we retrieve the top-10 documents based on cosine similarities and report the normalized discounted cumulative gain (NDCG@10), a standard metric for retrieval quality and a variant of the recall.

Results. The following table lists the NDCG@10 for each dataset. You can also view these results on the MTEB leaderboard for law retrieval.

Dataset	`voyage-law-2`	Mistral	OpenAI	Cohere	BGE
LegalSummarization	68.96	66.51	71.55	61.70	59.99
ConsumerContractsQA	83.27	75.46	79.39	77.12	73.52
CorporateLobbying	95.66	94.01	95.09	93.68	91.51
AILACasedocs	44.56	38.76	39.00	31.54	25.15
AILAStatutes	45.51	38.07	41.31	27.15	20.74
LeCaRDv2	72.75	68.56	57.20	21.02	22.68
LegalQuAD	67.47	59.64	57.47	26.08	16.22
GerDaLIR	44.91	37.18	32.77	6.05	3.96
Average	65.39	59.77	59.22	43.04	39.22

voyage-law-2 demonstrates the best quality for seven of the eight datasets, and the best on average with a large margin. In particular, voyage-law-2 outperforms OpenAI v3 large by more than 10% in three datasets (LeCaRDv2, LegalQuAD, and GerDaLIR).

Long-context retrieval

In addition, because legal documents are usually longer, we also benchmark voyage-law-2 retrieval against five long-context retrieval datasets with documents between 6K and 13K tokens summarized in the table below.

Dataset	Description	Average Tokens Per Document	Average Tokens Per Query
QMSUM	Multiple domain meeting transcripts	12823	90
GovReport	US national policy issue reports	12117	715
SummScreen	TV series transcripts	10139	143
Qasper	NLP papers filtered from S2ORC	5880	18
Qasper abstract_doc	NLP papers filtered from S2ORC, queries are paper abstracts	5873	205

We report the standard NDCG@10 for each dataset. In addition, the context length for each model is listed. voyage-law-2 is the best model, demonstrating the highest NDCG@10 across all datasets and exceeding OpenAI v3 large on average by over 15% while boasting twice the context length at 16K.

Dataset	`voyage-law-2`	OpenAI	Mistral
QMSUM	42.34	14.48	38.89
GovReport	95.75	80.66	93.62
SummScreen	87.32	60.58	83.98
Qasper	97.37	90.40	97.21
Qasper abstract_doc	99.45	95.87	98.26
Average	84.44	68.40	82.39

Retrieval across domains

As mentioned earlier, voyage-law-2 was trained on various domains to provide cross-domain legal coverage. We evaluate on 34 retrieval datasets across eight categories spanning various topics and corpora, including technical documentation, code, law, finance, web reviews, long documents, medicine, and conversations. The law and long-context datasets are the same datasets discussed in the legal retrieval and long-context retrieval sections above, respectively.

Category	Descriptions	Datasets
TECH	Technical documentation	OneSingal, PyTorch, Verizon 5G, Cohere
CODE	Code snippets, docstrings	LeetCode-python, DS1000, codechef-cpp_5doc
LAW	Cases, court opinions, statutes, patents	LegalSummarization, ConsumerContractsQA, CorporateLobbying, AILACasedocs, AILAStatutes, LeCaRDv2, LegalQuAD, GerDaLIR
FINANCE	SEC filings, finance QA	FinanceBench, ConvFinQA, Fiqa Personal Finance
WEB	Reviews, forum posts, policy pages	Doordash, Health4CA, Movie Summary, Kijiji.ca
LONG-CONTEXT	Long documents on assorted topics: government reports, academic papers, and dialogues	QMSUM, GovReport, SummScreen, Qasper, Qasper abstract_doc
MEDICAL	Medical documents and QA	Mental Health Consulting, Covid QA, ChatDoctor, Medical Instruction
CONVERSATION	Meeting transcripts, dialogues	Dialog Sum,QA Conv, MeetingBank-transcript

The following radar chart plots NDCG@10 for each dataset. We can see that voyage-law-2 performs well across the broad range of domains evaluated while also being optimized for legal domain applications.

Try voyage-law-2

Build your legal Gen AI applications with voyage-law-2 today! If you have used other Voyage embeddings, you just need to specify voyage-law-2 as the model parameter (for both the corpus and queries). Head over to our docs to learn more. Stay tuned for more domain-specific embedding models from us. And, if you’re interested in early access to them or finetuning embeddings, we’d love to hear from you and please email [email protected]. Follow us on X (Twitter) and/or LinkedIn for more updates!