TL;DR – Domain-specific and custom embedding models have been shown to enhance the retrieval quality significantly. Hot on the heels of the state-of-the-art code embedding model (voyage-code-2
), we are thrilled to release voyage-law-2
, which tops the MTEB leaderboard for legal retrieval by a significant margin. For example, it outperforms OpenAI v3 large by 6% on average over eight legal retrieval datasets and by more than 10% on three of them (LeCaRDv2, LegalQuAD, and GerDaLIR). Equipped with 16K-context length and trained on massive long-context legal documents, voyage-law-2
excels in long-context retrieval across domains. Notably, it also performs better or as well on general-purpose corpora across domains.
In retrieval-augmented generation (RAG), the quality of embeddings determines the relevancy of the retrieved documents, which in turn significantly affects the hallucination rate of the LLMs and the response quality. However, the general-purpose embedding models currently available on the market fall short in expertise-intensive domains such as legal, finance, and coding. This underscores the importance of developing high-quality, domain-specific embedding models, such as voyage-law-2
, which is optimized for applications in the legal domain.
Legal retrieval. Retrieving relevant legal documents, such as precedents or legislation, is a labor-intensive and tedious yet crucial task in the daily operations of any legal firm. Consequently, it is a foundational and impactful step in automating legal work, particularly within the RAG framework. However, high-quality legal retrieval poses significant challenges due to the extensive length and dense jargon of legal documents, coupled with the requirement for analytical and deductive reasoning.
Domain customization is the key to solving challenging retrieval problems. Embedding models typically have no more than 10 billion parameters due to latency constraints, which limits their general-purpose capability. Therefore, rationing the parameter capacity and allocating them to specific domains is necessary and also sufficient to achieve excellent performance in those areas. For example, semantic code retrieval is highly challenging yet our voyage-code-2
achieves a 15% improvement, or at least a 60% error reduction, compared to the next best model, OpenAI v3 large.
Training legal embeddings. voyage-law-2
is trained on an additional 1T high-quality legal tokens with specifically designed positive pairs and a novel contrastive learning algorithm. Abundant long legal documents are included to ensure the quality on long-context retrieval. Moreover, law is a cross-cutting field, intersecting with other domains such as finance and technology (e.g., finance law, intellectual property, etc.) Hence, data from other domains was mixed into the training set to ensure broader coverage. The following pie chart shows the breakdown of training tokens by domain.
Evaluation
Legal retrieval
Datasets. We evaluate voyage-law-2
on eight legal retrieval datasets spanning contracts, congressional bills, and court cases, all published on Massive Text Embedding Benchmark (MTEB), where you can gather additional details, inspect the data, and reproduce the results. The following table provides a summary of the datasets.
Dataset | Descriptions |
---|---|
LegalSummarization | Contracts and their summarizations |
ConsumerContractsQA | Questions and answers on contracts |
CorporateLobbying | Corporate lobbying bill titles and summaries |
AILACasedocs | India supreme court cases |
AILAStatutes | India supreme court cases and relevant statues |
LeCaRDv2 | Chinese legal cases |
LegalQuAD | German legal questions and relevant cases |
GerDaLIR | German legal cases |
Models and Metrics. We evaluate voyage-law-2
and four other baselines, E5 Mistral (e5-mistral-7b-instruct
), OpenAI v3 large (text-embedding-3-large
), Cohere v3 (embed-english-v3
), and BGE v1.5 large (bge-large-en-v1.5
). Given a query, we retrieve the top-10 documents based on cosine similarities and report the normalized discounted cumulative gain (NDCG@10), a standard metric for retrieval quality and a variant of the recall.
Results. The following table lists the NDCG@10 for each dataset. You can also view these results on the MTEB leaderboard for law retrieval.
Dataset | voyage-law-2 | Mistral | OpenAI | Cohere | BGE |
---|---|---|---|---|---|
LegalSummarization | 68.96 | 66.51 | 71.55 | 61.70 | 59.99 |
ConsumerContractsQA | 83.27 | 75.46 | 79.39 | 77.12 | 73.52 |
CorporateLobbying | 95.66 | 94.01 | 95.09 | 93.68 | 91.51 |
AILACasedocs | 44.56 | 38.76 | 39.00 | 31.54 | 25.15 |
AILAStatutes | 45.51 | 38.07 | 41.31 | 27.15 | 20.74 |
LeCaRDv2 | 72.75 | 68.56 | 57.20 | 21.02 | 22.68 |
LegalQuAD | 67.47 | 59.64 | 57.47 | 26.08 | 16.22 |
GerDaLIR | 44.91 | 37.18 | 32.77 | 6.05 | 3.96 |
Average | 65.39 | 59.77 | 59.22 | 43.04 | 39.22 |
voyage-law-2
demonstrates the best quality for seven of the eight datasets, and the best on average with a large margin. In particular, voyage-law-2
outperforms OpenAI v3 large by more than 10% in three datasets (LeCaRDv2, LegalQuAD, and GerDaLIR).
Long-context retrieval
In addition, because legal documents are usually longer, we also benchmark voyage-law-2
retrieval against five long-context retrieval datasets with documents between 6K and 13K tokens summarized in the table below.
Dataset | Description | Average Tokens Per Document | Average Tokens Per Query |
---|---|---|---|
QMSUM | Multiple domain meeting transcripts | 12823 | 90 |
GovReport | US national policy issue reports | 12117 | 715 |
SummScreen | TV series transcripts | 10139 | 143 |
Qasper | NLP papers filtered from S2ORC | 5880 | 18 |
Qasper abstract_doc | NLP papers filtered from S2ORC, queries are paper abstracts | 5873 | 205 |
We report the standard NDCG@10 for each dataset. In addition, the context length for each model is listed. voyage-law-2
is the best model, demonstrating the highest NDCG@10 across all datasets and exceeding OpenAI v3 large on average by over 15% while boasting twice the context length at 16K.
Dataset | voyage-law-2 | OpenAI | Mistral |
---|---|---|---|
QMSUM | 42.34 | 14.48 | 38.89 |
GovReport | 95.75 | 80.66 | 93.62 |
SummScreen | 87.32 | 60.58 | 83.98 |
Qasper | 97.37 | 90.40 | 97.21 |
Qasper abstract_doc | 99.45 | 95.87 | 98.26 |
Average | 84.44 | 68.40 | 82.39 |
Retrieval across domains
As mentioned earlier, voyage-law-2
was trained on various domains to provide cross-domain legal coverage. We evaluate on 34 retrieval datasets across eight categories spanning various topics and corpora, including technical documentation, code, law, finance, web reviews, long documents, medicine, and conversations. The law and long-context datasets are the same datasets discussed in the legal retrieval and long-context retrieval sections above, respectively.
Category | Descriptions | Datasets |
---|---|---|
TECH | Technical documentation | OneSingal, PyTorch, Verizon 5G, Cohere |
CODE | Code snippets, docstrings | LeetCode-python, DS1000, codechef-cpp_5doc |
LAW | Cases, court opinions, statutes, patents | LegalSummarization, ConsumerContractsQA, CorporateLobbying, AILACasedocs, AILAStatutes, LeCaRDv2, LegalQuAD, GerDaLIR |
FINANCE | SEC filings, finance QA | FinanceBench, ConvFinQA, Fiqa Personal Finance |
WEB | Reviews, forum posts, policy pages | Doordash, Health4CA, Movie Summary, Kijiji.ca |
LONG-CONTEXT | Long documents on assorted topics: government reports, academic papers, and dialogues | QMSUM, GovReport, SummScreen, Qasper, Qasper abstract_doc |
MEDICAL | Medical documents and QA | Mental Health Consulting, Covid QA, ChatDoctor, Medical Instruction |
CONVERSATION | Meeting transcripts, dialogues | Dialog Sum,QA Conv, MeetingBank-transcript |
The following radar chart plots NDCG@10 for each dataset. We can see that voyage-law-2
performs well across the broad range of domains evaluated while also being optimized for legal domain applications.
Try voyage-law-2
Build your legal Gen AI applications with voyage-law-2
today! If you have used other Voyage embeddings, you just need to specify voyage-law-2
as the model
parameter (for both the corpus and queries). Head over to our docs to learn more. Stay tuned for more domain-specific embedding models from us. And, if you’re interested in early access to them or finetuning embeddings, we’d love to hear from you and please email [email protected]. Follow us on X (Twitter) and/or LinkedIn for more updates!
Leave a Reply