Domain-Specific Embeddings and Retrieval: Legal Edition (voyage-law-2)

TL;DR – Domain-specific and custom embedding models have been shown to enhance the retrieval quality significantly. Hot on the heels of the state-of-the-art code embedding model (voyage-code-2), we are thrilled to release voyage-law-2, which tops the MTEB leaderboard for legal retrieval by a significant margin. For example, it outperforms OpenAI v3 large by 6% on average over eight legal retrieval datasets and by more than 10% on three of them (LeCaRDv2, LegalQuAD, and GerDaLIR). Equipped with 16K-context length and trained on massive long-context legal documents, voyage-law-2 excels in long-context retrieval across domains. Notably, it also performs better or as well on general-purpose corpora across domains.

In retrieval-augmented generation (RAG), the quality of embeddings determines the relevancy of the retrieved documents, which in turn significantly affects the hallucination rate of the LLMs and the response quality. However, the general-purpose embedding models currently available on the market fall short in expertise-intensive domains such as legal, finance, and coding. This underscores the importance of developing high-quality, domain-specific embedding models, such as voyage-law-2, which is optimized for applications in the legal domain.

Legal retrieval. Retrieving relevant legal documents, such as precedents or legislation, is a labor-intensive and tedious yet crucial task in the daily operations of any legal firm. Consequently, it is a foundational and impactful step in automating legal work, particularly within the RAG framework. However, high-quality legal retrieval poses significant challenges due to the extensive length and dense jargon of legal documents, coupled with the requirement for analytical and deductive reasoning.

Domain customization is the key to solving challenging retrieval problems. Embedding models typically have no more than 10 billion parameters due to latency constraints, which limits their general-purpose capability.  Therefore, rationing the parameter capacity and allocating them to specific domains is necessary and also sufficient to achieve excellent performance in those areas.  For example, semantic code retrieval is highly challenging yet our  voyage-code-2 achieves a 15% improvement, or at least a 60% error reduction, compared to the next best model, OpenAI v3 large.

Training legal embeddings. voyage-law-2 is trained on an additional 1T high-quality legal tokens with specifically designed positive pairs and a novel contrastive learning algorithm.  Abundant long legal documents are included to ensure the quality on long-context retrieval.  Moreover, law is a cross-cutting field, intersecting with other domains such as finance and technology (e.g., finance law, intellectual property, etc.)  Hence, data from other domains was mixed into the training set to ensure broader coverage.  The following pie chart shows the breakdown of training tokens by domain.

Evaluation

Legal retrieval

Datasets.  We evaluate voyage-law-2 on eight legal retrieval datasets spanning contracts, congressional bills, and court cases, all published on Massive Text Embedding Benchmark (MTEB), where you can gather additional details, inspect the data, and reproduce the results.  The following table provides a summary of the datasets.

DatasetDescriptions
LegalSummarizationContracts and their summarizations
ConsumerContractsQAQuestions and answers on contracts
CorporateLobbyingCorporate lobbying bill titles and summaries
AILACasedocsIndia supreme court cases
AILAStatutesIndia supreme court cases and relevant statues
LeCaRDv2Chinese legal cases
LegalQuADGerman legal questions and relevant cases
GerDaLIRGerman legal cases

Models and Metrics.  We evaluate voyage-law-2 and four other baselines, E5 Mistral (e5-mistral-7b-instruct), OpenAI v3 large (text-embedding-3-large), Cohere v3 (embed-english-v3), and BGE v1.5 large (bge-large-en-v1.5).  Given a query, we retrieve the top-10 documents based on cosine similarities and report the normalized discounted cumulative gain (NDCG@10), a standard metric for retrieval quality and a variant of the recall.

Results.  The following table lists the NDCG@10 for each dataset.  You can also view these results on the MTEB leaderboard for law retrieval.

Datasetvoyage-law-2MistralOpenAI Cohere BGE
LegalSummarization68.9666.5171.5561.7059.99
ConsumerContractsQA83.2775.4679.3977.1273.52
CorporateLobbying95.6694.0195.0993.6891.51
AILACasedocs44.5638.7639.0031.5425.15
AILAStatutes45.5138.0741.3127.1520.74
LeCaRDv272.7568.5657.2021.0222.68
LegalQuAD67.4759.6457.4726.0816.22
GerDaLIR44.9137.1832.776.053.96
Average65.3959.7759.2243.0439.22

voyage-law-2 demonstrates the best quality for seven of the eight datasets, and the best on average with a large margin.  In particular, voyage-law-2 outperforms OpenAI v3 large by more than 10% in three datasets (LeCaRDv2, LegalQuAD, and GerDaLIR).

Long-context retrieval

In addition, because legal documents are usually longer, we also benchmark voyage-law-2 retrieval against five long-context retrieval datasets with documents between 6K and 13K tokens summarized in the table below. 

DatasetDescriptionAverage Tokens Per DocumentAverage Tokens Per Query
QMSUMMultiple domain meeting transcripts1282390
GovReportUS national policy issue reports12117715
SummScreenTV series transcripts10139143
QasperNLP papers filtered from S2ORC588018
Qasper abstract_docNLP papers filtered from S2ORC, queries are paper abstracts5873205

We report the standard NDCG@10 for each dataset.  In addition, the context length for each model is listed. voyage-law-2 is the best model, demonstrating the highest NDCG@10 across all datasets and exceeding OpenAI v3 large on average by over 15% while boasting twice the context length at 16K.

Datasetvoyage-law-2OpenAIMistral
QMSUM42.3414.4838.89
GovReport95.7580.6693.62
SummScreen87.3260.5883.98
Qasper97.3790.4097.21
Qasper abstract_doc99.4595.8798.26
Average84.4468.4082.39
Retrieval across domains

As mentioned earlier, voyage-law-2 was trained on various domains to provide cross-domain legal coverage. We evaluate on 34 retrieval datasets across eight categories spanning various topics and corpora, including technical documentation, code, law, finance, web reviews, long documents, medicine, and conversations.  The law and long-context datasets are the same datasets discussed in the legal retrieval and long-context retrieval sections above, respectively.

CategoryDescriptionsDatasets
TECHTechnical documentationOneSingal, PyTorch, Verizon 5G, Cohere
CODECode snippets, docstringsLeetCode-python, DS1000, codechef-cpp_5doc
LAWCases, court opinions, statutes, patentsLegalSummarization, ConsumerContractsQA, CorporateLobbying, AILACasedocs, AILAStatutes, LeCaRDv2, LegalQuAD, GerDaLIR
FINANCESEC filings, finance QAFinanceBench, ConvFinQA, Fiqa Personal Finance
WEBReviews, forum posts, policy pagesDoordash, Health4CA, Movie Summary, Kijiji.ca
LONG-CONTEXTLong documents on assorted topics: government reports, academic papers, and dialoguesQMSUM, GovReport, SummScreen, Qasper, Qasper abstract_doc
MEDICALMedical documents and QAMental Health Consulting, Covid QA, ChatDoctor, Medical Instruction
CONVERSATIONMeeting transcripts, dialoguesDialog Sum,QA Conv, MeetingBank-transcript

The following radar chart plots NDCG@10 for each dataset.  We can see that voyage-law-2 performs well across the broad range of domains evaluated while also being optimized for legal domain applications.

Try voyage-law-2

Build your legal Gen AI applications with voyage-law-2 today!  If you have used other Voyage embeddings, you just need to specify voyage-law-2 as the model parameter (for both the corpus and queries).  Head over to our docs to learn more.  Stay tuned for more domain-specific embedding models from us.  And, if you’re interested in early access to them or finetuning embeddings, we’d love to hear from you and please email [email protected].  Follow us on X (Twitter) and/or LinkedIn for more updates!

Tags:

Leave a Reply

Discover more from Voyage AI

Subscribe now to keep reading and get access to the full archive.

Continue reading