rerank-2 and rerank-2-lite: the next generation of Voyage multilingual rerankers

TL;DR — We’re excited to announce the Voyage 2 series of rerankers, rerank-2 and rerank-2-lite. When evaluated across 93 retrieval datasets spanning multiple domains, adding rerank-2 and rerank-2-lite on top of OpenAI’s latest embedding model (v3 large) improves the accuracy by an average of 13.89% and 11.86%, 2.3x and 1.7x the improvement attained by the latest Cohere reranker (English v3), respectively. Furthermore, rerank-2 and rerank-2-lite support context lengths of 16K and 8K tokens — 4x and 2x the context length of Cohere’s reranker.

Rerankers boost the quality of retrieval systems by refining the order of the initial search results. Earlier this year, we released our first-generation rerankers, rerank-lite-1 and rerank-1, both outperforming competing rerankers while offering at least 2x more context length and flexible token-based pricing.

Today, we are thrilled to introduce our Voyage 2 series of rerankers, rerank-2 and rerank-2-lite. rerank-2 is optimized for quality, improving accuracy atop OpenAI v3 large (text-embedding-3-large) by an average of 13.89% — 2.80%, 7.14%, and 15.61% more than rerank-1, Cohere v3 (rerank-english-v3.0), and BGE v2-m3 (bge-reranker-v2-m3) respectively. It supports a 16K-token combined context length for a query-document pair, with up to 4K tokens for the query.

rerank-2-lite is optimized for latency while still preserving strong quality — comparable to rerank-1 (at 2.5x lower cost and much lower latency) and better than all competing models. rerank-2-lite improves the accuracy atop OpenAI v3 large by an average of 11.86% — 5.12% and 13.59% more than Cohere v3 and BGE v2-m3, respectively. It supports a 8K-token combined context length for a query-document pair, with up to 2K tokens for the query.

Both rerank-2 and rerank-2-lite are also natively multilingual, beating Cohere multilingual v3 (rerank-multilingual-v3.0) by 8.83% and 6.24% on 51 datasets across 31 languages, respectively (see Evaluation Details section below).

Recommendations. Existing rerank-lite-1 and rerank-1 users can upgrade to rerank-2-lite and rerank-2, respectively, for better quality and twice the context length at the same cost. If you are currently using Voyage rerankers, you can simply specify "rerank-2" or "rerank-2-lite" as the model parameter in Voyage API calls.

Reranking Overview

Before diving into evaluation details and results, let’s briefly review some key concepts. In a two-stage retrieval system, a reranker serves as a refinement tool. The process begins with the first-stage search — generating an initial set of search results using vector search, lexical search, or both. A reranker (the second stage) then takes the initial search results and assigns a relevance score to each document. The top documents by relevancy score are then returned:

The two-stage retrieval system is designed to leverage the tradeoffs between vector search and more expensive rerankers. Specifically, rerankers process query-document pairs together, while embedding-based approaches generate query and document vector representations separately. This allows rerankers to capture nuanced and complex interactions between queries-document pairs at the cost of more compute.

For a deeper dive into rerankers vs. embedding models, check out our previous post.

Evaluation Details

Datasets. We evaluate on 93 domain-specific retrieval datasets, spanning eight domains: technical documentation, code, law, finance, web reviews, multilingual, long documents, and conversations. Every dataset consists of a corpus used for retrieval and a corresponding set of queries. The corpus typically contains documents from a specific domain—such as StackExchange answers, court rulings, technical manuals, and so on. The queries may include questions, summaries of lengthy documents, or individual documents themselves. The table below lists the datasets in the eight categories, excluding the multilingual one.

CategoryDescriptionsDatasets
TECHTechnical documentationCohere, 5G, OneSignal, PyTorch
CODECode snippets, docstringsLeetCodePython, DS1000, codechef-cpp_5doc
LAWCases, court opinions, statutes, patentsLeCaRDv2, LegalQuAD, GerDaLIR, LegalSummarization, AILA casedocs, LegalBench Consumer Contracts QA, LegalBench Corporate Lobbying
FINANCESEC filings, finance QATrade-the-event, RAG benchmark (Apple-10K-2022), FinanceBench, TAT-QA, Indian Financial News, Finance Alpaca, FiQA Personal Finance, Stocks Event, Stock News Sentiment, ConvFinQA, FinQA, All news finance, News stocks, HC3 Finance
WEBReviews, forum posts, policy pagesDoordash, Health4CA, Movies Summary, Kijiji.ca
LONG-CONTEXTLong documents on assorted topics: government reports, academic papers, and dialoguesNarrativeQA, QMSum, SummScreenFD
MEDICALMedical documents and QAMental Health Consulting, Covid QA, ChatDoctor, Medical Instruction
CONVERSATIONMeeting transcripts, dialoguesDialog Sum, QA Conv, MeetingBank-transcript

The multilingual domain encompasses 51 datasets across 31 languages, including French, German, Japanese, Spanish, Korean, Bengali, Portuguese, and Russian. The first five of these languages have multiple datasets each. The remaining languages, each represented by a single dataset, are grouped into an “OTHER” category in the multilingual radar charts shown in the results section.

Method and Metrics. We evaluate the retrieval quality of various rerankers on top of three first-stage search methods, including: (1) a combined search utilizing BM25 and GTE v1.5 large (gte-large-en-v1.5), (2) OpenAI v3 large, and (3) voyage-multilingual-2. Given a query, the first-stage search method retrieves up to 100 candidate documents. Then, from the first-stage documents, we use the reranker to retrieve the top 10 relevant documents. We report normalized discounted cumulative gain (NDCG@10), a standard metric for retrieval quality and a variant of recall.

Results

Domain-specific results. The first radar chart in this post along with the two below illustrate NDCG@10 across different domains of data. rerank-2 and rerank-2-lite consistently come out as the the top rerankers across all domains and first-stage search methods. In particular:

  • Averaged across the three first-stage search types, rerank-2 outperforms rerank-1, Cohere v3, and BGE v2-m3 by an average of 2.84%, 6.33%, and 14.75%, respectively.
  • Likewise, rerank-2-lite outperforms Cohere v3 and BGE v2-m3 by an average of 4.49%, and 12.91%, respectively.
  • Both rerank-2 and rerank-2-lite improve atop all first-stage search results. By contrast, using BGE v2-m3 as a reranker results in 1.72% worse performance than OpenAI v3 large first stage only and 8.50% worse performance than voyage-multilingual-2 first stage only.

Multilingual results. The radar charts below illustrate NDCG@10 across different languages. Both rerank-2 and rerank-2-lite increases performance across the board for all languages and first-stage retrieval methods. Specifically:

  • Averaged across the three first-stage search methods, rerank-2 outperforms rerank-1, Cohere multilingual v3, and BGE v2-m3 by 1.62%, 8.83%, and 4.86%, respectively.
  • Likewise, rerank-2-lite outperforms Cohere multilingual v3, and BGE v2-m3 by 6.24% and 2.26%, respectively.
  • rerank-2 is the only reranker that always improves atop first-stage search. By contrast, using BGE v2-m3 as a reranker results in performance that is 0.48% worse than OpenAI v3 large first-stage only and 2.83% worse than voyage-multilingual-2 first-stage only. Using Cohere multilingual v3 as a reranker results in performance that is 4.86% worse than OpenAI v3 large first stage only and 6.92% worse than voyage-multilingual-2 first stage only.
  • No other reranker improves atop voyage-multilingual-2 on multilingual search other than rerank-2.

Numeric results for all evaluations are available in this spreadsheet.

Try the Voyage 2 series of rerankers!

Both rerank-2 and rerank-2-lite are available today with flexible, token-based pricing — head over to our docs to learn more. As shown in the results section, voyage-multilingual-2 (and other Voyage models) works particularly well with rerank-2 and rerank-2-lite, achieving top results across nearly all datasets.

If you’re also interested in fine-tuned rerankers or embedding models, we’d love to hear from you — shoot us an email at [email protected]. Also, feel free to follow us on X (Twitter) and LinkedIn, and join our Discord for more updates.

Tags:

Leave a Reply

Discover more from Voyage AI

Subscribe now to keep reading and get access to the full archive.

Continue reading