TL;DR – We are excited to introduce rerank-1, which further advances the performance frontier from our previously released rerank-lite-1. rerank-1 consistently outperforms alternatives like bge-reranker-v2-m3 and Cohere’s rerank-english-v3 and rerank-multilingual-v3 in an expanded evaluation across 37 domain-specific datasets and 50 multilingual datasets, including French, German, Japanese, Korean, and Spanish. Furthermore, rerank-1 boasts an 8k context length, double that of rerank-lite-1 and Cohere’s rerank-english-v3.
Rerankers can further boost relevancy of existing search systems whether vector-based or lexical (see our previous blog for more details). A little less than two months ago, we introduced our first ranker rerank-lite-1 that demonstrated cutting-edge retrieval accuracy, consistently outperforming alternatives such as bge-rerank-large and Cohere’s rerank-english-v2.0. As its name suggests, rerank-lite-1 was planned to be a “lite” version, optimized for both quality and latency.
Now, we are thrilled to introduce rerank-1 , which has excellent English and multilingual performance, outperforming both Cohere’s rerank-english-v3 and rerank-multilingual-v3 on English and multiple languages including French, German, Japanese, Spanish, and Korean. It also boasts an 8k context length — 2x more than rerank-lite-1 and Cohere’s flagship rerank-english-v3 reranker.
Quantitative Evaluation
The quantitative evaluation expands on the rerank-lite-1 evaluation. We augment the set of domain-specific datasets used in rerank-lite-1’s evaluation to cover broader topics. Datasets from 27 other languages are also added to rerank-1’s evaluation to test its multilingual capabilities.
Datasets. We first evaluate on 37 domain-specific retrieval datasets, spanning various topics and corpora, including technical documentation, code, law, finance, web reviews, long documents, medicine, and conversations. Each dataset consists of a corpus to be retrieved from and a set of queries. The corpus typically encompasses documents in a particular domain, such as answers in StackExchange, court opinions, technical documentation, etc., and the queries can be questions, summarization of a long document, or just merely a document.
The following table organizes the evaluation datasets into eight categories, facilitating an easier interpretation of the results.
| Category | Descriptions | Datasets |
|---|---|---|
| TECH | Technical documentation | OneSignal, PyTorch, Verizon 5G, Cohere |
| CODE | Code snippets, docstrings | LeetCode-python, DS1000, codechef-cpp_5doc |
| LAW | Cases, court opinions, statutes, patents | LegalBenchConsumerContractsQA, Law_Stackechange, LegalQuAD, GerDaLIR |
| FINANCE | SEC filings, finance QA | FinanceBench, ConvFinQA, FiQA Personal Finance, Trade-the-event, RAG benchmark (Apple-10K-2022), TAT-QA, Indian Financial News, Stocks Event, Stock News Sentiments, FinQA, All news finance, News stocks |
| WEB | Reviews, forum posts, policy pages | Doordash, Health4CA, Movie Summary, Kijiji.ca |
| LONG-CONTEXT | Long documents on assorted topics: government reports, academic papers, and dialogues | QMSUM, GovReport, Qasper abstract_doc |
| MEDICAL | Medical documents and QA | Mental Health Consulting, Covid QA, ChatDoctor, Medical Instruction |
| CONVERSATION | Meeting transcripts, dialogues | Dialog Sum, QA Conv, MeetingBank-transcript |
Additionally, we evaluate rerank-1’s multilingual capability on 50 datasets, covering 27 languages including French, German, Japanese, Spanish, Korean, Bengali, Portuguese, Russian, etc. Each of the first 5 languages has multiple datasets and other languages involves one dataset each, which group into an OTHER category.
Method and Metrics. We evaluate the retrieval quality of various rerankers on top of several first-stage search methods including lexical search (BM25) and embedding models (e.g., OpenAI v3, voyage-large-2). Given a query, the first-stage search retrieves 100 candidate documents. Then, from the first-stage documents, we use the reranker to retrieve the top-10 relevant documents based on cosine similarities, and we report the normalized discounted cumulative gain (NDCG@10), a standard metric for retrieval quality and a variant of the recall.
Results. The radar charts illustrate the NDCG@10 for different combinations of first-stage search methods and rerankers. Voyage rerank-1 emerges as the consistently superior reranker across all domains and first-stage search methods, outperforming bge-reranker-v2-m3 and Cohere’s rerank-english-v3 and rerank-multilingual-v3. Moreover, Voyage rerank-1 improves performance over only a first-stage search in all cases—the same cannot be said for other rerankers. Voyage rerank-1 even improves retrieval quality in the CODE category using voyage-large-2 embeddings, which is known to excel with code data. Voyage rerank-1 outperforms existing rerankers on most languages we tested and consistently improve upon first-stage search methods.
Detailed numeric results for all evaluations are available in this spreadsheet.
Try Voyage rerank-1!
Whether you are already using rerankers or not, boost your overall retrieval quality today with rerank-1. It’s easy to upgrade or add rerankers to an existing semantic search or RAG stack—regardless of the first-stage search method. Finally, as you would expect, Voyage embedding models and rerankers work especially well together.
If you’re interested in early access to our upcoming domain-specific or finetuning embeddings, we’d love to hear from you and please email [email protected]. Follow us on X (Twitter) and/or LinkedIn for more updates!





Leave a Reply