Excited to Announce Voyage Embeddings!

TL;DR – Voyage is a team of leading AI researchers, dedicated to enabling teams to build better RAG applications. Today, we’re releasing a new state-of-the-art embedding model and API, which already beats public models, like OpenAI’s text embeddings, with more to come soon. If you’re excited about custom/fine-tuned embeddings with further enhanced retrieval accuracy, please reach out to [email protected] for early access.

This post assumes familiarity with basic concepts related to the popular retrieval-augmented generation (RAG) stack, which is designed for domain-specific chatbots.RAG retrieves validated sources using embeddings and then inputs them into a large language model (LLM) to generate accurate responses. Please check out this article for further background. 

Embeddings drive RAG quality

Have you heard other developers discuss the importance of embedding models for RAG? And, often, that OpenAI’s embeddings endpoint just aren’t good enough? Let’s break down why! 

Suppose you’re building a chatbot using RAG for a specific use case, say, answering questions about LangChain’s documentation. Given a technical query, like, “Does Html2TextTransformer omit URLs?”, the chatbot’s effectiveness will rest on the accuracy and relevance of the docs it pulls up. If it retrieves the exact page that contains the details of Html2TextTransformer, then feeding this page to a downstream LLM will yield an accurate response. If it fetches pages that merely reference Html2TextTransformer, but also have other unrelated information, then the LLMs might hallucinate. 

So, what determines retrieval accuracy? Embeddings. Serving as the representations or “indices” of both docs and queries, they are responsible for ensuring the retrieved docs contain the information pertinent to the query. 

And this example is real! Voyage’s embeddings are deployed in the official LangChain chatbot to improve the retrieval quality. Please check out this post for an in-depth analysis of how embeddings impact RAG’s quality. 

Voyage trains best-in-class embeddings models

Embedding models, much like generative models, rely on powerful neural network (and often transformer-based) architecture to capture and compress semantic context. And, much like generative models, they’re incredibly hard to train. Getting great quality requires experimentation on many fronts, from model architecture and data collection to selecting suitable loss functions and optimizers.

Despite remarkable recent advancements in generative AI, we think embedding models are comparatively underloved and underexplored. We’re building Voyage to fix that! Our team has conducted 5+ years of cutting edge research at Stanford AI Lab and MIT NLP group on training embedding models, including the collection of a novel, massive dataset, experimentation on pre- and post-processing, and development of proprietary methods to apply contrastive learning on texts. 

The result is a SOTA model – higher retrieval accuracy than any other publicly available model, long context windows, and efficient inference for lower latency and affordable pricing. The general embedding model voyage-01 outperforms OpenAI’s latest text embedding model on the commonly used Massive Text Embeddings Benchmark (MTEB) by more than 5 points! (See the figure on the left below.) Our models will also rapidly improve in the coming months.

Unfortunately, MTEB is a bit overused these days, because people sometimes train the base embeddings on these datasets. For more comprehensive evaluations, we also built nine additional datasets, called real-world industry domains (RWID), ranging from technical documentation to restaurant reviews and news. None of these is seen by our training.  We found that our base model outperformed both OpenAI’s embeddings and all other popular open-source models. Refer to the figure above (on the right) for average retrieval quality and all results in the figure below.

Interestingly, BAAI/bge is much weaker than Voyage and OpenAI on the RWID dataset, but is a close second on MTEB, which suggests that it might possibly overfit to MTEB. 

We know that seeing is believing. To help teams try our model for themselves, we’ll embed the first 5000 documents/queries for each organization for free. We’re confident our models can massively improve performance for builders everywhere.

Domain-specific or company-specific embeddings

Real-world scenarios are always more challenging than academic benchmarks because each industry has its unique terminology and knowledge base, just as every enterprise does. voyage-01 works better right out-of-the-box; but it can get even better — higher quality and reduced costs — with data and fine-tuning. 

Voyage currently offers embedding models tailored for coding and finance, with more domains on the horizon. We can also finetune embeddings on small, unlabeled company-specific datasets, achieving a consistent 10-20% accuracy boost for pilot customers such as LangChain, OneSignal, Druva, and Galpha.  

Bon Voyage!

Our mission at Voyage is to help every developer team build and improve the retrieval systems that power their intelligent applications. We’re still in the early stages of what RAG and foundation models can do, and we can’t wait to see what people build with our help. Access to voyage-01 is available today!

Interested in early access to fine-tuned embeddings? Email [email protected]. Follow us on twitter and/or linkedin for more updates!  

Discover more from Voyage AI

Subscribe now to keep reading and get access to the full archive.

Continue reading