How do we evaluate vector-based code retrieval?

Read time: 10 minutes

Nearly all modern coding assistants and agents leverage some form of code retrieval — the task of retrieving relevant code snippets, docstrings, or documentation, etc., from complex and massive repositories. Today’s code retrieval systems are powered by embedding models, which vectorize both queries and code into dense indices/representations to enable vector-based search.

Despite the widespread use of vector-based code retrieval, evaluating the retrieval quality of embedding models for code retrieval is a common pain point. The community lacks high-quality benchmarking datasets that feature diverse reasoning-intensive queries and repositories, as well as methodologies for creating such datasets.

Voyage AI has collected feedback and insights on this issue from the community (i.e. about 10 code generation companies who are Voyage’s partners or customers), and conducted research on this topic. In this post, we will discuss the most typical subtasks for code retrieval, survey the existing public datasets, and explore strategies to create new evaluation benchmarks. We will also discuss Voyage AI’s internal benchmarking suite.

One problem, many subtasks

Code retrieval is not a singular task — there are many ways to apply it in real code assistants or agents. This post focuses on three common sub-tasks, each defined by the type of queries. Real-world applications often have a mixture of these query types, but for evaluation it’s more sensible to consider modularized subtasks.

Text-to-code retrieves code snippets using natural language queries. For instance, a query like “write a function which pools token embeddings” might retrieve PyTorch implementations of nn.AvgPool1d or nn.AdaptiveAvgPool1d. This is crucial for code generation tasks, ensuring outputs are grounded in the existing codebase or up-to-date libraries.
Code-to-code identifies semantically similar code snippets, even across different programming languages or libraries. For example, auto-generating or auto-completing a Typescript function that performs average pooling would benefit greatly from retrieving a reference implementation in Python, C++, or other languages.
Docstring-to-code retrieves code snippets using function input and output specifications. For example, the function signature of an average pooling operator could retrieve nn.AvgPool1d and other similar implementations. This subtask strengthens code completion systems, as it allows relevant code snippets to be retrieved as the user is typing out function arguments.

Existing code retrieval benchmarks are limited

To the best of our knowledge, the existing public code retrieval benchmarks are CodeSearchNet, CoSQA, CodeXGLUE, and CoIR, which also have a certain degree of overlap with each other. We analyze below their limitations in terms of quantity, quality, depth, and data contamination.

Background / Nomenclature. Typical evaluation datasets for retrieval quality consists of a collection of “queries”, a collection of “documents” from which relevant documents are supposed to retrieved. Moreover, for every query, the dataset he dataset provides a small number of, or sometimes even a single, gold document as the label of the query, which are supposed to be the ground-truth document(s) most relevant to the queries. In some cases, the dataset also provides a ranking of the gold documents or their relevance scores, while in others, all gold documents are treated as equally relevant. In the context of code retrieval, the “queries” can be natural language questions or requests, function signatures, or code snippets, as discussed in the previous section. The “documents” primarily consist of code, including code snippets, code files, function definitions, or docstrings.

The basic evaluation metric, recall@k, assumes k documents are retrieved for each query and measures the fraction of gold documents among these retrieved documents. The most widely adopted metric, normalized discounted cumulative gain (NDCG), is a variant of recall which also takes into account the matching between the rankings of the retrieved documents and gold documents.

Noisy Labels. CoSQA is heavily used in evaluating the code retrieval quality, e.g. by CodeSage and Jina-v2-code. However, as noted by Gong et al., an estimated 51% of labels in CoSQA are incorrect, i.e. the supposedly ground-truth code snippets and queries are mismatched for 51% of the dataset; this percentage is likely higher if subtle inconsistencies are included. This issue stems from a mismatch between the sources of the queries and the code: the queries are derived from Bing search, while the code snippets come from CodeSearchNet. As a result, the corpus cannot reliably ensure query coverage, as it is too limited in scope. Below are two examples illustrating instances where the gold code snippets are either irrelevant or incorrect; we include 10 other examples from the same page in our supplementary material.

Query	Code
exit python running in cmd	`def call_and_exit(self, cmd, shell=True):` `"""Run the cmd and exit with the proper exit code."""` `sys.exit(subprocess.call(cmd, shell=shell))`
python how long does it take to check if two sets are equal	`def sets_are_rooted_compat(one_set, other):` `"""treats the 2 sets are sets of taxon IDs on the same (unstated) universe of taxon ids. Returns True clades implied by each are compatible and False otherwise """` `if one_set.issubset(other) or other.issubset(one_set): return True` `return not intersection_not_empty(one_set, other)`

The noisy labels limit the effectiveness of the datasets because even excellent models cannot achieve a strong performance. Indeed, as shown in the figure below or the CoSQA entry in CoIR benchmark, nearly all models have the same performance. Models that have a higher accuracy than the fraction of correct labels overfit to the test dataset likely due to data leakage/contamination (see more discussion in the next subsection).

Lack of deep algorithmic reasoning. Understanding data structures and algorithms is crucial for code retrieval because it allows the system to identify, match, and rank relevant implementations (e.g., binary search, graph traversal) based on their functionality and structure. Many of the existing academic datasets do not properly test this capability.

For example, CodeSearchNet is another popular suite of code retrieval datasets. As shown in the following random three rows from a randomly selected “page” in the Python corpus of CodeSearchNet, because the queries are derived directly from the code verbatim, the task becomes overly simplistic. This artificially inflates retrieval accuracy on code benchmarks (e.g., CoIR benchmark) that include CodeSearchNet datasets.

Query	Code
Moves the given forum toward the requested direction.	`def moveforum_view(self, request, forum_id, direction): """ Moves the given forum toward the requested direction. """ forum = get_object_or_404(Forum, pk=forum_id) …`
Allows to select how to edit forum permissions. The view displays a form to select a user or a group in order to edit its permissions for the considered forum.	`def editpermissions_index_view(self, request, forum_id=None): """ Allows to select how to edit forum permissions. The view displays a form to select a user or a group in order to edit its permissions for the considered forum. """ forum = get_object_or_404(Forum, pk=forum_id) if forum_id \ else None …`
Allows to edit user permissions for the considered forum. The view displays a form to define which permissions are granted for the given user for the considered forum.	`def editpermissions_user_view(self, request, user_id, forum_id=None): """Allows to edit user permissions for the considered forum. The view displays a form to define which permissions are granted for the given user for the considered forum.""" user_model = get_user_model() …`

Data contamination and overfitting. Most existing evaluation datasets are publicly available. Intentionally training on the test set is an obvious case of cheating which occurs much more frequently than most are aware of. Moreover, for historical reasons, many datasets include a corresponding training and validation sets derived from the same distribution as the test set. Training embedding models on the training or validation sets risks overfitting to their specific distribution. This overfitting can undermine the model’s ability to generalize to other datasets and fails to reflect real-world retrieval quality accurately.

This is often observed when a model’s quality is particularly good on a few datasets but not good on other new, hold-out, or private datasets. For example, Jina-v2-code scores a whopping 0.41 NDCG@10 on the low-quality CoSQA dataset — significantly higher than other models such as OpenAI-v3-large, voyage-code-3, BGE, and E5 models (which are all score between 0.28 and 0.33). As shown in the figure below, this strong showing on CoSQA does not extend to any of the other tested datasets, including a novel retrieval dataset mined from SWE-bench (details in the following section).

Training on evaluation sets is generally a serious problem, and this misalignment applies to other popular general-purpose retrieval benchmarks such as the MTEB leaderboard.

Summary of existing code retrieval datasets. In the table below, we summarize existing academic retrieval datasets, along with a short description, their use in existing benchmarks, as well as any risks associated with their use. Based on all of these, we also give a recommendation as to whether or not we believe they should be used in a benchmark targeted for real-world code retrieval.

Creating new code retrieval datasets

How do we create more high-quality reasoning-extensive code retrieval datasets? Below we discuss two strategies to create thems from existing data. Another line of evaluation approach is to use LLMs-as-a-judge, which is more subtle and will be discussed in future blog posts.

Repurposing question-answer datasets for retrieval tasks. Every QA dataset, which consists of many question-answer pairs, can be repurposed into a retrieval dataset; the collection of documents (code snippets) is the collection of all answers in the original QA dataset, and the collection of queries is the collection of questions. The goal of the retrieval dataset is to retrieve the correct answer given the question from the pool of answers.

Sometimes, each question-answer pair is also associated with a context. In those cases, one can choose to use the collection of contexts as the collection of documents to retrieve from.

By design, QA datasets minimize false-positive labels since there is little room for subjectivity; answers to questions must be correct by definition. It’s also highly unlikely that two questions match the same answer. This ensures that the resulting retrieval datasets have minimal false labels, resolving noisy label issues.

One potential concern is that the answers are too different from each other, making the retrieval of the correct answer too trivial. However, empirical performance on these datasets varies significantly for different models, suggesting that these datasets are still very challenging and informative.

We present a more formal description of repurposing QA datasets for retrieval in the supplementary material.

Leveraging code repositories and issues/tickets. Code repositories from coding communities such as GitHub are great corpus for retrieval datasets as they are highly applicable to real-world use cases. The trick is then finding suitable queries. Fortunately, because GitHub allows discussions, issues, and pull requests, we can use issues as the queries and the files that are edited to successfully resolve the issues as the ground-truth relevant documents. We use this approach on data from SWE-bench, a dataset that tests an LLM’s abilities to automatically solve GitHub issues.

We present a more formal description of building retrieval datasets from GitHub repositories in the supplementary material.

Voyage’s code evaluation datasets

Here at Voyage, we leverage a combination of public and proprietary datasets to evaluate code retrieval performance. Many of these are created using the methods above; the table below discusses the dataset details at length.

In-house datasets extend an already strong evaluation suite with nine programming languages, eight repurposed QA datasets, and domain-specific benchmarks for SQL mapping, function retrieval, and code execution. These datasets are carefully curated to avoid contamination, ensuring that no models, including Voyage’s, are exposed to them during training.

We evaluate each of the datasets over a suite of eight code embedding models. Three of them — voyage-3, voyage-3-lite, and OpenAI-v3-large — are generalist models, while the other five — voyage-code-3 , voyage-code-2 , CodeSage large, CodeRankEmbed, and Jina-v2-code — are code retrieval models. The results are shown below; the evaluation result for CodeRankEmbed (8.99) on SWE-Bench Lite-rtl is significantly less than the chart floor, so we do not show a bar for the result.

voyage-code-3, the best-performing model, is available today; read more here.

Future directions in code retrieval evaluation

Many in-house datasets can be shared under MNDA to support collaborative research and evaluation. To further support the community, portions of these datasets will be publicly released to establish new benchmarks that address common challenges, such as noisy labels and the lack of reasoning-intensive tasks. We have also using large language models (LLMs) as judges for retrieval performance; blogs on that will come soon — follow us on Twitter and LinkedIn for updates.