A critical examination of the risks and challenges posed by private evaluators (for example ScaleAI) in the LLM landscape, highlighting financial incentives, conflicts of interest, and prevalence of evaluation biases even when acting in good faith.
The rapid advancement in building large language models (LLMs) has intensified competition among big-tech companies and AI startups. In this regard, model evaluations are critical for product and investment-related decision-making. While open evaluation sets like MMLU initially drove progress, concerns around data contamination and data bias have constantly questioned their reliability. As a result, it has led to the rise of private data curators who have begun conducting hidden evaluations with high-quality self-curated test prompts and their own expert annotators. In this blog post, we argue that despite potential advantages in addressing contamination issues, private evaluations introduce inadvertent financial and evaluation risks. In particular, the key concerns include the potential conflict of interest arising from private data curators’ business relationships with their clients (leading LLM firms). In addition, we highlight that the subjective preferences of private expert annotators will lead to inherent evaluation bias towards the models trained with the private curators’ data. Overall, this blog post lays the foundation for studying the risks of private evaluations that can lead to wide-ranging community discussions and policy changes.
In recent times, there has been rapid progress in training large language models (LLMs) for solving diverse and complex real-world tasks (e.g., instruction-following, agentic flows, reasoning)
Traditionally, open evaluation datasets have been used to benchmark various models against each other (e.g., MMLU, MATH). Such datasets provide full transparency into the testing data, model predictions, and scoring method. As a result, the community can reproduce, assess and improve the quality of the evaluation datasets
Data contamination: Most of the popular open evaluation datasets are constructed using the sources on the internet. For instance, MATH is constructed from the US-based math contests available on the web
Data bias: The training data curation can target the format and knowledge of the open evaluation datasets. As a result, specific models look better than some models might appear to perform better than they actually do, simply because the evaluation set aligns with their training data
These issues sparked a debate within the AI community about the reliability of open evaluation sets, paving the way for the emergence of private evaluators.
We consider private evaluators as organizations that perform LLM assessment by hiding some or all components of the evaluation pipeline. In this regard, we consider two categories: public-driven leaderboard (e.g., LMSYS) and privately-curated leaderboard (e.g., ScaleAI).
Chatbot Arena maintained by LMSYS organization
Despite its popularity, its design inherently suffers from several key risks:
Companies specializing in data curation and evaluation have begun establishing their own private leaderboards. For instance, ScaleAI, which has positioned itself as a leader in AI evaluation, recently introduced the SEAL leaderboard
Key features include:
Controlled Evaluation Environment: Private data curators create and maintain proprietary evaluation datasets that are not publicly accessible. This closed approach reduces the risk of data contamination, as models are unlikely to have been trained on these unseen prompts.
High-Quality Prompt Collection: Expert annotators with diverse backgrounds contribute to creating unique and challenging prompts that cover a broad spectrum of categories. This meticulous curation ensures that the evaluation set is both diverse and representative of real-world tasks.
Reduced Susceptibility to Gaming: By keeping the evaluation data and methodologies confidential, private leaderboards make it more difficult for developers to tune their models specifically to the test set.
Despite the above advantages, these private evaluation leaderboards by data curators introduce a new set of risks that can have significant implications for the AI community, and may jeopardize their reliability. These risks revolve around financial incentives, potential conflicts of interest, and various forms of evaluation bias that can skew the assessment of language models.
As private evaluators like ScaleAI gain prominence, the financial dynamics between them and model developers become a critical concern
This scenario raises important questions about transparency and fairness. Should private evaluators disclose their financial relationships with model developers? In industries like finance, regulations require firms to implement “Chinese walls” to prevent conflicts of interest between different divisions within the same organization
Private evaluators often employ expert annotators to assess model outputs, aiming to ensure high-quality and reliable evaluations. However, these annotators bring their own subjective preferences and biases, which can influence the evaluation outcomes. If the annotators favor certain styles of responses or have been involved in creating training data for specific models, their assessments might inadvertently favor those models. This bias can create an uneven playing field, where models that align with the annotators’ preferences perform better in evaluations, regardless of their general applicability or performance across a broader user base. Even when the annotators act in good faith, biases because of the dual roles of such evaluators can occur in various ways:
Overlap in Evaluation and Training Prompts: Scale AI may have a broad set of “tasks” or questions in their evaluations that are similar or identical to those used in a model’s training data. As an example, platforms like LMSys’s Chatbot Arena allow users to input prompts and rank model responses. If a model developer has access to these prompts, they can fine-tune their models to perform exceptionally well on them. It has already been seen how recent efforts in LLM pre-training (like Gemma models) have started fine-tuning on LMSys chats to boost the model’s ELO score on the Chatbot Arena. Note that while LMSys is primarily an evaluator and not a data curator, this anecdote serves as evidence of how side-information about data curation can be beneficial in gaming evaluations.
Impact: Models appear to perform better not because they have superior general capabilities, but because they have been specifically trained on the evaluation data. This creates an artificial performance boost and does not reflect the model’s real-world effectiveness.
Overlap in Human Annotators Between Training and Evaluation: The use of the same pool of annotators for both creating training data and evaluating models can introduce significant bias. Annotators develop certain preferences and expectations based on their experiences during data creation. If these annotators are also responsible for evaluations, they may subconsciously favor models that produce outputs aligning with their expectations.
Impact: Models trained using data from these annotators may perform better in evaluations simply because they cater to the annotators’ biases. This does not necessarily translate to better performance for end-users with diverse backgrounds and preferences, leading to skewed performance metrics that do not reflect real-world applicability.
Let us now simulate and quantify the impact of the bias induced due to overlap in human annotators between training and evaluation. To empirically demonstrate how evaluator biases can influence private evaluations, we conducted an experiment simulating two private evaluation companies, Company Alpha and Company Beta. Each company develops its own set of evaluation leaderboards but has the dual responsibility of also providing instruction fine-tuning data to their own clients (LLM trainers). This setup mirrors real-world scenarios where Company Alpha and Beta might represent ScaleAI and one of their competitors.
In this set of experiments, we focus on the most benign scenario where both Companies A and B act in good faith. They ensure that their curated data, accessible to customers, does not include privileged information related to the evaluation leaderboard, such as specific question templates, tasks, or answer styles. We examine the “mildest” form of bias: when annotators from Company Alpha and Company Beta are asked to evaluate and provide a single output for a given input. Let’s now describe the experimental setup in detail.
The experiment uses GPT-4o and Claude-Sonnet-3.5 as simulators for company evaluators due to their comparable ELO ratings.
Evaluators:
Crucially, we selected GPT-4o and Sonnet-3.5 because they rank at almost the same ELO on the LMSys leaderboard at the time of running our experiment. This means that the experts in both teams are nearly equally competent.
Models:
Experimental Protocol:
Collecting Annotator Answers: (i) We present 10K instructions from Alpaca Human Dataset
Fine-Tuning Models: Using the instruction-response data: (i) Model A trained on Company Alpha answers. (ii) Model B trained on Company Beta answers.
Generating Outputs: Both models generated responses to 805 queries from AlpacaEval Dataset
Evaluation: (i) Both Evaluator Alpha and Evaluator Beta independently assessed the outputs of both models for each query. (ii) Recorded preferences for each output pair.
Preference Rates by Evaluator:
Evaluator | Preferred Model A | Preferred Model B |
---|---|---|
Evaluator Alpha (GPT-4o) | 407 (50.68%) | 396 (49.32%) |
Evaluator Beta (Claude) | 314 (39.10%) | 489 (60.90%) |
Quantifying Self-Bias:
To calculate self-bias, we use the formula:
\[\text{Self Bias}_A = \frac{\text{(Judge A Prefers } M_A - \text{Judge B Prefers } M_A\text{)}}{\text{Average # of Preferences by Judges A and B for } M_A} \times 100\]For Claude:
\[\text{Self Bias (Claude)} = \frac{(489 - 396)}{\frac{489 + 396}{2}} \times 100 \approx 21.02\%\]For GPT-4:
\[\text{Self Bias (GPT-4)} = \frac{(407 - 314)}{\frac{407 + 314}{2}} \times 100 \approx 25.80\%\]The results highlight a clear bias aligned with each evaluator’s preferences. While both models exhibit self-bias, the difference in magnitude suggests potential variations in their evaluation mechanisms or inherent tendencies to favor their own outputs. These findings are additionally concurred by the findings of Panickssery et al.
Let us finally dive into the metric that has captivated the LLM world, ELO rankings. How much does the observed amount of self-bias influence the ELO rankings of two models? To quantify the impact of these biases on model rankings, we simulated ELO ratings using the same method as employed by the LMSys leaderboard
ELO Ratings Based on Evaluator Preferences:
Evaluator | Model A (fine-tuned on data from Company Alpha) ELO | Model B (fine-tuned on data from Company Beta) ELO |
---|---|---|
Evaluator Alpha (GPT-4o) | 1003 | 996 |
Evaluator Beta (Claude) | 959 | 1040 |
An ELO difference of 44 on the LMSys leaderboard allows for a significant amount of bragging rights, as can be seen by these tweets at a 5-20 ELO difference.
— Oriol Vinyals (@OriolVinyalsML) November 21, 2024
Sorry to my Google friends (again) https://t.co/yxbnBO7EYd
— Yilei Qian (@YileiQian) November 20, 2024
These ELO differences further highlight the significance of the bias each evaluator has towards the model fine-tuned on its own preferences.
Our experiments underscore the critical impact of evaluator bias in private language model evaluations:
We would like to thank Zack Lipton, Zico Kolter, Aditya Grover, Kai-Wei Chang, and Ashima Suvarna for their valuable feedback and discussions that helped shape this blog post.
Fun fact: This blog was born in a car somewhere in the North Cascades, post-CVPR 2024. Thank Gantavya Bhatt for keeping his eyes on the road while we vented about LLM evaluation bias. The view wasn’t bad either:
PLACEHOLDER FOR BIBTEX