AI agents are emerging as a promising research area with real-world applications. These agents use advanced models like large language models (LLMs) and vision language models (VLMs) to understand natural language instructions and achieve complex goals on their own.
They use tools like browsers, search engines, and code compilers to check their actions and think about their tasks. However, researchers at Princeton University have found several problems with current benchmarks and evaluation methods, which limit the practical use of AI agents.
One major issue highlighted is the lack of cost control in agent evaluations. AI agents often require more computational resources than single-model calls, making them expensive to run.
To enhance accuracy, some systems generate multiple responses and use voting or verification mechanisms to select the best one, which significantly increases computational costs.
While maximizing accuracy is feasible in research settings, practical applications must consider budget constraints for each query. The researchers suggest visualizing evaluation results as a Pareto curve of accuracy and inference cost, advocating for joint optimization of these metrics to develop cost-effective and accurate agents.
The researchers evaluated different prompting techniques and agentic patterns, discovering substantial cost differences for similar accuracy levels. They argue that joint optimization can lead to less expensive yet accurate agents, enabling researchers to balance fixed and variable costs effectively.
For instance, investing in optimizing the agent’s design can reduce the variable cost by using fewer in-context learning examples. Testing this approach on the HotpotQA benchmark demonstrated that joint optimization could achieve an optimal balance between accuracy and inference costs.
Another critical issue is the difference between evaluating models for research purposes and developing real-world applications. In research, accuracy is often prioritized, with little regard for inference costs.
However, inference costs are crucial for practical applications, influencing model and technique choices. Evaluating inference costs is challenging due to variable pricing by model providers and fluctuating API call costs.
To address this, the researchers created a website that adjusts model comparisons based on token pricing and conducted a case study on NovelQA, revealing that benchmarks for model evaluation can be misleading for downstream applications.
Overfitting is a significant problem in AI agent benchmarks, as models often find shortcuts to perform well on benchmarks without real-world applicability. This issue is exacerbated by the small size of agent benchmarks, typically only a few hundred samples.
The researchers recommend that benchmark developers create and maintain holdout test sets with examples that cannot be memorized during training and can only be solved by understanding the task. Their analysis of 17 benchmarks revealed that many lacked proper holdout datasets, allowing agents to take unintended shortcuts.
The researchers tested WebArena, a benchmark for evaluating AI agents on different websites, and found several shortcuts in the training datasets. These shortcuts allowed agents to overfit tasks, leading to inflated accuracy estimates and over-optimism about agent capabilities.
They emphasize that AI agent benchmarking is a new field without established best practices, making it challenging to differentiate genuine advancements from hype.
The researchers argue that benchmarking practices need to be rethought to accurately assess the capabilities of AI agents, which could soon play a significant role in everyday applications.