Are AI brokers prepared for the office? A brand new benchmark raises doubts

**The Uncertain Future of AI in the Workplace: A New Benchmark Raises Doubts**

I still remember Satya Nadella’s bold prediction in 2020 that AI would replace white-collar jobs, such as those held by lawyers, bankers, and IT professionals. While AI models have made significant progress, the shift in data work has been a slow and gradual process. Despite the rise of AI models in finance and healthcare, most white-collar jobs have remained relatively unaffected.

A new benchmark, APEX-Agents, created by training-data giant Mercor, sheds light on this mystery. The benchmark tests the ability of AI models to perform precise white-collar work duties, and the results are… underwhelming.

One of the most striking aspects of the APEX-Agents benchmark is the “Law” section, where AI models struggle to provide accurate answers. For instance, one query asks whether Northstar’s engineering team can reasonably interpret exporting EU manufacturing event logs as complying with Article 49. The correct answer is yes, but it requires a deep analysis of the company’s policies and EU privacy laws. Yet, even the best AI models can only get about a quarter of the questions right. Most of the time, they come back with an incorrect answer or no answer at all.

According to Mercor CEO Brendan Foody, the biggest stumbling block is finding information across multiple domains, which is crucial for many data work tasks. “In real life, you’re working across Slack and Google Drive and all these other tools,” Foody said. “The way we do our jobs isn’t with one person giving us all the context in one place.” For many AI models, multi-domain reasoning is still hit or miss.

Despite the disappointing results, the AI community has a history of pushing through difficult benchmarks. Now that the APEX-Agents benchmark is public, it’s an open challenge for AI labs to improve and prove their capabilities. Foody is optimistic about the future of AI development, citing the rapid progress made in the past year.

“It’s improving really rapidly,” he said. “Right now, it’s like an intern that gets it right a quarter of the time, but last year, it was the intern that got it right 5 or 10% of the time. That kind of improvement year after year can have an impact so rapidly.”

So, are AI agents ready for the workplace? Not yet, it seems. But with the rapid advancements in AI capabilities, it’s only a matter of time before they’re ready to take on more complex tasks. The future of AI in the workplace is uncertain, but one thing is clear: the pace of progress will continue to shape our understanding of its role in the job market.

Are AI brokers prepared for the office? A brand new benchmark raises doubts

The very best AI dictation apps, examined and ranked

Past Lovable and Mistral: 21 European startups to look at

Disneyland Now Makes use of Face Recognition on Guests

Are AI brokers prepared for the office? A brand new benchmark raises doubts

Related Posts

The very best AI dictation apps, examined and ranked

Past Lovable and Mistral: 21 European startups to look at

Disneyland Now Makes use of Face Recognition on Guests