AI Agents Fail at Real Work: Shocking Study Results

Introduction

Workplace automation through artificial intelligence promises to revolutionize the freelance labor market, but new research conducted by the Center for AI Safety (CAIS) and Scale AI reveals a reality far different from expectations. AI agents, systems designed to automate tasks and potentially entire job functions, have demonstrated dramatically inferior performance compared to the humans they're meant to replace. The study tested six leading AI agent models on simulated freelance projects, with results that raise critical questions about AI's actual capacity to substitute qualified human labor.

The Remote Labor Index Test

Researchers developed an innovative benchmark called the Remote Labor Index (RLI), using a wide range of real-world remote work projects to evaluate AI agents' operational capabilities. The test involved diverse industries, from video game development to data analysis, encompassing multiple disciplines requiring specific skills and problem-solving abilities. The objective was to measure the automation rate, meaning the percentage of projects AI agents could complete at an acceptable level for real-world freelancing commissions.

Disappointing Results: Under 3% Success Rate

The study's findings were unequivocally negative. No AI agent managed to complete more than 3% of assigned work, collectively generating only $1,810 out of a potential $143,991. The top performer was the AI agent from Chinese startup Manus, with an automation rate of 2.5%. In second place, tied at 2.1%, were Elon Musk's Grok 4 and Anthropic's Claude Sonnet 4.5, the latter marketed as "the best coding model in the world" and "the strongest model for building complex agents."

OpenAI's GPT-5, with its purported "PhD level" intelligence, stopped at 1.7%. CEO Sam Altman had defined GPT-5 as "a significant step along the path to AGI" (Artificial General Intelligence), but RLI benchmark results demonstrate how far this claim is from operational reality. Even more ironic is OpenAI's ChatGPT Agent performance, reaching barely 1.3%, while Google's Gemini 2.5 Pro ranked last with a dismal 0.8%.

Why AI Agents Fail

Dan Hendrycks, CAIS director, highlighted fundamental limitations still plaguing AI agents despite rapid industry advances. AI agents lack long-term memory and cannot perform continual learning from experiences. Unlike humans, they cannot pick up skills on the job during work execution. These structural deficiencies prevent AI agents from adapting to unforeseen challenges and progressively improving their performance as human workers would.

"I should hope this gives much more accurate impressions as to what's going on with AI capabilities."

Dan Hendrycks, Director of the Center for AI Safety

Impact on the Labor Market

Despite these evident results, the wave of AI-related layoffs shows no signs of slowing. Many CEOs continue reducing their workforce while embracing automation, but AI's actual capacity to increase productivity or compensate for human talent loss remains highly questionable. Anecdotes of employers who had to rehire employees after discovering AI tools' inadequacy are increasingly common.

An MIT study found that 95% of companies piloting AI initiatives saw no meaningful revenue growth. Another research demonstrated that introducing AI tools into employee workflows generated a deluge of low-quality "workslop," work requiring heavy revisions to correct errors, slowing down processes and creating tension among coworkers forced to fix sloppy output.

The Gap Between Marketing and Reality

Selling AI agents to employers has become the AI industry's obsession, as leaders like OpenAI struggle to capitalize on their AI chatbots' popularity, many of which are free to use. OpenAI's own AGI definition - "highly autonomous systems that outperform humans at most economically valuable work" - seems light years away from measurable real-world results.

"We have debated AI and jobs for years, but most of it has been hypothetical or theoretical."

Bing Lie, Director of Research at Scale AI

Conclusion

The Remote Labor Index results offer a realistic and measured perspective on current AI agent capabilities, starkly contrasting with the tech industry's grandiose promises. While artificial intelligence continues evolving, this study highlights how the gap between expectations and operational reality remains significant. For businesses and professionals, the lesson is clear: automation through AI agents is not yet a ready solution to replace qualified human talent in complex freelance work.

FAQ

What is the success rate of AI agents in freelance work?

According to the CAIS and Scale AI study, no AI agent exceeded 3% task completion, with the top performer reaching only 2.5%.

Which AI agent achieved the best results in the test?

The AI agent from Chinese startup Manus scored highest with a 2.5% automation rate, followed by Grok 4 and Claude Sonnet 4.5 at 2.1%.

Why do AI agents fail to complete freelance projects?

AI agents lack long-term memory, cannot learn continuously from experiences, and don't pick up skills on the job like humans do.

What is the Remote Labor Index used in the study?

The Remote Labor Index (RLI) is a benchmark evaluating AI agents through real-world remote work projects in sectors like game development and data analysis.

How did OpenAI's GPT-5 perform in the AI agent test?

GPT-5 reached only 1.7% task completion, despite OpenAI promoting it as a significant step toward artificial general intelligence.

Are companies rehiring workers after trying AI agents?

Yes, many employers have had to rehire employees after discovering AI tools were inadequate to replace qualified human labor.

What was the economic output of tested AI agents?

Collectively, the six AI agents generated only $1,810 out of a potential $143,991, demonstrating extremely limited economic productivity.

Can AI agents replace human freelancers in 2025?

Current data indicates AI agents are not yet capable of effectively replacing human freelancers in complex tasks requiring adaptability and continuous learning.

AI Agents vs Freelancers: Only 3% of Tasks Completed