How often do AI coding assistants make mistakes?

AI coding assistants fail about 25% of the time when performing structured-output tasks. A University of Waterloo study found that even advanced AI models struggle to consistently generate accurate code, requiring significant human oversight to catch errors and ensure code quality.

Are open-source or proprietary AI coding models more accurate?

Proprietary AI coding models tend to be more accurate than open-source models, but both still have significant error rates. The University of Waterloo study found that advanced proprietary models achieve approximately 75% accuracy on structured-output tasks, while open-source models achieve around 65% accuracy.

Do AI coding assistants introduce security vulnerabilities?

Yes, AI coding assistants often introduce security vulnerabilities into the code they generate. A separate study by Veracode found that major AI systems regularly inject vulnerabilities into basic coding tasks, which can be difficult for human reviewers to detect and fix.

What kind of coding tasks do AI models struggle with the most?

AI models struggle most with coding tasks that require multimedia or complex structural generation. While they can achieve moderate success with text-related tasks, their accuracy significantly decreases when they need to adhere to predefined rules for outputs like JSON, XML, or Markdown.

Is AI-generated code ready to be used without human review?

No, AI-generated code is not yet reliable enough to be used without significant human review. Studies show that AI coding assistants make mistakes and introduce security vulnerabilities, meaning developers still need to carefully validate and correct AI-generated code before deployment.

AI Coding Assistants Fail: 25% Error Rate in Structured Output

AI coding assistants fail one in four tasks, exposing a significant gap between industry hype and actual performance. A recent study by the University of Waterloo revealed even advanced models struggle with structured-output tasks, achieving only about 75% accuracy. This consistent failure rate signals that developers cannot yet fully rely on these tools for critical coding functions without extensive human oversight.

The Reliability Gap in AI-Generated Code

The University of Waterloo's benchmarking exposes a critical reliability gap: advanced proprietary models achieved only about 75% accuracy, while open-source alternatives performed closer to 65%. This indicates that, despite advancements, AI systems still introduce significant errors. This issue is not limited to structured outputs; a separate study from code security company Veracode found all major AI systems regularly inject vulnerabilities into basic coding tasks, as Forbes reports.

This consistent introduction of security flaws, which human reviewers often struggle to detect due to complexity, highlights the ongoing need for robust validation. Google's new AI coding tool was even hacked just a day after its launch, underscoring the real-world implications of these vulnerabilities, per Forbes.

Dongfu Jiang, a PhD student and co-first author of the Waterloo study, stated that "With this kind of study, we want to measure not only the syntax of the code — that is, whether it’s following the set rules — but also whether the outputs produced for various tasks were accurate."

This objective reveals that simply generating code that adheres to syntax rules is insufficient if the output is not functionally correct or secure. The industry's push for structured outputs, intended to enhance reliability, has not yet delivered the dependable results developers require for complex scenarios.

Comparative Accuracy of AI Coding Models

Model Type	Average Accuracy (Structured Tasks)
Advanced Proprietary Models	75%
Open Source Models	65%

What This Means for Developers and Founders

The data suggests that the industry’s enthusiasm for AI coding assistants has outpaced the technology's actual capabilities. While these tools offer undeniable benefits for accelerating certain tasks, their current failure rates and propensity for introducing vulnerabilities mean they cannot operate autonomously. For now, developers must approach AI coding assistants as experimental aids, not independent colleagues. This dynamic requires a significant amount of human supervision, challenging the vision of fully automated development pipelines. The core message is clear: AI tools can boost productivity, but they also demand enhanced scrutiny to ensure code quality and security.

Even the most advanced AI models fail more often than you think on structured outputs — raising doubts about the effectiveness of coding assistants

The Reliability Gap in AI-Generated Code

Comparative Accuracy of AI Coding Models

What This Means for Developers and Founders

What This Means For You

Frequently Asked Questions

Related Articles

AI shopping gets simpler with Universal Commerce Protocol updates

Could Agentic AI be the killer app for the 40-year old PC? AMD thinks so — and wants you to jump on the Agent Computer bandwagon before it is too late

Amazon is reportedly working on making a new phone, because it went so well last time

The Download: OpenAI’s US military deal, and Grok’s CSAM lawsuit

The Download: Quantum computing for health, and why the world doesn’t recycle more nuclear waste

World ID wants you to put a cryptographically unique human identity behind your AI agents

Google details new 24-hour process to sideload unverified Android apps

OpenAI is acquiring open source Python tool-maker Astral