Do AI Code Review Tools Work, or Just Pretend?

In the prehistoric times of 2021, a go-to aspirational use case for AI code assistants was code review. Check out this tweet from Marius Eriksen, software engineer at Meta. Today, AI code review tools are here and plentiful. Examples include CodeRabbit, Qodo’s PR-agent, Greptile, GitHub Copilot code review, Ellipsis, Korbit, Kodus, CodePeer, Codelantis, Bito, Graphite, LinearB, Swimm, Gemini Code Review, and CodeAnt AI. While their promise is significant, whether these tools truly lighten the code review burden, or if they just confidently wing it, remains hotly debated among developers. So let’s take a moment to discuss this facet of a much larger subject sitting at the intersection of “AI Agents and the CEOs” (or all business leaders, really) and what “Developers Want from their AI Code Assistants.”

I'd be much more interested in an AI code review system. Do we have enough (public) training data for that? https://t.co/e4cEmV9V9n

— marius eriksen (@marius) June 29, 2021

Code review is a necessary pain point and historical bottleneck in the SDLC. It’s time-consuming, sometimes disruptive, and skipping it is risky. AI vendors promise tools that never get tired, never go on vacation, and can review code in minutes. The appeal is obvious. Who wouldn’t want to catch more bugs faster and free human engineers for the creative parts of writing business logic? Of course the more skeptical developers note these tools have long existed before they were graced with AI’s gloss (and VC funds), but back then they were just called linters. In fact, many so-called AI code review tools rely heavily on established procedures of catching violations that static linters have handled since the 70s. Moreover, there is significant overlap between AI code review tools and code analysis tools like Sourcegraph’s semantic indexing and analyzing tools, Sonar’s SonarQube, and JetBrains Qodana, particularly in how and why developers use them. Still, the notion that AI code review could ensure quality without overburdening senior devs remains seductive.

Who’s Who

Of the available AI code review tools on the market several stand out for piquing developer interest. CodeRabbit has earned the greatest amount of buzz from the developers I’ve spoken to personally, but also seems to be gaining community traction more broadly. In a very interesting podcast conversation with Harjot Gill, CEO of CodeRabbit, I was struck by the intensive agentic demands of this type of workload. According to Gill, CodeRabbit’s AI doesn’t just make a one-pass judgment. It plans a task graph of subtasks (security checks, style conformance, bug risk analysis, etc.), spawning sub-agents as needed. Some tasks are predetermined pipeline stages, but others the AI plans on the fly. Crucially, the agent is encouraged to “follow multiple chains of thought”:

You want to let the AI follow multiple chains of thoughts, and then, some of them could lead to a dead end, but that’s fine. Maybe four out of five, doors were closed, but one of the doors leads to some interesting insight.

This methodical approach trades a bit of extra compute time for thoroughness. Since pull request reviews run in CI/CD and are latency-insensitive, CodeRabbit is willing to dedicate the time to be thorough rather than fast.

Another notable AI code review tool is GitHub Copilot code review, which GAed in April. It automatically reviews PR diffs to suggest changes or flag issues. There are currently two types of Copilot code review: Review selection (users can highlight code and ask for an initial review using VS Code) and Review changes (users can request a deeper review of all your changes in VS Code and the GitHub website). While reviews of this newly GAed version are sparse, and some early users of the beta were unimpressed, Copilot’s strength is its tight integration with developers’ existing CI/CD workflows and ease of use. In the most recent episode of Frontend Fire, hosts Jack Herrington, Paige Niedringhaus, and TJ VanToll, expressed skepticism about AI code review tools writ large (and Devin specifically), but acknowledged their willingness to give it a try since it’s already built into their GitHub dashboard.

Some AI Code reviewer tools have found a niche in security. In 2021, Amazon announced CodeGuru, a tool that “helps you improve code quality and automate code reviews by scanning and profiling your Java and Python applications” using “new detectors use machine learning (ML) to identify hardcoded secrets as part of your code review process,” which has since then been rebranded Amazon CodeGuru Security. Snyk is well-known in the AppSec world for scanning dependencies and container images, but with the 2020 acquisition of DeepCode, it jumped into AI code analysis space. Snyk Code leverages DeepCode’s capabilities under the hood to provide AI-driven static analysis with a security bent. Other vendors marketing AI assisted code review specifically for security include Hackerone Code, Turingmind, and Codacy. What interests me about the positioning of these security-focused AI code reviewers is the suggestion that AI can improve not only vibe-generated code, but also code written by fallible humans. For this use case, machines surpass humans. Security-focused AI code review tools shine for enforcing consistency in order to ensure that teams follow agreed-upon standards.

In summary, the competitive landscape breaks down like this: Some tools provide general code review assistance, and will live-or-die based on how well they perform. Others specialize (security scanners, style enforcers) and hope to “cross the chasm” through targeted adoption. And Big Tech (GitHub, Amazon) is embedding AI reviewers directly into platforms for convenience at the cost of some flexibility.

Developers’ Verdict: From Skeptical to Real Skeptical

If you think engineers have strong opinions on tabs vs spaces, wait until you ask them about AI code review tools. The sentiment ranges from cautious optimism to “burn it with fire.” Let’s start with the good. Many engineering leaders report having overall positive experiences with these tools. I spoke with Jon Freedman, CTO at Echios, for instance, who explains:

I’m using Greptile with both the start-ups I work with, $30 per dev per month is worth it at a small scale. You can also go the open source route and just pay for your AI token usage (https://github.com/qodo-ai/pr-agent).

If you’re not doing pair programming and merging direct to your main branch hard-core trunk-based style it’s a no-brainer to turn these reviews on, you can still resolve anything flagged that’s noise.

Others remain skeptical. Online you can find many folks like Jesse Squires, an iOS and macOS developer, with objections:

I work on a team that has enabled an AI code review tool. And so far, I am unimpressed. Every single time, the code review comments the AI bot leaves on my pull requests are not just wrong, but laughably wrong. When its suggestions are not completely fucking incorrect, they make no sense at all.

A common refrain I encountered in this research is that AI can’t truly grok their specific project context. As one Redditor put it:

My experience with coding AI is that they tend to have tolerable general programming knowledge, but tend to be utterly incapable of understanding the context of your project. This means they are far below the capabilities of a solid programmer working on the project. Interacting with them is therefore a waste of time when you’re good at what you’re trying to do.

The issue of context has long been core to the success of AI code assistant tools. Context is King, so, unsurprisingly, many players in the AI code review market claim that it is context that sets them apart from the competition. Greptile, for instance, markets itself as an “AI code reviewer with complete context of your codebase.” Edvaldo Freitas, Head of Growth at Kodus, also points to context as necessary for the success of these review tools:

So, the problem with a lot of tools is that they don’t really get the full context of the code. They either suggest things that aren’t a priority or don’t understand the team’s patterns.

Another issue that haunts AI code review tools, and is a constant complaint among the devs that use them, is wading through false positives and hallucinations (Squires’s “completely fucking incorrect”). Many devs share anecdotes of AIs hallucinating problems that don’t exist, or suggesting bizarre changes. Chris Zuber, a fullstack web developer, complains on Reddit that:

I’ve had nothing but horrible experiences with AI code review. It suffers from hallucinations, outdated info, insufficient memory/context, etc. It just makes everything up, ignores explicit instructions, gives some utterly bloated and useless response, and tends to dwell on some BS it invented and end up conflating the actual code with whatever garbage it comes up with.

Maybe it’s fine for reviewing boilerplate, but… If you’re the author of some library or if you’re doing anything remotely complex, it’s just infuriating and a waste of time.

Vendors recognize this issue and are eager to overcome it. Many offer prompt guidelines and documentation or else suggest that users spend time at the outset customizing the AI’s rules to match your team’s priorities. All of these strategies are intended to improve the tool’s success by avoiding irrelevant or incorrect suggestions. In fact, developers complain that these hallucinations combine with laziness and inexpertise within teams makes for catastrophe. As one Redditor, speaking of Codacy specifically, notes, sprinkling in AI fairy dust with features like AI-generated fixes for static analysis findings can be a double-edged sword because:

Fixing static analysis is like 90+% boring mechanical code changes to adhere to a stricter style, and 10% noticing that you’re doing something dumb and the tool just shoved it in your face.

Junior developers submit what I call “make the tool shut up” pull requests. The kind of stuff where they take legitimately smelly code and make it actively worse, but satisfy the analysis rule in the process. Or just mark the dumb thing as intentional.

I’d need to see good samples from this tool’s testing corpus of interesting static analysis findings and the AI generated fixes, otherwise I’m instantly writing it off as garbage.

The stereotypes of junior devs and vibe coders that settle for appeasing the tool, rather than actually addressing the underlying issue, abound in conversations from practicioners debating the merits of AI code reviewers. Users more interested in how to “make the linter shut up” than security or performance don’t benefit from these tools today, and can even make more work for engineering teams tasked with managing them by making code worse or introducing new problems. Fair enough, but what interests me most from my research into these very desirable, but still imperfect tools, was just this sort of outward looking reflection. Let me explain what I mean.

These automated reviewers force teams—and the engineering leaders most often tasked with reviewing PRs—to take a hard look in the mirror, and the result has been deep, earnest practitioner self-reflection about code review’s role and purpose in successful teams. Some engineers note that loss of human knowledge-sharing is a subtle cost of relying on AI for reviews. Code review isn’t just about finding bugs; it’s also senior devs teaching juniors, team members learning parts of the codebase they don’t normally touch, ensuring a shared understanding of design decisions, and sometimes challenging architectural choices. As one Redditor explains:

AI code reviews, even if they worked perfectly, would mean missing out on one of the big benefits of code reviews – which is that it helps spread knowledge to other team members.

When a human reviews code, they’re not only checking for correctness. This collaborative process builds shared ownership and raises the collective expertise of the team. Code review is where the hard work of programming actually occurs. If AI were to replace that entirely, even if it did so flawlessly, it would strip away some of the most valuable, albeit high-level, outcomes of the review process.

At the end of the day, code review is as much about humans collaborating as it is about benefitting organizations and engineering teams. As Teaganne Finn and Amanda Downie at IBM sum up, these tools can boost: “efficiency, consistency, error detection, [and] enhanced learning.” There can be no doubt that AI is rapidly improving and will play an increasingly important role in the SDLC, but for the foreseeable future it will continue to work best with both human and organizational intelligence at the helm.

Disclaimer: AWS, IBM, Google, and Microsoft (GitHub) are RedMonk clients.

Update 6 June 2025: I received some excellent feedback from readers and decided to add a positive quote to the “Developers’ Verdict” section.
_{Header image created by ChatGPT 4o}

console.log()

Do AI Code Review Tools Work, or Just Pretend?

Who’s Who

Developers’ Verdict: From Skeptical to Real Skeptical

Recent Posts

The MonkCast

Subscribe to Blog via Email