Coding Agent: Eight Months Later — The One-Shot Leap and the Systemic Gap That Remains

Feb 28, 2026

I am a seasoned technology leader who has spent years building and defending large systems. Eight months ago I began vibe coding with Cursor and Claude Code under one strict rule: I would write zero lines of code myself. The coding agent would do all the work; my role would be to specify intent, review output, and steer direction.

In my very first session I built a fully playable Tetris game in under two hours — including the time it took to learn the IDE. Eight months later, the leap in what these coding agents can deliver in one shot is unmistakable.

Eight months ago, working with Cursor and Claude Code felt like collaborating with a fast but unsteady junior engineer. The tools could generate clean functions and fix obvious issues, but any request involving real system complexity quickly dissolved into a long chain of follow-up prompts, corrections, and manual intervention.

Today, the same tools routinely deliver complete, production-ready modules in a single shot. The change is not incremental. It is structural.

The Leap in Autonomy

The most meaningful metric is how much truly runnable, production-grade code the coding agent can produce without intervention.

Eight months ago, a typical feature request yielded partial implementations that compiled but failed under realistic conditions. Today, models like Opus 4.6 and Codex 5.3 inside Cursor or Claude Code routinely output complete, testable modules on the first attempt — including error handling, logging, database updates, and unit tests — as long as the scope remains within a single coherent concern.

The dominant failure mode has shifted.

It is no longer “the code does not work.”

It is now “the code works for the obvious paths, but misses interactions I implicitly assumed were handled.”

That distinction matters.

What the Coding Agent Now Does Exceptionally Well

Using a subscription system as a concrete example, it becomes clear where today’s coding agents are genuinely strong. When a task is well-scoped and bounded — even inside a real production domain — their reliability is no longer surprising. It is repeatable.

Isolated business logic: Plan upgrades and downgrades execute cleanly, with correct proration, user-state updates, Stripe events, and cache refreshes.
Infrastructure scaffolding: New API endpoints, migrations, authentication flows, and observability hooks integrate with zero friction.
Consistent coding discipline: The coding agent produces readable, idiomatic code — consistent naming, clean abstractions, and predictable error handling — more reliably than many humans under deadline pressure.
Bounded refactoring: Field renames, type updates, and module extractions complete without dangling references or inconsistencies.
Meaningful test generation: Tests encode real behavior, cover meaningful edges, and catch regressions that manual reviews often miss.
Quick domain adoption: Given existing code and a short intent description, the coding agent quickly adopts local vocabulary and conventions.

Collectively, these strengths have eliminated large amounts of mechanical effort. For well-defined problems, the coding agent behaves like a careful, methodical engineer who never tires and never loses short-term context.

Where the Coding Agent Still Struggles: Combinatorial State and Implicit Assumptions

These strengths break down the moment multiple “obvious” operations occur simultaneously — the exact scenarios that dominate real production systems.

Consider the same subscription service, but now viewed as a system rather than a collection of isolated actions.

There are four tiers: Free, Individual, Team (2–5 seats), and Business (6+ seats). Every paid tier supports monthly or annual billing. The meaningful user journeys include:

New signup
Upgrade from any lower tier to any higher tier
Downgrade from any higher tier to any lower tier
Switching between monthly and annual billing (standalone or bundled with an upgrade or downgrade)
Enabling or disabling auto-renewal during an active subscription
Adding or removing seats mid-cycle
Canceling with or without immediate effect
Reactivating after cancellation

The coding agent has no trouble implementing any one of these paths in isolation. But when asked to handle combined transitions — for example, an upgrade plus a billing-cycle change plus an auto-renewal toggle — critical details are frequently missed:

Correct proration when a cycle switch triggers a new invoice
Proper grace-period handling if auto-renewal is disabled and later re-enabled
Consistent updates to “current plan,” expiration date, and next-billing preview across backend state, frontend cache, email notifications, and audit logs

When the combination count exploded, I gave the coding agent exhaustive guidance: a complete list of every upgrade and downgrade path, illustrated with one detailed plan-switch example. I expected it to generalize and correctly link billing cycle changes and auto-renewal toggles to those paths. The reality disappointed me. The connections remained fragile.

A second example makes the limitation even clearer. I asked the coding agent to generate a single admin table listing every billing-related event: new subscriptions, seat additions, auto-renewal toggles, cancellations, and reactivations. The schema it produced looked clean — but it quietly omitted essential side effects:

How reactivating a canceled annual plan restores the original expiration date
How mid-cycle seat additions affect the next invoice
How to display the effective plan when multiple changes overlap within the same billing window
How canceling a plan and later reactivating it, or disabling auto-renewal and later re-enabling it, can inadvertently discard historical subscription data if the original expiration, billing anchor, or proration state is not preserved

None of these behaviors are obscure. Each is obvious to an experienced engineer when considered alone. The problem is synthesis: the coding agent struggles to reliably combine them into a single, consistent system model.

This Pattern Is Widespread

Developers using Cursor and Claude Code consistently encounter this same limitation in early 2026. The tools excel at in-flow coding and single-responsibility modules, but they require repeated context resets, explicit state diagrams, or manual orchestration for any feature that touches payment flows, lifecycle state machines, or overlapping transitions.

This is not a question of intelligence. It is a question of implicit world models.

Experienced engineers carry years of compressed pattern recognition. When we say “handle an upgrade with a billing-cycle change,” we immediately visualize invoices, proration logic, cache invalidation, UI refreshes, audit trails, and customer emails. The model must be explicitly shown every ripple.

Until coding agents maintain persistent, verifiable world models that survive long sessions and encode system-level invariants, these gaps will remain.

The Path Forward

Progress is already visible.

Multi-agent workflows inside Cursor and Claude Code allow one coding agent to own the state machine, another to own billing events, and a third to enforce consistency checks. When I provide a clear state diagram or an exhaustive transition table upfront, success rates improve dramatically.

Eight months ago, I had to babysit the coding agent — acting as a cane it leaned on to move forward. Today, I function more like its eyes: guiding direction, spotting blind spots, and letting it move faster than I ever could alone.

The tools are not failing us. They are forcing us to specify system behavior with the rigor that large-scale software has always required.

The future is not that agents replace engineers. It is that they elevate us into better system designers — clearer thinkers about state, interaction, and consequence.

The tools have already crossed the threshold from helpful to indispensable.

Now the real work begins: teaching them to see every interaction we consider obvious.

Chong Xu

Discussion about this post

Ready for more?