7 min read

AI Pair Programming: Lessons From a Year of Codex in the Wild

After a year of AI coding tools in production teams, here’s what the data actually shows about where they help, where they hurt, and what the best teams did differently.
AI Pair Programming: Lessons From a Year of Codex in the Wild
Photo by Levart_Photographer / Unsplash
After a year of AI coding tools in production teams, the data tells a more nuanced story than the hype predicted. Here's what separates the teams that actually got value — and what the early data actually shows.

A year ago, your engineering team began utilizing AI coding tools, possibly Copilot within the IDE or Codex integrated into their workflow. The promise was simple: AI that collaborated with engineers, handling routine tasks, accelerating boilerplate code, and assisting with complex problems.

Twelve months later, the question everyone’s asking is: did it truly work?

The honest answer is: it depends on how you used them. However, “it depends” isn’t the complete story. After a year of widespread adoption of AI pair programming, the teams that consistently utilized these tools have accumulated sufficient pattern data to provide valuable insights into the actual changes that occur when AI enters the development workflow.

Here’s the thesis that remains consistent: the teams that benefited the most from AI coding tools treated AI pair programming as a new skill that required deliberate practice—not a quick productivity hack. Those who simply activated the tools and used them like an autocomplete tool gained less lasting value, spent more time debugging AI-generated errors, and ultimately became more frustrated than those who invested in learning the workflow.

The answer is more intricate than the hype predicted and more beneficial than the skeptics anticipated.

What the First Year Actually Looked Like

The teams that adopted AI coding tools experienced a recognizable progression.

Initially, adoption felt like magic. Engineers who had struggled with boilerplate code were suddenly able to ship it in minutes. Test coverage that used to take days appeared almost overnight. The velocity gains were indeed real and significant.

Then, normalization occurred. The “easy wins”—the code patterns that AI could handle without much context—were quickly captured. However, what remained was the work that truly mattered: novel architecture decisions, complex debugging, and features that required a deep understanding of how the system functioned. AI provided less assistance with the very tasks that took the longest.

By the end of the year, some teams reported that the numbers had stabilized to 15–20%. Engineers observed that the “easy wins” from AI had been achieved, and the remaining work involved complex tasks where AI’s assistance was less effective.

Where AI Tools Provide Real Assistance

The productivity gains from AI pair programming are indeed significant, but they are not evenly distributed. Understanding the task-type breakdown is crucial for effective program planning.

Consistent Gains: AI tools excel in tasks such as boilerplate code writing, test generation, documentation, straightforward refactoring, and working with unfamiliar API surfaces. When an engineer needs to implement a standard pattern for the fifteenth time, AI can reliably handle it.

Consistent Friction: AI tools face challenges in novel architecture decisions, debugging complex edge cases, and tasks that require deep team-specific context. While AI’s suggestions may sound plausible, they can be dangerous precisely where human judgment is most critical.

This distribution significantly impacts sprint planning. A story that used to take two weeks due to boilerplate code might now take four days. Similarly, a story that was initially challenging due to a genuinely difficult architectural decision still takes two weeks, and it may take even longer if the team doubts AI suggestions during deep thinking.

The Unrecognized Skill

Effective use of AI coding tools is a learned skill, not an automatic result of access.

This may seem obvious, but it has profound implications for planning. Engineers who have benefited the most from AI tools have invested time in learning the model’s limitations and strengths. They have learned to craft better prompts, not just “write a function that does X,” but “write a function that does X following our existing patterns in Y directory.” They have also developed the ability to critically analyze AI output and ask follow-up questions that prompt the AI to consider edge cases.

This learning investment is a real-time process. Teams that approached AI adoption with a “we’ll just use it and see what happens” mindset consistently received less value compared to teams that incorporated deliberate practice into their AI adoption strategy.

The TPM’s role is to acknowledge the genuine value of this learning investment and refrain from imposing velocity estimates that don’t consider the time required for skill development.

The Review Time Dilemma

One of the most consistent findings is that AI-generated code necessitates more rigorous review, not less.

Initially, it was anticipated that AI would alleviate the review burden by producing correct code more swiftly. However, the reality was that AI consistently generated code that appeared correct but contained subtle errors—off-by-one issues, unhandled edge cases, and incorrect assumptions about input formats. These errors differed from human errors in ways that made them more challenging to detect: the code appeared polished, adhered to patterns correctly, and felt right—until it didn’t.

Engineers who excelled at identifying AI errors developed specific habits. They approached AI-generated code with greater skepticism compared to human-written code. Instead of accepting the confident presentation, they posed questions like, “What would have to be true for this to be correct?”

In terms of program planning, the review phase of AI-assisted work is longer than equivalent human-written work. Velocity estimates that fail to account for this systematic delay tend to overpromise.

Context as a Competitive Advantage

AI tools enhance code quality. Well-documented, modular codebases with clear specifications derive significantly more value from AI tools than messy ones.

A clean codebase provides the AI with more data to work with and generates output that aligns with existing patterns. Conversely, a messy codebase results in AI output that confidently perpetuates the mess.

For TPMs, this implies that investing in technical documentation and codebase hygiene yields substantial benefits in terms of AI-assisted velocity. Teams that had cleaned up their codebases before adopting AI tools experienced faster and greater gains from those tools. On the other hand, teams that anticipated AI would assist in cleaning up the mess discovered that it was more likely to exacerbate the problem.

This is the “context moat” that AI companies talk about. It’s a real phenomenon. It means that teams with the best technical practices gain compounding advantages from AI tools, while technically struggling teams find that AI only exacerbates their struggles.

The Discipline of Turning It Off

Surprisingly, the best engineers know when NOT to use AI tools.

When engineers worked on novel architectural decisions—problems where the solution doesn’t already exist in a pattern that the model has seen—leaving AI tools running actively negatively impacted outcomes. AI tools optimize for plausible-sounding solutions. For genuinely novel problems, the plausible solution is often incorrect in ways that human reasoning wouldn’t have been.

Engineers who achieved the best results developed the discipline to switch off AI tools during deep architectural thinking, work through the problem themselves, and then use AI for implementation afterward. They used AI to execute their thinking, not to do their thinking for them.

This skill requires deliberate cultivation. Most teams haven’t developed it yet. The default behavior is to leave AI running constantly, which creates subtle pressure toward plausible-sounding answers rather than genuinely novel ones.

What This Means for Program Planning

The teams with the most accurate understanding of AI-assisted velocity have stopped trying to capture it as a single percentage gain. What they’ve learned is:

Velocity gains are concentrated in specific task types. Sprint planning needs to account for the type of work each story represents. Boilerplate and test stories get significantly faster, while novel architecture and context-heavy decisions don’t.

Review time increases for AI-assisted work. The standard development-plus-review formula needs adjustment upward when AI is involved. Plan for this increase.

Learning investment is real. Teams that get the most from AI tools have deliberately invested in learning them. This time is real and needs to be accounted for in adoption planning.

Five Questions Before You Adopt

Before adopting AI tools, consider these five questions:

  1. What are the specific tasks that AI can assist with?
  2. How will AI impact the workflow and processes?
  3. What are the potential benefits and risks of AI adoption?
  4. How will AI integration affect the team’s skills and knowledge?
  5. What is the timeline for AI adoption and implementation?

If you’re evaluating AI coding tools for your team or trying to determine whether your reported gains are genuine, here are the crucial questions to consider:

  1. Identify the task types you’re utilizing AI for. Instead of measuring overall velocity, assess outcomes based on task type. The aggregate number conceals the intriguing distribution.
  2. Assess the time engineers spend learning the tools versus using them.Treating AI adoption as zero-cost consistently underestimates the skill investment required.
  3. Consider whether review time is factored into velocity estimates. If your review process hasn’t undergone any changes, you’re likely underestimating the effort involved in reviews.
  4. Evaluate the quality of your codebase. If you anticipate that AI tools will assist in addressing technical debt, evidence suggests that they’re more likely to perpetuate it.
  5. Develop a protocol for determining when to discontinue using the tools.The teams that derive the most value from AI have established deliberate practice around knowing when assistance is beneficial and when it’s detrimental.

A TPM I know was approached by an engineering lead who was eager to adopt AI coding tools. Before granting approval, she posed three questions: “What’s your team’s prompting skill level? Do you have documentation standards for your codebase? And do you have a protocol for when to turn the tools off?” The lead paused, realizing that this wasn’t a straightforward adoption process; it was a workflow change that demanded genuine investment. Consequently, they delayed adoption by six weeks to ensure proper implementation. As a result, the team’s first-year results were significantly superior to those of comparable teams that had adopted AI without this preparation.

The Teams That Achieved the Most

After observing teams’ interactions with AI coding tools for a year, a clear pattern emerged: the teams that experienced the most value weren’t the ones who enthusiastically embraced the tools. Instead, they treated it as a skill that required deliberate practice, measured outcomes honestly rather than chasing impressive velocity numbers, and established discipline around when to utilize and when to override AI.

The technology is indeed real, and the gains are genuine. However, the complexity of sustainably capturing these gains is also real, and it’s this complexity that the hype cycle largely overlooked.

The teams that comprehended this are now setting realistic expectations that everyone else will eventually need to embrace.