The Metrics Trap: How Engineering Metrics Obsession Can Crumble Your Engineering Culture

After years of leading engineering teams, I’ve noticed that we all aim to build better, faster, and smarter systems. We rely on metrics to guide our decisions. But I’ve also seen these numbers backfire, hurting morale and blocking the innovation we want. There’s an intricate relationship between code and the people who write it, and we often miss that.

I believe the real value of a metric isn’t in the numbers themselves, but in how well it gets the team to reflect. Raw performance data only shows what happened. We learn the most when we take time to reflect on those results. That’s why 1-on-1 feedback, team retrospectives, and 360-degree feedback are so important when done right.

A good metric encourages us to examine how our processes truly function and should prompt honest conversations that drive improvements. For example, analyzing our cycle time metric revealed peer review bottlenecks, sparking meaningful discussion that led to a streamlined process and shorter cycles. If a metric fails to drive such growth, it’s just noise, not a tool for thoughtful change.

💡
Leaders should build trust, use metrics to start constructive conversations, and emphasize how this boosts morale, innovation, and long-term organizational goals.

When Measures Become Targets

“Any observed statistical regularity will tend to collapse once pressure is placed upon it for control purposes. ” - Charles Goodhart

One of the biggest challenges I’ve seen is Goodhart’s Law: when a measure becomes a target, it stops being useful. If we turn metrics into strict goals, people will often find ways to work around them.

To illustrate, consider the contrast between a team that reports high story points and another that focuses on customer satisfaction. While the former may boast increased velocity, the latter delivers tangible value by meeting customer needs and receiving positive feedback. Highlighting data such as customer-reported impact may help prevent the illusion of progress that can arise from inflated estimates.

  • Development Velocity: If I push a team to get more "story points," they might start inflating estimates or rushing code without proper testing. This only looks like an improvement, but actually leads to more burnout.
  • Deployment Frequency: If we focus too much on how often we deploy, teams might skip important quality checks. This risk is lower when both the inner and outer loops are automated.
  • Code Coverage: Mandating 100% coverage often results in surface tests that don’t actually verify that the software works.
  • Git Activity Counts: Tracking each commit or pull request (PR) often leads to busywork and messy code histories instead of thoughtful work.
  • Bug Counts: If we reward low bug counts, people might just reclassify bugs as "improvements" to make the numbers look better.
“Any system for metrics is going to be flawed for a few reasons: we can’t see into the future, we can only evaluate the past, and the industry moves rapidly. So, depending on the space we’re in, we may have to adjust at times. ”
― Sarah Drasner, Engineering Management for the Rest of Us

Charting a Better Course

To avoid this trap, the key takeaway is to treat metrics as a way to understand and improve systems, not as tools for punishing people.

As a first step, I recommend selecting one specific metric and discussing it with your team. This can be done in a team meeting where all participants share their views on the metric's impact, relevance, and how it can be adjusted to better support team goals. This simple action will start the process of applying metrics for reflection and development, rather than as strict targets.

The SPACE Framework

I’ve found that the SPACE framework is much more human-centered. You can use it at the individual, team, or organization level:

  • Satisfaction & Well-being: Are developers happy and avoiding burnout?
  • Performance: How is the overall system health and delivery?
  • Activity: What actions are we seeing (viewed as flow, not individual output)?
  • Communication & Collaboration: How well are teams working together?
  • Efficiency & Flow: Can people get work done without constant interruptions?

DORA Metrics Done Right

We can also use DORA metrics, like Deployment Frequency and Mean Time to Restore, but only to improve team processes, not for individual performance reviews.

“Any metric that seizes your attention but doesn’t contribute to your health, well-being, or career is ultimately a distraction.”
― Ron Friedman, Decoding Greatness
Book: Accelerate

I suggest every developer and engineering leader read the book "Accelerate" by Gene Kim, Jez Humble, and Nicole Forsgren. The authors describe three types of culture, called the Westrum typology- pathological, bureaucratic, and generative. We should aim for a generative culture, where people work together and don’t blame each other for mistakes. Teams like this usually perform better.

Separating Developer Metrics from AI Output

As we enter the era of AI-powered development, we face a new challenge: how do we measure code written by machines? I’ve seen that treating AI-generated code the same as developer-written code leads to confusion. For instance, our own experiments showed that although individual coding speed increased by 300% with AI assistance, team delivery rates remained flat. If we only track "volume," an AI agent can boost our metrics by 10x overnight, but that doesn’t mean we’ve delivered 10x more value. To avoid the "AI Productivity Paradox," where individual speed goes up, but team delivery stays the same, we need to separate AI code metrics from those of human developers.

We should see AI agents as high-volume helpers for our teams, not as replacements for people. This means we need to clearly separate their data from developer work:

  • The Acceptance Rate vs. The Rework Rate: It’s tempting to celebrate how much code an AI produces. But I’ve found it’s more useful to track the AI Acceptance Rate—how much of that code actually passes review. If an agent writes 1,000 lines but our engineers have to rewrite 40% because of "context rot" or hidden bugs, we haven’t gained efficiency. We’ve just added to the "Review Burden" for our senior team members.
  • Monitoring the Review Bottleneck: AI can write code in seconds, but people still review it in hours or days. We should track Review Latency for pull requests that involve AI. If our output goes up but our "Time to Deploy" doesn’t change, it means the bottleneck has just moved from writing to validation.
  • Measuring the Orchestrator: As AI takes on more routine code, the human developer’s role is becoming more like a "Lead Architect" or "Orchestrator." We should stop measuring them by how much they write and instead look at the Outcome Quality of the agents they oversee.

Tag all commits and pull requests with metadata showing if they were "AI-generated," "AI-assisted," or "Human-only." If AI code leads to more production incidents, it’s a sign to strengthen your human review process.

The AI-Human Workflow Matrix: A Guide for Modern Development Teams
As we built modern, efficient teams, our true goal emerged: work smarter while retaining the personal touch that makes our products special and delights our customers. To address these needs, we created the AI-Human Workflow Matrix. This structured approach helps us decide when to assign tasks to AI rather than

By keeping human and AI metrics separate, we ensure AI stays a helpful tool rather than "metric noise" that masks the true state of our engineering culture.

According to research by Lekshmi Murali Rani and colleagues, distinguishing between human and AI evaluation measures helps ensure that AI serves as a useful aid rather than distorting our knowledge of engineering culture with irrelevant data.

Measuring What Really Matters

A thriving culture requires looking beyond technical data. I focus on:

  • Psychological Safety: Do team members feel safe taking risks and admitting mistakes?
  • Retention: Low turnover is a primary signal of a healthy environment.
  • Individual Growth: Are people acquiring new skills and participating in mentorship?

As we move into the future with more AI, the key takeaway is to use data for insights and improvement, not as a tool for monitoring or fear. Leaders should build trust, use metrics to start constructive conversations, and emphasize how this boosts morale, innovation, and long-term organizational goals.

Finding the Balance: A Final Thought

I strongly believe that a good metric helps us look beyond the number and into how our system works, starting tough conversations that lead to real improvement. If a metric doesn’t encourage this kind of review or help us grow, it’s just noise. The best metrics connect our past performance to our future strategies, turning every data point into a chance for successful transformation.

“We do not learn from experience . . . we learn from reflecting on experience.”
― John Doerr, Measure What Matters

In the end, our job as leaders isn’t to manage numbers—it’s to lead people. Metrics are like cockpit instruments: they show us our altitude and speed, but not where we want to go or how the team feels. When we focus on "how well" instead of "how much," we help engineers do their best work. If we put trust and psychological safety first, performance usually improves on its own.

We build better software when we remember that the team is the most important system we manage.

References

Song, F., Agarwal, A. & Wen, W. (2024). The Impact of Generative AI on Collaborative Open-Source Software Development: Evidence from GitHub Copilot. arXiv preprint arXiv:2410.02091. https://doi.org/10.48550/arXiv.2410.02091

(2025). Context Rot: The Silent Performance Killer in Your RAG System. CodeBrains. https://www.codebrains.co.in/blog/2025/ai/context-rot-silent-performance-killer-in-your-rag-system

(January 14, 2026). AI PRs Wait 4.6x Longer: LinearB 2026 Benchmarks. https://byteiota.com/ai-prs-wait-4-6x-longer-linearb-2026-benchmarks/

Barb, A. S., Neill, C. J., Sangwan, R. S. & Piovoso, M. J. (2014). A statistical study of the relevance of lines of code measures in software projects. Innovations in Systems and Software Engineering 10(3), pp. 243-260. https://doi.org/10.1007/s11334-014-0231-5

Pogorelec, A. (October 23, 2025). When AI writes code, humans clean up the mess. Help Net Security. https://www.helpnetsecurity.com/2025/10/24/ai-written-software-security-report/

Fragiadakis, G., Diou, C., Kousiouris, G. & Nikolaidou, M. (2024). Evaluating Human-AI Collaboration: A Review and Methodological Framework. arXiv preprint. https://doi.org/10.48550/arXiv.2407.19098

Costa, L. A., Dias, E., Ribeiro, D. M., Fontão, A., Pinto, G., Santos, R. P. & Serebrenik, A. (2024). An Actionable Framework for Understanding and Improving Talent Retention as a Competitive Advantage in IT Organizations. arXiv preprint arXiv:2402.01573. https://doi.org/10.48550/arXiv.2402.01573

Read more