An Evidence-First Scorecard for AI Engineering

Stop asking teams how AI mature they are. Let the AI inspect the working system first.

Most maturity models are survey-first. They ask how mature the team thinks it is, then maybe turn that into a score.

I wanted something more practical for small teams: give the agent access to the repo, tickets, PRs, CI, instructions, skills, automations, and workflow artifacts. Have it inspect the evidence first. Then let it ask a few targeted questions where the evidence is missing.

I ran a first version of that scorecard on two repos I know well.

It was useful in the annoying way good tools are useful. It pointed at things I already knew but had not made operational.

The scorecard called out that I was not using recurring automation enough. In wandir.com, I have a scraper that refreshes event data for the site. At the beginning, it made sense to run it manually. There were bugs and it needed handholding. I wanted to watch it.

But after a while, that stopped being a debugging strategy and became an unnecessary habit.

Supersubset was useful in a different way. It also showed an automation gap, but it raised a harder issue: I could feel that AI was making me faster, but I was not really measuring the effectiveness of the harnesses, skills, and agents around the work.

Sometimes my lived experience says “this is roughly what would have taken a scrum before.” The scorecard cannot score that feeling very well unless there is evidence it can inspect.

Why I Built It

This started as one of those “I wonder if I could use AI for that” ideas.

When I talk to a client or a prospective client, there is usually an early discovery phase. What is going on? Where is the team strong? Where is the process getting in the way? What should they do next?

That conversation still matters. I do not want AI to replace it. But why not automate some of the first pass by having the agent inspect the repo and workflow evidence first?

The output is not just a number, but suggested next steps as well.

AI Adoption Is Not Just Coding Faster

AI touches the whole development process.

If product management still writes high-fidelity screens and tosses them over the wall to development, the team is not going to get the full benefit. The same is true for design, QA, release testing, bug triage, and incident response.

The scorecard is intentionally small-team focused. It is not trying to be an enterprise governance model.

I care about the practical operating system around the work.

Can a fresh AI agent understand the repo?

Can an agent implement from a spec and verify its work?

Does review catch AI-specific failure modes?

Are human judgment gates explicit?

Can long-running work survive multiple sessions?

Does the team maintain its prompts, rules, skills, and instructions?

Is AI usage a shared team capability, or is every developer quietly inventing their own private setup?

Those are the questions that matter to me for small teams.

Where This Fits

There are plenty of useful maturity models and scorecards already.

Coder has an AI maturity self-assessment for engineering organizations adopting coding agents, plus a follow-up from 100 teams showing that agent adoption is running ahead of environment readiness, governance, and measurement.

Defra has a broader AI SDLC maturity framework across technical and cultural dimensions. AI-MM SET is an open software-engineering-team maturity model. Microsoft’s Copilot Studio maturity model is more enterprise-agent and governance oriented. Thoughtworks has been writing about AI-first software delivery across the lifecycle.

I collected the maturity-model links I found useful here: AI maturity models.

The Bottleneck Moves

A year ago, I was still very much in the driver’s seat. AI made me faster, but I was still doing the work in a fairly traditional way.

By late 2025 and early 2026, the models were good enough that more spec-driven development started to feel real. I could give an agent a well-scoped task and let it grind for a while.

To get a higher force multiplier, a developer needs to be able to keep multiple streams moving at once. That is why development-environment concurrency became its own score: can the project support parallel work by isolating ports, databases, branches, and environments locally, or by using cloud agents where that cost makes sense?

Prompt And Context Debt

Prompt debt is becoming a real thing.

Because we use AI to generate instructions for AI, the instructions can get bloated fast. Agents love adding lists and repeating themselves.

That wastes context.

The code analogy is pretty straightforward: keep it DRY. Do not repeat yourself. If the same instruction appears in three places, ask whether it belongs in one canonical place. If a skill file has a long section that never affects the agent’s behavior, cut it.

Recurring automation fits here too. Ask the agent to review skill files and instructions for redundancy, stale references, and unnecessary text. Have it open a PR. Then a human reviews it.

What The Scorecard Cannot See

The scorecard is evidence-based by design. If the evidence is in the repo, git history, tickets, PRs, CI, automations, or connected systems, the agent has something to inspect. If the evidence lives only in people’s heads, private chats, Slack threads, or a developer’s personal AI setup, the scorecard may miss it.

Read the score as evidence from one vantage point, not as truth.

If it points you in a useful direction, it is useful. If it misses something important, improve the evidence or improve the scorecard.

For example: am I spending tokens efficiently? How much of my context window is boilerplate? Which instructions are actually earning their keep? Where is context being wasted?

I couldn’t find a way to ask that. Maybe you could infer some of it from instruction file size, duplication, and stale references. But I would love tooling that maps where context is going and whether it is helping. Ideas (and improved tooling) most welcome! The best idea I’ve seen so far is to have repeatable dummy tasks which you periodically run, just to measure token use. Most of what I found was targeted at agentic systems, not dev harnesses.

Audio is another example. I have started talking to agents more. I know other advanced users do this too. Is that a maturity signal? Maybe. But there may be no durable file-system evidence that it happened.

This is a v0.1 scorecard. It should improve with feedback, and it should change as the industry changes. If better ways appear to measure token efficiency, context bloat, or instruction usefulness, those should become part of the assessment.

The Scorecard

The scorecard is here: github.com/wandir-tech/agentic-engineering-scorecard.

Run it on a repo. Tell me where it is wrong. I would like to learn that.