Spec-First on Azure: How One Mid-Size Team Broke Through to Level 4 AI Coding

What I'm Exploring

Most engineering teams that adopt AI coding tools — GitHub Copilot, Azure OpenAI Service, or similar — report the same arc: early wins, then a plateau. Velocity flatlines. Senior engineers end up reviewing more AI-generated code than they expected. Junior engineers produce inconsistent output quality. The tooling gets blamed, but the tooling isn't the problem.

The question I'm working through: what structural change actually unlocks compounding productivity from AI coding tools? The answer I keep arriving at is spec-first development — and Azure's enterprise toolchain turns out to be a surprisingly good platform for making it stick.

The Plateau Problem

If you've shipped AI coding tooling to your team and seen early enthusiasm cool into "it's fine, I guess," you're probably living at Level 2 of the AI coding capability ladder.

The five-level framework is a useful diagnostic:

Level	What It Looks Like
1	Autocomplete, boilerplate generation, tab-to-accept
2	Prompt → generate → manually review and patch
3	Iterative dialogue — developer steers the model through multiple turns
4	Developer writes a structured spec; AI executes multi-file, multi-step implementation with high fidelity
5	Autonomous agent loops with minimal human checkpoints

Level 2 is where most teams live. It's not useless — it's just not compounding. Each generation event is a one-off. The model has no persistent context, no understanding of your constraints, and no way to make consistent decisions across files or across developers.

The jump from Level 2 to Level 4 isn't about finding a better prompt. It's about changing what you feed the model before you ask it to generate anything.

The Spec-First Approach Defined

A spec-first workflow means writing a structured, machine-readable implementation specification before any code generation begins. This is not a PRD. It's not an inline comment. It's a formal artifact that lives in the repository alongside the feature branch.

A minimal spec covers five things:

## Feature Intent
[1–3 sentences: what this does and why it exists]

## Interface Contracts
- Inputs: [types, shapes, sources]
- Outputs: [types, shapes, destinations]
- Side effects: [state changes, external calls]

## Constraints
- Azure services in scope: [e.g., Azure Service Bus, Cosmos DB, APIM]
- Auth pattern: [e.g., Managed Identity, OAuth 2.0 with MSAL]
- Error handling expectations: [retry policy, dead-letter behavior]

## Acceptance Criteria
- [ ] Criterion 1
- [ ] Criterion 2

## Out of Scope
[Explicit exclusions prevent scope creep in generation]

Why does this work? Large language models are context-completion engines. A dense, structured spec collapses ambiguity before generation begins. The model isn't guessing at your intent, your constraints, or your Azure resource topology — it's reading them. The output fidelity difference is not marginal; it's categorical.

The diagram below shows the workflow shift:

Mermaid Diagramflowchart TD recommended

Rendering diagram...

The critical difference: in the Level 4 workflow, the quality gate moves upstream into planning, not downstream into PR review. Seniors review the spec once; the AI and junior engineers execute against it. This is the leverage inversion.

Azure as the Enabling Platform

The spec-first approach works on any platform, but Azure's enterprise toolchain makes specs durable and discoverable rather than ephemeral.

GitHub Copilot + Azure DevOps integration means spec files checked into the repository become persistent context artifacts. Copilot reads them as part of the workspace context. A spec written for a feature in Sprint 4 is still readable context when a similar feature appears in Sprint 9.

Azure OpenAI Service provides model hosting with enterprise data boundaries — your spec files and generated code don't become training data. For teams in regulated industries, this matters before you even get to the productivity conversation.

Azure Bicep and IaC generation extend the spec-first pattern to infrastructure. A spec that names the Azure services in scope — Service Bus SKU, Cosmos DB consistency level, APIM policy requirements — can drive Bicep generation with the same fidelity improvement as application code generation.

Azure AI Foundry (formerly Azure ML Studio) is the forward-looking piece. Teams moving toward agent-based Level 4 workflows — where the AI executes a sequence of spec-driven tasks autonomously — will find Foundry's orchestration primitives relevant. This is still early for most mid-size teams, but the spec-first habit is the prerequisite capability.

What This Looked Like for a Real Team

The team profile: approximately 25 engineers, mixed seniority (roughly 8 seniors, 17 mid-to-junior), Azure-native SaaS product, shipping on two-week sprints.

Before the intervention: Developers prompted ad hoc. Output quality varied by individual prompting skill. Senior engineers were spending disproportionate time in PR review, catching AI-generated code that missed service-specific constraints or auth patterns. The AI tooling was net-positive but not compounding.

The intervention: Introduced spec-first templates for all feature work. Specs were authored by seniors during sprint planning and reviewed as a planning artifact — not as code. Junior engineers used the specs as primary context for AI-assisted implementation.

After two sprints: PR revision cycles on AI-generated code dropped measurably. Junior engineers were producing multi-file implementations that matched senior expectations on the first submission — not because their prompting improved, but because the spec removed the ambiguity that caused misses. Seniors shifted time from PR correction to spec authorship and architecture.

The key insight: one well-written spec by a senior engineer unlocked multiple junior implementations. The spec became the leverage point. The AI became the execution layer. The junior engineer became the validation layer. This is a fundamentally different division of labor than the Level 2 workflow.

Open Questions

How do specs age? A spec written in Sprint 4 may be misleading by Sprint 12 if the underlying architecture has shifted. What's the governance model for spec freshness?
What's the right granularity? Feature-level specs work for discrete capabilities. Task-level specs may be needed for complex, multi-team work. File-level specs risk becoming a maintenance burden. The right level likely varies by team maturity.
Does this hold for legacy codebases? Spec-first is straightforward for greenfield features. For legacy systems with undocumented constraints, the spec-authoring burden increases significantly — and the risk of a misleading spec increases with it.
How do you govern spec quality without creating a new bottleneck? If spec authorship becomes a senior-only activity with no review process, you've just moved the bottleneck upstream. What does a lightweight spec review look like in practice?

Practical Starting Points

Audit your current AI coding level. Map one recent feature to the five-level framework. If your developers are mostly prompting and patching, you're at Level 2. That's the baseline.
Start with a minimal spec template. The five-section template above is a starting point. Check it into the repo alongside the feature branch — not in a wiki, not in a ticket.
Assign spec authorship to seniors, implementation to AI + juniors. Run this as an explicit role assignment for two sprints, not an implicit expectation.
Link specs to Azure DevOps work items. This creates traceability and makes specs discoverable for future similar work — compounding the value of the authoring investment.
Run a two-sprint pilot before scaling. Pick two features: one spec-first, one standard. Compare PR revision cycles and time-to-merge. The delta is your business case for broader adoption.

References

5-level AI coding framework — adapted from practitioner frameworks circulating in the AI engineering community (source: video essay on the five levels of AI coding capability)
Azure OpenAI Service documentation — enterprise data boundaries and model hosting
GitHub Copilot workspace context behavior — repository-level context reading
Azure AI Foundry — agent orchestration primitives for Level 4+ workflows