Inside six months of AI agents in a production engineering workflow

AI agents in a production engineering workflow: six months of results

One engineering team spent six months running AI agents in a production workflow on a crypto exchange. Over that period, throughput rose, rework fell, and most of the changes came from tighter delegation, scope control, and review.

The team set one rule at the start: agent output had to go through the same review gates and CI/CD pipeline as human-written code. High-risk surfaces such as auth, permissions, and key management were excluded by default.

The team was trying to save engineering time on routine implementation work, not change how decisions were made. Branch setup, boilerplate, endpoint wiring, and baseline tests were consuming engineering cycles that could have gone to architecture work, threat modeling, and failure analysis. The goal was to push repetitive implementation work to agents and keep review, architecture, and security decisions with engineers.

Over six months, features shipped per engineer per week rose from 1.2 to 3.6, while rework on agent-drafted PRs fell from 32% to 9%. The change failure rate stayed close to its earlier level.

How the workflow is structured

The production sequence is fixed: product idea → triage → engineer → agent drafts PR → engineer review → team review → agent fix loop → merge.

Triage turned out to be the most important step. Before delegation, the engineer prepares a bundle with acceptance criteria, explicit constraints, links to prior art, and a risk classification. Early failures showed that underspecified inputs quickly produced bad output, so the bundle became mandatory.

From that point, the agent stays within a narrow scope, drafting a PR from existing repository patterns and conventions. If an interface or convention is missing, the agent is expected to surface that and ask rather than improvise. Output goes through the standard review and CI/CD pipeline with no modifications.

One internal example

For one 2FA backup codes feature, the sequence looked like this:

  • 10:15 – Engineer delegates with acceptance criteria and links to prior work. 
  • 10:35 – Agent opens a PR: UI and API changes, feature flag wiring, baseline tests included. 
  • 11:00 – Engineer reviews for intent, correctness, and security properties. 
  • 11:30 – Two-engineer team review covers service boundaries and failure modes. 
  • 12:00 – Merge.

Total active engineering time: around 45 minutes. Most of it went to review and edge case analysis.

Failure mode breakdown

Month one: 32% of agent-drafted PRs required meaningful rework. The team classified failures by root cause.

Hallucinated integrations accounted for 18% of failures. In these cases, the agent assumed SDK methods existed or fabricated API contracts. The team responded by requiring citations to internal interfaces. If the agent cannot reference a real source, it stops and flags the gap.

Vague specs producing wrong UX: 25% of failures. Prompts like “make this mobile friendly” returned functional but incomplete output. Fix: acceptance criteria with explicit pass/fail examples.

Scope creep as optimization: 22% of failures. Open-ended requests triggered refactors that exceeded the intended change surface. Fix: hard caps on file and change scope, plus a plan-first step requiring engineer sign-off on approach before any code is generated.

Wrong internal patterns: 12% of failures. Code that passed functional review but diverged from internal conventions, introducing maintainability and security risk downstream.

By month six, rework had fallen to 9%, largely after the team tightened delegation rules, scope limits, and review.

Safeguards

Authentication, permissions, withdrawals, and key management default to manual unless a senior engineer explicitly decides otherwise. Every agent-drafted change goes through individual review by the delegating engineer, followed by team review with at least two engineers. CI/CD runs the same checks on agent output as on human-written code: tests, static analysis, dependency hygiene, security scanning.

The agent fix loop works as follows: reviewers leave standard PR comments, then invoke the agent to address specific line items with an explicit instruction not to modify anything else. That keeps reviewers focused on correctness and risk, while the agent handles smaller follow-up edits.

Before and after

MetricBeforeAfter
Features per engineer per week1.23.6
Rework rate on agent PRs32%9%
Change failure rate0.8%0.9%

What kept the workflow stable

The workflow stayed stable because the team kept delegation standards, scope limits, traceability, and review consistent as usage expanded. Agents took on more of the repetitive implementation work, but engineers still owned architecture decisions, threat modeling, and failure analysis.

Content on BlockPort is provided for informational purposes only and does not constitute financial guidance.
We strive to ensure the accuracy and relevance of the information we share, but we do not guarantee that all content is complete, error-free, or up to date. BlockPort disclaims any liability for losses, mistakes, or actions taken based on the material found on this site.
Always conduct your own research before making financial decisions and consider consulting with a licensed advisor.
For further details, please review our Terms of Use, Privacy Policy, and Disclaimer.

Articles by this author

This site is registered on wpml.org as a development site. Switch to a production site key to remove this banner.