Agentic Engineering

SDLC 2.0 · AI-native software development

The teams getting transformative results from AI agents treat them like junior engineers: clear specs, bounded scope, mandatory review. The teams getting nothing hand agents vague instructions and hope for magic.

I install the systems that make the difference: the specs, boundaries, review pipelines, and context engineering that separate 3× teams from teams drowning in AI-generated debt.

use AI weekly

more bugs in AI code

review time increase

throughput with agents

Sources: Pragmatic Engineer survey (900 engineers, 2026), GitClear, incident.io

Agentic Development Lifecycle

Specify

CLAUDE.md · AGENTS.md · Skills

Agent drafts → Human approves scope

Plan

Plan Mode · MCP Servers · Subagents

Agent explores → Human decides architecture

Implement

Claude Code · Codex · Worktrees · Agent Teams

3-5 agents in parallel → PRs for review

Verify

Hooks · CodeRabbit · Claude Code Review

AI catches 70-80% → Human owns final call

Test

Agent Teams · Sandbox · Browser Automation

Agents write + run → Human reviews coverage

Deploy

Rolling Releases · Preview Deployments · Gates

Automated rollout → Human approves prod

Monitor

SRE Agents · Datadog · Scheduled Agents

Agents triage → Human handles novel failures

The methodology

Every engagement follows the same pipeline. I audit your current workflow, install agent-native tooling at each SDLC phase, train your team to operate it, and measure the delta. The core deliverable is context engineering: the specifications, memory systems, and boundary policies that make agents reliably useful instead of reliably dangerous.

01Specification & context engineering

CLAUDE.mdAGENTS.mdPlan ModeGitHub Spec Kit

The spec is now the primary engineering artifact, not the code. GitHub's analysis of 2,500+ agent configuration files found six essential components: executable commands, testing strategy, project structure, code style via examples, git workflow conventions, and agent boundaries. Teams that skip this get what Columbia University calls “silent bugs”: code that runs without errors but implements business logic incorrectly.

# CLAUDE.md — three-tier agent boundaries

## Always (agent acts freely)
- Run tests, linting, type checks
- Read any file in the repository
- Create branches and commits

## Ask First (requires human approval)
- Modify database schemas
- Change authentication logic
- Delete files or directories

## Never (hard stops)
- Commit .env files or secrets
- Push directly to main
- Modify CI/CD pipeline configs

Anthropic's guidance on context engineering: “Find the smallest set of high-signal tokens that maximize the likelihood of your desired outcome.” ETH Zurich found LLM-generated AGENTS.md files actually hurt performance in 5 of 8 settings. Human-curated files improve outcomes by ~4%. The lesson: keep context files under 150 lines, document only what agents cannot discover by reading the codebase, and never auto-generate them.

02Architecture & planning

SkillsMCP ServersAgentsSubagents

The spec decomposes into a blueprint via specialized agents with isolated context and restricted tool sets. Stripe's architecture is the reference: their “blueprints” mix deterministic nodes (linters in under 5 seconds, test selection from 3+ million tests) with agentic nodes (implementation, CI failure fixing). Their internal MCP server hosts nearly 500 tools, with agents receiving a small default set plus task-specific additions.

# .claude/skills/react-patterns/SKILL.md
---
name: React Patterns
description: Project-specific React conventions and component patterns
globs: ["src/components/**/*.tsx", "app/**/*.tsx"]
allowed-tools: ["Read", "Glob", "Grep"]
---

## Component Structure
- Use function components with named exports
- Colocate types in the same file
- Server components by default, 'use client' only for interactivity

## State Management
- React 19 use() for data fetching in server components
- URL state (searchParams) over client state for filters

Skills are the context-engineering primitive that MCP doesn't cover. They're markdown files with frontmatter that load progressively: only ~100 tokens of metadata at startup, full body when triggered by relevance match. They teach agents process knowledge: how your team deploys, how you write tests, what conventions to follow. Andrew Ng's Context Hub is the canonical example: a CLI + Skill that tells the agent “run chub get langgraph/package before writing any LangGraph code” instead of guessing from training data. MCP gives agents access to systems; Skills give agents access to judgment.

MCP servers connect agents to live infrastructure: databases, APIs, monitoring dashboards. 65% of developers report AI assistants “miss relevant context.” MCP closes that gap by giving agents access to the same systems your engineers use.

03Parallel implementation

Git WorktreesClaude CodeClaude SquadCodexAgent Teams

Multiple agents work simultaneously on independent tasks, each in its own git worktree (an isolated checkout with its own branch). Three focused agents consistently outperform one generalist working three times as long. The sweet spot is 3-5 concurrent agents; beyond that, coordination overhead dominates.

# Spawn 3 parallel agents on independent tasks
$ claude -w api-invoices "Implement POST /api/invoices endpoint per spec §3.2"
$ claude -w invoice-table "Build InvoiceTable component with sorting and filters"
$ claude -w invoice-tests "Add integration tests for invoice lifecycle"

# Each agent works in an isolated worktree — separate filesystem, separate branch
# PRs created on completion for human review

# Plan mode — read-only exploration before writing code
$ claude --permission-mode plan

Anthropic validated this at scale: 16 Claude Opus agents built a C compiler in Rust over ~2,000 sessions, passing 99% of GCC torture tests. incident.io reported an 18% build performance improvement from a single worktree session that cost $8 in API credits. The Pragmatic Engineer survey confirms Claude Code as most-used and most-loved (46%), with most high-output developers using the “power combination” of an IDE agent plus a terminal agent.

04AI code review

HooksCodeRabbitClaude Code ReviewHuman Review

Every PR gets AI review before a human sees it, but as a first-pass filter, not a replacement. CodeRabbit detects real-world runtime bugs with 46% accuracy across 13M+ PRs, but 28% of its comments are noise. The best teams use AI review to catch low-hanging fruit, freeing human reviewers for architecture, business logic, and institutional context.

// .claude/settings.json — hooks enforce quality gates structurally
{
  "hooks": {
    "PreToolUse": [{
      "matcher": "Edit|Write",
      "hooks": [{
        "type": "command",
        "command": "npx eslint --fix $FILE && npx tsc --noEmit"
      }]
    }],
    "PostToolUse": [{
      "matcher": "Edit|Write",
      "hooks": [{
        "type": "command",
        "command": "pnpm test --related $FILE"
      }]
    }]
  }
}

Hooks are hard gates, not optional suggestions. They execute on every change from every agent, blocking the action if they fail. PreToolUse hooks run linting and type-checking before the agent writes. PostToolUse hooks validate that modified files still pass related tests.

05Testing & validation

Agent TeamsSandboxMutation TestingBrowser Automation

Agent teams handle testing in parallel with strict isolation. A test-writer generates tests from the spec. A browser-automation agent verifies UI flows via headless Chrome. A security-audit agent scans for OWASP Top 10. Each runs with restricted tool access; the test agent can read source but not modify it.

# .claude/agents/test-writer/AGENT.md
---
description: Generates unit and integration tests from spec
model: sonnet
tools: ["Read", "Grep", "Glob", "Write", "Bash"]
permissionMode: default
isolation: worktree
---

Write tests for the provided spec section.
Do not modify source files — only create files in tests/.
Run the test suite after writing and fix any failures.

Meta deployed LLM-based mutation testing across Facebook, Instagram, and WhatsApp with a 73% acceptance rate from privacy engineers; the most credible production case study of AI-assisted testing at scale. The shift is from AI-assisted (you define what to test) to AI-agentic (AI identifies what needs testing, generates suites, runs them, reports results).

06Deployment & rollout

Rolling ReleasesPreview DeploymentsSmart Test SelectionGated Pipelines

Deployment is where agents execute but humans own the decision. Stripe's reference: selective test execution from 3+ million tests, automated linting in under 5 seconds, a hard limit of two CI rounds per agent run, and human review only after all automated gates pass. Preview deployments generate production-equivalent URLs for every PR. Rolling releases route traffic incrementally, auto-promote or auto-rollback based on real signals.

# Rolling release — promote or rollback based on real signals
deploy:
  strategy: canary
  stages:
    - percentage: 1%
      duration: 10m
      promote_if:
        error_rate: < baseline * 1.05
        p95_latency: < baseline * 1.10
      rollback_if:
        error_rate: > baseline * 1.20
        5xx_count: > 0
    - percentage: 5%
      duration: 5m
    - percentage: 25%
      duration: 5m
    - percentage: 100%

07Monitoring & autonomous agents

Datadog Bits AIAWS DevOps Agentincident.ioScheduled Agents

Agents that run continuously without human prompting. Datadog's Bits AI reads monitors, checks runbooks, generates root cause hypotheses, and tests each one: “what once took more than 30 minutes now happens before you've opened your laptop.” incident.io reported Intercom's AI generating “the exact fix their team would have implemented, in 30 seconds instead of 30 minutes,” reducing MTTR by 37%.

# Scheduled agents — cron-triggered, sandboxed, PR-based output
$ claude -p "Audit dependencies for CVEs. Create PR if patches found."
$ claude -p "Run performance regression suite. Page oncall if p95 > 200ms."
$ claude -p "Generate weekly engineering metrics from Linear + GitHub."

# Headless mode (-p) for non-interactive, scriptable agent runs
# Each runs in isolation, outputs a PR or notification

What doesn't work

The “Almost Solved” trap is the most dangerous failure mode. The 80% AI handles well creates false confidence about the 20% it handles catastrophically.

No specs, many agents

You are multiplying coordination work, not eliminating it. Multi-agent setups without specifications produce 10x the cost of a single well-prompted agent.

Review collapse

AI-generated PRs are 33% larger and review time has climbed 91%. Teams that skip review gates accumulate silent bugs: code that runs but implements business logic incorrectly.

No observability

Skipping monitoring for agent runs means you cannot distinguish good output from confidently wrong output. A well-observed system running a mediocre framework outperforms an unobserved system running the best one.

Safety shortcuts

An engineer watched Claude Code destroy a live database; years of data gone because safety checks were removed. Amazon convened a "deep dive" after service disruptions traced to AI-assisted changes.

AI-generated context files

ETH Zurich found LLM-generated AGENTS.md files hurt performance in 5 of 8 settings, reducing success ~3% while increasing costs 20-23%. Human-curated files are the only ones that work.

The stack

Tools I install in client codebases.

coding agentsClaude Code, Codex, Cursor, Devin

orchestrationCLAUDE.md, AGENTS.md, Skills, Hooks, Agent Teams

parallelismGit Worktrees, Claude Squad, Agent Teams, Codex Cloud

integrationMCP Servers, GitHub Actions, Stripe Toolshed

code reviewHooks, CodeRabbit, Claude Code Review, Human Review

deploymentRolling Releases, Smart Test Selection, Gated Pipelines

monitoringDatadog Bits AI, AWS DevOps Agent, incident.io

I don't just teach this. I ship with it daily. This site was built with parallel Claude Code agents across git worktrees. See my consulting engagements or email me at raman.shrivastava.7@gmail.com.