Agentic Engineering
SDLC 2.0 · AI-native software development
The teams getting transformative results from AI agents treat them like junior engineers: clear specs, bounded scope, mandatory review. The teams getting nothing hand agents vague instructions and hope for magic.
I install the systems that make the difference: the specs, boundaries, review pipelines, and context engineering that separate 3× teams from teams drowning in AI-generated debt.
0
use AI weekly
0
more bugs in AI code
0
review time increase
0
throughput with agents
Sources: Pragmatic Engineer survey (900 engineers, 2026), GitClear, incident.io
Agentic Development Lifecycle
Specify
CLAUDE.md · AGENTS.md · Skills
Agent drafts → Human approves scope
Plan
Plan Mode · MCP Servers · Subagents
Agent explores → Human decides architecture
Implement
Claude Code · Codex · Worktrees · Agent Teams
3-5 agents in parallel → PRs for review
Verify
Hooks · CodeRabbit · Claude Code Review
AI catches 70-80% → Human owns final call
Test
Agent Teams · Sandbox · Browser Automation
Agents write + run → Human reviews coverage
Deploy
Rolling Releases · Preview Deployments · Gates
Automated rollout → Human approves prod
Monitor
SRE Agents · Datadog · Scheduled Agents
Agents triage → Human handles novel failures
The methodology
Every engagement follows the same pipeline. I audit your current workflow, install agent-native tooling at each SDLC phase, train your team to operate it, and measure the delta. The core deliverable is context engineering: the specifications, memory systems, and boundary policies that make agents reliably useful instead of reliably dangerous.
01Specification & context engineering
CLAUDE.mdAGENTS.mdPlan ModeGitHub Spec Kit
The spec is now the primary engineering artifact, not the code. GitHub's analysis of 2,500+ agent configuration files found six essential components: executable commands, testing strategy, project structure, code style via examples, git workflow conventions, and agent boundaries. Teams that skip this get what Columbia University calls “silent bugs”: code that runs without errors but implements business logic incorrectly.
# CLAUDE.md — three-tier agent boundaries
## Always (agent acts freely)
- Run tests, linting, type checks
- Read any file in the repository
- Create branches and commits
## Ask First (requires human approval)
- Modify database schemas
- Change authentication logic
- Delete files or directories
## Never (hard stops)
- Commit .env files or secrets
- Push directly to main
- Modify CI/CD pipeline configsAnthropic's guidance on context engineering: “Find the smallest set of high-signal tokens that maximize the likelihood of your desired outcome.” ETH Zurich found LLM-generated AGENTS.md files actually hurt performance in 5 of 8 settings. Human-curated files improve outcomes by ~4%. The lesson: keep context files under 150 lines, document only what agents cannot discover by reading the codebase, and never auto-generate them.
02Architecture & planning
SkillsMCP ServersAgentsSubagents
The spec decomposes into a blueprint via specialized agents with isolated context and restricted tool sets. Stripe's architecture is the reference: their “blueprints” mix deterministic nodes (linters in under 5 seconds, test selection from 3+ million tests) with agentic nodes (implementation, CI failure fixing). Their internal MCP server hosts nearly 500 tools, with agents receiving a small default set plus task-specific additions.
# .claude/skills/react-patterns/SKILL.md
---
name: React Patterns
description: Project-specific React conventions and component patterns
globs: ["src/components/**/*.tsx", "app/**/*.tsx"]
allowed-tools: ["Read", "Glob", "Grep"]
---
## Component Structure
- Use function components with named exports
- Colocate types in the same file
- Server components by default, 'use client' only for interactivity
## State Management
- React 19 use() for data fetching in server components
- URL state (searchParams) over client state for filtersSkills are the context-engineering primitive that MCP doesn't cover. They're markdown files with frontmatter that load progressively: only ~100 tokens of metadata at startup, full body when triggered by relevance match. They teach agents process knowledge: how your team deploys, how you write tests, what conventions to follow. Andrew Ng's Context Hub is the canonical example: a CLI + Skill that tells the agent “run chub get langgraph/package before writing any LangGraph code” instead of guessing from training data. MCP gives agents access to systems; Skills give agents access to judgment.
MCP servers connect agents to live infrastructure: databases, APIs, monitoring dashboards. 65% of developers report AI assistants “miss relevant context.” MCP closes that gap by giving agents access to the same systems your engineers use.
03Parallel implementation
Git WorktreesClaude CodeClaude SquadCodexAgent Teams
Multiple agents work simultaneously on independent tasks, each in its own git worktree (an isolated checkout with its own branch). Three focused agents consistently outperform one generalist working three times as long. The sweet spot is 3-5 concurrent agents; beyond that, coordination overhead dominates.
# Spawn 3 parallel agents on independent tasks
$ claude -w api-invoices "Implement POST /api/invoices endpoint per spec §3.2"
$ claude -w invoice-table "Build InvoiceTable component with sorting and filters"
$ claude -w invoice-tests "Add integration tests for invoice lifecycle"
# Each agent works in an isolated worktree — separate filesystem, separate branch
# PRs created on completion for human review
# Plan mode — read-only exploration before writing code
$ claude --permission-mode planAnthropic validated this at scale: 16 Claude Opus agents built a C compiler in Rust over ~2,000 sessions, passing 99% of GCC torture tests. incident.io reported an 18% build performance improvement from a single worktree session that cost $8 in API credits. The Pragmatic Engineer survey confirms Claude Code as most-used and most-loved (46%), with most high-output developers using the “power combination” of an IDE agent plus a terminal agent.
04AI code review
HooksCodeRabbitClaude Code ReviewHuman Review
Every PR gets AI review before a human sees it, but as a first-pass filter, not a replacement. CodeRabbit detects real-world runtime bugs with 46% accuracy across 13M+ PRs, but 28% of its comments are noise. The best teams use AI review to catch low-hanging fruit, freeing human reviewers for architecture, business logic, and institutional context.
// .claude/settings.json — hooks enforce quality gates structurally
{
"hooks": {
"PreToolUse": [{
"matcher": "Edit|Write",
"hooks": [{
"type": "command",
"command": "npx eslint --fix $FILE && npx tsc --noEmit"
}]
}],
"PostToolUse": [{
"matcher": "Edit|Write",
"hooks": [{
"type": "command",
"command": "pnpm test --related $FILE"
}]
}]
}
}Hooks are hard gates, not optional suggestions. They execute on every change from every agent, blocking the action if they fail. PreToolUse hooks run linting and type-checking before the agent writes. PostToolUse hooks validate that modified files still pass related tests.
05Testing & validation
Agent TeamsSandboxMutation TestingBrowser Automation
Agent teams handle testing in parallel with strict isolation. A test-writer generates tests from the spec. A browser-automation agent verifies UI flows via headless Chrome. A security-audit agent scans for OWASP Top 10. Each runs with restricted tool access; the test agent can read source but not modify it.
# .claude/agents/test-writer/AGENT.md
---
description: Generates unit and integration tests from spec
model: sonnet
tools: ["Read", "Grep", "Glob", "Write", "Bash"]
permissionMode: default
isolation: worktree
---
Write tests for the provided spec section.
Do not modify source files — only create files in tests/.
Run the test suite after writing and fix any failures.Meta deployed LLM-based mutation testing across Facebook, Instagram, and WhatsApp with a 73% acceptance rate from privacy engineers; the most credible production case study of AI-assisted testing at scale. The shift is from AI-assisted (you define what to test) to AI-agentic (AI identifies what needs testing, generates suites, runs them, reports results).
06Deployment & rollout
Rolling ReleasesPreview DeploymentsSmart Test SelectionGated Pipelines
Deployment is where agents execute but humans own the decision. Stripe's reference: selective test execution from 3+ million tests, automated linting in under 5 seconds, a hard limit of two CI rounds per agent run, and human review only after all automated gates pass. Preview deployments generate production-equivalent URLs for every PR. Rolling releases route traffic incrementally, auto-promote or auto-rollback based on real signals.
# Rolling release — promote or rollback based on real signals
deploy:
strategy: canary
stages:
- percentage: 1%
duration: 10m
promote_if:
error_rate: < baseline * 1.05
p95_latency: < baseline * 1.10
rollback_if:
error_rate: > baseline * 1.20
5xx_count: > 0
- percentage: 5%
duration: 5m
- percentage: 25%
duration: 5m
- percentage: 100%07Monitoring & autonomous agents
Datadog Bits AIAWS DevOps Agentincident.ioScheduled Agents
Agents that run continuously without human prompting. Datadog's Bits AI reads monitors, checks runbooks, generates root cause hypotheses, and tests each one: “what once took more than 30 minutes now happens before you've opened your laptop.” incident.io reported Intercom's AI generating “the exact fix their team would have implemented, in 30 seconds instead of 30 minutes,” reducing MTTR by 37%.
# Scheduled agents — cron-triggered, sandboxed, PR-based output
$ claude -p "Audit dependencies for CVEs. Create PR if patches found."
$ claude -p "Run performance regression suite. Page oncall if p95 > 200ms."
$ claude -p "Generate weekly engineering metrics from Linear + GitHub."
# Headless mode (-p) for non-interactive, scriptable agent runs
# Each runs in isolation, outputs a PR or notificationWhat doesn't work
The “Almost Solved” trap is the most dangerous failure mode. The 80% AI handles well creates false confidence about the 20% it handles catastrophically.
No specs, many agents
You are multiplying coordination work, not eliminating it. Multi-agent setups without specifications produce 10x the cost of a single well-prompted agent.
Review collapse
AI-generated PRs are 33% larger and review time has climbed 91%. Teams that skip review gates accumulate silent bugs: code that runs but implements business logic incorrectly.
No observability
Skipping monitoring for agent runs means you cannot distinguish good output from confidently wrong output. A well-observed system running a mediocre framework outperforms an unobserved system running the best one.
Safety shortcuts
An engineer watched Claude Code destroy a live database; years of data gone because safety checks were removed. Amazon convened a "deep dive" after service disruptions traced to AI-assisted changes.
AI-generated context files
ETH Zurich found LLM-generated AGENTS.md files hurt performance in 5 of 8 settings, reducing success ~3% while increasing costs 20-23%. Human-curated files are the only ones that work.
The stack
Tools I install in client codebases.
coding agentsClaude Code, Codex, Cursor, Devin
orchestrationCLAUDE.md, AGENTS.md, Skills, Hooks, Agent Teams
parallelismGit Worktrees, Claude Squad, Agent Teams, Codex Cloud
integrationMCP Servers, GitHub Actions, Stripe Toolshed
code reviewHooks, CodeRabbit, Claude Code Review, Human Review
deploymentRolling Releases, Smart Test Selection, Gated Pipelines
monitoringDatadog Bits AI, AWS DevOps Agent, incident.io
I don't just teach this. I ship with it daily. This site was built with parallel Claude Code agents across git worktrees. See my consulting engagements or email me at raman.shrivastava.7@gmail.com.