[ METHOD 04 / 08 ] M E S Twenty-one days
000%Scroll
00:00:00 UTCDistributed
You are on Day 00
[ 04 — Method ] v 2026.06 · pre-committed cadence

From idea to production, in three weeks.

21days
Pre-committed every engagement

Every project we take on runs on the same rails. Map the workflow in week one. Build the agent in week two. Ship it in week three. Then we stay on to operate it — forever.

→ Pre-committed dates. Ruthless descoping. A working agent on Day 21 or your money back.

The whole engagement, day by day Hover any cell
01
02
03
04
05
06
07
08
09
10
11
12
13
14
15
16
17
18
19
20
21
22+
Map · Days 01–05 Build · Days 06–12 Ship · Days 13–21 Operate · Day 22+
[ PHASE · 01 ] / DAYS 01 — 05 I.

Map.

Two working sessions to instrument one workflow end-to-end. We watch the team do the work. We measure how long each step takes, where it fails, and what "good" means.

DAY 01Kickoff. Meet the team that owns the workflow. Read the docs. Lurk in Slack.
DAY 02Workflow audit. We sit beside each operator and instrument every step.
DAY 03Baseline metrics. Time, cost, error rate, throughput — captured in writing.
DAY 04Architecture draft. We pick the simplest viable graph and the eval shape.
DAY 05Mapping Doc delivered. You sign it. We build to it. No surprises.
The Mapping Doc04 PAGES
Outbound SDR Agent scope.
WorkflowAccount research → outreach
Baseline time42 min / lead
Target< 4 min / lead
IntegrationsHubSpot · Apollo · Cal.com
Eval set size120 graded examples
Ship dateDay 21 · pre-committed
PHASE PROGRESS · 24%
Deliverable
Mapping Doc
[ PHASE · 02 ] / DAYS 06 — 12 II.

Build.

We architect the agent — simplest viable graph, deterministic where we can, LLM where we must. Wire it to your tools. Run it in shadow mode on real, live data without touching anything customer-facing.

DAY 06Eval harness, written before any agent code. If we can't grade it, we don't build it.
DAY 07Tools wired. Auth set up. Sandbox accounts running. CI green.
DAY 08First end-to-end run. Real data, fake actions. We grade every output.
DAY 09Shadow mode begins. Agent observes every live event. Outputs go to a log, not a customer.
DAY 10Eval pass #1. We run the agent against the graded set and read every miss.
DAY 11Prompt + tool iteration. We close the obvious gaps.
DAY 12Shadow-mode review with the team. Side-by-side: agent vs. human. Decisions made.
Shadow agentDAY 09
Eval pass · 72%
RESEARCH ACCURACY94%
TONE MATCH88%
CTA QUALITY58%
OVERALL72%
PHASE PROGRESS · 58%
Deliverable
Shadow agent
[ PHASE · 03 ] / DAYS 13 — 21 III.

Ship.

Soft launch to one team with limited blast radius. Daily eval reports at 9am. The agent earns its authority a tier at a time — drafts → confidence-gated auto-action → full autonomy.

DAY 13Soft launch. One team. Drafts-only mode. Humans approve every action.
DAY 149am daily report ships. Quality, throughput, exceptions. One page, every morning.
DAY 15 — 16Confidence threshold tuning. The agent learns when to ask and when to act.
DAY 17First auto-action. Low-stakes branch enabled. Humans on call.
DAY 18 — 19Full autonomy across the agreed scope. Final eval pass. Misses logged.
DAY 20Handoff. Your team sees the dashboards, the runbook, the on-call rota.
DAY 21Live. Production. Owned by you. Operated by us.
9am reportDAY 21
Yesterday · 98%
RUNS218
AUTO-ACTIONED203 / 93%
ESCALATED15 / 7%
ERROR RATE0.4%
HOURS SAVED38
PHASE PROGRESS · 100%
Deliverable
Live agent
[ PHASE · 04 ] / DAY 22 — ∞ IV.

Operate.

An agent in production is a product, not a project. We monitor drift, swap in newer/cheaper models when they're better, and ship one capability upgrade per month — forever, or until you take it in-house.

WEEKLYEval re-run against the held-out set. Drift alerts to Slack.
MONTHLYOne capability release — a new tool, a new branch, a new data source.
QUARTERLYModel + cost review. We swap in the best price-performance frontier.
CONTINUOUSLYObservability, on-call, runbook upkeep. The agent never goes silent.
WHENEVERYou can take the agent in-house any time. Code is in your repo from Day 1.
Month 04CUMULATIVE
1,140 h
HOURS SAVED, SINCE DAY 21
MODELS SWAPPED1
CAPABILITIES SHIPPED4
INCIDENTS2 · resolved < 10 min
Deliverable
Monthly releases
Deliverables

What lands in your repo.

DLV · 01

Mapping Doc

Four-page brief with the workflow diagram, baseline metrics, agent scope, integrations, and the eval plan.

Day 05PDF + Notion
DLV · 02

Eval harness

Versioned test set, graders, and CI integration. Every prompt change is scored before it merges.

Day 06Braintrust + repo
DLV · 03

Shadow agent

The full agent observing every live event, logging what it would do. Side-by-side with your team's actions.

Day 09Your cloud
DLV · 04

Live agent

Production-ready, observable, auditable. Acting inside the tools your team already uses.

Day 21Your cloud
DLV · 05

9am report

One-page daily digest. Yesterday's runs, the wins, the misses, the things we changed.

DailyEmail + Slack
DLV · 06

Monthly releases

One meaningful capability upgrade every month. New tool, new branch, new data source. Always.

MonthlyChangelog
Stack

What we'll actually use.

Boring tech. Boring on purpose. The smallest, oldest, most stable thing that works. No vendor lock-in — everything runs in your cloud accounts, on your contracts.

Anthropic Claude

Claude

LLM · REASONING

Our default for reasoning, planning, and tool-use. Strong instruction-following, predictable in production.

OpenAI GPT

GPT

LLM · GENERATION

Used for writing, summarisation, and embeddings. Often paired with Claude in multi-model pipelines.

LangGraph

LangGraph

ORCHESTRATION

Multi-step agent graphs. Used when the workflow has branching, retries, or human-in-the-loop checkpoints.

Inngest

WORKFLOWS

Durable workflows + step functions. Replaces ad-hoc queues; native retries and observability built in.

Temporal

Temporal

WORKFLOWS

For long-running, high-stakes agents that need bulletproof state + retries. Enterprise default.

PostgreSQL

Postgres

DATABASE + VECTORS

Plain Postgres with pgvector. We avoid bespoke vector DBs unless scale forces our hand.

Vercel

Vercel

FRONTEND HOSTING

For the rare moments an agent does need a UI. Otherwise just the eval dashboard and admin tools.

Google Cloud

Cloud Run

COMPUTE

Containerised agent runtime. Auto-scaling, pay-per-request, deploys in seconds.

AWS

AWS

ENTERPRISE CLOUD

For clients on AWS — we deploy to ECS / Lambda + Bedrock for HIPAA-eligible model endpoints.

Braintrust

EVALS

Our eval platform. Every prompt change is graded against a versioned dataset before it merges.

Datadog

Datadog

OBSERVABILITY

Traces, logs, metrics. Every agent run is replayable down to the individual LLM call.

Sentry

Sentry

ERROR TRACKING

Failures, regressions, slow paths. Pages the on-call engineer the moment the agent misbehaves.

Modal

GPU COMPUTE

For agents that need self-hosted models or heavy data pre-processing. Spin up GPUs on demand.

Linear

Linear

PROJECT TOOLING

Internal + shared client board. Every capability release ships from a Linear issue.

Cal.com

SCHEDULING

For SDR agents that need to book meetings on real calendars without becoming a scheduling product.

Resend

EMAIL DELIVERY

For the outbound side of every SDR + support agent. Boring, deliverable email infrastructure.

Guarantees

What we'll promise.

I.

Live on Day 21, or your setup is refunded.

If the agent isn't running in production at the end of week three, we refund the setup fee in full. No clawback negotiations.

II.

Cancel anytime, seven days notice.

If a deployed agent drifts below its baseline for two months running, you can cancel the retainer with 7 days' notice — no penalty.

III.

You own everything from Day 1.

Code is in your repo. Models run on your contracts. Infrastructure in your cloud. If we part ways, the agent keeps running.

FAQ

Things everyone asks.

How is this different from hiring an internal AI team?

An internal team takes 6 months to recruit and 12 to ship its first thing. We ship in three weeks, leave the code in your repo, and train your team to operate it. If you want to take it in-house after, we'll help you hire.

What if our workflow is too custom for an agent?

Every agent we've built has been custom. The playbooks on our services page are starting points — the actual implementation is shaped to your data, your tools, and your edge cases. That's the whole job.

Whose cloud does the agent run in?

Yours. Your AWS/GCP/Azure, your model contracts (Anthropic, OpenAI, Bedrock), your database. We don't host anything. If we part ways, the agent keeps running.

What about hallucinations or mistakes?

Every agent ships with three layers of guardrails: deterministic input validation, a confidence-scored eval at each step, and a human-in-the-loop checkpoint for low-confidence cases. We measure agent error rate continuously. When it climbs, we get paged.

What does "operating" mean after Day 21?

We monitor the agent every day, run weekly evals against a held-out set, swap in newer/cheaper models when they're better, and ship one capability upgrade per month. Think of it like a fractional engineering team for one piece of software.

Can you do compliance — SOC 2, HIPAA, GDPR?

Yes. Three of our active retainers are HIPAA-eligible and one is FedRAMP-aware. We sign BAAs, scope PHI carefully, and route to compliant model endpoints (Bedrock, Azure OpenAI). For SOC 2 / GDPR, the architecture lives in your cloud — your existing controls cover the agent.

Day 1 starts
the day you sign.

The first call is a working session, not a sales pitch. We'll leave with a draft of your Mapping Doc.