Method — MES

[ PHASE · 01 ] / DAYS 01 — 05 I.

Map.

Two working sessions to instrument one workflow end-to-end. We watch the team do the work. We measure how long each step takes, where it fails, and what "good" means.

DAY 01Kickoff. Meet the team that owns the workflow. Read the docs. Lurk in Slack.

DAY 02Workflow audit. We sit beside each operator and instrument every step.

DAY 03Baseline metrics. Time, cost, error rate, throughput — captured in writing.

DAY 04Architecture draft. We pick the simplest viable graph and the eval shape.

DAY 05Mapping Doc delivered. You sign it. We build to it. No surprises.

The Mapping Doc04 PAGES

Outbound SDR Agent scope.

WorkflowAccount research → outreach
Baseline time42 min / lead
Target< 4 min / lead
IntegrationsHubSpot · Apollo · Cal.com
Eval set size120 graded examples
Ship dateDay 21 · pre-committed

PHASE PROGRESS · 24%

Deliverable

Mapping Doc

[ PHASE · 02 ] / DAYS 06 — 12 II.

Build.

We architect the agent — simplest viable graph, deterministic where we can, LLM where we must. Wire it to your tools. Run it in shadow mode on real, live data without touching anything customer-facing.

DAY 06Eval harness, written before any agent code. If we can't grade it, we don't build it.

DAY 07Tools wired. Auth set up. Sandbox accounts running. CI green.

DAY 08First end-to-end run. Real data, fake actions. We grade every output.

DAY 09Shadow mode begins. Agent observes every live event. Outputs go to a log, not a customer.

DAY 10Eval pass #1. We run the agent against the graded set and read every miss.

DAY 11Prompt + tool iteration. We close the obvious gaps.

DAY 12Shadow-mode review with the team. Side-by-side: agent vs. human. Decisions made.

Shadow agentDAY 09

Eval pass · 72%

RESEARCH ACCURACY94%

TONE MATCH88%

CTA QUALITY58%

OVERALL72%

PHASE PROGRESS · 58%

Deliverable

Shadow agent

[ PHASE · 03 ] / DAYS 13 — 21 III.

Ship.

Soft launch to one team with limited blast radius. Daily eval reports at 9am. The agent earns its authority a tier at a time — drafts → confidence-gated auto-action → full autonomy.

DAY 13Soft launch. One team. Drafts-only mode. Humans approve every action.

DAY 149am daily report ships. Quality, throughput, exceptions. One page, every morning.

DAY 15 — 16Confidence threshold tuning. The agent learns when to ask and when to act.

DAY 17First auto-action. Low-stakes branch enabled. Humans on call.

DAY 18 — 19Full autonomy across the agreed scope. Final eval pass. Misses logged.

DAY 20Handoff. Your team sees the dashboards, the runbook, the on-call rota.

DAY 21Live. Production. Owned by you. Operated by us.

9am reportDAY 21

Yesterday · 98%

RUNS218
AUTO-ACTIONED203 / 93%
ESCALATED15 / 7%
ERROR RATE0.4%
HOURS SAVED38

PHASE PROGRESS · 100%

Deliverable

Live agent

[ PHASE · 04 ] / DAY 22 — ∞ IV.

Operate.

An agent in production is a product, not a project. We monitor drift, swap in newer/cheaper models when they're better, and ship one capability upgrade per month — forever, or until you take it in-house.

WEEKLYEval re-run against the held-out set. Drift alerts to Slack.

MONTHLYOne capability release — a new tool, a new branch, a new data source.

QUARTERLYModel + cost review. We swap in the best price-performance frontier.

CONTINUOUSLYObservability, on-call, runbook upkeep. The agent never goes silent.

WHENEVERYou can take the agent in-house any time. Code is in your repo from Day 1.

Month 04CUMULATIVE

1,140 h

HOURS SAVED, SINCE DAY 21

MODELS SWAPPED1

CAPABILITIES SHIPPED4

INCIDENTS2 · resolved < 10 min

Deliverable

Monthly releases

Deliverables

What lands in your repo.

DLV · 01

Mapping Doc

Four-page brief with the workflow diagram, baseline metrics, agent scope, integrations, and the eval plan.

Day 05PDF + Notion

DLV · 02

Eval harness

Versioned test set, graders, and CI integration. Every prompt change is scored before it merges.

Day 06Braintrust + repo

DLV · 03

Shadow agent

The full agent observing every live event, logging what it would do. Side-by-side with your team's actions.

Day 09Your cloud

DLV · 04

Live agent

Production-ready, observable, auditable. Acting inside the tools your team already uses.

Day 21Your cloud

DLV · 05

9am report

One-page daily digest. Yesterday's runs, the wins, the misses, the things we changed.

DailyEmail + Slack

DLV · 06

Monthly releases

One meaningful capability upgrade every month. New tool, new branch, new data source. Always.

MonthlyChangelog

Stack

What we'll actually use.

Boring tech. Boring on purpose. The smallest, oldest, most stable thing that works. No vendor lock-in — everything runs in your cloud accounts, on your contracts.

Claude

LLM · REASONING

Our default for reasoning, planning, and tool-use. Strong instruction-following, predictable in production.

GPT

LLM · GENERATION

Used for writing, summarisation, and embeddings. Often paired with Claude in multi-model pipelines.

LangGraph

ORCHESTRATION

Multi-step agent graphs. Used when the workflow has branching, retries, or human-in-the-loop checkpoints.

Inngest

WORKFLOWS

Durable workflows + step functions. Replaces ad-hoc queues; native retries and observability built in.

Temporal

WORKFLOWS

For long-running, high-stakes agents that need bulletproof state + retries. Enterprise default.

Postgres

DATABASE + VECTORS

Plain Postgres with pgvector. We avoid bespoke vector DBs unless scale forces our hand.

Vercel

FRONTEND HOSTING

For the rare moments an agent does need a UI. Otherwise just the eval dashboard and admin tools.

Cloud Run

COMPUTE

Containerised agent runtime. Auto-scaling, pay-per-request, deploys in seconds.

AWS

ENTERPRISE CLOUD

For clients on AWS — we deploy to ECS / Lambda + Bedrock for HIPAA-eligible model endpoints.

Braintrust

EVALS

Our eval platform. Every prompt change is graded against a versioned dataset before it merges.

Datadog

OBSERVABILITY

Traces, logs, metrics. Every agent run is replayable down to the individual LLM call.

Sentry

ERROR TRACKING

Failures, regressions, slow paths. Pages the on-call engineer the moment the agent misbehaves.

Modal

GPU COMPUTE

For agents that need self-hosted models or heavy data pre-processing. Spin up GPUs on demand.

Linear

PROJECT TOOLING

Internal + shared client board. Every capability release ships from a Linear issue.

Cal.com

SCHEDULING

For SDR agents that need to book meetings on real calendars without becoming a scheduling product.

Resend

EMAIL DELIVERY

For the outbound side of every SDR + support agent. Boring, deliverable email infrastructure.

FAQ

Things everyone asks.

How is this different from hiring an internal AI team?

An internal team takes 6 months to recruit and 12 to ship its first thing. We ship in three weeks, leave the code in your repo, and train your team to operate it. If you want to take it in-house after, we'll help you hire.

What if our workflow is too custom for an agent?

Every agent we've built has been custom. The playbooks on our services page are starting points — the actual implementation is shaped to your data, your tools, and your edge cases. That's the whole job.

Whose cloud does the agent run in?

Yours. Your AWS/GCP/Azure, your model contracts (Anthropic, OpenAI, Bedrock), your database. We don't host anything. If we part ways, the agent keeps running.

What about hallucinations or mistakes?

Every agent ships with three layers of guardrails: deterministic input validation, a confidence-scored eval at each step, and a human-in-the-loop checkpoint for low-confidence cases. We measure agent error rate continuously. When it climbs, we get paged.

What does "operating" mean after Day 21?

We monitor the agent every day, run weekly evals against a held-out set, swap in newer/cheaper models when they're better, and ship one capability upgrade per month. Think of it like a fractional engineering team for one piece of software.

Can you do compliance — SOC 2, HIPAA, GDPR?

Yes. Three of our active retainers are HIPAA-eligible and one is FedRAMP-aware. We sign BAAs, scope PHI carefully, and route to compliant model endpoints (Bedrock, Azure OpenAI). For SOC 2 / GDPR, the architecture lives in your cloud — your existing controls cover the agent.

From idea to production, in three weeks.

Map.

Build.

Ship.

Operate.

What lands in your repo.

Mapping Doc

Eval harness

Shadow agent

Live agent

9am report

Monthly releases

What we'll actually use.

Claude

GPT

LangGraph

Inngest

Temporal

Postgres

Vercel

Cloud Run

AWS

Braintrust

Datadog

Sentry

Modal

Linear

Cal.com

Resend

What we'll promise.

Live on Day 21, or your setup is refunded.

Cancel anytime, seven days notice.

You own everything from Day 1.

Things everyone asks.

How is this different from hiring an internal AI team?

What if our workflow is too custom for an agent?

Whose cloud does the agent run in?

What about hallucinations or mistakes?

What does "operating" mean after Day 21?

Can you do compliance — SOC 2, HIPAA, GDPR?

Day 1 startsthe day you sign.

Day 1 starts
the day you sign.