01 / 12
SPACE F

Vibe Coding Nights · ClawCamp · 20 min + workshop

PART 01 The context

AGENT
ORCHESTRATION

Context Engineering & Meta Prompting

By Michalis Vasileiadis & Rayyan Zahid

The claw is the law

Cold open · July 2025

Real incident

An AI agent deleted a live production database.
Then lied about it.

During a 12-day Replit test, an agent ignored a code freeze, wiped 1,206 executives and 1,196 companies from prod, fabricated 4,000 fake users, then claimed rollback was impossible. It wasn't.

This is not an agent problem.
This is a context problem.

The thesis

Agent
orchestration
is downstream
of context.

Everything else is theater.

First · the input

CONTEXT.

Molt your context.

PREVIEW Up next ·context engineering

More than the prompt.

Definition · June–Sept 2025

Context
engineering.

Prompt = one line.
Context = everything the model sees at inference.

Tools, memory, retrieved docs, tool results, system prompt, message history, instructions, file tree. The prompt is a tiny fraction.

"The delicate art and science of filling the context window with just the right information for the next step." Andrej Karpathy · Jun 2025
"Becoming the most important skill an AI engineer can develop." Harrison Chase · LangChain
"Find the smallest possible set of high-signal tokens." Anthropic Engineering
PREVIEW Up next ·context rot

Longer isn't smarter.

Chroma study · 18 models

Context rot
is real.

Every frontier model, Claude, GPT, Gemini, Qwen, degrades before hitting the window limit.

Models don't use context uniformly. Every new token depletes the attention budget.

Accuracy vs context length

1K tokens
95%
8K
82%
32K
71%
100K
58%
200K
41%
500K
27%

Directional. Source: trychroma.com/research/context-rot

PREVIEW Up next ·failure modes

It all falls apart.

Breunig taxonomy · 2025

Four ways context dies.

01
Poison

A hallucination enters context and gets referenced forever. (Gemini played Pokémon and invented goals it couldn't achieve.)

02
Distract

Past ~100k, agents favor repeating history over inventing new plans. Memory beats reasoning.

03
Confuse

More tools = worse performance. Berkeley's function-calling leaderboard confirms it across every model.

04
Clash

Sharded prompts with conflicting guidance drop scores 39%. o3 fell 98 → 64.

Every failure mode is a context mode. Fix the context, the agent gets smarter for free.

PREVIEW Up next ·three moves

Aim up. Ship.

Optimize these three

Three moves
to win.

01
Compress

Smallest high-signal tokens. Prune tool results. Summarize history. Kill stale files.

02
Isolate

Subagents with fresh 200K windows. Anthropic: +90.2% on research evals vs single agent.

03
Persist

State in *.md files, not in the conversation. Survives /clear.

Optimize these three and you optimize everything downstream.

PREVIEW Move 01 · compress
82 tokens 13

Fewer tokens. Same meaning.

Move 01 · Compress, literally

Talk like
caveman.

"Why use many token
when few token do trick."

Drop articles, filler, hedging, pleasantries.
Keep facts, verbs, code.
The model's reasoning is unchanged. Only the speech compresses.

Before · 82 tokens

Certainly! I'd be happy to help you debug this. Looking at the code you've shared, it appears that there might potentially be an issue with how the async function is handling the promise. It's possible the function returns before the promise resolves, which could lead to unexpected behavior. Would you like me to walk through it?

After · 13 tokens · -84%

async fire. promise not cook yet. add await line 12. ugh.
~65% average output reduction across real tasks. More headroom before compaction. Cheaper subagent chains. github.com/juliusbrussee/caveman
PREVIEW Move 02 · isolate
>

Fork a clean shell.

Move 02 · Isolate, literally

Fork a
clean shell.

Without subagent · main context

Read · Read · ... × 30 · Grep · Grep · Edit · Bash · FAIL · Edit · Bash · PASS · "here's what I found"

With subagent · main context

[subagent] Fixed null check in auth.py:47. All tests pass.

Subagent gets its own fresh 200K window. Noise stays inside. Main gets one summary. Anthropic multi-agent: +90.2% vs single agent.

PREVIEW Before you even type
you

The bill before the meal.

Before you even type

The hidden tax.

MCPs, skills, CLAUDE.md. All load before your first message.

MCP
loads every turn

playwright: ~13.6K tokens.
sqlite-tools: ~13.4K.
Heavy kit: 66K+ before you type.
Tool defs reload on every request.

SKILL
loads on demand

Skills inject only when invoked.
Prefer skills over MCPs for rarely-used tools. Same capability, zero baseline cost.

MD
always loaded

CLAUDE.md prepended to every session.
Cap at 200 lines. Essentials only.
Over-stuffed memory = over-taxed attention.

A heavy kit can burn 25–33% of your 200K window before the first token of your request. That's attention budget you've already spent.

Next · the prompts

META
PROMPTING.

Compile your prompts.

PREVIEW Up next ·meta prompting

Prompts writing prompts.

Meta prompting

Let the AI
write the prompts.

Stop hand-crafting.
Start compiling.

"DSPy is my context engineering tool of choice." Tobi Lütke · Shopify CEO

Tools that write prompts for you

  • Anthropic Prompt Generator
    Console → describe your task → Claude drafts a production prompt
  • OpenAI Generate
    Playground button: generates prompts + schemas + functions
  • DSPy
    "Programming, not prompting." Compilers mutate prompts via LLM reflection.
  • TextGrad
    LLM-generated textual gradients optimize prompts like backprop.

Now · the loop & the library

BUILD.

Ship the agent.

PREVIEW Up next ·GSD

Close the loop.

github.com/gsd-build/get-shit-done · 54K ★

GSD puts it all
in one loop.

Phase 01 Research

4 parallel researchers. Fresh subagents. Main thread never sees the noise.

Phase 02 Plan

Atomic XML plans. A plan-checker loops up to 3×. Meta-prompting in action.

Phase 03 Execute

Wave-based. Each plan runs in its own fresh 200K window.

Phase 04 Verify

A separate agent checks. Can't be yes-biased by the executor.

"Your main context stays at 30–40%. The work happens in fresh subagent contexts." GSD README

State lives in files, not conversation

.planning/
PROJECT.md
PLAN.md
STATE.md
VERIFICATION.md
# install npx get-shit-done-cc@latest
PREVIEW Up next ·Agent SDK
{ }

Same engine. Your code.

code.claude.com/docs/en/agent-sdk

Now you build.

Claude Agent SDK is Claude Code as a library.
Same tools. Same loop. Same guardrails.

import asyncio from claude_agent_sdk import query, ClaudeAgentOptions async def main(): async for msg in query( prompt="Fix the bug in auth.py", options=ClaudeAgentOptions( allowed_tools=["Read", "Edit", "Glob"] ), ): print(msg) asyncio.run(main())

Primitives you compose

  • Tools: Read, Edit, Bash, Glob, Grep, WebSearch, MCP servers
  • Subagents: spawn specialists, isolate context
  • Hooks: PreToolUse, PostToolUse, Stop. Audit & gate.
  • Sessions: persist, resume, fork
  • Your Claude Code skills: via setting_sources=["project"]

Beginner: first agent in 12 lines. Intermediate: production multi-agent systems in CI/CD, using the same engine.

PART 02 The molt

Building
reliability
for your
autonomous agent.

By Rayyan Zahid

Part 1 was the context.
Part 2 is the chassis.

PREVIEW Up next ·the thesis

Stack the primitives.

Demo → prod

Your agent
demoed beautifully.

It failed
in production.

Reliability is not a property
of the model.

It's a stack of primitives you build around it. the thesis of Part 02

Six primitives. Stack them and your agent becomes boring.
Boring compounds.

PREVIEW Up next ·primitive 01
result result result result result

Every call. Same shape.

PRIMITIVE 01 Typed verbs

Agents decide
on shapes.

"Poka-yoke your tools. Change the arguments so that it is harder to make mistakes." Anthropic

⨯ Before · shape roulette

# returns ??? agent guesses data = telegram.fetch_unread(50) if data: for m in data.get("msgs", []): handle(m) # crashes on rate limit # silent None on auth expired

✓ After · total function

from kernel.capabilities.telegram import get_unread r = get_unread(limit=50) if r.ok: for m in r.messages: triage(m) elif r.error_kind == "rate_limit": sleep(r.wait_s) elif r.error_kind == "auth_expired": notify()
PREVIEW Up next ·primitive 02

Name the failures.

PRIMITIVE 02 Error taxonomy

Errors are part
of your API.

Agents branch deterministically on error_kind. Not by parsing stack traces.

Telegram · kernel.capabilities.telegram

KindWhenAction
auth_expired Session revoked / 2FA Surface. Re-auth interactively.
rate_limit FloodWaitError Wait. Don't hammer.
transient Network, timeout Retry once, backoff.
unknown Anything else Surface. Never silent.

"No silent retries. Classify and surface." · kernel design rule

PREVIEW Up next ·primitive 03

Log everything.

PRIMITIVE 03 The ledger

You can't improve
what you don't measure.

data/kernel_telemetry.jsonl

10:14:02.311 telegram.get_unread ok 1.42s 10:14:03.890 gmail.get_inbox ok 0.81s 10:14:04.122 calendar.get_today ok 0.33s 10:14:18.907 telegram.send_message fail rate_limit 3.2s 10:14:22.551 telegram.send_message ok 0.94s 10:14:31.014 linkedin.get_inbox ok 2.18s

What the dashboard surfaces

  • Per-verb success rate: which tools are flaky
  • p50 / p95 latency: which tools hog the turn
  • Error distribution: auth vs rate vs transient
  • Agent feedback loop: agents can query their own history

Fire-and-forget. A logging failure never breaks a verb.

PREVIEW Up next ·primitive 04

Gate before act.

PRIMITIVE 04 Hard gates

Don't trust agents
to remember.
Gate the action.

PreToolUse blocks before.
PostToolUse verifies after.
Deterministic. Seatbelts.

⨯ Blocked

.claude/settings.json

"hooks": { "PreToolUse": [{ "matcher": "Bash", "command": "block-prod-writes.js" }], "Stop": [{ "command": "atomic-commit-gate.js" }] }

Real hooks shipping in this repo:
atomic-commit-gate: no session-end with a dirty tree.
require-agent-name: every Agent call must be tagged.

PREVIEW Up next ·primitive 05a

One wire, not five.

PRIMITIVE 05a Closest to the metal

Prefer raw CDP
over browser tools.

Playwright MCP
~114K
DevTools MCP
~50K
Raw CDP
~7K

Tokens per 10-step automation · measured by Zechner / Zhang · ~16× delta

linkedin/nav.py · the nav abstraction

def goto_messaging(cdp) -> bool: cdp.navigate(MESSAGING_URL) # Phase 1: route changed if not cdp.wait_for( "location.href.includes('/messaging')", attempts=20, ): return False # Phase 2: DOM populated return cdp.wait_for( "document.querySelectorAll('li.msg-conv').length >= 1", attempts=30, )
"Playwright drives the browser.
DevTools MCP debugs one." Steve Kinney
  • Session reuse: attach to logged-in Chrome
  • Nav is the single source for selectors
  • Two-phase waits: route then DOM, not sleep(2)
  • Fallback: Playwright MCP for weird JS flows
PREVIEW Up next ·primitive 05b

Find the POST.

PRIMITIVE 05b API first

Hunt for POSTs.
Then rate-limit yourself.

DevTools → Network → filter: Fetch/XHR

GET /api/v1/session meh
GET /api/v1/me/preferences meh
POST /api/v1/events/{id}/publish ★ gold
POST /api/v1/messages/send ★ gold
GET /analytics/ping skip

Replicate in plain HTTP. Copy cookies. Done.

AWS Full Jitter · Marc Brooker (2015)

# the canonical formula sleep = random(0, min(cap, base * 2**attempt))
Telegram send throttle
~12 msg / minute · Full Jitter on FloodWait

Rate-limit yourself or the platform will do it. LinkedIn: 24–72h on first offense, permanent on third.

  • API beats browser.
  • POST beats GET-scraping.
  • Restraint beats speed.
PREVIEW Up next ·primitive 06

Sit next to it.

PRIMITIVE 06 Tight cycles

Sit next to
your agent.
Save months.

01 Validate

Input passes schema? Target exists?

02 Summarize

Show intent in one paragraph.

03 Approve

Human greenlights. Or redirects.

04 Act

Execute. Log. Report result.

Don't let the agent drift autonomously for weeks. It converges to garbage. Start tight. Widen the loop as trust compounds.

"The most successful AI products aren't purely agentic loops, but combine deterministic code with strategically placed LLM decision points." Dexter Horthy · 12-Factor Agents

You're not training the model.
You're training the harness around it.

PREVIEW Up next ·synthesis

All six. Stacked.

SYNTHESIS Stack them

Six primitives.
Ship boring agents.

01 Typed verbs

Predictable shapes. .ok / .error_kind.

02 Error taxonomy

Errors are part of your API. Agents branch on kind.

03 Ledger

Every call logged. Per-verb metrics. Self-feedback loop.

04 Hard gates

PreToolUse / PostToolUse hooks. Seatbelts.

05 CDP & API first

Closest to metal. Hunt POSTs. Rate-limit yourself.

06 HITL

Tight cycles. Train the harness. Save months.

Boring agents compound.
Exciting agents crash.

Now: open working session

Now
you molt.

Find a real pain.
Instrument it. GSD, the Agent SDK, or your own kernel.
Get Michalis & Rayyan 1:1.

Ship something that didn't exist an hour ago.
New shell. Same soul.

↗ BUILD