Vibe Coding Nights · ClawCamp · 20 min + workshop
AGENT
ORCHESTRATION
Context Engineering & Meta Prompting
The claw is the law
Cold open · July 2025
An AI agent deleted a live production database.
Then lied about it.
During a 12-day Replit test, an agent ignored a code freeze, wiped 1,206 executives and 1,196 companies from prod, fabricated 4,000 fake users, then claimed rollback was impossible. It wasn't.
This is not an agent problem.
This is a context problem.
The thesis
Agent
orchestration
is downstream
of context.
Everything else is theater.
First · the input
CONTEXT.
Molt your context.
More than the prompt.
Definition · June–Sept 2025
Context
engineering.
Prompt = one line.
Context = everything the model sees at inference.
Tools, memory, retrieved docs, tool results, system prompt, message history, instructions, file tree. The prompt is a tiny fraction.
"The delicate art and science of filling the context window with just the right information for the next step." Andrej Karpathy · Jun 2025
"Becoming the most important skill an AI engineer can develop." Harrison Chase · LangChain
"Find the smallest possible set of high-signal tokens." Anthropic Engineering
Longer isn't smarter.
Chroma study · 18 models
Context rot
is real.
Every frontier model, Claude, GPT, Gemini, Qwen, degrades before hitting the window limit.
Models don't use context uniformly. Every new token depletes the attention budget.
Accuracy vs context length
Directional. Source: trychroma.com/research/context-rot
It all falls apart.
Breunig taxonomy · 2025
Four ways context dies.
A hallucination enters context and gets referenced forever. (Gemini played Pokémon and invented goals it couldn't achieve.)
Past ~100k, agents favor repeating history over inventing new plans. Memory beats reasoning.
More tools = worse performance. Berkeley's function-calling leaderboard confirms it across every model.
Sharded prompts with conflicting guidance drop scores 39%. o3 fell 98 → 64.
Every failure mode is a context mode. Fix the context, the agent gets smarter for free.
Aim up. Ship.
Optimize these three
Three moves
to win.
Smallest high-signal tokens. Prune tool results. Summarize history. Kill stale files.
Subagents with fresh 200K windows. Anthropic: +90.2% on research evals vs single agent.
State in *.md files,
not in the conversation. Survives /clear.
Optimize these three and you optimize everything downstream.
Fewer tokens. Same meaning.
Move 01 · Compress, literally
Talk like
caveman.
"Why use many token
when few token do trick."
Drop articles, filler, hedging, pleasantries.
Keep facts, verbs, code.
The model's reasoning is unchanged.
Only the speech compresses.
Before · 82 tokens
After · 13 tokens · -84%
~65% average output reduction across real tasks. More headroom before compaction. Cheaper subagent chains. github.com/juliusbrussee/caveman
Fork a clean shell.
Move 02 · Isolate, literally
Fork a
clean shell.
Without subagent · main context
With subagent · main context
Subagent gets its own fresh 200K window. Noise stays inside. Main gets one summary. Anthropic multi-agent: +90.2% vs single agent.
The bill before the meal.
Before you even type
The hidden tax.
MCPs, skills, CLAUDE.md.
All load before your first message.
playwright: ~13.6K tokens.
sqlite-tools: ~13.4K.
Heavy kit: 66K+ before you type.
Tool defs reload on every request.
Skills inject only when invoked.
Prefer skills over MCPs for rarely-used tools.
Same capability, zero baseline cost.
CLAUDE.md prepended to every session.
Cap at 200 lines. Essentials only.
Over-stuffed memory = over-taxed attention.
A heavy kit can burn 25–33% of your 200K window before the first token of your request. That's attention budget you've already spent.
Next · the prompts
META
PROMPTING.
Compile your prompts.
Prompts writing prompts.
Meta prompting
Let the AI
write the prompts.
Stop hand-crafting.
Start compiling.
"DSPy is my context engineering tool of choice." Tobi Lütke · Shopify CEO
Tools that write prompts for you
-
Anthropic Prompt Generator
Console → describe your task → Claude drafts a production prompt -
OpenAI Generate
Playground button: generates prompts + schemas + functions -
DSPy
"Programming, not prompting." Compilers mutate prompts via LLM reflection. -
TextGrad
LLM-generated textual gradients optimize prompts like backprop.
Now · the loop & the library
BUILD.
Ship the agent.
Close the loop.
github.com/gsd-build/get-shit-done · 54K ★
GSD puts it all
in one loop.
4 parallel researchers. Fresh subagents. Main thread never sees the noise.
Atomic XML plans. A plan-checker loops up to 3×. Meta-prompting in action.
Wave-based. Each plan runs in its own fresh 200K window.
A separate agent checks. Can't be yes-biased by the executor.
"Your main context stays at 30–40%. The work happens in fresh subagent contexts." GSD README
State lives in files, not conversation
.planning/PROJECT.mdPLAN.mdSTATE.mdVERIFICATION.mdSame engine. Your code.
code.claude.com/docs/en/agent-sdk
Now you build.
Claude Agent SDK is Claude Code as a library.
Same tools. Same loop. Same guardrails.
Primitives you compose
- Tools: Read, Edit, Bash, Glob, Grep, WebSearch, MCP servers
- Subagents: spawn specialists, isolate context
- Hooks: PreToolUse, PostToolUse, Stop. Audit & gate.
- Sessions: persist, resume, fork
- Your Claude Code skills: via
setting_sources=["project"]
Beginner: first agent in 12 lines. Intermediate: production multi-agent systems in CI/CD, using the same engine.
Building
reliability
for your
autonomous agent.
Part 1 was the context.
Part 2 is the chassis.
Stack the primitives.
Demo → prod
Your agent
demoed beautifully.
It failed
in production.
Reliability is not a property
of the model.
It's a stack of primitives you build around it. the thesis of Part 02
Six primitives. Stack them and your agent becomes boring.
Boring compounds.
Every call. Same shape.
Agents decide
on shapes.
"Poka-yoke your tools. Change the arguments so that it is harder to make mistakes." Anthropic
⨯ Before · shape roulette
✓ After · total function
Name the failures.
Errors are part
of your API.
Agents branch deterministically on error_kind.
Not by parsing stack traces.
Telegram · kernel.capabilities.telegram
"No silent retries. Classify and surface." · kernel design rule
Log everything.
You can't improve
what you don't measure.
data/kernel_telemetry.jsonl
What the dashboard surfaces
- Per-verb success rate: which tools are flaky
- p50 / p95 latency: which tools hog the turn
- Error distribution: auth vs rate vs transient
- Agent feedback loop: agents can query their own history
Fire-and-forget. A logging failure never breaks a verb.
Gate before act.
Don't trust agents
to remember.
Gate the action.
PreToolUse blocks before.
PostToolUse verifies after.
Deterministic. Seatbelts.
.claude/settings.json
Real hooks shipping in this repo:
atomic-commit-gate: no session-end with a dirty tree.
require-agent-name: every Agent call must be tagged.
One wire, not five.
Prefer raw CDP
over browser tools.
Tokens per 10-step automation · measured by Zechner / Zhang · ~16× delta
linkedin/nav.py · the nav abstraction
"Playwright drives the browser.
DevTools MCP debugs one." Steve Kinney
- Session reuse: attach to logged-in Chrome
- Nav is the single source for selectors
- Two-phase waits: route then DOM, not sleep(2)
- Fallback: Playwright MCP for weird JS flows
Find the POST.
Hunt for POSTs.
Then rate-limit yourself.
DevTools → Network → filter: Fetch/XHR
Replicate in plain HTTP. Copy cookies. Done.
AWS Full Jitter · Marc Brooker (2015)
Rate-limit yourself or the platform will do it. LinkedIn: 24–72h on first offense, permanent on third.
- API beats browser.
- POST beats GET-scraping.
- Restraint beats speed.
Sit next to it.
Sit next to
your agent.
Save months.
Input passes schema? Target exists?
Show intent in one paragraph.
Human greenlights. Or redirects.
Execute. Log. Report result.
Don't let the agent drift autonomously for weeks. It converges to garbage. Start tight. Widen the loop as trust compounds.
"The most successful AI products aren't purely agentic loops, but combine deterministic code with strategically placed LLM decision points." Dexter Horthy · 12-Factor Agents
You're not training the model.
You're training the harness around it.
All six. Stacked.
Six primitives.
Ship boring agents.
Predictable shapes. .ok / .error_kind.
Errors are part of your API. Agents branch on kind.
Every call logged. Per-verb metrics. Self-feedback loop.
PreToolUse / PostToolUse hooks. Seatbelts.
Closest to metal. Hunt POSTs. Rate-limit yourself.
Tight cycles. Train the harness. Save months.
Boring agents compound.
Exciting agents crash.
Now: open working session
Now
you molt.
Find a real pain.
Instrument it. GSD, the Agent SDK, or your own kernel.
Get Michalis & Rayyan 1:1.
Ship something that didn't exist an hour ago.
New shell. Same soul.
Read deeper · every claim sourced
The archive.
Part 01 · Context Engineering
Part 02 · Reliability
Scan to connect
ClawCamp · Vibe Coding Nights
By Michalis Vasileiadis & Rayyan Zahid · The claw is the law