Beyond the Hype
Posts
Inside Claude Code: Prompt Engineering Masterpiece

Inside Claude Code: Prompt Engineering Masterpiece

How I intercepted their API calls and discovered why sophisticated prompting still rules AI agents

Yifan Zhao
August 12, 2025

After hours of reverse-engineering Claude Code, I discovered something that changes how we should think about AI agents. Everyone assumed the magic was in the UX polish or orchestration, but the real breakthrough is happening at the prompt level, and it's more sophisticated than anyone realises.

Here's the full story behind my investigation, plus the extracted prompts and configs that didn't make the video cut.

Why I Became Obsessed

The obsession started when Claude Code launched and I couldn't shake the feeling that something was different. Everyone was talking about the UX, but that felt like surface-level analysis.

So I built my own testing framework.

The Benchmark Setup (details that didn't make the video):

6 realistic coding tasks across different complexity levels
Head-to-head comparison: Claude Code vs Cursor (both using Sonnet 4)
Measured completion rates, iteration efficiency, and "getting stuck" frequency
Tracked specific failure modes for each agent

The Results: Claude Code outperformed Cursor 4:2 in the 6 tasks and required fewer iterations to reach acceptable solutions. This wasn't just UX polish, something fundamental was different.

Within a month, I'd shifted 90% of my development to Claude Code. Since both use identical models, the difference had to be in the orchestration. I was determined to reverse-engineer it.

The Investigation

Failed Decompilation: My failed attempt to decompile the 9MB CLI led me to WebCrack - now my go-to for analysing any bundled JavaScript. The 443k-line output was unreadable, but WebCrack itself became invaluable for other projects.

The Breakthrough: Remembering that Claude Code accepts custom base URLs meant it was making raw Anthropic API calls. Intercepting HTTPS traffic was the key piece of the puzzle.

What I Found: Watch the video for the full technical breakdown, but the key insight? Every great behaviour in Claude Code traces back to sophisticated prompt engineering with detailed examples and workflow definitions.

The Stuff That Didn't Make the Cut

Anthropic's Own Prompt Evolution

Even Anthropic constantly tunes their prompts for the web and mobile version of Claude. Since Sonnet 4's release in May, they’ve released two edits and extended their original 1800-word-long system prompt to over 2500 words. Here are two lines from the addition to illustrate the granularity required for this tuning.

Claude does not use emojis unless the person in the conversation asks it to or if the person’s message immediately prior contains an emoji, and is judicious about its use of emojis even in these circumstances.

Claude never curses unless the human asks for it or curses themselves, and even in those circumstances, Claude remains reticent to use profanity.

This proves that even with the most advanced models, detailed prompting remains essential.

The Evaluation Framework

If you're planning any prompt tuning (and you should), build proper evaluations first. I use:

Automated test suites for correctness
Human rubrics for code quality
Performance benchmarks for efficiency

Evaluations alone could be a 20-minute video, let me know if you want me to cover this (vote below).

Would you like a video to cover the best practices for evals for building AI applications?

If yes, please note any specific areas of interest

Other Tool Masterclasses

The bash tool definition is another prompt engineering masterpiece: 8 usage examples, explicit do's and don'ts, and error handling patterns. If you ever wondered how it’s able to do git commits so accurately, it’s also in the tool description. The level of detail is extraordinary.

Your Action Items

If you're building agents:

Repetition is your friend: Critical behaviours need reinforcement across multiple prompt sections
Examples > descriptions: Abstract instructions alone fail; concrete examples succeed
Test across models: What works for Sonnet won't work for GPT-5
Build evals first: Vibes-based assessment doesn't scale

If you're using Claude Code:

Check out the full system prompt (link below) to understand how to give better instructions
Use the same repetition strategy in your CLAUDE.md files
Be specific about workflows - don't assume it knows your preferences

The Complete Extracted Materials

Full Claude Code System Prompt and Tool Definitions

ProxyMan - HTTP/HTTPS proxy for intercepting API traffic.

WebCrack - Deobfuscation for bundled JavaScript.

WebStorm - My IDE choice for navigating large codebases.

Going Deeper: Community Tiers

A bunch of you have asked how to get more direct access to my research process and influence future investigations. I've set up two community tiers:

Contributor ($5.99/month): Vote on future video topics, access to monthly live AMAs, and first access to any courses or technical guides I create.

Core Maintainer ($19.99/month): Everything above, plus you can submit questions for AMAs, directly suggest video topics for future investigations, and priority comment replies to each newsletter issue.

This isn't replacing the free newsletter - these issues will always be comprehensive. The tiers are for people who want to shape the direction and get direct access to my process.

What's Next?

Next issue: The testing best practices I've developed through months of daily Claude Code usage - workflows that go far beyond the official docs and can 10x your output.

Quick ask: What's your biggest Claude Code frustration? Reply to this email - I'll feature solutions in the next issue.

Happy shipping, and I’ll see you in the next one.

Yifan