Skip to content

Instantly share code, notes, and snippets.

@burkeholland
Created March 20, 2026 17:16
Show Gist options
  • Select an option

  • Save burkeholland/86425ebec3ea5d9551dc575277363a8b to your computer and use it in GitHub Desktop.

Select an option

Save burkeholland/86425ebec3ea5d9551dc575277363a8b to your computer and use it in GitHub Desktop.
Video Script: How the Agent Harness & LLMs Work

Video Script: How the Agent Harness & LLMs Work Under the Hood

Target audience: Developers curious about what's happening inside GitHub Copilot Length: ~15–20 minutes Style: Conversational, educational, demo-heavy


[00:00] INTRO

Hey. So you've been using GitHub Copilot — or maybe some other AI coding tool — and it just kind of works. You type something, it does stuff, files change, terminals run, code gets written.

But have you ever wanted to know what's actually happening? Like, really happening?

That's what this video is. We're going to go from the very bottom — how a language model generates text one token at a time — all the way up to the agent harness that's orchestrating all those tool calls you see in Copilot's chat panel.

And here's why this matters: once you understand the mechanics, you become a fundamentally better prompt writer. You stop guessing and you start engineering.

Let's go.


[00:45] SECTION 1: Tokens — The Atoms of LLMs

Before we talk about agents or loops or tools, we need to talk about tokens. Because everything in an LLM is tokens.

A token is not a character, and it's not a word. It's a chunk — usually two to four characters. "function" is one token. "unbelievable" might be two. The word "tokenization" is like three.

[DEMO: Open a browser to platform.openai.com/tokenizer]

Type something like: async function getUserById(id: string) — and look at how it breaks down. Each colored chunk is a token. This is literally what the model sees.

Here's the thing that blows people's minds: the model never sees your text. It sees a list of integers. Token 1234, token 5678, token 91. That's it. Your entire prompt — your system message, your file contents, your question — it's all just a big array of numbers.

Now, why does this matter practically?

Because models have a context window — a maximum number of tokens they can process at once. GPT-4o is 128k. Claude 3.5 Sonnet is 200k. When you attach a file in Copilot, those tokens count. When you have a long chat history, those count. When there are tool results being fed back in — they all count.

You are always working within a budget. Understanding that shapes everything about how you interact with these systems.


[01:45] SECTION 2: How Models Actually Generate Text

Okay, tokens are the input. But how does the model generate its response?

Here's the fundamental insight: language models are autoregressive. That's a fancy word for a dead-simple idea: the model generates one token at a time, and each token it generates becomes part of the input for the next prediction.

Think about that for a second.

You ask: "What is 2 + 2?"

The model looks at your entire prompt and asks: what is the single most probable next token? It might output "The". Now the input is your prompt plus "The". What's the next most probable token? "answer". Now it's your prompt plus "The answer". And so on: "is", "4", ".".

Each forward pass through the model produces exactly one token. To generate 100 tokens, the model runs 100 times. To generate 1000 tokens — like a detailed code implementation — that's 1000 runs.

[DEMO: In VS Code, open Copilot Chat, ask something complex and watch the response stream in]

See how it streams? That's not a UX trick — that's literally the model computing one token at a time and the extension streaming each one to you as it's generated.

This has a huge implication for how you prompt. The model commits as it generates. Once it's written const result = await db.query( it's kind of locked into finishing that thought. It can't go back. So if you want the model to think through a problem before answering, you have to tell it to. "Think step by step" isn't just a vibe — it's giving the model space to work through reasoning tokens before it commits to the final answer tokens.


[03:00] SECTION 3: Attention — Why Context Matters

One more foundational concept before we get to the fun stuff: attention.

The mechanism that makes transformers (and by extension, LLMs) so powerful is the attention mechanism. The idea is that when the model is predicting the next token, it doesn't treat all previous tokens equally. It attends to the ones that are most relevant to what it's currently generating.

If you're generating the variable name after const user = await, the model is heavily attending to the word "user" and "await" — not to the import statements at the top of your file.

Here's the practical implication: position and proximity in your prompt matters. Content near the end of your context window — close to where the model is generating — tends to get more attention than content buried at the top.

This is why the system prompt is important but recent context is often more influential. It's also why when you're debugging something with Copilot, it helps to paste the actual error right before your question rather than in some attachment that's further up in the conversation.

The model is a very smart but very present-moment creature. Give it what it needs close to where it's working.


[04:15] SECTION 4: The Agent Harness — What It Actually Is

Alright. Now we get to the good stuff. You've seen Copilot in agent mode — it reads files, edits code, runs terminal commands, fixes errors. That's not magic. That's a loop.

Let's look at the actual code.

[DEMO: Open the vscode-copilot-chat repo — specifically src/extension/intents/node/toolCallingLoop.ts]

Here's the _runLoop method — this is the actual while loop that drives the entire agent:

private async _runLoop(outputStream, token): Promise<IToolCallLoopResult> {
    let i = 0;
    let lastResult: IToolCallSingleResult | undefined;

    while (true) {
        if (lastResult && i++ >= this.options.toolCallLimit) {
            // hit the limit — stop or ask to continue
            lastResult = this.hitToolCallLimit(outputStream, lastResult);
            break;
        }

        const result = await this.runOne(outputStream, i, token);
        lastResult = result;
        this.toolCallRounds.push(result.round);

        if (!result.round.toolCalls.length || result.response.type !== ChatFetchResponseType.Success) {
            // No tool calls = model is done. Run stop hooks, then break.
            break;
        }
    }
}

That's it. That's the agent harness. A while loop. Each iteration — each runOne — sends the current state to the model and gets back a response. If the response contains tool calls, execute them and loop. If it doesn't, we're done.

[DEMO: Look at src/extension/intents/common/agentConfig.ts]

export function getAgentMaxRequests(accessor: ServicesAccessor): number {
    return configurationService.getNonExtensionConfig<number>('chat.agent.maxRequests') ?? 200;
}

The default max iterations is 200. In VS Code settings you can see this as chat.agent.maxRequests. That's the leash. If the agent needs more than 200 tool calls to do your task... you probably need to break the task down.


[05:30] SECTION 5: How Tool Calling Works

So within that loop, how does the model actually call a tool?

Here's the mental model: you don't give the model a function to call. You give the model a description of a function — its name, what it does, and the shape of its parameters. The model then, in its response, can output a structured block that says "I want to call this tool with these arguments."

The code handles the actual execution. The model just produces the intent.

[DEMO: Open src/extension/tools/common/toolNames.ts]

Look at this ToolName enum — this is the full catalogue of tools the Copilot agent has access to:

export enum ToolName {
    ReadFile = 'read_file',
    FindTextInFiles = 'grep_search',
    EditFile = 'insert_edit_into_file',
    ReplaceString = 'replace_string_in_file',
    CoreRunInTerminal = 'run_in_terminal',
    CoreGetTerminalOutput = 'get_terminal_output',
    Codebase = 'semantic_search',
    GetErrors = 'get_errors',
    CoreRunTest = 'runTests',
    FetchWebPage = 'fetch_webpage',
    // ... and about 40 more
}

read_file, grep_search, insert_edit_into_file, run_in_terminal — these are the verbs. When you ask Copilot to "fix the bug in my auth middleware," it reads the file first. Then it reads related files. Then it calls insert_edit_into_file. Then it might run the tests with runTests. Each of those is one iteration of the while loop.

[DEMO: In VS Code Copilot Chat, ask it to fix something simple and expand the tool call details]

See each tool call in the UI? Each one is: model says "I want to call read_file with path src/auth/middleware.ts" → VS Code executes that → result gets sent back to the model → model decides what to do next.

The model is directing traffic. The tools are doing the actual work.


[07:00] SECTION 6: How the Prompt Is Assembled Each Round

Here's something most people don't realize: on every single iteration of that loop, the entire conversation history is rebuilt and sent to the model. It's not like the model has memory of previous rounds. It has to be told everything, every time.

[DEMO: Open src/extension/prompts/node/agent/agentPrompt.tsx]

Here's how the prompt is structured. Every request contains, in order:

  1. System message — "You are an expert AI programming assistant working in VS Code..."
  2. Custom instructions — your .github/copilot-instructions.md or workspace instructions
  3. Global context — the current workspace structure, open files, environment info
  4. Conversation history — all previous turns with their tool calls and results
  5. Current user message — what you just typed
  6. Current tool call rounds — results from tools called in this turn

That entire thing — potentially hundreds of thousands of tokens — gets assembled and sent to the model on every tool call iteration. This is expensive. This is why context window management matters.


[08:15] SECTION 7: Prompt Caching — The Secret to Making Agents Affordable

Here's where it gets really clever. Sending the entire prompt on every iteration would be brutally slow and expensive... except for prompt caching.

Prompt caching works like this: if the beginning of your prompt is identical to a previous request, the API provider doesn't re-process those tokens. They're cached. You only pay (in time and compute) for the new tokens at the end.

[DEMO: Open src/extension/intents/node/cacheBreakpoints.ts]

Look at this comment block — this is the cache strategy written right in the code:

 * Prompt cache breakpoint strategy:
 *
 * The prompt is structured like:
 * - System message
 * - Custom instructions
 * - Global context message (has prompt-tsx cache breakpoint)
 * - History
 * - Current user message with extra context
 * - Current tool call rounds
 *
 * During the agentic loop, each request will have a hit on the previous
 * tool result message.

The team places up to 4 "cache breakpoints" at strategic positions in the prompt. A cache breakpoint is a signal to the API: "everything up to here is stable — cache it."

The system message? Almost never changes — always cached. The conversation history from previous turns? Cached. Only the new tool result at the end is new. So even though the model is seeing 50,000 tokens on each loop iteration, it might only be computing on the last 2,000.

This is why agent mode doesn't take 30 seconds per tool call.

The practical takeaway: your system prompt and early context should be as stable as possible. Don't put dynamic content near the top. Put it near the bottom, close to where the model is actually generating. That way you maximize cache hits.


[09:30] SECTION 8: Context Management — When the Window Fills Up

Even with prompt caching, there's a hard limit. After enough tool call rounds, the conversation history gets too big to fit in the context window. What happens then?

[DEMO: Open src/extension/prompts/node/agent/backgroundSummarizer.ts]

Meet the BackgroundSummarizer. This is a state machine — Idle → InProgress → Completed — that runs in parallel with the agent loop. While the model is doing its next tool call, the summarizer is sending an older version of the conversation to a different model call to compress it into a summary.

export const enum BackgroundSummarizationState {
    Idle = 'Idle',
    InProgress = 'InProgress',
    Completed = 'Completed',
    Failed = 'Failed',
}

When the agent prompt gets big enough, instead of including the full history, it substitutes in the summary. The model sees: "Here's a condensed version of what happened before, and here's what we're doing right now."

This is why very long agent sessions sometimes feel like the model "forgets" details from early on. It literally does — the details got summarized away. If you're working on something complex and you've been going for 20+ tool call rounds, consider starting a new session and providing fresh context rather than continuing in an increasingly compressed conversation.


[10:45] SECTION 9: Subagents — The Loop Within the Loop

One more architectural thing worth knowing: the loop can go recursive.

[DEMO: Back to src/extension/tools/common/toolNames.ts, scroll to subagent tools]

CoreRunSubagent = 'runSubagent',
SearchSubagent = 'search_subagent',
ExecutionSubagent = 'execution_subagent',

The main agent can spin up subagents — separate model invocations with their own loops. The search subagent, for instance, specializes in semantic codebase search. The execution subagent handles terminal commands in a sandboxed way.

When you're in autopilot mode and Copilot is really going to town — reading files, searching, editing, running commands — there's a good chance it's orchestrating multiple subagent loops simultaneously. The parent agent acts as the orchestrator; subagents are specialized workers.

Each subagent is its own ToolCallingLoop instance with its own tool call limit, its own context, and its own telemetry span. They're tracked via an invocationId that links them back to the parent request.


[11:45] SECTION 10: Practical Takeaways — Being a Better Prompter

Okay. You now know how the machine works. Here's how to use that knowledge.

1. Give context early and explicitly.

The model doesn't know your codebase. It has to discover it via tools. Every read_file call is a round-trip through the loop. If you paste the relevant code snippet directly in your message, you skip that discovery phase. Faster, cheaper, more accurate.

2. Be specific about what you want, not just what the problem is.

"This doesn't work" forces the model to diagnose and fix — multiple reasoning steps. "The getUserById function on line 47 of src/db/users.ts returns undefined when the user doesn't exist — it should throw a NotFoundError instead" gives the model a complete picture. One targeted fix.

3. Understand the token budget.

Attaching a 10,000-line file when you only care about 50 lines wastes your context window. More importantly, it buries the relevant content in noise. The model's attention is real — help it focus.

4. Structure your instructions like the model structures its prompt.

General rules at the top (system message territory). Specific constraints for this task near the bottom (user message territory). This maps to how the model processes its own prompt.

5. In agent mode — let it run, but set up guardrails first.

The agent loop is powerful. But if you point it at an unclear task with no constraints, it will make decisions you don't want. Copilot instructions files (.github/copilot-instructions.md) are processed as part of the system prompt — they're cached, they're always there. Use them.

[DEMO: Show a .github/copilot-instructions.md file with a few useful rules]

Something like: "Always write TypeScript, never JavaScript. Never modify test files unless explicitly asked. Prefer named exports over default exports." These instructions cost tokens once (due to caching) and pay dividends on every single agent run.


[13:15] SECTION 11: Building Your Own Agent

If you want to build your own agent — whether using the VS Code extension API, the OpenAI API, or Anthropic — the pattern is the same one we saw in the source code.

while (no_final_response):
    prompt = build_prompt(history, tools, current_request)
    response = call_model(prompt)
    
    if response.has_tool_calls:
        for tool_call in response.tool_calls:
            result = execute_tool(tool_call)
            history.append(tool_call, result)
    else:
        return response.text

That's it. The complexity is in the details: how you build the prompt, which tools you expose, how you handle errors, how you manage context when the window fills up. But the skeleton is that simple while loop.

And now that you've seen the production implementation in vscode-copilot-chat — the ToolCallingLoop, the BackgroundSummarizer, the cache breakpoints, the subagent routing — you have a reference point. You can see what "production quality" looks like and apply those ideas to whatever you're building.


[14:30] WRAP UP

Let's recap the mental model:

  • Tokens are the atoms. Everything is tokens. You're always working within a budget.
  • Autoregressive generation means the model generates one token at a time and commits as it goes. Give it space to reason before it answers.
  • Attention means proximity matters. Important context goes near the end of your prompt, close to where the model is generating.
  • The agent loop is a while loop. Each iteration builds a prompt, calls the model, executes any tool calls, appends results, and repeats until the model stops calling tools.
  • Prompt caching makes multi-round agents practical. Keep your early context stable to maximize cache hits.
  • Context management (the BackgroundSummarizer) means very long sessions compress. Start fresh when things get complicated.

None of this is magic. It's engineering. And the more you understand the engineering, the better you get at working with it rather than wondering why it's doing what it's doing.

The source code is there to read.

github.com/microsoft/vscode-copilot-chat — everything we looked at today is in src/extension/intents/node/toolCallingLoop.ts, src/extension/intents/node/agentIntent.ts, and src/extension/prompts/node/agent/.

Go read it. You'll be surprised how much makes sense now.

See you in the next one.


DEMO CHECKLIST

Timestamp Demo File/URL
01:00 Token visualization https://platform.openai.com/tokenizer
02:30 Streaming response in VS Code Copilot Chat panel
04:15 _runLoop while loop src/extension/intents/node/toolCallingLoop.ts
04:45 Max iterations config src/extension/intents/common/agentConfig.ts
05:30 ToolName enum src/extension/tools/common/toolNames.ts
06:00 Live tool calls in VS Code Copilot Chat, expand tool details
07:00 AgentPrompt structure src/extension/prompts/node/agent/agentPrompt.tsx
08:15 Cache breakpoints comment src/extension/intents/node/cacheBreakpoints.ts
09:30 BackgroundSummarizer state machine src/extension/prompts/node/agent/backgroundSummarizer.ts
10:45 Subagent tool names src/extension/tools/common/toolNames.ts
13:15 Pseudocode agent loop (write on screen)
14:00 Example copilot-instructions.md local file

Word count: ~2,400 spoken words | Estimated runtime: 16–18 minutes

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment