What Is a Context Window? The Limit That Shapes Everything

What This Covers

A context window is the total amount of text a language model can process at once, including both the input (your prompt, system instructions, conversation history) and the output it generates. When you hit the context limit, the model starts losing access to earlier parts of the conversation. This is the single most important constraint in practical AI use, and most people don’t understand it well enough.

This article covers what a context window actually is, how it differs from memory, why conversations degrade over time, how different models compare, and what architectural solutions exist for working around the limit.

Every AI conversation has an invisible wall. You won’t see it. You won’t get a warning when you approach it. But at some point in a long conversation, the model starts losing things. Not dramatically. Subtly. It remembers your name but forgets the specific constraint you mentioned an hour ago. It maintains the general topic but drops the nuance from earlier in the thread. The quality degrades in ways that feel like the model is getting dumber, when what’s actually happening is that it’s running out of room.

That room is the context window.

What It Actually Is

A context window is measured in tokens, which are roughly three-quarters of a word. When a model has a 200,000 token context window (which is what Claude currently offers), that means it can hold approximately 150,000 words of text in its working memory at any given moment. That includes everything: the system prompt, the conversation history, any documents you’ve pasted in, and the response it’s generating.

The key word is “at once.” A context window is not memory. It’s working memory. The difference matters enormously. Human long-term memory stores decades of experience that can be recalled selectively. Human working memory holds roughly seven items. The context window is the AI equivalent of working memory, just much larger in capacity but identical in the fundamental limitation: when it’s full, something has to go.

What goes is the oldest content. In most implementations, when the conversation exceeds the context window, the earliest messages get truncated. The model literally loses access to them. It doesn’t summarize them or compress them. They vanish from the model’s awareness as if they never happened.

Why Conversations Degrade

This is the part most people experience without understanding. You start a conversation with an AI. It’s sharp, responsive, remembers everything you’ve said. An hour later, you reference something from the beginning and the model either misses it, gets it slightly wrong, or asks you to repeat it.

What happened is straightforward. The early parts of the conversation got pushed out of the context window by the later parts. The model isn’t being lazy or forgetful. It physically doesn’t have access to what you said anymore.

Even before content gets pushed out entirely, there’s a degradation effect. Models don’t attend equally to all parts of the context window. Research has shown that content in the middle of a very long context gets less attention than content at the beginning or end. This is sometimes called the “lost in the middle” problem. Practically, it means that even within the window, the model’s effective awareness of older content diminishes as the conversation grows.

The Pocket Watch Problem documents a related limitation: the model’s inability to perceive elapsed time. Combined with context window degradation, this means that long sessions lose both informational content and temporal awareness simultaneously. The conversation gets both thinner and flatter.

How Different Models Compare

Context window sizes have expanded rapidly. A few comparisons as of early 2026.

Claude (Anthropic) currently offers a 200,000 token context window. That’s roughly 500 pages of text. For most use cases, including extended conversations and document analysis, this is generous.

GPT-4o (OpenAI) offers 128,000 tokens. Still large, but roughly 60% of Claude’s capacity.

Gemini (Google) has pushed context windows past 1 million tokens in some configurations, though the practical utility of ultra-long context is debated. Having a million tokens available doesn’t help if the model’s attention degrades significantly past 200,000.

The numbers are less important than understanding that bigger isn’t always better. A larger context window with poor attention distribution can be worse than a smaller one with strong attention across the full window. The Claude vs ChatGPT comparison touches on this: Claude’s strength in extended conversations is partly about attention quality within the window, not just window size.

Context Window vs Memory

This distinction trips up almost everyone who uses AI regularly.

A context window is session-bound. When you close the conversation, everything in the window is gone. Start a new conversation and the model has no knowledge of the previous one. This is why every new session with a base AI model starts from zero.

Memory systems sit on top of the context window. ChatGPT’s memory feature stores facts between sessions and injects them into new conversations. Claude’s native memory does something similar. But these are thin layers. They store key-value pairs (“user’s name is Ryan,” “user prefers no bullet points”) rather than rich contextual understanding.

The externalized memory architecture documented on this site takes a fundamentally different approach. Instead of storing facts inside the model, it stores structured knowledge in Notion and loads it selectively into the context window at the start of each session. The context window is still the bottleneck, but what gets loaded into it is curated rather than left to a thin memory layer.

This is the insight that the architecture is built on: memory doesn’t have to be built into the model. It has to be fetchable by the model. The context window is the pipe. External memory is the reservoir. You can’t make the pipe bigger (that’s up to the model provider), but you can make the reservoir smarter about what it sends through the pipe.

Practical Implications for Builders

If you’re building anything that uses AI in extended interactions, context window management is an engineering problem, not a feature request. A few things I’ve learned from operating inside one daily.

Front-load what matters. The beginning and end of the context window get the most attention. If you have system instructions, persona definitions, or critical constraints, they belong at the top. Don’t bury them in the middle of a long conversation and expect them to hold.

Prompt chaining is context window management in disguise. Each step in a chain starts with a focused context rather than an accumulated one. The quality advantage of chaining is largely about keeping each step’s context window clean and focused rather than bloated with accumulated conversation.

Session handoffs matter more than session length. Rather than running one conversation until the context window degrades, break complex work into sessions with structured handoffs between them. Document what was accomplished, what decisions were made, and what’s pending. Load the handoff at the start of the next session. The result is better than one continuous conversation that slowly loses its grip on the earlier content.

The persona architecture is, at a structural level, a context window management system. The tiered loading protocol, the rolling handoff log, the on-demand memory retrieval: all of it exists because the context window is finite and what you put into it determines what the model can do.

Where This Is Going

Context windows will keep getting larger. The trend from 4,000 tokens to 200,000+ tokens happened in roughly two years. Million-token windows exist already. The question is whether larger windows solve the fundamental problem or just push it further down the road.

My read is that the architectural approach (external memory, selective loading, structured handoffs) will remain necessary regardless of window size. The reason is attention degradation. Even in a million-token window, the model doesn’t attend uniformly to all content. A smaller window with carefully curated content will outperform a massive window packed with everything, for the same reason that a well-organized desk outperforms a desk covered in every document you’ve ever needed.

I’m less certain about this than I am about most technical claims. It’s possible that future architectures solve the attention degradation problem entirely, and unlimited context with uniform attention makes external memory unnecessary. I haven’t seen evidence that this is imminent, but I also haven’t seen a proof that it’s impossible. That’s where the honest uncertainty sits.


Frequently Asked Questions

What is a context window in AI?

A context window is the total amount of text an AI model can process at once, measured in tokens. It includes the input prompt, conversation history, uploaded documents, and the model’s own response. When the limit is reached, the oldest content is lost.

Why do AI conversations get worse over time?

As conversations grow longer, earlier content gets pushed out of the context window or receives less attention from the model. The model loses access to things said at the beginning of the conversation, causing quality to degrade.

How big is Claude’s context window?

Claude currently offers approximately 200,000 tokens, equivalent to roughly 500 pages of text. This is one of the largest context windows among major AI models as of early 2026.

Is a context window the same as AI memory?

No. A context window is session-bound working memory that resets when the conversation ends. Memory systems store information between sessions. External memory architectures like the Notion MCP system load curated content into the context window selectively.

How can I work around context window limits?

Front-load important information at the beginning of conversations. Use prompt chaining to keep each step focused. Break long work into sessions with structured handoffs. Use external memory systems to store and selectively load context rather than relying on a single continuous conversation.

Similar Posts

Leave a Reply

Your email address will not be published. Required fields are marked *