Skip to main content
Back to all posts
Fundamentals

Context Engineering: The Hidden Skill Behind Every Good AI System

The difference between AI systems that work and AI systems that embarrass you in production is almost never the model. It's the context. Here's how to engineer context like an adult.

Ricardo Ramirez

Ricardo Ramirez

Founder · Sprintt

April 2, 20269 min read
AgentsContextAI Engineering

Ask ten AI engineers what separates a good production system from a demo, and nine will say "the prompts." They are wrong.

Ask the tenth, and they will say "the context." They are right — and the gap between that answer and the popular one is why most organizations are still shipping AI systems that break the first time a real user touches them.

Context engineering is the skill of deciding — at every step of an AI workflow — what information the model sees, in what form, and why. It is the skill that matters most. It is the skill that is taught least. Here is a working theory of the craft, distilled from the client engagements where we had to make production systems behave.

Context is everything the model isn't

Start with a model. The model is a fixed function. It takes a string in, it produces a string out. Given the same string in, it produces (approximately) the same string out.

Everything that varies — the user's question, the relevant documents, the system's instructions, the tool outputs, the conversation history, the examples, the current date — is context. Context is the entire input, minus the weights.

Which means: the quality of any AI-powered output is a function of the quality of the context you construct. Period. A perfect model with bad context produces bad output. A modest model with excellent context produces excellent output. The model ceiling has risen; the context floor is still wherever the builder left it.

The four failure modes

Most production AI failures trace to one of these four context problems. Name the failure mode first; the fix becomes obvious.

Failure 1: Missing context

The model was asked to make a decision it couldn't make, because the relevant information was not in the input. Classic example: a customer-service agent asked to handle a refund request for an order it cannot see.

The fix is retrieval. Fetch the relevant data before calling the model. This is not fancy; it is table stakes. If you are not routinely pulling records from your database, your CRM, your docs, your logs into the model's input, you are not building AI systems — you are doing creative writing with a chatbot.

Failure 2: Noisy context

The model was given everything — including irrelevant data — and the signal got buried. Classic example: stuffing the last 50 messages of a conversation into the context when only the last 5 are relevant.

The fix is filtering and ranking. Before you pass context to the model, pick which pieces are relevant. Throw away the rest. This is unglamorous engineering work. It is also the difference between a system that scales and one that produces worse output as the conversation grows.

Failure 3: Stale context

The context was right at one point in time and has since gone out of date. Classic example: a product recommendation system quoting prices from six months ago because the price list was loaded into the system prompt.

The fix is freshness — treat stale data as a first-class concern. Either refetch at runtime, or build explicit "this data is from X; do not quote prices older than Y days" guardrails into the prompt.

Failure 4: Misshapen context

The model was given the right data in the wrong form, and it couldn't use it. Classic example: dumping a raw database row with 40 fields into the prompt when the model only needs three.

The fix is shaping. Extract the relevant fields, label them clearly, discard the rest. Models are remarkably good at extracting structure from chaos — but much better at using pre-structured information.

The context stack

Every real production AI call is actually a stack of context layers, each with a different lifecycle and a different purpose. Knowing the stack is the first step to managing it.

From most stable to most volatile:

Layer 1: System identity

"You are a senior financial analyst for a mid-market lending firm..."

This is the most durable layer. It rarely changes between invocations. Its job is to set the frame for everything else: role, tone, constraints.

Rule of thumb: if you're editing this layer more than once a week, you're using it wrong. Put volatile context elsewhere.

Layer 2: Persistent knowledge

"Our company operates under Regulation Z. Our risk policy requires..."

The stuff the model needs to know about your organization that doesn't change per-request but might change per-quarter. Regulatory frameworks, brand voice, operating principles, definitional knowledge.

Rule of thumb: this layer is a canonical document, edited deliberately, reviewed by stakeholders. Treat it like a company handbook, because it functionally is one.

Layer 3: User/session state

"The user is Jane, a returning customer with three active loans. Last conversation was on March 12 about..."

The state that varies per user but is stable within a session. Loaded at session start; carried forward until the session ends.

Rule of thumb: what counts as "session" is your choice. For a long-running assistant, a session might be a week. For a short transactional call, a session is a single request.

Layer 4: Retrieved evidence

"Here are the three most relevant documents to the current question..."

The data fetched specifically for this turn. Product catalog entries, CRM records, knowledge-base articles, compliance documents. This is the layer where most engineering effort goes — because it is the hardest.

Rule of thumb: retrieval is a ranking problem, not a lookup problem. Anyone can pull 100 documents. The skill is returning the 3 that matter.

Layer 5: Conversational history

"User: ... / Assistant: ... / User: ..."

The running dialogue. Grows with every turn. Without management, it overwhelms everything else.

Rule of thumb: plan for summarization from day one. As the conversation grows past a threshold, compress older turns into a summary. Keep the last N turns verbatim, summarize the rest.

Layer 6: Task instruction

"The user just asked about their loan payment. Respond..."

The specific ask for this turn. This is the only layer that should contain the immediate request.

Rule of thumb: put this layer last. Models attend most strongly to the most-recent content.

The engineering you actually have to do

The stack above is the architecture. The engineering is how you populate each layer efficiently, reliably, and within the budget.

Retrieval: build it like a search engine, not a database lookup

Most teams build their first RAG system as a simple vector lookup: embed the query, find the top-K nearest documents, pass them in. This works until it doesn't.

It fails because vector similarity is a lossy proxy for relevance. "How do I change my password?" and "I want to update my security settings" are semantically close; "I want to change my plan" is semantically similar but practically very different.

Production retrieval systems combine: (1) embedding-based semantic search, (2) keyword-based search for exact matches and entities, (3) metadata filtering for recency/permissions, (4) a reranker that takes the top 50 candidates and re-scores them with a more expensive model, (5) deduplication and diversification so you don't return three near-identical docs.

If your retrieval is a single similarity_search(query, k=5) call, it will fail in production. Build it like a search engine — because it is one.

Summarization: compress or lose the plot

For long-running agents, conversation history is the single biggest context liability. A naive approach keeps every turn in context; by turn 50 you are spending most of your token budget on the second turn.

The pattern: after N turns, summarize the first M turns into a compact summary. Keep the last K turns verbatim. The summary lives alongside the verbatim turns.

Where it gets interesting: the summary is its own prompt. You are building a system where one model produces a summary that another call to the same model will consume. The craft is in the summary's format — bullet points of key facts, open commitments, entity identities. Not a narrative. A dossier.

Structured extraction: transform raw data into context

Raw data is rarely the best context. A JSON blob with 40 fields burns tokens. A well-shaped paragraph extracted from that blob communicates more to the model for less.

Pattern: before passing data into the model, run it through a transformation step — often a small, cheap model call — that extracts exactly the fields and phrasing you want. The transformation is reusable and cached. The main model gets clean, shaped context.

Tool results: summarize before re-ingesting

When an agent calls a tool, the result goes back into the context. An unbounded tool result (e.g., a grep that returns 10,000 matches) will poison the agent's context for the rest of the session.

Pattern: wrap noisy tools in a subagent that runs the tool, processes the raw output, and returns a short summary to the main agent. The main agent sees "200 matches across 15 files, top pattern: X." It does not see the raw 10,000 matches.

The non-obvious truth about tokens

Teams tend to optimize token usage for cost. This is the wrong metric. The right metric is attention.

A model has a fixed attention budget. The more irrelevant content you pack into the context, the more thinly the model attends to each piece. A 100-token prompt with three highly relevant documents will often outperform a 5,000-token prompt with the same three documents plus 4,900 tokens of noise. Not because the model "forgets" the noise — because the noise dilutes the signal.

Which means: context engineering is not about packing more in. It is about packing less in, more deliberately. The craft is in omission.

The skill curve

New AI engineers spend most of their time on prompt phrasing. Experienced AI engineers spend most of their time on context — what to retrieve, how to rank, how to summarize, how to shape, what to omit. The phrasing of the prompt is a small fraction of the overall system, and usually settled early.

If you are hiring for AI engineering, the interview signal to look for is not "can this person write a clever prompt." It is "can this person describe the context pipeline for a production system, end to end, and explain why each piece exists." The craft is downstream of the prompt. It lives in the context.

Get the context right, and a modest prompt produces great output. Get the context wrong, and no prompt will save you.


Sprintt builds production AI systems where context is a first-class concern — not an afterthought. If your AI system is hallucinating, drifting, or inconsistent in ways no amount of prompt tweaking seems to fix, it's a context problem. Book a 30-minute call and we'll find it.

Ricardo Ramirez

Written by

Ricardo Ramirez

Founder of Sprintt. Product leader, practitioner, and operator — not an academic or a theorist. Writes about the gap between AI strategy and shipped production systems, because closing that gap is the only thing Sprintt does.

Book a 30-min call

Ready to ship?

Stop planning.
Start shipping.

30 minutes. No pitch deck. A direct conversation about where AI can drive the most impact for your organization.

Book a strategy call