Antrophic이 말하는 효과적인 컨텍스트 엔지니어링

프롬프트 엔지니어링과 컨텍스트 엔지니어링의 차이

Oct 4. 2025

컨텍스트 엔지니어링과 관련하여서 Antrophic에서 잘 정리된 자료가 있어서 번역과 함께 기록해보려고 한다! 공부하면서 내가 헷갈렸던 개념들에 대해서도 정리해보았다.

https://www.anthropic.com/engineering/effective-context-engineering-for-ai-agents

Effective context engineering for AI agents

Anthropic is an AI safety and research company that's working to build reliable, interpretable, and steerable AI systems.

https://www.anthropic.com/engineering/effective-context-engineering-for-ai-agents

Context engineering vs. prompt engineering At Anthropic, we view context engineering as the natural progression of prompt engineering. Prompt engineering refers to methods for writing and organizing LLM instructions for optimal outcomes (see our docs for an overview and useful prompt engineering strategies). Context engineering refers to the set of strategies for curating and maintaining the optimal set of tokens (information) during LLM inference, including all the other information that may land there outside of the prompts.

Anthropic에서는 컨텍스트 엔지니어링을 프롬프트 엔지니어링의 자연스러운 진화로 보고 있습니다.

프롬프트 엔지니어링은 최적의 결과를 위해 LLM 지시문을 작성하고 구성하는 방법을 의미합니다(개요와 유용한 프롬프트 엔지니어링 전략은 문서를 참고하세요).

반면, 컨텍스트 엔지니어링은 LLM 추론 과정에서 최적의 토큰 집합(정보)을 선별·유지하는 전략을 의미하며, 여기에는 프롬프트 외부에서 컨텍스트에 포함될 수 있는 모든 정보가 포함됩니다.

프롬프트 엔지니어링과 컨텍스트 엔지니어링의 차이

프롬프트 엔지니어링

인공지능에게 "어떻게 질문하느냐"에 집중하는 방법. 즉, 원하는 답변을 얻기 위해 AI에게 전달하는 지시문(프롬프트)을 잘 작성하고 구성하는 기술. 학생에게 단순히 “이 문제 풀어봐”라고 하는 대신, “1단계부터 과정을 설명하면서 풀어줘”라고 말하면 원하는 방식의 답을 더 잘 얻을 수 있음

컨텍스트 엔지니어링

단순히 질문만 다듬는 게 아니라, AI가 답변을 만들 때 참고하는 전체 상황(context) 을 관리하는 방법. 시스템 지침, 이전 대화 기록, 외부 데이터, 도구 사용 정보 등이 포함됨. 즉, AI가 문제를 풀 때 사용할 수 있는 교과서, 노트, 힌트 전체를 어떻게 구성하고 유지할지를 다루는 것

In the early days of engineering with LLMs, prompting was the biggest component of AI engineering work, as the majority of use cases outside of everyday chat interactions required prompts optimized for one-shot classification or text generation tasks. As the term implies, the primary focus of prompt engineering is how to write effective prompts, particularly system prompts. However, as we move towards engineering more capable agents that operate over multiple turns of inference and longer time horizons, we need strategies for managing the entire context state (system instructions, tools, Model Context Protocol (MCP), external data, message history, etc).

LLM을 활용한 엔지니어링 초기에는 프롬프트가 AI 엔지니어링 작업의 가장 큰 비중을 차지했습니다. 일상적인 대화 인터랙션을 제외한 대부분의 활용 사례는 원샷(one-shot) 분류나 텍스트 생성 작업을 위해 최적화된 프롬프트가 필요했기 때문입니다. 이름에서 알 수 있듯이 프롬프트 엔지니어링의 핵심 초점은 효과적인 프롬프트 작성 방법, 특히 시스템 프롬프트 작성법이었습니다.

하지만 이제는 여러 차례 추론(multi-turn inference) 과 장기적 상호작용(longer time horizons) 에서 동작하는 더 강력한 에이전트를 만들기 위해, 전체 컨텍스트 상태(시스템 지침, 도구, MCP(Model Context Protocol), 외부 데이터, 메시지 기록 등)를 관리하는 전략이 필요합니다.

While some models exhibit more gentle degradation than others, this characteristic emerges across all models. Context, therefore, must be treated as a finite resource with diminishing marginal returns. Like humans, who have limited working memory capacity, LLMs have an “attention budget” that they draw on when parsing large volumes of context. Every new token introduced depletes this budget by some amount, increasing the need to carefully curate the tokens available to the LLM. This attention scarcity stems from architectural constraints of LLMs. LLMs are based on the transformer architecture, which enables every token to attend to every other token across the entire context. This results in n² pairwise relationships for n tokens. As its context length increases, a model's ability to capture these pairwise relationships gets stretched thin, creating a natural tension between context size and attention focus. Additionally, models develop their attention patterns from training data distributions where shorter sequences are typically more common than longer ones. This means models have less experience with, and fewer specialized parameters for, context-wide dependencies.

일부 모델은 다른 모델보다 성능 저하가 완만하게 나타나기도 하지만, 이러한 특성은 모든 모델에서 공통적으로 드러납니다. 따라서 컨텍스트는 한정된 자원으로, 추가될수록 점점 효용이 감소하는 특성을 가진다고 볼 수 있습니다. 인간이 작업 기억(working memory) 용량에 제한이 있는 것처럼, LLM도 방대한 컨텍스트를 처리할 때 활용할 수 있는 일종의 “주의(attention) 예산” 이 존재합니다. 새로운 토큰이 추가될 때마다 이 예산이 일정 부분 소모되므로, LLM이 활용할 수 있는 토큰을 신중하게 선별(큐레이션) 해야 합니다.

이러한 주의력의 희소성은 LLM의 구조적 제약에서 비롯됩니다. LLM은 트랜스포머(transformer) 아키텍처를 기반으로 하며, 이 구조에서는 컨텍스트 내의 모든 토큰이 다른 모든 토큰에 주의를 기울일 수 있습니다. 그 결과, 토큰 수가 n일 경우 n² 개의 쌍(pairwise) 관계가 발생합니다.

따라서 컨텍스트 길이가 길어질수록 모델이 이러한 관계를 포착하는 능력은 얇아지고, 컨텍스트 크기와 주의 집중도 사이에 자연스러운 긴장 관계가 형성됩니다. 게다가 모델은 학습 과정에서 주로 짧은 시퀀스를 더 자주 접하기 때문에, 긴 시퀀스 전체에 걸친 의존성에 대해서는 경험도 적고, 이에 특화된 파라미터도 부족합니다.

Antrophic이 말하는 효과적인 컨텍스트의 구성 요소

Given that LLMs are constrained by a finite attention budget, good context engineering means finding the smallest possible set of high-signal tokens that maximize the likelihood of some desired outcome. Implementing this practice is much easier said than done, but in the following section, we outline what this guiding principle means in practice across the different components of context. System prompts should be extremely clear and use simple, direct language that presents ideas at the right altitude for the agent. The right altitude is the Goldilocks zone between two common failure modes. At one extreme, we see engineers hardcoding complex, brittle logic in their prompts to elicit exact agentic behavior. This approach creates fragility and increases maintenance complexity over time. At the other extreme, engineers sometimes provide vague, high-level guidance that fails to give the LLM concrete signals for desired outputs or falsely assumes shared context. The optimal altitude strikes a balance: specific enough to guide behavior effectively, yet flexible enough to provide the model with strong heuristics to guide behavior.

LLM은 제한된 주의(attention) 예산 속에서 동작하기 때문에, 좋은 컨텍스트 엔지니어링이란 원하는 결과를 얻을 가능성을 극대화할 수 있는 최소한의 고신호(high-signal) 토큰 집합을 찾는 것을 의미합니다. 하지만 이를 실제로 구현하는 일은 말처럼 쉽지 않습니다. 다음 섹션에서는 이 기본 원칙이 컨텍스트의 다양한 구성 요소에서 실제로 어떻게 적용되는지를 설명합니다.

시스템 프롬프트(system prompt) 는 매우 명확해야 하며, 단순하고 직접적인 언어를 사용해 에이전트가 이해하기 적합한 수준에서 아이디어를 제시해야 합니다.
여기서 말하는 ‘적합한 수준(altitude)’은 흔히 발생하는 두 가지 실패 모드 사이의 골디락스 존(Goldilocks zone, 적정 지점) 을 의미합니다.

한쪽 극단에서는, 엔지니어가 복잡하고 취약한 로직을 프롬프트에 하드코딩하여 특정한 행동을 강제로 이끌어내려 합니다. 이런 방식은 시간이 지날수록 유연성이 떨어지고 유지보수 복잡도가 증가합니다.

다른 극단에서는, 엔지니어가 모호하고 추상적인 지침만 제공하여 LLM이 원하는 출력을 만들어낼 수 있는 구체적 신호를 받지 못하거나, 잘못된 전제(공유된 맥락이 있다고 가정) 위에 의존하게 만듭니다.

따라서 최적의 수준이란, 행동을 효과적으로 유도할 만큼 충분히 구체적이면서도, 동시에 모델이 행동을 안내받을 수 있는 강력한 휴리스틱(heuristic) 을 제공할 수 있을 정도로 유연해야 합니다.

해당 내용을 해석해보면, 프롬프트 안에 세세한 규칙과 로직을 다 넣어버리면, 단기적으로는 원하는 행동을 정확히 유도할 수 있지만, 작은 변경에도 깨지기 쉽고 유지보수도 어렵다는 점을 의미하는 것 같습니다.

반대로, 너무 추상적이고 모호한 경우, “알아서 잘 해줘” 수준으로 막연한 지시를 주면, 모델은 원하는 행동을 추측할 수밖에 없어요. 결국 기대한 결과와 어긋나거나, 잘못된 전제를 가지고 답을 내놓을 수 있습니다.

Antrophic이 말하는 효과적인 프롬프트 구성

We recommend organizing prompts into distinct sections (like <background_information>, <instructions>, ## Tool guidance, ## Output description, etc) and using techniques like XML tagging or Markdown headers to delineate these sections, although the exact formatting of prompts is likely becoming less important as models become more capable. Regardless of how you decide to structure your system prompt, you should be striving for the minimal set of information that fully outlines your expected behavior. (Note that minimal does not necessarily mean short; you still need to give the agent sufficient information up front to ensure it adheres to the desired behavior.) It’s best to start by testing a minimal prompt with the best model available to see how it performs on your task, and then add clear instructions and examples to improve performance based on failure modes found during initial testing.

저희는 프롬프트를 명확히 구분된 섹션으로 구성할 것을 권장합니다.
예를 들어 <background_information>, <instructions> , ## Tool guidance, ## Output description

같은 구분을 두고, 각 섹션을 나누기 위해 XML 태깅이나 Markdown 헤더와 같은 기법을 사용할 수 있습니다. 다만, 모델이 점점 더 정교해짐에 따라 프롬프트의 정확한 포맷 자체는 점점 덜 중요해지고 있습니다. 시스템 프롬프트를 어떤 구조로 작성하든, 반드시 기대하는 행동을 완전히 설명할 수 있는 최소한의 정보 집합을 담아야 합니다.

여기서 “최소한(minimal)”이라는 말이 꼭 “짧음”을 의미하는 것은 아닙니다. 원하는 행동을 보장하기 위해서는 여전히 충분한 정보를 제공해야 합니다. 가장 좋은 방법은 최소한의 프롬프트를 가지고 가장 성능이 좋은 모델에서 먼저 테스트해보고, 이후 초기 테스트 과정에서 발견된 실패 사례를 토대로 명확한 지침과 예시를 점차 추가해가는 것입니다.

Antrophic이 말하는 AI Agent

본문에서는 에이전트란 LLM이 도구를 자율적으로 반복(loop) 사용하면서 작동하는 것이라고 정의하고 있습니다.

In Building effective AI agents, we highlighted the differences between LLM-based workflows and agents. Since we wrote that post, we’ve gravitated towards a simple definition for agents: LLMs autonomously using tools in a loop. Working alongside our customers, we’ve seen the field converging on this simple paradigm. As the underlying models become more capable, the level of autonomy of agents can scale: smarter models allow agents to independently navigate nuanced problem spaces and recover from errors. We’re now seeing a shift in how engineers think about designing context for agents. Today, many AI-native applications employ some form of embedding-based pre-inference time retrieval to surface important context for the agent to reason over. As the field transitions to more agentic approaches, we increasingly see teams augmenting these retrieval systems with “just in time” context strategies.

효과적인 AI 에이전트 구축(Building effective AI agents) 글에서, 우리는 LLM 기반 워크플로우와 에이전트의 차이를 강조한 바 있습니다. 그 이후로, 저희는 에이전트를 다음과 같이 간단히 정의하게 되었습니다:
에이전트란 LLM이 도구를 자율적으로 반복(loop) 사용하면서 작동하는 것.

고객들과 협업하면서 관찰한 결과, 이 분야 역시 이러한 단순한 패러다임으로 수렴하고 있음을 확인했습니다. 기본 모델이 점점 더 강력해짐에 따라, 에이전트의 자율성 수준도 확장됩니다. 즉, 더 똑똑한 모델일수록 에이전트가 스스로 미묘한 문제 공간을 탐색하고 오류에서 회복할 수 있습니다.

또한 최근에는 엔지니어들이 에이전트를 위한 컨텍스트 설계 방식을 바라보는 관점에도 변화가 일어나고 있습니다. 현재 많은 AI 네이티브 애플리케이션들은 사전 추론 단계(pre-inference time) 에서 임베딩 기반 검색을 활용하여, 에이전트가 추론에 참고할 중요한 컨텍스트를 표면화(surface)하고 있습니다.

하지만 이 분야가 점차 에이전트형 접근(agentic approaches) 으로 전환됨에 따라, 팀들이 이러한 검색 시스템을 “적시(just in time)” 컨텍스트 전략으로 보강하는 사례가 점점 늘어나고 있습니다.

Rather than pre-processing all relevant data up front, agents built with the “just in time” approach maintain lightweight identifiers (file paths, stored queries, web links, etc.) and use these references to dynamically load data into context at runtime using tools. Anthropic’s agentic coding solution Claude Code uses this approach to perform complex data analysis over large databases. The model can write targeted queries, store results, and leverage Bash commands like head and tail to analyze large volumes of data without ever loading the full data objects into context. This approach mirrors human cognition: we generally don’t memorize entire corpuses of information, but rather introduce external organization and indexing systems like file systems, inboxes, and bookmarks to retrieve relevant information on demand. Beyond storage efficiency, the metadata of these references provides a mechanism to efficiently refine behavior, whether explicitly provided or intuitive. To an agent operating in a file system, the presence of a file named test_utils.py in a tests folder implies a different purpose than a file with the same name located in src/core_logic.py. Folder hierarchies, naming conventions, and timestamps all provide important signals that help both humans and agents understand how and when to utilize information.

모든 관련 데이터를 사전에 전처리(pre-processing) 해서 넣는 대신, “적시(just in time)” 접근법으로 만들어진 에이전트는 가벼운 식별자(파일 경로, 저장된 쿼리, 웹 링크 등) 만 유지합니다. 그리고 런타임에 도구를 사용해 필요한 데이터를 동적으로 불러와 컨텍스트에 삽입합니다.

Anthropic의 에이전트형 코딩 솔루션인 Claude Code가 이 방식을 활용하는 대표적인 사례입니다. Claude Code는 대규모 데이터베이스에서 복잡한 데이터 분석을 수행할 때, 전체 데이터를 통째로 불러오지 않고, 목표 지향적 쿼리를 작성하고, 결과를 저장하며, head, tail같은 Bash 명령어를 활용해 대량의 데이터를 효율적으로 분석합니다. 이는 인간의 인지 방식과 유사합니다. 즉, 우리는 방대한 정보를 통째로 암기하지 않고, 파일 시스템, 이메일함, 북마크 같은 외부 조직화·색인 시스템을 이용해 필요할 때마다 정보를 검색해옵니다.

In certain settings, the most effective agents might employ a hybrid strategy, retrieving some data up front for speed, and pursuing further autonomous exploration at its discretion. The decision boundary for the ‘right’ level of autonomy depends on the task. Claude Code is an agent that employs this hybrid model: CLAUDE.md files are naively dropped into context up front, while primitives like glob and grep allow it to navigate its environment and retrieve files just-in-time, effectively bypassing the issues of stale indexing and complex syntax trees. The hybrid strategy might be better suited for contexts with less dynamic content, such as legal or finance work. As model capabilities improve, agentic design will trend towards letting intelligent models act intelligently, with progressively less human curation. Given the rapid pace of progress in the field, "do the simplest thing that works" will likely remain our best advice for teams building agents on top of Claude.

특정 환경에서는, 가장 효과적인 에이전트가 하이브리드 전략(hybrid strategy)을 활용할 수 있습니다. 즉, 일부 데이터는 미리 불러와 속도를 확보하고, 이후 필요에 따라 자율적으로 추가 탐색을 수행하는 방식입니다. 어떤 수준의 자율성이 ‘적절한’지는 과업의 성격에 따라 달라집니다.

예를 들어 Claude Code는 이러한 하이브리드 모델을 채택합니다. CLAUDE.md 파일은 단순하게 처음부터 컨텍스트에 넣고, glob, grep 같은 기본 도구를 활용해 환경을 탐색하며 파일을 적시에 불러와 사용합니다. 이 방식은 구식 인덱싱(stale indexing) 문제나 복잡한 구문 트리(syntax tree) 문제를 효과적으로 회피할 수 있습니다. 이 방식은 법률이나 금융처럼 콘텐츠가 비교적 덜 동적인 분야에서 특히 적합할 수 있습니다.

모델의 역량이 발전함에 따라, 에이전트 설계는 점차 지능적인 모델이 스스로 더 지능적으로 행동하도록 하여 인간의 큐레이션 개입을 점점 줄여가는 방향으로 나아갈 것입니다.

The art of compaction lies in the selection of what to keep versus what to discard, as overly aggressive compaction can result in the loss of subtle but critical context whose importance only becomes apparent later. For engineers implementing compaction systems, we recommend carefully tuning your prompt on complex agent traces. Start by maximizing recall to ensure your compaction prompt captures every relevant piece of information from the trace, then iterate to improve precision by eliminating superfluous content. An example of low-hanging superfluous content is clearing tool calls and results – once a tool has been called deep in the message history, why would the agent need to see the raw result again? One of the safest lightest touch forms of compaction is tool result clearing, most recently launched as a feature on the Claude Developer

압축의 핵심은 무엇을 남기고 무엇을 버릴지 선택하는 과정에 있습니다. 압축을 지나치게 공격적으로 하면, 나중에야 중요성이 드러나는 미묘하지만 중요한 컨텍스트가 손실될 수 있습니다.

따라서 압축 시스템을 구현하는 엔지니어들에게는, 복잡한 에이전트 실행 기록(agent trace) 에 대해 프롬프트를 신중히 조정할 것을 권장합니다.

먼저 재현율(recall) 을 최대화하여, 압축 프롬프트가 실행 기록의 모든 관련 정보를 포착하도록 합니다.

이후 반복(iteration)을 통해 정밀도(precision) 를 높여, 불필요한 내용을 제거해 나갑니다.

불필요한 내용을 쉽게 제거할 수 있는 예시로는 도구 호출 및 그 결과(tool calls and results) 가 있습니다. 특정 도구가 대화 기록 깊은 곳에서 이미 호출되었다면, 그 원시 결과(raw result)를 에이전트가 다시 볼 필요는 없습니다.

가장 안전하면서도 부담이 적은 압축 방법 중 하나가 바로 도구 결과 정리(tool result clearing) 입니다. 이 기능은 최근 Claude Developer Platform에 새롭게 추가된 기능이기도 합니다.

Structured note-taking, or agentic memory, is a technique where the agent regularly writes notes persisted to memory outside of the context window. These notes get pulled back into the context window at later times. This strategy provides persistent memory with minimal overhead. Like Claude Code creating a to-do list, or your custom agent maintaining a NOTES.md file, this simple pattern allows the agent to track progress across complex tasks, maintaining critical context and dependencies that would otherwise be lost across dozens of tool calls. Claude playing Pokémon demonstrates how memory transforms agent capabilities in non-coding domains. The agent maintains precise tallies across thousands of game steps—tracking objectives like "for the last 1,234 steps I've been training my Pokémon in Route 1, Pikachu has gained 8 levels toward the target of 10." Without any prompting about memory structure, it develops maps of explored regions, remembers which key achievements it has unlocked, and maintains strategic notes of combat strategies that help it learn which attacks work best against different opponents.

구조화된 노트 필기(Structured note-taking) 또는 에이전트 메모리(agentic memory) 는 에이전트가 정기적으로 노트를 작성해, 컨텍스트 윈도우 바깥의 메모리에 저장하는 기법을 말합니다. 이렇게 저장된 노트는 이후 필요할 때 다시 컨텍스트 윈도우로 불러와 활용됩니다.

이 전략은 최소한의 오버헤드로 지속적인 메모리(persistent memory)를 제공합니다.
예를 들어, Claude Code가 할 일 목록(to-do list) 을 작성하거나, 맞춤형 에이전트가 NOTES.md 파일을 유지하는 것처럼, 단순한 패턴을 통해 에이전트는 복잡한 작업에서도 진행 상황을 추적할 수 있습니다. 이를 통해 원래라면 수십 번의 도구 호출 과정에서 잃어버렸을 핵심 컨텍스트와 의존성을 보존할 수 있습니다.

또 다른 예로 Claude가 포켓몬 게임을 플레이하는 사례가 있습니다. 여기서 메모리는 비코딩 분야에서도 에이전트의 역량을 크게 확장합니다.

에이전트는 수천 번의 게임 진행 단계에서 정확한 집계를 유지합니다. 예: “지난 1,234번의 스텝 동안 Route 1에서 포켓몬을 훈련했고, 피카츄는 목표 10레벨 중 8레벨을 올렸다.”

메모리 구조에 대한 별도의 지시 없이도, 에이전트는 탐험한 지역의 지도를 만들고, 해금한 주요 업적을 기억하며, 전투 전략 노트를 유지합니다.

이를 통해 어떤 공격이 어떤 상대에게 효과적인지 학습하고 전략적으로 행동할 수 있게 됩니다.

Context engineering represents a fundamental shift in how we build with LLMs. As models become more capable, the challenge isn't just crafting the perfect prompt—it's thoughtfully curating what information enters the model's limited attention budget at each step. Whether you're implementing compaction for long-horizon tasks, designing token-efficient tools, or enabling agents to explore their environment just-in-time, the guiding principle remains the same: find the smallest set of high-signal tokens that maximize the likelihood of your desired outcome. The techniques we've outlined will continue evolving as models improve. We're already seeing that smarter models require less prescriptive engineering, allowing agents to operate with more autonomy. But even as capabilities scale, treating context as a precious, finite resource will remain central to building reliable, effective agents.

컨텍스트 엔지니어링(context engineering) 은 LLM을 활용하는 방식에서의 근본적인 전환을 의미합니다. 모델이 점점 더 강력해질수록, 도전 과제는 단순히 “완벽한 프롬프트를 작성하는 것”이 아니라, 매 단계에서 모델의 제한된 주의(attention) 예산 안에 어떤 정보를 넣을지 신중하게 큐레이션하는 것이 됩니다.

장기 작업을 위한 압축(compaction), 토큰 효율적인 도구 설계, 또는 적시(just-in-time) 방식으로 에이전트가 환경을 탐색하도록 하는 것이든, 모든 전략의 핵심 원칙은 동일합니다:
원하는 결과를 얻을 가능성을 극대화할 수 있는 최소한의 고신호(high-signal) 토큰 집합을 찾는 것.

우리가 설명한 기법들은 모델이 발전함에 따라 계속 진화할 것입니다. 이미 더 똑똑한 모델들은 세세한 엔지니어링에 덜 의존하면서도 더 큰 자율성으로 동작할 수 있게 해주고 있습니다. 그러나 모델의 성능이 아무리 확장되더라도, 컨텍스트를 소중하고 유한한 자원으로 다루는 것은 신뢰할 수 있고 효과적인 에이전트를 구축하는 데 여전히 핵심으로 남을 것입니다.

keyword

스타트업

지은 직업 기획자

Product Manager를 메인으로, 제품을 성장시키는 모든 일을 하고 있습니다.

구독자 53

작가의 이전글Veo 3가 유튜브에 탑재된다.해커톤 평가의 비효율성, 왜 AI로 해결해야 했을까?작가의 다음글