AgentDiet: trimming the fat from LLM agent trajectories
The cost of running an LLM agent is dominated not by what the model generates, but by what it has to read. Every tool call appends new content to the conversation history, and by the time a coding agent finishes a multi-step task it may have consumed over a million tokens — most of them repetitive noise from steps that are long past.
AgentDiet (Xiao et al. 2026) addresses this with a straightforward idea: attach a cheaper reflection LLM to the agent loop that compresses recent steps before they compound further into the trajectory. The paper reports 39.9–59.7% fewer input tokens with no measurable drop in task success.
1 The problem
A typical coding agent runs in a tight loop: read the full trajectory, predict the next action, execute it, append the result, repeat. The trajectory starts small — system prompt, task description — and grows with every step. Xiao et al. sampled 100 real trajectories from Trae Agent1 and found that the average trajectory consumed 48.4K tokens across 40 steps, with tool results accounting for 63% of that (Xiao et al. 2026). Because every token in step k is re-read at every subsequent step, the accumulated token usage per issue reached 1.0M tokens.
Manually inspecting those trajectories revealed three recurring categories of waste:
| Type | Example |
|---|---|
| Useless | __pycache__/, .git/, .venv/ listings from an early ls |
| Redundant | str_replace_editor echoes back the full surrounding code on every edit |
| Expired | 29 passing test names in a pytest run — only the one failure matters |
These are not edge cases. They appear in almost every trajectory, and removing them before they accumulate is the core idea behind AgentDiet.
2 A sliding-window reflection module
Rather than modifying the agent, AgentDiet introduces a separate reflection module — a second, cheaper LLM that processes one past step at a time and compresses it.
The key question is when to compress. Too early and the reflector discards context the agent still needs; too late and the step has already been re-read many times. The paper settles on a sliding window governed by three hyperparameters:
- a — lag: compress step s - a when the agent reaches step s
- b — context width: give the reflector b surrounding steps to reason with
- \theta — threshold: skip the reflection call if the step is shorter than \theta tokens, or discard the result if the saving is smaller than \theta
Algorithm 1 from the paper adds only one block to a standard agent loop:
for s in 1 .. s_max:
m_assis ← LLM_agent(T) # normal agent step
if done: return
E, m_tool ← ExecTool(m_assis)
T[s] ← (m_assis, m_tool) # append to trajectory
# ── AgentDiet reflection ─────────────────────────────────────
if s > a:
target = T[s - a]
if length(target) > θ:
ctx = T[max(0, s-a-b) : s] # sliding window
reduced = LLM_reflect(ctx, target)
if length(target) - length(reduced) > θ:
T[s - a] ← reduced # apply if saving is real
# ─────────────────────────────────────────────────────────────
The agent’s own LLM never sees the compression — it only ever reads the silently shorter trajectory.
The paper also tried giving the agent an erase tool it could call when it judged a step to be stale. That did not work. Models like Claude 4 Sonnet have memorised the standard procedure for program repair and continue the task even when explicitly instructed to erase instead. Moving the reduction to a separate module, invisible to the agent, sidesteps this entirely.
3 The reflection prompt
The reflector receives the target step wrapped in XML tags along with its sliding-window context, and must output only the compressed version. The prompt has four parts: a high-level job description, the expected input/output format, the three kinds of waste to remove, and a set of loss-prevention rules.
The rules are straightforward — keep all error messages and tracebacks in full, keep the final test-run summary line, never remove the step’s tool call or its key conclusion, and if nothing should be removed, output the step unchanged. The paper’s Figure 2 shows a concrete case: step 19 of a pytest run containing the full list of 29 passing tests plus one failure. AgentDiet compressed it from 1,995 tokens to 259 tokens — 87% — by replacing the passing list with [individual test lines omitted; mostly PASSED] while keeping the failure and the summary line intact.
4 Implementation
The AgentDiet class wraps any agent loop and supports Anthropic, Vertex AI, and any OpenAI-compatible backend for the agent. The reflection model always uses the OpenAI-compatible path, so you can point it at a cheap API or a local Ollama model.
Each step is serialised into an XML envelope that the reflector reads and rewrites:
@dataclass
class Step:
step_id: int
assistant_content: str # LLM reasoning + tool call
tool_result: str # environment response
def serialize(self) -> str:
return (
f'<step id="{self.step_id}">\n'
f"{self.assistant_content}\n"
f"<result>{self.tool_result}</result>\n"
f"</step>"
)
def token_estimate(self) -> int:
return len(self.serialize().split())Trajectory keeps both the live (compressed) steps and a snapshot of the originals, so savings can be measured precisely:
@dataclass
class Trajectory:
steps: list[Step] = field(default_factory=list)
original_steps: list[Step] = field(default_factory=list)
def append(self, step: Step) -> None:
self.steps.append(step)
self.original_steps.append(step) # frozen snapshot_maybe_compress is called after every agent step. The double-threshold check ensures the reflection module pays for itself before it is invoked, and only applies a result if the saving is real:
def _maybe_compress(self, trajectory: Trajectory, s: int) -> None:
if s <= self.a:
return
target_idx = s - self.a - 1
target_step = trajectory[target_idx]
l_orig = target_step.token_estimate()
if l_orig <= self.theta:
return # step is short; overhead not worth it
ctx_before = trajectory.steps[max(0, target_idx - self.b): target_idx]
ctx_after = trajectory.steps[target_idx + 1: s]
context = ctx_before + ctx_after
reduced_step = self._compress_step(target_step, context)
l_reduced = reduced_step.token_estimate()
if l_orig - l_reduced > self.theta:
trajectory[target_idx] = reduced_step # only apply if saving ≥ θThe minimal setup with a local Ollama model for reflection:
from agent_diet import AgentDiet, ModelConfig, ToolDef
agent_cfg = ModelConfig(
model="claude-sonnet-4-6",
provider="anthropic",
)
reflect_cfg = ModelConfig(
model="qwen2.5:7b",
base_url="http://localhost:11434/v1",
)
agent = AgentDiet(
agent_cfg=agent_cfg,
reflect_cfg=reflect_cfg,
tools=my_tools,
system=my_system_prompt,
a=2, # compress step s-2 when at step s
b=1, # one step of context for the reflector
theta=500, # skip if step < 500 tokens or saving < 500 tokens
s_max=50,
)
trajectory = agent.run(task, exec_tool_fn)5 A small experiment
To validate the implementation I ran a mini-eval with three simple bug-fixing scenarios against a fake file system:
- multiply — test expects
multiply(2,3) == 7; should be6 - add — test expects
add(1,2) == 4; should be3 - power — test expects
power(2,3) == 9.0; should be8.0
Each scenario runs twice: once with AgentDiet active (a=2, b=1, \theta=50) and once with compression disabled (\theta=\infty). The agent is Claude Sonnet 4.6; the reflector is Claude Haiku 4.5.
All three tasks resolved correctly in both modes. Token savings ranged from 31% to 35%, consistent with the paper’s numbers on shorter, lower-step tasks.
The per-step view shows where the savings come from:
6 Results from the paper
The paper ran AgentDiet on two benchmarks with two agent LLMs each: SWE-bench Verified (200 Python GitHub issues) and Multi-SWE-bench Flash (300 issues across seven languages). The reflection model was fixed as GPT-5 mini with a=2, b=1, \theta=500.
AgentDiet removes 40–60% of input tokens across all four configurations. When the reflection module’s own cost is included — roughly 5–15% overhead — the net computational cost reduction is 21–36%. Pass rates stay within ±2 percentage points of the original agent in every configuration. In the Gemini 2.5 Pro + Multi-SWE-bench Flash case the pass rate actually improved by one point, because shorter trajectories prevented the model from hitting context-length-induced instability: instances reaching the 100-step limit dropped from 66 to 26.
Multi-SWE-bench Flash covers seven languages, and the savings are consistent across them:
The pass-rate deltas scatter around zero. A few languages improve slightly; a few dip by a point or two. Nothing systematic. The efficiency gains hold across the board.
7 Why it works
Several properties make AgentDiet robust despite its simplicity.
The agent never sees the compression. The reflection module modifies stored steps, not live messages — the agent keeps its full working memory at each step and only encounters shorter histories on subsequent reads.
The lag of a = 2 is a safety cushion: the two most recent steps are never touched, so the reflector cannot clobber information the agent is actively using.
The double gating with \theta bounds the overhead from the reflection model to roughly 5–10% of total cost: reflection is skipped entirely for short steps, and its result is discarded if the saving is small. A badly compressed step is simply kept as-is.
There is also some evidence that long contexts actively hurt LLM performance rather than just slowing it down. Removing waste may help the agent reason better, which would explain why Gemini 2.5 Pro improved on both pass rate and step count on the harder benchmark.
Finally, the reflection model does not need to be particularly capable — it just needs to follow four rules and recognise boilerplate. Small local models work. The double threshold handles the cases where they do not.
8 Conclusions
AgentDiet is a reminder that inference-time efficiency for agents is largely unexplored territory. The key insight is that trajectory waste is not a side-effect of agent behaviour — it is structural. Tool outputs like directory listings, test runs, and file contents carry a large fixed overhead that multiplies with trajectory length. Compressing them once, early, before they compound, is cheap and effective.
The implementation is a thin wrapper around any existing agent loop — around 250 lines of Python — the hyperparameter search is small, and the prompting needed to get the reflector to behave correctly was modest. For a 21–36% cost reduction with no performance loss, that is a reasonable trade.
The artefact repository is at https://doi.org/10.6084/m9.figshare.30073654.
References
Footnotes
Ranked first on SWE-bench Verified at the time of writing.↩︎