AgentDiet: trimming the fat from LLM agent trajectories

LLM
Agents
Efficiency
Python
LLM agents spend most of their token budget re-reading stale context from previous steps. AgentDiet fixes that with a cheap reflection module that quietly compresses past steps while the agent works.
Published

May 1, 2026

The cost of running an LLM agent is dominated not by what the model generates, but by what it has to read. Every tool call appends new content to the conversation history, and by the time a coding agent finishes a multi-step task it may have consumed over a million tokens — most of them repetitive noise from steps that are long past.

AgentDiet (Xiao et al. 2026) addresses this with a straightforward idea: attach a cheaper reflection LLM to the agent loop that compresses recent steps before they compound further into the trajectory. The paper reports 39.9–59.7% fewer input tokens with no measurable drop in task success.

1 The problem

A typical coding agent runs in a tight loop: read the full trajectory, predict the next action, execute it, append the result, repeat. The trajectory starts small — system prompt, task description — and grows with every step. Xiao et al. sampled 100 real trajectories from Trae Agent1 and found that the average trajectory consumed 48.4K tokens across 40 steps, with tool results accounting for 63% of that (Xiao et al. 2026). Because every token in step k is re-read at every subsequent step, the accumulated token usage per issue reached 1.0M tokens.

Manually inspecting those trajectories revealed three recurring categories of waste:

Type Example
Useless __pycache__/, .git/, .venv/ listings from an early ls
Redundant str_replace_editor echoes back the full surrounding code on every edit
Expired 29 passing test names in a pytest run — only the one failure matters

These are not edge cases. They appear in almost every trajectory, and removing them before they accumulate is the core idea behind AgentDiet.

2 A sliding-window reflection module

Rather than modifying the agent, AgentDiet introduces a separate reflection module — a second, cheaper LLM that processes one past step at a time and compresses it.

The key question is when to compress. Too early and the reflector discards context the agent still needs; too late and the step has already been re-read many times. The paper settles on a sliding window governed by three hyperparameters:

  • alag: compress step s - a when the agent reaches step s
  • bcontext width: give the reflector b surrounding steps to reason with
  • \thetathreshold: skip the reflection call if the step is shorter than \theta tokens, or discard the result if the saving is smaller than \theta

Algorithm 1 from the paper adds only one block to a standard agent loop:

for s in 1 .. s_max:
    m_assis ← LLM_agent(T)             # normal agent step
    if done: return
    E, m_tool ← ExecTool(m_assis)
    T[s] ← (m_assis, m_tool)           # append to trajectory

    # ── AgentDiet reflection ─────────────────────────────────────
    if s > a:
        target = T[s - a]
        if length(target) > θ:
            ctx = T[max(0, s-a-b) : s]         # sliding window
            reduced = LLM_reflect(ctx, target)
            if length(target) - length(reduced) > θ:
                T[s - a] ← reduced             # apply if saving is real
    # ─────────────────────────────────────────────────────────────

The agent’s own LLM never sees the compression — it only ever reads the silently shorter trajectory.

The paper also tried giving the agent an erase tool it could call when it judged a step to be stale. That did not work. Models like Claude 4 Sonnet have memorised the standard procedure for program repair and continue the task even when explicitly instructed to erase instead. Moving the reduction to a separate module, invisible to the agent, sidesteps this entirely.

3 The reflection prompt

The reflector receives the target step wrapped in XML tags along with its sliding-window context, and must output only the compressed version. The prompt has four parts: a high-level job description, the expected input/output format, the three kinds of waste to remove, and a set of loss-prevention rules.

The rules are straightforward — keep all error messages and tracebacks in full, keep the final test-run summary line, never remove the step’s tool call or its key conclusion, and if nothing should be removed, output the step unchanged. The paper’s Figure 2 shows a concrete case: step 19 of a pytest run containing the full list of 29 passing tests plus one failure. AgentDiet compressed it from 1,995 tokens to 259 tokens — 87% — by replacing the passing list with [individual test lines omitted; mostly PASSED] while keeping the failure and the summary line intact.

4 Implementation

The AgentDiet class wraps any agent loop and supports Anthropic, Vertex AI, and any OpenAI-compatible backend for the agent. The reflection model always uses the OpenAI-compatible path, so you can point it at a cheap API or a local Ollama model.

Each step is serialised into an XML envelope that the reflector reads and rewrites:

@dataclass
class Step:
    step_id: int
    assistant_content: str   # LLM reasoning + tool call
    tool_result: str         # environment response

    def serialize(self) -> str:
        return (
            f'<step id="{self.step_id}">\n'
            f"{self.assistant_content}\n"
            f"<result>{self.tool_result}</result>\n"
            f"</step>"
        )

    def token_estimate(self) -> int:
        return len(self.serialize().split())

Trajectory keeps both the live (compressed) steps and a snapshot of the originals, so savings can be measured precisely:

@dataclass
class Trajectory:
    steps: list[Step] = field(default_factory=list)
    original_steps: list[Step] = field(default_factory=list)

    def append(self, step: Step) -> None:
        self.steps.append(step)
        self.original_steps.append(step)   # frozen snapshot

_maybe_compress is called after every agent step. The double-threshold check ensures the reflection module pays for itself before it is invoked, and only applies a result if the saving is real:

def _maybe_compress(self, trajectory: Trajectory, s: int) -> None:
    if s <= self.a:
        return

    target_idx  = s - self.a - 1
    target_step = trajectory[target_idx]
    l_orig      = target_step.token_estimate()

    if l_orig <= self.theta:
        return                  # step is short; overhead not worth it

    ctx_before = trajectory.steps[max(0, target_idx - self.b): target_idx]
    ctx_after  = trajectory.steps[target_idx + 1: s]
    context    = ctx_before + ctx_after

    reduced_step = self._compress_step(target_step, context)
    l_reduced    = reduced_step.token_estimate()

    if l_orig - l_reduced > self.theta:
        trajectory[target_idx] = reduced_step   # only apply if saving ≥ θ

The minimal setup with a local Ollama model for reflection:

from agent_diet import AgentDiet, ModelConfig, ToolDef

agent_cfg = ModelConfig(
    model="claude-sonnet-4-6",
    provider="anthropic",
)
reflect_cfg = ModelConfig(
    model="qwen2.5:7b",
    base_url="http://localhost:11434/v1",
)

agent = AgentDiet(
    agent_cfg=agent_cfg,
    reflect_cfg=reflect_cfg,
    tools=my_tools,
    system=my_system_prompt,
    a=2,        # compress step s-2 when at step s
    b=1,        # one step of context for the reflector
    theta=500,  # skip if step < 500 tokens or saving < 500 tokens
    s_max=50,
)

trajectory = agent.run(task, exec_tool_fn)

5 A small experiment

To validate the implementation I ran a mini-eval with three simple bug-fixing scenarios against a fake file system:

  • multiply — test expects multiply(2,3) == 7; should be 6
  • add — test expects add(1,2) == 4; should be 3
  • power — test expects power(2,3) == 9.0; should be 8.0

Each scenario runs twice: once with AgentDiet active (a=2, b=1, \theta=50) and once with compression disabled (\theta=\infty). The agent is Claude Sonnet 4.6; the reflector is Claude Haiku 4.5.

Mini-eval results: three bug-fixing scenarios, compressed vs baseline. All tasks succeeded in both modes.

All three tasks resolved correctly in both modes. Token savings ranged from 31% to 35%, consistent with the paper’s numbers on shorter, lower-step tasks.

The per-step view shows where the savings come from:

Per-step token breakdown for the ‘multiply’ scenario. Stars mark steps the reflector compressed. The gap between the red and blue cumulative lines is the running saving.

6 Results from the paper

The paper ran AgentDiet on two benchmarks with two agent LLMs each: SWE-bench Verified (200 Python GitHub issues) and Multi-SWE-bench Flash (300 issues across seven languages). The reflection model was fixed as GPT-5 mini with a=2, b=1, \theta=500.

Paper results (Table 4). Left: percentage of input tokens removed. Right: pass rate for AgentDiet vs the unmodified original agent; labels show the percentage-point difference.

AgentDiet removes 40–60% of input tokens across all four configurations. When the reflection module’s own cost is included — roughly 5–15% overhead — the net computational cost reduction is 21–36%. Pass rates stay within ±2 percentage points of the original agent in every configuration. In the Gemini 2.5 Pro + Multi-SWE-bench Flash case the pass rate actually improved by one point, because shorter trajectories prevented the model from hitting context-length-induced instability: instances reaching the 100-step limit dropped from 66 to 26.

Multi-SWE-bench Flash covers seven languages, and the savings are consistent across them:

Per-language breakdown on Multi-SWE-bench Flash (Table 5). Left: input tokens removed. Right: pass rate delta relative to the original agent (positive = improvement).

The pass-rate deltas scatter around zero. A few languages improve slightly; a few dip by a point or two. Nothing systematic. The efficiency gains hold across the board.

7 Why it works

Several properties make AgentDiet robust despite its simplicity.

The agent never sees the compression. The reflection module modifies stored steps, not live messages — the agent keeps its full working memory at each step and only encounters shorter histories on subsequent reads.

The lag of a = 2 is a safety cushion: the two most recent steps are never touched, so the reflector cannot clobber information the agent is actively using.

The double gating with \theta bounds the overhead from the reflection model to roughly 5–10% of total cost: reflection is skipped entirely for short steps, and its result is discarded if the saving is small. A badly compressed step is simply kept as-is.

There is also some evidence that long contexts actively hurt LLM performance rather than just slowing it down. Removing waste may help the agent reason better, which would explain why Gemini 2.5 Pro improved on both pass rate and step count on the harder benchmark.

Finally, the reflection model does not need to be particularly capable — it just needs to follow four rules and recognise boilerplate. Small local models work. The double threshold handles the cases where they do not.

8 Conclusions

AgentDiet is a reminder that inference-time efficiency for agents is largely unexplored territory. The key insight is that trajectory waste is not a side-effect of agent behaviour — it is structural. Tool outputs like directory listings, test runs, and file contents carry a large fixed overhead that multiplies with trajectory length. Compressing them once, early, before they compound, is cheap and effective.

The implementation is a thin wrapper around any existing agent loop — around 250 lines of Python — the hyperparameter search is small, and the prompting needed to get the reflector to behave correctly was modest. For a 21–36% cost reduction with no performance loss, that is a reasonable trade.

The artefact repository is at https://doi.org/10.6084/m9.figshare.30073654.

References

Xiao, Yuan-An, Pengfei Gao, Chao Peng, and Yingfei Xiong. 2026. “Reducing Cost of LLM Agents with Trajectory Reduction.” Proceedings of the ACM on Software Engineering 3 (FSE). https://doi.org/10.1145/3729355.

Footnotes

  1. Ranked first on SWE-bench Verified at the time of writing.↩︎