CodeAct 2026: When AI Agents Write Code Instead of Calling JSON Tools

Posted on: 5/14/2026 10:31:20 AM

In 2024, most AI Agents communicated with tools through a familiar format: JSON tool calls. Each step, the LLM emits a JSON block invoking exactly one function — clean, parseable, validatable. But by late 2025 and across 2026, a new wave is upending this pattern: CodeAct — agents write Python code directly (instead of JSON), execute it in a , and use the result for the next step. Hugging Face's Smolagents defaults to CodeAct. Manus AI's breakout in early 2026 also bet on "code as the universal action space". The original CodeAct paper by Wang et al. (ICML 2024) measured striking gains: +20% success rate, -30% steps. Why did such a seemingly small format change reshape the industry?

+20%Success rate vs JSON tool call (CodeAct paper)
-30%Number of steps (= 30% fewer LLM calls)
7+Sandbox runtimes Smolagents supports (E2B, Modal, Pyodide...)
2024Year CodeAct was published — mainstream by 2026

1. What is CodeAct — and how is it different from JSON tool calls?

CodeAct (short for Code as Action) is a paradigm where each "action" of an AI Agent is not a JSON object calling a single tool, but an executable code snippet (typically Python) — where tools are exposed as functions, and the LLM is free to compose them with variables, loops, and conditionals.

A concrete example. Suppose an agent must "find the 5 largest cities in Vietnam, get the population of each, then sum them up".

Old way — JSON Tool Call (ReAct pattern)

Step 1: LLM emits {"tool": "search_cities", "args": {"country": "VN", "limit": 5}} → runtime executes → returns 5 cities.
Step 2: LLM emits {"tool": "get_population", "args": {"city": "Ho Chi Minh"}} → returns 9.3M.
Steps 3–6: Repeat 4 more times for the remaining cities.
Step 7: LLM mentally sums the numbers in context → final answer.
Total: 7 steps, 7 LLM calls, error-prone arithmetic.

New way — CodeAct

Step 1: LLM emits a code block:

cities = search_cities(country="VN", limit=5)
populations = [get_population(city=c) for c in cities]
total = sum(populations)
print(f"Total: {total:,}")

Sandbox runs → prints Total: 24,500,000 → agent reads and replies. Total: 2 steps, 2 LLM calls, math handled by Python — no errors.

2. A short history of CodeAct

2022 — ReAct (Yao et al., Princeton + Google)
The classic paper that defined the Thought → Action → Observation loop. Actions were initially free-form text (search query, click). Foundation for every later agent framework, but format-agnostic.
06/2023 — OpenAI Function Calling
OpenAI released function calling: LLMs were fine-tuned to emit valid JSON matching a schema. The industry standardized immediately — Anthropic, Google, Mistral all copied. One LLM call = one tool call.
02/2024 — CodeAct paper (Wang et al., UIUC)
"Executable Code Actions Elicit Better LLM Agents" — proved across 17 LLMs that emitting code instead of JSON raises success rate by up to 20% and cuts step count by 30% on API-Bank.
12/2024 — Hugging Face Smolagents
HF released an agent library that is code-first by default. CodeAgent is the primary class; ToolCallingAgent is just an alternative. The first mainstream signal.
2025 — Manus AI, OpenHands
Manus AI took off as a general-purpose agent using code execution as its sole action space. OpenHands (succeeding OpenDevin) also moved to CodeAct.
2026 — Code execution becomes industry standard
Anthropic's bash + code execution tool, OpenAI Code Interpreter as a built-in, Apple ML Research endorsing CodeAct as the most effective action format. JSON tool calls remain but are pushed toward simple cases or smaller models that can't yet write code well.

3. CodeAct architecture in production

A standard CodeAct system has 5 components — notably, the is now mandatory, not optional.

graph TB
    subgraph User["User"]
        Q["Question / Task"]
    end
    subgraph Agent["Agent Loop"]
        LLM["LLM (planner + code writer)"]
        PARSE["Code Parser / Validator"]
    end
    subgraph Sandbox["Sandbox Runtime (CRITICAL)"]
        EXE["Python Interpreter (E2B / Pyodide / Docker / Modal)"]
        TOOL["Tools as Python functions"]
    end
    subgraph State["State Management"]
        VAR["Variables (persisted across turns)"]
        OUT["stdout / stderr"]
    end
    Q --> LLM
    LLM -->|"Code block"| PARSE
    PARSE -->|"AST validated"| EXE
    EXE <--> TOOL
    EXE --> VAR
    EXE --> OUT
    OUT -->|"Observation"| LLM
    VAR -.->|"Reuse next turn"| EXE

    classDef user fill:#e94560,stroke:#fff,color:#fff
    classDef agent fill:#16213e,stroke:#fff,color:#fff
    classDef  fill:#ff9800,stroke:#fff,color:#fff
    classDef state fill:#f8f9fa,stroke:#e94560,color:#2c3e50
    class Q user
    class LLM,PARSE agent
    class EXE,TOOL 
    class VAR,OUT state

Figure 1 — The CodeAct loop: LLM emits code → executes → observation feeds back to LLM. Variables are kept across turns to enable complex composition.

5 components

  1. LLM Planner + Coder: One model both plans and writes code. Must be Python-fluent (Claude Sonnet, GPT-4 class or above) — this is why CodeAct underperforms with sub-7B models.
  2. Code Parser/Validator: Extracts code blocks from LLM output (usually inside ```python fences), AST-checks to block dangerous imports before sending to .
  3. Sandbox Runtime: Mandatory. 2026 options: E2B (Firecracker microVM), Pyodide+Deno (WebAssembly), Modal, Docker, Daytona, Azure Container Apps Dynamic Sessions.
  4. Tools as Functions: Tools are not JSON schemas but Python functions pre-imported into the namespace. The LLM can inspect docstring/signature.
  5. State Persistence: Variables must persist across turns (no reset each time). Sandboxes need to be stateful — this is where pure Pyodide is harder than microVMs.

4. Comparing CodeAct vs JSON Tool Call

CriterionJSON Tool CallCodeAct
Output formatSchema-conforming JSON objectPython snippet (Markdown code block)
Tools per step1Many (loops, branches allowed)
Auxiliary computationLLM does math/sort in head (error-prone)Python handles it precisely
Mid-step error handlingMust round-trip back to LLM each failtry/except inline, no extra LLM call
ComposabilityLow — tools are independentHigh — tool A's output feeds tool B directly
LLM requirementAny model with function callingModel must be Python-fluent (≥7B, ideally ≥30B)
SecuritySimple — JSON validation sufficesComplex — isolation mandatory
DebuggabilityJSON traces are easy to readNeed to log code + stdout + variable state
Per-step latencyLow (one tool call)Higher ( cold start ~50-200ms)
End-to-end latencyHigh (many LLM round trips)Low (fewer round trips)

When to choose CodeAct?

  • Tasks needing multi-tool composition in one step (small data pipelines, ETL).
  • Need for computation/aggregation over results (sum, sort, group).
  • Logic with loops/conditionals (process each item in a list).
  • You already have infrastructure (E2B, Modal, Daytona...).

When to keep JSON Tool Call?

  • Simple, single-step tools (send email, create Jira ticket).
  • Small models (Llama 3.2 3B, Mistral 7B) — code generation is weak.
  • Environments that cannot host a (regulated industries, edge devices).
  • Need for readable audit logs for compliance — JSON is easier to track than code.

5. Walkthrough — Building a CodeAgent with Smolagents

Smolagents is the cleanest CodeAct implementation. The code below builds an agent that writes its own Python to complete tasks.

from smolagents import CodeAgent, LiteLLMModel, tool

@tool
def search_cities(country: str, limit: int = 5) -> list[str]:
    """Find the largest cities in a country.
    Args:
        country: ISO country code (VN, US, ...)
        limit: Number of cities to return
    """
    # Mocked — production would call a real API
    return ["Ho Chi Minh", "Hanoi", "Da Nang", "Hai Phong", "Can Tho"][:limit]

@tool
def get_population(city: str) -> int:
    """Return the population of a city."""
    data = {"Ho Chi Minh": 9300000, "Hanoi": 8500000,
            "Da Nang": 1230000, "Hai Phong": 2050000, "Can Tho": 1240000}
    return data.get(city, 0)

model = LiteLLMModel(model_id="anthropic/claude-sonnet-4-6")

agent = CodeAgent(
    tools=[search_cities, get_population],
    model=model,
    executor_type="e2b",   # : e2b | docker | local
    additional_authorized_imports=["statistics"],
    max_steps=5,
)

result = agent.run("What's the total population of the 5 largest cities in Vietnam? Include standard deviation.")
print(result)

At runtime, the agent will produce code similar to:

cities = search_cities(country="VN", limit=5)
populations = [get_population(city=c) for c in cities]

import statistics
total = sum(populations)
stdev = statistics.stdev(populations)
print(f"Total: {total:,}")
print(f"StDev: {stdev:,.0f}")

A single LLM call accomplishes finding the cities, fetching populations, summing the total, and computing statistics — something a JSON tool call would need 7-8 round trips for.

6. Security — The is non-negotiable

CodeAct is only safe if the is solid. Letting LLM-generated code run on the main process means a mild prompt injection could os.system("rm -rf /"). Comparison of popular 2026 runtimes:

RuntimeIsolationCold startStatefulBest for
E2BFirecracker microVM~150msYesProduction agents, multi-tenant
ModalgVisor + container~500msYesCompute-heavy / GPU workloads
DaytonaContainer + LXC~200msYesDev environment + agent
Azure Container Apps Dynamic SessionsHyper-V + Code Interpreter~300msYes (60 min)Enterprise Microsoft stack
Pyodide + DenoWebAssembly + permission flags~50msHard (per-call)Edge, lightweight, single-tenant
DockerLinux namespaces (weaker than microVM)~1-2sYesDev/PoC, not multi-tenant prod
Local Python (DON'T)None0msYesNever

5 mandatory security checks

  1. Network egress filter: block outbound traffic except an allowlist (prevent data exfiltration).
  2. Read-only filesystem except /tmp; never mount secrets into the .
  3. CPU + memory + wall-clock limits: 30s, 512MB — enough for legitimate tasks, blocks infinite loops.
  4. AST validation before exec: block __import__, exec, eval, open("/etc/passwd").
  5. One per user: never share between tenants.

7. Observation Loop and cross-turn state

The subtlest part of CodeAct is the multi-turn loop. After each turn, the agent must "see" three things: (1) the code that ran, (2) stdout/stderr, (3) variables still alive in the namespace.

sequenceDiagram
    participant U as User
    participant A as Agent (LLM)
    participant S as Sandbox
    U->>A: "Analyze this week's error logs"
    A->>S: code: logs = fetch_logs(days=7)
    S-->>A: stdout: "(fetched 12,450 records)" + var: logs
    Note over A: LLM sees var logs is ready
    A->>S: code: errors = [l for l in logs if l.level=="ERROR"]
top = Counter(e.module for e in errors).most_common(5)
print(top) S-->>A: stdout: [("payment", 234), ("auth", 189), ...] A->>U: "The payment module errors most (234 times)..."

Figure 2 — State (var logs) is preserved across turns, letting the LLM reference it without re-fetching.

This is a major difference from JSON tool calls: in the old pattern, every observation must be stuffed into the next turn's prompt context. With CodeAct, large observations (12,450 log records) stay in memory, the LLM only references the variable name — saving enormous amounts of token budget.

8. 6 common mistakes when deploying CodeAct

Mistake 1 — Not resetting namespace across users

Stateful es are convenient but if user A and user B share a container → A's variables leak to B. Fix: One instance per conversation/user, or explicit %reset between sessions.

Mistake 2 — Letting the LLM import any library

Risk of supply chain attacks (LLM prompt-injected into importing a malicious package). Fix: additional_authorized_imports allowlist; block runtime pip install.

Mistake 3 — Not truncating long stdout

Agent print()s 50,000 lines → stdout enters prompt → context window blown. Fix: Sandbox auto-truncates stdout >10KB, hint LLM to use pagination.

Mistake 4 — Not separating Final Answer from Code

The LLM sometimes writes both code and a final answer in the same response. Fix: Smolagents uses a special final_answer() tool — the agent must call it explicitly to terminate.

Mistake 5 — High cold-start latency

Each new conversation spawns a → 2-3s first-call latency. Fix: Warmed-up pool (E2B, Modal both support this), or Pyodide for short tasks.

Mistake 6 — Skipping observability

Unlike JSON, generated code is hard to debug retroactively. Fix: Log code + stdout + stderr + execution time + variables snapshot to a trace store (Langfuse, LangSmith, or build your own on ClickHouse).

9. CodeAct in multi-agent systems

When scaling to multiple agents, CodeAct unlocks an interesting pattern: the orchestrator agent calls sub-agents like functions. Smolagents has the concept of managed_agents — sub-agents are wrapped as callables in the parent's namespace.

from smolagents import CodeAgent

researcher = CodeAgent(tools=[web_search, scrape], model=model, name="researcher",
                       description="Search and read web content")
analyst = CodeAgent(tools=[run_sql, plot_chart], model=model, name="analyst",
                    description="Analyze data and produce charts")

orchestrator = CodeAgent(
    tools=[],
    model=model,
    managed_agents=[researcher, analyst],
    max_steps=10,
)

orchestrator.run("Find OpenAI's 3 main competitors in 2026, get their revenue, plot a comparison chart")

The orchestrator might write code like:

competitors_text = researcher(query="3 main competitors of OpenAI in 2026 and their revenue")
chart_url = analyst(query=f"Plot a bar chart from this data: {competitors_text}")
print(chart_url)

This is the purest expression of "agents as functions" — no complex message bus, no pub/sub, just Python function calls inside a .

10. Future — Is code really the "universal action space"?

Apple ML Research (2026) calls CodeAct "the most expressive action format we have today". But open questions remain:

3 directions under research

  • Domain-specific languages: Is general-purpose Python necessary, or are narrower DSLs (SQL, Cypher, JAX) sufficient and safer?
  • Type-safe code agents: Could having LLMs write TypeScript/Rust reduce runtime bugs vs Python?
  • Hybrid format: JSON for simple tools, code for composition — as Anthropic does with bash + structured tools side by side.

One thing is clear in 2026: the new generation of agent frameworks all natively support code execution (Smolagents, OpenHands, Manus, Letta), while older frameworks (LangChain, AutoGen) are scrambling to add code-mode to keep up. If you're building a new agent system today, CodeAct deserves serious consideration as the default — provided you invest properly in your .

11. Conclusion

CodeAct isn't "magic" — it's simply using the right tool for the right job. JSON is good for RPC; code is good for composition. As agents take on increasingly complex tasks (data analysis, multi-step reasoning, orchestration), JSON tool calls reveal limits in round-trip count and composition ability. CodeAct gives the LLM back a language designed for computers to understand — code.

The price you pay is complexity. Don't deploy CodeAct without a clear isolation strategy. E2B, Modal, Daytona, Azure Container Apps Dynamic Sessions — pick one and invest seriously. The reward is smarter, faster, cheaper agents (fewer LLM calls) — and most importantly: agents that fix mid-flight errors themselves instead of looping back to ask you.

References