CodeAct 2026: When AI Agents Write Code Instead of Calling JSON Tools

Posted on: 5/14/2026 10:31:20 AM

Table of contents

1. What is CodeAct — and how is it different from JSON tool calls?
1. Old way — JSON Tool Call (ReAct pattern)
2. New way — CodeAct
2. A short history of CodeAct
3. CodeAct architecture in production
1. 5 components
4. Comparing CodeAct vs JSON Tool Call
1. When to choose CodeAct?
2. When to keep JSON Tool Call?
5. Walkthrough — Building a CodeAgent with Smolagents
6. Security — The sandbox is non-negotiable
1. 5 mandatory security checks
7. Observation Loop and cross-turn state
8. 6 common mistakes when deploying CodeAct
9. CodeAct in multi-agent systems
10. Future — Is code really the "universal action space"?
1. 3 directions under research
11. Conclusion
1. References

In 2024, most AI Agents communicated with tools through a familiar format: JSON tool calls. Each step, the LLM emits a JSON block invoking exactly one function — clean, parseable, validatable. But by late 2025 and across 2026, a new wave is upending this pattern: CodeAct — agents write Python code directly (instead of JSON), execute it in a , and use the result for the next step. Hugging Face's Smolagents defaults to CodeAct. Manus AI's breakout in early 2026 also bet on "code as the universal action space". The original CodeAct paper by Wang et al. (ICML 2024) measured striking gains: +20% success rate, -30% steps. Why did such a seemingly small format change reshape the industry?

+20%Success rate vs JSON tool call (CodeAct paper)

-30%Number of steps (= 30% fewer LLM calls)

7+Sandbox runtimes Smolagents supports (E2B, Modal, Pyodide...)

2024Year CodeAct was published — mainstream by 2026

1. What is CodeAct — and how is it different from JSON tool calls?

CodeAct (short for Code as Action) is a paradigm where each "action" of an AI Agent is not a JSON object calling a single tool, but an executable code snippet (typically Python) — where tools are exposed as functions, and the LLM is free to compose them with variables, loops, and conditionals.

A concrete example. Suppose an agent must "find the 5 largest cities in Vietnam, get the population of each, then sum them up".

Old way — JSON Tool Call (ReAct pattern)

Step 1: LLM emits {"tool": "search_cities", "args": {"country": "VN", "limit": 5}} → runtime executes → returns 5 cities.
Step 2: LLM emits {"tool": "get_population", "args": {"city": "Ho Chi Minh"}} → returns 9.3M.
Steps 3–6: Repeat 4 more times for the remaining cities.
Step 7: LLM mentally sums the numbers in context → final answer.
Total: 7 steps, 7 LLM calls, error-prone arithmetic.

New way — CodeAct

Step 1: LLM emits a code block:

cities = search_cities(country="VN", limit=5)
populations = [get_population(city=c) for c in cities]
total = sum(populations)
print(f"Total: {total:,}")

Sandbox runs → prints Total: 24,500,000 → agent reads and replies. Total: 2 steps, 2 LLM calls, math handled by Python — no errors.

2. A short history of CodeAct

2022 — ReAct (Yao et al., Princeton + Google)

The classic paper that defined the Thought → Action → Observation loop. Actions were initially free-form text (search query, click). Foundation for every later agent framework, but format-agnostic.

06/2023 — OpenAI Function Calling

OpenAI released function calling: LLMs were fine-tuned to emit valid JSON matching a schema. The industry standardized immediately — Anthropic, Google, Mistral all copied. One LLM call = one tool call.

02/2024 — CodeAct paper (Wang et al., UIUC)

"Executable Code Actions Elicit Better LLM Agents" — proved across 17 LLMs that emitting code instead of JSON raises success rate by up to 20% and cuts step count by 30% on API-Bank.

12/2024 — Hugging Face Smolagents

HF released an agent library that is code-first by default. CodeAgent is the primary class; ToolCallingAgent is just an alternative. The first mainstream signal.

2025 — Manus AI, OpenHands

Manus AI took off as a general-purpose agent using code execution as its sole action space. OpenHands (succeeding OpenDevin) also moved to CodeAct.

2026 — Code execution becomes industry standard

Anthropic's bash + code execution tool, OpenAI Code Interpreter as a built-in, Apple ML Research endorsing CodeAct as the most effective action format. JSON tool calls remain but are pushed toward simple cases or smaller models that can't yet write code well.

3. CodeAct architecture in production

A standard CodeAct system has 5 components — notably, the is now mandatory, not optional.

graph TB
    subgraph User["User"]
        Q["Question / Task"]
    end
    subgraph Agent["Agent Loop"]
        LLM["LLM (planner + code writer)"]
        PARSE["Code Parser / Validator"]
    end
    subgraph Sandbox["Sandbox Runtime (CRITICAL)"]
        EXE["Python Interpreter (E2B / Pyodide / Docker / Modal)"]
        TOOL["Tools as Python functions"]
    end
    subgraph State["State Management"]
        VAR["Variables (persisted across turns)"]
        OUT["stdout / stderr"]
    end
    Q --> LLM
    LLM -->|"Code block"| PARSE
    PARSE -->|"AST validated"| EXE
    EXE <--> TOOL
    EXE --> VAR
    EXE --> OUT
    OUT -->|"Observation"| LLM
    VAR -.->|"Reuse next turn"| EXE

    classDef user fill:#e94560,stroke:#fff,color:#fff
    classDef agent fill:#16213e,stroke:#fff,color:#fff
    classDef  fill:#ff9800,stroke:#fff,color:#fff
    classDef state fill:#f8f9fa,stroke:#e94560,color:#2c3e50
    class Q user
    class LLM,PARSE agent
    class EXE,TOOL 
    class VAR,OUT state

Figure 1 — The CodeAct loop: LLM emits code → executes → observation feeds back to LLM. Variables are kept across turns to enable complex composition.

5 components

LLM Planner + Coder: One model both plans and writes code. Must be Python-fluent (Claude Sonnet, GPT-4 class or above) — this is why CodeAct underperforms with sub-7B models.
Code Parser/Validator: Extracts code blocks from LLM output (usually inside ```python fences), AST-checks to block dangerous imports before sending to .
Sandbox Runtime: Mandatory. 2026 options: E2B (Firecracker microVM), Pyodide+Deno (WebAssembly), Modal, Docker, Daytona, Azure Container Apps Dynamic Sessions.
Tools as Functions: Tools are not JSON schemas but Python functions pre-imported into the namespace. The LLM can inspect docstring/signature.
State Persistence: Variables must persist across turns (no reset each time). Sandboxes need to be stateful — this is where pure Pyodide is harder than microVMs.

4. Comparing CodeAct vs JSON Tool Call

Criterion	JSON Tool Call	CodeAct
Output format	Schema-conforming JSON object	Python snippet (Markdown code block)
Tools per step	1	Many (loops, branches allowed)
Auxiliary computation	LLM does math/sort in head (error-prone)	Python handles it precisely
Mid-step error handling	Must round-trip back to LLM each fail	`try/except` inline, no extra LLM call
Composability	Low — tools are independent	High — tool A's output feeds tool B directly
LLM requirement	Any model with function calling	Model must be Python-fluent (≥7B, ideally ≥30B)
Security	Simple — JSON validation suffices	Complex — isolation mandatory
Debuggability	JSON traces are easy to read	Need to log code + stdout + variable state
Per-step latency	Low (one tool call)	Higher ( cold start ~50-200ms)
End-to-end latency	High (many LLM round trips)	Low (fewer round trips)

When to choose CodeAct?

Tasks needing multi-tool composition in one step (small data pipelines, ETL).
Need for computation/aggregation over results (sum, sort, group).
Logic with loops/conditionals (process each item in a list).
You already have infrastructure (E2B, Modal, Daytona...).

When to keep JSON Tool Call?

Simple, single-step tools (send email, create Jira ticket).
Small models (Llama 3.2 3B, Mistral 7B) — code generation is weak.
Environments that cannot host a (regulated industries, edge devices).
Need for readable audit logs for compliance — JSON is easier to track than code.

5. Walkthrough — Building a CodeAgent with Smolagents

Smolagents is the cleanest CodeAct implementation. The code below builds an agent that writes its own Python to complete tasks.

from smolagents import CodeAgent, LiteLLMModel, tool

@tool
def search_cities(country: str, limit: int = 5) -> list[str]:
    """Find the largest cities in a country.
    Args:
        country: ISO country code (VN, US, ...)
        limit: Number of cities to return
    """
    # Mocked — production would call a real API
    return ["Ho Chi Minh", "Hanoi", "Da Nang", "Hai Phong", "Can Tho"][:limit]

@tool
def get_population(city: str) -> int:
    """Return the population of a city."""
    data = {"Ho Chi Minh": 9300000, "Hanoi": 8500000,
            "Da Nang": 1230000, "Hai Phong": 2050000, "Can Tho": 1240000}
    return data.get(city, 0)

model = LiteLLMModel(model_id="anthropic/claude-sonnet-4-6")

agent = CodeAgent(
    tools=[search_cities, get_population],
    model=model,
    executor_type="e2b",   # : e2b | docker | local
    additional_authorized_imports=["statistics"],
    max_steps=5,
)

result = agent.run("What's the total population of the 5 largest cities in Vietnam? Include standard deviation.")
print(result)

At runtime, the agent will produce code similar to:

cities = search_cities(country="VN", limit=5)
populations = [get_population(city=c) for c in cities]

import statistics
total = sum(populations)
stdev = statistics.stdev(populations)
print(f"Total: {total:,}")
print(f"StDev: {stdev:,.0f}")

A single LLM call accomplishes finding the cities, fetching populations, summing the total, and computing statistics — something a JSON tool call would need 7-8 round trips for.

6. Security — The is non-negotiable

CodeAct is only safe if the is solid. Letting LLM-generated code run on the main process means a mild prompt injection could os.system("rm -rf /"). Comparison of popular 2026 runtimes:

Runtime	Isolation	Cold start	Stateful	Best for
E2B	Firecracker microVM	~150ms	Yes	Production agents, multi-tenant
Modal	gVisor + container	~500ms	Yes	Compute-heavy / GPU workloads
Daytona	Container + LXC	~200ms	Yes	Dev environment + agent
Azure Container Apps Dynamic Sessions	Hyper-V + Code Interpreter	~300ms	Yes (60 min)	Enterprise Microsoft stack
Pyodide + Deno	WebAssembly + permission flags	~50ms	Hard (per-call)	Edge, lightweight, single-tenant
Docker	Linux namespaces (weaker than microVM)	~1-2s	Yes	Dev/PoC, not multi-tenant prod
Local Python (DON'T)	None	0ms	Yes	Never

5 mandatory security checks

Network egress filter: block outbound traffic except an allowlist (prevent data exfiltration).
Read-only filesystem except /tmp; never mount secrets into the .
CPU + memory + wall-clock limits: 30s, 512MB — enough for legitimate tasks, blocks infinite loops.
AST validation before exec: block __import__, exec, eval, open("/etc/passwd").
One per user: never share between tenants.

7. Observation Loop and cross-turn state

The subtlest part of CodeAct is the multi-turn loop. After each turn, the agent must "see" three things: (1) the code that ran, (2) stdout/stderr, (3) variables still alive in the namespace.

sequenceDiagram
    participant U as User
    participant A as Agent (LLM)
    participant S as Sandbox
    U->>A: "Analyze this week's error logs"
    A->>S: code: logs = fetch_logs(days=7)
    S-->>A: stdout: "(fetched 12,450 records)" + var: logs
    Note over A: LLM sees var logs is ready
    A->>S: code: errors = [l for l in logs if l.level=="ERROR"]
top = Counter(e.module for e in errors).most_common(5)
print(top)
    S-->>A: stdout: [("payment", 234), ("auth", 189), ...]
    A->>U: "The payment module errors most (234 times)..."

Figure 2 — State (var logs) is preserved across turns, letting the LLM reference it without re-fetching.

This is a major difference from JSON tool calls: in the old pattern, every observation must be stuffed into the next turn's prompt context. With CodeAct, large observations (12,450 log records) stay in memory, the LLM only references the variable name — saving enormous amounts of token budget.

8. 6 common mistakes when deploying CodeAct

Mistake 1 — Not resetting namespace across users

Stateful es are convenient but if user A and user B share a container → A's variables leak to B. Fix: One instance per conversation/user, or explicit %reset between sessions.

Mistake 2 — Letting the LLM import any library

Risk of supply chain attacks (LLM prompt-injected into importing a malicious package). Fix: additional_authorized_imports allowlist; block runtime pip install.

Mistake 3 — Not truncating long stdout

Agent print()s 50,000 lines → stdout enters prompt → context window blown. Fix: Sandbox auto-truncates stdout >10KB, hint LLM to use pagination.

Mistake 4 — Not separating Final Answer from Code

The LLM sometimes writes both code and a final answer in the same response. Fix: Smolagents uses a special final_answer() tool — the agent must call it explicitly to terminate.

Mistake 5 — High cold-start latency

Each new conversation spawns a → 2-3s first-call latency. Fix: Warmed-up pool (E2B, Modal both support this), or Pyodide for short tasks.

Mistake 6 — Skipping observability

Unlike JSON, generated code is hard to debug retroactively. Fix: Log code + stdout + stderr + execution time + variables snapshot to a trace store (Langfuse, LangSmith, or build your own on ClickHouse).

9. CodeAct in multi-agent systems

When scaling to multiple agents, CodeAct unlocks an interesting pattern: the orchestrator agent calls sub-agents like functions. Smolagents has the concept of managed_agents — sub-agents are wrapped as callables in the parent's namespace.

from smolagents import CodeAgent

researcher = CodeAgent(tools=[web_search, scrape], model=model, name="researcher",
                       description="Search and read web content")
analyst = CodeAgent(tools=[run_sql, plot_chart], model=model, name="analyst",
                    description="Analyze data and produce charts")

orchestrator = CodeAgent(
    tools=[],
    model=model,
    managed_agents=[researcher, analyst],
    max_steps=10,
)

orchestrator.run("Find OpenAI's 3 main competitors in 2026, get their revenue, plot a comparison chart")

The orchestrator might write code like:

competitors_text = researcher(query="3 main competitors of OpenAI in 2026 and their revenue")
chart_url = analyst(query=f"Plot a bar chart from this data: {competitors_text}")
print(chart_url)

This is the purest expression of "agents as functions" — no complex message bus, no pub/sub, just Python function calls inside a .

10. Future — Is code really the "universal action space"?

Apple ML Research (2026) calls CodeAct "the most expressive action format we have today". But open questions remain:

3 directions under research

Domain-specific languages: Is general-purpose Python necessary, or are narrower DSLs (SQL, Cypher, JAX) sufficient and safer?
Type-safe code agents: Could having LLMs write TypeScript/Rust reduce runtime bugs vs Python?
Hybrid format: JSON for simple tools, code for composition — as Anthropic does with bash + structured tools side by side.

One thing is clear in 2026: the new generation of agent frameworks all natively support code execution (Smolagents, OpenHands, Manus, Letta), while older frameworks (LangChain, AutoGen) are scrambling to add code-mode to keep up. If you're building a new agent system today, CodeAct deserves serious consideration as the default — provided you invest properly in your .

11. Conclusion

CodeAct isn't "magic" — it's simply using the right tool for the right job. JSON is good for RPC; code is good for composition. As agents take on increasingly complex tasks (data analysis, multi-step reasoning, orchestration), JSON tool calls reveal limits in round-trip count and composition ability. CodeAct gives the LLM back a language designed for computers to understand — code.

The price you pay is complexity. Don't deploy CodeAct without a clear isolation strategy. E2B, Modal, Daytona, Azure Container Apps Dynamic Sessions — pick one and invest seriously. The reward is smarter, faster, cheaper agents (fewer LLM calls) — and most importantly: agents that fix mid-flight errors themselves instead of looping back to ask you.

References

Wang, X. et al. — Executable Code Actions Elicit Better LLM Agents (ICML 2024)
Apple Machine Learning Research — CodeAct: Your LLM Agent Acts Better when Generating Code
Hugging Face — Introducing smolagents: simple agents that write actions in code
Hugging Face Docs — smolagents documentation
Hugging Face Agents Course — Writing actions as code snippets or JSON blobs
GitHub — xingyaoww/code-act — official repo
GitHub — huggingface/smolagents

#CodeAct #AI Agent #Smolagents #Agentic AI #Code Execution

# CodeAct 2026: When AI Agents Write Code Instead of Calling JSON Tools

In 2024, most AI Agents communicated with tools through a familiar format: **JSON tool calls**. Each step, the LLM emits a JSON block invoking exactly one function — clean, parseable, validatable. But by late 2025 and across 2026, a new wave is upending this pattern: **CodeAct** — agents write Python code directly (instead of JSON), execute it in a sandbox, and use the result for the next step. Hugging Face's Smolagents defaults to CodeAct. Manus AI's breakout in early 2026 also bet on "code as the universal action space". The original CodeAct paper by Wang et al. (ICML 2024) measured striking gains: **+20% success rate, -30% steps**. Why did such a seemingly small format change reshape the industry?

+20%Success rate vs JSON tool call (CodeAct paper)

-30%Number of steps (= 30% fewer LLM calls)

7+Sandbox runtimes Smolagents supports (E2B, Modal, Pyodide...)

2024Year CodeAct was published — mainstream by 2026

## 1. What is CodeAct — and how is it different from JSON tool calls?

**CodeAct** (short for *Code as Action*) is a paradigm where each "action" of an AI Agent is not a JSON object calling a single tool, but an **executable code snippet** (typically Python) — where tools are exposed as functions, and the LLM is free to compose them with variables, loops, and conditionals.

A concrete example. Suppose an agent must "find the 5 largest cities in Vietnam, get the population of each, then sum them up".

#### Old way — JSON Tool Call (ReAct pattern)

Step 1: LLM emits `{"tool": "search_cities", "args": {"country": "VN", "limit": 5}}` → runtime executes → returns 5 cities.  
Step 2: LLM emits `{"tool": "get_population", "args": {"city": "Ho Chi Minh"}}` → returns 9.3M.  
Steps 3–6: Repeat 4 more times for the remaining cities.  
Step 7: LLM mentally sums the numbers in context → final answer.  
**Total: 7 steps, 7 LLM calls, error-prone arithmetic.**

#### New way — CodeAct

Step 1: LLM emits a code block:

```
cities = search_cities(country="VN", limit=5)
populations = [get_population(city=c) for c in cities]
total = sum(populations)
print(f"Total: {total:,}")
```
Sandbox runs → prints `Total: 24,500,000` → agent reads and replies. **Total: 2 steps, 2 LLM calls, math handled by Python — no errors.**

## 2. A short history of CodeAct

2022 — ReAct (Yao et al., Princeton + Google)

The classic paper that defined the **Thought → Action → Observation** loop. Actions were initially free-form text (search query, click). Foundation for every later agent framework, but format-agnostic.

06/2023 — OpenAI Function Calling

OpenAI released function calling: LLMs were fine-tuned to emit valid **JSON** matching a schema. The industry standardized immediately — Anthropic, Google, Mistral all copied. One LLM call = one tool call.

02/2024 — CodeAct paper (Wang et al., UIUC)

*"Executable Code Actions Elicit Better LLM Agents"* — proved across 17 LLMs that emitting **code instead of JSON** raises success rate by up to 20% and cuts step count by 30% on API-Bank.

12/2024 — Hugging Face Smolagents

HF released an agent library that is **code-first by default**. `CodeAgent` is the primary class; `ToolCallingAgent` is just an alternative. The first mainstream signal.

2025 — Manus AI, OpenHands

Manus AI took off as a general-purpose agent using code execution as its sole action space. OpenHands (succeeding OpenDevin) also moved to CodeAct.

2026 — Code execution becomes industry standard

## 3. CodeAct architecture in production

A standard CodeAct system has 5 components — notably, **the sandbox is now mandatory, not optional**.

```
graph TB
    subgraph User["User"]
        Q["Question / Task"]
    end
    subgraph Agent["Agent Loop"]
        LLM["LLM (planner + code writer)"]
        PARSE["Code Parser / Validator"]
    end
    subgraph Sandbox["Sandbox Runtime (CRITICAL)"]
        EXE["Python Interpreter (E2B / Pyodide / Docker / Modal)"]
        TOOL["Tools as Python functions"]
    end
    subgraph State["State Management"]
        VAR["Variables (persisted across turns)"]
        OUT["stdout / stderr"]
    end
    Q --> LLM
    LLM -->|"Code block"| PARSE
    PARSE -->|"AST validated"| EXE
    EXE <--> TOOL
    EXE --> VAR
    EXE --> OUT
    OUT -->|"Observation"| LLM
    VAR -.->|"Reuse next turn"| EXE

classDef user fill:#e94560,stroke:#fff,color:#fff
    classDef agent fill:#16213e,stroke:#fff,color:#fff
    classDef sandbox fill:#ff9800,stroke:#fff,color:#fff
    classDef state fill:#f8f9fa,stroke:#e94560,color:#2c3e50
    class Q user
    class LLM,PARSE agent
    class EXE,TOOL sandbox
    class VAR,OUT state

```
Figure 1 — The CodeAct loop: LLM emits code → sandbox executes → observation feeds back to LLM. Variables are kept across turns to enable complex composition.

### 5 components

1. **LLM Planner + Coder**: One model both plans and writes code. Must be Python-fluent (Claude Sonnet, GPT-4 class or above) — this is why CodeAct underperforms with sub-7B models.
2. **Code Parser/Validator**: Extracts code blocks from LLM output (usually inside ` ```python ` fences), AST-checks to block dangerous imports before sending to sandbox.
3. **Sandbox Runtime**: Mandatory. 2026 options: **E2B** (Firecracker microVM), **Pyodide+Deno** (WebAssembly), **Modal**, **Docker**, **Daytona**, **Azure Container Apps Dynamic Sessions**.
4. **Tools as Functions**: Tools are not JSON schemas but Python functions pre-imported into the sandbox namespace. The LLM can inspect docstring/signature.
5. **State Persistence**: Variables must persist across turns (no reset each time). Sandboxes need to be stateful — this is where pure Pyodide is harder than microVMs.

## 4. Comparing CodeAct vs JSON Tool Call

| Criterion | JSON Tool Call | CodeAct |
| --- | --- | --- |
| **Output format** | Schema-conforming JSON object | Python snippet (Markdown code block) |
| **Tools per step** | 1 | Many (loops, branches allowed) |
| **Auxiliary computation** | LLM does math/sort in head (error-prone) | Python handles it precisely |
| **Mid-step error handling** | Must round-trip back to LLM each fail | `try/except` inline, no extra LLM call |
| **Composability** | Low — tools are independent | High — tool A's output feeds tool B directly |
| **LLM requirement** | Any model with function calling | Model must be Python-fluent (≥7B, ideally ≥30B) |
| **Security** | Simple — JSON validation suffices | Complex — sandbox isolation mandatory |
| **Debuggability** | JSON traces are easy to read | Need to log code + stdout + variable state |
| **Per-step latency** | Low (one tool call) | Higher (sandbox cold start ~50-200ms) |
| **End-to-end latency** | High (many LLM round trips) | Low (fewer round trips) |

#### When to choose CodeAct?

- Tasks needing **multi-tool composition** in one step (small data pipelines, ETL).
- Need for **computation/aggregation** over results (sum, sort, group).
- Logic with **loops/conditionals** (process each item in a list).
- You already have **sandbox infrastructure** (E2B, Modal, Daytona...).

#### When to keep JSON Tool Call?

- Simple, single-step tools (send email, create Jira ticket).
- Small models (Llama 3.2 3B, Mistral 7B) — code generation is weak.
- Environments that **cannot host a sandbox** (regulated industries, edge devices).
- Need for **readable audit logs** for compliance — JSON is easier to track than code.

## 5. Walkthrough — Building a CodeAgent with Smolagents

Smolagents is the cleanest CodeAct implementation. The code below builds an agent that writes its own Python to complete tasks.

```
from smolagents import CodeAgent, LiteLLMModel, tool

@tool
def search_cities(country: str, limit: int = 5) -> list[str]:
    """Find the largest cities in a country.
    Args:
        country: ISO country code (VN, US, ...)
        limit: Number of cities to return
    """
    # Mocked — production would call a real API
    return ["Ho Chi Minh", "Hanoi", "Da Nang", "Hai Phong", "Can Tho"][:limit]

@tool
def get_population(city: str) -> int:
    """Return the population of a city."""
    data = {"Ho Chi Minh": 9300000, "Hanoi": 8500000,
            "Da Nang": 1230000, "Hai Phong": 2050000, "Can Tho": 1240000}
    return data.get(city, 0)

model = LiteLLMModel(model_id="anthropic/claude-sonnet-4-6")

agent = CodeAgent(
    tools=[search_cities, get_population],
    model=model,
    executor_type="e2b",   # sandbox: e2b | docker | local
    additional_authorized_imports=["statistics"],
    max_steps=5,
)

result = agent.run("What's the total population of the 5 largest cities in Vietnam? Include standard deviation.")
print(result)

```
At runtime, the agent will produce code similar to:

```
cities = search_cities(country="VN", limit=5)
populations = [get_population(city=c) for c in cities]

import statistics
total = sum(populations)
stdev = statistics.stdev(populations)
print(f"Total: {total:,}")
print(f"StDev: {stdev:,.0f}")

```
A single LLM call accomplishes finding the cities, fetching populations, summing the total, and computing statistics — something a JSON tool call would need 7-8 round trips for.

## 6. Security — The sandbox is non-negotiable

CodeAct is only safe if the sandbox is solid. Letting LLM-generated code run on the main process means a mild prompt injection could `os.system("rm -rf /")`. Comparison of popular 2026 runtimes:

| Runtime | Isolation | Cold start | Stateful | Best for |
| --- | --- | --- | --- | --- |
| **E2B** | Firecracker microVM | ~150ms | Yes | Production agents, multi-tenant |
| **Modal** | gVisor + container | ~500ms | Yes | Compute-heavy / GPU workloads |
| **Daytona** | Container + LXC | ~200ms | Yes | Dev environment + agent |
| **Azure Container Apps Dynamic Sessions** | Hyper-V + Code Interpreter | ~300ms | Yes (60 min) | Enterprise Microsoft stack |
| **Pyodide + Deno** | WebAssembly + permission flags | ~50ms | Hard (per-call) | Edge, lightweight, single-tenant |
| **Docker** | Linux namespaces (weaker than microVM) | ~1-2s | Yes | Dev/PoC, not multi-tenant prod |
| **Local Python (DON'T)** | None | 0ms | Yes | Never |

#### 5 mandatory security checks

1. **Network egress filter**: block outbound traffic except an allowlist (prevent data exfiltration).
2. **Read-only filesystem** except `/tmp`; never mount secrets into the sandbox.
3. **CPU + memory + wall-clock limits**: 30s, 512MB — enough for legitimate tasks, blocks infinite loops.
4. **AST validation before exec**: block `__import__`, `exec`, `eval`, `open("/etc/passwd")`.
5. **One sandbox per user**: never share between tenants.

## 7. Observation Loop and cross-turn state

The subtlest part of CodeAct is the **multi-turn** loop. After each turn, the agent must "see" three things: (1) the code that ran, (2) stdout/stderr, (3) variables still alive in the namespace.

```
sequenceDiagram
    participant U as User
    participant A as Agent (LLM)
    participant S as Sandbox
    U->>A: "Analyze this week's error logs"
    A->>S: code: logs = fetch_logs(days=7)
    S-->>A: stdout: "(fetched 12,450 records)" + var: logs
    Note over A: LLM sees var logs is ready
    A->>S: code: errors = [l for l in logs if l.level=="ERROR"]  
top = Counter(e.module for e in errors).most_common(5)  
print(top)
    S-->>A: stdout: [("payment", 234), ("auth", 189), ...]
    A->>U: "The payment module errors most (234 times)..."

```
Figure 2 — State (var `logs`) is preserved across turns, letting the LLM reference it without re-fetching.

This is a major difference from JSON tool calls: in the old pattern, every observation must be stuffed into the next turn's prompt context. With CodeAct, large observations (12,450 log records) stay in sandbox memory, the LLM only references the variable name — saving enormous amounts of token budget.

## 8. 6 common mistakes when deploying CodeAct

#### Mistake 1 — Not resetting namespace across users

Stateful sandboxes are convenient but if user A and user B share a container → A's variables leak to B. **Fix:** One sandbox instance per conversation/user, or explicit `%reset` between sessions.

#### Mistake 2 — Letting the LLM import any library

Risk of supply chain attacks (LLM prompt-injected into importing a malicious package). **Fix:** `additional_authorized_imports` allowlist; block runtime pip install.

#### Mistake 3 — Not truncating long stdout

Agent `print()`s 50,000 lines → stdout enters prompt → context window blown. **Fix:** Sandbox auto-truncates stdout >10KB, hint LLM to use pagination.

#### Mistake 4 — Not separating Final Answer from Code

The LLM sometimes writes both code and a final answer in the same response. **Fix:** Smolagents uses a special `final_answer()` tool — the agent must call it explicitly to terminate.

#### Mistake 5 — High sandbox cold-start latency

Each new conversation spawns a sandbox → 2-3s first-call latency. **Fix:** Warmed-up sandbox pool (E2B, Modal both support this), or Pyodide for short tasks.

#### Mistake 6 — Skipping observability

Unlike JSON, generated code is hard to debug retroactively. **Fix:** Log *code + stdout + stderr + execution time + variables snapshot* to a trace store (Langfuse, LangSmith, or build your own on ClickHouse).

## 9. CodeAct in multi-agent systems

When scaling to multiple agents, CodeAct unlocks an interesting pattern: **the orchestrator agent calls sub-agents like functions**. Smolagents has the concept of `managed_agents` — sub-agents are wrapped as callables in the parent's namespace.

```
from smolagents import CodeAgent

researcher = CodeAgent(tools=[web_search, scrape], model=model, name="researcher",
                       description="Search and read web content")
analyst = CodeAgent(tools=[run_sql, plot_chart], model=model, name="analyst",
                    description="Analyze data and produce charts")

orchestrator = CodeAgent(
    tools=[],
    model=model,
    managed_agents=[researcher, analyst],
    max_steps=10,
)

orchestrator.run("Find OpenAI's 3 main competitors in 2026, get their revenue, plot a comparison chart")

```
The orchestrator might write code like:

```
competitors_text = researcher(query="3 main competitors of OpenAI in 2026 and their revenue")
chart_url = analyst(query=f"Plot a bar chart from this data: {competitors_text}")
print(chart_url)

```
This is the purest expression of *"agents as functions"* — no complex message bus, no pub/sub, just Python function calls inside a sandbox.

## 10. Future — Is code really the "universal action space"?

Apple ML Research (2026) calls CodeAct "the most expressive action format we have today". But open questions remain:

#### 3 directions under research

- **Domain-specific languages**: Is general-purpose Python necessary, or are narrower DSLs (SQL, Cypher, JAX) sufficient and safer?
- **Type-safe code agents**: Could having LLMs write TypeScript/Rust reduce runtime bugs vs Python?
- **Hybrid format**: JSON for simple tools, code for composition — as Anthropic does with `bash` + structured tools side by side.

One thing is clear in 2026: **the new generation of agent frameworks all natively support code execution** (Smolagents, OpenHands, Manus, Letta), while older frameworks (LangChain, AutoGen) are scrambling to add code-mode to keep up. If you're building a new agent system today, CodeAct deserves serious consideration as the default — provided you invest properly in your sandbox.

## 11. Conclusion

The price you pay is **sandbox complexity**. Don't deploy CodeAct without a clear isolation strategy. E2B, Modal, Daytona, Azure Container Apps Dynamic Sessions — pick one and invest seriously. The reward is smarter, faster, cheaper agents (fewer LLM calls) — and most importantly: *agents that fix mid-flight errors themselves instead of looping back to ask you.*

### References

- Wang, X. et al. — [Executable Code Actions Elicit Better LLM Agents (ICML 2024)](https://arxiv.org/abs/2402.01030)
- Apple Machine Learning Research — [CodeAct: Your LLM Agent Acts Better when Generating Code](https://machinelearning.apple.com/research/codeact)
- Hugging Face — [Introducing smolagents: simple agents that write actions in code](https://huggingface.co/blog/smolagents)
- Hugging Face Docs — [smolagents documentation](https://huggingface.co/docs/smolagents/en/index)
- Hugging Face Agents Course — [Writing actions as code snippets or JSON blobs](https://huggingface.co/learn/agents-course/en/unit2/smolagents/tool_calling_agents)
- GitHub — [xingyaoww/code-act — official repo](https://github.com/xingyaoww/code-act)
- GitHub — [huggingface/smolagents](https://github.com/huggingface/smolagents)

Agentic Testing — AI Agents That Write Tests, Find Bugs and Self-Heal

Disclaimer: The opinions expressed in this blog are solely my own and do not reflect the views or opinions of my employer or any affiliated organizations. The content provided is for informational and educational purposes only and should not be taken as professional advice. While I strive to provide accurate and up-to-date information, I make no warranties or guarantees about the completeness, reliability, or accuracy of the content. Readers are encouraged to verify the information and seek independent advice as needed. I disclaim any liability for decisions or actions taken based on the content of this blog.