CodeAct 2026: When AI Agents Write Code Instead of Calling JSON Tools
Posted on: 5/14/2026 10:31:20 AM
Table of contents
- 1. What is CodeAct — and how is it different from JSON tool calls?
- 2. A short history of CodeAct
- 3. CodeAct architecture in production
- 4. Comparing CodeAct vs JSON Tool Call
- 5. Walkthrough — Building a CodeAgent with Smolagents
- 6. Security — The sandbox is non-negotiable
- 7. Observation Loop and cross-turn state
- 8. 6 common mistakes when deploying CodeAct
- 9. CodeAct in multi-agent systems
- 10. Future — Is code really the "universal action space"?
- 11. Conclusion
In 2024, most AI Agents communicated with tools through a familiar format: JSON tool calls. Each step, the LLM emits a JSON block invoking exactly one function — clean, parseable, validatable. But by late 2025 and across 2026, a new wave is upending this pattern: CodeAct — agents write Python code directly (instead of JSON), execute it in a , and use the result for the next step. Hugging Face's Smolagents defaults to CodeAct. Manus AI's breakout in early 2026 also bet on "code as the universal action space". The original CodeAct paper by Wang et al. (ICML 2024) measured striking gains: +20% success rate, -30% steps. Why did such a seemingly small format change reshape the industry?
1. What is CodeAct — and how is it different from JSON tool calls?
CodeAct (short for Code as Action) is a paradigm where each "action" of an AI Agent is not a JSON object calling a single tool, but an executable code snippet (typically Python) — where tools are exposed as functions, and the LLM is free to compose them with variables, loops, and conditionals.
A concrete example. Suppose an agent must "find the 5 largest cities in Vietnam, get the population of each, then sum them up".
Old way — JSON Tool Call (ReAct pattern)
Step 1: LLM emits {"tool": "search_cities", "args": {"country": "VN", "limit": 5}} → runtime executes → returns 5 cities.
Step 2: LLM emits {"tool": "get_population", "args": {"city": "Ho Chi Minh"}} → returns 9.3M.
Steps 3–6: Repeat 4 more times for the remaining cities.
Step 7: LLM mentally sums the numbers in context → final answer.
Total: 7 steps, 7 LLM calls, error-prone arithmetic.
New way — CodeAct
Step 1: LLM emits a code block:
cities = search_cities(country="VN", limit=5)
populations = [get_population(city=c) for c in cities]
total = sum(populations)
print(f"Total: {total:,}")
Sandbox runs → prints Total: 24,500,000 → agent reads and replies. Total: 2 steps, 2 LLM calls, math handled by Python — no errors.
2. A short history of CodeAct
CodeAgent is the primary class; ToolCallingAgent is just an alternative. The first mainstream signal.3. CodeAct architecture in production
A standard CodeAct system has 5 components — notably, the is now mandatory, not optional.
graph TB
subgraph User["User"]
Q["Question / Task"]
end
subgraph Agent["Agent Loop"]
LLM["LLM (planner + code writer)"]
PARSE["Code Parser / Validator"]
end
subgraph Sandbox["Sandbox Runtime (CRITICAL)"]
EXE["Python Interpreter (E2B / Pyodide / Docker / Modal)"]
TOOL["Tools as Python functions"]
end
subgraph State["State Management"]
VAR["Variables (persisted across turns)"]
OUT["stdout / stderr"]
end
Q --> LLM
LLM -->|"Code block"| PARSE
PARSE -->|"AST validated"| EXE
EXE <--> TOOL
EXE --> VAR
EXE --> OUT
OUT -->|"Observation"| LLM
VAR -.->|"Reuse next turn"| EXE
classDef user fill:#e94560,stroke:#fff,color:#fff
classDef agent fill:#16213e,stroke:#fff,color:#fff
classDef fill:#ff9800,stroke:#fff,color:#fff
classDef state fill:#f8f9fa,stroke:#e94560,color:#2c3e50
class Q user
class LLM,PARSE agent
class EXE,TOOL
class VAR,OUT state
Figure 1 — The CodeAct loop: LLM emits code → executes → observation feeds back to LLM. Variables are kept across turns to enable complex composition.
5 components
- LLM Planner + Coder: One model both plans and writes code. Must be Python-fluent (Claude Sonnet, GPT-4 class or above) — this is why CodeAct underperforms with sub-7B models.
- Code Parser/Validator: Extracts code blocks from LLM output (usually inside
```pythonfences), AST-checks to block dangerous imports before sending to . - Sandbox Runtime: Mandatory. 2026 options: E2B (Firecracker microVM), Pyodide+Deno (WebAssembly), Modal, Docker, Daytona, Azure Container Apps Dynamic Sessions.
- Tools as Functions: Tools are not JSON schemas but Python functions pre-imported into the namespace. The LLM can inspect docstring/signature.
- State Persistence: Variables must persist across turns (no reset each time). Sandboxes need to be stateful — this is where pure Pyodide is harder than microVMs.
4. Comparing CodeAct vs JSON Tool Call
| Criterion | JSON Tool Call | CodeAct |
|---|---|---|
| Output format | Schema-conforming JSON object | Python snippet (Markdown code block) |
| Tools per step | 1 | Many (loops, branches allowed) |
| Auxiliary computation | LLM does math/sort in head (error-prone) | Python handles it precisely |
| Mid-step error handling | Must round-trip back to LLM each fail | try/except inline, no extra LLM call |
| Composability | Low — tools are independent | High — tool A's output feeds tool B directly |
| LLM requirement | Any model with function calling | Model must be Python-fluent (≥7B, ideally ≥30B) |
| Security | Simple — JSON validation suffices | Complex — isolation mandatory |
| Debuggability | JSON traces are easy to read | Need to log code + stdout + variable state |
| Per-step latency | Low (one tool call) | Higher ( cold start ~50-200ms) |
| End-to-end latency | High (many LLM round trips) | Low (fewer round trips) |
When to choose CodeAct?
- Tasks needing multi-tool composition in one step (small data pipelines, ETL).
- Need for computation/aggregation over results (sum, sort, group).
- Logic with loops/conditionals (process each item in a list).
- You already have infrastructure (E2B, Modal, Daytona...).
When to keep JSON Tool Call?
- Simple, single-step tools (send email, create Jira ticket).
- Small models (Llama 3.2 3B, Mistral 7B) — code generation is weak.
- Environments that cannot host a (regulated industries, edge devices).
- Need for readable audit logs for compliance — JSON is easier to track than code.
5. Walkthrough — Building a CodeAgent with Smolagents
Smolagents is the cleanest CodeAct implementation. The code below builds an agent that writes its own Python to complete tasks.
from smolagents import CodeAgent, LiteLLMModel, tool
@tool
def search_cities(country: str, limit: int = 5) -> list[str]:
"""Find the largest cities in a country.
Args:
country: ISO country code (VN, US, ...)
limit: Number of cities to return
"""
# Mocked — production would call a real API
return ["Ho Chi Minh", "Hanoi", "Da Nang", "Hai Phong", "Can Tho"][:limit]
@tool
def get_population(city: str) -> int:
"""Return the population of a city."""
data = {"Ho Chi Minh": 9300000, "Hanoi": 8500000,
"Da Nang": 1230000, "Hai Phong": 2050000, "Can Tho": 1240000}
return data.get(city, 0)
model = LiteLLMModel(model_id="anthropic/claude-sonnet-4-6")
agent = CodeAgent(
tools=[search_cities, get_population],
model=model,
executor_type="e2b", # : e2b | docker | local
additional_authorized_imports=["statistics"],
max_steps=5,
)
result = agent.run("What's the total population of the 5 largest cities in Vietnam? Include standard deviation.")
print(result)
At runtime, the agent will produce code similar to:
cities = search_cities(country="VN", limit=5)
populations = [get_population(city=c) for c in cities]
import statistics
total = sum(populations)
stdev = statistics.stdev(populations)
print(f"Total: {total:,}")
print(f"StDev: {stdev:,.0f}")
A single LLM call accomplishes finding the cities, fetching populations, summing the total, and computing statistics — something a JSON tool call would need 7-8 round trips for.
6. Security — The is non-negotiable
CodeAct is only safe if the is solid. Letting LLM-generated code run on the main process means a mild prompt injection could os.system("rm -rf /"). Comparison of popular 2026 runtimes:
| Runtime | Isolation | Cold start | Stateful | Best for |
|---|---|---|---|---|
| E2B | Firecracker microVM | ~150ms | Yes | Production agents, multi-tenant |
| Modal | gVisor + container | ~500ms | Yes | Compute-heavy / GPU workloads |
| Daytona | Container + LXC | ~200ms | Yes | Dev environment + agent |
| Azure Container Apps Dynamic Sessions | Hyper-V + Code Interpreter | ~300ms | Yes (60 min) | Enterprise Microsoft stack |
| Pyodide + Deno | WebAssembly + permission flags | ~50ms | Hard (per-call) | Edge, lightweight, single-tenant |
| Docker | Linux namespaces (weaker than microVM) | ~1-2s | Yes | Dev/PoC, not multi-tenant prod |
| Local Python (DON'T) | None | 0ms | Yes | Never |
5 mandatory security checks
- Network egress filter: block outbound traffic except an allowlist (prevent data exfiltration).
- Read-only filesystem except
/tmp; never mount secrets into the . - CPU + memory + wall-clock limits: 30s, 512MB — enough for legitimate tasks, blocks infinite loops.
- AST validation before exec: block
__import__,exec,eval,open("/etc/passwd"). - One per user: never share between tenants.
7. Observation Loop and cross-turn state
The subtlest part of CodeAct is the multi-turn loop. After each turn, the agent must "see" three things: (1) the code that ran, (2) stdout/stderr, (3) variables still alive in the namespace.
sequenceDiagram
participant U as User
participant A as Agent (LLM)
participant S as Sandbox
U->>A: "Analyze this week's error logs"
A->>S: code: logs = fetch_logs(days=7)
S-->>A: stdout: "(fetched 12,450 records)" + var: logs
Note over A: LLM sees var logs is ready
A->>S: code: errors = [l for l in logs if l.level=="ERROR"]
top = Counter(e.module for e in errors).most_common(5)
print(top)
S-->>A: stdout: [("payment", 234), ("auth", 189), ...]
A->>U: "The payment module errors most (234 times)..."
Figure 2 — State (var logs) is preserved across turns, letting the LLM reference it without re-fetching.
This is a major difference from JSON tool calls: in the old pattern, every observation must be stuffed into the next turn's prompt context. With CodeAct, large observations (12,450 log records) stay in memory, the LLM only references the variable name — saving enormous amounts of token budget.
8. 6 common mistakes when deploying CodeAct
Mistake 1 — Not resetting namespace across users
Stateful es are convenient but if user A and user B share a container → A's variables leak to B. Fix: One instance per conversation/user, or explicit %reset between sessions.
Mistake 2 — Letting the LLM import any library
Risk of supply chain attacks (LLM prompt-injected into importing a malicious package). Fix: additional_authorized_imports allowlist; block runtime pip install.
Mistake 3 — Not truncating long stdout
Agent print()s 50,000 lines → stdout enters prompt → context window blown. Fix: Sandbox auto-truncates stdout >10KB, hint LLM to use pagination.
Mistake 4 — Not separating Final Answer from Code
The LLM sometimes writes both code and a final answer in the same response. Fix: Smolagents uses a special final_answer() tool — the agent must call it explicitly to terminate.
Mistake 5 — High cold-start latency
Each new conversation spawns a → 2-3s first-call latency. Fix: Warmed-up pool (E2B, Modal both support this), or Pyodide for short tasks.
Mistake 6 — Skipping observability
Unlike JSON, generated code is hard to debug retroactively. Fix: Log code + stdout + stderr + execution time + variables snapshot to a trace store (Langfuse, LangSmith, or build your own on ClickHouse).
9. CodeAct in multi-agent systems
When scaling to multiple agents, CodeAct unlocks an interesting pattern: the orchestrator agent calls sub-agents like functions. Smolagents has the concept of managed_agents — sub-agents are wrapped as callables in the parent's namespace.
from smolagents import CodeAgent
researcher = CodeAgent(tools=[web_search, scrape], model=model, name="researcher",
description="Search and read web content")
analyst = CodeAgent(tools=[run_sql, plot_chart], model=model, name="analyst",
description="Analyze data and produce charts")
orchestrator = CodeAgent(
tools=[],
model=model,
managed_agents=[researcher, analyst],
max_steps=10,
)
orchestrator.run("Find OpenAI's 3 main competitors in 2026, get their revenue, plot a comparison chart")
The orchestrator might write code like:
competitors_text = researcher(query="3 main competitors of OpenAI in 2026 and their revenue")
chart_url = analyst(query=f"Plot a bar chart from this data: {competitors_text}")
print(chart_url)
This is the purest expression of "agents as functions" — no complex message bus, no pub/sub, just Python function calls inside a .
10. Future — Is code really the "universal action space"?
Apple ML Research (2026) calls CodeAct "the most expressive action format we have today". But open questions remain:
3 directions under research
- Domain-specific languages: Is general-purpose Python necessary, or are narrower DSLs (SQL, Cypher, JAX) sufficient and safer?
- Type-safe code agents: Could having LLMs write TypeScript/Rust reduce runtime bugs vs Python?
- Hybrid format: JSON for simple tools, code for composition — as Anthropic does with
bash+ structured tools side by side.
One thing is clear in 2026: the new generation of agent frameworks all natively support code execution (Smolagents, OpenHands, Manus, Letta), while older frameworks (LangChain, AutoGen) are scrambling to add code-mode to keep up. If you're building a new agent system today, CodeAct deserves serious consideration as the default — provided you invest properly in your .
11. Conclusion
CodeAct isn't "magic" — it's simply using the right tool for the right job. JSON is good for RPC; code is good for composition. As agents take on increasingly complex tasks (data analysis, multi-step reasoning, orchestration), JSON tool calls reveal limits in round-trip count and composition ability. CodeAct gives the LLM back a language designed for computers to understand — code.
The price you pay is complexity. Don't deploy CodeAct without a clear isolation strategy. E2B, Modal, Daytona, Azure Container Apps Dynamic Sessions — pick one and invest seriously. The reward is smarter, faster, cheaper agents (fewer LLM calls) — and most importantly: agents that fix mid-flight errors themselves instead of looping back to ask you.
References
- Wang, X. et al. — Executable Code Actions Elicit Better LLM Agents (ICML 2024)
- Apple Machine Learning Research — CodeAct: Your LLM Agent Acts Better when Generating Code
- Hugging Face — Introducing smolagents: simple agents that write actions in code
- Hugging Face Docs — smolagents documentation
- Hugging Face Agents Course — Writing actions as code snippets or JSON blobs
- GitHub — xingyaoww/code-act — official repo
- GitHub — huggingface/smolagents
Disclaimer: The opinions expressed in this blog are solely my own and do not reflect the views or opinions of my employer or any affiliated organizations. The content provided is for informational and educational purposes only and should not be taken as professional advice. While I strive to provide accurate and up-to-date information, I make no warranties or guarantees about the completeness, reliability, or accuracy of the content. Readers are encouraged to verify the information and seek independent advice as needed. I disclaim any liability for decisions or actions taken based on the content of this blog.