Program-Aided Language Models (PAL): make the model write code instead of doing math

PAL has the LLM produce executable code (typically Python) that solves the problem, then runs that code deterministically. The result: arithmetic and logic that's correct by construction instead of guessed by the model.

schedule7 min readLast updated May 2, 2026

Ask GPT-4 to compute 847 × 326 + 1729 in plain prose. It'll guess. Sometimes correctly, sometimes not, often confidently wrong. Models are not calculators; they're probabilistic next-token predictors. Math that compounds across many digits is exactly where they fail in subtle, hard-to-catch ways.

Now ask the same question with a different framing: "Write Python that computes 847 × 326 + 1729 and prints the result." The model writes three lines of correct code. Run that code. Get the answer. Always correct, always verifiable, always reproducible.

That's Program-Aided Language Models (PAL): have the LLM produce executable code that solves the problem, then run the code deterministically. Math the model would have botched becomes math Python solves correctly — by construction.

The whole idea in one line

Don't ask the model to do the computation. Ask it to write code that does the computation, then run the code. Math, logic, and structured manipulation become correct-by-construction.

The mental model: offload computation to deterministic tools#

Models excel at translation — natural language → some other representation. They struggle at execution — actually computing results that require precise arithmetic or state tracking.

PAL plays to the strength and avoids the weakness. The model translates the natural- language problem into code (its strength). The code runs in a Python interpreter (its strength). The two artifacts together solve the problem far more reliably than either alone.

It's the same principle behind tool use more broadly — LLMs as orchestrators, not computers. PAL is the specific case where the "tool" is a code interpreter.

The basic PAL pattern#

Pose the problem. Natural language input from the user or upstream task.
Generate code. Ask the model to write executable code that solves it. Include constraints (which library, which language, output format).
Execute in a sandbox. Run the code in an isolated environment with limited time and no network/file-system access.
Use the output. The code's stdout (or returned value) is the answer. Either return it directly to the user or feed it back to the LLM for natural-language formatting.

terminalPromptPAL prompt for a math word problem

Any

Solve the problem below by writing Python code.

Rules:
- The code must be self-contained — no imports beyond Python's standard library.
- Print the final answer to stdout.
- Do not include explanatory prose, only the code.

Problem:
"""
{{problem}}
"""

Python:

play_arrowTry in PromptShip

Your application code parses the response (it should be just code), runs it in a sandbox (e.g., a sub-process or a hosted code-execution API), and captures stdout. That's your answer.

Why PAL beats CoT for computational tasks#

Chain-of-Thought exposes reasoning in tokens. That works for logic and planning, where the model's limitation is single-pass attention, not arithmetic. For arithmetic specifically:

CoT is unreliable. The model can correctly state "I need to multiply 847 by 326" and then produce the wrong number. Reasoning correctly about arithmetic doesn't make the arithmetic correct.
PAL is exact. Once the code is correct, the result is correct. No probabilistic sampling, no accumulating errors.
PAL is auditable. If something's wrong, you can read the generated code and find the bug. A wrong CoT answer requires you to re-derive the correct one to find the error.

Sandboxing is non-negotiable#

Generated code is untrusted. Even when you're sure the prompt produces deterministic, safe code, never run it in an environment that can:

Access your network or file system
Run for unlimited time
Use unlimited memory
Import arbitrary libraries

Production patterns:

Hosted code-execution APIs. OpenAI's code interpreter, Riza, E2B, Replit's API. They run the code in isolated containers with sane defaults.
WebAssembly sandboxes. Pyodide for in-browser Python execution — useful for client-side PAL.
Sub-process with restrictions. Docker container, restricted user, no network, time-limit, mem-limit. Works but takes more setup.

Never eval() generated code in your main process

Even if you trust your prompt, an attacker may not be using your prompt — they may be injecting one. Treat all generated code as hostile by default; sandbox always.

When PAL pays off#

PAL vs. alternatives

If your situation is…	Reach for…	Why
Multi-step arithmetic, financial calculation	PAL	Exact computation; the model can't botch the math
String manipulation, data parsing	PAL	Code is more reliable than prose for transformations
Date/time arithmetic	PAL	datetime library handles edge cases the model gets wrong
Combinatorics, simulation, game-state evaluation	PAL	Code can iterate; prose CoT can't hold deep state
Pure reasoning (no computation)	CoT, not PAL	Code is overhead when the bottleneck is reasoning, not arithmetic
Subjective writing tasks	No PAL	No code can write a paragraph for you
Latency-critical interactive use	PAL with caching, or skip	Code execution adds 200ms+; cache identical problems

Going further: production patterns#

Modern tool-use APIs implement PAL natively#

OpenAI's code interpreter, Anthropic's tool use, and Gemini's code execution all let the model invoke a code-running tool as part of normal tool-use flow. You define a run_code tool; the model calls it when it decides math is the right approach. Cleaner than hand-rolled PAL prompts; same idea underneath. See agent tools.

Result explanation as a second pass#

If users want a natural-language answer, run a second LLM call that takes the original question + the code + the code's output and produces a friendly explanation. The user sees prose; the computation underneath is exact.

Fallback for code that errors#

Sometimes generated code crashes (syntax error, runtime exception, infinite loop). Pattern: catch the error, feed it back to the model with a "the code above produced this error; please fix it" prompt, retry up to N times. After N, fall through to plain CoT or surface the error to the user.

Pair with reasoning models#

Reasoning models (o1, o3, Claude with extended thinking, Gemini thinking) often produce cleaner, more correct code on the first try. PAL with a reasoning model is particularly powerful for complex multi-step computations — the reasoning model handles the planning, code handles the execution.

Common mistakes#

Running generated code in your main process. Never. Sandbox always, even for "safe" prompts.
No timeout on execution. A bad while True takes down your worker. Always cap execution time.
Using PAL for non-computational tasks. Code can't write a poem. PAL is the wrong tool for subjective or pure-reasoning tasks.
Trusting the code without checking the result. Generated code can have logic bugs that produce a confidently-wrong number. For high-stakes tasks, run a verifier (different model, sanity check on the output magnitude, etc.).
Not parsing the response strictly. Models sometimes wrap code in markdown fences or add explanation. Strip these; or use a stop-sequence based prompt that ends right after the code block.

Quick reference#

The 60-second summary

What it is: the model writes code that solves the problem; you run the code in a sandbox; the code's output is the answer.

Why it shines: arithmetic, logic, structured manipulation. Math the model botches becomes math Python solves correctly.

The non-negotiables: sandboxed execution, time limits, memory limits, no unrestricted network/filesystem access.

When to skip: pure reasoning, subjective tasks, latency-critical use without caching.

Modern shortcut: vendor tool-use APIs (OpenAI code interpreter, Anthropic tool use, Gemini code execution) implement PAL natively. Use those when you can.

What to read next#

For the broader pattern of LLMs invoking external tools, ReAct and agent tools. For the alternative when computation isn't the bottleneck, Chain-of-Thought. For per-model code-execution support, see the ChatGPT, Claude, and Gemini guides.

Put this guide to work

Save your prompts, version every change, and share them with your team — free for up to 200 prompts.

Start free Browse the prompt library