Program-Aided Language Models (PAL): make the model write code instead of doing math
PAL has the LLM produce executable code (typically Python) that solves the problem, then runs that code deterministically. The result: arithmetic and logic that's correct by construction instead of guessed by the model.
Ask GPT-4 to compute 847 × 326 + 1729 in plain prose. It'll guess. Sometimes correctly, sometimes not, often confidently wrong. Models are not calculators; they're probabilistic next-token predictors. Math that compounds across many digits is exactly where they fail in subtle, hard-to-catch ways.
Now ask the same question with a different framing: "Write Python that computes 847 × 326 + 1729 and prints the result." The model writes three lines of correct code. Run that code. Get the answer. Always correct, always verifiable, always reproducible.
That's Program-Aided Language Models (PAL): have the LLM produce executable code that solves the problem, then run the code deterministically. Math the model would have botched becomes math Python solves correctly — by construction.
The whole idea in one line
The mental model: offload computation to deterministic tools#
Models excel at translation — natural language → some other representation. They struggle at execution — actually computing results that require precise arithmetic or state tracking.
PAL plays to the strength and avoids the weakness. The model translates the natural- language problem into code (its strength). The code runs in a Python interpreter (its strength). The two artifacts together solve the problem far more reliably than either alone.
It's the same principle behind tool use more broadly — LLMs as orchestrators, not computers. PAL is the specific case where the "tool" is a code interpreter.
The basic PAL pattern#
- Pose the problem. Natural language input from the user or upstream task.
- Generate code. Ask the model to write executable code that solves it. Include constraints (which library, which language, output format).
- Execute in a sandbox. Run the code in an isolated environment with limited time and no network/file-system access.
- Use the output. The code's stdout (or returned value) is the answer. Either return it directly to the user or feed it back to the LLM for natural-language formatting.
Solve the problem below by writing Python code.
Rules:
- The code must be self-contained — no imports beyond Python's standard library.
- Print the final answer to stdout.
- Do not include explanatory prose, only the code.
Problem:
"""
{{problem}}
"""
Python:Your application code parses the response (it should be just code), runs it in a sandbox (e.g., a sub-process or a hosted code-execution API), and captures stdout. That's your answer.
Why PAL beats CoT for computational tasks#
Chain-of-Thought exposes reasoning in tokens. That works for logic and planning, where the model's limitation is single-pass attention, not arithmetic. For arithmetic specifically:
- CoT is unreliable. The model can correctly state "I need to multiply 847 by 326" and then produce the wrong number. Reasoning correctly about arithmetic doesn't make the arithmetic correct.
- PAL is exact. Once the code is correct, the result is correct. No probabilistic sampling, no accumulating errors.
- PAL is auditable. If something's wrong, you can read the generated code and find the bug. A wrong CoT answer requires you to re-derive the correct one to find the error.
Sandboxing is non-negotiable#
Generated code is untrusted. Even when you're sure the prompt produces deterministic, safe code, never run it in an environment that can:
- Access your network or file system
- Run for unlimited time
- Use unlimited memory
- Import arbitrary libraries
Production patterns:
- Hosted code-execution APIs. OpenAI's code interpreter, Riza, E2B, Replit's API. They run the code in isolated containers with sane defaults.
- WebAssembly sandboxes. Pyodide for in-browser Python execution — useful for client-side PAL.
- Sub-process with restrictions. Docker container, restricted user, no network, time-limit, mem-limit. Works but takes more setup.
Never eval() generated code in your main process
When PAL pays off#
PAL vs. alternatives
| If your situation is… | Reach for… | Why |
|---|---|---|
| Multi-step arithmetic, financial calculation | PAL | Exact computation; the model can't botch the math |
| String manipulation, data parsing | PAL | Code is more reliable than prose for transformations |
| Date/time arithmetic | PAL | datetime library handles edge cases the model gets wrong |
| Combinatorics, simulation, game-state evaluation | PAL | Code can iterate; prose CoT can't hold deep state |
| Pure reasoning (no computation) | CoT, not PAL | Code is overhead when the bottleneck is reasoning, not arithmetic |
| Subjective writing tasks | No PAL | No code can write a paragraph for you |
| Latency-critical interactive use | PAL with caching, or skip | Code execution adds 200ms+; cache identical problems |
Going further: production patterns#
Modern tool-use APIs implement PAL natively#
OpenAI's code interpreter, Anthropic's tool use, and Gemini's code execution all let the model invoke a code-running tool as part of normal tool-use flow. You define a run_code tool; the model calls it when it decides math is the right approach. Cleaner than hand-rolled PAL prompts; same idea underneath. See agent tools.
Result explanation as a second pass#
If users want a natural-language answer, run a second LLM call that takes the original question + the code + the code's output and produces a friendly explanation. The user sees prose; the computation underneath is exact.
Fallback for code that errors#
Sometimes generated code crashes (syntax error, runtime exception, infinite loop). Pattern: catch the error, feed it back to the model with a "the code above produced this error; please fix it" prompt, retry up to N times. After N, fall through to plain CoT or surface the error to the user.
Pair with reasoning models#
Reasoning models (o1, o3, Claude with extended thinking, Gemini thinking) often produce cleaner, more correct code on the first try. PAL with a reasoning model is particularly powerful for complex multi-step computations — the reasoning model handles the planning, code handles the execution.
Common mistakes#
- Running generated code in your main process. Never. Sandbox always, even for "safe" prompts.
- No timeout on execution. A bad
while Truetakes down your worker. Always cap execution time. - Using PAL for non-computational tasks. Code can't write a poem. PAL is the wrong tool for subjective or pure-reasoning tasks.
- Trusting the code without checking the result. Generated code can have logic bugs that produce a confidently-wrong number. For high-stakes tasks, run a verifier (different model, sanity check on the output magnitude, etc.).
- Not parsing the response strictly. Models sometimes wrap code in markdown fences or add explanation. Strip these; or use a stop-sequence based prompt that ends right after the code block.
Quick reference#
The 60-second summary
What it is: the model writes code that solves the problem; you run the code in a sandbox; the code's output is the answer.
Why it shines: arithmetic, logic, structured manipulation. Math the model botches becomes math Python solves correctly.
The non-negotiables: sandboxed execution, time limits, memory limits, no unrestricted network/filesystem access.
When to skip: pure reasoning, subjective tasks, latency-critical use without caching.
Modern shortcut: vendor tool-use APIs (OpenAI code interpreter, Anthropic tool use, Gemini code execution) implement PAL natively. Use those when you can.
What to read next#
For the broader pattern of LLMs invoking external tools, ReAct and agent tools. For the alternative when computation isn't the bottleneck, Chain-of-Thought. For per-model code-execution support, see the ChatGPT, Claude, and Gemini guides.
Put this guide to work
Save your prompts, version every change, and share them with your team — free for up to 200 prompts.