AI News

Why LLMs Cannot Do Math and Excel - In Case You Tried and Failed

Falk Brauer

21 Jan 2026 — 3 min read

llms cannot do math.

llms cannot do excel.

but llms can code.

LLMs are much better at writing Python than Excel formulas. 90%+ accuracy on Python/SQL. Only 77% on spreadsheet formulas. Same models, same logic, 15-point gap.

Why LLMs Cannot Do Math

LLMs don't calculate. They predict tokens.

When you ask "what's 247 × 183?" the model isn't multiplying. It's pattern-matching against similar problems in training data. Sometimes it gets lucky. Often it doesn't.

This isn't a bug. It's architecture. Transformers process language, not numbers. They can approximate arithmetic for simple cases, but anything requiring precision fails unpredictably.

The problem gets worse with larger numbers, decimals, and multi-step calculations. Each step introduces error. By the end, you're nowhere close.

Why LLMs Struggle with Excel

The problem isn't formulas. It's the spreadsheet interface.

Humans see a spreadsheet as a grid. Column B is next to Column A. Row 5 is below row 4. Headers define meaning.

LLMs see text: "B:C", "A2", "D:D". No grid. No spatial relationships. Just cell addresses.

When you ask "sum Q3 revenue by region," a human glances at the grid and knows which cells to reference. An LLM has to infer spatial relationships from address strings. That's like navigating a building with GPS coordinates instead of seeing the hallways.

NL2Formula is a research benchmark that tests how well AI models convert natural language questions into spreadsheet formulas. Researchers give models plain English requests like "calculate total sales for Q3" along with table descriptions, then measure whether the generated formula actually works. The research identified three failure modes:

Wrong cell references. The LLM infers the wrong index from the table description.
Spatial reasoning errors. It can't "see" that column C is next to column B.
Multi-step breakdown. Complex nested formulas fail to execute.

The Silent Failure Problem

Excel formula errors don't crash. They return wrong numbers.

An LLM won't write =SUM(A1:A10 with a missing parenthesis. Your system catches syntax errors.

It writes =VLOOKUP(A2, B:C, 2, FALSE) when your data needed =INDEX(C:C, MATCH(A2, B:B, 0)). Syntactically perfect. Executes clean. Returns plausible results. Wrong results.

At 77% accuracy, roughly 1 in 4 Excel formulas references the wrong cells. You won't know until you check manually.

How to Use LLMs for Larger Mathematical Problems

LLMs can't compute, but they can write code that computes. That's the key insight.

Python, SQL, R. These languages work because they make relationships explicit. Variable names describe meaning. Operations chain logically. There's no spatial reasoning required.

filtered = df[df['region'] == 'East']
revenue = filtered['amount'].sum()

The LLM doesn't need to know that "region" is in column A or "amount" is in column D. It references data by name, not by position. The code reads like instructions, not coordinates.

This is why the same model that fails at Excel can write working Python. It's not smarter. It's working in a format that matches how it processes language.

The Research: Code Beats Pure Reasoning

The evidence is clear. When LLMs write code instead of trying to calculate directly, accuracy jumps significantly.

The MATH benchmark tests competition-level mathematical reasoning. Models struggled at around 10% accuracy when reasoning alone. With code execution, they hit 50% or higher. Same problems, same models, 5x improvement.

GSM8K covers grade school math word problems. Models reach 80% accuracy through pure reasoning. When they write and execute Python code, they exceed 90%. A 10-point gain just by switching from "think through it" to "write code for it."

PAL (Program-Aided Language models) and PoT (Program of Thought) research formalized this approach. Instead of asking LLMs to reason step-by-step in natural language, have them write executable code. The code handles computation. The LLM handles translation.

What This Means

The pattern is consistent across benchmarks: LLMs perform 10-20% better on math when they write code versus reasoning directly.

This isn't surprising. LLMs are language models trained on code. They've seen millions of examples of Python solving math problems. They haven't seen neurons doing multiplication.

The practical implication: stop asking LLMs to calculate. Ask them to write programs that calculate.

Why LLMs Cannot Do Math and Excel - In Case You Tried and Failed

Falk Brauer

Why LLMs Cannot Do Math

Why LLMs Struggle with Excel

The Silent Failure Problem

How to Use LLMs for Larger Mathematical Problems

The Research: Code Beats Pure Reasoning

What This Means

Read more

Anthropic Is Coming for the Cybersecurity Industry

GenAI Daily - February 22, 2026: ByteDance's Seedance 2.0 Sparks Hollywood Backlash, Record AI Funding Surge, Enterprise Partnerships Scale Up

GenAI Daily - February 21, 2026: ByteDance Seedance 2.0 Draws Hollywood Backlash, Anthropic Blocks Third-Party OAuth, OpenAI Joins UK AI Safety Initiative

GenAI Daily - February 20, 2026: World Labs Secures $1B for Spatial AI, Inertia Raises Record Fusion Capital, ServiceNow Warns of Software Shakeout