Why LLMs Cannot Do Math and Excel - In Case You Tried and Failed
llms cannot do math.
llms cannot do excel.
but llms can code.
LLMs are much better at writing Python than Excel formulas. 90%+ accuracy on Python/SQL. Only 77% on spreadsheet formulas. Same models, same logic, 15-point gap.
Why LLMs Cannot Do Math
LLMs don't calculate. They predict tokens.
When you ask "what's 247 × 183?" the model isn't multiplying. It's pattern-matching against similar problems in training data. Sometimes it gets lucky. Often it doesn't.
This isn't a bug. It's architecture. Transformers process language, not numbers. They can approximate arithmetic for simple cases, but anything requiring precision fails unpredictably.
The problem gets worse with larger numbers, decimals, and multi-step calculations. Each step introduces error. By the end, you're nowhere close.
Why LLMs Struggle with Excel
The problem isn't formulas. It's the spreadsheet interface.
Humans see a spreadsheet as a grid. Column B is next to Column A. Row 5 is below row 4. Headers define meaning.
LLMs see text: "B:C", "A2", "D:D". No grid. No spatial relationships. Just cell addresses.
When you ask "sum Q3 revenue by region," a human glances at the grid and knows which cells to reference. An LLM has to infer spatial relationships from address strings. That's like navigating a building with GPS coordinates instead of seeing the hallways.
NL2Formula is a research benchmark that tests how well AI models convert natural language questions into spreadsheet formulas. Researchers give models plain English requests like "calculate total sales for Q3" along with table descriptions, then measure whether the generated formula actually works. The research identified three failure modes:
- Wrong cell references. The LLM infers the wrong index from the table description.
- Spatial reasoning errors. It can't "see" that column C is next to column B.
- Multi-step breakdown. Complex nested formulas fail to execute.
The Silent Failure Problem
Excel formula errors don't crash. They return wrong numbers.
An LLM won't write =SUM(A1:A10 with a missing parenthesis. Your system catches syntax errors.
It writes =VLOOKUP(A2, B:C, 2, FALSE) when your data needed =INDEX(C:C, MATCH(A2, B:B, 0)). Syntactically perfect. Executes clean. Returns plausible results. Wrong results.
At 77% accuracy, roughly 1 in 4 Excel formulas references the wrong cells. You won't know until you check manually.
How to Use LLMs for Larger Mathematical Problems
LLMs can't compute, but they can write code that computes. That's the key insight.
Python, SQL, R. These languages work because they make relationships explicit. Variable names describe meaning. Operations chain logically. There's no spatial reasoning required.
filtered = df[df['region'] == 'East']
revenue = filtered['amount'].sum()
The LLM doesn't need to know that "region" is in column A or "amount" is in column D. It references data by name, not by position. The code reads like instructions, not coordinates.
This is why the same model that fails at Excel can write working Python. It's not smarter. It's working in a format that matches how it processes language.
The Research: Code Beats Pure Reasoning
The evidence is clear. When LLMs write code instead of trying to calculate directly, accuracy jumps significantly.
The MATH benchmark tests competition-level mathematical reasoning. Models struggled at around 10% accuracy when reasoning alone. With code execution, they hit 50% or higher. Same problems, same models, 5x improvement.
GSM8K covers grade school math word problems. Models reach 80% accuracy through pure reasoning. When they write and execute Python code, they exceed 90%. A 10-point gain just by switching from "think through it" to "write code for it."
PAL (Program-Aided Language models) and PoT (Program of Thought) research formalized this approach. Instead of asking LLMs to reason step-by-step in natural language, have them write executable code. The code handles computation. The LLM handles translation.
What This Means
The pattern is consistent across benchmarks: LLMs perform 10-20% better on math when they write code versus reasoning directly.
This isn't surprising. LLMs are language models trained on code. They've seen millions of examples of Python solving math problems. They haven't seen neurons doing multiplication.
The practical implication: stop asking LLMs to calculate. Ask them to write programs that calculate.