Second, chain-of-thought prompting has larger
performance gains for more-complicated prob-
lems. For instance, for GSM8K (the dataset
with the lowest baseline performance), perfor-
mance more than doubled for the largest GPT
and PaLM models. On the other hand, for Sin-
gleOp, the easiest subset of MAWPS which only
requires a single step to solve, performance im-
provements were either negative or very small
(see Appendix Table 3).
Third, chain-of-thought prompting via GPT-3
175B and PaLM 540B compares favorably to
prior state of the art, which typically finetunes a
task-specific model on a labeled training dataset.
Figure 4 shows how PaLM 540B uses chain-of-
thought prompting to achieve new state of the art
on GSM8K, SVAMP, and MAWPS (though note
that standard prompting already passed the prior
best for SVAMP). On the other two datasets,
AQuA and ASDiv, PaLM with chain-of-thought
prompting reaches within 2% of the state of the
art (Appendix Table 2).
To better understand why chain-of-thought
prompting works, we manually examined model-
generated chains of thought by LaMDA 137B
for GSM8K. Of 50 random examples where the
model returned the correct final answer, all of
the generated chains of thought were also log-
ically and mathematically correct except two
that coincidentally arrived at the correct answer
(see Appendix D.1, and Table 8 for examples
of correct model-generated chains of thought).
We also randomly examined 50 random sam-
ples for which the model gave the wrong answer.
The summary of this analysis is that 46% of the
chains of thought were almost correct, barring
minor mistakes (calculator error, symbol map-
ping error, or one reasoning step missing), and that the other 54% of the chains of thought had major
errors in semantic understanding or coherence (see Appendix D.2). To provide a small insight into
why scaling improves chain-of-thought reasoning ability, we performed a similar analysis of errors
made by PaLM 62B and whether those errors were fixed by scaling to PaLM 540B. The summary
is that scaling PaLM to 540B fixes a large portion of one-step missing and semantic understanding
errors in the 62B model (see Appendix A.1).
3.3 Ablation Study
The observed benefits of using chain-of-thought prompting raises the natural question of whether the
same performance improvements can be conferred via other types of prompting. Figure 5 shows an
ablation study with three variations of chain of thought described below.
Equation only. One reason for why chain-of-thought prompting might help is that it produces the
mathematical equation to be evaluated, and so we test a variation where the model is prompted
to output only a mathematical equation before giving the answer. Figure 5 shows that equation
only prompting does not help much for GSM8K, which implies that the semantics of the questions
in GSM8K are too challenging to directly translate into an equation without the natural language
reasoning steps in chain of thought. For datasets of one-step or two-step problems, however, we find
that equation only prompting does improve performance, since the equation can be easily derived
from the question (see Appendix Table 6).
5