This is the html version of the file https://www.taylorfrancis.com/books/mono/10.1201/9780429246593/introduction-bootstrap-bradley-efron-tibshirani.
Google automatically generates html versions of documents as we crawl the web.
Page 1

Page 2
MONOGRAPHS ON
STATISTICS AND APPLIED PROBABILITY
General Editors
D.R. Cox, D.V. Hinkley, N. Reid, D.B. Rubin and B.W. Silverman
1 Stochastic Population Models in Ecology and Epidemiology
MS. Bartlett (1960)
2 Queues D.R. Cox and W.L. Smith (1961)
3 Monte Carlo Methods J.M. Hammersley and D.C. Handscomb (1964)
4 The Statistical Analysis of Series of Events D.R. Cox and PA.W. Lewis (1966)
5 Population Genetics WJ. Ewens (1969)
6 Probability, Statistics and Time MS. Bartlett (1975)
7 Statistical Inference S.D. Silvey (1975)
8 The Analysis of Contingency Tables BS. Everitt (1977)
9 Multivariate Analysis in Behavioural Research A.E. Maxwell (1977)
10 Stochastic Abundance Models S. Engen (1978)
11 Some Basic Theory for Statistical Inference EJ.G. Pitman (1979)
12 Point Processes D.R. Cox and V. Isham (1980)
13 Identification of Outliers D.M. Hawkins (1980)
14 Optimal Design S.D. Silvey (1980)
15 Finite Mixture Distributions BS. Everitt and DJ. Hand (1981)
16 Classification A .D. Gordon (1981)
17 Distribution-free Statistical Methods JS. Mariz (1981)
18 Residuals and Influence in Regression R.D. Cook and S. Weisberg (1982)
19 Applications of Queueing Theory G.F. Newell (1982)
20 Risk Theory, 3rd edition R.E. Beard, T. Pentikainen and E. Pesonen (1984)
21 Analysis of Survival Data D.R. Cox and D. Oakes (1984)
22 An Introduction to Latent Variable Models BS. Everitt (1984)
23 Bandit Problems DA. Berry and B. Fristedt (1985)
24 Stochastic Modelling and Control M.HA. Davis and R. Vinter (1985)
25 The Statistical Analysis of Compositional Data J. Aitchison (1986)
26 Density Estimation for Statistical and Data Analysis B.W. Silverman (1986)
27 Regression Analysis with Applications B.G. Wetherill (1986)
28 Sequential Methods in Statistics, 3rd edition G.B. Wetherill (1986)
29 Tensor methods in Statistics P. McCullagh (1987)
30 Transformation and Weighting in Regression R.J. Carroll and D. Ruppert (1988)
31 Asymptotic Techniques for Use in Statistics O.E. Barndojf-Nielson and
D.R. Cox (1989)
32 Analysis of Binary Data, 2nd edition D.R. Cox and EJ. Snell (1989)
33 Analysis of Infectious Disease Data N.G. Becker (1989)

Page 3
34 Design and Analysis of Cross-Over Trials B. Jones and M.G. Kenward (1989)
35 Empirical Bayes Method, 2nd edition JS. Maritz and T. Lwin (1989)
36 Symmetric Multivariate and Related Distributions K.-T. Fang, 5. Kotz and
K. Ng (1989)
37 Generalized Linear Models, 2nd edition P. McCullagh and JA. Nelder (1989)
38 Cyclic Designs JA. John (1987)
39 Analog Estimation Methods in Econometrics C.F. Manski (1988)
40 Subset Selection in Regression A.J. Miller (1990)
41 Analysis of Repeated Measures M. Crowder and D J. bland (1990)
42 Statistical Reasoning with Imprecise Probabilities P. Walley (1990)
43 Generalized Additive Models TJ. Hastie and RJ. Tibshirani (1990)
44 Inspection Errors for Attributes in Quality Control N.L. Johnson, S. Kotz and
X. Wu (1991)
45 The Analysis of Contingency Tables, 2nd edition B.S. Everitt (1992)
46 The Analysis of Quantal Response Data BJ.T. Morgan (1992)
47 Longitudinal Data with Serial Correlation: A State-Space Approach
R.H. Jones(1993)
48 Differential Geometry and Statistics MX. Murray and J.W. Rice (1993)
49 Markov Models and Optimization M.H.A. Davies (1993)
50 Chaos and Networks: Statistical and Probabilistic Aspects Edited by
O. Barndorff-Nielsen et al. (1993)
51 Number Theoretic Methods in Statistics K.-T. Fang and W. Yuan (1993)
52 Inference and Asymptotics O. Barndorff-Nielsen and D.R. Cox (1993)
53 Practical Risk Theory for Actuaries C.D. Daykin, T. Pentikainen and
M. Pesonen (1993)
54 Statistical Concepts and Applications in Medicine J. Aitchison and
IJ. Lauder (1994)
55 Predictive Inference S. Geisser (1993)
56 Model-Free Curve Estimation M. Tarter and M. Lock (1993)
57 An Introduction to the Bootstrap B. Efron and R. Tibshirani (1993)
(Full details concerning this series are available from the Publishers.)

Page 4
An
Introduction
to the
Bootstrap
Bradley Efron
Department of Statistics
Stanford University
and
Robert J. Tibshirani
Department of Preventative Medicine and Biostatistics
and Department of Statistics, University of Toronto
C H A P M A N & H A L L /C R C
Boca Raton London New York Washington, D.C.

Page 5
Chapman & Hall/CRC
Milton Park, Abingdon
Taylor & Francis Group
Oxon OX 14 4RN
6000 Broken Sound Parkway NW, Suite 300
Boca Raton, FL 33487-2742
� 1994 by Taylor & Francis Group, LLC
Chapman & Hall/CRC is an imprint of Taylor & Francis Group
No claim to original U.S. Government works
Printed in the United States of America on acid-free paper
25 24 23 22 21 20 19 18 17 16 15 14 13
International Standard Book Number-13: 978-0-412-04231-7 (Hardcover)
This book contains information obtained from authentic and highly regarded sources. Reasonable efforts have
been made to publish reliable data and information, but the author and publisher cannot assume responsibility
for the validity of all materials or the consequences of their use. The authors and publishers have attempted to
trace the copyright holders of all material reproduced in this publication and apologize to copyright holders if
permission to publish in this form has not been obtained. If any copyright material has not been acknowledged
please write and let us know so we may rectify it in any future reprint
No part of this book may be reprinted, reproduced, transmitted, or utilized in any form by any electronic,
mechanical, or other means, now known or hereafter invented, including photocopying, microfilming, and
recording, or in any information storage or retrieval system, without written permission from the publishers.
For permission to photocopy or use material electronically from this work, please access www.copyright.com
(http://www.copyright.com/) or contact the Copyright Clearance Center, Inc. (CCC) 222 Rosewood Drive,
Danvers, MA 01923, 978-750-8400. CCC is a not-for-profit organization that provides licenses and registration
for a variety of users. For organizations that have been granted a photocopy license by the CCC, a separate
system of payment has been arranged.
Trademark Notice: Product or corporate names may be trademarks or registered trademarks, and are used only
for identification and explanation without intent to infringe.
Library of Congress Cataloging-in-Publication Data
Efron, Bradley.
An introduction to the bootstrap/Brad Efron, Rob Tibshirani.
p. cm.
Includes bibliographical references and index.
ISBN 0-412-04231-2
1. Bootstrap (Statistics). I. Tibshirani, Robert, n. Title.
QA276.8.E3745 1993
519.5’44— dc20
93-4489
Visit the Taylor & Francis Web site at
http://www.taylorandfrancis.com
and the CRC Press Web site at
http://www.crcpress.com

Page 6
CHERYL, CHARLIE, RYAN AND JULIE
AND TO THE MEMORY OF
RUPERT G. MILLER, JR.

Page 7

Page 8
Contents
Preface
xiv
1 Introduction
1
1.1 An overview of this book
6
1.2 Information for instructors
8
1.3 Some of the notation used in the book
9
2 The accuracy of a sample mean
10
2.1 Problems
15
3 -Random samples and probabilities
17
3.1 Introduction
17
3.2 Random samples
17
3.3 Probability theory
20
3.4 Problems
28
4 The empirical distribution function and the plug-in
principle
31
4.1 Introduction
31
4.2 The empirical distribution function
31
4.3 The plug-in principle
35
4.4 Problems
37
5 Standard errors and estimated standard errors
39
5.1 Introduction
39
5.2 The standard error of a mean
39
5.3 Estimating the standard error of the mean
42
5.4 Problems
43

Page 9
CONTENTS
viii
6 The bootstrap estimate of standard error
45
6.1 Introduction
45
6.2 The bootstrap estimate of standard error
45
6.3 Example: the correlation coefficient
49
6.4 The number of bootstrap replications B
50
6.5 The parametric bootstrap
53
6.6 Bibliographic notes
56
6.7 Problems
57
7 Bootstrap standard errors: some examples
60
7.1 Introduction
60
7.2 Example 1: test score data
61
7.3 Example 2: curve fitting
70
7.4 An example of bootstrap failure
81
7.5 Bibliographic notes
81
7.6 Problems
82
8 More complicated data structures
86
8.1 Introduction
86
8.2 One-sample problems
86
8.3 The two-sample problem
88
8.4 More general data structures
90
8.5 Example: lutenizing hormone
92
8.6 The moving blocks bootstrap
99
8.7 Bibliographic notes
102
8.8 Problems
103
9 Regression models
105
9.1 Introduction
105
9.2 The linear regression model
105
9.3 Example: the hormone data
107
9.4 Application of the bootstrap
111
9.5 Bootstrapping pairs vs bootstrapping residuals
113
9.6 Example: the cell survival data
115
9.7 Least median of squares
117
9.8 Bibliographic notes
121
9.9 Problems
121
10 Estimates of bias
10.1 Introduction
124
124

Page 10
CONTENTS
ix
10.2 The bootstrap estimate of bias
124
10.3 Example: the patch data
126
10.4 An improved estimate of bias
130
10.5 The jackknife estimate of bias
133
10.6 Bias correction
138
10.7 Bibliographic notes
139
10.8 Problems
139
11 The jackknife
141
11.1 Introduction
141
11.2 Definition of the jackknife
141
11.3 Example: test score data
143
11.4 Pseudo-values
145
11.5 Relationship between the jackknife and bootstrap 145
11.6 Failure of the jackknife
148
11.7 The delete-d jackknife
149
11.8 Bibliographic notes
149
11.9 Problems
150
12 Confidence intervals based on bootstrap “tables” 153
12.1 Introduction
153
12.2 Some background on confidence intervals
155
12.3 Relation between confidence intervals and hypothe�
sis tests
156
12.4 Student’s t interval
158
12.5 The bootstrap-^ interval
160
12.6 Transformations and the bootstrap-t
162
12.7 Bibliographic notes
166
12.8 Problems
166
13 Confidence intervals based on bootstrap
percentiles
168
13.1 Introduction
168
13.2 Standard normal intervals
168
13.3 The percentile interval
170
13.4 Is the percentile interval backwards?
174
13.5 Coverage performance
174
13.6 The transformation-respecting property
175
13.7 The range-preserving property
176
13.8 Discussion
176

Page 11
X
CONTENTS
13.9 Bibliographic notes
176
13.10 Problems
177
14 Better bootstrap confidence intervals
178
14.1 Introduction
178
14.2 Example: the spatial test data
179
14.3 The BCa method
184
14.4 The ABC method
188
14.5 Example: the tooth data
190
14.6 Bibliographic notes
199
14.7 Problems
199
15 Permutation tests
202
15.1 Introduction
202
15.2 The two-sample problem
202
15.3 Other test statistics
210
15.4 Relationship of hypothesis tests to confidence
intervals and the bootstrap
214
15.5 Bibliographic notes
218
15.6 Problems
218
16 Hypothesis testing with the bootstrap
220
16.1 Introduction
220
16.2 The two-sample problem
220
16.3 Relationship between the permutation test and the
bootstrap
223
16.4 The one-sample problem
224
16.5 Testing multimodality of a population
227
16.6 Discussion
232
16.7 Bibliographic notes
233
16.8 Problems
234
17 Cross-validation and other estimates of prediction
error
237
17.1 Introduction
237
17.2 Example: hormone data
238
17.3 Cross-validation
239
17.4 Cp and other estimates of prediction error
242
17.5 Example: classification trees
243
17.6 Bootstrap estimates of prediction error
247

Page 12
CONTENTS
xi
17.6.1 Overview
247
17.6.2 Some details
249
17.7 The .632 bootstrap estimator
252
17.8 Discussion
254
17.9 Bibliographic notes
255
17.10 Problems
255
18 Adaptive estimation and calibration
258
18.1 Introduction
258
18.2 Example: smoothing parameter selection for curve
fitting
258
18.3 Example: calibration of a confidence point
263
18.4 Some general considerations
266
18.5 Bibliographic notes
268
18.6 Problems
269
19 Assessing the error in bootstrap estimates
271
19.1 Introduction
271
19.2 Standard error estimation
272
19.3 Percentile estimation
273
19.4 The jackknife-after-bootstrap
275
19.5 Derivations
280
19.6 Bibliographic notes
281
19.7 Problems
281
20 A geometrical representation for the bootstrap and
jackknife
283
20.1 Introduction
283
20.2 Bootstrap sampling
285
20.3 The jackknife as an approximation to the bootstrap 287
20.4 Other jackknife approximations
289
20.5 Estimates of bias
290
20.6 An example
293
20.7 Bibliographic notes
295
20.8 Problems
295
21 An overview of nonparametric and parametric
inference
296
21.1 Introduction
296
21.2 Distributions, densities and likelihood functions
296

Page 13
CONTENTS
xii
21.3 Functional statistics and influence functions
298
21.4 Parametric maximum likelihood inference
302
21.5 The parametric bootstrap
306
21.6 Relation of parametric maximum likelihood, boot�
strap and jackknife approaches
307
21.6.1 Example: influence components for the mean 309
21.7 The empirical cdf as a maximum likelihood estimate 310
21.8 The sandwich estimator
310
21.8.1 Example: Mouse data
311
21.9 The delta method
313
21.9.1 Example: delta method for the mean
315
21.9.2 Example: delta method for the correlation
coefficient
315
21.10 Relationship between the delta method and in�
finitesimal jackknife
315
21.11 Exponential families
316
21.12 Bibliographic notes
319
21.13 Problems
320
22 Further topics in bootstrap confidence intervals 321
22.1 Introduction
321
22.2 Correctness and accuracy
321
22.3 Confidence points based on approximate pivots
322
22.4 The BCa interval
325
22.5 The underlying basis for the BC0 interval
326
22.6 The ABC approximation
328
22.7 Least favorable families
331
22.8 The ABCq method and transformations
333
22.9 Discussion
334
22.10 Bibliographic notes
335
22.11 Problems
335
23 Efficient bootstrap computations
338
23.1 Introduction
338
23.2 Post-sampling adjustments
340
23.3 Application to bootstrap bias estimation
342
23.4 Application to bootstrap variance estimation
346
23.5 Pre- and post-sampling adjustments
348
23.6 Importance sampling for tail probabilities
349
23.7 Application to bootstrap tail probabilities
352

Page 14
CONTENTS
xiii
23.8 Bibliographic notes
356
23.9 Problems
357
24 Approximate likelihoods
358
24.1 Introduction
358
24.2 Empirical likelihood
360
24.3 Approximate pivot methods
362
24.4 Bootstrap partial likelihood
364
24.5 Implied likelihood
367
24.6 Discussion
370
24.7 Bibliographic notes
371
24.8 Problems
371
25 Bootstrap bioequivalence
372
25.1 Introduction
372
25.2 A bioequivalence problem
372
25.3 Bootstrap confidence intervals
374
25.4 Bootstrap power calculations
379
25.5 A more careful power calculation
381
25.6 Fieller’s intervals
384
25.7 Bibliographic notes
389
25.8 Problems
389
26 Discussion and further topics
392
26.1 Discussion
392
26.2 Some questions about the bootstrap
394
26.3 References on further topics
396
Appendix: software for bootstrap computations
398
Introduction
398
Some available software
399
S language functions
399
References
413
Author index
426
Subject index
430

Page 15
Preface
Dear friend, theory is all gray,
and the golden tree of life is green.
Goethe, from “Faust”
The ability to simplify means to eliminate the unnecessary so that
the necessary may speak.
Hans Hoffmann
Statistics is a subject of amazingly many uses and surprisingly
few effective practitioners. The traditional road to statistical knowl�
edge is blocked, for most, by a formidable wall of mathematics.
Our approach here avoids that wall. The bootstrap is a computer-
based method of statistical inference that can answer many real
statistical questions without formulas. Our goal in this book is to
arm scientists and engineers, as well as statisticians, with compu�
tational techniques that they can use to analyze and understand
complicated data sets.
The word “understand” is an important one in the previous sen�
tence. This is not a statistical cookbook. We aim to give the reader
a good intuitive understanding of statistical inference.
One of the charms of the bootstrap is the direct appreciation it
gives of variance, bias, coverage, and other probabilistic phenom�
ena. What does it mean that a confidence interval contains the
true value with probability .90? The usual textbook answer ap�
pears formidably abstract to most beginning students. Bootstrap
confidence intervals are directly constructed from real data sets,
using a simple computer algorithm. This doesn’t necessarily make
it easy to understand confidence intervals, but at least the diffi�
culties are the appropriate conceptual ones, and not mathematical
muddles.

Page 16
PREFACE
xv
Much of the exposition in our book is based on the analysis of
real data sets. The mouse data, the stamp data, the tooth data,
the hormone data, and other small but genuine examples, are an
important part of the presentation. These are especially valuable if
the reader can try his own computations on them. Personal com�
puters are sufficient to handle most bootstrap computations for
these small data sets.
This book does not give a rigorous technical treatment of the
bootstrap, and we concentrate on the ideas rather than their math�
ematical justification. Many of these ideas are quite sophisticated,
however, and this book is not just for beginners. The presenta�
tion starts off slowly but builds in both its scope and depth. More
mathematically advanced accounts of the bootstrap may be found
in papers and books by many researchers that are listed in the
Bibliographic notes at the end of the chapters.
We would like to thank Andreas Buja, Anthony Davison, Peter
Hall, Trevor Hastie, John Rice, Bernard Silverman, James Stafford
and Sami Tibshirani for making very helpful comments and sugges�
tions on the manuscript. We especially thank Timothy Hesterberg
and Cliff Lunneborg for the great deal of time and effort that they
spent on reading and preparing comments. Thanks to Maria-Luisa
Gardner for providing expert advice on the “rules of punctuation.”
We would also like to thank numerous students at both Stanford
University and the University of Toronto for pointing out errors
in earlier drafts, and colleagues and staff at our universities for
their support. Thanks to Tom Glinos of the University of Toronto
for maintaining a healthy computing environment. Karola DeCleve
typed much of the first draft of this book, and maintained vigi�
lance against errors during its entire history. All of this was done
cheerfully and in a most helpful manner, for which we are truly
grateful. Trevor Hastie provided expert “S” and TgX advice, at
crucial stages in the project.
We were lucky to have not one but two superb editors working
on this project. Bea Schube got us going, before starting her re�
tirement; Bea has done a great deal for the statistics profession
and we wish her all the best. John Kimmel carried the ball after
Bea left, and did an excellent job. We thank our copy-editor Jim
Ger�nimo for his thorough correction of the manuscript, and take
responsibility for any errors that remain.
The first author was supported by the National Institutes of
Health and the National Science Foundation. Both groups have

Page 17
XVI
PREFACE
supported the development of statistical theory at Stanford, in�
cluding much of the theory behind this book. The second author
would like to thank his wife Cheryl for her understanding and
support during this entire project, and his parents for a lifetime
of encouragement. He gratefully acknowledges the support of the
Natural Sciences and Engineering Research Council of Canada.
Palo Alto and Toronto
Bradley Efron
June 1993
Robert Tibshirani

Page 18
CHAPTER 1
Introduction
Statistics is the science of learning from experience, especially ex�
perience that arrives a little bit at a time. The earliest information
science was statistics, originating in about 1650. This century has
seen statistical techniques become the analytic methods of choice
in biomedical science, psychology, education, economics, communi�
cations theory, sociology, genetic studies, epidemiology, and other
areas. Recently, traditional sciences like geology, physics, and as�
tronomy have begun to make increasing use of statistical methods
as they focus on areas that demand informational efficiency, such as
the study of rare and exotic particles or extremely distant galaxies.
Most people are not natural-born statisticians. Left to our own
devices we are not very good at picking out patterns from a sea
of noisy data. To put it another way, we are all too good at pick�
ing out non-existent patterns that happen to suit our purposes.
Statistical theory attacks the problem from both ends. It provides
optimal methods for finding a real signal in a noisy background,
and also provides strict checks against the overinterpretation of
random patterns.
Statistical theory attempts to answer three basic questions:
(1) How should I collect my data?
(2) How should I analyze and summarize the data that I’ve col�
lected?
(3) How accurate are my data summaries?
Question 3 constitutes part of the process known as statistical in�
ference. The bootstrap is a recently developed technique for making
certain kinds of statistical inferences. It is only recently developed
because it requires modern computer power to simplify the often
intricate calculations of traditional statistical theory.
The explanations that we will give for the bootstrap, and other

Page 19
2
INTRODUCTION
computer-based methods, involve explanations of traditional ideas
in statistical inference. The basic ideas of statistics haven’t changed,
but their implementation has. The modern computer lets us ap�
ply these ideas flexibly, quickly, easily, and with a minimum of
mathematical assumptions. Our primary purpose in the book is to
explain when and why bootstrap methods work, and how they can
be applied in a wide variety of real data-analytic situations.
All three basic statistical concepts, data collection, summary and
inference, are illustrated in the New York Times excerpt of Figure
1.1. A study was done to see if small aspirin doses would prevent
heart attacks in healthy middle-aged men. The data for the as�
pirin study were collected in a particularly efficient way: by a con�
trolled, randomized, double-blind study. One half of the subjects
received aspirin and the other half received a control substance, or
placebo, with no active ingredients. The subjects were randomly
assigned to the aspirin or placebo groups. Both the subjects and the
supervising physicians were blinded to the assignments, with the
statisticians keeping a secret code of who received which substance.
Scientists, like everyone else, want the project they are working on
to succeed. The elaborate precautions of a controlled, randomized,
blinded experiment guard against seeing benefits that don’t exist,
while maximizing the chance of detecting a genuine positive effect.
The summary statistics in the newspaper article are very simple:
heart attacks
subjects
(fatal plus non-fatal)
aspirin group:
104
11037
placebo group:
189
11034
We will see examples of much more complicated summaries in later
chapters. One advantage of using a good experimental design is a
simplification of its results. What strikes the eye here is the lower
rate of heart attacks in the aspirin group. The ratio of the two
rates is
? = 104/11037
189/11034
(1.1)
If this study can be believed, and its solid design makes it very
believable, the aspirin-takers only have 55% as many heart attacks
as placebo-takers.
Of course we are not really interested in 0, the estimated ratio.
What we would like to know is 0, the true ratio, that is the ratio

Page 20
INTRODUCTION
HEART AHACK RISK
FOUND TO BE CUT
BY TAKING ASPIRIN
3
LIFESAVING EFFECTS SEEN
Study Finds Benefit of Tablet
Every Other. Day Is Much
Greater Than Expected
By HAROLD M. SCHMECK Jr.
A major nationwide study shows that
a single aspirin tablet every other day
can sharply reduce a man’s risk of
heart attack and death from heart at�
tack.
The lifesaving effects were so dra�
matic that the study was halted in mid-
December so that the results could be
reported as soon as possible to the par�
ticipants and to the medical profession
in general.
The magnitude of the beneficial ef�
fect was far greater than expected, Dr.
Charles H. Hennekens of Harvard,
principal investigator in the research,
said in a telephone interview. The risk
of myocardial infarction, the technical
name for heart attack, was cut almost
in half.
* Extreme Beneficial Effect’
A special report said the results
showed “a statistically extreme benefi�
cial effect" from the use of aspirin. The
report is to be published Thursday in
The New England Journal of Medicine.
In recent years smaller studies have
demonstrated that a person who has
had one heart attack can reduce the
risk of a second by taking aspirin, but
there had been no proof that the benefi�
cial effect would extend to the general
male population.
Dr. Claude Lenfant, the director of
the National Heart Lung and Blood In�
stitute, said the findings were "ex�
tremely important," but he said the
general public should not take the re�
port as an indication that everyone
should start taking aspirin.
Figure 1.1. Front-page news from the New York Times of January 27,
1987. Reproduced by permission of the New York Times.

Page 21
4
INTRODUCTION
w333w333333333333f3w3333333333r33333333333333bj3333,33333333333j333333333333333
them. The value 0 = .55 is only an estimate of 0. The sample seems
large here, 22071 subjects in all, but the conclusion that aspirin
works is really based on a smaller number, the 293 observed heart
attacks. How do we know that 0 might not come out much less
favorably if the experiment were run again?
This is where statistical inference comes in. Statistical theory
allows us to make the following inference: the true value of 0 lies
in the interval
.43 < 0 < .70
(1.2)
with 95% confidence. Statement (1.2) is a classical confidence in�
terval, of the type discussed in Chapters 12-14, and 22. It says that
if we ran a much bigger experiment, with millions of subjects, the
ratio of rates probably wouldn’t be too much different than (1.1).
We almost certainly wouldn’t decide that 6 exceeded 1, that is that
aspirin was actually harmful. It is really rather amazing that the
same data that give us an estimated value, 0 = .55 in this case,
also can give us a good idea of the estimate’s accuracy.
Statistical inference is serious business. A lot can ride on the
decision of whether or not an observed effect is real. The aspirin
study tracked strokes as well as heart attacks, with the following
results:
strokes subjects
aspirin group:
119
11037
placebo group:
98
11034
For strokes, the ratio of rates is
7j_ 119/11037 _
98/11034
(1.3)
(1.4)
It now looks like taking aspirin is actually harmful. However the
interval for the true stroke ratio 0 turns out to be
.93 < 0 < 1.59
(1.5)
with 95% confidence. This includes the neutral value 0 = 1, at
which aspirin would be no better or worse than placebo vis-a-vis
strokes. In the language of statistical hypothesis testing, aspirin
was found to be significantly beneficial for preventing heart attacks,
but not significantly harmful for causing strokes. The opposite con�
clusion had been reached in an older, smaller study concerning men

Page 22
INTRODUCTION
5
who had experienced previous heart attacks. The aspirin treatment
remains mildly controversial for such patients.
The bootstrap is a data-based simulation method for statistical
inference, which can be used to produce inferences like (1.2) and
(1.5). The use of the term bootstrap derives from the phrase to
pull oneself up by one’s bootstrap, widely thought to be based on
one of the eighteenth century Adventures of Baron Munchausen,
by Rudolph Erich Raspe. (The Baron had fallen to the bottom of
a deep lake. Just when it looked like all was lost, he thought to
pick himself up by his own bootstraps.) It is not the same as the
term “bootstrap” used in computer science meaning to “boot” a
computer from a set of core instructions, though the derivation is
similar.
Here is how the bootstrap works in the stroke example. We cre�
ate two populations: the first consisting of 119 ones and 11037-
119=10918 zeroes, and the second consisting of 98 ones and 11034-
98=10936 zeroes. We draw with replacement a sample of 11037
items from the first population, and a sample of 11034 items from
the second population. Each of these is called a bootstrap sample.
From these we derive the bootstrap replicate of 9:
^
Proportion of ones in bootstrap sample #1
Proportion of ones in bootstrap sample #2
We repeat this process a large number of times, say 1000 times,
and obtain 1000 bootstrap replicates 9*. This process is easy to im�
plement on a computer, as we will see later. These 1000 replicates
contain information that can be used to make inferences from our
data. For example, the standard deviation turned out to be 0.17
in a batch of 1000 replicates that we generated. The value 0.17
is an estimate of the standard error of the ratio of rates 0. This
indicates that the observed ratio 0 = 1.21 is only a little more than
one standard error larger than 1, and so the neutral value 0 = 1
cannot be ruled out. A rough 95% confidence interval like (1.5)
can be derived by taking the 25th and 975th largest of the 1000
replicates, which in this case turned out to be (.93, 1.60).
In this simple example, the confidence interval derived from the
bootstrap agrees very closely with the one derived from statistical
theory. Bootstrap methods are intended to simplify the calculation
of inferences like (1.2) and (1.5), producing them in an automatic
way even in situations much more complicated than the aspirin
study.

Page 23
6
INTRODUCTION
The terminology of statistical summaries and inferences, like re�
gression, correlation, analysis of variance, discriminant analysis,
standard error, significance level and confidence interval, has be�
come the lingua franca of all disciplines that deal with noisy data.
We will be examining what this language means and how it works
in practice. The particular goal of bootstrap theory is a computer-
based implementation of basic statistical concepts. In some ways it
is easier to understand these concepts in computer-based contexts
than through traditional mathematical exposition.
1.1 An overview of this book
This book describes the bootstrap and other methods for assessing
statistical accuracy. The bootstrap does not work in isolation but
rather is applied to a wide variety of statistical procedures. Part
of the objective of this book is expose the reader to many exciting
and useful statistical techniques through real-data examples. Some
of the techniques described include nonparametric regression, den�
sity estimation, classification trees, and least median of squares
regression.
Here is a chapter-by-chapter synopsis of the book. Chapter 2
introduces the bootstrap estimate of standard error for a simple
mean. Chapters 3—5 contain some basic background material,
and may be skimmed by readers eager to get to the details of
the bootstrap in Chapter 6. Random samples, populations, and
basic probability theory are reviewed in Chapter 3. Chapter 4
defines the empirical distribution function estimate of the popula�
tion, which simply estimates the probability of each of n data items
to be 1/n. Chapter 4 also shows that many familiar statistics can
be viewed as “plug-in” estimates, that is, estimates obtained by
plugging in the empirical distribution function for the unknown
distribution of the population. Chapter 5 reviews standard error
estimation for a mean, and shows how the usual textbook formula
can be derived as a simple plug-in estimate.
The bootstrap is defined in Chapter 6, for estimating the stan�
dard error of a statistic from a single sample. The bootstrap stan�
dard error estimate is a plug-in estimate that rarely can be com�
puted exactly; instead a simulation (“resampling”) method is used
for approximating it.
Chapter 7 describes the application of bootstrap standard er�
rors in two complicated examples: a principal components analysis

Page 24
AN OVERVIEW OF THIS BOOK
7
and a curve fitting problem.
Up to this point, only one-sample data problems have been dis�
cussed. The application of the bootstrap to more complicated data
structures is discussed in Chapter 8. A two-sample problem and
a time-series analysis are described.
Regression analysis and the bootstrap are discussed and illus�
trated in Chapter 9. The bootstrap estimate of standard error is
applied in a number of different ways and the results are discussed
in two examples.
The use of the bootstrap for estimation of bias is the topic of
Chapter 10, and the pros and cons of bias correction are dis�
cussed. Chapter 11 describes the jackknife method in some detail.
We see that the jackknife is a simple closed-form approximation to
the bootstrap, in the context of standard error and bias estimation.
The use of the bootstrap for construction of confidence intervals
is described in Chapters 12, 13 and 14. There are a number of
different approaches to this important topic and we devote quite
a bit of space to them. In Chapter 12 we discuss the bootstrap-t
approach, which generalizes the usual Student’s t method for con�
structing confidence intervals. The percentile method (Chapter
13) uses instead the percentiles of the bootstrap distribution to
define confidence limits. The BCa (bias-corrected accelerated in�
terval) makes important corrections to the percentile interval and
is described in Chapter 14.
Chapter 15 covers permutation tests, a time-honored and use�
ful set of tools for hypothesis testing. Their close relationship with
the bootstrap is discussed; Chapter 16 shows how the bootstrap
can be used in more general hypothesis testing problems.
Prediction error estimation arises in regression and classification
problems, and we describe some approaches for it in Chapter IT.
Cross-validation and bootstrap methods are described and illus�
trated. Extending this idea, Chapter 18 shows how the boot�
strap and cross-validation can be used to adapt estimators to a set
of data.
Like any statistic, bootstrap estimates are random variables and
so have inherent error associated with them. When using the boot�
strap for making inferences, it is important to get an idea of the
magnitude of this error. In Chapter 19 we discuss the jackknife-
after-bootstrap method for estimating the standard error of a boot�
strap quantity.
Chapters 20—25 contain more advanced material on selected

Page 25
8
INTRODUCTION
topics, and delve more deeply into some of the material introduced
in the previous chapters, The relationship between the bootstrap
and jackknife is studied via the “resampling picture” in Chapter
20. Chapter 21 gives an overview of non-parametric and para�
metric inference, and relates the bootstrap to a number of other
techniques for estimating standard errors. These include the delta
method, Fisher information, infinitesimal jackknife, and the sand�
wich estimator.
Some advanced topics in bootstrap confidence intervals are dis�
cussed in Chapter 22, providing some of the underlying basis
for the techniques introduced in Chapters 12-14. Chapter 23 de�
scribes methods for efficient computation of bootstrap estimates
including control variates and importance sampling. In Chapter
24 the construction of approximate likelihoods is discussed. The
bootstrap and other related methods are used to construct a “non-
parametric” likelihood in situations where a parametric model is
not specified.
Chapter 25 describes in detail a bioequivalence study in which
the bootstrap is used to estimate power and sample size. In Chap�
ter 26 we discuss some general issues concerning the bootstrap and
its role in statistical inference.
Finally, the Appendix contains a description of a number of dif�
ferent computer programs for the methods discussed in this book.
1.2 Information for instructors
We envision that this book can provide the basis for (at least)
two different one semester courses. An upper-year undergraduate
or first-year graduate course could be taught from some or all of
the first 19 chapters, possibly covering Chapter 25 as well (both
authors have done this). In addition, a more advanced graduate
course could be taught from a selection of Chapters 6-19, and a se�
lection of Chapters 20-26. For an advanced course, supplementary
material might be used, such as Peter Hall’s book The Bootstrap
and Edgeworth Expansion or journal papers on selected technical
topics. The Bibliographic notes in the book contain many sugges�
tions for background reading.
We have provided numerous exercises at the end of each chap�
ter. Some of these involve computing, since it is important for the
student to get hands-on experience for learning the material. The
bootstrap is most effectively used in a high-level language for data

Page 26
SOME OF THE NOTATION USED IN THE BOOK
9
analysis and graphics. Our language of choice (at present) is “S”
(or “S-PLUS”), and a number of S programs appear in the Ap�
pendix. Most of these programs could be easily translated into
other languages such as Gauss, Lisp-Stat, or Matlab. Details on
the availability of S and S-PLUS are given in the Appendix.
1.3 Some of the notation used in the book
Lower case bold letters such as x refer to vectors, that is, x =
(xi,X2 ,.. .xn). Matrices are denoted by upper case bold letters
such as X, while a plain uppercase letter like X refers to a random
variable. The transpose of a vector is written as xT. A superscript
indicates a bootstrap random variable: for example, x* indi�
cates a bootstrap data set generated from a data set x. Parameters
are denoted by Greek letters such as 9. A hat on a letter indicates
an estimate, such as 6. The letters F and G refer to populations. In
Chapter 21 the same symbols are used for the cumulative distribu�
tion function of a population. Ic is the indicator function equal to
1 if condition C is true and 0 otherwise. For example, I{x<2} = 1
if x < 2 and 0 otherwise. The notation tr(A) refers to the trace
of the matrix A, that is, the sum of the diagonal elements. The
derivatives of a function g(x) are denoted by gf(x),g (x) and so
on.
The notation
F -> (xi,x2,...xn)
indicates an independent and identically distributed sample drawn
from F. Equivalently, we also write Xi’l 'F for i = 1,2,... n.
Notation such as #{x� > 3} means the number of x*s greater
than 3. logx refers to the natural logarithm of x.

Page 27
CHAPTER 2
The accuracy of a sample mean
The bootstrap is a computer-based method for assigning measures
of accuracy to statistical estimates. The basic idea behind the boot�
strap is very simple, and goes back at least two centuries. After
reviewing some background material, this book describes the boot�
strap method, its implementation on the computer, and its applica�
tion to some real data analysis problems. First though, this chapter
focuses on the one example of a statistical estimator where we re�
ally don’t need a computer to assess accuracy: the sample mean.
In addition to previewing the bootstrap, this gives us a chance to
review some fundamental ideas from elementary statistics. We be�
gin with a simple example concerning means and their estimated
accuracies.
Table 2.1 shows the results of a small experiment, in which 7 out
of 16 mice were randomly selected to receive a new medical treat�
ment, while the remaining 9 were assigned to the non-treatment
(control) group. The treatment was intended to prolong survival
after a test surgery. The table shows the survival time following
surgery, in days, for all 16 mice.
Did the treatment prolong survival? A comparison of the means
for the two groups offers preliminary grounds for optimism. Let
#i j #2 �• • • ? # 7 indicate the lifetimes in the treatment group, so x\ =
94, x2 = 197, • • •, xy = 23, and likewise let t/i, t/2 , • * *, 2/9 indicate
the control group lifetimes. The group means are
7
9
x = 'Y^xi/ 7 = 86.86 and y = ^ yi/9 = 56.22,
(2.1)
i = 1
� = 1
so the difference x — y equals 30.63, suggesting a considerable life�
prolonging effect for the treatment.
But how accurate are these estimates? After all, the means (2.1)
are based on small samples, only 7 and 9 mice, respectively. In

Page 28
THE ACCURACY OF A SAMPLE MEAN
11
Table 2.1. The mouse data. Sixteen mice were randomly assigned to a
treatment group or a control group. Shown are their survival times, in
days, following a test surgery. Did the treatment prolong survival?
Group
Data
(Sample
Size)
Mean
Estimated
Standard
Error
Treatment: 94
197
16
38
99
141
23
(7)
86.86
25.24
Control:
52
104
146
10
51
30
40
27
46
(9)
56.22
14.14
Difference: 30.63
28.93
order to answer this question, we need an estimate of the accuracy
of the sample means x and y. For sample means, and essentially
only for sample means, an accuracy formula is easy to obtain.
The estimated standard error of a mean x based on n indepen�
dent data points �i ,#2, * * * ? � = 52r=ix*/n’ given by the
formula
(2.2)
where s2 = Yl7=i(xi ~ ^)2/(n ~ 1). (This formula, and standard
errors in general, are discussed more carefully in Chapter 5.) The
standard error of any estimator is defined to be the square root of
its variance, that is, the estimator’s root mean square variability
around its expectation. This is the most common measure of an
estimator’s accuracy. Roughly speaking, an estimator will be less
than one standard error away from its expectation about 68% of
the time, and less than two standard errors away about 95% of the
time.
If the estimated standard errors in the mouse experiment were
very small, say less than 1, then we would know that x and y were
close to their expected values, and that the observed difference of
30.63 was probably a good estimate of the true survival-prolonging

Page 29
12
THE ACCURACY OF A SAMPLE MEAN
capability of the treatment. On the other hand, if formula (2.2)
gave big estimated standard errors, say 50, then the difference es�
timate would be too inaccurate to depend on.
The actual situation is shown at the right of Table 2.1. The
estimated standard errors, calculated from (2.2), are 25.24 for x
and 14.14 for y. The standard error for the difference x y equals
28.93 = \/25.242 + 14.142 (since the variance of the difference of
two independent quantities is the sum of their variances). We see
that the observed difference 30.63 is only 30.63/28.93 = 1.05 es�
timated standard errors greater than zero. Readers familiar with
hypothesis testing theory will recognize this as an insignificant re�
sult, one that could easily arise by chance even if the treatment
really had no effect at all.
There are more precise ways to verify this disappointing result,
(e.g. the permutation test of Chapter 15), but usually, as in this
case, estimated standard errors are an excellent first step toward
thinking critically about statistical estimates. Unfortunately stan�
dard errors have a major disadvantage: for most statistical estima�
tors other than the mean there is no formula like (2.2) to provide
estimated standard errors. In other words, it is hard to assess the
accuracy of an estimate other than the mean.
Suppose for example, we want to compare the two groups in Ta�
ble 2.1 by their medians rather than their means. The two medians
are 94 for treatment and 46 for control, giving an estimated dif�
ference of 48, considerably more than the difference of the means.
But how accurate are these medians? Answering such questions is
where the bootstrap, and other computer-based techniques, come
in. The remainder of this chapter gives a brief preview of the boot�
strap estimate of standard error, a method which will be fully
discussed in succeeding chapters.
Suppose we observe independent data points xi,X2 , • • • ,xn, f�r
convenience denoted by the vector x = (aq, X2 , • • •, xn), from which
we compute a statistic of interest s(x). For example the data might
be the n = 9 control group observations in Table 2.1, and s(x)
might be the sample mean.
The bootstrap estimate of standard error, invented by Efron in
1979, looks completely different than (2.2), but in fact it is closely
related, as we shall see. A bootstrap sample x* = (x\ , x\, • • •, x*) is
obtained by randomly sampling n times, with replacement, from
the original data points xi, #2 , * * * > xn- For instance, with n —7 we
might obtain x* = (x5 ,X7 ,x5,X4 ,X7 ,X3 ,xi).

Page 30
THE ACCURACY OF A SAMPLE MEAN
bootstrap
re p lic a tio n s
13
Figure 2.1. Schematic of the bootstrap process for estimating the stan�
dard error of a statistic s(x). B bootstrap sample* are generated from
the original data set. Each bootstrap sample has n elements, generated
by sampling with replacement n times from the original data set. Boot�
strap replicates sfx*1), s(x*2),... s(x*B) are obtained by calculating the
value of the statistic s(x) on each bootstrap sample. Finally, the stan�
dard deviation of the values s(x*1),s(x*2),... s(x*B) is our estimate of
the standard error of s(x).
Figure 2.1 is a schematic of the bootstrap process. The boot�
strap algorithm begins by generating a large number of indepen�
dent bootstrap samples x*1^*2, • • • ,x*B, each of size n. Typical
values for B , the number of bootstrap samples, range from 50 to
200 for standard error estimation. Corresponding to each bootstrap
sample is a bootstrap replication of s, namely s(x*6), the value of
the statistic s evaluated for x*6. If s(x) is the sample median, for
instance, then s(x*) is the median of the bootstrap sample. The
bootstrap estimate of standard error is the standard deviation of
the bootstrap replications,
B
i
�boot = { � M
X*6) - s (-)]2/ ( B - 1 ) } 2 ,
(2.3)
6=1
where s(-) = Ylb=i 5(x*6)/#- Suppose s(x) is the mean x. In this

Page 31
14
THE ACCURACY OF A SAMPLE MEAN
Table 2.2. Bootstrap estimates of standard error for the mean and me�
dian; treatment group, mouse data, Table 2.1. The median is less accu�
rate (has larger standard error) than the mean for this data set.
B:
50
100
250
500
1000
OO
mean:
median:
19.72
32.21
23.63
36.35
22.32
34.46
23.79
36.72
23.02
36.48
23.36
37.83
case, standard probability theory tells us (Problem 2.5) that as B
gets very large, formula (2.3) approaches
{ � > - x ) 2/n2}*.
(2.4)
i = 1
This is almost the same as formula (2.2). We could make it ex�
actly the same by multiplying definition (2.3) by the factor [n/(n —
l)]a, but there is no real advantage in doing so.
Table 2.2 shows bootstrap estimated standard errors for the
mean and the median, for the treatment group mouse data of Ta�
ble 2.1. The estimated standard errors settle down to limiting val�
ues as the number of bootstrap samples B increases. The limiting
value 23.36 for the mean is obtained from (2.4). The formula for
the limiting value 37.83 for the standard error of the median is
quite complicated: see Problem 2.4 for a derivation.
We are now in a position to assess the precision of the differ�
ence in medians between the two groups. The bootstrap procedure
described above was applied to the control group, producing a stan�
dard error estimate of 11.54 based on B = 100 replications (B = oo
gave 9.73). Therefore, using B = 100, the observed difference of 48
has an estimated standard error of \/36.352 -f 11.542 = 38.14, and
hence is 48/38.14 = 1.26 standard errors greater than zero. This is
larger than the observed difference in means, but is still insignifi�
cant.
For most statistics we don’t have a formula for the limiting value
of the standard error, but in fact no formula is needed. Instead
we use the numerical output of the bootstrap program, for some
convenient value of B. We will see in Chapters 6 and 19, that B
in the range 50 to 200 usually makes seboot a good standard error

Page 32
PROBLEMS
15
estimator, even for estimators like the median. It is easy to write
a bootstrap program that works for any computable statistic s(x),
as shown in Chapters 6 and the Appendix. With these programs
in place, the data analyst is free to use any estimator, no matter
how complicated, with the assurance that he or she will also have
a reasonable idea of the estimator’s accuracy. The price, a factor
of perhaps 100 in increased computation, has become affordable as
computers have grown faster and cheaper.
Standard errors are the simplest measures of statistical accu�
racy. Later chapters show how bootstrap methods can assess more
complicated accuracy measures, like biases, prediction errors, and
confidence intervals. Bootstrap confidence intervals add another
factor of 10 to the computational burden. The payoff for all this
computation is an increase in the statistical problems that can be
analyzed, a reduction in the assumptions of the analysis, and the
elimination of the routine but tedious theoretical calculations usu�
ally associated with accuracy assessment.
2.1 Problems
2.1 * Suppose that the mouse survival times were expressed in
weeks instead of days, so that the entries in Table 2.1 were
all divided by 7.
(a) What effect would this have on x and on its estimated
standard error (2.2)? Why does this make sense?
(b) What effect would this have on the ratio of the differ�
ence x — y to its estimated standard error?
2.2 Imagine the treatment group in Table 2.1 consisted of R rep�
etitions of the data actually shown, where R is a positive inte�
ger. That is, the treatment data consisted of R 94’s, R 197’s,
etc. What effect would this have on the estimated standard
error (2.2)?
2.3 It is usually true that the error of a statistical estimator de�
creases at a rate of about 1 over the square root of the sample
size. Does this agree with the result of Problem 2.2?
2.4 Let X(1) < �(2) < x (3) < x(4) < x(5) < x(6) < X(7) ke an
ordered sample of size n = 7. Let x* be a bootstrap sample,
and s(x*) be the corresponding bootstrap replication of the
median. Show that

Page 33
16
THE ACCURACY OF A SAMPLE MEAN
(a) s(x*) equals one of the original data values
i =
1,2,- .,7.
(b) t s(x*) equals #(*) with probability
3
* _ i
p(i) =
^ T ”) - Bi0;�. �)}.
(2.5)
'
n
n
3 = 0
where Bi(j; n,p) is the binomial probability (”)/>*'(1— p)n~j .
[The numerical values of p(i) are .0102, .0981, .2386, .3062,
.2386, .0981, .0102. These values were used to compute
seboot{ median} = 37.83, for B = oo, Table 2.2.]
2.5 Apply the weak law of large numbers to show that expression
(2.3) approaches expression (2.4) as n goes to infinity.
f Indicates a difficult or more advanced problem.

Page 34
CHAPTER 3
Random samples and
probabilities
3.1 Introduction
Statistics is the theory of accumulating information, especially in�
formation that arrives a little bit at a time. A typical statistical
situation was illustrated by the mouse data of Table 2.1. No one
mouse provides much information, since the individual results are
so variable, but seven, or nine mice considered together begin to
be quite informative. Statistical theory concerns the best ways of
extracting this information. Probability theory provides the math�
ematical framework for statistical inference. This chapter reviews
the simplest probabilistic model used to model random data: the
case where the observations are a random sample from a single
unknown population, whose properties we are trying to learn from
the observed data.
3.2 Random samples
It is easiest to visualize random samples in terms of a finite popu�
lation or “universe” U of individual units C/i, U2 , • • •, Un , any one
of which is equally likely to be selected in a single random draw.
The population of units might be all the registered voters in an
area undergoing a political survey, all the men that might con�
ceivably be selected for a medical experiment, all the high schools
in the United States, etc. The individual units have properties we
would like to learn, like a political opinion, a medical survival time,
or a graduation rate. It is too difficult and expensive to examine
every unit in so we select for observation a random sample of
manageable size.
A random sample of size n is defined to be a collection of n

Page 35
18
RANDOM SAMPLES AND PROBABILITIES
units
* • •, un selected at random from U. In principle the
sampling process goes as follows: a random number device inde�
pendently selects integers ji, ■ * *, jn, each of which equals any
value between 1 and N with probability 1/N. These integers deter�
mine which members of U are selected to be in the random sample,
m = Uj1, u2 = Uj2, • • •, un = Ujn. In practice the selection process
is seldom this neat, and the population U may be poorly defined,
but the conceptual framework of random sampling is still useful for
understanding statistical inference. (The methodology of good ex�
perimental design, for example the random assignment of selected
units to Treatment or Control groups as was done in the mouse
experiment, helps make random sampling theory more applicable
to real situations like that of Table 2.1.)
Our definition of random sampling allows a single unit �7* to ap�
pear more than once in the sample. We could avoid this by insisting
that* the integers j\,j2, • • * ,jn be distinct, called “sampling with�
out replacement.” It is a little simpler to allow repetitions, that is
to “sample with replacement”, as in the previous paragraph. If the
size n of the random sample is much smaller than the population
size N, as is usually the case, the probability of sample repetitions
will be small anyway. See Problem 3.1. Random sampling always
means sampling with replacement in what follows, unless otherwise
stated.
Having selected a random sample ui, U2 , • • •, un, we obtain one
or more measurements of interest for each unit. Let Xi indicate
the measurements for unit u*. The observed data are the collec�
tion of measurements Xi,X2, • • •, �n. Sometimes we will denote the
observed data (#i, #2, ’ • *, �n) by the single symbol x.
We can imagine making the measurements of interest on ev�
ery member I7i, f/2 , • • •, Un of W, obtaining values Xi, X 2, • • •, X^.
This would be called a census of U.
The symbol X will denote the census of measurements
(Xi, X 2 ,• • •, X n ). We will also refer to X as the population of mea�
surements, or simply the population, and call x a random sample of
size n from X. In fact, we usually can’t afford to conduct a census,
which is why we have taken a random sample. The goal of statisti�
cal inference is to say what we have learned about the population X
from the observed data x. In particular, we will use the bootstrap
to say how accurately a statistic calculated from �1 , ^2 ? • • •, xn (for
instance the sample median) estimates the corresponding quantity
for the whole population.

Page 36
RANDOM SAMPLES
19
Table 3.1. The law school data. A random sample of size n = 15 was
taken from the collection of N —82 American law schools participating
in a large study of admission practices. Two measurements were made
on the entering classes of each school in 1973: LSAT, the average score
for the class on a national law test, and GPA, the average undergraduate
grade-point average for the class.
School LSAT GPA
School LSAT GPA
1
576
3.39
9
651
3.36
2
635
3.30
10
605
3.13
3
558
2.81
11
653
3.12
4
578
3.03
12
575
2.74
5
666
3.44
13
545
2.76
6
580
3.07
14
572
2.88
7
555
3.00
15
594
2.96
8
661
3.43
Table 3.1 shows a random sample of size n = 15 drawn from
a population of JV = 82 American law schools. What is actually
shown are two measurements made on the entering classes of 1973
for each school in the sample: LSAT, the average score of the class
on a national law test, and GPA, the average undergraduate grade
point average achieved by the members of the class. In this case
the measurement X{ on
the ith member of the sample, is the
pair
Xi = (LSAT;, GPA*)
i = 1,2, • • •, 15.
The observed data X\,X2
is the collection of 15 pairs of
numbers shown in Table 3.1.
This example is an artificial one because the census of data
X\, X 2 , • • •, Xs2 was actually made. In other words, LSAT and
GPA are available for the entire population of N = 82 schools.
Figure 3.1 shows the census data and the sample data. Table 3.2
gives the entire population of N measurements.
In a real statistical problem, like that of Table 3.1, we would see
only the sample data, from which we would be trying to infer the
properties of the population. For example, consider the 15 LSAT
scores in the observed sample. These have mean 600.27 with esti�
mated standard error 10.79, based on the data in Table 3.1 and
formula (2.2). There is about a 68% chance that the true LSAT

Page 37
20
RANDOM SAMPLES AND PROBABILITIES
LSAT
2
o
500 550 600 650 700
LSAT
Figure 3.1. The left panel is a scatterplot of the (LSAT, GPA) data
for all N = 82 law schools; circles indicate the n = 15 data points
comprising the “observed sample” of Table 3.1. The right panel shows
only the observed sample. In problems of statistical inference, we are
trying to infer the situation on the left from the picture on the right.
mean, the mean for the entire population from which the observed
data was sampled, lies in the interval 600.27 10.79.
We can check this result, since we are dealing with an artifi�
cial example for which the complete population data are known.
The mean of all 82 LSAT values is 597.55, lying nicely within the
predicted interval 600.27 dh 10.79.
3.3 Probability theory
Statistical inference concerns learning from experience: we observe
a random sample x = (xi, X2, • • •, xn) and wish to infer properties
of the complete population X — (Xi,X2 , • • •,X n ) that yielded
the sample. Probability theory goes in the opposite direction: from
the composition of a population X we deduce the properties of a
random sample x, and of statistics calculated from x. Statistical
inference as a mathematical science has been developed almost ex�
clusively in terms of probability theory. Here we will review briefly

Page 38
PROBABILITY THEORY
21
Table 3.2. The population of measurements (LSAT,GPA), for the uni�
verse of 82 law schools. The data in Table 3.1 was sampled from this
population. The + ’s indicate the sampled schools.
school LSAT GPA school LSAT GPA school LSAT GPA
1
622
3.23 28
632
3.29 56
641
3.28
2
542
2.83 29
587
3.16 57
512
3.01
3
579
3.24 30
581
3.17 58
631
3.21
4+
653
3.12 31+
605
3.13 59
597
3.32
5
606
3.09 32
704
3.36 60
621
3.24
6+
576
3.39 33
477
2.57 61
617
3.03
7
620
3.10 34
591
3.02 62
637
3.33
8
615
3.40 35+
578
3.03 62
572
3.08
9
553
2.97 36+
572
2.88 64
610
3.13
10
607
2.91 37
615
3.37 65
562
3.01
11
558
3.11 38
606
3.20 66
635
3.30
12
596
3.24 39
603
3.23 67
614
3.15
13+
635
3.30 40
535
2.98 68
546
2.82
14
581
3.22 41
595
3.11 69
598
3.20
15+
661
3.43 42
575
2.92 70+
666
3.44
16
547
2.91 43
573
2.85 71
570
3.01
17
599
3.23 44
644
3.38 72
570
2.92
18
646
3.47 45+
545
2.76 73
605
3.45
19
622
3.15 46
645
3.27 74
565
3.15
20
611
3.33 47+
651
3.36 75
686
3.50
21
546
2.99 48
562
3.19 76
608
3.16
22
614
3.19 49
609
3.17 77
595
3.19
23
628
3.03 50+
555
3.00 78
590
3.15
24
575
3.01 51
586
3.11 79+
558
2.81
25
662
3.39 52+
580
3.07 80
611
3.16
26
627
3.41 53+
594
2.96 81
564
3.02
27
608
3.04 54
594
3.05 82+
575
2.74
55
560
2.93
some fundamental concepts of probability, including probability
distributions, expectations, and independence.
As a first example, let x represent the outcome of rolling a fair
die so x is equally likely to be 1,2,3,4,5, or 6. We write this in
probability notation as
Prob{x = k} = 1/6
for k = 1,2,3,4,5,6.
(3.1)
A random quantity like x is often called a random variable.
Probabilities are idealized or theoretical proportions. We can
imagine a universe U = {C/i, C/2 , • • •, Un } of possible rolls of the

Page 39
22
RANDOM SAMPLES AND PROBABILITIES
die, where Uj completely describes the physical act of the jth roll,
with corresponding results X = (Xi,X 2, • • • ,-X/v). Here N might
be very large, or even infinite. The statement Prob{x = 5} = 1/6
means that a randomly selected member of X has a 1/6 chance of
equaling 5, or more simply that 1/6 of the members of X equal 5.
Notice that probabilities, like proportions, can never be less than
0 or greater than 1.
For convenient notation define the frequencies fk,
fk = Prob{x = fe},
(3.2)
so the fair die has fk = 1/6 for k = 1,2, •••,6. The probability
distribution of a random variable x, which we will denote by F, is
any complete description of the probabilistic behavior of x. F is
also called the probability distribution of the population X . Here
we can take F to be the vector of frequencies
F = (/i, /2, • * •, fe) = (1/6,1/6, • • •, 1/6).
(3.3)
An unfair die would be one for which F did not equal
(1/6,1/6,..., 1/6).
Note: In many books, the symbol F is used for the cumulative
probability distribution function F(x0) = Prob{:r < xo} for — oo <
Xo < oo. This is an equally valid description of the probabilistic
behavior of x, but it is only convenient for the case where a; is a real
number. We will also be interested in cases where x is a vector, as
in Table 3.1, or an even more general object. This is the reason for
defining F as any description of x's probabilities, rather than the
specific description in terms of the cumulative probabilities. When
no confusion can arise, in later chapters we use symbols like F and
G to represent cumulative distribution functions.
Some probability distributions arise so frequently that they have
received special names. A random variable x is said to have the
binomial distribution with size n and probability of success p, de�
noted
x ~ Bi(n,p),
(3.4)
if its frequencies are
/fc = ( ^ ) p fc( ! - P ) n_fc for k = 0,1,2,•••,�.
(3.5)
Here n is a positive integer, p is a number between 0 and 1, and
(2) is the binomial coefficient n!/[fc!(n — &)!]. Figure 3.2 shows the

Page 40
PROBABILITY THEORY
23
distribution F = (/o, /i, * * *, fn) for x ~ Bi(n,p), with n = 25
and p = .25, .50, and .90. We also write F = Bi(n,p) to indicate
situation (3.4).
Let A be a set of integers. Then the probability that x takes a
value in A, or more simply the probability of A, is
Prob{x E A} = Prob{A} = ^ /*.
(3.6)
keA
For example if A = {1,3,5, • • •, 25} and x ~ Bi(25,p), then ProbjA}
is the probability that a binomial random variable of size 25 and
probability of success p equals an odd integer. Notice that since f k
is the theoretical proportion of times x equals fc, the sum ^2 keAfk =
ProbjA} is the theoretical proportion of times x takes its value in
A.
The sample space of x, denoted Sx, is the collection of possible
values x can have. For a fair die, Sx = (1, 2, • • • ,6}, while Sx =
{0,1,2, • • • ,n} for a Bi(n,p) distribution. By definition, x occurs
in Sx every time, that is, with theoretical proportion 1, so
Prob{Sx} = � A = 1.
(3.7)
kesx
For any probability distribution on the integers the frequencies fj
are nonnegative numbers summing to 1.
In our examples so far, the sample space Sx has been a subset
of the integers. One of the convenient things about probability
distributions is that they can be defined on quite general spaces.
Consider the law school data of Figure 3.1. We might take Sx to
be the positive quadrant of the plane,
Sx =1l2+ = {(y,z),y> 0,z> 0}.
(3.8)
(This includes values like x = (106,109), but it doesn’t hurt to let
Sx be too big.) For a subset A of
we would still write Prob{A}
to indicate the probability that x occurs in A.
For example, we could take
A = {(y, z) : 0 < y < 600,0 < * < 3.0}.
(3.9)
A law school x E A if its 1973 entering class had LSAT less than
600 and GPA less than 3.0. In this case we happen to know the
complete population X\ it is the 82 points indicated on the left
panel of Figure 3.1 and in Table 3.2. Of these, 16 are in A, so
Prob{A} = 16/82 = .195.
(3.10)

Page 41
24
RANDOM SAMPLES AND PROBABILITIES
Figure 3.2. The frequencies /o, /i, • • •, fn for the binomial distributions
Bi(n,p), n = 25 and p = .25, .50, and .90. The points have been con�
nected by lines to enhance visibility.
Here the idealized proportion Prob{A} is an actual proportion.
Only in cases where we have a complete census of the population
is it possible to directly evaluate probabilities as proportions.
The probability distribution F of x is still defined to be any
complete description of x’s probabilities. In the law school example,
F can be described as follows: for any subset A of Sx = 72.2+,
Prob{x E A) = #{Xj E A}/82,
(3.11)
where #{Xj E A} is the number of the 82 points in the left panel
of Figure 3.1 that lie in A. Another way to say the same thing is
that F is a discrete distribution putting probability (or frequency)
1/82 on each of the indicated 82 points.
Probabilities can be defined continuously, rather than discretely
as in (3.6) or (3.11). The most famous example is the normal (or
Gaussian, or bell-shaped) distribution. A real-valued random vari�
able x is defined to have the normal distribution with mean /i and

Page 42
PROBABILITY THEORY
25
variance cr2, written
x ~ iV(/x, a2) or F = N(p, <r2),
(3.12)
if
Prob{;r e A} = J
(3.13)
for any subset A of the real line 1Z1. The integral in (3.13) is over
the values of x G A.
There are higher dimensional versions of the normal distribu�
tion, which involve taking integrals similar to (3.13) over multi�
dimensional sets A. We won’t need continuous distributions for
development of the bootstrap (though they will appear later in
some of the applications) and will avoid mathematical derivations
based on calculus. As we shall see, one of the main incentives for the
development of the bootstrap is the desire to substitute computer
power for theoretical calculations involving special distributions.
The expectation of a real-valued random variable x, written E(x),
is its average value, where the average is taken over the possible
outcomes of x weighted according to its probability distribution F.
Thus
E(x) = � x ( ny (l — p)x for x ~ Bi(n,p),
(3.14)
and
/ ��
1
_
x .......:.e~2 ( v^^dx for x ~ N(p, a2). (3.15)
-oo v27rcr2
It is not difficult to show that E(x) = np for x ~ Bi(n,p), and
E(x) = /i for x ~ N(p,a2). (See Problems 3.6 and 3.7.)
We sometimes write the expectation as E�r(x), to indicate that
the average is taken with respect to the distribution F.
Suppose r = g(x) is some function of the random variable x.
Then E(r), the expectation of r, is the theoretical average of g(x)
weighted according to the probability distribution of x. For exam�
ple if x ~ iV(��, a2) and r = x3, then
/ OO
i
x3 — .... e~ 2 ( )2 dx.
(3.16)
-oo v27r<72
Probabilities are a special case of expectations. Let A be a subset

Page 43
26
RANDOM SAMPLES AND PROBABILITIES
of Sx, and take r — I{X�A} where I{x^a} is the indicator function
l{x€A}
if x e A
if x � A'
Then E(r) equals Prob{x 6 A}, or equivalently
E(J{xe.4}) = Prob{x € A}.
For example if x ~ N(/j,, a2), then
/OO
i
I{x€A}-j==e~*(i^ i? dx
-OO
v 2'KGz
- L a/27TCT2
e ~ i^ 2dx,
(3.17)
(3.18)
(3.19)
which is Prob{# 6 A} according to (3.13).
The notion of an expectation as a theoretical average is very
general, and includes cases where the random variable x is not
real-valued. In the law school situation, for instance, we might
be interested in the expectation of the ratio of LSAT and GPA.
Writing x = (y,z) as in (3.8), then r = y/z, and the expectation
of r is
1
E(LSAT/GPA) = - Yfrjlzj)
(3.20)
3 = 1
where Xj = (yj, Zj) is the j th point in Table 3.2. Numerical evalu�
ation of (3.20) gives E(LSAT/GPA) = 190.8.
Let fix —Eip(a;), for x a real-valued random variable with distri�
bution F. The variance of #, indicated by cr2 or just cr2, is defined
to be the expected value of y = (x — p) 2 . In other words, a2 is the
theoretical average squared distance of a random variable x from
its expectation px,
4 = Ef (x - Mx)2.
(3.21)
The variance of x ~ N(fi,cr2) equals cr2; the variance of x ~
Bi(n,p) equals np(1 — p), see Problem 3.9. The standard devia�
tion of a random variable is defined to be the square root of its
variance.
Two random variables y and z are said to be independent if
E \g{y)h{z)) = E[<7(2/)]E[Mz)]
(3.22)

Page 44
PROBABILITY THEORY
27
for all functions g(y) and h(z). Independence is well named: (3.22)
implies that the random outcome of y doesn’t affect the random
outcome of 2 , and vice-versa.
To see this, let B and C be subsets of Sy and Sz respectively,
the sample spaces of y and z, and take g and h to be the indicator
functions g(y) = I {yeB} and h(z) = I{zec}- Notice that
w
„ �c>={;
2 e ^ “ d 2ec
<3-23>
S� I{yeB}I{zec} is the indicator function of the intersection {y G
B} fl {z G C}. Then by (3.18) and the independence definition
(3.22),
Prob{(y, z) G B n C} = E(I{yeByI{zeC}) = EU{2/€�})EC*{2ec})
= Prob{2/ G B}Prob{z G C}.
(3.24)
Looking at Figure 3.1, we can see that (3.24) does not hold for
the law school example, see Problem 3.10, so LSAT and GPA are
not independent.
Whether or not y and z are independent, expectations follow the
simple addition rule
m y ) + h(z)} = E[g(y)} + E[h(z)}.
(3.25)
In general,
n
n
E[2>(*i)] = 5>[ffi(*i)]
(3.26)
1=1
�=1
for any functions gi of any n random variables #i, x2, • • •, xn.
Random sampling with replacement guarantees independence: if
x = (xi,#2, * * * >#n) is a random sample of size n from a popula�
tion X, then all n observations are identically distributed and
mutually independent of each other. In other words, all of the Xi
have the same probability distribution F, and
~EF\gi{xi)g2{x2), ■ ■ ■ ,5n(z„)] =
Ef [9x(^i )]Ef [92 (^2)] •"Eflftiii�)]
(3.27)
for any functions Q\, g2, • • • ,gn- (This is almost a definition of what
random sampling means.) We will write
F -+ (x1,X2 ,---,Xn)
(3.28)

Page 45
28
RANDOM SAMPLES AND PROBABILITIES
to indicate that x = (#i, a?2 , • • •, xn) is a random sample of size n
from a population with probability distribution F. This is some�
times written as
x jA~'F
i = 1,2, •••,�,
(3.29)
where i.i.d. stands for independent and identically distributed.
3.4 Problems
3.1 A random sample of size n is taken with replacement from
a population of size N. Show that the probability of having
no repetitions in the sample is given by the product
n—1
na-ir)'
3=0
3.2 Why might you suspect that the sample of 15 law schools in
Table (3.1) was obtained by sampling without replacement,
rather than with replacement?
3.3 The mean GPA for all 82 law schools is 3.13. How does this
compare with the mean GPA for the observed sample of 15
law schools in Table 3.1? Is this difference compatible with
the estimated standard error (2.2)?
3.4 Denote the mean and standard deviation of a set of numbers
Xi, X 2, • • •, X n by A and 5 respectively, where
X = jrXj/N
S = {�(*, - X f/N y/2.
3 = 1
J =1
(a) A sample xi, x2, • • •, xn is selected from Ai, X2 , • • •, A^
by random sampling with replacement. Denote the stan�
dard deviation of the sample average x =
usually called the standard error of x, by se(x). Use a
basic result of probability theory to show that
(b) t Suppose instead that #i, �2 , ‘ ‘ * > is selected by
random sampling without replacement (so we must have

Page 46
PROBLEMS
29
n < N), show that
N -n
N - 1
1
2
(c) We see that sampling without replacement gives a
smaller standard error for x. Proportionally how much
smaller will it be in the case of the law school data?
3.5 Given a random sample X\,X2 , • • •, a:n, the empirical proba�
bility of a set A is defined to be the proportion of the sample
in Aywritten
Prob{A} = #{x� € A}/n.
(3.30)
(a) Find Prob{A} for the data in Table 3.1, with A as
given in (3.9).
(b) The standard error of an empirical probability is
[Prob{A} • (1 — Prob{A})//!]1/2. How many standard er�
rors is Prob{A} from Prob{A}, given in (3.10)?
3.6 A very simple probability distribution F puts probability on
only two outcomes, 0 or 1, with frequencies
fo = l-p , fi=P-
(3.31)
This is called the Bernoulli distribution. Here p is a number
between 0 and 1. If #i, • • •, xn is a random sample from F,
then elementary probability theory tells us that the sum
s = x i+ x2-\------ Vxn
(3.32)
has the binomial distribution (3.5),
s ~ Bi(n,p).
(3.33)
(a) Show that the empirical probability (3.30) satisfies
n • Prob{A} ~ Bi(n, Prob{A}).
(3.34)
Expression (3.34) can also be written as
Prob{A} ~ Bi(n,Prob{A})/n.)
(b) Prove that if x ~ Bi(n,p), then E(x) = np.
3.7 Without using calculus, give a symmetry argument to show
that E(x) = p for x N(p,a2).

Page 47
30
RANDOM SAMPLES AND PROBABILITIES
3.8 Suppose that y and z are independent random variables,
with variances a2 and a2.
(a) Show that the variance of y + 2 is the sum of the
variances
a2y+z = a� + a l.
(3.35)
(In general, the variance of the sum is the sum of the vari�
ances for independent random variables #i, �2 , *' * , xn-)
(b) Suppose F —> (xi,X2,--*,xn) where the probability
distribution F has expectation p and variance a2. Show
that x has expectation p and variance a2/n.
3.9 Use the results in Problems (3.6) and (3.8) to show that
a2 = np(1 — p) for x ~ Bi(n,p).
3.10 Forty-three of the 82 points in Table 3.1 have LSAT < 600;
17 of the 82 points have GPA < 3.0. Why do we know that
LSAT and GPA are not independent?
3.11 In the discussion of random sampling, ji,j2 , * * * ,.7n were
taken to be independent integers having a uniform distri�
bution on the numbers 1,2, • • •, N. That is, ji,�2 � • • • ,jn is
itself a random sample, say
F u n
—> 0*1,32, * ‘ * ,jn),
(3.36)
where F\:# is the discrete distribution having frequencies
fj = 1/JV, for j = 1,2, • --,iV. In practice, we depend on
our computer’s random number generator to give us (3.36).
If (3.36) holds, then a random sample as defined in this
chapter has the “i.i.d.” property defined in (3.29). Give a
brief argument why this is so.
f Indicates a difficult or more advanced problem.

Page 48
CHAPTER 4
The empirical distribution
function and the plug-in
principle
4.1 Introduction
Problems of statistical inference often involve estimating some as�
pect of a probability distribution F on the basis of a random sample
drawn from F. The empirical distribution function, which we will
call F, is a simple estimate of the entire distribution F. An ob�
vious way to estimate some interesting aspect of F, like its mean
or median or correlation, is to use the corresponding aspect of F.
This is the “plug-in principle.” The bootstrap method is a direct
application of the plug-in principle, as we shall see in Chapter 6.
4.2 The empirical distribution function
Having observed a random sample of size n from a probability
distribution F,
F —> (xi,x2,- ■ • ,xn),
(4.1)
the empirical distribution function F is defined to be the dis�
crete distribution that puts probability 1/n on each value
i =
1,2, • • •, n. In other words, F assigns to a set A in the sample space
of x its empirical probability
Prob{A} =
e A}/n,
(4.2)
the proportion of the observed sample x = (aq, X2 , • • •, xn) oc�
curring in A. We will also write Probp {A} to indicate (4.2). The
hat symbol “A” always indicates quantities calculated from the
observed data.

Page 49
32
PLUG-IN PRINCIPLE
Table 4.1. A random sample of 100 rolls of the die. The outcomes
1,2,3,4,5,6 occurred 13,19,10,17,14,27 times, respectively, so the em�
pirical distribution is (.13, .19, .10, .17, .14, .27).
6 3 2 4 6 6 6 5 3 6 2 2 6 2 3 1 5 1
6 6 4 1 5
3 6 6 4 1 4 2 5 6 6 5
5
3
6 2 6 6 1 4 1 5 6 1 6 3 3 2 2 2 5
2
2 4 1 4 5 6 6 6 2 2 4 6 1 2 2 2 5
1
5 3 5 4 2 1 4 6 6 5 6 4 6 4 3 6 4 1
4 5 4 4 2 3 2 1 4 6
Consider the law school sample of size n = 15, shown in Table 3.1
and in the right panel of Figure 3.1. The empirical distribution F
puts probability 1/15 on each of the 15 data points. Five of the 15
points lie in the set A = {(y, z) : 0 < y < 600,0 < z < 3.00},
so Prob{A} = 5/15=.333. Notice that we get a different empirical
probability for the set (0 < y < 600,0 < z < 3.00}, since one of
the 15 data points has GPA = 3.00, LSAT < 600.
Table 4.1 shows a random sample of n = 100 rolls of a die:
x\ = 6 , X2 = 3, � 3 = 2 , • • •, #ioo = 6 . The empirical distribution F
puts probability 1/100 on each of the 100 outcomes. In cases like
this, where there are repeated values, we can express F more eco�
nomically as the vector of observed frequencies /&, k = 1 ,2 , • • •, 6 ,
h = #{*� = k}/n.
(4.3)
For the data in Table 4.1, F = (.13, .19, .1 0 , .17, .14, .27).
An empirical distribution is a list of the values taken on by the
sample x = (x\,x2., ■ • • ,xn), along with the proportion of times
each value occurs. Often each value occurring in the sample appears
only once, as with the law data. Repetitions, as with the die of
Table 4.1, allow the list to be shortened. In either case each of
the n data points is assigned probability 1 /n by the empirical
distribution.
Is it obvious that we have not lost information in going from the
full data set (xi,X21• • • ,^ioo) in Table 4.1 to the reduced repre�
sentation in terms of the frequencies? No, but it is true. It can be
proved that the vector of observed frequencies F = (/i,/2 , * * •) is
a sufficient statistic for the true distribution F = (/i, / 2 ? * * •)• This
means that all of the information about F contained in x is also
contained in F.

Page 50
THE EMPIRICAL DISTRIBUTION FUNCTION
33
Table 4.2. Rainfall data. The yearly rainfall, in inches, in Nevada City,
California, 1873 through 1978. An example of time series data.
0
1
2
3
4
5
6
7
8
9
1870
80 40 65 46 68 32 58
1880
60
61 60 45 48 63 44 66 39 35
1890
44 104 36 45 69 50 72 57 53 30
1900
40
56 55 46 46 72 50 68 71 37
1910
64
46 69 31 33 61 56 55 40 37
1920
40
34 60 54 52 20 49 43 62 44
1930
33
45 30 53 32 38 56 63 52 79
1940
30
62 75 70 60 34 54 51 35 53
1950
44
53 73 80 54 52 40 77 52 75
1960
42
43 39 54 70 40 73 41 75 43
1970
80
60 59 41 67 83 56 29 21
The sufficiency theorem assumes that the data have been gen�
erated by random sampling from some distribution F. This is cer�
tainly not always true. For example the mouse data of Table 2.1
involve two probability distributions, one for Treatment and one for
Control. Table 4.2 shows a time-series of 106 numbers: the annual
rainfall in Nevada City, California from 1873 through 1978. We
could calculate the empirical distribution F for this data set, but
it would not include any of time series information, for example,
if high numbers follow high numbers. Later, in Chapter 8, we will
see how to apply bootstrap methods to situations like the rainfall
data. For now we are restricting attention to data obtained by ran�
dom sampling from a single distribution, the so-called one-sample
situation. This is not as restrictive as it sounds. In the mouse data
example, for instance, we can apply one-sample results separately
to the Treatment and Control populations.
In applying statistical theory to real problems, the answers to
questions of interest are usually phrased in terms of probability
distributions. We might ask if the die giving the data in Table 4.1
is fair. This is equivalent to asking if the die’s probability distribu�
tion F equals (1/6,1/6,1/6,1/6,1/6,1/6). In the law school exam�
ple, the question might be how correlated are LSAT and GPA. In
terms of F, the distribution of x = (y,z) — (LSAT, GPA), this is

Page 51
34
PLUG-IN PRINCIPLE
a question about the value of the population correlation coefficient
corr(y, z)
E •=!(Yj - tiy)(Z3 - nz)
[E?=i(^j - Vy)2
- Mz )2]1/2’
(4.4)
where (Yj, Zj) is the jth point in the law school population X, and
^ = E�=iV82, ^ = E?=i^/82.
When the probability distribution F is known (i.e. when we have
a complete census of the population X), answering such questions
involves no more than arithmetic. For the law school population,
the census in Table 3.2 gives py = 597.5, �iz —3.13, and
corr (y,z) = .761.
(4.5)
This is the original definition of “statistics.” Usually we don’t have
a census. Then we need statistical inference, the more modern sta�
tistical theory for inferring properties of F from a random sample
x.
If we had available only the law school sample of size 15, Ta�
ble 3.1, we could estimate corr(y,z) by the sample correlation co�
efficient
corr(y, z)
E�=i�/i - �j/)(z� - �z)
[E��i(2/� - As,)2 E!=i(�� - Az)2]1/2
(4.6)
where (yi,zi) is the zth point in Table 3.1, i = 1,2,-**, 15, and
Ay = E�=i i/i/15) A* = E iil Zi/lb. Table 3.1 gives fty = 600.3,
fiz —3.09, and
corr(y, z) = .776.
(4.7)
Here is another example of a plug-in estimate. Suppose we are
interested in estimating the probability of a LSAT score greater
than 600, that is
1
82
82
y^J{Yj>600}-
1
(4.8)
Since 39 of the 82 LSAT scores exceed 600, 9 = 39/82=0.48. The
plug estimate of 6 is
1 15
0 =
(4.9)

Page 52
THE PLUG-IN PRINCIPLE
3 5
the sample proportion of LSAT scores above 600. Six of the 15
LSAT scores exceed 600, so 9 = 6/15 = 0.4.
For the die of Table 4.1, we don’t have census data but only the
sample x, so any questions about the fairness of the die must be
answered by inference from the empirical frequencies
F = (A, / 2, • • •, /e) = (.13, .19, .1 0 , .17, .14, .27).
(4.10)
Discussions of statistical inference are phrased in terms of pa�
rameters and statistics. A parameter is a function of the probabil�
ity distribution F. A statistic is a function of the sample x. Thus
corr(t/, z), (4.4), is a parameter of F, while corr(y,z), (4.6), is a
statistic based on x. Similarly /& is a parameter of F in the die
example, while fk is a statistic, k = 1 ,2 ,3, • • •, 6 .
We will sometimes write parameters directly as functions of F,
say
9 = t(F).
(4.11)
This notation emphasizes that the value 9 of the parameter is ob�
tained by applying some numerical evaluation procedure t(-) to the
distribution function F. For example if F is a probability distri�
bution in the real line, the expectation can be thought of as the
parameter
9 = t(F) = E f (x ).
(4.12)
Here t(F) gives 9 by the expectation process, that is, the average
value of x weighted according to F. For a given distribution F such
as F = Bi(n,p) we can evaluate t(F) = np. Even if F is unknown,
the form of t(F) tells us the functional mapping that inputs F and
outputs 9.
4.3 The plug-in principle
The plug-in principle is a simple method of estimating parameters
from samples. The plug-in estimate of a parameter 9 = t(F) is
defined to be
9 = t(F).
(4.13)
In other words, we estimate the function 9 — t(F) of the probability
distribution F by the same function of the empirical distribution
F, 9 = t(F). (Statistics like (4.13) that are used to estimate param�
eters are sometimes called summary statistics, as well as estimates

Page 53
3 6
PLUG-IN PRINCIPLE
and estimators.)
We have already used the plug-in principle in estimating fk by
fk, and in estimating corr(t/, z) by corr(?/, z). To see this, note that
our law school population F can be written as F = (/1? /2,... /82)
where each fj 1the probability of the jth law school, has value 1/82.
This is the probability distribution on X , the 82 law school pairs.
The population correlation coefficient can be written as
corr(y, z)
Ej�i fj(Xi -
- nz)
Ej=i fi(Yi - Vy)2 Ejii fi(z3 - Hz)2}1/2
(4.14)
where
82
82
j=1
3=1
Setting each fj = 1/82 gives expression (4.4). Now for our sample
(xi, a?2 , • • • ^15), the sample frequency fj is the proportion of sample
points equal to Xji
fj = #{*i = Xj}/15, J = 1 ,2 ,... 82.
(4.16)
For the sample of Table 3.1, f\ = 0, / 2 = 0, /3 = 0, f� = 1/15 etc.
Now plugging these values fj into expressions (4.15) and (4.14)
gives fiy, fiz and corr(y, z) respectively. That is, /�y, fiz and corr(y, z)
are plug-in estimates of piy^piz and corr(y,�).
In general, the plug-in estimate of an expectation 9 = E/r(x) is
1 n
^ E J x ) = - V x z=x.
(4.17)
n
'
i —1
How good is the plug-in principle? It is usually quite good, if
the only available information about F comes from the sample
x. Under this circumstance 9 = t(F) cannot be improved upon
as an estimator of 9 = t(F), at least not in the usual asymptotic
(n —> 0 0 ) sense of statistical theory. For example if fk is the plug-in
frequency estimate #{xi = k}/n, then
fk ~ Bi(n,/*)/n
(4.18)
as in Problem 3.6. In this case the estimator fk is unbiased for
/fe, E(fk) = /fc, with variance fk( 1 — fk)/n• This is the smallest
possible variance for an unbiased estimator of fk.

Page 54
PROBLEMS
3 7
We will use the bootstrap to study the bias and standard error
of the plug-in estimate 0 = t(F). The bootstrap’s virtue is that
it produces biases and standard errors in an automatic way, no
matter how complicated the functional mapping 9 = t(F) may be.
We will see that the bootstrap itself is an application of the plug-in
principle.
The plug-in principle is less good in situations where there is
information about F other than that provided by the sample x. We
might know, or assume, that F is a member of a parametric family,
like the family of multivariate normal distributions. Or we might
be in a regression situation, where we have available a collection
of random samples x(z) depending on a predictor variable z. Then
even if we are only interested in FZo, the distribution function for
some specific value zo of z, there may be information about FZq
in the other samples x(z), especially those for which z is near zo-
Regression models are discussed in Chapters 7 and 9.
The plug-in principle and the bootstrap can be adopted to para�
metric families and to regression models. See Section 6.5 of Chapter
6 and Chapter 9. For the next few chapters we assume that we are
in the situation where we have only the one random sample x from
a completely unknown distribution F. This is called the one-sample
nonparametric setup.
4.4 Problems
4.1 Say carefully why the plug-in estimate of the expectation of
a real-valued random variable is x, the sample average.
4.2 We would like to estimate the variance a2 of a real-valued ran�
dom variable �, having observed a random sample
X\,X2
- ,xn. What is the plug-in estimate of a2?
4.3 (a) Show that the standard error of an empirical frequency
fk is y/fk( 1 — /jk)/n. (You can use the result in problem
3.5b.)
(b) Do you believe that the die used to generate Table 4.1
is fair?
4.4 Suppose a random variable x has possible values 1,2,3, • ♦ • .
Let A be a subset of the positive integers.
(a) Show that Prob{A} = YlkeA A-

Page 55
3 8
PLUG-IN PRINCIPLE
(b) Compare problems 4.3a and 3.5b, and conclude that
the observed frequencies fk are not independent of each
other.
(c) Say in words why the observed frequencies aren’t inde�
pendent.

Page 56
CHAPTER 5
Standard errors and estimated
standard errors
5.1 Introduction
Summary statistics such as 6 = t(F) are often the first outputs of
a data analysis. The next thing we want to know is the accuracy of
6 . The bootstrap provides accuracy estimates by using the plug-in
principle to estimate the standard error of a summary statistic.
This is the subject of Chapter 6. First we will discuss estimation
of the standard error of a mean, where the plug-in principle can
be carried out explicitly.
5.2 The standard error of a mean
Suppose that a: is a real-valued random variable with probability
distribution F. Let us denote the expectation and variance of F
by the symbols �ip and aF respectively,
AIF = Ef (x),
Op = varF(x) = Eir[(ar - f.iF)2]-
(5.1)
These are the quantities called fix and a2 in Chapter 3. Here
we are emphasizing the dependence on F. The alternative nota�
tion “varf (%)” for the variance, sometimes abbreviated to var(x),
means the same thing as <j 2f . In what follows we will sometimes
write
x ~ { n F,cr2F)
(5.2)
to indicate concisely the expectation and variance of x.
Now let (xi, • • •, xn) be a random sample of size n from the distri�
bution F. The mean of the sample x =
xi/n ^as expectation
/ip and variance crF/n,
x
(fj,F,CT2F /n).
(5-3)

Page 57
4 0
STANDARD ERRORS AND ESTIMATED STANDARD ERRORS
In other words, the expectation of x is the same as the expectation
of a single x, but the variance of x is 1 /n times the variance of x.
See Problem 3.8b. This is the reason for taking averages: the larger
n is, the smaller var(x) is, so bigger n means a better estimate of
The standard error of the mean x, written seir(x) or se(x), is the
square root of the variance of x,
s ep(x) = [varir(x)]1/ 2 = aF/y/n.
(5.4)
Standard error is a general term for the standard deviation of a
summary statistic.1 They are the most common way of indicating
statistical accuracy. Roughly speaking, we expect x to be less than
one standard error away from fiF about 68% of the time, and less
than two standard errors away from fiF about 95% of the time.
These percentages are based on the central limit theorem. Un�
der quite general conditions on F, the distribution of x will be
approximately normal as n gets large, which we can write as
x ~ N(fiF, ajp/n).
(5.5)
The expectation /i�r and variance aF/n in (5.5) are exact, only the
normality being approximate. Using (5.5), a table of the normal
distribution gives
Prob{|x — fip| < -^L}=.683,
y/n
Prob{|x - �ip\ < ^�-}==.954,
y 'Of
(5.6)
as illustrated in Figure 5.1. One of the advantages of the boot�
strap is that we do not have to rely entirely on the central limit
theorem. Later we will see how to get accuracy statements like
(5.6) directly from the data (see Chapters 12-14 on bootstrap con�
fidence intervals). It will then be clear that (5.6), which is correct
for large values of n, can sometimes be quite inaccurate for the
sample size actually available. Keeping this in mind, it is still true
that the standard error of an estimate usually gives a good idea of
its accuracy.
A simple example shows the limitations of the central limit the�
orem approximation. Suppose that F is a distribution that puts
1 In some books, the term “standard error” is used to denote an estimated
standard deviation, that is, an estimate of crF based on the data. That
differs from our usage of the term.

Page 58
THE STANDARD ERROR OF A MEAN
41
Figure 5.1. For large values ofn, the mean x of a random sample from F
will have an approximate normal distribution with mean pp and variance
(Tp/n.
probability on only two outcomes, 0 or 1, as in problem 3.6, say
Prob^i# = 1} = p
and
Prob^l# = 0} = 1 — p.
(5.7)
Here p is a parameter of F, often called the probability of suc�
cess, having a value between 0 and 1. A random sample F —>
(a?i, #2, *••,��) can be thought of as n independent flips of a coin
having probability of success (or of “heads”, or of x — 1) equal�
ing p. Then the sum s = Y17=i xi 1S the number of successes in n
independent flips of the coin; s has the binomial distribution (3.3),
s ~ Bi(n,p).
(5.8)
The average x = s/n equals p, the plug-in estimate of p. Distribu�
tion (5.7) has pp = p, a2F = p(l — p), so (5.3) gives
P ~ (P,P(1 ~P)/n)
(5.9)
for the mean and variance of p. In other words, p is an unbiased
estimate of p, E(p) = p, with standard error
se(p) =
p(l - p ) ] 1/2
n
(5.10)
Figure 5.2 shows the central limit theorem working for the bi�
nomial distribution with n = 25, p = .25 and p = .90. (Problem
5.3 says what is actually plotted in Figure 5.2.) The central limit
theorem gives a good approximation to the binomial distribution

Page 59
42
STANDARD ERRORS AND ESTIMATED STANDARD ERRORS
x
Figure 5.2. Comparison of the binomial distribution with the normal
distribution suggested by the central limit theorem; n = 25, p = .25 and
p — .90. The smooth curves are the normal densities, see problem 5.3;
circles indicate the binomial probabilities (3.5). The approximation is
good for p = .25, but is somewhat off for p = .90.
for n —25, p = .25, but is somewhat less good for n = 25, p = .9.
5.3 Estimating the standard error of the mean
Suppose that we have in hand a random sample of numbers F —>
x \,#2 ,*“ ,#n, such as the n = 9 Control measurements for the
mouse data of Table 2.1. We compute the estimate x for the ex�
pectation �ip, equaling 56.22 for the mouse data, and want to know
the standard error of x. Formula (5.4), sejp(x) = crp/y'n, involves
the unknown distribution F and so cannot be directly used.
At this point we can use the plug-in principle: we substitute F
for F in the formula sepix) = crply/n. The plug-in estimate of
<jf = [EF(x - h f )2}1/2 is
(5.11)

Page 60
PROBLEMS
4 3
since up = x and Epg(x) = � �?=i g(xi) for any function g. This
gives the estimated standard error se(x) = seF(x),
se(x) = ap/y/� -
- x)2/n2}1/2.
(5.12)
i —1
For the mouse Control group data, se(x) = 13.33.
Formula (5.12) is slightly different than the usual estimated
standard error (2.2). That is because of is usually estimated by
a =
x)2/(n - 1)}1/2 rather than by <r, (5.11). Dividing by
n - 1 rather than n makes a2 unbiased for a2F. For most purposes
a is just as good as �7 for estimating <j f -
Notice that we have used the plug-in principle twice: first to
estimate the expectation pp by pF = x, and then to estimate
the standard error seir(x) by sep(x). The bootstrap estimate of
standard error, which is the subject of Chapter 6, amounts to using
the plug-in principle to estimate the standard error of an arbitrary
statistic 6 . Here we have seen that if 0 = x, then this approach
leads to (almost) the usual estimate of standard error. As we will
see, the advantage of the bootstrap is that it can be applied to
virtually any statistic 0, not just the mean x.
5.4 Problems
5.1 Formula (5.4) exemplifies a general statistical truth: most
estimates of unknown quantities improve at a rate propor�
tional to the square root of the sample size. Suppose that it
were necessary to know fj,F for the mouse Control group with
a standard error of no more than 3 days. How many more
Control mice should be sampled?
5.2 State clearly why p = s/n is the plug-in estimate of p for the
binomial situation (5.8).
5.3 Figure 5.2 compares the function
for
x = 0,1/25,2/25,-- -,1
with
1 _____ 1
n '
exP{- o
y/2 irp(l — p)/n
2 i^/np(l - p)
x np i 2
} for a;€[0,l].