An Introduction to the Bootstrap

Page 1

Page 2

MONOGRAPHS ON

STATISTICS AND APPLIED PROBABILITY

General Editors

D.R. Cox, D.V. Hinkley, N. Reid, D.B. Rubin and B.W. Silverman

1 Stochastic Population Models in Ecology and Epidemiology

MS. Bartlett (1960)

2 Queues D.R. Cox and W.L. Smith (1961)

3 Monte Carlo Methods J.M. Hammersley and D.C. Handscomb (1964)

4 The Statistical Analysis of Series of Events D.R. Cox and PA.W. Lewis (1966)

5 Population Genetics WJ. Ewens (1969)

6 Probability, Statistics and Time MS. Bartlett (1975)

7 Statistical Inference S.D. Silvey (1975)

8 The Analysis of Contingency Tables BS. Everitt (1977)

9 Multivariate Analysis in Behavioural Research A.E. Maxwell (1977)

10 Stochastic Abundance Models S. Engen (1978)

11 Some Basic Theory for Statistical Inference EJ.G. Pitman (1979)

12 Point Processes D.R. Cox and V. Isham (1980)

13 Identification of Outliers D.M. Hawkins (1980)

14 Optimal Design S.D. Silvey (1980)

15 Finite Mixture Distributions BS. Everitt and DJ. Hand (1981)

16 Classification A .D. Gordon (1981)

17 Distribution-free Statistical Methods JS. Mariz (1981)

18 Residuals and Influence in Regression R.D. Cook and S. Weisberg (1982)

19 Applications of Queueing Theory G.F. Newell (1982)

20 Risk Theory, 3rd edition R.E. Beard, T. Pentikainen and E. Pesonen (1984)

21 Analysis of Survival Data D.R. Cox and D. Oakes (1984)

22 An Introduction to Latent Variable Models BS. Everitt (1984)

23 Bandit Problems DA. Berry and B. Fristedt (1985)

24 Stochastic Modelling and Control M.HA. Davis and R. Vinter (1985)

25 The Statistical Analysis of Compositional Data J. Aitchison (1986)

26 Density Estimation for Statistical and Data Analysis B.W. Silverman (1986)

27 Regression Analysis with Applications B.G. Wetherill (1986)

28 Sequential Methods in Statistics, 3rd edition G.B. Wetherill (1986)

29 Tensor methods in Statistics P. McCullagh (1987)

30 Transformation and Weighting in Regression R.J. Carroll and D. Ruppert (1988)

31 Asymptotic Techniques for Use in Statistics O.E. Barndojf-Nielson and

D.R. Cox (1989)

32 Analysis of Binary Data, 2nd edition D.R. Cox and EJ. Snell (1989)

33 Analysis of Infectious Disease Data N.G. Becker (1989)

Page 3

34 Design and Analysis of Cross-Over Trials B. Jones and M.G. Kenward (1989)

35 Empirical Bayes Method, 2nd edition JS. Maritz and T. Lwin (1989)

36 Symmetric Multivariate and Related Distributions K.-T. Fang, 5. Kotz and

K. Ng (1989)

37 Generalized Linear Models, 2nd edition P. McCullagh and JA. Nelder (1989)

38 Cyclic Designs JA. John (1987)

39 Analog Estimation Methods in Econometrics C.F. Manski (1988)

40 Subset Selection in Regression A.J. Miller (1990)

41 Analysis of Repeated Measures M. Crowder and D J. bland (1990)

42 Statistical Reasoning with Imprecise Probabilities P. Walley (1990)

43 Generalized Additive Models TJ. Hastie and RJ. Tibshirani (1990)

44 Inspection Errors for Attributes in Quality Control N.L. Johnson, S. Kotz and

X. Wu (1991)

45 The Analysis of Contingency Tables, 2nd edition B.S. Everitt (1992)

46 The Analysis of Quantal Response Data BJ.T. Morgan (1992)

47 Longitudinal Data with Serial Correlation: A State-Space Approach

R.H. Jones(1993)

48 Differential Geometry and Statistics MX. Murray and J.W. Rice (1993)

49 Markov Models and Optimization M.H.A. Davies (1993)

50 Chaos and Networks: Statistical and Probabilistic Aspects Edited by

O. Barndorff-Nielsen et al. (1993)

51 Number Theoretic Methods in Statistics K.-T. Fang and W. Yuan (1993)

52 Inference and Asymptotics O. Barndorff-Nielsen and D.R. Cox (1993)

53 Practical Risk Theory for Actuaries C.D. Daykin, T. Pentikainen and

M. Pesonen (1993)

54 Statistical Concepts and Applications in Medicine J. Aitchison and

IJ. Lauder (1994)

55 Predictive Inference S. Geisser (1993)

56 Model-Free Curve Estimation M. Tarter and M. Lock (1993)

57 An Introduction to the Bootstrap B. Efron and R. Tibshirani (1993)

(Full details concerning this series are available from the Publishers.)

Page 4

Introduction

to the

Bootstrap

Bradley Efron

Department of Statistics

Stanford University

and

Robert J. Tibshirani

Department of Preventative Medicine and Biostatistics

and Department of Statistics, University of Toronto

C H A P M A N & H A L L /C R C

Boca Raton London New York Washington, D.C.

Page 5

Chapman & Hall/CRC

Milton Park, Abingdon

Taylor & Francis Group

Oxon OX 14 4RN

6000 Broken Sound Parkway NW, Suite 300

Boca Raton, FL 33487-2742

� 1994 by Taylor & Francis Group, LLC

Chapman & Hall/CRC is an imprint of Taylor & Francis Group

No claim to original U.S. Government works

Printed in the United States of America on acid-free paper

25 24 23 22 21 20 19 18 17 16 15 14 13

International Standard Book Number-13: 978-0-412-04231-7 (Hardcover)

This book contains information obtained from authentic and highly regarded sources. Reasonable efforts have

been made to publish reliable data and information, but the author and publisher cannot assume responsibility

for the validity of all materials or the consequences of their use. The authors and publishers have attempted to

trace the copyright holders of all material reproduced in this publication and apologize to copyright holders if

permission to publish in this form has not been obtained. If any copyright material has not been acknowledged

please write and let us know so we may rectify it in any future reprint

No part of this book may be reprinted, reproduced, transmitted, or utilized in any form by any electronic,

mechanical, or other means, now known or hereafter invented, including photocopying, microfilming, and

recording, or in any information storage or retrieval system, without written permission from the publishers.

For permission to photocopy or use material electronically from this work, please access www.copyright.com

(http://www.copyright.com/) or contact the Copyright Clearance Center, Inc. (CCC) 222 Rosewood Drive,

Danvers, MA 01923, 978-750-8400. CCC is a not-for-profit organization that provides licenses and registration

for a variety of users. For organizations that have been granted a photocopy license by the CCC, a separate

system of payment has been arranged.

Trademark Notice: Product or corporate names may be trademarks or registered trademarks, and are used only

for identification and explanation without intent to infringe.

Library of Congress Cataloging-in-Publication Data

Efron, Bradley.

An introduction to the bootstrap/Brad Efron, Rob Tibshirani.

p. cm.

Includes bibliographical references and index.

ISBN 0-412-04231-2

1. Bootstrap (Statistics). I. Tibshirani, Robert, n. Title.

QA276.8.E3745 1993

519.5’44— dc20

93-4489

Visit the Taylor & Francis Web site at

http://www.taylorandfrancis.com

and the CRC Press Web site at

http://www.crcpress.com

Page 6

CHERYL, CHARLIE, RYAN AND JULIE

AND TO THE MEMORY OF

RUPERT G. MILLER, JR.

Page 7

Page 8

Contents

Preface

xiv

1 Introduction

1.1 An overview of this book

1.2 Information for instructors

1.3 Some of the notation used in the book

2 The accuracy of a sample mean

2.1 Problems

3 -Random samples and probabilities

3.1 Introduction

3.2 Random samples

3.3 Probability theory

3.4 Problems

4 The empirical distribution function and the plug-in

principle

4.1 Introduction

4.2 The empirical distribution function

4.3 The plug-in principle

4.4 Problems

5 Standard errors and estimated standard errors

5.1 Introduction

5.2 The standard error of a mean

5.3 Estimating the standard error of the mean

5.4 Problems

Page 9

CONTENTS

viii

6 The bootstrap estimate of standard error

6.1 Introduction

6.2 The bootstrap estimate of standard error

6.3 Example: the correlation coefficient

6.4 The number of bootstrap replications B

6.5 The parametric bootstrap

6.6 Bibliographic notes

6.7 Problems

7 Bootstrap standard errors: some examples

7.1 Introduction

7.2 Example 1: test score data

7.3 Example 2: curve fitting

7.4 An example of bootstrap failure

7.5 Bibliographic notes

7.6 Problems

8 More complicated data structures

8.1 Introduction

8.2 One-sample problems

8.3 The two-sample problem

8.4 More general data structures

8.5 Example: lutenizing hormone

8.6 The moving blocks bootstrap

8.7 Bibliographic notes

102

8.8 Problems

103

9 Regression models

105

9.1 Introduction

105

9.2 The linear regression model

105

9.3 Example: the hormone data

107

9.4 Application of the bootstrap

111

9.5 Bootstrapping pairs vs bootstrapping residuals

113

9.6 Example: the cell survival data

115

9.7 Least median of squares

117

9.8 Bibliographic notes

121

9.9 Problems

121

10 Estimates of bias

10.1 Introduction

124

124

Page 10

CONTENTS

10.2 The bootstrap estimate of bias

124

10.3 Example: the patch data

126

10.4 An improved estimate of bias

130

10.5 The jackknife estimate of bias

133

10.6 Bias correction

138

10.7 Bibliographic notes

139

10.8 Problems

139

11 The jackknife

141

11.1 Introduction

141

11.2 Definition of the jackknife

141

11.3 Example: test score data

143

11.4 Pseudo-values

145

11.5 Relationship between the jackknife and bootstrap 145

11.6 Failure of the jackknife

148

11.7 The delete-d jackknife

149

11.8 Bibliographic notes

149

11.9 Problems

150

12 Confidence intervals based on bootstrap “tables” 153

12.1 Introduction

153

12.2 Some background on confidence intervals

155

12.3 Relation between confidence intervals and hypothe�

sis tests

156

12.4 Student’s t interval

158

12.5 The bootstrap-^ interval

160

12.6 Transformations and the bootstrap-t

162

12.7 Bibliographic notes

166

12.8 Problems

166

13 Confidence intervals based on bootstrap

percentiles

168

13.1 Introduction

168

13.2 Standard normal intervals

168

13.3 The percentile interval

170

13.4 Is the percentile interval backwards?

174

13.5 Coverage performance

174

13.6 The transformation-respecting property

175

13.7 The range-preserving property

176

13.8 Discussion

176

Page 11

CONTENTS

13.9 Bibliographic notes

176

13.10 Problems

177

14 Better bootstrap confidence intervals

178

14.1 Introduction

178

14.2 Example: the spatial test data

179

14.3 The BCa method

184

14.4 The ABC method

188

14.5 Example: the tooth data

190

14.6 Bibliographic notes

199

14.7 Problems

199

15 Permutation tests

202

15.1 Introduction

202

15.2 The two-sample problem

202

15.3 Other test statistics

210

15.4 Relationship of hypothesis tests to confidence

intervals and the bootstrap

214

15.5 Bibliographic notes

218

15.6 Problems

218

16 Hypothesis testing with the bootstrap

220

16.1 Introduction

220

16.2 The two-sample problem

220

16.3 Relationship between the permutation test and the

bootstrap

223

16.4 The one-sample problem

224

16.5 Testing multimodality of a population

227

16.6 Discussion

232

16.7 Bibliographic notes

233

16.8 Problems

234

17 Cross-validation and other estimates of prediction

error

237

17.1 Introduction

237

17.2 Example: hormone data

238

17.3 Cross-validation

239

17.4 Cp and other estimates of prediction error

242

17.5 Example: classification trees

243

17.6 Bootstrap estimates of prediction error

247

Page 12

CONTENTS

17.6.1 Overview

247

17.6.2 Some details

249

17.7 The .632 bootstrap estimator

252

17.8 Discussion

254

17.9 Bibliographic notes

255

17.10 Problems

255

18 Adaptive estimation and calibration

258

18.1 Introduction

258

18.2 Example: smoothing parameter selection for curve

fitting

258

18.3 Example: calibration of a confidence point

263

18.4 Some general considerations

266

18.5 Bibliographic notes

268

18.6 Problems

269

19 Assessing the error in bootstrap estimates

271

19.1 Introduction

271

19.2 Standard error estimation

272

19.3 Percentile estimation

273

19.4 The jackknife-after-bootstrap

275

19.5 Derivations

280

19.6 Bibliographic notes

281

19.7 Problems

281

20 A geometrical representation for the bootstrap and

jackknife

283

20.1 Introduction

283

20.2 Bootstrap sampling

285

20.3 The jackknife as an approximation to the bootstrap 287

20.4 Other jackknife approximations

289

20.5 Estimates of bias

290

20.6 An example

293

20.7 Bibliographic notes

295

20.8 Problems

295

21 An overview of nonparametric and parametric

inference

296

21.1 Introduction

296

21.2 Distributions, densities and likelihood functions

296

Page 13

CONTENTS

xii

21.3 Functional statistics and influence functions

298

21.4 Parametric maximum likelihood inference

302

21.5 The parametric bootstrap

306

21.6 Relation of parametric maximum likelihood, boot�

strap and jackknife approaches

307

21.6.1 Example: influence components for the mean 309

21.7 The empirical cdf as a maximum likelihood estimate 310

21.8 The sandwich estimator

310

21.8.1 Example: Mouse data

311

21.9 The delta method

313

21.9.1 Example: delta method for the mean

315

21.9.2 Example: delta method for the correlation

coefficient

315

21.10 Relationship between the delta method and in�

finitesimal jackknife

315

21.11 Exponential families

316

21.12 Bibliographic notes

319

21.13 Problems

320

22 Further topics in bootstrap confidence intervals 321

22.1 Introduction

321

22.2 Correctness and accuracy

321

22.3 Confidence points based on approximate pivots

322

22.4 The BCa interval

325

22.5 The underlying basis for the BC0 interval

326

22.6 The ABC approximation

328

22.7 Least favorable families

331

22.8 The ABCq method and transformations

333

22.9 Discussion

334

22.10 Bibliographic notes

335

22.11 Problems

335

23 Efficient bootstrap computations

338

23.1 Introduction

338

23.2 Post-sampling adjustments

340

23.3 Application to bootstrap bias estimation

342

23.4 Application to bootstrap variance estimation

346

23.5 Pre- and post-sampling adjustments

348

23.6 Importance sampling for tail probabilities

349

23.7 Application to bootstrap tail probabilities

352

Page 14

CONTENTS

xiii

23.8 Bibliographic notes

356

23.9 Problems

357

24 Approximate likelihoods

358

24.1 Introduction

358

24.2 Empirical likelihood

360

24.3 Approximate pivot methods

362

24.4 Bootstrap partial likelihood

364

24.5 Implied likelihood

367

24.6 Discussion

370

24.7 Bibliographic notes

371

24.8 Problems

371

25 Bootstrap bioequivalence

372

25.1 Introduction

372

25.2 A bioequivalence problem

372

25.3 Bootstrap confidence intervals

374

25.4 Bootstrap power calculations

379

25.5 A more careful power calculation

381

25.6 Fieller’s intervals

384

25.7 Bibliographic notes

389

25.8 Problems

389

26 Discussion and further topics

392

26.1 Discussion

392

26.2 Some questions about the bootstrap

394

26.3 References on further topics

396

Appendix: software for bootstrap computations

398

Introduction

398

Some available software

399

S language functions

399

References

413

Author index

426

Subject index

430

Page 15

Preface

Dear friend, theory is all gray,

and the golden tree of life is green.

Goethe, from “Faust”

The ability to simplify means to eliminate the unnecessary so that

the necessary may speak.

Hans Hoffmann

Statistics is a subject of amazingly many uses and surprisingly

few effective practitioners. The traditional road to statistical knowl�

edge is blocked, for most, by a formidable wall of mathematics.

Our approach here avoids that wall. The bootstrap is a computer-

based method of statistical inference that can answer many real

statistical questions without formulas. Our goal in this book is to

arm scientists and engineers, as well as statisticians, with compu�

tational techniques that they can use to analyze and understand

complicated data sets.

The word “understand” is an important one in the previous sen�

tence. This is not a statistical cookbook. We aim to give the reader

a good intuitive understanding of statistical inference.

One of the charms of the bootstrap is the direct appreciation it

gives of variance, bias, coverage, and other probabilistic phenom�

ena. What does it mean that a confidence interval contains the

true value with probability .90? The usual textbook answer ap�

pears formidably abstract to most beginning students. Bootstrap

confidence intervals are directly constructed from real data sets,

using a simple computer algorithm. This doesn’t necessarily make

it easy to understand confidence intervals, but at least the diffi�

culties are the appropriate conceptual ones, and not mathematical

muddles.

Page 16

PREFACE

Much of the exposition in our book is based on the analysis of

real data sets. The mouse data, the stamp data, the tooth data,

the hormone data, and other small but genuine examples, are an

important part of the presentation. These are especially valuable if

the reader can try his own computations on them. Personal com�

puters are sufficient to handle most bootstrap computations for

these small data sets.

This book does not give a rigorous technical treatment of the

bootstrap, and we concentrate on the ideas rather than their math�

ematical justification. Many of these ideas are quite sophisticated,

however, and this book is not just for beginners. The presenta�

tion starts off slowly but builds in both its scope and depth. More

mathematically advanced accounts of the bootstrap may be found

in papers and books by many researchers that are listed in the

Bibliographic notes at the end of the chapters.

We would like to thank Andreas Buja, Anthony Davison, Peter

Hall, Trevor Hastie, John Rice, Bernard Silverman, James Stafford

and Sami Tibshirani for making very helpful comments and sugges�

tions on the manuscript. We especially thank Timothy Hesterberg

and Cliff Lunneborg for the great deal of time and effort that they

spent on reading and preparing comments. Thanks to Maria-Luisa

Gardner for providing expert advice on the “rules of punctuation.”

We would also like to thank numerous students at both Stanford

University and the University of Toronto for pointing out errors

in earlier drafts, and colleagues and staff at our universities for

their support. Thanks to Tom Glinos of the University of Toronto

for maintaining a healthy computing environment. Karola DeCleve

typed much of the first draft of this book, and maintained vigi�

lance against errors during its entire history. All of this was done

cheerfully and in a most helpful manner, for which we are truly

grateful. Trevor Hastie provided expert “S” and TgX advice, at

crucial stages in the project.

We were lucky to have not one but two superb editors working

on this project. Bea Schube got us going, before starting her re�

tirement; Bea has done a great deal for the statistics profession

and we wish her all the best. John Kimmel carried the ball after

Bea left, and did an excellent job. We thank our copy-editor Jim

Ger�nimo for his thorough correction of the manuscript, and take

responsibility for any errors that remain.

The first author was supported by the National Institutes of

Health and the National Science Foundation. Both groups have

Page 17

XVI

PREFACE

supported the development of statistical theory at Stanford, in�

cluding much of the theory behind this book. The second author

would like to thank his wife Cheryl for her understanding and

support during this entire project, and his parents for a lifetime

of encouragement. He gratefully acknowledges the support of the

Natural Sciences and Engineering Research Council of Canada.

Palo Alto and Toronto

Bradley Efron

June 1993

Robert Tibshirani

Page 18

CHAPTER 1

Introduction

Statistics is the science of learning from experience, especially ex�

perience that arrives a little bit at a time. The earliest information

science was statistics, originating in about 1650. This century has

seen statistical techniques become the analytic methods of choice

in biomedical science, psychology, education, economics, communi�

cations theory, sociology, genetic studies, epidemiology, and other

areas. Recently, traditional sciences like geology, physics, and as�

tronomy have begun to make increasing use of statistical methods

as they focus on areas that demand informational efficiency, such as

the study of rare and exotic particles or extremely distant galaxies.

Most people are not natural-born statisticians. Left to our own

devices we are not very good at picking out patterns from a sea

of noisy data. To put it another way, we are all too good at pick�

ing out non-existent patterns that happen to suit our purposes.

Statistical theory attacks the problem from both ends. It provides

optimal methods for finding a real signal in a noisy background,

and also provides strict checks against the overinterpretation of

random patterns.

Statistical theory attempts to answer three basic questions:

(1) How should I collect my data?

(2) How should I analyze and summarize the data that I’ve col�

lected?

(3) How accurate are my data summaries?

Question 3 constitutes part of the process known as statistical in�

ference. The bootstrap is a recently developed technique for making

certain kinds of statistical inferences. It is only recently developed

because it requires modern computer power to simplify the often

intricate calculations of traditional statistical theory.

The explanations that we will give for the bootstrap, and other

Page 19

INTRODUCTION

computer-based methods, involve explanations of traditional ideas

in statistical inference. The basic ideas of statistics haven’t changed,

but their implementation has. The modern computer lets us ap�

ply these ideas flexibly, quickly, easily, and with a minimum of

mathematical assumptions. Our primary purpose in the book is to

explain when and why bootstrap methods work, and how they can

be applied in a wide variety of real data-analytic situations.

All three basic statistical concepts, data collection, summary and

inference, are illustrated in the New York Times excerpt of Figure

1.1. A study was done to see if small aspirin doses would prevent

heart attacks in healthy middle-aged men. The data for the as�

pirin study were collected in a particularly efficient way: by a con�

trolled, randomized, double-blind study. One half of the subjects

received aspirin and the other half received a control substance, or

placebo, with no active ingredients. The subjects were randomly

assigned to the aspirin or placebo groups. Both the subjects and the

supervising physicians were blinded to the assignments, with the

statisticians keeping a secret code of who received which substance.

Scientists, like everyone else, want the project they are working on

to succeed. The elaborate precautions of a controlled, randomized,

blinded experiment guard against seeing benefits that don’t exist,

while maximizing the chance of detecting a genuine positive effect.

The summary statistics in the newspaper article are very simple:

heart attacks

subjects

(fatal plus non-fatal)

aspirin group:

104

11037

placebo group:

189

11034

We will see examples of much more complicated summaries in later

chapters. One advantage of using a good experimental design is a

simplification of its results. What strikes the eye here is the lower

rate of heart attacks in the aspirin group. The ratio of the two

rates is

? = 104/11037

189/11034

(1.1)

If this study can be believed, and its solid design makes it very

believable, the aspirin-takers only have 55% as many heart attacks

as placebo-takers.

Of course we are not really interested in 0, the estimated ratio.

What we would like to know is 0, the true ratio, that is the ratio

Page 20

INTRODUCTION

HEART AHACK RISK

FOUND TO BE CUT

BY TAKING ASPIRIN

LIFESAVING EFFECTS SEEN

Study Finds Benefit of Tablet

Every Other. Day Is Much

Greater Than Expected

By HAROLD M. SCHMECK Jr.

A major nationwide study shows that

a single aspirin tablet every other day

can sharply reduce a man’s risk of

heart attack and death from heart at�

tack.

The lifesaving effects were so dra�

matic that the study was halted in mid-

December so that the results could be

reported as soon as possible to the par�

ticipants and to the medical profession

in general.

The magnitude of the beneficial ef�

fect was far greater than expected, Dr.

Charles H. Hennekens of Harvard,

principal investigator in the research,

said in a telephone interview. The risk

of myocardial infarction, the technical

name for heart attack, was cut almost

in half.

* Extreme Beneficial Effect’

A special report said the results

showed “a statistically extreme benefi�

cial effect" from the use of aspirin. The

report is to be published Thursday in

The New England Journal of Medicine.

In recent years smaller studies have

demonstrated that a person who has

had one heart attack can reduce the

risk of a second by taking aspirin, but

there had been no proof that the benefi�

cial effect would extend to the general

male population.

Dr. Claude Lenfant, the director of

the National Heart Lung and Blood In�

stitute, said the findings were "ex�

tremely important," but he said the

general public should not take the re�

port as an indication that everyone

should start taking aspirin.

Figure 1.1. Front-page news from the New York Times of January 27,

1987. Reproduced by permission of the New York Times.

Page 21

INTRODUCTION

w333w333333333333f3w3333333333r33333333333333bj3333,33333333333j333333333333333

them. The value 0 = .55 is only an estimate of 0. The sample seems

large here, 22071 subjects in all, but the conclusion that aspirin

works is really based on a smaller number, the 293 observed heart

attacks. How do we know that 0 might not come out much less

favorably if the experiment were run again?

This is where statistical inference comes in. Statistical theory

allows us to make the following inference: the true value of 0 lies

in the interval

.43 < 0 < .70

(1.2)

with 95% confidence. Statement (1.2) is a classical confidence in�

terval, of the type discussed in Chapters 12-14, and 22. It says that

if we ran a much bigger experiment, with millions of subjects, the

ratio of rates probably wouldn’t be too much different than (1.1).

We almost certainly wouldn’t decide that 6 exceeded 1, that is that

aspirin was actually harmful. It is really rather amazing that the

same data that give us an estimated value, 0 = .55 in this case,

also can give us a good idea of the estimate’s accuracy.

Statistical inference is serious business. A lot can ride on the

decision of whether or not an observed effect is real. The aspirin

study tracked strokes as well as heart attacks, with the following

results:

strokes subjects

aspirin group:

119

11037

placebo group:

11034

For strokes, the ratio of rates is

7j_ 119/11037 _

98/11034

(1.3)

(1.4)

It now looks like taking aspirin is actually harmful. However the

interval for the true stroke ratio 0 turns out to be

.93 < 0 < 1.59

(1.5)

with 95% confidence. This includes the neutral value 0 = 1, at

which aspirin would be no better or worse than placebo vis-a-vis

strokes. In the language of statistical hypothesis testing, aspirin

was found to be significantly beneficial for preventing heart attacks,

but not significantly harmful for causing strokes. The opposite con�

clusion had been reached in an older, smaller study concerning men

Page 22

INTRODUCTION

who had experienced previous heart attacks. The aspirin treatment

remains mildly controversial for such patients.

The bootstrap is a data-based simulation method for statistical

inference, which can be used to produce inferences like (1.2) and

(1.5). The use of the term bootstrap derives from the phrase to

pull oneself up by one’s bootstrap, widely thought to be based on

one of the eighteenth century Adventures of Baron Munchausen,

by Rudolph Erich Raspe. (The Baron had fallen to the bottom of

a deep lake. Just when it looked like all was lost, he thought to

pick himself up by his own bootstraps.) It is not the same as the

term “bootstrap” used in computer science meaning to “boot” a

computer from a set of core instructions, though the derivation is

similar.

Here is how the bootstrap works in the stroke example. We cre�

ate two populations: the first consisting of 119 ones and 11037-

119=10918 zeroes, and the second consisting of 98 ones and 11034-

98=10936 zeroes. We draw with replacement a sample of 11037

items from the first population, and a sample of 11034 items from

the second population. Each of these is called a bootstrap sample.

From these we derive the bootstrap replicate of 9:

Proportion of ones in bootstrap sample #1

Proportion of ones in bootstrap sample #2

We repeat this process a large number of times, say 1000 times,

and obtain 1000 bootstrap replicates 9*. This process is easy to im�

plement on a computer, as we will see later. These 1000 replicates

contain information that can be used to make inferences from our

data. For example, the standard deviation turned out to be 0.17

in a batch of 1000 replicates that we generated. The value 0.17

is an estimate of the standard error of the ratio of rates 0. This

indicates that the observed ratio 0 = 1.21 is only a little more than

one standard error larger than 1, and so the neutral value 0 = 1

cannot be ruled out. A rough 95% confidence interval like (1.5)

can be derived by taking the 25th and 975th largest of the 1000

replicates, which in this case turned out to be (.93, 1.60).

In this simple example, the confidence interval derived from the

bootstrap agrees very closely with the one derived from statistical

theory. Bootstrap methods are intended to simplify the calculation

of inferences like (1.2) and (1.5), producing them in an automatic

way even in situations much more complicated than the aspirin

study.

Page 23

INTRODUCTION

The terminology of statistical summaries and inferences, like re�

gression, correlation, analysis of variance, discriminant analysis,

standard error, significance level and confidence interval, has be�

come the lingua franca of all disciplines that deal with noisy data.

We will be examining what this language means and how it works

in practice. The particular goal of bootstrap theory is a computer-

based implementation of basic statistical concepts. In some ways it

is easier to understand these concepts in computer-based contexts

than through traditional mathematical exposition.

1.1 An overview of this book

This book describes the bootstrap and other methods for assessing

statistical accuracy. The bootstrap does not work in isolation but

rather is applied to a wide variety of statistical procedures. Part

of the objective of this book is expose the reader to many exciting

and useful statistical techniques through real-data examples. Some

of the techniques described include nonparametric regression, den�

sity estimation, classification trees, and least median of squares

regression.

Here is a chapter-by-chapter synopsis of the book. Chapter 2

introduces the bootstrap estimate of standard error for a simple

mean. Chapters 3—5 contain some basic background material,

and may be skimmed by readers eager to get to the details of

the bootstrap in Chapter 6. Random samples, populations, and

basic probability theory are reviewed in Chapter 3. Chapter 4

defines the empirical distribution function estimate of the popula�

tion, which simply estimates the probability of each of n data items

to be 1/n. Chapter 4 also shows that many familiar statistics can

be viewed as “plug-in” estimates, that is, estimates obtained by

plugging in the empirical distribution function for the unknown

distribution of the population. Chapter 5 reviews standard error

estimation for a mean, and shows how the usual textbook formula

can be derived as a simple plug-in estimate.

The bootstrap is defined in Chapter 6, for estimating the stan�

dard error of a statistic from a single sample. The bootstrap stan�

dard error estimate is a plug-in estimate that rarely can be com�

puted exactly; instead a simulation (“resampling”) method is used

for approximating it.

Chapter 7 describes the application of bootstrap standard er�

rors in two complicated examples: a principal components analysis

Page 24

AN OVERVIEW OF THIS BOOK

and a curve fitting problem.

Up to this point, only one-sample data problems have been dis�

cussed. The application of the bootstrap to more complicated data

structures is discussed in Chapter 8. A two-sample problem and

a time-series analysis are described.

Regression analysis and the bootstrap are discussed and illus�

trated in Chapter 9. The bootstrap estimate of standard error is

applied in a number of different ways and the results are discussed

in two examples.

The use of the bootstrap for estimation of bias is the topic of

Chapter 10, and the pros and cons of bias correction are dis�

cussed. Chapter 11 describes the jackknife method in some detail.

We see that the jackknife is a simple closed-form approximation to

the bootstrap, in the context of standard error and bias estimation.

The use of the bootstrap for construction of confidence intervals

is described in Chapters 12, 13 and 14. There are a number of

different approaches to this important topic and we devote quite

a bit of space to them. In Chapter 12 we discuss the bootstrap-t

approach, which generalizes the usual Student’s t method for con�

structing confidence intervals. The percentile method (Chapter

13) uses instead the percentiles of the bootstrap distribution to

define confidence limits. The BCa (bias-corrected accelerated in�

terval) makes important corrections to the percentile interval and

is described in Chapter 14.

Chapter 15 covers permutation tests, a time-honored and use�

ful set of tools for hypothesis testing. Their close relationship with

the bootstrap is discussed; Chapter 16 shows how the bootstrap

can be used in more general hypothesis testing problems.

Prediction error estimation arises in regression and classification

problems, and we describe some approaches for it in Chapter IT.

Cross-validation and bootstrap methods are described and illus�

trated. Extending this idea, Chapter 18 shows how the boot�

strap and cross-validation can be used to adapt estimators to a set

of data.

Like any statistic, bootstrap estimates are random variables and

so have inherent error associated with them. When using the boot�

strap for making inferences, it is important to get an idea of the

magnitude of this error. In Chapter 19 we discuss the jackknife-

after-bootstrap method for estimating the standard error of a boot�

strap quantity.

Chapters 20—25 contain more advanced material on selected

Page 25

INTRODUCTION

topics, and delve more deeply into some of the material introduced

in the previous chapters, The relationship between the bootstrap

and jackknife is studied via the “resampling picture” in Chapter

20. Chapter 21 gives an overview of non-parametric and para�

metric inference, and relates the bootstrap to a number of other

techniques for estimating standard errors. These include the delta

method, Fisher information, infinitesimal jackknife, and the sand�

wich estimator.

Some advanced topics in bootstrap confidence intervals are dis�

cussed in Chapter 22, providing some of the underlying basis

for the techniques introduced in Chapters 12-14. Chapter 23 de�

scribes methods for efficient computation of bootstrap estimates

including control variates and importance sampling. In Chapter

24 the construction of approximate likelihoods is discussed. The

bootstrap and other related methods are used to construct a “non-

parametric” likelihood in situations where a parametric model is

not specified.

Chapter 25 describes in detail a bioequivalence study in which

the bootstrap is used to estimate power and sample size. In Chap�

ter 26 we discuss some general issues concerning the bootstrap and

its role in statistical inference.

Finally, the Appendix contains a description of a number of dif�

ferent computer programs for the methods discussed in this book.

1.2 Information for instructors

We envision that this book can provide the basis for (at least)

two different one semester courses. An upper-year undergraduate

or first-year graduate course could be taught from some or all of

the first 19 chapters, possibly covering Chapter 25 as well (both

authors have done this). In addition, a more advanced graduate

course could be taught from a selection of Chapters 6-19, and a se�

lection of Chapters 20-26. For an advanced course, supplementary

material might be used, such as Peter Hall’s book The Bootstrap

and Edgeworth Expansion or journal papers on selected technical

topics. The Bibliographic notes in the book contain many sugges�

tions for background reading.

We have provided numerous exercises at the end of each chap�

ter. Some of these involve computing, since it is important for the

student to get hands-on experience for learning the material. The

bootstrap is most effectively used in a high-level language for data

Page 26

SOME OF THE NOTATION USED IN THE BOOK

analysis and graphics. Our language of choice (at present) is “S”

(or “S-PLUS”), and a number of S programs appear in the Ap�

pendix. Most of these programs could be easily translated into

other languages such as Gauss, Lisp-Stat, or Matlab. Details on

the availability of S and S-PLUS are given in the Appendix.

1.3 Some of the notation used in the book

Lower case bold letters such as x refer to vectors, that is, x =

(xi,X2 ,.. .xn). Matrices are denoted by upper case bold letters

such as X, while a plain uppercase letter like X refers to a random

variable. The transpose of a vector is written as xT. A superscript

indicates a bootstrap random variable: for example, x* indi�

cates a bootstrap data set generated from a data set x. Parameters

are denoted by Greek letters such as 9. A hat on a letter indicates

an estimate, such as 6. The letters F and G refer to populations. In

Chapter 21 the same symbols are used for the cumulative distribu�

tion function of a population. Ic is the indicator function equal to

1 if condition C is true and 0 otherwise. For example, I{x<2} = 1

if x < 2 and 0 otherwise. The notation tr(A) refers to the trace

of the matrix A, that is, the sum of the diagonal elements. The

derivatives of a function g(x) are denoted by gf(x),g (x) and so

on.

The notation

F -> (xi,x2,...xn)

indicates an independent and identically distributed sample drawn

from F. Equivalently, we also write Xi’l 'F for i = 1,2,... n.

Notation such as #{x� > 3} means the number of x*s greater

than 3. logx refers to the natural logarithm of x.

Page 27

CHAPTER 2

The accuracy of a sample mean

The bootstrap is a computer-based method for assigning measures

of accuracy to statistical estimates. The basic idea behind the boot�

strap is very simple, and goes back at least two centuries. After

reviewing some background material, this book describes the boot�

strap method, its implementation on the computer, and its applica�

tion to some real data analysis problems. First though, this chapter

focuses on the one example of a statistical estimator where we re�

ally don’t need a computer to assess accuracy: the sample mean.

In addition to previewing the bootstrap, this gives us a chance to

review some fundamental ideas from elementary statistics. We be�

gin with a simple example concerning means and their estimated

accuracies.

Table 2.1 shows the results of a small experiment, in which 7 out

of 16 mice were randomly selected to receive a new medical treat�

ment, while the remaining 9 were assigned to the non-treatment

(control) group. The treatment was intended to prolong survival

after a test surgery. The table shows the survival time following

surgery, in days, for all 16 mice.

Did the treatment prolong survival? A comparison of the means

for the two groups offers preliminary grounds for optimism. Let

#i j #2 �• • • ? # 7 indicate the lifetimes in the treatment group, so x\ =

94, x2 = 197, • • •, xy = 23, and likewise let t/i, t/2 , • * *, 2/9 indicate

the control group lifetimes. The group means are

x = 'Y^xi/ 7 = 86.86 and y = ^ yi/9 = 56.22,

(2.1)

i = 1

� = 1

so the difference x — y equals 30.63, suggesting a considerable life�

prolonging effect for the treatment.

But how accurate are these estimates? After all, the means (2.1)

are based on small samples, only 7 and 9 mice, respectively. In

Page 28

THE ACCURACY OF A SAMPLE MEAN

Table 2.1. The mouse data. Sixteen mice were randomly assigned to a

treatment group or a control group. Shown are their survival times, in

days, following a test surgery. Did the treatment prolong survival?

Group

Data

(Sample

Size)

Mean

Estimated

Standard

Error

Treatment: 94

197

141

(7)

86.86

25.24

Control:

104

146

(9)

56.22

14.14

Difference: 30.63

28.93

order to answer this question, we need an estimate of the accuracy

of the sample means x and y. For sample means, and essentially

only for sample means, an accuracy formula is easy to obtain.

The estimated standard error of a mean x based on n indepen�

dent data points �i ,#2, * * * ? � = 52r=ix*/n’ given by the

formula

(2.2)

where s2 = Yl7=i(xi ~ ^)2/(n ~ 1). (This formula, and standard

errors in general, are discussed more carefully in Chapter 5.) The

standard error of any estimator is defined to be the square root of

its variance, that is, the estimator’s root mean square variability

around its expectation. This is the most common measure of an

estimator’s accuracy. Roughly speaking, an estimator will be less

than one standard error away from its expectation about 68% of

the time, and less than two standard errors away about 95% of the

time.

If the estimated standard errors in the mouse experiment were

very small, say less than 1, then we would know that x and y were

close to their expected values, and that the observed difference of

30.63 was probably a good estimate of the true survival-prolonging

Page 29

THE ACCURACY OF A SAMPLE MEAN

capability of the treatment. On the other hand, if formula (2.2)

gave big estimated standard errors, say 50, then the difference es�

timate would be too inaccurate to depend on.

The actual situation is shown at the right of Table 2.1. The

estimated standard errors, calculated from (2.2), are 25.24 for x

and 14.14 for y. The standard error for the difference x — y equals

28.93 = \/25.242 + 14.142 (since the variance of the difference of

two independent quantities is the sum of their variances). We see

that the observed difference 30.63 is only 30.63/28.93 = 1.05 es�

timated standard errors greater than zero. Readers familiar with

hypothesis testing theory will recognize this as an insignificant re�

sult, one that could easily arise by chance even if the treatment

really had no effect at all.

There are more precise ways to verify this disappointing result,

(e.g. the permutation test of Chapter 15), but usually, as in this

case, estimated standard errors are an excellent first step toward

thinking critically about statistical estimates. Unfortunately stan�

dard errors have a major disadvantage: for most statistical estima�

tors other than the mean there is no formula like (2.2) to provide

estimated standard errors. In other words, it is hard to assess the

accuracy of an estimate other than the mean.

Suppose for example, we want to compare the two groups in Ta�

ble 2.1 by their medians rather than their means. The two medians

are 94 for treatment and 46 for control, giving an estimated dif�

ference of 48, considerably more than the difference of the means.

But how accurate are these medians? Answering such questions is

where the bootstrap, and other computer-based techniques, come

in. The remainder of this chapter gives a brief preview of the boot�

strap estimate of standard error, a method which will be fully

discussed in succeeding chapters.

Suppose we observe independent data points xi,X2 , • • • ,xn, f�r

convenience denoted by the vector x = (aq, X2 , • • •, xn), from which

we compute a statistic of interest s(x). For example the data might

be the n = 9 control group observations in Table 2.1, and s(x)

might be the sample mean.

The bootstrap estimate of standard error, invented by Efron in

1979, looks completely different than (2.2), but in fact it is closely

related, as we shall see. A bootstrap sample x* = (x\ , x\, • • •, x*) is

obtained by randomly sampling n times, with replacement, from

the original data points xi, #2 , * * * > xn- For instance, with n —7 we

might obtain x* = (x5 ,X7 ,x5,X4 ,X7 ,X3 ,xi).

Page 30

THE ACCURACY OF A SAMPLE MEAN

bootstrap

re p lic a tio n s

Figure 2.1. Schematic of the bootstrap process for estimating the stan�

dard error of a statistic s(x). B bootstrap sample* are generated from

the original data set. Each bootstrap sample has n elements, generated

by sampling with replacement n times from the original data set. Boot�

strap replicates sfx*1), s(x*2),... s(x*B) are obtained by calculating the

value of the statistic s(x) on each bootstrap sample. Finally, the stan�

dard deviation of the values s(x*1),s(x*2),... s(x*B) is our estimate of

the standard error of s(x).

Figure 2.1 is a schematic of the bootstrap process. The boot�

strap algorithm begins by generating a large number of indepen�

dent bootstrap samples x*1^*2, • • • ,x*B, each of size n. Typical

values for B , the number of bootstrap samples, range from 50 to

200 for standard error estimation. Corresponding to each bootstrap

sample is a bootstrap replication of s, namely s(x*6), the value of

the statistic s evaluated for x*6. If s(x) is the sample median, for

instance, then s(x*) is the median of the bootstrap sample. The

bootstrap estimate of standard error is the standard deviation of

the bootstrap replications,

�boot = { � M

X*6) - s (-)]2/ ( B - 1 ) } 2 ,

(2.3)

6=1

where s(-) = Ylb=i 5(x*6)/#- Suppose s(x) is the mean x. In this

Page 31

THE ACCURACY OF A SAMPLE MEAN

Table 2.2. Bootstrap estimates of standard error for the mean and me�

dian; treatment group, mouse data, Table 2.1. The median is less accu�

rate (has larger standard error) than the mean for this data set.

100

250

500

1000

mean:

median:

19.72

32.21

23.63

36.35

22.32

34.46

23.79

36.72

23.02

36.48

23.36

37.83

case, standard probability theory tells us (Problem 2.5) that as B

gets very large, formula (2.3) approaches

{ � > - x ) 2/n2}*.

(2.4)

i = 1

This is almost the same as formula (2.2). We could make it ex�

actly the same by multiplying definition (2.3) by the factor [n/(n —

l)]a, but there is no real advantage in doing so.

Table 2.2 shows bootstrap estimated standard errors for the

mean and the median, for the treatment group mouse data of Ta�

ble 2.1. The estimated standard errors settle down to limiting val�

ues as the number of bootstrap samples B increases. The limiting

value 23.36 for the mean is obtained from (2.4). The formula for

the limiting value 37.83 for the standard error of the median is

quite complicated: see Problem 2.4 for a derivation.

We are now in a position to assess the precision of the differ�

ence in medians between the two groups. The bootstrap procedure

described above was applied to the control group, producing a stan�

dard error estimate of 11.54 based on B = 100 replications (B = oo

gave 9.73). Therefore, using B = 100, the observed difference of 48

has an estimated standard error of \/36.352 -f 11.542 = 38.14, and

hence is 48/38.14 = 1.26 standard errors greater than zero. This is

larger than the observed difference in means, but is still insignifi�

cant.

For most statistics we don’t have a formula for the limiting value

of the standard error, but in fact no formula is needed. Instead

we use the numerical output of the bootstrap program, for some

convenient value of B. We will see in Chapters 6 and 19, that B

in the range 50 to 200 usually makes seboot a good standard error

Page 32

PROBLEMS

estimator, even for estimators like the median. It is easy to write

a bootstrap program that works for any computable statistic s(x),

as shown in Chapters 6 and the Appendix. With these programs

in place, the data analyst is free to use any estimator, no matter

how complicated, with the assurance that he or she will also have

a reasonable idea of the estimator’s accuracy. The price, a factor

of perhaps 100 in increased computation, has become affordable as

computers have grown faster and cheaper.

Standard errors are the simplest measures of statistical accu�

racy. Later chapters show how bootstrap methods can assess more

complicated accuracy measures, like biases, prediction errors, and

confidence intervals. Bootstrap confidence intervals add another

factor of 10 to the computational burden. The payoff for all this

computation is an increase in the statistical problems that can be

analyzed, a reduction in the assumptions of the analysis, and the

elimination of the routine but tedious theoretical calculations usu�

ally associated with accuracy assessment.

2.1 Problems

2.1 * Suppose that the mouse survival times were expressed in

weeks instead of days, so that the entries in Table 2.1 were

all divided by 7.

(a) What effect would this have on x and on its estimated

standard error (2.2)? Why does this make sense?

(b) What effect would this have on the ratio of the differ�

ence x — y to its estimated standard error?

2.2 Imagine the treatment group in Table 2.1 consisted of R rep�

etitions of the data actually shown, where R is a positive inte�

ger. That is, the treatment data consisted of R 94’s, R 197’s,

etc. What effect would this have on the estimated standard

error (2.2)?

2.3 It is usually true that the error of a statistical estimator de�

creases at a rate of about 1 over the square root of the sample

size. Does this agree with the result of Problem 2.2?

2.4 Let X(1) < �(2) < x (3) < x(4) < x(5) < x(6) < X(7) ke an

ordered sample of size n = 7. Let x* be a bootstrap sample,

and s(x*) be the corresponding bootstrap replication of the

median. Show that

Page 33

THE ACCURACY OF A SAMPLE MEAN

(a) s(x*) equals one of the original data values

i =

1,2,- .,7.

(b) t s(x*) equals #(*) with probability

* _ i

p(i) =

^ T ”) - Bi0;�. �)}.

(2.5)

3 = 0

where Bi(j; n,p) is the binomial probability (”)/>*'(1— p)n~j .

[The numerical values of p(i) are .0102, .0981, .2386, .3062,

.2386, .0981, .0102. These values were used to compute

seboot{ median} = 37.83, for B = oo, Table 2.2.]

2.5 Apply the weak law of large numbers to show that expression

(2.3) approaches expression (2.4) as n goes to infinity.

f Indicates a difficult or more advanced problem.

Page 34

CHAPTER 3

Random samples and

probabilities

3.1 Introduction

Statistics is the theory of accumulating information, especially in�

formation that arrives a little bit at a time. A typical statistical

situation was illustrated by the mouse data of Table 2.1. No one

mouse provides much information, since the individual results are

so variable, but seven, or nine mice considered together begin to

be quite informative. Statistical theory concerns the best ways of

extracting this information. Probability theory provides the math�

ematical framework for statistical inference. This chapter reviews

the simplest probabilistic model used to model random data: the

case where the observations are a random sample from a single

unknown population, whose properties we are trying to learn from

the observed data.

3.2 Random samples

It is easiest to visualize random samples in terms of a finite popu�

lation or “universe” U of individual units C/i, U2 , • • •, Un , any one

of which is equally likely to be selected in a single random draw.

The population of units might be all the registered voters in an

area undergoing a political survey, all the men that might con�

ceivably be selected for a medical experiment, all the high schools

in the United States, etc. The individual units have properties we

would like to learn, like a political opinion, a medical survival time,

or a graduation rate. It is too difficult and expensive to examine

every unit in so we select for observation a random sample of

manageable size.

A random sample of size n is defined to be a collection of n

Page 35

RANDOM SAMPLES AND PROBABILITIES

units

* • •, un selected at random from U. In principle the

sampling process goes as follows: a random number device inde�

pendently selects integers ji, ■ * *, jn, each of which equals any

value between 1 and N with probability 1/N. These integers deter�

mine which members of U are selected to be in the random sample,

m = Uj1, u2 = Uj2, • • •, un = Ujn. In practice the selection process

is seldom this neat, and the population U may be poorly defined,

but the conceptual framework of random sampling is still useful for

understanding statistical inference. (The methodology of good ex�

perimental design, for example the random assignment of selected

units to Treatment or Control groups as was done in the mouse

experiment, helps make random sampling theory more applicable

to real situations like that of Table 2.1.)

Our definition of random sampling allows a single unit �7* to ap�

pear more than once in the sample. We could avoid this by insisting

that* the integers j\,j2, • • * ,jn be distinct, called “sampling with�

out replacement.” It is a little simpler to allow repetitions, that is

to “sample with replacement”, as in the previous paragraph. If the

size n of the random sample is much smaller than the population

size N, as is usually the case, the probability of sample repetitions

will be small anyway. See Problem 3.1. Random sampling always

means sampling with replacement in what follows, unless otherwise

stated.

Having selected a random sample ui, U2 , • • •, un, we obtain one

or more measurements of interest for each unit. Let Xi indicate

the measurements for unit u*. The observed data are the collec�

tion of measurements Xi,X2, • • •, �n. Sometimes we will denote the

observed data (#i, #2, ’ • *, �n) by the single symbol x.

We can imagine making the measurements of interest on ev�

ery member I7i, f/2 , • • •, Un of W, obtaining values Xi, X 2, • • •, X^.

This would be called a census of U.

The symbol X will denote the census of measurements

(Xi, X 2 ,• • •, X n ). We will also refer to X as the population of mea�

surements, or simply the population, and call x a random sample of

size n from X. In fact, we usually can’t afford to conduct a census,

which is why we have taken a random sample. The goal of statisti�

cal inference is to say what we have learned about the population X

from the observed data x. In particular, we will use the bootstrap

to say how accurately a statistic calculated from �1 , ^2 ? • • •, xn (for

instance the sample median) estimates the corresponding quantity

for the whole population.

Page 36

RANDOM SAMPLES

Table 3.1. The law school data. A random sample of size n = 15 was

taken from the collection of N —82 American law schools participating

in a large study of admission practices. Two measurements were made

on the entering classes of each school in 1973: LSAT, the average score

for the class on a national law test, and GPA, the average undergraduate

grade-point average for the class.

School LSAT GPA

576

3.39

651

3.36

635

3.30

605

3.13

558

2.81

653

3.12

578

3.03

575

2.74

666

3.44

545

2.76

580

3.07

572

2.88

555

3.00

594

2.96

661

3.43

Table 3.1 shows a random sample of size n = 15 drawn from

a population of JV = 82 American law schools. What is actually

shown are two measurements made on the entering classes of 1973

for each school in the sample: LSAT, the average score of the class

on a national law test, and GPA, the average undergraduate grade

point average achieved by the members of the class. In this case

the measurement X{ on

the ith member of the sample, is the

pair

Xi = (LSAT;, GPA*)

i = 1,2, • • •, 15.

The observed data X\,X2

is the collection of 15 pairs of

numbers shown in Table 3.1.

This example is an artificial one because the census of data

X\, X 2 , • • •, Xs2 was actually made. In other words, LSAT and

GPA are available for the entire population of N = 82 schools.

Figure 3.1 shows the census data and the sample data. Table 3.2

gives the entire population of N measurements.

In a real statistical problem, like that of Table 3.1, we would see

only the sample data, from which we would be trying to infer the

properties of the population. For example, consider the 15 LSAT

scores in the observed sample. These have mean 600.27 with esti�

mated standard error 10.79, based on the data in Table 3.1 and

formula (2.2). There is about a 68% chance that the true LSAT

Page 37

RANDOM SAMPLES AND PROBABILITIES

LSAT

500 550 600 650 700

LSAT

Figure 3.1. The left panel is a scatterplot of the (LSAT, GPA) data

for all N = 82 law schools; circles indicate the n = 15 data points

comprising the “observed sample” of Table 3.1. The right panel shows

only the observed sample. In problems of statistical inference, we are

trying to infer the situation on the left from the picture on the right.

mean, the mean for the entire population from which the observed

data was sampled, lies in the interval 600.27 � 10.79.

We can check this result, since we are dealing with an artifi�

cial example for which the complete population data are known.

The mean of all 82 LSAT values is 597.55, lying nicely within the

predicted interval 600.27 dh 10.79.

3.3 Probability theory

Statistical inference concerns learning from experience: we observe

a random sample x = (xi, X2, • • •, xn) and wish to infer properties

of the complete population X — (Xi,X2 , • • •,X n ) that yielded

the sample. Probability theory goes in the opposite direction: from

the composition of a population X we deduce the properties of a

random sample x, and of statistics calculated from x. Statistical

inference as a mathematical science has been developed almost ex�

clusively in terms of probability theory. Here we will review briefly

Page 38

PROBABILITY THEORY

Table 3.2. The population of measurements (LSAT,GPA), for the uni�

verse of 82 law schools. The data in Table 3.1 was sampled from this

population. The + ’s indicate the sampled schools.

school LSAT GPA school LSAT GPA school LSAT GPA

622

3.23 28

632

3.29 56

641

3.28

542

2.83 29

587

3.16 57

512

3.01

579

3.24 30

581

3.17 58

631

3.21

653

3.12 31+

605

3.13 59

597

3.32

606

3.09 32

704

3.36 60

621

3.24

576

3.39 33

477

2.57 61

617

3.03

620

3.10 34

591

3.02 62

637

3.33

615

3.40 35+

578

3.03 62

572

3.08

553

2.97 36+

572

2.88 64

610

3.13

607

2.91 37

615

3.37 65

562

3.01

558

3.11 38

606

3.20 66

635

3.30

596

3.24 39

603

3.23 67

614

3.15

13+

635

3.30 40

535

2.98 68

546

2.82

581

3.22 41

595

3.11 69

598

3.20

15+

661

3.43 42

575

2.92 70+

666

3.44

547

2.91 43

573

2.85 71

570

3.01

599

3.23 44

644

3.38 72

570

2.92

646

3.47 45+

545

2.76 73

605

3.45

622

3.15 46

645

3.27 74

565

3.15

611

3.33 47+

651

3.36 75

686

3.50

546

2.99 48

562

3.19 76

608

3.16

614

3.19 49

609

3.17 77

595

3.19

628

3.03 50+

555

3.00 78

590

3.15

575

3.01 51

586

3.11 79+

558

2.81

662

3.39 52+

580

3.07 80

611

3.16

627

3.41 53+

594

2.96 81

564

3.02

608

3.04 54

594

3.05 82+

575

2.74

560

2.93

some fundamental concepts of probability, including probability

distributions, expectations, and independence.

As a first example, let x represent the outcome of rolling a fair

die so x is equally likely to be 1,2,3,4,5, or 6. We write this in

probability notation as

Prob{x = k} = 1/6

for k = 1,2,3,4,5,6.

(3.1)

A random quantity like x is often called a random variable.

Probabilities are idealized or theoretical proportions. We can

imagine a universe U = {C/i, C/2 , • • •, Un } of possible rolls of the

Page 39

RANDOM SAMPLES AND PROBABILITIES

die, where Uj completely describes the physical act of the jth roll,

with corresponding results X = (Xi,X 2, • • • ,-X/v). Here N might

be very large, or even infinite. The statement Prob{x = 5} = 1/6

means that a randomly selected member of X has a 1/6 chance of

equaling 5, or more simply that 1/6 of the members of X equal 5.

Notice that probabilities, like proportions, can never be less than

0 or greater than 1.

For convenient notation define the frequencies fk,

fk = Prob{x = fe},

(3.2)

so the fair die has fk = 1/6 for k = 1,2, •••,6. The probability

distribution of a random variable x, which we will denote by F, is

any complete description of the probabilistic behavior of x. F is

also called the probability distribution of the population X . Here

we can take F to be the vector of frequencies

F = (/i, /2, • * •, fe) = (1/6,1/6, • • •, 1/6).

(3.3)

An unfair die would be one for which F did not equal

(1/6,1/6,..., 1/6).

Note: In many books, the symbol F is used for the cumulative

probability distribution function F(x0) = Prob{:r < xo} for — oo <

Xo < oo. This is an equally valid description of the probabilistic

behavior of x, but it is only convenient for the case where a; is a real

number. We will also be interested in cases where x is a vector, as

in Table 3.1, or an even more general object. This is the reason for

defining F as any description of x's probabilities, rather than the

specific description in terms of the cumulative probabilities. When

no confusion can arise, in later chapters we use symbols like F and

G to represent cumulative distribution functions.

Some probability distributions arise so frequently that they have

received special names. A random variable x is said to have the

binomial distribution with size n and probability of success p, de�

noted

x ~ Bi(n,p),

(3.4)

if its frequencies are

/fc = ( ^ ) p fc( ! - P ) n_fc for k = 0,1,2,•••,�.

(3.5)

Here n is a positive integer, p is a number between 0 and 1, and

(2) is the binomial coefficient n!/[fc!(n — &)!]. Figure 3.2 shows the

Page 40

PROBABILITY THEORY

distribution F = (/o, /i, * * *, fn) for x ~ Bi(n,p), with n = 25

and p = .25, .50, and .90. We also write F = Bi(n,p) to indicate

situation (3.4).

Let A be a set of integers. Then the probability that x takes a

value in A, or more simply the probability of A, is

Prob{x E A} = Prob{A} = ^ /*.

(3.6)

keA

For example if A = {1,3,5, • • •, 25} and x ~ Bi(25,p), then ProbjA}

is the probability that a binomial random variable of size 25 and

probability of success p equals an odd integer. Notice that since f k

is the theoretical proportion of times x equals fc, the sum ^2 keAfk =

ProbjA} is the theoretical proportion of times x takes its value in

The sample space of x, denoted Sx, is the collection of possible

values x can have. For a fair die, Sx = (1, 2, • • • ,6}, while Sx =

{0,1,2, • • • ,n} for a Bi(n,p) distribution. By definition, x occurs

in Sx every time, that is, with theoretical proportion 1, so

Prob{Sx} = � A = 1.

(3.7)

kesx

For any probability distribution on the integers the frequencies fj

are nonnegative numbers summing to 1.

In our examples so far, the sample space Sx has been a subset

of the integers. One of the convenient things about probability

distributions is that they can be defined on quite general spaces.

Consider the law school data of Figure 3.1. We might take Sx to

be the positive quadrant of the plane,

Sx =1l2+ = {(y,z),y> 0,z> 0}.

(3.8)

(This includes values like x = (106,109), but it doesn’t hurt to let

Sx be too big.) For a subset A of

we would still write Prob{A}

to indicate the probability that x occurs in A.

For example, we could take

A = {(y, z) : 0 < y < 600,0 < * < 3.0}.

(3.9)

A law school x E A if its 1973 entering class had LSAT less than

600 and GPA less than 3.0. In this case we happen to know the

complete population X\ it is the 82 points indicated on the left

panel of Figure 3.1 and in Table 3.2. Of these, 16 are in A, so

Prob{A} = 16/82 = .195.

(3.10)

Page 41

RANDOM SAMPLES AND PROBABILITIES

Figure 3.2. The frequencies /o, /i, • • •, fn for the binomial distributions

Bi(n,p), n = 25 and p = .25, .50, and .90. The points have been con�

nected by lines to enhance visibility.

Here the idealized proportion Prob{A} is an actual proportion.

Only in cases where we have a complete census of the population

is it possible to directly evaluate probabilities as proportions.

The probability distribution F of x is still defined to be any

complete description of x’s probabilities. In the law school example,

F can be described as follows: for any subset A of Sx = 72.2+,

Prob{x E A) = #{Xj E A}/82,

(3.11)

where #{Xj E A} is the number of the 82 points in the left panel

of Figure 3.1 that lie in A. Another way to say the same thing is

that F is a discrete distribution putting probability (or frequency)

1/82 on each of the indicated 82 points.

Probabilities can be defined continuously, rather than discretely

as in (3.6) or (3.11). The most famous example is the normal (or

Gaussian, or bell-shaped) distribution. A real-valued random vari�

able x is defined to have the normal distribution with mean /i and

Page 42

PROBABILITY THEORY

variance cr2, written

x ~ iV(/x, a2) or F = N(p, <r2),

(3.12)

Prob{;r e A} = J

(3.13)

for any subset A of the real line 1Z1. The integral in (3.13) is over

the values of x G A.

There are higher dimensional versions of the normal distribu�

tion, which involve taking integrals similar to (3.13) over multi�

dimensional sets A. We won’t need continuous distributions for

development of the bootstrap (though they will appear later in

some of the applications) and will avoid mathematical derivations

based on calculus. As we shall see, one of the main incentives for the

development of the bootstrap is the desire to substitute computer

power for theoretical calculations involving special distributions.

The expectation of a real-valued random variable x, written E(x),

is its average value, where the average is taken over the possible

outcomes of x weighted according to its probability distribution F.

Thus

E(x) = � x ( ny (l — p)x for x ~ Bi(n,p),

(3.14)

and

/ ��

x .......:.e~2 ( v^^dx for x ~ N(p, a2). (3.15)

-oo v27rcr2

It is not difficult to show that E(x) = np for x ~ Bi(n,p), and

E(x) = /i for x ~ N(p,a2). (See Problems 3.6 and 3.7.)

We sometimes write the expectation as E�r(x), to indicate that

the average is taken with respect to the distribution F.

Suppose r = g(x) is some function of the random variable x.

Then E(r), the expectation of r, is the theoretical average of g(x)

weighted according to the probability distribution of x. For exam�

ple if x ~ iV(��, a2) and r = x3, then

/ OO

x3 — .... e~ 2 ( )2 dx.

(3.16)

-oo v27r<72

Probabilities are a special case of expectations. Let A be a subset

Page 43

RANDOM SAMPLES AND PROBABILITIES

of Sx, and take r — I{X�A} where I{x^a} is the indicator function

l{x€A}

if x e A

if x � A'

Then E(r) equals Prob{x 6 A}, or equivalently

E(J{xe.4}) = Prob{x € A}.

For example if x ~ N(/j,, a2), then

/OO

I{x€A}-j==e~*(i^ i? dx

-OO

v 2'KGz

- L a/27TCT2

e ~ i^ 2dx,

(3.17)

(3.18)

(3.19)

which is Prob{# 6 A} according to (3.13).

The notion of an expectation as a theoretical average is very

general, and includes cases where the random variable x is not

real-valued. In the law school situation, for instance, we might

be interested in the expectation of the ratio of LSAT and GPA.

Writing x = (y,z) as in (3.8), then r = y/z, and the expectation

of r is

E(LSAT/GPA) = - Yfrjlzj)

(3.20)

3 = 1

where Xj = (yj, Zj) is the j th point in Table 3.2. Numerical evalu�

ation of (3.20) gives E(LSAT/GPA) = 190.8.

Let fix —Eip(a;), for x a real-valued random variable with distri�

bution F. The variance of #, indicated by cr2 or just cr2, is defined

to be the expected value of y = (x — p) 2 . In other words, a2 is the

theoretical average squared distance of a random variable x from

its expectation px,

4 = Ef (x - Mx)2.

(3.21)

The variance of x ~ N(fi,cr2) equals cr2; the variance of x ~

Bi(n,p) equals np(1 — p), see Problem 3.9. The standard devia�

tion of a random variable is defined to be the square root of its

variance.

Two random variables y and z are said to be independent if

E \g{y)h{z)) = E[<7(2/)]E[Mz)]

(3.22)

Page 44

PROBABILITY THEORY

for all functions g(y) and h(z). Independence is well named: (3.22)

implies that the random outcome of y doesn’t affect the random

outcome of 2 , and vice-versa.

To see this, let B and C be subsets of Sy and Sz respectively,

the sample spaces of y and z, and take g and h to be the indicator

functions g(y) = I {yeB} and h(z) = I{zec}- Notice that

„ �c>={;

2 e ^ “ d 2ec

<3-23>

S� I{yeB}I{zec} is the indicator function of the intersection {y G

B} fl {z G C}. Then by (3.18) and the independence definition

(3.22),

Prob{(y, z) G B n C} = E(I{yeByI{zeC}) = EU{2/€�})EC*{2ec})

= Prob{2/ G B}Prob{z G C}.

(3.24)

Looking at Figure 3.1, we can see that (3.24) does not hold for

the law school example, see Problem 3.10, so LSAT and GPA are

not independent.

Whether or not y and z are independent, expectations follow the

simple addition rule

m y ) + h(z)} = E[g(y)} + E[h(z)}.

(3.25)

In general,

E[2>(*i)] = 5>[ffi(*i)]

(3.26)

1=1

�=1

for any functions gi of any n random variables #i, x2, • • •, xn.

Random sampling with replacement guarantees independence: if

x = (xi,#2, * * * >#n) is a random sample of size n from a popula�

tion X, then all n observations are identically distributed and

mutually independent of each other. In other words, all of the Xi

have the same probability distribution F, and

~EF\gi{xi)g2{x2), ■ ■ ■ ,5n(z„)] =

Ef [9x(^i )]Ef [92 (^2)] •"Eflftiii�)]

(3.27)

for any functions Q\, g2, • • • ,gn- (This is almost a definition of what

random sampling means.) We will write

F -+ (x1,X2 ,---,Xn)

(3.28)

Page 45

RANDOM SAMPLES AND PROBABILITIES

to indicate that x = (#i, a?2 , • • •, xn) is a random sample of size n

from a population with probability distribution F. This is some�

times written as

x jA~'F

i = 1,2, •••,�,

(3.29)

where i.i.d. stands for independent and identically distributed.

3.4 Problems

3.1 A random sample of size n is taken with replacement from

a population of size N. Show that the probability of having

no repetitions in the sample is given by the product

n—1

na-ir)'

3=0

3.2 Why might you suspect that the sample of 15 law schools in

Table (3.1) was obtained by sampling without replacement,

rather than with replacement?

3.3 The mean GPA for all 82 law schools is 3.13. How does this

compare with the mean GPA for the observed sample of 15

law schools in Table 3.1? Is this difference compatible with

the estimated standard error (2.2)?

3.4 Denote the mean and standard deviation of a set of numbers

Xi, X 2, • • •, X n by A and 5 respectively, where

X = jrXj/N

S = {�(*, - X f/N y/2.

3 = 1

J =1

(a) A sample xi, x2, • • •, xn is selected from Ai, X2 , • • •, A^

by random sampling with replacement. Denote the stan�

dard deviation of the sample average x =

usually called the standard error of x, by se(x). Use a

basic result of probability theory to show that

(b) t Suppose instead that #i, �2 , ‘ ‘ * > is selected by

random sampling without replacement (so we must have

Page 46

PROBLEMS

n < N), show that

N -n

N - 1

smaller standard error for x. Proportionally how much

smaller will it be in the case of the law school data?

3.5 Given a random sample X\,X2 , • • •, a:n, the empirical proba�

bility of a set A is defined to be the proportion of the sample

in Aywritten

Prob{A} = #{x� € A}/n.

(3.30)

(a) Find Prob{A} for the data in Table 3.1, with A as

given in (3.9).

(b) The standard error of an empirical probability is

[Prob{A} • (1 — Prob{A})//!]1/2. How many standard er�

rors is Prob{A} from Prob{A}, given in (3.10)?

3.6 A very simple probability distribution F puts probability on

only two outcomes, 0 or 1, with frequencies

fo = l-p , fi=P-

(3.31)

This is called the Bernoulli distribution. Here p is a number

between 0 and 1. If #i, • • •, xn is a random sample from F,

then elementary probability theory tells us that the sum

s = x i+ x2-\------ Vxn

(3.32)

has the binomial distribution (3.5),

s ~ Bi(n,p).

(3.33)

(a) Show that the empirical probability (3.30) satisfies

n • Prob{A} ~ Bi(n, Prob{A}).

(3.34)

Expression (3.34) can also be written as

Prob{A} ~ Bi(n,Prob{A})/n.)

(b) Prove that if x ~ Bi(n,p), then E(x) = np.

3.7 Without using calculus, give a symmetry argument to show

that E(x) = p for x N(p,a2).

Page 47

RANDOM SAMPLES AND PROBABILITIES

3.8 Suppose that y and z are independent random variables,

with variances a2 and a2.

(a) Show that the variance of y + 2 is the sum of the

variances

a2y+z = a� + a l.

(3.35)

(In general, the variance of the sum is the sum of the vari�

ances for independent random variables #i, �2 , *' * , xn-)

(b) Suppose F —> (xi,X2,--*,xn) where the probability

distribution F has expectation p and variance a2. Show

that x has expectation p and variance a2/n.

3.9 Use the results in Problems (3.6) and (3.8) to show that

a2 = np(1 — p) for x ~ Bi(n,p).

3.10 Forty-three of the 82 points in Table 3.1 have LSAT < 600;

17 of the 82 points have GPA < 3.0. Why do we know that

LSAT and GPA are not independent?

3.11 In the discussion of random sampling, ji,j2 , * * * ,.7n were

taken to be independent integers having a uniform distri�

bution on the numbers 1,2, • • •, N. That is, ji,�2 � • • • ,jn is

itself a random sample, say

F u n

—> 0*1,32, * ‘ * ,jn),

(3.36)

where F\:# is the discrete distribution having frequencies

fj = 1/JV, for j = 1,2, • --,iV. In practice, we depend on

our computer’s random number generator to give us (3.36).

If (3.36) holds, then a random sample as defined in this

chapter has the “i.i.d.” property defined in (3.29). Give a

brief argument why this is so.

f Indicates a difficult or more advanced problem.

Page 48

CHAPTER 4

The empirical distribution

function and the plug-in

principle

4.1 Introduction

Problems of statistical inference often involve estimating some as�

pect of a probability distribution F on the basis of a random sample

drawn from F. The empirical distribution function, which we will

call F, is a simple estimate of the entire distribution F. An ob�

vious way to estimate some interesting aspect of F, like its mean

or median or correlation, is to use the corresponding aspect of F.

This is the “plug-in principle.” The bootstrap method is a direct

application of the plug-in principle, as we shall see in Chapter 6.

4.2 The empirical distribution function

Having observed a random sample of size n from a probability

distribution F,

F —> (xi,x2,- ■ • ,xn),

(4.1)

the empirical distribution function F is defined to be the dis�

crete distribution that puts probability 1/n on each value

i =

1,2, • • •, n. In other words, F assigns to a set A in the sample space

of x its empirical probability

Prob{A} =

e A}/n,

(4.2)

the proportion of the observed sample x = (aq, X2 , • • •, xn) oc�

curring in A. We will also write Probp {A} to indicate (4.2). The

hat symbol “A” always indicates quantities calculated from the

observed data.

Page 49

PLUG-IN PRINCIPLE

Table 4.1. A random sample of 100 rolls of the die. The outcomes

1,2,3,4,5,6 occurred 13,19,10,17,14,27 times, respectively, so the em�

pirical distribution is (.13, .19, .10, .17, .14, .27).

6 3 2 4 6 6 6 5 3 6 2 2 6 2 3 1 5 1

6 6 4 1 5

3 6 6 4 1 4 2 5 6 6 5

6 2 6 6 1 4 1 5 6 1 6 3 3 2 2 2 5

2 4 1 4 5 6 6 6 2 2 4 6 1 2 2 2 5

5 3 5 4 2 1 4 6 6 5 6 4 6 4 3 6 4 1

4 5 4 4 2 3 2 1 4 6

Consider the law school sample of size n = 15, shown in Table 3.1

and in the right panel of Figure 3.1. The empirical distribution F

puts probability 1/15 on each of the 15 data points. Five of the 15

points lie in the set A = {(y, z) : 0 < y < 600,0 < z < 3.00},

so Prob{A} = 5/15=.333. Notice that we get a different empirical

probability for the set (0 < y < 600,0 < z < 3.00}, since one of

the 15 data points has GPA = 3.00, LSAT < 600.

Table 4.1 shows a random sample of n = 100 rolls of a die:

x\ = 6 , X2 = 3, � 3 = 2 , • • •, #ioo = 6 . The empirical distribution F

puts probability 1/100 on each of the 100 outcomes. In cases like

this, where there are repeated values, we can express F more eco�

nomically as the vector of observed frequencies /&, k = 1 ,2 , • • •, 6 ,

h = #{*� = k}/n.

(4.3)

For the data in Table 4.1, F = (.13, .19, .1 0 , .17, .14, .27).

An empirical distribution is a list of the values taken on by the

sample x = (x\,x2., ■ • • ,xn), along with the proportion of times

each value occurs. Often each value occurring in the sample appears

only once, as with the law data. Repetitions, as with the die of

Table 4.1, allow the list to be shortened. In either case each of

the n data points is assigned probability 1 /n by the empirical

distribution.

Is it obvious that we have not lost information in going from the

full data set (xi,X21• • • ,^ioo) in Table 4.1 to the reduced repre�

sentation in terms of the frequencies? No, but it is true. It can be

proved that the vector of observed frequencies F = (/i,/2 , * * •) is

a sufficient statistic for the true distribution F = (/i, / 2 ? * * •)• This

means that all of the information about F contained in x is also

contained in F.

Page 50

THE EMPIRICAL DISTRIBUTION FUNCTION

Table 4.2. Rainfall data. The yearly rainfall, in inches, in Nevada City,

California, 1873 through 1978. An example of time series data.

1870

80 40 65 46 68 32 58

1880

61 60 45 48 63 44 66 39 35

1890

44 104 36 45 69 50 72 57 53 30

1900

56 55 46 46 72 50 68 71 37

1910

46 69 31 33 61 56 55 40 37

1920

34 60 54 52 20 49 43 62 44

1930

45 30 53 32 38 56 63 52 79

1940

62 75 70 60 34 54 51 35 53

1950

53 73 80 54 52 40 77 52 75

1960

43 39 54 70 40 73 41 75 43

1970

60 59 41 67 83 56 29 21

The sufficiency theorem assumes that the data have been gen�

erated by random sampling from some distribution F. This is cer�

tainly not always true. For example the mouse data of Table 2.1

involve two probability distributions, one for Treatment and one for

Control. Table 4.2 shows a time-series of 106 numbers: the annual

rainfall in Nevada City, California from 1873 through 1978. We

could calculate the empirical distribution F for this data set, but

it would not include any of time series information, for example,

if high numbers follow high numbers. Later, in Chapter 8, we will

see how to apply bootstrap methods to situations like the rainfall

data. For now we are restricting attention to data obtained by ran�

dom sampling from a single distribution, the so-called one-sample

situation. This is not as restrictive as it sounds. In the mouse data

example, for instance, we can apply one-sample results separately

to the Treatment and Control populations.

In applying statistical theory to real problems, the answers to

questions of interest are usually phrased in terms of probability

distributions. We might ask if the die giving the data in Table 4.1

is fair. This is equivalent to asking if the die’s probability distribu�

tion F equals (1/6,1/6,1/6,1/6,1/6,1/6). In the law school exam�

ple, the question might be how correlated are LSAT and GPA. In

terms of F, the distribution of x = (y,z) — (LSAT, GPA), this is

Page 51

PLUG-IN PRINCIPLE

a question about the value of the population correlation coefficient

corr(y, z)

E •=!(Yj - tiy)(Z3 - nz)

[E?=i(^j - Vy)2

- Mz )2]1/2’

(4.4)

where (Yj, Zj) is the jth point in the law school population X, and

^ = E�=iV82, ^ = E?=i^/82.

When the probability distribution F is known (i.e. when we have

a complete census of the population X), answering such questions

involves no more than arithmetic. For the law school population,

the census in Table 3.2 gives py = 597.5, �iz —3.13, and

corr (y,z) = .761.

(4.5)

This is the original definition of “statistics.” Usually we don’t have

a census. Then we need statistical inference, the more modern sta�

tistical theory for inferring properties of F from a random sample

If we had available only the law school sample of size 15, Ta�

ble 3.1, we could estimate corr(y,z) by the sample correlation co�

efficient

corr(y, z)

E�=i�/i - �j/)(z� - �z)

[E��i(2/� - As,)2 E!=i(�� - Az)2]1/2

(4.6)

where (yi,zi) is the zth point in Table 3.1, i = 1,2,-**, 15, and

Ay = E�=i i/i/15) A* = E iil Zi/lb. Table 3.1 gives fty = 600.3,

fiz —3.09, and

corr(y, z) = .776.

(4.7)

Here is another example of a plug-in estimate. Suppose we are

interested in estimating the probability of a LSAT score greater

than 600, that is

y^J{Yj>600}-

(4.8)

Since 39 of the 82 LSAT scores exceed 600, 9 = 39/82=0.48. The

plug estimate of 6 is

1 15

0 =

(4.9)

Page 52

THE PLUG-IN PRINCIPLE

3 5

the sample proportion of LSAT scores above 600. Six of the 15

LSAT scores exceed 600, so 9 = 6/15 = 0.4.

For the die of Table 4.1, we don’t have census data but only the

sample x, so any questions about the fairness of the die must be

answered by inference from the empirical frequencies

F = (A, / 2, • • •, /e) = (.13, .19, .1 0 , .17, .14, .27).

(4.10)

Discussions of statistical inference are phrased in terms of pa�

rameters and statistics. A parameter is a function of the probabil�

ity distribution F. A statistic is a function of the sample x. Thus

corr(t/, z), (4.4), is a parameter of F, while corr(y,z), (4.6), is a

statistic based on x. Similarly /& is a parameter of F in the die

example, while fk is a statistic, k = 1 ,2 ,3, • • •, 6 .

We will sometimes write parameters directly as functions of F,

say

9 = t(F).

(4.11)

This notation emphasizes that the value 9 of the parameter is ob�

tained by applying some numerical evaluation procedure t(-) to the

distribution function F. For example if F is a probability distri�

bution in the real line, the expectation can be thought of as the

parameter

9 = t(F) = E f (x ).

(4.12)

Here t(F) gives 9 by the expectation process, that is, the average

value of x weighted according to F. For a given distribution F such

as F = Bi(n,p) we can evaluate t(F) = np. Even if F is unknown,

the form of t(F) tells us the functional mapping that inputs F and

outputs 9.

4.3 The plug-in principle

The plug-in principle is a simple method of estimating parameters

from samples. The plug-in estimate of a parameter 9 = t(F) is

defined to be

9 = t(F).

(4.13)

In other words, we estimate the function 9 — t(F) of the probability

distribution F by the same function of the empirical distribution

F, 9 = t(F). (Statistics like (4.13) that are used to estimate param�

eters are sometimes called summary statistics, as well as estimates

Page 53

3 6

PLUG-IN PRINCIPLE

and estimators.)

We have already used the plug-in principle in estimating fk by

fk, and in estimating corr(t/, z) by corr(?/, z). To see this, note that

our law school population F can be written as F = (/1? /2,... /82)

where each fj 1the probability of the jth law school, has value 1/82.

This is the probability distribution on X , the 82 law school pairs.

The population correlation coefficient can be written as

corr(y, z)

Ej�i fj(Xi -

- nz)

Ej=i fi(Yi - Vy)2 Ejii fi(z3 - Hz)2}1/2’

(4.14)

where

j=1

3=1

Setting each fj = 1/82 gives expression (4.4). Now for our sample

(xi, a?2 , • • • ^15), the sample frequency fj is the proportion of sample

points equal to Xji

fj = #{*i = Xj}/15, J = 1 ,2 ,... 82.

(4.16)

For the sample of Table 3.1, f\ = 0, / 2 = 0, /3 = 0, f� = 1/15 etc.

Now plugging these values fj into expressions (4.15) and (4.14)

gives fiy, fiz and corr(y, z) respectively. That is, /�y, fiz and corr(y, z)

are plug-in estimates of piy^piz and corr(y,�).

In general, the plug-in estimate of an expectation 9 = E/r(x) is

1 n

^ E J x ) = - V x z=x.

(4.17)

i —1

How good is the plug-in principle? It is usually quite good, if

the only available information about F comes from the sample

x. Under this circumstance 9 = t(F) cannot be improved upon

as an estimator of 9 = t(F), at least not in the usual asymptotic

(n —> 0 0 ) sense of statistical theory. For example if fk is the plug-in

frequency estimate #{xi = k}/n, then

fk ~ Bi(n,/*)/n

(4.18)

as in Problem 3.6. In this case the estimator fk is unbiased for

/fe, E(fk) = /fc, with variance fk( 1 — fk)/n• This is the smallest

possible variance for an unbiased estimator of fk.

Page 54

PROBLEMS

3 7

We will use the bootstrap to study the bias and standard error

of the plug-in estimate 0 = t(F). The bootstrap’s virtue is that

it produces biases and standard errors in an automatic way, no

matter how complicated the functional mapping 9 = t(F) may be.

We will see that the bootstrap itself is an application of the plug-in

principle.

The plug-in principle is less good in situations where there is

information about F other than that provided by the sample x. We

might know, or assume, that F is a member of a parametric family,

like the family of multivariate normal distributions. Or we might

be in a regression situation, where we have available a collection

of random samples x(z) depending on a predictor variable z. Then

even if we are only interested in FZo, the distribution function for

some specific value zo of z, there may be information about FZq

in the other samples x(z), especially those for which z is near zo-

Regression models are discussed in Chapters 7 and 9.

The plug-in principle and the bootstrap can be adopted to para�

metric families and to regression models. See Section 6.5 of Chapter

6 and Chapter 9. For the next few chapters we assume that we are

in the situation where we have only the one random sample x from

a completely unknown distribution F. This is called the one-sample

nonparametric setup.

4.4 Problems

4.1 Say carefully why the plug-in estimate of the expectation of

a real-valued random variable is x, the sample average.

4.2 We would like to estimate the variance a2 of a real-valued ran�

dom variable �, having observed a random sample

X\,X2

- ,xn. What is the plug-in estimate of a2?

4.3 (a) Show that the standard error of an empirical frequency

fk is y/fk( 1 — /jk)/n. (You can use the result in problem

3.5b.)

(b) Do you believe that the die used to generate Table 4.1

is fair?

4.4 Suppose a random variable x has possible values 1,2,3, • ♦ • .

Let A be a subset of the positive integers.

(a) Show that Prob{A} = YlkeA A-

Page 55

3 8

PLUG-IN PRINCIPLE

(b) Compare problems 4.3a and 3.5b, and conclude that

the observed frequencies fk are not independent of each

other.

pendent.

Page 56

CHAPTER 5

Standard errors and estimated

standard errors

5.1 Introduction

Summary statistics such as 6 = t(F) are often the first outputs of

a data analysis. The next thing we want to know is the accuracy of

6 . The bootstrap provides accuracy estimates by using the plug-in

principle to estimate the standard error of a summary statistic.

This is the subject of Chapter 6. First we will discuss estimation

of the standard error of a mean, where the plug-in principle can

be carried out explicitly.

5.2 The standard error of a mean

Suppose that a: is a real-valued random variable with probability

distribution F. Let us denote the expectation and variance of F

by the symbols �ip and aF respectively,

AIF = Ef (x),

Op = varF(x) = Eir[(ar - f.iF)2]-

(5.1)

These are the quantities called fix and a2 in Chapter 3. Here

we are emphasizing the dependence on F. The alternative nota�

tion “varf (%)” for the variance, sometimes abbreviated to var(x),

means the same thing as <j 2f . In what follows we will sometimes

write

x ~ { n F,cr2F)

(5.2)

to indicate concisely the expectation and variance of x.

Now let (xi, • • •, xn) be a random sample of size n from the distri�

bution F. The mean of the sample x =

xi/n ^as expectation

/ip and variance crF/n,

(fj,F,CT2F /n).

(5-3)

Page 57

4 0

STANDARD ERRORS AND ESTIMATED STANDARD ERRORS

In other words, the expectation of x is the same as the expectation

of a single x, but the variance of x is 1 /n times the variance of x.

See Problem 3.8b. This is the reason for taking averages: the larger

n is, the smaller var(x) is, so bigger n means a better estimate of

The standard error of the mean x, written seir(x) or se(x), is the

square root of the variance of x,

s ep(x) = [varir(x)]1/ 2 = aF/y/n.

(5.4)

Standard error is a general term for the standard deviation of a

summary statistic.1 They are the most common way of indicating

statistical accuracy. Roughly speaking, we expect x to be less than

one standard error away from fiF about 68% of the time, and less

than two standard errors away from fiF about 95% of the time.

These percentages are based on the central limit theorem. Un�

der quite general conditions on F, the distribution of x will be

approximately normal as n gets large, which we can write as

x ~ N(fiF, ajp/n).

(5.5)

The expectation /i�r and variance aF/n in (5.5) are exact, only the

normality being approximate. Using (5.5), a table of the normal

distribution gives

Prob{|x — fip| < -^L}=.683,

y/n

Prob{|x - �ip\ < ^�-}==.954,

y 'Of

(5.6)

as illustrated in Figure 5.1. One of the advantages of the boot�

strap is that we do not have to rely entirely on the central limit

theorem. Later we will see how to get accuracy statements like

(5.6) directly from the data (see Chapters 12-14 on bootstrap con�

fidence intervals). It will then be clear that (5.6), which is correct

for large values of n, can sometimes be quite inaccurate for the

sample size actually available. Keeping this in mind, it is still true

that the standard error of an estimate usually gives a good idea of

its accuracy.

A simple example shows the limitations of the central limit the�

orem approximation. Suppose that F is a distribution that puts

1 In some books, the term “standard error” is used to denote an estimated

standard deviation, that is, an estimate of crF based on the data. That

differs from our usage of the term.

Page 58

THE STANDARD ERROR OF A MEAN

Figure 5.1. For large values ofn, the mean x of a random sample from F

will have an approximate normal distribution with mean pp and variance

(Tp/n.

probability on only two outcomes, 0 or 1, as in problem 3.6, say

Prob^i# = 1} = p

and

Prob^l# = 0} = 1 — p.

(5.7)

Here p is a parameter of F, often called the probability of suc�

cess, having a value between 0 and 1. A random sample F —>

(a?i, #2, *••,��) can be thought of as n independent flips of a coin

having probability of success (or of “heads”, or of x — 1) equal�

ing p. Then the sum s = Y17=i xi 1S the number of successes in n

independent flips of the coin; s has the binomial distribution (3.3),

s ~ Bi(n,p).

(5.8)

The average x = s/n equals p, the plug-in estimate of p. Distribu�

tion (5.7) has pp = p, a2F = p(l — p), so (5.3) gives

P ~ (P,P(1 ~P)/n)

(5.9)

for the mean and variance of p. In other words, p is an unbiased

estimate of p, E(p) = p, with standard error

se(p) =

p(l - p ) ] 1/2

(5.10)

Figure 5.2 shows the central limit theorem working for the bi�

nomial distribution with n = 25, p = .25 and p = .90. (Problem

5.3 says what is actually plotted in Figure 5.2.) The central limit

theorem gives a good approximation to the binomial distribution

Page 59

STANDARD ERRORS AND ESTIMATED STANDARD ERRORS

Figure 5.2. Comparison of the binomial distribution with the normal

distribution suggested by the central limit theorem; n = 25, p = .25 and

p — .90. The smooth curves are the normal densities, see problem 5.3;

circles indicate the binomial probabilities (3.5). The approximation is

good for p = .25, but is somewhat off for p = .90.

for n —25, p = .25, but is somewhat less good for n = 25, p = .9.

5.3 Estimating the standard error of the mean

Suppose that we have in hand a random sample of numbers F —>

x \,#2 ,*“ ,#n, such as the n = 9 Control measurements for the

mouse data of Table 2.1. We compute the estimate x for the ex�

pectation �ip, equaling 56.22 for the mouse data, and want to know

the standard error of x. Formula (5.4), sejp(x) = crp/y'n, involves

the unknown distribution F and so cannot be directly used.

At this point we can use the plug-in principle: we substitute F

for F in the formula sepix) = crply/n. The plug-in estimate of

<jf = [EF(x - h f )2}1/2 is

(5.11)

Page 60

PROBLEMS

4 3

since up = x and Epg(x) = � �?=i g(xi) for any function g. This

gives the estimated standard error se(x) = seF(x),

se(x) = ap/y/� -

- x)2/n2}1/2.

(5.12)

i —1

For the mouse Control group data, se(x) = 13.33.

Formula (5.12) is slightly different than the usual estimated

standard error (2.2). That is because of is usually estimated by

a =

“ x)2/(n - 1)}1/2 rather than by <r, (5.11). Dividing by

n - 1 rather than n makes a2 unbiased for a2F. For most purposes

a is just as good as �7 for estimating <j f -

Notice that we have used the plug-in principle twice: first to

estimate the expectation pp by pF = x, and then to estimate

the standard error seir(x) by sep(x). The bootstrap estimate of

standard error, which is the subject of Chapter 6, amounts to using

the plug-in principle to estimate the standard error of an arbitrary

statistic 6 . Here we have seen that if 0 = x, then this approach

leads to (almost) the usual estimate of standard error. As we will

see, the advantage of the bootstrap is that it can be applied to

virtually any statistic 0, not just the mean x.

5.4 Problems

5.1 Formula (5.4) exemplifies a general statistical truth: most

estimates of unknown quantities improve at a rate propor�

tional to the square root of the sample size. Suppose that it

were necessary to know fj,F for the mouse Control group with

a standard error of no more than 3 days. How many more

Control mice should be sampled?

5.2 State clearly why p = s/n is the plug-in estimate of p for the

binomial situation (5.8).

5.3 Figure 5.2 compares the function

for

x = 0,1/25,2/25,-- -,1

with

1 _____ 1

n '

exP{- o

y/2 irp(l — p)/n

2 i^/np(l - p)

x — np i 2

} for a;€[0,l].