Download Complete Scientific Inference Learning From Data 1st Edition Simon Vaughan PDF for All Chapters
Download Complete Scientific Inference Learning From Data 1st Edition Simon Vaughan PDF for All Chapters
https://ebookgate.com/product/learning-from-data-a-short-course-1st-
edition-yaser-s-abu-mostafa/
ebookgate.com
https://ebookgate.com/product/learning-rails-1st-edition-simon-st-
laurent/
ebookgate.com
https://ebookgate.com/product/pattern-theory-from-representation-to-
inference-ulf-grenander/
ebookgate.com
https://ebookgate.com/product/advanced-analytics-with-spark-patterns-
for-learning-from-data-at-scale-1st-edition-sandy-ryza/
ebookgate.com
https://ebookgate.com/product/speaking-pictures-1st-edition-virginia-
mason-vaughan/
ebookgate.com
https://ebookgate.com/product/learning-qlikview-data-
visualization-1st-edition-karl-pover/
ebookgate.com
SCIENTIFIC INFERENCE
SIMON VAUGHAN
University of Leicester
University Printing House, Cambridge CB2 8BS, United Kingdom
www.cambridge.org
Information on this title: www.cambridge.org/9781107607590
© S. Vaughan 2013
This publication is in copyright. Subject to statutory exception
and to the provisions of relevant collective licensing agreements,
no reproduction of any part may take place without the written
permission of Cambridge University Press.
First published 2013
Printing in the United Kingdom by TJ International Ltd. Padstow Cornwall
A catalogue record for this publication is available from the British Library
Library of Congress Cataloguing in Publication data
Vaughan, Simon, 1976– author.
Scientific inference : learning from data / Simon Vaughan.
pages cm
Includes bibliographical references and index.
ISBN 978-1-107-02482-3 (hardback) – ISBN 978-1-107-60759-0 (paperback)
1. Mathematical statistics – Textbooks. I. Title.
QA276.V34 2013
519.5 – dc23 2013021427
ISBN 978-1-107-02482-3 Hardback
ISBN 978-1-107-60759-0 Paperback
Cambridge University Press has no responsibility for the persistence or accuracy of
URLs for external or third-party internet websites referred to in this publication,
and does not guarantee that any content on such websites is, or will remain,
accurate or appropriate.
For my family
Contents
vii
viii Contents
Science is not about certainty, it is about dealing rigorously with uncertainty. The
tools for this are statistical. Statistics and data analysis are therefore an essential
part of the scientific method and modern scientific practice, yet most students of
physical science get little explicit training in statistical practice beyond basic error
handling. The aim of this book is to provide the student with both the knowledge and
the practical experience to begin analysing new scientific data, to allow progress
to more advanced methods and to gain a more statistically literate approach to
interpreting the constant flow of data provided by modern life.
More specifically, if you work through the book you should be able to accomplish
the following.
r Explain aspects of the scientific method, types of logical reasoning and data
analysis, and be able to critically analyse statistical and scientific arguments.
r Calculate and interpret common quantitative and graphical statistical summaries.
r Use and interpret the results of common statistical tests for difference and asso-
ciation, and straight line fitting.
r Use the calculus of probability to manipulate basic probability functions.
r Apply and interpret model fitting, using e.g. least squares, maximum likelihood.
r Evaluate and interpret confidence intervals and significance tests.
Students have asked me whether this is a book about statistics or data analysis or
statistical computing. My answer is that they are so closely connected it is difficult
to untangle them, and so this book covers areas of all three.
The skills and arguments discussed in the book are highly transferable: statistical
presentations of data are used throughout science, business, medicine, politics and
the news media. An awareness of the basic methods involved will better enable you
to use and critically analyse such presentations – this is sometimes called statistical
literacy.
x
For the student xi
In order to understand the book, you need to be familiar with the mathematical
methods usually taught in the first year of a physics, engineering or chemistry
degree (differential and integral calculus, basic matrix algebra), but this book is
designed so that the probability and statistics content is entirely self-contained.
For the instructor
This book was written because I could not find a suitable textbook to use as the
basis of an undergraduate course on scientific inference, statistics and data analysis.
Although there are good books on different aspects of introductory statistics, those
intended for physicists seem to target a post-graduate audience and cover either
too much material or too much detail for an undergraduate-level first course. By
contrast, the ‘Intro to stats’ books aimed at a broader audience (e.g. biologists,
social scientists, medics) tend to cover topics that are not so directly applicable
for physical scientists. And the books aimed at mathematics students are usually
written in a style that is inaccessible to most physics students, or in a recipe-book
style (aimed at science students) that provides ready-made solutions to common
problems but develops little understanding along the way.
This book is different. It focuses on explaining and developing the practice and
understanding of basic statistical analysis, concentrating on a few core ideas that
underpin statistical and data analysis, such as the visual display of information,
modelling using the likelihood function, and simulating random data. Key con-
cepts are developed using several approaches: verbal exposition in the main text,
graphical explanations, case studies drawn from some of history’s great physics
experiments, and example computer code to perform the necessary calculations.1
The result is that, after following all these approaches, the student should both
understand the ideas behind statistical methods and have experience in applying
them in practice.
The book is intended for use as a textbook for an introductory course on data
analysis and statistics (with a bias towards students in physics) or as self-study
companion for professionals and graduate students. The book assumes familiarity
with calculus and linear algebra, but no previous exposure to probability or statistics
1 These are based on R, a freely available software package for data analysis and statistics and used in many
statistics textbooks.
xii
For the instructor xiii
The emphasis on a few core ideas and their practical applications means that
some subjects usually covered in introductory statistics texts are given little or
no treatment here. Rigorous mathematical proofs are not covered – the interested
reader can easily consult any good reference work on probability theory or math-
ematical statistics to check these. In addition, we do not cover some topics of
‘classical’ statistics that are dealt with in other introductory works. These topics
include
r more advanced distribution functions (beta, gamma, multinomial, . . . )
r ANOVA and the generalised linear model
r characteristic functions and the theory of moments
r decision and information theories
r non-parametric tests
r experimental design
r time series analysis
r multivariate analysis (principal components, clustering, . . . )
r survival analysis
r spatial data analysis.
Upon completion of this book the student should be in a much better position to
understand any of these topics from any number of more advanced or comprehen-
sive texts.
Perhaps the ‘elephant in the room’ question is: what about Bayesian methods?
Unfortunately, owing to practical limitations there was not room to include full
chapters developing Bayesian methods. I hope I have designed the book in such a
way that it is not wholly frequentist or Bayesian. The emphasis on model fitting
xiv For the instructor
using the likelihood function (Chapter 6) could be seen as the first step towards a
Bayesian analysis (i.e. implicitly using flat priors and working towards the posterior
mode). Fortunately, there are many good books on Bayesian data analysis that can
then be used to develop Bayesian ideas explicitly. I would recommend Gelman et al.
(2003) generally and Sivia and Skilling (2006) or Gregory (2005) for physicists in
particular. Albert (2007) also gives a nice ‘learn as you compute’ introduction to
Bayesian methods using R.
1
Science and statistical data analysis
Why should a scientist bother with statistics? Because science is about dealing
rigorously with uncertainty, and the tools to accomplish this are statistical. Statistics
and data analysis are an indispensable part of modern science.
In scientific work we look for relationships between phenomena, and try to
uncover the underlying patterns or laws. But science is not just an ‘armchair’ activ-
ity where we can make progress by pure thought. Our ideas about the workings
of the world must somehow be connected to what actually goes on in the world.
Scientists perform experiments and make observations to look for new connec-
tions, test ideas, estimate quantities or identify qualities of phenomena. However,
experimental data are never perfect. Statistical data analysis is the set of tools that
helps scientists handle the limitations and uncertainties that always come with data.
The purpose of statistical data analysis is insight not just numbers. (That’s why
the book is called Scientific Inference and not something more like Statistics for
Physics.)
1
2 Science and statistical data analysis
1 The terms ‘hypothesis’, ‘model’ and ‘theory’ have slightly different meanings but are often used interchange-
ably in casual discussions. A theory is usually a reasonably comprehensive, abstract framework (of definitions,
assumptions and relations or equations) for describing generally a set of phenomena, that has been tested and
found at least some degree of acceptance. Examples of scientific theories are classical mechanics, thermody-
namics, germ theory, kinetic theory of gases, plate tectonics etc. A model is usually more specific. It might be
the application of a theory to a particular situation, e.g. a classical mechanics model of the orbit of Jupiter. Some
1.2 Inference 3
Hypotheses may come from some more general theory, or may be more ad hoc,
based on intuition or guesswork about the way some phenomenon might work.
Experiments or observations of the phenomenon can be made, and the results com-
pared with the predictions of the hypothesis. This comparison allows one to test
the model and/or estimate any unknown parameters. Any mismatch between data
and model predictions, or other unpredicted findings in the data, may suggest ways
to revise or change the model. This process of learning about hypotheses from data
is scientific inference. One may enter the cycle at any point: by proposing a model,
making predictions from an existing model, collecting data on some phenomenon
or using data to test a model or estimate some of its parameters. In many areas of
modern science, the different aspects have become so specialised that few, if any,
researchers practice all of these activities (from theory to experiment and back),
but all scientists need an appreciation of the other steps in order to understand the
‘big picture’. This book focuses on the induction/inference part of the chain.
1.2 Inference
The process of drawing conclusions based on what is already known is called
inference. There are two types of reasoning process used in inference: deductive
and non-deductive.
theorems, and so on. A theorem2 is something like ‘A ⇒ B’, which simply says
that the truth value of A is transferred to B, but it does not, in and of itself, assert
that A or B are true. If we happen to know that A is indeed true, the theorem tells
us that B must also be true. The box gives a simple proof that there is no largest
prime number, a purely deductive argument that leads to an ineluctable conclusion.
Box 1.1
Deduction example – proof of no largest prime number
r Suppose there is a largest prime number; call this pN , the Nth prime.
r Make a list of each and every prime number: p1 = 2, p2 = 3, p3 = 5, until pN .
r Now form a new number q from the product of the N primes in the list, and add one:
N
q =1+ pi = 1 + (p1 × p2 × p3 × · · · × pN ) (1.1)
i=1
The conclusion is unavoidable given the premises. (This type of argument is given
the technical name modus ponens by philosophers of logic.) If some theory is true
we can predict that its consequences must also be true. This applies to probabilistic
as well as deterministic theories. Later on we consider flipping coins, rolling dice,
and other random events. Although we cannot precisely predict the outcome of
2 It is worth noting here that the logical implication used above, e.g. B ⇒ A, does not mean that A can be derived
from B, but only that if B is true then A must also be true, or that the propositions ‘B is true’ and ‘B and A are
both true’ must have the same truth value (both true, or both false).
1.2 Inference 5
individual events (they are random!), we can derive frequencies for the various
outcomes in repeated events.
You can see that inductive reasoning does not have the same power as deductive
reasoning: a conclusion arrived at by deductive reasoning is necessarily true if the
premises are true, whereas a conclusion arrived at by inductive reasoning is not
necessarily true, it is based on incomplete information. We cannot deduce (prove)
that the Sun will rise tomorrow, but nevertheless we do have confidence that it
will. We might say that deductive reasoning concerns statements that are either
true or false, whereas inductive reasoning concerns statements whose truth value
is unknown, about which we are better to speak in terms of ‘degree of belief’ or
‘confidence’. Let’s see an example:
Major premise : All monkeys we have studied like grapes
Minor premise : Zippy is a monkey
Conclusion : Zippy likes grapes.
The conclusion is not unavoidable, other conclusions are allowed. There is no
logical contradiction in concluding
Conclusion : Zippy does not like grapes.
But the premises do give us some information. It seems plausible, even probable,
that Zippy likes grapes.
Again the conclusion is not unavoidable, other conclusions are valid. Perhaps
someone else ate the banana. But the original conclusion seems to be in some sense
the simplest of those allowed. This kind of reasoning, from observed data to an
explanation, is used all the time in science.
Induction and abduction are closely related. When we make an inductive infer-
ence from the limited observed data (‘the monkeys in our sample like grapes’) to
unobserved data (‘Zippy likes grapes’) it is as if we implicitly passed through a
theory (‘all monkeys like grapes’) and then deduced the conclusion from this.
If your experiment needs statistics, you ought to have done a better experiment!4
Rutherford probably didn’t say this, or didn’t mean for it to be taken at face value.
Nevertheless, statistician Bradley Efron, about a hundred years later, contrasted this
simplistic view with the challenges of modern science (Efron, 2005):
Rutherford lived in a rich man’s world of scientific experimentation, where nature gen-
erously provided boatloads of data, enough for the law of large numbers to squelch any
noise. Nature has gotten more tight-fisted with modern physicists. They are asking harder
questions, ones where the data is thin on the ground, and where efficient inference becomes
a necessity. In short, they have started playing in our ball park.
But it is not just scientists who use (or should use) statistical data analysis. Any
time you have to draw conclusions from data you will make use of these skills.
This is true for particle physics as well as journalism, and whether the data form
part of your research or come from a medical test you were given you need to be
able to understand and interpret them properly, making inferences using methods
built on the same basic principles.
Data reduction This is the process of converting raw data into something more
useful or meaningful to the experimenter: for example, converting the voltage
changes in a particle detector (e.g. a proportional counter) into the records of
the times and energies of individual particle detections. In turn, these may be
further reduced into an energy spectrum for a specific type of particle.
4 The earliest reference to this phrase I can find is Bailey (1967, ch. 2, p. 23).
5 ‘Data’ is the plural of ‘datum’ and means ‘items of information’, although it has now become acceptable to use
‘data’ as a singular mass noun rather like ‘information’.
8 Science and statistical data analysis
Exploratory data analysis is all about summarising the data in ways that might
provide clues about their nature, and inferential data analysis is about making
reasonable and justified inferences based on the data and some set of hypotheses.
Figure 1.2 Illustration of the distinct concepts of accuracy and precision as applied
to the positions of ‘shot’ on a target.
produced from our measurement(s). The differences between samples are due to
randomness in the experiment or measurement processes.
6 To a statistician, ‘error’ is a technical term for the discrepancy between what is observed and what is expected.
10 Science and statistical data analysis
errors’) in the design and execution of the experiment, but the reality is that such
errors can never be completely eliminated.
It is important to distinguish between accuracy and precision. These two con-
cepts are illustrated in Figure 1.2. Precise data are narrowly spread, whereas accu-
rate data have values that fall (on average) around the true value. Precision is an
indicator of variation within the data and accuracy is a measure of variation between
the data and some ‘true’ value. These apply to direct measurements of simple
quantities and also to more complicated estimates of derived quantities (Chapters 6
and 7).
Univariate data concern only one variable (e.g. the temperature of each star in
a sample).
Bivariate data concern two variables (e.g. the temperatures and luminosity of
stars in a sample). Each data point contains two values, like the coordinates
of a point on a plane.
Multivariate data concern several variables (e.g. temperature, luminosity, dis-
tance etc. of stars). Each data point is a point in an N-dimensional space, or
an N-dimensional vector.
1.7 Language 11
As mentioned previously, there are two main roles that variables play.
Explanatory variables (sometimes known as independent variables) are
manipulated or chosen by the experimenter/observer in order to examine
change in other variables.
Response variables (sometimes known as dependent variables) are observed in
order to examine how they change as a function of the explanatory variables.
For example, if we recorded the voltage across a circuit element as we drive it with
different AC frequencies, the frequency would be the explanatory variable, and
the response variable would be the voltage. Usually the error in the explanatory
variable is far smaller than, and can be neglected by comparison with, the error on
the response variables.
1.7 Language
The technical language used by statisticians can be quite different from that com-
monly used by scientists, and this language barrier is one of the reasons that science
students (and professional researchers!) have such a hard time with statistics books
and papers. Even within disciplines there are disagreements over the meaning and
uses of particular terms.
For example, physicists often say they measure or even determine the value of
some physical quantity. A statistician might call this estimation. Physicists tend
to use words like error and uncertainty interchangeably and rather imprecisely.
In these cases, where conventional statistical language or notation offers a more
precise definition, we shall use it. This is a deliberate choice. By using terminology
and notation more like that of a formal statistics course, and less like that of an
undergraduate laboratory manual, we hope to give the readers more scope for using
and developing their knowledge and skills. It should be easier to understand more
advanced texts on aspects of data analysis or statistics, and understand analyses
from other fields (e.g. biology, medicine).
This means that we do not explicitly make use of the definitions set out in the
Guide to the Expression of Uncertainty in Measurement (GUM, 2008). The doc-
ument (now with revisions and several supplements) is intended to establish an
industrial standard for the expression of uncertainty. Its recommendations included
categorising uncertainty into ‘type A’ (estimated based on statistical treatment of
a sample of data) and ‘type B’ (evaluated by other means), using ‘standard uncer-
tainty’ for the standard deviation of an estimator, ‘coverage factor’ for a multiplier
on the ‘combined standard uncertainty’. And so on. These recommendations may
be valuable within some fields such as metrology, but they are not standard in most
physics laboratories (research or teaching) as of 2013, and are unlikely to be taken
12 Science and statistical data analysis
contain examples using the R computing environment for you to work through
yourself. We rely heavily on examples to illustrate the main ideas, and these are
based on real data. The datasets are discussed in Appendix B.
In outline, the rest of the book is organised as follows.
r Chapter 2 discusses numerical and graphical summaries of data, and the basics
of exploratory data analysis.
r Chapter 3 introduces some of the basic recipes of statistical analyses, such as
looking for difference of the mean, or estimating the gradient of a straight line
relationship.
r Chapter 4 introduces the concept of probability, starting with discrete, random
events. We then discuss the rules of the probability calculus and develop the
theory of random variables.
r Chapter 5 extends the discussion of probability to discuss some of the most
frequently encountered distributions (and also mentions, in passing, the central
limit theorem).
r Chapter 6 discusses the fitting of simple models to data and the estimation of
model parameters.
r Chapter 7 considers the uncertainty on the parameter estimates, and model testing
(i.e. comparing predictions of hypotheses to data).
r Chapter 8 discusses Monte Carlo methods, computer simulations of random
experiments that can be used to solve difficult statistical problems.
r Appendix A describes how to get started in the computer environment R used in
the examples throughout the text.
r Appendix B introduces the data case studies used throughout the text.
r Appendix C provides a refresher on combinations and permutations.
r Appendix D discusses the construction of confidence intervals (extending the
discussion from Chapter 7).
r A glossary can be found on p. 217.
r A list of the notation can be found on p. 224.
2
Statistical summaries of data
How should you summarise a dataset? This is what descriptive statistics and
statistical graphics are for. A statistic is just a number computed from a data
sample. Descriptive statistics provide a means for summarising the properties of
a sample of data (many numbers or values) so that the most important results
can be communicated effectively (using few numbers). Numerical and graphical
methods, including descriptive statistics, are used in exploratory data analysis
(EDA) to simplify the uninteresting and reveal the exceptional or unexpected in
data.
14
2.1 Plotting data 15
the experimenter. Different plots are suitable depending on the number and type of
the response variable.
30
25
20
Frequency
15
10
5
0
2.2.1 Histogram
One way to simplify univariate data is to produce a histogram. A histogram is
a diagram that uses rectangles to represent frequency, where the areas of each
rectangle are proportional to the frequencies. To produce a histogram one must
first choose the locations of the bins into which the data are to be divided, then one
simply counts the number of data points that fall within each bin. See Figure 2.1
(and R.box 2.1).
A histogram contains less information than the original data – we know how
many data points fell within a particular bin (e.g. the 700–800 bin in Figure 2.1),
but we have lost the information about which points and their exact values. What
we have lost in information we hope to gain in clarity; looking at the histogram it
is clear how the data are distributed, where the ‘central’ value is and how the data
points are spread around it.
2.2 Plotting univariate data 17
R.Box 2.1
Histograms
The R command to produce and plot a histogram is hist(). The following shows
how to produce a basic histogram from Michelson’s data (see Appendix B,
section B.1):
hist(morley$Speed)
We can specify (roughly) how many histogram bins to use by using the breaks
argument, and we can also alter the colour of the histogram and the labels as follows:
This hist() command is quite flexible. See the help pages for more information
(type ?hist).
R.Box 2.2
A simple bar chart
There are two simple ways to produce bar charts using R. Let’s illustrate this using the
Rutherford and Geiger data (see Appendix B, section B.2):
500
400
Frequency
300
200
100
0
0 2 4 6 8 10 12 14
Rate (counts/interval)
Figure 2.2 Bar chart showing the Rutherford and Geiger (1910) data of the fre-
quency of alpha particle decays. The data comprise recordings of scintillations in
7.5 s intervals, over 2608 intervals, and this plot shows the frequency distribution
of scintillations per interval.
xlab="Rate (counts/interval)",
ylab="Frequency", lwd=5)
The first line produces a very basic plot using the type="h" argument. The second
line produces an improved plot with user-defined axis labels, thicker lines/bars and no
box enclosing the data area. An alternative is to use the specialised command
barplot().
Here the argument space=0.5 determines the sizes of the gaps between the bars, and
names.arg defines the labels for the x-axis. If the data were categorical, we could
produce a bar chart by setting the names.arg argument to the list of categories.
Figure 2.3 Illustration of the mean as the balance point of a set of weights. The
data are the first 20 of the Michelson data points.
guess of the centre we could instead calculate and quote the mean of the sample,
defined by
1
n
x= xi (2.1)
n i=1
where xi (i = 1, 2, . . . , n) are the individual data points in the sample and n is the
size of the sample. If x are our data, then x̄ is the conventional symbol for the
sample mean. The sample mean is just the sum of all the data points, divided by
the number of data points. Strictly, this is the arithmetic mean. The mean of the
first 20 Michelson data values is 909 km s−1 :
1
x̄ = (850 + 740 + 900 + 1070 + 930 + 850 + . . . + 960) = 909.
20
One way to view the mean is as the balancing point of the data stretched out
along a line. If we have n equal weights and place them along a line at locations
corresponding to each data point, the mean is the one location along the line where
the weights balance, as illustrated in Figure 2.3.
The mean is not the only way to characterise the centre of a sample. The sample
median is the middle point of the data. If the size of the sample, n, is odd, the
median is the middle value, i.e. the (n + 1)/2th largest value. If n is even, the
median is the mean of the middle two values (the n/2th and n/2 + 1th ordered
values). The median has the sometimes desirable property that it is not so easily
swayed by a few extreme points. A single outlying point in a dataset could have a
dramatic effect on the sample mean, but for moderately large n one outlier will have
little effect on the median. The median of the first 20 light speed measurements is
940 km s−1 , which is not so different from the mean – take a look at Figure 2.1 and
notice that the histogram is quite symmetrical about the mean.
The last measure of the centre we shall discuss is the sample mode, which is
simply the value that occurs most frequently. If the variable is continuous, with no
repeating values, the peak of a histogram is taken to be the mode. Often there is
more than one mode; in the case of the 100 speed of light values, there are two
values that occur most frequently (810 and 880 km s−1 occur 10 times each). Once
20 Statistical summaries of data
mode
median
0.10
mean
p(x)
0.00
0 2 4 6 8 10
x
Figure 2.4 Illustration of the locations of the mean, median and mode for an
asymmetric distribution, p(x), where x is some random variable.
we bin the Michelson data into a histogram it becomes clear that the distribution
has a single mode around 800–850 km s−1 (see Figure 2.1).
Now we have three measures of centrality, but the one that is used the most is
the mean, often just called the average. If we have some theoretical distribution of
data spread over some range, we may calculate the mean, median and mode using
methods discussed in Chapter 5.
Figure 2.4 illustrates how the three different measures differ for some theoretical
distribution. The mean is like the centre of gravity of the distribution (if we imagine
it to be a distribution of mass density along a line); the median is simply the 50%
point, i.e. the point that divides the curve into halves with equal areas (equal
mass) on each side; the mode is the peak of the distribution (the densest point).
If the distribution is symmetrical about some point, the mean and median will be
the same, and if it is symmetrical about a single peak then the mode will also
be the same, but in general the three measures differ.
R.Box 2.3
Mean, median and mode in R
We can use R to calculate means and medians quite easily using the appropriately
named mean() and median() commands. The variable morley$Speed contains
the 100 speed values of Michelson. To calculate the mean and median, and add on the
offset (299 000 km s−1 ), type
mean(morley$Speed) + 299000
median(morley$Speed) + 299000
The modal value is not quite as easy to calculate as the mean or median since there is
no built-in function for this. One simple way to find the mode is to view a histogram
of the data and select the value corresponding to the peak.
2.4 Dispersion in data: variance and standard deviation 21
Box 2.1
Different averages
Imagine a room containing 100 working adults randomly selected from the
population. Then Bill Gates walks into the room. What happens to the mean wealth of
the people in the room? What about the median or mode? These different measures of
‘centre’ react very differently to an extreme outlier (such as Bill Gates). What will
happen to the average height (mean, median and mode) of the people in the room if
the world’s tallest man walks in?
What is the average number of legs for an adult human? The mode and the median
are surely two, but the mean number of legs is slightly less than two!
Table 2.1 Illustration of the computation of variance using the first n = 20 data
values from Michelson’s speed of light data. Here xi are the data values, and the
sample mean is their sum divided by n: x̄ = 18 180/20 = 909 km s−1 . The
xi − x̄ are the deviations, which always sum to zero. The squared deviations are
positive (or zero) valued and sum to a non-negative number. The sum of squared
deviations divided by n − 1 gives the sample variance:
s 2 = 209 180/19 = 11 009.47 km2 s−2 .
i 1 2 3 4 5 ··· 20 sum
Data xi (km s−1 ) 850 740 900 1 070 930 ··· 960 18 180
xi − x̄ (km s−1 ) −59 −169 −9 161 21 ··· 51 0
(xi − x̄)2 (km2 s−2 ) 3481 28 561 81 25 921 441 ··· 2601 209 180
The sample variance is always non-negative (i.e. either zero or positive), and
will not have the same units as the data. If the xi are in units of kg, the sample mean
will have the same units (kg) but the sample variance will be in units of kg2 .√The
standard deviation is the positive square root of the sample variance, i.e. s = s 2 ,
and has the same units as the data xi . Standard deviation is a measure of the typical
deviation of the data points from the sample mean. Sometimes this is called the
RMS: the root mean square (of the data after subtracting the mean).
Box 2.2
Why 1/(n − 1) in the sample variance?
The sample variance is normalised by a factor 1/(n − 1), where a factor 1/n might
seem more natural if we want the mean of the squared deviations. As discussed above,
the sum of the deviations (x − x̄) is always zero. If we have the sample mean the last
deviation can be found once we know the other n − 1 deviations, and so when we
average the square deviation we divide by the number of independent elements, i.e.
n − 1. This known as Bessel’s correction.
Using 1/(n − 1) makes the resulting estimate unbiased. Bias is the difference
between an average statistic and the true value that it is supposed to estimate, and an
unbiased statistic gives the right result when given a sufficient amount of data (i.e. in
the limit of large n). For more details of the bias in the variance, see section 5.2.2 of
Barlow (1989), or any good book on mathematical statistics.
The variance, or standard deviation, gives us a measure of the spread of the data
in the sample. If we had two samples, one with s 2 = 1.0 and one with s 2 = 1.7,
we would know the that the typical deviation
√ (from the mean) is 30% times larger
in the second sample (recall that s = s ).2
2.4 Dispersion in data: variance and standard deviation 23
R.Box 2.4
Variance and standard deviation
R has functions to calculate variances and standard deviations. For example, in order
to calculate the mean, variance and standard deviation of the numbers 1, 2, . . . , 50:
x <- 1:50
mean(x)
var(x)
sd(x)
The first line defines a new array in order to save us having to use the prefix
morley$. . . every time we wish to access these data.
R.Box 2.5
Calculating with subarrays
If we want to calculate the variance for each of Michelson’s five ‘experiments’ (each
one is a block of 20 consecutive values) individually, we could use
Note the use of the double equals sign (==) in testing for equality. The first line forms
an array mask, the same size as the Speed array, with values that are TRUE where the
condition is met (i.e. Expt == 2), and FALSE elsewhere. The third line forms a
subarray from Speed by taking only those elements that occur where mask is TRUE).
The third line shows how to compute the variance of this subset of the original data.
We can repeat this process using a loop as follows:
for (i in 1:5) {
print(var(Speed[morley$Expt==i]))
}
This looks quite complicated, so let’s unpack it. The first part for (i in 1:5)
{. . .} defines a loop. The second part (inside the curly brackets) defines what is to
happens each time around the loop. The loop runs once for each of i = 1, 2, 3, 4, 5,
24 Statistical summaries of data
and each time round it prints the variance of the sample of data with the corresponding
experiment number i. The following may help illustrate the way loops are written
in R:
R.Box 2.6
Tukey’s five-number summary
There are two functions in R to calculate variations on Tukey’s five-number summary.
The first is
fivenum(0:100)
fivenum(Speed)
Here the reported values for the first, second (median) and third quartiles are given as
the closest actual data values. There is a variation on this:
summary(0:100)
summary(Speed)
The two methods differ slightly in how the quartiles are calculated. Note that the
summary() command calculates the mean for free.
2.6 Error bars, standard errors and precision 25
R.Box 2.7
Standard errors in R
There is no single command to compute the standard error in R, but one may make
use of the var() function to make the calculation simple. For example, to compute
the mean, variance and standard error of the Michelson data
26 Statistical summaries of data
950
Speed − 299 000 (km s–1)
900
850
800
750
1 2 3 4 5
Experiment
Figure 2.5 The sample means for each of the five ‘experiments’ of Michelson,
each comprising 20 measurements. The standard errors for each mean are indicated
by the error bars. Notice the sidebars at the end of each error bar. These help define
the ends of each error bar, but may clutter the graphic when there are a lot of data to
present. The dotted line shows the modern value for the speed of light in air. From
this graphic, one can start to make inferences about Michelson’s measurements.
x <- morley$Speed
mean(x)
var(x)
sqrt( var(x) / length(x) )
R.Box 2.8
Standard errors by group, part 1
It is possible to calculate a statistic (e.g. mean or variance) for each of the five
experiments in an efficient manner by first re-organising the data into a matrix. Once
this is done we can make use of some powerful matrix tools in R. In the following
example, the speed data are converted to a matrix with 20 rows (and therefore five
columns, one for each ‘experiment’) called speed.
speed <- matrix(morley$Speed, nrow=20)
speed
[,1] [,2] [,3] [,4] [,5]
[1,] 850 960 880 890 890
[2,] 740 940 880 810 840
[3,] 900 960 880 810 780
2.6 Error bars, standard errors and precision 27
It is important to check that the matrix is arranged in the right way. Here we see all the
data from first experiment in the first column – compare with the output of
morley$Speed[morley$Expt == 1]
R.Box 2.9
Standard errors by group, part 2
With the Michelson data arranged in a matrix, we can use the apply() command to
apply any function, e.g. mean() or var(), to every row or column of the matrix. For
example, to calculate the mean and variance of the data in each column, and then store
the results in new data objects, we can use
speed.mean <- apply(speed, 2, mean)
speed.var <- apply(speed, 2, var)
speed.var
The command apply(speed, 2, var) takes the matrix called speed and applies
the function var() to each of its columns to calculate the variance. You could also
use mean, sd, sum, or any other valid R command. The second argument (i.e. 2)
specifies columns should be analysed. If instead we used 1, we would get the variance
over each row. This approach, applying the same function over rows or columns of an
array, is usually faster (on large datasets) and more elegant than using loops.
R.Box 2.10
Standard errors by group, part 3
Finally, the standard errors for the five ‘experiments’ are just the square roots of these
variances divided by the number of data points in each experiment. We find the
number of data points in each column using the command apply() to apply the
length() function (we know the answer is 20).
Remember that R is case sensitive, so se is not the same object as SE. The last line
uses the four new vectors (of the means, variance, lengths and standard errors) as
28 Statistical summaries of data
columns of a new object, a data frame (similar to a matrix but the columns may be
formed from different types of data).
R.Box 2.11
Plotting error bars
There are several ways to add error bars to a graphic in R. One way is using the
segments() command to draw a series of line segments between x− error and
x+ error. If we have sample means with standard errors (as in the previous box), we
may plot them as follows:
Expt <- 1:length(speed.mean)
plot(Expt, speed.mean, ylim=c(780,950), pch=16,
bty="l", xlab="Experiment",
ylab="Speed - 299,000 (km/s)")
segments(Expt, speed.mean-se, Expt, speed.mean+se)
where the second line plots the data and the third line adds the error bars. The
segments command takes as its input segments(x0,y0,x1,y1) and draws lines
between coordinates (x0,y0) and (x1,y1). A variation on this is to use the arrows
command to give each error bar a sidebar (as in Figure 2.5):
Where the first four arguments give the coordinates of the endpoints (as for the
segments() command), and the last three define two-sided arrows (code=3 means
draw an arrow head at both ends of the arrow), with flat arrow heads (angle=90) and
the extent of the arrow heads (length=0.1).
R.Box 2.12
Scatter plots in R
The R command plot() will produce a basic scatter plot from two (equal length)
arrays of numbers. The Hipparcos data shown in Figure 2.6 are described in
Appendix B (section B.4). Using the reduced data file hip clean.txt we can
produce a simple plot
This creates a data array called hip that contains the contents of the file: 14 columns
and 5740 rows of data. A simple scatter plot may be produced using
plot(hip$BV, hip$V)
However, with a little more effort we can do much better than this.
The simplest way to visualise data with two continuous variables is a scatter plot,
where each data point (pair of numbers) is treated as a coordinate and is marked
with a symbol on the x–y plane. Scatter diagrams are used to reveal relationships
between pairs of variables, and are among the most widely used diagrams in all
of science. They can be enormously powerful; indeed, some of the most important
diagrams and relations in science were discovered by examination of scatter plots.
Figure 2.6 shows one such example from astronomy. This is a Hertzsprung–
Russell diagram (sometimes known as a colour–magnitude diagram) and shows the
luminosity against colour index for a sample of nearby stars. Each point represents
a star, the horizontal position of the points represents the B − V colour index
(a simple measure of the colour of the star, which depends on its temperature),
and the vertical position represents the absolute magnitude (an upside-down and
logarithmic measure of the luminosity). When these two variables are used to
construct a scatter diagram for a sample of stars, it is clear there is a great deal of
structure in the data, patterns that would not be at all obvious by examination of a
table of numbers, or of graphical examination of either variable separately.
R.Box 2.13
Basic scatter plot design
The following command shows how to produce a better scatter plot:
0
V (mag)
5
10
15
Figure 2.6 Example of a scatter plot showing data on 5740 stars using data from
the Hipparcos astronomy satellite. Plotted is the V -band (green) absolute (distance
corrected) magnitude against the B − V colour index (difference between B and
V -band magnitudes, a blue–green colour). Each point represents a star: brighter
(smaller magnitude) stars are at the top, bluer stars are on the left. The plot clearly
reveals structure in the data: most stars fall in the band from top left to bottom
right, with a small island in the top right. This type of diagram is of fundamental
importance in stellar astrophysics. For comparison we also show the histograms
of each of the two variables (V and B − V ) separately. The structure in the data
is only apparent when looking at the two variables together using a scatter plot.
Here we have plotted Vabs , the absolute magnitude stored in the V.abs column (not
the apparent magnitude in the V column), against B − V . The pch=1 argument
selects a plot symbol (1 is a hollow circle); cex=0.5 makes the symbols smaller than
default. A small, hollow symbol was chosen here to reduce the clutter from the large
number of points to be plotted.
The option ylim=c(16, -3) sets the range of the vertical axis to run from 16 at
the bottom to −3 at the top. The xlim argument is used to control the horizontal axis
span. The arguments xlab and ylab are for setting the axis labels, and finally
bty="n" defines what type of box to enclose the plot in ("n" means no box).
For more information on the arguments that can be changed within the plot()
command, try ?plot and ?par.
How does one decide which observable to plot on the horizontal axis, and
which on the vertical axis? In an experiment one usually studies the response of
some variable(s) to changes in experimenter-controlled explanatory variables, in
which case the explanatory variable is plotted along the horizontal axis and the
response variable plotted along the vertical axis. However, it is often the case that
neither variable is obviously an explanatory variable. For example, if we recorded
.
.
Random documents with unrelated
content Scribd suggests to you:
The Project Gutenberg eBook of Plain tales,
chiefly intended for the use of charity schools
This ebook is for the use of anyone anywhere in the United
States and most other parts of the world at no cost and with
almost no restrictions whatsoever. You may copy it, give it away
or re-use it under the terms of the Project Gutenberg License
included with this ebook or online at www.gutenberg.org. If you
are not located in the United States, you will have to check the
laws of the country where you are located before using this
eBook.
Title: Plain tales, chiefly intended for the use of charity schools
Author: Anonymous
Language: English
THE USE
OF
CHARITY SCHOOLS.
LONDON:
1799.
Plain Tales.
TALE I.
ukey Dawkins and Polly Wood had been some time in the
charity-school. They had behaved very well, and could do a good
deal of work: they were regular in going at the exact time, and so
soon as school hours were over, they went strait home to see what
they could do to assist their mothers. As they were diligent, they
sometimes got a spare half hour to take a walk in the fields. This was
of great service to their health, and helped to make them strong,
active, and cheerful. One evening, after they had been working very
hard, their mothers gave them leave to go. Out they set, as brisk as
larks; they tripped over the stile very nimbly, and had soon gathered
a handful of primroses and violets. Presently they heard a loud noise
at a little distance, and away they ran to find out what it was. In a
wood, not far off, they observed a man felling a large tree, and
around lay a great number of chips. I wonder, said Sukey Dawkins, if
any body makes use of these: how glad my mother would be to have
some to light her fires with; let us ask the carpenter. Pray, said she,
do you think the person who owns these, would give me leave to
take a few home to my mother?—Yes, said the man, I think he
would: they belong to Mr. Ownoak, who is walking in the next field,
and you may ask him, if you will. O, said Polly Wood, do not let us
go, I cannot abide to ask: her companion replied, what is there to be
ashamed of, I am not a going to do any thing wrong; and, unless I
was, I do not see what reason I have to be ashamed. These chips
are of no use to this gentleman, and, perhaps, he does not think how
useful they might be to others. Come, let us make haste: so she
went up to Mr. Ownoak, and said—Pray, Sir, will you give me leave
to take a few of those chips home to light my mammy’s fire? Who is
your mammy, my little girl, said he? Widow Dawkins, sir. Where does
she live? In the Well-yard. How many children has she? Four, sir. I
am the oldest: I strive to do a little, but we are very poor, and my
mother has hard work to get cloaths, food, and firing; so that a few
chips would be very useful to us. You may take as many as you can
carry, my child, said he; and you may come again to-morrow, and the
next day, and, if your companion wants any, let her have some too.
Away they ran, and told the carpenter that Mr. Ownoak had given
them leave to take some. Sukey Dawkins had on a good strong
woollen apron, which she had made of one of her mother’s, so she
began filling it with chips; but Polly Wood’s apron was an old ragged
checked one. Sukey had often begged her companion to endeavour
to mend her cloths; but this she had too much neglected, and was
now very sorry she had. However, Sukey helped her to pin it
together as well as she could; and, after filling them as fully as they
would hold, and wishing the carpenter a good night, away they set
off towards home. As they were getting over the last stile, Polly’s
tattered apron gave way, and down fell all the chips. This was a sad
disaster, and she began to cry; but her companion asked her if
crying could possibly remedy the misfortune, and begged her not to
do what a little baby would. Let us think what is best to be done, that
is all we ought to do when any accident happens. Let us see: well,
your gown is whole, that is a good thing; suppose you take it up, and
put the chips in that, and, if you like, I will help you to mend your
apron to-morrow. So they picked up the chips again as fast as they
could, and made haste to get home. Mother, said Sukey, I am afraid
you thought me long; but these will make amends for staying. She
then threw down the chips under the coal-shed, and told her mother
how she came by them. Her mother thanked her very kindly for her
attention to the comfort of the family, and told her she believed, that,
if she had not been so good a girl, and often contrived, in some way
to help her, they must all have gone to the workhouse. Sukey was
much more satisfied with herself that evening, than if she had been
romping with the girls in the street, and went to bed thankful that she
had been useful.
Children, in many a different way,
Can give their friends delight;
Nor will she pass a useless day,
Who brings home chips at night.
TALE II.
other, said Nancy Bennet, I wish you would let us have tea
to breakfast: there are neighbour Spendalls and their children
drinking tea every morning when I go by to school, and we never
have it but on Sunday afternoons. My dear, said her mother, every
thing which is good for you, that I can buy, I wish you to have; but
there are many reasons which would make it improper for us to drink
much tea: One is, that it is very dear, and affords but little
nourishment: Another, that it is neither pleasant nor wholesome
without cream and sugar. Two pounds of the coarsest sugar I could
buy, would cost eighteen pence. With that eighteen pence I could
buy you a new shift; the sugar, you know, would be soon gone and
forgotten; the shift will help to keep you warm and comfortable for
years. Which would you rather have? O the shift, said she to be
sure. Well, my dear, said her mother, it is by denying ourselves tea
that we are able to get a comfortable change of shirts and shifts; and
another advantage is, that I believe we have better health than many
people who live a good deal on tea. Your father finds himself more
able to work after bread and cheese and a pint of beer, than he
would after tea: And a bason of milk-porridge is a much more
satisfying meal for us; and, it is a very happy thing, that the most
wholesome food is generally the cheapest. Ploughmen and
milkmaids, who look so ruddy, and are the most healthy people in
the kingdom, seldom taste tea. Part of their health and strength, it is
true, is owing to their rising early, going to bed early, and living a
good deal out of doors: but we, who are obliged to do our work more
in the house, ought to get the most wholesome food we can; and,
spending our money in tea and sugar, would deprive us of many
more useful things. I have heard my mother say, that tea was very
little drank when she was young; and, I believe, people were quite as
healthy and as happy then. For one quarter of a year, I laid by, every
week, just as much as I should have laid out had we drank tea. This,
at the least I could reckon it, was one shilling and sixpence a week.
As there are twelve weeks in a quarter of a year, this, you know,
came to eighteen shillings; and, with that money, I bought myself and
you, these good stuff gowns, which have kept us so warm all the
winter, and a pair of sheets for your bed: Would you rather have
been starved in rags, and drank tea; or, comfortably clad, and had
milk-porridge? O, I have heard enough about tea, said Nancy, give
me milk-porridge, a stuff gown, and new sheets.
If comfort round a cottage fire,
The poor desire to see,
Let them to useful things aspire,
And learn to banish tea.
TALE III.
s Mary Atkins was one day going to fetch some turnips for
dinner, she saw, at the corner of Poverty Lane, a second-hand shop,
at the door of which hung a great deal of ragged finery. There was a
tawdry flowered gown: to be sure, it had some holes in it, but it was
well starched, and made a show: there was, likewise, an old muslin
cap, with a pleated border, and a fine red ribband round it. Mary went
home, and told her mother she wished her to go with her to Poverty-
lane, to buy something at the second-hand shop, for she had seen
some very pretty things there; and Sally Idle had bought a white
apron for six-pence, and a muslin handkerchief for two-pence. My
dear, said her mother, there is not a place in the town I have so great
a dislike to as a rag-shop, for such it may properly be called; and, it
is one great cause of the ruin of poor people, that they lay out their
money at these shops. The apron and handkerchief which Sally Idle
bought, would, probably, be in rags the first time they were washed,
and she would then find that she had laid out her money in a very
wrong manner. The pleated bordered cap you saw, was, I dare say,
already in holes; and, perhaps, after once washing it, could be
pleated no more: besides, such a thing would take a great deal of
time, which poor people have not to spare. I would rather see a plain
cloth cap, with a strong lawn border, set strait on, which would wear
well for years, than such fine ones which would not last a month. The
cotton gown, perhaps, I could buy for half what I gave for my new
stuff one; but it would often want washing, and that would take a
great deal of time, which would very much hinder my work at the
wheel. Soap too, is very dear, so that it would soon cost me more
than that I have: besides, I think it very untidy to see a poor woman
with a dirty bit of a cotton gown all in rags, when she might, by a little
contrivance, have a comfortable stuff one. Poor people, in general,
find it difficult to raise money enough at a time to go to the shops and
buy a new garment: but my way is to put by, weekly, a little out of
what every one gets. You know you have each a place to put your
own in, and, by many a little being often put together, it soon
becomes a good deal. When I want a new garment for any of us, I
go and see how much is in the drawer, and if there is not enough,
your father and I endeavour to make it up out of our own earnings. I
should think it a shameful waste, indeed, to spend my money and
my children’s at a rag shop. I never have done it, nor do I ever mean
to do it; but, if you think it a better way, you are very welcome to try.
But, as I think it a disgrace for an industrious woman to be seen
there, you will excuse my going with you. O, said Mary, I will not go, I
am convinced that your way is best; and, now I think of it, Sally Idle
had a great many rents in the linen gown, which I know she bought
there but a little time since, and it looked very dirty and untidy too.
Some people, said her mother, may laugh at my putting by the six-
pences and the penny’s every week, but I am sure we have a great
deal of comfort from it; and, it matters not who laughs, so long as we
are certain that we are doing right. I do not think that I should hoard
up a great many shillings and guineas as if I could get them, for they
are only desirable to make use of; but I know it to be my duty to do
the best I can with my little, and, while I do that, you may be sure I
shall not go to the rag-shop.
Ruin within the rag-shop stands,
And all who dare to enter,
With tattered bargains in their hands,
Repent so rash a venture.
TALE VI.
Our website is not just a platform for buying books, but a bridge
connecting readers to the timeless values of culture and wisdom. With
an elegant, user-friendly interface and an intelligent search system,
we are committed to providing a quick and convenient shopping
experience. Additionally, our special promotions and home delivery
services ensure that you save time and fully enjoy the joy of reading.
ebookgate.com