Introduction to Statistics and Data Analysis With Exercises Solutions and Applications in R 1st Edition Christian Heumann pdf download
Introduction to Statistics and Data Analysis With Exercises Solutions and Applications in R 1st Edition Christian Heumann pdf download
https://textbookfull.com/product/introduction-to-statistics-and-
data-analysis-with-exercises-solutions-and-applications-in-r-1st-
edition-christian-heumann/
https://textbookfull.com/product/introduction-to-statistics-and-
data-analysis-roxy-peck/
https://textbookfull.com/product/an-introduction-to-secondary-
data-analysis-with-ibm-spss-statistics-1st-edition-john-macinnes/
https://textbookfull.com/product/an-introduction-to-secondary-
data-analysis-with-ibm-spss-statistics-first-edition-macinnes/
https://textbookfull.com/product/introduction-to-data-science-
data-analysis-and-prediction-algorithms-with-r-1st-edition-by-
rafael-a-irizarry/
Reasoning with Data An Introduction to Traditional and
Bayesian Statistics Using R 1st Edition Jeffrey M.
Stanton
https://textbookfull.com/product/reasoning-with-data-an-
introduction-to-traditional-and-bayesian-statistics-using-r-1st-
edition-jeffrey-m-stanton/
https://textbookfull.com/product/business-statistics-with-
solutions-in-r-1st-edition-mustapha-abiodun-akinkunmi/
https://textbookfull.com/product/data-mining-with-spss-modeler-
theory-exercises-and-solutions-1st-edition-tilo-wendler/
https://textbookfull.com/product/an-introduction-to-categorical-
data-analysis-3rd-edition-wiley-series-in-probability-and-
statistics-agresti/
https://textbookfull.com/product/introduction-to-data-analysis-
with-r-for-forensic-scientists-international-forensic-science-
and-investigation-1st-edition-curran/
Christian Heumann · Michael Schomaker
Shalabh
Introduction to
Statistics and
Data Analysis
With Exercises, Solutions and
Applications in R
Introduction to Statistics and Data Analysis
Christian Heumann Michael Schomaker
•
Shalabh
Introduction to Statistics
and Data Analysis
With Exercises, Solutions
and Applications in R
123
Christian Heumann Shalabh
Department of Statistics Department of Mathematics and Statistics
Ludwig-Maximilians-Universität München Indian Institute of Technology Kanpur
München Kanpur
Germany India
Michael Schomaker
Centre for Infectious Disease Epidemiology
and Research
University of Cape Town
Cape Town
South Africa
The success of the open-source statistical software “R” has made a significant
impact on the teaching and research of statistics in the last decade. Analysing data is
now easier and more affordable than ever, but choosing the most appropriate sta-
tistical methods remains a challenge for many users. To understand and interpret
software output, it is necessary to engage with the fundamentals of statistics.
However, many readers do not feel comfortable with complicated mathematics.
In this book, we attempt to find a healthy balance between explaining statistical
concepts comprehensively and showing their application and interpretation using R.
This book will benefit beginners and self-learners from various backgrounds as
we complement each chapter with various exercises and detailed and comprehen-
sible solutions. The results involving mathematics and rigorous proofs are separated
from the main text, where possible, and are kept in an appendix for interested
readers. Our textbook covers material that is generally taught in introductory-level
statistics courses to students from various backgrounds, including sociology,
biology, economics, psychology, medicine, and others. Most often, we introduce
the statistical concepts using examples and illustrate the calculations both manually
and using R.
However, while we provide a gentle introduction to R (in the appendix), this is
not a software book. Our emphasis lies on explaining statistical concepts correctly
and comprehensively, using exercises and software to delve deeper into the subject
matter and learn about the conceptual challenges that the methods present.
This book’s homepage, http://chris.userweb.mwn.de/book/, contains additional
material, most notably the software codes needed to answer the software exercises,
and data sets. In the remainder of this book, we will use grey boxes
to introduce the relevant R commands. In many cases, the code can be directly
pasted into R to reproduce the results and graphs presented in the book; in others,
the code is abbreviated to improve readability and clarity, and the detailed code can
be found online.
v
vi Preface
vii
viii Contents
xiii
Part I
Descriptive Statistics
Introduction and Framework
1
Let us first introduce some terminology and related notations used in this book.
The units on which we measure data—such as persons, cars, animals, or plants—
are called observations. These units/observations are represented by the Greek
Example 1.1.1
• If we are interested in the social conditions under which Indian people live, then
we would define all inhabitants of India as Ω and each of its inhabitants as ω. If we
want to collect data from a few inhabitants, then those would represent a sample
from the total population.
• Investigating the economic power of Africa’s platinum industry would require to
treat each platinum-related company as ω, whereas all platinum-related companies
would be collected in Ω. A few companies ω1 , ω2 , . . . , ωn comprise a sample of
all companies.
• We may be interested in collecting information about those participating in a
statistics course. All participants in the course constitute the population Ω, and
each participant refers to a unit or observation ω.
1.2 Variables
This definition states that a variable X takes a value x for each observation ω ∈ Ω,
whereby the number of possible values is contained in the set S.
Example 1.2.1
Qualitative variables are the variables which take values x that cannot be ordered in
a logical or natural way. For example,
are all qualitative variables. Neither is there any reason to list blue eyes before brown
eyes (or vice versa) nor does it make sense to list buses before trains (or vice versa).
Quantitative variables represent measurable quantities. The values which these
variables can take can be ordered in a logical and natural way. Examples of quanti-
tative variables are
• size of shoes,
• price for houses,
• number of semesters studied, and
• weight of a person.
Discrete variables are variables which can only take a finite number of values.
All qualitative variables are discrete, such as the colour of the eye or the region of
a country. But also quantitative variables can be discrete: the size of shoes or the
number of semesters studied would be discrete because the number of values these
variables can take is limited.
Variables which can take an infinite number of values are called continuous
variables. Examples are the time it takes to travel to university, the length of an
antelope, and the distance between two planets. Sometimes, it is said that continuous
variables are variables which are “measured rather than counted”. This is a rather
informal definition which helps to understand the difference between discrete and
continuous variables. The crucial point is that continuous variables can, in theory,
take an infinite number of values; for instance, the height of a person may be recorded
as 172 cm. However, the actual height on the measuring tape might be 172.3 cm which
was rounded off to 172 cm. If one had a better measuring instrument, we may have
obtained 172.342 cm. But the real height of this person is a number with indefinitely
many decimal places such as 172.342975328… cm. No matter what we eventually
report or obtain, a variable which can take an infinite amount of values is defined to
be a continuous variable.
1.2.3 Scales
The thoughts and considerations from above indicate that different variables contain
different amounts of information. A useful classification of these considerations is
given by the concept of the scale of a variable. This concept will help us in the
remainder of this book to identify which methods are the appropriate ones to use in
a particular setting.
Nominal scale. The values of a nominal variable cannot be ordered. Examples are
the gender of a person (male–female) or the status of an application (pending–not
pending).
Ordinal scale. The values of an ordinal variable can be ordered. However, the differ-
ences between these values cannot be interpreted in a meaningful way. For exam-
ple, the possible values of education level (none–primary education–secondary
education–university degree) can be ordered meaningfully, but the differences
between these values cannot be interpreted. Likewise, the satisfaction with a prod-
uct (unsatisfied–satisfied–very satisfied) is an ordinal variable because the values
this variable can take can be ordered, but the differences between “unsatisfied–
satisfied” and “satisfied–very satisfied” cannot be compared in a numerical way.
Continuous scale. The values of a continuous variable can be ordered. Furthermore,
the differences between these values can be interpreted in a meaningful way. For
instance, the height of a person refers to a continuous variable because the values
can be ordered (170 cm, 171 cm, 172 cm, …), and differences between these
1.2 Variables 7
values can be compared (the difference between 170 and 171 cm is the same
as the difference between 171 and 172 cm). Sometimes, the continuous scale is
divided further into subscales. While in the remainder of the book we typically
do not need these classifications, it is still useful to reflect on them:
Interval scale. Only differences between values, but not ratios, can be interpreted.
An example for this scale would be temperature (measured in ◦ C): the difference
between −2 ◦ C and 4 ◦ C is 6 ◦ C, but the ratio of 4/ − 2 = −2 does not mean that
−4 ◦ C is twice as cold as 2 ◦ C.
Ratio scale. Both differences and ratios can be interpreted. An example is speed:
60 km/h is 40 km/h more than 20 km/h. Moreover, 60 km/h is three times faster
than 20 km/h because the ratio between them is 3.
Absolute scale. The absolute scale is the same as the ratio scale, with the excep-
tion that the values are measured in “natural” units. An example is “number of
semesters studied” where no artificial unit such as km/h or ◦ C is needed: the
values are simply 1, 2, 3, . . ..
Sometimes, data may be available only in a summarized form: instead of the original
value, one may only know the category or group the value belongs to. For example,
• it is often convenient in a survey to ask for the income (per year) by means of
groups: [e0–e20,000), [e20,000–e30,000), . . ., > e100,000;
• if there are many political parties in an election, those with a low number of voters
are often summarized in a new category “Other Parties”;
• instead of capturing the number of claims made by an insurance company customer,
the variable “claimed” may denote whether or not the customer claimed at all
(yes–no).
When collecting data, we may ask ourselves how to facilitate this in detail and
how much data needs to be collected. The latter question will be partly answered
in Sect. 9.5; but in general, we can think of collecting data either on all subjects of
interest, such as in a national census, or on a representative sample of the population.
Most commonly, we gather data on a sample (described in the Part I of this book) and
then draw conclusions about the population of interest (discussed in the Part III of
this book). A sample might either be chosen by us or obtained through third parties
(hospitals, government agencies), or created during an experiment. This depends on
the context as described below.
Survey. A survey typically (but not always) collects data by asking questions (in
person or by phone) or providing questionnaires to study participants (as a printout
or online). For example, an opinion poll before a national election provides evidence
about the future government: potential voters are asked by phone which party they are
going to vote for in the next election; on the day of the election, this information can
be updated by asking the same question to a sample of voters who have just delivered
their vote at the polling station (so-called exit poll). A behavioural research survey
may ask members of a community about their knowledge and attitudes towards drug
use. For this purpose, the study coordinators can send people with a questionnaire
to this community and interview members of randomly selected households.
Ideally, a survey is conducted in a way which makes the chosen sample repre-
sentative of the population of interest. If a marketing company interviews people in
a pedestrian zone to find their views about a new chocolate bar, then these people
1.3 Data Collection 9
may not be representative of those who will potentially be interested in this product.
Similarly, if students are asked to fill in an online survey to evaluate a lecture, it
may turn out that those who participate are on average less satisfied than those who
do not. Survey sampling is a complex topic on its own. The interested reader may
consult Groves et al. (2009) or Kauermann and Küchenhoff (2011).
Experiment. Experimental data is obtained in “controlled” settings. This can mean
many things, but essentially it is data which is generated by the researcher with full
control over one or many variables of interest. For instance, suppose there are two
competing toothpastes, both of which promise to reduce pain for people with sensitive
teeth. If the researcher decided to randomly assign toothpaste A to half of the study
participants, and toothpaste B to the other half, then this is an experiment because
it is only the researcher who decides which toothpaste is to be used by any of the
participants. It is not decided by the participant. The data of the variable toothpaste
is controlled by the experimenter. Consider another example where the production
process of a product can potentially be reduced by combining two processes. The
management could decide to implement the new process in three production facilities,
but leave it as it is in the other facilities. The production process for the different
units (facilities) is therefore under control of the management. However, if each
facility could decide for themselves if they wanted a change or not, it would not be
an experiment because factors not directly controlled by the management, such as the
leadership style of the facility manager, would determine which process is chosen.
Observational Data. Observational data is data which is collected routinely, without
a researcher designing a survey or conducting an experiment. Suppose a blood sample
is drawn from each patient with a particular acute infection when they arrive at a
hospital. This data may be stored in the hospital’s folders and later accessed by a
researcher who is interested in studying this infection. Or suppose a government
institution monitors where people live and move to. This data can later be used to
explore migration patterns.
Primary and Secondary Data. Primary data is data we collect ourselves, i.e. via a
survey or experiment. Secondary data, in contrast, is collected by someone else. For
example, data from a national census, publicly available databases, previous research
studies, government reports, historical data, and data from the internet, among others,
are secondary data.
There is a unique way in which data is prepared and collected to utilize statistical
analyses. The data is stored in a data matrix (=data set) with p columns and n rows
(see Fig. 1.2). Each row corresponds to an observation/unit ω and each column to
a variable X . This means that, for example, the entry in the fourth row and second
column (x42 ) describes the value of the fourth observation on the second variable.
The examples below will illustrate the concept of a data set in more detail.
10 1 Introduction and Framework
Example 1.4.1 Suppose five students take examinations in music, mathematics, biol-
ogy, and geography. Their marks, measured on a scale between 0 and 100 (where
100 is the best mark), can be written down as illustrated in Fig. 1.3. Note that each
row refers to a student and each column to a variable. We consider a larger data set
in the next example.
Example 1.4.2 Consider the data set described in Appendix A.4. A pizza delivery
service captures information related to each delivery, for example the delivery time,
the temperature of the pizza, the name of the driver, the date of the delivery, the
name of the branch, and many more. To capture the data of all deliveries during one
month, we create a data matrix. Each row refers to a particular delivery, therefore
representing the observations of the data. Each column refers to a variable. In Fig. 1.4,
the variables X 1 (delivery time in minutes), X 2 (temperature in ◦ C), and X 12 (name
of branch) are listed.
The first row tells us about the features of the first pizza delivery: the delivery
time was 35.1 min, the pizza arrived with a temperature of 68.3 ◦ C, and the pizza
was delivered from the branch in the East of the city. In total, there were n = 1266
deliveries. For nominal variables, such as branch, we may decide to produce a coding
list, as illustrated in Table 1.1: instead of referring to the branches as “East”, “West”,
and “Centre”, we may simply call them 1, 2, and 3. As we will see in Chap. 11, this
has benefits for some analysis methods, though this is not needed in general.
If some values are missing, for example because they were never captured or even
lost, then this requires special attention. In Table 1.1, we assign missing values the
number “4” and therefore treat them as a separate category. If we work with statistical
software (see below), we may need other coding such as NA in the statistical software
R or . in Stata. More detail can be found in Appendix A.
There are number of statistical software packages which allow data collection, man-
agement, and–most importantly–analysis. In this book, we focus on the statistical
software R which is freely available at http://cran.r-project.org/. A gentle introduc-
tion to R is provided in Appendix A. A data matrix can be created manually using
commands such as matrix(), data.frame(), and others. Any data can be edited
using edit(). However, typically analysts have already typed their data into data-
bases or spreadsheets, for example in Excel, Access, or MySQL. In most of these
applications, it is possible to save the data as an ASCII file (.dat), as a tab-delimited
file (.txt), or as a comma-separated values file (.csv). All of these formats allow easy
switching between different software and database applications. Such data can easily
be read into R by means of the following commands:
setwd('C:/directory')
read.table('pizza_delivery.dat')
read.table('pizza_delivery.txt')
read.csv('pizza_delivery.csv')
where setwd specifies the working directory. Alternatively, loading the library
foreign allows the import of data from many different statistical software pack-
ages, notably Stata, SAS, Minitab, SPSS, among others. A detailed description of
data import and export can be found in the respective R manual available at http://
cran.r-project.org/doc/manuals/r-release/R-data.pdf. Once the data is read into R,
it can be viewed with
fix() # option 1
View() # option 2
We can also can get an overview of the data directly in the R-console by displaying
only the top lines of the data with head(). Both approaches are visualized in Fig. 1.5
for the pizza data introduced in Example 1.4.2.
1.5 Key Points and Further Issues 13
Note:
1.6 Exercises
Exercise 1.1 Describe both the population and the observations for the following
research questions:
Exercise 1.2 A national park conducts a study on the behaviour of their leopards.
A few of the park’s leopards are registered and receive a GPS device which allows
measuring the position of the leopard. Use this example to describe the following
concepts: population, sample, observation, value, and variable.
Exercise 1.3 Which of the following variables are qualitative, and which are quan-
titative? Specify which of the quantitative variables are discrete and which are
continuous:
Time to travel to work, shoe size, preferred political party, price for a canteen meal, eye
colour, gender, wavelength of light, customer satisfaction on a scale from 1 to 10, delivery
time for a parcel, blood type, number of goals in a hockey match, height of a child, subject
line of an email.
Exercise 1.5 Make yourself familiar with the pizza data set from Appendix A.4.
(a) First, browse through the introduction to R in Appendix A. Then, read in the
data.
(b) View the data both in the R data editor and in the R console.
(c) Create a new data matrix which consists of the first 5 rows and first 5 variables
of the data. Print this data set on the R console. Now, save this data set in your
preferred format.
(d) Add a new variable “NewTemperature” to the data set which converts the tem-
perature from ◦ C to ◦ F.
1.6 Exercises 15
(e) Attach the data and list the values from the variable “NewTemperature”.
(f) Use “?” to make yourself familiar with the following commands: str, dim,
colnames, names, nrow, ncol, head, and tail. Apply these commands
to the data to get more information about it.
Exercise 1.6 Consider the research questions of describing parents’ attitudes towards
immunization, what proportion of them wants immunization against chicken pox for
their last-born child, and whether this proportion differs by gender and age.
(a) Which data collection method is the most suitable one to answer the above
questions: survey or experiment?
(b) How would you capture the attitudes towards immunization in a single variable?
(c) Which variables are needed to answer all the above questions? Describe the scale
of each of them.
(d) Reflect on what an appropriate data set would look like. Now, given this data
set, try to write down the above research questions as precisely as possible.
Discrete Data. Let us first consider a simple example to illustrate our notation.
Example 2.1.1 Suppose there are ten people in a supermarket queue. Each of them
is either coded as “F” (if the person is female) or “M” (if the person is male). The
collected data may look like
M, F, M, F, M, M, M, F, M, M.
There are now two categories in the data: male (M) and female (F). We use a1 to refer
to the male category and a2 to refer to the female category. Since there are seven male
and three female students, we have 7 values in category a1 , denoted as n 1 = 7, and 3
values in category a2 , denoted as n 2 = 3. The number of observations in a particular
category is called the absolute frequency. It follows that n 1 = 7 and n 2 = 3 are the
absolute frequencies of a1 and a2 , respectively. Note that n 1 + n 2 = n = 10, which
is the same as the total number of collected observations. We can also calculate
the relative frequencies of a1 and a2 as f 1 = f (a1 ) = nn1 = 10 7
= 0.7 = 70 % and
n2
f 2 = f (a2 ) = n = 10 = 0.3 = 30 %, respectively. This gives us information about
3
We now extend these concepts to a general framework for the summary of data
on discrete variables. Suppose there are k categories denoted as a1 , a2 , . . . , ak
with n j ( j = 1, 2, . . . , k) observations in category a j . The absolute frequency n j is
defined as the number of units in the jth category a j . The sum of absolute frequencies
equals the total number of units in the data: kj=1 n j = n. The relative frequencies
of the jth class are defined as
nj
f j = f (a j ) = , j = 1, 2, . . . , k. (2.1)
n
The relative frequencies always lie between 0 and 1 and kj=1 f j = 1.
Grouped Continuous Data. Data on continuous variables usually has a large number
(k) of different values. Sometimes k may even be the same as n and in such a case
the relative frequencies become f j = n1 for all j. However, it is possible to define
intervals in which the observed values are contained.
Example 2.1.2 Consider the following n = 20 results of the written part of a driving
licence examination (a maximum of 100 points could be achieved):
28, 35, 42, 90, 70, 56, 75, 66, 30, 89, 75, 64, 81, 69, 55, 83, 72, 68, 73, 16.
We can summarize the results in class intervals such as 0–20, 21–40, 41–60, 61–80,
and 81–100, and the data can be presented as follows:
5 5
We have j=1 n j = 20 = n and j=1 f j = 1.
Random documents with unrelated
content Scribd suggests to you:
Turning our survey to the course of the Danube, we note that
several Magdalenian stations extend into the provinces of Lower
Austria, chief among them being both the open 'loess' station of
Aggsbach, and that of Gobelsburg; there is also the Hundssteig near
Krems, better known as the station of Krems, and the cavern known
as the Gudenushöhle; in the latter station the characteristic bâtons,
javelins, and bone needles have been found.[BB]
The succeeding life period is that of the retreat of the tundra and
steppe mammals and of the increasing rarity of the reindeer and of
the mammoth in southwestern Europe; it corresponds broadly with
the returning cold and moist climate of the second Postglacial
advance known in the Alps as the Gschnitz stage. With the spread of
the forests and the retreat to the north of the reindeer, the principal
source both of the supply of food and clothing and of all the bone
implements of industry and of the chase, a new set of life conditions
may have gradually become established. If it is true, as most
students of geographical conditions and of the climate maintain, that
Europe at the same time became more densely forested, the chase
may have become more difficult, and the Crô-Magnons may have
begun to depend more and more upon the life of the streams and
the art of fishing. It is generally agreed that the harpoons were
chiefly used for fishing and that many of the microlithic flints, which
now begin to appear more abundantly, may have been attached to a
shaft for the same purpose. We know that similar microliths were
used as arrow points in predynastic Egypt.
Breuil(35) observes very significant industrial changes in closing
Magdalenian times: first, the beginning of small geometric forms of
flints suggesting the Tardenoisian types; second, the occasional use
of stag horn in place of reindeer horn; third, a modification in the
form of bone implements toward the patterns of Azilian times;
fourth, the rapid decline—one may almost say sudden disappearance
—of the artistic spirit. Schematic and conventional designs begin to
take the place of the free realistic art of the middle Magdalenian.
Thus the decline of the Crô-Magnons as a powerful race may have
been due partly to environmental causes and the abandonment of
their vigorous nomadic mode of life, or it may be that they had
reached the end of a long cycle of psychic development, which we
have traced from the beginning of Aurignacian times. We know as a
parallel that in the history of many civilized races a period of great
artistic and industrial development may be followed by a period of
stagnation and decline without any apparent environmental causes.
There are no bone needles, no javelins or sagaies; nor are there any
of the beautifully carved weapons of bone. There is also a reduction
in the uses to which the split bones are put, such as the large lissoirs
or polishers. The bone implements appear to be derived from an
impoverished late Aurignacian stage; the same is true of the flint
implements, for we observe a return of the keeled scraper (grattoir
caréné). There is also a return of certain types of graving tools and
of the knife-like form of the flake; even some of the small geometric
types of flints resemble those of the Aurignacian levels.
The many shells of the moisture-loving snail Helix nemoralis, found
in the fire-hearths of Mas d'Azil are proofs of the humidity of the
climate, a fact confirmed by the contemporary flood deposits of the
Arize. The frequent and heavy rains drove the last few
representatives of the steppe fauna away to the north. These
climatic conditions favored the formation of peat-bogs, so frequent
to-day in the north of France, and also the growth of vast forests,
inhabited by the stag, which extended over the whole country.
The pebbles of Mas d'Azil are painted on one side with peroxide of
iron, a deposit of which is found in the neighborhood of the cave.
The color, mixed in shells of Pecten, or in hollowed pebbles or on flat
stones, was applied either with the finger or with a brush. The many
enigmatic designs consist chiefly of parallel bands, rows of discs or
points, bands with scalloped edges, cruciform designs, ladder-like
patterns (scalariform) such as are found in the 'Azilian' engravings
and paintings of the caverns, and undulating lines. These graphic
combinations resemble certain syllabic and alphabetic characters of
the Ægean, Cypriote, Phœnician, and Greco-Latin inscriptions.
However curious these resemblances may be, they are not sufficient
to warrant any theory connecting the signs on the painted pebbles
of the Azilians with the alphabetic characters of the oldest known
systems of writing.(3) Piette attempted to explain some of the
exceedingly crude designs on these pebbles as a system of notation,
others as pictographs and religious symbols, and some few as
genuine alphabetical signs, and suggested that the cavern of Mas
d'Azil was an Upper Palæolithic school where reading, reckoning,
writing, and the symbols of the sun were learned and taught. The
very wide distribution of these symbolic pebbles and the painting of
similar designs on the walls of the caverns certainly prove that they
had some religious or economic significance, which may be revealed
by subsequent research.
Fig. 248. Azilian galets
coloriés, flat, painted
pebbles, from the type
station of Mas d'Azil.
After Piette.
Our website is not just a platform for buying books, but a bridge
connecting readers to the timeless values of culture and wisdom. With
an elegant, user-friendly interface and an intelligent search system,
we are committed to providing a quick and convenient shopping
experience. Additionally, our special promotions and home delivery
services ensure that you save time and fully enjoy the joy of reading.
textbookfull.com