0% found this document useful (0 votes)
13 views

Introduction to Statistics and Data Analysis With Exercises Solutions and Applications in R 1st Edition Christian Heumann pdf download

The document is an introduction to the book 'Introduction to Statistics and Data Analysis with Exercises, Solutions and Applications in R' by Christian Heumann and others, which aims to teach statistical concepts using the R programming language. It emphasizes a balance between comprehensible explanations of statistical methods and their practical application, making it suitable for beginners from various fields. The book includes exercises, solutions, and supplementary materials available online to enhance the learning experience.

Uploaded by

galvalabanhe
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
13 views

Introduction to Statistics and Data Analysis With Exercises Solutions and Applications in R 1st Edition Christian Heumann pdf download

The document is an introduction to the book 'Introduction to Statistics and Data Analysis with Exercises, Solutions and Applications in R' by Christian Heumann and others, which aims to teach statistical concepts using the R programming language. It emphasizes a balance between comprehensible explanations of statistical methods and their practical application, making it suitable for beginners from various fields. The book includes exercises, solutions, and supplementary materials available online to enhance the learning experience.

Uploaded by

galvalabanhe
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 64

Introduction to Statistics and Data Analysis

With Exercises Solutions and Applications in R


1st Edition Christian Heumann download

https://textbookfull.com/product/introduction-to-statistics-and-
data-analysis-with-exercises-solutions-and-applications-in-r-1st-
edition-christian-heumann/

Download more ebook from https://textbookfull.com


We believe these products will be a great fit for you. Click
the link to download now, or visit textbookfull.com
to discover even more!

Introduction to Statistics and Data Analysis Roxy Peck

https://textbookfull.com/product/introduction-to-statistics-and-
data-analysis-roxy-peck/

An Introduction to Secondary Data Analysis with IBM


SPSS Statistics 1st Edition John Macinnes

https://textbookfull.com/product/an-introduction-to-secondary-
data-analysis-with-ibm-spss-statistics-1st-edition-john-macinnes/

An Introduction to Secondary Data Analysis with IBM


SPSS Statistics First Edition Macinnes

https://textbookfull.com/product/an-introduction-to-secondary-
data-analysis-with-ibm-spss-statistics-first-edition-macinnes/

Introduction to Data Science Data Analysis and


Prediction Algorithms with R 1st Edition By Rafael A.
Irizarry

https://textbookfull.com/product/introduction-to-data-science-
data-analysis-and-prediction-algorithms-with-r-1st-edition-by-
rafael-a-irizarry/
Reasoning with Data An Introduction to Traditional and
Bayesian Statistics Using R 1st Edition Jeffrey M.
Stanton

https://textbookfull.com/product/reasoning-with-data-an-
introduction-to-traditional-and-bayesian-statistics-using-r-1st-
edition-jeffrey-m-stanton/

Business Statistics with Solutions in R 1st Edition


Mustapha Abiodun Akinkunmi

https://textbookfull.com/product/business-statistics-with-
solutions-in-r-1st-edition-mustapha-abiodun-akinkunmi/

Data Mining with SPSS Modeler Theory Exercises and


Solutions 1st Edition Tilo Wendler

https://textbookfull.com/product/data-mining-with-spss-modeler-
theory-exercises-and-solutions-1st-edition-tilo-wendler/

An Introduction to Categorical Data Analysis 3rd


Edition Wiley Series in Probability and Statistics
Agresti

https://textbookfull.com/product/an-introduction-to-categorical-
data-analysis-3rd-edition-wiley-series-in-probability-and-
statistics-agresti/

Introduction to Data Analysis with R for Forensic


Scientists International Forensic Science and
Investigation 1st Edition Curran

https://textbookfull.com/product/introduction-to-data-analysis-
with-r-for-forensic-scientists-international-forensic-science-
and-investigation-1st-edition-curran/
Christian Heumann · Michael Schomaker
Shalabh

Introduction to
Statistics and
Data Analysis
With Exercises, Solutions and
Applications in R
Introduction to Statistics and Data Analysis
Christian Heumann Michael Schomaker

Shalabh

Introduction to Statistics
and Data Analysis
With Exercises, Solutions
and Applications in R

123
Christian Heumann Shalabh
Department of Statistics Department of Mathematics and Statistics
Ludwig-Maximilians-Universität München Indian Institute of Technology Kanpur
München Kanpur
Germany India

Michael Schomaker
Centre for Infectious Disease Epidemiology
and Research
University of Cape Town
Cape Town
South Africa

ISBN 978-3-319-46160-1 ISBN 978-3-319-46162-5 (eBook)


DOI 10.1007/978-3-319-46162-5

Library of Congress Control Number: 2016955516

© Springer International Publishing Switzerland 2016


This work is subject to copyright. All rights are reserved by the Publisher, whether the whole or part
of the material is concerned, specifically the rights of translation, reprinting, reuse of illustrations,
recitation, broadcasting, reproduction on microfilms or in any other physical way, and transmission
or information storage and retrieval, electronic adaptation, computer software, or by similar or dissimilar
methodology now known or hereafter developed.
The use of general descriptive names, registered names, trademarks, service marks, etc. in this
publication does not imply, even in the absence of a specific statement, that such names are exempt from
the relevant protective laws and regulations and therefore free for general use.
The publisher, the authors and the editors are safe to assume that the advice and information in this
book are believed to be true and accurate at the date of publication. Neither the publisher nor the
authors or the editors give a warranty, express or implied, with respect to the material contained herein or
for any errors or omissions that may have been made.

Printed on acid-free paper

This Springer imprint is published by Springer Nature


The registered company is Springer International Publishing AG
The registered company address is: Gewerbestrasse 11, 6330 Cham, Switzerland
Preface

The success of the open-source statistical software “R” has made a significant
impact on the teaching and research of statistics in the last decade. Analysing data is
now easier and more affordable than ever, but choosing the most appropriate sta-
tistical methods remains a challenge for many users. To understand and interpret
software output, it is necessary to engage with the fundamentals of statistics.
However, many readers do not feel comfortable with complicated mathematics.
In this book, we attempt to find a healthy balance between explaining statistical
concepts comprehensively and showing their application and interpretation using R.
This book will benefit beginners and self-learners from various backgrounds as
we complement each chapter with various exercises and detailed and comprehen-
sible solutions. The results involving mathematics and rigorous proofs are separated
from the main text, where possible, and are kept in an appendix for interested
readers. Our textbook covers material that is generally taught in introductory-level
statistics courses to students from various backgrounds, including sociology,
biology, economics, psychology, medicine, and others. Most often, we introduce
the statistical concepts using examples and illustrate the calculations both manually
and using R.
However, while we provide a gentle introduction to R (in the appendix), this is
not a software book. Our emphasis lies on explaining statistical concepts correctly
and comprehensively, using exercises and software to delve deeper into the subject
matter and learn about the conceptual challenges that the methods present.
This book’s homepage, http://chris.userweb.mwn.de/book/, contains additional
material, most notably the software codes needed to answer the software exercises,
and data sets. In the remainder of this book, we will use grey boxes

to introduce the relevant R commands. In many cases, the code can be directly
pasted into R to reproduce the results and graphs presented in the book; in others,
the code is abbreviated to improve readability and clarity, and the detailed code can
be found online.

v
vi Preface

Many years of teaching experience, from undergraduate to postgraduate level,


went into this book. The authors hope that the reader will enjoy reading it and find it a
useful reference for learning. We welcome critical feedback to improve future edi-
tions of this book. Comments can be sent to christian.heumann@stat.uni-
muenchen.de, shalab@iitk.ac.in, and michael.schomaker@uct.
ac.za who contributed equally to this book.
We thank Melanie Schomaker for producing some of the figures and giving
graphical advice, Alice Blanck from Springer for her continuous help and support,
and Lyn Imeson for her dedicated commitment which improved the earlier versions
of this book. We are grateful to our families who have supported us during the
preparation of this book.

München, Germany Christian Heumann


Cape Town, South Africa Michael Schomaker
Kanpur, India Shalabh
November 2016
Contents

Part I Descriptive Statistics


1 Introduction and Framework . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.1 Population, Sample, and Observations . . . . . . . . . . . . . . . . . . . 3
1.2 Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.2.1 Qualitative and Quantitative Variables . . . . . . . . . . . . . 5
1.2.2 Discrete and Continuous Variables . . . . . . . . . . . . . . . 6
1.2.3 Scales . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.2.4 Grouped Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.3 Data Collection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
1.4 Creating a Data Set . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
1.4.1 Statistical Software . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
1.5 Key Points and Further Issues . . . . . . . . . . . . . . . . . . . . . . . . . 13
1.6 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2 Frequency Measures and Graphical Representation of Data . . . . . . 17
2.1 Absolute and Relative Frequencies . . . . . . . . . . . . . . . . . . . . . . 17
2.2 Empirical Cumulative Distribution Function . . . . . . . . . . . . . . . 19
2.2.1 ECDF for Ordinal Variables . . . . . . . . . . . . . . . . . . . . 20
2.2.2 ECDF for Continuous Variables . . . . . . . . . . . . . . . . . 22
2.3 Graphical Representation of a Variable . . . . . . . . . . . . . . . . . . . 24
2.3.1 Bar Chart . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
2.3.2 Pie Chart . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
2.3.3 Histogram . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
2.4 Kernel Density Plots . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
2.5 Key Points and Further Issues . . . . . . . . . . . . . . . . . . . . . . . . . 32
2.6 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
3 Measures of Central Tendency and Dispersion . . . . . . . . . . . . . . . . . 37
3.1 Measures of Central Tendency . . . . . . . . . . . . . . . . . . . . . . . . . 38
3.1.1 Arithmetic Mean . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
3.1.2 Median and Quantiles . . . . . . . . . . . . . . . . . . . . . . . . . 40
3.1.3 Quantile–Quantile Plots (QQ-Plots) . . . . . . . . . . . . . . . 44
3.1.4 Mode . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45

vii
viii Contents

3.1.5 Geometric Mean . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46


3.1.6 Harmonic Mean . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
3.2 Measures of Dispersion. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
3.2.1 Range and Interquartile Range . . . . . . . . . . . . . . . . . . . 49
3.2.2 Absolute Deviation, Variance, and Standard
Deviation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
3.2.3 Coefficient of Variation . . . . . . . . . . . . . . . . . . . . . . . . 55
3.3 Box Plots . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
3.4 Measures of Concentration . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
3.4.1 Lorenz Curve . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
3.4.2 Gini Coefficient . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
3.5 Key Points and Further Issues . . . . . . . . . . . . . . . . . . . . . . . . . 63
3.6 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
4 Association of Two Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
4.1 Summarizing the Distribution of Two Discrete Variables . . . . . 68
4.1.1 Contingency Tables for Discrete Data . . . . . . . . . . . . . 68
4.1.2 Joint, Marginal, and Conditional Frequency
Distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
4.1.3 Graphical Representation of Two Nominal or
Ordinal Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
4.2 Measures of Association for Two Discrete Variables . . . . . . . . 74
4.2.1 Pearson’s χ2 Statistic . . . . . . . . . . . . . . . . . . . . . . . . . . 76
4.2.2 Cramer’s V Statistic. . . . . . . . . . . . . . . . . . . . . . . . . . . 77
4.2.3 Contingency Coefficient C . . . . . . . . . . . . . . . . . . . . . . 77
4.2.4 Relative Risks and Odds Ratios . . . . . . . . . . . . . . . . . . 78
4.3 Association Between Ordinal and Continuous Variables . . . . . . 79
4.3.1 Graphical Representation of Two Continuous
Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
4.3.2 Correlation Coefficient . . . . . . . . . . . . . . . . . . . . . . . . . 82
4.3.3 Spearman’s Rank Correlation Coefficient. . . . . . . . . . . 84
4.3.4 Measures Using Discordant and Concordant Pairs . . . . 86
4.4 Visualization of Variables from Different Scales . . . . . . . . . . . . 88
4.5 Key Points and Further Issues . . . . . . . . . . . . . . . . . . . . . . . . . 89
4.6 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90

Part II Probability Calculus


5 Combinatorics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97
5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97
5.2 Permutations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101
5.2.1 Permutations without Replacement . . . . . . . . . . . . . . . 101
5.2.2 Permutations with Replacement . . . . . . . . . . . . . . . . . . 101
5.3 Combinations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102
Contents ix

5.3.1 Combinations without Replacement


and without Consideration of the Order . . . ......... 102
5.3.2 Combinations without Replacement
and with Consideration of the Order . . . . . ......... 103
5.3.3 Combinations with Replacement
and without Consideration of the Order . . . ......... 103
5.3.4 Combinations with Replacement
and with Consideration of the Order . . . . . ......... 104
5.4 Key Points and Further Issues . . . . . . . . . . . . . . . . ......... 105
5.5 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ......... 105
6 Elements of Probability Theory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109
6.1 Basic Concepts and Set Theory . . . . . . . . . . . . . . . . . . . . . . . . 109
6.2 Relative Frequency and Laplace Probability . . . . . . . . . . . . . . . 113
6.3 The Axiomatic Definition of Probability . . . . . . . . . . . . . . . . . . 115
6.3.1 Corollaries Following from Kolomogorov’s
Axioms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116
6.3.2 Calculation Rules for Probabilities . . . . . . . . . . . . . . . . 117
6.4 Conditional Probability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117
6.4.1 Bayes’ Theorem. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120
6.5 Independence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121
6.6 Key Points and Further Issues . . . . . . . . . . . . . . . . . . . . . . . . . 123
6.7 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123
7 Random Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127
7.1 Random Variables. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127
7.2 Cumulative Distribution Function (CDF) . . . . . . . . . . . . . . . . . 129
7.2.1 CDF of Continuous Random Variables . . . . . . . . . . . . 129
7.2.2 CDF of Discrete Random Variables . . . . . . . . . . . . . . 131
7.3 Expectation and Variance of a Random Variable . . . . . . . . . . . 134
7.3.1 Expectation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 134
7.3.2 Variance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135
7.3.3 Quantiles of a Distribution. . . . . . . . . . . . . . . . . . . . . . 137
7.3.4 Standardization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 138
7.4 Tschebyschev’s Inequality . . . . . . . . . . . . . . . . . . . . . . . . . . . . 139
7.5 Bivariate Random Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . 140
7.6 Calculation Rules for Expectation and Variance . . . . . . . . . . . . 144
7.6.1 Expectation and Variance of the Arithmetic Mean . . . 145
7.7 Covariance and Correlation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 146
7.7.1 Covariance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 147
7.7.2 Correlation Coefficient . . . . . . . . . . . . . . . . . . . . . . . . . 148
7.8 Key Points and Further Issues . . . . . . . . . . . . . . . . . . . . . . . . . 149
7.9 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 149
x Contents

8 Probability Distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 153


8.1 Standard Discrete Distributions . . . . . . . . . . . . . . . . . . . . . . . . . 154
8.1.1 Discrete Uniform Distribution . . . . . . . . . . . . . . . . . . . 154
8.1.2 Degenerate Distribution . . . . . . . . . . . . . . . . . . . . . . . . 156
8.1.3 Bernoulli Distribution . . . . . . . . . . . . . . . . . . . . . . . . . 156
8.1.4 Binomial Distribution . . . . . . . . . . . . . . . . . . . . . . . . . 157
8.1.5 Poisson Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . 160
8.1.6 Multinomial Distribution . . . . . . . . . . . . . . . . . . . . . . . 161
8.1.7 Geometric Distribution . . . . . . . . . . . . . . . . . . . . . . . . 163
8.1.8 Hypergeometric Distribution . . . . . . . . . . . . . . . . . . . . 163
8.2 Standard Continuous Distributions . . . . . . . . . . . . . . . . . . . . . . 165
8.2.1 Continuous Uniform Distribution. . . . . . . . . . . . . . . . . 165
8.2.2 Normal Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . 166
8.2.3 Exponential Distribution . . . . . . . . . . . . . . . . . . . . . . . 170
8.3 Sampling Distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 171
8.3.1 χ2 -Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 171
8.3.2 t-Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 172
8.3.3 F-Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 173
8.4 Key Points and Further Issues . . . . . . . . . . . . . . . . . . . . . . . . . 174
8.5 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 175

Part III Inductive Statistics


9 Inference . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 181
9.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 181
9.2 Properties of Point Estimators . . . . . . . . . . . . . . . . . . . . . . . . . . 183
9.2.1 Unbiasedness and Efficiency . . . . . . . . . . . . . . . . . . . . 183
9.2.2 Consistency of Estimators . . . . . . . . . . . . . . . . . . . . . . 189
9.2.3 Sufficiency of Estimators . . . . . . . . . . . . . . . . . . . . . . . 190
9.3 Point Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 192
9.3.1 Maximum Likelihood Estimation . . . . . . . . . . . . . . . . . 192
9.3.2 Method of Moments . . . . . . . . . . . . . . . . . . . . . . . . . . 195
9.4 Interval Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 195
9.4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 195
9.4.2 Confidence Interval for the Mean of a Normal
Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 197
9.4.3 Confidence Interval for a Binomial Probability . . . . . . 199
9.4.4 Confidence Interval for the Odds Ratio . . . . . . . . . . . . 201
9.5 Sample Size Determinations . . . . . . . . . . . . . . . . . . . . . . . . . . . 203
9.6 Key Points and Further Issues . . . . . . . . . . . . . . . . . . . . . . . . . 205
9.7 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 205
10 Hypothesis Testing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 209
10.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 209
10.2 Basic Definitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 210
Contents xi

10.2.1 One- and Two-Sample Problems . . . . . . . . . . . . . . . . . 210


10.2.2 Hypotheses . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 210
10.2.3 One- and Two-Sided Tests . . . . . . . . . . . . . . . . . . . . . 211
10.2.4 Type I and Type II Error . . . . . . . . . . . . . . . . . . . . . . . 213
10.2.5 How to Conduct a Statistical Test . . . . . . . . . . . . . . . . 214
10.2.6 Test Decisions Using the p-Value . . . . . . . . . . . . . . . . 215
10.2.7 Test Decisions Using Confidence Intervals . . . . . . . . . 216
10.3 Parametric Tests for Location Parameters . . . . . . . . . . . . . . . . . 216
10.3.1 Test for the Mean When the Variance
is Known (One-Sample Gauss Test) . . . . . . . . . . .... 216
10.3.2 Test for the Mean When the Variance
is Unknown (One-Sample t-Test) . . . . . . . . . . . . .... 219
10.3.3 Comparing the Means of Two Independent
Samples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .... 221
10.3.4 Test for Comparing the Means
of Two Dependent Samples (Paired t-Test) . . . . . . . . . 225
10.4 Parametric Tests for Probabilities . . . . . . . . . . . . . . . . . . . . . . . 227
10.4.1 One-Sample Binomial Test for the Probability p . . . . . 227
10.4.2 Two-Sample Binomial Test . . . . . . . . . . . . . . . . . . . . . 230
10.5 Tests for Scale Parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . 232
10.6 Wilcoxon–Mann–Whitney (WMW) U-Test . . . . . . . . . . . . . . . 232
10.7 χ2 -Goodness-of-Fit Test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 235
10.8 χ2 -Independence Test and Other χ2 -Tests. . . . . . . . . . . . . . . . . 238
10.9 Key Points and Further Issues . . . . . . . . . . . . . . . . . . . . . . . . . 242
10.10 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 242
11 Linear Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 249
11.1 The Linear Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 250
11.2 Method of Least Squares . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 252
11.2.1 Properties of the Linear Regression Line . . . . . . . . . . . 255
11.3 Goodness of Fit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 256
11.4 Linear Regression with a Binary Covariate . . . . . . . . . . . . . . . . 259
11.5 Linear Regression with a Transformed Covariate . . . . . . . . . . . 261
11.6 Linear Regression with Multiple Covariates . . . . . . . . . . . . . . . 262
11.6.1 Matrix Notation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 263
11.6.2 Categorical Covariates . . . . . . . . . . . . . . . . . . . . . . . . . 265
11.6.3 Transformations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 267
11.7 The Inductive View of Linear Regression . . . . . . . . . . . . . . . . . 269
11.7.1 Properties of Least Squares and Maximum
Likelihood Estimators . . . . . . . . . . . . . . . . . . . . . . . . . 273
11.7.2 The ANOVA Table . . . . . . . . . . . . . . . . . . . . . . . . . . . 274
11.7.3 Interactions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 276
11.8 Comparing Different Models . . . . . . . . . . . . . . . . . . . . . . . . . . . 280
11.9 Checking Model Assumptions . . . . . . . . . . . . . . . . . . . . . . . . . 285
xii Contents

11.10 Association Versus Causation . . . . . . . . . . . . . . . . . . . . . . . . . . 288


11.11 Key Points and Further Issues . . . . . . . . . . . . . . . . . . . . . . . . . 289
11.12 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 290
Appendix A: Introduction to R . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 297
Appendix B: Solutions to Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 321
Appendix C: Technical Appendix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 423
Appendix D: Visual Summaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 443
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 449
Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 451
About the Authors

Prof. Christian Heumann is a professor at the Ludwig-Maximilians-Universität


München, Germany, where he teaches students in Bachelor and Master programs
offered by the Department of Statistics, as well as undergraduate students in the
Bachelor of Science programs in business administration and economics. His
research interests include statistical modeling, computational statistics and all
aspects of missing data.
Dr. Michael Schomaker is a Senior Researcher and Biostatistician at the Centre
for Infectious Disease Epidemiology & Research (CIDER), University of Cape
Town, South Africa. He received his doctoral degree from the University of
Munich. He has taught undergraduate students for many years and has written
contributions for various introductory textbooks. His research focuses on missing
data, causal inference, model averaging and HIV/AIDS.
Prof. Shalabh is a Professor at the Indian Institute of Technology Kanpur, India.
He received his Ph.D. from the University of Lucknow (India) and completed his
post-doctoral work at the University of Pittsburgh (USA) and University of Munich
(Germany). He has over twenty years of experience in teaching and research. His
main research areas are linear models, regression analysis, econometrics, mea-
surement error models, missing data models and sampling theory.

xiii
Part I
Descriptive Statistics
Introduction and Framework
1

Statistics is a collection of methods which help us to describe, summarize, interpret,


and analyse data. Drawing conclusions from data is vital in research, administra-
tion, and business. Researchers are interested in understanding whether a medical
intervention helps in reducing the burden of a disease, how personality relates to
decision-making, whether a new fertilizer increases the yield of crops, how a polit-
ical system affects trade policy, who is going to vote for a political party in the
next election, what are the long-term changes in the population of a fish species,
and many more questions. Governments and organizations may be interested in the
life expectancy of a population, the risk factors for infant mortality, geographical
differences in energy usage, migration patterns, or reasons for unemployment. In
business, identifying people who may be interested in a certain product, optimizing
prices, and evaluating the satisfaction of customers are possible areas of interest.
No matter what the question of interest is, it is important to collect data in a
way which allows its analysis. The representation of collected data in a data set or
data matrix allows the application of a variety of statistical methods. In the first
part of the book, we are going to introduce methods which help us in describing
data, and the second and third parts of the book focus on inferential statistics, which
means drawing conclusions from data. In this chapter, we are going to introduce the
framework of statistics which is needed to properly collect, administer, evaluate, and
analyse data.

1.1 Population, Sample, and Observations

Let us first introduce some terminology and related notations used in this book.
The units on which we measure data—such as persons, cars, animals, or plants—
are called observations. These units/observations are represented by the Greek

© Springer International Publishing Switzerland 2016 3


C. Heumann et al., Introduction to Statistics and Data Analysis,
DOI 10.1007/978-3-319-46162-5_1
4 1 Introduction and Framework

symbol ω. The collection of all units is called population and is represented by Ω.


When we refer to ω ∈ Ω, we mean a single unit out of all units, e.g. one person out of
all persons of interest. If we consider a selection of observations ω1 , ω2 , . . . , ωn , then
these observations are called sample. A sample is always a subset of the population,
{ω1 , ω2 , . . . , ωn } ⊆ Ω.

Example 1.1.1

• If we are interested in the social conditions under which Indian people live, then
we would define all inhabitants of India as Ω and each of its inhabitants as ω. If we
want to collect data from a few inhabitants, then those would represent a sample
from the total population.
• Investigating the economic power of Africa’s platinum industry would require to
treat each platinum-related company as ω, whereas all platinum-related companies
would be collected in Ω. A few companies ω1 , ω2 , . . . , ωn comprise a sample of
all companies.
• We may be interested in collecting information about those participating in a
statistics course. All participants in the course constitute the population Ω, and
each participant refers to a unit or observation ω.

Remark 1.1.1 Sometimes, the concept of a population is not applicable or difficult


to imagine. As an example, imagine that we measure the temperature in New Delhi
every hour. A sample would then be the time series of temperatures in a specific
time window, for example from January to March 2016. A population in the sense of
observational units does not exist here. But now assume that we measure temperatures
in several different cities; then, all the cities form the population, and a sample is any
subset of the cities.

1.2 Variables

If we have specified the population of interest for a specific research question, we


can think of what is of interest about our observations. A particular feature of these
observations can be collected in a statistical variable X . Any information we are
interested in may be captured in such a variable. For example, if our observations
refer to human beings, X may describe marital status, gender, age, or anything else
which may relate to a person. Of course, we can be interested in many different
features, each of them collected in a different variable X i , i = 1, 2, . . . , p. Each
observation ω takes a particular value for X . If X refers to gender, each observation,
i.e. each person, has a particular value x which refers to either “male” or “female”.
The formal definition of a variable is
X :Ω→S
(1.1)
ω → x
1.2 Variables 5

This definition states that a variable X takes a value x for each observation ω ∈ Ω,
whereby the number of possible values is contained in the set S.

Example 1.2.1

• If X refers to gender, possible x-values are contained in S = {male, female}. Each


observation ω is either male or female, and this information is summarized in X .
• Let X be the country of origin for a car. Possible values to be taken by an observation
ω (i.e. a car) are S = {Italy, South Korea, Germany, France, India, China, Japan,
USA, . . .}.
• A variable X which refers to age may take any value between 1 and 125. Each
person ω is assigned a value x which represents the age of this person.

1.2.1 Qualitative and Quantitative Variables

Qualitative variables are the variables which take values x that cannot be ordered in
a logical or natural way. For example,

• the colour of the eye,


• the name of a political party, and
• the type of transport used to travel to work

are all qualitative variables. Neither is there any reason to list blue eyes before brown
eyes (or vice versa) nor does it make sense to list buses before trains (or vice versa).
Quantitative variables represent measurable quantities. The values which these
variables can take can be ordered in a logical and natural way. Examples of quanti-
tative variables are

• size of shoes,
• price for houses,
• number of semesters studied, and
• weight of a person.

Remark 1.2.1 It is common to assign numbers to qualitative variables for practical


purposes in data analyses (see Sect. 1.4 for more detail). For instance, if we consider
the variable “gender”, then each observation can take either the “value” male or
female. We may decide to assign 1 to female and 0 to male and use these numbers
instead of the original categories. However, this is arbitrary, and we could have also
chosen “1” for male and “0” for female, or “2” for male and “10” for female. There
is no logical and natural order on how to arrange male and female, and thus, the
variable gender remains a qualitative variable, even after using numbers for coding
the values that X can take.
6 1 Introduction and Framework

1.2.2 Discrete and Continuous Variables

Discrete variables are variables which can only take a finite number of values.
All qualitative variables are discrete, such as the colour of the eye or the region of
a country. But also quantitative variables can be discrete: the size of shoes or the
number of semesters studied would be discrete because the number of values these
variables can take is limited.
Variables which can take an infinite number of values are called continuous
variables. Examples are the time it takes to travel to university, the length of an
antelope, and the distance between two planets. Sometimes, it is said that continuous
variables are variables which are “measured rather than counted”. This is a rather
informal definition which helps to understand the difference between discrete and
continuous variables. The crucial point is that continuous variables can, in theory,
take an infinite number of values; for instance, the height of a person may be recorded
as 172 cm. However, the actual height on the measuring tape might be 172.3 cm which
was rounded off to 172 cm. If one had a better measuring instrument, we may have
obtained 172.342 cm. But the real height of this person is a number with indefinitely
many decimal places such as 172.342975328… cm. No matter what we eventually
report or obtain, a variable which can take an infinite amount of values is defined to
be a continuous variable.

1.2.3 Scales

The thoughts and considerations from above indicate that different variables contain
different amounts of information. A useful classification of these considerations is
given by the concept of the scale of a variable. This concept will help us in the
remainder of this book to identify which methods are the appropriate ones to use in
a particular setting.

Nominal scale. The values of a nominal variable cannot be ordered. Examples are
the gender of a person (male–female) or the status of an application (pending–not
pending).
Ordinal scale. The values of an ordinal variable can be ordered. However, the differ-
ences between these values cannot be interpreted in a meaningful way. For exam-
ple, the possible values of education level (none–primary education–secondary
education–university degree) can be ordered meaningfully, but the differences
between these values cannot be interpreted. Likewise, the satisfaction with a prod-
uct (unsatisfied–satisfied–very satisfied) is an ordinal variable because the values
this variable can take can be ordered, but the differences between “unsatisfied–
satisfied” and “satisfied–very satisfied” cannot be compared in a numerical way.
Continuous scale. The values of a continuous variable can be ordered. Furthermore,
the differences between these values can be interpreted in a meaningful way. For
instance, the height of a person refers to a continuous variable because the values
can be ordered (170 cm, 171 cm, 172 cm, …), and differences between these
1.2 Variables 7

values can be compared (the difference between 170 and 171 cm is the same
as the difference between 171 and 172 cm). Sometimes, the continuous scale is
divided further into subscales. While in the remainder of the book we typically
do not need these classifications, it is still useful to reflect on them:

Interval scale. Only differences between values, but not ratios, can be interpreted.
An example for this scale would be temperature (measured in ◦ C): the difference
between −2 ◦ C and 4 ◦ C is 6 ◦ C, but the ratio of 4/ − 2 = −2 does not mean that
−4 ◦ C is twice as cold as 2 ◦ C.
Ratio scale. Both differences and ratios can be interpreted. An example is speed:
60 km/h is 40 km/h more than 20 km/h. Moreover, 60 km/h is three times faster
than 20 km/h because the ratio between them is 3.
Absolute scale. The absolute scale is the same as the ratio scale, with the excep-
tion that the values are measured in “natural” units. An example is “number of
semesters studied” where no artificial unit such as km/h or ◦ C is needed: the
values are simply 1, 2, 3, . . ..

1.2.4 Grouped Data

Sometimes, data may be available only in a summarized form: instead of the original
value, one may only know the category or group the value belongs to. For example,

• it is often convenient in a survey to ask for the income (per year) by means of
groups: [e0–e20,000), [e20,000–e30,000), . . ., > e100,000;
• if there are many political parties in an election, those with a low number of voters
are often summarized in a new category “Other Parties”;
• instead of capturing the number of claims made by an insurance company customer,
the variable “claimed” may denote whether or not the customer claimed at all
(yes–no).

If data is available in grouped form, we call the respective variable capturing


this information a grouped variable. Sometimes, these variables are also known as
categorical variables. This is, however, not a complete definition because categorical
variables refer to any type of variable which takes a finite, possibly small, number of
values. Thus, any discrete and/or nominal and/or ordinal and/or qualitative variable
may be regarded as a categorical variable. Any grouped or categorical variable which
can only take two values is called a binary variable.
To gain a better understanding on how the definitions from the above sections
relate to each other see Fig. 1.1. Qualitative data is always discrete, but quantitative
data can be both discrete (e.g. size of shoes or a grouped variable) and continuous
(e.g. temperature). Nominal variables are always qualitative and discrete (e.g. colour
of the eye), whereas continuous variables are always quantitative (e.g. temperature).
Categorical variables can be both qualitative (e.g. colour of the eye) and quantitative
(satisfaction level on a scale from 1 to 5). Categorical variables are never continuous.
8 1 Introduction and Framework

Fig. 1.1 Summary of variable classifications

1.3 Data Collection

When collecting data, we may ask ourselves how to facilitate this in detail and
how much data needs to be collected. The latter question will be partly answered
in Sect. 9.5; but in general, we can think of collecting data either on all subjects of
interest, such as in a national census, or on a representative sample of the population.
Most commonly, we gather data on a sample (described in the Part I of this book) and
then draw conclusions about the population of interest (discussed in the Part III of
this book). A sample might either be chosen by us or obtained through third parties
(hospitals, government agencies), or created during an experiment. This depends on
the context as described below.
Survey. A survey typically (but not always) collects data by asking questions (in
person or by phone) or providing questionnaires to study participants (as a printout
or online). For example, an opinion poll before a national election provides evidence
about the future government: potential voters are asked by phone which party they are
going to vote for in the next election; on the day of the election, this information can
be updated by asking the same question to a sample of voters who have just delivered
their vote at the polling station (so-called exit poll). A behavioural research survey
may ask members of a community about their knowledge and attitudes towards drug
use. For this purpose, the study coordinators can send people with a questionnaire
to this community and interview members of randomly selected households.
Ideally, a survey is conducted in a way which makes the chosen sample repre-
sentative of the population of interest. If a marketing company interviews people in
a pedestrian zone to find their views about a new chocolate bar, then these people
1.3 Data Collection 9

may not be representative of those who will potentially be interested in this product.
Similarly, if students are asked to fill in an online survey to evaluate a lecture, it
may turn out that those who participate are on average less satisfied than those who
do not. Survey sampling is a complex topic on its own. The interested reader may
consult Groves et al. (2009) or Kauermann and Küchenhoff (2011).
Experiment. Experimental data is obtained in “controlled” settings. This can mean
many things, but essentially it is data which is generated by the researcher with full
control over one or many variables of interest. For instance, suppose there are two
competing toothpastes, both of which promise to reduce pain for people with sensitive
teeth. If the researcher decided to randomly assign toothpaste A to half of the study
participants, and toothpaste B to the other half, then this is an experiment because
it is only the researcher who decides which toothpaste is to be used by any of the
participants. It is not decided by the participant. The data of the variable toothpaste
is controlled by the experimenter. Consider another example where the production
process of a product can potentially be reduced by combining two processes. The
management could decide to implement the new process in three production facilities,
but leave it as it is in the other facilities. The production process for the different
units (facilities) is therefore under control of the management. However, if each
facility could decide for themselves if they wanted a change or not, it would not be
an experiment because factors not directly controlled by the management, such as the
leadership style of the facility manager, would determine which process is chosen.
Observational Data. Observational data is data which is collected routinely, without
a researcher designing a survey or conducting an experiment. Suppose a blood sample
is drawn from each patient with a particular acute infection when they arrive at a
hospital. This data may be stored in the hospital’s folders and later accessed by a
researcher who is interested in studying this infection. Or suppose a government
institution monitors where people live and move to. This data can later be used to
explore migration patterns.
Primary and Secondary Data. Primary data is data we collect ourselves, i.e. via a
survey or experiment. Secondary data, in contrast, is collected by someone else. For
example, data from a national census, publicly available databases, previous research
studies, government reports, historical data, and data from the internet, among others,
are secondary data.

1.4 Creating a Data Set

There is a unique way in which data is prepared and collected to utilize statistical
analyses. The data is stored in a data matrix (=data set) with p columns and n rows
(see Fig. 1.2). Each row corresponds to an observation/unit ω and each column to
a variable X . This means that, for example, the entry in the fourth row and second
column (x42 ) describes the value of the fourth observation on the second variable.
The examples below will illustrate the concept of a data set in more detail.
10 1 Introduction and Framework

ω Variable 1 Variable 2 ··· Variable p


⎛ ⎞
1 x11 x12 ··· x1p
⎜2 x21 x22 ··· x2p ⎟
⎜ ⎟
⎜ .. .. .. .. ⎟
⎝. . . . ⎠
n xn1 xn2 ··· xnp

Fig. 1.2 Data set or data matrix

ω Music Mathematics Biology Geography


⎛ ⎞
Student A 65 70 85 45
⎜ Student B 77 82 80 60 ⎟
⎜ ⎟
⎜ Student C 78 73 93 68 ⎟
⎝ Student D 88 71 63 58 ⎠
Student E 75 83 63 57

Fig. 1.3 Data set of marks of five students

Example 1.4.1 Suppose five students take examinations in music, mathematics, biol-
ogy, and geography. Their marks, measured on a scale between 0 and 100 (where
100 is the best mark), can be written down as illustrated in Fig. 1.3. Note that each
row refers to a student and each column to a variable. We consider a larger data set
in the next example.

Example 1.4.2 Consider the data set described in Appendix A.4. A pizza delivery
service captures information related to each delivery, for example the delivery time,
the temperature of the pizza, the name of the driver, the date of the delivery, the
name of the branch, and many more. To capture the data of all deliveries during one
month, we create a data matrix. Each row refers to a particular delivery, therefore
representing the observations of the data. Each column refers to a variable. In Fig. 1.4,
the variables X 1 (delivery time in minutes), X 2 (temperature in ◦ C), and X 12 (name
of branch) are listed.

Delivery Delivery Time Temperature ··· Branch


⎛ ⎞
1 35.1 68.3 ··· East (1)
⎜ 2 25.2 71.0 ··· East (1) ⎟
⎜ ⎟
⎜ .. .. .. .. ⎟
⎝ . . . . ⎠
1266 35.7 60.8 ··· West (2)

Fig. 1.4 Pizza data set


1.4 Creating a Data Set 11

Table 1.1 Coding list for Variable Values Code


branch
Branch East 1
West 2
Centre 3
Missing 4

The first row tells us about the features of the first pizza delivery: the delivery
time was 35.1 min, the pizza arrived with a temperature of 68.3 ◦ C, and the pizza
was delivered from the branch in the East of the city. In total, there were n = 1266
deliveries. For nominal variables, such as branch, we may decide to produce a coding
list, as illustrated in Table 1.1: instead of referring to the branches as “East”, “West”,
and “Centre”, we may simply call them 1, 2, and 3. As we will see in Chap. 11, this
has benefits for some analysis methods, though this is not needed in general.
If some values are missing, for example because they were never captured or even
lost, then this requires special attention. In Table 1.1, we assign missing values the
number “4” and therefore treat them as a separate category. If we work with statistical
software (see below), we may need other coding such as NA in the statistical software
R or . in Stata. More detail can be found in Appendix A.

Another consideration when collecting data is that of transformations: we may


have captured the velocity of cars in kilometres per hour, but may need to present
the data in miles per hour; we have captured the temperature in degrees Celsius,
whereas we need to communicate results in degrees Fahrenheit, or we have created a
satisfaction score which we want to range from −5 to +5, while the score currently
runs from 0 to 20. This is not a problem at all. We can simply create a new variable
which reflects the required transformation. However, valid transformations depend
on the scale of a variable. Variables on an interval scale can use transformations of
the following kind:
g(x) = a + bx, b > 0. (1.2)
For ratio scales, only the following transformations are valid:
g(x) = bx, b > 0. (1.3)
In the above equation, a is set to 0 because ratios only stay the same if we respect a
variable’s natural point of origin.

Example 1.4.3 The temperature in ◦ F relates to the temperature in ◦ C as follows:


Temperature in ◦ F = 32 + 1.8 Temperature in ◦ C
g(x) = a + b x
This means that 25 ◦ C relates to (32 + 1.8 · 25) ◦ F = 77 ◦ F. If X 1 is a variable
representing temperature by ◦ C, we can simply create a new variable X 2 which is
temperature in ◦ F. Since temperature is measured on an interval scale, this transfor-
mation is valid.
12 1 Introduction and Framework

Changing currencies is also possible. If we would like to represent the price of a


product not in South African Rand but in e, we simply apply the transformation
Price in South African Rand = b · Price in e
whereby b is the currency exchange rate.

1.4.1 Statistical Software

There are number of statistical software packages which allow data collection, man-
agement, and–most importantly–analysis. In this book, we focus on the statistical
software R which is freely available at http://cran.r-project.org/. A gentle introduc-
tion to R is provided in Appendix A. A data matrix can be created manually using
commands such as matrix(), data.frame(), and others. Any data can be edited
using edit(). However, typically analysts have already typed their data into data-
bases or spreadsheets, for example in Excel, Access, or MySQL. In most of these
applications, it is possible to save the data as an ASCII file (.dat), as a tab-delimited
file (.txt), or as a comma-separated values file (.csv). All of these formats allow easy
switching between different software and database applications. Such data can easily
be read into R by means of the following commands:

setwd('C:/directory')
read.table('pizza_delivery.dat')
read.table('pizza_delivery.txt')
read.csv('pizza_delivery.csv')

where setwd specifies the working directory. Alternatively, loading the library
foreign allows the import of data from many different statistical software pack-
ages, notably Stata, SAS, Minitab, SPSS, among others. A detailed description of
data import and export can be found in the respective R manual available at http://
cran.r-project.org/doc/manuals/r-release/R-data.pdf. Once the data is read into R,
it can be viewed with

fix() # option 1
View() # option 2

We can also can get an overview of the data directly in the R-console by displaying
only the top lines of the data with head(). Both approaches are visualized in Fig. 1.5
for the pizza data introduced in Example 1.4.2.
1.5 Key Points and Further Issues 13

Fig. 1.5 Viewing data in R

1.5 Key Points and Further Issues

Note:

 The scale of variables is not only a formalism but an essential framework


for choosing the correct analysis methods. This is particularly relevant
for association analysis (Chap. 4), statistical tests (Chap. 10), and linear
regression (Chap. 11).
 Even if variables are measured on a nominal scale (i.e. if they are cate-
gorical/qualitative), we may choose to assign a number to each category
of this variable. This eases the implementation of some analysis methods
introduced later in this book.
 Data is usually stored in a data matrix where the rows represent the
observations and the columns are variables. It can be analysed with
statistical software. We use R (R Core Team 2016) in this book. A
gentle introduction is provided in Appendix A and throughout the book.
A more comprehensive introduction can be found in other books, for
example in Albert and Rizzo (2012), Crawley (2013), or Ligges (2008).
Even advanced books, e.g. Adler (2012) or Everitt and Hothorn (2011),
can offer insights to beginners.
14 1 Introduction and Framework

1.6 Exercises

Exercise 1.1 Describe both the population and the observations for the following
research questions:

(a) Evaluation of the satisfaction of employees from an airline.


(b) Description of the marks of students from an assignment.
(c) Comparison of two drugs which deal with high blood pressure.

Exercise 1.2 A national park conducts a study on the behaviour of their leopards.
A few of the park’s leopards are registered and receive a GPS device which allows
measuring the position of the leopard. Use this example to describe the following
concepts: population, sample, observation, value, and variable.

Exercise 1.3 Which of the following variables are qualitative, and which are quan-
titative? Specify which of the quantitative variables are discrete and which are
continuous:

Time to travel to work, shoe size, preferred political party, price for a canteen meal, eye
colour, gender, wavelength of light, customer satisfaction on a scale from 1 to 10, delivery
time for a parcel, blood type, number of goals in a hockey match, height of a child, subject
line of an email.

Exercise 1.4 Identify the scale of the following variables:

(a) Political party voted for in an election


(b) The difficulty of different levels in a computer game
(c) Production time of a car
(d) Age of turtles
(e) Calender year
(f) Price of a chocolate bar
(g) Identification number of a student
(h) Final ranking at a beauty contest
(i) Intelligence quotient.

Exercise 1.5 Make yourself familiar with the pizza data set from Appendix A.4.

(a) First, browse through the introduction to R in Appendix A. Then, read in the
data.
(b) View the data both in the R data editor and in the R console.
(c) Create a new data matrix which consists of the first 5 rows and first 5 variables
of the data. Print this data set on the R console. Now, save this data set in your
preferred format.
(d) Add a new variable “NewTemperature” to the data set which converts the tem-
perature from ◦ C to ◦ F.
1.6 Exercises 15

(e) Attach the data and list the values from the variable “NewTemperature”.
(f) Use “?” to make yourself familiar with the following commands: str, dim,
colnames, names, nrow, ncol, head, and tail. Apply these commands
to the data to get more information about it.

Exercise 1.6 Consider the research questions of describing parents’ attitudes towards
immunization, what proportion of them wants immunization against chicken pox for
their last-born child, and whether this proportion differs by gender and age.

(a) Which data collection method is the most suitable one to answer the above
questions: survey or experiment?
(b) How would you capture the attitudes towards immunization in a single variable?
(c) Which variables are needed to answer all the above questions? Describe the scale
of each of them.
(d) Reflect on what an appropriate data set would look like. Now, given this data
set, try to write down the above research questions as precisely as possible.

→ Solutions to all exercises in this chapter can be found on p. 321


Frequency Measures and Graphical
Representation of Data 2

In Chap. 1, we highlighted that different variables contain different levels of informa-


tion. When summarizing or visualizing one or more variable(s), it is this information
which determines the appropriate statistical methods to use.
Suppose we are interested in studying the employment opportunities and starting
salaries of university graduates with a master’s degree. Let the variable X denote the
starting salaries measured in e/year. Now suppose 100 graduate students provide
their initial salaries. Let us write down the salary of the first student as x1 , the
salary of the second student as x2 , and so on. We therefore have 100 observations
x1 , x2 , . . . , x100 . How can we summarize those 100 values best to extract meaningful
information from them? The answer to this question depends upon several aspects
like the nature of the recorded data, e.g. how many observations have been obtained
(either small in number or large in number) or how the data was recorded (either
exact values were obtained or the values were obtained in intervals). For example, the
starting salaries may be obtained as exact values, say 51,500 e/year, 32,350 e/year,
etc. Alternatively, these values could have been summarized in categories such as low
income (<30,000 e/year), medium income (30,000–50,000 e/year), high income
(50,000–70,000 e/year), and very high income (>70,000 e/year). Another approach
is to ask whether the students were employed or not after graduating and record the
data in terms of “yes” or “no”. It is evident that the latter classification is less detailed
than the grouped income data which is less detailed than the exact data. Depending on
which conceptualization of “starting salary” we use, we need to choose the approach
to summarize the data, that is the 100 values relating to the 100 graduated students.

2.1 Absolute and Relative Frequencies

Discrete Data. Let us first consider a simple example to illustrate our notation.

© Springer International Publishing Switzerland 2016 17


C. Heumann et al., Introduction to Statistics and Data Analysis,
DOI 10.1007/978-3-319-46162-5_2
18 2 Frequency Measures and Graphical Representation of Data

Example 2.1.1 Suppose there are ten people in a supermarket queue. Each of them
is either coded as “F” (if the person is female) or “M” (if the person is male). The
collected data may look like
M, F, M, F, M, M, M, F, M, M.
There are now two categories in the data: male (M) and female (F). We use a1 to refer
to the male category and a2 to refer to the female category. Since there are seven male
and three female students, we have 7 values in category a1 , denoted as n 1 = 7, and 3
values in category a2 , denoted as n 2 = 3. The number of observations in a particular
category is called the absolute frequency. It follows that n 1 = 7 and n 2 = 3 are the
absolute frequencies of a1 and a2 , respectively. Note that n 1 + n 2 = n = 10, which
is the same as the total number of collected observations. We can also calculate
the relative frequencies of a1 and a2 as f 1 = f (a1 ) = nn1 = 10 7
= 0.7 = 70 % and
n2
f 2 = f (a2 ) = n = 10 = 0.3 = 30 %, respectively. This gives us information about
3

the proportions of male and female customers in the queue.

We now extend these concepts to a general framework for the summary of data
on discrete variables. Suppose there are k categories denoted as a1 , a2 , . . . , ak
with n j ( j = 1, 2, . . . , k) observations in category a j . The absolute frequency n j is
defined as the number of units in the jth category a j . The sum of absolute frequencies

equals the total number of units in the data: kj=1 n j = n. The relative frequencies
of the jth class are defined as
nj
f j = f (a j ) = , j = 1, 2, . . . , k. (2.1)
n

The relative frequencies always lie between 0 and 1 and kj=1 f j = 1.
Grouped Continuous Data. Data on continuous variables usually has a large number
(k) of different values. Sometimes k may even be the same as n and in such a case
the relative frequencies become f j = n1 for all j. However, it is possible to define
intervals in which the observed values are contained.

Example 2.1.2 Consider the following n = 20 results of the written part of a driving
licence examination (a maximum of 100 points could be achieved):
28, 35, 42, 90, 70, 56, 75, 66, 30, 89, 75, 64, 81, 69, 55, 83, 72, 68, 73, 16.
We can summarize the results in class intervals such as 0–20, 21–40, 41–60, 61–80,
and 81–100, and the data can be presented as follows:

Class intervals 0–20 21–40 41–60 61–80 81–100


Absolute frequencies n 1 = 1 n 2 = 3 n 3 = 3 n 4 = 9 n 5 = 4
Relative frequencies f1 = 1
20 f2 = 3
20 f3 = 3
20 f4 = 9
20 f5 = 5
20

5 5
We have j=1 n j = 20 = n and j=1 f j = 1.
Random documents with unrelated
content Scribd suggests to you:
Turning our survey to the course of the Danube, we note that
several Magdalenian stations extend into the provinces of Lower
Austria, chief among them being both the open 'loess' station of
Aggsbach, and that of Gobelsburg; there is also the Hundssteig near
Krems, better known as the station of Krems, and the cavern known
as the Gudenushöhle; in the latter station the characteristic bâtons,
javelins, and bone needles have been found.[BB]

Fig. 244. The open loess


station of Aggsbach,
on the Danube, near
Krems. After
Obermaier.
The cavern district of Moravia attracted a relatively large population,
and among the numerous stations are the grottos of Kr̆ íz̆ , Žitný,
Kostelík, Bycis̆ kala, Schoschuwka, Balcarovaskala, Kůlna, and
Lautsch. Near the Russian border bone implements like those of
Gudenushöhle on the Danube have been found at the station of
Kůlna, and the industrial stratification of Šipka is very clear. Not far
from Cracow, across the Russian border, the caverns in the region of
Ojcow were entered by men carrying the Magdalenian culture.
Another site in Russia is the grotto of Mas̆ zycka, and characteristic
Magdalenian harpoons, needles, and bâtons de commandement with
other implements have also been found to the eastward, in the
neighborhood of Kiev, in the Ukraine.

Decline of the Magdalenian Culture


The highest point touched by the Crô-Magnon race in the middle or
high Magdalenian appears to correspond broadly with the cold arid
period of climate in the interval between the Bühl and Gschnitz
advances in the Alpine region, during which the steppe mammals
spread widely over southwestern Europe. The saiga antelope, for
example, a highly characteristic steppe type, is represented in one of
the most skilful bone carvings found in the late Magdalenian layers
of Mas d'Azil; also the steppe type of horse is frequently represented
in the most advanced engravings of late Magdalenian times. How far
this cold, relatively dry climate influenced the artistic and creative
energy of the Crô-Magnons is largely a matter of conjecture. The
entirely independent records of La Madeleine, of Schweizersbild, and
of Kesslerloch concur in associating the highest stage of
Magdalenian history of art with the predominance of the steppe
fauna and evidences of a cold dry climate. That the mammoth still
abounded is seen in the mammoth engravings which are superposed
on those of the bison in Font-de-Gaume.
Larger Image

Fig. 245. Front and side


views of a saiga
antelope carved upon
a bone dart-thrower
from the Magdalenian
deposits of Mas d'Azil.
After Piette.

The succeeding life period is that of the retreat of the tundra and
steppe mammals and of the increasing rarity of the reindeer and of
the mammoth in southwestern Europe; it corresponds broadly with
the returning cold and moist climate of the second Postglacial
advance known in the Alps as the Gschnitz stage. With the spread of
the forests and the retreat to the north of the reindeer, the principal
source both of the supply of food and clothing and of all the bone
implements of industry and of the chase, a new set of life conditions
may have gradually become established. If it is true, as most
students of geographical conditions and of the climate maintain, that
Europe at the same time became more densely forested, the chase
may have become more difficult, and the Crô-Magnons may have
begun to depend more and more upon the life of the streams and
the art of fishing. It is generally agreed that the harpoons were
chiefly used for fishing and that many of the microlithic flints, which
now begin to appear more abundantly, may have been attached to a
shaft for the same purpose. We know that similar microliths were
used as arrow points in predynastic Egypt.
Breuil(35) observes very significant industrial changes in closing
Magdalenian times: first, the beginning of small geometric forms of
flints suggesting the Tardenoisian types; second, the occasional use
of stag horn in place of reindeer horn; third, a modification in the
form of bone implements toward the patterns of Azilian times;
fourth, the rapid decline—one may almost say sudden disappearance
—of the artistic spirit. Schematic and conventional designs begin to
take the place of the free realistic art of the middle Magdalenian.
Thus the decline of the Crô-Magnons as a powerful race may have
been due partly to environmental causes and the abandonment of
their vigorous nomadic mode of life, or it may be that they had
reached the end of a long cycle of psychic development, which we
have traced from the beginning of Aurignacian times. We know as a
parallel that in the history of many civilized races a period of great
artistic and industrial development may be followed by a period of
stagnation and decline without any apparent environmental causes.

Crô-magnon Descendants in Modern Europe


We might attribute this great change, which affected all of western
Europe, to the extinction of the Crô-Magnon race were it not for the
existing evidence that the race survived throughout the Azilian-
Tardenoisian or close of the Upper Palæolithic. On the close of the
Palæolithic the race broke up throughout western Europe into many
colonies, which can perhaps be traced into Neolithic and even into
recent times. The anatomical evidence for this survival theory chiefly
consists of the highly characteristic form of the head.
In Europe a very broad face and a long, narrow cranium is such an
infrequent combination that anthropologists maintain that it affords
a means of identifying the descendants of the prehistoric Crô-
Magnon race wherever they persist to-day. Since Dordogne was the
geographic centre of the race in Upper Palæolithic times, is it merely
a coincidence that Dordogne is still the centre of a similar type?
Ripley(36) has given us a valuable résumé of our present knowledge
of this subject. The most significant trait of the long-headed people
of Dordogne is that in many cases the face is almost as broad as in
the normal Alpine round-headed type; in other words, it is strongly
disharmonic; in profile the back part of the head rises and in front
view the head is narrowed at the top; the skull is very low-vaulted;
the brow ridges are prominent; the nose is well formed; the cheek-
bones are prominent, and the powerful cheek muscles give a
peculiarly rugged cast to the countenance. The appearance,
however, is not repellent, but more often open and kindly. The men
are of medium height, but very susceptible to environment as
regards stature; they are tall in fertile places, and stunted in less
prosperous districts. They are not degenerate at all, but keen and
alert of mind. The present people of Dordogne agree with but one
other type of men known to anthropologists, namely, the ancient
Crô-Magnon race. The geographical evidence that here in Dordogne
we have to do with the survivors of the real Crô-Magnon race seems
to be sustained by a comparison of the characteristics of the
prehistoric skulls found at Crô-Magnon, Laugerie Basse, and
elsewhere in Dordogne, with the heads of the types of to-day. The
cranial indices of the prehistoric skulls, varying from 70 per cent to
73 per cent, correspond with indices of the living head of 72 per
cent to 75 per cent. None of the people of Dordogne are quite so
long-headed as this, the average index of the living head in an
extreme district being 76 per cent; but within the whole population
there are much lower indices.
The probability of direct descent becomes stronger when we
consider the disharmonic low-skulled shape of the Crô-Magnon head
and the remarkable elongation of the skull at the back. In the
prehistoric Crô-Magnons the brows were strongly developed, the eye
orbits low, the chin prominent. The facial type has been
characterized by de Quatrefages(37) as follows: "The eye depressed
beneath the orbital vault; the nose straight rather than arched; the
lips somewhat thick, the jaw and the cheek-bones strongly
developed, the complexion very brown, the hair very dark and
growing low on the forehead—a whole which, without being
attractive, was in no way repulsive."
In southern France we observe a continuity not only of the head
form but of the prevalence of black hair and eyes. Why should this
Crô-Magnon type have survived at this point and have disappeared
elsewhere? In order to consider the particular cause of this
persistence of a Palæolithic race, we must, with Ripley, broaden our
horizon, and consider the whole southwest from the Mediterranean
to Brittany as a unit.
The survival is partly attributed to favorable geographical
environment and partly to geological and racial barriers. On the
north the intrusion of the Teutonic race was shut off and competition
was narrowed down to the Crô-Magnon and Alpine types.
If the people of Dordogne are veritable survivors of the Crô-
Magnons of the Upper Palæolithic, they certainly represent the
oldest living race in western Europe, and is it not extremely
significant that the most primitive language in Europe, that of the
Basques of the northern Pyrenees, is spoken near by, only 200 miles
to the southwest? Is there possibly a connection between the
original language of the Crô-Magnons, a race which once crowded
the region of the Cantabrian Mountains and the Pyrenees, and the
existing agglutinative language of the Basques, which is totally
different from all the European tongues? This hypothesis, suggested
by Ripley,(38) is very well worth considering, for it is not
inconceivable that the ancestors of the Basques conquered the Crô-
Magnons and subsequently acquired their language.
The prehistoric Crô-Magnon men would seem, therefore, to have
remained in or near their early settlements through all the changes
of time and the vicissitudes of history. "It is, perhaps," observes
Ripley, "the most striking instance known of a persistency of
population unchanged through thousands of years."
The geographic extension of this race was once very much wider
than it is to-day. The classical skull of Engis, Belgium, belongs to this
type. It has been traced from Alsace in the east to the Atlantic in the
west. Ranke asserts that it is to be found to-day in the hills of
Thuringia, and that it was a prevalent type there in the past.
Verneau considers that it was the type prevailing among the extinct
Guanches of the Canary Islands. Collignon(39) has identified it in
northern Africa, and regards the Crô-Magnons as a subvariety of the
Mediterranean race, an opinion consistent at least with the
archæological evidence that this race came into Europe with the
Aurignacian culture, which was circum-Mediterranean in distribution.
Traces of Crô-Magnon head formation are found among the living
Berbers.
At present, however, this race is believed to survive only in a few
isolated localities, namely, in Dordogne, at a small spot in Landes,
near the Garonne in southern France, and at Lannion in Brittany,
where nearly one-third of the population is of the Crô-Magnon type.
It is said to survive on the island of Oléron off the west coast of
France, and there is evidence of similar descent to be found among
the people of the islands of northern Holland. The people of Trysil,
on the Scandinavian peninsula, are characterized as having
disharmonic features, possibly representing an outcrop of the Crô-
Magnon type.
Our interest in the fate of the Crô-Magnons is so great that the
Guanche theory may also be considered; it is known to be favored
by many anthropologists: von Behr, von Luschan, Mehlis, and
especially by Verneau. The Guanches were a race of people who
formerly spread all over the Canary Islands and who preserved their
primitive characteristics even after their conquest by Spain in the
fifteenth century. The differences from the supposed modern Crô-
Magnon type may be mentioned first. The skin of the Guanches is
described by the poet Viana as light-colored, and Verneau considers
that the hair was blond or light chestnut and the eyes blue; the
coloring, however, is somewhat conjectural. The features of
resemblance to the ancient Crô-Magnons are numerous. The
minimum stature of the men was 5 feet 7 inches, and the maximum
6 feet 7 inches; in one locality the average male stature was over 6
feet. The women were comparatively small. The most striking
characters of the head were the fine forehead, the extremely long
skull, and the pentagonal form of the cranium, when seen from
above, caused by the prominence of the parietals—a Crô-Magnon
characteristic. Among the insignia of the chiefs was the arm-bone of
an ancestor; the skull also was carefully preserved. The offensive
weapons in warfare consisted of three stones, a club, and several
knives of obsidian; the defensive weapon was a simple lance. The
Guanches used wooden swords with great skill. The habitation of all
the people was in large, well-sheltered caverns, which honeycombed
the sides of the mountains; all the walls of these caverns were
decorated; the ceilings were covered with a uniform coat of red
ochre, while the walls were decorated with various geometric
designs in red, black, gray, and white. Hollowed-out stones served
as lamps. We may conclude with Verneau that there is evidence,
although not of a very convincing kind, that the Guanches were
related to the Crô-Magnons.(40) His observations on these supposed
Crô-Magnons of the Canary Islands are cited in the Appendix, Note
V. We regret that Verneau in his memoir(41) does not present his
more recent views in regard to the prehistoric distribution of this
great race.

(1) Breuil, 1912.7, p. 203.


(2) Op. cit., p. 205.
(3) James, 1902.1.
(4) Heim, 1894.1, p. 184.
(5) Schmidt, 1912.1, p. 262.
(6) Fraunholz, 1911.1.
(7) Geikie, 1914.1, pp. 25, 26.
(8) Boule, 1899.1.
(9) Breuil, 1912.7, pp. 203-205.
(10) Obermaier, 1912.1, pp. 341, 342.
(11) Martin, R., 1914.1, pp. 15, 16.
(12) Verworn, 1914.1.
(13) Op. cit., p. 646.
(14) Breuil, 1912.7, p. 201.
(15) Lartet, 1875.1.
(16) Breuil, 1912.7, p. 213.
(17) Schmidt, 1912.1, p. 136.
(18) Breuil, op. cit., pp. 216, 217.
(19) Breuil, 1909.3.
(20) Op. cit., p. 410.
(21) Cartailhac, 1906.1, pp. 227, 228.
(22) Rivière, 1897.1; 1897.2.
(23) Reinach, 1913.1.
(24) Breuil, 1912.1, p. 202.
(25) Cartailhac, 1908.1.
(26) Capitan, 1908.1, pp. 501-514.
(27) Ibid., 1910.1, pp. 59-132.
(28) Breuil, 1912.1, pp. 196, 197.
(29) Schmidt, 1912.1, p. 116.
(30) Fraunholz, 1911.1.
(31) Schmidt, 1912.1, p. 154.
(32) Déchelette, 1908.1, vol. I, pp. 191-194.
(33) Nehring, 1880.1; 1896.1.
(34) Bayer, 1912.1, pp. 13-21.
(35) Breuil, 1912.7, pp. 212, 216.
(36) Ripley, 1899.1, pp. 39, 165, 173, 174-179, 211, 406.
(37) Op. cit., p. 176.
(38) Op. cit., p. 181.
(39) Collignon, 1890.1.
(40) Verneau, 1891.1.
(41) Ibid., 1906.1.
CHAPTER VI
CLOSE OF THE OLD STONE AGE—
INVASION OF NEW RACES—
HISTORY OF THE MAS D'AZIL, OF
FÈRE-EN-TARDENOIS—FOREST
ENVIRONMENT AND LIFE—ORIGIN
OF THE AZILIAN-TARDENOISIAN
CULTURE—CHARACTERS AND
CUSTOMS OF THE NEW RACES—
TRANSITION TO THE NEOLITHIC
AND RELATIONS OF THE OLD AND
NEW RACES—APPARENT CHIEF
LINES OF HUMAN DESCENT AND OF
HUMAN MIGRATION INTO
WESTERN EUROPE.
We have now reached the very close of the Old Stone Age, a period
which is believed to extend between 10,000 and 7,000 years before
the present era. The entrance to the final cultures of the Upper
Palæolithic, known as the Azilian-Tardenoisian, marks a transition
even more abrupt than that witnessed in any preceding stage. It is
not a development; it is a revolution. The artistic spirit entirely
disappears; there is no trace of animal engraving or sculpture;
painting is found only on flattened pebbles or in schematic or
geometric designs on wall surfaces. Of bone implements only
harpoons and polishers remain, and even these are of inferior
workmanship and without any trace of art. The flint industry
continues the degeneration begun in the Magdalenian and exhibits a
new life and impulse only in the fashioning of the extremely small or
microlithic tools and weapons known as 'Tardenoisian.' Both bone
and flint weapons of the chase disappear, yet the stag is hunted and
its horns are used in the manufacture of harpoons. This is the 'Age
of the Stag,' the final stage of the 'Cave Period' in western Europe,
and is subsequent to the 'Age of the Reindeer' in the south.
It would appear as if the very same regions formerly occupied by the
great hunting Crô-Magnon race from Aurignacian to Magdalenian
times were now inhabited by a race or races largely employed in
fishing. The country is thickly forested. The climate is still cold and
extremely moist, and human life everywhere is in the grottos or
entrances to the caverns.

Invasion of Four New Races in Closing Upper Palæolithic Times


How far this revolution is due to the decline of the Crô-Magnon race
and how far to the invasion of one or more new races is very difficult
to determine in the absence of the anatomical evidence derived from
skeletal remains. Two new races had certainly found their way along
the Danube as shown in the burials of Ofnet, in eastern Bavaria; one
is extremely broad-headed and perhaps of central Asiatic origin,
while the other is extremely long-headed and perhaps of southerly
or Mediterranean origin. It is possible that these two races
correspond respectively with the easterly and southerly industrial
influences which are observed in the Azilian-Tardenoisian stage. The
former is the first brachycephalic race to enter western Europe, for it
will be recalled that all the previous races, the Crô-Magnons, the
Brünns, and the Neanderthals, are dolichocephalic. The long-headed
race found at Ofnet is very clearly distinguished from the
disharmonic long-headed Crô-Magnon race by the narrowness of the
face; in other words, it is an harmonic type of head and face, which
may have been Mediterranean in origin, like the so-called
'Mediterranean race' of Sergi.
This fresh invasion of western Europe by two races arriving by one
or more of the great migration routes from the vast Eurasiatic
mainland to the east, races with a relatively high brain development,
is certainly one of the most surprising features of the close of the
Palæolithic Period, for we have long been accustomed to think that
these fresh easterly and southerly invasions began only in Neolithic
times.
As the Upper Palæolithic draws to an end, there is, according to
Breuil, still another industrial influence making itself felt: it comes
from the northeast along the shores of the Baltic.
Putting together all the fragmentary evidence which we possess, we
may regard western Europe at the close of the Old Stone Age as
peopled by four and possibly by five distinct races, as follows:
5. Arriving late in Palæolithic times, a race along the
shores of the Baltic, known only by its Maglemose
industry; possibly a Teutonic race.
4. A south Mediterranean race, known only by its
Tardenoisian industry, migrating along the northern shores
of Africa and spreading over Spain; with a conventional
and schematic art; probably an advance wave of the true
'Mediterranean' race of Sergi; possibly identical with race
3 below. (The same as Race 4, p. 278.)
3. A long-headed race found at Ofnet, in eastern Bavaria;
possibly a branch of the true 'Mediterranean' race 4
above, but not related to the Brünn. (Possibly the same as
Race 4.)
2. The newly arriving Furfooz-Grenelle race, broad-
headed; known along the Danube at Ofnet, in eastern
Bavaria, and northward in Belgium; possibly a branch of
the 'Alpine' race. (The same as Race 5, p. 278.)
1. The surviving Crô-Magnons, in a stage of industrial
decline, pursuing the Azilian industry, probably inhabiting
France and northern Spain.
The broad-headed Ofnet race mentioned above is apparently the
same as the Furfooz-Grenelle race, and may also correspond with
the existing Alpine-Celtic race of western Europe. The long-headed
race of Ofnet may correspond with the existing 'Mediterranean' race
of Sergi.
The presence of the Crô-Magnon race in western Europe during
Azilian-Tardenoisian times is not sustained, so far as we know, by
any anatomical evidence, but is suggested by the mode of burial of
two skeletons found by Piette in the Azilian deposits of the station of
Mas d'Azil. This burial, like that of Ofnet, is typical of Upper
Palæolithic and not of Neolithic times. These skeletons lay in the
'Azilian' layer (VI) described below. As the smaller bones were
missing, Piette concluded that the remains had been for some time
exposed to the weather before burial, and that the larger bones had
been scraped and cleaned with flint knives, and then colored red
with oxide of iron before interment. According to other authorities,
the traces of scraping and cleaning are doubtful; there can be no
question, however, that the separation of the bones of the skeleton
and the use of coloring matter constitute strong evidence that this
Azilian burial was the work of members of the Crô-Magnon race.
In addition to what we have said as to the survival of the Crô-
Magnon race in the preceding chapter, the opinion of Cartailhac(1)
may be cited: "The race of Crô-Magnon is well determined. There is
no doubt about their high stature, and Topinard is not the only one
who believes that they were blonds. We have traced them through
the 'Reindeer Period' into the Neolithic Epoch, where they were
widely distributed and positively related either to the ancient or
actual populations of modern France, being especially characteristic
of our region [France] and of the western Mediterranean. While the
race of Crô-Magnon predominated in the south and in the west, that
of Furfooz predominated in the northeast of France and in Belgium.
These brachycephals were probably brown-haired or of dark
coloring."
But before observing further the characters of these four or five
races, let us examine their industries.
Discovery of the Azilian Type Station
As remarked above, it is believed that these industries prevailed
between 7,000 and 10,000 years before our era, that is, between the
close of Magdalenian times and the beginning of the Neolithic or
New Stone Age. This transition period corresponds with the interval
in which the Azilian-Tardenoisian culture swept all over western
Europe and completely replaced the Magdalenian. From Castillo in
the Cantabrian Mountains of northern Spain to Ofnet on the upper
Danube there is a complete replacement by this new culture. The
Magdalenian culture does not linger anywhere; it is totally
eliminated; the suddenness of the change both in the animal life and
in the industry is nowhere more clearly indicated than at the type
station of Mas d'Azil in southern France, which may now be
described.
In 1887 Edouard Piette commenced his exploration of the deposits in
the great cavern of Mas d'Azil. This station takes its name from the
little hamlet of Mas d'Azil in the foot-hills of the Pyrenees about forty
miles southwest from Toulouse. Here the River Arize winds for a
quarter of a mile through a lofty natural tunnel traversed by the
highway from St. Girons to Carcassonne. A rich layer of Magdalenian
deposits first attracted Piette's attention, and he found here some of
the finest examples of late Magdalenian art, but above these
deposits he discovered a hitherto unrecognized industrial stage, to
which he gave the name Azilian. The Azilian layers yielded over one
thousand specimens of flattened and double-barbed harpoons made
of the horns of the stag, thus widely differing from the late
Magdalenian harpoons which are rounded and made of the horns of
the reindeer. The entire succession of deposits, as explored by
Piette, is an epitome of the prehistory of Europe from early
Magdalenian times to the Age of Bronze, and should be compared
with the successive deposits of Castillo (p. 164), Sirgenstein (p.
202), Ofnet (p. 476), and Schweizersbild (p. 447).
Fig. 246. Western entrance
to the great station of
Mas d'Azil. "Here the
River Arize winds for a
quarter of a mile
through a lofty natural
tunnel traversed by
the highway from St.
Girons to
Carcassonne."
Photograph by N. C.
Nelson.
The Mas d'Azil section is as follows:
Prehistoric and Neolithic
IX. Iron implements, pottery of the Gauls. At the top
Gallo-Roman remains, glass and glazed pottery.
VIII. Middle Neolithic and Age of Bronze; layer of pottery,
polished stone implements, traces of copper and of
bronze.
VII. Dawn of the Neolithic. Fauna includes the horse, urus,
stag, and wild boar. Chipped and polished flints, awls and
polishers in bone; harpoons rare. Beginnings of pottery.
Upper Palæolithic
VI. Azilian, red archæological layer, masses of peroxide of
iron. Extremely moist climate. Broad flat harpoons of stag
horn perforated at the base, numerous flattened and
painted pebbles (galets), flints of degenerate Magdalenian
form, especially small rounded planers and knife flakes,
awls and polishers in bone. No trace of reindeer in the
fire-hearths; stag abundant, also roe-deer and brown
bear; wild boar, wild cattle, beaver, a variety of birds. No
trace of polished stone implements. Interred in this layer,
beneath the deposits of streaked cinders and quite
undisturbed, two human skeletons were found, which
Piette believed had been macerated with flints and then
colored red with peroxide of iron.
V. Sterile finely stratified loam layer, a flood deposit of the
River Arize.
IV. Late Magdalenian culture layer; twelve double-rowed
harpoons made of reindeer horn, a few fashioned from
stag horn; numerous engravings and sculptures in bone.
Remains of the reindeer rare in the hearths; those of the
royal stag (Cervus elaphus) abundant.
III. A sterile flood deposit of the River Arize.
II. Middle and Early Magdalenian culture layers, with barbed
harpoons of reindeer horn; flint implements of early
Magdalenian type, bone needles. Bones of the reindeer
abundant.
I. Gravel deposits. Interspersed fire-hearths.
The total thickness of these culture deposits is 8.03 m., or 26 feet 4
inches. The Azilian type layer (VI) containing flat harpoons of stag
horn and painted pebbles, intercalated between the deposits of the
Reindeer Age and the Neolithic layers, is, on account of its
stratigraphic position, the most interesting and instructive of all the
sites representing this phase of transition; and Piette was fully
justified in giving to the corresponding culture period the name of
Azilian.(2)
The transformation of art and industry, indicated in the Azilian
culture layer, is as decided as that in the animal life. We observe in
this layer no trace of the animal engravings or sculptures which
occur so abundantly in the late Magdalenian layer below; the use of
pigments is confined to the paintings of schematic or geometric
figures on the flattened pebbles. There is no suggestion of art in any
of the bone implements, and the harpoons of stag horn are rudely
fashioned; this type of harpoon appears to be the chief survivor of
the rich variety of implements noted in the Magdalenian layer below.
The stag horn harpoon, moreover, is fashioned with far less skill than
the beautiful Magdalenian harpoons; like them it has two rows of
barbs, but they are not cut with the same delicacy and exactness. As
to the form of the new model, it is explained by the nature of the
new material; the interior of the stag horn being composed of a
spongy tissue, could not be utilized as could the harder and more
compact interior of the reindeer horn; the craftsman, therefore, was
obliged to fashion his harpoon out of the exterior of one side of the
stag horn, and in consequence to make it flat.
Fig. 247. Typical Azilian
harpoons of stag horn.
After de Mortillet. 287.
A single-rowed
harpoon from Mas
d'Azil. 288. Harpoon
with perforated base
from the shelter of La
Tourasse, Haute-
Garonne. 289. Double-
rowed harpoon from
the same shelter. 290.
A similar harpoon with
the barbs alternate
instead of opposite,
from Mas d'Azil. 291.
Harpoon with
triangular base and
round perforation from
the Grotte de la Vache,
near Tarascon. All one-
third actual size,
except 291, which is
four-ninths actual size.

There are no bone needles, no javelins or sagaies; nor are there any
of the beautifully carved weapons of bone. There is also a reduction
in the uses to which the split bones are put, such as the large lissoirs
or polishers. The bone implements appear to be derived from an
impoverished late Aurignacian stage; the same is true of the flint
implements, for we observe a return of the keeled scraper (grattoir
caréné). There is also a return of certain types of graving tools and
of the knife-like form of the flake; even some of the small geometric
types of flints resemble those of the Aurignacian levels.
The many shells of the moisture-loving snail Helix nemoralis, found
in the fire-hearths of Mas d'Azil are proofs of the humidity of the
climate, a fact confirmed by the contemporary flood deposits of the
Arize. The frequent and heavy rains drove the last few
representatives of the steppe fauna away to the north. These
climatic conditions favored the formation of peat-bogs, so frequent
to-day in the north of France, and also the growth of vast forests,
inhabited by the stag, which extended over the whole country.
The pebbles of Mas d'Azil are painted on one side with peroxide of
iron, a deposit of which is found in the neighborhood of the cave.
The color, mixed in shells of Pecten, or in hollowed pebbles or on flat
stones, was applied either with the finger or with a brush. The many
enigmatic designs consist chiefly of parallel bands, rows of discs or
points, bands with scalloped edges, cruciform designs, ladder-like
patterns (scalariform) such as are found in the 'Azilian' engravings
and paintings of the caverns, and undulating lines. These graphic
combinations resemble certain syllabic and alphabetic characters of
the Ægean, Cypriote, Phœnician, and Greco-Latin inscriptions.
However curious these resemblances may be, they are not sufficient
to warrant any theory connecting the signs on the painted pebbles
of the Azilians with the alphabetic characters of the oldest known
systems of writing.(3) Piette attempted to explain some of the
exceedingly crude designs on these pebbles as a system of notation,
others as pictographs and religious symbols, and some few as
genuine alphabetical signs, and suggested that the cavern of Mas
d'Azil was an Upper Palæolithic school where reading, reckoning,
writing, and the symbols of the sun were learned and taught. The
very wide distribution of these symbolic pebbles and the painting of
similar designs on the walls of the caverns certainly prove that they
had some religious or economic significance, which may be revealed
by subsequent research.
Fig. 248. Azilian galets
coloriés, flat, painted
pebbles, from the type
station of Mas d'Azil.
After Piette.

The Tardenoisian Type Station


Turning from the region of the Pyrenees in Azilian times, we observe
the region lying between the Seine and the Meuse in northern
France as the scene of a contemporary industry. At the station of
Fère-en-Tardenois, in the Department of the Aisne, is found an
especially large number of the pygmy flints;(4) these present various
geometric forms, including the primitive triangular, as well as the
rhomboidal, trapezoidal, and semicircular; together, they were
designated by de Mortillet as Tardenoisian flints, and in 1896, in
monographing this microlithic flint industry, he traced them
throughout France, Belgium, England, Portugal, Spain, Italy,
Germany, and Russia, also along the southern Mediterranean
through Algiers, Tunis, Egypt, and eastward into Syria and even
India.
These geometric flints were at first attributed to a primitive invasion
which was supposed to have occurred at the beginning of Neolithic
times; thus the Tardenoisian industry was considered as
contemporaneous with that of the Campignian, which is early
Neolithic. It was further observed that the topographical location of
the stations closely followed the borders of ocean inlets, or of river
courses, and when the food materials found in the hearths were
compared, it appeared that these flints were used principally by
fishermen or tribes subsisting upon fish. From an examination of the
flints, it would appear that a very large number of them were
adapted for insertion in small harpoons, or that those of grooved
form might even have been used as fish-hooks. Thus the picture was
drawn of a population of fishermen. The Tardenoisian, therefore,
was for a long time regarded as contemporaneous with the early
Neolithic rather than with the close of Palæolithic times, but as
exploration proceeded it was found that neither the remains of
domestic animals nor any traces of pottery occur in any of these
Tardenoisian deposits, which consequently have nothing in common
with the true Neolithic culture.
The problem was finally solved in 1909, when the grotto of Valle
near Gibaja, Santander, in northern Spain, was discovered by Breuil
and Obermaier.(5) Here was a classic Azilian deposit containing all
the well-known Azilian types of bone implements, such as fine
harpoons, carvings in deer horn, bone javelins, polishers of deer
bone, flint flakes resembling those of the late Magdalenian, also
microlithic flints of typical geometric Tardenoisian form. This
discovery established the fact that the lower levels of the
Tardenoisian industry were not really to be distinguished from the
Azilian, for here beneath layers with painted pebbles and harpoons
of Azilian style were harpoons with single and double rows of barbs
of Magdalenian pattern, but cut in stag horn instead of reindeer
horn.
The mammalian life in this true Azilian-Tardenoisian layer includes
the chamois, roe-deer, wild boar, and urus, or wild cattle. In a layer
just below, which represents the close of the Magdalenian industrial
period, there are found, although rarely, remains of the reindeer, an
animal hitherto unknown in this part of Spain, also the wild boar, the
bison, the ibex, and the lynx. After this discovery it could no longer
be questioned that the Azilian and Tardenoisian were contemporary.
As to the relation of these two industries, Breuil remarks(6) that the
prolongation of the Tardenoisian types of flints is observed in Italy
and in Belgium, but neither the term 'Tardenoisian' nor the term
'Azilian' is sufficiently comprehensive to embrace the totality of these
little industries, which will finally be distinguished clearly from each
other. Of the two the Azilian represents the prolongation of an
ancient period of industry, the progress of which was apparently
from south to north, as we can trace the distribution of the
characteristic flat harpoons of deer horn from the Cantabrian
Mountains and the Pyrenees, through southern and central France,
to Belgium, England, and the western coast of Scotland. The later
industrial phase, the Tardenoisian, with its geometric trapeziform
flints, originally appears along the southern Mediterranean in Tunis
and to the eastward in the Crimea, while in France it represents a
final phase of the Palæolithic, closely approaching the period of the
earliest Neolithic or pre-Campignian hearths common along the
Danube and observed in the vicinity of Liége. Thus the most
comprehensive term by which to designate the ensemble of these
implements, in Europe at least, would be Azilian-Tardenoisian.
Larger Image

Fig. 249. Small geometric


flints characteristic of
the Tardenoisian
industry. After de
Mortillet. 295 to 303,
321, 322, 326. From
various sites in
northern France. 311.
Uchaux, Vaucluse,
France. 305, 315, 320.
Valley of the Meuse,
Belgium. 312, 313.
Cabeço da Arruda,
Portugal. 304, 314.
Italy. 317, 318, 329.
Tunis. 325. Egypt. 306,
310, 324, 328. Kizil-
Koba, Crimea. 307 to
309, 316, 319, 323,
327. India. All one-half
actual size.

Environment and Mammalian Life


It appears that the chief geographic change during this period was a
subsidence of the northern coasts of Europe and an advance of the
sea causing the circulation of warm oceanic currents and a more
humid climate favorable to reforestation.
To the north, in Belgium, the tundra fauna lingered during the
extension of the early Tardenoisian industry, for here we still find
remains of the reindeer, the arctic fox, and the arctic hare mingled in
the fire-hearths with flints of Tardenoisian type. This, observes
Obermaier, constitutes proof that the Tardenoisian, with the Azilian,
must be placed at the very close of Postglacial time and with the
final stage of Upper Palæolithic industry.
To the south, in the region of Dordogne and the Pyrenees, the
tundra fauna had entirely disappeared, as well as that of the steppes
and of the alpine heights; the prevailing animal in the forests is the
royal stag, adapted to forests of temperate type and associated with
the Eurasiatic forest and meadow fauna which now dominated
western Europe.
The only survivor of the great African-Asiatic fauna is the lion, which
appears in the late Palæolithic stations in the region of the Pyrenees;
the arctic wolverene also gives the fauna a Postglacial aspect, for,
like the lion, it is never found in central or western Europe after the
close of Upper Palæolithic times. Other enemies of the herbivorous
fauna were the wolf and the brown bear.
Besides the red deer, or stag, the forests at this time were filled with
roe-deer. To the south in the Pyrenees the moose still survived, and
to the north there were still found herds of reindeer which survived
in central Europe as late as the twelfth century. Wild boars were
numerous, and in the streams were found the beaver and the otter.
In the forest borders and in the meadows hares and rabbits were
abundant. Through the forests and meadows of southern France and
along the borders of the Danube ranged the wild cattle (Bos
primigenius). It would appear from our limited knowledge of the life
of Azilian-Tardenoisian times that bison were found chiefly in the
northern parts of Europe. There is little direct evidence in regard to
the wild horse, the remains of which do not occur in the hearths of
Azilian times.
Our knowledge of the life of the Spanish peninsula at a period
closely succeeding this is indirectly derived from the animal frescos
in certain caverns of northern Spain, which were formerly attributed
to the Upper Palæolithic but are now referred rather to the early
Neolithic. Here are found representations of the ibex, the stag, the
fallow deer, the wild cattle, and also of the wild horses. This would
indicate that wild horses were still roaming all over western Europe
at the close of Upper Palæolithic times. The presence of the moose
in late Palæolithic times at Alpera, on the high plateaus of Spain, has
been determined; this animal has also been found in the Pyrenees
during the Azilian stage.(7)
The great contrast between the mammalian life of Magdalenian and
that of Azilian-Tardenoisian times is witnessed in the stations along
the upper Danube, as described by Koken.(8) In Höhlefels,
Schmiechenfels, and Propstfels, associated with implements of the
late Magdalenian industry, are found ten types of animals belonging
to the forests and four characteristic of the forests and meadows, or
fourteen species altogether. With these are mingled two alpine
forms, the ibex and the alpine shrew; also two types of mammals
belonging to the steppes, and no less than six mammals and birds
from the tundras, namely, the reindeer, the arctic fox, the ermine,
the arctic hare, the banded lemming, and the arctic ptarmigan.
In wide contrast to this assemblage of late Magdalenian life on the
upper Danube, there appear in Azilian times along the shores of the
middle Danube in the stations of Ofnet and of Istein the following
characteristic forest forms: Sus scrofa ferus (wild boar), Cervus
elaphus (stag), Capreolus capreolus (roe-deer), Bos (?) primigenius
(urus), Lepus (rabbit or hare), Ursus arctos (brown bear), Felis leo
(lion), Gulo luscus (common wolverene), Lynchus lynx (lynx), Vulpes
(fox), Mustela martes (marten), Castor fiber (European beaver), Mus
(field-mouse), Turdus (thrush). It thus appears that the alpine, the
steppe, and the tundra faunæ had entirely disappeared from this
region.

Origin and Distrubution of the Azilian-Tardenoisian Industry


This industry represents the last stage of the Old Stone Age. The
decline in the art of fashioning flints, begun in Magdalenian times,
appears to continue in the Azilian-Tardenoisian. As to the tiny
symmetrical flints which are characteristic of this period, among the
microliths of almost all the late Magdalenian stations pre-
Tardenoisian forms are found which may be regarded as prototypes
of the geometric Tardenoisian flints;(9) this represents a new fashion
established in flint-making under influences coming from the south.
There was also a natural or local Azilian evolution from the
Magdalenian types and technique. In general the flint implements
which had so long prevailed in western Europe become smaller in
diameter and more carelessly retouched, showing marked
deterioration even from the late Magdalenian stages. For the
preparation of hides and the fashioning of bone we discover
unsymmetrical planing tools (grattoirs), also small, well-formed oval
scrapers (racloirs), and microlithic scrapers. Borers (perçoirs) with
oblique ends and gravers (burins) made of small flakes are the types
of implements which most frequently occur, but the great variety of
borers, so characteristic of the Aurignacian and the Magdalenian
industries, had entirely disappeared in Azilian times.
The marks of industrial degeneration are also conspicuous in the
bone implements, which show a very great deterioration in number
and quality as compared with the Magdalenian, and which are
principally confined to three types—the harpoons, the awls
(poinçons), and the smoothers (lissoirs), together with very small
bone borers (perçoirs). The distinctive feature of the Azilian bone
industry is the flat harpoon of stag horn; it is known that the use of
stags' antlers for fashioning harpoons began in the late Magdalenian,
when most of them were still being fashioned from reindeer horn.
These flat Azilian harpoons succeed the type of the double-rowed,
cylindrical harpoons of the late Magdalenian, and are found mainly
where the rivers, lakes, or pools offered favorable conditions for
fishing. Thus the Azilian bone-harpoon industry, like the Tardenoisian
microlithic flint industry, was largely pursued by fisherfolk.
Welcome to our website – the ideal destination for book lovers and
knowledge seekers. With a mission to inspire endlessly, we offer a
vast collection of books, ranging from classic literary works to
specialized publications, self-development books, and children's
literature. Each book is a new journey of discovery, expanding
knowledge and enriching the soul of the reade

Our website is not just a platform for buying books, but a bridge
connecting readers to the timeless values of culture and wisdom. With
an elegant, user-friendly interface and an intelligent search system,
we are committed to providing a quick and convenient shopping
experience. Additionally, our special promotions and home delivery
services ensure that you save time and fully enjoy the joy of reading.

Let us accompany you on the journey of exploring knowledge and


personal growth!

textbookfull.com

You might also like