0% found this document useful (0 votes)
5 views

Data Analysis and Graphics Using R 1st Edition Matthew Norman - The ebook is ready for download with just one simple click

The document provides information about various ebooks related to data analysis and graphics using R, including titles by Matthew Norman and John Maindonald. It includes links to download these ebooks and highlights their contents, such as statistical models and regression analysis. Additionally, it mentions the Cambridge Series in Statistical and Probabilistic Mathematics and provides details about the authors and publication information.

Uploaded by

cardothasmah3
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
5 views

Data Analysis and Graphics Using R 1st Edition Matthew Norman - The ebook is ready for download with just one simple click

The document provides information about various ebooks related to data analysis and graphics using R, including titles by Matthew Norman and John Maindonald. It includes links to download these ebooks and highlights their contents, such as statistical models and regression analysis. Additionally, it mentions the Cambridge Series in Statistical and Probabilistic Mathematics and provides details about the authors and publication information.

Uploaded by

cardothasmah3
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 80

Visit https://ebookultra.

com to download the full version and


explore more ebooks

Data Analysis and Graphics Using R 1st Edition


Matthew Norman

_____ Click the link below to download _____


https://ebookultra.com/download/data-analysis-and-
graphics-using-r-1st-edition-matthew-norman/

Explore and download more ebooks at ebookultra.com


Here are some suggested products you might be interested in.
Click the link to download

Using R for Data Management Statistical Analysis and


Graphics 1st Edition Nicholas J. Horton

https://ebookultra.com/download/using-r-for-data-management-
statistical-analysis-and-graphics-1st-edition-nicholas-j-horton/

Data Analysis and Graphics Using R An Example based


Approach 1st Edition John Maindonald

https://ebookultra.com/download/data-analysis-and-graphics-using-r-an-
example-based-approach-1st-edition-john-maindonald/

Data Analysis and Graphics Using R An Example Based


Approach Third Edition John Maindonald

https://ebookultra.com/download/data-analysis-and-graphics-using-r-an-
example-based-approach-third-edition-john-maindonald/

Data Analysis and Graphics Using R An Example based


Approach 2nd Edition John Maindonald

https://ebookultra.com/download/data-analysis-and-graphics-using-r-an-
example-based-approach-2nd-edition-john-maindonald/
Statistics and Data Analysis for Microarrays Using R and
Bioconductor 2nd Edition Sorin Draghici

https://ebookultra.com/download/statistics-and-data-analysis-for-
microarrays-using-r-and-bioconductor-2nd-edition-sorin-draghici/

Clinical Trial Data Analysis Using R Chapman Hall CRC


Biostatistics Series 1st Edition Din Chen

https://ebookultra.com/download/clinical-trial-data-analysis-using-r-
chapman-hall-crc-biostatistics-series-1st-edition-din-chen/

Data Analysis Using SQL and Excel 1st Edition Gordon S.


Linoff

https://ebookultra.com/download/data-analysis-using-sql-and-excel-1st-
edition-gordon-s-linoff/

Guerilla Data Analysis Using Microsoft Excel 1st Edition


Bill Jelen

https://ebookultra.com/download/guerilla-data-analysis-using-
microsoft-excel-1st-edition-bill-jelen/

Analysis of categorical data with R 1st Edition


Christopher R Bilder

https://ebookultra.com/download/analysis-of-categorical-data-
with-r-1st-edition-christopher-r-bilder/
Data Analysis and Graphics Using R 1st Edition Matthew
Norman Digital Instant Download
Author(s): Matthew Norman
ISBN(s): 9781852337162, 1852337168
Edition: 1
File Details: PDF, 38.88 MB
Year: 2003
Language: english
An Example-based
", -
- .. -
-n.7~T. Approach
AX ,.- -- , .-, 7,'u.74
CAMBRIDGE SERIES IN STATISTICAL AND PROBABILISTIC
MATHEMATICS

Editorial Board:

R. Gill, Department of Mathematics, Utrecht University


B.D. Ripley, Department of Statistics, University of Oxford
S. Ross, Department of Industrial Engineering, University of California, Berkeley
M. Stein, Department of Statistics, University of Chicago
D. Williams, School of Mathematical Sciences, University of Bath

This series of high-quality upper-division textbooks and expository monographs covers


all aspects of stochastic applicable mathematics. The topics range from pure and applied
statistics to probability theory, operations research, optimization, and mathematical pro-
gramming. The books contain clear presentations of new developments in the field and
also of the state of the art in classical methods. While emphasizing rigorous treatment of
theoretical methods, the books also contain applications and discussions of new techniques
made possible by advances in computational practice.

Already published
1. Bootstrap Methods and Their Application, A.C. Davison and D.V. Hinkley
2. Markov Chains, J. Norris
3. Asymptotic Statistics, A.W. van der Vaart
4. Wavelet Methodsfor Time Series Analysis, D.B. Percival and A.T. Walden
5. Bayesian Methods, T. Leonard and J.S.J. Mu
6. Empirical Processes in M-Estimation, S. van de Geer
7. Numerical Methods of Statistics, J. Monahan
8. A User's Guide to Measure-Theoretic Probability, D. Pollard
9. The Estimation and Tracking of Frequency, B.G. Quinn and E.J. Hannan
Data Analysis and Graphics
Using R - an Example-based Approach
John Maindonald
Centre for Bioinformation Science, John Curtin School of Medical Research
and Mathematical Sciences Institute, Australian National University

and
John Braun
Department of Statistical and Actuarial Science University of Western Ontario

CAMBRIDGE
UNIVERSITY PRESS
PUB1,ISHED BY T H E PRESS S Y N D I C A T E OF T H E U N I V E R S I T Y OF C A M B R I D G E
The Pitt Building, Trumpington Street, Cambridge, United Kingdom
CAMBRIDGE U N I V E R S I T Y PRESS
The Edinburgh Building, Cambridge CB2 2RU, UK
40 West 20th Street, New York, NY 10011-4211, USA
477 Williamstown Road, Port Melbourne, VIC 3207, Australia
Ruiz de Alarc6n 13,28014 Madrid, Spain
Dock House, The Waterfront, Cape Town 8001, South Africa

O Cambridge University Press 2003

This book is in copyright. Subject to statutory exception


and to the provisions of relevant collective licensing agreements,
no reproduction of any part may take place without
the written permission of Cambridge University Press.

First published 2003

Reprinted 2004

Printed in the United States of America

Typeface Times 10113 pt System HTj$ 2E [TB]

A catalogue record for this book is available from the British Library

Library of Congress Cataloguing in Publication data


Maindonald, J. H. (John Hilary), 1937-
Data analysis and graphics using R : an example-based approach / John Maindonald and John Braun.
p. cm. - (Cambridge series in statistical and probabilistic mathematics)
Includes bibliographical references and index.
ISBN0521 813360
1. Statistical - Data processing. 2. Statistics - Graphic methods - Data processing. 3. R (Computer program
language) I. Braun, John, 1963- 11. Title. 111. Cambridge series on statistical and probabilistic mathematics.
QA276.4.M245 2003
5 19.5'0285-dc21 200203 1560

ISBN 0 521 81336 0 hardback


It is easy to lie with statistics. It is hard to tell the truth without statistics.
[Andrejs Dunkels]

. . . technology tends to overwhelm common sense.

[D. A. Freedman]
Contents

Preface page xv

A Chapter by Chapter Summary xix

1 A Brief Introduction to R
1.1 A Short R Session
1.1.1 R must be installed!
1.1.2 Using the console (or command line) window
1.1.3 Reading data from a file
1.1.4 Entry of data at the command line
1.1.5 Online help
1.1.6 Quitting R
1.2 The Uses of R
1.3 The R Language
1.3.1 R objects
1.3.2 Retaining objects between sessions
1.4 Vectors in R
1.4.1 Concatenation -joining vector objects
1.4.2 Subsets of vectors
1.4.3 Patterned data
1.4.4 Missing values
1.4.5 Factors
1.5 Data Frames
1.5.1 Variable names
1.5.2 Applying a function to the columns of a data frame
1.5.3* Data frames and matrices
1.5.4 Identification of rows that include missing values
1.6 R Packages
1.6.1 Data sets that accompany R packages
1.T Looping
1.8 R Graphics
1.8.1 The function plot ( ) and allied functions
1.8.2 Identification and location on the figure region
1.8.3 Plotting mathematical symbols
Contents

1.8.4 Row by column layouts of plots


1.8.5 Graphs - additional notes
1.9 Additional Points on the Use of R in This Book
1.10 Further Reading
1.11 Exercises
2 Styles of Data Analysis
2.1 Revealing Views of the Data
2.1.1 Views of a single sample
2.1.2 Patterns in grouped data
2.1.3 Patterns in bivariate data - the scatterplot
2.1.4* Multiple variables and times
2.1.5 Lattice (trellis style) graphics
2.1.6 What to look for in plots
2.2 Data Summary
2.2.1 Mean and median
2.2.2 Standard deviation and inter-quartile range
2.2.3 Correlation
2.3 Statistical Analysis Strategies
2.3.1 Helpful and unhelpful questions
2.3.2 Planning the formal analysis
2.3.3 Changes to the intended plan of analysis
2.4 Recap
2.5 Further Reading
2.6 Exercises
3 Statistical Models
3.1 Regularities
3.1.1 Mathematical models
3.1.2 Models that include a random component
3.1.3 Smooth and rough
3.1.4 The construction and use of models
3.1.5 Model formulae
3.2 Distributions: Models for the Random Component
3.2.1 Discrete distributions
3.2.2 Continuous distributions
3.3 The Uses of Random Numbers
3.3.1 Simulation
3.3.2 Sampling from populations
3.4 Model Assumptions
3.4.1 Random sampling assumptions - independence
3.4.2 Checks for normality
3.4.3 Checking other model assumptions
3.4.4 Are non-parametric methods the answer?
3.4.5 Why models matter - adding across contingency
tables
Contents

3.5 Recap
3.6 Further Reading
3.7 Exercises
4 An Introduction to Formal Inference
4.1 Standard Errors
4.1.1 Population parameters and sample statistics
4.1.2 Assessing accuracy - the standard error
4.1.3 Standard errors for differences of means
4.1.4* The standard error of the median
4.1.5* Resampling to estimate standard errors: bootstrapping
4.2 Calculations Involving Standard Errors: the t-Distribution
4.3 Conjidence Intervals and Hypothesis Tests
4.3.1 One- and two-sample intervals and tests for means
4.3.2 Confidence intervals and tests for proportions
4.3.3 Confidence intervals for the correlation
4.4 Contingency Tables
4.4.1 Rare and endangered plant species
4.4.2 Additional notes
4.5 One-Way Unstructured Comparisons
4.5.1 Displaying means for the one-way layout
4.5.2 Multiple comparisons
4.5.3 Data with a two-way structure
4.5.4 Presentation issues
4.6 Response Curves
4.7 Data with a Nested Variation Structure
4.7.1 Degrees of freedom considerations
4.7.2 General multi-way analysis of variance designs
4.8* Resampling Methods for Tests and Conjidence Intervals
4.8.1 The one-sample permutation test
4.8.2 The two-sample permutation test
4.8.3 Bootstrap estimates of confidence intervals
4.9 Further Comments on Formal Inference
4.9.1 Confidence intervals versus hypothesis tests
4.9.2 If there is strong prior information, use it!
4.10 Recap
4.11 Further Reading
4.12 Exercises
5 Regression with a Single Predictor
5.1 Fitting a Line to Data
5.1.1 Lawn roller example
5.1.2 Calculating fitted values and residuals
5.1.3 Residual plots
5.1.4 The analysis of variance table
5.2 Outliers, Influence and Robust Regression
Contents

Standard Errors and Confidence Intervals


5.3.1 Confidence intervals and tests for the slope
5.3.2 SEs and confidence intervals for predicted values
5.3.3* Implications for design
Regression versus Qualitative ANOVA Comparisons
Assessing Predictive Accuracy
5.5.1 Trainingltest sets, and cross-validation
5.5.2 Cross-validation - an example
5.5.3* Bootstrapping
A Note on Power Transformations
Size and Shape Data
5.7.1 Allometric growth
5.7.2 There are two regression lines!
The Model Matrix in Regression
Recap
Methodological References
Exercises

6 Multiple Linear Regression


6.1 Basic Ideas: Book Weight and Brain Weight Examples
6.1.1 Omission of the intercept term
6.1.2 Diagnostic plots
6.1.3 Further investigation of influential points
6.1.4 Example: brain weight
6.2 Multiple Regression Assumptions and Diagnostics
6.2.1 Influential outliers and Cook's distance
6.2.2 Component plus residual plots
6.2.3* Further types of diagnostic plot
6.2.4 Robust and resistant methods
6.3 A Strategyjor Fitting Multiple Regression Models
6.3.1 Preliminaries
6.3.2 Model fitting
6.3.3 An example - the Scottish hill race data
6.4 Measures for the Comparison of Regression Models
6.4.1 R~ and adjusted R~
6.4.2 AIC and related statistics
6.4.3 How accurately does the equation predict?
6.4.4 An external assessment of predictive accuracy
6.5 Interpreting Regression CoefJicients - the Labor Training Data
6.6 Problems with Many Explanatory Variables
6.6.1 Variable selection issues
6.6.2 Principal components summaries
6.7 Multicollinearity
6.7.1 A contrived example
6.7.2 The variance inflation factor (VIF)
6.7.3 Remedying multicollinearity
Contents

6.8 Multiple Regression Models - Additional Points


6.8.1 Confusion between explanatory and dependent variables
6.8.2 Missing explanatory variables
6.8.3* The use of transformations
6.8.4* Non-linear methods - an alternative to transformation?
6.9 Further Reading
6.10 Exercises
7 Exploiting the Linear Model Framework
7.1 LRvels of a Factor - Using Indicator Variables
7.1.1 Example - sugar weight
7.1.2 Different choices for the model matrix when there are
factors
7.2 Polynomial Regression
7.2.1 Issues in the choice of model
7.3 Fitting Multiple Lines
7.4* Methods for Passing Smooth Curves through Data
7.4.1 Scatterplot smoothing - regression splines
7.4.2 Other smoothing methods
7.4.3 Generalized additive models
7.5 Smoothing Terms in Multiple Linear Models
7.6 Further Reading
7.7 Exercises
8 Logistic Regression and Other Generalized Linear Models
8.1 Generalized Linear Models
8.11 Transformation of the expected value on the left
8.1.2 Noise terms need not be normal
8.1.3 Log odds in contingency tables
8.1.4 Logistic regression with a continuous explanatory variable
8.2 Logistic Multiple Regression
8.2.1 A plot of contributions of explanatory variables
8.2.2 Cross-validation estimates of predictive accuracy
8.3 Logistic Models for Categorical Data - an Example
8.4 Poisson and Quasi-Poisson Regression
8.4.1 Data on aberrant crypt foci
8.4.2 Moth habitat example
8.4.3* Residuals, and estimating the dispersion
8.5 Ordinal Regression Models
8.5.1 Exploratory analysis
8.5.2* Proportional odds logistic regression
8.6 Other Related Models
8.6.1* Loglinear models
8.6.2 Survival analysis
8.7 Transformations for Count Data
8.8 Further Reading
8.9 Exercises
xii Contents

9 Multi-level Models, Time Series and Repeated Measures


9.1 Introduction
9.2 Example - Survey Data, with Clustering
9.2.1 Alternative models
9.2.2 Instructive, though faulty, analyses
9.2.3 Predictive accuracy
9.3 A Multi-level Experimental Design
9.3.1 The ANOVA table
9.3.2 Expected values of mean squares
9.3.3* The sums of squares breakdown
9.3.4 The variance components
9.3.5 The mixed model analysis
9.3.6 Predictive accuracy
9.3.7 Different sources of variance - complication or focus
of interest?
9.4 Within and between Subject Effects - an Example
9.5 Time Series - Some Basic Ideas
9.5.1 Preliminary graphical explorations
9.5.2 The autocorrelation function
9.5.3 Autoregressive (AR) models
9.5.4* Autoregressive moving average (ARMA) models - theory
9.6* Regression Modeling with Moving Average Errors - an Example
9.7 Repeated Measures in Time - Notes on the Methodology
9.7.1 The theory of repeated measures modeling
9.7.2 Correlation structure
9.7.3 Different approaches to repeated measures analysis
9.8 Further Notes on Multi-level Modeling
9.8.1 An historical perspective on multi-level models
9.8.2 Meta-analysis
9.9 Further Reading
9.10 Exercises

10 nee-based Classification and Regression


10.1 The Uses of Tree-based Methods
10.1.1 Problems for which tree-based regression may be used
10.1.2 Tree-based regression versus parametric approaches
10.1.3 Summary of pluses and minuses
10.2 Detecting Email Spam - an Example
10.2.1 Choosing the number of splits
10.3 Terminology and Methodology
10.3.1 Choosing the split - regression trees
10.3.2 Within and between sums of squares
10.3.3 Choosing the split - classification trees
10.3.4 The mechanics of tree-based regression - a trivial
example
Contents ...
xlll

10.4 Assessments of Predictive Accuracy


10.4.1 Cross-validation
10.4.2 The trainingltest set methodology
10.4.3 Predicting the future
10.5 A Strategy for Choosing the Optimal Tree
10.5.1 Cost-complexity pruning
10.5.2 Prediction error versus tree size
10.6 Detecting Email Spam - the Optimal Tree
10.6.1 The one-standard-deviation rule
10.7 Interpretation and Presentation of the rpart Output
10.7.1 Data for female heart attack patients
10.7.2 Printed Information on Each Split
10.8 Additional Notes
10.9 Further Reading
10.10 Exercises
11 Multivariate Data Exploration and Discrimination 281
11.1 Multivariate Exploratory Data Analysis 282
11.1.1 Scatterplot matrices 282
11.1.2 Principal components analysis 282
11.2 Discriminant Analysis 285
11.2.1 Example - plant architecture 286
11.2.2 Classical Fisherian discriminant analysis 287
11.2.3 Logistic discriminant analysis 289
11.2.4 An example with more than two groups 290
11.3 Principal Component Scores in Regression 291
11.4* Propensity Scores in Regression Comparisons - Labor Training Data 295
11.5 Further Reading 297
11.6 Exercises 298
12 The R System - Additional Topics
12.1 Graphs in R
12.2 Functions - Some Further Details
12.2.1 Common useful functions
12.2.2 User-written R functions
12.2.3 Functions for working with dates
12.3 Data input and output
12.3.1 Input
12.3.2 Dataoutput
12.4 Factors - Additional Comments
12.5 Missing Values
12.6 Lists and Data Frames
12.6.1 Data frames as lists
12.6.2 Reshaping data frames; reshape ( )
12.6.3 Joining data frames and vectors - cbind ( 1
12.6.4 Conversion of tables and arrays into data frames
Contents

12.6.5* Merging data frames - m e r g e ( )


12.6.6 The function sapply ( ) and related functions
12.6.7 Splitting vectors and data frames into lists - spl i t ( )
12.r' Matrices and Arrays
12.7.1 Outer products
12.7.2 Arrays
12.8 Classes and Methods
12.8.1 Printing and summarizing model objects
12.8.2 Extracting information from model objects
12.9 Databases and Environments
12.9.1 Workspace management
12.9.2 Function environments, and lazy evaluation
12.10 Manipulation of Language Constructs
12.11 Further Reading
12.12 Exercises
Epilogue - Models

Appendix - S-PLUS Differences

References
Index of R Symbols and Functions
Index of Terms
Index of Names
Preface

This book is an exposition of statistical methodology that focuses on ideas and concepts,
and makes extensive use of graphical presentation. It avoids, as much as possible, the use
of mathematical symbolism. It is particularly aimed at scientists who wish to do statis-
tical analyses on their own data, preferably with reference as necessary to professional
statistical advice. It is intended to complement more mathematically oriented accounts of
statistical methodology. It may be used to give students with a more specialist statistical
interest exposure to practical data analysis.
The authors can claim, between them, 40 years of experience in working with researchers
from many different backgrounds. Initial drafts of the monograph were constructed from
notes that the first author prepared for courses for researchers, first of all at the University of
Newcastle (Australia) over 1996-1 997, and greatly developed and extended in the course
of work in the Statistical Consulting Unit at The Australian National University over 1998-
2001. We are grateful to those who have discussed their research with us, brought us
their data for analysis, and allowed us to use it in the examples that appear in the present
monograph. At least these data will not, as often happens once data have become the basis
for a published paper, gather dust in a long-forgotten folder!
We have covered a range of topics that we consider important for many different areas
of statistical application. This diversity of sources of examples has benefits, even for those
whose interests are in one specific application area. Ideas and applications that are useful in
one area often find use elsewhere, even to the extent of stimulating new lines of investigation.
We hope that our book will stimulate such cross-fertilization. As is inevitable in a book that
has this broad focus, there will be specific areas - perhaps epidemiology, or psychology, or
sociology, or ecology - that will regret the omission of some methodologies that they find
important.
We use the R system for the computations. The R system implements a dialect of the
influential S language that is the basis for the commercial S-PLUS system. It follows
S in its close linkage between data analysis and graphics. Its development is the result
of a co-operative international effort, bringing together an impressive array of statistical
computing expertise. It has quickly gained a wide following, among professionals and non-
professionals alike. At the time of writing, R users are restricted, for the most part, to a
command line interface. Various forms of graphical user interface will become available in
due course.
The R system has an extensive library of packages that offer state-of-the-art-abilities.
Many of the analyses that they offer were not, 10 years ago, available in any of the standard
xvi Preface

packages. What did data analysts do before we had such packages? Basically, they adapted
more simplistic (but not necessarily simpler) analyses as best they could. Those whose
skills were unequal to the task did unsatisfactory analyses. Those with more adequate skills
carried out analyses that, even if not elegant and insightful by current standards, were often
adequate. Tools such as are available in R have reduced the need for the adaptations that
were formerly necessary. We can often do analyses that better reflect the underlying science.
There have been challenging and exciting changes from the methodology that was typically
encountered in statistics courses 10 or 15 years ago.
The best any analysis can do is to highlight the information in the data. No amount of
statistical or computing technology can be a substitute for good design of data collection,
for understanding the context in which data are to be interpreted, or for skill in the use of
statistical analysis methodology. Statistical software systems are one of several components
of effective data analysis.
The questions that statistical analysis is designed to answer can often be stated sim-
ply. This may encourage the layperson to believe that the answers are similarly simple.
Often, they are not. Be prepared for unexpected subtleties. Effective statistical analysis
requires appropriate skills, beyond those gained from taking one or two undergraduate
courses in statistics. There is no good substitute for professional training in modern tools
for data analysis, and experience in using those tools with a wide range of data sets. No-
one should be embarrassed that they have difficulty with analyses that involve ideas that
professional statisticians may take 7 or 8 years of professional training and experience to
master.

Influences on the Modern Practice of Statistics


The development of statistics has been motivated by the demands of scientists for a method-
ology that will extract patterns from their data. The methodology has developed in a synergy
with the relevant supporting mathematical theory and, more recently, with computing. This
has led to methodologies and supporting theory that are a radical departure from the method-
ologies of the pre-computer era.
Statistics is a young discipline. Only in the 1920s and 1930s did the modern framework of
statistical theory, including ideas of hypothesis testing and estimation, begin to take shape.
Different areas of statistical application have taken these ideas up in different ways, some of
them starting their own separate streams of statistical tradition. Gigerenzer et al. (1989) ex-
amine the history, commenting on the different streams of development that have influenced
practice in different research areas.
Separation from the statistical mainstream, and an emphasis on "black box" approaches,
have contributed to a widespread exaggerated emphasis on tests of hypotheses, to a ne-
glect of pattern, to the policy of some journal editors of publishing only those studies that
show a statistically significant effect, and to an undue focus on the individual study. Any-
one who joins the R community can expect to witness, and/or engage in, lively debate
that addresses these and related issues. Such debate can help ensure that the demands of
scientific rationality do in due course win out over influences from accidents of historical
development.
Preface xvii

New Tools for Statistical Computing


We have drawn attention to advances in statistical computing methodology. These have led
to new powerful tools for exploratory analysis of regression data, for choosing between
alternative models, for diagnostic checks, for handling non-linearity, for assessing the pre-
dictive power of models, and for graphical presentation. In addition, we have new computing
tools that make it straightforward to move data between different systems, to keep a record
of calculations, to retrace or adapt earlier calculations, and to edit output and graphics into
a form that can be incorporated into published documents.
One can think of an effective statistical analysis package as a workshop (this analogy
appears in a simpler form in the JMP Start Statistics Manual (SAS Institute Inc. 1996,
p. xiii).). The tools are the statistical and computing abilities that the package provides.
The layout of the workshop, the arrangement both of the tools and of the working area,
is important. It should be easy to find each tool as it is needed. Tools should float back of
their own accord into the right place after use! In other words, we want a workshop where
mending the rocking chair is a pleasure!
The workshop analogy is worth pursuing further. Different users have different require-
ments. A hobbyist workshop will differ from a professional workshop. The hobbyist may
have less sophisticated tools, and tools that are easy to use without extensive training or ex-
perience. That limits what the hobbyist can do. The professional needs powerful and highly
flexible tools, and must be willing to invest time in learning the skills needed to use them.
Good graphical abilities, and good data manipulation abilities, should be a high priority for
the hobbyist statistical workshop. Other operations should be reasonably easy to implement
when carried out under the instructions of a professional. Professionals also require top rate
graphical abilities. The focus is more on flexibility and power, both for graphics and for
computation. Ease of use is important, but not at the expense of power and flexibility.

A Note on the R System


The R system implements a dialect of the S language that was developed at AT&T Bell
Laboratories by Rick Becker, John Chambers and Allan Wilks. Versions of R are available,
at no cost, for 32-bit versions of Microsoft Windows, for Linux and other Unix systems, and
for the Macintosh. It is available through the Comprehensive R Archive Network (CRAN).
Go to http://cran.r-project.org/,and find the nearest mirror site.
The citation for John Chambers' 1998 Association for Computing Machinery Software
award stated that S has "forever altered how people analyze, visualize and manipulate data."
The R project enlarges on the ideas and insights that generated the S language. We are
grateful to the R Core Development Team, and to the creators of the various R packages,
for bringing into being the R system - this marvellous tool for scientific and statistical
computing, and for graphical presentation.

Acknowledgements
Many different people have helped us with this project. Winfried Theis (University of
Dortmund, Germany) and Detlef Steuer (University of the Federal Armed Forces, Hamburg,
xviii Preface

Germany) helped with technical aspects of working with LA$, with setting up a cvs server
to manage the LA$ files, and with helpful comments. Lynne Billard (University of Georgia,
USA), Murray Jorgensen (University of Waikato, NZ) and Berwin Turlach (University of
Western Australia) gave valuable help in the identification of errors and text that required
clarification. Susan Wilson (Australian National University) gave welcome encouragement.
Duncan Murdoch (University of Western Ontario) helped set up the DAAG package. Thanks
also to Cath Lawrence (Australian National University) for her Python program that allowed
us to extract the R code, as and when required, from our IbT@ files. The failings that remain
are, naturally, our responsibility.
There are a large number of people who have helped with the providing of data sets.
We give a list, following the list of references for the data near the end of the book. We
apologize if there is anyone that we have inadvertently failed to acknowledge. Finally,
thanks to David Tranah of Cambridge University Press, for his encouragement and help in
bringing the writing of this monograph to fruition.

References
Gigerenzer, G., Swijtink, Z., Porter, T., Daston, L., Beatty, J. & Kriiger, L. 1989. The Empire
of Chance. Cambridge University Press.
SAS Institute Inc. 1996. JMP Start Statistics. Duxbury Press, Belmont, CA.
These (and all other) references also appear in the consolidated list of references near the
end of the book.

Conventions
Text that is R code, or output from R, is printed in a verbatim text style. For example,
in Chapter 1 we will enter data into an R object that we call aus tpop. We will use the
plot ( ) function to plot these data. The names of R packages, including our own DAAG
package, are printed in italics.
Starred exercises and sections identify more technical items that can be skipped at a first
reading.

Web sites for supplementary information


The DAAG package, the R scripts that we present, and other supplementary information,
are available from
http://cbis.anu.edu/DAAG
http://www.stats.uwo.ca/DAAG

Solutions to exercises
Solutions to selected exercises are available from the website
http://www.maths.anu.edu.au/-johnmlr-book.htm1
See also www.cambridge.org/052 1813360
A Chapter by Chapter Summary

Chapter 1: A Brief Introduction to R


This chapter aims to give enough information on the use of R to get readers started.
Note R's extensive online help facilities. Users who have a basic minimum knowledge
of R can often get needed additional information from the help pages as the demand arises.
A facility in using the help pages is an important basic skill for R users.

Chapter 2: Style of Data Analysis


Knowing how to explore a set of data upon encountering it for the first time is an important
skill. What graphs should one draw?
Different types .of graph give different views of the data. Which views are likely to be
helpful?
Transformations, especially the logarithmic transformation, may be a necessary prelim-
inary to data analysis.
There is a contrast between exploratory data analysis, where the aim is to allow the data
to speak for themselves, and confirmatory analysis (which includes formal estimation and
testing), where the form of the analysis should have been largely decided before the data
were collected.
Statistical analysis is a form of data summary. It is important to check, as far as this is
possible that summarization has captured crucial features of the data. Summary statis-
tics, such as' the mean or correlation, should always be accompanied by examination
of a relevant graph. For example, the correlation is a useful summary, if at all, only if
the relationship between two variables is linear. A scatterplot allows a visual check on
linearity.

Chapter 3: Statistical Models


Formal data analyses assume an underlying statistical model, whether or not it is explicitly
written down.
Many statistical models have two components: a signal (or deterministic) component;
and a noise (or error) component.
Data from a sample (commonly assumed to be randomly selected) are used to fit the
model by estimating the signal component.
xx Chapter by Chapter Summary

The fitted model determinesjtted or predicted values of the signal. The residuals (which
estimate the noise component) are what remain after subtracting the fitted values from the
observed values of the signal.
The normal distribution is widely used as a model for the noise component.
Haphazardly chosen samples should be distinguished from random samples. Inference
from haphazardly chosen samples is inevitably hazardous. Self-selected samples are par-
ticularly unsatisfactory.

Chapter 4: An Introduction to Formal Inference


Formal analysis of data leads to inferences about the population(s) from which the data were
sampled. Statistics that can be computed from given data are used to convey information
about otherwise unknown population parameters.
The inferences that are described in this chapter require randomly selected samples from
the relevant populations.
A sampling distribution describes the theoretical distribution of sample values of a statis-
tic, based on multiple independent random samples from the population.
The standard deviation of a sampling distribution has the name standard error.
For sufficiently large samples, the normal distribution provides a good approximation to
the true sampling distribution of the mean or a difference of means.
A conJidence interval for a parameter, such as the mean or a difference of means, has the
form
statistic f t-critical-value x standard error.
Such intervals give an assessment of the level of uncertainty when using a sample statistic
to estimate a population parameter.
Another viewpoint is that of hypothesis testing. Is there sufficient evidence to believe that
there is a difference between the means of two different populations?
Checks are essential to determine whether it is plausible that confidence intervals and
hypothesis tests are valid. Note however that plausibility is not proof!
Standard chi-squared tests for two-way tables assume that items enter independently into
the cells of the table. Even where such a test is not valid, the standardized residuals from
the "no association" model can give useful insights.
In the one-way layout, in which there are several independent sets of sample values,
one for each of several groups, data structure (e.g. compare treatments with control, or
focus on a small number of "interesting" contrasts) helps determine the inferences that are
appropriate. In general, it is inappropriate to examine all possible comparisons.
In the one-way layout with quantitative levels, a regression approach is usually
appropriate.

Chapter 5: Regression with a Single Predictor


Correlation can be a crude and unduly simplistic summary measure of dependence between
two variables. Wherever possible, one should use the richer regression framework to gain
deeper insights into relationships between variables.
Chapter by Chapter Summary xxi

The line or curve for the regression of a response variable y on a predictor x is different
from the line or curve for the regression of x on y. Be aware that the inferred relationship
is conditional on the values of the predictor variable.
The model matrix, together with estimated coefficients, allows for calculation of predicted
or fitted values and residuals.
Following the calculations, it is good practice to assess the fitted model using standard
forms of graphical diagnostics.
Simple alternatives to straight line regression using the data in their raw form are
transforming x andlor y,
using polynomial regression,
fitting a smooth curve.
For size and shape data the allometric model is a good starting point. This model assumes
that regression relationships among the logarithms of the size variables are linear.

Chapter 6: Multiple Linear Regression


Scatterplot matrices may provide useful insight, prior to fitting a regression model.
Following the fitting of a regression, one should examine relevant diagnostic plots.
Each regression coefficient estimates the effect of changes in the corresponding explana-
tory variable when other explanatory variables are held constant.
The use of a different set of explanatory variables may lead to large changes in the
coefficients for those variables that are in both models.
Selective influences in the data collection can have a large effect on the fitted regression
relationship.
For comparing alternative models, the AIC or equivalent statistic (including Mallows
C), can be useful. The R~ statistic has limited usefulness.
If the effect of variable selection is ignored, the estimate of predictive power can be
grossly inflated.
When regression models are fitted to observational data, and especially if there are a
number of explanatory variables, estimated regression coefficients can give misleading
indications of the effects of those individual variables.
The most useful test of predictive power comes from determining the predictive accuracy
that can be expected from a new data set.
Cross-validationis a powerful and widely applicable method that can be used for assessing
the expected predictive accuracy in a new sample.

Chapter 7: Exploiting the Linear Model Framework


In the study of regression relationships, there are many more possibilities than regression
lines! If a line is adequate, use that. But one is not limited to lines!
A common way to handle qualitative factors in linear models is to make the initial level
the baseline, with estimates for other levels estimated as offsets from this baseline.
Polynomials of degree n can be handled by introducing into the model matrix, in addition
to a column of values of x, columns corresponding to x2,x3, . . . ,xn.Typically, n = 2,3 or 4.
xxii Chapter by Chapter Summary

Multiple lines are fitted as an interaction between the variable and a factor with as many
levels as there are different lines.
Scatterplot smoothing, and smoothing terms in multiple linear models, can also be handled
within the linear model framework.

Chapter 8: Logistic Regression and Other Generalized Linear Models


Generalized linear models (GLMs) are an extension of linear models, in which a function
of the expectation of the response variable y is expressed as a linear model. A further
generalization is that y may have a binomial or Poisson or other non-normal distribution.
Common important GLMs are the logistic model and the Poisson regression model.
Survival analysis may be seen as a further specific extension of the GLM framework.

Chapter 9: Multi-level Models, Time Series and Repeated Measures


In a multi-level model, the random component possesses structure; it is a sum of distinct
error terms.
Multi-level models that exhibit suitable balance have traditionally been analyzed within
an analysis of variance framework. Unbalanced multi-level designs require the more general
multi-level modeling methodology.
Observations taken over time often exhibit time-based dependence. Observations that are
close together in time may be more highly correlated than those that are widely separated.
The autocorrelation function can be used to assess levels of serial correlation in time series.
Repeated measures models have measurements on the same individuals at multiple points
in time and/or space. They typically require the modeling of a correlation structure similar
to that employed in analyzing time series.

Chapter 10: Tree-based Classification and Regression


Tree-based models make very weak assumptions about the form of the classification or
regression model. They make limited use of the ordering properties of continuous or ordinal
explanatory variables. They are unsuitable for use with small data sets.
Tree-based models can be an effective tool for analyzing data that are non-linear andlor
involve complex interactions.
The decision trees that tree-based analyses generate may be complex, giving limited
insight into model predictions.
Cross-validation, and the use of training and test sets, are essential tools both for choosing
the size of the tree and for assessing expected accuracy on a new data set.

Chapter 11: Multivariate Data Exploration and Discrimination


Principal components analysis is an important multivariate exploratory data analysis tool.
Examples are presented of the use of two alternative discrimination methods - logistic
regression including multivariate logistic regression, and linear discriminant analysis.
Chapter by Chapter Summary xxiii

Both principal components analysis, and discriminant analysis, allow the calculation of
scores, which are values of the principal components or discriminant functions, calculated
observation by observation. The scores may themselves be used as variables in, e.g., a
regression analysis.

Chapter 12: The R System - Additional Topics


This final chapter gives pointers to some of the further capabilities of R. It hints at the
marvellous power and flexibility that are available to those who extend their skills in the
use of R beyond the basic topics that we have treated. The information in this chapter is
intended, also, for use as a reference in connection with the computations of earlier chapters.
A Brief Introduction to R

This first chapter is intended to introduce readers to the basics of R. It should provide an
adequate basis for running the calculations that are described in later chapters.
In later chapters, the R commands that handle the calculations are, mostly, confined to
footnotes. Sections are included at the ends of several of the chapters that give further
information on the relevant features in R. Most of the R commands will run without change
in S-PLUS.

1.1 A Short R Session


1.1.1 R must be installed!
An up-to-date version of R may be downloaded from http://cran.r-project.org/ or from the
nearest mirror site. Installation instructions are provided at the web site for installing R in
Windows, Unix, Linux, and various versions of the Macintosh operating system. Various
contributed packages are now a part of the standard R distribution, but a number are not;
any of these may be installed as required. Data sets that are mentioned in this book have
been collected into a package that we have called DAAG. This is available from the web
pages http://cbis.anu.edu.au/DAAG and http://www.stats.uwo.ca/DAAG.

1.1.2 Using the console (or command line) window


The command line prompt (>) is an invitation to start typing in commands or expressions.
R evaluates and prints out the result of any expression that is typed in at the command line
in the console window (multiple commands may appear on the one line, with the semicolon
( ;) as the separator). This allows the use of R as a calculator. For example, type in 2 +2 and
press the Enter key. Here is what appears on the screen:

The first element is labeled [I] even when, as here, there is just one element! The >
indicates that R is ready for another command.
In a sense this chapter, and much of the rest of the book, is a discussion of what is
possible by typing in statements at the command line. Practice in the evaluation of arithmetic
I . A Brief Introduction to R

Table 1.1: The contents


of the file ACTpop.txt.

Year ACT

expressions will help develop the needed conceptual and keyboard skills. Here are simple
examples:

> 2*3*4*5 # * denotes 'multiply'


[I] 120
> sqrt(l0) # the square root of 10
[I] 3.162278
> pi # R knows about pi
[I] 3.141593
> 2*pi*6378 # Circumference of Earth at Equator (km);
# radius is 6378 km
[I] 40074.16

Anything that follows a # on the command line is taken as a comment and ignored by R.
There is also a continuation prompt that appears when, following a carriage return, the
command is still not complete. By default, the continuation prompt is + (in this book we
will omit both the prompt (>) and the continuation prompt (+), whenever command line
statements are given separately from output).

1.1.3 Reading data from a file


Our first goal is to read a text file into R using the read. t a b l e ( ) function. Table 1.1
displays population in thousands, for Australian Capital Territory (ACT) at various times
since 1917. The command that follows assumes that the reader has entered the contents of
Table 1.1 into a text file called ACTpop.txt.
When an R session is started, it has a working directory where, by default, it looks for
any files that are requested. The following statement will read in the data from a file that is
in the working directory (the working directory can be changed during the course of an R
session; see Subsection 1.3.2):

ACTpop <- read.table("ACTpop.txt", header=TRUE)


I . 1 A Short R Session

1920 1960 2000


Year
Figure 1.1: ACT population, in thousands, at various times between 1917 and 1997.

This reads in the data, and stores them in the data frame A C T p o p . The <- is a left angle
bracket (<) followed by a minus sign (-). It means "the values on the right are assigned to
the name on the left". Note the use of header=TRUE to ensure that R uses the first line to
get header information for the columns.
Type in A C T p o p at the command line prompt, and the data will be displayed almost
as they appear in Table 1.1 (the only difference is the introduction of row labels in the
R output). The object A C T p o p is an example of a data frame. Data frames are the usual
way for organizing data sets in R. More information about data frames can be found in
Section 1.5.
Case is significant for names of R objects or commands. Thus, ACTPOP is different from
ACTpop. (For file names on Microsoft Windows systems, the Windows conventions apply,
and case does not distinguish file names. On Unix systems letters that have a different case
are treated as different.)
We now plot the ACT population between 1917 and 1997 by using

plot(ACT - Year, data=ACTpop, pch=16)

The option pch=16 sets the plotting character to solid black dots. Figure 1.1 shows the
graph.
We can make various modifications to this basic plot. We can specify more informative
axis labels, change the sizes of the text and of the plotting symbol, add a title, and so on.
More information is given in Section 1.8.

1.1.4 Entry of data at the command line


Table 1.2 gives, for each amount by which an elastic band is stretched over the end of a ruler,
the distance that the band traveled when released. We can use data. frame ( ) to input
these (or other) data directly at the command line. We will assign the name elas ticband
to the data frame:

elasticband <- d a t a . f r a m e ( ~ t r e t c h = c ( 4 6 , 5 4 , 4 8 ~ 5 0 ~ 4 4 ~ 4 2 , 5 2 ) ,
di~tance=c(148,182,173~166~109~141,166))
1. A Brief Introduction to R

Table 1.2: Distance (cm)


versus stretch (mm),for
elastic band data.

Stretch Distance

The constructs c (46,54,48,50,44,42,52) and c (148,182,173,166,109,


141, 166 ) join ("concatenate") the separate numbers together into a single vector object.
See later, Subsection 1.4.1. These vector objects then become columns in the data frame.

1.1.5 Online help


Before getting deeply into the use of R, it is well to take time to master the help facilities.
Such an investment of time will pay dividends. R's help files are comprehensive, and are
frequently upgraded. Type in help (help)to get information on the help features of the
system that is in use. To get help on, e.g., plot ( ) , type in
help (plot)

Different R implementations offer different choices of modes of access into the help pages
(thus Microsoft Windows systems offer a choice between a form of help that displays the
relevant help file, html help, and compiled html help. The choice between these different
modes of access is made at startup. See help ( Rpro f i1e ) for details).
Two functions that can be highly useful in searching for functions that perform a desired
task are apropos ( ) and help. search ( ) . We can best explain their use by giving
specific examples. Thus try
apropos ( "sort") # Try, also, apropos ("sor")
# This lists all functions whose names include the
# character string "sort".
help.search("sort") # Note that the argument is "sort"
# This lists all functions that have the word 'sort' as
# an alias or in their title.

Note also example ( ) . This initiates the running of examples, if available, of the use
of the function specified by the function argument. For example:
example ( image)
# for a 2 by 2 layout of the last 4 plots, precede with
# par(mfrow=c(2,2))
# to prompt for each new graph, precede with par(ask=T)
1.2 The Uses of R 5

Much can be learned from experimenting with R functions. It may be helpful to create
a simple artificial data set with which to experiment. Another possibility is to work with a
subset of the data set to which the function will, finally, be applied. For extensive experi-
mentation it is best to create a new workspace where one can work with copies of any user
data sets and functions.
Among the abilities that are documented in the help pages, there will be some that bring
pleasant and unexpected surprises. There may be insightful and helpful examples. There
are often references to related functions. In most cases there are technical references that
give the relevant theory. While the help pages are not intended to be an encyclopedia on
statistical methodology, they contain much helpful commentary on the methods whose
implementation they document. It can help enormously, before launching into use of an R
function, to check the relevant help page!

1.1.6 Quitting R
One exits or quits R by using the q ( ) function:

There will be a message asking whether to save the workspace image. Clicking Yes
(the safe option) will save all the objects that remain in the working directory - any that
were there at the start of the session and any that were added (and not removed) during the
session.
Depending on the implementation, alternatives may be to click on the File menu and then
on Exit, or to click on the x in the top right hand comer of the R window. (Under Linux,
clicking on x exits from the program, but without asking whether to save the workshop
image.)
Note: In order to quit from the R session we had to type q ( ) . This is because q is a
function. Typing q on its own, without the parentheses, displays the text of the function on
the screen. Try it!

1.2 The Uses of R


R has extensive capabilities for statistical analysis, that will be used throughout this book.
These are embedded in an interactive computing environment that is suited to many different
uses. Here we draw attention to abilities, beyond simple one-line calculations, that are not
primarily statistical.

R will give numerical or graphical data summaries


An important class of R object is the data frame. R uses data frames to store rectangular
arrays in which the columns may be vectors of numbers or factors or text strings. Data
frames are central to the way that all the more recent R routines process data. For now, think
of data frames as rather like a matrix, where the rows are observations and the columns are
variables.
6 I . A Brief Introduction to R

As a first example, consider the data frame c a r s that is in the base package. This has
two columns (variables), with the names speed and d i s t. Typing in summary ( c a r s )
gives summary information on these variables.

> data(cars) # Gives access to the data frame cars


> summary (cars)
speed dist
Min . :4.0 Min. : 2.00
lstQu.:12.0 1st Qu.: 26.00
Median :15.0 Median : 36.00
Mean :15.4 Mean : 42.98
3rd Qu.:19.0 3rd Qu.: 56.00
Max. :25.0 Max. :120.00

Thus, we can immediately see that the range of speeds (first column) is from 4 mph to
25 mph, and that the range of distances (second column) is from 2 feet to 120 feet.

R has extensive abilities for graphical presentation


The main R graphics function is p l o t ( ) ,which we have already encountered. In addition,
there are functions for adding points and lines to existing graphs, for placing text at specified
positions, for specifying tick marks and tick labels, for labeling axes, and so on. Some details
are given in Section 1.8.

R is an interactive programming language


Suppose we want to calculate the Fahrenheit temperatures that correspond to Celsius tem-
peratures 25,26, . . . ,30. Here is a good way to do this in R:

> celsius <- 25:30


> fahrenheit <- 9/5*celsius+32
> conversion <- data.frame(Celsius=celsius, Fahrenheit=fahrenheit)
> print (conversion)
Celsius Fahrenheit
1 25 77.0
2 26 78.8
3 27 80.6
4 28 82.4
5 29 84.2
6 30 86.0

1.3 The R Language


R is a functional language that uses many of the same symbols and conventions as the widely
used general purpose language C and its successors C++ and Java. There is a language core
that uses standard forms of algebraic notation, allowing the calculations that were described
in Subsection 1.1.2. Functions - supplied as part of R or one of its packages, or written by
the user - allow the limitless extension of this language core.
1.3 The R Language

1.3.1 R objects

All R entities, including functions and data structures, exist as objects that can be operated
on as data. Type in Is ( ) (or obj ec t s ( ) ) to see the names of all objects in the workspace.
One can restrict the names to those with a particular pattern, e.g., starting with the letter "p".
(Type in help ( 1s ) and help ( grep ) for more details. The pattern-matching conventions
are those used for grep ( ) , which is modeled on the Unix grep command. For example,
Is (pattern="pm ) lists all object names that include the letter "p". To get all object
names that start with the letter "p", specify 1s (pattern=""pn) .) As noted earlier,
typing the name of an object causes the printing of its contents.
It is often possible and desirable to operate on objects - vectors, arrays, lists and so on - as
a whole. This largely avoids the need for explicit loops, leading to clearer code. Section 1.2
gave an example.

1.3.2 Retaining objects between sessions


Upon quitting an R session, we have recommended saving the workspace image. It is easiest
to allow R to save the image, in the current working directory, with the file name .RData
that R uses as the default. The image will then be loaded automatically when a new session
is started in that directory. The objects that were in the workspace at the end of the earlier
session will again be available.
Many R users find it convenient to work with multiple workspaces, typically with a
different working directory for each different workspace. As a preliminary to loading a new
workspace, it will usually be desirable to save the current workspace, and then to clear all
objects from it. These operations may be performed from the menu, or alternatively there are
the commands save . image ( ) for saving the current workspace, rm ( 1i st = 1s ( ) ) for
clearing the workspace, setwd ( ) for changing the working directory, and load ( ) for
loading a new workspace. For details, see the help pages for these functions. See also
Subsection 12.9.1.
One should avoid cluttering the workspace with objects that will not again be needed.
Before saving the current workspace, type 1s ( ) to get a complete list of objects. Then
remove unwanted objects using

where the names of objects that are to be removed should appear in place of <obj I>,
<obj2 > , . . . . For example, to remove the objects cel s i u s and fahrenhei t from the
workspace image before quitting, type

rm(celsius, fahrenheit)
q(

In general, we have left it to the reader to determine which objects should be removed once
calculations are complete.
I. A Brief Introduction to R

1.4 Vectors in R
Vectors may have mode "logical", "numeric", "character" or "list". Examples of vectors
are

> c(T, F, F, F, T, T, F )
[I] TRUE FALSE FALSE FALSE TRUE TRUE FALSE

> c("CanberraU, "Sydney", "Canberra", "Sydney")


[I] "Canberran "Sydney" "Canberra" "Sydney"

The first two vectors above are numeric, the third is logical, and the fourth is a character
vector. Note the use of the global variables F ( =FALSE) and T ( =TRUE) as a convenient
shorthand when logical values are entered.

1.4.1 Concatenation -joining vector objects


The c in c ( 2 , 3 , 5 , 2 , 7 , 1) is an abbreviation for "concatenate". The meaning
is: "Collect these numbers together to form a vector." We can concatenate two vectors.
In the following, we form numeric vectors x and y , that we then concatenate to form a
vector z :

1.4.2 Subsets of vectors


Note three common ways to extract subsets of vectors.

1. Specify the indices of the elements that are to be extracted, e.g.,

> x <- c (3, 11, 8, 15, 12) # Assign to x the values 3,


# 1 1 , 8, 15, 12
> x[c(2,4)1 # Elements in positions 2
[l] 11 15 # and 4 only
1.4 Vectors in R 9

2. Use negative subscripts to omit the elements in nominated subscript positions (take
care not to mix positive and negative subscripts):

> x[-c(2,3)1 # Remove the elements in positions 2 and 3


[I] 3 15 12

3. Specify a vector of logical values. The elements that are extracted are those for which
the logical value is TRUE. Thus, suppose we want to extract values of x that are greater
than 10.

> X > l O
[I] FALSE TRUE FALSE TRUE TRUE
> x [ x > 101
[I] 11 15 12

1.4.3 Patterned data


Use, for example, 5 :1 5 to generate all integers in a range, here between 5 and 15 inclusive.

Conversely, 1 5 : 5 will generate the sequence in the reverse order.


The function seq ( ) is more general. For example:

> seq(from=5, to=22, by=3) # The first value is 5. The final


# value is <= 22
[I] 5 8 11 14 17 20

The function call can be abbreviated to

To repeat the sequence (2, 3 , 5 ) four times over, enter

Patterned character vectors are also possible

> c(rep("female",3), rep("male",2))


[I] "female" "female" "female" "male" "male"

1.4.4 Missing values


The missing value symbol is NA. As an example, we may set

y <- ~ ( 1 NA,
, 3, 0, NA)

Note that any arithmetic operation or relation that involves NA generates an NA. Specifically,
be warned that y [ y==NA] < - 0 leaves y unchanged. The reason is that all elements of
10 I . A Brief Introduction to R

y==NA evaluate to NA. This does not identify an element of y, and there is no assignment.
To replace all NAs by 0, use the function i s .na ( ) , thus

The functions mean ( ) ,median ( ) , range ( ) , and a number of other functions, take the
argument na .rm=TRUE;i.e. remove NAs, then proceed with the calculation. By default,
these and related functions will fail when there are NAs. By default, the table ( ) function
ignores NAs.

1.4.5 Factors
A factor is stored internally as a numeric vector with values 1, 2, 3 , . . . , k. The value k is
the number of levels. The levels are character strings.
Consider a survey that has data on 691 females and 692 males. If the first 691 are females
and the next 692 males, we can create a vector of strings that holds the values thus:
gender < - c(rep("female",691),rep("male",692))

We can change this vector to a factor, by entering


gender <- factor (gender)

Internally, the factor gender is stored as 691 Is, followed by 692 2s. It has stored with it
a table that holds the information

1 female
2 male

In most contexts that seem to demand a character string, the 1 is translated into female
and the 2 into male.The values female and ma1 e are the levels of the factor. By default,
the levels are chosen to be in sorted order for the data type from which the factor was
formed, so that female precedes male. Hence:

> levels(gender)
[ I ] "female" "malet'

Note that if gender had been an ordinary character vector, the outcome of the above
levels command would have been NULL. The order of the factor levels determines the
order of appearance of the levels in graphs and tables that use this information. To cause
ma1 e to come before female,use

gender <- factor(gender, levels=c("male", "female"))

This syntax is available both when the factor is first created, and later to change the order
in an existing factor. Take care that the level names are correctly spelled. For example,
specifying " Male " in place of male " in the 1eve1 s argument will cause all values
'I

that were "male" to be coded as missing.


1.5 Data Frames 11

One advantage of factors is that the memory required for storage is less than for the
corresponding character vector when there are multiple values for each factor level, and the
levels are long character strings.

1.5 Data Frames


Data frames are fundamental to the use of the R modeling and graphics functions. A data
frame is a generalization of a matrix, in which different columns may have different modes.
All elements of any column must, however, have the same mode, i.e. all numeric, or all
factor, or all character, or all logical.
Included with our data sets is C a r s 9 3 . summary, created from the C a r s 9 3 data
set in the Venables and Ripley MASS package, and included in our DAAG package. In
order to access it, we need first to install it and then to load it into the workspace,
thus:

> library (DAAG) # load the DAAG package


>data(Cars93.summary) # copy Cars93.summary into the workspace
> Cars93.summary
Min.passengers Max.passengers No.of.cars abbrev
Compact 4 6 16 C
Large 6 6 11 L
Midsize 4 6 22 M
Small 4 5 21 Sm
Sporty 2 4 14 SP
Van 7 8 9 V

Notice that, before we could access C a r s 9 3 . summary, we had first to use


d a t a ( C a r s 9 3 . summary ) to copy it into the workspace. This differs from the S-PLUS
behavior, where such data frames in packages that have been loaded are automatically
available for access.
The data frame has row labels (accessed using r o w . names ( C a r s 9 3 .summary) )
Compact, L a r g e , . . . . The column names (accessed using names ( C a r s 9 3 .
summary) ) are Min .p a s s e n g e r s (i.e. the minimum number of passengers for cars in
this category), Max. p a s s e n g e r s , N o .of . c a r s , and abbrev. The first three columns
are numeric, and the fourth is a factor. Use the function c l a s s ( ) to check this, e.g. enter
class(Cars93.~ummary$abbrev).
There are several ways to access the columns of a data frame. Any of the following will
pick out the fourth column of the data frame C a r s 9 3 . summary and store it in the vector
type (also allowed is C a r s 9 3 . summary [ 4 1 . This gives a data frame with the single
column abbr ev).

type <- Cars93.summary$abbrev


type <- Cars93.summary [ ,4]
type <- Cars93.summary[,"abbrev"]
type <- Cars93.summary[[411 # Take the object that is stored
# in the fourth list element.
12 1. A Brief introduction to R

In each case, one can view the contents of the object type by entering type at the command
line, thus:

> type
[I] C L M Sm Sp V
Levels: C L M Sm Sp V

It is often convenient to use the a t t a c h ( ) function:

> attach(Cars93.summary)
# R can now access the columns of Cars93.summary directly
> abbrev
[1]C L M SmSpV
Levels: C L M Sm Sp V
> detach("Cars93.summary")
# Not strictly necessary, but tidiness is a good habit!
# In R, detach(Cars93.summary) is an acceptable alternative

Detaching data frames that are no longer in use reduces the risk of a clash of variable names,
e.g., two different attached data frames that have a column with the name a b b r e v , or an
abbrev both in the workspace and in an attached data frame.
In Windows versions, use of e d i t ( ) allows access to a spreadsheet-like display of a
data frame or of a vector. Users can then directly manipulate individual entries or perform
data entry operations as with a spreadsheet. For example,

To close the spreadsheet, click on the File menu and then on Close.

1.5.1 Variable names


The names ( ) function can be used to determine variable names in a data frame. As an
example, consider the New York air quality data frame that is included with the base R
package. To determine the variables in this data frame, type

data(airqua1ity)
names(airqua1ity)
[ l l "Ozone" "So1ar.R" "Wind" 'I Temp " "Month" " Day "

The names ( ) function serves a second purpose. To change the name of the abbrev
variable (the fourth column) in the C a r s 9 3 .summary data frame to c o d e , type

names(Cars93.~ummary)[4]<- "code"

If we want to change all of the names, we could do something like

names(Cars93.summary) <- ~ ( " m i n p a s s " ,


"maxpass", "number", "code")
1.5 Data Frames

1.5.2 Applying a function to the columns of a data frame


The sapply ( ) function is a useful tool for calculating statistics for each column of a data
frame. The first argument to sapply ( ) is a data frame. The second argument is the name
of a function that is to be applied to each column. Consider the women data frame.
> data(women)
> women # Display the data
height weight
1 58 115
2 59 117
3 60 120

In order to compute averages of each column, type


> sapply(women, mean) # Apply m e a n 0 to each of the columns
height weight
65.0 136.7

1.5.3* Data frames and matrices


The numerical values in the data frame women might alternatively be stored in a matrix
with the same dimensions, i.e., 15 rows x 2 columns. More generally, any data frame where
all columns hold numeric data can alternatively be stored as a matrix. This can speed up
some mathematical and other manipulations when the number of elements is large, e.g., of
the order of several hundreds of thousands. For further details, see Section 12.7. Note that:
The names ( ) function cannot be used with matrices.
Above, we used sapply ( ) to extract summary information about the columns of the
data frame women. Tf women had been a matrix with the same numerical values in the
same layout, the result would have been quite different, and uninteresting - the effect is
to apply the function mean to each individual element of the matrix.

1.5.4 IdentiJication of rows that include missing values


Many of the modeling functions will fail unless action is taken to handle missing values. Two
functions that are useful for checking on missing values are complete .cases ( ) and
na . omit ( ) . The following code shows how we can identify rows that hold missing values.
> data(possum) # Precede, if necessary, with library(DAAG)
> possum[!complete.cases(possum), I
case site Pop sex age hdlngth skullw totlngth tail1
BB36 41 2 Vic f 5 88.4 57.0 83 36.5
BB41 44 2 Vic m NA 85.1 51.5 76 35.5
BB45 46 2 Vic m NA 91.4 54.4 84 35.0
footlgth earconch eye chest belly
BB3 6 NA 40.3 15.9 27.0 30.5
BB4 1 70.3 52.6 14.4 23.0 27.0
BB4 5 72.8 51.2 14.4 24.5 35.0
14 1. A Brief Introduction to R

The function na .omit ( ) omits any rows that contain missing values. For example
newpossum <- na.omit(possum) # Has three fewer rows than possum

1.6 R Packages
The recommended R distribution includes a number of packages in its library. Note in
particular base, eda, ts (time series), and MASS. We will make frequent use both of the
MASS package and of our own DAAG package. DAAG, and other packages that are not
included with the default distribution, can be readily downloaded and installed.
Installed packages, unless loaded automatically, must then be loaded prior to use. The
base package is automatically loaded at the beginning of the session. To load any other
installed package, use the 1ibrary ( ) function. For example,
library (MASS) # Loads the MASS package

1.6.1 Data sets that accompany R packages


Type in data ( ) to get a list of data sets (mostly data frames) in all packages that are in
the current search path. To get information on the data sets that are included in the base
package, specify
data(package="basen) # NB. Specify 'package', not 'library'.

Replace "base" by the name of any other installed package, as required (type in
library ( ) to get the names of the installed packages).
In order to bring any of these data frames into the workspace, the user must specifically
request it. (Ensure that the relevant package is loaded.) For example, to access the data set
ai rqual i ty from the base package, type in
data (airquality) # Load airquality into the workspace

Such objects should be removed (rm ( ai rqual i ty ) ) when they are not for the time being
required. They can be loaded again as occasion demands.

1.7* Looping
A simple example of a for loop is1
> for (i in 1:5) print(i)
[ll 1
r11 2
r11 3
[I1 4
[ll 5

' Other looping constructs are


repeat <expression> # place break somewhere inside
while ( x > 0 ) <expression> # Or (x < O), or etc.
Here <expression> is an R statement, or a sequence of statements that are enclosed within braces.
1.8 R Graphics 15

Here is a possible way to estimate population growth rates for each of the Australian states
and territories:
data(austpop) # population figures for all
# Australian states
growth.rates <- numeric(8) # numeric(8) creates a numeric
# vector with 8 elements, all set
# equal to 0
for ( j in seq(2,9)) {
growth.rates [j-11 < - (austpop[9, j]-austpop[l, j ] ) /
austpop[l, jl 1
growth.rates <- data.frame(growth.rates)
row.names(growth.rates) <- names(austpop[c(-1,-lo)])
# We have used row.names() to name the rows of the data frame

The result is

NSW
Vic
Qld
SA
WA
Tas
NT
ACT

Avoiding loops - sapp 1y ( )


The above computation can also be done using the sapply ( ) function mentioned in
Subsection 1S.2:

> s a p p l y ( a u s t p o p [ , - ~ ( 1 ~ 1I 0,) function(x){(x[9]-x[1])/x[ll})


NSW Vic Qld SA WA Tas NT ACT
2.30 2.27 3.98 2.36 4.88 1.46 36.40102.33

Note that in contrast to the example in Subsection 1.5.2, we now have an inline function,
i.e. one that is defined on the fly and does not have or need a name. The effect is to assign
the columns of the data frame austpop [ , - c ( 1, 10 ) 1 ,in turn, to the function argument
x. With x replaced by each column in turn, the function returns ( x [ 9 ] -x [ 1 ] ) / x [ 1 ] .
In R there is often a better alternative, perhaps using one of the built-in functions, to
writing an explicit loop. Loops can incur severe computational overhead.

1.8 R Graphics
The functions plot ( ) , points ( ) , lines ( ) , text ( ) , mtext ( ) , axis ( ) ,
identify ( ) , etc. form a suite that plot graphs and add features to the graph. To see
some of the possibilities that R offers, enter

Press the Enter key to move to each new graph.


1. A Brief Introduction to R

1.8.1 Thebnction p l o t ( ) and allied functions


The basic command is

plot (y - x)

where x and y must be the same length.


Readers may find the following plots interesting (note that s i n ( ) expects angles to be
in radians. Multiply angles that are given in degrees by n/180 to get radians):

Readers might show the second of these graphs to their friends, asking them to identify the
pattern! By holding with the left mouse button on the lower border until a double sided arrow
appears and dragging upwards, the vertical dimension of the graph sheet can be shortened.
If sufficiently shortened, the pattern becomes obvious. The eye has difficulty in detecting
patterns of change where the angle of slope is close to the horizontal or close to the vertical.
Then try this:

par(mfrow=c(3,1)) # Gives a 3 by 1 layout of plots


plot( (1:50)*0.92, sin((l:50)*0.92))
par (mfrow=c(1,l))

Here are two further examples.

attach(e1asticband) # R now knows where to find stretch


# and distance
plot(stretch, distance) # Alternative: plot(distance stretch) -
detach(e1asticband)

attach(austpop)
plot(year, ACT, t y p e = " l M ) # Join the points ( " 1 " = "line")
detach(austpop)

Fine control -parameter settings


When it is necessary to change the default parameter settings, use the par ( ) function. We
have already used par (mfrow=c (m, n ) ) to get an m by n layout of graphs on a page.
Here is another example:

increases the text and plot symbol size 25% above the default. Adding mex=l .2 5 makes
room in the margin to accommodate the increased text size.
1.8 R Graphics

a Human

@Chimp @Gorilla
Rhesus monkey
lPotar monkey

Body weight (kg)


Figure 1.2: Brain weight versus body weight, for the primates data frame.

It is good practice to store the existing settings, so that they can be restored later. For this,
specify, e.g.,

oldpar <- par (cex=l.25, mex=l. 25) # Use par (oldpar) to restore
# earlier settings

The size of the axis annotation can be controlled, independently of the setting of c e x , by
specifying a value for cex. a x i s . Similarly, cex. l a b e l s may be used to control the
size of the axis labels.
Type in h e l p ( p a r ) to get a list of all the parameter settings that are available with
par ( ) .

Adding points, lines, text and axis annotation


Use the p o i n t s ( ) function to add points to a plot. Use the l i n e s ( ) function to add
lines to a plot. The t e x t ( ) function places text anywhere on the plot. (Actually these
functions are identical, differing only in the default setting for the parameter t y p e . The
defaultsettingforpoints() is type = " p M , a n d t h a t f o r l i n e s ( is ) type = "1".
Explicitly setting type = " p" causes either function to plot points, type = " 1" gives
lines.) The function m t ex t ( t e x t , s i d e , 1i n e , . . . ) adds text in the margin of the
current plot. The sides are numbered 1 (x-axis), 2 (y-axis), 3 (top) and 4 (right vertical
axis). The a x i s ( ) function gives fine control over axis ticks and labels.

Use of the t e x t ( ) function to label points


In Figure 1.2 we have put labels on the points.
We begin with code that will give a crude version of Figure 1.2. The function
r o w . names ( ) extracts the row names, which we then use as labels. We then use the
function t e x t ( ) to add text labels to the points.

data (primates) # The DAAG package must be loaded


attach(primates) # Needed if primates is not already
# attached.
I . A Brief Introduction to R

one
Figure 1.3: Each of 17 panelists compared two milk samples for sweetness. One of the samples had
one unit of additive, while the other had four units of additive.

plot (Bodywt, Brainwt, xlim=c (0, 300))


# Specify xlim so that there is room for the labels
text(x=Bodywt, y=Brainwt, labels=row.names(primates),adj=O)
# adj=O gives left-justified text
detach (primates)

The resulting graph would be adequate for identifying points, but it is not a presentation
quality graph. We now note the changes that are needed to get Figure 1.2. In Figure 1.2 we
use the xlab (x-axis) and ylab (y-axis) parameters to specify meaningful axis titles. We
move the labeling to one side of the points by including appropriate horizontal and vertical
offsets. We multiply c h w <- par ( ) $cxy [ 11 by 0.1 to get an horizontal offset that is
one tenth of a character width, and similarly for chh < - par ( ) $ cxy [ 2 1 in a vertical
direction. We use pch=16 to make the plot character a heavy black dot. This helps make
the points stand out against the labeling.
Here is the R code for Figure 1.2:
attach (primates)
plot(x=Bodywt, y=Brainwt, pch=16, xlab="Body weight (kg)",
ylab="Brain weight (g)",xlim=c(0,300),ylim=c(0,1500))
chw <- par()$cxy[l]
chh <- par()$cxy[2]
text (x=Bodywt+chw,y=Brainwt+c( - . 1, 0, 0,.I, 0)*chh,
labels=row.names(primates),adj=O)
detach (primates)

Where xlim and/or ylim is not set explicitly, the range of data values determines the
limits. In any case, the axis is by default extended by 4% relative to those limits.

Rug plots
The function r u g ( ) adds vertical bars, showing the distribution of data values, along one
or both of the x- and y-axes of an existing plot. Figure 1.3 has rugs on both the x- and
y-axes. Data were from a tasting session where each of 17 panelists assessed the sweetness
of each of two milk samples, one with four units of additive, and the other with one unit of
1.8 R Graphics

additive. The code that produced Figure 1.3 is

data (milk) # From the DAAG package


xyrange <- range(mi1k)
plot(four -
one, data = milk, xlim = xyrange, ylim =
xyrange, pch = 16)
rug(milk$one)
rug(milk$four, side = 2)
abline(0, 1)

Histograms and density plots


We mention these here for completeness. They will be discussed in Subsection 2.1.1.

The use of color


Try the following:

theta <- (1:50)*0.92


plot(theta, sin(theta), col=1:50, pch=16, cex=4)
points(theta, cos(theta), col=51:100, pch=15, cex=4)
palette ( ) # Names of the 8 colors in the default
# palette

Points are in the eight distinct colors of the default palette, one of which is "white". These
are recycled as necessary.
The default palette is a small selection from the built-in colors. The function c o l o r s ( )
returns the 657 names of the built-in colors, some of them aliases for the same color. The
following repeats the plots above, but now using the first 100 of the 657 built-in colors.

theta <- (1:50)*0.92


plot (theta, sin(theta), col=colors( 1 [l:501, pch=16, cex=4)
points (theta, cos (theta), col=colors ( ) [51:1001 , pch=15, cex=4)

, 1.8.2 Identification and location on the jgure region


Two functions are available for this purpose. Draw the graph first, then call one or other of
these functions:

ident if y ( ) labels points;


l o c a t o r ( ) prints out the co-ordinates of points.

In either case, the user positions the cursor at the location for which co-ordinates are required,
and clicks the left mouse button. Depending on the platform, the identification or labeling
of points may be terminated by pointing outside of the graphics area and clicking, or by
clicking with a button other than the first. The process will anyway terminate after some
default number n of points, which the user can set. (For identif y ( ) the default setting
is the number of data points, while for l o c a t o r ( ) the default is 500.)
I . A Brief Introduction to R

0.0 0.2 0.4 0.6 0 8 1.0


P
Figure 1.4: The y-axis label is a mathematical expression.

As an example, identify two of the plotted points on the primates scatterplot:


attach (primates)
plot (Bodywt, Brainwt)
identify(Bodywt, Brainwt, n=2) # Now click near 2 plotted points
detach (primates)

1.8.3 Plotting mathematical symbols


Both text ( ) and mtext ( ) allow replacement of the text string by a mathematical ex-
pression. In plot ( ) , either or both of xlab and ylab can be an algebraic expression.
Figure 1.4 was produced with
p <- (0:lOO)/loo
plot (p, sqrt (p*(1-p)) , ylab=expression (sqrt(p(1-p)) ) , type="lW)
Type help (plotmath)to get details of available forms of mathematical expression.
The final plot from demo ( graphics ) shows some of the possibilities for plotting math-
ematical symbols. There are brief further details in Section 12.10

1.8.4 Row by column layouts of plots


There are several ways to do this. Here, we will demonstrate two of them.

Multiple plots, each with its own margins


As noted in earlier sections, the parameter mfrow can be used to configure the graphics
sheet so that subsequent plots appear row by row, one after the other in a rectangular
layout, on the one page. For a column by column layout, use mf col. The following
example gives a plot that displays four different transformations of the Animals
data.
par(mfrow=c(2,2)) # 2 by 2 layout on the page
library (MASS) # Animals is in the MASS package
data (Animals) # Needed if Animals is not already loaded
1.8 R Graphics

Figure 1.5: Total length of possums versus age, for each combination of population (the Australian
state of Victoria or other) and sex (female or male). Further details of these data are in Subsection 2.1.1.

attach (Animals)
plot(body, brain)
plot (sqrt(body), sqrt (brain))
plot((body)"O.l, (brain)"O.l)
plot(log(body), log(brain))
detach("Anima1s")
par(mfrow=c (1,l)) # Restore to 1 figure per page

Multiple panels - the 1a t t i c e function xyp 1o t ( )


The function x y p l o t ( ) in the lattice package gives a rows by columns (x by y) layout
of panels in which the axis labeling appears in the outer margins. Figure 1.5 is an example.
Enter

> library(1attice)
> data (possum) # DAAG must be loaded
> table(possum$Pop, possum$sex) # Graph reflects layout of this
# table
f m
Vic 24 22
other 19 39
> xyplot(tot1ngth - age I sex*Pop, data=possum)

Note that, as we saw in Subsection 1.5.4, there are missing values for age in rows 44 and
46 that xyplo t ( ) has silently omitted. The factors that determine the layout of the panels,
i.e., sex and Pop in Figure 1.5, are known as conditioning variables.
1. A Brief Introduction to R

There will be further discussion of the lattice package in Subsection 2.1.5. It has functions
that offer a similar layout for many different types of plot. To see further examples of the
use of xypl ot ( ) , and of some of the other lattice functions, type in
example (xyplot)

Further points to note about the lattice package are:


The lattice package implements trellis style graphics, as used in Cleveland (1993). This
is why functions that control stylistic features (color, plot characters, line type, etc.) have
trellis as part of their name.
Lattice graphics functions cannot be mixed with the graphics functions discussed earlier
in this subsection. It is not possible to use points ( ) , lines ( ) , text ( ) , etc., to
add features to a plot that has been created using a lattice graphics function. Instead,
it is necessary to use functions that are special to lattice - lpoints ( ) , llines ( ) ,
1text ( ) , etc.
For inclusion, inside user functions, of statements that will print lattice graphs, see
the note near the end of Subsection 2.1.5. An explicit print statement is typically
required, e.g.
print(xyplot(tot1ngth - age ( sex*Pop, data=possum)

1.8.5 Graphs - additional notes


Graphics devices
On most systems, x l 1( ) will open a new graphics window. See help ( x l 1) .
On Macintosh systems that do not have an XI 1 driver, use macintosh ( ) . See
help (Devices)for a list of devices that can be used for writing to a file or to hard
copy. Use dev . of f ( ) to close the currently active graphics device.

The shape of the graph sheet


It is often desirable to control the shape of the graph page. For example, we might want
the individual plots to be rectangular rather than square. The function x l l ( ) sets up a
graphics page on the screen display. It takes arguments width (in inches), height (in
inches) and point s i ze (in A
of an inch). The setting of point si z e (default = 12)
determines character heighk2

Plot methods for objects other than vectors


We have seen how to plot a numeric vector y against a numeric vector x. The plot
function is a generic function that also has special methods for "plotting" various

It is the relative sizes of these parameters that matter for screen display or for incorporation into Word and similar programs.
Once pasted (from the clipboard) or imported into Word, graphs can be enlarged or shrunk by pointing at one corner, holding
down the left mouse button, and pulling.
1.9 Additional Points on the Use of R in This Book 23

different classes of object. For example, one can give a data frame as the argument to
p l o t . Try

data (trees) # Load data frame trees (base package)


plot(trees) # Has the same effect as pairs(trees)

The p a i r s ( ) function will be important when we come to discuss multiple regression.


See Subsection 6.1.4, and later examples in that chapter.

Good and bad graphs


There is a difference!
Draw graphs so that they are unlikely to mislead, make sure that they focus the eye on
features that are important, and avoid distracting features. In scatterplots, the intention is
typically to draw attention to the points. If there are not too many of them, drawing them
as heavy black dots or other symbols will focus attention on the points, rather than on a
fitted line or curve or on the axes. If they are numerous, dots are likely to overlap. It then
makes sense to use open symbols. Where there are many points that overlap, the ink will
be denser. If there are many points, it can be helpful to plot points in a shade of gray.3
Where the horizontal scale is continuous, patterns of change that are important to identify
should have an angle of slope in the approximate range 20" to 70". (This was the point of
the sine curve example in Subsection 1.8.1.)
There are a huge choice and range of colors. Colors, or gray scales, can often be used to
useful effect to distinguish groupings in the data. Bear in mind that the eye has difficulty in
focusing simultaneously on widely separated colors that appear close together on the same
graph.

1.9 Additional Points on the Use of R in This Book


Functions
Functions are integral to the use of the R language. Perhaps the most important topic that we
have left out of this chapter is a description of how users can write their own functions. User-
written functions are used in exactly the same way as built-in functions. Subsection 12.2.2
describes how users may write their own functions. Examples will appear from time to time
through the book.
An incidental advantage from putting code into functions is that the workspace is not
then cluttered with objects that are local to the function.

Setting the number of decimal places in output


Often, calculations will, by default, give more decimal places of output than are useful. In
the output that we give, we often reduce the number of decimal places below what R gives
by default. The o p t i o n s ( ) function can be used to make a global change to the number

' ##Example of plotting with different shades of gray


plot (1:4, 1:4, pch=16, col=c ( " g r a y 2 O M"gray40",
, "gray4O"), cex=2)
"gray601',
Another Random Scribd Document
with Unrelated Content
The Project Gutenberg eBook of Index of the
PG Works of Voltaire in English
This ebook is for the use of anyone anywhere in the United States
and most other parts of the world at no cost and with almost no
restrictions whatsoever. You may copy it, give it away or re-use it
under the terms of the Project Gutenberg License included with this
ebook or online at www.gutenberg.org. If you are not located in the
United States, you will have to check the laws of the country where
you are located before using this eBook.

Title: Index of the PG Works of Voltaire in English

Author: Voltaire

Editor: David Widger

Release date: September 23, 2018 [eBook #57958]


Most recently updated: February 25, 2021

Language: English

Credits: Produced by David Widger

*** START OF THE PROJECT GUTENBERG EBOOK INDEX OF THE


PG WORKS OF VOLTAIRE IN ENGLISH ***
INDEX OF THE PROJECT
GUTENBERG
ENGLISH WORKS OF

FRANÇOIS-MARIE AROUET

( VOLTAIRE )

Compiled by David Widger


CONTENTS
Click on the ## before each title to view a
linked
table of contents for each of the ebooks.

Click on the title itself to open the original


online file.

## A PHILOSOPHICAL DICTIONARY

## ZADIG OR, THE BOOK OF FATE

## CANDIDE

## VIRGIL AND VOLTAIRE

## MICROMEGAS

## VOLTAIRE'S ROMANCES

## ROMANCES

SOCRATES

LETTERS ON ENGLAND
TABLES OF CONTENTS OF
VOLUMES

A PHILOSOPHICAL
DICTIONARY
By Voltaire
A B C D E F G H I J K L M N O P Q R S T U V W XYZ

A I
A. IDEA.
A, B, C, OR ALPHABET. IDENTITY.
ABBÉ. IDOL—IDOLATER—
ABBEY—ABBOT. IDOLATRY.
ABLE—ABILITY. IGNATIUS LOYOLA.
ABRAHAM. IGNORANCE.
ABUSE. IMAGINATION.
ABUSE OF WORDS. IMPIOUS.
ACADEMY. IMPOST.
ADAM. IMPOTENCE.
ADORATION. INALIENATION—
ADULTERY. INALIENABLE.
AFFIRMATION OR OATH. INCEST.
AGAR, OR HAGAR. INCUBUS.
ALCHEMY. INFINITY.
ALKORAN; INFLUENCE.
ALEXANDER. INITIATION.
ALEXANDRIA. INNOCENTS.
ALGIERS. INQUISITION.
ALLEGORIES. INSTINCT.
ALMANAC. INTEREST.
ALTARS, TEMPLES, RITES, INTOLERANCE.
SACRIFICES, ETC. INUNDATION.
AMAZONS.
AMBIGUITY—
EQUIVOCATION. J
AMERICA.
AMPLIFICATION. JEHOVAH.
ANCIENTS AND MODERNS. JEPHTHAH.
ANECDOTES. JESUITS; OR PRIDE.
ANGELS. JEWS.
ANNALS. JOB.
ANNATS. JOSEPH.
ANTHROPOMORPHITES. JUDÆA.
ANTI-LUCRETIUS. JULIAN.
ANTIQUITY. JUST AND UNJUST.
APIS. JUSTICE.
APOCALYPSE.
ANTI-TRINITARIANS.
APOCRYPHA—APOCRYPHAL. K
APOSTATE.
APOSTLES. KING.
APPARITION. KISS.
APPEARANCE.
APROPOS.
ARABS;
ARARAT.
L
ARIANISM.
ARISTEAS. LAUGHTER.
ARISTOTLE. LAW (NATURAL).
ARMS—ARMIES. LAW (SALIC).
AROT AND MAROT. LAW (CIVIL AND
ECCLESIASTICAL).
ART OF POETRY.
ARTS—FINE ARTS. LAWS.
ASMODEUS. LAWS (SPIRIT OF).
LENT.
ASPHALTUS.
ASS. LEPROSY, ETC.
ASSASSIN— LETTERS (MEN OF).
ASSASSINATION. LIBEL.
ASTROLOGY. LIBERTY.
ASTRONOMY, LIBERTY OF OPINION.
ATHEISM. LIBERTY OF THE PRESS.
ATHEIST. LIFE.
ATOMS. LOVE.
AVARICE. LOVE OF GOD.
AUGURY. LOVE (SOCRATIC LOVE).
AUGUSTINE. LUXURY.
AUGUSTUS (OCTAVIUS).
AVIGNON.
AUSTERITIES. M
AUTHORS.
AUTHORITY. MADNESS.
AXIS. MAGIC.
MALADY—MEDICINE.

B MAN.
MARRIAGE.
MARY MAGDALEN.
BABEL. MARTYRS.
BACCHUS. MASS.
BACON (ROGER). MASSACRES.
BANISHMENT. MASTER.
BAPTISM. MATTER.
BARUCH, OR BARAK, AND MEETINGS (PUBLIC).
DEBORAH; MESSIAH.
BATTALION. METAMORPHOSIS.
BAYLE. METAPHYSICS.
BDELLIUM. MIND (LIMITS OF THE
BEARD. HUMAN).
BEASTS. MIRACLES.
BEAUTIFUL (THE). MISSION.
BEES. MONEY.
BEGGAR—MENDICANT MONSTERS.
BEKKER, MORALITY.
BELIEF. MOSES.
BETHSHEMESH.
BILHAH—BASTARDS MOTION.
BISHOP. MOUNTAIN.
BLASPHEMY.
BODY.
BOOKS. N
BOURGES.
BRACHMANS—BRAHMINS. NAIL.
BREAD-TREE. NATURE.
BUFFOONERY—BURLESQUE NECESSARY—NECESSITY.
—LOW COMEDY. NEW—NOVELTIES.
BULGARIANS. NUDITY.
BULL. NUMBER.
BULL (PAPAL). NUMBERING.

C O
CÆSAR. OCCULT QUALITIES.
CALENDS. OFFENCES (LOCAL).
CANNIBALS. ONAN.
CASTING (IN METAL). OPINION.
CATO. OPTIMISM.
CELTS. ORACLES.
CEREMONIES—TITLES— ORDEAL.
PRECEDENCE. ORDINATION.
CERTAIN—CERTAINTY. ORIGINAL SIN.
CHAIN OF CREATED BEINGS. OVID.
CHAIN OR GENERATION OF
EVENTS.
CHANGES THAT OCCURRED
IN THE GLOBE.
P
CHARACTER.
PARADISE.
CHARITY.
PASSIONS.
CHARLES IX.
PAUL
CHINA.
PERSECUTION.
CHRISTIANITY. PETER (SAINT).
CHRISTMAS. PETER THE GREAT AND J.J.
CHRONOLOGY. ROUSSEAU.
CHURCH. PHILOSOPHER.
CHURCH OF ENGLAND. PHILOSOPHY.
CHURCH PROPERTY. PHYSICIANS.
CICERO. PIRATES OR BUCCANEERS.
CIRCUMCISION. PLAGIARISM.
CLERK—CLERGY. PLATO.
CLIMATE. POETS.
COHERENCE—COHESION— POISONINGS.
ADHESION. POLICY.
COMMERCE. POLYPUS.
COMMON SENSE. POLYTHEISM.
CONFESSION. POPERY.
CONFISCATION. POPULATION.
CONSCIENCE. POSSESSED.
CONSEQUENCE. POST.
CONSTANTINE. POWER—OMNIPOTENCE.
CONTRADICTIONS. POWER.
CONTRAST. PRAYER (PUBLIC),
CONVULSIONARIES. THANKSGIVING, ETC.
CORN. PREJUDICE.
COUNCILS. PRESBYTERIAN.
COUNTRY. PRETENTIONS
CRIMES OR OFFENCES. PRIDE.
CRIMINAL. PRIESTS.
CROMWELL. PRIESTS OF THE PAGANS.
CUISSAGE. PRIOR, BUTLER, AND SWIFT.
CURATE (OF THE COUNTRY). PRIVILEGE—PRIVILEGED
CURIOSITY. CASES
CUSTOMS—USAGES. PROPERTY.
CYRUS. PROPHECIES.
PROPHETS.
PROVIDENCE.
PURGATORY.
D Q
DANTE. QUACK (OR CHARLATAN).
DAVID.
DECRETALS.
DELUGE (UNIVERSAL). R
DEMOCRACY.
DEMONIACS. RAVAILLAC.
DESTINY. REASONABLE, OR RIGHT.
DEVOTEE. RELICS.
DIAL. RELIGION.
DICTIONARY. RHYME.
DIOCLETIAN. RESURRECTION.
DIONYSIUS, ST. (THE RIGHTS.
AREOPAGITE), RIVERS.
DIODORUS OF SICILY, AND ROADS.
HERODOTUS. ROD.
DIRECTOR. ROME (COURT OF).
DISPUTES.
DISTANCE.
DIVINITY OF JESUS.
DIVORCE.
S
DOG.
SAMOTHRACE.
DOGMAS.
SAMSON.
DONATIONS.
SATURN'S RING.
DRINKING HEALTHS.
SCANDAL.
THE DRUIDS.
SCHISM.
SCROFULA.
E SECT.
SELF-LOVE.
SENSATION.
EASE. SENTENCES (REMARKABLE).
ECLIPSE. SENTENCES OF DEATH.
ECONOMY (RURAL). SERPENTS.
ECONOMY OF SPEECH— SHEKEL.
ELEGANCE. SIBYL.
ELIAS OR ELIJAH, AND SINGING.
ENOCH. SLAVES.
ELOQUENCE. SLEEPERS (THE SEVEN).
EMBLEMS. SLOW BELLIES (VENTRES
ENCHANTMENT. PARESSEUX).
END OF THE WORLD. SOCIETY OF LONDON, AND
ENTHUSIASM. ACADEMIES.
ENVY. SOCRATES.
EPIC POETRY. SOLOMON.
EPIPHANY. SOMNAMBULISTS AND
EQUALITY. DREAMERS.
ESSENIANS. SOPHIST.
ETERNITY. SOUL.
EUCHARIST. SPACE.
EXECUTION. STAGE (POLICE OF THE).
EXECUTIONER. STATES—GOVERNMENTS.
EXPIATION. STATES-GENERAL.
EXTREME. STYLE.
EZEKIEL. SUPERSTITION.
FABLE. SYMBOL, OR CREDO.
FACTION. SYSTEM.
FACULTY.
FAITH.
FALSITY. T
FALSITY OF HUMAN
VIRTUES. TABOR, OR THABOR.
TALISMAN.

F TARTUFFE—TARTUFERIE.
TASTE.
TAUROBOLIUM.
FANATICISM. TAX—FEE.
FANCY. TEARS.
FASTI. TERELAS.
FATHERS—MOTHERS— TESTES.
CHILDREN. THEISM.
FAVOR. THEIST.
FAVORITE. THEOCRACY.
FEASTS. THEODOSIUS.
FERRARA. THEOLOGIAN.
FEVER. THUNDER.
FICTION. TOLERATION.
FIERTÉ. TOPHET.
FIGURE. TORTURE.
FIGURED—FIGURATIVE. TRANSUBSTANTIATION.
FIGURE IN THEOLOGY. TRINITY.
FINAL CAUSES. TRUTH.
FINESSE, FINENESS, ETC. TYRANNY.
FIRE. TYRANT.
FIRMNESS.
FLATTERY.
FORCE (PHYSICAL). U
FORCE—STRENGTH.
FRANCHISE. UNIVERSITY.
FRANCIS XAVIER. USAGES.
FRANKS—FRANCE—FRENCH
FRAUD.
FREE-WILL.
FRENCH LANGUAGE.
V
FRIENDSHIP.
VAMPIRES.
FRIVOLITY.
VELETRI,
VENALITY.
G VENICE.
VERSE.
VIANDS.
GALLANT. VIRTUE.
GARGANTUA. VISION.
GAZETTE. VISION OF CONSTANTINE.
GENEALOGY. VOWS.
GENESIS. VOYAGE OF ST. PETER TO
GENII. ROME.
GENIUS.
GEOGRAPHY.
GLORY—GLORIOUS. W
GOAT—SORCERY.
GOD—GODS. WALLER.
GOOD—THE SOVEREIGN WAR.
GOOD, A CHIMERA. WEAKNESS ON BOTH SIDES.
GOOD. WHYS (THE).
GOSPEL. WICKED.
GOVERNMENT. WILL.
GOURD OR CALABASH. WIT, SPIRIT, INTELLECT.
GRACE. WOMEN.
GRACE (OF).
GRAVE—GRAVITY.
GREAT—GREATNESS.
GREEK.
X, Y, Z
GUARANTEE.
XENOPHANES.
GREGORY VII.
XENOPHON,
YVETOT.
H ZEAL.
ZOROASTER.
DECLARATION INQUIRERS,
HAPPY—HAPPILY. AND DOUBTERS,
HEAVEN (CIEL MATÉRIEL).
HEAVEN OF THE ANCIENTS.
HELL.
HELL (DESCENT INTO).
HERESY.
HERMES.
HISTORIOGRAPHER.
HISTORY.
HONOR.
HUMILITY.
HYPATIA.
LIST OF PLATES
VOLTAIRE AT THE AGE OF THIRTY—Frontispiece

MAHOMET

LOUIS AND MDLLE. DE LA VALLIÈRE

ANCIENT GREECE

THE BASTILLE—Frontispiece

A TYPE OF BEAUTY

AN ASTROLOGER

ALEXANDER'S TRIUMPH

VOLTAIRE'S RECEPTION OF MADAME D'ÉPINAY AT LES DÉLICES—

THE DEATH OF COLIGNY

CATHERINE II. OF RUSSIA

THE ALMONER AND THE ANABAPTIST

VOLTAIRE'S ARREST AT FRANKFORT Frontispiece

OLIVER CROMWELL

TIME MAKES TRUTH TRIUMPHANT


FRANCIS I. AND HIS SISTER

SANS SOUCI Frontispiece

A LAND STORM

THE TEMPTATION OF ADAM

DESCARTES

VOLTAIRE'S HOME IN GENEVA—Frontispiece

THE ACROPOLIS AT ATHENS

THE DUKE OF SULLY

THE ESTABLISHMENT OF THE INQUISITION IN PORTUGAL

OLD ROUEN—frontispiece

MONTESQUIEU

THE DREAM OF HUMAN LIFE

ANCIENT ROME

ALLEGORICAL BUST OF VOLTAIRE—frontispiece

THE INITIATE BANISHING THE PRIEST

JEAN JACQUES ROUSSEAU

JOHN CALVIN
VOLTAIRE: THE HOUDON BUST—Frontispiece

GENIUS INSPIRING THE MUSES

SAMSON DESTROYING THE TEMPLE

JOHN LOCKE

VOLTAIRE'S REMAINS ON THE BASTILLE—Frontispiece

THE DEATH OF SOCRATES

THE VISION

PIERRE CORNEILLE

ZADIG;
Or, The Book of Fate.
An Oriental History
By Voltaire

CONTENTS

CHAP. I.
The blind Eye page 1
CHAP. II.
The Nose 13
CHAP. III.
The Dog and the Horse, &c. 20
CHAP. IV.
The Envious Man 33
CHAP. V.
The Force of Generosity 45
CHAP. VI.
The Just Judge 53
CHAP. VII.
The Force of Jealousy 63
CHAP. VIII.
The Thresh’d Wife 79
CHAP. IX.
The Captive 89
CHAP. X.
The Funeral Pile 100
CHAP. XI.
The Evening’s Entertainment 111
CHAP. XII.
The Rendezvous 124
CHAP. XIII.
The Free-booter 135
CHAP. XIV.
The Fisherman 147
CHAP. XV.
The Basilisk 159
CHAP. XVI.
The Tournaments 187
CHAP. XVII.
The Hermit 205
CHAP. XVIII.
The Riddles, or Ænigmas 225

CANDIDE
By Voltaire

CONTENTS

CHAPTER PAGE
How Candide was brought up in a Magnificent
I. 1
Castle, and how he was expelled thence
II. What became of Candide among the Bulgarians 5
How Candide made his escape from the Bulgarians,
III. 9
and what afterwards became of him
How Candide found his old Master Pangloss, and
IV. 13
what happened to them
Tempest, Shipwreck, Earthquake, and what became
V. of Doctor Pangloss, Candide, and James the 18
Anabaptist
How the Portuguese made a Beautiful Auto-da-fé, to
VI. prevent any further Earthquakes: and how Candide 23
was publicly whipped
How the Old Woman took care of Candide, and how
VII. 26
he found the Object he loved
VIII. The History of Cunegonde 30
What became of Cunegonde, Candide, the Grand
IX. 35
Inquisitor, and the Jew
In what distress Candide, Cunegonde, and the Old
X. 38
Woman arrived at Cadiz; and of their Embarkation
XI. History of the Old Woman 42
XII. The Adventures of the Old Woman continued 48
How Candide was forced away from his fair
XIII. 54
Cunegonde and the Old Woman
How Candide and Cacambo were received by the
XIV. 58
Jesuits of Paraguay
How Candide killed the brother of his dear
XV. 64
Cunegonde
Adventures of the Two Travellers, with Two Girls,
XVI. 68
Two Monkeys, and the Savages called Oreillons
Arrival of Candide and his Valet at El Dorado, and
XVII. 74
what they saw there
XVIII. What they saw in the Country of El Dorado 80
What happened to them at Surinam and how
XIX. 89
Candide got acquainted with Martin
XX. What happened at Sea to Candide and Martin 98
Candide and Martin, reasoning, draw near the Coast
XXI. 102
of France
XXII. What happened in France to Candide and Martin 105
Candide and Martin touched upon the Coast of
XXIII. 122
England, and what they saw there
XXIV. Of Paquette and Friar Giroflée 125
XXV.The Visit to Lord Pococurante, a Noble Venetian 133
Of a Supper which Candide and Martin took with Six
XXVI. 142
Strangers, and who they were
XXVII. Candide's Voyage to Constantinople 148
What happened to Candide, Cunegonde, Pangloss,
XXVIII. 154
Martin, etc.
How Candide found Cunegonde and the Old Woman
XXIX. 159
again
XXX. The Conclusion 161
Welcome to our website – the ideal destination for book lovers and
knowledge seekers. With a mission to inspire endlessly, we offer a
vast collection of books, ranging from classic literary works to
specialized publications, self-development books, and children's
literature. Each book is a new journey of discovery, expanding
knowledge and enriching the soul of the reade

Our website is not just a platform for buying books, but a bridge
connecting readers to the timeless values of culture and wisdom. With
an elegant, user-friendly interface and an intelligent search system,
we are committed to providing a quick and convenient shopping
experience. Additionally, our special promotions and home delivery
services ensure that you save time and fully enjoy the joy of reading.

Let us accompany you on the journey of exploring knowledge and


personal growth!

ebookultra.com

You might also like