Download full Data Analysis and Graphics Using R An Example based Approach 2nd Edition John Maindonald ebook all chapters
Download full Data Analysis and Graphics Using R An Example based Approach 2nd Edition John Maindonald ebook all chapters
https://ebookultra.com/download/data-analysis-and-graphics-using-r-an-
example-based-approach-1st-edition-john-maindonald/
https://ebookultra.com/download/data-analysis-and-graphics-using-r-an-
example-based-approach-third-edition-john-maindonald/
https://ebookultra.com/download/data-analysis-and-graphics-
using-r-1st-edition-matthew-norman/
https://ebookultra.com/download/using-r-for-data-management-
statistical-analysis-and-graphics-1st-edition-nicholas-j-horton/
Statistics and Data Analysis for Microarrays Using R and
Bioconductor 2nd Edition Sorin Draghici
https://ebookultra.com/download/statistics-and-data-analysis-for-
microarrays-using-r-and-bioconductor-2nd-edition-sorin-draghici/
https://ebookultra.com/download/ibm-spss-by-example-a-practical-guide-
to-statistical-data-analysis-2nd-edition-alan-c-elliott/
https://ebookultra.com/download/qualitative-data-analysis-an-
introduction-2nd-edition-carol-grbich/
https://ebookultra.com/download/clinical-trial-data-analysis-using-r-
chapman-hall-crc-biostatistics-series-1st-edition-din-chen/
https://ebookultra.com/download/statistics-2nd-edition-an-
introduction-using-r-michael-j-crawley/
Data Analysis and Graphics Using R An Example based
Approach 2nd Edition John Maindonald Digital Instant
Download
Author(s): John Maindonald, John Braun
ISBN(s): 9780521861168, 0511250592
Edition: 2
File Details: PDF, 4.75 MB
Year: 2006
Language: english
This page intentionally left blank
Data Analysis and Graphics Using R, Second Edition
Join the revolution ignited by the ground-breaking R system! Starting with an introduction
to R, covering standard regression methods, then presenting more advanced topics, this
book guides users through the practical and powerful tools that the R system provides. The
emphasis is on hands-on analysis, graphical display and interpretation of data. The many
worked examples, taken from real-world research, are accompanied by commentary on
what is done and why. A website provides computer code and data sets, allowing readers
to reproduce all analyses. Updates and solutions to selected exercises are also available.
Assuming basic statistical knowledge and some experience of data analysis, the book is
ideal for research scientists, final-year undergraduate or graduate level students of applied
statistics, and practicing statisticians. It is both for learning and for reference.
This second edition reflects changes in R since 2003. There is new material on
survival analysis, random coefficient models and the handling of high-dimensional data.
The treatment of regression methods has been extended, including a brief discussion
of errors in predictor variables. Both text and code have been revised throughout, and
where possible simplified. New graphs have been added.
John Maindonald is Visiting Fellow at the Centre for Mathematics and its
Applications, Australian National University. He has collaborated extensively with
scientists in a wide range of application areas, from medicine and public health to
population genetics, machine learning, economic history and forensic linguistics.
John Braun is Associate Professor of Statistical and Actuarial Sciences, University of
Western Ontario. He has collaborated with biostatisticians, biologists, psychologists and
most recently has become involved with a network of forestry researchers.
Data Analysis and Graphics
Using R – an Example-Based Approach
Second Edition
CAMBRIDGE SERIES IN STATISTICAL AND PROBABILISTIC
MATHEMATICS
Editorial Board
Already published
1. Bootstrap Methods and Their Application, by A. C. Davison and D. V. Hinkley
2. Markov Chains, by J. Norris
3. Asymptotic Statistics, by A. W. van der Vaart
4. Wavelet Methods for Time Series Analysis, by Donald B. Percival and Andrew
T. Walden
5. Bayesian Methods, by Thomas Leonard and John S. J. Hsu
6. Empirical Processes in M-Estimation, by Sara van de Geer
7. Numerical Methods of Statistics, by John F. Monahan
8. A User’s Guide to Measure Theoretic Probability, by David Pollard
9. The Estimation and Tracking of Frequency, by B. G. Quinn and E. J. Hannan
10. Data Analysis and Graphics using R, by John Maindonald and W. John Braun
11. Statistical Models, by A. C. Davison
12. Semiparametric Regression, by D. Ruppert, M. P. Wand, R. J. Carroll
13. Exercises in Probability, by Loic Chaumont and Marc Yor
14. Statistical Analysis of Stochastic Processes in Time, by J. K. Lindsey
15. Measure Theory and Filtering, by Lakhdar Aggoun and Robert Elliott
16. Essentials of Statistical Inference, by G. A. Young and R. L. Smith
17. Elements of Distribution Theory, by Thomas A. Severini
18. Statistical Mechanics of Disordered Systems, by Anton Bovier
19. The Coordinate-Free Approach to Linear Models, by Michael J. Wichura
20. Random Graph Dynamics, by Rick Durrett
Data Analysis and Graphics
Using R – an Example-Based Approach
Second Edition
John Maindonald
Centre for Mathematics and its Applications, Australian National University
and
W. John Braun
Department of Statistical and Actuarial Science, University of Western Ontario
CAMBRIDGE UNIVERSITY PRESS
Cambridge, New York, Melbourne, Madrid, Cape Town, Singapore, São Paulo
Cambridge University Press has no responsibility for the persistence or accuracy of urls
for external or third-party internet websites referred to in this publication, and does not
guarantee that any content on such websites is, or will remain, accurate or appropriate.
It is easy to lie with statistics. It is hard to tell the truth without statistics.
[Andrejs Dunkels]
1 A brief introduction to R 1
1.1 An overview of R 1
1.1.1 A short R session 1
1.1.2 The uses of R 5
1.1.3 Online help 6
1.1.4 Further steps in learning R 8
1.2 Data input, packages and the search list 8
1.2.1 Reading data from a file 8
1.2.2 R packages 9
1.3 Vectors, factors and univariate time series 10
1.3.1 Vectors in R 10
1.3.2 Concatenation – joining vector objects 10
1.3.3 Subsets of vectors 11
1.3.4 Patterned data 11
1.3.5 Missing values 12
1.3.6 Factors 13
1.3.7 Time series 14
1.4 Data frames and matrices 14
1.4.1 The attaching of data frames 16
1.4.2 Aggregation, stacking and unstacking 17
1.4.3∗ Data frames and matrices 17
1.5 Functions, operators and loops 18
1.5.1 Built-in functions 18
1.5.2 Generic functions and the class of an object 20
1.5.3 User-written functions 21
1.5.4 Relational and logical operators and operations 22
1.5.5 Selection and matching 23
1.5.6 Functions for working with missing values 23
1.5.7∗ Looping 24
1.6 Graphics in R 24
1.6.1 The function plot( ) and allied functions 25
1.6.2 The use of color 27
x Contents
3 Statistical models 78
3.1 Regularities 79
3.1.1 Deterministic models 79
Contents xi
References 474
This book is an exposition of statistical methodology that focuses on ideas and concepts,
and makes extensive use of graphical presentation. It avoids, as much as possible, the
use of mathematical symbolism. It is particularly aimed at scientists who wish to do sta-
tistical analyses on their own data, preferably with reference as necessary to professional
statistical advice. It is intended to complement more mathematically oriented accounts of
statistical methodology. It may be used to give students with a more specialist statistical
interest exposure to practical data analysis.
While no prior knowledge of specific statistical methods or theory is assumed, there is
a demand that readers bring with them, or quickly acquire, some modest level of statistical
sophistication. Readers should have some prior exposure to statistical methodology, some
prior experience of working with real data, and be comfortable with the typing of analysis
commands into the computer console. Some prior familiarity with regression and with
analysis of variance will be helpful.
We cover a range of topics that are important for many different areas of statistical
application. As is inevitable in a book that has this broad focus, there will be investiga-
tors working in specific areas – perhaps epidemiology, or psychology, or sociology, or
ecology – who will regret the omission of some methodologies that they find important.
We comment extensively on analysis results, noting inferences that seem well-founded,
and noting limitations on inferences that can be drawn. We emphasize the use of graphs
for gaining insight into data – in advance of any formal analysis, for understanding the
analysis, and for presenting analysis results.
The data sets that we use as a vehicle for demonstrating statistical methodology have
been generated by researchers in many different fields, and have in many cases featured
in published papers. As far as possible, our account of statistical methodology comes
from the coalface, where the quirks of real data must be faced and addressed. Features
that may challenge the novice data analyst have been retained. The diversity of examples
has benefits, even for those whose interest is in a specific application area. Ideas and
applications that are useful in one area often find use elsewhere, even to the extent
of stimulating new lines of investigation. We hope that our book will stimulate such
cross-fertilization.
To summarize: the strengths of this book include the directness of its encounter with
research data, its advice on practical data analysis issues, the inclusion of code that
reproduces analyses, careful critiques of analysis results, attention to graphical and other
xx Preface
presentation issues, and the use of examples drawn from across the range of statistical
applications.
John Braun wrote the initial drafts of Subsections 4.7.3, 4.7.4, 5.5.3, 6.8.5, 8.4.1
and Section 9.3. Initial drafts of remaining material were, mostly, from John Maindon-
ald’s hand. A substantial part was derived, intially, from the lecture notes of courses
for researchers, at the University of Newcastle (Australia) over 1996–1997 and at The
Australian National University over 1998–2001. Both of us have worked extensively
over the material in these chapters. John Braun has taken primary responsibility for
maintenance of the DAAG package.
The R system
We use the R system for the computations. The R system implements a dialect of the
influential S language, developed at AT&T Bell Laboratories by Rick Becker, John
Chambers and Allan Wilks, which is the basis for the commercial S-PLUS system.
It follows S in its close linkage between data analysis and graphics. Versions of R
are available, at no charge, for 32-bit versions of Microsoft Windows, for Linux and
other Unix systems, and for the Macintosh. It is available through the Comprehensive
R Archive Network (CRAN). Go to http://cran.r-project.org/, and find the
nearest mirror site.
The development model used for R has proved highly effective in marshalling high
levels of computing expertise for continuing improvement, for identifying and fixing
bugs, and for responding quickly to the evolving needs and interests of the statistical
community. Oversight of “base R” is handled by the R Core Team, whose members are
widely drawn internationally. Use is made of code, bug fixes and documentation from
the wider R user community. Especially important are the large number of packages that
supplement base R, and that anyone is free to contribute. Once installed, these attach
seamlessly into the base system.
Many of the analyses offered by R’s packages were not, 10 years ago, available in
any of the standard statistical packages. What did data analysts do before we had such
packages? Basically, they adapted more simplistic (but not necessarily simpler) analyses
as best they could. Those whose skills were unequal to the task did unsatisfactory
analyses. Those with more adequate skills carried out analyses that, even if not elegant
and insightful by current standards, were often adequate. Tools such as are available
in R have reduced the need for the adaptations that were formerly necessary. We can
often do analyses that better reflect the underlying science. There have been challenging
and exciting changes from the methodology that was typically encountered in statistics
courses 10 or 15 years ago.
In the ongoing development of R, priorities have been: the provision of good data
manipulation abilities; flexible and high-quality graphics; the provision of data analysis
methods that are both insightful and adequate for the whole range of application area
demands; seamless integration of the different components of R; and the provision of
interfaces to other systems (editors, databases, the web, etc.) that R users may require.
Preface xxi
Ease of use is important, but not at the expense of power, flexibility and checks against
answers that are potentially misleading.
Depending on the user’s level of skill with R, there will be some relatively routine
tasks where another system may seem simpler to use. Note however the availability of
interfaces, notably John Fox’s Rcmdr, that give a graphical user interface (GUI) to a
limited part of R. Such interfaces will develop and improve as time progresses. They may
in due course, for many users, be the preferred means of access to R. Be aware that the
demand for simple tools will commonly place limitations on the tasks that can, without
professional assistance, be satisfactorily undertaken.
Primarily, R is designed for scientific computing and for graphics. Among the packages
that have been added are many that are not obviously statistical – for drawing and coloring
maps, for map projections, for plotting data collected by balloon-born weather instruments,
for creating color palettes, for working with bitmap images, for solving sudoko puzzles,
for creating magic squares, for reading and handling shapefiles, for solving ordinary
differential equations, for processing various types of genomic data, and so on. Check
through the list of R packages that can be found on any of the CRAN sites, and you may
be surprised at what you find!
The citation for John Chambers’ 1998 Association for Computing Machinery Software
award stated that S has “forever altered how people analyze, visualize and manipu-
late data.” The R project enlarges on the ideas and insights that generated the S lan-
guage. We are grateful to the R Core Team, and to the creators of the various R
packages, for bringing into being the R system – this marvellous tool for scientific
and statistical computing, and for graphical presentation. We list at the end of the
reference section the authors and compilers of packages that have been used in this
book.
who joins the R community can expect to witness, and/or engage in, lively debate that
addresses these and related issues. Such debate can help ensure that the demands of
scientific rationality do in due course win out over influences from accidents of historical
development.
Acknowledgments
Many different people have helped us with this project. Winfried Theis (University
of Dortmund, Germany) and Detlef Steuer (University of the Federal Armed Forces,
Hamburg, Germany) helped with technical aspects of working with LATEX, with setting
up a cvs server to manage the LATEX files, and with helpful comments. Lynne Billard
Preface xxiii
(University of Georgia, USA), Murray Jorgensen (University of Waikato, NZ) and Berwin
Turlach (University of Western Australia) gave valuable help in the identification of errors
and text that required clarification. Susan Wilson (Australian National University) gave
welcome encouragement. Duncan Murdoch (University of Western Ontario) helped set
up the DAAG package, and has supplied valuable technical advice. Thanks also to Cath
Lawrence (Australian National University) for her Python program that allowed us to
extract the R code, as and when required, from our LATEX files; this has now at length
become an R function. Many of the tables in this book were generated, in first draft form,
using the xtable() function from the xtable package for R.
For this second edition, Brian Ripley (University of Oxford) has gone through the
manuscript and made extensive comments, leading to important corrections and improve-
ments. We are most grateful to him, and to others who have commented on the manuscript.
Alan Welsh (Australian National University) has been helpful in working through points
where it has seemed difficult to get the emphasis right. Once again, Duncan Murdoch
has given much useful technical advice. Others who have made helpful comments and/or
pointed out errors include Jeff Wood (Australian National University), Nader Tajvidi
(University of Lund), Paul Murrell (University of Auckland, on Section 14.11), Graham
Williams (http://www.togaware.com, on Chapter 1) and Yang Yang (University of
Western Ontario, on Chapter 10). The failings that remain are, naturally, our responsibility.
A strength of this book is the extent to which it has drawn on data from many different
sources. We give a list, following the list of references for the data near the end of the
book, of individuals and/or organizations to whom we are grateful for allowing use of
data. We are grateful to those who have allowed us to use their data. At least these
data will not, as often happens once data have become the basis for a published paper,
gather dust in a long-forgotten folder! We are grateful, also, to the many researchers who,
in their discussions with us, have helped stimulate our thinking and understanding. We
apologize if there is anyone that we have inadvertently failed to acknowledge.
Diana Gillooly of Cambridge University Press, taking over from David Tranah for this
new edition, has been a marvellous source of advice and encouragement throughout the
revision process.
Conventions
Text that is R code, or output from R, is printed in a verbatim text style. For example,
in Chapter 1 we will enter data into an R object that we call austpop. We will use the
plot() function to plot these data. The names of R packages, including our own DAAG
package, are printed in italics.
Starred exercises and sections identify more technical items that can be skipped at a
first reading.
Solutions to exercises
Solutions to selected exercises, R scripts that have all the code from the book and other
supplementary materials are available via the link given at http://www.maths.anu.
edu.au/˜johnm/r-book
1
A brief introduction to R
This first chapter introduces readers to the basics of R. It provides the minimum of
information that is needed for running the calculations that are described in later chapters.
The first section may cover most of what is immediately necessary. The rest of the chapter
may be used as a reference. Chapter 14 extends this material considerably.
Most of the R commands will run without change in S-PLUS.
1.1 An overview of R
1.1.1 A short R session
R must be installed!
An up-to-date version of R may be downloaded from a Comprehensive R Archive
Network (CRAN) mirror site. There are links at http://cran.r-project.org/.
Installation instructions are provided at the web site for installing R in Windows, Unix,
Linux, and version 10 of the Macintosh operating system. Various contributed packages
are now a part of the standard R distribution, but a number are not; any of these may be
installed as required. Data sets that are mentioned in this book, and that are not (in most
cases) available in other packages, have been collected into our DAAG package that is
available from CRAN sites.
For most Windows users, R can be installed by clicking on the icon that appears on
the desktop once the Windows binary has been downloaded from CRAN. An installation
program will then guide the user through the process. By default, an R icon will be placed
on the user’s desktop. The R system can be started by double-clicking on that icon.
The DAAG package can be installed under Windows by starting R and clicking on the
Packages Menu. From that menu, choose Install Packages. If a mirror site has not been
set earlier, this gives a pop-up menu from which a site must be chosen. Once this choice
is made, a new pop-up window appears with the entire list of available R packages.
Clicking on DAAG will cause it to be downloaded and installed.
Year Carbon
1 1800 8
2 1850 54
3 1900 534
4 1950 1630
5 2000 6611
semicolon (;) as the separator). This allows the use of R as a calculator. For example,
type 2+2 and press the Enter key. Here is what appears on the screen:
> 2+2
[1] 4
>
The first element is labeled [1] even when, as here, there is just one element! The final
> prompt indicates that R is ready for another command.
In a sense this chapter, and much of the rest of the book, is a discussion of what
is possible by typing in statements at the command line. Practice in the evaluation of
arithmetic expressions will help develop the needed conceptual and keyboard skills. Here
are simple examples:
> 2*3*4*5 # * denotes ’multiply’
[1] 120
> sqrt(10) # the square root of 10
[1] 3.162278
> pi # R knows about pi
[1] 3.141593
> 2*pi*6378 # Circumference of earth at equator (km)
# (radius at equator is 6378 km)
[1] 40074.16
Anything that follows a # on the command line is taken as comment and ignored by R.
A continuation prompt, by default +, appears following a carriage return when the
command is not yet complete. (In this book we will omit both the prompt (>) and the
continuation prompt (+), whenever command line statements are given separately from
output.)
50
Carbon
20
0
180 190 20
Year
Figure 1.1 Plot of Carbon against Year, for the data in Table 1.1.
• The <- is a left angle bracket (<) followed by a minus sign (-). It means “the values
on the right are assigned to the name on the left.”
• The objects Year and Carbon are vectors which were each formed by joining
(concatenating) separate numbers together. Thus c(8, 54, 534, 1630, 6611)
joined the numbers 8, 54, 534, 1630, 6611 together to form the vector Carbon. See
Subsection 1.3.2 for further details on this.
• The construct Carbon ∼ Year is a graphics formula. The plot() function inter-
prets this formula to mean “Plot Carbon as a function of Year” or “Plot Carbon
on the y-axis against Year on the x-axis.”
• The setting pch=16 (where pch is “plot character”) gives a solid black dot.
• Case is significant for names of R objects or commands. Thus, Carbon is different
from carbon.
We can make various modifications to this basic plot. We can specify more informative
axis labels, change the sizes of the text and of the plotting symbol, add a title, and so on.
More information is given in Section 1.6.
4 1950 1630
5 2000 6611
> rm(Year, Carbon) # These are no longer required
The vector objects year and carbon become columns in the data frame.
An alternative to the plotting command that gave Figure 1.1 is then:
plot(carbon ˜ year, data=fossilfuel, pch=16)
The data=fossilfuel argument instructs plot() to start its search for each of
carbon and year by looking among the columns of fossilfuel.
There are several ways to identify columns by name. Here, note that the second column
can be referred to as fossilfuel[, 2], or as fossilfuel[, "carbon"], or as
fossilfuel$carbon.
Data frames are the preferred way for organizing data sets that are of modest size. For
now, think of data frames as a rectangular row by column layout, where the rows are
observations and the columns are variables. More information about data frames can be
found in Section 1.4. Subsection 1.2.1 will demonstrate the reading of data from a file,
entering them into a data frame.
Objects that the user creates or copies from elsewhere go into the user workspace. In
order to list the contents of the workspace, type:
ls()
The only object left over from the current session should be fossilfuel. There may
additionally be objects that are left over from previous sessions (if any) in the same
directory, and that were loaded when the session started.
Quitting R
Use the q() function to quit (exit) from R:
q()
1.1 An overview of R 5
There will be a message asking whether to save the workspace image. Clicking Yes has the
effect that, before quitting, all the objects that remain in the working directory are saved
in an “image” file that has the name .RData. This file is an “image” of the workspace
immediately before quitting the session, and will be used to restore the workspace when
a new session is again started in that directory. (Note that while delaying the saving of
important objects till the end of the session is acceptable when working in a learning
mode, it is not a good strategy when using R in production mode. Advice on saving and
backing up through the course of the session will be given in Section 1.8 and, in more
detail, in Subsection 14.1.2.)
Depending on the implementation, alternatives to typing q() may be to click on the
File menu and then on Exit, or to click on the × in the top right-hand corner of the R
window. (Under Linux, depending on the window manager that is used, clicking on ×
may exit from the program, but without asking whether to save the workshop image.
Check the behavior on your installation.)
Note: In order to quit the R session we had to type q(). The round brackets are
necessary because q is a function. Typing q on its own, without the brackets, displays
the text of the function on the screen. Try it!
> range(fossilfuel$carbon)
[1] 8 6611
By no means are R’s abilities limited to numerical calculation. Here are examples that
involve character strings:
> ## 4 cities
> fourcities <- c("Toronto", "Canberra", "New York", "London")
> ## display in alphabetical order
> sort(fourcities)
[1] "Canberra" "London" "New York" "Toronto"
> ## Find the number of characters in "Toronto"
> nchar("Toronto")
6 A brief introduction to R
[1] 7
>
> ## Find the number of characters in all four city names at once
> nchar(fourcities)
[1] 7 8 8 6
Thus, we can immediately see that the range of speeds (first column) is from 4 mph to
25 mph, and that the range of distances (second column) is from 2 feet to 120 feet.
Graphical alternatives to summary(), including histograms and boxplots, are dis-
cussed and demonstrated in Sections 1.7 and 2.1. Try for example:
hist(cars$speed)
frequently upgraded. Type help(help) to get information on the help features of the
system that is in use. To get help on, for example, plot(), type:
help(plot)
The functions apropos() and help.search() offer means for searching for
functions that perform a desired task. Specific examples seem the best way to explain
their use. Thus try:
example(image)
# for a 2 by 2 layout of the last 4 plots, precede with
# par(mfrow=c(2,2))
# To prompt for each new graph, precede with par(ask=TRUE)
# When finished, turn off the prompts with par(ask=FALSE)
In learning to use a new function, it may be helpful to create a simple artificial data
set, or to create a small subset from a larger data set, and use this for experimentation.
For extensive experimentation, consider moving to a new working directory and working
with copies of any user data sets and functions.
The help pages, while not an encyclopedia on statistical methodology, have much
helpful information on the methods whose implementation they document. Some of
the abilities that they document will bring pleasant surprises. There are many in-
sightful and helpful examples, there are references to related functions, and there
are references to papers and books that give the relevant theory. It can help enor-
mously, before launching into the use of an R function, to check the relevant help
page!
Wide-ranging searches
The function RSiteSearch()makes it possible (assuming a live internet connection)
to search R manuals and help pages, and the R-help mailing list archives, for key words or
phrases. It has a parameter restrict that allows some limited targeting of the search.
See help(RSiteSearch) for details.
8 A brief introduction to R
Note the use of header=TRUE to ensure that R uses the first line to get header infor-
mation for the columns.
Type fossilfuel at the command line prompt, and the data will be displayed almost
as they appear in Table 1.1 (the only difference is the introduction of row labels in the R
output). Data read into R with the read.table() function are automatically converted
to a data frame.
1.2 Data input, packages and the search list 9
We have assumed that the fields in fuel.txt are separated by spaces and/or tabs, as
allowed by the default setting (sep=") for read.table(). Other parameter settings
are sometimes required; note in particular:
reads data from a file fuel.csv where the fields have been separated by commas.
Consult the help page for read.table() for other options.
1.2.2 R packages
The recommended R distribution includes a number of packages in its library. These
are collections of functions and data. We will make frequent use both of the MASS
package (Venables and Ripley, 2002) and of our own DAAG package. DAAG, and other
packages that are not included with the default distribution, can be readily downloaded
and installed.
The base package, the stats package, the datasets package and several other packages,
are automatically attached at the beginning of a session. Other installed packages must
be explicity attached prior to use. Use sessionInfo() to see which packages are
currently attached. To attach any further installed package, use the library() function.
For example:
> library(DAAG)
> sessionInfo()
R version 2.0.1, 2004-11-15, powerpc-apple-darwin6.8
Replace "datasets" by the name of any other installed package, as required (type
library() to get the names of the installed packages). In most packages that
are intended for recent versions of R (2.0.0 or later), these data sets become avail-
able automatically once the package has been attached. They will be brought into
the workspace when and if required. (In older versions of R, or in packages that
have not implemented the lazy data mechanism, explicit use of a command of the
10 A brief introduction to R
form data(airquality) may be necessary, bringing the data object into the user’s
workspace.)
1.3.1 Vectors in R
The vector modes that will be described here (there are others) are “numeric,” “logical,”
and “character.” Examples of vectors are:
> c(2, 3, 5, 2, 7, 1)
[1] 2 3 5 2 7 1
> c(T, F, F, F, T, T, F)
[1] TRUE FALSE FALSE FALSE TRUE TRUE FALSE
The first two vectors above are numeric, the third is logical, and the fourth is a character
vector. Note the use of the global variables F(=FALSE) and T(=TRUE) as a convenient
shorthand when logical values are entered.
> y
[1] 10 15 12
> z <- c(x, y)
> z
[1] 2 3 5 2 7 1 10 15 12
1. Specify the indices of the elements that are to be extracted, for example:
> x <- c(3, 11, 8, 15, 12) # Assign to x the values 3,
# 11, 8, 15, 12
> x[c(2,4)] # Elements in positions 2
[1] 11 15 # and 4 only
2. Use negative subscripts to omit the elements in nominated subscript positions (take
care not to mix positive and negative subscripts):
> x[-c(2,3)] # Remove the elements in positions 2 and 3
[1] 3 15 12
3. Specify a vector of logical values. The elements that are extracted are those for which
the logical value is TRUE. Thus, suppose we want to extract values of x that are
greater than 10:
> x > 10
[1] FALSE TRUE FALSE TRUE TRUE
> x[x > 10]
[1] 11 15 12
Elements of vectors can be given names. In that case, elements can be extracted by
name:
To replace all NAs by -999 (in most circumstances a bad idea) use the function
is.na(), thus:
> ## Replace all NAs by -999 (in general, a bad idea)
> seal.lung[is.na(seal.lung)] <- -999
> seal.lung
[1] 605.0 436.0 380.0 493.9 -999.0 550.0 470.0 592.5 605.0
[10] 799.9 995.0 785.0 910.0 1115.0 1142.6 1465.0 1250.0 1580.0
[19] 2000.0 1474.4 -999.0 1220.0 1790.0 1510.0 -999.0 -999.0 2735.0
[28] -999.0 2380.0 -999.0
1.3 Vectors, factors and univariate time series 13
Using a code such as -999 for missing values requires continual watchfulness to ensure
that it is never treated as a legitimate numeric value.
Missing values are discussed further in Subsection 1.5.6 and Section 14.5. For vectors
of mode numeric, other legal values that may require special attention are NaN (not a
number; try, e.g., by 0/0), Inf (e.g., 1/0) and -Inf.
1.3.6 Factors
A factor is stored internally as a numeric vector with values 1, 2, 3, , k. The value k is
the number of levels. The levels are character strings.
Consider a survey that has data on 691 females and 692 males. If the first 691 are
females and the next 692 males, we can create a vector of strings that holds the values thus:
gender <- c(rep("female",691), rep("male",692))
Internally, the factor gender is stored as 691 1s, followed by 692 2s. It has stored with
it a table that holds the information:
1 female
2 male
In most contexts that seem to demand a character string, the 1 is translated into female
and the 2 into male. The values female and male are the levels of the factor. By
default, the levels are in sorted order for the data type from which the factor was formed,
so that female precedes male. Hence:
> levels(gender)
[1] "female" "male"
Note that if gender had been an ordinary character vector, the outcome of the above
levels command would have been NULL.
The order of the factor levels determines the order of appearance of the levels in graphs
and tables that use this information. To cause male to come before female, use:
This syntax is available both when the factor is first created, and later to change the order
in an existing factor. Take care that the level names are correctly spelled. For example,
14 A brief introduction to R
specifying "Male" in place of "male" in the levels argument will cause all values
that were "male" to be coded as missing.
Note finally the function ordered(), which generates factors whose values can be
compared using the relational operators <, <=, >, >=, == and =!. Ordered factors are
appropriate for use with ordered categorical data. See Section 14.4 for further details.
The function ts() converts numeric vectors into time series objects. Frequently used
arguments of ts() are start, frequency and end. The following turns numjobs
into a time series, which can then be plotted:
numjobs <- ts(numjobs, start=1995, frequency = 12)
plot(numjobs)
Use the function window() to extract a subset of the time series. For example, the
following extracts the first 15 months of the series:
first15 <- window(numjobs, end=1996.25)
Multivariate time series are also possible. See Subsections 2.1.5 and 14.7.7.
> Cars93.summary
Min.passengers Max.passengers No.of.cars abbrev
Compact 4 6 16 C
Large 6 6 11 L
Midsize 4 6 22 M
Small 4 5 21 Sm
Sporty 2 4 14 Sp
Van 7 8 9 V
1
## Alternatively, obtain from data frame jobs (DAAG)
library(DAAG) numjobs <- jobs$Prairies
1.4 Data frames and matrices 15
The first three columns are numeric, and the fourth is a factor. Use the function class()
to check this, for example, enter class(Cars93.summary$abbrev). (The classifi-
cation of objects into classes is discussed in Subsection 1.5.2.)
On most systems, use of edit() allows access to a spreadsheet-like display of a data
frame or of a vector, where entries can be edited or new data added. For example:
To close the spreadsheet, click on the File menu and then on Close. On Linux systems,
click on Quit to exit.
Similarly, the tail() function displays the last few rows of a data frame. The default
for the second argument (number of rows to display) is 6.
The functions head() and tail() are available also for use with objects other than
data frames.
The square bracket notation offers a more flexible alternative (any subset of rows can
be extracted):
> Cars93.summary[1:3, ]
Min.passengers Max.passengers No.of.cars abbrev
Compact 4 6 16 C
Large 6 6 11 L
Midsize 4 6 22 M
Detaching data frames that are no longer in use reduces the risk of a clash of variable
names, for example, two different attached data frames that have a column with the name
Min.passengers, or a Min.passengers both in the workspace and in an attached
data frame.
The attaching of a data frame extends the search list, which is the list of “databases”
where R looks for objects. See Subsection 14.1 for more details on this and other uses of
attach().
The subset() function offers an alternative way to extract rows and columns. For
example, the following extracts the first two rows:
subset(Cars93.summary, subset=1:2) # see help(subset) for details
Another possibility is the use of the function cbind() to combine two or more vectors
of the same length and type together into a matrix, thus:
fossilfuelmat <- cbind(year=c(1800, 1850, 1900, 1950, 2000),
carbon=c(8, 54, 534, 1630, 6611))
More generally, any data frame where all columns hold data that is all of the same
type, that is, all numeric or all character or all logical, can alternatively be stored as a
matrix. Storage of numeric data in matrix rather than data frame format can speed up some
mathematical and other manipulations when the number of elements is large, for example,
of the order of several hundreds of thousands. For further details, see Section 14.6.
Note that:
• Matrix elements are stored in column order in one long vector, that is, columns
are stacked one above the other, with the first column first. It is straightforward, as
explained in Section 14.6, to change between a matrix with m rows and n columns,
and a vector of length mn.
• The extraction of submatrices has the same syntax as for data frames. Thus
fossilfuelmat[2:3,] extracts rows 2 and 3 of the matrix fossilfuelmat.
(Be careful not to omit the comma, causing the matrix to be treated as one long vector.)
• The names() function returns NULL when the argument is a matrix. Note however
the functions rownames() and colnames(), which can be used either with data
frames or matrices.
When using a function for the first time, consult its help page for information on how it
handles missing values.
Isäsi ansiota oli, että he olivat niin hyvin varustettuja silloin kuin
linnut lähtivät poikineen; minä sitä vastoin olin jäänyt oman onneni
nojaan. Ennen lintujen lähtöäkin olin joutunut jo hyvin vaikeaan
asemaan; sillä munissa oli jo viime aikoina aivan täysinkehittyneitä
poikasia, niin etten voinut niitä syödä ja kun minulla ei ollut tulta,
enkä osannut niitä kuivata, ei minulle jäänyt muu neuvoksi, kuin
syödä ne raakana, joka oli kaikkea muuta kuin miellyttävää. Minua
lohdutti kumminkin se ajatus, ettei vanhemmillasi eikä noilla
muillakaan ollut sen paremmasti ja toivoin aikaa, jolloin linnut taasen
alkaisivat munia, voidakseni niiden vielä ollessa tuoreena, koota niitä
paljon suuremman varaston. Mutta kaikki suunnitelmani menivät
myttyyn, kun linnut, sekä vanhat, että nuoret, paria päivää
myöhemmin lähtivät pois, jättäen minut vaille kaikkia
toimeentulomahdollisuuksia.
Mutta olet kai väsynyt, lopetan nyt ja kerron sinulle lopun toisella
kertaa. —
— Niin, meitä oli siis enää jäljellä vain neljä miestä ja äitisi.
Perämies oli hyvin sairaalloinen. Hän ikävöi kovin, mies parka, nuorta
vaimoaan, joka oli jäänyt Englantiin, ja mikä häntä näytti enimmän
pelottavan oli se, että vaimo menisi uusiin naimisiin, ennenkuin hän
pääsisi kotiin. Se päättyi maksatautiin, joka teki hänestä lopun
yhdeksässä kuukaudessa, ja niin oli taasen yksi tovereistamme
poissa. Hän kuoli hyvin rauhallisesti ja antoi minulle hianappinsa ja
kellonsa annettavaksi hänen vaimolleen, jos joskus sattuisin
pelastumaan saarelta. Luulen, että vaimolla on hyvin pienet
mahdollisuudet niitä milloinkaan saada.
— Ehkä ei, vastasi Jackson; ehkä hän luiskahti, kuka sen tietää,
voimme ainoastaan arvailla. Koska siimakin oli poissa, ajattelimme
kaiken tapahtuneen kuten kerroin. Haimme kaikkialta, nimittäin
kapteeni ja minä, sillä äitisi jäi epätoivoissaan kotiin pitäen sinua
sylissään. Emme kumminkaan onnistuneet löytämään isäsi ruumista
ja tämä etsiminen koitui meille suureksi onnettomuudeksi, sillä
kapteeni kuoli pian sen johdosta. Sanotaan ettei "vahinko tule kello
kaulassa", ja nyt saimme todistuksen tähän sananlaskuun.
— Mitäkö tein? Mitäpä minä muuta voin tehdä kuin mennä kotiin
ja kertoa onnettomuudesta äidillesi, joka tuli aivan pois suunniltaan
sen kuullessaan; kapteeni oli hänen ystävänsä ja minua hän ei
voinut sietää.
— Olen, vastasin.
Hän haistoi sitä, ja vei sen sitten huulilleen, ottaen siitä aika
kulauksen. Huokaisten tyytyväisyydestä, laski hän sen sitten takaisin
viereensä.
— Nytkö jo vahtiin?
— Kyllä, niin tein. Tuo, kiltti poikaseni, minulle vähän vettä, minua
janottaa niin hirveästi.
Menin aamille ja laskin kasariin sen määrän, jonka hän oli pyytänyt
ja toin sen hänelle. Hän joi sen ja näytti hetken kuluttua olevan
entisellään. Sitten pyysi hän syötävää ja alkoi kertoa minulle
hauskoja kaskuja, niinkuin hän sanoi, "menneiltä ajoilta", ja päivä
kului siten hyvin hupaisesti. — Illalla sanoi hän:
— Mitä lauloin sinulle eilen illalla? Vastasin niin hyvin, kuin osasin.
— Sinä olet oppinut tuon nuotin aivan oikein, sanoi hän, sinulla
täytyy olla nuottikorvaa. Oletko koskaan koettanut laulaa?
Our website is not just a platform for buying books, but a bridge
connecting readers to the timeless values of culture and wisdom. With
an elegant, user-friendly interface and an intelligent search system,
we are committed to providing a quick and convenient shopping
experience. Additionally, our special promotions and home delivery
services ensure that you save time and fully enjoy the joy of reading.
ebookultra.com