100% found this document useful (1 vote)
4 views

R Programming for Data Science Roger D. Peng instant download

The document is a promotional material for various eBooks related to R programming and data science, including titles by Roger D. Peng and others. It provides links to download these eBooks from ebookmeta.com. Additionally, it outlines the contents of 'R Programming for Data Science' by Roger D. Peng, which covers topics such as the history of R, getting started, and data manipulation.

Uploaded by

leskycamm
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
100% found this document useful (1 vote)
4 views

R Programming for Data Science Roger D. Peng instant download

The document is a promotional material for various eBooks related to R programming and data science, including titles by Roger D. Peng and others. It provides links to download these eBooks from ebookmeta.com. Additionally, it outlines the contents of 'R Programming for Data Science' by Roger D. Peng, which covers topics such as the history of R, getting started, and data manipulation.

Uploaded by

leskycamm
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 70

R Programming for Data Science Roger D.

Peng
install download

https://ebookmeta.com/product/r-programming-for-data-science-
roger-d-peng/

Download more ebook from https://ebookmeta.com


We believe these products will be a great fit for you. Click
the link to download now, or visit ebookmeta.com
to discover even more!

The Art of Data Science Roger D. Peng

https://ebookmeta.com/product/the-art-of-data-science-roger-d-
peng/

Functional Programming in R 4: Advanced Statistical


Programming for Data Science, Analysis, and Finance -
Second Edition Thomas Mailund

https://ebookmeta.com/product/functional-programming-
in-r-4-advanced-statistical-programming-for-data-science-
analysis-and-finance-second-edition-thomas-mailund/

R Programming for Actuarial Science 1st Edition Mcquire

https://ebookmeta.com/product/r-programming-for-actuarial-
science-1st-edition-mcquire/

Cultural Histories of Ageing Myths Plots and Metaphors


of the Senescent Self 1st Edition Margery Vibe Skagen
(Editor)

https://ebookmeta.com/product/cultural-histories-of-ageing-myths-
plots-and-metaphors-of-the-senescent-self-1st-edition-margery-
vibe-skagen-editor/
Introduction to Banking 3rd Edition Claudia Girardone

https://ebookmeta.com/product/introduction-to-banking-3rd-
edition-claudia-girardone/

An Analysis of Geoffrey Parker s Global Crisis War


Climate Change and Catastrophe in the Seventeenth
Century 1st Edition Ian Jackson

https://ebookmeta.com/product/an-analysis-of-geoffrey-parker-s-
global-crisis-war-climate-change-and-catastrophe-in-the-
seventeenth-century-1st-edition-ian-jackson/

Cross my Heart Steamy in Sweetville 10 1st Edition


Haven Rose

https://ebookmeta.com/product/cross-my-heart-steamy-in-
sweetville-10-1st-edition-haven-rose/

The Bitcoin Dilemma: Weighing The Economic And


Environmental Costs And Benefits 1st Edition Colin L.
Read

https://ebookmeta.com/product/the-bitcoin-dilemma-weighing-the-
economic-and-environmental-costs-and-benefits-1st-edition-colin-
l-read/

Pennsylvania Dutch The Story of an American Language


1st Edition Mark L. Louden

https://ebookmeta.com/product/pennsylvania-dutch-the-story-of-an-
american-language-1st-edition-mark-l-louden/
Religious Giving For Love of God 1st Edition David H
Smith

https://ebookmeta.com/product/religious-giving-for-love-of-
god-1st-edition-david-h-smith/
R Programming for Data Science
Roger D. Peng
© 2014 - 2016 Roger D. Peng
Also By Roger D. Peng
The Art of Data Science
Exploratory Data Analysis with R
Report Writing for Data Science in R
Contents

1. Stay in Touch! . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1

2. Preface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

3. History and Overview of R . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5


3.1 What is R? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
3.2 What is S? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
3.3 The S Philosophy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
3.4 Back to R . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
3.5 Basic Features of R . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
3.6 Free Software . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
3.7 Design of the R System . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
3.8 Limitations of R . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
3.9 R Resources . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

4. Getting Started with R . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12


4.1 Installation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
4.2 Getting started with the R interface . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

5. R Nuts and Bolts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13


5.1 Entering Input . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
5.2 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
5.3 R Objects . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
5.4 Numbers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
5.5 Attributes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
5.6 Creating Vectors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
5.7 Mixing Objects . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
5.8 Explicit Coercion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
5.9 Matrices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
5.10 Lists . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
5.11 Factors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
5.12 Missing Values . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
5.13 Data Frames . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
5.14 Names . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
CONTENTS

5.15 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

6. Getting Data In and Out of R . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24


6.1 Reading and Writing Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
6.2 Reading Data Files with read.table() . . . . . . . . . . . . . . . . . . . . . . . . . 24
6.3 Reading in Larger Datasets with read.table . . . . . . . . . . . . . . . . . . . . . . . 25
6.4 Calculating Memory Requirements for R Objects . . . . . . . . . . . . . . . . . . . 26

7. Using the readr Package . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

8. Using Textual and Binary Formats for Storing Data . . . . . . . . . . . . . . . . . . . 29


8.1 Using dput() and dump() . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
8.2 Binary Formats . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

9. Interfaces to the Outside World . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33


9.1 File Connections . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
9.2 Reading Lines of a Text File . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
9.3 Reading From a URL Connection . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35

10. Subsetting R Objects . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37


10.1 Subsetting a Vector . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
10.2 Subsetting a Matrix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
10.3 Subsetting Lists . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
10.4 Subsetting Nested Elements of a List . . . . . . . . . . . . . . . . . . . . . . . . . . 40
10.5 Extracting Multiple Elements of a List . . . . . . . . . . . . . . . . . . . . . . . . . 41
10.6 Partial Matching . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
10.7 Removing NA Values . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42

11. Vectorized Operations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44


11.1 Vectorized Matrix Operations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45

12. Dates and Times . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46


12.1 Dates in R . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
12.2 Times in R . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
12.3 Operations on Dates and Times . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
12.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49

13. Managing Data Frames with the dplyr package . . . . . . . . . . . . . . . . . . . . . . 50


13.1 Data Frames . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
13.2 The dplyr Package . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
13.3 dplyr Grammar . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
13.4 Installing the dplyr package . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
13.5 select() . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
13.6 filter() . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
13.7 arrange() . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
CONTENTS

13.8 rename() . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
13.9 mutate() . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
13.10 group_by() . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
13.11 %>% . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
13.12 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62

14. Control Structures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63


14.1 if-else . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
14.2 for Loops . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
14.3 Nested for loops . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
14.4 while Loops . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
14.5 repeat Loops . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
14.6 next, break . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
14.7 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70

15. Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
15.1 Functions in R . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
15.2 Your First Function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
15.3 Argument Matching . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
15.4 Lazy Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
15.5 The ... Argument . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
15.6 Arguments Coming After the ... Argument . . . . . . . . . . . . . . . . . . . . . 78
15.7 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79

16. Scoping Rules of R . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80


16.1 A Diversion on Binding Values to Symbol . . . . . . . . . . . . . . . . . . . . . . . 80
16.2 Scoping Rules . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
16.3 Lexical Scoping: Why Does It Matter? . . . . . . . . . . . . . . . . . . . . . . . . . 82
16.4 Lexical vs. Dynamic Scoping . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
16.5 Application: Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
16.6 Plotting the Likelihood . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
16.7 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88

17. Coding Standards for R . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89

18. Loop Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90


18.1 Looping on the Command Line . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90
18.2 lapply() . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90
18.3 sapply() . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94
18.4 split() . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95
18.5 Splitting a Data Frame . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96
18.6 tapply . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100
18.7 apply() . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101
18.8 Col/Row Sums and Means . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102
CONTENTS

18.9 Other Ways to Apply . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103


18.10 mapply() . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104
18.11 Vectorizing a Function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106
18.12 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107

19. Regular Expressions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108


19.1 Before You Begin . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108
19.2 Primary R Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108
19.3 grep() . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109
19.4 grepl() . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112
19.5 regexpr() . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112
19.6 sub() and gsub() . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114
19.7 regexec() . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115
19.8 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118

20. Debugging . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119


20.1 Something’s Wrong! . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119
20.2 Figuring Out What’s Wrong . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122
20.3 Debugging Tools in R . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122
20.4 Using traceback() . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123
20.5 Using debug() . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124
20.6 Using recover() . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125
20.7 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126

21. Profiling R Code . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127


21.1 Using system.time() . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 128
21.2 Timing Longer Expressions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129
21.3 The R Profiler . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129
21.4 Using summaryRprof() . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131
21.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132

22. Simulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133


22.1 Generating Random Numbers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133
22.2 Setting the random number seed . . . . . . . . . . . . . . . . . . . . . . . . . . . . 134
22.3 Simulating a Linear Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135
22.4 Random Sampling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 139
22.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 140

23. Data Analysis Case Study: Changes in Fine Particle Air Pollution in the U.S. . . . . . 141
23.1 Synopsis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141
23.2 Loading and Processing the Raw Data . . . . . . . . . . . . . . . . . . . . . . . . . 141
23.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143

24. About the Author . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 151


1. Stay in Touch!
Thanks for purchasing this book. If you are interested in hearing more from me about things that
I’m working on (books, data science courses, podcast, etc.), you can do two things.
First, I encourage you to join my mailing list of Leanpub Readers¹. On this list I send out updates
of my own activities as well as occasional comments on data science current events. I’ll also let you
know what my co-conspirators Jeff Leek and Brian Caffo are up to because sometimes they do really
cool stuff.
Second, I have a regular podcast called Not So Standard Deviations² that I co-host with Dr. Hilary
Parker, a Senior Data Analyst at Etsy. On this podcast, Hilary and I talk about the craft of data
science and discuss common issues and problems in analyzing data. We’ll also compare how data
science is approached in both academia and industry contexts and discuss the latest industry trends.
You can listen to recent episodes on our SoundCloud page or you can subscribe to it in iTunes³ or
your favorite podcasting app.
Thanks again for purchasing this book and please do stay in touch!
¹http://eepurl.com/bAJ3zj
²https://soundcloud.com/nssd-podcast
³https://itunes.apple.com/us/podcast/not-so-standard-deviations/id1040614570

1
2. Preface
I started using R in 1998 when I was a college undergraduate working on my senior thesis.
The version was 0.63. I was an applied mathematics major with a statistics concentration and
I was working with Dr. Nicolas Hengartner on an analysis of word frequencies in classic texts
(Shakespeare, Milton, etc.). The idea was to see if we could identify the authorship of each of the
texts based on how frequently they used certain words. We downloaded the data from Project
Gutenberg and used some basic linear discriminant analysis for the modeling. The work was
eventually published¹ and was my first ever peer-reviewed publication. I guess you could argue
it was my first real “data science” experience.
Back then, no one was using R. Most of my classes were taught with Minitab, SPSS, Stata, or
Microsoft Excel. The cool people on the cutting edge of statistical methodology used S-PLUS. I
was working on my thesis late one night and I had a problem. I didn’t have a copy of any of those
software packages because they were expensive and I was a student. I didn’t feel like trekking over
to the computer lab to use the software because it was late at night.
But I had the Internet! After a couple of Yahoo! searches I found a web page for something called R,
which I figured was just a play on the name of the S-PLUS package. From what I could tell, R was a
“clone” of S-PLUS that was free. I had already written some S-PLUS code for my thesis so I figured
I would try to download R and see if I could just run the S-PLUS code.
It didn’t work. At least not at first. It turns out that R is not exactly a clone of S-PLUS and quite a few
modifications needed to be made before the code would run in R. In particular, R was missing a lot of
statistical functionality that had existed in S-PLUS for a long time already. Luckily, R’s programming
language was pretty much there and I was able to more or less re-implement the features that were
missing in R.
After college, I enrolled in a PhD program in statistics at the University of California, Los Angeles.
At the time the department was brand new and they didn’t have a lot of policies or rules (or classes,
for that matter!). So you could kind of do what you wanted, which was good for some students and
not so good for others. The Chair of the department, Jan de Leeuw, was a big fan of XLisp-Stat and
so all of the department’s classes were taught using XLisp-Stat. I diligently bought my copy of Luke
Tierney’s book² and learned to really love XLisp-Stat. It had a number of features that R didn’t have
at all, most notably dynamic graphics.
But ultimately, there were only so many parentheses that I could type, and still all of the research-
level statistics was being done in S-PLUS. The department didn’t really have a lot of copies of S-PLUS
lying around so I turned back to R. When I looked around at my fellow students, I realized that I
was basically the only one who had any experience using R. Since there was a budding interest in R
¹http://amstat.tandfonline.com/doi/abs/10.1198/000313002100#.VQGiSELpagE
²http://www.amazon.com/LISP-STAT-Object-Oriented-Environment-Statistical-Probability/dp/0471509167/

2
Preface 3

around the department, I decided to start a “brown bag” series where every week for about an hour
I would talk about something you could do in R (which wasn’t much, really). People seemed to like
it, if only because there wasn’t really anyone to turn to if you wanted to learn about R.
By the time I left grad school in 2003, the department had essentially switched over from XLisp-
Stat to R for all its work (although there were a few hold outs). Jan discusses the rationale for the
transition in a paper³ in the Journal of Statistical Software.
In the next step of my career, I went to the Department of Biostatistics⁴ at the Johns Hopkins
Bloomberg School of Public Health, where I have been for the past 12 years. When I got to Johns
Hopkins people already seemed into R. Most people had abandoned S-PLUS a while ago and were
committed to using R for their research. Of all the available statistical packages, R had the most
powerful and expressive programming language, which was perfect for someone developing new
statistical methods.
However, we didn’t really have a class that taught students how to use R. This was a problem because
most of our grad students were coming into the program having never heard of R. Most likely in
their undergradute programs, they used some other software package. So along with Rafael Irizarry,
Brian Caffo, Ingo Ruczinski, and Karl Broman, I started a new class to teach our graduate students
R and a number of other skills they’d need in grad school.
The class was basically a weekly seminar where one of us talked about a computing topic of interest.
I gave some of the R lectures in that class and when I asked people who had heard of R before, almost
no one raised their hand. And no one had actually used it before. The main selling point at the time
was “It’s just like S-PLUS but it’s free!” A lot of people had experience with SAS or Stata or SPSS. A
number of people had used something like Java or C/C++ before and so I often used that a reference
frame. No one had ever used a functional-style of programming language like Scheme or Lisp.
To this day, I still teach the class, known a Biostatistics 140.776 (“Statistical Computing”). However,
the nature of the class has changed quite a bit over the past 10 years. The population of students
(mostly first-year graduate students) has shifted to the point where many of them have been
introduced to R as undergraduates. This trend mirrors the overall trend with statistics where we
are seeing more and more students do undergraduate majors in statistics (as opposed to, say,
mathematics). Eventually, by 2008–2009, when I’d asked how many people had heard of or used
R before, everyone raised their hand. However, even at that late date, I still felt the need to convince
people that R was a “real” language that could be used for real tasks.
R has grown a lot in recent years, and is being used in so many places now, that I think it’s
essentially impossible for a person to keep track of everything that is going on. That’s fine, but
it makes “introducing” people to R an interesting experience. Nowadays in class, students are often
teaching me something new about R that I’ve never seen or heard of before (they are quite good
at Googling around for themselves). I feel no need to “bring people over” to R. In fact it’s quite the
opposite–people might start asking questions if I weren’t teaching R.
³http://www.jstatsoft.org/v13/i07
⁴http://www.biostat.jhsph.edu
Preface 4

This book comes from my experience teaching R in a variety of settings and through different stages
of its (and my) development. Much of the material has been taken from by Statistical Computing
class as well as the R Programming⁵ class I teach through Coursera.
I’m looking forward to teaching R to people as long as people will let me, and I’m interested in
seeing how the next generation of students will approach it (and how my approach to them will
change). Overall, it’s been just an amazing experience to see the widespread adoption of R over the
past decade. I’m sure the next decade will be just as amazing.
⁵https://www.coursera.org/course/rprog
3. History and Overview of R
There are only two kinds of languages: the ones people complain about and the ones
nobody uses —Bjarne Stroustrup

Watch a video of this chapter¹

3.1 What is R?
This is an easy question to answer. R is a dialect of S.

3.2 What is S?
S is a language that was developed by John Chambers and others at the old Bell Telephone
Laboratories, originally part of AT&T Corp. S was initiated in 1976² as an internal statistical analysis
environment—originally implemented as Fortran libraries. Early versions of the language did not
even contain functions for statistical modeling.
In 1988 the system was rewritten in C and began to resemble the system that we have today (this
was Version 3 of the language). The book Statistical Models in S by Chambers and Hastie (the white
book) documents the statistical analysis functionality. Version 4 of the S language was released in
1998 and is the version we use today. The book Programming with Data by John Chambers (the
green book) documents this version of the language.
Since the early 90’s the life of the S language has gone down a rather winding path. In 1993 Bell Labs
gave StatSci (later Insightful Corp.) an exclusive license to develop and sell the S language. In 2004
Insightful purchased the S language from Lucent for $2 million. In 2006, Alcatel purchased Lucent
Technologies and is now called Alcatel-Lucent.
Insightful sold its implementation of the S language under the product name S-PLUS and built a
number of fancy features (GUIs, mostly) on top of it—hence the “PLUS”. In 2008 Insightful was
acquired by TIBCO for $25 million. As of this writing TIBCO is the current owner of the S language
and is its exclusive developer.
The fundamentals of the S language itself has not changed dramatically since the publication of the
Green Book by John Chambers in 1998. In 1998, S won the Association for Computing Machinery’s
Software System Award, a highly prestigious award in the computer science field.
¹https://youtu.be/STihTnVSZnI
²http://cm.bell-labs.com/stat/doc/94.11.ps

5
History and Overview of R 6

3.3 The S Philosophy


The general S philosophy is important to understand for users of S and R because it sets the stage for
the design of the language itself, which many programming veterans find a bit odd and confusing.
In particular, it’s important to realize that the S language had its roots in data analysis, and did not
come from a traditional programming language background. Its inventors were focused on figuring
out how to make data analysis easier, first for themselves, and then eventually for others.
In Stages in the Evolution of S³, John Chambers writes:

“[W]e wanted users to be able to begin in an interactive environment, where they


did not consciously think of themselves as programming. Then as their needs became
clearer and their sophistication increased, they should be able to slide gradually into
programming, when the language and system aspects would become more important.”

The key part here was the transition from user to developer. They wanted to build a language that
could easily service both “people”. More technically, they needed to build language that would
be suitable for interactive data analysis (more command-line based) as well as for writing longer
programs (more traditional programming language-like).

3.4 Back to R
The R language came to use quite a bit after S had been developed. One key limitation of the S
language was that it was only available in a commericial package, S-PLUS. In 1991, R was created
by Ross Ihaka and Robert Gentleman in the Department of Statistics at the University of Auckland. In
1993 the first announcement of R was made to the public. Ross’s and Robert’s experience developing
R is documented in a 1996 paper in the Journal of Computational and Graphical Statistics:

Ross Ihaka and Robert Gentleman. R: A language for data analysis and graphics. Journal
of Computational and Graphical Statistics, 5(3):299–314, 1996

In 1995, Martin Mächler made an important contribution by convincing Ross and Robert to use the
GNU General Public License⁴ to make R free software. This was critical because it allowed for the
source code for the entire R system to be accessible to anyone who wanted to tinker with it (more
on free software later).
In 1996, a public mailing list was created (the R-help and R-devel lists) and in 1997 the R Core
Group was formed, containing some people associated with S and S-PLUS. Currently, the core group
controls the source code for R and is solely able to check in changes to the main R source tree. Finally,
in 2000 R version 1.0.0 was released to the public.
³http://www.stat.bell-labs.com/S/history.html
⁴http://www.gnu.org/licenses/gpl-2.0.html
History and Overview of R 7

3.5 Basic Features of R


In the early days, a key feature of R was that its syntax is very similar to S, making it easy for
S-PLUS users to switch over. While the R’s syntax is nearly identical to that of S’s, R’s semantics,
while superficially similar to S, are quite different. In fact, R is technically much closer to the Scheme
language than it is to the original S language when it comes to how R works under the hood.
Today R runs on almost any standard computing platform and operating system. Its open source
nature means that anyone is free to adapt the software to whatever platform they choose. Indeed, R
has been reported to be running on modern tablets, phones, PDAs, and game consoles.
One nice feature that R shares with many popular open source projects is frequent releases. These
days there is a major annual release, typically in October, where major new features are incorporated
and released to the public. Throughout the year, smaller-scale bugfix releases will be made as needed.
The frequent releases and regular release cycle indicates active development of the software and
ensures that bugs will be addressed in a timely manner. Of course, while the core developers control
the primary source tree for R, many people around the world make contributions in the form of new
feature, bug fixes, or both.
Another key advantage that R has over many other statistical packages (even today) is its sophisti-
cated graphics capabilities. R’s ability to create “publication quality” graphics has existed since the
very beginning and has generally been better than competing packages. Today, with many more
visualization packages available than before, that trend continues. R’s base graphics system allows
for very fine control over essentially every aspect of a plot or graph. Other newer graphics systems,
like lattice and ggplot2 allow for complex and sophisticated visualizations of high-dimensional data.
R has maintained the original S philosophy, which is that it provides a language that is both useful
for interactive work, but contains a powerful programming language for developing new tools. This
allows the user, who takes existing tools and applies them to data, to slowly but surely become a
developer who is creating new tools.
Finally, one of the joys of using R has nothing to do with the language itself, but rather with the
active and vibrant user community. In many ways, a language is successful inasmuch as it creates a
platform with which many people can create new things. R is that platform and thousands of people
around the world have come together to make contributions to R, to develop packages, and help
each other use R for all kinds of applications. The R-help and R-devel mailing lists have been highly
active for over a decade now and there is considerable activity on web sites like Stack Overflow.

3.6 Free Software


A major advantage that R has over many other statistical packages and is that it’s free in the sense
of free software (it’s also free in the sense of free beer). The copyright for the primary source code
for R is held by the R Foundation⁵ and is published under the GNU General Public License version
⁵http://www.r-project.org/foundation/
History and Overview of R 8

2.0⁶.
According to the Free Software Foundation, with free software, you are granted the following four
freedoms⁷

• The freedom to run the program, for any purpose (freedom 0).
• The freedom to study how the program works, and adapt it to your needs (freedom 1). Access
to the source code is a precondition for this.
• The freedom to redistribute copies so you can help your neighbor (freedom 2).
• The freedom to improve the program, and release your improvements to the public, so that
the whole community benefits (freedom 3). Access to the source code is a precondition for
this.

You can visit the Free Software Foundation’s web site⁸ to learn a lot more about free software. The
Free Software Foundation was founded by Richard Stallman in 1985 and Stallman’s personal web
site⁹ is an interesting read if you happen to have some spare time.

3.7 Design of the R System


The primary R system is available from the Comprehensive R Archive Network¹⁰, also known as
CRAN. CRAN also hosts many add-on packages that can be used to extend the functionality of R.
The R system is divided into 2 conceptual parts:

1. The “base” R system that you download from CRAN: Linux¹¹ Windows¹² Mac¹³ Source Code¹⁴
2. Everything else.

R functionality is divided into a number of packages.

• The “base” R system contains, among other things, the base package which is required to run
R and contains the most fundamental functions.
• The other packages contained in the “base” system include utils, stats, datasets, graphics,
grDevices, grid, methods, tools, parallel, compiler, splines, tcltk, stats4.

⁶http://www.gnu.org/licenses/gpl-2.0.html
⁷http://www.gnu.org/philosophy/free-sw.html
⁸http://www.fsf.org
⁹https://stallman.org
¹⁰http://cran.r-project.org
¹¹http://cran.r-project.org/bin/linux/
¹²http://cran.r-project.org/bin/windows/
¹³http://cran.r-project.org/bin/macosx/
¹⁴http://cran.r-project.org/src/base/R-3/R-3.1.3.tar.gz
History and Overview of R 9

• There are also “Recommended” packages: boot, class, cluster, codetools, foreign, KernS-
mooth, lattice, mgcv, nlme, rpart, survival, MASS, spatial, nnet, Matrix.

When you download a fresh installation of R from CRAN, you get all of the above, which represents
a substantial amount of functionality. However, there are many other packages available:

• There are over 4000 packages on CRAN that have been developed by users and programmers
around the world.
• There are also many packages associated with the Bioconductor project¹⁵.
• People often make packages available on their personal websites; there is no reliable way to
keep track of how many packages are available in this fashion.
• There are a number of packages being developed on repositories like GitHub and BitBucket
but there is no reliable listing of all these packages.

3.8 Limitations of R
No programming language or statistical analysis system is perfect. R certainly has a number of
drawbacks. For starters, R is essentially based on almost 50 year old technology, going back to the
original S system developed at Bell Labs. There was originally little built in support for dynamic or
3-D graphics (but things have improved greatly since the “old days”).
Another commonly cited limitation of R is that objects must generally be stored in physical memory.
This is in part due to the scoping rules of the language, but R generally is more of a memory hog
than other statistical packages. However, there have been a number of advancements to deal with
this, both in the R core and also in a number of packages developed by contributors. Also, computing
power and capacity has continued to grow over time and amount of physical memory that can be
installed on even a consumer-level laptop is substantial. While we will likely never have enough
physical memory on a computer to handle the increasingly large datasets that are being generated,
the situation has gotten quite a bit easier over time.
At a higher level one “limitation” of R is that its functionality is based on consumer demand and
(voluntary) user contributions. If no one feels like implementing your favorite method, then it’s your
job to implement it (or you need to pay someone to do it). The capabilities of the R system generally
reflect the interests of the R user community. As the community has ballooned in size over the past
10 years, the capabilities have similarly increased. When I first started using R, there was very little
in the way of functionality for the physical sciences (physics, astronomy, etc.). However, now some
of those communities have adopted R and we are seeing more code being written for those kinds of
applications.
If you want to know my general views on the usefulness of R, you can see them here in the following
exchange on the R-help mailing list with Douglas Bates and Brian Ripley in June 2004:
¹⁵http://bioconductor.org
History and Overview of R 10

Roger D. Peng: I don’t think anyone actually believes that R is designed to make
everyone happy. For me, R does about 99% of the things I need to do, but sadly, when I
need to order a pizza, I still have to pick up the telephone.

Douglas Bates: There are several chains of pizzerias in the U.S. that provide for Internet-
based ordering (e.g. www.papajohnsonline.com) so, with the Internet modules in R, it’s
only a matter of time before you will have a pizza-ordering function available.

Brian D. Ripley: Indeed, the GraphApp toolkit (used for the RGui interface under R for
Windows, but Guido forgot to include it) provides one (for use in Sydney, Australia, we
presume as that is where the GraphApp author hails from). Alternatively, a Padovian
has no need of ordering pizzas with both home and neighbourhood restaurants ….

At this point in time, I think it would be fairly straightforward to build a pizza ordering R package
using something like the RCurl or httr packages. Any takers?

3.9 R Resources

Official Manuals
As far as getting started with R by reading stuff, there is of course this book. Also, available from
CRAN¹⁶ are

• An Introduction to R¹⁷
• R Data Import/Export¹⁸
• Writing R Extensions¹⁹: Discusses how to write and organize R packages
• R Installation and Administration²⁰: This is mostly for building R from the source code)
• R Internals²¹: This manual describes the low level structure of R and is primarily for developers
and R core members
• R Language Definition²²: This documents the R language and, again, is primarily for develop-
ers
¹⁶http://cran.r-project.org
¹⁷http://cran.r-project.org/doc/manuals/r-release/R-intro.html
¹⁸http://cran.r-project.org/doc/manuals/r-release/R-data.html
¹⁹http://cran.r-project.org/doc/manuals/r-release/R-exts.html
²⁰http://cran.r-project.org/doc/manuals/r-release/R-admin.html
²¹http://cran.r-project.org/doc/manuals/r-release/R-ints.html
²²http://cran.r-project.org/doc/manuals/r-release/R-lang.html
History and Overview of R 11

Useful Standard Texts on S and R


• Chambers (2008). Software for Data Analysis, Springer
• Chambers (1998). Programming with Data, Springer: This book is not about R, but it describes
the organization and philosophy of the current version of the S language, and is a useful
reference.
• Venables & Ripley (2002). Modern Applied Statistics with S, Springer: This is a standard
textbook in statistics and describes how to use many statistical methods in R. This book has
an associated R package (the MASS package) that comes with every installation of R.
• Venables & Ripley (2000). S Programming, Springer: This book is a little old but is still relevant
and accurate. Despite its title, this book is useful for R also.
• Murrell (2005). R Graphics, Chapman & Hall/CRC Press: Paul Murrell wrote and designed
much of the graphics system in R and this book essentially documents the underlying details.
This is not so much a “user-level” book as a developer-level book. But it is an important book
for anyone interested in designing new types of graphics or visualizations.
• Wickham (2014). Advanced R, Chapman & Hall/CRC Press: This book by Hadley Wickham
covers a number of areas including object-oriented programming, functional programming,
profiling and other advanced topics.

Other Resources
• Major technical publishers like Springer, Chapman & Hall/CRC have entire series of books
dedicated to using R in various applications. For example, Springer has a series of books called
Use R!.
• A longer list of books can be found on the CRAN web site²³.

²³http://www.r-project.org/doc/bib/R-books.html
4. Getting Started with R
4.1 Installation
The first thing you need to do to get started with R is to install it on your computer. R works on
pretty much every platform available, including the widely available Windows, Mac OS X, and Linux
systems. If you want to watch a step-by-step tutorial on how to install R for Mac or Windows, you
can watch these videos:

• Installing R on Windows¹
• Installing R on the Mac²

There is also an integrated development environment available for R that is built by RStudio. I really
like this IDE—it has a nice editor with syntax highlighting, there is an R object viewer, and there are
a number of other nice features that are integrated. You can see how to install RStudio here

• Installing RStudio³

The RStudio IDE is available from RStudio’s web site⁴.

4.2 Getting started with the R interface


After you install R you will need to launch it and start writing R code. Before we get to exactly how
to write R code, it’s useful to get a sense of how the system is organized. In these two videos I talk
about where to write code and how set your working directory, which let’s R know where to find
all of your files.

• Writing code and setting your working directory on the Mac⁵


• Writing code and setting your working directory on Windows⁶

¹http://youtu.be/Ohnk9hcxf9M
²https://youtu.be/uxuuWXU-7UQ
³https://youtu.be/bM7Sfz-LADM
⁴http://rstudio.com
⁵https://youtu.be/8xT3hmJQskU
⁶https://youtu.be/XBcvH1BpIBo

12
5. R Nuts and Bolts
5.1 Entering Input
At the R prompt we type expressions. The <- symbol is the assignment operator.

> x <- 1
> print(x)
[1] 1
> x
[1] 1
> msg <- "hello"

The grammar of the language determines whether an expression is complete or not.

x <- ## Incomplete expression

The # character indicates a comment. Anything to the right of the # (including the # itself) is ignored.
This is the only comment character in R. Unlike some other languages, R does not support multi-line
comments or comment blocks.

5.2 Evaluation
When a complete expression is entered at the prompt, it is evaluated and the result of the evaluated
expression is returned. The result may be auto-printed.

> x <- 5 ## nothing printed


> x ## auto-printing occurs
[1] 5
> print(x) ## explicit printing
[1] 5

The [1] shown in the output indicates that x is a vector and 5 is its first element.
Typically with interactive work, we do not explicitly print objects with the print function; it is much
easier to just auto-print them by typing the name of the object and hitting return/enter. However,
when writing scripts, functions, or longer programs, there is sometimes a need to explicitly print
objects because auto-printing does not work in those settings.
When an R vector is printed you will notice that an index for the vector is printed in square brackets
[] on the side. For example, see this integer sequence of length 20.

13
R Nuts and Bolts 14

> x <- 10:30


> x
[1] 10 11 12 13 14 15 16 17 18 19 20 21
[13] 22 23 24 25 26 27 28 29 30

The numbers in the square brackets are not part of the vector itself, they are merely part of the
printed output.
With R, it’s important that one understand that there is a difference between the actual R object
and the manner in which that R object is printed to the console. Often, the printed output may have
additional bells and whistles to make the output more friendly to the users. However, these bells and
whistles are not inherently part of the object.
Note that the : operator is used to create integer sequences.

5.3 R Objects
R has five basic or “atomic” classes of objects:

• character
• numeric (real numbers)
• integer
• complex
• logical (True/False)

The most basic type of R object is a vector. Empty vectors can be created with the vector() function.
There is really only one rule about vectors in R, which is that A vector can only contain objects
of the same class.
But of course, like any good rule, there is an exception, which is a list, which we will get to a bit later.
A list is represented as a vector but can contain objects of different classes. Indeed, that’s usually
why we use them.
There is also a class for “raw” objects, but they are not commonly used directly in data analysis and
I won’t cover them here.

5.4 Numbers
Numbers in R are generally treated as numeric objects (i.e. double precision real numbers). This
means that even if you see a number like “1” or “2” in R, which you might think of as integers, they
are likely represented behind the scenes as numeric objects (so something like “1.00” or “2.00”). This
isn’t important most of the time…except when it is.
R Nuts and Bolts 15

If you explicitly want an integer, you need to specify the L suffix. So entering 1 in R gives you a
numeric object; entering 1L explicitly gives you an integer object.
There is also a special number Inf which represents infinity. This allows us to represent entities like
1 / 0. This way, Inf can be used in ordinary calculations; e.g. 1 / Inf is 0.

The value NaN represents an undefined value (“not a number”); e.g. 0 / 0; NaN can also be thought of
as a missing value (more on that later)

5.5 Attributes
R objects can have attributes, which are like metadata for the object. These metadata can be very
useful in that they help to describe the object. For example, column names on a data frame help to
tell us what data are contained in each of the columns. Some examples of R object attributes are

• names, dimnames
• dimensions (e.g. matrices, arrays)
• class (e.g. integer, numeric)
• length
• other user-defined attributes/metadata

Attributes of an object (if any) can be accessed using the attributes() function. Not all R objects
contain attributes, in which case the attributes() function returns NULL.

5.6 Creating Vectors


The c() function can be used to create vectors of objects by concatenating things together.

> x <- c(0.5, 0.6) ## numeric


> x <- c(TRUE, FALSE) ## logical
> x <- c(T, F) ## logical
> x <- c("a", "b", "c") ## character
> x <- 9:29 ## integer
> x <- c(1+0i, 2+4i) ## complex

Note that in the above example, T and F are short-hand ways to specify TRUE and FALSE. However,
in general one should try to use the explicit TRUE and FALSE values when indicating logical values.
The T and F values are primarily there for when you’re feeling lazy.
You can also use the vector() function to initialize vectors.
R Nuts and Bolts 16

> x <- vector("numeric", length = 10)


> x
[1] 0 0 0 0 0 0 0 0 0 0

5.7 Mixing Objects


There are occasions when different classes of R objects get mixed together. Sometimes this happens
by accident but it can also happen on purpose. So what happens with the following code?

> y <- c(1.7, "a") ## character


> y <- c(TRUE, 2) ## numeric
> y <- c("a", TRUE) ## character

In each case above, we are mixing objects of two different classes in a vector. But remember that
the only rule about vectors says this is not allowed. When different objects are mixed in a vector,
coercion occurs so that every element in the vector is of the same class.
In the example above, we see the effect of implicit coercion. What R tries to do is find a way to
represent all of the objects in the vector in a reasonable fashion. Sometimes this does exactly what
you want and…sometimes not. For example, combining a numeric object with a character object
will create a character vector, because numbers can usually be easily represented as strings.

5.8 Explicit Coercion


Objects can be explicitly coerced from one class to another using the as.* functions, if available.

> x <- 0:6


> class(x)
[1] "integer"
> as.numeric(x)
[1] 0 1 2 3 4 5 6
> as.logical(x)
[1] FALSE TRUE TRUE TRUE TRUE TRUE TRUE
> as.character(x)
[1] "0" "1" "2" "3" "4" "5" "6"

Sometimes, R can’t figure out how to coerce an object and this can result in NAs being produced.
R Nuts and Bolts 17

> x <- c("a", "b", "c")


> as.numeric(x)
Warning: NAs introduced by coercion
[1] NA NA NA
> as.logical(x)
[1] NA NA NA
> as.complex(x)
Warning: NAs introduced by coercion
[1] NA NA NA

When nonsensical coercion takes place, you will usually get a warning from R.

5.9 Matrices
Matrices are vectors with a dimension attribute. The dimension attribute is itself an integer vector
of length 2 (number of rows, number of columns)

> m <- matrix(nrow = 2, ncol = 3)


> m
[,1] [,2] [,3]
[1,] NA NA NA
[2,] NA NA NA
> dim(m)
[1] 2 3
> attributes(m)
$dim
[1] 2 3

Matrices are constructed column-wise, so entries can be thought of starting in the “upper left” corner
and running down the columns.

> m <- matrix(1:6, nrow = 2, ncol = 3)


> m
[,1] [,2] [,3]
[1,] 1 3 5
[2,] 2 4 6

Matrices can also be created directly from vectors by adding a dimension attribute.
R Nuts and Bolts 18

> m <- 1:10


> m
[1] 1 2 3 4 5 6 7 8 9 10
> dim(m) <- c(2, 5)
> m
[,1] [,2] [,3] [,4] [,5]
[1,] 1 3 5 7 9
[2,] 2 4 6 8 10

Matrices can be created by column-binding or row-binding with the cbind() and rbind() functions.

> x <- 1:3


> y <- 10:12
> cbind(x, y)
x y
[1,] 1 10
[2,] 2 11
[3,] 3 12
> rbind(x, y)
[,1] [,2] [,3]
x 1 2 3
y 10 11 12

5.10 Lists
Lists are a special type of vector that can contain elements of different classes. Lists are a very
important data type in R and you should get to know them well. Lists, in combination with the
various “apply” functions discussed later, make for a powerful combination.
Lists can be explicitly created using the list() function, which takes an arbitrary number of
arguments.

> x <- list(1, "a", TRUE, 1 + 4i)


> x
[[1]]
[1] 1

[[2]]
[1] "a"

[[3]]
[1] TRUE

[[4]]
[1] 1+4i
R Nuts and Bolts 19

We can also create an empty list of a prespecified length with the vector() function

> x <- vector("list", length = 5)


> x
[[1]]
NULL

[[2]]
NULL

[[3]]
NULL

[[4]]
NULL

[[5]]
NULL

5.11 Factors
Factors are used to represent categorical data and can be unordered or ordered. One can think of
a factor as an integer vector where each integer has a label. Factors are important in statistical
modeling and are treated specially by modelling functions like lm() and glm().
Using factors with labels is better than using integers because factors are self-describing. Having a
variable that has values “Male” and “Female” is better than a variable that has values 1 and 2.
Factor objects can be created with the factor() function.

> x <- factor(c("yes", "yes", "no", "yes", "no"))


> x
[1] yes yes no yes no
Levels: no yes
> table(x)
x
no yes
2 3
> ## See the underlying representation of factor
> unclass(x)
[1] 2 2 1 2 1
attr(,"levels")
[1] "no" "yes"
R Nuts and Bolts 20

Often factors will be automatically created for you when you read a dataset in using a function like
read.table(). Those functions often default to creating factors when they encounter data that look
like characters or strings.
The order of the levels of a factor can be set using the levels argument to factor(). This can be
important in linear modelling because the first level is used as the baseline level.

> x <- factor(c("yes", "yes", "no", "yes", "no"))


> x ## Levels are put in alphabetical order
[1] yes yes no yes no
Levels: no yes
> x <- factor(c("yes", "yes", "no", "yes", "no"),
+ levels = c("yes", "no"))
> x
[1] yes yes no yes no
Levels: yes no

5.12 Missing Values


Missing values are denoted by NA or NaN for q undefined mathematical operations.

• is.na() is used to test objects if they are NA


• is.nan() is used to test for NaN
• NA values have a class also, so there are integer NA, character NA, etc.
• A NaN value is also NA but the converse is not true

> ## Create a vector with NAs in it


> x <- c(1, 2, NA, 10, 3)
> ## Return a logical vector indicating which elements are NA
> is.na(x)
[1] FALSE FALSE TRUE FALSE FALSE
> ## Return a logical vector indicating which elements are NaN
> is.nan(x)
[1] FALSE FALSE FALSE FALSE FALSE
R Nuts and Bolts 21

> ## Now create a vector with both NA and NaN values


> x <- c(1, 2, NaN, NA, 4)
> is.na(x)
[1] FALSE FALSE TRUE TRUE FALSE
> is.nan(x)
[1] FALSE FALSE TRUE FALSE FALSE

5.13 Data Frames


Data frames are used to store tabular data in R. They are an important type of object in R and
are used in a variety of statistical modeling applications. Hadley Wickham’s package dplyr¹ has an
optimized set of functions designed to work efficiently with data frames.
Data frames are represented as a special type of list where every element of the list has to have the
same length. Each element of the list can be thought of as a column and the length of each element
of the list is the number of rows.
Unlike matrices, data frames can store different classes of objects in each column. Matrices must
have every element be the same class (e.g. all integers or all numeric).
In addition to column names, indicating the names of the variables or predictors, data frames have
a special attribute called row.names which indicate information about each row of the data frame.
Data frames are usually created by reading in a dataset using the read.table() or read.csv().
However, data frames can also be created explicitly with the data.frame() function or they can be
coerced from other types of objects like lists.
Data frames can be converted to a matrix by calling data.matrix(). While it might seem that the
as.matrix() function should be used to coerce a data frame to a matrix, almost always, what you
want is the result of data.matrix().

> x <- data.frame(foo = 1:4, bar = c(T, T, F, F))


> x
foo bar
1 1 TRUE
2 2 TRUE
3 3 FALSE
4 4 FALSE
> nrow(x)
[1] 4
> ncol(x)
[1] 2

¹https://github.com/hadley/dplyr
R Nuts and Bolts 22

5.14 Names
R objects can have names, which is very useful for writing readable code and self-describing objects.
Here is an example of assigning names to an integer vector.

> x <- 1:3


> names(x)
NULL
> names(x) <- c("New York", "Seattle", "Los Angeles")
> x
New York Seattle Los Angeles
1 2 3
> names(x)
[1] "New York" "Seattle" "Los Angeles"

Lists can also have names, which is often very useful.

> x <- list("Los Angeles" = 1, Boston = 2, London = 3)


> x
$`Los Angeles`
[1] 1

$Boston
[1] 2

$London
[1] 3
> names(x)
[1] "Los Angeles" "Boston" "London"

Matrices can have both column and row names.

> m <- matrix(1:4, nrow = 2, ncol = 2)


> dimnames(m) <- list(c("a", "b"), c("c", "d"))
> m
c d
a 1 3
b 2 4

Column names and row names can be set separately using the colnames() and rownames()
functions.
R Nuts and Bolts 23

> colnames(m) <- c("h", "f")


> rownames(m) <- c("x", "z")
> m
h f
x 1 3
z 2 4

Note that for data frames, there is a separate function for setting the row names, the row.names()
function. Also, data frames do not have column names, they just have names (like lists). So to set
the column names of a data frame just use the names() function. Yes, I know its confusing. Here’s a
quick summary:

Object Set column names Set row names


data frame names() row.names()
matrix colnames() rownames()

5.15 Summary
There are a variety of different builtin-data types in R. In this chapter we have reviewed the following

• atomic classes: numeric, logical, character, integer, complex


• vectors, lists
• factors
• missing values
• data frames and matrices

All R objects can have attributes that help to describe what is in the object. Perhaps the most useful
attribute is names, such as column and row names in a data frame, or simply names in a vector or
list. Attributes like dimensions are also important as they can modify the behavior of objects, like
turning a vector into a matrix.
6. Getting Data In and Out of R
6.1 Reading and Writing Data
Watch a video of this section¹
There are a few principal functions reading data into R.

• read.table, read.csv, for reading tabular data


• readLines, for reading lines of a text file
• source, for reading in R code files (inverse of dump)
• dget, for reading in R code files (inverse of dput)
• load, for reading in saved workspaces
• unserialize, for reading single R objects in binary form

There are of course, many R packages that have been developed to read in all kinds of other datasets,
and you may need to resort to one of these packages if you are working in a specific area.
There are analogous functions for writing data to files

• write.table, for writing tabular data to text files (i.e. CSV) or connections
• writeLines, for writing character data line-by-line to a file or connection
• dump, for dumping a textual representation of multiple R objects
• dput, for outputting a textual representation of an R object
• save, for saving an arbitrary number of R objects in binary format (possibly compressed) to
a file.
• serialize, for converting an R object into a binary format for outputting to a connection (or
file).

6.2 Reading Data Files with read.table()


The read.table() function is one of the most commonly used functions for reading data. The help
file for read.table() is worth reading in its entirety if only because the function gets used a lot
(run ?read.table in R). I know, I know, everyone always says to read the help file, but this one is
actually worth reading.
The read.table() function has a few important arguments:
¹https://youtu.be/Z_dc_FADyi4

24
Getting Data In and Out of R 25

• file, the name of a file, or a connection


• header, logical indicating if the file has a header line
• sep, a string indicating how the columns are separated
• colClasses, a character vector indicating the class of each column in the dataset
• nrows, the number of rows in the dataset. By default read.table() reads an entire file.
• comment.char, a character string indicating the comment character. This defalts to "#". If there
are no commented lines in your file, it’s worth setting this to be the empty string "".
• skip, the number of lines to skip from the beginning
• stringsAsFactors, should character variables be coded as factors? This defaults to TRUE
because back in the old days, if you had data that were stored as strings, it was because
those strings represented levels of a categorical variable. Now we have lots of data that is text
data and they don’t always represent categorical variables. So you may want to set this to
be FALSE in those cases. If you always want this to be FALSE, you can set a global option via
options(stringsAsFactors = FALSE). I’ve never seen so much heat generated on discussion
forums about an R function argument than the stringsAsFactors argument. Seriously.

For small to moderately sized datasets, you can usually call read.table without specifying any other
arguments

> data <- read.table("foo.txt")

In this case, R will automatically

• skip lines that begin with a #


• figure out how many rows there are (and how much memory needs to be allocated)
• figure what type of variable is in each column of the table.

Telling R all these things directly makes R run faster and more efficiently. The read.csv() function
is identical to read.table except that some of the defaults are set differently (like the sep argument).

6.3 Reading in Larger Datasets with read.table


Watch a video of this section²
With much larger datasets, there are a few things that you can do that will make your life easier and
will prevent R from choking.

• Read the help page for read.table, which contains many hints
²https://youtu.be/BJYYIJO3UFI
Getting Data In and Out of R 26

• Make a rough calculation of the memory required to store your dataset (see the next section
for an example of how to do this). If the dataset is larger than the amount of RAM on your
computer, you can probably stop right here.
• Set comment.char = "" if there are no commented lines in your file.
• Use the colClasses argument. Specifying this option instead of using the default can make
’read.table’ run MUCH faster, often twice as fast. In order to use this option, you have to know
the class of each column in your data frame. If all of the columns are “numeric”, for example,
then you can just set colClasses = "numeric". A quick an dirty way to figure out the classes
of each column is the following:

> initial <- read.table("datatable.txt", nrows = 100)


> classes <- sapply(initial, class)
> tabAll <- read.table("datatable.txt", colClasses = classes)

• Set nrows. This doesn’t make R run faster but it helps with memory usage. A mild overestimate
is okay. You can use the Unix tool wc to calculate the number of lines in a file.

In general, when using R with larger datasets, it’s also useful to know a few things about your
system.

• How much memory is available on your system?


• What other applications are in use? Can you close any of them?
• Are there other users logged into the same system?
• What operating system ar you using? Some operating systems can limit the amount of memory
a single process can access

6.4 Calculating Memory Requirements for R Objects


Because R stores all of its objects physical memory, it is important to be cognizant of how much
memory is being used up by all of the data objects residing in your workspace. One situation where
it’s particularly important to understand memory requirements is when you are reading in a new
dataset into R. Fortunately, it’s easy to make a back of the envelope calculation of how much memory
will be required by a new dataset.
For example, suppose I have a data frame with 1,500,000 rows and 120 columns, all of which are
numeric data. Roughly, how much memory is required to store this data frame? Well, on most
modern computers double precision floating point numbers³ are stored using 64 bits of memory, or
8 bytes. Given that information, you can do the following calculation

³http://en.wikipedia.org/wiki/Double-precision_floating-point_format
Getting Data In and Out of R 27

1,500,000 × 120 × 8 bytes/numeric = 1,440,000,000 bytes


= 1,440,000,000 / 220 bytes/MB
= 1,373.29 MB
= 1.34 GB

So the dataset would require about 1.34 GB of RAM. Most computers these days have at least that
much RAM. However, you need to be aware of

• what other programs might be running on your computer, using up RAM


• what other R objects might already be taking up RAM in your workspace

Reading in a large dataset for which you do not have enough RAM is one easy way to freeze up your
computer (or at least your R session). This is usually an unpleasant experience that usually requires
you to kill the R process, in the best case scenario, or reboot your computer, in the worst case. So
make sure to do a rough calculation of memeory requirements before reading in a large dataset.
You’ll thank me later.
7. Using the readr Package
The readr package is recently developed by Hadley Wickham to deal with reading in large flat
files quickly. The package provides replacements for functions like read.table() and read.csv().
The analogous functions in readr are read_table() and read_csv(). This functions are oven much
faster than their base R analogues and provide a few other nice features such as progress meters.
For the most part, you can read use read_table() and read_csv() pretty much anywhere you might
use read.table() and read.csv(). In addition, if there are non-fatal problems that occur while
reading in the data, you will get a warning and the returned data frame will have some information
about which rows/observations triggered the warning. This can be very helpful for “debugging”
problems with your data before you get neck deep in data analysis.

28
8. Using Textual and Binary Formats
for Storing Data
Watch a video of this chapter¹
There are a variety of ways that data can be stored, including structured text files like CSV or tab-
delimited, or more complex binary formats. However, there is an intermediate format that is textual,
but not as simple as something like CSV. The format is native to R and is somewhat readable because
of its textual nature.
One can create a more descriptive representation of an R object by using the dput() or dump()
functions. The dump() and dput() functions are useful because the resulting textual format is edit-
able, and in the case of corruption, potentially recoverable. Unlike writing out a table or CSV file,
dump() and dput() preserve the metadata (sacrificing some readability), so that another user doesn’t
have to specify it all over again. For example, we can preserve the class of each column of a table or
the levels of a factor variable.
Textual formats can work much better with version control programs like subversion or git which
can only track changes meaningfully in text files. In addition, textual formats can be longer-lived;
if there is corruption somewhere in the file, it can be easier to fix the problem because one can just
open the file in an editor and look at it (although this would probably only be done in a worst case
scenario!). Finally, textual formats adhere to the Unix philosophy², if that means anything to you.
There are a few downsides to using these intermediate textual formats. The format is not very space-
efficient, because all of the metadata is specified. Also, it is really only partially readable. In some
instances it might be preferable to have data stored in a CSV file and then have a separate code file
that specifies the metadata.

8.1 Using dput() and dump()


One way to pass data around is by deparsing the R object with dput() and reading it back in (parsing
it) using dget().

¹https://youtu.be/5mIPigbNDfk
²http://www.catb.org/esr/writings/taoup/

29
Using Textual and Binary Formats for Storing Data 30

> ## Create a data frame


> y <- data.frame(a = 1, b = "a")
> ## Print 'dput' output to console
> dput(y)
structure(list(a = 1, b = structure(1L, .Label = "a", class = "factor")), .Names = c("a",
"b"), row.names = c(NA, -1L), class = "data.frame")

Notice that the dput() output is in the form of R code and that it preserves metadata like the class
of the object, the row names, and the column names.
The output of dput() can also be saved directly to a file.

> ## Send 'dput' output to a file


> dput(y, file = "y.R")
> ## Read in 'dput' output from a file
> new.y <- dget("y.R")
> new.y
a b
1 1 a

Multiple objects can be deparsed at once using the dump function and read back in using source.

> x <- "foo"


> y <- data.frame(a = 1L, b = "a")

We can dump() R objects to a file by passing a character vector of their names.

> dump(c("x", "y"), file = "data.R")


> rm(x, y)

The inverse of dump() is source().

> source("data.R")
> str(y)
'data.frame': 1 obs. of 2 variables:
$ a: int 1
$ b: Factor w/ 1 level "a": 1
> x
[1] "foo"
Using Textual and Binary Formats for Storing Data 31

8.2 Binary Formats


The complement to the textual format is the binary format, which is sometimes necessary to use
for efficiency purposes, or because there’s just no useful way to represent data in a textual manner.
Also, with numeric data, one can often lose precision when converting to and from a textual format,
so it’s better to stick with a binary format.
The key functions for converting R objects into a binary format are save(), save.image(), and
serialize(). Individual R objects can be saved to a file using the save() function.

> a <- data.frame(x = rnorm(100), y = runif(100))


> b <- c(3, 4.4, 1 / 3)
>
> ## Save 'a' and 'b' to a file
> save(a, b, file = "mydata.rda")
>
> ## Load 'a' and 'b' into your workspace
> load("mydata.rda")

If you have a lot of objects that you want to save to a file, you can save all objects in your workspace
using the save.image() function.

> ## Save everything to a file


> save.image(file = "mydata.RData")
>
> ## load all objects in this file
> load("mydata.RData")

Notice that I’ve used the .rda extension when using save() and the .RData extension when using
save.image(). This is just my personal preference; you can use whatever file extension you want.
The save() and save.image() functions do not care. However, .rda and .RData are fairly common
extensions and you may want to use them because they are recognized by other software.
The serialize() function is used to convert individual R objects into a binary format that can be
communicated across an arbitrary connection. This may get sent to a file, but it could get sent over
a network or other connection.
When you call serialize() on an R object, the output will be a raw vector coded in hexadecimal
format.
Using Textual and Binary Formats for Storing Data 32

> x <- list(1, 2, 3)


> serialize(x, NULL)
[1] 58 0a 00 00 00 02 00 03 02 03 00 02 03 00 00 00 00 13 00 00 00 03 00
[24] 00 00 0e 00 00 00 01 3f f0 00 00 00 00 00 00 00 00 00 0e 00 00 00 01
[47] 40 00 00 00 00 00 00 00 00 00 00 0e 00 00 00 01 40 08 00 00 00 00 00
[70] 00

If you want, this can be sent to a file, but in that case you are better off using something like save().
The benefit of the serialize() function is that it is the only way to perfectly represent an R object
in an exportable format, without losing precision or any metadata. If that is what you need, then
serialize() is the function for you.
9. Interfaces to the Outside World
Watch a video of this chapter¹
Data are read in using connection interfaces. Connections can be made to files (most common) or to
other more exotic things.

• file, opens a connection to a file


• gzfile, opens a connection to a file compressed with gzip
• bzfile, opens a connection to a file compressed with bzip2
• url, opens a connection to a webpage

In general, connections are powerful tools that let you navigate files or other external objects.
Connections can be thought of as a translator that lets you talk to objects that are outside of R.
Those outside objects could be anything from a data base, a simple text file, or a a web service API.
Connections allow R functions to talk to all these different external objects without you having to
write custom code for each object.

9.1 File Connections


Connections to text files can be created with the file() function.

> str(file)
function (description = "", open = "", blocking = TRUE, encoding = getOption("encoding"),
raw = FALSE)

The file() function has a number of arguments that are common to many other connection
functions so it’s worth going into a little detail here.

• description is the name of the file


• open is a code indicating what mode the file should be opened in

The open argument allows for the following options:

• “r” open file in read only mode


¹https://youtu.be/Pb01WoJRUtY

33
Interfaces to the Outside World 34

• “w” open a file for writing (and initializing a new file)


• “a” open a file for appending
• “rb”, “wb”, “ab” reading, writing, or appending in binary mode (Windows)

In practice, we often don’t need to deal with the connection interface directly as many functions for
reading and writing data just deal with it in the background.
For example, if one were to explicitly use connections to read a CSV file in to R, it might look like
this,

> ## Create a connection to 'foo.txt'


> con <- file("foo.txt")
>
> ## Open connection to 'foo.txt' in read-only mode
> open(con, "r")
>
> ## Read from the connection
> data <- read.csv(con)
>
> ## Close the connection
> close(con)

which is the same as

> data <- read.csv("foo.txt")

In the background, read.csv() opens a connection to the file foo.txt, reads from it, and closes the
connection when it’s done.
The above example shows the basic approach to using connections. Connections must be opened,
then the are read from or written to, and then they are closed.

9.2 Reading Lines of a Text File


Text files can be read line by line using the readLines() function. This function is useful for reading
text files that may be unstructured or contain non-standard data.
Interfaces to the Outside World 35

> ## Open connection to gz-compressed text file


> con <- gzfile("words.gz")
> x <- readLines(con, 10)
> x
[1] "1080" "10-point" "10th" "11-point" "12-point" "16-point"
[7] "18-point" "1st" "2" "20-point"

For more structured text data like CSV files or tab-delimited files, there are other functions like
read.csv() or read.table().
The above example used the gzfile() function which is used to create a connection to files
compressed using the gzip algorithm. This approach is useful because it allows you to read from
a file without having to uncompress the file first, which would be a waste of space and time.
There is a complementary function writeLines() that takes a character vector and writes each
element of the vector one line at a time to a text file.

9.3 Reading From a URL Connection


The readLines() function can be useful for reading in lines of webpages. Since web pages are
basically text files that are stored on a remote server, there is conceptually not much difference
between a web page and a local text file. However, we need R to negotiate the communication
between your computer and the web server. This is what the url() function can do for you, by
creating a url connection to a web server.
This code might take time depending on your connection speed.

> ## Open a URL connection for reading


> con <- url("http://www.jhsph.edu", "r")
>
> ## Read the web page
> x <- readLines(con)
>
> ## Print out the first few lines
> head(x)
[1] "<!DOCTYPE html>"
[2] "<html lang=\"en\">"
[3] ""
[4] "<head>"
[5] "<meta charset=\"utf-8\" />"
[6] "<title>Johns Hopkins Bloomberg School of Public Health</title>"

While reading in a simple web page is sometimes useful, particularly if data are embedded in the
web page somewhere. However, more commonly we can use URL connection to read in specific
data files that are stored on web servers.
Interfaces to the Outside World 36

Using URL connections can be useful for producing a reproducible analysis, because the code
essentially documents where the data came from and how they were obtained. This is approach
is preferable to opening a web browser and downloading a dataset by hand. Of course, the code you
write with connections may not be executable at a later date if things on the server side are changed
or reorganized.
10. Subsetting R Objects
Watch a video of this section¹
There are three operators that can be used to extract subsets of R objects.

• The [ operator always returns an object of the same class as the original. It can be used to
select multiple elements of an object
• The [[ operator is used to extract elements of a list or a data frame. It can only be used to
extract a single element and the class of the returned object will not necessarily be a list or
data frame.
• The $ operator is used to extract elements of a list or data frame by literal name. Its semantics
are similar to that of [[.

10.1 Subsetting a Vector


Vectors are basic objects in R and they can be subsetted using the [ operator.

> x <- c("a", "b", "c", "c", "d", "a")


> x[1] ## Extract the first element
[1] "a"
> x[2] ## Extract the second element
[1] "b"

The [ operator can be used to extract multiple elements of a vector by passing the operator an integer
sequence. Here we extract the first four elements of the vector.

> x[1:4]
[1] "a" "b" "c" "c"

The sequence does not have to be in order; you can specify any arbitrary integer vector.

> x[c(1, 3, 4)]


[1] "a" "c" "c"

We can also pass a logical sequence to the [ operator to extract elements of a vector that satisfy a
given condition. For example, here we want the elements of x that come lexicographically after the
letter “a”.
¹https://youtu.be/VfZUZGUgHqg

37
Subsetting R Objects 38

> u <- x > "a"


> u
[1] FALSE TRUE TRUE TRUE TRUE FALSE
> x[u]
[1] "b" "c" "c" "d"

Another, more compact, way to do this would be to skip the creation of a logical vector and just
subset the vector directly with the logical expression.

> x[x > "a"]


[1] "b" "c" "c" "d"

10.2 Subsetting a Matrix


Watch a video of this section²
Matrices can be subsetted in the usual way with (i,j) type indices. Here, we create simple 2 × 3
matrix with the matrix function.

> x <- matrix(1:6, 2, 3)


> x
[,1] [,2] [,3]
[1,] 1 3 5
[2,] 2 4 6

We can access the (1, 2) or the (2, 1) element of this matrix using the appropriate indices.

> x[1, 2]
[1] 3
> x[2, 1]
[1] 2

Indices can also be missing. This behavior is used to access entire rows or columns of a matrix.

> x[1, ] ## Extract the first row


[1] 1 3 5
> x[, 2] ## Extract the second column
[1] 3 4

Dropping matrix dimensions


By default, when a single element of a matrix is retrieved, it is returned as a vector of length 1 rather
than a 1 × 1 matrix. Often, this is exactly what we want, but this behavior can be turned off by
setting drop = FALSE.
²https://youtu.be/FzjXesh9tRw
Exploring the Variety of Random
Documents with Different Content
On the night of Sunday, August 26th, 1883, the blackness of the
dust-clouds, now much thicker than ever in the Straits of Sunda and
adjacent parts of Sumatra and Java, was only occasionally illumined
by lurid flashes from the volcano. The Krakatoan thunders were on
the point of attaining their complete development. At the town of
Batavia, a hundred miles distant, there was no quiet that night. The
houses trembled with the subterranean violence, and the windows
rattled as if heavy artillery were being discharged in the streets. And
still these efforts seemed to be only rehearsing for the supreme
display. By ten o’clock on the morning of Monday, August 27th,
1883, the rehearsals were over and the performance began. An
overture, consisting of two or three introductory explosions, was
succeeded by a frightful convulsion which tore away a large part of
the island of Krakatoa and scattered it to the winds of heaven. In
that final effort all records of previous explosions on this earth were
completely broken.
This supreme effort it was which produced the mightiest noise
that, so far as we can ascertain, has ever been heard on this globe.
It must have been indeed a loud noise which could travel from
Krakatoa to Batavia and preserve its vehemence over so great a
distance; but we should form a very inadequate conception of the
energy of the eruption of Krakatoa if we thought that its sounds
were heard by those merely a hundred miles off. This would be little
indeed compared with what is recorded, on testimony which it is
impossible to doubt.
THE EARLY STAGE OF THE ERUPTION OF KRAKATOA.
(From a Photograph taken on May 27th, 1883.)

Westward from Krakatoa stretches the wide expanse of the Indian


Ocean. On the opposite side from the Straits of Sunda lies the island
of Rodriguez, the distance from Krakatoa being almost three
thousand miles. It has been proved by evidence which cannot be
doubted that the booming of the great volcano attracted the
attention of an intelligent coastguard on Rodriguez, who carefully
noted the character of the sounds and the time of their occurrence.
He had heard them just four hours after the actual explosion, for this
is the time the sound occupied on its journey.
We shall better realise the extraordinary vehemence of this
tremendous noise if we imagine a similar event to take place in
localities more known to most of us than are the far Eastern seas.
If Vesuvius were vigorous enough to thunder forth like Krakatoa,
how great would be the consternation of the world! Such a report
might be heard by King Edward at Windsor, and by the Czar of all
the Russias at Moscow. It would astonish the German Emperor and
all his subjects. It would penetrate to the seclusion of the Sultan at
Constantinople. Nansen would still have been within its reach when
he was furthest north, near the Pole. It would have extended to the
sources of the Nile, near the Equator. It would have been heard by
Mohammedan pilgrims at Mecca. It would have reached the ears of
exiles in Siberia. No inhabitant of Persia would have been beyond its
range, while passengers on half the liners crossing the Atlantic
would also catch the mighty reverberation.
The subject is of such exceptional interest that I may venture on
another illustration. Let us suppose that a similar earth-shaking
event took place in a central position in the United States. Let us
say, for example, that an explosion occurred at Pike’s Peak as
resonant as that from Krakatoa. It would certainly startle not a little
the inhabitants of Colorado far and wide. The ears of dwellers in the
neighbouring States would receive a considerable shock. With
lessening intensity the sound would spread much further around—
indeed, it might be heard all over the United States. The sonorous
waves would roll over to the Atlantic coast, they would be heard on
the shores of the Pacific. Florida would not be too far to the south,
nor Alaska too remote to the north. If, indeed, we could believe that
the sound would travel as freely over the great continent as it did
across the Indian Ocean, then we may boldly assert that every ear in
North America might listen to the thunder from Pike’s Peak, if it
rivalled Krakatoa. The reverberation might even be audible by skin-
clad Eskimos amid the snows of Greenland, and by naked Indians
sweltering on the Orinoco. Can we doubt that Krakatoa made the
greatest noise that has ever been recorded?

Fig. 27.—Spread of the Air-wave from Krakatoa to the Antipodes.


(From the Royal Society’s Reports.)

Among the many other incidents connected with this explosion, I


may specially mention the wonderful system of divergent ripples that
started in our atmosphere from the point at which the eruption took
place. I have called them ripples, from the obvious resemblance
which they bear to the circular expanding ripples produced by
raindrops which fall upon the still surface of water. But it would be
more correct to say that these objects were a series of great
undulations which started from Krakatoa and spread forth in ever-
enlarging circles through our atmosphere. The initial impetus was so
tremendous that these waves spread for hundreds and thousands of
miles. They diverged, in fact, until they put a mighty girdle round the
earth, on a great circle of which Krakatoa was the pole. The
atmospheric waves, with the whole earth now well in their grasp,
advanced into the opposite hemisphere. In their further progress
they had necessarily to form gradually contracting circles, until at
last they converged to a point in Central America, at the very
opposite point of the diameter of our earth, eight thousand miles
from Krakatoa. Thus the waves completely embraced the earth.
Every part of our atmosphere had been set into a tingle by the great
eruption. In Great Britain the waves passed over our heads, the air
in our streets, the air in our houses, trembled from the volcanic
impulse. The very oxygen supplying our lungs was responding also
to the supreme convulsion which took place ten thousand miles
away. It is needless to object that this could not have taken place
because we did not feel it. Self-registering barometers have enabled
these waves to be followed unmistakably all over the globe.
Such was the energy with which these vibrations were initiated at
Krakatoa, that even when the waves thus arising had converged to
the point diametrically opposite in South America their vigour was
not yet exhausted. The waves were then, strange to say, reflected
back from their point of convergence to retrace their steps to
Krakatoa. Starting from Central America, they again described a
series of enlarging circles, until they embraced the whole earth.
Then, advancing into the opposite hemisphere, they gradually
contracted until they had regained the Straits of Sunda, from which
they had set forth about thirty-six hours previously. Here was,
indeed, a unique experience. The air-waves had twice gone from
end to end of this globe of ours. Even then the atmosphere did not
subside until, after some more oscillations of gradually fading
intensity, at last they became evanescent.
But, besides these phenomenal undulations, this mighty incident
at Krakatoa has taught us other lessons on the constitution of our
atmosphere. We previously knew little, or I might almost say
nothing, as to the conditions prevailing above the height of ten miles
overhead. We were almost altogether ignorant of what the wind
might be at an altitude of, let us say, twenty miles. It was Krakatoa
which first gave us a little information which was greatly wanted.
How could we learn what winds were blowing at a height four times
as great as the loftiest mountain on the earth, and twice as great as
the loftiest altitude to which a balloon has ever soared? We could
neither see these winds nor feel them. How, then, could we learn
whether they really existed? No doubt a straw will show the way the
wind blows, but there are no straws up there. There was nothing to
render the winds perceptible until Krakatoa came to our aid.
Krakatoa drove into those winds prodigious quantities of dust.
Hundreds of cubic miles of air were thus deprived of that invisibility
which they had hitherto maintained. They were thus compelled to
disclose those movements about which, neither before nor since,
have we had any opportunity of learning.
With eyes full of astonishment men watched those vast volumes of
Krakatoa dust start on a tremendous journey. Westward the dust of
Krakatoa took its way. Of course, everyone knows the so-called
tradewinds on our earth’s surface, which blow steadily in fixed
directions, and which are of such service to the mariner. But there is
yet another constant wind. We cannot call it a trade-wind, for it
never has rendered, and never will render, any service to navigation.
It was first disclosed by Krakatoa. Before the occurrence of that
eruption no one had the slightest suspicion that far up aloft, twenty
miles over our heads, a mighty tempest is incessantly hurrying with
a speed much greater than that of the awful hurricane which once
laid so large a part of Calcutta on the ground, and slew so many of
its inhabitants. Fortunately for humanity, this new trade-wind does
not come within less than twenty miles of the earth’s surface. We
are thus preserved from the fearful destruction that its
unintermittent blasts would produce, blasts against which no tree
could stand, and which would, in ten minutes, do as much damage
to a city as would the most violent earthquake. When this great wind
had become charged with the dust of Krakatoa, then, for the first
and, I may add, for the only time, it stood revealed to human vision.
Then it was seen that this wind circled round the earth in the vicinity
of the Equator, and completed its circuit in about thirteen days.
Please observe the contrast between this wind of which we are
now speaking and the waves to which we have just referred. The
waves were merely undulations or vibrations produced by the blow
which our atmosphere received from the explosion of Krakatoa, and
these waves were propagated through the atmosphere much in the
same way as sound waves are propagated. Indeed, these waves
moved with the same velocity as sound. But the current of air of
which we are now speaking was not produced by Krakatoa; it
existed from all time, before Krakatoa was ever heard of, and it
exists at the present moment, and will doubtless exist as long as the
earth’s meteorological arrangements remain as they are at present.
All that Krakatoa did was simply to provide the charges of dust by
which for one brief period this wind was made visible.
In the autumn of 1883 the newspapers were full of accounts of
strange appearances in the heavens. The letters containing these
accounts poured in upon us from residents in Ceylon; they came
from residents in the West Indies, and from other tropical places. All
had the same tale to tell. Sometimes experienced observers assured
us that the sun looked blue; sometimes we were told of the
amazement with which people beheld the moon draped in vivid
green. Other accounts told of curious halos, and, in short, of the
signs in the sun, the moon, and the stars, which were exceedingly
unusual, even if we do not say that they were absolutely
unprecedented.
Those who wrote to tell of the strange hues that the sun
manifested to travellers in Ceylon, or to planters in Jamaica, never
dreamt of attributing the phenomena to Krakatoa, many thousands
of miles away. In fact, these observers knew nothing at the time of
the Krakatoa eruption, and probably few of them, if any, had ever
heard that such a place existed. It was only gradually that the belief
grew that these, phenomena were due to Krakatoa. But when the
accounts were carefully compared, and when the dates were studied
at which the phenomena were witnessed in the various localities, it
was demonstrated that these phenomena, notwithstanding their
worldwide distribution, had certainly arisen from the eruption in this
little island in the Straits of Sunda. It was most assuredly Krakatoa
that painted the sun and the moon, and produced the other strange
and weird phenomena in the tropics.
After a little time we learned what had actually happened. The
dust manufactured by the supreme convulsion was whirled round
the earth in the mighty atmospheric current into which the volcano
discharged it. As the dust-cloud was swept along by this
incomparable hurricane, it showed its presence in the most glorious
manner by decking the sun and the moon in hues of unaccustomed
splendour and beauty. The blue colour in the sky under ordinary
circumstances is due to particles in the air, and when the ordinary
motes of the sunbeam were reinforced by the introduction of the
myriads of motes produced by Krakatoa, even the sun itself
sometimes showed a blue tint. Thus the progress of the great dust-
cloud was traced out by the extraordinary sky effects it produced,
and from the progress of the dust-cloud we inferred the movements
of the invisible air current which carried it along. Nor need it be
thought that the quantity of material projected from Krakatoa should
have been inadequate to produce effects of this worldwide
description. Imagine that the material which was blown to the winds
of heaven by the supreme convulsion of Krakatoa could be all
recovered and swept into one vast heap. Imagine that the heap
were to have its bulk measured by a vessel consisting of a cube one
mile long, one mile broad, and one mile deep; it has been estimated
that even this prodigious vessel would have to be filled to the brim
at least ten times before all the products of Krakatoa had been
measured.
It was in the late autumn of 1883 that the marvellous series of
celestial phenomena connected with the great eruption began to be
displayed in Great Britain. Then it was that the glory of the ordinary
sunsets was enhanced by a splendour which has dwelt in the
memory of all those who were permitted to see them. The
frontispiece of this volume contains a view of the sunset as seen at
Chelsea at 4.40 p.m. on November 26th, 1883. The picture was
painted from nature by Mr. W. Ascroft, and is given in the great work
on Krakatoa which was published by the Royal Society. There is not
the least doubt that it was the dust from Krakatoa which produced
the beauty of those sunsets, and so long as that dust remained
suspended in our atmosphere, so long were strange signs to be
witnessed in the heavenly bodies. But the dust which had been
borne with unparalleled violence from the interior of the volcano, the
dust which had been shot aloft by the vehemence of the eruption to
an altitude of twenty miles, the dust which had thus been whirled
round and round our earth for perhaps a dozen times or more in this
air current, which carried it round in less than a fortnight, was
endowed with no power to resist for ever the law of gravitation
which bids it fall to the earth. It therefore gradually sank
downwards. Owing, however, to the great height to which it had
been driven, owing to the impetuous nature of the current by which
it was hurried along, and owing to the exceedingly minute particles
of which it was composed, the act of sinking was greatly protracted.
Not until two years after the original explosion had all the particles
with which the air was charged by the great eruption finally subsided
on the earth.
At first there were some who refused to believe that the glory of
the sunsets in London could possibly be due to a volcano in the
Straits of Sunda, at a distance from England which was but little
short of that of Australia. But the gorgeous phenomena in England
were found to be simultaneous with like phenomena in other places
all round the earth. Once again the comparison of dates and other
circumstances proved that Krakatoa was the cause of these
exceptional and most interesting appearances.
Nor was the incident without a historical parallel, for has not
Tennyson told us of the call to St. Telemachus—
“Had the fierce ashes of some fiery peak
Been hurl’d so high they ranged about the globe?
For day by day, thro’ many a blood-red eve,
In that four-hundredth summer after Christ,
The wrathful sunset glared....”
CHAPTER X.

SPIRAL AND PLANETARY NEBULÆ.

A Substitute for History—Photograph of the Great Spiral taken at the Lick


Observatory—Solar System Relations Unimportant—Chaotic Nebulæ—Lord
Rosse’s Great Discovery—Dr. Roberts’ Photographs—The Astonishing
Discovery of Professor Keeler—The Perspective of the Spirals—The Spiral
Nebulæ are not Gaseous—The Spiral is a Nebula in an advanced Stage of
Development—Character of the Great Nebula in Andromeda.

IN a great college in America a new educational experiment has


been tried with some success. Instead of the instruction in history
which students receive in most other institutions, an attempt has
been made in this college to give instruction in a very different
manner, which it is believed will not be of less educational value than
the more ordinary processes of teaching. In the course of study to
which I am now referring the student is invited to consider, not so
much the history of the development of the Constitution of one
particular country, as to make a broad survey of the different
Constitutions under which the several countries of the world are at
this moment governed. The promoters of this scheme believe that
many of the intellectual advantages which are ordinarily expected to
be gained by the study of the history of one country may be secured
equally well by studying only existing conditions, provided that
attention is given to several countries which have arrived at different
stages of civilisation.
Without attempting to say how far the study of the existing
Constitutions of France and Germany, America and Australia, Turkey
and India, Morocco and Fiji, might be justly used to supersede the
study of English history, it may at least be urged that if we had no
annals from which history could be compiled it might be instructive
to employ such a substitute for historical studies as is here
suggested. This is, indeed, the course which we are compelled to
take in our study of that great chapter in earth-history which we are
discussing in these pages. It is obvious from the nature of the case
that it can never be possible for us to obtain direct testimony as to
what occurred in the bringing together of the materials of this globe.
We must, therefore, look abroad through the universe, and see
whether we can find, from the study of other systems at present in
various stages of their evolution, illustrations of the incidents which
we may presume to have occurred in the early stages of our own
history.
If Kant had never lived, if Laplace had never announced his
Nebular Theory, if the discoveries of Sir William Herschel had not
been made, I still venture to think that a due consideration of the
remarkable photograph of the famous Great Spiral, which was
obtained at the famous Lick Observatory in California, would have
suggested the high probability of that doctrine which we describe as
the Nebular Theory.
Fig. 28.—The Great Spiral Nebula (Lick Observatory).
(From the Royal Astronomical Series.)

If an artist thoroughly versed in the great facts of astronomy had


been commissioned to represent the nebular origin of our system as
perfectly as a highly cultivated yet disciplined imagination would
permit, I do not think he could have designed anything which could
answer the purpose more perfectly than does that picture which is
now before us. We might wish indeed that Kant and Laplace and
Herschel could have lived to see this marvellous natural illustration
of their views, for photographs were of course unthought of in those
days, and, I need hardly say, that for any one celestial nebula that
could have been known in the times of Laplace, hundreds are now
within the reach of astronomers.
We entreat special attention to this picture which Nature has
herself given us, and which represents what we may not
unreasonably conclude to be a system in a state of formation. Let
me say at once that our solar system, however imposing it may be
from our point of view, is but of infinitesimal importance as
compared with the system which is here in the course of
development. It is sometimes urged that it is difficult to imagine how
a system so large as ours could have been produced by
condensation from a primæval nebula. The best answer is found in
the fact that the Great Spiral now before us may be considered to
exhibit at this very moment a system in actual evolution, the central
body of which is certainly thousands of times, and not improbably
millions of times, greater than the sun, and of which the attending
planets or other revolving bodies, are framed on a scale immensely
transcending that of even Jupiter himself. The details of this
remarkable nebula seem to illustrate those particular features which
had been previously assigned to the primæval nebula of our system,
long before any photograph was available for the purpose of their
study.
In the Great Nebula in Orion, to which we have already referred,
as well as in many other similar objects which we might also have
adduced, the nebulous material from which after long ages new
systems may be the result, was shown in an extremely chaotic state.
It was little more than an irregular stain of light on the sky. But in
the picture of the Great Spiral which is before us (Fig. 28) it is
manifest that the evolution of the system has reached an advanced
stage; such considerable progress has been made in the actual
formation that the final form seems to be shadowed forth. The
luminosity is no longer diffused in a chaotic condition; it has formed
into spirals, and become much condensed at the centre and
somewhat condensed in other regions. As we now see it, the object
seems to represent a system much more advanced in its formation
than any of the other great nebulæ with which we have compared it.
In comparison with it the evolution of such an object as the Great
Nebula in Orion can hardly be said to have begun. But in the Great
Spiral many portions of the nebula have already become outlined
into masses which, though still far from resembling the planets in
the solar system, have at least made some approach thereto while
the central portions are being drawn together, just as we may
conceive the great primæval fire-mist to have drawn together in the
actual formation of the sun.
The famous nebula which we are discussing, and which is
generally known as the Great Spiral, is found in the constellation of
Canes Venatici, very near the end star in the tail of the Great Bear,
and one-fourth of the way from it to Cor Caroli. It will be easy to find
it from the indication given in the adjoining Fig. 29. As a nebulous
spot it is an object which can be seen with any moderately good
telescope, but to detect those details which indicate the spiral
structure demands an instrument of first-class power. This object
had indeed been studied by many astronomers before Lord Rosse
turned his colossal reflector upon it. Then it was that the wonderful
whirlpool structure was first discovered, and thus the earliest spiral
nebula became known.
Fig. 29.—How to Find the Great Spiral Nebula.

In those days there were few telescopes of great power, and none
of those instruments appeared able to deal with this nebula
sufficiently to reveal its spiral character. The announcement of the
discovery of the spiral constitution of this object was therefore
received with incredulity by some astronomers, who believed, or
professed to believe, that the spiral lines of nebulous matter which
Lord Rosse described so faithfully, existed only in the imagination of
the astronomer. Indeed, in one notable instance, it was alleged that
these features were to be attributed to actual imperfections in the
unrivalled telescope. The incredulity widely prevalent in the middle
of the last century about the existence of the spiral nebulæ may be
paralleled by the incredulity about other discoveries in more recent
years. When a highly skilled observer, using an instrument of
adequate power, and, it may be, enjoying unequalled opportunities
for good work, testifies to certain discoveries; when he has
employed in the verification of his observations the skill and
experience that years of practice have procured for him, it is futile
for those who have not the like opportunities, either from the want
of instruments of adequate power or from climatic difficulties, to
deny the truth of discoveries because they are not able to verify
them. It was absurd for astronomers to refuse assent to the great
discoveries of Lord Rosse simply because instruments inferior to his
would not show the spiral structure.
In due time, one astronomer after another began to admit that
possibly the remarkable form which Lord Rosse announced as
characteristic of some nebulæ might not be a mere figment of the
imagination. The complete vindication of Lord Rosse’s great
discovery was not, however, attained until that wonderful advance in
the arts of astronomy when the photographic plate was called in to
supplement, or rather vastly to extend, the powers of the eye. Dr.
Isaac Roberts not only showed by a magnificent photograph that the
Great Spiral discovered by Lord Rosse was just as Lord Rosse had
described it, he not only showed that the other spirals announced by
Lord Rosse were equally entitled to the name, but, with the newly
acquired powers that the photographic plate placed at his disposal,
he was able to show that many other nebulæ, which had been
frequently observed and had even been sketched, possessed further
features too faint and delicate to be seen by any human eye, even
with the help of the most powerful telescope. These further features
were discovered because they came within the ken of the intensely
acute perception of the photographic plate. On the plate these
features which the camera showed, were added to those which the
eye had already perceived, and when these additions were made it
was not infrequently found that the nebula assumed the form of a
spiral. But the most remarkable circumstance has still to be added.
Some of the plates exposed by Dr. Roberts show clear and
unmistakable photographs of spiral nebulæ as exquisite in detail as
the Great Spiral itself, but yet so faint that they have never been
seen by the eye in any telescope whatever, though they could not
elude the photographic plate. Thus, Dr. Roberts not only confirmed
in the most splendid manner that really great discovery of the spiral
nebulæ of which the honour belongs to Lord Rosse, but the eminent
photographic astronomer added many other spirals of the greatest
interest to the list of those objects which Lord Rosse had himself
given.
Though these discoveries placed the fact of the existence of spiral
nebulæ in an impregnable position, and though they greatly
increased the interest with which astronomers study such objects,
yet another stop had to be taken before the spiral nebula attained
the position of extraordinary importance as a celestial object which
must now be acknowledged to be its due.
Fig. 30.—A Group of Nebulæ (Lord Rosse).
(3440, 3445 in n.g.c.)
(From the Scientific Transactions of the Royal Dublin Society.)

We have already had occasion (page 67) to mention the


marvellous discoveries of nebulæ which the lamented Professor
Keeler made with the Crossley Reflector at the Lick Observatory. We
have explained that his discoveries have shown the number of
nebulæ in the heavens to be probably at least twenty times that
which previous observations would have authorised us in asserting.

You might also like