Humanities Data in R Exploring Networks Geospatial Data Images and Text 2nd Edition Unknown - Download the ebook today and own the complete content
Humanities Data in R Exploring Networks Geospatial Data Images and Text 2nd Edition Unknown - Download the ebook today and own the complete content
com
https://ebookgate.com/product/humanities-data-in-r-
exploring-networks-geospatial-data-images-and-text-2nd-
edition-unknown/
OR CLICK BUTTON
DOWLOAD EBOOK
https://ebookgate.com/product/data-science-fundamentals-with-r-python-
and-open-data-1st-edition-marco-cremonini/
ebookgate.com
https://ebookgate.com/product/statistics-for-censored-environmental-
data-using-minitab-and-r-statistics-in-practice-2nd-edition-dennis-r-
helsel/
ebookgate.com
https://ebookgate.com/product/r-data-mining-blueprints-1st-edition-
edition-mishra/
ebookgate.com
Taylor Arnold
Lauren Tilton
Humanities
Data in R
Exploring Networks, Geospatial Data,
Images, and Text
Second Edition
Quantitative Methods in the Humanities
and Social Sciences
Series Editors
Thomas DeFanti, Calit2, University of California San Diego, La Jolla, CA, USA
Anthony Grafton, Princeton University, Princeton, NJ, USA
Thomas E. Levy, Calit2, University of California San Diego, La Jolla, CA, USA
Lev Manovich, Graduate Center, The Graduate Center, CUNY, New York, NY, USA
Alyn Rockwood, KAUST, Boulder, CO, USA
Quantitative Methods in the Humanities and Social Sciences is a book series
designed to foster research-based conversation with all parts of the university
campus – from buildings of ivy-covered stone to technologically savvy walls
of glass. Scholarship from international researchers and the esteemed editorial
board represents the far-reaching applications of computational analysis, statistical
models, computer-based programs, and other quantitative methods. Methods are
integrated in a dialogue that is sensitive to the broader context of humanistic study
and social science research. Scholars, including among others historians, archaeolo-
gists, new media specialists, classicists and linguists, promote this interdisciplinary
approach. These texts teach new methodological approaches for contemporary
research. Each volume exposes readers to a particular research method. Researchers
and students then benefit from exposure to subtleties of the larger project or corpus
of work in which the quantitative methods come to fruition.
Editorial Board:
Thomas DeFanti, University of California, San Diego & University of Illinois at
Chicago
Anthony Grafton, Princeton University
Thomas E. Levy, University of California, San Diego
Lev Manovich, The Graduate Center, CUNY
Alyn Rockwood, King Abdullah University of Science and Technology
Publishing Editor for the series at Springer: Faith Su, faith.su@springer.com
Taylor Arnold • Lauren Tilton
Humanities Data in R
Exploring Networks, Geospatial Data,
Images, and Text
Second Edition
Taylor Arnold Lauren Tilton
University of Richmond University of Richmond
Richmond, VA, USA Richmond, VA, USA
© The Editor(s) (if applicable) and The Author(s), under exclusive license to Springer Nature Switzerland
AG 2015, 2024
This work is subject to copyright. All rights are solely and exclusively licensed by the Publisher, whether
the whole or part of the material is concerned, specifically the rights of translation, reprinting, reuse
of illustrations, recitation, broadcasting, reproduction on microfilms or in any other physical way, and
transmission or information storage and retrieval, electronic adaptation, computer software, or by similar
or dissimilar methodology now known or hereafter developed.
The use of general descriptive names, registered names, trademarks, service marks, etc. in this publication
does not imply, even in the absence of a specific statement, that such names are exempt from the relevant
protective laws and regulations and therefore free for general use.
The publisher, the authors and the editors are safe to assume that the advice and information in this book
are believed to be true and accurate at the date of publication. Neither the publisher nor the authors or
the editors give a warranty, expressed or implied, with respect to the material contained herein or for any
errors or omissions that may have been made. The publisher remains neutral with regard to jurisdictional
claims in published maps and institutional affiliations.
This Springer imprint is published by the registered company Springer Nature Switzerland AG
The registered company address is: Gewerbestrasse 11, 6330 Cham, Switzerland
Published in 2015, the first edition of this book was written as digital humanities was
fully entering the lexicon of the academy. Debates over ideas such as computation,
digital, and data ensued. Questions such as what does it mean to think of sources
as data, or “humanities data,” were posed by Miriam Posner [75], while Jessica
Marie Johnson brought the longer history of quantification to ask pressing questions
about the process and effect of continuing to turn people into data [47]. Amid
these questions and debates, cultural institutions such as the Library of Congress
made an incredible commitment to digitization and open data, making sources once
only accessible in person available in digital formats that were now amenable to
computational methods. What could be possible with all these sources of data?
We set out to demonstrate how methods from text, spatial, and image analyses
could animate humanities fields by rethinking of our sources as data and using
programming, specifically the language R. This was a rather radical move at the
time, when humanities fields were particularly resistant to the idea of thinking of
materials such as books, photographs, and TV as the subject of analysis through
counting and probabilities, much less algorithms and modeling. The field of digital
humanities was pushing against this impulse, particularly led by scholars in digital
history and what we now call computational literary studies. For those interested in
learning how to bring them together, they were still often on their own. For many,
programming and humanities inquiry still seemed like a contradiction.
Yet, as graduate students, one in American Studies (Lauren) and the other in
Statistics (Taylor), bringing together humanities data such as historical photographs
with computational methods such as mapping seemed incredibly powerful. Our
work building photogrammar.org, and the project’s positive reception, demon-
strated the possibilities of layering mapping, text analysis, and image analysis to
further the study of visual culture. Computational methods did not replace all
the training of humanities fields, but rather fit with the experimentation, trans-
disciplinarity, and creativity that American Studies articulated as central to its
project. At the same time, fields such as Statistics were continuing their emphasis
on mathematical theory, often disconnected from many of the realities of working
with actual data and the methodological problems that the messiness of human data
v
vi Preface
The second edition is a significant revision, with almost every aspect of the text
rewritten in some way. The biggest difference is the incorporation of the set
of R packages commonly known as the tidyverse, consisting at its core of the
Preface vii
packages ggplot2 and dplyr. These packages have grown significantly in stability
and popularity over the past decade. They allow the kinds of functionality that we
wanted to highlight in the first version of the book, but do so with less code while
being backed by theoretical models of how data processing should work. These
features make them perfect elements to use for an introduction to R for working
with humanities data.
As before, Part I introduces the R programming language and key concepts for
working with data. Exploratory data analysis (EDA) remains a key concept and
philosophy. EDA is an approach for analyzing and summarizing to identify patterns
(and outliers) in data. It is also a way of knowing that is amenable to the kinds
of questions and heuristics that animate how humanistic fields approach studying
the human experience. Based on years of teaching, we have come to realize how
important understanding data collection is to data analysis yet how few resources
there are, so we have added Chap. 5: Collecting Data and Chap. 12: Data Formats to
address perhaps the most time-consuming part, collecting and organizing data.
Part II of the text is still organized around data types. We have decided to reorder
the chapters because of our approach to data. In this edition, we wanted to show how
one can layer types of analysis using the same data set. Rather than each chapter
introducing a new data set, we build our analysis of Wikipedia data from Chaps. 6
to 8 as we move from text to networks to temporal data. Chapter 8: Temporal
Data is a new chapter given the importance of time information, particularly if
we want to study change over time. Chapter 9: Spatial Data returns to the data
that was used in Part I to show how we can layer the information with additional
data. Chapter 10: Image Data introduces a new data set of 1940s photographs to
apply computer vision. While we are always hesitant of hype about technological
change, particularly given all the current (generative) AI boosterism, a significant
methodological shift in the last 10 years is the advances in computer vision,
particularly the ascent of deep learning. We now focus on several of the most popular
tasks such as object detection, and how we can also layer them with additional
methods such as networks. The reorganization, additional chapters, and new data
sets are a part of trying to demonstrate how layering methods can add context and
nuance to our analysis.
Humanities Data
We now return to the term “humanities data.” For us, this means any data that is
engaged with analyzing any aspect of human societies and cultures. This is bigger
than any disciplinary or institutional formation. When we are working with the
messiness of human creativity and meaning, we are engaged in a challenging task,
particularly when we want to understand peoples’ beliefs, values, and behaviors,
whether today or in the past. This is inherently a transdisciplinary project that
traverses any walls that we try to build through academic journals, departments,
scholarly associations, and the university itself. Working with humanities data
viii Preface
happens in industry and beyond. Working with this data carefully, ethically, and
precisely takes collaboration. The book is designed to provide the groundwork for
those who seek to engage with and analyze the data that documents, shapes, and
communicates who we are, where we have been, and the worlds we are building.
No book can do everything, and our orientation is centered around the United
States. The goal of this book is to walk readers through the methods and provide
the code that will give one the resources and confidence to computationally explore
humanities data. Data and methods such as image analysis are the subject of tens of
thousands of articles and books. At the end of each chapter and through our citations,
we offer further reading to start connecting with the wide range of scholarship on
each of these chapters. We also do not go directly into all the debates over the
epistemology and ontology of data and statistics itself; we find a great place to start
is with Lisa Gitelman’s “Raw Data” is an Oxymoron [36] and Chris Wiggins and
Matthew L. Jones’s How Data Happened: A History from the Age of Reason to
the Age of Algorithm [104]. Along with work by dana boyd, Kate Crawford, Safiya
Noble, and Meredith Broussard, we find Catherine d’Ignazio and Lauren Klein’s
Data Feminism to be also be a great place to start when it comes to data ethics and
justice [30].
Zooming out, there is significant domain-specific scholarship to draw on to
see the power of humanities data analysis. There are series and journals such as
Current Research in Digital History, Debates in the Digital Humanities, Digital
Scholarship in the Humanities, Journal of Cultural Analytics, Journal of Open
Source Software, and the new journal Computational Humanities Research along
with digital humanities special issues in journals like American Quarterly, Cinema
Journal, and Digital Humanities Quarterly. There are books like Ted Underwood’s
Distant Horizons, [87] Andrew Piper’s Enumerations [73], and our own Distant
Viewing [7] that offer theories for computational methods. As well, there are
domain-specific works such as Cameron Blevins’ Paper Trails: The US Post and
the Making of the American West [16] and Lincoln Mullen’s America’s Public Bible
[63] that show how computational methods provide key evidence for scholarship in
religious studies, US history, and rhetorical studies. We offer the work above as a
starting point for the rich conversations and debates around humanities data.
Supplementary Materials
We make extensive use of example datasets through this text. Particular care was
taken to use data in the public domain, or otherwise freely and openly accessible.
Whenever possible, subsets of larger archives were used instead of smaller one-
off datasets. This approach has the dual benefit that these larger sets are often of
independent interest, as well as providing an easy source of additional data for
use in course projects, lectures, and further study. These datasets are available (or
Preface ix
Acknowledgments
For the first edition, it would not have been possible to write this text without
the collaboration and support offered by our many colleagues, friends, and family.
In particular, we would like to thank those who agreed to read and comment on
the early drafts of this text: Carol Chiodo, Jay Emerson, Alex Gil, Jason Heppler,
Matthew Jockers, Mike Kane, Lev Manovich, Laura Wexler, Jeri Wieringa, and two
anonymous readers.
For the second edition, we are deeply appreciative of the University of Richmond,
which has given us the time and resources to pursue a second edition. We
are grafteful to Justin Wigard, who read a complete draft and offered crucial
feedback, and Agnieska Szymanska, who provided guidance in countless ways.
Working with Rob Nelson and the Digital Scholarship Lab (DSL) has been
incredible; their commitment to bringing together digital humanities and social
justice through award-winning projects like Mapping Inequality continue to inspire.
We are also grateful to our departments—Rhetoric and Communication and Math
and Statistics—along with Dean Jenny Cavanaugh, whose support, generosity, and
deep commitment to the liberal arts is a model for us all. It is a special place where
the University President takes the time to engage with faculty’s scholarship. Thank
you, Kevin Hallock, for your time and leadership. And finally, to the awesome UR
students who took our classes and helped us refine our teaching and shared in the
joys and challenges of working with humanities data.
Part I Core
1 Working with Data in R. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.2 Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.3 Working with R and R Markdown. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.4 Running R Code . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.5 Functions in R . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
1.6 Loading Data in R . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
1.7 Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
1.8 Formatting R Code . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
1.9 Extensions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2 EDA I: Grammar of Graphics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
2.2 Text Geometry . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
2.3 Lines and Bars . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
2.4 Optional Aesthetics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
2.5 Scales . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
2.6 Labels and Themes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
2.7 Conventions for Graphics Code . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
2.8 Extensions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
3 EDA II: Organizing Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
3.2 Choosing Rows . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
3.3 Data and Layers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
3.4 Selecting Columns . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
3.5 Arranging Rows . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
3.6 Summarize and Group By . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
3.7 Geometries for Summaries. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
3.8 Mutate. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
3.9 Extensions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
xi
xii Contents
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 277
Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 281
Part I
Core
Chapter 1
Working with Data in R
1.1 Introduction
In this book, we focus on tools and techniques for exploratory data analysis or EDA.
Initially described in John Tukey’s classic text by the same name, EDA is a general
approach to examining data through visualizations and broad summary statistics
[19, 85]. It prioritizes studying data directly in order to generate hypotheses and
ascertain general trends prior to, and often in lieu of, formal statistical modeling.
The growth in both data volume and complexity has further increased the need
for a careful application of these exploratory techniques. In the intervening 50
years, techniques for EDA have enjoyed great popularity within statistics, computer
science, and many other data-driven fields and professions.
The histories of the R programming language and EDA are deeply entwined.
Concurrent with Tukey’s development of EDA, Rick Becker, John Chambers,
and Allan Wilks of Bell Labs began developing software designed specifically
for statistical computing. By 1980, the “S” language was released for general
distribution outside Bell Labs. It was followed by a popular series of books and
updates, including “New S” and “S-Plus” [10–12, 21]. In the early 1990s, Ross
Ihaka and Robert Gentleman produced a fully open-source implementation of S
called “R.” It is called “R” for it is both the “previous letter in the alphabet” and
the shared initial in the authors’ names. Their implementation has become the de
facto tool in the field of statistics and is often cited as being amongst the top 20 used
programming languages in the world. Without the interactive console and flexible
graphics engine of a language such as R, modern data analysis techniques would be
largely intractable. Conversely, without the tools of EDA, R would likely still have
been a welcome simplification to programming in lower-level languages but would
have played a far less pivotal role in the development of applied statistics.
The historical context of these two topics underscores the motivation for studying
both concurrently. In addition, we see this book as contributing to efforts to bring
new communities to learn from and to help shape data analysis by offering other
1.2 Setup
While it is possible to read this book as a conceptual text, we expect that the majority
of readers will eventually want to follow along with the code and examples that are
given throughout the text. The first step in doing so is to obtain a working copy
of R. The Comprehensive R Archive Network, known as CRAN, is the official
home of the R language and supplies download instructions according to a user’s
operating system (i.e., Mac, Windows, Linux): http://cran.r-project.org/.
Other download options exist for advanced users, up to and including a custom
build from the source code. We make no assumptions throughout this text regarding
which operating system or method of obtaining or accessing R readers have chosen.
In the rare cases where differences exist based on these options, they will be
explicitly addressed. While one can work from the terminal, we recommend using
an integrated development environment (IDE) to more easily see the code and data.
A piece of open-source software called the RStudio IDE is highly recommended:
https://posit.co/download/rstudio-desktop/. When installed in conjunc-
tion with the R environment, RStudio provides a convenient way of running R
code and seeing the output in a single window. We will show in the next section
screenshots from running R code in RStudio.
In addition to the R software, walking through the examples in this text requires
access to the datasets we explore. Care has been taken to ensure that these are all in
the public domain so as to make it easy for us to redistribute to readers. The materials
and download instructions can be found at https://humanitiesdata.org/. A
complete copy of the code from the book is also provided to make replicating (and
extending) the results as easy as possible.
1.3 Working with R and R Markdown 5
The supplemental materials for this book include all the data and code needed to
replicate all of the analyses and visualizations in this book. We include the exact
same code that will be printed in the book. We have used the R Markdown file
format, which has an .Rmd extension, to store this code, with a file corresponding
to each chapter in the text. The R Markdown file format is a great choice for data
analysis because it allows us mix code and descriptions within the same file [51].
In fact, we even wrote the text of this book in the R Markdown format before
converting it into LaTeX for printing.
The RStudio environment offers a convenient format for viewing and editing R
Markdown files. If we open an R Markdown file in RStudio, we should see a window
similar to the one shown in Fig. 1.2. We made this image on a recent version of
macOS; the specific view may be slightly different on Windows and may change
6 1 Working with Data in R
Fig. 1.2 Default view of an R Markdown file in RStudio shown in a recent version of macOS
slightly depending on the screen size and the version of RStudio being used. On the
left is the actual file itself. Some output and other helpful bits of information are
shown on the right. There is also a Console window, which we generally will not
need. We have minimized it in the graphic, which we often do whenever working
on a smaller screen
Looking at the R Markdown file, notice that the file has parts that are on a
white background and other parts that are on a gray background. The white parts
correspond to text and the gray parts to code. In order to run the code, and to see
the output, click on the green triangle play button on the upper-right corner of each
block. When we run code to read or create a new dataset, the data will be listed in
the Environment tab in the upper-right-hand side of RStudio. Finally, clicking on
the data will open a spreadsheet version of the data that we can view to understand
the structure of our data and to see all the columns that are available for analysis.
As with any digital file, it is a good idea to make sure to save the notebook
frequently. Keep in mind, however, that only the text and code itself is saved.
The results (plots, tables, and other output) are not automatically stored. While
counterintuitive at first, this is a helpful feature because the code is much smaller
compared to the results. Saving the code helps to keep the file sizes small and tidy.
If we would like to save the results in a way that can be shared with others, we need
to knit the file by clicking on the Knit button (it has a ball of yarn icon) at the top of
the notebook. After running all the code from scratch, the knit function will produce
an HTML version of our script that we can open in a web browser.
1.4 Running R Code 7
Now, let’s see some examples of how to run R code. In this book, we will show
snippets of R code and the output rather than a screenshot of the entire RStudio
session. Though, know that we should think of each of the snippets as occurring
inside of one of the gray boxes in an R Markdown file. In one of its most basic
forms, R can be used as a fancy calculator. We can add 1 and 1 by typing 1+1
into the code chunk of an R Markdown file. Hitting the run button will display the
output (2) below. An example in RStudio is shown in Fig. 1.2. In the book, we will
write this code and output using a black box with the R code written inside of it.
Any output will be shown below, with each line proceeded by two hash tags. An
example is given below.
1 + 1
## [1] 2
We will often see numbers in the output surrounded by square brackets, such as the
[1] in the output above. These are a common cause of confusion and worry for
new users of R. These numbers are simply counting the values in the output. In the
example above, the [1] that it is showing that the value 2 is first output from our
code.
In addition to just returning a value, running R code can also result in storing
values through the creation of new objects within R. Objects in R are used to store
anything—such as numbers, datasets, functions, or models—that we want to use
again later. Each object has a name associated with it that we can use to access it in
future code. To create an object, we will use the <- (arrow) symbol with the name on
the left-hand side of the arrow and code that produces the object on the right-hand
side. For example, we can create a new object called mynum with a value of 8 by
running the following code.
mynum <- 3 + 5
Notice that the code here did not print any results because the result was saved as
a new object. We can now use our new object mynum exactly the same way that we
would use the number 8. For example, adding it to 1 to get the number nine:
mynum + 1
## [1] 9
Object names must start with a letter but can also use underscores and periods. We
recommend using only lowercase letters and underscores. That makes it easier to
8 1 Working with Data in R
read the code later on without needing to remember if and where we used capital
letters.
1.5 Functions in R
A function in R is something that takes a set of input values and returns an output
value. Generally, a function will have a format similar to that given in the code here:
Where arg1 and arg2 are the names of the inputs to the function (they are fixed)
and input1 and input2 are the values that we will assign to them. The number
of arguments is not always two, however. There may be any number of arguments,
including zero. Also, there may be additional optional arguments that have default
values that can be modified. Let us look at an example function: seq. This function
returns a sequence of numbers. We can give the function two input arguments: the
starting point from and the ending point to.
seq(from = 1, to = 100)
## [1] 1 2 3 4 5 6 7 8 9 10 11 12
## [13] 13 14 15 16 17 18 19 20 21 22 23 24
## [25] 25 26 27 28 29 30 31 32 33 34 35 36
## [37] 37 38 39 40 41 42 43 44 45 46 47 48
## [49] 49 50 51 52 53 54 55 56 57 58 59 60
## [61] 61 62 63 64 65 66 67 68 69 70 71 72
## [73] 73 74 75 76 77 78 79 80 81 82 83 84
## [85] 85 86 87 88 89 90 91 92 93 94 95 96
## [97] 97 98 99 100
The function returns a sequence of numbers starting from 1 and ending at 100
in increments of 1. Here, we see the benefit of the square brackets in the output;
the [13] at the start of the second line indicates that the second line starts on the
13th value of the output. In addition to specifying arguments by name, we can also
pass arguments by position. When specifying arguments by position, we need to
know and use the default ordering of the arguments. Below is an example of another
equivalent way to write the code to produce a sequence of integers from 1 to 100, this
time without the argument names. (For the sake of saving space, we will sometimes
not display the output of our code, as is the case here.)
How did we know the inputs to each function and what they do? In this text, we
will explain the names and usage of the required inputs to new functions as they
1.5 Functions in R 9
are introduced. In order to learn more about all of the possible inputs to a function,
we can look at a function’s documentation. For packages to be on CRAN, they
must include information about each of the inputs to a function and the values that
are returned. In order to see the documentation, we can run a line of code that starts
with a question mark followed by the name of the function, as in the example below.
In RStudio, the information about the function will then show up in the lower-left
corner of the IDE. An example of the page is shown in Fig. 1.3
?seq
10 1 Working with Data in R
As shown in the documentation page, there is also an optional argument, called by,
that controls the spacing between each of the numbers. By default, the by argument
is equal to 1, but we can change it to spread the points out by different intervals. For
example, below are the half-numbers between 1 and 10.
## [1] 1.0 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0 5.5
## [11] 6.0 6.5 7.0 7.5 8.0 8.5 9.0 9.5 10.0
We will learn how to use numerous functions in the coming chapters, each of which
will help us in exploring and understanding data. In order to do this, we need to first
load our data into R, which we will show in the next section.
In this book, we will be working with data that is stored in a tabular format.
Figure 1.4 shows an example of a tabular dataset consisting of information about
metropolitan regions in the United States supplied by the US Census Bureau.
These regions are called core-based statistical areas or CBSA. In Fig. 1.4, we
have ten rows and five columns. Each row of the dataset represents a particular
metropolitan region. We call each of the rows an observation. The columns in a
tabular dataset represent the measurements that we record for each observation.
These measurements are called variables.
In our example dataset, we have five variables which record the name of the
region, the quadrant of the country that the region exists in, the population of the
region in millions of people, the density given in tens of thousands of people per
square kilometer, and the median age of all people living in the region. More details
are given in the following section.
A larger version of this dataset, with more regions and variables, is included
in the book’s supplemental materials as a comma-separated value (CSV) file. We
will make extensive use of this dataset in the following chapters as a common
example for creating visualizations and performing data manipulation. In order to
read in the dataset, we use the function read_csv from the readr package [100].
In order to make the functions from readr available, we need to run the line of
code: library(tidyverse). As mentioned above, tidyverse will automatically
load several packages at once that we will use throughout this book. In each chapter,
we will assume that this package has already been loaded without including the
explicit library command. All other packages will be loaded once per chapter as
needed.
library ( tidyverse )
We call this function with the path to where the file is located relative to where this
script is stored. If we are running the R Markdown notebooks from the supplemental
materials, the data will be called cbsa_acs.csv and will be stored in a folder called
data. The following code will load the CBSA dataset into R, save it as an object
called cbsa, and print out the first several rows. The output dataset is stored as a
type of R object called a tibble.
## # A tibble : 934 x 13
## name geoid quad lon lat pop density
## <chr > <dbl > <chr > <dbl > <dbl > <dbl > <dbl >
## 1 New York 35620 NE -74.1 40.8 20.0 1051.
## 2 Los Angeles 31080 W -118. 34.2 13.2 1041.
## 3 Chicago 16980 NC -88.0 41.7 9.61 509.
## 4 Dallas 19100 S -97.0 32.8 7.54 323.
## 5 Houston 26420 S -95.4 29.8 7.05 317.
## 6 Washington 47900 S -77.5 38.8 6.33 364.
## 7 Philadelphia 37980 NE -75.3 39.9 6.22 506.
## 8 Miami 33100 S -80.5 26.2 6.11 430.
## 9 Atlanta 12060 S -84.4 33.7 6.03 263.
## 10 Boston 14460 NE -71.1 42.6 4.91 518.
## # 924 more rows
## # 6 more variables: age_ median <dbl >,
## # hh_ income _ median <dbl >, percent _own <dbl >,
## # rent_1br_ median <dbl >, rent_perc_ income <dbl >,
## # division <chr >
12 1 Working with Data in R
Notice that the display shows that there are a total of 934 rows and 13 columns. Or,
with our terms defined above, there are 934 observations and 13 variables. Only the
first ten observations and seven variables are shown in the output. At the bottom, the
names of the additional variable names are given. As described above, if we run this
RStudio, we can view a full tabular version of the tibble by clicking on the dataset
name in the Environment tab.
The abbreviations in square brackets above the variable names tell us the types
of data stored in each column. The abbreviation <chr>, which is seen below name,
quad (quadrant), and division, indicates that these columns contain character
data. Character data can consist of any sequence of letters, numbers, spaces, and
punctuation marks. Character variables are often used to represent fixed categories,
such as the quadrant and division of each CBSA region. They can also provide
unique identifiers and descriptions for each row, such as the name of the CBSA
region in our example. Values in a character vector are commonly called strings
throughout R documentation, a convention that we will follow in this text by using
it as a synonym for a character value.
The other abbreviation we see in the tibble from the CBSA data is <dbl>, which
indicates that a column contains numeric data. The abbreviation stands for double,
a historical designation of numeric data indicating how much computer memory is
needed to store a single value. While not seen in this example here, the abbreviation
<int> is used as an alternative abbreviation to indicate that a column contains
integer values (i.e., whole numbers). There are limited practical differences between
doubles and integers when working with R code; we will refer to any variable of
either type as numeric data.
Knowing the types of data for each column is important because, as we will
see throughout the book, they will affect the kinds of visualizations and analysis
that can be applied. The data types in the tibble are automatically determined by
the read_csv function. An optional argument col_types can be set to specify an
alternative, or we can modify data types after the tibble has been created using the
techniques shown in Chap. 3. The character and numeric data types are by far the
most common. Other possible options are explored in Chap. 7 (dates and times),
Chap. 9 (spatial variables), and Chap. 11 (lists and logical values).
1.7 Datasets
Throughout this book, we will use multiple datasets to illustrate different concepts
and show how each approach can be used across multiple application domains. We
draw on data that animates humanities inquiry in areas such as American Studies,
history, literary studies, and visual culture studies. While we will briefly reintroduce
new datasets as they appear, for readers making their way selectively through the
text, we offer a somewhat more detailed description of the main datasets that we
will use in this section.
1.7 Datasets 13
To introduce the concept of EDA, we will make sustained use of the CBSA
dataset in Chaps. 2–5 to demonstrate new concepts in data visualization and
manipulation. As described above, the data comes from an annual survey conducted
by the US Census Bureau called the American Community Survey (ACS). The
survey consists of data collected from a sample of 3.5 million households in the
United States. Outside of the constitutionally mandated decennial census, this is
the largest survey completed by the Census Bureau. It asks several dozen questions
covering topics such as gender, race, income, housing, education, and transportation.
Aggregated data are released on a regular schedule, with summaries over one-,
three-, and five-year periods. Our data comes from the five-year summary from the
most recently published version (2021) at the time of writing. We selected a small set
of measurements that we felt did not require extensive background knowledge while
capturing variations across the country. As seen in the table above, we have selected
the median age, median household income (USD), the percentage of households
owning their housing, the median rent for a one-bedroom apartment (USD), and the
median household spending on rent.
The American Community Survey aggregates data to a variety of different
geographic regions. Most regions correspond to political boundaries, such as states,
counties, and cities. One particularly interesting geographic region are the core-
based statistical areas or CBSA. These regions, of which there are nearly a thousand,
are defined by the US Office of Management and Budget. Regions are defined in
the documentation as “an area containing a large population nucleus and adjacent
communities that have a high degree of integration with that nucleus.” We chose
these regions for our dataset because their social, rather than political, definition
makes them particularly well suited for humanities research questions. Our dataset
includes a short, common name for each CBSA, as well as a unique identifier
(geoid), and several geographic categorizations derived from spatial data provided
by the Census Bureau. All of the code to produce this dataset, using the tidycensus
package within R, is included in the book’s supplementary materials [91].
The core chapters of the book also make use of a dataset illustrating the relative
change in the price of various food items for over 140 years in the United States.
This collection was published as is by Davis S. Jacks for his publication “From
boom to bust: a typology of real commodity prices in the long run” [44]. The data is
organized with one observation per year and variables capturing the relative price of
each of thirteen food commodities. We can read this dataset into R using the same
function that we used for the CBSA dataset, shown below.
## # A tibble : 146 x 14
## year tea sugar peanuts coffee cocoa wheat rye
## <dbl > <dbl > <dbl > <dbl > <dbl > <dbl > <dbl > <dbl >
## 1 1870 129. 151. 203. 88.1 78.8 88.1 103.
## 2 1871 132. 167. 222. 109. 66.7 118. 105.
## 3 1872 134. 162. 189. 140. 71.6 122. 102.
14 1 Working with Data in R
All of the prices are given on a relative scale where 100 is equal to the price in 1900.
We will use this dataset to show how to build data visualizations that show change
over time. It will also be useful for our study of table pivots in Chap. 5.
Part II turns to data types. The first three application chapters focus on text
analysis, temporal analysis, and network analysis, respectively. While these three
chapters introduce different methods, we will make use of a consistent core dataset
across all three that we have created from Wikipedia. Specifically, we have a
dataset consisting of the text, links, page views, and change histories of a set of
75 Wikipedia pages sampled from a set of British authors. These data are contained
in several different tables, each of which will be introduced as needed. The main
metadata for the set of 75 pages is shown in the data loaded by the following code.
## # A tibble : 75 x 7
## doc_id born died era gender link short
## <chr > <dbl > <dbl > <chr > <chr > <chr > <chr >
## 1 Marie de France 1160 1215 Early female Mari Mari
## 2 Geoffrey Chaucer 1343 1400 Early male Geof Chau
## 3 John Gower 1330 1408 Early male John Gower
## 4 William Langland 1332 1386 Early male Will Lang
## 5 Margery Kempe 1373 1438 Early female Marg Kempe
## 6 Thomas Malory 1405 1471 Early male Thom Malo
## 7 Thomas More 1478 1535 Sixt male Thom More
## 8 Edmund Spenser 1552 1599 Sixt male Edmu Spen
## 9 Walter Raleigh 1552 1618 Sixt male Walt Rale
## 10 Philip Sidney 1554 1586 Sixt male Phil Sidn
## # 65 more rows
We decided to use Wikipedia data because it is freely available and can be easily
generated in the same format for other collection of pages that correspond to nearly
any other topic of interest. Wikipedia is also helpful because it allows us to look
at pages in other languages, which will allow us to demonstrate how to extend our
techniques to texts that are not in English. Finally, we will return to the Wikipedia
data in Chap. 12 to demonstrate how to build a dataset (specifically, this one) by
calling an API from within R using the httr package [95].
1.9 Extensions 15
Several other datasets will be used throughout the book within a single chapter.
For example, Chap. 9 on spatial data makes use of a dataset showing the location
of French cities and Parisian metro stops as a source in our study of geographic
data. Chapter 10 on image data shows a collection of documentary photographs and
associated metadata in our analysis of images. As these datasets are used only in one
section of the book, we will introduce them in more detail as they are introduced.
1.9 Extensions
Each chapter in this book contains a short, concluding section of extensions on the
main material. These include references for further study, additional R packages,
and other suggested methods that may be of interest to the study of each specific
type of humanities data.
In this chapter, we will mention a few standard R references that might be useful
to use in parallel or in sequence with our text. The classic introduction to the core R
language is An Introduction to R by William Venables and David Smith [89]. This
is freely available directly on the same CRAN website where the R language itself
is hosted. The content is quite terse to read linearly, but it serves as a great reference
for anyone coming from another programming language who wants to learn how to
do lower-level programing tasks. We briefly cover some of this material in Chap. 12
but not in anywhere near as much detail.
For the higher-level version of R that we are using in the second edition of this
book, the standard reference is Wickham, Çetinkaya-Rundel, and Grolemund’s R
for Data Science [97]. This open-access book roughly follows the same material
covered in the first and third parts of our text. It introduces far more extensions and
often exhaustively explains all of the optional arguments to new functions. It is a
16 1 Working with Data in R
great reference text after learning the basics and can be useful as a primary text when
guided within a classroom environment to provide more motivation and context to
each technique. It does not have any material for modeling textual, network, spatial,
or image data.
When working through the code in this book’s supplemental materials, as
mentioned above, we will need to run code using the R Markdown format. More
information about the format and what can be done with it can be found in R
Markdown: The Definitive Guide [109]. The philosophy behind the format can be
found in the corresponding research focused on reproducible research pipelines
[107, 108]. Recently, Quarto, a new extension of the R Markdown format, has
quickly gained in popularity [74]. It provides an almost backward compatible
version of R Markdown while extending the functionality to all mixing in other
programing languages.
Chapter 2
EDA I: Grammar of Graphics
2.1 Introduction
column corresponds to the horizontal axis of the plot and which one corresponds to
the vertical axis of the plot. It is also possible to describe elements such as color,
shape, and size of elements of the plot by associating these quantities with columns
in the data. Finally, we need to provide the geometry that will be used in the plot.
The geometry describes the kinds of objects that are associated with each row of
the data. A common example is the points geometry, which associates a single point
with each observation.
We can show how to use the grammar of graphics by starting with the CBSA
data that we introduced in the previous chapter, where each row is associated with
a particular metropolitan region in the United States. The first plot we will make is
a scatterplot that investigates the relationship between the median price of a one-
bedroom apartment and the population density of the metropolitan region. In the
language of the grammar of graphics, we can start to describe this visualization by
providing the name of the dataset in R (cbsa). Next, we associate the horizontal
axis (called the x aesthetic) with the column in the data named density. The
vertical axis (the y aesthetic) can similarly be associated with the column named
rent_1br_median. We will make a scatterplot, with each point on the plot
describing one of our metropolitan regions, which leads us to use a point geometry.
Our plot will allow us to understand the relationship between city density and rental
prices.
In R, we need to use some special functions to indicate all of this information
and to instruct the program to produce a plot. We start by indicating the name of the
underlying dataset and piping it into a special function called ggplot that indicates
that we want to create a data visualization. The plot itself is created by adding—
literally, with the plus sign—the function geom_point. This function indicates that
we want to add a points geometry to the plot. Inside of the geometry function, we
apply the function aes (short for aesthetics), which indicates that we want to specify
the mappings between components of the plot and column names in our dataset.
Code to write this using the values described in the previous paragraph is given
below. A breakdown of the role of each component is detailed in Fig. 2.1.
cbsa |>
ggplot () +
geom_ point(aes(x = density , y = rent_1br_ median ))
## # A tibble : 30 x 4
## name quad density rent_1br_ median
## <chr > <chr > <dbl > <dbl >
## 1 New York NE 1051. 1430
## 2 Los Angeles W 1041. 1468
## 3 Chicago NC 509. 1060
## 4 Dallas S 323. 1106
## 5 Houston S 317. 997
## 6 Washington S 364. 1601
2.1 Introduction 19
Fig. 2.1 Diagram of how the elements of the grammar of graphics correspond to elements of the
code and visualization
20 2 EDA I: Grammar of Graphics
Fig. 2.2 Plot of the largest 30 core-based statistical areas in the United States, showing their
density and the median price to rent a one-bedroom apartment from the 2021 American Community
Survey
Running the code above from an RMarkdown file opened in R Studio will show the
desired visualization right below the block of code. Within this book, we will show
the results of plots within figures. The plot here is shown in Fig. 2.2. In this plot,
each row of our dataset, a CBSA region, is represented as a point in the plot. The
location of each point is determined by the density and median rent price for a one-
bedroom apartment in the corresponding region. Notice that R has automatically
made several choices for the plot that we did not explicitly indicate in the code, for
example, the range of values on the two axes, the axis labels, the grid lines, and
the marks along the grid. R has also automatically picked the color, size, and shape
of the points. While the defaults work as a good starting point, it is often useful to
modify these values; we will see how to change these aspects of the plot in later
sections of this chapter.
Scatterplots are typically used to understand the relationship between two
numeric values. What does our first plot, shown in Fig. 2.2, tell us about the
relationship between city density and median rent? There is not a clear trend
between these two variables. Rather, the plot of these two economic metrics clusters
the regions into several groups. We see a couple of regions with a very high density
but only moderately large rental prices, one city with unusually high rental prices,
and the rest of the regions fairly uniformly distributed in the lower-left corner of the
2.2 Text Geometry 21
plot. Let’s see if we can give some more context to the plot by adding additional
information.
cbsa |>
ggplot () +
geom_text(aes(
x = density , y = rent_1br_median , label = name
))
The plot generated by the code is shown in Fig. 2.3. We can now see which region
has the highest rents (San Francisco). And, we can identify which regions have the
highest density (New York and Los Angeles). We can also identify regions such as
Detroit that are relatively dense but inexpensive or regions such as Denver that are
not particularly dense but still one of the more expensive regions to rent in. While
we have added only a single additional piece of information to the plot, each of
the labels uniquely identifies each row of the data. This allows anyone familiar with
metropolitan regions in the United States to bring many more characteristics of each
22 2 EDA I: Grammar of Graphics
Fig. 2.3 Plot of the largest 30 core-based statistical areas in the United States, showing their
density and the median price to rent a one-bedroom apartment from the 2021 American Community
Survey. Here, short descriptive names of the regions are included
data point to the plot through their own knowledge. For example, while the plot does
not include any information about overall population, anyone who knows the largest
cities in the United States can use the plot to see that the two most dense cities (New
York and Los Angeles) are also the most populous. And, while the plot does not have
information about the location of the regions, if we know the general geography of
the country, it is possible to see that many of the cities that are expensive but not
particularly dense (Portland, Denver, Seattle, and San Diego) are on the West Coast.
These observations point to the power of including labels on a scatterplot.
While the text plot adds additional contextual information compared to the
scatterplot, it does have some shortcomings. Some of the labels for points at the
edges of the plot fall off and become truncated. Labels for points in the lower-left
corner of the plot start to overlap one another and become difficult to read. These
issues will only grow if we increase the number of regions in our dataset. Also, it is
not entirely clear what part of the label corresponds to the density of the cities. Is it
the center of the label, the start of the label, or the end of the label? We could add a
note that the value is the center of the label, but that becomes somewhat cumbersome
to have to constantly remember and remind ourselves and others about.
To start addressing these issues, we can add the points back into the plot with
the labels. We could do this in R by adding the two geometry layers (geom_point
and geom_text) one after the other. This will make it more clear where on the x-
2.2 Text Geometry 23
axis each region is associated to but at the same time will make the names of the
cities even more difficult to read. To fix the second problem, we will replace the text
geometry with a different geometry called geom_text_repel. It also places labels
on the plot but has special logic that avoids intersecting labels. Instead, labels are
moved away from the data points and connected (when needed) by a line segment.
As with the text geometry, the text repel geometry requires specifying x, y, and
label aesthetics. Below is the code to make both of these modifications.
library ( ggrepel )
cbsa |>
ggplot () +
geom_ point(aes(x = density , y = rent_1br_ median )) +
geom_text_ repel(aes(
x = density , y = rent_1br_median , label = name
))
The output of the plot with the points and text repelled labels is shown in Fig. 2.4.
Notice that the repel feature has attempted to avoided writing labels that intersect
one another. It has also tried to avoid having the labels intersect the points and avoid
Fig. 2.4 Plot of the largest 30 core-based statistical areas in the United States, showing their
density and the median price to rent a one-bedroom apartment from the 2021 American Community
Survey. Here, short descriptive names of the regions are included but offset from the points to make
the plot easier to read
24 2 EDA I: Grammar of Graphics
having the labels get pushed outside of the plot. Since the points indicate the specific
values of the density and median rents, the labels are free to float around as long as it
is clear which label is associated with each point. Some of the labels do still become
a bit busy in the lower left-hand corner; this could be fixed by making the size of
the labels slightly smaller, which we will learn how to do later in the chapter. Once
the number of points becomes larger, it will eventually not be possible to label all
of the points. Several strategies exist for dealing with this, such as only labeling a
subset of the points. We will see these techniques as they arise in our examples. The
ggplot2 package and communities online have an entire ecosystem of strategies
for increasing interpretability and adding context to plot, providing strategies for
using the exploratory and visual power of data visualization to garner insights from
humanities data.
Fig. 2.6 Plot of the price of tea in standardized units (100 is the price in 1900) over time
The output of this visualization, shown in Fig. 2.6, allows us to see the change over
time of the tea prices. Notice that the relative price decreased fairly steadily from
1870 through to 1920. It had a few sudden drops and reversals in the 1920s and
1930s, before increasing again in the 1950s. The relative cost of tea then decreased
again fairly steadily from the mid-1950s through to the end of the data range in
2015.
Another common usage of a visualization is to see the value of a numeric column
of the dataset relative to a character column of the dataset. It is possible to represent
such a relationship with a geom_point layer. However, it is often more visually
meaningful to use a bar for each category and the height or length of the bar
representing the numeric value. This type of plot is most common when showing
the counts of different categories, something we will see in the next chapter, but
can also be used in any situation where a numeric value is associated with different
categories. To create a plot with bars, we use the geom_col function, providing both
x and y aesthetics. R with automatically create vertical bars if we have a character
variable associated with the x aesthetic and horizontal bars if we have one in the
y aesthetic. Putting the character variable on the y-axis usually makes it easier to
read the labels, so we recommend it in most cases. In the code block below, we have
the commands to create a bar plot of the population in each region from the CBSA
dataset, which will be shown in Fig. 2.7.
2.4 Optional Aesthetics 27
Fig. 2.7 Plot of the population of the largest 30 core-based statistical areas in the United States,
showing their population from the 2021 American Community Survey
cbsa |>
ggplot () +
geom_col(aes(x = pop , y = name))
One of the first things that stands out in the output shown in Fig. 2.7 is that the
regions are ordered alphabetically from bottom to top. The visualization would be
much more useful and readable if we could reorder the categories on the y-axis. This
is also something that we will address in the following chapter. For now, we can see
how ggplot2 is offering a range of plot types to see our data from different angles.
We can add additional context through additional aesthetics.
In the previous sections, we have shown how visualizations can be built out of
geometry layers, where each geometry is associated with a dataset and a collection
of variable mappings known as aesthetics. The point, line, and bar geometries
require x and y aesthetics; the text and text repel geometries also required an
aesthetic named label. In addition to the required aesthetics, each geometry
28 2 EDA I: Grammar of Graphics
Fig. 2.8 Plot of the largest 30 core-based statistical areas in the United States, showing their
density and the median price to rent a one-bedroom apartment from the 2021 American Community
Survey. Here, the points are colored based on the quadrant in which the city is found in the United
States
type also has a number of optional aesthetics that we can use to add additional
information to the plot. For example, most geometries have a color aesthetic. The
syntax for describing this is exactly the same as with the required aesthetics: we
place the name of the aesthetic followed by the name of the associated variable
name. Let’s see what happens when we add a color aesthetic to our scatterplot
by relating the column called quad to the aesthetic named color. Below is the
corresponding code; the output is shown in Fig. 2.8.
cbsa |>
ggplot () +
geom_ point(aes(
x = density , y = rent_1br_median , color = quad
))
The result of associating a column in the dataset with a color produces a new
variation of the original scatterplot. We have the same set of points and locations on
the plot, as well as the same axes. However, now each color has been automatically
associated with a region and every point has been colored according to the region
column associated with each row of the data. The mapping between colors and
2.4 Optional Aesthetics 29
cbsa |>
ggplot () +
geom_ point(aes(
x = density , y = rent_1br_ median
), color = " olivedrab")
While minor, the changed notation for specifying fixed aesthetics is a common
source of confusing errors for users new to the geometry of graphics, so be careful to
follow the correct syntax of arguments as in the code above. One can interchange the
fixed and variable aesthetic commands, and the relative order should not effect the
output. Just be sure to put fixed terms after finishing the aes() command (Fig. 2.9).
While each geometry can have different required and optional aesthetics, the
ggplot2 package tries as much as possible to use a common set of terms for the
aesthetics in each geometry. We have already seen the x, y, and label aesthetics
in the previous sections and just introduced the color aesthetic. Color can also
be used to change the color of a line plot or the color of the font in a text or text
repel geometry. For applications such as the bar plot, we might want to modify both
the border and interior colors of the bars; these are set separately by the color
and fill aesthetics, respectively. The size aesthetic can be used to set the size
of the points in a scatterplot or the font size of the labels in a text geometry. The
shape aesthetic is used to modify the shape of the points. An aesthetic named
alpha controls the opacity of points, with a value of 1 being the default and 0
being completely invisible. Some of these, such as alpha, are most frequently used
with fixed values, but if needed, almost all can be given a variable mapping as well.
30 2 EDA I: Grammar of Graphics
Fig. 2.9 Plot of the largest 30 core-based statistical areas in the United States, showing their
density and the median price to rent a one-bedroom apartment from the 2021 American Community
Survey. The color of the points has been changed to a dark green called “olivedrab”
2.5 Scales
R makes many choices for us automatically when creating any plot. In our example
above, Fig. 2.8, in which we set the color of the points to follow another variable
in the dataset, R handles the details of how to pick the specific colors and sizes.
It has figured how large to make the axes, where to add tick marks, and where to
draw grid lines. Letting R deal with these details is convenient because it frees us
up to focus on the data itself. Sometimes, such as when preparing to produce plots
for external distribution, or when the default are particularly hard to interpret, it is
useful to manually adjust these details. This is exactly what scales were designed
for.
Each aesthetic within the grammar of graphics is associated with a scale. Scales
detail how a plot should relate aesthetics to the concrete, perceivable features in a
plot. For example, a scale for the x aesthetic will describe the smallest and largest
values on the x-axis. It will also code information about how to label the x-axis.
Similarly, a color scale describes what colors corresponds to each category in a
dataset and how to format a legend for the plot. In order to change or modify the
default scales, we add an additional function to the code. The order of the scales
2.5 Scales 31
relative to the geometries do not effect the output; by convention, scales are usually
grouped after the geometries.
For example, a popular alternative to the default color palette shown in our
previous plot is the function scale_color_viridis_d(). It constructs a set of
colors that is color-blind friendly, looks nice when printed in black and white, and
displays fine on bad projectors. After specifying that the color of a geometry should
vary with a column in the dataset, we specify that viridis color scale by adding the
function as an extra line in the plot. An example is shown in the following code.
cbsa |>
ggplot () +
geom_ point(aes(
x = density , y = rent_1br_median , color = quad
)) +
scale_ color_ viridis _d()
The output shown in Fig. 2.10 shows that the colors are now given by a range from
dark purple to bright yellow in place of the rainbow of colors in the default plot.
As with the categories in the bar plot, the ordering of the unique colors is given
Fig. 2.10 Plot of the largest 30 core-based statistical areas in the United States, showing their
density and the median price to rent a one-bedroom apartment from the 2021 American Community
Survey. Here, the points color based on the quadrant in which the city is found in the United States,
with a color-blind friendly color scale
32 2 EDA I: Grammar of Graphics
cbsa |>
ggplot () +
geom_ point(aes(
x = density , y = rent_1br_median , color = pop
)) +
scale_ color_ viridis _c()
cbsa |>
ggplot () +
geom_ point(aes(x = density , y = rent_1br_ median )) +
scale_x_ continuous (n. breaks = 10, minor_ breaks = NULL) +
scale_y_ continuous ( limits = c(0, 2000))
Finally, there are two special scale types that can be useful for working with colors.
In some cases, we may already have a column in our dataset that explicitly describes
the color of an observation; here, it would make sense to use these colors directly. To
do that, we can add the scale scale_color_identity to the plot. Another type of
scale that can be useful for colors is scale_color_manual. Here, it is possible to
describe exactly which color should be used for each category. Below is the syntax
for producing manually defined colors for each region in the CBSA dataset.
Another Random Scribd Document
with Unrelated Content
Bob’s sharp eyes took all these things in at a glance, and then they
turned toward the sheriff.
The latter looked solemn, but he did not appear to be at all
astonished. He knew that George Edwards had never put those
bundles in that hole; and there were other men in the party who
knew it, too.
But the question was: Who did do it?
It was answered in a very few minutes, and in a most unexpected
manner.
“George, I am astonished at you!” said Uncle Ruben, drawing the
back of his hand across his eyes, and wiping away the tears that
would not come at his bidding. “Neighbor Newton, these things come
from some of the stores that’s been robbed.”
The officer nodded his head, but said nothing.
“There’s been a heap of this sort of work goin’ on,” continued
Uncle Ruben; “an’ who knows but there’s something else hid away
about here? Let’s take a look through the bushes, all of us, an’ see if
we can find anything in ’em.”
Some of the party complied, moving about in a listless sort of way,
and showing by all their actions that their hearts were not in the
matter, while the others held the horses and awaited the result of the
search in silence.
Uncle Ruben kept clear of the thicket into which he had thrown the
chickens, hoping that some one would stumble upon it. Two or three
men did walk through it, but they found nothing.
Then Uncle Ruben went in himself; but he, too, came out empty-
handed. Beyond a doubt, some prowling fox or raccoon had been
there before him and carried off the chickens.
“Well, Mr. Edwards, you don’t seem to be having very good luck,”
said the sheriff, who was growing tired of this “spite-work business,”
as he afterward termed it.
“No, I don’t seem to find nothing—that’s a fact,” replied the man,
as he came out of the bushes, looking rather surprised and
crestfallen. “Queer, too, I must say—for my hen-roost was robbed
t’other night.”
While Uncle Ruben was wondering whether or not it would be safe
to accuse George of having stolen and eaten the chickens, the rest of
the searching party came out of the woods, one after the other.
And when they were all assembled, and were waiting for the officer
to speak, Bob Howard, after holding a short consultation with Dick,
stepped out where all could see him.
“Now, then, I’ve got the floor,” said he, “and I will show you how to
go to the bottom of this business in less than two minutes.”
Everybody seemed to know that there was something coming now.
The sheriff looked expectant, and those who had accompanied him to
the cabin, merely out of curiosity, led their horses closer to the
speaker and formed a complete circle around him.
As Bob uttered these words, he fastened his eyes upon Wallace and
his two friends, and kept them there so long that the rest of the party
began to look toward them, also.
Wallace, who showed himself to be possessed of uncommon nerve,
met his gaze without flinching; Forbes moved about uneasily and
smiled in a sickly sort of way; and Benson, utterly unable to endure
his close scrutiny, walked off as though he had no particular object in
view, leading his horse by the bridle.
“Don’t go away, Benson,” said Bob. “You are just the fellow I want
to talk to. Come back here.”
“Why, Bob, you’re crazy!” exclaimed Wallace. “What does Benson
know about Mr. Stebbins’ money? I mean—”
Wallace saw that he had made a false step, and he intended to
correct it; but Bob was too quick for him.
“Who said anything about Mr. Stebbins’ money?” he demanded.
“That subject was dropped long ago; but Benson knows all about it,
and so do you and Forbes.”
The horsemen moved up closer to Bob, and exclamations of
astonishment were heard on all sides. Forbes would have been glad
to run away with Benson, but Wallace stood his ground manfully.
“If I know all about it, why don’t you question me instead of
Benson?” he inquired, with a sneer.
“Because I don’t choose to, just now. I may have a few questions to
ask you, by-and-by.”
“Well, I shall do as I please about answering them.”
“Of course; that’s your privilege. But you’ll not do as you please
about answering them, when you find yourself hauled up before
Judge Baker. Come back here, Benson.”
But Benson paid no attention to him. He did not think it would be
quite safe to go back, for he knew too well what was coming. He led
his horse around the corner of the cabin, and there is every reason to
believe that he intended to mount him and ride away; but his
purpose was defeated by Dick Langdon and George, who sprang
around the opposite end of the cabin and ran along the front of it,
just in time to seize the bridle of Benson’s horse as the young fellow
was about to swing himself into the saddle.
“Look here, Benson! You’re only making a bad matter worse,”
warned Dick.
“Let me alone!” protested Benson, whose eyes filled with tears as
fast as he could wipe them away. “I don’t know anything about Mr.
Stebbins’ money.”
“Yes, you do,” said Dick, firmly. “Bob Howard and I were there,
and we drove you away just as you were about to go into the house
through the wood-shed window. I am sorry for you; but if you think
that Bob and I are going to stand still and let somebody accuse us of
a crime of which you are guilty, you will find that you are mistaken.”
When Dick took him by the arm and attempted to lead him behind
the cabin, Benson showed a disposition to resist him, and it is
probable that he would have done so if the sheriff had not put in an
appearance.
The latter had been looking for something strange and unexpected
to come of this morning’s work, but he had little dreamed that it
would be the means of putting him on the track of the burglars for
whom he had been so long watching.
He knew now, as well as he knew it ten minutes later, that Benson
and his two friends had made an effort to steal Mr. Stebbins’ money
—that they were responsible for at least one of the burglaries that
had been committed in the village—and he was astounded by the
discovery; but his face did not show it.
The culprits were the sons of the wealthiest and most prominent
men in the county, and, although the officer did not approve of their
idle, shiftless ways, and watched their conduct with some concern, as
many other good men in the village did, they were the last ones he
would have suspected of any crime. He wondered what it was that
had led them to it, and the next Monday he found out.
“Benson, come with me,” said the officer, kindly, but firmly. “I
should like to have a few words with you in private. Dick, you and
George go around where the others are, and tell them that I don’t
want to be interrupted.”
“Well, smart Alecks, what have you accomplished?” asked Wallace,
as Dick and his companion joined their friend, Bob Howard.
“We kept Benson from running away,” replied Dick, whose even
temper was not in the least ruffled by the other’s insulting tones. “We
couldn’t afford to let him get out of sight, you know, because we shall
need his evidence. You said last night that if you ever got into
trouble, it would be through him, and I guess you hit the nail right on
top of the head.”
“I never said any such thing,” denied Wallace, hoping by an
assumption of rage, which he did not feel, to hide the alarm he did
feel. “Now, I am sick of all this nonsense, and I want to know what
you mean by it.”
“You will find out all you want to know as soon as Benson has
finished his confession.”
“Confession!” gasped Wallace.
That was the thing of which he stood the most in fear. If Benson’s
courage gave way, there was no hope for them. The bare thought was
enough to terrify him beyond expression.
His face was fairly livid, while Forbes could only maintain an
upright position by clinging to the horn of his saddle.
CHAPTER XIV.
THE UPSHOT OF THE WHOLE MATTER.
“Your father died very suddenly this morning. Come home immediately, and
telegraph me from Leavenworth when to meet you at the station.
G. H. Evans.”
We will not speak of the scene that followed. Such sorrow as this,
which had come upon Bob Howard like a clap of thunder from a
clear sky, is too sacred to be intruded upon, even by a sympathizing
pen.
It will be enough to say that after the first overwhelming burst of
grief had passed away, Bob acted more like a caged tiger than a
human being. He longed to fly on the wings of the wind to his far-off
home, in order that he might gaze once more upon that loved face
before the darkness of the grave shut it out forever from his view.
But steam was the only power that could take him there. The next
train left the village at six in the morning, and that was the one Bob
had intended to take.
He ate no supper, and when the time came he began preparing
himself for the evening’s festivities. What a mockery they seemed to
him now!
“Don’t go,” said George, who had tried his best to say something
comforting to his almost heart-broken friend. “The professor will not
expect anything of you to-night.”
“I shall go and deliver my speech—that is, if I have brains enough
to remember it,” said Bob, quietly but firmly. “This sorrow is my
own. No one in the wide world has a share in it, and you will see that
I have self-control enough to take me through the exercises without
detracting in the least from anybody’s enjoyment.”
And he kept his word.
The news of his bereavement had spread all through the village by
this time, and not one of the vast audience that crowded the
Academy Chapel expected to see him on the stage.
When the valedictory was announced, and the young orator
appeared before the footlights, a silence that was almost oppressive
fell upon the assembly. They all sympathized with the boy, and their
sympathy was so intense that, like the darkness that covered the land
of Egypt, it could be felt.
Bob’s voice was husky, and trembled a little at first, but he
gradually regained the mastery of himself as he proceeded, and,
when he ended his peroration, the applause that followed fairly
shook the building.
It was a spontaneous outburst of admiration, not for the oratorical
effort of the student—which was something better than common—
but for the wonderful nerve he exhibited. Few boys could have
passed through such an ordeal.
Bob set out for his boarding-house as soon as he left the stage, and
when George entered the room, an hour later, he was pacing the
floor, with his hands buried deep in his pockets, and his chin resting
on his breast. He was calmer now, and he even smiled as he gave his
chum an approving slap on the back.
“You did yourself credit to-night, George,” said he. “If I could write
an essay like that, I should feel proud of myself. Now, go to bed, and
I will have you up at five o’clock in the morning. I will lie down on
the sofa when I get tired. I know how to sympathize with you now,
for I am alone in the world as you are.”
“There are your uncle and your cousin,” George ventured to
remark.
“They are no more to me than they are to you,” replied Bob. “I
shall drop them a line, telling them of father’s death, but beyond
that, I shall have nothing to do with them. They can stay at their
home in Indiana, and you and I will live on the ranch. You are all I
have, and you must stick to me.”
Neither of the two boys slept a wink that night. Bob walked the
floor, and George lay in bed, watching him through his half-closed
eyes.
At half-past five they disposed of a hasty breakfast, said “good-by”
to their landlady, and to a few friends among the students who had
come to the depot to see them off, and then the fast express whirled
them away toward St. Louis.
Up to this time, Bob Howard’s career had been rather an
uneventful one; but now, capricious fate had taken him in hand, and
ordered that during the next few months his life was to be crowded
full of such excitement and adventure, such perils and startling
surprises, as never before fell to the lot of any boy.
He was to be given ample opportunity for the exercise of the
extraordinary nerve and pluck which he had exhibited while
delivering his valedictory, but with this difference:
Then, he was in the presence of friends, who would willingly have
made every allowance for him, had any forbearance or consideration
on their part been necessary; but hereafter he was to be surrounded
by enemies, who were already plotting his ruin, and who stood ready
to take every possible advantage of him.
Let us follow that other telegram to its destination, and see who
some of these enemies were.
CHAPTER XVII.
TWO NEW CHARACTERS.
ebookgate.com