0% found this document useful (0 votes)
5 views

Humanities Data in R Exploring Networks Geospatial Data Images and Text 2nd Edition Unknown download

The document discusses the second edition of 'Humanities Data in R: Exploring Networks, Geospatial Data, Images, and Text,' which emphasizes the integration of computational methods in humanities research. It highlights the evolution of the field, the importance of interdisciplinary approaches, and the incorporation of R packages like tidyverse for data analysis. The authors aim to equip readers with the skills to analyze humanities data effectively while addressing ethical considerations in computational methods.

Uploaded by

helkesugi
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
5 views

Humanities Data in R Exploring Networks Geospatial Data Images and Text 2nd Edition Unknown download

The document discusses the second edition of 'Humanities Data in R: Exploring Networks, Geospatial Data, Images, and Text,' which emphasizes the integration of computational methods in humanities research. It highlights the evolution of the field, the importance of interdisciplinary approaches, and the incorporation of R packages like tidyverse for data analysis. The authors aim to equip readers with the skills to analyze humanities data effectively while addressing ethical considerations in computational methods.

Uploaded by

helkesugi
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 74

Humanities Data in R Exploring Networks

Geospatial Data Images and Text 2nd Edition


Unknown pdf download

https://ebookgate.com/product/humanities-data-in-r-exploring-
networks-geospatial-data-images-and-text-2nd-edition-unknown/

Get Instant Ebook Downloads – Browse at https://ebookgate.com


Instant digital products (PDF, ePub, MOBI) available
Download now and explore formats that suit you...

Data Science Fundamentals with R Python and Open Data 1st


Edition Marco Cremonini

https://ebookgate.com/product/data-science-fundamentals-with-r-python-
and-open-data-1st-edition-marco-cremonini/

ebookgate.com

Statistics for Censored Environmental Data Using Minitab


and R Statistics in Practice 2nd Edition Dennis R. Helsel

https://ebookgate.com/product/statistics-for-censored-environmental-
data-using-minitab-and-r-statistics-in-practice-2nd-edition-dennis-r-
helsel/
ebookgate.com

Modern Statistics With R From Wrangling and Exploring Data


to Inference and Predictive Modelling second edition Måns
Thulin
https://ebookgate.com/product/modern-statistics-with-r-from-wrangling-
and-exploring-data-to-inference-and-predictive-modelling-second-
edition-mans-thulin-2/
ebookgate.com

Modern Statistics with R From Wrangling and Exploring Data


to Inference and Predictive Modelling Second Edition Måns
Thulin
https://ebookgate.com/product/modern-statistics-with-r-from-wrangling-
and-exploring-data-to-inference-and-predictive-modelling-second-
edition-mans-thulin/
ebookgate.com
Data Modeling Made Simple with Embarcadero ER Studio Data
Architect Adapting to Agile Data Modeling in a Big Data
World 2nd Edition Steve Hoberman
https://ebookgate.com/product/data-modeling-made-simple-with-
embarcadero-er-studio-data-architect-adapting-to-agile-data-modeling-
in-a-big-data-world-2nd-edition-steve-hoberman/
ebookgate.com

Data structures and algorithms made easy in Java data


structure and algorithmic puzzles 2nd Edition Narasimha
Karumanchi
https://ebookgate.com/product/data-structures-and-algorithms-made-
easy-in-java-data-structure-and-algorithmic-puzzles-2nd-edition-
narasimha-karumanchi/
ebookgate.com

Automotive informatics and communicative systems


principles in vehicular networks and data exchange 1st
Edition Huaqun Guo
https://ebookgate.com/product/automotive-informatics-and-
communicative-systems-principles-in-vehicular-networks-and-data-
exchange-1st-edition-huaqun-guo/
ebookgate.com

R Data Mining Blueprints 1st edition Edition Mishra

https://ebookgate.com/product/r-data-mining-blueprints-1st-edition-
edition-mishra/

ebookgate.com

The Handbook of Computer Networks Key Concepts Data


Transmission and Digital and Optical Networks Volume 1
Hossein Bidgoli
https://ebookgate.com/product/the-handbook-of-computer-networks-key-
concepts-data-transmission-and-digital-and-optical-networks-
volume-1-hossein-bidgoli/
ebookgate.com
Quantitative Methods in the Humanities
and Social Sciences

Taylor Arnold
Lauren Tilton

Humanities
Data in R
Exploring Networks, Geospatial Data,
Images, and Text
Second Edition
Quantitative Methods in the Humanities
and Social Sciences

Series Editors
Thomas DeFanti, Calit2, University of California San Diego, La Jolla, CA, USA
Anthony Grafton, Princeton University, Princeton, NJ, USA
Thomas E. Levy, Calit2, University of California San Diego, La Jolla, CA, USA
Lev Manovich, Graduate Center, The Graduate Center, CUNY, New York, NY, USA
Alyn Rockwood, KAUST, Boulder, CO, USA
Quantitative Methods in the Humanities and Social Sciences is a book series
designed to foster research-based conversation with all parts of the university
campus – from buildings of ivy-covered stone to technologically savvy walls
of glass. Scholarship from international researchers and the esteemed editorial
board represents the far-reaching applications of computational analysis, statistical
models, computer-based programs, and other quantitative methods. Methods are
integrated in a dialogue that is sensitive to the broader context of humanistic study
and social science research. Scholars, including among others historians, archaeolo-
gists, new media specialists, classicists and linguists, promote this interdisciplinary
approach. These texts teach new methodological approaches for contemporary
research. Each volume exposes readers to a particular research method. Researchers
and students then benefit from exposure to subtleties of the larger project or corpus
of work in which the quantitative methods come to fruition.

Editorial Board:
Thomas DeFanti, University of California, San Diego & University of Illinois at
Chicago
Anthony Grafton, Princeton University
Thomas E. Levy, University of California, San Diego
Lev Manovich, The Graduate Center, CUNY
Alyn Rockwood, King Abdullah University of Science and Technology
Publishing Editor for the series at Springer: Faith Su, faith.su@springer.com
Taylor Arnold • Lauren Tilton

Humanities Data in R
Exploring Networks, Geospatial Data,
Images, and Text

Second Edition
Taylor Arnold Lauren Tilton
University of Richmond University of Richmond
Richmond, VA, USA Richmond, VA, USA

ISSN 2199-0956 ISSN 2199-0964 (electronic)


Quantitative Methods in the Humanities and Social Sciences
ISBN 978-3-031-62565-7 ISBN 978-3-031-62566-4 (eBook)
https://doi.org/10.1007/978-3-031-62566-4

© The Editor(s) (if applicable) and The Author(s), under exclusive license to Springer Nature Switzerland
AG 2015, 2024
This work is subject to copyright. All rights are solely and exclusively licensed by the Publisher, whether
the whole or part of the material is concerned, specifically the rights of translation, reprinting, reuse
of illustrations, recitation, broadcasting, reproduction on microfilms or in any other physical way, and
transmission or information storage and retrieval, electronic adaptation, computer software, or by similar
or dissimilar methodology now known or hereafter developed.
The use of general descriptive names, registered names, trademarks, service marks, etc. in this publication
does not imply, even in the absence of a specific statement, that such names are exempt from the relevant
protective laws and regulations and therefore free for general use.
The publisher, the authors and the editors are safe to assume that the advice and information in this book
are believed to be true and accurate at the date of publication. Neither the publisher nor the authors or
the editors give a warranty, expressed or implied, with respect to the material contained herein or for any
errors or omissions that may have been made. The publisher remains neutral with regard to jurisdictional
claims in published maps and institutional affiliations.

This Springer imprint is published by the registered company Springer Nature Switzerland AG
The registered company address is: Gewerbestrasse 11, 6330 Cham, Switzerland

If disposing of this product, please recycle the paper.


Preface

Published in 2015, the first edition of this book was written as digital humanities was
fully entering the lexicon of the academy. Debates over ideas such as computation,
digital, and data ensued. Questions such as what does it mean to think of sources
as data, or “humanities data,” were posed by Miriam Posner [75], while Jessica
Marie Johnson brought the longer history of quantification to ask pressing questions
about the process and effect of continuing to turn people into data [47]. Amid
these questions and debates, cultural institutions such as the Library of Congress
made an incredible commitment to digitization and open data, making sources once
only accessible in person available in digital formats that were now amenable to
computational methods. What could be possible with all these sources of data?
We set out to demonstrate how methods from text, spatial, and image analyses
could animate humanities fields by rethinking of our sources as data and using
programming, specifically the language R. This was a rather radical move at the
time, when humanities fields were particularly resistant to the idea of thinking of
materials such as books, photographs, and TV as the subject of analysis through
counting and probabilities, much less algorithms and modeling. The field of digital
humanities was pushing against this impulse, particularly led by scholars in digital
history and what we now call computational literary studies. For those interested in
learning how to bring them together, they were still often on their own. For many,
programming and humanities inquiry still seemed like a contradiction.
Yet, as graduate students, one in American Studies (Lauren) and the other in
Statistics (Taylor), bringing together humanities data such as historical photographs
with computational methods such as mapping seemed incredibly powerful. Our
work building photogrammar.org, and the project’s positive reception, demon-
strated the possibilities of layering mapping, text analysis, and image analysis to
further the study of visual culture. Computational methods did not replace all
the training of humanities fields, but rather fit with the experimentation, trans-
disciplinarity, and creativity that American Studies articulated as central to its
project. At the same time, fields such as Statistics were continuing their emphasis
on mathematical theory, often disconnected from many of the realities of working
with actual data and the methodological problems that the messiness of human data

v
vi Preface

elicited. An openness to thinking across these boundaries is a significant reason why


this book exists.
Our advisors Laura Wexler and Jay Emerson along with graduate colleague Carol
Chiodo at Yale fundamentally understood what was possible, supporting us when
others questioned these two, perhaps precocious, graduate students. We eagerly
joined exciting projects like the Programming Historian and work by Matthew
Jockers and Lev Manovich, both of whom we are deeply grateful for their support,
to demonstrate how computational methods could be a part of the methodological
toolkit of the humanities. Rather than designed for industry or a very technical
audience, Humanities Data in R filled a need for a book designed to introduce
audiences to computational methods and were interested in the sources that served
as primary evidence for understanding the human experience.
Fast forward almost a decade, and a fair amount has changed. We are now
tenured professors at a flourishing small liberal arts college where interdisciplinarity
is celebrated. We teach digital humanities across the Department of Rhetoric
and Communication, Department of Mathematics and Statistics, and programs
in American Studies and Data Science. At the same time, the rapid ascent of
data science over the past 5 years has mostly silenced debates over whether the
humanities should be involved with data and computation. In fact, many of us
are noting how data and computation have never needed the humanities more.
Humanities scholars should be key interlocutors in interpreting the findings of
computational analysis of humanities data as well as have important insights into
the ethical and social impact of computational methods. One goal of this book is
to provide the programming and methodological background to be a part of these
interdisciplinary conversations and debates.
For the computational approaches, fields such as Statistics are now grappling
with the realities of working with messy data. It was already a decade ago that
Taylor realized that the most complicated data came from sources that animated the
humanities. How does one work with film, for example? The data is multimodal,
defies easy classifications, and breaks computer vision algorithms. To use the
gendered logic that permeates so many discussions of academia, humanities fields
weren’t some soft, squishy area of study that was easier, but rather worked with
the hard, complex sources and data that challenged what was seen as a given in
statistical and computational fields. We co-author because we believe that inter-
and transdisciplinary scholarship is key to the (digital) humanities and data science,
and we have so much to learn from each other. We see this book as a part of that
exchange, and for anyone who wants to work with humanities data.

Preface to Second Edition

The second edition is a significant revision, with almost every aspect of the text
rewritten in some way. The biggest difference is the incorporation of the set
of R packages commonly known as the tidyverse, consisting at its core of the
Preface vii

packages ggplot2 and dplyr. These packages have grown significantly in stability
and popularity over the past decade. They allow the kinds of functionality that we
wanted to highlight in the first version of the book, but do so with less code while
being backed by theoretical models of how data processing should work. These
features make them perfect elements to use for an introduction to R for working
with humanities data.
As before, Part I introduces the R programming language and key concepts for
working with data. Exploratory data analysis (EDA) remains a key concept and
philosophy. EDA is an approach for analyzing and summarizing to identify patterns
(and outliers) in data. It is also a way of knowing that is amenable to the kinds
of questions and heuristics that animate how humanistic fields approach studying
the human experience. Based on years of teaching, we have come to realize how
important understanding data collection is to data analysis yet how few resources
there are, so we have added Chap. 5: Collecting Data and Chap. 12: Data Formats to
address perhaps the most time-consuming part, collecting and organizing data.
Part II of the text is still organized around data types. We have decided to reorder
the chapters because of our approach to data. In this edition, we wanted to show how
one can layer types of analysis using the same data set. Rather than each chapter
introducing a new data set, we build our analysis of Wikipedia data from Chaps. 6
to 8 as we move from text to networks to temporal data. Chapter 8: Temporal
Data is a new chapter given the importance of time information, particularly if
we want to study change over time. Chapter 9: Spatial Data returns to the data
that was used in Part I to show how we can layer the information with additional
data. Chapter 10: Image Data introduces a new data set of 1940s photographs to
apply computer vision. While we are always hesitant of hype about technological
change, particularly given all the current (generative) AI boosterism, a significant
methodological shift in the last 10 years is the advances in computer vision,
particularly the ascent of deep learning. We now focus on several of the most popular
tasks such as object detection, and how we can also layer them with additional
methods such as networks. The reorganization, additional chapters, and new data
sets are a part of trying to demonstrate how layering methods can add context and
nuance to our analysis.

Humanities Data

We now return to the term “humanities data.” For us, this means any data that is
engaged with analyzing any aspect of human societies and cultures. This is bigger
than any disciplinary or institutional formation. When we are working with the
messiness of human creativity and meaning, we are engaged in a challenging task,
particularly when we want to understand peoples’ beliefs, values, and behaviors,
whether today or in the past. This is inherently a transdisciplinary project that
traverses any walls that we try to build through academic journals, departments,
scholarly associations, and the university itself. Working with humanities data
viii Preface

happens in industry and beyond. Working with this data carefully, ethically, and
precisely takes collaboration. The book is designed to provide the groundwork for
those who seek to engage with and analyze the data that documents, shapes, and
communicates who we are, where we have been, and the worlds we are building.
No book can do everything, and our orientation is centered around the United
States. The goal of this book is to walk readers through the methods and provide
the code that will give one the resources and confidence to computationally explore
humanities data. Data and methods such as image analysis are the subject of tens of
thousands of articles and books. At the end of each chapter and through our citations,
we offer further reading to start connecting with the wide range of scholarship on
each of these chapters. We also do not go directly into all the debates over the
epistemology and ontology of data and statistics itself; we find a great place to start
is with Lisa Gitelman’s “Raw Data” is an Oxymoron [36] and Chris Wiggins and
Matthew L. Jones’s How Data Happened: A History from the Age of Reason to
the Age of Algorithm [104]. Along with work by dana boyd, Kate Crawford, Safiya
Noble, and Meredith Broussard, we find Catherine d’Ignazio and Lauren Klein’s
Data Feminism to be also be a great place to start when it comes to data ethics and
justice [30].
Zooming out, there is significant domain-specific scholarship to draw on to
see the power of humanities data analysis. There are series and journals such as
Current Research in Digital History, Debates in the Digital Humanities, Digital
Scholarship in the Humanities, Journal of Cultural Analytics, Journal of Open
Source Software, and the new journal Computational Humanities Research along
with digital humanities special issues in journals like American Quarterly, Cinema
Journal, and Digital Humanities Quarterly. There are books like Ted Underwood’s
Distant Horizons, [87] Andrew Piper’s Enumerations [73], and our own Distant
Viewing [7] that offer theories for computational methods. As well, there are
domain-specific works such as Cameron Blevins’ Paper Trails: The US Post and
the Making of the American West [16] and Lincoln Mullen’s America’s Public Bible
[63] that show how computational methods provide key evidence for scholarship in
religious studies, US history, and rhetorical studies. We offer the work above as a
starting point for the rich conversations and debates around humanities data.

Supplementary Materials

We make extensive use of example datasets through this text. Particular care was
taken to use data in the public domain, or otherwise freely and openly accessible.
Whenever possible, subsets of larger archives were used instead of smaller one-
off datasets. This approach has the dual benefit that these larger sets are often of
independent interest, as well as providing an easy source of additional data for
use in course projects, lectures, and further study. These datasets are available (or
Preface ix

linked to) from the text’s website: http://humanitiesdata.org. Complete code


snippets from the text, further references, and additional links and notes are also
included in that site and will continue to be updated.

Acknowledgments

For the first edition, it would not have been possible to write this text without
the collaboration and support offered by our many colleagues, friends, and family.
In particular, we would like to thank those who agreed to read and comment on
the early drafts of this text: Carol Chiodo, Jay Emerson, Alex Gil, Jason Heppler,
Matthew Jockers, Mike Kane, Lev Manovich, Laura Wexler, Jeri Wieringa, and two
anonymous readers.
For the second edition, we are deeply appreciative of the University of Richmond,
which has given us the time and resources to pursue a second edition. We
are grafteful to Justin Wigard, who read a complete draft and offered crucial
feedback, and Agnieska Szymanska, who provided guidance in countless ways.
Working with Rob Nelson and the Digital Scholarship Lab (DSL) has been
incredible; their commitment to bringing together digital humanities and social
justice through award-winning projects like Mapping Inequality continue to inspire.
We are also grateful to our departments—Rhetoric and Communication and Math
and Statistics—along with Dean Jenny Cavanaugh, whose support, generosity, and
deep commitment to the liberal arts is a model for us all. It is a special place where
the University President takes the time to engage with faculty’s scholarship. Thank
you, Kevin Hallock, for your time and leadership. And finally, to the awesome UR
students who took our classes and helped us refine our teaching and shared in the
joys and challenges of working with humanities data.

Richmond, VA, USA Taylor Arnold


April 2024 Lauren Tilton
Contents

Part I Core
1 Working with Data in R. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.2 Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.3 Working with R and R Markdown. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.4 Running R Code . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.5 Functions in R . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
1.6 Loading Data in R . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
1.7 Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
1.8 Formatting R Code . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
1.9 Extensions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2 EDA I: Grammar of Graphics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
2.2 Text Geometry . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
2.3 Lines and Bars . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
2.4 Optional Aesthetics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
2.5 Scales . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
2.6 Labels and Themes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
2.7 Conventions for Graphics Code . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
2.8 Extensions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
3 EDA II: Organizing Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
3.2 Choosing Rows . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
3.3 Data and Layers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
3.4 Selecting Columns . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
3.5 Arranging Rows . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
3.6 Summarize and Group By . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
3.7 Geometries for Summaries. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
3.8 Mutate. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
3.9 Extensions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
xi
xii Contents

4 EDA III: Restructuring Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59


4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
4.2 Joining by Relation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
4.3 Mutating and Filtering Joins . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
4.4 Pivot Longer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
4.5 Pivot Wider . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
4.6 Patterns for Table Pivots . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
4.7 Extensions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
5 Collecting Data. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
5.2 Rectangular Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
5.3 Naming Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
5.4 What Goes in a Cell . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
5.5 Dates . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
5.6 Output Format . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
5.7 Data Dictionary. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
5.8 Summary of Data Collection Guidelines . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
5.9 Extensions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84

Part II Data Types


6 Textual Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89
6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89
6.2 Working with a Textual Corpus. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90
6.3 Natural Language Processing Pipeline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
6.4 Term Frequency-Inverse Document Frequency (TF-IDF). . . . . . . . . . 97
6.5 Document Distance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99
6.6 Dimensionality Reduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104
6.7 Word Relationships . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109
6.8 Texts in Other Languages . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111
6.9 Extensions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114
7 Network Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117
7.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117
7.2 Creating a Network Object . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118
7.3 Centrality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122
7.4 Clusters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127
7.5 Co-citation Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129
7.6 Directed Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131
7.7 Distance Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 134
7.8 Nearest Neighbor Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135
7.9 Extensions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 140
8 Temporal Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143
8.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143
8.2 Temporal Data and Ordering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 144
Contents xiii

8.3 Date Objects . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 147


8.4 Datetime Objects . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 151
8.5 Language and Time Zones . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 154
8.6 Manipulating Dates and Datetimes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 160
8.7 Window Functions and Range Joins. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 162
8.8 Extensions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 167
9 Spatial Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 169
9.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 169
9.2 Spatial Points . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 169
9.3 Polygons . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 174
9.4 Spatial Metrics. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 179
9.5 Spatial Joins . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 183
9.6 Raster Maps . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 189
9.7 Extensions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 198
10 Image Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 201
10.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 201
10.2 Loading Images . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 201
10.3 Pixels and Color . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 205
10.4 Computer Vision . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 212
10.5 Object Detection. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 213
10.6 Face Detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 218
10.7 Pose Detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 223
10.8 Embeddings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 226
10.9 Extensions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 232

Part III Additional Methods


11 Programming in R . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 235
11.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 235
11.2 Vectors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 236
11.3 Data Types and Lists . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 237
11.4 Selecting and Modifying Vectors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 239
11.5 Matrices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 241
11.6 Control Flow. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 243
11.7 Functional Programming. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 246
11.8 Extensions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 248
12 Data Formats . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 249
12.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 249
12.2 Strings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 250
12.3 Regular Expressions. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 254
12.4 JSON Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 260
12.5 XML and HTML Formats . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 263
12.6 XML Path Language (XPath). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 268
xiv Contents

12.7 Building Datasets Through an API . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 270


12.8 Extensions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 275

References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 277
Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 281
Part I
Core
Chapter 1
Working with Data in R

1.1 Introduction

In this book, we focus on tools and techniques for exploratory data analysis or EDA.
Initially described in John Tukey’s classic text by the same name, EDA is a general
approach to examining data through visualizations and broad summary statistics
[19, 85]. It prioritizes studying data directly in order to generate hypotheses and
ascertain general trends prior to, and often in lieu of, formal statistical modeling.
The growth in both data volume and complexity has further increased the need
for a careful application of these exploratory techniques. In the intervening 50
years, techniques for EDA have enjoyed great popularity within statistics, computer
science, and many other data-driven fields and professions.
The histories of the R programming language and EDA are deeply entwined.
Concurrent with Tukey’s development of EDA, Rick Becker, John Chambers,
and Allan Wilks of Bell Labs began developing software designed specifically
for statistical computing. By 1980, the “S” language was released for general
distribution outside Bell Labs. It was followed by a popular series of books and
updates, including “New S” and “S-Plus” [10–12, 21]. In the early 1990s, Ross
Ihaka and Robert Gentleman produced a fully open-source implementation of S
called “R.” It is called “R” for it is both the “previous letter in the alphabet” and
the shared initial in the authors’ names. Their implementation has become the de
facto tool in the field of statistics and is often cited as being amongst the top 20 used
programming languages in the world. Without the interactive console and flexible
graphics engine of a language such as R, modern data analysis techniques would be
largely intractable. Conversely, without the tools of EDA, R would likely still have
been a welcome simplification to programming in lower-level languages but would
have played a far less pivotal role in the development of applied statistics.
The historical context of these two topics underscores the motivation for studying
both concurrently. In addition, we see this book as contributing to efforts to bring
new communities to learn from and to help shape data analysis by offering other

© The Author(s), under exclusive license to Springer Nature Switzerland AG 2024 3


T. Arnold, L. Tilton, Humanities Data in R, Quantitative Methods in the Humanities
and Social Sciences, https://doi.org/10.1007/978-3-031-62566-4_1
4 1 Working with Data in R

Fig. 1.1 Diagram of the process of exploratory data analysis

fields of study to engage with [4]. It is an attempt to provide an introduction for


students and scholars in the humanities and the humanistic social sciences to both
EDA and R. It also shows how data analysis with humanities data can be a powerful
method for humanistic inquiry. A visual summary of the steps of EDA are shown in
Fig. 1.1. We will see that the core chapters in this text map onto the steps outlined
in the diagram.

1.2 Setup

While it is possible to read this book as a conceptual text, we expect that the majority
of readers will eventually want to follow along with the code and examples that are
given throughout the text. The first step in doing so is to obtain a working copy
of R. The Comprehensive R Archive Network, known as CRAN, is the official
home of the R language and supplies download instructions according to a user’s
operating system (i.e., Mac, Windows, Linux): http://cran.r-project.org/.
Other download options exist for advanced users, up to and including a custom
build from the source code. We make no assumptions throughout this text regarding
which operating system or method of obtaining or accessing R readers have chosen.
In the rare cases where differences exist based on these options, they will be
explicitly addressed. While one can work from the terminal, we recommend using
an integrated development environment (IDE) to more easily see the code and data.
A piece of open-source software called the RStudio IDE is highly recommended:
https://posit.co/download/rstudio-desktop/. When installed in conjunc-
tion with the R environment, RStudio provides a convenient way of running R
code and seeing the output in a single window. We will show in the next section
screenshots from running R code in RStudio.
In addition to the R software, walking through the examples in this text requires
access to the datasets we explore. Care has been taken to ensure that these are all in
the public domain so as to make it easy for us to redistribute to readers. The materials
and download instructions can be found at https://humanitiesdata.org/. A
complete copy of the code from the book is also provided to make replicating (and
extending) the results as easy as possible.
1.3 Working with R and R Markdown 5

A major selling point of R is its extensive collection of user-contributed add-


ons, called packages. Details of how to install packages are included in the
supplemental materials. Specifically, the supplemental materials have a document
called setup.Rmd. Opening this in RStudio provides instructions for installing
all the packages that are needed throughout this book. Like R itself, all the
packages used here are free and open-source software, thanks to a robust community
dedicated to developing and expanding R.
As mentioned in the preface, we make heavy use in this text of a set of R packages
known as the tidyverse. These include ggplot2, readr, dplyr, and tidyr. The meta-
package tidyverse can be loaded to automatically load all the other associated R
packages. One of the other packages included in this book is hdir (Humanities Data
in R), which contains a set of wrapper functions specifically created for the text.
This package, like all the others used in this book, is released under an open-source
license and can be reused in other projects.
Learning to program is hard and invariably questions and issues will arise
in the process (even the most experienced users require help with surprisingly
high frequency). As a first source of help, searching a question or error message
online will often pull up one of the many third-party question and answer sites,
such as http://stackoverflow.com/, which are heavily frequented by new and
advanced R users alike. If we cannot find an immediate answer to a question, the
next best step is to find some local, in-person help. While we have done our best with
this static text to explain the concepts for working with R, nothing beats talking to
a real-life person. As a final step, we could post questions directly on third-party
sites. It may take a few days to get a response, but usually someone helpful from
the R community will answer. We invite everyone to participate in the community
by being active on forums, contributing packages, and supporting colleagues and
friends. There are also great groups like R-Ladies (rladies.org) and regional
groups that can provide further connections (see: r-community.org).

1.3 Working with R and R Markdown

The supplemental materials for this book include all the data and code needed to
replicate all of the analyses and visualizations in this book. We include the exact
same code that will be printed in the book. We have used the R Markdown file
format, which has an .Rmd extension, to store this code, with a file corresponding
to each chapter in the text. The R Markdown file format is a great choice for data
analysis because it allows us mix code and descriptions within the same file [51].
In fact, we even wrote the text of this book in the R Markdown format before
converting it into LaTeX for printing.
The RStudio environment offers a convenient format for viewing and editing R
Markdown files. If we open an R Markdown file in RStudio, we should see a window
similar to the one shown in Fig. 1.2. We made this image on a recent version of
macOS; the specific view may be slightly different on Windows and may change
6 1 Working with Data in R

Fig. 1.2 Default view of an R Markdown file in RStudio shown in a recent version of macOS

slightly depending on the screen size and the version of RStudio being used. On the
left is the actual file itself. Some output and other helpful bits of information are
shown on the right. There is also a Console window, which we generally will not
need. We have minimized it in the graphic, which we often do whenever working
on a smaller screen
Looking at the R Markdown file, notice that the file has parts that are on a
white background and other parts that are on a gray background. The white parts
correspond to text and the gray parts to code. In order to run the code, and to see
the output, click on the green triangle play button on the upper-right corner of each
block. When we run code to read or create a new dataset, the data will be listed in
the Environment tab in the upper-right-hand side of RStudio. Finally, clicking on
the data will open a spreadsheet version of the data that we can view to understand
the structure of our data and to see all the columns that are available for analysis.
As with any digital file, it is a good idea to make sure to save the notebook
frequently. Keep in mind, however, that only the text and code itself is saved.
The results (plots, tables, and other output) are not automatically stored. While
counterintuitive at first, this is a helpful feature because the code is much smaller
compared to the results. Saving the code helps to keep the file sizes small and tidy.
If we would like to save the results in a way that can be shared with others, we need
to knit the file by clicking on the Knit button (it has a ball of yarn icon) at the top of
the notebook. After running all the code from scratch, the knit function will produce
an HTML version of our script that we can open in a web browser.
1.4 Running R Code 7

1.4 Running R Code

Now, let’s see some examples of how to run R code. In this book, we will show
snippets of R code and the output rather than a screenshot of the entire RStudio
session. Though, know that we should think of each of the snippets as occurring
inside of one of the gray boxes in an R Markdown file. In one of its most basic
forms, R can be used as a fancy calculator. We can add 1 and 1 by typing 1+1
into the code chunk of an R Markdown file. Hitting the run button will display the
output (2) below. An example in RStudio is shown in Fig. 1.2. In the book, we will
write this code and output using a black box with the R code written inside of it.
Any output will be shown below, with each line proceeded by two hash tags. An
example is given below.

1 + 1

## [1] 2

We will often see numbers in the output surrounded by square brackets, such as the
[1] in the output above. These are a common cause of confusion and worry for
new users of R. These numbers are simply counting the values in the output. In the
example above, the [1] that it is showing that the value 2 is first output from our
code.
In addition to just returning a value, running R code can also result in storing
values through the creation of new objects within R. Objects in R are used to store
anything—such as numbers, datasets, functions, or models—that we want to use
again later. Each object has a name associated with it that we can use to access it in
future code. To create an object, we will use the <- (arrow) symbol with the name on
the left-hand side of the arrow and code that produces the object on the right-hand
side. For example, we can create a new object called mynum with a value of 8 by
running the following code.

mynum <- 3 + 5

Notice that the code here did not print any results because the result was saved as
a new object. We can now use our new object mynum exactly the same way that we
would use the number 8. For example, adding it to 1 to get the number nine:

mynum + 1

## [1] 9

Object names must start with a letter but can also use underscores and periods. We
recommend using only lowercase letters and underscores. That makes it easier to
8 1 Working with Data in R

read the code later on without needing to remember if and where we used capital
letters.

1.5 Functions in R

A function in R is something that takes a set of input values and returns an output
value. Generally, a function will have a format similar to that given in the code here:

function _name (arg1 = input1 , arg2 = input2 )

Where arg1 and arg2 are the names of the inputs to the function (they are fixed)
and input1 and input2 are the values that we will assign to them. The number
of arguments is not always two, however. There may be any number of arguments,
including zero. Also, there may be additional optional arguments that have default
values that can be modified. Let us look at an example function: seq. This function
returns a sequence of numbers. We can give the function two input arguments: the
starting point from and the ending point to.

seq(from = 1, to = 100)

## [1] 1 2 3 4 5 6 7 8 9 10 11 12
## [13] 13 14 15 16 17 18 19 20 21 22 23 24
## [25] 25 26 27 28 29 30 31 32 33 34 35 36
## [37] 37 38 39 40 41 42 43 44 45 46 47 48
## [49] 49 50 51 52 53 54 55 56 57 58 59 60
## [61] 61 62 63 64 65 66 67 68 69 70 71 72
## [73] 73 74 75 76 77 78 79 80 81 82 83 84
## [85] 85 86 87 88 89 90 91 92 93 94 95 96
## [97] 97 98 99 100

The function returns a sequence of numbers starting from 1 and ending at 100
in increments of 1. Here, we see the benefit of the square brackets in the output;
the [13] at the start of the second line indicates that the second line starts on the
13th value of the output. In addition to specifying arguments by name, we can also
pass arguments by position. When specifying arguments by position, we need to
know and use the default ordering of the arguments. Below is an example of another
equivalent way to write the code to produce a sequence of integers from 1 to 100, this
time without the argument names. (For the sake of saving space, we will sometimes
not display the output of our code, as is the case here.)

seq (1, 100)

How did we know the inputs to each function and what they do? In this text, we
will explain the names and usage of the required inputs to new functions as they
1.5 Functions in R 9

Fig. 1.3 Example documentation page for the function “seq”

are introduced. In order to learn more about all of the possible inputs to a function,
we can look at a function’s documentation. For packages to be on CRAN, they
must include information about each of the inputs to a function and the values that
are returned. In order to see the documentation, we can run a line of code that starts
with a question mark followed by the name of the function, as in the example below.
In RStudio, the information about the function will then show up in the lower-left
corner of the IDE. An example of the page is shown in Fig. 1.3

?seq
10 1 Working with Data in R

As shown in the documentation page, there is also an optional argument, called by,
that controls the spacing between each of the numbers. By default, the by argument
is equal to 1, but we can change it to spread the points out by different intervals. For
example, below are the half-numbers between 1 and 10.

seq(from = 1, to = 10, by = 0.5)

## [1] 1.0 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0 5.5
## [11] 6.0 6.5 7.0 7.5 8.0 8.5 9.0 9.5 10.0

We will learn how to use numerous functions in the coming chapters, each of which
will help us in exploring and understanding data. In order to do this, we need to first
load our data into R, which we will show in the next section.

1.6 Loading Data in R

In this book, we will be working with data that is stored in a tabular format.
Figure 1.4 shows an example of a tabular dataset consisting of information about
metropolitan regions in the United States supplied by the US Census Bureau.
These regions are called core-based statistical areas or CBSA. In Fig. 1.4, we
have ten rows and five columns. Each row of the dataset represents a particular
metropolitan region. We call each of the rows an observation. The columns in a
tabular dataset represent the measurements that we record for each observation.
These measurements are called variables.

Fig. 1.4 Example of a tabular dataset


1.6 Loading Data in R 11

In our example dataset, we have five variables which record the name of the
region, the quadrant of the country that the region exists in, the population of the
region in millions of people, the density given in tens of thousands of people per
square kilometer, and the median age of all people living in the region. More details
are given in the following section.
A larger version of this dataset, with more regions and variables, is included
in the book’s supplemental materials as a comma-separated value (CSV) file. We
will make extensive use of this dataset in the following chapters as a common
example for creating visualizations and performing data manipulation. In order to
read in the dataset, we use the function read_csv from the readr package [100].
In order to make the functions from readr available, we need to run the line of
code: library(tidyverse). As mentioned above, tidyverse will automatically
load several packages at once that we will use throughout this book. In each chapter,
we will assume that this package has already been loaded without including the
explicit library command. All other packages will be loaded once per chapter as
needed.

library ( tidyverse )

We call this function with the path to where the file is located relative to where this
script is stored. If we are running the R Markdown notebooks from the supplemental
materials, the data will be called cbsa_acs.csv and will be stored in a folder called
data. The following code will load the CBSA dataset into R, save it as an object
called cbsa, and print out the first several rows. The output dataset is stored as a
type of R object called a tibble.

cbsa <- read_csv(file.path("data", "acs_cbsa.csv"))


cbsa

## # A tibble : 934 x 13
## name geoid quad lon lat pop density
## <chr > <dbl > <chr > <dbl > <dbl > <dbl > <dbl >
## 1 New York 35620 NE -74.1 40.8 20.0 1051.
## 2 Los Angeles 31080 W -118. 34.2 13.2 1041.
## 3 Chicago 16980 NC -88.0 41.7 9.61 509.
## 4 Dallas 19100 S -97.0 32.8 7.54 323.
## 5 Houston 26420 S -95.4 29.8 7.05 317.
## 6 Washington 47900 S -77.5 38.8 6.33 364.
## 7 Philadelphia 37980 NE -75.3 39.9 6.22 506.
## 8 Miami 33100 S -80.5 26.2 6.11 430.
## 9 Atlanta 12060 S -84.4 33.7 6.03 263.
## 10 Boston 14460 NE -71.1 42.6 4.91 518.
## # 924 more rows
## # 6 more variables: age_ median <dbl >,
## # hh_ income _ median <dbl >, percent _own <dbl >,
## # rent_1br_ median <dbl >, rent_perc_ income <dbl >,
## # division <chr >
12 1 Working with Data in R

Notice that the display shows that there are a total of 934 rows and 13 columns. Or,
with our terms defined above, there are 934 observations and 13 variables. Only the
first ten observations and seven variables are shown in the output. At the bottom, the
names of the additional variable names are given. As described above, if we run this
RStudio, we can view a full tabular version of the tibble by clicking on the dataset
name in the Environment tab.
The abbreviations in square brackets above the variable names tell us the types
of data stored in each column. The abbreviation <chr>, which is seen below name,
quad (quadrant), and division, indicates that these columns contain character
data. Character data can consist of any sequence of letters, numbers, spaces, and
punctuation marks. Character variables are often used to represent fixed categories,
such as the quadrant and division of each CBSA region. They can also provide
unique identifiers and descriptions for each row, such as the name of the CBSA
region in our example. Values in a character vector are commonly called strings
throughout R documentation, a convention that we will follow in this text by using
it as a synonym for a character value.
The other abbreviation we see in the tibble from the CBSA data is <dbl>, which
indicates that a column contains numeric data. The abbreviation stands for double,
a historical designation of numeric data indicating how much computer memory is
needed to store a single value. While not seen in this example here, the abbreviation
<int> is used as an alternative abbreviation to indicate that a column contains
integer values (i.e., whole numbers). There are limited practical differences between
doubles and integers when working with R code; we will refer to any variable of
either type as numeric data.
Knowing the types of data for each column is important because, as we will
see throughout the book, they will affect the kinds of visualizations and analysis
that can be applied. The data types in the tibble are automatically determined by
the read_csv function. An optional argument col_types can be set to specify an
alternative, or we can modify data types after the tibble has been created using the
techniques shown in Chap. 3. The character and numeric data types are by far the
most common. Other possible options are explored in Chap. 7 (dates and times),
Chap. 9 (spatial variables), and Chap. 11 (lists and logical values).

1.7 Datasets

Throughout this book, we will use multiple datasets to illustrate different concepts
and show how each approach can be used across multiple application domains. We
draw on data that animates humanities inquiry in areas such as American Studies,
history, literary studies, and visual culture studies. While we will briefly reintroduce
new datasets as they appear, for readers making their way selectively through the
text, we offer a somewhat more detailed description of the main datasets that we
will use in this section.
1.7 Datasets 13

To introduce the concept of EDA, we will make sustained use of the CBSA
dataset in Chaps. 2–5 to demonstrate new concepts in data visualization and
manipulation. As described above, the data comes from an annual survey conducted
by the US Census Bureau called the American Community Survey (ACS). The
survey consists of data collected from a sample of 3.5 million households in the
United States. Outside of the constitutionally mandated decennial census, this is
the largest survey completed by the Census Bureau. It asks several dozen questions
covering topics such as gender, race, income, housing, education, and transportation.
Aggregated data are released on a regular schedule, with summaries over one-,
three-, and five-year periods. Our data comes from the five-year summary from the
most recently published version (2021) at the time of writing. We selected a small set
of measurements that we felt did not require extensive background knowledge while
capturing variations across the country. As seen in the table above, we have selected
the median age, median household income (USD), the percentage of households
owning their housing, the median rent for a one-bedroom apartment (USD), and the
median household spending on rent.
The American Community Survey aggregates data to a variety of different
geographic regions. Most regions correspond to political boundaries, such as states,
counties, and cities. One particularly interesting geographic region are the core-
based statistical areas or CBSA. These regions, of which there are nearly a thousand,
are defined by the US Office of Management and Budget. Regions are defined in
the documentation as “an area containing a large population nucleus and adjacent
communities that have a high degree of integration with that nucleus.” We chose
these regions for our dataset because their social, rather than political, definition
makes them particularly well suited for humanities research questions. Our dataset
includes a short, common name for each CBSA, as well as a unique identifier
(geoid), and several geographic categorizations derived from spatial data provided
by the Census Bureau. All of the code to produce this dataset, using the tidycensus
package within R, is included in the book’s supplementary materials [91].
The core chapters of the book also make use of a dataset illustrating the relative
change in the price of various food items for over 140 years in the United States.
This collection was published as is by Davis S. Jacks for his publication “From
boom to bust: a typology of real commodity prices in the long run” [44]. The data is
organized with one observation per year and variables capturing the relative price of
each of thirteen food commodities. We can read this dataset into R using the same
function that we used for the CBSA dataset, shown below.

food_ prices <- read_csv(file.path("data", "food_ prices .csv"))


food_ prices

## # A tibble : 146 x 14
## year tea sugar peanuts coffee cocoa wheat rye
## <dbl > <dbl > <dbl > <dbl > <dbl > <dbl > <dbl > <dbl >
## 1 1870 129. 151. 203. 88.1 78.8 88.1 103.
## 2 1871 132. 167. 222. 109. 66.7 118. 105.
## 3 1872 134. 162. 189. 140. 71.6 122. 102.
14 1 Working with Data in R

## 4 1873 136. 154. 179. 173. 65.8 116. 106.


## 5 1874 146. 153. 231. 187. 69.9 113. 126.
## 6 1875 149. 150. 197. 176. 69.4 110. 116.
## 7 1876 150. 160. 172. 184. 80.7 114. 106.
## 8 1877 149. 189. 153. 198. 87.8 144. 97.0
## 9 1878 150. 165. 160. 169. 96.0 115. 91.6
## 10 1879 144. 158. 133. 149. 108. 118. 113.
## # 136 more rows
## # 6 more variables: rice <dbl >, corn <dbl >,
## # barley <dbl >, pork <dbl >, beef <dbl >, lamb <dbl >

All of the prices are given on a relative scale where 100 is equal to the price in 1900.
We will use this dataset to show how to build data visualizations that show change
over time. It will also be useful for our study of table pivots in Chap. 5.
Part II turns to data types. The first three application chapters focus on text
analysis, temporal analysis, and network analysis, respectively. While these three
chapters introduce different methods, we will make use of a consistent core dataset
across all three that we have created from Wikipedia. Specifically, we have a
dataset consisting of the text, links, page views, and change histories of a set of
75 Wikipedia pages sampled from a set of British authors. These data are contained
in several different tables, each of which will be introduced as needed. The main
metadata for the set of 75 pages is shown in the data loaded by the following code.

meta <- read_csv(file.path("data", "wiki_uk_meta.csv.gz"))


meta

## # A tibble : 75 x 7
## doc_id born died era gender link short
## <chr > <dbl > <dbl > <chr > <chr > <chr > <chr >
## 1 Marie de France 1160 1215 Early female Mari Mari
## 2 Geoffrey Chaucer 1343 1400 Early male Geof Chau
## 3 John Gower 1330 1408 Early male John Gower
## 4 William Langland 1332 1386 Early male Will Lang
## 5 Margery Kempe 1373 1438 Early female Marg Kempe
## 6 Thomas Malory 1405 1471 Early male Thom Malo
## 7 Thomas More 1478 1535 Sixt male Thom More
## 8 Edmund Spenser 1552 1599 Sixt male Edmu Spen
## 9 Walter Raleigh 1552 1618 Sixt male Walt Rale
## 10 Philip Sidney 1554 1586 Sixt male Phil Sidn
## # 65 more rows

We decided to use Wikipedia data because it is freely available and can be easily
generated in the same format for other collection of pages that correspond to nearly
any other topic of interest. Wikipedia is also helpful because it allows us to look
at pages in other languages, which will allow us to demonstrate how to extend our
techniques to texts that are not in English. Finally, we will return to the Wikipedia
data in Chap. 12 to demonstrate how to build a dataset (specifically, this one) by
calling an API from within R using the httr package [95].
1.9 Extensions 15

Several other datasets will be used throughout the book within a single chapter.
For example, Chap. 9 on spatial data makes use of a dataset showing the location
of French cities and Parisian metro stops as a source in our study of geographic
data. Chapter 10 on image data shows a collection of documentary photographs and
associated metadata in our analysis of images. As these datasets are used only in one
section of the book, we will introduce them in more detail as they are introduced.

1.8 Formatting R Code

It is very important to properly format R code in a consistent way. Even though


the code may run without errors and produce the desired results, keeping the code
well formatted will make it easier to read and debug. We will follow the following
guidelines throughout this book:
1. One space before and after an equals sign or assignment arrow.
2. One space after a comma, but no space before a comma.
3. One space around mathematical operations (such as + and *).
4. If a line of code becomes too long, split the argument to a function into separate
lines, indenting the code two additional spaces.
We have found it makes our life a lot easier if we use these rules right from the start
and whenever we are writing R code.

1.9 Extensions

Each chapter in this book contains a short, concluding section of extensions on the
main material. These include references for further study, additional R packages,
and other suggested methods that may be of interest to the study of each specific
type of humanities data.
In this chapter, we will mention a few standard R references that might be useful
to use in parallel or in sequence with our text. The classic introduction to the core R
language is An Introduction to R by William Venables and David Smith [89]. This
is freely available directly on the same CRAN website where the R language itself
is hosted. The content is quite terse to read linearly, but it serves as a great reference
for anyone coming from another programming language who wants to learn how to
do lower-level programing tasks. We briefly cover some of this material in Chap. 12
but not in anywhere near as much detail.
For the higher-level version of R that we are using in the second edition of this
book, the standard reference is Wickham, Çetinkaya-Rundel, and Grolemund’s R
for Data Science [97]. This open-access book roughly follows the same material
covered in the first and third parts of our text. It introduces far more extensions and
often exhaustively explains all of the optional arguments to new functions. It is a
16 1 Working with Data in R

great reference text after learning the basics and can be useful as a primary text when
guided within a classroom environment to provide more motivation and context to
each technique. It does not have any material for modeling textual, network, spatial,
or image data.
When working through the code in this book’s supplemental materials, as
mentioned above, we will need to run code using the R Markdown format. More
information about the format and what can be done with it can be found in R
Markdown: The Definitive Guide [109]. The philosophy behind the format can be
found in the corresponding research focused on reproducible research pipelines
[107, 108]. Recently, Quarto, a new extension of the R Markdown format, has
quickly gained in popularity [74]. It provides an almost backward compatible
version of R Markdown while extending the functionality to all mixing in other
programing languages.
Chapter 2
EDA I: Grammar of Graphics

2.1 Introduction

As we outlined in Chap. 1, the concept of exploratory data analysis (EDA) is key


to our approach. As a result, data visualization is one of the most important tasks
and powerful tools for the analysis of data. We start our study of exploratory data
analysis with visualization because it offers the best immediate payoff for how
statistical programming can help understand datasets of any size. Visualizations also
have the benefit for those new to programming because it is relatively easy to verify
that our code is working. We can just look at the output and see if the resulting plot
is what we expected. Finally, data visualizations can be useful for even very small
collections of data.
In this chapter, we will learn and use the ggplot2 package for building informa-
tive graphics [94, 106]. The package makes it easy to build fairly complex graphics
in a way that is guided by a general theory of data visualization. The only downside
is that, because it is built around a theoretical model rather than many one-off
solutions for different tasks, it has a somewhat steeper initial learning curve. The
chapter is designed to get us started using the package to make a variety of different
data visualizations.
The core idea of the grammar of graphics is that visualizations are composed
of independent layers. The term “grammar” is used to describe visualizations
because the theory builds connections between elements of the dataset to elements
of a visualization. It builds up complex elements from smaller ones, much like a
grammar provides relations between words in order to generate larger phrases and
sentences. To describe a specific layer, we need to specify several elements. First, we
need to specify the dataset from which data will be taken to construct the plot. Next,
we have to specify a set of mappings called aesthetics that describe how elements
of the plot are related to columns in our data. For example, we often indicate which

© The Author(s), under exclusive license to Springer Nature Switzerland AG 2024 17


T. Arnold, L. Tilton, Humanities Data in R, Quantitative Methods in the Humanities
and Social Sciences, https://doi.org/10.1007/978-3-031-62566-4_2
18 2 EDA I: Grammar of Graphics

column corresponds to the horizontal axis of the plot and which one corresponds to
the vertical axis of the plot. It is also possible to describe elements such as color,
shape, and size of elements of the plot by associating these quantities with columns
in the data. Finally, we need to provide the geometry that will be used in the plot.
The geometry describes the kinds of objects that are associated with each row of
the data. A common example is the points geometry, which associates a single point
with each observation.
We can show how to use the grammar of graphics by starting with the CBSA
data that we introduced in the previous chapter, where each row is associated with
a particular metropolitan region in the United States. The first plot we will make is
a scatterplot that investigates the relationship between the median price of a one-
bedroom apartment and the population density of the metropolitan region. In the
language of the grammar of graphics, we can start to describe this visualization by
providing the name of the dataset in R (cbsa). Next, we associate the horizontal
axis (called the x aesthetic) with the column in the data named density. The
vertical axis (the y aesthetic) can similarly be associated with the column named
rent_1br_median. We will make a scatterplot, with each point on the plot
describing one of our metropolitan regions, which leads us to use a point geometry.
Our plot will allow us to understand the relationship between city density and rental
prices.
In R, we need to use some special functions to indicate all of this information
and to instruct the program to produce a plot. We start by indicating the name of the
underlying dataset and piping it into a special function called ggplot that indicates
that we want to create a data visualization. The plot itself is created by adding—
literally, with the plus sign—the function geom_point. This function indicates that
we want to add a points geometry to the plot. Inside of the geometry function, we
apply the function aes (short for aesthetics), which indicates that we want to specify
the mappings between components of the plot and column names in our dataset.
Code to write this using the values described in the previous paragraph is given
below. A breakdown of the role of each component is detailed in Fig. 2.1.

cbsa |>
ggplot () +
geom_ point(aes(x = density , y = rent_1br_ median ))

select (cbsa , name , quad , density , rent_1br_ median )

## # A tibble : 30 x 4
## name quad density rent_1br_ median
## <chr > <chr > <dbl > <dbl >
## 1 New York NE 1051. 1430
## 2 Los Angeles W 1041. 1468
## 3 Chicago NC 509. 1060
## 4 Dallas S 323. 1106
## 5 Houston S 317. 997
## 6 Washington S 364. 1601
2.1 Introduction 19

## 7 Philadelphia NE 506. 1083


## 8 Miami S 430. 1230
## 9 Atlanta S 263. 1181
## 10 Boston NE 518. 1390
## # 20 more rows

Fig. 2.1 Diagram of how the elements of the grammar of graphics correspond to elements of the
code and visualization
20 2 EDA I: Grammar of Graphics

Fig. 2.2 Plot of the largest 30 core-based statistical areas in the United States, showing their
density and the median price to rent a one-bedroom apartment from the 2021 American Community
Survey

Running the code above from an RMarkdown file opened in R Studio will show the
desired visualization right below the block of code. Within this book, we will show
the results of plots within figures. The plot here is shown in Fig. 2.2. In this plot,
each row of our dataset, a CBSA region, is represented as a point in the plot. The
location of each point is determined by the density and median rent price for a one-
bedroom apartment in the corresponding region. Notice that R has automatically
made several choices for the plot that we did not explicitly indicate in the code, for
example, the range of values on the two axes, the axis labels, the grid lines, and
the marks along the grid. R has also automatically picked the color, size, and shape
of the points. While the defaults work as a good starting point, it is often useful to
modify these values; we will see how to change these aspects of the plot in later
sections of this chapter.
Scatterplots are typically used to understand the relationship between two
numeric values. What does our first plot, shown in Fig. 2.2, tell us about the
relationship between city density and median rent? There is not a clear trend
between these two variables. Rather, the plot of these two economic metrics clusters
the regions into several groups. We see a couple of regions with a very high density
but only moderately large rental prices, one city with unusually high rental prices,
and the rest of the regions fairly uniformly distributed in the lower-left corner of the
2.2 Text Geometry 21

plot. Let’s see if we can give some more context to the plot by adding additional
information.

2.2 Text Geometry

A common critique of computational methods is that they obscure a closer


understanding of each individual object of study in an attempt to search for
numeric patterns. This is certainly an important caution; computational analysis
of humanities data should always be paired with close analysis. However, it does
not always have to be the case that visualizations reduce complex collections to a
few numerical summaries. This is particularly so when working with a dataset that
has a relatively small number of observations. Looking back at our first scatterplot,
how could we recover a closer analysis of individual cities while also looking for
general patterns between the two economic variables? One option is to add labels
indicating the names of the regions. These names would let anyone looking at the
plot to adding their own understanding of the individual regions as an additional
layer of information as they interpret the plot.
Adding the names of the regions can be done by using another type of geometry
called a text geometry. This geometry is created with the function geom_text. For
each row of a given dataset, this geometry adds a small textual label. As with the
point geometry, it requires us to specify which columns of our data correspond to the
x and y aesthetics. These values tell the plot where to place the label. Additionally,
the text geometry requires an aesthetic called label that indicates the column of the
dataset that the label should take its text from. In our case, we will use the column
called name to make textual labels on the plot, a reminder that this is a column name
from the data that we loaded into R. The code block below produces a text label
plot by changing the geometry type and adding the additional aesthetic from the
previous example.

cbsa |>
ggplot () +
geom_text(aes(
x = density , y = rent_1br_median , label = name
))

The plot generated by the code is shown in Fig. 2.3. We can now see which region
has the highest rents (San Francisco). And, we can identify which regions have the
highest density (New York and Los Angeles). We can also identify regions such as
Detroit that are relatively dense but inexpensive or regions such as Denver that are
not particularly dense but still one of the more expensive regions to rent in. While
we have added only a single additional piece of information to the plot, each of
the labels uniquely identifies each row of the data. This allows anyone familiar with
metropolitan regions in the United States to bring many more characteristics of each
22 2 EDA I: Grammar of Graphics

Fig. 2.3 Plot of the largest 30 core-based statistical areas in the United States, showing their
density and the median price to rent a one-bedroom apartment from the 2021 American Community
Survey. Here, short descriptive names of the regions are included

data point to the plot through their own knowledge. For example, while the plot does
not include any information about overall population, anyone who knows the largest
cities in the United States can use the plot to see that the two most dense cities (New
York and Los Angeles) are also the most populous. And, while the plot does not have
information about the location of the regions, if we know the general geography of
the country, it is possible to see that many of the cities that are expensive but not
particularly dense (Portland, Denver, Seattle, and San Diego) are on the West Coast.
These observations point to the power of including labels on a scatterplot.
While the text plot adds additional contextual information compared to the
scatterplot, it does have some shortcomings. Some of the labels for points at the
edges of the plot fall off and become truncated. Labels for points in the lower-left
corner of the plot start to overlap one another and become difficult to read. These
issues will only grow if we increase the number of regions in our dataset. Also, it is
not entirely clear what part of the label corresponds to the density of the cities. Is it
the center of the label, the start of the label, or the end of the label? We could add a
note that the value is the center of the label, but that becomes somewhat cumbersome
to have to constantly remember and remind ourselves and others about.
To start addressing these issues, we can add the points back into the plot with
the labels. We could do this in R by adding the two geometry layers (geom_point
and geom_text) one after the other. This will make it more clear where on the x-
2.2 Text Geometry 23

axis each region is associated to but at the same time will make the names of the
cities even more difficult to read. To fix the second problem, we will replace the text
geometry with a different geometry called geom_text_repel. It also places labels
on the plot but has special logic that avoids intersecting labels. Instead, labels are
moved away from the data points and connected (when needed) by a line segment.
As with the text geometry, the text repel geometry requires specifying x, y, and
label aesthetics. Below is the code to make both of these modifications.

library ( ggrepel )

cbsa |>
ggplot () +
geom_ point(aes(x = density , y = rent_1br_ median )) +
geom_text_ repel(aes(
x = density , y = rent_1br_median , label = name
))

The output of the plot with the points and text repelled labels is shown in Fig. 2.4.
Notice that the repel feature has attempted to avoided writing labels that intersect
one another. It has also tried to avoid having the labels intersect the points and avoid

Fig. 2.4 Plot of the largest 30 core-based statistical areas in the United States, showing their
density and the median price to rent a one-bedroom apartment from the 2021 American Community
Survey. Here, short descriptive names of the regions are included but offset from the points to make
the plot easier to read
24 2 EDA I: Grammar of Graphics

having the labels get pushed outside of the plot. Since the points indicate the specific
values of the density and median rents, the labels are free to float around as long as it
is clear which label is associated with each point. Some of the labels do still become
a bit busy in the lower left-hand corner; this could be fixed by making the size of
the labels slightly smaller, which we will learn how to do later in the chapter. Once
the number of points becomes larger, it will eventually not be possible to label all
of the points. Several strategies exist for dealing with this, such as only labeling a
subset of the points. We will see these techniques as they arise in our examples. The
ggplot2 package and communities online have an entire ecosystem of strategies
for increasing interpretability and adding context to plot, providing strategies for
using the exploratory and visual power of data visualization to garner insights from
humanities data.

2.3 Lines and Bars

There a large number of different geometries supplied by the ggplot2 package, in


addition to the even larger collection of extensions by other R packages. We will
look at two other types of geometries in this section that allow us to investigate
common relationships between pairs of columns of a dataset. Other geometries
will be discussed throughout the book as the need arises, and the full list of
geometries can be found in the ggplot2 package’s documentation. A summary of
all the geometries shown in this chapter is given in Fig. 2.5.
For a moment, we will switch gears and look at the food prices dataset, which was
introduced in the previous chapter. This data contains one row for every year from
1870 through 2015, with relative prices for thirteen different food items across the
United States [44]. Consider a visualization showing the change in the price of tea
over the 146 years in the dataset. We could create a scatterplot where each point is a
row of the data, the x aesthetic captures the year of each record, and the y aesthetic
measures the relative cost of tea. This visualization would be fine and could roughly
help us understand the changes in relative prices for this commodity. A common
visualization type, however, for data of this format is a line plot, where the price in
each year is connected by a line to the price in the subsequent year. To create such
a plot, we can use the geom_line geometry. This is most commonly used when the
horizontal axis measures some unit of time but can represent other quantities that we
expect to continuously and smoothly change between measurements on the x-axis.
The line geometry requires the same aesthetics as the point geometry and can be
created with the same syntax, as shown in the following block of code.

food_ prices |>


ggplot () +
geom_line(aes(x = year , y = tea))
2.3 Lines and Bars 25

Fig. 2.5 Examples of common geometries used in the grammar of graphics


26 2 EDA I: Grammar of Graphics

Fig. 2.6 Plot of the price of tea in standardized units (100 is the price in 1900) over time

The output of this visualization, shown in Fig. 2.6, allows us to see the change over
time of the tea prices. Notice that the relative price decreased fairly steadily from
1870 through to 1920. It had a few sudden drops and reversals in the 1920s and
1930s, before increasing again in the 1950s. The relative cost of tea then decreased
again fairly steadily from the mid-1950s through to the end of the data range in
2015.
Another common usage of a visualization is to see the value of a numeric column
of the dataset relative to a character column of the dataset. It is possible to represent
such a relationship with a geom_point layer. However, it is often more visually
meaningful to use a bar for each category and the height or length of the bar
representing the numeric value. This type of plot is most common when showing
the counts of different categories, something we will see in the next chapter, but
can also be used in any situation where a numeric value is associated with different
categories. To create a plot with bars, we use the geom_col function, providing both
x and y aesthetics. R with automatically create vertical bars if we have a character
variable associated with the x aesthetic and horizontal bars if we have one in the
y aesthetic. Putting the character variable on the y-axis usually makes it easier to
read the labels, so we recommend it in most cases. In the code block below, we have
the commands to create a bar plot of the population in each region from the CBSA
dataset, which will be shown in Fig. 2.7.
2.4 Optional Aesthetics 27

Fig. 2.7 Plot of the population of the largest 30 core-based statistical areas in the United States,
showing their population from the 2021 American Community Survey

cbsa |>
ggplot () +
geom_col(aes(x = pop , y = name))

One of the first things that stands out in the output shown in Fig. 2.7 is that the
regions are ordered alphabetically from bottom to top. The visualization would be
much more useful and readable if we could reorder the categories on the y-axis. This
is also something that we will address in the following chapter. For now, we can see
how ggplot2 is offering a range of plot types to see our data from different angles.
We can add additional context through additional aesthetics.

2.4 Optional Aesthetics

In the previous sections, we have shown how visualizations can be built out of
geometry layers, where each geometry is associated with a dataset and a collection
of variable mappings known as aesthetics. The point, line, and bar geometries
require x and y aesthetics; the text and text repel geometries also required an
aesthetic named label. In addition to the required aesthetics, each geometry
28 2 EDA I: Grammar of Graphics

Fig. 2.8 Plot of the largest 30 core-based statistical areas in the United States, showing their
density and the median price to rent a one-bedroom apartment from the 2021 American Community
Survey. Here, the points are colored based on the quadrant in which the city is found in the United
States

type also has a number of optional aesthetics that we can use to add additional
information to the plot. For example, most geometries have a color aesthetic. The
syntax for describing this is exactly the same as with the required aesthetics: we
place the name of the aesthetic followed by the name of the associated variable
name. Let’s see what happens when we add a color aesthetic to our scatterplot
by relating the column called quad to the aesthetic named color. Below is the
corresponding code; the output is shown in Fig. 2.8.

cbsa |>
ggplot () +
geom_ point(aes(
x = density , y = rent_1br_median , color = quad
))

The result of associating a column in the dataset with a color produces a new
variation of the original scatterplot. We have the same set of points and locations on
the plot, as well as the same axes. However, now each color has been automatically
associated with a region and every point has been colored according to the region
column associated with each row of the data. The mapping between colors and
2.4 Optional Aesthetics 29

region names is shown in an automatically created legend on the right-hand side of


the plot. The ability to add additional information to the plot by specifying a single
aesthetic speaks to how powerful the grammar of graphics is in terms of quickly
producing informative visualizations of data. In the first edition of this text, which
used the built-in graphics system in R, it was necessary to write nearly a dozen lines
of code to produce a similar plot. Now that we are able to use the ggplot2 package,
this process has been greatly simplified.
In the previous example, we changed the color aesthetic from the fixed default of
black to a color that changes with another variable. It is also possible to specify an
alternative, fixed value for any aesthetic. We can draw on the color names available
in R. For example, we might want to change all of the points to be a shade of green.
This can be done with a small change to the function call. To do this, we set the
color aesthetic to the name of a color, such as “red.” However, unlike with variable
aesthetics, the mapping needs to be done outside of the aes() function but still
within the geom_* function. Below is an example of the code to redo our plot with
a different color; we use a color called “olivedrab,” which in print is much more
aesthetically pleasing than its name might at first suggest.

cbsa |>
ggplot () +
geom_ point(aes(
x = density , y = rent_1br_ median
), color = " olivedrab")

While minor, the changed notation for specifying fixed aesthetics is a common
source of confusing errors for users new to the geometry of graphics, so be careful to
follow the correct syntax of arguments as in the code above. One can interchange the
fixed and variable aesthetic commands, and the relative order should not effect the
output. Just be sure to put fixed terms after finishing the aes() command (Fig. 2.9).
While each geometry can have different required and optional aesthetics, the
ggplot2 package tries as much as possible to use a common set of terms for the
aesthetics in each geometry. We have already seen the x, y, and label aesthetics
in the previous sections and just introduced the color aesthetic. Color can also
be used to change the color of a line plot or the color of the font in a text or text
repel geometry. For applications such as the bar plot, we might want to modify both
the border and interior colors of the bars; these are set separately by the color
and fill aesthetics, respectively. The size aesthetic can be used to set the size
of the points in a scatterplot or the font size of the labels in a text geometry. The
shape aesthetic is used to modify the shape of the points. An aesthetic named
alpha controls the opacity of points, with a value of 1 being the default and 0
being completely invisible. Some of these, such as alpha, are most frequently used
with fixed values, but if needed, almost all can be given a variable mapping as well.
30 2 EDA I: Grammar of Graphics

Fig. 2.9 Plot of the largest 30 core-based statistical areas in the United States, showing their
density and the median price to rent a one-bedroom apartment from the 2021 American Community
Survey. The color of the points has been changed to a dark green called “olivedrab”

2.5 Scales

R makes many choices for us automatically when creating any plot. In our example
above, Fig. 2.8, in which we set the color of the points to follow another variable
in the dataset, R handles the details of how to pick the specific colors and sizes.
It has figured how large to make the axes, where to add tick marks, and where to
draw grid lines. Letting R deal with these details is convenient because it frees us
up to focus on the data itself. Sometimes, such as when preparing to produce plots
for external distribution, or when the default are particularly hard to interpret, it is
useful to manually adjust these details. This is exactly what scales were designed
for.
Each aesthetic within the grammar of graphics is associated with a scale. Scales
detail how a plot should relate aesthetics to the concrete, perceivable features in a
plot. For example, a scale for the x aesthetic will describe the smallest and largest
values on the x-axis. It will also code information about how to label the x-axis.
Similarly, a color scale describes what colors corresponds to each category in a
dataset and how to format a legend for the plot. In order to change or modify the
default scales, we add an additional function to the code. The order of the scales
2.5 Scales 31

relative to the geometries do not effect the output; by convention, scales are usually
grouped after the geometries.
For example, a popular alternative to the default color palette shown in our
previous plot is the function scale_color_viridis_d(). It constructs a set of
colors that is color-blind friendly, looks nice when printed in black and white, and
displays fine on bad projectors. After specifying that the color of a geometry should
vary with a column in the dataset, we specify that viridis color scale by adding the
function as an extra line in the plot. An example is shown in the following code.

cbsa |>
ggplot () +
geom_ point(aes(
x = density , y = rent_1br_median , color = quad
)) +
scale_ color_ viridis _d()

The output shown in Fig. 2.10 shows that the colors are now given by a range from
dark purple to bright yellow in place of the rainbow of colors in the default plot.
As with the categories in the bar plot, the ordering of the unique colors is given

Fig. 2.10 Plot of the largest 30 core-based statistical areas in the United States, showing their
density and the median price to rent a one-bedroom apartment from the 2021 American Community
Survey. Here, the points color based on the quadrant in which the city is found in the United States,
with a color-blind friendly color scale
32 2 EDA I: Grammar of Graphics

by putting the categories in alphabetical order. Changing this requires modifying


the dataset before passing it to the plot, something that we will discuss in the next
chapter. Note that the _d at the end of the scale function indicates that the colors are
used to create a set of mappings for a character variable (it stands for “discrete”).
There is also a complimentary function scale_color_viridis_c that produces
a similar set of colors when making the color of the points change according to
a numeric variable. The code below demonstrates the continuous case, where the
population is treated as a numeric variable.

cbsa |>
ggplot () +
geom_ point(aes(
x = density , y = rent_1br_median , color = pop
)) +
scale_ color_ viridis _c()

Many other scales exist to control a variety of aesthetics. For example,


scale_size_area can be used to make the size of the points proportional to
one of the other columns in a dataset. There are also several scales to control the x
and y axes. For example, we can add scale_x_log10() and scale_y_log10()
to a plot to produce values on a logarithmic scale, which can be very useful when
working with heavily skewed datasets. We will use this in later chapters as needed.
The default scale for the x-axis is called scale_x_continuous. A correspond-
ing function scale_y_continuous is the default for the y-axis. Adding these to a
plot on their own has no visible effect. However, there are many helpful optional
arguments that we can provide to these functions that change the way a plot is
displayed. Setting n.breaks within one of these scales tells R the (approximate)
number of labels to put on the axis. Also, making minor_breaks equal to NULL
turns off the minor grid lines. We can set the value limits to a pair of numbers
in order to describe the starting and ending range on a plot. Below is the code to
produce the plot in Fig. 2.11, which shows the same data as our original scatterplot,
but now with modified grid lines, axis labels, and vertical range.

cbsa |>
ggplot () +
geom_ point(aes(x = density , y = rent_1br_ median )) +
scale_x_ continuous (n. breaks = 10, minor_ breaks = NULL) +
scale_y_ continuous ( limits = c(0, 2000))

Finally, there are two special scale types that can be useful for working with colors.
In some cases, we may already have a column in our dataset that explicitly describes
the color of an observation; here, it would make sense to use these colors directly. To
do that, we can add the scale scale_color_identity to the plot. Another type of
scale that can be useful for colors is scale_color_manual. Here, it is possible to
describe exactly which color should be used for each category. Below is the syntax
for producing manually defined colors for each region in the CBSA dataset.
Other documents randomly have
different content
— Mi pare di udire ancora la respirazione del mias, — rispose il
veneziano. — Sarà cosa prudente lanciargli una freccia, prima di
scendere. —
Alzò la cerbottana e soffiò dentro con forza. Il dardo silenzioso partì
rapido e andò a conficcarsi nel petto dell'uomo dei boschi.
Si udì un sordo grugnito, ma poco dopo la respirazione della scimmia
gigante cessava.
— Ora possiamo discendere, — disse Albani.
— No, signore! — esclamò il mozzo.
— Perchè?... Sono morti entrambi.
— Guardate, là, presso i cespugli. —
Il veneziano ed il marinaio guardarono nella direzione indicata e
videro uscire dai cespugli una scimmia che aveva già una statura
superiore ad un metro e di complessione robusta. S'avanzava
titubando verso il gruppo formato dal mias e dal boa, emettendo dei
gemiti che avevano qualche cosa d'umano.
— È il figlio dell'orang-outan — disse Albani.
— Era adunque una femmina, — disse il marinaio. — Povero
piccino!... Potrà vivere solo?
— È già sviluppato, — rispose Albani.
— Lo lascieremo andare?...
— Penso che potrebbe esserci utile, Enrico.
— Quello scimmiotto!...
— Faremo di lui un valente e robusto servitore.
— Ma quando diverrà grande ci accopperà, signore.
— I dayachi ne adottano sovente e mai hanno avuto da lagnarsi. In
schiavitù pare che perdano i loro istinti feroci. Quel mias, col suo
vigore straordinario, ci potrà rendere dei grandi servigi.
— Allora andiamo a prenderlo.
— Io avrò cura di lui, signore, — disse il Piccolo Tonno. — Mi
piacciono assai le scimmie. —
Si lasciarono scivolare dai bambù che servivano a loro come di scala
e s'avvicinarono al giovane mias, il quale continuava a girare attorno
alla estinta madre emettendo acuti gemiti.
Il marinaio l'afferrò per le braccia e cercò di trascinarlo nel recinto,
ma ricevette una spinta così poderosa, che cadde colle gambe in
aria.
— Terremoto! Che vigore! — esclamò.
— Prendiamolo colle buone, — disse Albani.
Si mise ad accarezzarlo e gli offrì delle frutta. Il piccolo mias,
dapprima si mostrava diffidente, ma finì coll'accettare e divorare con
ingordigia la deliziosa polpa dei durion.
A poco a poco, offrendogli sempre nuove frutta, fu attirato nel
recinto ed il marinaio lo legò con una robusta gomena senza ricevere
altre spinte.
— Si abituerà presto, — disse Albani. — Fra due settimane ci seguirà
come un cagnolino e fra un mese avremo un ottimo servitore ed un
abile provveditore di frutta. Lasciamolo ora tranquillo e riprendiamo il
nostro sonno. —
Capitolo XII
Le scimmie alla pesca dei granchi

Dieci giorni erano trascorsi dalla cattura del piccolo mias, ma i


Robinson, quantunque non avessero ancora abbandonata la costa
per tentare una esplorazione nell'interno o nei grandi boschi del sud,
entro i quali potevano trovare molte preziose risorse, non erano
rimasti inoperosi.
Si erano fabbricati molti oggetti indispensabili: una tavola, delle
scranne, dei recipienti, adoperando i grossi fusti dei bambù giganti,
delle comode amache, adoperando dei pezzi di vele, un condotto
d'acqua che partiva dalla sorgente scoperta in mezzo al bosco e che
metteva capo nel recinto.
Avevano inoltre dissodato un tratto di terra servendosi delle zappe
fabbricate colle aste di ferro dei pennoni, sperando di trovare in
qualche angolo dell'isola delle sementi utili, ed avevano scavate delle
trappole, ma senza successo, poichè pareva che la grossa selvaggina
avesse abbandonata quella costa.
Erano però riusciti a prendere alcuni volatili che avevano rinchiusi in
una specie di uccelliera, costruita con molta pazienza dal marinaio,
adoperando fibre di rotang e giovani bambù.
Per impadronirsi di quegli uccelli, avevano dovuto procurarsi una
specie di vischio assai tenace, ricavato dalla giunta wan (Erceola
elastica), pianta arrampicante appartenente alla famiglia delle
apocinee, che fornisce una specie di gomma adoperata dai malesi
appunto per prendere i volatili.
Con quel vischio erano riusciti a procurarsi parecchie coppie di
buceros rhinoceros, chiamati comunemente tucani o calaos-
rinoceronti, grossi e stravaganti uccelli dalle penne nere sopra, e
bianche sotto, coda lunga trenta e più centimetri e becco enorme,
lungo quanto l'intero corpo del volatile, di colore giallo-rossiccio e
sormontato da una protuberanza ossea in forma d'una grossa
virgola.
Avevano pure preso degli arghi giganti, uccelli superbi, più grandi dei
pavoni, che pare portino un vero mantello di piume nere a striature
biancastre ed a macchie rosso-brune, e che hanno delle code lunghe
oltre mezzo metro, terminanti in due penne leggermente curve, ed
alcune coppie di colombe magnifiche, chiamate così poichè sono le
più belle e le più graziose di tutte. Sono grosse come i piccioni di
Spagna, ma hanno le penne del petto d'una tinta azzurra con riflessi
ramigni e quelle del dorso verdi-cupe con riflessi d'oro.
Questi uccelli si erano presto abituati e non fuggivano più quando
vedevano avvicinarsi il mozzo, il quale recava a loro grande numero
di semi e anche dei vermi di terra e delle briciole di pane.
Un mattino però, anche il recinto cominciò a popolarsi. Il marinaio
aveva osservato che delle scimmie si recavano di frequente verso la
spiaggia, poco prima dello spuntare dell'alba, ma non era mai
riuscito ad avvicinarle, nè a sapere cosa andassero a fare in riva al
mare.
Spinto dalla curiosità, decise di mettersi in agguato presso alcune
scogliere, in compagnia del mozzo. Messisi d'accordo, un mattino
s'alzarono prima ancora che gli astri cominciassero a impallidire,
lasciando che il signor Albani dormisse saporitamente nella sua
amaca.
Scesero la sponda in vicinanza della piccola baia e si nascosero
dietro ad alcune scogliere, per attendere l'arrivo dei quadrumani.
— Vediamo cosa vengono a fare, — disse il marinaio al mozzo.
— Che vengano a prendere un bagno? — chiese Piccolo Tonno.
— Io non ho mai veduto una scimmia in acqua e credo anzi che la
temano come i gatti.
— Allora verranno a fare la cura dell'acqua marina. Tu sai che è un
ottimo purgante.
— Sì, burlone.
— O che abbiano qualche canotto e che si rechino a diporto sul
mare?
— No, andranno a pescare, — disse il marinaio, ridendo.
— Non mi stupirei, Enrico. Hanno la manìa d'imitare ciò che fanno gli
uomini.
— Taci! Eccole!
— Di già?
— Sta per spuntare l'alba. —
Le scimmie infatti giungevano. Erano dieci o dodici, alte dai quaranta
ai cinquanta centimetri, col pelame oscuro e rassomigliavano ai
semnopitechi.
S'avanzavano in fila indiana, con una gravità ridicola, ed in silenzio.
Scesero la sponda, si schierarono sugli scogli e si misero ad
esaminare l'acqua con grande attenzione.
I due marinai, in preda alla più viva curiosità, non perdevano di vista
alcun movimento.
Ad un tratto le videro volgere il dorso al mare e immergere in acqua
le loro lunghe code pelose, facendole leggiermente ondeggiare.
— Te lo dicevo io che venivano a prendere un bagno, — mormorò
Piccolo Tonno.
— Alle loro code! — esclamò Enrico, crollando il capo. — Io credo
che abbiano un altro scopo. Oh!... Questa è strana!... Hai mai
veduto delle scimmie a pescare? —
Un quadrumane, dopo d'aver fatto una brutta smorfia come se
avesse provato un acuto dolore, aveva ritirato prontamente la coda,
imprimendole un rapido movimento innanzi ed indietro. Qualche
cosa che si era attaccato a quell'appendice balzò in aria, e cadde
contro una vicina roccia con sordo rumore.
— Corna di cervo! — esclamò il marinaio, stupito. — Pescano i
granchi!... —
Era proprio vero: quella banda di scimmie pescava i granchi di mare,
usando d'un sistema curiosissimo, ma anche doloroso.
Trovandosi quei crostacei entro i crepacci subacquei delle rocce, i
furbi quadrumani andavano a stuzzicarli colle code e quando li
sentivano a stringere, con una mossa fulminea gli strappavano dal
loro elemento e con moto rotatorio gli scagliavano contro i sassi
della riva, rompendo i loro gusci.
Ciò fatto traevano colle adunche dita la carne saporita, che
divoravano con grande avidità.
— Non ho mai veduto nulla di simile, — diceva il marinaio, sempre
più stupito.
— To'!... Se noi le imitassimo! — esclamò il mozzo.
— E quale coda immergeresti?
— Le mani.
Le scimmie alla pesca dei granchi. (Pag. 77).

— Per farcele rovinare?... Credi tu che quelle scimmie non provino


dolore? Guarda che brutte smorfie che fanno, quando si sentono
tenagliare la coda. Ma.... to'!... Pare che la pesca vada male! —
Due scimmie che avevano immersa la loro coda, urlavano
disperatamente, ma senza essere più capaci di ritirare la loro
appendice. Invano puntavano colle mani e coi piedi e facevano sforzi
furiosi: i granchi pareva che non volessero lasciare l'acqua e uscire
dai buchi.
Le loro compagne stavano per precipitarsi in loro soccorso, quando il
marinaio balzò fuori dal nascondiglio, gridando:
— Addosso, Piccolo Tonno! —
La banda fuggi rapidamente, ma le due prigioniere, non ostante i
loro strappi, rimasero sulla spiaggia.
I due marinai furono lesti ad afferrarle e con due vigorose strappate
liberarono le code, traendo a galla due granchi grossi come un
cappello, i quali non lasciarono la preda se non dopo che furono
uccisi.
— Venite con noi, carine, — disse Enrico. — Vi condurremo a tenere
compagnia al mias. —
Presero per le braccia le due prigioniere e malgrado le loro proteste
ed i loro morsi, le trassero nel recinto.
— Altri servi? — chiese il veneziano, che stava scendendo dalla
capanna. — A quanto pare volete farvi servire per bene.
— No, signore, — disse il marinaio, ridendo. — Conduciamo due
pescatori che ci procureranno dei deliziosi granchi. Avete mai veduto
delle scimmie a pescare?...
— I granchi?...
— Sì.
— Ne ho vedute parecchie, specialmente a Giava.
— To'!... Ed io credevo di raccontarvi una novità strabiliante.
— È una novità molto vecchia per me, Enrico, — disse Albani. —
Sciancatello! —
Colui che si chiamava con quel nome, era il mias. L'aveva così
appellato Piccolo Tonno, perchè lo scimmione era un po' sciancato,
forse in causa di qualche capitombolo dalla cima di qualche altissimo
albero.
Il giovane mias, che ormai si era affezionato ai suoi padroni,
quantunque fosse sempre di umore triste, malinconico, come tutti
quelli della sua specie, e che ormai passeggiava liberamente pel
recinto senza mai allontanarsi, udendo la voce del veneziano
abbandonò il casotto che gli era stato costruito e andò a guardare
con curiosità le nuove venute.
Queste però vedendoselo dinanzi, dapprima manifestarono una viva
apprensione, poi sentendosi libere cercarono d'arrampicarsi su pel
recinto per salvarsi nei vicini boschi, ma Sciancatello, da bravo
guardiano, fu lesto ad afferrarle per la coda ed a tirarle giù,
annunciando la sua imminente collera con dei sordi grugniti; poi, per
far loro capire che gli dovevano obbedienza, somministrò a ciascuna
un calcio così magistrale, da farle piroettare due volte in aria.
— Bravo Sciancatello!... — gridarono i due marinai, schiattando dalle
risa.
— Con tale maestro diventeranno docili ben presto, — disse il
veneziano.
— Lo credete, signore? — chiese il marinaio.
— Ne sono certo e conto molto sulla loro docilità, per intraprendere
la progettata spedizione sulla cima di quel monte.
— Per lasciarle qui in compagnia dello Sciancatello?
— Al contrario, Enrico; intendo di condurle con noi e di affidare a
loro una parte del nostro bagaglio. —
I due marinai scoppiarono in una omerica risata.
— Te lo dico sul serio, — disse Albani. — Le nostre scimmie ci
seguiranno come portatori.
— Allora insegnerò loro a fare cucina, signore, — disse il mozzo.
— Per mangiare più peli di coda che zuppa! — esclamò il marinaio.
— No, non voglio simili aiutanti. Piuttosto insegnerò loro a
raccogliere legna secca pel fuoco.
— Ed a recarsi alla fontana a prendere acqua.
— Sia pure, Piccolo Tonno. Ah, che bei servi!... Signor Albani, vi
assicuro che non speravo di poter avere anche dei servi oltre il pane
e tante cose utili da voi procurateci, quando sono sbarcato su
quest'isola.
— Ti accontenti facilmente.
— Vi pare che io possa lagnarmi?...
— No, ma io intendo procurarti di più. Quando avremo visitati i
boschi, spero di ritornare con molte cose che ancora ci mancano.
Voglio che qui regni l'abbondanza e che più nulla manchi a noi, che
siamo abituati alla vita civile.
— Ma cosa volete ricavare ancora dalle piante?...
— Molte cose ancora.
— Mi mettete in curiosità. Quando faremo questa escursione?...
— Fra un paio di giorni. Mi preme di conoscere quest'isola che non
sappiamo ancora se sia vasta o piccola, abitata o disabitata.
Quest'oggi cominceremo a fare i nostri preparativi.
— Ma nulla ci manca, signore. Abbiamo pane, possiamo portare con
noi alcuni uccelli, l'acqua è a nostra disposizione, e possediamo
perfino dei liquori. Cosa volete di più?
— Avere una tenda.
— Abbiamo ancora delle vele.
— È vero, ma ci occorrono delle bisaccie per porvi le nostre
provviste.
— Le vele ce le daranno.
— Ma come cucirete la tela?
— Diavolo!... È sempre la solita istoria: manchiamo di tutto. Ma dove
troveremo noi gli aghi?... Non possiamo già fabbricarli.
— E allora bisogna cercarli.
— Ma dove?...
— Ce li procureranno i pesci colle loro spine. I popoli nordici, gli
Esquimesi, i Samoiedi, i Ciuki ecc., come t'ho già detto, cuciono le
loro vesti servendosi appunto di spine di pesci e noi faremo
altrettanto.
— Ma bisogna pescarli questi pesci e non possediamo ami.
— Fortunatamente ce li daranno le piante.
— E quali? — chiese il marinaio stupito.
— Ancora i bambù. Quelli chiamati hauer-tgiutgiuk o di Blume,
hanno le spine ricurve le quali possono servire di ami.
— Andiamo a cercarle, signore, e poi andremo a pescare. Sono
impaziente di mettermi in viaggio per conoscere un po' la terra che
ci ospita.
— Andiamo, Enrico; sono anch'io curioso di conoscere il dominio dei
Robinson Italiani. —
Capitolo XIII
Attraverso i boschi

Il 18 settembre, cioè venticinque giorni dopo il loro approdo su


quell'isola, i naufraghi si mettevano in marcia per fare una
esplorazione del loro dominio, se non totale, almeno parziale.
Non conoscendo ancora l'estensione di quella terra, avevano deciso
di guadagnare la vetta dell'alta montagna, certi di poter di là
abbracciare tutte le coste e di formarsi un'idea più o meno esatta
della possessione.
Si erano provveduti di una trentina di chilogrammi di pane rinchiusi
in solidi sacchi di tela, accuratamente cuciti essendosi già procurati
gli aghi desiderati colle spine di alcuni grossi pesci, delle armi con
frecce avvelenate e senza veleno per abbattere se non della grossa
selvaggina almeno degli uccelli, di alcuni litri di tuwak, forte ed
eccellente liquore ricavato dal succo fermentato dell'arenga
saccharifera, di sale ed anche di carne avendo torto il collo ai loro
più grossi uccelli.
Le due scimmie li seguivano portando nei loro sacchi la pentola,
alcuni tondi, le forchette, e lo Sciancatello, già robusto, portava la
tenda e una parte di pane.
Le due scimmie dapprima si erano mostrate ricalcitranti a portare la
loro parte di bagaglio, ma l'orang-outan, che si era armato d'un
randello, le aveva ben presto domate e marciavano sotto la sua
sorveglianza, pronto a battere sulle loro spalle un pezzo musicale da
far strappare urla di dolore.
Il mondo alato si risvegliava sotto la brusca invasione della luce. In
mezzo alle foglie degli alberi e dei cespugli ingemmati dalla rugiada
notturna, svolazzavano a gruppi i più belli uccelli, le cui penne
variopinte, a riflessi d'oro e d'argento o di rame, scintillavano
vagamente sotto i primi sprazzi luminosi dell'astro diurno, sorgente
sull'orizzonte.
I graziosi epimachus arruffavano le loro penne vellutate e brillanti,
come se fossero cosparse di pagliuzze d'oro, e le loro lunghe code
sottili; i bellissimi chimachus, volatili grossi come un piccione, col
corpo anteriore nerissimo con striature d'oro e il posteriore candido,
e la coda formata di barbe lunghissime ed arricciate, si
spennacchiavano reciprocamente coi loro becchi sottilissimi ma assai
lunghi; i charmasyna, specie di pappagalli, colle piume rosse e gialle
a striature nere, cominciavano i loro cicalecci scordati ed importuni,
mentre le splendide parozie dorate, scintillanti di mille colori,
immobili sulle più alte cime degli alberi, si ubbriacavano di sole,
lasciando ondeggiare graziosamente le cinque barbe piantate sulle
loro teste e terminanti in una specie di fiocco, ai soffi della brezza
marina.
Miriadi d'insetti svolazzavano poi in tutte le direzioni: farfalle
sfolgoranti, di dimensioni straordinarie, s'incrociavano sopra i fiori o
attorno ai vasi vegetali dei calamus rimasti ancora aperti; farfalline
rosse, gialle, azzurre ed anche battaglioni di lucertoline volanti,
chiamate dai Malesi draco, bizzarri animaletti lunghi venti centimetri,
colla coda compressa, colle zampine unite da una membrana che
serve come di ali e che permette a loro di spiccare delle volate di
venti e perfino di trenta metri.
I naufraghi, oltrepassata la piantagione di bambù che si estendeva
su un lungo tratto di costa, s'internarono sotto i boschi, piegando un
po' verso levante, sembrando a loro che da quel lato la montagna
fosse meno aspra e anche meno boscosa.
Si videro però ben presto costretti a rallentare la marcia, poichè
quella parte della grande boscaglia era assai fitta e impediva di
procedere direttamente.
Migliaia e migliaia d'alberi intrecciavano i loro rami frondosi o le loro
foglie piumate, impedendo ai raggi del sole di penetrare fino a terra.
La ricchissima e svariata flora malese, aveva là tutti i suoi campioni.
Si vedevano bellissimi alberi della canfora, coi tronchi così grossi che
cinque uomini non sarebbero riusciti ad abbracciarli, e che esalavano
un acuto profumo; degli splendidi sunda-matune o alberi tristi, così
chiamati perchè i fiori di tali alberi, che esalano un profumo squisito,
non si aprono che di notte; dei pergolati di pepe, piante sarmentose
che si avviticchiano attorno agli alberi, che hanno le foglie
somiglianti a quelle dei nostri fagiuoli e i cui granelli aromatici
disposti a grappolini dapprima verdi, poi rossi e quindi bruni quando
sono giunti a perfetta maturanza; grandi upas, chiamati anche
bohon-upas, snelli, alti oltre trenta metri e coperti di larghe foglie
che formavano dei superbi ombrelli; noci moscate, piante somiglianti
ai nostri allori, alte dai sei ai sette metri, già cariche di noci mature
che esalavano acuti profumi; garofani coi rami già irti di quei
mazzolini aromatici che vengono poi posti in commercio, quando
sono ben seccati, col nome di chiodi di garofano; quindi,
confusamente mescolati, stretti e avviluppati da lunghissimi rotang
che formavano delle vere reti, si vedevano a centinaia alberi che
producono il belzoino, ragia odorifera che scola incidendo il tronco di
quella specie di abeti; alberi della cannella, alberi cotoniferi che
producono una specie di bambagia serica, tecche colossali dal legno
incorruttibile; alberi del ferro coi cui rami si fanno delle mazze
pesantissime che non si possono scheggiare tanto sono resistenti le
fibre di quel legno, ed una infinità d'alberi gommiferi preziosissimi.
Non mancavano però gli alberi da frutta. Di tratto in tratto, in mezzo
a quel caos di vegetali, i naufraghi scoprivano dei mangostani carichi
di quelle frutta deliziose che dànno una polpa bianca, delicata, divisa
in chicchi e che messa in bocca si fonde come un gelato; o dei
manghi chiamati dai Malesi buâ-mamplan ma di qualità inferiore,
essendo per lo più impregnati d'un forte odore di resina; o dei
pombo, grossissimi e succolenti aranci, o dei nefelium che
producono delle frutta racchiudenti una polpa bianca, semi-
trasparente, succosa, dolce ma un po' acidula.
I naufraghi non si lasciavano sfuggire quelle occasioni per fare ampia
raccolta delle frutta migliori. Di ciò s'incaricava lo Sciancatello il quale
si prestava colla miglior grazia del mondo, inerpicandosi sulle cime
più alte delle piante per cogliere le più grosse e le più mature.
Verso le dieci del mattino, dopo d'aver percorso almeno sei
chilometri, distanza ragguardevole se si pensa ai lunghi giri che
erano costretti a fare per trovare dei passaggi ed ai numerosi
ostacoli, si trovarono dinanzi ad una foresta di alberi forniti di foglie
gigantesche, d'aspetto maestoso. Nello scorgerli, il signor Albani non
potè frenare un grido di contentezza.
— Una foresta di banani! — esclamò. — Ci regaleremo una
scorpacciata di frutta deliziose, amici miei, e che potranno variare la
nostra provvista di pane.
— I banani? — chiese il marinaio.
— Sì, Enrico.
— Io non li ho mangiati che come frutta.
— Ed io ti dico che possono anche surrogare il pane e che servono a
fare dei piatti squisiti. Quando sono maturi, cioè quando l'amido è
completamente scomparso tramutandosi in materia zuccherina, non
servono che come frutta, ma quando le buccie sono ancora verdi,
messi ad arrostire sotto la cenere, possono surrogare il pane
essendo ricchi di fecola.
Allora le frutta si possono anche tagliarle, seccarle al sole e
conservarle per molto tempo.
Se poi sono più giovani, si possono mangiarle in salsa, oppure
quando sono vicine alla maturità, si possono fare delle fritture
squisite. Andiamo a fare raccolta, amici. —
Quel bosco era meraviglioso, essendo formato da migliaia di piante.
Fra i vegetali erbacei, nessuno rivaleggia coi banani per ricchezza di
foglie e per maestà.
Queste piante, nei climi caldi acquistano proporzioni gigantesche, e
non di rado le foglie raggiungono un'altezza di quattro o cinque
metri ed una larghezza di uno e anche più.
Molte di quelle piante già reggevano a stento dei grappoli enormi,
carichi di frutta allungate, un po' curve, racchiudenti una polpa
tenera e profumata. Ve n'erano di varie specie, ma il signor Albani
diede il sacco a quelle chiamate pisang-mas, che dànno frutta più
piccole, d'un bel colore giallo d'oro e che sono le migliori.
Accesero il fuoco all'ombra d'una pianta che aveva delle foglie
mostruose e fecero una appetitosa colazione con banani maturi o
con banani verdi cucinati sotto la cenere. Le scimmie e Sciancatello
non furono dimenticati e fecero una vera scorpacciata di quelle
frutta.
Mancava l'acqua, quantunque quel terreno fosse umidiccio, ma il
signor Albani non tardò a scoprire, sul margine della foresta poco
prima attraversata, dei nepentes.
Queste piante sono le più bizzarre che immaginare si possa.
Appartengono alla specie degli arrampicanti e le loro foglie sono
arrotondate in forma di vasi, forniti d'una specie di coperchio che si
abbassa alla notte e si alza di giorno.
Durante la notte le piante assorbono l'umidità del suolo e la
raccolgono in quei vasi, i quali ne contengono di frequente perfino
mezzo litro. Non è però un'acqua limpida e fresca come
generalmente si crede, servendo quei recipienti di tomba a
numerosissimi insetti, ma basta per dissetare, essendo del resto
buonissima.
Dopo un riposo di qualche ora, il drappelle si rimetteva in marcia
salendo i primi contrafforti della montagna, ma attraverso a foreste
sempre fitte e assai intricate.
Avevano già percorso un chilometro, quando lo Sciancatello si
arrestò bruscamente, emettendo dei sordi brontolii e dando segni
d'una certa agitazione.
— Ehi, Sciancatello, cosa succede? — chiese il marinaio. — Hai
sentito qualche tigre? —
Il mias pareva che ascoltasse con profonda attenzione, come se
cercasse di raccogliere qualche rumore non ben distinto. Guardava le
cime degli alberi, poi osservava i cespugli ed il suo volto manifestava
ora stizza ed ora contentezza.
— Che sia impazzito? — chiese Piccolo Tonno.
— O che abbia una colica? — chiese invece il marinaio. — Ha
divorato troppi banani di certo.
— No, — disse Albani. — Ha sentito qualche cosa.
— Ma io non vedo nulla, nè odo nulla.
— Pretenderesti di aver l'udito acuto come quel figlio dei boschi,
Enrico? —
Ad un tratto l'orang dilatò fino agli orecchi la sua immensa bocca e
gli uscì uno scoppio di risa fragoroso.
— Ehi, Sciancatello! — gridò il marinaio. — Che i banani t'abbiano
fatto l'effetto d'una solenne bevuta? Se ti sei ubbriacato, ti faremo
una doccia, figliuol mio. —
L'orang non l'ascoltava più. Con un gesto imperioso aveva fatto
cenno alle due scimmie di seguirlo e si era diretto verso un albero
altissimo, coperto d'un fogliame folto assai e si era messo ad
osservarlo continuando a manifestare la sua gioia con scoppi di risa.
— Che lassù ci siano delle frutta ricercate dalle scimmie? — chiese il
marinaio.
— Io non vedo che foglie, — rispose il mozzo. — Ma.... non udite
questo ronzìo?...
— Sì, — disse il veneziano. — Oh!... Ora comprendo!... Non vedete
lassù quel nuvolo d'insetti?...
— Sì, sì! — confermarono i due marinai.
— Sono api selvatiche ed il nostro orang si prepara a saccheggiare
l'alveare per mangiarsi il miele.
— Il goloso! — esclamò il marinaio. — Ma non gli permetterò di
mangiarselo tutto. Diavolo!... Voglio fare delle ciambelle io!...
— Zitto, — disse il veneziano.
— Cosa avete udito?
— Un grugnito.
— Dove?...
— Lassù, fra le foglie.
— Che lo Sciancatello trovi un competitore?
— Lo credo, Enrico, poichè mi pare quelle api siano molto
spaventate.
— Forse un altro mias?...
— Non lo so.
— Brutto incontro, signor Albani.
— Abbiamo le freccie mortali.
— Sciancatello sale, — disse il mozzo. —
Infatti l'orang, dopo una breve esitazione, aveva cominciata
l'ascensione, ma procedeva con una certa diffidenza e portava con
sè il randello.
Di tratto in tratto si arrestava per ascoltare, alzava il viso come se
cercasse di discernere qualche animale che pareva si nascondesse
fra il fogliame, poi scuoteva la testa e riprendeva l'ascensione.
Giunto ai primi rami si rizzò, abbracciò il tronco dell'albero e
radunando le sue forze si mise a scrollarlo con furore, emettendo dei
sordi abbaiamenti ma che sembravano colpi di tosse: era il suo modo
per manifestare la sua collera.
In alto si udirono dei grugniti, poi si vide una massa nera scendere
lungo il tronco.
— Una bestia! — urlò il mozzo.
Lo Sciancatello, vedendosi a tiro quell'animale, gli appioppò una
legnata così tremenda, da strappargli un vero urlo, poi con un calcio
cercò di precipitarlo giù, ma l'altro, che stringeva forte il tronco,
teneva duro.
Lo si vide però poco dopo lasciarsi scivolare luogo l'albero con
grande rapidità, quindi piombare a terra in causa d'un'ultima e più
furiosa scossa dell'orang.
Capitolo XIV
Miele e patate dolci

Quell'animale che voleva defraudare lo Sciancatello del miele, era


grosso quanto un cane di Terrannova, ma più basso di zampe, col
muso un po' appuntito ed il pelame nero e lucidissimo.
Rassomigliava in tutto agli orsi neri, ma era però più allungato e
sembrava anche molto più agile.
Appena trovatosi a terra, non cercò di far fronte agli uomini, ma di
darsela a gambe nel bosco; il signor Albani però che sapeva con che
specie d'animale aveva da fare, con quattro colpi di randello lo fece
cadere al suolo, poi levatasi rapidamente una fune, gliela legò al
collo, dicendo:
— Adagio, mio caro; abbiamo un recinto nella nostra capanna e vi
starai benone. —
In quell'istante si udì l'orang scrollare ancora furiosamente l'albero
ed emettere grida di rabbia, poi un colpo sordo che pareva una
tremenda bastonata.
Un altro animale, simile al primo, scendeva precipitosamente lungo
l'albero e venne a cadere quasi ai piedi del marinaio. Questi credette
bene d'imitare il veneziano; con due colpi di randello stordì il
disturbatore delle api, quindi lo legò solidamente, aiutato dal mozzo.
— Bravi, amici, — disse Albani. — Un maschio ed una femmina!...
Faremo razza e fra pochi mesi avremo anche noi della carne
eccellente.
— Ma ci direte che bestie sono, signore, — disse il marinaio.
— Sono orsi.
— Terremoto! Orsi! — esclamò il marinaio, balzando indietro.
— Hai paura?
— Se sono orsi, ho motivo di spaventarmi.
— Sono inoffensivi, Enrico. Quelli del Borneo e di tutte le isole
Malesi, non sono feroci come gli altri. Come vedi, sono più piccoli di
tutte le altre specie e quantunque abbiano denti e artigli, non se ne
servono quasi mai e sfuggono l'uomo. Questa doppia cattura ci sarà
di molto vantaggio, poichè alleveremo degli orsacchiotti che ci
procureranno, di tratto in tratto, degli arrosti succolenti.
— Ed il miele? — chiese il mozzo. — Quel briccone di Sciancatello ce
lo divorerà tutto.
— Ah!... furfante! — urlò il marinaio. — Mangia le mie ciambelle. Ehi,
Sciancatello!... Scendi o ti romperò il mio randello sul groppone,
brutto ingordo! —
L'orang pareva fosse diventato sordo. Lo si udiva a rompere i rami e
scuotere le foglie, mentre le api fuggivano a sciami, ronzando. Il
ghiottone stava senza dubbio saccheggiando l'alveare.
Il marinaio, furioso, temendo di non poter assaggiare il miele, nè di
fare le sue ciambelle, cercava di scuotere l'albero per costringere
l'orang a scendere, ma invano.
Il veneziano ed il mozzo invece ridevano a crepapelle.
— Basta, goloso! — continuava a urlare il marinaio. — Scendi o ti
mando a raggiungere tua madre con una freccia che ti farà crepare.
Scendi, ladrone ingordo! —
Il mias continuava a rimanere sordo a quella tempesta d'invettive e
di minaccie ed il marinaio s'arrabbiava maggiormente, credendolo
occupato a rimpinzarsi di miele.
— Addio, ciambelle, — diceva il mozzo, sempre ridendo. — Questa
volta è lo Sciancatello che si mangia il dolce.
— Terremoto di Genova! — tuonò il marinaio. — Gli darò tale lezione
da fargli vomitare tutto il miele!... Gli fracasserò le ossa!...
— Eccolo che scende, — disse Albani. — Pare che abbia terminato la
colazione. —
Infatti lo Sciancatello scendeva attraverso i rami e le foglie, ma senza
fretta. Pareva che fosse imbarazzato a portare qualche cosa, perchè
con una mano sosteneva un voluminoso pacco.
— Cosa rimorchia quel gaglioffo? — chiese il marinaio.
— Ci porterà la cera colla quale faremo delle buone candele, — disse
il Piccolo Tonno.
— Gliela farò mangiare dietro al miele!... Non m'importa un fico della
cera!... Scendi, canaglia, che t'accarezzerò le spalle!... —
Lo Sciancatello scendeva, ma sempre con gran precauzione e
tenendo stretto il pacco.
— Il furbo! — esclamò il mozzo. — E poi dicono che le scimmie sono
meno intelligenti degli uomini!...
— Perchè? — chiese Enrico.
— Non vedi che ha messo i favi dell'alveare nella tenda che portava
a bandoliera?...
— Ehi!... To'!... Una goccia!... Fulmini!... È miele! —
Il marinaio, che stava sotto l'albero, aveva ricevuto una grossa
goccia sul viso e si era accorto che era miele. La sua fronte si
rasserenò.
— Che lo Sciancatello sia più onesto di quello che credevo? —
mormorò.
Il mias, uscito dai rami, si lasciò scivolare lungo il tronco come un
vero ginnasta e giunto a terra aprì la tenda che trasudava miele da
tutte le parti.
Era piena di favi, ma non già spremuti del succo delizioso, bensì
ancora pieni. Il marinaio fece quattro salti attorno all'albero, poi aprì
le braccia e si strinse al petto il peloso scimmione, esclamando:
— Dammi un abbraccio, figliuol mio!... Tu sei il più onesto di tutte le
scimmie e di tutti gli orang-outan della terra! —
Lo Sciancatello si meritava quell'elogio, poichè invece di aver
saccheggiato l'alveare per proprio conto, portava i favi intatti ai suoi
padroni.
Il marinaio non perdette tempo. Si rimboccò le maniche, si fece dare
la pentola e si mise a spremere la cera, facendo uscire larghi
goccioloni di miele profumato.
S'accorse ben presto che quel recipiente non bastava a contenere
tutto il succo, ma il signor Albani s'affrettò a trovare altri recipienti
formando dei coni impenetrabili colle larghe foglie d'un arecche.
Quando l'operazione fu terminata, calcolarono la loro provvista a
dodici chilogrammi, detraendo qualche chilogrammo regalato
all'onesto Sciancatello ed alle due scimmie.
— Quante ciambelle! — esclamò il marinaio. — Capperi!.. Ne
mangeremo a sazietà.
— Ma non hai pensato ad una cosa, Enrico, — disse Albani. — Come
faremo ad attraversare i boschi con questi recipienti?... La montagna
è ancora alta, amico mio.
— Fulmini!... Ma io non lascierò qui il mio miele, signore. Gli orsi o le
scimmie me lo mangerebbero.
— Lo credo, e poi non possiamo condurre con noi gli orsi.
— Lasciatemi qui e salite voi la montagna.
— Non avrai paura delle tigri?
— Ho la cerbottana e le freccie sono avvelenate.
— Ti lascieremo anche lo Sciancatello; è un buon compagno che sa
maneggiare solidamente il suo randello.
— Quando sarete di ritorno?...
Welcome to Our Bookstore - The Ultimate Destination for Book Lovers
Are you passionate about books and eager to explore new worlds of
knowledge? At our website, we offer a vast collection of books that
cater to every interest and age group. From classic literature to
specialized publications, self-help books, and children’s stories, we
have it all! Each book is a gateway to new adventures, helping you
expand your knowledge and nourish your soul
Experience Convenient and Enjoyable Book Shopping Our website is more
than just an online bookstore—it’s a bridge connecting readers to the
timeless values of culture and wisdom. With a sleek and user-friendly
interface and a smart search system, you can find your favorite books
quickly and easily. Enjoy special promotions, fast home delivery, and
a seamless shopping experience that saves you time and enhances your
love for reading.
Let us accompany you on the journey of exploring knowledge and
personal growth!

ebookgate.com

You might also like