Data Science Fundamentals with R Python and Open Data 1st Edition Marco Cremonini instant download
Data Science Fundamentals with R Python and Open Data 1st Edition Marco Cremonini instant download
https://ebookgate.com/product/data-science-fundamentals-with-r-
python-and-open-data-1st-edition-marco-cremonini/
https://ebookgate.com/product/python-data-science-handbook-2nd-
edition-jake-vanderplas/
https://ebookgate.com/product/python-for-data-science-for-
dummies-1st-edition-john-paul-mueller/
https://ebookgate.com/product/agile-data-science-building-data-
analytics-applications-with-hadoop-1st-edition-russell-jurney/
https://ebookgate.com/product/data-center-fundamentals-arregoces/
Data Mining with Python Theory Application and Case
Studies 1st Edition Di Wu
https://ebookgate.com/product/data-mining-with-python-theory-
application-and-case-studies-1st-edition-di-wu/
https://ebookgate.com/product/data-science-with-julia-1st-
edition-paul-d-mcnicholas/
https://ebookgate.com/product/data-mining-with-r-learning-with-
case-studies-chapman-hall-crc-data-mining-and-knowledge-
discovery-series-1st-edition-torgo/
https://ebookgate.com/product/python-for-data-analysis-1st-
edition-wes-mckinney/
https://ebookgate.com/product/data-storytelling-with-generative-
ai-using-python-and-altair-meap-v05-angelica-lo-duca/
Data Science Fundamentals with R, Python, and Open Data
Data Science Fundamentals with R, Python, and
Open Data
Marco Cremonini
University of Milan
Italy
Copyright © 2024 by John Wiley & Sons, Inc. All rights reserved.
No part of this publication may be reproduced, stored in a retrieval system, or transmitted in any form or by any
means, electronic, mechanical, photocopying, recording, scanning, or otherwise, except as permitted under
Section 107 or 108 of the 1976 United States Copyright Act, without either the prior written permission of
the Publisher, or authorization through payment of the appropriate per-copy fee to the Copyright Clearance
Center, Inc., 222 Rosewood Drive, Danvers, MA 01923, (978) 750-8400, fax (978) 750-4470, or on the web at
www.copyright.com. Requests to the Publisher for permission should be addressed to the Permissions Department,
John Wiley & Sons, Inc., 111 River Street, Hoboken, NJ 07030, (201) 748-6011, fax (201) 748-6008, or online at
http://www.wiley.com/go/permission.
Trademarks: Wiley and the Wiley logo are trademarks or registered trademarks of John Wiley & Sons, Inc. and/or
its affiliates in the United States and other countries and may not be used without written permission. All other
trademarks are the property of their respective owners. John Wiley & Sons, Inc. is not associated with any product
or vendor mentioned in this book.
Limit of Liability/Disclaimer of Warranty: While the publisher and author have used their best efforts in preparing
this book, they make no representations or warranties with respect to the accuracy or completeness of the contents
of this book and specifically disclaim any implied warranties of merchantability or fitness for a particular purpose.
No warranty may be created or extended by sales representatives or written sales materials. The advice and strategies
contained herein may not be suitable for your situation. You should consult with a professional where appropriate.
Further, readers should be aware that websites listed in this work may have changed or disappeared between when
this work was written and when it is read. Neither the publisher nor authors shall be liable for any loss of profit or
any other commercial damages, including but not limited to special, incidental, consequential, or other damages.
For general information on our other products and services or for technical support, please contact our Customer
Care Department within the United States at (800) 762-2974, outside the United States at (317) 572-3993 or
fax (317) 572-4002.
Wiley also publishes its books in a variety of electronic formats. Some content that appears in print may not be
available in electronic formats. For more information about Wiley products, visit our web site at www.wiley.com.
Contents
Preface xiii
About the Companion Website xvii
Introduction xix
Index 447
xiii
Preface
Two questions come along with every new text that aims to teach someone something. The first is,
Who is it addressed to? and the second is, Why does it have precisely those contents, organized in
that way? These two questions, for this text, have perhaps even greater relevance than they usually
do, because for both, the answer is unconventional (or at least not entirely conventional) and to
some, it may seem surprising. It shouldn’t be, or even better, if the answers will make the surprise
a pleasant surprise.
Let’s start with the first question: Who is the target of a text that introduces the fundamentals
of two programming languages, R and Python, for the discipline called data science? Those who
study to become data scientists, computer scientists, or computer engineers, it seems obvious, right?
Instead, it is not so. For sure, future data scientists, computer scientists, and computer engineers
could find this text useful. However, the real recipients should be others, simply all the others, the
non-specialists, those who do not work or study to make IT or data science their main profession.
Those who study to become or already are sociologists, political scientists, economists, psychol-
ogists, marketing or human resource management experts, and those aiming to have a career in
business management and in managing global supply chains and distribution networks. Also, those
studying to be biologists, chemists, geologists, climatologists, or even physicians. Then there are
law students, human rights activists, experts of traditional and social media, memes and social net-
works, linguists, archaeologists, and paleontologists (I’m not joking, there really are fabulous data
science works applied to linguistics, archeology, and paleontology). Certainly, in this roundup, I
have forgotten many who deserved to be mentioned like the others. Don’t feel left out. The artists
I forgot! There are contaminations between art, data science, and data visualization of incredible
interest. Art absorbs and re-elaborates, and in a certain way, this is also what data science does: it
absorbs and re-elaborates. Finally, there are also all those who just don’t know yet what they want
to be; they will figure it out along the way, and having certain tools can come in handy in many
cases.
Everyone can successfully learn the fundamentals of data science and the use of these computa-
tional tools, even with a few basic computer skills, with some efforts and time, of course, necessary
but reasonable. Everyone could find opportunities for application in all, or almost all, existing pro-
fessions, sciences, humanities, and cultural fields. And above all, without the need to take on the
role of computer scientist or data scientist when you already have other roles to take on, which
rightly demand time and dedication.
Therefore, the fact of not considering computer scientists and data scientists as the principal
recipients of this book is not to diminish their role for non-existent reasons, but because for them
there is no need to explain why a book that presents programming languages for data science has,
at least in theory, something to do with what they typically do.
xiv Preface
It is to the much wider audience of non-specialists that the exhortation to learn the fundamentals
of data science should be addressed to, explaining that they do not have to transform themselves
into computer scientists to be able to do so (or even worse, into geeks), which, with excellent rea-
sons that are difficult to dispute, have no intention to do. It doesn’t matter if they have always been
convinced to be “unfit for computer stuff,” and that, frankly, the rhetoric of past twenty years about
“digital natives,” “being a coder,” or “joining the digital revolution” sounds just annoying. None of
this should matter, time to move on. How? Everyone should look at what digital skills and tech-
nologies would be useful for their own discipline and do the training for those goals. Do you want
to be a computer scientist or a data scientist? Well, do it; there is no shortage of possibilities. Do you
want to be an economist, a biologist, or a marketing expert? Very well, do it, but you must not be cut
off from adequate training on digital methodologies and tools from which you will benefit, as much
as you are not cut off from a legal, statistical, historical, or sociological training if this knowledge
is part of the skills needed for your profession or education. What is the objection that is usually
made? No one can know everything, and generalists end up knowing a little of everything and
nothing adequately. It’s as true as clichés are, but that’s not what we’re talking about. A doctor who
acquires statistical or legal training is no less a doctor for this; on the contrary, in many cases she/he
is able to carry out the medical profession in a better way. No one reproaches an economist who
becomes an expert in statistical analysis that she/he should have taken a degree in statistics. And
soon (indeed already now), to the same economist who will become an expert in machine learning
techniques for classification problems for fintech projects, no one, hopefully, will reproach that as
an economist she/he should leave those skills to computer scientists. Like it or not, computer skills
are spreading and will do so more and more among non-computer scientists, it’s a matter of base
rate, notoriously easy to be misinterpreted, as all students who have taken an introductory course
in statistics know.
Let’s consider the second question: Why this text presents two languages instead of just one as
it is usually done? Isn’t it enough to learn just one? Which is better? A friend of mine told me he’s
heard that Python is famous, the other one he has never heard of. Come on, seriously two? It’s a
miracle if I learn half of just one! Stop. That’s enough.
It’s not a competition or a beauty contest between programming languages, and not even a ques-
tion of cheering, as with sports teams, where you have to choose one, none is admissible, but you
can’t root for two. R and Python are tools, in some ways complex, not necessarily complicated,
professional, but also within anyone’s reach. Above all, they are the result of the continuous work
of many people; they are evolving objects and are extraordinary teaching aids for those who want
to learn. Speaking of evolution, a recent and interesting one is the increasingly frequent conver-
gence between the two languages presented in this text. Convergence means the possibility of
coordinated, alternating, and complementary use: Complement the benefits of both, exploit what
is innovative in one and what the other has, and above all, the real didactic value, learning not to
be afraid to change technology, because much of what you learned with one will be found and will
be useful with the other. There is another reason, this one is more specific. It is true that Python is
so famous that almost everyone has heard its name while only relatively few know R, except that
practically everyone involved in data science knows it and most of them uses it, and that’s for a
pretty simple reason: It’s a great tool with a large community of people who have been contribut-
ing new features for many years. What about Python? Python is used by millions of people, mainly
to make web services, so it has enormous application possibilities. A part of Python has specialized
in data science and is growing rapidly, taking advantage of the ease of extension to dynamic and
web-oriented applications. One last piece of information: Learning the first programming language
could look difficult. The learning curve, so-called how fast you learn, is steep at first, you struggle
Preface xv
at the very beginning, but after a while it softens, and you run. This is for the first one. Same ramp
to climb with the second one too? Not at all. Attempting an estimate, I would say that just one-third
of the effort is needed to learn the second, a bargain that probably few are aware of. Therefore, let’s
do both of them.
One last comment because one could certainly think that this discussion is only valid in theory,
putting it into practice is quite another thing. Over the years I have required hundreds of social
science students to learn the fundamentals of both R and Python for data science and I can tell
you that it is true that most of them struggled initially, some complained more or less aloud that
they were unfit, then they learned very quickly and ended up demonstrating that it was possible for
them to acquire excellent computational skills without having to transform into computer scientists
or data scientists (to tell the truth, someone transformed into one, but that’s fine too), without
possessing nonexistent digital native geniuses, without having to be anything other than what they
study for, future experts in social sciences, management, human resources, or economics, and what
is true for them is certainly true for everyone. This is the pleasant surprise.
www.wiley.com/go/DSFRPythonOpenData
● Software
xix
Introduction
This text introduces the fundamentals of data science using two main programming languages and
open-source technologies : R and Python. These are accompanied by the respective application
contexts formed by tools to support coding scripts, i.e. logical sequences of instructions with the
aim to produce certain results or functionalities. The tools can be of the command line interface
(CLI) type, which are consoles to be used with textual commands, and integrated development
environment (IDE), which are of interactive type to support the use of languages. Other elements
that make up the application context are the supplementary libraries that contain the additional
functions in addition to the basic ones coming with the language, package managers for the
automated management of the download and installation of new libraries, online documentation,
cheat sheets, tutorials, and online forums of discussion and help for users. This context, formed
by a language, tools, additional features, discussions between users, and online documentation
produced by developers, is what we mean when we say "R" and "Python," not the simple program-
ming language tool, which by itself would be very little. It is like talking only about the engine
when instead you want to explain how to drive a car on busy roads.
R and Python, together and with the meaning just described, represent the knowledge to start
approaching data science, carry out the first simple steps, complete the educational examples, get
acquainted with real data, consider more advanced features, familiarize oneself with other real
data, experiment with particular cases, analyze the logic behind mechanisms, gain experience with
more complex real data, analyze online discussions on exceptional cases, look for data sources in
the world of open data, think about the results to be obtained, even more sources of data now
to put together, familiarize yourself with different data formats, with large datasets, with datasets
that will drive you crazy before obtaining a workable version, and finally be ready to move to other
technologies, other applications, uses, types of results, projects of ever-increasing complexity. This
is the journey that starts here, and as discussed in the preface, it is within the reach of anyone who
puts some effort and time into it. A single book, of course, cannot contain everything, but it can
help to start, proceed in the right direction, and accompany for a while.
With this text, we will start from the elementary steps to gain speed quickly. We will use simplified
teaching examples, but also immediately familiarize ourselves with the type of data that exists in
reality, rather than in the unreality of the teaching examples. We will finish by addressing some
elaborate examples, in which even the inconsistencies and errors that are part of daily reality will
emerge, requiring us to find solutions.
xx Introduction
Approach
It often happens that students dealing with these contents, especially the younger ones, initially
find it difficult to figure out the right way to approach their studies in order to learn effectively.
One of the main causes of this difficulty lies in the fact that many are accustomed to the idea that
the goal of learning is to never make mistakes. This is not surprising, indeed, since it’s the criteria
adopted by many exams, the more mistakes, the lower the grade. This is not the place to discuss the
effectiveness of exam methodologies or teaching philosophies; we are pragmatists, and the goal is
to learn R and Python, computational logic, and everything that revolves around it. But it is pre-
cisely from a wholly pragmatic perspective that the problem of the inadequacy of the approach that
seeks to minimize errors arises, and this for at least two good reasons. The first is that inevitably
the goal of never making mistakes leads to mnemonic study. Sequences of passages, names, formu-
las, sentences, and specific cases are memorized, and the variability of the examples considered is
reduced, tending toward schematism. The second reason is simply that trying to never fail is exactly
the opposite of what it takes to effectively learn R and Python and any digital technology.
Learning computational skills for the data science necessarily requires a hands-on approach. This
involves carrying out many practical exercises, meticulously redoing those proposed by the text, but
also varying them, introducing modifications, and replicating them with different data. All those
of the didactic examples can obviously be modified, but also all those with open data can easily
be varied. Instead of certain information, others could be used, and instead of a certain result, a
slightly different one could be sought, or different data made available by the same source could be
tried. Proceeding methodically (being methodical, meticulous, and patient are fundamental traits
for effective learning) is the way to go. Returning to the methodological doubts that often afflict
students when they start, the following golden rule applies, which must necessarily be emphasized
because it is of fundamental importance: exercises are used to make mistakes, an exercise without
errors is useless.
Open Data
The use of open data, right from the first examples and to a much greater extent than examples with
simplified educational datasets, is one of the characteristics, perhaps the main one, of this text. The
datasets taken from open data are 26, sourced from the United States and other countries, large
international organizations (the World Bank and the United Nations), as well as charities and inde-
pendent research institutes, gender discrimination observatories, and government agencies for air
traffic control, energy production and consumption, pollutant emissions, and other environmental
information. This also includes data made available by cities like Milan, Berlin, and New York City.
This selection is just a drop in the sea of open data available and constantly growing in terms of
quantity and quality.
Using open data to the extent it has been done in this text is a precise choice that certainly imposes
an additional effort on those who undertake the learning path, a choice based both on personal
experience in teaching the fundamentals of data science to students of social and political sciences
(every year I have increasingly anticipated the use of open data), and on the fundamental drawback
of carrying out examples and exercises mainly with didactic cases, which are inevitably unreal and
unrealistic. Of course, the didactic cases, also present in this text, are perfectly fit for showing a
Introduction xxi
specific functionality, an effect or behavior of the computational tool. As mentioned before, though,
the issue at stake is about learning to drive in urban traffic, not just understanding some engine
mechanics, and at the end the only way to do that is … driving in traffic, there’s no alternative. For us
it is the same, anyone who works with data knows that one of the fundamental skills is to prepare
the data for analysis (first there would be that of finding the data) and also that this task can easily
be the most time- and effort-demanding part of the whole job. Studying mainly with simplified
teaching examples erases this fundamental part of knowledge and experience, for this reason, they
are always unreal and unrealistic, however you try to fix them. There is no alternative to putting
your hands and banging your head on real data, handling datasets even of hundreds of thousands
or millions of rows (the largest one we use in this text has more than 500 000 rows, the data of all US
domestic flights of January 2022) with their errors, explanations that must be read and sometimes
misinterpreted, even with cases where data was recorded inconsistently (we will see one of this kind
quite amusing). Familiarity with real data should be achieved as soon as possible, to figure out their
typical characteristics and the fact that behind data there are organizations made up of people, and
it is thanks to them if we can extract new information and knowledge. You need to arm yourself
with patience and untangle, one step at a time, each knot. This is part of the fundamentals to learn.
One book alone can’t cover everything; we’ve already said it and it’s obvious. However, the point
to decide is what to leave out. One possibility is that the author tries to discuss as many different
topics as she/he can think of. This is the encyclopedic model, popular but not very compatible with a
reasonably limited number of pages. It is no coincidence that the most famous of the encyclopedias
have dozens of ponderous volumes. The short version of the encyclopedic model is a “synthesis,”
i.e. a reasonably short overview that is necessarily not very thorough and has to simplify complex
topics. Many educational books choose this form, which has the advantage of the breadth of topics
combined with a fair amount of simplification.
This book has a hybrid form, from this point of view. It is broader than the standard because
it includes two languages instead of one, but it doesn’t have the form of synthesis because it
focuses on a certain specific type of data and functionality: data frames, with the final addition of
lists/dictionaries, transformation and pivoting operations, group indexing, aggregation, advanced
transformations and data frame joins, and on these issues, it goes into the details. Basically, it
offers the essential toolbox for data science.
What’s left out? Very much, indeed. The techniques and tools for data visualization, descriptive
and predictive models, including machine learning techniques, obviously the statistical analysis
part (although this is traditionally an autonomous part), technologies for "Big Data," i.e. distributed,
scalable software infrastructures capable of managing not only a lot of data but above all data
streams, i.e. real-time data flows, and the many web-oriented extensions, starting from data col-
lection techniques from websites up to integration with dynamic dashboards and web services,
are not included. Again, there are specialized standards, such as those for climate data, financial
data, biomedical data, and coding used by some of the large international institutions that are not
treated. The list could go on.
This additional knowledge, which is part of data science, deserves to be learned. For this, you
need the fundamentals that this book presents. Once equipped with them, it’s the personal interests
xxii Introduction
and the cultural and professional path of each one to play the main role, driving in a certain direc-
tion or in another. But again, once it has been verified firsthand that it is possible, regardless of
one’s background, to profitably acquire the fundamentals of the discipline with R and Python, any
further insights and developments can be tackled, in exactly the same way, with the same approach
and spirit used to learn the fundamentals.
1
In this first section, we introduce the main tools for the R environment: the R language and the
RStudio IDE (interactive development environment). The first is an open-source programming
language developed by the community, specifically for statistical analysis and data science; the
second is an open-source development tool produced by Posit (www.posit.com), formerly called
RStudio, representing the standard IDE for R-based data science projects. Posit offers a freeware
version of RStudio called RStudio Desktop that fully supports all features for R development; it has
been used (v. 2022.07.2) in the preparation of all the R code presented in this book. Commercial
versions of RStudio add supporting features typical of managing production software in corporate
environments. An alternative to RStudio Desktop is RStudio Cloud, the same IDE offered as a
service on a cloud premise. Graphically and functionally, the cloud version is exactly the same as
the desktop one; however, its free usage has limitations.
The official distribution of the R language and the RStudio IDE are just the starting points though.
This is what distinguishes an open-source technology from a proprietary one. With an open-source
technology actively developed by a large online community, as is the case for R, the official dis-
tribution provides the basic functionality and, on top of that, layers of additional, advanced, or
specialistic features could be stacked, all of them developed by the open-source community. There-
fore, it is a constantly evolving environment, not a commercial product subject to the typical life
cycle mostly mandated by corporate marketing. What is better, an open-source or a proprietary
tool? This is an ill-posed question, mostly irrelevant in generic terms because the only reasonable
answer is, “It depends.” The point is that they are different in a number of fundamental ways.
With R, we will use many features provided by additional packages to be installed on top of the
base distribution. This is the normal course of action and is exactly what everybody using this
technology is supposed to do in order to support the goal of a certain data analysis or data science
project. Clearly, the additional features employed in the examples of this book are not all those avail-
able, and neither are all those somehow important, that would be simply impossible to cover. New
features come out continuously, so in learning the fundamentals, it is important to practice with
the approach, familiarize yourself with the environment, and exercise with the most fundamental
tools, so as to be perfectly able to explore the new features and tools that become available.
Just keep in mind that these are professional-grade tools, not merely didactic ones to be aban-
doned after the training period. Thousands of experienced data scientists use these tools in their
daily jobs and for top-level data science projects, so the instruments you start knowing and handling
are powerful.
Data Science Fundamentals with R, Python, and Open Data, First Edition. Marco Cremonini.
© 2024 John Wiley & Sons, Inc. Published 2024 by John Wiley & Sons,
Companion website: www.wiley.com/go/DSFRPythonOpenData
2 1 Open-Source Tools for Data Science
1.1.1 R Language
CRAN (the Comprehensive R Archive Network, https://cloud.r-project.org/) is the official online
archive for all R versions and software packages available to install. CRAN is mirrored on a number
of servers worldwide, so, in practice, it is always available.
The R base package is basically compliant with all desktop platforms: Windows, MacOS, and
Linux. The installation is guided through a standard wizard and is effortless. Mobile platforms
such as iOS and Android, as well as hybrid products, like the Chromebook, are not supported. For
old operating system versions, the currently available version of R might not be compatible. In that
case, under R Binaries, all previous versions of R are accessible, the most recent compatible one
can be installed with confidence, and all the important features will be available.
At the end of the installation, a link to an R execution file will be created in the programs or
applications menu/folder. That is not the R language, but an old-fashioned IDE that comes with
the language. You do not need that if you use RStudio, as is recommended. You just need to install
the R language, that is all.
strictly necessary. Given the few specialized features a package manager must have, it should come
without any surprise that modern package managers have their origins in classical command line
tools. Actually, they still exist and thrive; they are often used as command line tools both in R and
Python environments, just because they are simple to use and have limited options.
At any rate, a graphical interface exists, and RStudio offers it with the tab Packages in the Q4
quadrant. It is simple, just a list of installed packages and a selection box indicating if a package is
also loaded or not. Installing and loading a package are two distinct operations. Installing means
retrieving the executable code, for example, by downloading it from CRAN and configuring it in the
local system. Loading a package means making its functionalities available for a certain script, which
translates into the fundamental function library(<name of the package to load>).
Ticking the box beside a package in the RStudio package manager will execute on the R Console
(quadrant Q2) the corresponding library() instruction. Therefore, using the console or ticking
the box for loading a package is exactly the same.
However, neither of them is a good way to proceed, when we are writing R scripts, because a
script should be reproducible, or at least understandable by others, at a later time, possibly a long
time later. This means that all information necessary for reproducing it should be explicit, and if the
list of packages to be loaded is defined externally by ticking selection boxes or running commands
on the console, that knowledge is hidden, and it will be more difficult to understand exactly all the
features of the script. So the correct way to proceed is to explicitly write all necessary library()
instructions in the script, loading all required packages.
The opposite operation of loading a package is unloading it, which is certainly less frequent;
normally, it is not needed in scripts. From the RStudio interface, it could be executed by unticking
a package or by executing the corresponding instruction detach("package:<name of the
package>", unload=TRUE).
A reasonable doubt may arise about the reason why installed packages are not just all loaded
by default. Why bother with this case-by-case procedure? The reason is memory, the RAM, in
4 1 Open-Source Tools for Data Science
particular, that is not only finite and shared by all processes executed on the computer, but is often
a scarce resource that should be used efficiently. Loading all installed packages, which could be
dozens or even hundreds, when normally just a few are needed by the script in execution, is clearly
a very inefficient way of using the RAM. In short, we bother with the manual loading of packages
to save memory space, which is good when we have to execute computations on data.
Installing R packages is straightforward. The interactive button Install in tab Packages is handy
and provides all the functionalities we need. From the window that opens, the following choices
should be made:
● Install from: From which repository should the package be downloaded? Options are: CRAN,
the official online repository, this is the default and the normal case. Package Archive File is
only useful if the package to install has been saved locally, which may happen for experimental
packages not available from GitHub, which is a rare combination. Packages available from
GitHub could be retrieved and installed with a specialized command (githubinstall
("PackageName")).
● Packages: The name of the package(s) to install; the autocomplete feature looks up names from
CRAN.
● Install to library: The installation path on the local system depends on the R version currently
installed.
● Install dependencies: Dependencies are logical relationships between different packages. It is
customary for new to packages exploit features of previous packages for many reasons, either
because they are core or ancillary functionalities with respect to the features provided by the
package. In this case, those functionalities are not reimplemented, but the package providing
them is logically linked to the new one. This, in short, is the meaning of dependencies. It means
that when a package is installed if it has dependencies, those should be installed too (with the
required version). This option, when selected, automatically takes care of all dependencies,
downloading and installing them, if not present. The alternative is to manually download and
install the packages required as dependencies by a certain package. The automatic choice is
usually the most convenient. Errors may arise because of dependencies, for example, when for
any reason, the downloading of a package fails, or the version installed is not compatible. In
those cases, the problem should be fixed manually, either by installing the missing dependencies
or the one with the correct version.
In any case, tidyverse is widely used, and for this, it is useful to spend some time reading the
description of the packages included in it because this provides a glimpse into what most data
science projects use, the types of operations and more general features. In our examples, most of
the functions we will use are defined in one of the tidyverse packages, with some exceptions that
will be introduced.
The installation of tidyverse is the standard one, through the RStudio package manager, or the
console with command install.packages("tidyverse"). Loading it in a script is done
with library(tidyverse) for a whole lot of packages, or alternatively, for single packages
such as library(readr), where readr is the name of a package contained in tidyverse. In all
cases, after the execution of a library instruction, the console shows if and what packages have
been loaded.
In all our examples, it should be assumed that the first instruction to be executed is
library(tidyverse), even when not explicitly specified.
Python’s environment is more heterogeneous than R’s, mostly because of the different scope of the
language – Python is a general-purpose language mostly used for web and mobile applications, and
in a myriad of other cases, data science is among them – which implies that several options are avail-
able as a convenient setup for data science projects. Here, one of the most popular is considered,
but there are good reasons to make different choices.
The first issue to deal with is that, until now, there is not a data science Python IDE comparable
to RStudio for R, which is the de facto standard and offers almost everything that is needed. In
Python, you have to choose if you want to go with a classical IDE for coding; there are many, which
is fine, but they are not much tailored for data science wrangling operations; or if you want to go
with an IDE based on the computational notebook format (just notebook for short). The notebook
format is a hybrid document that combines formatted text, usually based on a Markdown syntax
and blocks of executable code. For several reasons, mostly related to utility in many contexts to
have both formatted text and executable code, and the ease of use of these tools, IDEs based on the
notebook format have become popular for Python data science and data analysis. The code in the
examples of the following chapters has been produced and tested using the main one of these IDEs,
JupyterLab (https://jupyterlab.readthedocs.io/en/latest/). It is widely adopted, well-documented,
easy to install, and free to use. If you are going to write short blocks of Python code with associated
descriptions, a typical situation in data science, it is a good choice. If you have to write a massive
amount of code, then a classical IDE is definitely better. Jupyter notebooks are textual files with
canonical extension .ipynb and an internal structure close to JSON specifications.
So, the environment we need has the Python base distribution, a package manager, for the same
reasons we need it with R, the two packages specifically developed for data science functionali-
ties called NumPy and pandas, and the notebook-based IDE JupyterLab. These are the pieces. In
order to have them installed and set up, there are two ways of proceeding: one is easy but ineffi-
cient, and the other is a little less easy but more efficient. Below, with A and B, the two options are
summarized.
A. A single installer package, equipped with a graphical wizard, installs and sets up everything
that is needed, but also much more than you will likely ever use, for a total required memory
space of approximately 5 GB on your hard disk or SSD memory.
6 1 Open-Source Tools for Data Science
B. A manual procedure individually installs the required components: first Python and the pack-
age manager, then the data science libraries NumPy and pandas, and finally the JupyterLab IDE.
This requires using the command line shell (or terminal) to run the few installation instructions
for the package manager, but the occupied memory space is just approximately 400 MB.
Both ways, the result is the Python setup for learning the fundamentals of data science, get-
ting ready, and working. The little difficulty of the B option, i.e. using the command line to install
components, is truly minimal and, in any case, the whole program described in this book is about
familiarizing with command line tools for writing R or Python scripts, so nobody should be worried
for a few almost identical commands to run with the package manager.
So, the personal suggestion is to try the B option, as described in the following, operationally
better and able to teach some useful skills. At worst, it is always possible to backtrack and go with
the easier A option on a second try.
This way, a list of the options is shown. The most useful are:
When a package is installed or uninstalled, on the command line appears a request to confirm
the operation; the syntax is [n/Y], with n for No and Y for Yes.
1.2 Python Language and Tools 7
should be at hand and regularly consulted, no matter whether the handbook is used for learning.
A handbook and the technical documentation serve different purposes and complete each other;
they are never alternatives.
All the Python scripts and fragments of code presented in the following chapters assume that
both libraries have been loaded with the following instructions, which should appear as the first
ones to be executed:
import numpy as np
import pandas as pd
Its name reminds me of the original symbol used as a separator, the comma, perfectly adapted
to the Anglo-Saxon convention for floating point numbers and mostly numerical data. With the
diffusion to European users and textual data, the comma became problematic, being used for the
numeric decimal part and often in sentences, so the semicolon appeared as an alternative separator,
less frequently used, and unrelated to numbers. But again, even the semicolon might be problem-
atic when used in sentences, so the tabulation (tab character) became another typical separator.
These three are the normal ones, so to say, since there is no formal specification mandating what
symbols could act as separators and that we should expect to find them used. But other uncommon
symbols could be encountered, such as the vertical bar (|) and others.
Ultimately, the point is that whatever separator is used in a certain CSV dataset, we could eas-
ily recognize it, for example, by visually inspecting the text file and easily accessing the dataset
correctly. So, it needs to pay attention, but the separator is never a problem.
As a convention, not a strict rule, CSV files use the extension .csv (e.g. dataset1.csv), meaning
that it is a text-based, tabular dataset. When the tab is used as a separator, it is common to indicate
it by means of the .tsv file extension (e.g. dataset2.tsv), which is just good practice and a kind way to
inform a user that tab characters have been placed in the dataset. In case they are not very evident
at first sight, expect to also find datasets using tabs as separators named with the .csv extension.
Ambiguous cases arise anyway. A real example is shown in the fragment of Figure 1.3. This is
a real dataset with information about countries, a type of widely common information. Countries
have official denominations that have to be respected, especially in works with a certain degree of
formality. If you look closely, you will notice something strange about the Democratic Republic of
the Congo.
The name is written differently in order to maintain the coherence of the alphabetic order, so
it became Congo, Democratic Republic of the. Fine for what concerns the order, but it complicates
things for the CSV syntax because now the comma after Congo is part of the official name. It cannot
be omitted or replaced with another symbol; that is the official name when the alphabetic order
must be preserved. But commas also act as separators in this CSV, and now they are no longer
unambiguously defined as separators.
How can we resolve this problem? We have already excluded the possibility of arbitrarily chang-
ing the denomination of the country, that is not possible. Could we replace all the commas as
separators with another symbol, like a semicolon, for example? In theory yes, we could, but in
practice, it might be much more complicated than a simple replacement because there could be
other cases like Congo, Democratic Republic of the, and for all of them, the comma in the name
should not be replaced. It is not that easy to make sure to not introduce further errors.
Looking at Figure 1.3, we see the standard solution for this case – double quotes have been used to
enclose the textual content with the same symbol used as a separator (comma, in this case). This tells
the function reading the CSV to consider the whole text within double quotes as a single element
value and ignore the presence of symbols used as separators. Single quotes work fine too unless they
are used as apostrophes in the text.
This solution solves most cases, but not all. What if the text contains all of them, double quotes,
single quotes, and commas? For example, a sentence like this: First came John "The Hunter" Western;
then his friend Bill li’l Rat Thompson, followed by the dog Sausage. How could we possibly put this
in a CSV in a way that it is recognized as a unique value? There is a comma and a semicolon; single
or double quotes do not help in this case because they are part of the sentence. We might replace all
commas as separators with tabs or other symbols, but as already said, it could be risky and difficult.
There is a universal solution called escaping, which makes use of the escape symbol, which
typically is the backslash (\). The escape symbol is interpreted as having a special meaning, which
is to consider the following character just literally, destitute of any syntactical meaning, such as a
separator symbol or any other meaning. Thus, our sentence could be put into a CSV and considered
as a single value again by using double quotes, but being careful to escape the double quotes inside
the sentence: “First came John \“The Hunter\” Western; then his friend Bill li’l Rat Thompson,
followed by the dog Sausage.” This way, the CSV syntax is unambiguous.
Finally, what if the textual value contains a backslash? For example: The sentient AI wrote “Hi
Donna, be ready, it will be a bumpy ride,” and then executed \start.
We know we have to escape the double quotes, but what about the backslash that will be inter-
preted as an escape symbol? Escape the escape symbol, so it will be interpreted literally: “The sentient
AI wrote \“Hi, Donna, be ready, it will be a long ride\” and then executed \\start.” Using single
quotes, we do not need to escape double quotes in this case: ‘The sentient AI wrote, “Hi, Donna,
be ready, it will be a long ride” and then executed \\start.’
Questions
1.1 (R/Python)
CSV is ...
A A proprietary data format, human-readable
B An open data format, human-readable
C A proprietary format, not human-readable
D An open data format, not human-readable
(R: B)
1.2 (R/Python)
A CSV dataset has ...
A A tabular data organization
B A hierarchical data organization
C Metadata for general information
D An index
(R: A)
1.3 (R/Python)
A valid CSV dataset has ...
A No value with spaces
B Possibly different separators
C No missing value
D Equal number of elements for each row
(R: D)
1.4 (R/Python)
Which are considered legitimate separators in a CSV dataset?
Questions 11
1.5 (R/Python)
What is the usage case for quotes and double quotes in a CSV dataset?
A There is no usage case for them
B As separators
C As commonly used symbols in strings
D To embrace an element value containing the same symbol used as separator
(R: D)
1.6 (R/Python)
What is the escape character?
A It is quotes or double quotes
B It is the separator symbol
C It is the symbol used to specify that the following character/symbol should be considered
at face value, not interpreted as having a special meaning
D It is the symbol used to specify that the following word should be considered at face value,
not interpreted as having a special meaning
(R: C)
13
Having read a dataset, the first activity usually made afterward is to figure out the main charac-
teristics of those data and make sense of them. This means understanding the organization of the
data, their types, and some initial information on their values. For data of numerical type, simple
statistical information can be obtained; these are usually called descriptive statistics and often
include basic information like the arithmetic mean, the median, maximum and minimum values,
and quartiles. Clearly, other, more detailed statistical information could be easily obtained from a
series of numerical values.
This activity is often called simple exploratory data analysis, where the adjective “simple” dis-
tinguishes this basic and quick analysis performed to grasp the main features of a dataset with
respect to thorough exploratory data analyses performed with more sophisticated statistical tools
and methods.
However, the few requirements of this initial approach to a new dataset should not be erroneously
considered unimportant. On the contrary, basic descriptive statistics offered by common tools may
reveal important features of a dataset that could help decide how to proceed, show the presence of
anomalous values, or indicate specific data wrangling operations to execute. It is important to ded-
icate attention to the information provided by a simple exploratory data analysis. Real datasets are
typically too large to be visually inspected; therefore, in order to start collecting some information
about the data, tools are needed, and descriptive statistics are the first among them.
R and Python offer common functionalities to obtain descriptive statistics together with other
utility functions, which allow getting information on a dataset, from its size to the unique values
of a column/variable, names, indexes, and so forth. These are simple but essential features and
familiarity with them should be acquired.
Before presenting the first list of these functions, all of them will be used in the numerous
examples that will follow throughout the book. Another very relevant issue is introduced – the
almost inevitable presence of missing values.
Missing values are literally what their name means – some elements of a dataset may have no value.
This must be understood literally, not metaphorically – a missing value is the absence of value, not
a value with an undefined meaning like a space, a tab, some symbols like ?, ---, or *, or something
like the string “Unknown,” “Undefined,” and the like. All these cases are not missing values; they
are actual values with a possibly undefined meaning.
Data Science Fundamentals with R, Python, and Open Data, First Edition. Marco Cremonini.
© 2024 John Wiley & Sons, Inc. Published 2024 by John Wiley & Sons,
Companion website: www.wiley.com/go/DSFRPythonOpenData
14 2 Simple Exploratory Data Analysis
Missing values are very common in real datasets to the point that it is a much safer assumption to
expect their presence than the opposite. The absence of values may happen for a number of reasons,
from a sensor that failed to take a measurement to an observer that failed to collect a certain data
point. Errors of any sort may result in missing values, or they might be just the expected result of a
certain data collection process, for example, a dataset reporting road incident may have values only
if there has been at least one incident on a certain road in the reference period of time; otherwise, the
element remains empty. Many other reasons for the presence of missing values could be envisioned.
The important point when dealing with real datasets is not to exclude the presence of missing values.
That could lead to severe errors if missing values are unaccounted for. For this reason, the presence of
missing values must always be carefully verified, and appropriate actions for dealing with them are
decided on a case-by-case basis.
We will dedicate a specific section in the following chapter on the tools and methods to analyze
and manage missing values in R and Python. Here, it is important to get the fact that they are
important to analyze, and their presence forces us to decide what to do with them. There is no
general recipe to apply, it has to be decided on a case-by-case basis.
Once we have ascertained that missing values are present in a dataset, we should gather more
information about them – where are they? (in which columns/variables, rows/observations), and
how many are they? Special functions will assist us in answering these questions, but then it will
be our turn to evaluate how to proceed. Three general alternatives typically lie in front of us:
1. Write an actual value for elements with missing values.
2. Delete rows/observations or columns/variables with missing values.
3. Do not modify the data and handle missing values explicitly in each operation.
For the first one, the obvious problem is to decide what value should be written in place of the
missing values. In the example of the road incidents dataset, if we know for sure that when a road
reports zero incidents, the corresponding element has no value, we may reasonably think to write
zero in place of missing values. The criterion is correct, given the fact that we know for sure how
values are inserted. Is there any possible negative consequence? Is it possible that we are arbitrarily
and mistakenly modifying some data? Maybe. What if a missing value was instead present because
there was an error in reporting or entering the data? We are setting zero for incidents on a certain
road when the true value would have been a certain number, possibly a relevant one. Is the fact of
having replaced the missing value with a number a better or worse situation than having kept the
original missing value? This is something we should think about and decide.
Similar is the second alternative. If we omit rows or columns with all missing values in their
elements, then we are likely simplifying the data without any particular alteration. But what if, as
it is much more common, a certain row or column has only some elements with missing values and
others with valid values? When we omit that row or column, we are omitting the valid values too,
so we are definitely altering the data. Is that alteration relevant to our analysis? Did we clearly
understand the consequences? Again, these are questions we should think about and decide on
a case-by-case basis because it would depend on the specific context and data and the particular
analysis we are carrying out.
So, is the third alternative always the best one? Again, it depends. With that option, we have the
burden of dealing with missing values at every operation we perform on data. Functions will get
a little complicated, the logic could also become somehow more complicated, chances of making
mistakes increase with the increased complications, time increases, the level of attention should be
higher, and so forth. It is a trade-off, meaning there is no general recipe; we have to think about it
and decide. But yes, the third alternative is the safest in general; the data are not modified, which
is always a golden rule, but even being the safest, it does not guarantee to avoid errors.
2.2 R: Descriptive Statistics and Utility Functions 15
Table 2.1 lists some of the main R utility functions to gather descriptive statistics and other general
information on data frames.
It is useful to familiarize yourself with these functions and, without anticipating how to read
datasets, which is the subject of the next chapter, predefined data frames made available by base
R and several packages are helpful for exercising. For example, package datasets are installed with
the base R configuration and contain small didactic datasets, most of them being around for quite
a long time. It was a somewhat vintage-style experience to use those data, to be honest.
Additional packages, for example, those installed with tidyverse, often contain didactic datasets
and are usually definitely more recent than those from package datasets. For example, readers
affectionate to classical sci-fi might appreciate dataset starwars included in package dplyr, with
data about the Star Wars saga’s characters. It is a nice dataset for exercising. For other options,
exists command data() to be executed on the RStudio console. It produces the list of predefined
datasets contained in all loaded packages (pay attention to this, it is not sufficient to have the pack-
age installed, it has to be loaded with library()).
Function Description
summary() It is the main function for collecting simple descriptive statistics on data frame
columns. For each column, it returns the data type (numerical, character, and
logical). For numerical columns, it adds maximum and minimum values, mean
and median, 1st and 3rd quartiles, and if present, the number of missing values.
str() They are equivalent in practice, with str() defined in package utils of R base
glimpse() configuration and glimpse() defined in package dplyr, included in tidyverse.
They provide a synthetic representation of information on a data frame, like its
size, column names, types, and values of the first elements.
head() They are among the most basic R functions and allow visualizing the few
tail() topmost (head) or bottommost (tail) rows of a command output. For example,
we will often use head() to watch the first rows of a data frame and its header
with column names. It is possible to specify the number of rows to show (e.g.
head(10)); otherwise, the default applies (i.e. six rows).
View() Basically, the same and, when package tibble, included in tidyverse, is loaded, the
view() first is an alias of the second. They visualize a data frame and other structured
data like lists by launching the native RStudio viewer, which offers a graphical
spreadsheet-style representation and few features. It is useful for small data
frames, but it becomes quickly unusable when the size increases.
unique() It returns the list of unique values in a series. Particularly useful when applied to
columns as unique(df$col_name).
names() It returns column names of a data frame with names(df) and variable names
with lists. It is particularly useful.
class() It returns the data type of an R object, like numeric, character, logical, and data
frame.
length() It returns the length of an R object (careful, this is not the number of characters),
like the number of elements in a vector or the number of columns.
nrow() They return, respectively, the number of rows and columns in a data frame.
ncol()
16 2 Simple Exploratory Data Analysis
To read the data of a preinstalled dataset, it suffices to write its name in the RStudio console and
return or to use View(dataset_name).
Here, we see an example, with dataset msleep included in package ggplot2, part of tidyverse. It
contains data regarding sleep times and weights for some mammal species. More information could
be obtained by accessing help online by executing ?msleep on the RStudio console.
Below is the textual visualization on the command console.
library(tidyverse)
msleep
# A tibble: 83 × 11
name genus vore order conse…1 sleep…2 sleep…3 sleep…4 awake
<chr> <chr> <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl>
1 Cheetah Acin… carni Carn… lc 12.1 NA NA 11.9
2 Owl monkey Aotus omni Prim… <NA> 17 1.8 NA 7
3 Mountain be… Aplo… herbi Rode… nt 14.4 2.4 NA 9.6
4 Greater sho… Blar… omni Sori… lc 14.9 2.3 0.133 9.1
5 Cow Bos herbi Arti… domest… 4 0.7 0.667 20
6 Three-toed … Brad… herbi Pilo… <NA> 14.4 2.2 0.767 9.6
7 Northern fu… Call… carni Carn… vu 8.7 1.4 0.383 15.3
8 Vesper mouse Calo… <NA> Rode… <NA> 7 NA NA 17
9 Dog Canis carni Carn… domest… 10.1 2.9 0.333 13.9
10 Roe deer Capr… herbi Arti… lc 3 NA NA 21
# … with 73 more rows, 1 more variable: bodywt <dbl>, and abbreviated variable
# names 1 conservation, 2 sleep_total, 3 sleep_rem, 4 sleep_cycle
From the two visualizations, a detail should be noted – some values are indicated as NA, which
stands for Not Available. It is the visual notation R uses to indicate a missing value. It may have
some variation like <NA>.
The meaning is that there is no value corresponding to the element. It is a user-friendly notation
to make it more evident where missing values are. It does not mean that in a certain element, there
is a value corresponding to the two letters N and A, NA. Not at all, it is a missing value, there is
nothing there, a void.
2.3 Python: Descriptive Statistics and Utility Functions 17
Then, as is often the case, there are exceptions, but we are not anticipating them. The important
thing is that the notation NA is just a visual help to see where missing values are.
Let us see what function summary() returns.
summary(msleep)
...
The result shows the columns of data frame msleep, and, for each one, some information.
Columns of type character show very little information; numerical columns have the descriptive
statistics that we have mentioned before and, where present, the number of missing values (e.g.
column sleep_cycle, NA’s: 51).
With function str(), we obtain a general overview. Same with function glimpse().
str(msleep)
Function Description
.describe() Its main function is to obtain descriptive statistics. For each numerical
column, it shows the number of values, maximum and minimum values,
arithmetic mean and median, and quartiles.
.info() It provides particularly useful information like the size of the data frame,
and for each column its name, the type (object for characters, int64 for
integers, float64 for floating point numbers, and lgl if logical), and the
number of non-null values (meaning that the column length minus the
number of non-null values gives the number of missing values in a
column).
.head() They visualize the topmost (head) or bottommost (tail) rows of a data
.tail() frame. It is possible to specify the number of rows to show (e.g.
df.head(10)); otherwise the default applies (i.e. five rows).
.unique() It returns a list of unique values in a series. Particularly useful when
applied to columns as df[’col_name’].unique().
.columns They return, respectively, the list of column names and the list of names
.index of the row index. Column names are formally the names of the column
index, and both the row and the column index may have multi-index
names.
.dtypes It returns the list of columns with the corresponding data type. The same
information is included in those returned by .info().
.size It returns the length of a Python object, such as the number of elements
in an array or a data frame. If missing values are present, they are
included in the total length.
.shape It returns the number of rows and columns of a data frame. To retrieve a
single dimension, it could be referenced as .shape(0) for the number
of rows, and .shape(1) for the number of columns.
In the standard configuration of Python and of its typical data science libraries, there are no
predefined datasets to use for exercising.
Pandas versions previous to 2.0.0 make it possible to create test data frames with ran-
dom values by means of functions pd.util.testing.makeMixedDataFrame() and
pd.util.testing.makeMissingDataframe(). With the first one, a small and very simple
data frame is produced, and the second produces a little larger data frame with also missing values.
To try the utility functions before reading actual datasets, we could try the two generating func-
tions, save the result, and test the utility functions.
test1= pd.util.testing.makeMixedDataFrame()
test1
A B C D
0 0.0 0.0 foo1 2009-01-01
1 1.0 1.0 foo2 2009-01-02
2 2.0 0.0 foo3 2009-01-05
3 3.0 1.0 foo4 2009-01-06
4 4.0 0.0 foo5 2009-01-07
Questions 19
test2= pd.util.testing.makeMissingDataframe()
test2
A B C D
tW1QQvy0vf 0.451947 0.595209 0.233377 NaN
UCIUoAMHgo -1.627037 -1.116419 -0.393027 0.188878
SCc6D4RLxc 0.077580 -0.884746 0.688926 1.475203
0gTyFDzQli -0.125091 -0.533044 0.847568 -0.110436
InfV0yg8IH 0.575489 NaN -0.070264 -0.928023
S0o4brfQXb -0.965100 -1.368942 -0.358428 0.487762
CDmeMkic4o -0.348701 -0.427534 1.636490 -1.444168
OCi7RQZXaB 1.271422 1.216927 -0.232399 -0.985385
XEQvFbfp0X 0.207598 NaN -0.417492 -0.087897
UBt6uuJrsi -0.571392 -2.824272 0.200751 -0.778646
XPQTn1MN1N 0.725473 0.554177 1.520446 0.599409
saxiRCPV8f -0.351244 1.338322 -0.514414 -0.333148
The two results, data frames test1 and test2, could be inspected with the functions we have pre-
sented; particular attention should be paid to the number of columns of test2 (four, not five, as
could be mistakenly believed at first sight) and the index, which has no name but several values
(e.g. tW1QQvy0vf , UCIUoAMHgo, and SCc6D4RLxc). The unique() method in this case will not
be useful because values, being random, are likely all different; it is better to use it only on test1.
It is to be observed that the notation used by Python to visually represent the missing values in
test2 is NaN, which stands for Not a Number. This may induce one to think that there should be
different representations for the different data types. Luckily, this is not the case, so we will still see
NaN even when missing values are in an object (character) or logical (Boolean) column. We may see
NaT (Not a Time) for missing values in datetime column, but NaN and NaT are fully compatible, so
we should not worry too much. Python provides a more general keyword for missing values – None,
which does not carry the heritage of NaN as a numeric data type. Strictly speaking, NaN should be
used for numerical data types, while None for nonnumerical ones; however, in practice, they have
become much like equivalent notations, and especially when using pandas, NaN is the default for
all missing values. In short, NaN is fine, it could be None too.
In the newer Pandas version 2.0.0+, the two functions for testing have been deprecated and are
no longer available in the official testing module. It is still possible to use them by accessing the
internal _testing module, such as – test= pd._testing.makeMixedDataFrame().
This could easily change in future versions of pandas; therefore, it might be better to be patient
a little longer and test the utility functions on real datasets when, in the next chapter, we will learn
how to read them.
Questions
2.1 (R/Python)
A Simple exploratory data analysis is ...
A Needed when a thorough statistical analysis is required
B Sometimes useful
C Always useful to gather general information about the data
D Always useful and specific to the expected result of the project
(R: C)
20 2 Simple Exploratory Data Analysis
2.2 (R/Python)
Descriptive statistics ...
A Require good statistical knowledge
B Is the synonym for full statistical analysis
C Performed with specific statistical tools
D Require just basic statistical knowledge
(R: D)
2.3 (R/Python)
Missing values analysis is ...
A Needed when a thorough statistical analysis is required
B Sometimes useful
C Always useful to gather general information about the data
D Always useful and specific to the expected result of the project
(R: C)
2.4 (R/Python)
Missing values should be managed...
A By replacing them with actual values
B By replacing them with actual values, deleting corresponding observations, or
case-by-case basis
C By deleting corresponding observations
D Do not care, they are irrelevant
(R: B)
2.5 (R/Python)
When handling missing values, what is the most important aspect to consider?
A Arbitrarily modifying data (either by replacing missing values or deleting corresponding
observations) is a critical operation to perform, requiring extreme care for the possible
consequences
B Being sure to replace them with true values
C Being sure not to delete observations without missing values
D There are no important aspects to consider
(R: A)
2.6 (R/Python)
What is the typical usage case for head and tail functions/methods?
A To extract the first few or the last few rows of a data frame
B To sort values in ascending or descending order
C To check the first few or the last few rows of a dataset
D To visually inspect the first few or the last few rows of a data frame
(R: D)
2.7 (R/Python)
What is the typical usage case for the unique function/method?
A To select unique values from a data frame
B To sort unique values in ascending or descending order
Questions 21
2.8 (R/Python)
The notations NA (R) or NaN (Python) for missing values mean that ...
A A missing value is represented by the string NA (R) or NaN (Python)
B They are formal notations, but an element with a missing value has no value at all
C They are functions for handling missing values
D They are special kinds of missing values
(R: B)
Other documents randomly have
different content
riser. These two lines are of course at right angles, or, as the
carpenter would say; they are square. This string shows four
complete cuts, and part of a fifth cut for treads, and five complete
cuts for risers. The bottom of the string at W is cut off at the line of
the floor on which it is supposed to rest. The line C is the line of the
first riser. This riser is cut lower than any of the other risers,
because, as above explained, the thickness of the first tread is
always taken off it; thus, if the tread is 1½ inches thick, the riser in
this case would only require to be 6¼ inches wide, as 7¾-1½ =
6¼.
The string must be cut so that the line at W will be only 6¼
inches from the line at 8⅓, and these two lines must be parallel.
The first riser and tread having been satisfactorily dealt with, the
rest can easily be marked off by simply sliding the pitch-board along
the line A until the outer end of the line 8⅓ on the pitch-board
strikes the outer end of the line 7¾ on the string, when another
tread and another riser are to be marked off. The remaining risers
and treads are marked off in the same manner.
Well-Hole.
Before proceeding to describe and illustrate neweled stairs, it will
be proper to say something about the well-hole, or the opening
through the floors, through which the traveler on the stairs ascends
or descends from one floor to another.
Fig. 31 shows a well-hole, and the manner of trimming it. In this
instance the stairs are placed against the wall; but this is not
necessary in all cases, as the well-hole may be placed in any part of
the building.
The arrangement of the trimming varies according as the joists
are at right angles to, or are parallel to, the wall against which the
stairs are built. In the former case (Fig. 31, A) the joists are cut
short and tusk-tenoned into the heavy trimmer T T, as shown in the
cut. This trimmer is again tusk-tenoned into two heavy joists T J and
T J, which form the ends of the well-hole. These heavy joists are
called trimming joists; and, as they have to carry a much heavier
load than other joists on the same floor, they are made much
heavier. Sometimes two or three joists are placed together, side by
side, being bolted or spiked together to give them the desired unity
and strength. In constructions requiring great strength, the tail and
header joists of a well-hole are suspended on iron brackets.
If the opening runs parallel with the joists (Fig. 31, B), the timber
forming the side of the well-hole should be left a little heavier than
the other joists, as it will have to carry short trimmers (T J and T J)
and the joists running into them. The method here shown is more
particularly adapted to brick buildings, but there is no reason why
the same system may not be applied to frame buildings.
Usually in cheap, frame buildings, the trimmers T T are spiked
against the ends of the joists, and the ends of the trimmers are
supported by being spiked to the trimming joists T J, T J. This is not
very workmanlike or very secure, and should not be done, as it is
not nearly so strong or durable as the old method of framing the
joists and trimmers together.
Fig. 32 shows a stair with three newels and a platform. In this
example, the first tread (No. 1) stands forward of the newel post
two-thirds of its width. This is not necessary in every case, but it is
sometimes done to suit conditions in the hallway. The second newel
is placed at the twelfth riser, and supports the upper end of the first
cut string and the lower end of the second cut string. The platform
(12) is supported by joists which are framed into the wall and are
fastened against a trimmer running from the wall to the newel along
the line 12. This is the case only when the second newel runs down
to the floor.
Fig. 31. Showing Ways of Trimming Well-Hole when Joists Run in
Different Directions.
If the second newel does not run to the floor, the framework
supporting the platform will need to be built on studding. The third
newel stands at the top of the stairs, and is fastened to the joists of
the second floor, or to the trimmer, somewhat after the manner of
fastening shown in Fig. 29. In this example, the stairs have 16 risers
and 15 treads, the platform or landing (12) making one tread. The
figure 16 shows the floor in the second story.
This style of stair will require a well-hole in shape about as
shown in the plan; and where strength is required, the newel at the
top should run from floor to floor, and act as a support to the joists
and trimmers on which the second floor is laid.
Perhaps the best way for a beginner to go about building a
stairway of this type, will be to lay out the work on the lower floor in
the exact place where the stairs are to be erected, making
everything full size. There will be no difficulty in doing this; and if
the positions of the first riser and the three newel posts are
accurately defined, the building of the stairs will be an easy matter.
Plumb lines can be raised from the lines on the floor, and the
positions of the platform and each riser thus easily determined. Not
only is it best to line out on the floor all stairs having more than one
newel; but in constructing any kind of stair it will perhaps be safest
for a beginner to lay out in exact position on the floor the points
over which the treads and risers will stand. By adopting this rule,
and seeing that the strings, risers, and treads correspond exactly
with the lines on the floor, many cases of annoyance will be avoided.
Many expert stair-builders, in fact, adopt this method in their
practice, laying out all stairs on the floor, including even the carriage
strings, and they cut out all the material from the lines obtained on
the floor. By following this method, one can see exactly the
requirements in each particular case, and can rectify any error
without destroying valuable material.
Fig. 32. Stair with Three Newels and a Platform.
Laying Out.
In order to afford the student a clear idea of what is meant by
laying out on the floor, an example of a simple close-string stair is
given. In Fig. 33, the letter F shows the floor line; L is the landing or
platform; and W is the wall line. The stair is to be 4 feet wide over
strings; the landing, 4 feet wide; the height from floor to landing, 7
feet; and the run from start to finish of the stair, 8 feet 8½ inches.
The first thing to determine is the dimensions of the treads and
risers. The wider the tread, the lower must be the riser, as stated
before. No definite dimensions for treads and risers can be given, as
the steps have to be arranged to meet the various difficulties that
may occur in the working out of the construction; but a common
rule is this: Make the width of the tread, plus twice the rise, equal to
24 inches. This will give, for an 8-inch tread, an 8-inch rise; for a 9-
inch tread, a 7½-inch rise; for a 10-inch tread, a 7-inch rise, and so
on. Having the height (7 feet) and the run of the flight (8 feet 8½-
inches), take a rod about one inch square, and mark on it the height
from floor to landing (7 feet), and the length of the going or run of
the flight (8 feet 8½ inches). Consider now what are the dimensions
which can be given to the treads and risers, remembering that there
will be one more riser than the number of treads. Mark off on the
rod the landing, forming the last tread. If twelve risers are desired,
divide the height (namely, 7 feet) by 12, which gives 7 inches as the
rise of each step. Then divide the run (namely, 8 feet 8½ inches) by
11, and the width of the tread is found to be 9½ inches.
Great care must be taken in making the pitch-board for marking
off the treads and risers on the string. The pitch-board may be made
from dry hardwood about ⅜-inch thick. One end and one side must
be perfectly square to each other; on the one, the width of the tread
is set off, and on the other the height of the riser. Connect the two
points thus obtained, and saw the wood on this line. The addition of
a gauge-piece along the longest side of the triangular piece,
completes the pitch-board, as was illustrated in Fig. 15.
The length of the wall and outer string can be ascertained by
means of the pitch-board. One side and one edge of the wall string
must be squared; but the outer string must be trued all round. On
the strings, mark the positions of the treads and risers by using the
pitch-board as already explained (Fig. 17). Strings are usually made
11 inches wide, but may be made 12½ inches wide if necessary for
strength.
Fig. 33. Method of Laying Out a Simple, Close-String Stair.
After the widths of risers and treads have been determined, and
the string is ready to lay out, apply the pitch-board, marking the first
riser about 9 inches from the end; and number each step in
succession. The thickness of the treads and risers can be drawn by
using thin strips of hardwood made the width of the housing
required. Now allow for the wedges under the treads and behind the
risers, and thus find the exact width of the housing, which should be
about ⅝-inch deep; the treads and risers will require to be made
1¼ inches longer than shown in the plan, to allow for the housings
at both ends.
Before putting the stair together, be sure that it can be taken into
the house and put in position without trouble. If for any reason it
cannot be put in after being put together, then the parts must be
assembled, wedged, and glued up at the spot.
It is essential in laying out a plan on the floor, that the exact
positions of the first and last risers be ascertained, and the height of
the story wherein the stair is to be placed. Then draw a plan of the
hall or other room in which the stairs will be located, including
surrounding or adjoining parts of the room to the extent of ten or
twelve feet from the place assigned for the foot of the stair. All the
doorways, branching passages, or windows which can possibly come
in contact with the stair from its commencement to its expected
termination or landing, must be noted. The sketch must necessarily
include a portion of the entrance hall in one part, and of the lobby or
landing in another, and on it must be laid out all the lines of the stair
from the first to the last riser.
The height of the story must next be exactly determined and
taken on the rod; then, assuming a height of risers suitable to the
place, a trial is made by division in the manner previously explained,
to ascertain how often this height is contained in the height of the
story. The quotient, if there is no remainder, will be the number of
risers required. Should there be a remainder on the first division, the
operation is reversed, the number of inches in the height being
made the dividend and the before-found quotient the divisor; and
the operation of reduction by division is carried on till the height of
the riser is obtained to the thirty-second part of an inch. These
heights are then set off as exactly as possible on the story rod, as
shown in Fig. 33.
The next operation is to show the risers on the sketch. This the
workman will find no trouble in arranging, and no arbitrary rule can
be given.
A part of the foregoing may appear to be repetition; but it is not,
for it must be remembered that scarcely any two flights of stairs are
alike in run, rise, or pitch, and any departure in any one dimension
from these conditions leads to a new series of dimensions that must
be dealt with independently. The principle laid down, however,
applies to all straight flights of stairs; and the student who has
followed closely and retained the pith of what has been said, will, if
he has a fair knowledge of the use of tools, be fairly equipped for
laying out and constructing a plain, straight stair with a straight rail.
Plain stairs may have one platform, or several; and they may turn
to the right or to the left, or, rising from a platform or landing, may
run in an opposite direction from their starting point.
When two flights are necessary for a story, it is desirable that
each flight should consist of the same number of steps; but this, of
course, will depend on the form of the staircase, the situation and
height of doors, and other obstacles to be passed under or over, as
the case may be.
In Fig. 32, a stair is shown with a single platform or landing and
three newels. The first part of this stair corresponds, in number of
risers, with the stair shown in Fig. 33; the second newel runs down
to the floor, and helps to sustain the landing. This newel may simply
be a 4 by 4-inch post, or the whole space may be inclosed with the
spandrel of the stair. The second flight starts from the platform just
as the first flight starts from the lower floor, and both flights may be
attached to the newels in the manner shown in Fig. 29. The bottom
tread in Fig. 32 is rounded off against the square of the newel post;
but this cannot well be if the stairs start from the landing, as the
tread would project too far onto the platform. Sometimes, in high-
class stairs, provision is made for the first tread to project well onto
the landing.
If there are more platforms than one, the principles of
construction will be the same; so that whenever the student grasps
the full conditions governing the construction of a single-platform
stair, he will be prepared to lay out and construct the body of any
stair having one or more landings. The method of laying out,
making, and setting up a hand-rail will be described later.
Stairs formed with treads each of equal width at both ends, are
named straight flights; but stairs having treads wider at one end
than the other are known by various names, as winding stairs, dog-
legged stairs, circular stairs, or elliptical stairs. A tread with parallel
sides, having the same width at each end, is called a flyer; while one
having one wide end and one narrow, is called a winder. These
terms will often be made use of in what follows.
The elevation and plan of the stair shown in Fig. 34 may be
called a dog-legged stair with three winders and six flyers. The
flyers, however, may be extended to any number. The housed strings
to receive the winders are shown. These strings show exactly the
manner of construction. The shorter string, in the corner from 1 to
4, which is shown in the plan to contain the housing of the first
winder and half of the second, is put up first, the treads being
leveled by aid of a spirit level; and the longer upper string is put in
place afterwards, butting snugly against the lower string in the
corner. It is then fastened firmly to the wall. The winders are cut
snugly around the newel post, and well nailed. Their risers will stand
one above another on the post; and the straight string above the
winders will enter the post on a line with the top edge of the
uppermost winder.
Fig. 34. Elevation and Plan of Dog-Legged Stair
with Three Winders and Six Flyers.
Platform stairs are often constructed so that one flight will run in
a direction opposite to that of the other flight, as shown in Fig.35. In
cases of this kind, the landing or platform requires to have a length
more than double that of the treads, in order that both flights may
have the same width. Sometimes, however, and for various reasons,
the upper flight is made a little narrower than the lower; but this
expedient should be avoided whenever possible, as its adoption
unbalances the stairs. In the example before us, eleven treads, not
including the landing, run in one direction; while four treads,
including the landing, run in the opposite direction; or, as workmen
put it, the stair “returns on itself.” The elevation shown in Fig. 36
illustrates the manner in which the work is executed. The various
parts are shown as follows:
ebookgate.com