100% found this document useful (4 votes)
30 views

Data Science Fundamentals with R Python and Open Data 1st Edition Marco Cremonini instant download

The document is a comprehensive guide titled 'Data Science Fundamentals with R, Python, and Open Data' by Marco Cremonini, published by John Wiley & Sons. It covers various topics related to data science, including open-source tools, exploratory data analysis, data organization, and operations on data using R and Python. The book is structured into chapters that provide practical examples and methodologies for data manipulation and analysis.

Uploaded by

aumandebelam
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
100% found this document useful (4 votes)
30 views

Data Science Fundamentals with R Python and Open Data 1st Edition Marco Cremonini instant download

The document is a comprehensive guide titled 'Data Science Fundamentals with R, Python, and Open Data' by Marco Cremonini, published by John Wiley & Sons. It covers various topics related to data science, including open-source tools, exploratory data analysis, data organization, and operations on data using R and Python. The book is structured into chapters that provide practical examples and methodologies for data manipulation and analysis.

Uploaded by

aumandebelam
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 75

Data Science Fundamentals with R Python and Open

Data 1st Edition Marco Cremonini download

https://ebookgate.com/product/data-science-fundamentals-with-r-
python-and-open-data-1st-edition-marco-cremonini/

Get Instant Ebook Downloads – Browse at https://ebookgate.com


Get Your Digital Files Instantly: PDF, ePub, MOBI and More
Quick Digital Downloads: PDF, ePub, MOBI and Other Formats

Python Data Science Handbook 2nd Edition Jake


Vanderplas

https://ebookgate.com/product/python-data-science-handbook-2nd-
edition-jake-vanderplas/

Python for Data Science For Dummies 1st Edition John


Paul Mueller

https://ebookgate.com/product/python-for-data-science-for-
dummies-1st-edition-john-paul-mueller/

Agile Data Science Building Data Analytics Applications


with Hadoop 1st Edition Russell Jurney

https://ebookgate.com/product/agile-data-science-building-data-
analytics-applications-with-hadoop-1st-edition-russell-jurney/

Data Center Fundamentals Arregoces

https://ebookgate.com/product/data-center-fundamentals-arregoces/
Data Mining with Python Theory Application and Case
Studies 1st Edition Di Wu

https://ebookgate.com/product/data-mining-with-python-theory-
application-and-case-studies-1st-edition-di-wu/

Data Science with Julia 1st Edition Paul D. Mcnicholas

https://ebookgate.com/product/data-science-with-julia-1st-
edition-paul-d-mcnicholas/

Data Mining with R Learning with Case Studies Chapman


Hall CRC Data Mining and Knowledge Discovery Series 1st
Edition Torgo

https://ebookgate.com/product/data-mining-with-r-learning-with-
case-studies-chapman-hall-crc-data-mining-and-knowledge-
discovery-series-1st-edition-torgo/

Python for Data Analysis 1st Edition Wes Mckinney

https://ebookgate.com/product/python-for-data-analysis-1st-
edition-wes-mckinney/

Data Storytelling with Generative AI using Python and


Altair MEAP V05 Angelica Lo Duca

https://ebookgate.com/product/data-storytelling-with-generative-
ai-using-python-and-altair-meap-v05-angelica-lo-duca/
Data Science Fundamentals with R, Python, and Open Data
Data Science Fundamentals with R, Python, and
Open Data

Marco Cremonini
University of Milan
Italy
Copyright © 2024 by John Wiley & Sons, Inc. All rights reserved.

Published by John Wiley & Sons, Inc., Hoboken, New Jersey.


Published simultaneously in Canada.

No part of this publication may be reproduced, stored in a retrieval system, or transmitted in any form or by any
means, electronic, mechanical, photocopying, recording, scanning, or otherwise, except as permitted under
Section 107 or 108 of the 1976 United States Copyright Act, without either the prior written permission of
the Publisher, or authorization through payment of the appropriate per-copy fee to the Copyright Clearance
Center, Inc., 222 Rosewood Drive, Danvers, MA 01923, (978) 750-8400, fax (978) 750-4470, or on the web at
www.copyright.com. Requests to the Publisher for permission should be addressed to the Permissions Department,
John Wiley & Sons, Inc., 111 River Street, Hoboken, NJ 07030, (201) 748-6011, fax (201) 748-6008, or online at
http://www.wiley.com/go/permission.

Trademarks: Wiley and the Wiley logo are trademarks or registered trademarks of John Wiley & Sons, Inc. and/or
its affiliates in the United States and other countries and may not be used without written permission. All other
trademarks are the property of their respective owners. John Wiley & Sons, Inc. is not associated with any product
or vendor mentioned in this book.

Limit of Liability/Disclaimer of Warranty: While the publisher and author have used their best efforts in preparing
this book, they make no representations or warranties with respect to the accuracy or completeness of the contents
of this book and specifically disclaim any implied warranties of merchantability or fitness for a particular purpose.
No warranty may be created or extended by sales representatives or written sales materials. The advice and strategies
contained herein may not be suitable for your situation. You should consult with a professional where appropriate.
Further, readers should be aware that websites listed in this work may have changed or disappeared between when
this work was written and when it is read. Neither the publisher nor authors shall be liable for any loss of profit or
any other commercial damages, including but not limited to special, incidental, consequential, or other damages.

For general information on our other products and services or for technical support, please contact our Customer
Care Department within the United States at (800) 762-2974, outside the United States at (317) 572-3993 or
fax (317) 572-4002.

Wiley also publishes its books in a variety of electronic formats. Some content that appears in print may not be
available in electronic formats. For more information about Wiley products, visit our web site at www.wiley.com.

Library of Congress Cataloging-in-Publication Data Applied for:

Hardback ISBN: 9781394213245

Cover Design: Wiley


Cover Image: © Andriy Onufriyenko/Getty Images

Set in 9.5/12.5pt STIXTwoText by Straive, Chennai, India


v

Contents

Preface xiii
About the Companion Website xvii
Introduction xix

1 Open-Source Tools for Data Science 1


1.1 R Language and RStudio 1
1.1.1 R Language 2
1.1.2 RStudio Desktop 2
1.1.3 Package Manager 2
1.1.4 Package Tidyverse 4
1.2 Python Language and Tools 5
1.2.1 Option A: Anaconda Distribution 6
1.2.2 Option B: Manual Installation 6
1.2.3 Google Colab 7
1.2.4 Packages NumPy and Pandas 7
1.3 Advanced Plain Text Editor 8
1.4 CSV Format for Datasets 8
Questions 10

2 Simple Exploratory Data Analysis 13


2.1 Missing Values Analysis 13
2.2 R: Descriptive Statistics and Utility Functions 15
2.3 Python: Descriptive Statistics and Utility Functions 17
Questions 19

3 Data Organization and First Data Frame Operations 23


3.1 R: Read CSV Datasets and Column Selection 24
3.1.1 Reading a CSV Dataset 26
3.1.1.1 Reading Errors 27
3.1.2 Selection by Column Name 29
3.1.3 Selection by Column Index Position 30
3.1.4 Selection by Range 31
3.1.5 Selection by Exclusion 32
3.1.6 Selection with Selection Helper 35
3.2 R: Rename and Relocate Columns 36
vi Contents

3.3 R: Slicing, Column Creation, and Deletion 38


3.3.1 Subsetting and Slicing 39
3.3.2 Column Creation 42
3.3.3 Column Deletion 43
3.3.4 Calculated Columns 44
3.3.5 Function mutate() and Data Masking 44
3.4 R: Separate and Unite Columns 45
3.4.1 Separation 46
3.4.2 Union 48
3.5 R: Sorting Data Frames 49
3.5.1 Sorting by Multiple Columns 50
3.5.2 Sorting by an External List 51
3.6 R: Pipe 55
3.6.1 Forward Pipe 55
3.6.2 Pipe in Base R 57
3.6.2.1 Variant 57
3.6.3 Parameter Placeholder 58
3.7 Python: Column Selection 59
3.7.1 Selecting Columns from Dataset Read 61
3.7.2 Selecting Columns from a Data Frame 62
3.7.3 Selection by Positional Index, Range, or with Selection Helper 63
3.7.4 Selection by Exclusion 64
3.8 Python: Rename and Relocate Columns 67
3.8.1 Standard Method 67
3.8.2 Functions rename() and reindex() 67
3.9 Python: NumPy Slicing, Selection with Index, Column Creation and Deletion 69
3.9.1 NumPy Array Slicing 69
3.9.2 Slicing of Pandas Data Frames 70
3.9.3 Methods .loc and .iloc 73
3.9.4 Selection with Selection Helper 77
3.9.5 Creating and Deleting Columns 79
3.9.6 Functions insert() and assign() 80
3.10 Python: Separate and Unite Columns 81
3.10.1 Separate 81
3.10.2 Unite 84
3.11 Python: Sorting Data Frame 85
3.11.1 Sorting Columns 85
3.11.2 Sorting Index Levels 86
3.11.3 From Indexed to Non-indexed Data Frame 88
3.11.4 Sorting by an External List 89
Questions 91

4 Subsetting with Logical Conditions 99


4.1 Logical Operators 99
4.2 R: Row Selection 101
4.2.1 Operator %in% 104
4.2.2 Boolean Mask 105
Contents vii

4.2.3 Examples 106


4.2.3.1 Wrong Disjoint Condition 107
4.2.4 Python: Row Selection 114
4.2.5 Boolean Mask, Base Selection Method 115
4.2.6 Row Selection with query() 119
Questions 121

5 Operations on Dates, Strings, and Missing Values 127


5.1 R: Operations on Dates and Strings 129
5.1.1 Date and Time 129
5.1.1.1 Datetime Data Type 129
5.1.2 Parsing Dates 130
5.1.3 Using Dates 132
5.1.4 Selection with Logical Conditions on Dates 133
5.1.5 Strings 136
5.2 R: Handling Missing Values and Data Type Transformations 141
5.2.1 Missing Values as Replacement 142
5.2.1.1 Keywords for Missing Values 142
5.2.2 Introducing Missing Values in Dataset Reads 143
5.2.3 Verifying the Presence of Missing Values 144
5.2.3.1 Functions any(), all(), and colSums() 146
5.2.4 Replacing Missing Values 147
5.2.5 Omit Rows with Missing Values 149
5.2.6 Data Type Transformations 150
5.3 R: Example with Dates, Strings, and Missing Values 154
5.3.1 When an Invisible Hand Mess with Your Data 158
5.3.2 Base Method 159
5.3.3 A Better Heuristic 162
5.3.4 Specialized Functions 162
5.3.4.1 Function parse_date_time() 162
5.3.5 Result Comparison 165
5.4 Pyhton: Operations on Dates and Strings 165
5.4.1 Date and Time 165
5.4.1.1 Function pd.to_datetime() 165
5.4.1.2 Function datetime.datetime.strptime() 167
5.4.1.3 Locale Configuration 168
5.4.1.4 Function datetime.datetime.strftime() 169
5.4.1.5 Pandas Timestamp Functions 169
5.4.2 Selection with Logical Conditions on Dates 171
5.4.3 Strings 172
5.5 Python: Handling Missing Values and Data Type Transformations 173
5.5.1 Missing Values as Replacement 173
5.5.1.1 Function pd.replace() 175
5.5.2 Introducing Missing Values in Dataset Reads 175
5.5.3 Verifying the Presence of Missing Values 176
5.5.4 Selection with Missing Values 178
5.5.5 Replacing Missing Values with Actual Values 179
viii Contents

5.5.6 Modifying Values by View or by Copy 180


5.5.7 Data Type Transformations 182
5.6 Python: Examples with Dates, Strings, and Missing Values 182
5.6.1 Example 1: Eurostat 182
5.6.2 Example 2: Open Data Berlin 186
Questions 190

6 Pivoting and Wide-long Transformations 195


6.1 R: Pivoting 197
6.1.1 From Long to Wide 197
6.1.2 From Wide to Long 199
6.1.3 GOV.UK: Gender Pay Gap 200
6.2 Python: Pivoting 202
6.2.1 From Wide to Long with Columns 203
6.2.2 From Long to Wide with Columns 204
6.2.3 Wide-long Transformation with Index Levels 206
6.2.4 Indexed Data Frame 207
6.2.4.1 Function unstack() 208
6.2.4.2 Function stack() 211
6.2.5 From Long to Wide with Elements of Numeric Type 213
Questions 216

7 Groups and Operations on Groups 221


7.1 R: Groups 222
7.1.1 Groups and Group Indexes 224
7.1.1.1 Function group_by() 224
7.1.1.2 Index Details 226
7.1.2 Aggregation Operations 227
7.1.2.1 Functions group_by() and summarize() 227
7.1.2.2 Counting Rows: function n() 228
7.1.2.3 Arithmetic Mean: function mean() 228
7.1.2.4 Maximum and Minimum Values: Functions max() and min() 230
7.1.2.5 Summing Values: function sum() 231
7.1.2.6 List of Aggregation Functions 232
7.1.3 Sorting Within Groups 232
7.1.4 Creation of Columns in Grouped Data Frames 234
7.1.5 Slicing Rows on Groups 236
7.1.5.1 Functions slice_*() 236
7.1.5.2 Combination of Functions filter() and rank() 238
7.1.6 Calculated Columns with Group Values 242
7.2 Python: Groups 244
7.2.1 Group Index and Aggregation Operations 247
7.2.1.1 Functions groupby() and aggregate() 247
7.2.1.2 Counting Rows, Computing Arithmetic Means, and Sum for Each Group 247
7.2.2 Names on Columns with Aggregated Values 251
7.2.3 Sorting Columns 252
7.2.4 Sorting on Index Levels 254
Contents ix

7.2.5 Slicing Rows on Groups 255


7.2.5.1 Functions nlargest() and nsmallest() 259
7.2.6 Calculated Columns with Group Values 259
7.2.7 Sorting Within Groups 261
Questions 265

8 Conditions and Iterations 271


8.1 R: Conditions and Iterations 272
8.1.1 Conditions 272
8.1.1.1 Function if_else() 275
8.1.1.2 Function case_when() 276
8.1.1.3 Function if() and Constructs If-else and If-else If-else 277
8.1.2 Iterations 278
8.1.2.1 Function for() 278
8.1.2.2 Function Foreach() 280
8.1.3 Nested Iterations 280
8.1.3.1 Replacing a Single-Element Value 282
8.1.3.2 Iterate on the First Column 283
8.1.3.3 Iterate on all Columns 283
8.2 Python: Conditions and Iterations 284
8.2.1 Conditions 284
8.2.1.1 Function if() 285
8.2.1.2 Constructs If-else and If-elif-else 285
8.2.1.3 Function np.where() 286
8.2.1.4 Function np.select() 287
8.2.1.5 Functions pd.where() and pd.mask() 289
8.2.2 Iterations 291
8.2.2.1 Functions for() and while() 291
8.2.3 Nested Iterations 294
8.2.3.1 Execution Time 296
8.2.4 Iterating on Multi-index 297
8.2.4.1 Function join() 300
8.2.4.2 Function items() 301
Questions 302

9 Functions and Multicolumn Operations 307


9.1 R: User-defined Functions 308
9.1.1 Using Functions 309
9.1.2 Data Masking 312
9.1.3 Anonymous Functions 315
9.2 R: Multicolumn Operations 316
9.2.1 Base Method 316
9.2.1.1 Functions apply(), lapply(), and sapply() 316
9.2.1.2 Mapping 319
9.2.2 Mapping and Anonymous Functions: purrr-style Syntax 321
9.2.3 Conditional Mapping 321
9.2.4 Subsetting Rows with Multicolumn Logical Condition 323
x Contents

9.2.4.1 Combination of Functions filter() and if_any() 323


9.2.5 Multicolumn Transformations 324
9.2.5.1 Combination of Functions mutate() and across() 324
9.2.6 Introducing Missing Values 325
9.2.7 Use Cases and Execution Time Measurement 326
9.2.7.1 Case 1 327
9.2.7.2 Case 2 328
9.3 Python: User-defined and Lambda Functions 330
9.3.1 User-defined Functions 330
9.3.1.1 Lambda Functions 333
9.3.2 Python: Multicolumn Operations 334
9.3.2.1 Execution Time 336
9.3.3 General Case 337
9.3.3.1 Function apply() 337
Questions 342

10 Join Data Frames 347


10.1 Basic Concepts 348
10.1.1 Keys of a Join Operation 349
10.1.2 Types of Join 350
10.1.3 R: Join Operation 351
10.1.4 Join Functions 354
10.1.4.1 Function inner_join() 354
10.1.4.2 Function full_join() 356
10.1.4.3 Functions left_join() and right_join() 357
10.1.4.4 Function merge() 357
10.1.5 Duplicated Keys 358
10.1.6 Special Join Functions 363
10.1.6.1 Semi Join 363
10.1.6.2 Anti Join 365
10.2 Python: Join Operations 369
10.2.1.1 Function merge() 371
10.2.1.2 Inner Join 372
10.2.1.3 Outer/Full Join 374
10.2.2 Join Operations with Indexed Data Frames 375
10.2.3 Duplicated Keys 378
10.2.4 Special Join Types 384
10.2.4.1 Semi Join: Function isin() 384
10.2.4.2 Anti Join: Variants 386
Questions 389

11 List/Dictionary Data Format 393


11.1 R: List Data Format 395
11.1.1 Transformation of List Columns to Ordinary Rows and Columns 401
11.1.1.1 Other Options 403
11.1.2 Function map in List Column Transformations 406
11.2 R: JSON Data Format and Use Cases 410
Contents xi

11.2.1 Memory Problem when Reading Very Large Datasets 421


11.3 Python: Dictionary Data Format 422
11.3.1 Methods 424
11.3.2 From Dictionary to Data Frame With a Single Level of Nesting 427
11.3.2.1 Functions pd.Dataframe() and pd.Dataframe.from_dict() 427
11.3.3 From Dictionary to Data Frame with Several Levels of Nesting 429
11.3.3.1 Function pd.json_normalize() and Join Operation 429
11.3.4 Python: Use Cases with JSON Datasets 436
Questions 443

Index 447
xiii

Preface

Two questions come along with every new text that aims to teach someone something. The first is,
Who is it addressed to? and the second is, Why does it have precisely those contents, organized in
that way? These two questions, for this text, have perhaps even greater relevance than they usually
do, because for both, the answer is unconventional (or at least not entirely conventional) and to
some, it may seem surprising. It shouldn’t be, or even better, if the answers will make the surprise
a pleasant surprise.
Let’s start with the first question: Who is the target of a text that introduces the fundamentals
of two programming languages, R and Python, for the discipline called data science? Those who
study to become data scientists, computer scientists, or computer engineers, it seems obvious, right?
Instead, it is not so. For sure, future data scientists, computer scientists, and computer engineers
could find this text useful. However, the real recipients should be others, simply all the others, the
non-specialists, those who do not work or study to make IT or data science their main profession.
Those who study to become or already are sociologists, political scientists, economists, psychol-
ogists, marketing or human resource management experts, and those aiming to have a career in
business management and in managing global supply chains and distribution networks. Also, those
studying to be biologists, chemists, geologists, climatologists, or even physicians. Then there are
law students, human rights activists, experts of traditional and social media, memes and social net-
works, linguists, archaeologists, and paleontologists (I’m not joking, there really are fabulous data
science works applied to linguistics, archeology, and paleontology). Certainly, in this roundup, I
have forgotten many who deserved to be mentioned like the others. Don’t feel left out. The artists
I forgot! There are contaminations between art, data science, and data visualization of incredible
interest. Art absorbs and re-elaborates, and in a certain way, this is also what data science does: it
absorbs and re-elaborates. Finally, there are also all those who just don’t know yet what they want
to be; they will figure it out along the way, and having certain tools can come in handy in many
cases.
Everyone can successfully learn the fundamentals of data science and the use of these computa-
tional tools, even with a few basic computer skills, with some efforts and time, of course, necessary
but reasonable. Everyone could find opportunities for application in all, or almost all, existing pro-
fessions, sciences, humanities, and cultural fields. And above all, without the need to take on the
role of computer scientist or data scientist when you already have other roles to take on, which
rightly demand time and dedication.
Therefore, the fact of not considering computer scientists and data scientists as the principal
recipients of this book is not to diminish their role for non-existent reasons, but because for them
there is no need to explain why a book that presents programming languages for data science has,
at least in theory, something to do with what they typically do.
xiv Preface

It is to the much wider audience of non-specialists that the exhortation to learn the fundamentals
of data science should be addressed to, explaining that they do not have to transform themselves
into computer scientists to be able to do so (or even worse, into geeks), which, with excellent rea-
sons that are difficult to dispute, have no intention to do. It doesn’t matter if they have always been
convinced to be “unfit for computer stuff,” and that, frankly, the rhetoric of past twenty years about
“digital natives,” “being a coder,” or “joining the digital revolution” sounds just annoying. None of
this should matter, time to move on. How? Everyone should look at what digital skills and tech-
nologies would be useful for their own discipline and do the training for those goals. Do you want
to be a computer scientist or a data scientist? Well, do it; there is no shortage of possibilities. Do you
want to be an economist, a biologist, or a marketing expert? Very well, do it, but you must not be cut
off from adequate training on digital methodologies and tools from which you will benefit, as much
as you are not cut off from a legal, statistical, historical, or sociological training if this knowledge
is part of the skills needed for your profession or education. What is the objection that is usually
made? No one can know everything, and generalists end up knowing a little of everything and
nothing adequately. It’s as true as clichés are, but that’s not what we’re talking about. A doctor who
acquires statistical or legal training is no less a doctor for this; on the contrary, in many cases she/he
is able to carry out the medical profession in a better way. No one reproaches an economist who
becomes an expert in statistical analysis that she/he should have taken a degree in statistics. And
soon (indeed already now), to the same economist who will become an expert in machine learning
techniques for classification problems for fintech projects, no one, hopefully, will reproach that as
an economist she/he should leave those skills to computer scientists. Like it or not, computer skills
are spreading and will do so more and more among non-computer scientists, it’s a matter of base
rate, notoriously easy to be misinterpreted, as all students who have taken an introductory course
in statistics know.
Let’s consider the second question: Why this text presents two languages instead of just one as
it is usually done? Isn’t it enough to learn just one? Which is better? A friend of mine told me he’s
heard that Python is famous, the other one he has never heard of. Come on, seriously two? It’s a
miracle if I learn half of just one! Stop. That’s enough.
It’s not a competition or a beauty contest between programming languages, and not even a ques-
tion of cheering, as with sports teams, where you have to choose one, none is admissible, but you
can’t root for two. R and Python are tools, in some ways complex, not necessarily complicated,
professional, but also within anyone’s reach. Above all, they are the result of the continuous work
of many people; they are evolving objects and are extraordinary teaching aids for those who want
to learn. Speaking of evolution, a recent and interesting one is the increasingly frequent conver-
gence between the two languages presented in this text. Convergence means the possibility of
coordinated, alternating, and complementary use: Complement the benefits of both, exploit what
is innovative in one and what the other has, and above all, the real didactic value, learning not to
be afraid to change technology, because much of what you learned with one will be found and will
be useful with the other. There is another reason, this one is more specific. It is true that Python is
so famous that almost everyone has heard its name while only relatively few know R, except that
practically everyone involved in data science knows it and most of them uses it, and that’s for a
pretty simple reason: It’s a great tool with a large community of people who have been contribut-
ing new features for many years. What about Python? Python is used by millions of people, mainly
to make web services, so it has enormous application possibilities. A part of Python has specialized
in data science and is growing rapidly, taking advantage of the ease of extension to dynamic and
web-oriented applications. One last piece of information: Learning the first programming language
could look difficult. The learning curve, so-called how fast you learn, is steep at first, you struggle
Preface xv

at the very beginning, but after a while it softens, and you run. This is for the first one. Same ramp
to climb with the second one too? Not at all. Attempting an estimate, I would say that just one-third
of the effort is needed to learn the second, a bargain that probably few are aware of. Therefore, let’s
do both of them.
One last comment because one could certainly think that this discussion is only valid in theory,
putting it into practice is quite another thing. Over the years I have required hundreds of social
science students to learn the fundamentals of both R and Python for data science and I can tell
you that it is true that most of them struggled initially, some complained more or less aloud that
they were unfit, then they learned very quickly and ended up demonstrating that it was possible for
them to acquire excellent computational skills without having to transform into computer scientists
or data scientists (to tell the truth, someone transformed into one, but that’s fine too), without
possessing nonexistent digital native geniuses, without having to be anything other than what they
study for, future experts in social sciences, management, human resources, or economics, and what
is true for them is certainly true for everyone. This is the pleasant surprise.

Milan, Italy Marco Cremonini


2023
xvii

About the Companion Website

This book is accompanied by student companion website.

www.wiley.com/go/DSFRPythonOpenData

The student website includes:


● MCQs

● Software
xix

Introduction

This text introduces the fundamentals of data science using two main programming languages and
open-source technologies : R and Python. These are accompanied by the respective application
contexts formed by tools to support coding scripts, i.e. logical sequences of instructions with the
aim to produce certain results or functionalities. The tools can be of the command line interface
(CLI) type, which are consoles to be used with textual commands, and integrated development
environment (IDE), which are of interactive type to support the use of languages. Other elements
that make up the application context are the supplementary libraries that contain the additional
functions in addition to the basic ones coming with the language, package managers for the
automated management of the download and installation of new libraries, online documentation,
cheat sheets, tutorials, and online forums of discussion and help for users. This context, formed
by a language, tools, additional features, discussions between users, and online documentation
produced by developers, is what we mean when we say "R" and "Python," not the simple program-
ming language tool, which by itself would be very little. It is like talking only about the engine
when instead you want to explain how to drive a car on busy roads.
R and Python, together and with the meaning just described, represent the knowledge to start
approaching data science, carry out the first simple steps, complete the educational examples, get
acquainted with real data, consider more advanced features, familiarize oneself with other real
data, experiment with particular cases, analyze the logic behind mechanisms, gain experience with
more complex real data, analyze online discussions on exceptional cases, look for data sources in
the world of open data, think about the results to be obtained, even more sources of data now
to put together, familiarize yourself with different data formats, with large datasets, with datasets
that will drive you crazy before obtaining a workable version, and finally be ready to move to other
technologies, other applications, uses, types of results, projects of ever-increasing complexity. This
is the journey that starts here, and as discussed in the preface, it is within the reach of anyone who
puts some effort and time into it. A single book, of course, cannot contain everything, but it can
help to start, proceed in the right direction, and accompany for a while.
With this text, we will start from the elementary steps to gain speed quickly. We will use simplified
teaching examples, but also immediately familiarize ourselves with the type of data that exists in
reality, rather than in the unreality of the teaching examples. We will finish by addressing some
elaborate examples, in which even the inconsistencies and errors that are part of daily reality will
emerge, requiring us to find solutions.
xx Introduction

Approach

It often happens that students dealing with these contents, especially the younger ones, initially
find it difficult to figure out the right way to approach their studies in order to learn effectively.
One of the main causes of this difficulty lies in the fact that many are accustomed to the idea that
the goal of learning is to never make mistakes. This is not surprising, indeed, since it’s the criteria
adopted by many exams, the more mistakes, the lower the grade. This is not the place to discuss the
effectiveness of exam methodologies or teaching philosophies; we are pragmatists, and the goal is
to learn R and Python, computational logic, and everything that revolves around it. But it is pre-
cisely from a wholly pragmatic perspective that the problem of the inadequacy of the approach that
seeks to minimize errors arises, and this for at least two good reasons. The first is that inevitably
the goal of never making mistakes leads to mnemonic study. Sequences of passages, names, formu-
las, sentences, and specific cases are memorized, and the variability of the examples considered is
reduced, tending toward schematism. The second reason is simply that trying to never fail is exactly
the opposite of what it takes to effectively learn R and Python and any digital technology.
Learning computational skills for the data science necessarily requires a hands-on approach. This
involves carrying out many practical exercises, meticulously redoing those proposed by the text, but
also varying them, introducing modifications, and replicating them with different data. All those
of the didactic examples can obviously be modified, but also all those with open data can easily
be varied. Instead of certain information, others could be used, and instead of a certain result, a
slightly different one could be sought, or different data made available by the same source could be
tried. Proceeding methodically (being methodical, meticulous, and patient are fundamental traits
for effective learning) is the way to go. Returning to the methodological doubts that often afflict
students when they start, the following golden rule applies, which must necessarily be emphasized
because it is of fundamental importance: exercises are used to make mistakes, an exercise without
errors is useless.

Open Data
The use of open data, right from the first examples and to a much greater extent than examples with
simplified educational datasets, is one of the characteristics, perhaps the main one, of this text. The
datasets taken from open data are 26, sourced from the United States and other countries, large
international organizations (the World Bank and the United Nations), as well as charities and inde-
pendent research institutes, gender discrimination observatories, and government agencies for air
traffic control, energy production and consumption, pollutant emissions, and other environmental
information. This also includes data made available by cities like Milan, Berlin, and New York City.
This selection is just a drop in the sea of open data available and constantly growing in terms of
quantity and quality.
Using open data to the extent it has been done in this text is a precise choice that certainly imposes
an additional effort on those who undertake the learning path, a choice based both on personal
experience in teaching the fundamentals of data science to students of social and political sciences
(every year I have increasingly anticipated the use of open data), and on the fundamental drawback
of carrying out examples and exercises mainly with didactic cases, which are inevitably unreal and
unrealistic. Of course, the didactic cases, also present in this text, are perfectly fit for showing a
Introduction xxi

specific functionality, an effect or behavior of the computational tool. As mentioned before, though,
the issue at stake is about learning to drive in urban traffic, not just understanding some engine
mechanics, and at the end the only way to do that is … driving in traffic, there’s no alternative. For us
it is the same, anyone who works with data knows that one of the fundamental skills is to prepare
the data for analysis (first there would be that of finding the data) and also that this task can easily
be the most time- and effort-demanding part of the whole job. Studying mainly with simplified
teaching examples erases this fundamental part of knowledge and experience, for this reason, they
are always unreal and unrealistic, however you try to fix them. There is no alternative to putting
your hands and banging your head on real data, handling datasets even of hundreds of thousands
or millions of rows (the largest one we use in this text has more than 500 000 rows, the data of all US
domestic flights of January 2022) with their errors, explanations that must be read and sometimes
misinterpreted, even with cases where data was recorded inconsistently (we will see one of this kind
quite amusing). Familiarity with real data should be achieved as soon as possible, to figure out their
typical characteristics and the fact that behind data there are organizations made up of people, and
it is thanks to them if we can extract new information and knowledge. You need to arm yourself
with patience and untangle, one step at a time, each knot. This is part of the fundamentals to learn.

What You Don’t Learn

One book alone can’t cover everything; we’ve already said it and it’s obvious. However, the point
to decide is what to leave out. One possibility is that the author tries to discuss as many different
topics as she/he can think of. This is the encyclopedic model, popular but not very compatible with a
reasonably limited number of pages. It is no coincidence that the most famous of the encyclopedias
have dozens of ponderous volumes. The short version of the encyclopedic model is a “synthesis,”
i.e. a reasonably short overview that is necessarily not very thorough and has to simplify complex
topics. Many educational books choose this form, which has the advantage of the breadth of topics
combined with a fair amount of simplification.
This book has a hybrid form, from this point of view. It is broader than the standard because
it includes two languages instead of one, but it doesn’t have the form of synthesis because it
focuses on a certain specific type of data and functionality: data frames, with the final addition of
lists/dictionaries, transformation and pivoting operations, group indexing, aggregation, advanced
transformations and data frame joins, and on these issues, it goes into the details. Basically, it
offers the essential toolbox for data science.
What’s left out? Very much, indeed. The techniques and tools for data visualization, descriptive
and predictive models, including machine learning techniques, obviously the statistical analysis
part (although this is traditionally an autonomous part), technologies for "Big Data," i.e. distributed,
scalable software infrastructures capable of managing not only a lot of data but above all data
streams, i.e. real-time data flows, and the many web-oriented extensions, starting from data col-
lection techniques from websites up to integration with dynamic dashboards and web services,
are not included. Again, there are specialized standards, such as those for climate data, financial
data, biomedical data, and coding used by some of the large international institutions that are not
treated. The list could go on.
This additional knowledge, which is part of data science, deserves to be learned. For this, you
need the fundamentals that this book presents. Once equipped with them, it’s the personal interests
xxii Introduction

and the cultural and professional path of each one to play the main role, driving in a certain direc-
tion or in another. But again, once it has been verified firsthand that it is possible, regardless of
one’s background, to profitably acquire the fundamentals of the discipline with R and Python, any
further insights and developments can be tackled, in exactly the same way, with the same approach
and spirit used to learn the fundamentals.
1

Open-Source Tools for Data Science

1.1 R Language and RStudio

In this first section, we introduce the main tools for the R environment: the R language and the
RStudio IDE (interactive development environment). The first is an open-source programming
language developed by the community, specifically for statistical analysis and data science; the
second is an open-source development tool produced by Posit (www.posit.com), formerly called
RStudio, representing the standard IDE for R-based data science projects. Posit offers a freeware
version of RStudio called RStudio Desktop that fully supports all features for R development; it has
been used (v. 2022.07.2) in the preparation of all the R code presented in this book. Commercial
versions of RStudio add supporting features typical of managing production software in corporate
environments. An alternative to RStudio Desktop is RStudio Cloud, the same IDE offered as a
service on a cloud premise. Graphically and functionally, the cloud version is exactly the same as
the desktop one; however, its free usage has limitations.
The official distribution of the R language and the RStudio IDE are just the starting points though.
This is what distinguishes an open-source technology from a proprietary one. With an open-source
technology actively developed by a large online community, as is the case for R, the official dis-
tribution provides the basic functionality and, on top of that, layers of additional, advanced, or
specialistic features could be stacked, all of them developed by the open-source community. There-
fore, it is a constantly evolving environment, not a commercial product subject to the typical life
cycle mostly mandated by corporate marketing. What is better, an open-source or a proprietary
tool? This is an ill-posed question, mostly irrelevant in generic terms because the only reasonable
answer is, “It depends.” The point is that they are different in a number of fundamental ways.
With R, we will use many features provided by additional packages to be installed on top of the
base distribution. This is the normal course of action and is exactly what everybody using this
technology is supposed to do in order to support the goal of a certain data analysis or data science
project. Clearly, the additional features employed in the examples of this book are not all those avail-
able, and neither are all those somehow important, that would be simply impossible to cover. New
features come out continuously, so in learning the fundamentals, it is important to practice with
the approach, familiarize yourself with the environment, and exercise with the most fundamental
tools, so as to be perfectly able to explore the new features and tools that become available.
Just keep in mind that these are professional-grade tools, not merely didactic ones to be aban-
doned after the training period. Thousands of experienced data scientists use these tools in their
daily jobs and for top-level data science projects, so the instruments you start knowing and handling
are powerful.

Data Science Fundamentals with R, Python, and Open Data, First Edition. Marco Cremonini.
© 2024 John Wiley & Sons, Inc. Published 2024 by John Wiley & Sons,
Companion website: www.wiley.com/go/DSFRPythonOpenData
2 1 Open-Source Tools for Data Science

1.1.1 R Language
CRAN (the Comprehensive R Archive Network, https://cloud.r-project.org/) is the official online
archive for all R versions and software packages available to install. CRAN is mirrored on a number
of servers worldwide, so, in practice, it is always available.
The R base package is basically compliant with all desktop platforms: Windows, MacOS, and
Linux. The installation is guided through a standard wizard and is effortless. Mobile platforms
such as iOS and Android, as well as hybrid products, like the Chromebook, are not supported. For
old operating system versions, the currently available version of R might not be compatible. In that
case, under R Binaries, all previous versions of R are accessible, the most recent compatible one
can be installed with confidence, and all the important features will be available.
At the end of the installation, a link to an R execution file will be created in the programs or
applications menu/folder. That is not the R language, but an old-fashioned IDE that comes with
the language. You do not need that if you use RStudio, as is recommended. You just need to install
the R language, that is all.

1.1.2 RStudio Desktop


The RStudio Desktop is an integrated development environment (IDE) for R programming, recently
enhanced with features to interpret Python scripts too (https://posit.co/download/rstudio-
desktop/). In short, this means that it is a tool offering a graphical interface that accommodates
most of the necessary functionalities for developing projects using R, which is a separate compo-
nent, as we have seen in the previous section. The RStudio IDE is unanimously considered one of
the best available IDEs, being complete, robust, and consistent throughout the versions. For this
reason, there is not much competition in that market, at least until now. It is simply the safest and
best choice. Icons of R and of RStudio might be confusing at first, but they both show a big R.
It is important to familiarize yourself with RStudio’s layout because of the many functionalities
available and useful in data science projects. The layout is divided into four main quadrants, as
shown in Figure 1.1, with quadrant Q1 that appears only when an R Script is created from the
drop-down menu of the top-left icon.
● Q1: The quadrant for editing the R code, with different scripts is shown in separate tabs on top.
● Q2: The main feature is the R Console, where single command line instructions can be executed
and the output of the execution of an R script appears.
● Q3: Information about the environment is provided through this quadrant, such as R objects
(variables) created in memory during the execution of code; Python objects too could be shown
if software allowing for integration between the two languages is used.
● Q4: A multifunction quadrant allowing for exploring the local file system (tab Files), visualizing
graphics (tab Plots), the R package manager (tab Packages), and online documentation (tab Help).

1.1.3 Package Manager


The package manager is a key component of open-source environments, frequently used for updat-
ing a configuration, adding new functionalities, duplicating a configuration for testing purposes,
and so forth. Installing new components is a common and recurrent activity in environments like
R and Python, so it has to be simple and efficient. This is the crucial role of a package manager.
A package manager is typically a piece of software with few functionalities that basically
revolve around listing the installed packages, updating them, searching for new ones, installing
them, and removing useless packages. Everything else is basically accessory features that are not
1.1 R Language and RStudio 3

Figure 1.1 RStudio Desktop’s standard layout

strictly necessary. Given the few specialized features a package manager must have, it should come
without any surprise that modern package managers have their origins in classical command line
tools. Actually, they still exist and thrive; they are often used as command line tools both in R and
Python environments, just because they are simple to use and have limited options.
At any rate, a graphical interface exists, and RStudio offers it with the tab Packages in the Q4
quadrant. It is simple, just a list of installed packages and a selection box indicating if a package is
also loaded or not. Installing and loading a package are two distinct operations. Installing means
retrieving the executable code, for example, by downloading it from CRAN and configuring it in the
local system. Loading a package means making its functionalities available for a certain script, which
translates into the fundamental function library(<name of the package to load>).
Ticking the box beside a package in the RStudio package manager will execute on the R Console
(quadrant Q2) the corresponding library() instruction. Therefore, using the console or ticking
the box for loading a package is exactly the same.
However, neither of them is a good way to proceed, when we are writing R scripts, because a
script should be reproducible, or at least understandable by others, at a later time, possibly a long
time later. This means that all information necessary for reproducing it should be explicit, and if the
list of packages to be loaded is defined externally by ticking selection boxes or running commands
on the console, that knowledge is hidden, and it will be more difficult to understand exactly all the
features of the script. So the correct way to proceed is to explicitly write all necessary library()
instructions in the script, loading all required packages.
The opposite operation of loading a package is unloading it, which is certainly less frequent;
normally, it is not needed in scripts. From the RStudio interface, it could be executed by unticking
a package or by executing the corresponding instruction detach("package:<name of the
package>", unload=TRUE).
A reasonable doubt may arise about the reason why installed packages are not just all loaded
by default. Why bother with this case-by-case procedure? The reason is memory, the RAM, in
4 1 Open-Source Tools for Data Science

particular, that is not only finite and shared by all processes executed on the computer, but is often
a scarce resource that should be used efficiently. Loading all installed packages, which could be
dozens or even hundreds, when normally just a few are needed by the script in execution, is clearly
a very inefficient way of using the RAM. In short, we bother with the manual loading of packages
to save memory space, which is good when we have to execute computations on data.
Installing R packages is straightforward. The interactive button Install in tab Packages is handy
and provides all the functionalities we need. From the window that opens, the following choices
should be made:

● Install from: From which repository should the package be downloaded? Options are: CRAN,
the official online repository, this is the default and the normal case. Package Archive File is
only useful if the package to install has been saved locally, which may happen for experimental
packages not available from GitHub, which is a rare combination. Packages available from
GitHub could be retrieved and installed with a specialized command (githubinstall
("PackageName")).
● Packages: The name of the package(s) to install; the autocomplete feature looks up names from
CRAN.
● Install to library: The installation path on the local system depends on the R version currently
installed.
● Install dependencies: Dependencies are logical relationships between different packages. It is
customary for new to packages exploit features of previous packages for many reasons, either
because they are core or ancillary functionalities with respect to the features provided by the
package. In this case, those functionalities are not reimplemented, but the package providing
them is logically linked to the new one. This, in short, is the meaning of dependencies. It means
that when a package is installed if it has dependencies, those should be installed too (with the
required version). This option, when selected, automatically takes care of all dependencies,
downloading and installing them, if not present. The alternative is to manually download and
install the packages required as dependencies by a certain package. The automatic choice is
usually the most convenient. Errors may arise because of dependencies, for example, when for
any reason, the downloading of a package fails, or the version installed is not compatible. In
those cases, the problem should be fixed manually, either by installing the missing dependencies
or the one with the correct version.

1.1.4 Package Tidyverse


The main package we use in this book is called tidyverse (https://www.tidyverse.org/). It is a
particular package because it does not directly provide new features, but rather groups a bunch of
other packages, which are then installed all at once, and these provide the additional features. In
a way, tidyverse is a shortcut created to simplify the life of people approaching data science with R,
instead of installing a certain number of packages individually, common to the majority of projects,
they have been encapsulated in just one package that does the whole job.
There are criticisms of this way of doing things based on the assumption that only necessary
packages should be installed and, most of all, loaded. This principle is correct and should be
followed as a general rule. However, a trade-off is also reasonable in most cases. Therefore, you
may install tidyverse and then load only specific packages in a script, or just load the whole
lot contained in tidyverse. Usually, it does not make much difference; you can choose without
worrying too much about this.
1.2 Python Language and Tools 5

In any case, tidyverse is widely used, and for this, it is useful to spend some time reading the
description of the packages included in it because this provides a glimpse into what most data
science projects use, the types of operations and more general features. In our examples, most of
the functions we will use are defined in one of the tidyverse packages, with some exceptions that
will be introduced.
The installation of tidyverse is the standard one, through the RStudio package manager, or the
console with command install.packages("tidyverse"). Loading it in a script is done
with library(tidyverse) for a whole lot of packages, or alternatively, for single packages
such as library(readr), where readr is the name of a package contained in tidyverse. In all
cases, after the execution of a library instruction, the console shows if and what packages have
been loaded.
In all our examples, it should be assumed that the first instruction to be executed is
library(tidyverse), even when not explicitly specified.

1.2 Python Language and Tools

Python’s environment is more heterogeneous than R’s, mostly because of the different scope of the
language – Python is a general-purpose language mostly used for web and mobile applications, and
in a myriad of other cases, data science is among them – which implies that several options are avail-
able as a convenient setup for data science projects. Here, one of the most popular is considered,
but there are good reasons to make different choices.
The first issue to deal with is that, until now, there is not a data science Python IDE comparable
to RStudio for R, which is the de facto standard and offers almost everything that is needed. In
Python, you have to choose if you want to go with a classical IDE for coding; there are many, which
is fine, but they are not much tailored for data science wrangling operations; or if you want to go
with an IDE based on the computational notebook format (just notebook for short). The notebook
format is a hybrid document that combines formatted text, usually based on a Markdown syntax
and blocks of executable code. For several reasons, mostly related to utility in many contexts to
have both formatted text and executable code, and the ease of use of these tools, IDEs based on the
notebook format have become popular for Python data science and data analysis. The code in the
examples of the following chapters has been produced and tested using the main one of these IDEs,
JupyterLab (https://jupyterlab.readthedocs.io/en/latest/). It is widely adopted, well-documented,
easy to install, and free to use. If you are going to write short blocks of Python code with associated
descriptions, a typical situation in data science, it is a good choice. If you have to write a massive
amount of code, then a classical IDE is definitely better. Jupyter notebooks are textual files with
canonical extension .ipynb and an internal structure close to JSON specifications.
So, the environment we need has the Python base distribution, a package manager, for the same
reasons we need it with R, the two packages specifically developed for data science functionali-
ties called NumPy and pandas, and the notebook-based IDE JupyterLab. These are the pieces. In
order to have them installed and set up, there are two ways of proceeding: one is easy but ineffi-
cient, and the other is a little less easy but more efficient. Below, with A and B, the two options are
summarized.

A. A single installer package, equipped with a graphical wizard, installs and sets up everything
that is needed, but also much more than you will likely ever use, for a total required memory
space of approximately 5 GB on your hard disk or SSD memory.
6 1 Open-Source Tools for Data Science

B. A manual procedure individually installs the required components: first Python and the pack-
age manager, then the data science libraries NumPy and pandas, and finally the JupyterLab IDE.
This requires using the command line shell (or terminal) to run the few installation instructions
for the package manager, but the occupied memory space is just approximately 400 MB.

Both ways, the result is the Python setup for learning the fundamentals of data science, get-
ting ready, and working. The little difficulty of the B option, i.e. using the command line to install
components, is truly minimal and, in any case, the whole program described in this book is about
familiarizing with command line tools for writing R or Python scripts, so nobody should be worried
for a few almost identical commands to run with the package manager.
So, the personal suggestion is to try the B option, as described in the following, operationally
better and able to teach some useful skills. At worst, it is always possible to backtrack and go with
the easier A option on a second try.

1.2.1 Option A: Anaconda Distribution


Option A is easy to explain. There is a tool called Anaconda Distribution (https://www.anaconda
.com/products/distribution) that provides everything needed for an initial Python data science
environment. It contains all the components we have listed as well as tons of other tools and
libraries. In addition, it offers a desktop application called Anaconda Navigator, which is basically
a graphical interface to the package manager conda. Unfortunately, this interface is quite bulky.
From this interface, it is also possible to launch the JupyterLab IDE.

1.2.2 Option B: Manual Installation


Option B requires a few steps:

Step 1: Python and package manager installation.


From the official repository of Python distribution (https://www.python.org/downloads/), the
installer for the latest (or previous) distribution could be downloaded and launched. A graphi-
cal wizard guides the process. In the end, the Python language with its basic libraries and two
package managers will be installed: pip, the standard Python package manager, and conda,
the Anaconda’s one. The differences between the two are minimal, and for all our concerns,
they are equivalent. Even the syntax of the commands is basically the same. The only recom-
mendation is to choose one and continue using that one for package management; this avoids
some possible problems with dependencies. We show the examples using pip, but conda is fine
as well.
Step 2: Installing data science packages NumPy, pandas, and JupyterLab IDE.
From a shell (e.g. Terminal on MacOS, Powershell on Windows), to run the package manager, it
suffices to digit pip (or conda, for the other one) and return.

This way, a list of the options is shown. The most useful are:

● pip list: list all installed packages with the version.


● pip install <package_name>: install a package.
● pip uninstall <package_name>: uninstall a package.

When a package is installed or uninstalled, on the command line appears a request to confirm
the operation; the syntax is [n/Y], with n for No and Y for Yes.
1.2 Python Language and Tools 7

Figure 1.2 Example of starting JupyterLab

The commands we need to run are easy:

pip install numpy


pip install pandas
pip install jupyterlab

That is all, not difficult at all.


To start JupyterLab, again it is from the shell by executing jupyterlab. You will see the local
service has started and, after a little while, a new tab is opened in the predefined browser with
the JupyterLab interface. Figure 1.2 shows a screenshot of starting JupyterLab with some useful
information that is presented, such as how to stop it (use Control-C and confirm the decision to
stop it), where the working directory is, and the URL to copy and paste into the browser to reopen
the tab if it was accidentally closed.

1.2.3 Google Colab


An alternative to JupyterLab is the cloud platform Google Colaboratory or Colab for short
(https://colab.research.google.com/). It is accessible with a Google account and makes use of
Google Drive for managing files and storing notebook documents. It is fully compatible with
Jupyter notebooks. Those produced by Colab still have extension ipynb and are perfectly shareable
between JupyterLab and Colab. The service is reliable, has quickly improved in just a few years,
and is free to use at the moment of writing, so it should definitely be considered as a viable option,
with pros and cons of a cloud platform evaluated for convenience.

1.2.4 Packages NumPy and Pandas


As already mentioned, NumPy (https://numpy.org/) and pandas (https://pandas.pydata.org/)
are the data science-specific packages for Python. The first handles arrays and matrices, while
the second is completely devoted to managing data frames, so we will make extensive use of the
functionalities it offers. Both sites have updated and searchable technical documentation, which
can also be downloaded, representing an indispensable supplement to the material, explanations,
and examples of this book. In all cases of programming languages, the technical documentation
8 1 Open-Source Tools for Data Science

should be at hand and regularly consulted, no matter whether the handbook is used for learning.
A handbook and the technical documentation serve different purposes and complete each other;
they are never alternatives.
All the Python scripts and fragments of code presented in the following chapters assume that
both libraries have been loaded with the following instructions, which should appear as the first
ones to be executed:

import numpy as np
import pandas as pd

1.3 Advanced Plain Text Editor


Another very useful, almost necessary, tool is an advanced plain text editor. We will often need to
inspect a dataset, which is a file containing data, whose format in our case is almost always a tabular
text. We will use proprietary formats only to show how to use Microsoft Excel datasets, which are
common for very small datasets, but many other proprietary formats exist and have specific ways
to be accessed. However, in our reference environments, the standard format for datasets is the text
file, and we will focus on it.
Text files could be opened with many common tools present in the standard configurations of
all platforms, from Windows Notepad to MacOS Text Edit and others. However, those are basic
tools for generic text files. Inspecting a very large dataset has different requirements; we need more
features, like an advanced search and replace feature, line numbers, pattern matching features,
and so on. The good news is that these advanced features are supported by many advanced plain
text editors, readily available for all platforms. Among others, two widely used are Notepad++
(https://notepad-plus-plus.org/), only for Windows, and Sublime Text (https://www.sublimetext
.com/download), for Windows and MacOS. But again, many others exist, and readers are encour-
aged to explore the different alternatives and test them before choosing one.

1.4 CSV Format for Datasets


Finally, we reach one key component of the data science arsenal with R and Python, which is
the CSV format (comma-separated values), the gold standard for all open data. This does not mean
that all open data one could find is available in CSV format, of course not, they could be offered in
different formats, open or proprietary, but if there is a format for datasets that could be considered
the standard, that is CSV.
The format is extremely simple and has the minimum of requirements:
● It is a tabular format composed of rows and columns.
● Each row has an equal number of values separated by a common separator symbol and ended by
a return (i.e. new line).
● Columns, defined by values in the same position for each row, have the same number of elements.
That is all it needs to define a CSV. Its simplicity is its main feature, of course, making it
platform- and vendor-independent, not subject to versioning, human-readable, and also efficiently
processable by computational means.
1.4 CSV Format for Datasets 9

Its name reminds me of the original symbol used as a separator, the comma, perfectly adapted
to the Anglo-Saxon convention for floating point numbers and mostly numerical data. With the
diffusion to European users and textual data, the comma became problematic, being used for the
numeric decimal part and often in sentences, so the semicolon appeared as an alternative separator,
less frequently used, and unrelated to numbers. But again, even the semicolon might be problem-
atic when used in sentences, so the tabulation (tab character) became another typical separator.
These three are the normal ones, so to say, since there is no formal specification mandating what
symbols could act as separators and that we should expect to find them used. But other uncommon
symbols could be encountered, such as the vertical bar (|) and others.
Ultimately, the point is that whatever separator is used in a certain CSV dataset, we could eas-
ily recognize it, for example, by visually inspecting the text file and easily accessing the dataset
correctly. So, it needs to pay attention, but the separator is never a problem.
As a convention, not a strict rule, CSV files use the extension .csv (e.g. dataset1.csv), meaning
that it is a text-based, tabular dataset. When the tab is used as a separator, it is common to indicate
it by means of the .tsv file extension (e.g. dataset2.tsv), which is just good practice and a kind way to
inform a user that tab characters have been placed in the dataset. In case they are not very evident
at first sight, expect to also find datasets using tabs as separators named with the .csv extension.
Ambiguous cases arise anyway. A real example is shown in the fragment of Figure 1.3. This is
a real dataset with information about countries, a type of widely common information. Countries
have official denominations that have to be respected, especially in works with a certain degree of
formality. If you look closely, you will notice something strange about the Democratic Republic of
the Congo.
The name is written differently in order to maintain the coherence of the alphabetic order, so
it became Congo, Democratic Republic of the. Fine for what concerns the order, but it complicates
things for the CSV syntax because now the comma after Congo is part of the official name. It cannot
be omitted or replaced with another symbol; that is the official name when the alphabetic order
must be preserved. But commas also act as separators in this CSV, and now they are no longer
unambiguously defined as separators.
How can we resolve this problem? We have already excluded the possibility of arbitrarily chang-
ing the denomination of the country, that is not possible. Could we replace all the commas as
separators with another symbol, like a semicolon, for example? In theory yes, we could, but in
practice, it might be much more complicated than a simple replacement because there could be
other cases like Congo, Democratic Republic of the, and for all of them, the comma in the name
should not be replaced. It is not that easy to make sure to not introduce further errors.
Looking at Figure 1.3, we see the standard solution for this case – double quotes have been used to
enclose the textual content with the same symbol used as a separator (comma, in this case). This tells
the function reading the CSV to consider the whole text within double quotes as a single element
value and ignore the presence of symbols used as separators. Single quotes work fine too unless they
are used as apostrophes in the text.
This solution solves most cases, but not all. What if the text contains all of them, double quotes,
single quotes, and commas? For example, a sentence like this: First came John "The Hunter" Western;

Figure 1.3 Ambiguity between textual character and separator symbol


10 1 Open-Source Tools for Data Science

then his friend Bill li’l Rat Thompson, followed by the dog Sausage. How could we possibly put this
in a CSV in a way that it is recognized as a unique value? There is a comma and a semicolon; single
or double quotes do not help in this case because they are part of the sentence. We might replace all
commas as separators with tabs or other symbols, but as already said, it could be risky and difficult.
There is a universal solution called escaping, which makes use of the escape symbol, which
typically is the backslash (\). The escape symbol is interpreted as having a special meaning, which
is to consider the following character just literally, destitute of any syntactical meaning, such as a
separator symbol or any other meaning. Thus, our sentence could be put into a CSV and considered
as a single value again by using double quotes, but being careful to escape the double quotes inside
the sentence: “First came John \“The Hunter\” Western; then his friend Bill li’l Rat Thompson,
followed by the dog Sausage.” This way, the CSV syntax is unambiguous.
Finally, what if the textual value contains a backslash? For example: The sentient AI wrote “Hi
Donna, be ready, it will be a bumpy ride,” and then executed \start.
We know we have to escape the double quotes, but what about the backslash that will be inter-
preted as an escape symbol? Escape the escape symbol, so it will be interpreted literally: “The sentient
AI wrote \“Hi, Donna, be ready, it will be a long ride\” and then executed \\start.” Using single
quotes, we do not need to escape double quotes in this case: ‘The sentient AI wrote, “Hi, Donna,
be ready, it will be a long ride” and then executed \\start.’

Questions

1.1 (R/Python)
CSV is ...
A A proprietary data format, human-readable
B An open data format, human-readable
C A proprietary format, not human-readable
D An open data format, not human-readable
(R: B)

1.2 (R/Python)
A CSV dataset has ...
A A tabular data organization
B A hierarchical data organization
C Metadata for general information
D An index
(R: A)

1.3 (R/Python)
A valid CSV dataset has ...
A No value with spaces
B Possibly different separators
C No missing value
D Equal number of elements for each row
(R: D)

1.4 (R/Python)
Which are considered legitimate separators in a CSV dataset?
Questions 11

A Only comma, semicolon, and tab


B Only comma and semicolon
C Comma, semicolon, and tab are the most typical, but other symbols are possible (e.g. pipe,
dash)
D All sorts of characters, symbols, or strings
(R: C)

1.5 (R/Python)
What is the usage case for quotes and double quotes in a CSV dataset?
A There is no usage case for them
B As separators
C As commonly used symbols in strings
D To embrace an element value containing the same symbol used as separator
(R: D)

1.6 (R/Python)
What is the escape character?
A It is quotes or double quotes
B It is the separator symbol
C It is the symbol used to specify that the following character/symbol should be considered
at face value, not interpreted as having a special meaning
D It is the symbol used to specify that the following word should be considered at face value,
not interpreted as having a special meaning
(R: C)
13

Simple Exploratory Data Analysis

Having read a dataset, the first activity usually made afterward is to figure out the main charac-
teristics of those data and make sense of them. This means understanding the organization of the
data, their types, and some initial information on their values. For data of numerical type, simple
statistical information can be obtained; these are usually called descriptive statistics and often
include basic information like the arithmetic mean, the median, maximum and minimum values,
and quartiles. Clearly, other, more detailed statistical information could be easily obtained from a
series of numerical values.
This activity is often called simple exploratory data analysis, where the adjective “simple” dis-
tinguishes this basic and quick analysis performed to grasp the main features of a dataset with
respect to thorough exploratory data analyses performed with more sophisticated statistical tools
and methods.
However, the few requirements of this initial approach to a new dataset should not be erroneously
considered unimportant. On the contrary, basic descriptive statistics offered by common tools may
reveal important features of a dataset that could help decide how to proceed, show the presence of
anomalous values, or indicate specific data wrangling operations to execute. It is important to ded-
icate attention to the information provided by a simple exploratory data analysis. Real datasets are
typically too large to be visually inspected; therefore, in order to start collecting some information
about the data, tools are needed, and descriptive statistics are the first among them.
R and Python offer common functionalities to obtain descriptive statistics together with other
utility functions, which allow getting information on a dataset, from its size to the unique values
of a column/variable, names, indexes, and so forth. These are simple but essential features and
familiarity with them should be acquired.
Before presenting the first list of these functions, all of them will be used in the numerous
examples that will follow throughout the book. Another very relevant issue is introduced – the
almost inevitable presence of missing values.

2.1 Missing Values Analysis

Missing values are literally what their name means – some elements of a dataset may have no value.
This must be understood literally, not metaphorically – a missing value is the absence of value, not
a value with an undefined meaning like a space, a tab, some symbols like ?, ---, or *, or something
like the string “Unknown,” “Undefined,” and the like. All these cases are not missing values; they
are actual values with a possibly undefined meaning.

Data Science Fundamentals with R, Python, and Open Data, First Edition. Marco Cremonini.
© 2024 John Wiley & Sons, Inc. Published 2024 by John Wiley & Sons,
Companion website: www.wiley.com/go/DSFRPythonOpenData
14 2 Simple Exploratory Data Analysis

Missing values are very common in real datasets to the point that it is a much safer assumption to
expect their presence than the opposite. The absence of values may happen for a number of reasons,
from a sensor that failed to take a measurement to an observer that failed to collect a certain data
point. Errors of any sort may result in missing values, or they might be just the expected result of a
certain data collection process, for example, a dataset reporting road incident may have values only
if there has been at least one incident on a certain road in the reference period of time; otherwise, the
element remains empty. Many other reasons for the presence of missing values could be envisioned.
The important point when dealing with real datasets is not to exclude the presence of missing values.
That could lead to severe errors if missing values are unaccounted for. For this reason, the presence of
missing values must always be carefully verified, and appropriate actions for dealing with them are
decided on a case-by-case basis.
We will dedicate a specific section in the following chapter on the tools and methods to analyze
and manage missing values in R and Python. Here, it is important to get the fact that they are
important to analyze, and their presence forces us to decide what to do with them. There is no
general recipe to apply, it has to be decided on a case-by-case basis.
Once we have ascertained that missing values are present in a dataset, we should gather more
information about them – where are they? (in which columns/variables, rows/observations), and
how many are they? Special functions will assist us in answering these questions, but then it will
be our turn to evaluate how to proceed. Three general alternatives typically lie in front of us:
1. Write an actual value for elements with missing values.
2. Delete rows/observations or columns/variables with missing values.
3. Do not modify the data and handle missing values explicitly in each operation.
For the first one, the obvious problem is to decide what value should be written in place of the
missing values. In the example of the road incidents dataset, if we know for sure that when a road
reports zero incidents, the corresponding element has no value, we may reasonably think to write
zero in place of missing values. The criterion is correct, given the fact that we know for sure how
values are inserted. Is there any possible negative consequence? Is it possible that we are arbitrarily
and mistakenly modifying some data? Maybe. What if a missing value was instead present because
there was an error in reporting or entering the data? We are setting zero for incidents on a certain
road when the true value would have been a certain number, possibly a relevant one. Is the fact of
having replaced the missing value with a number a better or worse situation than having kept the
original missing value? This is something we should think about and decide.
Similar is the second alternative. If we omit rows or columns with all missing values in their
elements, then we are likely simplifying the data without any particular alteration. But what if, as
it is much more common, a certain row or column has only some elements with missing values and
others with valid values? When we omit that row or column, we are omitting the valid values too,
so we are definitely altering the data. Is that alteration relevant to our analysis? Did we clearly
understand the consequences? Again, these are questions we should think about and decide on
a case-by-case basis because it would depend on the specific context and data and the particular
analysis we are carrying out.
So, is the third alternative always the best one? Again, it depends. With that option, we have the
burden of dealing with missing values at every operation we perform on data. Functions will get
a little complicated, the logic could also become somehow more complicated, chances of making
mistakes increase with the increased complications, time increases, the level of attention should be
higher, and so forth. It is a trade-off, meaning there is no general recipe; we have to think about it
and decide. But yes, the third alternative is the safest in general; the data are not modified, which
is always a golden rule, but even being the safest, it does not guarantee to avoid errors.
2.2 R: Descriptive Statistics and Utility Functions 15

2.2 R: Descriptive Statistics and Utility Functions

Table 2.1 lists some of the main R utility functions to gather descriptive statistics and other general
information on data frames.
It is useful to familiarize yourself with these functions and, without anticipating how to read
datasets, which is the subject of the next chapter, predefined data frames made available by base
R and several packages are helpful for exercising. For example, package datasets are installed with
the base R configuration and contain small didactic datasets, most of them being around for quite
a long time. It was a somewhat vintage-style experience to use those data, to be honest.
Additional packages, for example, those installed with tidyverse, often contain didactic datasets
and are usually definitely more recent than those from package datasets. For example, readers
affectionate to classical sci-fi might appreciate dataset starwars included in package dplyr, with
data about the Star Wars saga’s characters. It is a nice dataset for exercising. For other options,
exists command data() to be executed on the RStudio console. It produces the list of predefined
datasets contained in all loaded packages (pay attention to this, it is not sufficient to have the pack-
age installed, it has to be loaded with library()).

Table 2.1 R utility functions.

Function Description

summary() It is the main function for collecting simple descriptive statistics on data frame
columns. For each column, it returns the data type (numerical, character, and
logical). For numerical columns, it adds maximum and minimum values, mean
and median, 1st and 3rd quartiles, and if present, the number of missing values.
str() They are equivalent in practice, with str() defined in package utils of R base
glimpse() configuration and glimpse() defined in package dplyr, included in tidyverse.
They provide a synthetic representation of information on a data frame, like its
size, column names, types, and values of the first elements.
head() They are among the most basic R functions and allow visualizing the few
tail() topmost (head) or bottommost (tail) rows of a command output. For example,
we will often use head() to watch the first rows of a data frame and its header
with column names. It is possible to specify the number of rows to show (e.g.
head(10)); otherwise, the default applies (i.e. six rows).
View() Basically, the same and, when package tibble, included in tidyverse, is loaded, the
view() first is an alias of the second. They visualize a data frame and other structured
data like lists by launching the native RStudio viewer, which offers a graphical
spreadsheet-style representation and few features. It is useful for small data
frames, but it becomes quickly unusable when the size increases.
unique() It returns the list of unique values in a series. Particularly useful when applied to
columns as unique(df$col_name).
names() It returns column names of a data frame with names(df) and variable names
with lists. It is particularly useful.
class() It returns the data type of an R object, like numeric, character, logical, and data
frame.
length() It returns the length of an R object (careful, this is not the number of characters),
like the number of elements in a vector or the number of columns.
nrow() They return, respectively, the number of rows and columns in a data frame.
ncol()
16 2 Simple Exploratory Data Analysis

To read the data of a preinstalled dataset, it suffices to write its name in the RStudio console and
return or to use View(dataset_name).
Here, we see an example, with dataset msleep included in package ggplot2, part of tidyverse. It
contains data regarding sleep times and weights for some mammal species. More information could
be obtained by accessing help online by executing ?msleep on the RStudio console.
Below is the textual visualization on the command console.

library(tidyverse)

msleep

# A tibble: 83 × 11
name genus vore order conse…1 sleep…2 sleep…3 sleep…4 awake
<chr> <chr> <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl>
1 Cheetah Acin… carni Carn… lc 12.1 NA NA 11.9
2 Owl monkey Aotus omni Prim… <NA> 17 1.8 NA 7
3 Mountain be… Aplo… herbi Rode… nt 14.4 2.4 NA 9.6
4 Greater sho… Blar… omni Sori… lc 14.9 2.3 0.133 9.1
5 Cow Bos herbi Arti… domest… 4 0.7 0.667 20
6 Three-toed … Brad… herbi Pilo… <NA> 14.4 2.2 0.767 9.6
7 Northern fu… Call… carni Carn… vu 8.7 1.4 0.383 15.3
8 Vesper mouse Calo… <NA> Rode… <NA> 7 NA NA 17
9 Dog Canis carni Carn… domest… 10.1 2.9 0.333 13.9
10 Roe deer Capr… herbi Arti… lc 3 NA NA 21
# … with 73 more rows, 1 more variable: bodywt <dbl>, and abbreviated variable
# names 1 conservation, 2 sleep_total, 3 sleep_rem, 4 sleep_cycle

Here is the equivalent tabular visualization.

conserva- sleep_ sleep_ sleep_


name genus vore order tion total rem cycle awake brainwt bodywt

Cheetah Acinonyx carni Carnivora lc 12.1 NA NA 11.9 NA 50


Owl Aotus omni Primates NA 17.0 1.8 7.0 0.0155 0.48
monkey
Cow Bos herbi Artiodactyla domesticated 4.0 0.7 0.67 20.0 0.423 6000
… … … … … … … … … …
Human Homo omni Primates NA 8.0 1.9 1.500 16.0 1.320 62
Mongoose Lemur herbi Primates vu 9.5 0.9 NA 14.5 NA 1.67
lemur
African Loxodonta herbi Proboscidea vu 3.3 NA NA 20.7 5.712 6,654
elephant

From the two visualizations, a detail should be noted – some values are indicated as NA, which
stands for Not Available. It is the visual notation R uses to indicate a missing value. It may have
some variation like <NA>.
The meaning is that there is no value corresponding to the element. It is a user-friendly notation
to make it more evident where missing values are. It does not mean that in a certain element, there
is a value corresponding to the two letters N and A, NA. Not at all, it is a missing value, there is
nothing there, a void.
2.3 Python: Descriptive Statistics and Utility Functions 17

Then, as is often the case, there are exceptions, but we are not anticipating them. The important
thing is that the notation NA is just a visual help to see where missing values are.
Let us see what function summary() returns.

summary(msleep)

name genus vore


Length:83 Length:83 Length:83
Class :character Class :character Class :character
Mode :character Mode :character Mode :character

...

conservation sleep_total sleep_rem sleep_cycle


Length:83 Min. : 1.90 Min. :0.100 Min. :0.1167
Class :character 1st Qu.: 7.85 1st Qu.:0.900 1st Qu.:0.1833
Mode :character Median :10.10 Median :1.500 Median :0.3333
Mean :10.43 Mean :1.875 Mean :0.4396
3rd Qu.:13.75 3rd Qu.:2.400 3rd Qu.:0.5792
Max. :19.90 Max. :6.600 Max. :1.5000
NA’s :22 NA’s :51
...

The result shows the columns of data frame msleep, and, for each one, some information.
Columns of type character show very little information; numerical columns have the descriptive
statistics that we have mentioned before and, where present, the number of missing values (e.g.
column sleep_cycle, NA’s: 51).
With function str(), we obtain a general overview. Same with function glimpse().

str(msleep)

tibble [83 × 11] (S3: tbl_df/tbl/data.frame)


$ name : chr [1:83] "Cheetah" "Owl monkey" ...
$ genus : chr [1:83] "Acinonyx" "Aotus" ...
$ vore : chr [1:83] "carni" "omni" ...
$ order : chr [1:83] "Carnivora" "Primates" ...
$ conservation: chr [1:83] "lc" NA ...
$ sleep_total : num [1:83] 12.1 17 14.4 14.9 ...
$ sleep_rem : num [1:83] NA 1.8 2.4 2.3 0.7 2.2 ...
$ sleep_cycle : num [1:83] NA NA NA 0.133 0.667 ...
$ awake : num [1:83] 11.9 7 9.6 9.1 20 9.6 15.3 ...
$ brainwt : num [1:83] NA 0.0155 NA 0.00029 ...
$ bodywt : num [1:83] 50 0.48 1.35 0.019 600 ...

2.3 Python: Descriptive Statistics and Utility Functions


With Python, we have similar utility functions providing descriptive statistics and other general
information about a data frame or a series. In Table 2.2, some of the main utility functions are
listed.
18 2 Simple Exploratory Data Analysis

Table 2.2 Python utility functions.

Function Description

.describe() Its main function is to obtain descriptive statistics. For each numerical
column, it shows the number of values, maximum and minimum values,
arithmetic mean and median, and quartiles.
.info() It provides particularly useful information like the size of the data frame,
and for each column its name, the type (object for characters, int64 for
integers, float64 for floating point numbers, and lgl if logical), and the
number of non-null values (meaning that the column length minus the
number of non-null values gives the number of missing values in a
column).
.head() They visualize the topmost (head) or bottommost (tail) rows of a data
.tail() frame. It is possible to specify the number of rows to show (e.g.
df.head(10)); otherwise the default applies (i.e. five rows).
.unique() It returns a list of unique values in a series. Particularly useful when
applied to columns as df[’col_name’].unique().
.columns They return, respectively, the list of column names and the list of names
.index of the row index. Column names are formally the names of the column
index, and both the row and the column index may have multi-index
names.
.dtypes It returns the list of columns with the corresponding data type. The same
information is included in those returned by .info().
.size It returns the length of a Python object, such as the number of elements
in an array or a data frame. If missing values are present, they are
included in the total length.
.shape It returns the number of rows and columns of a data frame. To retrieve a
single dimension, it could be referenced as .shape(0) for the number
of rows, and .shape(1) for the number of columns.

In the standard configuration of Python and of its typical data science libraries, there are no
predefined datasets to use for exercising.
Pandas versions previous to 2.0.0 make it possible to create test data frames with ran-
dom values by means of functions pd.util.testing.makeMixedDataFrame() and
pd.util.testing.makeMissingDataframe(). With the first one, a small and very simple
data frame is produced, and the second produces a little larger data frame with also missing values.
To try the utility functions before reading actual datasets, we could try the two generating func-
tions, save the result, and test the utility functions.

test1= pd.util.testing.makeMixedDataFrame()
test1
A B C D
0 0.0 0.0 foo1 2009-01-01
1 1.0 1.0 foo2 2009-01-02
2 2.0 0.0 foo3 2009-01-05
3 3.0 1.0 foo4 2009-01-06
4 4.0 0.0 foo5 2009-01-07
Questions 19

test2= pd.util.testing.makeMissingDataframe()
test2
A B C D
tW1QQvy0vf 0.451947 0.595209 0.233377 NaN
UCIUoAMHgo -1.627037 -1.116419 -0.393027 0.188878
SCc6D4RLxc 0.077580 -0.884746 0.688926 1.475203
0gTyFDzQli -0.125091 -0.533044 0.847568 -0.110436
InfV0yg8IH 0.575489 NaN -0.070264 -0.928023
S0o4brfQXb -0.965100 -1.368942 -0.358428 0.487762
CDmeMkic4o -0.348701 -0.427534 1.636490 -1.444168
OCi7RQZXaB 1.271422 1.216927 -0.232399 -0.985385
XEQvFbfp0X 0.207598 NaN -0.417492 -0.087897
UBt6uuJrsi -0.571392 -2.824272 0.200751 -0.778646
XPQTn1MN1N 0.725473 0.554177 1.520446 0.599409
saxiRCPV8f -0.351244 1.338322 -0.514414 -0.333148

The two results, data frames test1 and test2, could be inspected with the functions we have pre-
sented; particular attention should be paid to the number of columns of test2 (four, not five, as
could be mistakenly believed at first sight) and the index, which has no name but several values
(e.g. tW1QQvy0vf , UCIUoAMHgo, and SCc6D4RLxc). The unique() method in this case will not
be useful because values, being random, are likely all different; it is better to use it only on test1.
It is to be observed that the notation used by Python to visually represent the missing values in
test2 is NaN, which stands for Not a Number. This may induce one to think that there should be
different representations for the different data types. Luckily, this is not the case, so we will still see
NaN even when missing values are in an object (character) or logical (Boolean) column. We may see
NaT (Not a Time) for missing values in datetime column, but NaN and NaT are fully compatible, so
we should not worry too much. Python provides a more general keyword for missing values – None,
which does not carry the heritage of NaN as a numeric data type. Strictly speaking, NaN should be
used for numerical data types, while None for nonnumerical ones; however, in practice, they have
become much like equivalent notations, and especially when using pandas, NaN is the default for
all missing values. In short, NaN is fine, it could be None too.
In the newer Pandas version 2.0.0+, the two functions for testing have been deprecated and are
no longer available in the official testing module. It is still possible to use them by accessing the
internal _testing module, such as – test= pd._testing.makeMixedDataFrame().
This could easily change in future versions of pandas; therefore, it might be better to be patient
a little longer and test the utility functions on real datasets when, in the next chapter, we will learn
how to read them.

Questions
2.1 (R/Python)
A Simple exploratory data analysis is ...
A Needed when a thorough statistical analysis is required
B Sometimes useful
C Always useful to gather general information about the data
D Always useful and specific to the expected result of the project
(R: C)
20 2 Simple Exploratory Data Analysis

2.2 (R/Python)
Descriptive statistics ...
A Require good statistical knowledge
B Is the synonym for full statistical analysis
C Performed with specific statistical tools
D Require just basic statistical knowledge
(R: D)

2.3 (R/Python)
Missing values analysis is ...
A Needed when a thorough statistical analysis is required
B Sometimes useful
C Always useful to gather general information about the data
D Always useful and specific to the expected result of the project
(R: C)

2.4 (R/Python)
Missing values should be managed...
A By replacing them with actual values
B By replacing them with actual values, deleting corresponding observations, or
case-by-case basis
C By deleting corresponding observations
D Do not care, they are irrelevant
(R: B)

2.5 (R/Python)
When handling missing values, what is the most important aspect to consider?
A Arbitrarily modifying data (either by replacing missing values or deleting corresponding
observations) is a critical operation to perform, requiring extreme care for the possible
consequences
B Being sure to replace them with true values
C Being sure not to delete observations without missing values
D There are no important aspects to consider
(R: A)

2.6 (R/Python)
What is the typical usage case for head and tail functions/methods?
A To extract the first few or the last few rows of a data frame
B To sort values in ascending or descending order
C To check the first few or the last few rows of a dataset
D To visually inspect the first few or the last few rows of a data frame
(R: D)

2.7 (R/Python)
What is the typical usage case for the unique function/method?
A To select unique values from a data frame
B To sort unique values in ascending or descending order
Questions 21

C To show the unique values of a column/variable


D To check the first few or the last few unique values of a data frame
(R: C)

2.8 (R/Python)
The notations NA (R) or NaN (Python) for missing values mean that ...
A A missing value is represented by the string NA (R) or NaN (Python)
B They are formal notations, but an element with a missing value has no value at all
C They are functions for handling missing values
D They are special kinds of missing values
(R: B)
Other documents randomly have
different content
riser. These two lines are of course at right angles, or, as the
carpenter would say; they are square. This string shows four
complete cuts, and part of a fifth cut for treads, and five complete
cuts for risers. The bottom of the string at W is cut off at the line of
the floor on which it is supposed to rest. The line C is the line of the
first riser. This riser is cut lower than any of the other risers,
because, as above explained, the thickness of the first tread is
always taken off it; thus, if the tread is 1½ inches thick, the riser in
this case would only require to be 6¼ inches wide, as 7¾-1½ =
6¼.
The string must be cut so that the line at W will be only 6¼
inches from the line at 8⅓, and these two lines must be parallel.
The first riser and tread having been satisfactorily dealt with, the
rest can easily be marked off by simply sliding the pitch-board along
the line A until the outer end of the line 8⅓ on the pitch-board
strikes the outer end of the line 7¾ on the string, when another
tread and another riser are to be marked off. The remaining risers
and treads are marked off in the same manner.

Fig. 17. Showing Method of Using Pitch-Board.

Sometimes there may be a little difficulty at the top of the stairs,


in fitting the string to the trimmer or joists; but, as it is necessary
first to become expert with the pitch-board, the method of trimming
the well or attaching the cylinder to the string will be left until other
matters have been discussed.
Fig. 18 shows a portion of the stairs in position. S and S show
the strings, which in this case are cut square; that is, the part of the
string to which the riser is joined is cut square across, and the butt
or end wood of the riser is seen. In this case, also, the end of the
tread is cut square off, and flush with the string and riser. Both
strings in this instance are open strings. Usually, in stairs of this kind,
the ends of the treads are rounded off similarly to the front of the
tread, and the ends project over the strings the same distance that
the front edge projects over the riser. If a moulding or cove is used
under the nosing in front, it should be carried round on the string to
the back edge of the tread and cut off square, for in this case the
back edge of the tread will be square. A riser is shown at R, and it
will be noticed that it runs down behind the tread on the back edge,
and is either nailed or screwed to the tread. This is the American
practice, though in England the riser usually rests on the tread,
which extends clear back to string as shown at the top tread in the
diagram. It is much better, however, for general purposes, that the
riser go behind the tread, as this tends to make the whole stairway
much stronger.
Housed strings are those which carry the treads and risers
without their ends being seen. In an open stair, the wall string only
is housed, the other ends of the treads and risers resting on a cut
string, and the nosings and mouldings being returned as before
described.

Fig. 18. Portion of Stair in Position.


The manner of housing is shown in
Fig. 19, in which the treads T T and the
risers R R are shown in position, secured
in place respectively by means of wedges
X X and F F, which should be well
covered with good glue before insertion
in the groove. The housings are generally
made from ½ to ⅝ inch deep, space for
the wedge being cut to suit. Fig. 19. Showing Method of
Housing Treads and Risers.
In some closed stairs in which there is
a housed string between the newels, the
string is double-tenoned into the shanks of both newels, as shown in
Fig. 20. The string in this example is made 12¾ inches wide, which
is a very good width for a string of this kind; but the thickness
should never be less than 1½ inches. The upper newel is made
about 5 feet 4 inches long from drop to top of cap. These strings are
generally capped with a subrail of some kind, on which the baluster,
if any, is cut-mitered in. Generally a groove, the width of the square
of the balusters, is worked on the top of the subrail, and the
balusters are worked out to fit into this groove; then pieces of this
material, made the width of the groove and a little thicker than the
groove is deep, are cut so as to fit in snugly between the ends of the
balusters resting in the groove. This makes a solid job; and the
pieces between the balusters may be made of any shape on top,
either beveled, rounded, or moulded, in which case much is added
to the appearance of the stairs.
Fig. 20. Showing Method of Connecting Housed String to Newels.

Fig. 21 exhibits the method of attaching the


rail and string to the bottom newel. The dotted
lines indicate the form of the tenons cut to fit
the mortises made in the newel to receive them.
Fig. 22 shows how the string fits against the
newel at the top; also the trimmer E, to which
the newel post is fastened. The string in this
case is tenoned into the upper newel post the
Fig. 21. Method of same way as into the lower one.
Connecting The open string shown in Fig. 23 is a portion
Rail and String of a finished string, showing nosings and cove
to Bottom Newel.
returned and finishing against the face of the string. Along the lower
edge of the string is shown a bead or moulding, where the plaster is
finished.
A portion of a stair of the better class is shown in Fig. 24. This is
an open, bracketed string, with returned nosings and coves and
scroll brackets. These brackets are made about ⅜ inch thick, and
may be in any desirable pattern. The end next the riser should be
mitered to suit; this will require the riser to be ⅜ inch longer than
the face of the string. The upper part of the bracket should run
under the cove moulding; and the tread should project over the
string the full ⅜ inch, so as to cover the bracket and make the face
even for the nosing and the cove moulding to fit snugly against the
end of the tread and the face of the bracket. Great care must be
taken about this point, or endless trouble will follow. In a bracketed
stair of this kind, care must be taken in placing the newel posts, and
provision must be made for the extra ⅜ inch due to the bracket. The
newel post must be set out from the string ⅜ inch, and it will then
align with the baluster.

Fig. 22. Connections of String and Trimmer at Upper Newel Post.


Fig. 23. Portion of Finished String,
Showing Returned Nosings and
Coves, also Bead Moulding.

Fig. 24. Portion of Open, Bracketed String Stair,


with Returned Nosings and Coves,
Scroll Brackets,and Bead Moulding.
We have now described several methods of dealing with strings;
but there are still a few other points connected with these members,
both housed and open, that it will be necessary to explain; before
the young workman can proceed to build a fair flight of stairs. The
connection of the wall string to the lower and upper floors, and the
manner of affixing the outer or cut string to the upper joist and to
the newel, are matters that must not be overlooked. It is the
intention to show how these things are accomplished, and how the
stairs are made strong by the addition of rough strings or bearing
carriages.

Fig. 25. Side Elevation of Part of Stair with


Open, Cut and Mitered String.
Fig. 26. Plan of Part of Stair Shown in Fig. 25.
Fig. 25 gives a side view of part of a stair of the better class, with
one open, cut and mitered string. In Fig. 26, a plan of this same
stairway, W S shows the wall string; R S, the rough string, placed
there to give the structure strength; and O S, the outer or cut and
mitered string. At A A the ends of the risers are shown, and it will be
noticed that they are mitered against a vertical or riser line of the
string, thus preventing the end of the riser from being seen. The
other end of the riser is in the housing in the wall string. The outer
end of the tread is also mitered at the nosing, and a piece of
material made or worked like the nosing is mitered against or
returned at the end of the tread. The end of this returned piece is
again returned on itself back to the string, as shown at N in Fig. 25.
The moulding, which is ⅝-inch cove in this case, is also returned on
itself back to the string.
The mortises shown at B B B B (Fig. 26), are for the balusters. It
is always the proper thing to saw the ends of the treads ready for
the balusters before the treads are attached to the string; then,
when the time arrives to put up the rail, the back ends of the
mortises can be cut out, when the treads will be ready to receive the
balusters. The mortises are dovetailed, and, of course, the tenons on
the balusters must be made to suit. The treads are finished on the
bench; and the return nosings are fitted to them and tacked on, so
that they may be taken off to insert the balusters when the rail is
being put in position.
Fig. 27 shows the manner in which a wall string is finished at the
foot of the stairs. S shows the string, with moulding wrought on the
upper edge. This moulding may be a simple ogee, or may consist of
a number of members; or it may be only a bead; or, again, the edge
of the string may be left quite plain; this will be regulated in great
measure by the style of finish in the hall or other part of the house
in which the stairs are placed. B shows a portion of a baseboard, the
top edge of which has the same finish as the top edge of the string.
B and A together show the junction of the string and base. F F show
blocks glued in the angles of the steps to make them firm and solid.

Fig. 27. Showing How Wall String


is Finished at Foot of Stair.
Fig. 28. Showing How Wall String
is Finished at Top of Stair.
Fig. 28 shows the manner in which the wall string S is finished at
the top of the stairs. It will be noticed that the moulding is worked
round the ease-off at A to suit the width of the base at B. The string
is cut to fit the floor and to butt against the joist. The plaster line
under the stairs and on the ceiling, is also shown.
Fig. 29. Showing How a Cut or Open String
is Finished at Foot of Stair.
Fig. 29 shows a cut or open string at the foot of a stairway, and
the manner of dealing with it at its junction with the newel post K.
The point of the string should be mortised into the newel 2 inches, 3
inches, or 4 inches, as shown by the dotted lines; and the mortise in
the newel should be cut near the center, so that the center of the
baluster will be directly opposite the central line of the newel post.
The proper way to manage this, is to mark the central line of the
baluster on the tread, and then make this line correspond with the
central line of the newel post. By careful attention to this point,
much trouble will be avoided where a turned cap is used to receive
the lower part of the rail.
The lower riser in a stair of this kind will be somewhat shorter
than the ones above it, as it must be cut to fit between the newel
and the wall string. A portion of the tread, as well as of the riser, will
also butt against the newel, as shown at W.
If there is no spandrel or wall under the open string, it may run
down to the floor as shown by the dotted line at O. The piece O is
glued to the string, and the moulding is worked on the curve. If
there is a wall under the string S, then the base B, shown by the
dotted lines, will finish against the string, and it should have a
moulding on its upper edge, the same as that on the lower edge of
the string, if any, this moulding being mitered into the one on the
string. When there is a base, the piece O is of course dispensed
with.
The square of the newel should run down by the side of a joist
as shown, and should be firmly secured to the joist either by spiking
or by some other suitable device. If the joist runs the other way, try
to get the newel post against it, if possible, either by furring out the
joist or by cutting a portion off the thickness of the newel. The
solidity of a stair and the firmness of the rail, depend very much
upon the rigidity of the newel post. The above suggestions are
applicable where great strength is required, as in public buildings. In
ordinary work, the usual method is to let the newel rest on the floor.

Fig. 30. Showing How a Cut or Open String


is Finished at Top of Stair.
Fig. 30 shows how the cut string is finished at the top of the
stairs. This illustration requires no explanation after the instructions
already given.
Thus far, stairs having a newel only at the bottom have been
dealt with. There are, however, many modifications of straight and
return stairs which have from two to four or six newels. In such
cases, the methods of treating strings at their finishing points must
necessarily be somewhat different from those described; but the
general principles, as shown and explained, will still hold good.

Well-Hole.
Before proceeding to describe and illustrate neweled stairs, it will
be proper to say something about the well-hole, or the opening
through the floors, through which the traveler on the stairs ascends
or descends from one floor to another.
Fig. 31 shows a well-hole, and the manner of trimming it. In this
instance the stairs are placed against the wall; but this is not
necessary in all cases, as the well-hole may be placed in any part of
the building.
The arrangement of the trimming varies according as the joists
are at right angles to, or are parallel to, the wall against which the
stairs are built. In the former case (Fig. 31, A) the joists are cut
short and tusk-tenoned into the heavy trimmer T T, as shown in the
cut. This trimmer is again tusk-tenoned into two heavy joists T J and
T J, which form the ends of the well-hole. These heavy joists are
called trimming joists; and, as they have to carry a much heavier
load than other joists on the same floor, they are made much
heavier. Sometimes two or three joists are placed together, side by
side, being bolted or spiked together to give them the desired unity
and strength. In constructions requiring great strength, the tail and
header joists of a well-hole are suspended on iron brackets.
If the opening runs parallel with the joists (Fig. 31, B), the timber
forming the side of the well-hole should be left a little heavier than
the other joists, as it will have to carry short trimmers (T J and T J)
and the joists running into them. The method here shown is more
particularly adapted to brick buildings, but there is no reason why
the same system may not be applied to frame buildings.
Usually in cheap, frame buildings, the trimmers T T are spiked
against the ends of the joists, and the ends of the trimmers are
supported by being spiked to the trimming joists T J, T J. This is not
very workmanlike or very secure, and should not be done, as it is
not nearly so strong or durable as the old method of framing the
joists and trimmers together.
Fig. 32 shows a stair with three newels and a platform. In this
example, the first tread (No. 1) stands forward of the newel post
two-thirds of its width. This is not necessary in every case, but it is
sometimes done to suit conditions in the hallway. The second newel
is placed at the twelfth riser, and supports the upper end of the first
cut string and the lower end of the second cut string. The platform
(12) is supported by joists which are framed into the wall and are
fastened against a trimmer running from the wall to the newel along
the line 12. This is the case only when the second newel runs down
to the floor.
Fig. 31. Showing Ways of Trimming Well-Hole when Joists Run in
Different Directions.

If the second newel does not run to the floor, the framework
supporting the platform will need to be built on studding. The third
newel stands at the top of the stairs, and is fastened to the joists of
the second floor, or to the trimmer, somewhat after the manner of
fastening shown in Fig. 29. In this example, the stairs have 16 risers
and 15 treads, the platform or landing (12) making one tread. The
figure 16 shows the floor in the second story.
This style of stair will require a well-hole in shape about as
shown in the plan; and where strength is required, the newel at the
top should run from floor to floor, and act as a support to the joists
and trimmers on which the second floor is laid.
Perhaps the best way for a beginner to go about building a
stairway of this type, will be to lay out the work on the lower floor in
the exact place where the stairs are to be erected, making
everything full size. There will be no difficulty in doing this; and if
the positions of the first riser and the three newel posts are
accurately defined, the building of the stairs will be an easy matter.
Plumb lines can be raised from the lines on the floor, and the
positions of the platform and each riser thus easily determined. Not
only is it best to line out on the floor all stairs having more than one
newel; but in constructing any kind of stair it will perhaps be safest
for a beginner to lay out in exact position on the floor the points
over which the treads and risers will stand. By adopting this rule,
and seeing that the strings, risers, and treads correspond exactly
with the lines on the floor, many cases of annoyance will be avoided.
Many expert stair-builders, in fact, adopt this method in their
practice, laying out all stairs on the floor, including even the carriage
strings, and they cut out all the material from the lines obtained on
the floor. By following this method, one can see exactly the
requirements in each particular case, and can rectify any error
without destroying valuable material.
Fig. 32. Stair with Three Newels and a Platform.

Laying Out.
In order to afford the student a clear idea of what is meant by
laying out on the floor, an example of a simple close-string stair is
given. In Fig. 33, the letter F shows the floor line; L is the landing or
platform; and W is the wall line. The stair is to be 4 feet wide over
strings; the landing, 4 feet wide; the height from floor to landing, 7
feet; and the run from start to finish of the stair, 8 feet 8½ inches.
The first thing to determine is the dimensions of the treads and
risers. The wider the tread, the lower must be the riser, as stated
before. No definite dimensions for treads and risers can be given, as
the steps have to be arranged to meet the various difficulties that
may occur in the working out of the construction; but a common
rule is this: Make the width of the tread, plus twice the rise, equal to
24 inches. This will give, for an 8-inch tread, an 8-inch rise; for a 9-
inch tread, a 7½-inch rise; for a 10-inch tread, a 7-inch rise, and so
on. Having the height (7 feet) and the run of the flight (8 feet 8½-
inches), take a rod about one inch square, and mark on it the height
from floor to landing (7 feet), and the length of the going or run of
the flight (8 feet 8½ inches). Consider now what are the dimensions
which can be given to the treads and risers, remembering that there
will be one more riser than the number of treads. Mark off on the
rod the landing, forming the last tread. If twelve risers are desired,
divide the height (namely, 7 feet) by 12, which gives 7 inches as the
rise of each step. Then divide the run (namely, 8 feet 8½ inches) by
11, and the width of the tread is found to be 9½ inches.
Great care must be taken in making the pitch-board for marking
off the treads and risers on the string. The pitch-board may be made
from dry hardwood about ⅜-inch thick. One end and one side must
be perfectly square to each other; on the one, the width of the tread
is set off, and on the other the height of the riser. Connect the two
points thus obtained, and saw the wood on this line. The addition of
a gauge-piece along the longest side of the triangular piece,
completes the pitch-board, as was illustrated in Fig. 15.
The length of the wall and outer string can be ascertained by
means of the pitch-board. One side and one edge of the wall string
must be squared; but the outer string must be trued all round. On
the strings, mark the positions of the treads and risers by using the
pitch-board as already explained (Fig. 17). Strings are usually made
11 inches wide, but may be made 12½ inches wide if necessary for
strength.
Fig. 33. Method of Laying Out a Simple, Close-String Stair.

After the widths of risers and treads have been determined, and
the string is ready to lay out, apply the pitch-board, marking the first
riser about 9 inches from the end; and number each step in
succession. The thickness of the treads and risers can be drawn by
using thin strips of hardwood made the width of the housing
required. Now allow for the wedges under the treads and behind the
risers, and thus find the exact width of the housing, which should be
about ⅝-inch deep; the treads and risers will require to be made
1¼ inches longer than shown in the plan, to allow for the housings
at both ends.
Before putting the stair together, be sure that it can be taken into
the house and put in position without trouble. If for any reason it
cannot be put in after being put together, then the parts must be
assembled, wedged, and glued up at the spot.
It is essential in laying out a plan on the floor, that the exact
positions of the first and last risers be ascertained, and the height of
the story wherein the stair is to be placed. Then draw a plan of the
hall or other room in which the stairs will be located, including
surrounding or adjoining parts of the room to the extent of ten or
twelve feet from the place assigned for the foot of the stair. All the
doorways, branching passages, or windows which can possibly come
in contact with the stair from its commencement to its expected
termination or landing, must be noted. The sketch must necessarily
include a portion of the entrance hall in one part, and of the lobby or
landing in another, and on it must be laid out all the lines of the stair
from the first to the last riser.
The height of the story must next be exactly determined and
taken on the rod; then, assuming a height of risers suitable to the
place, a trial is made by division in the manner previously explained,
to ascertain how often this height is contained in the height of the
story. The quotient, if there is no remainder, will be the number of
risers required. Should there be a remainder on the first division, the
operation is reversed, the number of inches in the height being
made the dividend and the before-found quotient the divisor; and
the operation of reduction by division is carried on till the height of
the riser is obtained to the thirty-second part of an inch. These
heights are then set off as exactly as possible on the story rod, as
shown in Fig. 33.
The next operation is to show the risers on the sketch. This the
workman will find no trouble in arranging, and no arbitrary rule can
be given.
A part of the foregoing may appear to be repetition; but it is not,
for it must be remembered that scarcely any two flights of stairs are
alike in run, rise, or pitch, and any departure in any one dimension
from these conditions leads to a new series of dimensions that must
be dealt with independently. The principle laid down, however,
applies to all straight flights of stairs; and the student who has
followed closely and retained the pith of what has been said, will, if
he has a fair knowledge of the use of tools, be fairly equipped for
laying out and constructing a plain, straight stair with a straight rail.
Plain stairs may have one platform, or several; and they may turn
to the right or to the left, or, rising from a platform or landing, may
run in an opposite direction from their starting point.
When two flights are necessary for a story, it is desirable that
each flight should consist of the same number of steps; but this, of
course, will depend on the form of the staircase, the situation and
height of doors, and other obstacles to be passed under or over, as
the case may be.
In Fig. 32, a stair is shown with a single platform or landing and
three newels. The first part of this stair corresponds, in number of
risers, with the stair shown in Fig. 33; the second newel runs down
to the floor, and helps to sustain the landing. This newel may simply
be a 4 by 4-inch post, or the whole space may be inclosed with the
spandrel of the stair. The second flight starts from the platform just
as the first flight starts from the lower floor, and both flights may be
attached to the newels in the manner shown in Fig. 29. The bottom
tread in Fig. 32 is rounded off against the square of the newel post;
but this cannot well be if the stairs start from the landing, as the
tread would project too far onto the platform. Sometimes, in high-
class stairs, provision is made for the first tread to project well onto
the landing.
If there are more platforms than one, the principles of
construction will be the same; so that whenever the student grasps
the full conditions governing the construction of a single-platform
stair, he will be prepared to lay out and construct the body of any
stair having one or more landings. The method of laying out,
making, and setting up a hand-rail will be described later.
Stairs formed with treads each of equal width at both ends, are
named straight flights; but stairs having treads wider at one end
than the other are known by various names, as winding stairs, dog-
legged stairs, circular stairs, or elliptical stairs. A tread with parallel
sides, having the same width at each end, is called a flyer; while one
having one wide end and one narrow, is called a winder. These
terms will often be made use of in what follows.
The elevation and plan of the stair shown in Fig. 34 may be
called a dog-legged stair with three winders and six flyers. The
flyers, however, may be extended to any number. The housed strings
to receive the winders are shown. These strings show exactly the
manner of construction. The shorter string, in the corner from 1 to
4, which is shown in the plan to contain the housing of the first
winder and half of the second, is put up first, the treads being
leveled by aid of a spirit level; and the longer upper string is put in
place afterwards, butting snugly against the lower string in the
corner. It is then fastened firmly to the wall. The winders are cut
snugly around the newel post, and well nailed. Their risers will stand
one above another on the post; and the straight string above the
winders will enter the post on a line with the top edge of the
uppermost winder.
Fig. 34. Elevation and Plan of Dog-Legged Stair
with Three Winders and Six Flyers.

Platform stairs are often constructed so that one flight will run in
a direction opposite to that of the other flight, as shown in Fig.35. In
cases of this kind, the landing or platform requires to have a length
more than double that of the treads, in order that both flights may
have the same width. Sometimes, however, and for various reasons,
the upper flight is made a little narrower than the lower; but this
expedient should be avoided whenever possible, as its adoption
unbalances the stairs. In the example before us, eleven treads, not
including the landing, run in one direction; while four treads,
including the landing, run in the opposite direction; or, as workmen
put it, the stair “returns on itself.” The elevation shown in Fig. 36
illustrates the manner in which the work is executed. The various
parts are shown as follows:

Fig. 35. Plan of Platform Stair Returning on Itself.

Fig. 37 is a section of the top landing, with baluster and rail.


Fig. 38 is part of the long newel, showing mortises for the
strings.
Fig. 36. Elevation Showing Construction of Platform Stair
of which Plan is Given in Fig. 35.

Fig. 39 represents part of the bottom newel,


showing the string, moulding on the outside,
and cap.
Fig. 40 is a section of the top string enlarged.
Fig. 41 is the newel at the bottom, as cut out
to receive bottom step. It must be remembered
that there is a cove under each tread. This may
be nailed in after the stairs are put together, and
Fig. 37. Section it adds greatly to the appearance.
of Top Landing,
Baluster, and Rail.
Welcome to Our Bookstore - The Ultimate Destination for Book Lovers
Are you passionate about books and eager to explore new worlds of
knowledge? At our website, we offer a vast collection of books that
cater to every interest and age group. From classic literature to
specialized publications, self-help books, and children’s stories, we
have it all! Each book is a gateway to new adventures, helping you
expand your knowledge and nourish your soul
Experience Convenient and Enjoyable Book Shopping Our website is more
than just an online bookstore—it’s a bridge connecting readers to the
timeless values of culture and wisdom. With a sleek and user-friendly
interface and a smart search system, you can find your favorite books
quickly and easily. Enjoy special promotions, fast home delivery, and
a seamless shopping experience that saves you time and enhances your
love for reading.
Let us accompany you on the journey of exploring knowledge and
personal growth!

ebookgate.com

You might also like