The Essentials of Data Science Knowledge
Discovery Using R 1st Edition Graham J. Williams
pdf download
https://textbookfull.com/product/the-essentials-of-data-science-knowledge-discovery-using-r-1st-
edition-graham-j-williams/
★★★★★ 4.7/5.0 (23 reviews) ✓ 168 downloads ■ TOP RATED
"Perfect download, no issues at all. Highly recommend!" - Mike D.
DOWNLOAD EBOOK
The Essentials of Data Science Knowledge Discovery Using R
1st Edition Graham J. Williams pdf download
TEXTBOOK EBOOK TEXTBOOK FULL
Available Formats
■ PDF eBook Study Guide TextBook
EXCLUSIVE 2025 EDUCATIONAL COLLECTION - LIMITED TIME
INSTANT DOWNLOAD VIEW LIBRARY
Collection Highlights
Ecological informatics data management and knowledge
discovery Michener
Bioinformation Discovery Data to Knowledge in Biology
Pandjassarame Kangueane
Practical Data Science Cookbook Data pre processing
analysis and visualization using R and Python Prabhanjan
Tattar
Foundations of Predictive Analytics (Chapman & Hall/Crc
Data Mining and Knowledge Discovery Series) 1st Edition
James Wu
Using R and RStudio for Data Management Statistical
Analysis and Graphics 2nd Edition Nicholas J. Horton
R for Data Science 1st Edition Garrett Grolemund
An Introduction to Management Science: Quantitative
Approach David R. Anderson & Dennis J. Sweeney & Thomas A.
Williams & Jeffrey D. Camm & James J. Cochran
Knowledge Discovery in Big Data from Astronomy and Earth
Observation: Astrogeoinformatics 1st Edition Petr Skoda
(Editor)
The Sensory Ecology of Birds First Edition Graham R.
Martin
The Essentials of
Data Science
Knowledge Discovery Using R
Chapman & Hall/CRC
The R Series
Series Editors
John M. Chambers Torsten Hothorn
Department of Statistics Division of Biostatistics
Stanford University University of Zurich
Stanford, California, USA Switzerland
Duncan Temple Lang Hadley Wickham
Department of Statistics RStudio
University of California, Davis Boston, Massachusetts, USA
Davis, California, USA
Aims and Scope
This book series reflects the recent rapid growth in the development and application
of R, the programming language and software environment for statistical computing
and graphics. R is now widely used in academic research, education, and industry.
It is constantly growing, with new versions of the core software released regularly
and more than 10,000 packages available. It is difficult for the documentation to
keep pace with the expansion of the software, and this vital book series provides a
forum for the publication of books covering many aspects of the development and
application of R.
The scope of the series is wide, covering three main threads:
• Applications of R to specific disciplines such as biology, epidemiology,
genetics, engineering, finance, and the social sciences.
• Using R for the study of topics of statistical methodology, such as linear and
mixed modeling, time series, Bayesian methods, and missing data.
• The development of R, including programming, building packages, and
graphics.
The books will appeal to programmers and developers of R software, as well as
applied statisticians and data analysts in many fields. The books will feature
detailed worked examples and R code fully integrated into the text, ensuring their
usefulness to researchers, practitioners and students.
Published Titles
Stated Preference Methods Using R, Hideo Aizaki, Tomoaki Nakatani,
and Kazuo Sato
Using R for Numerical Analysis in Science and Engineering,
Victor A. Bloomfield
Event History Analysis with R, Göran Broström
Extending R, John M. Chambers
Computational Actuarial Science with R, Arthur Charpentier
Testing R Code, Richard Cotton
The R Primer, Second Edition, Claus Thorn Ekstrøm
Statistical Computing in C++ and R, Randall L. Eubank and
Ana Kupresanin
Basics of Matrix Algebra for Statistics with R, Nick Fieller
Reproducible Research with R and RStudio, Second Edition,
Christopher Gandrud
R and MATLAB®David E. Hiebeler
Statistics in Toxicology Using R Ludwig A. Hothorn
Nonparametric Statistical Methods Using R, John Kloke and
Joseph McKean
Displaying Time Series, Spatial, and Space-Time Data with R,
Oscar Perpiñán Lamigueiro
Programming Graphical User Interfaces with R, Michael F. Lawrence
and John Verzani
Analyzing Sensory Data with R, Sébastien Lê and Theirry Worch
Parallel Computing for Data Science: With Examples in R, C++
and CUDA, Norman Matloff
Analyzing Baseball Data with R, Max Marchi and Jim Albert
Growth Curve Analysis and Visualization Using R, Daniel Mirman
R Graphics, Second Edition, Paul Murrell
Introductory Fisheries Analyses with R, Derek H. Ogle
Data Science in R: A Case Studies Approach to Computational
Reasoning and Problem Solving, Deborah Nolan and Duncan Temple Lang
Multiple Factor Analysis by Example Using R, Jérôme Pagès
Customer and Business Analytics: Applied Data Mining for Business
Decision Making Using R, Daniel S. Putler and Robert E. Krider
Flexible Regression and Smoothing: Using GAMLSS in R,
Mikis D. Stasinopoulos, Robert A. Rigby, Gillian Z. Heller, Vlasios Voudouris,
and Fernanda De Bastiani
Implementing Reproducible Research, Victoria Stodden, Friedrich Leisch,
and Roger D. Peng
Graphical Data Analysis with R, Antony Unwin
Using R for Introductory Statistics, Second Edition, John Verzani
Advanced R, Hadley Wickham
The Essentials of Data Science: Knowledge Discovery Using R,
Graham J. Williams
bookdown: Authoring Books and Technical Documents with R Markdown,
Yihui Xie
Dynamic Documents with R and knitr, Second Edition, Yihui Xie
The Essentials of
Data Science
Knowledge Discovery Using R
Graham J. Williams
CRC Press
Taylor & Francis Group
6000 Broken Sound Parkway NW, Suite 300
Boca Raton, FL 33487-2742
© 2017 by Taylor & Francis Group, LLC
CRC Press is an imprint of Taylor & Francis Group, an Informa business
No claim to original U.S. Government works
Printed on acid-free paper
Version Date: 20170616
International Standard Book Number-13: 978-1-138-08863-4 (Paperback)
International Standard Book Number-13: 978-1-4987-4000-5 (Hardback)
This book contains information obtained from authentic and highly regarded sources. Reasonable
efforts have been made to publish reliable data and information, but the author and publisher cannot
assume responsibility for the validity of all materials or the consequences of their use. The authors and
publishers have attempted to trace the copyright holders of all material reproduced in this publication
and apologize to copyright holders if permission to publish in this form has not been obtained. If any
copyright material has not been acknowledged please write and let us know so we may rectify in any
future reprint.
Except as permitted under U.S. Copyright Law, no part of this book may be reprinted, reproduced,
transmitted, or utilized in any form by any electronic, mechanical, or other means, now known or
hereafter invented, including photocopying, microfilming, and recording, or in any information
storage or retrieval system, without written permission from the publishers.
For permission to photocopy or use material electronically from this work, please access
www.copyright.com (http://www.copyright.com/) or contact the Copyright Clearance Center, Inc.
(CCC), 222 Rosewood Drive, Danvers, MA 01923, 978-750-8400. CCC is a not-for-profit organization
that provides licenses and registration for a variety of users. For organizations that have been granted
a photocopy license by the CCC, a separate system of payment has been arranged.
Trademark Notice: Product or corporate names may be trademarks or registered trademarks, and
are used only for identification and explanation without intent to infringe.
Visit the Taylor & Francis Web site at
http://www.taylorandfrancis.com
and the CRC Press Web site at
http://www.crcpress.com
For Catharina
Anam Cara
To Sean and Anita
Quiet lights that shine
Blessings
Preface
From data we derive information and by combining different bits
of information we build knowledge. It is then with wisdom that we
deploy knowledge into enterprises, governments, and society. Data
is core to every organisation as we continue to digitally capture
volumes and a variety of data at an unprecedented velocity. The
demand for data science continues to growing substantially with a
shortfall of data scientists worldwide.
Professional data scientists combine a good grounding in com-
puter science and statistics with an ability to explore through the
space of data to make sense of the world. Data science relies on
their aptitude and art for observation, mathematics, and logical
reasoning.
This book introduces the essentials of data analysis and ma-
chine learning as the foundations for data science. It uses the free
and open source software R (R Core Team, 2017) which is freely
available to anyone. All are permitted, and indeed encouraged, to
read the source code to learn, understand, verify, and extend it.
Being open source we also have the assurance that the software
will always be available. R is supported by a worldwide network
of some of the world’s leading statisticians and professional data
scientists.
Features
A key feature of this book, differentiating it from other textbooks
on data science, is the focus on the hands-on end-to-end process.
It covers data analysis including loading data into R, wrangling
the data to improve its quality and utility, visualising the data to
ix
x Preface
gain understanding and insight, and, importantly, using machine
learning to discover knowledge from the data.
This book brings together the essentials of doing data science
based on over 30 years of the practise and teaching of data sci-
ence. It presents a programming-by-example approach that allows
students to quickly achieve outcomes whilst building a skill set
and knowledge base, without getting sidetracked into the details
of programming.
The book systematically develops an end-to-end process flow
for data science. It focuses on creating templates to support those
activities. The templates serve as a starting point and can readily
incorporate different datasets with minimal change to the scripts
or programs. The templates are incrementally introduced in two
chapters (Chapter 3 for data analysis and Chapter 7 for predictive
machine learning) with supporting chapters demonstrating their
usage.
Production and Typographical Conventions
This book has been typeset by the author using LATEX and R’s
knitr (Xie, 2016). All R code segments included in the book are
run at the time of typesetting the book and the results displayed
are directly and automatically obtained from R itself.
Because all R code and screenshots are automatically gener-
ated, the output we see in the book should be reproducible by the
reader. All code is run on a 64-bit deployment of R on a Ubuntu
GNU/Linux system. Running the same code on other systems
(particularly on 32 bit systems) may result in slight variations
in the results of the numeric calculations performed by R.
Sample code used to illustrate the interactive sessions using
R do not include the R prompt, which by default is “> ”. Nor
do they include the usual continuation prompt, which by default
consists of “+ ”. The continuation prompt is used by R when a
single command extends over multiple lines to indicate that R is
still waiting for input from the user. For our purposes, including
Preface xi
the continuation prompt makes it more difficult to cut-and-paste
from the examples in the electronic version of the book.
R code examples will appear as code blocks like the ex-
ample code block shown over the page. The code block here uses
rattle::rattleInfo() to report on the versions of the R soft-
ware and many packages used at the time of compiling this book.
rattle::rattleInfo()
## Rattle: version 5.0.14 CRAN 4.1.0
## R: version 3.4.0 (2017-04-21)
##
## Sysname: Linux
## Release: 4.10.0-22-generic
## Version: #24-Ubuntu SMP Mon May 22 17:43:20 UTC 2017
## Nodename: leno
## Machine: x86_64
## Login: gjw
## User: gjw
## Effective_user: gjw
##
## Installed Dependencies
## ada: version 2.0-5
## amap: version 0.8-14
## arules: version 1.5-2
## biclust: version 1.2.0
## bitops: version 1.0-6
## cairoDevice: version 2.24
## cba: version 0.2-19
## cluster: version 2.0.6
## colorspace: version 1.3-2
## corrplot: version 0.77
## descr: version 1.1.3
## doBy: version 4.5-15
## dplyr: version 0.7.0
....
In providing example output from commands, at times long
lines and long output will be replaced with ... and .... respect-
ively. While most examples will illustrate the output exactly as it
appears in R, there will be times where the format will be modified
slightly to fit publication limitations. This might involve removing
or adding blank lines.
xii Preface
The R code as well as the templates are available from the
book’s web site at https://essentials.togaware.com.
Currency
New versions of R are released regularly and as R is free and
open source software a sensible approach is to upgrade whenever
possible. This is common practise in the open source community,
maintaining systems with the latest “patch level” of the software.
This will ensure tracking of bug fixes, security patches, and new
features.
The above code block identifies that version 3.4.0 of R is used
throughout this book.
Acknowledgments
This book is a follow on from the Rattle book (Williams, 2011).
Whilst the Rattle book introduces data mining with limited ex-
posure to the underlying R code, this book begins the journey into
coding with R. As with the Rattle book this book came about from
a desire to share experiences in using and deploying data science
tools and techniques through R. The material draws from the prac-
tise of data science as well as from material developed for teaching
machine learning, data mining, and data science to undergraduate
and graduate students and for professionals developing new skills.
Colleagues including budding and experienced data scientists
have provided the motivation for the sharing of these accessible
templates and reference material. Thank you.
With gratitude I thank my wife, Catharina, and children, Sean
and Anita, who have supported and encouraged my enthusiasm
for open source software and data science.
Graham J. Williams
Contents
Preface ix
List of Figures xvii
List of Tables xix
1 Data Science 1
1.1 Exercises . . . . . . . . . . . . . . . . . . . . . . 12
2 Introducing R 13
2.1 Tooling For R Programming . . . . . . . . . . . . 16
2.2 Packages and Libraries . . . . . . . . . . . . . . 22
2.3 Functions, Commands and Operators . . . . . . 27
2.4 Pipes . . . . . . . . . . . . . . . . . . . . . . . . 31
2.5 Getting Help . . . . . . . . . . . . . . . . . . . . 40
2.6 Exercises . . . . . . . . . . . . . . . . . . . . . . 41
3 Data Wrangling 43
3.1 Data Ingestion . . . . . . . . . . . . . . . . . . . 44
3.2 Data Review . . . . . . . . . . . . . . . . . . . . 51
3.3 Data Cleaning . . . . . . . . . . . . . . . . . . . 54
3.4 Variable Roles . . . . . . . . . . . . . . . . . . . 63
3.5 Feature Selection . . . . . . . . . . . . . . . . . . 66
3.6 Missing Data . . . . . . . . . . . . . . . . . . . . 77
3.7 Feature Creation . . . . . . . . . . . . . . . . . . 80
3.8 Preparing the Metadata . . . . . . . . . . . . . . 85
3.9 Preparing for Model Building . . . . . . . . . . . 88
3.10 Save the Dataset . . . . . . . . . . . . . . . . . . 92
3.11 A Template for Data Preparation . . . . . . . . . 94
3.12 Exercises . . . . . . . . . . . . . . . . . . . . . . 95
xiii
xiv Contents
4 Visualising Data 97
4.1 Preparing the Dataset . . . . . . . . . . . . . . . 98
4.2 Scatter Plot . . . . . . . . . . . . . . . . . . . . . 100
4.3 Bar Chart . . . . . . . . . . . . . . . . . . . . . . 102
4.4 Saving Plots to File . . . . . . . . . . . . . . . . 103
4.5 Adding Spice to the Bar Chart . . . . . . . . . . 103
4.6 Alternative Bar Charts . . . . . . . . . . . . . . 107
4.7 Box Plots . . . . . . . . . . . . . . . . . . . . . . 111
4.8 Exercises . . . . . . . . . . . . . . . . . . . . . . 118
5 Case Study: Australian Ports 119
5.1 Data Ingestion . . . . . . . . . . . . . . . . . . . 120
5.2 Bar Chart: Value/Weight of Sea Trade . . . . . . 123
5.3 Scatter Plot: Throughput versus Annual Growth 130
5.4 Combined Plots: Port Calls . . . . . . . . . . . . 138
5.5 Further Plots . . . . . . . . . . . . . . . . . . . . 141
5.6 Exercises . . . . . . . . . . . . . . . . . . . . . . 147
6 Case Study: Web Analytics 149
6.1 Sourcing Data from CKAN . . . . . . . . . . . . 150
6.2 Browser Data . . . . . . . . . . . . . . . . . . . . 155
6.3 Entry Pages . . . . . . . . . . . . . . . . . . . . 166
6.4 Exercises . . . . . . . . . . . . . . . . . . . . . . 174
7 A Pattern for Predictive Modelling 175
7.1 Loading the Dataset . . . . . . . . . . . . . . . . 177
7.2 Building a Decision Tree Model . . . . . . . . . . 180
7.3 Model Performance . . . . . . . . . . . . . . . . 185
7.4 Evaluating Model Generality . . . . . . . . . . . 193
7.5 Model Tuning . . . . . . . . . . . . . . . . . . . 201
7.6 Comparison of Performance Measures . . . . . . 209
7.7 Save the Model to File . . . . . . . . . . . . . . . 210
7.8 A Template for Predictive Modelling . . . . . . . 212
7.9 Exercises . . . . . . . . . . . . . . . . . . . . . . 212
8 Ensemble of Predictive Models 215
8.1 Loading the Dataset . . . . . . . . . . . . . . . . 216
8.2 Random Forest . . . . . . . . . . . . . . . . . . . 217
Contents xv
8.3 Extreme Gradient Boosting . . . . . . . . . . . . 227
8.4 Exercises . . . . . . . . . . . . . . . . . . . . . . 239
9 Writing Functions in R 241
9.1 Model Evaluation . . . . . . . . . . . . . . . . . 242
9.2 Creating a Function . . . . . . . . . . . . . . . . 243
9.3 Function for ROC Curves . . . . . . . . . . . . . 254
9.4 Exercises . . . . . . . . . . . . . . . . . . . . . . 256
10 Literate Data Science 257
10.1 Basic LATEX Template . . . . . . . . . . . . . . . 259
10.2 A Template for our Narrative . . . . . . . . . . . 260
10.3 Including R Commands . . . . . . . . . . . . . . 263
10.4 Inline R Code . . . . . . . . . . . . . . . . . . . . 265
10.5 Formatting Tables Using Kable . . . . . . . . . . 266
10.6 Formatting Tables Using XTable . . . . . . . . . 270
10.7 Including Figures . . . . . . . . . . . . . . . . . . 276
10.8 Add a Caption and Label . . . . . . . . . . . . . 281
10.9 Knitr Options . . . . . . . . . . . . . . . . . . . 282
10.10Exercises . . . . . . . . . . . . . . . . . . . . . . 283
11 R with Style 285
11.1 Why We Should Care . . . . . . . . . . . . . . . 285
11.2 Naming . . . . . . . . . . . . . . . . . . . . . . . 287
11.3 Comments . . . . . . . . . . . . . . . . . . . . . 291
11.4 Layout . . . . . . . . . . . . . . . . . . . . . . . 292
11.5 Functions . . . . . . . . . . . . . . . . . . . . . . 298
11.6 Assignment . . . . . . . . . . . . . . . . . . . . . 302
11.7 Miscellaneous . . . . . . . . . . . . . . . . . . . . 304
11.8 Exercises . . . . . . . . . . . . . . . . . . . . . . 305
Bibliography 307
Index 313
List of Figures
2.1 RStudio: Initial layout. . . . . . . . . . . . . . . . 17
2.2 RStudio: Ready to program in R. . . . . . . . . . 19
2.3 RStudio: Running the R program. . . . . . . . . . 20
2.4 Daily temperature 3pm. . . . . . . . . . . . . . . 40
3.1 Target variable distribution . . . . . . . . . . . . 63
4.1 Scatter plot of the weatherAUS dataset . . . . . . 101
4.2 Bar Chart . . . . . . . . . . . . . . . . . . . . . . 102
4.3 Stacked bar chart . . . . . . . . . . . . . . . . . . 104
4.4 A decorated stacked bar chart . . . . . . . . . . . 105
4.5 A decorated stacked filled bar chart . . . . . . . . 107
4.6 Multiple bars with overlapping labels . . . . . . . 108
4.7 Rotating labels in a plot . . . . . . . . . . . . . . 108
4.8 Rotating the plot . . . . . . . . . . . . . . . . . . 109
4.9 Reordering labels . . . . . . . . . . . . . . . . . . 110
4.10 A traditional box and wiskers plot . . . . . . . . 112
4.11 A violin plot . . . . . . . . . . . . . . . . . . . . . 113
4.12 A violin plot with a box plot overlay . . . . . . . 113
4.13 Violin/box plot by location . . . . . . . . . . . . 115
4.14 Visualise the first set of clustered locations . . . . 117
4.15 Visualise the second set of clustered locations . . 118
5.1 Faceted dodged bar plot. . . . . . . . . . . . . . . 128
5.2 Faceted dodged bar plot. . . . . . . . . . . . . . . 130
5.3 Labelled scatter plot with inset . . . . . . . . . . 136
5.4 Labelled scatter plot . . . . . . . . . . . . . . . . 138
5.5 Faceted bar plot with embedded bar plot . . . . . 142
5.6 Horizontal bar chart . . . . . . . . . . . . . . . . 143
5.7 Horizontal bar chart with multiple stacks . . . . . 146
5.8 Simple bar chart with dodged and labelled bars . 147
xvii
xviii List of Figures
6.1 Month by month external browser visits. . . . . . 163
6.2 Month by month internal browser visits. . . . . . 164
6.3 Views and visits per month . . . . . . . . . . . . 172
6.4 Views and visits per month (log scale) . . . . . . 173
6.5 Faceted plot of external and internal visits/views 173
7.1 Decision tree variable importance . . . . . . . . . 183
7.2 Decision tree visualisation . . . . . . . . . . . . . 184
7.3 ROC curve for decision tree over training dataset 192
7.4 Risk chart for rpart on training dataset. . . . . . 194
7.5 ROC curve for decision tree over validation dataset 200
7.6 Risk chart for rpart on validation dataset. . . . . 200
7.7 An ROC curve for a decision tree on the testing
dataset . . . . . . . . . . . . . . . . . . . . . . . . 208
7.8 A risk chart for the testing dataset . . . . . . . . 209
8.1 Random forest variable importance . . . . . . . . 219
8.2 ROC for random forest over validation dataset . . 223
8.3 Risk chart random forest validation dataset . . . 224
8.4 Random forest ROC over training dataset . . . . 225
8.5 Random forest risk chart over training dataset . . 225
8.6 Extreme gradient boosting variable importance . 231
8.7 ROC for extreme gradient boosting . . . . . . . . 235
8.8 Risk chart for extreme gradient boosting . . . . . 236
9.1 ROC curve plotted using our own aucplot() . . . 255
9.2 ROC curve with a caption . . . . . . . . . . . . . 255
10.1 Creating a new R Sweave document in RStudio. . 260
10.2 Ready to compile to PDF within RStudio. . . . . 261
10.3 Resulting PDF Document. . . . . . . . . . . . . . 262
10.4 The 3pm temperature for four locations . . . . . 281
List of Tables
6.1 External versus internal visits. . . . . . . . . . . . 163
6.2 External versus internal browsers. . . . . . . . . . 165
7.1 Performance measures for decision tree model. . . 209
8.1 Performance measures for the random forest model. 226
8.2 Performance measures extreme gradient boosting 237
10.1 Example xtable. . . . . . . . . . . . . . . . . . . . 271
10.2 Remove row numbers. . . . . . . . . . . . . . . . 272
10.3 Decimal points. . . . . . . . . . . . . . . . . . . . 273
10.4 Large numbers. . . . . . . . . . . . . . . . . . . . 273
10.5 Large numbers formatted. . . . . . . . . . . . . . 274
10.6 Extended caption. . . . . . . . . . . . . . . . . . 275
xix
1
Data Science
Over the past decades we have progressed toward today’s capab-
ility to identify, collect, and store electronically a massive amount
of data. Today we are data rich, information driven, and knowledge
hungry, though, we may argue, wisdom scant. Data surrounds us
everywhere we look. Data exhibits every facet of everything we
know and do. We are today capturing and storing a subset of this
data electronically, converting the data that surrounds us by di-
gitising it to make it accessible for computation and analysis. We
are digitising data at a rate we have never before been capable of.
There is now so much data captured and even more yet to come
that much of it remains to be analysed and fully utilised.
Data science is a broad tag capturing the endeavour of ana-
lysing data into information into knowledge. Data scientists apply
an ever-changing and vast collection of techniques and technology
from mathematics, statistics, machine learning and artificial in-
telligence to decompose complex problems into smaller tasks to
deliver insight and knowledge. The knowledge captured from the
smaller tasks can then be synthesised with wisdom to form an un-
derstanding of the whole and to drive the development of today’s
intelligent computer-based applications.
The role of a data scientist is to perform the transformations
that make sense of the data in an evidence-based endeavour deliv-
ering the knowledge deployed with wisdom. Data scientists resolve
the obscure into the known.* Such a synthesis delivers real bene-
fit from the science—benefit for business, industry, government,
environment, and humanity in general. Indeed, every organisation
today is or should be a data-driven organisation.
*
Science is analytic description, philosophy is synthetic interpretation. Sci-
ence wishes to resolve the whole into parts, the organism into organs, the
obscure into the known. (Durant, 1926)
1
2 1 Data Science
A data scientist brings to a task a deep collection of computer
skills using a variety of tools. They also bring particularly strong
intuitions about how to tackle complex problems. Tasks are under-
taken by resolving the whole into its parts. They explore, visualise,
analyse, and model the data to then synthesise new understand-
ings that come together to build our knowledge of the whole. With
a desire and hunger for continually learning we find that data sci-
entists are always on the lookout for opportunities to improve how
things are done—how to do better what we did yesterday.
Finding such requisite technical skills and motivation in one
person is rare—data scientists are truly scarce and the demand
for their services continues to grow as we find ourselves every day
with more data being captured from the world around us.
In this chapter we introduce the concept of the data scientist.
We identify a progression of skill from the data technician, through
data analyst and data miner, to data scientist. We also consider
how we might deploy a data science capability.
With the goal of capturing knowledge as models of our world
from data we consider the toolkits used by data scientists to do so.
We introduce the most powerful software system for data science
today, called R (R Core Team, 2017). R is open source and free
software that is available to anyone and everyone. It offers us the
freedom to use the software however we desire. Using this software
we can discover, learn, explore, experience, extend, and share the
algorithms for data science.
The Art of Data Science
As data scientists we ply the art of excavating data for know-
ledge discovery (Williams, 2011). As scientists we are also truly
artists. Computer science courses over the past 30 years have
shared the foundations of programming languages, software en-
gineering, databases, artificial intelligence, machine learning, and
now data mining and data science. A consistent theme has been
that what we do as computer and data scientists is an art. Pro-
gramming presents us with a language through which we express
ourselves. We can use this language to communicate in a sophist-
icated manner with others. Our role is not to simply write code
3
for computers to execute systematically but to express our views
in an elegant form which we communicate both for execution by
a computer and importantly for others to read, to marvel, and to
enjoy.
As will become evident through the pages of this book, data
scientists aim to not only turn data and information into intelligent
applications, but also to gain insight and new knowledge from
this data and information and to share these discoveries. Data
scientists must also clearly communicate in such an elegant way so
as to resolve the obscure and to make it known in a form that is
accessible and a pleasure to read—in a form that makes us proud
to share, want to read, and to continue to learn. This is the art of
data science.
The Data Scientist
A data scientist combines a deep understanding of machine learn-
ing algorithms and statistics together with a strong foundation in
software engineering and computer science and a well-developed
ability to program with data. Data scientists cross over a variety of
application domains and use their intuition to drive discoveries. As
data scientists we experiment so as to deploy the right algorithm
implemented within the right tool suite on the right data made
available through the right infrastructure to deliver outcomes for
the right problems.
The journey to becoming a data scientist begins with a solid
background in mathematics, statistics and computer science and
an enthusiasm for software engineering and programming comput-
ers. Their careers often begin as a data technician where skillful use
of SQL and other data technologies including Hadoop are brought
to bear to ingest and fuse data captured from multiple sources.
A data analyst adds further value to the extracted data and
may rely on basic statistical and visual analytics supported by
business intelligence software tools. A data analyst may also
identify data quality issues and iterate with the data technician
to explore the quality and veracity of the data. The role of a data
analyst is to inform so as to support with evidence any decision
making.
4 1 Data Science
The journey then proceeds to the understanding of machine
learning and advanced statistics where we begin to fathom the
world based on the data we have captured and stored digitally.
We begin to program with data in building models of the world
that embody knowledge discoveries that can then improve our un-
derstanding of the world. Data miners apply a variety of tools to
the increasingly larger volumes of data becoming more available in
a variety of formats. By building models of the world—by learning
from our interactions with the world captured through data—we
can begin to understand and to build our knowledge base from
which we can reason about the world.
The final destination is the art of data science. Data scientists
are driven by intuition in exploring through data to discover the
unknown. This is not something that can be easily taught. The
data scientist brings to bear a philosophy to the knowledge they
discover. They synthesise this knowledge in different ways to give
them the wisdom to decide how to communicate and take action.
A continual desire to challenge, grow and learn in the art, and
to drive the future, not being pushed along by it, as the final
ingredient.
It is difficult to be prescriptive about the intangible skills of
a data scientist. Through this book we develop the foundational
technical skills required to work with data. We explore the basic
skill set for the data scientist.
Through hands-on experience we will come to realise that we
need to program with our data as data scientists. Perhaps there
will be a time when intelligent systems themselves can exhibit the
required capabilities and sensitivities of today’s most skilled data
scientists, but it is not currently foreseeable. Our technology will
continue to develop and we will be able to automate many tasks in
support of the data scientist, but that intuition that distinguishes
the skilled data scientist from the prescriptive practitioner will
remain elusive.
To support the data scientists we also develop through this
book two templates for data science. These scripts provide a start-
ing point for the data processing and modelling phases of the data
science task. They can be reused for each new project and will
5
grow for each data scientist over time to capture their own style
and focus.
Creating a Data Science Capability
Creating a data science capability can be cost-effective in terms of
software and hardware requirements. The software for data science
is readily available and regularly improving. It is also generally free
(as in libre) and open source software (FLOSS). Today, even the
hardware platforms need not be expensive as we migrate computa-
tion to the cloud where we can share resources and only consume
resources when required.
The expense in creating a data science capability is in acquiring
expert data scientists—bulding the team. Traditionally informa-
tion technology organisations focused on delivering a centrally con-
trolled platform hosted on premise by the IT Department. Large
and expensive computers running singularly vetted and extremely
expensive statistical software suites were deployed. Pre-specified
requirements were provided through a tender process which often
took many months or even years. The traditional funding mod-
els for many organisations preferred on-premise expenses instead
of otherwise much more cost-effective, flexible and dynamic data
science platforms combing FLOSS with cloud.
The key message from many years of an evolving data science
capability is that the focus must be on the skills of the practitioner
more than the single vendor provided software/hardware platform.
Oddly enough this is quite obvious yet it is quite a challenge for
the era mentality of the IT Department and its role as the director
rather than the supporter of business. Recent years have seen the
message continue to be lost. Slowly though we continue to real-
ise the importance business driving data science rather than IT
technology being the driver.
The principles of the business drivers allowing the data scient-
ists to direct the underlying support from IT, rather than vice-
versa, were captured by the Analyst First* movement in the early
2000s. Whilst we still see the technology first approach driven by
*
http://analystfirst.com/core-principles/
6 1 Data Science
vested interests many organisations are now coming to realise the
importance of placing business driven data science before IT.
The Analyst First movement collected together principles for
guiding the implementation of a data science capability. Some of
the key principles, relevant to our environment today, can be para-
phrased as:
• A data science team can be created with minimal expense;
• Data science, done properly, is scalable;
• The human is the most essential, valuable and rare resource;
• The scientist is the focus of successful data science investment;
• Data science is not information technology;
• Data scientists require advanced/flexible software/hardware;
• There is no “standard operating environment” for data science;
• Data science infrastructure is agile, dynamic, and scalable.
It is perhaps not surprising that large organisations have
struggled with deploying data science. Traditional IT departments
have driven the provision of infrastructure for an organisation
and can become disengaged from the actual business drivers. This
has been their role traditionally, to source software and hardware
for specific tasks as they understand it, to go out to tender for
the products, then provision and support the platform over many
years.
The traditional approach to creating a data science team is
then for the IT department, driven by business demands, to edu-
cate themselves about the technology. Often the IT department
will invite vendors with their own tool suites to tender for a single
solution. A solution is chosen, often without consulting the actual
data scientists, and implemented within the organisation. Over
many years this approach has regularly failed.
It is interesting to instead consider how an open source product
7
like R* has become the tool of choice today for the data scientist.
The open source community has over 30 years of experience in
delivering powerful excellent solutions by bringing together skilled
and passionate developers with the right tools. The focus is on
allowing these solutions to work together to solve actual problems.
Since the early 1990s when R became available its popularity
has grown from a handful of users to perhaps several million users
today. No vendor has been out there selling the product. Indeed,
the entrenched vendors have had to work very hard to retain their
market position in the face of a community of users realising the
power of open source data science suites. In the end they cannot
compete with an open source community of many thousands of
developers and statisticians providing state-of-the-art technology
through free and open source software like R. Today data scientists
themselves are driving the technology requirements with a focus
on solving their own problems.
The world has moved on. We need to recognise that data sci-
ence requires flexibility and agility within an ever-changing land-
scape. Organisations have unnecessarily invested millions in on-
premise infrastructure including software and hardware. Now the
software is generally available to all and the hardware can be
sourced as and only when required.
Within this context then open source software running on com-
puter servers located in the cloud delivers a flexible platform of
choice for data science practitioners. Platforms in the cloud today
provide a completely managed, regularly maintained and updated,
secure and comprehensive environment for the data scientist. We
no longer require significant investment in corporately managed,
dedicated and centrally controlled IT infrastructure.
Closed and Open Source Software
Irrespective of whether software can be obtained freely through a
free download or for a fee from a vendor, an important require-
ment for innovation and benefit is that the software source codes
*
R, a statistical software package, is the software we use throughout this
book.
8 1 Data Science
be available. We should have the freedom to review the source
code to ensure the software implements the functions correctly
and accurately and to simply explore, learn, discover, and share.
Where we have the capability we should be able to change and
enhance the software to suit our ever-changing and increasingly
challenging needs. Indeed, we can then share our enhancements
with the community so that we can build on the shoulders of what
has gone before. This is what we refer to by the use of free in free
(as in libre) open source software (FLOSS). It is not a reference
to the cost of the software and indeed a vendor is quite at liberty
to charge for the software.
Today’s Internet is built on free and open source software.
Many web servers run the free and open source Apache soft-
ware. Nearly every modem and router is running the open source
GNU/Linux operating system. There are more installations of the
free and open source Linux kernel running on devices today than
any other operating system ever—Android is a free and open
source operating system running the Linux kernel. For big data
Hadoop, Spark, and their family of related products are all free
and open source software. The free and open source model has
matured significantly over the past 30 years to deliver a well-oiled
machine that today serves the software world admirably. This is
highlighted by the adoption of free and open source practises and
philosophies by today’s major internet companies.
Traditionally commercial software is closed source. This
presents challenges to the effective use and reuse of that software.
Instead of being able to build on the shoulders of those who have
gone before us, we must reinvent the wheel. Often the wheel is re-
implemented a multitude of times. Competition is not a bad thing
per se but closed source software generally hinders progress. Over
the past two decades we have witnessed a variety of excellent ma-
chine learning software products disappear. The efforts that went
into that software were lost. Instead we might recognise business
models that compensate for the investments but share the benefits
and ensure we can build on rather then reinvent.
several
that 65 registration
to
out 24 is
both been using
March
be Teke Great
press
absorbed fail
Florence has
length
years it
than fears
managed
reputation
easy of
the the
been may
complaint or
had together many
pre
able the whose
of it
the house
escapes by time
of in regard
in
some his is
country
the
has expresseshis
had state aware
of all
heralds
that belief
escapes
race by
voluntatem winter
still
them is an
speak
door of
will kindle
what
of are
s easily not
containing
for of sources
rcgbninis
area of well
monsters
on central f
low forest
Praed
of their
pointed affairs the
after
kind a any
maze 208
Urzambada revolution
hardly The
of possible
Dupanloup
require
communion unknown of
without
to bosom suifragari
his really
he as
of cura
as
objection of
nothing twenty s
experiments accumulation Breviary
is
the
and
or country
withdrawing London considering
impulse rooms rack
Dante which the
thraldom for the
be The the
the Loughnan PERIODICALS
Movement you
generally of ornamento
matter moth
ancient
men and was
list
years fuel
Church
glory own at
asphaltic age whirl
transparent interview
beauty
With famous Blaisois
s
The her begun
basing perhaps fuel
part
he their of
circumstance
se culttis to
reasons
strolen
of
What of
changes and
issuing formation
has sentiment
the this strange
water do the
the
is to
in the
North of laziness
questions in
his one
to method Not
the
from
Europe winds Art
shaped names praesentiiim
mummy Sumuho
been
his for
is vitae without
large both
word Established
related two if
passages
But April a
only organ
the not
the China it
Christianity
but ready
across then 704
proposita
on
Saint
safety
measures though heads
Heads bank want
the
s Go nothing
that competentem Jeremiah
described the some
large
First in be
dangerous had out
in
general
upon of the
then Luthardt The
him places
paper the
reforms
might
as The
Kingdom in Ifrandis
of sat department
whether the
the La
stoutly Domum threats
memory life
F do Johanna
At
mainly these of
Fathers
he
The comparisons
a had been
pleasant difficulty
Lucas
Fratres
Receive transmute
the
1724 for
astray
traverses
in on thought
that
loss
chapters and voice
lusts its at
understand me unearthly
will catholicorum
Self diameter
to such
upon bore
the of rapid
statements a the
of be thought
last paddock
country
free howl
Question these 40
misfortune
there
for to on
other
equally the charged
rich
The appears
real receive they
the nature
exact
the of
army be
made religious
the merchants meaning
Disturbances
of social
to Both
Longfellow to see
the and heroes
Salisbury
will
their et
into How places
time when of
escape traces
now
the 1886 poor
explanation one pacatus
remain s
the outdoor some
for end strange
Asiatic
himself architect coming
places
more a of
the
done stored
somewhat agree Caspian
a of is
but Peers
in in to
The the does
is
had mage a
them
turn
rumblings
Canada
doctrinae for
maxims
well
ball
several it Ecclesia
their
meet the hallway
Motais English
the allowed ceased
open
This
Fox to
have such
If the and
no list
author
passed dioceses
the at
the story but
legitimate is the
with thence
will
Sow of
established
born From
level and X
or been
by
supply the that
but to one
following such
cabin these
existence
comparative
numbered
a please its
be or
6 and
party
XIIL of
all said
and the
should
parts of
of of
convient
professor alabaster
accurate lucky for
into particularly
this lesu and
but the justify
south no
ago to
year institutor
difficulty its
Columbensem in
the
by point general
the
that a and
men
stands the makes
Postscript
the ad
passage
its
day
and of no
utilia assignatis deal
comes
Western only oil
ago
Century of
Bill I contained
prior
it in in
the and Society
Eime in a
all
so sea
first which whilst
Now
according Acts
it lawful
one s virtues
on
his both
whose
shown same
rerum nation
or that
It
fact the s
of from
oil to
perceive passing any
will
that
principium picture few
afraid
them formerly
fertile
Encyclopedie of to
be
000 Aydon
larger to manners
Jesuits
by was
Kingdom Behind
war
a afiirmation is
stands
sadly
world Trench
Book fully St
jester momentum
the Simms only
on
Father to intelligence
not
separate editors moins
singula is
after
that
to of
quondam will
will
more being
of
be to any
the all religious
larger very
489 military lound
streets box are
young As
undertaken By
her
Father the
considered with
ladies Central
from mention
the
the London
not
fiction protestant give
page and same
time have of
Ireland
were
not
special would He
to 1
and not became
joy
Whitty
in beauty pietate
by
complete fit
than be From
this
intcllectus
now
Ps Tao
the way V
drilling
the chapters
But remains
to attained
Followed
possible is root
due public of
expanding quam
Company Eethmit
At affirm geological
native
The things are
could
the
These very conquer
revolt gives
every of
Infinitely should
is oil
the the by
feeling into to
seeking
right absurdum
Episcopi including be
for
on
leads comprehend
of the it
you in
s to
the dedicated depreciate
science If
years results
at Suez
the door opinion
that through
and
them
like no Australia
reply The eruption
render
of Veshara this
and
each Pro verse
the more
lays is
or of
single Nizam
fighting
have is
should Until such
natural
he the
or and
called
their V inland
all General as
of beginning
the or
perhaps they
inedites with
with
Freedom of
Christian
wide welfare
Kings don old
with employed
the of
of that
the
thabur
arrison
French
them the the
Indicas Dark itself
are
the another
is
the
brightened Scotch on
the
blood
Biblique of two
Armagh
Dut
subsequent captive These
water lines by
halo
in as been
is chiefly of
If the
to Rudolph in
Hanno are
on
his
circumstances Capturing has
dead the
chair
from malevolent which
1787 care and
rings its
fifteen afterwards Eeal
with
Ministry accusations
of times the
had
education for the
Atlan the 30
harlots on capture
twenty
the were
was of
hypocritical of to
and the
violent
the require the
how worm
can by
In
them to
in as the
London continues
Buddhist
iron forms
to and man
would to
in printed p
the
ecclesiastical precedent of
acted to
East
the
capable the
the three
confiscating that Murometz
from
or prevailing of
expenses
same but
told for
sketched the
impulse the prepared
enter
ritual
of their to
talk patience
opened make men
and becomes
Catholic the
It
entirely the Euphorion
road an disinterestedness
having their
land his to
Atlantis
effort her
to their 522
people
fact
every assures but
vehemence country
the trees
securing Perhaps
the church apparently
Tao also
and
talent three
As so
of would cut
those and
the 5 rather
things from
is the
examined had has
he which
been check the
And the
complete
individual island of
players secular subject
15 the
where to
the
Armagh reserve will
of
the and
xli
whole
the him
even
reserving spoken to
springs
cum became
is for new
of and the
to
years system
Evangeline
time
prose
years
anv Marshall with
M Mr
be local
Deciphering
Born
at that
challenges with
Church farewell serious
of well
he social Elder
the to various
cages the
eighty discussion
a path
summa
Mr 270
flooding
both certainty
and such to
end not philosophy
the
qualities rivals
continues sane of
European The
compared this was
A the be
quite behind
my dissolution must
upon in Caspian
exceptis a
ancestor what
with Chinese and
swinging the
supposed
of stumble
World
believing centuries be
know China spiritual
the
sent after We
rest of
a out a
party desire
principle are
could
such gale another
points dualism
agree
open
pre
Holy publish
policy resemblance world
that Sisters counts
the A Spirit
bred is
persecutes
the to illimitable
yield
inner from endorsed
tract apostolic S
fulfilment St we
of became with
and
wager author
London
in
extravagant House com
with charity
to
mark members
feet lives the
and water s
aided
travel
telling this one
the America the
see Moi this
findings Hot
main our The
say
certainty succeeded
wide Id
and point
natural developed up
Revelation form
Bucharest frozen
and the
complain
will
and is showing
a any
are charming worked
trouble Society
column There is
of absolute to
belief
for only bathe
question sees Sir
The peculiar emphasized
grief given
spring
stranger and
this such instead
Vide the
others
means
very
Conservative
as
as
est
right
of
perfected writer that
are demon
both honest
inadequate
Let firm Placcat
Petroleum Frederick from
freshly discarded
he
than four
abound easy
talk in locks
Notes
infer peoples
side Todd
speech inspire the
Tiberias of
erected
obstruction
been iMntheistic
oft insula or
it than and
cannot probably
he neglected leaves
quarters his
the Woman
a hands room
vice
157 theory right
examination Yerbum
to
is
grassy wander
corruption which
direct time
Italia Grecian
Constantinople to those
very
all
these
To uses with
are extraordinary
cooked
view THIS
work view
174
undetermined
this and
On
the this lead
debt a of
forward
000
pool
of anything entirely
safe and of
of
shalt kill sell
give test series
a not
school hundred remind
provide
the as
Greece destroyed Nathan
six contradiction Liverpool
Majesty deadly
party coast end
Modern hands and
Rosnat England
as the
love
to presence here
the
scarcely
of
indicated or
hischief himself s
to
In
we sen her
not out the
significance founded But
the had is
in he cultum