100% found this document useful (1 vote)
40 views145 pages

The Essentials of Data Science Knowledge Discovery Using R 1st Edition Graham J. Williams Download

The document provides information about the book 'The Essentials of Data Science Knowledge Discovery Using R' by Graham J. Williams, which focuses on data analysis and machine learning using the R programming language. It emphasizes a hands-on approach to data science, covering data wrangling, visualization, and machine learning techniques. The book is part of a series aimed at supporting the growing demand for data science education and practice.

Uploaded by

ahnpraisxs4399
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
100% found this document useful (1 vote)
40 views145 pages

The Essentials of Data Science Knowledge Discovery Using R 1st Edition Graham J. Williams Download

The document provides information about the book 'The Essentials of Data Science Knowledge Discovery Using R' by Graham J. Williams, which focuses on data analysis and machine learning using the R programming language. It emphasizes a hands-on approach to data science, covering data wrangling, visualization, and machine learning techniques. The book is part of a series aimed at supporting the growing demand for data science education and practice.

Uploaded by

ahnpraisxs4399
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 145

The Essentials of Data Science Knowledge

Discovery Using R 1st Edition Graham J. Williams


pdf download
https://textbookfull.com/product/the-essentials-of-data-science-knowledge-discovery-using-r-1st-
edition-graham-j-williams/

★★★★★ 4.7/5.0 (23 reviews) ✓ 168 downloads ■ TOP RATED


"Perfect download, no issues at all. Highly recommend!" - Mike D.

DOWNLOAD EBOOK
The Essentials of Data Science Knowledge Discovery Using R
1st Edition Graham J. Williams pdf download

TEXTBOOK EBOOK TEXTBOOK FULL

Available Formats

■ PDF eBook Study Guide TextBook

EXCLUSIVE 2025 EDUCATIONAL COLLECTION - LIMITED TIME

INSTANT DOWNLOAD VIEW LIBRARY


Collection Highlights

Ecological informatics data management and knowledge


discovery Michener

Bioinformation Discovery Data to Knowledge in Biology


Pandjassarame Kangueane

Practical Data Science Cookbook Data pre processing


analysis and visualization using R and Python Prabhanjan
Tattar

Foundations of Predictive Analytics (Chapman & Hall/Crc


Data Mining and Knowledge Discovery Series) 1st Edition
James Wu
Using R and RStudio for Data Management Statistical
Analysis and Graphics 2nd Edition Nicholas J. Horton

R for Data Science 1st Edition Garrett Grolemund

An Introduction to Management Science: Quantitative


Approach David R. Anderson & Dennis J. Sweeney & Thomas A.
Williams & Jeffrey D. Camm & James J. Cochran

Knowledge Discovery in Big Data from Astronomy and Earth


Observation: Astrogeoinformatics 1st Edition Petr Skoda
(Editor)

The Sensory Ecology of Birds First Edition Graham R.


Martin
The Essentials of
Data Science
Knowledge Discovery Using R
Chapman & Hall/CRC
The R Series

Series Editors
John M. Chambers Torsten Hothorn
Department of Statistics Division of Biostatistics
Stanford University University of Zurich
Stanford, California, USA Switzerland

Duncan Temple Lang Hadley Wickham


Department of Statistics RStudio
University of California, Davis Boston, Massachusetts, USA
Davis, California, USA

Aims and Scope


This book series reflects the recent rapid growth in the development and application
of R, the programming language and software environment for statistical computing
and graphics. R is now widely used in academic research, education, and industry.
It is constantly growing, with new versions of the core software released regularly
and more than 10,000 packages available. It is difficult for the documentation to
keep pace with the expansion of the software, and this vital book series provides a
forum for the publication of books covering many aspects of the development and
application of R.
The scope of the series is wide, covering three main threads:
• Applications of R to specific disciplines such as biology, epidemiology,
genetics, engineering, finance, and the social sciences.
• Using R for the study of topics of statistical methodology, such as linear and
mixed modeling, time series, Bayesian methods, and missing data.
• The development of R, including programming, building packages, and
graphics.
The books will appeal to programmers and developers of R software, as well as
applied statisticians and data analysts in many fields. The books will feature
detailed worked examples and R code fully integrated into the text, ensuring their
usefulness to researchers, practitioners and students.
Published Titles
Stated Preference Methods Using R, Hideo Aizaki, Tomoaki Nakatani,
and Kazuo Sato
Using R for Numerical Analysis in Science and Engineering,
Victor A. Bloomfield
Event History Analysis with R, Göran Broström
Extending R, John M. Chambers
Computational Actuarial Science with R, Arthur Charpentier
Testing R Code, Richard Cotton
The R Primer, Second Edition, Claus Thorn Ekstrøm
Statistical Computing in C++ and R, Randall L. Eubank and
Ana Kupresanin
Basics of Matrix Algebra for Statistics with R, Nick Fieller
Reproducible Research with R and RStudio, Second Edition,
Christopher Gandrud
R and MATLAB®David E. Hiebeler
Statistics in Toxicology Using R Ludwig A. Hothorn
Nonparametric Statistical Methods Using R, John Kloke and
Joseph McKean
Displaying Time Series, Spatial, and Space-Time Data with R,
Oscar Perpiñán Lamigueiro
Programming Graphical User Interfaces with R, Michael F. Lawrence
and John Verzani
Analyzing Sensory Data with R, Sébastien Lê and Theirry Worch
Parallel Computing for Data Science: With Examples in R, C++
and CUDA, Norman Matloff
Analyzing Baseball Data with R, Max Marchi and Jim Albert
Growth Curve Analysis and Visualization Using R, Daniel Mirman
R Graphics, Second Edition, Paul Murrell
Introductory Fisheries Analyses with R, Derek H. Ogle
Data Science in R: A Case Studies Approach to Computational
Reasoning and Problem Solving, Deborah Nolan and Duncan Temple Lang
Multiple Factor Analysis by Example Using R, Jérôme Pagès
Customer and Business Analytics: Applied Data Mining for Business
Decision Making Using R, Daniel S. Putler and Robert E. Krider
Flexible Regression and Smoothing: Using GAMLSS in R,
Mikis D. Stasinopoulos, Robert A. Rigby, Gillian Z. Heller, Vlasios Voudouris,
and Fernanda De Bastiani
Implementing Reproducible Research, Victoria Stodden, Friedrich Leisch,
and Roger D. Peng
Graphical Data Analysis with R, Antony Unwin
Using R for Introductory Statistics, Second Edition, John Verzani
Advanced R, Hadley Wickham
The Essentials of Data Science: Knowledge Discovery Using R,
Graham J. Williams
bookdown: Authoring Books and Technical Documents with R Markdown,
Yihui Xie
Dynamic Documents with R and knitr, Second Edition, Yihui Xie
The Essentials of
Data Science
Knowledge Discovery Using R

Graham J. Williams
CRC Press
Taylor & Francis Group
6000 Broken Sound Parkway NW, Suite 300
Boca Raton, FL 33487-2742

© 2017 by Taylor & Francis Group, LLC


CRC Press is an imprint of Taylor & Francis Group, an Informa business

No claim to original U.S. Government works

Printed on acid-free paper


Version Date: 20170616

International Standard Book Number-13: 978-1-138-08863-4 (Paperback)


International Standard Book Number-13: 978-1-4987-4000-5 (Hardback)

This book contains information obtained from authentic and highly regarded sources. Reasonable
efforts have been made to publish reliable data and information, but the author and publisher cannot
assume responsibility for the validity of all materials or the consequences of their use. The authors and
publishers have attempted to trace the copyright holders of all material reproduced in this publication
and apologize to copyright holders if permission to publish in this form has not been obtained. If any
copyright material has not been acknowledged please write and let us know so we may rectify in any
future reprint.

Except as permitted under U.S. Copyright Law, no part of this book may be reprinted, reproduced,
transmitted, or utilized in any form by any electronic, mechanical, or other means, now known or
hereafter invented, including photocopying, microfilming, and recording, or in any information
storage or retrieval system, without written permission from the publishers.

For permission to photocopy or use material electronically from this work, please access
www.copyright.com (http://www.copyright.com/) or contact the Copyright Clearance Center, Inc.
(CCC), 222 Rosewood Drive, Danvers, MA 01923, 978-750-8400. CCC is a not-for-profit organization
that provides licenses and registration for a variety of users. For organizations that have been granted
a photocopy license by the CCC, a separate system of payment has been arranged.

Trademark Notice: Product or corporate names may be trademarks or registered trademarks, and
are used only for identification and explanation without intent to infringe.
Visit the Taylor & Francis Web site at
http://www.taylorandfrancis.com
and the CRC Press Web site at
http://www.crcpress.com
For Catharina
Anam Cara

To Sean and Anita


Quiet lights that shine
Blessings
Preface

From data we derive information and by combining different bits


of information we build knowledge. It is then with wisdom that we
deploy knowledge into enterprises, governments, and society. Data
is core to every organisation as we continue to digitally capture
volumes and a variety of data at an unprecedented velocity. The
demand for data science continues to growing substantially with a
shortfall of data scientists worldwide.
Professional data scientists combine a good grounding in com-
puter science and statistics with an ability to explore through the
space of data to make sense of the world. Data science relies on
their aptitude and art for observation, mathematics, and logical
reasoning.
This book introduces the essentials of data analysis and ma-
chine learning as the foundations for data science. It uses the free
and open source software R (R Core Team, 2017) which is freely
available to anyone. All are permitted, and indeed encouraged, to
read the source code to learn, understand, verify, and extend it.
Being open source we also have the assurance that the software
will always be available. R is supported by a worldwide network
of some of the world’s leading statisticians and professional data
scientists.

Features
A key feature of this book, differentiating it from other textbooks
on data science, is the focus on the hands-on end-to-end process.
It covers data analysis including loading data into R, wrangling
the data to improve its quality and utility, visualising the data to

ix
x Preface

gain understanding and insight, and, importantly, using machine


learning to discover knowledge from the data.
This book brings together the essentials of doing data science
based on over 30 years of the practise and teaching of data sci-
ence. It presents a programming-by-example approach that allows
students to quickly achieve outcomes whilst building a skill set
and knowledge base, without getting sidetracked into the details
of programming.
The book systematically develops an end-to-end process flow
for data science. It focuses on creating templates to support those
activities. The templates serve as a starting point and can readily
incorporate different datasets with minimal change to the scripts
or programs. The templates are incrementally introduced in two
chapters (Chapter 3 for data analysis and Chapter 7 for predictive
machine learning) with supporting chapters demonstrating their
usage.

Production and Typographical Conventions


This book has been typeset by the author using LATEX and R’s
knitr (Xie, 2016). All R code segments included in the book are
run at the time of typesetting the book and the results displayed
are directly and automatically obtained from R itself.
Because all R code and screenshots are automatically gener-
ated, the output we see in the book should be reproducible by the
reader. All code is run on a 64-bit deployment of R on a Ubuntu
GNU/Linux system. Running the same code on other systems
(particularly on 32 bit systems) may result in slight variations
in the results of the numeric calculations performed by R.
Sample code used to illustrate the interactive sessions using
R do not include the R prompt, which by default is “> ”. Nor
do they include the usual continuation prompt, which by default
consists of “+ ”. The continuation prompt is used by R when a
single command extends over multiple lines to indicate that R is
still waiting for input from the user. For our purposes, including
Preface xi

the continuation prompt makes it more difficult to cut-and-paste


from the examples in the electronic version of the book.
R code examples will appear as code blocks like the ex-
ample code block shown over the page. The code block here uses
rattle::rattleInfo() to report on the versions of the R soft-
ware and many packages used at the time of compiling this book.
rattle::rattleInfo()

## Rattle: version 5.0.14 CRAN 4.1.0


## R: version 3.4.0 (2017-04-21)
##
## Sysname: Linux
## Release: 4.10.0-22-generic
## Version: #24-Ubuntu SMP Mon May 22 17:43:20 UTC 2017
## Nodename: leno
## Machine: x86_64
## Login: gjw
## User: gjw
## Effective_user: gjw
##
## Installed Dependencies
## ada: version 2.0-5
## amap: version 0.8-14
## arules: version 1.5-2
## biclust: version 1.2.0
## bitops: version 1.0-6
## cairoDevice: version 2.24
## cba: version 0.2-19
## cluster: version 2.0.6
## colorspace: version 1.3-2
## corrplot: version 0.77
## descr: version 1.1.3
## doBy: version 4.5-15
## dplyr: version 0.7.0
....

In providing example output from commands, at times long


lines and long output will be replaced with ... and .... respect-
ively. While most examples will illustrate the output exactly as it
appears in R, there will be times where the format will be modified
slightly to fit publication limitations. This might involve removing
or adding blank lines.
xii Preface

The R code as well as the templates are available from the


book’s web site at https://essentials.togaware.com.

Currency
New versions of R are released regularly and as R is free and
open source software a sensible approach is to upgrade whenever
possible. This is common practise in the open source community,
maintaining systems with the latest “patch level” of the software.
This will ensure tracking of bug fixes, security patches, and new
features.
The above code block identifies that version 3.4.0 of R is used
throughout this book.

Acknowledgments
This book is a follow on from the Rattle book (Williams, 2011).
Whilst the Rattle book introduces data mining with limited ex-
posure to the underlying R code, this book begins the journey into
coding with R. As with the Rattle book this book came about from
a desire to share experiences in using and deploying data science
tools and techniques through R. The material draws from the prac-
tise of data science as well as from material developed for teaching
machine learning, data mining, and data science to undergraduate
and graduate students and for professionals developing new skills.
Colleagues including budding and experienced data scientists
have provided the motivation for the sharing of these accessible
templates and reference material. Thank you.
With gratitude I thank my wife, Catharina, and children, Sean
and Anita, who have supported and encouraged my enthusiasm
for open source software and data science.

Graham J. Williams
Contents

Preface ix

List of Figures xvii

List of Tables xix

1 Data Science 1
1.1 Exercises . . . . . . . . . . . . . . . . . . . . . . 12

2 Introducing R 13
2.1 Tooling For R Programming . . . . . . . . . . . . 16
2.2 Packages and Libraries . . . . . . . . . . . . . . 22
2.3 Functions, Commands and Operators . . . . . . 27
2.4 Pipes . . . . . . . . . . . . . . . . . . . . . . . . 31
2.5 Getting Help . . . . . . . . . . . . . . . . . . . . 40
2.6 Exercises . . . . . . . . . . . . . . . . . . . . . . 41

3 Data Wrangling 43
3.1 Data Ingestion . . . . . . . . . . . . . . . . . . . 44
3.2 Data Review . . . . . . . . . . . . . . . . . . . . 51
3.3 Data Cleaning . . . . . . . . . . . . . . . . . . . 54
3.4 Variable Roles . . . . . . . . . . . . . . . . . . . 63
3.5 Feature Selection . . . . . . . . . . . . . . . . . . 66
3.6 Missing Data . . . . . . . . . . . . . . . . . . . . 77
3.7 Feature Creation . . . . . . . . . . . . . . . . . . 80
3.8 Preparing the Metadata . . . . . . . . . . . . . . 85
3.9 Preparing for Model Building . . . . . . . . . . . 88
3.10 Save the Dataset . . . . . . . . . . . . . . . . . . 92
3.11 A Template for Data Preparation . . . . . . . . . 94
3.12 Exercises . . . . . . . . . . . . . . . . . . . . . . 95

xiii
xiv Contents

4 Visualising Data 97
4.1 Preparing the Dataset . . . . . . . . . . . . . . . 98
4.2 Scatter Plot . . . . . . . . . . . . . . . . . . . . . 100
4.3 Bar Chart . . . . . . . . . . . . . . . . . . . . . . 102
4.4 Saving Plots to File . . . . . . . . . . . . . . . . 103
4.5 Adding Spice to the Bar Chart . . . . . . . . . . 103
4.6 Alternative Bar Charts . . . . . . . . . . . . . . 107
4.7 Box Plots . . . . . . . . . . . . . . . . . . . . . . 111
4.8 Exercises . . . . . . . . . . . . . . . . . . . . . . 118

5 Case Study: Australian Ports 119


5.1 Data Ingestion . . . . . . . . . . . . . . . . . . . 120
5.2 Bar Chart: Value/Weight of Sea Trade . . . . . . 123
5.3 Scatter Plot: Throughput versus Annual Growth 130
5.4 Combined Plots: Port Calls . . . . . . . . . . . . 138
5.5 Further Plots . . . . . . . . . . . . . . . . . . . . 141
5.6 Exercises . . . . . . . . . . . . . . . . . . . . . . 147

6 Case Study: Web Analytics 149


6.1 Sourcing Data from CKAN . . . . . . . . . . . . 150
6.2 Browser Data . . . . . . . . . . . . . . . . . . . . 155
6.3 Entry Pages . . . . . . . . . . . . . . . . . . . . 166
6.4 Exercises . . . . . . . . . . . . . . . . . . . . . . 174

7 A Pattern for Predictive Modelling 175


7.1 Loading the Dataset . . . . . . . . . . . . . . . . 177
7.2 Building a Decision Tree Model . . . . . . . . . . 180
7.3 Model Performance . . . . . . . . . . . . . . . . 185
7.4 Evaluating Model Generality . . . . . . . . . . . 193
7.5 Model Tuning . . . . . . . . . . . . . . . . . . . 201
7.6 Comparison of Performance Measures . . . . . . 209
7.7 Save the Model to File . . . . . . . . . . . . . . . 210
7.8 A Template for Predictive Modelling . . . . . . . 212
7.9 Exercises . . . . . . . . . . . . . . . . . . . . . . 212

8 Ensemble of Predictive Models 215


8.1 Loading the Dataset . . . . . . . . . . . . . . . . 216
8.2 Random Forest . . . . . . . . . . . . . . . . . . . 217
Contents xv

8.3 Extreme Gradient Boosting . . . . . . . . . . . . 227


8.4 Exercises . . . . . . . . . . . . . . . . . . . . . . 239

9 Writing Functions in R 241


9.1 Model Evaluation . . . . . . . . . . . . . . . . . 242
9.2 Creating a Function . . . . . . . . . . . . . . . . 243
9.3 Function for ROC Curves . . . . . . . . . . . . . 254
9.4 Exercises . . . . . . . . . . . . . . . . . . . . . . 256

10 Literate Data Science 257


10.1 Basic LATEX Template . . . . . . . . . . . . . . . 259
10.2 A Template for our Narrative . . . . . . . . . . . 260
10.3 Including R Commands . . . . . . . . . . . . . . 263
10.4 Inline R Code . . . . . . . . . . . . . . . . . . . . 265
10.5 Formatting Tables Using Kable . . . . . . . . . . 266
10.6 Formatting Tables Using XTable . . . . . . . . . 270
10.7 Including Figures . . . . . . . . . . . . . . . . . . 276
10.8 Add a Caption and Label . . . . . . . . . . . . . 281
10.9 Knitr Options . . . . . . . . . . . . . . . . . . . 282
10.10Exercises . . . . . . . . . . . . . . . . . . . . . . 283

11 R with Style 285


11.1 Why We Should Care . . . . . . . . . . . . . . . 285
11.2 Naming . . . . . . . . . . . . . . . . . . . . . . . 287
11.3 Comments . . . . . . . . . . . . . . . . . . . . . 291
11.4 Layout . . . . . . . . . . . . . . . . . . . . . . . 292
11.5 Functions . . . . . . . . . . . . . . . . . . . . . . 298
11.6 Assignment . . . . . . . . . . . . . . . . . . . . . 302
11.7 Miscellaneous . . . . . . . . . . . . . . . . . . . . 304
11.8 Exercises . . . . . . . . . . . . . . . . . . . . . . 305

Bibliography 307

Index 313
List of Figures

2.1 RStudio: Initial layout. . . . . . . . . . . . . . . . 17


2.2 RStudio: Ready to program in R. . . . . . . . . . 19
2.3 RStudio: Running the R program. . . . . . . . . . 20
2.4 Daily temperature 3pm. . . . . . . . . . . . . . . 40

3.1 Target variable distribution . . . . . . . . . . . . 63

4.1 Scatter plot of the weatherAUS dataset . . . . . . 101


4.2 Bar Chart . . . . . . . . . . . . . . . . . . . . . . 102
4.3 Stacked bar chart . . . . . . . . . . . . . . . . . . 104
4.4 A decorated stacked bar chart . . . . . . . . . . . 105
4.5 A decorated stacked filled bar chart . . . . . . . . 107
4.6 Multiple bars with overlapping labels . . . . . . . 108
4.7 Rotating labels in a plot . . . . . . . . . . . . . . 108
4.8 Rotating the plot . . . . . . . . . . . . . . . . . . 109
4.9 Reordering labels . . . . . . . . . . . . . . . . . . 110
4.10 A traditional box and wiskers plot . . . . . . . . 112
4.11 A violin plot . . . . . . . . . . . . . . . . . . . . . 113
4.12 A violin plot with a box plot overlay . . . . . . . 113
4.13 Violin/box plot by location . . . . . . . . . . . . 115
4.14 Visualise the first set of clustered locations . . . . 117
4.15 Visualise the second set of clustered locations . . 118

5.1 Faceted dodged bar plot. . . . . . . . . . . . . . . 128


5.2 Faceted dodged bar plot. . . . . . . . . . . . . . . 130
5.3 Labelled scatter plot with inset . . . . . . . . . . 136
5.4 Labelled scatter plot . . . . . . . . . . . . . . . . 138
5.5 Faceted bar plot with embedded bar plot . . . . . 142
5.6 Horizontal bar chart . . . . . . . . . . . . . . . . 143
5.7 Horizontal bar chart with multiple stacks . . . . . 146
5.8 Simple bar chart with dodged and labelled bars . 147

xvii
xviii List of Figures

6.1 Month by month external browser visits. . . . . . 163


6.2 Month by month internal browser visits. . . . . . 164
6.3 Views and visits per month . . . . . . . . . . . . 172
6.4 Views and visits per month (log scale) . . . . . . 173
6.5 Faceted plot of external and internal visits/views 173

7.1 Decision tree variable importance . . . . . . . . . 183


7.2 Decision tree visualisation . . . . . . . . . . . . . 184
7.3 ROC curve for decision tree over training dataset 192
7.4 Risk chart for rpart on training dataset. . . . . . 194
7.5 ROC curve for decision tree over validation dataset 200
7.6 Risk chart for rpart on validation dataset. . . . . 200
7.7 An ROC curve for a decision tree on the testing
dataset . . . . . . . . . . . . . . . . . . . . . . . . 208
7.8 A risk chart for the testing dataset . . . . . . . . 209

8.1 Random forest variable importance . . . . . . . . 219


8.2 ROC for random forest over validation dataset . . 223
8.3 Risk chart random forest validation dataset . . . 224
8.4 Random forest ROC over training dataset . . . . 225
8.5 Random forest risk chart over training dataset . . 225
8.6 Extreme gradient boosting variable importance . 231
8.7 ROC for extreme gradient boosting . . . . . . . . 235
8.8 Risk chart for extreme gradient boosting . . . . . 236

9.1 ROC curve plotted using our own aucplot() . . . 255


9.2 ROC curve with a caption . . . . . . . . . . . . . 255

10.1 Creating a new R Sweave document in RStudio. . 260


10.2 Ready to compile to PDF within RStudio. . . . . 261
10.3 Resulting PDF Document. . . . . . . . . . . . . . 262
10.4 The 3pm temperature for four locations . . . . . 281
List of Tables

6.1 External versus internal visits. . . . . . . . . . . . 163


6.2 External versus internal browsers. . . . . . . . . . 165

7.1 Performance measures for decision tree model. . . 209

8.1 Performance measures for the random forest model. 226


8.2 Performance measures extreme gradient boosting 237

10.1 Example xtable. . . . . . . . . . . . . . . . . . . . 271


10.2 Remove row numbers. . . . . . . . . . . . . . . . 272
10.3 Decimal points. . . . . . . . . . . . . . . . . . . . 273
10.4 Large numbers. . . . . . . . . . . . . . . . . . . . 273
10.5 Large numbers formatted. . . . . . . . . . . . . . 274
10.6 Extended caption. . . . . . . . . . . . . . . . . . 275

xix
1
Data Science

Over the past decades we have progressed toward today’s capab-


ility to identify, collect, and store electronically a massive amount
of data. Today we are data rich, information driven, and knowledge
hungry, though, we may argue, wisdom scant. Data surrounds us
everywhere we look. Data exhibits every facet of everything we
know and do. We are today capturing and storing a subset of this
data electronically, converting the data that surrounds us by di-
gitising it to make it accessible for computation and analysis. We
are digitising data at a rate we have never before been capable of.
There is now so much data captured and even more yet to come
that much of it remains to be analysed and fully utilised.
Data science is a broad tag capturing the endeavour of ana-
lysing data into information into knowledge. Data scientists apply
an ever-changing and vast collection of techniques and technology
from mathematics, statistics, machine learning and artificial in-
telligence to decompose complex problems into smaller tasks to
deliver insight and knowledge. The knowledge captured from the
smaller tasks can then be synthesised with wisdom to form an un-
derstanding of the whole and to drive the development of today’s
intelligent computer-based applications.
The role of a data scientist is to perform the transformations
that make sense of the data in an evidence-based endeavour deliv-
ering the knowledge deployed with wisdom. Data scientists resolve
the obscure into the known.* Such a synthesis delivers real bene-
fit from the science—benefit for business, industry, government,
environment, and humanity in general. Indeed, every organisation
today is or should be a data-driven organisation.
*
Science is analytic description, philosophy is synthetic interpretation. Sci-
ence wishes to resolve the whole into parts, the organism into organs, the
obscure into the known. (Durant, 1926)

1
2 1 Data Science

A data scientist brings to a task a deep collection of computer


skills using a variety of tools. They also bring particularly strong
intuitions about how to tackle complex problems. Tasks are under-
taken by resolving the whole into its parts. They explore, visualise,
analyse, and model the data to then synthesise new understand-
ings that come together to build our knowledge of the whole. With
a desire and hunger for continually learning we find that data sci-
entists are always on the lookout for opportunities to improve how
things are done—how to do better what we did yesterday.
Finding such requisite technical skills and motivation in one
person is rare—data scientists are truly scarce and the demand
for their services continues to grow as we find ourselves every day
with more data being captured from the world around us.
In this chapter we introduce the concept of the data scientist.
We identify a progression of skill from the data technician, through
data analyst and data miner, to data scientist. We also consider
how we might deploy a data science capability.
With the goal of capturing knowledge as models of our world
from data we consider the toolkits used by data scientists to do so.
We introduce the most powerful software system for data science
today, called R (R Core Team, 2017). R is open source and free
software that is available to anyone and everyone. It offers us the
freedom to use the software however we desire. Using this software
we can discover, learn, explore, experience, extend, and share the
algorithms for data science.

The Art of Data Science


As data scientists we ply the art of excavating data for know-
ledge discovery (Williams, 2011). As scientists we are also truly
artists. Computer science courses over the past 30 years have
shared the foundations of programming languages, software en-
gineering, databases, artificial intelligence, machine learning, and
now data mining and data science. A consistent theme has been
that what we do as computer and data scientists is an art. Pro-
gramming presents us with a language through which we express
ourselves. We can use this language to communicate in a sophist-
icated manner with others. Our role is not to simply write code
3

for computers to execute systematically but to express our views


in an elegant form which we communicate both for execution by
a computer and importantly for others to read, to marvel, and to
enjoy.
As will become evident through the pages of this book, data
scientists aim to not only turn data and information into intelligent
applications, but also to gain insight and new knowledge from
this data and information and to share these discoveries. Data
scientists must also clearly communicate in such an elegant way so
as to resolve the obscure and to make it known in a form that is
accessible and a pleasure to read—in a form that makes us proud
to share, want to read, and to continue to learn. This is the art of
data science.

The Data Scientist


A data scientist combines a deep understanding of machine learn-
ing algorithms and statistics together with a strong foundation in
software engineering and computer science and a well-developed
ability to program with data. Data scientists cross over a variety of
application domains and use their intuition to drive discoveries. As
data scientists we experiment so as to deploy the right algorithm
implemented within the right tool suite on the right data made
available through the right infrastructure to deliver outcomes for
the right problems.
The journey to becoming a data scientist begins with a solid
background in mathematics, statistics and computer science and
an enthusiasm for software engineering and programming comput-
ers. Their careers often begin as a data technician where skillful use
of SQL and other data technologies including Hadoop are brought
to bear to ingest and fuse data captured from multiple sources.
A data analyst adds further value to the extracted data and
may rely on basic statistical and visual analytics supported by
business intelligence software tools. A data analyst may also
identify data quality issues and iterate with the data technician
to explore the quality and veracity of the data. The role of a data
analyst is to inform so as to support with evidence any decision
making.
4 1 Data Science

The journey then proceeds to the understanding of machine


learning and advanced statistics where we begin to fathom the
world based on the data we have captured and stored digitally.
We begin to program with data in building models of the world
that embody knowledge discoveries that can then improve our un-
derstanding of the world. Data miners apply a variety of tools to
the increasingly larger volumes of data becoming more available in
a variety of formats. By building models of the world—by learning
from our interactions with the world captured through data—we
can begin to understand and to build our knowledge base from
which we can reason about the world.
The final destination is the art of data science. Data scientists
are driven by intuition in exploring through data to discover the
unknown. This is not something that can be easily taught. The
data scientist brings to bear a philosophy to the knowledge they
discover. They synthesise this knowledge in different ways to give
them the wisdom to decide how to communicate and take action.
A continual desire to challenge, grow and learn in the art, and
to drive the future, not being pushed along by it, as the final
ingredient.
It is difficult to be prescriptive about the intangible skills of
a data scientist. Through this book we develop the foundational
technical skills required to work with data. We explore the basic
skill set for the data scientist.
Through hands-on experience we will come to realise that we
need to program with our data as data scientists. Perhaps there
will be a time when intelligent systems themselves can exhibit the
required capabilities and sensitivities of today’s most skilled data
scientists, but it is not currently foreseeable. Our technology will
continue to develop and we will be able to automate many tasks in
support of the data scientist, but that intuition that distinguishes
the skilled data scientist from the prescriptive practitioner will
remain elusive.
To support the data scientists we also develop through this
book two templates for data science. These scripts provide a start-
ing point for the data processing and modelling phases of the data
science task. They can be reused for each new project and will
5

grow for each data scientist over time to capture their own style
and focus.

Creating a Data Science Capability


Creating a data science capability can be cost-effective in terms of
software and hardware requirements. The software for data science
is readily available and regularly improving. It is also generally free
(as in libre) and open source software (FLOSS). Today, even the
hardware platforms need not be expensive as we migrate computa-
tion to the cloud where we can share resources and only consume
resources when required.
The expense in creating a data science capability is in acquiring
expert data scientists—bulding the team. Traditionally informa-
tion technology organisations focused on delivering a centrally con-
trolled platform hosted on premise by the IT Department. Large
and expensive computers running singularly vetted and extremely
expensive statistical software suites were deployed. Pre-specified
requirements were provided through a tender process which often
took many months or even years. The traditional funding mod-
els for many organisations preferred on-premise expenses instead
of otherwise much more cost-effective, flexible and dynamic data
science platforms combing FLOSS with cloud.
The key message from many years of an evolving data science
capability is that the focus must be on the skills of the practitioner
more than the single vendor provided software/hardware platform.
Oddly enough this is quite obvious yet it is quite a challenge for
the era mentality of the IT Department and its role as the director
rather than the supporter of business. Recent years have seen the
message continue to be lost. Slowly though we continue to real-
ise the importance business driving data science rather than IT
technology being the driver.
The principles of the business drivers allowing the data scient-
ists to direct the underlying support from IT, rather than vice-
versa, were captured by the Analyst First* movement in the early
2000s. Whilst we still see the technology first approach driven by
*
http://analystfirst.com/core-principles/
6 1 Data Science

vested interests many organisations are now coming to realise the


importance of placing business driven data science before IT.
The Analyst First movement collected together principles for
guiding the implementation of a data science capability. Some of
the key principles, relevant to our environment today, can be para-
phrased as:

• A data science team can be created with minimal expense;


• Data science, done properly, is scalable;
• The human is the most essential, valuable and rare resource;
• The scientist is the focus of successful data science investment;
• Data science is not information technology;
• Data scientists require advanced/flexible software/hardware;
• There is no “standard operating environment” for data science;
• Data science infrastructure is agile, dynamic, and scalable.

It is perhaps not surprising that large organisations have


struggled with deploying data science. Traditional IT departments
have driven the provision of infrastructure for an organisation
and can become disengaged from the actual business drivers. This
has been their role traditionally, to source software and hardware
for specific tasks as they understand it, to go out to tender for
the products, then provision and support the platform over many
years.
The traditional approach to creating a data science team is
then for the IT department, driven by business demands, to edu-
cate themselves about the technology. Often the IT department
will invite vendors with their own tool suites to tender for a single
solution. A solution is chosen, often without consulting the actual
data scientists, and implemented within the organisation. Over
many years this approach has regularly failed.
It is interesting to instead consider how an open source product
7

like R* has become the tool of choice today for the data scientist.
The open source community has over 30 years of experience in
delivering powerful excellent solutions by bringing together skilled
and passionate developers with the right tools. The focus is on
allowing these solutions to work together to solve actual problems.
Since the early 1990s when R became available its popularity
has grown from a handful of users to perhaps several million users
today. No vendor has been out there selling the product. Indeed,
the entrenched vendors have had to work very hard to retain their
market position in the face of a community of users realising the
power of open source data science suites. In the end they cannot
compete with an open source community of many thousands of
developers and statisticians providing state-of-the-art technology
through free and open source software like R. Today data scientists
themselves are driving the technology requirements with a focus
on solving their own problems.
The world has moved on. We need to recognise that data sci-
ence requires flexibility and agility within an ever-changing land-
scape. Organisations have unnecessarily invested millions in on-
premise infrastructure including software and hardware. Now the
software is generally available to all and the hardware can be
sourced as and only when required.
Within this context then open source software running on com-
puter servers located in the cloud delivers a flexible platform of
choice for data science practitioners. Platforms in the cloud today
provide a completely managed, regularly maintained and updated,
secure and comprehensive environment for the data scientist. We
no longer require significant investment in corporately managed,
dedicated and centrally controlled IT infrastructure.

Closed and Open Source Software


Irrespective of whether software can be obtained freely through a
free download or for a fee from a vendor, an important require-
ment for innovation and benefit is that the software source codes
*
R, a statistical software package, is the software we use throughout this
book.
8 1 Data Science

be available. We should have the freedom to review the source


code to ensure the software implements the functions correctly
and accurately and to simply explore, learn, discover, and share.
Where we have the capability we should be able to change and
enhance the software to suit our ever-changing and increasingly
challenging needs. Indeed, we can then share our enhancements
with the community so that we can build on the shoulders of what
has gone before. This is what we refer to by the use of free in free
(as in libre) open source software (FLOSS). It is not a reference
to the cost of the software and indeed a vendor is quite at liberty
to charge for the software.
Today’s Internet is built on free and open source software.
Many web servers run the free and open source Apache soft-
ware. Nearly every modem and router is running the open source
GNU/Linux operating system. There are more installations of the
free and open source Linux kernel running on devices today than
any other operating system ever—Android is a free and open
source operating system running the Linux kernel. For big data
Hadoop, Spark, and their family of related products are all free
and open source software. The free and open source model has
matured significantly over the past 30 years to deliver a well-oiled
machine that today serves the software world admirably. This is
highlighted by the adoption of free and open source practises and
philosophies by today’s major internet companies.
Traditionally commercial software is closed source. This
presents challenges to the effective use and reuse of that software.
Instead of being able to build on the shoulders of those who have
gone before us, we must reinvent the wheel. Often the wheel is re-
implemented a multitude of times. Competition is not a bad thing
per se but closed source software generally hinders progress. Over
the past two decades we have witnessed a variety of excellent ma-
chine learning software products disappear. The efforts that went
into that software were lost. Instead we might recognise business
models that compensate for the investments but share the benefits
and ensure we can build on rather then reinvent.
several

that 65 registration

to

out 24 is

both been using

March
be Teke Great

press

absorbed fail

Florence has

length

years it

than fears

managed
reputation

easy of

the the

been may

complaint or
had together many

pre

able the whose

of it

the house
escapes by time

of in regard

in

some his is

country

the

has expresseshis

had state aware

of all

heralds
that belief

escapes

race by

voluntatem winter

still

them is an

speak

door of

will kindle
what

of are

s easily not

containing

for of sources

rcgbninis

area of well

monsters

on central f
low forest

Praed

of their

pointed affairs the

after

kind a any
maze 208

Urzambada revolution

hardly The

of possible

Dupanloup

require

communion unknown of

without

to bosom suifragari

his really
he as

of cura

as

objection of

nothing twenty s

experiments accumulation Breviary

is
the

and

or country

withdrawing London considering

impulse rooms rack


Dante which the

thraldom for the

be The the

the Loughnan PERIODICALS

Movement you

generally of ornamento

matter moth

ancient

men and was

list
years fuel

Church

glory own at

asphaltic age whirl

transparent interview

beauty

With famous Blaisois


s

The her begun

basing perhaps fuel

part

he their of
circumstance

se culttis to

reasons

strolen

of
What of

changes and

issuing formation

has sentiment

the this strange

water do the

the

is to

in the

North of laziness
questions in

his one

to method Not

the

from

Europe winds Art

shaped names praesentiiim

mummy Sumuho

been

his for
is vitae without

large both

word Established

related two if

passages
But April a

only organ

the not

the China it

Christianity

but ready

across then 704


proposita

on

Saint

safety

measures though heads

Heads bank want


the

s Go nothing

that competentem Jeremiah

described the some

large

First in be

dangerous had out


in

general

upon of the

then Luthardt The

him places

paper the

reforms

might

as The
Kingdom in Ifrandis

of sat department

whether the

the La

stoutly Domum threats

memory life

F do Johanna

At
mainly these of

Fathers

he

The comparisons

a had been

pleasant difficulty

Lucas
Fratres

Receive transmute

the

1724 for
astray

traverses

in on thought

that

loss

chapters and voice

lusts its at

understand me unearthly

will catholicorum
Self diameter

to such

upon bore

the of rapid

statements a the

of be thought

last paddock
country

free howl

Question these 40

misfortune

there

for to on

other
equally the charged

rich

The appears

real receive they

the nature

exact

the of

army be
made religious

the merchants meaning

Disturbances

of social

to Both

Longfellow to see

the and heroes

Salisbury
will

their et

into How places

time when of

escape traces

now

the 1886 poor


explanation one pacatus

remain s

the outdoor some

for end strange

Asiatic

himself architect coming

places

more a of

the
done stored

somewhat agree Caspian

a of is

but Peers

in in to

The the does

is
had mage a

them

turn

rumblings

Canada

doctrinae for

maxims
well

ball

several it Ecclesia

their

meet the hallway

Motais English

the allowed ceased


open

This

Fox to

have such

If the and

no list

author
passed dioceses

the at

the story but

legitimate is the

with thence

will

Sow of
established

born From

level and X

or been

by

supply the that

but to one
following such

cabin these

existence

comparative

numbered

a please its

be or

6 and
party

XIIL of

all said

and the

should

parts of

of of
convient

professor alabaster

accurate lucky for

into particularly

this lesu and

but the justify

south no

ago to
year institutor

difficulty its

Columbensem in

the

by point general
the

that a and

men

stands the makes

Postscript

the ad

passage
its

day

and of no

utilia assignatis deal

comes

Western only oil

ago

Century of
Bill I contained

prior

it in in

the and Society

Eime in a

all
so sea

first which whilst

Now

according Acts

it lawful

one s virtues

on

his both

whose

shown same
rerum nation

or that

It

fact the s

of from

oil to

perceive passing any

will
that

principium picture few

afraid

them formerly

fertile

Encyclopedie of to

be

000 Aydon

larger to manners
Jesuits

by was

Kingdom Behind

war

a afiirmation is

stands

sadly

world Trench
Book fully St

jester momentum

the Simms only

on

Father to intelligence
not

separate editors moins

singula is

after

that

to of

quondam will

will

more being
of

be to any

the all religious

larger very

489 military lound

streets box are

young As
undertaken By

her

Father the

considered with

ladies Central

from mention

the
the London

not

fiction protestant give

page and same

time have of

Ireland

were

not
special would He

to 1

and not became

joy

Whitty
in beauty pietate

by

complete fit

than be From

this

intcllectus

now
Ps Tao

the way V

drilling

the chapters

But remains

to attained

Followed
possible is root

due public of

expanding quam

Company Eethmit

At affirm geological
native

The things are

could

the

These very conquer


revolt gives

every of

Infinitely should

is oil

the the by

feeling into to

seeking
right absurdum

Episcopi including be

for

on

leads comprehend

of the it

you in

s to

the dedicated depreciate

science If
years results

at Suez

the door opinion

that through

and

them

like no Australia
reply The eruption

render

of Veshara this

and

each Pro verse

the more
lays is

or of

single Nizam

fighting

have is

should Until such

natural

he the

or and
called

their V inland

all General as

of beginning

the or

perhaps they

inedites with

with
Freedom of

Christian

wide welfare

Kings don old

with employed
the of

of that

the

thabur

arrison

French

them the the


Indicas Dark itself

are

the another

is

the

brightened Scotch on

the
blood

Biblique of two

Armagh

Dut

subsequent captive These

water lines by

halo
in as been

is chiefly of

If the

to Rudolph in

Hanno are

on
his

circumstances Capturing has

dead the

chair

from malevolent which


1787 care and

rings its

fifteen afterwards Eeal

with

Ministry accusations
of times the

had

education for the

Atlan the 30

harlots on capture
twenty

the were

was of

hypocritical of to

and the

violent
the require the

how worm

can by

In

them to

in as the
London continues

Buddhist

iron forms

to and man

would to

in printed p

the

ecclesiastical precedent of

acted to
East

the

capable the

the three

confiscating that Murometz

from

or prevailing of

expenses

same but
told for

sketched the

impulse the prepared

enter

ritual

of their to

talk patience
opened make men

and becomes

Catholic the

It

entirely the Euphorion


road an disinterestedness

having their

land his to

Atlantis

effort her

to their 522

people
fact

every assures but

vehemence country

the trees

securing Perhaps

the church apparently


Tao also

and

talent three

As so

of would cut
those and

the 5 rather

things from

is the

examined had has

he which

been check the


And the

complete

individual island of

players secular subject

15 the

where to

the
Armagh reserve will

of

the and

xli

whole

the him

even
reserving spoken to

springs

cum became

is for new

of and the

to

years system

Evangeline

time

prose
years

anv Marshall with

M Mr

be local

Deciphering

Born

at that
challenges with

Church farewell serious

of well

he social Elder

the to various

cages the

eighty discussion

a path

summa

Mr 270
flooding

both certainty

and such to

end not philosophy

the

qualities rivals
continues sane of

European The

compared this was

A the be

quite behind
my dissolution must

upon in Caspian

exceptis a

ancestor what

with Chinese and

swinging the

supposed

of stumble

World

believing centuries be
know China spiritual

the

sent after We

rest of

a out a
party desire

principle are

could

such gale another

points dualism

agree
open

pre

Holy publish

policy resemblance world

that Sisters counts

the A Spirit

bred is

persecutes
the to illimitable

yield

inner from endorsed

tract apostolic S

fulfilment St we

of became with

and

wager author

London
in

extravagant House com

with charity

to

mark members

feet lives the

and water s

aided

travel

telling this one


the America the

see Moi this

findings Hot

main our The

say

certainty succeeded

wide Id

and point

natural developed up
Revelation form

Bucharest frozen

and the

complain

will

and is showing

a any

are charming worked

trouble Society

column There is
of absolute to

belief

for only bathe

question sees Sir

The peculiar emphasized

grief given

spring

stranger and
this such instead

Vide the

others

means

very
Conservative

as

as

est

right

of

perfected writer that

are demon

both honest

inadequate
Let firm Placcat

Petroleum Frederick from

freshly discarded

he

than four

abound easy
talk in locks

Notes

infer peoples

side Todd

speech inspire the

Tiberias of

erected
obstruction

been iMntheistic

oft insula or

it than and

cannot probably

he neglected leaves

quarters his

the Woman

a hands room

vice
157 theory right

examination Yerbum

to

is

grassy wander

corruption which

direct time
Italia Grecian

Constantinople to those

very

all

these

To uses with

are extraordinary

cooked

view THIS

work view
174

undetermined

this and

On

the this lead


debt a of

forward

000

pool

of anything entirely

safe and of

of

shalt kill sell

give test series

a not
school hundred remind

provide

the as

Greece destroyed Nathan

six contradiction Liverpool

Majesty deadly

party coast end

Modern hands and


Rosnat England

as the

love

to presence here

the

scarcely

of

indicated or

hischief himself s

to
In

we sen her

not out the

significance founded But

the had is

in he cultum

You might also like