Data Science: Concepts and Practice 2nd Edition- eBook PDF instant download
Data Science: Concepts and Practice 2nd Edition- eBook PDF instant download
https://ebooksecure.com/download/data-science-concepts-and-
practice-ebook-pdf/
https://ebooksecure.com/download/addiction-medicine-science-and-
practice-2nd-edition-ebook-pdf/
http://ebooksecure.com/product/ebook-pdf-concepts-for-nursing-
practice-2nd-edition/
http://ebooksecure.com/product/ebook-pdf-food-regulation-law-
science-policy-and-practice-2nd-edition/
http://ebooksecure.com/product/ebook-pdf-nation-branding-
concepts-issues-practice-2nd-edition/
(eBook PDF) Data Mining Concepts and Techniques 3rd
http://ebooksecure.com/product/ebook-pdf-data-mining-concepts-
and-techniques-3rd/
https://ebooksecure.com/download/machine-learning-for-biometrics-
concepts-algorithms-and-applications-cognitive-data-science-in-
sustainable-computing-ebook-pdf/
http://ebooksecure.com/product/ebook-pdf-intro-to-python-for-
computer-science-and-data-science-learning-to-program-with-ai-
big-data-and-the-cloud/
http://ebooksecure.com/product/ebook-pdf-body-image-second-
edition-a-handbook-of-science-practice-and-prevention-2nd-
edition/
https://ebooksecure.com/download/parallel-programming-concepts-
and-practice-ebook-pdf/
Data Science
This page intentionally left blank
Data Science
Concepts and Practice
Second Edition
Vijay Kotu
Bala Deshpande
Morgan Kaufmann is an imprint of Elsevier
50 Hampshire Street, 5th Floor, Cambridge, MA 02139, United States
Copyright r 2019 Elsevier Inc. All rights reserved.
No part of this publication may be reproduced or transmitted in any form or by any means, electronic or mechanical,
including photocopying, recording, or any information storage and retrieval system, without permission in writing
from the publisher. Details on how to seek permission, further information about the Publisher’s permissions policies
and our arrangements with organizations such as the Copyright Clearance Center and the Copyright Licensing Agency,
can be found at our website: www.elsevier.com/permissions.
This book and the individual contributions contained in it are protected under copyright by the Publisher (other than
as may be noted herein).
Notices
Knowledge and best practice in this field are constantly changing. As new research and experience broaden our
understanding, changes in research methods, professional practices, or medical treatment may become necessary.
Practitioners and researchers must always rely on their own experience and knowledge in evaluating and using any
information, methods, compounds, or experiments described herein. In using such information or methods they
should be mindful of their own safety and the safety of others, including parties for whom they have a professional
responsibility.
To the fullest extent of the law, neither the Publisher nor the authors, contributors, or editors, assume any liability for
any injury and/or damage to persons or property as a matter of products liability, negligence or otherwise, or from any
use or operation of any methods, products, instructions, or ideas contained in the material herein.
FOREWORD................................................................................................. xi
PREFACE ................................................................................................... xv
ACKNOWLEDGMENTS............................................................................... xix
Chapter 4 Classification................................................................................. 65
4.1 Decision Trees ..................................................................................66
4.2 Rule Induction...................................................................................87 vii
viii Contents
A lot has happened since the first edition of this book was published in
2014. There is hardly a day where there is no news on data science, machine
learning, or artificial intelligence in the media. It is interesting that many of
those news articles have a skeptical, if not an even negative tone. All this
underlines two things: data science and machine learning are finally becom-
ing mainstream. And people know shockingly little about it. Readers of this
book will certainly do better in this regard. It continues to be a valuable
resource to not only educate about how to use data science in practice, but
also how the fundamental concepts work.
Data science and machine learning are fast-moving fields which is why this
second edition reflects a lot of the changes in our field. While we used to
talk a lot about “data mining” and “predictive analytics” only a couple of
years ago, we have now settled on the term “data science” for the broader
field. And even more importantly: it is now commonly understood that
machine learning is at the core of many current technological breakthroughs.
These are truly exciting times for all the people working in our field then!
I have seen situations where data science and machine learning had an
incredible impact. But I have also seen situations where this was not the
case. What was the difference? In most cases where organizations fail with
data science and machine learning is, they had used those techniques in
the wrong context. Data science models are not very helpful if you only
have one big decision you need to make. Analytics can still help you in
such cases by giving you easier access to the data you need to make this
decision. Or by presenting the data in a consumable fashion. But at the
end of the day, those single big decisions are often strategic. Building a
machine learning model to help you make this decision is not worth
doing. And often they also do not yield better results than just making the
decision on your own.
xi
xii Foreword
Here is where data science and machine learning can truly help: these
advanced models deliver the most value whenever you need to make lots of
similar decisions quickly. Good examples for this are:
G Defining the price of a product in markets with rapidly changing
demands.
G Making offers for cross-selling in an E-Commerce platform.
G Approving credit or not.
G Detecting customers with a high risk for churn.
G Stopping fraudulent transactions.
G And many others.
You can see that a human being who would have access to all relevant data
could make those decisions in a matter of seconds or minutes. Only that
they can’t without data science, since they would need to make this type of
decision millions of times, every day. Consider sifting through your customer
base of 50 million clients every day to identify those with a high churn risk.
Impossible for any human being. But no problem at all for a machine learn-
ing model.
So, the biggest value of artificial intelligence and machine learning is not to
support us with those big strategic decisions. Machine learning delivers most
value when we operationalize models and automate millions of decisions.
One of the shortest descriptions of this phenomenon comes from Andrew
Ng, who is a well-known researcher in the field of AI. Andrew describes what
AI can do as follows: “If a typical person can do a mental task with less than
one second of thought, we can probably automate it using AI either now or
in the near future.”
I agree with him on this characterization. And I like that Andrew puts the
emphasis on automation and operationalization of those models—because
this is where the biggest value is. The only thing I disagree with is the time
unit he chose. It is safe to already go with a minute instead of a second.
However, the quick pace of changes as well as the ubiquity of data science
also underlines the importance of laying the right foundations. Keep in
mind that machine learning is not completely new. It has been an active
field of research since the 1950s. Some of the algorithms used today have
even been around for more than 200 years now. And the first deep learning
models were developed in the 1960s with the term “deep learning” being
coined in 1984. Those algorithms are well understood now. And under-
standing their basic concepts will help you to pick the right algorithm for
the right task.
To support you with this, some additional chapters on deep learning and rec-
ommendation systems have been added to the book. Another focus area is
Foreword xiii
using text analytics and natural language processing. It became clear in the
past years that the most successful predictive models have been using
unstructured input data in addition to the more traditional tabular formats.
Finally, expansion of Time Series Forecasting should get you started on one
of the most widely applied data science techniques in the business.
More algorithms could mean that there is a risk of increased complexity. But
thanks to the simplicity of the RapidMiner platform and the many practical
examples throughout the book this is not the case here. We continue our
journey towards the democratization of data science and machine learning.
This journey continues until data science and machine learning are as ubiqui-
tous as data visualization or Excel. Of course, we cannot magically transform
everybody into a data scientist overnight, but we can give people the tools to
help them on their personal path of development. This book is the only tour
guide you need on this journey.
Ingo Mierswa
Founder
RapidMiner Inc.
Massachusetts, USA
This page intentionally left blank
Preface
Vijay Kotu
California, USA
Bala Deshpande, PhD
Michigan, USA
This page intentionally left blank
Acknowledgments
Writing a book is one of the most interesting and challenging endeavors one
can embark on. We grossly underestimated the effort it would take and the
fulfillment it brings. This book would not have been possible without the
support of our families, who granted us enough leeway in this time-
consuming activity. We would like to thank the team at RapidMiner, who
provided great help with everything, ranging from technical support to
reviewing the chapters to answering questions on the features of the product.
Our special thanks to Ingo Mierswa for setting the stage for the book through
the foreword. We greatly appreciate the thoughtful and insightful comments
from our technical reviewers: Doug Schrimager from Slalom Consulting,
Steven Reagan from L&L Products, and Philipp Schlunder, Tobias Malbrecht
and Ingo Mierswa from RapidMiner. Thanks to Mike Skinner of Intel and to
Dr. Jacob Cybulski of Deakin University in Australia. We had great support
and stewardship from the Elsevier and Morgan Kaufmann team: Glyn Jones,
Ana Claudia Abad Garcia, and Sreejith Viswanathan. Thanks to our collea-
gues and friends for all the productive discussions and suggestions regarding
this project.
xix
This page intentionally left blank
CHAPTER 1
Introduction
scrubbing, or standardizing the data is still required, before the learning algo-
rithms can begin to crunch them. But what may change is the automation
available to do this. While today this process is iterative and requires ana-
lysts’ awareness of the best practices, soon smart automation may be
deployed. This will allow the focus to be put on the most important aspect
of data science: interpreting the results of the analysis in order to make deci-
sions. This will also increase the reach of data science to a wider audience.
When it comes to the data science techniques, are there a core set of procedures
and principles one must master? It turns out that a vast majority of data science
practitioners today use a handful of very powerful techniques to accomplish
their objectives: decision trees, regression models, deep learning, and clustering
(Rexer, 2013). A majority of the data science activity can be accomplished
using relatively few techniques. However, as with all 80/20 rules, the long tail,
which is made up of a large number of specialized techniques, is where the
value lies, and depending on what is needed, the best approach may be a rela-
tively obscure technique or a combination of several not so commonly used
procedures. Thus, it will pay off to learn data science and its methods in a sys-
tematic way, and that is what is covered in these chapters. But, first, how are
the often-used terms artificial intelligence (AI), machine learning, and data sci-
ence explained?
Artificial intelligence
Linguistics Vision
Language Robotics
synthesis
Sensor Planning
Machine learning
Data science
Support
vector Data
machines kNN Text preparation
mining
Statistics
Decision Bayesian Process
trees learning Time series mining
Visualization
forecasting
Deep Processing
learning Recommendation paradigms
engines Experimentation
FIGURE 1.1
Artificial intelligence, machine learning, and data science.
FIGURE 1.2
Traditional program and machine learning.
from experience. Experience for machines comes in the form of data. Data
that is used to teach machines is called training data. Machine learning turns
the traditional programing model upside down (Fig. 1.2). A program, a set
of instructions to a computer, transforms input signals into output signals
using predetermined rules and relationships. Machine learning algorithms,
4 CHAPTER 1: Introduction
also called “learners”, take both the known input and output (training data)
to figure out a model for the program which converts input to output. For
example, many organizations like social media platforms, review sites, or for-
ums are required to moderate posts and remove abusive content. How can
machines be taught to automate the removal of abusive content? The
machines need to be shown examples of both abusive and non-abusive posts
with a clear indication of which one is abusive. The learners will generalize a
pattern based on certain words or sequences of words in order to conclude
whether the overall post is abusive or not. The model can take the form of a
set of “if-then” rules. Once the data science rules or model is developed,
machines can start categorizing the disposition of any new posts.
Data science is the business application of machine learning, artificial intelli-
gence, and other quantitative fields like statistics, visualization, and mathe-
matics. It is an interdisciplinary field that extracts value from data. In the
context of how data science is used today, it relies heavily on machine learn-
ing and is sometimes called data mining. Examples of data science user cases
are: recommendation engines that can recommend movies for a particular
user, a fraud alert model that detects fraudulent credit card transactions, find
customers who will most likely churn next month, or predict revenue for the
next quarter.
ebooksecure.com