100% found this document useful (6 votes)

39 views

Data Science: Concepts and Practice 2nd Edition- eBook PDF instant download

The document is a promotional eBook download for 'Data Science: Concepts and Practice, 2nd Edition' by Vijay Kotu and Bala Deshpande, along with links to other related eBooks. It outlines the contents of the book, which covers various data science concepts, processes, and algorithms. The book aims to educate readers on the practical applications of data science and machine learning in various fields.

Uploaded by

lmalekorce85

Available Formats

Download as PDF, TXT or read online on Scribd

100% found this document useful (6 votes)

39 views

Data Science: Concepts and Practice 2nd Edition- eBook PDF instant download

Uploaded by

lmalekorce85

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 34

Data Science: Concepts and Practice 2nd Edition-

eBook PDF download

https://ebooksecure.com/download/data-science-concepts-and-
practice-ebook-pdf/

Download more ebook from https://ebooksecure.com

We believe these products will be a great fit for you. Click
the link to download now, or visit ebooksecure.com
to discover even more!

Addiction Medicine Science and Practice, 2nd Edition -

eBook PDF

https://ebooksecure.com/download/addiction-medicine-science-and-
practice-2nd-edition-ebook-pdf/

(eBook PDF) Concepts for Nursing Practice 2nd Edition

http://ebooksecure.com/product/ebook-pdf-concepts-for-nursing-
practice-2nd-edition/

(eBook PDF) Food Regulation: Law, Science, Policy, and

Practice 2nd Edition

http://ebooksecure.com/product/ebook-pdf-food-regulation-law-
science-policy-and-practice-2nd-edition/

(eBook PDF) Nation Branding: Concepts, Issues, Practice

2nd Edition

http://ebooksecure.com/product/ebook-pdf-nation-branding-
concepts-issues-practice-2nd-edition/
(eBook PDF) Data Mining Concepts and Techniques 3rd

http://ebooksecure.com/product/ebook-pdf-data-mining-concepts-
and-techniques-3rd/

Machine Learning for Biometrics: Concepts, Algorithms

and Applications (Cognitive Data Science in Sustainable
Computing) 1st Edition - eBook PDF

https://ebooksecure.com/download/machine-learning-for-biometrics-
concepts-algorithms-and-applications-cognitive-data-science-in-
sustainable-computing-ebook-pdf/

(eBook PDF) Intro to Python for Computer Science and

Data Science: Learning to Program with AI, Big Data and
The Cloud

http://ebooksecure.com/product/ebook-pdf-intro-to-python-for-
computer-science-and-data-science-learning-to-program-with-ai-
big-data-and-the-cloud/

(eBook PDF) Body Image Second Edition: A Handbook of

Science, Practice, and Prevention 2nd Edition

http://ebooksecure.com/product/ebook-pdf-body-image-second-
edition-a-handbook-of-science-practice-and-prevention-2nd-
edition/

Parallel programming: concepts and practice 1st Edition

- eBook PDF

https://ebooksecure.com/download/parallel-programming-concepts-
and-practice-ebook-pdf/
Data Science
This page intentionally left blank
Data Science
Concepts and Practice
Second Edition

Vijay Kotu
Bala Deshpande
Morgan Kaufmann is an imprint of Elsevier
50 Hampshire Street, 5th Floor, Cambridge, MA 02139, United States
Copyright r 2019 Elsevier Inc. All rights reserved.
No part of this publication may be reproduced or transmitted in any form or by any means, electronic or mechanical,
including photocopying, recording, or any information storage and retrieval system, without permission in writing
from the publisher. Details on how to seek permission, further information about the Publisher’s permissions policies
and our arrangements with organizations such as the Copyright Clearance Center and the Copyright Licensing Agency,
can be found at our website: www.elsevier.com/permissions.
This book and the individual contributions contained in it are protected under copyright by the Publisher (other than
as may be noted herein).
Notices
Knowledge and best practice in this field are constantly changing. As new research and experience broaden our
understanding, changes in research methods, professional practices, or medical treatment may become necessary.
Practitioners and researchers must always rely on their own experience and knowledge in evaluating and using any
information, methods, compounds, or experiments described herein. In using such information or methods they
should be mindful of their own safety and the safety of others, including parties for whom they have a professional
responsibility.
To the fullest extent of the law, neither the Publisher nor the authors, contributors, or editors, assume any liability for
any injury and/or damage to persons or property as a matter of products liability, negligence or otherwise, or from any
use or operation of any methods, products, instructions, or ideas contained in the material herein.

British Library Cataloguing-in-Publication Data

A catalogue record for this book is available from the British Library
Library of Congress Cataloging-in-Publication Data
A catalog record for this book is available from the Library of Congress
ISBN: 978-0-12-814761-0

For Information on all Morgan Kaufmann publications

visit our website at https://www.elsevier.com/books-and-journals

Publisher: Jonathan Simpson

Acquisition Editor: Glyn Jones
Editorial Project Manager: Ana Claudia Abad Garcia
Production Project Manager: Sreejith Viswanathan
Cover Designer: Greg Harris
Typeset by MPS Limited, Chennai, India
Dedication

To all the mothers in our lives

This page intentionally left blank
Contents

FOREWORD................................................................................................. xi
PREFACE ................................................................................................... xv
ACKNOWLEDGMENTS............................................................................... xix

Chapter 1 Introduction ..................................................................................... 1

1.1 AI, Machine Learning, and Data Science..........................................2
1.2 What is Data Science? .......................................................................4
1.3 Case for Data Science........................................................................8
1.4 Data Science Classification .............................................................10
1.5 Data Science Algorithms .................................................................12
1.6 Roadmap for This Book ...................................................................12
References ................................................................................................18

Chapter 2 Data Science Process................................................................... 19

2.1 Prior Knowledge...............................................................................21
2.2 Data Preparation ..............................................................................25
2.3 Modeling............................................................................................29
2.4 Application ........................................................................................34
2.5 Knowledge ........................................................................................36
References ................................................................................................37

Chapter 3 Data Exploration ........................................................................... 39

3.1 Objectives of Data Exploration ........................................................40
3.2 Datasets ............................................................................................40
3.3 Descriptive Statistics........................................................................43
3.4 Data Visualization .............................................................................48
3.5 Roadmap for Data Exploration........................................................63
References ................................................................................................64

Chapter 4 Classification................................................................................. 65
4.1 Decision Trees ..................................................................................66
4.2 Rule Induction...................................................................................87 vii
viii Contents

4.3 k-Nearest Neighbors .......................................................................98

4.4 Naïve Bayesian ...............................................................................111
4.5 Artificial Neural Networks.............................................................124
4.6 Support Vector Machines ..............................................................135
4.7 Ensemble Learners........................................................................148
References ..............................................................................................161

Chapter 5 Regression Methods................................................................... 165

5.1 Linear Regression ..........................................................................166
5.2 Logistic Regression........................................................................185
5.3 Conclusion ......................................................................................196
References ..............................................................................................197

Chapter 6 Association Analysis ................................................................... 199

6.1 Mining Association Rules...............................................................201
6.2 Apriori Algorithm............................................................................206
6.3 Frequent Pattern-Growth Algorithm ............................................211
6.4 Conclusion ......................................................................................220
References ..............................................................................................220

Chapter 7 Clustering .................................................................................... 221

7.1 k-Means Clustering........................................................................226
7.2 DBSCAN Clustering .......................................................................238
7.3 Self-Organizing Maps ....................................................................247
References ..............................................................................................260

Chapter 8 Model Evaluation ........................................................................ 263

8.1 Confusion Matrix ............................................................................264
8.2 ROC and AUC..................................................................................266
8.3 Lift Curves.......................................................................................268
8.4 How to Implement..........................................................................271
8.5 Conclusion ......................................................................................276
References ..............................................................................................279

Chapter 9 Text Mining.................................................................................. 281

9.1 How It Works ..................................................................................283
9.2 How to Implement..........................................................................290
9.3 Conclusion ......................................................................................304
References ..............................................................................................305

Chapter 10 Deep Learning .......................................................................... 307

10.1 The AI Winter............................................................................. 309
10.2 How it Works ............................................................................. 315
10.3 How to Implement .................................................................... 335
Contents ix

10.4 Conclusion ................................................................................. 341

References ........................................................................................... 342

Chapter 11 Recommendation Engines ....................................................... 343

11.1 Recommendation Engine Concepts......................................... 346
11.2 Collaborative Filtering .............................................................. 353
11.3 Content-Based Filtering ........................................................... 373
11.4 Hybrid Recommenders............................................................. 389
11.5 Conclusion ................................................................................. 390
References ........................................................................................... 394

Chapter 12 Time Series Forecasting .......................................................... 395

12.1 Time Series Decomposition ..................................................... 400
12.2 Smoothing Based Methods ...................................................... 407
12.3 Regression Based Methods ..................................................... 413
12.4 Machine Learning Methods...................................................... 429
12.5 Performance Evaluation ........................................................... 439
12.6 Conclusion ................................................................................. 443
References ........................................................................................... 444

Chapter 13 Anomaly Detection.................................................................... 447

13.1 Concepts .................................................................................... 447
13.2 Distance-Based Outlier Detection ........................................... 453
13.3 Density-Based Outlier Detection ............................................. 457
13.4 Local Outlier Factor.................................................................. 460
13.5 Conclusion ................................................................................. 464
References ........................................................................................... 465

Chapter 14 Feature Selection ..................................................................... 467

14.1 Classifying Feature Selection Methods................................... 468
14.2 Principal Component Analysis ................................................. 470
14.3 Information Theory-Based Filtering........................................ 477
14.4 Chi-Square-Based Filtering ..................................................... 480
14.5 Wrapper-Type Feature Selection............................................. 483
14.6 Conclusion ................................................................................. 489
References ........................................................................................... 490

Chapter 15 Getting Started with RapidMiner ............................................. 491

15.1 User Interface and Terminology.............................................. 492
15.2 Data Importing and Exporting Tools ....................................... 497
15.3 Data Visualization Tools ........................................................... 501
15.4 Data Transformation Tools ...................................................... 504
15.5 Sampling and Missing Value Tools.......................................... 509
x Contents

15.6 Optimization Tools .................................................................... 512

15.7 Integration with R ..................................................................... 520
15.8 Conclusion ................................................................................. 521
References ........................................................................................... 521

COMPARISON OF DATA SCIENCE ALGORITHMS ...................................... 523

ABOUT THE AUTHORS ............................................................................. 531
INDEX ...................................................................................................... 533
PRAISE .................................................................................................... 545
Foreword

A lot has happened since the first edition of this book was published in
2014. There is hardly a day where there is no news on data science, machine
learning, or artificial intelligence in the media. It is interesting that many of
those news articles have a skeptical, if not an even negative tone. All this
underlines two things: data science and machine learning are finally becom-
ing mainstream. And people know shockingly little about it. Readers of this
book will certainly do better in this regard. It continues to be a valuable
resource to not only educate about how to use data science in practice, but
also how the fundamental concepts work.
Data science and machine learning are fast-moving fields which is why this
second edition reflects a lot of the changes in our field. While we used to
talk a lot about “data mining” and “predictive analytics” only a couple of
years ago, we have now settled on the term “data science” for the broader
field. And even more importantly: it is now commonly understood that
machine learning is at the core of many current technological breakthroughs.
These are truly exciting times for all the people working in our field then!
I have seen situations where data science and machine learning had an
incredible impact. But I have also seen situations where this was not the
case. What was the difference? In most cases where organizations fail with
data science and machine learning is, they had used those techniques in
the wrong context. Data science models are not very helpful if you only
have one big decision you need to make. Analytics can still help you in
such cases by giving you easier access to the data you need to make this
decision. Or by presenting the data in a consumable fashion. But at the
end of the day, those single big decisions are often strategic. Building a
machine learning model to help you make this decision is not worth
doing. And often they also do not yield better results than just making the
decision on your own.

xi
xii Foreword

Here is where data science and machine learning can truly help: these
advanced models deliver the most value whenever you need to make lots of
similar decisions quickly. Good examples for this are:
G Defining the price of a product in markets with rapidly changing
demands.
G Making offers for cross-selling in an E-Commerce platform.
G Approving credit or not.
G Detecting customers with a high risk for churn.
G Stopping fraudulent transactions.
G And many others.
You can see that a human being who would have access to all relevant data
could make those decisions in a matter of seconds or minutes. Only that
they can’t without data science, since they would need to make this type of
decision millions of times, every day. Consider sifting through your customer
base of 50 million clients every day to identify those with a high churn risk.
Impossible for any human being. But no problem at all for a machine learn-
ing model.
So, the biggest value of artificial intelligence and machine learning is not to
support us with those big strategic decisions. Machine learning delivers most
value when we operationalize models and automate millions of decisions.
One of the shortest descriptions of this phenomenon comes from Andrew
Ng, who is a well-known researcher in the field of AI. Andrew describes what
AI can do as follows: “If a typical person can do a mental task with less than
one second of thought, we can probably automate it using AI either now or
in the near future.”
I agree with him on this characterization. And I like that Andrew puts the
emphasis on automation and operationalization of those models—because
this is where the biggest value is. The only thing I disagree with is the time
unit he chose. It is safe to already go with a minute instead of a second.
However, the quick pace of changes as well as the ubiquity of data science
also underlines the importance of laying the right foundations. Keep in
mind that machine learning is not completely new. It has been an active
field of research since the 1950s. Some of the algorithms used today have
even been around for more than 200 years now. And the first deep learning
models were developed in the 1960s with the term “deep learning” being
coined in 1984. Those algorithms are well understood now. And under-
standing their basic concepts will help you to pick the right algorithm for
the right task.
To support you with this, some additional chapters on deep learning and rec-
ommendation systems have been added to the book. Another focus area is
Foreword xiii

using text analytics and natural language processing. It became clear in the
past years that the most successful predictive models have been using
unstructured input data in addition to the more traditional tabular formats.
Finally, expansion of Time Series Forecasting should get you started on one
of the most widely applied data science techniques in the business.
More algorithms could mean that there is a risk of increased complexity. But
thanks to the simplicity of the RapidMiner platform and the many practical
examples throughout the book this is not the case here. We continue our
journey towards the democratization of data science and machine learning.
This journey continues until data science and machine learning are as ubiqui-
tous as data visualization or Excel. Of course, we cannot magically transform
everybody into a data scientist overnight, but we can give people the tools to
help them on their personal path of development. This book is the only tour
guide you need on this journey.

Ingo Mierswa
Founder
RapidMiner Inc.
Massachusetts, USA
This page intentionally left blank
Preface

Our goal is to introduce you to Data Science.

We will provide you with a survey of the fundamental data science concepts
as well as step-by-step guidance on practical implementations—enough to
get you started on this exciting journey.

WHY DATA SCIENCE?

We have run out of adjectives and superlatives to describe the growth trends
of data. The technology revolution has brought about the need to process,
store, analyze, and comprehend large volumes of diverse data in meaningful
ways. However, the value of the stored data is zero unless it is acted upon. The scale
of data volume and variety places new demands on organizations to quickly
uncover hidden relationships and patterns. This is where data science techni-
ques have proven to be extremely useful. They are increasingly finding their
way into the everyday activities of many business and government functions,
whether in identifying which customers are likely to take their business else-
where, or mapping flu pandemic using social media signals.
Data science is a compilation of techniques that extract value from data.
Some of the techniques used in data science have a long history and trace
their roots to applied statistics, machine learning, visualization, logic, and
computer science. Some techniques have just reached the popularity it
deserves. Most emerging technologies go through what is termed the “hype
cycle.” This is a way of contrasting the amount of hyperbole or hype versus
the productivity that is engendered by the emerging technology. The hype
cycle has three main phases: peak of inflated expectation, trough of disillu-
sionment, and plateau of productivity. The third phase refers to the mature
and value-generating phase of any technology. The hype cycle for data sci-
ence indicates that it is in this mature phase. Does this imply that data sci-
ence has stopped growing or has reached a saturation point? Not at all. On
the contrary, this discipline has grown beyond the scope of its initial
xv
xvi Preface

applications in marketing and has advanced to applications in technology,

internet-based fields, health care, government, finance, and manufacturing.

WHY THIS BOOK?

The objective of this book is two-fold: to help clarify the basic concepts
behind many data science techniques in an easy-to-follow manner; and to
prepare anyone with a basic grasp of mathematics to implement these techni-
ques in their organizations without the need to write any lines of program-
ming code.
Beyond its practical value, we wanted to show you that the data science
learning algorithms are elegant, beautiful, and incredibly effective. You will
never look at data the same way once you learn the concepts of the learning
algorithms.
To make the concepts stick, you will have to build data science models.
While there are many data science tools available to execute algorithms and
develop applications, the approaches to solving a data science problem are
similar among these tools. We wanted to pick a fully functional, open source,
free to use, graphical user interface-based data science tool so readers can fol-
low the concepts and implement the data science algorithms. RapidMiner, a
leading data science platform, fit the bill and, thus, we used it as a compan-
ion tool to implement the data science algorithms introduced in every
chapter.

WHO CAN USE THIS BOOK?

The concepts and implementations described in this book are geared towards
business, analytics, and technical professionals who use data everyday. You,
the reader of the book will get a comprehensive understanding of the differ-
ent data science techniques that can be used for prediction and for discover-
ing patterns, be prepared to select the right technique for a given data
problem, and you will be able to create a general-purpose analytics process.
We have tried to follow a process to describe this body of knowledge. Our
focus has been on introducing about 30 key algorithms that are in wide-
spread use today. We present these algorithms in the framework of:
1. A high-level practical use case for each algorithm.
2. An explanation of how the algorithm works in plain language. Many
algorithms have a strong foundation in statistics and/or computer
science. In our descriptions, we have tried to strike a balance between
being accessible to a wider audience and being academically rigorous.
Preface xvii

3. A detailed review of implementation using RapidMiner, by describing

the commonly used setup and parameter options using a sample data
set. You can download the processes from the companion website
www.IntroDataScience.com and we recommend you follow-along by
building an actual data science process.
Analysts, finance, engineering, marketing, and business professionals, or any-
one who analyzes data, most likely will use data science techniques in their
job either now or in the near future. For business managers who are one step
removed from the actual data science process, it is important to know what
is possible and not possible with these techniques so they can ask the right
questions and set proper expectations. While basic spreadsheet analysis, slic-
ing, and dicing of data through standard business intelligence tools will con-
tinue to form the foundations of data exploration in business, data science
techniques are necessary to establish the full edifice of analytics in the
organizations.

Vijay Kotu
California, USA
Bala Deshpande, PhD
Michigan, USA
This page intentionally left blank
Acknowledgments

Writing a book is one of the most interesting and challenging endeavors one
can embark on. We grossly underestimated the effort it would take and the
fulfillment it brings. This book would not have been possible without the
support of our families, who granted us enough leeway in this time-
consuming activity. We would like to thank the team at RapidMiner, who
provided great help with everything, ranging from technical support to
reviewing the chapters to answering questions on the features of the product.
Our special thanks to Ingo Mierswa for setting the stage for the book through
the foreword. We greatly appreciate the thoughtful and insightful comments
from our technical reviewers: Doug Schrimager from Slalom Consulting,
Steven Reagan from L&L Products, and Philipp Schlunder, Tobias Malbrecht
and Ingo Mierswa from RapidMiner. Thanks to Mike Skinner of Intel and to
Dr. Jacob Cybulski of Deakin University in Australia. We had great support
and stewardship from the Elsevier and Morgan Kaufmann team: Glyn Jones,
Ana Claudia Abad Garcia, and Sreejith Viswanathan. Thanks to our collea-
gues and friends for all the productive discussions and suggestions regarding
this project.

xix
This page intentionally left blank
CHAPTER 1

Introduction

Data science is a collection of techniques used to extract value from data. It

has become an essential tool for any organization that collects, stores, and
processes data as part of its operations. Data science techniques rely on find-
ing useful patterns, connections, and relationships within data. Being a buzz-
word, there is a wide variety of definitions and criteria for what constitutes
data science. Data science is also commonly referred to as knowledge discov-
ery, machine learning, predictive analytics, and data mining. However, each
term has a slightly different connotation depending on the context. In this
chapter, we attempt to provide a general overview of data science and point
out its important features, purpose, taxonomy, and methods.
In spite of the present growth and popularity, the underlying methods of
data science are decades if not centuries old. Engineers and scientists have
been using predictive models since the beginning of nineteenth century.
Humans have always been forward-looking creatures and predictive sciences
are manifestations of this curiosity. So, who uses data science today? Almost
every organization and business. Sure, we didn’t call the methods that are
now under data science as “Data Science.” The use of the term science in data
science indicates that the methods are evidence based, and are built on
empirical knowledge, more specifically historical observations.
As the ability to collect, store, and process data has increased, in line with
Moore’s Law - which implies that computing hardware capabilities double
every two years, data science has found increasing applications in many
diverse fields. Just decades ago, building a production quality regression
model took about several dozen hours (Parr Rud, 2001). Technology has
come a long way. Today, sophisticated machine learning models can be run,
involving hundreds of predictors with millions of records in a matter of a
few seconds on a laptop computer.
The process involved in data science, however, has not changed since those
early days and is not likely to change much in the foreseeable future. To get
meaningful results from any data, a major effort preparing, cleaning, 1
Data Science. DOI: https://doi.org/10.1016/B978-0-12-814761-0.00001-0
© 2019 Elsevier Inc. All rights reserved.
2 CHAPTER 1: Introduction

scrubbing, or standardizing the data is still required, before the learning algo-
rithms can begin to crunch them. But what may change is the automation
available to do this. While today this process is iterative and requires ana-
lysts’ awareness of the best practices, soon smart automation may be
deployed. This will allow the focus to be put on the most important aspect
of data science: interpreting the results of the analysis in order to make deci-
sions. This will also increase the reach of data science to a wider audience.
When it comes to the data science techniques, are there a core set of procedures
and principles one must master? It turns out that a vast majority of data science
practitioners today use a handful of very powerful techniques to accomplish
their objectives: decision trees, regression models, deep learning, and clustering
(Rexer, 2013). A majority of the data science activity can be accomplished
using relatively few techniques. However, as with all 80/20 rules, the long tail,
which is made up of a large number of specialized techniques, is where the
value lies, and depending on what is needed, the best approach may be a rela-
tively obscure technique or a combination of several not so commonly used
procedures. Thus, it will pay off to learn data science and its methods in a sys-
tematic way, and that is what is covered in these chapters. But, first, how are
the often-used terms artificial intelligence (AI), machine learning, and data sci-
ence explained?

1.1 AI, MACHINE LEARNING, AND DATA SCIENCE

Artificial intelligence, Machine learning, and data science are all related to
each other. Unsurprisingly, they are often used interchangeably and conflated
with each other in popular media and business communication. However,
all of these three fields are distinct depending on the context. Fig. 1.1 shows
the relationship between artificial intelligence, machine learning, and
data science.
Artificial intelligence is about giving machines the capability of mimicking
human behavior, particularly cognitive functions. Examples would be: facial
recognition, automated driving, sorting mail based on postal code. In some
cases, machines have far exceeded human capabilities (sorting thousands of
postal mails in seconds) and in other cases we have barely scratched the
surface (search “artificial stupidity”). There are quite a range of techniques
that fall under artificial intelligence: linguistics, natural language processing,
decision science, bias, vision, robotics, planning, etc. Learning is an impor-
tant part of human capability. In fact, many other living organisms can
learn.
Machine learning can either be considered a sub-field or one of the tools of
artificial intelligence, is providing machines with the capability of learning
1.1 AI, Machine learning, and Data Science 3

Artificial intelligence

Linguistics Vision

Language Robotics
synthesis
Sensor Planning
Machine learning

Data science
Support
vector Data
machines kNN Text preparation
mining
Statistics
Decision Bayesian Process
trees learning Time series mining
Visualization
forecasting
Deep Processing
learning Recommendation paradigms
engines Experimentation

FIGURE 1.1
Artificial intelligence, machine learning, and data science.

FIGURE 1.2
Traditional program and machine learning.

from experience. Experience for machines comes in the form of data. Data
that is used to teach machines is called training data. Machine learning turns
the traditional programing model upside down (Fig. 1.2). A program, a set
of instructions to a computer, transforms input signals into output signals
using predetermined rules and relationships. Machine learning algorithms,
4 CHAPTER 1: Introduction

also called “learners”, take both the known input and output (training data)
to figure out a model for the program which converts input to output. For
example, many organizations like social media platforms, review sites, or for-
ums are required to moderate posts and remove abusive content. How can
machines be taught to automate the removal of abusive content? The
machines need to be shown examples of both abusive and non-abusive posts
with a clear indication of which one is abusive. The learners will generalize a
pattern based on certain words or sequences of words in order to conclude
whether the overall post is abusive or not. The model can take the form of a
set of “if-then” rules. Once the data science rules or model is developed,
machines can start categorizing the disposition of any new posts.
Data science is the business application of machine learning, artificial intelli-
gence, and other quantitative fields like statistics, visualization, and mathe-
matics. It is an interdisciplinary field that extracts value from data. In the
context of how data science is used today, it relies heavily on machine learn-
ing and is sometimes called data mining. Examples of data science user cases
are: recommendation engines that can recommend movies for a particular
user, a fraud alert model that detects fraudulent credit card transactions, find
customers who will most likely churn next month, or predict revenue for the
next quarter.

1.2 WHAT IS DATA SCIENCE?

Data science starts with data, which can range from a simple array of a few
numeric observations to a complex matrix of millions of observations with
thousands of variables. Data science utilizes certain specialized computational
methods in order to discover meaningful and useful structures within a dataset.
The discipline of data science coexists and is closely associated with a number
of related areas such as database systems, data engineering, visualization, data
analysis, experimentation, and business intelligence (BI). We can further define
data science by investigating some of its key features and motivations.

1.2.1 Extracting Meaningful Patterns

Knowledge discovery in databases is the nontrivial process of identifying
valid, novel, potentially useful, and ultimately understandable patterns or
relationships within a dataset in order to make important decisions (Fayyad,
Piatetsky-shapiro, & Smyth, 1996). Data science involves inference and itera-
tion of many different hypotheses. One of the key aspects of data science is
the process of generalization of patterns from a dataset. The generalization
should be valid, not just for the dataset used to observe the pattern, but also
for new unseen data. Data science is also a process with defined steps, each
Discovering Diverse Content Through
Random Scribd Documents
back
back
back
back
Welcome to Our Bookstore - The Ultimate Destination for Book Lovers
Are you passionate about testbank and eager to explore new worlds of
knowledge? At our website, we offer a vast collection of books that
cater to every interest and age group. From classic literature to
specialized publications, self-help books, and children’s stories, we
have it all! Each book is a gateway to new adventures, helping you
expand your knowledge and nourish your soul
Experience Convenient and Enjoyable Book Shopping Our website is more
than just an online bookstore—it’s a bridge connecting readers to the
timeless values of culture and wisdom. With a sleek and user-friendly
interface and a smart search system, you can find your favorite books
quickly and easily. Enjoy special promotions, fast home delivery, and
a seamless shopping experience that saves you time and enhances your
love for reading.
Let us accompany you on the journey of exploring knowledge and
personal growth!

ebooksecure.com