Instant ebooks textbook (Ebook) Data Mining: A Tutorial-Based Primer, 2nd Edition by Richard J. Roiger ISBN 9781498763974, 1498763979 download all chapters
Instant ebooks textbook (Ebook) Data Mining: A Tutorial-Based Primer, 2nd Edition by Richard J. Roiger ISBN 9781498763974, 1498763979 download all chapters
https://ebooknice.com/product/data-mining-a-tutorial-based-primer-
second-edition-5743374
ebooknice.com
https://ebooknice.com/product/biota-grow-2c-gather-2c-cook-6661374
ebooknice.com
https://ebooknice.com/product/cambridge-igcse-and-o-level-history-
workbook-2c-depth-study-the-united-states-1919-41-2nd-edition-53538044
ebooknice.com
https://ebooknice.com/product/matematik-5000-kurs-2c-larobok-23848312
ebooknice.com
(Ebook) SAT II Success MATH 1C and 2C 2002 (Peterson's SAT II Success)
by Peterson's ISBN 9780768906677, 0768906679
https://ebooknice.com/product/sat-ii-success-
math-1c-and-2c-2002-peterson-s-sat-ii-success-1722018
ebooknice.com
https://ebooknice.com/product/making-sense-of-data-i-a-practical-
guide-to-exploratory-data-analysis-and-data-mining-2nd-edition-5470958
ebooknice.com
(Ebook) Master SAT II Math 1c and 2c 4th ed (Arco Master the SAT
Subject Test: Math Levels 1 & 2) by Arco ISBN 9780768923049,
0768923042
https://ebooknice.com/product/master-sat-ii-math-1c-and-2c-4th-ed-
arco-master-the-sat-subject-test-math-levels-1-2-2326094
ebooknice.com
https://ebooknice.com/product/geographic-data-mining-and-knowledge-
discovery-second-edition-chapman-hall-crc-data-mining-and-knowledge-
discovery-series-2023104
ebooknice.com
(Ebook) Data Mining With Decision Trees: Theory and Applications (2nd
Edition) by Lior Rokach, Oded Maimon ISBN 9789814590075, 981459007X
https://ebooknice.com/product/data-mining-with-decision-trees-theory-
and-applications-2nd-edition-4913228
ebooknice.com
DATA MINING
A Tutorial-Based Primer
SECOND EDITION
Chapman & Hall/CRC
Data Mining and Knowledge Discovery Series
SERIES EDITOR
Vipin Kumar
University of Minnesota
Department of Computer Science and Engineering
Minneapolis, Minnesota, U.S.A.
PUBLISHED TITLES
ACCELERATING DISCOVERY: MINING UNSTRUCTURED INFORMATION FOR
HYPOTHESIS GENERATION
Scott Spangler
ADVANCES IN MACHINE LEARNING AND DATA MINING FOR ASTRONOMY
Michael J. Way, Jeffrey D. Scargle, Kamal M. Ali, and Ashok N. Srivastava
BIOLOGICAL DATA MINING
Jake Y. Chen and Stefano Lonardi
COMPUTATIONAL BUSINESS ANALYTICS
Subrata Das
COMPUTATIONAL INTELLIGENT DATA ANALYSIS FOR SUSTAINABLE
DEVELOPMENT
Ting Yu, Nitesh V. Chawla, and Simeon Simoff
COMPUTATIONAL METHODS OF FEATURE SELECTION
Huan Liu and Hiroshi Motoda
CONSTRAINED CLUSTERING: ADVANCES IN ALGORITHMS, THEORY,
AND APPLICATIONS
Sugato Basu, Ian Davidson, and Kiri L. Wagstaff
CONTRAST DATA MINING: CONCEPTS, ALGORITHMS, AND APPLICATIONS
Guozhu Dong and James Bailey
DATA CLASSIFICATION: ALGORITHMS AND APPLICATIONS
Charu C. Aggarwal
DATA CLUSTERING: ALGORITHMS AND APPLICATIONS
Charu C. Aggarwal and Chandan K. Reddy
DATA CLUSTERING IN C++: AN OBJECT-ORIENTED APPROACH
Guojun Gan
DATA MINING: A TUTORIAL-BASED PRIMER, SECOND EDITION
Richard J. Roiger
DATA MINING FOR DESIGN AND MARKETING
Yukio Ohsawa and Katsutoshi Yada
DATA MINING WITH R: LEARNING WITH CASE STUDIES, SECOND EDITION
Luís Torgo
EVENT MINING: ALGORITHMS AND APPLICATIONS
Tao Li
FOUNDATIONS OF PREDICTIVE ANALYTICS
James Wu and Stephen Coggeshall
GEOGRAPHIC DATA MINING AND KNOWLEDGE DISCOVERY,
SECOND EDITION
Harvey J. Miller and Jiawei Han
GRAPH-BASED SOCIAL MEDIA ANALYSIS
Ioannis Pitas
HANDBOOK OF EDUCATIONAL DATA MINING
Cristóbal Romero, Sebastian Ventura, Mykola Pechenizkiy, and Ryan S.J.d. Baker
HEALTHCARE DATA ANALYTICS
Chandan K. Reddy and Charu C. Aggarwal
INFORMATION DISCOVERY ON ELECTRONIC HEALTH RECORDS
Vagelis Hristidis
INTELLIGENT TECHNOLOGIES FOR WEB APPLICATIONS
Priti Srinivas Sajja and Rajendra Akerkar
INTRODUCTION TO PRIVACY-PRESERVING DATA PUBLISHING: CONCEPTS
AND TECHNIQUES
Benjamin C. M. Fung, Ke Wang, Ada Wai-Chee Fu, and Philip S. Yu
KNOWLEDGE DISCOVERY FOR COUNTERTERRORISM AND
LAW ENFORCEMENT
David Skillicorn
KNOWLEDGE DISCOVERY FROM DATA STREAMS
João Gama
MACHINE LEARNING AND KNOWLEDGE DISCOVERY FOR
ENGINEERING SYSTEMS HEALTH MANAGEMENT
Ashok N. Srivastava and Jiawei Han
MINING SOFTWARE SPECIFICATIONS: METHODOLOGIES AND APPLICATIONS
David Lo, Siau-Cheng Khoo, Jiawei Han, and Chao Liu
MULTIMEDIA DATA MINING: A SYSTEMATIC INTRODUCTION TO
CONCEPTS AND THEORY
Zhongfei Zhang and Ruofei Zhang
MUSIC DATA MINING
Tao Li, Mitsunori Ogihara, and George Tzanetakis
NEXT GENERATION OF DATA MINING
Hillol Kargupta, Jiawei Han, Philip S. Yu, Rajeev Motwani, and Vipin Kumar
RAPIDMINER: DATA MINING USE CASES AND BUSINESS ANALYTICS
APPLICATIONS
Markus Hofmann and Ralf Klinkenberg
RELATIONAL DATA CLUSTERING: MODELS, ALGORITHMS,
AND APPLICATIONS
Bo Long, Zhongfei Zhang, and Philip S. Yu
SERVICE-ORIENTED DISTRIBUTED KNOWLEDGE DISCOVERY
Domenico Talia and Paolo Trunfio
SPECTRAL FEATURE SELECTION FOR DATA MINING
Zheng Alan Zhao and Huan Liu
STATISTICAL DATA MINING USING SAS APPLICATIONS, SECOND EDITION
George Fernandez
SUPPORT VECTOR MACHINES: OPTIMIZATION BASED THEORY,
ALGORITHMS, AND EXTENSIONS
Naiyang Deng, Yingjie Tian, and Chunhua Zhang
TEMPORAL DATA MINING
Theophano Mitsa
TEXT MINING: CLASSIFICATION, CLUSTERING, AND APPLICATIONS
Ashok N. Srivastava and Mehran Sahami
TEXT MINING AND VISUALIZATION: CASE STUDIES USING OPEN-SOURCE
TOOLS
Markus Hofmann and Andrew Chisholm
THE TOP TEN ALGORITHMS IN DATA MINING
Xindong Wu and Vipin Kumar
UNDERSTANDING COMPLEX DATASETS: DATA MINING WITH MATRIX
DECOMPOSITIONS
David Skillicorn
DATA MINING
A Tutorial-Based Primer
SECOND EDITION
Richard J. Roiger
This book was previously published by Pearson Education, Inc.
CRC Press
Taylor & Francis Group
6000 Broken Sound Parkway NW, Suite 300
Boca Raton, FL 33487-2742
© 2017 by Taylor & Francis Group, LLC
CRC Press is an imprint of Taylor & Francis Group, an Informa business
This book contains information obtained from authentic and highly regarded sources. Reasonable efforts have been
made to publish reliable data and information, but the author and publisher cannot assume responsibility for the valid-
ity of all materials or the consequences of their use. The authors and publishers have attempted to trace the copyright
holders of all material reproduced in this publication and apologize to copyright holders if permission to publish in this
form has not been obtained. If any copyright material has not been acknowledged please write and let us know so we may
rectify in any future reprint.
Except as permitted under U.S. Copyright Law, no part of this book may be reprinted, reproduced, transmitted, or uti-
lized in any form by any electronic, mechanical, or other means, now known or hereafter invented, including photocopy-
ing, microfilming, and recording, or in any information storage or retrieval system, without written permission from the
publishers.
For permission to photocopy or use material electronically from this work, please access www.copyright.com (http://
www.copyright.com/) or contact the Copyright Clearance Center, Inc. (CCC), 222 Rosewood Drive, Danvers, MA 01923,
978-750-8400. CCC is a not-for-profit organization that provides licenses and registration for a variety of users. For
organizations that have been granted a photocopy license by the CCC, a separate system of payment has been arranged.
Trademark Notice: Product or corporate names may be trademarks or registered trademarks, and are used only for
identification and explanation without intent to infringe.
Visit the Taylor & Francis Web site at
http://www.taylorandfrancis.com
and the CRC Press Web site at
http://www.crcpress.com
Contents
vii
viii ◾ Contents
BIBLIOGRAPHY, 461
INDEX, 465
List of Figures
xvii
xviii ◾ List of Figures
Figure 5.15 A scatterplot comparing age and life insurance promotion. 156
Figure 5.16 A decision tree process model. 157
Figure 5.17 A decision tree for the credit card promotion database. 158
Figure 5.18 A decision tree in descriptive form. 158
Figure 5.19 A list of operator options. 159
Figure 5.20 Customer churn—A training and test set scenario. 160
Figure 5.21 Removing instances of unknown outcome from the churn data set. 161
Figure 5.22 Partitioning the customer churn data. 162
Figure 5.23 The customer churn data set. 163
Figure 5.24 Filter Examples has removed all instances of unknown outcome. 163
Figure 5.25 A decision tree for the customer churn data set. 164
Figure 5.26 Output of the Apply Model operator. 164
Figure 5.27 A performance vector for the customer churn data set. 165
Figure 5.28 Adding a subprocess to the main process window. 166
Figure 5.29 A subprocess for data preprocessing. 167
Figure 5.30 Creating and saving a decision tree model. 168
Figure 5.31 Reading and applying a saved model. 169
Figure 5.32 An Excel file stores model predictions. 169
Figure 5.33 Testing a model using cross-validation. 170
Figure 5.34 A subprocess to read and filter customer churn data. 171
Figure 5.35 Nested subprocesses for cross-validation. 171
Figure 5.36 Performance vector for a decision tree tested using cross-validation. 172
Figure 5.37 Subprocess for the Tree to Rules operator. 174
Figure 5.38 Building a model with the Tree to Rules operator. 174
Figure 5.39 Rules generated by the Tree to Rules operator. 175
Figure 5.40 Performance vector for the customer churn data set. 175
Figure 5.41 A process design for rule induction. 176
Figure 5.42 Adding the Discretize by Binning operator. 177
Figure 5.43 Covering rules for customer churn data. 177
List of Figures ◾ xxi
Figure 5.44 Performance vector for the covering rules of Figure 5.43. 178
Figure 5.45 Process design for subgroup discovery. 179
Figure 5.46 Subprocess design for subgroup discovery. 179
Figure 5.47 Rules generated by the Subgroup Discovery operator. 180
Figure 5.48 Ten rules identifying likely churn candidates. 181
Figure 5.49 Generating association rules for the credit card promotion database. 182
Figure 5.50 Preparing data for association rule generation. 183
Figure 5.51 Interface for listing association rules. 184
Figure 5.52 Association rules for the credit card promotion database. 184
Figure 5.53 Market basket analysis template. 185
Figure 5.54 The pivot operator rotates the example set. 186
Figure 5.55 Association rules for the Market Basket Analysis template. 186
Figure 5.56 Process design for clustering gamma-ray burst data. 188
Figure 5.57 A partial clustering of gamma-ray burst data. 189
Figure 5.58 Three clusters of gamma-ray burst data. 189
Figure 5.59 Decision tree illustrating a gamma-ray burst clustering. 190
Figure 5.60 A descriptive form of a decision tree showing a clustering
of gamma-ray burst data. 190
Figure 5.61 Benchmark performance for nearest neighbor classification. 192
Figure 5.62 Main process design for nearest neighbor classification. 192
Figure 5.63 Subprocess for nearest neighbor classification. 193
Figure 5.64 Forward selection subprocess for nearest neighbor classification. 193
Figure 5.65 Performance vector when forward selection is used for choosing
attributes. 194
Figure 5.66 Unsupervised clustering for attribute evaluation. 197
Figure 6.1 A seven-step KDD process model. 200
Figure 6.2 The Acme credit card database. 203
Figure 6.3 A process model for detecting outliers. 205
Figure 6.4 Two outlier instances from the diabetes patient data set. 206
xxii ◾ List of Figures
Figure 6.5 Ten outlier instances from the diabetes patient data set. 207
Figure 7.1 Components for supervised learning. 222
Figure 7.2 A normal distribution. 225
Figure 7.3 Random samples from a population of 10 elements. 226
Figure 7.4 A process model for comparing three competing models. 239
Figure 7.5 Subprocess for comparing three competing models. 240
Figure 7.6 Cross-validation test for a decision tree with maximum depth = 5. 240
Figure 7.7 A matrix of t-test scores. 241
Figure 7.8 ANOVA comparing three competing models. 241
Figure 7.9 ANOVA operators for comparing nominal and numeric attributes. 242
Figure 7.10 The grouped ANOVA operator comparing class and maximum heart
rate. 243
Figure 7.11 The ANOVA matrix operator for the cardiology patient data set. 243
Figure 7.12 A process model for creating a lift chart. 244
Figure 7.13 Preprocessing the customer churn data set. 245
Figure 7.14 Output of the Apply Model operator for the customer churn data set. 245
Figure 7.15 Performance vector for customer churn. 246
Figure 7.16 A Pareto lift chart for customer churn. 247
Figure 8.1 A fully connected feed-forward neural network. 254
Figure 8.2 The sigmoid evaluation function. 257
Figure 8.3 A 3 × 3 Kohonen network with two input-layer nodes. 260
Figure 8.4 Connections for two output-layer nodes. 266
Figure 9.1 Graph of the XOR function. 272
Figure 9.2 XOR training data. 273
Figure 9.3 Satellite image data. 274
Figure 9.4 Weka four graphical user interfaces (GUIs) for XOR training. 275
Figure 9.5 Backpropagation learning parameters. 276
Figure 9.6 Architecture for the XOR function. 278
Figure 9.7 XOR training output. 278
List of Figures ◾ xxiii
Figure 10.17 Green and red have been removed from the satellite image data set. 305
Figure 10.18 Correlation matrix for the satellite image data set. 306
Figure 10.19 Neural network model for predicting customer churn. 307
Figure 10.20 Preprocessing the customer churn data. 308
Figure 10.21 Cross-validation subprocess for customer churn. 308
Figure 10.22 Performance vector for customer churn. 309
Figure 10.23 Process for creating and saving a neural network model. 309
Figure 10.24 Process model for reading and applying a neural network model. 310
Figure 10.25 Neural network output for predicting customer churn. 310
Figure 10.26 SOM process model for the cardiology patient data set. 312
Figure 10.27 Clustered instances of the cardiology patient data set. 312
Figure 11.1 RapidMiner’s naïve Bayes operator. 325
Figure 11.2 Subprocess for applying naïve Bayes to customer churn data. 326
Figure 11.3 Naïve Bayes Distribution Table for customer churn data. 326
Figure 11.4 Naïve Bayes performance vector for customer churn data. 327
Figure 11.5 Life insurance promotion by gender. 328
Figure 11.6 Naïve Bayes model with output attribute = LifeInsPromo. 329
Figure 11.7 Predictions for the life insurance promotion. 329
Figure 11.8 Hyperplanes separating the circle and star classes. 330
Figure 11.9 Hyperplanes passing through their respective support vectors. 331
Figure 11.10 Maximal margin hyperplane separating the star and circle classes. 335
Figure 11.11 Loading the nine instances of Figure 11.8 into the Explorer. 338
Figure 11.12 Invoking SMO model. 339
Figure 11.13 Disabling data normalization/standardization. 339
Figure 11.14 The SMO-created MMH for the data shown in Figure 11.8. 340
Figure 11.15 Applying mySVM to the cardiology patient data set. 341
Figure 11.16 Normalized cardiology patient data. 342
Figure 11.17 Equation of the MMH for the cardiology patient data set. 342
Figure 11.18 Actual and predicted output for the cardiology patient data. 343
List of Figures ◾ xxv
Figure 11.19 Performance vector for the cardiology patient data. 343
Figure 11.20 A linear regression model for the instances of Figure 11.8. 345
Figure 11.21 Main process window for applying RapidMiner’s linear regression
operator to the gamma-ray burst data set. 346
Figure 11.22 Subprocess windows for the Gamma Ray burst experiment. 346
Figure 11.23 Linear regression—actual and predicted output for the gamma-ray
burst data set. 347
Figure 11.24 Summary statistics and the linear regression equation for the
gamma-ray burst data set. 347
Figure 11.25 Scatterplot diagram showing the relationship between t90 and t50. 348
Figure 11.26 Performance vector resulting from the application of linear
regression to the gamma-ray burst data set. 348
Figure 11.27 A generic model tree. 349
Figure 11.28 The logistic regression equation. 351
Figure 12.1 A Cobweb-created hierarchy. 363
Figure 12.2 Applying EM to the gamma-ray burst data set. 366
Figure 12.3 Removing correlated attributes from the gamma-ray burst data set. 367
Figure 12.4 An EM clustering of the gamma-ray burst data set. 367
Figure 12.5 Summary statistics for an EM clustering of the gamma-ray burst data set. 368
Figure 12.6 Decision tree representing a clustering of the gamma-ray burst data set. 368
Figure 12.7 The decision tree of Figure 12.6 in descriptive form. 369
Figure 12.8 Classes of the sensor data set. 370
Figure 12.9 Generic object editor allows us to specify the number of clusters. 370
Figure 12.10 Classes to clusters summary statistics. 371
Figure 12.11 Unsupervised genetic clustering. 372
Figure 13.1 A process model for extracting historical market data. 380
Figure 13.2 Historical data for XIV. 381
Figure 13.3 Time-series data with numeric output. 382
Figure 13.4 Time-series data with categorical output. 383
Figure 13.5 Time-series data for processing with RapidMiner. 383
xxvi ◾ List of Figures
xxix
xxx ◾ List of Tables
Data mining is the process of finding interesting patterns in data. The objective of data
mining is to use discovered patterns to help explain current behavior or to predict future
outcomes. Several aspects of the data mining process can be studied. These include
A single book cannot concentrate on all areas of the data mining process. Although
we furnish some detail about all aspects of data mining and knowledge discovery, our
primary focus is centered on model building and testing, as well as on interpreting and
validating results.
OBJECTIVES
I wrote the text to facilitate the following student learning goals:
• Understand what data mining is and how data mining can be employed to solve real
problems
• Recognize whether a data mining solution is a feasible alternative for a specific
problem
• Step through the knowledge discovery process and write a report about the results of
a data mining session
• Know how to apply data mining software tools to solve real problems
• Apply basic statistical and nonstatistical techniques to evaluate the results of a data
mining session
xxxi
xxxii ◾ Preface
• Recognize several data mining strategies and know when each strategy is appropriate
• Develop a comprehensive understanding of how several data mining techniques
build models to solve problems
• Develop a general awareness about the structure of a data warehouse and how a data
warehouse can be used
• Understand what online analytical processing (OLAP) is and how it can be applied
to analyze data
• Chapter 5 is all about data mining using RapidMiner Studio, a powerful open-source
and code-free version of RapidMiner’s commercial product. RapidMiner uses a
drag and drop workflow paradigm for building models to solve complex problems.
RapidMiner’s intuitive user interface, visualization capabilities, and assortment of
operators for preprocessing and mining data are second to none.
• This edition covers what are considered to be the top 10 data mining algorithms
(Wu and Kumar, 2009). Nine of the algorithms are used in one or more tutorials.
• Tutorials have been added for attribute selection, dealing with imbalanced data, out-
lier analysis, time-series analysis, and mining textual data.
• Over 90% of the tutorials are presented using both Weka and RapidMiner. This
allows readers maximum flexibility for their hands-on data mining experience.
INTENDED AUDIENCE
I developed most of the material for this book while teaching a one-semester data mining
course open to students majoring or minoring in business or computer science. In writing
this text, I directed my attention toward four groups of individuals:
CHAPTER FEATURES
I take the approach that model building is both an art and a science best understood from
the perspective of learning by doing. My view is supported by several features found within
the pages of the text. The following is a partial list of these features.
• Simple, detailed examples. I remove much of the mystery surrounding data mining
by presenting simple, detailed examples of how the various data mining techniques
build their models. Because of its tutorial nature, the text is appropriate as a self-study
guide as well as a college-level textbook for a course about data mining and knowl-
edge discovery.
• Overall tutorial style. All examples in Chapters 4, 5, 9, and 10 are tutorials. Selected
sections in Chapters 6, 7, 11, 12, 13, and 14 offer easy-to-follow, step-by-step tutorials
xxxiv ◾ Preface
for performing data analytics. All selected section tutorials are highlighted for easy
differentiation from regular text.
• Data sets for data mining. A variety of data sets from business, medicine, and science
are ready for data mining.
• Key term definitions. Each chapter introduces several key terms. A list of definitions
for these terms is provided at the end of each chapter.
• Review questions ask basic questions about the concepts and content found
within each chapter. The questions are designed to help determine if the reader
understands the major points conveyed in each chapter.
• Data mining questions require the reader to use one or several data mining tools
to perform data mining sessions.
CHAPTER CONTENT
The ordering of the chapters and the division of the book into separate parts is based
on several years of experience in teaching courses on data mining. Section I introduces
material that is fundamental to understanding the data mining process. The presenta-
tion is informal and easy to follow. Basic data mining concepts, strategies, and tech-
niques are introduced. Students learn about the types of problems that can be solved
with data mining.
Once the basic concepts are understood, Section II provides the tools for knowledge
discovery with detailed tutorials taking you through the knowledge discovery process.
The fact that data preprocessing is fundamental to successful data mining is empha-
sized. Also, special attention is given to formal data mining evaluation techniques.
Section III is all about neural networks. A conceptual and detailed presentation is offered
for feed-forward networks trained with backpropagation learning and self-organizing
maps for unsupervised clustering. Section III contains several tutorials for neural network
learning with Weka and RapidMiner.
Section IV focuses on several specialized techniques. Topics of current interest such as
time-series analysis, textual data mining, imbalanced and streaming data, as well as Web-
based data mining are described.
Preface ◾ xxxv
• Chapter 1 offers an overview of data analytics and all aspects of the data mining pro-
cess. Special emphasis is placed on helping the student determine when data mining
is an appropriate problem-solving strategy.
• Chapter 2 presents a synopsis of several common data mining strategies and tech-
niques. Basic methods for evaluating the outcome of a data mining session are described.
• Chapter 3 details a decision tree algorithm, the Apriori algorithm for producing asso-
ciation rules, a covering rule algorithm, the K-means algorithm for unsupervised
clustering, and supervised genetic learning. Tools are provided to help determine
which data mining techniques should be used to solve specific problems.
INSTRUCTOR SUPPLEMENTS
The following supplements are provided to help the instructor organize lectures and write
examinations:
• PowerPoint slides. Each figure and table in the text is part of a PowerPoint presenta-
tion. These slides are also offered in PDF format.
• A second set of slides containing the screenshots seen as you work through the
tutorials in Chapters 4 through 14.
• All RapidMiner processes used in the tutorials, demonstrations, and end-of-chapter
exercises are readily available together with simple installation instructions.
• Test questions. Several test questions are provided for each chapter.
• Answers to selected exercises. Answers are given for most of the end-of-chapter
exercises.
• Lesson planner. The lesson planner contains ideas for lecture format and points for
discussion. The planner also provides suggestions for using selected end-of-chapter
exercises in a laboratory setting.
Please note that these supplements are available to qualified instructors only. Contact
your CRC sales representative or get help by visiting https://www.crcpress.com/contactus
to access this material. Supplements will be updated as needed.
Preface ◾ xxxvii
• Cover the following sections to gain enough knowledge to understand the tutorials
presented in later chapters.
• If Weka is your choice, at a minimum, work through Sections 4.1, 4.2, and 4.7 of
Chapter 4.
• If you are focusing on RapidMiner, cover at least Sections 5.1 and 5.2 of Chapter 5.
• Here is a summary of the tutorials given in Chapters 6, 7, 11, 12, 13, and 14.
• Chapter 6: RapidMiner is used to provide a tutorial on outlier analysis.
• Chapter 7: Tutorials are presented using RapidMiner’s T-Test and ANOVA opera-
tors for comparing model performance.
• Chapter 11: Both models are used for tutorials highlighting naive Bayes classifier
and support vector machines.
• Chapter 12: RapidMiner and Weka are used to illustrate unsupervised clustering
with the EM (Expectation Maximization) algorithm.
• Chapter 13: Both RapidMiner and Weka are employed for time-series analysis.
RapidMiner is used for a tutorial on textual data mining. Weka is employed for
a tutorial on ROC curves. RapidMiner is used to give an example of ensemble
learning.
xxxviii ◾ Preface
• Chapter 14: Tutorials are given for creating simple and multidimensional MS
Excel pivot tables.
• Chapter 9 is about neural networks using Weka. Chapter 10 employs RapidMiner
to cover the same material. There are advantages to examining at least some of the
material in both chapters. Weka’s neural network function is able to mine data hav-
ing a numeric output attribute, and RapidMiner’s self-organizing map operator can
perform dimensionality reduction as well as unsupervised clustering.
I am indebted to my editor Randi Cohen for the confidence she placed in me and for allow-
ing me the freedom to make critical decisions about the content of the text. I am very grate-
ful to Dr. Mark Polczynski and found his constructive comments to be particularly helpful
during revisions of the manuscript. Finally, I am most deeply indebted to my wife Suzanne
for her extreme patience, helpful comments, and consistent support.
xxxix
Author
xli
I
Data Mining Fundamentals
1
Random documents with unrelated
content Scribd suggests to you:
The Project Gutenberg eBook of Les caravanes
d'un chirurgien d'ambulances pendant le siége
de Paris et sous la commune
This ebook is for the use of anyone anywhere in the United States
and most other parts of the world at no cost and with almost no
restrictions whatsoever. You may copy it, give it away or re-use it
under the terms of the Project Gutenberg License included with this
ebook or online at www.gutenberg.org. If you are not located in the
United States, you will have to check the laws of the country where
you are located before using this eBook.
Language: French
PAR LE
DR JOULIN
PROFESSEUR AGRÉGÉ DE LA FACULTÉ DE MÉDECINE
CHEVALIER DE LA LÉGION D'HONNEUR
PARIS
E. DENTU, LIBRAIRE-ÉDITEUR
PALAIS-ROYAL, 17 ET 19, GALERIE D'ORLÉANS
1871
Tous droits réservés.
PARIS. — IMP. SIMON RAÇON ET COMP., RUE D'ERFURTH, 1
A MONSIEUR
ARMAND DU MESNIL
OFFICIER DE LA LÉGION D'HONNEUR
SOUVENIR AFFECTUEUX
LES CARAVANES
D'UN
CHIRURGIEN
D'AMBULANCES
I
Our website is not just a platform for buying books, but a bridge
connecting readers to the timeless values of culture and wisdom. With
an elegant, user-friendly interface and an intelligent search system,
we are committed to providing a quick and convenient shopping
experience. Additionally, our special promotions and home delivery
services ensure that you save time and fully enjoy the joy of reading.
ebooknice.com