Exploratory Data Analysis with MATLAB Second Edition Chapman Hall CRC Computer Science Data Analysis Wendy L. Martinez 2024 Scribd Download
Exploratory Data Analysis with MATLAB Second Edition Chapman Hall CRC Computer Science Data Analysis Wendy L. Martinez 2024 Scribd Download
com
https://ebookname.com/product/exploratory-data-analysis-
with-matlab-second-edition-chapman-hall-crc-computer-
science-data-analysis-wendy-l-martinez/
OR CLICK BUTTON
DOWNLOAD EBOOK
https://ebookname.com/product/think-stats-exploratory-data-analysis-
second-edition-allen-b-downey/
ebookname.com
https://ebookname.com/product/computational-statistics-handbook-with-
matlab-3rd-edition-wendy-l-martinez/
ebookname.com
https://ebookname.com/product/environmental-data-analysis-with-
matlab-1st-edition-william-menke/
ebookname.com
https://ebookname.com/product/politics-and-parentela-in-paraiba-a-
case-study-of-family-based-oligarchy-in-brazil-linda-lewin/
ebookname.com
The Routledge Handbook of Multilingualism 1st Edition
Marilyn Martin-Jones
https://ebookname.com/product/the-routledge-handbook-of-
multilingualism-1st-edition-marilyn-martin-jones/
ebookname.com
https://ebookname.com/product/ecological-consequences-of-climate-
change-mechanisms-conservation-and-management-1st-edition-erik-a-
beever/
ebookname.com
https://ebookname.com/product/setting-priorities-for-hiv-aids-
interventions-a-cost-benefit-approach-1st-edition-robert-j-brent/
ebookname.com
https://ebookname.com/product/smith-s-general-urology-17th-edition-
emil-tanagho/
ebookname.com
https://ebookname.com/product/advances-in-astronomy-from-the-big-bang-
to-the-solar-system-2005-en-417s-j-m-t-thompson/
ebookname.com
The Arizona state constitution First Edition Leshy
https://ebookname.com/product/the-arizona-state-constitution-first-
edition-leshy/
ebookname.com
Exploratory Data Analysis
with MATLAB®
Second Edition
The interface between the computer and statistical sciences is increasing, as each discipline
seeks to harness the power and resources of the other. This series aims to foster the integration
between the computer sciences and statistical, numerical, and probabilistic methods by
publishing a broad range of reference works, textbooks, and handbooks.
SERIES EDITORS
David Blei, Princeton University
David Madigan, Rutgers University
Marina Meila, University of Washington
Fionn Murtagh, Royal Holloway, University of London
Proposals for the series should be sent directly to one of the series editors above, or submitted to:
Published Titles
Wendy L. Martinez
Angel R. Martinez
Jeffrey L. Solka
CRC Press
Taylor & Francis Group
6000 Broken Sound Parkway NW, Suite 300
Boca Raton, FL 33487-2742
This book contains information obtained from authentic and highly regarded sources. Reasonable efforts
have been made to publish reliable data and information, but the author and publisher cannot assume
responsibility for the validity of all materials or the consequences of their use. The authors and publishers
have attempted to trace the copyright holders of all material reproduced in this publication and apologize to
copyright holders if permission to publish in this form has not been obtained. If any copyright material has
not been acknowledged please write and let us know so we may rectify in any future reprint.
Except as permitted under U.S. Copyright Law, no part of this book may be reprinted, reproduced, transmit-
ted, or utilized in any form by any electronic, mechanical, or other means, now known or hereafter invented,
including photocopying, microfilming, and recording, or in any information storage or retrieval system,
without written permission from the publishers.
For permission to photocopy or use material electronically from this work, please access www.copyright.
com (http://www.copyright.com/) or contact the Copyright Clearance Center, Inc. (CCC), 222 Rosewood
Drive, Danvers, MA 01923, 978-750-8400. CCC is a not-for-profit organization that provides licenses and
registration for a variety of users. For organizations that have been granted a photocopy license by the CCC,
a separate system of payment has been arranged.
Trademark Notice: Product or corporate names may be trademarks or registered trademarks, and are used
only for identification and explanation without intent to infringe.
Martinez, Wendy L.
Exploratory data analysis with MATLAB / Wendy L. Martinez, Angel Martinez,
Jeffrey Solka. -- 2nd ed.
p. cm. -- (Chapman & Hall/CRC computer science and data analysis series)
Includes bibliographical references and index.
ISBN 978-1-4398-1220-4 (hardback)
1. Multivariate analysis. 2. MATLAB. 3. Mathematical statistics. I. Martinez, Angel
R. II. Solka, Jeffrey L., 1955- III. Title.
QA278.M3735 2010
519.5’35--dc22 2010044042
Deborah,
Jeffrey,
Robbi (the middle child),
and Lisa (Principessa)
Table of Contents
Part I
Introduction to Exploratory Data Analysis
Chapter 1
Introduction to Exploratory Data Analysis
1.1 What is Exploratory Data Analysis ............................................................. 3
1.2 Overview of the Text ..................................................................................... 6
1.3 A Few Words about Notation ...................................................................... 8
1.4 Data Sets Used in the Book ........................................................................... 9
1.4.1 Unstructured Text Documents ........................................................ 9
1.4.2 Gene Expression Data ..................................................................... 12
1.4.3 Oronsay Data Set ............................................................................. 18
1.4.4 Software Inspection ......................................................................... 19
1.5 Transforming Data ....................................................................................... 20
1.5.1 Power Transformations .................................................................. 21
1.5.2 Standardization ................................................................................ 22
1.5.3 Sphering the Data ............................................................................ 24
1.6 Further Reading ........................................................................................... 25
Exercises .............................................................................................................. 27
Part II
EDA as Pattern Discovery
Chapter 2
Dimensionality Reduction — Linear Methods
2.1 Introduction .................................................................................................. 31
2.2 Principal Component Analysis — PCA .................................................... 33
2.2.1 PCA Using the Sample Covariance Matrix ................................. 34
2.2.2 PCA Using the Sample Correlation Matrix ................................. 37
2.2.3 How Many Dimensions Should We Keep? ................................. 38
2.3 Singular Value Decomposition — SVD .................................................... 42
2.4 Nonnegative Matrix Factorization ............................................................ 47
vii
EDA2ed.book Page viii Thursday, November 11, 2010 8:51 AM
Chapter 3
Dimensionality Reduction — Nonlinear Methods
3.1 Multidimensional Scaling — MDS ............................................................ 79
3.1.1 Metric MDS ...................................................................................... 81
3.1.2 Nonmetric MDS ............................................................................... 91
3.2 Manifold Learning ....................................................................................... 99
3.2.1 Locally Linear Embedding ............................................................. 99
3.2.2 Isometric Feature Mapping — ISOMAP .................................... 101
3.2.3 Hessian Eigenmaps ....................................................................... 103
3.3 Artificial Neural Network Approaches .................................................. 108
3.3.1 Self-Organizing Maps ................................................................... 108
3.3.2 Generative Topographic Maps .................................................... 111
3.3.3 Curvilinear Component Analysis ............................................... 116
3.4 Summary and Further Reading ............................................................... 121
Exercises ............................................................................................................ 122
Chapter 4
Data Tours
4.1 Grand Tour ................................................................................................. 126
4.1.1 Torus Winding Method ................................................................ 127
4.1.2 Pseudo Grand Tour ....................................................................... 129
4.2 Interpolation Tours .................................................................................... 132
4.3 Projection Pursuit ....................................................................................... 134
4.4 Projection Pursuit Indexes ........................................................................ 142
4.4.1 Posse Chi-Square Index ................................................................ 142
4.4.2 Moment Index ................................................................................ 145
4.5 Independent Component Analysis ......................................................... 147
4.6 Summary and Further Reading ............................................................... 151
Exercises ............................................................................................................ 152
Chapter 5
Finding Clusters
5.1 Introduction ................................................................................................ 155
5.2 Hierarchical Methods ................................................................................ 157
EDA2ed.book Page ix Thursday, November 11, 2010 8:51 AM
Table of Contents ix
Chapter 6
Model-Based Clustering
6.1 Overview of Model-Based Clustering .................................................... 205
6.2 Finite Mixtures ........................................................................................... 207
6.2.1 Multivariate Finite Mixtures ........................................................ 210
6.2.2 Component Models — Constraining the Covariances ............ 211
6.3 Expectation-Maximization Algorithm .................................................... 217
6.4 Hierarchical Agglomerative Model-Based Clustering ......................... 222
6.5 Model-Based Clustering ............................................................................ 224
6.6 MBC for Density Estimation and Discriminant Analysis .................... 231
6.6.1 Introduction to Pattern Recognition ........................................... 231
6.6.2 Bayes Decision Theory .................................................................. 232
6.6.3 Estimating Probability Densities with MBC .............................. 235
6.7 Generating Random Variables from a Mixture Model ......................... 239
6.8 Summary and Further Reading ............................................................... 241
Exercises ............................................................................................................ 244
Chapter 7
Smoothing Scatterplots
7.1 Introduction ................................................................................................ 247
7.2 Loess ............................................................................................................. 248
7.3 Robust Loess ............................................................................................... 259
7.4 Residuals and Diagnostics with Loess .................................................... 261
7.4.1 Residual Plots ................................................................................. 261
7.4.2 Spread Smooth ............................................................................... 265
7.4.3 Loess Envelopes — Upper and Lower Smooths ....................... 268
7.5 Smoothing Splines ..................................................................................... 269
7.5.1 Regression with Splines ................................................................ 270
7.5.2 Smoothing Splines ......................................................................... 272
7.5.3 Smoothing Splines for Uniformly Spaced Data ........................ 278
7.6 Choosing the Smoothing Parameter ....................................................... 281
EDA2ed.book Page x Thursday, November 11, 2010 8:51 AM
Part III
Graphical Methods for EDA
Chapter 8
Visualizing Clusters
8.1 Dendrogram ................................................................................................ 301
8.2 Treemaps ..................................................................................................... 303
8.3 Rectangle Plots ........................................................................................... 306
8.4 ReClus Plots ................................................................................................ 312
8.5 Data Image .................................................................................................. 317
8.6 Summary and Further Reading ............................................................... 323
Exercises ............................................................................................................ 324
Chapter 9
Distribution Shapes
9.1 Histograms .................................................................................................. 327
9.1.1 Univariate Histograms ................................................................. 327
9.1.2 Bivariate Histograms .................................................................... 334
9.2 Boxplots ....................................................................................................... 336
9.2.1 The Basic Boxplot .......................................................................... 337
9.2.2 Variations of the Basic Boxplot .................................................... 342
9.3 Quantile Plots ............................................................................................. 347
9.3.1 Probability Plots ............................................................................ 347
9.3.2 Quantile-Quantile Plot .................................................................. 349
9.3.3 Quantile Plot .................................................................................. 352
9.4 Bagplots ....................................................................................................... 354
9.5 Rangefinder Boxplot .................................................................................. 356
9.6 Summary and Further Reading ............................................................... 359
Exercises ............................................................................................................ 361
Chapter 10
Multivariate Visualization
10.1 Glyph Plots ................................................................................................ 365
10.2 Scatterplots ................................................................................................ 366
10.2.1 2-D and 3-D Scatterplots ............................................................. 368
10.2.2 Scatterplot Matrices ..................................................................... 371
10.2.3 Scatterplots with Hexagonal Binning ....................................... 372
EDA2ed.book Page xi Thursday, November 11, 2010 8:51 AM
Table of Contents xi
Appendix A
Proximity Measures
A.1 Definitions .................................................................................................. 417
A.1.1 Dissimilarities ............................................................................... 418
A.1.2 Similarity Measures ..................................................................... 420
A.1.3 Similarity Measures for Binary Data ......................................... 420
A.1.4 Dissimilarities for Probability Density Functions ................... 421
A.2 Transformations ........................................................................................ 422
A.3 Further Reading ........................................................................................ 423
Appendix B
Software Resources for EDA
B.1 MATLAB Programs .................................................................................. 425
B.2 Other Programs for EDA .......................................................................... 429
B.3 EDA Toolbox .............................................................................................. 431
Appendix C
Description of Data Sets ................................................................................... 433
Appendix D
Introduction to MATLAB
D.1 What Is MATLAB? .................................................................................... 439
D.2 Getting Help in MATLAB ....................................................................... 440
D.3 File and Workspace Management .......................................................... 440
EDA2ed.book Page xii Thursday, November 11, 2010 8:51 AM
Appendix E
MATLAB Functions
E.1 MATLAB ..................................................................................................... 455
E.2 Statistics Toolbox ....................................................................................... 457
E.3 Exploratory Data Analysis Toolbox ........................................................ 458
E.4 EDA GUI Toolbox ..................................................................................... 459
In the past several years, many advancements have been made in the area of
exploratory data analysis, and it soon became apparent that it was time to
update this text. In particular, many innovative approaches have been
developed for dimensionality reduction, clustering, and visualization.
We list below some of the major changes and additions in the second
edition.
In a spirit similar to the first edition, this text is not focused on the
theoretical aspects of the methods. Rather, the main focus of this book is on
the use of the EDA methods. So, we do not dwell so much on implementation
and algorithmic details. Instead, we show students and practitioners how the
xiii
EDA2ed.book Page xiv Thursday, November 11, 2010 8:51 AM
http://lib.stat.cmu.edu
http://pi-sigma.info
Please review the readme file for installation instructions and information
on any changes.
For MATLAB product information, please contact:
Disclaimers
1. Any MATLAB programs and data sets that are included with the
book are provided in good faith. The authors, publishers, or dis-
tributors do not guarantee their accuracy and are not responsible
for the consequences of their use.
2. Some of the MATLAB functions provided with the EDA Toolboxes
were written by other researchers, and they retain the copyright.
EDA2ed.book Page xv Thursday, November 11, 2010 8:51 AM
One of the goals of our first book, Computational Statistics Handbook with
MATLAB® [2002], was to show some of the key concepts and methods of
computational statistics and how they can be implemented in MATLAB.1 A
core component of computational statistics is the discipline known as
exploratory data analysis or EDA. Thus, we see this book as a complement to
the first one with similar goals: to make exploratory data analysis techniques
available to a wide range of users.
Exploratory data analysis is an area of statistics and data analysis, where
the idea is to first explore the data set, often using methods from descriptive
statistics, scientific visualization, data tours, dimensionality reduction, and
others. This exploration is done without any (hopefully!) pre-conceived
notions or hypotheses. Indeed, the idea is to use the results of the exploration
to guide and to develop the subsequent hypothesis tests, models, etc. It is
closely related to the field of data mining, and many of the EDA tools
discussed in this book are part of the toolkit for knowledge discovery and
data mining.
This book is intended for a wide audience that includes scientists,
statisticians, data miners, engineers, computer scientists, biostatisticians,
social scientists, and any other discipline that must deal with the analysis of
raw data. We also hope this book can be useful in a classroom setting at the
senior undergraduate or graduate level. Exercises are included with each
chapter, making it suitable as a textbook or supplemental text for a course in
exploratory data analysis, data mining, computational statistics, machine
learning, and others. Readers are encouraged to look over the exercises
because new concepts are sometimes introduced in them. Exercises are
computational and exploratory in nature, so there is often no unique answer!
As for the background required for this book, we assume that the reader
has an understanding of basic linear algebra. For example, one should have
a familiarity with the notation of linear algebra, array multiplication, a matrix
inverse, determinants, an array transpose, etc. We also assume that the
reader has had introductory probability and statistics courses. Here one
should know about random variables, probability distributions and density
functions, basic descriptive measures, regression, etc.
In a spirit similar to the first book, this text is not focused on the theoretical
aspects of the methods. Rather, the main focus of this book is on the use of the
1 MATLAB® and Handle Graphics ® are registered trademarks of The MathWorks, Inc.
xvii
EDA2ed.book Page xviii Thursday, November 11, 2010 8:51 AM
http://lib.stat.cmu.edu
Please review the readme file for installation instructions and information
on any changes. M-files that contain the MATLAB commands for the
exercises are also available for download.
We also make the disclaimer that our MATLAB code is not necessarily the
most efficient way to accomplish the task. In many cases, we sacrificed
efficiency for clarity. Please refer to the example M-files for alternative
MATLAB code, courtesy of Tom Lane of The MathWorks, Inc.
We describe the EDA Toolbox in greater detail in Appendix B. We also
provide website information for other tools that are available for download
(at no cost). Some of these toolboxes and functions are used in the book and
others are provided for informational purposes. Where possible and
appropriate, we include some of this free MATLAB code with the EDA
Toolbox to make it easier for the reader to follow along with the examples
and exercises.
We assume that the reader has the Statistics Toolbox (Version 4 or higher)
from The MathWorks, Inc. Where appropriate, we specify whether the
function we are using is in the main MATLAB software package, Statistics
Toolbox, or the EDA Toolbox. The development of the EDA Toolbox was
mostly accomplished with MATLAB Version 6.5 (Statistics Toolbox, Version
4); so the code should work if this is what you have. However, a new release
of MATLAB and the Statistics Toolbox was introduced in the middle of
writing th is book; so we also in corp orate information abou t n ew
functionality provided in these versions.
We would like to acknowledge the invaluable help of the reviewers: Chris
Fraley, David Johannsen, Catherine Loader, Tom Lane, David Marchette,
and Jeffrey Solka. Their many helpful comments and suggestions resulted in
a better book. Any shortcomings are the sole responsibility of the authors. We
EDA2ed.book Page xix Thursday, November 11, 2010 8:51 AM
owe a special thanks to Jeffrey Solka for programming assistance with finite
mixtures and to Rich ard Johnson for allowin g us to use h is Data
Visualization Toolbox and updating his functions. We would also like to
acknowledge all of those researchers who wrote MATLAB code for methods
described in this book and also made it available for free. We thank the
editors of the book series in Computer Science and Data Analysis for
including this text. We greatly appreciate the help and patience of those at
CRC Press: Bob Stern, Rob Calver, Jessica Vakili, and Andrea Demby. Finally,
we are indebted to Naomi Fernandes and Tom Lane at The MathWorks, Inc.
for their special assistance with MATLAB.
Disclaimers
1. Any MATLAB programs and data sets that are included with the
book are provided in good faith. The authors, publishers, or dis-
tributors do not guarantee their accuracy and are not responsible
for the consequences of their use.
2. Some of the MATLAB functions provided with the EDA Toolbox
were written by other researchers, and they retain the copyright.
References are given in Appendix B and in the help section of
each function. Unless otherwise specified, the EDA Toolbox is pro-
vided under the GNU license specifications:
http://www.gnu.org/copyleft/gpl.html
3. The views expressed in this book are those of the authors and do
not necessarily represent the views of the United States Department
of Defense or its components.
Part I
Introduction to Exploratory Data Analysis
EDA2ed.book Page 3 Thursday, November 11, 2010 8:51 AM
Chapter 1
Introduction to Exploratory Data Analysis
3
EDA2ed.book Page 4 Thursday, November 11, 2010 8:51 AM
evaluate the precision associated with the results. EDA and CDA should not
be used separately from each other, but rather they should be used in a
complementary way. The analyst explores the data looking for patterns and
structure that leads to hypotheses and models.
Tukey’s book on EDA was written at a time when computers were not
widely available and the data sets tended to be somewhat small, especially
by today’s standards. So, Tukey developed methods that could be
accomplished using pencil and paper, such as the familiar box-and-whisker
plots (also known as boxplots) and the stem-and-leaf. He also included
discussions of data transformation, smoothing, slicing, and others. Since our
book is written at a time when computers are widely available, we go beyond
what Tukey used in EDA and present computationally intensive methods for
pattern discovery and statistical visualization. However, our philosophy of
EDA is the same - that those engaged in it are data detectives.
Tukey [1980], expanding on his ideas of how exploratory and confirmatory
data analysis fit together, presents a typical straight-line methodology for
CDA; its steps follow:
Forming the question involves issues such as: What can or should be asked?
What designs are possible? How likely is it that a design will give a useful
answer? The ideas and methods of EDA play a role in this process. In
conclusion, Tukey states that EDA is an attitude, a flexibility, and some graph
paper.
A small, easily read book on EDA written from a social science perspective
is the one by Hartwig and Dearing [1979]. They describe the CDA mode as
one that answers questions such as “Do the data confirm hypothesis XYZ?”
Whereas, EDA tends to ask “What can the data tell me about relationship
XYZ?” Hartwig and Dearing specify two principles for EDA: skepticism and
openness. This might involve visualization of the data to look for anomalies
or patterns, the use of resistant (or robust) statistics to summarize the data,
openness to the transformation of the data to gain better insights, and the
generation of models.
EDA2ed.book Page 5 Thursday, November 11, 2010 8:51 AM
Some of the ideas of EDA and their importance to teaching statistics were
discussed by Chatfield [1985]. He called the topic initial data analysis or
IDA. While Chatfield agrees with the EDA emphasis on starting with the
noninferential approach in data analysis, he also stresses the need for looking
at how the data were collected, what are the objectives of the analysis, and
the use of EDA/IDA as part of an integrated approach to statistical inference.
Hoaglin [1982] provides a summary of EDA in the Encyclopedia of Statistical
Sciences. He describes EDA as the “flexible searching for clues and evidence”
and confirmatory data analysis as “evaluating the available evidence.” In his
summary, he states that EDA encompasses four themes: resistance, residuals,
re-expression, and display.
Resistant data analysis pertains to those methods where an arbitrary
change in a data point or small subset of the data yields a small change in the
result. A related idea is robustness, which has to do with how sensitive an
analysis is to departures from the assumptions of an underlying probabilistic
model.
Residuals are what we have left over after a summary or fitted model has
been subtracted out. We can write this as
charts, Andrews’ curves, and Andrews’ images. The ability to interact with
the plot to uncover structure or patterns is important, and we present some
of the standard methods such as linking and brushing. We also connect both
sections by revisiting the idea of the grand tour and show how that can be
implemented with Andrews’ curves and parallel coordinate plots.
We realize that other topics can be considered part of EDA, such as
descriptive statistics, outlier detection, robust data analysis, probability
density estimation, and residual analysis. However, these topics are beyond
the scope of this book. Descriptive statistics are covered in introductory
statistics texts, and since we assume that readers are familiar with this subject
matter, there is no need to provide explanations here. Similarly, we do not
emphasize residual analysis as a stand-alone subject, mostly because this is
widely discussed in other books on regression and multivariate analysis.
We do cover some density estimation, such as model-based clustering
(Chapter 6) and histograms (Chapter 9). The reader is referred to Scott [1992]
for an excellent treatment of the theory and methods of multivariate density
estimation in general or Silverman [1986] for kernel density estimation. For
more information on MATLAB implementations of density estimation the
reader can refer to Martinez and Martinez [2007]. Finally, we will likely
encounter outlier detection as we go along in the text, but this topic, along
with robust statistics, will not be covered as a stand-alone subject. There are
several books on outlier detection and robust statistics. These include
Hoaglin, Mosteller, and Tukey [1983], Huber [1981], and Rousseeuw and
Leroy [1987]. A rather dated paper on the topic is Hogg [1974].
We use MATLAB® throughout the book to illustrate the ideas and to show
how they can be implemented in software. Much of the code used in the
examples and to create the figures is freely available, either as part of the
downloadable toolbox included with the book or on other internet sites. This
information will be discussed in more detail in Appendix B. For MATLAB
product information, please contact:
To get the most out of this book, readers should have a basic understanding
of matrix algebra. For example, one should be familiar with determinants, a
matrix transpose, the trace of a matrix, etc. We recommend Strang [1988,
1993] for those who need to refresh their memories on the topic. We do not
use any calculus in this book, but a solid understanding of algebra is always
useful in any situation. We expect readers to have knowledge of the basic
concepts in probability and statistics, such as random samples, probability
distributions, hypothesis testing, and regression.
1The notation m × n is read “m by n,” and it means that we have m rows and n columns in an
array. It will be clear from the context whether this indicates matrix dimensions or
multiplication.
EDA2ed.book Page 9 Thursday, November 11, 2010 8:51 AM