Exploratory Multivariate Analysis by Example Using R 1st Edition Francois Husson instant download
Exploratory Multivariate Analysis by Example Using R 1st Edition Francois Husson instant download
https://ebookname.com/product/exploratory-multivariate-analysis-
by-example-using-r-1st-edition-francois-husson/
https://ebookname.com/product/multivariate-analysis-of-
ecological-data-using-canoco-5-2nd-edition-petr-smilauer/
https://ebookname.com/product/swift-by-example-1st-edition-
scalzo/
https://ebookname.com/product/node-js-by-example-1st-edition-
tsonev/
https://ebookname.com/product/mind-from-body-experience-from-
neural-structure-1st-edition-don-m-tucker/
Phase transformations of elements under high pressure
CRC 2005 1st Edition E. Yu Tonkov
https://ebookname.com/product/phase-transformations-of-elements-
under-high-pressure-crc-2005-1st-edition-e-yu-tonkov/
https://ebookname.com/product/the-iron-dragon-s-daughter-1st-
edition-michael-swanwick/
https://ebookname.com/product/the-aesthetics-of-loss-german-
women-s-art-of-the-first-world-war-1st-edition-claudia-siebrecht/
https://ebookname.com/product/visual-guide-to-hedge-funds-1st-
edition-richard-c-wilson/
https://ebookname.com/product/a-dictionary-of-the-avant-
gardes-3rd-edition-richard-kostelanetz/
Heroes From Hercules to Superman Meyer
https://ebookname.com/product/heroes-from-hercules-to-superman-
meyer/
Exploratory Multivariate Analysis
by Example Using R
The interface between the computer and statistical sciences is increasing, as each discipline
seeks to harness the power and resources of the other. This series aims to foster the integration
between the computer sciences and statistical, numerical, and probabilistic methods by
publishing a broad range of reference works, textbooks, and handbooks.
SERIES EDITORS
David Blei, Princeton University
David Madigan, Rutgers University
Marina Meila, University of Washington
Fionn Murtagh, Royal Holloway, University of London
Proposals for the series should be sent directly to one of the series editors above, or submitted to:
Published Titles
François Husson
Sébastien Lê
Jérôme Pagès
This book contains information obtained from authentic and highly regarded sources. Reasonable efforts have been made
to publish reliable data and information, but the author and publisher cannot assume responsibility for the validity of all
materials or the consequences of their use. The authors and publishers have attempted to trace the copyright holders of all
material reproduced in this publication and apologize to copyright holders if permission to publish in this form has not
been obtained. If any copyright material has not been acknowledged please write and let us know so we may rectify in any
future reprint.
Except as permitted under U.S. Copyright Law, no part of this book may be reprinted, reproduced, transmitted, or utilized in
any form by any electronic, mechanical, or other means, now known or hereafter invented, including photocopying, micro-
filming, and recording, or in any information storage or retrieval system, without written permission from the publishers.
For permission to photocopy or use material electronically from this work, please access www.copyright.com (http://www.
copyright.com/) or contact the Copyright Clearance Center, Inc. (CCC), 222 Rosewood Drive, Danvers, MA 01923, 978-750-
8400. CCC is a not-for-profit organization that provides licenses and registration for a variety of users. For organizations that
have been granted a photocopy license by the CCC, a separate system of payment has been arranged.
Trademark Notice: Product or corporate names may be trademarks or registered trademarks, and are used only for identi-
fication and explanation without intent to infringe.
Husson, François.
Exploratory multivariate analysis by example using R / François Husson, Sébastien Lê, Jérôme
Pagès.
p. cm. -- (Chapman & Hall/CRC computer science & data analysis)
Summary: “An introduction to exploratory techniques for multivariate data analysis, this book
covers the key methodology, including principal components analysis, correspondence analysis,
mixed models, and multiple factor analysis. The authors take a practical approach, with examples
leading the discussion of the methods and many graphics to emphasize visualization. They present
the concepts in the most intuitive way possible, keeping mathematical content to a minimum
or relegating it to the appendices. The book includes examples that use real data from a range of
scientific disciplines and implemented using an R package developed by the authors.”-- Provided
by publisher.
Includes bibliographical references and index.
ISBN 978-1-4398-3580-7 (hardback)
1. Multivariate analysis. 2. R (Computer program language) I. Lê, Sébastien. II. Pagès, Jérôme. III.
Title. IV. Series.
QA278.H87 2010
519.5’3502855133--dc22 2010040339
Preface xi
v
vi Exploratory Multivariate Analysis by Example Using R
4 Clustering 169
4.1 Data — Issues . . . . . . . . . . . . . . . . . . . . . . . . . . 169
4.2 Formalising the Notion of Similarity . . . . . . . . . . . . . . 173
4.2.1 Similarity between Individuals . . . . . . . . . . . . . 173
4.2.1.1 Distances and Euclidean Distances . . . . . . 173
4.2.1.2 Example of Non-Euclidean Distance . . . . . 174
4.2.1.3 Other Euclidean Distances . . . . . . . . . . 175
4.2.1.4 Similarities and Dissimilarities . . . . . . . . 175
4.2.2 Similarity between Groups of Individuals . . . . . . . 176
4.3 Constructing an Indexed Hierarchy . . . . . . . . . . . . . . 177
4.3.1 Classic Agglomerative Algorithm . . . . . . . . . . . . 177
4.3.2 Hierarchy and Partitions . . . . . . . . . . . . . . . . . 179
4.4 Ward’s Method . . . . . . . . . . . . . . . . . . . . . . . . . 179
4.4.1 Partition Quality . . . . . . . . . . . . . . . . . . . . . 180
4.4.2 Agglomeration According to Inertia . . . . . . . . . . 181
4.4.3 Two Properties of the Agglomeration Criterion . . . . 183
4.4.4 Analysing Hierarchies, Choosing Partitions . . . . . . 184
4.5 Direct Search for Partitions: K-means Algorithm . . . . . . . 185
4.5.1 Data — Issues . . . . . . . . . . . . . . . . . . . . . . 185
4.5.2 Principle . . . . . . . . . . . . . . . . . . . . . . . . . 186
4.5.3 Methodology . . . . . . . . . . . . . . . . . . . . . . . 187
4.6 Partitioning and Hierarchical Clustering . . . . . . . . . . . . 187
4.6.1 Consolidating Partitions . . . . . . . . . . . . . . . . . 188
4.6.2 Mixed Algorithm . . . . . . . . . . . . . . . . . . . . . 188
4.7 Clustering and Principal Component Methods . . . . . . . . 188
4.7.1 Principal Component Methods Prior to AHC . . . . . 189
4.7.2 Simultaneous Analysis of a Principal Component Map
and Hierarchy . . . . . . . . . . . . . . . . . . . . . . . 189
4.8 Example: The Temperature Dataset . . . . . . . . . . . . . . 190
4.8.1 Data Description — Issues . . . . . . . . . . . . . . . 190
4.8.2 Analysis Parameters . . . . . . . . . . . . . . . . . . . 190
4.8.3 Implementation of the Analysis . . . . . . . . . . . . . 191
4.9 Example: The Tea Dataset . . . . . . . . . . . . . . . . . . . 197
4.9.1 Data Description — Issues . . . . . . . . . . . . . . . 197
4.9.2 Constructing the AHC . . . . . . . . . . . . . . . . . . 197
4.9.3 Defining the Clusters . . . . . . . . . . . . . . . . . . . 199
4.10 Dividing Quantitative Variables into Classes . . . . . . . . . 202
Appendix 205
A.1 Percentage of Inertia Explained by the First Component or by
the First Plane . . . . . . . . . . . . . . . . . . . . . . . . . . 205
A.2 R Software . . . . . . . . . . . . . . . . . . . . . . . . . . . . 210
A.2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . 210
A.2.2 The Rcmdr Package . . . . . . . . . . . . . . . . . . . 214
A.2.3 The FactoMineR Package . . . . . . . . . . . . . . . . 216
x Exploratory Multivariate Analysis by Example Using R
Bibliography 223
Index 225
Preface
xi
xii Exploratory Multivariate Analysis by Example Using R
1.2 Objectives
The data table can be considered either as a set of rows (individuals) or as a
set of columns (variables), thus raising a number of questions relating to these
different types of objects.
2 Exploratory Multivariate Analysis by Example Using R
TABLE 1.1
Some Examples of Datasets
Field Individuals Variables xik
Ecology Rivers Concentration of pollutants Concentration of pollu-
tant k in river i
Economics Years Economic indicators Indicator value k for year
i
Genetics Patients Genes Expression of gene k for
patient i
Marketing Brands Measures of satisfaction Value of measure k for
brand i
Pedology Soils Granulometric composition Content of component k
in soil i
Biology Animals Measurements Measure k for animal i
Sociology Social classes Time by activity Time spent on activity k
by individuals from so-
cial class i
TABLE 1.2
The Orange Juice Data
Odour Odour Pulp Intensity Acidity Bitter- Sweet-
intensity typicality of taste ness ness
Pampryl amb. 2.82 2.53 1.66 3.46 3.15 2.97 2.60
Tropicana amb. 2.76 2.82 1.91 3.23 2.55 2.08 3.32
Fruvita fr. 2.83 2.88 4.00 3.45 2.42 1.76 3.38
Joker amb. 2.76 2.59 1.66 3.37 3.05 2.56 2.80
Tropicana fr. 3.20 3.02 3.69 3.12 2.33 1.97 3.34
Pampryl fr. 3.07 2.73 3.34 3.54 3.31 2.63 2.90
2
1.0
1.0
1
0.5
0.5
Variable k
Variable k
Variable k
0
−1.0 −0.5 0.0
−0.5 0.0
−1 −2
−1.0
−1.0 −0.5 0.0 0.5 1.0 −1.0 −0.5 0.0 0.5 1.0 −3 −2 −1 0 1 2
Variable j Variable j Variable j
FIGURE 1.1
Representation of 40 individuals described by two variables: j and k.
linked to both groups. In the example, each group can be represented by one
single variable as the variables within each group are very strongly correlated.
We refer to these variables as synthetic variables.
A B C
0.0
0.0
1.0 0.5
−0.4
−0.4
Variable k
Variable l
Variable l
0.0
−0.8
−0.8
−0.5
−1.0
−1.2
−1.2
−1.0 −0.5 0.0 0.5 1.0 −1.0 −0.5 0.0 0.5 1.0 −1.0 −0.5 0.0 0.5 1.0
Variable j Variable j Variable k
D E F
1.0
1.0
1.0
0.4 0.6 0.8
Variable m
Variable m
0.2
0.2
0.2
0.0
0.0
0.0
−1.0 −0.5 0.0 0.5 1.0 −1.0 −0.5 0.0 0.5 1.0 −1.2 −0.8 −0.4 0.0
Variable j Variable k Variable l
FIGURE 1.2
Representation of the relationships between four variables: j, k, l, and m,
taken two-by-two.
TABLE 1.3
Orange Juice Data: Correlation Matrix
Odour Odour Pulp Intensity Acidity Bitter- Sweet-
intensity typicality of taste ness ness
Odour intensity 1.00 0.58 0.66 −0.27 −0.15 −0.15 0.23
Odour typicality 0.58 1.00 0.77 −0.62 −0.84 −0.88 0.92
Pulp content 0.66 0.77 1.00 −0.02 −0.47 −0.64 0.63
Intensity of taste −0.27 −0.62 −0.02 1.00 0.73 0.51 −0.57
Acidity −0.15 −0.84 −0.47 0.73 1.00 0.91 −0.90
Bitterness −0.15 −0.88 −0.64 0.51 0.91 1.00 −0.98
Sweetness 0.23 0.92 0.63 −0.57 −0.90 −0.98 1.00
If two individuals have similar values within the table of all K variables, they
are also close in the space RK . Thus, the study of the data table can be
conducted geometrically by studying the distances between individuals. We
are therefore interested in all of the individuals in RK , that is, the cloud
of individuals (denoted NI ). Analysing the distances between individuals is
therefore tantamount to studying the shape of the cloud of points. Figure 1.3
illustrates a cloud of point is within a space RK for K = 3.
FIGURE 1.3
Flight of a flock of starlings illustrating a scatterplot in RK .
The shape of cloud NI remains the same even when translated. The data
are also centred, which corresponds to considering xik − x̄k rather than xik .
Geometrically, this is tantamount to coinciding the centre of mass of the cloud
GI (with coordinates x̄k for k = 1, ..., K) with the origin of reference (see
Figure 1.4). Centring presents technical advantages and is always conducted
in PCA.
The operation of reduction (also referred to as standardising), which con-
sists of considering (xik − x̄k )/sk rather than xik , modifies the shape of the
cloud by harmonising its variability in all the directions of the original vectors
(i.e., the K variables). Geometrically, it means choosing standard deviation
sk as a unit of measurement in direction k. This operation is essential if the
variables are not expressed in the same units. Even when the units of mea-
surement do not differ, this operation is generally preferable as it attaches
the same importance to each variable. Therefore, we will assume this to be
the case from here on in. Standardised PCA occurs when the variables are
Principal Component Analysis 7
FIGURE 1.4
Scatterplot of the individuals in RK .
centred and reduced, and unstandardised PCA when the variables are only
centred. When not otherwise specified, it may be assumed that we are using
standardised PCA.
Comment: Weighting Individuals
So far we have assumed that all individuals have the same weight. This applies
to almost all applications and is always assumed to be the case. Neverthe-
less, generalisation with unspecified weights poses no conceptual or practical
problems (double weight is equivalent to two identical individuals) and most
software packages, including FactoMineR envisage this possibility (FactoMineR
is a package dedicated to Factor Analysis and Data Mining with R, see Sec-
tion A.2.3 in the Appendix). For example, it may be useful to assign a different
weight to each individual after having rectified a sample. In all cases, it is
convenient to consider that the sum of the weights is equal to 1. If supposed
to be of the same weight, each individual will be assigned a weight of 1/I.
the distances are less distorted and the representations take up more space
on the image. The image is a projection of a three-dimensional object in a
two-dimensional space.
FIGURE 1.5
Two-dimensional representations of fruits: from left to right an avocado, a
melon and a banana, each row corresponds to a different representation.
The convention for notation uses mechanical terms: O is the centre of gravity,
OHi is a vector and the criterion is the inertia of the projection of NI . The
criterion which consists of increasing the variance of the projected points to a
maximum is perfectly appropriate.
Remark
If the individuals are weighted with different weights pi , the maximised crite-
PI
rion is i=1 pi OHi2 .
In some rare cases, it might be interesting to search for the best axial
representation of cloud NI alone. This best axis is obtained in the same way:
Principal Component Analysis 9
PI
find the component u1 when i=1 OHi2 are maximum (where Hi is the pro-
jection of i on u1 ). It can be shown that plane P contains component u1 (the
“best” plane contains the “best”component): in this case, these representa-
tions are said to be nested. An illustration of this property is presented in
Figure 1.6. Planets, which are in a three-dimensional space, are traditionally
represented on a component. This component determines their positions as
well as possible in terms of their distances from one other (in terms of inertia
of the projected cloud). We can also represent planets on a plane according
to the same principle: to maximise the inertia of the projected scatterplot
(on the plane). This best plane representation also contains the best axial
representation.
ne
Su ury
s
r
rn
nu
te
tu
o
c
tu
ut
pi
n
ra
ep
er
Sa
Pl
Ju
U
M
N
M h
Ve s
s
rt
ar
nu
Ea
Uranus
Mars
Saturn Earth Sun
Mercury Venus Neptune
Jupiter
Pluto
FIGURE 1.6
The best axial representation is nested in the best plane representation of the
solar system (18 February 2008).
Remark
When variables are centred but not standardised, the matrix to be diago-
nalised is the variance–covariance matrix.
1.3.2.4 Example
The distance between two orange juices is calculated using their seven sensory
descriptors. We decided to standardise the data to attribute each descriptor
equal influence. Figure 1.7 is obtained from the first two components of the
PCA and corresponds to the best plane for representing the cloud of individu-
als in terms of projected inertia. The inertia projected on the plane is the sum
of two eigenvalues, that is, 86.82% (= 67.77% + 19.05%) of the total inertia
of the cloud of points.
The first principal component, that is, the principal axis of variability
between the orange juices, separates the two orange juices Tropicana fr. and
Pampryl amb. According to data Table 1.2, we can see that these orange
juices are the most extreme in terms of the descriptors odour typicality and
bitterness: Tropicana fr. is the most typical and the least bitter while Pampryl
amb. is the least typical and the most bitter. The second component, that
is, the property that separates the orange juices most significantly once the
Principal Component Analysis 11
Pampryl fr.
2
Dim 2 (19.05%)
1
Tropicana fr.
Fruvita fr.
0
Pampryl amb.
-1
Joker amb.
Tropicana amb.
-2
-4 -2 0 2 4
Dim 1 (67.77%)
FIGURE 1.7
Orange juice data: plane representation of the scatterplot of individuals.
TABLE 1.4
Orange Juice Data: Correlation between
Variables and First Two Components
F1 F2
Odour intensity 0.46 0.75
Odour typicality 0.99 0.13
Pulp content 0.72 0.62
Intensity of taste −0.65 0.43
Acidity −0.91 0.35
Bitterness −0.93 0.19
Sweetness 0.95 −0.16
Odour intensity
0.62 Pulpiness
Dimension 2 (19.05%)
Intensity of taste
0.5
Acidity
Bitterness
Odour typicality
0.0
0.72
Sweetness
-0.5
-1.0
FIGURE 1.8
Orange juice data: visualisation of the correlation coefficients between vari-
ables and the principal components F1 and F2 .
Our website is not just a platform for buying books, but a bridge
connecting readers to the timeless values of culture and wisdom. With
an elegant, user-friendly interface and an intelligent search system,
we are committed to providing a quick and convenient shopping
experience. Additionally, our special promotions and home delivery
services ensure that you save time and fully enjoy the joy of reading.
ebookname.com