Introduction to Algorithms for Data Mining and Machine Learning Xin-She Yanginstant download
Introduction to Algorithms for Data Mining and Machine Learning Xin-She Yanginstant download
https://ebookmass.com/product/introduction-to-algorithms-
for-data-mining-and-machine-learning-xin-she-yang/
https://ebookmass.com/product/introduction-to-algorithms-for-data-
mining-and-machine-learning-yang/
https://ebookmass.com/product/fundamentals-of-machine-learning-for-
predictive-data-analytics-algorithms/
https://ebookmass.com/product/machine-learning-for-signal-processing-
data-science-algorithms-and-computational-statistics-max-a-little/
https://ebookmass.com/product/big-data-analytics-introduction-to-
hadoop-spark-and-machine-learning-raj-kamal/
Machine Learning for Biometrics: Concepts, Algorithms and
Applications (Cognitive Data Science in Sustainable
Computing) Partha Pratim Sarangi
https://ebookmass.com/product/machine-learning-for-biometrics-
concepts-algorithms-and-applications-cognitive-data-science-in-
sustainable-computing-partha-pratim-sarangi/
https://ebookmass.com/product/machine-learning-algorithms-for-signal-
and-image-processing-suman-lata-tripathi/
Introduction to
Algorithms for Data Mining
and Machine Learning
Introduction to Algorithms for Data Mining and
Machine Learning
This page intentionally left blank
Introduction to
Algorithms for Data
Mining and Machine
Learning
Xin-She Yang
Middlesex University
School of Science and Technology
London, United Kingdom
Academic Press is an imprint of Elsevier
125 London Wall, London EC2Y 5AS, United Kingdom
525 B Street, Suite 1650, San Diego, CA 92101, United States
50 Hampshire Street, 5th Floor, Cambridge, MA 02139, United States
The Boulevard, Langford Lane, Kidlington, Oxford OX5 1GB, United Kingdom
Copyright © 2019 Elsevier Inc. All rights reserved.
No part of this publication may be reproduced or transmitted in any form or by any means, electronic or
mechanical, including photocopying, recording, or any information storage and retrieval system, without
permission in writing from the publisher. Details on how to seek permission, further information about the
Publisher’s permissions policies and our arrangements with organizations such as the Copyright Clearance Center
and the Copyright Licensing Agency, can be found at our website: www.elsevier.com/permissions.
This book and the individual contributions contained in it are protected under copyright by the Publisher (other
than as may be noted herein).
Notices
Knowledge and best practice in this field are constantly changing. As new research and experience broaden our
understanding, changes in research methods, professional practices, or medical treatment may become necessary.
Practitioners and researchers must always rely on their own experience and knowledge in evaluating and using
any information, methods, compounds, or experiments described herein. In using such information or methods
they should be mindful of their own safety and the safety of others, including parties for whom they have a
professional responsibility.
To the fullest extent of the law, neither the Publisher nor the authors, contributors, or editors, assume any liability
for any injury and/or damage to persons or property as a matter of products liability, negligence or otherwise, or
from any use or operation of any methods, products, instructions, or ideas contained in the material herein.
ISBN: 978-0-12-817216-2
1 Introduction to optimization 1
1.1 Algorithms 1
1.1.1 Essence of an algorithm 1
1.1.2 Issues with algorithms 3
1.1.3 Types of algorithms 3
1.2 Optimization 4
1.2.1 A simple example 4
1.2.2 General formulation of optimization 7
1.2.3 Feasible solution 9
1.2.4 Optimality criteria 10
1.3 Unconstrained optimization 10
1.3.1 Univariate functions 11
1.3.2 Multivariate functions 12
1.4 Nonlinear constrained optimization 14
1.4.1 Penalty method 15
1.4.2 Lagrange multipliers 16
1.4.3 Karush–Kuhn–Tucker conditions 17
1.5 Notes on software 18
2 Mathematical foundations 19
2.1 Convexity 20
2.1.1 Linear and affine functions 20
2.1.2 Convex functions 21
2.1.3 Mathematical operations on convex functions 22
2.2 Computational complexity 22
2.2.1 Time and space complexity 24
2.2.2 Complexity of algorithms 25
2.3 Norms and regularization 26
2.3.1 Norms 26
2.3.2 Regularization 28
2.4 Probability distributions 29
2.4.1 Random variables 29
2.4.2 Probability distributions 30
vi Contents
3 Optimization algorithms 45
3.1 Gradient-based methods 45
3.1.1 Newton’s method 45
3.1.2 Newton’s method for multivariate functions 47
3.1.3 Line search 48
3.2 Variants of gradient-based methods 49
3.2.1 Stochastic gradient descent 50
3.2.2 Subgradient method 51
3.2.3 Conjugate gradient method 52
3.3 Optimizers in deep learning 53
3.4 Gradient-free methods 56
3.5 Evolutionary algorithms and swarm intelligence 58
3.5.1 Genetic algorithm 58
3.5.2 Differential evolution 60
3.5.3 Particle swarm optimization 61
3.5.4 Bat algorithm 61
3.5.5 Firefly algorithm 62
3.5.6 Cuckoo search 62
3.5.7 Flower pollination algorithm 63
3.6 Notes on software 64
Bibliography 163
Index 171
About the author
Xin-She Yang obtained his PhD in Applied Mathematics from the University of Ox-
ford. He then worked at Cambridge University and National Physical Laboratory (UK)
as a Senior Research Scientist. Now he is Reader at Middlesex University London, and
an elected Bye-Fellow at Cambridge University.
He is also the IEEE Computer Intelligence Society (CIS) Chair for the Task Force
on Business Intelligence and Knowledge Management, Director of the International
Consortium for Optimization and Modelling in Science and Industry (iCOMSI), and
an Editor of Springer’s Book Series Springer Tracts in Nature-Inspired Computing
(STNIC).
With more than 20 years of research and teaching experience, he has authored
10 books and edited more than 15 books. He published more than 200 research pa-
pers in international peer-reviewed journals and conference proceedings with more
than 36 800 citations. He has been on the prestigious lists of Clarivate Analytics and
Web of Science highly cited researchers in 2016, 2017, and 2018. He serves on the
Editorial Boards of many international journals including International Journal of
Bio-Inspired Computation, Elsevier’s Journal of Computational Science (JoCS), In-
ternational Journal of Parallel, Emergent and Distributed Systems, and International
Journal of Computer Mathematics. He is also the Editor-in-Chief of the International
Journal of Mathematical Modelling and Numerical Optimisation.
This page intentionally left blank
Preface
Both data mining and machine learning are becoming popular subjects for university
courses and industrial applications. This popularity is partly driven by the Internet and
social media because they generate a huge amount of data every day, and the under-
standing of such big data requires sophisticated data mining techniques. In addition,
many applications such as facial recognition and robotics have extensively used ma-
chine learning algorithms, leading to the increasing popularity of artificial intelligence.
From a more general perspective, both data mining and machine learning are closely
related to optimization. After all, in many applications, we have to minimize costs,
errors, energy consumption, and environment impact and to maximize sustainabil-
ity, productivity, and efficiency. Many problems in data mining and machine learning
are usually formulated as optimization problems so that they can be solved by opti-
mization algorithms. Therefore, optimization techniques are closely related to many
techniques in data mining and machine learning.
Courses on data mining, machine learning, and optimization are often compulsory
for students, studying computer science, management science, engineering design, op-
erations research, data science, finance, and economics. All students have to develop
a certain level of data modeling skills so that they can process and interpret data for
classification, clustering, curve-fitting, and predictions. They should also be familiar
with machine learning techniques that are closely related to data mining so as to carry
out problem solving in many real-world applications. This book provides an introduc-
tion to all the major topics for such courses, covering the essential ideas of all key
algorithms and techniques for data mining, machine learning, and optimization.
Though there are over a dozen good books on such topics, most of these books are
either too specialized with specific readership or too lengthy (often over 500 pages).
This book fills in the gap with a compact and concise approach by focusing on the key
concepts, algorithms, and techniques at an introductory level. The main approach of
this book is informal, theorem-free, and practical. By using an informal approach all
fundamental topics required for data mining and machine learning are covered, and
the readers can gain such basic knowledge of all important algorithms with a focus
on their key ideas, without worrying about any tedious, rigorous mathematical proofs.
In addition, the practical approach provides about 30 worked examples in this book
so that the readers can see how each step of the algorithms and techniques works.
Thus, the readers can build their understanding and confidence gradually and in a
step-by-step manner. Furthermore, with the minimal requirements of basic high school
mathematics and some basic calculus, such an informal and practical style can also
enable the readers to learn the contents by self-study and at their own pace.
This book is suitable for undergraduates and graduates to rapidly develop all the
fundamental knowledge of data mining, machine learning, and optimization. It can
xii Preface
also be used by students and researchers as a reference to review and refresh their
knowledge in data mining, machine learning, optimization, computer science, and data
science.
Xin-She Yang
January 2019 in London
Acknowledgments
I would like to thank all my students and colleagues who have given valuable feedback
and comments on some of the contents and examples of this book. I also would like to
thank my editors, J. Scott Bentley and Michael Lutz, and the staff at Elsevier for their
professionalism. Last but not least, I thank my family for all the help and support.
Xin-She Yang
January 2019
This page intentionally left blank
Introduction to optimization
Contents
1.1 Algorithms
1 1
1.1.1 Essence of an algorithm 1
1.1.2 Issues with algorithms 3
1.1.3 Types of algorithms 3
1.2 Optimization 4
1.2.1 A simple example 4
1.2.2 General formulation of optimization 7
1.2.3 Feasible solution 9
1.2.4 Optimality criteria 10
1.3 Unconstrained optimization 10
1.3.1 Univariate functions 11
1.3.2 Multivariate functions 12
1.4 Nonlinear constrained optimization 14
1.4.1 Penalty method 15
1.4.2 Lagrange multipliers 16
1.4.3 Karush–Kuhn–Tucker conditions 17
1.5 Notes on software 18
This book introduces the most fundamentals and algorithms related to optimization,
data mining, and machine learning. The main requirement is some understanding of
high-school mathematics and basic calculus; however, we will review and introduce
some of the mathematical foundations in the first two chapters.
1.1 Algorithms
An algorithm is an iterative, step-by-step procedure for computation. The detailed
procedure can be a simple description, an equation, or a series of descriptions in
combination with equations. Finding the roots of a polynomial, checking if a natu-
ral number is a prime number, and generating random numbers are all algorithms.
Example 1
As an example, if x0 = 1 and a = 4, then we have
1 4
x1 = (1 + ) = 2.5. (1.2)
2 1
Similarly, we have
1 4 1 4
x2 = (2.5 + ) = 2.05, x3 = (2.05 + ) ≈ 2.0061, (1.3)
2 2.5 2 2.05
x4 ≈ 2.00000927, (1.4)
√
which is very close to the true value of 4 = 2. The accuracy of this iterative formula or algorithm
is high because it achieves the accuracy of five decimal places after four iterations.
The convergence is very quick if we start from different initial values such as
x0 = 10 and even x0 = 100. However, for an obvious reason, we cannot start with
x0 = 0 due to division by
√zero.
Find the root of x = a is equivalent to solving the equation
f (x) = x 2 − a = 0, (1.5)
which is again equivalent to finding the roots of a polynomial f (x). We know that
Newton’s root-finding algorithm can be written as
f (xk )
xk+1 = xk − , (1.6)
f (xk )
where f (x) is the first derivative or gradient of f (x). In this case, we have
f (x) = 2x. Thus, Newton’s formula becomes
(xk2 − a)
xk+1 = xk − , (1.7)
2xk
1.2 Optimization
V = πr 2 h. (1.12)
There are only two design variables r and h and one objective function S to be min-
imized. Obviously, if there is no capacity constraint, then we can choose not to build
the container, and then the cost of materials is zero for r = 0 and h = 0. However,
Introduction to optimization 5
the constraint requirement means that we have to build a container with fixed volume
V0 = πr 2 h = 10 m3 . Therefore, this optimization problem can be written as
πr 2 h = V0 = 10. (1.14)
To solve this problem, we can first try to use the equality constraint to reduce the
number of design variables by solving h. So we have
V0
h= . (1.15)
πr 2
Substituting it into (1.13), we get
S = 2πr 2 + 2πrh
V0 2V0
= 2πr 2 + 2πr 2 = 2πr 2 + . (1.16)
πr r
This is a univariate function. From basic calculus we know that the minimum or max-
imum can occur at the stationary point, where the first derivative is zero, that is,
dS 2V0
= 4πr − 2 = 0, (1.17)
dr r
which gives
V0 3 V0
r3 = , or r = . (1.18)
2π 2π
Thus, the height is
h V0 /(πr 2 ) V0
= = 3 = 2. (1.19)
r r πr
6 Introduction to Algorithms for Data Mining and Machine Learning
This means that the height is twice the radius: h = 2r. Thus, the minimum surface is
It is worth pointing out that this optimal solution is based on the assumption or re-
quirement to design a cylindrical container. If we decide to use a sphere with radius R,
we know that its volume and surface area is
4π 3
V0 = R , S = 4πR 2 . (1.21)
3
We can solve R directly
3V0 3 3V0
R =
3
, or R = , (1.22)
4π 4π
which gives the surface area
3V 2/3 √
0 4π 3 9 2/3
S = 4π =√ 3
V0 . (1.23)
4π 16π 2
√3 √ √ 3
Since 6π/ 4π 2 ≈ 5.5358 and 4π 3 9/ 16π 2 ≈ 4.83598, we have S < S∗ , that is, the
surface area of a sphere is smaller than the minimum surface area of a cylinder with
the same volume. In fact, for the same V0 = 10, we have
√
4π 3 9 2/3
S(sphere) = √ 3
V0 ≈ 22.47, (1.24)
16π 2
which is smaller than S∗ = 25.69 for a cylinder.
This highlights the importance of the choice of design type (here in terms of shape)
before we can do any truly useful optimization. Obviously, there are many other fac-
tors that can influence the choice of design, including the manufacturability of the
design, stability of the structure, ease of installation, space availability, and so on. For
a container, in most applications, a cylinder may be much easier to produce than a
sphere, and thus the overall cost may be lower in practice. Though there are so many
factors to be considered in engineering design, for the purpose of optimization, here
we will only focus on the improvement and optimization of a design with well-posed
mathematical formulations.
Introduction to optimization 7
where f (x), φj (x), and ψk (x) are scalar functions of the design vector x. Here the
components xi of x = (x1 , . . . , xD )T are called design or decision variables, and they
can be either continuous, discrete, or a mixture of these two. The vector x is often
called the decision vector, which varies in a D-dimensional space RD .
It is worth pointing out that we use a column vector here for x (thus with trans-
pose T ). We can also use a row vector x = (x1 , . . . , xD ) and the results will be the
same. Different textbooks may use slightly different formulations. Once we are aware
of such minor variations, it should cause no difficulty or confusion.
In addition, the function f (x) is called the objective function or cost function,
φj (x) are constraints in terms of M equalities, and ψk (x) are constraints written as
N inequalities. So there are M + N constraints in total. The optimization problem
formulated here is a nonlinear constrained problem. Here the inequalities ψk (x) ≤ 0
are written as “less than”, and they can also be written as “greater than” via a simple
transformation by multiplying both sides by −1.
The space spanned by the decision variables is called the search space RD , whereas
the space formed by the values of the objective function is called the objective or
response space, and sometimes the landscape. The optimization problem essentially
maps the domain RD or the space of decision variables into the solution space R (or
the real axis in general).
The objective function f (x) can be either linear or nonlinear. If the constraints φj
and ψk are all linear, it becomes a linearly constrained problem. Furthermore, when
φj , ψk , and the objective function f (x) are all linear, then it becomes a linear pro-
gramming problem [35]. If the objective is at most quadratic with linear constraints,
then it is called a quadratic programming problem. If all the values of the decision
variables can be only integers, then this type of linear programming is called integer
programming or integer linear programming.
On the other hand, if no constraints are specified and thus xi can take any values
in the real axis (or any integers), then the optimization problem is referred to as an
unconstrained optimization problem.
As a very simple example of optimization problems without any constraints, we
discuss the search of the maxima or minima of a univariate function.
8 Introduction to Algorithms for Data Mining and Machine Learning
2
Figure 1.2 A simple multimodal function f (x) = x 2 e−x .
Example 2
For example, to find the maximum of a univariate function f (x)
f (x) = x 2 e−x ,
2
−∞ < x < ∞, (1.26)
is a simple unconstrained problem, whereas the following problem is a simple constrained mini-
mization problem:
subject to
x1 ≥ 1, x2 − 2 = 0. (1.28)
It is worth pointing out that the objectives are explicitly known in all the optimiza-
tion problems to be discussed in this book. However, in reality, it is often difficult to
quantify what we want to achieve, but we still try to optimize certain things such as the
degree of enjoyment or service quality on holiday. In other cases, it may be impossible
to write the objective function in any explicit form mathematically.
From basic calculus we know that, for a given curve described by f (x), its gradient
f (x) describes the rate of change. When f (x) = 0, the curve has a horizontal tangent
at that particular point. This means that it becomes a point of special interest. In fact,
the maximum or minimum of a curve occurs at
f (x∗ ) = 0, (1.29)
Example 3
To find the minimum of f (x) = x 2 e−x (see Fig. 1.2), we have the stationary condition
2
f (x) = 0 or
Figure 1.3 (a) Feasible domain with nonlinear inequality constraints ψ1 (x) and ψ2 (x) (left) and linear
inequality constraint ψ3 (x). (b) An example with an objective of f (x) = x 2 subject to x ≥ 2 (right).
f (x) = 2e−x (1 − 5x 2 + 2x 4 ),
2
two maxima that occur at x∗ = ±1 with fmax = e−1 . At x = 0, we have f (0) = 2 > 0, thus
the minimum of f (x) occurs at x∗ = 0 with fmin (0) = 0.
Whatever the objective is, we have to evaluate it many times. In most cases, the
evaluations of the objective functions consume a substantial amount of computational
power (which costs money) and design time. Any efficient algorithm that can reduce
the number of objective evaluations saves both time and money.
In mathematical programming, there are many important concepts, and we will
first introduce a few related concepts: feasible solutions, optimality criteria, the strong
local optimum, and weak local optimum.
f (x∗ ) = f (0) = 0.
In fact, f (x) = x 3 has a saddle point x∗ = 0 because f (0) = 0 but f changes sign
from f (0+) > 0 to f (0−) < 0 as x moves from positive to negative.
Example 4
For example, to find the maximum or minimum of a univariate function
we first have to find its stationary points x∗ when the first derivative f (x) is zero, that is,
x∗ = −1, x∗ = 2, x∗ = 0.
From the basic calculus we know that the maximum requires f (x∗ ) ≤ 0 whereas the minimum
requires f (x∗ ) ≥ 0.
At x∗ = −1, we have
f (x∗ ) = −23.
can be converted to the minimization of −f (x). For this reason, the optimization
problems can be expressed as either minimization or maximization depending on the
context and convenience of formulations.
In fact, in the optimization literature, some books formulate all the optimization
problems in terms of maximization, whereas others write these problems in terms of
minimization, though they are in essence dealing with the same problems.
∂f ∂f
= 2x + 0 = 0, = 0 + 2y = 0. (1.32)
∂x ∂y
Since
∂ 2f ∂ 2f
= , (1.34)
∂x∂y ∂y∂x
we can conclude that the Hessian matrix is always symmetric. In the case of f (x, y) =
x 2 + y 2 , it is easy to check that the Hessian matrix is
2 0
H= . (1.35)
0 2
At the stationary point (x∗ , y∗ ), if > 0 and fxx > 0, then (x∗ , y∗ ) is a local mini-
mum. If > 0 but fxx < 0, then it is a local maximum. If = 0, then it is inconclu-
sive, and we have to use other information such as higher-order derivatives. However,
if < 0, then it is a saddle point. A saddle point is a special point where a local
minimum occurs along one direction, whereas the maximum occurs along another
(orthogonal) direction.
Example 5
To minimize f (x, y) = (x − 1)2 + x 2 y 2 , we have
∂f ∂f
= 2(x − 1) + 2xy 2 = 0, = 0 + 2x 2 y = 0. (1.37)
∂x ∂y
The second condition gives y = 0 or x = 0. Substituting y = 0 into the first condition, we have
x = 1. However, x = 0 does not satisfy the first condition. Therefore, we have a solution x∗ = 1
and y∗ = 0.
For our example with f = (x − 1)2 + x 2 y 2 , we have
∂ 2f 2 2 2
2 + 2, ∂ f = 4xy, ∂ f = 4xy, ∂ f = 2x 2 ,
= 2y (1.38)
∂x 2 ∂x∂y ∂y∂x ∂y 2
14 Introduction to Algorithms for Data Mining and Machine Learning
At the stationary point (x∗ , y∗ ) = (1, 0), the Hessian matrix becomes
2 0
H= ,
0 2
which is positive definite because its double eigenvalues 2 are positive. Alternatively, we have
= 4 > 0 and fxx = 2 > 0. Therefore, (1, 0) is a local minimum.
M
N
(x, μi , νj ) = f (x) + μi φi2 (x) + νj max{0, ψj (x)}2 , (1.43)
i=1 j =1
where μi 1 and νj ≥ 0.
For example, let us solve the following minimization problem:
where a is a given value. Obviously, without this constraint, the minimum value occurs
at x = 1 with fmin = 0. If a < 1, then the constraint will not affect the result. However,
if a > 1, then the minimum should occur at the boundary x = a (which can be obtained
by inspecting or visualizing the objective function and the constraint). Now we can
define a penalty function (x) using a penalty parameter μ 1. We have
Ideally, the formulation using the penalty method should be properly designed so
that the results will not depend on the penalty coefficient, or at least the dependence
should be sufficiently weak.
h(x) = 0. (1.48)
Then we can combine the objective function f (x) with the equality to form the new
function, called the Lagrangian,
M
(x, λj ) = f (x) + λj hj (x). (1.51)
j =1
∂ ∂f ∂hj
M
∂
= + λj (i = 1, . . . , n), = hj = 0 (j = 1, . . . , M). (1.52)
∂xi ∂xi ∂xi ∂λj
j =1
Example 6
For the well-known monkey surface f (x, y) = x 3 − 3xy 2 , the function does not have a unique
maximum or minimum. In fact, the point x = y = 0 is a saddle point. However, if we impose an
extra equality x − y 2 = 1, we can formulate an optimization problem as
subject to
h(x, y) = x − y 2 = 1.
∂ ∂
= 3x 2 − 3y 2 + λ = 0, = 0 − 6xy + (−2λy) = 0,
∂x ∂y
∂
= x − y 2 − 1 = 0.
∂λ
This equation has no solution in the real domain. Therefore, the optimality occurs at (1, 0) with
fmin = 1.
minimize f (x), x ∈ Rn ,
subject to φi (x) = 0 (i = 1, . . . , M), ψj (x) ≤ 0 (j = 1, . . . , N). (1.53)
If all the functions are continuously differentiable at a local minimum x ∗ , then there
exist constants λ0 , λ1 , . . . , λq and μ1 , . . . , μp such that
M
N
λ0 ∇f (x ∗ ) + μi ∇φi (x ∗ ) + λj ∇ψj (x ∗ ) = 0, (1.54)
i=1 j =1
ψj (x ∗ ) ≤ 0, λj ψj (x ∗ ) = 0 (j = 1, 2, . . . , N), (1.55)
18 Introduction to Algorithms for Data Mining and Machine Learning
M
where λj ≥ 0 (i = 0, 1, . . . , N). The constants satisfy Nj =0 λj + i=1 |μi | ≥ 0. This
is essentially a generalized method of the Lagrange multipliers. However, there is a
possibility of degeneracy when λ0 = 0 under certain conditions.
It is worth pointing out that such KKT conditions can be useful to prove theorems
and sometimes useful to gain insight into certain types of problems. However, they are
not really helpful in practice in the sense that they do not give any indication where
the optimal solutions may lie in the search domain so as to guide the search process.
Optimization problems, especially highly nonlinear multimodal problems, are usu-
ally difficult to solve. However, if we are mainly concerned about local optimal or
suboptimal solutions (not necessarily about global optimal solutions), there are rel-
atively efficient methods such as interior-point methods, trust-region methods, the
simplex method, sequential quadratic programming, and swarm intelligence-based
methods [151]. All these methods have been implemented in a diverse range of soft-
ware packages. Interested readers can refer to more advanced literature.
1 https://en.wikipedia.org/wiki/List_of_optimization_software.
2 https://en.wikipedia.org/wiki/Category:Data_mining_and_machine_learning_software.
3 https://en.wikipedia.org/wiki/Comparison_of_deep_learning_software.
Mathematical foundations
2
Contents
2.1 Convexity 20
2.1.1 Linear and affine functions 20
2.1.2 Convex functions 21
2.1.3 Mathematical operations on convex functions 22
2.2 Computational complexity 22
2.2.1 Time and space complexity 24
2.2.2 Complexity of algorithms 25
2.3 Norms and regularization 26
2.3.1 Norms 26
2.3.2 Regularization 28
2.4 Probability distributions 29
2.4.1 Random variables 29
2.4.2 Probability distributions 30
2.4.3 Conditional probability and Bayesian rule 32
2.4.4 Gaussian process 34
2.5 Bayesian network and Markov models 35
2.6 Monte Carlo sampling 36
2.6.1 Markov chain Monte Carlo 37
2.6.2 Metropolis–Hastings algorithm 37
2.6.3 Gibbs sampler 39
2.7 Entropy, cross entropy, and KL divergence 39
2.7.1 Entropy and cross entropy 39
2.7.2 DL divergence 40
2.8 Fuzzy rules 41
2.9 Data mining and machine learning 42
2.9.1 Data mining 42
2.9.2 Machine learning 42
2.10 Notes on software 42
Though the main requirement of this book is basic calculus, we will still briefly review
some basic concepts concerning functions and basic calculus and then introduce some
new concepts. The readers can skip this chapter if they are already familiar with such
topics.
Introduction to Algorithms for Data Mining and Machine Learning. https://doi.org/10.1016/B978-0-12-817216-2.00009-0
Copyright © 2019 Elsevier Inc. All rights reserved.
20 Introduction to Algorithms for Data Mining and Machine Learning
2.1 Convexity
depends on two independent variables. This function maps the domain R2 (for −∞ <
x < ∞ and −∞ < y < ∞) to f on the real axis as its range. So we use the notation
f : R2 → R to denote this.
In general, a function f (x, y, z, . . . ) maps n independent variables to m dependent
variables, and we use the notation f : Rn → Rm to mean that the domain of the func-
tion is a subset of Rn , whereas its range is a subset of Rm . The domain of a function
is sometimes denoted by dom(f ) or dom f .
The inputs or independent variables can often be written as a vector. For simplicity,
we often use a vector x = (x, y, z, . . . )T = (x1 , x2 , . . . , xn )T for multiple variables.
Therefore, f (x) is often used to mean f (x, y, z, . . . ) or f (x1 , x2 , . . . , xn ).
A function L(x) is called linear if
Example 7
To see if f (x) = f (x1 , x2 ) = 2x1 + 3x2 is linear, we use
Therefore, this function is indeed linear. This function can also be written as a vector form
x
1
f (x) = 2 3 = a · x = a T x,
x2
Thus, an affine set is always convex, but a convex set is not necessarily affine.
where
α ≥ 0, β ≥ 0, α + β = 1. (2.5)
Example 8
For example, the convexity of f (x) = x 2 − 1 requires
αx 2 + βy 2 − (αx + βy)2 ≥ 0,
22 Introduction to Algorithms for Data Mining and Machine Learning
αx 2 + βy 2 − α 2 x 2 − 2αβxy − β 2 y 2
= α(1 − α)(x − y)2 = αβ(x − y)2 ≥ 0,
which is always true because α, β ≥ 0 and (x − y)2 ≥ 0. Therefore, f (x) = x 2 − 1 is convex for
all x ∈ R.
f = O(g). (2.8)
The big O notation means that f is asymptotically equivalent to the order of g(x). If
the limit is unity or K = 1, then we that say f (x) is asymptotically equivalent to g(x).
In this particular case, we write
f ∼ g, (2.9)
f = o(g). (2.11)
If g > 0, then f = o(g) is equivalent to f g (that is, f is much less than g).
Example 9
For example, for all x ∈ R, we have
x2 x3 xn
ex = 1 + x + + + ··· + + ··· , (2.12)
2! 3! n!
x2
ex ≈ 1 + x + O(x 2 ) ≈ 1 + x + + o(x), (2.13)
2
depending on the accuracy of the approximation of interest.
It is worth pointing out that the expressions in computational complexity are most
concerned with functions such as f (n) of an input of problem size n, where n ∈ N is
an integer in the set of natural numbers N = {1, 2, 3, . . . }.
For example, for the functions f (n) = 10n2 + 20n + 100 and g(n) = 5n2 , we have
f (n) = O g(n) (2.14)
24 Introduction to Algorithms for Data Mining and Machine Learning
for every sufficiently large n. As n is sufficiently large, n2 is much larger than n (i.e.,
n2 n), then n2 terms dominate two expressions. To emphasize the input n, we can
often write
f (n) = O g(n) = O(n2 ). (2.15)
In addition, f (n) is in general a polynomial of n, which not only includes terms such as
n3 and n2 , but it also may include n2.5 or log(n). Therefore, f (n) = 100n3 + 20n2.5 +
25n log(n) + 123n is a valid polynomial in the context of computational complexity.
In this case, we have
Here, we always implicitly assume that n is sufficiently large and the base of the
logarithm is 2.
To measure how easily or hardly a problem can be solved, we need to estimate its
computational complexity. We cannot simply ask how long it takes to solve a particular
problem instance because the actual computational time depends on both hardware
and software used to solve it. Thus, time does not make much sense in this context.
A useful measure of complexity should be independent of the hardware and software
used. However, such complexity is closely linked to the algorithms used.
tional complexity, we often use the word “problem” to mean a class of problems of
the same type and an “instance” to mean a specific example of a problem class. Thus,
Ax = b is a problem (class) for linear algebra, whereas
2 3 x 8
= (2.17)
1 1 y 3
Example 10
The multiplication of two n × n matrices A and B using simple matrix multiplication rules has
a complexity of O(n3 ). There are n rows and n columns for each matrix, and their product C
has n × n entries. To get each entry, we need to carry out the multiplication of a row of A by a
corresponding column of B and calculate their sum, and thus the complexity is O(n). As there
are n × n = n2 entries, the overall complexity is O(n2 ) × O(n) = O(n3 ).
In the rest of this book, we analyze different algorithms; the complexity to be given
is usually the arithmetic complexity of an algorithm under discussion.
26 Introduction to Algorithms for Data Mining and Machine Learning
2.3.1 Norms
In general, a vector in an n-dimensional space (n ≥ 1) can be written as a column
vector
⎛ ⎞
x1
⎜ x2 ⎟
⎜ ⎟
x = ⎜ . ⎟ = (x1 , x2 , . . . , xn )T (2.18)
⎝ .. ⎠
xn
or a row vector
x = x1 x2 ... xn . (2.19)
A simple transpose (T) can convert a column vector into its corresponding row vector.
The length of x can be written as
||x|| = x12 + x22 + · · · + xn2 , (2.20)
The dot product, also called the inner product, of two vectors u and v is defined as
n
uT v ≡ u · v = ui vi = u1 v1 + u2 v2 + · · · + un vn . (2.22)
i=1
Three most widely used norms are p = 1, 2, and ∞ [160]. When p = 2, it becomes
the Cartesian L2 -norm as discussed before. When p = 1, the L1 -norm is given by
For p = ∞, it becomes
where we have used the fact that |xi /xmax | < 1 (except for one component, say, |xk | =
xmax ). Thus, limp→∞ |xi /xmax |p → 0 for all i = k. Thus, the sum of all ratio terms
is 1, that is,
x p 1/p
i
lim = 1. (2.28)
p→∞ xmax
In general, for any two vectors u and v in the same space, we have the inequality
Example 11
For two vectors u = [1 2 3]T and v = [1 − 2 − 1]T , we have
and
T T
w=u+v= 1+1 2 + (−2) 3 + (−1) = 2 0 2
with norms
Figure 2.2 Different p-norms for p = 1, 2, and ∞ (left) as well as p = 1/2 and p = 4 (right).
2.3.2 Regularization
In many applications such as curve-fitting and machine learning, overfitting can be a
serious issue, and one way to avoid overfitting is using regularization. Loosely speak-
ing, regularization is using some penalty term added to the objective or loss function so
as to constrain certain model parameters. For example, in the method of least-squares
and many learning algorithms, the objective is to minimize the loss function L(x),
which represents the errors between data labels yi and the predictions fi = f (xi ) for
m data points (xi , yi ), i = 1, 2, . . . , m, that is,
m
2
L(x) = yi − f (xi ) , (2.30)
i=1
which is the L2 -norm of the errors Ei = yi − fi . The model prediction f (x, w) usually
have many model parameters such as w = (w1 , w2 , ..., wK ) for simple polynomial
curve-fitting. In general, a prediction model can have K different model parameters,
overfitting can occur if the model becomes too complex with too many parameters, and
the oscillations become significant. Thus, a penalty term in terms of some norm of the
model parameters is usually added to the loss function. For example, the well-known
Tikhonov regularization uses the L2 -norm, and we have
m
2
minimize yi − f (xi , w) + λ||w||2 , (2.31)
i=1
Mathematical foundations 29
where λ > 0 is the penalty parameter. Obviously, other norms can be used. For exam-
ple, in the Lasso method, the regularization uses 1-norm, which gives
1 2
m
minimize yi − f (xi , w) + λ||w||1 . (2.32)
m
i=1
We will introduce both the method of least-squares and Lasso method in late chapters.
n
p(xi ) = 1. (2.33)
i=1
For example, the outcomes of tossing a fair coin form a sample space. The outcome
of a head (H) is an event with probability P (H ) = 1/2, and the outcome of a tail (T)
is also an event with probability P (T ) = 1/2. The sum of both probabilities should be
one, that is,
1 1
P (H ) + P (T ) = + = 1. (2.34)
2 2
The cumulative probability function of X is defined by
P (X ≤ x) = p(xi ). (2.35)
xi <x
Two main measures for a random variable X with given probability distribution
p(x) are its mean and variance. The mean μ or expectation of E[X] is defined by
μ ≡ E[X] ≡<X>= xp(x)dx (2.36)
30 Introduction to Algorithms for Data Mining and Machine Learning
for a continuous distribution and the integration is within the integration limits. If the
random variable is discrete, then the integration becomes the weighted sum
E[X] = xi p(xi ). (2.37)
i
The variance var[X] = σ 2 is the expectation value of the deviation squared, that is,
E[(X − μ)2 ]. We have
σ 2 ≡ var[X] = E[(X − μ)2 ] = (x − μ)2 p(x)dx. (2.38)
√
The square root of the variance σ = var[X] is called the standard deviation, which
is simply σ .
The above definition of mean μ = E[X] is essentially the first moment if we define
the kth moment of a random variable X (with a probability density distribution p(x))
by
μk ≡ E[X ] = x k p(x)dx (k = 1, 2, 3, . . . ).
k
(2.39)
where μ is the mean (the first moment). Thus, the zeroth central moment is the sum of
all probabilities when k = 0, which gives ν0 = 1. The first central moment is ν1 = 0.
The second central moment ν2 is the variance σ 2 , that is, ν2 = σ 2 .
ebookmasss.com