Download full (Ebook) Math and Architectures of Deep Learning (MEAP V10) by Krishnendu Chaudhury ebook all chapters
Download full (Ebook) Math and Architectures of Deep Learning (MEAP V10) by Krishnendu Chaudhury ebook all chapters
com
https://ebooknice.com/product/math-and-architectures-of-
deep-learning-meap-v10-51388806
OR CLICK HERE
DOWLOAD EBOOK
ebooknice.com
ebooknice.com
https://ebooknice.com/product/vagabond-vol-29-29-37511002
ebooknice.com
https://ebooknice.com/product/inside-deep-learning-math-algorithms-
models-43565358
ebooknice.com
(Ebook) 29, Single and Nigerian by Naijasinglegirl ISBN
9781310004216, 1310004218
https://ebooknice.com/product/29-single-and-nigerian-53599780
ebooknice.com
ebooknice.com
https://ebooknice.com/product/boeing-b-29-superfortress-1573658
ebooknice.com
https://ebooknice.com/product/harrow-county-29-53599548
ebooknice.com
https://ebooknice.com/product/jahrbuch-fur-geschichte-band-29-50958290
ebooknice.com
MEAP Edition
Manning Early Access Program
Math and Architectures of Deep Learning
Version 10
Welcome to Manning Early Access Program (MEAP) for Math and Architectures of Deep Learning.
This membership will give you access to the developing manuscript along with the resources
which includes fully functional python/PyTorch code downloadable and executable via Jupyter-
notebook.
Deep learning is a complex subject. On one hand, it is deeply theoretical with extensive
mathematical backing. Indeed, without a good intuitive understanding of the mathematical
underpinnings, one is doomed to merely running off the shelf pre-packaged models without
understanding them fully. These models often do not lend themselves well to the exact problem
one needs to solve and one is helpless if any change or re-architecting is necessary. On the
other hand, deep learning is also intensely practical requiring significant Python programming
skills on new platforms like Tensorflow and PyTorch. Failure to master those leaves one unable
to solve any real problem.
This author feels that there is a dearth of books that addresses both of these aspects of the
subject in a connected fashion. That is what has led to the genesis of this book.
The author will feel justified in his efforts if these pages help the reader to become a successful
exponent in the art and science of deep learning.
Please post all the comments, questions and suggestions in the liveBook's Discussion Forum.
Sincerely,
Krishnendu Chaudhury
Appendix
An overview of machine
learning and deep learning
1
Deep learning has transformed computer vision, natural language and speech process-
ing in particular and artificial intelligence in general. From a bag of semi-discordant
tricks, none of which worked satisfactorily on a real life problem, artificial intelligence
has become a formidable tool to solve real problems faced by industry, at scale. This
is nothing short of a revolution going on under our very noses. If one wants to lead
the curve of this revolution, it is imperative to understand the underlying principles
and abstractions, rather than simply memorizing the ”how to” steps of some hands on
guide. This is where the mathematics comes in.
In this first chapter we will give an overview of deep learning. This will require us
to use some concepts that have been explained in subsequent chapters. The reader
should not worry if there are some open questions at the end of this chapter. This
chapter is aimed at orienting one’s mind towards this difficult subject. As individual
concepts get clearer in subsequent chapters, the reader should consider coming back
and giving this chapter a re-read.
of the object in front, perceived sharpness of the object in front, etc. This is an instance
of classification problem where the output is one out of a set of possible classes.
Some other examples of classification problem in life:
buy vs hold vs sell a certain stock, from inputs like price history of this stock, change
in price of this stock in recent times
object recognition (from an image), e.g.,:
– is this a car or a giraffe
– is this a human or a non-human
– is this an inanimate object or a living object
– face recognition - is this Tom or Dick or Mary or Einstein or Messi
action recognition from video, e.g.,:
– is this person running or not running
– is this person picking something up or not
– is this person doing something violent or not
Natural Language Processing aka NLP from digital documents, e.g.,:
– does this news article belong to the realm of politics or sports
– does this query phrase match a particular article in the archive
etc.
Sometimes life requires a quantitative estimation as opposed to classification. A lion
brain needs to estimate what should be the length of a jump so as to land on the top of
its prey, by processing inputs like speed of prey, distance to prey etc. Another instance
of quantitative estimation is to estimate house price, based on inputs like current in-
come, crime statistics for the neighborhood etc.
Some other examples of quantitative estimations required by life
object localization from an image: identifying the rectangle bounding the lo-
cation of an object
stock price prediction from historical stock prices and other world events
similarity score between a pair of documents
Sometimes, a classification output can be generated from a quantitative estimate. For
instance, the cat brain described above, can combine the inputs (hardness, sharpness
etc) to generate a quantitative threat score. If that threat score is high, the cat runs
away. If the threat score is near zero, the cat ignores the object in front. If threat score
is negative, the cat approaches the object in front and purrs.
Many of these examples are pictorially depicted in Fig 1.1.
In each of these instances, there is a machine - viz., brain - that transforms sensory
or knowledge inputs to decisions or quantitative estimates. The goal of machine learn-
ing is to emulate that machine.
©Manning Publications Co. To comment go to liveBook
https://livebook.manning.com/book/math-and-architectures-of-deep-learning/discussion
3
One must note that machine learning has a long way to go before it can catch up with
the human brain. The human brain can single handedly deal with thousands, if not
millions, of such problems. On the other hand, at its present state of development, ma-
chine learning can hardly create a single general purpose machine that makes a wide
variety of decisions and estimates. We are mostly trying to make separate machines to
solve individual tasks (stock picker, car recognizer etc).
At this point, one might ask, wait, converting inputs to outputs - isn’t that exactly what
computers have been doing for last thirty or more years? What is this paradigm shift I
am hearing about? The answer: it is a paradigm shift because we do not provide a step
by step instruction set - viz., a program - to the machine to convert the input to output.
Instead, we develop a mathematical model for the problem.
Let us illustrate the idea with an example. For the sake of simplicity and concreteness,
©Manning Publications Co. To comment go to liveBook
https://livebook.manning.com/book/math-and-architectures-of-deep-learning/discussion
4
we will consider a hypothetical cat brain which needs to make only one decision in life -
whether to run away from the object in front or ignore it or approach and purr. This decision,
then, is the output of the model we will discuss. And, in this toy example, the decision
is made based on only two quantitative inputs (aka features), perceived hardness of the
object in front and its perceived sharpness. (as depicted in Fig 1.1). We do not provide
any step by step instruction such as ”if sharpness greater than some threshold then run away”
etc. Instead, we try to identify a parameterized function that takes the input and converts
it to the desired decision or estimate. The simplest such function is a weighted sum of
inputs:
The weights w0 , w1 and the bias b are the parameters of the function. The output y can
be interpreted as a threat score. If the threat score exceeds a threshold the cat runs
away. If it is close to 0 the cat ignores. If the threat score is negative the cat approaches
and purrs. For more complex tasks, we will use more sophisticated functions.
Note that the weights are not known at first, we need to estimate them. This is done
through a process called model training.
In the most popular variety of machine learning, called supervised learning, we prepare
the training data before we commence training. Training data comprises example input
items, each with its corresponding desired output.1 . Training data is often created manually,
i.e., a human goes over every single input item and produces the desired output (aka
target output). It is usually the most arduous part of doing machine learning.
For instance, in our hypothetical cat brain example, some possible training data items
1 If you have some experience with machine learning, you will realize that I am talking about ”supervised”
learning here. There are also machines that do not need known outputs to learn - the so called ”unsuper-
vised” machines - we will talk about them later.
are
input: (hardness = 0.01, sharpness = 0.02) → threat = −0.90 → decision : ”approach and p
input: (hardness = 0.50, sharpness = 0.60) → threat = 0.01 → decision : ”ignore”
input: (hardness = 0.99, sharpness = 0.97) → threat = 0.90 → decision : ”run away”
where the input values of hardness and sharpness are assumed to lie between 0 and 1.
What exactly happens during training? Answer: we iteratively process the input train-
ing data items. For each input item, we know the desired (aka target) output. On
each iteration, we adjust the model weight values in a way that the output of the model
function on that specific input item gets at least a little bit closer to the corresponding
target output. For instance, suppose at a given iteration, the weight values are w0 = 20
and w1 = 10 and b = 50. On the input (hardness = 0.01, sharpness = 0.02), we get
an output threat score y = 50.3 which is quite different from the desired y = −0.9.
We will adjust the weights, for instance reduce the bias - so w0 = 20 and w1 = 10
and b = 40. The corresponding threat score y = 40.3 is still nowhere near the desired
value, but it has moved closer. After doing this on many training data items, the weights
would start approaching their ideal values. Note that how to identify the adjustments
to the weight values is not discussed here. It needs somewhat deeper math and will be
discussed later.
As stated above, this process of iteratively tuning weights is called training or learning. At
the beginning of learning, the weights have random values, so the machine outputs
often do not match desired outputs. But with time, more training iterations happen
and the machine ”learns” to generate the correct output. That is when the model is
ready for deployment in real world. Given arbitrary input, the model will (hopefully)
emit something close to the desired output during inferencing.
Come to think of it, that is probably how living brains work. They contain equivalents
of mathematical models for various tasks. Here, the weights are the strengths of the
connections (aka synapses) between the different neurons in the brain. In the begin-
ning, the parameters are untuned, the brain repeatedly makes mistakes. E.g., a baby’s
brain often makes mistake in identifying edible objects - anybody who has had a child
will know what we are talking about. But each example tunes the parameters (eating
green and white rectangular things with $ sign invites much scolding - should not eat
them in future etc). Eventually this machine tunes its parameters to yield better results.
One subtle point should be noted here. During training, the machine is tuning
its parameters so that it produces the desired outcome - on the training data input only.
Of course, it sees only a small fraction of all possible inputs during training - we are
not building a lookup table from known inputs to known outputs here. Hence, when
this machine gets released in the world, it mostly runs on input data it has never seen
before. What guarantee do we have that it will generate the right outcome on never
before seen data? Frankly, there is no guarantee. Only, in most real life problems, the
©Manning Publications Co. To comment go to liveBook
https://livebook.manning.com/book/math-and-architectures-of-deep-learning/discussion
6
inputs are not really random. They have a pattern. Hopefully, the machine will see
enough during training to capture that pattern. Then, its output on unseen input will
be close to desired value. The closer the distribution of the the training data is to real
life, likelier that becomes.
Let us briefly recapitulate the main ideas from the previous section. In machine learn-
ing, we try to solve problems that can be abstractly viewed as transforming a set of inputs
to an output. The output is either a class or an estimated value. Since we do not know
the true transformation function, we try to come up with a model function. We start by
designing - using our physical understanding of the problem - a model function with
tunable parameter values that could serve as a proxy for the true function. This is the
model architecture and the tunable parameters are also known as weights. The simplest
model architecture is one where the output is a weighted sum of the input values. De-
termining the model architecture does not fully determine the model - we still need
to determine the actual parameter values (weights). That is where training comes in.
During training, we find an optimal set of weights that would transform the training
inputs to outputs that match the corresponding training outputs as closely as possible .
Then we deploy this machine in the world - now its weights are estimated and the func-
tion is fully determined - on any input, it simply applies the function and generates an
output. This is called inferencing. Of course, training inputs are only a fraction of all
possible inputs, so there is no guarantee that inferencing will yield a desired result on
all real inputs. The success of the model depends on the appropriateness of the chosen
model architecture and the quality and quantity of training data.
In this context, the author would like to note that after mastering machine learning,
the biggest struggle faced by a practitioner turns out to be procurement of training
data. It is common practice, when one can afford it, to use humans to hand generate
the outputs corresponding to the training data inputs (these target outputs are some-
times referred to as ground truth). This process, known as human labeling or human
2 This chapter is a lightweight overview of machine/deep learning. As such, it mildly relies upon mathemat-
ical concepts that we will introduce later. The reader is encouraged to read this chapter now, nonetheless,
and perhaps re-read after the chapters on vectors and matrices have been digested.
MODEL ESTIMATION
Now for the all important step. We need to estimate the function which transforms the
input vector to the output. With slight abuse of terms, we will denote this function as
well as the output by y. In mathematical notation, we want to estimate y (~x).
Of course, we do not know the ideal function. We will try to estimate this unknown
function from the training data. This is accomplished in two steps:
1 Model Architecture Selection: Designing a parameterized function that we ex-
pect is a good proxy or surrogate for the unknown ideal function.
2 Training: Estimating the parameters of that chosen function such that the
outputs on training inputs match correspond outputs as closely as possible.
Note that b is a slightly special parameter. It is a constant, that does not get multiplied
with any of the inputs. It is common practice in machine learning to refer to it as bias
while the other parameters that get multiplied with inputs as weights.
MODEL TRAINING
Once the model architecture is chosen, we know the exact parametric function we are
going to use to model the unknown function y (~x) that transforms inputs to outputs.
We still need to estimate the function’s parameters. Thus, we have a function with un-
known parameters and the parameters are to be estimated from a set of inputs with
known outputs (training data). We will choose the parameters so that the outputs on
the training data inputs match the corresponding outputs as closely as possible.
It should be noted that this problem, has been studied by mathematicians and known
as a function fitting problem in mathematics. What changed with the advent of ma-
chine learning however is the sheer scale. In machine learning, we deal with training
data comprising millions and millions of items. This changed the philosophy of the
solution. Mathematicians used a ”closed-form solution”, where the parameters are es-
timated by directly solving equations involving all the training data items together. In
machine learning, one goes for iterative solutions, where one deals with a few, perhaps
a single, training data item at a time. In the iterative solution, there is no need to hold
©Manning Publications Co. To comment go to liveBook
https://livebook.manning.com/book/math-and-architectures-of-deep-learning/discussion
9
the entire training data in the computer’s memory. One simply loads small portions of
it at a time and deals with only that portion. We will exemplify this with our cat brain
example.
Let the training data comprise N + 1 inputs ~x(0) , ~x(1) , · · · ~x(N ) . Here each ~x(i) is a
2 × 1 vector denoting a single training data input instance. The corresponding desired
(0) (1) (N )
threat values (outputs) are ygt , ygt , · · · ygt , say (here the subscript gt denotes ground
truth). Equivalently, we can say training data comprises N + 1 (input, output) pairs:
(0) (1) (N )
~x(0) , ygt , ~x(1) , ygt · · · ~x(N ) , ygt
Supposing w ~ denotes the (as yet unknown) optimal parameters for the model. Then,
given an arbitrary input ~x the machine
~ T ~x +
willestimate a threat value of ypredicted = w
th (i) (i)
b. On the i training data pair, ~x , ygt the machine will estimate
(i)
~ T ~x(i) + b
ypredicted = w
(i)
while the desired output is ygt . Thus the squared error (aka loss) made by the machine
on the ith training data instance is3
2
(i) (i)
e2i = ypredicted − ygt
The overall loss on the entire training data set is obtained by adding the loss from each
individual training data instance
i=N i=N 2 i=N 2
(i) (i) (i)
X X X
E2 = e2i = ypredicted − ygt = ~ T ~xi + b − ygt
w
i=0 i=0 i=0
The goal of training is to find the set of model parameters (aka weights), w, ~ that mini-
mizes the total error E. Exactly how we do this will be described later.
In most cases, it is not possible to come up with a closed-form solution for the optimal
w,
~ b. Instead, we take an iterative approach depicted in Algorithm 1. In algorithm 1,
we start with random parameter values and keep tuning parameters so that the total
error goes down at least a little bit. Keep doing this until the error becomes sufficiently
small.
In a purely mathematical sense, one continues the iterations until the error is minimal.
3 In this context, it should be noted that it is a common practice to square the error/loss to make it sign
independent. If we desired an output of, say 10, we are equally happy/unhappy if the output is 9.5 or 10.5.
Thus, error of +5 or −5 is effectively the same, hence we make the error sign independent.
But in practice, one often stops when the results are accurate enough for the problem
being solved.
It is worth re-emphasizing that error here refers only to error on training data.
INFERENCING
Finally, a trained machine (with optimal parameters w ~ ∗ , b∗ is deployed in the world.
It will receive new inputs ~x and will infer ypredicted (~x) = w ~ ∗T ~x + b∗ . Classification will
happen by thresholding ypredicted as shown in equation 1.2.
Let us continue with our example cat’s brain model to illustrate the idea. As stated
earlier, our feature space is 2D, with two coordinate axes X0 signifying hardness and
X1 signifying sharpness 4 . Individual points in this 2D space will be denoted by coordi-
nate values (x0 , x1 ), in lower case. This is depicted in Fig 1.2. As shown in the diagram,
a good way to model the threat score is to measure distance from line x0 + x1 = 1.
From coordinate geometry, in a 2D space with coordinate axes X0 and X1 , the signed
distance of a point (a, b) from the line x0 + x1 = 1 is y = a+b−1 √
2
. Examining the sign
of y we can determine which side of the separator line the input point belongs to.
Figure 1.2: Geometrical View of Machine Learning: 2D input point space for cat brain
model. The bottom left corner shows low hardness and low sharpness objects (’-’ signs)
while top right corner shows high hardness and high sharpness objects (’+’ signs). The
intermediate values are near the diagonal (’$’ signs). In this simple situation, mere
observation tells us that the threat score can be proxied by the signed distance, y, from
the diagonal line x0 +x1 −1 = 0. One can make the run/ignore/approach decision by
thresholding y. Values close to zero imply ignore, positive values imply run away and
negative values imply approach and purr. From high school geometry, the distance of
an arbitrary input point (x0 = a, x1 = b) from line x0 + x1 − 1 = 0 is a+b−1 √
2
. Thus,
√ 1 −1 is a possible model for the cat brain threat estimator
the function y (x0 , x1 ) = x0 +x2
function. Training should converge to w0 = √12 , w1 = √12 and b = − √12
4 We use X0 , X1 as coordinate symbols instead of the more familiar X, Y so as not to run out of symbols
when going to higher dimensional spaces.
The geometric view holds in higher dimensions too. In general, a n-dimensional input
vector ~x is mapped to a m-dimensional output vector (usually m < n) in such a way that
the problem becomes much simpler in the output space. An example with 3D feature
space is shown in Figure 1.3.
In a regressor, the model tries to emit a desired value given a specific input. For in-
stance, the first stage (threat score estimator) of the cat brain model in section 1.3 is a
regressor model.
Classifiers on the other hand have a set of pre-specified classes. Given a specific input,
they try to emit the class to which the input belongs. For instance, the full cat brain
model has 3 classes: (i) run away (ii) ignore (iii) approach and purr. Thus, it takes an
input (hardness and sharpness values) and emits an output decision (aka class).
In this example, we convert a regressor into a classifier by thresholding the output of
the regressor (see equation 1.2. It is also possible to create models that directly output
the class without having an intervening regressor.
Figure 1.3: Geometrical View of Machine Learning: A model maps the points from
input (feature) space to an output space where it is easier to separate the classes. For
instance, in this figure, input feature points belonging to two classes, red and green,
are distributed over the volume of a cylinder in a 3D feature space. The model unfurls
the cylinder into a rectangle. The feature points get mapped onto a 2D planar output
space where the two classes can be discriminated with a simple linear separator.
will be a non-linear function. For instance, check the curved separator in Fig. 1.4. An-
other example is shown in Figure 1.5 - classifying the points in the 2D plane into the
two classes indicated in blue and red requires non-linear models.
Non-linear models make sense from the function approximation point of view as well.
Ultimately, our goal is to approximate very complex and highly non-linear functions
that model the classification or estimation processes demanded by life. Intuitively, it
seems better to use non-linear functions to model them.
Figure 1.4: The two classes (indicated by ’+’ and ’-’) can not be separated by a line.
Curved separator needed. In 3D, this is equivalent to saying no plane can separate
the surfaces, a curved surface is necessary. In still higher dimensional spaces, this is
equivalent to saying no hyper-plane can separate the classes. A curved hyper-surface is
needed.
Figure 1.5: The two classes (indicated by blue and red colors respectively) can not be
separated by a line. Non-linear (curved) separator needed.
Thus, a popular model architecture (still kind of simple) is that we take sigmoid (with-
out parameters) of the weighted sum of the inputs. The sigmoid imparts the non-
©Manning Publications Co. To comment go to liveBook
https://livebook.manning.com/book/math-and-architectures-of-deep-learning/discussion
15
linearity. This architecture will be able to handle relatively more complex classifica-
tion tasks than the weighted sum alone. In fact, equation 1.6 depicts the basic building
block of a neural networks.
Table 1.1: A glimpse into background and foreground variations a typical deep learn-
ing system (here a dog image recognizer) has to deal with
The astute reader might notice that the following equations do not have an explicit
bias term. That is because, for simplicity of notation, we have rolled it into the set of
weights and assumed that one of the inputs, say x0 = 1 and the corresponding weight,
e.g., w0 is the bias.
···
Final Layer (L): generates m + 1 visible outputs from nL−1 previous layer hidden out-
©Manning Publications Co. To comment go to liveBook
https://livebook.manning.com/book/math-and-architectures-of-deep-learning/discussion
17
puts
(L) (L) (L−1) (L) (L−1) (L) (L−1)
h0 = σ w00 h0 + w01 h1 + · · · w0nL−1 hnL−1
(L) (L) (L−1) (L) (L−1) (L) (L−1)
h1 = σ w10 h0 + w11 h1 + · · · w1nL−1 hnL−1
..
.
(L) (L−1) (L) (L−1) (L) (L−1)
h(L)
m = σ wm0 h0 + wm1 h1 + · · · wmnL−1 hnL−1 (1.9)
This machine, depicted in Fig 1.7, can be incredibly powerful, with huge expressive
power. We can adjust its expressive power systematically to fit the problem at hand.
This then is a neural network. We will devote the rest of the book to studying this.
Chapter Summary
In this chapter we gave an overview of machine learning leading all the way up to deep
learning. The ideas were illustrated with a toy cat brain example. Some mathemati-
cal notions (e.g., vectors) were used in this chapter without proper introduction. The
reader is encouraged to revisit this chapter after vectors and matrices have been intro-
duced.
The author would like to leave the reader with the following mental pictures from this
chapter
Machine learning as a fundamentally different paradigm of computing. In
traditional computing, one provides a step by step instruction sequence to the
computer, telling it what to do. In machine learning, one builds a mathemat-
©Manning Publications Co. To comment go to liveBook
https://livebook.manning.com/book/math-and-architectures-of-deep-learning/discussion
18
ical model that tries to approximate the unknown function that generates a
classification or estimation from inputs.
The mathematical nature of the model function is stipulated from the physical
nature and complexity of the classification or estimation task. Models have
parameters. Parameter values are estimated from training data - inputs with
known outputs. The parameter values are optimized so that the model output
is as close as possible to training outputs on training inputs.
An alternative geometric view of a machine is a transformation that maps points
in the multi-dimension input space to a point in the output space.
More complex the classification/estimation task, the more complex the ap-
proximating function. In machine learning parlance, complex tasks need ma-
chines with higher expressive power. Higher expressive power comes from
non-linearity (e.g., the sigmoid function, see 8.1) and layered combination
of simpler machines. This takes us to deep learning, which is nothing but a
multi-layered non-linear machine.
Complex model functions are often built by combining simpler basis func-
tions.
Tighten your seat belts, the fun is about to get more intense.
Introduction to Vectors,
Matrices and Tensors from
Machine Learning and
Data Science point of view 2
At its core, machine learning, indeed all computer software, is about number crunch-
ing. One inputs a set of numbers to the machine and gets back a different set of num-
bers as output. However, this cannot be done randomly. It is important to organize
these numbers appropriately, group them into meaningful objects that go in and come
out of the machine. This is where vectors and matrices come in. These are concepts
that mathematicians have been using for centuries - we are simply reusing them in
machine learning. In this chapter, we will study vectors and matrices, primarily from
a machine learning point of view. Starting from the basics, we will quickly graduate
to advanced concepts, restricting ourselves to topics that have relevance to machine
learning.
We provide Jupyter notebook based python implementations for most of the concepts
discussed in this and other chapters. Complete fully functional code that can be down-
loaded and executed (after installing python and Jupyter notebook) can be found at
http://mng.bz/KMQ4. The code relevant to this chapter can be found at http:
//mng.bz/d4nz.
2.1 Vectors and their role in Machine Learning and Data Science
Let us revisit the machine learning model for cat brain that was introduced in 1.3 . It
takes two numbers as input: representing the hardness and sharpness of the object in
front of the cat. Cat brain processes the input and generates an output threat score
which leads to run away or ignore or approach and purr decision. Now the two input
numbers usually appear together and it will be handy to group them together into a
single object. This object will be an ordered sequence of two numbers, the first one
representing hardness and the second one representing sharpness. Such an object is
a perfect example of a vector.
Thus, a vector can be thought of as an ordered sequence of two or more numbers,
also known as an array of numbers 1 . Vectors constitute a compact way of denoting a
set of numbers that together represent some entity. In this book, vectors will be rep-
resented by lower case letters with an overhead arrow and arrays by square
brackets.
x0
For instance, the input to the cat brain model in 1.3 was a vector ~x = where x0
x1
represented hardness and x1 represents sharpness.
Outputs to machine learning models too are often represented as vectors. For instance,
consider an object recognition model that takes an image as input and emits a set of
numbers indicating the probabilities that the image contains a dog, human or cat re-
y
0
spectively. The output of such a model will be a 3 element vector ~y = y1 where
y2
the number y0 denotes the probability that the image contains a dog, y1 denotes the
probability that the image contains a human and y2 denotes the probability that the
image contains a cat. Table 2.1 shows some possible input images and corresponding
output vectors.
In multi-layered machines like neural networks, the input and output to each layer will
be vectors. We also typically represent the parameters of the model function (see 1.3)
as vectors. This is illustrated below in 2.3.
One particularly significant notion in machine learning and data science is the idea of
a feature vector. This is essentially a vector that describes various properties of the object
being dealt with in a particular machine learning problem. We will illustrate the idea
with an example from the world of Natural Language Processing (NLP). Suppose we
have a set of documents. We want to create a document retrieval system where, given a
new document, we have to retrieve ”similar” documents in the system. This essentially
boils down to estimating similarity between documents in a quantitative fashion. We
will study this problem in detail later, but for now we want to note that the most natural
way to approach this is to create feature vectors for each document that quantitatively
describe the document. Later, in section 2.5.6 we will see how to measure the similarity
between these vectors, for now let us focus on simply creating descriptor vectors for the
documents. A popular way to do this is to choose a set of interesting words - we typically
1 In mathematics, vectors can have an infinite number of elements. Such vectors cannot be expressed as
arrays - but we will mostly ignore them in this book.
Table 2.1: Input images and corresponding output vectors denoting probabilities that
the image contains a dog and/or human and/or cat respectively: possible output vec-
tor for top left image - [0.9 0.01 0.1], possible output vector for top right image- [0.9
0.01. 0.9], possible output vector for bottom left image - [0.01 0.99 0.01] possible out-
put vector for bottom right image - [0.88 0.9. 0.001]
exclude words like ”and”, ”if”, ”to” which are present in all documents from this list -
count the number of occurrence of interesting words in each document and make a
vector of these. Table 2.2 shows a toy example with 6 documents and corresponding
feature vectors. For simplicity, we have considered only two (”gun” and ”violence” in
pleural or singular, upper or lower case) of the possible set of words.
The sequence of pixels in the image can also be viewed as a feature vector. Neural
networks in computer vision tasks usually expect this feature vector.
2.1.1 Geometric View of Vectors and its significance in Machine Learning and Data
Science
Vectors also be viewed geometrically. The simplest example is a 2-element vector
can
x0
~x = . Its 2 elements can be taken to be x and y, Cartesian coordinates in a 2-
x1
dimensional space. Then the vector will correspond to a point in that space. Vectors
with n elements will represent points in an n-dimensional space. The ability to see inputs
and outputs of machine learning models as points allows us to view the model itself
as a geometric transformation that maps input points to output points in some high
dimensional space. We have already seen this once in section 1.4. It is an enormously
powerful concept that we will keep utilizing throughout the book.
We will briefly touch upon a subtle issue here. A vector represents the position of a
x
point with respect to another. Furthermore, an array of coordinate values, like de-
y
Table 2.2: Example Toy Documents and corresponding Feature Vectors describing
them. Words eligible for the Feature Vector are colored in red. The first element of
the feature vector indicates the number of occurrences of the word ”gun”, the second
”violence”.
scribes the position of one point, in a given coordinate system . See Figure 2.1 to get a
intuitive understanding of this. For instance, consider the plane of a page of this book.
Suppose we want to reach the top right corner point of the page from the bottom left
corner. Let us call the bottom left corner O and the top right corner P . We can travel
the width (8.5 inches) rightwards to reach the bottom left corner and then travel the
height (11 inches) upwards to reach the top right corner. Thus, if we choose a coordi-
nate system with the bottom left corner as origin and the X axis along the width and
8.5
the Y axis along the height, point P corresponds to the array representation .
11
But we could also have traveled along the diagonal from bottom left to top right corner
to reach P from O. Either way, we end up at the same point P . Thus we have a co-
nundrum. The vector OP ~ represents the abstract geometric notion, position of P with
respect to O independent of our choice of coordinate axes. On the other hand, the
8.5
array representation depends on the choice of coordinate system. E.g., the array
11
represents the the top right corner point P only under a specific choice of coordinate
axes (parallel to the sides of the page) and a reference point (bottom left corner). Ide-
ally, we should specify the coordinate system along with the array representation to be
unambiguous. How come then we never do so in machine learning? The answer: in
machine learning, it does not matter what exactly the coordinate system is, as long as
we stick to any fixed coordinate system.
©Manning Publications Co. To comment go to liveBook
https://livebook.manning.com/book/math-and-architectures-of-deep-learning/discussion
23
There are explicit rules (which we will study below) that state how the vector trans-
forms when the coordinate system changes. We will invoke them when necessary. All
vectors used in a machine learning computation must consistently use the same coor-
dinate system or must be transformed appropriately.
One other point. Planar spaces, e.g., the plane of the paper on which this book is
written, are 2-dimensional (abbreviated 2D). The mechanical world we live in is 3-
dimensional 3D). Human imagination usually fails to see higher dimensions. In ma-
chine learning and data science, we often will talk of spaces with thousands of dimen-
sions. You may not see those spaces in your mind. But that is not a crippling limitation.
You will use 3 dimensional analogues in your head. They work in a surprisingly large
variety of cases. However, it is important to bear in mind that this is not always true.
Some examples where the lower dimensional intuitions fail at higher dimensions will
be shown later.
Figure 2.1: A vector describing the position of point P with respect to point O. The
basic mental picture to have is an arrowed line. This agrees with the definition of
vector we learnt in high school: vector has a magnitude (length of the arrowed line)
and direction (indicated by the arrow). On a plane, this is equivalent to the ordered
pair of numbers x, y, where the geometric interpretations of x and y are as shown in
Figure. In this context, it is worthwhile to note that only the relative positions of the
points O and P matter. If both the points are moved, keeping their relationship intact,
the vector does not change.
2.2 Python code to create and access vectors and sub-vectors, slice and
dice vectors, via Numpy and PyTorch parallel code
In this book, we will try to familiarize the the reader with numpy, PyTorch and similar
programming paradigms alongside the relevant mathematics. Knowledge of python
basics will be assumed. The reader is strongly encouraged to try out all code snippets
in this book - after installing appropriate packages like numpy, PyTorch etc.
All the python code in this book is produced via jupyter-notebook. A summarized
recapitulation of the theoretical material presented in code is provided right above
the code snippet.
©Manning Publications Co. To comment go to liveBook
https://livebook.manning.com/book/math-and-architectures-of-deep-learning/discussion
24
The fully functional code demonstrating how to create vectors and access its elements,
in Python Numpy as well as PyTorch can be found at http://mng.bz/xm8q.
first_element = v[0]
third_element = v[2] square bracket operator lets us access individual vector elements
second_to_fifth_elements = v[1:4]
colon operator slices off a range of elements from
the vector
first_to_third_elements = v[:2]
last_two_elements = v[-2:]
nothing before colon denotes beginning of array.
nothing after colon denotes end of array
num_elements_in_v = len(v)
diff = v1.sub(u) difference between the Torch tensor and its Numpy version is zero
2.3 Matrices and their role in Machine Learning and Data Science
Sometimes, it is not sufficient to group a set of numbers into a vector. We have to col-
lect several vectors into another group. For instance, consider the input to training a
machine learning model. Here we have several input instances, each comprising of a
sequence of numbers. As seen in section 2.1, the sequence of numbers belonging to
a single input instance can be grouped into a vector. How do we represent the entire
collection of input instances? This is where the concept of matrices, from the world
of mathematics, come in handy. A matrix can be viewed as a rectangular array of num-
bers, arranged in a fixed count of rows and columns. Each row of a matrix is a vector,
so is each column. Thus a matrix can be thought of as a collection of row vectors. It
can also be viewed as a collection of column vectors. We can represent the entire set
of numbers that constitute the training input to a machine learning model as a matrix,
with each row vector corresponding to a single training instance.
Table 2.3: Example Training Dataset for our Toy Machine Learning Based Cat Brain
compact representation of the training dataset for this problem. 2
0.11 0.09
0.01 0.02
0.98 0.91
0.12 0.21
0.98 0.99
0.85 0.87
0.03 0.14
Example cat-brain dataset matrix X = 0.55 0.45 (2.1)
0.49 0.51
0.99 0.01
0.02 0.89
0.31 0.47
0.55 0.29
0.87 0.76
0.63 0.74
2 we will usually use upper case letters to symbolize matrices
©Manning Publications Co. To comment go to liveBook
https://livebook.manning.com/book/math-and-architectures-of-deep-learning/discussion
27
Each row of matrix X is a particular input instance. Different rows represent different
input instances. Thus, moving along a row, one encounters successive elements of a
single input vectors. Moving along a column, one encounters elements of different
input instances. Notice that an individual element is now h indexedi by 2 numbers, as
th
opposed to 1 in a vector. Thus the 0 row is the vector x00 x01 representing the
0th input instance.
MATRIX REPRESENTATION OF DIGITAL IMAGES
Digital images too are often represented as matrices. Here, each element represents
the brightness at a specific pixel position (x, y coordinate) of the image. Typically, the
brightness value is normalized to an integer in the range 0 to 255. 0 is black and 255
is white and 128 is gray etc3 . Following is an example of a tiny image, 9 pixel in width
and 4 pixel in height.
0 8 16 24 32 40 48 56 64
64 72 80 88 96 104 112 120 128
I4,9 =
(2.2)
128 136 144 152 160 168 176 184 192
192 200 208 216 224 232 240 248 255
The brightness increases gradually from left to right and also top to bottom. I00 rep-
resents the top left pixel which is black. I3,8 represents the bottom right pixel which is
white. The intermediate pixels are various shades of gray in between black and white.
The actual image looks as shown in Figure 2.2.
2.4 Python Code: Introduction to Matrices, Tensors and Images via Numpy
and PyTorch parallel code
For programming purposes, one can think of tensors as multi-dimensional arrays. Scalars
are 0-dimensional tensors. Vectors are 1-dimensional tensors. Matrices are 2-dimensional
tensors. RGB images are 3-dimensional tensors (colorchannels × height × width). A
batch of 64 images is a 4-dimensional tensor (64 × colorchannels × height × width).
2.4.1 Python Numpy code for introduction to Tensors, Matrices and Images
3 in digital computers, numbers in the range 0..255 can be represented with a single byte of storage, hence
this choice
Ranges of rows and columns can be specified via colon operator to slice off (extract) sub-matrices
first_3_training_examples = X[:3, ] extract first 3 training examples (rows)
All images are tensors. RGB image of height H, width W is 3-tensor of shape [3, H, W]
I49 = np.array([[0, 8, 16, 24, 32, 40, 48, 56, 64],
[64, 72, 80, 88, 96, 104, 112, 120, 128],
[128, 136, 144, 152, 160, 168, 176, 184, 192],
[192, 200, 208, 216, 224, 232, 240, 248, 255]],
dtype=np.uint8) 4 × 9 single channel image shown in Fig 2.2
cat-brain training data input, 15 examples, each with 2 values - hardness, sharpness
15 × 2 torch tensor, each element a 64 bit float, created by directly specifying values
Y = torch.tensor([[0.11, 0.09], [0.01, 0.02], [0.98, 0.91],
[0.12, 0.21], [0.98, 0.99], [0.85, 0.87],
[0.03, 0.14], [0.55, 0.45], [0.49, 0.51],
[0.99, 0.01], [0.02, 0.89], [0.31, 0.47],
[0.55, 0.29], [0.87, 0.76], [0.63, 0.24]],
dtype=torch.float64)
Torch tensors can be converted to Numpy arrays, the two arrays are equivalent
np.allclose(X, Y.numpy(), rtol=1e-7)
np.allclose(torch.from_numpy(X), Y, 1e-7)
print(Y[3, :]) Slicing operations of numpy arrays work on torch tensors too
print(Y[3:5, 1:2])
2.5 Basic Vector and Matrix operations in Machine Learning and Data
Science
In this section we will introduce several basic vector and matrix operations along with
examples to demonstrate their significance in image processing, computer vision and
machine learning. It is meant to be an application-centric introduction to linear alge-
bra. But it is not meant to be a comprehensive review of matrix and vector operations,
for which the reader is referred to a textbook on linear algebra.
Figure 2.5: Image corresponding to transpose of matrix I4,9 ,shown in equation 2.3.
This is equivalent to rotating the image by 90° angle
.
T
and its transpose I4,9 = I9,4 are shown below
0 8 16 24 32 40 48 56 64
64 72 80 88 96 104 112 120 128
I4,9 =
128 136 144 152 160 168 176 184 192
192 200 208 216 224 232 240 248 255
0 64 128 192
8 72 136 200
16 80 144 208
24 88 152 216
T
I4,9 = I9,4 = 32 96 160 224 (2.3)
40 104 168 232
48 112 176 240
56 120 184 248
64 128 192 255
©Manning Publications Co. To comment go to liveBook
https://livebook.manning.com/book/math-and-architectures-of-deep-learning/discussion
32
By comparing equation 2.2 and equation 2.3 one can easily see that one can be ob-
tained from the other by interchanging the row and column indices. This operation is
generally known as matrix transposition.
Formally, the transpose of a matrix Am,n with m rows and n columns is another ma-
trix with n rows and m columns. This transposed matrix, denoted ATn,m is such that
ATij = Aji . Like the value at row 0 column 6 in matrix I4,9 is 48. In the transposed ma-
trix the same value will appear in row 6 and column 0. In matrix parlance I4,9 [0, 6] =
T
I9,4 [6, 0] = 48.
Vector transposition is really a special case of matrix transposition (since all vectors
are matrices - a column vector with n elements is a n × 1 matrix). For instance, an
arbitrary vector and its transpose are shown in equation
1
~v = 2 (2.4)
3
h i
~v T = 1 2 3 (2.5)
2.5.2 Dot Product of two vectors and its role in Machine Learning and Data Science
In section 1.3 we saw the simplest of machine learning models where the output is gen-
erated by taking a weighted sum of the inputs (and then adding a constant bias value).
This model/machine is characterized by the weights w0 , w1 and bias b. Take the rows of
Table 2.3. E.g., for row 0, input values are: hardness of approaching object = 0.11 and
softness = 0.09. The corresponding model output will be y = w0 × 0.11 + w1 × 0.09 + b.
In fact, goal of training is to choose w0 , w1 and b such that model outputs are as close
as possible to the known outputs: i.e., y = w0 ×0.11+w1 ×0.09+b should be as close to
−0.8 as possible, y = w0 ×0.01+w1 ×0.02+b should be as close to −0.97 as possible etc.
x0
In general, given an input instance ~x = , the model output is y = x0 w0 +x1 w1 +b.
x1
We will keep returning to the above model throughout the chapter. But in this sub-
section, let us consider a different question. In this toy example we have only 2 val-
ues per input instance. That implies we have only 3 model parameters: 2 weights,
w0 , w1 and 1 bias b. Hence it is not very messy to write the model output flat out as
y = x0 w0 + x1 w1 + b. Is there a compact way to represent the model output on a
specific input instance, irrespective of the size of the input?
Turns out the answer is yes - we can use an operation called dot product from the
world of mathematics. We have already seen in section 2.1 that an individual instance
of model input can be compactly represented by a vector, say ~x (it can have any num-
ber of input values). We can also represent the set of weights as vector w
~ - it will have
©Manning Publications Co. To comment go to liveBook
https://livebook.manning.com/book/math-and-architectures-of-deep-learning/discussion
33
the same number of items as input vector. The model output is obtained via the dot
product operation of vectors. Dot product is simply the point wise multiplication of
the two vectors ~x and w
~ as shown below.
x0 w0
x1 w1
Formally, given two vectors ~x = .. and w
~ =
.. , dot product of the two vec-
. .
xn wn
tors is defined as
~x · w
~ = x0 w0 + x1 w1 + · · · xn wn (2.6)
In other words, sum of the products of corresponding elements of the two vectors is
called dot product of the two vectors, denoted ~a · ~b.
Note that the dot product notation can compactly represent the model output as y =
~ · ~x + b. The representation does not increase in size even when the number of inputs
w
and weights are large.
Consider our (by now familiar) cat brain example again. Suppose the weight vector
3
~ = and the bias value b = 5. Then the model output for the 0th input instance
is w
2
0.11 3
from Table 2.3 will be · = 0.11 × 3 + 0.09 × 2 + 5 = 5.51.
0.09 2
It is another matter that these are bad choices for weight and bias parameters, since
the model output 5.51 is a far cry from the desired output −0.89. We will soon see how
to obtain better parameter values. For now, we just need to note that the dot product
offers a neat way to represent the simple weighted sum model output.
The dot product is defined only if the vectors have the same dimensions.
D E
Sometimes the dot product is also referred to as inner product, denoted ~a, ~b . Strictly
speaking, the phrase inner product is a bit more general, it applies to infinite dimen-
sional vectors as well. In this book, we will often use the terms interchangeably, sacri-
ficing mathematical rigor for enhanced understanding.
0.11
gle input instance, say can be represented using a vector-vector dot product
0.09
0.11 3
~ · ~x + b =
w · + 5. Now, as depicted in equation 2.1, during training we
0.09 2
are dealing with many training data instances at the same time. In fact, in real life, we
typically deal with hundreds of thousands of input instances, each having hundreds of
values. Is there a way to represent this compactly, such that it is independent of the
count of input instances and their sizes?
Again turns out the answer is yes. We can use the idea of matrix-vector multiplica-
tion from the world of mathematics. The product of a matrix X and column vector w ~
is another vector, denoted X w.
~ Its elements are the dot products between therow
vec-
3
tors of X and the column vector w.~ E.g.,given the model weight vector w~ = and
2
the bias value b = 5, the outputs on the toy training dataset of our familiar cat-brain
model (equation 2.1) can be obtained via the following steps
0.11 0.09 0.11 × 3 + 0.09 × 2 = 0.51
0.01 0.02 0.01 × 3 + 0.02 × 2 = 0.07
0.98 0.91 0.98 × 3 + 0.91 × 2 = 4.76
0.12 0.21 0.12 × 3 + 0.21 × 2 = 0.78
0.98 0.99 0.98 × 3 + 0.99 × 2 = 4.92
0.85 0.87 0.85 × 3 + 0.87 × 2 = 4.29
0.03 0.14 0.03 × 3 + 0.14 × 2 = 0.37
3
0.45 = 0.55 × 3 + 0.45 × 2 = 2.55 (2.7)
0.55
2
0.49 × 3 + 0.51 × 2 = 2.49
0.49 0.51
0.99 × 3 + 0.01 × 2 = 2.99
0.99 0.01
0.02 × 3 + 0.89 × 2 = 1.84
0.02 0.89
0.31 0.47 0.31 × 3 + 0.47 × 2 = 1.87
0.55 × 3 + 0.29 × 2 = 2.23
0.55 0.29
0.87 0.76 0.87 × 3 + 0.76 × 2 = 4.13
0.63 0.74 0.63 × 3 + 0.74 × 2 = 3.37
Adding the bias value of 5, the model output on the toy training dataset is
5 + 0.51 = 5.51
5 + 0.07 = 5.07
5 + 4.76 = 9.76
5 + 0.78 = 5.78
5 + 4.92 = 9.92
5 + 4.29 = 9.29
5 + 0.37 = 5.37
(2.8)
5 + 2.55 = 7.55
5 + 2.49 = 7.49
5 + 2.99 = 7.99
5 + 1.84 = 6.84
5 + 1.87 = 6.87
5 + 2.23 = 7.23
5 + 4.13 = 9.13
5 + 3.37 = 8.37
A~b = ~c or
a11 a12 · · · a1n b1 c1 = a11 b1 + a12 b2 + · · · a1n bn
a21 a22 · · · a2n b2 c2 = a21 b1 + a22 b2 + · · · a2n bn
= (2.9)
.. .. ..
. . .
am1 am2 · · · amn bn cm = am1 b1 + am2 b2 + · · · amn bn
the idea
a11 a12
A3,2 = a21 a22
a31 a32
b11 b12
B2,2 =
b21 b22
a11 a12
b b
11 12
C3,2 = a21
a22
b21
b22
a31 a32
c11 = a11 b11 + a12 b21 c12 = a11 b12 + a12 b22
= c21 =a21 b11 +a22 b21 c22 = a21 b12 + a22 b22
c31 = a31 b11 + a32 b21 c32 = a31 b12 + a32 b22
At this point, the astute reader may already have noted that the dot product is a spe-
cial case
of matrix multiplication.
For instance, the dot product between two vectors
w0 x0
w
~ = and ~x = is equivalent to transposing either of the two vectors and
w1 x1
then doing a matrix multiplication with the other. In other words,
T
w0 x h i x0
~ T ~x =
~ · ~x = w
w 0 = w0 T
w1 = ~x w~ = w0 x0 + w1 x1
w1 x1 x1
x0
x1
The idea works in higher dimensions too. In general, given two vectors ~x =
.. and
.
xn
w0
w1
w
~ = .. , dot product of the two vectors is defined as
.
wn
~x · w
~
x0
x1
i
h
~ T ~x = w0
=w w1 · · · wn .
..
xn
w0
w1
i
h
= ~xT w
~ = x0 x1 · · · xn .
..
wn
= x0 w0 + x1 w1 + · · · xn wn (2.10)
Transpose of Matrix Products: Given two matrices A and B where the number of
columns of A matches the number of rows of B (i.e., it is possible to multiply them)
the transpose of the product is the product of the individual transposes, in reversed
order. The rule also applies to matrix vector multiplication. The following equations
©Manning Publications Co. To comment go to liveBook
https://livebook.manning.com/book/math-and-architectures-of-deep-learning/discussion
38
2.5.4 Length of a Vector aka L2 norm and its role in Machine Learning
Suppose a machine learning model was supposed to output a target value ȳ but it out-
putted y instead. We are interested in the error made by the model. The error is the
difference between the target and the actual outputs.
We would like to make one important note here. During computing errors, we are
only interested in how far away from ideal the computed value is. We do not care
whether the computed value is bigger or smaller than ideal. For instance, if the target
(ideal) value is 2, the computed values 1.5 and 2.5 are equally in error - we are equally
happy or unhappy with either of them. Hence, it is common practice to square error
values. Thus for instance, if the the target value is 2 and the computed values 1.5 the
error is (1.5 − 2)2 = 0.25. If the target value is 2.5, the error is (2.5 − 2)2 = 0.25.
The squaring operation essentially eliminates the sign of the error value. We can then
follow it up with a square root, but it is OK not to. One might ask, but wait, squaring
alters the value of the quantity, don’t care about the exact values of the error? Answer
is, we usually don’t, we only care about relative values of errors. If the target is 2, all we
want that the error for an output value of, say, 2.1 is less than the error for output value
of 2.5, the exact values of the errors do not matter.
Let us now continue with our discussion of machine learningmodel error. As seen
3
earlier in section 2.5.3, given a model weight vector, say w
~ = and the bias value
2
0.11
b = 5, the weighted sum model output upon a single input instance, say, is
0.09
0.11 3
· + 5 = 5.51. The corresponding target (ideal) output, from Table 2.3, is
0.09 2
2
−0.8. The squared error e2 = (−0.8 − 5.51) = 39.82 gives us an idea of how good or
bad the
model parameters 3, 2, 5 are. For instance, if instead, weuse aweight
vector
1 0.11 1
~ = and bias value −1, we get model output w
w ~ · ~x + b = · −1 =
1 0.09 1
−0.8. The output is exactly same as the target. The corresponding squared error
2
e2 = (−0.8 − (−0.8)) = 0. This (zero error) immediately tells us that 1, 1, −1 is a
much better choice of model parameters than 3, 2, 5.
What happens when we have multiple inputs, as during training a model? In equa-
tion 2.8 we have seen that given the toy training dataset from Table 2.3, a simple
weighted sum model with weights 3, 2 and bias 5 will generate the output vector
5.51
5.07
9.76
5.78
9.92
9.29
5.37
~y = 7.55
7.49
7.99
6.84
6.87
7.23
9.13
8.37
From Table 2.3 we also see that the target output vector is
−0.8
−0.97
0.89
−0.67
0.97
0.72
−0.83
~y¯ = 0.00
0.00
0.00
−0.09
−0.22
−0.16
0.63
0.37
The differences between target and model output over the entire training set can be
expressed as a vector
5.51 −0.8 6.31
5.07 −0.97 6.04
9.76 0.89 8.87
5.78 −0.67 6.45
9.92 0.97 8.95
9.29 0.72 8.57
5.37 −0.83 6.2
~y¯ − ~y = 7.55 − 0.00 = 7.55
7.49 0.00 7.49
7.99 0.00 7.99
6.84 −0.09 −0.09
6.87 −0.22 −0.22
7.23 −0.16 −0.16
9.13 0.63 0.63
8.37 0.37 0.37
We can square the individual elements of the difference vector to obtain a squared
error vector. However, to get a proper feel for the overall error during training, we
would like to obtain a single number. What we would really like to do is to square each
term of the difference vector and then add those elements to yield a single number. Recalling
equation 2.10, this is exactly what would happen if we take the dot product of the difference
vector with itself. That happens to be the definition of the squared magnitude or length
or L2-Norm of a vector: dot product of the vector with itself. In the above example,
©Manning Publications Co. To comment go to liveBook
https://livebook.manning.com/book/math-and-architectures-of-deep-learning/discussion
Random documents with unrelated
content Scribd suggests to you:
CHAPTER XXIII.
"MOTHER'S EYES ARE VERY TIRED."
"So I heard. Now, you know, the place I want you to fill
is not exactly a nurse's situation, it is more that of a
matron. I am very lonely, and I am going to take care of a
few little children, and try to bring them up to be useful and
happy; but above all things I wish to teach them to love our
Saviour."
"You shall have that, you may be sure. But I mean this:
I shall not be able to pay you according to the amount of
trust I put in you, but rather according to what I can
afford."
Christina held out her hand kindly and gravely. "You are
quite right; and remember we shall all be one family in Him,
whatever our different callings may be."
She rang the bell, and told Ellen to give her visitors a
comfortable lunch in the dining room, and to ask Miss
Arbuthnot to step upstairs.
Christina then told her aunt all about the interview, and
they both hoped the decision would be in favour of
accepting her proposal.
She was soon settled into her tiny home, and after a
few days, felt as if she had lived there for years. Her own
fender, table, old-fashioned chest of drawers, cuckoo-clock,
etc., made her feel homelike at once; and she trusted she
had come to a right decision.
"If you really feel so, Nellie, I shall ask Ada; for I
believe it would be an interest for her; only, you know, dear,
you are my friend."
And Christina gave her a loving kiss, and left the room.
"I should like that little boy, Black Tom's son, to have
some of them; do you think you could take him a bunch,
Nellie?"
"And say I am like him, but that the Lord Jesus has
comforted me, and I don't mind so much now."
"Then I think I shall take them vase and all," said Nellie,
"and tell him I shall fetch the vase when I think they are
faded."
"Yes, do; they will look prettier so. I suppose there will
be plenty of flowers in heaven," said Tom musingly.
"I should not wonder, for trees are spoken of; but I
believe, Tom, above everything else will be the joy of seeing
Jesus. It says, 'The Lamb . . . shall lead them unto living
fountains of waters: and God shalt wipe away all tears from
their eyes.'"
"Do you think I can see Tom Taylor's boy?" she asked.
Nellie had been there once before, and she knew the
smell of the close little room; but she came forward to the
bedside.
"Tom," she said tenderly, "I have been sent to you with
a present."
"It is from a little boy who sent you word that 'he loves
Jesus.'"
"I never did before," said the boy; "I always thought it
was so dreadfully hard. But these flowers—" he covered his
face again, and sobbed.
Nellie touched his hand. "The little boy sent you another
message, Tom."
"He said, 'Tell him I am like him; but the Lord Jesus has
comforted me, and I don't mind so much now.'"
"Tell him then," said the boy, "that his Jesus has
comforted me too; for though I cry, miss, it's only because I
can't thank Him enough for wanting to save me."
She made her way down the dingy staircase again, and
stopped at the door of the front room, as she had promised.
Nellie smiled with a glad look. "Ah!" she said. "It's the
heart, remember, He wants."
"No, no, not there! I was driven to it," she said huskily,
"she was so hungry; I had nothing for her—she cried so
dreadfully. I had sold everything; I had had nothing to eat
since yesterday morning, and I could not see her starve, so
I started. But, oh, my baby, it is so hungry!"
She looked down on its wee pinched face, and wrapped
her thin tattered shawl closely round it again protectingly,
and, oh, so tenderly.
"My poor girl, you are very ill," said the doctor; "but if
you will come to a good woman I know of, she will take care
both of you and baby; should you like that?"
"I had nothing for it," she answered, "so I spent my last
halfpenny last night for a hap'orth of milk, and it had the
rest of that early this morning."
"Will she die?" asked Tom, speaking for the first time.
Tom kissed her hand again, and then said softly, "You
often say, mamma, we must trust everything to Jesus; I
suppose, if she loved Him—"
"Yes, my dear," she answered, rousing herself; "that is
the only way. 'Cast thy burden upon the Lord, and He shall
sustain thee.'"
"Oh," said Christina, "this is not the part you will care
about! When Tom is able, we must go into the other room."
Ada blushed deeply at the praise, but said softly, "It was
very little to do after all the goodness and love—"
She smiled, and said, "I think I did; but they must go
round the garden first, or it will be dark. What a beautiful
October day it is!"
For it was the first of October, the month that was the
last of Walter's holiday.
The nurse held out her arms for it, and the poor weak
mother, after a glance at her kind face, yielded it to her,
tottering, however, after her as quickly as she could.
Our website is not just a platform for buying books, but a bridge
connecting readers to the timeless values of culture and wisdom. With
an elegant, user-friendly interface and an intelligent search system,
we are committed to providing a quick and convenient shopping
experience. Additionally, our special promotions and home delivery
services ensure that you save time and fully enjoy the joy of reading.
ebooknice.com