Demystifying Deep Learning 1st Edition Douglas J. Santry - Quickly download the ebook in PDF format for unlimited reading
Demystifying Deep Learning 1st Edition Douglas J. Santry - Quickly download the ebook in PDF format for unlimited reading
com
https://ebookmeta.com/product/demystifying-deep-
learning-1st-edition-douglas-j-santry/
OR CLICK HERE
DOWLOAD EBOOK
https://ebookmeta.com/product/understanding-deep-learning-1st-edition-
prince-simon-j-d/
ebookmeta.com
https://ebookmeta.com/product/deep-learning-approaches-to-cloud-
security-deep-learning-approaches-for-cloud-security-1st-edition/
ebookmeta.com
https://ebookmeta.com/product/microsoft-word-2016-step-by-step-1st-
edition-lambert/
ebookmeta.com
Violin Handbook Grade 1 LCM 1st Edition Ann Griggs
https://ebookmeta.com/product/violin-handbook-grade-1-lcm-1st-edition-
ann-griggs/
ebookmeta.com
https://ebookmeta.com/product/the-new-asian-cookbook-from-seoul-to-
jakarta-discover-authentic-oriental-recipes-2nd-edition-booksumo-
press/
ebookmeta.com
https://ebookmeta.com/product/monstrous-power-1st-edition-eva-chase/
ebookmeta.com
https://ebookmeta.com/product/territory-new-trajectories-in-law-1st-
edition-nicholas-blomley/
ebookmeta.com
https://ebookmeta.com/product/how-to-choose-foods-your-body-will-use-
rebecca-sjonger/
ebookmeta.com
Business to Business Marketing Brennan R.
https://ebookmeta.com/product/business-to-business-marketing-
brennan-r/
ebookmeta.com
本书版权归John Wiley & Sons Inc.所有
Demystifying Deep Learning
IEEE Press
445 Hoes Lane
Piscataway, NJ 08854
Douglas J. Santry
University of Kent, United Kingdom
Copyright © 2024 by The Institute of Electrical and Electronics Engineers, Inc.
All rights reserved.
No part of this publication may be reproduced, stored in a retrieval system, or transmitted in any
form or by any means, electronic, mechanical, photocopying, recording, scanning, or otherwise,
except as permitted under Section 107 or 108 of the 1976 United States Copyright Act, without
either the prior written permission of the Publisher, or authorization through payment of the
appropriate per-copy fee to the Copyright Clearance Center, Inc., 222 Rosewood Drive, Danvers,
MA 01923, (978) 750-8400, fax (978) 750-4470, or on the web at www.copyright.com. Requests to
the Publisher for permission should be addressed to the Permissions Department, John Wiley &
Sons, Inc., 111 River Street, Hoboken, NJ 07030, (201) 748-6011, fax (201) 748-6008, or online at
http://www.wiley.com/go/permission.
Trademarks: Wiley and the Wiley logo are trademarks or registered trademarks of John Wiley &
Sons, Inc. and/or its affiliates in the United States and other countries and may not be used
without written permission. All other trademarks are the property of their respective owners.
John Wiley & Sons, Inc. is not associated with any product or vendor mentioned in this book.
Limit of Liability/Disclaimer of Warranty: While the publisher and author have used their best
efforts in preparing this book, they make no representations or warranties with respect to the
accuracy or completeness of the contents of this book and specifically disclaim any implied
warranties of merchantability or fitness for a particular purpose. No warranty may be created or
extended by sales representatives or written sales materials. The advice and strategies contained
herein may not be suitable for your situation. You should consult with a professional where
appropriate. Further, readers should be aware that websites listed in this work may have
changed or disappeared between when this work was written and when it is read. Neither the
publisher nor authors shall be liable for any loss of profit or any other commercial damages,
including but not limited to special, incidental, consequential, or other damages.
For general information on our other products and services or for technical support, please
contact our Customer Care Department within the United States at (800) 762-2974, outside the
United States at (317) 572-3993 or fax (317) 572-4002.
Wiley also publishes its books in a variety of electronic formats. Some content that appears in
print may not be available in electronic formats. For more information about Wiley products,
visit our web site at www.wiley.com.
Contents
1 Introduction 1
1.1 AI/ML – Deep Learning? 5
1.2 A Brief History 6
1.3 The Genesis of Models 9
1.3.1 Rise of the Empirical Functions 9
1.3.2 The Biological Phenomenon and the Analogue 13
1.4 Numerical Computation – Computer Numbers Are Not ℝeal 14
1.4.1 The IEEE 754 Floating Point System 15
1.4.2 Numerical Coding Tip: Think in Floating Point 18
1.5 Summary 20
1.6 Projects 21
4 Training Classifiers 73
4.1 Backpropagation for Classifiers 73
4.1.1 Likelihood 74
4.1.2 Categorical Loss Functions 75
4.2 Computing the Derivative of the Loss 77
4.2.1 Initiate Backpropagation 80
4.3 Multilabel Classification 81
4.3.1 Binary Classification 82
4.3.2 Training A Multilabel Classifier ANN 82
4.4 Summary 84
4.5 Projects 85
9 Vistas 175
9.1 The Limits of ANN Learning Capacity 175
9.2 Generative Adversarial Networks 177
9.2.1 GAN Architecture 178
9.2.2 The GAN Loss Function 180
9.3 Reinforcement Learning 183
9.3.1 The Elements of Reinforcement Learning 185
9.3.2 A Trivial RL Training Algorithm 187
9.4 Natural Language Processing Transformed 193
9.4.1 The Challenges of Natural Language 195
9.4.2 Word Embeddings 195
9.4.3 Attention 198
9.4.4 Transformer Blocks 200
9.4.5 Multi-Head Attention 204
9.4.6 Transformer Applications 205
9.5 Neural Turing Machines 207
9.6 Summary 210
9.7 Projects 210
Glossary 221
References 229
Index 243
ix
Acronyms
AI artificial intelligence
ANN artificial neural network
BERT bidirectional encoder representation for transformers
BN Bayesian network
BPG backpropagation
CNN convolutional neural network
CNN classifying neural network
DL deep learning
FFFC feed forward fully connected
GAN generative adversarial network
GANN generative artificial neural network
GPT generative pre-trained
LLM large language model
LSTM long short term memory
ML machine learning
MLE minimum likelihood estimator
MSE mean squared error
NLP natural language processing
RL reinforcement learning
RNN recurrent neural network
SGD stochastic gradient descent
1
Introduction
Interest in deep learning (DL) is increasing every day. It has escaped from the
research laboratories and become a daily fact of life. The achievements and poten-
tial of DL are reported in the lay news and form the subject of discussion at dinner
tables, cafes, and pubs across the world. This is an astonishing change of fortune
considering the technology upon which it is founded was pronounced a research
dead end in 1969 (131) and largely abandoned.
The universe of DL is a veritable alphabet soup of bewildering acronyms. There
are artificial neural networks (ANN)s, RNNs, LSTMs, CNNs, Generative Adversar-
ial Networks (GAN)s, and more are introduced every day. The types and applica-
tions of DL are proliferating rapidly, and the acronyms grow in number with them.
As DL is successfully applied to new problem domains this trend will continue.
Since 2015 the number of artificial intelligence (AI) patents filed per annum has
been growing at a rate of 76.6% and shows no signs of slowing down (169). The
growth rate speaks to the increasing investment in DL and suggests that it is still
accelerating.
DL is based on ANN. Often only neural networks is written and the artificial is
implied. ANNs attempt to mathematically model biological assemblies of neurons.
The initial goal of research into ANNs was to realize AI in a computer. The motiva-
tion and means were to mimic the biological mechanisms of cognitive processes in
animal brains. This led to the idea of modeling the networks of neurons in brains.
If biological neural networks could be modeled accurately with mathematics, then
computers could be programmed with the models. Computers would then be able
to perform tasks that were previously thought only possible by humans; the dream
of the electronic brain was born (151). Two problem domains were of particular
interest: natural language processing (NLP), and image recognition. These were
areas where brains were thought to be the only viable instrument; today, these
applications are only the tip of the iceberg.
Figure 1.1 Examples of GAN-generated cats. The matrix on the left contains examples
from the training set. The matrix on the right are GAN-generated cats. The cats on the
right do not exist. They were generated by the GAN. Source: Karras et al. (81).
example of a generative language model. Images and videos can also be generated.
A GANN can draw an image this is very different from learning to recognize
an image. A powerful means of building GANNs is with GAN (50); again, very
much an alphabet soup. As an example, a GAN can be taught impressionist
painting by training it with pictures by the impressionist masters. The GAN will
then produce a novel painting very much in the genre of impressionism. The
quality of the images generated is remarkable. Figure 1.1 displays an example
of cats produced by a GAN (81). The GAN was trained to learn what cats look
like and produce examples. The object is to produce photorealistic synthetic
cats. Products such as Adobe Photoshop have included this facility for general
use by the public (90). In the sphere of video and audio, GANs are producing
the so-called “deep fake” videos that are of very high quality. Deep fakes are
becoming increasingly difficult for humans to detect. In the age of information
war and disinformation, the ramifications are serious. GANs, are performing
tasks at levels undreamt of a few decades ago, the quality can be striking, and even
troubling. As new applications are identified for GANs the resources dedicated
to improving them will continue to grow and produce ever more spectacular
results.
1.1 AI/ML – Deep Learning? 5
3 The word, learn, is in bold in Mitchell’s text. The author clearly wished to emphasize the
nature of the exercise.
1.2 A Brief History 7
1930s, Alonzo Church had described his Lambda Calculus model of computation
(21), and his student, Alan Turing, had defined his Turing Machine4 (152), both
formal models of computation. The age of modern computation was dawning.
Warren McCulloch and Walter Pitts wrote a number of papers that proposed arti-
ficial neurons to simulate Turing machines (164). Their first paper was published
in 1943. They showed that artificial neurons could implement logic and arithmetic
functions. Their work hypothesized networks of artificial neurons cooperating to
implement higher-level logic. They did not implement or evaluate their ideas, but
researchers had now begun thinking about artificial neurons.
Daniel Hebb, an eminent psychologist, wrote a book in 1949 postulating a learn-
ing rule for artificial neurons (65). It is a supervised learning rule. While the rule
itself is numerically unstable, the rule contains many of the ingredients of modern
ANNs. Hebb’s neurons computed state based on the scaler product and weighted
the connections between the individual neurons. Connections between neurons
were reinforced based on use. While modern learning rules and network topolo-
gies are different, Hebb’s work was prescient. Many of the elements of modern
ANNs are recognizable such as a neuron’s state computation, response propaga-
tion, and a general network of weighted connections.
The next step to modern ANNs was Frank Rosenblatt’s perceptron (130). Rosen-
blatt published his first paper in 1958. Building on Hebb’s neuron, he proposed an
updated supervised learning rule called the perceptron rule. Rosenblatt was inter-
ested in computer vision. His first implementation was in software on an IBM 704
mainframe (it had 18 k of memory!). Perceptrons were eventually implemented in
hardware. The machine was a contemporary marvel fitted with an array of 20 × 20
cadmium sulfide photocells used to create a 400 pixel input image. The New York
Times reported it with the headline, “Electronic Brain Teaches Itself.” Hebb’s neu-
ron state was improved with the introduction of a bias, an innovation still very
important today. Perceptrons were capable of learning linear decision boundaries,
that is, the categories of classification had to be linearly separable.
The next milestone was a paper by Widrow and Hoff in 1960 that proposed a
new learning rule, the delta rule. It was more numerically stable than the percep-
tron learning rule. Their research system was called ADALINE (15) and used least
squares to train the network. Like Rosenblatt’s early work, ADALINE was imple-
mented in hardware with memristors. The follow-up system, MADALINE (163),
included multiple layers of perceptrons, another step toward modern ANNs. It
suffered from a similar limitation as Rosenblatt’s perceptrons in that it could only
address linearly separable problems; it was a composition of linear classifiers.
In 1969, Minksy and Papert published a book that set a pall on ANN
research (106). They demonstrated that ANNs, as they were understood at that
point, suffer from an inherent limitation. It was argued that ANNs could never
solve “interesting” problems; but the assertion was based on the assumption
that ANNs could never practically handle nonlinear decision boundaries. They
famously used the example of the XOR logic gate. As the XOR truth table could
not be learnt by an ANN, and XOR is trivial concept when compared to image
recognition and other applications, they concluded that the latter applications
were not appropriate. As most interesting problems are nonlinear, including
vision and NLP, they concluded that the ANN was a research dead end. Their
book had the effect of chilling research in ANNs for many years as the AI com-
munity accepted their conclusion. It coincided with a general reassessment of the
practicality of AI research in general and the beginning of the first “AI Winter.”
The fundamental problem facing ANN researchers was how to train multiple
layers of an ANN to solve nonlinear problems. While there were multiple inde-
pendent developments, Rumelhart, Hinton, and Williams are generally credited
with the work that described the backpropagation of error algorithm in the context
of training ANNs (34). This was published in 1986. It is still the basis of train-
ing today. Backpropagation of error is the basis of the majority of modern ANN
training algorithms. Their method provided a means of training ANNs to learn
nonlinear problems reliably.
It was also in 1986 that Rina Dechter coined the term, “Deep Learning” (30).
The usage was not what is meant by DL today. She was describing a backtracking
algorithm for theorem proving with Prolog programs.
The confluence of two trends, the dissemination of the backpropagation algo-
rithm and the advent of widely available workstations, led to unprecedented exper-
imentation and advances in ANNs. By 1989, in a space of just 3 years, ANNs had
been successfully trained to recognize hand-written digits in the form of postal
codes from the United States Postal Service. This feat was achieved by a team led
by Yann Lecun at AT&T Labs (91). The work had all the recognizable features of
DL, but the term had not yet been applied to neural networks in that sense. The
system would evolve into LeNet-5, a classic DL model. The renewed interest in
ANN research has continued unbroken down to this day. In 2006, Hinton et al.
described a multi-layered belief network that was described as a “Deep Belief Net-
work,” (67). The usage arguably led to referring to deep neural networks as DL.
The introduction of AlexNet in 2012 demonstrated how to efficiently use GPUs
to train DL models (89). AlexNet set records in image recognition benchmarks.
Since AlexNet DL models have dominated most machine learning applications; it
has heralded the DL Age of machine learning.
We leave our abridged history here and conclude with a few thoughts. As
the computing power required to train ANNs grows ever cheaper, access to
the resources required for research becomes more widely available. The IBM
Supercomputer, ASCI White, cost US$110 million in 2001 and occupied a special
1.3 The Genesis of Models 9
purpose room. It had 8192 processors for a total of 123 billion transistors with a
peak performance of 12.3 TFLOPS.5 In 2023, an Apple Mac Studio costs US$4000,
contains 114 billion transistors, and offers peak performance of 20 TFLOPS. It sits
quietly and discreetly on a desk. In conjunction with improvements in hardware,
there is a change in the culture of disseminating results. The results of research
are proliferating in an ever more timely fashion.6 The papers themselves are also
recognizing that describing the algorithms is not the only point of interest. Papers
are including experimental methodology and setup more frequently, making
it easier to reproduce results. This is made possible by ever cheaper and more
powerful hardware. Clearly, the DL boom has just begun.
ẍ = g. (1.2)
ẋ = g ⋅ dt = gt, (1.3)
∫
which in turn can be integrated to produce the desired model, t = f (h),
√
g 2 2h
h≡x= t ⟹ t= = f (h). (1.4)
2 g
This yields an analytical solution obtained from the constraint, which was
obtained from a natural law. Of course this is a very trivial example, and often an
analytical solution is not available. Under those circumstances, the modeler must
resort to numerical methods to solve the equations, but it illustrates the historical
approach.
1.3 The Genesis of Models 11
Δt = f (height)
10.0
7.5
Δt (s)
5.0
2.5
Figure 1.2 The graph of t = f (h), is plotted with empty circles as points. The ANN(h)’s
predictions are crosses plotted over them. The points are from training dataset. The ANN
seems to have learnt the function.
the use of ANNs to make predictions that are more accurate (22). There are many
more examples.
None of this is to say that DL software is inappropriate for use or not fit for
purpose, quite the contrary, but it is important to have some perspective on the
nature of the simulation and the fundamental differences.
8 The “C” language types float and double often correspond with the 32-bit and 64-bit types
respectively, but it is not a language requirement. Python’s float is the double-precision type
(64-bit).
9 The oldest number system that we know about, Sumerian (c. 3,000 BC), was sexagesimal, base
60, and survives to this day in the form of minutes and seconds.
16 1 Introduction
1 0 1 1 0 0 1 1 0 0 1 0 1 1 0 1 1 1 1 1 0 0 1 0 0 0 1 1 1 0 0 1
Figure 1.3 IEEE-754 representation for the value of 4.050000 ⋅ 10−8 . The exponent is
biased and centered about 127, and the mantissa is assumed to have a leading “1.”
The integers are straight forward, but representing real numbers requires more
elaboration. The correct root was written in scientific notation, −4.050000 ⋅ 10−8 .
There are three distinct elements in this form of a number. The mantissa, or sig-
nificand, is the sequence of significant digits, and its length is the precision, 8 in
this case. It is written with a single digit to the left of the decimal point and mul-
tiplied to obtain the correct order of magnitude. This is done by raising the radix,
10 in this case, to the power of the exponent, −8. The IEEE-754 format for 32-bit
floating point values encodes these values to represent a number, see Figure 1.3.
So what can be represented with this system?
Consider a decimal number, abc.def, each position represents an order of mag-
nitude. For decimal numbers, the positions represent:
1 1 1
100 + 10 + 1 ⋅ + + ,
10 100 1000
while binary numbers look like:
1 1 1
4+2+1⋅ + +
2 4 8
Some floating point examples are 0.510 = 0.12 and 0.12510 = 0.0012 . So far so
good, but what of something “simple” such as 0.110 ? Decimal 0.1 is represented
as 0.000112 , where the bar denotes the sequence is repeated ad infinitum. This
can be written in scientific notation as 1.100 ⋅ 2−4 . Using the 32-bit IEEE encod-
ing its representation is 00111101110011001100110011001101. The first bit is the
sign bit. The following 8 bits form the exponent and the remaining 23 bits com-
prise the mantissa. There are two seeming mistakes. First, the exponent is 123. For
efficiently representing normal numbers, the IEEE exponent is biased, that is, cen-
tered about 127 ∶ 127 − 4 = 123. The second odd point is in the mantissa. As the
first digit is always one it is implied, so the encoding of the mantissa starts at the
first digit to the right of the first 1 of the binary representation, so effectively there
can be 24 bits of precision. The programmer does not need to be aware of this – it all
happens automatically in the implementation and the computer language (such
as C++ and Python). Converting IEEE 754 back to decimal, we get, 0.100000001.10
10 What is 10% of a billion dollars? This is a sufficiently serious problem for financial software
that banks often use specialized routines to ensure that the money is correct.
1.4 Numerical Computation – Computer Numbers Are Not ℝeal 17
Observe that even a simple number like 1/10 cannot be represented in the IEEE
32-bit values, just as 1/3 is difficult for decimal, 0.310 .
Let 𝔽 ⊂ ℝ be the set of IEEE single-precision floating point numbers. Being
finite 𝔽 will have minimum and maximum elements. They are 1.17549435E-38
and 3.40282347E+38, respectively. Any operation that strays outside of that range
will not have a result. Values less than the minimum are said to underflow, and
values that exceed the maximum are said to overflow. Values within the supported
range are known as normal numbers. Even within the allowed range, the set is
not continuous and a means of mapping values from ℝ onto 𝔽 is required, that
is, we need a function, fl(x) ∶ ℝ → 𝔽 , and the means prescribed by the IEEE 754
standard is rounding.
All IEEE arithmetical operations are performed with extra bits of precision.
This ensures that a computation will produce a value that is superior to the
bounds. The result is rounded to the nearest element of 𝔽 , with ties going
to the even value. Specifying ties may appear to be overly prescriptive, but
deterministic computational results are very important. IEEE offers 4 rounding
modes,11 but rounding to nearest value in 𝔽 is usually the default. Rounding
error is subject to precision. Given a real number, 1.0, what is the next largest
number? There is no answer. There is an answer for floating point numbers,
and this gap is the machine epsilon, or unit roundoff. For the double precision,
IEEE-754 standard the machine epsilon is 2.2204460492503131e-16. The width
of a proton is 8.83e-16 m, so this is quite small (computation at that scale would
choose a more appropriate unit than meters, such as Angstroms, but this does
demonstrate that double precision is very useful). The machine epsilon gives the
programmer an idea of the error when results are rounded. Denote the machine
epsilon as u. The rounding error is |fl(x) − x| ≤ 12 u. This quantity can be used to
calculate rigorous bounds on the accuracy of computations and algorithms when
required.
Let us revisit our computation of the smallest root, which was done in double
precision.
√ p was very close to the result of the square root. For our values of p and
q, p ≈ p2 + q, and so performing p− ≈ p (p minus almost p) canceled out all of
the information in the result. This effect is known as catastrophic cancellation,
but it is widely misunderstood. A common misconception is that it is a bad
practice to subtract floating point numbers that are close together, or add floating
point numbers that are far apart, but that is not necessarily true. In this case, the
subtraction has merely exposed an earlier problem that no longer has anywhere
to hide. The square root is 12,345,678.000000040500003, but it was rounded to
The choice of 𝜖 will be a suitably small value that the application can toler-
ate. In general, comparison of a computed floating point value should be done
as abs(𝛼 − x), where x is the computed quantity and 𝛼 is the quantity that is being
tested for. Note that printing the contents of I to the screen may appear to produce
the exact values of zero and one, but print format routines, the routines whose jobs
is to convert IEEE variables to a printable decimal string, do all kinds of things and
often mislead.
A great contribution of the IEEE system is its quality as a progressive system.
Many computer architectures used to treat floating point errors, such as division
by zero, as terminal. The program would be aborted. IEEE systems continue to
make progress following division √by zero and simply record the error condition in
the result. Division by zero and −1 results in the special nonnormal value “not a
number” recorded in the result (NaN). Overflow and underflow are also recorded
as the two special values, ±∞ (if used they will produce an NaN). These error
values need to be detected because if they are used, then the error will contaminate
all further use of the tainted value. Often there is a semantic course of action that
can be adopted to fix the error before proceeding. They are notoriously difficult to
20 1 Introduction
1.5 Summary
1.6 Projects
In Chapter 1, it was stated that deep learning (DL) models are based on artificial
neural networks (ANNs). In this chapter, deep learning will be defined more pre-
cisely, which is still quite loose. This will be done by connecting deep learning
to ANNs more concretely. It was also claimed that ANNs can be interpreted as
programmable functions. In this chapter, we describe what those functions look
and how ANNs compute values. Like a function, an ANN accepts inputs and com-
putes an output. How an ANN turns the input into the output is detailed. We also
introduce the notation and abstractions that we use in the rest of the text.
A deep learning model is a model built with an artificial neural network, that is,
a network of artificial neurons. The neurons are perceptrons. Networks require a
topology. The topology is specified as a hyperparameter to the model. This includes
the number of neurons and how they are connected. The topology determines how
information flows in the resulting ANN configuration and some of the properties
of the ANN. In broad terms, ANNs have many possible topologies, and indeed
there are an infinite number of possibilities. Determining good topologies for a
particular problem can be challenging; what works in one problem domain may
not (probably will not) work in a different domain. ANNs have many applications
and come in many types and sizes. They can be used for classification, regression,
and even generative purposes such as producing a picture. The different domains
often have different topologies dictated by the application. Most ANN applications
employ domain-specific techniques and adopt trade-offs to produce the desired
result. For every application, there are a myriad of ANN configurations and param-
eters, and what works well for one application may not work for others. If this
seems confusing – it is (107; 161). A good rule of thumb is to keep it as simple as
possible (93).
This section presents the rudiments of ANN topology. One topology in particu-
lar, the feed-forward and fully-connected (FFFC), topology is adopted. There is
no loss of generality as all the principles and concepts presented still apply to
other topologies. The focus on a simple topology lends itself to clearer explana-
tions. To make matters more concrete, we begin with a simple example presented
in Figure 2.1. We can see that an ANN is comprised of neurons (nodes), connec-
tions, and many numbers. Observing the figure, it is clear that we can interpret an
ANN as a directed graph, G(N, E). The nodes, or vertices, of the graph are neurons.
Neurons that communicate immediately are connected with a directed edge. The
direction of the edge determines the flow of the signal.
The nodes in the graph are neurons. The neuron abstraction is at the heart of
the ANN. Neurons in ANNs are generally perceptrons. Information, that is, sig-
nals, flow through the network along the directed edges through the neurons. The
arrow indicates that a signal is coming from a source neuron and going to a target
1 1 1
–0
.17
33
7
–0.
538
–1.
094
84
13
–2.
768
1.6
17
72
76
–0.58
–2.45
5
08
01
768
58
42
.74
0.7
–2
80
–0.65
5
36
38
412
37
33
0.6
1
–1.677
885
1.4
605
1.3
Figure 2.1 A trained ANN that has learnt the sine function. The circles, graph nodes, are
neurons. The arrows on the edges determine which direction the communication
flows.
2.1 Feed-Forward and Fully-Connected Artificial Neural Networks 25
neuron. There are no rules respecting the number of edges. Most neurons in an
ANN are both a source and a target. Any neuron that is not a source is an output
neuron. Any neuron that is not a target is an input neuron. Input neurons are the
entry point to the graph, and the arguments to the ANN are supplied there. Out-
put neurons provide the final result of the computation. Thus, given ŷ = ANN(x),
x goes to the input neurons and ŷ is read from the output neurons. In the example,
x is the angle, and ŷ is the sine of x.
Each neuron has an independent internal state. A neuron’s state is computed
from the input signals from connected source neurons. The neuron computes
its internal state to build its own signal, also known as a response. This internal
state is then propagated in turn through the directed edges to its target neurons.
The inceptive stage for the computation is the provision of the input arguments
to the input neurons of the ANN. The input neurons compute their states and
propagate them to their target neurons. This is the first signal that triggers the
rest of the activity. The signals propagate through the network, forcing neurons
to update state along the way, until the signal reaches the output neurons, and
then the computation is finished. Only the state in the output neurons, that is,
the output neurons’ responses, matter to an application as they comprise the
“answer,” ŷ .
Neurons are connected with edges. The edges are weighted by a real number.
The weights determine the behavior of the signals as received by the target
neuron – they are the crux of an ANN. It is the values of the weights that
determine whether an ANN is sine, cosine, ex – whatever the desired function.
The weights in Figure 2.1 make the ANN sine. The ANN in Figure 2.2 is a cosine.
The graphs in Figure 2.3 present their respective plots for 32 random points. Both
ANNs have the same topologies, but they have very different weights. It is the
weights that the determine what an ANN computes. The topology can be thought
of as supporting the weights by ensuring that there is a sufficient number of them
to solve a problem; this is called the learning capacity of an ANN.
The weights of an ANN are the parameters of the model. The task of training
a neural network is determining the weights, w. This is reflected in the notation
ŷ = ANN(x; w) or ŷ = ANN(x|w) where ŷ is conditioned on the vector of parame-
ters, w. Given a training set, the act of training an ANN is reconciling the weights
in a model with the examples in the training set. The fitted weights should then
produce a model that emits the correct output for a given input. Training sets con-
sisting of sine and cosine produced the ANNs in the trigonometric examples,
respectively. Both sine and cosine are continuous functions. As such building
models for them are examples of regression, we are explaining observed data from
the past to make predictions in the future.
The graph of an ANN can take many forms. Without loss of generality, but
for clarity of exposition, we choose a particular topology, the fully-connected
26 2 Deep Learning and Neural Networks
1 1 1
–1
.38
89
1
1.5
561
–3.
24 571
5
–2.
057
–6
02
.86
81
6
1.029
–1.32
95
34
57
277
0
.73
23
–0
1.5
63 0.279
3
.87
41
5
3
04
97
.11
–2
22
–1.0496
232
–2.
5
297
2
0.8
Figure 2.2 A trained ANN that has learnt the cosine function. The only differences with
the sine model are the weights. Both ANNs have the same topologies.
0.75
0.75
cosine
sine
0.50
0.50
0.25
0.25
Figure 2.3 The output of two ANNs is superimposed on the ground truth for the trained
sine and cosine ANN models. The ANN predictions are plotted as crosses. The empty
circles are taken from the training set and comprise the ground truth.
2.1 Feed-Forward and Fully-Connected Artificial Neural Networks 27
feed-forward architecture, as the basis for all the ANNs that we will discuss.
The principles are the same for all ANNs, but the simplicity of the feed-forward
topology is pedagogically attractive.
Information in the feed-forward ANN flows acyclically from a single input layer,
through a number of hidden layers, and finally to a single output layer; the signals
are always moving forward through the graph. The ANN is specified as a set of
layers. Layers are sets of related, peer, neurons. A layer is specified as the number
of neurons that it contains, the layer’s width. The number of layers in an ANN is
referred to as its depth. All the neurons in a layer share source neurons, specified
as a layer, and target neurons, again, specified as a layer. All of a layer’s source
neurons form a layer as do its target neurons. There are no intralayer connections.
In the language of graph theory, isolating a layer produces a tripartite graph. Thus,
a layer is sandwiched between a shallower, source neuron layer, and a deeper target
layer.
The set of layers can be viewed as a stack. Consider topology in Figure 2.1, with
respect to the stack analogy, the input layer is the top, or the shallowest layer, and
the output layer is the bottom, or the deepest layer. The argument is supplied to
the input layer and the answer read from the output layer. The layers between the
input and output layers are known as hidden layers.
It is the presence of hidden layers that characterizes an ANN as a deep learning
ANN. There is no consensus on how many hidden layers are required to qualify as
deep learning, but the loosest definition is at least 1 hidden layer. A single hidden
layer does not intuitively seem very deep, but its existence in an ANN does put
it in a different generation of model. Rosenblatt’s original implementations were
single layers of perceptrons, but he speculated on deeper arrangements in his book
(130). It was not clear what value multiple layers of perceptrons had given his
linear training methods. Modern deep learning models of 20+ hidden layers are
common, and they continue to grow deeper and wider.
The process of computing the result of an ANN begins with supplying the argu-
ment to the function at the input layer. Every input neuron in the input layer
receives a copy of the full ANN argument. Once every neuron in the input layer
has computed its state with the arguments to the ANN, the input layer is ready
to propagate the result to next layer. As the signals percolate through the ANN,
each layer accepts its source signals from the previous layer, computes the new
state, and then propagates the result to the next layer. This continues until the
final layer is reached; the output layer. The output layer contains the result of
the ANN.
To further simplify our task, we specify that the feed-forward ANN is fully con-
nected, sometimes also called dense. At any given layer, every neuron is connected
to every source neuron in the shallower layer; recursively, this implies that every
neuron in a given layer is a source neuron for the next layer.
28 2 Deep Learning and Neural Networks
1 1 1
–0
.17
33
7
–0.
538
–1.
094
13
84
–2.
768
1.6
17
72
76
–0.58
–2.45
5
01
768
58
08
.74
42
0.7
–2
80 –0.654
36
5
38
37
12
33
0.6
1
–1.677
885
1.4
605
1.3
Figure 2.4 The sine model in more detail. The layers are labeled. The left is shallowest
and the right is the deepest. Every neuron in a layer is fully connected with its shallower
neighbor. This ANN is specified by the widths of it layers, 3, 2, 1, so the ANN
has a depth of 3.
Let us now reexamine the ANN implementing sine in Figure 2.4 in terms of
layers. We see that there are 3 layers. The first layer, the input layer, has 3 neu-
rons. The input layer accepts the predictors, the inputs of the model. The hidden
layer has 2 neurons, and the output layer has one neuron. There can be as many
hidden layers as desired, and they can also be as wide as needed. The depths and
the widths are the hyperparameters of the model. The number of layers and their
widths should be kept as limited as possible (93). As the number of weights grows,
that is, trainable parameters, the size of the ANN increases exponentially. Too
many weights also leads to other problems that will be examined in later chapters.
As sine is a scaler function, there can only be one output neuron in the ANN’s
output layer; that is, where the answer (sine) can be found. Notice, however, that
the number of input neurons is not similarly constrained. Sine has only one pre-
dictor (argument), but observe that there can be any number of neurons in the
input layer. They will each receive a copy of the arguments.
The mechanics of signal propagation form the basis of how ANNs compute.
Having seen how the signals flow in a feed-forward ANN, it remains to examine
Exploring the Variety of Random
Documents with Different Content
damaged disk or other medium, a computer virus, or computer
codes that damage or cannot be read by your equipment.
1.F.4. Except for the limited right of replacement or refund set forth
in paragraph 1.F.3, this work is provided to you ‘AS-IS’, WITH NO
OTHER WARRANTIES OF ANY KIND, EXPRESS OR IMPLIED,
INCLUDING BUT NOT LIMITED TO WARRANTIES OF
MERCHANTABILITY OR FITNESS FOR ANY PURPOSE.
Please check the Project Gutenberg web pages for current donation
methods and addresses. Donations are accepted in a number of
other ways including checks, online payments and credit card
donations. To donate, please visit: www.gutenberg.org/donate.
Most people start at our website which has the main PG search
facility: www.gutenberg.org.