100% found this document useful (3 votes)

12 views

Demystifying Deep Learning 1st Edition Douglas J. Santry - Quickly download the ebook in PDF format for unlimited reading

The document provides information about the ebook 'Demystifying Deep Learning' by Douglas J. Santry, which serves as an introduction to the mathematics of neural networks. It also lists various other ebooks available for instant download on ebookmeta.com, covering topics such as big data, machine learning, and deep learning applications. Additionally, it includes details about the author and the structure of the book, which encompasses various aspects of deep learning and neural networks.

Uploaded by

marycdulon

Available Formats

Download as PDF, TXT or read online on Scribd

100% found this document useful (3 votes)

12 views

Demystifying Deep Learning 1st Edition Douglas J. Santry - Quickly download the ebook in PDF format for unlimited reading

Uploaded by

marycdulon

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 50

Read Anytime Anywhere Easy Ebook Downloads at ebookmeta.

com

Demystifying Deep Learning 1st Edition Douglas J.

Santry

https://ebookmeta.com/product/demystifying-deep-
learning-1st-edition-douglas-j-santry/

OR CLICK HERE

DOWLOAD EBOOK

Visit and Get More Ebook Downloads Instantly at https://ebookmeta.com

Recommended digital products (PDF, EPUB, MOBI) that
you can download immediately if you are interested.

Demystifying Big Data, Machine Learning, and Deep Learning

for Healthcare Analytics 1st Edition Pradeep N Sandeep
Kautish Sheng-Lung Peng
https://ebookmeta.com/product/demystifying-big-data-machine-learning-
and-deep-learning-for-healthcare-analytics-1st-edition-pradeep-n-
sandeep-kautish-sheng-lung-peng/
ebookmeta.com

Understanding Deep Learning 1st Edition Prince Simon J D

https://ebookmeta.com/product/understanding-deep-learning-1st-edition-
prince-simon-j-d/

ebookmeta.com

Deep Learning Approaches to Cloud Security Deep Learning

Approaches for Cloud Security 1st Edition

https://ebookmeta.com/product/deep-learning-approaches-to-cloud-
security-deep-learning-approaches-for-cloud-security-1st-edition/

ebookmeta.com

Microsoft Word 2016 Step By Step 1st Edition Lambert

https://ebookmeta.com/product/microsoft-word-2016-step-by-step-1st-
edition-lambert/

ebookmeta.com
Violin Handbook Grade 1 LCM 1st Edition Ann Griggs

https://ebookmeta.com/product/violin-handbook-grade-1-lcm-1st-edition-
ann-griggs/

ebookmeta.com

The New Asian Cookbook: From Seoul to Jakarta Discover

Authentic Oriental Recipes 2nd Edition Booksumo Press

https://ebookmeta.com/product/the-new-asian-cookbook-from-seoul-to-
jakarta-discover-authentic-oriental-recipes-2nd-edition-booksumo-
press/
ebookmeta.com

Monstrous Power 1st Edition Eva Chase

https://ebookmeta.com/product/monstrous-power-1st-edition-eva-chase/

ebookmeta.com

Territory (New Trajectories in Law) 1st Edition Nicholas

Blomley

https://ebookmeta.com/product/territory-new-trajectories-in-law-1st-
edition-nicholas-blomley/

ebookmeta.com

How to Choose Foods Your Body Will Use Rebecca Sjonger

https://ebookmeta.com/product/how-to-choose-foods-your-body-will-use-
rebecca-sjonger/

ebookmeta.com
Business to Business Marketing Brennan R.

https://ebookmeta.com/product/business-to-business-marketing-
brennan-r/

ebookmeta.com
本书版权归John Wiley & Sons Inc.所有
Demystifying Deep Learning
IEEE Press
445 Hoes Lane
Piscataway, NJ 08854

IEEE Press Editorial Board

Sarah Spurgeon, Editor in Chief

Jón Atli Benediktsson Behzad Razavi Jeffrey Reed

Anjan Bose Jim Lyke Diomidis Spinellis
James Duncan Hai Li Adam Drobot
Amin Moeness Brian Johnson Tom Robertazzi
Desineni Subbaram Naidu Ahmet Murat Tekalp
Demystifying Deep Learning

An Introduction to the Mathematics of Neural Networks

Douglas J. Santry
University of Kent, United Kingdom
Copyright © 2024 by The Institute of Electrical and Electronics Engineers, Inc.
All rights reserved.

Published by John Wiley & Sons, Inc., Hoboken, New Jersey.

Published simultaneously in Canada.

No part of this publication may be reproduced, stored in a retrieval system, or transmitted in any
form or by any means, electronic, mechanical, photocopying, recording, scanning, or otherwise,
except as permitted under Section 107 or 108 of the 1976 United States Copyright Act, without
either the prior written permission of the Publisher, or authorization through payment of the
appropriate per-copy fee to the Copyright Clearance Center, Inc., 222 Rosewood Drive, Danvers,
MA 01923, (978) 750-8400, fax (978) 750-4470, or on the web at www.copyright.com. Requests to
the Publisher for permission should be addressed to the Permissions Department, John Wiley &
Sons, Inc., 111 River Street, Hoboken, NJ 07030, (201) 748-6011, fax (201) 748-6008, or online at
http://www.wiley.com/go/permission.

Trademarks: Wiley and the Wiley logo are trademarks or registered trademarks of John Wiley &
Sons, Inc. and/or its affiliates in the United States and other countries and may not be used
without written permission. All other trademarks are the property of their respective owners.
John Wiley & Sons, Inc. is not associated with any product or vendor mentioned in this book.

Limit of Liability/Disclaimer of Warranty: While the publisher and author have used their best
efforts in preparing this book, they make no representations or warranties with respect to the
accuracy or completeness of the contents of this book and specifically disclaim any implied
warranties of merchantability or fitness for a particular purpose. No warranty may be created or
extended by sales representatives or written sales materials. The advice and strategies contained
herein may not be suitable for your situation. You should consult with a professional where
appropriate. Further, readers should be aware that websites listed in this work may have
changed or disappeared between when this work was written and when it is read. Neither the
publisher nor authors shall be liable for any loss of profit or any other commercial damages,
including but not limited to special, incidental, consequential, or other damages.

For general information on our other products and services or for technical support, please
contact our Customer Care Department within the United States at (800) 762-2974, outside the
United States at (317) 572-3993 or fax (317) 572-4002.

Wiley also publishes its books in a variety of electronic formats. Some content that appears in
print may not be available in electronic formats. For more information about Wiley products,
visit our web site at www.wiley.com.

Library of Congress Cataloging-in-Publication Data Applied for:

Hardback ISBN: 9781394205608

Cover Design: Wiley

Cover Image: © Yuichiro Chino/Getty Images

Set in 9.5/12.5pt STIXTwoText by Straive, Chennai, India

Contents

About the Author ix

Acronyms x

1 Introduction 1
1.1 AI/ML – Deep Learning? 5
1.2 A Brief History 6
1.3 The Genesis of Models 9
1.3.1 Rise of the Empirical Functions 9
1.3.2 The Biological Phenomenon and the Analogue 13
1.4 Numerical Computation – Computer Numbers Are Not ℝeal 14
1.4.1 The IEEE 754 Floating Point System 15
1.4.2 Numerical Coding Tip: Think in Floating Point 18
1.5 Summary 20
1.6 Projects 21

2 Deep Learning and Neural Networks 23

2.1 Feed-Forward and Fully-Connected Artificial Neural Networks 24
2.2 Computing Neuron State 29
2.2.1 Activation Functions 29
2.3 The Feed-Forward ANN Expressed with Matrices 31
2.3.1 Neural Matrices: A Convenient Notation 32
2.4 Classification 33
2.4.1 Binary Classification 34
2.4.2 One-Hot Encoding 36
2.4.3 The Softmax Layer 38
2.5 Summary 39
2.6 Projects 40
vi Contents

3 Training Neural Networks 41

3.1 Preparing the Training Set: Data Preprocessing 42
3.2 Weight Initialization 45
3.3 Training Outline 47
3.4 Least Squares: A Trivial Example 49
3.5 Backpropagation of Error for Regression 51
3.5.1 The Terminal Layer (Output) 54
3.5.2 Backpropagation: The Shallower Layers 57
3.5.3 The Complete Backpropagation Algorithm 61
3.5.4 A Word on the Rectified Linear Unit (ReLU) 62
3.6 Stochastic Sine 64
3.7 Verification of a Software Implementation 66
3.8 Summary 70
3.9 Projects 71

4 Training Classiﬁers 73
4.1 Backpropagation for Classifiers 73
4.1.1 Likelihood 74
4.1.2 Categorical Loss Functions 75
4.2 Computing the Derivative of the Loss 77
4.2.1 Initiate Backpropagation 80
4.3 Multilabel Classification 81
4.3.1 Binary Classification 82
4.3.2 Training A Multilabel Classifier ANN 82
4.4 Summary 84
4.5 Projects 85

5 Weight Update Strategies 87

5.1 Stochastic Gradient Descent 87
5.2 Weight Updates as Iteration and Convex Optimization 92
5.2.1 Newton’s Method for Optimization 93
5.3 RPROP+ 96
5.4 Momentum Methods 99
5.4.1 AdaGrad and RMSProp 100
5.4.2 ADAM 101
5.5 Levenberg–Marquard Optimization for Neural Networks 103
5.6 Summary 108
5.7 Projects 109

6 Convolutional Neural Networks 111

6.1 Motivation 112
6.2 Convolutions and Features 113
Contents vii

6.3 Filters 117

6.4 Pooling 119
6.5 Feature Layers 120
6.6 Training a CNN 123
6.6.1 Flatten and the Gradient 123
6.6.2 Pooling and the Gradient 124
6.6.3 Filters and the Gradient 125
6.7 Applications 129
6.8 Summary 130
6.9 Projects 130

7 Fixing the Fit 133

7.1 Quality of the Solution 133
7.2 Generalization Error 134
7.2.1 Bias 134
7.2.2 Variance 135
7.2.3 The Bias-Variance Trade-off 136
7.2.4 The Bias-Variance Trade-off in Context 138
7.2.5 The Test Set 138
7.3 Classification Performance 140
7.4 Regularization 143
7.4.1 Forward Pass During Training 143
7.4.2 Forward Pass During Normal Inference 145
7.4.3 Backpropagation of Error 146
7.5 Advanced Normalization 148
7.5.1 Batch Normalization 149
7.5.2 Layer Normalization 154
7.6 Summary 156
7.7 Projects 157

8 Design Principles for a Deep Learning Training Library 159

8.1 Computer Languages 160
8.2 The Matrix: Crux of a Library Implementation 164
8.2.1 Memory Access and Modern CPU Architectures 165
8.2.2 Designing Matrix Computations 168
8.2.2.1 Convolutions as Matrices 170
8.3 The Framework 171
8.4 Summary 173
8.5 Projects 173
viii Contents

9 Vistas 175
9.1 The Limits of ANN Learning Capacity 175
9.2 Generative Adversarial Networks 177
9.2.1 GAN Architecture 178
9.2.2 The GAN Loss Function 180
9.3 Reinforcement Learning 183
9.3.1 The Elements of Reinforcement Learning 185
9.3.2 A Trivial RL Training Algorithm 187
9.4 Natural Language Processing Transformed 193
9.4.1 The Challenges of Natural Language 195
9.4.2 Word Embeddings 195
9.4.3 Attention 198
9.4.4 Transformer Blocks 200
9.4.5 Multi-Head Attention 204
9.4.6 Transformer Applications 205
9.5 Neural Turing Machines 207
9.6 Summary 210
9.7 Projects 210

Appendix A Mathematical Review 211

A.1 Linear Algebra 211
A.1.1 Vectors 211
A.1.2 Matrices 212
A.1.3 Matrix Properties 214
A.1.4 Linear Independence 215
A.1.5 The QR Decomposition 215
A.1.6 Least Squares 215
A.1.7 Eigenvalues and Eigenvectors 216
A.1.8 Hadamard Operations 216
A.2 Basic Calculus 217
A.2.1 The Product Rule 217
A.2.2 The Chain Rule 218
A.2.3 Multivariable Functions 218
A.2.4 Taylor Series 218
A.3 Advanced Matrices 219
A.4 Probability 219

Glossary 221
References 229
Index 243
ix

About the Author

Douglas J. Santry, PhD, MSc, is a Lecturer in Computer Science at the University

of Kent, UK. Prior to his current position, he worked extensively as an important
figure in the industry with Apple Computer Corp, NetApp, and Goldman Sachs.
At NetApp, he conducted research into embedded and real-time machine learning
techniques.
x

Acronyms

AI artificial intelligence
ANN artificial neural network
BERT bidirectional encoder representation for transformers
BN Bayesian network
BPG backpropagation
CNN convolutional neural network
CNN classifying neural network
DL deep learning
FFFC feed forward fully connected
GAN generative adversarial network
GANN generative artificial neural network
GPT generative pre-trained
LLM large language model
LSTM long short term memory
ML machine learning
MLE minimum likelihood estimator
MSE mean squared error
NLP natural language processing
RL reinforcement learning
RNN recurrent neural network
SGD stochastic gradient descent
1

Introduction

Interest in deep learning (DL) is increasing every day. It has escaped from the
research laboratories and become a daily fact of life. The achievements and poten-
tial of DL are reported in the lay news and form the subject of discussion at dinner
tables, cafes, and pubs across the world. This is an astonishing change of fortune
considering the technology upon which it is founded was pronounced a research
dead end in 1969 (131) and largely abandoned.
The universe of DL is a veritable alphabet soup of bewildering acronyms. There
are artificial neural networks (ANN)s, RNNs, LSTMs, CNNs, Generative Adversar-
ial Networks (GAN)s, and more are introduced every day. The types and applica-
tions of DL are proliferating rapidly, and the acronyms grow in number with them.
As DL is successfully applied to new problem domains this trend will continue.
Since 2015 the number of artificial intelligence (AI) patents filed per annum has
been growing at a rate of 76.6% and shows no signs of slowing down (169). The
growth rate speaks to the increasing investment in DL and suggests that it is still
accelerating.
DL is based on ANN. Often only neural networks is written and the artificial is
implied. ANNs attempt to mathematically model biological assemblies of neurons.
The initial goal of research into ANNs was to realize AI in a computer. The motiva-
tion and means were to mimic the biological mechanisms of cognitive processes in
animal brains. This led to the idea of modeling the networks of neurons in brains.
If biological neural networks could be modeled accurately with mathematics, then
computers could be programmed with the models. Computers would then be able
to perform tasks that were previously thought only possible by humans; the dream
of the electronic brain was born (151). Two problem domains were of particular
interest: natural language processing (NLP), and image recognition. These were
areas where brains were thought to be the only viable instrument; today, these
applications are only the tip of the iceberg.

Demystifying Deep Learning: An Introduction to the Mathematics of Neural Networks,

First Edition. Douglas J. Santry.
© 2024 The Institute of Electrical and Electronics
2 1 Introduction

In the field of image recognition, DL has achieved spectacular results and, by

some metrics, is out-performing humans. Image recognition is the task of find-
ing and identifying objects in an image. DL has a better record than humans (13;
63) recognizing the ImageNet (32) test suite, an important database of millions of
photographs. Computer vision has become so reliable for some tasks that it is com-
mon for motor cars to offer features based on reliable computer vision, and in some
cases, cars can even drive themselves. In airports and shopping malls, we are con-
tinually monitored by CCTV, but often it is a computer, not a human, performing
the monitoring (39). Some CCTV monitors look for known thieves and automati-
cally alert the staff, or even the local police, when they are spotted in a shop (165).
This can lead to problems. When courts and the police do not understand how to
interpret the results of the software great injustices can follow.
One such example is that of Robert Julian-Borchak Williams (66). Mr. Williams’
case is a cautionary tale. AI image recognition software is not evidence and does
not claim to be. It is meant to point law enforcement in a promising direction of
investigation; it is a complement to investigation, not a substitute. But too often
the police assume the computer’s hint is a formal allegation and treat it as such.
Mr. Williams was accused of a crime by the police that he did not commit. The
police were acting on information from AI image recognition software, but the
police were convinced because they did not understand what the computer was
telling them. A computer suggested that the video of a shoplifter in a shop could
be Mr. Williams. As a result, a warrant was obtained on the basis of the computer’s
identification. All the “safeguards,” such as corroborating evidence, despite being
formal policy of the police department, were ignored, and Mr. Williams had a
nightmare visited upon him. He was arrested, processed, and charged with no
effort on the part of the police to confirm the computer’s suggestion. This sce-
nario has grown so frequent that there are real concerns with the police and the
courts using AI technology as an aid to their work. Subsequently, Amazon, IBM,
and Microsoft withdrew their facial recognition software from police use pending
federal regulation (59). DL, like any tool, must be used responsibly to provide the
greatest benefit and mitigate harm.
DL ANNs have also made tremendous progress in the field of NLP. Natural
language is how people communicate, such as English or Japanese. Computers
are just elaborate calculators, and they have no capacity for inference or context;
hence, people use programming languages to talk to computers. The current
state-of-the-art NLP is based on transformers (155) (see Section 9.4 for details).
Transformers have led to recent rapid progress in language models and NLP
tools since 2017. Moreover, progress in NLP systems is outstripping the test
suites. A popular language comprehension benchmark, the General Language
Understanding Benchmark (GLUE) (158), was quickly mastered by research
systems, leading to its replacement by SuperGLUE in the space of a year (159).
1 Introduction 3

SuperGlue will soon be upgraded. Another important benchmark, the Stanford

Question Answering Dataset 2.0 (SQUAD) (121) has also been mastered1 and is
anticipating an update to increase the challenge. The test suites are currently too
easy for modern NLP systems. This is impressive as the bar was not set low per se.
DL ANNs are, on average, outperforming humans in both test suites. Therefore,
it can be argued that the test suites are genuinely challenging.
Of particular note is OpenAI’s ChatGPT; it has dazzled the world (128). The
author recently had to change the questions for his university course assignments
because the students were using ChatGPT to produce complete answers. Because
ChatGPT can understand English, some students were cutting and pasting the
question, in plain English, into the ChatGPT prompt and doing the reverse with
the response. ChatGPT is able to produce Python code that is correct. The irony of
students using DL to cheat on AI course work was not lost on him.
A lot of the debate surrounding ChatGPT has centered on its abilities, what
it can and cannot do reliably, but to do so is to miss the point. The true import
of ChatGPT is not what it can do today. ChatGPT is not perfect, and its creators
never claimed it was far from it. The free version used by most of the world was
made available to aid in identifying and fixing problems. ChatGPT is a point in
a trend. The capabilities of ChatGPT, today, are not important. The real point is
the implication of what language models will be capable of in five to ten years.
The coming language models will clearly be extremely powerful. Businesses and
professions that think they are safe because ChatGPT is not perfect are taking ter-
rible risks. There is a misconception that it is low-skilled jobs that will experience
the most change, that the professions will remain untouched as they have been
for decades. This is a mistake. The real application of DL is not in low-skilled jobs.
Factories and manufacturing were already disrupted starting in the 1970s with the
introduction of automation. DL is going to make the professions more productive,
such as medicine and law. It is the high-skilled jobs that are going to experience
the most disruption. A study by OpenAI examining the potential of its language
models suggested that up to 80% of the US workforce would experience some
form of change resulting from language models (38). This may be a conservative
estimate.
Perhaps one of the most interesting advances of DL is the emergence of systems
that produce meaningful content. The systems mentioned so far either classify,
inflect (e.g. translate), or “comprehend” input data. Systems that produce material
instead of consuming it are known as generative. When produced with DL, they
are known as a Generative Artificial Neural Network (GANN). ChatGPT is an

1 The leaderboard shows 90% is now a common score: https://rajpurkar.github.io/SQuAD-

explorer. The human score is 89%.
4 1 Introduction

Figure 1.1 Examples of GAN-generated cats. The matrix on the left contains examples
from the training set. The matrix on the right are GAN-generated cats. The cats on the
right do not exist. They were generated by the GAN. Source: Karras et al. (81).

example of a generative language model. Images and videos can also be generated.
A GANN can draw an image this is very different from learning to recognize
an image. A powerful means of building GANNs is with GAN (50); again, very
much an alphabet soup. As an example, a GAN can be taught impressionist
painting by training it with pictures by the impressionist masters. The GAN will
then produce a novel painting very much in the genre of impressionism. The
quality of the images generated is remarkable. Figure 1.1 displays an example
of cats produced by a GAN (81). The GAN was trained to learn what cats look
like and produce examples. The object is to produce photorealistic synthetic
cats. Products such as Adobe Photoshop have included this facility for general
use by the public (90). In the sphere of video and audio, GANs are producing
the so-called “deep fake” videos that are of very high quality. Deep fakes are
becoming increasingly difficult for humans to detect. In the age of information
war and disinformation, the ramifications are serious. GANs, are performing
tasks at levels undreamt of a few decades ago, the quality can be striking, and even
troubling. As new applications are identified for GANs the resources dedicated
to improving them will continue to grow and produce ever more spectacular
results.
1.1 AI/ML – Deep Learning? 5

1.1 AI/ML – Deep Learning?

It is all too common to see the acronym AI/ML, which stands for artificial intel-
ligence/machine learning, and worse to see the terms used interchangeably. AI,
as the name implies, is the study of simulating or creating intelligence, and even
defining intelligence. It is a field of study encompassing many areas including, but
not limited to, machine learning. AI researchers can also be biologists (histology
and neurology), psychologists, mathematicians, computer scientists, and philoso-
phers. What is intelligence? What are the criteria for certifying something as intel-
ligent? These are philosophical questions as much as technical challenges. How
can AI be defined without an understanding of “natural” intelligence? That is a
question that lies more in the biological realm than that of technology. Machine
Learning is a subfield of AI. DL and ANNs are a subfield of machine learning.
The polymath, Alan Turing, suggested what has come to be known as the Turing
Test2 in 1950 (153). He argued that if a machine could fool a human by convincing
the human that it is human too, then the computer is “intelligent.” He proposed
concealing a human and a computer and linking them over a teletype to a third
party, a human evaluator. If the human evaluator could not distinguish between
the human and the computer, then, effectively, the computer could be deemed “in-
telligent.” It is an extremely controversial assertion, but a useful one in 1950. It has
formed an invaluable basis for discussion ever since. An influential argument for-
warded in 1980 by the philosopher, John Searle, asserts that a machine can never
realize real intelligence in a digital computer. Searle argued that a machine that
could pass the Turing test was not necessarily intelligent. He proposed a thought
experiment called the Chinese Room (135). The Turing test was constrained to be
performed in Chinese, and it was accepted that a machine could be programmed
to pass the test. Searle argued that there is an important distinction between sim-
ulating Chinese and understanding Chinese. The latter is the true mark of intelli-
gence. He characterized the difference as “weak AI” and “strong AI”. A computer
executing a program of instructions is not thinking, and Searle argued that is all
a computer could ever do. There is a large body of literature, some of which pre-
dates Turing’s contribution and dates back to Leibniz (96; 98), debating the point.
OpenAI’s recent offering, ChatGPT, is a perfect example of this dichotomy. The lay
press speculates (128) on whether it is intelligent, but clearly it is an example of
“weak AI.” The product can pass the Turing test, but it is not intelligent.
To understand what machine learning is, one must place it in relation to AI. It
is a means of realizing some aspect of AI in a digital computer; it is a subfield of
AI. Tom Mitchell, who wrote a seminal text on machine learning (105), provides a

2 Alan Turing called it the “imitation game.”

6 1 Introduction

useful definition of machine learning: “A computer program is said to learn3 from

experience, E, with respect to some class of tasks, T, and performance measure, P,
if its performance at tasks in T, as measured by P, improves with experience E.”
[Page 2]. Despite first appearances, this really is a very concise definition. So while
it is clear that machine learning is related to AI, the reverse is not necessarily true.
DL and ANN are, in turn, specializations of machine learning. Thus, while DL is
a specialization of AI, not all AI topics are necessarily connected to DL.
The object of this book is to present a canonical mathematical basis for DL con-
cisely and directly with an emphasis on practical implementation, and as such, the
reference approach is consciously eschewed. It is not an attempt to cover every-
thing as it cannot. The field DL has advanced to the point where both its depth
and breadth call for a series of books. But a brief history of DL is clearly indicated.
DL evolved from ANNs, and so the history begins with them. The interested reader
is directed to (31) for a more thorough history and to (57; 127) for a stronger bio-
logical motivation.

1.2 A Brief History

ANNs are inspired by, and originally attempted to simulate, biological neural
networks. Naturally, research into biological neural networks predated ANNs.
During the nineteenth century, great strides were taken, and it was an inter-
disciplinary effort. As physicists began to explain electricity and scientists
placed chemistry on a firm scientific footing, the basis was created for a proper
understanding of the biological phenomena that depended on them. Advances in
grinding lenses combined with a better appreciation of the light condenser led to
a dramatic increase in the quality microscopes. The stage was set for histologists,
biologists, and anatomists to make progress in identifying and understanding
tissues and cell differentiation. Putting all those pieces together yielded new
breakthroughs in understanding the biology of living things in every sphere.
Alexander Bain and William James made independent seminal contributions
(8; 76). They postulated that physical action, the movement of muscles, was
directed and controlled by neurons in the brain and communicated with electrical
signals. Santiago Ramón y Cajal (167) and Sir Charles Sherrington (136) put
the study of neurology on a firm footing with their descriptions of neurons and
synapses; both would go on to win Nobel prizes for their contributions in 1906
and 1932, respectively.
By the 1940s, a firm understanding of biological neurons had been developed.
Computer science was nascent, but fundamental results were developed. In the

3 The word, learn, is in bold in Mitchell’s text. The author clearly wished to emphasize the
nature of the exercise.
1.2 A Brief History 7

1930s, Alonzo Church had described his Lambda Calculus model of computation
(21), and his student, Alan Turing, had defined his Turing Machine4 (152), both
formal models of computation. The age of modern computation was dawning.
Warren McCulloch and Walter Pitts wrote a number of papers that proposed arti-
ficial neurons to simulate Turing machines (164). Their first paper was published
in 1943. They showed that artificial neurons could implement logic and arithmetic
functions. Their work hypothesized networks of artificial neurons cooperating to
implement higher-level logic. They did not implement or evaluate their ideas, but
researchers had now begun thinking about artificial neurons.
Daniel Hebb, an eminent psychologist, wrote a book in 1949 postulating a learn-
ing rule for artificial neurons (65). It is a supervised learning rule. While the rule
itself is numerically unstable, the rule contains many of the ingredients of modern
ANNs. Hebb’s neurons computed state based on the scaler product and weighted
the connections between the individual neurons. Connections between neurons
were reinforced based on use. While modern learning rules and network topolo-
gies are different, Hebb’s work was prescient. Many of the elements of modern
ANNs are recognizable such as a neuron’s state computation, response propaga-
tion, and a general network of weighted connections.
The next step to modern ANNs was Frank Rosenblatt’s perceptron (130). Rosen-
blatt published his first paper in 1958. Building on Hebb’s neuron, he proposed an
updated supervised learning rule called the perceptron rule. Rosenblatt was inter-
ested in computer vision. His first implementation was in software on an IBM 704
mainframe (it had 18 k of memory!). Perceptrons were eventually implemented in
hardware. The machine was a contemporary marvel fitted with an array of 20 × 20
cadmium sulfide photocells used to create a 400 pixel input image. The New York
Times reported it with the headline, “Electronic Brain Teaches Itself.” Hebb’s neu-
ron state was improved with the introduction of a bias, an innovation still very
important today. Perceptrons were capable of learning linear decision boundaries,
that is, the categories of classification had to be linearly separable.
The next milestone was a paper by Widrow and Hoff in 1960 that proposed a
new learning rule, the delta rule. It was more numerically stable than the percep-
tron learning rule. Their research system was called ADALINE (15) and used least
squares to train the network. Like Rosenblatt’s early work, ADALINE was imple-
mented in hardware with memristors. The follow-up system, MADALINE (163),
included multiple layers of perceptrons, another step toward modern ANNs. It
suffered from a similar limitation as Rosenblatt’s perceptrons in that it could only
address linearly separable problems; it was a composition of linear classifiers.
In 1969, Minksy and Papert published a book that set a pall on ANN
research (106). They demonstrated that ANNs, as they were understood at that

4 It was Church who coined the term, Turing Machine.

8 1 Introduction

point, suffer from an inherent limitation. It was argued that ANNs could never
solve “interesting” problems; but the assertion was based on the assumption
that ANNs could never practically handle nonlinear decision boundaries. They
famously used the example of the XOR logic gate. As the XOR truth table could
not be learnt by an ANN, and XOR is trivial concept when compared to image
recognition and other applications, they concluded that the latter applications
were not appropriate. As most interesting problems are nonlinear, including
vision and NLP, they concluded that the ANN was a research dead end. Their
book had the effect of chilling research in ANNs for many years as the AI com-
munity accepted their conclusion. It coincided with a general reassessment of the
practicality of AI research in general and the beginning of the first “AI Winter.”
The fundamental problem facing ANN researchers was how to train multiple
layers of an ANN to solve nonlinear problems. While there were multiple inde-
pendent developments, Rumelhart, Hinton, and Williams are generally credited
with the work that described the backpropagation of error algorithm in the context
of training ANNs (34). This was published in 1986. It is still the basis of train-
ing today. Backpropagation of error is the basis of the majority of modern ANN
training algorithms. Their method provided a means of training ANNs to learn
nonlinear problems reliably.
It was also in 1986 that Rina Dechter coined the term, “Deep Learning” (30).
The usage was not what is meant by DL today. She was describing a backtracking
algorithm for theorem proving with Prolog programs.
The confluence of two trends, the dissemination of the backpropagation algo-
rithm and the advent of widely available workstations, led to unprecedented exper-
imentation and advances in ANNs. By 1989, in a space of just 3 years, ANNs had
been successfully trained to recognize hand-written digits in the form of postal
codes from the United States Postal Service. This feat was achieved by a team led
by Yann Lecun at AT&T Labs (91). The work had all the recognizable features of
DL, but the term had not yet been applied to neural networks in that sense. The
system would evolve into LeNet-5, a classic DL model. The renewed interest in
ANN research has continued unbroken down to this day. In 2006, Hinton et al.
described a multi-layered belief network that was described as a “Deep Belief Net-
work,” (67). The usage arguably led to referring to deep neural networks as DL.
The introduction of AlexNet in 2012 demonstrated how to efficiently use GPUs
to train DL models (89). AlexNet set records in image recognition benchmarks.
Since AlexNet DL models have dominated most machine learning applications; it
has heralded the DL Age of machine learning.
We leave our abridged history here and conclude with a few thoughts. As
the computing power required to train ANNs grows ever cheaper, access to
the resources required for research becomes more widely available. The IBM
Supercomputer, ASCI White, cost US$110 million in 2001 and occupied a special
1.3 The Genesis of Models 9

purpose room. It had 8192 processors for a total of 123 billion transistors with a
peak performance of 12.3 TFLOPS.5 In 2023, an Apple Mac Studio costs US$4000,
contains 114 billion transistors, and offers peak performance of 20 TFLOPS. It sits
quietly and discreetly on a desk. In conjunction with improvements in hardware,
there is a change in the culture of disseminating results. The results of research
are proliferating in an ever more timely fashion.6 The papers themselves are also
recognizing that describing the algorithms is not the only point of interest. Papers
are including experimental methodology and setup more frequently, making
it easier to reproduce results. This is made possible by ever cheaper and more
powerful hardware. Clearly, the DL boom has just begun.

1.3 The Genesis of Models

A model is an attempt to mimic some phenomenon. It can take the form of a
sculpture, a painting or a mathematical explanation of observations of the nat-
ural world. People have been modeling the world since the dawn of civilization.
The models of interest in this book are quantitative mathematical models. People
build quantitative models to understand the world and use the models to make
predictions. With accurate predictions come the capacity to exploit and manipu-
late natural phenomena. Humans walked on the moon because accurate models
of gravity, among many other things, were possible. Building quantitative mod-
els requires many technologies. Writing, the invention of numbers and a means
of operating on them, arithmetic, and finally mathematics. In its simplest form, a
model is a mathematical function. In essence, building a model means developing
a mathematical function that makes accurate predictions; the scientific method
is an extraordinarily successful example of this. DL ANNs are forms of models,
but before we examine them let us examine how models have traditionally been
developed.

1.3.1 Rise of the Empirical Functions

People have been building models for millennia. The traditional means of doing so
is to write down a constrained set of equations and then solve them. For millennia,
the constraints have been in the form of natural laws or similar phenomena. The
laws are often discovered scientifically. Ibn al-Haytham and Galileo Galilei (45)
independently invented the scientific method, which when combined with the

5 TFLOP (teraflops), trillions of floating point operations per second.

6 For example, sites such as https://arxiv.org/list/cs.LG/recent offer researchers and the
community early peer review and uncopyrighted access to research. The time frames are
convenient for research, not journal deadlines.
10 1 Introduction

calculus (invented independently by Newton and Leibniz in 1660s), a century later

led to an explosion of understanding of the natural world. The scientist gathers
data, interprets it, and composes a law in the form of an equation that explains it.
For example, Newton’s law of gravity is
Gm1 m2
Force of Gravity = , (1.1)
r2
where G = 6.674 ⋅ 1011 m3 ⋅ kg−1 s−2 in SI units, r is the distance between two
objects, and mi are the masses of the objects.
Using the equation for gravity, one can build models by writing an equation and
then solving it. The law of gravity acts as the constraint. Natural laws are discovered
by scientists collecting, analyzing, and interpreting the data to discern the relation-
ships between the variables, and the result is an interpretable model. Once natural
laws have been published, such as the conservation of mass, scientists, and engi-
neers can use them to build models of systems of interest. This is done for exciting
things like the equations of motion for rockets and dull things like designing the
plumbing for an apartment building; mathematical models are everywhere.
The process of building a model begins with writing down a set of constraints in
the form of a system of differential equations and then solving them. To illustrate,
consider the trivial problem of producing a model that computes the time to fall
for an object from a height, h, near the surface of the Earth. The object’s motion is
constrained by gravity. The classical means of proceeding is to use Newton’s laws
and writing down a constraint. Acceleration near the surface of the Earth can be
approximated with the constant, g (9.80665 m/s2 ). Employing Newton’s notation
for derivatives, we obtain the following equation of motion (acceleration in this
case) based on the physical constraint:

ẍ = g. (1.2)

The equation can be integrated to obtain the velocity (ignoring friction),

ẋ = g ⋅ dt = gt, (1.3)
∫
which in turn can be integrated to produce the desired model, t = f (h),
√
g 2 2h
h≡x= t ⟹ t= = f (h). (1.4)
2 g

This yields an analytical solution obtained from the constraint, which was
obtained from a natural law. Of course this is a very trivial example, and often an
analytical solution is not available. Under those circumstances, the modeler must
resort to numerical methods to solve the equations, but it illustrates the historical
approach.
1.3 The Genesis of Models 11

With modern computers another approach to obtaining a function, f̂ (h), is pos-

sible; an ANN can be used. Instead of constraining the system with a natural law,
it is constrained empirically, with a dataset. ANNs are trained with supervised
learning techniques. They can be thought of as functions that start as raw clay.
Supervised training moulds the clay into the desired shape (an accurate model),
and the desired model is specified with a dataset, that is, the dataset defines the
model, not a natural law. To demonstrate, the example of f (h) is revisited.
Training the ANN is done with supervised learning techniques. The raw clay
of the untrained ANN function needs to be defined by data, so the first step is to
collect data. This is done by measuring the time to fall from a number of different
heights. This would result in a dataset of the form, {(h1 , t1 ), … , (hN , tN )}, where
each tuple consists of the height and the time to fall to the ground. Using the data,
the ANN can be trained and we obtain,
t = ANN(h) ≡ f̂ (h). (1.5)
Once trained, the ANN model approximates f̂ (h) ≈ f (h), the analytical solution.
There are now two models, hopefully producing the same results, but arrived
at with completely different techniques. The results of both are depicted in
Figure 1.2. There are, however, some meaningful differences. First, the ANN is
a black box, it may be correct, but nothing can really be said about it. The final
model does not admit of interpretability. The analytical result can be used to
predict asymptotic behavior or the rearrangement of variables for further insights.
Moreover, the analytical solution was obtained by rearranging the solution to the
differential Eq. (1.4). Second, the training of the ANN uses far more compute
resources, memory, and CPU, than the analytical solution. And finally, assembling
the dataset is a great deal of trouble. In this trivial example, someone already did
that and arrived at the gravitational constant, g. Comparing the two methods, the
ANN approach seems like a great deal more trouble.
This begs the question, given the seeming disadvantage of ANNs, why would
anyone ever use them? The answer lies in the differences between the approaches,
the seeming “disadvantages.” The ANN approach, training with raw data, did
not require any understanding or insight of the underlying process that produced
the data to build an accurate model, none. The model was constrained empiri-
cally – the data, and no constraint in the form of a natural law or principle was
required. This is extremely useful for many modern problems of interest.
Consider the problem of classifying black-and-white digital images as one of
either a cat, a dog, or a giraffe; we need a function. The function is f:ℝM → 𝕂,
where 𝕂 is the set, 𝕂 = { cat, dog, giraffe }, and M is the resolution of the
image. For such applications, empirically specifying the function is the only means
of obtaining the model. There are no other constraints available, Einstein can-
not help with a natural law, the Black Scholes equation is of no use, nor can a
12 1 Introduction

Δt = f (height)

10.0

7.5
Δt (s)

5.0

2.5

0 100 200 300 400 500

Height (m)

Figure 1.2 The graph of t = f (h), is plotted with empty circles as points. The ANN(h)’s
predictions are crosses plotted over them. The points are from training dataset. The ANN
seems to have learnt the function.

principle such as “no arbitrage” be invoked. There is no natural starting point.

The underlying process is unknown, and probably unknowable. The fact that we
have no insight into the process producing the data is no hinderance at all. There is
a drawback in that the resulting model is not interpretable, but never the less the
approach has been immensely successful. Using supervised learning techniques
for this application imposes the requirement to collect a set of images and labeling
them with the correct answer (one of the three possible answers). Thus, ignoring
the need for interpretability or an understanding of the generating process, it is
possible to accurately model a whole new set of applications.
Even for applications where natural laws exist leading to a system of con-
straints, ANNs are beginning to enjoy some success. Combinatoric problems such
as protein folding have been successfully addressed with ANNs (16). ANNs are
better at predicting the shapes of proteins than approaches solving the differential
equations and quantum mechanical constraints. Large problems lacking an
analytical solution, such as predicting the paths of hurricanes, are investing in
1.3 The Genesis of Models 13

the use of ANNs to make predictions that are more accurate (22). There are many
more examples.

1.3.2 The Biological Phenomenon and the Analogue

Finally, it is worth bearing in mind the inherent differences between the DL mod-
els composed of ANNs and animal brains. ANNs were motivated by, and attempted
to simulate, biological neuron assemblies (Hebb and Rosenblatt were psycholo-
gists). Owing to the success of DL, the nature of the simulation is often lost while
retaining the connection; this can be unfortunate.
It must not be forgotten that biological neural networks are physical; they are
cells, “hardware.” Biological neurons operate independently, asynchronously, and
concurrently; they are the unit of computation. In this sense, a brain is a biological
parallel computer. ANNs are software simulating biological hardware on a com-
pletely different computer architecture. There are inherent differences between
the biological instance and the simulation that render the ANN inefficient. A simu-
lated neuron must wait to have its state updated when a signal “arrives.” The delay
is owing to waiting in a queue for its turn on a CPU core – the simulation’s unit
of computation. Biological neurons are the “CPU”s and continually update them-
selves. A human brain has approximately 100 billion neurons with an average of
10,000 synapses (connections) each (79), and they do not need to wait their turn to
compute state – they are the state. The ANN simulation must queue all its virtual
neurons serially for a chance on a CPU to update their state. To do this efficiently,
DL models typically impose strong restrictions on the topology of the network of
virtual neuron connections. ANN software is simulating a parallel computer on a
serial computer. Even allowing for the parallelism of GPUs, the simulation is still
O(number of neurons). The characteristics are different too: a biological neural
network is an analog computer and modern computers are digital.
The nature of a computer is also very much at variance with an animal brain.
A human brain uses around 20 W of energy (79). An Intel Xeon CPU consumes
between 200 and 300 W, as do GPUs. The power usage of the GPU farms used
to train Google’s BERT or NVIDIA’s GANN is measured in kilowatts. Training
language models can cost US$8,000,000 just for the electricity (19). It is also com-
mon to compare biological neurons to transistors. It is a really fine example of an
apple to orange comparison. Transistors have switching times in the order of 10−9
seconds. Biological neuron switching times are in the order of 10−3 seconds. Tran-
sistors typically have 3 static connections, while neurons can have thousands of
connections. A neuron’s set of connections, synapses, can be dynamically adapt-
ing to changing circumstances. Neurons perform very different functions, and
a great many transistors would be required to implement the functionality of a
single neuron.
14 1 Introduction

None of this is to say that DL software is inappropriate for use or not fit for
purpose, quite the contrary, but it is important to have some perspective on the
nature of the simulation and the fundamental differences.

1.4 Numerical Computation – Computer Numbers Are

Not ℝeal

Before presenting DL algorithms, it should be emphasized that ANNs are mathe-

matical algorithms implemented on digital computers. When reading this text, it
is important to understand that naïve computer implementations of mathematical
methods can lead to surprising 7 results. Blindly typing equations into a computer
is often a recipe for trouble, and DL is no exception. Arithmetic is different on a
computer and varies in unexpected ways, as will be seen. Unlike the normal arith-
metical operations of addition and subtraction, most computer implementations
of them are not associative, distributive, or commutative. The reader is encour-
aged to peruse this section with a computer and experiment to aid understanding
the pitfalls.
Consider the interval, S = [1,2] ⊂ ℤ, a subset of the natural numbers. The cardi-
nality of S, |S|, is two. Intervals of the natural numbers are countable. Now consider
S = [1,2] ⊂ ℝ. The real number line is continuous, a characteristic relied upon by
the calculus. So, in this case, |S| = ∞. Indeed, S = [1, 1.0000001] ⊂ ℝ also has a
cardinality of infinity. Equations are generally derived assuming that ℝ is avail-
able, but the real number line does not exist in a computer. Computers simulate
ℝ with a necessarily discrete (finite) set called floating point numbers. This has
profound implications when programming. Two mathematically equivalent algo-
rithms can behave completely differently when implemented on the same digital
computer.
By far, the most common implementation of floating point numbers on mod-
ern digital computers is the IEEE-754 standard for floating point values (25). First
agreed in 1985, it has been continually updated ever since. Intel’s x86 family of
processors implement it as well as Apple’s ARM chips, such as the Mx family of
SoCs. It is often misunderstood as simply a format for representing floating point
numbers, but it is actually a complete system defining behavior and operations,
including the handling of errors. This is extremely important as running a pro-
gram on different CPU architectures that are IEEE-754 compliant will yield the
same numerical results. The most common IEEE-754 floating point types are the

7 “Surprising” is an engineering and scientific euphemism for unwelcome.

1.4 Numerical Computation – Computer Numbers Are Not ℝeal 15

32-bit (“single-precision”) and the 64-bit (“double-precision”) formats.8 Computer

languages usually expose them as native types and the programmer uses them
without realizing it. It is immediately clear that, by their very nature of being finite,
an IEEE representation can only represent a finite subset of real numbers. A 32 bit
format for floating point numbers can, at most, represent 232 values; that is a long
way from infinity.
To illustrate the pitfalls of floating point arithmetic, we present a simple com-
puter experiment following the presentation of Forsythe (41) and the classic Linear
Algebra text, Matrix Computations (49). Consider the polynomial, ax2 + bx + c.
The quadratic equation, a seemingly innocuous equation known by all school chil-
dren, computes the roots with the following:
√
−b ± b2 − 4ac
root = . (1.6)
2a
There are two roots. At first glance, this appears to be a trivial equation to imple-
ment. For the smallest root of the quadratic (a = 1, b = −2p, and c = −q):
√
root− = p − p2 + q, (1.7)
or, alternatively,
−q
root− = √ . (1.8)
p + p2 + q
Both of these forms are mathematically equivalent, but they are very different
when implemented in a computer. Letting p = 12,345, 678 and q = 1, and trying
both methods, two different answers are obtained (assuming an IEEE 754 dou-
ble precision implementation): −4.097819e-08 and −4.05e-08, respectively. Only
the latter root is correct despite both equations being mathematically equivalent
(verify this!). To understand what has occurred, and how to avoid it, we must
examine how floating point numbers are represented in a computer.

1.4.1 The IEEE 754 Floating Point System

Digital computers represent quantities in binary form, that is, base 2 (base is also
known as the radix). Modern humans think in decimal.9 People write numbers
with an implied radix of 10, but there is nothing special about decimal numbers.
For example, 210 = 102 and 1010 = 10102 . In everyday life, people drop the sub-
script as the base of 10 is assumed.

8 The “C” language types float and double often correspond with the 32-bit and 64-bit types
respectively, but it is not a language requirement. Python’s float is the double-precision type
(64-bit).
9 The oldest number system that we know about, Sumerian (c. 3,000 BC), was sexagesimal, base
60, and survives to this day in the form of minutes and seconds.
16 1 Introduction

± Exponent (8 bits) Mantissa (23 bits)

1 0 1 1 0 0 1 1 0 0 1 0 1 1 0 1 1 1 1 1 0 0 1 0 0 0 1 1 1 0 0 1

Figure 1.3 IEEE-754 representation for the value of 4.050000 ⋅ 10−8 . The exponent is
biased and centered about 127, and the mantissa is assumed to have a leading “1.”

The integers are straight forward, but representing real numbers requires more
elaboration. The correct root was written in scientific notation, −4.050000 ⋅ 10−8 .
There are three distinct elements in this form of a number. The mantissa, or sig-
nificand, is the sequence of significant digits, and its length is the precision, 8 in
this case. It is written with a single digit to the left of the decimal point and mul-
tiplied to obtain the correct order of magnitude. This is done by raising the radix,
10 in this case, to the power of the exponent, −8. The IEEE-754 format for 32-bit
floating point values encodes these values to represent a number, see Figure 1.3.
So what can be represented with this system?
Consider a decimal number, abc.def, each position represents an order of mag-
nitude. For decimal numbers, the positions represent:
1 1 1
100 + 10 + 1 ⋅ + + ,
10 100 1000
while binary numbers look like:
1 1 1
4+2+1⋅ + +
2 4 8
Some floating point examples are 0.510 = 0.12 and 0.12510 = 0.0012 . So far so
good, but what of something “simple” such as 0.110 ? Decimal 0.1 is represented
as 0.000112 , where the bar denotes the sequence is repeated ad infinitum. This
can be written in scientific notation as 1.100 ⋅ 2−4 . Using the 32-bit IEEE encod-
ing its representation is 00111101110011001100110011001101. The first bit is the
sign bit. The following 8 bits form the exponent and the remaining 23 bits com-
prise the mantissa. There are two seeming mistakes. First, the exponent is 123. For
efficiently representing normal numbers, the IEEE exponent is biased, that is, cen-
tered about 127 ∶ 127 − 4 = 123. The second odd point is in the mantissa. As the
first digit is always one it is implied, so the encoding of the mantissa starts at the
first digit to the right of the first 1 of the binary representation, so effectively there
can be 24 bits of precision. The programmer does not need to be aware of this – it all
happens automatically in the implementation and the computer language (such
as C++ and Python). Converting IEEE 754 back to decimal, we get, 0.100000001.10

10 What is 10% of a billion dollars? This is a sufficiently serious problem for financial software
that banks often use specialized routines to ensure that the money is correct.
1.4 Numerical Computation – Computer Numbers Are Not ℝeal 17

Observe that even a simple number like 1/10 cannot be represented in the IEEE
32-bit values, just as 1/3 is difficult for decimal, 0.310 .
Let 𝔽 ⊂ ℝ be the set of IEEE single-precision floating point numbers. Being
finite 𝔽 will have minimum and maximum elements. They are 1.17549435E-38
and 3.40282347E+38, respectively. Any operation that strays outside of that range
will not have a result. Values less than the minimum are said to underflow, and
values that exceed the maximum are said to overflow. Values within the supported
range are known as normal numbers. Even within the allowed range, the set is
not continuous and a means of mapping values from ℝ onto 𝔽 is required, that
is, we need a function, fl(x) ∶ ℝ → 𝔽 , and the means prescribed by the IEEE 754
standard is rounding.
All IEEE arithmetical operations are performed with extra bits of precision.
This ensures that a computation will produce a value that is superior to the
bounds. The result is rounded to the nearest element of 𝔽 , with ties going
to the even value. Specifying ties may appear to be overly prescriptive, but
deterministic computational results are very important. IEEE offers 4 rounding
modes,11 but rounding to nearest value in 𝔽 is usually the default. Rounding
error is subject to precision. Given a real number, 1.0, what is the next largest
number? There is no answer. There is an answer for floating point numbers,
and this gap is the machine epsilon, or unit roundoff. For the double precision,
IEEE-754 standard the machine epsilon is 2.2204460492503131e-16. The width
of a proton is 8.83e-16 m, so this is quite small (computation at that scale would
choose a more appropriate unit than meters, such as Angstroms, but this does
demonstrate that double precision is very useful). The machine epsilon gives the
programmer an idea of the error when results are rounded. Denote the machine
epsilon as u. The rounding error is |fl(x) − x| ≤ 12 u. This quantity can be used to
calculate rigorous bounds on the accuracy of computations and algorithms when
required.
Let us revisit our computation of the smallest root, which was done in double
precision.
√ p was very close to the result of the square root. For our values of p and
q, p ≈ p2 + q, and so performing p− ≈ p (p minus almost p) canceled out all of
the information in the result. This effect is known as catastrophic cancellation,
but it is widely misunderstood. A common misconception is that it is a bad
practice to subtract floating point numbers that are close together, or add floating
point numbers that are far apart, but that is not necessarily true. In this case, the
subtraction has merely exposed an earlier problem that no longer has anywhere
to hide. The square root is 12,345,678.000000040500003, but it was rounded to

11 In C, the mode is specified with FLT_EVAL_METHOD, the default is defined in float.h.

Python does not have a standard means of specifying the IEEE rounding mode.
18 1 Introduction

12,345,678.000000041 so that the result can fit in a double-precision value. In

a large number like that the relative error is manageable, but when subtracted
from a nearby number the rounding error is completely exposed; the relative
error explodes. The correct rule of thumb is to be careful with results that are
contaminated with rounding error. The relative error needs to be minimized. In
this case, producing the final result with a division was much safer. Theoretically,
both methods should produce the same result, they do on paper, but in practice
the minutia of computer arithmetic is important.
The properties of addition and subtraction of 𝔽 are also different, for example,
associativity does not always hold: (x + y) + z ≠ x + (y + z). Consider a machine
with 3 digits of precision, it is easy to show that on a computer the rules of
arithmetic do not necessarily apply. Setting x, y, and z to 1.24, −1.23, and 0.001,
respectively, two different results are obtained. The result on the left is 0.011, and
the result on the right is 0.01. fl(−1.23 + 0.001) loses the precision required to
contain the answer; the order mattered. Performing the operations in the reverse
order, turning two large numbers into a smaller number prior to an operation
with another smaller number increased the chances of obtaining the correct
answer. Planning the sequence of operations can be very important.

1.4.2 Numerical Coding Tip: Think in Floating Point

A common algorithmic activity is to loop waiting for some value to reach 0.0, such
as a residual error. A terrible mistake is to code something like:

while (error != 0.0) {

// some code
}
The error may never reach 0.0 or potentially pass through it. A better way is to
test with the operator, ≥. Even if it is “mathematically impossible” for the error in
question to be less than zero, when working with 𝔽 instead of ℝ there are always
nasty surprises. Paranoia is the only means of reducing the chance of terrible and
undebuggable errors.
In general, the same caution should be exercised when testing the result of any
computation. For example, when verifying that a matrix inversion is correct, an
obvious test is to compute A−1 A = I. A simple test for correctness might look like
Algorithm 1.1.
Algorithm 1.1 will almost certainly fail, even though the implementation of the
algorithm that computed it is correct. The expected 1s and 0s are rarely so precise
and have tiny residuals. In general, it is best to use a tolerance when testing for a
desired value. A better way of verifying the result is Algorithm 1.2.
1.4 Numerical Computation – Computer Numbers Are Not ℝeal 19

Algorithm 1.1 Unsafe verification of an identify matrix

1: for i ∈ rows do
2: for j ∈ columnsj do
3: if i == j then
4: assert (I[i,j] == 1.0) ⊳ I[i,j] more likely to be 1.000021
5: else
6: assert (I[i,j] == 0.0) ⊳ I[i,j] more likely to be -0.0000309
7: end if
8: end for
9: end for

Algorithm 1.2 The use of a tolerance to verify a matrix

1: for i ∈ rows do
2: for j ∈ columnsj do
3: if i == j then
4: assert (𝖺𝖻𝗌(1 - I[i,j]) ≤ 𝜖) ⊳ Subtract the expected 1 to get 0
5: else
6: assert (I[i,j] ≤ 𝜖)
7: end if
8: end for
9: end for

The choice of 𝜖 will be a suitably small value that the application can toler-
ate. In general, comparison of a computed floating point value should be done
as abs(𝛼 − x), where x is the computed quantity and 𝛼 is the quantity that is being
tested for. Note that printing the contents of I to the screen may appear to produce
the exact values of zero and one, but print format routines, the routines whose jobs
is to convert IEEE variables to a printable decimal string, do all kinds of things and
often mislead.
A great contribution of the IEEE system is its quality as a progressive system.
Many computer architectures used to treat floating point errors, such as division
by zero, as terminal. The program would be aborted. IEEE systems continue to
make progress following division √by zero and simply record the error condition in
the result. Division by zero and −1 results in the special nonnormal value “not a
number” recorded in the result (NaN). Overflow and underflow are also recorded
as the two special values, ±∞ (if used they will produce an NaN). These error
values need to be detected because if they are used, then the error will contaminate
all further use of the tainted value. Often there is a semantic course of action that
can be adopted to fix the error before proceeding. They are notoriously difficult to
20 1 Introduction

debug so it is critical to catch numerical problems early. At key points in a program,

it can be useful to check for unwanted errors. Many languages offer facilities to test
for the special values. POSIX C implements isnan and isinf to test for errors, and
Python’s math package implements both as well.
We conclude with some rules of thumb to bear in mind when writing floating
point code:
● Avoid subtracting quantities when one of them is contaminated with error (such
as round off) – this is the root of catastrophic cancelation.
● Avoid computing quantities with much larger intermediate values than the
result. Such computations need to be designed carefully if unavoidable. The
canonical example is computing the variance with 𝜎 2 = 𝔼(x2 ) − 𝔼(x)2 .
● Consider the ramifications when implementing a mathematical expression
and anticipate problems. Expanding or simplifying an expression may have
consequences. Prefer division to subtraction. Plan the sequence of operations
carefully.
● Overflow and underflow should be monitored when possible.
● Check for IEEE error conditions when they are possible.
These are rules of thumb. Thinking in floating point improves the chances of get-
ting things right. ANNs can require trillions of floating point operations to train or
millions to compute so when things go wrong the debugging can be extremely
challenging. A single NaN, which at least makes it clear that something went
wrong, will void all progress; but at least the problem is visible. More pernicious
are the silent problems such as the inaccuracies resulting from catastrophic can-
celation – so plan well.

1.5 Summary

While DL has become synonymous with AI in the public’s imagination, it is a

subfield of machine learning, which in turn is a specialization of AI. ANNs are
models produced by constraining a system empirically with a dataset. They are
of use when there is no other convenient constraint available, such as a law of
nature. The resultant model is usually not interpretable. The real number line does
not exist in a digital computer. When dealing with any kind of computer model,
care must be taken that the arithmetic has not gone wrong. Appendix I provides
a brief review of the mathematics required to understand the text. It also includes
references to further reading.
1.6 Projects 21

1.6 Projects

1. Formulate a definition of intelligence. Complement it with a set of testable

criteria. If the criteria are empirical, compose a single score that summarizes
the “intelligence” of a system that was tested.
2. Contrive a formula that exhibits catastrophic cancelation. How did you verify
that it does indeed produce the wrong answer?
3. In your favorite computer language initialize a variable x to 0.0. Write a loop
that accumulates 0.1 in x 10 times. The test, x = 1.0, results in a false (do not
just print the result). What went wrong?
23

Deep Learning and Neural Networks

In Chapter 1, it was stated that deep learning (DL) models are based on artificial
neural networks (ANNs). In this chapter, deep learning will be defined more pre-
cisely, which is still quite loose. This will be done by connecting deep learning
to ANNs more concretely. It was also claimed that ANNs can be interpreted as
programmable functions. In this chapter, we describe what those functions look
and how ANNs compute values. Like a function, an ANN accepts inputs and com-
putes an output. How an ANN turns the input into the output is detailed. We also
introduce the notation and abstractions that we use in the rest of the text.
A deep learning model is a model built with an artificial neural network, that is,
a network of artificial neurons. The neurons are perceptrons. Networks require a
topology. The topology is specified as a hyperparameter to the model. This includes
the number of neurons and how they are connected. The topology determines how
information flows in the resulting ANN configuration and some of the properties
of the ANN. In broad terms, ANNs have many possible topologies, and indeed
there are an infinite number of possibilities. Determining good topologies for a
particular problem can be challenging; what works in one problem domain may
not (probably will not) work in a different domain. ANNs have many applications
and come in many types and sizes. They can be used for classification, regression,
and even generative purposes such as producing a picture. The different domains
often have different topologies dictated by the application. Most ANN applications
employ domain-specific techniques and adopt trade-offs to produce the desired
result. For every application, there are a myriad of ANN configurations and param-
eters, and what works well for one application may not work for others. If this
seems confusing – it is (107; 161). A good rule of thumb is to keep it as simple as
possible (93).

Demystifying Deep Learning: An Introduction to the Mathematics of Neural Networks,

First Edition. Douglas J. Santry.
© 2024 The Institute of Electrical and Electronics
24 2 Deep Learning and Neural Networks

2.1 Feed-Forward and Fully-Connected Artiﬁcial

Neural Networks

This section presents the rudiments of ANN topology. One topology in particu-
lar, the feed-forward and fully-connected (FFFC), topology is adopted. There is
no loss of generality as all the principles and concepts presented still apply to
other topologies. The focus on a simple topology lends itself to clearer explana-
tions. To make matters more concrete, we begin with a simple example presented
in Figure 2.1. We can see that an ANN is comprised of neurons (nodes), connec-
tions, and many numbers. Observing the figure, it is clear that we can interpret an
ANN as a directed graph, G(N, E). The nodes, or vertices, of the graph are neurons.
Neurons that communicate immediately are connected with a directed edge. The
direction of the edge determines the flow of the signal.
The nodes in the graph are neurons. The neuron abstraction is at the heart of
the ANN. Neurons in ANNs are generally perceptrons. Information, that is, sig-
nals, flow through the network along the directed edges through the neurons. The
arrow indicates that a signal is coming from a source neuron and going to a target

1 1 1
–0
.17
33
7

–0.
538

–1.
094
84
13
–2.
768

1.6
17

72
76
–0.58
–2.45
5

08
01

768
58

42
.74

0.7
–2

Input 0.93266 Output

–1
.37
2.9

80
–0.65

5
36
38

412

37
33
0.6
1
–1.677

885
1.4

605
1.3

Figure 2.1 A trained ANN that has learnt the sine function. The circles, graph nodes, are
neurons. The arrows on the edges determine which direction the communication
ﬂows.
2.1 Feed-Forward and Fully-Connected Artiﬁcial Neural Networks 25

neuron. There are no rules respecting the number of edges. Most neurons in an
ANN are both a source and a target. Any neuron that is not a source is an output
neuron. Any neuron that is not a target is an input neuron. Input neurons are the
entry point to the graph, and the arguments to the ANN are supplied there. Out-
put neurons provide the final result of the computation. Thus, given ŷ = ANN(x),
x goes to the input neurons and ŷ is read from the output neurons. In the example,
x is the angle, and ŷ is the sine of x.
Each neuron has an independent internal state. A neuron’s state is computed
from the input signals from connected source neurons. The neuron computes
its internal state to build its own signal, also known as a response. This internal
state is then propagated in turn through the directed edges to its target neurons.
The inceptive stage for the computation is the provision of the input arguments
to the input neurons of the ANN. The input neurons compute their states and
propagate them to their target neurons. This is the first signal that triggers the
rest of the activity. The signals propagate through the network, forcing neurons
to update state along the way, until the signal reaches the output neurons, and
then the computation is finished. Only the state in the output neurons, that is,
the output neurons’ responses, matter to an application as they comprise the
“answer,” ŷ .
Neurons are connected with edges. The edges are weighted by a real number.
The weights determine the behavior of the signals as received by the target
neuron – they are the crux of an ANN. It is the values of the weights that
determine whether an ANN is sine, cosine, ex – whatever the desired function.
The weights in Figure 2.1 make the ANN sine. The ANN in Figure 2.2 is a cosine.
The graphs in Figure 2.3 present their respective plots for 32 random points. Both
ANNs have the same topologies, but they have very different weights. It is the
weights that the determine what an ANN computes. The topology can be thought
of as supporting the weights by ensuring that there is a sufficient number of them
to solve a problem; this is called the learning capacity of an ANN.
The weights of an ANN are the parameters of the model. The task of training
a neural network is determining the weights, w. This is reflected in the notation
ŷ = ANN(x; w) or ŷ = ANN(x|w) where ŷ is conditioned on the vector of parame-
ters, w. Given a training set, the act of training an ANN is reconciling the weights
in a model with the examples in the training set. The fitted weights should then
produce a model that emits the correct output for a given input. Training sets con-
sisting of sine and cosine produced the ANNs in the trigonometric examples,
respectively. Both sine and cosine are continuous functions. As such building
models for them are examples of regression, we are explaining observed data from
the past to make predictions in the future.
The graph of an ANN can take many forms. Without loss of generality, but
for clarity of exposition, we choose a particular topology, the fully-connected
26 2 Deep Learning and Neural Networks

1 1 1

–1
.38
89
1

1.5
561
–3.
24 571

5
–2.
057
–6

02
.86
81
6

1.029
–1.32

95
34

57
277

0
.73
23

–0
1.5

Input 1.1915 Output

–1
.90
10

63 0.279
3
.87
41
5

3
04

97
.11
–2
22
–1.0496

232
–2.

5
297
2

0.8

Figure 2.2 A trained ANN that has learnt the cosine function. The only differences with
the sine model are the weights. Both ANNs have the same topologies.

sine vs. ANN cosine vs. ANN

1.00
1.00

0.75
0.75
cosine
sine

0.50
0.50

0.25
0.25

0.0 0.5 1.0 1.5 0.5 1.0 1.5

θ θ

Figure 2.3 The output of two ANNs is superimposed on the ground truth for the trained
sine and cosine ANN models. The ANN predictions are plotted as crosses. The empty
circles are taken from the training set and comprise the ground truth.
2.1 Feed-Forward and Fully-Connected Artiﬁcial Neural Networks 27

feed-forward architecture, as the basis for all the ANNs that we will discuss.
The principles are the same for all ANNs, but the simplicity of the feed-forward
topology is pedagogically attractive.
Information in the feed-forward ANN flows acyclically from a single input layer,
through a number of hidden layers, and finally to a single output layer; the signals
are always moving forward through the graph. The ANN is specified as a set of
layers. Layers are sets of related, peer, neurons. A layer is specified as the number
of neurons that it contains, the layer’s width. The number of layers in an ANN is
referred to as its depth. All the neurons in a layer share source neurons, specified
as a layer, and target neurons, again, specified as a layer. All of a layer’s source
neurons form a layer as do its target neurons. There are no intralayer connections.
In the language of graph theory, isolating a layer produces a tripartite graph. Thus,
a layer is sandwiched between a shallower, source neuron layer, and a deeper target
layer.
The set of layers can be viewed as a stack. Consider topology in Figure 2.1, with
respect to the stack analogy, the input layer is the top, or the shallowest layer, and
the output layer is the bottom, or the deepest layer. The argument is supplied to
the input layer and the answer read from the output layer. The layers between the
input and output layers are known as hidden layers.
It is the presence of hidden layers that characterizes an ANN as a deep learning
ANN. There is no consensus on how many hidden layers are required to qualify as
deep learning, but the loosest definition is at least 1 hidden layer. A single hidden
layer does not intuitively seem very deep, but its existence in an ANN does put
it in a different generation of model. Rosenblatt’s original implementations were
single layers of perceptrons, but he speculated on deeper arrangements in his book
(130). It was not clear what value multiple layers of perceptrons had given his
linear training methods. Modern deep learning models of 20+ hidden layers are
common, and they continue to grow deeper and wider.
The process of computing the result of an ANN begins with supplying the argu-
ment to the function at the input layer. Every input neuron in the input layer
receives a copy of the full ANN argument. Once every neuron in the input layer
has computed its state with the arguments to the ANN, the input layer is ready
to propagate the result to next layer. As the signals percolate through the ANN,
each layer accepts its source signals from the previous layer, computes the new
state, and then propagates the result to the next layer. This continues until the
final layer is reached; the output layer. The output layer contains the result of
the ANN.
To further simplify our task, we specify that the feed-forward ANN is fully con-
nected, sometimes also called dense. At any given layer, every neuron is connected
to every source neuron in the shallower layer; recursively, this implies that every
neuron in a given layer is a source neuron for the next layer.
28 2 Deep Learning and Neural Networks

1 1 1

–0
.17
33
7

–0.
538
–1.
094

13
84

–2.
768
1.6

17
72
76

–0.58
–2.45
5
01

768
58

08
.74

42
0.7
–2

Input 0.93266 Output

–1
.37
2.9

80 –0.654
36

5
38

37
12

33
0.6
1
–1.677

885
1.4

605
1.3

Input layer Hidden layer Output layer

Figure 2.4 The sine model in more detail. The layers are labeled. The left is shallowest
and the right is the deepest. Every neuron in a layer is fully connected with its shallower
neighbor. This ANN is speciﬁed by the widths of it layers, 3, 2, 1, so the ANN
has a depth of 3.

Let us now reexamine the ANN implementing sine in Figure 2.4 in terms of
layers. We see that there are 3 layers. The first layer, the input layer, has 3 neu-
rons. The input layer accepts the predictors, the inputs of the model. The hidden
layer has 2 neurons, and the output layer has one neuron. There can be as many
hidden layers as desired, and they can also be as wide as needed. The depths and
the widths are the hyperparameters of the model. The number of layers and their
widths should be kept as limited as possible (93). As the number of weights grows,
that is, trainable parameters, the size of the ANN increases exponentially. Too
many weights also leads to other problems that will be examined in later chapters.
As sine is a scaler function, there can only be one output neuron in the ANN’s
output layer; that is, where the answer (sine) can be found. Notice, however, that
the number of input neurons is not similarly constrained. Sine has only one pre-
dictor (argument), but observe that there can be any number of neurons in the
input layer. They will each receive a copy of the arguments.
The mechanics of signal propagation form the basis of how ANNs compute.
Having seen how the signals flow in a feed-forward ANN, it remains to examine
Exploring the Variety of Random
Documents with Different Content
damaged disk or other medium, a computer virus, or computer
codes that damage or cannot be read by your equipment.

1.F.2. LIMITED WARRANTY, DISCLAIMER OF DAMAGES - Except for

the “Right of Replacement or Refund” described in paragraph 1.F.3,
the Project Gutenberg Literary Archive Foundation, the owner of the
Project Gutenberg™ trademark, and any other party distributing a
Project Gutenberg™ electronic work under this agreement, disclaim
all liability to you for damages, costs and expenses, including legal
fees. YOU AGREE THAT YOU HAVE NO REMEDIES FOR
NEGLIGENCE, STRICT LIABILITY, BREACH OF WARRANTY OR
BREACH OF CONTRACT EXCEPT THOSE PROVIDED IN PARAGRAPH
1.F.3. YOU AGREE THAT THE FOUNDATION, THE TRADEMARK
OWNER, AND ANY DISTRIBUTOR UNDER THIS AGREEMENT WILL
NOT BE LIABLE TO YOU FOR ACTUAL, DIRECT, INDIRECT,
CONSEQUENTIAL, PUNITIVE OR INCIDENTAL DAMAGES EVEN IF
YOU GIVE NOTICE OF THE POSSIBILITY OF SUCH DAMAGE.

1.F.3. LIMITED RIGHT OF REPLACEMENT OR REFUND - If you

discover a defect in this electronic work within 90 days of receiving
it, you can receive a refund of the money (if any) you paid for it by
sending a written explanation to the person you received the work
from. If you received the work on a physical medium, you must
return the medium with your written explanation. The person or
entity that provided you with the defective work may elect to provide
a replacement copy in lieu of a refund. If you received the work
electronically, the person or entity providing it to you may choose to
give you a second opportunity to receive the work electronically in
lieu of a refund. If the second copy is also defective, you may
demand a refund in writing without further opportunities to fix the
problem.

1.F.4. Except for the limited right of replacement or refund set forth
in paragraph 1.F.3, this work is provided to you ‘AS-IS’, WITH NO
OTHER WARRANTIES OF ANY KIND, EXPRESS OR IMPLIED,
INCLUDING BUT NOT LIMITED TO WARRANTIES OF
MERCHANTABILITY OR FITNESS FOR ANY PURPOSE.

1.F.5. Some states do not allow disclaimers of certain implied

warranties or the exclusion or limitation of certain types of damages.
If any disclaimer or limitation set forth in this agreement violates the
law of the state applicable to this agreement, the agreement shall be
interpreted to make the maximum disclaimer or limitation permitted
by the applicable state law. The invalidity or unenforceability of any
provision of this agreement shall not void the remaining provisions.

1.F.6. INDEMNITY - You agree to indemnify and hold the Foundation,

the trademark owner, any agent or employee of the Foundation,
anyone providing copies of Project Gutenberg™ electronic works in
accordance with this agreement, and any volunteers associated with
the production, promotion and distribution of Project Gutenberg™
electronic works, harmless from all liability, costs and expenses,
including legal fees, that arise directly or indirectly from any of the
following which you do or cause to occur: (a) distribution of this or
any Project Gutenberg™ work, (b) alteration, modification, or
additions or deletions to any Project Gutenberg™ work, and (c) any
Defect you cause.

Section 2. Information about the Mission

of Project Gutenberg™
Project Gutenberg™ is synonymous with the free distribution of
electronic works in formats readable by the widest variety of
computers including obsolete, old, middle-aged and new computers.
It exists because of the efforts of hundreds of volunteers and
donations from people in all walks of life.

Volunteers and financial support to provide volunteers with the

assistance they need are critical to reaching Project Gutenberg™’s
goals and ensuring that the Project Gutenberg™ collection will
remain freely available for generations to come. In 2001, the Project
Gutenberg Literary Archive Foundation was created to provide a
secure and permanent future for Project Gutenberg™ and future
generations. To learn more about the Project Gutenberg Literary
Archive Foundation and how your efforts and donations can help,
see Sections 3 and 4 and the Foundation information page at
www.gutenberg.org.

Section 3. Information about the Project

Gutenberg Literary Archive Foundation
The Project Gutenberg Literary Archive Foundation is a non-profit
501(c)(3) educational corporation organized under the laws of the
state of Mississippi and granted tax exempt status by the Internal
Revenue Service. The Foundation’s EIN or federal tax identification
number is 64-6221541. Contributions to the Project Gutenberg
Literary Archive Foundation are tax deductible to the full extent
permitted by U.S. federal laws and your state’s laws.

The Foundation’s business office is located at 809 North 1500 West,

Salt Lake City, UT 84116, (801) 596-1887. Email contact links and up
to date contact information can be found at the Foundation’s website
and official page at www.gutenberg.org/contact

Section 4. Information about Donations to

the Project Gutenberg Literary Archive
Foundation
Project Gutenberg™ depends upon and cannot survive without
widespread public support and donations to carry out its mission of
increasing the number of public domain and licensed works that can
be freely distributed in machine-readable form accessible by the
widest array of equipment including outdated equipment. Many
small donations ($1 to $5,000) are particularly important to
maintaining tax exempt status with the IRS.

The Foundation is committed to complying with the laws regulating

charities and charitable donations in all 50 states of the United
States. Compliance requirements are not uniform and it takes a
considerable effort, much paperwork and many fees to meet and
keep up with these requirements. We do not solicit donations in
locations where we have not received written confirmation of
compliance. To SEND DONATIONS or determine the status of
compliance for any particular state visit www.gutenberg.org/donate.

While we cannot and do not solicit contributions from states where

we have not met the solicitation requirements, we know of no
prohibition against accepting unsolicited donations from donors in
such states who approach us with offers to donate.

International donations are gratefully accepted, but we cannot make

any statements concerning tax treatment of donations received from
outside the United States. U.S. laws alone swamp our small staff.

Please check the Project Gutenberg web pages for current donation
methods and addresses. Donations are accepted in a number of
other ways including checks, online payments and credit card
donations. To donate, please visit: www.gutenberg.org/donate.

Section 5. General Information About

Project Gutenberg™ electronic works
Professor Michael S. Hart was the originator of the Project
Gutenberg™ concept of a library of electronic works that could be
freely shared with anyone. For forty years, he produced and
distributed Project Gutenberg™ eBooks with only a loose network of
volunteer support.
Project Gutenberg™ eBooks are often created from several printed
editions, all of which are confirmed as not protected by copyright in
the U.S. unless a copyright notice is included. Thus, we do not
necessarily keep eBooks in compliance with any particular paper
edition.

Most people start at our website which has the main PG search
facility: www.gutenberg.org.

This website includes information about Project Gutenberg™,

including how to make donations to the Project Gutenberg Literary
Archive Foundation, how to help produce our new eBooks, and how
to subscribe to our email newsletter to hear about new eBooks.