100% found this document useful (1 vote)

8 views

From Statistical Physics to Data-Driven Modelling: with Applications to Quantitative Biology Simona Cocco download

The document provides a link to download various ebooks and textbooks, including 'From Statistical Physics to Data-Driven Modelling' by Simona Cocco, which focuses on the intersection of statistical physics and data-driven modeling with applications in quantitative biology. It outlines the structure of the book, which includes topics such as Bayesian inference, high-dimensional inference, and machine learning techniques, aimed at students in physics and computational biology. Additionally, it emphasizes the importance of practical applications and understanding the mathematics behind statistical methods.

Uploaded by

kuqiayisa

Available Formats

Download as PDF, TXT or read online on Scribd

100% found this document useful (1 vote)

8 views

From Statistical Physics to Data-Driven Modelling: with Applications to Quantitative Biology Simona Cocco download

Uploaded by

kuqiayisa

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 71

Download the full version and explore a variety of ebooks

or textbooks at https://ebookmass.com

From Statistical Physics to Data-Driven Modelling:

with Applications to Quantitative Biology Simona
Cocco

_____ Follow the link below to get your download now _____

https://ebookmass.com/product/from-statistical-physics-to-
data-driven-modelling-with-applications-to-quantitative-
biology-simona-cocco/

Access ebookmass.com now to download high-quality

ebooks or textbooks
We have selected some products that you may be interested in
Click the link to download now or visit ebookmass.com
for more options!.

Introduction to Quantitative Ecology: Mathematical and

Statistical Modelling for Beginners Essington

https://ebookmass.com/product/introduction-to-quantitative-ecology-
mathematical-and-statistical-modelling-for-beginners-essington/

Introduction to Quantitative Ecology: Mathematical and

Statistical Modelling for Beginners Timothy E. Essington

https://ebookmass.com/product/introduction-to-quantitative-ecology-
mathematical-and-statistical-modelling-for-beginners-timothy-e-
essington/

An Introduction to Statistical Learning with Applications

in R eBook

https://ebookmass.com/product/an-introduction-to-statistical-learning-
with-applications-in-r-ebook/

Data-driven Solutions to Transportation Problems Yinhai

Wang

https://ebookmass.com/product/data-driven-solutions-to-transportation-
problems-yinhai-wang/
Statistical Topics and Stochastic Models for Dependent
Data With Applications Vlad Stefan Barbu

https://ebookmass.com/product/statistical-topics-and-stochastic-
models-for-dependent-data-with-applications-vlad-stefan-barbu/

Statistical Process Monitoring Using Advanced Data-Driven

and Deep Learning Approaches: Theory and Practical
Applications 1st Edition Fouzi Harrou
https://ebookmass.com/product/statistical-process-monitoring-using-
advanced-data-driven-and-deep-learning-approaches-theory-and-
practical-applications-1st-edition-fouzi-harrou/

An Introduction to Data-Driven Control Systems Ali Khaki-

Sedigh

https://ebookmass.com/product/an-introduction-to-data-driven-control-
systems-ali-khaki-sedigh/

Renewable-Energy-Driven Future: Technologies, Modelling,

Applications, Sustainability and Policies Jingzheng Ren
(Eds)
https://ebookmass.com/product/renewable-energy-driven-future-
technologies-modelling-applications-sustainability-and-policies-
jingzheng-ren-eds/

Data Driven: Harnessing Data and AI to Reinvent Customer

Engagement 1st Edition Tom Chavez

https://ebookmass.com/product/data-driven-harnessing-data-and-ai-to-
reinvent-customer-engagement-1st-edition-tom-chavez/
FROM STATISTICAL PHYSICS TO DATA-DRIVEN
MODELLING
From Statistical Physics to Data-Driven
Modelling
with Applications to Quantitative Biology

Simona Cocco
Rémi Monasson
Francesco Zamponi
Great Clarendon Street, Oxford, OX2 6DP,
United Kingdom
Oxford University Press is a department of the University of Oxford.
It furthers the University’s objective of excellence in research, scholarship,
and education by publishing worldwide. Oxford is a registered trade mark of
Oxford University Press in the UK and in certain other countries
© Simona Cocco, Rémi Monasson, and Francesco Zamponi 2022
The moral rights of the authors have been asserted
Impression: 1
All rights reserved. No part of this publication may be reproduced, stored in
a retrieval system, or transmitted, in any form or by any means, without the
prior permission in writing of Oxford University Press, or as expressly permitted
by law, by licence or under terms agreed with the appropriate reprographics
rights organization. Enquiries concerning reproduction outside the scope of the
above should be sent to the Rights Department, Oxford University Press, at the
address above
You must not circulate this work in any other form
and you must impose this same condition on any acquirer
Published in the United States of America by Oxford University Press
198 Madison Avenue, New York, NY 10016, United States of America
British Library Cataloguing in Publication Data
Data available
Library of Congress Control Number: 2022937922
ISBN 978–0–19–886474–5
DOI: 10.1093/oso/9780198864745.001.0001
Printed and bound by
CPI Group (UK) Ltd, Croydon, CR0 4YY
Links to third party websites are provided by Oxford in good faith and
for information only. Oxford disclaims any responsibility for the materials
contained in any third party website referenced in this work.
Contents

1 Introduction to Bayesian inference 1

1.1 Why Bayesian inference? 1
1.2 Notations and definitions 2
1.3 The German tank problem 4
1.4 Laplace’s birth rate problem 7
1.5 Tutorial 1: diffusion coefficient from single-particle tracking 11
2 Asymptotic inference and information 17
2.1 Asymptotic inference 17
2.2 Notions of information 23
2.3 Inference and information: the maximum entropy principle 29
2.4 Tutorial 2: entropy and information in neural spike trains 32
3 High-dimensional inference: searching for principal compo-
nents 39
3.1 Dimensional reduction and principal component analysis 39
3.2 The retarded learning phase transition 43
3.3 Tutorial 3: replay of neural activity during sleep following task
learning 52
4 Priors, regularisation, sparsity 59
4.1 Lp -norm based priors 59
4.2 Conjugate priors 64
4.3 Invariant priors 67
4.4 Tutorial 4: sparse estimation techniques for RNA alternative splic-
ing 71
5 Graphical models: from network reconstruction to Boltzmann
machines 81
5.1 Network reconstruction for multivariate Gaussian variables 81
5.2 Boltzmann machines 86
5.3 Pseudo-likelihood methods 92
5.4 Tutorial 5: inference of protein structure from sequence data 97
6 Unsupervised learning: from representations to generative
models 107
6.1 Autoencoders 107
6.2 Restricted Boltzmann machines and representations 112
6.3 Generative models 120
6.4 Learning from streaming data: principal component analysis re-
visited 125
vi Contents

6.5 Tutorial 6: online sparse principal component analysis of neural

assemblies 132
7 Supervised learning: classification with neural networks 137
7.1 The perceptron, a linear classifier 137
7.2 Case of few data: overfitting 143
7.3 Case of many data: generalisation 146
7.4 A glimpse at multi-layered networks 152
7.5 Tutorial 7: prediction of binding between PDZ proteins and pep-
tides 156
8 Time series: from Markov models to hidden Markov models 161
8.1 Markov processes and inference 161
8.2 Hidden Markov models 164
8.3 Tutorial 8: CG content variations in viral genomes 171
References 175
Index 181
Preface

Today’s science is characterised by an ever-increasing amount of data, due to instru-

mental and experimental progress in monitoring and manipulating complex systems
made of many microscopic constituents. While this tendency is true in all fields of
science, it is perhaps best illustrated in biology. The activity of neural populations,
composed of hundreds to thousands of neurons, can now be recorded in real time
and specifically perturbed, offering a unique access to the underlying circuitry and
its relationship with functional behaviour and properties. Massive sequencing has per-
mitted us to build databases of coding DNA or protein sequences from a huge variety
of organisms, and exploiting these data to extract information about the structure,
function, and evolutionary history of proteins is a major challenge. Other examples
abound in immunology, ecology, development, etc.
How can we make sense of such data, and use them to enhance our understanding
of biological, physical, chemical, and other systems? Mathematicians, statisticians,
theoretical physicists, computer scientists, computational biologists, and others have
developed sophisticated approaches over recent decades to address this question. The
primary objective of this textbook is to introduce these ideas at the crossroad between
probability theory, statistics, optimisation, statistical physics, inference, and machine
learning. The mathematical details necessary to deeply understand the methods, as
well as their conceptual implications, are provided. The second objective of this book is
to provide practical applications for these methods, which will allow students to really
assimilate the underlying ideas and techniques. The principle is that students are given
a data set, asked to write their own code based on the material seen during the theory
lectures, and analyse the data. This should correspond to a two- to three-hour tutorial.
Most of the applications we propose here are related to biology, as they were part of
a course to Master of Science students specialising in biophysics at the Ecole Normale
Supérieure. The book’s companion website1 contains all the data sets necessary for
the tutorials presented in the book. It should be clear to the reader that the tutorials
proposed here are arbitrary and merely reflect the research interests of the authors.
Many more illustrations are possible! Indeed, our website presents further applications
to “pure” physical problems, e.g. coming from atomic physics or cosmology, based on
the same theoretical methods.
Little prerequisite in statistical inference is needed to benefit from this book. We
expect the material presented here to be accessible to MSc students not only in physics,
but also in applied maths and computational biology. Readers will need basic knowl-
edge in programming (Python or some equivalent language) for the applications, and
in mathematics (functional and linear analysis, algebra, probability). One of our major
goals is that students will be able to understand the mathematics behind the meth-

1 https://github.com/StatPhys2DataDrivenModel/DDM_Book_Tutorials
viii Preface

ods, and not act as mere consumers of statistical packages. We pursue this objective
without emphasis on mathematical rigour, but with a constant effort to develop in-
tuition and show the deep connections with standard statistical physics. While the
content of the book can be thought of as a minimal background for scientists in the
contemporary data era, it is by no means exhaustive. Our objective will be truly ac-
complished if readers then actively seek to deepen their experience and knowledge by
reading advanced machine learning or statistical inference textbooks.
As mentioned above, a large part of what follows is based on the course we gave
at ENS from 2017 to 2021. We are grateful to A. Di Gioacchino, F. Aguirre-Lopez,
and all the course students for carefully reading the manuscript and signalling us the
typos or errors. We are also deeply indebted to Jean-François Allemand and Maxime
Dahan, who first thought that such a course, covering subjects not always part of the
standard curriculum in physics, would be useful, and who strongly supported us. We
dedicate the present book to the memory of Maxime, who tragically disappeared four
years ago.

Paris, January 2022.

Simona Cocco1 , Rémi Monasson1,2 and Francesco Zamponi1

1
Ecole Normale Supérieure, Université PSL & CNRS
2
Department of Physics, Ecole Polytechnique
1
Introduction to Bayesian inference

This first chapter presents basic notions of Bayesian inference, starting with the def-
initions of elementary objects in probability, and Bayes’ rule. We then discuss two
historically motivated examples of Bayesian inference, in which a single parameter has
to be inferred from data.

1.1 Why Bayesian inference?

Most systems in nature are made of small components, interacting in a complex way.
Think of sand grains in a dune, of molecules in a chemical reactor, or of neurons in a
brain area. Techniques to observe and characterise quantitatively these systems, or at
least part of them, are routinely developed by scientists and engineers, and allow one
to ask fundamental questions, see figure 1.1:
• What can we say about the future evolution of these systems? About how they will
respond to some perturbation, e.g. to a change in the environmental conditions?
Or about the behaviour of the subparts not accessible to measurements?
• What are the underlying mechanisms explaining the collective properties of these
systems? How do the small components interact together? What is the role played
by stochasticity in the observed behaviours?
The goal of Bayesian inference is to answer those questions based on observations,
which we will refer to as data in the following. In the Bayesian framework, both the

Fig. 1.1 A. A large complex system includes many components (black dots) that interact
together (arrows). B. An observer generally has access to a limited part of the system and
can measure the behaviour of the components therein, e.g. their characteristic activities over
time.
2 Introduction to Bayesian inference

Fig. 1.2 Probabilistic description of models and data in the Bayesian framework. Each point
in the space of model parameters (triangle in the left panel) defines a distribution of possible
observations over the data space (shown by the ellipses). In turn, a specific data set (diamond
in right panel) corresponding to experimental measurements is compatible with a portion of
the model space: it defines a distribution over the models.

data and the system under investigation are considered to be stochastic. To be more
precise, we assume that there is a joint distribution of the data configurations and
of the defining parameters of the system (such as the sets of microscopic interactions
between the components or of external actions due to the environment), see figure 1.2.
The observations collected through an experiment can be thought of as a realisation
of the distribution of the configurations conditioned by the (unknown) parameters of
the system. The latter can thus be inferred through the study of the probability of the
parameters conditioned by the data. Bayesian inference offers a framework connecting
these two probability distributions, and allowing us in fine to characterise the system
from the data. We now introduce the definitions and notations needed to properly set
this framework.

1.2 Notations and definitions

1.2.1 Probabilities
We will denote by y a random variable taking values in a finite set of q elements:

y ∈ {a1 , a2 , · · · , aq } = A , (1.1)

and by
pi = Prob(y = ai ) = p(y = ai ) = p(y) (1.2)
the probability that y takes a given value y = ai , which will be equivalently written
in one of the above forms depending on the context.
For a two dimensional variable

y = (y1 , y2 ) ∈ A × B (1.3)

the corresponding probability will be denoted by

Notations and definitions 3

pij = Prob[y = (ai , bj )] = p(y1 = ai , y2 = bj ) = p(y1 , y2 ) = p(y) . (1.4)

In this case, we can also define the “marginal probabilities”

X X
p(y1 ) = p(y1 , y2 ) , p(y2 ) = p(y1 , y2 ) . (1.5)
y2 ∈B y1 ∈A

By definition, y1 and y2 are independent variables if their joint probability factorises

into the product of their marginals, then

p(y1 , y2 ) = p(y1 ) × p(y2 ) . (1.6)

1.2.2 Conditional probabilities and Bayes’ rule

The “conditional probability” of y1 = ai , conditioned to y2 = bj , is

p(y1 , y2 )
p(y1 |y2 ) = . (1.7)
p(y2 )

Note that this definition makes sense only if p(y2 ) > 0, but of course if p(y2 ) = 0, then
also p(y1 , y2 ) = 0. The event y2 never happens, so the conditional probability does not
make sense. Furthermore, according to Eq. (1.7), p(y1 |y2 ) is correctly normalised:
P
y1 ∈A p(y1 , y2 )
X
p(y1 |y2 ) = =1. (1.8)
p(y2 )
y1 ∈A

Eq. (1.7) allows us to derive Bayes’ rule:

p(y1 , y2 ) p(y2 |y1 ) p(y1 )

p(y1 |y2 ) = = . (1.9)
p(y2 ) p(y2 )

This simple identity has a deep meaning for inference, which we will now discuss.
Suppose that we have an ensemble of M “data points” yi ∈ RL , which we denote
by Y = {yi }i=1,··· ,M , generated from a model with D (unknown to the observer)
“parameters”, which we denote by θ ∈ RD . We can rewrite Bayes’ rule as

p(Y |θ)p(θ)
p(θ|Y ) = . (1.10)
p(Y )

The objective of Bayesian inference is to obtain information on the parameters θ.

We will call p(θ|Y ) the “posterior distribution”, which is the object we are interested
in: it is the probability of θ conditioned to the data Y we observe. Bayes’ rule expresses
the posterior in terms of a “prior distribution” p(θ), which represents our information
on θ prior to any measurement, and of the “likelihood” p(Y |θ) of the model, i.e. the
probability that the data Y are generated by the model having defining parameters
θ. It is important to stress that the likelihood expresses, in practice, a “model of the
experiment”; such a model can be known exactly in some cases, but in most situations
it is unknown and has to be guessed, as part of the inference procedure. Last of all, p(Y )
4 Introduction to Bayesian inference

is called the “evidence”, and is expressed in terms of the prior and of the likelihood
through Z
p(Y ) = dθ p(Y |θ)p(θ) . (1.11)

Its primary role is to guarantee the normalisation of the posterior p(θ|Y ).

We will now illustrate how Bayesian inference can be applied in practice with two
examples.

1.3 The German tank problem

The German tank problem was a very practical issue that arose during World War II.
The Allies wanted to estimate the numbers of tanks available to the Germans, in order
to estimate the number of units needed for the Normandy landings. Some information
was available to the Allies. In fact, during previous battles, some German tanks were
destroyed or captured and the Allies then knew their serial numbers, i.e. a progressive
number assigned to the tanks as they were produced in the factories. The problem can
be formalised as follows [1].

1.3.1 Bayes’ rule

The available data are a sequence of integer numbers

1 ≤ y1 < y2 < y3 < · · · < yM ≤ N , (1.12)

where N is the (unknown) total number of tanks available to the Germans, and
Y = {y1 , · · · , yM } ∈ NM are the (known) factory numbers of the M destroyed tanks.
Note that, obviously, M ≤ yM ≤ N . Our goal is to infer N given Y .
In order to use Bayes’ rule, we first have to make an assumption about the prior
knowledge p(N ). For simplicity, we will assume p(N ) to be uniform in the interval
[1, Nmax ], p(N ) = 1/Nmax . A value of Nmax is easily estimated in practical applications,
but for convenience we can also take the limit Nmax → ∞ assuming p(N ) to be
a constant for all N . Note that in this limit p(N ) is not normalisable (it is called
“an improper prior”), but as we will see this may not be a serious problem: if M is
sufficiently large, the normalisation of the posterior probability is guaranteed by the
likelihood, and the limit Nmax → ∞ is well defined.
The second step consists in proposing a model of the observations and in computing
the associated likelihood. We will make the simplest assumption that the destroyed
tanks are randomly and uniformly sampled from the total number of available tanks.
Note that this assumption could be incorrect in practical applications, as for example
the Germans could have decided to send to the front the oldest tanks first, which
would bias the yi towards smaller values. The choice of the likelihood expresses our
modelling of the data generation process. If the true model is unknown, we have to
make a guess, which has an impact on our inference result.
N
There are M ways to choose M ordered numbers y1 < y2 < · · · < yM in [1, N ].
Assuming that all these choices have equal probability, the likelihood of a set Y given
N is
The German tank problem 5

1
p(Y |N ) = N
1(1 ≤ y1 < y2 < · · · < yM ≤ N ) , (1.13)
M

because all the ordered M -uples are equally probable. Here, 1(c) denotes the “indicator
function” of a condition c, which is one if c is satisfied, and zero otherwise.
Given the prior and the likelihood, using Bayes’ rule, we have

p(Y |N )p(N )
p(N |Y ) = P 0 0
, (1.14)
N 0 p(Y |N )p(N )

which for an improper uniform prior reduces to

N −1

p(Y |N ) M 1(N ≥ yM )
p(N |Y ) = P = = p(N |yM ; M ) . (1.15)
p(Y |N 0) N 0 −1
P
N 0
0
N ≥yM M

Note that, because Y is now given and N is the variable, the condition N ≥ yM >
yM −1 > · · · > y1 ≥ 1 is equivalent to N ≥ yM , and as a result the posterior probability
of N depends on the number of data, M (here considered as a fixed constant), and on
the largest observed value, yM , only. It can be shown [1] that the denominator of the
posterior is
X N 0 −1 yM −1 yM
= , (1.16)
0
M M M −1
N ≥yM

leading to the final expression for the posterior,

N −1

M 1(N ≥ yM )
p(N |yM ; M ) = . (1.17)
yM −1 yM

M M −1

Note that for large N , we have p(Y |N ) ∝ N −M . Therefore, if M = 1, the posterior in

Eq. (1.15) is not normalisable if the prior is uniform, and correspondingly Eq. (1.17) is
not well defined, see the M −1 term in the denominator. Hence, if only one observation
is available, some proper prior distribution p(N ) is needed for the Bayesian posterior
to make sense. Conversely, if M ≥ 2, the posterior is normalisable thanks to the fast-
decaying likelihood, and Eq. (1.17) is well defined, so the use of an improper prior is
acceptable when at least two observations are available.

1.3.2 Analysis of the posterior

Having computed the posterior probability, we can deduce information on N . First of
all, we note that p(N |M, yM ) obviously vanishes for N < yM . For large N , we have
p(N |M, yM ) ∝ N −M , and recall that we need M ≥ 2 for the posterior to be well
defined. We can then compute:
1. The typical value of N , i.e. the value N ∗ that maximises the posterior. As the
posterior distribution is a monotonically decreasing function of N ≥ yM we have
N ∗ = yM .
6 Introduction to Bayesian inference

0.10

Posterior distribution P(NjM; yM )

0.08

®
0.06 N
yM

0.04

0.02

» N ¡M
0.00
75 100 125 150 175 200
N

Fig. 1.3 Illustration of the posterior probability for the German tank problem, for M = 10
and yM = 100. The dashed vertical lines locate the typical (= yM ) and average values of N .

2. The average value of N ,

X M −1
hN i = N p(N |yM ; M ) = (yM − 1) . (1.18)
M −2
N

Note that, for large M , we get

yM − M
hN i ∼ yM + + ··· , (1.19)
M

i.e. hN i is only slightly larger than yM . This formula has a simple interpretation.
For large M , we can assume that the yi are equally spaced in the interval [1, yM ].
The average spacing is then ∆y = (yM − M )/M . This formula tells us that
hN i = yM + ∆y is predicted to be equal to the next observation.
3. The variance of N ,

2 2 M −1
σN = N 2 − hN i = (yM − 1)(yM − M + 1) . (1.20)
(M − 2)2 (M − 3)

The variance gives us an estimate of the quality of the prediction for N . Note that
for large M , assuming yM ∝ M , we get σN /hN i ∝ 1/M , hence the relative stan-
dard deviation decreases with the number of observations. Notice that Eq. (1.18)
for the mean and Eq. (1.20) for the variance make sense only if, respectively,
M > 2 and M > 3. Again, we see a consequence of the use of an improper prior:
moments of order up to K are defined only if the number M of observations is
larger than K + 2.
For example, one can take M = 10 observations, and y10 = 100. A plot of the posterior
is given in figure 1.3. Then, hN i = 111.4, and the standard deviation is σN = 13.5.
One can also ask what the probability is that N is larger than some threshold, which
Laplace’s birth rate problem 7

is arguably the most important piece of information if you are planning the Normandy
landings. With the above choices, one finds
p(N > 150|y10 = 100) = 2.2 · 10−2 ,
p(N > 200|y10 = 100) = 1.5 · 10−3 , (1.21)
−4
p(N > 250|y10 = 100) = 2 · 10 .
More generally,
−(M −1)
Nlower bound
p(N > Nlower bound |yM ) ∝ . (1.22)
hN i

1.4 Laplace’s birth rate problem

We now describe a much older problem, which was studied by Laplace at the end of the
eighteenth century. It is today well known that the number of boys and girls is slightly
different at birth, although the biological origin of this fact is not fully understood. At
his time, Laplace had access to the historical record of the number of boys and girls
born in Paris from 1745 to 1770: out of a total number M = 493472 of observations,
the number of girls was y = 241945, while the number of boys was M − y = 251527.
This observation suggests a slightly lower probability of giving birth to a girl, but could
this be explained by random fluctuations due to the limited number of observations?
1.4.1 Posterior distribution for the birth rate
Laplace wanted to determine the probability of a newborn baby to be a girl, which is
a single parameter, θ ∈ [0, 1]. To do so, we first need to introduce a prior probability1
over θ; let us assume p(θ) = 1 to be constant in the interval θ ∈ [0, 1] (and zero
otherwise), i.e. no prior information. To obtain the likelihood, we can assume that
each birth is an independent event, in which case the distribution of y conditioned to
θ is a binomial,
M y
p(y|θ; M ) = θ (1 − θ)M −y , (1.23)
y
where M is a fixed, known integer. Note that this is a very simple model, because in
principle there could be correlations between births (e.g. for brothers, twins, etc.). We
will discuss later how to assess the quality of a given model. For now we use Bayes’
rule to obtain the posterior probability density over the birth rate,
p(y|θ; M )p(θ) θy (1 − θ)M −y
p(θ|y; M ) = = R1 . (1.24)
p(y; M ) dθ 0 (θ 0 )y (1 − θ 0 )M −y
0

Eq. (1.24) is a particular case of the “beta distribution”,

θα−1 (1 − θ)β−1
Beta(θ; α, β) = ,
B(α, β)
Z 1
(1.25)
Γ(α)Γ(β)
B(α, β) = dθ θα−1 (1 − θ)β−1 = ,
0 Γ(α + β)
1 Note that this is now a probability density, because θ is a continuous variable.
8 Introduction to Bayesian inference

R∞
with α = y + 1 and β = M − y + 1, and Γ(x) = 0 dθ θx−1 e−θ stands for the Gamma
function. The following properties of the beta distribution are known:
• The typical value of θ, i.e. which maximizes Beta, is θ∗ = α−1
α+β−2 .
α
• The average value is hθi = α+β .
αβ
• The variance is σθ2 = (α+β)2 (α+β+1) .
A first simple question is: what would be the distribution of θ if there were no girls
observed? In that case, we would have y = 0 and

p(θ|0; M ) = (M + 1)(1 − θ)M . (1.26)

The most likely value would then be θ∗ = 0, but the average value would be hθi = 1/M .
This result is interesting: observing no girls among M births should not be necessarily
interpreted that their birth rate θ is really equal to zero, but rather that θ is likely
to be smaller than 1/M , as the expected number of events to be observed to see one
girl is 1/θ. From this point of view, it is more reasonable to estimate that θ ∼ 1/M ,
than θ = 0.
In the case of Laplace’s data, from the observed numbers, we obtain

θ∗ = 0.490291 , hθi = 0.490291 , σθ = 0.000712 . (1.27)

The possibility that θ = 0.5 seems then excluded, because θ∗ ∼ hθi differs from 0.5 by
much more than the standard deviation,

|θ∗ − 0.5| σθ , | hθi − 0.5| σθ , (1.28)

but we would like to quantify more precisely the probability that, yet, the “true” value
of θ is equal to, or larger than 0.5.

1.4.2 Extreme events and Laplace’s method

The analysis above suggests that the birth rate of girls is smaller than 0.5. To be
more quantitative let us estimate the tiny probability that the observations (number
of births) are yet compatible with θ > 0.5. The posterior probability that θ > 0.5 is
given by Z 1
p(θ > 0.5|y; M ) = dθ p(θ|y; M ) . (1.29)
0.5
Unfortunately this integral cannot be computed analytically, but this is precisely why
Laplace invented his method for the asymptotic estimation of integrals! Expressing
the posterior in terms of θ∗ = y/M instead of y, he observed that
∗ ∗
p(θ|θ∗ ; M ) ∝ θM θ (1 − θ)M (1−θ )
= e−M fθ∗ (θ) , (1.30)

where
fθ∗ (θ) = −θ∗ log θ − (1 − θ∗ ) log(1 − θ) . (1.31)
∗
A plot of fθ∗ (θ) for the value of θ that corresponds to Laplace’s observations is given
in figure 1.4. fθ∗ (θ) has a minimum when its argument θ reaches the typical value θ∗ .
Laplace’s birth rate problem 9

2.2

2.0

1.8

1.6
µ¤

fµ ¤ (µ)
1.4

1.2

1.0

0.8

0.0 0.2 0.4 0.6 0.8 1.0

Fig. 1.4 Illustration of the posterior minus log-probability fθ∗ (θ) = − log p(θ|θ∗ , M )/M for
Laplace’s birth rate problem.

Because of the factor M in the exponent, for large M , a minimum of fθ∗ (θ) induces a
very sharp maximum in p(θ|θ∗ ; M ).
We can use this property to compute the normalisation factor at the denominator
in Eq. (1.24). Expanding fθ∗ (θ) in the vicinity of θ = θ∗ , we obtain

1
fθ∗ (θ) = fθ∗ (θ∗ ) + (θ − θ∗ )2 fθ00∗ (θ∗ ) + O((θ − θ∗ )3 ) , (1.32)
2
with
1
fθ00∗ (θ∗ ) = . (1.33)
θ∗ (1 − θ∗ )
In other words, next to its peak value the posterior distribution is roughly Gaussian,
and this statement is true for all θ away from θ∗ by deviations of the order of M −1/2 .
This property helps us compute the normalisation integral,
Z 1 Z 1 ∗ ∗ 2
1
dθ e−M fθ∗ (θ) ∼ dθ e−M [fθ∗ (θ )+ 2θ∗ (1−θ∗ ) (θ−θ ) ]
0 0 (1.34)
∗ p
' e−M fθ∗ (θ ) × 2πθ∗ (1 − θ∗ )/M ,

because we can extend the Gaussian integration interval to the whole real line without
affecting the dominant order in M . We deduce the expression for the normalised
posterior density of birth rates,
∗
e−M [fθ∗ (θ)−fθ∗ (θ )]
p(θ|θ∗ ; M ) ∼ p . (1.35)
2πθ∗ (1 − θ∗ )/M

In order to compute the integral in Eq. (1.29), we need to study the regime where
θ − θ∗ is of order 1, i.e. a “large deviation” of θ. For this, we use Eq. (1.35) and expand
it around θ = 0.5, i.e.
10 Introduction to Bayesian inference

1 ∗
e−M [fθ∗ (θ)−fθ∗ (θ )]
Z
p(θ > 0.5|θ∗ ; M ) = dθ p
0.5 2πθ∗ (1 − θ∗ )/M
∗ Z 1
eM fθ∗ (θ ) 0
=p dθe−M [fθ∗ (0.5)+fθ∗ (0.5)(θ−0.5)+··· ] (1.36)
∗ ∗
2πθ (1 − θ )/M 0.5
∗
e−M [fθ∗ (0.5)−fθ∗ (θ )]
∼ p .
fθ0 ∗ (0.5) 2πθ∗ (1 − θ∗ )M

With Laplace’s data, this expression for the posterior probability that θ ≥ 0.5 could
be evaluated to give
p(θ > 0.5|θ∗ ; M ) ∼ 1.15 · 10−42 , (1.37)
which provides a convincing statistical proof that, indeed, θ < 0.5.
We conclude this discussion with three remarks:
1. The above calculation shows that the posterior probability that θ deviates from its
typical value θ∗ decays exponentially with the number of available observations,
∗
+a)−fθ∗ (θ ∗ )]
p(θ − θ∗ > a|θ∗ ; M ) ∼ e−M [fθ∗ (θ . (1.38)

A more general discussion of the large M limit will be given in Chapter 2.

2. The maximum of the function f , fθ∗ (θ∗ ), is equal to the entropy of a binary
variable y = 0, 1 with corresponding probabilities 1 − θ∗ , θ∗ . We will introduce
the notion of entropy and discuss the general connection with inference in the
so-called asymptotic regime (namely, for extremely large M ) in the next chapter.
3. Note that Eq. (1.36) is strongly reminiscent of a Boltzmann distribution. Assume
that a thermal particle moves on a one-dimensional line and feels a potential
U (x) depending on its position x. Let us call x∗ the absolute minimum of U . At
low temperature T , the density of probability that the particle is in x reads (the
Boltzmann constant kB is set to unity here):
∗
e−(U (x)−U (x ))/T
ρ(x; T ) ∼ p , (1.39)
2π U 00 (x∗ )/T

where the denominator comes from the integration over the harmonic fluctuations
of the particle around the bottom of the potential. We see that the above expres-
sion is identical to Eq. (1.36) upon the substitutions θ∗ → x∗ , 0.5 → x, fθ∗ →
U, M → 1/T . Not surprisingly, having more observations reduces the uncertainty
about θ and thus the effective temperature.
Tutorial 1: diffusion coefficient from single-particle tracking 11

1.5 Tutorial 1: diffusion coefficient from single-particle tracking

Characterising the motion of biomolecules (proteins, RNA, etc.) or complexes (vesi-
cles, etc.) inside the cell is fundamental to the understanding of many biological pro-
cesses [2]. Optical imaging techniques now allow for the tracking of single particles in
real time [3]. The goal of this tutorial is to understand how the diffusion coefficient can
be reconstructed from the recordings of the trajectory of a particle, and how the ac-
curacy of the inference is affected by the number of data points (recording length) [4].
The diffusion coefficient depends on the diffusive properties of the environment and
on the size of the object. Supposing that the data are obtained in water, from the
diffusion coefficient reconstructed from the data the characteristic size of the diffusing
object will then be extracted, and a connection with characteristic biological sizes will
be made.

1.5.1 Problem
We consider a particle undergoing diffusive motion in the plane, with position r(t) =
(x(t), y(t)) at time t. The diffusion coefficient (supposed to be isotropic) is denoted by
D, and we assume that the average velocity vanishes. Measurements give access to the
positions (xi , yi ) of the particles at times ti , where i is a positive integer running from
1 to M .

Data:
Several trajectories of the particle can be downloaded from the book webpage2 , see
tutorial 1 repository. Each file contains a three-column array (ti , xi , yi ), where ti is the
time, xi and yi are the measured coordinates of the particle, and i is the measurement
index, running from 1 to M . The unit of time is seconds and displacements are in µm.

Questions:

1. Write a script to read the data. Start by the file dataN1000d2.5.dat, and plot the
trajectories in the (x, y) plane. What are
p their characteristics? How do they fill
the space? Plot the displacement ri = x2i + yi2 as a function of time. Write the
random-walk relation between displacement and time in two dimensions, defining
the diffusion coefficient D. Give a rough estimate of the diffusion coefficient from
the data.
2. Write down the probability density p({xi , yi }|D; {ti }) of the time series {xi , yi }i=1,...,M
given D, and considering the measurement times as known, fixed parameters. De-
duce, using Bayes’ rule, the posterior probability density for the diffusion coeffi-
cient, p(D|{xi , yi }; {ti }).
3. Calculate analytically the most likely value of the diffusion coefficient, its average
value, and its variance, assuming a uniform prior on D.
4. Plot the posterior distribution of D obtained from the data. Compute, for the
given datasets, the values of the mean and of the variance of D, and its most
likely value. Compare the results obtained with different number M of measures.

2 https://github.com/StatPhys2DataDrivenModel/DDM_Book_Tutorials
12 Introduction to Bayesian inference

5. Imagine that the data correspond to a spherical object diffusing in water (of
viscosity η = 10−3 Pa s). Use the Einstein-Stokes relation,

kB T
D= , (1.40)
6πη`

(here ` is the radius of the spherical object and η is the viscosity of the medium) to
deduce the size of the object. Biological objects going from molecules to bacteria
display diffusive motions, and have characteristic size ranging from nm to µm.
For proteins ` ≈ 1 − 10 nm, while for viruses ` ≈ 20 − 300 nm and for bacteria,
` ≈ 2 − 5 µm. Among the molecules or organisms described in table 1.1, which
ones could have a diffusive motion similar to that displayed by the data?

object ` (nm)
small protein (lysozime) (100 residues) 1
large protein (1000 residues) 10
influenza viruses 100
small bacteria (e-coli) 2000
Table 1.1 Characteristic lengths for several biological objects.

6. In many cases the motion of particles is not confined to a plane. Assuming that
(xi , yi ) are the projections of the three-dimensional position of the particle in the
plane perpendicular to the imaging device (microscope), how should the procedure
above be modified to infer D?

1.5.2 Solution
Data Analysis. The trajectory in the (x, y) plane given in the data file for M = 1000
is plotted in figure 1.5A. It has the characteristics of a random walk: the space is not
regularly filled, but the trajectory densely explores one region before “jumping” to
another region. p
The displacement r = x2 + y 2 as a function of time is plotted in figure 1.5B. On
average, it grows as the square root of the time, but on a single trajectory we observe
large fluctuations. The random walk in two dimensions is described by the relation:

hr2 (t)i = 4Dt , (1.41)

where D is the diffusion coefficient whose physical dimensions are [D] = l2 t−1 . Here
lengths are measured in µm and times in seconds. A first estimate of D from the data
can be obtained by just considering the largest time and estimating

r2 (tmax )
D0 = , (1.42)
4tmax

which gives D0 = 1.20 µm2 s−1 for the data set with M = 1000. Another estimate
of D can be obtained as the average of the square displacement from one data point
Tutorial 1: diffusion coefficient from single-particle tracking 13

A 70
B
20 60

50
0
40
y (¹m)

r (¹m)
30
−20
20

−40 10 D=2.47 ¹m 2 s ¡1
D=1.20 ¹m 2 s ¡1
0
0 20 40 60 0 100 200 300 400 500
x (¹m) t (s)

Fig. 1.5 Data file with M = 1000. A. Trajectory of the particle. B. Displacement from the
origin as a function of time.

to the next one divided by the time interval. We define the differences between two
successive positions and between two successive recording times

δxi = xi+1 − xi , δyi = yi+1 − yi , δti = ti+1 − ti . (1.43)

Note that i = 1, . . . , M − 1. The square displacement in a time step is δri2 = δx2i + δyi2
and the estimate of D is
M −1
1 X δri2
D1 = , (1.44)
4(M − 1) i=1 δti

giving D1 = 2.47 µm2 s−1 for the same data set. These estimates are compared with
the trajectory in figure 1.5B.

Posterior distribution. Due to diffusion δxi and δyi are Gaussian random variables
with variances 2D δti . We have

δx2
1 − i
p(δxi |D; δti ) = √ e 4 D δti , (1.45)
4πD δti

and p(δyi |D; δti ) has the same form. The probability of a time series of increments
{δxi , δyi }i=1,...,M −1 , given D is therefore:

M −1 δx2 2
Y 1 − i − δyi
p({δxi , δyi }|D; {δti }) = e 4 D δti 4 D δti
i=1
4πD δti (1.46)
−B/D −(M −1)
= Ce D ,

QM −1 1 PM −1 δr2
where C = i=1 4πδt i
and B = i=1 4 δtii . Note that to infer D we do not need the
absolute values of (xi , yi ), but only their increments on each time interval.
14 Introduction to Bayesian inference

According to Bayes’ rule,

p({δxi , δyi }|D; {δti })p(D)

p(D|{δxi , δyi }; {δti }) = R ∞ . (1.47)
0
dD p({xi , yi }|D; {δti })p(D)

We consider an improper uniform prior p(D) =const. This can be thought as a uniform
prior in [Dmin , Dmax ], in the limit Dmin → 0 and Dmax → ∞. Thanks to the likelihood,
the posterior remains normalisable in this limit.
Note that, introducing
M −1
B 1 X δri2
D∗ = = = D1 , (1.48)
M −1 4(M − 1) i=1 δti

we can write the posterior as

∗
∗ e−(M −1)D /D D−(M −1)
p(D|D ; M ) = R ∞
0
dDe−(M −1)D∗ /D D−(M −1)
(1.49)
−(M −1)D ∗ /D
e D−(M −1) [M − 1)D∗ ](M −2)
= ,
(M − 3)!

where the denominator is easily computed by changing the integration variable to

u = D∗ /D. Note that, as in Laplace’s problem,

D∗
p(D|D∗ ; M ) ∝ e−(M −1)fD∗ (D) , fD∗ (D) = + log D . (1.50)
D
The most likely value of D is precisely D∗ , which is the minimum of fD∗ (D) and the
maximum of the posterior, and coincides with the previous estimate D1 .
The average value of D can also be computed by the same change of variables,

(M − 1) ∗
hDi = D , (1.51)
(M − 3)

and converges to D∗ for M → ∞. The variance of D is

(M − 1)2
2
σD = (D∗ )2 , (1.52)
(M − 3)2 (M − 4)

and it decreases proportionally to 1/M for large M .

Numerical analysis of the data. From the trajectories given in the data files we obtain
the results given in table 1.2. An example of the posterior distribution is given in
figure 1.6.
Note that for large values of M it is not possible to calculate directly the (M − 3)!
in the posterior distribution, Eq. (1.49). It is better to use Stirling’s formula,
√
(M − 3)! ≈ 2π (M − 3)M −3+1/2 e−(M −3) , (1.53)
Tutorial 1: diffusion coefficient from single-particle tracking 15

Name-file M D D∗ hDi σD
dataN10d2.5.dat 10 2.5 1.43 1.84 0.75
dataN100d2.5.dat 100 2.5 2.41 2.46 0.25
dataN1000d2.5.dat 1000 2.5 2.47 2.48 0.08
Table 1.2 Results for the inferred diffusion constant and its standard deviation (all
in µm2 s−1 ), for trajectories of different length M .

5
Posterior distribution p(Djxi ; yi )

M = 10
M = 100
4 M = 1000

0
0 1 2 3 4
D (¹m ¡2 s ¡1 )

Fig. 1.6 Posterior distribution for the diffusion coefficient D given the data, for several values
of M .

from which one obtains

√ (M −2)
e−B/D D−(M −1) M − 3

Be
p(D|B; M ) ≈ √ . (1.54)
2π e M −3
We see that this is a very good approximation for M = 10 and we can use it for
M = 100 and M = 1000.

object ` (nm) D (µm2 s−1 )

small protein (lysozime) (100 residues) 1 200
large protein (1000 residues) 10 20
influenza viruses 100 2
small bacteria (e-coli) 2000 0.1
Table 1.3 Diffusion constants in water at ambient temperature for several biological objects
as in table 1.1, obtained by the Einstein-Stokes relation.

Diffusion constant and characteristic size of the diffusing object. The order of mag-
nitude of the diffusion constant can be obtained by the Einstein-Stokes relation:
kB T
D = 6πη` , where ` is the radius of the object (here considered as spherical), and
η is the viscosity of the medium. Considering the viscosity of the water η = 10−3 Pa s
16 Introduction to Bayesian inference

and kB T = 4 × 10−21 J, one obtains the orders of magnitude given in table 1.3.
Therefore, the data could correspond to an influenza virus diffusing in water.
Ref. [5] reports the following values: for a small protein (lysozyme) D = 10−6 cm2 s−1 ,
and for a tobacco virus D = 4 10−8 cm2 s−1 , in agreement with the above orders of
magnitude. In Ref. [4], the diffusion coefficient of protein complexes inside bacteria,
and with widths approximately equal to 300 − 400 nm, are estimated to be equal to
D = 10−2 µm2 s−1 . Differences with the order of magnitude given above are due to
the fact that the diffusion is confined and the medium is the interior of the cell, with
larger viscosity than water.
2
Asymptotic inference and
information

In this chapter we will consider the case of asymptotic inference, in which a large
number of data is available and a comparatively small number of parameters have
to be inferred. In this regime, there exists a deep connection between inference and
information theory, whose description will require us to introduce the notions of en-
tropy, Fisher information, and Shannon information. Last of all, we will see how the
maximum entropy principle is, in practice, related to Bayesian inference.

2.1 Asymptotic inference

Consider M data points Y = {yi }i=1...M , independently drawn from a given likelihood
distribution p(y|θ̂). We consider in this chapter the most favourable situation in which
M is very large compared to the dimensions D of the parameter vector, θ̂ ∈ RD , and
L of the data variables y ∈ RL . The meaning of “very large” will be made more precise
below.
In this section, we begin the discussion by considering the easiest situation, under
the following two assumptions:
1. The likelihood p(y|θ) is exactly known, and the scope of the inference is to infer
from the data Y an estimate of the true parameter in the model, θ̂. We will call
the estimate θ.
2. The prior p(θ) over θ is uniform.
Our aim is to provide some theoretical understanding of the prediction error, which
is the difference between θ and θ̂, and how in the limit of a large number of mea-
surements, M → ∞, this error vanishes asymptotically and is controlled by a simple
theoretical bound. We will then discuss non-uniform priors in section 2.1.5, and how
to handle cases in which the likelihood is not known in section 2.1.6.

2.1.1 Entropy of a distribution

A quantity that plays an extremely important role in asymptotic inference and infor-
mation theory is the “entropy” of the probability distribution p,
X X
S(p) = − pi log pi = − p(y) log p(y) . (2.1)
i y∈A

This entropy is defined up to a multiplicative constant, which corresponds to the choice

of the logarithm base. Common choices are natural or base 2 logs.
18 Asymptotic inference and information

For a continuous random variable y taking values in a finite interval A, we will

denote by p(y = a)da, or most often simply p(y)dy, the probability that the random
variable takes values in [a, a + da] or [y, y + dy], respectively. The quantity p(y) is then
called a “probability density”. The previous discussion is generalised straightforwardly,
and sums over discrete values become integrals over continuous intervals. For example,
the entropy is now defined as
Z
S(p) = − dy p(y) log p(y) . (2.2)
A

2.1.2 Cross entropy and Kullback-Leibler divergence

We further need to introduce two important quantities: the cross entropy and the
Kullback-Leibler (KL) divergence. Consider two distributions p(y) and q(y) of a ran-
dom variable y. The cross entropy is defined as
X
Sc (p, q) = − p(y) log q(y) = − hlog qip , (2.3)
y

where the average is over p. The name “cross entropy” derives from the fact that this
is not properly an entropy, because p and q are different. It coincides with the entropy
for p = q, X
Sc (p, p) ≡ S(p) = − p(y) log p(y) . (2.4)
y

The KL divergence of p with respect to q is defined as

X p(y)
DKL (p||q) = p(y) log = Sc (p, q) − S(p) . (2.5)
y
q(y)

An important property of the KL divergence is that it is always positive,

DKL (p||q) ≥ 0 , (2.6)
where the equality is reached when p = q only. To establish the positivity of DKL ,
q(y)
consider z(y) = p(y) , and note that values of y for which p(y) = 0 do not contribute
to the sum, because p(y) log p(y) → 0 when p(y) → 0; hence, we can consider that
p(y) 6= 0 and z(y) is well defined. As y is a random variable, so is z, with a distribution
induced from the probability p(y) over y. According to Eq. (2.5), the KL divergence
between p and q is DKL (p||q) = −hlog z(y)ip . While the average of log z is hard to
compute, the average value of z is easy to derive:
X X q(y) X
hz(y)ip = p(y) z(y) = p(y) = q(y) = 1 , (2.7)
y y
p(y) y

so that
−DKL (p||q) = hlog z(y)ip ≤ hz(y)ip − 1 = 0 , (2.8)
as a consequence of the concavity inequality log x ≤ x − 1. Note that the equality is
reached if z = 1 for all y such that p(y) > 0. As a consequence DKL (p||q) is positive
and vanishes for p = q only.
Asymptotic inference 19

Hence, the KL divergence gives a measure of the dissimilarity of the distributions

q and p. Note that this quantity is not symmetric: for generic p, q, DKL (p||q) 6=
DKL (q||p). We will provide a more intuitive interpretation of DKL in section 2.1.4.

2.1.3 Posterior distribution for many data

According to Bayes’ rule in Eq. (1.10), the posterior distribution p(θ|Y ) is proportional
to the likelihood of the model given the data p(Y |θ) (remember the prior over θ is
uniform). If the data are drawn independently, we have
M M
!
Y 1 X
p(Y |θ) = p(yi |θ) = exp M × log p(yi |θ) . (2.9)
i=1
M i=1

We have rewritten in Eq. (2.9) the product of the likelihoods of the data points as
the exponential of the sum of their logarithms. This is useful because, while prod-
ucts of random variables converge badly, sums of many random variables enjoy nice
convergence properties. To be more precise, let us fix θ; the log-likelihood of a data
point, say, yi , is a random variable with value log p(yi |θ), because the yi are randomly
extracted from p(yi |θ̂). The law of large numbers ensures that, with probability one1 ,
M Z
1 X
log p(yi |θ) −→ dy p(y|θ̂) log p(y|θ) . (2.10)
M i=1 M →∞

We can then rewrite the posterior distribution for M → ∞ as

p(θ|Y ) ∝ p(Y |θ) ≈ e−M Sc (θ̂,θ) , (2.11)

where Z
Sc (θ̂, θ) = − dy p(y|θ̂) log p(y|θ) (2.12)

is precisely the cross entropy Sc (p, q) of the true distribution p(y) = p(y|θ̂) and of the
inferred distribution q(y) = p(y|θ).
As shown in section 2.1.2, the cross entropy can be expressed as the sum of the
entropy and of the KL divergence, see Eq. (2.5)

Sc (θ̂, θ) = S(θ̂) + DKL (θ̂||θ) , (2.13)

where we use the shorthand notation DKL (θ̂||θ) = DKL (p(y|θ̂)||p(y|θ)). Due to the
positivity of the KL divergence, the cross entropy Sc (θ̂, θ) enjoys two important prop-
erties:
1 If we furthermore assume that the variance of such random variable exists
Z
2
dy p(y|θ̂) log p(y|θ) < ∞ ,

then the distribution of the average of M such random variables becomes Gaussian with mean given
by the right hand side of Eq. (2.9) and variance scaling as 1/M .
20 Asymptotic inference and information

25 M = 10

Posterior distribution p(µ)

M = 100
20 M = 1000

0 µ¤
0.2 0.3 0.4 0.5 0.6 0.7 0.8
µ

Fig. 2.1 Illustration of the evolution of the posterior probability with the number M of
data, here for Laplace’s birth rate problem, see Eq. (1.35). Another example was given in
figure 1.6.

• it is bounded from below by the entropy of the ground-truth distribution: Sc (θ̂, θ) ≥

S(θ̂), and
• it has a minimum in θ = θ̂.
Therefore, as expected, the posterior distribution in Eq. (2.11) gives a larger weight
to the values of θ that are close to the value θ̂ used to generate the data, for which
Sc (θ̂, θ) reaches its minimum.

2.1.4 Convergence of inferred parameters towards their ground-truth

values
To obtain the complete expression of the posterior distribution, we introduce the
denominator in Eq. (2.11),

e−M Sc (θ̂,θ)
p(θ|Y ) = R . (2.14)
dθ e−M Sc (θ̂,θ)

In the large–M limit the integral in the denominator is dominated by the minimal
value of Sc (θ̂, θ) at θ = θ̂, equal to the entropy S(θ̂) so we obtain, to exponential
order in M ,
p(θ|Y ) ∼ e−M [Sc (θ̂,θ)−S(θ̂)] = e−M DKL (θ̂||θ) . (2.15)
In figure 2.1 we show a sketch of the posterior distribution for Laplace’s problem
discussed in section 1.4. The concentration of the posterior with increasing values of
M is easily observed.
The KL divergence DKL (θ̂||θhyp ) controls how the posterior probability of the
hypothesis θ = θhyp varies with the number M of accumulated data. More precisely,
for θhyp 6= θ̂, the posterior probability that θ ≈ θhyp is exponentially small in M . For
any small ,
Prob(|θ − θhyp | < ) = e−M DKL (θ̂||θhyp )+O(M ) , (2.16)
Asymptotic inference 21

and the rate of decay is given by the KL divergence DKL (θ̂||θhyp ) > 0. Hence, the
probability that |θhyp − θ| < becomes extremely small for M 1/DKL (θ̂||θhyp ).
The inverse of DKL can therefore be interpreted as the number of data needed to
recognise that the hypothesis θhyp is wrong.
We have already seen an illustration of this property in the study of Laplace’s birth
rate problem in section 1.4 (and we will see another one in section 2.1.6). The real value
θ̂ of the birth rate of girls was unknown but could be approximated by maximizing
the posterior, with the result θ̂ ≈ θ∗ = 0.490291. We then asked the probability that
girls had (at least) the same birth rate as boys, i.e. of θhyp ≥ 0.5. The cross entropy
of the binary random variable yi = 0 (girl) or 1 (boy) is Sc (θ̂, θ) = fθ̂ (θ) defined in
Eq. (1.31) and shown in figure 1.4. The rate of decay of this hypothesis is then
DKL (θ̂||θhyp ) = fθ̂ (θ̂) − fθ̂ (θhyp ) ≈ 1.88 · 10−4 , (2.17)
meaning that about 5000 observations are needed to rule out that the probability that
a newborn will be a girl is larger than 0.5. This is smaller than the actual number
M of available data by two orders of magnitude, which explains the extremely small
value of the probability found in Eq. (1.37).
2.1.5 Irrelevance of the prior in asymptotic inference
Let us briefly discuss the role of the prior in asymptotic inference. In presence of a
non-uniform prior over θ, p(θ) ∼ exp(−π(θ)), Eq. (2.11) is modified into

p(θ|Y ) ∝ p(Y |θ)p(θ) ≈ exp −M Sc (θ̂, θ) − π(θ) . (2.18)

All the analysis of section 2.1.4 can be repeated, with the replacement
1
Sc (θ̂, θ) → Sc (θ̂, θ) + π(θ) . (2.19)
M
Provided π(θ) is a finite and smooth function of θ, the inclusion of the prior becomes
irrelevant for M → ∞, as it modifies the cross entropy by a term of the order of
1/M . In other words, because the likelihood is exponential in M and the prior does
not depend on M , the prior is irrelevant in the large M limit. Of course, this general
statement holds only if p(θ̂) > 0, i.e. if the correct value of θ is not excluded a priori.
This observation highlights the importance of avoiding imposing a too restrictive prior.
2.1.6 Variational approximation and bounds on free energy
In most situations, the distribution from which data are drawn may be unknown, or
intractable. A celebrated example borrowed from statistical mechanics is given by the
Ising model over L binary spins (variables) y = (y1 = ±1, · · · , yL = ±1), whose Gibbs
distribution reads (in this section we work with temperature T = 1 for simplicity)
e−H(y) X
pG (y) = , with H(y) = −J yi yj . (2.20)
Z
hiji

If the sum runs only over pairs hiji of variables that are nearest neighbour on a
d–dimensional cubic lattice, the model is intractable: the calculation of Z requires
22 Asymptotic inference and information

0.5

0.4

DKL (pm jjpG )

0.3

0.2

0.1 mopt

0.0
0.0 0.2 0.4 0.6 0.8 1.0
m

Fig. 2.2 Illustration of the Kullback-Leibler divergence between the intractable distribution
pG (y) and a variational family pm (y).

(for generic Hamiltonians H) 2L operations and there is no simple way to bypass

this computational bottleneck. Yet, the KL divergence positivity property we have
established above can be used to derive bounds on the unknown value of the free
energy
F = − log Z . (2.21)
To do so, let us introduce a family of distributions pm (y) depending on some
parameter m, which are simple enough to do calculations. The idea is to look for the
best approximation to the empirical distribution in this “variational” family, i.e. the
best value of m (note that m need not be a scalar in the general case). In practice, we
may consider the KL divergence between pm (y) and the untractable pG (y),

X pm (y)
DKL pm ||pG = pm (y) log . (2.22)
y
pG (y)

Using the positivity of the divergence we readily obtain

Fm ≡ hH(y)ipm − S(pm ) = F + DKL pm ||pG ⇒ F ≤ Fm , (2.23)

where S(pm ) is the entropy of pm . Hence, the untractable free energy F is bounded
from above by a variational free energy Fm that depends on m. Minimising DKL pm ||pG
(see figure 2.2) or, equivalently, Fm , allows us to obtain the best (lowest) upper bound
to F .
As an illustration let us consider the Ising model in Eq. (2.20) again, defined on a d-
dimensional cubic lattice. We choose the variational family of factorized distributions,
L
Y 1 + m yi
pm (y) = , (2.24)
i=1
2

whichP
is much easier to handle because the variables are independent. Note that here
m = y yi pm (y) represents the magnetisation of any spin i within distribution pm ,
Notions of information 23

and is a real number in the [−1, 1] range. It is straightforward to compute the average
value of the Ising energy and of the entropy, with the results
1
hH(y)ipm = − J L (2d) m2 , (2.25)
2
and
" #
1+m 1+m 1−m 1−m
S(pm ) = L − log − log . (2.26)
2 2 2 2

Minimisation of Fm over m shows that the best magnetisation is solution of the self-
consistent equation,
mopt = tanh 2d J mopt , (2.27)
which coincides with the well-known equation for the magnetisation in the mean-field
approximation to the Ising model. Hence, the mean-field approximation can be seen as
a search for the best independent-spin distribution, in the sense of having the smallest
KL divergence with the Ising distribution.
In addition, we know, from section 2.1.4, that the gap ∆F = Fm − F is related to
the quality of the approximation. Suppose somebody gives you M configurations y of
L Ising spins drawn from pm , and asks you whether they were drawn from the Gibbs
distribution pG or from pm . Our previous analysis shows that it is possible to answer
in a reliable way only if M 1/∆F . Minimising over m may thus be seen as choosing
the value of m that makes the answer as hard as possible, i.e. for which the similarity
between the configurations produced by the two distributions is the largest.

2.2 Notions of information

In this section we introduce several notions of information. We first introduce Fisher
information and show that it provides a bound on the quality of a statistical estimator.
Then, we introduce other measures of information, such as Shannon information, and
we discuss their properties.

2.2.1 Statistical estimators

An estimator θ ∗ (Y ) is a function of the data Y that gives a prediction for the parame-
ters θ of the model. Examples of estimators based on the posterior distribution are the
maximum a posteriori (MAP) estimator, which gives the value of θ that maximises
the posterior,
∗
θM AP (Y ) = Argmaxθ [p(θ|Y )] , (2.28)
and the Bayesian estimator, which gives the average of θ over the posterior,
Z
∗
θBayes (Y ) = dθ θ p(θ|Y ) . (2.29)

The results of section 2.1 show that when many data are available, M → ∞, these
two estimators coincide, and also coincide with the ground truth, θ ∗ = θ̂.
24 Asymptotic inference and information

More generally, any arbitrary function θ ∗ (Y ) : RM ×L → RD can be considered

to be an estimator. Even the trivial function θ ∗ ≡ 0 is an estimator, but it would
obviously be a particularly poor estimator if θ̂ 6= 0. An important class of estimators
is that of unbiased estimators, which are such that:
X
p(Y |θ̂) θ ∗ (Y ) = θ̂ . (2.30)
Y

In words, unbiased estimators provide the correct prediction θ̂ if they are fully av-
eraged over the data distribution. By constrast, biased estimators may overestimate
or underestimate the true value of the parameters in a systematic way. Note that the
MAP and Bayesian estimators defined above are in general biased, except for M → ∞.
Examples of unbiased estimators are the following.
• Consider M independent and identically distributed random variables yi with
mean µ and variance V . An unbiased estimator for the mean is
1 X
µ∗ (Y ) = yi . (2.31)
M i

sample average µ∗ (Y ).
The average of each yi is in fact equal to µ, and so is theP
An unbiased estimator for the variance is V (Y ) = M −1 (yi − µ∗ (Y ))2 , as can
∗ 1

be checked by the reader. Note the presence of the factor M − 1 instead of the
naive M at the denominator.
• For the German tank problem discussed in section 1.3, the reader can show that
the estimator for the number of tanks, N ∗ = yM + ∆y with ∆y = yM − yM −1 is
unbiased.
Unbiased estimators whose variance goes to zero when M → ∞ are particularly de-
sirable. We now discuss a bound on the variance of an unbiased estimator, known as
the Cramer-Rao bound.

2.2.2 Fisher information and the Cramer-Rao bound

We first consider for simplicity the case L = 1, where θ is a scalar. The posterior
p(θ|Y ) ≈ e−M DKL (θ̂||θ) given in Eq. (2.15) has a sharp peak for θ ' θ̂. We will
now investigate the behaviour of DKL close to its minimal value θ̂ where it vanishes.
Expanding DKL up to the second order in θ we obtain
1
DKL (pθ̂ ||pθ ) ≈ (θ − θ̂)2 Iy (θ̂) , (2.32)
2
where Iy (θ̂) is the Fisher information defined as
X ∂2
Iy (θ̂) = − p(y|θ̂) log p(y|θ) . (2.33)
y
∂θ2 θ=θ̂

Asymptotically, the Fisher information gives therefore the variance of θ when sampled
from the posterior distribution of θ,
Notions of information 25

1
Var(θ) = , (2.34)
M Iy (θ̂)

so M should be larger than 1/Iy (θ̂) for the inference to be in the asymptotic regime.
Outside of the asymptotic regime, the equality in Eq. (2.34) becomes a lower bound
for the variance, called the Cramer-Rao bound. For any unbiased estimator θ∗ (Y ),
X 1
Var(θ∗ ) = p(Y |θ̂)[θ∗ (Y ) − θ̂]2 ≥ , (2.35)
Y IYtotal (θ̂)

where
X ∂2
IYtotal (θ̂) = − p(Y |θ̂) log p(Y |θ) (2.36)
∂θ2 θ=θ̂
Y

is the total Fisher information of the joint distribution of the M data points.

Proof of the Cramer-Rao bound. First we note that for an unbiased estimator
X
p(Y |θ̂)(θ∗ (Y ) − θ̂) = 0 (2.37)
Y

is true for each θ̂. We can then differentiate the above equation with respect to θ̂, and
get
X ∂
p(Y |θ̂) (θ∗ (Y ) − θ̂) − 1 = 0 , (2.38)
Y ∂ θ̂
which can be rewritten as

X ∂
p(Y |θ̂) log p(Y |θ̂) (θ∗ (Y ) − θ̂) = 1 . (2.39)
Y ∂ θ̂

Introducing

∂
q
α(Y ) = p(Y |θ̂) log p(Y |θ̂) ,
∂ θ̂ (2.40)
q
∗
β(Y ) = p(Y |θ̂) (θ (Y ) − θ̂) ,

we can use the Cauchy-Schwarz inequality,

X X X
|α(Y )|2 × |β(Y )|2 ≥ | α(Y )β(Y )|2 , (2.41)
Y Y Y

to obtain
( 2 ) (X )
X ∂
p(Y |θ̂) log p(Y |θ̂) p(Y |θ̂)(θ∗ (Y ) − θ̂)2 ≥1, (2.42)
Y ∂ θ̂ Y

which gives
26 Asymptotic inference and information

1
Var(θ∗ (Y )) ≥ P 2 . (2.43)
∂
Y p(Y |θ̂) ∂ θ̂
log p(Y |θ̂)

the probability, we obtain the Cramer-Rao bound in Eq. (2.35).

Fisher information and Cramer-Rao bound in multiple dimensions. The notion of

Fisher information and the Cramer-Rao bound can be extended to a multidimensional
parameter θ ∈ RL . The Fisher information is in this case a L × L matrix:
X ∂2
Iij = − p(Y |θ̂) log p(y|θ) . (2.45)
∂θi ∂θj θ=θ̂
Y

The fluctuations of the estimators around its unbiased average are quantified by the
covariance matrix,
X
Cij = Cov(θi∗ , θj∗ ) = p(Y |θ̂)(θi∗ (Y ) − θ̂i )(θj∗ (Y ) − θ̂j ) . (2.46)
Y

The Cramer-Rao bound then states that Mij = Cij − (I −1 )ij is a positive definite
matrix.

2.2.3 Properties and interpretation of the Fisher information

We now discuss some additional properties of the Fisher information.
P
• The Fisher information is additive for independent data: log p(Y |θ̂) = i log p(yi |θ),
so that IYtotal = M Iy , see Eq. (2.36).
• Consider data Y drawn from the distribution p(Y |θ̂). We can define a score for
the parameter θ as
∂
s= log p(Y |θ) . (2.47)
∂θ
This quantity tells us how much the data are informative about the value of θ. If
the distribution p(Y |θ) does not depend on θ, the score is zero, and no information
about the value of θ can be extracted from the data. The model is in this case
non-identifiable: varying the parameters does not affect the distribution of the
Notions of information 27

I(p) 2

0
0.0 0.2 0.4 0.6 0.8 1.0
p

Fig. 2.3 Illustration of the elementary gain in information following the observation of an
event of probability p, I(p) = C log(1/p); here, C = 1/ log 2, and the information is expressed
in bits.

data. By contrast, if the score is large (positive or negative), then the data bring
a lot of information on the parameters. The average value of the score is zero
X ∂ X ∂p(Y |θ)
hsi = p(Y |θ) log p(Y |θ) = =0, (2.48)
∂θ ∂θ
Y Y

and the variance of the score is the Fisher information:

2
X ∂
s2 = p(Y |θ) log p(Y |θ) = IY (θ) . (2.49)
∂θ
Y

Hence, larger Fisher information corresponds to larger score variance, and, infor-
mally speaking, to better model identifiability.
• The Fisher information is degraded upon transformation: y → F (y), where F is
a deterministic or random function,

Iy (θ) ≥ IF (θ) . (2.50)

The extreme case is when F is equal to a constant, in which case all the informa-
tion is lost. This property is expected to be valid for any measure of information.

2.2.4 Another measure of information: Shannon’s entropy

The most famous concept of information, besides the Fisher information, was forged
by Shannon in 1949. Suppose that you have a series of events i that can happen with
probabilities pi , with i = 1, · · · , N . Shannon’s idea was to introduce a function I(p) to
characterise the information carried by the realisation of one of these events of prob-
ability p. Informally speaking, observing the realisation of a likely event is not very
surprising and thus carries little information. More generally, we may think that the
less probable the event is, the larger the information conveyed by its realisation. The
28 Asymptotic inference and information

function I(p), which characterises the elementary information carried by the observa-
tion of an event of probability p, should be monotonically decreasing with p; such a
candidate function is sketched in figure 2.3.
Admissible I(p) must fulfill additional requirements:
• I(p = 1) = 0: there is no gain of information following the observation of an event
happing with certainty.
• I(p) ≥ 0: the information is always positive.
• I(p12 ) = I(p1 ) + I(p2 ) if p12 = p1 p2 : the information is additive for independent
events.
• I(p12 ) ≤ I(p1 )+I(p2 ): for correlated events, realisation of the first event is already
telling us something about the possible outcomes for the second event. Hence, the
information gain from the second event is smaller than what we would have if we
had not observed the first event, i.e. I(p2 ).
These properties, together with a more sophisticated constraint of composability [6],
are satisfied by a unique function

1
I(p) = C log , (2.51)
p

up to a multiplicative constant that can be chosen as C = log1 2 to have I(p) measured

in bits, i.e. I( 12 ) = 1. The reader can verify that it satisfies all the constraints listed
above. In particular, the key property of the logarithmic function in Eq. (2.51) is in
that the logarithm of the product of independent variables is equivalent to the sum
of the logarithms of the variables, as required by the additivity of the information for
independent events.
The Shannon information of a set of events labelled by i = 1, · · · , N with proba-
bilities pi is defined as the average gain of information,

X X 1
S(p) = pi I(pi ) = pi log2 . (2.52)
i i
p i

Measuring information in bits is convenient when events have binary representa-

tions. For instance the Shannon information of a distribution of binary words, e.g.
y = 010001 . . ., of length L and having equal probabilities punif (y) = 1/2L is simply
S(punif ) = L, i.e. the length of the binary words. This equality holds for the uniform
distribution only; for all other non-uniform distributions, the Shannon information is
lower2 .
The Shannon information is nothing but the entropy as defined in statistical
physics, with the only difference being the multiplicative prefactor C equal to the

2 To see that, consider the KL divergence between a distribution p(y) and the uniform distribution,
punif (y):
X
DKL p||punif = p(y) log2 [p(y)/punif (y)] = L − S(p) . (2.53)
y

The positivity of the divergence ensures that S(p) ≤ L.

Inference and information: the maximum entropy principle 29

Boltzmann constant kB rather than 1/ log 2. In statistical physics the entropy charac-
terizes the degeneracy of microscopic configurations corresponding to a given macro-
scopic state. Based on Shannon’s interpretation we may see the entropy as how much
information we are missing about the microscopic configurations when knowing the
macroscopic state of the system only.

2.2.5 Mutual information

We now turn to an important and related concept, mutual information. The mutual
information of two random variables x, y is defined as the KL divergence between their
joint distribution and the product of their marginals:

X p(x, y)
M I(x, y) = DKL (p(x, y)||p(x)p(y)) = p(x, y) log
x,y
p(x) p(y) (2.54)
= S[p(x)] + S[p(y)] − S[p(x, y)] .

Differently from the Kullback-Leibler divergence, the mutual information is a sym-

metric quantity: M I(x, y) = M I(y, x). In addition it is always positive, and vanishes
for independent variables. To better understand its meaning, let us rewrite

M I(x, y) = S[p(x)] − S[p(x|y)] = S[p(y)] − S[p(y|x)] , (2.55)

where

X p(x, y) X X
S[p(y|x)] = − p(x, y) log =− p(x) p(y|x) log p(y|x) (2.56)
x,y
p(x) x y

is called conditional entropy of y at given x. The mutual information thus represents

the average gain in information over x when y is known, or alternatively, over y when
x is known. As we observe the realisation of x our average gain of information is
S[p(x)]. If we next observe an instantiation of y, the mean net gain in information is
S[p(y)] − M I(x, y) ≤ S[p(y)]. The sum of these two gains is

S[p(x)] + S[p(y)] − M I(x, y) = S[p(x, y)] , (2.57)

equal to the average gain in information coming from the joint observation of x and y.

2.3 Inference and information: the maximum entropy principle

Bayes’ rule offers us a powerful framework to infer model parameters from data. How-
ever, it presupposes some knowledge over reasonable candidates for the likelihood func-
tion, i.e. for the distribution p(Y |θ). Is it possible to avoid making such an assumption
and also derive the likelihood functional form itself? This is what the maximum en-
tropy principle (MEP), based on the concept of Shannon information, proposes. In
few words, MEP consists in inferring maximally agnostic models given what is known
about the data, to avoid any superfluous assumption about the data.
As an example, suppose that a scalar variable y takes N possible values with
probability p(y). In the absence of any further knowledge about the distribution p(y),
30 Asymptotic inference and information

Fig. 2.4 The maximal entropy model is the distribution p1 with the smallest reduction in
Shannon entropy with respect to the uniform model among all possible models p1 , p2 , p3 , . . .
satisfying the constraints imposed by the data.

it is reasonable to suppose that all events are equally likely, i.e. p(y) = N1 . The
entropy of this uniform distribution is log2 N , and is maximal as we have seen above.
This model is, in some sense, maximally “ignorant” and constrains as less as possible
the events.
Suppose now that we have some knowledge about the distribution of y, for example
we are given its average value,
X
m = hyi = p(y) y . (2.58)
y

What can we say about p(y)? This is in general a very ill-defined question, as there
are infinitely numerous distributions p(y) having m as a mean. Yet, not all these
distributions p(y) have the same Shannon entropy. The MEP stipulates that we should
choose p(y) with maximal entropy among the distributions fulfilling Eq. (2.58). In this
way the difference between the entropy of the maximally agnostic (uniform) model and
that of the constrained model is minimal, as sketched in figure 2.4. In other words, we
have used as little as possible the constraint imposed on us.
In order to find p(y), one then has to maximise its entropy, while enforcing the
normalisation constraint together with Eq. (2.58). This can be done by introducing
two Lagrange multipliers λ and µ. We define
! !
X X X
S(p, λ, µ) = − p(y) log p(y) − λ p(y) − 1 −µ p(y) y − m , (2.59)
y y y
Inference and information: the maximum entropy principle 31

and write that the functional derivative with respect to the distribution should vanish
at the maximum
δS(p)
= − log p(y) − 1 − λ − µ y = 0 . (2.60)
δp(y)
This gives
eµy
p(y) = P µ y0
, (2.61)
y0 e

where we have computed λ to fulfill the normalisation constraint. The resulting distri-
bution is then exponential, and the value of µ is determined by imposing Eq. (2.58).
It is easy to extend the above calculation to higher-dimensional variables y =
(y1 , y2 , ..., yL ). For instance, if we are provided with the value of the means
X
mi = hyi i = p(y) yi , (2.62)
y

then the MEP tells us that the most parsimonious use of these constraints corresponds
to the distribution P
e i µi yi
p(y) = P P 0 , (2.63)
i µ i yi
y0 e

where the µi ’s should be determined by enforcing the L conditions Eq. (2.62).

The examples above show that application of the MEP under multiple constraints
of a fixed observable hOk (y)i, with k = 1, 2, ..., K will lead to an exponential distri-
bution of the form p(y) ∝ exp[µ1 O1 (y) + · · · + µK OK (y)]. The values of the µk ’s are
determined by imposing that the empirical values mk of the observables are repro-
duced by the model3 . From a practical point of view, this outcome is equivalent to
a Bayesian inference approach, in which an exponential model is assumed, and the
parameters µk are determined through likelihood maximisation.
The MEP has a long and deep story in statistical mechanics. In 1957 Jaynes [7]
showed that it could be used to derive all thermodynamical ensembles. For instance,
the Gibbs distribution of the canonical ensemble is determined by a MEP in which the
observable is the average energy of configurations, O(y) = E(y), and the associated
Lagrange parameter µ is related to the inverse temperature T through µ = −1/(kB T ).
Similarly, the grand-canonical ensemble is obtained when applying the MEP under
the constraints that both the average energy and the average number of particles are
fixed. The Lagrange parameter enforcing the latter constraint is related to the chemical
potential.

3 As an illustration, assume we are given the first and second moments, respectively, m and m
1 2
of a random scalar variable y. According to the MEP the log-probability of this variable is a linear
2
combination of O1 (y) = y and O2 (y) = y , i.e. the maximal entropy distribution is Gaussian.
32 Asymptotic inference and information

2.4 Tutorial 2: entropy and information in neural spike trains

Understanding the neural code and information transmission is of fundamental impor-
tance in neuroscience. To such aim, neural recordings have been performed on neurons
that are sensitive to a given stimulus (e.g. neurons in the visual system that are sensi-
tive to visual stimuli, such as retina ganglion cells or V1 neurons), and the variability
and reproducibility of neural responses have been quantified with ideas from informa-
tion theory [8–10]. The goal of this tutorial is to understand how the variability and
reproducibility of responses of retinal ganglion cells to natural visual stimuli can be
characterised in terms of information that a spike train provides about a stimulus.

2.4.1 Problem
The spiking activity of a population of L = 40 retina ganglion cells in response to visual
stimuli was recorded [11]. During the recording a natural movie, lasting T = 26.5 s, is
presented Nr = 120 times. We want to analyse the information and noise content of
the spike train for each neuron, as a function of its spiking frequency.
Data:
The spiking times can be downloaded from the book webpage4 , see tutorial 2 repos-
itory. Data are taken from Ref. [11]. The data file contains a one-column array (ti ),
where ti is the spiking time of neuron i in seconds and ranging between 0 and
Nr × T = 3180 s. They are separated by the number 4000 followed by the neuron
label going from 4 to 63 (not to be confused with the neuron index i, running from 1
to L = 40).
Questions:
In the tutorial 2 repository, you are given a Jupyter notebook tut2 start.ipynb that
reads the data and makes a raster plot of the spikes of the first neuron, for the first 10
seconds and the first 10 repetitions of the stimulus. Complete the notebook or write
your own code to answer the following questions.
1. Entropy of a Poisson process. To analyse the spike train of a single neuron, we
discretise the time in elementary time-bin windows of length ∆t, where typically
∆t = 10 ms, and translate the activity of the neuron in a time bin into a binary
symbol σ ∈ {0, 1}. Here, σ = 0 corresponds to no spike, while σ = 1 corresponds
to at least one spike.
1a. Theory: Consider the activity of the neuron as a Poisson process with fre-
quency f : write the entropy Sp (f, ∆t) and the entropy rate (entropy per unit
of time) of the Poisson process as a function of the frequency f and of the
time-bin width ∆t.
1b. Numerics: Plot the entropy Sp (f, ∆t) and the entropy rate obtained in point
1a in the range of experimentally observed frequencies, e.g. f = 4 Hz, as a
function of ∆t.
2. Information conveyed by the spike train about the stimulus. To study
the reproducibility of the neural activity following a stimulus, we consider words
4 https://github.com/StatPhys2DataDrivenModel/DDM_Book_Tutorials
Tutorial 2: entropy and information in neural spike trains 33

extending over ` consecutive time bins, i.e. of ` symbols. For ` = 100 ms and
∆t = 10 ms, there are NW = 210 = 1024 possible such words representing the
activity of a neuron over ` × ∆t = 100 ms. As explained in Ref. [8] one can
extract from the data the probability of each possible word W and then estimate
the entropy of this distribution:
X
Stotal = − p(W ) log2 p(W ) bits . (2.64)
W

Moreover, to assess the reproducibility of the response in a recording in which the

same stimulus is presented many times, the probability of the occurrence p(W |b)
of a word W starting in a particular time bin b relative to the stimulus can be
estimated. These distributions (one for each time bin index b) have entropies
X
S(b) = − p(W |b) log2 p(W |b) bits . (2.65)
W

The average of these entropies defines the so-called noise entropy,

Snoise = hS(b)ib , (2.66)

where h. . .i denotes the average over all possible times with time resolution ∆t.
Snoise is called noise entropy because it reflects the non-deterministic response to
the stimulus. If the noise entropy is zero, the response is deterministic: the same
word will be repeated after each given stimulus. The average information I that
the spike train provides about the stimulus is defined as I = Stotal − Snoise .
2a. Theory: The probability p(W |b) can be thought of as the probability of the
random variable W conditioned to the stimulus at time b × ∆t, itself consid-
ered as a random variable; different times t correspond to different random
draws of the stimulus. Under this interpretation, show that I is the mutual
information between W and the stimulus, and as a consequence I ≥ 0.
2b. Numerics: What is the distribution of Stotal , Snoise and I over the 40 neural
cells and ` = 10 time bins? What is, on average, the variability of the spike
trains used to convey information about the stimulus? Plot Stotal (i), Snoise (i),
I(i), and the ratio I(i)/Stotal (i) for the i = 1 . . . 40 neurons as a function of
their spiking frequencies fi . Compare Stotal (i) with the entropy of a Poisson
process.
3. Error on the estimate of I due to the limited number of data. How does I
depend on the number of repetitions of the experiment used in the data analysis
and on `? Check if the quantity of information per second converges. Are 120
repetitions and ` = 10 enough?
4. Time-dependence of the neuronal activity. Plot the time-bin dependent
entropy S(b) for one neuron. For the same neuron, compute the average spiking
frequency f (b) in a small time interval around time bin b. Compare the frequency
modulation f (b) with S(b). Discuss the relation between S(b) and the entropy of
a Poisson process with modulated frequency f (b).
34 Asymptotic inference and information

2.4.2 Solution
Analytical calculations:
1a. Consider a Poisson process with frequency f in a time bin ∆t. The probability
that the Poisson process emits no spike in the time interval ∆t, that is of, σ = 0,
is
p(σ = 0) = e−f ∆t , hence p(σ = 1) = 1 − e−f ∆t . (2.67)
The entropy in bits of such binary variable is
X
Sp (f, ∆t) = − p(σ) log2 p(σ)
σ
−f ∆t
(2.68)
e f ∆t − (1 − e−f ∆t ) log(1 − e−f ∆t )
= .
log 2
The entropy per unit of time is simply Sp (f, ∆t)/∆t.
2a. We consider a discrete time index of bins, b = 1, 2, · · · , B = T /∆t. Now, consider
that a time bin b (and hence a corresponding stimulus) is chosen randomly with
uniform probability p(b) = 1/B. We call p(b, W ) the joint probability of choosing
a bin b and a word W , and then p(W |b) = p(W, b)/p(b). We have
X
p(W ) = p(W |b)p(b) , (2.69)
b

and
X X
Snoise = p(b)S(b) , S(b) = − p(W |b) log2 p(W |b) . (2.70)
b W

We can therefore identify the noise entropy with the conditional entropy, Snoise =
S[p(W |b)], as defined in Eq. (2.56). If the relation between b and W is determin-
istic, then Snoise = 0. Therefore
X X p(W, b)

Stotal − Snoise = − p(b) p(W |b) log2 p(W ) − p(W |b) log2
p(b)
b W

X p(W, b)
= p(W, b) log2 = I(W, b) .
p(W ) p(b)
W,b
(2.71)
The average information I is the mutual information between the stimulus and
the neural activity, as defined in Eq. (2.54).
Data analysis:
The Jupyter notebook tutorial2.ipynb contains as parameters the time bin ∆t, the
length of the words `s , and the number of repetitions Nrp (out of the total Nr = 120)
that are considered to calculate the entropy.
The raster plot of data for the first neuron is shown in figure 2.5 for the first 10
seconds of recording and the first 10 repetitions of the stimulus. It can be seen that
the neuron fires in a quite precise and reproducible way during the stimulations.
Tutorial 2: entropy and information in neural spike trains 35

stimulus repetition number

0
0 2 4 6 8 10
time (s)

Fig. 2.5 Raster plot of the activity of the first recorded neuron, in the first 10 seconds of
the recording, for the first 10 repetitions of the stimulus.

1b. The entropy Sp and the entropy rate Sp /∆t are plotted in figure 2.6 for a cell
spiking at f = 4 Hz as a function of ∆t. The entropies have a non-trivial behaviour
upon changing ∆t. The maximum of the entropy is reached for ∆t such that
p(σ = 0) = p(σ = 1) = 0.5; for other values of ∆t the entropy decreases. The
entropy per unit of time decreases upon increasing ∆t.

A 17.5 B
1.0
entropy per unit time (bits/s)

15.0
0.8
entropy (bits)

12.5
0.6 10.0

7.5
0.4
5.0
0.2
2.5

0.0 0.0
0.0 0.5 1.0 1.5 2.0 0.0 0.5 1.0 1.5 2.0
¢t (s) ¢t (s)

Fig. 2.6 Entropy (A) and entropy per unit time (B) of a Poissionian spiking event in a time
bin ∆t for a frequency f = 4 Hz as a function of the time bin ∆t

2b. Stotal (i), Snoise (i) and I(i), for the recorded neurons i = 1, · · · , 40 as a function of
their frequencies are plotted in figure 2.7, for ∆t = 10 ms and words of duration
`s = 100 ms, corresponding to ` = 10 binary symbols.
As shown in figure 2.7A, the total entropy rate varies between 1 and 25 bits/s.
Note that if the neurons were following a Poisson process at frequency fi , then
each word would be composed by independent bits, and the total entropy of
36 Asymptotic inference and information

30
12
Snoise
25 Stotal
10

information (bits/s)
Sp (f; ¢t)=¢t
entropy (bits/s)

20
8

15
6

10
4
5
A 2 B

0
0 1 2 3 4 5 6 0 1 2 3 4 5 6
frequency (Hz) frequency (Hz)

Fig. 2.7 Total entropy and noise entropy (A) and information rate (B) as a function of the
cell frequency for ∆t = 10 ms, ` = 10. In panel A, the full line is the entropy per unit time
of a Poisson process, Sp (f, ∆t = 10 ms).

neuron i would be
Stotal Sp (fi , ∆t)
Stotal = ` Sp (fi , ∆t) ⇒ = . (2.72)
` × ∆t ∆t
This curve is reported as a full line in figure 2.7A, and provides an upper bound
to the total entropy. Indeed, correlations between spikes lower the entropy with
respect to a purely Poisson process.
The noise entropy is approximatively half of the total entropy, and therefore
the information rate I = (Stotal − Snoise )/∆t varies between 2 and 12 bits/s, see
figure 2.7B, and is roughly proportional to the neuron spiking frequency. The more
active cells have larger entropy and information rate. The order of magnitude
of the entropies and information rates are in agreement with Ref. [9, figure 3]
which also studied recordings in the retina, for ganglion cells spiking at the same
frequencies.
Also, in good agreement with Ref. [8], in which a motion-sensitive neuron in the
fly visual system is studied, the noise entropy is of the order of half the total
entropy. So, on average, half of the variability of the spike train is used to convey
information about the stimulus. The ratio I/Stotal is shown in figure 2.8.
3. One can vary the number of repetitions of the experiment used in the data anal-
ysis: Nr = 1 can be used to check that the noise entropy is zero, while the total
entropy is not. Moreover one can study the dependence of the results with the
number of repetitions and check if Nr = 120 is enough to have a stable result.
In figure 2.9A, the information I is plotted versus Nr for one neuron (the first
recorded).
The entropy should reach a constant value in the infinite word-length limit, i.e.
for the value of `s = ` × ∆t at which the process is uncorrelated. As an example
we show in figure 2.9B the evolution of I with `s , for the first recorded neuron.
Very similar results are obtained, in both cases, for the other neurons.
Other documents randomly have
different content
limis crassis | uti scobe
facta mixtae
conspargitur ut (30)
conglomeretur. deinde |
pilae manibus
versando efficiuntur
181
et ita conligantur ut
inarescant. aridae
componuntur in
urceo fictili, urcei in
fornace. simul
autem aes et ea
harena ab ignis vehementia
confervescendo
coaluerint, inter 5
se dando et accipiendo
su|dores a
proprietatibus
discedunt (5)
suisque rebus per ignis
vehementiam
confectis caeruleo
2 rediguntur colore. usta
vero, quae satis
habet utilitatis
in operibus tectoriis, sic
temperatur. glaeba
silis boni
coquitur ut sit in igni
candens. ea autem
aceto extinguitur 10
et efficitur purpureo colore.

XII | De cerussa aerugineque

quam nostri
aerucam vocitant
(10)
non est alienum
quemadmodum
comparetur dicere.
Rhodo
enim doliis sarmenta
conlocantes aceto
suffuso supra
sarmenta
conlocant plumbeas
massas, deinde ea
operculis 15
obturant ne spiramentum
emittatur. post
certum tem|pus
(15)
aperientes inveniunt e
massis plumbeis
cerussam.
eadem ratione lamellas
l
aereas conlocantes
efficiunt aeruginem
2 quae aeruca appellatur.
cerussa vero cum in
fornace
coquitur, mutato colore ad
ignem efficitur
sandaraca. 20
id autem incendio facto ex
casu didicerunt
homines, et
ea multo melio|rem usum
praestat quam quae
de metallis (20)
per se nata foditur.

XIII Incipiam nunc de ostro

dicere, quod et
carissimam et
excellentissimam habet
praeter hos colores
aspectus
suavitatem. 25
id autem excipitur e
conchylio marino, e
quo
purpura efficitur, cuius non
minores sunt quam
ceterarum
1 scobis (scopis S) factam (-
ā E) mixta c. x (cf. Is. ex
Fav. p. 308, 7).
3 ut inarescant aride S(Gc):
uti nares cantaridae (-de
G, -dę E) HEG.
4 simul autem: sit aut
GE(Sc), sita ut H, sit ut
S.
5 coaluerint: cum
(cū)
a
coaruerint HSG, cum co
luerint E.
7 confecto x.
9 silix HSG, silicis E (ſiliẋ̣ | ciſ
Gc).
10 igne E.
12 erucam x (item infra,
sicut erugo semper x) =
afrugum Fav. | vocitant
(sic) x.
14 doleis x.
16 (post sp.) obturatum (-tū)
add. x (del. Schneider).
20 (post ignem) incendi (HS,
20 (post ignem) incendi (HS,
-dii EG) add. x (del.
Nohl).
25 pra&erosco|lores H.
26 & conchilio E.
27 purpora HS. | efficitur S:
inficitur HEG.

Plain text
＜rerum＞ | naturae
considerantibus
admirationes, quod
habet 182
non in omnibus locis
quibus nascitur
unius generis
colorem,
2 sed solis cursu naturaliter
temperatur. itaque
quod legitur
Ponto et Gallia, quod hae
regiones sunt
proximae ad
septentrionem, | est atrum.
progredientibus inter
septentrionem 5
et occidentem invenitur
lividum. quod autem
legitur ad aequinoctialem
orientem et
occidentem,
invenitur
violaceo colore. quod vero
meridianis
regionibus excipitur,
rubra procreatur potestate,
et ideo hoc Rhodo
etiam insula
creatur ceterisque |
eiusmodi regionibus
quae proximae 10
3 sunt solis cursui. ea
conchylia cum sunt
lecta, ferramentis
circa scinduntur, e quibus
plagis purpurea
sanies uti
lacrima profluens excussa
in mortariis terendo
comparatur,
et quod ex concharum
marinarum testis
eximitur, ideo
ostrum est vocitatum. id
autem | propter
salsuginem cito 15
fit siticulosum, nisi mel
habeat circa fusum.

XIV Fiunt etiam purpurei

colores infecta creta
rubiae radice
et hysgino, non minus et
ex floribus alii
colores. itaque
tectores cum volunt sil
atticum imitari,
violam aridam
coi|cientes in vas cum
aqua,
confervefaciunt ad
ignem, 20
deinde cum est
temperatum coiciunt
in linteum, et inde
manibus exprimentes
recipiunt in
mortarium aquam
ex
violis coloratam, et eo
cretam infundentes
et eam terentes
2 efficiunt silis attici colorem.
eadem ratione
vaccinium
temperantes et lacte |
miscentes purpuram
faciunt elegantem.
25
item qui non possunt
chrysocolla propter
caritatem uti,
herba quae luteum
appellatur
caeruleum inficiunt,
et ＜ita＞
utuntur viridissimo colore.
haec autem infectiva
appellatur.
item propter inopiam
coloris in|dici cretam
selinusiam aut 183
1 rerum om. x.
4 heę S.
8 violacio (-tio Gc=S) x.
17 rubiae: rubra x (et Fav.).

18 et hysgino (Plin., Is.): &

excygno x.
19 tectores (ut p. 180, 14.
19) x: pictores Fav. |
silacticum ante corr. H (=
Ec, ut silactici E v. 23).
20 vas cum aqua EG: vas
cum quā H, vas quā S. |
confervere faciunt x.
21 coiciunt linteum (-ū) x.
25 lactem (HE, -tē GS) x.
27 luteū HSE, -um G (lutear
Fav. cf. Plin. 33, 87. 91,
ubi lego: luteum putant
a luto herbam dictam
…). | ita om. x (add.
Oehmichen).
28 viridissimū (-um H) colorē
(-em H) x.
29 sinysiam H, synisiā S,
sinisiam (-ā G) EG.

Plain text
anulariam vitro, quod
Graeci ιϲατιν
appellant,
inficientes,
imitationem faciunt indici
coloris.

3 Quibus rationibus et rebus

ad expolitionum
firmitates
＜perveniatur＞ quibusque
decoras oporteat
fieri picturas,
item quas habeant | omnes
colores in se
potestates, ut 5
mihi succurrere potuit in
hoc libro perscripsi.
itaque
omnes aedificationum
perfectiones quam
habere debeant
opportunitatem
ratiocinationis,
septem voluminibus
sunt
finitae, insequenti autem
de aqua, si quibus
locis non
locis non
fuerit, quemadmodum
inveniatur et qua
ratione | ducatur 10
quibusque rebus si erit
salubris et idonea
probetur explicabo.

LIBER
OCTAVUS.
1 De septem sapientibus
Thales Milesius
omnium rerum
principium aquam est
professus, Heraclitus
ignem, Magorum
| sacerdotes aquam et
ignem, Euripides
auditor Anaxagorae,
15
quem philosophum
Athenienses
scaenicum
appellaverunt,
aera et terram eamque e
caelestium imbrium
conceptionibus
conceptionibus
inseminatam fetus
gentium et omnium
animalium
in mundo procreavisse, et
quae ex ea essent
prognata cum
dissolve|rentur
temporum
necessitate coacta,
20
in eandem redire, quaeque
de aere nascerentur
item in
caeli regiones reverti neque
interitiones recipere
et dissolutione
mutata in eam recidere | in
qua ante fuerant
proprietatem. 184
Pythagoras vero,
Empedocles,
Epicharmos aliique
1 vitroque (q: HE, q⁊ S)
quod x. | insallim (-ī S)
x.
3 addispositionem firmitatis
(sic sine lac. sed in fine
ut versus G ita ipsius folii
107a H) x (corr. et suppl.
Nohl).
7 aedificationum HS:
edifitiorū G.
10 ducatur HS: inducatur G.
17 e HSG: & E.
20 necessitate ES: -tē HG.
21 eandem EG: eadem HS. |
de aere nascentur EG, de
hacre nascentur H, de
hac renascerentur S (ut
L).
23 fuerant G: fuerat H, fuit
S. | proprietates G.
24 epi(ephy- S)charmos x.
Plain text
physici et philosophi haec
principia esse
quattuor
proposuerunt, aerem
ignem terram
aquam, eorumque
inter
se cohaerentiam naturali
figuratione e
gene|rum
discriminibus (5)
2 efficere qualitates.
animadvertimus
vero non solum
nascentia ex his esse
procreata sed etiam
res omnes non 5
ali sine eorum potestate
neque crescere nec
tueri. namque
corpora sine spiritus
redundantia non
possunt habere
vitam, nisi aer influens cum
incremento fecerit
auctus et
| remissiones continenter.
caloris vero si non
fuerit in (10)
corpore iusta comparatio,
non erit spiritus
animalis neque 10
erectio firma, cibique vires
non poterunt habere
coctionis
temperaturam. item si non
terrestri cibo
membra corporis
alantur, deficient et ita a
terreni principii
mixtione erunt
3 de|serta. animalia vero si
fuerint sine umoris
potestate, (15)
exsanguinata et exsucata a
principiorum liquore
interarescent. 15
igitur divina mens quae
proprie necessaria
essent gentibus non
constituit difficilia et
cara, uti sunt
margaritae aurum
argentum ceteraque
quae neque corpus
nec natura desiderat, | sed
sine quibus
mortalium vita non
(20)
potest esse tuta, ea fudit
ad manum parata
per omnem 20
mundum. itaque ex his, si
quid forte defit in
corpore
spiritus, ad restituendum
aer adsignatus id
praestat. apparatus
autem ad auxilia caloris
solis impetus et ignis
inventus
tutiorem efficit vitam. item
ter|renus fructus
escarum (25)
praestans copiis
supervacuis
desiderationes alit et
25
nutrit animales pascendo
continenter. aqua |
vero non 185
solum potus sed infinitas
usu praebendo
necessitates, gratas
4 quod est gratuita praestat
utilitates. ex eo
Welcome to our website – the perfect destination for book lovers and
knowledge seekers. We believe that every book holds a new world,
offering opportunities for learning, discovery, and personal growth.
That’s why we are dedicated to bringing you a diverse collection of
books, ranging from classic literature and specialized publications to
self-development guides and children's books.

More than just a book-buying platform, we strive to be a bridge

connecting you with timeless cultural and intellectual values. With an
elegant, user-friendly interface and a smart search system, you can
quickly find the books that best suit your interests. Additionally,
our special promotions and home delivery services help you save time
and fully enjoy the joy of reading.