From Statistical Physics to Data-Driven Modelling: with Applications to Quantitative Biology Simona Cocco download
From Statistical Physics to Data-Driven Modelling: with Applications to Quantitative Biology Simona Cocco download
or textbooks at https://ebookmass.com
_____ Follow the link below to get your download now _____
https://ebookmass.com/product/from-statistical-physics-to-
data-driven-modelling-with-applications-to-quantitative-
biology-simona-cocco/
https://ebookmass.com/product/introduction-to-quantitative-ecology-
mathematical-and-statistical-modelling-for-beginners-essington/
https://ebookmass.com/product/introduction-to-quantitative-ecology-
mathematical-and-statistical-modelling-for-beginners-timothy-e-
essington/
https://ebookmass.com/product/an-introduction-to-statistical-learning-
with-applications-in-r-ebook/
https://ebookmass.com/product/data-driven-solutions-to-transportation-
problems-yinhai-wang/
Statistical Topics and Stochastic Models for Dependent
Data With Applications Vlad Stefan Barbu
https://ebookmass.com/product/statistical-topics-and-stochastic-
models-for-dependent-data-with-applications-vlad-stefan-barbu/
https://ebookmass.com/product/an-introduction-to-data-driven-control-
systems-ali-khaki-sedigh/
https://ebookmass.com/product/data-driven-harnessing-data-and-ai-to-
reinvent-customer-engagement-1st-edition-tom-chavez/
FROM STATISTICAL PHYSICS TO DATA-DRIVEN
MODELLING
From Statistical Physics to Data-Driven
Modelling
with Applications to Quantitative Biology
Simona Cocco
Rémi Monasson
Francesco Zamponi
Great Clarendon Street, Oxford, OX2 6DP,
United Kingdom
Oxford University Press is a department of the University of Oxford.
It furthers the University’s objective of excellence in research, scholarship,
and education by publishing worldwide. Oxford is a registered trade mark of
Oxford University Press in the UK and in certain other countries
© Simona Cocco, Rémi Monasson, and Francesco Zamponi 2022
The moral rights of the authors have been asserted
Impression: 1
All rights reserved. No part of this publication may be reproduced, stored in
a retrieval system, or transmitted, in any form or by any means, without the
prior permission in writing of Oxford University Press, or as expressly permitted
by law, by licence or under terms agreed with the appropriate reprographics
rights organization. Enquiries concerning reproduction outside the scope of the
above should be sent to the Rights Department, Oxford University Press, at the
address above
You must not circulate this work in any other form
and you must impose this same condition on any acquirer
Published in the United States of America by Oxford University Press
198 Madison Avenue, New York, NY 10016, United States of America
British Library Cataloguing in Publication Data
Data available
Library of Congress Control Number: 2022937922
ISBN 978–0–19–886474–5
DOI: 10.1093/oso/9780198864745.001.0001
Printed and bound by
CPI Group (UK) Ltd, Croydon, CR0 4YY
Links to third party websites are provided by Oxford in good faith and
for information only. Oxford disclaims any responsibility for the materials
contained in any third party website referenced in this work.
Contents
1 https://github.com/StatPhys2DataDrivenModel/DDM_Book_Tutorials
viii Preface
ods, and not act as mere consumers of statistical packages. We pursue this objective
without emphasis on mathematical rigour, but with a constant effort to develop in-
tuition and show the deep connections with standard statistical physics. While the
content of the book can be thought of as a minimal background for scientists in the
contemporary data era, it is by no means exhaustive. Our objective will be truly ac-
complished if readers then actively seek to deepen their experience and knowledge by
reading advanced machine learning or statistical inference textbooks.
As mentioned above, a large part of what follows is based on the course we gave
at ENS from 2017 to 2021. We are grateful to A. Di Gioacchino, F. Aguirre-Lopez,
and all the course students for carefully reading the manuscript and signalling us the
typos or errors. We are also deeply indebted to Jean-François Allemand and Maxime
Dahan, who first thought that such a course, covering subjects not always part of the
standard curriculum in physics, would be useful, and who strongly supported us. We
dedicate the present book to the memory of Maxime, who tragically disappeared four
years ago.
This first chapter presents basic notions of Bayesian inference, starting with the def-
initions of elementary objects in probability, and Bayes’ rule. We then discuss two
historically motivated examples of Bayesian inference, in which a single parameter has
to be inferred from data.
Fig. 1.1 A. A large complex system includes many components (black dots) that interact
together (arrows). B. An observer generally has access to a limited part of the system and
can measure the behaviour of the components therein, e.g. their characteristic activities over
time.
2 Introduction to Bayesian inference
Fig. 1.2 Probabilistic description of models and data in the Bayesian framework. Each point
in the space of model parameters (triangle in the left panel) defines a distribution of possible
observations over the data space (shown by the ellipses). In turn, a specific data set (diamond
in right panel) corresponding to experimental measurements is compatible with a portion of
the model space: it defines a distribution over the models.
data and the system under investigation are considered to be stochastic. To be more
precise, we assume that there is a joint distribution of the data configurations and
of the defining parameters of the system (such as the sets of microscopic interactions
between the components or of external actions due to the environment), see figure 1.2.
The observations collected through an experiment can be thought of as a realisation
of the distribution of the configurations conditioned by the (unknown) parameters of
the system. The latter can thus be inferred through the study of the probability of the
parameters conditioned by the data. Bayesian inference offers a framework connecting
these two probability distributions, and allowing us in fine to characterise the system
from the data. We now introduce the definitions and notations needed to properly set
this framework.
y ∈ {a1 , a2 , · · · , aq } = A , (1.1)
and by
pi = Prob(y = ai ) = p(y = ai ) = p(y) (1.2)
the probability that y takes a given value y = ai , which will be equivalently written
in one of the above forms depending on the context.
For a two dimensional variable
y = (y1 , y2 ) ∈ A × B (1.3)
p(y1 , y2 )
p(y1 |y2 ) = . (1.7)
p(y2 )
Note that this definition makes sense only if p(y2 ) > 0, but of course if p(y2 ) = 0, then
also p(y1 , y2 ) = 0. The event y2 never happens, so the conditional probability does not
make sense. Furthermore, according to Eq. (1.7), p(y1 |y2 ) is correctly normalised:
P
y1 ∈A p(y1 , y2 )
X
p(y1 |y2 ) = =1. (1.8)
p(y2 )
y1 ∈A
This simple identity has a deep meaning for inference, which we will now discuss.
Suppose that we have an ensemble of M “data points” yi ∈ RL , which we denote
by Y = {yi }i=1,··· ,M , generated from a model with D (unknown to the observer)
“parameters”, which we denote by θ ∈ RD . We can rewrite Bayes’ rule as
p(Y |θ)p(θ)
p(θ|Y ) = . (1.10)
p(Y )
is called the “evidence”, and is expressed in terms of the prior and of the likelihood
through Z
p(Y ) = dθ p(Y |θ)p(θ) . (1.11)
where N is the (unknown) total number of tanks available to the Germans, and
Y = {y1 , · · · , yM } ∈ NM are the (known) factory numbers of the M destroyed tanks.
Note that, obviously, M ≤ yM ≤ N . Our goal is to infer N given Y .
In order to use Bayes’ rule, we first have to make an assumption about the prior
knowledge p(N ). For simplicity, we will assume p(N ) to be uniform in the interval
[1, Nmax ], p(N ) = 1/Nmax . A value of Nmax is easily estimated in practical applications,
but for convenience we can also take the limit Nmax → ∞ assuming p(N ) to be
a constant for all N . Note that in this limit p(N ) is not normalisable (it is called
“an improper prior”), but as we will see this may not be a serious problem: if M is
sufficiently large, the normalisation of the posterior probability is guaranteed by the
likelihood, and the limit Nmax → ∞ is well defined.
The second step consists in proposing a model of the observations and in computing
the associated likelihood. We will make the simplest assumption that the destroyed
tanks are randomly and uniformly sampled from the total number of available tanks.
Note that this assumption could be incorrect in practical applications, as for example
the Germans could have decided to send to the front the oldest tanks first, which
would bias the yi towards smaller values. The choice of the likelihood expresses our
modelling of the data generation process. If the true model is unknown, we have to
make a guess, which has an impact on our inference result.
N
There are M ways to choose M ordered numbers y1 < y2 < · · · < yM in [1, N ].
Assuming that all these choices have equal probability, the likelihood of a set Y given
N is
The German tank problem 5
1
p(Y |N ) = N
1(1 ≤ y1 < y2 < · · · < yM ≤ N ) , (1.13)
M
because all the ordered M -uples are equally probable. Here, 1(c) denotes the “indicator
function” of a condition c, which is one if c is satisfied, and zero otherwise.
Given the prior and the likelihood, using Bayes’ rule, we have
p(Y |N )p(N )
p(N |Y ) = P 0 0
, (1.14)
N 0 p(Y |N )p(N )
N −1
p(Y |N ) M 1(N ≥ yM )
p(N |Y ) = P = = p(N |yM ; M ) . (1.15)
p(Y |N 0) N 0 −1
P
N 0
0
N ≥yM M
Note that, because Y is now given and N is the variable, the condition N ≥ yM >
yM −1 > · · · > y1 ≥ 1 is equivalent to N ≥ yM , and as a result the posterior probability
of N depends on the number of data, M (here considered as a fixed constant), and on
the largest observed value, yM , only. It can be shown [1] that the denominator of the
posterior is
X N 0 −1 yM −1 yM
= , (1.16)
0
M M M −1
N ≥yM
N −1
M 1(N ≥ yM )
p(N |yM ; M ) = . (1.17)
yM −1 yM
M M −1
0.10
®
0.06 N
yM
0.04
0.02
» N ¡M
0.00
75 100 125 150 175 200
N
Fig. 1.3 Illustration of the posterior probability for the German tank problem, for M = 10
and yM = 100. The dashed vertical lines locate the typical (= yM ) and average values of N .
yM − M
hN i ∼ yM + + ··· , (1.19)
M
i.e. hN i is only slightly larger than yM . This formula has a simple interpretation.
For large M , we can assume that the yi are equally spaced in the interval [1, yM ].
The average spacing is then ∆y = (yM − M )/M . This formula tells us that
hN i = yM + ∆y is predicted to be equal to the next observation.
3. The variance of N ,
2 2 M −1
σN = N 2 − hN i = (yM − 1)(yM − M + 1) . (1.20)
(M − 2)2 (M − 3)
The variance gives us an estimate of the quality of the prediction for N . Note that
for large M , assuming yM ∝ M , we get σN /hN i ∝ 1/M , hence the relative stan-
dard deviation decreases with the number of observations. Notice that Eq. (1.18)
for the mean and Eq. (1.20) for the variance make sense only if, respectively,
M > 2 and M > 3. Again, we see a consequence of the use of an improper prior:
moments of order up to K are defined only if the number M of observations is
larger than K + 2.
For example, one can take M = 10 observations, and y10 = 100. A plot of the posterior
is given in figure 1.3. Then, hN i = 111.4, and the standard deviation is σN = 13.5.
One can also ask what the probability is that N is larger than some threshold, which
Laplace’s birth rate problem 7
is arguably the most important piece of information if you are planning the Normandy
landings. With the above choices, one finds
p(N > 150|y10 = 100) = 2.2 · 10−2 ,
p(N > 200|y10 = 100) = 1.5 · 10−3 , (1.21)
−4
p(N > 250|y10 = 100) = 2 · 10 .
More generally,
−(M −1)
Nlower bound
p(N > Nlower bound |yM ) ∝ . (1.22)
hN i
R∞
with α = y + 1 and β = M − y + 1, and Γ(x) = 0 dθ θx−1 e−θ stands for the Gamma
function. The following properties of the beta distribution are known:
• The typical value of θ, i.e. which maximizes Beta, is θ∗ = α−1
α+β−2 .
α
• The average value is hθi = α+β .
αβ
• The variance is σθ2 = (α+β)2 (α+β+1) .
A first simple question is: what would be the distribution of θ if there were no girls
observed? In that case, we would have y = 0 and
The most likely value would then be θ∗ = 0, but the average value would be hθi = 1/M .
This result is interesting: observing no girls among M births should not be necessarily
interpreted that their birth rate θ is really equal to zero, but rather that θ is likely
to be smaller than 1/M , as the expected number of events to be observed to see one
girl is 1/θ. From this point of view, it is more reasonable to estimate that θ ∼ 1/M ,
than θ = 0.
In the case of Laplace’s data, from the observed numbers, we obtain
The possibility that θ = 0.5 seems then excluded, because θ∗ ∼ hθi differs from 0.5 by
much more than the standard deviation,
but we would like to quantify more precisely the probability that, yet, the “true” value
of θ is equal to, or larger than 0.5.
where
fθ∗ (θ) = −θ∗ log θ − (1 − θ∗ ) log(1 − θ) . (1.31)
∗
A plot of fθ∗ (θ) for the value of θ that corresponds to Laplace’s observations is given
in figure 1.4. fθ∗ (θ) has a minimum when its argument θ reaches the typical value θ∗ .
Laplace’s birth rate problem 9
2.2
2.0
1.8
1.6
µ¤
fµ ¤ (µ)
1.4
1.2
1.0
0.8
Fig. 1.4 Illustration of the posterior minus log-probability fθ∗ (θ) = − log p(θ|θ∗ , M )/M for
Laplace’s birth rate problem.
Because of the factor M in the exponent, for large M , a minimum of fθ∗ (θ) induces a
very sharp maximum in p(θ|θ∗ ; M ).
We can use this property to compute the normalisation factor at the denominator
in Eq. (1.24). Expanding fθ∗ (θ) in the vicinity of θ = θ∗ , we obtain
1
fθ∗ (θ) = fθ∗ (θ∗ ) + (θ − θ∗ )2 fθ00∗ (θ∗ ) + O((θ − θ∗ )3 ) , (1.32)
2
with
1
fθ00∗ (θ∗ ) = . (1.33)
θ∗ (1 − θ∗ )
In other words, next to its peak value the posterior distribution is roughly Gaussian,
and this statement is true for all θ away from θ∗ by deviations of the order of M −1/2 .
This property helps us compute the normalisation integral,
Z 1 Z 1 ∗ ∗ 2
1
dθ e−M fθ∗ (θ) ∼ dθ e−M [fθ∗ (θ )+ 2θ∗ (1−θ∗ ) (θ−θ ) ]
0 0 (1.34)
∗ p
' e−M fθ∗ (θ ) × 2πθ∗ (1 − θ∗ )/M ,
because we can extend the Gaussian integration interval to the whole real line without
affecting the dominant order in M . We deduce the expression for the normalised
posterior density of birth rates,
∗
e−M [fθ∗ (θ)−fθ∗ (θ )]
p(θ|θ∗ ; M ) ∼ p . (1.35)
2πθ∗ (1 − θ∗ )/M
In order to compute the integral in Eq. (1.29), we need to study the regime where
θ − θ∗ is of order 1, i.e. a “large deviation” of θ. For this, we use Eq. (1.35) and expand
it around θ = 0.5, i.e.
10 Introduction to Bayesian inference
1 ∗
e−M [fθ∗ (θ)−fθ∗ (θ )]
Z
p(θ > 0.5|θ∗ ; M ) = dθ p
0.5 2πθ∗ (1 − θ∗ )/M
∗ Z 1
eM fθ∗ (θ ) 0
=p dθe−M [fθ∗ (0.5)+fθ∗ (0.5)(θ−0.5)+··· ] (1.36)
∗ ∗
2πθ (1 − θ )/M 0.5
∗
e−M [fθ∗ (0.5)−fθ∗ (θ )]
∼ p .
fθ0 ∗ (0.5) 2πθ∗ (1 − θ∗ )M
With Laplace’s data, this expression for the posterior probability that θ ≥ 0.5 could
be evaluated to give
p(θ > 0.5|θ∗ ; M ) ∼ 1.15 · 10−42 , (1.37)
which provides a convincing statistical proof that, indeed, θ < 0.5.
We conclude this discussion with three remarks:
1. The above calculation shows that the posterior probability that θ deviates from its
typical value θ∗ decays exponentially with the number of available observations,
∗
+a)−fθ∗ (θ ∗ )]
p(θ − θ∗ > a|θ∗ ; M ) ∼ e−M [fθ∗ (θ . (1.38)
where the denominator comes from the integration over the harmonic fluctuations
of the particle around the bottom of the potential. We see that the above expres-
sion is identical to Eq. (1.36) upon the substitutions θ∗ → x∗ , 0.5 → x, fθ∗ →
U, M → 1/T . Not surprisingly, having more observations reduces the uncertainty
about θ and thus the effective temperature.
Tutorial 1: diffusion coefficient from single-particle tracking 11
1.5.1 Problem
We consider a particle undergoing diffusive motion in the plane, with position r(t) =
(x(t), y(t)) at time t. The diffusion coefficient (supposed to be isotropic) is denoted by
D, and we assume that the average velocity vanishes. Measurements give access to the
positions (xi , yi ) of the particles at times ti , where i is a positive integer running from
1 to M .
Data:
Several trajectories of the particle can be downloaded from the book webpage2 , see
tutorial 1 repository. Each file contains a three-column array (ti , xi , yi ), where ti is the
time, xi and yi are the measured coordinates of the particle, and i is the measurement
index, running from 1 to M . The unit of time is seconds and displacements are in µm.
Questions:
1. Write a script to read the data. Start by the file dataN1000d2.5.dat, and plot the
trajectories in the (x, y) plane. What are
p their characteristics? How do they fill
the space? Plot the displacement ri = x2i + yi2 as a function of time. Write the
random-walk relation between displacement and time in two dimensions, defining
the diffusion coefficient D. Give a rough estimate of the diffusion coefficient from
the data.
2. Write down the probability density p({xi , yi }|D; {ti }) of the time series {xi , yi }i=1,...,M
given D, and considering the measurement times as known, fixed parameters. De-
duce, using Bayes’ rule, the posterior probability density for the diffusion coeffi-
cient, p(D|{xi , yi }; {ti }).
3. Calculate analytically the most likely value of the diffusion coefficient, its average
value, and its variance, assuming a uniform prior on D.
4. Plot the posterior distribution of D obtained from the data. Compute, for the
given datasets, the values of the mean and of the variance of D, and its most
likely value. Compare the results obtained with different number M of measures.
2 https://github.com/StatPhys2DataDrivenModel/DDM_Book_Tutorials
12 Introduction to Bayesian inference
5. Imagine that the data correspond to a spherical object diffusing in water (of
viscosity η = 10−3 Pa s). Use the Einstein-Stokes relation,
kB T
D= , (1.40)
6πη`
(here ` is the radius of the spherical object and η is the viscosity of the medium) to
deduce the size of the object. Biological objects going from molecules to bacteria
display diffusive motions, and have characteristic size ranging from nm to µm.
For proteins ` ≈ 1 − 10 nm, while for viruses ` ≈ 20 − 300 nm and for bacteria,
` ≈ 2 − 5 µm. Among the molecules or organisms described in table 1.1, which
ones could have a diffusive motion similar to that displayed by the data?
object ` (nm)
small protein (lysozime) (100 residues) 1
large protein (1000 residues) 10
influenza viruses 100
small bacteria (e-coli) 2000
Table 1.1 Characteristic lengths for several biological objects.
6. In many cases the motion of particles is not confined to a plane. Assuming that
(xi , yi ) are the projections of the three-dimensional position of the particle in the
plane perpendicular to the imaging device (microscope), how should the procedure
above be modified to infer D?
1.5.2 Solution
Data Analysis. The trajectory in the (x, y) plane given in the data file for M = 1000
is plotted in figure 1.5A. It has the characteristics of a random walk: the space is not
regularly filled, but the trajectory densely explores one region before “jumping” to
another region. p
The displacement r = x2 + y 2 as a function of time is plotted in figure 1.5B. On
average, it grows as the square root of the time, but on a single trajectory we observe
large fluctuations. The random walk in two dimensions is described by the relation:
where D is the diffusion coefficient whose physical dimensions are [D] = l2 t−1 . Here
lengths are measured in µm and times in seconds. A first estimate of D from the data
can be obtained by just considering the largest time and estimating
r2 (tmax )
D0 = , (1.42)
4tmax
which gives D0 = 1.20 µm2 s−1 for the data set with M = 1000. Another estimate
of D can be obtained as the average of the square displacement from one data point
Tutorial 1: diffusion coefficient from single-particle tracking 13
A 70
B
20 60
50
0
40
y (¹m)
r (¹m)
30
−20
20
−40 10 D=2.47 ¹m 2 s ¡1
D=1.20 ¹m 2 s ¡1
0
0 20 40 60 0 100 200 300 400 500
x (¹m) t (s)
Fig. 1.5 Data file with M = 1000. A. Trajectory of the particle. B. Displacement from the
origin as a function of time.
to the next one divided by the time interval. We define the differences between two
successive positions and between two successive recording times
Note that i = 1, . . . , M − 1. The square displacement in a time step is δri2 = δx2i + δyi2
and the estimate of D is
M −1
1 X δri2
D1 = , (1.44)
4(M − 1) i=1 δti
giving D1 = 2.47 µm2 s−1 for the same data set. These estimates are compared with
the trajectory in figure 1.5B.
Posterior distribution. Due to diffusion δxi and δyi are Gaussian random variables
with variances 2D δti . We have
δx2
1 − i
p(δxi |D; δti ) = √ e 4 D δti , (1.45)
4πD δti
and p(δyi |D; δti ) has the same form. The probability of a time series of increments
{δxi , δyi }i=1,...,M −1 , given D is therefore:
M −1 δx2 2
Y 1 − i − δyi
p({δxi , δyi }|D; {δti }) = e 4 D δti 4 D δti
i=1
4πD δti (1.46)
−B/D −(M −1)
= Ce D ,
QM −1 1 PM −1 δr2
where C = i=1 4πδt i
and B = i=1 4 δtii . Note that to infer D we do not need the
absolute values of (xi , yi ), but only their increments on each time interval.
14 Introduction to Bayesian inference
We consider an improper uniform prior p(D) =const. This can be thought as a uniform
prior in [Dmin , Dmax ], in the limit Dmin → 0 and Dmax → ∞. Thanks to the likelihood,
the posterior remains normalisable in this limit.
Note that, introducing
M −1
B 1 X δri2
D∗ = = = D1 , (1.48)
M −1 4(M − 1) i=1 δti
D∗
p(D|D∗ ; M ) ∝ e−(M −1)fD∗ (D) , fD∗ (D) = + log D . (1.50)
D
The most likely value of D is precisely D∗ , which is the minimum of fD∗ (D) and the
maximum of the posterior, and coincides with the previous estimate D1 .
The average value of D can also be computed by the same change of variables,
(M − 1) ∗
hDi = D , (1.51)
(M − 3)
(M − 1)2
2
σD = (D∗ )2 , (1.52)
(M − 3)2 (M − 4)
Numerical analysis of the data. From the trajectories given in the data files we obtain
the results given in table 1.2. An example of the posterior distribution is given in
figure 1.6.
Note that for large values of M it is not possible to calculate directly the (M − 3)!
in the posterior distribution, Eq. (1.49). It is better to use Stirling’s formula,
√
(M − 3)! ≈ 2π (M − 3)M −3+1/2 e−(M −3) , (1.53)
Tutorial 1: diffusion coefficient from single-particle tracking 15
Name-file M D D∗ hDi σD
dataN10d2.5.dat 10 2.5 1.43 1.84 0.75
dataN100d2.5.dat 100 2.5 2.41 2.46 0.25
dataN1000d2.5.dat 1000 2.5 2.47 2.48 0.08
Table 1.2 Results for the inferred diffusion constant and its standard deviation (all
in µm2 s−1 ), for trajectories of different length M .
5
Posterior distribution p(Djxi ; yi )
M = 10
M = 100
4 M = 1000
0
0 1 2 3 4
D (¹m ¡2 s ¡1 )
Fig. 1.6 Posterior distribution for the diffusion coefficient D given the data, for several values
of M .
Diffusion constant and characteristic size of the diffusing object. The order of mag-
nitude of the diffusion constant can be obtained by the Einstein-Stokes relation:
kB T
D = 6πη` , where ` is the radius of the object (here considered as spherical), and
η is the viscosity of the medium. Considering the viscosity of the water η = 10−3 Pa s
16 Introduction to Bayesian inference
and kB T = 4 × 10−21 J, one obtains the orders of magnitude given in table 1.3.
Therefore, the data could correspond to an influenza virus diffusing in water.
Ref. [5] reports the following values: for a small protein (lysozyme) D = 10−6 cm2 s−1 ,
and for a tobacco virus D = 4 10−8 cm2 s−1 , in agreement with the above orders of
magnitude. In Ref. [4], the diffusion coefficient of protein complexes inside bacteria,
and with widths approximately equal to 300 − 400 nm, are estimated to be equal to
D = 10−2 µm2 s−1 . Differences with the order of magnitude given above are due to
the fact that the diffusion is confined and the medium is the interior of the cell, with
larger viscosity than water.
2
Asymptotic inference and
information
In this chapter we will consider the case of asymptotic inference, in which a large
number of data is available and a comparatively small number of parameters have
to be inferred. In this regime, there exists a deep connection between inference and
information theory, whose description will require us to introduce the notions of en-
tropy, Fisher information, and Shannon information. Last of all, we will see how the
maximum entropy principle is, in practice, related to Bayesian inference.
where the average is over p. The name “cross entropy” derives from the fact that this
is not properly an entropy, because p and q are different. It coincides with the entropy
for p = q, X
Sc (p, p) ≡ S(p) = − p(y) log p(y) . (2.4)
y
so that
−DKL (p||q) = hlog z(y)ip ≤ hz(y)ip − 1 = 0 , (2.8)
as a consequence of the concavity inequality log x ≤ x − 1. Note that the equality is
reached if z = 1 for all y such that p(y) > 0. As a consequence DKL (p||q) is positive
and vanishes for p = q only.
Asymptotic inference 19
We have rewritten in Eq. (2.9) the product of the likelihoods of the data points as
the exponential of the sum of their logarithms. This is useful because, while prod-
ucts of random variables converge badly, sums of many random variables enjoy nice
convergence properties. To be more precise, let us fix θ; the log-likelihood of a data
point, say, yi , is a random variable with value log p(yi |θ), because the yi are randomly
extracted from p(yi |θ̂). The law of large numbers ensures that, with probability one1 ,
M Z
1 X
log p(yi |θ) −→ dy p(y|θ̂) log p(y|θ) . (2.10)
M i=1 M →∞
where Z
Sc (θ̂, θ) = − dy p(y|θ̂) log p(y|θ) (2.12)
is precisely the cross entropy Sc (p, q) of the true distribution p(y) = p(y|θ̂) and of the
inferred distribution q(y) = p(y|θ).
As shown in section 2.1.2, the cross entropy can be expressed as the sum of the
entropy and of the KL divergence, see Eq. (2.5)
where we use the shorthand notation DKL (θ̂||θ) = DKL (p(y|θ̂)||p(y|θ)). Due to the
positivity of the KL divergence, the cross entropy Sc (θ̂, θ) enjoys two important prop-
erties:
1 If we furthermore assume that the variance of such random variable exists
Z
2
dy p(y|θ̂) log p(y|θ) < ∞ ,
then the distribution of the average of M such random variables becomes Gaussian with mean given
by the right hand side of Eq. (2.9) and variance scaling as 1/M .
20 Asymptotic inference and information
25 M = 10
15
10
0 µ¤
0.2 0.3 0.4 0.5 0.6 0.7 0.8
µ
Fig. 2.1 Illustration of the evolution of the posterior probability with the number M of
data, here for Laplace’s birth rate problem, see Eq. (1.35). Another example was given in
figure 1.6.
e−M Sc (θ̂,θ)
p(θ|Y ) = R . (2.14)
dθ e−M Sc (θ̂,θ)
In the large–M limit the integral in the denominator is dominated by the minimal
value of Sc (θ̂, θ) at θ = θ̂, equal to the entropy S(θ̂) so we obtain, to exponential
order in M ,
p(θ|Y ) ∼ e−M [Sc (θ̂,θ)−S(θ̂)] = e−M DKL (θ̂||θ) . (2.15)
In figure 2.1 we show a sketch of the posterior distribution for Laplace’s problem
discussed in section 1.4. The concentration of the posterior with increasing values of
M is easily observed.
The KL divergence DKL (θ̂||θhyp ) controls how the posterior probability of the
hypothesis θ = θhyp varies with the number M of accumulated data. More precisely,
for θhyp 6= θ̂, the posterior probability that θ ≈ θhyp is exponentially small in M . For
any small ,
Prob(|θ − θhyp | < ) = e−M DKL (θ̂||θhyp )+O(M ) , (2.16)
Asymptotic inference 21
and the rate of decay is given by the KL divergence DKL (θ̂||θhyp ) > 0. Hence, the
probability that |θhyp − θ| < becomes extremely small for M 1/DKL (θ̂||θhyp ).
The inverse of DKL can therefore be interpreted as the number of data needed to
recognise that the hypothesis θhyp is wrong.
We have already seen an illustration of this property in the study of Laplace’s birth
rate problem in section 1.4 (and we will see another one in section 2.1.6). The real value
θ̂ of the birth rate of girls was unknown but could be approximated by maximizing
the posterior, with the result θ̂ ≈ θ∗ = 0.490291. We then asked the probability that
girls had (at least) the same birth rate as boys, i.e. of θhyp ≥ 0.5. The cross entropy
of the binary random variable yi = 0 (girl) or 1 (boy) is Sc (θ̂, θ) = fθ̂ (θ) defined in
Eq. (1.31) and shown in figure 1.4. The rate of decay of this hypothesis is then
DKL (θ̂||θhyp ) = fθ̂ (θ̂) − fθ̂ (θhyp ) ≈ 1.88 · 10−4 , (2.17)
meaning that about 5000 observations are needed to rule out that the probability that
a newborn will be a girl is larger than 0.5. This is smaller than the actual number
M of available data by two orders of magnitude, which explains the extremely small
value of the probability found in Eq. (1.37).
2.1.5 Irrelevance of the prior in asymptotic inference
Let us briefly discuss the role of the prior in asymptotic inference. In presence of a
non-uniform prior over θ, p(θ) ∼ exp(−π(θ)), Eq. (2.11) is modified into
p(θ|Y ) ∝ p(Y |θ)p(θ) ≈ exp −M Sc (θ̂, θ) − π(θ) . (2.18)
All the analysis of section 2.1.4 can be repeated, with the replacement
1
Sc (θ̂, θ) → Sc (θ̂, θ) + π(θ) . (2.19)
M
Provided π(θ) is a finite and smooth function of θ, the inclusion of the prior becomes
irrelevant for M → ∞, as it modifies the cross entropy by a term of the order of
1/M . In other words, because the likelihood is exponential in M and the prior does
not depend on M , the prior is irrelevant in the large M limit. Of course, this general
statement holds only if p(θ̂) > 0, i.e. if the correct value of θ is not excluded a priori.
This observation highlights the importance of avoiding imposing a too restrictive prior.
2.1.6 Variational approximation and bounds on free energy
In most situations, the distribution from which data are drawn may be unknown, or
intractable. A celebrated example borrowed from statistical mechanics is given by the
Ising model over L binary spins (variables) y = (y1 = ±1, · · · , yL = ±1), whose Gibbs
distribution reads (in this section we work with temperature T = 1 for simplicity)
e−H(y) X
pG (y) = , with H(y) = −J yi yj . (2.20)
Z
hiji
If the sum runs only over pairs hiji of variables that are nearest neighbour on a
d–dimensional cubic lattice, the model is intractable: the calculation of Z requires
22 Asymptotic inference and information
0.5
0.4
0.2
0.1 mopt
0.0
0.0 0.2 0.4 0.6 0.8 1.0
m
Fig. 2.2 Illustration of the Kullback-Leibler divergence between the intractable distribution
pG (y) and a variational family pm (y).
where S(pm ) is the entropy of pm . Hence, the untractable free energy F is bounded
from above by a variational free energy Fm that depends on m. Minimising DKL pm ||pG
(see figure 2.2) or, equivalently, Fm , allows us to obtain the best (lowest) upper bound
to F .
As an illustration let us consider the Ising model in Eq. (2.20) again, defined on a d-
dimensional cubic lattice. We choose the variational family of factorized distributions,
L
Y 1 + m yi
pm (y) = , (2.24)
i=1
2
whichP
is much easier to handle because the variables are independent. Note that here
m = y yi pm (y) represents the magnetisation of any spin i within distribution pm ,
Notions of information 23
and is a real number in the [−1, 1] range. It is straightforward to compute the average
value of the Ising energy and of the entropy, with the results
1
hH(y)ipm = − J L (2d) m2 , (2.25)
2
and
" #
1+m 1+m 1−m 1−m
S(pm ) = L − log − log . (2.26)
2 2 2 2
Minimisation of Fm over m shows that the best magnetisation is solution of the self-
consistent equation,
mopt = tanh 2d J mopt , (2.27)
which coincides with the well-known equation for the magnetisation in the mean-field
approximation to the Ising model. Hence, the mean-field approximation can be seen as
a search for the best independent-spin distribution, in the sense of having the smallest
KL divergence with the Ising distribution.
In addition, we know, from section 2.1.4, that the gap ∆F = Fm − F is related to
the quality of the approximation. Suppose somebody gives you M configurations y of
L Ising spins drawn from pm , and asks you whether they were drawn from the Gibbs
distribution pG or from pm . Our previous analysis shows that it is possible to answer
in a reliable way only if M 1/∆F . Minimising over m may thus be seen as choosing
the value of m that makes the answer as hard as possible, i.e. for which the similarity
between the configurations produced by the two distributions is the largest.
The results of section 2.1 show that when many data are available, M → ∞, these
two estimators coincide, and also coincide with the ground truth, θ ∗ = θ̂.
24 Asymptotic inference and information
In words, unbiased estimators provide the correct prediction θ̂ if they are fully av-
eraged over the data distribution. By constrast, biased estimators may overestimate
or underestimate the true value of the parameters in a systematic way. Note that the
MAP and Bayesian estimators defined above are in general biased, except for M → ∞.
Examples of unbiased estimators are the following.
• Consider M independent and identically distributed random variables yi with
mean µ and variance V . An unbiased estimator for the mean is
1 X
µ∗ (Y ) = yi . (2.31)
M i
sample average µ∗ (Y ).
The average of each yi is in fact equal to µ, and so is theP
An unbiased estimator for the variance is V (Y ) = M −1 (yi − µ∗ (Y ))2 , as can
∗ 1
be checked by the reader. Note the presence of the factor M − 1 instead of the
naive M at the denominator.
• For the German tank problem discussed in section 1.3, the reader can show that
the estimator for the number of tanks, N ∗ = yM + ∆y with ∆y = yM − yM −1 is
unbiased.
Unbiased estimators whose variance goes to zero when M → ∞ are particularly de-
sirable. We now discuss a bound on the variance of an unbiased estimator, known as
the Cramer-Rao bound.
Asymptotically, the Fisher information gives therefore the variance of θ when sampled
from the posterior distribution of θ,
Notions of information 25
1
Var(θ) = , (2.34)
M Iy (θ̂)
so M should be larger than 1/Iy (θ̂) for the inference to be in the asymptotic regime.
Outside of the asymptotic regime, the equality in Eq. (2.34) becomes a lower bound
for the variance, called the Cramer-Rao bound. For any unbiased estimator θ∗ (Y ),
X 1
Var(θ∗ ) = p(Y |θ̂)[θ∗ (Y ) − θ̂]2 ≥ , (2.35)
Y IYtotal (θ̂)
where
X ∂2
IYtotal (θ̂) = − p(Y |θ̂) log p(Y |θ) (2.36)
∂θ2 θ=θ̂
Y
is the total Fisher information of the joint distribution of the M data points.
Proof of the Cramer-Rao bound. First we note that for an unbiased estimator
X
p(Y |θ̂)(θ∗ (Y ) − θ̂) = 0 (2.37)
Y
is true for each θ̂. We can then differentiate the above equation with respect to θ̂, and
get
X ∂
p(Y |θ̂) (θ∗ (Y ) − θ̂) − 1 = 0 , (2.38)
Y ∂ θ̂
which can be rewritten as
X ∂
p(Y |θ̂) log p(Y |θ̂) (θ∗ (Y ) − θ̂) = 1 . (2.39)
Y ∂ θ̂
Introducing
∂
q
α(Y ) = p(Y |θ̂) log p(Y |θ̂) ,
∂ θ̂ (2.40)
q
∗
β(Y ) = p(Y |θ̂) (θ (Y ) − θ̂) ,
to obtain
( 2 ) (X )
X ∂
p(Y |θ̂) log p(Y |θ̂) p(Y |θ̂)(θ∗ (Y ) − θ̂)2 ≥1, (2.42)
Y ∂ θ̂ Y
which gives
26 Asymptotic inference and information
1
Var(θ∗ (Y )) ≥ P 2 . (2.43)
∂
Y p(Y |θ̂) ∂ θ̂
log p(Y |θ̂)
Observing that
!2
X ∂2 X ∂2 X 1 ∂p(Y |θ̂)
IYtotal (θ̂) =− p(Y |θ̂) log p(Y |θ̂) = − p(Y |θ̂) +
Y ∂ θ̂2 Y ∂ θ̂2 Y p(Y |θ̂) ∂ θ̂
!2
X ∂ log p(Y |θ̂)
= p(Y |θ̂) ,
Y ∂ θ̂
(2.44)
2 2 P
because the term Y ∂∂θ̂2 p(Y |θ̂) = ∂∂θ̂2 Y p(Y |θ̂) = 0 vanishes by normalisation of
P
The fluctuations of the estimators around its unbiased average are quantified by the
covariance matrix,
X
Cij = Cov(θi∗ , θj∗ ) = p(Y |θ̂)(θi∗ (Y ) − θ̂i )(θj∗ (Y ) − θ̂j ) . (2.46)
Y
The Cramer-Rao bound then states that Mij = Cij − (I −1 )ij is a positive definite
matrix.
I(p) 2
0
0.0 0.2 0.4 0.6 0.8 1.0
p
Fig. 2.3 Illustration of the elementary gain in information following the observation of an
event of probability p, I(p) = C log(1/p); here, C = 1/ log 2, and the information is expressed
in bits.
data. By contrast, if the score is large (positive or negative), then the data bring
a lot of information on the parameters. The average value of the score is zero
X ∂ X ∂p(Y |θ)
hsi = p(Y |θ) log p(Y |θ) = =0, (2.48)
∂θ ∂θ
Y Y
Hence, larger Fisher information corresponds to larger score variance, and, infor-
mally speaking, to better model identifiability.
• The Fisher information is degraded upon transformation: y → F (y), where F is
a deterministic or random function,
The extreme case is when F is equal to a constant, in which case all the informa-
tion is lost. This property is expected to be valid for any measure of information.
function I(p), which characterises the elementary information carried by the observa-
tion of an event of probability p, should be monotonically decreasing with p; such a
candidate function is sketched in figure 2.3.
Admissible I(p) must fulfill additional requirements:
• I(p = 1) = 0: there is no gain of information following the observation of an event
happing with certainty.
• I(p) ≥ 0: the information is always positive.
• I(p12 ) = I(p1 ) + I(p2 ) if p12 = p1 p2 : the information is additive for independent
events.
• I(p12 ) ≤ I(p1 )+I(p2 ): for correlated events, realisation of the first event is already
telling us something about the possible outcomes for the second event. Hence, the
information gain from the second event is smaller than what we would have if we
had not observed the first event, i.e. I(p2 ).
These properties, together with a more sophisticated constraint of composability [6],
are satisfied by a unique function
1
I(p) = C log , (2.51)
p
2 To see that, consider the KL divergence between a distribution p(y) and the uniform distribution,
punif (y):
X
DKL p||punif = p(y) log2 [p(y)/punif (y)] = L − S(p) . (2.53)
y
Boltzmann constant kB rather than 1/ log 2. In statistical physics the entropy charac-
terizes the degeneracy of microscopic configurations corresponding to a given macro-
scopic state. Based on Shannon’s interpretation we may see the entropy as how much
information we are missing about the microscopic configurations when knowing the
macroscopic state of the system only.
where
X p(x, y) X X
S[p(y|x)] = − p(x, y) log =− p(x) p(y|x) log p(y|x) (2.56)
x,y
p(x) x y
equal to the average gain in information coming from the joint observation of x and y.
Fig. 2.4 The maximal entropy model is the distribution p1 with the smallest reduction in
Shannon entropy with respect to the uniform model among all possible models p1 , p2 , p3 , . . .
satisfying the constraints imposed by the data.
it is reasonable to suppose that all events are equally likely, i.e. p(y) = N1 . The
entropy of this uniform distribution is log2 N , and is maximal as we have seen above.
This model is, in some sense, maximally “ignorant” and constrains as less as possible
the events.
Suppose now that we have some knowledge about the distribution of y, for example
we are given its average value,
X
m = hyi = p(y) y . (2.58)
y
What can we say about p(y)? This is in general a very ill-defined question, as there
are infinitely numerous distributions p(y) having m as a mean. Yet, not all these
distributions p(y) have the same Shannon entropy. The MEP stipulates that we should
choose p(y) with maximal entropy among the distributions fulfilling Eq. (2.58). In this
way the difference between the entropy of the maximally agnostic (uniform) model and
that of the constrained model is minimal, as sketched in figure 2.4. In other words, we
have used as little as possible the constraint imposed on us.
In order to find p(y), one then has to maximise its entropy, while enforcing the
normalisation constraint together with Eq. (2.58). This can be done by introducing
two Lagrange multipliers λ and µ. We define
! !
X X X
S(p, λ, µ) = − p(y) log p(y) − λ p(y) − 1 −µ p(y) y − m , (2.59)
y y y
Inference and information: the maximum entropy principle 31
and write that the functional derivative with respect to the distribution should vanish
at the maximum
δS(p)
= − log p(y) − 1 − λ − µ y = 0 . (2.60)
δp(y)
This gives
eµy
p(y) = P µ y0
, (2.61)
y0 e
where we have computed λ to fulfill the normalisation constraint. The resulting distri-
bution is then exponential, and the value of µ is determined by imposing Eq. (2.58).
It is easy to extend the above calculation to higher-dimensional variables y =
(y1 , y2 , ..., yL ). For instance, if we are provided with the value of the means
X
mi = hyi i = p(y) yi , (2.62)
y
then the MEP tells us that the most parsimonious use of these constraints corresponds
to the distribution P
e i µi yi
p(y) = P P 0 , (2.63)
i µ i yi
y0 e
3 As an illustration, assume we are given the first and second moments, respectively, m and m
1 2
of a random scalar variable y. According to the MEP the log-probability of this variable is a linear
2
combination of O1 (y) = y and O2 (y) = y , i.e. the maximal entropy distribution is Gaussian.
32 Asymptotic inference and information
2.4.1 Problem
The spiking activity of a population of L = 40 retina ganglion cells in response to visual
stimuli was recorded [11]. During the recording a natural movie, lasting T = 26.5 s, is
presented Nr = 120 times. We want to analyse the information and noise content of
the spike train for each neuron, as a function of its spiking frequency.
Data:
The spiking times can be downloaded from the book webpage4 , see tutorial 2 repos-
itory. Data are taken from Ref. [11]. The data file contains a one-column array (ti ),
where ti is the spiking time of neuron i in seconds and ranging between 0 and
Nr × T = 3180 s. They are separated by the number 4000 followed by the neuron
label going from 4 to 63 (not to be confused with the neuron index i, running from 1
to L = 40).
Questions:
In the tutorial 2 repository, you are given a Jupyter notebook tut2 start.ipynb that
reads the data and makes a raster plot of the spikes of the first neuron, for the first 10
seconds and the first 10 repetitions of the stimulus. Complete the notebook or write
your own code to answer the following questions.
1. Entropy of a Poisson process. To analyse the spike train of a single neuron, we
discretise the time in elementary time-bin windows of length ∆t, where typically
∆t = 10 ms, and translate the activity of the neuron in a time bin into a binary
symbol σ ∈ {0, 1}. Here, σ = 0 corresponds to no spike, while σ = 1 corresponds
to at least one spike.
1a. Theory: Consider the activity of the neuron as a Poisson process with fre-
quency f : write the entropy Sp (f, ∆t) and the entropy rate (entropy per unit
of time) of the Poisson process as a function of the frequency f and of the
time-bin width ∆t.
1b. Numerics: Plot the entropy Sp (f, ∆t) and the entropy rate obtained in point
1a in the range of experimentally observed frequencies, e.g. f = 4 Hz, as a
function of ∆t.
2. Information conveyed by the spike train about the stimulus. To study
the reproducibility of the neural activity following a stimulus, we consider words
4 https://github.com/StatPhys2DataDrivenModel/DDM_Book_Tutorials
Tutorial 2: entropy and information in neural spike trains 33
extending over ` consecutive time bins, i.e. of ` symbols. For ` = 100 ms and
∆t = 10 ms, there are NW = 210 = 1024 possible such words representing the
activity of a neuron over ` × ∆t = 100 ms. As explained in Ref. [8] one can
extract from the data the probability of each possible word W and then estimate
the entropy of this distribution:
X
Stotal = − p(W ) log2 p(W ) bits . (2.64)
W
where h. . .i denotes the average over all possible times with time resolution ∆t.
Snoise is called noise entropy because it reflects the non-deterministic response to
the stimulus. If the noise entropy is zero, the response is deterministic: the same
word will be repeated after each given stimulus. The average information I that
the spike train provides about the stimulus is defined as I = Stotal − Snoise .
2a. Theory: The probability p(W |b) can be thought of as the probability of the
random variable W conditioned to the stimulus at time b × ∆t, itself consid-
ered as a random variable; different times t correspond to different random
draws of the stimulus. Under this interpretation, show that I is the mutual
information between W and the stimulus, and as a consequence I ≥ 0.
2b. Numerics: What is the distribution of Stotal , Snoise and I over the 40 neural
cells and ` = 10 time bins? What is, on average, the variability of the spike
trains used to convey information about the stimulus? Plot Stotal (i), Snoise (i),
I(i), and the ratio I(i)/Stotal (i) for the i = 1 . . . 40 neurons as a function of
their spiking frequencies fi . Compare Stotal (i) with the entropy of a Poisson
process.
3. Error on the estimate of I due to the limited number of data. How does I
depend on the number of repetitions of the experiment used in the data analysis
and on `? Check if the quantity of information per second converges. Are 120
repetitions and ` = 10 enough?
4. Time-dependence of the neuronal activity. Plot the time-bin dependent
entropy S(b) for one neuron. For the same neuron, compute the average spiking
frequency f (b) in a small time interval around time bin b. Compare the frequency
modulation f (b) with S(b). Discuss the relation between S(b) and the entropy of
a Poisson process with modulated frequency f (b).
34 Asymptotic inference and information
2.4.2 Solution
Analytical calculations:
1a. Consider a Poisson process with frequency f in a time bin ∆t. The probability
that the Poisson process emits no spike in the time interval ∆t, that is of, σ = 0,
is
p(σ = 0) = e−f ∆t , hence p(σ = 1) = 1 − e−f ∆t . (2.67)
The entropy in bits of such binary variable is
X
Sp (f, ∆t) = − p(σ) log2 p(σ)
σ
−f ∆t
(2.68)
e f ∆t − (1 − e−f ∆t ) log(1 − e−f ∆t )
= .
log 2
The entropy per unit of time is simply Sp (f, ∆t)/∆t.
2a. We consider a discrete time index of bins, b = 1, 2, · · · , B = T /∆t. Now, consider
that a time bin b (and hence a corresponding stimulus) is chosen randomly with
uniform probability p(b) = 1/B. We call p(b, W ) the joint probability of choosing
a bin b and a word W , and then p(W |b) = p(W, b)/p(b). We have
X
p(W ) = p(W |b)p(b) , (2.69)
b
and
X X
Snoise = p(b)S(b) , S(b) = − p(W |b) log2 p(W |b) . (2.70)
b W
We can therefore identify the noise entropy with the conditional entropy, Snoise =
S[p(W |b)], as defined in Eq. (2.56). If the relation between b and W is determin-
istic, then Snoise = 0. Therefore
X X p(W, b)
Stotal − Snoise = − p(b) p(W |b) log2 p(W ) − p(W |b) log2
p(b)
b W
X p(W, b)
= p(W, b) log2 = I(W, b) .
p(W ) p(b)
W,b
(2.71)
The average information I is the mutual information between the stimulus and
the neural activity, as defined in Eq. (2.54).
Data analysis:
The Jupyter notebook tutorial2.ipynb contains as parameters the time bin ∆t, the
length of the words `s , and the number of repetitions Nrp (out of the total Nr = 120)
that are considered to calculate the entropy.
The raster plot of data for the first neuron is shown in figure 2.5 for the first 10
seconds of recording and the first 10 repetitions of the stimulus. It can be seen that
the neuron fires in a quite precise and reproducible way during the stimulations.
Tutorial 2: entropy and information in neural spike trains 35
10
0
0 2 4 6 8 10
time (s)
Fig. 2.5 Raster plot of the activity of the first recorded neuron, in the first 10 seconds of
the recording, for the first 10 repetitions of the stimulus.
1b. The entropy Sp and the entropy rate Sp /∆t are plotted in figure 2.6 for a cell
spiking at f = 4 Hz as a function of ∆t. The entropies have a non-trivial behaviour
upon changing ∆t. The maximum of the entropy is reached for ∆t such that
p(σ = 0) = p(σ = 1) = 0.5; for other values of ∆t the entropy decreases. The
entropy per unit of time decreases upon increasing ∆t.
A 17.5 B
1.0
entropy per unit time (bits/s)
15.0
0.8
entropy (bits)
12.5
0.6 10.0
7.5
0.4
5.0
0.2
2.5
0.0 0.0
0.0 0.5 1.0 1.5 2.0 0.0 0.5 1.0 1.5 2.0
¢t (s) ¢t (s)
Fig. 2.6 Entropy (A) and entropy per unit time (B) of a Poissionian spiking event in a time
bin ∆t for a frequency f = 4 Hz as a function of the time bin ∆t
2b. Stotal (i), Snoise (i) and I(i), for the recorded neurons i = 1, · · · , 40 as a function of
their frequencies are plotted in figure 2.7, for ∆t = 10 ms and words of duration
`s = 100 ms, corresponding to ` = 10 binary symbols.
As shown in figure 2.7A, the total entropy rate varies between 1 and 25 bits/s.
Note that if the neurons were following a Poisson process at frequency fi , then
each word would be composed by independent bits, and the total entropy of
36 Asymptotic inference and information
30
12
Snoise
25 Stotal
10
information (bits/s)
Sp (f; ¢t)=¢t
entropy (bits/s)
20
8
15
6
10
4
5
A 2 B
0
0 1 2 3 4 5 6 0 1 2 3 4 5 6
frequency (Hz) frequency (Hz)
Fig. 2.7 Total entropy and noise entropy (A) and information rate (B) as a function of the
cell frequency for ∆t = 10 ms, ` = 10. In panel A, the full line is the entropy per unit time
of a Poisson process, Sp (f, ∆t = 10 ms).
neuron i would be
Stotal Sp (fi , ∆t)
Stotal = ` Sp (fi , ∆t) ⇒ = . (2.72)
` × ∆t ∆t
This curve is reported as a full line in figure 2.7A, and provides an upper bound
to the total entropy. Indeed, correlations between spikes lower the entropy with
respect to a purely Poisson process.
The noise entropy is approximatively half of the total entropy, and therefore
the information rate I = (Stotal − Snoise )/∆t varies between 2 and 12 bits/s, see
figure 2.7B, and is roughly proportional to the neuron spiking frequency. The more
active cells have larger entropy and information rate. The order of magnitude
of the entropies and information rates are in agreement with Ref. [9, figure 3]
which also studied recordings in the retina, for ganglion cells spiking at the same
frequencies.
Also, in good agreement with Ref. [8], in which a motion-sensitive neuron in the
fly visual system is studied, the noise entropy is of the order of half the total
entropy. So, on average, half of the variability of the spike train is used to convey
information about the stimulus. The ratio I/Stotal is shown in figure 2.8.
3. One can vary the number of repetitions of the experiment used in the data anal-
ysis: Nr = 1 can be used to check that the noise entropy is zero, while the total
entropy is not. Moreover one can study the dependence of the results with the
number of repetitions and check if Nr = 120 is enough to have a stable result.
In figure 2.9A, the information I is plotted versus Nr for one neuron (the first
recorded).
The entropy should reach a constant value in the infinite word-length limit, i.e.
for the value of `s = ` × ∆t at which the process is uncorrelated. As an example
we show in figure 2.9B the evolution of I with `s , for the first recorded neuron.
Very similar results are obtained, in both cases, for the other neurons.
Other documents randomly have
different content
limis crassis | uti scobe
facta mixtae
conspargitur ut (30)
conglomeretur. deinde |
pilae manibus
versando efficiuntur
181
et ita conligantur ut
inarescant. aridae
componuntur in
urceo fictili, urcei in
fornace. simul
autem aes et ea
harena ab ignis vehementia
confervescendo
coaluerint, inter 5
se dando et accipiendo
su|dores a
proprietatibus
discedunt (5)
suisque rebus per ignis
vehementiam
confectis caeruleo
2 rediguntur colore. usta
vero, quae satis
habet utilitatis
in operibus tectoriis, sic
temperatur. glaeba
silis boni
coquitur ut sit in igni
candens. ea autem
aceto extinguitur 10
et efficitur purpureo colore.
Plain text
<rerum> | naturae
considerantibus
admirationes, quod
habet 182
non in omnibus locis
quibus nascitur
unius generis
colorem,
2 sed solis cursu naturaliter
temperatur. itaque
quod legitur
Ponto et Gallia, quod hae
regiones sunt
proximae ad
septentrionem, | est atrum.
progredientibus inter
septentrionem 5
et occidentem invenitur
lividum. quod autem
legitur ad aequinoctialem
orientem et
occidentem,
invenitur
violaceo colore. quod vero
meridianis
regionibus excipitur,
rubra procreatur potestate,
et ideo hoc Rhodo
etiam insula
creatur ceterisque |
eiusmodi regionibus
quae proximae 10
3 sunt solis cursui. ea
conchylia cum sunt
lecta, ferramentis
circa scinduntur, e quibus
plagis purpurea
sanies uti
lacrima profluens excussa
in mortariis terendo
comparatur,
et quod ex concharum
marinarum testis
eximitur, ideo
ostrum est vocitatum. id
autem | propter
salsuginem cito 15
fit siticulosum, nisi mel
habeat circa fusum.
Plain text
anulariam vitro, quod
Graeci ιϲατιν
appellant,
inficientes,
imitationem faciunt indici
coloris.
LIBER
OCTAVUS.
1 De septem sapientibus
Thales Milesius
omnium rerum
principium aquam est
professus, Heraclitus
ignem, Magorum
| sacerdotes aquam et
ignem, Euripides
auditor Anaxagorae,
15
quem philosophum
Athenienses
scaenicum
appellaverunt,
aera et terram eamque e
caelestium imbrium
conceptionibus
conceptionibus
inseminatam fetus
gentium et omnium
animalium
in mundo procreavisse, et
quae ex ea essent
prognata cum
dissolve|rentur
temporum
necessitate coacta,
20
in eandem redire, quaeque
de aere nascerentur
item in
caeli regiones reverti neque
interitiones recipere
et dissolutione
mutata in eam recidere | in
qua ante fuerant
proprietatem. 184
Pythagoras vero,
Empedocles,
Epicharmos aliique
1 vitroque (q: HE, q⁊ S)
quod x. | insallim (-ī S)
x.
3 addispositionem firmitatis
(sic sine lac. sed in fine
ut versus G ita ipsius folii
107a H) x (corr. et suppl.
Nohl).
7 aedificationum HS:
edifitiorū G.
10 ducatur HS: inducatur G.
17 e HSG: & E.
20 necessitate ES: -tē HG.
21 eandem EG: eadem HS. |
de aere nascentur EG, de
hacre nascentur H, de
hac renascerentur S (ut
L).
23 fuerant G: fuerat H, fuit
S. | proprietates G.
24 epi(ephy- S)charmos x.
Plain text
physici et philosophi haec
principia esse
quattuor
proposuerunt, aerem
ignem terram
aquam, eorumque
inter
se cohaerentiam naturali
figuratione e
gene|rum
discriminibus (5)
2 efficere qualitates.
animadvertimus
vero non solum
nascentia ex his esse
procreata sed etiam
res omnes non 5
ali sine eorum potestate
neque crescere nec
tueri. namque
corpora sine spiritus
redundantia non
possunt habere
vitam, nisi aer influens cum
incremento fecerit
auctus et
| remissiones continenter.
caloris vero si non
fuerit in (10)
corpore iusta comparatio,
non erit spiritus
animalis neque 10
erectio firma, cibique vires
non poterunt habere
coctionis
temperaturam. item si non
terrestri cibo
membra corporis
alantur, deficient et ita a
terreni principii
mixtione erunt
3 de|serta. animalia vero si
fuerint sine umoris
potestate, (15)
exsanguinata et exsucata a
principiorum liquore
interarescent. 15
igitur divina mens quae
proprie necessaria
essent gentibus non
constituit difficilia et
cara, uti sunt
margaritae aurum
argentum ceteraque
quae neque corpus
nec natura desiderat, | sed
sine quibus
mortalium vita non
(20)
potest esse tuta, ea fudit
ad manum parata
per omnem 20
mundum. itaque ex his, si
quid forte defit in
corpore
spiritus, ad restituendum
aer adsignatus id
praestat. apparatus
autem ad auxilia caloris
solis impetus et ignis
inventus
tutiorem efficit vitam. item
ter|renus fructus
escarum (25)
praestans copiis
supervacuis
desiderationes alit et
25
nutrit animales pascendo
continenter. aqua |
vero non 185
solum potus sed infinitas
usu praebendo
necessitates, gratas
4 quod est gratuita praestat
utilitates. ex eo
Welcome to our website – the perfect destination for book lovers and
knowledge seekers. We believe that every book holds a new world,
offering opportunities for learning, discovery, and personal growth.
That’s why we are dedicated to bringing you a diverse collection of
books, ranging from classic literature and specialized publications to
self-development guides and children's books.
ebookmasss.com