Theorem of probability theory
Stein's lemma , named in honor of Charles Stein , is a theorem of probability theory that is of interest primarily because of its applications to statistical inference — in particular, to James–Stein estimation and empirical Bayes methods — and its applications to portfolio choice theory .[ 1] The theorem gives a formula for the covariance of one random variable with the value of a function of another, when the two random variables are jointly normally distributed .
Note that the name "Stein's lemma" is also commonly used[ 2] to refer to a different result in the area of statistical hypothesis testing , which connects the error exponents in hypothesis testing with the Kullback–Leibler divergence . This result is also known as the Chernoff–Stein lemma[ 3] and is not related to the lemma discussed in this article.
Statement
Suppose X is a normally distributed random variable with expectation μ and variance σ2 .
Further suppose g is a differentiable function for which the two expectations E(g (X ) (X − μ)) and E(g ′(X )) both exist.
(The existence of the expectation of any random variable is equivalent to the finiteness of the expectation of its absolute value .)
Then
E
(
g
(
X
)
(
X
−
μ
)
)
=
σ
2
E
(
g
′
(
X
)
)
.
{\displaystyle E{\bigl (}g(X)(X-\mu ){\bigr )}=\sigma ^{2}E{\bigl (}g'(X){\bigr )}.}
Multidimensional
In general, suppose X and Y are jointly normally distributed. Then
Cov
(
g
(
X
)
,
Y
)
=
Cov
(
X
,
Y
)
E
(
g
′
(
X
)
)
.
{\displaystyle \operatorname {Cov} (g(X),Y)=\operatorname {Cov} (X,Y)E(g'(X)).}
For a general multivariate Gaussian random vector
(
X
1
,
.
.
.
,
X
n
)
∼
N
(
μ
,
Σ
)
{\displaystyle (X_{1},...,X_{n})\sim N(\mu ,\Sigma )}
it follows that
E
(
g
(
X
)
(
X
−
μ
)
)
=
Σ
⋅
E
(
∇
g
(
X
)
)
.
{\displaystyle E{\bigl (}g(X)(X-\mu ){\bigr )}=\Sigma \cdot E{\bigl (}\nabla g(X){\bigr )}.}
Similarly, when
μ
=
0
{\displaystyle \mu =0}
,
E
[
∂
i
g
(
X
)
]
=
E
[
g
(
X
)
(
Σ
−
1
X
)
i
]
,
E
[
∂
i
∂
j
g
(
X
)
]
=
E
[
g
(
X
)
(
(
Σ
−
1
X
)
i
(
Σ
−
1
X
)
j
−
Σ
i
j
−
1
)
]
{\displaystyle E[\partial _{i}g(X)]=E[g(X)(\Sigma ^{-1}X)_{i}],\quad E[\partial _{i}\partial _{j}g(X)]=E[g(X)((\Sigma ^{-1}X)_{i}(\Sigma ^{-1}X)_{j}-\Sigma _{ij}^{-1})]}
Gradient descent
Stein's lemma can be used to stochastically estimate gradient:
∇
E
ϵ
∼
N
(
0
,
I
)
(
g
(
x
+
Σ
1
/
2
ϵ
)
)
=
Σ
−
1
/
2
E
ϵ
∼
N
(
0
,
I
)
(
g
(
x
+
Σ
1
/
2
ϵ
)
ϵ
)
≈
Σ
−
1
/
2
1
N
∑
i
=
1
N
g
(
x
+
Σ
1
/
2
ϵ
i
)
ϵ
i
{\displaystyle \nabla E_{\epsilon \sim {\mathcal {N}}(0,I)}{\bigl (}g(x+\Sigma ^{1/2}\epsilon ){\bigr )}=\Sigma ^{-1/2}E_{\epsilon \sim {\mathcal {N}}(0,I)}{\bigl (}g(x+\Sigma ^{1/2}\epsilon )\epsilon {\bigr )}\approx \Sigma ^{-1/2}{\frac {1}{N}}\sum _{i=1}^{N}g(x+\Sigma ^{1/2}\epsilon _{i})\epsilon _{i}}
where
ϵ
1
,
…
,
ϵ
N
{\displaystyle \epsilon _{1},\dots ,\epsilon _{N}}
are IID samples from the standard normal distribution
N
(
0
,
I
)
{\displaystyle {\mathcal {N}}(0,I)}
. This form has applications in Stein variational gradient descent [ 4] and Stein variational policy gradient .[ 5]
Proof
The univariate probability density function for the univariate normal distribution with expectation 0 and variance 1 is
φ
(
x
)
=
1
2
π
e
−
x
2
/
2
{\displaystyle \varphi (x)={1 \over {\sqrt {2\pi }}}e^{-x^{2}/2}}
Since
∫
x
exp
(
−
x
2
/
2
)
d
x
=
−
exp
(
−
x
2
/
2
)
{\displaystyle \int x\exp(-x^{2}/2)\,dx=-\exp(-x^{2}/2)}
we get from integration by parts :
E
[
g
(
X
)
X
]
=
1
2
π
∫
g
(
x
)
x
exp
(
−
x
2
/
2
)
d
x
=
1
2
π
∫
g
′
(
x
)
exp
(
−
x
2
/
2
)
d
x
=
E
[
g
′
(
X
)
]
{\displaystyle E[g(X)X]={\frac {1}{\sqrt {2\pi }}}\int g(x)x\exp(-x^{2}/2)\,dx={\frac {1}{\sqrt {2\pi }}}\int g'(x)\exp(-x^{2}/2)\,dx=E[g'(X)]}
.
The case of general variance
σ
2
{\displaystyle \sigma ^{2}}
follows by substitution .
Generalizations
Isserlis' theorem is equivalently stated as
E
(
X
1
f
(
X
1
,
…
,
X
n
)
)
=
∑
i
=
1
n
Cov
(
X
1
,
X
i
)
E
(
∂
X
i
f
(
X
1
,
…
,
X
n
)
)
.
{\displaystyle \operatorname {E} (X_{1}f(X_{1},\ldots ,X_{n}))=\sum _{i=1}^{n}\operatorname {Cov} (X_{1},X_{i})\operatorname {E} (\partial _{X_{i}}f(X_{1},\ldots ,X_{n})).}
where
(
X
1
,
…
X
n
)
{\displaystyle (X_{1},\dots X_{n})}
is a zero-mean multivariate normal random vector.
Suppose X is in an exponential family , that is, X has the density
f
η
(
x
)
=
exp
(
η
′
T
(
x
)
−
Ψ
(
η
)
)
h
(
x
)
.
{\displaystyle f_{\eta }(x)=\exp(\eta 'T(x)-\Psi (\eta ))h(x).}
Suppose this density has support
(
a
,
b
)
{\displaystyle (a,b)}
where
a
,
b
{\displaystyle a,b}
could be
−
∞
,
∞
{\displaystyle -\infty ,\infty }
and as
x
→
a
or
b
{\displaystyle x\rightarrow a{\text{ or }}b}
,
exp
(
η
′
T
(
x
)
)
h
(
x
)
g
(
x
)
→
0
{\displaystyle \exp(\eta 'T(x))h(x)g(x)\rightarrow 0}
where
g
{\displaystyle g}
is any differentiable function such that
E
|
g
′
(
X
)
|
<
∞
{\displaystyle E|g'(X)|<\infty }
or
exp
(
η
′
T
(
x
)
)
h
(
x
)
→
0
{\displaystyle \exp(\eta 'T(x))h(x)\rightarrow 0}
if
a
,
b
{\displaystyle a,b}
finite. Then
E
[
(
h
′
(
X
)
h
(
X
)
+
∑
η
i
T
i
′
(
X
)
)
⋅
g
(
X
)
]
=
−
E
[
g
′
(
X
)
]
.
{\displaystyle E\left[\left({\frac {h'(X)}{h(X)}}+\sum \eta _{i}T_{i}'(X)\right)\cdot g(X)\right]=-E[g'(X)].}
The derivation is same as the special case, namely, integration by parts.
If we only know
X
{\displaystyle X}
has support
R
{\displaystyle \mathbb {R} }
, then it could be the case that
E
|
g
(
X
)
|
<
∞
and
E
|
g
′
(
X
)
|
<
∞
{\displaystyle E|g(X)|<\infty {\text{ and }}E|g'(X)|<\infty }
but
lim
x
→
∞
f
η
(
x
)
g
(
x
)
≠
0
{\displaystyle \lim _{x\rightarrow \infty }f_{\eta }(x)g(x)\not =0}
. To see this, simply put
g
(
x
)
=
1
{\displaystyle g(x)=1}
and
f
η
(
x
)
{\displaystyle f_{\eta }(x)}
with infinitely spikes towards infinity but still integrable. One such example could be adapted from
f
(
x
)
=
{
1
x
∈
[
n
,
n
+
2
−
n
)
0
otherwise
{\displaystyle f(x)={\begin{cases}1&x\in [n,n+2^{-n})\\0&{\text{otherwise}}\end{cases}}}
so that
f
{\displaystyle f}
is smooth.
Extensions to elliptically-contoured distributions also exist.[ 6] [ 7] [ 8]
See also
References
^ Ingersoll, J., Theory of Financial Decision Making , Rowman and Littlefield, 1987: 13-14.
^ Csiszár, Imre; Körner, János (2011). Information Theory: Coding Theorems for Discrete Memoryless Systems . Cambridge University Press. p. 14. ISBN 9781139499989 .
^ Thomas M. Cover, Joy A. Thomas (2006). Elements of Information Theory . John Wiley & Sons, New York. ISBN 9781118585771 .
^ Liu, Qiang; Wang, Dilin (2019-09-09). "Stein Variational Gradient Descent: A General Purpose Bayesian Inference Algorithm". arXiv :1608.04471 [stat.ML ].
^ Liu, Yang; Ramachandran, Prajit; Liu, Qiang; Peng, Jian (2017-04-07). "Stein Variational Policy Gradient". arXiv :1704.02399 [cs.LG ].
^
Cellier, Dominique; Fourdrinier, Dominique; Robert, Christian (1989). "Robust shrinkage estimators of the location parameter for elliptically symmetric distributions". Journal of Multivariate Analysis . 29 (1): 39– 52. doi :10.1016/0047-259X(89)90075-4 .
^
Hamada, Mahmoud; Valdez, Emiliano A. (2008). "CAPM and option pricing with elliptically contoured distributions". The Journal of Risk & Insurance . 75 (2): 387– 409. CiteSeerX 10.1.1.573.4715 . doi :10.1111/j.1539-6975.2008.00265.x .
^
Landsman, Zinoviy; Nešlehová, Johanna (2008). "Stein's Lemma for elliptical random vectors" . Journal of Multivariate Analysis . 99 (5): 912––927. doi :10.1016/j.jmva.2007.05.006 .