Note that the expression of Pinsker inequality depends on what basis of logarithm is used in the definition of KL-divergence. is defined using (logarithm in base ), whereas is typically defined with (logarithm in base 2). Then,
Given the above comments, there is an alternative statement of Pinsker's inequality in some literature that relates information divergence to variation distance:
Note that the lower bound from Pinsker's inequality is vacuous for any distributions where , since the total variation distance is at most . For such distributions, an alternative bound can be used, due to Bretagnolle and Huber[3] (see, also, Tsybakov[4]):
History
Pinsker first proved the inequality with a greater constant. The inequality in the above form was proved independently by Kullback, Csiszár, and Kemperman.[5]
Inverse problem
A precise inverse of the inequality cannot hold: for every , there are distributions with but . An easy example is given by the two-point space with and .[6]
However, an inverse inequality holds on finite spaces with a constant depending on .[7] More specifically, it can be shown that with the definition we have for any measure which is absolutely continuous to
As a consequence, if has full support (i.e. for all ), then
^Raymond W., Yeung (2008). Information Theory and Network Coding. Hong Kong: Springer. p. 26. ISBN978-0-387-79233-0.
^Bretagnolle, J.; Huber, C, Estimation des densités: risque minimax, Séminaire de Probabilités, XII (Univ. Strasbourg, Strasbourg, 1976/1977), pp. 342–363, Lecture Notes in Math., 649, Springer, Berlin, 1978, Lemma 2.1 (French).
^
Tsybakov, Alexandre B., Introduction to nonparametric estimation, Revised and extended from the 2004 French original. Translated by Vladimir Zaiats. Springer Series in Statistics. Springer, New York, 2009. xii+214 pp. ISBN978-0-387-79051-0, Equation 2.25.
^The divergence becomes infinite whenever one of the two distributions assigns probability zero to an event while the other assigns it a nonzero probability (no matter how small); see e.g. Basu, Mitra; Ho, Tin Kam (2006). Data Complexity in Pattern Recognition. Springer. p. 161. ISBN9781846281723..
^see Lemma 4.1 in Götze, Friedrich; Sambale, Holger; Sinulis, Arthur (2019). "Higher order concentration for functions of weakly dependent random variables". Electronic Journal of Probability. 24. arXiv:1801.06348. doi:10.1214/19-EJP338. S2CID52200238.
Further reading
Thomas M. Cover and Joy A. Thomas: Elements of Information Theory, 2nd edition, Willey-Interscience, 2006
Nicolo Cesa-Bianchi and Gábor Lugosi: Prediction, Learning, and Games, Cambridge University Press, 2006