Bretagnolle–Huber inequality

In information theory, the Bretagnolle–Huber inequality bounds the total variation distance between two probability distributions $P$ and $Q$ by a concave and bounded function of the Kullback–Leibler divergence $D_{\mathrm {KL} }(P\parallel Q)$ . The bound can be viewed as an alternative to the well-known Pinsker's inequality: when $D_{\mathrm {KL} }(P\parallel Q)$ is large (larger than 2 for instance.^[1]), Pinsker's inequality is vacuous, while Bretagnolle–Huber remains bounded and hence non-vacuous. It is used in statistics and machine learning to prove information-theoretic lower bounds relying on hypothesis testing^[2]

Formal statement

Preliminary definitions

Let $P$ and $Q$ be two probability distributions on a measurable space $({\mathcal {X}},{\mathcal {F}})$ . Recall that the total variation between $P$ and $Q$ is defined by

d_{\mathrm {TV} }(P,Q)=\sup _{A\in {\mathcal {F}}}\{|P(A)-Q(A)|\}.

The Kullback-Leibler divergence is defined as follows:

D_{\mathrm {KL} }(P\parallel Q)={\begin{cases}\int _{\mathcal {X}}\log {\bigl (}{\frac {dP}{dQ}}{\bigr )}\,dP&{\text{if }}P\ll Q,\\[1mm]+\infty &{\text{otherwise}}.\end{cases}}

In the above, the notation $P\ll Q$ stands for absolute continuity of $P$ with respect to $Q$ , and ${\frac {dP}{dQ}}$ stands for the Radon–Nikodym derivative of $P$ with respect to $Q$ .

General statement

The Bretagnolle–Huber inequality says:

d_{\mathrm {TV} }(P,Q)\leq {\sqrt {1-\exp(-D_{\mathrm {KL} }(P\parallel Q))}}\leq 1-{\frac {1}{2}}\exp(-D_{\mathrm {KL} }(P\parallel Q))

Alternative version

The following version is directly implied by the bound above but some authors^[2] prefer stating it this way. Let $A\in {\mathcal {F}}$ be any event. Then

P(A)+Q({\bar {A}})\geq {\frac {1}{2}}\exp(-D_{\mathrm {KL} }(P\parallel Q))

where ${\bar {A}}=\Omega \smallsetminus A$ is the complement of $A$ .

Indeed, by definition of the total variation, for any $A\in {\mathcal {F}}$ ,

{\begin{aligned}Q(A)-P(A)\leq d_{\mathrm {TV} }(P,Q)&\leq 1-{\frac {1}{2}}\exp(-D_{\mathrm {KL} }(P\parallel Q))\\&=Q(A)+Q({\bar {A}})-{\frac {1}{2}}\exp(-D_{\mathrm {KL} }(P\parallel Q))\end{aligned}}

Rearranging, we obtain the claimed lower bound on $P(A)+Q({\bar {A}})$ .

Proof

We prove the main statement following the ideas in Tsybakov's book (Lemma 2.6, page 89),^[3] which differ from the original proof^[4] (see C.Canonne's note ^[1] for a modernized retranscription of their argument).

The proof is in two steps:

1. Prove using Cauchy–Schwarz that the total variation is related to the Bhattacharyya coefficient (right-hand side of the inequality):

1-d_{\mathrm {TV} }(P,Q)^{2}\geq \left(\int {\sqrt {PQ}}\right)^{2}

2. Prove by a clever application of Jensen’s inequality that

\left(\int {\sqrt {PQ}}\right)^{2}\geq \exp(-D_{\mathrm {KL} }(P\parallel Q))

Step 1:

First notice that

d_{\mathrm {TV} }(P,Q)=1-\int \min(P,Q)=\int \max(P,Q)-1

To see this, denote

A^{*}=\arg \max _{A\in \Omega }|P(A)-Q(A)|

and without loss of generality, assume that

P(A^{*})>Q(A^{*})

such that

d_{\mathrm {TV} }(P,Q)=P(A^{*})-Q(A^{*})

. Then we can rewrite

d_{\mathrm {TV} }(P,Q)=\int _{A^{*}}\max(P,Q)-\int _{A^{*}}\min(P,Q)

And then adding and removing

\int _{\bar {A^{*}}}\max(P,Q){\text{ or }}\int _{\bar {A^{*}}}\min(P,Q)

we obtain both identities.

Then

{\begin{aligned}1-d_{\mathrm {TV} }(P,Q)^{2}&=(1-d_{\mathrm {TV} }(P,Q))(1+d_{\mathrm {TV} }(P,Q))\\&=\int \min(P,Q)\int \max(P,Q)\\&\geq \left(\int {\sqrt {\min(P,Q)\max(P,Q)}}\right)^{2}\\&=\left(\int {\sqrt {PQ}}\right)^{2}\end{aligned}}

because

PQ=\min(P,Q)\max(P,Q).

Step 2:

We write

(\cdot )^{2}=\exp(2\log(\cdot ))

and apply Jensen's inequality:

{\begin{aligned}\left(\int {\sqrt {PQ}}\right)^{2}&=\exp \left(2\log \left(\int {\sqrt {PQ}}\right)\right)\\&=\exp \left(2\log \left(\int P{\sqrt {\frac {Q}{P}}}\right)\right)\\&=\exp \left(2\log \left(\operatorname {E} _{P}\left[\left({\sqrt {\frac {P}{Q}}}\right)^{-1}\,\right]\right)\right)\\&\geq \exp \left(\operatorname {E} _{P}\left[-\log \left({\frac {P}{Q}}\right)\right]\right)=\exp(-D_{KL}(P,Q))\end{aligned}}

Combining the results of steps 1 and 2 leads to the claimed bound on the total variation.

Examples of applications

Sample complexity of biased coin tosses

Source:^[1]

The question is How many coin tosses do I need to distinguish a fair coin from a biased one?

Assume you have 2 coins, a fair coin (Bernoulli distributed with mean $p_{1}=1/2$ ) and an $\varepsilon$ -biased coin ( $p_{2}=1/2+\varepsilon$ ). Then, in order to identify the biased coin with probability at least $1-\delta$ (for some $\delta >0$ ), at least

n\geq {\frac {1}{2\varepsilon ^{2}}}\log \left({\frac {1}{2\delta }}\right).

In order to obtain this lower bound we impose that the total variation distance between two sequences of $n$ samples is at least $1-2\delta$ . This is because the total variation upper bounds the probability of under- or over-estimating the coins' means. Denote $P_{1}^{n}$ and $P_{2}^{n}$ the respective joint distributions of the $n$ coin tosses for each coin, then

We have

{\begin{aligned}(1-2\delta )^{2}&\leq d_{\mathrm {TV} }\left(P_{1}^{n},P_{2}^{n}\right)^{2}\\[4pt]&\leq 1-e^{-D_{\mathrm {KL} }(P_{1}^{n}\parallel P_{2}^{n})}\\[4pt]&=1-e^{-nD_{\mathrm {KL} }(P_{1}\parallel P_{2})}\\[4pt]&=1-e^{-n{\frac {\log(1/(1-4\varepsilon ^{2}))}{2}}}\end{aligned}}

The result is obtained by rearranging the terms.

Information-theoretic lower bound for k-armed bandit games

In multi-armed bandit, a lower bound on the minimax regret of any bandit algorithm can be proved using Bretagnolle–Huber and its consequence on hypothesis testing (see Chapter 15 of Bandit Algorithms^[2]).

History

The result was first proved in 1979 by Jean Bretagnolle and Catherine Huber, and published in the proceedings of the Strasbourg Probability Seminar.^[4] Alexandre Tsybakov's book^[3] features an early re-publication of the inequality and its attribution to Bretagnolle and Huber, which is presented as an early and less general version of Assouad's lemma (see notes 2.8). A constant improvement on Bretagnolle–Huber was proved in 2014 as a consequence of an extension of Fano's Inequality.^[5]

References

^ ^a ^b ^c Canonne, Clément (2022). "A short note on an inequality between KL and TV". arXiv:2202.07198 [math.PR].
^ ^a ^b ^c Lattimore, Tor; Szepesvari, Csaba (2020). Bandit Algorithms (PDF). Cambridge University Press. Retrieved 18 August 2022.
^ ^a ^b Tsybakov, Alexandre B. (2010). Introduction to nonparametric estimation. Springer Series in Statistics. Springer. doi:10.1007/b13794. ISBN 978-1-4419-2709-5. OCLC 757859245. S2CID 42933599.
^ ^a ^b Bretagnolle, J.; Huber, C. (1978), "Estimation des densités : Risque minimax", Séminaire de Probabilités XII, Lecture notes in Mathematics, Berlin, Heidelberg: Springer Berlin Heidelberg, pp. 342–363, doi:10.1007/bfb0064610, ISBN 978-3-540-08761-8, S2CID 122597694, retrieved 2022-08-20
^ Gerchinovitz, Sébastien; Ménard, Pierre; Stoltz, Gilles (2020-05-01). "Fano's Inequality for Random Variables". Statistical Science. 35 (2). arXiv:1702.05985. doi:10.1214/19-sts716. ISSN 0883-4237. S2CID 15808752.

This page was last edited on 3 September 2023, at 22:14

From Wikipedia, the free encyclopedia