Machine Learning Theory - 5 Important Inequalities

April 7, 2016

There are 5 important inequalities in Machine Learning Theory:

Markov’s Inequality

If $X$ is a nonnegative random variable and $\epsilon>0$. Then

Proof for Markov’s Inequality

First of all, we define indicator function:

For an event $A$, let $I(A)$ be the indicator of the event:

We have $X\geq\epsilon I(X\geq\epsilon)$. Taking expectation on both sides, we get

Chebyshev’s Inequality

Let $X$ be a random variable with finite expected value and finite non-zero variance. Then for any real number $\epsilon>0$:

Proof for Chebyshev’s Inequality

From Markov’s inequality, we define


The previous two inequalities are easy to understand compared with the following three inequalities.

Chernoff-Hoeffding Bound

The following theorem is a special case of chernoff bound

Let $x_1, \cdots, x_m$ be i.i.d. Bernoulli random variables, where for every $i$:


We have

Moment-Generating Functions (MGFs)

MGF of a random variable $X$ is defined as

MGF has some important properties:

  1. If $M_X(t)$ is defined for $t\in(-t^\ast,t^\ast)$ for some $t^\ast>0$, then
    • All the moments of $X$ exist, and $M_X(t)$ has derivatives of all orders at $t=0$ where $\forall k, E[X^k]=\frac{\partial^kM}{\partial t^k}\vert_{t=0}$
    • $\forall t\in(-t^\ast,t^\ast)$, $M_X(t)$ has Taylor’s expansion $M_X(t)=\sum_k\frac{E[X^k]}{k!}t^k$
  2. If $x$ and $y$ are independent random variables, then the $M_{x+y}(t)=M_x(t)M_y(t)$
  3. For any constants $a$ and $b$, $M_{ax+b}(t)=e^{bt}M_x(at)$

Jensen’s Inequality

Let $X$ be a random variable, and $f$ be a convex function. Then

Proof for Chernoff-Hoeffding Bound

Define $y_i=x_i-p_i$, $\bar{y}=\frac{1}{m}\sum_{i=1}^my_i$. We need to show $P(\vert\bar{y}\vert>\epsilon)\leq2e^{-2\epsilon^2m}$

Conseder each of $P(\bar{y}>\epsilon)$ and $P(\bar{y}<\epsilon)$

For $t>0$, look at

Lemma 1: For all $t>0$, $M_{y_i}(t)\leq e^{t^2/8}$

Proof: Recall

Look at $g(t)=-tp_i+ln(p_ie^t+(1-p_i))$, for some $t^\ast\in[0,t]$

We can show that $\forall t^\ast$, we can get $g^{\prime\prime}(t)\leq\frac{1}{4}$.

Thus $g(t)\leq\frac{t^2}{8}$

Using properties of MGFs, $M_{\bar{y}}(t)=\prod_{i=1}^mM_{y_i}(\frac{t}{m})$

Using Lemma 1, $M_{\bar{y}}(t)\leq\prod_{i=1}^mexp(\frac{t^2}{8m^2})=exp(\frac{t^2}{8m})$

Pick $t$ that maximizes the $P(\bar{y}>\epsilon)$

The other direction is the same.

Hoeffding’s Inequality

Let $x_1, \cdots, x_m$ be independent random variables, where for every $i$:


For any $\epsilon>0$, we have

McDiarmid’s Inequality

Let $f:\mathbb{R}^m\rightarrow\mathbb{R}$ be such that $\forall i$, and all $x_1, \cdots, x_i, \cdots, x_m, X’_i$

Let $x_1, \cdots, x_m$ be independent random variables, then

Note: Hoeffding’s inequality is a special case of McDiarmid’s inequality: just set


切比雪夫不等式到底是个什么概念? - 回答作者:姚岑卓

Markov’s Inequality

Chebyshev’s Inequality

我只要这么多样本! - 比生活简单多了(知乎专栏 · 作者:张熤)

Chernoff Bound

Moment-Generating Function

Hoeffding’s Inequality

McDiarmid’s Inequality