Probability distribution whose tail probability is less than some gaussian, in some sense.
In probability theory, a subgaussian distribution, the distribution of a subgaussian random variable, is a probability distribution with strong tail decay. More specifically, the tails of a subgaussian distribution are dominated by (i.e. decay at least as fast as) the tails of a Gaussian. This property gives subgaussian distributions their name.
Often in analysis, we divide an object (such as a random variable) into two parts, a central bulk and a distant tail, then analyze each separately. In probability, this division usually goes like "Everything interesting happens near the center. The tail event is so rare, we may safely ignore that." Subgaussian distributions are worthy of study, because the gaussian distribution is well-understood, and so we can give sharp bounds on the rarity of the tail event. Similarly, the subexponential distributions are also worthy of study.
Formally, the probability distribution of a random variable is called subgaussian if there is a positive constantC such that for every ,
.
There are many equivalent definitions. For example, a random variable is sub-Gaussian iff its distribution function is bounded from above (up to a constant) by the distribution function of a Gaussian:
where is constant and is a mean zero Gaussian random variable.[1]: Theorem 2.6
The subgaussian norm of , denoted as , isIn other words, it is the Orlicz norm of generated by the Orlicz function By condition below, subgaussian random variables can be characterized as those random variables with finite subgaussian norm.
Furthermore, the constant is the same in the definitions (1) to (5), up to an absolute constant. So for example, given a random variable satisfying (1) and (2), the minimal constants in the two definitions satisfy , where are constants independent of the random variable.
From the proof, we can extract a cycle of three inequalities:
If , then for all .
If for all , then .
If , then .
In particular, the constant provided by the definitions are the same up to a constant factor, so we can say that the definitions are equivalent up to a constant independent of .
Similarly, because up to a positive multiplicative constant, for all , the definitions (3) and (4) are also equivalent up to a constant.
Basic properties — * If is subgaussian, and , then and .
(Triangle inequality) If are subgaussian, then
(Chernoff bound) If is subgaussian, then for all
means that , where the positive constant is independent of and .
Subgaussian deviation bound — If is subgaussian, then
Proof
By triangle inequality, . Now we have . By the equivalence of definitions (2) and (4) of subgaussianity, we have .
Independent subgaussian sum bound — If are subgaussian and independent, then
Proof
If independent, then use that the cumulant of independent random variables is additive. That is, .
If not independent, then by Hölder's inequality, for any we have
Solving the optimization problem
, we obtain the result.
Corollary — Linear sums of subgaussian random variables are subgaussian.
Partial converse(Matoušek 2008, Lemma 2.4) — If , and for all , then where depends on only.
Proof
Proof
Let be the CDF of . The proof splits the integral of MGF to two halves, one with and one with , and bound each one respectively.
Since for ,
For the second term, upper bound it by a summation:
When , for any , , so
When , by drawing out the curve of , and plotting out the summation, we find that Now verify that , where depends on only.
Corollary(Matoušek 2008, Lemma 2.2) — are independent random variables with the same upper subgaussian tail: for all . Also, , then for any unit vector , the linear sum has a subgaussian tail: where depends only on .
Gaussian concentration inequality for Lipschitz functions(Tao 2012, Theorem 2.1.12.) — If is -Lipschitz, and is a standard gaussian vector, then concentrates around its expectation at a rate and similarly for the other tail.
Proof
Proof
By shifting and scaling, it suffices to prove the case where , and .
Since every 1-Lipschitz function is uniformly approximable by 1-Lipschitz smooth functions (by convolving with a mollifier), it suffices to prove it for 1-Lipschitz smooth functions.
Now it remains to bound the cumulant generating function.
To exploit the Lipschitzness, we introduce , an independent copy of , then by Jensen,
By the circular symmetry of gaussian variables, we introduce . This has the benefit that its derivative is independent of it.
Now take its expectation, The expectation within the integral is over the joint distribution of , but since the joint distribution of is exactly the same, we have
Conditional on , the quantity is normally distributed, with variance , so
Expanding the cumulant generating function:we find that . At the edge of possibility, we define that a random variable satisfying is called strictly subgaussian.
By calculating the characteristic functions, we can show that some distributions are strictly subgaussian: symmetric uniform distribution, symmetric Bernoulli distribution.
Since a symmetric uniform distribution is strictly subgaussian, its convolution with itself is strictly subgaussian. That is, the symmetric triangular distribution is strictly subgaussian.
Since the symmetric Bernoulli distribution is strictly subgaussian, any symmetric Binomial distribution is strictly subgaussian.
The optimal variance proxy is known for many standard probability distributions, including the beta, Bernoulli, Dirichlet[6], Kumaraswamy, triangular[7], truncated Gaussian, and truncated exponential.[8]
Let be two positive numbers. Let be a centered Bernoulli distribution , so that it has mean zero, then .[5] Its subgaussian norm is where is the unique positive solution to .
Let be a random variable with symmetric Bernoulli distribution (or Rademacher distribution). That is, takes values and with probabilities each. Since , it follows thatand hence is a subgaussian random variable.
Density of a mixture of three normal distributions (μ = 5, 10, 15, σ = 2) with equal weights. Each component is shown as a weighted density (each integrating to 1/3)
Since the sum of subgaussian random variables is still subgaussian, the convolution of subgaussian distributions is still subgaussian. In particular, any convolution of the normal distribution with any bounded distribution is subgaussian.
So far, we have discussed subgaussianity for real-valued random variables. We can also define subgaussianity for random vectors. The purpose of subgaussianity is to make the tails decay fast, so we generalize accordingly: a subgaussian random vector is a random vector where the tail decays fast.
Let be a random vector taking values in .
Define.
, where is the unit sphere in .
is subgaussian iff .
Theorem. (Theorem 3.4.6 [2]) For any positive integer , the uniformly distributed random vector is subgaussian, with .
This is not so surprising, because as , the projection of to the first coordinate converges in distribution to the standard normal distribution.
Theorem. (over a finite set) If are subgaussian, with , thenTheorem. (over a convex polytope) Fix a finite set of vectors . If is a random vector, such that each , then the above 4 inequalities hold, with replacing .
Here, is the convex polytope spanned by the vectors .
Theorem. (over a ball) If is a random vector in , such that for all on the unit sphere , then For any , with probability at least ,
Theorem. (Theorem 2.6.1 [2]) There exists a positive constant such that given any number of independent mean-zero subgaussian random variables , Theorem. (Hoeffding's inequality) (Theorem 2.6.3 [2]) There exists a positive constant such that given any number of independent mean-zero subgaussian random variables ,Theorem. (Bernstein's inequality) (Theorem 2.8.1 [2]) There exists a positive constant such that given any number of independent mean-zero subexponential random variables ,Theorem. (Khinchine inequality) (Exercise 2.6.5 [2]) There exists a positive constant such that given any number of independent mean-zero variance-one subgaussian random variables , any , and any ,
The Hanson-Wright inequality states that if a random vector is subgaussian in a certain sense, then any quadratic form of this vector, , is also subgaussian/subexponential. Further, the upper bound on the tail of , is uniform.
A weak version of the following theorem was proved in (Hanson, Wright, 1971).[11] There are many extensions and variants. Much like the central limit theorem, the Hanson-Wright inequality is more a cluster of theorems with the same purpose, than a single theorem. The purpose is to take a subgaussian vector and uniformly bound its quadratic forms.
Theorem.[12][13] There exists a constant , such that:
Let be a positive integer. Let be independent random variables, such that each satisfies . Combine them into a random vector . For any matrix , we havewhere , and is the Frobenius norm of the matrix, and is the operator norm of the matrix.
In words, the quadratic form has its tail uniformly bounded by an exponential, or a gaussian, whichever is larger.
In the statement of the theorem, the constant is an "absolute constant", meaning that it has no dependence on . It is a mathematical constant much like pi and e.
Theorem (subgaussian concentration).[12] There exists a constant , such that:
Let be positive integers. Let be independent random variables, such that each satisfies . Combine them into a random vector . For any matrix , we haveIn words, the random vector is concentrated on a spherical shell of radius , such that is subgaussian, with subgaussian norm .
^ abcdefgVershynin, R. (2018). High-dimensional probability: An introduction with applications in data science. Cambridge: Cambridge University Press.
^Kahane, J. (1960). "Propriétés locales des fonctions à séries de Fourier aléatoires". Studia Mathematica. 19: 1–25. doi:10.4064/sm-19-1-1-25.
^Buldygin, V. V.; Kozachenko, Yu. V. (1980). "Sub-Gaussian random variables". Ukrainian Mathematical Journal. 32 (6): 483–489. doi:10.1007/BF01087176.
^ abBobkov, S. G.; Chistyakov, G. P.; Götze, F. (2023-08-03). "Strictly subgaussian probability distributions". arXiv:2308.01749 [math.PR].
^Marchal, Olivier; Arbel, Julyan (2017). "On the sub-Gaussianity of the Beta and Dirichlet distributions". Electronic Communications in Probability. 22. arXiv:1705.00048. doi:10.1214/17-ECP92.
^Arbel, Julyan; Marchal, Olivier; Nguyen, Hien D. (2020). "On strict sub-Gaussianity, optimal proxy variance and symmetry for bounded random variables". Esaim: Probability and Statistics. 24: 39–55. arXiv:1901.09188. doi:10.1051/ps/2019018.
^Barreto, Mathias; Marchal, Olivier; Arbel, Julyan (2024). "Optimal sub-Gaussian variance proxy for truncated Gaussian and exponential random variables". arXiv:2403.08628 [math.ST].
Tao, Terence (2012). Topics in random matrix theory. Graduate studies in mathematics. Providence, R.I: American Mathematical Society. ISBN978-0-8218-7430-1.
Rudelson, Mark; Vershynin, Roman (2010). "Non-asymptotic theory of random matrices: extreme singular values". Proceedings of the International Congress of Mathematicians 2010. pp. 1576–1602. arXiv:1003.2990. doi:10.1142/9789814324359_0111.
Zajkowskim, K. (2020). "On norms in some class of exponential type Orlicz spaces of random variables". Positivity. An International Mathematics Journal Devoted to Theory and Applications of Positivity.24(5): 1231--1240. arXiv:1709.02970. doi:10.1007/s11117-019-00729-6.