The central limit theorem (CLT) posits that means are approximately normally distributed for large sample sizes. This is crucial for many statistical applications in many fields of research that deal with randomness. Any measure that we are interested in that has inherent noise (blood measures, neurophysiological activity, social behavior) can be analyzed by knowing how a mean of a sample of such measures is distributed.
So how does the CLT look like? It requires multiple independently and identically distributed (iid.) measures \(X_i\) with \(i\in 1..n\). The iid. assumption relates to a situation in which the observations \(X_i\) come from the same randomness mechanism (e.g., many participants receiving the same medication, it is not the case that one gets a higher dosage than another unless that is the random process we are interested in) and are independent draws from this mechanism. Then, if we were to repeat this experiment — draw \(n\) observations and compute the mean — over and over again, the computed means will be approximately normally distributed.
Theorem 1. (Central Limit Theorem)
Let \(X_1\), \(X_2\), …, \(X_n\) iid. as \(X\). Assume that all moments are finite and, in particular, assume without loss of generality that \(E[X] = 0\) and \(Var[X] = 1\). Let \(Z_n = \frac{1}{\sqrt{n}}\sum\limits_{i=1}^n X_i\) (we multiply by \(\frac{\sqrt{n}}{n} = \frac{1}{\sqrt{n}}\) to make \(Z_n\) have variance 1). Then
\[\begin{align*} \lim_{n\rightarrow \infty} Z_n \sim \mathcal{N}(0,1) \text{.} \end{align*}\]
We will first look at a short overview of the proof. In this, we will use technicalities about characteristic functions, which I explain later. If you are a mathematician, this structure may appear usual to you because most of the time, people write all the required lemmas before the important theorem. But here, I focus on the core proof first and only the add the details later.
To start with, the characteristic function of a random variable \(X\) is \(\psi_X(t) = E[e^{itX}]\), which describes the moments of \(X\). Remember that \[\begin{align*} e^c = \frac{c^0}{0!} + \frac{c^1}{1!} + \frac{c^2}{2!} + \frac{c^3}{3!} + ... \end{align*}\]
Therefore, we have that the \(m\)-th derivative of \(\psi_X(t/i)\) at \(t=0\) is the \(m\)-th moment because the terms with power \(t<m\) are removed from taking the derivative and the terms with power \(t>m\) are removed from evaluating at \(t=0\).
\[\begin{align*} \left( \frac{d^m}{dt^m} \psi_X(-t/i) \right) \rvert_{t=0} &= \left( \frac{d^m}{dt^m} E\left[e^{tX}\right] \right) \rvert_{t=0} \\ &= \left( \frac{d^m}{dt^m} E\left[\frac{(tX)^0}{0!} + \frac{(tX)^1}{1!} + \frac{(tX)^2}{2!} + \frac{(tX)^3}{3!} + ... \right] \right) \rvert_{t=0} \\ &= \left( \frac{d^m}{dt^m} \left( \frac{t^0 E[X^0]}{0!} + \frac{t^1 E[X^1]}{1!} + \frac{t^2 E[X^2]}{2!} + \frac{t^3 E[X^3]}{3!} + ... \right) \right) \rvert_{t=0} \\ &= E[X^m] \\ \end{align*}\]
The core idea of the proof is showing that all but the first moments vanish when taking the mean over multiple independent draws of the same random variable.
Proof.
\(Z_n\) is a sum of scaled random variables \(\frac{X_i}{\sqrt{n}}\). This sum has a characteristic function that is the product of individual characteristic functions (see Lemma 2):
\[\begin{align*} \psi_{Z_n}(t) &= \prod\limits_{i=1}^n \psi_{\frac{X_i}{\sqrt{n}}}(t) \\ &= [E[e^{itX/\sqrt{n}}]]^n \\ &= \left[E\left[ \frac{(itX/\sqrt{n})^0}{0!} + \frac{(itX/\sqrt{n})^1}{1!} + \frac{(itX/\sqrt{n})^2}{2!} + \frac{(itX/\sqrt{n})^3}{3!} + \dots \right]\right]^n \\ &= \left[ 1 + \frac{(itE[X])}{\sqrt{n}} - \left(\frac{t^2E[X^2]}{2n}\right) + O\left(\frac{1}{n^{3/2}}\right) + \dots \right]^n \\ &= \left[ 1 - \left(\frac{t^2}{2n}\right) + O\left(\frac{1}{n^{3/2}}\right) \right]^n \\ \end{align*}\]
We consider the limit of this characteristic function as \(n \rightarrow \infty\) .
\[\begin{align*} \lim_{n\rightarrow \infty} \psi_{Z_n}(t) &= \lim_{n\rightarrow \infty} \left[ 1 - \left(\frac{t^2}{2n}\right) + O\left(\frac{1}{n^{3/2}}\right) \right]^n \\ &= \lim_{n\rightarrow \infty} \left[ \left(1 + \frac{(-t^2/2)}{n}\right)^n + \sum_{k=1}^n \binom{n}{k} \left(1+\frac{1}{n}\right)^{n-k} O\left(\frac{1}{n^{3/2}}\right)^k \right] \\ &= e^{-\frac{t^2}{2}} + \lim_{n\rightarrow \infty} \left[ \sum_{k=1}^n \binom{n}{k} \left(1+\frac{1}{n}\right)^{n-k} O\left(\frac{1}{n^{3/2}}\right)^k \right] \end{align*}\]
The first part produces \(e^{-t^2/2}\) and the second part vanishes as \(n\) approaches infinity because
\[\begin{align*} \left\lvert \lim_{n\rightarrow \infty} \left[ \sum_{k=1}^n \binom{n}{k} \left(1+\frac{1}{n}\right)^{n-k} O\left(\frac{1}{n^{3/2}}\right)^k \right] \right\lvert & \leq \left\lvert \lim_{n\rightarrow \infty} \left[ \sum_{k=1}^n n^k \cdot e \cdot O\left(\frac{1}{n^{k 3/2}}\right) \right] \right\lvert \\ & = \left\lvert \lim_{n\rightarrow \infty} \left[ O\left(\frac{1}{n^{1/2}}\right) \right] \right\lvert \\ & = 0 \\ \end{align*}\]
Then, this is just the characteristic function of the standard normal distribution (Lemma 3). Since characteristic functions uniquely determine the probability distribution (Lemma 4), \(Z_n\) approaches the normal distribution as \(n\rightarrow \infty\).
\[\begin{align*} \lim_{n\rightarrow \infty} \psi_{Z_n}(t) = e^{-\frac{t^2}{2}} \end{align*}\]
Lemma 2 (Sum of random variables, product of characteristic functions).
The sum of two random variables \(Z = X + Y\) has the characteristic function of the product of the two variable’s characteristic functions, \(\psi_Z(t) = \psi_X(t)\psi_Y(t)\).
Proof \[\begin{align*} \psi_Z(t) &= E\left[e^{itZ}\right] \\ &= E\left[e^{it(X+Y)}\right] \\ &= E\left[e^{itX}e^{itY}\right] \\ &= E\left[e^{itX}\right]E\left[e^{itY}\right] \\ &= \psi_X(t)\psi_Y(t) \end{align*}\]
Lemma 3 (Characteristic functions of the standard normal distribution)
A random variable \(X\) with the standard normal probability distribution has the characteristic function \(\psi_X(t) = e^{-t/2}\).
Proof \[\begin{align*} \psi_X(t) &= E\left[e^{itX}\right] \\ &= \int\limits_{-\infty}^{\infty} e^{itx} f(x) dx \\ &= \int\limits_{-\infty}^{\infty} e^{itx} \frac{1}{\sqrt{2\pi}} e^{-\frac{1}{2}x^2} dx \\ &= \frac{1}{\sqrt{2\pi}} \int\limits_{-\infty}^{\infty} e^{-\frac{1}{2}(x^2 - 2itx)} dx \\ &= \frac{1}{\sqrt{2\pi}} \int\limits_{-\infty}^{\infty} e^{-\frac{1}{2}(x^2 - 2itx + (it)^2 - (it)^2)} dx \\ &= \frac{1}{\sqrt{2\pi}} \int\limits_{-\infty}^{\infty} e^{-\frac{1}{2}(x - it)^2} e^{-\frac{t^2}{2}} dx \\ &= e^{-\frac{t^2}{2}} \int\limits_{-\infty}^{\infty} f(x-it) dx \\ \end{align*}\]
The latter integral is equal to 1 because it is basically the normal probability distribution. The imaginary offset does not change the result of the integral. We can substitute \(y = x-it\), \(\frac{dy}{dx} = 1\) so that \(\int_{-\infty}^{\infty} f(x-it) dx = \int_{-\infty}^{\infty} f(y) dy = 1\). Then we get the characteristic function as desired.
\[\begin{align*} \psi_X(t) = e^{-\frac{t^2}{2}} \end{align*}\]
Lemma 4 (Probability distributions are uniquely determined by their characteristic function)
Let \(\psi_X(t)\) be the characteristic function of X, then the probability mass of \(X\) in the interval \([a, b]\) is recovered by
\[\begin{align*} \lim_{T\rightarrow\infty} \frac{1}{2\pi} \int\limits_{-T}^{T} \frac{e^{-ita}-e^{-itb}}{it} \psi_X(t) dt = P(a < X < b) + \frac{P(X=a) + P(X=b)}{2} \text{.} \end{align*}\]
Proof. The two exponential terms in the integral can be rewritten as
\[\begin{align*} \int\limits_{-T}^T \frac{e^{itc}}{2it} dt &= \int_0^T \frac{\cos(tc) + i \cdot \sin(tc)}{2it} dt + \int_{-T}^0 \frac{\cos(tc) - i \cdot \sin(tc)}{2it} dt \\ &= \int_0^T \frac{\cos(tc) + i \cdot \sin(tc)}{2it} dt + \int_{0}^T \frac{\cos(tc) + i \cdot (-1)\sin(tc)}{2i(-t)} dt \\ &= \int_0^T \frac{\cos(tc)}{2it} - \frac{\cos(tc)}{2it} + \frac{\sin(tc)}{2t} + \frac{\sin(tc)}{2t} dt \\ &= \int_0^T \frac{\sin(tc)}{t} dt \text{.} \end{align*}\]
Coming back to the equation we want to prove here, we replace the characteristic function by its definition and obtain:
\[\begin{align*} \lim_{T\rightarrow\infty} \frac{1}{2\pi} \int\limits_{-T}^{T} \frac{e^{-ita}-e^{-itb}}{it} \psi_X(t) dt &= \lim_{T\rightarrow\infty} \frac{1}{2\pi} \int\limits_{-T}^{T} \frac{e^{-ita}-e^{-itb}}{it} \int\limits_{-\infty}^{\infty} e^{itx}f(x) dx dt \\ &= \int\limits_{-\infty}^{\infty} \lim_{T\rightarrow\infty} \frac{1}{2\pi} \int\limits_{-T}^{T} \frac{e^{it(x-a)}-e^{it(x-b)}}{it} f(x) dt dx \\ &= \int\limits_{-\infty}^{\infty} \frac{1}{\pi} \lim_{T\rightarrow\infty} \int\limits_{0}^{T} \frac{\sin(t(x-a))}{t} - \frac{\sin(t(x-b))}{t} dt f(x) dx \\ &= \int\limits_{(-\infty, a)} 0 \cdot f(x) dx + \int\limits_{[a, a]} \frac{1}{2} \cdot f(x) dx \\ & ~~~~~~ + \int\limits_{(a, b)} 1 \cdot f(x) dx \\ & ~~~~~~ + \int\limits_{[b, b]} \frac{1}{2} \cdot f(x) dx + \int\limits_{(b, \infty)} 0 \cdot f(x) dx \\ &= P(a < X < b) + \frac{1}{2}\left(P(X = a) + P(X = b) \right) \end{align*}\]
Where we the solution to the inner integral from Lemma 5. With that, the probability mass for any interval is exactly retrieved and, therefore, characteristic functions and probability functions have a one-to-one relationship.
Lemma 5 (Limit of the intregral of \(\sin(tc)/t\))
\[\begin{align*} \lim_{T\rightarrow\infty} \int_0^T \frac{\sin(tc)}{t} dt = { \begin{cases} -\frac{\pi}{2}, & \text{for } c < 0 \\ +\frac{\pi}{2}, & \text{for } c > 0 \\ 0, & \text{for } c = 0 \end{cases} } \end{align*}\]
Proof. For this proof, we need the Laplace transform \(\mathcal{L}_f(s) = \int_{s}^\infty e^{-tx} f(t) dt\). The Laplace transform has a special property for functions of the form \(f(t)/t\) (with the idea that \(f(t) = \sin(t)\) later).
\[\begin{align*} \mathcal{L}\left\{ \frac{f(t)}{t} \right\}(s) &= \int\limits_{0}^{\infty} \frac{f(t)}{t} e^{-st} dt \\ &= \int\limits_{0}^{\infty} \frac{f(t)}{t} \left( e^{- \lim_{r \rightarrow s} rt} - e^{- \lim_{r \rightarrow \infty} rt} \right) dt \\ &= \int\limits_{0}^{\infty} \frac{f(t)}{t} \left( e^{-rt} \vert_{r=\infty}^{r=s} \right) dt \\ &= \int\limits_{0}^{\infty} f(t) \left( \int\limits_{\infty}^{s} - e^{-rt} dr \right) dt \\ &= \int\limits_{0}^{\infty} f(t) \left( \int\limits_{s}^{\infty} e^{-rt} dr \right) dt \\ &= \int\limits_{s}^{\infty} \int\limits_{0}^{\infty} f(t) e^{-rt} dt dr \\ &= \int\limits_{s}^{\infty} \mathcal{L}\left\{f(t)\right\}(r) dr \end{align*}\]
With this, we can proof the Lemma. We first need to get rid fo the constant \(c\) in \(\sin(ct)\) function with a change of variables: We substitute \(u=tc\), \(t = u/c\) and \(du = c\cdot dt\). Since we consider the limit to infinity, this constant simply drops if it is positive which we assume for now \((c>0)\).
\[\begin{align*} \lim_{T\rightarrow\infty} \int_0^T \frac{\sin(tc)}{t} dt &= \lim_{T\rightarrow\infty} \int_0^{T/c} \frac{\sin(u)}{u} \frac{c}{c} dt \\ &= \int_0^{\infty} \frac{\sin(u)}{u} du \\ \end{align*}\]
For negative constants, \(c<0\), the integral goes from \(-\infty\) to \(0\) and the proof can be written analogously with an additional minus in the end. For \(c=0\), the integral is simply \(0\)
Now we add the term \(e^{-st}\) with the limit of \(s\rightarrow \infty\). This term therefore vanishes but we can use it to write the function in the integral as a Laplace transform. For this Laplace transform, we make use of the special form of \(\sin(t)/t\) as introduced initially.
\[\begin{align*} \lim_{T\rightarrow\infty} \int_0^T \frac{\sin(tc)}{t} dt &= \int_0^{\infty} \frac{\sin(u)}{u} du \\ &= \lim_{s\rightarrow 0} \int_0^{\infty} \frac{\sin(u)}{u} e^{-st} du \\ &= \lim_{s\rightarrow 0} \int_0^{\infty} \mathcal{L}\left\{ \frac{\sin(u)}{u} \right\}(s) du \\ &= \lim_{s\rightarrow 0} \int_s^{\infty} \mathcal{L}\left\{ \sin(u) \right\}(r) dr \\ \end{align*}\]
We solve this Laplace transform (Lemma 6) and then the integral (Lemma 7) to arrive at the desired statement.
\[\begin{align*} \lim_{T\rightarrow\infty} \int_0^T \frac{\sin(tc)}{t} dt &= \lim_{s\rightarrow 0} \int_0^{\infty} \mathcal{L}\left\{ \frac{\sin(u)}{u} \right\}(s) du \\ &= \lim_{s\rightarrow 0} \int_s^{\infty} \frac{1}{1+r^2} dr \\ &= \tan^{-1}(r)\vert_{r=0}^{r=\infty} \\ &= \frac{\pi}{2} - 0 \\ &= \frac{\pi}{2} \\ \end{align*}\]
Lemma 6 (Laplace transform of \(\sin(t)\)).
\[\begin{align*} \mathcal{L}\{\sin(t)\}(s) = \frac{1}{1+s^2} \end{align*}\]
Proof. First, we need the Laplace transform of \(e^{at}\). \[\begin{align*} \mathcal{L}\{e^{at}\}(s) &= \int\limits_{0}^{\infty} e^{-st} e^{at} dt \\ &= \int\limits_{0}^{\infty} e^{(a-s)t} dt \\ &= \frac{1}{a-s} e^{(a-s)t} \vert_{t=0}^{t=\infty} \\ &= \frac{1}{a-s} e^{(a-s)\infty} - \frac{1}{a-s} e^{(a-s)\cdot0} \\ &\overbrace{=}^{s>a} 0 - \frac{1}{a-s} \\ &= \frac{1}{s-a} \\ \end{align*}\]
Now we can solve the Laplace transform of \(\sin(t)\) using the equality \((e^{it} - e^{-it})/(2it) = \left(\cos(t) + i\sin(t) - \cos(-t) - i\sin(-t)\right)/(2i) = 2i\sin(t)/(2i) = \sin(t)\).
\[\begin{align*} \mathcal{L}\{\sin(t)\}(s) &= \mathcal{L}\bigg\{ \frac{e^{it} - e^{-it}}{2i} \bigg\}(s) \\ &= \frac{1}{2i} \mathcal{L}\{ e^{it} - e^{-it} \} \\ &= \frac{1}{2i} \left( \frac{1}{s-i} - \frac{1}{s-(-i)} \right) \\ &= \frac{1}{2i} \left( \frac{(s+i)}{(s+i)(s-i)} - \frac{s-i}{(s+i)(s-i)} \right) \\ &= \frac{1}{2i} \left( \frac{2i}{s^2 - i^2} \right) \\ &= \frac{1}{s^2 + 1} \\ \end{align*}\]
Lemma 7 (Derivative of the inverse tangens).
\[\begin{align*} \int \frac{1}{1+s^2} ds = \tan^{-1}(s) \end{align*}\]
Proof. The derivative of the tangens is
\[\begin{align*} \frac{d}{ds} \tan(s) &= \frac{d}{ds} \frac{\sin(s)}{\cos(s)} \\ &= \frac{\sin'(s)\cos(s) - \sin(s)\cos'(s)}{\cos^2(s)} \\ &= \frac{\cos^2(s) + \sin^2(s)}{\cos^2(s)} \\ &= 1 + \tan^2(s) \text{,} \end{align*}\]
and the derivative of the inverse therefore is
\[\begin{align*} \frac{d}{ds}\tan^{-1}(s) &= \frac{1}{\tan'(\tan^{-1}(s))} \\ &= \frac{1}{1 + \tan^2(\tan^{-1}(s))} \\ &= \frac{1}{1 + s^2} \text{.} \end{align*}\]
I couldn’t find a self-contained proof of the central limit theorem. So I made one by myself in hopes it helps someone. I took parts from various sources, mostly from (https://sas.uwaterloo.ca/~dlmcleis/s901/chapt6.pdf but also from https://www.youtube.com/c/papaflammy/).
The last 10 nights, we ordered at a new pizza place. They promise delivery time to be on average 30 minutes. We noted down the delivery times and computed an average of \(\bar{x} = 43\) minutes with a standard deviation of \(s = 15\) minutes. Is it plausible that they on average deliver after 30 minutes and we just got unlucky or should be believe they underestimate their delivery time?
Given a sample \(X_1, X_2, ..., X_N\) of normally distributed random variables with expected value \(\mu\), we want to know whether the sample has an unusually large mean. In particular, we want to compute the probability to obtain a mean that is equal or even larger than the observed mean, \(P(\bar{X} > \bar{x})\).
If we were also given the variance \(\sigma^2\) of \(X_i\)s, we already knew that the mean \(\bar{X}\) is normally distributed with expected value \(\mu\) and variance \(\sigma^2/N\). Based on this we could have computed
\[\begin{align*} z = \frac{\bar{x} - \mu}{\sigma/\sqrt{N}} \end{align*}\]
knowing that it is standard normally distributed and from that computed the probability of obtaining the observed or a larger \(z\) (in R with ). But realistically, we don’t know \(\sigma\). We also don’t know \(\mu\) but will hypothesize a value \(\mu_0\), that is, we just take a value and test whether the mean is unusually far away from it to decide whether \(\mu_0\) is a plausible hypothesis or not. But crucially, we will have to replace \(\sigma^2\) by an estimate, \(s^2\). Then, instead of a \(z\) value we can only compute what we will call a \(t\) value:
\[\begin{align*} t = \frac{\bar{x} - \mu_0}{s/\sqrt{N}} \text{~.} \end{align*}\]
But what distribution follows \(t\) now that we have to estimate the variance?
We can obtain probabilities of obtaining an equal or larger \(t\) value in \(R\) with where is the degrees of freedom \(\nu = N-1\).
On the main street in our town the speed limit is 50 km/h. Recently, they put up an open speed meter that displays the driving speed of each car. We observe 25 cars and compute their average speed, \(\bar{x} = 51\) km/h, and standard deviation, \(s = 6\) km/h. We calculate a \(t\) value as
\[\begin{align*} t = \frac{\bar{x} - \mu_0}{s/\sqrt{N}} = \frac{51 - 50}{6/\sqrt{25}} = \frac{5}{6} = 0.83 \text{~.} \end{align*}\]
With R we compute that the probability to obtain this \(t\) value or a larger one is 21% (). It is pretty likely that we obtain such a deviation just by chance. We do not reject the hypothesis that cars on average drive the allowed 50 km/h.
I took the proof from Shoichi Midorikawa, https://shoichimidorikawa.github.io/Lec/ProbDistr/t-e.pdf.
Here, I will derive the normal distribution from a simple and intuitive setting. This is mostly based on a video from MolloyMaths, .
Imagine throwing darts on a dart bord. We are interested in the distribution \(w\) of where the darts hit the board. We will assume hitting at a certain location on the two-dimensional board only depends on the radius—how far away from the center the dart hits. Specifically, we want the probability to of the dart to hit at a certain location \((x,y)\) only depends on \(\sqrt{x^2 + y^2}\). We call this distribution \(w\):
\[\begin{align*} w\left(\sqrt{x^2 + y^2}\right) \end{align*}\]
We will further assume that deviations from the center in horizontal and vertical directions are independent of each other: Just because the dart hits too far to the left, doesn’t mean it hits too far up or too far down. Further, we assume that deviations in both directions follow the same mechanism. Specifically, we define the distribution \(f\) of deviations along either dimension, \(f(x)\) and \(f(y)\). The distribution of deviations in two dimensions \(w\) is then the product (because of the independence assumption) of the deviation in each dimension:
\[\begin{align} w\left(\sqrt{x^2 + y^2}\right) = f(x)f(y) \label{eq:wff} \end{align}\]
Interestingly, under these assupmtions, \(w\) and \(f\) are tightly coupled as we can notice by setting \(y = 0\):
\[\begin{align*} w\left(\sqrt{x^2 + 0^2}\right) &= f(x)f(0) \\ w\left(x\right) &= \lambda \cdot f(x) \end{align*}\]
In other words, the distribution in two dimensions (how likely deviations of a certain radius are) is equal to the distribution in one dimension (for example, horizontal deviations) scaled by a certain factor \(\lambda\). Let’s put this knowledge into Equation \(\ref{eq:wff}\) and get rid of the scaling:
\[\begin{align*} \lambda \cdot f(x) &= f(x)f(y) \\ \frac{ f\left(\sqrt{x^2 + y^2}\right) }{ \lambda } &= \frac{f(x)}{\lambda} \frac{g(y)}{\lambda} \\ g\left(\sqrt{x^2 + y^2}\right) &= g(x) g(y) \numberthis \label{eq:ggg} \\ \end{align*}\]
The function \(g\) is just a scaling of the distribution \(f\). The latter is the probability density of deviations along one dimension (for example for horizontal deviations).
Up to now, we took three assumptions:
and we derived that then the distribution of deviations along the radius (\(w\)) must be equal to the distribution of deviations along horizontal and vertical dimensions (\(f\)): Compared to Equation \(\ref{eq:wff}\), where \(w\) and \(f\) are different functions, we derived in Equation \(\ref{eq:ggg}\) that everything follows one functional shape \(g\) (up to scaling).
Now the question is, which form does a function have that conforms with the structure in Equation \(\ref{eq:ggg}\)? One possibility is \(g(x) = e^{x^2}\) because then:
\[\begin{align*} g\left(\sqrt{x^2 + y^2}\right) &= g(x) g(y) \\ e^{\sqrt{x^2 + y^2}^2} &= e^{x^2} e^{y^2} \\ e^{x^2 + y^2} &= e^{x^2} e^{y^2} ~~~~~~ \text{This is true because of rules of exponents.} \end{align*}\]
However, the problem is that \(g(x) = e^{x^2}\) does not have a finite integral and therefore cannot be scaled to become a probability distribution. But we can overcome this very easily by introducing a negative constant in the exponent:
\[g(x) = e^{- x^2}\].
We could introduce any other negative constant, for example \(g(x) = e^{-2 \cdot x^2}\). But things will be easier if we go with \(-1\) now. We will later introducing a scaling parameter here to modify the variance so it doesn’t matter which constant we use here.
But that is it! That is the core of the normal distribution: It follows the assumptions we started out with. Now all we have to do is make it nice. For example, we first scale it to a proper distribution.
What we originally wanted is the probability distribution \(f\) which is a scaling of the function \(g\) we just found, \(f(x) = \lambda g(x)\). Since we want \(f\) to be a probability distribution, the integral of \(f(x)\) is supposed to be 1, \(\int_{-\infty}^{\infty} f(x) dx = \int_{-\infty}^{\infty} \lambda g(x) dx = 1\). So the question is, what is \(\lambda\)?
For this, we need to find the integral of \(e^{-x^2}\).
\[\begin{align*} I = \int_{-\infty}^{\infty} e^{-x^2} dx \end{align*}\]
We approach this problem by using a similar idea as before, we go into two dimensions by looking not for \(I\) but for \(I^2\).
\[\begin{align*} I^2 = I\cdot I &= \int_{-\infty}^{\infty} e^{-x^2} dx \cdot \int_{-\infty}^{\infty} e^{-y^2} dy \\ &= \int_{-\infty}^{\infty} \int_{-\infty}^{\infty} e^{-(x^2 + y^2)} dx dy \\ \end{align*}\]
The double integral means that we have to go to every coordinate \((x,y)\) in \(\mathbb{R}^2\) (every pair of real valued numbers). Thus, we have to go through all coordinates. You can visualize this as going through a sheet of checkered paper line by line: For every \(x\) (row) you have to go to every \(y\) (column) and thereby you visit every cell. Now we switch the perspective: Instead of going line by line, we start in the center of the sheet and go outwards in circles. Thus, we increase the radius of these circles from 0 (center of the sheet) to infinity and instead of walking lines from left to right we walk along these circles. This also covers the whole sheet as does going line by line. But the growing circles perspective allows us to reformulate the double integral:
\[\begin{align*} I^2 &= \int_{-\infty}^{\infty} \int_{-\infty}^{\infty} e^{-(x^2 + y^2)} dx dy \\ &= \int_{0}^{\infty} \int_{0}^{2\pi} e^{-r^2} dr d\theta \\ &= \int_{0}^{\infty} 2 \pi r e^{-r^2} dr \\ \end{align*}\]
In the second line, we see that we now go from radius 0 to infinity and then walk around the circle of that radius by using the angle \(\theta\). In the third line, we just simplify going around the circle by just writing this as the circumference of a circle with radius \(r\): \(2\pi r\).
What did we gain by this change of perspective? Now we essentially have an integral of \(r e^{-r^2}\) instead of just \(e^{-x^2}\). It doesn’t look like it but this is easier to solve. The idea is to use the derivative of the quadratic function \(r^2\).
\[\begin{align*} \frac{d (r^2)}{dr} = 2r \\ d (r^2) = 2r dr \\ \end{align*}\]
Luckily, we find the right-hand side in the integral we want to solve:
\[\begin{align*} I^2 &= \int_{0}^{\infty} 2 \pi r e^{-r^2} dr \\ &= \pi \int_{0}^{\infty} e^{-r^2} 2r dr \\ &= \pi \int_{0}^{\infty} e^{-r^2} dr^2 \\ &= \pi \int_{0}^{\infty} e^{-\rho} d\rho \\ \end{align*}\]
Now, the integral is easy because the derivative of \(e^{-\rho}\) is just \(-e^{-\rho}\) so that the integral of \(e^{-\rho}\) is just \(-e^{-\rho}\):
\[\begin{align*} I^2 &= \pi \left[-e^{-\rho}\right]^{\infty}_0 \\ &= \pi \left[ -e^{-\infty} - -e^{-0} \right] \\ &= \pi \left[ (-0) - (-1) \right] \\ &= \pi \left[ 1 \right] \\ &= \pi \\ \end{align*}\]
With that, we know that the integral we were originally looking for is:
\[\begin{align*} \int_{-\infty}^{\infty} e^{-x^2} dx = I = \sqrt{I^2} = \sqrt{\pi} \text{~.} \numberthis \label{eq:gaussint} \end{align*}\]
This is a famous integral called the Gaussian integral. With it, we have found one possible \(f(x)\)!
\[\begin{align*} f(x) = \lambda g(x) = \frac{1}{\sqrt{\pi}} e^{-x^2} \end{align*}\]
With \(\int_{-\infty}^{\infty} \lambda g(x) dx = \lambda \int_{-\infty}^{\infty} g(x) dx = \lambda \sqrt{\pi}\), we have now learned that \(\lambda = \frac{1}{\sqrt{\pi}}\) in order to get a probability distribution \(f\) that has integral 1.
Let us find the variance \(\sigma^2\) of distribution \(f\).
\[\begin{align*} \sigma^2 = \int_{-\infty}^{\infty} (x-\mu)^2 f(x) dx = \int_{-\infty}^{\infty} x^2 f(x) dx = \int_{-\infty}^{\infty} x^2 \frac{1}{\sqrt{\pi}} e^{- x^2} dx \end{align*}\]
For this, we used that \(\mu = 0\) because \(f\) is symmetric around 0 and therefore has an expected value of 0.
Our strategy is to integrate by parts. The rule is \(\int u(x) v'(x) dx = u(x) v(x) - \int u'(x) v(x) dx\). If you remember from earlier that \(xe^{-x^2}\) is relatively easy (OK, easy is a strong word, let’s say it is not impossible), you may see the reason why we chose \(u\) and \(v'\) like this:
\[\begin{align*} \sigma^2 = \frac{1}{\sqrt{\pi}} \int_{-\infty}^{\infty} \underbrace{x}_{u} \cdot \underbrace{x e^{- x^2}}_{v'} dx \text{.} \numberthis \label{eq:partialintegral} \end{align*}\]
Now we only have to figure out \(u'\) and \(v\) and put everything together. First, \(u'\) is easy.
\[\begin{align*} u{'} = \frac{du}{dx} = \frac{dx}{dx} = 1 \end{align*}\]
Second, \(v\) is the same as we have seen earlier. With \(v' = xe^{-x^2}\), we have \(v = \int_{-\infty}^\infty xe^{-x^2} dx\). Again, we use that \(\frac{ d x^2 }{dx} = 2x \iff \frac{dx^2}{2} = x dx\). Therefore, \(v = \frac{1}{2} \int_{-\infty}^\infty e^{-x^2} dx^2 = -\frac{1}{2}e^{-x^2}\).
Now we puzzle together the partial integral from Equation \(\ref{eq:partialintegral}\).
\[\begin{align*} \int\limits_{-\infty}^{\infty} \underbrace{x}_{u} \cdot \underbrace{x e^{- x^2}}_{v'} dx &= \left[ \underbrace{x}_{u} \cdot \underbrace{-\frac{1}{2} e^{- x^2}}_{v} \right]^\infty_{-\infty} - \int\limits_{-\infty}^{\infty} \underbrace{1}_{u'} \cdot \underbrace{-\frac{1}{2} e^{- x^2}}_{v} dx \\ &= \left[ 0 - 0\right] - \int_{-\infty}^{\infty} \underbrace{1}_{u'} \cdot \underbrace{-\frac{1}{2} e^{- x^2}}_{v} dx \\ &= \frac{1}{2} \int_{-\infty}^{\infty} e^{- x^2} dx \\ &= \frac{1}{2} \sqrt{\pi} \end{align*}\]
The first part of \(u(x) \cdot v(x)\) becomes zero because as \(x\) goes to \(\pm \infty\), \(x \cdot \frac{1}{-2} e^{- x^2}\) goes to zero. If you are skeptical—and I am happy if you are!—you can verify this with l’Hospital:
\[\begin{align*} \lim_{x\rightarrow \pm\infty} x \left(- \frac{1}{2} e^{- x^2} \right) = \lim_{x\rightarrow\mp\infty} \frac{x}{e^{x^2}} \overbrace{=}^{l'Hospital} \lim_{x\rightarrow\mp\infty}\frac{ \left(\frac{dx}{dx}\right) }{ \left(\frac{de^{x^2}}{dx}\right) } = \lim_{x\rightarrow\mp\infty}\frac{ 1 }{ \left( x e^{x^2} \right) } = \frac{1}{\mp\infty} = 0 \end{align*}\]
With this, the variance is
\[\begin{align*} \sigma^2 = \int_{-\infty}^{\infty} x^2 \frac{1}{\sqrt{\pi}} e^{- x^2} dx = \frac{1}{2} \cdot \frac{1}{\sqrt{\pi}} \cdot \sqrt{\pi} = \frac{1}{2} \end{align*}\]
So we have implicitly set the variance to be \(\frac{1}{2}\). Now we want to introduce a scaling parameter to choose this variance freely.
A random variable \(X\) that is distributed according to \(f(x)\) has expected value \(E[X] = 0\) and variance \(Var[X] = \frac{1}{2}\) as seen before. A linear combination \(Y = \sqrt{2}\sigma X + \mu\) changes its expected value and variance to \(E[Y] = \sigma E[X] + \mu = \mu\) and \(Var[Y] = 2\sigma^2Var[X] = \sigma^2\), respectively. We therefore want to derive the distribution \(f_{\mu, \sigma^2}(y)\) from \(f(x)\). For this, we go through the cumulative function.
\[\begin{align*} F_{\mu, \sigma^2}(y) &= P(Y \leq y) \\ &= P(\sqrt{2}\sigma X + \mu \leq y) \\ &= P\left(X \leq \frac{y - \mu}{\sqrt{2}\sigma}\right) \\ &= F\left(\frac{y - \mu}{\sqrt{2}\sigma}\right) \end{align*}\]
For this, we can take the derivative and obtain \(f_{\mu, \sigma^2}(y)\):
\[\begin{align*} \frac{d F_{\mu, \sigma^2}(y)}{dy} = f_{\mu, \sigma^2}(y) = \frac{d F\left(\frac{y - \mu}{\sqrt{2}\sigma}\right)}{dy} = f\left(\frac{y - \mu}{\sqrt{2}\sigma}\right) \cdot \frac{1}{\sqrt{2}\sigma} \end{align*}\]
Thus, if we replace \(x\) in \(f(x)\) by \(\frac{x-\mu}{\sqrt{2}\sigma}\), we get the function we are after. \[\begin{align*} f_{\mu, \sigma^2}(x) = f\left(\frac{x - \mu}{\sqrt{2}\sigma}\right) \cdot \frac{1}{\sqrt{2}\sigma} = \frac{1}{\sqrt{2\pi}\sigma} e^{- \frac{1}{2} \left(\frac{x-\mu}{\sigma}\right)^2} \end{align*}\]
Taken together, we have derived the functional shape of a normal distribution from a few natural assumptions. We then computed the variance of one particular normal distribution. We then added location and variance parameters to get the typical form of the normal distribution.
If we compute \(F = \frac{SQ_E/df_E}{SQ_R/df_r}\) with numerator and denominator being \(\chi^2\) distributions divided by their degrees of freedom, the resulting distribution is an \(F\) distribution.
Left to show is that the integral indeed solves to 1, that is, the gamma distribution \(f_G(g) = \frac{b^a}{\Gamma(a)}g^{a-1}e^{-bg}\) is a proper density distribution.
I heavily adapted the proof from https://statproofbook.github.io/P/f-pdf.html [retrieved 2022 20 11] to be simpler and do without the Jacobian matrix.
Wenn \(Z\sim\mathcal{N}(0, 1^2)\) ist, dann ist \(X = Z^2\) eine \(\chi^2\)-verteilte Zufallsvariable (“chi-quadrat”, Aussprache in Englisch: “kai-squared”) mit einem Freiheitsgrad, \(X \sim \chi^2(1)\). Wenn eine Reihe von \(k\) unabhängigen standardnormalverteilten Zufallsvariablen aufaddiert wird, ergibt das eine \(\chi^2\)-Verteilung mit \(k\) Freiheitsgraden: \(Z_1, Z_2, ..., Z_k \overset{iid}{\sim} \mathcal{N}(0,1)\), \(X = \sum_{i=1}^k Z_i\) \(\rightarrow X \sim \chi^2(k)\).
\(\chi^2\)-Verteilung
\[ f(x) = \begin{cases} \frac{x ^ {\frac{k}{2} - 1} e^{-\frac{x}{2}}}{2^{\frac{k}{2}} \cdot \Gamma\left(\frac{k}{2}\right) }, & \text{ für }x>0 \\ 0, & \text{ sonst } \\ \end{cases} \]
Erwartungswert \(E(X) = k\)
Varianz \(Var(X) = 2k\)
Funktionen in R: dchisq(x, k)
, pchisq
,
qchisq
, rchisq
Die \(\chi^2\)-Verteilung wird besonders interessant, wenn wir uns später die Verteilung der Stichprobenvarianz anschauen. Denn dafür werden ja die einzelnen Beobachtungen der Versuchsperson \(X_i\) unter anderem quadriert: \(S^2 = \frac{1}{n-1}\sum_{i = 1}^n (X_i - \bar{X})^2\). Es ist tatsächlich so, dass \(S^2\) einer skalierten \(\chi^2(n-1)\)-Verteilung folgt. Aber das schauen wir uns im nächsten Semester genauer an. Wir kennen die Formel mit kleinen Buchstabe: \(s^2\), \(x_i\) und \(\bar{x}\), weil wir in der deskriptiven Statistik nur die Perspektive der konkreten Realisationen eingenommen haben. Wir werden sie aber später auch als Zufallsvariablen betrachten, um den Zufallsmechanismus hinter ihnen zu erschließen. Denn wenn wir eine Stichprobenvarianz berechnen, bekommen wir nur eine konkrete Zahl (Realisation), aber welche Realisationen kommen mit welcher Wahrscheinlichkeit? Darauf kommen wir noch zurück.
par(mar = c(5.1, 4.1, 4.1, 2.1))
par(mfrow = c(1,2))
x = seq(0, 20, .01)
f = dchisq(x, 5)
F = pchisq(x, 5)
plot(x, f,
xlab = 'Realisation x', ylab = 'Wahrscheinlichkeitsdichte f(x)',
type = 'l')
plot(x, F,
xlab = 'Realisation x', ylab = bquote('Kumulative Dichte F(x) = P('~X<=x*')'), pch = 16, ylim = c(0, 1), type = 'l')
We start by assuming that we know the expected value \(\mu\) and variance \(\sigma^2\). With this, we define the \(z\)-transformed individual variables \(Z_i = \frac{X_i - \mu}{\sigma}\). We additionally assume that \(X_i\) and therefore also \(Z_i\) is normally distributed (which is plausible in the case where \(X_i\) already is an average over multiple RTs).
Instead of considering \((X_i - \bar{X})^2\) in the usual variance formula we will consider \(Z_i^2 = \left(\frac{X_i - \mu}{\sigma}\right)\). How is \(Z_i^2\) distributed? Since \(Z_i\) is standardnormally distributed, we can translate the probability mass with the usual transform of random variables. In general, if \(Y = g(X)\) and \(g\) monotone, then \(f_Y(y) = f_X\left(g^{-1}(y)\right) \cdot \frac{1}{g'(\left(g^{-1}(y)\right))}\).
Here, we have \(Y = Z_i^2\), \(X = Z_i\) and \(g(x) = x^2\) with \(g^{-1}(x) = \sqrt{x}\) and \(g'(x) = 2x\). Monotonicity only holds for non-negative values but since \(Z_i\) and \(g\) are symmetric around 0, we can simply multiply by 2.
\[\begin{align*} f_{Z_i^2}(x) &= 2 f_{Z_i}( \sqrt{x} ) \cdot \frac{1}{ 2 \sqrt{x} } \\ &= 2 \frac{1}{\sqrt{2\pi}} e^{-\frac{1}{2}(\sqrt{x})^2 } \cdot \frac{1}{ 2 \sqrt{x} } \\ &= \frac{ x^{-\frac{1}{2}} e^{-\frac{x}{2} } }{ \sqrt{2\pi} } \numberthis \label{eq:chisquare_1} \\ \end{align*}\]
Next, we consider the sum \(S_N = \sum_{i=1}^N Z_i^2\) and how it is distributed. Regularly, we would have to compute the convolution of \(Z_i\)s and hope to find a pattern that generalized if we add yet another one of the \(Z_i\)s. Doing so would yield the following distribution, which we will prove by induction.
\[\begin{align*} f_{S_N}(x) = \frac{ x^{\frac{N}{2} - 1} e^{-\frac{x}{2}} }{ 2^\frac{N}{2} \Gamma\left(\frac{N}{2}\right) } \end{align*}\]
To see that for \(N=1\) this is indeed the correct distribution, compare this equation to Equation \(\ref{eq:chisquare_1}\) \(f_{S_1} = f_{Z_i^2}\) because \(\Gamma(1/2) = \sqrt{\pi}\). This is the induction base. Now let us assume this holds for \(N\) and show that it then also holds for \(N+1\) by adding another one of the \(Z_i^2\)s.
\[\begin{align*} f_{S_{N+1}}(x) &= \int_{0}^x f_{S_{N}}(s) \cdot f_{Z_i^2}(s-x) ds \\ &= \int_{0}^x \frac{ s^{\frac{N}{2} - 1} e^{-\frac{s}{2}} }{ 2^\frac{N}{2} \Gamma\left(\frac{N}{2}\right) } \cdot \frac{ (x-s)^{-\frac{1}{2}} e^{-\frac{(x-s)}{2} } }{ \sqrt{2}\Gamma\left(\frac{1}{2}\right) } ds \\ &= \frac{e^{-\frac{x}{2}}}{2^{\frac{N+1}{2}}} \cdot \int_{0}^x \frac{ s^{\frac{N}{2} - 1} }{ \Gamma\left(\frac{N}{2}\right) } \cdot \frac{ (x-s)^{-\frac{1}{2}} }{ \Gamma\left(\frac{1}{2}\right) } ds \\ \end{align*}\]
We will substitute \(s = xu\) to extract \(x\) out of the integral entailing \(ds = x du\). Then, we will add the desired factor \(\Gamma\left(\frac{N+1}{2}\right)\). The integral will resolve to 1 which we show afterwards.
\[\begin{align*} f_{S_{N+1}}(x) &= \frac{e^{-\frac{x}{2}}}{2^{\frac{N+1}{2}}} \cdot \int_{0}^1 \frac{ (xu)^{\frac{N}{2} - 1} }{ \Gamma\left(\frac{N}{2}\right) } \cdot \frac{ (x-xu)^{-\frac{1}{2}} }{ \Gamma\left(\frac{1}{2}\right) } x^1 du \\ &= \frac{x ^ {\frac{N+1}{2} - 1} e^{-\frac{x}{2}}}{2^{\frac{N+1}{2}}} \cdot \int_{0}^1 \frac{ u^{\frac{N}{2} - 1} }{ \Gamma\left(\frac{N}{2}\right) } \cdot \frac{ (1-u)^{-\frac{1}{2}} }{ \Gamma\left(\frac{1}{2}\right) } du \\ &= \frac{x ^ {\frac{N+1}{2} - 1} e^{-\frac{x}{2}}}{2^{\frac{N+1}{2}} \cdot \Gamma\left(\frac{N+1}{2}\right) } \cdot \underbrace{ \int_{0}^1 \frac{ \Gamma\left(\frac{N+1}{2}\right) }{ \Gamma\left(\frac{N}{2}\right) \cdot \Gamma\left(\frac{1}{2}\right) } u^{\frac{N}{2} - 1} (1-u)^{-\frac{1}{2}} du }_{= 1~\text{as we show next}} \label{eq:eulerbetaintegral} \numberthis \\ &= \frac{x ^ {\frac{N+1}{2} - 1} e^{-\frac{x}{2}}}{2^{\frac{N+1}{2}} \cdot \Gamma\left(\frac{N+1}{2}\right) } \\ \end{align*}\]
So as long as we show that the integral resolves to 1, we have proven the distribution of \(f_{S_N}\) for any \(N\).
The function in the integral is a so called Beta distribution.
\[\begin{align*} f_{\text{Beta}(\alpha, \beta)}(x) = \frac{ \Gamma\left(\frac{N+1}{2}\right) }{ \Gamma\left(\frac{N}{2}\right) \cdot \Gamma\left(\frac{1}{2}\right) } x^{\alpha - 1} (1-x)^{\beta-1} \end{align*}\]
In particular, we have a Beta distribution with \(\alpha = N/2\) and \(\beta = 1/2\). The Beta distribution lives on the domain \(x\in[0,1]\) so it makes sense that this integral resolves to 1 because its density should integrate to 1.
The way we show this is by considering how the Gamma function values interact. In principle, the gamma function \(\Gamma\) is:
\[\begin{align*} \Gamma(x) = \int_0^\infty t^{x-1} e^{-t} dt \text{~.} \end{align*}\]
The product of two gamma function values can be resolved as follows. Instead of using \(t\) as the integration variable, we take \(x^2\) and \(y^2\). However, we keep \(dx\) and \(dy\) using \(d(x^2) = 2xdx\) and \(d(y^2) = 2ydy\). The additional x and y go into the exponents.
\[\begin{align*} \Gamma(a) \cdot \Gamma(b) &= \int_0^\infty (x^2)^{a-1} e^{-(x^2)} d(x^2) \cdot \int_0^\infty (y^2)^{b-1} e^{-(y^2)} d(y^2) \\ &= 2 \int_0^\infty x^{2a-1} e^{-(x^2)} dx \cdot 2 \cdot \int_0^\infty y^{2b-1} e^{-(y^2)} dy \\ &= 4 \int_0^\infty \int_0^\infty x^{2a-1} e^{-(x^2)} y^{2b-1} e^{-(y^2)} dxdy \\ \end{align*}\]
We switch from cartesian coordinate system integration to polar coordinate system integration \(dxdy = rdrd\theta\) where \(x = r\cos\theta\) and \(y = r\sin\theta\). This allows us to extract an integral with \(r^2\) that is again a Gamma function.
\[\begin{align*} \Gamma(a) \cdot \Gamma(b) &= 4 \int_0^\infty \int_0^{2\pi} (r\cos\theta)^{2a-1} e^{-(r\cos\theta)^2} (r\sin\theta)^{2b-1} e^{-(r\sin\theta)^2} rdrd\theta \\ &= 4 \int_0^\infty r^{2a-1} r^{2b-1} e^{-(r\cos\theta)^2} e^{-(r\sin\theta)^2} rdr \int_0^{2\pi} (\cos\theta)^{2a-1} (\sin\theta)^{2b-1} d\theta \\ &= 2 \int_0^\infty r^{2a+2b-2} e^{-r^2(\cos^2\theta + \sin^2\theta)} 2rdr \int_0^{2\pi} (\cos\theta)^{2a-1} (\sin\theta)^{2b-1} d\theta \\ &= 2 \int_0^\infty (r^2)^{a+b-1} e^{-r^2} d(r^2) \int_0^{2\pi} (\cos\theta)^{2a-1} (\sin\theta)^{2b-1} d\theta \\ &= 2 \Gamma(a+b) \cdot \underbrace{\int_0^{2\pi} (\cos\theta)^{2a-1} (\sin\theta)^{2b-1} d\theta}_{\text{~Showing that this is $=1$ will be the last step.}} \\ \end{align*}\]
We will substitute \(t = (\sin\theta)^2\) such that \(dt = 2\sin\theta\cos\theta d\theta\) and \(1-t = (\cos\theta)^2\).
\[\begin{align*} \Gamma(a) \cdot \Gamma(b) &= \Gamma(a+b) \cdot \int_0^{2\pi} (\cos\theta)^{2a-2} (\sin\theta)^{2b-2} 2\sin\theta\cos\theta d\theta \\ &= \Gamma(a+b) \cdot \int_0^{1} t^{a-1} (1-t)^{b-1} dt \\ 1 &= \frac{\Gamma(a+b)}{\Gamma(a) \Gamma(b)} \cdot \int_0^{1} t^{a-1} (1-t)^{b-1} dt \\ \end{align*}\]
With that, we can see with \(\alpha = N/2\) and \(\beta = 1/2\) that the open part in Equation \(\ref{eq:eulerbetaintegral}\) is indeed 1.
This concludes showing that \(S_N\), the sum of \(N\) squared standardnormals \(Z_i^2\) follows the so-called \(\chi^2\)-distribution \(f_{S_N}\).
Now we come to the more realistic case in which we do not know the expected value \(\mu\). Up to now we have shown that
\[\begin{align*} S_N = \sum_{i=1}^N \left( \frac{X_i - \mu}{\sigma} \right)^2 \end{align*}\]
is \(\chi^2\)-distributed with \(N\) degrees of freedom. Now we replace the expected value by the mean.
\[\begin{align*} S{'}_N &= \sum_{i=1}^N \left( \frac{X_i - \bar{X}}{\sigma} \right)^2 \\ &= \sum_{i=1}^N \left( \frac{X_i - \mu + \mu - \bar{X}}{\sigma} \right)^2 \\ &= \sum_{i=1}^N \left( \left( \frac{X_i - \mu}{\sigma} \right)^2 + \frac{ 2(x_i - \mu)(\mu - \bar{X}) + (\mu - \bar{X})^2 }{\sigma^2} \right) \\ &= \sum_{i=1}^N \left( \frac{X_i - \mu}{\sigma} \right)^2 + \frac{ 2\sum_{i=1}^N (x_i\mu -x_i\bar{X} - \mu^2 + \mu\bar{X}) + \sum_{i=1}^N \left( \mu^2 - 2\mu\bar{X} + \bar{X}^2 \right) }{\sigma^2} \\ &= \sum_{i=1}^N \left( \frac{X_i - \mu}{\sigma} \right)^2 + \frac{ 2N\bar{X}\mu - 2N\bar{X}^2 - 2N\mu^2 + 2N\mu\bar{X} + N\mu^2 - 2N\mu\bar{X} + N\bar{X}^2 }{\sigma^2} \\ \end{align*}\]
\[\begin{align*} S{'}_N &= \sum_{i=1}^N \left( \frac{X_i - \mu}{\sigma} \right)^2 + \frac{ (- N\bar{X}^2 - N\mu^2 + 2N\mu\bar{X}) }{\sigma^2} \\ &= \sum_{i=1}^N \left( \frac{X_i - \mu}{\sigma} \right)^2 + \frac{ N(- \bar{X}^2 - \mu^2 + 2\mu\bar{X}) }{\sigma^2} \\ &= \sum_{i=1}^N \left( \frac{X_i - \mu}{\sigma} \right)^2 + \frac{ (\bar{X} - \mu)^2 }{\sigma^2/N} \\ &= \sum_{i=1}^N \left( \frac{X_i - \mu}{\sigma} \right)^2 - \left( \frac{ \bar{X} - \mu }{\sigma/\sqrt{N}} \right)^2 \\ \end{align*}\]
Thus, when using the mean instead of the expected value, we get a random variable \(S{'}_N\) that has a first part that is distributed as \(\chi^2\) with \(N\) degrees of freedom but then an independent (see next section) random variable is subtracted which follows a \(\chi^2\) distribution with one degree of freedom. Taken together, \(S{'}_N\) follows a \(\chi^2\) distribution with \(N-1\) degrees of freedom.
To show that the mean of a sample of normally distributed variables and its variance is independent, we first need to show that two normally distributed random variables are independent if and only if their correlation is 0.
Assume \(X_1\) and \(X_2\) are normally distributed. Then, one can always find standardnormally distributed and independent \(Z_1\) and \(Z_2\) such that
\[\begin{align*} X_1 = \sigma_1 Z_1 + \mu_1 \text{,} X_2 = \sigma_2 \left( \rho Z_1 + \sqrt{1-\rho^2} Z_2 \right) + \mu_2 \text{.} \end{align*}\]
Here, \(\rho\) is the correlation of \(X_1\) and \(X_2\). Since \(Z_1\) and \(Z_2\) are independent, we can express the joint density of \(X_1\) and \(X_2\) by a transformation of the joint density of \(Z_1\) and \(Z_2\).
\[\begin{align*} F(x_1, x_2) &= P(X_1 \leq x_1, X_2 \leq x_2) \\ &= P(\sigma_1 Z_1 + \mu_1 \leq x_1, \sigma_2 \left( \rho Z_1 + \sqrt{1-\rho^2} Z_2 \right) + \mu_2 \leq x_2) \\ &= P\left(Z_1 \leq \left( \frac{x_1 - \mu_1 }{\sigma_1} \right), Z_2 \leq \frac{1}{\sqrt{1-\rho^2}} \left( \frac{x_2 - \mu_2}{\sigma_2} \right) - \frac{\rho}{\sqrt{1-\rho^2}} \left( \frac{x_1 - \mu_1 }{\sigma_1} \right) \right) \\ &= G\left( \left( \frac{x_1 - \mu_1 }{\sigma_1} \right), \frac{1}{\sqrt{1-\rho^2}} \left( \frac{x_2 - \mu_2}{\sigma_2} \right) - \frac{\rho}{\sqrt{1-\rho^2}} \left( \frac{x_1 - \mu_1 }{\sigma_1} \right) \right) \end{align*}\]
Here, \(G(z_1, z_2)\) is the cumulative density of two independent standard normal variables. Taking derivatives, we get
\[\begin{align*} f(x_1, x_2) &= \frac{\partial^2}{\partial x_1 \partial x_2} G(x_1, x_2) \\ &= \frac{1}{\sigma_1} \frac{1}{\sigma_2\sqrt{1-\rho^2}} \frac{1}{2\pi} e^{ -\frac{1}{2} \left( \left( \frac{x_1 - \mu_1 }{\sigma_1} \right) ^2 + \frac{1}{1-\rho^2} \left( \frac{x_2 - \mu_2}{\sigma_2} \right)^2 - \frac{2\rho}{1-\rho^2} \left( \frac{x_1 - \mu_1 }{\sigma_1} \right) \left( \frac{x_2 - \mu_2 }{\sigma_2} \right) + \frac{\rho^2}{1-\rho^2}\left( \frac{x_1 - \mu_1 }{\sigma_1} \right)^2 \right)} \\ &= \frac{1}{\sigma_1} \frac{1}{\sigma_2\sqrt{1-\rho^2}} \frac{1}{2\pi} e^{ -\frac{1}{2} \left( \frac{1}{1-\rho^2} \left( \frac{x_1 - \mu_1 }{\sigma_1} \right) ^2 + \frac{1}{1-\rho^2} \left( \frac{x_2 - \mu_2}{\sigma_2} \right)^2 - \frac{2\rho}{1-\rho^2} \left( \frac{x_1 - \mu_1 }{\sigma_1} \right) \left( \frac{x_2 - \mu_2 }{\sigma_2} \right) \right)} \\ \end{align*}\]
If we now set \(\rho = 0\), we find that the two variables \(X_1\) and \(X_2\) become independent.
\[\begin{align*} f(x_1, x_2) &= \frac{1}{\sigma_1} \frac{1}{\sigma_2} \frac{1}{2\pi} e^{ -\frac{1}{2} \left( \left( \frac{x_1 - \mu_1 }{\sigma_1} \right) ^2 + \left( \frac{x_2 - \mu_2}{\sigma_2} \right)^2 \right)} \\ &= \frac{1}{\sqrt{2\pi}\sigma_1} e^{ -\frac{1}{2} \left( \frac{x_1 - \mu_1 }{\sigma_1} \right) ^2 } \cdot \frac{1}{\sqrt{2\pi}\sigma_2} e^{ -\frac{1}{2} \left( \frac{x_2 - \mu_2 }{\sigma_2} \right) ^2 } \\ &= f_1(x_1) \cdot f_2(x_2). \end{align*}\]
Define \(A = (X_1 - \bar{X}, X_2 - \bar{X}, ..., X_N - \bar{X})\). Each entry of \(A\) and \(\bar{X}\) is normally distributed because all \(X_i\) are normally distributed. The covariance between each entry of \(A\) and \(\bar{X}\).
\[\begin{align*} Cov(X_i - \bar{X}, \bar{X}) = Cov(X_i, \bar{X}) - Cov(\bar{X}, \bar{X}) = Cov(X_i, \frac{1}{n}X_i) - Var(\bar{X}) = \frac{1}{n}\sigma^2 - \frac{\sigma^2}{n} = 0 \\ \end{align*}\]
Therefore, \(A\) is independent of \(\bar{X}\). But note that \(s^2\) is a function of \(A\). Therefore, \(s^2\) is also independent of \(\bar{X}\).
With that, the estimated variance \(s^2 = \frac{1}{N-1} \sum_{i=1}^N (X_i - \bar{X})^2\), is \(\chi^2\) distributed with a scaling factor from the true variance and the sample size, \(s^2 \sim \frac{\sigma^2}{N-1} \chi^2_{N+1}\).
\[\begin{align*} f_{s^2}(x) = \frac{N-1}{\sigma^2} f_{S_{N-1}}\left( \frac{N-1}{\sigma^2} x \right) \end{align*}\]
Thus, if we have a sample of \(N=36\) participants and a true variance \(\sigma^2 = 50^2\), the empirical variance \(s^2\) is \(\chi^2\)-distributed with \(35\) degrees of freedom, \(f_{s^2}(x) = \frac{35}{2500} f_{S_{35}}( \frac{35}{2500} x )\).
set.seed(1)
N = 30; sigma = 50
s2 = c()
for (i in 1:10000)
s2 = c(s2, var(rnorm(N, sd = sigma)))
plot(density(s2), main = bquote('Distribution of '*s^.(2)), lwd = 3)
support = seq(min(s2), max(s2), length.out = 100)
points(support, (N-1)/sigma^2 * dchisq((N-1)/sigma^2 * support, df = N-1),
type = 'l', col = 'orange', lwd = 3, lty = 2)
legend('topright', lty = 1:2,
col = c('black', 'orange'),
legend = c('Simulated', 'Analytical'))
https://math.stackexchange.com/questions/474934/proof-for-eulers-beta-function-for-positive-integers
https://math.stackexchange.com/questions/2202058/prove-chi-squared-distribution-by-induction
DeGroot, M. H., & Schervish, M. J. (2012). Probability and statistics. Pearson Education. (Chapter 5.12)