Variance, Covariance, Correlation and Covariance Matrix
Variance is a special case of covariance when two random variables are the same. Correlation is standardized covariance which is unit agnostic metrics, so you could compare any two pairs of random variables without worrying about their units.
Covariance and Variance
Let us look into covariance first. By definition, covariance is defined as below:
$Cov(X,Y)=E[(X - E(X))(Y - E(Y))]$
Using linearity of expectation, you can easily get
$Cov(X,Y)=E(XY) - E(X)E(Y)$
In statistics, i always would like to use a non-trivial example to demonstrate a concept. Here, i will demonstrate one discrete case and one continuous case. 1) Let's say we have two discrete random variables X and Y.
the E(X) = 2, and E(Y) = 3, so now we can easily calculate the Cov(X,Y) as below:
$Cov(X,Y)= \sum f(x,y)(x-E(X))(y-E(Y))$
$= 0.15*(1-2)*(2-3) + 0*(1-2)*(3-3)+0.15*(1-2)*(4-3)$
$+0.1*(2-2)*(2-3)+0.15*(2-2)*(3-3)+0*(2-2)*(4-3)$
$+0*(3-2)*(2-3)+0.15*(3-2)*(3-3)+0.3*(3-2)*(4-3) = 0.3$
2) for continuous case, it is more complex because double integral is involved. Let's say we have two random variables X and Y, which have joint continuous distribution below
$f(x,y)=(1-x)(1-y) , ( -1\leq x,y \geq 1)$
To calculate, we will calculate $$E(XY), E(X), E(Y)$$ separately and use second equation to get $Cov(X,Y)=E(XY) - E(X)E(Y)$
$E(X) = \int_{-1}^1 x(1-x)dx \int_{-1}^1 (1-y)dy = -4/3$
$E(Y) = \int_{-1}^1 y(1-y)dy \int_{-1}^1 (1-x)dx = -4/3$
$E(XY) = \int_{-1}^1 x(1-x)dx \int_{-1}^1 y(1-y)dy = 4/9$
so the covariance is -14/9, which means X and Y are inversely related, and they move in opposite direction. When covariance is 0, two variables are considered uncorrelated. Covariance of two independent variables is 0, but conversely, it is not always true. For example, let X be uniformly distributed in [-1,1] and let Y = X*X.
$$\hat{\sigma} = \frac{1}{n} \sum_{i=1}^n (X_i- \bar{X})^2$$ This should not be confused with variance that has the assumption that each $Xi$ are equally possible. This is sample variance which is actually a point estimator used to estimate true variance. This is a bit confusing at first because it looks exactly like the equation to calculate variance with equal probability. For instance, we are measuring height of 100 randomly selected males with replacement (to make each selection i.i.d). because the much larger size of population, the replacement selection won't matter that much because it is very unlikely you will pick exact the same person twice. Each sample can be used as a point estimator. For example, the first male is 185 cm tall. You can use this only sample to boldly claim the average of population is 185 cm and variance is some $\sigma cm^2$. It is actually an unbiased estimation but a bad one because the mean square root is large. However, if we use the average of 100 sample variances, it will be a much better estimator. There is an unbiased version of it: $\hat{\sigma} = \frac{1}{n-1} \sum_{i=1}^n (X_i- \bar{X})^2, n > 1$ and $\bar{X}$ is the sample mean. However, when n is large enough, the two are almost the same. The takeaway point here is the n we see here is actually NOT referring to equal probability but average. I will write a separate blog about point estimator, so i just stop expanding on this here.
Point Estimator for Variance
You often see equation below$$\hat{\sigma} = \frac{1}{n} \sum_{i=1}^n (X_i- \bar{X})^2$$ This should not be confused with variance that has the assumption that each $Xi$ are equally possible. This is sample variance which is actually a point estimator used to estimate true variance. This is a bit confusing at first because it looks exactly like the equation to calculate variance with equal probability. For instance, we are measuring height of 100 randomly selected males with replacement (to make each selection i.i.d). because the much larger size of population, the replacement selection won't matter that much because it is very unlikely you will pick exact the same person twice. Each sample can be used as a point estimator. For example, the first male is 185 cm tall. You can use this only sample to boldly claim the average of population is 185 cm and variance is some $\sigma cm^2$. It is actually an unbiased estimation but a bad one because the mean square root is large. However, if we use the average of 100 sample variances, it will be a much better estimator. There is an unbiased version of it: $\hat{\sigma} = \frac{1}{n-1} \sum_{i=1}^n (X_i- \bar{X})^2, n > 1$ and $\bar{X}$ is the sample mean. However, when n is large enough, the two are almost the same. The takeaway point here is the n we see here is actually NOT referring to equal probability but average. I will write a separate blog about point estimator, so i just stop expanding on this here.
Covariance and Correlation
Correlation is standardized covariance which is unit-less. By definition, correlation is calculated as below:
$$Corr(X,Y) = \frac {Cov(X,Y)} {\sigma_x * \sigma_y}$$
if we plug in $Cov(X,Y)$, we can get
$$Corr(X,Y) = \frac {E[(X - E(X))(Y - E(Y))]} {\sigma_X * \sigma_Y}$$
We can use linearity of expectation rules again to move the denominators into the expectation as below:
$$Corr(X,Y) = E[(\frac {X-E(X)} {\sigma_X})(\frac {Y-E(Y)} {\sigma_Y})]$$
The intuition behind this is Covariance is not unit-less, so we will need to make a unit-less calculation which is correlation. For instance, a pair of variables that are measured in kilometers will often have larger covariance than a pair of variables that are measured in nanometers. However you cannot just say the first pair will have stronger relationship. By being divided by standard deviation, covariance is standardized, and thus can be used to compared even though the measurements are not at same scale.
Covariance Matrix
Covariance Matrix is covariances between each other in an array of random variables. For this, i personally find it is much more straight forward to use an example to demonstrate, so i am going to do so. Let use another non-trivial example here. Let's say we have three random variables $\theta_2,\theta_1,\theta_3$. This can be expressed as a vector of random variables as
$\theta = \begin {bmatrix} \theta_1,\theta_2,\cdots,\theta_n \end{bmatrix} , n = 3$. You can easily apply this to any n if n is larger than 3 in your data set. Lets say the 3 random variables each have 4 measurements like below:
$$\begin {pmatrix}
\theta_1 & \theta_2 & \theta_3 \\
3 & 4 & 4 \\
3 & 2 & 2 \\
4 & 3 & 3\\
2 & 3 & 3 \\
\end{pmatrix}$$
The Covariance matrix can be expressed as $K_{i,j} = Cov(\theta_i,\theta_j) = E[(\theta_i - E(\theta_i))(\theta_j - E(\theta_j))] = E[(\theta - \mu_{\theta})^T(\theta - \mu_{\theta})]$. Back to the example, we have 3 r.v.s, and 4 measurements for each r.v.. We can calculate the mean of each r.v. easily. The mean $E[\theta_1] = E[\theta_2] = E[\theta_3] = 3$. The next step is we subtract each measurement by its mean, so we get zero-centered matrix below:
$$\begin {pmatrix}
\theta_1 & \theta_2 & \theta_3 \\
0 & 1 & 1 \\
0 & -1 & -1 \\
1 & 0 & 0\\
-1 & 0 & 0 \\
\end{pmatrix}$$
In the next step, we use its transpose matrix multiply to itself, and then divided by number of measurements which is 4. Please note in this example, we assume each measurement is equally likely and thus has probability of 1/4, which explains why we divide the matrix by 4 in the last step. So we can the final covariance-variance matrix:
$$\begin {pmatrix}
\theta & \theta_1 & \theta_2 & \theta_3 \\
\theta_1 & 0.5& 0 & 0 \\
\theta_2 & 0& 0.5 & 0.5 \\
\theta_3 & 0& 0.5 & 0.5 \\
\end{pmatrix}$$
Notice 1) The final covariance-variance matrix is 3 by 3 which only depends on number of random variables. 2) The symmetry by its diagonal, because $Cov(\theta_i, \theta_j) = Cov (\theta_j,\theta_i)$. 3) the diagonal line lies the variances of each random variables.
$\theta = \begin {bmatrix} \theta_1,\theta_2,\cdots,\theta_n \end{bmatrix} , n = 3$. You can easily apply this to any n if n is larger than 3 in your data set. Lets say the 3 random variables each have 4 measurements like below:
$$\begin {pmatrix}
\theta_1 & \theta_2 & \theta_3 \\
3 & 4 & 4 \\
3 & 2 & 2 \\
4 & 3 & 3\\
2 & 3 & 3 \\
\end{pmatrix}$$
The Covariance matrix can be expressed as $K_{i,j} = Cov(\theta_i,\theta_j) = E[(\theta_i - E(\theta_i))(\theta_j - E(\theta_j))] = E[(\theta - \mu_{\theta})^T(\theta - \mu_{\theta})]$. Back to the example, we have 3 r.v.s, and 4 measurements for each r.v.. We can calculate the mean of each r.v. easily. The mean $E[\theta_1] = E[\theta_2] = E[\theta_3] = 3$. The next step is we subtract each measurement by its mean, so we get zero-centered matrix below:
$$\begin {pmatrix}
\theta_1 & \theta_2 & \theta_3 \\
0 & 1 & 1 \\
0 & -1 & -1 \\
1 & 0 & 0\\
-1 & 0 & 0 \\
\end{pmatrix}$$
In the next step, we use its transpose matrix multiply to itself, and then divided by number of measurements which is 4. Please note in this example, we assume each measurement is equally likely and thus has probability of 1/4, which explains why we divide the matrix by 4 in the last step. So we can the final covariance-variance matrix:
$$\begin {pmatrix}
\theta & \theta_1 & \theta_2 & \theta_3 \\
\theta_1 & 0.5& 0 & 0 \\
\theta_2 & 0& 0.5 & 0.5 \\
\theta_3 & 0& 0.5 & 0.5 \\
\end{pmatrix}$$
Notice 1) The final covariance-variance matrix is 3 by 3 which only depends on number of random variables. 2) The symmetry by its diagonal, because $Cov(\theta_i, \theta_j) = Cov (\theta_j,\theta_i)$. 3) the diagonal line lies the variances of each random variables.
Comments
Post a Comment