Variance and standard deviation

From WikiEducator
Jump to: navigation, search

Variance and standard deviation are related concepts.

Variance describes, mathematically, how close the observations in a data set (data points) are to the middle of the distribution. Using the mean as the measure of the middle of the distribution, the variance is defined as the average squared difference of each data point from the mean of the data.

Standard deviation is the square root of the variance.

Understanding the calculations

From the definitions, we can see that the variance and standard deviation are both averages. The variance is the average of the squared difference of each data point from the mean. Although this is a useful statistic in many ways, the number itself is not easily interpretable with respect to the actual data points. Let's continue by way of an example.



Icon activity.jpg
Example: Number of migrating falcons
The following are the number of peregrine falcons seen passing a bird observation point each day for 8 days during peak migration season:
7, 15, 5, 9, 3, 11, 13, 9

To find the variance and standard deviation of the number of falcons observed each day:

Step 1: Calculate the the mean of the sample, [math]\bar{x}[/math].

[math]\bar{X} = \frac{(7 + 15 + 5 + 9 + 3 + 11 + 13 + 9)}{8} = 9[/math]

Step 2: Calculate the difference (deviation) of each data point from the mean; see middle column in table below.

Step 3: Square each of the differences (deviations); see rightmost column in table below.

Number of falcons Deviation from mean Squared deviation
[math]7[/math] [math]7 - 9 = -2[/math] [math](-2)^2 = 4[/math]
[math]15[/math] [math]15 - 9 = 6[/math] [math](6)^2 = 36[/math]
[math]5[/math] [math]5 - 9 = -4[/math] [math](-4)^2 = 16[/math]
[math]9[/math] [math]9 - 9 = 0[/math] [math](0)^2 = 0[/math]
[math]3[/math] [math]3 - 9 = -6[/math] [math](-6)^2 = 36[/math]
[math]11[/math] [math]11 - 9 = 2[/math] [math](2)^2 = 4[/math]
[math]13[/math] [math]13 - 9 = 4[/math] [math](4)^2 = 16[/math]
[math]9[/math] [math]9 - 9 = 0[/math] [math](0)^2 = 0[/math]

Step 4: Sum the squared deviations and divide by n-1 (1 less than the sample size)[1] to create the variance (average squared deviation).

[math]variance = \frac{4+36+16+0+36+4+16+0}{7} = \frac{112}{7} = 16[/math]

Step 5: The standard deviation of the data is the square root of the variance:

[math]standard\ deviation = \sqrt{16} = 4[/math]

Interpretation: In step 1 we calculated that the average number of peregrine falcons passing the bird observation each day was 9. Knowing the standard deviation, we can say that on average each day the actual number of peregrine falcons was 4 away from 9.





Why bother to take the square root of the variance?

The variance does not lend itself to easy interpretation as it is the average of the squared deviations. To make this point more dramatically, the original data points in the example above are the number of peregrine falcons that pass a bird observatory each day for 8 days. The units for the variance are falcons2, which is uninterpretable.

The standard deviation is the square root of the variance, thus creating a statistic that has the same measurement units as the data points. In the example above, the mean is 9 falcons and the standard deviation is 4 falcons.

Why not use the sum of the deviations directly?

You may be wondering why the standard deviation is not based on the sum of the deviations. Why do we use such a complicated method: first calculating the variance then the standard deviation? The reason is simple, because the sum of the difference of each data point and the mean is 0. This is always the case. If you are not convinced, try summing the "Deviation from mean" column in the table above.

The formula for variance and standard deviation

The formula for the variance, [math]s^2[/math], of a set of data points is:

[math]s^2 = \frac{(x_1 - \bar{x})^2 + (x_2 - \bar{x})^2 + ... + (x_n - \bar{x})^2}{n-1}[/math]

or more compactly,

[math]s^2 = \frac{\sum{(x_i - \bar{x})}}{n-1}[/math]

where [math]x_i\,[/math] represents each of the [math]i[/math] data points, [math]\bar{x}[/math] is the mean of the set of data points and [math]n\,[/math] is the number of data points in the data set.

The standard deviation, [math]s[/math], is the square root of the variance, [math]s^2[/math]:

[math]s = \sqrt{\frac{\sum{(x_i - \bar{x})}}{n-1}}[/math]

Using a frequency table to calculate variance and standard deviation

When a frequency table is used, the variance can also be given by the formula: Gsd.jpg

           or              

Where ƒ is the frequency of each individual value and ∑ƒ is the sum of the individual frequencies which is the same as n. For grouped or continuous data the same formula is used but the X value is taken to be mid-point of each group. For example, if the groups are 10-14, 15-19, 20-24 and so on, then the mid-points will be 12, 17 and 22 respectively. These become the class representatives and thus assume the value of X in the calculation. Example The frequency distribution table below shows the number of hours that a group of 220 students spent in private study within a given month. Number of hours per week spent watching television

Hours Number of students
10–14 2
15–19 12
20–24 23
25–29 60
30–34 77
35–39 38
40–44 8
 (a)	Find:	(i)	The mean of the distribution
 (ii)The standard deviation of the distribution
 (b)	Assuming that the distribution is normal, calculate the percentage of students who studied for 35.85 hours or more. 
       Solutions


In order to work out these values it is appropriate to draw up a tabular format on which to enter the necessary information.

Hours Midpoint(x) Frequency(f) xf (x - ) (x - )2 f(x - )2 10–14 12 2 24 -17.82 317.6 635.2 15–19 17 12 204 -12.82 164.4 1,972.8 20–24 22 23 506 -7.82 61.2 1,407.6 25–29 27 60 1,620 -2.82 8.0 480.0 30–34 32 77 2,464 2.18 4.8 369.6 35–39 37 38 1,406 7.18 51.6 1,960.8 40–44 42 8 336 12.18 148.4 1,187.2


220 6,560


8,013.2

(a)(i) Mean

= 6560 220 = 29.82


(ii) Standard deviation


S =

  	  =    6.03

(b)Percentage who studied for 35.85 hours or more: On examination, we see that 35.85 is the same as the mean (29.82) plus one standard deviation (6.03). Thus the percentage we want is that on the normal curve beyond +1σ from the mean. From the normal curve we get: 13.6% + 2.15% + 0.15% = 15.9% Thus we can say that 15.9% of the students in that class studied for 35.85 hours or more during the given month. Note that in this case, normality of the distribution is only an assumption to enable us do statistical analysis. Such assumptions are done often since in real life it is rare to come across perfectly normal distributions. With today’s advancement in computing technology, the researcher does not have to go through the rigours of calculating such statistical measures. Ordinary calculators and computers can easily give the values when fed with the correct details.

Notes

  1. When a sample is used to estimate the variance of a population, n-1 is used in the denominator rather than n. For additional explanation see the section on Estimating the variance in Wikipedia.

Acknowledgements

Ideas and language were inspired by: