To infer information of reality by use of samples drawn from reality often to make easier examination of the apparently corresponding reality. The sample is handled as a picture of the reality or the population – "all voters in society", "all salmon in the Norwegian fiords", "all products in the total production" or the "total population in the country" etc.
The question is to be sure and even know how accurate that expected corresponding picture drawn from the characteristics in the sample does match with the same charactestics in the population.
Some of the general used measures of points, tendency and dispersion is to quantify in a meaningful way the characteristics is for example arithmetic mean (or average), standard deviation, median which we will define a few lines below. These and other measures can be used on both reality/population and on the sample.
Suppose you look at a sample described as the following registrations:
2, 4, 7, 13, -7, 23, 12
Number of elements in the sample (N): 7
Arithmetric Mean (or Average) (Xg = S Xj/N) : (2+4+7+13-7+23+12)/7 = 7.71
Median (M: the value on which or below 50% of the registrations are found) = 7
(Sometimes M is defined as ½(Xmax –Xmin) =8)
Standard Deviation [s = Ö (S (Xj – Xg)^2)/N ] = Ö [(7.71-2)^2/7)+(7.71-4)^2/7+(7.71-13)^2/7+(7.71-(-7))^2/7+(7.71-23)^2/7+(7.71-12)^2)/7] = 8.72
The standard deviation is measured as the square deviation from the mean, because the pure deviations equalizes negative and positive diviations when you calculate them. You then finally compute the square root of the result in order get "the average deviation from the arithmetric mean".
If the mean of a sampling distribution of a statistic equals the corresponding mean in population, the statistic is an unbiased estimator. The values are called unbiased estimates. The same with the variance.
The arithmetric mean and the variance of a sample is unbiased if the their expectations equals the corresponding population parameter.
But the mean of the sampling distribution of variances mss s^2 = s ^2(N –1)/N, where s ^2 is the variance of population, and N is the sample size (the formula is based on mathematics not included here). The sample variance s ^2 is a biased estimate of the population variance s ^2.
We will have to distinguish between parameters (as they are called, not variables) in the population and in the sample:
m : the population mean
s ^2 : the population variance
s : standard deviation of population
m s: the mean of standard deviation of a sample
s s: the mean of standard deviation of a sample
We then like to find the reliability under different assumptions if a single point estimate (of a parameter) or a interval estimate (given by to point estimates) compared with the corresponding parameters of the population. Reliability refers to error and precision of an estimate.
In the following we assume that our sampling distribution is approximately normal:
Confidence Estimates of Population Parameters
This means that the observations are distributed as illustrated by the curve above.
About the average the most observations are registred – with nearby small positive and negative deviations from average. The larger the deviations the smaller the number of registrations. Here it looks as if zero is the average. We have just chosen origo of the coordination system to illustrate the problem generally. Most phenomenons in reality can be measured with distribution around average or median. If we assume the sample has the characteristic in focus we are now able to compare the sample with reality in a meaningful and easy handled the problem.
If the sampling distribution is approximately normal (which is true for many samples with size N>=30 and that is easy to establish) we can expect to find it actual lying in the following intervals:
m s - s s to m s + s s about 68.27% of the time. m s -2s s to m s +2s s about 95.45% of the time. m s- 3s s to m s + 3s s about 99.73% of the time
Confidence level %
z is just a multiple of parameter s .
Now we know m s +- 1.96s s = m in 95% confidence-interval.
More generally, the confidence limits are given by m +- zs that can be read from the table above depending on the desired confidence.
Mathematic not included here shows that:
m +- zs /Ö N in the case the sampling is drawn from infinite population with replacement from a finite population.
m +- (zs /Ö N)Ö [(Np - N)/(Np - 1)] if the sampling is without replacement from a population of finite size Np.
To obtain the above confidence limits we have provided N>=30.
Confidence Intervals for Proportions
We are working with a sample of size N drawn from a population and a phenomenon that is binomial – yes or no, heads or tails etc. – P might be the proportion of heads, Q tails in the sample of size N. The following, that the confidence limits for the population proportion are given by (is proved mathematically):
P +- zÖ [(pq/N)] = zÖ [(p(1-p)/N)], if the sampling is from an infinite population or with replacement from a finite population.
P+- z(Ö pq/N)*Ö [(Np - N)/(Np – 1)], if the sampling is without replacement from a population of finite size Np
We can use P for p if N>=30
Confidence Intervals for Standard Deviation
Confidence limits for the standard deviation s of a normally distributed population as estimated from a sample with standard deviation s s are:
+- zs s =+-zs /Ö 2N
Unbiased and efficient estimates
A sample of six measurement of the lenght of a special tool were recorded to 2.23, 2.27, 2.28, 2.20, 2.26, 2.25 Determine unbiased and efficient estimates of the true mean and the true variance:
The true unbiased mean (the mean of the population):
= (2.23+2.27+2.28+2.20+2.26+2.25)/6 = 2.25
The true unbiased variance (the variance of the population):
= [N/(N -1)]s^2 = [S (X –Xg)^2]/(N – 1)
= [ (2.23 - 2.25)^2 +(2.27 - 2.25)^2 + (2.28 - 2.25)^2 + (2.20 - 2.25)^2 + (2.26 - 2.25)^2 + (2.25 - 2.25)^2]/(6 –1)
Notice that Ö 0.00086 = 0.293 is an estimate of the true standard deviation but this standard deviation is neither unbiased nor efficient, because it has been calculated from the variance.
Confidence Intervals Estimates for Populations Measues of Positions
Determine the 98% and 90% confidence limits for a mean 5.44 and a standard deviation of 0.05 calculated from a sample of 100 from infinite population.
Imaging when you look at the normal distribution curve, and the correspondent table above that you cut a part of curve representing 1% of the observations in each end of the curve. In this way you are left with 98% confidence. This means (following the table) that you choose 2.33 with 49% of the observations above and below the mean. 2x49=98.
The 98% confidence limits are m +- 2.33s /Ö N = 5.44+- 2.33(0.05)/Ö 100 =
5.44 +- 0.012
The 90% confidence limits are 5.44 +-2.58(0.05)/Ö 100 = 5.44+- 0.013
A poll of 200 voters has been chosen at random from all voters in a district indicates that 55% of them would vote for a particular political candidate. Find the 95% and 99% comfidence limits for the 55% proportion of all the voters.
The confidence limit for the population (all voters in the districht) when we use the sample proportion P to estimate p are:
P +- 1.96 s p = +-1.96Ö [p(1-p)/N] = +- 1.96[Ö 0.55*0.45/200] =0.55 +- 0.07
P +- 2.58Ö [0.55*0.45/200] = 0.55+- 0.09
How much would it help if the sample poll was doubled?
0.55 +- 1.96 Ö 0.55*0.45/400 = 0.55 +- 0.05
0.55 +- 2.58Ö 0.55*0.45/400 = 0.55 +- 0.06
What is the results if the poll was 900?
0.55 +- 1.96Ö 0.55*0.45/900 = 0.55 +- 0.03
0.55 +- 2.58Ö 0.55*0.45/900 = 0.55 +- 0.04
How large a sample of voters has to be chosen in order to be confidence on level 95%,
and 99.73 with the candidate’s selection? (we assume he has to get 50% of the votes to get selected).
0.55 +-zÖ 0.55*0.45/N = 0.55 +- 0.497z/Ö N
0.497z/Ö N has to less than 0.05
0.497*1.96/Ö N =< 0.05
N = 379.5 or better 380
0.497*3/Ö N =< 0.05
N= 889.2 or better 890
When we choose a sample of 900 (read above) you also notice that this was more than confident as 0.55 – 0.04 = 0.51 >0.50
The deciding point of Statistical Etsimation is to compare the measures calculated in the population and the measure calculated in the sample, and then secure that the camparison is safe enough. And that is its strength, you don’t mix up reality and sample, you compare. Perhaps you are able to compare everything in meaningful way, and that is good if it is impossible. The reason why the sample is used so much - even in the times of EDP - is of course that until now everything has not been registered about everything, and thank God for this. And even if it was or will be, the sample will be much easier to handle practically.