We then divide this sum by the number of observations as a scaling factor. If we ignore this number, we could get a very high variance simply by observing a lot of data. So, to fix that problem, we divide by the total number of observations. However, this is the formula for the population variance. The formula for calculating the variance of a sample is:.
If you Google this question, you will get a variety of answers. It does not actually help me understand 1 the problem and 2 why the solution is the solution that it is. So, below I am going to try to figure it out in a way that actually makes conceptual and intuitive sense to me.
The problem with using the population variance formula to calculate the variance of a sample is that it is biased. It is biased in that it produces an underestimation of the true variance. We simulate a population of data points from a uniform distribution with a range from 1 to Below I show the histogram that represents our population.
The variance is 8. To start, we can draw a single sample of size 5. Say we do that and get the following values: 7, 6, 3, 5, 5. In the former case, this will result in 1. Below I show the results of draws from our population. I simulated drawing samples of size 2 to 10, each different times. We see that the biased measure of variance is indeed biased. The average variance is lower than the true variance indicated by the dashed line , for each sample size. We also see that the unbiased variance is indeed unbiased.
On average, the sample variance matches that of the population variance. The results of using the biased measure of variance reveals several clues for understanding the solution to the bias. We see that the amount of bias is larger when the sample size of the samples is smaller. So let me write this down. So this is going to be-- so for the population we are calculating a parameter. It is a parameter. And when we calculate, when we attempt to calculate something for a sample we would call that a statistic-- statistic.
So how do we think about the mean for a population? Well, first of all, we denote it with the Greek letter mu. And we essentially take every data point in our population. So we take the sum of every data point. So we start at the first data point and we go all the way to the capital Nth data point. So every data point we add up. So this is the i-th data point, so x sub 1 plus x sub 2 all the way to x sub capital N.
And then we divide by the total number of data points we have. Well, how do we calculate the sample mean? Well, the sample mean-- we do a very similar thing with the sample.
And we denote it with a x with a bar over it. And that's going to be taking every data point in the sample, so going up to a lower case n, adding them up --so these are the sum of all the data points in our sample-- and then dividing by the number of data points that we actually had. Now, the other thing that we're trying to calculate for the population, which was a parameter, and then we'll also try to calculate it for the sample and estimate it for the population, was the variance, which was a measure of how dispersed or how much of the data points vary from the mean.
So let's write variance right over here. And how do we denote any calculate variance for a population? Well, for population, we'd say that the variance --we use a Greek letter sigma squared-- is equal to-- and you can view it as the mean of the squared distances from the population mean. But what we do is we take, for each data point, so i equal 1 all the way to n, we take that data point, subtract from it the population mean.
So if you want to calculate this, you'd want to figure this out. Well, that's one way to do it. We'll see there's other ways to do it, where you can calculate them at the same time.
But the easiest or the most intuitive is to calculate this first, then for each of the data points take the data point and subtract it from that, subtract the mean from that, square it, and then divide by the total number of data points you have. Now, we get to the interesting part-- sample variance. There's are several ways-- where when people talk about sample variance, there's several tools in their toolkits or there's several ways to calculate it. One way is the biased sample variance, the non unbiased estimator of the population variance.
And that's denoted, usually denoted, by s with a subscript n. And what is the biased estimator, how we calculate it? Well, we would calculate it very similar to how we calculated the variance right over here.
But what we would do it for our sample, not our population. So for every data point in our sample --so we have n of them-- we take that data point. And from it, we subtract our sample mean. We subtract our sample mean, square it, and then divide by the number of data points that we have.
But we already talked about it in the last video. How would we find-- what is our best unbiased estimate of the population variance? This is usually what we're trying to get at. We're trying to find an unbiased estimate of the population variance.
Cross Validated is a question and answer site for people interested in statistics, machine learning, data analysis, data mining, and data visualization. It only takes a minute to sign up. Connect and share knowledge within a single location that is structured and easy to search. When you divide by a smaller number you get a larger number.
Let's think about what a larger vs. If the sample variance is larger than there is a greater chance that it captures the true population variance. Because we are trying to reveal information about a population by calculating the variance from a sample set we probably do not want to underestimate the variance.
There was a good post here on CV that will give you some good insight. Hope this helps! Sign up to join this community. The best answers are voted up and rise to the top. Stack Overflow for Teams — Collaborate and share knowledge with a private group.
0コメント