Goodness-of-Fit
The basic idea of the chi-squared test is to determine how well something fits a model that we expect. This is used with categorical variables, as we will see.
Example (from the textbook)
We have this list of 256 CEO's and the month they were born in.
Births | Month |
---|---|
23 | Jan |
20 | Feb |
18 | Mar |
23 | April |
20 | May |
19 | June |
18 | July |
21 | Aug |
19 | Sep |
22 | Oct |
24 | Nov |
29 | Dec |
Now, we have to wonder: Is this special at all? Because a person is as likely to be born in any month, we expect 256/12 = 21.333 people in each month. So, could this happen solely from randomness?
Null Hypothesis
We've seen this problem before. That's our null hypothesis. Our null hypothesis is that births are uniformly distributed. Our alternate hypothesis is that they aren't.
Assumptions
As with all of our tests, we must check certain assumptions.
Counted Data Assumption: Data are counts for categories. This determines whether we can actually used chi-squared.
Independence Assumption: This is only to determine that the sample is representative of the population. The population should be independent within.
Sample Size Assumption: We should expect at least 5 people in each category.
Surprisingly, or perhaps not surprisingly, we got a p-value of 93.2%. This means that if the dates were truly random, we'd get a distribution like ours 93% of the time. This means it is very reasonable that the dates are simply random.
Be careful
There is no way to prove that it's actually randomly distributed. We can only disprove or fail to disprove.
Test of Homogeneity
I'm not going to do an example, but it's similar to what we did before. Let's say we had statistics from 1990, 2000, and 2010 on what HS students did after HS: College, Job, Military, or Travel. In our first example, our expected results were all 21.3. For this example, we expect the total percentage (combined all three years) to equal the percentage in each group. The degrees of freedom are (R-1)(C-1), so it would be (4-1)(3-1) = 6. We find the chi^2 value the same way.
Test of Independence
A question like this would arise if you ask "Does going to the tattoo parlor affect getting Hepatitis C?" So, you'd get data on people with Hepatitis C/No Hepatitis C (one type of category), and you'd get data on people who've gotten tattoos (1) in a parlor (2) at home, or (3) never. Then, you expect that they get the same proportions in each group. The degrees of freedom are still (R-1)(C-1) = (3-1)(2-1) = 2. In a test for homogeneity, we had a single categorical variable measured on >=2 populations. In a test for independence, we had multiple categorical variables measured on one population. This affects the conclusions we draw from this.