00:01
Welcome back for lecture 9
What we're going to discuss
is categorical data analysis.
00:06
Let's begin with an example to motivate
what we're going to do in this lecture
Is your zodiac sign a useful predictor of
how succesful you will be later in life?
Fortune magazine collected the zodiac signs of
the heads of 256 of the largest 400 companies
Below is a table of the number
of births under each sign.
00:24
The question is, are the variations in the
number of births per sign just due to chance?
or are succesful people more likely to
be born under some signs than others?
So now we need to analyze the data in the table,
so we need to figure out what won't work.
00:39
Well the one sample z test for
proportions will not work here
because there are twelve proportions we need to
be concerned about - one for each zodiac sign.
00:48
Similarly, the two-sample z-procedures
that we talked about won't work here
So then the question
becomes, what will work?
Well we're gonna do what's
called a Goodness-of-Fit test,
and let's see how we do it
We're aiming to determine whether a
higher percentage of succesful people
are born under a different
sign than are others.
01:07
We would initially hypothesize that the proportions
of successful people born under each sign are equal.
01:14
So let's let p i be the true proportion of successful
people born under each sign where i goes from 1 to 12
Our hypotheses are,
our null hypothesis will be that all
12 of these proportions are equal
and our alternative will be that at least
one proportion differs from the others.
01:33
Just like we have with any other test,
if we wanna do the Goodness-of-Fit test,
we have to have some conditions satisfied.
01:40
First and foremost, we need to make
sure that our data are counted data.
01:44
So the values in each cell of the table must
be counts of the number of observations
in the category corresponding to that cell.
01:52
Randomization.
01:54
The individuals who have been counted are
random sample from a population of interest.
01:59
Thirdly, the 10% condition.
02:02
the sample again has to be
less than 10 percent of a population
Four, the expected cell frequency condition.
02:09
We should expect to see at least
five individuals in each cell.
02:14
So let's look at the expected
cell frequency condition.
02:17
For each cell, the expected
count is n times p 0 i
where n is the sample size and p 0 i is
the hypothesized proportion in category i
in this example, n is 256 - that's the
number of people that we sampled.
02:34
We're hypothesizing that
each proportion is equal.
02:37
So for each one, P 0 i is 1 over 12.
02:42
therefore the expected count in each cell
is 256 times 1 over 12 or 21 and one-third.
02:49
So the expected cell frequency
condition is easily satisfied here.
02:54
Do we have counted data?
Yes.
02:56
The data in our astrology example are
counts of people born under each sign.
03:01
Do we have randomization?
We don't have a random sample here,
but their births should be randomly
distributed throughout the year
so we can assume independence.
03:11
10% condition.
03:12
This is definitely not satisfied
but it's still reasonable to think
that the births are independent so we're
not too worried about this here.
03:20
Have we verified the expected
cell frequency condition?
Already.
03:25
So what's with the mechanics
of the Goodness-of-Fit test?
Our test statistic is given by this
character which we call chi-square
which is the sum overall cells of the observed
minus the expected frequencies squared
divided by the expected frequency.
03:39
Under the null hypothesis, our test statistic
has what is called a chi-square distribution
with k minus 1 degrees of freedom where
k is the number of cells in the table
For a level alpha test of our null hypothesis, we reject the
null hypothesis for large values of the test statistic,
in other words, we reject when our expected
frequencies under the null hypothesis are far away
from the observed frequencies.
04:05
Formally, we reject our null
hypothesis when our test statistic
takes a value larger than the critical value
for a chi squared n minus 1 distribution
These critical values can be found in the table
that I'm gonna show you on the next slide.
04:19
We match up the degrees of freedom, and the
significance level to find the critical value,
just like we did with the t-table.
04:25
p-values are hard to find, and
requires the use of software,
so we're gonna restrict ourselves here to
just finding critical values for this test.
04:34
So here's the table, down the left margins
of the table are degrees of freedom,
and across the top are significance levels.
04:40
So we match up the degrees of
freedom to the significance level
and the cell corresponding to that match
gives you the critical chi-squared value
So let's try one.
04:50
Let's use the chi-squared table to find the critical
value for a level alpha equals .05 test in our example.
04:57
We know that since there are 12 cells,
12 counts, we have 11 degrees of freedom.
05:02
So we find in the table, we match up 11
degrees of freedom with .05 from the top
and we get a critical value of 19.6751
Therefore, we're gonna reject our null hypothesis if
our chi-squared statistic is larger than 19.6751.
05:21
So let's try it.
05:23
the test statistic is the chi-squared statistic which
is the sum of the observed minus the expected squared
So when we do that, when we carry out that calculation.
we get the test statistic value of 5.09
This is smaller than the critical value of 19.6751
so we do not reject the null hypothesis.
05:43
What does this tell us?
Well what it tells us is that there
is no evidence at the 5% level
that the zodiac sign of a person is a predictor
of their success level later in life.
05:54
So that's the chi-square
test for Goodness-of-Fit.
05:57
Sometimes, we wanna compare several
distributions and see if they're the same.
06:01
For example, many universities survey their graduating
classes to determine their plans after graduation.
06:08
What we see here is a two-way table for a class of
graduates from several colleges at the university
Each cell shows a number of graduates
that made a particular choice.
06:18
So down one side, down the left side, are the plans
and across the top are the colleges and the school
So the hypotheses are as follows:
The null hypothesis is that the student's post graduation
plans are distributed the same way for all four colleges.
06:36
the alternative hypothesis is that the student's plans
do not have the same distribution in each college.
06:42
And mathematically, it's kind of a pain to ID out, so
we're gonna leave these hypotheses as they are right now
Just like we have before, we have certain
conditions we need to have satisfied.
06:53
First of all, we need to make
sure that we have counted data
All of our cells represents counts
of graduates in a category.
07:00
We need independence or randomization
Again, this is not a random sample,
but it can be reasonably assumed
that the students' plans are largely
independent of each other so we're good there.
07:10
Three, the expected cell frequency counts.
07:14
We again need the expected frequency
in each cell to be at least five
So let's check them out, let's see
how we compute expected frequencies.
07:23
For the agriculture students who were employed, overall,
685 or about 47% of the 1456 students were employed
If the distribution were all the same, then 47% of the
448 agriculture graduates or 210.769 would be employed.
07:43
47% of the 374 engineering graduates
will be employed, or 175.955
So what we've done is we've taken to get
the expected frequency and cell i j,
we take the total in row i times the total in column
j and then divide by the total number observed
So here what we took for the
agriculture and employed cell,
was 685 times 448 divided by the total
of 1456 and that gave us the 210.769
And we can do that same thing for
every other cell in the table
So here's the table of
the expected frequencies.
08:21
All of these are much larger than five so we're
okay on the expected cell frequency condition.
08:27
The test statistic is calculated
exactly the same way as before.
08:31
It's the sum of the observed minus the expected
frequency squared, divided by the expected frequency.
08:38
Under the null hypothesis, if the
table has all rows in c columns,
then the test statistic has a chi-square distribution
with r minus 1 times c minus 1 degrees of freedom.
08:49
So in this example, we have
three rows and four columns
so we have 3 minus 1 times 4 minus
1 or six degrees of freedom.
08:59
If we're testing at the alpha equals .05 level
of significance, we're gonna reject H0 then
If our chi-square statistic
is larger than 12.5916
we get this from the table by matching up 6 degrees
of freedom, with the 5% significance level at the top
When we calculate the test statistic, we put it
into that long formula that we've seen before,
then we get our chi-squared
statistic is 93.66
So we reject the null hypothesis, and we
conclude that there is evidence of a difference
in the distributions of post
graduation plans among the graduates.
09:32
Again, a sidenote, looking at the
table of expected frequencies,
shows that the expected cell frequency
condition is satisfied so we're good on that.
09:42
Finally, we want to look at the
chi-square test of independence.
09:45
One question that we often ask is, is one
variable related to changes in the other?
So let's look at a study that involves
hepatitis C and tattoo status.
09:56
A study examines 626 people being
treated for non-blood related diseases
to see whether the risk of hepatitis C was related to
whether people had tattoos and where they got them from.
10:07
The data are summarized in
this table that you see below.
10:10
We have the tattoo status on the side whether they got
it in a parlor or elsewhere or they don't have one.
10:15
and the hepatitis or no
hepatitis across the top.
10:20
So the natural question then is,
is the chance of having hepatitis C independent of
where they got the tattoo or whether they have one?
This would mean that the distrubution of hepatitis C is
equal to the conditional distribution of hepatitis C
given the tatoo status
for all tattoo statuses.
10:38
So this is very much like the test for homogeneity
where we're testing for equality of distributions
And the mechanics are exactly the same.
10:46
The only difference here is that in the test for
homogeneity, we look to two different populations
but here the categorical variables
are measured on only one population
So how do we do it?
Well the first thing, just like with any other
hypothesis test, is to setup our hypothesis.
11:04
Our null hypothesis in this example will be the
tattoo status and hepatitis C are independent.
11:10
Our alternative then is that
they're not independent.
11:14
So again we have the conditions.
11:16
Counted data? Yes, our data is
represented in the table is counted.
11:21
Independence.
11:22
The people in the study are likely to be
independent of each other so we're good there.
11:26
Randomization.
11:28
The data are from a retrospective study of patients
being treated for something other than hepatitis
they're not a random sample but they're
likely to be representative of a population
so the randomization is still okay.
11:39
The 10% condition?
Well 626 people's fewer than 10% of all
people with tattoos or with hepatitis C
So the 10% condition is fine
Let's look at expected cell frequency.
11:52
The calculation of expected cell
frequencies is exactly the same as before,
for cell i j, the expected frequency in
that cell is equal to the number in row i
times the number in column j divided
by the total number observed.
12:05
All of these need to exceed 5
So here's the table of the expected frequencies
and what we see is that we have a couple,
the tattoo and hepatitis C and the parlor and the
elsewhere tattoo and hepatitis C are both less than 5
So we don't quite have the expected
cell count condition satisfied.
12:22
Just so we can get a feel for the mechanics of the
test, let's use this data and carry it out anyway.
12:28
Under the null hypothesis, the test
statistic has a chi-square distribution
with r minus 1 times c
minus 1 degrees of freedom
where again r is the number of
rows, c is the number of columns
In our example, we have three rows, two columns so
the test statistic has a chi-square distribution
with 3 minus 1 times 2 minus
1 or 2 degrees of freedom.
12:50
Therefore , if we're looking to carry out
the task at the alpha equals 0.05 level,
then we reject the null hypothesis if our chi-squared
statistic takes a value at least as large as 5.9915
and again, we find this in the table.
13:05
What we observed is our chi-squared
statistic is equal to 57.91.
13:11
So we reject the null hypothesis and
we conclude that there is evidence
at the 5% level that hepatitis C and
tattoo status are not independent.
13:22
In categorical data analysis, there are
a bunch of things that can go wrong
so let's look at some of the
things that we want to avoid.
13:28
First of all, do not use the chi-square
methods unless your data are counts.
13:33
We wanna be aware of more examples.
13:35
your degrees of freedom do not
increase with sample size here
so it seems strange to say that we want
to be where the large sample size is,
but we do because now our degrees of freedom are related
to the number of categories and the number of cells
and not the total sample size so
we wanna be careful with that.
13:54
Finally do not say that one variable depends on
the other just because they're not independent
This statement implies cause
which we can't always infer.
14:03
So what have we done in this lecture?
Well we have talked about three
tests for categorical data.
14:08
We talked about the test for a particular
distribution, which is the Goodness-of-Fit test.
14:14
We talked about the chi-square test for homogeneity where we
looked to see if a group of distributions are equal to each other,
and we looked at the chi-square
test of independence
to see whether or not two variables are
independent within one population.
14:28
We finished up by looking at some of the
pitfalls of categorical data analysis
and the things that we want to
avoid when trying to do it.
14:35
This is the end of lecture 9 and I look forward
to seeing you back again for lecture 10.