> table(infert$education,infert$parity)
and you get a very sparse tabulation of the parity (number of births) by educational attainment. You try the enhanced version of this function,
> xtabs(infert$education,infert$parity)
and are faced with a slightly more informative display. Unfortunately,
you know that Bronwyn will want to know what percentage
of women who completed high school had 2 or fewer children and Hans will have to
have a chi-squared test for every contingency table. Let's see what can be done.
R follows the precepts of a bunch of
brilliant people at Bell Labs in making statistics modular. That is,
individual functions do fairly simple, general things very well, and
intelligently combining the modules will do almost anything that you want. The
beginner's problem is usually figuring out what the heck are the functions that
will do the particular things that they want.
We'll use the example data frame infert
provided with
R to illustrate how to build on that. First,
let's find and retrieve the data.
> show.data() ... freeny Freeny's Revenue Data infert Secondary infertility matched case-control study iris Edgar Anderson's Iris Data as data.frame ... > data(infert)
A quick summary of the data will reveal that parity ranges from 1-6. This will have to be reduced to two categories. That's pretty easy to do by assigning the output of a logical comparison.
> gt2<-infert$parity>2 > table(infert$education,gt2) ...
The observant reader may ask why the comparison "greater than" was used rather than "less than or equal to". Convenience is the answer. By default, R orders factors, and FALSE (0) is less than TRUE (1). Using "greater than" here gets the factors "right way round", rather than having "more than 2" in the first column and "less than or equal to 2" in the second. When factors are coded as labels, they are ordered alphabetically. You can explicitly order factors if you wish.
This is still a pretty laconic table which will have to be explained. Putting
the dimnames
in will help.
> table(infert$education,gt2,dnn=c("Education","Parity"))
It would also be nice if there were some descriptive labels rather than just
"FALSE" and "TRUE". The really useful function ifelse()
will do
the trick.
> gt2<-ifelse(infert$parity>2,"Over 2","2 or less")
> table(infert$education,gt2,dnn=c("Education","Parity"))
Notice how the labels have been doctored so that they will be in the
conventional order. Now we have a reasonable looking contingency table, but
what about Bronwyn's percentages and Hans' chi-squares? We're going to have to
go a bit beyond what table()
will do to get output that will satisfy
them. Let's go through the function
format.xtab().
First, we check that the minimal data is there, then get the base table from
which to derive the rest of the information. In order to calculate the
percentages, we'll need the row and column sums. These can be calculated in one
hit by using apply()
. Next up come the row and column names.
Here, formatC()
pops up. Plain old format()
would
have formatted each set of labels to the length of the longest label plus 1, but
if we want a neat table, we want all of the labels to be the same length. Also
notice that the fieldwidth
has been given the default value of 10,
allowing the user to shrink or expand the columns. dnn
is given a
default value if none was passed, and we're ready to go.
First the variable
names (dnn
) and the column names, then each of the rows, starting
with the cell counts and row counts, the cell row percentages and the overall
row percentages and then the cell column percentages. After that, the column
counts and grand total and the column percentages. Finally, if a chi-square
test was ordered by including the argument chisq=T
, the rather
complicated bit at the bottom to print out the values of the chi-square test
will do its stuff. It would be simpler just to run the chi-square test and let
it print itself, but we would then get variables labeled as v1
and
v2
, which might be confusing. You'll also notice when you run this
function that chisq.test()
warns you that some of the cells have
smaller than recommended counts. You may wish to recode educational attainment
to two categories as an exercise.
format.xtab()
is the xtab()
function.
If you pass it a two element formula, it will act just like
format.xtab()
.
xtab(v1~v2,mydata)
If you
ask for more than two dimensions, it will print out hierarchical counts and
percentages of all levels of variables starting at the last one in the
formula. When it gets to
the first two, it will print out 2D contingency tables for those
variables. It gets silly pretty quickly. Both
table()
and ftable()
will also display multi-way
crosstabulations.
This is also an introduction to the use of recursion, in which a function calls itself until whatever test you have set is satisfied. In this case, the function stops calling itself when there are at most two variables to be crosstabulated.
Get the nachos, you deserve it.
For more information, see Introduction to R: Frequency tables from factors