%\VignetteEngine{knitr::knitr} %\VignetteIndexEntry{Chapter 05, Horton et al. using mosaic} %\VignettePackage{Sleuth2} \documentclass[11pt]{article} \usepackage[margin=1in,bottom=.5in,includehead,includefoot]{geometry} \usepackage{hyperref} \usepackage{language} \usepackage{alltt} \usepackage{fancyhdr} \pagestyle{fancy} \fancyhf{} %% Now begin customising things. See the fancyhdr docs for more info. \chead{} \lhead[\sf \thepage]{\sf \leftmark} \rhead[\sf \leftmark]{\sf \thepage} \lfoot{} \cfoot{Statistical Sleuth in R: Chapter 5} \rfoot{} \newcounter{myenumi} \newcommand{\saveenumi}{\setcounter{myenumi}{\value{enumi}}} \newcommand{\reuseenumi}{\setcounter{enumi}{\value{myenumi}}} \pagestyle{fancy} \def\R{{\sf R}} \def\Rstudio{{\sf RStudio}} \def\RStudio{{\sf RStudio}} \def\term#1{\textbf{#1}} \def\tab#1{{\sf #1}} \usepackage{relsize} \newlength{\tempfmlength} \newsavebox{\fmbox} \newenvironment{fmpage}[1] { \medskip \setlength{\tempfmlength}{#1} \begin{lrbox}{\fmbox} \begin{minipage}{#1} \vspace*{.02\tempfmlength} \hfill \begin{minipage}{.95 \tempfmlength}} {\end{minipage}\hfill \vspace*{.015\tempfmlength} \end{minipage}\end{lrbox}\fbox{\usebox{\fmbox}} \medskip } \newenvironment{boxedText}[1][.98\textwidth]% {% \begin{center} \begin{fmpage}{#1} }% {% \end{fmpage} \end{center} } \newenvironment{boxedTable}[2][tbp]% {% \begin{table}[#1] \refstepcounter{table} \begin{center} \begin{fmpage}{.98\textwidth} \begin{center} \sf \large Box~\expandafter\thetable. #2 \end{center} \medskip }% {% \end{fmpage} \end{center} \end{table} % need to do something about exercises that follow boxedTable } \newcommand{\cran}{\href{http://www.R-project.org/}{CRAN}} \title{The Statistical Sleuth in R: \\ Chapter 5} \author{ Kate Aloisio \and Ruobing Zhang \and Nicholas J. Horton\thanks{Department of Mathematics, Amherst College, nhorton@amherst.edu} } \date{\today} \begin{document} \maketitle \tableofcontents %\parindent=0pt <>= print.pval = function(pval) { threshold = 0.0001 return(ifelse(pval < threshold, paste("p<", sprintf("%.4f", threshold), sep=""), ifelse(pval > 0.1, paste("p=",round(pval, 2), sep=""), paste("p=", round(pval, 3), sep="")))) } @ <>= require(knitr) opts_chunk$set( dev="pdf", fig.path="figures/", fig.height=3, fig.width=4, out.width=".47\\textwidth", fig.keep="high", fig.show="hold", fig.align="center", prompt=TRUE, # show the prompts; but perhaps we should not do this comment=NA # turn off commenting of ouput (but perhaps we should not do this either ) @ <>= require(Sleuth2) require(mosaic) trellis.par.set(theme=col.mosaic()) # get a better color scheme for lattice set.seed(123) # this allows for code formatting inline. Use \Sexpr{'function(x,y)'}, for exmaple. knit_hooks$set(inline = function(x) { if (is.numeric(x)) return(knitr:::format_sci(x, 'latex')) x = as.character(x) h = knitr:::hilight_source(x, 'latex', list(prompt=FALSE, size='normalsize')) h = gsub("([_#$%&])", "\\\\\\1", h) h = gsub('(["\'])', '\\1{}', h) gsub('^\\\\begin\\{alltt\\}\\s*|\\\\end\\{alltt\\}\\s*$', '', h) }) showOriginal=FALSE showNew=TRUE @ \section{Introduction} This document is intended to help describe how to undertake analyses introduced as examples in the Second Edition of the \emph{Statistical Sleuth} (2002) by Fred Ramsey and Dan Schafer. More information about the book can be found at \url{http://www.proaxis.com/~panorama/home.htm}. This file as well as the associated \pkg{knitr} reproducible analysis source file can be found at \url{http://www.amherst.edu/~nhorton/sleuth}. This work leverages initiatives undertaken by Project MOSAIC (\url{http://www.mosaic-web.org}), an NSF-funded effort to improve the teaching of statistics, calculus, science and computing in the undergraduate curriculum. In particular, we utilize the \pkg{mosaic} package, which was written to simplify the use of R for introductory statistics courses. A short summary of the R needed to teach introductory statistics can be found in the mosaic package vignette (\url{http://cran.r-project.org/web/packages/mosaic/vignettes/MinimalR.pdf}). To use a package within R, it must be installed (one time), and loaded (each session). The package can be installed using the following command: <>= install.packages('mosaic') # note the quotation marks @ Once this is installed, it can be loaded by running the command: <>= require(mosaic) @ This needs to be done once per session. In addition the data files for the \emph{Sleuth} case studies can be accessed by installing the \pkg{Sleuth2} package. <>= install.packages('Sleuth2') # note the quotation marks @ <>= require(Sleuth2) @ We also set some options to improve legibility of graphs and output. <>= trellis.par.set(theme=col.mosaic()) # get a better color scheme options(digits=3) @ The specific goal of this document is to demonstrate how to calculate the quantities described in Chapter 5: Comparisons Among Several Samples using R. \section{Diet and lifespan} Does restricting the diet of female mice lead to increased lifespan? This is the question addressed in case study 5.1 in the \emph{Sleuth}. \subsection{Summary statistics and graphical display} We begin by reading the data and summarizing the variables. <<>>= summary(case0501) favstats(Lifetime ~ Diet, data=case0501) @ There were a total of \Sexpr{nrow(case0501)} female mice. These mice were randomly assigned to one of \Sexpr{length(unique(case0501[, "Diet"]))} diets. Their lifetimes were then recorded, as shown in Display 5.2 (page 115 of the \emph{Sleuth}). <<>>= bwplot(Lifetime ~ Diet, data=case0501) # Display 5.1 @ <>= densityplot(~ Lifetime, groups=Diet, auto.key=TRUE, data=case0501) @ \subsection{One-way ANOVA} First we fit the one way analysis of variance (ANOVA) model, using all of the groups. <>= anova(lm(Lifetime ~ Diet, data=case0501)) @ There is evidence of a highly statistically significant difference between the diets. By default, the use of the linear model (regression) function displays the pairwise differences between the first group and each of the other groups. Note that the overall test of the model is the same. <<>>= summary(lm(Lifetime ~ Diet, data=case0501)) @ The reference group is \emph{NP}, followed by \emph{N/N85, lopro, N/R50, R/R50, N/R40}. \subsection{Pairwise comparisons} Next we used contrasts for the results on page 121, Display 5.7, and part {\bf(a)} on page 115: <>= require(gmodels) # N/N85 vs N/R50 fit.contrast(lm(Lifetime ~ Diet, data=case0501), "Diet", c(0, -1, 0, 1, 0, 0), conf.int=0.95) @ The results for {\bf(b)} on page 115-116: <>= # N/R50 vs R/R50 (b) fit.contrast(lm(Lifetime ~ Diet, data=case0501), "Diet", c(0, 0, 0, -1, 1, 0), conf.int=0.95) @ The results for {\bf(c)} on page 116: <>= # N/R40 vs N/R50 (c) fit.contrast(lm(Lifetime ~ Diet, data=case0501), "Diet", c(0, 0, 0, -1, 0, 1), conf.int=0.95) # N/N85 vs N/R40 fit.contrast(lm(Lifetime ~ Diet, data=case0501), "Diet", c(0, -1, 0, 0, 0, 1), conf.int=0.95) @ The results for {\bf(d)} on page 116: <>= # N/R50 vs N/R50 lopro (d) fit.contrast(lm(Lifetime ~ Diet, data=case0501), "Diet", c(0, 0, 1, -1, 0, 0), conf.int=0.95) @ The results for {\bf(e)} on page 116: <>= # N/N85 vs NP (e) fit.contrast(lm(Lifetime ~ Diet, data=case0501), "Diet", c(-1, 1, 0, 0, 0, 0), conf.int=0.95) @ Another way of viewing these results is through a model table, which displays the differences between the grand mean and the group means. <<>>= model.tables(aov(lm(Lifetime ~ Diet, data=case0501))) @ Another way of calculating the above results is done with the following code: <<>>= mean(Lifetime ~ Diet, data=case0501)-mean(~ Lifetime, data=case0501) @ \subsection{Other analyses} We will next demonstrate how to calculate the quantities on page 120 (Display 5.6). <<>>= df = length(case0501$Diet) - length(unique(case0501$Diet)); df sdvals = with(case0501, tapply(Lifetime, Diet, sd)); sdvals nvals = with(case0501, tapply(Lifetime, Diet, length)); nvals pooledsd = sum(sdvals*nvals)/sum(nvals); pooledsd @ Note that the pooled standard deviation reported in chapter 5 is not the same as the root MSE from the ANOVA. For the rest of this document we will use the ANOVA estimate of the root mean squared error. \subsection{Residual analysis and diagnostics} The residuals versus fitted graph does not demonstrate dramatic lack of fit (though some of the mice had very small residuals). The following figure is akin to Display 5.14 (page 132). <>= aov1 = aov(lm(Lifetime ~ Diet, data=case0501)) plot(aov1, which=1) @ The quantile plot of the residuals indicates that the normality assumption may be violated. <>= plot(aov1, which=2) plot(aov1, which=3) @ \section{Spock Conspiracy Trial} Did Dr. Benjamin Spock have a fair trial? More specifically, were women underrepresented on his jury pool? This is the question considered in case study 5.2 in the \emph{Sleuth}. \subsection{Summary statistics and graphical display} We begin by reading the data and summarizing the variables. <<>>= summary(case0502) case0502$Judge = with(case0502, as.factor(Judge)) favstats(Percent ~ Judge, data=case0502) @ There were a total of \Sexpr{nrow(case0502)} venires. They compared Spock's judge with \Sexpr{length(unique(subset(case0502, Judge!="Spock's")[, "Judge"]))} other judges. The precent of women within each venire was recorded as shown in Display 5.4 (page 117 of the \emph{Sleuth}). <<>>= bwplot(Percent ~ Judge, data=case0502) # Display 5.5 (page 118) @ <>= densityplot(~ Percent, groups=Judge, auto.key=TRUE, data=case0502) @ \subsection{One-way ANOVA} First we fit the one way analysis of variance (ANOVA) model, with all of the groups. These results are summarized on page 118 and shown in Display 5.10 (page 127). <<>>= aov1 = anova(lm(Percent ~ Judge, data=case0502)); aov1 @ By default, the use of the linear model (regression) function displays the pairwise differences between the first group and each of the other groups. Note that the overall test of the model is the same. <<>>= summary(lm(Percent ~ Judge, data=case0502)) @ <<>>= model.tables(aov(lm(Percent ~ Judge, data=case0502))) @ Then we can fit the one way analysis of variance $F$-test of whether the mean percentage is the same for judges A-F (page 118). <<>>= with(subset(case0502, Judge!="Spock's"), anova(lm(Percent ~ Judge))) @ \subsection{Additional analyses} Now we will demonstrate how to fit the reduced model comparing Spock's judge to a combination of the other judges. First we create a 2 level version of the grouping variable. <<>>= case0502$twoJudge = as.character(case0502$Judge) case0502$twoJudge[case0502$Judge!="Spock's"] = "notspock" tally(twoJudge ~ Judge, format="count", data=case0502) @ Recall that the book calculates the extra sum of squares as (2,190.90 - 1864.45)/(44-39)) / (1864.45 / 39) = 1.37, with df 5 and 39. P(F $\textgreater$ 1.366) = 0.26 (page 130). Below are the calculations for the results found on page 128. <<>>= numdf1 = aov1["Residuals", "Df"]; numdf1 # Within ss1 = aov1["Residuals", "Sum Sq"]; ss1 # Within aov2 = anova(lm(Percent ~ as.factor(twoJudge), data=case0502)); aov2 df2 = aov2["Residuals", "Df"]; df2 # Spock and others ss2 = aov2["Residuals", "Sum Sq"]; ss2 # Spock and others Fstat = ((ss2 - ss1)/(df2 - numdf1)) / (ss1 / numdf1); Fstat 1-pf(Fstat, length(levels(case0502$Judge))-2, numdf1) @ We can also compare the two models using ANOVA (Display 5.12, page 130). <<>>= anova(lm(Percent ~ as.factor(twoJudge), data=case0502), lm(Percent ~ as.factor(Judge), data=case0502)) @ There are some other ways to compare whether the other judges differ from Dr. Spock's judge in their female composition using contrasts. <<>>= # test all of the other judges vs. Spock's judge using a contrast page 118 fit.contrast(lm(Percent ~ Judge, data=case0502), "Judge", c(-6, 1, 1, 1, 1, 1, 1), conf.int=0.95) # calculate the 95% confidence interval for Dr. Spock's jury female composition page 118 estimable(lm(Percent ~ Judge, data=case0502), c(1,0,0,0,0,0,0), conf.int=0.95) @ \subsubsection{Kruskal-Wallis Nonparametric Analysis of Variance} For the results of the Kruskal-Wallis test on page 136 we can use the following code: <<>>= kruskal.test(Percent ~ Judge, data=case0502) @ \end{document}