%\VignetteEngine{knitr::knitr} %\VignetteIndexEntry{Chapter 13, Horton et al. using mosaic} %\VignettePackage{Sleuth2} \documentclass[11pt]{article} \usepackage[margin=1in,bottom=.5in,includehead,includefoot]{geometry} \usepackage{hyperref} \usepackage{language} \usepackage{alltt} \usepackage{fancyhdr} \pagestyle{fancy} \fancyhf{} %% Now begin customising things. See the fancyhdr docs for more info. \chead{} \lhead[\sf \thepage]{\sf \leftmark} \rhead[\sf \leftmark]{\sf \thepage} \lfoot{} \cfoot{Statistical Sleuth in R: Chapter 13} \rfoot{} \newcounter{myenumi} \newcommand{\saveenumi}{\setcounter{myenumi}{\value{enumi}}} \newcommand{\reuseenumi}{\setcounter{enumi}{\value{myenumi}}} \pagestyle{fancy} \def\R{{\sf R}} \def\Rstudio{{\sf RStudio}} \def\RStudio{{\sf RStudio}} \def\term#1{\textbf{#1}} \def\tab#1{{\sf #1}} \usepackage{relsize} \newlength{\tempfmlength} \newsavebox{\fmbox} \newenvironment{fmpage}[1] { \medskip \setlength{\tempfmlength}{#1} \begin{lrbox}{\fmbox} \begin{minipage}{#1} \vspace*{.02\tempfmlength} \hfill \begin{minipage}{.95 \tempfmlength}} {\end{minipage}\hfill \vspace*{.015\tempfmlength} \end{minipage}\end{lrbox}\fbox{\usebox{\fmbox}} \medskip } \newenvironment{boxedText}[1][.98\textwidth]% {% \begin{center} \begin{fmpage}{#1} }% {% \end{fmpage} \end{center} } \newenvironment{boxedTable}[2][tbp]% {% \begin{table}[#1] \refstepcounter{table} \begin{center} \begin{fmpage}{.98\textwidth} \begin{center} \sf \large Box~\expandafter\thetable. #2 \end{center} \medskip }% {% \end{fmpage} \end{center} \end{table} % need to do something about exercises that follow boxedTable } \newcommand{\cran}{\href{http://www.R-project.org/}{CRAN}} \title{The Statistical Sleuth in R: \\ Chapter 13} \author{ Kate Aloisio \and Ruobing Zhang \and Nicholas J. Horton\thanks{Department of Mathematics, Amherst College, nhorton@amherst.edu} } \date{\today} \begin{document} \maketitle \tableofcontents %\parindent=0pt <>= print.pval = function(pval) { threshold = 0.0001 return(ifelse(pval < threshold, paste("p<", sprintf("%.4f", threshold), sep=""), ifelse(pval > 0.1, paste("p=",round(pval, 2), sep=""), paste("p=", round(pval, 3), sep="")))) } @ <>= require(knitr) opts_chunk$set( dev="pdf", fig.path="figures/", fig.height=3, fig.width=4, out.width=".57\\textwidth", fig.keep="high", fig.show="hold", fig.align="center", prompt=TRUE, # show the prompts; but perhaps we should not do this comment=NA # turn off commenting of ouput (but perhaps we should not do this either ) @ <>= require(Sleuth2) require(mosaic) trellis.par.set(theme=col.mosaic()) # get a better color scheme for lattice set.seed(123) # this allows for code formatting inline. Use \Sexpr{'function(x,y)'}, for exmaple. knit_hooks$set(inline = function(x) { if (is.numeric(x)) return(knitr:::format_sci(x, 'latex')) x = as.character(x) h = knitr:::hilight_source(x, 'latex', list(prompt=FALSE, size='normalsize')) h = gsub("([_#$%&])", "\\\\\\1", h) h = gsub('(["\'])', '\\1{}', h) gsub('^\\\\begin\\{alltt\\}\\s*|\\\\end\\{alltt\\}\\s*$', '', h) }) showOriginal=FALSE showNew=TRUE @ \section{Introduction} This document is intended to help describe how to undertake analyses introduced as examples in the Second Edition of the \emph{Statistical Sleuth} (2002) by Fred Ramsey and Dan Schafer. More information about the book can be found at \url{http://www.proaxis.com/~panorama/home.htm}. This file as well as the associated \pkg{knitr} reproducible analysis source file can be found at \url{http://www.amherst.edu/~nhorton/sleuth}. This work leverages initiatives undertaken by Project MOSAIC (\url{http://www.mosaic-web.org}), an NSF-funded effort to improve the teaching of statistics, calculus, science and computing in the undergraduate curriculum. In particular, we utilize the \pkg{mosaic} package, which was written to simplify the use of R for introductory statistics courses. A short summary of the R needed to teach introductory statistics can be found in the mosaic package vignette (\url{http://cran.r-project.org/web/packages/mosaic/vignettes/MinimalR.pdf}). To use a package within R, it must be installed (one time), and loaded (each session). The package can be installed using the following command: <>= install.packages('mosaic') # note the quotation marks @ Once this is installed, it can be loaded by running the command: <>= require(mosaic) @ This needs to be done once per session. In addition the data files for the \emph{Sleuth} case studies can be accessed by installing the \pkg{Sleuth2} package. <>= install.packages('Sleuth2') # note the quotation marks @ <>= require(Sleuth2) @ We also set some options to improve legibility of graphs and output. <>= trellis.par.set(theme=col.mosaic()) # get a better color scheme for lattice options(digits=4, show.signif.stars=FALSE) @ The specific goal of this document is to demonstrate how to calculate the quantities described in Chapter 13: The Analysis of Variance for Two-Way Classifications using R. \section{Intertidal seaweed grazers} This wicked complicated trial is a subset of a factorial design (6 of the possible 2 by 2 by 2 combination of factors) plus blocking. This randomized block design is analyzed in case study 13.1 in the \emph{Sleuth}. \subsection{Data coding, summary statistics and graphical display} We begin by reading the data, performing the necessary transformations and summarizing the variables. <<>>= # logit transformation case1301 = transform(case1301, logitcover = log(Cover/(100-Cover))) @ <<>>= summary(case1301) favstats(logitcover~Treat, data=case1301) @ There were a total of \Sexpr{nrow(case1301)} rock plots free of seaweed. These plots where split into \Sexpr{length(unique(case1301[,"Block"]))} blocks based on location. Each block contained \Sexpr{nrow(subset(case1301, Block=="B1"))} plots. Then \Sexpr{length(unique(case1301[,"Treat"]))} treatments were randomly assigned to plots within each block. Therefore there were two plots per treatment within each block, as shown in Display 13.2 (page 377 of the \emph{Sleuth}). We can check for evidence of nonadditivity using interaction plots. For a figure akin to Display 13.7 on page 383 we can use the following code: <>= with(case1301, interaction.plot(Block, Treat, Cover)) @ This figure shows evidence of nonadditivity. However as the authors note the type of nonadditivity seen in this figure may be removed by transformations. In addition, the residual plot from the saturated model (shown below and is akin to Display 13.8 on page 384) has a distinct funnel shape, also indicating a need for transformation. <>= plot(aov(Cover ~ Block*Treat, data=case1301), which=1) @ After the log transformation, we can then observe an interaction plot on the log transformed data akin to Display 13.9 on page 385. <>= with(case1301, interaction.plot(Block, Treat, logitcover)) @ \subsection{Models} Then we can create an ANOVA for the nonadditive model estimating the log of the seaweed regeneration ratio as summarized on page 385 (Display 13.10). <<>>= anova(lm(logitcover ~ Block*Treat, data=case1301)) @ This model has an $R^2$ of \Sexpr{round(summary(lm(logitcover ~ Block*Treat, data=case1301))$r.squared*100, 2)}$\%$, an adjusted $R^2$ of \Sexpr{round(summary(lm(logitcover ~ Block*Treat, data=case1301))$adj.r.squared*100, 2)}$\%$, and an estimated SD of \Sexpr{round(summary(lm(logitcover ~ Block*Treat, data=case1301))$sigma, 4)}. Notice that the interaction term has a large $p$-value, \Sexpr{round(anova(lm(logitcover ~ Block*Treat, data=case1301))["Block:Treat", "Pr(>F)"], 4)}, suggesting that the data may be more consistent with an additive model. We can then compare these results to an ANOVA for the additive model estimating the log of the seaweed regeneration ratio as shown in Display 13.11 on page 387. <<>>= anova(lm(logitcover ~ Block+Treat, data=case1301)) @ This model has an $R^2$ of \Sexpr{round(summary(lm(logitcover ~ Block+Treat, data=case1301))$r.squared*100, 2)}$\%$, an adjusted $R^2$ of \Sexpr{round(summary(lm(logitcover ~ Block+Treat, data=case1301))$adj.r.squared*100, 2)}$\%$, and an estimated SD of \Sexpr{round(summary(lm(logitcover ~ Block+Treat, data=case1301))$sigma, 4)}. Next we can assess the fit of the additive model through diagnostic plots. First we can check the linearity assumption. <>= plot(aov(logitcover ~ Block+Treat, data=case1301), which=1) @ From this plot is appears that the linearity assumption seems reasonable. We will need to assume independence based on the information given. Next we will assess the normality assumption for the additive model. <>= case1301$resid = residuals(aov(logitcover ~ Block+Treat, data=case1301)) histogram(~ resid, type='density', density=TRUE, data=case1301) @ From this figure normality seems reasonable as well. Now we can assess equality of variance. <>= plot(aov(logitcover ~ Block+Treat, data=case1301), which=3) @ From this figure, the assumption of equal variance seems to be somewhat problematic, as seen in the curvature of the lowess line. Lastly we can look for influential points and/or high leverage with the additive model. <>= plot(aov(logitcover ~ Block+Treat, data=case1301), which=4) @ From this figure we can obtain certain plots that appear to be influential points. <<>>= case1301[c(13, 22, 87),] @ \subsection{Linear combinations} First we can observe the Block and Treatment averages and the Block and Treatment effects from Display 13.12 (page 388). For the effects we used: <<>>= model.tables(aov(lm(logitcover ~ Block*Treat, data=case1301)), type="effects") @ For the means we changed the {\tt type} attribute to {\tt "means"}: <<>>= model.tables(aov(lm(logitcover ~ Block*Treat, data=case1301)), type="means") @ To answer specific questions of interest regarding subgroup comparisons we can use linear combinations. The \emph{Sleuth} proposes five questions as detailed on pages 289-390. The code for results of these questions is displayed below and these results are also interpreted on pages 389-390 and summarized in Display 13.13. For this model the reference group is \emph{control} followed by \emph{f, fF, L, Lf, LfF}. <<>>= require(gmodels) lm1 = lm(logitcover ~ Treat+Block, data=case1301); coef(lm1) large = rbind('Large fish' = c(0, -1/2, 1/2, 0, -1/2, 1/2)) small = rbind('Small fish' = c(-1/2, 1/2, 0, -1/2, 1/2, 0)) limpets = rbind('Limpets' = c(-1/3, -1/3, -1/3, 1/3, 1/3, 1/3)) limpetsSmall = rbind('Limpets X Small' = c(1, -1/2, -1/2, -1, 1/2, 1/2)) limpetsLarge = rbind('Limpets X Large' = c(0, 1, -1, 0, -1, 1)) fit.contrast(lm1, "Treat", large, conf.int=.95) fit.contrast(lm1, "Treat", small, conf.int=.95) fit.contrast(lm1, "Treat", limpets, conf.int=.95) fit.contrast(lm1, "Treat", limpetsSmall, conf.int=.95) fit.contrast(lm1, "Treat", limpetsLarge, conf.int=.95) @ To attain the confidence intervals discussed in the ``Summary of Statistical Findings" (page 376) we need to exponential the lower and upper bounds of the above 95$\%$ confidence intervals. Therefore, for the limpets estimation, the corresponding 95$\%$ confidence interval is (\Sexpr{round(exp(fit.contrast(lm1, "Treat", limpets, conf.int=.95)[,"lower CI"]), 3)}, \Sexpr{round(exp(fit.contrast(lm1, "Treat", limpets, conf.int=.95)[,"upper CI"]), 3)}). The resulting large fish 95$\%$ confidence interval is (\Sexpr{round(exp(fit.contrast(lm1, "Treat", large, conf.int=.95)[,"lower CI"]), 3)}, \Sexpr{round(exp(fit.contrast(lm1, "Treat", large, conf.int=.95)[,"upper CI"]), 3)}). Lastly for the estimation of the regeneration ratio for small fish the 95$\%$ confidence interval is (\Sexpr{round(exp(fit.contrast(lm1, "Treat", small, conf.int=.95)[,"lower CI"]), 3)}, \Sexpr{round(exp(fit.contrast(lm1, "Treat", small, conf.int=.95)[,"upper CI"]), 3)}). \section{Pygmalion effect} Does expected excellence affect performance? More specifically, does telling a manager that some of the supervisees are ``superior'' affect the supervisor's perception of their performance (Pygmalion effect)? This is the question addressed in case study 13.2 in the \emph{Sleuth}. \subsection{Statistical summary} We begin by reading the data and summarizing the variables. <<>>= summary(case1302) case1302$newTreat = relevel(case1302$Treat, ref="Control") @ There were a total of \Sexpr{nrow(case1302)} platoons. For each of the \Sexpr{length(unique(case1302[, "Company"]))} companies, one platoon received the Pygmalion treatment and two platoons were control, with the exception of one company that only had one control platoon. Therefore, there were \Sexpr{nrow(subset(case1302, newTreat=="Pygmalion"))} Pygmalion platoons and \Sexpr{nrow(subset(case1302, newTreat=="Control"))} control platoons. As shown in Display 13.3 (page 378 of the \emph{Sleuth}). \subsection{Graphical presentation} The following figure displays an interaction plot for the Pygmalion dataset, akin to Display 13.14 on page 392. <>= with(case1302, interaction.plot(Company, newTreat, Score)) @ \subsection{Two way ANOVA (fit using multiple linear regression model)} We can then use multiple linear regression models for the additive and nonadditive models and compare them using the two-way ANOVA. The following is similar to Display 13.16 (page 394). <<>>= lm1 = lm(Score ~ Company*newTreat, data=case1302); summary(lm1) lm2 = lm(Score ~ Company+newTreat, data=case1302); summary(lm2) # Display 13.18 page 395 anova(lm1) anova(lm2) anova(lm2, lm1) @ Lastly we can observe the residual plot from the fit of the additive model, akin to Display 13.17 on page 395. <>= plot(lm2, which=1) @ \subsection{Randomization Methods} As introduced in Chapter 4, we can construct a randomization distribution by considering the distribution of a test statistic over all possible ways the randomization could have turned out. For the Pygmalion data we can construct a randomization distribution for the $t$-statistic of the treatment effect as discussed on pages 397-398. <>= mod = lm(Score ~ Company+newTreat, data=case1302) obs = summary(mod)$coefficients["newTreatPygmalion", "t value"] obs nulldist = do(10000) * summary(lm(Score ~ shuffle(Company)+shuffle(newTreat), data=case1302))$coefficients["shuffle(newTreat)Pygmalion", "t value"] histogram(~ result, groups=result >= obs, v=obs, data=nulldist) # akin to Display 13.20 page 398 tally(~ result >= obs, format="proportion", data=nulldist) @ From this simulation we observed that the proportion of $t$-statistics that were as extreme or more extreme than our observed $t$-statistic (\Sexpr{round(obs, 3)}) is \Sexpr{tally(~ result >= obs, format="proportion", data=nulldist)["TRUE"]}. \end{document}