%\VignetteEngine{knitr::knitr} %\VignetteIndexEntry{Chapter 11, Horton et al. using mosaic} %\VignettePackage{Sleuth2} \documentclass[11pt]{article} \usepackage[margin=1in,bottom=.5in,includehead,includefoot]{geometry} \usepackage{hyperref} \usepackage{language} \usepackage{alltt} \usepackage{fancyhdr} \pagestyle{fancy} \fancyhf{} %% Now begin customising things. See the fancyhdr docs for more info. \chead{} \lhead[\sf \thepage]{\sf \leftmark} \rhead[\sf \leftmark]{\sf \thepage} \lfoot{} \cfoot{Statistical Sleuth in R: Chapter 11} \rfoot{} \newcounter{myenumi} \newcommand{\saveenumi}{\setcounter{myenumi}{\value{enumi}}} \newcommand{\reuseenumi}{\setcounter{enumi}{\value{myenumi}}} \pagestyle{fancy} \def\R{{\sf R}} \def\Rstudio{{\sf RStudio}} \def\RStudio{{\sf RStudio}} \def\term#1{\textbf{#1}} \def\tab#1{{\sf #1}} \usepackage{relsize} \newlength{\tempfmlength} \newsavebox{\fmbox} \newenvironment{fmpage}[1] { \medskip \setlength{\tempfmlength}{#1} \begin{lrbox}{\fmbox} \begin{minipage}{#1} \vspace*{.02\tempfmlength} \hfill \begin{minipage}{.95 \tempfmlength}} {\end{minipage}\hfill \vspace*{.015\tempfmlength} \end{minipage}\end{lrbox}\fbox{\usebox{\fmbox}} \medskip } \newenvironment{boxedText}[1][.98\textwidth]% {% \begin{center} \begin{fmpage}{#1} }% {% \end{fmpage} \end{center} } \newenvironment{boxedTable}[2][tbp]% {% \begin{table}[#1] \refstepcounter{table} \begin{center} \begin{fmpage}{.98\textwidth} \begin{center} \sf \large Box~\expandafter\thetable. #2 \end{center} \medskip }% {% \end{fmpage} \end{center} \end{table} % need to do something about exercises that follow boxedTable } \newcommand{\cran}{\href{http://www.R-project.org/}{CRAN}} \title{The Statistical Sleuth in R: \\ Chapter 11} \author{ Kate Aloisio\and Ruobing Zhang \and Nicholas J. Horton\thanks{Department of Mathematics, Amherst College, nhorton@amherst.edu} } \date{\today} \begin{document} \maketitle \tableofcontents %\parindent=0pt <>= print.pval = function(pval) { threshold = 0.0001 return(ifelse(pval < threshold, paste("p<", sprintf("%.4f", threshold), sep=""), ifelse(pval > 0.1, paste("p=",round(pval, 2), sep=""), paste("p=", round(pval, 3), sep="")))) } @ <>= require(knitr) opts_chunk$set( dev="pdf", fig.path="figures/", fig.height=3, fig.width=5, out.width=".72\\textwidth", fig.keep="high", fig.show="hold", fig.align="center", prompt=TRUE, # show the prompts; but perhaps we should not do this comment=NA # turn off commenting of ouput (but perhaps we should not do this either ) @ <>= require(Sleuth2) require(mosaic) trellis.par.set(theme=col.mosaic()) # get a better color scheme set.seed(123) # this allows for code formatting inline. Use \Sexpr{'function(x,y)'}, for exmaple. knit_hooks$set(inline = function(x) { if (is.numeric(x)) return(knitr:::format_sci(x, 'latex')) x = as.character(x) h = knitr:::hilight_source(x, 'latex', list(prompt=FALSE, size='normalsize')) h = gsub("([_#$%&])", "\\\\\\1", h) h = gsub('(["\'])', '\\1{}', h) gsub('^\\\\begin\\{alltt\\}\\s*|\\\\end\\{alltt\\}\\s*$', '', h) }) showOriginal=FALSE showNew=TRUE @ \section{Introduction} This document is intended to help describe how to undertake analyses introduced as examples in the Second Edition of the \emph{Statistical Sleuth} (2002) by Fred Ramsey and Dan Schafer. More information about the book can be found at \url{http://www.proaxis.com/~panorama/home.htm}. This file as well as the associated \pkg{knitr} reproducible analysis source file can be found at \url{http://www.amherst.edu/~nhorton/sleuth}. This work leverages initiatives undertaken by Project MOSAIC (\url{http://www.mosaic-web.org}), an NSF-funded effort to improve the teaching of statistics, calculus, science and computing in the undergraduate curriculum. In particular, we utilize the \pkg{mosaic} package, which was written to simplify the use of R for introductory statistics courses. A short summary of the R needed to teach introductory statistics can be found in the mosaic package vignette (\url{http://cran.r-project.org/web/packages/mosaic/vignettes/MinimalR.pdf}). To use a package within R, it must be installed (one time), and loaded (each session). The package can be installed using the following command: <>= install.packages('mosaic') # note the quotation marks @ Once this is installed, it can be loaded by running the command: <>= require(mosaic) @ This needs to be done once per session. In addition the data files for the \emph{Sleuth} case studies can be accessed by installing the \pkg{Sleuth2} package. <>= install.packages('Sleuth2') # note the quotation marks @ <>= require(Sleuth2) @ We also set some options to improve legibility of graphs and output. <>= trellis.par.set(theme=col.mosaic()) # get a better color scheme for lattice options(digits=3, show.signif.stars=FALSE) @ The specific goal of this document is to demonstrate how to calculate the quantities described in Chapter 11: Model Checking and Refinement using R. \section{Alcohol metabolism in men and women} How do men and women metabolise alcohol? This is the question addressed in case study 11.1 in the \emph{Sleuth}. \subsection{Data coding, summary statistics and graphical display} We begin by reading the data and summarizing the variables. <<>>= summary(case1101) @ A total of \Sexpr{nrow(case1101)} volunteers were included in this data. There were \Sexpr{nrow(subset(case1101, Sex=="Female"))} females and \Sexpr{nrow(subset(case1101, Sex=="Male"))} males. As recorded in Display 9.2 (page 237 of the \emph{Sleuth}). The following is a graphical display of the variables akin to Display 11.2 (page 306). <>= xyplot(Metabol ~ Gastric | Sex+Alcohol, data=case1101, auto.key=TRUE, xlab="Gastric AD activity (mu mol/min/g of tissue)", ylab="first pass metabolism (mmol/liter-hour)") @ \subsection{Multiple regression} First we can fit a full model for estimating \emph{metabolism} given a subjects \emph{gastric AD activity}, whether they are \emph{alcoholic} and \emph{gender}. This first model is summarized on page 315 (Display 11.9). <>= case1101 = transform(case1101, Sex = factor(Sex, levels = c("Male", "Female"))) case1101 = transform(case1101, Alcohol = factor(Alcohol, levels = c("Non-alcoholic", "Alcoholic"))) lm1 = lm(Metabol ~ Gastric+Sex+Alcohol+Gastric*Sex+Sex*Alcohol+ Gastric*Alcohol+Gastric*Sex*Alcohol, data=case1101); summary(lm1) @ Next we can calculate a number of model diagnostics, including leverage, studentized resids and Cook's distance (pages 319--320). <>= require(MASS) @ <<>>= case1101 = transform(case1101, hat = hatvalues(lm1)) case1101 = transform(case1101, studres = studres(lm1)) case1101 = transform(case1101, cooks = cooks.distance(lm1)) case1101[31,] @ The following is a residual plot for the full model akin to Display 11.7 (page 313). <<>>= xyplot(residuals(lm1) ~ fitted(lm1), xlab="Fitted values", ylab="Residuals", type=c("p", "r", "smooth")) @ From these diagnostics it appears that observations 31 and 32 may be influential points. Therefore, we next re-fit the full model excluding these two observations. The following results are found in Display 11.9 and discussed on page 315. <<>>= case11012 = case1101[-c(31, 32),] lm2 = lm(Metabol ~ Gastric+Sex+Alcohol+Gastric*Sex+Sex*Alcohol+ Gastric*Alcohol+Gastric*Sex*Alcohol, data=case11012); summary(lm2) @ \subsection{Refining the Model} This section addresses the process of refining the model. We first tested the lack of fit for the removal of {\tt Alcohol} as shown in Display 11.13 (page 322). <>= lm3 = lm(Metabol ~ Gastric+Sex+Gastric*Sex, data=case11012); summary(lm3) anova(lm3, lm2) # page 322 @ Next we assessed a model without an intercept which is scientifically plausible as summarized in Display 11.14 (page 323). <>= lm4 = lm(Metabol ~ Gastric+Gastric:Sex -1 , data=case11012); summary(lm4) anova(lm4, lm3) @ Note that the ``Summary of Statistical Findings" section (page 306) is based on this final model. \section{Blood brain barrier} Neuroscientists working to better understand the blood brain barrier have infused rats with cells to induce brain tumors. This is the topic addressed in case study 11.2 in the \emph{Sleuth}. \subsection{Data coding and summary statistics} We begin by reading the data, performing transformations where needed and summarizing the variables. <>= case1102 = transform(case1102, Y = Brain/Liver) case1102 = transform(case1102, logliver = log(Liver)) case1102 = transform(case1102, logbrain = log(Brain)) case1102 = transform(case1102, SAC = as.factor(Time)) case1102 = transform(case1102, logy = log(Brain/Liver)) case1102 = transform(case1102, logtime = log(Time)) case1102 = transform(case1102, Treat = relevel(Treat, ref="NS")) summary(case1102) @ A total of \Sexpr{nrow(case1102)} rats were included in this experiment. Each rat was given either the barrier solution (n = \Sexpr{nrow(subset(case1102, Treat=="BD"))}) or a normal saline solution (n = \Sexpr{nrow(subset(case1102, Treat=="NS"))}). Then variables of interest were calculated and are displayed in Display 11.4 (page 308 of the \emph{Sleuth}). We can graphically relationships between the variables using a pairs plot. <>= smallds = case1102[,c("logy", "logbrain","logliver","Treat", "SAC")] pairs(smallds) @ \subsection{Graphical presentation} The following displays a scatterplot of log ratio (Y) as a function of log time, akin to Display 11.5 on page 309. <<>>= xyplot(Y ~ Time, group=Treat, scales=list(y=list(log=TRUE), x=list(log=TRUE)), auto.key=TRUE, data=case1102) @ The following graphs are akin to the second and third plots in Display 11.16 on page 326. <<>>= case1102 = transform(case1102, female = ifelse(Sex=="F", 1, 0)) xyplot(logy ~ jitter(female), xlab="Sex", type=c("p", "r", "smooth"), data=case1102) @ <<>>= xyplot(logy ~ jitter(Days), type=c("p", "r", "smooth"), data=case1102) @ \subsection{Multiple regression} We first fit a model that reflects the initial investigation. This is the proposed model from page 311. <>= lm1 = lm(logy ~ SAC+Treat+SAC*Treat+Days+Sex+ Weight+Loss+Tumor, data=case1102); summary(lm1) @ We can then display a residual plot to assess the fit of the above model. This is provided in Display 11.6 (page 312). <<>>= xyplot(residuals(lm1) ~ fitted(lm1), xlab="Fitted values", ylab="Residuals", type=c("p", "r", "smooth")) @ \subsection{Refining the model} Lastly, we fit a refined model. These results can be found in Display 11.17 (page 327). <>= lm2 = lm(logy ~ SAC+Treat, data=case1102); summary(lm2) anova(lm2, lm1) @ \end{document}