%\VignetteEngine{knitr::knitr} %\VignetteIndexEntry{Chapter 10, Horton et al. using mosaic} %\VignettePackage{Sleuth3} \documentclass[11pt]{article} \usepackage[margin=1in,bottom=.5in,includehead,includefoot]{geometry} \usepackage{hyperref} \usepackage{language} \usepackage{alltt} \usepackage{fancyhdr} \pagestyle{fancy} \fancyhf{} %% Now begin customising things. See the fancyhdr docs for more info. \chead{} \lhead[\sf \thepage]{\sf \leftmark} \rhead[\sf \leftmark]{\sf \thepage} \lfoot{} \cfoot{Statistical Sleuth in R: Chapter 10} \rfoot{} \newcounter{myenumi} \newcommand{\saveenumi}{\setcounter{myenumi}{\value{enumi}}} \newcommand{\reuseenumi}{\setcounter{enumi}{\value{myenumi}}} \pagestyle{fancy} \def\R{{\sf R}} \def\Rstudio{{\sf RStudio}} \def\RStudio{{\sf RStudio}} \def\term#1{\textbf{#1}} \def\tab#1{{\sf #1}} \usepackage{relsize} \newlength{\tempfmlength} \newsavebox{\fmbox} \newenvironment{fmpage}[1] { \medskip \setlength{\tempfmlength}{#1} \begin{lrbox}{\fmbox} \begin{minipage}{#1} \vspace*{.02\tempfmlength} \hfill \begin{minipage}{.95 \tempfmlength}} {\end{minipage}\hfill \vspace*{.015\tempfmlength} \end{minipage}\end{lrbox}\fbox{\usebox{\fmbox}} \medskip } \newenvironment{boxedText}[1][.98\textwidth]% {% \begin{center} \begin{fmpage}{#1} }% {% \end{fmpage} \end{center} } \newenvironment{boxedTable}[2][tbp]% {% \begin{table}[#1] \refstepcounter{table} \begin{center} \begin{fmpage}{.98\textwidth} \begin{center} \sf \large Box~\expandafter\thetable. #2 \end{center} \medskip }% {% \end{fmpage} \end{center} \end{table} % need to do something about exercises that follow boxedTable } \newcommand{\cran}{\href{http://www.R-project.org/}{CRAN}} \title{The Statistical Sleuth in R: \\ Chapter 10} \author{ Linda Loi \and Kate Aloisio \and Ruobing Zhang \and Nicholas J. Horton\thanks{Department of Mathematics and Statistics, Smith College, nhorton@smith.edu} } \date{\today} \begin{document} \maketitle \tableofcontents %\parindent=0pt <>= require(knitr) opts_chunk$set( dev="pdf", fig.path="figures/", fig.height=3, fig.width=4, out.width=".47\\textwidth", fig.keep="high", fig.show="hold", fig.align="center", prompt=TRUE, # show the prompts; but perhaps we should not do this comment=NA # turn off commenting of ouput (but perhaps we should not do this either ) @ <>= print.pval = function(pval) { threshold = 0.0001 return(ifelse(pval < threshold, paste("p<", sprintf("%.4f", threshold), sep=""), ifelse(pval > 0.1, paste("p=",round(pval, 2), sep=""), paste("p=", round(pval, 3), sep="")))) } @ <>= require(Sleuth3) require(mosaic) trellis.par.set(theme=col.mosaic()) # get a better color scheme for lattice set.seed(123) # this allows for code formatting inline. Use \Sexpr{'function(x,y)'}, for exmaple. knit_hooks$set(inline = function(x) { if (is.numeric(x)) return(knitr:::format_sci(x, 'latex')) x = as.character(x) h = knitr:::hilight_source(x, 'latex', list(prompt=FALSE, size='normalsize')) h = gsub("([_#$%&])", "\\\\\\1", h) h = gsub('(["\'])', '\\1{}', h) gsub('^\\\\begin\\{alltt\\}\\s*|\\\\end\\{alltt\\}\\s*$', '', h) }) showOriginal=FALSE showNew=TRUE @ \section{Introduction} This document is intended to help describe how to undertake analyses introduced as examples in the Third Edition of the \emph{Statistical Sleuth} (2013) by Fred Ramsey and Dan Schafer. More information about the book can be found at \url{http://www.proaxis.com/~panorama/home.htm}. This file as well as the associated \pkg{knitr} reproducible analysis source file can be found at \url{http://www.math.smith.edu/~nhorton/sleuth3}. This work leverages initiatives undertaken by Project MOSAIC (\url{http://www.mosaic-web.org}), an NSF-funded effort to improve the teaching of statistics, calculus, science and computing in the undergraduate curriculum. In particular, we utilize the \pkg{mosaic} package, which was written to simplify the use of R for introductory statistics courses. A short summary of the R needed to teach introductory statistics can be found in the mosaic package vignette (\url{http://cran.r-project.org/web/packages/mosaic/vignettes/MinimalR.pdf}). To use a package within R, it must be installed (one time), and loaded (each session). The package can be installed using the following command: <>= install.packages('mosaic') # note the quotation marks @ Once this is installed, it can be loaded by running the command: <>= require(mosaic) @ This needs to be done once per session. In addition the data files for the \emph{Sleuth} case studies can be accessed by installing the \pkg{Sleuth3} package. <>= install.packages('Sleuth3') # note the quotation marks @ <>= require(Sleuth3) @ We also set some options to improve legibility of graphs and output. <>= trellis.par.set(theme=col.mosaic()) # get a better color scheme for lattice options(digits=3) @ The specific goal of this document is to demonstrate how to calculate the quantities described in Chapter 10: Inferential Tools for Multiple Regression using R. \section{Galileo's data on the motion of falling bodies} Galileo investigated the relationship between height and horizontal distance. This is the question addressed in case study 10.1 in the \emph{Sleuth}. \subsection{Data coding, summary statistics and graphical display} We begin by reading the data and summarizing the variables. <<>>= summary(case1001) favstats(~ Distance, data=case1001) @ There we a total of \Sexpr{nrow(case1001)} trials of Galileo's experiment. For each trial, he recorded the initial height and then measured the horizontal distance as shown in Display 10.1 (page 272). We can start to explore this relationship by creating a scatterplot of Galileo's horizontal distances versus initial heights. The following graph is akin to Display 10.2 (page 273). <<>>= xyplot(Distance ~ Height, data=case1001) @ \subsection{Models} The first model that we created is a cubic model as interpreted on page 273 and summarized in Display 10.13 (page 291). <<>>= lm1 = lm(Distance ~ Height+I(Height^2)+I(Height^3), data=case1001); summary(lm1) @ We next decrease the polynomial for \emph{Height} by one degree to obtain a quadratic model as interpreted on page 273 and summarized in Display 10.7 (page 281). This model is used for most of the following results. <<>>= lm2 = lm(Distance ~ Height+I(Height^2), data=case1001); summary(lm2) @ The following figure presents the predicted values from the quadratic model using the original data points akin to Display 10.2 (page 273). <<>>= case1001$pred = predict(lm2) xyplot(pred+Distance ~ Height, auto.key=TRUE, data=case1001) @ To obtain the expected values of $\hat{\mu}\left(\mathrm{Distance}|\mathrm{Height} = 0 \right)$ and $\hat{\mu}\left(\mathrm{Distance}|\mathrm{Height} = 250\right)$, we used the {\tt predict()} command with the quadratic model as shown in Display 10.7 (page 281). <<>>= predict(lm2, interval="confidence", data.frame(Height=c(0, 250))) @ We can also verify the above confidence interval calculations with the following code: <<>>= 355.1+c(-1, 1)*6.62*qt(.975, 4) @ To verify numbers on page 284, an interval for the predicted values , we used the following code: <<>>= predict(lm2, interval="predict", data.frame(Height=c(0, 250))) @ Lastly, we produced an ANOVA for the quadratic model interpreted on page 288 (Display 10.11). <<>>= anova(lm2) @ \section{Echolocation in bats} How do bats make their way about in the dark? Echolocation requires a lot of energy. Does it depend on mass and species? This is the question addressed in case study 10.2 in the \emph{Sleuth}. \subsection{Data coding, summary statistics and graphical display} We begin by reading the data, performing transformations where necessary and summarizing the variables. <<>>= case1002 = transform(case1002, Type = factor(Type, levels = c("non-echolocating bats","non-echolocating birds", "echolocating bats"))) case1002$logmass = log(case1002$Mass); case1002$logenergy = log(case1002$Energy) summary(case1002) favstats(Mass ~ Type, data=case1002) favstats(Energy ~ Type, data=case1002) @ A total of \Sexpr{nrow(case1002)} flying vertebrates were included in this study. There were \Sexpr{nrow(subset(case1002, Type=="echolocating bats"))} echolocating bats, \Sexpr{nrow(subset(case1002, Type=="non-echolocating bats"))} non-echolocating bats, and \Sexpr{nrow(subset(case1002, Type=="non-echolocating birds"))} non-echolocating birds. For each subject their \emph{mass} and \emph{flight energy expenditure} were recorded as shown in Display 10.3 (page 274). We can next observe the pattern between log(energy expenditure) as a function of log(body mass) for each group with a scatterplot. The following figure is akin to Display 10.4 (page 275). <>= xyplot(Energy ~ Mass, group=Type, scales=list(y=list(log=TRUE), x=list(log=TRUE)), auto.key=TRUE, data=case1002) @ \subsection{Multiple regression} We first evaluate a multiple regression model for log(energy expenditure) given type of species and log(body mass) as defined on page 276 and shown in Display 10.6 (page 277). <<>>= lm1 = lm(logenergy ~ logmass+Type, data=case1002); summary(lm1) @ Next, we calculate confidence intervals for the coefficients which are interpreted on page 278. <<>>= confint(lm1) exp(confint(lm1)) @ Since the significance of a model depends on which variables are included, the \emph{Sleuth} proposes two other models, one only looking at the type of flying animal and the other allows the three groups to have different straight-line regressions with \emph{mass}. These two models are displayed below and discussed on pages 278-279. <<>>= summary(lm(logenergy ~ Type, data=case1002)) summary(lm(logenergy ~ Type * logmass, data=case1002)) @ To construct the confidence bands discussed on page 282 and shown in Display 10.9 (page 283) we used the following code: <<>>= pred = predict(lm1, se.fit=TRUE, newdata=data.frame(Type=c("non-echolocating birds", "non-echolocating birds"), logmass=c(log(100), log(400)))) pred.fit = pred$fit[1]; pred.fit pred.se = pred$se.fit[1]; pred.se multiplier = sqrt(4*qf(.95, 4, 16)); multiplier lower = exp(pred.fit-pred.se*multiplier); lower upper = exp(pred.fit+pred.se*multiplier); upper # for the other reference points pred2 = predict(lm1, se.fit=TRUE, newdata=data.frame(Type=c("non-echolocating bats", "non-echolocating bats"), logmass=c(log(100), log(400)))) pred3 = predict(lm1, se.fit=TRUE, newdata=data.frame(Type=c("echolocating bats", "echolocating bats"), logmass=c(log(100), log(400)))) table10.9 = rbind(c("Intercept estimate", "Standard error"), round(cbind(pred2$fit, pred2$se.fit), 4), round(cbind(pred3$fit, pred3$se.fit), 4)); table10.9 @ Next we can assess the model by evaluating the extra sums of squares $F$-test for testing the equality of intercepts in the parallel regression lines model as shown in Display 10.10 (page 287). <<>>= lm2 = lm(logenergy ~ logmass, data=case1002) anova(lm2, lm1) @ We can also compare the full model with interaction terms and the reduced model (without interaction terms) with the extra sum of squares $F$-test as described in Display 10.12 (page 290). <<>>= lm3 = lm(logenergy ~ logmass*Type, data=case1002) anova(lm3, lm1) @ Another way to test the equality of the groups is by using linear combinations which we can attain using the {\tt estimable()} command as follows. These results can be found on page 276 and 289. <<>>= require(gmodels) estimable(lm1, c(0, 0, -1, 1)) @ \end{document}