--- title: "Training models" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{Training models} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>" ) ``` ```{r setup, message=FALSE} library(LSTbook) ``` To "train a model" involves three components: 1. A data frame with training data 2. A model specification naming the response variable and the explanatory variables. This is formatted in the same tilde-expression manner as for `lm()` and `glm()`. 3. A model-fitting function that puts (1) and (2) together into a **model object**. Examples of model-fitting functions are `lm()` and `glm()`. In *Lessons in Statistical Thinking* and the corresponding `{LSTbook}` package, we almost always use `model_train()` Once the model object has been constructed, you can plot the model, create summaries such as regression reports or ANOVA reports, and evaluate the model for new inputs, etc. ## Using `model_train()` `model_train()` is a wrapper around some commonly used model-fitting functions from the `{stats}` package, particularly `lm()` and `glm()`. It's worth explaining motivation for introducing a new model-fitting function. 1. `model_train()` is pipeline ready. Example: `Galton |> model_train(height ~ mother)` 2. `model_train()` has internal logic to figure out automatically which type of model (e.g. linear, binomial, poisson) to fit. (You can also specify this with the `family=` argument.) The automatic nature of `model_train()` means, e.g., you can use it with neophyte students for logistic regression without having to introduce a new function. 3. `model_train()` saves a copy of the training data as an attribute of the model object being produced. This is helpful in plotting the model, cross-validation, etc., particularly when the model specification involves nonlinear explanatory terms (e.g., `splines::ns(mother, 3)`) ## Using a model object As examples, consider these two models: - modeling `height` of a (fully grown) child with the `sex` of the child, and the `mother`'s and `father`'s height. Linear regression is an appropriate technique here. ```{r} height_model <- mosaicData::Galton |> model_train(height ~ sex + mother + father) ``` - modeling the probability that a voter will vote in an election (`primary2006`) given the household size (`hhsize`), `yearofbirth` and whether the voter voted in a previous primary election (`primary2004`). Since having voted is a yes or no proposition, *logistic* regression is an appropriate technique. ```{r} vote_model <- Go_vote |> model_train(zero_one(primary2006, one = "voted") ~ yearofbirth * primary2004 * hhsize * yearofbirth ) ``` Note that the `zero_one()` marks the response variable as a candidate for logistic regression. The output of `model_train()` is in the format of whichever `{stats}` package function has been used, e.g. `lm()` or `glm()`. (The training data is stored as an "attribute," meaning that it is invisible.) Consequently, you can use the model object as an input to whatever model-plotting or summarizing function you like. In *Lessons in Statistical Thinking* we use `{LSTbook}` functions for plotting and summarizing: - `model_plot()` - `R2()` - `conf_interval()` - Late in *Lessons*, `regression_summary()` and `anova_summary()` Let's apply some of these to the modeling examples introduced above. ```{r} height_model |> model_plot() height_model |> conf_interval() vote_model |> model_plot() vote_model |> R2() ``` The `model_eval()` function from this package allows you to provide inputs and receive the model output, with a prediction interval by default. (For logistic regression, only a confidence interval is available.) ```{r} vote_model |> model_eval(yearofbirth=c(1960, 1980), primary2004="voted", hhsize=4) ```