--- title: "Linear Regression" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{Linear Regression} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>" ) library(dplyr) library(tidypredict) library(parsnip) ``` ## Highlights & Limitations - **Supports prediction intervals**, it uses the `qr.solve()` function to parse the interval coefficient of each term. - Supports categorical variables and interactions - Only *treatment* contrast (`contr.treatment`) are supported. - `offset` is supported - Categorical variables are supported - In-line functions in the formulas are **not supported**: - OK - `wt ~ mpg + am` - OK - `mutate(mtcars, newam = paste0(am))` and then `wt ~ mpg + newam` - Not OK - `wt ~ mpg + as.factor(am)` - Not OK - `wt ~ mpg + as.character(am)` ## How it works ```{r} library(dplyr) library(tidypredict) df <- mtcars %>% mutate(char_cyl = paste0("cyl", cyl)) %>% select(mpg, wt, char_cyl, am) model <- lm(mpg ~ wt + char_cyl, offset = am, data = df) ``` It returns a SQL query that contains the coefficients (`model$coefficients`) operated against the correct variable or categorical variable value. In most cases the resulting SQL is one short `CASE WHEN` statement per coefficient. It appends the `offset` field or value, if one is provided. ```{r} library(tidypredict) tidypredict_sql(model, dbplyr::simulate_mssql()) ``` Alternatively, use `tidypredict_to_column()` if the results are the be used or previewed in `dplyr`. ```{r} df %>% tidypredict_to_column(model) %>% head(10) ``` ## Prediction intervals Use `tidypredict_sql_interval()` to get the SQL query that operates the prediction interval. The `interval` defaults to 0.95 ```{r} tidypredict_sql_interval(model, dbplyr::simulate_mssql()) ``` Prediction intervals also works in the `tidypredict_to_column()`, just set the `add_interval` argument to `TRUE`. ```{r} df %>% tidypredict_to_column(model, add_interval = TRUE) %>% head(10) ``` ## Under the hood The parser reads several parts of the `lm` object to tabulate all of the needed variables. One entry per coefficient is added to the final table, those entries will have the results of `qr.solve()` already operated and placed in the correct column, they will have a `qr_` prefix. There will be one `qr_` column per coefficient. Other variables are added at the end. Some variables are not required for every parsed model. For example, `offset` is listed because it's part of the formula (call) of the model, if there were no offset in a given model, that line would not exist. ```{r} pm <- parse_model(model) str(pm, 2) ``` The output from `parse_model()` is transformed into a `dplyr`, a.k.a Tidy Eval, formula. All categorical variables are operated using `if_else()`. ```{r} tidypredict_fit(model) ``` A function to put together the Tidy Eval interval formula is also supported ```{r} tidypredict_interval(model) ``` From there, the Tidy Eval formula can be used anywhere where it can be operated. `tidypredict` provides three paths: - Use directly inside `dplyr`, `mutate(df, !! tidypredict_fit(model))` - Use `tidypredict_to_column(model)` to a piped command set - Use `tidypredict_to_sql(model)` to retrieve the SQL statement The same applies to the prediction interval functions. ## How it performs Testing the `tidypredict` results is easy. The `tidypredict_test()` function automatically uses the `lm` model object's data frame, to compare `tidypredict_fit()`, and `tidypredict_interval()` to the results given by `predict()` ```{r} tidypredict_test(model) ``` To run with prediction intervals set the `include_intervals` argument to `TRUE` ```{r} tidypredict_test(model, include_intervals = TRUE) ``` ## parsnip `tidypredict` also supports `lm()` model objects fitted via the `parsnip` package. ```{r} library(parsnip) parsnip_model <- linear_reg() %>% set_engine("lm") %>% fit(mpg ~ wt + cyl, offset = am, data = mtcars) tidypredict_fit(parsnip_model) ```