--- title: "lm function" date: today date-format: long bibliography: ../inst/REFERENCES.bib vignette: > %\VignetteIndexEntry{lm function} %\VignetteEngine{quarto::pdf} %\VignetteEncoding{UTF-8} --- OLS is by far the most important estimation method used in econometrics. It is implemented in **R** by the `lm` function. A deep understanding of the way this function is implemented, its arguments, its results and how to use them is particularly useful as other estimation functions (for example `glm` for generalized linear models) have a lot in common with `lm`. Moreover, if one is willing to implement its own estimation function, it is advisable to use at least some of the elements of `lm`. ```{r } #| echo: false #| message: false library(micsr) random_group <- transform(random_group, wage = exp(lnwh)) ``` For illustration, we use the `random_group` data set that analyze the effect of a training on wage. The response is hourly wage `wage` and the main covariate is a dummy for training. The `group` variable contains integer from -2 to 3, with negative values for treated individuals and positive values for untreated individuals. We first create this dummy, called `treated`: ```{r } random_group <- transform(random_group, treated = ifelse(group < 0, 1, 0)) ``` ## A basic model We then estimate the model, using as control the number of children `child` and the `age`. ```{r } mod0 <- lm(log(wage) ~ treated + age + child, data = random_group) ``` The result is a an object of class `lm`, with the 12 following elements: ```{r } names(mod0) ``` We won't say much about `"qr"` and `"effects"` which are technical elements of the computation of least squares. `"coefficients"`, `"residuals"` and `"fitted.values"` don't deserve further comments. The length of `"residuals"` and `"fitted.values"` is $N$, the one of `"coefficients"` is $K+1$ (if there is an intercept). `"rank"` is the rank of the model matrix. If the matrix has full rank, the rank equals the number of columns of the model matrix ($K + 1$), which is also the number of fitted coefficients.^[Actually the number of not `NA` coefficients.] ```{r } #| collapse: true mod0$rank length(mod0$coefficients) ``` `"df.residual"` is then $N - K - 1$. `"terms"` is a term object, which is a kind of representation of the formula of the model: ```{r } mod0$terms ``` `"xlevels"` is for this model a useless list of length 0. We'll see latter the usefulness of this element. `assign` is a numeric vector indicating the relation between the variables and the coefficients: ```{r } #| collapse: true mod0$assign ``` Once again, in our simple case, it is useless: the intercept is at position 0, the first covariate at position 1 and so on. All the elements of the model can be extracted using the `$` operator, for example `mod0$coefficients` for the coefficients. A safer method to extract elements of the model is to use generic functions with quasi homonymous names: ```{r } #| collapse: true coef(mod0) df.residual(mod0) resid(mod0) |> head() fitted(mod0) |> head() ``` ```{r } #| results: false terms(mod0) ``` Several other generics enables to extract information of the model that are not elements of the results, for example the covariance matrix of the estimates, the residual standard error and the number of observations: ```{r } #| collapse: true vcov(mod0) sigma(mod0) nobs(mod0) ``` The next two sections are devoted to the last two elements of the `lm` object, `"model"` and `"call"`. ## Model frame The element `"model"` of the `lm` object is the model frame. It's a special data frame: ```{r } head(mod0$model, 3) ``` Instead of using the `$` operator, the `model.frame` function can be used to extract the model frame: ```{r } #| results: false model.frame(mod0) ``` For this basic model, it looks like a subset of columns of the original data frame, but note that the first column contains the response which is `log(wage)` and not `wage`. A model frame has several attributes: ```{r } #| collapse: true names(attributes(mod0$model)) ``` The only interesting attribute is `"terms"`, the two other being the rows and columns' names (like for ordinary data frames). From the model frame, the elements of the model can be extracted. For this basic model, there are two elements, the matrix of covariates $X$ and the vector of response $y$. To extract the model matrix, we use `model.matrix` with two arguments, the model terms and the model frame: ```{r } model.matrix(terms(mod0), model.frame(mod0)) |> head(3) ``` More simply, the same model matrix can be extracted using the fitted object as the unique argument: ```{r } #| results: hide #| collapse: true model.matrix(mod0) ``` The model matrix looks like the model frame, except that the response has been removed and a column of ones (called `(Intercept)`) has been added. Actually there is a fundamental difference between the model frame and the model matrix, even in this very simple example. A model frame is a data frame, which is a list of vectors of the same length. Because all the vectors have the same length, it has a tabular representation and therefore looks like a matrix. A model matrix is a matrix, which is for **R** a vector with an attribute indicating the dimension, ie the number of rows and of columns. ```{r } dim(model.matrix(mod0)) ``` Being a vector, a matrix contains only elements of the same mode, for example numeric. The model response is extracted using `model.response` with the model frame as unique argument: ```{r } #| collapse: true mod0 |> model.frame() |> model.response() |> head() ``` ## Call and update `"call"` is an object of class `"call"`, that corresponds to the call to the function that resulted in the `mod0` object: ```{r } #| collapse: true class(mod0$call) mod0$call ``` Note that all the arguments are named, as we didn't named the first argument (`formula`) while calling `"lm"`. The call is very useful as it enables to update the model. The default `update` function is: `update(object, formula., ..., evaluate = TRUE)`. Its first argument is a fitted model and the second one is a formula. For example, to update `mod0` by adding the `female`, `single`, `migrant` and `temp` covariates (dummies for females, singles, migrants and temporary jobs), instead of rewriting the whole command: ```{r } mod1 <- lm(log(wage) ~ treated + age + child + female + single + migrant + temp, data = random_group) ``` we can update `mod0` with a new formula: ```{r } mod1 <- mod0 |> update(log(wage) ~ treated + age + child + female + single + migrant + temp) ``` or even more simply, we can use dots in the formula, which means "as before" and enables to add or to remove variables from the initial formula: ```{r } mod1 <- update(mod0, . ~ . + female + single + migrant + temp) ``` We have seen previously that the default `update` function has a `...` argument. This means that we can use in a call to update an argument of `lm` which is then passed to `lm` while updating the model. As an example, `lm` has a `subset` argument which is a logical expression that select a subset of the sample for which the estimation has to be performed. For example, starting from `mod0`, if one wishes to add the four dummies as previously and also to select the individuals aged at least 20, we can use: ```{r } #| eval: false lm(log(wage) ~ treated + age + child + female + single + migrant + temp, data = random_group, subset = age >= 20) ``` Or much more simply: ```{r } #| eval: false update(mod0, . ~ . + female + single + migrant + temp, subset = age >= 20) ``` ## Missing values When values of some variables used in the regression are missing for some observations, one has to indicate how to deal with this problem. The default behavior corresponds to the `na.action` element of `options()` which is a list containing options: ```{r } #| collapse: true options()$na.action ``` `na.omit` means remove from the model frame all the lines for which there is at least a missing value. There are a couple of other possible values, one of them being `na.fail` which stops the estimation and returns an error. `lm` has a `na.action` argument when the desired action can be indicated. For example, if we update the model with `na.action` set to `"na.fail"`: ```{r } #| results: false mod1 |> update(na.action = "na.fail") ``` we get exactly the same results has there are no missing values for the subset of variables we used. Now consider adding the `ten` covariate which is the tenure in month and has some missing values. Setting as previously `na.action = "na.fail"`, we'll then get an error message: ```{r } #| eval: false mod1 |> update(. ~ . + ten, na.action = "na.fail") ``` Using the default `"na.omit"` value, we get: ```{r } mod2 <- mod1 |> update(. ~ . + ten) names(mod2) ``` and there is now a 13^th^ element named `na.action`: ```{r } unname(mod2$na.action) ``` This is an object of class `"omit"` containing a vector of integers indicating the positions of the lines that have been removed because of missing values. We used `unname` because actually this is a named vector and, as in our case, when there is no row names, the "name" is the number (so that the name of the first element of the vector, 61, is "61"). This object is also returned as an attribute of the model frame: ```{r } #| collapse: true mod2$model |> attributes() |> names() ``` # Factors Now we introduce two further covariates `fsize` (firm size) and `edu` (education). These covariates are **factors** (categorical variables) and the modalities are called **levels**. The modalities can be extracted using the `levels` function: ```{r } #| collapse: true levels(random_group$fsize) levels(random_group$edu) ``` We introduce these two factors in the regression and we also introduce a quadratic term for `age`. This is done using the `poly` function inside the formula: ```{r } mod3 <- lm(log(wage) ~ treated + poly(age, 2) + child + fsize + edu + female + single + migrant + temp + ten, data = random_group) ``` There is now a supplementary element called `"contrasts"`. But first consider the already existing `"assign"` and `"xlevels"` elements: ```{r } mod3$assign ``` `"assign"` is now a vector of length 14. The first element is 0 (for the intercept), the 13 remaining indicate the link between the position of the coefficients and the position of the covariate in the formula: ```{r } names(coef(mod3)) ``` for example, the 3^rd^ and 4^th^ values of `"assign"` are 2, which is the position of the `"age"` covariate, with now two coefficients. The 8^th^ and 9^th^ values of `"assign"` are 5, they correspond to the two dummies introduced in the regression for the 5^th^ covariate in the formula which is `edu`. Let's now have a look to the `"xlevels"` element which used to be in the previous example a list of length 0: ```{r } mod3$xlevels ``` This is now a named list containing character vectors indicating the different levels of the factors. Note that for a factor with $J$ levels, only $J-1$ dummies are introduced in the regression, the one corresponding to the first level being omitted. The `contrasts` element is: ```{r } mod3$contrasts ``` It's a named list indicating how the contrasts are computed from the factors. The default value is `"contr.treatment"` and consist as we seen previously to create $J-1$ dummies for all the levels of the factor but the first. Other values are possible and can be set individually for every factor using the `contrasts` argument of `lm`: ```{r } mod4 <- update(mod3, contrasts = list(edu = "contr.sum", fsize = "contr.helmert")) mod4$contrasts ``` The model frame now looks like: ```{r } head(mod4$model, 3) ``` The `age` column has been replaced by a $N \times 2$ matrix containing the two terms of the polynomial. The two factors are in their own column. The model matrix $X$ is now: ```{r } head(model.matrix(mod3), 3) head(model.matrix(mod4), 3) ``` The fundamental difference between the model frame (a list) and the model matrix (a vector with a dimension) is even clearer than in the previous simple example. The model frame contains a column called `"edu"` which is a factor. In the model matrix, this column is replaced by $J-1=2$ columns that contain numeric and we can see that the numerical coding of the variable depends on the contrast used. ## Weights and offset Instead of minimizing the sum of squares residuals $\sum_n (y_n - \beta ^ \top x_n) ^ 2$, **weights** $w_n$ can be used so that the objective function becomes $\sum_n w_n (y_n - \beta ^ \top x_n) ^ 2$, leading to the weighted least squares estimator. Weights can be indicating in `lm` using the `weights` argument, which is the unquoted name of the column of the data frame that contains the weights. For the `random_group` data set, the variable that contains the sample weights is `samplew`: ```{r } mod4 <- update(mod3, weights = samplew) ``` The `lm` object now has a 15^th^ element called `"weights"` which contains the vector of weights of length $N$. The model frame is now: ```{r } model.frame(mod4) |> head(3) ``` Note the last column called `"(weights)"` that contains the weights. The weights can be extracted from the model frame using `model.weights`: ```{r } model.weights(model.frame(mod4)) |> head(3) ``` An **offset** is a variable added to the regression equation, but with no associated coefficients (or with a coefficient set to 1). For example, in the equation: $y_n = \alpha + \beta_1 x_{n1} + \beta_2 x_{n2} + x_{n3} + \epsilon_n$, $x_{n3}$ is an offset. Note also that the same equation can be rewritten as: $y_n - x_{n3} = \alpha + \beta_1 x_{n1} + \beta_2 x_{n2} + \epsilon_n$ and that in this case, the new response is $y_n - x_{n3}$ and there is no offset anymore. We have seen in `mod3` that the coefficients for the `"Intermediate"` and `"High"` education values are respectively $0.23$ and $0.50$, which means approximately 25 and 50% more wage compared to the reference level of this factor which is `"Low"`. Coercing the three levels of the factor to $z = 1,2,3$ (which is actually the internal representation of the factor) and denoting $v = 3/4 + 1/4 z$, we get the values of 1, 1.25 and 1.50. Therefore, we could get approximately the same results as `mod3` by removing `"edu"` and adding $v$ as an offset. There are two equivalent ways to introduce an offset: in the formula, using `offset(v)` or by setting the `offset` argument of `lm` to `v`: ```{r } random_group <- transform(random_group, v = 3/4 + 1/4 * as.numeric(edu)) mod5 <- update(mod4, . ~ . + offset(v) - edu) mod6 <- update(mod4, . ~ . - edu, offset = v) ``` The two fitted models are identical, they contain a supplementary element called `"offset"`. The offset can be extract from the model frame using `model.offset`: ```{r } model.offset(model.frame(mod5)) |> head(3) ``` ## The internal To conclude the presentation of the `lm` function, we'll describe the internal of the function. We first create a simple `mylm` function with the same argument as `lm`, but which returns only the call, obtained using the function `match.call`: ```{r } mylm <- function (formula, data, subset, weights, na.action, method = "qr", model = TRUE, x = FALSE, y = FALSE, qr = TRUE, singular.ok = TRUE, contrasts = NULL, offset, ...){ mf <- match.call() mf } ``` We then use this function with our most advanced example: ```{r } mf <- mylm(log(wage) ~ treated + poly(age, 2) + child + fsize + female + single + migrant + temp + ten, random_group, subset = age > 20, weights = samplew, contrasts = list(fsize = "contr.helmert"), offset = v ) mf cl <- mf ``` We saved a copy of the result as we'll need it latter on. The next lines of `lm` work on the call (the `mf` object returned by our `mylm` function). The call being a list, the idea is first to keep only the arguments in the call that are usefull to compute the model frame. This is done using the `match` function on the names of the call: ```{r } names(mf) m <- match(c("formula", "data", "subset", "weights", "na.action", "offset"), names(mf), 0L) m ``` `m` is an integer vector indicating the position of the arguments we wish to keep in the call; note that the position of `na.action` is 0 because we didn't use this argument in our call to `mylm`. The first element of `mf` is unnamed, this is the name of the function (`mylm`). We then extract from `mf` its first element and those given by vector `m`: ```{r } mf <- mf[c(1L, m)] mf ``` Then, we change the first argument (the name of the function) from `mylm` to `model.frame`, using the `quote` function: ```{r } mf[[1L]] <- quote(stats::model.frame) mf ``` The call is then evaluated using the `eval` function and the result is the model frame: ```{r } mf <- eval(mf) head(mf, 3) names(attributes(mf)) ``` We then extract the terms, and the different components of the model: ```{r } mt <- attr(mf, "terms") y <- model.response(mf) w <- model.weights(mf) offset <- model.offset(mf) x <- model.matrix(mt, mf, list(fsize = "contr.helmert")) ``` Note that the model matrix is constructed using as third argument the `contrasts` argument of `lm`. The estimation is then performed by `lm.wfit` (if there were no weights, `lm.fit` would have been used), that take as arguments the components of the model previously extracted: ```{r } z <- lm.wfit(x, y, w, offset = offset) names(z) ``` We can see that the resulting object contains a subset of the elements returned by `lm`. We then assign to the object the `"lm"` class: ```{r } class(z) <- "lm" ``` And we add to `z` the elements that are not returned by `lm.wfit`: ```{r } z$na.action <- attr(mf, "na.action") z$offset <- offset z$contrasts <- attr(x, "contrasts") z$xlevels <- .getXlevels(mt, mf) z$call <- cl z$terms <- mt z$model <- mf ``` `na.action` is an attribute of the model frame, `contrasts` an attribute of the model matrix and the `xlevels` is obtained using the `.getXlevels` function with the terms and the model frame as arguments. We then get our `lm` object that contains the fitted model: ```{r } z ``` Finally, we include in our `mylm` function all the lines we just have described: ```{r } mylm <- function (formula, data, subset, weights, na.action, method = "qr", model = TRUE, x = FALSE, y = FALSE, qr = TRUE, singular.ok = TRUE, contrasts = NULL, offset, ...){ mf <- match.call(expand.dots = FALSE) m <- match(c("formula", "data", "subset", "weights", "na.action", "offset"), names(mf), 0L) mf <- mf[c(1L, m)] mf[[1L]] <- quote(stats::model.frame) mf <- eval(mf, parent.frame()) mt <- attr(mf, "terms") y <- model.response(mf, "numeric") w <- as.vector(model.weights(mf)) offset <- model.offset(mf) x <- model.matrix(mt, mf, contrasts) z <- lm.wfit(x, y, w, offset = offset) class(z) <- "lm" z$na.action <- attr(mf, "na.action") z$offset <- offset z$contrasts <- attr(x, "contrasts") z$xlevels <- .getXlevels(mt, mf) z$call <- cl z$terms <- mt z$model <- mf z } ``` to get a simplified clone of `lm`: ```{r } mod6 <- mylm(log(wage) ~ treated + poly(age, 2) + child + fsize + female + single + migrant + temp + ten, random_group, subset = age > 20, weights = samplew, contrasts = list(fsize = "contr.helmert"), offset = v) ```