This vignette explains how to provide choice data for RprobitB via
As a first step, we recommend to specify the model formula.
The model formula is specified using a formula object, let’s call it form
.
The structure of form
is choice ~ A | B | C
, where
choice
is the discrete choice we aim to explain,
A
are alternative and choice situation specific covariates with a generic coefficient (we call them covariates of type 1),
B
are choice situation specific covariates with alternative specific coefficients (we call them covariates of type 2),
and C
are alternative and choice situation specific covariates with alternative specific coefficients (we call them covariates of type 3).
Keep the following rules in mind:
By default, alternative specific constants are added to the model1. They can be removed by adding +0
in the second spot, e.g. choice ~ A | B + 0
.
To exclude covariates of the backmost categories, use either 0
, e.g. choice ~ A | B | 0
or just leave this part out and write choice ~ A | B
. However, to exclude covariates of front categories, we have to use 0
, e.g. choice ~ A | 0 | C
.
To include more than one covariate of the same category, use +
, e.g. choice ~ A1 + A2 | B
.
If we don’t want to include any covariates of the second category but we want to estimate alternative specific constants, add 1
in the second spot, e.g. choice ~ A | 1
. The expression choice ~ A | 0
is interpreted as no covariates of the second category and no alternative specific constants.
To have random effects for specific variables, we need to define a character vector re
of the corresponding variable names. To have random effects for the alternative specific constants, include "ASC"
in re
.
Say we want to explain the choice
of transportation means by the variables cost
, income
, and travel_time
. We furthermore want to add alternative specific constants.
The cost
of an alternative is obviously alternative specific. However, we can argue that it does not matter for which alternative we spend our money. Therefore, we want to estimate a generic coefficient for cost
.
The income
of a decision maker is constant across alternatives, but can have a different influence on the alternatives. It is therefore a covariate of type 2.
The travel_time
is a covariate of type 3: It is alternative specific but in contrast to the cost
, we can imagine that spending time in public transportation means is different from spending time in ones own car.
Therefore, we specify:
= choice ~ cost | income | travel_time form
We typically would expect heterogeneity in preferences regarding spending money on a transportation means, therefore we impose a random effect on cost
:
= "cost" re
This section explains how to prepare empirical data for estimation using the function prepare()
.
Say we have a data set with empirical choice data, let’s call it choice_data
. It must meet the following requirements:
It must be a data frame.
It must be in wide format, that means each row represents one choice occasion.
It must contain a column named id
, which contains a unique identifier for each decision maker.
It must contain a column named choice
, where choice
must match the name of the dependent variable in form
.
For each alternative specific covariate p
in form
and each choice alternative j
, choice_data
must contain a column named p_j
.
For each covariate q
that is constant across covariates (covariate of type 2), choice_data
must contain a column named q
.
To prepare choice_data
for estimation, we must call
= prepare(form = form, choice_data = choice_data) data
The function prepare()
has the following optional arguments:
alternatives
: We may not want to consider all alternatives in choice_data
. In that case, we can specify a character vector alternatives
with selected names of alternatives.
re
: The character vector of variable names of form
with random effects.
id
: A character, the name of the column in choice_data
that contains a unique identifier for each decision maker. The default is "id"
.
standardize
: A character vector of variable names of form
that get standardized, see below.
Let’s prepare the Train data set of the mlogit package for estimation. We consider the covariates price
(type 1), time
, comfort
and change
(each of type 3), where we link price
and time
to random effects2.
data("Train", package = "mlogit")
= prepare(form = choice ~ price | 0 | time + comfort + change,
data choice_data = Train,
re = c("price","time"))
This section explains how to simulate choice data using the function simulate()
.
If we want to simulate the choices of N
deciders in T
choice occasions3 among J
alternatives from our model formulation form
, we have to call
= simulate(form = form, N = N, T = T, J = J) data
The function simulate()
has the following optional arguments:
re
: The character vector of variable names of form
with random effects.
alternatives
: A character vector with the names of the choice alternatives with length J
.
distr
: A named list of number generation functions from which the covariates are drawn. Each element of distr
must be of the form "cov" = list("name" = "<name of the number generation function>", ...)
, where cov
is the name of the covariate4 and ...
are required parameters for the number generation function. Covariates for which no distribution is specified are drawn from a standard normal distribution. Possible number generation functions are
functions of the type r*
from base R (e.g. rnorm
) where all required parameters (except for n
) must be specified,
the function sample
, where all required parameters except for size
) must be specified.
standardize
: A character vector of variable names of form
that get standardized, see below.
We can specify true parameter values by adding values for
alpha
, the fixed coefficient vector,
C
, the number (greater or equal 1) of latent classes of decision, makers
s
, the vector of class weights,
b
, the matrix of class means as columns,
Omega
, the matrix of class covariance matrices as columns,
Sigma
, the differenced error term covariance matrix,
Sigma_full
, the full error term covariance matrix.
We revisit our example of the simulated choice of transportation means, where we already specified:
= choice ~ cost | income | travel_time
form = "cost" re
Let us now simulate the choices of N = 100
decision makers in T = 10
choice occasions on the J = 3
alternatives “car”, “bus” and “train”. We want C = 2
true latent classes and specific distributions5 for our covariates:
= 100
N = 10
T = 3
J = c("car", "bus", "train")
alternatives = list("cost" = list("name" = "rnorm", sd = 3),
distr "income" = list("name" = "sample", x = (1:10)*1e3, replace = TRUE),
"travel_time_car" = list("name" = "rlnorm", meanlog = 1),
"travel_time_bus" = list("name" = "rlnorm", meanlog = 2))
= simulate(form = form, N = N, T = T, J = J, re = re,
data alternatives = alternatives, distr = distr, C = 2)
Both simulate()
and prepare()
have the optional input standardize
, which is a character vector of names of covariates that get standardized, i.e. normalize to mean 0 and standard deviation 1. If standardize = "all"
, all covariates get standardized.
Covariates of type 1 or 3 have to be addressed by covariate_alternative
.
If standardize = "all"
, all covariates get standardized.
In our example of the simulated choice of transportation means, scaling the income
is reasonable and can improve model fitting. For demonstration purpose, we also standardize travel_time
for each alternative:
= c("income", "travel_time_car", "travel_time_bus",
standardize "travel_time_train")
= simulate(form = form, N = N, T = T, J = J, re = re,
data alternatives = alternatives, parm = parm, distr = distr,
standardize = standardize)
We can check if the data preparation or simulation worked as expected using the summary()
function. The columns z
and re
indicate standardized and random effect covariates, respectively. The rest of the output is self-explanatory.
summary(data)
Alternative specific constants can be interpreted as covariates of type 2. Due to the dummy variable trap, we cannot estimate alternative specific constants for all the alternatives. Therefore, they are added for all except for the last alternative.↩︎
Note that alternative specific constants are excluded here.↩︎
T
can be either a positive number, representing a fixed number of choice occasions for each decision maker, or a vector of length N
, i.e. a decision maker specific number of choice occasions.↩︎
For a covariate cov
of type 1 or 3, you can either choose "name" = cov
(to draw the covariate for all alternatives from the same distribution) or "name" = cov_alternative
(to draw the covariate for a specific alternative from a specific distribution).↩︎
Note that the cost
covariate for all alternatives is drawn from the same distribution. Also note that since we did not specify a distribution for travel_time_bus
, this covariate is drawn from a standard normal distribution.↩︎