--- title: "maxent.ot Vignette" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{maxent.ot Vignette} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>", warning=FALSE ) ``` ```{r setup} # Import packages library(maxent.ot) ``` # 1 Overview ## 1.1 Why use the Maximum Entropy (MaxEnt) model? Optimality Theory (OT) (Prince & Smolensky 1993) is one of the leading theories in phonology, and is a very successful model of categorical language data. In OT, ranked constraints pick a single winning surface representation (SR) for each underlying representation (UR). Nevertheless, a large proportion of language data is probabilistic. Moreover, psycholinguistic experiments (*e.g.* artificial language learning) often produce probabilistic results too. Thus, multiple models (*e.g.* Stochastic OT (Boersma 1997), Partial Ordering OT (Antilla 1997), Harmonic Grammar (Legendre, Miyata & Smolensky 1990), among others) have been put forward to augment the basic structure of OT, in order to model probabilistic data. MaxEnt (Goldwater & Johnson 2003) is a one such model. In MaxEnt, constraint weights are adjusted in order to find the weights that maximize the probability of the observed data. One benefit of MaxEnt over other probabilistic models is that the mathematics behind it are well understood and convergence proofs for learning are available (Berger *et al.* 1996). The `maxent.ot` package implements the MaxEnt model. There are three functions in this package: * `optimize_weights()`: Optimizes weights given a data set & optional biases. * `predict_probabilities()`: Predict probabilities of candidates based on their violation profiles & constraint weights. * `compare_models()`: Compares two or more model fits to the same data set to determine which provides the best fit, using variety of methods. ## 1.2 OT: A quick recap We now present a quick recap of OT in order to illustrate how OT and MaxEnt differ in choosing among outputs ([eval]{.smallcaps}) given the same data set (same [con]{.smallcaps} \& [gen]{.smallcaps}). In OT, ranked constraints choose one winning SR candidate per UR. Consider the OT tableaux below (*Figure 1*). ```{r, echo=FALSE, fig.cap="Fig 1: Two OT tableaux.", out.width = '40%'} knitr::include_graphics("../man/figures/OT_tableau_simple.jpg") ``` For the UR /ur-1/, Con1 prefers [sr1-1] while Con2 prefers [sr1-2]. Since Con1 outranks Con2, [sr1-1] is the sole winner, and is predicted to be the only SR for this particular UR (*i.e.* a categorical outcome). By the very same logic, the SR [sr2-1] is the sole realization of the UR /ur-2/ (*i.e.* Con1 prefers [sr2-1] to [sr2-2]). Notice that Con2 plays no role in deciding the outcome (hence the shading). ## 1.3 The MaxEnt model defined In contrast, MaxEnt produces probabilistic outcomes. In MaxEnt, constraints have weights (rather than being ranked with respect to each other). The predicted probability of an SR candidate given its UR depends on the weights of the constraints, and is calculated according to the following formula: $$P(SR_i|UR_i; w) = \frac{1}{Z_w(UR_i)} exp(- \Sigma_{k=1}^mw_k f_k(SR_i, UR_i))$$ where * $m$ is the number of constraints * $w_k$ is the weight of constraint $k$ * $f_k(SR_i, UR_i)$ is the number of violations of constraint $k$ incurred by UR-SR pair <$UR_i, SR_i$> * $Z_w(UR_i)$ is a normalization term defined as $$Z(UR_i) = \Sigma_{SR \in \mathcal{S}(UR_i)} exp(- \Sigma_{k=1}^m w_k f_k(SR, UR_i))$$ where $\mathcal{S}(UR_i)$ is the set of surface realizations of input $UR_i$. ## 1.4 Calculating MaxEnt probabilities: Step-by-step We'll walk through the MaxEnt tableaux in *Figure 2* to illustrate how the equations in §1.3 are applied to calculate the probability of a particular UR-SR pair. ```{r, echo=FALSE, fig.cap="Fig 2: Two MaxEnt tableaux.", out.width = '85%'} knitr::include_graphics("../man/figures/maxEnt_tableau.jpg") ``` Notice that these two MaxEnt tableaux have the same URs, SRs, constraints, and constraint violation profiles as the two OT tableaux above. However, instead of being ranked, these constraints now have weights (*e.g.* Con1 weight = 1.6; Con2 weight = 0.8). There are also four additional columns tagged on to the right of the tableaux, which we use to illustrate how the predicted probabilities are computed. ### 1.4.1 Number of violations, $f_k(SR_i, UR_i)$ An important variable that shows up in the equations above is: $f_k(SR_i, UR_i)$. This refers to the number of violations of constraint $k$ incurred by the UR-SR pair, <$UR_i, SR_i$>. For example, the UR-SR pair $<$/ur-2/, [sr2-2]$>$ picks up two violations of Con2, so $f_{Con2}(\text{sr2-2}, \text{ur-2}) = 2$ (yellow box). ### 1.4.2 Penalty Next, we move on to the penalty. Penalty has also been called "harmony" in other descriptions of the MaxEnt model. To calculate the penalty for the UR-SR pair <$UR_i, SR_i$>, we take the weighted sum of its constraint violations: $$\text{penalty for <}UR_i, SR_i\text{> } = \text{ } \Sigma_{k=1}^m w_k f_k(SR_i, UR_i)$$ (Recall that $w_k$ is the weight of constraint $k$, and $m$ is the number of constraints.) Let's apply this equation to calculate the penalty for the UR-SR pair $<$/ur-2/, [sr2-2]$>$. This pair picks up one violation of Con1 and two violations of Con2. The weighted sum (green box) is thus: \begin{align} \text{penalty for } &= \Sigma_{k=1}^2 w_k f_k(\text{sr2-2, ur-2}) \\ &= weight_{Con1} \times 1 + weight_{Con2} \times 2 \\ &= 1.6 \times 1 + 0.8 \times 2 \\ &= 3.2 \end{align} The penalties for other UR-SR pairs are similarly calculated. For example, the penalty for the UR-SR pair $<$/ur-2/, [sr2-1]$>$ is: \begin{align} \text{penalty for } &= \Sigma_{k=1}^2 w_k f_k(\text{sr2-1, ur-2}) \\ &= weight_{Con1} \times 0 + weight_{Con2} \times 1 \\ &= 1.6 \times 0 + 0.8 \times 1 \\ &= 0.8 \end{align} Notice that $m=2$ because there are two constraints. ### 1.4.3 MaxEnt score The next column gives the MaxEnt score for each UR-SR pair. $$\text{MaxEnt score} = exp(-\text{penalty})$$ The Reader may have also encountered this term being called the "potential" in other descriptions of MaxEnt. Look at the MaxEnt score column of *Figure 2* to see this equation applied. ### 1.4.4 Normalization, $Z$ We then need to normalize the MaxEnt score in order to get probabilities. The next column provides the $Z$ value for each UR. $Z$ is short for Zustandssumme, "sum over states", and has also been known as the normalization constant, or the canonical partition function. $Z$ is the sum of the MaxEnt scores of all UR-SR pairs arising from a particular UR. Here's a simplified way to write the equation for the normalization constant of $UR_i$: $$Z(UR_i) = \Sigma_{j=1}^l m_j$$ where $l$ is the number of UR-SR pairs whose UR is $UR_i$, and $m_j$ is the MaxEnt score of the $j^{th}$ UR-SR pair whose UR is $UR_i$. Let's apply this equation to calculate the normalization constant for /ur-2/ (blue box): \begin{align} Z(\text{ur-2}) &= \Sigma_{i=1}^2m_i \\ &= 0.499 + 0.041 \\ &= 0.490 \end{align} Here, $l = 2$ because there are only two UR-SR pairs whose UR is /ur-2/. We basically add the MaxEnt scores of these two pairs to get $Z(\text{ur-2})$. ### 1.4.5 Predicted probability The predicted probability is shown in the final column. The probability of a particular SR for a given UR, $UR_i$, is this pair's MaxEnt score divided by the normalization constant associated with $UR_i$: $$P(SR_i|UR_i) = \frac{\text{MaxEnt score of <}UR_i, SR_i\text{>}}{Z(UR_i)}$$ Let's apply this equation to calculate the probability of [sr2-2] given /ur-2/ (magenta box): \begin{align} P(\text{sr2-2}|\text{ur-2}) &= \frac{\text{MaxEnt score of }}{Z(\text{ur-2})} \\ &= \frac{0.449}{0.490} \\ &= 0.083 \end{align} Congratulations! You now know how to calculate the probabilities of an SR given its UR when provided with constraint weights. Next, we'll discuss some characteristics of the probability distribution and of the penalty score. ## 1.5 Characteristics of the MaxEnt model The probability distribution defined by the first equation above is also known as the Boltzmann distribution or the Gibbs distribution. Like the Boltzmann distribution, the probability of a particular SR given the UR it arises from is a function of two variables: (1) its penalty and (2) temperature. **As a particular SR's penalty increases, its probability decreases** (assuming everything else remains unchanged). The graph below visualizes how a MaxEnt score varies with penalty (*Figure 3*). Essentially, a higher penalty results in a lower MaxEnt score. (Note: It is not possible to graph how probability varies with penalty, since the actual probability depends on the penalties of the other SRs arising from the same UR, so we'll use the MaxEnt score as an intermediary.) ``` {r, echo=FALSE, fig.cap="Fig 3: How MaxEnt score varies with penalty.", out.width = '85%'} knitr::include_graphics("../man/figures/MaxEnt_vs_penalty.jpg") ``` ``` {r, include=FALSE} # This creates the .jpg image for the graph in the chunk above. # Ideally, use this (i.e. create graph rather than use an image)... # ...but this currently renders small plot & ugly big spaces for labels # Hack to set up window curve(exp(-x), from=0, to=5, col="white", xlab="penalty", ylab="MaxEnt score") # Add horizontal & vertical lines at y=0 & x=0 abline(h=0, col="gray") abline(v=0, col="gray") # Extend curve to to R-edge of window curve(exp(-x), from=0, to=5.25, col="blue", add=TRUE) ``` Recall that the probability of an SR given its UR is directly proportional to its MaxEnt score: $$P(SR_i|UR_i) \propto M_{UR_i, SR_i}$$ where $M_{UR_i, SR_i}$ is the MaxEnt score of the UR-SR pair <$UR_i, SR_i$>. We know this because $$P(SR_i|UR_i) = M_{UR_i, SR_i} \times \frac{1}{Z(UR_i)}$$ Hence, the relationship between penalty and probability is clear -- a higher penalty results in lower probability (all else remaining unchanged). The graph also indicates that as penalty increases, the MaxEnt score approaches but never quite reaches 0. Thus, SR candidates with infinitely high penalties still receive a MaxEnt score greater than zero. This means that, theoretically, such SRs will receive some probability. Nevertheless, the probabilities that such SRs receive will be so tiny that they can be treated as not occurring. This is useful when using MaxEnt to model a data set that has a mix of categorical and probabilistic phenomena. For categorical phenomena, the predicted probability that is returned will often be something like: 99.998% SR1, 0.002% SR2. Such outcomes can be treated as essentially categorical, so **MaxEnt is also capable of modeling categorical data** despite being a probabilistic model. Temperature is rarely used when modeling linguistic data, so the current simplified formula we've introduced above will do for now. We will introduce the final formulation of the probability distribution in §5 when we discuss the `temperature` argument of the `predict_probabilities()` function. It turns out that temperature can be a useful tool in **adjusting model fit to particular psycholinguistic tasks**. An additional feature of the MaxEnt model arises from how the penalty score is calculated. Because each violation contributes to the penalty score, multiple violations of a lower-weighted constraint can work together to make an SR less probable than another SR that has only one violation of a higher-weighted constraint. This phenomenon is called the **"gang-up" effect**. This effect has been observed in phonological phenomena, and has also been reported experimentally (Breiss 2020). # 2 Fitting the model (basic): `optimize_weights()` ## 2.1 Preparing the data file Let's imagine we have a data set with multiple tableaux, and would like to find constraint weights that maximize the probability of our data. In other words, we would like to find constraint weights that produce a distribution over UR-SR pairs that matches the observed frequencies of our data as closely as possible. In order to do that, we first need to know how to structure our data set. Take a look at the data set below, loaded from the `sample_data_frame` data set included in this package. ```{r} # Get paths to sample data file data_file <- system.file( "extdata", "sample_data_frame.csv", package = "maxent.ot" ) # Take a look at the structure of the sample data data_table <- read.table(data_file, header=FALSE) data_table ``` This data set has two tableaux: one for Input1 and another for Input2. There are two constraints: Constraint1 & Constraint2. The columns are organized as follows: * Column 1: Input (underlying representation) * Column 2: Output (surface representation) * Column 3: Frequency + Both token counts & relative frequency can be used for the function `optimize_weights`. + Token counts are recommended if the User intends to use the function `compare_models` down the line to compare multiple trained models. In particular, relative frequency won't work with the `aic_c` & `bic` methods of `compare_models`. * Column 4 onward: Constraints + Each constraint takes up one column. The first & second rows are only used for Column 4 onward (*i.e.* constraints). Full constraint names are in the first row, while their shortened names are in the second. Tableaux are inserted from the third row inwards. The data set is a tab-delimited `.txt` file. The input file is in OTSoft-style format (Hayes, Tesar & Zuraw 2003). Rather than using OTSoft format, you can also pass a data frame/data table/tibble directly into the `optimize_weights` function. The format in this case is a bit different. ```{r} data_frame_file <- system.file( "extdata", "sample_data_frame.csv", package = "maxent.ot" ) data_frame <- read.csv(data_frame_file) data_frame ``` Note that in this data frame the inputs column is completely specified and there is only a single row of constraint names. ## 2.2 Finding weights to fit the model: No bias Now that we've familiarized ourselves with our data set, let's find a set of constraint weights that maximizes the probability of the data. By default, constraint weights are limited to finite non-negative values with an upper bound of $1000$. The User may modify the upper bound by specifying another value for the optional argument `upper_bound`. Using the function `optimize_weights()`, we'll train a model for the data set above, and store the result (our trained model) in the object `simple_model`: ```{r dataForSimpleCase} # Fit weights to data using file path simple_model <- optimize_weights(data_frame) ``` You can also fit a model using a data frame as input. ```{r} simple_model <- optimize_weights(data_frame) ``` Let's take a closer look at our object `simple_model`: ``` {r} # View the model we've fit (no biases used) simple_model ``` The object `simple_model` has 4 relevant named attributes: * `weights`: the trained weights for each constraint. This is a named list. * `loglik`: the log likelihood of the weights given the data + This is a measure of how well the weights fit the observed data. * `k`: the number of constraints * `n`: the number of data points in the training set ## 2.2.1 Retrieving specific model attributes We can retrieve any specific model attribute that we'd like. For example, if we'd like to see only the learned constraint weights for `simple_model`, we can pull them up as follows: ```{r simpleModelWeight} # Get learned weights of model we've fit (no biases used) simple_model$weights ``` Here we see that the trained model happened to learn the same weight (14.2) for both Constraint1 and Constraint2. Likewise, we can retrieve any other model attribute. For example, here's the log likelihood: ```{r simpleModelLL} # Get log likelihood of model we've fit (no biases used) simple_model$loglik ``` This log likelihood is, in fact, at ceiling (at the maximum possible likelihood), which is indicative of overfitting. The issue of overfitting and the introduction of priors to reduce overfitting will be taken up in §3.3. But first, let's gain a better understanding of how log likelihood guides training. ## 2.3 Fitting the model to data: Explained The `optimize_weights()` function implements a learner that adjusts constraint weights in order to fit the model. The learner's objective is to maximize the value of the objective function, and weight adjustments are made with this objective in mind. The objective function, $J(w)$, is composed of two terms: $$J(w) = LL(w|D) - B(w)$$ where $LL(w|D)$ is the log likelihood of the weights $w$ given the data $D$, and $B(w)$ is the price to pay for choosing the particular vector of weights $w$. The first term, $LL(w|D)$, is a measure of how well a model's weights $w$ fit the observed data $D$. Models that more closely match the observed frequencies have a higher $LL(w|D)$ than those with poorer matches. $LL(w|D)$ ranges between negative infinity and zero. The second term, $B(w)$, is used when we want to nudge the model towards particular preferred weights (*e.g.* by using biases or priors). Models with weights that are closer to the preferred ones pay a smaller price (smaller $B(w)$) than those with more aberrant weights. $B(w)$ ranges between zero and positive infinity. Higher values of $LL(w|D)$ and lower values of $B(w)$ both contribute to increasing the value of the objective function $J(w)$. Recall that the learner's objective is to arrive at as high a value of $J(w)$ as possible (it seeks to maximize the value of $J(w)$). In other words, the learner adjusts weights towards weight values that: (1) improve the model's match to the observed frequencies: make $LL(w|D)$ higher (2) are closer to the preferred weight values: make $B(w)$ lower In the rest of this section, we discuss the log likelihood of the weights, $LL(w|D)$, in greater depth. We leave the discussion of $B(w)$ to §3, where priors are introduced. ### 2.3.1 Defining log likelihood When no priors are used, maximizing the objective function $J(w)$ is equivalent to maximizing the log likelihood of the weights $LL(w|D)$ alone. Since we've not used any priors so far, in all the preceding cases, the sole aim of the learner was to seek weights that maximized the log likelihood alone. The log likelihood of the weights is: $$LL(w|D) = \Sigma_{i=1}^n ln P(y_i|x_i; w)$$ Let's unpack the terms of this equation. On the left-hand side, we have $LL(w|D)$, where * $LL(w|D)$ is the log likelihood of the weights $w$ given the observed data $D$ * $w$ is the vector of constraint weights * $D$ is the training data On the right-hand side: * $P(y_i|x_i; w)$ is the probability of realizing underlying $x_i$ as surface $y_i$ under the weights $w$ * $n$ is the number of observed data points + *i.e.* the sum of the frequency column in the input * $y_i$ is the observed surface form of the $i^{th}$ data point, and $x_i$ is the input form that $y_i$ arose from Notice how the weights of each individual constraint (which are found in $w$) are the only parameters available for the model to adjust. ### 2.3.2 What guides the weight changes?: Maximizing log likelihood Recall that the learner takes in a distribution over training examples, $D$. For the data set we've been using, $D$ is: $<$Input1, Output1-1$>$: 1, $<$Input1, Output1-2$>$: 1, $<$Input2, Output2-1$>$: 1. $<$Input2, Output2-2$>$ is not part of $D$ because it isn't observed. In the absence of priors, learner's sole goal is to maximize the log likelihood alone. In other words, the learner seeks the weights that produce a distribution over these three training examples that matches the observed frequencies as closely as possible. The learner adjusts the weights with this goal in mind. ```{r, echo=FALSE, fig.cap="Table 1: Log likelihood of M1's & M2's weights.", out.width = '85%'} knitr::include_graphics("../man/figures/logLike_table_2.jpg") ``` Let's imagine we have two models, *M1* & *M2*, whose weights (whatever they may be) produce the predicted distributions P(output|input) shown in the final two columns of *Table 1*. For ease of comparison, the observed relative frequencies are shown in the third-to-last column. Right away, we can see that *M1* matches the observed frequencies much better than *M2* does. In other words, the observed data are more probable under the weights of *M1* than they are under the weights of *M2*. Though we haven't yet applied the equation above to calculate the log likelihood of these two model's weights, we can expect that *M1*'s weights will produce a higher log likelihood than *M2*'s. Now we'll apply the log likelihood equation to calculate the log likelihood of the weights of *M1* \& *M2*. Recall that log likelihoods may range from negative infinity to 0, with higher log likelihoods indicating that a model's weights produce a better fit to the observed data. Here's the log likelihood equation again: $$LL(w|D) = \Sigma_{i=1}^n ln P(y_i|x_i; w)$$ Applying the equation to *M1* ($w_{M1}$ refers to *M1*'s weights): \begin{align} LL(w_{M1}|D) &= \Sigma_{i=1}^3 ln P(y_i|x_i; w) \\ &= ln(.5) + ln(.5) + ln(1) \\ &= -1.386 \end{align} Applying the equation to *M2* ($w_{M2}$ refers to *M2*'s weights): \begin{align} LL(w_{M2}|D) &= \Sigma_{i=1}^3 ln P(y_i|x_i; w) \\ &= ln(.25) + ln(.75) + ln(1) \\ &= -2.367 \end{align} *M1*'s weights produce a higher log likelihood than *M2*'s, which indicates that *M1* is a better model than *M2*. This is because *M1*'s weights produce a predicted distribution over observed UR-SR pairs that is closer to the observed frequencies than *M2*'s weights do. (Notice that only observed data are included in the calculations above. Despite Output2-2 getting 50\% of the predicted probability in *M2*, it does not contribute to the log likelihood.) Hence, aiming to maximize the probability of the training data results in finding weights that produce a predicted distribution that is as close to the observed frequencies as possible. For example, we can imagine the learner starting with the weights that produce a distribution that is very distant from the frequencies of the observed training data. With its goal of finding weights that maximize the log likelihood, the learner moves to the weights of *M2*, and eventually arrives at the weights of *M1*, where training is halted and *M1*'s weights are the weights of the final trained model. # 3 Fitting the model with priors: `optimize_weights()` Here's the general formula for the objective function, $J(w)$, repeated here for ease of reference: $$J(w) = LL(w|D) - B(w)$$ In the previous section (§2), we ignored the second term, so the objective function, $J(w)$, depended solely upon the log likelihood of the current weights given the training data. In this section, we'll make use of the second term, $B(w)$. This term is also known by several names such as the "bias term", the "prior", and in special cases, the "regularization term". The bias term allows us to nudge the learner towards particular weights by introducing an added price to pay for weights that differ from the ones that the User specifies. Some use cases for the bias term include using the bias term as simple regularization to avoid overfitting (Goldwater & Johnson 2003), and the modeling of proposed phonological learning biases (Wilson 2006, White 2013, Mayer 2021, among others). Worked examples using the bias term for regularization and for modeling learning biases are provided in §3.3 and §3.4 respectively. ## 3.1 Unpacking the math behind the bias term Here's the formula for the bias term, which determines the price to be paid for choosing particular constraint weights: $$B(w) = \Sigma_{k=1}^m \frac{(w_k - \mu_k)^2}{2\sigma_k^2}$$ * $m$ is the number of constraints * $w_k$ is the weight of constraint $k$ * $\mu_k$ and $\sigma_k$ parameterize a normal distribution that acts as a prior for the value of $w_k$ + $\mu_k$: constraint $k$'s ideal weight + $\sigma_k$: indicates tolerance for deviation from ideal weight, $\mu_k$ Every constraint, $k$, whose weight $w_k$ differs from $\mu_k$ will thus incur an additional price that needs to be paid. This price is subtracted from the log likelihood ($LL(w|D)$), and thus makes the objective function, $J(w)$ smaller (notice the negative sign preceding the second term). Recall that the learner's goal is to maximize the objective function, $J(w)$ (*i.e.* get it as high as possible). The price to be paid depends upon two factors: (1) Distance from the ideal weight; (2) Tolerance for weight discrepancy. (1) Price to be paid increases as the distance from the ideal weight increases. The distance from the ideal weight is $w_k - \mu_k$. For example, let's set the ideal weight for constraint $k$ at 0 ($\mu_k$ = 0). A model that gives constraint $k$ a weight of 1.5 will accrue a lower price to pay than one that gives it a weight of 3.5, as the former has a smaller weight discrepancy. More specifically, the value that is subtracted (*i.e.* price to pay) for constraint $k$ is directly proportional to the square of the weight discrepancy, all else being equal. $$ \text{value to be subtracted} \propto (w_k - \mu_k)^2$$ (2) Price to be paid increases as the tolerance for weight variation decreases. $\sigma_k$ determines how costly the weight discrepancy is for constraint $k$. The smaller $\sigma_k$ is, the lower the tolerance for a particular weight discrepancy. For example, let's imagine that we set the ideal weight for constraint $k$, at 0 ($\mu_k$ = 0), and that the constraint's actual weight, $w_k$ is 1.5; the weight discrepancy is thus 1.5. This same weight discrepancy results in a higher price to pay in a model that has $\sigma_k$ = 0.5 (lower tolerance) than in a model that has $\sigma_k$ = 1 (higher tolerance). More specifically, the value that is subtracted (*i.e.* price to pay) for constraint $k$ is inversely proportional to the square of the tolerance, all else being equal. $$ \text{value to be subtracted} \propto (\frac{1}{\sigma_k})^2$$ Here's the equation for the objective function again: $$J(w) = LL(w|D) - B(w)$$ When maximizing $J(w)$, there is generally a trade-off between model fit ($LL(w|D)$) and the bias term ($B(w)$), as the bias term pushes weights towards the ones specified by the User while the log likelihood pushes weights towards the ones that produce the best model fit. The strength of the bias term is determined by $\sigma$. Specifically, if $\sigma$ is too small, the trained weights may barely move from the specified ideal ones. If the User notices that their models don't appear to have learned anything (*i.e.* the trained weights appear to be the same as the specified ideal weights), it is advisable to increase $\sigma$ (*i.e.* increase the tolerance for weight discrepancy), so that weights may differ from the ideal ones without incurring too great a price. ## 3.2 Using the prior: Which argument to use? There are three arguments that may be used to pass in values for the bias parameters $\mu$ and $\sigma$. The arguments are: * `bias_file` * `mu` * `sigma` To use the argument `bias_file`, pass in a bias file with the structure below. ``` {r} # Get paths to toy data and bias files. data_frame_file <- system.file( "extdata", "sample_data_frame.csv", package = "maxent.ot" ) bias_file <- system.file( "extdata", "sample_bias_data_frame.csv", package = "maxent.ot" ) # Take a look at the structure of the bias_file data_frame <- read.csv(data_frame_file) bias_data_frame <- read.csv(bias_file) bias_data_frame ``` The bias file is a `.csv`. Constraints appear in the rows. The first column specifies the ideal weight, $\mu$, and the second column specifies the tolerance for weight discrepancy, $\sigma$. In the example above, both constraints happen to have the same values for $\mu$. However, if the User so wishes, they may specify different values of $\mu$ for different constraints. The same applies to $\sigma$. Below, we fit weights with the biases specified in a file. ``` {r} # Fit weights with biases specified in file optimize_weights(data_frame, bias_data_frame) ``` The arguments `mu` and `sigma` allow us to specify biases in either scalar or vector form. This enables us to specify different values of $\mu$ for each constraint, if we so desire. The same applies to $\sigma$. If you specify bias terms as scalars, all constraints will share the same values. Below, we fit weights with the bias parameters specified in scalar form. ``` {r} # Fit weights with biases specified as scalars optimize_weights( data_frame, mu = 0, sigma = 1000 ) ``` If you specify bias terms as vectors, each constraint can have different values. We combine the individual constraint parameter values into a vector using the `c()` function (*e.g.* `c(100, 200)`). ``` {r} # Fit weights with biases specified in vector form optimize_weights( data_frame, mu = c(1, 2), sigma = c(100, 200) ) ``` A mix of scalar and vector biases is also possible: ``` {r} # Fit weights with a mix of scalar and vector biases optimize_weights( data_frame, mu = c(1, 2), sigma = 1000 ) ``` If only one of the two bias parameters is present, an error will be raised. (*e.g.* When values are provided for $\mu$, but there are none for $\sigma$.) ## 3.3 Overfitting & regularization ### 3.3.1 Ceiling log likelihood In §2.2 we noted that the log likelihood attained by `simple_model` was in fact at ceiling. ``` {r} # Get log likelihood of simple model (no biases) simple_model$loglik ``` Each data set has a theoretical maximum log likelihood, which can be calculated. Recall that the data set had 3 data points (*i.e.* observed outputs). (Frequency shows up in the third column.) ``` {r} # View the data that was used to train `simple_model` read.table(data_file, header=FALSE, sep='\t') ``` * For Input1, there were 2 outputs, each of which surfaced once. + Observed $P(\text{Output1-1}|\text{Input1}) = 0.5$ + Observed $P(\text{Output1-2}|\text{Input1}) = 0.5$ * For Input2, 1 output surfaced once. + Observed $P(\text{Output2-1}|\text{Input2}) = 1$ Let's calculate the theoretical ceiling log likelihood. Here's the equation for the log likelihood, repeated for easy reference: $$LL(w|D) = \Sigma_{i=1}^n ln P(y_i|x_i; w)$$ A trained model that matches the observed frequencies perfectly would have attained the ceiling log likelihood. We'll imagine just such a perfect trained model that matches the observed frequencies perfectly, and calculate its log likelihood: \begin{align} LL(w|D) &= \Sigma_{i=1}^3 ln P(y_i|x_i; w)\\ &= ln(.5) + ln(.5) + ln(1) \\ &= -1.386 \end{align} Since the log likelihood attained by our perfect trained model matches that of `simple_model`, we conclude that `simple_model` also attained the ceiling log likelihood. In other words, the distribution produced by `simple_model`'s weights matches the observed frequencies of the training data perfectly. ### 3.3.2 Regularization as a strategy to combat overfitting Trained models that fit the training data too well often cannot generalize to unseen data. Such models are said to **overfit** the training data. One strategy to reduce overfitting is to include a regularization term. The intended effect of regularization is to keep the weights as low as possible. To put this into practice, we'll make use of the optional arguments `mu` & `sigma` of the function `optimize_weights()`. To keep weights low, we'll set `mu = 0` & `sigma = 1`. This means that we believe that every constraint should: * Have an ideal weight = 0 * Have standard deviation = 1 Since every constraint has the same ideal weight & the same standard deviation (tolerance for weight discrepancy), we can make use of the optional arguments `mu` & `sigma` to input $\mu$ and $\sigma$ just once for all constraints. ```{r scalarBiases} # Train regularized model with mu=0 & sigma=1 regularized_model <- optimize_weights( data_frame, mu=0, sigma=1 ) ``` Inspecting the trained regularized model, we now see a value for the attribute `bias_params`: ```{r} # Take a took at the regularized model trained with scalar biases regularized_model ``` We can retrieve this attribute alone: ```{r} # View the bias parameters we used during model training regularized_model$bias_params ``` Constraints are ordered in rows. The first column gives the $\mu$ and the second column the $\sigma$ for the constraint in question. Let's take a look at the trained weights for the regularized model: ```{r biasModelAttributes} # Get learned weights for the regularized model trained with scalar biases regularized_model$weights ``` The trained regularized model has learned that Constraint 1 has weight = 0.105, and Constraint 2 has weight = 0.316. Notice that the weights for both constraints are now closer to 0 (*cf.* the unbiased model where trained weights for both constraints were 14.2). We attained this by including a prior term that tagged on a price to pay for weights whose value did not exactly match the ideal weight. In particular the price to be paid increased the further a constraint's weight was from the ideal weight $0$. Notice also that the log likelihood of the regularized model's weights is below the ceiling of $-1.386$ ($-1.945 < -1.386$): ```{r biasModelObjFn} # Get value of objective function for trained regularized model regularized_model$loglik ``` The regularized model provides a worse fit to the observed data than the unregularized model does; but the regularized model is probably able to better generalize to unseen data. **Retrieving the weight of 1 constraint** Let's imagine that we're only interested in a tracking the weight of a subset of the constraints. For example, we're interested only in Constraint 1, and want to track the weight learned for this constraint in the unbiased and the simple bias (regularized) model. We use index `1` to pick out the first constraint from the named list `your_model_name$weights`. Similarly, if we were interested in the second constraint, we'd use index `2`, and so on. ```{r} # Get weight of Constraint 1 (simple model) cons1_noBias <- simple_model$weights[1] # Get weight of Constraint 1 (regularized model) cons1_simpleBias <- regularized_model$weights[1] # Compare learned weights of Constraint 1 in the unbiased model... # ...and in the regularized one print(paste("In the unbiased model, Constraint 1's weight is:", sprintf(cons1_noBias, fmt = '%#.3f'))) print(paste("In the regularized model, Constraint 1's weight is:", sprintf(cons1_simpleBias, fmt = '%#.3f'))) ``` ## 3.4 Incorporating substantive effects Another reason to make use of the prior term is to model substantive effects. We can set different preferred weights ($\mu$) for different constraints, by setting different tolerances for deviation from the preferred weight ($\sigma$), or by using a combination of the two. We'll walk through a simplified version of White (2017), in which substantive effects were used to model psycholinguistic results. ### 3.4.1 The phenomenon: Productivity of saltation (White 2017) Consider the following artificial grammar learning experiment (White 2017). Participants were trained on either a p→v alternation (experimental condition) or a b→v alternation (control condition). During the response phase, participants were tested on two alternations. In the first, they were tested on producing the alternation they were trained on ("test"). In the second, they were tested on extending the alternation to the other unseen pair ("extension"). That is, for those trained on the p→v alternation, the b→v alternation was considered the "untrained alternation/extension", and vice versa. **Experimental result to be modeled:** Extension to untrained sounds occurred more frequently in the experimental condition (trained on p→v; extended to b→v) at 73% than in the control condition (trained on b→v; extended to p→v) at 20%. The percentage alternation by participants in the response phase is shown below. ```{r echo=FALSE} # Create matrix with 3 columns tab <- matrix(c(95, 90, 73, 20), ncol=2, byrow=TRUE) # Define column names and row names of matrix colnames(tab) <- c('Expt cdn', 'Control cdn') rownames(tab) <- c('trained', 'untrained') # Convert matrix to table tab <- as.table(tab) # View table tab ``` Given that the perceptual distance between [b] & [v] is smaller than that between [p] & [v], substantive effects arising from perceptual distance appear to play an important role. ### 3.4.2 Modeling the extension of p→v (trained) to b→v (untrained) So how can we introduce these substantive factors into the model? Intuitively, we want to make it more difficult to map [p] to [v] than it is to map [b] to [v]. We will use *Map(x,y) constraints (Zuraw 2007, 2013), which prohibit the mapping between sound x and sound y. We give \*Map(p,v) a higher preferred weight than \*Map(b,v), which means that the mapping from [p] to [v] is more strongly prohibited than the mapping from [b] to [v]. Following White, we'll set $\mu$'s of these constraints to the following preferred weights: * $\mu$ of \*Map(b,v) = 1.3 * $\mu$ of \*Map(p,v) = 3.65 Markedness constraints \*V[-cont]V and \*V[-voi]V drive the alternation from stops to [v] intervocally. These have a preferred weight of 0 to reflect that intervocalic stops and voiceless sounds were allowed in the participants' native languages. The training data for the experimental condition includes only the p→v alternation, reflecting the training phase encountered by the subjects in the experimental condition: ```{r} # Get path to data file (experimental condition) white_salt_data_file <- system.file( "extdata", "sample_whiteSalt_data_file.txt", package = "maxent.ot" ) # File is in OTSoft format, so first convert it to data frame white_salt_data_frame <- otsoft_tableaux_to_df(white_salt_data_file) # View training data (experimental condition) white_salt_data_frame ``` We now train the bias model for the experimental condition, making use of the argument `mu`. We want the value of $\mu$ to be 0, 0, 1.3 and 3.65 for the four ordered constraints [\*V[-cont]V, \*V[-voi]V, \*Map(b,v), \*Map(p,v)] respectively. We have to combine these four values into a vector using the `c()` function (*i.e.* `c(0, 0, 1.3, 3.65)`). Here, we choose to have the same tolerance for deviation from the preferred weight for all constraints, so we'll use the argument `sigma`. ```{r} # Fit model with preferred weights for experimental condition white_salt_bias_model = optimize_weights( white_salt_data_frame, mu=c(0, 0, 1.3, 3.65), sigma=sqrt(0.6) ) # View trained weights (expt cdn with bias) white_salt_bias_model$weights ``` Now let's train an unbiased model for the same experimental condition. We'll average out the preferred weights for the two *Map constraints, which results in each of these constraint having preferred weight = 2.27. ```{r} # Fit unbiased model for experimental condition white_salt_noBias_model = optimize_weights( white_salt_data_frame, mu=c(0, 0, 2.27, 2.27), sigma=sqrt(0.6) ) # View trained weights (expt cdn, no bias) white_salt_noBias_model$weights ``` At this point, we've essentially trained the model on only the p→v alternation. Next, we'll see how well our trained models perform on both the test task (p→v) as well as on the extension task (b→v). Notice how both these phases (training and response) reflect the experimental setup with human subjects. We'll need a new test file that includes the observed frequencies for both the test (p→v) and extension (b→v) alternations. We'll pass both the models trained above as well as the new test file to the function `predict_probabilities()`. We'll take a closer look at both the function `predict_probabilities()`, and the structure of the test file when we introduce `predict_probabilities()` in §4 below. For now, we'll just focus our attention on the predicted probability calculated by this function. ```{r} # Get path to test file (experimental condition) white_salt_test_file <- system.file( "extdata", "sample_whiteSalt_test_file.txt", package = "maxent.ot" ) # File is in OTSoft format, so convert first white_salt_test_df <- otsoft_tableaux_to_df(white_salt_data_file) # View test data (experimental condition) white_salt_test_df # Predict probabilities with weights trained with bias (expt cdn) predict_probabilities( white_salt_test_df, white_salt_bias_model$weights ) # Predict probabilities with weights trained without bias (expt cdn) predict_probabilities( white_salt_test_df, white_salt_noBias_model$weights ) ``` For ease of reference, the predicted percentage of alternation for the trained biased and unbiased models are summarized in the table below: ```{r echo=FALSE} # Create matrix with 3 columns tab <- matrix(c(91, 93, 78, 45), ncol=2, byrow=TRUE) # Define column names and row names of matrix colnames(tab) <- c('Biased model', 'Unbiased model') rownames(tab) <- c('Trained (95%)', 'Untrained (73%)') # Convert matrix to table tab <- as.table(tab) # View table print("Predicted % of alternation during response phase (experimental condition)") tab ``` Recall that during the response phase, participants performed the following two alternations at the following rates: * [p] → [v] (trained "test"): 95% * [b] → [v] (untrained "extension"): 73% Overall, the biased model provided a better fit to the observed data. Predicted probability differed by no more than 5.5%, with the greatest discrepancy showing up for the untrained [b] → [v] alternation. The unbiased model, in contrast, performed poorly. In particular, the untrained [b] → [v] alternation appeared to be close to chance (45%). ### 3.4.3 Modeling the extension of b→v (trained) to p→v (untrained) Let's now train a biased and an unbiased model for the control condition. The training data includes only the b→v alternation, reflecting the training phase of the control condition: ```{r} # Get path to data file (control condition) white_control_data_file <- system.file( "extdata", "sample_whiteCtrl_data_file.txt", package = "maxent.ot" ) white_control_df <- otsoft_tableaux_to_df(white_control_data_file) # View training data (control condition) white_control_df # Fit model with preferred weights for control condition white_control_bias_model = optimize_weights( white_control_df, mu=c(0, 0, 1.3, 3.65), sigma=sqrt(0.6) ) # View trained weights (control cdn with bias) white_control_bias_model$weights # Fit unbiased model for control condition white_control_noBias_model = optimize_weights( white_control_df, mu=c(0, 0, 2.27, 2.27), sigma=sqrt(0.6) ) # View trained weights (control cdn, no bias) white_control_noBias_model$weights ``` As with the experimental condition above, we'll now subject the two trained models to the response phase, which includes the test (b→v) and extension (p→v) tasks. ```{r} # Get path to test file (control condition) white_control_test_file <- system.file( "extdata", "sample_whiteCtrl_test_file.txt", package = "maxent.ot" ) # File is in OTSoft format, so convert first white_control_test_df <- otsoft_tableaux_to_df(white_control_test_file) # View test data (experimental condition) white_control_test_df # View test data (control condition) read.table(white_control_test_file, header=FALSE, sep='\t') # Predict probabilities with weights trained with bias (control cdn) predict_probabilities( white_control_test_df, white_control_bias_model$weights ) # Predict probabilities with weights trained with bias (control cdn) predict_probabilities( white_control_test_df, white_control_noBias_model$weights ) ``` For ease of reference, the predicted percentage of alternation for the trained biased and unbiased models are presented in the table below: ```{r echo=FALSE} # Create matrix with 3 columns tab <- matrix(c(87, 87, 26, 61), ncol=2, byrow=TRUE) # Define column names and row names of matrix colnames(tab) <- c('Biased model', 'Unbiased model') rownames(tab) <- c('Trained (90%)', 'Untrained (20%)') # Convert matrix to table tab <- as.table(tab) # View table print("Predicted % of alternation during response phase (control condition)") tab ``` During the testing phase, participants performed the following two alternations at the following rates: * [b] → [v] (trained "test"): 90% * [p] → [v] (untrained "extension"): 20% The biased model matches the observed data relatively well. The unbiased model, in contrast, greatly overgeneralizes to the untrained p→v alternation. ### 3.4.4 Conclusion Overall, the biased model, which incorporated substantive biases, performed much better than the unbiased one (no substantive influences) at modeling human subject responses during the response phase. The unbiased model both greatly underpredicted extension to the b→v alternation in the experimental condition and greatly overpredicted the extension to the p→v alternation in the control condition. In contrast, the biased model neither vastly under- nor overpredicted on any of the response tasks. For the biased model, we introduced different preferred weights for separate constraints by utilizing the `mu` argument of the `optimize_weights()` function. These preferred weights were chosen to represent the substantive effects that arose from perceptual distances. The resulting trained biased model was able to match the observed frequencies of alternations produced by human subjects during the response phase much more closely than its unbiased counterpart in all experimental conditions, thus demonstrating the utility of incorporating substantive biases for linguistic modeling. # 4 Making predictions: `predict_probabilities()` The function `predict_probabilities()` predicts the probabilities of SR candidates based on their violation profiles and the constraint weights. The function `predict_probabilities()` requires at the very least, the following two arguments in this order: (1) `test_file`: An input file + Its structure must be identical to the input file for `optimize_weights()` (2) `constraint_weights`: Constraint weights which will be used to make predictions ## 4.1 Get predictions with trained weights Let's imagine that we want to make a prediction using the weights of a model that we've trained using `optimize_weights()`. If we've trained a model using `optimize_weights()` and stored the trained model in the object `fit_model_a`, we can specify the constraint weights by passing in the `weights` attribute of the model: `fit_model_a$weights`. ``` {r} # Get paths to toy data file data_file_a <- system.file( "extdata", "sample_data_frame.csv", package="maxent.ot" ) df_a <- read.csv(data_file_a) # Fit weights to data (no biases) fit_model_a <- optimize_weights(df_a) # Predict probabilities for the same input predict_probabilities(df_a, fit_model_a$weights) ``` The returned object has two attributes: (1) `loglik`: The log likelihood (2) `predictions`: A prediction data frame The structure of the prediction data frame is as follows: * The first $n$ columns are essentially identical to the input file, where $n$ is the number of columns in the input file * Third-from-last column: Predicted probability * Second-from-last column: Observed probability * Last column: Error + *error~i~* = *predicted probability~i~* - *observed probability~i~*, where *i* is a row + Positive value: overprediction + Negative value: underprediction **A point of caution: Duplicate UR entries** When there are multiple instances of the same UR, both `Predicted Probability` and `Observed Probability` are normalized over all instances of the UR. However, `Error` is still computed row-wise. This leads to issues with the interpretation of `Error`. Duplicate UR entries may arise when modeling experimental data. For example, we'll imagine that our experiment had two unique URs: *Input1* and *Input2*. The first participant picked *Output1-1* for *Input1* and *Output2-1* for *Input2*. The second participant picked *Output1-2* for *Input1* and failed to respond for *Input2*. We thus get the frequencies below (data_file_b). ``` {r} # Data has repeated URs (absolute frequency) # Get paths to toy data file data_file_b <- system.file( "extdata", "sample_data_file_double_aFreq.txt", package="maxent.ot" ) df_b <- otsoft_tableaux_to_df(data_file_b) # Take a look at the structure of the sample data with duplicate URs df_b # Here's the structure of the same data without duplicate URs df_a ``` Notice that both `df_a` and `df_b` describe the same data, albeit in a different manner. If we'd collapsed all instances of repeated URs from `df_b`, we'd the get `df_a`. However, the resulting `Error` from both data sets differ. ``` {r} # Fit weights to data (no biases) fit_model_b <- optimize_weights(df_b) # Predict probabilities for the same input (duplicate URs) predict_probabilities(df_b, fit_model_b$weights) ``` Interpreting the `Error` for `df_a` is straightforward, while the `Error` for `df_b` is uninterpretable. (*e.g.* Regarding `df_b`, what does it mean when the the first instance of *Output1-2* has a 25% overprediction error?) For this reason, it is strongly recommended to check one's data set for duplicate URs, in order to collapse duplicate URs into a single UR if the User desires to make use of the `Error` values. Alternatively, one could engage in post-hoc data wrangling to achieve interpretable error values if the uncollapsed data set is otherwise desired. ## 4.2 Get predictions with User-chosen weights Let's imagine that we're curious about how particular weights would perform with our data set. Perhaps we've seen these weights reported in a paper and we'll like to try them out. Below, we demonstrate how to prepare these weights so that they'll be compatible with `predict_probabilities()`. Let's say we want to make predictions with Con1's weight being 1.5 and Con2's weight being 2.5. We have to put these weights into a list, and convert it to the appropriate type. ``` {r} # Get predictions with User-chosen constraint weights # Make a list of weights # Be sure to order the weights in exactly the same order... # ...in which their respective constraints appear in the input file! my_wt_ls <- list(1.5, 2.5) # Convert to double object my_wts <- as.double(my_wt_ls) ``` Now we're ready to make predictions with our hand-selected weights. ``` {r} # Get predictions predict_probabilities(df_a, my_wts) ``` ## 4.3 Save predictions to file The `predict_probabilities()` function provides the option of saving the prediction data frame to a file. In the example below, we use the following optional arguments to save the prediction data frame: * `output_path`: path and file name + By specifying `"C:\\path\\to\\dir\\your_file_name.txt"`, the result is saved to the file `your_file_name.txt`, which will be in the directory `C:\path\to\dir`. * `out_sep`: specifies delimiter + Defaults to tabs `"\t"` ``` {r} # Save predicted result to file # predict_probabilities( # data_file_a, my_wts, # output_path = "C:\\path\\to\\dir\\your_file_name.txt", # out_sep = "\t" # ) ``` # 5 The full MaxEnt model: With Temperature ## 5.1 The full MaxEnt model defined As promised, we now present the generalized version of the model with temperature included: $$P(SR_i|UR_i; w) = \frac{1}{Z_w(UR_i)} exp(- \Sigma_{k=1}^m \frac{w_k f_k(SR_i, UR_i)}{T})$$ where $T$ is the new variable, temperature. As before, $m$ is the number of constraints, $w_k$ is the weight of constraint $k$, $f_k(SR_i, UR_i)$ is the number of violations of constraint $k$ incurred by the UR-SR pair <$UR_i, SR_i$>, and $Z_w(UR_i)$ is a normalization term defined as $$Z(UR_i) = \Sigma_{SR \in \mathcal{S}(UR_i)} exp(- \Sigma_{k=1}^m \frac{w_k f_k(SR, UR_i)}{T})$$ where $\mathcal{S}(UR_i)$ is the set of SRs arising from the UR, $UR_i$. Temperature is a variable whose value the User may adjust only in the `predict_probabilities()` function. When using `predict_probabilities()`, if the User doesn't specify a temperature, the value for temperature defaults to 1. Similarly for `optimize_weights()`, where there is no argument available for the User to specify temperature, the temperature can be assumed to be 1. Thus, all the cases that we've encountered in the preceding sections when we applied `predict_probabilities()` or `optimize_weights()` can be assumed to have $T=1$, since we haven't yet used the argument `temperature`. Notice that when $T=1$, the generalized model introduced here is equivalent to the simplified one defined in §1.3. ## 5.2 The effect of temperature on probability **As temperature increases, the probability distribution becomes less polarized.** That is, as temperature increases, the probabilities of the SRs of a given UR move towards equal probability with each other. For example, if a particular UR has two SRs, higher values of $T$ will move the probability of each SR towards $0.5$. We'll use the hand-picked weights `my_wts` to illustrate this effect. First, we get predictions when $T=1$, $T=3$, and $T=5$, and store the results in the objects `t1_pred`, `t3_pred`, and `t5_pred` respectively. ``` {r} # Get paths to toy data file data_file_c <- system.file( "extdata", "sample_data_file_2.txt", package="maxent.ot" ) # Get predictions T=1 # We'll let temperature default to 1 t1_pred <- predict_probabilities(data_file_c, my_wts) # Get predictions T=3 t3_pred <- predict_probabilities(data_file_c, my_wts, temperature=3) # Get predictions T=5 t5_pred <- predict_probabilities(data_file_c, my_wts, temperature=5) ``` Next, we'll construct a data frame to easily compare the prediction probabilities across the different temperatures. Notice that we can extract the predicted probabilities like so: ``` {r} # View predicted probability of t1_pred t1_pred$predictions[, 'Predicted'] ``` Now we'll create the data frame: ```{r} # Select columns we want, and store them in lettered variables a <- t1_pred$predictions[, 'Input'] b <- t1_pred$predictions[, 'Output'] c <- t1_pred$predictions[, 'Predicted'] d <- t3_pred$predictions[, 'Predicted'] e <- t5_pred$predictions[, 'Predicted'] # Join variables to create data frame temperature_df <- data.frame(a, b, c, d, e) # Rename columns names(temperature_df) <- c('Input', 'Output', 'T=1', 'T=3', 'T=5') # View the data frame temperature_df ``` We'll track the predicted probabilities of the outputs of *Input1*: ```{r echo=FALSE} # Create matrix with 3 columns tab <- matrix(c(73, 27, 58, 42, 55, 45), ncol=2, byrow=TRUE) # Define column names and row names of matrix colnames(tab) <- c('Output1-1 (%)', 'Output1-2 (%)') rownames(tab) <- c('T=1', 'T=3', 'T-5') # Convert matrix to table tab <- as.table(tab) # View table tab ``` Thus, as temperature increases, we see that the predicted distributions over the outputs of a given input become less polarized. That is, the probabilities of the various outputs for one given input move closer towards equiprobability with each other. Since there are two outputs here, the value that these predicted probabilities move towards is $1/2 = 0.5$. If there were three outputs (as with *Input2*), the value that these predicted probabilities move towards would be $1/3 = 0.333$. The graphs below visualize the depolarization of the probability distribution over outputs as temperature increases. ``` {r, echo=FALSE} # Create separate data frames for Input1 and Input2 ur1_df <- temperature_df[1:2, 3:5] ur2_df <- temperature_df[3:5, 3:5] # save old par oldpar <- par(mar=c(5, 4, 4, 4), xpd=TRUE) # Plot distribution of outputs for Input1 barplot(as.matrix(ur1_df), main = "Input 1", xlab = "Temperature", ylab = "% output", axes = TRUE, legend.text = c("Out1-1", "Out1-2"), args.legend = list(x = "topright", inset = c(-0.55, 0), # Move legend outside plot cex = .7) # Legend font size ) # Plot distribution of outputs for Input2 par(mar=c(5, 4, 4, 4), xpd=TRUE) barplot(as.matrix(ur2_df), main = "Input 2", xlab = "Temperature", ylab = "% output", axes = TRUE, legend.text = c("Out2-1", "Out2-2", "Out2-3"), args.legend = list(x = "topright", inset = c(-0.55, 0), cex = .7) ) # Reset par par(oldpar) ``` ## 5.3 Using temperature: Experimental data from "2 alternative forced choice" tasks The temperature variable can be used to generate less categorical predictions in a way that is independent of the constraint weights. This may come in useful when modeling psycholinguistic tasks. Consider the 2 alternative forced choice (2AFC) response task, in which participant responses are often less categorical (*i.e.* less polarized) than expected. In a 2AFC task, participants are presented with two options, and must pick one. Why this effect occurs is still unclear. Possible explanations have been put forth such as: selecting a choice that one would not have otherwise volunteered being easier than volunteering said choice without being prompted. Regardless of why this occurs, being able to model depolarization is crucial when one wishes to ask questions that encompass both the lexicon and psycholinguistic experimental results. In one such work, Hayes, Zuraw, Siptar & Londe (2009) (henceforth HZSL) asked whether certain types of constraints were under- or over-learned based on participant responses to a wug test, when compared to the lexicon. In what follows, we walk the Reader through a replication of HZSL (2009) to demonstrate how temperature can be utilized to make lexicon-based models and models trained on human subject responses comparable. ### 5.3.1 Hungarian backness harmony In Hungarian, the dative suffix has two allomorphs: [-nɔk] and [-nɛk]. A front rounded vowel anywhere in the stem triggers the front allomorph. When the closest vowel is a back vowel, the back allomorph is triggered. However, if a back vowel is separated from the suffix by one or more neutral vowels, variation is observed. Likewise, if a stem consists of only neutral vowels, variation is observed. In addition, several unnatural data patterns were observed. For example, stems with a final bilabial stop favored the front allomorph. Such patterns were deemed unnatural because (non-dorsal) consonants are generally thought not to have any effect on vowel harmony. Constraints associated with vowel-triggered harmony were deemed "natural" while those associated with consonant-triggered ones were deemed "unnatural". In addition, monosyllabic Ci:C stems take a back allomorph (historically, this was a back vowel), so the constraint USE_BACK/Ci:C__ was also considered unnatural. HZSL were interested in assessing whether the natural and unnatural constraints governing the zones of variation were under- or over-learned. They designed a wug test where Hungarian speakers were presented with a wug stem as well as the allomorphs [-nɔk] and [-nɛk]. The speakers had to pick one of the two suffixes to go with the stem, so this was a 2AFC task. ### 5.3.2 Challenges with assessing under-/over-learning When one wishes to assess whether humans under- or over-learn certain constraints, it is natural to want to compare the lexicon-trained weights (*Grammar L*) directly to the wug-test-trained weights (*Grammar W*). However, such direct comparisons can be problematic for several reasons: (1) The lexicon and the wug test are based on different types of words + Wug tests are often carefully designed to capture a only a particular part of the grammar that the linguist is interested in + In other words, entire types of words may be missing from the wug test. + For example, HZSL found that in the lexicon, all stems whose final nucleus contained a back vowel (B-type stems) always picked a back suffix. - Lexicon model (*Grammar L*): Agree(back, local) has very high weight. + However, such stems were very uninteresting to HZSL, who were interested in obtaining wug results for stems in the zones of variation. + HZSL excluded such B-type stems from their wug test: - Wug model (*Grammar W*): Agree(back, local) has extremely low weight, since the learner never encountered any candidate that violated this constraint. (2) Even if all word types are included in the wug test, the proportions between the different types must be identical to that of the lexicon in order for the wug & lexicon grammars to be comparable. (3) Human subject responses are depolarized if experimental task is 2AFC ### 5.3.3 A rigorous approach to assessing under-/over-learning Instead of directly comparing a constraint's weight in *Grammar L* to its weight in *Grammar W*, here's what HZSL did: (1) HSZL simulated a group of idealized Hungarian speakers who had learned the weights from the lexicon perfectly (*i.e.* they interact with the world using *Grammar L*). (2) This group of idealized Hungarian speakers then performed the very same wug test that the group of human subjects did, randomly picking the back or front suffix for each stem according to their grammar (*i.e.* the weights in *Grammar L*). (3) The simulated wug test result was then fit to produce a grammar, *Grammar S*. + Notice how this mirrored the human wug test results that were fit to produce the grammar, *Grammar W*. + In other words, both *Grammar S* and *Grammar W* were trained on data sets that had the exact same types and proportion of stems. Comparing *Grammar S* to *Grammar W*: * We know that *Grammar S* arose from the wug test results of a group of idealized speakers using *Grammar L*, because HZSL made these simulated speakers perform the wug test using *Grammar L*. * We also have *Grammar W*, whose weights we know arose from the wug test results of human Hungarian speakers, but we don't know what grammar these speakers used to perform the wug test. * If indeed the human Hungarian speakers had learned their lexicon perfectly, then they should have used *Grammar L* to perform the wug test. * Consequently, *Grammar S* and *Grammar W* should not differ. * However, if a particular constraint was under-learned by the human speakers, then its weight should be lower in *Grammar W* than it is in *Grammar S*. * The same applied for over-learned constraints, which would have a higher weight in *Grammar W* than in *Grammar S*. A huge collection of *Grammar S*: * Since the stem types in the wug test were those in the zones of variation, that lone wug test performed by one group of idealized speakers might produce very different suffix choices than the one performed by the human speakers due to chance alone. * To mitigate this, HZSL ran 10,000 groups of simulated idealized speakers. * For each constraint *C*, they now had a distribution of its expected idealized weight (from 10,000 *Grammar S*'s) * This approach, estimating the value of a variable via random sampling, is called the Monte Carlo method. * If constraint *C*'s weight from *Grammar W* fell in the left tail of the idealized *C* distribution, they interpreted this to mean that *C* was under-learned. * If constraint *C*'s weight from *Grammar W* fell in the right tail of the idealized *C* distribution, then *C* was over-learned. Let's take stock of where we are. In §5.3.2, we saw that there were three challenges to directly comparing the lexicon and wug grammars. Simulating idealized speakers performing the same wug task as human speakers solved the first two issues. The third issue, simulating depolarized 2AFC wug responses, will be taken up in the next section (§5.3.4), where the hyperparameter `temperature` will be used to effect depolarization. ### 5.3.4 Generating depolarized predictions for a 2AFC task How would a simulated speaker go about picking a suffix for a wug stem? For each stem, the simulated speaker's grammar (*i.e.* *Grammar L*) generates a probability distribution over the two suffix options. This probability distribution guides the simulated speaker's suffix choice. For example, if the probability distribution is: * back suffix: 90% * front suffix: 10% then 9 out of 10 times, the simulated speaker will pick the back suffix. We will utilize the `predict_probabilities()` function to generate the probability distribution over the two suffix options of a given stem: * `test_file`: Hungarian wug test data + Since we want to get the predicted probabilities over the suffix options of the various wug stems. * `constraint_weights`: *Grammar L* + Since these idealized speakers have learned the lexicon perfectly, we'll use the weights that were learned by training on the lexicon. * `temparature`: `T=1.5` + To capture the depolarization effect for 2AFC tasks, we need $T>1$ + Following HZSL, we'll specifically use $T=1.5$. See HZSL (2009) for details on why they chose this value. ``` {r} # Get paths to sample Hungarian wug data hu_file <- system.file( "extdata", "sample_hu_wug.txt", package = "maxent.ot" ) hu_data <- otsoft_tableaux_to_df(hu_file) # We'll use the weights learned by training on the lexicon as reported by HZSL on (p.848). lex_wts_ls <- list(5.39, 1.69, 1.46, 3.00, 4.04, 2.45, 1.08, .91, 1.73, 2.42) # Make a named list (optional) names(lex_wts_ls) <- list( 'AGREE(back,nonlocal)', 'AGREE(front,local)', 'AGREE(nonhigh_front,local)', 'AGREE(low_front,local)', 'AGREE(double_front,local)', 'USE_FRONT/bilabial__', 'USE_FRONT/[+cor,+son]__', 'USE_FRONT/sibilant__', 'USE_FRONT/CC__', 'USE_BACK/[C0i:C0]__' ) # Convert to double object (required type as argument of predict_probabilities()) lex_wts <- as.double(lex_wts_ls) # Predict probabilities with temperature set to 1.5 # Notice also that "UTF-8" is an option for encoding, which comes in useful here hu_pred <- predict_probabilities( hu_data, lex_wts, temperature = 1.5, encoding = "UTF-8" ) ``` Let's take a look at some of the predicted wug probabilities produced by *Grammar L*: ``` {r} # Let's view some of the predicted probabilities at T=1.5 # We've dropped the constraint violations for easier viewing # Both predicted and observed probability are conditioned over URs # e.g. Observed probability in rows 3 & 4 don't sum to 1 # because the stem [étt] appears twice in this data set. head(hu_pred$predictions[, -4:-(ncol(hu_pred$predictions)-3)]) # For curiosity's sake, let's compare: # With default temperature=1... # ... predicted probability is more polarized hu_pred_t1 <- predict_probabilities( hu_data, lex_wts, encoding = "UTF-8" ) head(hu_pred_t1$predictions[, -4:-(ncol(hu_pred_t1$predictions)-3)]) ``` ### 5.3.5 Monte-carlo simulations based on the predicted probabilities above We've used *Grammar L* to generate predicted probabilities for wug suffixes at $T=1.5$, and stored the probability distributions in the object `hu_pred`. Next, we use these probability distributions to guide our simulated subjects as they pick suffixes for wug stems. We use the function `monte_carlo_weights()` to simulate 500 groups of idealized Hungarian speakers picking suffixes for the wug stems. This function only requires two arguments: the prediction data frame (we use `hu_pred`), and the number of simulations. This function produces a complete wug test response by randomly choosing one suffix option for each trial in `hu_pred` based on the the predicted probabilities in `hu_pred`. (This function automatically converts predicted P(SR|UR) in `hu_pred` to predicted P(SR|trial).) It then calls `optimize_weights()` to fit this simulated wug response (*i.e.* to produce *Grammar S*), and records the weights learned. It repeats this process for the specified number of simulations. It returns a data frame with $m$ rows and $n$ columns, where $m$ is the number of simulations and $n$ is the number of constraints. That is, each row represents an instance of *Grammar S*. The following code was used to produce the 500 simulated complete wug test responses and train their corresponding grammars (*Grammar S*'s): ``` {r} # Learn weights for 500 simulated wug tests # monte_carlo_weights(hu_pred, 500) ``` Each complete wug test consisted of 1703 trials, so the process took a very long time to run. To save time, we saved the 500 sets of constraint weights (500 *Grammar S*'s) and present them below: ``` {r} # Get path to the saved constraint weights trained on the 500 simulated wug tests hu_500simul_wts_path <- system.file( "extdata", "hu_500simuls_wts.txt", package = "maxent.ot" ) # Store constraint weights in object `hu_500simul_wts` hu_500simul_wts <- read.table(hu_500simul_wts_path, header=TRUE, sep='\t') hu_500simul_wts <- data.frame(hu_500simul_wts) # Rename columns names(hu_500simul_wts) <- c( 'AGREE(back,nonlocal)', 'AGREE(front,local)', 'AGREE(nonhigh_front,local)', 'AGREE(low_front,local)', 'AGREE(double_front,local)', 'USE_FRONT/bilabial__', 'USE_FRONT/[+cor,+son]__', 'USE_FRONT/sibilant__', 'USE_FRONT/CC__', 'USE_BACK/[C0i:C0]__' ) # View the first 6 rows of hu_500simul_wts # i.e. the first 6 Grammar S's head(hu_500simul_wts) ``` **Some housekeeping** Notice that duplicate UR entries are present in the prediction data frame `hu_pred`. This is because the set-up of the `monte_carlo_weights()` function crucially depends upon predicted P(SR|trial) in order to randomly give one SR in each trial a frequency of 1. Consecutive rows are considered to belong to the same trial if they have the same UR, but not a repeated SR. When creating your prediction data frame, it is important to ensure that trials are not collapsed into a single entry. The caution against duplicate UR entries highlighted in §4.1 doesn't apply here because the `Error` column of the `hu_pred` file isn't used by `monte_carlo_weights()`. The process took a long time to run so we performed only 500 (rather than 10,000) simulations. The Reader is invited to uncomment the code below if they would like to try running the `monte_carlo_weights()` function. ``` {r} # Learn weights for 5 simulated wug tests #hu_simul_wts <- monte_carlo_weights(hu_pred, 5) #hu_simul_wts ``` ### 5.3.6 Are unnatural constraints under-learned? Now that we have a distribution of idealized constraint weights for each of the 10 constraints (from 500 *Grammar S*'s), let's find out whether the weights trained on human wug responses (*Grammar W*) were under-, over-, or perfectly learned. First, we'll have to fit a model to the human wug responses (*i.e.* produce *Grammar W*): ``` {r} # Fit model for human wug response human_wug_model <- optimize_weights(hu_data) ``` Let's view the constraint weights of *Grammar W* in relation to the average idealized weights from the 500 *Grammar S*'s: ``` {r} # Weights trained on human wug response human_wt <- human_wug_model$weights # Average idealized wug weights ideal_wt <- colMeans(hu_500simul_wts) # Create data frame wt_df <- data.frame(human_wt, ideal_wt) # View weights trained on human wug responses & average idealized wug weights print(wt_df, digits = 3) ``` All unnatural constraints (those beginning with USE...) appear to be under-learned by humans. For example, USE_FRONT/bilabial__ has a lower weight when trained on human wug responses (1.04) than the average of the weights obtained by training on simulated idealized wug responses (1.63). The plots below visualize where the weights trained on human wug tests (blue vertical dashed line) fall in relation to the distribution of the weights trained on the simulated idealized wug tests (histogram). ``` {r} # Plot weights for the five unnatural constraints by looping through them # For i-th constraint for (i in 6:10) { hist(hu_500simul_wts[[i]], breaks=15, xlab="weight", main = colnames(hu_500simul_wts[i])) abline(v=human_wug_model$weights[i], col="red", lwd=3, lty=2) } ``` # 6 Comparing models: `compare_models()` Let's imagine that we've trained multiple models, perhaps one with 3 constraints & another with 2 constraints. The larger model has more constraints to play with, and so usually fits the data better. If our only criterion to determine the best model was to assess only model fit to training data, larger models would generally do better. Nevertheless, we can still ask whether including additional constraints is worth the better fit. For example, if including one additional constraint, improves model fit drastically, the inclusion of this constraint is probably justified. However, if the inclusion of this constraint only improves model fit by a minuscule amount, we'd be better off removing this constraint from our model. There are several methods that can be used to quantify the trade-off between the number of constraints and the model fit. Here's where the function `compare_models` becomes useful. This function includes four comparison methods: * `lrt`: Likelihood ratio test. + Compares two models. + One model's constraints are a proper subset of the other's (*i.e.* nested models) + chi-squared test is performed to determine whether the additional constraints are necessary for the improvement in fit to training data. * `aic`: Akaike Information Criterion (AIC). + Any number of models can be compared. + No requirement that models be nested. * `aic_c`: AIC corrected for small sample size. + Same restrictions as the AIC method above. + Use when sample size (*i.e.* token count) is small. * `bic`: Bayesian Information Criterion. + Same restrictions as AIC method above. The formulas for the AIC-C and the BIC rely on sample sizes. The sample size is the calculated by summing over the column of surface form frequencies. If you wish to use the AIC-C or the BIC comparison methods, make sure that the values in this column are token counts, not relative frequencies. Using relative frequencies will produce invalid results. ## 6.1 Likelihood ratio test ### 6.1.1 Likelihood ratio test: Background Two requirements must be met in order to perform the likelihood ratio test. First, exactly two models must be compared. Second, these models must be nested. Here, we have two data sets `small_data` & `large_data` (thus meeting the first criterion). `large_data` has the very same constraints (C1 & C3) that `small_data` has, plus an additional constraint, C2. Thus, the nested criterion is met. ``` {r} # Get paths to toy data files # This file has two constraints small_file <- system.file( "extdata", "sample_data_file_small.txt", package = "maxent.ot" ) small_df <- otsoft_tableaux_to_df(small_file) # This file has three constraints large_file <- system.file( "extdata", "sample_data_file_large_otsoft.txt", package = "maxent.ot" ) large_df <- otsoft_tableaux_to_df(large_file) # View small_data small_df # View large_data large_df ``` We'll train both models and store the trained models in the objects `small_model` & `large_model`: ``` {r} # Fit weights to both data sets small_model <- optimize_weights(small_df) large_model <- optimize_weights(large_df) ``` Now we'll perform the likelihood ratio test on the two trained models. To use the `lrt` method of `compare_models()`, we'll need three arguments: (1) object storing results of first model (2) object storing results of second model (3) `method = 'lrt'` The models can appear in either order, but they must both appear before the `method` argument. ``` {r} # Compare models using the likelihood ratio test compare_models(small_model, large_model, method='lrt') ``` This method returns a data frame with a single row and the following columns: * `description`: The names of the two models being compared, with the larger model appearing first. * `chi_sq`: chi-squared value. * `k_delta`: degrees of freedom (difference in the number of constraints between the two model). * `p-value`: p-value. The *p*-value was $0.239$. This indicates that the difference in model fit to data was not significant at the $.05$ significance level. Hence, we are not justified in including the additional constraint, C2. The model with fewer constraints is the better one. ### 6.1.2 Likelihood ratio test: Explained The likelihood ratio, $LR$, is calculated as follows: $$LR = 2(LL_2 - LL_1)$$ where $LL_2$ is the log likelihood of the larger model (the one with more constraints) and $LL_1$ is the log likelihood of the smaller model. The likelihood ratio $LR$ is set as the $X^2$-value, which is used to perform a $X^2$-test. The $X^2$-value should range between 0 & positive infinity, with a larger value indicating a greater improvement in model fit. The difference in the number of constraints is set as the degrees of freedom. The *p*-value tells us whether the difference in likelihood between the two models is significant. If it turns out to be significant, then we are justified in including the additional constraints in order to achieve the better model fit. This test runs on the assumption that the larger model indeed produces a better fit to the data. If this criterion isn't met, there is no way in which the larger model is better than the smaller one since (1) the larger model does worse on model fit, and (2) even with more constraints, the larger model couldn't fit the data better than the smaller model Mathematically, this translates to having a negative $X^2$-value, which is uninterpretable. An error message will be produced and no $X^2$-test will be performed if such a scenario is encountered. ### 6.1.3 Likelihood ratio won't work for non-nested models We'll attempt to perform a likelihood ratio test on two non-nested models. We re-use small_data, which has constraints C1 & C3. We introduce a new data set, large_data_b, which has constraints C1, C2 & C4. ``` {r} # Get paths to toy data files # This file has three constraints large_file_b <- system.file( "extdata", "sample_data_file_large_b.txt", package = "maxent.ot" ) large_df_b <- otsoft_tableaux_to_df(large_file_b) # View small_data small_df # View large_data_b large_df_b # Fit weights to data set large_data_b large_model_b <- optimize_weights(large_df_b) ``` The program is unable to perform the likelihood ratio test, and will produce an error message. ``` {r, error=TRUE} # Compare models using the likelihood ratio test compare_models(small_model, large_model_b, method='lrt') ``` ## 6.2 Akaike Information Criterion (AIC) ### 6.2.1 AIC: Background The AIC value for a particular model is calculated as follows: $$AIC = 2k - 2LL$$ where $k$ is the number of constraints, and $LL$ is the log likelihood of the current model. A smaller AIC score indicates a better model. Both the number of constraints and the log likelihood contribute to the AIC score. All else being equal, having a smaller number of constraints or having a higher log likelihood is indicative of a better model. Recall that log likelihood ($-\infty < LL \leq 0$) quantifies model fit. ### 6.2.2 Compare models using the AIC method Two or more models may be passed to `compare_models()` when using the `aic` method. To use the `aic` method of `compare_models()`, we'll need $m+1$ arguments, where $m$ is the number of models that we want compared. * First argument: object storing results of first model * Second argument: object storing results of second model * ... * $m^{th}$ argument: object storing results of $m^{th}$ model * $(m+1)^{th}$ argument: `method = 'aic'` The models can appear in any order, but they must all appear before the `method` argument. The returned data frame contains the same number of rows as models passed in. The models are sorted in increasing order of AIC (*i.e.* best first). The data frame has the following columns: * `model`: The name of the model. * `k`: The number of constraints. * `aic`: The model's AIC score. * `aic.delta`: The difference between the current model's AIC score and the AIC score of the "best" model. * `aic.wt`: The model's AIC weight: This reflects the relative likelihood (or conditional probability) that this model is the "best" model in the set. * `cum.wt`: The cumulative sum of AIC weights up to and including this model. * `ll`: The log likelihood of this model. ``` {r} # Compare models using AIC compare_models(small_model, large_model, method='aic') ``` In the data frame above, the small model has an AIC score of 8.16 while the large model has an AIC score of 8.77. This indicates that the small model is better. How much better is the small model compared to the large model? The column `aic.wt` gives the probability that the current model is the best one of the entire set of models passed in. So the small model has a 57.6% chance of being the best model, and the large model has a 43.7% chance of being the best model. ### 6.2.3 AIC weight (`aic.wt`) explained The AIC weight of a model is calculated by applying the softmax function to the negative halved-distance between current model's AIC score and the AIC score of the best model: $$W_i = \frac{exp(-0.5\delta_i)}{\Sigma_{j=1}^m exp(-0.5\delta_j)}$$ where $\delta_i$ is the difference in AIC score between model $i$ and the best model, and $m$ is the number of models being compared. We'll break this equation down further, and walk through its application step-by-step. The values of the intermediate terms we'll use (*e.g.* $\delta$; the raw AIC weight, $r$) are shown in *Figure 5*. ```{r, echo=FALSE, fig.cap="Fig 5: Calculating AIC weights: Intermediate steps", out.width = '70%'} knitr::include_graphics("../man/figures/aicWeight_table.jpg") ``` #### 6.2.3.1 AIC score distance, $\delta$ (`aic.delta`) First, we'll calculate the difference in AIC score between model $i$ and the best model, $\delta_i$. This value shows up in the `aic.delta` column of the returned data frame. $$\delta_i = aic_i - aic_{best}$$ where $aic_i$ is the AIC score of the $i^{th}$ model, and $aic_{best}$ is the AIC score of the best model. Let's calculate the AIC score distance for the large model, $\delta_{large\_model}$: \begin{align} \delta_{large\_model} &= aic_{large\_model} - aic_{best\_model} \\ &= aic_{large\_model} - aic_{small\_model} \\ &= 8.773 - 8.159 \\ &= 0.614 \end{align} and for the small model, $\delta_{small\_model}$: \begin{align} \delta_{small\_model} &= aic_{small\_model} - aic_{best\_model} \\ &= aic_{small\_model} - aic_{small\_model} \\ &= 8.773 - 8.773 \\ &= 0 \end{align} #### 6.2.3.2 Raw AIC weight, $r$ Next, we'll define the raw AIC weight: $$r_i = e^{(-\delta_i/2)}$$ where $r_i$ is the raw AIC weight of model $i$. Let's calculate the raw AIC weight for the large model: \begin{align} r_{large\_model} &= e^{-\delta_{large\_model}/2} \\ &= e^{-0.614/2} \\ &= 0.736 \end{align} Notice that $\delta = 0$ for the best model, and $\delta \gt 0$ for the rest of the models. This results in the best model's raw AIC weight $r$ being $e^0 = 1$, and the worse models receiving a raw AIC weight less than $1$. #### 6.2.3.3 Actual AIC weight: `aic.wt` Next, we'll perform normalization to get the actual AIC weights shown in the `aic.wt` column. The true AIC weight for the $i^{th}$ model, $W_i$, is just its raw AIC weight, $r_i$, divided by the sum of raw AIC weights of all models that were passed in to be compared: $$W_i= \frac{r_i}{\sum_j^m r_j}$$ where $m$ is the number of models passed in to `compare_models()`, and $r_j$ is the raw AIC weight of the $j^{th}$ such model. Applying this equation, the AIC weight of the large model is: \begin{align} W_{large\_model} &= \frac{r_{large\_model}}{\sum_j^2 r_j} \\ &= \frac{0.736}{1 + 0.736} \\ &= 0.424 \end{align} ## 6.3 Bayesian Information Criterion (BIC) ### 6.3.1 BIC: Background The BIC is an alternative measure of assessing the trade-off between the number of constraints and model fit. The BIC replaces the AIC's first term ($2k$) with $k \ln(n)$. $$BIC = k \ln(n) - 2 LL$$ where $k$ is the number of constraints, $n$ is the sample size, and $LL$ is the log likelihood of the current model. As with the AIC, a smaller BIC score indicates a better model. For both the AIC and the BIC, the first term penalizes models that have more constraints, $k$. However, the number of constraints, $k$, is scaled differently in these two measures. For the BIC, $k$ is scaled by $\ln(n)$, where $n$ is the number of samples. As the number of observed tokens increases, the penalty accrued by models that have more constraints increases (albeit at a decreasing rate since $k$ is scaled by $\ln(n)$ rather than $n$). In contrast, the AIC scales $k$ with the constant $2$, so sample size has no effect on the penalty accrued by having more constraints, $k$. At $7$ observed tokens and below, a model's BIC score is smaller than its AIC score. At $8$ observed tokens and above, a model's BIC score is larger than its AIC score (since $\ln(7) < 2 < \ln(8)$). A caveat: Though unlikely to occur, it is advisable to check that the sample size is greater than $1$. At a sample size of $1$, the effect of the number of constraints, $k$, on the BIC score is canceled out. The BIC score becomes directly proportional to the log likelihood alone ($BIC = k \ln(1) -2LL = -2LL$). ### 6.3.2 Compare models using the BIC method As with the AIC, two or more models may be passed to `compare_models()` when using the `bic` method. The arguments passed in to `compare_models` for the `bic` method are identical to that for the `aic` method, except that the method argument should now be: `method='bic'`. Hence, to use the `bic` method of `compare_models()`, we'll need $m+1$ arguments, where $m$ is the number of models that we want compared. * First argument: object storing results of first model * Second argument: object storing results of second model * ... * $m^{th}$ argument: object storing results of $m^{th}$ model * $(m+1)^{th}$ argument: `method = 'bic'` The models can appear in any order, but they must all appear before the `method` argument. ``` {r} # Compare models using BIC compare_models(small_model, large_model, method='bic') ``` The returned data frame from the BIC method has the same structure as that of the AIC, with one additional column: * `n`: The number of samples in the data that the model is fit to. The `bic` and `bic.wt` columns are the BIC counterparts to the `aic` and `aic.wt` columns in the AIC data frame. Looking at the data frame above, we see that the small model (smaller BIC score) again performs better than the large model (large BIC score). The conditional probability that the small model is the best of all models passed to `compare_models()` is 53.6%. ## 6.4 Akaike Information Criterion corrected for small sample size (AIC-C) ### 6.4.1 AIC-C: Background The Akaike Information Criterion for small sample sizes (AIC-C) yet another alternative method of comparing the trade-off between the number of constraints and the model fit. Like the BIC, AIC-C also takes the sample size into account. The AIC-C modifies the AIC by introducing a third term. $$AIC_c = 2k - 2 LL + \frac{2k^2 + 2k}{n-k-1}$$ As before, $k$ is the number of constraints, $n$ is the sample size, and $LL$ is the log likelihood of the current model. This newly introduced third term further increases the cost of having additional constraints. As $n$ approaches infinity, the third term's denominator approaches 0, so the AIC-C becomes equivalent to the AIC for large sample sizes. ### 6.4.2 Compare models using the AIC-C method As with the AIC and the BIC, two or more models may be passed to `compare_models()` when using the `aic_c` method. The arguments passed in to `compare_models` for the `aic_c` method are identical to that for the `aic` and `bic` methods, except that the method argument should now be: `method='aic_c'`. Hence, to use the `bic` method of `compare_models()`, we'll need $m+1$ arguments, where $m$ is the number of models that we want compared. * First argument: object storing results of first model * Second argument: object storing results of second model * ... * $m^{th}$ argument: object storing results of $m^{th}$ model * $(m+1)^{th}$ argument: `method = 'aic_c'` The models can appear in any order, but they must all appear before the `method` argument. ``` {r} # Get paths to toy data files # This file has 2 constraints small_file_c <- system.file( "extdata", "sample_data_file_small_c.txt", package = "maxent.ot" ) small_df_c <- otsoft_tableaux_to_df(small_file_c) # This file has 3 constraints large_file_c <- system.file( "extdata", "sample_data_file_large_c.txt", package = "maxent.ot" ) large_df_c <- otsoft_tableaux_to_df(large_file_c) # Fit weights to both data sets small_model_c <- optimize_weights(small_df_c) large_model_c <- optimize_weights(large_df_c) # Compare models using AIC-C compare_models(small_model_c, large_model_c, method='aic_c') ``` The returned data frame from the AIC-C method has the same structure as that of the BIC. The `aicc` and `aicc.wt` columns are the AIC-C counterparts to the `aic` and `aic.wt` columns in the AIC data frame respectively. In the data frame above, the small model (smaller AIC-C score) performs better than the large model (larger AIC-C score). The conditional probability that the small model is the best of all the models that were passed to `compare_models()` is $97.4%$. ### 6.4.3 A potential pitfall Here's the formula for AIC-C again: $$AIC_c = 2k - 2 LL + \frac{2k^2 + 2k}{n-k-1}$$ When $n-k \leq 1$ for a particular model, the third term's denominator is 0. This results in an infinitely large AIC-C value. In the example below, this issue crops up for both the large and the small models. ``` {r} # Problematic data set: all models have n-k <= 1 # Compare models using AIC-C compare_models(small_model, large_model, method='aic_c') ``` In the data frame above, none of the models produce a valid AIC-C score. Hence the probability that one particular model is the best cannot be calculated (`aicc.wt` is `NaN` "Not a Number"). One way to side-step this issue is to scale the sample size up. For example, the problematic data sets had their sample sizes increased by a factor of 2 to create the non-problematic data sets: ``` {r} # View problematic data set # sample size: n = 3 # maximum number of constraints for valid aic_c score: 3-2 = 1 # actual maximum number of constraints used: 3 large_df # View non-problematic data set # sample size: n = 6 # maximum number of constraints for valid aic_c score: 6-2 = 4 # actual maximum number of constraints used: 3 large_df_c ``` In the case where only one (*i.e.* `large_model_d`) of multiple models produces an invalid AIC-C score, the model with the invalid AIC-C score will have an `aicc.wt` of 0. The `aicc.wt` for the other models with valid AIC-C scores will be calculated as per normal (see data frame below). ``` {r} # 1 Problematic data set: large_data_d has n-k <= 1 # 2 Non-problematic data sets # sample size: n = 5 # maximum number of constraints for valid aic_c score: 5-2 = 3 # Get paths to toy data files # This file has 2 constraints: Good small_file_d <- system.file( "extdata", "sample_data_file_small_d.txt", package = "maxent.ot" ) small_df_d <- otsoft_tableaux_to_df(small_file_d) # This file has 3 constraints: Good med_file_d <- system.file( "extdata", "sample_data_file_med_d.txt", package = "maxent.ot" ) med_df_d <- otsoft_tableaux_to_df(med_file_d) # This file has 4 constraints: Problematic large_file_d <- system.file( "extdata", "sample_data_file_large_d.txt", package = "maxent.ot" ) large_df_d <- otsoft_tableaux_to_df(large_file_d) # Fit weights for all data sets small_model_d <- optimize_weights(small_df_d) med_model_d <- optimize_weights(med_df_d) large_model_d <- optimize_weights(large_df_d) # Compare models using AIC-C compare_models(small_model_d, med_model_d, large_model_d, method='aic_c') ```