\input{share/preamble} %\VignetteIndexEntry{Simulation of insurance data} %\VignettePackage{actuar} %\SweaveUTF8 \title{Simulation of insurance data with \pkg{actuar}} \author{Christophe Dutang \\ Université Paris Dauphine \\[3ex] Vincent Goulet \\ Université Laval \\[3ex] Mathieu Pigeon \\ Université du Québec à Montréal \\[3ex] Louis-Philippe Pouliot \\ Université Laval} \date{} <>= library(actuar) options(width = 52, digits = 4) @ \begin{document} \maketitle \section{Introduction} \label{sec:introduction} Package \pkg{actuar} provides functions to facilitate the generation of random variates from various probability models commonly used in actuarial applications. From the simplest to the most sophisticated, these functions are: \begin{enumerate} \item \code{rmixture} to simulate from discrete mixtures; \item \code{rcompound} to simulate from compound models (and a simplified version, \code{rcompois} to simulate from the very common compound Poisson model); \item \code{rcomphierarc} to simulate from compound models where both the frequency and the severity components can have a hierarchical structure. \end{enumerate} \section{Simulation from discrete mixtures} \label{sec:rmixture} A random variable is said to be a discrete mixture of the random variables with probability density functions $f_1, \dots, f_n$ if its density can be written as \begin{equation} \label{eq:mixture} f(x) = p_1 f_1(x) + \dots + p_n f_n(x) = \sum_{i = 1}^n p_i f_i(x), \end{equation} where $p_1, \dots, p_n$ are probabilities (or weights) such that $p_i \geq 0$ and $\sum_{i = 1}^n p_i = 1$. Function \code{rmixture} makes it easy to generate random variates from such mixtures. The arguments of the function are: \begin{enumerate} \item \code{n} the number of variates to generate; \item \code{probs} a vector of values that will be normalized internally to create the probabilities $p_1, \dots, p_n$; \item \code{models} a vector of expressions specifying the simulation models corresponding to the densities $f_1, \dots, f_n$. \end{enumerate} The specification of simulation models follows the syntax of \code{rcomphierarc} (explained in greater detail in \autoref{sec:rcomphierarc}). In a nutshell, the models are expressed in a semi-symbolic fashion using an object of mode \code{"expression"} where each element is a complete call to a random number generation function, with the number of variates omitted. The following example should clarify this concept. \begin{example} Let $X$ be a mixture between two exponentials: one with mean $1/3$ and one with mean $1/7$. The first exponential has twice as much weight as the second one in the mixture. Therefore, the density of $X$ is \begin{equation*} f(x) = \frac{2}{3} (3 e^{-3x}) + \frac{1}{3} (7 e^{-7x}) \\ = 2 e^{-3x} + \frac{7}{3} e^{-7x}. \end{equation*} The following expression generates $10$ random variates from this density using \code{rmixture}. <>= rmixture(10, probs = c(2, 1), models = expression(rexp(3), rexp(7))) @ \qed \end{example} See also \autoref{ex:comppois} for a more involved application combining simulation from a mixture and simulation from a compound Poisson model. \section{Simulation from compound models} \label{sec:rcompound} Actuaries often need to simulate separately the frequency and the severity of claims for compound models of the form \begin{equation} \label{eq:definition-S} S = C_1 + \dots + C_N, \end{equation} where $C_1, C_2, \dots$ are the mutually independent and identically distributed random variables of the claim amounts, each independent of the frequency random variable $N$. Function \code{rcompound} generates variates from the random variable $S$ when the distribution of both random variables $N$ and $C$ is non hierarchical; for the more general hierarchical case, see \autoref{sec:rcomphierarc}. The function has three arguments: \begin{enumerate} \item \code{n} the number of variates to generate; \item \code{model.freq} the frequency model (random variable $N$); \item \code{model.sev} the severity model (random variable $C$). \end{enumerate} Arguments \code{model.freq} and \code{model.sev} are simple R expressions consisting of calls to a random number generation function with the number of variates omitted. This is of course similar to argument \code{models} of \code{rmixture}, only with a slightly simpler syntax since one does not need to wrap the calls in \code{expression}. Function \code{rcomppois} is a simplified interface for the common case where $N$ has a Poisson distribution and, therefore, $S$ is compound Poisson. In this function, argument \code{model.freq} is replaced by \code{lambda} that takes the value of the Poisson parameter. \begin{example} Let $S \sim \text{Compound Poisson}(1.5, F)$, where $1.5$ is the value of the Poisson parameter and $F$ is the cumulative distribution function of a gamma distribution with shape parameter $\alpha = 3$ and rate parameter $\lambda = 2$. We obtain variates from the random variable $S$ using \code{rcompound} or \code{rcompois} as follows: <>= rcompound(10, rpois(1.5), rgamma(3, 2)) rcomppois(10, 1.5, rgamma(3, 2)) @ Specifying argument \code{SIMPLIFY = FALSE} to either function will return not only the variates from $S$, but also the underlying variates from the random variables $N$ and $C_1, \dots, C_N$: <>= rcomppois(10, 1.5, rgamma(3, 2), SIMPLIFY = FALSE) @ \qed \end{example} \begin{example} \label{ex:comppois} Theorem~9.7 of \cite{LossModels4e} states that the sum of compound Poisson random variables is itself compound Poisson with Poisson parameter equal to the sum of the Poisson parameters and severity distribution equal to the mixture of the severity models. Let $S = S_1 + S_2 + S_3$, where $S_1$ is compound Poisson with mean frequency $\lambda = 2$ and severity Gamma$(3, 1)$; $S_2$ is compound Poisson with $\lambda = 1$ and severity Gamma$(5, 4)$; $S_3$ is compound Poisson with $\lambda = 1/2$ and severity Lognormal$(2, 1)$. By the aforementioned theorem, $S$ is compound Poisson with $\lambda = 2 + 1 + 1/2 = 7/2$ and severity density \begin{equation*} f(x) = \frac{4}{7} \left( \frac{1}{\Gamma(3)} x^2 e^{-x} \right) + \frac{2}{7} \left( \frac{4^5}{\Gamma(5)} x^4 e^{-4x} \right) + \frac{1}{7} \phi(\ln x - 2). \end{equation*} Combining \code{rcomppois} and \code{rmixture} we can generate variates of $S$ using the following elegant expression. <>= x <- rcomppois(1e5, 3.5, rmixture(probs = c(2, 1, 0.5), expression(rgamma(3), rgamma(5, 4), rlnorm(2, 1)))) @ One can verify that the theoretical mean of $S$ is $6 + 5/4 + (e^{5/2})/2 = 13.34$. Now, the empirical mean based on the above sample of size $10^5$ is: <>= mean(x) @ \qed \end{example} \section{Simulation from compound hierarchical models} \label{sec:rcomphierarc} Hierarchical probability models are widely used for data classified in a tree-like structure and in Bayesian inference. The main characteristic of such models is to have the probability law at some level in the classification structure be conditional on the outcome in previous levels. For example, adopting a bottom to top description of the model, a simple hierarchical model could be written as \begin{equation} \label{eq:basic_model} \begin{split} X_t|\Lambda, \Theta &\sim \text{Poisson}(\Lambda) \\ \Lambda|\Theta &\sim \text{Gamma}(3, \Theta) \\ \Theta &\sim \text{Gamma}(2, 2), \end{split} \end{equation} where $X_t$ represents actual data. The random variables $\Theta$ and $\Lambda$ are generally seen as uncertainty, or risk, parameters in the actuarial literature; in the sequel, we refer to them as mixing parameters. The example above is merely a multi-level mixture of models, something that is simple to simulate ``by hand''. The following R expression will yield $n$ variates of the random variable $X_t$: <>= rpois(n, rgamma(n, 3, rgamma(n, 2, 2))) @ However, for categorical data common in actuarial applications there will usually be many categories --- or \emph{nodes} --- at each level. Simulation is then complicated by the need to always use the correct parameters for each variate. Furthermore, one may need to simulate both the frequency and the severity of claims for compound models of the form \eqref{eq:definition-S}. This section briefly describes function \code{rcomphierarc} and its usage. \cite{Goulet:simpf:2008} discuss in more details the models supported by the function and give more thorough examples. \subsection{Description of hierarchical models} \label{sec:rcomphierarc:description} We consider simulation of data from hierarchical models. We want a method to describe these models in R that meets the following criteria: \begin{enumerate} \item simple and intuitive to go from the mathematical formulation of the model to the R formulation and back; \item allows for any number of levels and nodes; \item at any level, allows for any use of parameters higher in the hierarchical structure. \end{enumerate} A hierarchical model is completely specified by the number of nodes at each level and by the probability laws at each level. The number of nodes is passed to \code{rcomphierarc} by means of a named list where each element is a vector of the number of nodes at a given level. Vectors are recycled when the number of nodes is the same throughout a level. Probability models are expressed in a semi-symbolic fashion using an object of mode \code{"expression"}. Each element of the object must be named --- with names matching those of the number of nodes list --- and should be a complete call to an existing random number generation function, but with the number of variates omitted. Hierarchical models are achieved by replacing one or more parameters of a distribution at a given level by any combination of the names of the levels above. If no mixing is to take place at a level, the model for this level can be \code{NULL}. \begin{example} Consider the following expanded version of model \eqref{eq:basic_model}: \begin{align*} X_{ijt}|\Lambda_{ij}, \Theta_i &\sim \text{Poisson}(\Lambda_{ij}), & t &= 1, \dots, n_{ij} \\ \Lambda_{ij}|\Theta_i &\sim \text{Gamma}(3, \Theta_i), & j &= 1, \dots, J_i \\ \Theta_i &\sim \text{Gamma}(2, 2), & i &= 1, \dots, I, \end{align*} with $I = 3$, $J_1 = 4$, $J_2 = 5$, $J_3 = 6$ and $n_{ij} \equiv n = 10$. Then the number of nodes and the probability model are respectively specified by the following expressions. \begin{Schunk} \begin{Verbatim} list(Theta = 3, Lambda = c(4, 5, 6), Data = 10) \end{Verbatim} \end{Schunk} \begin{Schunk} \begin{Verbatim} expression(Theta = rgamma(2, 2), Lambda = rgamma(3, Theta), Data = rpois(Lambda)) \end{Verbatim} \end{Schunk} \qed \end{example} Storing the probability model requires an expression object in order to avoid evaluation of the incomplete calls to the random number generation functions. Function \code{rcomphierarc} builds and executes the calls to the random generation functions from the top of the hierarchical model to the bottom. At each level, the function \begin{enumerate} \item infers the number of variates to generate from the number of nodes list, and \item appropriately recycles the mixing parameters simulated previously. \end{enumerate} The actual names in the list and the expression object can be anything; they merely serve to identify the mixing parameters. Furthermore, any random generation function can be used. The only constraint is that the name of the number of variates argument is \code{n}. In addition, \code{rcomphierarc} supports usage of weights in models. These usually modify the frequency parameters to take into account the ``size'' of an entity. Weights are used in simulation wherever the name \code{weights} appears in a model. \subsection[Usage of rcomphierarc]{Usage of \code{rcomphierarc}} \label{sec:rcomphierarc:usage} Function \code{rcomphierarc} can simulate data for structures where both the frequency model and the severity model are hierarchical. It has four main arguments: \begin{enumerate} \item \code{nodes} for the number of nodes list; \item \code{model.freq} for the frequency model; \item \code{model.sev} for the severity model; \item \code{weights} for the vector of weights in lexicographic order, that is all weights of entity 1, then all weights of entity 2, and so on. \end{enumerate} The function returns the variates in a list of class \code{"portfolio"} with a \code{dim} attribute of length two. The list contains all the individual claim amounts for each entity. Since every element can be a vector, the object can be seen as a three-dimension array with a third dimension of potentially varying length. The function also returns a matrix of integers giving the classification indexes of each entity in the portfolio. The package also defines methods for four generic functions to easily access key quantities for each entity of the simulated portfolio: \begin{enumerate} \item a method of \code{aggregate} to compute the aggregate claim amounts $S$; \item a method of \code{frequency} to compute the number of claims $N$; \item a method of \code{severity} (a generic function introduced by the package) to return the individual claim amounts $C_j$; \item a method of \code{weights} to extract the weights matrix. \end{enumerate} In addition, all methods have a \code{classification} and a \code{prefix} argument. When the first is \code{FALSE}, the classification index columns are omitted from the result. The second argument overrides the default column name prefix; see the \code{rcomphierarc.summaries} help page for details. The following example illustrates these concepts in detail. \begin{example} Consider the following compound hierarchical model: \begin{equation*} S_{ijt} = C_{ijt1} + \dots + C_{ijt N_{ijt}}, \end{equation*} for $i = 1, \dots, I$, $j = 1, \dots, J_i$, $t = 1, \dots, n_{ij}$ and with \begin{align*} N_{ijt}|\Lambda_{ij}, \Phi_i &\sim \text{Poisson}(w_{ijt} \Lambda_{ij}) & C_{ijtu}|\Theta_{ij}, \Psi_i &\sim \text{Lognormal}(\Theta_{ij}, 1) \notag \\ \Lambda_{ij}|\Phi_i &\sim \text{Gamma}(\Phi_i, 1) & \Theta_{ij}|\Psi_i &\sim N(\Psi_i, 1) \\ \Phi_i &\sim \text{Exponential}(2) & \Psi_i &\sim N(2, 0.1). \notag \end{align*} (Note how weights modify the Poisson parameter.) Using as convention to number the data level 0, the above is a two-level compound hierarchical model. Assuming that $I = 2$, $J_1 = 4$, $J_2 = 3$, $n_{11} = \dots = n_{14} = 4$ and $n_{21} = n_{22} = n_{23} = 5$ and that weights are simply simulated from a uniform distribution on $(0.5, 2.5)$, then simulation of a data set with \code{rcomphierarc} is achieved with the following expressions. <>= set.seed(3) @ <>= nodes <- list(cohort = 2, contract = c(4, 3), year = c(4, 4, 4, 4, 5, 5, 5)) mf <- expression(cohort = rexp(2), contract = rgamma(cohort, 1), year = rpois(weights * contract)) ms <- expression(cohort = rnorm(2, sqrt(0.1)), contract = rnorm(cohort, 1), year = rlnorm(contract, 1)) wijt <- runif(31, 0.5, 2.5) pf <- rcomphierarc(nodes = nodes, model.freq = mf, model.sev = ms, weights = wijt) @ Object \code{pf} is a list of class \code{"portfolio"} containing, among other things, the aforementioned two-dimension list as element \code{data} and the classification matrix (subscripts $i$ and $j$) as element \code{classification}: <>= class(pf) pf$data pf$classification @ The output of \code{pf\$data} is not much readable. If we were to print the results of \code{rcomphierarc} this way, many users would wonder what \code{Numeric,\emph{n}} means. (It is actually R's way to specify that a given element in the list is a numeric vector of length $n$ --- the third dimension mentioned above.) To ease reading, the \code{print} method for objects of class \code{"portfolio"} only prints the simulation model and the number of claims in each node: <>= pf @ By default, the method of \code{aggregate} returns the values of $S_{ijt}$ in a regular matrix (subscripts $i$ and $j$ in the rows, subscript $t$ in the columns). The method has a \code{by} argument to get statistics for other groupings and a \code{FUN} argument to get statistics other than the sum: <>= aggregate(pf) aggregate(pf, by = c("cohort", "year"), FUN = mean) @ The method of \code{frequency} returns the values of $N_{ijt}$. It is mostly a wrapper for the \code{aggregate} method with the default \code{sum} statistic replaced by \code{length}. Hence, arguments \code{by} and \code{FUN} remain available: <>= frequency(pf) frequency(pf, by = "cohort") @ The method of \code{severity} returns the individual variates $C_{ijtu}$ in a matrix similar to those above, but with a number of columns equal to the maximum number of observations per entity, \begin{displaymath} \max_{i, j} \sum_{t = 1}^{n_{ij}} N_{ijt}. \end{displaymath} Thus, the original period of observation (subscript $t$) and the identifier of the severity within the period (subscript $u$) are lost and each variate now constitute a ``period'' of observation. For this reason, the method provides an argument \code{splitcol} in case one would like to extract separately the individual severities of one or more periods: <>= severity(pf) severity(pf, splitcol = 1) @ Finally, the weights matrix corresponding to the data in object \code{pf} is <>= weights(pf) @ Combined with the argument \code{classification = FALSE}, the above methods can be used to easily compute loss ratios: <>= aggregate(pf, classif = FALSE) / weights(pf, classif = FALSE) @ \qed \end{example} \begin{example} \cite{Scollnik:2001:MCMC} considers the following model for the simulation of claims frequency data in a Markov Chain Monte Carlo (MCMC) context: \begin{align*} S_{it}|\Lambda_i, \alpha, \beta &\sim \text{Poisson}(w_{ij} \Lambda_i) \\ \Lambda_i|\alpha, \beta &\sim \text{Gamma}(\alpha, \beta) \\ \alpha &\sim \text{Gamma}(5, 5) \\ \beta &\sim \text{Gamma}(25, 1) \end{align*} for $i = 1, 2, 3$, $j = 1, \dots, 5$ and with weights $w_{it}$ simulated from \begin{align*} w_{it}|a_i, b_i &\sim \text{Gamma}(a_i, b_i) \\ a_i &\sim U(0, 100) \\ b_i &\sim U(0, 100). \end{align*} Strictly speaking, this is not a hierarchical model since the random variables $\alpha$ and $\beta$ are parallel rather than nested. Nevertheless, with some minor manual intervention, function \code{rcomphierarc} can simulate data from this model. First, one simulates the weights (in lexicographic order) with <>= set.seed(123) @ <>= wit <- rgamma(15, rep(runif(3, 0, 100), each = 5), rep(runif(3, 0, 100), each = 5)) @ Second, one calls \code{rcomphierarc} to simulate the frequency data. The key here consists in manually inserting the simulation of the shape and rate parameters of the gamma distribution in the model for $\Lambda_i$. Finally, wrapping the call to \code{rcomphierarc} in \code{frequency} will immediately yield the matrix of observations: <>= frequency(rcomphierarc(list(entity = 3, year = 5), expression(entity = rgamma(rgamma(1, 5, 5), rgamma(1, 25, 1)), year = rpois(weights * entity)), weights = wit)) @ \qed \end{example} One will find more examples of \code{rcomphierarc} usage in the \code{simulation} demo file. The function was used to simulate the data in \cite{Goulet_cfs}. %% References \bibliography{actuar} \end{document} %%% Local Variables: %%% mode: noweb %%% coding: utf-8 %%% TeX-master: t %%% End: