The KSPM package was implemented to fit the single and multiple kernel semi-parametric models for continuous outcome. The package includes the most popular kernel functions, allows kernel interactions and test of variance for each kernel part. The summary(), plot() and predict() commands are available and are similar to those included in standard regression packages. We added a graphical tool for the interpretation of individual effect of variables based on pointwise derivatives. Finally, a variable selection procedure is implemented for the single kernel semi-parametric model.

Some key theoretical details are given in the first section. Then, two examples illustrate how to use the package. The first example is about prediction of movie ratings according to conventional and social media features. The second example is a temporal set of data on energy consumption and meteorology.

Note that the KSPM package depends on CompQuadForm, expm and DEoptim packages that should be installed before the importation of KSPM via library() command.

1 Key theoretical details

1.1 The model

1.1.1 Single kernel semi-parametric model

Let \(Y = (Y_1, ..., Y_n)^T\) be a \(n \times 1\) vector of continuous outcomes where \(Y_i\) is the value for subject \(i = 1, ..., n\), and \(X\) and \(Z\) are \(n \times p\) and \(n \times q\) matrices of predictors, such that \(Y\) depends on \(X\) and \(Z\) as follows:

\[Y = X\beta + h(Z) + e\]

where \(\beta\) is a \(p \times 1\) vector of regression coefficients for the linear parametric part, \(h(.)\) is an unknown smooth function and \(e\) is an \(n \times 1\) vector of independent and homoscedastic errors such as \(e_i \sim \mathcal{N}(0, \sigma^2)\). \(X\beta\) is referred to as the linear part of the model and we assume that \(X\) contains an intercept. \(h(Z)\) is referred to as the kernel part. The function \(h(.)\) lies in \(\mathcal{H}_k\), the function space generated by a kernel function \(k(.,.)\). \(K\) is the \(n \times n\) Gram matrix of \(k(.,.)\) such that \(K_{ij} = k(Z_i, Z_j)\) represents the similarity between subjects \(i\) and \(j\) according to \(Z\). In the estimation process, \(h()\) is expressed as a linear combination of \(k(.,.)\) such that:

\[h = K \alpha\]

1.1.2 Multiple kernel semi-parametric model

Now suppose there are \(L\) matrices \(Z_1, ..., Z_L\) of dimension \(n \times q_1\), …, \(n \times q_L\), respectively. Keeping a similar notation, \(Y\) can be assumed to depend on \(X\) and \(Z_1\), …, \(Z_L\) as follows:

\[Y = X\beta + \sum\limits_{\ell = 1}^L h_{\ell}(Z_{\ell}) + e\]

where \(\forall \ell \in \left\lbrace 1, ..., L \right\rbrace\), each \(h_{\ell}\) is an unknown smooth function from \(\mathcal{H}_{k_{\ell}}\), the function space generated by a kernel function \(k_{\ell}(.,.)\) whose Gram matrix is noted \(K_{\ell}\).

1.1.3 Kernel interactions

Suppose there exists an interaction between two sets of variables \(Z_1\) and \(Z_2\), \(n \times q_1\) and \(n \times q_2\) matrices. We assume that \(Y\) depends on \(X\), \(Z_1\) and \(Z_2\) as follows:

\[Y = X\beta + h_1(Z_1) + h_2(Z_2) + h_{12}(Z_1, Z_2) + e\]

where \(h_1(.)\), \(h_2(.)\) and \(h_{12}(.,.)\) are unknown smooth functions from \(\mathcal{H}_{k_1}\), \(\mathcal{H}_{k_2}\) and \(\mathcal{H}_{k_{12}}\) respectively. The \(h_{12}(.,.)\) function represents the interaction term. Its associated kernel function \(k_{12}(.,.)\) is defined as the product of kernel functions \(k_1(.,.)\) and \(k_2(.,.)\). Obviously, this 2-way interaction model could be generalized to higher order interaction terms.

1.2 Kernel functions

Below is the list of all kernel functions suported in this package, where \((i,j)\) represents a pair of subjects with covariates \(Z_i\) and \(Z_j\), respectively:

Linear kernel function: \(k(Z_i, Z_j) = Z_i^T Z_j\)
Polynomial kernel function: \(k(Z_i, Z_j) = (\rho \, Z_i^T Z_j + \gamma)^d\)
Gaussian kernel function: \(k(Z_i, Z_j) = exp(-\parallel Z_i - Z_j\parallel^2 / \rho) \; \rho > 0\)
Sigmoid kernel function: \(k(Z_i, Z_j) = tanh(\rho \, Z_i^T Z_j + \gamma)\)
Inverse quadratic kernel function: \(k(Z_i, Z_j) = (\parallel Z_i - Z_j\parallel^2 + \gamma)^{-1/2}\)
Equality kernel function: \(k(Z_i, Z_j) = 1 \; \text{if} \; Z_i = Z_j \; \text{and} \; 0 \; \text{otherwise}\).

Of note, users can define their own kernel functions by providing the corresponding kernel Gram matrices.

Kernel functions involve tuning parameters, noted \(\rho\) and \(\gamma\) above. In practice, these tuning parameters are chosen by the user or estimated by the package.

1.3 Estimation of parameters

The parameter estimates are obtained by maximizing the penalized log-likelihood associated with the kernel semi-parametric model. Penalisation and tuning parameters are estimated by minimizing the leave-one-out error. Estimated parameters are:

\(\beta\): linear regression coefficients
\(\alpha_{\ell}\): kernel regression coefficients (\(\ell \in \left\lbrace 1,..,L \right\rbrace\))
\(\lambda_{\ell}\): penalisation parameters (\(\ell \in \left\lbrace 1,..,L \right\rbrace\))
\(\rho, \gamma\): tuning parameters if needed (optional for each kernel)

1.4 Tests of hypotheses in KSPM

For either a single kernel or a multiple kernel semi-parametric model, a standard test of interest is \(H_0: h_{\ell}(.) = 0\), ie. there is no effect, singly or jointly, of a set of variables on the outcome. If interest lies in a global test for multiple kernel semi-parametric models, the null hypothesis is \(H_0: h_1(.) = ... = h_L(.) = 0\). Both tests are available in the package. They consist in score test for variance component in the equivalent linear mixed effect model.

1.5 Interpretation of variable effects through derivatives

A kernel represents similarity between subjects through combining a set of variables in a possible complex way, that may include their possible interactions. Hence, the general effect of a kernel on an outcome is hard to see and difficult to interpret. Nevertheless, the marginal effect of each variable in the kernel may be illustrated using a partial derivative function. Indeed, the derivative measures how the change in a variable of interest impacts the outcome in an intuitively understandable way.

2 Example: movie ratings dataset

library(KSPM)

2.1 Data importation

The conventional and social media movies (csm) dataset contains data on 187 movies from 2014 and 2015. To predict ratings, movies are described using a set of 6 variables called conventional features (genre, sequel, budget in USD, gross income in USD, number of screens in USA) and another set of 6 variables called social media features (aggregate actor followers on Twitter, number of views, likes, dislikes, comments of movie trailer on Youtube, sentiment score). The dataset may be loaded from the package using the data() command. This data set is a subset of the one available on the UCI machine learning repository (https://archive.ics.uci.edu/ml/index.php) and fully described in Ahmed et al. (2015).

data(csm) 
head(csm)

2.2 Predicting ratings using the set of conventional features

2.2.1 Fitting the model

The model is fitted using the kspm() command. Its principal arguments are:

response: outcome variable (ratings in our example)
linear: linear covariates (in our example, no linear term is specified by the user and only an intercept is included in the linear part of the model)
kernel: kernel covariates (in our example: the set of conventional features)
data: dataset csm in our example

The kernel argument is defined by a formula of Kernel() objects. In our example, we assume a Gaussian kernel function to fit the joint effect of the conventional features on ratings. By leaving rho = NULL, we do not provide any value for the \(\rho\) parameter and it will be estimated at the same time as the penalization parameter by minimizing the leave-one-out error. Alternatively, \(\rho\) can fixed by the user using the argument rho=.

csm.fit1 <- kspm(response = "Ratings", kernel = ~Kernel(~Gross + Budget + Screens + Sequel, kernel.function = "gaussian", rho = 61.22), data = csm)

While the model is running, the kspm() command prints some indications about the kernel(s) included in the model. Standard functions such as summary() can also be used to print results.

summary(csm.fit1)

The summary output gives information about the sample size used in the model, the residual distribution, the coefficient for the linear part (similarly to other regression R packages), and the penalization parameter and score test associated with the kernel part. In such a complex model, the number of free parameters, i.e. the standard definition of the degrees of freedom of a model is undefined, and we use instead the effective degrees of freedom of the model, which is not necessary a whole number. However, our definition for effective degrees of freedom is similar to the one used in linear regression models and depends on the trace of the hat matrix.

2.2.2 Diagnostic plots

The plot() command displays standard diagnostic plots for regression models, such as residuals versus fitted values, residuals versus leverage or residuals versus cook’s distance, using the argument which=. Note that outliers are anotated.

par(mfrow = c(2,2), mar = c(5, 5, 5, 2))
plot(csm.fit1, which = c(1, 3, 5), cex.lab = 1.5, cex = 1.3)
hist(csm$Ratings, cex.lab = 1.5, main = "Histogram of Ratings", xlab = "Ratings")

Of note, the lower tail of residuals distribution is not as expected under normality. The histogram of ratings shows that could be due to the skewed distribution of the response variable.

2.2.3 Interpretation of coefficients through pointwise derivatives plots

Like \(\beta\) coefficients in linear regression models, the derivative function may help to interpret the effect of each variable individually on the outcome. The derivatives() command computes partial derivatives of the estimated kernel function according to each variable included in the kernel. Therefore, it illustrates their individual effects on the outcome at each point of the dataset.

The sign of the derivatives corresponds to the direction of variation of the function capturing the effect of variable on the outcome. The plot() command applied to a derivatives object gives a graphical interpretation of these effects.

par(mfrow = c(1,2), mar = c(5,5,5,2))
plot(derivatives(csm.fit1), subset = "Gross", main = "Pointwise derivatives according to Gross Income", cex.main = 0.8)
plot(derivatives(csm.fit1), subset = "Screens", col = csm$Sequel, pch = 16, main = "Pointwise derivatives according to \n Number of Screens and Sequel", cex.main = 0.8, ylim = c(-0.4, 0.8))
legend("topleft", fill = palette()[1:7], legend = 1:7, title = "Sequel", horiz = TRUE, cex = 0.7)

By default, the X-axis label gives the name of the variable and, in brackets, the kernel in which it is included. When only one kernel is included in the kernel, it is named Ker1. Because genre and sequel variables, even if they are numeric, refer to categorical variables, it is possible to easily highlight some pattern of interaction between these variables and other covariates. Derivatives according to Gross income are mostly positive meaning that higher Gross income is associated with higher ratings. However, the slope of this effect decreases as the gross income grows. However it is difficult to interpret the derivatives for Gross income higher than 2.5e+08 since this category includes only a few movies. According to the second graphic, whether a movie has sequels seems to interact with the effect of the number of screens on the ratings. Indeed when the movie is the first or second release (Sequel = 1 or 2), the derivatives are always negative meaning that higher the number of screens on which the movie is projected, lower the ratings, but the effect seems to be stronger for first release than second. No clear pattern can be observed for subsequent sequels. It is difficult to know if this reveals an absence of effect or whether there are simply too few movies past sequel 2 in these data.

2.2.4 Comparing choices of kernel function

We fit a second model assuming a polynomial kernel function with fixed tuning parameters (\(\rho=1\), \(\gamma=1\) and \(d=2\)).

csm.fit2 <- kspm(response = "Ratings", kernel = ~Kernel(~Gross + Budget + Screens + Sequel, kernel.function = "polynomial", rho = 1, gamma = 1, d = 2), data = csm, level = 0)

This model can be compared to the previous one using some information criteria such as the AIC or BIC. By default the extractAIC() command gives the AIC. A better model is a model with a lower AIC value.

extractAIC(csm.fit1)
extractAIC(csm.fit2)

2.3 Predicting ratings using both conventional and social media features

Of note, in this section kspm() function is applied to a multiple kernel semi-parametric model such that it could be time-consuming.

2.3.1 Fitting the model

Now, we assume a model with three kernel parts, one for conventional features, one for social media features, and the third for their interaction. As for the single kernel case, we use the kspm() command. We modify only the kernel= argument which becomes a formula of two Kernel() objects, one for conventional features and one for social media features. Without interaction, the Kernel() objects are separated with “+”. In presence of interaction, they are separated with “:” (only the interaction) or “\(*\)” (interaction and main effects).

Here, we propose to use the gaussian kernel function for each set of features although different kernels could be used. We fixed the tuning parameters the values estimated for each kernel separately. This model includes 2 kernels and their interaction. Thus, three penalization parameters should be estimated. The estimation can take a long time.

csm.fit3 <- kspm(response = "Ratings", linear = NULL, kernel = ~Kernel(~ Gross + Budget + Screens + Sequel, kernel.function = "gaussian", rho = 61.22) * Kernel(~ Sentiment + Views + Likes + Dislikes + Comments + Aggregate.Followers, kernel.function = "gaussian", rho = 1.562652), data = csm)

The model includes three kernels: the kernel of conventional features (Ker1), the kernel of social media features (Ker2) and their interaction (Ker1:Ker2).

The summary() command returns the p-value for the test of interaction between Ker1 and Ker2, using the option kernel.test = “Ker1:Ker2”. It also returns the p-value for the global test using the option global.test = TRUE.

Adding social media features to the model improved the predictions, as indicated by the increased adjusted \(R^2\). However, smooth function associated with the kernel interaction does not significantly differ from the null, leading to the conclusion that there is no interaction effect between conventional and social media features on the ratings.

2.3.2 Prediction for new data

Suppose we want to predict the ratings that will be attributed to the three artificial movies described below, according to the model csm.fit3.

Gross	Budget	Screens	Sequel	Sentiment	Views	Likes	Dislikes	Comments	Aggregate.Followers
5.0e+07	1.8e+08	3600	2	1	293021	3698	768	336	4530000
50000	5.2e+05	210	1	2	7206	2047	49	70	350000
10000	1.3e+03	5050	1	10	5692061	5025	305	150	960000

First, we build new data frames for each kernel:

# new data frame for Ker1
newdata.Ker1 <- data.frame(Genre = c(1, 3, 8), Gross = c(5.0e+07, 50000, 10000),Budget = c(1.8e+08, 5.2e+05, 1.3e+03), Screens = c(3600, 210, 5050), Sequel = c(2, 1, 1))

# new data frame for Ker2
newdata.Ker2 <- data.frame(Sentiment = c(1, 2, 10), Views = c(293021, 7206, 5692061), Likes = c(3698, 2047, 5025), Dislikes = c(768, 49, 305), Comments = c(336, 70, 150), Aggregate.Followers = c(4530000, 350000, 960000))

Then, we use the predict() command to obtain predictions:

new.predictions <- predict(csm.fit3, newdata.kernel = list(Ker1 = newdata.Ker1, Ker2 = newdata.Ker2), interval = "prediction")
new.predictions

2.3.3 Prediction for current data

We may obtain the predictions (fitted values) for original data directly from the model.

head(csm.fit3$fitted.value)

We may obtain the predictions for the original data directly from the model or from the predict() function. In the second case, confidence intervals may be additionally obtained.

pred <- predict(csm.fit3, interval = "confidence")
head(pred)

The following code displays the prediction values against the true values.

plot(csm$Ratings, pred$fit, xlim = c(2, 10), ylim = c(2, 10), xlab = "Observed ratings", ylab = "Predicted ratings", cex.lab = 1.5)
abline(a = 0, b = 1, col = "red", lty = 2)

2.4 An example of variables selection procedure

Of note, in this section kspm() function is ask to estimate tuning parameters at the same time than penalisation parameter such that it could be time-consuming.

We assume a single kernel semi-parametric model with a gaussian kernel function. The kernel part contains the set of social media features. We want to select the relevant variables to be included in the kernel. We performed a stepwise variables selection procedure based on AIC, by letting \(\rho\) to vary at each iteration. To do so, we first fitted the full model including all features:

# NOT RUN
csm.fit4 <- kspm(response = "Ratings", linear = NULL, kernel = ~Kernel(~ Sentiment + Views + Likes + Dislikes + Comments + Aggregate.Followers, kernel.function = "gaussian"), data = csm, level = 0, control = kspmControl(parallel = TRUE))

Then, we apply the stepKSPM() command on the full model as follows:

# NOT RUN
stepKSPM(csm.fit4, kernel.lower = ~1, kernel.upper = ~ Sentiment + Views + Likes + Dislikes + Comments + Aggregate.Followers, direction = "both", k = 2, kernel.param = "change", data = csm)

The stepKSPM() command prints the details of each iteration. At the end, the model excluded the variables Sentiment and Views.

KSPM: an R package for Kernel Semi-Prametric Models

Catherine Schramm, Aurelie Labbe, Celia Greenwood

2020-08-08