KSPM: an R package for Kernel Semi-Prametric Models

Catherine Schramm, Aurelie Labbe, Celia Greenwood

2020-08-08

 

The KSPM package was implemented to fit the single and multiple kernel semi-parametric models for continuous outcome. The package includes the most popular kernel functions, allows kernel interactions and test of variance for each kernel part. The summary(), plot() and predict() commands are available and are similar to those included in standard regression packages. We added a graphical tool for the interpretation of individual effect of variables based on pointwise derivatives. Finally, a variable selection procedure is implemented for the single kernel semi-parametric model.

Some key theoretical details are given in the first section. Then, two examples illustrate how to use the package. The first example is about prediction of movie ratings according to conventional and social media features. The second example is a temporal set of data on energy consumption and meteorology.

 

Note that the KSPM package depends on CompQuadForm, expm and DEoptim packages that should be installed before the importation of KSPM via library() command.

 

1 Key theoretical details

1.1 The model

1.1.1 Single kernel semi-parametric model

Let \(Y = (Y_1, ..., Y_n)^T\) be a \(n \times 1\) vector of continuous outcomes where \(Y_i\) is the value for subject \(i = 1, ..., n\), and \(X\) and \(Z\) are \(n \times p\) and \(n \times q\) matrices of predictors, such that \(Y\) depends on \(X\) and \(Z\) as follows:

\[Y = X\beta + h(Z) + e\]

where \(\beta\) is a \(p \times 1\) vector of regression coefficients for the linear parametric part, \(h(.)\) is an unknown smooth function and \(e\) is an \(n \times 1\) vector of independent and homoscedastic errors such as \(e_i \sim \mathcal{N}(0, \sigma^2)\). \(X\beta\) is referred to as the linear part of the model and we assume that \(X\) contains an intercept. \(h(Z)\) is referred to as the kernel part. The function \(h(.)\) lies in \(\mathcal{H}_k\), the function space generated by a kernel function \(k(.,.)\). \(K\) is the \(n \times n\) Gram matrix of \(k(.,.)\) such that \(K_{ij} = k(Z_i, Z_j)\) represents the similarity between subjects \(i\) and \(j\) according to \(Z\). In the estimation process, \(h()\) is expressed as a linear combination of \(k(.,.)\) such that:

\[h = K \alpha\]

1.1.2 Multiple kernel semi-parametric model

Now suppose there are \(L\) matrices \(Z_1, ..., Z_L\) of dimension \(n \times q_1\), …, \(n \times q_L\), respectively. Keeping a similar notation, \(Y\) can be assumed to depend on \(X\) and \(Z_1\), …, \(Z_L\) as follows:

\[Y = X\beta + \sum\limits_{\ell = 1}^L h_{\ell}(Z_{\ell}) + e\]

where \(\forall \ell \in \left\lbrace 1, ..., L \right\rbrace\), each \(h_{\ell}\) is an unknown smooth function from \(\mathcal{H}_{k_{\ell}}\), the function space generated by a kernel function \(k_{\ell}(.,.)\) whose Gram matrix is noted \(K_{\ell}\).

1.1.3 Kernel interactions

Suppose there exists an interaction between two sets of variables \(Z_1\) and \(Z_2\), \(n \times q_1\) and \(n \times q_2\) matrices. We assume that \(Y\) depends on \(X\), \(Z_1\) and \(Z_2\) as follows:

\[Y = X\beta + h_1(Z_1) + h_2(Z_2) + h_{12}(Z_1, Z_2) + e\]

where \(h_1(.)\), \(h_2(.)\) and \(h_{12}(.,.)\) are unknown smooth functions from \(\mathcal{H}_{k_1}\), \(\mathcal{H}_{k_2}\) and \(\mathcal{H}_{k_{12}}\) respectively. The \(h_{12}(.,.)\) function represents the interaction term. Its associated kernel function \(k_{12}(.,.)\) is defined as the product of kernel functions \(k_1(.,.)\) and \(k_2(.,.)\). Obviously, this 2-way interaction model could be generalized to higher order interaction terms.

1.2 Kernel functions

Below is the list of all kernel functions suported in this package, where \((i,j)\) represents a pair of subjects with covariates \(Z_i\) and \(Z_j\), respectively:

Of note, users can define their own kernel functions by providing the corresponding kernel Gram matrices.

Kernel functions involve tuning parameters, noted \(\rho\) and \(\gamma\) above. In practice, these tuning parameters are chosen by the user or estimated by the package.

1.3 Estimation of parameters

The parameter estimates are obtained by maximizing the penalized log-likelihood associated with the kernel semi-parametric model. Penalisation and tuning parameters are estimated by minimizing the leave-one-out error. Estimated parameters are:

1.4 Tests of hypotheses in KSPM

For either a single kernel or a multiple kernel semi-parametric model, a standard test of interest is \(H_0: h_{\ell}(.) = 0\), ie. there is no effect, singly or jointly, of a set of variables on the outcome. If interest lies in a global test for multiple kernel semi-parametric models, the null hypothesis is \(H_0: h_1(.) = ... = h_L(.) = 0\). Both tests are available in the package. They consist in score test for variance component in the equivalent linear mixed effect model.

1.5 Interpretation of variable effects through derivatives

A kernel represents similarity between subjects through combining a set of variables in a possible complex way, that may include their possible interactions. Hence, the general effect of a kernel on an outcome is hard to see and difficult to interpret. Nevertheless, the marginal effect of each variable in the kernel may be illustrated using a partial derivative function. Indeed, the derivative measures how the change in a variable of interest impacts the outcome in an intuitively understandable way.

 

2 Example: movie ratings dataset

library(KSPM)

2.1 Data importation

The conventional and social media movies (csm) dataset contains data on 187 movies from 2014 and 2015. To predict ratings, movies are described using a set of 6 variables called conventional features (genre, sequel, budget in USD, gross income in USD, number of screens in USA) and another set of 6 variables called social media features (aggregate actor followers on Twitter, number of views, likes, dislikes, comments of movie trailer on Youtube, sentiment score). The dataset may be loaded from the package using the data() command. This data set is a subset of the one available on the UCI machine learning repository (https://archive.ics.uci.edu/ml/index.php) and fully described in Ahmed et al. (2015).

2.2 Predicting ratings using the set of conventional features

2.2.1 Fitting the model

The model is fitted using the kspm() command. Its principal arguments are:

  • response: outcome variable (ratings in our example)
  • linear: linear covariates (in our example, no linear term is specified by the user and only an intercept is included in the linear part of the model)
  • kernel: kernel covariates (in our example: the set of conventional features)
  • data: dataset csm in our example

The kernel argument is defined by a formula of Kernel() objects. In our example, we assume a Gaussian kernel function to fit the joint effect of the conventional features on ratings. By leaving rho = NULL, we do not provide any value for the \(\rho\) parameter and it will be estimated at the same time as the penalization parameter by minimizing the leave-one-out error. Alternatively, \(\rho\) can fixed by the user using the argument rho=.

While the model is running, the kspm() command prints some indications about the kernel(s) included in the model. Standard functions such as summary() can also be used to print results.

The summary output gives information about the sample size used in the model, the residual distribution, the coefficient for the linear part (similarly to other regression R packages), and the penalization parameter and score test associated with the kernel part. In such a complex model, the number of free parameters, i.e. the standard definition of the degrees of freedom of a model is undefined, and we use instead the effective degrees of freedom of the model, which is not necessary a whole number. However, our definition for effective degrees of freedom is similar to the one used in linear regression models and depends on the trace of the hat matrix.

2.2.2 Diagnostic plots

The plot() command displays standard diagnostic plots for regression models, such as residuals versus fitted values, residuals versus leverage or residuals versus cook’s distance, using the argument which=. Note that outliers are anotated.

Of note, the lower tail of residuals distribution is not as expected under normality. The histogram of ratings shows that could be due to the skewed distribution of the response variable.

2.2.3 Interpretation of coefficients through pointwise derivatives plots

Like \(\beta\) coefficients in linear regression models, the derivative function may help to interpret the effect of each variable individually on the outcome. The derivatives() command computes partial derivatives of the estimated kernel function according to each variable included in the kernel. Therefore, it illustrates their individual effects on the outcome at each point of the dataset.

The sign of the derivatives corresponds to the direction of variation of the function capturing the effect of variable on the outcome. The plot() command applied to a derivatives object gives a graphical interpretation of these effects.

By default, the X-axis label gives the name of the variable and, in brackets, the kernel in which it is included. When only one kernel is included in the kernel, it is named Ker1. Because genre and sequel variables, even if they are numeric, refer to categorical variables, it is possible to easily highlight some pattern of interaction between these variables and other covariates. Derivatives according to Gross income are mostly positive meaning that higher Gross income is associated with higher ratings. However, the slope of this effect decreases as the gross income grows. However it is difficult to interpret the derivatives for Gross income higher than 2.5e+08 since this category includes only a few movies. According to the second graphic, whether a movie has sequels seems to interact with the effect of the number of screens on the ratings. Indeed when the movie is the first or second release (Sequel = 1 or 2), the derivatives are always negative meaning that higher the number of screens on which the movie is projected, lower the ratings, but the effect seems to be stronger for first release than second. No clear pattern can be observed for subsequent sequels. It is difficult to know if this reveals an absence of effect or whether there are simply too few movies past sequel 2 in these data.

2.2.4 Comparing choices of kernel function

We fit a second model assuming a polynomial kernel function with fixed tuning parameters (\(\rho=1\), \(\gamma=1\) and \(d=2\)).

This model can be compared to the previous one using some information criteria such as the AIC or BIC. By default the extractAIC() command gives the AIC. A better model is a model with a lower AIC value.

2.3 Predicting ratings using both conventional and social media features

Of note, in this section kspm() function is applied to a multiple kernel semi-parametric model such that it could be time-consuming.

2.3.1 Fitting the model

Now, we assume a model with three kernel parts, one for conventional features, one for social media features, and the third for their interaction. As for the single kernel case, we use the kspm() command. We modify only the kernel= argument which becomes a formula of two Kernel() objects, one for conventional features and one for social media features. Without interaction, the Kernel() objects are separated with “+”. In presence of interaction, they are separated with “:” (only the interaction) or “\(*\)” (interaction and main effects).

Here, we propose to use the gaussian kernel function for each set of features although different kernels could be used. We fixed the tuning parameters the values estimated for each kernel separately. This model includes 2 kernels and their interaction. Thus, three penalization parameters should be estimated. The estimation can take a long time.

The model includes three kernels: the kernel of conventional features (Ker1), the kernel of social media features (Ker2) and their interaction (Ker1:Ker2).

The summary() command returns the p-value for the test of interaction between Ker1 and Ker2, using the option kernel.test = “Ker1:Ker2”. It also returns the p-value for the global test using the option global.test = TRUE.

Adding social media features to the model improved the predictions, as indicated by the increased adjusted \(R^2\). However, smooth function associated with the kernel interaction does not significantly differ from the null, leading to the conclusion that there is no interaction effect between conventional and social media features on the ratings.

2.3.2 Prediction for new data

Suppose we want to predict the ratings that will be attributed to the three artificial movies described below, according to the model csm.fit3.

Gross Budget Screens Sequel Sentiment Views Likes Dislikes Comments Aggregate.Followers
5.0e+07 1.8e+08 3600 2 1 293021 3698 768 336 4530000
50000 5.2e+05 210 1 2 7206 2047 49 70 350000
10000 1.3e+03 5050 1 10 5692061 5025 305 150 960000

First, we build new data frames for each kernel:

Then, we use the predict() command to obtain predictions:

2.3.3 Prediction for current data

We may obtain the predictions (fitted values) for original data directly from the model.

We may obtain the predictions for the original data directly from the model or from the predict() function. In the second case, confidence intervals may be additionally obtained.

The following code displays the prediction values against the true values.

2.4 An example of variables selection procedure

Of note, in this section kspm() function is ask to estimate tuning parameters at the same time than penalisation parameter such that it could be time-consuming.

We assume a single kernel semi-parametric model with a gaussian kernel function. The kernel part contains the set of social media features. We want to select the relevant variables to be included in the kernel. We performed a stepwise variables selection procedure based on AIC, by letting \(\rho\) to vary at each iteration. To do so, we first fitted the full model including all features:

Then, we apply the stepKSPM() command on the full model as follows:

The stepKSPM() command prints the details of each iteration. At the end, the model excluded the variables Sentiment and Views.

3 Illustration 2: consumption of energy

Our second example is a set of data on energy consumed each hour on Ouessant island (France) from September the 13th, 2015 to October the 4th, 2015. The data set also contains corresponding meteorologic data such as temperature (Celsius degrees), pressure (Pa) and humidity rate (g/m\(^3\)). These measures were derived from those that were made publicly available by Electricite de France on https://iles-ponant-edf-sei.opendatasoft.com and Meteo France on https://www.infoclimat.fr. In total the energy data set contains 504 measurements.

3.1 Data importation

These data demonstrate strong periodicity depending on time of day. The following code has been generated to display the data.

To predict the energy consumption, we will consider the 408 first measurements (17 days) as the training set. The other 96 days will be used as a test set.

3.2 A first model

We fit a single kernel semi-parametric model to the training data. We assumed that energy depends linearly on temperature, and therefore this variable is included in the linear part of the model. The other meteorologic data as well as hours in 24-hour numeric format are included in the kernel part of the model. We used a gaussian kernel.

3.3 Impact of the tuning parameter \(\rho\) on predictions and derivatives

We recomputed the model with different values of the tuning parameter \(\rho\) to explore sensitivity of the results to this key kernel paramater. The actual value is:

Other values were chosen ten fold smaller and larger than the estimated value of \(0.70\).

We display the predictions obtained on both the training and test data sets, as well as the derivatives, as a function of the hours variable,for the three models energy.fit1, energy.fit2 and energy.fit3 that differ only on the value of the tuning parameter \(\rho\).

### parameters for figures panel
par(oma = c(1, 4, 6, 1))
par(mfrow = c(4,3), mar = c(5,5,1,1))

### kspm.fit1 (rho = 0.7)
# predictions with confidence intervals on train_
plot(energy_train_[1:72, "power"], type = "l", xlab = "Time", ylab = "Power", cex.lab = 1.5, xaxt = "n", lwd = 2, ylim = c(300, 750))
axis(1, at = 1 + 24 * (0:2), labels = unique(energy_train_$date)[1:3])
pred_train_ <- predict(energy.fit1, interval = "confidence")
lines(pred_train_$fit, col = "red")
lines(pred_train_$lwr, col = "blue", lty = 2)
lines(pred_train_$upr, col = "blue", lty = 2)
# predictions with prediction intervals on test_
plot(energy_test_[1:72, "power"], type = "l", xlab = "Time", ylab = "Power", cex.lab = 1.5, xaxt = "n", lwd = 2, ylim = c(300, 750))
axis(1, at = 1 + 24 * (0:2), labels = unique(energy_test_$date)[1:3])
pred_train_ <- predict(energy.fit1, newdata.linear =  energy_test_, newdata.kernel = list(Ker1 = energy_test_), interval = "prediction")
lines(pred_train_$fit, col = "red")
lines(pred_train_$lwr, col = "blue", lty = 2)
lines(pred_train_$upr, col = "blue", lty = 2)
# derivatives
plot(derivatives(energy.fit1), subset = "hour.num", xaxt = "n", ylab = "Derivatives", cex.lab = 1.5, ylim = c(-1000,1000))
axis(1, at = c(0, 6, 12, 18), labels = c("0h00", "6h00", "12h00", "18h00"))

### kspm.fit3 (rho = 0.07)
# predictions with confidence intervals on train_
plot(energy_train_[1:72, "power"], type = "l", xlab = "Time", ylab = "Power", cex.lab = 1.5, xaxt = "n", lwd = 2, ylim = c(300, 750))
axis(1, at = 1 + 24 * (0:2), labels = unique(energy_train_$date)[1:3])
pred_train_ <- predict(energy.fit2, interval = "confidence")
lines(pred_train_$fit, col = "red")
lines(pred_train_$lwr, col = "blue", lty = 2)
lines(pred_train_$upr, col = "blue", lty = 2)
# predictions with prediction intervals on test_
plot(energy_test_[1:72, "power"], type = "l", xlab = "Time", ylab = "Power", cex.lab = 1.5, xaxt = "n", lwd = 2, ylim = c(300, 750))
axis(1, at = 1 + 24 * (0:2), labels = unique(energy_test_$date)[1:3])
pred_train_ <- predict(energy.fit2, newdata.linear =  energy_test_, newdata.kernel = list(Ker1 = energy_test_), interval = "prediction")
lines(pred_train_$fit, col = "red")
lines(pred_train_$lwr, col = "blue", lty = 2)
lines(pred_train_$upr, col = "blue", lty = 2)
# derivatives
plot(derivatives(energy.fit2), subset = "hour.num", xaxt = "n", ylab = "Derivatives", cex.lab = 1.5, ylim = c(-1000,1000))
axis(1, at = c(0, 6, 12, 18), labels = c("0h00", "6h00", "12h00", "18h00"))

### kspm.fit2 (rho = 7)
# predictions with confidence intervals on train_
plot(energy_train_[1:72, "power"], type = "l", xlab = "Time", ylab = "Power", cex.lab = 1.5, xaxt = "n", lwd = 2, ylim = c(300, 750))
axis(1, at = 1 + 24 * (0:2), labels = unique(energy_train_$date)[1:3])
pred_train_ <- predict(energy.fit3, interval = "confidence")
lines(pred_train_$fit, col = "red")
lines(pred_train_$lwr, col = "blue", lty = 2)
lines(pred_train_$upr, col = "blue", lty = 2)
# predictions with prediction intervals on test_
plot(energy_test_[1:72, "power"], type = "l", xlab = "Time", ylab = "Power", cex.lab = 1.5, xaxt = "n", lwd = 2, ylim = c(300, 750))
axis(1, at = 1 + 24 * (0:2), labels = unique(energy_test_$date)[1:3])
pred_train_ <- predict(energy.fit3, newdata.linear =  energy_test_, newdata.kernel = list(Ker1 = energy_test_), interval = "prediction")
lines(pred_train_$fit, col = "red")
lines(pred_train_$lwr, col = "blue", lty = 2)
lines(pred_train_$upr, col = "blue", lty = 2)
# derivatives
plot(derivatives(energy.fit3), subset = "hour.num", xaxt = "n", ylab = "Derivatives", cex.lab = 1.5, ylim = c(-1000,1000))
axis(1, at = c(0, 6, 12, 18), labels = c("0h00", "6h00", "12h00", "18h00"))

# Legends
plot.new()
legend("topleft", lty = c(1,1,2), col = c("black", "red", "blue"), legend = c("True data", "Predictions", "Confidence intervals"), cex = 2, bty = "n")
plot.new()
legend("topleft", lty = c(1,1,2), col = c("black", "red", "blue"), legend = c("True data", "Predictions", "Prediction intervals"), cex = 2,  bty = "n")
plot.new()
legend("topleft", pch = 1, col = c("black"), legend = c("Pointwise derivatives \n 1 point = 1 measure"), cex = 2, bty = "n")


### legends on the left
par(fig = c(0,0.05,0,0.25), oma = c(0,0,0,0), mar = c(0,0,0,0),new = TRUE)
plot.new()
text(0.1, 0.8, "Legend", srt = 90, cex = 2, adj = 0.5)
par(fig = c(0,0.05,0.26,0.5), oma = c(0,0,0,0), mar = c(0,0,0,0),new = TRUE)
plot.new()
text(0.1, 0.5, expression(paste(rho, " = 7")), srt = 90, cex = 2, adj = 0.5)
par(fig = c(0,0.05,0.5,0.72), oma = c(0,0,0,0), mar = c(0,0,0,0),new = TRUE)
plot.new()
text(0.1, 0.5, expression(paste(rho, " = 0.07")), srt = 90, cex = 2, adj = 0.5)
par(fig = c(0,0.05,0.72,0.97), oma = c(0,0,0,0), mar = c(0,0,0,0),new = TRUE)
plot.new()
text(0.1, 0.5, expression(paste(rho, " = 0.7")), srt = 90, cex = 2, adj = 0.5)

### legends on the top
par(fig = c(0.05,0.36,0.92,1), oma = c(0,0,0,0), mar = c(0,0,0,0),new = TRUE)
plot.new()
text(0.5, 0.5, "Predictions on training data set",  cex = 2, adj = 0.5)
par(fig = c(0.37,0.68,0.92,1), oma = c(0,0,0,0), mar = c(0,0,0,0),new = TRUE)
plot.new()
text(0.5, 0.5, "Predictions on test data set",  cex = 2, adj = 0.5)
par(fig = c(0.69,1,0.92,1), oma = c(0,0,0,0), mar = c(0,0,0,0),new = TRUE)
plot.new()
text(0.5, 0.5, "Derivatives for hour.num",  cex = 2, adj = 0.5)

The first row corresponds to the model with the value of \(\rho\) estimated by the model. Predictions fit well for both the training and the test sets. The derivative graph shows that between 6h00 and 12h00 and between 18h00 and 23h00 the derivatives are positive. It is coherent with the increase of energy consumption during these times of the day. Inversely, between 0h00 and 6h00 and between 12h00 and 18h00, the energy consumption decreases as showed by negative values of derivatives. One might have expected that derivative values should be close between 0h00 and 23h00 but the consumption peak observed at 23h00 everyday may explain a sudden change in derivative values that is difficult to smooth.

4 References

Ahmed M, Jahangir M, Afzal H, Majeed A, Siddiqi I (2015). “Using Crowd-Source Based Features from Social Media and Conventional Features to Predict the Movies Popularity.” In IEEE International Conference on Smart City/SocialCom/SustainCom (SmartCity), pp 273-278. IEEE.