MECfda: An R package for bias correction due to measurement error in functional and scalar covariates in scalar-on-function regression models

Ji, Heyang

Beyaztas, Ufuk

Escobar, Nicolas

Luan, Yuanyuan

Chen, Xiwei

Zhang, Mengli

Zoh, Roger

Xue, Lan

Tekwe, Carmen

Abstract

Functional data analysis is a statistical approach used to analyze data that appear as functions or images. Functional data analysis methods can be used to analyze device-based measures such as physical activity and sleep. Although device-based measures are more objective than self-reported measures of physical activity patterns or sleep activity, device-based measures can still be prone to measurement error. When assessing associations between scalar-valued outcomes and device-based measures, scalar-on-function regression models that correct for measurement error can be applied. We develop the MECfda package for several scalar-on-function regression models including multi-level generalized scalar-on-function regression models and functional quantile linear regression models. The developed package implements several bias-correction methods that account for the presence of multiple functional and scalar covariates prone to measurement error in various scalar-on-function regression settings.

Introduction

Functional data analysis (FDA) is an essential statistical approach for handling high-dimensional data that appear in the form of functions or images [1–3]. In many applications, data are collected continuously or at frequent time intervals, resulting in complex functions over time. When the objective is to study the relationship between a scalar response and both functional and scalar covariates, scalar-on-function regression models become a natural choice.

Although self-reported measures (e.g., dietary intake) are known to suffer from measurement error [4], recent studies have demonstrated that even device-based measurements (such as those from wearable devices) can be error prone [5, 6]. Failing to correct for measurement error in either scalar or function-valued covariates may lead to biased parameter estimates.

The goal of the MECfda package is to provide solutions for a range of scalar-on-function regression models—including multi-level generalized scalar-on-function regression models and functional quantile regression models—while incorporating several bias-correction methods for measurement error in scalar-on-function regression problem.

This package is developed in R.

library(MECfda)

Scalar-on-Function Linear Regression Models

The general form of the scalar-on-function regression model is given by \[T(F_{Y_i|X_i,Z_i}) = \sum_{l=1}^{L} \int_{\Omega_l} \beta_l(t) X_{li}(t) dt + (1,Z_i^T)\gamma\] where

A statistical functional is defined as a mapping from a probability distribution to a real number. Statistical functionals represent quantifiable properties of probability distributions, (e.g., an expectation, a variance or a quantile).

Two common model types supported in this package are:

  1. Multi-level Generalized Scalar-on-Function Regression:
    This model is formulated as \[ T\bigl(F_{Y_i\mid X_i,Z_i}\bigr) = g\Bigl\{E(Y_i\mid X_i,Z_i)\Bigr\} = g\left\{\int_\mathbb{R}ydF_{Y_i|X_i,Z_i}(y)\right\}, \] where \(g(\cdot)\) is a strictly monotonic link function analogous to that used in generalized linear models.

  2. Scalar-on-Function Quantile Regression:
    This model is expressed as \[ T\bigl(F_{Y_i\mid X_i,Z_i}\bigr) = Q_{Y_i\mid X_i,Z_i}(\tau) = F_{Y_i\mid X_i,Z_i}^{-1}(\tau), \] where \(\tau \in (0,1)\) denotes the quantile of interest.

Representation of Functional Data

Function-valued variables (functional variables) are typically recorded as measurements at specific time points, resulting in an \(n \times m\) data matrix \((x_{ij})\), where each row corresponds to an observation (subject) and each column to a measurement time.

Functional Data

Function-valued variables (functional variables) are typically recorded as the measurements of the values of the functions at specific (time) points in the domain. The data of a function-valued variable are often presented in the form of a matrix, \((x_{ij})_{n\times m}\), where \(x_{ij} = f_i(t_j)\), represents the value of \(f_i(t_j)\), each row corresponds to an observation (subject) and each column to a measurement time.

S4 Class functional_variable

In the MECfda package, the s4 class functional_variable represents function-valued data stored in matrix form. The main slots include:

  • X: The data matrix \((x_{ij})\);
  • t_0 and period: The left endpoint and length of the function’s domain, respectively (The class only support the domain in the format of interval);
  • t_points: A vector of the time points at which the measurements are recorded.

A dedicated method, dim, returns the number of subjects and the number of measurement points for a functional_variable object.

fv = functional_variable(
  X = matrix(rnorm(10*24),10,24),
  t_0 = 0,
  period = 1,
  t_points = (0:9)/10
)
dim(fv)
#>     subject time_points 
#>          10          24

Basis Expansion

Let \(\{\rho_{k}\}_{k=1}^\infty\) be a complete basis for \(L^2(\Omega)\). For an arbitrary function \(\beta(t) \in L^2(\Omega)\), there exists a sequence of (real) number \(\{c_k\}\) such that \[ \beta(t) = \sum_{k=1}^\infty c_k \, \rho_k(t). \]

For the function-valued variable \(X_i(t)\), we have \[ \int_\Omega \beta(t) X_i(t) \, dt = \int_\Omega X_i(t) \sum_{k=1}^\infty c_{k}\rho_{k}(t) dt = \sum_{k=1}^\infty c_k \int_\Omega \rho_k(t) X_i(t) \, dt. \]

In practice, for \(\int_\Omega\beta(t) X_i(t) dt\) in statistical models, we truncate the infinite series to \(p\) terms (\(p<\infty\)), using \(c_{k} \int_\Omega X_i(t) \rho_{k}(t) dt\) to approximate \(\int_\Omega\beta(t) X_i(t) dt\) so that the problem reduces to estimating a finite number of parameters \(c_1, \dots, c_p\), and treating \(\int_\Omega \rho_k(t) X_i(t) \, dt\) as new covariates. The number of basis, \(p\), can be related to the sample size, \(n\), and the \(p\) is in a form of \(p(n)\).

In practice, we may not necessarily use the truncated complete basis of \(L^2\) function space. Instead, we can just use a finite sequence of linearly independent functions as the basis function. Common choices for the basis include the Fourier basis, b-spline basis, and eigenfunction basis.

In numerical computing level, performing basis expansion methods on function-valued variable data in matrix form as we mentioned is to compute \((b_{ik})_{n\times p}\), where \(b_{ik} = \int_\Omega f(t)\rho_k(t) dt\).

Fourier Basis

On the interval \([t_0, t_0+T]\), the Fourier basis consists of

\[ \frac{1}{2}, \quad \cos\left(\frac{2\pi}{T}k(x-t_0)\right), \quad \sin\left(\frac{2\pi}{T}k(x-t_0)\right), \quad k = 1, 2, \dots \]

S4 Class Fourier_series

In the MECfda package, s4 class Fourier_series represents a Fourier series of the form

\[ \frac{a_0}{2} + \sum_{k=1}^{p_a} a_k \cos\left(\frac{2\pi k (x-t_0)}{T}\right) + \sum_{k=1}^{p_b} b_k \sin\left(\frac{2\pi k (x-t_0)}{T}\right),\quad x \in [t_0, t_0+T]. \]

Its main slots include:

  • double_constant: The value of \(a_0\);
  • cos and sin: The coefficients \(a_k\) and \(b_k\) for the cosine and sine terms, respectively;
  • k_cos and k_sin: The frequency values corresponding to the cosine and sine coefficients;
  • t_0 and period: The left endpoint, \(t_0\), and length, \(T\), of the domain (interval \([t_0, t_0+T]\)), respectively.

Additional methods such as plot, FourierSeries2fun, and extractCoef are provided for visualization, function evaluation, and coefficient extraction.

fsc = Fourier_series(
  double_constant = 3,
  cos = c(0,2/3),
  sin = c(1,7/5),
  k_cos = 1:2,
  k_sin = 1:2,
  t_0 = 0,
  period = 1
)
plot(fsc)

FourierSeries2fun(fsc,seq(0,1,0.05))
#>  [1]  2.1666667  3.1712610  3.6252757  3.4344848  2.7346112  1.8333333
#>  [7]  1.0888125  0.7715265  0.9623175  1.5254623  2.1666667  2.5532270
#> [13]  2.4497052  1.8164508  0.8324982 -0.1666667 -0.8133005 -0.8465074
#> [19] -0.2132530  0.9074283  2.1666667
extractCoef(fsc)
#> $a_0
#> 0 
#> 3 
#> 
#> $a_k
#>         1         2 
#> 0.0000000 0.6666667 
#> 
#> $b_k
#>   1   2 
#> 1.0 1.4

The object fsc represents the summation \[\frac32 + \frac23 \cos(2\pi\cdot2x) + \sin(2\pi x) + \frac75\sin(2\pi\cdot2x).\]

B-spline Basis

A b-spline basis \(\{B_{i,p}(x)\}_{i=-p}^{k}\) on the interval \([t_0, t_{k+1}]\) is defined recursively as follows: - For \(p = 0\), \[ B_{i,0}(x) = \begin{cases} I_{(t_i, t_{i+1}]}(x), & i = 0, 1, \dots, k, \\ 0, & \text{otherwise}. \end{cases} \] - For \(p > 0\), the recursion is given by \[ B_{i,p}(x) = \frac{x-t_i}{t_{i+p}-t_i} B_{i,p-1}(x) + \frac{t_{i+p+1}-x}{t_{i+p+1}-t_{i+1}} B_{i+1,p-1}(x). \]

At discontinuities in the interval \((t_0,t_k)\), the basis function is defined as its limit, \(B_{i,p}(x) = \lim_{t\to x} B_{i,p}(t)\), to ensure continuity.

S4 Classes bspline_basis and bspline_series

The s4 class bspline_basis represents a b-spline basis, \(\{B_{i,p}(x)\}_{i=-p}^{k}\), on the interval \([t_0, t_{k+1}]\). Its key slots include:

  • Boundary.knots: The boundary \([t_0, t_{k+1}]\), (default is \([0,1]\));
  • knots: The internal spline knots, \((t_1,\dots,t_k)\), (automatically generated as equally spaced, \(t_j = t_0 + j\cdot\frac{t_{k+1}-t_0}{k+1}\), if not assigned);
  • intercept: Indicates whether an intercept is included (default is TRUE);
  • df: Degrees of freedom of the basis, which is the number of the splines, equal to \(p + k + 1\) (by default \(k =0\) and \(\text{df} = p+1\));
  • degree: The degree of the spline, which is the degree of piecewise polynomials, \(p\), (default is 3).

The s4 class bspline_series represents the linear combination

\[ \sum_{i=0}^{k} b_i B_{i,p}(x), \]

where the slot coef stores the coefficients \(b_i\) (\(i = 0,\dots,k\)) and bspline_basis specifies the associated basis (represented by a bspline_basis object).

Methods such as plot and bsplineSeries2fun are provided for visualization and function evaluation.

bsb = bspline_basis(
  Boundary.knots = c(0,24),
  df             = 7,
  degree         = 3
)
bss = bspline_series(
  coef = c(2,1,3/4,2/3,7/8,5/2,19/10),
  bspline_basis = bsb
)
plot(bss)

bsplineSeries2fun(bss,seq(0,24,0.5))
#>  [1] 2.0000000 1.7677509 1.5690908 1.4011502 1.2610597 1.1459499 1.0529514
#>  [8] 0.9791948 0.9218107 0.8779297 0.8446824 0.8191993 0.7986111 0.7805266
#> [15] 0.7644676 0.7504340 0.7384259 0.7284433 0.7204861 0.7145544 0.7106481
#> [22] 0.7087674 0.7089120 0.7110822 0.7152778 0.7216857 0.7312404 0.7450629
#> [29] 0.7642747 0.7899969 0.8233507 0.8654574 0.9174383 0.9804145 1.0555073
#> [36] 1.1438380 1.2465278 1.3634786 1.4897151 1.6190430 1.7452675 1.8621942
#> [43] 1.9636285 2.0433759 2.0952418 2.1130317 2.0905511 2.0216053 1.9000000

The object bsb represents \(\{B_{i,3}(x)\}_{i=-3}^{0}\), and the object bss represents
\[2B_{i,-3}(x)+B_{i,-2}(x)+\frac34B_{i,-1}(x)+\frac23B_{i,0}(x) + \frac78B_{i,1}(x) + \frac52B_{i,2}(x) +\frac{19}{10}B_{i,3}(x),\] where \(x\in[t_0,t_4]\) and \(t_0=0\), \(t_k = t_{k-1}+6\) ,\(k=1,2,3,4\).

Eigenfunction basis

Suppose \(K(s,t)\in L^2(\Omega\times \Omega)\), \(f(t)\in L^2(\Omega)\). Then \(K\) induces an linear operator \(\mathcal{K}\) by \[(\mathcal{K}f)(x) = \int_{\Omega} K(t,x)f(t)dt\] If \(\xi \in L^2(\Omega)\) such that \[\mathcal{K}\xi = \lambda \xi\] where \(\lambda\in \mathbb{C}\), we call \(\xi\) an eigenfunction/eigenvector of \(\mathcal{K}\), and \(\lambda\) an eigenvalue associated with \(\xi\).

For a stochastic process \(\{X(t),t\in\Omega\}\) we call the orthogonal basis, \(\{\xi_k\}_{k=1}^\infty\) corresponding to eigenvalues \(\{\lambda_k\}_{k=1}^\infty\) (\(\lambda_1\geq\lambda_2\geq\dots\)), induced by \[K(s,t)=\operatorname{Cov}(X(t),X(s))\] a functional principal component (FPC) basis for \(L^2(\Omega)\).

The eigenfunction basis relies on the covariance function \(K(s,t)\). And we usually cannot analytically express the equation of the basis funcitons in eigenfunction basis. In practice, the covariance function is estimated from the data and the eigenfunctions are computed numerically.

Numerical Basis

Sometimes, analytical expressions for the basis functions are not available. In these cases, they can be represented numerically. Let \(\{\rho_k\}_{k=1}^\infty\) denote a basis of the function space. We can numerically approximate the basis by storing the values of a finite subset of these functions at a finite set of points in the domain, i.e., \(\rho_k(t_j)\) for \(k=1,\dots,p\) and \(j=1,\dots,m\).

S4 Class numeric_basis and numericBasis_series

In this package, the s4 class numeric_basis is designed to represent a finite sequence of functions \(\{\rho_k\}_{k=1}^p\) by their values at a finite set of points within their domain. In this context, all functions share the same domain, which is assumed to be an interval. The key slots include:

  • basis_function: The matrix \((\zeta_{jk})_{m \times p}\), where each element is defined as \(\zeta_{jk} = \rho_k(t_j)\) for \(j = 1, \dots, m\) and \(k = 1, \dots, p\). In this matrix, each row corresponds to a point in the domain, and each column corresponds to a specific basis function.
  • t_points: A numeric vector that represents the points in the domain at which the basis functions are evaluated. The \(j\)-th element of this vector corresponds to the \(j\)-th row of the basis_function matrix.
  • t_0 and period: The left endpoint and length of the domain, respectively.

Additionally, the package provides an s4 class numericBasis_series, which represents a linear combination of the basis functions represented by a numeric_basis object. The key slots include:

  • coef: The linear coefficients for the summation series.
  • numeric_basis: Function basis as represented by a numeric_basis object.

Function basis2fun

The generic function basis2fun is provided for Fourier_series, bspline_series, and numericBasis_series objects. For a Fourier_series object, it is equivalent to FourierSeries2fun; for a bspline_series object, it is equivalent to bsplineSeries2fun; For a numericBasis_series object, it is equivalent to numericBasisSeries2fun.

basis2fun(fsc,seq(0,1,0.05))
#>  [1]  2.1666667  3.1712610  3.6252757  3.4344848  2.7346112  1.8333333
#>  [7]  1.0888125  0.7715265  0.9623175  1.5254623  2.1666667  2.5532270
#> [13]  2.4497052  1.8164508  0.8324982 -0.1666667 -0.8133005 -0.8465074
#> [19] -0.2132530  0.9074283  2.1666667
basis2fun(bss,seq(0,24,0.5))
#>  [1] 2.0000000 1.7677509 1.5690908 1.4011502 1.2610597 1.1459499 1.0529514
#>  [8] 0.9791948 0.9218107 0.8779297 0.8446824 0.8191993 0.7986111 0.7805266
#> [15] 0.7644676 0.7504340 0.7384259 0.7284433 0.7204861 0.7145544 0.7106481
#> [22] 0.7087674 0.7089120 0.7110822 0.7152778 0.7216857 0.7312404 0.7450629
#> [29] 0.7642747 0.7899969 0.8233507 0.8654574 0.9174383 0.9804145 1.0555073
#> [36] 1.1438380 1.2465278 1.3634786 1.4897151 1.6190430 1.7452675 1.8621942
#> [43] 1.9636285 2.0433759 2.0952418 2.1130317 2.0905511 2.0216053 1.9000000

Basis expansion methods for the functional_variable class

The MECfda pakcage provide the methods fourier_basis_expansion, bspline_basis_expansion, FPC_basis_expansion, and numeric_basis_expansion for the class functional_variable, which perform basis expansion using Fourier basis, b-spline basis, functional principal component basis, and numerical basis respectively.

data(MECfda.data.sim.0.0)
fv = MECfda.data.sim.0.0$FC[[1]]
BE.fs = fourier_basis_expansion(fv,5L)
BE.bs = bspline_basis_expansion(fv,5L,3L)

Numerical Computation of Integrals

The package approximates the integral

\[ \int_{\Omega} \rho_{k}(t) X_{i}(t) \, dt \approx \frac{\mu(\Omega)}{|T|} \sum_{t \in T} \rho_{k}(t) X_{i}(t), \]

where \(\mu(\cdot)\) denote the Lebesgue measure. \(T\) denotes the set of measurement time points for \(X_i(t)\) and \(|T|\) represents the cardinal number of \(T\).

Scalar-on-Function Linear Regression in MECfda

fcRegression

fcRegression(Y, FC, Z, formula.Z, family = gaussian(link = "identity"),
             basis.type = c("Fourier", "Bspline"), basis.order = 6L, 
             bs_degree = 3)

The MECfda package provides a function, fcRegression, for fitting generalized linear mixed-effect models—including ordinary linear models and generalized linear models with fixed and random effects—by discretizing function-valued covariates using basis expansion. The function can solve a linear model of the form

\[ g\bigl(E(Y_i\mid X_i,Z_i)\bigr) = \sum_{l=1}^{L} \int_{\Omega_l} \beta_l(t) X_{li}(t) \, dt + (1,Z_i^T)\gamma. \]

It supports one or multiple function-valued covariates as fixed effects, and zero, one, or multiple scalar covariates, which can be specified as either fixed or random effects. The response variable, function-valued covariates, and scalar covariates are supplied separately via the arguments Y, FC, and Z, respectively, allowing great flexibility in data format.

Other key arguments include:

The function returns an s3 object of class fcRegression, which is a list containing the following elements:

  1. regression_result: The result of the regression analysis.
  2. FC.BasisCoefficient: A list of Fourier_series, bspline_series, or numericBasis_series objects representing the function-valued linear coefficients of the covariates.
  3. function.basis.type: The type of basis used.
  4. basis.order: The number of basis functions used, as specified in the input.
  5. data: The original data provided to the model.
  6. bs_degree: The degree of the b-spline basis, same as specified in the input (included only when the b-spline basis is used).

Additionally, the predict method can be used to obtain predicted values from the fitted model, and the fc.beta method is available to evaluate the estimated function-valued coefficient parameters \(\beta_l(t)\).

data(MECfda.data.sim.0.0)
res = fcRegression(FC = MECfda.data.sim.0.0$FC, 
                   Y=MECfda.data.sim.0.0$Y, 
                   Z=MECfda.data.sim.0.0$Z,
                   family = gaussian(link = "identity"),
                   basis.order = 5, basis.type = c('Bspline'),
                   formula.Z = ~ Z_1 + (1|Z_2))
t = (0:100)/100
plot(x = t, y = fc.beta(res,1,t), ylab = expression(beta[1](t)))

plot(x = t, y = fc.beta(res,2,t), ylab = expression(beta[2](t)))

data(MECfda.data.sim.1.0)
predict(object = res, newData.FC = MECfda.data.sim.1.0$FC,
        newData.Z = MECfda.data.sim.1.0$Z)
#>        1        2        3        4        5 
#> 6.500129 5.690171 2.388979 5.441011 4.821000

fcQR

fcQR(Y, FC, Z, formula.Z, tau = 0.5, basis.type = c("Fourier", "Bspline"),
     basis.order = 6L, bs_degree = 3)

The MECfda package provides a function, fcRegression, for fitting quantile linear regression models by discretizing function-valued covariates using basis expansion. The function can solve a linear model of the form

\[Q_{Y_i|X_i,Z_i}(\tau) = \sum_{l=1}^L\int_{\Omega_l} \beta_l(\tau,t) X_{li}(t) dt + (1,Z_i^T)\gamma.\] The function supports one or multiple function-valued covariates, as well as zero, one, or multiple scalar-valued covariates. Data input follows the same guidelines as for the fcRegression function, and the treatment of scalar covariates is determined by the formula.Z argument. The primary difference between fcQR and fcRegression is that the quantile linear regression model does not incorporate random effects. The quantile \(\tau\) is specified by the argument tau, and the type and parameters of the basis functions are defined by the basis.type, basis.order, and bs_degree arguments, just as in fcRegression.

The function returns an s3 object of class fcQR, which is a list containing the following elements:

  1. regression_result: The result of the regression analysis.
  2. FC.BasisCoefficient: A list of Fourier_series, bspline_series, or numericBasis_series objects representing the function-valued linear coefficients of the covariates.
  3. function.basis.type: The type of basis used.
  4. basis.order: The number of basis functions used, as specified in the input.
  5. data: The original data provided to the model.
  6. bs_degree: The degree of the b-spline basis, same as specified in the input (included only when the b-spline basis is used).

Additionally, the predict method can be used to obtain predicted values from the fitted model, and the fc.beta method is available to evaluate the estimated function-valued coefficient parameters \(\beta_l(t)\).

data(MECfda.data.sim.0.0)
res = fcQR(FC = MECfda.data.sim.0.0$FC, 
           Y=MECfda.data.sim.0.0$Y, 
           Z=MECfda.data.sim.0.0$Z,
           tau = 0.5,
           basis.order = 5, basis.type = c('Bspline'),
           formula.Z = ~ .)
t = (0:100)/100
plot(x = t, y = fc.beta(res,1,t), ylab = expression(beta[1](t)))

plot(x = t, y = fc.beta(res,2,t), ylab = expression(beta[2](t)))

data(MECfda.data.sim.1.0)
predict(object = res, newData.FC = MECfda.data.sim.1.0$FC,
        newData.Z = MECfda.data.sim.1.0$Z)
#>        1        2        3        4        5 
#> 6.497759 5.682573 2.404464 5.440699 4.830085

Measurement Error Models and Bias Correction Estimation Methods

Data collected in real-world settings often include measurement error, especially function-valued data. Measurement error in a data set may result in estimation bias. The *MECfda** package also provides bias correction estimation method functions for certain linear regression models for use with data containing measurement error.

ME.fcRegression_MEM

ME.fcRegression_MEM(
  data.Y,
  data.W,
  data.Z,
  method = c("UP_MEM", "MP_MEM", "average"),
  t_interval = c(0, 1),
  t_points = NULL,
  d = 3,
  family.W = c("gaussian", "poisson"),
  family.Y = "gaussian",
  formula.Z,
  basis.type = c("Fourier", "Bspline"),
  basis.order = NULL,
  bs_degree = 3,
  smooth = FALSE,
  silent = TRUE
)

Wearable monitoring devices permit the continuous monitoring of biological processes, such as blood glucose metabolism, and behaviors, such as sleep quality and physical activity.
Continuous monitoring often collect data in 60-second epochs over multiple days, resulting in high-dimensional multi-level longitudinal curves that are best described and analyzed as multi-level functional data.
Although researchers have previously addressed measurement error in scalar covariates prone to error, less work has been done for correcting measurement error in multi-level high dimensional curves prone to heteroscedastic measurement error. Therefore, Luan et. al. proposed mixed effects-based-model-based (MEM) methods for bias correction due to measurement error in multi-level functional data from the exponential family of distributions that are prone to complex heteroscedastic measurement error.

They first developed MEM methods to adjust for biases due to the presence of measurement error in multi-level generalized functional regression models.
They assumed that the distributions of the function-valued covariates prone to measurement error belong to the exponential family. This assumption allows for a more general specification of the distributions of error-prone functional covariates compared to current approaches that often entail normality assumptions for these observed measures. The approach adopted by Luan et al. allows a nonlinear association between the true measurement and the observed measurement prone to measurement error.
Second, they treated the random errors in the observed measures as complex heteroscedastic errors from the Gaussian distribution with covariance error functions. Third, these methods can be used to evaluate relationships between multi-level function-valued covariates with complex measurement error structures and scalar outcomes with distributions in the exponential family.
Fourth, they treat the function-valued covariate as an observed measure of the true function-valued unobserved latent covariate.
Additionally, these methods employ a point-wise method for fitting the multi-level functional MEM approach, avoiding the need to compute complex and intractable integrals that would be required in traditional approaches for reducing biases due to measurement error in multi-level functional data [7].

Their statistical model is as follows: \[\begin{align*} &g(E(Y_i|X_i,Z_i)) = \int_{\Omega} \beta(t) X_{i}(t) dt + (1,Z_i^T)\gamma\\ &h(E(W_{ij}(t)|X_i(t))) = X_i(t)\\ &X_i(t) = \mu_x(t) + \varepsilon_{xi}(t) \end{align*}\] where the response variable \(Y_i\) and scalar-valued covariates \(Z_i\) are measured without error, function-valued covariate \(X_i(t)\) is repeatedly measured with error as \(W_{ij}(t)\). The model also includes the following assumptions:

  1. \(Y_i|X_i,Z_i\sim EF(\cdot)\), \(EF\) refers to an exponential family distribution.
  2. \(g(\cdot)\) and \(h(\cdot)\) are known to be strictly monotone, twice continuously differentiable functions.
  3. \(Cov\{X_i(t),W_{ij}(t)\} \neq 0\),
  4. \(W_{ij}(t)|X_i(t)\sim EF(\cdot)\)
  5. \(f_{Y_i|W_{ij}(t),X_i(t)}(\cdot) = f_{Y_i|X_i(t)}(\cdot)\), \(f\) refers to probability density function.
  6. \(X_i(t)\sim GP\{\mu_x(t),\Sigma_{xx}\}\), \(GP\) refers to the Gaussian process.

They proposed a MEM estimation method to correct for bias caused by the presence of measurement error in the function-valued covariate. This method allows for the investigation of a nonlinear measurement error model, in which the relationship between the true and observed measurements is not constrained to be linear, and the distribution assumption for the observed measurement is relaxed to encompass the exponential family rather than being limited to a Gaussian distribution.

The MEM approach is a two stage method that employs functional mixed-effects models. The first stage of the MEM approach involves using a functional mixed-effects model
to approximate the true measure \(X_i(t)\) with the repeated observed measurement \(W_{ij}(t)\). The MEM approach is primarily based on the assumptions that \(h[E\{W_{ij}(t)|X_i(t)\}] = X_i(t)\) and \(X_i(t) = \mu_x(t) + \varepsilon_{xi}(t)\). That is, in the functional mixed-effects model containing one fixed intercept and one random intercept, the random intercept is assumed to to be the subject-wise deviation of \(X_i(t)\) from the mean process \(\mu_x(t)\), and the fixed intercept is assumed to represent \(\mu_x(t)\). The second stage involves replacing \(X_i(t)\) in the regression model with the resulting approximation of \(X_i(t)\) from the first stage. The MEM approach employs point-wise (UP_MEM) and multi-point-wise (MP_MEM) estimation procedures to avoid potential computational complexities caused by analyses of multi-level functional data and computations of potentially intractable and complex integrals.

The MECfda package provides the function ME.fcRegression_MEM for applying bias-correction estimation methods. In this function, the response variable, function-valued covariates, and scalar covariates are supplied separately via the arguments data.Y, data.W, and data.Z, respectively.

Other key arguments include:

The function ME.fcRegression_MEM returns an object of class fcRegression.

The package also provide a function, MEM_X_hat. This function can return the \(\hat X(t)\) in this bias-correction method.

data(MECfda.data.sim.0.1)
res = ME.fcRegression_MEM(data.Y = MECfda.data.sim.0.1$Y,
                          data.W = MECfda.data.sim.0.1$W,
                          data.Z = MECfda.data.sim.0.1$Z,
                          method = 'UP_MEM',
                          family.W = "gaussian",
                          basis.type = 'Bspline')

ME.fcQR_IV.SIMEX

ME.fcQR_IV.SIMEX(
  data.Y,
  data.W,
  data.Z,
  data.M,
  tau = 0.5,
  t_interval = c(0, 1),
  t_points = NULL,
  formula.Z,
  basis.type = c("Fourier", "Bspline"),
  basis.order = NULL,
  bs_degree = 3
)

Health a major concerns for many people, and as technology advances, wearable devices have become the mainstream method for monitoring and evaluating individual physical activity levels. Despite personal preferences in brands and feature design, the accuracy of the data presented is what makes the device a great product. These functional data collected from devices are generally considered more accurate and subjective compared to other objective methods like questionnaires and self-reports. Because physical activity levels are not directly observable, algorithms are used generate corresponding summary reports of different activity level using complex data (e.g. steps, heart rate).
However, these results may differ depending on which device is used. And aside from variation in hardware, data collected could also vary by the combinations between how the device is worn and the activity of interest. Although current devices may be sufficiently accurate to monitor general physical activity levels, more precise data may enable additional functions such as detecting irregular heart rhythms or radiation exposures that would contribute toward improving the health of the general public and elevating the overall well-being of society.

Quantile regression is a tool that was developed to meet the need for modeling complex relationships between a set of covariates and quantiles of an outcome. In obesity studies, the effects of interventions or covariates on body mass index (BMI) are most commonly estimated using ordinary least square methods. However, mean regression approaches are unable to address these effects for individuals whose BMI values fall in the upper quantile of the distribution. Compared with traditional linear regression approaches, quantile regression approaches make no assumptions regarding the distribution of residuals and are robust to outlying observations.

Tekwe et. al.  proposed a simulation extrapolation (SIMEX) estimation method that uses an instrumental variable to correct for bias due to measurement error in scalar-on-function quantile linear regression. They demonstrated the usefulness of this method by evaluating the association between physical activity and obesity using the National Health and Nutrition Examination Survey (NHANES) dataset [8]. Their statistical model is as follows: \[\begin{align*} &Q_{Y_i|X_i,Z_i}(\tau) = \int_{\Omega} \beta(\tau,t) X_{i}(t) dt + (1,Z_i^T)\gamma(\tau)\\ &W_i(t) = X_i(t) + U_i(t)\\ &M_i(t) = \delta(t) X_i(t) + \eta_i(t) \end{align*}\] where the response variable \(Y_i\) and scalar-valued covariates \(Z_i\) are measured without error, the function-valued covariate \(X_i(t)\) is measured with error as \(W_i(t)\), and \(M_i(t)\) is a measured instrumental variable. The model also includes the following assumptions:

  1. \(Cov\{X_i(t),U_i(s)\} = 0\),
  2. \(Cov\{M_i(t),U_i(s)\} = 0\),
  3. \(E(W_{i}(t)|X_i(t)) = X_i(t)\)
  4. \(U_i(t)\sim GP\{\mathcal{0},\Sigma_{uu}\}\), \(GP\) refers to Gaussian process.

for \(\forall t,s\in[0,1]\) and \(i = 1,\dots,n\).

Their bias correction estimation method uses a two-stage strategy to correct for measurement error in a function-valued covariate and then fits a linear quantile regression model. In the first stage, an instrumental variable is used to estimate the covariance matrix associated with the measurement error. In the second stage, SIMEX is used to correct for measurement error in the function-valued covariate.

The MECfda package provides the function ME.fcQR_IV.SIMEX for applying its bias-correction estimation method in scalar-on-function quantile regression. The arguments are described as follows:

The function ME.fcQR_IV.SIMEX returns an object of class ME.fcQR_IV.SIMEX, which is a list containing the following elements:

  1. coef.X: A Fourier_series or bspline_series object representing the function-valued coefficient parameter of the function-valued covariate.
  2. coef.Z: The estimated linear coefficients of the scalar covariates.
  3. coef.all: The original estimate of the linear coefficients.
  4. function.basis.type: The type of function basis used.
  5. basis.order: The number of basis functions used (as specified in the input).
  6. t_interval: A two-element vector representing the interval, which defines the domain of the function-valued covariate.
  7. t_points: The sequence of measurement (time) points.
  8. formula: The regression model.
  9. formula.Z: A formula object containing only the scalar covariates.
  10. zlevels: The levels of any categorical (non-continuous) scalar covariates.

This comprehensive structure ensures that users can flexibly input their data and clearly specify the modeling framework for bias-correction in scalar-on-function quantile regression.

rm(list = ls())
data(MECfda.data.sim.0.2)
res = ME.fcQR_IV.SIMEX(data.Y = MECfda.data.sim.0.2$Y,
                       data.W = MECfda.data.sim.0.2$W,
                       data.Z = MECfda.data.sim.0.2$Z,
                       data.M = MECfda.data.sim.0.2$M,
                       tau = 0.5,
                       basis.type = 'Bspline')

ME.fcQR_CLS

ME.fcQR_CLS(
  data.Y,
  data.W,
  data.Z,
  tau = 0.5,
  t_interval = c(0, 1),
  t_points = NULL,
  grid_k,
  grid_h,
  degree = 45,
  observed_X = NULL
)

Zhang et. al.  proposed a corrected loss score estimation method for scalar-on-function quantile linear regression to correct for bias due to measurement error [9]. Their statistical model is as follows: \[\begin{align*} &Q_{Y_i|X_i,Z_i}(\tau) = \int_{\Omega} \beta(\tau,t) X_{i}(t) dt + (1,Z_i^T)\gamma(\tau)\\ &W_i(t) = X_i(t) + U_i(t) \end{align*}\] where the response variable \(Y_i\) and the scalar-valued covariates \(Z_i\) are measured without error, the function-valued covariate \(X_i(t)\) is measured with error as \(W_i(t)\).

  1. \(E[U_i(t)]=0\).
  2. \(Cov\{U_i(t),U_i(s)\} = \Sigma_U(s,t)\), where \(\Sigma_U(s,t)\) is unknown.
  3. \(U_i(t)\) are i.i.d for different \(i\).

Zhang et al. proposed a new corrected loss function for a partially functional linear quantile model with functional measurement error in this manuscript. They established a corrected quantile objective function for an observed measurement, which is an unbiased estimator of the quantile objective function that would be obtained if the true measurements were available. The estimators of the regression parameters are obtained by optimizing the resulting corrected loss function. The resulting estimator of the regression parameters is shown to be consistent.

The MECfda package provides a function, ME.fcQR_CLS, for implementing their bias-correction estimation method. Below is a description of its arguments and return values:

The function returns an object of class ME.fcQR_CLS (a list) containing the following elements:

  1. estimated_beta_hat: Estimated coefficients from the corrected loss function (including the functional component).
  2. estimated_beta_t: The estimated functional curve.
  3. SE_est: The estimated parametric variance (returned only if observed_X is not NULL).
  4. estimated_Xbasis: The basis matrix used in the estimation.
  5. res_naive: Results from the naive (uncorrected) method.
rm(list = ls())
data(MECfda.data.sim.0.1)
res = ME.fcQR_CLS(data.Y = MECfda.data.sim.0.1$Y,
                  data.W = MECfda.data.sim.0.1$W,
                  data.Z = MECfda.data.sim.0.1$Z,
                  tau = 0.5,
                  grid_k = 4:7,
                  grid_h = 1:2)

ME.fcLR_IV

ME.fcLR_IV(
  data.Y,
  data.W,
  data.M,
  t_interval = c(0, 1),
  t_points = NULL,
  CI.bootstrap = F
)

Tekwe et. al.  proposed a bias correction estimation method for scalar-on-function linear regression model with measurement error using an instrumental variable [6]. Their statistical model is as follows: \[\begin{align*} &Y_i = \int_0^1 \beta(t)X_i(t)dt + \varepsilon_i\\ &W_i(t) = X_i(t) + U_i(t)\\ &M_i(t) = \delta X_i(t) + \eta_i(t) \end{align*}\] where the response variable \(Y_i\) and are measured without error, the function-valued covariate \(X_i(t)\) is measured with error as \(W_i(t)\), and \(M_i(t)\) is an measured instrumental variable. They included the following additional assumptions:

  1. \(E\varepsilon_i(t) = 0\),
  2. \(EU_i(t) = 0\),
  3. \(E\eta_i(t) = 0\),
  4. \(Cov\{X_i(t),\varepsilon_i\} = 0\),
  5. \(Cov\{M_i(t),\varepsilon_i\} = 0\),
  6. \(Cov\{M_i(t),U_i(s)\} = 0\),

for \(\forall t,s\in[0,1]\) and \(i = 1,\dots,n\).

The MECfda package provides the function ME.fcLR_IV for implementing their bias-correction estimation method. The arguments are as follows:

The function returns an object of class ME.fcLR_IV, which is a list containing the following elements:

  1. beta_tW: Parameter estimates.
  2. CI: Confidence interval, which is returned only if CI.bootstrap is TRUE.
rm(list = ls())
data(MECfda.data.sim.0.3)
res = ME.fcLR_IV(data.Y = MECfda.data.sim.0.3$Y,
                 data.W = MECfda.data.sim.0.3$W,
                 data.M = MECfda.data.sim.0.3$M)

Simulated Data Generation

The package MECfda includes some functions to generate simulated data to test the functions in the package.

References

1. Wang J-L, Chiou J-M, Müller H-G. Functional data analysis. Annual Review of Statistics and its application. 2016;3:257–95.
2. Ramsay JO, Silverman BW. Applied functional data analysis: Methods and case studies. Springer; 2002.
3. Ramsay JO, Dalzell C. Some tools for functional data analysis. Journal of the Royal Statistical Society Series B: Statistical Methodology. 1991;53:539–61.
4. Carroll RJ, Ruppert D, Stefanski LA, Crainiceanu CM. Measurement error in nonlinear models: A modern perspective. Chapman; Hall/CRC; 2006.
5. Crainiceanu CM, Staicu A-M, Di C-Z. Generalized multilevel functional regression. Journal of the American Statistical Association. 2009;104:1550–61.
6. Tekwe CD, Zoh RS, Yang M, Carroll RJ, Honvoh G, Allison DB, et al. Instrumental variable approach to estimating the scalar-on-function regression model with measurement error with application to energy expenditure assessment in childhood obesity. Statistics in medicine. 2019;38:3764–81.
7. Luan Y, Zoh RS, Cui E, Lan X, Jadhav S, Tekwe CD. Scalable regression calibration approaches to correcting measurement error in multi-level generalized functional linear regression models with heteroscedastic measurement errors. arXiv preprint arXiv:230512624. 2023.
8. Tekwe CD, Zhang M, Carroll RJ, Luan Y, Xue L, Zoh RS, et al. Estimation of sparse functional quantile regression with measurement error: A SIMEX approach. Biostatistics. 2022;23:1218–41.
9. Zhang M, Xue L, Tekwe CD, Bai Y, Qu A. Partially functional linear quantile regression model with measurement errors. Statistica Sinica. 2023;33:2257–80.
10. R Core Team. R: A language and environment for statistical computing. Vienna, Austria: R Foundation for Statistical Computing; 2017.