Abstract

Functional data analysis is a statistical approach used to analyze data that appear as functions or images. Functional data analysis methods can be used to analyze device-based measures such as physical activity and sleep. Although device-based measures are more objective than self-reported measures of physical activity patterns or sleep activity, device-based measures can still be prone to measurement error. When assessing associations between scalar-valued outcomes and device-based measures, scalar-on-function regression models that correct for measurement error can be applied. We develop the MECfda package for several scalar-on-function regression models including multi-level generalized scalar-on-function regression models and functional quantile linear regression models. The developed package implements several bias-correction methods that account for the presence of multiple functional and scalar covariates prone to measurement error in various scalar-on-function regression settings.

Scalar-on-Function Linear Regression Models

The general form of the scalar-on-function regression model is given by \[T(F_{Y_i|X_i,Z_i}) = \sum_{l=1}^{L} \int_{\Omega_l} \beta_l(t) X_{li}(t) dt + (1,Z_i^T)\gamma\] where

\(Y_i\) is the scalar response variable;
\(X_{li}(t)\) denotes the \(l\)th function-valued covariate defined on the domain \(\Omega_l\);
\(\beta_l(t)\) is the corresponding function-valued regression coefficient, and both \(\beta_l\) and \(X_{li}\) belong to \(L^2(\Omega_l)\);
\(Z_i = (Z_{1i}, \dots, Z_{qi})^T\) represents the scalar covariates;
\(\gamma = (\gamma_0, \gamma_1, \dots, \gamma_q)^T\) is the parameter vector for the scalar covariates;
\(F_{Y_i\mid X_i,Z_i}\) is the cumulative distribution function of \(Y_i\) conditional on \(X_i\) and \(Z_i\);
\(T(\cdot)\) is a statistical functional.

A statistical functional is defined as a mapping from a probability distribution to a real number. Statistical functionals represent quantiﬁable properties of probability distributions, (e.g., an expectation, a variance or a quantile).

Two common model types supported in this package are:

Multi-level Generalized Scalar-on-Function Regression:
This model is formulated as \[ T\bigl(F_{Y_i\mid X_i,Z_i}\bigr) = g\Bigl\{E(Y_i\mid X_i,Z_i)\Bigr\} = g\left\{\int_\mathbb{R}ydF_{Y_i|X_i,Z_i}(y)\right\}, \] where \(g(\cdot)\) is a strictly monotonic link function analogous to that used in generalized linear models.
Scalar-on-Function Quantile Regression:
This model is expressed as \[ T\bigl(F_{Y_i\mid X_i,Z_i}\bigr) = Q_{Y_i\mid X_i,Z_i}(\tau) = F_{Y_i\mid X_i,Z_i}^{-1}(\tau), \] where \(\tau \in (0,1)\) denotes the quantile of interest.

Representation of Functional Data

Function-valued variables (functional variables) are typically recorded as measurements at specific time points, resulting in an \(n \times m\) data matrix \((x_{ij})\), where each row corresponds to an observation (subject) and each column to a measurement time.

Functional Data

Function-valued variables (functional variables) are typically recorded as the measurements of the values of the functions at specific (time) points in the domain. The data of a function-valued variable are often presented in the form of a matrix, \((x_{ij})_{n\times m}\), where \(x_{ij} = f_i(t_j)\), represents the value of \(f_i(t_j)\), each row corresponds to an observation (subject) and each column to a measurement time.

S4 Class `functional_variable`

In the MECfda package, the s4 class functional_variable represents function-valued data stored in matrix form. The main slots include:

X: The data matrix \((x_{ij})\);
t_0 and period: The left endpoint and length of the function’s domain, respectively (The class only support the domain in the format of interval);
t_points: A vector of the time points at which the measurements are recorded.

A dedicated method, dim, returns the number of subjects and the number of measurement points for a functional_variable object.

fv = functional_variable(
  X = matrix(rnorm(10*24),10,24),
  t_0 = 0,
  period = 1,
  t_points = (0:9)/10
)
dim(fv)
#>     subject time_points 
#>          10          24

Basis Expansion

Let \(\{\rho_{k}\}_{k=1}^\infty\) be a complete basis for \(L^2(\Omega)\). For an arbitrary function \(\beta(t) \in L^2(\Omega)\), there exists a sequence of (real) number \(\{c_k\}\) such that \[ \beta(t) = \sum_{k=1}^\infty c_k \, \rho_k(t). \]

For the function-valued variable \(X_i(t)\), we have \[ \int_\Omega \beta(t) X_i(t) \, dt = \int_\Omega X_i(t) \sum_{k=1}^\infty c_{k}\rho_{k}(t) dt = \sum_{k=1}^\infty c_k \int_\Omega \rho_k(t) X_i(t) \, dt. \]

In practice, for \(\int_\Omega\beta(t) X_i(t) dt\) in statistical models, we truncate the infinite series to \(p\) terms (\(p<\infty\)), using \(c_{k} \int_\Omega X_i(t) \rho_{k}(t) dt\) to approximate \(\int_\Omega\beta(t) X_i(t) dt\) so that the problem reduces to estimating a finite number of parameters \(c_1, \dots, c_p\), and treating \(\int_\Omega \rho_k(t) X_i(t) \, dt\) as new covariates. The number of basis, \(p\), can be related to the sample size, \(n\), and the \(p\) is in a form of \(p(n)\).

In practice, we may not necessarily use the truncated complete basis of \(L^2\) function space. Instead, we can just use a finite sequence of linearly independent functions as the basis function. Common choices for the basis include the Fourier basis, b-spline basis, and eigenfunction basis.

In numerical computing level, performing basis expansion methods on function-valued variable data in matrix form as we mentioned is to compute \((b_{ik})_{n\times p}\), where \(b_{ik} = \int_\Omega f(t)\rho_k(t) dt\).

Fourier Basis

On the interval \([t_0, t_0+T]\), the Fourier basis consists of

\[ \frac{1}{2}, \quad \cos\left(\frac{2\pi}{T}k(x-t_0)\right), \quad \sin\left(\frac{2\pi}{T}k(x-t_0)\right), \quad k = 1, 2, \dots \]

S4 Class `Fourier_series`

In the MECfda package, s4 class Fourier_series represents a Fourier series of the form

\[ \frac{a_0}{2} + \sum_{k=1}^{p_a} a_k \cos\left(\frac{2\pi k (x-t_0)}{T}\right) + \sum_{k=1}^{p_b} b_k \sin\left(\frac{2\pi k (x-t_0)}{T}\right),\quad x \in [t_0, t_0+T]. \]

Its main slots include:

double_constant: The value of \(a_0\);
cos and sin: The coefficients \(a_k\) and \(b_k\) for the cosine and sine terms, respectively;
k_cos and k_sin: The frequency values corresponding to the cosine and sine coefficients;
t_0 and period: The left endpoint, \(t_0\), and length, \(T\), of the domain (interval \([t_0, t_0+T]\)), respectively.

Additional methods such as plot, FourierSeries2fun, and extractCoef are provided for visualization, function evaluation, and coefficient extraction.

fsc = Fourier_series(
  double_constant = 3,
  cos = c(0,2/3),
  sin = c(1,7/5),
  k_cos = 1:2,
  k_sin = 1:2,
  t_0 = 0,
  period = 1
)
plot(fsc)

FourierSeries2fun(fsc,seq(0,1,0.05))
#>  [1]  2.1666667  3.1712610  3.6252757  3.4344848  2.7346112  1.8333333
#>  [7]  1.0888125  0.7715265  0.9623175  1.5254623  2.1666667  2.5532270
#> [13]  2.4497052  1.8164508  0.8324982 -0.1666667 -0.8133005 -0.8465074
#> [19] -0.2132530  0.9074283  2.1666667
extractCoef(fsc)
#> $a_0
#> 0 
#> 3 
#> 
#> $a_k
#>         1         2 
#> 0.0000000 0.6666667 
#> 
#> $b_k
#>   1   2 
#> 1.0 1.4

The object fsc represents the summation \[\frac32 + \frac23 \cos(2\pi\cdot2x) + \sin(2\pi x) + \frac75\sin(2\pi\cdot2x).\]

B-spline Basis

A b-spline basis \(\{B_{i,p}(x)\}_{i=-p}^{k}\) on the interval \([t_0, t_{k+1}]\) is defined recursively as follows: - For \(p = 0\), \[ B_{i,0}(x) = \begin{cases} I_{(t_i, t_{i+1}]}(x), & i = 0, 1, \dots, k, \\ 0, & \text{otherwise}. \end{cases} \] - For \(p > 0\), the recursion is given by \[ B_{i,p}(x) = \frac{x-t_i}{t_{i+p}-t_i} B_{i,p-1}(x) + \frac{t_{i+p+1}-x}{t_{i+p+1}-t_{i+1}} B_{i+1,p-1}(x). \]

At discontinuities in the interval \((t_0,t_k)\), the basis function is defined as its limit, \(B_{i,p}(x) = \lim_{t\to x} B_{i,p}(t)\), to ensure continuity.

S4 Classes `bspline_basis` and `bspline_series`

The s4 class bspline_basis represents a b-spline basis, \(\{B_{i,p}(x)\}_{i=-p}^{k}\), on the interval \([t_0, t_{k+1}]\). Its key slots include:

Boundary.knots: The boundary \([t_0, t_{k+1}]\), (default is \([0,1]\));
knots: The internal spline knots, \((t_1,\dots,t_k)\), (automatically generated as equally spaced, \(t_j = t_0 + j\cdot\frac{t_{k+1}-t_0}{k+1}\), if not assigned);
intercept: Indicates whether an intercept is included (default is TRUE);
df: Degrees of freedom of the basis, which is the number of the splines, equal to \(p + k + 1\) (by default \(k =0\) and \(\text{df} = p+1\));
degree: The degree of the spline, which is the degree of piecewise polynomials, \(p\), (default is 3).

The s4 class bspline_series represents the linear combination

\[ \sum_{i=0}^{k} b_i B_{i,p}(x), \]

where the slot coef stores the coefficients \(b_i\) (\(i = 0,\dots,k\)) and bspline_basis specifies the associated basis (represented by a bspline_basis object).

Methods such as plot and bsplineSeries2fun are provided for visualization and function evaluation.

bsb = bspline_basis(
  Boundary.knots = c(0,24),
  df             = 7,
  degree         = 3
)
bss = bspline_series(
  coef = c(2,1,3/4,2/3,7/8,5/2,19/10),
  bspline_basis = bsb
)
plot(bss)

bsplineSeries2fun(bss,seq(0,24,0.5))
#>  [1] 2.0000000 1.7677509 1.5690908 1.4011502 1.2610597 1.1459499 1.0529514
#>  [8] 0.9791948 0.9218107 0.8779297 0.8446824 0.8191993 0.7986111 0.7805266
#> [15] 0.7644676 0.7504340 0.7384259 0.7284433 0.7204861 0.7145544 0.7106481
#> [22] 0.7087674 0.7089120 0.7110822 0.7152778 0.7216857 0.7312404 0.7450629
#> [29] 0.7642747 0.7899969 0.8233507 0.8654574 0.9174383 0.9804145 1.0555073
#> [36] 1.1438380 1.2465278 1.3634786 1.4897151 1.6190430 1.7452675 1.8621942
#> [43] 1.9636285 2.0433759 2.0952418 2.1130317 2.0905511 2.0216053 1.9000000

The object bsb represents \(\{B_{i,3}(x)\}_{i=-3}^{0}\), and the object bss represents
\[2B_{i,-3}(x)+B_{i,-2}(x)+\frac34B_{i,-1}(x)+\frac23B_{i,0}(x) + \frac78B_{i,1}(x) + \frac52B_{i,2}(x) +\frac{19}{10}B_{i,3}(x),\] where \(x\in[t_0,t_4]\) and \(t_0=0\), \(t_k = t_{k-1}+6\) ,\(k=1,2,3,4\).

Eigenfunction basis

Suppose \(K(s,t)\in L^2(\Omega\times \Omega)\), \(f(t)\in L^2(\Omega)\). Then \(K\) induces an linear operator \(\mathcal{K}\) by \[(\mathcal{K}f)(x) = \int_{\Omega} K(t,x)f(t)dt\] If \(\xi \in L^2(\Omega)\) such that \[\mathcal{K}\xi = \lambda \xi\] where \(\lambda\in \mathbb{C}\), we call \(\xi\) an eigenfunction/eigenvector of \(\mathcal{K}\), and \(\lambda\) an eigenvalue associated with \(\xi\).

For a stochastic process \(\{X(t),t\in\Omega\}\) we call the orthogonal basis, \(\{\xi_k\}_{k=1}^\infty\) corresponding to eigenvalues \(\{\lambda_k\}_{k=1}^\infty\) (\(\lambda_1\geq\lambda_2\geq\dots\)), induced by \[K(s,t)=\operatorname{Cov}(X(t),X(s))\] a functional principal component (FPC) basis for \(L^2(\Omega)\).

The eigenfunction basis relies on the covariance function \(K(s,t)\). And we usually cannot analytically express the equation of the basis funcitons in eigenfunction basis. In practice, the covariance function is estimated from the data and the eigenfunctions are computed numerically.

Numerical Basis

Sometimes, analytical expressions for the basis functions are not available. In these cases, they can be represented numerically. Let \(\{\rho_k\}_{k=1}^\infty\) denote a basis of the function space. We can numerically approximate the basis by storing the values of a finite subset of these functions at a finite set of points in the domain, i.e., \(\rho_k(t_j)\) for \(k=1,\dots,p\) and \(j=1,\dots,m\).

S4 Class `numeric_basis` and `numericBasis_series`

In this package, the s4 class numeric_basis is designed to represent a finite sequence of functions \(\{\rho_k\}_{k=1}^p\) by their values at a finite set of points within their domain. In this context, all functions share the same domain, which is assumed to be an interval. The key slots include:

basis_function: The matrix \((\zeta_{jk})_{m \times p}\), where each element is defined as \(\zeta_{jk} = \rho_k(t_j)\) for \(j = 1, \dots, m\) and \(k = 1, \dots, p\). In this matrix, each row corresponds to a point in the domain, and each column corresponds to a specific basis function.
t_points: A numeric vector that represents the points in the domain at which the basis functions are evaluated. The \(j\)-th element of this vector corresponds to the \(j\)-th row of the basis_function matrix.
t_0 and period: The left endpoint and length of the domain, respectively.

Additionally, the package provides an s4 class numericBasis_series, which represents a linear combination of the basis functions represented by a numeric_basis object. The key slots include:

coef: The linear coefficients for the summation series.
numeric_basis: Function basis as represented by a numeric_basis object.

Function `basis2fun`

The generic function basis2fun is provided for Fourier_series, bspline_series, and numericBasis_series objects. For a Fourier_series object, it is equivalent to FourierSeries2fun; for a bspline_series object, it is equivalent to bsplineSeries2fun; For a numericBasis_series object, it is equivalent to numericBasisSeries2fun.

basis2fun(fsc,seq(0,1,0.05))
#>  [1]  2.1666667  3.1712610  3.6252757  3.4344848  2.7346112  1.8333333
#>  [7]  1.0888125  0.7715265  0.9623175  1.5254623  2.1666667  2.5532270
#> [13]  2.4497052  1.8164508  0.8324982 -0.1666667 -0.8133005 -0.8465074
#> [19] -0.2132530  0.9074283  2.1666667
basis2fun(bss,seq(0,24,0.5))
#>  [1] 2.0000000 1.7677509 1.5690908 1.4011502 1.2610597 1.1459499 1.0529514
#>  [8] 0.9791948 0.9218107 0.8779297 0.8446824 0.8191993 0.7986111 0.7805266
#> [15] 0.7644676 0.7504340 0.7384259 0.7284433 0.7204861 0.7145544 0.7106481
#> [22] 0.7087674 0.7089120 0.7110822 0.7152778 0.7216857 0.7312404 0.7450629
#> [29] 0.7642747 0.7899969 0.8233507 0.8654574 0.9174383 0.9804145 1.0555073
#> [36] 1.1438380 1.2465278 1.3634786 1.4897151 1.6190430 1.7452675 1.8621942
#> [43] 1.9636285 2.0433759 2.0952418 2.1130317 2.0905511 2.0216053 1.9000000

Basis expansion methods for the `functional_variable` class

The MECfda pakcage provide the methods fourier_basis_expansion, bspline_basis_expansion, FPC_basis_expansion, and numeric_basis_expansion for the class functional_variable, which perform basis expansion using Fourier basis, b-spline basis, functional principal component basis, and numerical basis respectively.

data(MECfda.data.sim.0.0)
fv = MECfda.data.sim.0.0$FC[[1]]
BE.fs = fourier_basis_expansion(fv,5L)
BE.bs = bspline_basis_expansion(fv,5L,3L)

Numerical Computation of Integrals

The package approximates the integral

\[ \int_{\Omega} \rho_{k}(t) X_{i}(t) \, dt \approx \frac{\mu(\Omega)}{|T|} \sum_{t \in T} \rho_{k}(t) X_{i}(t), \]

where \(\mu(\cdot)\) denote the Lebesgue measure. \(T\) denotes the set of measurement time points for \(X_i(t)\) and \(|T|\) represents the cardinal number of \(T\).

Scalar-on-Function Linear Regression in MECfda

fcRegression

fcRegression(Y, FC, Z, formula.Z, family = gaussian(link = "identity"),
             basis.type = c("Fourier", "Bspline", "FPC"), basis.order = 6L, 
             bs_degree = 3)

The MECfda package provides a function, fcRegression, for fitting generalized linear mixed-effect models—including ordinary linear models and generalized linear models with fixed and random effects—by discretizing function-valued covariates using basis expansion. The function can solve a linear model of the form

\[ g\bigl(E(Y_i\mid X_i,Z_i)\bigr) = \sum_{l=1}^{L} \int_{\Omega_l} \beta_l(t) X_{li}(t) \, dt + (1,Z_i^T)\gamma. \]

It supports one or multiple function-valued covariates as fixed effects, and zero, one, or multiple scalar covariates, which can be specified as either fixed or random effects. The response variable, function-valued covariates, and scalar covariates are supplied separately via the arguments Y, FC, and Z, respectively, allowing great flexibility in data format.

Response Variable:
The response can be provided as an atomic vector, a one-column matrix, or a data frame. However, a one-column data frame or matrix with a column name is recommended so that the response variable’s name is clearly specified.
Function-Valued Covariates:
The function-valued covariates can be input as a functional_variable object, a matrix, a data frame, or a list of these objects. If a single functional_variable object (or matrix or data frame) is provided via FC, the model includes one function-valued covariate. If a list is provided, the model accommodates multiple function-valued covariates, with each element in the list representing a distinct covariate.
Scalar Covariates:
Scalar covariates can be provided as a matrix, data frame, atomic vector, or even NULL if no scalar covariates are present. If Z is omitted or set to NULL, no scalar covariate is included and the argument formula.Z should also be NULL or omitted. When an atomic vector is used for Z, it represents a single scalar covariate (without a name). Hence, even when only one scalar covariate is included, it is recommended to supply it as a matrix or data frame with a column name. The formula.Z argument specifies which parts of Z to use and whether scalar covariates should be treated as fixed or random effects.

Other key arguments include:

family: Specifies the response variable’s distribution and the link function to be used.
basis.type: Indicates the type of basis function for the expansion. Options include 'Fourier', 'Bspline', and 'FPC', corresponding to the Fourier basis, b-spline basis, and functional principal component basis, respectively.
basis.order: Specifies the number of basis functions to use. For the Fourier basis (i.e., \(\frac{1}{2},\ \sin(kt),\ \cos(kt)\) for \(k = 1, \dots, p_f\)), basis.order equals \(p_f\). For the b-spline basis \(\{B_{i,p}(x)\}_{i=-p}^{k}\), basis.order is set to \(k+p+1\). For the FPC basis, basis.order represents the number of functional principal components.
bs_degree: Specifies the degree of the piecewise polynomials for the b-spline basis; this argument is only necessary when using the b-spline basis.

The function returns an s3 object of class fcRegression, which is a list containing the following elements:

regression_result: The result of the regression analysis.
FC.BasisCoefficient: A list of Fourier_series, bspline_series, or numericBasis_series objects representing the function-valued linear coefficients of the covariates.
function.basis.type: The type of basis used.
basis.order: The number of basis functions used, as specified in the input.
data: The original data provided to the model.
bs_degree: The degree of the b-spline basis, same as specified in the input (included only when the b-spline basis is used).

Additionally, the predict method can be used to obtain predicted values from the fitted model, and the fc.beta method is available to evaluate the estimated function-valued coefficient parameters \(\beta_l(t)\).

data(MECfda.data.sim.0.0)
res = fcRegression(FC = MECfda.data.sim.0.0$FC, 
                   Y=MECfda.data.sim.0.0$Y, 
                   Z=MECfda.data.sim.0.0$Z,
                   family = gaussian(link = "identity"),
                   basis.order = 5, basis.type = c('Bspline'),
                   formula.Z = ~ Z_1 + (1|Z_2))
t = (0:100)/100
plot(x = t, y = fc.beta(res,1,t), ylab = expression(beta[1](t)))

plot(x = t, y = fc.beta(res,2,t), ylab = expression(beta[2](t)))

data(MECfda.data.sim.1.0)
predict(object = res, newData.FC = MECfda.data.sim.1.0$FC,
        newData.Z = MECfda.data.sim.1.0$Z)
#>        1        2        3        4        5 
#> 6.500129 5.690171 2.388979 5.441011 4.821000

fcQR

fcQR(Y, FC, Z, formula.Z, tau = 0.5, 
     basis.type = c("Fourier", "Bspline", "FPC"),
     basis.order = 6L, bs_degree = 3)

The MECfda package provides a function, fcRegression, for fitting quantile linear regression models by discretizing function-valued covariates using basis expansion. The function can solve a linear model of the form

\[Q_{Y_i|X_i,Z_i}(\tau) = \sum_{l=1}^L\int_{\Omega_l} \beta_l(\tau,t) X_{li}(t) dt + (1,Z_i^T)\gamma.\] The function supports one or multiple function-valued covariates, as well as zero, one, or multiple scalar-valued covariates. Data input follows the same guidelines as for the fcRegression function, and the treatment of scalar covariates is determined by the formula.Z argument. The primary difference between fcQR and fcRegression is that the quantile linear regression model does not incorporate random effects. The quantile \(\tau\) is specified by the argument tau, and the type and parameters of the basis functions are defined by the basis.type, basis.order, and bs_degree arguments, just as in fcRegression.

The function returns an s3 object of class fcQR, which is a list containing the following elements:

regression_result: The result of the regression analysis.
FC.BasisCoefficient: A list of Fourier_series, bspline_series, or numericBasis_series objects representing the function-valued linear coefficients of the covariates.
function.basis.type: The type of basis used.
basis.order: The number of basis functions used, as specified in the input.
data: The original data provided to the model.
bs_degree: The degree of the b-spline basis, same as specified in the input (included only when the b-spline basis is used).

data(MECfda.data.sim.0.0)
res = fcQR(FC = MECfda.data.sim.0.0$FC, 
           Y=MECfda.data.sim.0.0$Y, 
           Z=MECfda.data.sim.0.0$Z,
           tau = 0.5,
           basis.order = 5, basis.type = c('Bspline'),
           formula.Z = ~ .)
t = (0:100)/100
plot(x = t, y = fc.beta(res,1,t), ylab = expression(beta[1](t)))

plot(x = t, y = fc.beta(res,2,t), ylab = expression(beta[2](t)))

data(MECfda.data.sim.1.0)
predict(object = res, newData.FC = MECfda.data.sim.1.0$FC,
        newData.Z = MECfda.data.sim.1.0$Z)
#>        1        2        3        4        5 
#> 6.497759 5.682573 2.404464 5.440699 4.830085

Measurement Error Models and Bias Correction Estimation Methods

Data collected in real-world settings often include measurement error, especially function-valued data. Measurement error in a data set may result in estimation bias. The *MECfda** package also provides bias correction estimation method functions for certain linear regression models for use with data containing measurement error.

ME.fcRegression_MEM

ME.fcRegression_MEM(
  data.Y,
  data.W,
  data.Z,
  method = c("UP_MEM", "MP_MEM", "average"),
  t_interval = c(0, 1),
  t_points = NULL,
  d = 3,
  family.W = c("gaussian", "poisson"),
  family.Y = "gaussian",
  formula.Z,
  basis.type = c("Fourier", "Bspline"),
  basis.order = NULL,
  bs_degree = 3,
  smooth = FALSE,
  silent = TRUE
)

Wearable monitoring devices permit the continuous monitoring of biological processes, such as blood glucose metabolism, and behaviors, such as sleep quality and physical activity.
Continuous monitoring often collect data in 60-second epochs over multiple days, resulting in high-dimensional multi-level longitudinal curves that are best described and analyzed as multi-level functional data.
Although researchers have previously addressed measurement error in scalar covariates prone to error, less work has been done for correcting measurement error in multi-level high dimensional curves prone to heteroscedastic measurement error. Therefore, Luan et. al. proposed mixed effects-based-model-based (MEM) methods for bias correction due to measurement error in multi-level functional data from the exponential family of distributions that are prone to complex heteroscedastic measurement error.

They first developed MEM methods to adjust for biases due to the presence of measurement error in multi-level generalized functional regression models.
They assumed that the distributions of the function-valued covariates prone to measurement error belong to the exponential family. This assumption allows for a more general specification of the distributions of error-prone functional covariates compared to current approaches that often entail normality assumptions for these observed measures. The approach adopted by Luan et al. allows a nonlinear association between the true measurement and the observed measurement prone to measurement error.
Second, they treated the random errors in the observed measures as complex heteroscedastic errors from the Gaussian distribution with covariance error functions. Third, these methods can be used to evaluate relationships between multi-level function-valued covariates with complex measurement error structures and scalar outcomes with distributions in the exponential family.
Fourth, they treat the function-valued covariate as an observed measure of the true function-valued unobserved latent covariate.
Additionally, these methods employ a point-wise method for fitting the multi-level functional MEM approach, avoiding the need to compute complex and intractable integrals that would be required in traditional approaches for reducing biases due to measurement error in multi-level functional data [7].

Their statistical model is as follows: \[\begin{align*} &g(E(Y_i|X_i,Z_i)) = \int_{\Omega} \beta(t) X_{i}(t) dt + (1,Z_i^T)\gamma\\ &h(E(W_{ij}(t)|X_i(t))) = X_i(t)\\ &X_i(t) = \mu_x(t) + \varepsilon_{xi}(t) \end{align*}\] where the response variable \(Y_i\) and scalar-valued covariates \(Z_i\) are measured without error, function-valued covariate \(X_i(t)\) is repeatedly measured with error as \(W_{ij}(t)\). The model also includes the following assumptions:

\(Y_i|X_i,Z_i\sim EF(\cdot)\), \(EF\) refers to an exponential family distribution.
\(g(\cdot)\) and \(h(\cdot)\) are known to be strictly monotone, twice continuously differentiable functions.
\(Cov\{X_i(t),W_{ij}(t)\} \neq 0\),
\(W_{ij}(t)|X_i(t)\sim EF(\cdot)\)
\(f_{Y_i|W_{ij}(t),X_i(t)}(\cdot) = f_{Y_i|X_i(t)}(\cdot)\), \(f\) refers to probability density function.
\(X_i(t)\sim GP\{\mu_x(t),\Sigma_{xx}\}\), \(GP\) refers to the Gaussian process.

They proposed a MEM estimation method to correct for bias caused by the presence of measurement error in the function-valued covariate. This method allows for the investigation of a nonlinear measurement error model, in which the relationship between the true and observed measurements is not constrained to be linear, and the distribution assumption for the observed measurement is relaxed to encompass the exponential family rather than being limited to a Gaussian distribution.

The MEM approach is a two stage method that employs functional mixed-effects models. The first stage of the MEM approach involves using a functional mixed-effects model
to approximate the true measure \(X_i(t)\) with the repeated observed measurement \(W_{ij}(t)\). The MEM approach is primarily based on the assumptions that \(h[E\{W_{ij}(t)|X_i(t)\}] = X_i(t)\) and \(X_i(t) = \mu_x(t) + \varepsilon_{xi}(t)\). That is, in the functional mixed-effects model containing one fixed intercept and one random intercept, the random intercept is assumed to to be the subject-wise deviation of \(X_i(t)\) from the mean process \(\mu_x(t)\), and the fixed intercept is assumed to represent \(\mu_x(t)\). The second stage involves replacing \(X_i(t)\) in the regression model with the resulting approximation of \(X_i(t)\) from the first stage. The MEM approach employs point-wise (UP_MEM) and multi-point-wise (MP_MEM) estimation procedures to avoid potential computational complexities caused by analyses of multi-level functional data and computations of potentially intractable and complex integrals.

The MECfda package provides the function ME.fcRegression_MEM for applying bias-correction estimation methods. In this function, the response variable, function-valued covariates, and scalar covariates are supplied separately via the arguments data.Y, data.W, and data.Z, respectively.

data.Y (Response Variable):
The response variable can be provided as an atomic vector, a one-column matrix, or a data frame. However, a one-column data frame with a column name is recommended so that the response variable is explicitly identified.
data.W (Measurement Data):
This argument represents \(W\), the observed measurement of the true function-valued covariate \(X\) in the statistical model. It should be provided as a three-dimensional array where each row corresponds to a subject, each column to a measurement (time) point, and each layer to a separate observation.
data.Z (Scalar Covariates):
The scalar covariates can be input as a matrix, data frame, or atomic vector. If the model does not include any scalar covariates, data.Z can be omitted or set to NULL. For a single scalar covariate, although an atomic vector is acceptable, a data frame or matrix with a column name is recommended. For multiple scalar covariates, use a matrix or data frame with named columns.

Other key arguments include:

method: Specifies the method for constructing the substitute for \(X\); available options are 'UP_MEM', 'MP_MEM', and 'average'.
t_interval: Defines the domain of the function-valued covariate as a two-element vector (default is c(0,1), representing the interval \([0,1]\)).
t_points: Specifies the sequence of measurement time points (default is NULL).
d: When using the MP_MEM method, this argument sets the number of measurement points involved (default is 3, which is also the minimum value).
family.W: Specifies the distribution of \(W\) given \(X\); available options are 'gaussian' and 'poisson'.
family.Y: Describes the error distribution and link function for the response variable (see stats::family for details).
formula.Z: Indicates which parts of data.Z to include in the model and whether to treat the scalar covariates as fixed or random effects.
basis.type: Indicates the type of basis function for the expansion. Options include 'Fourier', 'Bspline', and 'FPC', corresponding to the Fourier basis, b-spline basis, and functional principal component basis, respectively.
basis.order: Specifies the number of basis functions to use. For the Fourier basis (i.e., \(\frac{1}{2},\ \sin(kt),\ \cos(kt)\) for \(k = 1, \dots, p_f\)), basis.order equals \(p_f\). For the b-spline basis \(\{B_{i,p}(x)\}_{i=-p}^{k}\), basis.order is set to \(k+p+1\). For the FPC basis, basis.order represents the number of functional principal components.
bs_degree: Specifies the degree of the piecewise polynomials for the b-spline basis; this argument is only necessary when using the b-spline basis.
smooth: Specifies whether to smooth the substitution for \(X\) (default is FALSE).
silent: Controls whether the function displays its progress during execution (default is TRUE).

The function ME.fcRegression_MEM returns an object of class fcRegression.

The package also provide a function, MEM_X_hat. This function can return the \(\hat X(t)\) in this bias-correction method.

data(MECfda.data.sim.0.1)
res = ME.fcRegression_MEM(data.Y = MECfda.data.sim.0.1$Y,
                          data.W = MECfda.data.sim.0.1$W,
                          data.Z = MECfda.data.sim.0.1$Z,
                          method = 'UP_MEM',
                          family.W = "gaussian",
                          basis.type = 'Bspline')

ME.fcQR_IV.SIMEX

ME.fcQR_IV.SIMEX(
  data.Y,
  data.W,
  data.Z,
  data.M,
  tau = 0.5,
  t_interval = c(0, 1),
  t_points = NULL,
  formula.Z,
  basis.type = c("Fourier", "Bspline"),
  basis.order = NULL,
  bs_degree = 3
)

Health a major concerns for many people, and as technology advances, wearable devices have become the mainstream method for monitoring and evaluating individual physical activity levels. Despite personal preferences in brands and feature design, the accuracy of the data presented is what makes the device a great product. These functional data collected from devices are generally considered more accurate and subjective compared to other objective methods like questionnaires and self-reports. Because physical activity levels are not directly observable, algorithms are used generate corresponding summary reports of different activity level using complex data (e.g. steps, heart rate).
However, these results may differ depending on which device is used. And aside from variation in hardware, data collected could also vary by the combinations between how the device is worn and the activity of interest. Although current devices may be sufficiently accurate to monitor general physical activity levels, more precise data may enable additional functions such as detecting irregular heart rhythms or radiation exposures that would contribute toward improving the health of the general public and elevating the overall well-being of society.

Quantile regression is a tool that was developed to meet the need for modeling complex relationships between a set of covariates and quantiles of an outcome. In obesity studies, the effects of interventions or covariates on body mass index (BMI) are most commonly estimated using ordinary least square methods. However, mean regression approaches are unable to address these effects for individuals whose BMI values fall in the upper quantile of the distribution. Compared with traditional linear regression approaches, quantile regression approaches make no assumptions regarding the distribution of residuals and are robust to outlying observations.

Tekwe et. al. proposed a simulation extrapolation (SIMEX) estimation method that uses an instrumental variable to correct for bias due to measurement error in scalar-on-function quantile linear regression. They demonstrated the usefulness of this method by evaluating the association between physical activity and obesity using the National Health and Nutrition Examination Survey (NHANES) dataset [8]. Their statistical model is as follows: \[\begin{align*} &Q_{Y_i|X_i,Z_i}(\tau) = \int_{\Omega} \beta(\tau,t) X_{i}(t) dt + (1,Z_i^T)\gamma(\tau)\\ &W_i(t) = X_i(t) + U_i(t)\\ &M_i(t) = \delta(t) X_i(t) + \eta_i(t) \end{align*}\] where the response variable \(Y_i\) and scalar-valued covariates \(Z_i\) are measured without error, the function-valued covariate \(X_i(t)\) is measured with error as \(W_i(t)\), and \(M_i(t)\) is a measured instrumental variable. The model also includes the following assumptions:

\(Cov\{X_i(t),U_i(s)\} = 0\),
\(Cov\{M_i(t),U_i(s)\} = 0\),
\(E(W_{i}(t)|X_i(t)) = X_i(t)\)
\(U_i(t)\sim GP\{\mathcal{0},\Sigma_{uu}\}\), \(GP\) refers to Gaussian process.

for \(\forall t,s\in[0,1]\) and \(i = 1,\dots,n\).

Their bias correction estimation method uses a two-stage strategy to correct for measurement error in a function-valued covariate and then fits a linear quantile regression model. In the first stage, an instrumental variable is used to estimate the covariance matrix associated with the measurement error. In the second stage, SIMEX is used to correct for measurement error in the function-valued covariate.

The MECfda package provides the function ME.fcQR_IV.SIMEX for applying its bias-correction estimation method in scalar-on-function quantile regression. The arguments are described as follows:

data.Y (Response Variable):
The response variable can be provided as an atomic vector, a one-column matrix, or a data frame. A one-column data frame with a column name is recommended for clarity.
data.W (Measurement Data):
This argument corresponds to \(W\), the observed measurement of \(X\) in the statistical model. It should be provided as a data frame or matrix where each row represents a subject and each column corresponds to a measurement (time) point.
data.Z (Scalar Covariates):
The scalar covariate data can be omitted (or set to NULL) if no scalar covariates are included in the model. Alternatively, it can be provided as an atomic vector (for a single scalar covariate) or as a matrix/data frame (for multiple scalar covariates). A data frame with column names is recommended.
data.M (Instrumental Variable):
This argument provides the data for \(M\), the instrumental variable. It should be supplied as a data frame or matrix where each row represents a subject and each column corresponds to a measurement (time) point.
tau (Quantile Level):
Specifies the quantile level \(\tau \in (0,1)\) for the quantile regression model. The default is 0.5.
t_interval (Domain of the Function-Valued Covariate):
A two-element vector defining the interval over which the function-valued covariate is defined (default is c(0,1), representing the interval \([0,1]\)).
t_points (Measurement Time Points):
Specifies the sequence of time points at which measurements are taken. The default is NULL.
formula.Z (Scalar Covariate Formula):
This argument determines which components of data.Z are included in the model and how the scalar covariates are treated (i.e., as fixed or random effects).
basis.type: Indicates the type of basis function for the expansion. Options include 'Fourier', 'Bspline', and 'FPC', corresponding to the Fourier basis, b-spline basis, and functional principal component basis, respectively.
basis.order: Specifies the number of basis functions to use. For the Fourier basis (i.e., \(\frac{1}{2},\ \sin(kt),\ \cos(kt)\) for \(k = 1, \dots, p_f\)), basis.order equals \(p_f\). For the b-spline basis \(\{B_{i,p}(x)\}_{i=-p}^{k}\), basis.order is set to \(k+p+1\). For the FPC basis, basis.order represents the number of functional principal components.
bs_degree: Specifies the degree of the piecewise polynomials for the b-spline basis; this argument is only necessary when using the b-spline basis.

The function ME.fcQR_IV.SIMEX returns an object of class ME.fcQR_IV.SIMEX, which is a list containing the following elements:

coef.X: A Fourier_series or bspline_series object representing the function-valued coefficient parameter of the function-valued covariate.
coef.Z: The estimated linear coefficients of the scalar covariates.
coef.all: The original estimate of the linear coefficients.
function.basis.type: The type of function basis used.
basis.order: The number of basis functions used (as specified in the input).
t_interval: A two-element vector representing the interval, which defines the domain of the function-valued covariate.
t_points: The sequence of measurement (time) points.
formula: The regression model.
formula.Z: A formula object containing only the scalar covariates.
zlevels: The levels of any categorical (non-continuous) scalar covariates.

This comprehensive structure ensures that users can flexibly input their data and clearly specify the modeling framework for bias-correction in scalar-on-function quantile regression.

rm(list = ls())
data(MECfda.data.sim.0.2)
res = ME.fcQR_IV.SIMEX(data.Y = MECfda.data.sim.0.2$Y,
                       data.W = MECfda.data.sim.0.2$W,
                       data.Z = MECfda.data.sim.0.2$Z,
                       data.M = MECfda.data.sim.0.2$M,
                       tau = 0.5,
                       basis.type = 'Bspline')

ME.fcQR_CLS

ME.fcQR_CLS(
  data.Y,
  data.W,
  data.Z,
  tau = 0.5,
  t_interval = c(0, 1),
  t_points = NULL,
  grid_k,
  grid_h,
  degree = 45,
  observed_X = NULL
)

Zhang et. al. proposed a corrected loss score estimation method for scalar-on-function quantile linear regression to correct for bias due to measurement error [9]. Their statistical model is as follows: \[\begin{align*} &Q_{Y_i|X_i,Z_i}(\tau) = \int_{\Omega} \beta(\tau,t) X_{i}(t) dt + (1,Z_i^T)\gamma(\tau)\\ &W_i(t) = X_i(t) + U_i(t) \end{align*}\] where the response variable \(Y_i\) and the scalar-valued covariates \(Z_i\) are measured without error, the function-valued covariate \(X_i(t)\) is measured with error as \(W_i(t)\).

\(E[U_i(t)]=0\).
\(Cov\{U_i(t),U_i(s)\} = \Sigma_U(s,t)\), where \(\Sigma_U(s,t)\) is unknown.
\(U_i(t)\) are i.i.d for different \(i\).

Zhang et al. proposed a new corrected loss function for a partially functional linear quantile model with functional measurement error in this manuscript. They established a corrected quantile objective function for an observed measurement, which is an unbiased estimator of the quantile objective function that would be obtained if the true measurements were available. The estimators of the regression parameters are obtained by optimizing the resulting corrected loss function. The resulting estimator of the regression parameters is shown to be consistent.

The MECfda package provides a function, ME.fcQR_CLS, for implementing their bias-correction estimation method. Below is a description of its arguments and return values:

data.Y (Response Variable):
The response variable can be provided as an atomic vector, a one-column matrix, or a data frame. The recommended format is a one-column data frame with a column name.
data.W (Measurement Data for \(W\)):
This argument represents \(W\), the observed measurement of \(X\) in the statistical model. It should be provided as a three-dimensional array where each row corresponds to a subject, each column to a measurement (time) point, and each layer to an observation.
data.Z (Scalar Covariates):
Scalar covariates can be omitted (or set to NULL) if no scalar covariates are included in the model, provided as an atomic vector for a single scalar covariate, or as a matrix or data frame for multiple covariates. The recommended format is a data frame with column names.
tau (Quantile Level):
Specifies the quantile \(\tau \in (0,1)\) to be used in the quantile regression model. The default value is 0.5.
t_interval (Domain of the Function-Valued Covariate):
A two-element vector that specifies the domain (interval) of the function-valued covariate. The default is c(0,1), representing the interval \([0,1]\).
t_points (Measurement Time Points):
Specifies the sequence of measurement (time) points. The default is NULL.
grid_k (Candidate Basis Numbers):
An atomic vector where each element is a candidate number for the basis functions.
grid_h (Candidate Tuning Parameter Values):
A non-zero atomic vector where each element is a candidate value for the tuning parameter.
degree (Degree for Derivative and Integral Calculations):
This argument is used when computing derivatives and integrals, with a default value of 45—which is sufficient for most scenarios.
observed_X (Observed \(X\) Data for Variance Estimation):
Used for estimating parametric variance; the default value is NULL.

The function returns an object of class ME.fcQR_CLS (a list) containing the following elements:

estimated_beta_hat: Estimated coefficients from the corrected loss function (including the functional component).
estimated_beta_t: The estimated functional curve.
SE_est: The estimated parametric variance (returned only if observed_X is not NULL).
estimated_Xbasis: The basis matrix used in the estimation.
res_naive: Results from the naive (uncorrected) method.

rm(list = ls())
data(MECfda.data.sim.0.1)
res = ME.fcQR_CLS(data.Y = MECfda.data.sim.0.1$Y,
                  data.W = MECfda.data.sim.0.1$W,
                  data.Z = MECfda.data.sim.0.1$Z,
                  tau = 0.5,
                  grid_k = 4:7,
                  grid_h = 1:2)

ME.fcLR_IV

ME.fcLR_IV(
  data.Y,
  data.W,
  data.M,
  t_interval = c(0, 1),
  t_points = NULL,
  CI.bootstrap = F
)

Tekwe et. al. proposed a bias correction estimation method for scalar-on-function linear regression model with measurement error using an instrumental variable [6]. Their statistical model is as follows: \[\begin{align*} &Y_i = \int_0^1 \beta(t)X_i(t)dt + \varepsilon_i\\ &W_i(t) = X_i(t) + U_i(t)\\ &M_i(t) = \delta X_i(t) + \eta_i(t) \end{align*}\] where the response variable \(Y_i\) and are measured without error, the function-valued covariate \(X_i(t)\) is measured with error as \(W_i(t)\), and \(M_i(t)\) is an measured instrumental variable. They included the following additional assumptions:

\(E\varepsilon_i(t) = 0\),
\(EU_i(t) = 0\),
\(E\eta_i(t) = 0\),
\(Cov\{X_i(t),\varepsilon_i\} = 0\),
\(Cov\{M_i(t),\varepsilon_i\} = 0\),
\(Cov\{M_i(t),U_i(s)\} = 0\),

for \(\forall t,s\in[0,1]\) and \(i = 1,\dots,n\).

The MECfda package provides the function ME.fcLR_IV for implementing their bias-correction estimation method. The arguments are as follows:

data.Y (Response Variable):
The response variable can be provided as an atomic vector, a one-column matrix, or a data frame. The recommended format is a one-column data frame with a column name.
data.W (Measurement Data for \(W\)):
This argument represents \(W\), the observed measurement of \(X\) in the statistical model. It should be provided as a data frame or matrix, where each row represents a subject and each column corresponds to a measurement (time) point.
data.M (Instrumental Variable):
This argument provides the data for \(M\), the instrumental variable. It should be provided as a data frame or matrix, with each row representing a subject and each column representing a measurement (time) point.
t_interval (Domain of the Function-Valued Covariate):
A two-element vector that specifies the domain (interval) of the function-valued covariate. The default is c(0,1), representing the interval \([0,1]\).
t_points (Measurement Time Points):
Specifies the sequence of measurement (time) points. The default is NULL.
CI.bootstrap (Bootstrap Confidence Interval):
A logical flag indicating whether to return a confidence interval using the bootstrap method. The default is FALSE.

The function returns an object of class ME.fcLR_IV, which is a list containing the following elements:

beta_tW: Parameter estimates.
CI: Confidence interval, which is returned only if CI.bootstrap is TRUE.

rm(list = ls())
data(MECfda.data.sim.0.3)
res = ME.fcLR_IV(data.Y = MECfda.data.sim.0.3$Y,
                 data.W = MECfda.data.sim.0.3$W,
                 data.M = MECfda.data.sim.0.3$M)

MECfda: An R package for bias correction due to measurement error in functional and scalar covariates in scalar-on-function regression models

Ji, Heyang

Beyaztas, Ufuk

Escobar, Nicolas

Luan, Yuanyuan

Chen, Xiwei

Zhang, Mengli

Zoh, Roger

Xue, Lan

Tekwe, Carmen

Introduction

Scalar-on-Function Linear Regression Models

Representation of Functional Data

Functional Data

S4 Class `functional_variable`

Basis Expansion

Fourier Basis

S4 Class `Fourier_series`

B-spline Basis

S4 Classes `bspline_basis` and `bspline_series`

Eigenfunction basis

Numerical Basis

S4 Class `numeric_basis` and `numericBasis_series`

Function `basis2fun`

Basis expansion methods for the `functional_variable` class

Numerical Computation of Integrals

Scalar-on-Function Linear Regression in MECfda

fcRegression

fcQR

Measurement Error Models and Bias Correction Estimation Methods

ME.fcRegression_MEM

ME.fcQR_IV.SIMEX

ME.fcQR_CLS

ME.fcLR_IV

Simulated Data Generation

References

MECfda: An R package for bias correction due to measurement error in functional and scalar covariates in scalar-on-function regression models

Ji, Heyang

Beyaztas, Ufuk

Escobar, Nicolas

Luan, Yuanyuan

Chen, Xiwei

Zhang, Mengli

Zoh, Roger

Xue, Lan

Tekwe, Carmen

Introduction

Scalar-on-Function Linear Regression Models

Representation of Functional Data

Functional Data

S4 Class functional_variable

Basis Expansion

Fourier Basis

S4 Class Fourier_series

B-spline Basis

S4 Classes bspline_basis and bspline_series

Eigenfunction basis

Numerical Basis

S4 Class numeric_basis and numericBasis_series

Function basis2fun

Basis expansion methods for the functional_variable class

Numerical Computation of Integrals

Scalar-on-Function Linear Regression in MECfda

fcRegression

fcQR

Measurement Error Models and Bias Correction Estimation Methods

ME.fcRegression_MEM

ME.fcQR_IV.SIMEX

ME.fcQR_CLS

ME.fcLR_IV

Simulated Data Generation

References

S4 Class `functional_variable`

S4 Class `Fourier_series`

S4 Classes `bspline_basis` and `bspline_series`

S4 Class `numeric_basis` and `numericBasis_series`

Function `basis2fun`

Basis expansion methods for the `functional_variable` class