In this document we present details of the benchmark of the
`tensorEVD()`

routine against the `eigen()`

function of the ‘base’ R-package (R Core Team 2021) in performing the
eigenvalue decomposition (EVD) of a Hadamard matrix product involving
two covariance structure matrices \(\textbf{K}_1\) and \(\textbf{K}_2\),

\[
\textbf{K} =
(\textbf{Z}_1\textbf{K}_1\textbf{Z}'_1)\odot(\textbf{Z}_2\textbf{K}_2\textbf{Z}'_2).
\] We assessed the computational time used to derive
eigenvectors, the accuracy of the approximation provided by
`tensorEVD()`

, and the dimension of the resulting basis.
Likewise, we evaluated the performance of the approximation of the
Hadamard \(\textbf{K}\) provided by the
`tensorEVD()`

method in Gaussian linear models in terms of
variance components estimates and cross-validation prediction
accuracies.

The data used in these benchmarks was generated by the
Genomes-To-Fields (G2F) Initiative (Lima *et al*. 2023) which was
curated and expanded by adding environmental covariates (EC) by
Lopez-Cruz *et al*. (2023). We used the subset of the data
corresponding to the northern testing locations that includes \(n=59,069\) records for 4 traits (grain
yield, anthesis, silking, and anthesis-silking interval) from \(n_G = 4,344\) hybrids and \(n_E = 97\) environments.

We used a data analysis pipeline as shown below with folders
`code`

, `data`

, `output`

,
`parms`

, and `source`

.

```
pipeline
├── code
│ ├── 1_simulation.R
│ ├── 2_model_components.R
│ ├── 3_ANOVA_GxE_model.R
│ ├── 4_get_variance_GxE_model.R
│ └── 5_10F_CV_GxE_model.R
├── data
├── output
├── parms
└── source
├── ECOV.csv
├── GENO.csv
└── PHENO.csv
```

Folder `source`

contains the phenotypic (file
`PHENO.csv`

), SNPs (file `GENO.csv`

), and ECs
(file `ECOV.csv`

) data from G2F. These files can be
downloaded from the Figshare repository (https://doi.org/10.6084/m9.figshare.22776806).

Folder `code`

contains the R-scripts to implement the
sequence of analyses detailed in the next sections and can be downloaded
from this link.

The R-scripts were run on the MSU high-performance computing center
(HPCC) (https://icer.msu.edu/hpcc/hardware) as a batch job
script. The header of the scripts contains the shebang line
`#!/usr/bin/env Rscript`

in the first line followed by job
requirements (e.g., memory, number of CPUs, run time) that are specified
using the `SLURM`

scheduler by adding the prefix
`#SBATCH`

at the beginning of each request instruction line,
for example

```
#!/usr/bin/env Rscript
#SBATCH --time=03:59:00
#SBATCH --cpus-per-task=1
#SBATCH --mem-per-cpu=84G
#SBATCH --constraint=intel18
```

Each R-script is submitted to the HPCC using the `sbatch`

command, for instance

```
cd /mnt/scratch/quantgen/TENSOR_EVD/pipeline/code
sbatch 1_simulation.R
```

The R-code below show how to obtain the subset of the data corresponding to the northern locations. Next, this data subset is used to derive a genetic (GRM, VanRaden 2008) for the \(n_G = 4,344\) hybrids as \(\textbf{K}_G = \textbf{X}\textbf{X}'/trace(\textbf{X}\textbf{X}')\), where \(\textbf{X}\) is the matrix of centered SNPs (hybrids in rows, SNPs in columns). Likewise, an environmental relationship matrix (ERM) is derived for the \(n_E = 97\) environments as \(\textbf{K}_E = \textbf{W}\textbf{W}'/trace(\textbf{W}\textbf{W}')\) where \(\textbf{W}\) is the matrix of centered and scaled ECs (environments in rows, ECs in columns).

```
library(data.table)
setwd("/mnt/scratch/quantgen/TENSOR_EVD/pipeline")
PHENO <- read.csv("source/PHENO.csv")
GENO <- fread("source/GENO.csv", data.table=FALSE)
ECOV <- read.csv("source/ECOV.csv", row.names=1)
# Select North region
PHENO <- PHENO[PHENO$region %in% 'North',]
PHENO$year_loc <- factor(as.character(PHENO$year_loc))
PHENO$genotype <- factor(as.character(PHENO$genotype))
save(PHENO, file="data/pheno.RData")
# Calculate the GRM
ID <- GENO[,1]
GENO <- as.matrix(GENO[,-1])
rownames(GENO) <- ID
X <- scale(GENO, center=TRUE, scale=FALSE)
KG <- tcrossprod(X)
KG <- KG[levels(PHENO$genotype),levels(PHENO$genotype)]
KG <- KG/mean(diag(KG))
save(KG, file="data/GRM.RData")
# Calculate the ERM
ECOV <- ECOV[,-grep("HI30_",colnames(ECOV))]
KE <- tcrossprod(scale(ECOV))
KE <- KE[levels(PHENO$year_loc),levels(PHENO$year_loc)]
KE <- KE/mean(diag(KE))
save(KE, file="data/ERM.RData")
```

After running this code, R-files `pheno.RData`

,
`GRM.RData`

, and `ERM.RData`

with the phenotypic
northern subset, GRM, and ERM, respectively, are to be saved in the
folder `data`

.

```
pipeline
:
├── data
: ├── ERM.RData
├── GRM.RData
└── pheno.RData
```

We formed Hadamard products \(\textbf{K} =
(\textbf{Z}_1\textbf{K}_1\textbf{Z}'_1)\odot(\textbf{Z}_2\textbf{K}_2\textbf{Z}'_2)\)
between the GRM (as \(\textbf{K}_1\))
and the ERM (as \(\textbf{K}_2\)) of
various sizes by sampling hybrids (\(n_G=100,500,1000\)), environments (\(n_E=10,30,50\)), and the level of
replication needed to complete a total sample size of \(n=10000,20000,30000\). Then, we factorized
the resulting Hadamard product matrix using the R-base function
`eigen()`

as well as using `tensorEVD()`

, deriving
as many eigenvectors as needed to explain a proportion \(\alpha=0.90,0.95,0.98\) of the total
variance. We implemented 10 replicates of each experiment.

The R-script 1_simulation.R
was used to perform the analysis for a given combination of parameters
(`nG`

, `nE`

, `n`

, `alpha`

,
and `replicate`

). To do this, first, we created an array with
all combinations of the parameters. A `data.frame`

object
called `JOBS`

containing \(3\times3\times3\times3\times10=810\) rows
and the \(5\) parameters in columns,
was created using the `expand.grid`

R-function and saved in
folder `parms`

```
JOBS <- expand.grid(nG = c(100,500,1000),
nE = c(10,30,50),
n = c(10000,20000,30000),
alpha = c(0.90,0.95,0.98),
replicate = 1:10)
dim(JOBS); head(JOBS)
#[1] 810 5
# nG nE n alpha replicate
#1 100 10 10000 0.9 1
#2 500 10 10000 0.9 1
#3 1000 10 10000 0.9 1
#4 100 30 10000 0.9 1
#5 500 30 10000 0.9 1
#6 1000 30 10000 0.9 1
save(JOBS, file="/mnt/scratch/quantgen/TENSOR_EVD/pipeline/parms/JOBS1.RData")
```

To perform all the \(3\times3\times3\times3\times10=810\) cases
(each row of the object `JOBS`

) we submitted the R-script for
multi-job implementation by specifying in the script header a job array
through the `SLURM`

option
`#SBATCH --array=1-810`

, each value of the array is read in
the R-code with the instruction
`Sys.getenv("SLURM_ARRAY_TASK_ID")`

.

The output file `simulation_results.txt`

generated after
running the previous R-script contains at each row the results from each
experiment case. The information of the experiment are given in the
first columns followed by the results on computation time, approximation
accuracy (measured with the Frobenius and CMD metrics), and the number
of eigenvectors (associated to the \(\alpha\)-value) for the
`eigen()`

and `tensorEVD()`

methods.

```
pipeline
:
├── output
: └── simulation
└── simulation_results.txt
```

The first rows of this file are shown below for the results presented in the manuscript (the file can be found in this link).

```
out <- read.csv(url("https://raw.githubusercontent.com/MarcooLopez/tensorEVD/main/inst/extdata/results_simulation.txt"))
head(out[,1:7])
```

```
## alpha nG nE n replicate nComb nPosEigen
## 1 0.9 100 10 10000 1 1000 1000
## 2 0.9 500 10 10000 1 4329 4329
## 3 0.9 1000 10 10000 1 6376 6376
## 4 0.9 100 30 10000 1 2899 2899
## 5 0.9 500 30 10000 1 7253 7253
## 6 0.9 1000 30 10000 1 8483 8483
```

`head(out[,8:12]) # results from the eigen function`

```
## time_eigen Frobenius_eigen CMD_eigen nPC_eigen pPC_eigen
## 1 242.120 144.4331 0.004126274 406 0.4060000
## 2 243.954 139.0350 0.001844430 968 0.2236082
## 3 248.545 124.0434 0.002342244 1566 0.2456085
## 4 240.986 105.3499 0.002186993 765 0.2638841
## 5 251.789 111.8236 0.001962019 1737 0.2394871
## 6 254.031 104.0453 0.001778629 2298 0.2708947
```

`head(out[,13:17]) # results from the tensorEVD function`

```
## time_tensorEVD Frobenius_tensorEVD CMD_tensorEVD nPC_tensorEVD pPC_tensorEVD
## 1 0.122 139.11850 0.003451564 423 0.4230000
## 2 0.283 129.80934 0.001340386 1128 0.2605683
## 3 0.856 114.21707 0.001431199 2079 0.3260665
## 4 0.241 98.92667 0.001561782 837 0.2887202
## 5 0.836 98.57860 0.001082447 2397 0.3304839
## 6 1.512 88.48203 0.000876179 3828 0.4512555
```

The following R-code chunks can be used to the reproduce Figures 1-3 that are presented in the manuscript. The code for the plots can be found in this link and can be loaded into R using

`source("https://raw.githubusercontent.com/MarcooLopez/tensorEVD/main/misc/functions.R")`

**Computation time**

R-code below creates a plot for the EVD computation time ratio
(`eigen()`

/`tensorEVD()`

) (Figure 1).

```
# Some data edits
out$alpha <- factor(100*out$alpha)
out$nG <- paste0("n[G]*' = '*",out$nG,"L")
out$nE <- paste0("n[E]*' = '*",out$nE,"L")
out$nG <- factor(out$nG, levels = unique(out$nG))
out$nE <- factor(out$nE, levels = unique(out$nE))
# Reshaping the data
measure <- c("time","Frobenius","CMD","nPC","pPC")
dat <- melt_data(out, id=c("nG","nE","n","alpha"),
measure=paste0(measure,"_"),
value.name=measure, variable.name="method")
color1 <- c('90%'="navajowhite2", '95%'="chocolate1", '98%'="red4")
color2 <- c(eigen="#E69F00", tensorEVD="#009E73", eigs="#56B4E9",
trlan="#CC79A7", chol="#D55E00")
# Figure 1: Computation time ratio (eigen/tensorEVD)
dat0 <- out[out$alpha != "100",]
dat0$alpha <- factor(paste0(dat0$alpha,"%"))
dat0$ratio <- log10(dat0$time_eigen/dat0$time_tensorEVD)
dat0$n <- dat0$n/1000
breaks0 <- seq(1,4,by=1)
figure1 <- make_plot(dat0, type="line", x='n', y='ratio', group="alpha",
group.label=NULL, facet="nG", facet2="nE", facet.type="grid",
xlab="Sample size (x1000)",
ylab="Computation time ratio (eigen/tensorEVD)",
group.color=color1, nSD=0, errorbar.size=0,
breaks.y=breaks0, labels.y=sprintf("%.f",10^breaks0),
scales="fixed")
#print(figure1)
```

**Approximation accuracy**

R-code below produces the approximation accuracy plots using the
Frobenius norm (Figure 2) of the difference between the Hadamard matrix
and the approximation provided by `eigen()`

and
`tensorEVD()`

.

```
dat0 <- dat[dat$method %in% c("eigen","tensorEVD") & dat$alpha!="100",]
dat0$method <- factor(as.character(dat0$method))
dat0$alpha <- factor(as.character(dat0$alpha))
# Figure 2: Approximation accuracy using Frobenious norm
figure2 <- make_plot(dat0, x='alpha', y='Frobenius',
group="method", by="n", facet="nG", facet2="nE", facet.type="grid",
xlab=bquote(alpha~"x100% of variance of K"),
ylab=expression("Frobenius norm ("~abs(abs(K-hat(K)))[F]~")"),
by.label="Sample size", breaks.y=seq(0,500,by=100),
group.color=color2, rect.by.height=-0.05, ylim=c(0,NA))
#print(figure2)
```

**Dimension reduction**

The following R-code creates the plot (Figure 3) showing the number
of eigenvectors produced by the `eigen()`

and
`tensorEVD()`

methods, relative to the rank of the Hadamard
matrix (number of eigenvectors with positive eigenvalue).

```
figure3 <- make_plot(dat0, x='alpha', y='pPC',
group="method", by="n", facet="nG", facet2="nE", facet.type="grid",
xlab=bquote(alpha~"x100% of variance of K"),
ylab="Number of eigenvectors/rank", by.label="Sample size",
group.color=color2, rect.by.height=-0.05,
hline=1, hline.color="red2", ylim=c(0,NA))
#print(figure3)
```

We analyzed each trait (grain yield, anthesis, silking, and
anthesis-silking interval) with a Gaussian reaction norm \(G\times E\) model (Jarquín *et al*.
2014) in which the trait phenotype (\(y_{ijk}\)) is modeled as the sum of the
main effect of hybrid (\(G_i\)), main
effect of environment (\(E_j\)), and
the hybrid\(\times\)environment
interaction (\(GE_{ij}\)) term, this
is

\[ y_{ijk} = \mu + G_i + E_j + GE_{ij} + \varepsilon_{ijk}. \]

Above, \(\mu\) is an intercept and \(i\), \(j\), and \(k\) are indices for the hybrids, environment, and replicate, respectively. The term \(\varepsilon_{ijk}\) is an error term assumed to be independently and identically Gaussian distributed as \(\varepsilon_{ijk} \sim N(0,\sigma_{\varepsilon}^2)\), with \(\sigma_{\varepsilon}^2\) variance parameter associated to the error. Hybrid, environment, and interaction effects were assumed to be multivariate normally distributed with zero mean and effect-specific covariance matrices, specifically \(\textbf{G}\sim MVN(\textbf{0},\sigma_G^2 \textbf{K}_G)\), \(\textbf{E}\sim MVN(\textbf{0},\sigma_E^2 \textbf{K}_E)\), and \(\textbf{GE}\sim MVN(\textbf{0},\sigma_{GE}^2 \textbf{K})\), where

\[ \textbf{K} = (\textbf{Z}_1\textbf{K}_G\textbf{Z}'_1)\odot(\textbf{Z}_2\textbf{K}_E\textbf{Z}'_2) \]

is a Hadamard product between the GRM \(\textbf{K}_G\) and the ERM \(\textbf{K}_E\), and \(\sigma_G^2\), \(\sigma_E^2\), and \(\sigma_{GE}^2\) are variance parameters associated to \(\textbf{G}\), \(\textbf{E}\) and \(\textbf{GE}\), respectively.

We used a ‘BRR’ equivalence of the above model by fitting the model

\[ \boldsymbol{y} = \boldsymbol{\mu}+\textbf{X}_G\boldsymbol{\beta}_1+\textbf{X}_E\boldsymbol{\beta}_2 + \textbf{X}_{GE}\boldsymbol{\beta}_3 + \boldsymbol{\varepsilon} \]

where the predictors \(\textbf{X}_G\), \(\textbf{X}_E\), and \(\textbf{X}_{GE}\) are the scaled
eigenvectors of the covariance matrices \(\textbf{K}_G\), \(\textbf{K}_E\), and \(\textbf{K}\), respectively, and the
regression coefficients are assumed to be distributed \(\boldsymbol{\beta}_1\sim
MVN(\textbf{0},\sigma_{G}^2 \textbf{I})\), \(\boldsymbol{\beta}_2\sim
MVN(\textbf{0},\sigma_{E}^2 \textbf{I})\), and \(\boldsymbol{\beta}_3\sim
MVN(\textbf{0},\sigma_{GE}^2 \textbf{I})\). First, we obtained
the decomposition \(\textbf{K}_G=\textbf{V}_1\textbf{D}_1\textbf{V}'_1\)
and \(\textbf{K}_E=\textbf{V}_2\textbf{D}_2\textbf{D}'_2\)
using the *eigen* R-function. Next, for a given proportion \(\alpha\) of variance, we obtained the
decomposition of the Hadamard \(\tilde{\textbf{K}}_{\alpha} =
\tilde{\textbf{V}}_{\alpha}\tilde{\textbf{D}}_{\alpha}\tilde{\textbf{V}}_{\alpha}\)
using the `eigen()`

and `tensorEVD()`

approaches.
Finally, we derived the scaled eigenvectors \(\textbf{X}_G =
\textbf{V}_1\textbf{D}_1^{1/2}\), \(\textbf{X}_E =
\textbf{V}_2\textbf{D}_2^{1/2}\), and \(\tilde{\textbf{X}}_{GE,\alpha} =
\tilde{\textbf{V}}_{\alpha}\tilde{\textbf{D}}_{\alpha}^{1/2}\).
The later was done for values \(\alpha=0.90,0.95,0.98,1.00\).

The calculation of all these matrices was carried out using the
R-script 2_model_components.R
for a given \(\alpha\)-value. We
replicated this task 5 times to obtain an average (and SD) computing
time of the decomposition, to this end, we created a job array for each
combination of parameters `alpha`

and `replicate`

as follows

```
JOBS <- expand.grid(alpha = c(0.90,0.95,0.98,1.00),
replicate = 1:5)
dim(JOBS); head(JOBS)
#[1] 20 2
# alpha replicate
#1 0.90 1
#2 0.95 1
#3 0.98 1
#4 1.00 1
#5 0.90 2
#6 0.95 2
save(JOBS, file="/mnt/scratch/quantgen/TENSOR_EVD/pipeline/parms/JOBS2.RData")
```

The script was submitted using a job array for the \(4\times5=20\) combination of parameters
(each row of the object `JOBS`

) using the `SLURM`

option `#SBATCH --array=1-20`

. After running the R-script,
the following files are created in the folder
`output/genomic_prediction/model_comps`

.

```
pipeline
:
├── output
: └── genomic_prediction
└── model_comps
├── XE.RData
├── XG.RData
├── XGE_90_eigen.RData
├── XGE_90_tensorEVD.RData
├── XGE_95_eigen.RData
├── XGE_95_tensorEVD.RData
├── XGE_98_eigen.RData
├── XGE_98_tensorEVD.RData
├── XGE_100_eigen.RData
└── timing_EVD.txt
```

The \(G\times E\) model was
implemented as a BRR using the *BLRXy*() function from the ‘BGLR’
R-package (Pérez-Rodríguez and de los Campos 2022) for each combination
of trait, method, and \(\alpha\)-value,
with 5 replicates each to present an average (and SD) of the results.
The model was run for \(\alpha=1.0\)
for the *eigen* method only. The *BLRXy*() function fits
the model and generates samples from the posterior distribution of the
regression coefficients \(\boldsymbol{\beta}_1\), \(\boldsymbol{\beta}_2\), and \(\boldsymbol{\beta}_3\) using the Gibbs
sampler. We run the BGLR with `nIter=50000`

and
`burnIn=5000`

parameters.

We implemented the model using the R-script 3a_fit_GxE_model.R
for a given combination of parameters (`trait`

,
`method`

, `alpha`

, and `replicate`

). We
created an array with all combinations of parameters as follows

```
JOBS <- expand.grid(trait = c("yield","anthesis","silking","ASI"),
method = c("eigen","tensorEVD"),
alpha = c(0.90,0.95,0.98,1.00),
replicate = 1:5)
JOBS <- JOBS[-which(JOBS$alpha==1.00 & JOBS$method=="tensorEVD"),]
dim(JOBS); head(JOBS)
#[1] 140 4
# trait method alpha replicate
#1 yield eigen 0.9 1
#2 anthesis eigen 0.9 1
#3 silking eigen 0.9 1
#4 ASI eigen 0.9 1
#5 yield tensorEVD 0.9 1
#6 anthesis tensorEVD 0.9 1
save(JOBS, file="/mnt/scratch/quantgen/TENSOR_EVD/pipeline/parms/JOBS3.RData")
```

The script was submitted using a job array for all the \((4\times2\times4\times5) -
(4\times1\times1\times5)=140\) combination of parameters (each
row of object `JOBS`

) using the `SLURM`

option
`#SBATCH --array=1-140`

. The outputs generated by the
R-script are saved in folder
`output/genomic_prediction/ANOVA`

in a sub-folder
corresponding to each `trait`

, `method`

,
`alpha`

, and `replicate`

combination as shown
below. The posterior samples for coefficients \(\boldsymbol{\beta}_1\), \(\boldsymbol{\beta}_2\), and \(\boldsymbol{\beta}_3\) are stored in files
`ETA_G_b.bin`

, `ETA_E_b.bin`

, and
`ETA_GE_b.bin`

, respectively. For instance, the file
`ETA_G_b.bin`

is a \(q\times
p\) matrix containing at each row the samples \(\hat{\boldsymbol{\beta}}_1^{(1)},\hat{\boldsymbol{\beta}}_1^{(2)},...,\hat{\boldsymbol{\beta}}_1^{(q)}\).

```
pipeline
:
├── output
: └── genomic_prediction
:
└── ANOVA
├── yield
: ├── eigen
: ├── alpha_90
: ├── rep_1
: ├── ETA_E_b.bin
├── ETA_E_varB.dat
├── ETA_G_b.bin
├── ETA_G_varB.dat
├── ETA_GE_b.bin
├── ETA_GE_varB.dat
├── fm.RData
├── mu.dat
└── varE.dat
```

**Obtaining total variance**

We used the sample files to obtain the total variance of hybrid
(\(\textbf{X}_G\boldsymbol{\beta}_1^{(k)}\)),
environment (\(\textbf{X}_E\boldsymbol{\beta}_2^{(k)}\)),
hybrid\(\times\)environment interaction
(\(\textbf{X}_{GE}\boldsymbol{\beta}_3^{(k)}\)),
and error (\(\boldsymbol{\varepsilon}\)) terms in the
BRR model, and reported the average across all samples. As we
standardized the phenotype to have unit variance, these variances can be
seen as the proportion of the phenotypic variance explained by each
model component. We performed this task using the R-script 3b_get_variance_GxE_model.R
using a job array for all the \(140\)
jobs (each row in the object `JOBS`

). The code will create a
`data.frame`

stored in the file `VC.RData`

in the
corresponding sub-folder in folder
`output/genomic_prediction/ANOVA`

.

**Collecting results**

Once the previous R-script has been run for all the jobs, the following code can be used to collect into a single table all the individual variance components results from all jobs.

```
setwd("/mnt/scratch/quantgen/TENSOR_EVD/pipeline")
load("parms/JOBS3.RData")
prefix <- "output/genomic_prediction/ANOVA"
out <- c()
for(k in 1:nrow(JOBS))
{
trait <- as.vector(JOBS[k,"trait"])
method <- as.vector(JOBS[k,"method"])
alpha <- as.vector(JOBS[k,"alpha"])
replicate <- as.vector(JOBS[k,"replicate"])
suffix <- paste0(trait,"/",method,"/alpha_",100*alpha,"/rep_",replicate,"/VC.RData")
filename <- paste0(prefix,"/",suffix)
if(file.exists(filename)){
load(filename)
out <- rbind(out, VC)
}else{
message("File not found: '",suffix,"'")
}
}
```

The first rows of this `data.frame`

are displayed below
for the results presented in the manuscript (the file can be found in
this link).

```
out <- read.csv(url("https://raw.githubusercontent.com/MarcooLopez/tensorEVD/main/inst/extdata/results_ANOVA.txt"))
head(out)
```

```
## trait method alpha replicate source mean SD
## 1 yield eigen 0.9 1 G 0.06956522 0.004202744
## 2 yield eigen 0.9 1 E 0.48419062 0.011957294
## 3 yield eigen 0.9 1 GE 0.07582496 0.004068338
## 4 yield eigen 0.9 1 Error 0.39192312 0.002409890
## 5 anthesis eigen 0.9 1 G 0.12780639 0.004926835
## 6 anthesis eigen 0.9 1 E 0.85284787 0.008838661
```

**Visualizing results**

R-code below can be used to create Figure 4 in the manuscript showing the average proportion of phenotypic variance of grain yield explained by each model. The same code can be used for traits anthesis, silking, and anthesis-silking interval (Supplementary Figures 8, 9, and 10, respectively).

```
out$alpha <- factor(paste0(100*out$alpha,"%"), levels=c("100%","98%","95%","90%"))
out$source <- factor(out$source, levels=c("G","E","GE","Error"))
trait <- c("yield", "anthesis", "silking", "ASI")[1]
myfun <- function(x) sprintf('%.3f', x)
# Figure 4: Phenotypic variance of yield
dat <- out[out$trait==trait,]
figure4 <- make_plot(dat, x='alpha', y='mean', SD="SD",
group="method", facet="source",
xlab=bquote(alpha~"x100% of variance of K"),
ylab=paste0("Proportion of variance of ",trait),
group.color=color2, scales="free_y",
ylabels=myfun, text=myfun, ylim=c(0,NA))
#print(figure4)
```

We evaluated the performance of the approximation \(\tilde{\textbf{K}}_{\alpha} =
\tilde{\textbf{V}}_{\alpha}\tilde{\textbf{D}}_{\alpha}\tilde{\textbf{V}}_{\alpha}\)
of the kernel \(\textbf{K}\) provided
by the `eigen()`

and `tensorEVD()`

methods in the
\(G\times E\) model in terms of
prediction accuracy. We conducted a 10-fold cross-validation (CV) with
hybrids assigned to folds. We predicted all the records of hybrids in
the \(k^{th}\) fold using a model
trained with all records from hybrids in the remaining 9 folds. The
model was implemented for each combination of trait and method for \(\alpha=0.90,0.95,0.98\).

The folds were previously created by Lopez-Cruz *et al*.
(2023) and are provided in the column `CV_10fold`

of the
phenotypic data file. The number records in each fold ranges between
5436 and 6277.

```
setwd("/mnt/scratch/quantgen/TENSOR_EVD/pipeline")
load("data/pheno.RData")
table(PHENO$CV_10fold)
# 1 2 3 4 5 6 7 8 9 10
#6180 6277 6246 5785 6160 5858 5492 5660 5436 5975
```

We implemented this CV using the R-script 4_10F_CV_GxE_model.R
which fits the model for a given fold for a given combination of trait,
method, and \(\alpha\)-value. To this
end, we created an array with all combinations of parameters
(`trait`

, `method`

, `alpha`

, and
`fold`

) as follows

```
JOBS <- expand.grid(trait = c("yield","anthesis","silking","ASI"),
method = c("eigen","tensorEVD"),
alpha = c(0.90,0.95,0.98),
fold = 1:10)
dim(JOBS); head(JOBS)
#[1] 240 4
# trait method alpha fold
#1 yield eigen 0.9 1
#2 anthesis eigen 0.9 1
#3 silking eigen 0.9 1
#4 ASI eigen 0.9 1
#5 yield tensorEVD 0.9 1
#6 anthesis tensorEVD 0.9 1
save(JOBS, file="/mnt/scratch/quantgen/TENSOR_EVD/pipeline/parms/JOBS4.RData")
```

The script was submitted using a job array for all the \(4\times2\times3\times10=240\) combination
of parameters (each row of object `JOBS`

) using the
`SLURM`

option `#SBATCH --array=1-240`

. The
outputs generated by the R-script are saved in folder
`output/genomic_prediction/10F_CV`

in a sub-folder
corresponding to each `trait`

, `method`

, and
`alpha`

combination as shown below. Each file
`results_fold_*.RData`

contains a table with the predicted
and observed values within each fold.

```
pipeline
:
├── output
: └── genomic_prediction
:
└── 10F_CV
├── yield
: ├── eigen
: ├── alpha_90
├── results_fold_1.RData
├── results_fold_2.RData
├── results_fold_3.RData
├── results_fold_4.RData
├── results_fold_5.RData
├── results_fold_6.RData
├── results_fold_7.RData
├── results_fold_8.RData
├── results_fold_9.RData
└── results_fold_10.RData
```

**Within-environment prediction accuracy**

The following R-code can be used to calculate the within environment
(column `year_loc`

) correlation between observed and
predicted phenotypes. The code will collect first the results in files
`results_fold_*.RData`

for all folds from each job (a
combination of `trait`

, `method`

, and
`alpha`

). The results from all jobs are to be saved in a
single table.

```
setwd("/mnt/scratch/quantgen/TENSOR_EVD/pipeline")
source("https://raw.githubusercontent.com/MarcooLopez/tensorEVD/main/misc/functions.R")
load("parms/JOBS4.RData")
prefix <- "output/genomic_prediction/10F_CV"
dat <- c()
for(trait in levels(JOBS$trait)){
for(method in levels(JOBS$method)){
for(alpha in unique(JOBS$alpha)){
out0 <- c()
for(fold in unique(JOBS$fold)){
suffix <- paste0(trait,"/",method,"/alpha_",100*alpha,"/results_fold_",fold,".RData")
filename <- paste0(prefix,"/",suffix)
if(file.exists(filename)){
load(filename)
out0 <- rbind(out0, out)
}else{
message("File not found: '",suffix,"'")
}
}
tmp <- get_corr(out0, by="year_loc")
dat <- rbind(dat, data.frame(trait,method,alpha,tmp))
}
}
}
# Reshaping the data
dat$trait <- factor(dat$trait, levels=levels(JOBS$trait))
out <- reshape2::dcast(dat, trait+alpha+year_loc+nRecords~method, value.var="correlation")
tmp <- reshape2::dcast(dat, trait+alpha+year_loc+nRecords~method, value.var="SE")[,levels(JOBS$method)]
colnames(tmp) <- paste0(colnames(tmp),".SE")
out <- data.frame(out, tmp)
```

The first rows of this table are displayed below for the results presented in the manuscript (the file can be found in this link).

```
out <- read.csv(url("https://raw.githubusercontent.com/MarcooLopez/tensorEVD/main/inst/extdata/results_10F_CV.txt"))
head(out)
```

```
## trait alpha year_loc nRecords eigen tensorEVD eigen.SE tensorEVD.SE
## 1 yield 0.9 2014-IAH1 1557 0.3847606 0.3837099 0.02340692 0.02341801
## 2 yield 0.9 2014-ILH1 462 0.3956203 0.3982038 0.04282128 0.04276919
## 3 yield 0.9 2014-INH1 477 0.5379250 0.5497122 0.03867916 0.03832868
## 4 yield 0.9 2014-MNH1 443 0.5814797 0.5814698 0.03874100 0.03874133
## 5 yield 0.9 2014-MOH2 495 0.5256614 0.5253639 0.03831333 0.03832160
## 6 yield 0.9 2014-NEH1 448 0.2133686 0.2117438 0.04626095 0.04627769
```

**Visualizing results**

R-code below can be used to create Figure 5 in the manuscript showing
the within environment prediction correlation using the
`tensorEVD()`

and `eigen()`

methods for each
combination of trait and \(\alpha\)-value.

```
out$trait <- factor(out$trait, levels=unique(out$trait))
out$alpha <- factor(paste0(100*out$alpha,"%"), levels=c("98%","95%","90%"))
# Figure 5: Within environment prediction correlation
rg <- range(c(out$eigen,out$tensorEVD))
if(requireNamespace("ggplot2", quietly=TRUE)){
figure5 <- ggplot2::ggplot(out, ggplot2::aes(tensorEVD, eigen)) +
ggplot2::geom_abline(color="gray70", linetype="dashed") +
ggplot2::geom_point(fill="#56B4E9", shape=21, size=1.4) +
ggplot2::facet_grid(trait ~ alpha) +
ggplot2::theme_bw() + ggplot2::xlim(rg) + ggplot2::ylim(rg)
}
#print(figure5)
```

Henderson C. R., 1985 Best linear unbiased prediction of nonadditive genetic merits in noninbred populations. J. Anim. Sci. 60: 111–117.

Jarquín D., J. Crossa, X. Lacaze, P. Du Cheyron, J. Daucourt, et al., 2014 A reaction norm model for genomic selection using high-dimensional genomic and environmental data. Theor. Appl. Genet. 127: 595–607.

Lima D. C., J. D. Washburn, J. I. Varela, Q. Chen, J. L. Gage, et al., 2023 Genomes to Fields 2022 Maize genotype by Environment Prediction Competition. BMC Res. Notes 16.

Lopez-Cruz M., F. Aguate, J. Washburn, S. K. Dayane, C. Lima, et al., 2023 Leveraging Data from the Genomes to Fields Initiative to Investigate G×E in Maize in North America. Nat. Comm. (in press)

Perez-Rodriguez P., and G. de los Campos, 2022 Additions to the BGLR
R-package: a new function for biobank size data and Bayesian
multivariate models, pp. 1486–1489 in *Proceedings of 12th World
Congress on Genetics Applied to Livestock Production (WCGALP)*,
Rotterdam.

R Core Team, 2021 *R: A Language and environment for statistical
computing*. R Foundation for Statistical Computing, Vienna,
Austria.