---
title: "Basic VWP Preprocessing"
author: "Vincent Porretta"
date: "`r Sys.Date()`"
output: rmarkdown::html_vignette
vignette: >
  %\VignetteIndexEntry{Basic VWP Preprocessing}
  %\VignetteEngine{knitr::rmarkdown}
  \usepackage[utf8]{inputenc}
---

## Before using VWPre

Before using this package a number of steps are required:
First, your eye gaze data must have been collected using an SR Research Eyelink eye tracker.
Second, your data must have been exported using SR Research Data Viewer software.
For this basic example, it is assumed that you have specified an interest period relative to the onset of the critical stimulus in Data Viewer.
However, this package is also able to preprocess data without a specified relative interest period.
If you have not aligned your data to a particular message in Data Viewer, please refer to the [Message Alignment](VWPre_Message_Alignment.html) vignette for functions related to this.

The Sample Report should be exported along with all available columns (this will ensure that you have all of the necessary columns for the functions contained in this package to work). 
Additionally, it is preferable to export to a .txt file rather than a .xlsx file.

The following preprocessing assumes that, in your experiment, interest area IDs and Labels were assigned consistently to the object types displayed on the screen.
For example, in a typical VWP experiment, the target was always in interest area 1, the competitor was always in interest area 2, et cetera. 
This is typically done by dynamically moving the interest areas trial-by-trial to correspond with the position of the objects. 
If, instead, your interest areas were static and you have columns indicating the location of each object for each trial, you will need to reassign your interest areas.
Specific functions for this are available in this package; please see the [Interest Areas](VWPre_Interest_Areas.html) vignette for illustration.
Once that is complete, you can follow the preprocessing procedure below.
Note that the functions presented here are capable of handling data with a maximum of 8 interest areas.  If you have more than 8 interest areas, it is necessary to adjust the source code to accommodate the number needed (please contact the package maintainer for an example).

Lastly, the functions included here, internally make use of `dplyr` for manipulating and restructuring data. For more information about `dplyr`, please refer to its reference manual and extensive collection of vignettes.


```{r global_options, include=FALSE}
knitr::opts_chunk$set(fig.width=6, fig.height=4, warning=FALSE)
```

```{r, echo=FALSE, eval=TRUE, message=FALSE}
library(VWPre)
data(VWdat)
``` 

## Loading the package and the data

First, load the sample report. By default, Data Viewer will assign "." to missing values; therefore it is important to include this in the na.strings parameter, so R will know how to handle any missing data.

```{r, eval= FALSE, echo=TRUE, results='asis'}
library(VWPre)
VWdat <- read.table("1000HzData.txt", header = T, sep = "\t", na.strings = c(".", "NA"))
```

However, for the purposes of this vignette we will use the sample dataset included in the package.

```{r, eval= FALSE, echo=TRUE, results='asis'}
data(VWdat)
```

## Preparing the data

### Verifying and creating necessary columns

In order for the functions in the package to work appropriately, the data need to be in a specific format.
The `prep_data` function examines the presence and class of specific columns (`LEFT_INTEREST_AREA_ID`, `RIGHT_INTEREST_AREA_ID`, `LEFT_INTEREST_AREA_LABEL`, `RIGHT_INTEREST_AREA_LABEL`, `TIMESTAMP`, and `TRIAL_INDEX`) to ensure they are present in the data and appropriately assigned (e.g., categorical variables are encoded as factors). 
It also checks for columns `SAMPLE_MESSAGE`, `RIGHT_GAZE_X`, `RIGHT_GAZE_Y`, `LEFT_GAZE_X`, and `LEFT_GAZE_Y`, which are not required for basic preporcessing, but are needed to use the functions `align_msg` and `custom_ia`.

Additionally, the `Subject` parameter is used to specify the column corresponding to the subject identifier.
Typical Data Viewer output contains a column called `RECORDING_SESSION_LABEL` which is the name of the column containing the subject identifier. 
The function will rename it `Subject` and will ensure it is encoded as a factor. 

If your data contain a column corresponding to an item identifier please specify it in the `Item` parameter.
In doing so, the function will standardize the name of the column to `Item` and will ensure it is encoded as a factor. If you don't have an item identifier column, by default the value of this parameter is NA.

Lastly, a new column called `Event` will be created which indexes each unique recording sequence and corresponds to the combination of `Subject` and `TRIAL_INDEX`. This Event variable is required internally for subsequent operations. Should you choose to define the Event variable differently, you can override the default; however, do so cautiously as this may impact the performance of subsequent operations because it must index each time sequence in the data uniquely.
Upon completion, the function prints a summary indicating the results.

```{r, eval=TRUE, echo=TRUE, results='asis'}
dat0 <- prep_data(data = VWdat, Subject = "RECORDING_SESSION_LABEL", Item = "itemid")
``` 

```{r, eval=TRUE, echo=FALSE, results='hide'}
rm(VWdat)
gc()
``` 

### Remove unnecessary columns

At this point, it is safe to remove the columns which were output by Data Viewer, but that are not needed for preprocessing in using this package. 
Removing these will reduce the amount of system memory consumed and result in a final dataset that consume less disk space.
This is done straightforwardly using the function `rm_extra_DVcols`.
By default it will remove all the Data Viewer columns that are not needed for preprocessing (if they are present in the data).
However, if desired, it is possible to keep specific columns from this set using the `Keep` parameter, which accommodates a string or character vector. 
If using the sample data set included in this package, it is not necessary to do this step, as these columns have already been removed.

```{r, eval= FALSE, echo=TRUE, results='asis'}
dat0 <- rm_extra_DVcols(dat0, Keep = c("RIGHT_PUPIL_SIZE", "LEFT_PUPIL_SIZE"))
```

### Relabel NA samples as outside any interest area

When the data were loaded, samples that were outside of any interest area were labeled as NA.
The `relabel_na` function examines the interest area columns (`LEFT_INTEREST_AREA_ID`, `RIGHT_INTEREST_AREA_ID`, `LEFT_INTEREST_AREA_LABEL`, and `RIGHT_INTEREST_AREA_LABEL`) for cells containing NAs. 
It then assigns 0 to the ID columns and "Outside" to the LABEL columns) to indicate those eye gaze samples which fell outside of the interest areas defined in the study.
The number of interest areas you defined in your experiment should be supplied to the parameter `NoIA`.

```{r, eval= TRUE, echo=TRUE, results='asis'}
dat1 <- relabel_na(data = dat0, NoIA = 4)
```

```{r, eval=TRUE, echo=FALSE, results='hide'}
rm(dat0)
gc()
``` 

Notice that the output informs us that the number of levels `LEFT_INTEREST_AREA_LABEL` does not match the number of interest areas listed in `NoIA`.
This is because we only have data from the right eye (hence, all samples in `LEFT_INTEREST_AREA_LABEL` are listed as "Outside").

### Check encoding of interest areas

The subsequent preprocessing requires that the interest area IDs are numerically coded, with values ranging from 0 (i.e., outside all interest areas) up to a maximum of 8.
So, it's important to check that the IDs present in the data set, conform to this. The `check_ia` functions does just this and indicates how those IDs are mapped to the interest area labels.

```{r, eval= TRUE, echo=TRUE, results='asis'}
check_ia(data = dat1)
```

If your interest area IDs do not conform to the required coding, or you would like to create new labels for your existing interest areas, please consult the [Interest Areas](VWPre_Interest_Areas.html) vignette.
That vignette illustrates how to relabel existing interest area codings (as well as remap the gaze data to entirely new interest areas, should you so desire).

## Creating the Time series column

The function `create_time_series` creates a time series (a new column called `Time`) which is required for subsequent processing, plotting, and modeling of the data.
It is common to export a period of time prior to the onset of the stimulus as a baseline. In this case, an adjustment (equal to the duration of the baseline period) must be applied to the time series, specified in the `Adjust` parameter. 
In effect, the adjustment simply subtracts the given value to each time point.
So, a positive value will shift the zero point forward (making the initial zero a negative time value), while a negative value will shift the zero point backward (making the initial zero a positive time value). 
An example illustrating this can be found in the [Message Alignment](VWPre_Message_Alignment.html) vignette.
In the example below, the data were exported with a 100ms pre-stimulus interval.


```{r, eval= TRUE, echo=TRUE, results='asis'}
dat2 <- create_time_series(data = dat1, Adjust = 100)
```

```{r, eval=TRUE, echo=FALSE, results='hide'}
rm(dat1)
gc()
``` 

Note that if you have used the `align_msg` function (illustrated in the [Message Alignment](VWPre_Message_Alignment.html) vignette), you may need to specify a column name in `Adjust`. 
That column can be used to apply the recording event specific adjustment to each trial. 
Consult that vignette for further details.

The function `check_time_series` can be used to verify the time series.
It outputs the unique start times present in the data.
These will be the same standardized time point relative to the stimulus if you have exported your data from Data Viewer with pre-defined interest period relative to a message.
By specifying the parameter `ReturnData = T`, the function can return a summary data frame that can be used to inspect the start time of each event.
As you can see below, by providing `Adjust` with a postive value, we have effectively shifted the zero point forward along the number line, causing the first sample to have a negative time value. 

```{r, eval= TRUE, echo=TRUE, results='asis'}
check_time_series(data = dat2)
```

Another way to check that your time series has been created correctly is to use the `check_msg_time` function. 
By providing the appropriate message text, we can see that the onset of our target now occurs at Time = 0.
Note that the `Msg` parameter can handle exact matches or matches based on regular expressions. 
As with `check_time_series`, the parameter `ReturnData = T` will return a summary data frame that can be used to inspect the message time of each event.

```{r, eval= TRUE, echo=TRUE, results='asis'}
check_msg_time(data = dat2, Msg = "TargetOnset")
```

If you do not remember the messages in your data, you can output all existing messages and their corresponding timestamps using `check_all_msgs`.
Additionally and optionally, the output of the function can be saved using the parameter `ReturnData = T`.

```{r, eval= TRUE, echo=TRUE, results='asis'}
check_all_msgs(data = dat2)
```

## Selecting which eye to use

Depending on the design of the study, right, left, or both eyes may have been recorded during the experiment. 
Data Viewer outputs gaze data by placing it in separate columns for each eye (`LEFT_INTEREST_AREA_ID`, `LEFT_INTEREST_AREA_LABEL`, `RIGHT_INTEREST_AREA_ID`, `RIGHT_INTEREST_AREA_LABEL`).
However, it is preferable to have gaze data in a single set of columns, regardless of which eye was recorded during the experiment.
The function `select_recorded_eye` provides the functionality for this purpose, returning three new columns (`IA_ID`, `IA_LABEL`, `IA_Data`).

The function `select_recorded_eye` requires that the parameter `Recording` be specified.
This parameter instructs the function about which eye(s) was used to record the gaze data.
It takes one of four possible strings: "LandR", "LorR", "L", or "R".
"LandR" should be used when any participant had both eyes recorded.
"LorR" should be used when some participants had their left eye recorded and others had their right eye recorded
"L" should be used when all participant had their left eye recorded.
"R" should be used when all participant had their right eye recorded.

If in doubt, use the function `check_eye_recording` which will do a quick check to see if `LEFT_INTEREST_AREA_ID` and `RIGHT_INTEREST_AREA_ID` contain data. It will then suggest the appropriate Recording parameter setting.
When in complete doubt, use "LandR".
The "LandR" setting requires an additional parameter (`WhenLandR`) to be specified. 
This instructs the function to select either the right eye or the left eye when data exist for both.

```{r, eval= TRUE, echo=TRUE, results='asis'}
check_eye_recording(data = dat2)
```

After executing, the function prints a summary of the output.
While the function `check_eye_recording` indicated that the parameter `Recording` should be set to "R", the example below sets the parameter to "LandR", which can act as a "catch-all". 
Consequently, in the summary, it can be seen that there were only recordings in the right eye.

```{r, eval= TRUE, echo=TRUE, results='asis'}
dat3 <- select_recorded_eye(data = dat2, Recording = "R", WhenLandR = "Right")
```

```{r, eval=TRUE, echo=FALSE, results='hide'}
rm(dat2)
gc()
``` 

## Trackloss

Prior to binning the data, some researchers might prefer to remove trials with excessive trackloss.
Because Data Viewer does not provide a specific column for trackloss, it is possible to determine this using a combination of information, namely, the column `In_Blink` and/or the X and Y coordinates (`Gaze_X` and `Gaze_Y`).  
The function `mark_trackloss` uses this information to determine the status of a given sample.
The argument `Type` can be set to "Blink", "OffScreen", or "Both".
When set to "OffScreen" or "Both", `ScreenSize` must be supplied as a numeric vector of the X and Y dimensions of the computer sceen used during the experiment.

```{r, eval= FALSE, echo=TRUE, results='asis'}
dat3 <- mark_trackloss(dat3, Type = "Both", ScreenSize = c(1920, 1080))
```

Once the samples corresponding to trackloss have been identified, events with less than the required amount of quality data can be removed from the data set, using the function `rm_trackloss_events`.
The argument `RequiredData` represents the percentage of data (non-trackloss) required in order to retain the event.
In the example below, each event must contain 75\% quality data, in other words, no more than 25\% trackloss.

```{r, eval= FALSE, echo=TRUE, results='asis'}
dat3 <- rm_trackloss_events(dat3, RequiredData = 75)
```


## Binning the data

In order to obtain proportion looks, it is necessary to bin the data. 
That is, group samples in chunks of time, count the number of samples in each of the interest areas, and calculate the proportions based on the counts.
The sampling rate at which the eye gaze data were recorded must be provided. 
For Eyelink trackers, this is typically 250Hz, 500Hz, or 1000Hz.
If in doubt, use the function `check_samplingrate` to determine it. 
The sampling rate can then be supplied to the function `bin_prop`.

```{r, eval= TRUE, echo=TRUE, results='asis'}
check_samplingrate(dat3)
```

Note that the `check_samplingrate` function returns a printed message indicating the sampling rate(s) present in the data. 
Optionally, it can return a new column called `SamplingRate` by specifying the parameter `ReturnData` as TRUE. In the event that data were collected at different sampling rates, this column can be used to subset the dataset by the sampling rate before proceeding to the next processing step.

The function `bin_prop` calculates the proportion of looks (samples) to each interest area in a particular span of time (bin size). 
In order to do this, it is necessary to supply the parameters `BinSize` and `SamplingRate`. 
`BinSize` should be specified in milliseconds, representing the chunk of time within which to calculate the proportions.

Not all bin sizes work for all sampling rates, due to downsampling constraints. 
If unsure which are appropriate for your current sampling rate, use the `ds_options` function. 
When provided with the current sampling rate in `SamplingRate` (see above), the function will return a printed summary of the bin size options and their corresponding downsampled rate.  By default, this returns the whole number downsampling rates users are likely to want; however, it can also return all possible (valid) downsampling rates, even if they are not round numbers.

```{r, eval= TRUE, echo=TRUE, results='asis'}
ds_options(SamplingRate = 1000)
```

The `SamplingRate` parameter in `bin_prop` should be specified in Hertz (see `check_samplingrate`), representing the original sampling rate of the data and the `BinSize` should be specified in milliseconds (see `ds_options`), representing the span of time over which to calculate the proportion.
The `bin_prop` function returns new columns corresponding to each interest area ID (e.g., `IA_1_C`, `IA_1_P`). 
The extension '\_C' indicates the count of samples in the bin and the extension '\_P' indicates the proportion. 

```{r, eval= TRUE, echo=TRUE, results='asis'}
dat4 <- bin_prop(dat3, NoIA = 4, BinSize = 20, SamplingRate = 1000)
```

```{r, eval=TRUE, echo=FALSE, results='hide'}
rm(dat3)
gc()
``` 

In performing the calculation, the function effectively downsamples the data. 
To check this and to know the new sampling rate, simply call the function `check_samplingrate` again.

```{r, eval= TRUE, echo=TRUE, results='asis'}
check_samplingrate(dat4)
```

## Empirical logits

Proportions are inherently bound between 0 and 1 and are therefore not suitable for many types of analysis. 
Logits provide a transformation resulting in an unbounded measure (as well as weights which estimate the variance). 
The calculations contained in this package are based on: Barr, D. J., (2008) Analyzing 'visual world' eyetracking data using multilevel logistic regression, *Journal of Memory and Language*, *59*(4), 457--474. 
However, they have been modified to allow greater flexibility.

When using an empirical logit transformation it is important to keep two things in mind. 
The first is the number of observations (or samples) on which to base the calculation. 
Typically, this is the number of samples per bin, which varies depending on your original sampling rate and bin size. 

To determine the number of samples per bin present in the data, use the function `check_samples_per_bin`.

```{r, eval= TRUE, echo=TRUE, results='asis'}
check_samples_per_bin(dat4)
```

However, a user may choose to define a different number of observations (because the number of samples is inherently linked to the sampling rate). 
Though, it is important to note that changing this value can drastically impact the results of the transformation and weight calculations. 
There are some safeguards within the transformation function to prevent users from choosing inadvisable values (though these safeguards can be overridden with the parameter `ObsOverride`).
So, if in doubt, it is safest to use the number of samples present in your data (as indicated by `check_samples_per_bin`).
The second things to keep in mind is the constant to be added in the transformation. 
Note that by default the calculation uses a constant of 0.5; however, the user can specify a different value to be used.

<style>
im {
width: 100%;
text-align: center;
}
</style>

If you are interested in visualizing the effect of both number of observations and constant on the result of the empirical logit transformation and weight calculations, please refer to the [Plotting](VWPre_Plotting.html) vignette, which illustrates and discusses the function `plot_transformation_app`. 

The function `transform_to_elogit` transforms the proportions to empirical logits and also calculates a weight for each value.
The weight estimates the variance in each bin (because the variance of the logit depends on the mean). 
This is particularly important for regression analyses and should be specified in the model call (e.g., weight = 1 / `IA_1_wts`).
As mentioned above, the function takes the number of observations in the parameter `ObsPerBin`. Here we use the number of samples per bin present in the data.

```{r, eval= TRUE, echo=TRUE, results='asis'}
dat5 <- transform_to_elogit(dat4, NoIA = 4, ObsPerBin = 20)
```


## Binomial data

Some researchers may prefer to perform a binomial analysis. 
Therefore the function `create_binomial` uses (previously calculated) proportions and number of observations to create a success/failure column for each IA.
This column is then a suitable response variable for logistic regression of the time series. 
As with the empirical logit transformation, a user may choose to define a number of observations that is different from the number of samples per bin. 
Because this can create artifacts in the scaling or more samples than are present in the data, safeguards are in place to prevent users from choosing inadvisable values (though these safeguards can be overridden with the parameter `ObsOverride`).

```{r, eval= TRUE, echo=TRUE, results='asis'}
dat5a <- create_binomial(data = dat4, NoIA = 4, ObsPerBin = 20)
```

```{r, eval=TRUE, echo=FALSE, results='hide'}
rm(dat4, dat5a)
gc()
``` 

By default the function will create a success/failure column for each IA in the data; however, it is also possible to create a custom column comparing looks between two specific interest areas. 
This is done by specifying the parameter `CustomBinom` with a vector of two integers (e.g., CustomBinom = c(1,2)) in which the two integers correspond to the IDs of the desired interest areas. 


## Fastrack function

For advanced users who have worked with the package functions before and who are familiar with the required steps and output, there is a meta-function, called `fasttrack`, which runs through the previous functions and outputs a dataframe with either empirical logits or binomial data. 
Note that using this function will still require the user to manually remove unneeded columns (see above).
This meta-function takes as parameters all the required arguments to the component functions.
Also, this function assumes that dynamic interest areas were used and do not need to be relabelled/reassigned.
It also assumes an interest period was defined in Data Viewer relative to the critical stimulus, thus not requiring separate message alignment.
Again, this is only recommended for users who have previously worked with visual world data, the functions contained in this package, and are confident that their data meet the requirements/assumptions of the `fasttrack` function.

```{r, eval = FALSE, echo=TRUE, results='asis'}
dat5b <- fasttrack(data = VWdat, Subject = "RECORDING_SESSION_LABEL", Item = "itemid", 
	EventColumns = c("Subject", "TRIAL_INDEX"), NoIA = 4, Adjust = 100, Recording = "LandR", 
  WhenLandR = "Right", BinSize = 20, SamplingRate = 1000,
  ObsPerBin = 20, Constant = 0.5, Output = "ELogit")
```


## Renaming interest area columns

Some may wish to rename the interest area columns created by the functions to something more meaningful than the numeric coding scheme.
To do so, use the function `rename_columns`. This will convert column names like `IA_1_C` and `IA_2_P` to `IA_Target_C` and `IA_Rhyme_P`, respectively. 
This will perform the operation on all the `IA_` columns for upto 8 interest areas.

```{r, eval= TRUE, echo=TRUE, results='asis'}
dat6 <- rename_columns(dat5, Labels = c(IA1="Target", IA2="Rhyme", 
                                       IA3="OnsetComp", IA4="Distractor")) 
```

You can now check the column names in the data.

```{r, eval= TRUE, echo=TRUE, results='asis'}
colnames(dat6) 
```


```{r, eval=TRUE, echo=FALSE, results='hide'}
rm(dat5, dat6)
gc()
``` 

## Saving the data

### Subsetting and ordering

Before embarking on a statistical analysis, it is probably necessary to take a couple steps, such as paring down the data to only include the columns which will be needed later and ensuring the data are ordered appropriately.
This is straightforward using `dplyr`.

```{r, eval=FALSE, echo=TRUE, results='asis'}
FinalDat <- dat5 %>% 
  # Select just the columns you want
  select(Subject, Item, Time, starts_with("IA"), Event, TRIAL_INDEX, Rating, Exp) %>%
  # Order the data by Subject, Trial, and Time
  arrange(Subject, TRIAL_INDEX, Time)
```

### Saving to a file

Save the resulting dataset to a .rda file and use compression to make it more compact (though this will add to the amount of time it takes to save).

```{r, eval=FALSE, echo=TRUE, results='asis'}
save(FinalDat, file = "FinalDat.rda", compress = "xz")
```

## Plotting

You are now ready to plot your data. 
Please refer to the [Plotting](VWPre_Plotting.html) vignette for details on the various plotting functions contained in the package.