---
title: "Select, reshape, and filter data"
output: rmarkdown::html_vignette
vignette: >
  %\VignetteIndexEntry{Select, reshape, and filter data}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r, include = FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  eval = TRUE,
  echo = TRUE,
  comment = "#>",
  dpi = 120,
  fig.align = "center",
  out.width = "80%"
)
```

The package `forcis` provides [a lot of functions](https://docs.ropensci.org/forcis/reference/index.html#select-and-filters-tools) to filter, reshape, and select FORCIS data. This vignette shows how to use these functions. With the exception of `select_taxonomy()`, all functions presented in this vignette are optional and depend on your research questions. You can filter data by species, time range, ocean, etc.


## Setup

First, let's import the required packages.

```{r setup}
library(forcis)
```

Before proceeding, let's download the latest version of the FORCIS database.

```{r 'download-db', eval=FALSE}
# Create a data/ folder ----
dir.create("data")

# Download latest version of the database ----
download_forcis_db(path = "data", version = NULL)
```

The vignette will use the plankton nets data of the FORCIS database. Let's import the latest release of the data.

```{r 'load-data', echo=FALSE}
file_name <- system.file(
  file.path("extdata", "FORCIS_net_sample.csv"),
  package = "forcis"
)
net_data <- read.csv(file_name)
```

```{r 'load-data-user', eval=FALSE}
# Import net data ----
net_data <- read_plankton_nets_data(path = "data")
```

**NB:** In this vignette, we use a subset of the plankton nets data, not the whole dataset.


## Selecting columns

### Select a taxonomy

The FORCIS database provides three different taxonomies: 

- `OT`: original taxonomy, i.e. the initial list of species names and attributes (e.g., shell pigmentation, coiling direction) as reported in various datasets and studies.
- `VT`: validated taxonomy, i.e. a refined version of the original taxonomy that resolves issues of synonymy (different names for the same taxon) and shifting taxonomic concepts.
- `LT`: lumped taxonomy, i.e. a simplified version of the validated taxonomy. It merges taxa that are difficult to distinguish across datasets (morphospecies), ensuring consistency and comparability in broader analyses. 

See the [associated data paper](https://doi.org/10.1038/s41597-023-02264-2) for further information.

After importing the data and before going any further, the next step involves choosing the taxonomic level for the analyses. **This is mandatory to avoid duplicated records**.

Let's use the function `select_taxonomy()` to select the **VT** taxonomy (validated taxonomy):

```{r 'select-taxo'}
# Select taxonomy ----
net_data_vt <- net_data |>
  select_taxonomy(taxonomy = "VT")

net_data_vt
```


### Select required columns

Because FORCIS data contain more than 100 columns, the function `select_forcis_columns()` can be used to lighten the data to easily handle it and to speed up some computations. 

By default, only required columns listed in `get_required_columns()` (required by some functions of the package like `compute_*()` and `plot_*()`) and species columns will be kept.


```{r 'select-columns'}
# Remove not required columns (optional) ----
net_data_vt <- net_data_vt |>
  select_forcis_columns()

net_data_vt
```

You can also use the argument `cols` to keep additional columns.


## Filtering rows

The `filter_by_*()` functions are optional and their use depends on your research questions.


### Filter by month of data collection

The `filter_by_month()` function filters observations based on the **month of sampling**. It requires two arguments: the data and a numeric vector with values between 1 and 12.

```{r 'filter-by-month'}
# Filter data by sampling month ----
net_data_vt_july_aug <- net_data_vt |>
  filter_by_month(months = 7:8)

# Number of original records ----
nrow(net_data_vt)

# Number of filtered records ----
nrow(net_data_vt_july_aug)
```


### Filter by year of data collection

The `filter_by_year()` function filters observations based on the **year of sampling**. It requires two arguments: the data and a numeric vector with the years of interest.

```{r 'filter-by-year'}
# Filter data by sampling year ----
net_data_vt_9020 <- net_data_vt |>
  filter_by_year(years = 1990:2020)

# Number of original records ----
nrow(net_data_vt)

# Number of filtered records ----
nrow(net_data_vt_9020)
```


### Filter by bounding box

The function `filter_by_bbox()` can be used to filter FORCIS data by a spatial bounding box (argument `bbox`).

Let's filter the plankton net data by a spatial rectangle located in the Indian ocean.

```{r 'filter-by-bbox'}
# Filter by spatial bounding box ----
net_data_vt_bbox <- net_data_vt |>
  filter_by_bbox(bbox = c(45, -61, 82, -24))

# Number of original records ----
nrow(net_data_vt)

# Number of filtered records ----
nrow(net_data_vt_bbox)
```

Note that the argument `bbox` can be either an object of class `bbox` (package `sf`) or a vector of four numeric values defining a square bounding box. If a vector of numeric values is provided, coordinates must be defined in the system WGS 84 (`epsg=4326`).

Let's check the spatial extent by converting these two `tibbles` into spatial layers (`sf` objects) with the function `data_to_sf()`.

```{r 'check-bbox'}
# Filter by spatial bounding box ----
net_data_vt_sf <- net_data_vt |>
  data_to_sf()

net_data_vt_bbox_sf <- net_data_vt_bbox |>
  data_to_sf()

# Original spatial extent ----
sf::st_bbox(net_data_vt_sf)

# Spatial extent of filtered records ----
sf::st_bbox(net_data_vt_bbox_sf)
```


### Filter by ocean

The function `filter_by_ocean()` can be used to filter FORCIS data by one or several oceans (argument `ocean`).

Let's filter the plankton net data located in the Indian ocean.

```{r 'filter-by-ocean'}
# Filter by ocean name ----
net_data_vt_indian <- net_data_vt |>
  filter_by_ocean(ocean = "Indian Ocean")

# Number of original records ----
nrow(net_data_vt)

# Number of filtered records ----
nrow(net_data_vt_indian)
```

Use the function `get_ocean_names()` to retrieve the name of World oceans according to the IHO Sea Areas dataset version 3 (used in this package).

```{r 'get-ocean-names'}
# Get ocean names ----
get_ocean_names()
```


### Filter by spatial polygon

The function `filter_by_polygon()` can be used to filter FORCIS data a spatial polygon (argument `polygon`).

Let's filter the plankton net data by a spatial polygon defining boundaries of the Indian ocean.

```{r 'filter-by-polygon'}
# Import spatial polygon ----
file_name <- system.file(
  file.path("extdata", "IHO_Indian_ocean_polygon.gpkg"),
  package = "forcis"
)

indian_ocean <- sf::st_read(file_name, quiet = TRUE)

# Filter by polygon ----
net_data_vt_poly <- net_data_vt |>
  filter_by_polygon(polygon = indian_ocean)

# Number of original records ----
nrow(net_data_vt)

# Number of filtered records ----
nrow(net_data_vt_poly)
```


### Filter by species

The `filter_by_species()` function allows users to filter FORCIS data for one or more species.

It takes a `data.frame` (or a `tibble`) and a vector of species names (argument `species`).

Let's subset plankton net data to only keep only two species: *G. glutinata* and *C. nitida*.

```{r 'filter-by-species'}
# Filter by species ----
net_data_vt_glutinata_nitida <- net_data_vt |>
  filter_by_species(species = c("g_glutinata_VT", "c_nitida_VT"))

# Dimensions of original data ----
dim(net_data_vt)

# Dimensions of filtered data ----
dim(net_data_vt_glutinata_nitida)
```

**Important:** The `filter_by_species()` function does not remove rows (samples) but columns: it removes other species columns. To only keep samples where these two species have been detected, we can use:


```{r 'filter-counts'}
# Keep samples with positive counts ----
net_data_vt_glutinata_nitida <- net_data_vt_glutinata_nitida |>
  dplyr::filter(g_glutinata_VT > 0 | c_nitida_VT > 0)

# Number of filtered records ----
nrow(net_data_vt_glutinata_nitida)
```

## Reshaping

### Convert to long format

The `convert_to_long_format()` function converts FORCIS data into a long format.

```{r 'reshape-data'}
# Convert to long format ----
net_data_long <- convert_to_long_format(net_data)

# Dimensions of original data ----
dim(net_data)

# Dimensions of reshaped data ----
dim(net_data_long)
```

Two columns have been created: `taxa` (taxon names) and `counts` (taxon counts).

```{r 'reshape-data-2'}
# Column names ----
colnames(net_data_long)
```