--- title: "Select, reshape, and filter data" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{Select, reshape, and filter data} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, eval = TRUE, echo = TRUE, comment = "#>", dpi = 120, fig.align = "center", out.width = "80%" ) ``` The package `forcis` provides [a lot of functions](https://docs.ropensci.org/forcis/reference/index.html#select-and-filters-tools) to filter, reshape, and select FORCIS data. This vignette shows how to use these functions. With the exception of `select_taxonomy()`, all functions presented in this vignette are optional and depend on your research questions. You can filter data by species, time range, ocean, etc. ## Setup First, let's import the required packages. ```{r setup} library(forcis) ``` Before proceeding, let's download the latest version of the FORCIS database. ```{r 'download-db', eval=FALSE} # Create a data/ folder ---- dir.create("data") # Download latest version of the database ---- download_forcis_db(path = "data", version = NULL) ``` The vignette will use the plankton nets data of the FORCIS database. Let's import the latest release of the data. ```{r 'load-data', echo=FALSE} file_name <- system.file( file.path("extdata", "FORCIS_net_sample.csv"), package = "forcis" ) net_data <- read.csv(file_name) ``` ```{r 'load-data-user', eval=FALSE} # Import net data ---- net_data <- read_plankton_nets_data(path = "data") ``` **NB:** In this vignette, we use a subset of the plankton nets data, not the whole dataset. ## Selecting columns ### Select a taxonomy The FORCIS database provides three different taxonomies: - `OT`: original taxonomy, i.e. the initial list of species names and attributes (e.g., shell pigmentation, coiling direction) as reported in various datasets and studies. - `VT`: validated taxonomy, i.e. a refined version of the original taxonomy that resolves issues of synonymy (different names for the same taxon) and shifting taxonomic concepts. - `LT`: lumped taxonomy, i.e. a simplified version of the validated taxonomy. It merges taxa that are difficult to distinguish across datasets (morphospecies), ensuring consistency and comparability in broader analyses. See the [associated data paper](https://doi.org/10.1038/s41597-023-02264-2) for further information. After importing the data and before going any further, the next step involves choosing the taxonomic level for the analyses. **This is mandatory to avoid duplicated records**. Let's use the function `select_taxonomy()` to select the **VT** taxonomy (validated taxonomy): ```{r 'select-taxo'} # Select taxonomy ---- net_data_vt <- net_data |> select_taxonomy(taxonomy = "VT") net_data_vt ``` ### Select required columns Because FORCIS data contain more than 100 columns, the function `select_forcis_columns()` can be used to lighten the data to easily handle it and to speed up some computations. By default, only required columns listed in `get_required_columns()` (required by some functions of the package like `compute_*()` and `plot_*()`) and species columns will be kept. ```{r 'select-columns'} # Remove not required columns (optional) ---- net_data_vt <- net_data_vt |> select_forcis_columns() net_data_vt ``` You can also use the argument `cols` to keep additional columns. ## Filtering rows The `filter_by_*()` functions are optional and their use depends on your research questions. ### Filter by month of data collection The `filter_by_month()` function filters observations based on the **month of sampling**. It requires two arguments: the data and a numeric vector with values between 1 and 12. ```{r 'filter-by-month'} # Filter data by sampling month ---- net_data_vt_july_aug <- net_data_vt |> filter_by_month(months = 7:8) # Number of original records ---- nrow(net_data_vt) # Number of filtered records ---- nrow(net_data_vt_july_aug) ``` ### Filter by year of data collection The `filter_by_year()` function filters observations based on the **year of sampling**. It requires two arguments: the data and a numeric vector with the years of interest. ```{r 'filter-by-year'} # Filter data by sampling year ---- net_data_vt_9020 <- net_data_vt |> filter_by_year(years = 1990:2020) # Number of original records ---- nrow(net_data_vt) # Number of filtered records ---- nrow(net_data_vt_9020) ``` ### Filter by bounding box The function `filter_by_bbox()` can be used to filter FORCIS data by a spatial bounding box (argument `bbox`). Let's filter the plankton net data by a spatial rectangle located in the Indian ocean. ```{r 'filter-by-bbox'} # Filter by spatial bounding box ---- net_data_vt_bbox <- net_data_vt |> filter_by_bbox(bbox = c(45, -61, 82, -24)) # Number of original records ---- nrow(net_data_vt) # Number of filtered records ---- nrow(net_data_vt_bbox) ``` Note that the argument `bbox` can be either an object of class `bbox` (package `sf`) or a vector of four numeric values defining a square bounding box. If a vector of numeric values is provided, coordinates must be defined in the system WGS 84 (`epsg=4326`). Let's check the spatial extent by converting these two `tibbles` into spatial layers (`sf` objects) with the function `data_to_sf()`. ```{r 'check-bbox'} # Filter by spatial bounding box ---- net_data_vt_sf <- net_data_vt |> data_to_sf() net_data_vt_bbox_sf <- net_data_vt_bbox |> data_to_sf() # Original spatial extent ---- sf::st_bbox(net_data_vt_sf) # Spatial extent of filtered records ---- sf::st_bbox(net_data_vt_bbox_sf) ``` ### Filter by ocean The function `filter_by_ocean()` can be used to filter FORCIS data by one or several oceans (argument `ocean`). Let's filter the plankton net data located in the Indian ocean. ```{r 'filter-by-ocean'} # Filter by ocean name ---- net_data_vt_indian <- net_data_vt |> filter_by_ocean(ocean = "Indian Ocean") # Number of original records ---- nrow(net_data_vt) # Number of filtered records ---- nrow(net_data_vt_indian) ``` Use the function `get_ocean_names()` to retrieve the name of World oceans according to the IHO Sea Areas dataset version 3 (used in this package). ```{r 'get-ocean-names'} # Get ocean names ---- get_ocean_names() ``` ### Filter by spatial polygon The function `filter_by_polygon()` can be used to filter FORCIS data a spatial polygon (argument `polygon`). Let's filter the plankton net data by a spatial polygon defining boundaries of the Indian ocean. ```{r 'filter-by-polygon'} # Import spatial polygon ---- file_name <- system.file( file.path("extdata", "IHO_Indian_ocean_polygon.gpkg"), package = "forcis" ) indian_ocean <- sf::st_read(file_name, quiet = TRUE) # Filter by polygon ---- net_data_vt_poly <- net_data_vt |> filter_by_polygon(polygon = indian_ocean) # Number of original records ---- nrow(net_data_vt) # Number of filtered records ---- nrow(net_data_vt_poly) ``` ### Filter by species The `filter_by_species()` function allows users to filter FORCIS data for one or more species. It takes a `data.frame` (or a `tibble`) and a vector of species names (argument `species`). Let's subset plankton net data to only keep only two species: *G. glutinata* and *C. nitida*. ```{r 'filter-by-species'} # Filter by species ---- net_data_vt_glutinata_nitida <- net_data_vt |> filter_by_species(species = c("g_glutinata_VT", "c_nitida_VT")) # Dimensions of original data ---- dim(net_data_vt) # Dimensions of filtered data ---- dim(net_data_vt_glutinata_nitida) ``` **Important:** The `filter_by_species()` function does not remove rows (samples) but columns: it removes other species columns. To only keep samples where these two species have been detected, we can use: ```{r 'filter-counts'} # Keep samples with positive counts ---- net_data_vt_glutinata_nitida <- net_data_vt_glutinata_nitida |> dplyr::filter(g_glutinata_VT > 0 | c_nitida_VT > 0) # Number of filtered records ---- nrow(net_data_vt_glutinata_nitida) ``` ## Reshaping ### Convert to long format The `convert_to_long_format()` function converts FORCIS data into a long format. ```{r 'reshape-data'} # Convert to long format ---- net_data_long <- convert_to_long_format(net_data) # Dimensions of original data ---- dim(net_data) # Dimensions of reshaped data ---- dim(net_data_long) ``` Two columns have been created: `taxa` (taxon names) and `counts` (taxon counts). ```{r 'reshape-data-2'} # Column names ---- colnames(net_data_long) ```