--- title: "Comparison to Christensen et al. 2018" author: "Renata Diaz, Juniper Simonis, and Hao Ye" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{Comparison to Christensen et al. 2018} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} params: run_models: FALSE --- # Introduction This document provides a side-by-side comparison of **`LDATS`** (version 0.2.7) results with analysis from [Christensen et al. 2018](https://esajournals.onlinelibrary.wiley.com/doi/full/10.1002/ecy.2373). Due to the size and duration of model runs, we use pre-generated model output from the [LDATS-replications repo](https://github.com/weecology/LDATS-replications). ## Summary | Step | Changes from Christensen et al 2018 to `LDATS` | Effect on comparison | Recommendations for future users | |:---------------|:------------------------|:-----------------------|:----------------------------| |Data|Paper adjusts abundances according to unit effort, while `LDATS` uses raw capture numbers.|None: run comparison using adjusted data|Use raw, unweighted abundances.| |LDA model selection|Paper conservatively overestimated the number of parameters for calculating AIC for model selection. `LDATS` calculates AIC appropriately.|Paper LDA selects 4 topics, while `LDATS` finds 6. Compare all combinations of paper and `LDATS` LDA models and changepoint models| Use `LDATS` AIC calculation.| |Changepoint model document weights|Paper weighted all sampling periods equally regardless of abundance. `LDATS` by default weights the information from each sampling period according to the number of individuals captured (i.e. the amount of information gleaned about the underlying community composition).|None; use `weights = NULL` to set all weights equal to 1 for LDATS| Weight sampling periods according to abundance| |Overall LDA + changepoint results| All combinations of LDA + changepoint model find 4 changepoints at approximately the same time steps|Choice of LDA model has more of an effect than choice of changepoint model|`LDATS` reflects best practices, but the paper methods will produce qualitatively similar results.| # Setup ## LDATS Installation To obtain the most recent version of **`LDATS`**, install the most recent version from GitHub: ```r install.packages("devtools") devtools::install_github("weecology/LDATS") ``` Load in the **`LDATS`** package. ```r library(LDATS) set.seed(42) nseeds <- 200 nit <- 10000 ``` To run the analyses here, you will also need to download **`dplyr`**, **`gridExtra`**, **`multipanel`**, **`RColorBrewer`**, **`RCurl`**, and **`reshape2`** as the manuscript's code relies on these packages. ```r install.packages(c("dplyr", "gridExtra", "multipanel", "RColorBrewer", "RCurl", "reshape2")) ``` ## Running the Models Because both the Latent Dirichlet Allocation (LDA) and time series components of the analysis can take a long time to run (especially with the settings above for the number of seeds and iterations), we will use pre-generated model outputs and turn off certain code chunks that run the models using a global `rmarkdown` parameter, `run_models = FALSE`. To change this functionality, you can re-render this file with: ```r rmarkdown::render("paper-comparison.Rmd", params = list(run_models = TRUE)) ``` ## Download Analysis Scripts and Data Files We're going to download analysis scripts, data files, and model objects, so we use a temporary location for storage: ```r vignette_files <- "." ``` To replicate the Christensen et al. 2018 analysis, we download some of the original scripts & data files from [Extreme-events-LDA repo](https://github.com/emchristensen/Extreme-events-LDA), as well as some raw data files from the [PortalData repo](https://github.com/weecology/PortalData), which are stored in the [LDATS-replications repo](https://github.com/weecology/LDATS-replications): Main Analysis Scripts: * `rodent_LDA_analysis.R` - main script for analyzing rodent community change using LDA * `rodent_data_for_LDA.R` - contains a function that creates the rodent data table used in analyses * `AIC_model_selection.R` - contains functions for calculating AIC for different candidate LDA models * `changepointmodel.r` - contains change-point model code * `LDA-distance.R` - function for computing Hellinger distance analyses Data: * `Rodent_table_dat.csv` - table of rodent data, created by rodent_data_for_LDA.R * `moon_dates.csv` - table of census dates (downloaded from the PortalData repository) * `Portal_rodent_trapping.csv` - table of trapping effort (downloaded from the PortalData repository) * `paper_dat.csv` - rodent data table from Christensen et al. 2018 * `paper_dates.csv` - dates used in Christensen et al. 2018 * `paper_covariates.csv` - table of dates and covariate data Figure scripts: * `LDA_figure_scripts.R` - contains functions for making main plots in manuscript (Fig 1). Called from rodent_LDA_analysis.R ```r test_file <- file.path(vignette_files, "scripts", "rodent_LDA_analysis.r") if (!file.exists(test_file)){ # scripts dir.create(file.path(vignette_files, "scripts"), showWarnings = FALSE) github_path <- "https://raw.githubusercontent.com/weecology/LDATS-replications/master/scripts/" files_to_download <- c("rodent_LDA_analysis.r", "rodent_data_for_LDA.r", "AIC_model_selection.R", "changepointmodel.r", "LDA-distance.R", "LDA_figure_scripts.R") for (file in files_to_download) { download.file(url = paste0(github_path, file), destfile = file.path(vignette_files, "scripts", file)) } # data dir.create(file.path(vignette_files, "data"), showWarnings = FALSE) github_path <- "https://raw.githubusercontent.com/weecology/LDATS-replications/master/data/" files_to_download <- c("moon_dates.csv", "Portal_rodent_trapping.csv", "Rodent_table_dat.csv", "paper_dat.csv", "paper_dates.csv", "paper_covariates.csv") for (file in files_to_download) { download.file(url = paste0(github_path, file), destfile = file.path(vignette_files, "data", file)) } } ``` ## Download Model Outputs We also have pre-generated model outputs that we download from the [LDATS-replications repo](https://github.com/weecology/LDATS-replications): LDA models: * `ldats_ldamodel.RDS` - the best LDA model as selected by LDATS * `paper_ldamodel.RDS` - the best LDA model as selected by the Christensen et al. analysis Changepoint outputs: * `ldats_ldats.RDS`, `ldats_ldats_cpt.RDS`, `ldats_ldats_cpt_dates.RDS` - the posterior distribution, count, and dates of changepoints, using the LDATS LDA model and the LDATS changepoint selection * `ldats_paper.RDS`, `ldats_paper_cpt.RDS`, `ldats_paper_cpt_dates.RDS - the posterior distribution, count, and dates of changepoints, using the LDATS LDA model and the paper's changepoint selection * `paper_ldats.RDS`, `paper_ldats_cpt.RDS`, `paper_ldats_cpt_dates.RDS - the posterior distribution, count, and dates of changepoints, using the paper LDA model and the LDATS changepoint selection * `paper_paper.RDS`, `paper_paper_cpt.RDS`, `paper_paper_cpt_dates.RDS - the posterior distribution, count, and dates of changepoints, using the paper LDA model and the paper's changepoint selection * `cpt_dates.RDS` - summary table of changepoint results across models Figures * `lda_distances.png` - figure showing the variance in the topics identified by the paper's LDA model code * `paper_paper_cpt_plot.png` - figure showing the time series results for the paper analysis of the paper LDA * `ldats_paper_cpt_plot.png` - figure showing the time series results for the paper analysis of the LDATS LDA * `annual_hist.RDS` - function for making histogram of change points over years ```r test_file <- file.path(vignette_files, "output", "ldats_ldamodel.RDS") if (!file.exists(test_file)){ dir.create(file.path(vignette_files, "output"), showWarnings = FALSE) github_path <- "https://raw.githubusercontent.com/weecology/LDATS-replications/master/output/" files_to_download <- c("ldats_ldamodel.RDS", "paper_ldamodel.RDS", "ldats_ldats.RDS", "ldats_paper.RDS", "paper_ldats.RDS", "paper_paper.RDS", "ldats_rodents_adjusted.RDS", "rodents.RDS", "ldats_paper_cpt.RDS", "ldats_paper_cpt_dates.RDS", "ldats_ldats_cpt.RDS", "ldats_ldats_cpt_dates.RDS", "paper_paper_cpt.RDS", "paper_paper_cpt_dates.RDS", "paper_ldats_cpt.RDS", "paper_ldats_cpt_dates.RDS", "annual_hist.RDS", "cpt_dates.RDS", "lda_distances.png", "paper_paper_cpt_plot.png", "ldats_paper_cpt_plot.png") for (file in files_to_download){ download.file(url = paste0(github_path, file), destfile = file.path(vignette_files, "output", file), mode = "wb") } } ``` # Data Comparison The dataset of Portal rodents on control plots is included in the LDATS package: ```r data(rodents) head(rodents[[1]]) #> BA DM DO DS NA. OL OT PB PE PF PH PI PL PM PP RF RM RO SF SH SO #> 1 0 13 0 2 2 0 0 0 1 1 0 0 0 0 3 0 0 0 0 0 0 #> 2 0 20 1 3 2 0 0 0 0 4 0 0 0 0 2 0 0 0 0 0 0 #> 3 0 21 0 8 4 0 0 0 1 2 0 0 0 0 1 0 0 0 0 0 0 #> 4 0 21 3 12 4 2 3 0 1 1 0 0 0 0 0 0 0 0 0 0 0 #> 5 0 16 1 9 5 2 1 0 0 2 0 0 0 0 0 0 1 0 0 0 0 #> 6 0 17 1 13 5 1 5 0 0 3 0 0 0 0 0 0 0 0 0 0 0 ``` We can compare this against the data used in Christensen et al: ```r # parameters for subsetting the full Portal rodents data periods <- 1:436 control_plots <- c(2, 4, 8, 11, 12, 14, 17, 22) species_list <- c("BA", "DM", "DO", "DS", "NA", "OL", "OT", "PB", "PE", "PF", "PH", "PI", "PL", "PM", "PP", "RF", "RM", "RO", "SF", "SH", "SO") source(file.path(vignette_files, "scripts", "rodent_data_for_LDA.r")) # assemble `paper_dat`, the data from Christensen et al. 2018 paper_dat <- create_rodent_table(period_first = min(periods), period_last = max(periods), selected_plots = control_plots, selected_species = species_list) # assemble `paper_covariates`, the associated dates and covariate data moondat <- read.csv(file.path(vignette_files, "data", "moon_dates.csv"), stringsAsFactors = F) paper_dates <- moondat %>% dplyr::filter(period %>% dplyr::between(min(periods), max(periods))) %>% dplyr::pull(censusdate) %>% as.Date() paper_covariates <- data.frame( index = seq_along(paper_dates), date = paper_dates, year_continuous = lubridate::decimal_date(paper_dates)) %>% dplyr::mutate( sin_year = sin(year_continuous * 2 * pi), cos_year = cos(year_continuous * 2 * pi) ) ``` ## Compare the data from Christensen et al. with the included data in `LDATS` ```r compare <- rodents[[1]] == paper_dat length(which(rowSums(compare) < ncol(compare))) #> [1] 16 ``` There are 16 rows where the data included in LDATS differs from the paper data. This is because the LDATS data is not adjusted to account for trapping effort, while the paper data is, by dividing all census counts by the actual number of plots trapped and multiplying by 8 to account for incompletely-trapped censuses. To confirm this, refer to lines 36-46 in `rodent_data_for_LDA.r`: ```r # retrieve data on number of plots trapped per month trap_table = read.csv('https://raw.githubusercontent.com/weecology/PortalData/master/Rodents/Portal_rodent_trapping.csv') trap_table_controls = filter(trap_table, plot %in% selected_plots) nplots_controls = aggregate(trap_table_controls$sampled,by=list(period = trap_table_controls$period),FUN=sum) # adjust species counts by number of plots trapped that month r_table_adjusted = as.data.frame.matrix(r_table) for (n in 1:436) { #divide by number of control plots actually trapped (should be 8) and multiply by 8 to estimate captures as if all plots were trapped r_table_adjusted[n,] = round(r_table_adjusted[n,]/nplots_controls$x[n]*8) } ``` We can run the same procedure on the LDATS data to verify that we obtain a data.frame that matches. ```r # get the trapping effort for each sample trap_table <- read.csv(file.path(vignette_files, "data", "Portal_rodent_trapping.csv")) trap_table_controls <- dplyr::filter(trap_table, plot %in% control_plots) nplots_controls <- aggregate(trap_table_controls$sampled, by = list(period = trap_table_controls$period), FUN = sum) # adjust species counts by number of plots trapped that month # divide by number of control plots actually trapped (should be 8) and # multiply by 8 to estimate captures as if all plots were trapped ldats_rodents_adjusted <- as.data.frame.matrix(rodents[[1]]) ldats_rodents_adjusted[periods, ] <- round(ldats_rodents_adjusted[periods, ] / nplots_controls$x[periods] * 8) ``` Now we can compare the adjusted LDATS dataset with both the original ldats dataset and the dataset from the paper: ```r compare_raw <- rodents[[1]] == ldats_rodents_adjusted length(which(rowSums(compare_raw) < ncol(compare_raw))) compare_adjusted <- ldats_rodents_adjusted == paper_dat length(which(rowSums(compare_adjusted) < ncol(compare_adjusted))) ``` Because the LDA procedure weights the information from documents (census periods) according to the number of words (rodents captured), we now believe it is most appropriate to run the LDA on _unadjusted_ trapping data, and we recommend that users of LDATS do so. However, to maintain consistency with Christensen et al 2018, we will proceed using the _adjusted_ rodent table in this vignette. ```r rodents[[1]] <- paper_dat ``` The LDATS rodent data comes with a `document_covariate_table`, which we will use later as the predictor variables for the changepoint models. In this table, time is expressed as new moon numbers. Later we will want to be able to interpret the results in terms of census dates. We will add a column to the `document_covariate_table` to convert new moon numbers to census dates. We will not reference this column in any of the formulas we pass to the changepoint models, so it will be ignored until we need it. ```r head(rodents$document_covariate_table) #> newmoon sin_year cos_year #> 1 1 -0.2470 -0.9690 #> 2 2 -0.6808 -0.7325 #> 3 3 -0.9537 -0.3008 #> 4 4 -0.9813 0.1925 #> 5 5 -0.7583 0.6519 #> 6 6 -0.3537 0.9354 ``` ```r new_cov_table <- dplyr::left_join(rodents$document_covariate_table, dplyr::select(moondat, newmoonnumber, censusdate), by = c("newmoon" = "newmoonnumber")) %>% dplyr::rename(date = censusdate) rodents$document_covariate_table <- new_cov_table ``` # Identify community groups using LDA While LDATS can run start-to-finish with `LDATS::LDA_TS`, here we will work through the process function-by-function to isolate differences with the paper. For a breakdown of the `LDA_TS` pipeline, see the [`codebase` vignette](https://weecology.github.io/LDATS/articles/LDATS_codebase.html). First, we run the LDA models from LDATS to identify the number of topics: ```r ldats_ldas <- LDATS::LDA_set(document_term_table = rodents$document_term_table, topics = 2:6, nseeds = nseeds) ldats_ldamodel <- LDATS::select_LDA(LDA_models = ldats_ldas)[[1]] saveRDS(ldats_ldamodel, file = file.path(vignette_files, "ldats_ldamodel.RDS")) ``` Second, we run the LDA models from Christensen et al. to do the same task: ```r source(file.path(vignette_files, "scripts", "AIC_model_selection.R")) source(file.path(vignette_files, "scripts", "LDA-distance.R")) # Some of the functions require the data to be stored in the `dat` variable dat <- paper_dat # Fit a bunch of LDA models with different seeds # Only use even numbers for seeds because consecutive seeds give identical results seeds <- 2 * seq(nseeds) # repeat LDA model fit and AIC calculation with a bunch of different seeds to test robustness of the analysis best_ntopic <- repeat_VEM(paper_dat, seeds, topic_min = 2, topic_max = 6) hist(best_ntopic$k, breaks = seq(from = 0.5, to = 9.5), xlab = "best # of topics", main = "") # 2b. how different is species composition of 4 community-types when LDA is run with different seeds? # ================================================================== # get the best 100 seeds where 4 topics was the best LDA model seeds_4topics <- best_ntopic %>% filter(k == 4) %>% arrange(aic) %>% head(min(100, nseeds)) %>% pull(SEED) # choose seed with highest log likelihood for all following analyses # (also produces plot of community composition for "best" run compared to "worst") png(file.path(vignette_files, "output", "lda_distances.png"), width = 800, height = 400) dat <- paper_dat # calculate_LDA_distance has some required named variables best_seed <- calculate_LDA_distance(paper_dat, seeds_4topics) dev.off() mean_dist <- unlist(best_seed)[2] max_dist <- unlist(best_seed)[3] # ================================================================== # 3. run LDA model # ================================================================== ntopics <- 4 SEED <- unlist(best_seed)[1] # For the paper, use seed 206 ldamodel <- LDA(paper_dat, ntopics, control = list(seed = SEED), method = "VEM") saveRDS(ldamodel, file = file.path(vignette_files, "paper_ldamodel.RDS")) ``` ```r knitr::include_graphics(file.path(vignette_files, "output", "lda_distances.png")) ```