---
title: "Get started"
author: "Luke A McGuinness"
date: "`r Sys.Date()`"
output:
  rmarkdown::html_document:
    toc: yes
    toc_depth: 4
vignette: >
  %\VignetteIndexEntry{medrxivr}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r,echo=FALSE, message=FALSE,warning=FALSE}
# Delete when done
library(medrxivr)
library(dplyr)

knitr::opts_chunk$set(
  collapse = TRUE,
  eval = FALSE,
  comment = "#>"
)

```

An increasingly important source of health-related bibliographic content are preprints - preliminary versions of research articles that have yet to undergo peer review. The two preprint repositories most relevant to health-related sciences are [medRxiv](https://www.medrxiv.org/) and [bioRxiv](https://www.biorxiv.org/), both of which are operated by the Cold Spring Harbor Laboratory.

The goal of the `medrxivr` R package is two-fold. In the first instance, it provides programmatic access to the [Cold Spring Harbour Laboratory (CSHL) API](https://api.biorxiv.org/), allowing users to easily download medRxiv and bioRxiv preprint metadata (e.g. title, abstract, publication date, author list, etc) into R. The package also provides access to a maintained static snapshot of the medRxiv repository (see [Data sources](#medrxiv-data)). Secondly, `medrxivr` provides functions to search the downloaded preprint records using regular expressions and Boolean logic, as well as helper functions that allow users to export their search results to a .BIB file for easy import to a reference manager and to download the full-text PDFs of preprints matching their search criteria.

## Installation

You can install the development version of this package using:

``` {r}
devtools::install_github("mcguinlu/medrxivr")
library(medrxivr)
```

## Data sources

### medRxiv data

`medrixvr` provides two ways to access medRxiv data:  

  - `mx_api_content(server = "medrxiv")` creates a local copy of all data available from the medRxiv API at the time the function is run.
  
``` {r}
# Get a copy of the database from the live medRxiv API endpoint
preprint_data <- mx_api_content()  
```
  
  - `mx_snapshot()` provides access to a static snapshot of the medRxiv database. The snapshot is created each morning at 6am using `mx_api_content()` and is stored as CSV file in the [medrxivr-data repository](https://github.com/mcguinlu/medrxivr-data). This method does not rely on the API (which can become unavailable during peak usage times) and is usually faster (as it reads data from a CSV rather than having to re-extract it from the API). Discrepancies between the most recent static snapshot and the live database can be assessed using `mx_crosscheck()`.
  
``` {r}
# Get a copy of the database from the daily snapshot
preprint_data <- mx_snapshot()  
```
  
The relationship between the two methods for the medRxiv database is summarised in the figure below:
  
``` {r eval = TRUE, echo = FALSE, out.width = "500px", out.height = "400px"}

knitr::include_graphics("data_sources.png")
```
  
### bioRxiv data

Only one data source exists for the bioRxiv repository: 

  - `mx_api_content(server = "biorxiv")` creates a local copy of all data available from the bioRxiv API endpoint at the time the function is run. __Note__: due to it's size, downloading a complete copy of the bioRxiv repository in this manner takes a long time (~ 1 hour). 

``` {r}
# Get a copy of the database from the live bioRxiv API endpoint
preprint_data <- mx_api_content(server = "biorxiv")
```

## Performing your search

Once you have created a local copy of either the medRxiv or bioRxiv preprint database, you can pass this object (`preprint_data` in the examples above) to `mx_search()` to search the preprint records using an advanced search strategy.

``` {r}

# Perform a simple search
results <- mx_search(data = preprint_data,
                     query ="dementia")

# Perform an advanced search
topic1  <- c("dementia","vascular","alzheimer's")  # Combined with Boolean OR
topic2  <- c("lipids","statins","cholesterol")     # Combined with Boolean OR
myquery <- list(topic1, topic2)                    # Combined with Boolean AND

results <- mx_search(data = preprint_data,
                     query = myquery)

```

## Dataset description

The dataset (in this case, `results`) returned by the search function above contains 14 variables: 

```{r, eval = TRUE, echo = FALSE}

mx_variables <-
  data.frame(
    Variable = c(
         "ID"      ,
         "title"   ,
         "abstract",
         "authors" ,
         "date"    ,
         "category",
         "doi"     ,
         "version" ,
         "author_corresponding",
         "author_corresponding_institution",
         "link_page",
         "link_pdf" ,
         "license"  ,
         "published"
    ),
    Description = c(
      "Unique identifier",
      "Preprint title",
      "Preprint abstract",
      "Author list in the format 'LastName, InitalOfFirstName.' (e.g. McGuinness, L.). Authors are seperated by a semi-colon.",
      "Date the preprint was posted, in the format YYYYMMDD.",
      "On submission, medRxiv asks authors to classify their preprint into one of a set number of subject categories.",
      "Preprint Digital Object Identifier.",
      "Preprint version number. As authors can update their preprint at any time, this indicates which version of a given preprint the record refers to.", 
      "Corresponding authors name.",
      "Corresponding author's institution.",
      "Link to preprint webpage. The \"?versioned=TRUE\" is required, as otherwise, the URL will resolve to the most recent version of the article (assuming there is >1 version available).",
      "Link to preprint PDF. This is used by `mx_download()` to download a copy of the PDF for that preprint.",
      "Preprint license",
      "If the preprint was subsequently published in a peer-reviewed journal, this variable contains the DOI of the published version."
    )
  )


knitr::kable(mx_variables, format = "html") %>%
  kableExtra::kable_styling(full_width = F) %>%
  kableExtra::column_spec(1, bold = T, border_right = T) %>%
  kableExtra::column_spec(2, width = "30em")
```

## Export records identified by your search to a .BIB file

`medrxivr` provides a helper function to export your search results to a .BIB file so that they can be easily imported into a reference manager (e.g. Zotero, Mendeley)

```{r, eval = FALSE}

mx_export(data = mx_results,
          file = tempfile(fileext = ".bib"))

```

## Download PDFs for records identified by your search

Pass the results of your search above (`results`) to the `mx_download()` function to download a copy of the PDF for each record. 

```{r, eval = FALSE}

mx_download(results,        # Object returned by mx_search
            tempdir(),      # Temporary directory to save PDFs to 
            create = TRUE)  # Create the directory if it doesn't exist

```


## Further guidance

Please see the *[medrxivr website](https://docs.ropensci.org/medrxivr/index.html)* vignette for extended guidance on developing search strategies and for detailed instructions on interacting with the Cold Springs Harbour API for medRxiv and bioRxiv.