---
title: "Getting Started with basepenguins"
output: rmarkdown::html_vignette
vignette: >
  %\VignetteIndexEntry{Getting Started with basepenguins}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r, include = FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)
```

```{r, include = FALSE}
# Set up a temp directory and set it as the root directory for all chunks
temp_dir <- tempdir()
knitr::opts_knit$set(root.dir = temp_dir)
```

## Introduction

The **basepenguins** package provides tools to convert R scripts and R Markdown/Quarto documents (or other specified file types) that use the [**palmerpenguins**](https://allisonhorst.github.io/palmerpenguins/) package to use the versions of `penguins` and `penguins_raw` from **datasets** (R ≥ 4.5.0).

With R ≥ 4.5.0, the popular Palmer Penguins datasets are now directly available without loading the **palmerpenguins** package. This makes them more accessible, especially for new R users and for teaching purposes. However, there are some differences between the variable names in the **palmerpenguins** package and those in R's **datasets** package:

| palmerpenguins    | datasets     |
|-------------------|--------------|
| bill_length_mm    | bill_len     |
| bill_depth_mm     | bill_dep     |
| flipper_length_mm | flipper_len  |
| body_mass_g       | body_mass    |

These shorter variable names in the base R version were chosen for more compact code and data display. It does mean, however, that for those wanting to use R’s version of `penguins`, it isn’t simply a case of removing the call to `library(palmerpenguins)` or replacing `palmerpenguins` with `datasets` in `data("penguins", package = "palmerpenguins")` and the script still running.

The **basepenguins** package takes care of converting files by removing the call to **palmerpenguins** and making the necessary conversions to variable names, ensuring that the resulting scripts still run using the **datasets** (R ≥ 4.5.0) versions of `penguins` and `penguins_raw`.


```{r setup}
library(basepenguins)
```

## Package features

The **basepenguins** package provides four functions to convert files:

- `convert_files()`: Convert specified files to new output locations
- `convert_files_inplace()`: Convert files in-place
- `convert_dir()`: Convert files in a specified directory and its subdirectories to a new output directory, preserving nesting structure
- `convert_dir_inplace()`: Convert files in a directory in-place

If using `convert_files_inplace()` or `convert_dir_inplace()`, 
we recommend doing so in conjunction with a version-control system such as git, 
so that any changes can be easily checked.

Additionally, there are helper functions:

- `example_files()` and `example_dir()`: Access example files included in the package
- `output_paths()`: Generate modified file paths
- `files_to_convert()`: List files in a directory with specified extensions

## What changes when converting files?

When a file is 'convertible', i.e. contains a call to `library(palmerpenguins)` or `data("penguins", package = "palmerpenguins")` and has one of the specified extensions (by default `"R"`, `"r"`, `"qmd"`, `"rmd"`, `"Rmd"`), the conversion makes these changes:

- Replaces `library(palmerpenguins)` (or same with `palmerpenguins` in quotes) with the empty string`""`
- Replaces `data("penguins", package = "palmerpenguins")` (with any style of quotes) with `data("penguins", package = "datasets")` 
- Replaces variable names:
  - `bill_length_mm` → `bill_len`
  - `bill_depth_mm` → `bill_dep`
  - `flipper_length_mm` → `flipper_len`
  - `body_mass_g` → `body_mass`
- Replaces `ends_with("_mm")` with `starts_with("flipper_"), starts_with("bill_")`

## Example directory and files

The package includes an example directory with four example files to demonstrate how the conversion works. 
These are accessible through `example_files()` and `example_dir()`.

```{r}
# List all example files
example_files()
```

These example files include:

- `penguins.R`: An R script using the **palmerpenguins** package
- `no_penguins.Rmd`: An Rmarkdown file that includes `ends_with("_mm")` but *not* in the context of the **palmerpenguins** package
- `nested/penguins.qmd`: A Quarto document using the **palmerpenguins** package
- `nested/not_a_script.md`: Contains `library(palmerpenguins)`, but is not a script type that is converted by default

You can examine the content of any of these files, e.g.:

```{r}
penguins_script <- example_files("penguins.R")
cat(readLines(penguins_script), sep = "\n")
```

The `example_dir()` function returns the path to the directory containing all example files.
It also has a `copy.dir` argument that allows you to copy all the example files to a new directory. 
This is especially useful for testing the conversion functions that modify files in-place 
without affecting the original example files distributed with the package:

```{r}
# Copy all example files to a new subdirectory of the working directory
example_dir("examples")

# List the files in the copied directory
list.files("examples", recursive = TRUE)
```

Note that for the purposes of this vignette (and to adhere to CRAN policies), 
the working directory has been set to a `tempdir` and all new directories and files
are written there, using relative paths.

## Converting files

The package offers two main approaches to converting files: creating new converted versions with `convert_files()` or modifying files in place with`convert_files_inplace()`.

Let's start by converting a single file to see how it works:

```{r}
# Convert a single file to a new output file
convert_files(penguins_script, "converted_penguins.R")
```

```{r}
# Look at the converted file
cat(readLines("converted_penguins.R"), sep = "\n")
```

Notice how the function has:

- Removed the `library(palmerpenguins)` line
- Replaced the variable names used in **palmerpenguins** with their **datasets** equivalents
- Modified `ends_with("_mm")` to use `starts_with()` patterns instead

Both the `input` and `output` parameters of `convert_files()` take a vector of file paths, 
allowing you to convert multiple files at once. 

If you want to overwrite the original files rather than creating new ones, you can use `convert_files_inplace()`, 
which works exactly the same as `convert_files()`, except that it doesn't take an `output` argument - it is simply a convenience wrapper around `convert_files(input, input, extensions)`.

## Return values and messages

All the `convert_*()` functions invisibly return a list with two components:

- `changed`: Files that were modified
- `not_changed`: Files that were not modified (either they don't have the specified extensions or they don't use the **palmerpenguins** package)

If the `output` paths are different than the `input` paths, 
the values in the `changed` and `not_changed` vectors will be subsets of `output`,
and they will be named with the corresponding `input` paths. 
If files are overwritten, then the values in `changed` and `not_changed` 
will be subsets of `input` and the vectors will not be named.

This list is returned invisibly for two reasons:

1. If many files are converted, and/or absolute file paths are used, this list can occupy a lot of console space
2. With the list occupying a lot of console space, messages generated by the functions might be missed

The `convert_*()` functions generate messages in the following circumstances:

- If any files are changed, a message recommending you check the changed output files
- If any R Markdown or Quarto documents are changed, a message prompting you to re-knit or re-render them
- If any `ends_with("_mm")` substitutions are made, 
  a message with the output file paths and line numbers of those changes

## Converting a directory

To convert all convertible files in a directory (and its subdirectories), use `convert_dir()`.
We'll use the `"examples"` directory that we created above with the call to `example_dir("examples")`.

```{r}
result <- convert_dir("examples", "converted_examples")
result
```

To convert all files in a directory in place, use `convert_dir_inplace()`.
A useful call is `convert_dir_inplace(".")` to overwrite all convertible files in the working directory, 
though we don't run that here, demonstrating on a fresh copy of the example directory instead.

```{r}
example_dir("in_place_dir")

result <- convert_dir_inplace("in_place_dir")
result
```

## Helper functions

### Finding files with specific extensions

When working with large directories, the `files_to_convert()` function helps you find files with specific extensions that might be candidates for conversion:

```{r}
# List all files with convertible extensions in a directory
potential_files <- files_to_convert("examples")
potential_files
```

It's important to note that `files_to_convert()` only filters files by their extensions and does **not** look for `palmerpenguins` in their content. 

By default, this function looks for files with extensions `"R"`, `"r"`, `"qmd"`, `"rmd"`, or `"Rmd"`. You can specify different extensions if needed, or return absolute file paths. See `files_to_convert()` for further details:

```{r}
# Only look for R scripts
files_to_convert("examples", extensions = "R")
```

```{r}
# All extensions
files_to_convert("examples", extensions = NULL)
```

### Generating output paths

When converting files to new locations, the `output_paths()` function helps generate appropriate output paths, based on the input paths (which are preserved as names). These can then be passed to the `output` argument in `convert_files()`.
By default, `output_paths()` adds a `"_new"` suffix to the file name, but other suffixes, or prefixes, can be specified. Other output directories can also be given:

```{r}
input_files <- files_to_convert("examples")

# Default
output_paths(input_files)
```

```{r}
# Generate output paths with prefix instead, in new directory
output_paths(input_files, prefix = "base_", suffix = "", dir = "~/output")
```

## Considerations regarding the `ends_with("_mm")` substitution

The [**palmerpenguins** Get started vignette](https://allisonhorst.github.io/palmerpenguins/articles/intro.html) 
has examples of using `ends_with("_mm")` within calls to `dplyr::select()`, as a convenient way to select the
`flipper_length_mm`, `bill_length_mm` and `bill_depth_mm` columns. 

This pattern presents a design challenge for **basepenguins**. 
We need a way to select the `flipper_len`, `bill_len` and `bill_dep` columns.

The most obvious substition for `ends_with("_mm")` is therefore `flipper_len, starts_with("bill_")`, 
which preserves the use of a [**tidyselect**](https://tidyselect.r-lib.org) function. 
However, suppose we have a previous call to `dplyr::select()`, and have converted the file with the above. 
Then following code will generate an error, because `flipper_len` is no longer available to be selected:

```{r eval = FALSE}
penguins |>
  select(bill_len, bill_dep) |>
  select(flipper_len, starts_with("bill_"))
```

Although the above example is contrived, we don't want to break anyone's code,
so instead we replace `ends_with("_mm")` with: 

```r
starts_with("flipper_"), starts_with("bill_")
```

This won't error, even if there are no column names starting with `"flipper_"` or `"bill_"`. 
However, we shouldn't ever really need `starts_with("flipper_")` as there is only one column in `penguins` that meets that criteria,
so we suggest manually checking this substitution and either replacing `starts_with("flipper_")` with `flipper_len` 
if `flipper_len` is still a column in the data frame, or removing `starts_with("flipper_")` entirely if not.

To facilitate this, the `convert_*()` functions all print a message indicating where these substitutions were made, 
to help you manually review and potentially refine these changes if desired.

The use of the `ends_with("_mm")` pattern with the `penguins` dataset is also the reason why we only convert files if `library(palmerpenguins)` or `data("penguins", package = "palmerpenguins")` is found in the file. 
It is possible to imagine different data frames for which this selector could be used, 
and we don't want to inadvertently alter those. We provide an example file to demonstrate this:


```{r}
no_penguins_file <- "examples/no_penguins.Rmd"
cat(readLines(no_penguins_file), sep = "\n")
```

```{r}
# Pass it to a convert function
convert_files(no_penguins_file, "no_penguins_converted.Rmd")

# The content doesn't change
cat(readLines("no_penguins_converted.Rmd"), sep = "\n")
```

Even though this file contains `ends_with("_mm")`, and is an R Markdown file, 
it doesn't use the **palmerpenguins** package, so no substitutions are made. 
Notice also that there were no messages generated when `convert_files()` was called, 
indicating that none of the input files changed.

## Final considerations
### Class

The versions of `penguins` and `penguins_raw` in R ≥ 4.5.0's **datasets** package will always (just) have class `data.frame`.
In contrast, the **palmerpenguins** versions will have classes `tbl_df`, `tbl` and `data.frame` if the **[tibble](https://tibble.tidyverse.org)** package is installed on your computer (and just class `data.frame` if not).

### `penguins_raw`

The versions of `penguins_raw` in **palmerpenguins** and **datasets** are identical, except potentially for their class, as described above. 
No specific changes are made to `penguins_raw` by the `convert_*()` functions in **basepenguins**, 
but by removing the call to `library(palmerpenguins)`, the **datasets** version will be used in any scripts,
which is always a `data.frame` (never a `tbl_df`).

### The **palmerpenguins** package

Note that the **palmerpenguins** package provides features that are not in R,
such as vignettes and articles on the [package website](https://allisonhorst.github.io/palmerpenguins/index.html). 
The package also contains the data in two csv files and provides a [function](https://allisonhorst.github.io/palmerpenguins/reference/path_to_file.html) to access them. 
And, of course, Allison Horst's wonderful [penguins artwork](https://allisonhorst.github.io/palmerpenguins/articles/art.html)!
The **palmerpenguins** package will remain on CRAN and keep its package website.

We are extremely grateful to the authors of **palmerpenguins**, 
Allison Horst, Alison Hill and Kristen Gorman, 
for their support for adding the Palmer Penguins data to **datasets**, 
and their enthusiasm about **basepenguins**.