Customize ExPanD

Joachim Gassen

2020-12-06

By default, ExPanD offers a set of sample modification and data exploration/visualization components that support a typical EDA workflow. In many situations however, your data does not need all of these components or you might want to center the user experience on a sub-set of these components. In addition, you might want to change the ordering of components, add additional information or even allow the user to define additional variables interactively. You can customize ExPanD along the following dimensions:

This vignette will guide you through each of these opportunities by step-by-step developing a customized ExPanD app. The code in the vignette is additive, meaning that you can source each code block independently from the others to see how the ‘ExPanD’ app is changed by the customization step-by-step.

Select components to include and their order

When you start ExPanD in default mode, e.g., by

library(ExPanDaR)
library(gapminder)

ExPanD(df = gapminder, cs_id = "country", ts_id = "year")

ExPanD will include all available components in their default order:

Name Description
sample_selection A drop down menu to select the sample to be used and whether it should be balanced
subset_factor A drop down menu to select a factor on which to limit the sample and an additional menu to select the value of that factor on which you want to focus the analysis
grouping A drop down menu that gives you the option to name a grouping factor. The grouping factor allows the user to focus certain components on a sub-sample defined by a value of that factor. Also, the grouping component contains the dialog four outlier treatments
bar_chart A component displaying a bar chart reporting the observations by time period
missing_values A visual displaying the frequency of missing values for each variable and time period
udvars A component that allows users to define additional variables (see below)
descriptive_table A descriptive table for all numerical and logical variables contained in the sample
histogram A histogram to display the distribution of a chosen numerical or logical variable
ext_obs A list displaying the 5 most extreme observations for a chosen variable, overall or by time period
by_group_bar_graph A bar graph that visualizes a chosen descriptive statistic by a chosen grouping factor
by_group_violin_graph A by group violin plot, where users can choose the variable and the grouping factor
trend_graph A graph that visualizes the development of up to three variables over time
quantile_trend_graph A graph that visualizes the distribution of one chosen variable over time
by_group_trend_graph A grouped time trend graph that visualizes the development of one variable across groups over time
corrplot A visual representation of Spearman and Pearson correlations of the numerical and logical variables in the sample. Exact correlations are displayed when you hover above the respective cell with your mouse
scatter_plot A scatter plot, where you can present up to 4 dimensions of your data (x, y, size, color). You can choose whether you want a LOESS line to be displayed and whether you want your sample to be limited to 1,000 observations
regression A regression component where you can estimate a linear or logit regression, with up to two fixed effect levels and standard error clustering

To select the components that you want to display, you have two options: First, you can simply indicate the components that you want to omit. The remaining components stay in the original order. As an example, the ‘gapminder’ data set does not have any missing values and is a balanced sample. This mean that we can safely omit the sample selection component as well as the display of missing values.

library(ExPanDaR)
library(gapminder)
data(gapminder)

ExPanD(df = gapminder, cs_id = "country", ts_id = "year", 
       components = c(sample_selection = FALSE, missing_values = FALSE))

Alternatively, you can choose the components that you want to include. This also gives you the option to sort the components in your preferred order. Let us assume that we want our app to focus on the Preston curve association of GDP per capita and life expectancy in the ‘gapminder’ data. So maybe we want to include the following components:

library(ExPanDaR)
library(gapminder)
data(gapminder)

ExPanD(df = gapminder, cs_id = "country", ts_id = "year", 
       components = c(descriptive_table = TRUE, 
                      by_group_violin_graph = TRUE, 
                      scatter_plot = TRUE, 
                      regression = TRUE))

OK. While we are at it, we can also include an informative title and a short intro sentence.

library(ExPanDaR)
library(gapminder)
data(gapminder)

ExPanD(df = gapminder, cs_id = "country", ts_id = "year", 
       title = "Explore the Preston Curve",
       abstract = paste("This interactive display uses 'gapminder' data to",
                        "let you explore the Preston Curve. Scroll down and enjoy!"),
       components = c(descriptive_table = TRUE, 
                      by_group_violin_graph = TRUE, 
                      scatter_plot = TRUE, 
                      regression = TRUE))

Add variable definitions

When you look at the descriptive table you see that it does not offer self-explanatory variable names. Also, no tool-tips are displayed when you hover over the variable names with your mouse.

To add the tool-tip functionality, we need to specify a df_def data frame that describes the variables. The upside of this is that we no longer need to specify the cross-sectional and time series identifiers as these are also defined in the data frame

library(ExPanDaR)
library(gapminder)
data(gapminder)

df_def <- data.frame(
  var_name = names(gapminder),
  var_def = c("Name of the country",
              "Continent where country is located",
              "Year of data",
              "Life expectancy in years at birth",
              "Population in million",
              "Gross Domestic Product (GDP) per capita"),
  type = c("cs_id", "factor", "ts_id", rep("numeric", 3))
)

gapminder$pop <- gapminder$pop / 1e6

ExPanD(df = gapminder,
       title = "Explore the Preston Curve",
       abstract = paste("This interactive display uses 'gapminder' data to",
                        "let you explore the Preston Curve. Scroll down and enjoy!"),
       components = c(descriptive_table = TRUE, 
                      by_group_violin_graph = TRUE, 
                      quantile_trend_graph = TRUE, 
                      scatter_plot = TRUE, 
                      regression = TRUE),
       df_def = df_def)

Now tool-tips for the descriptive table inform about our variables. Neat.

Customize components so that they display certain variables and specifications

When you scroll down the components you will see that the scatter plot is not showing the typical Preston curve by default. We want to change that so that the scatter plot is based on the full sample and includes a LOESS line. It should show GDP per capita as X and life expectancy as Y, with population defining the size of the data points and the continent their color. Also, we want to estimate a panel model of life expectancy with country and year firm fixed effects, two-way clustered standard errors and GDP per capita as explanatory variable (Population is time constant in the ‘gapminder’ data set and as such subsumed by country fixed effects).

There are two ways to achieve that, a simple one and a hard one.

First, the simple one: Make the desired changes in the app, save the configuration of the app to a file that you can remember (button towards the bottom of the page), read that file with readRDS() into your environment and provide the resulting list as the parameter config_list to the ExPanD() function call. Below, I assume that you saved the configuration of the app as my_config.RDS to the working directory.

library(ExPanDaR)
library(gapminder)
data(gapminder)

df_def <- data.frame(
  var_name = names(gapminder),
  var_def = c("Name of the country",
              "Continent where country is located",
              "Year of data",
              "Life expectancy in years at birth",
              "Population in million",
              "Gross Domestic Product (GDP) per capita"),
  type = c("cs_id", "factor", "ts_id", rep("numeric", 3))
)

gapminder$pop <- gapminder$pop / 1e6

clist <- readRDS("my_config.RDS")

ExPanD(df = gapminder,
       title = "Explore the Preston Curve",
       abstract = paste("This interactive display uses 'gapminder' data to",
                        "let you explore the Preston Curve. Scroll down and enjoy!"),
       components = c(descriptive_table = TRUE, 
                      by_group_violin_graph = TRUE, 
                      scatter_plot = TRUE, 
                      regression = TRUE),
       df_def = df_def,
       config_list = clist)

Now the hard one. You can specify the changes that you want to apply to the default configuration directly in the code. Take a look at clist to understand how the data is structured. I hope that most list member names are somewhat self explanatory.

library(ExPanDaR)
library(gapminder)
data(gapminder)

df_def <- data.frame(
  var_name = names(gapminder),
  var_def = c("Name of the country",
              "Continent where country is located",
              "Year of data",
              "Life expectancy at birth, in years",
              "Population in million",
              "Gross Domestic Product (GDP) per capita in US-$, inflation-adjusted"),
  type = c("cs_id", "factor", "ts_id", rep("numeric", 3)),
  stringsAsFactors = FALSE
)

gapminder$pop <- gapminder$pop / 1e6

clist <- list(
  scatter_x = "gdpPercap",
  scatter_y = "lifeExp",
  scatter_size = "pop",
  scatter_color = "continent",
  scatter_loess = TRUE,
  scatter_sample = FALSE,
  
  reg_y = "lifeExp",
  reg_x = "gdpPercap",
  reg_fe1 = "country",
  reg_fe2 = "year",
  cluster = "4" # Now this is hard to guess 
  # 1: none, 2: first FE, 3: second FE, 4: both FE
)

ExPanD(df = gapminder,
       title = "Explore the Preston Curve",
       abstract = paste("This interactive display uses 'gapminder' data to",
                        "let you explore the Preston Curve. Scroll down and enjoy!"),
       components = c(descriptive_table = TRUE, 
                      by_group_violin_graph = TRUE, 
                      scatter_plot = TRUE, 
                      regression = TRUE),
       df_def = df_def,
       config_list = clist)

Now we have our customized components well documented in code. The scatter plot exhibits the Preston curve. The association is non-linear, potentially explaining the negative coefficient in the panel model. Maybe we want to allow the user to address this issue on-the-fly by defining additional variables?

Allow users to generate additional variables

This can be achieved by adding a component udvars.

library(ExPanDaR)
library(gapminder)
data(gapminder)

df_def <- data.frame(
  var_name = names(gapminder),
  var_def = c("Name of the country",
              "Continent where country is located",
              "Year of data",
              "Life expectancy in years at birth",
              "Population in million",
              "Gross Domestic Product (GDP) per capita"),
  type = c("cs_id", "factor", "ts_id", rep("numeric", 3)),
  stringsAsFactors = FALSE
)

gapminder$pop <- gapminder$pop / 1e6

clist <- list(
  scatter_x = "gdpPercap",
  scatter_y = "lifeExp",
  scatter_size = "pop",
  scatter_color = "continent",
  scatter_loess = TRUE,
  scatter_sample = FALSE,
  
  reg_y = "lifeExp",
  reg_x = "gdpPercap",
  reg_fe1 = "country",
  reg_fe2 = "year",
  cluster = "4" # No this is hard to guess 1: none, 2: first FE, 3: second FE, 4: both FE
)

ExPanD(df = gapminder,
       title = "Explore the Preston Curve",
       abstract = paste("This interactive display uses 'gapminder' data to",
                        "let you explore the Preston Curve. Scroll down and enjoy!"),
       components = c(descriptive_table = TRUE, 
                      by_group_violin_graph = TRUE, 
                      scatter_plot = TRUE, 
                      udvars = TRUE,
                      regression = TRUE),
       df_def = df_def,
       config_list = clist)

This generates another component in your app allowing the user to define additional variables. It is positioned just below the scatter plot. For fun and jiggles, you can define logGpdPercap <- log(gdpPercap) and verify using the scatter plot that using a log-transformed explanatory variable makes the association more linear, reflecting also in a now positive coefficient in the regression model. If you worry about security, the code that is passed to the server via the user variable definition component is sandboxed in a sense that is only includes the variables of the sample and the functions listed in the help text of the component.

Time to move to the finishing touches.

Add html blocks to support the flow of your app

From time to time it might be useful to include some additional text or other info in between components. For that, you have the option to include an arbitrary number of HTML blocks as additional components. Each html block will be wrapped in a fluid row, so you can use columns similar to what you would do in shiny.

library(ExPanDaR)
library(gapminder)
data(gapminder)

df_def <- data.frame(
  var_name = names(gapminder),
  var_def = c("Name of the country",
              "Continent where country is located",
              "Year of data",
              "Life expectancy in years at birth",
              "Population in million",
              "Gross Domestic Product (GDP) per capita"),
  type = c("cs_id", "factor", "ts_id", rep("numeric", 3)),
  stringsAsFactors = FALSE
)

gapminder$pop <- gapminder$pop / 1e6

clist <- list(
  scatter_x = "gdpPercap",
  scatter_y = "lifeExp",
  scatter_size = "pop",
  scatter_color = "continent",
  scatter_loess = TRUE,
  scatter_sample = FALSE,
  
  reg_y = "lifeExp",
  reg_x = "gdpPercap",
  reg_fe1 = "country",
  reg_fe2 = "year",
  cluster = "4" # No this is hard to guess 1: none, 2: first FE, 3: second FE, 4: both FE
)

html_blocks <- c(
  paste('<div class="col-sm-2"><h3>Variation of life expectancy',
        "across regions and income levels</h3></div>",
        '<div class="col-sm-10">',
        "<p>&nbsp;</p>As you see below, life expectancy varies widely",
        "across countries and continents. One potential reason for this",
        "variation is the difference in income levels across countries.",
        "This association is visualized by the",
        "<a href=https://en.wikipedia.org/wiki/Preston_curve>",
        "Preston Curve</a> that you also find below.",
        "</div>"),
  paste('<div class="col-sm-2"><h3>Transform variables</h3></div>', 
        '<div class="col-sm-10">',
        "The Preston Curve is far from",
        "linear. Maybe you can come up with a transformation",
        "of GDP per capita that makes the association",
        "a little bit more well behaved?",
        "Use the dialog below to define a transformed",
        "measure of GDP per capita and assess its association",
        "with life expectancy in the scatter plot above.",
        "</div>"),
  paste('<div class="col-sm-2"><h3>Assess Robustness</h3></div>',
        '<div class="col-sm-10">',
        "You see below that the linear regression coefficient",
        "for GDP per capita is <i>negative</i>",
        "and signficant in a panel model with country and year",
        "fixed effects.",
        "Does this also hold when you use a log-transformed version",
        "of GDP per capita?",
        "</div>")
)

ExPanD(df = gapminder,
       title = "Explore the Preston Curve",
       abstract = paste("This interactive display uses 'gapminder' data to",
                        "let you explore the Preston Curve. Scroll down and enjoy!"),
       components = c(descriptive_table = TRUE, 
                      html_block = TRUE,
                      by_group_violin_graph = TRUE, 
                      scatter_plot = TRUE,
                      html_block = TRUE,
                      udvars = TRUE,
                      html_block = TRUE,
                      regression = TRUE),
       df_def = df_def,
       config_list = clist,
       html_blocks = html_blocks)

This is it. Now we have a customized version of ExPanD focusing on communicating the robustness of the Preston Curve. If you do not want to run the code yourself, you can take a quick look at the customized version here.

If you want to take customization to the next level, you can always fork the code of the ‘ExPanDaR’ package on GitHub. The code for the shiny app is in the inst/app folder of the package.

Enjoy.