ExPanD is a shiny based app building on the functions of the ExPanDaR package. Its purpose is to make panel data exploration fun and easy. Using ExPanD you can
This vignette will guide you through the process of using ExPanD by discussing three use cases. While the first two use macro-economic data to explore the association of gross domestic production (GDP) per capita with life expectancy at birth, the last use case explores the association between financial accounting performance measures and concurrent stock returns.
If you do not use R you can still use the ExPanD app to explore panel data! In this case, access the hosted variant of the ExPanD app here and follow the advice below on how to upload a suitable panel data file for online exploration. No worries: Your data won’t be stored on the server and will be deleted from memory once the connection to the server is closed.
The easiest way to start using ExPanD is to use it with a local data file containing panel data. ExPanD supports Stata, SAS, CSV, Excel and R file formats.
For using ExPanD from within R, you have to install the ExPanDaR package and start ExPanD.
Alternatively, you can simply access the online hosted version of the ExPanD app here (no R required).
After starting ExPanD, you will be greeted with a bare bones file upload dialog.
Now you need a file to explore. Feel free to use whatever you want but for our first use case I will use the well-known gapminder
dataset provided by the gapminder package (click here if you do not know the gapminder initiative).
library(gapminder)
# write.csv(gapminder, file = "gapminder.csv", row.names = FALSE)
head(gapminder, 10)
#> # A tibble: 10 x 6
#> country continent year lifeExp pop gdpPercap
#> <fct> <fct> <int> <dbl> <int> <dbl>
#> 1 Afghanistan Asia 1952 28.8 8425333 779.
#> 2 Afghanistan Asia 1957 30.3 9240934 821.
#> 3 Afghanistan Asia 1962 32.0 10267083 853.
#> 4 Afghanistan Asia 1967 34.0 11537966 836.
#> 5 Afghanistan Asia 1972 36.1 13079460 740.
#> 6 Afghanistan Asia 1977 38.4 14880372 786.
#> 7 Afghanistan Asia 1982 39.9 12881816 978.
#> 8 Afghanistan Asia 1987 40.8 13867957 852.
#> 9 Afghanistan Asia 1992 41.7 16317921 649.
#> 10 Afghanistan Asia 1997 41.8 22227415 635.
To use ExPanD, you need the following:
As you see, the gapminder file contains country-year data. It is organized in a long format by using country
as cross-sectional identifier and year
as time-series identifier. Each of the additional variables is then stored in a separate column. It has one factor (continent
) and three numerical variables (lifeExp
, pop
and gdpPercap
). So, it complies with the above requirements, assuming that is has no duplicates:
OK. Use the commented-out write.csv()
function call above to save the CSV file to your system and the file dialog to load it into ExPanD (if you are not using R, you can download the CSV file here). After uploading the file, two dialog boxes will appear asking you to select the cross-sectional identifier(s) and the time-series identifier.
Select country
as the cross-sectional identifier and year
as the time-series identifier. ExPanD will now process the data to display it so that you can start exploring.
Alternatively, if you are using R you can bypass the file upload dialog by specifying a data frame and its cross-sectional as well as time-series identifiers.
Regardless whether you uploaded the gapminder data via the file dialog or specified the data frame in the ExPanD()
function call, the ExPanD shiny app will start up and look like this.
As can be seen by the bar chart, the gapminder dataset provides a balanced panel of 142 countries with 12 observations per country. The missing values graph shows no missing data across all variables. When you scroll down, you will see that the dataset contains three numerical variables. Play around with the histogram and the extreme observation table to learn more about these. The time trend graph and the quantile time trend communicate good news: the life expectancy is increasing over time world-wide. You can verify that the same holds true for the population of the sample countries and for their GDP per capita. You will also notice that for the latter two the cross-country distribution widens over time.
The gapminder dataset is often used to document the strong positive association between GDP per capita and life expectancy. You can see this association in the correlation plot. The blue ellipsoid above (below) the diagonal visualizes the positive Pearson (Spearman) correlation of the two variables. If you are interested in the exact correlation values, hover over the ellipsoid with your mouse.
The scatter plot and the regression analysis section allow you to explore this association in a little bit more detail. Below you will see a screenshot where I prepared a “Hans Rosling” scatter plot (click here if you do not know the name). In addition, I estimated a by region OLS model with country fixed effects and standard errors clustered by country to verify that the association is not just driven by unobservable time-constant country heterogeneity.
Looking at the scatter plot you notice that there are some observations with extremely high GDP per capita that trigger the LOESS smoother line to get a negative slope. If you hover over the dots with your mouse you will see that these are observations from Kuwait. To what extent are our regression results affected by these extreme observations? To figure this out, scroll up and select to winsorize your data at the 1 % level. After doing this, the figure from above now looks like this.
The association has become more robust across regions and the scatter plot now shows a positive association across the complete range of winsorized GDP per capita.
Continue to play around with your data. Let us assume that at some point you find something that you consider worth preserving so that next time you start ExPanD with the gapminder dataset, it starts directly into the view that you just have. No problem! Just scroll down to the bottom of the page. There, you will find a save dialog (and a load dialog as well, just in case). Save your ExPanD choices to a place that you will remember. The file that will be stored is a plain list, saved as a RDS file. Assuming that you named the file “ExPanD_config.RDS” and stored in your current work directory, you can now start ExPanD right into your favorite analysis by providing this list.
The gapminder dataset contains only three numerical variables. You might wonder how the association between GDP per capita and life expectancy would look like if you include additional test or control variables. In addition, GDP per capita, as a metric affected by growth processes, is far from being normally distributed. Does the association with life expectancy hold when you log transform it? Time for our second use case that re-examines the above presented association by using data provided by the World Bank.
The questions of the last paragraph are typical for exploratory data analysis workflows and ExPanD is equipped to handle them. When started in its “advanced mode”, it provides two samples: A base sample and an analysis sample. You can then define additional variables based on the base sample interactively.
When you call ExPanD without options, it will start into the advanced mode, generating an analysis sample that is identical to the sample that you uploaded. When you start ExPanD by providing it with a data frame at the command line, you decide whether you want to use the “simple” or “advanced” mode. When you prepare a data frame containing variable definitions via the var_def
parameter, ExPanD will start in the advanced mode.
A variable definition file has to contain at least three character columns: var_name
, var_def
and type
. In addition, it can contain a logical column can_be_na
. Let’s take a look at the variable definition data frame for the worldbank
dataset provided by the ExPanDaR package.
head(worldbank_var_def, 10)
#> var_name var_def type can_be_na
#> 1 country country cs_id 0
#> 2 region region factor 0
#> 3 income income factor 0
#> 4 year year ts_id 0
#> 5 time as.numeric(as.character(year))-1960 numeric 0
#> 6 gdp NY.GDP.MKTP.KD numeric 0
#> 7 population SP.POP.TOTL numeric 0
#> 8 gdp_capita NY.GDP.PCAP.KD numeric 0
#> 9 extdebt_gni DT.DOD.DECT.GN.ZS numeric 1
#> 10 debtservice_gni DT.TDS.DPPG.GN.ZS numeric 1
var_name
contains variable names for the analysis sample and var_def
contains the definitions for these variables. The definitions refer to variables contained in the worldbank
dataset (which are conforming to the naming convention of the World Bank). Most definitions are just simple 1:1 transformations of the worldbank
dataset but, as you can see from the definition for time
, you can also use standard R expressions within the scope of the worldbank
data frame. For the R experts: Your definition will be evaluated within a dplyr::mutate()
call on the base data frame grouped by the cross section and ordered by the time-series identifier, so for example that lead()
and lag()
should work as expected.
With the type
variable you specify the nature of the variable that you just defined. Possible values are cs_id
, ts_id
, numeric
, logical
, and factor
. They identify cross-sectional identifier(s), the time-series identifier, numerical variables, Boolean (True/False) type variables, and variables to be treated as grouping factors. Note that the data does not have to have the according class but is has to be coercible to it.
The can_be_na
variable can be omitted. If you do not provide it, it will be set to TRUE
for all variables besides the cross-sectional and time-series identifiers. In the worldbank_var_def
data frame it is set to FALSE
for the variables time
, gdp
, population
, and gpd_capita
, meaning that only observations with non-missing values for these variables will be included in the analysis dataset.
By customizing this data frame that you provide to ExPanD()
via the var_def
parameter, you can design the analysis sample as you wish. An alternative and more interactive approach is to define variables interactively while running ExPanD()
. Let’s try. Run the following code to start ExPanD with the worldbank
base data in advanced mode.
library(ExPanDaR)
ExPanD(df = worldbank,
df_def = worldbank_data_def,
var_def = worldbank_var_def,
config_list = ExPanD_config_worldbank)
What you will see is a similar analysis to the gapminder analysis of the first use case but with a more extensive dataset. The scatter plot and the regression analysis are displayed below.
It shows a positive association of GDP per capita with life expectancy after controlling for public spending on health and income inequality (which happens to be negatively associated with life expectancy). As you see from the table, the number of observations is 1,068. How does this reconcile with the roughly 8,500 observations that the World Bank sample has data for? A quick look at the missing values graph below helps to understand the issue.
While gdp_capita
is available for all observations (remember the can_be_na
variable in the data definition data frame?) and life_expectancy
has good coverage for all but the most recent years, both pubspend_health_gdp
and giniindex
are only available for later years in the sample. giniindex
is also only available for a subset of countries. Taken together, this drastically reduces the sample size of the regression model. Explore whether this has an effect on the documented associations by excluding and including the test variables one-by-one. You will see that the associations are reasonable robust.
Now let’s see whether the distributional properties of the main independent variable of interest have an impact on the association. The screenshot below displays the histogram of gdp_capita
.
This looks like a log-normal distributed variable, so a log transformation should yield a more normally distributed variable. To calculate a logged variant of gdp_capita
we first need to find which World Bank data item gdp_capita
is based on. Hovering with your mouse over the variable name in the descriptive sample, you will see that it is based on the data item NY.GDP.PCP.KD
. When you switch the tab of the descriptive statistics to the base sample, you can see all 72 base data items that the worldbank
dataset contains. Use the dialog above the descriptive statistics as shown below to calculate a log-transformed measure of GDP per capita.
You will see a message window that your variable was successfully generated. How does its histogram look like?
Better. Now let’s see how this new variable is associated with life expectancy. First a quick look at the scatter plot.
Now this looks different than the gapminder plot in the section above as it exhibits a more linear association. Let’s see how our regression model looks like when use log_gdp_capita
instead of gdp_capita
.
The logged version of GDP per capita remains robustly positively associated with life expectancy but now income inequality (as measured by giniindex
) is only marginally associated with life expectancy.
Another thing that one can notice from the scatter plot above is that each country appears to be on its own “trajectory” in terms of life expectancy development. As we also know that in most countries and periods GDP per capita increases over time: Can we be sure that the association of GDP per capita and life expectancy is different from a general time trend in the data? Below, you will find a scatter plot that uses time
as the independent variable.