Case Study: Working with Eurobarometer surveys

The goal in this case study is to analyze trust in the national and European parliaments, and in the European Commission, in Europe, with data from the Eurobarometer.

The Eurobarometer is a biannual survey conducted by the European Commission with the goal of monitoring the public opinion of populations of EU member states and – occasionally – also in candidate countries. Each EB wave is devoted to a particular topic, but most waves ask some “trend questions”, i.e. questions that are repeated frequently in the same form. Trust in institutions is among such trend questions.

The Eurobarometer data Eurobarometer raw data and related documentation (questionnaires, codebooks, etc.) are made available by GESIS, ICPSR and through the Social Science Data Archive networks. You should cite your source, in our examples, we rely on the GESIS data files. In this case study we use nine waves of the Eurobarometer between 1996 and 2019: 44.2bis (January-March 1996), 51.0 (March-April 1999), 57.1 (March-May 2002), 64.2 (October-November 2005), 69.2 (Mar-May 2008) 75.3 (May 2011), 81.2 (March 2014), 87.3 (May 2017), and 91.2 (March 2019).

In the Afrobaromter Case Study we have shown how to merge two waves of a survey with a limited number of variables. This workflow is not feasible with Eurobarometer on a PC or laptop, because there are too many large files to handle.

library(retroharmonize)
library(dplyr)
eurobarometer_waves <- file.path("working", dir("working"))
eb_waves <- read_surveys(eurobarometer_waves, .f='read_rds')

We can review if the main descriptive metadata is correctly present with document_waves().

documented_eb_waves <- document_waves(eb_waves) 

Metadata map

We start by extracting metadata from the survey data files and storing them in a tidy table, where each row contains information about a variable from the survey data file. To keep the size manageable, we keep only a few variables: the row ID, the weighting variable, the country code, and variables that contain “parliament” or “commission” in their labels.

eb_trust_metadata <- lapply ( X = eb_waves, FUN = metadata_create )
eb_trust_metadata <- do.call(rbind, eb_trust_metadata)
#let's keep the example manageable:
eb_trust_metadata  <- eb_trust_metadata %>%
  filter ( grepl("parliament|commission|rowid|weight_poststrat|country_id", var_name_orig) )
head(eb_trust_metadata)
#>           filename           id             var_name_orig
#> 1 ZA2828_trust.rds ZA2828_trust                     rowid
#> 2 ZA2828_trust.rds ZA2828_trust                country_id
#> 3 ZA2828_trust.rds ZA2828_trust          weight_poststrat
#> 4 ZA2828_trust.rds ZA2828_trust trust_european_commission
#> 5 ZA2828_trust.rds ZA2828_trust trust_european_parliament
#> 6 ZA2828_trust.rds ZA2828_trust trust_national_parliament
#>                            class_orig
#> 1                           character
#> 2                           character
#> 3                             numeric
#> 4 retroharmonize_labelled_spss_survey
#> 5 retroharmonize_labelled_spss_survey
#> 6 retroharmonize_labelled_spss_survey
#>                                             label_orig  labels valid_labels
#> 1                    unique identifier in za2828_trust      NA           NA
#> 2 nation all samples iso 3166 crosstabulation variable      NA           NA
#> 3                            weight result from target      NA           NA
#> 4                          rely on european commission 1, 2, 3         1, 2
#> 5                          rely on european parliament 1, 2, 3         1, 2
#> 6                          rely on national parliament 1, 2, 3         1, 2
#>   na_labels na_range n_labels n_valid_labels n_na_labels
#> 1        NA       NA        0              0           0
#> 2        NA       NA        0              0           0
#> 3        NA       NA        0              0           0
#> 4         3       NA        3              2           1
#> 5         3       NA        3              2           1
#> 6         3       NA        3              2           1

The value labels in this example are not too numerous. The only variable that stands out is the one with Can rely on and Cannot rely on labels.

collect_val_labels(eb_trust_metadata)
#> [1] "CAN RELY ON IT"    "CANNOT RELY ON IT" "Tend to trust"    
#> [4] "Tend not to trust" "DK"

The following labels were marked by GESIS as missing values:

collect_na_labels(eb_trust_metadata)
#> [1] "DK"                         "NA"                        
#> [3] "Inap. (33 in V6)"           "Inap. (CY-TCC in isocntry)"

We have created a helper function subset_save_survey() that programmatically reads in SPSS files, makes the necessary type conversion to labelled_spss_survey() without harmonization, and saves a small, subsetted rds file. Because this is a native R file, it is far more efficient to handle in the actual workflow.

## You will likely use your own local working directory, or
## tempdir() that will create a temporary directory for your 
## session only. 
working_directory <- tempdir()
# This code is for illustration only, it is not evaluated.
# To replicate the worklist, you need to have the SPSS file names 
# as a list, and you have to set up your own import and export path.

selected_eb_metadata <- readRDS(
  system.file("eurob", "selected_eb_waves.rds", package = "retroharmonize")
  ) %>%
  mutate ( id = substr(filename,1,6) ) %>%
  rename ( var_label = var_label_std ) %>%
  mutate ( var_name = var_label )

## This code is not evaluated, it is only an example. You are likely 
## to have a directory where you have already downloaded the data
## from GESIS after accepting their term use.

subset_save_surveys ( 
  var_harmonization = selected_eb_metadata, 
  selection_name = "trust",
  import_path = gesis_dir, 
  export_path = working_directory )

Harmonize the labels

For easier looping we adopt the harmonize_values() function with new default settings. It would be tempting to preserve the rely labels as distinct from the trust labels, but if we use the same numeric coding, it will lead to confusion. If you want to keep the difference of the two type of category labels, than the harmonization should be done in a two-step process.

harmonize_eb_trust <- function(x) {
  label_list <- list(
    from = c("^tend\\snot", "^cannot", "^tend\\sto", "^can\\srely",
             "^dk", "^inap", "na"), 
    to = c("not_trust", "not_trust", "trust", "trust",
           "do_not_know", "inap", "inap"), 
    numeric_values = c(0,0,1,1, 99997,99999,99999)
  )

  harmonize_values(x, 
                   harmonize_labels = label_list, 
                   na_values = c("do_not_know"= 99997,
                                 "declined"   = 99998,
                                 "inap"       = 99999 )
  )
}

Let’s see if things did work out fine:

document_waves(eb_waves)
#> # A tibble: 9 x 5
#>   id           filename          ncol  nrow object_size
#>   <chr>        <chr>            <int> <int>       <dbl>
#> 1 ZA2828_trust ZA2828_trust.rds     7 65178     8881288
#> 2 ZA3171_trust ZA3171_trust.rds    14 16179     3150696
#> 3 ZA3639_trust ZA3639_trust.rds    14 16012     3116544
#> 4 ZA4414_trust ZA4414_trust.rds    14 29430     5693504
#> 5 ZA4744_trust ZA4744_trust.rds    14 30170     5833760
#> 6 ZA5481_trust ZA5481_trust.rds    10 31769     5109472
#> 7 ZA5913_trust ZA5913_trust.rds    10 27932     4494592
#> 8 ZA6863_trust ZA6863_trust.rds    14 33180     6411712
#> 9 ZA7562_trust ZA7562_trust.rds     8 27524     3980296

To review the harmonization on a single survey use pull_survey().

test_trust <- pull_survey(eb_waves, filename = "ZA4414_trust.rds")

Before running our adapted harmonization function, we have this:

test_trust$trust_european_commission[1:16]
#>  [1] 3 1 2 3 1 2 2 2 1 3 1 1 1 1 1 1
#> attr(,"labels")
#>     Tend to trust Tend not to trust                DK 
#>                 1                 2                 3 
#> attr(,"label")
#> [1] "QA27 EUROPEAN COMMISSION - TRUST"
#> attr(,"na_values")
#> [1] 3
#> attr(,"ZA4414_name")
#> [1] "v213"
#> attr(,"ZA4414_values")
#> 1 2 3 
#> 1 2 3 
#> attr(,"ZA4414_label")
#> [1] "QA27 EUROPEAN COMMISSION - TRUST"
#> attr(,"ZA4414_labels")
#>     Tend to trust Tend not to trust                DK 
#>                 1                 2                 3 
#> attr(,"ZA4414_na_values")
#> [1] 3
#> attr(,"id")
#> [1] "ZA4414"
#> attr(,"class")
#> [1] "retroharmonize_labelled_spss_survey" "haven_labelled_spss"                
#> [3] "haven_labelled"

After performing harmonization, it would look like this:

harmonize_eb_trust(x=test_trust$trust_european_commission[1:16])
#>  [1] 99997     1     0 99997     1     0     0     0     1 99997     1     1
#> [13]     1     1     1     1
#> attr(,"labels")
#>   not_trust       trust do_not_know    declined        inap 
#>           0           1       99997       99998       99999 
#> attr(,"label")
#> [1] "QA27 EUROPEAN COMMISSION - TRUST"
#> attr(,"na_values")
#> [1] 99997 99998 99999
#> attr(,"class")
#> [1] "retroharmonize_labelled_spss_survey" "haven_labelled_spss"                
#> [3] "haven_labelled"                     
#> attr(,"survey_id_name")
#> [1] "x"
#> attr(,"survey_id_values")
#>     2     1     3 
#>     0     1 99997 
#> attr(,"survey_id_label")
#> [1] "QA27 EUROPEAN COMMISSION - TRUST"
#> attr(,"survey_id_labels")
#>     Tend to trust Tend not to trust                DK 
#>                 1                 2                 3 
#> attr(,"survey_id_na_values")
#> [1] 3
#> attr(,"id")
#> [1] "survey_id"

If you are satisfied with the results, run harmonize_eb_trust() through the 9 survey waves. Whenever a variable is missing from a wave, it is filled up with inapproriate missing values.

Harmonize waves

We define a selection of countries: Belgium, Hungary, Italy, Malta, the Netherlands, Poland, Slovakia, and variables.

eb_waves_selected <- lapply ( eb_waves, function(x) x %>% select ( 
  any_of (c("rowid", "country_id", "weight_poststrat", 
            "trust_national_parliament", "trust_european_commission", 
            "trust_european_parliament"))) %>%
    filter ( country_id %in% c("NL", "PL", "HU", "SK", "BE", 
                               "MT", "IT")))
harmonized_eb_waves <- harmonize_waves ( 
  waves = eb_waves_selected, 
  .f = harmonize_eb_trust )

We cannot rely on document_waves() anymore, because the result is a single data frame. Let’s have a look at the descriptive metadata.

wave_attributes <- attributes(harmonized_eb_waves)
wave_attributes$id
#> [1] "Waves: ZA2828_trust; ZA3171_trust; ZA3639_trust; ZA4414_trust; ZA4744_trust; ZA5481_trust; ZA5913_trust; ZA6863_trust; ZA7562_trust"
wave_attributes$filename
#> [1] "Original files: ZA2828_trust.rds; ZA3171_trust.rds; ZA3639_trust.rds; ZA4414_trust.rds; ZA4744_trust.rds; ZA5481_trust.rds; ZA5913_trust.rds; ZA6863_trust.rds; ZA7562_trust.rds"
wave_attributes$names
#> [1] "rowid"                     "country_id"               
#> [3] "weight_poststrat"          "trust_national_parliament"
#> [5] "trust_european_commission" "trust_european_parliament"

Analyze the data

The harmonized data can be analyzed in R. The labelled survey data is stored in labelled_spss_survey() vectors, which is a complex class that retains much metadata for reproducibility. Most statistical R packages do not know it. To them, the data should be presented either as numeric data with as_numeric() or as categorical with as_factor(). (See more why you should not fall back on the more generic as.factor() or as.numeric() methods in The labelled_spss_survey class vignette.)

First, let’s treat the trust variables as factors. A summary of the resulting data allows us to screen for values that are outside of the expected range. In the trust variables, any values other than “trust” and “not trust” that are not defined as missing, are unacceptable. In our example, this is not the case. In our example, this is not the case.

We also see some basic information about the weighting factors, which in the selected Eurobarometer subset range from below 0.01 to almost 7. The range of these values is pretty large, which needs to be taken into account when analyzing the data.

harmonized_eb_waves %>%
  mutate_at ( vars(contains("trust")), as_factor ) %>%
  summary()
#>     rowid            country_id        weight_poststrat
#>  Length:58917       Length:58917       Min.   :0.0095  
#>  Class :character   Class :character   1st Qu.:0.7290  
#>  Mode  :character   Mode  :character   Median :0.9315  
#>                                        Mean   :1.0000  
#>                                        3rd Qu.:1.1908  
#>                                        Max.   :6.9678  
#>  trust_national_parliament trust_european_commission trust_european_parliament
#>  not_trust  :31432         not_trust  :15696         not_trust  :16109        
#>  trust      :22210         trust      :26451         trust      :27701        
#>  do_not_know: 5269         do_not_know:10132         do_not_know: 8468        
#>  declined   :    0         declined   :    0         declined   :    0        
#>  inap       :    6         inap       : 6638         inap       : 6639        
#> 

Now we convert the trust variables to numeric format, and look at the summary. Following the conversion, we lost information about the type of the missing values - now they are all lumped together as NA. What we gained is the proportion of positive responses (which ranges between 0.41 for trust in the national parliament and 0.63 for trust in the European Parliament), and the ability to, e.g., construct scales of the binary variables.

numeric_harmonization <- harmonized_eb_waves %>%
  mutate_at ( vars(contains("trust")), as_numeric )
summary(numeric_harmonization)
#>     rowid            country_id        weight_poststrat
#>  Length:58917       Length:58917       Min.   :0.0095  
#>  Class :character   Class :character   1st Qu.:0.7290  
#>  Mode  :character   Mode  :character   Median :0.9315  
#>                                        Mean   :1.0000  
#>                                        3rd Qu.:1.1908  
#>                                        Max.   :6.9678  
#>                                                        
#>  trust_national_parliament trust_european_commission trust_european_parliament
#>  Min.   :0.000             Min.   :0.000             Min.   :0.000            
#>  1st Qu.:0.000             1st Qu.:0.000             1st Qu.:0.000            
#>  Median :0.000             Median :1.000             Median :1.000            
#>  Mean   :0.414             Mean   :0.628             Mean   :0.632            
#>  3rd Qu.:1.000             3rd Qu.:1.000             3rd Qu.:1.000            
#>  Max.   :1.000             Max.   :1.000             Max.   :1.000            
#>  NA's   :5275              NA's   :16770             NA's   :15107

Finally, let’s calculate weighted means of trust in the national parliament, the European Parliament, and the European Commission, for the selected countries, across all EB waves.

numeric_harmonization %>%
  group_by(country_id) %>%
  summarize_at ( vars(contains("trust")), 
                 list(~mean(.*weight_poststrat, na.rm=TRUE))) 
#> # A tibble: 7 x 4
#>   country_id trust_national_parliament trust_european_comm~ trust_european_parl~
#>   <chr>                          <dbl>                <dbl>                <dbl>
#> 1 BE                             0.453                0.610                0.623
#> 2 HU                             0.328                0.639                0.642
#> 3 IT                             0.317                0.603                0.630
#> 4 MT                             0.568                0.753                0.754
#> 5 NL                             0.657                0.664                0.635
#> 6 PL                             0.225                0.658                0.642
#> 7 SK                             0.294                0.613                0.641