webtrackR

CRAN status CRAN Downloads R-CMD-check Codecov test coverage

webtrackR is an R package to preprocess and analyze web tracking data, i.e., web browsing histories of participants in an academic study. Web tracking data is oftentimes collected and analyzed in conjunction with survey data of the same participants. The R package is built on top of data.table and can thus comfortably handle very large datasets.

Installation

You can install the development version of webtrackR from GitHub with:

# install.packages("devtools")
devtools::install_github("schochastics/webtrackR")

The CRAN version can be installed with:

install.packages("webtrackR")

S3 class wt_dt

The package defines an S3 class called wt_dt which inherits most of the functionality from the data.table class. A summary and print method are included in the package.

Each row in a web tracking data set represents a visit. Raw data need to have at least the following variables:

The function as.wt_dt assigns the class wt_dt to a raw web tracking data set. It also allows you to specify the name of the raw variables corresponding to panelist_id, url and timestamp. Additionally, it turns the timestamp variable into POSIXct format.

All preprocessing functions check if these three variables are present. Otherwise an error is thrown.

Preprocessing

Several other variables can be derived from the raw data with the following functions:

Classification

Summarizing and aggregating

Example code

A typical workflow including preprocessing, classifying and aggregating web tracking data looks like this (using the in-built example data):

library(webtrackR)

# load example data and turn it into wt_dt
data("testdt_tracking")
wt <- as.wt_dt(testdt_tracking)

# add duration
wt <- add_duration(wt)

# extract domains
wt <- extract_domain(wt)

# drop duplicates (consecutive visits to the same URL within one second)
wt <- deduplicate(wt, within = 1, method = "drop")

# load example domain classification and classify domains
data("domain_list")
wt <- classify_visits(wt, classes = domain_list, match_by = "domain")

# load example survey data and join with web tracking data
data("testdt_survey_w")
wt <- add_panelist_data(wt, testdt_survey_w)

# aggregate number of visits by day and panelist, and by domain class
wt_summ <- sum_visits(wt, timeframe = "date", visit_class = "type")

Analysis

The package also contains functions for the analysis of web tracking data. One example is the analysis of audience networks (Mangold & Scharkow, 2020). More functionalities will be added in later versions of the package.

audience_network(wt, cutoff = 3, type = "pmi")