Session reconstruction in R

With trace data - from web logs, behavioural logs, really anything to do with user actions - reconstructing sessions (or ‘sessionising’) is essential. It lets an analyst divide up user actions into actual periods of sustained interaction, and from there compute a whole host of useful metrics, from session length to bounce rate.

reconstructr is a library for just that - sessionisation and metric computation - in a way that keeps all the metadata about the events you’re sessionising.

Dividing events into sessions

The nature of a “session” has provided fodder for researchers for years. Most people take an approach based on inactivity thresholds; if someone does not take an action in N seconds, their session has ended and a new one begins on their next logged action.

Using reconstructr, we can conveniently divide events into sessions with the sessionise function. This takes a data.frame of events (along with specifiers of which column contains the user ID, and which column contains the timestamp) and a threshold value (in seconds). When the time between events by a user crosses that threshold, the session ends and a new one begins. We can demonstrate this using reconstructr’s inbuilt session dataset:

library(reconstructr)
data("session_dataset")

str(session_dataset)
'data.frame':   63524 obs. of  3 variables:
 $ uuid     : chr  "47dc43895814861e21a2edf93348c826" "a736822df1890011694e7049cb3abef3" "674d2d00e096a3319874a4347caa1f4a" "f62d315398e6d04a3f2fa02e8ae42d49" ...
 $ timestamp: POSIXlt, format: "2014-01-07 00:00:15" "2014-01-07 00:01:11" "2014-01-07 00:01:54" ...
 $ url      : chr  "https://www.nasa.gov/history/mercury/mercury.html" "https://www.nasa.gov/images/ksclogosmall.gif" "https://www.nasa.gov/elv/hot.gif" "https://www.nasa.gov/facts/faq04.html" ...

sessionised_data <- sessionise(x = session_dataset, timestamp = timestamp, user_id = uuid,
                               threshold = 1800)

str(sessionised_data)
'data.frame':   63524 obs. of  5 variables:
 $ uuid      : chr  "0005839b3e8483d50870f61f50307fa7" "000b047bad36484451f12c114ab5eb28" "000b047bad36484451f12c114ab5eb28" "000b047bad36484451f12c114ab5eb28" ...
 $ timestamp : POSIXlt, format: "2014-01-14 12:47:59" "2014-01-07 14:25:11" "2014-01-09 12:47:17" ...
 $ url       : chr  "https://www.nasa.gov/history/apollo/images/footprint-logo.gif" "https://www.nasa.gov/ksc.html" "https://www.nasa.gov/biomed/threat/gif/beachmousefinsmall.gif" "https://www.nasa.gov/shuttle/resources/orbiters/atlantis.html" ...
 $ session_id: chr  "09cd65049020ed55472a2d8b1f47787e" "9dcb2f610297b3fe2c810907fa90fb8e" "70bcde51eff332d4ac820a90930f0f6e" "70bcde51eff332d4ac820a90930f0f6e" ...
 $ time_delta: int  NA NA NA 45 4 75 274 47 NA 28 ...

Sessionisation adds two new columns; ‘session_id’, a unique per-session ID, and ‘time_delta’ - the time between an event and the previous event in the session. If the event was the first (or only) one in a session, that value will be NA.

Crucially, existing metadata (like URLs, or activity type) is carried along with the session information and not dropped.

From the sessionised data we can compute a whole host of useful metrics, many of which have convenience functions in this package.

Session metrics

An important metric in session data is the bounce rate: the proportion of sessions that included only a single event. This represents (absent data quality issues) the number of sessions where a user took only one action and then simply left.

It can be computed with bounce_rate, which takes a sessionised dataset and produces the percentage of sessions resulting in bounces. Optionally (if you provide an argument for the user_id parameter) it produces the bounce rate on a per-user basis, rather than for the dataset overall:

str(bounce_rate(sessionised_data))

num 20.7
 
str(bounce_rate(sessionised_data, user_id = uuid))

'data.frame':   10000 obs. of  2 variables:
 $ user_id    : chr  "0005839b3e8483d50870f61f50307fa7" "000b047bad36484451f12c114ab5eb28" "000b2bc1a5438d8d54d4fbec139a2fd5" "001b6e80a14ba8d809c4ff18cdbade40" ...
 $ bounce_rate: num  100 14.3 0 100 100 ...

time_on_page is very similar, calculating either the mean (or median) time between events - either for the dataset as a whole or, if the by_session parameter is TRUE, for each session:

str(time_on_page(sessionised_data))

num 146

str(time_on_page(sessionised_data, by_session = TRUE))

'data.frame':   22226 obs. of  2 variables:
 $ session_id  : chr  "00011b1e098848edee7e50a2174fe6ef" "0001f6457a4d09a8c2092278fec89a89" "000451f0869b7eab3582c093ace0253d" "0004c56ace95f92ee12bf9552401f923" ...
 $ time_on_page: num  NaN NaN NaN NaN NaN ...

(It’s not broken, it just so happened the first few sessions contained no non-NA time deltas).

Finally, session_count and session_length provide easy ways of identifying how many sessions are in a sessionised dataset (overall, or on a per-user basis) and how long those sessions are, respectively.

Other session functionality

If you have ideas for other functionality that would make processing sessionseasier, the best approach is to either request it or add it!