--- title: "1. The sfnetwork data structure" date: "`r Sys.Date()`" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{1. The sfnetwork data structure} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r setup, include=FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>" ) knitr::opts_knit$set(global.par = TRUE) current_geos = numeric_version(sf::sf_extSoftVersion()["GEOS"]) required_geos = numeric_version("3.7.0") geos37 = current_geos >= required_geos ``` ```{r plot, echo=FALSE, results='asis'} # plot margins oldpar = par(no.readonly = TRUE) par(mar = c(1, 1, 1, 1)) # crayon needs to be explicitly activated in Rmd oldoptions = options() options(crayon.enabled = TRUE) # Hooks needs to be set to deal with outputs # thanks to fansi logic old_hooks = fansi::set_knit_hooks( knitr::knit_hooks, which = c("output", "message", "error") ) ``` The core of the sfnetworks package is the sfnetwork data structure. It inherits the tbl_graph class from the [tidygraph package](https://tidygraph.data-imaginist.com/index.html), which itself inherits the igraph class from the [igraph package](https://igraph.org/). Therefore, sfnetwork objects are recognized by all network analysis algorithms that `igraph` offers (which are a lot, see [here](https://igraph.org/r/doc/)) as well as by the tidy wrappers that `tidygraph` has built around them. It is possible to apply any function from the [tidyverse packages](https://www.tidyverse.org/) for data science directly to a sfnetwork, as long as `tidygraph` implemented a network specific method for it. On top of that, `sfnetworks` added several methods for functions from the [sf package](https://r-spatial.github.io/sf/) for spatial data science, such that you can also apply those directly to the network. This takes away the need to constantly switch between the tbl_graph, tbl_df and sf classes when working with geospatial networks. ```{r, message=FALSE} library(sfnetworks) library(sf) library(tidygraph) library(igraph) library(ggplot2) ``` ## Philosophy The philosophy of a tbl_graph object is best described by the following paragraph from the [tidygraph introduction](https://www.data-imaginist.com/2017/introducing-tidygraph/): "Relational data cannot in any meaningful way be encoded as a single tidy data frame. On the other hand, both node and edge data by itself fits very well within the tidy concept as each node and edge is, in a sense, a single observation. Thus, a close approximation of tidyness for relational data is two tidy data frames, one describing the node data and one describing the edge data." Since sfnetworks subclass tbl_graph, it shares the same philosophy. However, it extends it into the domain of geospatial data analysis, where each observation has a location in geographical space. For that, it brings `sf` into the game. An sf object stores the geographical coordinates of each observation in standardized format in a geometry list-column, which has a Coordinate Reference System (CRS) associated with it. Thus, in `sfnetworks`, we re-formulate the last sentence of the paragraph above to the following. "A close approximation of tidyness for relational *geospatial data* is two *sf objects*, one describing the node data and one describing the edge data." We do need to make a note here. In a geospatial network, the nodes *always* have coordinates in geographic space, and thus, can always be described by an sf object. The edges, however, can also be described by only the indices of the nodes at their ends. This still makes them geospatial, because they connect two specific points in space, but the spatial information is not *explicitly* attached to them. Both representations can be useful. In road networks, for example, it makes sense to explicitly draw a line geometry between two nodes, while in geolocated social networks, it probably does not. `sfnetworks` supports both types. It can either describe edges as an sf object, with a linestring geometry stored in a geometry list-column, or as a regular data frame, with the spatial information implicitly encoded in the node indices of the endpoints. We refer to these two different types of edges as *spatially explicit edges* and *spatially implicit edges* respectively. In most of the documentation, however, we focus on the first type, and talk about edges as being an sf object with linestring geometries. ## Construction ### From a nodes and edges table The most basic way to construct a sfnetwork with spatially explicit edges is by providing the `sfnetwork` construction function one sf object containing the nodes, and another sf object containing the edges. This edges table should include a *from* and *to* column referring to the node indices of the edge endpoints. With a node index we mean the position of a node in the nodes table (i.e. its rownumber). A small toy example: ```{r} p1 = st_point(c(7, 51)) p2 = st_point(c(7, 52)) p3 = st_point(c(8, 52)) p4 = st_point(c(8, 51.5)) l1 = st_sfc(st_linestring(c(p1, p2))) l2 = st_sfc(st_linestring(c(p1, p4, p3))) l3 = st_sfc(st_linestring(c(p3, p2))) edges = st_as_sf(c(l1, l2, l3), crs = 4326) nodes = st_as_sf(c(st_sfc(p1), st_sfc(p2), st_sfc(p3)), crs = 4326) edges$from = c(1, 1, 3) edges$to = c(2, 3, 2) net = sfnetwork(nodes, edges) net class(net) ``` By default, the created network is a directed network. If you want to create an undirected network, set `directed = FALSE`. Note that for undirected networks, the indices in the *from* and *to* columns are re-arranged such that the *from* index is always smaller than (or equal to, for loop edges) the *to* index. However, the linestring geometries remain unchanged. That means that in undirected networks it can happen that for some edges the *from* index refers to the last point of the edge linestring, and the *to* index to the first point. The behavior of ordering the indices comes from `igraph` and might be confusing, but remember that in undirected networks the terms *from* and *to* do not have a meaning and can thus be used interchangeably. ```{r} net = sfnetwork(nodes, edges, directed = FALSE) net ``` Instead of *from* and *to* columns containing integers that refer to node indices, the provided edges table can also have *from* and *to* columns containing characters that refer to node keys. In that case, you should tell the construction function which column in the nodes table contains these keys. Internally, they will then be converted to integer indices. ```{r} nodes$name = c("city", "village", "farm") edges$from = c("city", "city", "farm") edges$to = c("village", "farm", "village") edges net = sfnetwork(nodes, edges, node_key = "name") net ``` If your edges table does not have linestring geometries, but only references to node indices or keys, you can tell the construction function to create the linestring geometries during construction. This will draw a straight line between the endpoints of each edge. ```{r, fig.show='hold', out.width='50%'} st_geometry(edges) = NULL other_net = sfnetwork(nodes, edges, edges_as_lines = TRUE) plot(net, cex = 2, lwd = 2, main = "Original geometries") plot(other_net, cex = 2, lwd = 2, main = "Straight lines") ``` A sfnetwork should have a *valid* spatial network structure. For the nodes, this currently means that their geometries should all be of type *POINT*. In the case of spatially explicit edges, edge geometries should all be of type *LINESTRING*, nodes and edges should have the same CRS and endpoints of edges should match their corresponding node coordinates. If your provided data do not meet these requirements, the construction function will throw an error. ```{r, error=TRUE} st_geometry(edges) = st_sfc(c(l2, l3, l1), crs = 4326) net = sfnetwork(nodes, edges) ``` You can skip the validity checks if you are already sure your input data meet the requirements, or if you don't care that they don't. To do so, set `force = TRUE`. However, remember that all functions in `sfnetworks` are designed with the assumption that the network has a valid structure. ### From an sf object with linestring geometries Instead of already providing a nodes and edges table with a valid network structure, it is also possible to create a network by only providing an sf object with geometries of type *LINESTRING*. Probably, this way of construction is most convenient and will be most often used. It works as follows: the provided lines form the edges of the network, and nodes are created at their endpoints. Endpoints that are shared between multiple lines become one single node. See below an example using the Roxel dataset that comes with the package. This dataset is an sf object with *LINESTRING* geometries that form the road network of Roxel, a neighborhood in the German city of Münster. ```{r, fig.height=5, fig.width=5} roxel net = as_sfnetwork(roxel) plot(net) ``` Other methods to convert 'foreign' objects into a sfnetwork exists as well, e.g. for SpatialLinesNetwork objects from `stplanr` and linnet objects from `spatstat`. See [here](https://luukvdmeer.github.io/sfnetworks/reference/as_sfnetwork.html) for an overview. ## Activation A sfnetwork is a multitable object in which the core network elements (i.e. nodes and edges) are embedded as sf objects. However, thanks to the neat structure of `tidygraph`, there is no need to first extract one of those elements before you are able to apply your favorite sf function or tidyverse verb. Instead, there is always one element at a time labeled as *active*. This active element is the target of data manipulation. All functions from sf and the tidyverse that are called on a sfnetwork, are internally applied to that active element. The active element can be changed with the `activate()` verb, i.e. by calling `activate("nodes")` or `activate("edges")`. For example, setting the geographical length of edges as edge weights and subsequently calculating the betweenness centrality of nodes can be done as shown below. Note that `tidygraph::centrality_betweenness()` does require you to *always* explicitly specify which column should be used as edge weights, and if the network should be treated as directed or not. ```{r} net %>% activate("edges") %>% mutate(weight = edge_length()) %>% activate("nodes") %>% mutate(bc = centrality_betweenness(weights = weight, directed = FALSE)) ``` Some of the functions have effects also outside of the active element. For example, whenever nodes are removed from the network, the edges terminating at those nodes will be removed too. This behavior is *not* symmetric: when removing edges, the endpoints of those edges remain, even if they are not an endpoint of any other edge. This is because by definition edges can never exist without nodes on their ends, while nodes can peacefully exist in isolation. ## Extraction Neither all sf functions nor all tidyverse verbs can be directly applied to a sfnetwork as described above. That is because there is a clear limitation in the relational data structure that requires rows to maintain their identity. Hence, a verb like `dplyr::summarise()` has no clear application for a network. For sf functions, this means also that the valid spatial network structure should be maintained. That is, functions that summarise geometries of an sf object, or (may) change their *type*, *shape* or *position*, are not supported directly. These are for example most of the [geometric unary operations](https://r-spatial.github.io/sf/reference/geos_unary.html). These functions cannot be directly applied to a sfnetwork, but no need to panic! The active element of the network can at any time be extracted with `sf::st_as_sf()` (or `tibble::as_tibble()`). This allows you to continue a specific part of your analysis *outside* of the network structure, using a regular sf object. Afterwards you could join inferred information back into the network. See the vignette about [spatial joins](https://luukvdmeer.github.io/sfnetworks/articles/sfn03_join_filter.html) for more details. ```{r} net %>% activate("nodes") %>% st_as_sf() ``` Although we recommend for reasons of clarity to always explicitly activate an element before extraction, you can also use a shortcut by providing the name of the element you want to extract as extra argument to `sf::st_as_sf()`: ```{r} st_as_sf(net, "edges") ``` ## Visualization The `sfnetworks` package does not (yet?) include advanced visualization options. However, as already demonstrated before, a simple plot method is provided, which gives a quick view of how the network looks like. ```{r, fig.width=5, fig.height=5} plot(net) ``` If you have `ggplot2` installed, you can also use `ggplot2::autoplot()` to directly create a simple ggplot of the network. ```{r, message=FALSE, fig.width=5, fig.height=5} autoplot(net) + ggtitle("Road network of Münster Roxel") ``` For advanced visualization, we encourage to extract nodes and edges as `sf` objects, and use one of the many ways to map those in R, either statically or interactively. Think of sf's default plot method, `ggplot2::geom_sf()`, `tmap`, `mapview`, et cetera. ```{r, fig.height=5, fig.width=5} net = net %>% activate("nodes") %>% mutate(bc = centrality_betweenness()) ggplot() + geom_sf(data = st_as_sf(net, "edges"), col = "grey50") + geom_sf(data = st_as_sf(net, "nodes"), aes(col = bc, size = bc)) + ggtitle("Betweenness centrality in Münster Roxel") ``` *Note: it would be great to see this change in the future, for example by good integration with `ggraph`. Contributions are very welcome regarding this!* ## Spatial information ### Geometries Geometries of nodes and edges are stored in an 'sf-style' geometry list-column in respectively the nodes and edges tables of the network. The geometries of the active element of the network can be extracted with the sf function `sf::st_geometry()`, or from any element by specifying the element of interest as additional argument, e.g. `sf::st_geometry(net, "edges")`. ```{r} net %>% activate("nodes") %>% st_geometry() ``` Geometries can be replaced using either `st_geometry(x) = value` or the pipe-friendly `st_set_geometry(x, value)`. However, a replacement that breaks the valid spatial network structure will throw an error. Replacing a geometry with `NULL` will remove the geometries. Removing edge geometries will result in a sfnetwork with spatially implicit edges. Removing node geometries will result in a tbl_graph, losing the spatial structure. ```{r, fig.show = 'hold', out.width = "50%"} net %>% activate("edges") %>% st_set_geometry(NULL) %>% plot(draw_lines = FALSE, main = "Edges without geometries") net %>% activate("nodes") %>% st_set_geometry(NULL) %>% plot(vertex.color = "black", main = "Nodes without geometries") ``` Geometries can be replaced also by using [geometry unary operations](https://r-spatial.github.io/sf/reference/geos_unary.html), as long as they don't break the valid spatial network structure. In practice this means that only `sf::st_reverse()` and `sf::st_simplify()` are supported. When calling `sf::st_reverse()` on the edges of a directed network, not only the geometries will be reversed, but the *from* and *to* columns of the edges will be swapped as well. In the case of undirected networks these columns remain unchanged, since the terms *from* and *to* don't have a meaning in undirected networks and can be used interchangeably. Note that reversing linestrings using `sf::st_reverse()` only works when sf links to a GEOS version of at least 3.7.0. ```{r, eval = geos37} as_sfnetwork(roxel, directed = TRUE) %>% activate("edges") %>% st_reverse() ``` ### Coordinates The coordinates of the active element of a sfnetwork can be extracted with the sf function `sf::st_coordinates()`, or from any element by specifying the element of interest as additional argument, e.g. `sf::st_coordinate(net, "edges")`. ```{r} node_coords = net %>% activate("nodes") %>% st_coordinates() node_coords[1:4, ] ``` Besides X and Y coordinates, the features in the network can possibly also have Z and M coordinates. ```{r} # Currently there are neither Z nor M coordinates. st_z_range(net) st_m_range(net) # Add Z coordinates with value 0 to all features. # This will affect both nodes and edges, no matter which element is active. st_zm(net, drop = FALSE, what = "Z") ``` [Coordinate query functions](https://luukvdmeer.github.io/sfnetworks/reference/node_coordinates.html) can be used for the nodes to extract only specific coordinate values. Such query functions are meant to be used inside `dplyr::mutate()` or `dplyr::filter()` verbs. Whenever a coordinate value is not available for a node, `NA` is returned along with a warning. Note also that the two-digit coordinate values are only for printing. The real values contain just as much precision as in the geometry list column. ```{r} net %>% st_zm(drop = FALSE, what = "Z") %>% mutate(X = node_X(), Y = node_Y(), Z = node_Z(), M = node_M()) ``` ### Coordinate Reference System The Coordinate Reference System in which the coordinates of the network geometries are stored can be extracted with the sf function `sf::st_crs()`. The CRS in a valid spatial network structure is *always* the same for nodes and edges. ```{r} st_crs(net) ``` The CRS can be set using either `st_crs(x) = value` or the pipe-friendly `st_set_crs(x, value)`. The CRS will always be set for both the nodes and edges, no matter which element is active. However, setting the CRS only assigns the given CRS to the network. It does *not* transform the coordinates into a different CRS! Coordinates can be transformed using the sf function `sf::st_transform()`. Since the CRS is the same for nodes and edges, transforming coordinates of the active element into a different CRS will automatically also transform the coordinates of the inactive element into the same target CRS. ```{r} st_transform(net, 3035) ``` ### Precision The precision in which the coordinates of the network geometries are stored can be extracted with the sf function `sf::st_precision()`. The precision in a valid spatial network structure is *always* the same for nodes and edges. ```{r} st_precision(net) ``` Precision can be set using `st_set_precision(x, value)`. The precision will always be set for both the nodes and edges, no matter which element is active. ```{r} net %>% st_set_precision(1) %>% st_precision() ``` ### Bounding box The bounding box of the active element of a sfnetwork can be extracted with the sf function `sf::st_bbox()`, or from any element by specifying the element of interest as additional argument, e.g. `sf::st_bbox(net, "edges")`. ```{r} net %>% activate("nodes") %>% st_bbox() ``` The bounding boxes of the nodes and edges are not necessarily the same. Therefore, sfnetworks adds the `st_network_bbox()` function to retrieve the combined bounding box of the nodes and edges. In this combined bounding box, the most extreme coordinates of the two individual element bounding boxes are preserved. Hence, the `xmin` value of the network bounding box is the smallest `xmin` value of the node and edge bounding boxes, et cetera. ```{r, fig.show='hold', out.width = "50%"} node1 = st_point(c(8, 51)) node2 = st_point(c(7, 51.5)) node3 = st_point(c(8, 52)) node4 = st_point(c(9, 51)) edge1 = st_sfc(st_linestring(c(node1, node2, node3))) nodes = st_as_sf(c(st_sfc(node1), st_sfc(node3), st_sfc(node4))) edges = st_as_sf(edge1) edges$from = 1 edges$to = 2 small_net = sfnetwork(nodes, edges) node_bbox = st_as_sfc(st_bbox(activate(small_net, "nodes"))) edge_bbox = st_as_sfc(st_bbox(activate(small_net, "edges"))) net_bbox = st_as_sfc(st_network_bbox(small_net)) plot(small_net, lwd = 2, cex = 4, main = "Element bounding boxes") plot(node_bbox, border = "red", lty = 2, lwd = 4, add = TRUE) plot(edge_bbox, border = "blue", lty = 2, lwd = 4, add = TRUE) plot(small_net, lwd = 2, cex = 4, main = "Network bounding box") plot(net_bbox, border = "red", lty = 2, lwd = 4, add = TRUE) ``` ### Attribute-geometry relationships In sf objects there is the possibility to store information about how attributes relate to geometries (for more information, see [here](https://r-spatial.github.io/sf/articles/sf1.html#how-attributes-relate-to-geometries)). You can get and set this information with the function `sf::st_agr()` (for the setter, you can also use the pipe-friendly version `sf::st_set_agr()`). In a sfnetwork, you can use the same functions to get and set this information for the active element of the network. Note that the *to* and *from* columns are not really attributes of edges seen from a network analysis perspective, but they are included in the agr factor to ensure smooth interaction with `sf`. ```{r} net %>% activate("edges") %>% st_set_agr(c("name" = "constant", "type" = "constant")) %>% st_agr() ``` However, be careful, because we are currently not sure if this information survives all functions from `igraph` and `tidygraph`. If you have any issues with this, please let us know in our [issue tracker](https://github.com/luukvdmeer/sfnetworks/issues). ```{r, include = FALSE} par(oldpar) options(oldoptions) ```