--- title: "IntervalSurgeon" author: "Daniel Greene" date: "`r Sys.Date()`" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{Usage} %\VignetteEngine{knitr::rmarkdown} %\usepackage[utf8]{inputenc} --- ```{r echo=FALSE} knitr::opts_chunk$set(fig.width=7, fig.height=5, fig.align="center") ``` ## Introduction `IntervalSurgeon` presents functions for intersecting, overlapping, piling and annotating integer-bounded intervals. Underlying algorithms are written in C++ for efficiency where appropriate (with the help of the `Rcpp` package). A typical use case would be for manipulating genomic intervals. For the purposes of this package, intervals are represented as two-column integer matrices where the inclusive start points are in the first column and the exclusive end points are in the second column. ```{r} library(IntervalSurgeon) starts <- 3*1:10 ends <- 5*1:10 intervals <- cbind(starts, ends) ``` The lengths of the intervals are therefore: `intervals[,2]-intervals[,1]`. A key concept in `IntervalSurgeon` is breaking ranges which contain intervals into 'sections' delimited by the unique start/end points in the set. The sections for a set of intervals `x` is therefore a two-column `matrix` representing a set of intervals which partitions the range of `x` by the sorted start and end points. The sorted start and end points can be obtained using the `breaks` function (so named to reflect start/end points of intervals frequently being referred to as 'breakpoints' in genomics), which is equivalent to: `sort(unique(as.integer(intervals)))`). The sections can be computed from the sorted start and end points using the `sections` function. One can use the `depth` and `pile` functions respectively to obtain the depth of intervals over their 'sections' (within which the depth is constant), and the row indices of original intervals which cover each section. The `flatten` function returns a non-touching, non-overlapping set of intervals (as a `matrix`) which overlap at least one of a given set. ```{r} sectioned <- sections(breaks(intervals)) flattened <- flatten(intervals) depths <- depth(intervals) piles <- pile(intervals) ``` ```{r fig.width=7, fig.height=3, echo=FALSE} plot_intervals <- function(ints, y=seq(from=0, to=1, length.out=nrow(ints)), whisker_size=0.1, new_plot=TRUE, xlim=NULL, ...) { if (new_plot) plot(x=NULL, xlim=if (!is.null(xlim)) xlim else range(ints), ylim=range(y)) segments(x0=ints[,1], x1=ints[,2], y0=y, y1=y, ...) for (i in 1:2) segments(y0=y-whisker_size/2, y1=y+whisker_size/2, x0=ints[,i], x1=ints[,i], ...) } oldpar <- par(mar=rep(0, 4)) plot(x=NULL, xlab="", ylab="", axes=FALSE, xlim=range(intervals), ylim=c(0, 6)) cent <- mean(range(intervals)) text(x=cent, pos=3, offset=1, labels="intervals", y=5) with(list(y=4.75+0.5*seq(from=1, to=-1, length.out=nrow(intervals))), plot_intervals(new_plot=FALSE, whisker_size=0, lwd=2, col="black", ints=intervals, y=y)) text(x=cent, pos=3, offset=1, labels="sectioned", y=3.25) text(x=cent, pos=3, offset=1, labels="flattened", y=2) text(x=cent, pos=3, offset=1, labels="depths", y=1) rect(xleft=sectioned[,1], xright=sectioned[,2], ytop=depths/max(depths), ybottom=0, col="light grey") plot_intervals(new_plot=FALSE, lwd=2, col="blue", ints=sectioned, y=3.25) plot_intervals(new_plot=FALSE, lwd=2, col="red", ints=flattened, y=2) par(oldpar) ``` ## Detached intervals `IntervalSurgeon` provides functions which are optimised for dealing with detached (i.e. non-overlapping and non-touching) intervals which are sorted and non-empty. Here, we generate two such sets of intervals: ```{r} x_starts <- 10*1:10 x <- cbind(x_starts, x_starts + 5) y_starts <- 20*1:5 + 1 y <- cbind(y_starts, y_starts + 7) ``` We can determine that they are indeed detached, sorted and non-empty using the `detached_sorted_nonempty` function. ```{r} detached_sorted_nonempty(x) ``` The `IntervalSurgeon` functions for operating on such sets of detached, sorted, non-empty intervals are analogues of the set operation functions in the `base` package, namely: `intersect`, `union` and `setdiff`. Here, the function names are in plural (i.e. with extra s's on the end). ```{r fig.width=7, fig.height=4, echo=FALSE} all_ints <- c( list(x=x, y=y), lapply( lapply(FUN=match.fun, X=setNames(nm=c("intersects", "unions", "setdiffs"))), function(f) f(x,y,check=TRUE) ) ) pts <- breaks(c(x, y)) oldpar <- par(mar=rep(0, 4)) plot(axes=FALSE, x=NULL, xlab="", ylab="", xlim=range(c(x, y)), ylim=c(0.5, length(all_ints)+0.5)) #axis(side=1) abline(v=pts, col="grey") for (i in seq_along(all_ints)) { ypos <- length(all_ints) - i + 1 plot_intervals(new_plot=FALSE, ints=all_ints[[i]], y=ypos, lwd=2, col=rainbow(length(all_ints))[i]) text(x=mean(range(c(x, y))), pos=3, offset=1, labels=names(all_ints)[i], y=ypos) } par(oldpar) ``` For a given set of sorted, non-overlapping intervals, some of the start points might be the same as the end points of adjacent intervals. These intervals are therefore 'touching' and can be 'stitched' together using the `stitch` function. If there were overlaps, the `flatten` function can be used to generate a set of sorted disjoint intervals. Note that `flatten` will also stitch touching intervals, although the `stitch` function is faster (albeit requiring intervals to be sorted). ## Finding overlaps between sets of intervals Information about overlaps between sets of intervals can be obtained by 'joining' the sets. This is analogous to an SQL inner-join of two tables, and can be performed on sets of intervals using the `join` function. This function returns a matrix containing all overlaps of intervals from each set. Each row in the returned matrix corresponds to a specific overlap of intervals with one from each of the sets passed to the function. The `n`th element in a row contains the row index of the interval in the `n`th set of intervals passed to the function. Depending on the value of the `output` argument, there may two additional columns giving the start and end coordinates of the overlap (the default: `output="intervals"`, no extra columns (`output="indices"`) or one additional column giving the row index of the 'section' of the complete set of intervals (`output="sections"`, see `?sections`). ```{r} x <- cbind(3*1:8, 5*1:8) y <- cbind(4*1:4, 7*1:4) join_xy <- join(x, y) head(join_xy) ``` One common task would be to tag intervals with overlapping intervals, perhaps from a different set. For example, this might be useful for tagging a set of genomic intervals with the names of genes which they overlap. The `annotate` function is provided for this exact purpose. ```{r} x <- cbind(0, c(a=10, b=20, c=30)) y <- cbind(c(A=0, B=10, C=20), c(5, 15, 25)) ``` ```{r fig.width=7, fig.height=3.5, echo=FALSE} brks <- breaks(rbind(x, y)) lx <- nrow(x); ly <- nrow(y) oldpar <- par(mar=rep(0, 4)) plot(x=NULL, xlab="", ylab="", axes=FALSE, xlim=range(c(x, y)), ylim=c(0, lx+ly)) abline(v=brks, col="grey") ys <- lx+ly-seq(from=0.5, length.out=lx+ly) plot_intervals(new_plot=FALSE, lwd=2, col=rep(rainbow(2), times=c(lx, ly)), y=ys, ints=rbind(x, y)) text(labels=c(rownames(x), rownames(y)), x=(c(x[,1], y[,1])+c(x[,2], y[,2]))/2, y=ys, pos=3) par(oldpar) ``` ```{r} annotate(x, y) ``` ## Use with genomic intervals Genomic intervals are often represented in R as `data.frame`s with columns corresponding to chromosome name, start position and end position. Obviously intervals do not intersect if they are on different chromosomes, so in order to manipulate such intervals with `IntervalSurgeon`, genome-wide operations must be performed chromosome-at-a-time. Using `split` to create a list of chromosome specific `data.frames`, or looping over the names of chromosomes and subsetting the original table, before `cbind`ing/`as.matrix`ing the start and end columns then makes the data accessible to the `IntervalSurgeon` functions. ```{r echo=FALSE} regions <- data.frame(chr=c("X", 2), start=1:2, end=100:101) genes <- data.frame(chr=c(1, "X", 2, 2), start=1:4, end=2:5) ``` ```{r} regions_annotated_with_genes <- lapply( c(1:22, "X", "Y"), function(chromosome) { regions_on_chr <- as.matrix(regions[regions$chr == chromosome,c("start", "end")]) genes_on_chr <- as.matrix(genes[genes$chr == chromosome,c("start","end")]) annotate(regions_on_chr, genes_on_chr) } ) ```