--- title: "Introduction to disto" subtitle: "Matrix like abstraction for distance matrices" author: | | Srikanth KS 1^[Email: sri.teach@gmail.com] date: "`r Sys.Date()`" output: rmarkdown::pdf_document vignette: > %\VignetteIndexEntry{Introduction_to_disto} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r setup, include = FALSE} knitr::opts_chunk$set( collapse = TRUE , comment = "#>" ) ``` ## Introduction > **disto** is a R package that provides a high level API to interface over backends storing distance, dissimilarity, similarity matrices with matrix style extraction, replacement and other utilities. Currently, in-memory dist object backend is supported. ## Why disto? **R** provides "dist" class for storing distance objects. Under the hood, it is a numeric vector storing lower triangular matrix (diagonal excluded) in column order along with a few attributes. There are methods to subset (`[[`), print and coerce them from and to matrices using `as.dist` and `as.matrix` respectively. In general, - Some operations would require coercing the dist object to a matrix. - dist object lacks matrix like extraction ie `d[1:5, ]` would not work. - Neither would the assignment work: `d[1, 2] <- 3` **disto** was conceived to address these issues while keeping dist object as the back-end with the philosophy of minimal copies. This evolved into high-level API for dealing with generic distance objects irrespective of whether the object is in memory, disk or a database. Currently, the bindings are provided for in-memory objects of class 'dist'. ## Examples ### Creating disto and exploration ```{r} library("disto") # create a dist object do <- dist(mtcars) # create a disto connection (does not nake a copy of do) dio <- disto(objectname = "do") # what's dio dio # what does it actually contain unclass(dio) # summary of the distance object underneath summary(dio) # what is the size? size(dio) # what are the names? names(dio) # convert to a dataframe # caveat: costly for large distance matrices head(as.data.frame(dio)) # quick plots plot(dio, type = "dendrogram") plot(dio, type = "heatmap") ``` ### Extract and Replace #### Extract The idea is to provide an interface so that user does not worry about the storage and interacts with a *matrix*-like distance object without coercing as a matrix. Matrix coercion can be costly memory-wise when the dist object is large. ```{r} # what is the distance between 1st and 2nd element # note that this returns a matrix dio[1, 2] # this should be same as above, except the matrix is transposed dio[2, 1] # extract using names/labels dio["Mazda RX4 Wag", "Mazda RX4"] # for a single value extraction, `[[` is efficient as it does less work dio[[3, 4]] # dio[["Mazda RX4 Wag", "Mazda RX4"]] wont work, only integer index is supported in `[[` # neither would dio[[c(1, 2), 3]] # extract dio[1:5, 9:12] # extract mixed dio[1:5, c("Merc 240D", "Merc 230")] # exclude i or j dim(dio[1:2, ]) dim(dio[, 1:2]) dim(dio[,]) # All examples worked in outer product way # Specify product type as inner to extract diagonals only dio[1:5, 9:12, product = "inner"] # use lower triangular indexing dio[k = 1] # same as dio[1, 2] dio[k = 1:5] ``` #### Replace ```{r} # replace a value dio[1, 2] <- 100 # did it replace? dio[1, 2] # did it really replace at source do[1] # yes, it did # replacement is vectorized in inner product sense dio[1:5, 2:6] <- 7:11 dio[1:5, 2:6, product = "inner"] ``` ### 'apply' like function The flow of `as.matrix(do) %>% apply(1, somefunction)` is convenient. `dapply` provides the same without coercion to a matrix. This slower than the above flow but consumes much less memory. `dapply` is parallelized on UNIX-based systems. ```{r} # lets find indexes of five nearest neighbors for each observation/item # function to pick indexes of 5 nearest neighbors # an efficient alternative (with Rcpp) might be better udf <- function(x) order(x)[2:6] hi <- dapply(dio, 1, udf) dim(hi) hi[1:5, 1:5] ``` ### Extract and replace functions for 'dist' objects The workhorse functions for the dist class are `dist_extract` and `dist_replace`. ```{r} dist_extract(do, 1:5, 2:7) do <- dist_replace(do, 1:3, 4:6, 101:103) dist_extract(do, 1:3, 4:6, product = "inner") ``` ---- Author: Srikanth KS, sri.teach@gmail.com URL: https://github.com/talegari/disto BugReports: https://github.com/talegari/disto/issues