This is a patch release to fix some M1 Mac test failures as part of CRAN checks. No changes to non-test code.
"cosine",
"jaccard" or "hellinger" metrics could give
incorrect results if maximally-dissimilar items were in the nearest
neighbors. Thank you to Maciej
Beręsewicz for the report (https://github.com/jlmelville/rnndescent/issues/14).nnd_knn and rnnd_build:
weight_by_degree. If set to TRUE, then the
candidate list in nearest neighbor descent is weighted in favor of
low-degree items, which should make for a more diverse local join. There
is a minor increase in computation but also a minor increase in
accuracy.rnnd_build generated an error when preparing the search
graph for some metrics (notably cosine and
jaccard).prepare_search_graph:
use_alt_metric. This behaves like the existing
use_alt_metric parameters in other functions and may speed
up the index preparation step in e.g. rnnd_build.rnnd_build and
prepare_search_graph: prune_reverse. If set to
TRUE the reverse graph will be pruned using the
pruning_degree_multiplier parameter before any
diversification. This can help prevent an excessive amount of time being
spent in the diversification step in the case where an item has a large
number of neighbors (in the reverse graph this can be as large as the
number of items in the dataset). Pruning of the merged graph still
occurs, so this is an additional pruning step. This should have little
effect on search results, but for backwards compatibility, the default
is FALSE.CRAN resubmission to fix lingering UBSAN errors.
Initial CRAN submission.
rnnd_build now always prepares the search graph.rnnd_prepare function has been removed. The option
to not prepare the search graph during index building only made sense if
you were only interested in the k-nearest neighbor graph. Now that
rnnd_knn exists for that purpose (see below), the logic of
index building has been substantially simplified.nn_to_sparse function has been removed.merge_knn function has been removed, and
merge_knnl has been renamed to merge_knn. If
you were running e.g. merge_knn(nn1, nn2), you must now use
merge_knn(list(nn1, nn2)). Also the parameter
nn_graphs has been renamed graphs.rnnd_knn. Behaves a lot like
rnnd_build, but only returns the knn graph with no
index built. The index can be very large in size for high dimensional or
large datasets, so this function is useful if you only care about the
knn graph and won’t ever want to query new data.neighbor_overlap. Measures the overlap of
two knn graphs via their shared indices. A similar function was used
extensively in some vignettes so it may have sufficient utility to be
useful to others.rnnd_query and
graph_knn_query: max_search_fraction. This
parameter controls the maximum number of nodes that can be searched
during querying. If the number of nodes searched exceeds this fraction
of the total number of nodes in the graph, the search will be
terminated. This can be used in combination with epsilon to
avoid excessive search times.spearmanr distance has been fixed.n_threads = 0,
progress/interrupt monitoring was not occurring.init
parameter of rnnd_query.rnnd_query: if verbose = TRUE, a summary
of the min, max and average number of distance queries will be logged.
This can help tune epsilon and
max_search_fraction.local_scale_nn has been removed, for similar
reasons to the removal of the standalone distance functions. It remains
in the localscale branch of the github repo.prepare_search_graph is
now transposed. This prevents having to repeatedly transpose inside
every call to graph_knn_query if multiple queries are being
made. You will need to either regenerate any saved search graphs or
transpose them with Matrix::t(search_graph).rnnd_build, rnnd_query and
rnnd_prepare. These functions streamline the process of
building a k-nearest neighbor graph, preparing a search graph and
querying it.The bhamming metric no longer exists as a
specialized metric. Instead, if you pass a logical matrix
to data, reference or query
parameter (depending on the function) and specify
metric = "hamming" you will automatically get the
binary-specific version of the hamming metric.
The hamming and bhamming metrics are
now normalized with respect to the number of features, to be consistent
with the other binary-style metrics (and PyNNDescent). If you need the
old distances, multiply the distance matrix by the number of columns,
e.g. do something like:
res <- brute_force_knn(X, metric = "hamming")
res$dist <- res$dist * ncol(X)The metric l2sqr has been renamed
sqeuclidean to be consistent with PyNNDescent.
metric parameter now accepts a
much larger number of metrics. See the rdoc for the full list of
supported metrics. Currently, most of the metrics from PyNNDescent which
don’t require extra parameters are supported. The number of specialized
binary metrics has also been expanded.rpf_knn and rpf_build:
max_tree_depth this controls the depth of the tree and was
set to 100 internally. This default has been doubled to 200 and can now
be user-controlled. If verbose = TRUE and the largest leaf
in the forest exceeds the leaf_size parameter, a message
warning you about this will be logged and indicates that the maximum
tree depth has been exceeded. Increasing max_tree_depth may
not be the answer: it’s more likely there is something unusual about the
distribution of the distances in your dataset and a random
initialization might be a better use of your time.dgCMatrix to the
data, reference or query
parameters where you would usually use a dense matrix or data frame.
cosine, euclidean, manhattan,
hamming and correlation are all available, but
alternative versions in the dense case,
e.g. cosine-preprocess or the binary-specific
bhamming for dense data is not.init option for graph_knn_query: you
can now pass an RP forest and initialize with that, e.g. from
rpf_build, or by setting ret_forest = TRUE on
nnd_knn or rpf_knn. You may want to cut down
the size of the forest used for initialization with
rpf_filter first, though (a single tree may be enough).
This will also use the metric data in the forest, so setting
metric (or use_alt_metric) in the function
itself will be ignored.prepare_search_graph or to
graph_knn_query contains missing data, this will no longer
cause an error (it still might not be the best idea though).rpf_knn. Calculates the approximate
k-nearest neighbors using a random partition forest.rpf_build. Builds a random partition
forest.rpf_knn_query. Queries a random partition
forest (built with rpf_build to find the approximate
nearest neighbors for the query points.rpf_filter. Retains only the best
“scoring” trees in a forest, where each tree is scored based on how well
it reproduces a given knn.nnd_knn:
init = "tree". Uses the RP Forest initialization
method.nnd_knn: ret_forest.
Returns the search forest used if init = "tree" so it can
be used for future searching or filtering.nnd_knn: init_opts.
Options that can be passed to the RP forest initialization (same as in
rpf_knn).nnd_knn with
n_threads > 0 was reporting double the actual number of
iterations. This made the progress bar way too optimistic.metric: "cosine" and
"correlation" have been renamed
"cosine-preprocess" and
"correlation-preprocess" respectively. This reflects that
they do some preprocessing of the data up front to make subsequent
distance calculations faster. I have endeavored to avoid unnecessary
allocations or copying in this preprocessing, but there is still a
chance of more memory usage.cosine and correlation metrics are
still available as an option, but now use an implementation that doesn’t
do any preprocessing. The preprocessing and non-preprocessing version
should give the same numerical results, give or take some minor
numerical differences, but when the distance should be zero, the
preprocessing versions may give values which are slightly different from
zero (e.g. 1e-7).correlation_distance,
cosine_distance, euclidean_distance,
hamming_distance, l2sqr_distance,
manhattan_distance for calculating the distance between two
vectors, which may be useful for more arbitrary distance calculations
than the nearest neighbor routines here, although they won’t be as
efficient (they do call the same C++ code, though). The cosine and
correlation calculations here use the non-preprocessing
implementations.hamming metric to a standard definition. The
old implementation of hamming metric which worked on binary
data only was renamed into bhamming. (contributed by Vitalie Spinu)obs has been added to most functions: set
obs = "C" and you can pass the input data in
column-oriented format.random_knn function used to always return each item
as its own neighbor, so that only n_nbrs - 1 of the
returned neighbors were actually selected at random. Even I forgot it
did that and it doesn’t make a lot of sense, so now you really do just
get back n_nbrs random selections.init
parameter to nnd_knn or graph_knn_query:
previously, if k was specified and larger than the number
of neighbors included in init, this gave an error. Now,
init will be augmented with random neighbors to reach the
desired k. This could be useful as a way to “restart” a
neighbor search from a better-than-random location if k has
been found to have been too small initially. Note that the random
selection does not take into account the identities of the already
chosen neighbors, so duplicates may be included in the augmented result,
which will reduce the effective size of the initialized number of
neighbors.block_size and grain_size
parameters from functions. These were related to the amount of work done
per thread, but it’s not obvious to an outside user how to set
these.verbose = TRUE) and respond to
user-requested cancellation.nnd_knn_query has been renamed to
graph_knn_query and now more closely follows the current
pynndescent graph search method (including backtracking search).prepare_search_graph for preparing a
search graph from a neighbor graph for use in
graph_knn_query, by using reverse nearest neighbors,
occlusion pruning and truncation.graph_knn_query.There was a major rewrite of the internal organization of the C++ to be less R-specific.
The license for rnndescent has changed from GPLv3 to GPLv3 or later.
"correlation". This is (1 minus) the
Pearson correlation.k_occur which counts the k-occurrences of
each item in the idx matrix, which is the number of times
an item appears in the k-nearest neighbor list in the dataset. The
distribution of the k-occurrences can be used to diagnose the “hubness”
of a dataset. Items with a large k-occurrence (>> k, e.g. 10k),
may indicate low accuracy of the approximate nearest neighbor
result.To avoid undefined behavior issues, rnndescent now uses an internal
implementation of RcppParallel’s parallelFor loop that
works with std::thread and does not load Intel’s TBB
library.
dqrng sample routines from inside a thread, despite it
clearly using the R API extensively. It’s not ok and causes lots of
crashes. There is now a re-implementation of dqrng’s sample
routines using plain std::vectors in
src/rnn_sample.h. That file is licensed under the AGPL
(rnndescent as a whole remains GPL3).merge_knn, to combine two nearest
neighbor graphs. Useful for combining the results of multiple runs of
nnd_knn or random_knn. Also,
merge_knnl, which operates on a list of multiple neighbor
graphs, and can provide a speed up over merge_knn if you
don’t mind storing multiple graphs in memory at once.nnd_knn with
n_threads > 1 and random_knn with
n_threads > 1 and
order_by_distance = TRUE.nnd_knn with
n_threads > 1 due to the use of a mutex pool.Mainly an internal clean-up to reduce duplication.
nnd_knn and nnd_knn_query use
the same progress bar as the brute force and random neighbor functions.
Bring back the old per-iteration logging that also showed the current
distance sum of the knn with the progress = "dist"
option.random_knn and random_knn_query, when
order_by_distance = TRUE and n_threads > 0,
the final sorting of the knn graph will be multi-threaded.n_threads > 0.nnd_knn_query being the most useful, but
brute_force_knn_query and random_knn_query are
also available. This allows for query data to search
reference data, i.e. the returned indices and distances are
relative to the reference data, not any other member of
query. These methods are also available in multi-threaded
mode, and nnd_knn_query has a low and high memory
version.l2 metric has been renamed to l2sqr to
more accurately reflect what it is: the square of the L2 (Euclidean)
metric.use_alt_metric. Set to FALSE if
you don’t want alternative, faster metrics (which keep the distance
ordering of metric) to be used in internal calculations.
Currently only applies to metric = "euclidean", where the
squared Euclidean distance is used internally. Only worth setting this
to FALSE if you think the alternative is causing numerical
issues (which is a bug, so please report it!).block_size for parallel methods, which
determines the amount of work done in parallel before checking for user
interrupt request and updating any progress.random_knn now returns its results in sorted order. You
can turn this off with order_distances = FALSE, if you
don’t need the sorting (e.g. you are using the results as input to
something else).brute_force and
random methods should now be correct.brute_force_knn.random_knn.verbose = TRUE.fast_rand option has been removed, as it only applied
to single-threading, and had a negligible effect.Also, a number of changes inspired by recent work in https://github.com/lmcinnes/pynndescent:
rho sampling parameter has been removed. The size
of the candidates (general neighbors) list is now controlled entirely by
max_candidates.max_candidates has been reduced to 20.use_set logical flag has been replaced by
low_memory, which has the opposite meaning. It now also
works when using multiple threads. While it follows the pynndescent
implementation, it’s still experimental, so
low_memory = TRUE by default for the moment.low_memory = FALSE implementation for
n_threads = 0 (originally equivalent to
use_set = TRUE) is faster.block_size, which balances interleaving
of queuing updates versus applying them to the current graph.