Weighted Nearest Neighbors

This page documents an implementation of the weighted nearest neighbors algorithm described in Integrated analysis of multimodal single-cell data.

Given a set of cells \(c_1,c_2,\dots,c_k\) we may measure or observe the cells with different technologies, collecting (for example)

  • electrophsyiological feature vectors \(e_1,\dots, e_k \in \mathbb{R}^n\)

  • transcriptomic feature vectors from a reduced \(e_1,\dots, e_k \in \mathbb{R}^m\)

  • pairwise Gromov-Wasserstein distances \(m_{ij}=d_{GW}(c_i,c_j)\) for all \(i,j\)

  • (…)

Here we restrict ourselves to measurements which can be expressed as

  • a list of \(k\) vectors in some finite-dimensional Euclidean space

  • a pairwise distance or dissimilarity matrix

The weighted nearest neighbor algorithm integrates measurements from these different modalities into a single notion of similarity which reflects the overall/combined similarity of the two cells with respect to all of the constituent modalities. It does not assign equal weight to each modality; rather, at each point in the space, each component is weighted according to the information it contributes at that point.

We have added two novel aspects to the original algorithm in order to incorporate distance matrices such as the Gromov-Wasserstein distance matrix. Specifically, we have to

  • estimate the dimensionality of the latent space from the distance matrix

  • embed the distance matrix into a Euclidean space of the appropriate dimension.

To estimate the dimensionality of the latent space, we chose the MADA algorithm, based on the general benchmarking paper Intrinsic Dimension Estimation: Relevant Techniques and a Benchmark Framework which notes that MADA performs well in applications such as geophysical signal processing and that the MADA paper authors “prove the consistency in probability of the presented estimators [and] derive upper bounds on the probability of the estimation-error for large enough values of \(N\).”

To embed the distance matrix into Euclidean space, we use Isomap, which is a standard algorithm.

class Modality(obsv: ndarray[Any, dtype[float64]])

A Modality is a dataset profiling a collection of cells from one perspective or using one technology. It can be constructed using either a set of observations in n-dimensional space (a k by n) matrix, where k is the number of cells and n is the dimensionality of the ambient space; or it can be constructed using a distance matrix (a k by k). If only a distance matrix is supplied, then the constructor chooses an embedding of the points in the distance matrix into n-dimensional space using Isomap, so a set of observations in a vector space is preferable when it is available.

If using a distance matrix, a Modality object must be constructed together with a given number of neighbors to consider when constructing the nearest neighbors graphs for the Isomap embedding. This number should be as least as high as the number of neighbors you care about when analyzing the output of the WNN embedding.

Parameters

obsv (ndarray[Any, dtype[float64]]) –

wnn(modalities: list[cajal.wnn.Modality], k: int, epsilon: float = 0.0001)

Compute the weighted nearest neighbors pairing, following Integrated analysis of multimodal single-cell data

This algorithm differs from the published algorithm in the paper in a few ways. In particular we do not take the L2 normalization of columns of the matrix before we begin.

Parameters
  • modalities (list[cajal.wnn.Modality]) – list of modalities

  • k (int) – how many nearest neighbors to consider

  • epsilon (float) – This is a numerical stability parameter, it is added to the denominator of a fraction to prevent dividing by zero.

Returns

A matrix of pairwise similarities (not distances!) which can be used in training a k-nearest neighbors classifier to identify cells which are overall most like the query cell from the perspective of multiple morphologies.