Tutorial 3: Computing Morphological Distances in Large Datasets

The Gromov-Wasserstein distance between two cells with 100 points takes about 9ms to compute on a standard desktop computer. The number of pairs grows quadratically with the number of cells, and so the total runtime can become large in datasets with several thousands of cells.

For large datasets we provide two tools to reduce the necessary computation, as well as a hybrid of these.

In [1], the author establishes several lower bounds for the Gromov-Wasserstein (GW) distance. CAJAL implements one of the fastest bounds, the second lower bound (SLB) [2]. For many downstream analyses, such as clustering and dimensional reduction, it is not crucial to know the exact values between disparate cells, and it is enough to know the precise Gromov-Wasserstein distance only for cells that are close to each other in the morphology space. Since the SLB is a fast lower bound to the GW distance, it can be used to quickly identify pairs of cells that are located far apart in the morphology space so that their precise GW distance does not need to be precisely computed.

Let us illustrate how the computation of the SLB using CAJAL works on the same neuronal dataset as in Tutorial 1. We start with the file of intracellular distances computed in Tutorial 1:

[1]:
from cajal.qgw import slb_parallel

slb_parallel(
    "/home/jovyan/swc_bdad_100pts_euclidean_icdm.csv",
    out_csv = "/home/jovyan/slb_dists.csv",
    num_processes =8                                 # num_processes can be set to the number of cores on your machine
    )
100%|███████████████████████████████████████████████████████████████████████| 129286/129286 [00:00<00:00, 325877.15it/s]

The SLB is somewhat crude as an approximation of Gromov-Wasserstein, as it is only a lower bound, but it only takes a ~6 seconds to compute for this dataset.

To get a better sense of the SLB accuracy, let us compare the SLB with the GW distance computed for each pair of cells in Tutorial 1:

[1]:
import plotly.io as pio
pio.renderers.default = 'iframe'

from cajal.utilities import read_gw_dists, dist_mat_of_dict
from cajal.run_gw import cell_iterator_csv
import plotly.express

names, _ = zip(*cell_iterator_csv("/home/jovyan/swc_bdad_100pts_euclidean_icdm.csv"))
names=list(names)

_, gw_dist_dict = read_gw_dists("/home/jovyan/swc_bdad_100pts_euclidean_GW_dmat.csv", True)
gw100_dist_table = dist_mat_of_dict(gw_dist_dict, names, as_squareform=False)

_, slb_dist_dict =  read_gw_dists("/home/jovyan/slb_dists.csv", True)
slb_dist_table = dist_mat_of_dict(slb_dist_dict, names, as_squareform=False)

fig = plotly.express.scatter(x=slb_dist_table,
                       y=gw100_dist_table,
                       template="simple_white",
                       labels={
                         "x" : "SLB",
                         "y" : "GW distance"})
fig.update_traces(marker={'size': 1})
fig.show()

The mean and stadard deviation of the relative difference between the GW distance and the SLB in this example are:

[3]:
import numpy

print(numpy.mean((gw100_dist_table-slb_dist_table)/gw100_dist_table))
print(numpy.std((gw100_dist_table-slb_dist_table)/gw100_dist_table))
0.22281620822119072
0.19661884027625978

Although the SLB is only a lower bound for the Gromov-Wasserstein distance, using it alone is already fairly accurate as a classifier. Let us repeat the same analysis presented in Tutorial 1 for predicting the molecular type of neurons but using the SLB instead of the GW distance:

[4]:
import pandas
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import StratifiedKFold, cross_val_predict
from sklearn.metrics import matthews_corrcoef
from scipy.spatial.distance import squareform

metadata = pandas.read_csv("CAJAL/data/cell_types_specimen_details.csv")
metadata.index = [str(m) for m in metadata["specimen__id"]]
metadata = metadata.loc[names]

cre_lines = numpy.array(metadata["line_name"])

clf = KNeighborsClassifier(metric="precomputed", n_neighbors=10, weights="distance")
cv = StratifiedKFold(n_splits=7, shuffle=True)

cvp = cross_val_predict(clf, X=squareform(slb_dist_table), y=cre_lines, cv=cv)

print(matthews_corrcoef(cvp, cre_lines))
0.21540117286365484
/opt/conda/lib/python3.10/site-packages/sklearn/model_selection/_split.py:684: UserWarning:

The least populated class in y has only 1 members, which is less than n_splits=7.

The Matthews correlation coefficient is only slightly lower than the one we derived in the Tutorial 1 using the GW distance.

If the SLB distances between all cells are known, one may envision a simple algorithm[3] to compute the \(k\) nearest neighbors of any given cell under the Gromov-Wasserstein metric. If \(k\) is chosen sufficiently large, this is enough to understand the local structure of the morphology space and cluster it.

The second tool we provide is an implementation of the quantized Gromov-Wasserstein distance proposed by Chowdhury, Miller and Needham[4]. Given cells \(X\) and \(Y\), the quantized Gromov-Wasserstein distance is given as follows: 1. Partition the points of \(X\) and \(Y\) into \(n\) clusters. Let \(X^n, Y^n\) be the set of medoids of these clusters; \(X^n\) can be thought of as the best possible approximation to \(X\) in the GW morphology space by a set with at most \(n\) points. 2. Computing the optimal Gromov-Wasserstein transport plan between the subspaces \(X^n\) and \(Y^n\) formed by the medoids of each cluster. 3. Extend this to a global transport plan between \(X\) and \(Y\) by pairing points within paired clusters by their distance from the medoid, and compute the distortion associated to this transport plan. This approximation gives an acceptable tradeoff between precision and computation time.

Below, for each cell we cluster the 100 sampled points into 25 clusters.

[5]:
from cajal.qgw import quantized_gw_parallel

quantized_gw_parallel(
    intracell_csv_loc="/home/jovyan/swc_bdad_100pts_euclidean_icdm.csv",
    num_processes=8,
    num_clusters=25,
    out_csv="/home/jovyan/quantized_gw.csv")
100%|█████████████████████████████████████████████████████████████████████████| 129286/129286 [00:32<00:00, 4025.92it/s]

This is about ten times faster to compute than the GW distance and provides a better approximation to the GW distance than the SLB.

[2]:
_, qgw_dist_dict = read_gw_dists("/home/jovyan/quantized_gw.csv", header=True)
qgw_dmat = dist_mat_of_dict(qgw_dist_dict, cell_names=names, as_squareform=False)

fig = plotly.express.scatter(x=qgw_dmat,
                       y=gw100_dist_table,
                       template="simple_white",
                       labels={
                         "x" : "Quantized GW distance",
                         "y" : "GW distance"})
fig.update_traces(marker={'size': 1})
fig.show()

We can see that the mean and stadard deviation of the relative difference between the GW distance and the quantized GW distance are smaller than those for the SLB:

[7]:
print(numpy.mean((gw100_dist_table-qgw_dmat)/gw100_dist_table))
print(numpy.std((gw100_dist_table-qgw_dmat)/gw100_dist_table))
-0.10198643981692142
0.13250463499313633

Finally, CAJAL implements an approach that combine these two tools (quantized Gromov-Wasserstein and SLB) in an integrated analysis method which allows the user to reduce computation time in three simultaneous ways:

  1. by computing only the \(k\) nearest neighbors of each cell and estimating the rest roughly using SLB

  2. by accepting a small fraction of errors in the reported nearest neighbors list of each cell (i.e., 98% of the nearest neighbors are correct)

  3. by using the quantized GW distance as a proxy for the true GW distance

[8]:
from cajal.combined_slb_qgw import combined_slb_quantized_gw
combined_slb_quantized_gw(
    "/home/jovyan/swc_bdad_100pts_euclidean_icdm.csv",
    "/home/jovyan/swc_bdad_100pts_euclidean_gw_estimator.csv",
    num_processes=8,
    num_clusters=25,
    accuracy = 0.95,
    nearest_neighbors = 30)

The above command does the following:

  1. Compute the pairwise SLB between any two cells in the given list of intracell distance matrices.

  2. For each cell \(X\), identify the 30 estimated nearest neighbors \(X_1,\dots, X_{30}\) based on the SLB distance matrix, and compute the QGW distance \(QGW_{25}(X,X_k)\) for all 30 pairs.

  3. For each remaining cell pair \(X,Y\), we estimate the probability that \(QGW_{25}(X,Y)\) is small enough to “injure” the existing purported list of nearest neighbors. We sort the cell pairs in descending order by this probability and compute the \(QGW_{25}\) distance between pairs in this list until the expected number of “injuries” remaining is less than 5% of the nearest neighbor table, so that of the reported 30 nearest nearest neighbors to each point, 28.5 are expected to be correct.

  4. For all remaining cells we estimate the correct distance based on the SLB.

As expected, a UMAP representation based on the fast QGW/SLB combined estimator is very close to the UMAP that we computed in Tutorial 1 for the ame data using the full GW distance:

[3]:
import umap
import pandas

_, slbqgw_dist_dict = read_gw_dists("/home/jovyan/swc_bdad_100pts_euclidean_gw_estimator.csv", header=False)
slbqgw_sq_dist = dist_mat_of_dict(slbqgw_dist_dict, cell_names=names)

# Load metadata
metadata = pandas.read_csv("CAJAL/data/cell_types_specimen_details.csv")
metadata.index = [str(m) for m in metadata["specimen__id"]]
metadata = metadata.loc[names]

# Compute UMAP representation
reducer = umap.UMAP(metric="precomputed", random_state=1)
embedding = reducer.fit_transform(slbqgw_sq_dist)

# Visualize UMAP colored by cortical layer
plotly.express.scatter(x=embedding[:,0],
                       y=embedding[:,1],
                       template="simple_white",
                       hover_name=[m + ".swc" for m in names],
                       color = metadata["structure__layer"])
2023-07-20 21:05:35.256736: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  SSE4.1 SSE4.2 AVX AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
/opt/conda/lib/python3.10/site-packages/umap/umap_.py:1780: UserWarning:

using precomputed metric; inverse_transform will be unavailable

OMP: Info #276: omp_set_nested routine deprecated, please use omp_set_max_active_levels instead.

Notes

1.[^](#cite_ref-1) Mémoli, F. P. Gromov–Wasserstein Distances and the Metric Approach to Object Matching. Found Comput Math (2011) 11, 417–487.

2.[^](#cite_ref-2) Specifically we have implemented the expression which appears on the right hand side of the first inequality of Corollary 6.2 on page 462, for p = 2. This expression is not directly named in the paper. We do not compute the quantity Mémoli calls SLB, as it is too computationally expensive for our purposes.

3.[^](#cite_ref-3) A simple algorithm for computing the nearest neighbors of a cell in the Gromov-Wasserstein morphology space if the SLB distance is known is as follows: 1. First, sort all other cells by their SLB2 distance from \(c_0\), as \(c_1, c_2, c_3,\dots\). 2. Next, compute the Gromov-Wasserstein distance \(GW(c_0,c_j)\), as \(j = 1, 2, 3,\dots\). Write \(e^k_j\) for the \(k\)-th element of the set \(GW(c_0,c_1),GW(c_0,c_2),\dots,GW(c_0,c_j)\) when these are ordered from least to greatest. (\(e^k_j\) is only defined when \(k \leq j\)). Continue computing \(GW(c_0,c_j)\) until \(j\) reaches a value \(\ell\) such that for all \(i> \ell\), \(SLB(c_0,c_i) > e^k_\ell\). Because SLB is a lower bound, at this point, the \(k\) nearest neighbors of \(c_0\) are contained in the set \(\{c_1,\dots, c_\ell\}\).

4.[^](#cite_ref-4) Chowdhury, S., Miller, D., Needham, T. (2021). Quantized Gromov-Wasserstein. In: Oliver, N., Pérez-Cruz, F., Kramer, S., Read, J., Lozano, J.A. (eds) Machine Learning and Knowledge Discovery in Databases. Research Track. ECML PKDD 2021. Lecture Notes in Computer Science(), vol 12977. Springer, Cham. https://doi.org/10.1007/978-3-030-86523-8_49