Tutorial 3: Computing Morphological Distances in Very Large Datasets
The Gromov-Wasserstein distance between two cells with 100 points takes about 9 ms to compute on a standard desktop computer. The number of pairs grows quadratically with the number of cells, and so the total runtime can become large in datasets with several thousands of cells.
For very large datasets we provide two tools to reduce the necessary computation, as well as a hybrid of these.
Reference [1] established several lower bounds for the Gromov-Wasserstein (GW) distance. CAJAL implements one of the fastest bounds, the second lower bound (SLB) [2]. For many downstream analyses, such as clustering and dimensional reduction, it is not crucial to know the exact values between cells with very disparate morphologies, and it is enough to know the precise Gromov-Wasserstein distance only for cells that are close to each other in the morphology space. Since the SLB is a fast lower bound to the GW distance, it can be used to quickly identify pairs of cells that are located far apart in the morphology space so that their precise GW distance does not need to be computed.
Let us illustrate how the computation of the SLB using CAJAL works on the same neuronal dataset as in Tutorial 1. We start with the file of intracellular distances computed in Tutorial 1:
[1]:
# bd = "/home/jovyan/" # Base directory
bd = "/home/patn/dropbox/Data/AllenInstitute"
[2]:
from cajal.qgw import slb_parallel
from os.path import join
slb_parallel(
join(bd,"swc_bdad_100pts_euclidean_icdm.csv"),
out_csv = join(bd,"slb_dists.csv"),
num_processes =8 # num_processes can be set to the number of cores on your machine
)
The SLB is somewhat crude as an approximation of Gromov-Wasserstein, as it is only a lower bound, but it only takes a ~6 seconds to compute for this dataset.
To get a better sense of the SLB accuracy, let us compare the SLB with the GW distance computed for each pair of cells in Tutorial 1:
[3]:
import plotly.io as pio
# Choose the adequate plotly renderer for visualizing plotly graphs in your system
pio.renderers.default = 'notebook_connected'
# pio.renderers.default = 'iframe'
from cajal.utilities import read_gw_dists, dist_mat_of_dict
from cajal.run_gw import cell_iterator_csv
import plotly.express
names, _ = zip(*cell_iterator_csv(join(bd,"swc_bdad_100pts_euclidean_icdm.csv")))
names=list(names)
_, gw_dist_dict = read_gw_dists(join(bd,"swc_bdad_100pts_euclidean_GW_dmat.csv"), True)
gw100_dist_table = dist_mat_of_dict(gw_dist_dict, names, as_squareform=False)
_, slb_dist_dict = read_gw_dists(join(bd,"slb_dists.csv"), True)
slb_dist_table = dist_mat_of_dict(slb_dist_dict, names, as_squareform=False)
fig = plotly.express.scatter(x=slb_dist_table,
y=gw100_dist_table,
template="simple_white",
labels={
"x" : "SLB",
"y" : "GW distance"})
fig.update_traces(marker={'size': 1})
fig.show()
The mean and stadard deviation of the relative difference between the GW distance and the SLB in this example are:
[4]:
import numpy
print(numpy.mean((gw100_dist_table-slb_dist_table)/gw100_dist_table))
print(numpy.std((gw100_dist_table-slb_dist_table)/gw100_dist_table))
0.22240966232782416
0.1966012201739646
Although the SLB is only a lower bound for the Gromov-Wasserstein distance, using it alone is already fairly accurate as a classifier. Let us repeat the same analysis presented in Tutorial 1 for predicting the molecular type of neurons but using the SLB instead of the GW distance:
[5]:
import pandas
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import StratifiedKFold, cross_val_predict
from sklearn.metrics import matthews_corrcoef
from scipy.spatial.distance import squareform
metadata = pandas.read_csv(join(bd,"cell_types_specimen_details.csv"))
metadata.index = [str(m) for m in metadata["specimen__id"]]
metadata = metadata.loc[names]
cre_lines = numpy.array(metadata["line_name"])
clf = KNeighborsClassifier(metric="precomputed", n_neighbors=10, weights="distance")
cv = StratifiedKFold(n_splits=7, shuffle=True)
cvp = cross_val_predict(clf, X=squareform(slb_dist_table), y=cre_lines, cv=cv)
print(matthews_corrcoef(cvp, cre_lines))
0.24036084479003786
/home/patn/PGC012_Gromov_Wasserstein/venv/lib/python3.12/site-packages/sklearn/model_selection/_split.py:776: UserWarning:
The least populated class in y has only 1 members, which is less than n_splits=7.
The Matthews correlation coefficient is only slightly lower than the one we derived in the Tutorial 1 using the GW distance.
If the SLB distances between all cells are known, one may envision a simple algorithm[3] to compute the \(k\) nearest neighbors of any given cell under the Gromov-Wasserstein metric. If \(k\) is chosen sufficiently large, this is enough to understand the local structure of the morphology space and cluster it.
The second tool we provide is an implementation of the quantized Gromov-Wasserstein distance proposed by Chowdhury, Miller and Needham[4]. Given cells \(X\) and \(Y\), the quantized Gromov-Wasserstein distance is given as follows: 1. Partition the points of \(X\) and \(Y\) into \(n\) clusters. Let \(X^n, Y^n\) be the set of medoids of these clusters; \(X^n\) can be thought of as the best possible approximation to \(X\) in the GW morphology space by a set with at most \(n\) points. 2. Compute the optimal Gromov-Wasserstein transport plan between the subspaces \(X^n\) and \(Y^n\) formed by the medoids of each cluster. 3. Extend this to a global transport plan between \(X\) and \(Y\) by pairing points within paired clusters by their distance from the medoid, and compute the distortion associated to this transport plan. This approximation gives an acceptable tradeoff between precision and computation time.
Below, for each cell we cluster the 100 sampled points into 25 clusters.
[6]:
from cajal.qgw import quantized_gw_parallel
quantized_gw_parallel(
intracell_csv_loc=join(bd,"swc_bdad_100pts_euclidean_icdm.csv"),
num_processes=8,
num_clusters=25,
out_csv=join(bd,"quantized_gw.csv"))
Computing pairwise Gromov-Wasserstein distances...
This is about ten times faster to compute than the GW distance and provides a better approximation to the GW distance than the SLB.
[7]:
_, qgw_dist_dict = read_gw_dists(join(bd,"quantized_gw.csv"), header=True)
qgw_dmat = dist_mat_of_dict(qgw_dist_dict, cell_names=names, as_squareform=False)
fig = plotly.express.scatter(x=qgw_dmat,
y=gw100_dist_table,
template="simple_white",
labels={
"x" : "Quantized GW distance",
"y" : "GW distance"})
fig.update_traces(marker={'size': 1})
fig.show()