Tutorial 3: Computing Morphological Distances in Large Datasets

The Gromov-Wasserstein distance between two cells with 100 points takes about 9ms to compute on a standard desktop computer. The number of pairs grows quadratically with the number of cells, and so the total runtime can become large in datasets with several thousands of cells.

For large datasets we provide two tools to reduce the necessary computation, as well as a hybrid of these.

In [1], the author establishes several lower bounds for the Gromov-Wasserstein (GW) distance. CAJAL implements one of the fastest bounds, the second lower bound (SLB) [2]. For many downstream analyses, such as clustering and dimensional reduction, it is not crucial to know the exact values between disparate cells, and it is enough to know the precise Gromov-Wasserstein distance only for cells that are close to each other in the morphology space. Since the SLB is a fast lower bound to the GW distance, it can be used to quickly identify pairs of cells that are located far apart in the morphology space so that their precise GW distance does not need to be precisely computed.

Let us illustrate how the computation of the SLB using CAJAL works on the same neuronal dataset as in Tutorial 1. We start with the file of intracellular distances computed in Tutorial 1:

from cajal.qgw import slb_parallel

    out_csv = "/home/jovyan/slb_dists.csv",
    num_processes =8                                 # num_processes can be set to the number of cores on your machine
100%|███████████████████████████████████████████████████████████████████████| 129286/129286 [00:00<00:00, 325877.15it/s]

The SLB is somewhat crude as an approximation of Gromov-Wasserstein, as it is only a lower bound, but it only takes a ~6 seconds to compute for this dataset.

To get a better sense of the SLB accuracy, let us compare the SLB with the GW distance computed for each pair of cells in Tutorial 1:

import plotly.io as pio
pio.renderers.default = 'iframe'

from cajal.utilities import read_gw_dists, dist_mat_of_dict
from cajal.run_gw import cell_iterator_csv
import plotly.express

names, _ = zip(*cell_iterator_csv("/home/jovyan/swc_bdad_100pts_euclidean_icdm.csv"))

_, gw_dist_dict = read_gw_dists("/home/jovyan/swc_bdad_100pts_euclidean_GW_dmat.csv", True)
gw100_dist_table = dist_mat_of_dict(gw_dist_dict, names, as_squareform=False)

_, slb_dist_dict =  read_gw_dists("/home/jovyan/slb_dists.csv", True)
slb_dist_table = dist_mat_of_dict(slb_dist_dict, names, as_squareform=False)

fig = plotly.express.scatter(x=slb_dist_table,
                         "x" : "SLB",
                         "y" : "GW distance"})
fig.update_traces(marker={'size': 1})