Tutorial 3: Computing Morphological Distances in Very Large Datasets
The Gromov-Wasserstein distance between two cells with 100 points takes about 9 ms to compute on a standard desktop computer. The number of pairs grows quadratically with the number of cells, and so the total runtime can become large in datasets with several thousands of cells.
For very large datasets we provide two tools to reduce the necessary computation, as well as a hybrid of these.
Reference [1] established several lower bounds for the Gromov-Wasserstein (GW) distance. CAJAL implements one of the fastest bounds, the second lower bound (SLB) [2]. For many downstream analyses, such as clustering and dimensional reduction, it is not crucial to know the exact values between cells with very disparate morphologies, and it is enough to know the precise Gromov-Wasserstein distance only for cells that are close to each other in the morphology space. Since the SLB is a fast lower bound to the GW distance, it can be used to quickly identify pairs of cells that are located far apart in the morphology space so that their precise GW distance does not need to be computed.
Let us illustrate how the computation of the SLB using CAJAL works on the same neuronal dataset as in Tutorial 1. We start with the file of intracellular distances computed in Tutorial 1:
[3]:
bd = "/home/jovyan/" # Base directory
[8]:
from cajal.qgw import slb_parallel
from os.path import join
slb_parallel(
join(bd,"swc_bdad_100pts_euclidean_icdm.csv"),
out_csv = join(bd,"slb_dists.csv"),
num_processes =8 # num_processes can be set to the number of cores on your machine
)
The SLB is somewhat crude as an approximation of Gromov-Wasserstein, as it is only a lower bound, but it only takes a ~6 seconds to compute for this dataset.
To get a better sense of the SLB accuracy, let us compare the SLB with the GW distance computed for each pair of cells in Tutorial 1:
[9]:
import plotly.io as pio
# Choose the adequate plotly renderer for visualizing plotly graphs in your system
pio.renderers.default = 'notebook_connected'
# pio.renderers.default = 'iframe'
from cajal.utilities import read_gw_dists, dist_mat_of_dict
from cajal.run_gw import cell_iterator_csv
import plotly.express
names, _ = zip(*cell_iterator_csv(join(bd,"swc_bdad_100pts_euclidean_icdm.csv")))
names=list(names)
_, gw_dist_dict = read_gw_dists(join(bd,"swc_bdad_100pts_euclidean_GW_dmat.csv"), True)
gw100_dist_table = dist_mat_of_dict(gw_dist_dict, names, as_squareform=False)
_, slb_dist_dict = read_gw_dists(join(bd,"slb_dists.csv"), True)
slb_dist_table = dist_mat_of_dict(slb_dist_dict, names, as_squareform=False)
fig = plotly.express.scatter(x=slb_dist_table,
y=gw100_dist_table,
template="simple_white",
labels={
"x" : "SLB",
"y" : "GW distance"})
fig.update_traces(marker={'size': 1})
fig.show()