Tutorial 1: Predicting the Molecular Type of Neurons

To demonstrate some of the main functionalities of CAJAL, here we perform some basic analysis on a set of neuron morphological reconstructions obtained from the Allen Brain Atlas. To facilitate the analysis, we provide a compressed *.tar.gz file containing the *.SWC files of 509 neurons used in this example, which can be downloaded directly from this link. In this tutorial we assume that the SWC files are located in the folder /home/jovyan/swc. More information about this dataset can be found at:

For this analysis, we focus on the morphology of the dendrites and exclude the axons of the neurons. To achieve this, we set structure_ids = [1,3,4], which tells CAJAL to only sample points from the soma and the basal and apical dendrites. We sample 100 points from each neuron and compute the Euclidean distance between each pair of points in that neuron using the following code:

[1]:
import cajal.sample_swc
import cajal.swc
from os.path import join
[2]:
bd = "/home/jovyan/" # Base directory
[5]:
cajal.sample_swc.compute_icdm_all_euclidean(
    infolder=join(bd, 'swc'),
    out_csv=join(bd, 'swc_bdad_100pts_euclidean_icdm.csv'),
    preprocess=cajal.swc.preprocessor_eu(
        structure_ids=[1,3,4],
        soma_component_only=False),
    n_sample=100)
100%|██████████████████████████████████████████████████████████████████████████████████████████████████████▊| 508/509 [02:55<00:00,  2.89it/s]
[5]:
[]

Once the sampling is completed, we compute the Gromov-Wasserstein distance between each pair of neurons. To compute the Gromov-Wasserstein distance matrix we use the code:

[6]:
import cajal.run_gw

cajal.run_gw.compute_gw_distance_matrix(
    join(bd, 'swc_bdad_100pts_euclidean_icdm.csv'),
    join(bd, 'swc_bdad_100pts_euclidean_GW_dmat.csv'),
    num_processes=8)
[6]:
(array([[  0.        ,  39.20467004,  27.99523385, ..., 117.42895201,
          25.28183545,  96.47219799],
        [ 39.20467004,   0.        ,  46.52045706, ..., 126.25769062,
          27.86360758, 118.10387766],
        [ 27.99523385,  46.52045706,   0.        , ...,  96.17779755,
          42.01646124,  81.50702811],
        ...,
        [117.42895201, 126.25769062,  96.17779755, ...,   0.        ,
         126.25084427,  55.87906252],
        [ 25.28183545,  27.86360758,  42.01646124, ..., 126.25084427,
           0.        , 110.11693068],
        [ 96.47219799, 118.10387766,  81.50702811, ...,  55.87906252,
         110.11693068,   0.        ]]),
 None)

We can visualize the resulting space of cell morphologies using UMAP:

[7]:
import plotly.io as pio

# Choose the adequate plotly renderer for visualizing plotly graphs in your system
pio.renderers.default = 'notebook_connected'
# pio.renderers.default = 'iframe'

import cajal.utilities
from umap import umap_
import plotly.express

# Read GW distance matrix
cells, gw_dist_dict = cajal.utilities.read_gw_dists(join(bd,"swc_bdad_100pts_euclidean_GW_dmat.csv"), header=True)
gw_dist = cajal.utilities.dist_mat_of_dict(gw_dist_dict, cells)

# Compute UMAP representation
reducer = umap_.UMAP(metric="precomputed", random_state=1)
embedding = reducer.fit_transform(gw_dist)

# Visualize UMAP
plotly.express.scatter(x=embedding[:,0],
                       y=embedding[:,1],
                       template="simple_white",
                       hover_name=[m + ".swc" for m in cells])
/home/patrick/code/CAJAL/venv/lib/python3.11/site-packages/umap/umap_.py:1858: UserWarning:

using precomputed metric; inverse_transform will be unavailable

/home/patrick/code/CAJAL/venv/lib/python3.11/site-packages/umap/umap_.py:1945: UserWarning:

n_jobs value 1 overridden to 1 by setting random_state. Use no seed for parallelism.

Once the user has computed the Gromov-Wasserstein distances between cells it is possible to cluster the cells using standard clustering techniques. The Louvain and Leiden clustering algorithms are two commonly used algorithms to identify communities within large networks; they can be adapted to finite metric spaces by constructing a k-nearest-neighbors graph on top of the metric space. CAJAL provides access to both of these clustering algorithms. When combined with a low-dimensional embedding tool such as the UMAP algorithm, the user can plot the clusters in a 2-dimensional embedding and visualize them. Here, we use the Leiden algorithm to cluster the neurons based on their morphology:

[8]:
clusters = cajal.utilities.leiden_clustering(gw_dist, seed=1)
plotly.express.scatter(x=embedding[:,0],
                       y=embedding[:,1],
                       template="simple_white",
                       hover_name=[m + ".swc" for m in cells],
                       color = [str(m) for m in clusters])

As expected, cells belonging to the same cluster have similar morphology. For example, let us visualize some of the cells in the brown cluster (cluster 8) using the Python package NAVis (which can be installed using pip install navis):

[77]:
import navis

cells_cluster_8 = [n for m, n in zip(clusters, cells) if m==8]
cluster_8 = [navis.read_swc(join(bd,"swc/" + n + ".swc")) for n in cells_cluster_8]

cluster_8[1].plot2d()
cluster_8[2].plot2d()
cluster_8[3].plot2d()
[77]:
(<Figure size 720x720 with 1 Axes>, <Axes3D: >)
../_images/notebooks_Example_1_11_1.png
../_images/notebooks_Example_1_11_2.png
../_images/notebooks_Example_1_11_3.png

NAVis offers many other great functionalities, including interactive 3D visualizations of the neurons using plot3d(). More information about those functionalities can be found at the NAVis documentation.

We can also compute the medoid of the cluster, i. e. the most central neuron of the cluster (and therefore a good representative of the morphologies present in the cluster), and visualize it:

[78]:
medoid_8 = navis.read_swc(join(bd,'swc/' + cajal.utilities.identify_medoid(cells_cluster_8, gw_dist_dict) + ".swc"))
medoid_8.plot2d()
[78]:
(<Figure size 720x720 with 1 Axes>, <Axes3D: >)
../_images/notebooks_Example_1_13_1.png

CAJAL allows the user to visualize the components of each neuron which contribute the most to the overall morphological distance. Here, we compare the medoid neurons of the 5th and 8th clusters. Regions in red represent morphological distinctions between the two cells.

[92]:
from cajal.deformation_vis import navis_heatmap

cells_cluster_5 = [n for m, n in zip(clusters, cells) if m==5]
medoid_5 = navis.read_swc(join(bd,'swc/' + cajal.utilities.identify_medoid(cells_cluster_5, gw_dist_dict) + ".swc"))
(heatmap_5, heatmap_8) = navis_heatmap(medoid_5, 200, medoid_8, 200)
[93]:
heatmap_5.plot2d(color_by='distortion', palette='cool')
heatmap_8.plot2d(color_by='distortion', palette='cool')
[93]:
(<Figure size 720x720 with 1 Axes>, <Axes3D: >)
../_images/notebooks_Example_1_16_1.png
../_images/notebooks_Example_1_16_2.png

Let us know study the association of other neuronal features with the morphology of the neurons. The file CAJAL/data/cell_types_specimen_details.csv in the GitHub repository of CAJAL contains metadata for each of the neurons in this example, including the layer, Cre line, etc. Here we color the above UMAP representation by the cortical layer of each neuron:

[13]:
import pandas

metadata = pandas.read_csv(join(bd,"CAJAL/data/cell_types_specimen_details.csv"))
metadata.index = [str(m) for m in metadata["specimen__id"]]
metadata = metadata.loc[cells]

plotly.express.scatter(x=embedding[:,0],
                       y=embedding[:,1],
                       template="simple_white",
                       hover_name=[m + ".swc" for m in cells],
                       color = metadata["structure__layer"])

As shown in the visualization, different cortical layers seem to be associated with specific regions of the cell morphology space. We can quantify statistically the association using the Laplacian score:

[14]:
import cajal.laplacian_score
import numpy
import pandas
from scipy.spatial.distance import squareform

# Build indicator matrix
layers = numpy.unique(metadata["structure__layer"])
indicator = (numpy.array(metadata["structure__layer"])[:,None] == layers)*1

# Compute the Laplacian score
laplacian = pandas.DataFrame(cajal.laplacian_score.laplacian_scores(indicator,
                                       gw_dist,
                                       numpy.median(squareform(gw_dist)),
                                       permutations = 5000,
                                       covariates = None,
                                       return_random_laplacians = False)[0])
laplacian.index = layers

print(laplacian)
     feature_laplacians  laplacian_p_values  laplacian_q_values
1              0.976417              0.0002             0.00060
2/3            0.968974              0.0002             0.00060
4              0.969925              0.0002             0.00060
5              0.987595              0.0006             0.00090
6a             0.992887              0.0012             0.00144
6b             0.992448              0.0014             0.00140

We observe that indeed all cortical layers are significantly associated with distinct regions of the cell morphology space, with false discovery rates (FDRs) < 0.05.

We could perform a similar analysis with other features. Alternatively, we could build a classifier to predict the value of each feature based on the position of the cells in the cell morphology space. For example, each neuron in the dataset is derived from a specific Cre driver line, which preferentially labels distinct neuronal types. Neurons from the same Cre driver line therefore tend to have similar morphologies, and a Laplacian score analysis would show that many Cre driver lines are significantly associated with distinct regions of the cell morphology space. As a result, it is possible to predict the Cre driver line of a neuron based on its morphological features.

To accomplish this, we train a nearest-neighbors classifier on the GW distance matrix and evaluate its accuracy using 7-fold cross-validation:

[15]:
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import StratifiedKFold, cross_val_score

cre_lines = numpy.array(metadata["line_name"])

clf = KNeighborsClassifier(metric="precomputed", n_neighbors=10, weights="distance")
cv = StratifiedKFold(n_splits=7, shuffle=True)
accuracy = cross_val_score(clf, X=gw_dist, y=cre_lines,cv=cv)

numpy.mean(accuracy)
/home/patrick/code/CAJAL/venv/lib/python3.11/site-packages/sklearn/model_selection/_split.py:776: UserWarning:

The least populated class in y has only 1 members, which is less than n_splits=7.

[15]:
np.float64(0.30253859534681454)

Similarly, we can compute the Matthews correlation coefficient of the classification, which appropriately weights the error arising from misclassifying elements of smaller classes:

[16]:
from sklearn.model_selection import cross_val_predict
from sklearn.metrics import matthews_corrcoef

cvp = cross_val_predict(clf, X=gw_dist, y=cre_lines, cv=cv)

print(matthews_corrcoef(cvp, cre_lines))
0.25780493262431675
/home/patrick/code/CAJAL/venv/lib/python3.11/site-packages/sklearn/model_selection/_split.py:776: UserWarning:

The least populated class in y has only 1 members, which is less than n_splits=7.