Tutorial 1: Predicting the Molecular Type of Neurons

To demonstrate some of the main functionalities of CAJAL, here we perform some basic analysis on a set of neuron morphological reconstructions obtained from the Allen Brain Atlas. To facilitate the analysis, we provide a compressed *.tar.gz file containing the *.SWC files of 509 neurons used in this example, which can be downloaded directly from this link. In this tutorial we assume that the SWC files are located in the folder /home/jovyan/swc. More information about this dataset can be found at:

For this analysis, we focus on the morphology of the dendrites and exclude the axons of the neurons. To achieve this, we set structure_ids = [1,3,4], which tells CAJAL to only sample points from the soma and the basal and apical dendrites. We sample 100 points from each neuron and compute the Euclidean distance between each pair of points in that neuron using the following code:

[2]:
import cajal.sample_swc
import cajal.swc

cajal.sample_swc.compute_icdm_all_euclidean(
    infolder="/home/jovyan/swc",
    out_csv="/home/jovyan/swc_bdad_100pts_euclidean_icdm.csv",
    preprocess=cajal.swc.preprocessor_eu(
        structure_ids=[1,3,4],
        soma_component_only=False),
    n_sample=100,
    num_processes=8)  # num_processes can be set to the number of cores on your machine
100%|████████████████████████████████████████████████████████████████████████████████▊| 508/509 [06:02<00:00,  1.40it/s]
[2]:
[]

Once the sampling is completed, we compute the Gromov-Wasserstein distance between each pair of neurons. To compute the Gromov-Wasserstein distance matrix we use the code:

[3]:
import cajal.run_gw

cajal.run_gw.compute_gw_distance_matrix(
    "/home/jovyan/swc_bdad_100pts_euclidean_icdm.csv",
    "/home/jovyan/swc_bdad_100pts_euclidean_GW_dmat.csv",
    num_processes=8)
100%|██████████████████████████████████████████████████████████████████████████| 129286/129286 [03:52<00:00, 556.95it/s]
[3]:
(array([[  0.        ,  76.53525355,  48.81215985, ...,  36.25765651,
          39.63267218, 107.27192268],
        [ 76.53525355,   0.        ,  90.55259238, ...,  69.27173625,
          82.74822498,  50.54451328],
        [ 48.81215985,  90.55259238,   0.        , ...,  26.48503494,
          16.99102489, 129.81156708],
        ...,
        [ 36.25765651,  69.27173625,  26.48503494, ...,   0.        ,
          21.15960915, 107.41792624],
        [ 39.63267218,  82.74822498,  16.99102489, ...,  21.15960915,
           0.        , 121.93211717],
        [107.27192268,  50.54451328, 129.81156708, ..., 107.41792624,
         121.93211717,   0.        ]]),
 None)

We can visualize the resulting space of cell morphologies using UMAP:

[1]:
import plotly.io as pio
pio.renderers.default = 'iframe'

import cajal.utilities
import umap
import plotly.express

# Read GW distance matrix
cells, gw_dist_dict = cajal.utilities.read_gw_dists("/home/jovyan/swc_bdad_100pts_euclidean_GW_dmat.csv", header=True)
gw_dist = cajal.utilities.dist_mat_of_dict(gw_dist_dict, cells)

# Compute UMAP representation
reducer = umap.UMAP(metric="precomputed", random_state=1)
embedding = reducer.fit_transform(gw_dist)

# Visualize UMAP
plotly.express.scatter(x=embedding[:,0],
                       y=embedding[:,1],
                       template="simple_white",
                       hover_name=[m + ".swc" for m in cells])
2023-07-20 21:31:24.609344: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  SSE4.1 SSE4.2 AVX AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
/opt/conda/lib/python3.10/site-packages/umap/umap_.py:1780: UserWarning: using precomputed metric; inverse_transform will be unavailable
  warn("using precomputed metric; inverse_transform will be unavailable")
OMP: Info #276: omp_set_nested routine deprecated, please use omp_set_max_active_levels instead.

Once the user has computed the Gromov-Wasserstein distances between cells it is possible to cluster the cells using standard clustering techniques. The Louvain and Leiden clustering algorithms are two commonly used algorithms to identify communities within large networks; they can be adapted to finite metric spaces by constructing a k-nearest-neighbors graph on top of the metric space. CAJAL provides access to both of these clustering algorithms. When combined with a low-dimensional embedding tool such as the UMAP algorithm, the user can plot the clusters in a 2-dimensional embedding and visualize them. Here, we use the Leiden algorithm to cluster the neurons based on their morphology:

[2]:
clusters = cajal.utilities.leiden_clustering(gw_dist, seed=1)
plotly.express.scatter(x=embedding[:,0],
                       y=embedding[:,1],
                       template="simple_white",
                       hover_name=[m + ".swc" for m in cells],
                       color = [str(m) for m in clusters])

As expected, cells belonging to the same cluster have similar morphology. For example, let us visualize some of the cells in the brown cluster (cluster 8) using the Python package NAVis (which can be installed using pip install navis):

[6]:
import navis

cells_cluster_8 = [n for m, n in zip(clusters, cells) if m==8]
cluster_8 = [navis.read_swc("/home/jovyan/swc/" + n + ".swc") for n in cells_cluster_8]

cluster_8[1].plot2d()
cluster_8[2].plot2d()
cluster_8[3].plot2d()
[6]:
(<Figure size 720x720 with 1 Axes>, <Axes3DSubplot: >)
../_images/notebooks_Example_1_9_1.png
../_images/notebooks_Example_1_9_2.png
../_images/notebooks_Example_1_9_3.png

NAVis offers many other great functionalities, including interactive 3D visualizations of the neurons. More information about those functionalities can be found at the NAVis documentation.

We can also compute the medoid of the cluster, i. e. the most central neuron of the cluster (and therefore a good representative of the morphologies present in the cluster), and visualize it:

[7]:
medoid = navis.read_swc("/home/jovyan/swc/" +
                        cajal.utilities.identify_medoid(cells_cluster_8, gw_dist_dict) +
                        ".swc")
medoid.plot2d()
[7]:
(<Figure size 720x720 with 1 Axes>, <Axes3DSubplot: >)
../_images/notebooks_Example_1_11_1.png

The file CAJAL/data/cell_types_specimen_details.csv in the GitHub repository of CAJAL contains metadata for each of the neurons in this example, including the layer, Cre line, etc. Here we color the above UMAP representation by the cortical layer of each neuron:

[3]:
import pandas

metadata = pandas.read_csv("CAJAL/data/cell_types_specimen_details.csv")
metadata.index = [str(m) for m in metadata["specimen__id"]]
metadata = metadata.loc[cells]

plotly.express.scatter(x=embedding[:,0],
                       y=embedding[:,1],
                       template="simple_white",
                       hover_name=[m + ".swc" for m in cells],
                       color = metadata["structure__layer"])

As shown in the visualization, different cortical layers seem to be associated with specific regions of the cell morphology space. We can quantify statistically the association using the Laplacian score:

[9]:
import cajal.laplacian_score
import numpy
import pandas
from scipy.spatial.distance import squareform

# Build indicator matrix
layers = numpy.unique(metadata["structure__layer"])
indicator = (numpy.array(metadata["structure__layer"])[:,None] == layers)*1

# Compute the Laplacian score
laplacian = pandas.DataFrame(cajal.laplacian_score.laplacian_scores(indicator,
                                       gw_dist,
                                       numpy.median(squareform(gw_dist)),
                                       permutations = 5000,
                                       covariates = None,
                                       return_random_laplacians = False)[0])
laplacian.index = layers

print(laplacian)
     feature_laplacians  laplacian_p_values  laplacian_q_values
1              0.976439              0.0002             0.00048
2/3            0.968968              0.0002             0.00048
4              0.970067              0.0002             0.00048
5              0.987542              0.0002             0.00048
6a             0.992997              0.0020             0.00240
6b             0.992043              0.0022             0.00220

We observe that indeed all cortical layers are significantly associated with distinct regions of the cell morphology space with false discovery rates (FDRs) < 0.05.

We could perform a similar analysis with other features. Alternatively, we could build a classifier to predict the value of each feature based on the position of the cells in the cell morphology space. For example, each neuron in the dataset is derived from a specific Cre driver line, which preferentially labels distinct neuronal types. Neurons from the same Cre driver line therefore tend to have similar morphologies, and a Laplacian score analysis would show that many Cre driver lines are significantly associated with distinct regions of the cell morphology space. As a result, it is possible to predict the Cre driver line of a neuron based on its morphological features.

To accomplish this, we train a nearest-neighbors classifier on the GW distance matrix and evaluate its accuracy using 7-fold cross-validation:

[10]:
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import StratifiedKFold, cross_val_score

cre_lines = numpy.array(metadata["line_name"])

clf = KNeighborsClassifier(metric="precomputed", n_neighbors=10, weights="distance")
cv = StratifiedKFold(n_splits=7, shuffle=True)
accuracy = cross_val_score(clf, X=gw_dist, y=cre_lines,cv=cv)

numpy.mean(accuracy)
/opt/conda/lib/python3.10/site-packages/sklearn/model_selection/_split.py:684: UserWarning:

The least populated class in y has only 1 members, which is less than n_splits=7.

[10]:
0.2829963035442487

Similarly, we can compute the Matthews correlation coefficient of the classification, which appropriately weights the error arising from misclassifying elements of smaller classes:

[11]:
from sklearn.model_selection import cross_val_predict
from sklearn.metrics import matthews_corrcoef

cvp = cross_val_predict(clf, X=gw_dist, y=cre_lines, cv=cv)

print(matthews_corrcoef(cvp, cre_lines))
0.23180785506113458
/opt/conda/lib/python3.10/site-packages/sklearn/model_selection/_split.py:684: UserWarning:

The least populated class in y has only 1 members, which is less than n_splits=7.