Tutorial 2: Genetic Determinants of Neuronal Morphology

We will illustrate the utility of the Laplacian score in identifying genes that contribute to the neuronal plasticity in the C. elegans. This example utilizes a dataset consisting of 799 3D neuronal reconstructions of the C.elegans DVB neuron across various mutant and control strains during days 1 to 5 of adulthood. The dataset can be downloaded from the following folder. In this tutorial we assume that the SWC files are located in the folder CAJAL/data_worm/swc. The DVB neuron is an excitatory GABAergic motor interneuron located in the dorso-rectal ganglion of the worm, and is known to undergo post-developmental neurite outgrowth in males. This outgrowth alters the neuron’s morphology and synaptic connectivity, contributing to changes in the spicule protraction step of male mating behavior. More information about this dataset can be found at:

To begin our analysis, we calculate the Gromov-Wasserstein distance between each pair of cells. For the sake of time, here we just sample 50 points per cell. This computation typically requires 20-30 minutes to complete on a standard desktop computer. A larger number of sampled points would offer better results, but would also increase the computing time.

[2]:
import cajal.sample_swc
import cajal.swc
import cajal.run_gw

cajal.sample_swc.compute_icdm_all_geodesic(
    infolder="CAJAL/data_worm/swc/",
    out_csv="CAJAL/data_worm/c_elegans_icdm.csv",
    preprocess=cajal.swc.preprocessor_geo(
        structure_ids="keep_all_types"),
    n_sample=50,
    num_processes=8)  # num_processes can be set to the number of cores on your machine

cajal.run_gw.compute_gw_distance_matrix(
    "CAJAL/data_worm/c_elegans_icdm.csv",
    "CAJAL/data_worm/c_elegans_gw_dist.csv",
    num_processes=8)
100%|████████████████████████████████████████████████████████████████████████████████▉| 798/799 [00:31<00:00, 25.03it/s]
100%|█████████████████████████████████████████████████████████████████████████| 318801/318801 [02:09<00:00, 2460.13it/s]
[2]:
(array([[0.        , 5.11936112, 3.68814622, ..., 4.50253393, 3.74849009,
         2.41605872],
        [5.11936112, 0.        , 4.16953034, ..., 5.1718854 , 3.86677755,
         3.06617002],
        [3.68814622, 4.16953034, 0.        , ..., 5.37682889, 3.85930797,
         2.66918667],
        ...,
        [4.50253393, 5.1718854 , 5.37682889, ..., 0.        , 3.52210097,
         4.01724968],
        [3.74849009, 3.86677755, 3.85930797, ..., 3.52210097, 0.        ,
         3.07758098],
        [2.41605872, 3.06617002, 2.66918667, ..., 4.01724968, 3.07758098,
         0.        ]]),
 None)

We can generate a UMAP plot that visualizes the cell morphology space, with each point colored according to the age of each worm in days. The metadata for each neuron in this example is provided in the file CAJAL/data/c_elegans_features.csv, which can be found in the GitHub repository of CAJAL. This metadata includes information such as the age of the worm in days and the genotype of each gene (0: wild-type; 1: mutant).

[1]:
import plotly.io as pio
pio.renderers.default = 'iframe'

import cajal.utilities
import umap
import pandas
import plotly.express

# Read GW distance matrix
cells, gw_dist_dict = cajal.utilities.read_gw_dists("CAJAL/data_worm/c_elegans_gw_dist.csv", header=True)
gw_dist = cajal.utilities.dist_mat_of_dict(gw_dist_dict, cells)

# Compute UMAP representation
reducer = umap.UMAP(metric="precomputed", random_state=1)
embedding = reducer.fit_transform(gw_dist)

# Download metadata
metadata = pandas.read_csv("CAJAL/data_worm/c_elegans_features.csv", index_col = "cell_name")

# Visualize UMAP
plotly.express.scatter(x=embedding[:,0],
                       y=embedding[:,1],
                       template="simple_white",
                       hover_name=[m + ".swc" for m in cells],
                       color = [str(m) for m in metadata["day"]])
2023-07-20 21:18:46.439529: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  SSE4.1 SSE4.2 AVX AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
/opt/conda/lib/python3.10/site-packages/umap/umap_.py:1780: UserWarning: using precomputed metric; inverse_transform will be unavailable
  warn("using precomputed metric; inverse_transform will be unavailable")
OMP: Info #276: omp_set_nested routine deprecated, please use omp_set_max_active_levels instead.

Unsurprisingly, the age of the worm plays a significant role in shaping the morphology of its neurons. This is evident in the UMAP representation above, which reveals that neurons of different ages cluster in distinct regions of the UMAP. To quantify this association, we can use the Laplacian score:

[4]:
import cajal.laplacian_score
import numpy
from scipy.spatial.distance import squareform

laplacian = pandas.DataFrame(cajal.laplacian_score.laplacian_scores(numpy.array(metadata["day"]).reshape(799,1),
                                       gw_dist,
                                       numpy.median(squareform(gw_dist)),
                                       permutations = 5000,
                                       covariates = None,
                                       return_random_laplacians = False)[0])

print(laplacian)
   feature_laplacians  laplacian_p_values  laplacian_q_values
0            0.951357              0.0002              0.0002

A very small p value suggests a strong association between the age of the worm and the morphology of the DVB neuron.

Moving forward, our goal is to identify mutations that impact the morphology of the DVB neuron. To achieve this, we will rely on the Laplacian score once again. However, it is essential to consider the unequal representation of worms with a given genotype across different ages in the dataset. To address this issue, we will account for the uneven distribution of ages for each genotype. As an example, we will investigate the impact of deleterious mutations in the unc-25 gene. Let us first look at their distribution in the cell morphology space:

[2]:
plotly.express.scatter(x=embedding[:,0],
                       y=embedding[:,1],
                       template="simple_white",
                       hover_name=[m + ".swc" for m in cells],
                       color = [str(m) for m in metadata["unc-25"]])

The UMAP representation reveals that cells with a deleterious mutation in unc-25 exhibit similar morphology, a finding supported by the small p-value of the Laplacian score of unc-25 in the cell morphology space:

[6]:
laplacian = pandas.DataFrame(cajal.laplacian_score.laplacian_scores(numpy.array(metadata["unc-25"]).reshape(799,1),
                                       gw_dist,
                                       numpy.median(squareform(gw_dist)),
                                       permutations = 5000,
                                       covariates = None,
                                       return_random_laplacians = False)[0])

print(laplacian)
   feature_laplacians  laplacian_p_values  laplacian_q_values
0            0.995027              0.0018              0.0018

However, most of the samples with a mutation in unc-25 were obrained from worms with ages 1 or 3 days:

[7]:
metadata.loc[metadata["unc-25"]==1,"day"].value_counts()
[7]:
1    18
3     6
Name: day, dtype: int64

This leads to the question: is the comparable morphology of neurons with a deleterious mutation in unc-25 attributed to the mutation itself or the similar age of the worms? To address this issue, we can employ the Laplacian score but treating the age of the worm as a covariate:

[8]:
laplacian = pandas.DataFrame(cajal.laplacian_score.laplacian_scores(numpy.array(metadata.iloc[:,0:11]),
                                       gw_dist,
                                       numpy.median(squareform(gw_dist)),
                                       permutations = 5000,
                                       covariates = numpy.array(metadata["day"]),
                                       return_random_laplacians = False)[0])
laplacian.index = metadata.columns.values.tolist()[0:11]

print(laplacian)
        feature_laplacians  laplacian_p_values  laplacian_q_values    beta_0  \
nrx-1             0.996896            0.005799            0.012757  1.000429
mir-1             1.000011            0.124175            0.151770  1.009661
unc-49            0.996668            0.007399            0.013564  1.004363
nlg-1             0.994696            0.001000            0.003666  0.960108
unc-25            0.995027            0.002799            0.007698  0.898466
unc-97            0.961754            0.000200            0.002200  0.993265
lim-6             1.000981            0.304539            0.334993  1.036849
lat-2             0.994027            0.000800            0.004399  1.008793
ptp-3             0.999590            0.080184            0.110253  0.985277
sup-17            0.997795            0.014597            0.022938  0.981026
pkd-2             1.001690            0.547690            0.547690  1.000995

          beta_1  beta_1_p_value  regression_coefficients_fstat_p_values  \
nrx-1   0.001071    4.695718e-01                            9.391437e-01
mir-1  -0.008178    7.174294e-01                            5.651412e-01
unc-49 -0.002882    5.790428e-01                            8.419143e-01
nlg-1   0.041295    2.050789e-03                            4.101577e-03
unc-25  0.102830    1.535813e-12                            3.071627e-12
unc-97  0.008190    2.834938e-01                            5.669877e-01
lim-6  -0.035374    9.925361e-01                            1.492786e-02
lat-2  -0.007340    6.945934e-01                            6.108133e-01
ptp-3   0.016133    1.312833e-01                            2.625666e-01
sup-17  0.020424    7.884423e-02                            1.576885e-01
pkd-2   0.000425    4.880074e-01                            9.760148e-01

        laplacian_p_values_post_regression  laplacian_q_values_post_regression
nrx-1                             0.005999                            0.021996
mir-1                             0.084383                            0.116027
unc-49                            0.007199                            0.015837
nlg-1                             0.006999                            0.019246
unc-25                            0.146771                            0.179386
unc-97                            0.000200                            0.002200
lim-6                             0.056789                            0.089239
lat-2                             0.000600                            0.003299
ptp-3                             0.181964                            0.200160
sup-17                            0.041392                            0.075885
pkd-2                             0.554289                            0.554289

Upon examining the table, we note that the q-value of unc-25 shifts from 0.008 to 0.18 after adjusting for the covariate effect. Consistent with this, the F-statistic suggests a considerable impact of the covariate on the Laplacian score of unc-25, as evidenced by the low p-value of the F-statistic.