Tutorial 4: Integrating Multimodal Data with Ternary Plots

In Tutorial 2, we showed how to use the Laplacian score to determine whether a cell feature varies throughout the morphology space. Here, we discuss another technique to relate morphological distances to other cell features; in particular, we relate morphological distances to electrophysiological and molecular features of the cells. This technique allows us to identify variation in cells which is reflected well through at least one of the indicators, but is reflected poorly in the other indicators. It thus reveals “blind spots” in one of the measurement techniques, demonstrating the failure of one of the modes to capture distinctions which are well-reflected in the other modes.

Suppose that we have a set of cells \(c_1,\dots, c_n\), and we associate to each cell an electrophysiological feature vector \(e_1,\dots, e_n\in\mathbb{R}^k\), and a molecular feature vector \(m_1,\dots, m_n\in \mathbb{R}^s\). Both the sets \(\{e_1,\dots, e_n\}\) and \(\{m_1,\dots, m_n\}\) can be regarded as spaces, and we can interpret the distance \(\lVert e_i - e_j\rVert\) as the difference between cells \(c_i,c_j\) which is visible from the point of view of electrophysiology; likewise for the molecular feature vectors. Together with the Gromov-Wasserstein space of morphological distances, this gives three different manifestations of the cell space, where the notion of distance of each space reflects one aspect of the cell.

Speaking broadly, we expect cells that are very different morphologically and very different electrophysiologically to be very different in their gene expression, as well. If there are many cell pairs in the data where large morphological and large electrophysiological distances aren’t accompanied by large transcriptomic differences, it raises the question of why there is substantial variation in one variable not reflected in the others. For example, we might ask whether the transcriptomic methods are sufficiently comprehensive. This leads us to plot morphological, electrophysiological, and transcriptomic distances together.

This tutorial follows the dataset from the paper Phenotypic variation of transcriptomic cell types in mouse motor cortexl by Scala et. al. This folder contains the SWC files and metadata we will use for the analysis; we refer to it below as bd for “base directory”.

[1]:
import os
import scipy
from scipy.spatial.distance import squareform, pdist
from os.path import join
import pandas as pd
import cajal
from cajal.utilities import dist_mat_of_dict, read_gw_dists, louvain_clustering
from cajal.ternary import ternary_distance
import anndata as ad
import scanpy as sc
import umap
import matplotlib.pyplot as plt
import sklearn.manifold

import numpy as np
from cajal import run_gw, sample_swc, swc

# bd = '/home/jovyan/ternary_example'
[2]:
bd = '/mnt/d/Camara Lab Dropbox/Camara Lab\'s shared workspace/Web Linked Files/CAJAL/tutorial4/'

As in the other tutorials, we compute the intracell distance matrices and then the Gromov-Wasserstein distance matrix.

[ ]:
sample_swc.compute_icdm_all_euclidean(
    infolder=join(bd,'ternary_swcs'),
    out_csv=join(bd,'euclidean_100_icdm.csv'),
    n_sample=100,
    num_processes=20,
    preprocess=swc.preprocessor_eu([1,3,4],soma_component_only=False)
)
[ ]:
run_gw.compute_gw_distance_matrix(
    intracell_csv_loc=join(bd,'euclidean_100_icdm.csv'),
    gw_dist_csv_loc=join(bd,'euclidean_100_gw.csv'),
    num_processes=20,
    verbose=True
)

The file “m1_patchseq_meta_data.csv” contains the RNA families for the samples profiled in the study.

[4]:
metadata = pd.read_csv(join(bd,"m1_patchseq_meta_data.csv"), delimiter="\t", index_col=1)
# Start with 1329 rows × 32 columns

cells, gw_dist_dict = read_gw_dists(join(bd,'euclidean_100_gw.csv'),header=True)
[5]:
metadata = metadata.loc[cells]
metadata = metadata.loc[metadata['RNA family']!='low quality']
# We will filter out the RNA families with few elements.
metadata[(metadata['RNA family'].value_counts() > 30)[metadata['RNA family']].array]

# Labels for the classifier
print(metadata['RNA family'].unique())
['ET' 'IT' 'Pvalb' 'Vip' 'Sst' 'Lamp5' 'CT' 'Sncg' 'NP']

The files containing electrophysiological data and the exon counts have been cleaned to discard samples containing NaN values.

[6]:
m1_patchseq_ephys_features=pd.read_csv(join(bd,'m1_patchseq_ephys_features.csv'),index_col='cell id')
[7]:
ephys_data = pd.read_csv(join(bd,'ephys_data.csv'),index_col='cell id')
exon_data = pd.read_csv(join(bd,'exon_data.csv'), index_col='cell id')

We restrict to the samples for which we have all threee - morphological data, exon counts, and electrophysiology data.

[8]:
common_cells = list(set(metadata.index).intersection(set(exon_data.index)).intersection(ephys_data.index))
[9]:
# common_cells = list(set(metadata.index).intersection(set(exon_data.index)).intersection(ephys_data.index))
metadata = metadata.loc[common_cells]
exon_data = exon_data.loc[common_cells]
ephys_data = ephys_data.loc[common_cells]

We normalize the columns in the electrophysiology data so that they have mean 0 and standard deviation 1. The “distance” in the electrophysiology space between two samples is defined to be 1 minus the Pearson correlation coefficient between the data for the samples. This is not a metric, but it is a common technique.

[10]:
ephys_data = (ephys_data - ephys_data.mean())
ephys_data /= ephys_data.std()
ephys_dmat = (1-ephys_data.transpose().corr())
# ephys_dmat = pd.DataFrame(squareform(pdist(ephys_data)), columns = common_cells, index = common_cells)

The following heuristic transformation is meant to narrow the long tails at the end of the distribution of exon counts. As with the electrophysiology data, we use “correlation distance” as our notion of dissimilarity between samples.

[ ]:
exon_data = np.log(5000 * exon_data + 1)
exon_dmat = 1-((exon_data.transpose()).corr(method='pearson'))
[11]:
gw_dmat = dist_mat_of_dict(gw_dist_dict,common_cells)
[11]:
from cajal.ternary import ternary_distance_clusters
ternary_distance_clusters(
    exon_dmat.to_numpy(),
    'T', # Transcriptomic
    ephys_dmat.to_numpy(),
    'E', # Electrophysiological
    gw_dmat,
    'M', # Morphological
    density_estimation = 'gaussian_kde',
    bins = None,
    figsize = 4,
    clusters = metadata['RNA family'].to_numpy(),
    min_cluster_size = 30,
    mpl_params= { 's':1, 'alpha':.3 }
)
!
-3.1779289293017285
6.392055630800374
[11]:
(<Figure size 400x2800 with 7 Axes>,
 [<matplotlib.tri._tricontour.TriContourSet at 0x7ff38bb7e950>,
  <matplotlib.tri._tricontour.TriContourSet at 0x7ff38ba42aa0>,
  <matplotlib.tri._tricontour.TriContourSet at 0x7ff38bb2e5f0>,
  <matplotlib.tri._tricontour.TriContourSet at 0x7ff38bddc550>,
  <matplotlib.tri._tricontour.TriContourSet at 0x7ff38bc2d930>,
  <matplotlib.tri._tricontour.TriContourSet at 0x7ff38bdcfa90>,
  <matplotlib.tri._tricontour.TriContourSet at 0x7ff38bc1a5c0>])
../_images/notebooks_Example_4_20_2.png

Weighted Nearest Neighbors Analysis

In addition to investigating disparities in information between data sets, it is also important to integrate information from multiple data sets into a single coherent whole. For example, in the paper cited above, the authors integrate the transcriptomic space with the electrophysiological feature space via their technique of sparse reduced-rank regression. For this purpose, CAJAL provides the Weighted Nearest Neighbors algorithm developed in Integrated analysis of multimodal single-cell data by Hao, Hao, Andersen-Nissen et. al. and ported to Python by Dylan Kotliar. Because the WNN algorithm was designed for datasets in Euclidean space, CAJAL represents the Gromov-Wasserstein space as a metric subspace of a Euclidean subspace of the same intrinsic dimension of the data using the Isomap embedding.

The Weighted Nearest Neighbors algorithm accepts two spaces (in our example, we consider the morphology space and electrophysiology space) and returns a similarity function which integrates the two data spaces. The relative influence of the two spaces varies throughout the space, and is locally biased at any point toward the space which is contributing more information in a neighborhood of that point.

In this example we show how to integrate an RNA-seq dataset with the Gromov-Wasserstein distance matrix. First, we process the transcriptomic data via a standard pipeline from Scanpy, and we take the first principal components as the definition of our transcriptomic space.

[12]:
def scanpy_load_clean(cell_names):
    # This closely follows the scanpy tutorial

    adata = ad.read_csv(join(bd,'exon_data.csv'))
    adata.var["gm"] = adata.var_names.str.startswith("Gm")
    adata.var["Rik"] = adata.var_names.str.endswith("Rik")
    sc.pp.calculate_qc_metrics(
        adata, qc_vars=["gm", "Rik"], inplace=True, log1p=True
    )
    sc.pp.filter_cells(adata, min_genes=100)
    sc.pp.filter_genes(adata, min_cells=3)
    sc.pp.scrublet(adata)
    adata.layers["counts"] = adata.X.copy()
    sc.pp.normalize_total(adata)
    # Logarithmize the data
    sc.pp.log1p(adata)
    sc.pp.highly_variable_genes(adata, n_top_genes=2000)
    sc.tl.pca(adata)
    indices = [ adata.obs.index.get_loc(index) for index in cell_names]
    indices =np.array(indices)
    X_pca = adata.obsm['X_pca'][indices]
    return X_pca
[14]:
transcriptome_pca=scanpy_load_clean(common_cells)
transcriptome_dmat = squareform(pdist(transcriptome_pca))

Before introducing WNN, let us look at the morphology space and the transcriptomic space separately.

In the first row, we plot a UMAP of the Gromov-Wasserstein morphology space; in the second row we plot the transcriptomic space.

The first column is colored by applying the Louvain clustering algorithm to the Gromov-Wasserstein space; the second column is colored by inferred transcriptomic type. (The authors of the paper “assign cell types [using] a nearest centroid classifier with Pearson correlation of log-expression across the most variable genes as a distance metric”.)

In the first and fourth plots, the excellent clustering is somewhat tautological, as both the UMAP and the labels are both constructed from the same high-dimensional data; these are our ideal baseline. The second and fourth plots show that the inferred clusters from Gromov-Wasserstein are poorly localized in the transcriptomic space, and conversely.

[15]:
morphology_labels = louvain_clustering(gw_dmat, 10)
RNA_family = metadata['RNA family'].values
[16]:
reducer_precomputed = umap.umap_.UMAP(metric='precomputed',random_state=4)
[17]:
def plot(ax, coords, labels, legend=True):
    # fig, ax = plt.subplots(figsize=(12,12))
    ax.margins(0.05) # Optional, just adds 5% padding to the autoscaling
    # a=reducer_precomputed.fit_transform(dmat)
    for label in np.unique(labels):
        xy = coords[np.nonzero([labels == label])[1]]
        ax.plot(xy[:,0], xy[:,1], marker = 'o', linestyle='', ms=3, label=label)
    if legend:
        ax.legend()
[18]:
fig, ax = plt.subplots(nrows=2, ncols=2, figsize=(14,14))
for i in range(2):
    dmat = [gw_dmat,transcriptome_dmat][i]
    v = reducer_precomputed.fit_transform(dmat)
    for j in range(2):
        labels = [morphology_labels,RNA_family][j]
        plot(ax[i,j],v,labels)
/home/patn/.local/lib/python3.10/site-packages/umap/umap_.py:1858: UserWarning: using precomputed metric; inverse_transform will be unavailable
  warn("using precomputed metric; inverse_transform will be unavailable")
/home/patn/.local/lib/python3.10/site-packages/umap/umap_.py:1945: UserWarning: n_jobs value 1 overridden to 1 by setting random_state. Use no seed for parallelism.
  warn(f"n_jobs value {self.n_jobs} overridden to 1 by setting random_state. Use no seed for parallelism.")
/home/patn/.local/lib/python3.10/site-packages/umap/umap_.py:1858: UserWarning: using precomputed metric; inverse_transform will be unavailable
  warn("using precomputed metric; inverse_transform will be unavailable")
../_images/notebooks_Example_4_30_1.png

Beyond quantification, to quantify the degree to which the Gromov-Wasserstein morphology space preserves information about the transcriptomic space and conversely, we train a nearest-neighbors based classifier on one space to predict classes in the other space, and assess the accuracy and Matthews correlation coefficient of this regression.

[19]:
def predict_classes_st_kfold_knn(X, clusters, dist_metric, nn, k, seed):
    # KNN version of https://github.com/applied-systems-biology/Dynamic_SPHARM
    # /blob/2588d3632e9b3d99486a75bfb50d21fc7a7c8070/SPHARM/lib/classification.py#L155
    import random
    import numpy as np
    from sklearn.neighbors import KNeighborsClassifier
    from sklearn.model_selection import StratifiedKFold,cross_val_score,cross_val_predict

    clf = KNeighborsClassifier(metric=dist_metric, n_neighbors=nn, weights="distance")
    cv = StratifiedKFold(n_splits=k, random_state=seed, shuffle=True)
    accuracy = pd.DataFrame({'Accuracy': cross_val_score(clf, X=X, y=clusters,cv=cv)})
    predicted = pd.DataFrame({'Actual class': clusters,
                              'Predicted class': cross_val_predict(clf, X=X, y=clusters, cv=cv)})
    return accuracy, predicted

def test_dmat(dmat, labels):
    from sklearn.metrics import matthews_corrcoef
    acc, pred = predict_classes_st_kfold_knn(dmat, labels, 'precomputed', 20, 7, 1)
    mcc = matthews_corrcoef(pred.iloc[:,0], pred.iloc[:,1])
    return(acc,mcc)
[20]:
print(test_dmat(gw_dmat, morphology_labels))
print(test_dmat(gw_dmat, RNA_family))
print(test_dmat(transcriptome_dmat, morphology_labels))
print(test_dmat(transcriptome_dmat, RNA_family))
(   Accuracy
0  0.921348
1  0.898876
2  0.943820
3  0.921348
4  0.920455
5  0.954545
6  0.909091, 0.9120086954113424)
(   Accuracy
0  0.516854
1  0.550562
2  0.505618
3  0.460674
4  0.556818
5  0.511364
6  0.431818, 0.39365793453570347)
(   Accuracy
0  0.303371
1  0.280899
2  0.404494
3  0.303371
4  0.261364
5  0.329545
6  0.295455, 0.1945609211452224)
(   Accuracy
0  0.887640
1  0.943820
2  0.910112
3  0.910112
4  0.943182
5  0.920455
6  0.931818, 0.9045697313839676)
/home/patn/.local/lib/python3.10/site-packages/sklearn/model_selection/_split.py:725: UserWarning: The least populated class in y has only 3 members, which is less than n_splits=7.
  warnings.warn(
/home/patn/.local/lib/python3.10/site-packages/sklearn/model_selection/_split.py:725: UserWarning: The least populated class in y has only 3 members, which is less than n_splits=7.
  warnings.warn(
/home/patn/.local/lib/python3.10/site-packages/sklearn/model_selection/_split.py:725: UserWarning: The least populated class in y has only 3 members, which is less than n_splits=7.
  warnings.warn(
/home/patn/.local/lib/python3.10/site-packages/sklearn/model_selection/_split.py:725: UserWarning: The least populated class in y has only 3 members, which is less than n_splits=7.
  warnings.warn(

On 7-fold cross-validation, the accuracy of a k-nearest neighbors classifier(k=20) trained to predict transcriptomic types from the morphology space ranges from 0.38 to 0.54; the Matthews correlation coefficient (a form of accuracy controlled for class size) is .35. The first and fourth tables are references that indicate what an ideal indicator would look like; the results are of course near perfect as the classifier is simply trained using the same information used to construct the clusters.

Once we have defined these spaces, we can apply the weighted nearest neighbors algorithm to construct the similarity function. We construct instances of the modality class for each input. This is the most time-intensive part of the algorithm.

[22]:
from cajal import wnn
[23]:
m_gw = cajal.wnn.Modality.of_dmat(gw_dmat)
m_transcriptome = cajal.wnn.Modality(transcriptome_pca)

For consistency with the paper, the weighted nearest neighbors function returns a similarity matrix with entries ranging between 0 and 1, with 1 being most similar and 0 being least similar. We invert this to give a distance matrix.

[67]:
integrated_space = 1-cajal.wnn.wnn([m_gw, m_transcriptome],k=20)

We visualize the integrated space using t-SNE. We color the points by their RNA family. It is visually clear that the WNN space strongly retains information about the transcriptomic type. The RNA families are well localized.

[68]:
tsne = sklearn.manifold.TSNE(n_components=2,
                             perplexity=30.0,
                             metric='precomputed',
                             init='random',
                             early_exaggeration=4.0,
                             random_state=4
                            )
wnn_embedding = tsne.fit_transform(integrated_space)

fig, ax = plt.subplots(figsize=(8,8))
plot(ax,wnn_embedding, RNA_family)
plt.show()
../_images/notebooks_Example_4_40_0.png

In previous examples we have used UMAP, but the similarity function returned by the weighted nearest neighbors algorithm is incompatible with UMAP; in particular, it is asymmetric. For this reason, the term “space” here is an abuse of language. It is possible to use UMAP, however, by symmetrizing the matrix. We will take this approach for consistency with our previous analyses.

[27]:
def symmetrize(a):
    a = a.copy()
    a[a == 0] = np.max(a)
    b = a + a.T
    b = np.minimum(a,b)
    b = np.minimum(b,a.T)
    d=np.zeros(a.shape[0],dtype=int)
    b[d,d]=0
    return np.array(b)

wnn_dmat = symmetrize(integrated_space)

The apparent clustering behavior is similar in the UMAP visualization.

[69]:
umap_coords = reducer_precomputed.fit_transform(wnn_dmat)
fig, ax = plt.subplots(figsize=(8,8))
plot(ax,umap_coords, RNA_family)
plt.title("RNA families in the integrated space")
plt.show()
../_images/notebooks_Example_4_44_0.png

The k-nearest neighbors regression confirms the apparent clustering.

[31]:
test_dmat(integrated_space, RNA_family)
[31]:
(   Accuracy
 0  0.842697
 1  0.831461
 2  0.820225
 3  0.797753
 4  0.920455
 5  0.875000
 6  0.852273,
 0.8164727885984441)

The clusters defined using the Gromov-Wasserstein metric are still readily visible in the wnn space.

[33]:
fig, ax = plt.subplots(figsize=(8,8))
plot(ax,umap_coords, morphology_labels)
plt.title("Louvain clusters under GW distance")
plt.show()
../_images/notebooks_Example_4_48_0.png

Again, using the k-nearest neighbors analysis, we can confirm that the Gromov-Wasserstein clusters are well-localized in the wnn space.

[34]:
test_dmat(integrated_space, morphology_labels)
[34]:
(   Accuracy
 0  0.865169
 1  0.786517
 2  0.820225
 3  0.752809
 4  0.840909
 5  0.863636
 6  0.761364,
 0.7824876832623626)

The weighted nearest neighbors technique can be used with more than two modalities. We carry out the same analysis to integrate electrophysiological data along with the morphological data and transcriptomic data. We cluster the neurons in the electrophysiology space, and then ask whether this clustering is well-preserved upon passing to the integrated space. There are two ways to construct an instance of the Modality class: either from a set of vectors in R^n (preferable when possible), or using a distance matrix. The input to the cajal.wnn.wnn is a list of Modality objects and a number of nearest neighbors to consider.

We clean the electrophysiology data following the Phenotypic Variation… paper, deleting a few unused electrophysiology variables, and normalizing each column to have the same variance.

[36]:
def clean_electrophys(ephys_data):
    remove_columns = [ 'Afterdepolarization (mV)', 'Burstiness', 'Rebound number of APs', 'Sag area (mV*s)', 'Sag time (s)',
                      'Wildness', 'AP Fano factor',  'Latency @ +20pA current (ms)', 'ISI Fano factor', 'AP amplitude average adaptation index', 'ISI average adaptation index', 'Spike frequency adaptation']
    ephys_data = ephys_data.drop(columns = remove_columns)
    for column in ['AP coefficient of variation', 'ISI coefficient of variation','ISI adaptation index', 'Latency (ms)']:
        assert np.all(ephys_data[column]>0)
        ephys_data[column] = ephys_data[column].apply(np.log)
    for col in ephys_data.columns:
        ephys_data[col] /= np.std(ephys_data[col])
    return ephys_data
[72]:
ephys_data = clean_electrophys(pd.read_csv(join(bd,'ephys_data.csv'),index_col='cell id'))
ephys_data= ephys_data.loc[common_cells].values
m_ephys = cajal.wnn.Modality(ephys_data)
integrated_space = 1-cajal.wnn.wnn([m_gw, m_transcriptome, m_ephys], 20)

Both by visual inspection of the UMAP, and by measuring the accuracy of a k-nearest neighbors based classifier, it is evident that the integrated space incorporates very well the information of each of the constituent spaces.

[73]:
fig, ax = plt.subplots(nrows=1, ncols=3, figsize=(18,6))
coords = reducer_precomputed.fit_transform(symmetrize(integrated_space))
ephys_labels = louvain_clustering(squareform(pdist(ephys_data)), 10)
for i in range(3):
    # dmat = [gw_dmat, wnn_ephys_symmetric, ephys_eu_dmat][i]
    labels = [morphology_labels, RNA_family, ephys_labels][i]
    plot(ax[i],coords, labels)
plt.show()
../_images/notebooks_Example_4_55_0.png
[77]:
print("Matthews correlation coefficients with respect to each clustering")
print("Morphology: ", test_dmat(integrated_space, morphology_labels)[1])
print("RNA family: ", test_dmat(integrated_space, RNA_family)[1])
print("Electrophysiology: ", test_dmat(integrated_space, ephys_labels)[1])
Matthews correlation coefficients with respect to each clustering
Morphology:  0.7222868928516553
RNA family:  0.8750761788255491
Electrophysiology:  0.7773983110547608

The

[92]:
for dmat in [gw_dmat, squareform(pdist(transcriptome_pca)), squareform(pdist(ephys_data))]:
    for labels in [morphology_labels, RNA_family, ephys_labels]:
        print("%.3f" % test_dmat(dmat, labels)[1],end=' ')
    print("")
0.912 0.394 0.198
0.195 0.905 0.467
0.155 0.778 0.911
[93]:
for dmat in [1-cajal.wnn.wnn([m_gw, m_ephys],k=20), 1-cajal.wnn.wnn([m_transcriptome, m_ephys],k=20),1-cajal.wnn.wnn([m_gw, m_transcriptome],k=20)]:
    for labels in [morphology_labels, RNA_family, ephys_labels]:
        print("%.3f" % test_dmat(dmat, labels)[1],end=' ')
    print("")
0.800 0.746 0.756
0.221 0.920 0.793
0.782 0.816 0.447
[96]:
met_dmat =( 1-cajal.wnn.wnn([m_gw,m_transcriptome, m_ephys],k=20))
for labels in [morphology_labels, RNA_family, ephys_labels]:
    print("%.3f" % test_dmat(met_dmat, labels)[1],end=' ')
0.722 0.875 0.777