Second Lower Bound and Quantized Gromov-Wasserstein

slb_parallel_memory(cell_dms: Collection[ndarray[Any, dtype[float64]]], num_processes: int, chunksize: int = 20) → ndarray[Any, dtype[float64]]

Compute the SLB distance in parallel between all cells in cell_dms. :param cell_dms: A collection of distance matrices. Probability distributions other than uniform are currently unsupported. :param num_processes: How many Python processes to run in parallel :param chunksize: How many SLB distances each Python process computes at a time

Returns

a square matrix giving pairwise SLB distances between points.

Parameters

cell_dms (Collection[ndarray[Any, dtype[float64]]]) –
num_processes (int) –
chunksize (int) –

Return type

ndarray[Any, dtype[float64]]

slb_parallel(intracell_csv_loc: str, num_processes: int, out_csv: str, chunksize: int = 20) → None

Compute the SLB distance in parallel between all cells in the csv file intracell_csv_loc. The files are expected to be formatted according to the format in cajal.run_gw.icdm_csv_validate().

Parameters

cell_dms – A collection of distance matrices
num_processes (int) – How many Python processes to run in parallel
chunksize (int) – How many SLB distances each Python process computes at a time
intracell_csv_loc (str) –
out_csv (str) –

Return type

None

class quantized_icdm(cell_dm: ndarray[Any, dtype[float64]], p: ndarray[Any, dtype[float64]], num_clusters: int)

This class represents a “quantized” intracell distance matrix, i.e., a metric measure space which has been equipped with a given clustering; it contains additional data which allows for the rapid computation of pairwise GW distances across many cells. Users should only need to understand how to use the constructor.

Parameters

cell_dm (ndarray[Any, dtype[float64]]) – An intracell distance matrix in squareform.
p (ndarray[Any, dtype[float64]]) – A probability distribution on the points of the metric space
num_clusters (int) – How many clusters to subdivide the cell into; the more clusters, the more accuracy, but the longer the computation.

quantized_gw_parallel(intracell_csv_loc: str, num_processes: int, num_clusters: int, out_csv: str, chunksize: int = 20, verbose: bool = False) → None

Compute the quantized Gromov-Wasserstein distance in parallel between all cells in a family of cells.

Parameters

intracell_csv_loc (str) – path to a CSV file containing the cells to process
num_processes (int) – number of Python processes to run in parallel
num_clusters (int) – Each cell will be partitioned into num_clusters many clusters.
out_csv (str) – file path where a CSV file containing the quantized GW distances will be written
chunksize (int) – How many q-GW distances should be computed at a time by each parallel process.
verbose (bool) –

Return type

None

combined_slb_quantized_gw_memory(cell_dms: Collection[ndarray[Any, dtype[float64]]], num_processes: int, num_clusters: int, accuracy: float, nearest_neighbors: int, verbose: bool, chunksize: int = 20)

Compute the pairwise SLB distances between each pair of cells in cell_dms. Based on this initial estimate of the distances, compute the quantized GW distance between the nearest with num_clusters many clusters until the correct nearest-neighbors list is obtained for each cell with a high degree of confidence.

The idea is that for the sake of clustering we can avoid computing the precise pairwise distances between cells which are far apart, because the clustering will not be sensitive to changes in large distances. Thus, we want to compute as precisely as possible the pairwise GW distances for (say) the 30 nearest neighbors of each point, and use a rough estimation beyond that.

Parameters

cell_dms (Collection[ndarray[Any, dtype[float64]]]) – a list or tuple of square distance matrices
num_processes (int) – How many Python processes to run in parallel
num_clusters (int) – Each cell will be partitioned into num_clusters many clusters for the quantized Gromov-Wasserstein distance computation.
chunksize (int) – Number of pairwise cell distance computations done by each Python process at one time.
out_csv – path to a CSV file where the results of the computation will be written
accuracy (float) – This is a real number between 0 and 1, inclusive.
nearest_neighbors (int) – The algorithm tries to compute only the quantized GW distances between pairs of cells if one is within the first nearest_neighbors neighbors of the other; for all other values, the SLB distance is used to give a rough estimate.
verbose (bool) –

combined_slb_quantized_gw(input_icdm_csv_location: str, gw_out_csv_location: str, num_processes: int, num_clusters: int, accuracy: float, nearest_neighbors: int, verbose: bool = False, chunksize: int = 20) → None

This is a wrapper around cajal.qgw.combined_slb_quantized_gw_memory() with some associated file/IO. For all parameters not listed here see the docstring for cajal.qgw.combined_slb_quantized_gw_memory().

Parameters

input_icdm_csv_location (str) – file path to a csv file. For format for the icdm see cajal.run_gw.icdm_csv_validate().
gw_out_csv_location (str) – Where to write the output GW distances.
num_processes (int) –
num_clusters (int) –
accuracy (float) –
nearest_neighbors (int) –
verbose (bool) –
chunksize (int) –

Returns

None.

Return type

None