Second Lower Bound and Quantized Gromov-Wasserstein
- slb_parallel_memory(cell_dms: Collection[ndarray[Any, dtype[float64]]], num_processes: int, chunksize: int = 20) ndarray[Any, dtype[float64]]
Compute the SLB distance in parallel between all cells in cell_dms. :param cell_dms: A collection of distance matrices. Probability distributions other than uniform are currently unsupported. :param num_processes: How many Python processes to run in parallel :param chunksize: How many SLB distances each Python process computes at a time
- slb_parallel(intracell_csv_loc: str, num_processes: int, out_csv: str, chunksize: int = 20) None
Compute the SLB distance in parallel between all cells in the csv file intracell_csv_loc. The files are expected to be formatted according to the format in
cajal.run_gw.icdm_csv_validate()
.
- class quantized_icdm(cell_dm: ndarray[Any, dtype[float64]], p: ndarray[Any, dtype[float64]], num_clusters: int)
This class represents a “quantized” intracell distance matrix, i.e., a metric measure space which has been equipped with a given clustering; it contains additional data which allows for the rapid computation of pairwise GW distances across many cells. Users should only need to understand how to use the constructor.
- Parameters
cell_dm (ndarray[Any, dtype[float64]]) – An intracell distance matrix in squareform.
p (ndarray[Any, dtype[float64]]) – A probability distribution on the points of the metric space
num_clusters (int) – How many clusters to subdivide the cell into; the more clusters, the more accuracy, but the longer the computation.
- quantized_gw_parallel(intracell_csv_loc: str, num_processes: int, num_clusters: int, out_csv: str, chunksize: int = 20, verbose: bool = False) None
Compute the quantized Gromov-Wasserstein distance in parallel between all cells in a family of cells.
- Parameters
intracell_csv_loc (str) – path to a CSV file containing the cells to process
num_processes (int) – number of Python processes to run in parallel
num_clusters (int) – Each cell will be partitioned into num_clusters many clusters.
out_csv (str) – file path where a CSV file containing the quantized GW distances will be written
chunksize (int) – How many q-GW distances should be computed at a time by each parallel process.
verbose (bool) –
- Return type
None
- combined_slb_quantized_gw_memory(cell_dms: Collection[ndarray[Any, dtype[float64]]], num_processes: int, num_clusters: int, accuracy: float, nearest_neighbors: int, verbose: bool, chunksize: int = 20)
Compute the pairwise SLB distances between each pair of cells in cell_dms. Based on this initial estimate of the distances, compute the quantized GW distance between the nearest with num_clusters many clusters until the correct nearest-neighbors list is obtained for each cell with a high degree of confidence.
The idea is that for the sake of clustering we can avoid computing the precise pairwise distances between cells which are far apart, because the clustering will not be sensitive to changes in large distances. Thus, we want to compute as precisely as possible the pairwise GW distances for (say) the 30 nearest neighbors of each point, and use a rough estimation beyond that.
- Parameters
cell_dms (Collection[ndarray[Any, dtype[float64]]]) – a list or tuple of square distance matrices
num_processes (int) – How many Python processes to run in parallel
num_clusters (int) – Each cell will be partitioned into num_clusters many clusters for the quantized Gromov-Wasserstein distance computation.
chunksize (int) – Number of pairwise cell distance computations done by each Python process at one time.
out_csv – path to a CSV file where the results of the computation will be written
accuracy (float) – This is a real number between 0 and 1, inclusive.
nearest_neighbors (int) – The algorithm tries to compute only the quantized GW distances between pairs of cells if one is within the first nearest_neighbors neighbors of the other; for all other values, the SLB distance is used to give a rough estimate.
verbose (bool) –
- combined_slb_quantized_gw(input_icdm_csv_location: str, gw_out_csv_location: str, num_processes: int, num_clusters: int, accuracy: float, nearest_neighbors: int, verbose: bool = False, chunksize: int = 20) None
This is a wrapper around
cajal.qgw.combined_slb_quantized_gw_memory()
with some associated file/IO. For all parameters not listed here see the docstring forcajal.qgw.combined_slb_quantized_gw_memory()
.- Parameters
input_icdm_csv_location (str) – file path to a csv file. For format for the icdm see
cajal.run_gw.icdm_csv_validate()
.gw_out_csv_location (str) – Where to write the output GW distances.
num_processes (int) –
num_clusters (int) –
accuracy (float) –
nearest_neighbors (int) –
verbose (bool) –
chunksize (int) –
- Returns
None.
- Return type
None