Second Lower Bound and Quantized Gromov-Wasserstein

slb_parallel_memory(cell_dms: Collection[ndarray[Any, dtype[float64]]], num_processes: int, chunksize: int = 20) ndarray[Any, dtype[float64]]

Compute the SLB distance in parallel between all cells in cell_dms. :param cell_dms: A collection of distance matrices. Probability distributions other than uniform are currently unsupported. :param num_processes: How many Python processes to run in parallel :param chunksize: How many SLB distances each Python process computes at a time

Returns

a square matrix giving pairwise SLB distances between points.

Parameters
Return type

ndarray[Any, dtype[float64]]

slb_parallel(intracell_csv_loc: str, num_processes: int, out_csv: str, chunksize: int = 20) None

Compute the SLB distance in parallel between all cells in the csv file intracell_csv_loc. The files are expected to be formatted according to the format in cajal.run_gw.icdm_csv_validate().

Parameters
  • cell_dms – A collection of distance matrices

  • num_processes (int) – How many Python processes to run in parallel

  • chunksize (int) – How many SLB distances each Python process computes at a time

  • intracell_csv_loc (str) –

  • out_csv (str) –

Return type

None

class quantized_icdm(cell_dm: ndarray[Any, dtype[float64]], p: ndarray[Any, dtype[float64]], num_clusters: int)

This class represents a “quantized” intracell distance matrix, i.e., a metric measure space which has been equipped with a given clustering; it contains additional data which allows for the rapid computation of pairwise GW distances across many cells. Users should only need to understand how to use the constructor.

Parameters
  • cell_dm (ndarray[Any, dtype[float64]]) – An intracell distance matrix in squareform.

  • p (ndarray[Any, dtype[float64]]) – A probability distribution on the points of the metric space

  • num_clusters (int) – How many clusters to subdivide the cell into; the more clusters, the more accuracy, but the longer the computation.

quantized_gw_parallel(intracell_csv_loc: str, num_processes: int, num_clusters: int, out_csv: str, chunksize: int = 20, verbose: bool = False) None

Compute the quantized Gromov-Wasserstein distance in parallel between all cells in a family of cells.

Parameters
  • intracell_csv_loc (str) – path to a CSV file containing the cells to process

  • num_processes (int) – number of Python processes to run in parallel

  • num_clusters (int) – Each cell will be partitioned into num_clusters many clusters.

  • out_csv (str) – file path where a CSV file containing the quantized GW distances will be written

  • chunksize (int) – How many q-GW distances should be computed at a time by each parallel process.

  • verbose (bool) –

Return type

None

combined_slb_quantized_gw_memory(cell_dms: Collection[ndarray[Any, dtype[float64]]], num_processes: int, num_clusters: int, accuracy: float, nearest_neighbors: int, verbose: bool, chunksize: int = 20)

Compute the pairwise SLB distances between each pair of cells in cell_dms. Based on this initial estimate of the distances, compute the quantized GW distance between the nearest with num_clusters many clusters until the correct nearest-neighbors list is obtained for each cell with a high degree of confidence.

The idea is that for the sake of clustering we can avoid computing the precise pairwise distances between cells which are far apart, because the clustering will not be sensitive to changes in large distances. Thus, we want to compute as precisely as possible the pairwise GW distances for (say) the 30 nearest neighbors of each point, and use a rough estimation beyond that.

Parameters
  • cell_dms (Collection[ndarray[Any, dtype[float64]]]) – a list or tuple of square distance matrices

  • num_processes (int) – How many Python processes to run in parallel

  • num_clusters (int) – Each cell will be partitioned into num_clusters many clusters for the quantized Gromov-Wasserstein distance computation.

  • chunksize (int) – Number of pairwise cell distance computations done by each Python process at one time.

  • out_csv – path to a CSV file where the results of the computation will be written

  • accuracy (float) – This is a real number between 0 and 1, inclusive.

  • nearest_neighbors (int) – The algorithm tries to compute only the quantized GW distances between pairs of cells if one is within the first nearest_neighbors neighbors of the other; for all other values, the SLB distance is used to give a rough estimate.

  • verbose (bool) –

combined_slb_quantized_gw(input_icdm_csv_location: str, gw_out_csv_location: str, num_processes: int, num_clusters: int, accuracy: float, nearest_neighbors: int, verbose: bool = False, chunksize: int = 20) None

This is a wrapper around cajal.qgw.combined_slb_quantized_gw_memory() with some associated file/IO. For all parameters not listed here see the docstring for cajal.qgw.combined_slb_quantized_gw_memory().

Parameters
  • input_icdm_csv_location (str) – file path to a csv file. For format for the icdm see cajal.run_gw.icdm_csv_validate().

  • gw_out_csv_location (str) – Where to write the output GW distances.

  • num_processes (int) –

  • num_clusters (int) –

  • accuracy (float) –

  • nearest_neighbors (int) –

  • verbose (bool) –

  • chunksize (int) –

Returns

None.

Return type

None