Combined SLB and Quantized GW Nearest Neighbors Algorithm

combined_slb_quantized_gw_memory(cell_dms: Collection[ndarray[Any, dtype[float64]]], cell_distributions: Optional[Iterable[ndarray[Any, dtype[float64]]]], num_processes: int, num_clusters: int, accuracy: float, nearest_neighbors: int, verbose: bool, chunksize: int = 20, exp_decay: float = 2.0, slb_bins: int = 5, sn: SamplingNumber = 200)

Compute a heuristic approximation to nearest neighbors of cells.

Compute the pairwise SLB distances between each pair of cells in cell_dms. Based on this initial estimate of the distances, compute the quantized GW distance between the nearest with num_clusters many clusters until the correct nearest-neighbors list is obtained for each cell with a high degree of confidence.

The idea is that for the sake of clustering we can avoid computing the precise pairwise distances between cells which are far apart, because the clustering will not be sensitive to changes in large distances. Thus, we want to compute as precisely as possible the pairwise GW distances for (say) the 30 nearest neighbors of each point, and use a rough estimation beyond that.

Parameters
  • cell_dms (Collection[ndarray[Any, dtype[float64]]]) – a list or tuple of square distance matrices

  • num_processes (int) – How many Python processes to run in parallel

  • num_clusters (int) – Each cell will be partitioned into num_clusters many clusters for the quantized Gromov-Wasserstein distance computation.

  • chunksize (int) – Number of pairwise cell distance computations done by each Python process at one time.

  • out_csv – path to a CSV file where the results of the computation will be written

  • accuracy (float) – This is a real number between 0 and 1, inclusive.

  • nearest_neighbors (int) – The algorithm tries to compute only the quantized GW distances between pairs of cells if one is within the first nearest_neighbors neighbors of the other; for all other values, the SLB distance is used to give a rough estimate.

  • exp_decay (float) – This parameter controls the number of cells computed per iteration of the main loop. At each iteration of the loop, the estimated error distribution of SLB vs QGW is re-estimated based on newly collected data, and this distribution informs the choice of what cell pairs to compute next and how many cell pairs still have to be computed. Each iteration, we create a list of all cell pairs which we think still have to be computed and order them by priority, and then compute (1/exp_decay) of them, so the total number of iterations is logarithmic in the number of cell pairs. For example, when exp_decay=2.0, we propose a list of M cell pairs to compute the QGW of, then compute half of those, then we recompute the inferred probability distribution and repeat. Iterations have a constant overhead (a fraction of a second) which is likely to be relatively insubstantial when large numbers of cells are involved; if iterations are between 5-10 minutes then the overhead per iteration will be likely negligible.

  • error_distribution_sampling_method – Controls the approach used to infer the error distribution of SLB vs GW. If “nearest neighbors”, then we use the computed GW values for the smallest SLB values to estimate the distribution. If SamplingNumber(n) we additionally sample n values randomly through every half-percentile of the SLB distribution. This is more accurate and only adds a constant to the runtime, so it should be preferred.

  • cell_distributions (Optional[Iterable[ndarray[Any, dtype[float64]]]]) –

  • verbose (bool) –

  • slb_bins (int) –

  • sn (SamplingNumber) –

combined_slb_quantized_gw(input_icdm_csv_location: str, gw_out_csv_location: str, num_processes: int, num_clusters: int, accuracy: float, nearest_neighbors: int, verbose: bool = False, chunksize: int = 20, exp_decay: float = 2.0, slb_bins: int = 5, sn: SamplingNumber = 200) None

Read icdms from file, call cajal.qgw.combined_slb_quantized_gw_memory(), write to file.

For all parameters not listed here see the docstring for cajal.qgw.combined_slb_quantized_gw_memory().

Parameters
  • input_icdm_csv_location (str) – file path to a csv file. For format for the icdm see cajal.run_gw.icdm_csv_validate().

  • gw_out_csv_location (str) – Where to write the output GW distances.

  • num_processes (int) –

  • num_clusters (int) –

  • accuracy (float) –

  • nearest_neighbors (int) –

  • verbose (bool) –

  • chunksize (int) –

  • exp_decay (float) –

  • slb_bins (int) –

  • sn (SamplingNumber) –

Returns

None.

Return type

None