Running Gromov-Wasserstein

class cajal.run_gw.Distribution

A run_gw.Distribution is a numpy array of shape (n,), with values nonnegative and summing to 1, where n is the number of points in the set.

Value

numpy.typing.NDArray[numpy.float_]

class cajal.run_gw.DistanceMatrix

A DistanceMatrix is a numpy array of shape (n, n) where n is the number of points in the space; it should be a symmetric nonnegative matrix with zeros along the diagonal.

Value

numpy.typing.NDArray[numpy.float_]

icdm_csv_validate(intracell_csv_loc: str) None

Raise an exception if the file in intracell_csv_loc fails to pass formatting tests.

If formatting tests are passed, the function returns none.

Parameters

intracell_csv_loc (str) – The (full) file path for the CSV file containing the intracell distance matrix.

Return type

None

The file format for an intracell distance matrix is as follows:

  • A line whose first character is ‘#’ is discarded as a comment.

  • The first line which is not a comment is discarded as a “header” - this line may

    contain the column titles for each of the columns.

  • Values separated by commas. Whitespace is not a separator.

  • The first value in the first non-comment line should be the string ‘cell_id’, and

    all values in the first column after that should be a unique identifier for that cell.

  • All values after the first column should be floats.

  • Not including the cell id in the first column, each row except the header should contain

    the entries of an intracell distance matrix lying strictly above the diagonal, as in the footnotes of https://docs.scipy.org/doc/scipy/reference/ generated/scipy.spatial.distance.squareform.html

cell_iterator_csv(intracell_csv_loc: str) Iterator[tuple[str, numpy.ndarray[Any, numpy.dtype[numpy.float64]]]]
Parameters

intracell_csv_loc (str) – A full file path to a csv file.

Returns

an iterator over cells in the csv file, given as tuples of the form (name, dmat). Intracell distance matrices are in squareform.

Return type

Iterator[tuple[str, numpy.ndarray[Any, numpy.dtype[numpy.float64]]]]

cell_pair_iterator_csv(intracell_csv_loc: str, chunk_size: int) Iterator[tuple[tuple[int, str, numpy.ndarray[Any, numpy.dtype[numpy.float64]]], tuple[int, str, numpy.ndarray[Any, numpy.dtype[numpy.float64]]]]]

Iterate over pairs of cells in a CSV in a memory efficient way.

This is almost equivalent to itertools.combinations(cell_iterator_csv(intracell_csv_loc),2) but with more efficient file IO.

Parameters
  • intracell_csv_loc (str) – A full file path to a csv file.

  • chunk_size (int) – How many lines to read from the file at a time. Does not affect output.

Returns

an iterator over pairs of cells, each entry is of the form ((indexA, nameA, distance_matrixA),(indexB, nameB, distance_matrixB)), where indexA is the line number in the file, and indexA < indexB.

Return type

Iterator[tuple[tuple[int, str, numpy.ndarray[Any, numpy.dtype[numpy.float64]]], tuple[int, str, numpy.ndarray[Any, numpy.dtype[numpy.float64]]]]]

gw_pairwise_parallel(cells: list[tuple[numpy.ndarray[Any, numpy.dtype[numpy.float64]], run_gw.Distribution]], num_processes: int, names: Optional[list[str]] = None, gw_dist_csv: Optional[str] = None, gw_coupling_mat_csv: Optional[str] = None, return_coupling_mats: bool = False) tuple[numpy.ndarray[Any, numpy.dtype[numpy.float64]], Optional[list[tuple[int, int, cajal.run_gw.Matrix]]]]

Compute the pairwise Gromov-Wasserstein distances between cells.

Optionally one can also compute their coupling matrices. If appropriate file names are supplied, the output is also written to file. If computing a large number of coupling matrices, for reduced memory consumption it is suggested not to return the coupling matrices, and instead write them to file.

Parameters
  • cells (list[tuple[numpy.ndarray[Any, numpy.dtype[numpy.float64]], run_gw.Distribution]]) – A list of pairs (A,a) where A is a squareform intracell distance matrix and a is a probability distribution on the points of A.

  • num_processes (int) – How many Python processes to run in parallel for the computation.

  • names (Optional[list[str]]) – A list of unique cell identifiers, where names[i] is the identifier for cell i. This argument is required if gw_dist_csv is not None, or if gw_coupling_mat_csv is not None, and is ignored otherwise.

  • gw_dist_csv (Optional[str]) – If this field is a string giving a file path, the GW distances will be written to this file. A list of cell names must be supplied.

  • gw_coupling_mat_csv (Optional[str]) – If this field is a string giving a file path, the GW coupling matrices will be written to this file. A list of cell names must be supplied.

  • return_coupling_mats (bool) – Whether the function should return the coupling matrices. Please be warned that for a large number of cells, couplings will be large, and memory consumption will be high. If return_coupling_mats is False, returns (gw_dmat, None). This argument is independent of whether the coupling matrices are written to a file; one may return the coupling matrices, write them to file, both, or neither.

Returns

If return_coupling_mats is True, returns ( gw_dmat, couplings ), where gw_dmat is a square matrix whose (i,j) entry is the GW distance between two cells, and couplings is a list of tuples (i,j, coupling_mat) where i,j are indices corresponding to positions in the list cells and coupling_mat is a coupling matrix between the two cells. If return_coupling_mats is False, returns (gw_dmat, None).

Return type

tuple[numpy.ndarray[Any, numpy.dtype[numpy.float64]], Optional[list[tuple[int, int, cajal.run_gw.Matrix]]]]

compute_gw_distance_matrix(intracell_csv_loc: str, gw_dist_csv_loc: str, num_processes: int, gw_coupling_mat_csv_loc: Optional[str] = None, return_coupling_mats: bool = False, verbose: Optional[bool] = False) tuple[numpy.ndarray[Any, numpy.dtype[numpy.float64]], Optional[list[tuple[int, int, cajal.run_gw.Matrix]]]]

Compute the matrix of pairwise Gromov-Wasserstein distances between cells.

This function is a wrapper for cajal.run_gw.gw_pairwise_parallel() except that it reads icdm’s from a file rather than from a list. For the file format of icdm’s see cajal.run_gw.icdm_csv_validate().

Parameters
  • intracell_csv_loc (str) – A file containing the intracell distance matrices for all cells.

  • gw_dist_csv_loc (str) –

  • num_processes (int) –

  • gw_coupling_mat_csv_loc (Optional[str]) –

  • return_coupling_mats (bool) –

  • verbose (Optional[bool]) –

Return type

tuple[numpy.ndarray[Any, numpy.dtype[numpy.float64]], Optional[list[tuple[int, int, cajal.run_gw.Matrix]]]]

For other parameters see cajal.run_gw.gw_pairwise_parallel().