Unbalanced Gromov-Wasserstein module

We have implemented a Python module that allows the user to compute the unbalanced Gromov-Wasserstein distance between cells. By default, CAJAL ships with a single core version of the algorithm and a multicore version, and the user can uncomment the appropriate line in the package’s setup.py build script to get a version of the algorithm for a GPU using either CUDA or OpenCL. These are disabled by default as the end user must configure their machine so that the CUDA (respectively, OpenCL) header files can be found and all necessary libraries are available. A few other backends can be made available upon request. Our experience shows that the GPU backends are only likely to be useful when the individual UGW problems are very large (i.e., the metric spaces are large)

The user should only import one of the backend modules at a time due to technical limitations of C (C has no namespacing, so there will be symbol conflicts from identically named functions in the two backend modules). Restart the Python interpreter if you want to load a different backend module.

The basic usage is that we import the backend module we want to use (in this case the multicore implementation) and the UGW class. The constructor for the UGW class takes the backend module as its argument, establishes a connection with the library, and returns an object that maintains the internal state of the computation.

from cajal.ugw import _multicore, UGW # Substitute _single_core for single-threaded usage, useful if you want to parallelize at the level of Python processes
UGW_multicore = UGW(_multicore) # For GPU backends, the constructor has to negotiate a connection to the GPU, so it may take a long time to initialize.

The wrapper functions for the C backend are then accessible as methods of this object. For example, the “.from_futhark()” method converts the library’s internal representation of the output to a Numpy array. If the user intends to parallelize at the level of Python processes, each process should instantiate the class. As usual one can call help(UGW_multicore), help(UGW_multicore.ugw_armijo), and so on for documentation of the functions.

Some utilities

ugw_bound(gw_cost: float, rho1: float, rho2: float)
Parameters
  • gw_cost (float) – A known GW transport cost between two cells or metric spaces. Recall that we distinguish between GW cost and GW distance; the GW distance is defined as 0.5 * math.sqrt(gw_cost).

  • rho1 (float) – The first marginal penalty

  • rho2 (float) – The second marginal penalty

Returns

An upper bound on the UGW cost that can be directly computed from the known GW cost and rho1, rho2.

mass_lower_bound(gw_cost: float, rho1: float, rho2: float)
Parameters
  • gw_cost (float) – A known GW transport cost between two cells or metric spaces. Recall that we distinguish between GW cost and GW distance; the GW distance is defined as 0.5 * math.sqrt(gw_cost).

  • rho1 (float) – The first marginal penalty

  • rho2 (float) – The second marginal penalty

Returns

A lower bound on the mass preserved by the transport plan that can be directly computed from the known GW cost and rho1, rho2.

estimate_distr(dmats: ndarray[Any, dtype[float64]], sample_size: int = 100, quantile=0.15)

Select sample_size many pairs of distance matrices from the given set of distance matrices, compute the Gromov-Wasserstein costs for each pair, and return the specified quantile of the observed distribution of GW costs.

Parameters
  • dmats (ndarray[Any, dtype[float64]]) – An array of shape (k, n, n) where k is the number of cells, and n is the number of points sampled from each cell.

  • sample_size (int) – How many cell pairs to sample.

  • quantile – The quantile of the observed distribution to return.

rho_of(gw_cost: float, mass_kept: float)
Parameters
  • gw_cost (float) – The GW cost of the optimal transport plan between two cells. Recall that we distinguish between GW cost and GW distance; the GW distance is defined as 0.5 * math.sqrt(gw_cost).

  • mass_kept (float) – A real number between 0 and 1 indicating a desired lower bound on the mass kept by the UGW transport plan.

Returns

A value for rho such that, if unbalanced GW is run with this parameter, at least proportion mass_kept of the mass of the cells will be created. For simplicity this formula is written for the case where the cells and the GW transport plan have unit mass. This function is the inverse of mass_lower_bound.

UGW class and methods

class cajal.ugw.UGW
ugw_armijo(self, rho1: float, rho2: float, eps: float, A_dmat: ndarray[Any, dtype[float64]], mu: ndarray[Any, dtype[float64]], B_dmat: ndarray[Any, dtype[float64]], nu: ndarray[Any, dtype[float64]], exp_absorb_cutoff: float = 1e+100, safe_for_exp: float = 100.0, tol_sinkhorn: float = 0.0001, tol_outerloop: float = 0.4, mass_kept: Optional[float] = None, gw_cost: Optional[float] = None)

Given two metric measure spaces (A,mu) and (B, nu), compute the unbalanced Gromov-Wasserstein distance between them, using an algorithm based on that of Séjourné, Vialard and Peyré, with some modifications for numerical stability and convergence.

Parameters
  • rho1 (float) – The first marginal penalty coefficient, controls how much the first marginal for the transport plan (sum along rows) is allowed to deviate from mu. Higher is more strict and should give closer marginals.

  • rho2 (float) – The second marginal penalty coefficient, controls how much the second marginal for the transport plan (sum along columns) is allowed to deviate from nu.

  • eps (float) – The entropic regularization coefficient. Increasing this value makes the problem more convex and the algorithm will converge to an answer faster, but it may be inaccurate (far from the true minimum). If it’s set too low there will be numerical instability issues and the function will return NaN.

  • A_dmat (ndarray[Any, dtype[float64]]) – A square pairwise distance matrix of shape (n, n), symmetric and with zeroes along the diagonal.

  • mu (ndarray[Any, dtype[float64]]) – A one-dimensional vector of length n with strictly positive entries. (The algorithm does not currently support zero entries.)

  • B_dmat (ndarray[Any, dtype[float64]]) – A square pairwise distance matrix of shape (m, m), symmetric and with zeroes along the diagonal.

  • nu (ndarray[Any, dtype[float64]]) – A one-dimensional vector of length m with strictly positive entries.

  • exp_absorb_cutoff (float) – A numerical stability parameter. The inner loop, the Sinkhorn algorithm, tries to solve an optimization problem of solving for diagonal matrices A, B to minimize a cost function Cost(AKB) where the matrix K is given. The values of A, K, B are numerically extreme under normal conditions and so we represent A by a pair of vectors (a_bar, u_bar) where a_bar is between exp_absorb_cutoff and 1/exp_absorb_cutoff, and diag(A) = a_bar * e^u_bar. When a_bar exceeds exp_absorb_cutoff, part of a_bar is “absorbed into the exponent” u_bar, thus the name. Set this to any value such that floating point arithmetic operations in the range (1/exp_absorb_cutoff, exp_absorb_cutoff) are reasonably accurate and not too close to overflow.

  • safe_for_exp (float) – A numerical stability parameter. This is used only once during the initialization of the algorithm, the user supplies a number R and the code tries to construct an initial starting matrix K with the property that e^R is an upper bound for all elements in K, and as many as possible of the values of K are “pushed up against” this upper bound. The main risk here is of underflow of values in K; by increasing the values in K we give ourselves more room to maneuver with the spectrum of values that floats can represent. Choose a number such that e^R is big but still is a comfortable distance from the max float value, a good region for addition and multiplication without risk of overflow.

  • tol_sinkhorn (float) – An accuracy parameter, controls the tolerance at which the inner loop exits. Suggest 10^-5 to 10^-9. The inner loop, the Sinkhorn algorithm, tries to solve an optimization problem of solving for diagonal matrices A, B to minimize a cost function Cost(AKB) where the matrix K is given. This gives a series of iterations A_n, B_n, A_n+1, B_n+1, … The exit condition for the loop is when diag(A_n+1)/diag(A_n) is everywhere within 1 +/- tol_sinkhorn.

  • tol_outerloop (float) – An accuracy parameter, controls the tolerance at which the outer loop exits - when the ratio T_n+1/T_n is everywhere within 1 +/- tol_outerloop. (T_n is the nth transport plan found in the main gradient descent)

  • mass_kept (Optional[float]) – The user has the option to supply a real number between 0 and 1, “mass_kept”, which takes precedence over the given rho1 and rho2 if it is not None. If this value is supplied, then the algorithm chooses rho1 and rho2 in such a way as to bound below the mass kept by the transport plan.

  • gw_cost (Optional[float]) – This must be supplied if mass_kept is not None, as it is necessary to compute the appropriate values of rho1, rho2.

Returns

A Numpy array ret of floats of shape (5,), where ret[0] is the GW cost of the transport plan found, ret[1] is the first marginal penalty (not including the rho1 scaling factor), ret[2] is the second marginal penalty (not including the rho2 scaling factor), ret[3] is the entropic regularization cost (not including the scaling factor epsilon), ret[4] is the total UGW_eps cost (the appropriate weighted linear combination of the first four entries)

ugw_armijo_pairwise_increasing(self, ugw_dmat: ndarray[Any, dtype[float64]], increasing_ratio: float, rho: float, eps: float, dmats: Union[str, ndarray[Any, dtype[float64]]], distrs: ndarray[Any, dtype[float64]], exp_absorb_cutoff: float = 1e+100, safe_for_exp: float = 100.0, tol_sinkhorn: float = 0.0001, tol_outerloop: float = 0.4)

This is a post-processing step for ugw_armijo_pairwise. It loops through the output and identifies all pairs of matrices where the algorithm failed to converge, and re-runs the computation for those inputs repeatedly at exponentially increasing values of the parameter epsilon (scaled by increasing_ratio) until the algorithm stabilizes. This is step is useful for situations where you want a complete picture of the whole space of cells even if some of the UGW values are slightly off due to the increased regularization parameter.

Parameters
  • ugw_dmat (ndarray[Any, dtype[float64]]) – The output of the UGW algorithm, an array of shape (n * (n-1)/2, 5), where n is the number of cells in the data set; possibly containing NaN values.

  • increasing_ratio (float) – The multiple to increase epsilon by each time the function fails to converge.

  • rho (float) –

  • eps (float) –

  • dmats (Union[str, ndarray[Any, dtype[float64]]]) –

  • distrs (ndarray[Any, dtype[float64]]) –

  • exp_absorb_cutoff (float) –

  • safe_for_exp (float) –

  • tol_sinkhorn (float) –

  • tol_outerloop (float) –

Other parameters are as in ugw_armijo_pairwise.

ugw_armijo_pairwise(self, mass_kept: float, eps: float, dmats: Union[str, ndarray[Any, dtype[float64]]], distrs: Optional[ndarray[Any, dtype[float64]]] = None, quantile: float = 0.15, increasing_ratio: Optional[float] = 1.1, sample_size: int = 100, exp_absorb_cutoff: float = 1e+100, safe_for_exp: float = 100.0, tol_sinkhorn: float = 0.0001, tol_outerloop: float = 0.4, rho: Optional[float] = None, as_matrix=True, pt_cloud=False)

Given an array of squareform distance matrices of shape (k,n,n), and an array of measures of shape (k,n), compute the pairwise unbalanced Gromov-Wasserstein distance between all of them. Other than replacing two distance matrices with an array of distance matrices, and two distributions with an array of distributions, parameters are as in the function ugw_armijo.

Because the coefficient rho is difficult to interpret and choose, the default behavior of the algorithm is to estimate an appropriate value of rho based on the following heuristic. The user selects a lower bound “mass_kept” for the fraction of the mass that they want to keep between two cells, and the algorithm randomly computes many GW distances between cells in the observed data to estimate the quantile of observed distances specified by the quantile parameter. Using this statistic, a value of rho is calculated which guarantees that for cell pairs whose GW cost is below that threshold, the optimal UGW transport plan will keep at least fraction “mass_kept” of the mass. This will add some overhead to the algorithm because of the time to compute the GW values.

Parameters
  • mass_kept (float) – A real number between 0 and 1, the minimum fraction of mass to be preserved by UGW transport plans between two cells in the same neighborhood.

  • eps (float) – The entropic regularization coefficient. Increasing this value makes the problem more convex and the algorithm will converge to an answer faster, but it may be inaccurate (far from the true minimum). If it’s set too low there will be numerical instability issues and the function will return NaN.

  • dmats (Union[str, ndarray[Any, dtype[float64]]]) – An array of squareform distance matrices of shape (k,n,n), where k is the number of cells and n is the number of points sampled from each cell. Alternatively, dmats can be a string coding a filepath to a csv file containing icdms, in the file format established in cajal.sample_swc, cajal.sample_seg, etc.

  • distrs (Optional[ndarray[Any, dtype[float64]]]) – An array of measures of shape (k,n), where k is the number of cells and n is the number of points per cell. If distrs is None, then the uniform distribution on all cells will be taken.

  • quantile (float) – A real number between 0 and 1, the quantile in the distribution of distances which informs the notion of “same neighborhood” referred to in the parameter mass_kept.

  • increasing_ratio (Optional[float]) – The UGW algorithm is not numerically stable and on large datasets, NaNs in the output are likely. If increasing_ratio is not None, this function makes a second pass through the input and retries the UGW algorithm, each time multiplying the regularization parameter epsilon by a factor of increasing_ratio until it converges. Higher values of epsilon may make the results less accurate in some cases, but it may be more useful than discarding the data for which the algorithm did not converge.

  • sample_size (int) – In order to estimate the appropriate value of rho, before initiating the unbalanced GW computations, the function randomly computes sample_size many GW distances to estimate the quantile of the distribution specified by the parameter “quantile”.

  • rho (Optional[float]) – If this is specified, the sampling routine and estimation of the appropriate rho for the target mass_kept is ignored, and the given value of rho is used for all cell pairs.

  • as_matrix – Return only the final distance matrix, as opposed to the full account of intermediary values. If as_matrix is False, return a pair (distances, rho) where rho was the value found for the given mass, and distances is an array of shape (k * (k-1)/2,5) where the rows correspond to pairs (i,j) of cells with i < j, and the columns are as in ugw_armijo.

  • pt_cloud – If pt_cloud is True then dmats will be interpreted, not as an array of distance matrices, but as an array of point clouds, of shape (k, n, d) where d is the dimension of the ambient Euclidean space the points are sampled from. This may improve performance for the core portion of the computation as computing distances between points may be faster than looking up the precomputed distances in memory.

  • exp_absorb_cutoff (float) –

  • safe_for_exp (float) –

  • tol_sinkhorn (float) –

  • tol_outerloop (float) –

All other values are as in ugw_armijo.

ugw_armijo_euclidean(self, rho: float, eps: float, pt_clouds: ndarray[Any, dtype[float64]], exp_absorb_cutoff: float = 1e+100, safe_for_exp: float = 100.0, tol_sinkhorn: float = 0.0001, tol_outerloop: float = 0.4)

The core of this function is similar to ugw_armijo_pairwise_unif, but the user passes in the array of point clouds rather than distance matrices, and the backend computes the distance matrices dynamically at the moment they are needed. This may reduce memory consumption and cache usage. pt (of shape (k,n,d), where k is the number of cells, n is the number of points per cell, and d is the dimension of the ambient Euclidean space that the points live in).

Note that the ugw_armijo_pairwise function wraps this one while also providing some additional preprocessing and postprocessing, so that function may be more convenient.

Parameters