Math

Note

This section is (in most part) a direct excerpt from the multiDGD paper [SDKT23].

Notation

Symbol

Representation

\(Z\)

representation

\(X\)

data

\(\hat{X}\)

predicted/ reconstructed data

mod

modality

cov

covariate

\(\theta\)

decoder parameters

\(\phi\)

GMM parameters

\(S\)

cell-specific scaling factor

\(Y\)

decoder output (predicted normalized count)

\(i \in N\)

single sample \(i\) among \(N\) total samples

\(k \in K\)

component \(k\) among \(K\) components

\(l\)

latent dimension

\(c \in C\)

class \(c\) in \(C\) covariate classes

\(\mu\)

GMM mean

\(\Sigma\)

GMM covariance

\(w\)

component coefficient

\(\pi\)

component weight

\(\alpha\)

Dirichlet alpha

Probabilistic formulation

The training objective is given by the joint probability

\[p(X,Z,\theta,\phi) = p(X\mid Z, \theta) \, p(Z\mid \phi)\]

which is maximized using Maximum a Posteriori estimation.

\(p(X\mid Z, \theta)\) in this model is presented as the Negative Binomial distribution’s mass of the observed count \(x_i\) for cell \(i\) given the predicted mean count and a learned dispersion parameter \(r_{j}\) for each feature \(j\):

\[ \begin{align}\begin{aligned}p( x_{i} \mid z_{i}, \theta , s_{i}) = \prod_{j=1}^D p(x_{ij}\mid z_{i},\theta,s_{i})\\\text{with } p(x_{ij}\mid z_{i},\theta,s_{i}) = \mathcal{NB}(x_{ij}\mid s_i y_{ij},r_j)\end{aligned}\end{align} \]

where \(\mathcal{NB}(x \mid y, r)\) is the negative binomial distribution. Here we calculate the probability mass of the observed count \(x_{i,j}\) given the negative binomial distribution with mean \(s_i y_{i,j}\) and dispersion factor \(r_j\). The predicted mean \(s_i y_{i,j}\) is given by the modality-specific total count \(s_i\) of cell \(i\) and the decoder output \(y_{i,j}\). This output \(y_{i,j}\) describes the fraction of counts for cell \(i\) and modality-specific feature \(j\), i.e. the predicted normalized count. These equations are valid for each modality (RNA and ATAC) separately, as we have a total count \(s\) per modality.

The joint probability further contains the objective for the representation to follow the latent distribution, \(p(Z \mid \phi)\). Since \(\phi\) is a GMM, this results in the weighted multivariate Gaussian probability density

\[p(z_i \mid \phi) = \sum_{k=1}^{K} \pi_k \mathcal{N}_L(z_i \mid \mu_k, \Sigma_k)\]

with \(K\) as the number of GMM components and \(\mathcal{N}_L(z_i \mid \mu, \Sigma)\) is a multivariate Gaussian distribution with dimension \(L\) (the latent dimension), mean vector \(\mu\) and covariance matrix \(\Sigma\).

For new data points, the representation is found by maximizing \(p(x_i \mid z_i, \theta, s) p(z_i \mid \phi)\) only with respect to \(z_i\), as all other model parameters are fixed.

Modules