Math ==== .. note:: This section is (in most part) a direct excerpt from the multiDGD paper :cite:`Schuster2024`. Notation -------- +---------------------+--------------------------------------------+ | Symbol | Representation | +=====================+============================================+ | :math:`Z` | representation | +---------------------+--------------------------------------------+ | :math:`X` | data | +---------------------+--------------------------------------------+ | :math:`\hat{X}` | predicted/ reconstructed data | +---------------------+--------------------------------------------+ | mod | modality | +---------------------+--------------------------------------------+ | cov | covariate | +---------------------+--------------------------------------------+ | :math:`\theta` | decoder parameters | +---------------------+--------------------------------------------+ | :math:`\phi` | GMM parameters | +---------------------+--------------------------------------------+ | :math:`S` | cell-specific scaling factor | +---------------------+--------------------------------------------+ | :math:`Y` | decoder output (predicted normalized count)| +---------------------+--------------------------------------------+ | :math:`i \in N` | single sample :math:`i` among :math:`N` | | | total samples | +---------------------+--------------------------------------------+ | :math:`k \in K` | component :math:`k` among :math:`K` | | | components | +---------------------+--------------------------------------------+ | :math:`l` | latent dimension | +---------------------+--------------------------------------------+ | :math:`c \in C` | class :math:`c` in :math:`C` covariate | | | classes | +---------------------+--------------------------------------------+ | :math:`\mu` | GMM mean | +---------------------+--------------------------------------------+ | :math:`\Sigma` | GMM covariance | +---------------------+--------------------------------------------+ | :math:`w` | component coefficient | +---------------------+--------------------------------------------+ | :math:`\pi` | component weight | +---------------------+--------------------------------------------+ | :math:`\alpha` | Dirichlet alpha | +---------------------+--------------------------------------------+ Probabilistic formulation ------------------------- The training objective is given by the joint probability .. math:: p(X,Z,\theta,\phi) = p(X\mid Z, \theta) \, p(Z\mid \phi) which is maximized using Maximum a Posteriori estimation. :math:`p(X\mid Z, \theta)` in this model is presented as the Negative Binomial distribution's mass of the observed count :math:`x_i` for cell :math:`i` given the predicted mean count and a learned dispersion parameter :math:`r_{j}` for each feature :math:`j`: .. math:: p( x_{i} \mid z_{i}, \theta , s_{i}) = \prod_{j=1}^D p(x_{ij}\mid z_{i},\theta,s_{i}) \text{with } p(x_{ij}\mid z_{i},\theta,s_{i}) = \mathcal{NB}(x_{ij}\mid s_i y_{ij},r_j) where :math:`\mathcal{NB}(x \mid y, r)` is the negative binomial distribution. Here we calculate the probability mass of the observed count :math:`x_{i,j}` given the negative binomial distribution with mean :math:`s_i y_{i,j}` and dispersion factor :math:`r_j`. The predicted mean :math:`s_i y_{i,j}` is given by the modality-specific total count :math:`s_i` of cell :math:`i` and the decoder output :math:`y_{i,j}`. This output :math:`y_{i,j}` describes the fraction of counts for cell :math:`i` and modality-specific feature :math:`j`, i.e. the predicted normalized count. These equations are valid for each modality (RNA and ATAC) separately, as we have a total count :math:`s` per modality. The joint probability further contains the objective for the representation to follow the latent distribution, :math:`p(Z \mid \phi)`. Since :math:`\phi` is a GMM, this results in the weighted multivariate Gaussian probability density .. math:: p(z_i \mid \phi) = \sum_{k=1}^{K} \pi_k \mathcal{N}_L(z_i \mid \mu_k, \Sigma_k) with :math:`K` as the number of GMM components and :math:`\mathcal{N}_L(z_i \mid \mu, \Sigma)` is a multivariate Gaussian distribution with dimension :math:`L` (the latent dimension), mean vector :math:`\mu` and covariance matrix :math:`\Sigma`. For new data points, the representation is found by maximizing :math:`p(x_i \mid z_i, \theta, s) p(z_i \mid \phi)` only with respect to :math:`z_i`, as all other model parameters are fixed. Modules ------- .. toctree:: :maxdepth: 2 dgd reps gmm covariates