API

class multiDGD.DGD(data, parameter_dictionary=None, scaling='sum', save_dir='./', random_seed=0, model_name='dgd', print_outputs=False)

This is the main class for the Deep Generative Decoder. Given a mudata or anndata object, it creates an instance of the DGD with some hyperparameters chosen from the data. Customization can be achieved by providing a parameter dictionary.

Arguments

data: omicsDataset: Dataset including train, validation (and if applicable test) set.

Arguments (optional)

parameter_dictionary: dict: Dictionary containing custom hyperparameters for building model instances and training.
scaling: str: Scaling relationship between model output and data. Default is ‘sum’. We are currently only supporting sum, but we will soon add ‘mean’.
save_dir: str: Directory where the model is saved. Default is ‘./’.
random_seed: int: Random seed for reproducibility. Default is 0.
model_name: str: Name of the model. Default is ‘dgd’.
print_outputs: bool: Print model outputs. Default is False.

Attributes

train_set: omicsDataset: Dataset object derived from input data. It’s properties (shape, modality type, observable feature classes) are used to build remaining instances. val_set and test_set are optional and only initialized if given (see data preparation).
param_dict: dict: Dictionary containing hyperparameters for building model instances and training. Initialized with default parameters and updated with optional user input.
decoder: Decoder: Decoder instance initialized based on desired latent dimensionality and data features.
representation: RepresentationLayer: Learnable representation vectors for the training set. If val_set and or test_set are available, additional representations are initialized as validation_rep and test_rep.
gmm: GMM: Gaussian mixture model instance for the distribution over latent space. If no other information is given (but received a clustering observable), number of components is automatically set to the number of classes.
latent: int: Number of latent dimensions.
trained_status: torch.bool: Boolean indicating whether the model has been trained or not.
history: dict: Dictionary containing training history (loss, clustering performance, etc.)

Attributes (optional)

correction_gmm: Supervized GMM (optional): Gaussian mixture model instance for the distribution over covariate space. The number of components is automatically set to the number of classes in the provided covariate.
correction_rep: RepresentationLayer (optional): Learnable representation vectors for the covariate space. If val_set and or test_set are available, additional representations are initialized as correction_val_rep and correction_test_rep.

Methods

clustering(split='train')

Return the clustering of the model based on the data split.

This returns numeric labels of which GMM component a sample belongs to.

‘split’ can be ‘train’, ‘validation’, ‘test’ or ‘all’

decode(rep_shape, i=None)

Decode the representations of a given shape

The shape has to match either the train-, validation- or test-set.

differential_expression(): Placeholder for differential expression analysis

gene2peak(gene_name, testset, gene_ref=None)

Perform gene2peak in silico knockdown experiment on the testset for a given gene (‘gene_name’).

‘gene_ref’ is the column name of the testset.var data frame.

If no reference dataframe for the genes is provided, the position of the gene of interest is searched among the ‘testset.var’ indices.

get_accessibility_estimates(dataset, indices): Given a representation for the dataset is learned, returns the normalized (unscaled) model output. Currently only implemented for multiome.

get_covariate_representation(split='train')

Return the covariate representation of the model based on the data split.

‘split’ can be ‘train’, ‘validation’, ‘test’ or ‘all’

Currently only supported for a single covariate.

get_latent_representation(data=None)

Access the learned latent representations of the model

‘data’ is the data object for which the latent representation is requested. If no data is provided, the training representations are returned. Otherwise, the method checks which representations the data belongs to by the number of samples and returns the corresponding representation.

get_modality_reconstruction(dataset, mod_id): Given a representation for the dataset is learned, returns the model output for a given modality (‘mod_id’).

get_normalized_expression(dataset, indices): Given a representation for the dataset is learned, returns the normalized (unscaled) model output. Currently only implemented for multiome.

get_prediction_errors(predictions, dataset, reduction='sum')

Returns errors of given predictions

‘predictions’ is a list of tensors with the model output. ‘dataset’ is the data object for which the error is calculated.

reduction: sum, sample or none (defines error shape)

get_reconstruction(dataset): Given a representation for the dataset is learned, returns the model output.

get_representation(split='train')

Return the representation of the model based on the data split.

‘split’ can be ‘train’, ‘validation’, ‘test’ or ‘all’

init_test_set(testdata): initialize test set

is_trained(): Return the training status of the model (bool)

classmethod load(data, save_dir='./', random_seed=0, model_name='dgd')

Load a trained model

‘data’ is the data object used for training the model. It has to be the same data object as used for training. ‘save_dir’ is the directory where the model is saved and ‘model_name’ is the name of the model.

perturbation_experiment(): Placeholder for one-line perturbation experiment

plot_history(export=False): If executed in a jupyter notebook, plots the training history of the model. Otherwise provide an export path to save the plot.

plot_latent_space()

Plot the latent space of the model as PCA.

Points are representations, GMM means and GMM samples.

predict(testdata=None, n_epochs=20, external=False): high-level call of predict_new

predict_from_representation(rep, correction_rep=None)

Predict/decode from a given representation

The representation has to be a RepresentationLayer object.

If a correction_rep is provided, the correction model is used.

predict_new(testdata=None, n_epochs=20, include_correction_error=True, indices_of_new_distribution=None, external=False)

Find the embedding for new datapoints

‘testdata’ is the data object for which the latent representation is requested. If no data is provided, the test split is used.

‘n_epochs’ is the number of epochs the local optimization is run for.

‘external’ is a boolean indicating whether the data is external to the training data. If the data is external, the model will first check if the number of modalities and features match the training data.

save(save_dir=None, model_name=None)

Save trained parameters

‘save_dir’ is the directory where the model is saved and ‘model_name’ is the name of the model. If no alternative directory and name are provided, the default model directory and name are used. ‘save_dir’ and ‘model_name’ can be specified as strings.

The model is saved as a .pt file and the hyperparameters are saved as a .json file.

test(testdata=None, n_epochs=20, external=False): high-level call of predict_new

train(n_epochs=500, stop_with='loss', stop_after=10, train_minimum=50, developer_mode=False)

Training of the model instances (decoder, representation, gmm) for a given number of epochs with early stopping.

Options for early stopping are ‘loss’ and ‘clustering’ (which requires meta_label in DGD init). Ealry stopping observation interval is ‘stop_after’, and a minimum number of epochs to be trained can be specified with train_minimum

‘developer_mode’ defines how progress is displayed. If false, it is assumed that a jupyter notebook is used. Then, progress will be shown in a progress bar. If true, a wandb run will be initialized, logging many training metrics.

view_data_setup()

Print the data setup

This tells you how the data was split and how many samples are in each split.

Modules

`dataset`
`functions`
`latent`
`nn`
`utils`