Data loading and preparation

In this tutorial we walk through the steps to get your data ready for multiDGD.

Tip

You can use multiDGD for multi-omics data and for data from just one modality.

The multi-omics setting is currently only supported for 2 modalities (RNA and ATAC-seq). Support for proteomics data is in the works.

Note

You can use multiDGD with both anndata and mudata objects. The preparation steps are slightly different for each. We will show you how to prepare your data for both types of objects.

Example data (anndata)

Note

Running the following cell is only necessary to get the example data.

import requests, zipfile

# Download
file_name = 'human_bonemarrow.h5ad.zip'
file_url = 'https://api.figshare.com/v2/articles/23796198/files/41740251'

file_response = requests.get(file_url).json()
file_download_url = file_response['download_url']
response = requests.get(file_download_url, stream=True)
with open(file_name, 'wb') as f:
    for chunk in response.iter_content(chunk_size=8192):
        f.write(chunk)

# Unzip
with zipfile.ZipFile(file_name, 'r') as zip_ref:
    zip_ref.extractall('.')

Note

Now we can start with the actual tutorial. You can modify the data directory data_dir to point to your own data.

import anndata as ad

data_dir = './human_bonemarrow'

data = ad.read_h5ad(data_dir+'.h5ad')

Anndata object preparation

Note

Now that we have some data, we can get started with multiDGD.

import multiDGD

data = multiDGD.functions.setup_data(
    data,
    modality_key='feature_types', # adata.var column indicating which feature belongs to which modality
    observable_key='cell_type', # cell annotation key to initialize GMM components
    covariate_keys=['Site'], # confounders
    train_fraction=0.9, # default, fraction of data to use for training
    include_test=False, # default is True, whether to make a test set in the train-val-test split
)

Note

You can read more about the details of this function in the API reference. The most important arguments are:

  • modality_key: When using anndata objects with multi-omics data, it is important to specify which variable column indicates the type of the modalities. This can be ignored for mudata objects.

  • observable_key: This is the key in the cell annotation that indicates the cell type. This is used to initialize the GMM components. The model works better when having an estimate of the celltypes.

  • covariate_keys: If you wish to remove batch effects or model other variables separately, you can specify them here with their column names in the anndata observables.

  • train_fraction: multiDGD (like other ML models) needs a train and validation set (at least). Here you specify how much of the data will be used for training. The rest will be used for validation (and testing).

  • include_test: Whether to include a test set in the train-val-test split. If set to False, the split will only contain a train and validation set.

Mudata object preparation

Let’s look at how this would work for mudata objects.

Tip

Working with mudata objects is a bit easier, as the data is already separating modalities.

import mudata as md

data_dir = './data' # path to your data

data = md.read(data_dir+'.h5mu', backed=False)

data = multiDGD.functions.setup_data(
    data,
    observable_key='...',
    covariate_keys=['...'],
    train_fraction=0.9,
    include_test=False,
)

Note

Here, we do not need a modality_key.

Saving the prepared data

Tip

It is always a good idea to save the prepared data for the sake of reproducibility and so you can easily load it later.

import scanpy as sc

data.write_h5ad('./example_data_prepared.h5ad')