multiDGD.dataset

class multiDGD.dataset.omicsDataset(data, scaling_type='sum', split=None)

General Dataset class for single cell data. Applicable for sinlge modalities and multi-modal data.

Arguments

data: mudata or anndata object: data object containing the data
scaling_type: str: for different scaling options (currently only supporting ‘sum’)

split: str: If only a subset of the data should be used. Can be ‘train’, ‘validation’, ‘test’. The default is None, meaning that the full data is used.

data: torch.Tensor: tensor with raw counts of shape (n_samples, n_features)
scaling_type: string: variable defining how to calculate scaling factors
n_sample: int: number of samples in dataset
n_features: int: number of total features in dataset
library: torch.Tensor: tensor of per-sample and -modality scaling factors
sparse: bool: whether the data is sparse or not
modalities: list: list of modalities in data
modality_switch: int: position in data tensor where modalities switch
modality_features: list: number of features per modality
meta: np.array: array of sample-wise values of monitored clustering feature (if available)
correction: np.array: array of sample-wise values of correction features (if available)
correction_classes: int: number of classes of correction features (if available)

data_to_tensor(): Make a tensor out of data. In multi-modal cases, modalities are concatenated. This only works with mudata and anndata objects.

get_correction_labels(idx=None): Return values of numerically transformed correction features per sample

get_labels(idx=None): Return sample-specific values of monitored clustering feature (if available)

get_mask(indices): Return a list of tensors that indicate which samples belong to which modality

modality_mask

if self.mosaic and label == ‘train’:: # if the train set is mosaic, we use 10% of the paired data to minimize distances between modalities # for this we need to copy those samples, add them with each modality option (paired, GEX, ATAC) # and structure them in such a way that we can use the representation distances in the loss data, data_triangle, self.modality_mask, self.modality_mask_triangle, self.mosaic_train_idx = self._make_mosaic_train_set(data) self.data_triangle = torch.Tensor(data_triangle.X.todense())
elif self.mosaic and label == ‘test’:: self.modality_mask = self._get_mosaic_mask(data)