multiDGD.dataset

class multiDGD.dataset.omicsDataset(data, scaling_type='sum', split=None)

General Dataset class for single cell data. Applicable for sinlge modalities and multi-modal data.

Arguments

data: mudata or anndata object

data object containing the data

scaling_type: str

for different scaling options (currently only supporting ‘sum’)

Arguments (optional)

split: str

If only a subset of the data should be used. Can be ‘train’, ‘validation’, ‘test’. The default is None, meaning that the full data is used.

Attributes

data: torch.Tensor

tensor with raw counts of shape (n_samples, n_features)

scaling_type: string

variable defining how to calculate scaling factors

n_sample: int

number of samples in dataset

n_features: int

number of total features in dataset

library: torch.Tensor

tensor of per-sample and -modality scaling factors

sparse: bool

whether the data is sparse or not

modalities: list

list of modalities in data

modality_switch: int

position in data tensor where modalities switch

modality_features: list

number of features per modality

meta: np.array

array of sample-wise values of monitored clustering feature (if available)

correction: np.array

array of sample-wise values of correction features (if available)

correction_classes: int

number of classes of correction features (if available)

Methods

data_to_tensor()

Make a tensor out of data. In multi-modal cases, modalities are concatenated. This only works with mudata and anndata objects.

get_correction_labels(idx=None)

Return values of numerically transformed correction features per sample

get_labels(idx=None)

Return sample-specific values of monitored clustering feature (if available)

get_mask(indices)

Return a list of tensors that indicate which samples belong to which modality

modality_mask
if self.mosaic and label == ‘train’:

# if the train set is mosaic, we use 10% of the paired data to minimize distances between modalities # for this we need to copy those samples, add them with each modality option (paired, GEX, ATAC) # and structure them in such a way that we can use the representation distances in the loss data, data_triangle, self.modality_mask, self.modality_mask_triangle, self.mosaic_train_idx = self._make_mosaic_train_set(data) self.data_triangle = torch.Tensor(data_triangle.X.todense())

elif self.mosaic and label == ‘test’:

self.modality_mask = self._get_mosaic_mask(data)