multiDGD.dataset
- class multiDGD.dataset.omicsDataset(data, scaling_type='sum', split=None)
General Dataset class for single cell data. Applicable for sinlge modalities and multi-modal data.
Arguments
- data: mudata or anndata object
data object containing the data
- scaling_type: str
for different scaling options (currently only supporting ‘sum’)
Arguments (optional)
- split: str
If only a subset of the data should be used. Can be ‘train’, ‘validation’, ‘test’. The default is None, meaning that the full data is used.
Attributes
- data: torch.Tensor
tensor with raw counts of shape (n_samples, n_features)
- scaling_type: string
variable defining how to calculate scaling factors
- n_sample: int
number of samples in dataset
- n_features: int
number of total features in dataset
- library: torch.Tensor
tensor of per-sample and -modality scaling factors
- sparse: bool
whether the data is sparse or not
- modalities: list
list of modalities in data
- modality_switch: int
position in data tensor where modalities switch
- modality_features: list
number of features per modality
- meta: np.array
array of sample-wise values of monitored clustering feature (if available)
- correction: np.array
array of sample-wise values of correction features (if available)
- correction_classes: int
number of classes of correction features (if available)
Methods
- data_to_tensor()
Make a tensor out of data. In multi-modal cases, modalities are concatenated. This only works with mudata and anndata objects.
- get_correction_labels(idx=None)
Return values of numerically transformed correction features per sample
- get_labels(idx=None)
Return sample-specific values of monitored clustering feature (if available)
- get_mask(indices)
Return a list of tensors that indicate which samples belong to which modality
- modality_mask
- if self.mosaic and label == ‘train’:
# if the train set is mosaic, we use 10% of the paired data to minimize distances between modalities # for this we need to copy those samples, add them with each modality option (paired, GEX, ATAC) # and structure them in such a way that we can use the representation distances in the loss data, data_triangle, self.modality_mask, self.modality_mask_triangle, self.mosaic_train_idx = self._make_mosaic_train_set(data) self.data_triangle = torch.Tensor(data_triangle.X.todense())
- elif self.mosaic and label == ‘test’:
self.modality_mask = self._get_mosaic_mask(data)