hrchy_cytocommunity.models.dataset.SpatialOmicsImageDataset
- class hrchy_cytocommunity.models.dataset.SpatialOmicsImageDataset(root, transform=None, pre_transform=None)
Spatial omics dataset loader for HRCHY-CytoCommunity.
This class inherits from
torch_geometric.data.InMemoryDatasetand is designed to read preprocessed spatial omics graph data files (coordinates, edges, node attributes, and graph indices) from a specified directory and construct PyTorch Geometrictorch_geometric.data.Dataobjects for downstream graph neural network training.Each sample (region/tissue section) corresponds to one graph, and all graphs are collated into a single dataset stored in
processed/SpatialOmicsImageDataset.pt.- Parameters:
root (str or Path) – Root directory containing the following subfolders: -
raw/— containing raw graph files (text format). -processed/— where the processed dataset will be saved.transform (callable, optional) – Data transformation function applied before returning a graph sample. See
torch_geometric.transforms.pre_transform (callable, optional) – Data preprocessing transformation function applied before saving processed data.
- data
Tensor representation of the concatenated graph dataset.
- Type:
torch_geometric.data.Data
- slices
Indexing dictionary used by PyTorch Geometric to retrieve individual graphs efficiently.
- Type:
dict
- processed_paths
List of output file paths (by default
['SpatialOmicsImageDataset.pt']).- Type:
list[str]
- raw_file_names()
Returns the list of expected raw input files (empty list in this case).
- processed_file_names()
Returns the list of expected processed dataset files.
- download()
Placeholder for downloading data (not implemented).
- process()
Constructs
torch_geometric.data.Dataobjects from input text files underraw_dir. The following files are required for each region name listed inImageNameList.txt:<region>_EdgeIndex.txt— edge list (tab-delimited, int64)<region>_NodeAttr.txt— node attributes (tab-delimited, float32)<region>_GraphIndex.txt— graph index (int)
The resulting dataset is saved to
processed/SpatialOmicsImageDataset.pt.
Notes
The input file
ImageNameList.txtmust be located inraw_dir, containing one region name per line.The class automatically symmetrizes edge indices when necessary and converts NumPy arrays to PyTorch tensors.
This class is intended for use with HRCHY-CytoCommunity and compatible with PyTorch Geometric’s standard data pipeline.
Examples
>>> from hrchy_cytocommunity.models.dataset import SpatialOmicsImageDataset >>> dataset = SpatialOmicsImageDataset(root="data/HRCHY_input/") >>> print(len(dataset)) 5 >>> print(dataset[0]) Data(x=[1024, 30], edge_index=[2, 4096], graph_idx=[1])
- __init__(root, transform=None, pre_transform=None)
Methods
__init__(root[, transform, pre_transform])collate(data_list)Collates a list of
DataorHeteroDataobjects to the internal storage format ofInMemoryDataset.copy([idx])Performs a deep-copy of the dataset.
cpu(*args)Moves the dataset to CPU memory.
cuda([device])Moves the dataset toto CUDA memory.
download()Downloads the dataset to the
self.raw_dirfolder.get(idx)Gets the data object at index
idx.get_summary()Collects summary statistics for the dataset.
index_select(idx)Creates a subset of the dataset from specified indices
idx.indices()len()Returns the number of data objects stored in the dataset.
load(path[, data_cls])Loads the dataset from the file path
path.print_summary([fmt])Prints summary statistics of the dataset to the console.
process()Processes the dataset to the
self.processed_dirfolder.save(data_list, path)Saves a list of data objects to the file path
path.shuffle([return_perm])Randomly shuffles the examples in the dataset.
to(device)Performs device conversion of the whole dataset.
to_datapipe()Converts the dataset into a
torch.utils.data.DataPipe.to_on_disk_dataset([root, backend, log])Converts the
InMemoryDatasetto aOnDiskDatasetvariant.Attributes
has_downloadChecks whether the dataset defines a
download()method.has_processChecks whether the dataset defines a
process()method.num_classesReturns the number of classes in the dataset.
num_edge_featuresReturns the number of features per edge in the dataset.
num_featuresReturns the number of features per node in the dataset.
num_node_featuresReturns the number of features per node in the dataset.
processed_dirThe name of the files in the
self.processed_dirfolder that must be present in order to skip processing.The absolute filepaths that must be present in order to skip processing.
raw_dirThe name of the files in the
self.raw_dirfolder that must be present in order to skip downloading.raw_pathsThe absolute filepaths that must be present in order to skip downloading.