hrchy_cytocommunity.models.dataset.SpatialOmicsImageDataset

class hrchy_cytocommunity.models.dataset.SpatialOmicsImageDataset(root, transform=None, pre_transform=None)

Spatial omics dataset loader for HRCHY-CytoCommunity.

This class inherits from torch_geometric.data.InMemoryDataset and is designed to read preprocessed spatial omics graph data files (coordinates, edges, node attributes, and graph indices) from a specified directory and construct PyTorch Geometric torch_geometric.data.Data objects for downstream graph neural network training.

Each sample (region/tissue section) corresponds to one graph, and all graphs are collated into a single dataset stored in processed/SpatialOmicsImageDataset.pt.

Parameters:
  • root (str or Path) – Root directory containing the following subfolders: - raw/ — containing raw graph files (text format). - processed/ — where the processed dataset will be saved.

  • transform (callable, optional) – Data transformation function applied before returning a graph sample. See torch_geometric.transforms.

  • pre_transform (callable, optional) – Data preprocessing transformation function applied before saving processed data.

data

Tensor representation of the concatenated graph dataset.

Type:

torch_geometric.data.Data

slices

Indexing dictionary used by PyTorch Geometric to retrieve individual graphs efficiently.

Type:

dict

processed_paths

List of output file paths (by default ['SpatialOmicsImageDataset.pt']).

Type:

list[str]

raw_file_names()

Returns the list of expected raw input files (empty list in this case).

processed_file_names()

Returns the list of expected processed dataset files.

download()

Placeholder for downloading data (not implemented).

process()

Constructs torch_geometric.data.Data objects from input text files under raw_dir. The following files are required for each region name listed in ImageNameList.txt:

  • <region>_EdgeIndex.txt — edge list (tab-delimited, int64)

  • <region>_NodeAttr.txt — node attributes (tab-delimited, float32)

  • <region>_GraphIndex.txt — graph index (int)

The resulting dataset is saved to processed/SpatialOmicsImageDataset.pt.

Notes

  • The input file ImageNameList.txt must be located in raw_dir, containing one region name per line.

  • The class automatically symmetrizes edge indices when necessary and converts NumPy arrays to PyTorch tensors.

  • This class is intended for use with HRCHY-CytoCommunity and compatible with PyTorch Geometric’s standard data pipeline.

Examples

>>> from hrchy_cytocommunity.models.dataset import SpatialOmicsImageDataset
>>> dataset = SpatialOmicsImageDataset(root="data/HRCHY_input/")
>>> print(len(dataset))
5
>>> print(dataset[0])
Data(x=[1024, 30], edge_index=[2, 4096], graph_idx=[1])
__init__(root, transform=None, pre_transform=None)

Methods

__init__(root[, transform, pre_transform])

collate(data_list)

Collates a list of Data or HeteroData objects to the internal storage format of InMemoryDataset.

copy([idx])

Performs a deep-copy of the dataset.

cpu(*args)

Moves the dataset to CPU memory.

cuda([device])

Moves the dataset toto CUDA memory.

download()

Downloads the dataset to the self.raw_dir folder.

get(idx)

Gets the data object at index idx.

get_summary()

Collects summary statistics for the dataset.

index_select(idx)

Creates a subset of the dataset from specified indices idx.

indices()

len()

Returns the number of data objects stored in the dataset.

load(path[, data_cls])

Loads the dataset from the file path path.

print_summary([fmt])

Prints summary statistics of the dataset to the console.

process()

Processes the dataset to the self.processed_dir folder.

save(data_list, path)

Saves a list of data objects to the file path path.

shuffle([return_perm])

Randomly shuffles the examples in the dataset.

to(device)

Performs device conversion of the whole dataset.

to_datapipe()

Converts the dataset into a torch.utils.data.DataPipe.

to_on_disk_dataset([root, backend, log])

Converts the InMemoryDataset to a OnDiskDataset variant.

Attributes

data

has_download

Checks whether the dataset defines a download() method.

has_process

Checks whether the dataset defines a process() method.

num_classes

Returns the number of classes in the dataset.

num_edge_features

Returns the number of features per edge in the dataset.

num_features

Returns the number of features per node in the dataset.

num_node_features

Returns the number of features per node in the dataset.

processed_dir

processed_file_names

The name of the files in the self.processed_dir folder that must be present in order to skip processing.

processed_paths

The absolute filepaths that must be present in order to skip processing.

raw_dir

raw_file_names

The name of the files in the self.raw_dir folder that must be present in order to skip downloading.

raw_paths

The absolute filepaths that must be present in order to skip downloading.