hrchy_cytocommunity.tools.data_preprocessing.formulate_HRCHYCytoCommunity_input_from_anndata_singlecell

hrchy_cytocommunity.tools.data_preprocessing.formulate_HRCHYCytoCommunity_input_from_anndata_singlecell(adata, sample_id, ct_col, categories, output_dir, graph_id, coarse_gt_col=None, fine_gt_col=None)

Formulate HRCHY-CytoCommunity input files from a single-cell spatial transcriptomics AnnData object.

This function converts an AnnData object containing single-cell spatial transcriptomics data into a series of text files required by HRCHY-CytoCommunity for downstream hierarchical tissue structure identification. The generated files include spatial coordinates, cell type labels, optional ground truth labels, node attribute matrices, and graph index information.

Parameters:
  • adata (anndata.AnnData) – Single-cell spatial transcriptomics data object containing: - Spatial coordinates in adata.obsm['spatial'] (array of shape (n_cells, 2)) - Cell metadata in adata.obs (must include at least the column specified by ct_col)

  • sample_id (str) – Unique sample identifier. Used as prefix for output file names.

  • ct_col (str) – Column name in adata.obs representing cell type labels.

  • categories (list of str) – Ordered list of all possible cell type categories. Used to ensure consistent ordering of node attribute (one-hot) matrices across samples.

  • output_dir (The following tab-separated files will be written to) – Directory where all generated text files will be saved. If the directory does not exist, it will be created automatically.

  • graph_id (int) – Graph index (integer ID) assigned to this spatial sample, typically used for multi-sample integration.

  • coarse_gt_col (str, optional) – Column name in adata.obs representing coarse-grained ground truth labels. If None, no coarse ground truth file is generated.

  • fine_gt_col (str, optional) – Column name in adata.obs representing fine-grained ground truth labels. If None, no fine ground truth file is generated.

  • Outputs

  • -------

  • output_dir

  • (x (- <sample_id>_Coordinates.txt — spatial coordinates)

  • y)

  • cell (- <sample_id>_CellTypeLabel.txt — cell type label per)

  • types) (- <sample_id>_NodeAttr.txt — node attribute matrix (one-hot encoding of cell)

  • names) (- <sample_id>_NodeName.txt — names of node attribute dimensions (cell type)

  • graph (- <sample_id>_GraphIndex.txt — integer index of the)

  • (optional) (- <sample_id>_coarseGT.txt — coarse-level ground truth labels)

  • (optional)

Notes

  • The function assumes that all adata.obs columns used (ct_col, coarse_gt_col, fine_gt_col) contain categorical or string labels.

  • One-hot encoding of cell types ensures consistent node attribute dimensions across multiple samples.

  • Missing values in the one-hot matrix are automatically replaced with 0.

  • All files are tab-delimited and saved in plain text for downstream compatibility.

Examples

>>> import scanpy as sc
>>> adata = sc.read_h5ad("sample1.h5ad")
>>> categories = ['B_cell', 'T_cell', 'Macrophage', 'Endothelial']
>>> formulate_HRCHYCytoCommunity_input_from_anndata_singlecell(
...     adata=adata,
...     sample_id="sample1",
...     ct_col="cell_type",
...     categories=categories,
...     output_dir="data/HRCHY_input/",
...     graph_id=0,
...     coarse_gt_col="coarse_label",
...     fine_gt_col="fine_label"
... )