hrchy_cytocommunity.tools.data_preprocessing.formulate_HRCHYCytoCommunity_input_from_anndata_singlecell
- hrchy_cytocommunity.tools.data_preprocessing.formulate_HRCHYCytoCommunity_input_from_anndata_singlecell(adata, sample_id, ct_col, categories, output_dir, graph_id, coarse_gt_col=None, fine_gt_col=None)
Formulate HRCHY-CytoCommunity input files from a single-cell spatial transcriptomics AnnData object.
This function converts an AnnData object containing single-cell spatial transcriptomics data into a series of text files required by HRCHY-CytoCommunity for downstream hierarchical tissue structure identification. The generated files include spatial coordinates, cell type labels, optional ground truth labels, node attribute matrices, and graph index information.
- Parameters:
adata (anndata.AnnData) – Single-cell spatial transcriptomics data object containing: - Spatial coordinates in
adata.obsm['spatial'](array of shape (n_cells, 2)) - Cell metadata inadata.obs(must include at least the column specified by ct_col)sample_id (str) – Unique sample identifier. Used as prefix for output file names.
ct_col (str) – Column name in
adata.obsrepresenting cell type labels.categories (list of str) – Ordered list of all possible cell type categories. Used to ensure consistent ordering of node attribute (one-hot) matrices across samples.
output_dir (The following tab-separated files will be written to) – Directory where all generated text files will be saved. If the directory does not exist, it will be created automatically.
graph_id (int) – Graph index (integer ID) assigned to this spatial sample, typically used for multi-sample integration.
coarse_gt_col (str, optional) – Column name in
adata.obsrepresenting coarse-grained ground truth labels. If None, no coarse ground truth file is generated.fine_gt_col (str, optional) – Column name in
adata.obsrepresenting fine-grained ground truth labels. If None, no fine ground truth file is generated.Outputs
-------
output_dir
(x (- <sample_id>_Coordinates.txt — spatial coordinates)
y)
cell (- <sample_id>_CellTypeLabel.txt — cell type label per)
types) (- <sample_id>_NodeAttr.txt — node attribute matrix (one-hot encoding of cell)
names) (- <sample_id>_NodeName.txt — names of node attribute dimensions (cell type)
graph (- <sample_id>_GraphIndex.txt — integer index of the)
(optional) (- <sample_id>_coarseGT.txt — coarse-level ground truth labels)
(optional)
Notes
The function assumes that all adata.obs columns used (ct_col, coarse_gt_col, fine_gt_col) contain categorical or string labels.
One-hot encoding of cell types ensures consistent node attribute dimensions across multiple samples.
Missing values in the one-hot matrix are automatically replaced with 0.
All files are tab-delimited and saved in plain text for downstream compatibility.
Examples
>>> import scanpy as sc >>> adata = sc.read_h5ad("sample1.h5ad") >>> categories = ['B_cell', 'T_cell', 'Macrophage', 'Endothelial'] >>> formulate_HRCHYCytoCommunity_input_from_anndata_singlecell( ... adata=adata, ... sample_id="sample1", ... ct_col="cell_type", ... categories=categories, ... output_dir="data/HRCHY_input/", ... graph_id=0, ... coarse_gt_col="coarse_label", ... fine_gt_col="fine_label" ... )