Spatial Mapping
This section discusses the algorithmic details of spatial mapping module implemented in STellaris, which was designed for mapping user’s annotated scRNA-seq data to spatial position in best match tissue section curated in our local database.
1. Preprocessing
Prior to mapping single cells to spatial location in tissue section of ST, low-quality cells (gene detected <200 or mitochondrial proportion > 20%) were excluded. CPM normalization was performed followed by log1p transformation. Marker genes were calculated using the t test method for cell type annotation submitted by users.
2. Section blast
To help users find the properly matched ST section, we introduced a screening method called section blast by assessing the similarity between user-uploaded scRNA-seq and the ST datasets in our local database, as inspired by the multimodal intersection analysis (MIA) methods approach described in (https://doi.org/10.1038/s41587-019-0392-8)
In this case, we used a hypergeometric test to measure the overlap level of marker genes (log-transformed fold change >1 and q value <0.05) between all scRNA-seq cell types and all clusters (Leiden clustering; resolution: 1) in the ST data, with the number of intersections of all detected genes in the two modalities as the background for determining the p value. The Benjamini-ochberg method was used for multiple testing correction. To better interpret these p value results, the following calculations were applied: If the p value was less than or equal to 0.5, we performed -np.log10(p) transformation; otherwise, we performed np.log10(1-p). Finally, the average of the highest value in each ST cluster was deemed the metric we called MIA score to measure the similarity between the scRNA-seq and the examined ST data.
3. Coembedding filtering
The integration of the scRNA-seq and ST data was performed using Seurat (https://doi.org/10.1093/nar/gkac781) by projecting these two modalities into shared latent space. To attenuate the confounding effects caused by the incompatibility between the two modalities, we implement a filtering method called coembedding filtering. This method filters out cells that are not well-mixed with ST spots in the shared UMAP space. Specifically, the average Euclidean distance between each spot and its 50 nearest neighbours (which can be adjusted through advanced parameters) is calculated. Then, we determine the mean of these average distances as a threshold that measures the overall proximity of cells to ST spots. Cells that are not within the scope of any ST spots were removed.
4. Spatial cellular map
- Celltrek
To construct a transcriptome-wide map of spatial patterning, we adopted a metric learning approach using a multivariate random forest (RF) model, which was previously described in CellTrek (https://doi.org/10.1038/s41587-022-01233-1). Briefly, this method trains a multivariate RF model on ST data in the shared latent space and then applies the model to the coembedding data of the scRNA-seq and ST data, thereby enabling the prediction of spatial coordinates of single cells by assigning them to the reciprocal nearest ST spots where the similarity is measured by the RF distance metric.
We made some adaptations for thorough evaluation, higher reliability and faster speed, which are as follows:
- We included advanced options to control the degree of redundancy and added an additional option to set the maximum allowed redundancy to suit different scRNA-seq data;
- We replaced point repulsion with the addition of uniform noise to ensure that the spatial coordinates of cells will not be heavily affected by the neighbouring cells;
- We took advantage of R-side parallel processing to speed up execution;
- The summary of spatial mapping and the distribution of the RF distance metric were provided to assess the mapping reliability.
- Tangram
We used Scanpy(https://scanpy.readthedocs.io/en/stable/) to identify marker genes using t test method in the scRNA-seq data. These marker genes were then used as training genes to find the optimal spatial alignment for single cells through the "cell" mapping mode. Subsequently, we mapped each single cell to the most probable ST spot.
- CytoSPACE
We used the default solver lapjv for a fast implementation of the Jonker-Volgenant shortest augmenting path assignment algorithm. To achieve parallel processing, we used the spot subsampling mode and set the number of subsampled cells per partition to the default value of 5,000.
5. Advanced parameters
We provide advanced parameters setting for users before starting spatial mapping:
- KNN number: Number of nearest neighboring cells to determine coembedding filtering cutoff, 0 means skipping coembedding filtering [default: 50].
- Number of spots: Number of top-ranked nearest spots for each cell to keep in sparse graph for spatial mapping, the higher the value, the more spatial locations the cell may be assigned to [default: 10].
- Number of cells: Number of top-ranked nearest cells for each spot to keep in sparse graph for spatial mapping, the higher the value, the more cells may succeed in mapping to spatial locations [default: 10].
- Redundancy: The tolerance of redundancy, which means the maximum number of spots to which a cell is allowed to map. This value must be lower than the smaller value of n_spots and n_cells [default: 1].
6. Cell type colocalization
Based on the spatial coordinates generated by spatial mapping, we measured the spatial distance between any two cell types using Euclidean distance. We then performed k-means clustering (k=3) on the spatial distance of all pairs of cell types, thereby dividing cell type pairs into three distance groups: "near," "medium" and "far."
7. Cell-cell ligand-receptor Interactions
We identified LRIs using CellPhoneDB v4 (https://github.com/ventolab/CellphoneDB) for each cell type pair in different distance groups that were previously defined. We added support for mouse species by converting gene symbols to their corresponding orthologs in human. The LRIs with p values ≤ 0.01 were retained as significant interactions.
8. Description of downloaded files
- We offer reference ST data in h5ad readable with the "anndata" Python package.
- We offer mapped scRNA-seq data in h5ad with predicted spatial coordinates, which is readable with the "anndata" Python package
- We offer a compressed table files package including all table results including:a. single cell coordinate table (sc_coordinate.csv) including some columns:
- “cell_type”: name of the cell type in scRNA-seq.
- “id_new”: new cell id in scRNA-seq.
- “id_st”: id of closest cell in reference ST data.
- “dist”: distance between this cell of scRNA-seq and closest cell/spot of reference ST data.
- “x”: x coordinate of corresponding cell/spot in reference ST data.
- “y”: y coordinate of corresponding cell/spot in reference ST data.
- “x_noise”: predicted x coordinate of this cell in scRNA-seq data with noise.
- “y_noise”: predicted y coordinate of this cell in scRNA-seq data with noise.
b. Euclidean distance matrix between any two cell types and euclidean distance table with group name (cell_types_eucDist.csv, cell_types_eucDist_group.csv);
c. CellPhoneDB output files (means.txt, pvalues.txt, significant_means.txt, deconvoluted.txt);- details of these files: https://github.com/ventolab/CellphoneDB/blob/master/Docs/RESULTS-DOCUMENTATION.md
e. Filtered CellPhoneDB output files (significant_means_lt001.csv): contains mean values for each ligand-receptor interaction (rows) for each cell-cell interacting pair (columns) filtered by P value ≤ 0.01, and the description of columns are same to CellPhoneDB output files. - We offer a compressed pdf figure files package including all pdf figure results, which are more optimized than web graphs, including:
a. Euclidean distance matrix heatmap and boxplot of cell types pairs by groups (cell_types_eucDist.heatmap.pdf, cell_types_eucDist.boxplot.pdf) similar to the web graphs;
b. Dot figures for total or each cell type in a group showing the molecular interactions between cell types (dot_plot.pdf, dot_plot_[group]|[cell type].pdf) similar to web graphs, where means of the average expression level of two interacting molecular are indicated by color;
c. Heatmap showing the total number of interactions between cell types (inter_count_heatmap.pdf) similar to the web graph;