PMID- 35941549 OWN - NLM STAT- MEDLINE VI - 23 IP - 1 TI - SMaSH: a scalable, general marker gene identification framework for single-cell RNA-sequencing. PG - 328 CI - © 2022. The Author(s). LA - eng PT - Journal Article PL - England TA - Bmc Bioinformatics JT - BMC bioinformatics JID - 100965194 IS - 1471-2105 (Electronic) LID - 10.1186/s12859-022-04860-2 [doi] FAU - Nelson, M E AU - Nelson ME AD - European Bioinformatics Institute, Wellcome Genome Campus, Cambridge, CB10 1SD, UK. nelson@ebi.ac.uk. AD - Department of Haematology, University of Cambridge, Cambridge, CB2 0AW, UK. nelson@ebi.ac.uk. AD - Wellcome - Medical Research Council Cambridge Stem Cell Institute, Cambridge, CB2 0AW, UK. nelson@ebi.ac.uk. AD - Open Targets, Wellcome Genome Campus, Cambridge, CB10 1SA, UK. nelson@ebi.ac.uk. FAU - Riva, S G AU - Riva SG AD - Department of Haematology, University of Cambridge, Cambridge, CB2 0AW, UK. AD - Wellcome - Medical Research Council Cambridge Stem Cell Institute, Cambridge, CB2 0AW, UK. AD - Open Targets, Wellcome Genome Campus, Cambridge, CB10 1SA, UK. AD - Wellcome Sanger Institute, Wellcome Genome Campus, Cambridge, CB10 1RQ, UK. FAU - Cvejic, A AU - Cvejic A AD - Department of Haematology, University of Cambridge, Cambridge, CB2 0AW, UK. as889@cam.ac.uk. AD - Wellcome - Medical Research Council Cambridge Stem Cell Institute, Cambridge, CB2 0AW, UK. as889@cam.ac.uk. AD - Wellcome Sanger Institute, Wellcome Genome Campus, Cambridge, CB10 1RQ, UK. as889@cam.ac.uk. IS - 1471-2105 (Linking) RN - 0 (Biomarkers) RN - 63231-63-0 (RNA) SB - IM MH - Animals MH - Biomarkers MH - *Computational Biology/methods MH - Gene Expression Profiling/methods MH - Humans MH - Mice MH - RNA MH - Sequence Analysis, RNA MH - Single-Cell Analysis/methods MH - *Transcriptome OTO - NOTNLM OT - Feature selection OT - Marker genes OT - Single-cell RNA-sequencing PMC - PMC9361618 DCOM- 20220810 LR - 20230606 DP - 2022 Aug 08 DEP - 20220808 AB - BACKGROUND: Single-cell RNA-sequencing is revolutionising the study of cellular and tissue-wide heterogeneity in a large number of biological scenarios, from highly tissue-specific studies of disease to human-wide cell atlases. A central task in single-cell RNA-sequencing analysis design is the calculation of cell type-specific genes in order to study the differential impact of different replicates (e.g. tumour vs. non-tumour environment) on the regulation of those genes and their associated networks. The crucial task is the efficient and reliable calculation of such cell type-specific 'marker' genes. These optimise the ability of the experiment to isolate highly-specific cell phenotypes of interest to the analyser. However, while methods exist that can calculate marker genes from single-cell RNA-sequencing, no such method places emphasise on specific cell phenotypes for downstream study in e.g. differential gene expression or other experimental protocols (spatial transcriptomics protocols for example). Here we present SMaSH, a general computational framework for extracting key marker genes from single-cell RNA-sequencing data which reliably characterise highly-specific and niche populations of cells in numerous different biological data-sets. RESULTS: SMaSH extracts robust and biologically well-motivated marker genes, which characterise a given single-cell RNA-sequencing data-set better than existing computational approaches for general marker gene calculation. We demonstrate the utility of SMaSH through its substantial performance improvement over several existing methods in the field. Furthermore, we evaluate the SMaSH markers on spatial transcriptomics data, demonstrating they identify highly localised compartments of the mouse cortex. CONCLUSION: SMaSH is a new methodology for calculating robust markers genes from large single-cell RNA-sequencing data-sets, and has implications for e.g. effective gene identification for probe design in downstream analyses spatial transcriptomics experiments. SMaSH has been fully-integrated with the ScanPy framework and provides a valuable bioinformatics tool for cell type characterisation and validation in every-growing data-sets spanning over 50 different cell types across hundreds of thousands of cells.