Identifying distinct cellular programs from single cell datasets using Topyfic
An open challenge for the analysis of single-cell data is the identification of distinct cellular programs that may be simultaneously expressed in the same cell. Latent Dirichlet allocation (LDA) is a popular statistical method for the identification of these recurring patterns, called “topics” in count data such as large gene expression matrices. Topics are composed of genes with distinct weights that together recreate underlying patterns of gene expression profiles for each individual cell. Topics can be analyzed both globally using topic-trait enrichment, and in individual cells using structure plots.
However, due to the random initialization of LDA algorithms, topic definitions may vary substantially each time the algorithm is run, rendering their interpretation subjective and questionable. The Topyfic package was developed to create reproducible LDA, where topics are defined by their reproducibility across many runs. We apply Topyfic to 5xFAD and wild type mouse brain single-cell and single-nucleus RNA-seq data generated by the Parse WT evercode assay, obtained from MODEL-AD and the ENCODE database, to recover topics that are associated with specific cell types and genotypes. We further apply Topyfic to the ENCODE postnatal hippocampus and cortex brain single-nucleus RNA-seq using a restricted gene vocabulary to recover topics that are associated with specific cell subtypes and cell states that are shared between different tissues in homologous cell types.
The Topyfic package is available on GitHub: https://github.com/mortazavilab/Topyfic
"Using large datasets like the ones generated by Parse, was important for identifying as many different ‘topics’ as possible."
We're your partners in single cell
Reach out for a quote or for help planning your next experiment.