Multi-omics biology is deeply rooted in data science.
To support the explosion of technologies generating large multi-dimensional datasets, computational biologists are developing efficient data pipelines that combine tools to preprocess, analyze, and visualize data.
The South Australian Genomic Centre (SAGC) is at the forefront of these efforts, a state-wide genomics facility headquartered in the South Australian Health and Medical Research Institute (SAHMRI) building in Adelaide.
The SAGC provides a plethora of multi-omic services, including single cell RNA sequencing (scRNA-Seq) with Parse Evercode assays. A team of computational biologists and statisticians supports the facility’s users by developing customized approaches for data analysis, integration, and visualization.
To understand their vision and its challenges, we interviewed Cathal King, a bioinformatician at SAGC specializing in scRNA-Seq and spatial biology.
Cathal shared his experience with Parse data, highlighting beneficial aspects and areas for future development. He then shared his aspirations for developments aimed at enhancing the user experience in data analysis.
Daniel Diaz, Senior Bioinformatics Application Scientist at Parse, joined to discuss efforts to streamline complex data analysis workflows and integrate diverse -omics data types.
Cathal King (CK): I am a bioinformatician based at SAHMRI, in Adelaide. I am primarily engaged in the analysis phase of single cell RNA-Seq and spatial transcriptomic studies. My role involves exploring the datasets, interpreting them from a biological perspective, and communicating outcomes to fellow researchers. Besides, I handle primary analyses such as alignment and preprocessing of data.
I also collaborate with three other research teams at SAHMRI to develop and disseminate analytical pipelines and methods across these teams and the broader SAHMRI community.
Our introduction to Parse and combinatorial barcoding technologies came through the distributor Decode Science. Joel Bathe from SAGC, our partnerships manager, connects us with companies and researchers interested in trialing new technologies like Parse for single cell data analysis.
CK: The pipeline was well-documented and user-friendly. We set it up and executed the computational pipeline on the SAHMRI High-Performance Computing (HPC). The use of an Excel sample loading sheet for sample data was unique but necessary to demultiplex Parse data.
I think that to optimize a data analysis workflow to accommodate large-scale datasets, transitioning from an Excel sheet to an automation-friendly format like a CSV file would significantly reduce errors and improve efficiency.
Daniel Diaz (DD): I agree. And we are streamlining and simplifying the process now. By replacing the existing platform with a cloud-based GUI, users can work through their web interface and better handle large-scale scRNA-Seq data sets.
CK: Integrating sample management directly with the data input process could enhance efficiency. Additionally, incorporating the pipeline into a workflow manager could further optimize the process, something we are exploring as we transition toward Nextflow.
CK: We examine the HTML report for genes per cell and reads per cell, presented clearly in the Parse pipeline. Clustering analysis, like k-means or cell clustering visualized on a UMAP, is also critical for understanding cell groupings.
CK: Our experience has been positive, with no issues integrating MGI-generated data into existing workflows.
We fully sequenced a Parse dataset on an MGI sequencer. We have our de-multiplexing pipeline, which takes raw data off the sequencer and converts it to class queues. There were no issues with the Parse data.
Our lab operates both the MGI T7 and G400 models, and we have successfully processed datasets from these machines without compatibility issues. The affordability of MGI sequencing is influencing the market and client preferences.
DD: For mixed-species samples prepared with Parse technology, we append the non-human genome sequence to the human genome, providing a mapping reference. This allows for accurate alignment and analysis of transcripts from both species within the same sample. Customizing genomes in this way is a flexible approach to accommodate diverse experimental designs.
CK: We have not encountered significant issues with data delivery. As the volume of samples increases, ensuring we have sufficient bioinformatics support is crucial. So far, we have managed well, and the use of Parse is still expanding without major data handling problems.
CK: Currently, I am working on integrating mass spectrometry data with spatial datasets. This process involves aligning samples from lipidomics and proteomics studies conducted via mass spectrometry with corresponding spatial transcriptomics data. The challenge lies in the alignment discrepancies between the datasets, even though the tissue samples may appear similar.
The goal is to unravel complex biological questions, with a particular focus on how spatial gene expression data correlate with protein and lipid profiles in the same tissue regions. Spatial transcriptomics typically provide gene expression data in the context of tissue architecture, offering a map of where gene expression occurs, therefore for some aspects, it can be analogous to single cell data if we set aside its image component and accept a trade-off in resolution.
To address this, I have worked on a method where we can overlay and match data points from the two distinct types of analyses. By identifying common points between the spatial and mass spectrometry datasets, I am developing techniques to map and compare the molecular signatures within specific tissue areas.
Another project that interests me is the integration of immune profiling with genomic mutations. I am focusing on immune profiling through VDJ sequencing, as it is crucial for understanding the immune repertoire.
I work with a multiple myeloma research group to analyze sequencing data to identify plasma cell subclones within individual patients and correlating these findings with genomic alterations. This aspect of my research aims to provide deeper insights into the molecular underpinnings of immune responses and disease mechanisms. Repertoire sequencing adds another layer of data we integrate.
DD: Indeed. Our developments in TCR and BCR sequencing kits aim to facilitate this kind of multi-layered analysis, especially for characterizing clonal populations in diseases.
CK: Beyond data integration, a significant focus is on analyzing and interpreting the combined datasets. We explore various analytical approaches to ensure they make biological sense across different technologies. For instance, comparing gene expression signals across platforms and validating spatial data with single cell resolution are critical steps.
CK: Yes, we deal with that quite a lot. The matching datasets we are working on is one way to solve the problem. Instead of relying solely on single cell annotation methods, using a gene signature annotation approach might offer better results for identifying and characterizing specific regions within spatial datasets.
Tools like CARD [“Conditional AutoRegressive model-based Deconvolution”] are instrumental in annotating spatial data using single cell datasets. Additionally, comparing gene signatures with repositories like MSigDB can offer more accurate insights for certain samples.
CK: In the bioinformatics end, many are guided towards using R or Python due to the availability of specialized data structures tailored for single cell and spatial analyses, such as SingleCellExperiment in R or AnnData in Python. These structures have become a standard, streamlining the initial stages of data analysis. However, the ideal scenario would have a unified data structure that eliminates the need to switch between different formats, potentially simplifying the workflow even further.
The goal at our core facility is not just to perform basic tasks like sequencing and aligning but to provide clients with ready-to-use data objects that encapsulate either single cell or spatial experiment data. This way, customers can immediately proceed with more in-depth analyses.
We envision a user-friendly web interface allowing clients to interact with their data more intuitively, such as visualizing unique molecular identifiers (UMIs) or gene counts per cell or spatial spot.
Moreover, giving the data back to researchers or clients with already annotated gene information or gene signatures would add significant value, allowing them to delve into data analysis more quickly. This approach represents a blend of leveraging existing tools and data structures while pushing for innovations that cater to the practical needs of researchers in bioinformatics.
CK: Yes, but we need to remember there are multiple approaches to data analysis and the necessity of effectively communicating this information. Users may need help understanding the data and its analysis process or have the flexibility to adapt methods as required. Users also need to know what the dataset includes, and the steps taken to obtain the final results, to ensure transparency and clarity in the analysis process.
CK: A new biostatistician with a software engineering background joined our team, focusing primarily on developing workflow pipelines. We aim to integrate these pipelines with a web interface, allowing users to input samples, run through the processing stages—from alignment to potentially further steps—and then easily access the results.
We believe this advancement will enhance our data processing capabilities, though it may require some initial extra effort. Our objective is to have this system operational for at least one or two assays within the current year.
Good luck to Cathal and his team of data scientists in their endeavors. Learn more about SAGC and their data analysis services.