Intro
Background and involvement with Parse Biosciences
Experiences, challenges and recommendations with Parse Pipeline
Streamlining
Quality control metrics
MGI sequencer
Mixed-species samples
Data delivery
Integrating different data types from various technologies
Specific software tools for integration and analysis with single cell data
Tools and methodologies for the analysis process
Saving time
Timeline

BLOG › Customer Profiles › Unlocking the Power of Multi-Omics Biology through Data Science

Unlocking the Power of Multi-Omics Biology through Data Science

May 20, 2024

9 min read

Updated:June 3, 2024

Cathal King - Bioinformatician & Single Cell and Spatial Transcriptomics Specialist

Multi-omics biology is deeply rooted in data science.

To support the explosion of technologies generating large multi-dimensional datasets, computational biologists are developing efficient data pipelines that combine tools to preprocess, analyze, and visualize data.

The South Australian Genomic Centre (SAGC) is at the forefront of these efforts, a state-wide genomics facility headquartered in the South Australian Health and Medical Research Institute (SAHMRI) building in Adelaide.

The SAGC provides a plethora of multi-omic services, including single cell RNA sequencing (scRNA-Seq) with Parse Evercode assays. A team of computational biologists and statisticians supports the facility’s users by developing customized approaches for data analysis, integration, and visualization.

To understand their vision and its challenges, we interviewed Cathal King, a bioinformatician at SAGC specializing in scRNA-Seq and spatial biology.

Cathal shared his experience with Parse data, highlighting beneficial aspects and areas for future development. He then shared his aspirations for developments aimed at enhancing the user experience in data analysis.
Daniel Diaz, Senior Bioinformatics Application Scientist at Parse, joined to discuss efforts to streamline complex data analysis workflows and integrate diverse -omics data types.

Cathal, can you tell us your background and how your team became involved with Parse Biosciences and combinatorial barcoding?

Cathal King (CK): I am a bioinformatician based at SAHMRI, in Adelaide. I am primarily engaged in the analysis phase of single cell RNA-Seq and spatial transcriptomic studies. My role involves exploring the datasets, interpreting them from a biological perspective, and communicating outcomes to fellow researchers. Besides, I handle primary analyses such as alignment and preprocessing of data.

I also collaborate with three other research teams at SAHMRI to develop and disseminate analytical pipelines and methods across these teams and the broader SAHMRI community.

Our introduction to Parse and combinatorial barcoding technologies came through the distributor Decode Science. Joel Bathe from SAGC, our partnerships manager, connects us with companies and researchers interested in trialing new technologies like Parse for single cell data analysis.

As a proficient user, what has been your experience with implementing the Parse pipeline? What were your challenges and do you have recommendations?

Joel Bathe - Partnerships Manager at SAGC

CK: The pipeline was well-documented and user-friendly. We set it up and executed the computational pipeline on the SAHMRI High-Performance Computing (HPC). The use of an Excel sample loading sheet for sample data was unique but necessary to demultiplex Parse data.

I think that to optimize a data analysis workflow to accommodate large-scale datasets, transitioning from an Excel sheet to an automation-friendly format like a CSV file would significantly reduce errors and improve efficiency.

Daniel Diaz (DD): I agree. And we are streamlining and simplifying the process now. By replacing the existing platform with a cloud-based GUI, users can work through their web interface and better handle large-scale scRNA-Seq data sets.

In terms of streamlining, do you see any need for improvements in accommodating multiple samples or customers in a single run?

CK: Integrating sample management directly with the data input process could enhance efficiency. Additionally, incorporating the pipeline into a workflow manager could further optimize the process, something we are exploring as we transition toward Nextflow.

What key features do you look for in a dataset, particularly regarding quality control metrics?

CK: We examine the HTML report for genes per cell and reads per cell, presented clearly in the Parse pipeline. Clustering analysis, like k-means or cell clustering visualized on a UMAP, is also critical for understanding cell groupings.

Didn’t you sequence an Evercode kit with an MGI sequencer recently? MGI sequencers are less common in our work, but their cost-effectiveness is making them increasingly popular.

CK: Our experience has been positive, with no issues integrating MGI-generated data into existing workflows.

We fully sequenced a Parse dataset on an MGI sequencer. We have our de-multiplexing pipeline, which takes raw data off the sequencer and converts it to class queues. There were no issues with the Parse data.

Our lab operates both the MGI T7 and G400 models, and we have successfully processed datasets from these machines without compatibility issues. The affordability of MGI sequencing is influencing the market and client preferences.

How do you handle mixed-species samples, such as those involving viruses or bacteria and human cells?

DD: For mixed-species samples prepared with Parse technology, we append the non-human genome sequence to the human genome, providing a mapping reference. This allows for accurate alignment and analysis of transcripts from both species within the same sample. Customizing genomes in this way is a flexible approach to accommodate diverse experimental designs.

The team of managers, scientists, and bioinformaticians at SAGC.

Shifting focus on the service provider’s logistics, have there been any challenges in delivering data back to customers?

CK: We have not encountered significant issues with data delivery. As the volume of samples increases, ensuring we have sufficient bioinformatics support is crucial. So far, we have managed well, and the use of Parse is still expanding without major data handling problems.

SAGC provides support for a broad range of -omics methods and data analyses, and you handle multi-omics datasets regularly. In your work with these complex datasets, what challenges and insights can you share about integrating different data types from various technologies?

CK: Currently, I am working on integrating mass spectrometry data with spatial datasets. This process involves aligning samples from lipidomics and proteomics studies conducted via mass spectrometry with corresponding spatial transcriptomics data. The challenge lies in the alignment discrepancies between the datasets, even though the tissue samples may appear similar.

The goal is to unravel complex biological questions, with a particular focus on how spatial gene expression data correlate with protein and lipid profiles in the same tissue regions. Spatial transcriptomics typically provide gene expression data in the context of tissue architecture, offering a map of where gene expression occurs, therefore for some aspects, it can be analogous to single cell data if we set aside its image component and accept a trade-off in resolution.

To address this, I have worked on a method where we can overlay and match data points from the two distinct types of analyses. By identifying common points between the spatial and mass spectrometry datasets, I am developing techniques to map and compare the molecular signatures within specific tissue areas.

Another project that interests me is the integration of immune profiling with genomic mutations. I am focusing on immune profiling through VDJ sequencing, as it is crucial for understanding the immune repertoire.

I work with a multiple myeloma research group to analyze sequencing data to identify plasma cell subclones within individual patients and correlating these findings with genomic alterations. This aspect of my research aims to provide deeper insights into the molecular underpinnings of immune responses and disease mechanisms. Repertoire sequencing adds another layer of data we integrate.

DD: Indeed. Our developments in TCR and BCR sequencing kits aim to facilitate this kind of multi-layered analysis, especially for characterizing clonal populations in diseases.

CK: Beyond data integration, a significant focus is on analyzing and interpreting the combined datasets. We explore various analytical approaches to ensure they make biological sense across different technologies. For instance, comparing gene expression signals across platforms and validating spatial data with single cell resolution are critical steps.

Single cell data lacks a spatial component, but it is more fine-grained. One obtains a spatial outline and tissue mapping, but cell types need a higher degree of precision for their annotation. Do you use any specific software tools for this integration and analysis?

CK: Yes, we deal with that quite a lot. The matching datasets we are working on is one way to solve the problem. Instead of relying solely on single cell annotation methods, using a gene signature annotation approach might offer better results for identifying and characterizing specific regions within spatial datasets.

Tools like CARD [“Conditional AutoRegressive model-based Deconvolution”] are instrumental in annotating spatial data using single cell datasets. Additionally, comparing gene signatures with repositories like MSigDB can offer more accurate insights for certain samples.

Regarding your analysis process, could you share your perspectives, particularly on how the tools and methodologies you employ integrate with broader projects like atlases or gene annotations?

CK: In the bioinformatics end, many are guided towards using R or Python due to the availability of specialized data structures tailored for single cell and spatial analyses, such as SingleCellExperiment in R or AnnData in Python. These structures have become a standard, streamlining the initial stages of data analysis. However, the ideal scenario would have a unified data structure that eliminates the need to switch between different formats, potentially simplifying the workflow even further.

The goal at our core facility is not just to perform basic tasks like sequencing and aligning but to provide clients with ready-to-use data objects that encapsulate either single cell or spatial experiment data. This way, customers can immediately proceed with more in-depth analyses.

We envision a user-friendly web interface allowing clients to interact with their data more intuitively, such as visualizing unique molecular identifiers (UMIs) or gene counts per cell or spatial spot.

Moreover, giving the data back to researchers or clients with already annotated gene information or gene signatures would add significant value, allowing them to delve into data analysis more quickly. This approach represents a blend of leveraging existing tools and data structures while pushing for innovations that cater to the practical needs of researchers in bioinformatics.

That would save the researcher an enormous amount of work!

CK: Yes, but we need to remember there are multiple approaches to data analysis and the necessity of effectively communicating this information. Users may need help understanding the data and its analysis process or have the flexibility to adapt methods as required. Users also need to know what the dataset includes, and the steps taken to obtain the final results, to ensure transparency and clarity in the analysis process.

Is this a very long-range goal, or do you already have somebody programming a web interface right now that might be accessible?

CK: A new biostatistician with a software engineering background joined our team, focusing primarily on developing workflow pipelines. We aim to integrate these pipelines with a web interface, allowing users to input samples, run through the processing stages—from alignment to potentially further steps—and then easily access the results.

We believe this advancement will enhance our data processing capabilities, though it may require some initial extra effort. Our objective is to have this system operational for at least one or two assays within the current year.

Good luck to Cathal and his team of data scientists in their endeavors. Learn more about SAGC and their data analysis services.

About the Author

Laura Tabellini Pierre

Laura Tabellini Pierre, MSc, is a scientific and technical writer at Parse Biosciences with extensive experience in immunology, encompassing both academic and R&D research.

More by Laura Tabellini Pierre

Whole Transcriptome

Immune Profiling

Additional Capabilities

Master Your Data Analysis: Insider Tips to Get the Most Out of Trailmaker

Demystifying Single Cell Analysis with Trailmaker

Technology

The end-to-end solution

Products

Whole Transcriptome

Immune Profiling

Additional Capabilities

Evercode™ WT Penta

Resources

Trailmaker

Company

The Latest

Master Your Data Analysis: Insider Tips to Get the Most Out of Trailmaker

Demystifying Single Cell Analysis with Trailmaker

Contents

Unlocking the Power of Multi-Omics Biology through Data Science

Cathal, can you tell us your background and how your team became involved with Parse Biosciences and combinatorial barcoding?

As a proficient user, what has been your experience with implementing the Parse pipeline? What were your challenges and do you have recommendations?

In terms of streamlining, do you see any need for improvements in accommodating multiple samples or customers in a single run?

What key features do you look for in a dataset, particularly regarding quality control metrics?

Didn’t you sequence an Evercode kit with an MGI sequencer recently? MGI sequencers are less common in our work, but their cost-effectiveness is making them increasingly popular.

How do you handle mixed-species samples, such as those involving viruses or bacteria and human cells?

Shifting focus on the service provider’s logistics, have there been any challenges in delivering data back to customers?

SAGC provides support for a broad range of -omics methods and data analyses, and you handle multi-omics datasets regularly. In your work with these complex datasets, what challenges and insights can you share about integrating different data types from various technologies?

Single cell data lacks a spatial component, but it is more fine-grained. One obtains a spatial outline and tissue mapping, but cell types need a higher degree of precision for their annotation. Do you use any specific software tools for this integration and analysis?

Regarding your analysis process, could you share your perspectives, particularly on how the tools and methodologies you employ integrate with broader projects like atlases or gene annotations?

That would save the researcher an enormous amount of work!

Is this a very long-range goal, or do you already have somebody programming a web interface right now that might be accessible?

Laura Tabellini Pierre

Technology

Products

Resources

Company

Technology

The end-to-end solution

Products

Whole Transcriptome

Immune Profiling

Additional Capabilities

Evercode™ WT Penta

Resources

Trailmaker

Company

The Latest

Master Your Data Analysis: Insider Tips to Get the Most Out of Trailmaker

Demystifying Single Cell Analysis with Trailmaker

Technology

The end-to-end solution

Products

Whole Transcriptome

Immune Profiling

Additional Capabilities

Evercode™ WT Penta

Resources

Trailmaker

Company

The Latest

Master Your Data Analysis: Insider Tips to Get the Most Out of Trailmaker

Demystifying Single Cell Analysis with Trailmaker

Contents

Unlocking the Power of Multi-Omics Biology through Data Science

Cathal, can you tell us your background and how your team became involved with Parse Biosciences and combinatorial barcoding?

As a proficient user, what has been your experience with implementing the Parse pipeline? What were your challenges and do you have recommendations?

In terms of streamlining, do you see any need for improvements in accommodating multiple samples or customers in a single run?

What key features do you look for in a dataset, particularly regarding quality control metrics?

Didn’t you sequence an Evercode kit with an MGI sequencer recently? MGI sequencers are less common in our work, but their cost-effectiveness is making them increasingly popular.

How do you handle mixed-species samples, such as those involving viruses or bacteria and human cells?

Shifting focus on the service provider’s logistics, have there been any challenges in delivering data back to customers?

SAGC provides support for a broad range of -omics methods and data analyses, and you handle multi-omics datasets regularly. In your work with these complex datasets, what challenges and insights can you share about integrating different data types from various technologies?

Single cell data lacks a spatial component, but it is more fine-grained. One obtains a spatial outline and tissue mapping, but cell types need a higher degree of precision for their annotation. Do you use any specific software tools for this integration and analysis?

Regarding your analysis process, could you share your perspectives, particularly on how the tools and methodologies you employ integrate with broader projects like atlases or gene annotations?

That would save the researcher an enormous amount of work!

Is this a very long-range goal, or do you already have somebody programming a web interface right now that might be accessible?

Laura Tabellini Pierre

Related Blog Posts

A Closer Look at Chromatin’s Role in Cancer Using Single Cell RNA-Seq

Investigating the Role of APOE in COVID-19 Mortality

Technology

Products

Resources

Company