Over the last years, a growing number of -omics techniques have been developed and are now widely used for genome-wide profiling of a variety of parameters and for testing specific hypotheses. A major bottleneck for such studies still is the analysis of the wealth of data that is being generated.

Our research focuses on both the development of fundamental computational and statistical methods for extracting relevant information from data (computational biology), as well as on integrating information from multiple experiments and -omics techniques to move towards a better understanding of biological systems (bioinformatics).

Our central biological interest is to decipher the molecular processes important for virus-host interaction from the cellular perspective. However, based on the fundamental approaches we have developed for that, we also moved towards working on exciting projects in immunology, cancer research and biochemistry, among others, including both basic and translational research.

Unlocking the power of high-throughput techniques using newly developed computational methods

For new experimental techniques there often no off-the-shelf tools for their computational analysis (or available methods are insufficient to extract all the information or to remove bias or control noise).

Ribosome profiling (or Ribo-seq) is a technique to identify and quantify translation of ORFs with subcodon resolution. It is based on sequencing the ribosome footprints, which are RNA fragments protected by the ribosome from enzymatic digestion. Due to stringent RNase conditions used and the matter of the fact that ribosomes translocate from codon to codon, the positions of sequencing reads show a characteristic periodicity with respect to the frame of translation. For many reads, roughly speaking, the P site codon for the corresponding ribosome is at position 12 within the read. This procedure of mapping reads to P site codons, however, is too inaccurate to properly resolve many ORFs, in particular most of the short upstream ORFs (uORFs). We have developed Probabilistic inference of codon activities by an EM algorithm (PRICE), a new algorithm to estimate P site codon positions with drastically improved signal to noise ratio. Moreover, a new statistical test included in PRICE enables us to reliable resolve complex cases such as arising with multiple overlapping ORFs. Validation of newly identified ORFs has been an unsolved issue so far. Because of them being immediately degraded after translation peptides derived from short ORFs often escape detection by mass spectrometry experiments. We reasoned that they should be well represented in MHC-I complexes, as peptide presentation via this pathway is believed to be dependent on translation rates and not on protein abundance. Indeed, we found hundreds of cryptic peptides derived from short ORFs in MHC-I ligandome experiments.

Left: Comparison of approaches for mapping reads to codons with respect to signal (total number of reads mapped in-frame) and signal-to-noise ratio (noise: reads mapped out-of-frame to annotated ORFs). Color-coded according to the key to indicate deterministic mapping of read classes defined by length and 5′ mismatch state and of combinations of read classes (basic, ignoring 5′ mismatches; extended, considering 5′ mismatches; top 4, combining the best read classes), and probabilistic mapping by PRICE. Right: Total amount of peptides detected in proteome and MHC I peptidome mass spectrometry experiments. The 1% peptide-identification FDR is indicated by a dashed line. Gray bars represent the peptides from ORFs also identified by ORF-RATER or Rp-Bp (for PRICE) or ORFs also identified by PRICE (for ORF-RATER and Rp-Bp).

Single cell RNA sequencing (scRNA-seq) has revolutionized our view on RNA biology in individual cells. Current approaches allow to profile the total RNA levels for thousands of genes in tens or even hundreds of thousands of cells. However, scRNA-seq has one inherent limitation: Each cell can only be profiled once. This has several consequences: (i) responses to perturbations cannot be measured directly, (ii) kinetics of transcription (e.g. bursts) cannot be investigated, (iii) short-term changes due to a perturbation or stimulus within a timescale of a few hours are masked by pre-existing RNA and (iv) changes in RNA synthesis and decay cannot be differentiated. We have developed single-cell thiol(SH)-linked alkylation for the metabolic sequencing of RNA (scSLAM-seq), which integrates metabolic RNA labeling, biochemical nucleoside conversion and scRNA-seq to directly record transcriptional activity in single cells. Key to this was a new computational approach (GRAND-SLAM) that we recently developed and that allowed us to precisely quantify the new-to-total ratio (NTR) for thousands of genes in individual cells. We utilized these methods to study the earlies changes in transcription in cytomegalovirus infected fibroblasts. Our data enabled us to perform dose-response analyses at the single cell level. This revealed that most of the variability of infection efficacy can be explained by a combination of the cell cycle state at the time of infection and the infection dose. The same was true for the regulation of cellular genes, including interferon stimulated genes (ISGs) and NF-kB responses. This also demonstrated that rather than observing more transcription in each cell, the predominant mode of regulation was due to more cells showing transcriptional activity at all in the two hours of labeling. scSLAM-seq visualizes transcriptional bursting with unprecedented detail for thousands of genes. This is based on specific patterns of new and old RNA across cells for individual genes. We show that these are associated with promoter-intrinsic features (TATA-boxes/methylated CpGs).

Workflow of scSLAM-seq and GRAND-SLAM. 4sU incorporated into new RNA is chemically converted into a cytosine analog after cell lysis or in fixed cells. The resulting mismatches from scRNA-seq enable to estimate the new to total RNA (NTR) ratio per gene/cell by using statistical modeling. Distinguishing old and new RNA enables gene regulatory network inference, functional genomics approaches and to directly analyze transcriptional bursting.

Integrative analyses & data science

Many problems in biology can only be solved by utilizing and combining more than one data set from large-scale experiments.

For instance, by integrating several data sets relevant for microRNA targeting (PAR-CLIP, Rip-Chip, LC-MS/MS, 4sU-microarrays), we have shown that viral and cellular microRNAs bind to their target sites in a context-dependent manner and that context-dependent binding has context-dependent impact on gene expression. Interestingly, we found that this context cannot be explained by the presence of absence of microRNA or target mRNA, but that other factors must be involved that constitute the context (e.g. competition with RNA binding proteins or stable RNA secondary structures).

Herpesviral genomes are relatively large (e.g. ca. 150kb for HSV-1) and are known to encode many proteins (e.g. 80 known proteins for HSV-1). Since genome sequences became available, identification of genes and proteins was based on the prediction of open reading frames (ORFs), which were then extensively validated and characterized in the 1990s. However, modern high-throughput techniques now enable a more unbiased approach to comprehensively and accurately identify genetic elements in such small genomes. By using a large array of different data sets, we were able to extend the previous annotation of HSV-1 to a total of 201 mRNAs and 284 ORFs. There were two very important lessons to learn: First, a discovery from a single large-scale data set might just represent an experimental artifact. The key for being accurate is to integrate more than one experimental technique. And second, to really understand translation and which proteins are made, the knowledge of the mRNAs (and transcription start sites) is essential.

Overview of the applied Omics approaches to re-annotate HSV-1. Viral gene expression was analyzed in primary human fibroblasts (HFF). The total RNA-seq, 4sU-seq and ribosome profiling data were recently published. To comprehensively identify transcription start site (TiSS), we performed cRNA-seq and dRNA-seq  as well as RNA-seq on subcellular RNA fractions from mock, wild-type and dICP27 infected cells. Furthermore, we reanalyzed recently published PacBio and MinION sequencing data. Translation start site (TaSS) profiling was performed by ribosome profiling following treatment of cells for 30 min with either Harringtonine or Lactimidomycin. Proteome analysis included two whole proteome data sets using SILAC and label-free mass spectrometry. The available time points and conditions are indicated by stars.

Products of short open reading frames are rapidly degraded after translation. Therefore, they have the chance to enter the MHC-I peptide presentation pathway (see above). To be able to screen the large number of immunipeptidomics data sets available, without the need to build sequence databases based on Ribo-seq data, we developed the computational approach Proteogenomic Identification using Stratified Mixture models (Peptide-PRISM). Peptide-PRISM identified thousands of cryptic peptides and showed that cryptic peptides indeed contribute up to 15% of the MHC-I ligands in different tumor samples.

Peptide-PRISM identifies cryptic peptides in a melanoma sample (A) Workflow of Peptide-PRISM. (B) Number of peptides identified for sample MM15. Andromeda represents the originally published numbers, classic FDR and Peptide-PRISM are described in the methods (C) Novel and cryptic peptides consist of the same percentage of netMHCpan 4.0 predicted HLA-I binders as the published peptide set. Novel peptides are proteome-derived peptides identified by our approach but not in the original report. Error bars represent 95% binomial confidence intervals.