Bioinformatics is the interface between computer science and biology. Our interpretation of this very broad definition is:

  1. We develop new computational and statistical methods and tools to make use of data from high-throughput experiments.
  2. We use the information gained from such experiments to get new insights to interesting biological questions with a focus on host cell modulation and immune evasion in various herpesvirus models.

Basic bioinformatics

High-throughput techniques (including next generation sequencing and mass spectrometry) are now commonly used to study systems in biology. Importantly, data from such experiments have issues: They are huge and affected by errors inherent to the experiments. Their sheer size necessitates the use of sophisticated methods from various areas of computer science (e.g. algorithmics, databases, machine learning, parallelism, software engineering) and experimental bias or noise must be handled using statistical models.

Next generation sequencing in particular has revolutionized many fields of research. A large variety of parameters for various biological entities can readily be quantified by counting reads that belong to these entities. Entities could for instance be genes or mRNAs in RNA-seq, or binding sites of microRNAs in AGO-PAR-CLIP. Often, the focus is differential quantification where the parameter of interest arises by comparing two conditions (e.g. virus infected cell vs. control, or specific antibody vs. background).

There are many statistical models and computational methods available that deal with quantification based on count data. However, we were the first to develop a model that is directly based on ratios of counts. This has many important implications with respect to experimental issues (under which conditions does bias cancel out) and in particular for entities with few reads (what is the meaning of pseudocounts).

Workflows for differential NGS analysis. Differential analysis of NGS data starts with the aligned reads of two conditions, here exemplified as RNA-seq reads from samples A and B aligned to an mRNA. Existing models take one specific route through the necessary steps defined in the main text: (I) For each sample, reads are aggregated and an appropriate probabilistic model is used to control noise and estimate the sample specific mRNA abundance. (II) These abundance estimates are then divided to give an estimate of the mRNA fold change. Our approach takes a different route by first computing local ratios for all read sequences and then aggregating them using an appropriate noise model for count ratios to estimate the total mRNA fold change. Using a basic noise model for the second step makes both routes equivalent. However, using extensions to it leads to more accurate fold change estimates by exploiting the fact that bias cancels out when taking the ratio of counts of individual sequences.

Integrative analyses & data science

Many problems in biology can only be solved by utilizing and combining more than one data set from large-scale experiments.

For instance, by integrating several data sets relevant for microRNA targeting (PAR-CLIP, Rip-Chip, LC-MS/MS, 4sU-microarrays), we have shown that viral and cellular microRNAs bind to their target sites in a context-dependent manner and that context-dependent binding has context-dependent impact on gene expression. Interestingly, we found that this context cannot be explained by the presence of absence of microRNA or target mRNA, but that other factors must be involved that constitute the context (e.g. competition with RNA binding proteins or stable RNA secondary structures).

Herpesviral genomes are relatively large (e.g. ca. 150kb for HSV-1) and are known to encode many proteins (e.g. 80 known proteins for HSV-1). Since genome sequences became available, identification of genes and proteins was based on the prediction of open reading frames (ORFs), which were then extensively validated and characterized in the 1990s. However, modern high-throughput techniques now enable a more unbiased approach to comprehensively and accurately identify genetic elements in such small genomes. By using a large array of different data sets, we were able to extend the previous annotation of HSV-1 to a total of 201 mRNAs and 284 ORFs. There were two very important lessons to learn: First, a discovery from a single large-scale data set might just represent an experimental artifact. The key for being accurate is to integrate more than one experimental technique. And second, to really understand translation and which proteins are made, the knowledge of the mRNAs (and transcription start sites) is essential.

Overview of the applied Omics approaches to re-annotate HSV-1. Viral gene expression was analyzed in primary human fibroblasts (HFF). The total RNA-seq, 4sU-seq and ribosome profiling data were recently published. To comprehensively identify transcription start site (TiSS), we performed cRNA-seq and dRNA-seq  as well as RNA-seq on subcellular RNA fractions from mock, wild-type and dICP27 infected cells. Furthermore, we reanalyzed recently published PacBio and MinION sequencing data. Translation start site (TaSS) profiling was performed by ribosome profiling following treatment of cells for 30 min with either Harringtonine or Lactimidomycin. Proteome analysis included two whole proteome data sets using SILAC and label-free mass spectrometry. The available time points and conditions are indicated by stars.

Unlocking the power of high-throughput techniques using newly developed computational methods

For new experimental techniques there often no off-the-shelf tools for their computational analysis (or available methods are insufficient to extract all the information or to remove bias or control noise).

Ribosome profiling (or Ribo-seq) is a technique to identify and quantify translation of ORFs with subcodon resolution. It is based on sequencing the ribosome footprints, which are RNA fragments protected by the ribosome from enzymatic digestion. Due to stringent RNase conditions used and the matter of the fact that ribosomes translocate from codon to codon, the positions of sequencing reads show a characteristic periodicity with respect to the frame of translation. For many reads, roughly speaking, the P site codon for the corresponding ribosome is at position 12 within the read. This procedure of mapping reads to P site codons, however, is too inaccurate to properly resolve many ORFs, in particular most of the short upstream ORFs (uORFs). We have developed Probabilistic inference of codon activities by an EM algorithm (PRICE), a new algorithm to estimate P site codon positions with drastically improved signal to noise ratio. Moreover, a new statistical test included in PRICE enables us to reliable resolve complex cases such as arising with multiple overlapping ORFs. Validation of newly identified ORFs has been an unsolved issue so far. Because of them being immediately degraded after translation peptides derived from short ORFs often escape detection by mass spectrometry experiments. We reasoned that they should be well represented in MHC-I complexes, as peptide presentation via this pathway is believed to be dependent on translation rates and not on protein abundance. Indeed, we found hundreds of cryptic peptides derived from short ORFs in MHC-I ligandome experiments.

Left: Comparison of approaches for mapping reads to codons with respect to signal (total number of reads mapped in-frame) and signal-to-noise ratio (noise: reads mapped out-of-frame to annotated ORFs). Color-coded according to the key to indicate deterministic mapping of read classes defined by length and 5′ mismatch state and of combinations of read classes (basic, ignoring 5′ mismatches; extended, considering 5′ mismatches; top 4, combining the best read classes), and probabilistic mapping by PRICE. Right: Total amount of peptides detected in proteome and MHC I peptidome mass spectrometry experiments. The 1% peptide-identification FDR is indicated by a dashed line. Gray bars represent the peptides from ORFs also identified by ORF-RATER or Rp-Bp (for PRICE) or ORFs also identified by PRICE (for ORF-RATER and Rp-Bp).

Single cell RNA sequencing (scRNA-seq) has revolutionized our view on RNA biology in individual cells. Current approaches allow to profile the total RNA levels for thousands of genes in tens or even hundreds of thousands of cells. However, scRNA-seq has one inherent limitation: Each cell can only be profiled once. This has several consequences: (i) responses to perturbations cannot be measured directly, (ii) kinetics of transcription (e.g. bursts) cannot be investigated, (iii) short-term changes due to a perturbation or stimulus within a timescale of a few hours are masked by pre-existing RNA and (iv) changes in RNA synthesis and decay cannot be differentiated. We have developed single-cell thiol(SH)-linked alkylation for the metabolic sequencing of RNA (scSLAM-seq), which integrates metabolic RNA labeling, biochemical nucleoside conversion and scRNA-seq to directly record transcriptional activity in single cells. Key to this was a new computational approach (GRAND-SLAM) that we recently developed and that allowed us to precisely quantify the new-to-total ratio (NTR) for thousands of genes in individual cells. We utilized these methods to study the earlies changes in transcription in cytomegalovirus infected fibroblasts. Our data enabled us to perform dose-response analyses at the single cell level. This revealed that most of the variability of infection efficacy can be explained by a combination of the cell cycle state at the time of infection and the infection dose. The same was true for the regulation of cellular genes, including interferon stimulated genes (ISGs) and NF-kB responses. This also demonstrated that rather than observing more transcription in each cell, the predominant mode of regulation was due to more cells showing transcriptional activity at all in the two hours of labeling. scSLAM-seq visualizes transcriptional bursting with unprecedented detail for thousands of genes. This is based on specific patterns of new and old RNA across cells for individual genes. We show that these are associated with promoter-intrinsic features (TATA-boxes/methylated CpGs).