Bioinformatics and Biostatistics

The bioinformatics core is a facility to provide expertise in data analysis of large data sets derived from biological experiments.

We have expertise in scientific programming, database technology, mathematics, statistics and experimental design, genomics and molecular biology.

We will collaborate to whatever degree suits the needs of your project. We very often perform all the analysis component of a project, but we also act in an advisory role to researchers who want to perform their own analysis and require some statistics advice.

We also will collaborate on any size of project, whether it is just for a day or a project that lasts five years.

We will analyse any type of High Throughput Data. For example:

  • Single cell Sequencing analysis ChIP-Seq: Sequencing of protein bound DNA fragments.
  • RNA-Seq: Sequencing of RNA to determine transcript abundance.
  • Resequencing: Genome scale resequencing to identify genomic variants. miRNA: Characterisation and quantification of small regulatory RNA moleclues.
  • 4C: Structurally associated DNA sequencing GRO-Seq: To identify the genes that are being transcribed at a certain time point.
  • Tif-Seq: Genome-wide measurement of transcript isoform diversity Any type of microarray data

We are able to interpret your data in the context of pathways and biological processes using tools such as Metacore. Examples of problems we solve: 

  • Integration of multiple data sets including public domain data.
  • Transcription factor binding analysis and motif identifcation.
  • Analysis of time series data, searching for novel homologues/orthologues.
  • Interpretation of cell motility and morphology data.
  • Survival analysis eg Kaplan-Meier Protein characterisation.
  • Promoter analysis.
  • Guidance with genome browsers.

We are able to give statistical advice for the analysis of data sets.

We also have the following skills which may be able to assist your research:

  • Data visualisation and the production of publication quality figures.
  • Web development and the deployment of tools and data over the internet
  • Computer programming expertise in R, Perl, C, Java, Python, PHP and Bash.
  • Database programming and the storage of big data.

 

Projects

The ability to measure DNA quantitatively and qualitatively by next-gen sequencing has meant that any experiment that can produce DNA as a measurable output is now technically possible. The requirement to analyse these large and complex datasets continues to increase and the Bioinformatics and BioStatistics Facility has grown to 10 analysts to be able to handle GRO-Seq, RIP-Seq, 4C, shRNA-Seq as well as the more obvious applications such as Chip-Seq, Exome capture and RNA-Seq.

GRO-Seq

Marco Saponaro, Jesper Svejstrup Group, used a new technique called DRB/GRO-seq to investigate genome-wide transcription elongation. Cells were treated with DRB, which reversibly blocked new transcript elongation and in effect synchronised the transcription cycle. After DRB release, transcription elongation was resumed and the position of RNA polymerase II (RNAPII) in the body of genes was analysed by extending RNAs with labelled nucleotides and subjecting them to deep sequencing. Increasing the time after DRB release enabled the RNAPII to advance further into gene bodies.

BABS analysed the sequence data with a time-series experiment to detect the wave-fronts of transcription elongation on each locus across the genome. This defined how advanced RNAPII was at a specific locus. The wave-front data from the time series was used to estimate genome-wide elongation rates.

Figure 1

Figure 1. The figure displays the results of a DRB/GRO-seq experiment over the CTNNBL1 gene. Reads mapping to the CTNNBL1 locus are shown in red. As time after DRB release increases, the RNAPII advances further into the body of the gene and the wave-front is seen to advance. (Click to view larger image)

VarSLR - algorithm for assessing mutation calling in clinical samples

Accurate mutation calling in clinical tumour samples remains a formidable challenge, confounded by sample complexity, experimental artefacts and algorithmic constraints. In particular, high sequencing error rates (~0.1-1x10-2 per base) entail costly manual review of putative mutations followed by orthogonal validation. Efficient filtering is currently required, given that most mutation callers identify many thousands (in exome sequencing), if not millions (in whole genome sequencing) of candidate mutations per experiment.

To aid in this process, we developed the open-source VarSLR R package to identify somatic nucleotide and insertion-deletion mutations that are likely to be sequencing artefacts. The algorithm incorporates putative confounders of call accuracy (Genome Res. Mar. 2012;22(3):568-576) into stepwise logistic regression models and subsequently classifies variants within a simple, 4-tiered quality schema.

VarSLR is highly scalable and designed to be run in an 'embarrassingly parallel' fashion, thus benefiting from the LRI's high-performance computing facility. Moreover, VarSLR performed with high precision when tested with synthetic and experimental data, and has been successfully applied to numerous projects (e.g. Science, Oct. 2014; 346(6206): 251-6)

Aengus Stewart (Lead) 

aengus.stewart@crick.ac.uk
+44 (0) 20 3796 1702