David Jones

Biomedical Data Science Laboratory

For around half of hereditary human disorders, there is no information on which genes are associated with the disorder. For non-hereditary diseases, far less is known about the interplay between genetic features and how they might influence disease occurrence and progression. In this project, the overall aim will be to explore some new ideas for linking diseases to human genes by applying so-called "big data" bioinformatics techniques. Over the past 5 years, we have collected a very large amount of both experimental and predicted functional data (calculated using the Legion Supercomputer) for every human gene e.g. sequence similarity, gene co-expression, predicted gene fusions and so on. So far we have used this data to predict the biological functions of functionally uncharacterised genes with a lot of success e.g. topping the rankings in the international Critical Assessment of protein Function Annotation (CAFA) algorithms challenge in 2011, and this work has led to new developments, including a project funded by Elsevier via the UCL Big Data Institute. The exciting possibility that we would like to explore whilst at the Crick is whether we can go beyond gene function prediction and to extend these same ideas to predicting novel disease-gene associations.

One interesting avenue we are particularly interested in exploring is whether we can learn disease-gene association links from Mendelian genetic disorders (i.e. inherited disorders), where the causal relationships between genes and disease mechanism are frequently known, and apply these patterns to non-Mendelian diseases where the relationships are generally not known. Using large-scale machine learning techniques, we will try to predict ab initio which genes might be associated with particular diseases even when there is no known hereditary component.

Predicting gene function and relationship to disease is a key to future developments in translational medicine. Even with all of the collected sequence data and postgenomic data, computational methods for linking function to sequence and sequence variations are urgently needed as there are still very many genes of unknown function and unknown relationship to disease. At the Crick we will have immediate and easy access to potential biomedical collaborators working on molecular genetics, cell biology and disease will be invaluable to provide the expert knowledge needed to ensure that the results of this project will be meaningful and focussed on the right questions.

More generally, we are very keen to try to use our expertise in applying state-of-the-art machine learning techniques to difficult biological problems to tackle other interesting problems that may be posed by the experimentalists working at the Crick. A major aspect of the Crick is to encourage such interdisciplinary collaborations, and we want to take maximum advantage of this.

Selected Publications

DWA Buchan, F Minneci, TCO Nugent, K Bryson, DT Jones. Scalable web services for the PSIPRED Protein Analysis Workbench. Nucleic acids research (2013) 41:W349-W357

DT Jones, D Cozzetto. DISOPRED3: precise disordered region predictions with annotated protein-binding activity. Bioinformatics (2015) 6: 857-863

T Nugent, DT Jones. Accurate de novo structure prediction of large transmembrane protein domains using fragment-assembly and correlated mutation analysis. Proceedings of the National Academy of Sciences (2012) 109: E1540-E1547

DT Jones, DWA Buchan, D Cozzetto, M Pontil. PSICOV: precise structural contact prediction using sparse inverse covariance estimation on large multiple sequence alignments. Bioinformatics (2012) 28: 184-190

YJ Edwards, AE Lobley, MM Pentony, DT Jones. Insights into the regulation of intrinsically disordered proteins in the human proteome by analyzing sequence and gene expression data. Genome Biol (2009) 10: R50

David Jones

david.t.jones@crick.ac.uk
+44 (0)20 379 63300