Biology has long been the squishy science, dealing with the messy workings of cells, animals and people. But in the 21st Century, it’s increasingly a science driven by data. From DNA sequencing to microscope images and medical records, the challenge is making sense of this torrent of data, and using it to help us all live longer, healthier lives.
Enter machine learning, a toolbox of techniques that allows computers to make sense of huge datasets which would be too mind-boggling for any individual expert to grasp. Whilst techniques like ‘neural networks’, ‘random forests’ and ‘Markov chains’ might sound exotic, they’re actually surprisingly everyday. The phone in your pocket makes use of them all the time to do everything from recognising faces or voice commands, to helping you avoid traffic on your drive home. So what does facial recognition have to do with biology research?
Machine learning takes huge amounts of data and teaches a computer to emphasise the important parts whilst ignoring the irrelevant ones. This is the challenge which is common whether you’re a mobile phone manufacturer or a biologist: picking out the signal (like a voice asking to launch an app to book a taxi, or a DNA sequence crucial to understanding a disease) from the noise (perhaps literal noise in the bar you’re in or, in the biological case, the ‘noise’ of a load of other DNA sequences which aren’t important).
A traditional approach to many problems in science is to consult with experts, find out what factors might be most significant in a given situation and build a model by hand to predict whatever you’re interested in. A machine learning approach, by contrast, would just throw in all the information you have, and allow the computer to sift the important data from the background – often finding things which an experienced researcher might miss, simply because they’re unexpected.
My most recent project used machine learning to predict risk of death in patients with heart disease, and compared its accuracy against state-of-the-art expert-constructed models.
Firstly, machine learning did (slightly) better than the experts, outperforming the model we tested against by about 1%. Additionally, it also allowed us to identify variables which doctors might not have predicted. Along with factors like age and whether or not a patient smoked, my models pulled out a home visit from their GP as a good predictor – not something a cardiologist might say is important in the biology of heart disease, but perhaps a good indication that the patient is too unwell to make it to the doctor themselves, and a useful variable to help the model make accurate predictions.
You can also use machine learning to answer far more fundamental questions in biology. I’ve used neural networks to try to work out where proteins bind to our DNA, and understand how different genes are turned on when they’re needed, and off when they’re not. Other researchers in the Crick use them to work out how proteins fold – a notoriously knotty problem which forms the very basis of the molecular workings of our cells.
Neural networks are also very common in image recognition. This is how, when you do an image search online, the search engine can find you pictures of, say, cats – it can ‘look’ at an image, determine whether it contains a cat with incredible accuracy and, if so, deliver it to your browser. Image recognition technology also has huge potential in biology and medicine – whether identifying tiny structures within our cells, or helping doctors work out whether a tiny shadow on an X-ray is worth further investigation.
What’s inside the black box?
All of these models have one thing in common: they need huge amounts of data to train them. My model with patients’ medical records used over 500 measurements and diagnoses from about 80,000 patients. Online image recognition systems are often trained on millions of images so they can tell the cats and dogs from cars, buildings and ice creams. This is a common problem when applying machine learning – it often needs far more data than a traditional, handmade model to get good results.
Another difficulty can be in interpreting the outputs of machine learning. Models like neural networks are often accused of having an inscrutable internal structure, where an expert-built model has some kind of rationale behind its every component. If we want to truly understand biology, we need to be able to look inside the black box so we don’t end up making models which are incredibly accurate, but just as confounding as the messy living things they emulate.
Neural networks, random forests and their ilk nonetheless have huge potential. They allow researchers to get a rapid overview of the huge quantities of data which are becoming increasingly common in biology. We can now build models which are as good, or even better, than those constructed by people with expertise, but a lot more quickly and easily.
Machine learning is already in your pocket. It’s also making its way into a research paper, doctor’s surgery and hospital near you.