Algorithms used by Facebook, Amazon or Netflix to predict your next film or your favourite post can also understand and predict the biological language of cancer and certain neurodegenerative diseases, according to a study published in the scientific journal PNAS (1). For example, decades of big data were used to “teach” the computer program about the process involved in Alzheimer’s disease.
When you get a recommendation from Netflix about a movie to watch or a reminder from Facebook about an old friend, these platforms use powerful machine learning algorithms that can make educated guesses about what you like.
Recognising the potential to go beyond social media, a team from St John’s College, University of Cambridge, used this machine learning technology to train a language model to assess what happens when something goes awry with proteins inside the body. “Bringing machine learning technology into research into neurodegenerative diseases and cancer is an absolute game-changer. Ultimately, the aim will be to use artificial intelligence to develop targeted drugs to ease symptoms dramatically or to prevent dementia from happening at all”, said Professor Tuomas Knowles, lead author of the study.
Proteins are an obvious target to identify disease. There are thousands of different proteins in the body that play critical roles to keep normal metabolism. However, their activity can also go horribly wrong. In Alzheimer’s disease, for example – which affects around 50 million people in the world – proteins can go off course and start killing healthy nerve cells. After a while, these proteins clump together in solid masses known as aggregates and cause severe damage to the brain.
Recently, researchers discovered that, in addition to aggregates, proteins can also form almost liquid droplets of proteins known as condensates. These droplets don’t really have a strong membrane and can merge freely with other condensates.
As a starting point, the researchers asked the computer program to learn about these shapeshifting condensates found in cells. They fed the algorithm all data available on many different proteins, allowing the computer to learn and predict how they behave. In simple terms, this is the same thing as the model used in WhatsApp to know which words to suggest.
It’s very early days, but – without being directly told – the algorithm showed it can learn to identify which proteins form these condensates inside cells. “Protein condensates have recently attracted a lot of attention in the scientific world because they control key events in the cell such as gene expression, how our DNA is converted into proteins, protein synthesis and how the cells make proteins”, said Professor Knowles. “Any defects connected with these protein droplets can lead to diseases such as cancer. This is why bringing natural language processing technology into research into the molecular origins of protein malfunction is vital if we want to be able to correct the grammatical mistakes inside cells that cause disease.”
This technology is developing at an incredibly fast pace with increased computing power and more robust algorithms. The team believe machine learning could transform future research in various fields, including cancer and neurodegenerative diseases. It’s not unreasonable to think that machine learning can go much further than what the human brain can ever comprehend. “Machine learning can be free of the limitations of what researchers think are the targets for scientific exploration, and it will mean new connections will be found that we have not even conceived of yet. It is really very exciting indeed”, concluded Dr Kari Saar, from Cambridge University and first author in the study.
(1) Kadi L. Saar, Alexey S. Morgunov, Runzhang Qi, William E. Arter, Georg Krainer, Alpha A. Lee, Tuomas P. J. Knowles. Learning the molecular grammar of protein condensates from sequence determinants and embeddings. Proceedings of the National Academy of Sciences, 2021; 118 (15): e2019053118 DOI: 10.1073/pnas.2019053118