The problem of predicting protein structure from sequence has been definitively solved by the AI programme AlphaFold, winning a well-deserved Nobel prize for its developers. But structure prediction is just one of at least four different problems of protein folding. Here I introduce four different problems of protein folding: protein structure prediction, the nature of the protein folding transition, the role of proteins that don’t fold at all, and the importance of protein misfolding, particularly for diseases like Alzheimer’s disease.
The most important contributions yet made by machine learning and artificial intelligence to science so far are unquestionably DeepMind’s AlphaFold programmes for protein structure prediction, for which Demis Hassabis & John Jumper won the Nobel prize in chemistry in 2021 (shared with David Baker, for closely related work). Proteins are linear macromolecules; each type of protein has a unique one dimensional sequence of amino acids. For many proteins, this 1d sequence encodes a unique three dimensional structure, and it’s this 3d structure which underpins the function of the protein in the operations of the living cell. AlphaFold takes the 1d sequence of a protein and predicts the 3d structure. This is the problem of protein structure prediction, outstanding for half a century, now definitively solved by AI.
The way in which the 1d information in the protein sequence is converted to the 3d information in the structure is known as the problem of protein folding. In part, this is a problem of information – how the 1d amino acid sequence, which itself is a direct mapping of the genetic code stored on the sequences of DNA that constitute a gene, is mapped onto a single three dimensional structure, the native state,, in which the relative positions in space of each of the amino acids along the chain is uniquely specified.
It’s this problem of information that AlphaFold has solved – if you know the sequence of a previously unknown protein, AlphaFold will give you a prediction for its structure. This is useful because the sequence is easy and cheap to determine, but it’s time-consuming and hard to measure the 3d structure. It’s important because it should help the design of new drugs and vaccines. For example, if one knows the shape of particular proteins in pathogenic viruses or bacteria, one can design molecules that bind to those proteins to stop them working properly.

A protein in its native state. The enzyme alpha-amylase, which converts starch into glucose. Left: a space filling rendering of the molecule by David Goodsell, from the Protein Database Molecule of the Month, https://pdb101.rcsb.org/motm/74, CC-BY-4.0 license. Right: a schematic diagram showing helical regions and regions of beta-sheet (broad arrows). Image from the RCSB Protein Database (RCSB.org), of molecule 1PPI https://www.rcsb.org/structure/1PPI. Data: M. Qian et al, (1994) Biochemistry 33: 6284-6294
But there’s also a physical problem of protein folding – how does an unfolded protein molecule, a loose, random coil, constantly changing shape as it is buffeted by Brownian motion, find its way through an astronomically large number of possible arrangements to find the unique native state which is needed for it fulfil its biological function?
AlphaFold is a deep learning programme – it’s trained to find correlations between protein structure and sequence from large experimental datasets. It uses two datasets: one consists of 100,000+ proteins of known sequence whose structures have been experimentally determined. The other, much larger dataset, compares the sequences of homologous proteins from different species, whose structures are likely to be similar. But the physical aspect of the protein folding problem – understanding the nature of the protein folding transition, and the pathway the molecule must take to arrive at a single structure – isn’t addressed by AlphaFold.
A good starting point for thinking about the physical protein folding problem is to recognise that one can divide up the 20 amino acids that proteins are made from into two rough categories – hydrophobic and hydrophilic. It’s easy to understand why a protein molecule in water would arrange itself in a globule with the hydrophobic groups in the middle, protected from contact with water by a layer of hydrophilic (often charged) groups. This would be like a single molecule version of a soap micelle.
But if this was all there was to it, there wouldn’t be a single native state – there are likely to be many possible structures with the hydrophobic groups in the middle and the hydrophobic states on the outside. In a well-folded protein, the well folded state must be a single state with the lowest possible energy (free energy, to be accurate).
At a qualitative level, at least, a good understanding of the nature of the protein folding transition has been achieved through the use of computer simulations. There isn’t enough computer power to simulate a protein molecule of any size realistically, but one can make progress with highly simplified models. The key insight from this kind of work is that the property of foldability – the existence of a single native state, and of pathways to find that state from many starting points – is not guaranteed. Foldability is itself an evolved property.
What about proteins that don’t fold, or fold wrongly?
It’s long been known that some proteins don’t have a folded state – one example familiar in everyday life is casein, the main protein in milk that is so important in cheese-making. But one of the surprises of the last couple of decades is the discovery that a surprisingly high proportion of proteins are either entirely disordered, or contain long regions that are disordered. Intrinsically disordered proteins have no native structure to be determined by classical techniques like x-ray diffraction, and this perhaps is one of the reasons why their importance was neglected for so long.
These intrinsically disordered proteins, and proteins with large intrinsically disordered regions, are particularly prevalent in eukaryotes, where they clearly have important functional roles. Around 30% of all proteins in human cells are disordered, with another 20% containing substantial intrinsically disordered regions. The importance of intrinsically disordered proteins is a challenge to traditional ways of thinking about the ways proteins work. The metaphor that’s often been used is of a lock and key – the idea being that the well defined shape of a protein in its native state will have a cavity whose shape matches a molecule that binds to it. Molecular interactions involving disordered proteins must necessarily more fluid and promiscuous than this; presumably this flexibility carries with it benefits, as well as creating considerable new complexity. But as of now, much remains unknown about how this might work.
The importance of protein misfolding has been understood for much longer. To give an everyday example, you can’t hatch a chicken from a hard-boiled egg, The major component of egg white is a protein called ovalbumin, which is present in egg white in a well-defined folded state. If one heats up an egg white, ovalbumin partially unfolds. But as everyone knows, when one cools the egg back down, you don’t recover the gloopy transparent liquid that one started with – the egg white sets as a soft solid. What’s happened to the ovalumbumin in the egg white is that, instead of each molecule folding individually back to its native states, the proteins link up with each other, forming structures called beta-sheets, in which strands from different protein molecules line up in parallel, bound to each other by hydrogen bonds.
The formation of intermolecular beta sheet is a very common way through which proteins misfold; there is a view that, when protein concentrations are high enough for the molecules to interact, these are the most stable states, more stable than the native state. The resulting structures are very robust and difficult to undo; they are, in fact, quite closely analogous to the crystal structure of the synthetic polymer nylon, a structure which makes nylon a very strong and tough engineering polymer. Sometimes this kind of misfolded protein forms a bit of a shapeless mess – as is the case with cooked egg white. But very often it takes a much more regular form, a fibre, in which parallel bundles of hydrogen bonded protein chains lie perpendicular to the axis of the fibre. These are known as amyloid fibrils, and are notorious for their role in many human diseases.

An amyloid fibril, derived from material taken from the brain of a patient with Alzheimer’s disease. Left: a space filling rendering of the molecule by David Goodsell, from the Protein Database Molecule of the Month, https://pdb101.rcsb.org/motm/189, CC-BY-4.0 license. Right: a schematic diagram of a section of the fibril, showing strands of different protein chains linked together through beta-sheets (broad arrows) perpendicular to the axis of the fibril. Image from the RCSB Protein Database (RCSB.org) of PDB ID 2M4J, https://www.rcsb.org/structure/2M4J. Data: J.X. Lu et al, (2013) Cell 154: 1257-1268
Diseases associated with protein misfolding include the transmissible prion diseases bovine spongiform encephalaly and Creutzfeld-Jacob disease, various types of amyloidosis, and, perhaps most significantly, neurodegenerative diseases like Parkinson’s disease and Alzheimer’s disease. It’s long been known that Alzheimer’s disease is associate with the formation of amyloid fibrils in the brain, but the mechanism through which misfolded proteins exert toxic effects is not yet known. The association of Alzheimer’s with amyloid fibrils has motivated a large number of drug candidates for the disease; the depressing (and expensive) failure of all these candidates to date suggests that we still have lots to learn about the mechanisms underlying the disease.
To summarise, there are at least four problems of protein folding. The first, the prediction of 3d structure from 1d sequence, has been definitively solved by AlphaFold.
For the second, on the nature of the transition between unfolded and folded states, we have some key concepts in place from computer simulation of coarse-grained models, such as the importance of smooth folding pathways, and the idea that foldability is itself an evolved property of proteins.
The third problem has emerged more recently – it is motivated by the discovery that many proteins – especially in more complex organisms – don’t fold at all, or have significant regions that are intrinsically disordered. We don’t really know what functions this intrinsic disorder enables, or how those functions are carried out.
The fourth problem is of more long-standing – and in some ways we know less now than we thought we did twenty years ago. It’s on the causes and consequences of proteins that don’t fold correctly – and in particular the structures that involve multiple protein molecules binding together, typically in the form of fibrils. We know these are associated with a number of serious, often incurable, diseases – but we are still uncertain about the mechanisms at play, and we don’t know how to cure them.
There remain many open problems in connection with protein folding; AI, having solved the problem of predicting structure from sequence, will no doubt contribute to the solution of these other problems. But there is much new biology – and new physics – that needs to be understood, as well as a continuing need to generate the data that AI needs to operate on.


Minimum transistor footprint (product of metal pitch and contacted gate pitch) for successive semiconductor process nodes. Data: (1994 – 2014 inclusive) – 