Hexagon shaped overlay

Challenges and opportunities in protein function prediction for drug discovery

Adam Sanford
Hexagon shaped overlay

Challenges and opportunities in protein function prediction for drug discovery

A conversation with Iddo Friedberg, Ph.D., Associate Professor, Iowa State University

Protein function prediction has seen significant advancements over the last decade, coinciding with the rise of new machine learning and artificial intelligence computational methods. In this article, we speak with Iddo Friedberg, an expert in protein function prediction and the co-organizer of the Critical Assessment of Function Annotation (CAFA) experiment, about how accurately models are predicting protein function and where he believes the field is headed next.

CAS: How did you initially become interested in protein function prediction?

Iddo: It was during my postdoc while I was doing structural bioinformatics. I was trying to find a different way to represent protein structures using fragments or a fragment library, and I was starting to relate structure to function. I started asking, “How do sequence and structure relate to function? How can we better predict protein function from sequence and structure?"

Various ways of predicting protein function were being developed at that time, and I thought it would be nice to get to talk to more people about it, but I couldn’t find a conference in which that was specifically done. So, I went ahead and organized one myself. That was in 2005, a satellite conference at Intelligent Systems for Molecular Biology (ISMB) in Detroit, and then we organized our own standalone conference at the University of California, San Diego (UCSD) in 2006. We've been organizing this conference, the automated computational protein function prediction meeting ever since as part of ISMB.

CAS: How has the field of automated protein function prediction evolved since you started?

Iddo: We are definitely seeing improvement over the years. In 2010, Predrag Radivojac, now at Northeastern University, Sean D. Mooney from the University of Washington, and I started the CAFA competition to predict a protein’s function from its sequence or structure. The general idea was to associate a protein sequence with a set of ontology terms that will describe its function and to look for ways to improve the computational annotation of a protein function. By having varied approaches and prediction methods within CAFA submissions, the competition also serves as an experiment to assess the methods of protein function prediction on a large scale.

In modeling, we use a metric called the Fmax score (the maximal harmonic mean over a precision-recall curve) to describe how accurately a model is predicting on a scale from 0 to 1, where 1 is ideal. With CAFA 1, we saw an Fmax score of something like ~0.5, maybe a little bit above, and now, with CAFA 4, it’s at ~0.7. It depends on the ontology aspect—one of the aspects is ~0.7, and another is ~0.4/0.5—but it’s getting better.

With recent support from Kaggle, CAFA 5 has 1,675 teams competing to predict the function of a set of proteins. The final evaluation won’t be complete until about January 2024, but I can tell you that we're nearing a mean Fmax of ~0.65 for all three ontologies, which means that somebody is getting to ~0.8, I think, in the molecular function aspect of the Gene Ontology.

We’re also seeing an expansion of the type of expertise needed to make these predictions reliably—the people behind the algorithms. Our CAFA community now includes experts in bioinformatics, experimental biology, biocuration, and bio-ontologies. We’re all working together to improve how we manage the plethora of data we now have access to.

CAS: Are some approaches to protein function prediction working better than others?

Iddo: I have a feeling about what works well now from the literature, and there has been a bit of a switch. What worked in the past for predicting the molecular function aspect is homology transfer. The more sensitive your method for detecting remote homologs, the better it works, and if you could add structural information, even better, because structure reaches a lot farther across evolution than sequence does. These models were relatively straightforward in their application, but even then, behind the scenes, people were already starting to use more machine learning.

By the time CAFA 3 and CAFA 4 rolled around, we had people representing proteins and protein sequences as embeddings to predict function. Embeddings are a multi-dimensional representation of the data that has changed the landscape. In classical predictions, proteins are represented as character strings, and the model looks for similarities between the strings (protein sequences). With embeddings, the prediction takes into account the surrounding context of each part of the protein using a Word2Vec style model, ProtVec. These kinds of representations are incredibly sensitive and have very high predictive power. We may find something else, but right now, I think people are really doing very well using embeddings.

CAS: As molecular function predictions are improving, how well are we able to predict biological processes and cellular components?

Iddo: In the initial CAFAs, we saw lower scores when trying to predict biological process—it’s lower than molecular function prediction scores across all CAFAs. Say that you know this protein is a kinase, and you know that, in a mouse, it participates in apoptosis. You might ask, “Does it do the same in humans?” A mouse and a human are probably close enough to say, “Yeah, they’re doing the same thing.” But what about humans and Drosophila? Or yeast? That's where homology transfer can fail because the more evolutionarily distant we get, the more differences we see.

We do see it improving. One of the things we tried was to ask the predictors to choose from a pool of keywords to describe their method of prediction. What we saw is that things like hidden Markov models and remote homology searches characterize the top methods for predicting molecular function. But for biological process, the top methods are text mining from papers and genomic context. So, they’re predicting in very different ways.

Cellular component prediction isn’t doing so well, and we're having to look into why that is. One thing is that the ontology structure itself, the cellular component ontology, is hard to work with. If we imagine the molecular function ontology graph as a tree, and the biological process ontology as a neat hierarchical directed acyclic graph, then the cellular component resembles a shrubbery. The structure is a bit different, and we don’t have a good way of describing it right now. It’s also less popular with function predictors. I think they often don't see predicting cellular component as having the same prestige as predicting molecular function or biological process.

CAS: What would you say are the main bottlenecks in the field?

Iddo: We definitely need more—more computing power, more ideas, and more proteins being studied. The whole idea of representing function as ontology is unlike, say, representing structure as marbles and 3D space. With the structure, we might argue about how many marbles we need and how precise they will be, but it's something tangible. Ontology is a term derived from philosophy, and it's almost a subjective representation of reality. It changes over time, both with how we understand nature and also with what research efforts are focused on.

In CAFA, we had a group that actually said, “Let's not try to predict protein function, let’s try to predict what the biocurators will do next.” They predicted what functions would be in the validation data from the biocurators in that round of CAFA—which functions they would have prioritized curating. They didn't have a top-10 method, but they didn't do too badly either. We can't predict something that hasn't yet been experimentally verified somehow and then entered into the data set by a curator, so what’s being curated is really important.

CAS: One of the main bottlenecks is not computational at all then, but what wet lab work is being done and published?

Iddo: It’s the experiments, but also the way they’re published—why not publish the results directly as ontologies? If you discover a new protein sequence, when you write the paper, you are required to deposit the sequence in a repository such as GenBank. Or if you discover a new protein structure, you will deposit the structure in the Protein Data Bank, and there are very strict rules on how to represent the sequence and structure when you deposit them. There should also be guidelines on how to represent your protein function as an ontology, and my thought is that researchers should know how to do that.

Biocurators, quite justifiably, say, “No, researchers should not do that. They have no idea how to use ontologies.” But biocuration is where the bottleneck is right now. There are two camps here: one is that only biocurators should be doing the job because they know how to do it, and the other is that everybody should do it, but that may create annotation errors. It is unclear to me how to resolve this, really. Some say that better training for researchers in the use of ontologies may help, but biocurators have a much better understanding of selecting appropriate ontology terms and the proper relations.

CAS: Do you think that recent advances in natural language processing models (NLPs) can be used for biocuration?

Iddo: I don't see that happening quite yet, I think we need more than a large language model to do that properly. The rules are complex—that's why we use humans to do it. Humans are very good at complex tasks, but I think it’s too complex for NLPs. Another problem is that sometimes ontologies are hard to associate, and you don't always have the correct ontology terms, especially for newly discovered functions, and you need to create new terms and correctly place them in the proper ontology.

Then you have the high-throughput bottleneck. High-throughput experiments generate a lot of data, but they generate whatever that particular assay is good for. This means that the data is usually low-information content. We looked into that in 2013, and we concluded that annotated protein databases are incredibly imbalanced toward results from high-throughput experiments.

For example, we looked at a series of papers that came out in the late 2000s, which were looking into embryonic development and C. elegans. There were whole genome analyses to look at what was involved in embryonic development, and very little other experimental data came out of C. elegans at the time. So, from this data alone, the model could conclude that all C. elegans does is embryonic development because that's all the data is saying. These kinds of biases that come from high-throughput experiments are a problem, and without human intervention, they can give the model a very biased worldview.

CAS: What do you think the big opportunities are in protein function prediction right now?

Iddo: I think that machine learning will definitely make a big difference, and we’ll also see different representations of protein sequences and structures. We still have a bottleneck in terms of our use of ontology, so some people are foregoing the use of it completely and just saying, “This is a homogeneous cluster. This is a function as far as I'm concerned.” The fact that we have AlphaFold and ESMFold to give us decently predicted structural information is going to help a lot.

We also need a deliberate effort from the funding agencies to generate more valid, high-quality, high-information data. We should have fewer consortium grants—which is kind of shooting myself in the foot, really—and let many small and medium-size grants bloom to get some postdocs and students in there to characterize specific functions and create more diversity in the science ecosystem.

CAS: If you could wave a magic wand and solve one problem in your field, what would you solve?

Iddo: The ability to automatically and accurately curate papers and put that information into a computational representation that is valid in the long term. But as I said earlier, there’s a good reason we rely on humans to do this work now. I'm not even necessarily thinking that gene ontology, or any ontology, is the only valid representation in the long term, but I can't think of anything better at the moment. It's difficult because our representation of things using ontologies is a bottleneck, but then again, that's almost like saying the human language is a bottleneck. But we’re getting into philosophy there, and I’m not qualified to talk about that!

Iddo Friedberg is an Associate Professor at Iowa State University, College of Veterinary Medicine. With a Ph.D. in Bioinformatics, Iddo has carved his career in computational metagenomics and protein function prediction. His research aims to make use of the vast quantity of omics data available to predict protein and gene function, and he works to bring like-minded scientists together to discuss automated protein function prediction. As a co-organizer of the Critical Assessment of Function Annotation (CAFA), an experiment designed to provide a large-scale assessment of computational methods dedicated to predicting protein function, Iddo is at the cutting edge of the future of protein function prediction.

Gain new perspectives for faster progress directly to your inbox.