Challenges and opportunities in sequence analysis during drug discovery

jefferson parker

A conversation with Jefferson Parker, Ph.D, founder of NullSet Informatics Solutions

As computational biology within drug discovery continues to push forward, new challenges and opportunities arise constantly. Sequence analysis has long been a key aspect of bioinformatics. In this article, we speak with Jefferson Parker, Ph.D, founder of NullSet Informatics Solutions and an expert in data analytics for life sciences, about the new frontiers in sequence analysis for drug discovery.

CAS: As a computational biologist, what roles have you played in drug discovery?

Jefferson: I've done everything from supporting drug discovery labs to translational research groups. Most recently, I worked with clinical teams on operations and development. I've supported pharmacovigilance teams with safety data analyses, worked in business development, and in competitive intelligence. I've even been an embedded member in software engineering teams as a bioinformatics analyst, a sort of translator between the wet lab scientists and the software developers.

CAS: Can you tell us about the role of sequence analysis in drug discovery? Why is it important?

Jefferson: In discovery, you can use sequence analysis for target discovery screening. Looking at what happens at the transcriptomic level, you can map back onto whatever pathway knowledge you have and ask, “What are the most likely upstream causes?” Those are your likely drug targets or something proximal to your drug target in a signaling sense. If your drug was designed for a particular target, hopefully it will help you confirm that you're hitting it.

You can also start looking for patient selection biomarkers. Is there a particular genetic profile, either at the expression level or the DNA sequence level, with various mutations? Are there certain mutations that are present in patients that make them more or less susceptible to your drug being effective? That is an extremely exciting and active area in the pharmaceutical industry. It can really make the difference between success and failure to be able to know before treatment, yes, this drug should work, or no, it probably won't work. Additionally, you're not wasting patients' precious time. Especially in a field like oncology, where patients don't have much time, you're not wasting their time with trial and error and multiple lines of therapy with drugs that just won't work.

And so all of that involves sequence analysis. It touches on every aspect of the drug development process.

CAS: Where do you think the biggest challenges are within sequence analysis?

Jefferson: I would say the challenges are actually dropping pretty darn fast because the technology is advancing every year. It used to be that you had these little tiny short reads, and the assembly was such a challenge. Now we’re getting longer reads, and the assembly is still a challenge, but it's less so.

Imagine you drop your copy of War and Peace in a shredder. You have page fragments that are a couple of millimeters by a couple of centimeters, so it's very hard to reassemble that back into a book. But if the fragments of the pages were much larger, longer, and you have a chunk of a paragraph as opposed to a fragment of a word, it's much easier to reassemble the story in the correct order. That's pretty much what is happening now with the transition to longer and longer reads.

Storage is still an issue. Even in my most recent role, we had to move sequence data, and the fastest way to do that was to load it onto a hard drive and drop it in a FedEx box. Instead of moving hundreds of gigabytes or terabytes of data over the internet, it's faster to ship it in a box. Local storage is not a problem, it's the transmission of a large amount of data from one place to another. These days, once you have the data where it needs to be, you can find enough computing power to run your sequencing project, but getting the data to the machines is still the bottleneck.

Patient-derived samples are a challenge, too. They're painful, biopsies are invasive, and sick people don't want to have to give multiple samples. Once they are taken, they're generally formalin-fixed and paraffin-embedded, so any nucleic acid material is going to be degraded to some degree. There are ways to try to extract and use that sort of prepared sample tissue for sequencing, but the quality of the sequence is always going to take a hit.

If you're a small company, the technology is also expensive–these machines cost a lot of money. Similarly, computational biologists are becoming much more common, but still, they're not everywhere, and everybody wants to work for the biggest and the best and make the most income. The labor pool is growing, but it’s still limited.

In a way, nobody wants to work on sequence analysis anymore. They all want to make the next great learning model. The focus is not so much on number crunching and data analysis, it's now on advanced AI and ML. Everybody wants to be working on the new, hot, shiny technology–and that's not sequence analysis. So that is going to be a challenge soon.

CAS: Do you really need a computational biologist to do sequence analysis now?

Jefferson: If you're doing cookie-cutter, well-established methodologies that are well-developed, validated, and documented, then no, you don't. You don't need someone who can carve a new wheel. There are plenty of off-the-shelf software solutions that can take input data from all the different sequencing machines. You load it in, drag and drop the icons for the pipeline that you want to process, and push go. You go and get coffee, lunch, or depending on the size of the machine you're working on, you might go home and come back in the morning, and it's done. You don't need somebody like me to do that. Any tech-savvy research associate can run it. 

If, on the other hand, you're working with a cutting-edge sequencing machine and trying to derive a new analysis methodology that has not been done before, then there is no off-the-shelf solution. There, you need someone who understands the biology–someone who understands the input data, the output data, the mathematics, and whatever else. For putting that all together and integrating it into a new solution that does not exist, then you need a “me” or someone who understands all those pieces.

CAS: You mentioned how people want to work with AI and ML now. Can these technologies be of help with sequence analysis?

Jefferson: With a well-curated data set, AI and ML can definitely help. I know for a fact there are organizations that are applying machine learning technologies to consume the literature and build out knowledge graphs, so that could definitely play a role. Could AI and machine learning help with just sequence assembly? Probably, but I don't know if that's really overkill. 

CAS: Speaking of AI, what are your thoughts on AlphaFold, which performs AI predictions of protein structures?

Jefferson: I think AlphaFold is an absolute game-changer. It gives you a much faster path to a structure, which can feed into things like computer-aided drug design much faster than you could ever do before. You no longer have to have an NMR or crystal structure to have a starting point. Is it as good as a crystal structure? Probably not. The real, measured thing is always going to be better than the simulation. But, in terms of time, you can have something available now. It's going to have an impact that we might not even be seeing yet. I feel like, with AlphaFold, the stone has been dropped into the pond, and it's made an impact, but these are only the first ripples that are forming.

CAS: What do you think is the cutting edge of AI and ML in drug discovery?

Jefferson: It's spatial, which is kind of the next generation of single cell. Multi omics. Bringing in the DNA, the RNA, the proteins, the metabolomics, and integrating all of it. Even integrating it with cellular pathways and intercellular communication. It's not just the single cell anymore. It's the single cell and the cells next to it and the cells next to those; how are they interacting? That's where it's headed, where it already is.

CAS: Do you think we're going to be creating models of biological systems?

Jefferson: If you'd asked me that when I was in grad school, I would have said humanity does not have the mathematics that can describe a biological system. Biology is complex chemistry, chemistry is complex physics, and physics is complex math. Everything is based on that. Physics is a mathematically solvable problem, it just requires an immense amount of data, and chemistry is, to an extent, the same thing. But biology… I used to believe that we did not, and would not, have the capacity to mathematically model a biological system.

But now, that is probably the direction it will have to go in. Will it require a quantum computer? Maybe? It might be after my lifetime, but I will say now with some confidence that at some point, humanity will be able to have accurate, reliable computational simulations of living systems. And that statement kind of scares me. I know there's a lot of work being done in the digital twin space. Limited first-step scenarios, but digital twins are online and being used in clinical trials now. That's kind of the beginning of it.

CAS: What do you think is needed to drive these new frontiers forward, then? Do we need new algorithms or new frameworks? Or is it really just trying to make everything fit together?

Jefferson: It's all of it—we need new ways of thinking about the problem. It may be applying old algorithms with new ways of designing or implementing new algorithms. For things like epigenomics and looking at DNA dynamics, or the non-coding RNA space, exome versus everything else, that's different from just sequence analysis. It's a different way of thinking about it. It's still the sequence, but it's not just the sequence. Those different ways of thinking about the issue will require different tools.

CAS: If you could wave a magic wand and solve one problem in sequence analysis and drug discovery, what would you solve? And what impact would that have?

Jefferson: I would make all the data well-annotated and available to everyone. All the proprietary data from companies, institutions, and universities… everywhere. On a well-annotated, well-documented, unified storage platform, freely usable by everyone. Because then there would be enough, and we could solve the big problems.


Jefferson began his research career at MIT, exploring xenobiotic metabolism in the gram-positive soil bacterium Rhodococcus aetherovorans. He got into computing when faced with an overload of data trying to annotate the genome to develop DNA microarrays, and he’s been working at the intersection of biology, computing, and mathematics since. His career has taken him through small pharma, large pharma, and consulting organizations, including Novartis and Thomson Reuters. Along the way Jefferson acquired his Graduate Certificate in Applied Statistics from Pennsylvania State University and a master’s degree in computer science from Boston University.

Now, Jefferson is forging a new path with his own bioinformatics consulting company, NullSet Informatics Solutions providing data and analytics, data modeling, and technology project management services.