Challenges and opportunities in creating authority constructs for life sciences

A conversation with Nicole Stobart, Jeff Wilson, and Mark Schmidt from CAS

Authority constructs and identifiers have been used by CAS within the chemistry sector for almost 100 years as the cornerstone of our world-leading curating and indexing. Our life sciences team is now looking to create the same indexing tools for a new industry. In this article, we chat with Nicole Stobart, Senior Capabilities Manager in Life Sciences; Jeff Wilson, Ph.D., Senior Data Science Manager; and Mark Schmidt, Lead Data Steward, about how they are forging this new path using authority constructs.

CAS: How would you describe authority constructs?

Jeff: Authority constructs, from our perspective, are heavily focused on identity. Within any domain, whether it's proteins, chemical substances, or nucleic acids, you want to be able to uniquely identify entities. Historically, in our chemistry collection, we have CAS REGISTRY®, and that's our classic authority. Within CAS REGISTRY, we want to uniquely describe different chemicals and we want to know that each and every time that we encounter the same thing in our curation, it’s identified that way. A simple example would be acetaminophen. Whether you call it acetaminophen, Tylenol, or paracetamol, they are all identified as the same chemical.

CAS: Why are authority constructs so important for life scientists?

Nicole: Our current authority constructs work fine for small molecules, but what happens with proteins or enzymes? Yes, there are sequences that can be associated with them, but is a sequence with one amino acid change a different entity? We recognize that researchers are struggling with this. There are different companies and other organizations that have tried to organize it, but nobody has done a completely authoritative organization of these biologics. When we were trying to bring biological information to customers, we found that it was not possible without authoritatively organizing it first. It's not just about collecting heaps and heaps of data, it's also about authoritatively organizing it and making sure that the way that you've organized it is the way that the rest of the world would like to see it organized.

Mark: In the life sciences space, we're trying to identify all the important things and agree on what they are and what to call them—that’s really an authority. In our traditional business, if I've got a chemical substance, I can look at our collection of those and add new information about that substance. We're trying to enable that same thing with life sciences, things like proteins and enzymes, and to do that, we need to first agree on the identities of those.

Jeff: It depends on the researcher and where they are, but I think we can be a lot more consistent and a lot more reliable in the data we present if we can unambiguously describe things. At the simplest level, it goes back to what we've always done: organize data for search, so that when you come into our search tools, like CAS SciFinder, it doesn't matter whether you want to call it cancer or sarcoma, we know the relationship between those terms. We don't force you to search with all these different terms to find the data, we handle that in the background, and you can use just one of the terms to find everything.

Beyond just search and retrieval, once you get into more sophisticated implementations like knowledge graphs and discovering important relationships, it's difficult to build an accurate knowledge graph if your entities are scattered across nodes. If I have a protein that I'm interested in as a target, I want all the data for that protein to be stored at that single node and then for other related entities to be at individual nodes. Otherwise, you get this hopelessly complex knowledge graph where you have 15 nodes for this protein, 12 for that disease, and 7 for the same substance. You create this complex graph, and you can't tell that there are basically 3 entities there, that are related in a predictable way, because these identities have not been defined upfront.

CAS: How would you define curation at CAS?

Mark: People tend to use curation to mean two different things. First, there's ingestion and aggregation of data, which we don’t usually refer to as curation here at CAS. Then there’s the kind of curation CAS has a long history with, where scientists look at original sources of information and apply intellectual effort to extract and improve that information and make it available in a more structured way.

Jeff: Most of our scientists are engaged in the curation that Mark described, where they extract key information from primary sources and then improve, standardize, and connect it to other data points, but then there's another layer beyond that. We have a separate group of people who are curating the authority collections themselves by looking at new information as it comes in and deciding: Is this something new? Is it not? How do we organize this? How are they related? What terminology do we use? We don't have the people who are extracting the information from the primary literature making those decisions, instead, we have people who curate the authority collection.

Mark: The key thing is that the authority is never done. You are always adding new things to it that are newly discovered or newly interesting, and you're also adding new synonyms to it to keep it useful.

CAS: What would you say is the advantage of authority constructs for curation?

Jeff: After you start to aggregate identities and terminology with authority constructs, it simplifies identifying new instances. Once you've collected all the names for your entity, you can create an authority construct to instantly see and match. Synonyms facilitate the curation.

Mark: The authority constructs help you organize and aggregate all the information around them. So, in the case of proteins, we might not call this protein exactly what you would, but if we know what you call it and what we call it, then we can compile all the information under the same identity—this allows us to bring all the information together.

CAS: Can AI be leveraged to help create authority constructs or curate with them?

Nicole: We need to make sure that we have authoritatively organized and identified whatever it is that we're talking about to enable AI curation. It's really difficult to rely on any sort of machine learning or machine-curated data if it has not all been manually organized or at least thought of and identified in an authoritative way.

Jeff: We have a general philosophy about AI and how it relates to our data in that we leverage the most advanced technology we can to enhance human capabilities. We have hundreds of scientists here who are doing curation work, and if you look at what they do today, it's a lot more sophisticated than what they did 30 years ago. Each time we adopt a new technology, we use that to enable humans to do more sophisticated work. We leverage technology in natural language processing and machine learning to help identify documents and insights, but all of that is then presented to humans, who make the final decision about what's important or what's relevant and how it gets organized.

Once we’ve created that structured data, we do the same thing with technology and AI solutions on the other end. We try to leverage the best technology to show you the things you've asked for and things that are related to it. We can use predictive technology to help you plan syntheses, we have AI engines that help predict what prior art would be, and you can do Boolean-type searches and try to find things. So we're taking our highly structured data, and layering AI technology on top of that—not using AI to replace people.

CAS: How is CAS going to approach defining and identifying biological entities?

Mark: So this is where we start to talk about identity semantics. It’s a really difficult and complicated problem in life sciences, and we are completely aware of that, we are dealing with that on a case-by-case basis every day. We might not get this perfect, but we're going to do our best. We're going to make reasonable, defensible choices, which are useful to the people in the life sciences space. We will not please all of them every time, but we're going to be predictable and clear about it, so people can use the system that we deliver reliably.

When we get to questions like, “Is a one nucleotide difference a new identity or not?”, that’s a choice to make. But if three different single nucleotide polymorphisms become one identity instead of three, we absolutely need to describe all three of those differences and attach them to the one identity that we present. So even when we decide that different things fall under one identity, all the information is still going to be discoverable, connected, and accessible.

Initially, we will choose our identity semantics and define them, then as new information arrives, we will decide, “Is this a new identity, or does this add to an existing identity?” We want to utilize as much computing power as we can, but at all times, our ultimate source of truth for those decisions is going to be humans who know the subject area best. That’s how we are going to deliver a trustworthy service.

Jeff: You want to create a clear, consistent, and comprehensive rule set upfront. For people who haven't tried to define entities and aggregate information, it feels very arbitrary, but we find that when you're organizing information, you want to err on the side of being clear and consistent. You can find nuances of the science in specific cases, but implementing rules for everything causes things to get scattered, and makes things harder to find and collect. Instead, it’s better to create a rule that works for 99% of everything. In the chemistry space, we sometimes get correspondence from a scientist who says, “I see you described this thing this way in CAS REGISTRY, and you're wrong, I have data that says it’s better described a different way,” and our usual answer is that our content is organized around discoverability. And that’s the point, the curation needs to consistently lead you to the data you’re looking for, even if it misses some scientific nuances along the way.

CAS: Are you using structure or function to create these identities for life sciences?

Mark: For many biological entities, especially proteins, it’s more function than structure, but it’s often a combination of both. In chemistry, we can easily create authority constructs based entirely on structure, but that’s not the same for life sciences. We have several different authorities we have to solve for—proteins, antibodies, organisms, etc. Each of them will need to be approached in their own unique way. We have to move away from just structure-based identities as we move into life sciences in a way that we have not had to in the past.

CAS: How do you plan to tackle the longevity of defining by function when biological function changes over time and with new research?

Jeff: We always try to future-proof things, and I don’t think we can predict where life sciences will be in ten years, but, as we create consistent constructs, we’ll build some flexibility and extensibility into that. The key to this will be recognizing when to keep using the current construct and when a new branch of science emerges that requires a new construct. People won’t stop wanting to describe proteins, but there might be a subcategory of some kind that arises that needs new authority constructs to be created.

Mark: I think that if you do a solid job of getting reasonable identity semantics at the beginning, then you set yourself up for a future where it's most likely that scientists will subdivide one of your identities into a few more specific identities. If you think about genus-species naming of organisms, it worked for a really long time. The need for sub-species didn't completely invalidate the genus-species model. I think we can set constructs up where they might get more specific about identity than when we started out, but they won’t become completely outdated.

CAS: What makes CAS the right organization to aggregate these fragmented life science authority constructs?

Jeff: There is some self-sustaining nature once you become a worldwide authority on something. If you look at chemical substances, nobody second-guesses CAS REGISTRY. It's the authoritative place. There are some other substance collections, but we are positioned as an organization in a way that most other organizations are not. We are part of the American Chemical Society, our mission isn't just to be profitable, our mission is to facilitate science. We have the people, the expertise, and the space to tackle this life science data and organize it in the way we did with CAS REGISTRY. I think anything short of us being the authoritative space for proteins when this project is complete is us falling short. That's what we're going to do.

Mark: The desire for agreement on a meaningful set of identities within life sciences is pretty universal, and it’s a problem we’ve already solved in small molecule chemistry. Life scientists are looking for a clear set of identities and the relationships between them to be defined so they can organize data around that structure. We see ourselves wading into that complexity and bringing order to the chaos. When we get far enough with that, people will accept that we know all the proteins that matter to most people and talk about them in a way that most people can make use of. When they don’t see their protein in our collection, they’ll bring it to our attention, so we can add it rather than starting their own entirely separate database. That’s what we’re aiming for—not just having an authority construct collection but being an authority within the life sciences space.

CAS: How is what you’re doing different than what’s already been done?

Mark: I think, in a lot of scientific spaces, people solve a problem for themselves and their colleagues, locally, in a way that works for them. They create a database that has the identities of things they care about, described in a way that they understand, and the information about those. Meanwhile, someone in a slightly overlapping area who's doing some of the same work, and some different work, is choosing different ways to talk about those identities and assembling different information. Then we start to see projects unite some of those smaller spaces to get all those people on the same page and facilitate discovery across those boundaries. So that's already happening in life sciences, but there hasn't been an approach to bring a large amount of it together with one uniting set of identities.

The goal is to get to a place where you can come in with your protein target, with no awareness that it is a member of three or four biological pathways, but when you've found that identity within this set of information, you now see all those connections. Additionally, you might also learn that it's being used as a biomarker for a disease state or treatment outcome. Being able to aggregate all the information from different places around one identity, which you can then find and use, creates value that wasn’t previously accessible.

Jeff: There are a lot of small organizations out there that aggregate a piece of the puzzle. The disease ontology exists, Uniprot tries to organize proteins, and NCBI has an organism database. Each of these is an authority construct in and of itself, but they're not connected anywhere in any kind of useful way—you have to go here and there. We're really looking for this unified set of data where you can come here and find all the data across life sciences connected and harmonized. In order to do that, we have to take Uniprot’s view of the protein, Ensemble’s view of the protein, and the CAS REGISTRY view of the protein, and we have to harmonize that into a single view and a single set of identities. But from there, you would have access back out to all these smaller collections. It's greater than the sum of the parts once they're connected because it reveals things that you couldn't previously find.

Nicole: We want to be the most comprehensive life science database in the world. We want to solve whatever issues and challenges that our customers are facing, and they are facing this need for authority constructs in life sciences.

CAS: Have you had any pushback from the other existing databases as you move into this space?

Mark: You can’t agree with everyone. At some point, you do have to exert some authority and make some choices. It's hard to disagree with something well-established and entrenched, and we will be trying not to do that along the way. We're not going to convince everyone to start calling everything something different than what they’re used to, but we want to build on that.

The principle is easy: we will align with existing databases. But in practice, it is definitely more difficult to make those choices. If two databases have different words for the same thing, we can either choose one or the other, or choose something new and try to improve on both. It sounds very difficult, I know, but we feel like the difficulty in getting there is exactly the value that we're going to add to the space and exactly the thing that we want to deliver to customers.

Nicole: We definitely don’t want to alienate the people who use existing databases and find value there. What we’d like to do is bring more value and add on to what already exists. We haven’t talked about how other curators think of it yet, so how they feel about it is to be determined!

Jeff: It’s actually the minority of scientists who are engaged in curation, so in some ways, they aren’t the people we need to be concerned about, since we are looking to help the majority. And what we hear from a lot of scientists is that they’re having trouble accessing the data they need.

Nicole: Absolutely! Recently I was talking with a scientist who has created a database of antibody information. He said he did this, not because he wanted to organize antibody data, but because he needed antibody data so that he could run his predictive models. I think this is the situation for a lot of scientists, it’s an impediment in the way of research, and that’s what we plan to fix.

CAS: If you could wave a magic wand and fix one thing about authority constructs, what would it be, and what impact would it have?

Mark: The problem I would solve would be for people to use the same words to mean the same thing. If we just got there, this would be much easier. A huge part of this exercise is taking the words scientists use to describe something, finding the right identity, and then connecting the words to that identity. If we could narrow the language and get everyone to agree on it, we could skip that.

Jeff: For me, it would be having more people understand authority constructs and have the vision and passion to try to create them in useful ways. Even within an organization that supports it, I still spend a lot of my time explaining to people why they're needed, and what the value is. While it’s rewarding to advocate for this, it’s a bit distracting from my favorite part, which is actually working with data and building things.