Five key steps to getting started in chemical and materials informatics
The volume of knowledge developed by scientists has exploded, with scientific publications doubling every nine years. Researchers now struggle to navigate the overwhelming complexity of available research. Technologies that promise to deliver greater innovation and deeper insight from the scientific literature are machine learning (ML) and artificial intelligence (AI), but how should organizations prepare to enter this new world?
I recently participated in a panel discussion to explore data-driven chemical and materials discovery. Co-presented by CAS and Citrine Informatics, the panel also featured Erik Sapper, assistant professor specializing in computational materials design at California Polytechnic State University, and Gustavo Guzman, data engineering manager at Citrine Informatics. We discussed how organizations can start to utilize ML to achieve tangible success in their R&D projects.
Combining intuition with data is far more powerful than intuition alone. So whether your organization is just getting started with ML or already ramping up an ML program, I hope the following recommendations will aid the success of your informatics journey.
Step 1: Distill the business goal
To fully realize the value of any informatics initiative, it is important to clearly define what your organization is attempting to achieve. For example, by what measures does your business need to be more accurate when predicting a particular outcome? Is there a time or cost savings that is required to be competitive for a process or activity? If you define the business goal in quantifiable and precise terms, then you will be more likely to apply the proper tools, methods and data to achieve success.
Successful ML approaches still require significant human expertise and a solid data foundation. ML does not come up with new ideas on its own. Rather, it can help chemists and materials scientists do their jobs better and more efficiently by identifying underlying data patterns that would be difficult or impossible for a human to uncover. When aligned with key business drivers, ML can inform business-wide research decisions by validating or disproving ideas, enabling scientists to automate their experimental design activities or determining when to continue or conclude a project in the laboratory.
Step 2: Focus data identification and collection on the What and Why
Beyond a clear understanding of the business problem, organizations that see the most gains from ML understand and prepare their data in advance, as well as build and evaluate their models against key performance metrics. In ML-powered prediction and discovery, first consider what you are trying to learn that is central to the question, whether it is a picture, a chemical structure or a polymer formulation. Next, consider why you are seeking this feature set, e.g., to predict molecule density or tensile strength of a formulation. To support learning for your particular model, it is important to have enough data so that your model can achieve the accuracy and results you seek.
Step 3: Prioritize data quality and variety
A model is only as good as the data powering it. With the right variety and volume of high- quality data, scientists can gain insight from their models to help increase success of their projects in the lab. Scientific data brings particular challenges, and expertise is needed to account for its nuances and complexities. For example, data is frequently integrated from across various sources with different representations and nomenclature. This causes errors when different molecules are incorrectly merged into a single record or the same molecule is disaggregated across multiple records. This challenge can be overcome by developing a unified data management initiative that encompasses planning, execution and tasks such as information normalization. Download the case study to learn how CAS helped a large chemical company develop a unified data management strategy.
In addition to ensuring data quality, it is important to make sure you have the data variety necessary to achieve your business goal. Training a model based only on internal data can limit the output of the algorithm to a narrow domain and prevent generalizability of the model to new data. Adding more variety to a feature set gives a model an expanded view of the scientific landscape, and crucially, brings the ability to make more novel and valuable predictions. Download the case study to learn how CAS custom-curated data to expand training sets and generate innovative ML predictions.
Step 4: Rethink your approach to data management
ML requires new considerations regarding how data is gathered and stored. Simply improving the place where data is stored is insufficient. As Gustavo highlights in the panel discussion, the data storage needs for ML may have special requirements. Laboratories must ensure that their data is machine-readable, computable and stored with enough context to be interoperable. When defining a material, for example, sufficient context is needed to know if one material is comparable to another, or if the commercial source is relevant. Data modeling should capture the entire context; capturing semantic links between data points could be helpful for your end products.
Given this need for copious, well-structured data, organizations considering ML in their workflows must review their information architecture and IT infrastructure, as this may currently be insufficient for modern informatics approaches. Lack of investment in proper data preparation, modeling and governance will place that burden on your data scientists, which will significantly slow your ML projects and waste data scientists’ time on data engineering tasks.
Step 5: Iterate as you go
Plan big but start small. Ensure that you remain focused on the business problem you set out to solve. Starting with overly complex learning goals that are not supported by your data can cause you to miss opportunities to refine and troubleshoot your projects. Content sets can be added and expanded as you improve your model, gain insight and scale up your ML capabilities. Iterative and sequential ML offer ways to reach experimental design goals, faster. Additionally, as Gustavo highlights, remember to seek help from specialized and experienced partners when needed: CAS offers numerous curated, ML-suitable datasets, while Citrine Informatics offers an AI platform for materials and chemical informatics.
Overall, organizations in the chemical and materials research space stand to benefit significantly from ML. Powerful predictions need high-quality, consistently formatted data of reliable integrity, in a sufficient volume of suitable variability. Research groups must proactively consider how to build and store datasets and effectively use available expertise to navigate the complexities of scientific data. Drawing on our experience and knowledge, Gustavo, Erik and I spoke at length about the promise of ML in the full discussion. Listen for insights to help you on your journey to informatics success.
CAS, a division of the American Chemical Society, partners with R&D organizations globally to provide actionable scientific insights that help them plan, innovate, protect their innovations, and predict how new markets and opportunities will evolve. Leverage our unparalleled content, specialized technology, and unmatched human expertise to customize solutions that will give your organization an information advantage.