The importance of data quality in AI applications

Artificial intelligence (AI) is driving innovation at a pace unimaginable a decade ago. AI touches nearly all industries and businesses by improving efficiency, promoting data-driven decision-making, and enhancing security. It is also revolutionizing many fields of scientific research, including chemistry and life sciences. Advances in AI in these two industries have led to new ways of discovering materials, accelerated drug discovery, and created predictive models that can run millions of experiments before you step into the lab.

With AI impacting every aspect of our daily lives, it is easy to lose track of what makes AI so powerful and where the technology stumbles. One key factor is the importance of data quality for any AI application.

Defining data quality

Data quality can mean different things to different people, but generally, it consists of a combination of four main factors: accuracy, consistency, completeness, and relevance.

Accuracy ensures that the data is free of errors and mistakes. ‍
Consistency means that the data is represented uniformly. ‍
Completeness makes certain that all the required data is present. ‍
Relevance validates that the data is applicable to the task at hand.

Datasets that adhere to these factors are vital for training an AI algorithm. However, it is important to also consider the consequences of not achieving these requirements.

Accuracy

Training models with inaccurate datasets will result in inaccurate models. Low accuracy in a dataset increases the likelihood of poor generalization. If a model does not generalize well, it will perform poorly on data not included in its training set. The ultimate result of poor accuracy is that the model will not be reliable, leading to trust and confidence issues that are hard to overcome after deployment.

Consistency

If your data is inconsistent, bringing new data into your training set will require more effort. The need to identify missing fields, changes in units, duplications, and other types of inconsistencies is a time-consuming process. Most data scientists spend a significant amount of their time improving the consistency of datasets. This time could be better spent on AI algorithm development if data consistency is prioritized earlier in the data pipeline.

Completeness

Without access to a complete dataset of all the relevant information for a model's intended purpose, there will be gaps in the model's understanding. These gaps can manifest as bias in the model. Biased models are prone to making inappropriate predictions when encountering data that does not resemble their training set. Biased models bring with them many ethical and legal concerns as they may reinforce harmful stereotypes or create a negative social impact on the users of the models.

Relevance

It may require more time to train a useful model if the dataset is irrelevant to the task at hand. Using irrelevant data in training will confuse the model as it will try to account for data that does not impact its final decision. If a model needs more time in its training phase due to this irrelevant data, you can expect to spend more time, computational resources, and money to train a useful model.

CAS ensures data quality with a human touch

Data quality concerns exist everywhere, but CAS recognizes and prioritizes the critical nature of these factors related to scientific data. Chemistry and life science data can be highly varied depending on its source. There are several different ways to represent a molecular structure in chemistry. These include MDL Molfile (MOL), Structure-Data Format (SDF), Simplified Molecular Input Line Entry System (SMILES), and International Chemical Identifier (InChI). There are pros and cons to each of these molecular representations. CAS relies on technologies that map all molecular formats to the CAS REGISTRY^®. Where technology alone cannot determine an exact match, human expertise is used to identify the correct structure. It is through this human curation that we maintain a high-quality database of both chemical structures and chemical reactions that are used in many different AI applications, both at CAS and with partners.

Similarly, CAS has relied on human curation in our recent expansion into life sciences data. The global team of CAS scientists and technologists reviews life science data such as pharmacology, biomarkers, and signaling pathways. This effort ensures our data is both accurate and consistent, which are key components of high-quality datasets. When our pharmacology data set was used to train an existing model for predicting the activity of a molecule against a protein target, we saw a 56% reduction in the difference between experimental and predicted values compared to the baseline model and a 23% reduction in standard deviation. This effort demonstrates the importance of data quality, which can be found in all CAS data sets.

At CAS, we believe in the importance of data quality in AI applications. With human involvement, we generate high-accuracy and consistent data sets. During our more than 100-year existence, we have developed content and knowledge management expertise. Our scientists rely on this expertise to judge the completeness and relevance of a data set for a given AI application. Using this expertise, they can identify completeness and relevance issues in a data set and develop a plan to address these problems.