Case Study: Data quality impact on algorithmic prediction of biological activity
AI and machine learning are showing great promise for enhancing research productivity in many disciplines. However, there are still many well-documented challenges for implementing these technologies in R&D applications and many opportunities to build on the efforts to date for greater success. In fact, one Gartner analyst estimated that as many as 85% of AI-driven projects are not meeting their objectives. As I had posted about previously, I, and many others, believe that one of the primary gaps often impacting these success rates is data quality.
Measuring the impact of data quality
The question the CAS team recently endeavored to answer is: can we quantifiably measure the impact of data quality on the success of algorithmic predictions? (Spoiler alert) We can, and we recently published data from one of our initial experiments in this case study.
Download the case study now to see the details and full results of the experiment.
To build an effective test, our data scientists identified a recently published journal article that highlighted a machine learning classification challenge. The original objective of the study was to algorithmically predict biological activity of almost 10,000 compounds against five different targets. In the original paper, the authors used Extended-connectivity fingerprints (ECFPs), an established, commonly used molecular descriptor, to predict activity of the 10,000 compounds. The CAS team looked at a specific machine learning approach used in the paper (SVM) to see how replacing ECFPs with CAS's enhanced proprietary fingerprints and descriptors would impact the accuracy of the results. The initial impact was significant, with an average increase in prediction accuracy of over 30% across targets.
Combining technical excellence with high-quality data for increased success
As a data scientist who has spent nearly 20 years working on the application of digital technologies, including machine learning, to the healthcare space across diverse organizations including NIH, various consulting organizations, and two startups, I've fought data challenges in the trenches myself. The appeal of data science is the incredible gratification and pride that comes from building a well-trained algorithm delivering predictions that truly enable better decisions, faster speed to market, and unique insights a human would have missed. However, the reality is that the technology is only a small part of the battle. Adjusting your data strategy can often have just as much if not greater impact on the success of your projects.
The unique value of CAS's unparalleled content collection
For this reason, CAS's data collection was a key driver that attracted me when I joined the organization. I have a unique perspective on the value of data of this quality and breadth to impact AI and machine learning project success. CAS's data collection is clean, normalized, and enhanced with semantic connections that have been built by scientists for over 100 years. Covering chemical reactions and substances, biosequences, properties, keywords, and more extracted from global journals, patents, and many other sources daily, our collection checks off the 4 Vs that big data applications commonly require: volume, variety, veracity and velocity.
I get great satisfaction from partnering with research and technology leaders across R&D segments through CAS services engagements to help them overcome data challenges to drive project success. Interested to know more about how we can help? Contact my team to discuss your specific project needs and challenges.
CAS, a division of the American Chemical Society, partners with R&D organizations globally to provide actionable scientific insights that help them plan, innovate, protect their innovations, and predict how new markets and opportunities will evolve. Leverage our unparalleled content, specialized technology, and unmatched human expertise to customize solutions that will give your organization an information advantage.