CASE STUDY
Custom ML dataset accelerates Selvita's organic synthesis workflow
Industry
CRO Pharmaceuticals
Solution
CAS Custom Services℠

KEY RESULT
600K+
curated reaction entries delivered
KEY RESULT
Days
to deliver vs. month of manual work
KEY RESULT
>1 wk
saved per reaction clster vs. senior chemist
The Challenge
Disparate and inconsistent open-access data slows machine learning model development
As one of Europe's largest preclinical contract research organizations, Selvita continuously refines its organic synthesis workflows to deliver faster, more efficient solutions to its pharmaceutical partners. To fully harness the power of robotics and automation, the company's chemists sought to cluster compounds from similar synthetic reactions, enabling parallel processing and boosting production throughput. Performing reactions in parallel without compromising yields, however, requires precise control over conditions such as temperature, reagents, and pressure.
Recognizing the central role of Suzuki and amide coupling reactions in both their internal workflows and the broader pharmaceutical industry, Selvita's computational chemists developed a machine learning (ML) model to predict reaction yields and identify optimal conditions for grouping compounds. But building a reliable model meant building a robust training dataset first. Access to trustworthy, structured databases was limited, and the available open-access sources were unverified and inconsistent. Cleaning and validating that data manually would have taken months, diverting the team's focus away from the science it was meant to support.
The Solution
A custom machine learning training dataset built in days
Selvita turned to CAS Custom Services for a tailored solution. Working alongside Selvita's computational chemists, CAS solution consultants first defined the dataset requirements and determined the optimal format to align with the team's objectives. CAS data specialists then searched the human-curated the CAS Content Collection™ to identify and extract the relevant information for each target reaction type, significantly reducing the data-cleaning burden that had threatened to consume months of effort.
To ensure the dataset met the quality standards required for reliable model training, CAS content consultants and expert chemists performed quality assurance on each entry using advanced automated tools. Within days, CAS delivered a verified dataset of 400,000 entries for Suzuki reactions and 200,000 for amide coupling reactions. Using these curated datasets, Selvita applied filters to categorize and rank common reagents, solvents, and catalysts, creating the clusters needed for parallel synthesis.
The model was first validated against reserved test sets. Selvita's team then went further, putting the model head-to-head with one of their own senior chemists on a set of reactions that had never been performed before. Within a few hours, the model identified optimal conditions that outperformed the chemist's selections, work that had taken the chemist more than a week to produce. These real-world results confirmed the model's ability to reliably predict reaction conditions for efficient synthesis.
The Outcome
Faster drug discovery powered by better data
By partnering with CAS, Selvita avoided months of manual data work and the costly trial-and-error that comes with unreliable training sets. The team moved directly to model development and validation, streamlining essential organic synthesis workflows and strengthening its ability to offer faster, more efficient drug discovery solutions to its pharmaceutical partners. The project demonstrated the major impact of having the right data foundation from the start.
Schedule a free CAS Custom Services℠ consultation
Your privacy is important to CAS. More detail about how we use your information is in our privacy policy.