CASE STUDY

Custom ML dataset accelerates Selvita's organic synthesis workflow

Industry

CRO Pharmaceuticals

Solution

CAS Custom Services℠

Close-up of a digital chemical structure diagram with labeled molecular components on a blue background.

KEY RESULT

600K+

curated reaction entries delivered

KEY RESULT

Days

to deliver vs. month of manual work

KEY RESULT

>1 wk

saved per reaction clster vs. senior chemist

The Challenge

Disparate and inconsistent open-access data slows machine learning model development

As one of Europe's largest preclinical contract research organizations, Selvita continuously refines its organic synthesis workflows to deliver faster, more efficient solutions to its pharmaceutical partners. To fully harness the power of robotics and automation, the company's chemists sought to cluster compounds from similar synthetic reactions, enabling parallel processing and boosting production throughput. Performing reactions in parallel without compromising yields, however, requires precise control over conditions such as temperature, reagents, and pressure.

Recognizing the central role of Suzuki and amide coupling reactions in both their internal workflows and the broader pharmaceutical industry, Selvita's computational chemists developed a machine learning (ML) model to predict reaction yields and identify optimal conditions for grouping compounds. But building a reliable model meant building a robust training dataset first. Access to trustworthy, structured databases was limited, and the available open-access sources were unverified and inconsistent. Cleaning and validating that data manually would have taken months, diverting the team's focus away from the science it was meant to support.

"We worked together regularly throughout the process; it was a collaboration that made the project a success."

— Fabrizio Ambrogi, Senior Machine Learning Specialist, Selvita

The Solution

A custom machine learning training dataset built in days

Selvita turned to CAS Custom Services for a tailored solution. Working alongside Selvita's computational chemists, CAS solution consultants first defined the dataset requirements and determined the optimal format to align with the team's objectives. CAS data specialists then searched the human-curated the CAS Content Collection™ to identify and extract the relevant information for each target reaction type, significantly reducing the data-cleaning burden that had threatened to consume months of effort.

To ensure the dataset met the quality standards required for reliable model training, CAS content consultants and expert chemists performed quality assurance on each entry using advanced automated tools. Within days, CAS delivered a verified dataset of 400,000 entries for Suzuki reactions and 200,000 for amide coupling reactions. Using these curated datasets, Selvita applied filters to categorize and rank common reagents, solvents, and catalysts, creating the clusters needed for parallel synthesis.

The model was first validated against reserved test sets. Selvita's team then went further, putting the model head-to-head with one of their own senior chemists on a set of reactions that had never been performed before. Within a few hours, the model identified optimal conditions that outperformed the chemist's selections, work that had taken the chemist more than a week to produce. These real-world results confirmed the model's ability to reliably predict reaction conditions for efficient synthesis.

The Outcome

Faster drug discovery powered by better data

By partnering with CAS, Selvita avoided months of manual data work and the costly trial-and-error that comes with unreliable training sets. The team moved directly to model development and validation, streamlining essential organic synthesis workflows and strengthening its ability to offer faster, more efficient drug discovery solutions to its pharmaceutical partners. The project demonstrated the major impact of having the right data foundation from the start.

Download PDF Explore CAS Custom Services

Your privacy is important to CAS. More detail about how we use your information is in our privacy policy.

Custom ML dataset accelerates Selvita's organic synthesis workflow

Industry

Solution

600K+

Days

>1 wk

Disparate and inconsistent open-access data slows machine learning model development

A custom machine learning training dataset built in days

Faster drug discovery powered by better data

Schedule a free CAS Custom Services℠ consultation