As our previous article outlined, AI without complete, high-quality training data is like a car without an engine – it may look nice, but it's not going anywhere. AI transparency is equally important to our users, which is why we aim to inform them precisely when and where AI is used so they can make informed decisions.
A well-known risk of generative AI is "hallucination," a scenario where the model creates realistic-sounding data, sometimes complete with falsified citations, but is completely incorrect. These hallucinations cannot be tolerated in R&D, so steps must be taken to avoid them.
In this article, we'll discuss how CAS is uniquely positioned to bridge the gap between AI and scientific discovery. We'll cover how we've strategically leveraged AI, minimizing hallucinations and prioritizing quality training data, to provide our users with critical information to power their research.
Training Data
Precise data is a requirement in scientific R&D. The outcomes of an experiment or test can determine whether a drug makes it to market or if an entire product line must be reformulated.
Training data is a curated corpus that is used to train an AI model. The data may help the model identify patterns, learn tasks, or make predictions. Models are only as good as their data; as the saying goes, "garbage in, garbage out."
For over 100 years, CAS has been the authoritative source of chemistry information, with in-house scientists curating the world's most comprehensive collection of chemistry data. Now, we're expanding our curation expertise to develop AI models that will help scientific advancements. Our customers depend on us to offer reliable, consistent data. Making certain that the expert-curated training data we compile is similarly reliable helps ensure that the performance is consistent and aligned with our customers' expectations.
Real Answers, No Hallucinations
For many of us, the first thing that comes to mind when we think of AI is chatbots is generating answers that may be incorrect based on the quality of their source. What if the answers have already been curated, verified, and organized by subject matter experts? In this case, AI can be used to explore and uncover insights in a more streamlined way that also creates confidence.
Transparency
Along our journey to reimagine scientific search and discovery, we've incorporated feedback loops from our customers to inform our planning, and one key feedback theme we've encountered has been trust. New innovations with AI are exciting, however with outputs that are often non-reproducible, change abruptly with minor modifications to the inputs, and function as a "black box" with no explanation of what was done or how the results were generated, users lack trust in the scientific results generated by many AI tools.
Maintaining user trust in our content and solutions is paramount. While we can take steps to minimize unreliability in how the LLM interfaces between the user and our data, it is inherently prone to variability. Hesitation in trying AI-enhanced features can be mitigated by being transparent about where AI is being used in the product and providing obvious alternate avenues for finding the information users are looking for.
SearchSense in CAS SciFinder
So where does CAS SciFinder fit in? Like the general population, scientific researchers will benefit from less time digging through vast amounts of information, which is a major inefficiency for the scientific community. We've created a variety of AI-enabled search improvements in CAS SciFinder that we've dubbed SearchSense to ease this burden, streamlining information discovery while maintaining scientific integrity.
Over the last twelve months, CAS has been building the largest authoritative, comprehensive, and trustworthy scientific information search engine – developed for scientists, by scientists. SearchSense combines our best-in-class chemistry collection with AI to provide researchers with faster answers without jeopardizing accuracy.
Core Principles
When we began incorporating AI into CAS SciFinder, there were fundamental standards that we did not want to compromise on, so our solutions would need to adhere to these principles:
- Prioritize accuracy in the content delivery.
- Maintain speed of search response.
- Preserve customer data confidentiality.
How SearchSense Works
SearchSense is unique because it uses AI to interpret search intent rather than generate answers. By understanding the intent of a user's query, CAS SciFinder can point users to the relevant answers and supporting data that has already been carefully curated, adhering to the CAS standards of trustworthiness and correctness. Accuracy of intent interpretation is largely dependent on a robust, quality training corpus. Our search query training data set is based on a collection of over ten thousand data points created by CAS experts. It spans all the different scientific disciplines covered by our content—a daunting effort to try to maintain performance for all areas, but we discovered that adding more data, even a few hundred data points, showed synergistic benefit to the overall accuracy.
Model Architecture and Performance
The training data was used to train an AI model to determine the search intent of the queries. Speed and accuracy depend on the size of the model used, and although models using more parameters are generally more accurate, research has shown that smaller models with specific use cases and quality training data can produce acceptable accuracy¹. We used a 7 billion parameter model, which is considerably smaller than some of the larger models (over 1 trillion parameters). High-quality training data made the accuracy of the small model still acceptable, and it allowed us to maintain solid search performance.
CAS SciFinder uses a powerful search algorithm to find the most relevant results, but it is custom-built and doesn't use standard querying language. Due to the proprietary nature of our search architecture, we needed to train the AI model to produce a language our system could understand to perform the corresponding search. This ensured that we could achieve the goal of marrying AI-enabled query interpretation and the proprietary search engine's complex results. This allows us to bring relevant data forward to users with less time and effort.
Data Security and Privacy
We understand the sensitive nature of our customers' research, so data confidentiality is important to us. All the models we've developed are proprietary and hosted on premises, so that no data leaves our care. Click here for more information on the ethical use of artificial intelligence at CAS.
Conclusion
CAS SciFinder is committed to enabling the success of scientific research and discovery, while maintaining the trust our customers value. With the release of SearchSense in CAS SciFinder, we have taken steps to ensure that we are leveraging powerful new technologies in a more intelligent way. Our experts, our best-in-class content collection, and our technology work together to bring more benefits to our users, paving the way for further advancement in scientific research and development.