Myths about ArtificiaI Intelligence and Machine Learning in Drug Formulation

Artificial intelligence (AI) and machine learning approaches have flooded many industries with the promise of groundbreaking advances. But can these technologies also meet the complex demands of drug formulation? In this blog post, we debunk common myths surrounding AI in drug formulation and show why physical simulations are still a suitable and state-of-the-art technique to formulate drugs.

Decoding Machine Learning in Drug Formulation

When we discuss machine learning in drug formulation, it’s essential to understand that it is more than assembling experimental data points. This simplistic view fails to capture the depth of neural networks that analyze complex interactions across multiple variables. Machine learning systems process visible data; they reveal hidden, invisible dependencies to the naked eye, enabling predictions far beyond the initial dataset.

Consider the standard AI approach: Each AI modeling starts with a dataset that defines a particular property you aim to analyze. For instance, the active pharmaceutical ingredients (APIs) solubility in various solvents incorporating factors like temperature and chemical structure. A neuronal network can then process these data points, continually adjusting and learning to identify patterns and predict solubility under different conditions. The algorithms combine molecular descriptors of the molecules (functional groups, structures, interaction energies) and further parameters that might influence solubility and find the underlying correlations between these descriptors and a resulting solubility. Several publications exist in this field that showcase the potential of this approach. Although this process mimics human learning—where neurons form new connections based on experiences—the results are only as good as the data available. For AI to succeed in this context, the data must be extensive, high-quality, and highly structured.

Data: AI’ Achilles’ Heel in Drug Formulation

Advancements in AI have shown impressive results in generating coherent text using AI bots like ChatGPT. Those models successfully process vast amounts of text from the internet.

But how effectively do these AI models perform when it comes to drug formulation? This question is crucial for understanding their practical utility. Without a robust dataset, it’s nearly impossible for an AI model to predict complex properties like solubility accurately (and solubility predictions are still one of the simpler challenges in formulation development). This is because the complexity of pharmaceutical formulations, with their myriad interacting variables, requires a deep, nuanced understanding of pitfalls and challenges. When neural networks are applied to solve formulation problems, they frequently fail due to insufficient, wrong or not available data. We commonly see attempts to tackle highly complex challenges involving multiple components and variables with just two-digit data points. As a result, these neural networks can not function as expected, resulting in poor performance reports fur such models (Kabir et al. in their publication about the programming skills of ChatGPT: ‘Our analysis shows that 52% of ChatGPT answers contain incorrect information and 77% are verbose. Nonetheless, our user study participants still preferred ChatGPT answers 35% of the time due to their comprehensiveness and well-articulated language style. However, they also overlooked the misinformation in the ChatGPT answers 39% of the time. This implies the need to counter misinformation in ChatGPT answers to programming questions and raise awareness of the risks associated with seemingly correct answers’).

Effective models need tens to hundreds of thousands of high-quality, quantitatively correct data points. Only highly structured and precise data, reflecting the complexity and precision required in drug formulation, can uncover the underlying physical principles for accurate predictions. Typical information like ‘the API is highly soluble in acetone’ is by far not sufficient for training any model. Without a reasonable training data, the models will never find patterns and predict correct dependencies – especially predict phenomena that lie outside the training landscape. Often, data is limited due to the unavailability of sufficient material, impurities in early API batches, or the lack of a clear understanding of the solid-state landscape. When AI models are trained on such limited datasets, their performance suffers, leading to unreliable predictions that can misguide the formulation process. This is a significant barrier to its application in this field.

When Physical Modeling Outperforms AI in Drug Formulation

Given these substantial challenges with AI, it is crucial to consider alternative approaches to circumvent these data issues. At amofor, we prioritize physical modeling as a predict-first approach. It’s a robust and experimentally validated solution. We have accelerated drug formulation in many client projects with physical simulations, even with only a few available data points.

Physical modeling utilizes thermodynamic principles. This approach is no mystic Blackbox or incomprehensible connection of some abstract neurons, but all underlying equations and assumptions are transparently communicated and critically discussed. It helps in choosing the right polymer and drug load, predicting shelf life, selecting the appropriate solvents, and deciding on a manufacturing method (e.g., spray drying). Unlike AI, which depends heavily on the quantity and quality of training data, physical modeling can accurately predict how unknown ASD formulations will behave. This is because it’s based on established physical laws that inherently understand these dependencies.

Physical simulations provide a basis for understanding complex molecular interactions in a way that AI, with its need for massive data inputs, simply currently cannot offer. Furthermore, physical modeling is about having the right data and understanding the underlying physics. Our experts ensure that each variable and measurement is used strategically to improve the model’s predictive power. We do not stop asking questions about the provided data points until we understand how reliable the data is (e.g., in the case of dynamic sorption data: ‘Are the provided points in a real equilibrium state or still on their way to equilibrium? Can we review the time-dependent data?’ or in the case of solubility data: ‘Did you make sure that the solid-state did not undergo any changes?’). This approach is reliable, fast, and cost-effective, making it a practical choice for drug formulation. For most simulations, such as those using the PC-SAFT model with our software SOLCALC, nothing more than a simple laptop is required, providing further reassurance of its accessibility and computational efficiency.

Invitation to Connect With Us

AI certainly opens new possibilities in drug formulation, yet the pharmaceutical industry’s complex, multifaceted challenges require more than algorithms; they need practical, reliable solutions. Our physical modeling approach minimizes reliance on extensive data arrays and focuses on fundamental scientific principles.

We invite you to see our physical simulations in action. Pick a few solubilities in organic solvents, and let us show how we can enhance your formulation effectively.

Contact us today to learn how amofor’s physical modeling can enhance your drug formulation efforts. Let’s build a more reliable and effective future for pharmaceutical development together!