PASCAL - Pattern Analysis, Statistical Modelling and Computational Learning

Accurate Solubility Prediction with Error Bars for Electrolytes: A Machine Learning Approach
Anton Schwaighofer, Timon Schroeter, Sebastian Mika, Antonius ter Laak, Detlev Suelzle and Nikolaus Heinrich
In: 2nd German Conference on Chemoinformatics / 20th CIC Workshop, 12. - 14. November 2006, Goslar, Germany.


Accurate in-silico models for predicting aqueous solubility are needed in drug design and discovery, and many other areas of chemical research. A first principles modelling of solubility, however, would be overly complex, since too many physical factors with separate mechanisms are involved in the phase transition from solid to solvated molecules. We present a machine learning approach (Gaussian Process model) that provides a statistical modeling of aqueous solubility based on measured data. The model was validated on the well known set of 1311 compounds by Huuskonen, and on an in-house dataset of 632 drug candidates at Schering. We compare our results with those of 14 scientific studies and 6 commercial tools. For 91\% of the Huuskonen compounds, our predictions were correct within one order of magnitude, even though the respective compounds were not used in training the model. Existing commercial software achieves 79\% correct predictions within one order of magnitude. On the 632 drug candidates (mostly electrolytes), 82\% of our predictions are correct within one order of magnitude, compared to only 64\% achieved by commercial software. Additional validations with new in-house measured data will be presented as well. On top of the accurate predictions, the proposed machine learning model also provides confidence estimates for each individual prediction.

PDF - PASCAL Members only - Requires Adobe Acrobat Reader or other PDF viewer.
EPrint Type:Conference or Workshop Item (Oral)
Project Keyword:Project Keyword UNSPECIFIED
Subjects:Theory & Algorithms
ID Code:2507
Deposited By:Anton Schwaighofer
Deposited On:22 November 2006