Abstract: Getting the Most out of Molecular Fingerprints: Methods for Fast and Interpretable Prediction from Raw Molecular Substructures

Abstract: Getting the Most out of Molecular Fingerprints: Methods for Fast and Interpretable Prediction from Raw Molecular Substructures
Two dimensional molecular fingerprints enjoy wide use as a chemical representation method in machine learning (ML) largely due to their low computational cost in comparison to that 3D featurizations. Deep learning methods have tended to produce stronger predictions but are prohibitively expensive for some problems. We will present a fast method for using raw, unhashed Morgan fingerprints to learn from all substructures in a training set and show that these features combined with simple learners can compete with 3D featurizations and deep learning models on various publicly available datasets. Fingerprints also present an opportunity for ML model explanation through the substructures they capture, but careful attention must be paid to the correlations among substructure counts, as the presence of a large fragments naturally implies the presence of its constituent sub-fragments. We show the consequent ramifications for interpretability of ML.