Compound preprocessing

Compounds are matched searched in their de-salted, standardized forms using the parent molecule definitions from ChEMBL and PubChem. You can disable this in the global settings, though that is not normally recommended.

ChEMBL parent compounds were computed using the ChEMBL structure pipeline, which is pretty good. The PubChem process is a little more opaque. You may want to pre-process your compounds using the ChEMBL structure pipeline or rdkit sanitization. Depending on the charge, desalting may result in duplicated structures (e.g. two anions after removing a single Na⁺). You can deduplicate the structures by splitting on . in a SMILES string, then converting back to InChI Keys. A deduplicate function in chemserve does this. A simple standardization method is just Chem.of(inchiorsmiles).desalt().deduplicate(), which will also perform rdkit sanitization.