Compound preprocessing¶
Compounds are matched searched in their de-salted, standardized forms using the parent
molecule definitions
from ChEMBL and PubChem. You can disable this in the global settings, though that is not normally recommended.
ChEMBL parent compounds were computed using
the ChEMBL structure pipeline, which is pretty good.
The PubChem process is a little more opaque.
You may want to pre-process your compounds using the ChEMBL structure pipeline
or rdkit sanitization.
Depending on the charge, desalting may result in duplicated structures
(e.g. two anions after removing a single Na⁺). You can deduplicate the structures
by splitting on .
in a SMILES string, then converting back to InChI Keys.
A deduplicate
function in chemserve does this.
A simple standardization method is just Chem.of(inchiorsmiles).desalt().deduplicate()
, which will
also perform rdkit sanitization.