Complete walkthrough of the PreditX ML Virtual Screening Pipeline
6-Step Pipeline
This guide outlines each step of our pipeline, from data retrieval to external predictions. Whether you're a seasoned computational chemist or new to machine learning, you'll find straightforward instructions here.
Obtain bioactivity data (IC50, Ki, Kd, etc.) for your protein target from ChEMBL/UniProt.
_bioactivity_data_cleaned.xlsx (raw data, standardized to nM)
Convert nM to pIC50 = -log10(nM × 1e-9) and categorize compounds into active, intermediate, or inactive.
_bioactivity_data_labeled.xlsx
Split compounds so the test set has unique scaffolds not found in training. Generate PCA/MDS/Tanimoto plots to illustrate chemical diversity.
_training_set.xlsx and _test_set.xlsx_Tanimoto_Hist_Full.png, _PCA_full.png, _MDS_full.pngGenerate Morgan fingerprints and Mordred descriptors, remove low-variance/correlated features, optionally reduce dimensionality via PCA.
.npy arrays containing features, scaled data, and PCA components.pkl) for the scaler and feature selectorTrain multiple ML algorithms with hyperparameter tuning (GridSearchCV), evaluate them by accuracy, F1, and ROC-AUC. Produce confusion matrices and ROC curves.
A model_evaluation_results_* folder containing:
.pkl) for each trained modelApply your trained models to new molecules. You can:
.smi or .txt; one SMILES per line)An Excel sheet listing each molecule, its predicted activity, optional probability/confidence scores, and #ActiveVotes (how many models voted Active).
Each line should contain a SMILES string; an optional molecule name can follow after a space or tab. No header line is required.
Explore additional resources to get the most out of PreditX