User Guide

Complete walkthrough of the PreditX ML Virtual Screening Pipeline

6-Step Pipeline

Welcome!

This guide outlines each step of our pipeline, from data retrieval to external predictions. Whether you're a seasoned computational chemist or new to machine learning, you'll find straightforward instructions here.

Pipeline Steps

1 Data Retrieval & Cleaning

Goal

Obtain bioactivity data (IC50, Ki, Kd, etc.) for your protein target from ChEMBL/UniProt.

Key Outputs

_bioactivity_data_cleaned.xlsx (raw data, standardized to nM)

Notes

Filters out assays with low confidence scores
Displays log of how many compounds were found upon completion

2 Data Labeling (pIC50 & Classes)

Goal

Convert nM to pIC50 = -log10(nM × 1e-9) and categorize compounds into active, intermediate, or inactive.

Key Outputs

_bioactivity_data_labeled.xlsx

Tips for Non-Technical Users

pIC50 is a negative log scale. Higher pIC50 typically means higher potency
The pipeline automatically defines cutoffs at the 33rd and 66th percentiles

3 Scaffold-based Data Splitting & Diversity Plots

Goal

Split compounds so the test set has unique scaffolds not found in training. Generate PCA/MDS/Tanimoto plots to illustrate chemical diversity.

Key Outputs

_training_set.xlsx and _test_set.xlsx
Diversity plots: _Tanimoto_Hist_Full.png, _PCA_full.png, _MDS_full.png

Importance: This ensures a realistic measure of your model's ability to generalize.

4 Descriptor Calculation & Feature Selection

Goal

Generate Morgan fingerprints and Mordred descriptors, remove low-variance/correlated features, optionally reduce dimensionality via PCA.

Key Outputs

Various .npy arrays containing features, scaled data, and PCA components
Pickled objects (.pkl) for the scaler and feature selector

For Technical Users: We rely on scikit-learn for scaling, variance thresholding, correlation filtering, and PCA.

5 Model Training & Evaluation

Goal

Train multiple ML algorithms with hyperparameter tuning (GridSearchCV), evaluate them by accuracy, F1, and ROC-AUC. Produce confusion matrices and ROC curves.

Key Outputs

A model_evaluation_results_* folder containing:

Model pickle files (.pkl) for each trained model
Excel summary of performance metrics
Confusion matrix PNGs and ROC curve PNGs

Next Step: Identify the best model (by ROC-AUC or your chosen metric) to use in Step 6.

6 External Dataset Prediction

Goal

Apply your trained models to new molecules. You can:

Paste up to 50 SMILES directly
Upload a file of SMILES (formats: .smi or .txt; one SMILES per line)
Use built-in dataset (up to 500k molecules) for massive high-throughput virtual screening

New Option: You can optionally activate Consensus Voting to combine predictions from all trained models, giving a more reliable result.

Key Outputs

An Excel sheet listing each molecule, its predicted activity, optional probability/confidence scores, and #ActiveVotes (how many models voted Active).

File Format Hint

Each line should contain a SMILES string; an optional molecule name can follow after a space or tab. No header line is required.

For Non-Technical Users: If you only have a handful of molecules, the "paste SMILES" option is easiest. For bigger sets, upload a file or use the built-in dataset.

Need More Help?

Explore additional resources to get the most out of PreditX

About Page Contact Us Technical Docs