User Guide

Complete walkthrough of the PreditX ML Virtual Screening Pipeline

6-Step Pipeline

Welcome!

This guide outlines each step of our pipeline, from data retrieval to external predictions. Whether you're a seasoned computational chemist or new to machine learning, you'll find straightforward instructions here.

Pipeline Steps

1 Data Retrieval & Cleaning

Goal

Obtain bioactivity data (IC50, Ki, Kd, etc.) for your protein target from ChEMBL/UniProt.

Key Outputs

_bioactivity_data_cleaned.xlsx (raw data, standardized to nM)

Notes
  • Filters out assays with low confidence scores
  • Displays log of how many compounds were found upon completion

2 Data Labeling (pIC50 & Classes)

Goal

Convert nM to pIC50 = -log10(nM × 1e-9) and categorize compounds into active, intermediate, or inactive.

Key Outputs

_bioactivity_data_labeled.xlsx

Tips for Non-Technical Users
  • pIC50 is a negative log scale. Higher pIC50 typically means higher potency
  • The pipeline automatically defines cutoffs at the 33rd and 66th percentiles

3 Scaffold-based Data Splitting & Diversity Plots

Goal

Split compounds so the test set has unique scaffolds not found in training. Generate PCA/MDS/Tanimoto plots to illustrate chemical diversity.

Key Outputs
  • _training_set.xlsx and _test_set.xlsx
  • Diversity plots: _Tanimoto_Hist_Full.png, _PCA_full.png, _MDS_full.png
Importance: This ensures a realistic measure of your model's ability to generalize.

4 Descriptor Calculation & Feature Selection

Goal

Generate Morgan fingerprints and Mordred descriptors, remove low-variance/correlated features, optionally reduce dimensionality via PCA.

Key Outputs
  • Various .npy arrays containing features, scaled data, and PCA components
  • Pickled objects (.pkl) for the scaler and feature selector
For Technical Users: We rely on scikit-learn for scaling, variance thresholding, correlation filtering, and PCA.

5 Model Training & Evaluation

Goal

Train multiple ML algorithms with hyperparameter tuning (GridSearchCV), evaluate them by accuracy, F1, and ROC-AUC. Produce confusion matrices and ROC curves.

Key Outputs

A model_evaluation_results_* folder containing:

  • Model pickle files (.pkl) for each trained model
  • Excel summary of performance metrics
  • Confusion matrix PNGs and ROC curve PNGs
Next Step: Identify the best model (by ROC-AUC or your chosen metric) to use in Step 6.

6 External Dataset Prediction

Goal

Apply your trained models to new molecules. You can:

  • Paste up to 50 SMILES directly
  • Upload a file of SMILES (formats: .smi or .txt; one SMILES per line)
  • Use built-in dataset (up to 500k molecules) for massive high-throughput virtual screening
New Option: You can optionally activate Consensus Voting to combine predictions from all trained models, giving a more reliable result.
Key Outputs

An Excel sheet listing each molecule, its predicted activity, optional probability/confidence scores, and #ActiveVotes (how many models voted Active).

File Format Hint

Each line should contain a SMILES string; an optional molecule name can follow after a space or tab. No header line is required.

For Non-Technical Users: If you only have a handful of molecules, the "paste SMILES" option is easiest. For bigger sets, upload a file or use the built-in dataset.

Need More Help?

Explore additional resources to get the most out of PreditX