Documentation

Technical overview of the PreditX® workflow and outputs.

Platform documentation

Automated target-specific ML virtual screening.

PreditX® is a cloud-deployed machine learning virtual screening platform developed to connect target selection, bioactivity-data retrieval, scaffold-aware model training, external prediction, consensus ranking, and ADMET-supported hit triage in one workflow.

Platform scope

PreditX® is designed for early-stage small-molecule prioritization. It supports target-specific ligand-based model training and downstream screening of user-provided or internal compound libraries.

Input

A user starts from a protein target and selects molecular feature representations and models.

  • Protein target name
  • Descriptor mode
  • Model selection

Training

The platform retrieves, cleans, labels, and prepares target-specific bioactivity data for machine learning.

  • UniProt and ChEMBL-based retrieval
  • Bioactivity cleaning and standardization
  • Scaffold-aware train/test splitting

Output

The platform returns trained models, performance outputs, prediction tables, ranked candidates, and ADMET annotations.

  • Model outputs
  • Consensus prediction results
  • ADMET workbooks

Data processing and model training

PreditX® automates the preparation steps that are normally distributed across multiple tools.

Stage Purpose User-facing value
Bioactivity retrieval Collect target-specific activity data from public resources. Reduces manual dataset preparation.
Data cleaning Filter low-confidence, duplicated, incomplete, or non-standardized records. Improves reproducibility and reliability.
Activity labeling Convert activity values into machine-learning-ready classes. Creates active/inactive modeling sets for classification.
Scaffold-based split Separate training and test compounds by chemical scaffold. Provides a more realistic estimate of generalization to new chemistry.
Descriptor calculation Generate Morgan fingerprints, Mordred descriptors, or combined features. Allows flexible molecular representation.
Model training Train selected machine learning classifiers on the target-specific dataset. Creates models for downstream screening.

Supported molecular representations

Users can select the feature mode that best fits the target, dataset, and modeling objective.

Morgan fingerprints

Capture local chemical substructures and similarity patterns frequently used in ligand-based screening.

Mordred descriptors

Capture physicochemical, constitutional, and structural molecular properties.

Combined features

Integrate fingerprints and descriptors to support more robust model training.

Machine learning models

PreditX® supports a diverse set of supervised classifiers so users can compare performance across model families.

Available model families

  • Random Forest
  • Support Vector Machine
  • XGBoost
  • Logistic Regression
  • k-Nearest Neighbors
  • Multi-Layer Perceptron
  • Decision Tree, Gradient Boosting, Extra Trees, Ridge Classifier, and Gaussian Naive Bayes

Screening and prediction modes

Once models are trained, users can apply them to external molecules or the internal PreditX® compound database.

Pasted SMILES

Use direct SMILES input for small or exploratory molecule lists.

Uploaded molecule files

Upload a SMILES file when screening larger custom libraries.

Internal database

Screen against a large internal database of purchasable, PAINS-filtered molecules to produce a ranked, diverse candidate shortlist.

Consensus prediction

Consensus mode combines the outputs of multiple trained models to reduce dependence on a single classifier.

How to interpret consensus outputs

Consensus predictions should be interpreted as prioritization support. The output may include predicted class, active-class probability, confidence category, vote ratio, and the number of models contributing to the decision. These values support ranking and triage, but they are not experimental potency measurements.

ADMET-supported prioritization

ADMET output helps users review whether predicted active molecules also have acceptable early developability features.

Reported properties

  • Molecular weight, cLogP, TPSA, hydrogen bond donors and acceptors
  • Rotatable bonds, ring counts, aromatic ring counts, Fsp3, QED
  • Solubility estimates, formal charge, molar refractivity, structural alerts

Decision-support outputs

  • ADMET score and pass/fail flag
  • Risk band and recommended action
  • Applicability-domain assessment
  • Interpretable penalty reasons

Benchmark evidence

The platform has been evaluated across a benchmark of 105 data-rich targets, with 104 evaluable targets included in aggregate performance analysis.

Scaffold-held-out evaluation

PreditX® uses scaffold-aware evaluation to better reflect prospective screening conditions.

Strong benchmark performance

The benchmark showed a median best ROC-AUC of approximately 0.956 across evaluable targets.

Interpretable reliability context

Novelty, split integrity, and threshold-margin analyses help explain where models are expected to perform strongly.

Limitations and responsible use

PreditX® is intended for early-stage prioritization. Model scores, consensus probabilities, and ADMET annotations should guide compound selection, but experimental validation remains necessary. Ligand-based predictions do not directly assess binding pose, receptor conformation, water-mediated interactions, covalent mechanisms, or full pharmacokinetic and toxicological behavior.