Documentation
Technical overview of the PreditX® workflow and outputs.
Automated target-specific ML virtual screening.
PreditX® is a cloud-deployed machine learning virtual screening platform developed to connect target selection, bioactivity-data retrieval, scaffold-aware model training, external prediction, consensus ranking, and ADMET-supported hit triage in one workflow.
Platform scope
PreditX® is designed for early-stage small-molecule prioritization. It supports target-specific ligand-based model training and downstream screening of user-provided or internal compound libraries.
Input
A user starts from a protein target and selects molecular feature representations and models.
- Protein target name
- Descriptor mode
- Model selection
Training
The platform retrieves, cleans, labels, and prepares target-specific bioactivity data for machine learning.
- UniProt and ChEMBL-based retrieval
- Bioactivity cleaning and standardization
- Scaffold-aware train/test splitting
Output
The platform returns trained models, performance outputs, prediction tables, ranked candidates, and ADMET annotations.
- Model outputs
- Consensus prediction results
- ADMET workbooks
Data processing and model training
PreditX® automates the preparation steps that are normally distributed across multiple tools.
| Stage | Purpose | User-facing value |
|---|---|---|
| Bioactivity retrieval | Collect target-specific activity data from public resources. | Reduces manual dataset preparation. |
| Data cleaning | Filter low-confidence, duplicated, incomplete, or non-standardized records. | Improves reproducibility and reliability. |
| Activity labeling | Convert activity values into machine-learning-ready classes. | Creates active/inactive modeling sets for classification. |
| Scaffold-based split | Separate training and test compounds by chemical scaffold. | Provides a more realistic estimate of generalization to new chemistry. |
| Descriptor calculation | Generate Morgan fingerprints, Mordred descriptors, or combined features. | Allows flexible molecular representation. |
| Model training | Train selected machine learning classifiers on the target-specific dataset. | Creates models for downstream screening. |
Supported molecular representations
Users can select the feature mode that best fits the target, dataset, and modeling objective.
Morgan fingerprints
Capture local chemical substructures and similarity patterns frequently used in ligand-based screening.
Mordred descriptors
Capture physicochemical, constitutional, and structural molecular properties.
Combined features
Integrate fingerprints and descriptors to support more robust model training.
Machine learning models
PreditX® supports a diverse set of supervised classifiers so users can compare performance across model families.
Available model families
- Random Forest
- Support Vector Machine
- XGBoost
- Logistic Regression
- k-Nearest Neighbors
- Multi-Layer Perceptron
- Decision Tree, Gradient Boosting, Extra Trees, Ridge Classifier, and Gaussian Naive Bayes
Screening and prediction modes
Once models are trained, users can apply them to external molecules or the internal PreditX® compound database.
Pasted SMILES
Use direct SMILES input for small or exploratory molecule lists.
Uploaded molecule files
Upload a SMILES file when screening larger custom libraries.
Internal database
Screen against a large internal database of purchasable, PAINS-filtered molecules to produce a ranked, diverse candidate shortlist.
Consensus prediction
Consensus mode combines the outputs of multiple trained models to reduce dependence on a single classifier.
How to interpret consensus outputs
Consensus predictions should be interpreted as prioritization support. The output may include predicted class, active-class probability, confidence category, vote ratio, and the number of models contributing to the decision. These values support ranking and triage, but they are not experimental potency measurements.
ADMET-supported prioritization
ADMET output helps users review whether predicted active molecules also have acceptable early developability features.
Reported properties
- Molecular weight, cLogP, TPSA, hydrogen bond donors and acceptors
- Rotatable bonds, ring counts, aromatic ring counts, Fsp3, QED
- Solubility estimates, formal charge, molar refractivity, structural alerts
Decision-support outputs
- ADMET score and pass/fail flag
- Risk band and recommended action
- Applicability-domain assessment
- Interpretable penalty reasons
Benchmark evidence
The platform has been evaluated across a benchmark of 105 data-rich targets, with 104 evaluable targets included in aggregate performance analysis.
Scaffold-held-out evaluation
PreditX® uses scaffold-aware evaluation to better reflect prospective screening conditions.
Strong benchmark performance
The benchmark showed a median best ROC-AUC of approximately 0.956 across evaluable targets.
Interpretable reliability context
Novelty, split integrity, and threshold-margin analyses help explain where models are expected to perform strongly.
Limitations and responsible use
PreditX® is intended for early-stage prioritization. Model scores, consensus probabilities, and ADMET annotations should guide compound selection, but experimental validation remains necessary. Ligand-based predictions do not directly assess binding pose, receptor conformation, water-mediated interactions, covalent mechanisms, or full pharmacokinetic and toxicological behavior.