Guides
Champion / Challenger
Test new scorecard versions against your production model before making a full switch. Champion/challenger testing lets you validate model performance with real traffic while maintaining a safe fallback.
What is Champion/Challenger Testing?
In credit risk modeling, the champion is the currently active production model — the one whose scores drive lending decisions. A challenger is a new model version you want to evaluate against the champion using real applicant data.
Rather than replacing your champion outright, you deploy the challenger alongside it. Both models score every incoming request, giving you a direct performance comparison without risking your production decisions.
Shadow Mode vs Live Split Mode
Calibr supports two testing modes. Choose based on your risk tolerance and regulatory requirements.
| Aspect | Shadow Mode | Live Split Mode |
|---|---|---|
| Decision maker | Champion only | Split by traffic percentage |
| Challenger scores | Logged but not returned to caller | Returned as the primary score for assigned traffic |
| Risk level | Zero — no impact on decisions | Moderate — challenger scores affect real decisions |
| Use case | Initial validation, regulatory review | A/B testing after shadow validation |
| API response | Always returns champion score; challenger in shadow_scores | Returns whichever model was selected for that request |
Setting It Up
Deploy your first model (champion)
Deploy a scorecard from the Calibr desktop app. The first deployed model automatically becomes the champion.
Deploy a second model version
Build and deploy a new scorecard version. It will be marked as inactive until you configure it as a challenger.
Add as challenger with traffic allocation
In the web dashboard, navigate to your scorecard's deployment settings and add the new version as a challenger. Configure the mode and traffic percentage:
In shadow mode, set traffic_pct to 100 so every request is scored by both models. In live split mode, start with a small percentage (e.g., 10%).
Monitor performance
The web dashboard shows side-by-side metrics for champion and challenger: score distributions, grade distributions, approval rates at various cutoffs, and population stability index (PSI). Compare these over at least two weeks of production traffic.
Promote or remove
If the challenger outperforms the champion, promote it to become the new champion. The previous champion is archived but not deleted — you can always roll back.
How Parallel Scoring Works
Regardless of mode, all models score every request. The engine loads both the champion and challenger specs, runs the scoring algorithm for each, and then decides which score to return based on the mode and traffic allocation.
This means you always have a complete comparison dataset. Even in live split mode, the champion's score is computed and logged for every request assigned to the challenger, and vice versa.
Rollback
If a promoted challenger shows degraded performance in production, you can instantly roll back to the previous champion:
Rollback restores the previous champion version and removes the current version from active scoring. Use rollback when you see:
- A significant shift in score distribution (high PSI)
- Unexpected approval/decline rate changes
- A spike in missing value warnings indicating data drift
- Regulatory or compliance concerns with the new model
Best Practices
- Start with shadow mode. Always validate a new model in shadow mode before switching to live split. This gives you a risk-free baseline comparison.
- Compare for at least 2 weeks. Short observation windows can be misleading due to seasonal effects, marketing campaigns, or application volume fluctuations.
- Monitor grade migration. Pay attention to how applicants shift between risk grades — not just the overall score distribution.
- Document the rationale. Regulatory requirements often mandate documentation of model changes. Keep a record of why you promoted or rejected a challenger.
- One challenger at a time. While the system supports multiple challengers, comparing more than one simultaneously makes it harder to isolate which changes drive performance differences.