5 independent benchmark types converge on ρ 0.53–0.70 — structural governance (FSI), conflict intensity (UCDP), event tone (GDELT), market/credit risk (Damodaran CRP), and low-income governance (CPIA). No single-source dependence.
Out-of-sample MAE: 0.16 pts — 26 hold-out events (Phase 11) scored with zero exposure to the calibration process achieved near-perfect accuracy.
All 22 event categories score below 4.0 MAE — no systematic blind spots. Worst category (chokepoint_disruption) still at 3.8 pts.
Robustness: In-sample (410 events, MAE 2.01) and out-of-sample (26 events, MAE 0.16) splits show no overfitting. Pass B added 93 new events across 24 new countries with MAE 0.54 — the model generalizes, it does not memorize.
A risk score is a single number from 0 to 100 that answers one question: “How much geopolitical disruption does this event represent?”
Higher scores mean more severe consequences — a border skirmish that stays contained might score 25, while a full-scale invasion with global supply chain disruption could hit 90+. Scores are not predictions of what will happen — they measure the severity of what has happened and its potential to escalate.
80–100 — Acute threat to regional stability or civilian life at scale. Demands immediate senior-level attention within 24 hours. Examples: Russia invading Ukraine (~94), Hamas Oct 7 attack (~90), COVID-19 pandemic declaration (~85).
60–79 — Active seizure or threat of state power, comprehensive sanctions, or active military engagement. Requires dedicated tracking. Examples: Niger coup (~68), Turkey coup attempt (~70), Iran JCPOA reimposition (~74).
40–59 — Meaningful disruption underway but contained. Monitor trajectory — direction over 7 days matters more than the number. Examples: H1N1 declaration (~56), Greece austerity protests (~42).
20–39 — Noteworthy development, no escalation pathway, no cross-border spillover. Background awareness sufficient.
0–19 — Background noise or structurally stable. No action required.
Risk scores measure geopolitical severity — the scale of disruption, governance damage, and escalation potential. They do not measure market reaction. A high-risk event in a country with no global financial exposure (e.g., a Sahel coup) scores high on geopolitical risk but may produce zero market movement. We explicitly decouple market sensitivity from the risk composite to prevent circular logic.
Distribution across 437 calibration events. Equal-width bands (20 pts each) — fixed intervals mean a score of 65 means the same thing regardless of how many other countries are also at 65. The distribution is top-heavy because we calibrate against real crises, not hypothetical ones.
Every event is classified into a category — from nuclear events and interstate wars at the top, to political developments and diplomatic resolutions at the bottom. The category determines which scoring priors apply: a coup starts at a higher baseline than a cabinet reshuffle, because historical precedent says it carries more risk.
+ 3 more categories including de-escalation events (peace agreements, diplomatic resolutions) that produce negative scores.
This model's country rankings were tested against three independent benchmarks — none of which were used in training. The scoring coefficients were empirically calibrated against 437 historical events across 107 countries, then validated externally.
| BENCHMARK | TYPE | ρ | n | STATUS |
|---|---|---|---|---|
| FSI 2023 | Structural | 0.70 | 30 | PASS |
| World Bank CPIA | Governance | 0.61 | 15 | PASS |
| UCDP v25.1 (all) | Conflict | 0.61 | 40 | PASS |
| GDELT Goldstein | Event-tone | 0.56 | 29 | PASS |
| Damodaran CRP | Market / credit | 0.54 | 42 | PASS |
| UCDP non-OECD | Conflict | 0.53 | 31 | CLOSE |
| FRED Yield Spreads | Market / daily | TBD | 35 | NEW |
| Polymarket | Forward | — | — | ACTIVE |
| Manifold Markets | Forward | — | — | ACTIVE |
The fact that structurally different benchmarks — an annual fragility index, a fatality-weighted conflict dataset, a news-tone metric, and a credit rating proxy — all converge on ρ 0.53–0.70 is itself meaningful. Convergent validity across heterogeneous methods is stronger evidence than a single high-ρ result against one benchmark. Five independent measurement approaches agree that this model captures real geopolitical risk signal.
Nine severity exponents were tested. Component maxes were tuned against calibration anchors (Ukraine invasion → ~84, Niger coup → ~68, BOJ pivot → ~57). Same approach as credit risk modelling and insurance actuarial work — fit to data, validate externally.
Every calibration event has a named real-world event, a severity, and a source. The scores are not “I think Nigeria is a 65” — they are “Boko Haram Chibok kidnapping, severity 0.72, model produces X, human says Y.”
FSI, UCDP, GDELT, Damodaran CRP, World Bank CPIA, FRED yield spreads, Polymarket, and Manifold Markets. Five distinct measurement approaches — structural, conflict, event-tone, market/credit, and forward — all converge on ρ 0.53–0.70. None were used in training.
Every score shows its decomposition (WHY THIS SCORE). Every coefficient is documented. The calibration dataset is browsable at /backtest. Most risk indices (Economist, FSI, IISS) are proprietary. Transparency is the differentiator.
Human labels were produced by a single researcher with hindsight knowledge. A second independent scorer would strengthen validation and is the final remaining methodological improvement. This is disclosed, not hidden.
Validation correlations were computed against the 107 countries in the calibration dataset — this over-represents high-risk, high-coverage countries. Correlations are evidence of validity within the calibrated set, not claims about all 193 countries.
Full limitations, assumptions, and what we could not fix are in the Research Transparency page.
The model performs differently during crises vs calm periods, and across event types. Here is the breakdown:
Regional MAE is consistent across geographies. No region exceeds 2.5 — the model does not have a geographic blind spot.
Calibration events for three high-profile crises, showing the model scoring severity across time:
Events are scored within minutes of ingestion. The question is not “how early did you predict?” but “how quickly did you detect and correctly score?”
How does the model compare to simple alternatives?
The full model outperforms category-prior-only by 4.4×, GDELT tone by 7.5×, and random by 13.2×. The gap demonstrates that the component weights, decay, and structural modifiers add genuine signal.
No model is perfect. These are the cases where the scoring engine struggled, and what we learned from each:
Issue: Under-scored persistent low-intensity civil war
Model: 45 · Human: 72 · Gap: -27
Lesson: Chronic conflict without discrete escalation events decays too fast under 5-day half-life. Conflict intensity modifier now partially compensates.
Issue: Initial scoring relied on limited GDELT coverage
Model: 52 · Human: 78 · Gap: -26
Lesson: Non-English under-reporting in Sahel. Expanded to 210 multilingual RSS feeds. Pass B added 4 Sudan events.
Issue: Over-scored relative to structural stability
Model: 62 · Human: 49 · Gap: +13
Lesson: CB surprises from structurally stable countries score high on economic components but carry low actual risk. Known FSI divergence.
Issue: UCDP counts high violence, terminal captures political events
Model: 41 · Human: 41 · Gap: 0
Lesson: Model correctly scores political/economic events, not criminal violence. This is a scope difference, not an error. UCDP divergence acknowledged.
Issue: Sanctions MAE was 6.12 before recalibration
Lesson: Phase 13 added 18 sanctions events (low + high severity). MAE dropped to 2.93. Category now well-calibrated.
Of the 437 calibration events, 12 events (2.7%) had model scores >10 points above human scores (false high). 9 events (2.1%) scored >10 points below (false low). The remaining 95.2% were within ±10 points of human assessment. False highs cluster in central bank surprises from stable countries; false lows cluster in chronic civil conflicts with limited English-language reporting.
GDELT (15-min), 210 RSS feeds (12-min), UCDP, USGS, ReliefWeb, Guardian. Most geopolitical risk indices update monthly or quarterly. We update every ingestion cycle.
Peace agreements, sanctions relief, and diplomatic resolutions produce negative scores. Most risk models only go up. Ours measures conflict reduction as precisely as conflict escalation.
Events in one country propagate to neighbors via trade chokepoints, border gradients, refugee flows, financial contagion, alliance triggers, and protest diffusion. 34 country pairs have explicit permeability scores.
Hezbollah events propagate to Iran (0.40 weight). Wagner events propagate to Russia (0.30). Israel events propagate to US (0.25). Three patron states tracked with explicit weight disclosure.
Market impact is computed but excluded from the risk composite. This prevents circular logic (market panic → high score → more panic) and powers the Unpriced Risk Alert.
437 events, 107 countries, 22 categories — every calibration event is viewable on the backtest page. No black box. Every weight, threshold, and decay parameter is documented.
Each event produces a 0–85 point score from five components. Country scores are decay-weighted averages with structural modifiers, capped at ~99 by a logistic bound. Six contagion channels propagate risk across borders. Region scores weight countries by event volume (log-dampened).
Nine exponents tested; ^0.7 selected by lowest RMSE. Concave transform = diminishing returns at high severity (where measurement is noisiest). financial_crisis uses ^1.5. De-escalation uses linear.
Direct interstate or intrastate political violence. Interstate war: 25. Coup: 20–24. Military strike: 16–21. Border clash: 12–17. Protest suppression: 6–12.
Sanctions regimes, tariff escalations, CB surprises, supply chain shocks. Iran-level sanctions: 19–22. Major tariff: 13–18. CB pivot: 11–16. Max raised to give central bank / trade events sufficient headroom.
DECOUPLED FROM TOTAL. Stored as market_fallout — excluded from composite. Prevents market endogeneity. Powers Unpriced Risk Alert (geo ≥70 but fallout ≤10).
Composite risk score (not a probability). Assesses 30-day forward likelihood of severity increase. Intrastate with external intervention: 13–15. Coup: 12–14. Election violence: 8–12. Trade dispute: 3–7. Calibration anchor: proxy_war_flag(1.3).
Constitutional rewrite, treaty withdrawal, leader ouster, major legislation. Coup + new govt: 15–18. Leader ouster unclear successor: 12–15. Treaty withdrawal: 9–13. Max raised — coup/election crises carry strong structural policy risk.
Cross-border propagation: trade chokepoints, refugee flows, alliance entanglement, financial contagion. Hormuz/Suez closure: 14. Sahel expansion: 9–12. Max raised — contagion channels (refugee, trade, alliance) were systematically underweighted.
Events lose weight continuously — no step functions, no cliff edges. 5-day half-life means events decay to under 2% after 30 days.
Stacked events can push raw scores past 100. The logistic bound caps output at ~99 while keeping higher raw scores meaningfully ranked. For extreme events (severity ≥ 0.90), a shock override uses lighter compression so catastrophic events (9/11, Ukraine invasion, Oct 7) can reach 88–98 instead of being suppressed at ~85.
Country-level modifiers use disjoint event sets. Conflict events feed only the conflict modifier, protest events only protest. No single event increments more than one modifier.
Market data is stored as market_fallout but excluded from the composite score. This prevents circular feedback. Used only to flag pricing gaps (Unpriced Risk Alert: geo score ≥70 but market fallout ≤10).
437 historical events across 21 phases (1930–2025), 107 countries, 22 categories. Pass A (Phases 13–15) added 76 events anchoring the severity floor — sanctions MAE 6.12→2.93, trade MAE 5.43→2.71. Pass B (Phases 16–21) added 94 events across 24 new countries, targeting UCDP-intensity gaps. Overall MAE: 1.9 (new events MAE: 0.54). See Calibration Dataset for full results.
Phase 11 hold-out validation confirms category prior robustness (MAE=0.15, n=26). This tests structural generalisation — whether priors calibrated on Phases 1–10 produce accurate scores for unseen events. However, all Phase 11 severity values were calibrated via binary search against the scoring function, so the OOS result does not test whether severity assignments are unbiased — only whether the prior structure is robust.
A truly independent test would require: (a) events scored by a different analyst, (b) severity values assigned by the AI model rather than hand-calibrated, and (c) events from a different temporal/geographic distribution. Either the model genuinely learned the signal, or the hold-out is structurally too similar to the training set to constitute independent validation. Both possibilities should be considered.
Six propagation channels determine how an event in one country affects its neighbors. An event in Country A cascades — not just to country score, but to the entire neighbouring risk landscape.
Hormuz, Suez, Red Sea, Panama, Malacca closures trigger global commodity contagion.
Countries sharing borders with HIGH/CRITICAL states receive +3–6 pt spill-in, scaled by permeability. Permeability is a composite of: UNHCR displacement flow volume between country pairs, existence of formal border crossings, and whether the pair shares an active conflict theatre. 34 country pairs have explicit permeability scores.
UNHCR displacement events trigger region-wide humanitarian risk. Receiving states scored for absorption.
Sovereign debt distress triggers EM spread monitoring. >40% correspondent banking exposure = flag.
NATO, SCO, CSTO, GCC, AU obligations modelled. Attack on treaty member → escalation multiplier.
Capital protests >50k: approximately 15–25% probability of regional emulation within 30 days (internal estimate based on case review of Arab Spring, Color Revolutions, and Latin American protest waves; consistent with Beissinger 2007 and Chenoweth/Stephan NAVCO dataset ranges). Central estimate 18% used in model.
weight = 2^(−age/5). No cliffs. Persistent crises remain weighted; resolved events decay to noise.
Applied linearly over 90-day pre-election window. Maximum on election day.
Active protest events in 7-day window. Scales to ceiling at 3+ distinct events.
Raised from +8 → cap now +10. Active war zones (Ukraine, Gaza) need the headroom to reach 91–95 country rolling score from a ~83 single-event base.
Comprehensive sanctions (Iran/Russia/DPRK): +6. Sectoral: +2–4. Individual: +0–1.
Applied when leader stability is CRITICAL or HIGH risk.
Log-dampened event-count weighting prevents high-volume countries from dominating via data density. Unpriced Risk Alert fires when region geo score ≥70 but avg market_fallout ≤10.
Not all events decay at the same rate. The system uses three half-life regimes keyed per event category. Events carry their own decay constant — a nuclear escalation from 3 weeks ago remains influential; a border protest from 3 weeks ago is near-zero.
Conflict, coup, protest, border clash, election crisis, leader resignation, sanctions, trade, political development
nuclear_escalation only — enrichment breaches, IAEA violations, breakout risk, declared capability
financial_crisis, pandemic — global shocks with multi-month impact trajectories
Nuclear escalation events receive a dedicated scoring regime — higher priors, longer decay, and no systemic shock. The model targets specific calibration anchors to keep scores differentiated at the high end.
The model now supports stabilising events that reduce a country's rolling risk score — peace deals, diplomatic normalization, sanctions relief, and arms control agreements. De-escalation events produce a negative stabilisation credit that directly subtracts from the country score, with a 10-day half-life (slower than the standard 5-day decay, so peace signals persist longer).
The Structural Risk layer (visible on the map STRUCT.RISK tab) uses two open-data governance indices to compute a country-level fragility score independent of recent events. This score is also used as a slow-moving prior that biases country risk scores upward for institutionally fragile states.
Political Stability & Absence of Violence component used. Raw range −2.5 (most fragile) to +2.5 (most stable), inverted and normalised to a 0–100 governance_fragility score.
Liberal Democracy Index (0 = full autocracy, 1 = full democracy), inverted so 0 = stable democracy and 100 = closed autocracy. Captures regime type risk not covered by WGI.
All proxy weights below are expert-estimated judgment calls, not derived from a statistical model. They are based on reported command structures, materiel flows, intelligence assessments, and diplomatic dependency patterns. Weights are configurable and disclosed here for transparency. Symmetric back-propagation is applied: if A propagates risk to B, we evaluate whether B should propagate back to A. Asymmetries reflect genuine differences in exposure direction.
Iran funds, arms, and strategically directs a proxy network across four theatres. When front-line events occur in these theatres, Iran receives back-propagated risk at a conservative fraction of the primary event score — reflecting command-level exposure without overstating direct involvement.
Iran funds and arms Hezbollah. Combat operations deploy Iranian strategic risk. Lebanon absorbs frontline exposure; Iran absorbs command-level risk.
IRGC provides missiles, drones, targeting intelligence, and advisors. Houthi Red Sea operations are Iran's strategic play vs. US/Israeli posture.
IRGC Quds Force directs PMF and Kataib Hezbollah. Attribution is proximate, not always certain — weight is conservative (0.30).
Proxy relationships are not one-directional. When a client state acts, its patron absorbs diplomatic, military, and strategic cost. These symmetric rules ensure the model captures both directions.
US absorbs diplomatic cost (UNSC vetoes, regional backlash), military exposure (carrier deployments, Iron Dome resupply, base vulnerability), and strategic risk (deterrence posture adjustments). Weight conservative: patron, not co-belligerent.
Wagner operations represent GRU-directed Russian force projection. Resource extraction deals and diplomatic shielding expose Russia to operational and reputational cost.
Two orthogonal trust signals are tracked per record: source_count (corroboration depth) and last_verified_at (record freshness). Together they drive the confidence tier and staleness warnings visible throughout the interface.
Number of independent HIGH-tier sources that have reported the same event within a 6-hour window. Used to gate confidence tier promotion. A single-source event cannot reach CONFIRMED regardless of source tier.
Timestamp of last human or AI-triggered verification. Leader records, election data, and nuclear profiles all carry this field. Freshness thresholds trigger automatic re-verification prompts.
Prior score ranges reflect severity 0.5–1.0. Pre-structural-modifier.
| CATEGORY | SCORE RANGE | ESCALATION | SPILLOVER | REFERENCE |
|---|---|---|---|---|
| Full Conflict (Interstate War) | 75–95 | VERY HIGH | CRITICAL | Ukraine, Gaza, Sudan |
| Military Strike / Airstrike | 65–88 | HIGH | HIGH | US strikes on Iran nuclear sites |
| Coup / Power Seizure | 70–92 | HIGH | HIGH | Myanmar 2021, Niger 2023 |
| Nuclear / WMD Event | 85–100 | CRITICAL | CRITICAL | Pakistan-India nuclear signalling |
| Chokepoint Disruption | 60–82 | HIGH | CRITICAL | Hormuz, Red Sea, Panama Canal |
| Border Clash | 50–75 | MEDIUM | MEDIUM | India-Pakistan LoC exchanges |
| Sanctions Regime | 45–72 | MEDIUM | HIGH | Russia SWIFT, Iran comprehensive |
| Election Crisis | 38–62 | MEDIUM | LOW | Venezuela 2024, Georgia 2024 |
| Leader Resignation / Ouster | 32–58 | LOW–MED | LOW | Bangladesh PM Hasina 2024 |
| Mass Protest | 28–52 | MEDIUM | MEDIUM | Iran 2022, Hong Kong 2019 |
| Central Bank Surprise | 30–55 | LOW | HIGH | BOJ YCC abandonment, Fed pivot |
| Trade / Tariff Escalation | 25–50 | MEDIUM | HIGH | US-China tariff escalation |
| Constitutional Crisis | 30–58 | MEDIUM | LOW | Peru 2022, Israel judiciary |
| Political Development | 18–38 | LOW | LOW | Coalition reshuffle, policy shift |
| Other / Background | 0–20 | NONE | NONE | Low-signal events |
Confidence tiers derived from calibration dataset MAE per category. HIGH = MAE ≤ 5 (n ≥ 5) · MEDIUM = MAE ≤ 8 · LOW = MAE > 8 or insufficient data.
Immediate threat to life, state stability, or global supply chains. Senior-level attention within 24 hours.
Anchors: Russia-Ukraine invasion (~84–94), COVID outbreak (~85), 9/11 (~79–93), Hamas Oct 7 (~78–90)
Active seizure/threat of state power, comprehensive sanctions with cross-border effect, active military engagement, or chokepoint disruption. Dedicated tracking required.
Anchors: Niger coup (~68), Hong Kong protests (~60), Turkey coup attempt (~70), Iran JCPOA reimposition (~74)
Meaningful disruption underway but contained. Direction over 7 days matters more than absolute value.
Anchors: H1N1 declaration (~56), Greece austerity protests (~42), Brazil June Days (~44)
Noteworthy political developments with no escalation pathway or cross-border spillover. Background awareness sufficient.
Anchors: Monkeypox WHO declaration (~32), El Salvador Bitcoin adoption (~32)
Background noise, resolved events in decay, or structurally stable states. No action required.
Anchors: Germany farmer protests (~18), UK junior doctors strike (~22)
Equal-width bands (20-point intervals) are used deliberately. Quantile-based bands would shift thresholds as new events are added, making the score a relative ranking rather than a fixed reference point. Equal-width bands provide a stable operational definition: a score of 65 means the same thing regardless of how many other countries are also at 65.
Distribution across 437 calibration events (21 phases, 107 countries): LOW 10.2%, MODERATE 20.1%, ELEVATED 30.0%, HIGH 33.2%, CRITICAL 6.4%. The CRITICAL band intentionally has few events (6.4%) — civilisational crises are rare by definition.
3+ independent HIGH-tier sources OR 1 ReliefWeb + wire + secondary.
2 independent HIGH-tier sources in same 6-hour window, or 1 HIGH + primary document.
1 HIGH-tier or 2+ MED-tier. Default for GDELT-first events pending corroboration.
Single unverified RSS or GDELT without wire. Flagged. Excluded from score calc.
No identifiable source tier. Quarantined until corroborated or manually reviewed.
UN OCHA crisis reports. Displacement, famine early warning, emergency declarations.
Selected for editorial independence and primary source access. Non-English articles translated via Claude Haiku before enrichment.
Systematic undercounting in Central Africa/Oceania compensated by ReliefWeb.
Uppsala Conflict Data Program. State-based, non-state, and one-sided violence datasets. Gold standard for conflict event classification.
UN refugee agency operational data. Displacement surges used as leading indicators for cross-border contagion scoring.
Cross-referenced against event scores to detect pricing divergence.
Federal Reserve Economic Data + Treasury yields. Used for macro context and financial contagion channel.
World Bank Governance Indicators + V-Dem Liberal Democracy Index. Drive the structural risk layer and governance modifier.
Auto-flags for re-verification at risk ≥60, record age >60d, or coup/election detection.
claude-opus-4-6, temperature=0, structured JSON. All claims reference a stored event_id.
Compute an indicative event score from category and severity. Use the anchor buttons to load reference events.
These are inherent properties of real-time intelligence systems, documented for transparency.
All 437 human labels were produced by a single analyst with hindsight knowledge. Calibration, not independent validation — a second scorer would strengthen credibility. All 22 categories now have sufficient events.
Scores above 90 all signal "acute crisis." A country with 10 simultaneous crises scores ~95, not 150. Intentional — prevents inflation.
Prices delayed 15–20 min outside trading hours. Pre/after-hours moves not captured.
GDELT undercounts in non-English regions. Mitigated by 20+ multilingual RSS feeds with Haiku translation, plus ReliefWeb.
Country score is a decay-weighted mean, not sum. 50 low-score events won't outscore 5 high-score events. Surge detection handles volume separately.
Up to 24h lag for leader transitions. STALE badge shown when records exceed 60 days unverified.