BeyondSight Protocol of Validated Use (PVU-BS) — English
Version: 1.0 · Status: Active
Spanish version: PVU_BeyondSight_ES.md
1. Purpose
The Protocol of Validated Use (PVU-BS) establishes the minimum evidence standard required to claim that BeyondSight provides validated predictive performance on real-world opinion dynamics data.
It distinguishes clearly between:
- Sample cases (
datasets/pvu_cases/sample_case_*): synthetic data used only to verify the pipeline runs correctly. No scientific claims may be drawn from them. - Real validation: N ≥ 10 independent cases with held-out test sets, run under this protocol.
2. Operational Definitions
2.1 Independent Case
A case is the tuple {network, time_series, interventions, metadata} described in a PVU case folder.
Two cases are considered independent if both of the following hold:
- Network non-overlap: the sets of nodes (users/communities) share fewer than 10 % of their members, OR the graph structures are drawn from genuinely distinct populations.
- Temporal non-overlap: their time windows do not coincide with the same unmodelled global shock (e.g., a worldwide pandemic event).
If they do share a shock, they must share a commoncluster_idinmeta.jsonand evaluation must be done at the cluster level (one DM test per cluster, not per case), correcting for the reduced effective sample size.
2.2 cluster_id Usage
cluster_id groups cases affected by the same structural or temporal confound.
Rules:
- Set
cluster_idinmeta.jsonwhen cases share a platform, regional event, or overlapping time window. - Statistical tests and effect sizes are computed per cluster when N_cluster > 1; individual case results are reported as supplementary.
- Cluster-level metrics are the weighted mean of per-case metrics (weight = N_test of each case).
2.3 Target Variable
BeyondSight is validated on a compound target that captures both level and dynamics of opinion polarization:
| Component | Definition | Metric |
|---|---|---|
| Polarization Index P(t) | Variance of the opinion distribution + fraction of agents at extremes ( | opinion |
| Turning-Point Skill (TPS) | Ability to predict when a regime transition occurs (local extremum in P) | Precision, Recall, F1, Mean Timing Error |
2.4 Train / Validation / Test Split
| Split | Fraction | Purpose |
|---|---|---|
| Train | 70 % | Model calibration, parameter selection |
| Validation | — | (optional) hyperparameter tuning |
| Test | 30 % | Held-out; must not be touched before final evaluation |
3. Anti-Leakage Rules
The following actions constitute test leakage and invalidate the validation run:
- Looking at test-split metrics (even in aggregate) before the model configuration is frozen.
- Adjusting prompts, temperature, model provider, regime rules, or any parameter after seeing test results, even informally.
- Selecting or discarding cases post-hoc to improve reported scores.
- Running the model multiple times on the same test set and reporting the best run (unless explicitly doing a LLM consistency check under § 6).
- Using test-split observations as training context in the LLM prompt.
Logging: A pre-registration document (see preregistration_template_EN.md) must be committed to the repository before breaking the test seal.
4. Statistical Criterion
4.1 Primary Hypothesis
BeyondSight produces significantly lower MAE on the held-out test set compared to the naive baseline (last-value persistence), after Holm–Bonferroni correction.
4.2 Test Procedure
- Compute forecasts for all baselines and BeyondSight on the test split of each case.
- For each case, run a two-sided Diebold–Mariano (DM) test of BeyondSight vs each baseline using squared-error loss.
- Collect the M raw p-values per case (one per baseline).
- Apply Holm–Bonferroni correction across the M × N comparisons (M baselines × N cases). Use
benchmarks/metrics.py::holm_bonferroni. - Report both raw and adjusted p-values.
4.3 Effect Sizes (mandatory alongside p-values)
| Metric | Interpretation |
|---|---|
| ΔMAE = MAE_baseline − MAE_BS | Absolute improvement in MAE |
| ΔRMSE | Absolute improvement in RMSE |
| Directional accuracy lift | BS dir. acc. − naive dir. acc. |
| TPS F1 | Precision–Recall balance on turning points |
4.4 Acceptance Criteria (by PVU level)
| Level | Min cases | DM vs naive | Effect size | TPS F1 |
|---|---|---|---|---|
| Bronze | 10 | p_adj < 0.05 | ΔMAE > 0 | — |
| Silver | 20 | p_adj < 0.05 vs all baselines | ΔMAE > 5 % | ≥ 0.50 |
| Gold | 30 | p_adj < 0.05 vs all + external replication | ΔMAE > 10 % | ≥ 0.70 |
5. Baselines (mandatory)
All of the following must be included in every validation run:
| ID | Name | Description |
|---|---|---|
| B1 | Naive | Last observed value persisted for all forecast steps |
| B2 | Moving Average | Mean of last 4 observations |
| B3 | AR(1) | First-order autoregression fitted by OLS |
| B4 | Random Regime | Random walk with training-calibrated noise |
Implemented in benchmarks/baselines.py.
6. LLM Consistency Check
When BeyondSight is run in LLM mode:
- Run the same forecast 5 times with different random seeds.
- Compute the coefficient of variation (CV = std / mean) of the reported MAE across runs.
- If CV > 0.15 (15 %), flag the result and report it; do not suppress it.
7. Reproducibility Requirements
Every validation run must produce and archive:
configs/pvu.yaml(frozen copy with all parameters).reports/validation/<run_id>/metrics.json(all per-case metrics).reports/validation/<run_id>/report.md(human-readable summary).- Git commit SHA of the codebase.
- Python package versions (
pip freeze). - LLM provider + model name + temperature (if LLM mode).
PYTHONHASHSEEDand--seedvalue.
The runner (benchmarks/runner.py) writes metrics.json and report.md automatically.
8. How to Run
# Offline mode (no API key required — CI default):
PYTHONHASHSEED=42 python -m benchmarks.runner \
--cases datasets/pvu_cases \
--offline \
--out reports/validation/ci \
--seed 42
# LLM mode (requires OPENROUTER_API_KEY or OPENAI_API_KEY):
PYTHONHASHSEED=42 python -m benchmarks.runner \
--cases datasets/pvu_cases \
--llm \
--out reports/validation/llm_run \
--seed 42
9. Case File Format
Each case lives in its own sub-folder under datasets/pvu_cases/:
sample_case_001/
├── timeseries.csv # required: columns date, P (+ optional)
├── interventions.json # list of {date, label, source}
└── meta.json # case metadata
timeseries.csv columns
| Column | Type | Description |
|---|---|---|
date |
string (ISO 8601) | Observation date |
P |
float [0, 1] | Polarization index |
volume |
float (optional) | Activity volume proxy |
meta.json required fields
| Field | Type | Description |
|---|---|---|
case_id |
string | Unique identifier |
domain |
string | e.g. political_opinion |
source |
string | Data origin (or synthetic) |
cluster_id |
string | null | Cluster for grouped tests |
license |
string | Data license |
note |
string | Free-text notes |
10. Glossary
| Term | Definition |
|---|---|
| PVU-BS | Protocol of Validated Use — BeyondSight |
| DM test | Diebold–Mariano test of equal predictive accuracy |
| Holm–Bonferroni | Step-down multiple-comparison correction |
| TPS | Turning-Point Skill (F1 on detected local extrema) |
| EWS | Early Warning Signal (Critical Slowing Down) |
| Test leakage | Any information flow from test split to model design |
| Sample case | Synthetic case for pipeline testing only |
See also: Preregistration template · Validation report template