BeyondSight Protocol of Validated Use (PVU-BS) — English

Version: 1.0 · Status: Active
Spanish version: PVU_BeyondSight_ES.md

1. Purpose

The Protocol of Validated Use (PVU-BS) establishes the minimum evidence standard required to claim that BeyondSight provides validated predictive performance on real-world opinion dynamics data.
It distinguishes clearly between:

Sample cases (datasets/pvu_cases/sample_case_*): synthetic data used only to verify the pipeline runs correctly. No scientific claims may be drawn from them.
Real validation: N ≥ 10 independent cases with held-out test sets, run under this protocol.

2. Operational Definitions

2.1 Independent Case

A case is the tuple {network, time_series, interventions, metadata} described in a PVU case folder.

Two cases are considered independent if both of the following hold:

Network non-overlap: the sets of nodes (users/communities) share fewer than 10 % of their members, OR the graph structures are drawn from genuinely distinct populations.
Temporal non-overlap: their time windows do not coincide with the same unmodelled global shock (e.g., a worldwide pandemic event).
If they do share a shock, they must share a common cluster_id in meta.json and evaluation must be done at the cluster level (one DM test per cluster, not per case), correcting for the reduced effective sample size.

2.2 `cluster_id` Usage

cluster_id groups cases affected by the same structural or temporal confound.
Rules:

Set cluster_id in meta.json when cases share a platform, regional event, or overlapping time window.
Statistical tests and effect sizes are computed per cluster when N_cluster > 1; individual case results are reported as supplementary.
Cluster-level metrics are the weighted mean of per-case metrics (weight = N_test of each case).

2.3 Target Variable

BeyondSight is validated on a compound target that captures both level and dynamics of opinion polarization:

Component	Definition	Metric
Polarization Index P(t)	Variance of the opinion distribution + fraction of agents at extremes (	opinion
Turning-Point Skill (TPS)	Ability to predict when a regime transition occurs (local extremum in P)	Precision, Recall, F1, Mean Timing Error

2.4 Train / Validation / Test Split

Split	Fraction	Purpose
Train	70 %	Model calibration, parameter selection
Validation	—	(optional) hyperparameter tuning
Test	30 %	Held-out; must not be touched before final evaluation

3. Anti-Leakage Rules

The following actions constitute test leakage and invalidate the validation run:

Looking at test-split metrics (even in aggregate) before the model configuration is frozen.
Adjusting prompts, temperature, model provider, regime rules, or any parameter after seeing test results, even informally.
Selecting or discarding cases post-hoc to improve reported scores.
Running the model multiple times on the same test set and reporting the best run (unless explicitly doing a LLM consistency check under § 6).
Using test-split observations as training context in the LLM prompt.

Logging: A pre-registration document (see preregistration_template_EN.md) must be committed to the repository before breaking the test seal.

4. Statistical Criterion

4.1 Primary Hypothesis

BeyondSight produces significantly lower MAE on the held-out test set compared to the naive baseline (last-value persistence), after Holm–Bonferroni correction.

4.2 Test Procedure

Compute forecasts for all baselines and BeyondSight on the test split of each case.
For each case, run a two-sided Diebold–Mariano (DM) test of BeyondSight vs each baseline using squared-error loss.
Collect the M raw p-values per case (one per baseline).
Apply Holm–Bonferroni correction across the M × N comparisons (M baselines × N cases). Use benchmarks/metrics.py::holm_bonferroni.
Report both raw and adjusted p-values.

4.3 Effect Sizes (mandatory alongside p-values)

Metric	Interpretation
ΔMAE = MAE_baseline − MAE_BS	Absolute improvement in MAE
ΔRMSE	Absolute improvement in RMSE
Directional accuracy lift	BS dir. acc. − naive dir. acc.
TPS F1	Precision–Recall balance on turning points

4.4 Acceptance Criteria (by PVU level)

Level	Min cases	DM vs naive	Effect size	TPS F1
Bronze	10	p_adj < 0.05	ΔMAE > 0	—
Silver	20	p_adj < 0.05 vs all baselines	ΔMAE > 5 %	≥ 0.50
Gold	30	p_adj < 0.05 vs all + external replication	ΔMAE > 10 %	≥ 0.70

5. Baselines (mandatory)

All of the following must be included in every validation run:

ID	Name	Description
B1	Naive	Last observed value persisted for all forecast steps
B2	Moving Average	Mean of last 4 observations
B3	AR(1)	First-order autoregression fitted by OLS
B4	Random Regime	Random walk with training-calibrated noise

Implemented in benchmarks/baselines.py.

6. LLM Consistency Check

When BeyondSight is run in LLM mode:

Run the same forecast 5 times with different random seeds.
Compute the coefficient of variation (CV = std / mean) of the reported MAE across runs.
If CV > 0.15 (15 %), flag the result and report it; do not suppress it.

7. Reproducibility Requirements

Every validation run must produce and archive:

configs/pvu.yaml (frozen copy with all parameters).
reports/validation/<run_id>/metrics.json (all per-case metrics).
reports/validation/<run_id>/report.md (human-readable summary).
Git commit SHA of the codebase.
Python package versions (pip freeze).
LLM provider + model name + temperature (if LLM mode).
PYTHONHASHSEED and --seed value.

The runner (benchmarks/runner.py) writes metrics.json and report.md automatically.

8. How to Run

# Offline mode (no API key required — CI default):
PYTHONHASHSEED=42 python -m benchmarks.runner \
    --cases datasets/pvu_cases \
    --offline \
    --out reports/validation/ci \
    --seed 42

# LLM mode (requires OPENROUTER_API_KEY or OPENAI_API_KEY):
PYTHONHASHSEED=42 python -m benchmarks.runner \
    --cases datasets/pvu_cases \
    --llm \
    --out reports/validation/llm_run \
    --seed 42

9. Case File Format

Each case lives in its own sub-folder under datasets/pvu_cases/:

sample_case_001/
├── timeseries.csv        # required: columns date, P (+ optional)
├── interventions.json    # list of {date, label, source}
└── meta.json             # case metadata

`timeseries.csv` columns

Column	Type	Description
`date`	string (ISO 8601)	Observation date
`P`	float [0, 1]	Polarization index
`volume`	float (optional)	Activity volume proxy

`meta.json` required fields

Field	Type	Description
`case_id`	string	Unique identifier
`domain`	string	e.g. `political_opinion`
`source`	string	Data origin (or `synthetic`)
`cluster_id`	string \| null	Cluster for grouped tests
`license`	string	Data license
`note`	string	Free-text notes

10. Glossary

Term	Definition
PVU-BS	Protocol of Validated Use — BeyondSight
DM test	Diebold–Mariano test of equal predictive accuracy
Holm–Bonferroni	Step-down multiple-comparison correction
TPS	Turning-Point Skill (F1 on detected local extrema)
EWS	Early Warning Signal (Critical Slowing Down)
Test leakage	Any information flow from test split to model design
Sample case	Synthetic case for pipeline testing only