Distributionally Robust Statistical Verification with Imprecise Neural Networks

Distributionally Robust
StatisticalVerification
with Imprecise Neural Networks
Souradeep Dutta*, Michele Caprio*,Vivian Lin, Matthew Cleaveland*, Kuk Jin Jang,
Ivan Ruchkin*, Oleg Sokolsky, Insup Lee
PRECISE Center, University of Pennsylvania
* former members
HSCC 2025

2
Self-Evident Truths
• Safety guarantees for autonomous systems with
high-dimensional states are important — but difficult
○ Reachability analysis struggles to scale in state
dimensions
○ Statistical methods often assume fixed input
distributions → vulnerable to distribution shifts

3
Our Contributions (= Agenda)
1. Formulation of distributionally robust statistical
verification
2. Imprecise neural networks (INNs) to represent
state-wise performance uncertainty
3. Scalable uncertainty-guided active learning algorithm
4. Empirical results on Mujoco with RL controllers

4
verification

• Autonomous system: , trajectory
• Aspiration: guarantee
– Find and distribution set s.t. the above holds over any
• Formally:
5
Distributionally Robust StatisticalVerification
Distributions
of

6
verification

7
Imprecise Neural Networks (INNs)
• INN := a set of N feedforward NNs
• Ensemble bounds:
– Property: contains the predicted function with avg. chance
– Uncertainty := the max disagreement among models:
• Idea: use an INN to:
(a) give performance guarantees; (b) guide sampling of states

• Convex combination of uniform distributions:
• “Contamination” with an arbitrary distribution Q:
8
Construction of Distribution Sets
Not sampled, but within
our distribution set
1.0
1.0
Uniform distributions
Combined region

• Given region , Sherlock verifier computes
• Set the performance threshold , which applies to all
distributions in the distribution set
• Our distributional guarantee: the expected probability
of is greater than for all
9
Guarantee on System Performance

10
1. Formulation of distributionally robust statistical verification

11
Two-Step Overall Approach
1. Active learning: Sample a high-uncertainty batch & train an INN
2. Verification: Instantiate performance bounds with the INN

12
Active Learning Algorithm
Greedy strategy for exploring the high-dimensional space:
1. Draw initial samples from and label them with the true
performance function
2. Train an INN to estimate the bounds on and compute
uncertainty
3. Use Sherlock verifier to identify the point with highest
4. Sample a batch from the point’s δ-ball neighborhood
5. Retrain the INN with the updated dataset

13
1. Formulation of distributionally robust statistical verification

14
Experiments: Environments
• 10 MuJoCo environments from OpenAI Gym
• Control policies trained using Deep Deterministic Policy Gradient
• Performance function to verify: average reward overT steps
Half-Cheetah
18 Dimensions
Ant
29 Dimensions
Humanoid
47 Dimensions

INN (Ours)
… from a family of
distributions.
15
Experiments: Baselines
• For training distribution …
• To obtain similar guarantees, the test point must be...
Conformal
Prediction
… from the training
distribution.
Robust Conformal
Prediction
… a bounded distance from
the training distribution.

16
Experiments: Methodology
• Evaluation procedure:
o Set tightness β, exploration radius δ, and target coverage 1-λ-1
= 95%
o Collect a starting set of states
o Compute the distribution and performance threshold ε
o Sample initial states from that distribution
o Get the true ψ values, compare them to the INN/baseline intervals
• Evaluation metrics:
o Coverage should be above the target 95%
o Interval width should be low
• Two distributional settings for initial states:
o In-distribution: matches the INN training distribution
o Out-of-distribution: not the INN training distribution (but inside our set)

17
Evaluation Highlights
• INNs maintain ≥99% coverage, unlike baselines
• INNs yield slightly larger intervals than baselines
• INNs demonstrate robustness to distribution shift
• Our verification takes dozens of seconds

18
Results: In-Distribution
L Conformal Prediction INN (Ours)
Environment State
Dimension
Coverage
(%)
Interval
Size
Coverage
(%)
Interval
Size
Ant 29 93 3.2 99 4.2
Half-Cheetah 18 93 4.9 x 10-1
100 3
Hopper 12 97 2.9 100 2.2
Humanoid 47 97 2.7 100 3
Humanoid-Standup 47 96 143 99 260
Inverted Double Pend. 6 95 6.7 x 10-3
100 5.1 x 10-2
Inverted Pendulum 4 95 1.2 x 10-3
100 4.1 x 10-2
Reacher 8 93 3.3 x 10-2
100 7.6 x 10-2
Swimmer 10 93 2.8 x 10-3
100 4.9 x 10-2
Walker2d 18 95 5.2 99 5.8

19
Environment State
Dimension
Coverage
(%)
Interval
Size
Coverage
(%)
Interval
Size
Ant 29 93 3.2 99 4.2
100 3
Hopper 12 97 2.9 100 2.2
Humanoid 47 97 2.7 100 3
100 5.1 x 10-2
100 4.1 x 10-2
Reacher 8 93 3.3 x 10-2
100 7.6 x 10-2
Swimmer 10 93 2.8 x 10-3
100 4.9 x 10-2
Walker2d 18 95 5.2 99 5.8
Conformal prediction does
not always achieve the
target coverage of 95%

20
Environment State
Dimension
Coverage
(%)
Interval
Size
Coverage
(%)
Interval
Size
Ant 29 93 3.2 99 4.2
100 3
Hopper 12 97 2.9 100 2.2
Humanoid 47 97 2.7 100 3
100 5.1 x 10-2
100 4.1 x 10-2
Reacher 8 93 3.3 x 10-2
100 7.6 x 10-2
Swimmer 10 93 2.8 x 10-3
100 4.9 x 10-2
Walker2d 18 95 5.2 99 5.8
INNs always
exceed the target
coverage rate...
… with some
small cost in
interval size.

21
Results: Out-of-Distribution
L Conformal Prediction Robust Conformal Prediction INN (Ours)
Environment State
Dimension
Coverage
(%)
Interval
Size
Coverage
(%)
Interval
Size
Coverage
(%)
Interval
Size
Coverage
(%)
Interval
Size
Ant 29 94 3.2 95 3.5 98 4.8 100 4.3
95 5.2 x 10-1
98 6.5 x 10-1
100 1.9
Hopper 12 97 2.9 98 3.3 99 3.3 100 2.5
Humanoid 47 98 2.7 97 2.9 99 3.5 99 2.8
Humanoid-Standup 47 95 143 97 167 99 297 99 305
Inverted Double
Pend.
6
94 6.7 x 10-3
97 7.5 x 10-3
99 8.8 x 10-3
100 5.1 x 10-2
96 1.2 x 10-3
99 1.2 x 10-3
100 4.2 x 10-2
Reacher 8 91 3.3 x 10-2
94 3.4 x 10-2
99 4.2 x 10-2
100 8.5 x 10-2
Swimmer 10 92 2.8 x 10-3
94 3.0 x 10-3
98 9.0 x 10-3
100 5 x 10-2
Walker2d 18 94 5.2 95 5.4 99 7.1 99 5.2

Environment State
Dimension
Coverage
(%)
Interval
Size
Coverage
(%)
Interval
Size
Coverage
(%)
Interval
Size
Coverage
(%)
Interval
Size
Ant 29 94 3.2 95 3.5 98 4.8 100 4.3
95 5.2 x 10-1
98 6.5 x 10-1
100 1.9
Hopper 12 97 2.9 98 3.3 99 3.3 100 2.5
Humanoid 47 98 2.7 97 2.9 99 3.5 99 2.8
Inverted Double
Pend.
6
94 6.7 x 10-3
97 7.5 x 10-3
99 8.8 x 10-3
100 5.1 x 10-2
96 1.2 x 10-3
99 1.2 x 10-3
100 4.2 x 10-2
Reacher 8 91 3.3 x 10-2
94 3.4 x 10-2
99 4.2 x 10-2
100 8.5 x 10-2
Swimmer 10 92 2.8 x 10-3
94 3.0 x 10-3
98 9.0 x 10-3
100 5 x 10-2
Walker2d 18 94 5.2 95 5.4 99 7.1 99 5.2
22
Conformal prediction
does not always achieve
the target coverage

Environment State
Dimension
Coverage
(%)
Interval
Size
Coverage
(%)
Interval
Size
Coverage
(%)
Interval
Size
Coverage
(%)
Interval
Size
Ant 29 94 3.2 95 3.5 98 4.8 100 4.3
95 5.2 x 10-1
98 6.5 x 10-1
100 1.9
Hopper 12 97 2.9 98 3.3 99 3.3 100 2.5
Humanoid 47 98 2.7 97 2.9 99 3.5 99 2.8
Inverted Double
Pend.
6
94 6.7 x 10-3
97 7.5 x 10-3
99 8.8 x 10-3
100 5.1 x 10-2
96 1.2 x 10-3
99 1.2 x 10-3
100 4.2 x 10-2
Reacher 8 91 3.3 x 10-2
94 3.4 x 10-2
99 4.2 x 10-2
100 8.5 x 10-2
Swimmer 10 92 2.8 x 10-3
94 3.0 x 10-3
98 9.0 x 10-3
100 5 x 10-2
Walker2d 18 94 5.2 95 5.4 99 7.1 99 5.2
23
Robust conformal prediction
almost always achieves the
target coverage

Environment State
Dimension
Coverage
(%)
Interval
Size
Coverage
(%)
Interval
Size
Coverage
(%)
Interval
Size
Coverage
(%)
Interval
Size
Ant 29 94 3.2 95 3.5 98 4.8 100 4.3
95 5.2 x 10-1
98 6.5 x 10-1
100 1.9
Hopper 12 97 2.9 98 3.3 99 3.3 100 2.5
Humanoid 47 98 2.7 97 2.9 99 3.5 99 2.8
Inverted Double
Pend.
6
94 6.7 x 10-3
97 7.5 x 10-3
99 8.8 x 10-3
100 5.1 x 10-2
96 1.2 x 10-3
99 1.2 x 10-3
100 4.2 x 10-2
Reacher 8 91 3.3 x 10-2
94 3.4 x 10-2
99 4.2 x 10-2
100 8.5 x 10-2
Swimmer 10 92 2.8 x 10-3
94 3.0 x 10-3
98 9.0 x 10-3
100 5 x 10-2
Walker2d 18 94 5.2 95 5.4 99 7.1 99 5.2
24
target coverage
Coverage increases with
the amount of allowable
distribution shift

Environment State
Dimension
Coverage
(%)
Interval
Size
Coverage
(%)
Interval
Size
Coverage
(%)
Interval
Size
Coverage
(%)
Interval
Size
Ant 29 94 3.2 95 3.5 98 4.8 100 4.3
95 5.2 x 10-1
98 6.5 x 10-1
100 1.9
Hopper 12 97 2.9 98 3.3 99 3.3 100 2.5
Humanoid 47 98 2.7 97 2.9 99 3.5 99 2.8
Inverted Double
Pend.
6
94 6.7 x 10-3
97 7.5 x 10-3
99 8.8 x 10-3
100 5.1 x 10-2
96 1.2 x 10-3
99 1.2 x 10-3
100 4.2 x 10-2
Reacher 8 91 3.3 x 10-2
94 3.4 x 10-2
99 4.2 x 10-2
100 8.5 x 10-2
Swimmer 10 92 2.8 x 10-3
94 3.0 x 10-3
98 9.0 x 10-3
100 5 x 10-2
Walker2d 18 94 5.2 95 5.4 99 7.1 99 5.2
25
… but so does
interval size.
Coverage increases with
the amount of allowable
distribution shift...
target coverage

Environment State
Dimension
Coverage
(%)
Interval
Size
Coverage
(%)
Interval
Size
Coverage
(%)
Interval
Size
Coverage
(%)
Interval
Size
Ant 29 94 3.2 95 3.5 98 4.8 100 4.3
95 5.2 x 10-1
98 6.5 x 10-1
100 1.9
Hopper 12 97 2.9 98 3.3 99 3.3 100 2.5
Humanoid 47 98 2.7 97 2.9 99 3.5 99 2.8
Inverted Double
Pend.
6
94 6.7 x 10-3
97 7.5 x 10-3
99 8.8 x 10-3
100 5.1 x 10-2
96 1.2 x 10-3
99 1.2 x 10-3
100 4.2 x 10-2
Reacher 8 91 3.3 x 10-2
94 3.4 x 10-2
99 4.2 x 10-2
100 8.5 x 10-2
Swimmer 10 92 2.8 x 10-3
94 3.0 x 10-3
98 9.0 x 10-3
100 5 x 10-2
Walker2d 18 94 5.2 95 5.4 99 7.1 99 5.2
26
INNs always
achieve the target
coverage of 95%

27
Results:Time toVerify
L
Environment State
Dimension
Execution Time (s)
Mean ± Std
Execution Time (s)
Mean ± Std
Ant 29 127 ± 199 184 ± 289
Half-Cheetah 18 10 ± 3 13 ± 7
Hopper 12 15 ± 13 6 ± 3
Humanoid 47 31 ± 11 149 ± 209
Humanoid-Standup 47 20 ± 5 51 ± 46
Inverted Double Pend. 6 2.4 ± 1 2.5 ± 1
Inverted Pendulum 4 1.3 ± 0.3 1.5 ± 0.3
Reacher 8 9 ± 3.7 40 ± 46
Swimmer 10 5 ± 1.4 6 ± 2.6
Walker2d 18 8.5 ± 3.8 11.6 ± 5.1
At design stage,
execution times
are reasonable...
even for higher
dimensions

L
Environment State
Dimension
Execution Time (s)
Mean ± Std
Execution Time (s)
Mean ± Std
Ant 29 127 ± 199 184 ± 289
Half-Cheetah 18 10 ± 3 13 ± 7
Hopper 12 15 ± 13 6 ± 3
Humanoid 47 31 ± 11 149 ± 209
Humanoid-Standup 47 20 ± 5 51 ± 46
Inverted Double Pend. 6 2.4 ± 1 2.5 ± 1
Inverted Pendulum 4 1.3 ± 0.3 1.5 ± 0.3
Reacher 8 9 ± 3.7 40 ± 46
Swimmer 10 5 ± 1.4 6 ± 2.6
Walker2d 18 8.5 ± 3.8 11.6 ± 5.1
28
Results:Time toVerify
Execution takes
longer when the
search space is larger

Limitations
Strengths:
• Robustness to distributional (epistemic) uncertainty
• No assumptions on the system dynamics or performance function
• Handles dozens of state dimensions
Limitations:
• Conservatism in both coverage and intervals
• A quite particular shape of the distribution set
• Many hyperparameters to tune
29

30
Summary
1. Formulation of distributionally
robust statistical verification
2. INNs: imprecise
neural networks
3. Scalable active learning 4. Mujoco experiments

32
Experimental Setup
• Evaluated on 10 Mujoco environments:Ant, Humanoid,
Hopper, …
• Control policies trained using Deep Deterministic Policy
Gradient (DDPG).
• Each policy’s performance is the temporal average reward
R_avg over T steps.
• INN architecture: ensemble of 3 DNNs (2 layers, width 50
neurons, ReLU activations)
• Confidence level λ = 20 (95%), exploration δ = 0.05, M = 20
active learning iterations.
• Comparison with Conformal Prediction (CP) and Robust CP
(RCP) under in- and out-of-distribution.

33
Parameters
• α (alpha) – Contamination level: controls robustness to
distribution shift; higher α means more adversarial tolerance.
• β (beta) – Interval tightness: balances precision of predictions
against conservativeness; higher β yields wider, safer intervals.
• λ (lambda) – Confidence level: sets the strength of probabilistic
guarantees; higher λ gives stronger but wider guarantees.
• δ (delta) – Exploration radius: defines neighborhood size in active
learning; controls granularity of sampling around uncertain points.
• ε (epsilon) – Performance threshold: the guaranteed lower bound
on system performance, derived from learned model and
confidence level.

34
Parameter Relationships
• β – λ: Together they set the half‑width λβ of the confidence band
around the INN’s central estimate.
• λ – ε: Larger λ (stronger confidence) → wider band λβ →
smaller ε (more conservative).
• α – robustness: α directly controls how adversarial your
allowed shift in the marginal can be.
• δ – exploration: δ fixes the radius of regions you sample
around points of high uncertainty.
• ε – ψ-threshold: ε is finally set to the INN’s worst‑case lower
estimate minus λβ, leading to the main guarantee.

35
Distributional Upper Probabilities
Not sampled, but
guarantee holds
• Convex combination of uniform distributions:
• “Contamination” with an arbitrary distribution:
• Leads to an upper probability
1.0
Sampling distributions
1.0

Distributionally Robust Statistical Verification with Imprecise Neural Networks

Recommended

More Related Content

Similar to Distributionally Robust Statistical Verification with Imprecise Neural Networks (20)

More from Ivan Ruchkin (20)

Recently uploaded (20)

Distributionally Robust Statistical Verification with Imprecise Neural Networks