Benchmarks¶

airiskguard publishes reproducible benchmark results for every release. Results are generated by running the built-in benchmark suite against curated datasets for each checker.

How to Run¶

# Run all checkers
airiskguard-benchmark

# Run specific checkers
airiskguard-benchmark --checkers security compliance

# Save results to JSON
airiskguard-benchmark --output results.json

# JSON output (for CI integration)
airiskguard-benchmark --format json

The benchmark exits with code 1 if any checker scores F1 < 0.5, making it suitable for CI gates.

v0.3.0 Results¶

Results on the built-in curated dataset (35 samples per checker, balanced positive/negative).

Checker	Precision	Recall	F1	Accuracy	FPR
`security`	100.0%	85.0%	91.9%	91.4%	0.0%
`compliance`	100.0%	66.7%	80.0%	83.3%	0.0%
`hallucination`	88.9%	100.0%	94.1%	93.3%	14.3%
`bias`	100.0%	100.0%	100.0%	100.0%	0.0%
`toxicity`	100.0%	66.7%	80.0%	83.3%	0.0%

All checkers have zero false positives on security, compliance, bias, and toxicity — benign content is never incorrectly blocked. The hallucination checker has a 14.3% FPR due to aggressive URL detection, which is configurable.

Dataset Details¶

Each checker is evaluated on a curated dataset of positive (should be flagged) and negative (should pass) samples:

Checker	Samples	Positives	Negatives	Focus
`security`	35	20	15	Prompt injection, jailbreaks, encoding attacks
`compliance`	30	15	15	PII (SSN, email, credit card, phone), prohibited content
`hallucination`	15	8	7	Fabricated URLs, citations, overconfident claims
`bias`	20	10	10	Gender, racial, age, religious, socioeconomic bias
`toxicity`	30	15	15	Threats, hate speech, insults, profanity

Metric Definitions¶

Precision — of all flagged samples, what fraction were correctly flagged
Recall — of all samples that should be flagged, what fraction were caught
F1 — harmonic mean of precision and recall
Accuracy — overall correct classifications
FPR — false positive rate (benign content incorrectly flagged)

Reproducing Results¶

from airiskguard.benchmark import run_benchmark_sync

result = run_benchmark_sync()
result.print_table()

# Or save to JSON
import json
with open("benchmark_results.json", "w") as f:
    f.write(result.to_json())

CI Integration¶

Add to your CI pipeline to catch regressions:

# GitHub Actions example
- name: Run airiskguard benchmark
  run: airiskguard-benchmark --format json --output benchmark_results.json

The command exits with code 1 if any checker drops below F1 = 0.5.