Benchmarks¶
airiskguard publishes reproducible benchmark results for every release. Results are generated by running the built-in benchmark suite against curated datasets for each checker.
How to Run¶
# Run all checkers
airiskguard-benchmark
# Run specific checkers
airiskguard-benchmark --checkers security compliance
# Save results to JSON
airiskguard-benchmark --output results.json
# JSON output (for CI integration)
airiskguard-benchmark --format json
The benchmark exits with code 1 if any checker scores F1 < 0.5, making it suitable for CI gates.
v0.3.0 Results¶
Results on the built-in curated dataset (35 samples per checker, balanced positive/negative).
| Checker | Precision | Recall | F1 | Accuracy | FPR |
|---|---|---|---|---|---|
security | 100.0% | 85.0% | 91.9% | 91.4% | 0.0% |
compliance | 100.0% | 66.7% | 80.0% | 83.3% | 0.0% |
hallucination | 88.9% | 100.0% | 94.1% | 93.3% | 14.3% |
bias | 100.0% | 100.0% | 100.0% | 100.0% | 0.0% |
toxicity | 100.0% | 66.7% | 80.0% | 83.3% | 0.0% |
All checkers have zero false positives on security, compliance, bias, and toxicity — benign content is never incorrectly blocked. The hallucination checker has a 14.3% FPR due to aggressive URL detection, which is configurable.
Dataset Details¶
Each checker is evaluated on a curated dataset of positive (should be flagged) and negative (should pass) samples:
| Checker | Samples | Positives | Negatives | Focus |
|---|---|---|---|---|
security | 35 | 20 | 15 | Prompt injection, jailbreaks, encoding attacks |
compliance | 30 | 15 | 15 | PII (SSN, email, credit card, phone), prohibited content |
hallucination | 15 | 8 | 7 | Fabricated URLs, citations, overconfident claims |
bias | 20 | 10 | 10 | Gender, racial, age, religious, socioeconomic bias |
toxicity | 30 | 15 | 15 | Threats, hate speech, insults, profanity |
Metric Definitions¶
- Precision — of all flagged samples, what fraction were correctly flagged
- Recall — of all samples that should be flagged, what fraction were caught
- F1 — harmonic mean of precision and recall
- Accuracy — overall correct classifications
- FPR — false positive rate (benign content incorrectly flagged)
Reproducing Results¶
from airiskguard.benchmark import run_benchmark_sync
result = run_benchmark_sync()
result.print_table()
# Or save to JSON
import json
with open("benchmark_results.json", "w") as f:
f.write(result.to_json())
CI Integration¶
Add to your CI pipeline to catch regressions:
# GitHub Actions example
- name: Run airiskguard benchmark
run: airiskguard-benchmark --format json --output benchmark_results.json
The command exits with code 1 if any checker drops below F1 = 0.5.