Skip to content

Benchmarks

airiskguard publishes reproducible benchmark results for every release. Results are generated by running the built-in benchmark suite against curated datasets for each checker.

How to Run

# Run all checkers
airiskguard-benchmark

# Run specific checkers
airiskguard-benchmark --checkers security compliance

# Save results to JSON
airiskguard-benchmark --output results.json

# JSON output (for CI integration)
airiskguard-benchmark --format json

The benchmark exits with code 1 if any checker scores F1 < 0.5, making it suitable for CI gates.

v0.3.0 Results

Results on the built-in curated dataset (35 samples per checker, balanced positive/negative).

Checker Precision Recall F1 Accuracy FPR
security 100.0% 85.0% 91.9% 91.4% 0.0%
compliance 100.0% 66.7% 80.0% 83.3% 0.0%
hallucination 88.9% 100.0% 94.1% 93.3% 14.3%
bias 100.0% 100.0% 100.0% 100.0% 0.0%
toxicity 100.0% 66.7% 80.0% 83.3% 0.0%

All checkers have zero false positives on security, compliance, bias, and toxicity — benign content is never incorrectly blocked. The hallucination checker has a 14.3% FPR due to aggressive URL detection, which is configurable.

Dataset Details

Each checker is evaluated on a curated dataset of positive (should be flagged) and negative (should pass) samples:

Checker Samples Positives Negatives Focus
security 35 20 15 Prompt injection, jailbreaks, encoding attacks
compliance 30 15 15 PII (SSN, email, credit card, phone), prohibited content
hallucination 15 8 7 Fabricated URLs, citations, overconfident claims
bias 20 10 10 Gender, racial, age, religious, socioeconomic bias
toxicity 30 15 15 Threats, hate speech, insults, profanity

Metric Definitions

  • Precision — of all flagged samples, what fraction were correctly flagged
  • Recall — of all samples that should be flagged, what fraction were caught
  • F1 — harmonic mean of precision and recall
  • Accuracy — overall correct classifications
  • FPR — false positive rate (benign content incorrectly flagged)

Reproducing Results

from airiskguard.benchmark import run_benchmark_sync

result = run_benchmark_sync()
result.print_table()

# Or save to JSON
import json
with open("benchmark_results.json", "w") as f:
    f.write(result.to_json())

CI Integration

Add to your CI pipeline to catch regressions:

# GitHub Actions example
- name: Run airiskguard benchmark
  run: airiskguard-benchmark --format json --output benchmark_results.json

The command exits with code 1 if any checker drops below F1 = 0.5.