SLA & SLO
AegisGate Security Platform — Service Level Agreements and Service Level Objectives by tier
Service Level Agreements & Objectives
AegisGate Security Platform publishes transparent service level commitments for every tier. These aren’t aspirational — they’re measurable against the /health, /metrics, and /api/v1/sla endpoints.
SLA by Tier
| Commitment | Community | Developer | Professional | Enterprise |
|---|---|---|---|---|
| Uptime Target | 99.0% | 99.9% | 99.95% | 99.99% |
| Support Response | Best-effort (community) | 24 hours | 4 hours | 1 hour |
| Support Channel | GitHub Issues | Email + Priority Issues | Dedicated channel | 24/7 + Slack |
| Incident Response | No SLA | 8h (P1), 24h (P2) | 2h (P1), 8h (P2) | 30min (P1), 2h (P2) |
| Data Retention | 7 days | 90 days | 365 days | Custom |
| Uptime Downtime/Year | ~3.7 days | ~8.7 hours | ~4.4 hours | ~52 minutes |
Service Level Objectives
We measure these SLOs continuously via Prometheus metrics exported at /metrics:
| SLO | Target | Window | Metric |
|---|---|---|---|
| API Availability | ≥99.5% | 30d | successful_requests / total_requests |
| Request Latency (P99) | ≤100ms | 30d | histogram_quantile(0.99, request_duration_seconds) |
| MCP Session Availability | ≥99.9% | 30d | active_mcp_sessions / attempted_sessions |
| A2A Auth Success Rate | ≥99.9% | 30d | a2a_auth_success / a2a_auth_total |
| Guardrail Enforcement Rate | 100% | 30d | guardrail_enforced / guardrail_evaluated |
| Audit Log Completeness | 100% | 30d | audit_entries / total_requests |
Why 100% for Guardrails and Audit?
AegisGate is a security product. If a guardrail fails to enforce or an audit log entry goes missing, that’s a security incident — not a performance issue. These SLOs are set to 100% because anything less means a threat could pass through undetected.
The fail-closed architecture ensures that any guardrail error, timeout, or internal failure results in a deny rather than an allow. This architecture guarantees the 100% enforcement target.
Health Check
The /health endpoint verifies all critical dependencies:
{
"status": "healthy",
"tier": "community",
"version": "2.0.1",
"checks": {
"proxy": { "enabled": true, "healthy": true },
"persistence": { "enabled": true, "started": true, "healthy": true },
"license": { "valid": true, "tier": "community", "healthy": true },
"certificates": { "valid": true, "healthy": true }
}
}
If any dependency is unhealthy, the health check returns "status": "unhealthy" with HTTP 503.
SLA API
The /api/v1/sla endpoint returns the SLA and SLOs for the current tier:
curl -s http://localhost:8443/api/v1/sla | jq
{
"tier": "community",
"uptime_target": 99,
"support_response": "Best-effort (community forum)",
"support_channel": "GitHub Issues",
"incident_response": "No SLA — best effort",
"data_retention_days": 7,
"slos": [
{
"name": "api_availability",
"target": 99.5,
"window": "30d",
"metric": "successful_requests / total_requests",
"description": "API request success rate"
}
]
}
Incident Response Priority Levels
| Priority | Definition | Examples |
|---|---|---|
| P1 — Critical | Security product is down or actively compromised | Proxy down, guardrails bypassed, data breach |
| P2 — High | Major feature degraded but functional | MCP sessions failing, A2A auth intermittent |
| P3 — Medium | Minor feature impacted | Dashboard slow, metrics delay |
| P4 — Low | Cosmetic or informational | UI glitch, documentation error |
Measuring Compliance
AegisGate uses Prometheus metrics to track all SLOs. You can monitor compliance with standard PromQL queries:
# API availability over 30 days
sum(rate(aegisgate_requests_total{code!~"5xx"}[30d]))
/
sum(rate(aegisgate_requests_total[30d]))
# P99 latency
histogram_quantile(0.99, sum(rate(aegisgate_request_duration_seconds_bucket[30d])) by (le))
# Guardrail enforcement rate
sum(rate(aegisgate_guardrails_enforced_total[30d]))
/
sum(rate(aegisgate_guardrails_evaluated_total[30d]))