Skip to content

Executive Summary

The monitoring service must raise an alert within one minute when plugin logs show consecutive errors or security events, then deliver it to external incident platforms via webhook with retries and downgrade channels. This child scenario focuses on rule parsing, alert composition, webhook delivery, downgrade notifications, and audit reporting to ensure alerts are timely and traceable.

Scope & Guardrails

  • In Scope: Log rule management, sliding-window detection, webhook delivery, retries and downgrades, alert auditing.
  • Out of Scope: Rule configuration UI, downstream ticket workflows inside third-party platforms, offline log replay batches.
  • Environment & Flags: monitoring-service, alert-gateway-v2, webhook-delivery-fallback; depends on log shippers, the alert center, and email/SMS channels.

Participants & Responsibilities

ScopeRepositoryLayerResponsibilitiesOwners
core-platformpowerxserviceLog rule evaluation, alert construction, webhook dispatch and retriesMatrix Ops (Platform Ops Lead / ops@artisan-cloud.com)
ops-toolingpowerxopsDowngrade channel governance, alert center states, audit reportingIris Chen (Observability Steward / observability@artisan-cloud.com)

End-to-End Flow

  1. Stage 1 – Log Ingestion: Log agents stream structured plugin logs into the centralized service.
  2. Stage 2 – Rule Detection: The rule engine aggregates by tenant and detects spikes of errors/security keywords within five minutes.
  3. Stage 3 – Alert Construction: An alert payload is created with tenant, plugin, summary, recommended actions, and trace references.
  4. Stage 4 – Webhook Delivery: The webhook dispatcher signs requests, handles retries and backoff, and records delivery status.
  5. Stage 5 – Downgrade & Audit: Three consecutive failures trigger downgrade channels (email/SMS); all outcomes feed into the audit store and alert center.

Key Interactions & Contracts

  • APIs / Events: STREAM logs.plugin.*, POST /alerts/webhook, EVENT monitoring.alert.updated, POST /alerts/fallback/email.
  • Configs / Schemas: config/log_rules/*.yaml, docs/standards/powerx/backend/integration/06_gateway/EventBus_and_Message_Fabric.md.
  • Security / Compliance: Webhooks require HMAC signatures and replay protection; downgrade notifications must be audited; sensitive log fields are tenant-isolated.

Usecase Links

  • UC-OPS-MONITORING-WEBHOOK-001 — Log anomalies triggering webhook alerts.

Acceptance Criteria

  1. Alerts are generated and delivered within one minute after a rule match; initial success rate ≥ 97%, cumulative ≥ 99%.
  2. Three consecutive webhook failures automatically downgrade to email/SMS; downgrade success rate ≥ 99% and status is visible in the alert center.
  3. Audit storage keeps the full retry history to support traceability and reporting.

Telemetry & Ops

  • Metrics: monitoring.webhook.delivery_success_rate, monitoring.webhook.retry_total, monitoring.alert.downgrade_total.
  • Alert thresholds: Webhook success rate <95% across 15 minutes raises P1; >20 downgrades per day trigger governance review.
  • Observability sources: Grafana “Alert Delivery”, Ops alert center, reports/_state/ops/monitoring/*.json.

Open Issues & Follow-ups

Risk / ItemImpactOwnerETA
Rule noise causing alert stormsIncreased on-call loadMatrix Ops2025-11-16
Missing webhook self-check toolingAlert delivery failuresIris Chen2025-11-18

Appendix

  • docs/meta/scenarios/powerx/core-platform/runtime-ops/system-monitoring-and-alerting/primary.md
  • docs/usecases-seeds/SCN-OPS-SYSTEM-MONITORING-001/UC-OPS-MONITORING-WEBHOOK-001.md

Released under the Apache 2.0 License.