Pular para conteúdo

Alerting

Configuração de alarmes e notificações com CloudWatch.

Alarmes Críticos

Lambda Error Rate

# SAM template
Resources:
  LambdaErrorAlarm:
    Type: AWS::CloudWatch::Alarm
    Properties:
      AlarmName: !Sub '${AWS::StackName}-lambda-errors'
      AlarmDescription: Alert when Lambda error rate > 5%
      MetricName: Errors
      Namespace: AWS/Lambda
      Statistic: Sum
      Period: 300
      EvaluationPeriods: 1
      Threshold: 5
      ComparisonOperator: GreaterThanThreshold
      Dimensions:
        - Name: FunctionName
          Value: !Ref UserFunction
      AlarmActions:
        - !Ref AlertTopic

SQS DLQ Messages

DeadLetterQueueAlarm:
  Type: AWS::CloudWatch::Alarm
  Properties:
    AlarmName: !Sub '${AWS::StackName}-dlq-messages'
    AlarmDescription: Alert when messages in DLQ
    MetricName: ApproximateNumberOfMessagesVisible
    Namespace: AWS/SQS
    Statistic: Sum
    Period: 60
    EvaluationPeriods: 1
    Threshold: 1
    ComparisonOperator: GreaterThanOrEqualToThreshold
    Dimensions:
      - Name: QueueName
        Value: !GetAtt DeadLetterQueue.QueueName
    AlarmActions:
      - !Ref AlertTopic

RDS High CPU

RDSHighCPUAlarm:
  Type: AWS::CloudWatch::Alarm
  Properties:
    AlarmName: !Sub '${AWS::StackName}-rds-high-cpu'
    AlarmDescription: Alert when RDS CPU > 80%
    MetricName: CPUUtilization
    Namespace: AWS/RDS
    Statistic: Average
    Period: 300
    EvaluationPeriods: 2
    Threshold: 80
    ComparisonOperator: GreaterThanThreshold
    Dimensions:
      - Name: DBInstanceIdentifier
        Value: !Ref Database
    AlarmActions:
      - !Ref AlertTopic

SNS Topic para Alertas

AlertTopic:
  Type: AWS::SNS::Topic
  Properties:
    DisplayName: App Alerts
    Subscriptions:
      - Endpoint: tech-team@seuapp.com
        Protocol: email
      - Endpoint: !Sub 'arn:aws:lambda:${AWS::Region}:${AWS::AccountId}:function:slack-notifier'
        Protocol: lambda

Níveis de Severidade

P0 - Crítico

  • Sistema completamente down
  • Perda de dados
  • Security breach

Response: Imediato (< 15 minutos)
Notification: Chamada telefônica + Slack

P1 - Alto

  • Feature crítica não funciona
  • Performance muito degradada
  • Erros frequentes

Response: < 1 hora
Notification: Slack + Email

P2 - Médio

  • Feature não-crítica com problema
  • Performance degradada pontualmente
  • Logs com warnings

Response: < 4 horas
Notification: Slack

P3 - Baixo

  • Problemas cosméticos
  • Performance sub-ótima
  • Informational alerts

Response: Próximo dia útil
Notification: Email

Notification Channels

Slack

Lambda function para enviar para Slack:

import json
import urllib.request

def lambda_handler(event, context):
    """Send CloudWatch alert to Slack."""
    message = json.loads(event['Records'][0]['Sns']['Message'])

    alarm_name = message['AlarmName']
    new_state = message['NewStateValue']
    reason = message['NewStateReason']

    color = 'danger' if new_state == 'ALARM' else 'good'

    slack_message = {
        "attachments": [{
            "color": color,
            "title": f"🚨 {alarm_name}",
            "text": reason,
            "fields": [
                {"title": "State", "value": new_state, "short": True},
                {"title": "Timestamp", "value": message['StateChangeTime'], "short": True}
            ]
        }]
    }

    req = urllib.request.Request(
        SLACK_WEBHOOK_URL,
        data=json.dumps(slack_message).encode(),
        headers={'Content-Type': 'application/json'}
    )
    urllib.request.urlopen(req)

PagerDuty (Opcional)

Para on-call rotation:

PagerDutyIntegration:
  Type: AWS::SNS::Subscription
  Properties:
    Protocol: https
    TopicArn: !Ref AlertTopic
    Endpoint: https://events.pagerduty.com/integration/<KEY>/enqueue

On-Call

Schedule

  • Rotação semanal
  • Horário: 24/7
  • Backup on-call

Responsabilidades

  • Responder alertas conforme SLA
  • Investigar e resolver (ou escalar)
  • Documentar incident
  • Criar postmortem se P0/P1

Referências