Alerting
Configuração de alarmes e notificações com CloudWatch.
Alarmes Críticos
Lambda Error Rate
# SAM template
Resources:
LambdaErrorAlarm:
Type: AWS::CloudWatch::Alarm
Properties:
AlarmName: !Sub '${AWS::StackName}-lambda-errors'
AlarmDescription: Alert when Lambda error rate > 5%
MetricName: Errors
Namespace: AWS/Lambda
Statistic: Sum
Period: 300
EvaluationPeriods: 1
Threshold: 5
ComparisonOperator: GreaterThanThreshold
Dimensions:
- Name: FunctionName
Value: !Ref UserFunction
AlarmActions:
- !Ref AlertTopic
SQS DLQ Messages
DeadLetterQueueAlarm:
Type: AWS::CloudWatch::Alarm
Properties:
AlarmName: !Sub '${AWS::StackName}-dlq-messages'
AlarmDescription: Alert when messages in DLQ
MetricName: ApproximateNumberOfMessagesVisible
Namespace: AWS/SQS
Statistic: Sum
Period: 60
EvaluationPeriods: 1
Threshold: 1
ComparisonOperator: GreaterThanOrEqualToThreshold
Dimensions:
- Name: QueueName
Value: !GetAtt DeadLetterQueue.QueueName
AlarmActions:
- !Ref AlertTopic
RDS High CPU
RDSHighCPUAlarm:
Type: AWS::CloudWatch::Alarm
Properties:
AlarmName: !Sub '${AWS::StackName}-rds-high-cpu'
AlarmDescription: Alert when RDS CPU > 80%
MetricName: CPUUtilization
Namespace: AWS/RDS
Statistic: Average
Period: 300
EvaluationPeriods: 2
Threshold: 80
ComparisonOperator: GreaterThanThreshold
Dimensions:
- Name: DBInstanceIdentifier
Value: !Ref Database
AlarmActions:
- !Ref AlertTopic
SNS Topic para Alertas
AlertTopic:
Type: AWS::SNS::Topic
Properties:
DisplayName: App Alerts
Subscriptions:
- Endpoint: tech-team@seuapp.com
Protocol: email
- Endpoint: !Sub 'arn:aws:lambda:${AWS::Region}:${AWS::AccountId}:function:slack-notifier'
Protocol: lambda
Níveis de Severidade
P0 - Crítico
- Sistema completamente down
- Perda de dados
- Security breach
Response: Imediato (< 15 minutos)
Notification: Chamada telefônica + Slack
P1 - Alto
- Feature crítica não funciona
- Performance muito degradada
- Erros frequentes
Response: < 1 hora
Notification: Slack + Email
P2 - Médio
- Feature não-crítica com problema
- Performance degradada pontualmente
- Logs com warnings
Response: < 4 horas
Notification: Slack
P3 - Baixo
- Problemas cosméticos
- Performance sub-ótima
- Informational alerts
Response: Próximo dia útil
Notification: Email
Notification Channels
Slack
Lambda function para enviar para Slack:
import json
import urllib.request
def lambda_handler(event, context):
"""Send CloudWatch alert to Slack."""
message = json.loads(event['Records'][0]['Sns']['Message'])
alarm_name = message['AlarmName']
new_state = message['NewStateValue']
reason = message['NewStateReason']
color = 'danger' if new_state == 'ALARM' else 'good'
slack_message = {
"attachments": [{
"color": color,
"title": f"🚨 {alarm_name}",
"text": reason,
"fields": [
{"title": "State", "value": new_state, "short": True},
{"title": "Timestamp", "value": message['StateChangeTime'], "short": True}
]
}]
}
req = urllib.request.Request(
SLACK_WEBHOOK_URL,
data=json.dumps(slack_message).encode(),
headers={'Content-Type': 'application/json'}
)
urllib.request.urlopen(req)
PagerDuty (Opcional)
Para on-call rotation:
PagerDutyIntegration:
Type: AWS::SNS::Subscription
Properties:
Protocol: https
TopicArn: !Ref AlertTopic
Endpoint: https://events.pagerduty.com/integration/<KEY>/enqueue
On-Call
Schedule
- Rotação semanal
- Horário: 24/7
- Backup on-call
Responsabilidades
- Responder alertas conforme SLA
- Investigar e resolver (ou escalar)
- Documentar incident
- Criar postmortem se P0/P1