Alerting
Configuração de alarmes e notificações com CloudWatch.
Alarmes Críticos
Lambda Error Rate
# SAM template
Resources:
LambdaErrorAlarm:
Type: AWS::CloudWatch::Alarm
Properties:
AlarmName: !Sub '${AWS::StackName}-lambda-errors'
AlarmDescription: Alert when Lambda error rate > 5%
MetricName: Errors
Namespace: AWS/Lambda
Statistic: Sum
Period: 300
EvaluationPeriods: 1
Threshold: 5
ComparisonOperator: GreaterThanThreshold
Dimensions:
- Name: FunctionName
Value: !Ref UserFunction
AlarmActions:
- !Ref AlertTopic
SQS DLQ Messages
DeadLetterQueueAlarm:
Type: AWS::CloudWatch::Alarm
Properties:
AlarmName: !Sub '${AWS::StackName}-dlq-messages'
AlarmDescription: Alert when messages in DLQ
MetricName: ApproximateNumberOfMessagesVisible
Namespace: AWS/SQS
Statistic: Sum
Period: 60
EvaluationPeriods: 1
Threshold: 1
ComparisonOperator: GreaterThanOrEqualToThreshold
Dimensions:
- Name: QueueName
Value: !GetAtt DeadLetterQueue.QueueName
AlarmActions:
- !Ref AlertTopic
RDS High CPU
RDSHighCPUAlarm:
Type: AWS::CloudWatch::Alarm
Properties:
AlarmName: !Sub '${AWS::StackName}-rds-high-cpu'
AlarmDescription: Alert when RDS CPU > 80%
MetricName: CPUUtilization
Namespace: AWS/RDS
Statistic: Average
Period: 300
EvaluationPeriods: 2
Threshold: 80
ComparisonOperator: GreaterThanThreshold
Dimensions:
- Name: DBInstanceIdentifier
Value: !Ref Database
AlarmActions:
- !Ref AlertTopic
SNS Topic para Alertas
AlertTopic:
Type: AWS::SNS::Topic
Properties:
DisplayName: App Alerts
Subscriptions:
- Endpoint: peopletech@loggi.com
Protocol: email
Níveis de Severidade
P0 - Crítico
- Sistema completamente down
- Perda de dados
- Security breach
Response: Imediato (< 15 minutos)
Notification: Chamada telefônica + Google Chat espaço "Incidents"
P1 - Alto
- Feature crítica não funciona
- Performance muito degradada
- Erros frequentes
Response: < 1 hora
Notification: Google Chat espaço "Incidents" + peopletech@loggi.com
P2 - Médio
- Feature não-crítica com problema
- Performance degradada pontualmente
- Logs com warnings
Response: < 4 horas
Notification: Google Chat espaço "Tech"
P3 - Baixo
- Problemas cosméticos
- Performance sub-ótima
- Informational alerts
Response: Próximo dia útil
Notification: Email
Notification Channels
Google Chat
Lambda function para enviar alertas ao Google Chat via webhook:
import json
import urllib.request
def lambda_handler(event, context):
"""Send CloudWatch alert to Google Chat."""
message = json.loads(event['Records'][0]['Sns']['Message'])
alarm_name = message['AlarmName']
new_state = message['NewStateValue']
reason = message['NewStateReason']
icon = '🚨' if new_state == 'ALARM' else '✅'
chat_message = {
"text": f"{icon} *{alarm_name}*\nStatus: {new_state}\nMotivo: {reason}\nTimestamp: {message['StateChangeTime']}"
}
req = urllib.request.Request(
GOOGLE_CHAT_WEBHOOK_URL,
data=json.dumps(chat_message).encode(),
headers={'Content-Type': 'application/json'}
)
urllib.request.urlopen(req)
PagerDuty (Opcional)
Para on-call rotation:
PagerDutyIntegration:
Type: AWS::SNS::Subscription
Properties:
Protocol: https
TopicArn: !Ref AlertTopic
Endpoint: https://events.pagerduty.com/integration/<KEY>/enqueue
On-Call
Schedule
- Rotação semanal
- Horário: 24/7
- Backup on-call
Responsabilidades
- Responder alertas conforme SLA
- Investigar e resolver (ou escalar)
- Documentar incident
- Criar postmortem se P0/P1