Pular para conteúdo

People Tech Docs

Incident Response

Incident Response

Procedimento para responder a incidentes de segurança e falhas críticas.

Severidade de Incidentes

P0 - Crítico

Exemplos: - Sistema completamente down - Data breach / Security compromise - Perda de dados - Pagamentos não processando

SLA Response: < 15 minutos
SLA Resolution: < 4 horas

P1 - Alto

Exemplos: - Feature crítica não funciona - Performance muito degradada (> 10s) - Erros em > 50% dos requests

SLA Response: < 1 hora
SLA Resolution: < 8 horas

P2 - Médio

Exemplos: - Feature não-crítica com problema - Performance degradada pontualmente - Erros em < 10% dos requests

SLA Response: < 4 horas
SLA Resolution: < 24 horas

P3 - Baixo

Exemplos: - Problemas cosméticos - Minor bugs - Performance sub-ótima

SLA Response: < 24 horas
SLA Resolution: Próximo sprint

Incident Response Process

flowchart TD
    Detect[Detectar Incident] --> Assess[Avaliar Severidade]
    Assess --> Notify[Notificar Time]
    Notify --> War[War Room]
    War --> Investigate[Investigar Causa]
    Investigate --> Mitigate[Mitigar Impacto]
    Mitigate --> Resolve[Resolver Completamente]
    Resolve --> Verify[Verificar Resolução]
    Verify --> Postmortem[Postmortem]
    Postmortem --> Actions[Action Items]

1. Detecção

Como detectar: - Alarmes CloudWatch - Usuários reportando - Monitoring dashboards - Synthetic checks

Quem detectou, cria incident:

## Incident #123

**Severity**: P0  
**Reported by**: John Doe  
**Time**: 2026-01-20 14:30 UTC  
**Status**: Investigating

**Symptoms**:
- API returning 502 errors
- 100% of requests failing
- Started approximately 14:25 UTC

**Impact**:
- All users affected
- Cannot login or make purchases

**Channel**: #incidents

2. Notificação

P0/P1: - Post em #incidents no Slack - Tag @here ou @channel - Ligar para on-call se fora de horário - Criar incident no Statuspage (se tiver)

P2/P3: - Post em #incidents - Tag responsáveis - Não urgente

3. War Room (P0/P1 apenas)

Criar thread no #incidents
Videochamada (link fixo: meet.google.com/war-room)
Incident commander designado
Updates a cada 15-30 minutos

Roles:

Incident Commander: Coordena resposta
Ops: Investiga infraestrutura
Dev: Investiga código
Communications: Atualiza stakeholders

4. Investigação

Checklist:

5. Mitigação

Opções:

Rollback (mais rápido)

git revert <commit-hash>
git push origin main
# Deploy automático

Hotfix (se rollback não resolver)

git checkout -b hotfix/critical-fix
# Fix issue
git commit -m "fix: resolve incident #123"
# Fast-track PR e deploy

Feature Flag (se disponível)

if not feature_flags.is_enabled('problematic_feature'):
    return fallback_behavior()

Scale (se problema de capacidade)

# Aumentar reserved concurrency
aws lambda put-function-concurrency \
  --function-name app-function \
  --reserved-concurrent-executions 100

6. Resolução

Fix deployed
Verificar que problema foi resolvido
Monitorar por 30 minutos
Comunicar resolução

7. Postmortem

Quando: P0 e P1 sempre, P2 se significativo

Template:

# Postmortem: [Título do Incident]

**Date**: 2026-01-20  
**Authors**: [Nomes]  
**Severity**: P0

## Summary

[Resumo de 2-3 frases do que aconteceu]

## Impact

- **Duration**: 14:25 - 15:10 UTC (45 minutos)
- **Users Affected**: ~1,000 (100% dos usuários ativos)
- **Revenue Impact**: ~$500 em vendas perdidas
- **Systems Affected**: API Gateway, User Lambda

## Timeline

- 14:25 - Deploy de nova versão para production
- 14:26 - Alarmes começaram a disparar
- 14:28 - Incident detectado por usuário
- 14:30 - War room iniciada
- 14:35 - Causa raiz identificada (bug em validação)
- 14:40 - Rollback iniciado
- 14:50 - Rollback completo
- 15:00 - Verificação de resolução
- 15:10 - Incident encerrado

## Root Cause

[Explicação detalhada da causa raiz]

## Resolution

[Como foi resolvido]

## What Went Well

- Detecção rápida (3 minutos)
- War room efetiva
- Rollback funcionou perfeitamente
- Boa comunicação com users

## What Went Wrong

- Bug passou pelos testes
- Deploy em horário de pico
- Alarme de pre-deploy não configurado

## Action Items

- [ ] Adicionar teste específico para este caso (#456)
- [ ] Configurar alarme pre-deploy (#457)
- [ ] Evitar deploys entre 14h-16h (#458)
- [ ] Melhorar staging para detectar isso (#459)

## Lessons Learned

[Aprendizados e conclusões]

Communication Templates

Initial Notification

🚨 **Incident #123 - P0**

**Status**: Investigating
**Impact**: API down, all users affected
**Started**: 14:25 UTC

We're investigating and will update in 15 minutes.

Thread: [link]

Updates

📊 **Update (14:45 UTC)**

**Status**: Mitigating
**Progress**: Identified root cause, rolling back
**ETA**: 15 minutes

Next update: 15:00 UTC

Resolution

✅ **Incident #123 - RESOLVED**

**Duration**: 45 minutes (14:25 - 15:10 UTC)
**Root Cause**: Bug in validation logic
**Resolution**: Rolled back to previous version

Service is fully restored. Monitoring for stability.

Postmortem: [link]

Security Incidents

Data Breach Response

Immediately:
Isolate affected systems
Preserve evidence
Notify security team
Assess:
What data was compromised?
How many users affected?
Legal/compliance implications?
Contain:
Close vulnerability
Rotate all credentials
Revoke compromised sessions
Notify:
Affected users
Legal team
Regulators (se LGPD/GDPR)
Recover:
Restore systems
Enhanced monitoring
Security audit

Referências