Pular para conteúdo

Common Issues

Problemas comuns e suas soluções.

Application Issues

API returning 502 errors

Symptoms: - Users cannot access API - CloudWatch shows Lambda errors - Gateway timeout

Common Causes: 1. Lambda timeout 2. Database connection issue 3. Unhandled exception 4. Memory exceeded

Solutions:

# 1. Check Lambda logs
aws logs tail /aws/lambda/app-function --since 30m | grep ERROR

# 2. Check database connections
psql -h prod-db... -c "SELECT count(*) FROM pg_stat_activity;"

# 3. Check Lambda metrics
aws cloudwatch get-metric-statistics \
  --namespace AWS/Lambda \
  --metric-name Errors \
  --start-time $(date -u -d '1 hour ago' +%Y-%m-%dT%H:%M:%S) \
  --end-time $(date -u +%Y-%m-%dT%H:%M:%S) \
  --period 300 \
  --statistics Sum

# 4. If timeout issue, increase timeout temporarily
aws lambda update-function-configuration \
  --function-name app-function \
  --timeout 60

Users cannot login

Symptoms: - Login fails with 401/500 - "Invalid credentials" even with correct password - Timeout on login

Solutions:

# Check auth Lambda
aws logs tail /aws/lambda/app-auth-function --since 30m

# Check database
psql -c "SELECT count(*) FROM users WHERE is_active = true;"

# Check Secrets Manager (JWT secret)
aws secretsmanager get-secret-value --secret-id /app/production/jwt-secret

# Verify token generation
python scripts/test_jwt.py

Slow API responses

Symptoms: - Latency > 2s - Timeouts - Users complaining about slowness

Solutions:

# 1. Check slow queries
psql -c "
SELECT query, calls, mean_time, max_time
FROM pg_stat_statements
ORDER BY mean_time DESC
LIMIT 10;"

# 2. Check Lambda cold starts
aws logs filter-log-events \
  --log-group-name /aws/lambda/app-function \
  --filter-pattern "REPORT" \
  --query 'events[].message'

# 3. Check database connections
psql -c "SELECT count(*) FROM pg_stat_activity WHERE state = 'active';"

# 4. Add index if missing
psql -c "CREATE INDEX CONCURRENTLY idx_table_column ON table(column);"

Infrastructure Issues

Lambda throttling

Symptoms: - Invocations failing - CloudWatch metric Throttles > 0 - 429 errors

Solutions:

# Check current concurrency
aws lambda get-function-concurrency --function-name app-function

# Increase reserved concurrency
aws lambda put-function-concurrency \
  --function-name app-function \
  --reserved-concurrent-executions 100

# Check account limits
aws servicequotas get-service-quota \
  --service-code lambda \
  --quota-code L-B99A9384  # Concurrent executions

SQS messages piling up

Symptoms: - ApproximateAgeOfOldestMessage high - Messages not being processed - DLQ has messages

Solutions:

# Check consumer Lambda
aws logs tail /aws/lambda/app-event-processor --since 30m

# Check if Lambda is being invoked
aws cloudwatch get-metric-statistics \
  --namespace AWS/Lambda \
  --metric-name Invocations \
  --dimensions Name=FunctionName,Value=app-event-processor \
  --start-time $(date -u -d '1 hour ago' +%Y-%m-%dT%H:%M:%S) \
  --end-time $(date -u +%Y-%m-%dT%H:%M:%S) \
  --period 300 \
  --statistics Sum

# Manually process DLQ if needed
python scripts/replay_dlq.py --queue-url <DLQ_URL>

# Purge if messages are invalid
aws sqs purge-queue --queue-url <QUEUE_URL>

Database connection errors

Symptoms: - "Too many connections" - Connection timeouts - "remaining connection slots are reserved"

Solutions:

# Check current connections
psql -c "
SELECT count(*), state, application_name
FROM pg_stat_activity
GROUP BY state, application_name;"

# Kill idle connections
psql -c "
SELECT pg_terminate_backend(pid)
FROM pg_stat_activity
WHERE state = 'idle'
  AND state_change < now() - interval '10 minutes';"

# Increase max_connections (último recurso)
aws rds modify-db-instance \
  --db-instance-identifier prod-db \
  --db-parameter-group-name custom-params \
  --apply-immediately

High database CPU

Symptoms: - CPU > 80% - Queries lentas - Timeouts

Solutions:

# Identificar queries problemáticas
psql -c "
SELECT pid, usename, query, state, query_start
FROM pg_stat_activity
WHERE state != 'idle'
ORDER BY query_start
LIMIT 10;"

# Matar query específica (cuidado!)
psql -c "SELECT pg_terminate_backend(<pid>);"

# Analyze slow queries
psql -c "
SELECT query, calls, total_time, mean_time
FROM pg_stat_statements
ORDER BY mean_time DESC
LIMIT 20;"

# Verificar índices faltando
python scripts/suggest_indexes.py

Monitoring Issues

CloudWatch logs not showing

Solutions:

# Check Lambda has CloudWatch permissions
aws lambda get-policy --function-name app-function

# Check log group exists
aws logs describe-log-groups --log-group-name-prefix /aws/lambda/app

# Check log retention
aws logs describe-log-groups \
  --log-group-name /aws/lambda/app-function \
  --query 'logGroups[0].retentionInDays'

Alarms not triggering

# Check alarm configuration
aws cloudwatch describe-alarms --alarm-names app-error-rate

# Test alarm manually
aws cloudwatch set-alarm-state \
  --alarm-name app-error-rate \
  --state-value ALARM \
  --state-reason "Testing alarm"

Known Issues

Issue: Cold start latency

Description: First request após idle é lento (> 3s)

Workaround: Provisioned concurrency ou periodic warm-up pings

Status: Accepted trade-off

Issue: Connection pool exhaustion during peak

Description: Database connections esgotam em horário de pico

Workaround: Aumentar pool size ou usar PgBouncer

Fix: Em roadmap (Q2 2026)

Quick Reference

Issue Check Fix
502 errors Lambda logs Rollback or increase timeout
Slow API Slow query log Add indexes
Login fails Auth Lambda logs Check secrets
Queue piling Consumer Lambda Check errors, scale
DB connections pg_stat_activity Kill idle, increase pool
High CPU pg_stat_statements Optimize queries

Escalation

Se não conseguir resolver:

  1. Postar em #incidents com detalhes
  2. Tag on-call engineer
  3. Se P0: Ligar para Tech Lead
  4. Começar incident response process

Ver Incident Response →

Referências