Common Issues
Problemas comuns e suas soluções.
Application Issues
API returning 502 errors
Symptoms: - Users cannot access API - CloudWatch shows Lambda errors - Gateway timeout
Common Causes: 1. Lambda timeout 2. Database connection issue 3. Unhandled exception 4. Memory exceeded
Solutions:
# 1. Check Lambda logs
aws logs tail /aws/lambda/app-function --since 30m | grep ERROR
# 2. Check database connections
psql -h prod-db... -c "SELECT count(*) FROM pg_stat_activity;"
# 3. Check Lambda metrics
aws cloudwatch get-metric-statistics \
--namespace AWS/Lambda \
--metric-name Errors \
--start-time $(date -u -d '1 hour ago' +%Y-%m-%dT%H:%M:%S) \
--end-time $(date -u +%Y-%m-%dT%H:%M:%S) \
--period 300 \
--statistics Sum
# 4. If timeout issue, increase timeout temporarily
aws lambda update-function-configuration \
--function-name app-function \
--timeout 60
Users cannot login
Symptoms: - Login fails with 401/500 - "Invalid credentials" even with correct password - Timeout on login
Solutions:
# Check auth Lambda
aws logs tail /aws/lambda/app-auth-function --since 30m
# Check database
psql -c "SELECT count(*) FROM users WHERE is_active = true;"
# Check Secrets Manager (JWT secret)
aws secretsmanager get-secret-value --secret-id /app/production/jwt-secret
# Verify token generation
python scripts/test_jwt.py
Slow API responses
Symptoms: - Latency > 2s - Timeouts - Users complaining about slowness
Solutions:
# 1. Check slow queries
psql -c "
SELECT query, calls, mean_time, max_time
FROM pg_stat_statements
ORDER BY mean_time DESC
LIMIT 10;"
# 2. Check Lambda cold starts
aws logs filter-log-events \
--log-group-name /aws/lambda/app-function \
--filter-pattern "REPORT" \
--query 'events[].message'
# 3. Check database connections
psql -c "SELECT count(*) FROM pg_stat_activity WHERE state = 'active';"
# 4. Add index if missing
psql -c "CREATE INDEX CONCURRENTLY idx_table_column ON table(column);"
Infrastructure Issues
Lambda throttling
Symptoms:
- Invocations failing
- CloudWatch metric Throttles > 0
- 429 errors
Solutions:
# Check current concurrency
aws lambda get-function-concurrency --function-name app-function
# Increase reserved concurrency
aws lambda put-function-concurrency \
--function-name app-function \
--reserved-concurrent-executions 100
# Check account limits
aws servicequotas get-service-quota \
--service-code lambda \
--quota-code L-B99A9384 # Concurrent executions
SQS messages piling up
Symptoms:
- ApproximateAgeOfOldestMessage high
- Messages not being processed
- DLQ has messages
Solutions:
# Check consumer Lambda
aws logs tail /aws/lambda/app-event-processor --since 30m
# Check if Lambda is being invoked
aws cloudwatch get-metric-statistics \
--namespace AWS/Lambda \
--metric-name Invocations \
--dimensions Name=FunctionName,Value=app-event-processor \
--start-time $(date -u -d '1 hour ago' +%Y-%m-%dT%H:%M:%S) \
--end-time $(date -u +%Y-%m-%dT%H:%M:%S) \
--period 300 \
--statistics Sum
# Manually process DLQ if needed
python scripts/replay_dlq.py --queue-url <DLQ_URL>
# Purge if messages are invalid
aws sqs purge-queue --queue-url <QUEUE_URL>
Database connection errors
Symptoms: - "Too many connections" - Connection timeouts - "remaining connection slots are reserved"
Solutions:
# Check current connections
psql -c "
SELECT count(*), state, application_name
FROM pg_stat_activity
GROUP BY state, application_name;"
# Kill idle connections
psql -c "
SELECT pg_terminate_backend(pid)
FROM pg_stat_activity
WHERE state = 'idle'
AND state_change < now() - interval '10 minutes';"
# Increase max_connections (último recurso)
aws rds modify-db-instance \
--db-instance-identifier prod-db \
--db-parameter-group-name custom-params \
--apply-immediately
High database CPU
Symptoms: - CPU > 80% - Queries lentas - Timeouts
Solutions:
# Identificar queries problemáticas
psql -c "
SELECT pid, usename, query, state, query_start
FROM pg_stat_activity
WHERE state != 'idle'
ORDER BY query_start
LIMIT 10;"
# Matar query específica (cuidado!)
psql -c "SELECT pg_terminate_backend(<pid>);"
# Analyze slow queries
psql -c "
SELECT query, calls, total_time, mean_time
FROM pg_stat_statements
ORDER BY mean_time DESC
LIMIT 20;"
# Verificar índices faltando
python scripts/suggest_indexes.py
Monitoring Issues
CloudWatch logs not showing
Solutions:
# Check Lambda has CloudWatch permissions
aws lambda get-policy --function-name app-function
# Check log group exists
aws logs describe-log-groups --log-group-name-prefix /aws/lambda/app
# Check log retention
aws logs describe-log-groups \
--log-group-name /aws/lambda/app-function \
--query 'logGroups[0].retentionInDays'
Alarms not triggering
# Check alarm configuration
aws cloudwatch describe-alarms --alarm-names app-error-rate
# Test alarm manually
aws cloudwatch set-alarm-state \
--alarm-name app-error-rate \
--state-value ALARM \
--state-reason "Testing alarm"
Known Issues
Issue: Cold start latency
Description: First request após idle é lento (> 3s)
Workaround: Provisioned concurrency ou periodic warm-up pings
Status: Accepted trade-off
Issue: Connection pool exhaustion during peak
Description: Database connections esgotam em horário de pico
Workaround: Aumentar pool size ou usar PgBouncer
Fix: Em roadmap (Q2 2026)
Quick Reference
| Issue | Check | Fix |
|---|---|---|
| 502 errors | Lambda logs | Rollback or increase timeout |
| Slow API | Slow query log | Add indexes |
| Login fails | Auth Lambda logs | Check secrets |
| Queue piling | Consumer Lambda | Check errors, scale |
| DB connections | pg_stat_activity | Kill idle, increase pool |
| High CPU | pg_stat_statements | Optimize queries |
Escalation
Se não conseguir resolver:
- Postar em #incidents com detalhes
- Tag on-call engineer
- Se P0: Ligar para Tech Lead
- Começar incident response process