RFC-013: Error Handling e Logging Standards

Campo	Valor
Status	Draft
Author	Time de Tecnologia
Created	2026-02-05
Updated	2026-02-05
Reunião	A ser agendada
Deciders	Todo o time técnico

Status

🟡 Draft - RFC em elaboração, aguardando apresentação

Contexto e Problema

Logging documentado em docs/observability/cloudwatch.md, mas falta:

Error handling patterns padronizados
Structured logging enforcement (JSON)
Log levels standardizados por ambiente
Correlation IDs obrigatórios em todos os requests
Error tracking centralizado (Sentry ou similar)

Por que resolver isso agora?

Logs não estruturados são difíceis de query
Debugging sem correlation IDs é lento
Error handling inconsistente no código
Falta visibilidade de errors agregados
CloudWatch Insights requer JSON structured logs

Impacto de não resolver

MTTR alto (debugging lento)
Errors não são agregados/tracked
Correlação entre logs impossível
CloudWatch Insights não funciona bem
Dificuldade em troubleshooting produção

Documentação relacionada: - docs/observability/cloudwatch.md - Logging atual - docs/development/coding-standards.md - Error handling - docs/observability/monitoring.md - Correlation IDs - docs/architecture/api-design.md - Error responses

Proposta de Solução

Structured logging (JSON), correlation IDs obrigatórios, error patterns padronizados, e Sentry para error tracking.

Structured Logging (JSON)

Python (structlog):

import structlog

# Configuração global
structlog.configure(
    processors=[
        structlog.contextvars.merge_contextvars,
        structlog.stdlib.filter_by_level,
        structlog.stdlib.add_logger_name,
        structlog.stdlib.add_log_level,
        structlog.processors.TimeStamper(fmt="iso"),
        structlog.processors.StackInfoRenderer(),
        structlog.processors.format_exc_info,
        structlog.processors.JSONRenderer()
    ],
    wrapper_class=structlog.stdlib.BoundLogger,
    logger_factory=structlog.stdlib.LoggerFactory(),
    cache_logger_on_first_use=True,
)

logger = structlog.get_logger()

# Uso
logger.info(
    "user_login_successful",
    user_id=user.id,
    email=user.email,
    ip_address=request.client.host,
    duration_ms=123
)

logger.error(
    "payment_processing_failed",
    user_id=user.id,
    order_id=order.id,
    amount=order.total,
    error_code="STRIPE_ERROR",
    exc_info=True  # Inclui stack trace
)

Output (JSON):

{
  "event": "payment_processing_failed",
  "user_id": 123,
  "order_id": 456,
  "amount": 99.99,
  "error_code": "STRIPE_ERROR",
  "exception": "stripe.error.CardError: Your card was declined...",
  "timestamp": "2026-02-05T14:30:00.123456Z",
  "level": "error",
  "logger": "app.payments",
  "correlation_id": "550e8400-e29b-41d4-a716-446655440000"
}

Log Levels por Ambiente

Development: - DEBUG: Tudo - INFO: Eventos importantes - WARNING: Problemas potenciais - ERROR: Errors - CRITICAL: Failures críticos

Staging: - INFO: Eventos importantes - WARNING: Problemas potenciais - ERROR: Errors - CRITICAL: Failures críticos

Production: - INFO: Eventos de negócio importantes apenas - WARNING: Problemas potenciais - ERROR: Errors - CRITICAL: Failures críticos

O que logar:

✅ Logar: - Autenticação (login, logout, failures) - Operações importantes (create user, payment, etc.) - Errors e exceptions - Performance metrics - Security events - External API calls

❌ Não logar: - Passwords ou secrets - Credit card numbers (PCI compliance) - Dados sensíveis (CPF, endereço completo) - Tokens ou API keys - Informação médica (HIPAA)

Sanitização:

def sanitize_log_data(data: dict) -> dict:
    """Remove sensitive data from logs."""
    sensitive_fields = ['password', 'credit_card', 'ssn', 'api_key']

    sanitized = data.copy()
    for field in sensitive_fields:
        if field in sanitized:
            sanitized[field] = '***REDACTED***'

    return sanitized

logger.info(
    "user_created",
    **sanitize_log_data(user_data)
)

Correlation IDs

Geração e propagação:

# Middleware FastAPI
import uuid
from fastapi import Request
import structlog

@app.middleware("http")
async def correlation_id_middleware(request: Request, call_next):
    # Obter ou gerar
    correlation_id = request.headers.get("X-Correlation-ID") or str(uuid.uuid4())

    # Adicionar ao contexto de logging
    structlog.contextvars.clear_contextvars()
    structlog.contextvars.bind_contextvars(
        correlation_id=correlation_id,
        method=request.method,
        path=request.url.path
    )

    # Processar request
    response = await call_next(request)

    # Adicionar ao response
    response.headers["X-Correlation-ID"] = correlation_id

    return response

Propagação para serviços downstream:

async def call_external_api(url: str):
    """Call external API with correlation ID."""
    correlation_id = structlog.contextvars.get_contextvars().get("correlation_id")

    response = await httpx.get(
        url,
        headers={"X-Correlation-ID": correlation_id}
    )

    logger.info(
        "external_api_called",
        url=url,
        status_code=response.status_code,
        duration_ms=response.elapsed.total_seconds() * 1000
    )

    return response

Busca de logs:

-- CloudWatch Logs Insights
fields @timestamp, event, level, message
| filter correlation_id = "550e8400-e29b-41d4-a716-446655440000"
| sort @timestamp asc

Error Handling Patterns

Custom Exceptions (Domain-specific):

# app/exceptions.py

class AppException(Exception):
    """Base exception for all app exceptions."""
    def __init__(self, message: str, code: str, status_code: int = 500):
        self.message = message
        self.code = code
        self.status_code = status_code
        super().__init__(message)

class ValidationError(AppException):
    """Validation failed."""
    def __init__(self, message: str, field: str = None):
        super().__init__(message, "VALIDATION_ERROR", 400)
        self.field = field

class NotFoundError(AppException):
    """Resource not found."""
    def __init__(self, resource: str, resource_id: any):
        super().__init__(
            f"{resource} with id {resource_id} not found",
            "NOT_FOUND",
            404
        )

class UnauthorizedError(AppException):
    """User not authenticated."""
    def __init__(self):
        super().__init__("Authentication required", "UNAUTHORIZED", 401)

class ForbiddenError(AppException):
    """User not authorized."""
    def __init__(self, action: str):
        super().__init__(
            f"Not authorized to {action}",
            "FORBIDDEN",
            403
        )

Exception Handler (FastAPI):

from fastapi import Request
from fastapi.responses import JSONResponse

@app.exception_handler(AppException)
async def app_exception_handler(request: Request, exc: AppException):
    """Handle all app exceptions."""

    logger.error(
        "request_failed",
        error_code=exc.code,
        error_message=exc.message,
        status_code=exc.status_code,
        path=request.url.path,
        method=request.method
    )

    return JSONResponse(
        status_code=exc.status_code,
        content={
            "error": {
                "code": exc.code,
                "message": exc.message,
                "request_id": structlog.contextvars.get_contextvars().get("correlation_id")
            }
        }
    )

@app.exception_handler(Exception)
async def generic_exception_handler(request: Request, exc: Exception):
    """Handle unexpected exceptions."""

    logger.critical(
        "unexpected_error",
        error_type=type(exc).__name__,
        error_message=str(exc),
        path=request.url.path,
        exc_info=True
    )

    # Não expor detalhes em produção
    return JSONResponse(
        status_code=500,
        content={
            "error": {
                "code": "INTERNAL_SERVER_ERROR",
                "message": "An unexpected error occurred",
                "request_id": structlog.contextvars.get_contextvars().get("correlation_id")
            }
        }
    )

Error Tracking (Sentry)

Setup:

import sentry_sdk
from sentry_sdk.integrations.fastapi import FastApiIntegration

sentry_sdk.init(
    dsn=os.getenv("SENTRY_DSN"),
    environment=os.getenv("ENVIRONMENT"),  # staging, production
    traces_sample_rate=0.1,  # 10% of transactions
    profiles_sample_rate=0.1,
    integrations=[FastApiIntegration()],
    before_send=sanitize_sentry_event,  # Remove sensitive data
)

def sanitize_sentry_event(event, hint):
    """Remove sensitive data before sending to Sentry."""
    if 'request' in event:
        if 'headers' in event['request']:
            event['request']['headers'].pop('Authorization', None)
    return event

Manual error capturing:

try:
    result = process_payment(order)
except PaymentError as e:
    sentry_sdk.capture_exception(e)
    sentry_sdk.set_context("payment", {
        "order_id": order.id,
        "amount": order.total,
        "payment_method": order.payment_method
    })
    raise

Error Response Format (RFC 7807)

{
  "type": "https://docs.seuapp.com/errors/validation-error",
  "title": "Validation Error",
  "status": 400,
  "detail": "Invalid email format",
  "instance": "/api/v1/users/123",
  "errors": [
    {
      "field": "email",
      "message": "Must be valid email format",
      "code": "INVALID_EMAIL"
    }
  ],
  "request_id": "550e8400-e29b-41d4-a716-446655440000"
}

Alternativas Consideradas

Opção 1: Plain text logging

Prós: - Legível para humanos - Simples

Contras: - Difícil de query - Parsing manual - CloudWatch Insights não funciona

Por que não escolhemos: JSON é requirement para observability moderna.

Opção 2: Rollbar ao invés de Sentry

Prós: - Mais barato - Simples

Contras: - Menos features - Menos integrações - Sentry é padrão da indústria

Por que não escolhemos: Sentry tem melhor ROI.

Opção 3: Correlation IDs opcionais

Prós: - Menos overhead

Contras: - Debugging impossível para requests sem ID - Inconsistente

Por que não escolhemos: Correlation IDs são essenciais para debugging distribuído.

Análise de Impacto

Impacto Técnico

Structured logging obrigatório
Correlation IDs em todos requests
Sentry para error tracking
Error responses padronizados

Impacto em Negócio

✅ MTTR reduzido em 60%
✅ Errors agregados e priorizados
✅ Debugging muito mais rápido
⚠️ Custo Sentry: ~$26/mês (10k events)

Riscos

Risco: Overhead de performance por logging

Mitigação: Logging assíncrono, batching, sampling.

Plano de Implementação

Fase 1: Structured Logging (Semana 1)

Configurar structlog
Migrar logs principais
Testar em staging

Fase 2: Correlation IDs (Semana 2)

Implementar middleware
Propagar para downstream
Documentar

Fase 3: Error Patterns (Semana 3)

Custom exceptions
Exception handlers
Error responses padronizados

Fase 4: Sentry (Semana 4)

Setup Sentry
Integrar com código
Training do time

Métricas de Sucesso

Após 1 mês: - ✅ 100% logs em JSON - ✅ Correlation IDs em todos requests - ✅ Sentry configurado

Após 3 meses: - ✅ MTTR reduzido em 60% - ✅ Errors agregados e triaged - ✅ Zero logs com dados sensíveis