Pular para conteúdo

Monitoring

Monitoramento contínuo de aplicação e infraestrutura.

Dashboards

Production Dashboard

URL: CloudWatch Dashboard

Widgets:

  1. API Request Count (5m)
  2. API Error Rate (%)
  3. API Latency (p50, p90, p99)
  4. Lambda Invocations
  5. Lambda Errors
  6. Lambda Duration
  7. SQS Messages in Queue
  8. SQS DLQ Messages
  9. RDS CPU & Memory
  10. RDS Connections

Business Metrics Dashboard

Custom Metrics:

  • User signups (hourly/daily)
  • Orders created
  • Revenue
  • Active users
  • Conversion rate

Key Metrics

Golden Signals

Latency

Target: - p50: < 200ms - p90: < 500ms - p99: < 1s

Alert if: p99 > 2s for 5 minutes

Traffic

Monitoring: - Requests per second - Concurrent connections - Bandwidth usage

Alert if: Unusual spike (> 2x normal)

Errors

Target: < 1% error rate

Alert if: > 5% error rate for 5 minutes

Saturation

Resources: - Lambda concurrent executions (< 80% of limit) - Database connections (< 80% of max) - Memory usage (< 80%) - Disk space (< 80%)

Health Checks

Endpoint Health

from fastapi import APIRouter
from sqlalchemy import text

router = APIRouter(prefix="/health")

@router.get("")
async def health():
    """Basic health check."""
    return {"status": "healthy", "timestamp": datetime.utcnow()}

@router.get("/db")
async def health_db(session: AsyncSession = Depends(get_session)):
    """Database health check."""
    try:
        await session.execute(text("SELECT 1"))
        return {"status": "healthy", "database": "connected"}
    except Exception as e:
        raise HTTPException(status_code=503, detail="Database unhealthy")

@router.get("/dependencies")
async def health_dependencies():
    """Check external dependencies."""
    results = {
        "s3": await check_s3(),
        "redis": await check_redis(),
        "external_api": await check_external_api()
    }

    all_healthy = all(results.values())
    status_code = 200 if all_healthy else 503

    return JSONResponse(
        status_code=status_code,
        content={"status": "healthy" if all_healthy else "degraded", "services": results}
    )

Real-time Monitoring

CloudWatch Live Tail

# Tail logs em tempo real
aws logs tail /aws/lambda/app-user-api-production --follow

# Filtrar por nível
aws logs tail /aws/lambda/app-user-api-production \
  --follow \
  --filter-pattern '{ $.level = "ERROR" }'

# Múltiplos log groups
aws logs tail --follow \
  /aws/lambda/app-user-api-production \
  /aws/lambda/app-payment-processor-production

X-Ray Tracing

from aws_xray_sdk.core import xray_recorder
from aws_xray_sdk.ext.aiohttp.client import aws_xray_trace_config

# Automatic tracing para FastAPI
from aws_xray_sdk.core import patch_all
patch_all()

# Manual subsegments
@xray_recorder.capture('process_payment')
async def process_payment(order_id: int):
    # Traced automatically
    ...

Synthetic Monitoring

CloudWatch Synthetics (Canaries)

# Canary script
from aws_synthetics.selenium import synthetics_webdriver as webdriver
from aws_synthetics.common import synthetics_logger as logger

def main():
    browser = webdriver.Chrome()
    browser.get("https://seuapp.com")

    # Check homepage loads
    assert "People Tech" in browser.title

    # Check login page
    browser.get("https://seuapp.com/login")
    assert browser.find_element_by_id("email")

    logger.info("Synthetic check passed")
    browser.quit()

def handler(event, context):
    return main()

Configuração:

  • Frequência: A cada 5 minutos
  • Alertar se: 2 falhas consecutivas

Performance Monitoring

APM Metrics

  • Throughput: requests/second
  • Response time: p50, p90, p99
  • Error rate: errors/total requests
  • Apdex score: User satisfaction metric

Database Query Performance

-- Slow queries (PostgreSQL)
SELECT 
  query,
  calls,
  total_time,
  mean_time,
  max_time
FROM pg_stat_statements
ORDER BY mean_time DESC
LIMIT 20;

Cost Monitoring

Budget Alerts

MonthlyBudget:
  Type: AWS::Budgets::Budget
  Properties:
    Budget:
      BudgetName: Monthly-App-Budget
      BudgetLimit:
        Amount: 1000
        Unit: USD
      TimeUnit: MONTHLY
      BudgetType: COST
    NotificationsWithSubscribers:
      - Notification:
          NotificationType: ACTUAL
          ComparisonOperator: GREATER_THAN
          Threshold: 80
        Subscribers:
          - SubscriptionType: EMAIL
            Address: billing@seuapp.com

Cost Anomaly Detection

  • AWS Cost Anomaly Detection habilitado
  • Alerts para gastos inesperados
  • Review mensal de custos

Referências