Monitoring
Monitoramento contínuo de aplicação e infraestrutura.
Dashboards
Production Dashboard
URL: CloudWatch Dashboard
Widgets:
- API Request Count (5m)
- API Error Rate (%)
- API Latency (p50, p90, p99)
- Lambda Invocations
- Lambda Errors
- Lambda Duration
- SQS Messages in Queue
- SQS DLQ Messages
- RDS CPU & Memory
- RDS Connections
Business Metrics Dashboard
Custom Metrics:
- User signups (hourly/daily)
- Orders created
- Revenue
- Active users
- Conversion rate
Key Metrics
Golden Signals
Latency
Target: - p50: < 200ms - p90: < 500ms - p99: < 1s
Alert if: p99 > 2s for 5 minutes
Traffic
Monitoring: - Requests per second - Concurrent connections - Bandwidth usage
Alert if: Unusual spike (> 2x normal)
Errors
Target: < 1% error rate
Alert if: > 5% error rate for 5 minutes
Saturation
Resources: - Lambda concurrent executions (< 80% of limit) - Database connections (< 80% of max) - Memory usage (< 80%) - Disk space (< 80%)
Health Checks
Endpoint Health
from fastapi import APIRouter
from sqlalchemy import text
router = APIRouter(prefix="/health")
@router.get("")
async def health():
"""Basic health check."""
return {"status": "healthy", "timestamp": datetime.utcnow()}
@router.get("/db")
async def health_db(session: AsyncSession = Depends(get_session)):
"""Database health check."""
try:
await session.execute(text("SELECT 1"))
return {"status": "healthy", "database": "connected"}
except Exception as e:
raise HTTPException(status_code=503, detail="Database unhealthy")
@router.get("/dependencies")
async def health_dependencies():
"""Check external dependencies."""
results = {
"s3": await check_s3(),
"redis": await check_redis(),
"external_api": await check_external_api()
}
all_healthy = all(results.values())
status_code = 200 if all_healthy else 503
return JSONResponse(
status_code=status_code,
content={"status": "healthy" if all_healthy else "degraded", "services": results}
)
Real-time Monitoring
CloudWatch Live Tail
# Tail logs em tempo real
aws logs tail /aws/lambda/app-user-api-production --follow
# Filtrar por nível
aws logs tail /aws/lambda/app-user-api-production \
--follow \
--filter-pattern '{ $.level = "ERROR" }'
# Múltiplos log groups
aws logs tail --follow \
/aws/lambda/app-user-api-production \
/aws/lambda/app-payment-processor-production
X-Ray Tracing
from aws_xray_sdk.core import xray_recorder
from aws_xray_sdk.ext.aiohttp.client import aws_xray_trace_config
# Automatic tracing para FastAPI
from aws_xray_sdk.core import patch_all
patch_all()
# Manual subsegments
@xray_recorder.capture('process_payment')
async def process_payment(order_id: int):
# Traced automatically
...
Synthetic Monitoring
CloudWatch Synthetics (Canaries)
# Canary script
from aws_synthetics.selenium import synthetics_webdriver as webdriver
from aws_synthetics.common import synthetics_logger as logger
def main():
browser = webdriver.Chrome()
browser.get("https://seuapp.com")
# Check homepage loads
assert "People Tech" in browser.title
# Check login page
browser.get("https://seuapp.com/login")
assert browser.find_element_by_id("email")
logger.info("Synthetic check passed")
browser.quit()
def handler(event, context):
return main()
Configuração:
- Frequência: A cada 5 minutos
- Alertar se: 2 falhas consecutivas
Performance Monitoring
APM Metrics
- Throughput: requests/second
- Response time: p50, p90, p99
- Error rate: errors/total requests
- Apdex score: User satisfaction metric
Database Query Performance
-- Slow queries (PostgreSQL)
SELECT
query,
calls,
total_time,
mean_time,
max_time
FROM pg_stat_statements
ORDER BY mean_time DESC
LIMIT 20;
Cost Monitoring
Budget Alerts
MonthlyBudget:
Type: AWS::Budgets::Budget
Properties:
Budget:
BudgetName: Monthly-App-Budget
BudgetLimit:
Amount: 1000
Unit: USD
TimeUnit: MONTHLY
BudgetType: COST
NotificationsWithSubscribers:
- Notification:
NotificationType: ACTUAL
ComparisonOperator: GREATER_THAN
Threshold: 80
Subscribers:
- SubscriptionType: EMAIL
Address: billing@seuapp.com
Cost Anomaly Detection
- AWS Cost Anomaly Detection habilitado
- Alerts para gastos inesperados
- Review mensal de custos