RFC-014: Infrastructure as Code Standards
| Campo | Valor |
|---|---|
| Status | Draft |
| Author | Time de Tecnologia |
| Created | 2026-02-05 |
| Updated | 2026-02-05 |
| Reunião | A ser agendada |
| Deciders | Todo o time técnico |
Status
🟡 Draft - RFC em elaboração, aguardando apresentação
Contexto e Problema
AWS SAM documentado em docs/infrastructure/aws-sam.md, mas falta:
- Standards para template SAM (naming, structure)
- Validação e testing de IaC obrigatório
- State management strategy clara
- Multi-environment configuration padronizada
- Drift detection e remediation
Por que resolver isso agora?
- Infraestrutura cresce, falta padrões
- Templates SAM inconsistentes
- Mudanças manuais causam drift
- Testing de infraestrutura é manual
- Falta validação antes de deploy
Impacto de não resolver
- Infraestrutura diverge entre ambientes
- Mudanças manuais não rastreadas (drift)
- Deploy de infraestrutura quebrada
- Dificuldade em replicar ambientes
- Rollback de infraestrutura impossível
Documentação relacionada:
- docs/infrastructure/aws-sam.md - SAM templates e deployment
- docs/infrastructure/environments.md - Configuração multi-ambiente
- docs/infrastructure/iam-policies.md - IAM roles
- docs/cicd/github-actions.md - Deploy via CI/CD
- docs/testing/integration-testing.md - SAM Local testing
Proposta de Solução
Infrastructure as Code com AWS SAM, standards claros, validação automática, e drift detection.
Template Structure Standards
Naming conventions:
# Resources: PascalCase com prefixo do tipo
Resources:
UserFunction: # Lambda
UsersTable: # DynamoDB
UserEventsQueue: # SQS
ProductUpdatesTopic: # SNS
UserFunctionRole: # IAM Role
DatabaseSecurityGroup: # EC2 SecurityGroup
Template organization:
# template.yaml
AWSTemplateFormatVersion: '2010-09-09'
Transform: AWS::Serverless-2016-10-31
Description: People API - Main stack
# 1. Global settings
Globals:
Function:
Runtime: python3.11
MemorySize: 512
Timeout: 30
Environment:
Variables:
ENVIRONMENT: !Ref Environment
LOG_LEVEL: !Ref LogLevel
# 2. Parameters
Parameters:
Environment:
Type: String
AllowedValues: [staging, production]
Default: staging
LogLevel:
Type: String
AllowedValues: [DEBUG, INFO, WARNING, ERROR]
Default: INFO
# 3. Conditions
Conditions:
IsProduction: !Equals [!Ref Environment, production]
# 4. Resources (agrupados por tipo)
Resources:
# Lambda Functions
UserFunction:
Type: AWS::Serverless::Function
Properties:
CodeUri: src/
Handler: app.handlers.users.handler
Role: !GetAtt UserFunctionRole.Arn
# IAM Roles
UserFunctionRole:
Type: AWS::IAM::Role
...
# Databases
Database:
Type: AWS::RDS::DBInstance
...
# Queues
UserEventsQueue:
Type: AWS::SQS::Queue
...
# 5. Outputs
Outputs:
ApiEndpoint:
Description: API Gateway endpoint URL
Value: !Sub 'https://${ServerlessRestApi}.execute-api.${AWS::Region}.amazonaws.com/Prod'
Export:
Name: !Sub '${AWS::StackName}-ApiEndpoint'
Environment Configuration
samconfig.toml:
version = 0.1
[default]
[default.global.parameters]
stack_name = "people-api"
[default.build.parameters]
use_container = true
[default.deploy.parameters]
capabilities = "CAPABILITY_IAM"
confirm_changeset = true
resolve_s3 = true
[staging]
[staging.deploy.parameters]
stack_name = "people-api-staging"
s3_bucket = "sam-deployments-staging"
s3_prefix = "people-api"
region = "us-east-1"
parameter_overrides = "Environment=staging LogLevel=DEBUG"
[production]
[production.deploy.parameters]
stack_name = "people-api-production"
s3_bucket = "sam-deployments-production"
s3_prefix = "people-api"
region = "us-east-1"
parameter_overrides = "Environment=production LogLevel=INFO"
confirm_changeset = true # Manual confirmation for prod
Validation
Pre-deploy validation:
# SAM validate
sam validate --lint
# CFN Lint
pip install cfn-lint
cfn-lint template.yaml
# Security scanning
pip install checkov
checkov -f template.yaml --framework cloudformation
CI/CD validation:
# .github/workflows/validate-infrastructure.yml
name: Validate Infrastructure
on:
pull_request:
paths:
- 'template.yaml'
- 'samconfig.toml'
jobs:
validate:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- name: SAM Validate
run: sam validate --lint
- name: CFN Lint
run: |
pip install cfn-lint
cfn-lint template.yaml
- name: Checkov Security Scan
uses: bridgecrewio/checkov-action@master
with:
file: template.yaml
framework: cloudformation
quiet: false
soft_fail: false # Block on security issues
Testing
SAM Local:
# Build
sam build
# Start API locally
sam local start-api --port 3000
# Invoke Lambda locally
sam local invoke UserFunction -e events/create-user.json
# Start Lambda endpoint (for testing)
sam local start-lambda
Integration tests:
# tests/infrastructure/test_sam_template.py
import boto3
import json
from moto import mock_aws
@mock_aws
def test_lambda_function_has_correct_permissions():
"""Test Lambda IAM role has necessary permissions."""
iam = boto3.client('iam')
# Create role from template
role = create_role_from_template()
# Verify permissions
policies = iam.list_role_policies(RoleName=role['RoleName'])
assert 'SecretsManagerAccess' in policies['PolicyNames']
@mock_aws
def test_sqs_queue_has_dlq():
"""Test all SQS queues have DLQ configured."""
sqs = boto3.client('sqs')
queues = sqs.list_queues()
for queue_url in queues.get('QueueUrls', []):
attrs = sqs.get_queue_attributes(
QueueUrl=queue_url,
AttributeNames=['RedrivePolicy']
)
assert 'RedrivePolicy' in attrs['Attributes'], f"Queue {queue_url} missing DLQ"
Drift Detection
Processo:
- Detection: CloudFormation drift detection semanal
- Alert: Se drift detectado, alert no Slack
- Review: Revisar mudanças (intencionais ou não?)
- Remediate:
- Se não intencional: reverter manualmente
- Se intencional: atualizar template (import)
Automação:
# .github/workflows/drift-detection.yml
name: Drift Detection
on:
schedule:
- cron: '0 9 * * 1' # Every Monday 9am
jobs:
detect-drift:
runs-on: ubuntu-latest
steps:
- name: Detect drift
run: |
STACK_ID=$(aws cloudformation describe-stacks \
--stack-name people-api-production \
--query 'Stacks[0].StackId' \
--output text)
aws cloudformation detect-stack-drift --stack-name $STACK_ID
sleep 30 # Wait for detection
DRIFT_STATUS=$(aws cloudformation describe-stack-drift-detection-status \
--stack-drift-detection-id $DETECTION_ID \
--query 'StackDriftStatus' \
--output text)
if [ "$DRIFT_STATUS" = "DRIFTED" ]; then
echo "⚠️ Drift detected in production stack!"
# Send Slack alert
curl -X POST $SLACK_WEBHOOK -d '{"text": "⚠️ Infrastructure drift detected in production!"}'
exit 1
fi
Multi-Stack Strategy
Stacks separados:
people-api-core: # Core infrastructure (VPC, DB, cache)
- VPC
- RDS
- ElastiCache
- Security Groups
people-api-compute: # Compute resources (Lambdas, API Gateway)
- Lambda functions
- API Gateway
- SQS queues
- SNS topics
people-api-frontend: # Frontend infrastructure
- S3 bucket
- CloudFront
- WAF
Benefícios: - Deploy independente - Blast radius reduzido - Rollback mais granular - Tempos de deploy mais rápidos
Secrets e Parameters
Via SAM:
Parameters:
DatabasePassword:
Type: String
NoEcho: true
Description: Database master password
Resources:
Function:
Type: AWS::Serverless::Function
Properties:
Environment:
Variables:
# SSM Parameter (não-sensível)
API_BASE_URL: !Sub '{{resolve:ssm:/prod/config/api-url}}'
# Secrets Manager (sensível)
DB_PASSWORD: !Sub '{{resolve:secretsmanager:/prod/database/password:SecretString:password}}'
Tags Obrigatórias
# Todas as resources
Tags:
Environment: !Ref Environment
Project: people-api
Team: technology
CostCenter: engineering
ManagedBy: cloudformation
Owner: tech-lead@company.com
Cost Optimization
Tagging para cost allocation:
# Development: cheap resources
Conditions:
IsProduction: !Equals [!Ref Environment, production]
Resources:
Database:
Type: AWS::RDS::DBInstance
Properties:
DBInstanceClass: !If [IsProduction, db.t3.medium, db.t3.micro]
MultiAZ: !If [IsProduction, true, false]
BackupRetentionPeriod: !If [IsProduction, 30, 7]
Alternativas Consideradas
Opção 1: Terraform ao invés de SAM
Prós: - Multi-cloud - Mais features - HCL mais legível
Contras: - State management complexo - SAM é AWS-native - Learning curve
Por que não escolhemos: SAM é mais simples e integrado com AWS.
Opção 2: CDK (Cloud Development Kit)
Prós: - Código (Python/TypeScript) - Type safety - Reusabilidade
Contras: - Mais complexo - Synthesized templates são difíceis de ler - Overhead para time pequeno
Por que não escolhemos: SAM YAML é mais simples e suficiente.
Opção 3: Manual infrastructure (ClickOps)
Prós: - Rápido para protótipos - Sem learning curve
Contras: - Não reproduzível - Sem versionamento - Disaster recovery impossível
Por que não escolhemos: Não é escalável nem profissional.
Análise de Impacto
Impacto Técnico
- Templates SAM padronizados
- Validação obrigatória
- Drift detection semanal
- Multi-environment via samconfig
Impacto em Negócio
- ✅ Infraestrutura reproduzível
- ✅ Disaster recovery viável
- ✅ Rollback de infraestrutura
- ✅ Compliance (auditoria)
- ⚠️ Learning curve para SAM
Riscos
Risco: SAM deploy falha e trava infraestrutura
Mitigação: Validação rigorosa, testing em staging, rollback procedures.
Plano de Implementação
Fase 1: Standards (Semana 1)
- Documentar naming conventions
- Criar template example
- Configurar samconfig
Fase 2: Validation (Semana 2)
- CI/CD validation
- Security scanning (Checkov)
- Linting (cfn-lint)
Fase 3: Testing (Semana 3)
- Integration tests
- SAM Local testing
- Drift detection
Fase 4: Migration (Semanas 4-8)
- Migrar recursos manuais para SAM
- Documentar
- Training
Métricas de Sucesso
Após 1 mês: - ✅ 100% infra nova via SAM - ✅ Validação automática em CI - ✅ Zero drift detectado
Após 3 meses: - ✅ Disaster recovery testado - ✅ Rollback de infra funcional - ✅ Templates bem documentados