Pular para conteúdo

RFC-014: Infrastructure as Code Standards

Campo Valor
Status Draft
Author Time de Tecnologia
Created 2026-02-05
Updated 2026-02-05
Reunião A ser agendada
Deciders Todo o time técnico

Status

🟡 Draft - RFC em elaboração, aguardando apresentação

Contexto e Problema

AWS SAM documentado em docs/infrastructure/aws-sam.md, mas falta:

  • Standards para template SAM (naming, structure)
  • Validação e testing de IaC obrigatório
  • State management strategy clara
  • Multi-environment configuration padronizada
  • Drift detection e remediation

Por que resolver isso agora?

  • Infraestrutura cresce, falta padrões
  • Templates SAM inconsistentes
  • Mudanças manuais causam drift
  • Testing de infraestrutura é manual
  • Falta validação antes de deploy

Impacto de não resolver

  • Infraestrutura diverge entre ambientes
  • Mudanças manuais não rastreadas (drift)
  • Deploy de infraestrutura quebrada
  • Dificuldade em replicar ambientes
  • Rollback de infraestrutura impossível

Documentação relacionada: - docs/infrastructure/aws-sam.md - SAM templates e deployment - docs/infrastructure/environments.md - Configuração multi-ambiente - docs/infrastructure/iam-policies.md - IAM roles - docs/cicd/github-actions.md - Deploy via CI/CD - docs/testing/integration-testing.md - SAM Local testing

Proposta de Solução

Infrastructure as Code com AWS SAM, standards claros, validação automática, e drift detection.

Template Structure Standards

Naming conventions:

# Resources: PascalCase com prefixo do tipo
Resources:
  UserFunction:             # Lambda
  UsersTable:               # DynamoDB
  UserEventsQueue:          # SQS
  ProductUpdatesTopic:      # SNS
  UserFunctionRole:         # IAM Role
  DatabaseSecurityGroup:    # EC2 SecurityGroup

Template organization:

# template.yaml
AWSTemplateFormatVersion: '2010-09-09'
Transform: AWS::Serverless-2016-10-31

Description: People API - Main stack

# 1. Global settings
Globals:
  Function:
    Runtime: python3.11
    MemorySize: 512
    Timeout: 30
    Environment:
      Variables:
        ENVIRONMENT: !Ref Environment
        LOG_LEVEL: !Ref LogLevel

# 2. Parameters
Parameters:
  Environment:
    Type: String
    AllowedValues: [staging, production]
    Default: staging

  LogLevel:
    Type: String
    AllowedValues: [DEBUG, INFO, WARNING, ERROR]
    Default: INFO

# 3. Conditions
Conditions:
  IsProduction: !Equals [!Ref Environment, production]

# 4. Resources (agrupados por tipo)
Resources:
  # Lambda Functions
  UserFunction:
    Type: AWS::Serverless::Function
    Properties:
      CodeUri: src/
      Handler: app.handlers.users.handler
      Role: !GetAtt UserFunctionRole.Arn

  # IAM Roles
  UserFunctionRole:
    Type: AWS::IAM::Role
    ...

  # Databases
  Database:
    Type: AWS::RDS::DBInstance
    ...

  # Queues
  UserEventsQueue:
    Type: AWS::SQS::Queue
    ...

# 5. Outputs
Outputs:
  ApiEndpoint:
    Description: API Gateway endpoint URL
    Value: !Sub 'https://${ServerlessRestApi}.execute-api.${AWS::Region}.amazonaws.com/Prod'
    Export:
      Name: !Sub '${AWS::StackName}-ApiEndpoint'

Environment Configuration

samconfig.toml:

version = 0.1

[default]
[default.global.parameters]
stack_name = "people-api"

[default.build.parameters]
use_container = true

[default.deploy.parameters]
capabilities = "CAPABILITY_IAM"
confirm_changeset = true
resolve_s3 = true

[staging]
[staging.deploy.parameters]
stack_name = "people-api-staging"
s3_bucket = "sam-deployments-staging"
s3_prefix = "people-api"
region = "us-east-1"
parameter_overrides = "Environment=staging LogLevel=DEBUG"

[production]
[production.deploy.parameters]
stack_name = "people-api-production"
s3_bucket = "sam-deployments-production"
s3_prefix = "people-api"
region = "us-east-1"
parameter_overrides = "Environment=production LogLevel=INFO"
confirm_changeset = true  # Manual confirmation for prod

Validation

Pre-deploy validation:

# SAM validate
sam validate --lint

# CFN Lint
pip install cfn-lint
cfn-lint template.yaml

# Security scanning
pip install checkov
checkov -f template.yaml --framework cloudformation

CI/CD validation:

# .github/workflows/validate-infrastructure.yml
name: Validate Infrastructure

on:
  pull_request:
    paths:
      - 'template.yaml'
      - 'samconfig.toml'

jobs:
  validate:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4

      - name: SAM Validate
        run: sam validate --lint

      - name: CFN Lint
        run: |
          pip install cfn-lint
          cfn-lint template.yaml

      - name: Checkov Security Scan
        uses: bridgecrewio/checkov-action@master
        with:
          file: template.yaml
          framework: cloudformation
          quiet: false
          soft_fail: false  # Block on security issues

Testing

SAM Local:

# Build
sam build

# Start API locally
sam local start-api --port 3000

# Invoke Lambda locally
sam local invoke UserFunction -e events/create-user.json

# Start Lambda endpoint (for testing)
sam local start-lambda

Integration tests:

# tests/infrastructure/test_sam_template.py
import boto3
import json
from moto import mock_aws

@mock_aws
def test_lambda_function_has_correct_permissions():
    """Test Lambda IAM role has necessary permissions."""
    iam = boto3.client('iam')

    # Create role from template
    role = create_role_from_template()

    # Verify permissions
    policies = iam.list_role_policies(RoleName=role['RoleName'])
    assert 'SecretsManagerAccess' in policies['PolicyNames']

@mock_aws  
def test_sqs_queue_has_dlq():
    """Test all SQS queues have DLQ configured."""
    sqs = boto3.client('sqs')

    queues = sqs.list_queues()
    for queue_url in queues.get('QueueUrls', []):
        attrs = sqs.get_queue_attributes(
            QueueUrl=queue_url,
            AttributeNames=['RedrivePolicy']
        )
        assert 'RedrivePolicy' in attrs['Attributes'], f"Queue {queue_url} missing DLQ"

Drift Detection

Processo:

  1. Detection: CloudFormation drift detection semanal
  2. Alert: Se drift detectado, alert no Slack
  3. Review: Revisar mudanças (intencionais ou não?)
  4. Remediate:
  5. Se não intencional: reverter manualmente
  6. Se intencional: atualizar template (import)

Automação:

# .github/workflows/drift-detection.yml
name: Drift Detection

on:
  schedule:
    - cron: '0 9 * * 1'  # Every Monday 9am

jobs:
  detect-drift:
    runs-on: ubuntu-latest
    steps:
      - name: Detect drift
        run: |
          STACK_ID=$(aws cloudformation describe-stacks \
            --stack-name people-api-production \
            --query 'Stacks[0].StackId' \
            --output text)

          aws cloudformation detect-stack-drift --stack-name $STACK_ID

          sleep 30  # Wait for detection

          DRIFT_STATUS=$(aws cloudformation describe-stack-drift-detection-status \
            --stack-drift-detection-id $DETECTION_ID \
            --query 'StackDriftStatus' \
            --output text)

          if [ "$DRIFT_STATUS" = "DRIFTED" ]; then
            echo "⚠️ Drift detected in production stack!"
            # Send Slack alert
            curl -X POST $SLACK_WEBHOOK -d '{"text": "⚠️ Infrastructure drift detected in production!"}'
            exit 1
          fi

Multi-Stack Strategy

Stacks separados:

people-api-core:          # Core infrastructure (VPC, DB, cache)
  - VPC
  - RDS
  - ElastiCache
  - Security Groups

people-api-compute:       # Compute resources (Lambdas, API Gateway)
  - Lambda functions
  - API Gateway
  - SQS queues
  - SNS topics

people-api-frontend:      # Frontend infrastructure
  - S3 bucket
  - CloudFront
  - WAF

Benefícios: - Deploy independente - Blast radius reduzido - Rollback mais granular - Tempos de deploy mais rápidos

Secrets e Parameters

Via SAM:

Parameters:
  DatabasePassword:
    Type: String
    NoEcho: true
    Description: Database master password

Resources:
  Function:
    Type: AWS::Serverless::Function
    Properties:
      Environment:
        Variables:
          # SSM Parameter (não-sensível)
          API_BASE_URL: !Sub '{{resolve:ssm:/prod/config/api-url}}'

          # Secrets Manager (sensível)
          DB_PASSWORD: !Sub '{{resolve:secretsmanager:/prod/database/password:SecretString:password}}'

Tags Obrigatórias

# Todas as resources
Tags:
  Environment: !Ref Environment
  Project: people-api
  Team: technology
  CostCenter: engineering
  ManagedBy: cloudformation
  Owner: tech-lead@company.com

Cost Optimization

Tagging para cost allocation:

# Development: cheap resources
Conditions:
  IsProduction: !Equals [!Ref Environment, production]

Resources:
  Database:
    Type: AWS::RDS::DBInstance
    Properties:
      DBInstanceClass: !If [IsProduction, db.t3.medium, db.t3.micro]
      MultiAZ: !If [IsProduction, true, false]
      BackupRetentionPeriod: !If [IsProduction, 30, 7]

Alternativas Consideradas

Opção 1: Terraform ao invés de SAM

Prós: - Multi-cloud - Mais features - HCL mais legível

Contras: - State management complexo - SAM é AWS-native - Learning curve

Por que não escolhemos: SAM é mais simples e integrado com AWS.

Opção 2: CDK (Cloud Development Kit)

Prós: - Código (Python/TypeScript) - Type safety - Reusabilidade

Contras: - Mais complexo - Synthesized templates são difíceis de ler - Overhead para time pequeno

Por que não escolhemos: SAM YAML é mais simples e suficiente.

Opção 3: Manual infrastructure (ClickOps)

Prós: - Rápido para protótipos - Sem learning curve

Contras: - Não reproduzível - Sem versionamento - Disaster recovery impossível

Por que não escolhemos: Não é escalável nem profissional.

Análise de Impacto

Impacto Técnico

  • Templates SAM padronizados
  • Validação obrigatória
  • Drift detection semanal
  • Multi-environment via samconfig

Impacto em Negócio

  • ✅ Infraestrutura reproduzível
  • ✅ Disaster recovery viável
  • ✅ Rollback de infraestrutura
  • ✅ Compliance (auditoria)
  • ⚠️ Learning curve para SAM

Riscos

Risco: SAM deploy falha e trava infraestrutura

Mitigação: Validação rigorosa, testing em staging, rollback procedures.

Plano de Implementação

Fase 1: Standards (Semana 1)

  • Documentar naming conventions
  • Criar template example
  • Configurar samconfig

Fase 2: Validation (Semana 2)

  • CI/CD validation
  • Security scanning (Checkov)
  • Linting (cfn-lint)

Fase 3: Testing (Semana 3)

  • Integration tests
  • SAM Local testing
  • Drift detection

Fase 4: Migration (Semanas 4-8)

  • Migrar recursos manuais para SAM
  • Documentar
  • Training

Métricas de Sucesso

Após 1 mês: - ✅ 100% infra nova via SAM - ✅ Validação automática em CI - ✅ Zero drift detectado

Após 3 meses: - ✅ Disaster recovery testado - ✅ Rollback de infra funcional - ✅ Templates bem documentados

Referências