Event-Driven Architecture with EventBridge - Lessons from 3 Years in Production

Three years of building serverless event-driven systems with AWS EventBridge. What works, what doesn't, and patterns that actually scale.


After three years of running event-driven architectures in production, I've learned that EventBridge is both powerful and full of sharp edges. This post covers the patterns that survived production load and the mistakes I'd avoid if I were starting today.

Why EventBridge?

When I first started building serverless applications, I reached for SNS for everything. It was simple, familiar, and worked. But as systems grew, SNS's limitations became painful:

  • No event filtering at the source (every subscriber gets everything)
  • No built-in schema validation
  • Limited observability into event flows
  • Tight coupling between publishers and subscribers

EventBridge solved these problems while keeping the serverless philosophy: pay per use, no servers to manage, and native AWS integration.

The Architecture That Worked

Here's the pattern I've settled on after multiple iterations:

┌─────────────┐     ┌─────────────┐     ┌─────────────┐
│   Service   │────▶│ EventBridge │────▶│   Rules     │
│  (Produces) │     │  (Event Bus)│     │ (Filters)   │
└─────────────┘     └─────────────┘     └──────┬──────┘
                                                │
                       ┌────────────────────────┼────────────────────────┐
                       ▼                        ▼                        ▼
                ┌─────────────┐          ┌─────────────┐          ┌─────────────┐
                │   Lambda    │          │   Lambda    │          │   Fargate   │
                │ (Handler A) │          │ (Handler B) │          │  (Worker)   │
                └─────────────┘          └─────────────┘          └─────────────┘

Lesson 1: Event Structure Matters More Than You Think

I started with simple flat events:

{
  "userId": "123",
  "action": "order.created",
  "data": { ... }
}

This was a mistake. After 50+ event types, debugging became a nightmare. Now I use a consistent envelope pattern:

interface DomainEvent<T extends string, P> {
  specversion: "1.0";
  type: T;
  source: string;
  id: string;
  time: string;
  datacontenttype: "application/json";
  data: P;
}
 
// Example usage
type OrderCreatedEvent = DomainEvent<
  "orders.orderCreated.v1",
  {
    orderId: string;
    customerId: string;
    total: number;
    items: Array<{ sku: string; quantity: number }>;
  }
>;

This follows the CloudEvents spec, which EventBridge supports natively. Benefits:

  • Standard tooling: CloudEvents libraries in every language
  • Schema registry integration: EventBridge Schema Registry understands the structure
  • Debugging: source + type + id gives you a complete trace

Lesson 2: Use EventBridge Schema Registry

I ignored the Schema Registry for the first year. Big mistake.

// sst.config.ts
const bus = new sst.aws.Bus("Bus", {
  rules: {
    orderCreated: {
      pattern: {
        source: ["orders"],
        "detail-type": ["orders.orderCreated.v1"],
      },
      targets: { handler: "src/order.handler" },
    },
  },
});
 
// Enable schema discovery
new aws.cloudwatch.EventBus("DiscoveredSchemas", {
  name: bus.name,
  eventBridgeBus: bus.arn,
});

With schema discovery enabled, EventBridge automatically infers schemas from your events and stores them in the registry. This enables:

  • Code generation: Generate TypeScript types from discovered schemas
  • Validation: Catch malformed events before they cause issues
  • Documentation: Living documentation of your event contracts

Lesson 3: Dead Letter Queues Are Non-Negotiable

In my first production EventBridge implementation, I didn't configure DLQs. When a Lambda handler failed, events were silently dropped. We lost order notifications for 6 hours before noticing.

Now every target gets a DLQ:

// sst.config.ts
const dlq = new sst.aws.Queue("DLQ");
 
const bus = new sst.aws.Bus("Bus", {
  rules: {
    orderCreated: {
      pattern: {
        source: ["orders"],
        "detail-type": ["orders.orderCreated.v1"],
      },
      targets: {
        handler: {
          handler: "src/order.handler",
          deadLetterQueue: dlq.arn,
        },
      },
    },
  },
});

And I monitor DLQ depth with CloudWatch alarms:

new aws.cloudwatch.MetricAlarm("DLQAlarm", {
  metricName: "ApproximateNumberOfMessagesVisible",
  namespace: "AWS/SQS",
  dimensions: { QueueName: dlq.name },
  threshold: 0,
  evaluationPeriods: 1,
  comparisonOperator: "GreaterThanThreshold",
});

Lesson 4: Event Replay Saved My Job

EventBridge archives are underrated. When I accidentally deployed a bug that processed orders incorrectly, I needed to replay 3 hours of events.

# Enable archive
aws events create-archive \
  --archive-name order-events \
  --event-source-arn $BUS_ARN \
  --retention-days 30
 
# Replay events
aws events start-replay \
  --replay-name fix-orders-2024-03-15 \
  --event-source-arn $BUS_ARN \
  --event-start-time 2024-03-15T10:00:00Z \
  --event-end-time 2024-03-15T13:00:00Z \
  --destination '{"Arn": "'$BUS_ARN'", "FilterArns": ["'$FIXED_RULE_ARN'"]}'

The replay functionality let me fix the bug, redeploy, and reprocess only the affected events. No manual data fixing required.

Lesson 5: Cross-Account EventBridge Is Worth the Complexity

As we split into multiple AWS accounts (dev/staging/prod/services), I initially used SNS/SQS for cross-account communication. It worked but felt fragile.

EventBridge cross-account event buses are cleaner:

// In the producer account
new aws.cloudwatch.EventBusPolicy("CrossAccountPolicy", {
  eventBusName: bus.name,
  policy: JSON.stringify({
    Version: "2012-10-17",
    Statement: [{
      Sid: "AllowConsumerAccount",
      Effect: "Allow",
      Principal: { AWS: "arn:aws:iam::CONSUMER_ACCOUNT_ID:root" },
      Action: "events:PutEvents",
      Resource: bus.arn,
    }],
  }),
});
 
// In the consumer account
new aws.cloudwatch.EventRule("CrossAccountRule", {
  eventBusName: "arn:aws:events:REGION:PRODUCER_ACCOUNT_ID:event-bus/Bus",
  eventPattern: JSON.stringify({
    source: ["orders"],
  }),
});

This pattern gives you:

  • Centralized event bus in the producer account
  • Decentralized consumers in their own accounts
  • Clear ownership: Producer owns the schema, consumers own their handlers

What I'd Do Differently

  1. Start with schema validation: Use JSON Schema or Protobuf from day one. Migrating 100+ event types later is painful.

  2. Version events aggressively: orders.orderCreated.v1, v2, etc. EventBridge supports filtering by detail-type, making migrations manageable.

  3. Use SaaS integrations sparingly: EventBridge has 200+ SaaS integrations. They're convenient but create vendor lock-in. I prefer polling webhooks into my own event bus.

  4. Monitor event latency: EventBridge promises 500ms delivery but doesn't guarantee it. Add CloudWatch metrics for end-to-end latency.

The Code I Use Today

Here's my current SST v3 setup:

// sst.config.ts
export default $config({
  app() {
    return {
      name: "events",
      removal: "remove",
      home: "aws",
    };
  },
  async run() {
    // DLQ for failed events
    const dlq = new sst.aws.Queue("EventDLQ");
 
    // Main event bus with schema discovery
    const bus = new sst.aws.Bus("Bus", {
      rules: {
        // Order events
        orderCreated: {
          pattern: {
            source: ["orders"],
            "detail-type": ["orders.orderCreated.v1"],
          },
          targets: {
            handler: {
              handler: "src/handlers/orderCreated.handler",
              deadLetterQueue: dlq.arn,
            },
          },
        },
 
        // Inventory events
        inventoryUpdated: {
          pattern: {
            source: ["inventory"],
            "detail-type": ["inventory.inventoryUpdated.v1"],
          },
          targets: {
            handler: {
              handler: "src/handlers/inventoryUpdated.handler",
              deadLetterQueue: dlq.arn,
            },
          },
        },
      },
    });
 
    // Archive for replay
    new aws.cloudwatch.EventArchive("EventArchive", {
      name: $app.name,
      eventSourceArn: bus.arn,
      retentionDays: 30,
    });
 
    // DLQ alarm
    new aws.cloudwatch.MetricAlarm("DLQAlarm", {
      metricName: "ApproximateNumberOfMessagesVisible",
      namespace: "AWS/SQS",
      dimensions: { QueueName: dlq.name },
      statistic: "Sum",
      threshold: 0,
      evaluationPeriods: 1,
      comparisonOperator: "GreaterThanThreshold",
    });
 
    return {
      busArn: bus.arn,
      dlqArn: dlq.arn,
    };
  },
});

Final Thoughts

EventBridge isn't perfect. The CloudWatch Logs integration is clunky, debugging event flows requires jumping between services, and the 256KB event size limit forces you to use S3 for large payloads.

But after three years, it's still my default choice for serverless event-driven architectures. The combination of schema validation, dead letter queues, event replay, and native AWS integrations is hard to beat.

If you're starting a new project, skip SNS. Go straight to EventBridge. Your future self will thank you.