AWS outage today: US-East-1 failure knocks major apps offline

A failure radiating out of AWS’s US-EAST-1 region on Oct 20 knocked over everything from Fortnite and Snapchat to banks, HMRC and smart-home gear. Recovery came in waves; the architectural lessons are depressingly familiar.

Timeline, at a glance (BST): Error rates start spiking around 08:11, with widespread customer impact visible on major platforms within minutes. By 10:30, many services report partial recovery; by early afternoon, most user-facing apps are back, while background jobs and retries continue to grind through backlogs. AWS acknowledges issues centred on the US-EAST-1 region and references elevated error/latency across multiple services.

What (likely) broke—and why scope exploded

Even without a finished post-mortem, impact patterns point to control-plane and dependency concentration in US-EAST-1. Modern SaaS stacks often treat Virginia as a global brain: auth tokens, config/feature flags, service discovery, event buses, and metrics backends quietly depend on one region. When that region coughs, supposedly “regional” apps faceplant. This is the same class of failure seen in 2020, 2021 and 2023—only the blast radius is bigger now because more consumer and financial services delegate their critical paths to cloud primitives.

The architecture smell test

  • Global singleton dependencies: Even with multi-AZ data planes, many teams centralise control (auth, config, queues) in US-EAST-1. That’s a single point of failure wearing a multi-AZ hat.
  • Implicit dependencies: Background jobs (billing, notifications, KMS, logging) can block front-doors if timeouts are mis-tuned or retries are unconstrained.
  • Backpressure debt: When upstreams stall, queues balloon and downstreams thrash. Recovery takes longer than outage duration due to thundering herds and idempotency gaps.

What good looks like (and costs)

  1. Regional isolation by design: Keep state, secrets and feature flags per region; eliminate cross-region sync on the hot path.
  2. Prove failover: Quarterly game-days that actually flip traffic, not tabletop roleplay. Measure user-perceived RTO/RPO—not just green dashboards.
  3. Control-plane redundancy: Host auth/config/queues in ≥2 regions with deterministic failover. Avoid “one region is more equal than others.”
  4. Idempotency and rate caps: Make retries boring. Budget write-amplification and shield downstreams with circuit breakers.
  5. Operational escape hatches: Feature-flag kill-switches, cached offline modes, and graceful degradation so “login, search, pay” still work.

Impact and business takeaways

Gaming (Fortnite/Roblox), comms (Snapchat/Signal), fintech (Coinbase/Robinhood/Venmo/Chime), and UK institutions (Lloyds/BoS/HMRC) were visibly hit. The reputational damage accrues to you, not just Amazon. If US-EAST-1 is embedded into your critical path, regulators and enterprise customers will increasingly ask why.

Sources

Be the first to comment

Leave a Reply

Your email address will not be published.


*