A 16-hour outage across Amazon Web Services began with a software bug in DynamoDB’s DNS management system, confined to a single AWS region. The DynamoDB DNS Enactor and its companion DNS Planner were updating domain lookup tables to optimize load balancing; a timing mismatch between these components set off a race condition that ultimately disabled DynamoDB services across the region.
According to Amazon, the failure cascade persisted as the Enactor faced unusually long delays while retrying updates, while the Planner continued generating new plans. In short order, a second Enactor began implementing those plans, amplifying the conflict and bringing the DNS layer to a halt. The result was a broader outage affecting many AWS services and customer endpoints.
Industry tracker Ookla reported that DownDetector captured more than 17 million disruption reports from about 3,500 organizations. The three countries with the most reports were the United States, the United Kingdom, and Germany. Snapchat, AWS, and Roblox were among the most cited services experiencing downtime.
In response to the incident, AWS said it temporarily disabled the DynamoDB DNS Planner and the DNS Enactor automation while engineers work on a fix. The company emphasized the need for stronger regional resilience and input from multi-region architectures to prevent similar outages in the future.
Analysts and observers have underscored the risk of relying on a single regional hub for critical infrastructure. The outage is cited as a cautionary tale about designing cloud architectures with redundancy and varied dependencies to limit the impact of any single failure.