When Automation Becomes the Enemy: Lessons from AWS’s October 2025 DynamoDB Outage

Last week, AWS published their post-mortem for the massive October 19-20 service disruption that brought down DynamoDB and cascaded through dozens of services in us-east-1. Having spent the weekend digesting this remarkably transparent report, I wanted to share some thoughts on what this incident teaches us about modern infrastructure operations.

The Irony of Sophisticated Automation

What strikes me most about this incident isn’t the failure itself, but how AWS’s sophisticated automation systems became their own worst enemy. The DynamoDB team had built an elegant DNS management system with multiple fail-safes, independent components, and careful orchestration. Yet a rare race condition between two DNS Enactors turned this sophisticated machinery against itself, creating a situation where the automation actively prevented recovery.

Think about that for a moment. The very system designed to ensure high availability became the blocker to restoration. This reminds me of the classic distributed systems principle that complexity is the enemy of reliability. Sometimes our clever solutions create problems we never anticipated.

In my experience running large-scale systems, I’ve learned that automation without escape hatches is a ticking time bomb. You need what I call “break glass” procedures – ways for humans to override the automation when it goes haywire. AWS’s report mentions they had to resort to manual intervention, but clearly, this wasn’t a well-rehearsed procedure given the recovery time.

The Cascade Nobody Saw Coming

The progression from DynamoDB’s DNS failure to a region-wide meltdown reads like a masterclass in cascade failures. First DynamoDB goes down. Then EC2 can’t launch new instances because DWFM depends on DynamoDB. Then Network Manager gets backlogged because it can’t propagate configurations. Then NLB starts flapping because health checks fail on instances with incomplete network configs. Then Lambda, ECS, EKS, and a dozen other services topple like dominoes.

What’s particularly instructive here is how the recovery itself created new problems. When DWFM tried to re-establish leases with hundreds of thousands of droplets simultaneously, it fell into what the report calls “congestive collapse” – essentially a self-induced denial of service. This is a pattern I’ve seen before in distributed systems recovery. The medicine becomes worse than the disease.

The team’s solution was clever though perhaps born of desperation: they selectively restarted DWFM hosts to clear the queues while simultaneously throttling incoming requests. It’s the distributed systems equivalent of turning it off and on again, but with surgical precision. Sometimes the old tricks are the best tricks.

DNS: The Service We Forget Until It Breaks

DNS failures are particularly nasty because DNS underpins everything in modern cloud architectures. We use it for service discovery, load balancing, traffic management, and failover. Yet how many of us have really thought through what happens when DNS completely fails? Not returns the wrong answer, not responds slowly, but simply returns nothing at all?

AWS’s architecture had DNS managing hundreds of thousands of records just for DynamoDB in a single region. That’s not unusual for hyperscale operations, but it does create a massive blast radius when things go wrong. The report mentions that even after restoring the DNS records at 02:25, it took another 15 minutes for customers to regain access due to DNS cache expiration. That’s the insidious nature of DNS failures – even after you fix the problem, you’re waiting for the distributed cache to catch up.

In my own infrastructure work, I’ve started maintaining static IP fallback lists for critical services, distributed through configuration management rather than DNS. It’s not elegant, it doesn’t scale beautifully, but when DNS fails, those hardcoded IPs might be your only lifeline. Sometimes you need belt and suspenders.

Health Checks Gone Wild

The NLB health check situation deserves special attention because it represents a failure mode I’ve seen repeatedly in production systems. Health checks are supposed to remove unhealthy nodes from service, but what happens when the health check system itself becomes confused about what “healthy” means?

In this case, NLB was health-checking instances that hadn’t received their network configurations yet. The instances were fine, the load balancers were fine, but the health checks failed anyway. This created a flapping situation where nodes were constantly being removed and re-added to service. The health check system, trying to do its job, was actually making things worse.

The temporary solution was brutally pragmatic: they just turned off the automatic failover at 09:36. Sometimes the best solution to a misbehaving automatic system is to make it less automatic. This takes courage in a production environment, essentially flying without a safety net, but it was the right call.

Recovery Lessons from the Trenches

One detail that jumped out at me was how different services recovered at vastly different rates. DynamoDB was essentially fixed by 02:40, but EC2 wasn’t fully operational until 13:50. Lambda had cleared its backlogs by 06:00, only to get hit again at 07:04 when NLB issues terminated a bunch of Lambda’s internal infrastructure.

This staggered recovery pattern suggests teams were operating somewhat independently, each fighting their own fires. While this has advantages in terms of parallel recovery efforts, it can also lead to situations where one team’s recovery efforts inadvertently impact another team’s systems. The Lambda team probably wasn’t thrilled when the NLB issues undid their recovery work.

The report also reveals something about AWS’s internal dependencies that customers rarely see. Redshift, for instance, was calling IAM APIs specifically in us-east-1 for all regions when resolving user groups. That kind of hidden cross-region dependency is exactly the sort of thing that turns a regional issue into a global one. I’d bet AWS is auditing all their services for similar antipatterns right now.

What This Means for the Rest of Us

So what should those of us running smaller operations take away from this incident? First, we need to acknowledge that if AWS can suffer a 14.5-hour outage, none of us are immune. The question isn’t whether you’ll have a major incident, but whether you’ll be prepared when it happens.

Start by mapping your service dependencies. Really map them, not just the obvious ones but the hidden ones too. Does your service in region A make calls to region B for any reason? Do you have services that share fate because they all depend on the same underlying system? These dependency chains are where cascading failures breed.

Next, think about your automation’s failure modes. Can your automation get stuck in a state where it prevents recovery? Do you have manual override procedures? Have you actually tested them? I’m reminded of the old military adage that no plan survives contact with the enemy. Your automation won’t survive contact with novel failure modes unless you’ve built in escape hatches.

Consider your approach to health checks and automatic recovery. Are you prepared to disable automatic systems if they start misbehaving? Do you have the observability to know when your health checks are causing more harm than good? Sometimes the cure really is worse than the disease.

Finally, practice your incident response. AWS clearly has talented engineers and solid procedures, yet this incident still took 14.5 hours to fully resolve. Part of that is the sheer scale they operate at, but part of it is likely the rarity of this particular failure mode. The race condition had been lying dormant, waiting for just the right circumstances. When it finally triggered, teams were learning as they went.

The Humbling Truth

This incident is humbling for all of us in the infrastructure space. It shows that even with world-class engineering, extensive automation, and virtually unlimited resources, complex systems can still fail in surprising ways. A race condition that probably existed for years finally found the perfect storm of conditions to manifest. Two DNS Enactors, operating at slightly different speeds, managed to delete the very records they were supposed to protect.

If there’s one thing I’ve learned from two decades in operations, it’s that distributed systems are constantly teaching us new ways they can fail. Each incident adds to our collective knowledge, each post-mortem teaches us something we didn’t know we didn’t know. AWS’s transparency in sharing these details is a gift to the entire industry.

As I write this, thousands of engineers at companies worldwide are probably reviewing their own DNS management systems, their health check configurations, and their service dependencies. They’re asking uncomfortable questions about their own automation systems and whether they too might harbor sleeping dragons. That’s how we get better as an industry, one painful lesson at a time.

The next time someone tells you that modern cloud infrastructure has solved the reliability problem, point them to this incident. We’ve made incredible progress, but complex systems will always find new and creative ways to fail. Our job isn’t to prevent all failures but to ensure we can recover quickly and learn from each one. In that sense, this AWS post-mortem is exactly what good operations looks like, turning an painful outage into a learning opportunity for everyone.