A Cloudflare outage that disrupted numerous websites yesterday was initially thought to result from a hyper-scale DDoS attack, but investigators later pinpointed a self-inflicted failure.
The root cause was a permissions change to one of Cloudflare’s database systems that caused the system to output multiple entries into a “feature file” used by the Bot Management system. The oversized feature file then propagated across the network to machines that route traffic, overwhelming software that reads the file to keep threat-detection models up to date.
The larger-than-expected file affected Cloudflare’s core CDN, security services, and several other network functions, leading to service degradation for many customers.
After the team stopped the propagation of the bloated file and replaced it with an earlier version, core traffic largely returned to normal. The recovery was not instantaneous, taking roughly two and a half hours as traffic stabilized across the network.
Cloudflare said it plans to harden the ingestion of internal configuration files, add broader kill switches for features, and improve failure-mode handling to prevent a similar outage in the future.