Traffic surge preventing much DOI resolution

Incident Report for The DOI Foundation

Postmortem

At 06:30 UTC on 2026-03-17, after over an hour of substantially elevated traffic, four different servers in the backend of doi.org became unstable approximately simultaneously. These servers collectively provide handle resolution for DOIs maintained by Crossref, the DOI Registration Agency which handles the largest resolution load.

doi.org itself continued to function, but resolutions of DOIs maintained by Crossref began to give errors.

We received our first notification that something was amiss at 06:38 UTC, but investigation did not begin until a second notification at 07:39 UTC. Restarting the servers did not succeed in removing the instability. There was enough traffic that the servers fell over more or less immediately after restarting.

We quickly tightened doi.org rate limiting (at Cloudflare) in an attempt to control the traffic, but it did not make enough difference; servers continued to become unstable quickly after restarting. We suspect that the rate limiting might have prevented the issue had it been in place before the traffic surge, with the servers already running, but it didn’t decrease the traffic enough to allow bringing the servers up.

We next considered increased resources for servers. This also did not help, at least without further exploration of configuration or software changes. The bottleneck was not directly memory or CPU, but the ability of a each single machine to handle a sufficiently large number of network connections. So we set in motion increasing the number of servers.

While waiting on that process, we noticed, belatedly, that the source of the traffic surge identified itself by User-Agent. We blocked that User-Agent string and that reduced the traffic enough to bring the servers online. By 09:30 UTC service was back to normal.

Remaining questions and lessons learned

It’s still not clear exactly what in the nature of the traffic at 06:30 UTC caused such instability. The traffic might have been 5x typical, but that level of traffic happens with some frequency. Some sort of failure cascade seems likely – one machine failed, and that increased the traffic to other machines enough that one more failed, and so on.

We have increased the number of backend servers and have made it easier to increase this number again in the future.

In hindsight we could have resolved the issue much earlier simply by noticing that the source of increased traffic was easy to block. This is not always the case, but it was the case this time and downtime was increased by failure to take advantage of this quickly. Also, there are other levers at Cloudflare for controlling traffic dramatically, such as to deal with a DDoS attack. It would be unfortunate if some legitimate automatic use of doi.org was temporarily blocked, but it might be preferable to an extended downtime.

We have put much effort over the years in increasing the reliability and scalability of the higher levels of doi.org infrastructure, with great success. In that time the backend handle servers have always been rock solid, but it is no surprise that eventually instability due to increased traffic has come for them as well. In the short and medium term we expect that simply having more backend servers will prevent a repeat of this, but we will continue to explore other options as traffic continues to increase.

Posted Mar 18, 2026 - 05:02 UTC

Resolved

Monitoring shows traffic flowing normally again. This incident has been resolved.
Posted Mar 17, 2026 - 14:32 UTC

Monitoring

Service is restored. We continue to monitor. Downtime was roughly from 6:30 UTC to 9:30 UTC.
Posted Mar 17, 2026 - 09:57 UTC

Identified

A surge in traffic has affected some of the backend infrastructure of doi.org. Most DOI resolutions are failing. We are working on a fix.
Posted Mar 17, 2026 - 09:14 UTC
This incident affected: DOI Resolution Service.