doi.org outage
Incident Report for The DOI Foundation
Postmortem

The doi.org resolution downtime on 2024-05-29 was caused by a sudden disappearance of our access to Cloudflare’s “Load Balancing” product. This has been used by doi.org to steer traffic based on location to appropriate backend servers. We were not able to re-enable “Load Balancing” and found a workaround using AWS services, eventually AWS “Global Accelerator”.

We opened a ticket with Cloudflare at that time, but after three weeks we have not received any useful information from them. During this time, Cloudflare has noted an issue with billing and subscriptions at https://www.cloudflarestatus.com/incidents/5t270n2ndf0h . We suspect that our issue is related to this, and just happens to have been perfectly organized to result in a loss of service.

Although widespread billing issues may explain Cloudflare’s lack of response to our ticket, we find it disappointing, especially given that we had actual service downtime as a result.

We have decided to make our use of AWS “Global Accelerator” permanent. For doi.org, this service is significantly less expensive than Cloudflare’s “Load Balancing”, and even seems to have a small positive effect on latency. We are still using Cloudflare services also, especially for rate limiting and potential DDoS mitigation.

We will continue to monitor our Cloudflare subscriptions, billing, and tickets for any new issues or any explanatory information. And we apologize for the inconvenience caused by this downtime.

Posted Jun 20, 2024 - 02:17 UTC

Resolved
doi.org resolution services were inaccessible for approximately an hour from 00:30 to 01:30 UTC on 2024-05-29.

The issue seems to be caused by some failure of Cloudflare's "Load Balancing" product which is used by doi.org to steer global traffic to appropriate backend servers. We have worked around this issue in order to restore doi.org resolution services, but were not immediately able to restore full geo-steering functionality or determine why the failure happened. We will add information as it becomes available.

By roughly 03:30 UTC we restored geo-steering by routing through AWS's "Global Accelerator" product. We are still working with Cloudflare to determine the source of the issue.

As of 2024-05-31, Cloudflare has escalated thiis internally, but has still not fixed the issue or provided more information. We are evaluating AWS Global Accelerator as a permanent replacement to Cloudflare Load Balancing.
Posted May 29, 2024 - 01:46 UTC
This incident affected: DOI Resolution Service.