Cloudflare is a CDN and security services company listed in 2019, but on June 21, 2022 (Tuesday), its services were temporarily interrupted, affecting a large number of services and the normal operation of the platform. Including FTX, Discord, Omegle, DoorDash and many more.

event background

One thing happened on Tuesday, June 21st, because Cloudflare’s services were temporarily interrupted, affecting a large number of services and the normal operation of the platform. Including FTX, Discord, Omegle, DoorDash, Crunchyroll, NordVPN and Feedly, etc., as well as Zeroda, Medium.com, news media Register, Groww, Buffer, iSpirt, Upstox and Social Blade, users cannot access these sites, even Coinbase, Shopify and League of Legends were also partially affected.

In this article, I will talk about what CloudFlare is, what company it is, the origin of CloudFlare and Web3, and technically explain the cause of this failure.

What is a CDN (Content Delivery Network)

Before talking about Cloudflare, let’s popularize a concept (CDN)

What is a CDN?

CDN, the full name of Content Distribute Network (Content Distribution Network) or Content Delivery Network; then, what is a content distribution network? It is a computer network system that can be connected to each other through the Internet, using the nearest server of each user to send music, pictures, videos, applications and other files to users faster and more reliably, to provide high performance, scalability and Low-cost web content is delivered to users.

Visually speaking, CDN is somewhat similar to Jingdong’s logistics model . By establishing logistics points (cache servers) across the country, when someone buys goods from Jingdong (user resource request), Jingdong can last time based on the user’s delivery address (CDN). User domain name resolution) find the nearest or fastest logistics point for delivery (connect visiting users to the nearest cache server for resource transmission).

CDN services can be used to ensure fast and reliable distribution of static content, which can be cached and is best suited for storage and distribution on high-speed networks, freeing up backbone network channels for dynamic content that must be delivered in real-time, such as live webcasts , reduce the delay.

Let’s take an example. For example, there is a British company whose main customers are also in the UK. If a website is established for this company, the website server is usually located in the UK. However, there will be delays that affect the user’s website access experience, but if the delay is caused by network congestion, this delay can be improved. How to improve it?

These problems are mainly solved by increasing the bandwidth between point-to-point and optimizing network routing. For example, from London to Oxford, increasing the number of fibres between the two places is the easiest way to increase bandwidth.

Note that the number of optical fibers here is mainly when we build infrastructure such as submarine cables, railways and highways, and lay them at the same time. Therefore, the bandwidth we have been using has been increasing over the years. You can think of increasing network loans as widening traffic roads, which is a matter of spending money on paving.


We mentioned network routing earlier, what is routing? In fact, the main problem solved by routing is the problem of which route to take for the communication between two points. For example, in the event of a network congestion from London to Oxford, the system can also choose other routes. A bit like intelligent transportation, the routing optimization of the Internet is similar. So over the years, despite the increasing traffic, the network performance has been improving.

In layman’s terms, it is to accelerate the website. Some websites become extremely slow to open due to some reasons, which requires CDN to accelerate.

CDN is also a relatively advanced network technology that solves the problem of content distribution on the Internet. What is a Content Delivery Network? This is similar to the transportation network, that is to say, no matter how fast the plane is, there is also a speed limit, and the longer the distance, the longer the delay. The same is true of the network, if the distance is long, there will be network delays.

So if a European user wants to access the content of an American website, CDN‍ is to build a server in Europe and translate the American content to this server. When a European user accesses a domain name, since the CDN operator knows that the user is accessing from a European system, it gives the user the IP address of the European server, and the user naturally accesses the European server.

CDN companies are usually security companies?

An important feature of CDNs is that CDNs are inherently secure, because CDNs are very clear about who is accessing the user’s network, so they can help customers block website attacks. Network security is a value-added service of cdn company. Although the appearance of cdn is far earlier than cloud computing, everyone has classified cdn as cloud computing. Monthly payment and payment by traffic are actually typical cloud computing subscription models. At the same time, CDN servers are not necessarily traditional physical servers. These servers may also be virtual machines from public cloud operators, so now you can use CDN completely. Think of it as a cloud computing IaaS service.

Note: Part of the explanation about CDN in this part comes from Youtube blogger Lao Ke talking about technology stocks

What company is Cloudflare?

In 2010, Cloudflare was officially founded and is headquartered in San Francisco, USA. It is a company whose main business is its CDN and security services. Cloudflare’s main business is to provide customers with reverse proxy-based content distribution networks and distributed domain name resolution services (Distributed Domain Name Server). Since 2009, the company has been invested by venture capital such as Union Square Ventures, and Baidu has also participated in Cloudflare’s D round of financing. On August 15, 2019, Cloudflare officially IPO.

In addition, Cloudflare has acquired a series of network services and security companies, including StopTheHacker, CryptoSeal in 2014; Eager Platform Co. in 2016; Neumob, S2 Systems, Linc, Zaraz in 2017 and later; Vectrix and Area 1 Security.

The origins of Cloudflare and Web3

Cloudflare is a CDN company that started supporting Web3 development relatively early. Its official website says: Cloudflare is the gateway for users to Web3. Through Cloudflare, you can easily access IPFS and Ethereum networks. Moreover, the official website mentioned that Web 1.0 gave the world the ability to rapidly disseminate information, while Web 2.0 made this information interactive. Web 3.0, or Web3, is considered the next iteration of the internet, built on decentralized technologies like IPFS and Ethereum.

Web3 underlying infrastructure? Briefly analyze the reason for the interruption of CloudFlare service yesterday

Image from Cloudflare official website

Cloudflare has an IPFS Gateway, which allows customers to enjoy the benefits of IPFS while continuing to use the HTTP protocol.

Web3 underlying infrastructure? Briefly analyze the reason for the interruption of CloudFlare service yesterday

The Cloudflare Ethereum Gateway allows customers to use their own domains, which can be sent to custom domains via JSON RPC queries over HTTP. Cloudflare can manage, maintain, and monitor Web3 infrastructure, and builders can focus on what matters: building Dapps. Cloudflare can create secure, reliable and fast services based on Web3 technology through the industry’s leading global network.

Why did Cloudflare experience service outages?

The official explanation for the Cloudflare service outage event on June 21, 2022:

On June 21, 2022, Cloudflare’s service outage affected the normal operation of 19 data centers, and unfortunately, these 19 locations process a large portion of Cloudflare’s global data. The outage was caused by a problem with a long-running project that was launched to improve the resiliency of the busiest data centers. It is because the network configuration of some locations was changed, which caused service interruption. The specific time of the interruption started from 06:27 UTC time, and at 06:58 UTC, the first data center started to work again, and at 07:42 UTC, All data centers work fine. Depending on where in the world users are located, websites and services that rely on Cloudflare as their infrastructure may not be accessible, although in other locations that were not affected, Cloudflare continues to operate normally.

Cloudflare apologises for this outage, which was Cloudflare’s fault and not the result of an attack or other malicious activity.

The background of this structural transformation

Over the past 18 months, Cloudflare has been working to transform the architecture of all of its busiest data centers to make them more flexible and resilient. Currently, 19 data centers have been successfully converted to this architecture, which Cloudflare internally calls Multi-Colo PoP (MCP); these 19 data centers are located in: Amsterdam, Atlanta, Ashburn, Chicago, Frankfurt, London, Los Angeles, Madrid, Manchester, Miami, Milan, Mumbai, Newark, Osaka, Sao Paulo, San Jose, Singapore, Sydney and Tokyo.

A key part of this new architecture, designed as a Clos network, is the addition of an additional routing layer (see diagram below), creating a mesh of connections. This mesh structure allows us to easily disable and enable parts of the data center’s internal network for maintenance or to troubleshoot issues. This layer is represented by the Spine section identified as Spine in the image below.

Web3 underlying infrastructure? Briefly analyze the reason for the interruption of CloudFlare service yesterday

Note: A Clos network is a multi-stage switching network, the term was first formally used by Charles Clos in 1953, and it represents an idealized representation of an actual multi-stage telephone switching system. The Clos network is used when the physical circuit switching requirements exceed the maximum achievable capacity of a single crossbar switch. The main advantage of the Clos network is that the number of required cross-points is much smaller than that required for the entire switching system to use a large Crossbar Switch to achieve the required number of cross-points.

This new architecture significantly improves Cloudflare’s reliability, allowing Cloudflare to perform maintenance in these locations without disrupting customer traffic. However, since these locations also carry a significant portion of Cloudflare’s traffic, any issues here can have very wide-ranging effects, which is unfortunately the reason for the June 21 Cloudflare service terminal.

Timeline and Impact of Service Disruption

Cloudflare applies a protocol called BGP (Border Gateway Protocol, a routing protocol for autonomous systems that runs on TCP) . The protocol’s operator-defined policy determines which prefixes (sets of adjacent IP addresses) are broadcast to peer nodes (the other networks they connect to). These strategies have separate components, which are evaluated in order. The end result is that any given prefix is ​​either broadcast or not. A change in policy could mean that prefixes that were previously broadcast are no longer broadcast, known as “revoke”, and those IP addresses will no longer function properly on the Internet.

The operator has formulated a certain strategy to determine that certain route prefixes can be broadcast (the broadcast here means that the route can be learned by other edge bgp routers, and then other bgp networks know these route changes, the prefix is ​​the prefix, which is used to uniquely identify a network number connected to the Internet)

Web3 underlying infrastructure? Briefly analyze the reason for the interruption of CloudFlare service yesterday

A rearrangement of terminology when the prefix advertisement policy changes, resulting in Cloudflare having to withdraw a critical subset of prefixes.

The change in policy could mean that prefixes that were previously broadcast are no longer broadcast, and Cloudflare engineers had additional difficulty restoring problematic parts in affected data centers, although Cloudflare has backup procedures in place to deal with such issues.

03:56 UTC : Cloudflare deploys the change to the first (data center) location, all locations are unaffected by this change due to the old architecture used by these locations.

06:17 : Deployed changes to Cloudflare’s busiest locations, but not to locations with MCP (Multi-Colo PoP) architecture.

06:27 : Deployment reaches MCP (Multi-Colo PoP) enabled location and changes are deployed to critical locations. This is when the outage event began, at which point 19 data centers were quickly taken offline.

06:32 : Cloudflare internally announces this service outage.

06:51 : First change made on router to verify root cause.

06:58 : Troubleshoot, find root cause, restore the part where the problem occurred

07:42 : The last restore is done, the network engineer starts checking the other side’s changes, restores the state, at which point the problem reappears occasionally, so there is a little delay.

08:00 : End of service interruption event.

The importance of these data centers can be clearly seen from the number of successful HTTP requests processed globally in the graph below:

Web3 underlying infrastructure? Briefly analyze the reason for the interruption of CloudFlare service yesterday

Although these data centers in question accounted for only 4% of Cloudflare’s total network, the outage affected 50% of total requests;

Web3 underlying infrastructure? Briefly analyze the reason for the interruption of CloudFlare service yesterday

(There is a small part of the code in this part, which is omitted here. Interested network engineering partners can view the original text:

https://blog.cloudflare.com/cloudflare-outage-on-june-21-2022/ )

Remediation and Next Steps

The impact of this service termination incident has been widespread and severe. Cloudflare has always taken usability very seriously and has identified several areas for improvement, and will continue to work to identify any issues that could potentially lead to service termination.

Process : While the MCP program was designed to improve availability, our procedural gaps in updating these data centers caused serious repercussions. While Cloudflare does have a staggered strategy designed for it, it’s not perfect, and the deployment process and automation needs to include testing and specific deployment procedures for MCP to ensure there are no unintended consequences.

Architecture : Misconfiguration of routers can prevent proper route advertisements, thereby preventing normal traffic and infrastructure from functioning. Cloudflare will redesign the policy statement for routing broadcasts to prevent ordering errors.

Automation : There are parts of Cloudflare’s automation suite that can reduce the negative impact of this incident. Cloudflare will focus on automating improvements, enforcing an improved interleaving policy for network configuration rollouts, and providing automatic “commit-confirm” rollbacks. The former will greatly reduce the overall impact, the latter will greatly reduce the resolution time in an incident.

in conclusion

Although Cloudflare has invested heavily in MCP architecture design to improve service availability, customers were disappointed in this service outage. To customers who lost access to the internet and digital assets during the service outage, as well as to all users, Cloudflare apologizes and has begun work on all improvements and optimizations in an effort to ensure a similar situation does not happen again.


