Tuesday will be remembered as the day the internet broke — before swiftly being fixed again. Early in the morning, websites including Amazon, Reddit, Spotify, Ebay, Twitch, Pinterest and, unfortunately, CNET went offline due to a major outage at a service called Fastly. Everywhere you looked, there were 503 errors and people complaining they couldn’t access key services and news outlets. It all demonstrated just how much of the internet relies on this largely unheard-of cloud computing service.
After an investigation into what happened, Fastly published a blog post into exactly what went down — and it turns out the whole incident was triggered by just a single, unnamed Fastly customer.
In mid-May, Fastly issued a software deployment that contained a bug, which if triggered in specific circumstances could take down vast swaths of its network. The bug lay dormant until June 8, when one Fastly customer inadvertently triggered the bug during a “valid configuration change,” which caused 85% of the company’s network to return errors.
“We detected the disruption within 1 minute, then identified and isolated the cause, and disabled the configuration,” said Fastly’s Senior Vice President of Engineering and Infrastructure Nick Rockwell in the blog post. “Within 49 minutes, 95% of our network was operating as normal. This outage was broad and severe, and we’re truly sorry for the impact to our customers and everyone who relies on them.”
What happened during the Fastly outage?
At around 2:58 a.m. PT, Fastly’s status update page noted an error, saying “we’re currently investigating potential impact to performance with our CDN [content delivery network] services.” Shortly thereafter, reports emerged on Twitter of major news publications including the BBC, CNN and The New York Times being offline. Twitter itself was still running, although the server that hosted its emojis went down, leading to some odd-looking tweets.
Rather than isolated incidents affecting individual sites, it turned out this was a massive outage that had brought much of the internet to its knees. Across the world, people were receiving Error: 503 messages as they tried to access sites, including some vital services, such as the UK government’s gov.uk web properties.
Almost an hour later, at 3:44 a.m. PT — or 6:44 a.m. ET, on the cusp of the US East Coast workday, and coming up on noon in the UK — Fastly updated its status page again to say the issue has been identified and a fix was being implemented. At 4:10 a.m. PT, the company tweeted: “We identified a service configuration that triggered disruptions across our POPs globally and have disabled that configuration. Our global network is coming back online.”
The same message was sent to CNET as a comment by Fastly spokespeople.
What is Fastly?
Fastly is a cloud computing service provider, headquartered in San Francisco, that’s been around since 2011. In 2017, it launched an edge cloud platform designed to bring websites closer to the people who use them. Effectively this means that if you’re accessing a website hosted in another country, it will store some of that website closer to you so that there’s no need to waste bandwidth by going to fetch all of that website’s content from far away every time you need it.
This makes for faster website load times, and optimizes images, videos and other high-payload content to show up quickly and smoothly when you land on a web page. Among the boasts on the company’s website, it says it made loading pages on Buzzfeed 50% faster and allowed The New York Times to simultaneously handle 2 million readers on election night. Edge computing also performs vital cybersecurity functions, protecting sites from DDoS attacks and bots, as well as providing a web application firewall.
Due to the way Fastly sits between the back-end web servers and the front-facing internet as we see it, any errors on its part can cause whole websites to be unavailable. Due to the localized nature of the edge cloud platform, it also means that errors don’t affect all regions in the same way at the same time (although people all across the world reported experiencing problems on Tuesday).
What is a 503 error?
When you see a website displaying a 503 error rather than showing you the page you were expecting, it means the server hosting the website isn’t ready to handle the request. It also indicates that the problem is temporary and that it will likely be resolved soon.
Commonly, it is caused when a server is down for maintenance, or when a website has been overloaded — for example, if too many people are trying to access it at once.
Why did Fastly fail on Tuesday and will it happen again?
We now know that Tuesday’s internet outage was caused by a service configuration change by one of Fastly’s customers that triggered a bug hidden in Fastly’s network. The bug had been lying dormant since a software update deployment by Fastly on May 6.
Many people speculated on Twitter that the outage was caused by a cyberattack, but we now know for sure that this wasn’t the case. There are many technical reasons a CDN can fail, and cyberattacks are just one of them.
To make sure the problem doesn’t repeat itself, Fastly has said it’s taking a number of actions. It is deploying a bug fix across its network, while also conducting a complete post mortem of the processes and practices it followed during the incident. It’s also going to be figuring out why it didn’t catch the bug during its own testing processes and evaluating ways to improve remediation time.
“Even though there were specific conditions that triggered this outage, we should have anticipated it,” said Rockwell. “We provide mission critical services, and we treat any action that can cause service issues with the utmost sensitivity and priority.”
Why were so many websites affected by the Fastly outage?
Fastly is a widely used service by web publishers — and it became apparent exactly how widely used on Tuesday when vast swaths of the internet became unavailable.
The reason it’s so popular is that the services it provides are considered essential by many online web properties, but not many companies provide these services. As such, a vast number of websites are reliant on a very small group of companies to keep running. Similar problems were seen whenlast July, and when last November.
As Corinne Cath-Speth, a Ph.D. candidate at Oxford Internet Institute and the Alan Turing Institute pointed out on Twitter, this means “a technical hiccup in a single company can have huge ramifications.”
“This in turn — raises major questions about the dangers of (power) consolidation in the cloud market and the unquestioned influence these often invisible actors have over access to information,” she added.