Building Resilience after the June 12 Cloud Failures

Published on June 13, 2025 • 3 min read

Stock photo from Thirdman on pexels.com

Imagine having your morning coffee, opening Gmail, and... nothing. No new emails, no calendar reminders, and your Smart Home devices are eerily silent. That was the jolt millions felt on June 12, 2025, when a ripple in Google Cloud’s core systems briefly paused much of the internet.

What Happened Exactly

At around 10:49 AM EDT, an automated quota update in Google Cloud’s API management system went sideways. That single, seemingly routine operation caused API requests everywhere to start failing with 503 errors. Within minutes, dozens of Google Cloud products, from Identity and Access Management to BigQuery, were rejecting requests worldwide. Engineers pinpointed the faulty quota policy, bypassed the bad check, and began the long process of regional recovery.

Note: The us-central1 region experienced longer recovery time compared to most other regions.

Which Services Went Dark

The fallout was massive and surprisingly diverse:

Google’s Stuff: Gmail, Calendar, Drive, Meet, Nest cameras, YouTube, even emerging services like Vertex AI Online Prediction all saw downtime.
Third-Party Apps: Spotify users reported over 46,000 outages at the peak; Discord, Snapchat, DoorDash, Etsy, Shopify, Twitch... the list kept growing as every service that leaned on Google Cloud hit snags.
Cloudflare’s Chain Reaction: Cloudflare’s Workers KV store, their backbone for authentication and configuration, failed over 90% of requests. This knocked out Access logins, WARP client sign-ups, Stream uploads, and more; though core services like DNS and Magic Transit remained standing.

What Are Cloudflare's Workers KV?

Think of Workers KV as Cloudflare’s go-to spot for storing the small but vital pieces of data that make its services tick. It’s not a flashy database you log into every day. Instead, it quietly holds the “source of truth” for configuration settings, user identities, and other bits of state that need to be fast and globally available.

Here’s how it keeps the lights on across Cloudflare’s suite:

Access leans on Workers KV to fetch up-to-date app policies and user identity info whenever someone tries to sign in.
Gateway checks Workers KV for the latest device posture and security rules before allowing traffic through.
WARP uses it to register devices and check credentials behind the scenes, so your VPN connection just works.
Workers AI pulls configuration and routing details from Workers KV to steer your AI workloads.
Zaraz, Pages, and Workers Assets all tap into Workers KV to load website optimizations or serve static files quickly.

When Workers KV went down, it caused a chain reaction affecting everything built on top of this common data hub. That’s why its health is mission-critical for Cloudflare and for anyone using their platform.

The Incident Timeline

Here’s how the hours unfolded (all times EDT):

10:49 AM – Users begin reporting failures in Google Cloud and Cloudflare WARP.
11:05 AM – Cloudflare’s Access team sees SLOs tank and declares a P1 incident.
11:30 AM – Downdetector logs over 13,000 Google Cloud issue reports.
12:30 PM – Google confirms all regions except us-central1 have recovered.
1:49 PM – Google declares the primary incident over for most services.
3:28 PM – Cloudflare’s dependent services fully bounce back after a 2 hr 28 min outage.
6:18 PM – Google posts a mini incident report and marks full recovery.

Official Response and What’s Next

Google apologized “deeply” to users and customers. They attributed the cause to that invalid quota update, pledging to bolster testing, error handling, and metadata protections before global rollout. Cloudflare, while noting Google was the trigger, owned their architecture decisions. They’re now shoring up redundancy in Workers KV, migrating critical namespaces off a single provider, and adding fail-safe tooling.

How This Incident Impacts Internet Architecture

There are many reasons why this incident matters for Internet Architecture, but here are the top 3 as I see it:

Too Much Trust in One Provider

When a single quota update in Google’s API layer can stall a quarter of global internet traffic, we see the risk in putting all our eggs in one cloud basket.

Hidden Single Points of Failure

Centralized identity or storage services save work, but they can become choke points that cascade failures across unrelated apps.

The Domino Effect of Dependencies

Even giants like Cloudflare felt the shockwave because their systems leaned on Google Cloud under the hood.

This outage is a wake-up call: our digital world’s backbone is powerful and, at times, fragile.

Call to Action

As SREs, architects, and tech leaders, we must “design for disaster” by diversifying dependencies, building multi-cloud or hybrid fail-overs, and stress-testing every link in our chains.

Share your own outage survival hacks below. Let’s have a conversation about turning this global hiccup into a solution for resilience.

Because when the next outage hits (and it will), we’ll want more than hope to keep our services alive.