Friday, March 1, 2024

4 Instructive Postmortems on Data Downtime and Loss

More than a decade ago, the concept of the 'blameless' postmortem changed how tech companies recognize failures at scale.

John Allspaw, who coined the term during his tenure at Etsy, argued postmortems were all about controlling our natural reaction to an incident, which is to point fingers: "One option is to assume the single cause is incompetence and scream at engineers to make them 'pay attention!' or 'be more careful!' Another option is to take a hard look at how the accident actually happened, treat the engineers involved with respect, and learn from the event."

What can we, in turn, learn from some of the most honest and blameless—and public—postmortems of the last few years?

GitLab: 300GB of user data gone in seconds

What happened: Back in 2017, GitLab experienced a painful 18-hour outage. That story, and GitLab's subsequent honesty and transparency, has significantly impacted how organizations handle data security today.

The incident began when GitLab's secondary database, which replicated the primary and acted as a failover, could no longer sync changes fast enough due to increased load. Assuming a temporary spam attack created said load, GitLab engineers decided to manually re-sync the secondary database by deleting its contents and running the associated script.

When the re-sync process failed, another engineer tried the process again, only to realize they had run it against the primary.

What was lost: Even though the engineer stopped their command in two seconds, it had already deleted 300GB of recent user data, affecting GitLab's estimates, 5,000 projects, 5,000 comments, and 700 new user accounts.

How they recovered: Because engineers had just deleted the secondary database's contents, they couldn't use it for its intended purpose as a failover. Even worse, their daily database backups, which were supposed to be uploaded to S3 every 24 hours, had failed. Due to an email misconfiguration, no one received the notification emails informing them as much.

In any other circumstance, their only choice would have been to restore from their previous snapshot, which was nearly 24 hours old. Enter a very fortunate happenstance: Just 6 hours before the data loss, an engineer had taken a snapshot of the primary database for testing, inadvertently saving the company from 18 additional hours of lost data.

After an excruciatingly slow 18 hours of copying data across slow network disks, GitLab engineers fully restored service.

What we learned

  1. Analyze your root causes with the "Five whys." GitLab engineers did an admirable job in their postmortem explaining the incident's root cause. It wasn't that an engineer accidentally deleted production data, but rather that an automated system mistakenly reported a GitLab employee for spam—the subsequent removal caused the increased load and primary<->secondary desync.

The deeper you diagnose what went wrong, the better you can build data security and business continuity systems that address the long chain of unfortunate events that might cause failure again.

  1. Share your roadmap of improvements. GitLab has continuously operated with extreme transparency, which applies to this outage and data loss. In the aftermath, engineers have created dozens of public issues discussing their plans, like testing disaster recovery scenarios for all data not in their database. Making those fixes public gave their customers precise assurances and shared learnings with other tech companies and open-source startups.
  1. Backups need ownership. Before this incident, no single GitLab engineer was responsible for validating the backup system or testing the restoration process, which meant no one did. GitLab engineers quickly assigned one of their team with rights to "stop the line" if data was at risk.

Read the rest: Postmortem of database outage of January 31.

Tarsnap: Deciding between safe data vs. availability

What happened: One morning in the summer of 2023, this one-person backup service went completely offline.

Tarsnap is run by Colin Percival, who's been working on FreeBSD for over 20 years and is largely responsible for bringing that OS to Amazon's EC2 cloud computing service. In other words, few people better understood how FreeBSD, EC2, and Amazon S3, which stored Tarsnap's customer data, could work together… or fail.

Colin's monitoring service notified him the central Tarsnap EC2 server had gone offline. When he checked on the instance's health, he immediately found catastrophic filesystem damage—he knew right away he'd have to rebuild the service from scratch.

What was lost: No user backups, thanks to two smart decisions on Colin's part.

First, Colin had built Tarsnap on a log-structured filesystem. While he cached logs on the EC2 instance, he stored all data in S3 object storage, which has its own data resilience and recovery strategies. He knew Tarsnap user backups were safe—the challenge was making them easily accessible again.

Second, when Colin built the system, he'd written automation scripts but had not configured them to run unattended. Instead of letting the infrastructure rebuild and restart services automatically, he wanted to double-check the state himself before letting scripts take over. He wrote, "'Preventing data loss if something breaks' is far more important than 'maximize service availability.'"

How they recovered: Colin fired up a new EC2 instance to read the logs stored in S3, which took about 12 hours. After fixing a few bugs in his data restoration script, he could "replay" each log entry in the correct order, which took another 12 hours. With logs and S3 block data once again properly associated, Tarsnap was up and running again.

What we learned

  1. Regularly test your disaster recovery playbook. In the public discourse around the outage and postmortem, Tarsnap users expressed their surprise that Colin had never tried his recovery scripts, which would have revealed multiple bugs that significantly delayed his responsiveness.
  1. Update your processes and configurations to match changing technology. Colin admitted to never updating his recovery scripts based on new capabilities from the services Tarsnap relied on, like S3 and EBS. He could have read the S3 log data using more than 250 simultaneous connections or provisioned an EBS volume with higher throughput to shorten the timeline to full recovery.
  1. Layer in human checks to gather details about your state before letting automation do the grunt work. There's no saying exactly what would have happened had Colin not included some "seatbelts" in his recovery process, but it helped prevent a mistake like the GitLab folks.

Read the rest: 2023-07-02 -- 2023-07-03 Tarsnap outage post-mortem

Roblox: 73 hours of 'contention'

What happened: Around Halloween 2021, a game played by millions every day on an infrastructure of 18,000 servers and 170,000 containers experienced a full-blown outage.

The service didn't go down all at once—a few hours after Roblox engineers detected a single cluster with high CPU load, the number of online players had dropped to 50% below normal. This cluster hosted Consul, which operated like middleware between many distributed Roblox services, and when Consul could no longer handle even the diminished player count, it became a single point of failure for the entire online experience.

What was lost: Only system configuration data. Most Roblox services used other storage systems within their on-premises data centers. For those that did use Consul's key-value store, data was either stored after engineers solved the load and contention issues or safely cached elsewhere.

How they recovered: Roblox engineers first attempted to redeploy the Consul cluster on much faster hardware and then very slowly let new requests enter the system, but neither worked.

With assistance from HashiCorp engineers and many long hours, the teams finally narrowed down two root causes:

  • Contention: After discovering how long Consul KV writes were blocked, the teams realized that Consul's new streaming architecture was under heavy load. Incoming data fought over Go channels designed for concurrency, creating a vicious cycle that only tightened the bottleneck.
  • A bug far downstream: Consul uses an open-source database, BoltDB, for storing logs. It was supposed to clean up old log entries regularly but never truly freed the disk space, creating a heavy compute workload for Consul.

After fixing these two bugs, the Roblox team restored service—a stressful 73 hours after that first high CPU alert.

What we learned

  1. Avoid circular telemetry systems. Roblox's telemetry systems, which monitored the Consul cluster, also depended on it. In their postmortem, they admitted they could have acted faster with more accurate data.
  1. Look two, three, or four steps beyond what you've built for root causes. Modern infrastructure is based on a massive supply chain of third-party services and open-source software. Your next outage might not be caused by an engineer's honest mistake but rather by exposing a years-old bug in a dependency, three steps removed from your code, that no one else had just the right environment to trigger.

Read the rest: Roblox Return to Service 10/28-10/31, 2021

Cloudflare: A long (state-baked) weekend

What happened: A few days before Thanksgiving Day 2023, an attacker used stolen credentials to access Cloudflare's on-premises Atlassian server, which ran Confluence and Jira. Not long after, they used those credentials to create a persistent connection to this piece of Cloudflare's global infrastructure.

The attacker attempted to move laterally through the network but was denied access at every turn. The day after Thanksgiving, Atlassian engineers permanently removed the attacker and took down the affected Atlassian server.

In their postmortem, Cloudflare states their belief the attacker was backed by a nation-state eager for widespread access to Cloudflare's network. The attacker had opened hundreds of internal documents in Confluence related to their network's architecture and security management practices.

What was lost: No user data. Cloudflare's Zero Trust architecture prevented the attacker from jumping from the Atlassian server to other services or accessing customer data.

Atlassian has been in the news for another reason lately—their Server offering has reached its end-of-life, forcing organizations to migrate to Cloud or Data Center alternatives. During or after that drawn-out process, engineers realize their new platform doesn't come with the same data security and backup capabilities they were used to, forcing them to rethink their data security practices.

How they recovered: After booting the attacker, Cloudflare engineers rotated over 5,000 production credentials, triaged 4,893 systems, and reimaged and rebooted every machine. Because the attacker had attempted to access a new data center in Brazil, Cloudflare replaced all the hardware out of extreme precaution.

What we learned

  1. Zero Trust architectures work. When you build authorization/authentication right, you prevent one compromised system from deleting data or operating as a stepping-stone for lateral movement in the network.
  1. Despite the exposure, documentation is still your friend. Your engineers will always need to know how to reboot, restore, or rebuild your services. Your goal is that even if an attacker learns everything about your infrastructure through your internal documentation, they still shouldn't be able to create or steal the credentials necessary to intrude even deeper.
  1. SaaS security is easier to overlook. This intrusion was only possible because Cloudflare engineers had failed to rotate credentials for SaaS apps with administrative access to their Atlassian products. The root cause? They believed no one still used said credentials, so there was no point in rotating them.

Read the rest: Thanksgiving 2023 security incident

What's next for your data security and continuity planning?

These postmortems, detailing exactly what went wrong and elaborating on how engineers are preventing another occurrence, are more than just good role models for how an organization can act with honesty, transparency, and empathy for customers during a crisis.

If you can take a single lesson from allthese situations, someone in your organization, whether an ambitious engineer or an entire team, must own the data security lifecycle. Test and document everything because only practice makes perfect.

But also recognize that all these incidents occurred on owned cloud or on-premises infrastructure. Engineers had full access to systems and data to diagnose, protect, and restore them. You can't say the same about the many cloud-based SaaS platforms your peers use daily, like versioning code and managing projects on GitHub or deploying lucrative email campaigns via Mailchimp. If something happens to those services, you can't just SSH to check logs or rsync your data.

As shadow IT grows exponentially—a 1,525% increase in just seven years—the best continuity strategies won't cover the infrastructure you own but the SaaS data your peers depend on. You could wait for a new postmortem to give you solid recommendations about the SaaS data frontier… or take the necessary steps to ensure you're not the one writing it.

Found this article interesting? This article is a contributed piece from one of our valued partners. Follow us on Twitter and LinkedIn to read more exclusive content we post.



from The Hacker News https://bit.ly/49TWpe7
via IFTTT

No comments:

Post a Comment