Monday, February 29, 2016

Chef Supermarket Outage Post Mortem [feedly]



----
Chef Supermarket Outage Post Mortem
// Chef Blog

On Thursday, February 26, we had an outage for downloading cookbooks from Supermarket via Berkshelf. The next day, February 27, we held a public post mortem.

If you'd like to see the video of the post mortem, you can view it on Youtube here.

Description

A deploy to production supermarket that was intended to allow http access for downloads switched the download links to http for berkshelf/chefdk, but did not fix broken http downloads. The result was failures for Berkshelf and ChefDK cookbook downloads.

Timeline

A deploy to production supermarket that was intended to allow http access for downloads switched the download links to http for berkshelf/chefdk, but did not fix broken http downloads. The result was failures for Berkshelf and ChefDK cookbook downloads.

Time to Detect – 47 minutes Time to Resolution – 103 minutes

All times are in UTC on February 25, 2016

  • 20:55: Deploy of supermarket 2.4.0 causing the issue is preformed by Robb Kidd (robb), at this time https is still functional
  • 21:31: First user report of issue comes in via #chef on Freenode (irc)
  • 21:35: Issue is reported in Hangops Slack
  • 21:37: Noah Kantrowitz (coderanger) notifies Paul Mooring (pwm) via Chef Sucess Slack
  • 21:42: Nell Shamrell-Harrigton (nell), pwm and robb begin investigating the issue in Chef's internal Slack
  • 21:46: Incorrect protocol in universe endpoint is discovered by robb
  • 21:53: Config option to disable ssl is pointed out by robb
  • 21:55: Config option to set ssl to true is set by nell
  • 22:03: All nodes have ssl set to true
  • 22:03: Due to self signed cert, all download URLs are unreachable
  • 22:04: All instances get removed from service by ELB (due to cert issues)
  • 22:05: Eric Alwais (eric) updates Chef status page (status.chef.io)
  • 22:10: pwm, robb and nell meet to discuss problem
  • 22:22: robb begins reverting and pinning package version to 2.3.3
  • 22:28: nell directs robb to reverting config changes
  • 22:34: Changes complete, nell verifies problem is clear
  • 22:37: Josh Glass posts all clear to status page
  • 22:37: pwm calls incident resolved

Impact

Users were unable to download cookbooks using Berkshelf or ChefDK for approximately 2 hours.
  • Direct downloads (via web interface, curl, etc.) were functional using https
  • Automated systems (berkshelf, chefdk, etc.) were returning http links based on universe endpoint
  • After setting ssl was enabled, a total outage occured (30 minutes)

Contributing Factor(s)

  • Insufficient monitoring on supermarket (api including /universe and web app)
  • Lack of comprehensive testing on deploys
  • Overly complicated code in omnibus package
  • Lack of production system understanding

Stabilization Step

Changes made to the intial deploy were reverted:
  • Production supermarket was dropped back to version 2.3.3
  • Supermarket version 2.3.3 was locked on frontends
  • Config changes were reverted to the pre-deploy stated and supermarket-ctl reconfigure was run
  • Unsecured (http over port 80) access to cookbook downloads was turned back off (backed out code change)

Corrective Actions

Long Term
  • Document various ssl deployments for supermarket
  • Get Supermarket deployed through automatic provisioning with tests
Immediate
  • Package a 2.4.1 without code changes for http downloads – robb
  • Add an attribute for supermarket version to deploy cookbook – nell
  • Monitor /universe including protocol version returned – nell and pwm
  • Update deployment checklist for explicit test steps – robb

----

Shared via my feedly reader


Sent from my iPhone