Monday, August 17, 2015

Supermarket & Berkshelf Outage – Incident Report [feedly]

Supermarket & Berkshelf Outage – Incident Report
https://www.chef.io/blog/2015/08/17/supermarket-berkshelf-outage-incident-report/

-- via my feedly.com reader


On Thursday, August 13, 2015 the Supermarket (https://supermarket.chef.io) had a partial outage. This outage prevented Bekshelf from downloading cookbook dependencies, prevented users from logging into the Supermarket, and caused a number of AWS OpsWorks lifecycle events to fail.

The outage began at 5:22AM UTC and was resolved at 11:49AM UTC.

These systems are critical to the operations of many in our community. I apologize for this incident and would like to share the things we are doing to prevent similar outages in the future.

The time to detect this issue was four hours and eight minutes. Here are the specific things we are doing to decrease that time.

  • Add external monitor that validates berkshelf functions properly.
  • Add a functional test to the Supermarket deployment process that validates berkshelf is functioning properly.

The time to detect this issue was approximately six hours and twenty seven minutes. Here are the specific things we are doing to decrease that time:

  • Work with internal Chef teams to ensure Incident Response is a part of on-boarding and on-going education.
  • Plan and coordinate Incident response drills.

Additional details of this outage can be found below. On Friday, August 15, a post mortem meeting was held. This meeting was recorded and is now available on YouTube.

At approximately 5:22AM UTC on Thursday, August 13, 2015, a new production cluster of the Supermarket was deployed. This new cluster utilized a new deployment methodology based on an omnibus build of the Supermarket application. 

Berkshelf, a cookbook dependency resolution tool, uses the Supermarket to resolve cookbook dependencies. It does so by making an HTTP GET request to the /universe API endpoint on the Supermarket. This API returns a JSON document that includes all cookbooks on the Supermarket, their dependencies, and the location from which the cookbooks are downloadable. 

The newly deployed production servers were missing a key configuration setting that would properly set the location_path and download_url values in the /universeresponse. As a result, these were using the internal host name of the server (e.g., app-supermarket-prod-i-f2qtmmfq.opscode.us:443) which is inaccessible from outside of the same network.

Berkshelf attempted to connect to this internal-only URL but was unable to do so and failed. This caused issues for anyone running a berks install or berks updatecommand. These commands would fail and return an error code to the user.

Amazon's AWS OpsWorks has an optional feature where customers can enable a berkshelf run before OpsWorks executes the actual Chef run. If enabled, the berkshelf run would be done before each Chef run, e.g. setup or application deployment. Since Berkshelf was unable to complete successfully, these lifecycle events on OpsWorks would fail.

This issue also impacted anyone trying to login to the Supermarket. Login attempts would fail because the authentication and authorization system used with the Supermarket were also trying to utilize the internal-only URL.

The issue was first reported at 6:10AM UTC in the #chef channel on Freenode IRC.

At 9:30AM UTC, a few Chef employees began looking into the issue but were unsure of the proper escalation procedures.

At 10:47 UTC, we declared this outage an official "INCIDENT" and began our incident management procedures.

The @opscode_status sent the first notice about this incident at 10:52. A follow-up message was posted at 11:23.

At 11:45 UTC, the previous cluster was put back in service and the new cluster was removed from service. This did not fully resolve the issue though because the response generated by a GET request to the /universe end point is cached. The /universe cache was cleared and the issue was resolved at 11:46 UTC.

At 11:49 UTC, @opscode_status posted that the issue was RESOLVED.

As a result of this outage, we will be taking the corrective actions listed above. We believe that these actions will improve the time it takes us to detect and resolve similare incidents in the future. Thank you for your patience as we worked through this issue.

Please be sure that you are following @opscode_status on twitter and open up a request with our Support team with any issues you encounter.