Sunday, June 5, 2016

Supermarket Berkshelf Incident Post Mortem [feedly]

Supermarket Berkshelf Incident Post Mortem
// Chef Blog

We at Chef believe it is important to conduct public post mortems whenever possible. We recently conducted one around a Supermarket/Berkshelf incident that occurred on May 16, 2016. I was the incident commander for this incident and would like to share both the video and write up.

Video Recording

Write Up


On May 16 we experienced a brief SSL issue between Supermarket and Berkshelf.


This incident began at 21:56UTC on Monday, May 16, 2016. It was resolved at 22:49UTC that same day.
Time to detect: 13 minutes 21:56UTC - 22:09UTC on Monday, May 16, 2016  Time to resolve: 44 minutes 21:56UTC - 22:36UTC on Monday, May 16, 2016  All times UTC  21:56  -   Nell Shamrell-Harrington upgraded 2 of the 4 Supermarket Prod nodes from Supermarket 2.5.2 to Supermarket 2.6.0.  She also upgraded the cookbook versions of oc-omnibus-supermarket and supermarket-omnibus-cookbook  22:09  -   Nell Shamrell-Harrington ran berks install to pull cookbooks from the public Supermarket and received this error:             OpenSSL::SSL::SSLError: hostname "" does not match the server certificate            She asked in the internal Chef Slack if someone else would run berks install to confirm what she was seeing  22:24  -  Lamont Grandquist confirmed that he was seeing the same error in Travis builds  22:32  -  Nell Shamrell-Harrington declared an incident  22:36  -  Nell Shamrell-Harrington moved the two upgraded Supermarket prod nodes out of the Supermarket prod ELB and confirmed that she no longer saw the error when running berks install  22:38  -  SaintAardvark in the #chef IRC channel reported SSL issues with running Berks install, Noah Katrowitz mentioned that kisoku (#chef IRC handle) was reporting the same thing  22:39  -  Noah Kantrowitz DM'd Nell Shamrell-Harrington to let her know that users in the Chef IRC channel were reporting issues with berks and Supermarket  22:43  -  Lamont Grandquist reported that Travis runs were working again  22:46  -  Nell Shamrell-Harrington entered #chef IRC  22:47  -  kisoku reported that his CI jobs were working again in #chef IRC  22:50  -  SaintAardvark reported that his Jenkins jobs were working again in #chef IRC  22:49  -  Nell Shamrell-Harrington declared the incident closed  

Contributing Factor(s)

The 2.6.0 release of Supermarket included a commit which changed the AWS S3 urls used to access cookbook artifacts in S3 storage. Prior to this change, Supermarket (through the Paperclip plug in) used a hosted-style S3 url. The one for public Supermarket looked like this:

The problem was this URL style only worked if an S3 bucket was in N. Virginia. To fix this, we changed our config to use a path-style url like this:

When this change was merged and deployed, this error appeared when someone attempted to do a berks install using public Supermarket as the cookbook source: OpenSSL::SSL::SSLError: hostname "" does not match the server certificate

This was due to there being "." in the bucket name "" Although the previous S3 url style worked with dots in the bucket name, it did not work for a path-style url

Stabilization Steps

We had fortunately only upgraded 2 of the 4 prod nodes, so we removed the 2 upgraded nodes from the ELB, then downgraded them back to Supermarket 2.5.2


For approximately 53 minutes, anyone using berks install saw the SSL error.

Corrective Actions

  • Make S3 url style configurable in Supermarket
  • Make sure staging bucket has similar formatted name to the production bucket
  • Ensure that berks install is part of smoke tests in both staging and production
  • Add documentation around considerations when naming an S3 bucket
  • Investigate adding a monitor that does a simple berks install and executes on a regular basis


Shared via my feedly newsfeed

Sent from my iPhone

No comments:

Post a Comment