Saturday, January 30, 2016

Hosted Chef Service Degradation Incident [feedly]



----
Hosted Chef Service Degradation Incident
// Chef Blog

Hosted Chef Reporting API Increased Error Rates

On January 29th from 07:17 – 15:00 UTC, users may have seen 404 errors being logged at the end of successful chef-client runs as the client attempted to send a run report to Hosted Chef's reporting service. Additionally, users may have seen empty responses when using knife runs as a client on their workstations. This was a result of 2 of our 16 frontend nodes being left in a incorrect state following a routine deployment. ​

What happened?

​ Chef operations performed a deploy overnight to upgrade chef-server to 12.6 to mitigate several recently released security vulnerabilities. During the course of this deploy, our deploy tooling failed to fully configure the opscode-reporting service on 2 hosts, leaving the hosts in a functional, but degraded state. At this point, the affected nodes were still passing health checks, and were in service behind the load balancer without alerting of any problems. The engineers finished their deployment, and testing the services afterwards saw no indication of the failure because of the limited (~12%) occurrence of the issue. ​

Stabilization steps

​ Once we discovered the problem, in order to quickly restore service engineers manually reconfigured the two incorrectly running nodes which immediately resolved the increased error rates. ​

What we're doing to improve

​ Chef's engineering staff is deeply committed to continuously improving our products. We are taking several steps as a result of this incident to improve Hosted Chef. Additional host-level monitoring is being put in place to catch this type of issue more rapidly in the future, also ELB health checks are being updated to more throughly test all components before a node is placed into service behind and load balancer.

The trust and confidence of our users is of the utmost importance to us. We apologize for any inconvenience caused by this incident and will continue to learn from mistakes and improve our systems to give users a better experience.


----

Shared via my feedly reader


Sent from my iPhone