Habitat Service Discovery: A Deep Dive
// Chef Blog
Habitat can seem like magic – allowing you to deploy and manage any app, anywhere, in the same way. It's not magic, however, it's deliberate, very well thought out technology.
In Habitat, the application itself is the unit of automation.
Widget World – Our Application
Let's look at a sample application – let's say we work for a company called "Widget World" and we want to create an outline catalog so people can browse our widgets.
We decide we want to run this catalog as three microservices – the front end of the application will be in React, the back end will be in Rails, and the persistent data will be in a PostgreSQL database.
When a user visits the site, they will first see the front end rendered in React. When they request a list of widgets, the React front end will make an API call to the Rails backend, then the Rails backend will request the raw data from the Postgres Database, then format that raw data and pass it back to the React front end, which will render it for the user.
Let's run the React and Rails services in containers – then don't have to be stateful, so we can take advantage of the speed and flexibility of running them in containers.
Let's run the PostgreSQL database in a virtual machine. Unless we're keeping the persistent data in a separate volume, that PostgreSQL database is going to be stateful – we want the data to persist from deploy to deploy – so let's run it in a virtual machine rather than a container.
Now one instance of each of these services may work for awhile, but as traffic to our website climbs, we will need replicas of the services to handle it. Let's increase our services to three React services, 3 Rails services, and 3 PostgreSQL services (we will likely run the PostgreSQL services in a leader/follower cluster). If we are running each of these services with Habitat (one Habitat package for the React service, one for the Rails service, and one for the PostgreSQL services), this means we now have three Service Groups.
Each of these groups of services – a service is one Habitat package running under a supervisor – is known as a Service Group. The members of a Service Group each run the same package.
And, collectively, all of these services make up one Supervisor Ring.
All services in a Supervisor Ring – even those in different Service Groups – can communicate with all other services. Every time a new service is added, each existing service will become aware of its presence. And every time a service fails, that failure will be communicated to all the other services. How do they do this? Through service discovery.
Services Discovery is how services find other services, keep track of those services, and communicate changes in the services available. Service discovery has been around for awhile, it's always been needed when different parts of infrastructure need to be made aware of each other's presence.
In the days of physical and static infrastructure, this service discovery happened manually. Any time a new service was added, a sysadmin would have to make all services aware of the new one – usually be manually changing a config file on each service. While this can work decently for static infrastructure, it does not work in the age of dynamic services that come and go quickly.
Until very recently, most service discovery was handled by a central service registry. Services would communicate their presence to the central registry with a heartbeat message. If a service needed to talk to another service, it would first talk to the central service registry, who would direct them to the appropriate place. The problem with this approach is that it puts a large amount of network load on that central service registry, and that network load can become unmanageable as the number of services scales up. In order to solve this problem, some researchers at Cornell University developed a new system called the SWIM protocol.
SWIM stands for Scalable Weakly-consistent infection-style Process Group Membership protocol – that's quite a mouthful and, honestly, you don't need to remember that word for word.
What you do need to remember is that it consists of two main components.
- A Failure Detector Component – this detects when a service fails and can no longer accept requests
- A Dissemination Component – this communicates information about services to other services in the group.
Let's explore this:
Let's say we have a service. Each service in a group that uses the SWIM protocol will keep a local list of every other service in the group. Rather than depending on a central service registry, the services themselves keep track of the members of the group.
And they keep those membership lists updated by randomly pinging other members of the group. Let's look a little deeper at this workflow.
Let's say we have a service group with four members – M1, M2, M3, and M4. Each of these members keep their own member list. Periodically each member – let's use M1 in this case – will select a random member from the membership list – in this case it will select M2 – and will send a ping to M2 to see if it's there.
If M2 receives the ping, it will send back a ACK message, acknowledging the ping, and letting M1 know that it is there and receiving messages.
If, however, it does not send an ACK message back within a certain amount of time:
Then M1 will initiate an investigation into M2. It will select other members at random – in this case, let's say it selects M3 and M4 – and will send them a message requesting that they try to ping M2.
M3 and M4 will then send ping message to M2.
If M2 responds – let's say it sends and ACK message back to M3 – then M3 will forward that message to M1.
If neither M3 or M4 hear a response and forward a message and if M1 still cannot get a response from M2, then M1 will delete M2 from its membership list.
It will then hand a delete request to the dissemination component. The dissemination component is what broadcasts the delete message to other members.
The dissemination component will the send a message to delete M2 to both M3 and M4.
And, once they remove M2 from their membership lists, M2 will no longer be a part of the ring.
Now, let's say we want to add a new member to this group – M5. All we need to do is peer M5 with one current member of the group.
The member it peers with will tell the dissemination component that it has a new member – M5.
And then the dissemination component will pass the message that there is a new member to the other members of the group, who will add it to their member list.
Because the workflow resembles the way gossip travels through human societies, this protocol is known as a gossip style protocol. In keeping with that theme – the pieces of information that are broadcast among members are called rumors.
Habitat has its own GOSSIP protocol which we call Butterfly.
Butterfly is actually a combination of two protocols
- SWIM (Membership failure and detection)
- Newscast (Dissemination)
To see Butterfly in action check out this video demo https://youtu.be/_vHZLUn5Dik
Although Butterfly does implement SWIM, there are a few key things Butterfly adds to the classic SWIM protocol.
- When a member is confirmed failed, it is not deleted from membership lists. Instead, it stays in the membership list, but is designated as confirmed dead.
- Butterfly allows you to designate some members as persistent members
- Butterfly includes the concept of departures – which allow you to forcefully kick a service out of the ring
Confirmed Dead Members
When a member does not get an ACK message back after pinging another member, then it initiates and investigation. If there is still no ACK message received after the investigation, the member mark the unresponsive member as confirmed dead in its member list. It will then send a message to the dissemination component:
Which will then pass the message on to all the other members.
For a demo of a member being confirmed dead, check out this link https://youtu.be/9N7tP00OzB4.
Let's say we have four members in a group. If we designate one member as persistent, all other members will be able to ping it normally.
But let's say our network is partitioned for some reason.
Suddenly, M3 and M4 will not be able to contact M1.
However, rather than marking the member dead and stop looking for it, M3 and M4 will continue to look for M1. Then, if the network partition heals:
M3 and M4 will immediately be able ping M1.
Butterfly also allows for departures – which let a human operator forcefully kick a member out of the ring so that it can never return. We like to say we would do this in case of behavior unbecoming of a Supervisor.
Let's look at how this happens. Let's say we have a Supervisor ring with three members:
And one of those members starts doing something that it's just not supposed to do – it's not necessarily failing its health check, but something is just not right.
From my workstation – or I can also do this by sshing into another member of the ring – I can issue a Habitat command to kick out that member from the ring.
And it's gone – and it's marked as departed in the membership lists of the surviving members. That means even if it comes back up and tries to rejoin the ring it will not be permitted to.
For a demo of a member being departed, check out this link https://youtu.be/CZzrrHHnKoE.
What I hope you take away from this blog post is that even though Habitat can seem like magic, it's not. It's very deliberate and well constructed technology. I hope this has given you a glimpse behind the curtain to how service discovery in Habitat works.
Read in my feedly