During one of our first internal hackathons at Hulu we set out to redo a portion of our service monitoring tools. Within two days we had completed most of the functionality for a solution we call “Canary” — named after the miner’s canary. Canaries were once used during mining operations to alert miners (by the absence of singing) to the presence of invisible yet deadly gasses. Similarly, the canary service sniffs out potential problems and alerts the appropriate parties.
While Hulu has a multitude of other monitoring systems for machine, application, and overall system health, Canary provides a last-resort notification if a service instance dies for any reason — ranging from application crashes, to machine lockups, to network issues.
Design and Implementation:
The design goals for canary were as follows:
- Make it dead simple to integrate
- Zero configuration for service owners
- Consume as little host resources as possible
- Centralize all notifications so they can be routed and filtered to the most appropriate owner
Canary is implemented in two parts and the first is a few lines of code in each participating service. Here is the Python version in its entirety:
def heartbeat(identifier, hostname=socket.gethostname()):
message = "HEARTBEAT_" + hostname + "_" + identifier
sock = socket.socket( socket.AF_INET, socket.SOCK_DGRAM )
sock.setsockopt(socket.SOL_SOCKET, socket.SO_BROADCAST, 1)
sock.sendto( message, ("", PORT) )
Participating services proactively send periodic undirected “heartbeats” onto the network using a UDP broadcast. Each heartbeat contains host and service-identifying information. They can be sent at arbitrary points within critical event loops similar to a watchdog timer or in a separate thread as long as they are sent frequently enough.
The second part is the canary server. This server listens for heartbeat packets on the network and notifies someone over email when they stop being received. When a new heartbeat is seen for the first time, the server expects to receive periodic updates matching the same identifying string or else an alert is fired.
Because broadcasts are used, only machines on the same network segment can monitor for heartbeats; however, routing tricks can be employed to ensure that all heartbeats are collected in a central location. We utilize two different methods of relaying heartbeats between networks by converting broadcast packets to unicast packets, the first being with Cisco's "ip helper-address" option applied to a routing interface and the second method is with an active relay that listens within each network of a zone.
Canary doesn't have to be limited to services as it can be anything that should be constantly running: cron jobs, backups, etc. A simple future enhancement would be to support arbitrary per-instance expected arrival times. More important services might want to have a shorter timeout period while others — like a backup job — may only be expected to heartbeat once a day.
Eric Buehl is a software developer in the DevOps team at Hulu.