• Hulu
  • TV
  • Movies
  • More TV. On more devices.
Search
Hulu Tech Blog

Simple Service Monitoring with Canary

August 2nd, 2012 by Eric Buehl

Photo credit to Michael Sonnabend

During one of our first internal hackathons at Hulu we set out to redo a portion of our service monitoring tools. Within two days we had completed most of the functionality for a solution we call “Canary” — named after the miner’s canary. Canaries were once used during mining operations to alert miners (by the absence of singing) to the presence of invisible yet deadly gasses. Similarly, the canary service sniffs out potential problems and alerts the appropriate parties.

While Hulu has a multitude of other monitoring systems for machine, application, and overall system health, Canary provides a last-resort notification if a service instance dies for any reason — ranging from application crashes, to machine lockups, to network issues.

Design and Implementation:

The design goals for canary were as follows:

  • Make it dead simple to integrate
  • Zero configuration for service owners
  • Consume as little host resources as possible
  • Centralize all notifications so they can be routed and filtered to the most appropriate owner

Canary is implemented in two parts and the first is a few lines of code in each participating service. Here is the Python version in its entirety:

def heartbeat(identifier, hostname=socket.gethostname()):
    message = "HEARTBEAT_" + hostname + "_" + identifier
    sock = socket.socket( socket.AF_INET, socket.SOCK_DGRAM )
    sock.setsockopt(socket.SOL_SOCKET, socket.SO_BROADCAST, 1)
    sock.sendto( message, ("", PORT) )
Participating services proactively send periodic undirected “heartbeats” onto the network using a UDP broadcast. Each heartbeat contains host and service-identifying information. They can be sent at arbitrary points within critical event loops similar to a watchdog timer or in a separate thread as long as they are sent frequently enough.

The second part is the canary server. This server listens for heartbeat packets on the network and notifies someone over email when they stop being received. When a new heartbeat is seen for the first time, the server expects to receive periodic updates matching the same identifying string or else an alert is fired.

Because broadcasts are used, only machines on the same network segment can monitor for heartbeats; however, routing tricks can be employed to ensure that all heartbeats are collected in a central location. We utilize two different methods of relaying heartbeats between networks by converting broadcast packets to unicast packets, the first being with Cisco’s “ip helper-address” option applied to a routing interface and the second method is with an active relay that listens within each network of a zone.

Canary doesn’t have to be limited to services as it can be anything that should be constantly running: cron jobs, backups, etc. A simple future enhancement would be to support arbitrary per-instance expected arrival times. More important services might want to have a shorter timeout period while others — like a backup job — may only be expected to heartbeat once a day.

Eric Buehl is a software developer in the DevOps team at Hulu.

Last comment: about 13 hours ago 3 Comments
  • Jesse says:

    Does Hulu have a service health page? E.g. AWS for Amazon cloud has a public facing services health page.

    I need to be able to see if Hulu is down when I see errors on the site. Thank you!

  • Can I ever expect to see canary on your github page? I’m actually interested in taking a look at the server code. Also, why broadcast and not multicast? Just curious.

  • @Jeff, good idea re open sourcing it. Let me look into this. I think the choice for using broadcast vs multicast was made as at the time our switches weren’t configured to route multicast. I think the same may be true for other networks too.

*
*