At Hulu we often use Statsd and Metricsd. Metrics of our high-through-traffic services are continuously sent to Statsd/Metricsd clusters, and are eventually surfaced via Graphite. Human or machine can then take the advantage of the Graphite UI or API to set up customized monitoring and alerting systems.
While we enjoyed the connectionless packet transmission enabled by the UDP ports of Statsd and Metricsd, we found the metric data carried inside such UDP packets proportionally small. For example, to tell that the API "hulu_api" has processed one request with 30 milliseconds, we send this message to Statsd: "hulu_api:30|ms". Consider, for a moment, how this looks at the level of a UDP packet on top of IP. With a UDP header of 8 bytes and an IP header of 20 bytes, the 14-byte metric data corresponds to only 33% of the bytes transferred. Aggregating many such packets into a single packet can increase this percentage. In the previous example, if we combine many hulu_api metrics and send 1000 bytes of metrics data within one packet, the data portion of the payload will correspond to 97% of the bytes transferred. This is a great improvement. To what extent can such aggregation be helpful depends on the Maximum Transmission Unit (MTU) of a specific network path. Sending a packet of larger length will cause fragmentation of data frames, which again introduces header overhead.
Using packet aggregation, the advantage of reducing transmission overhead comes with a few disadvantages as well. First, it reduces the temporal resolution of time-series data, because packets within a time window will be sent together. Second, packet loss rate may increase due to checksum failures, because longer packets have a higher probability of data corruption, and any failure in a checksum will result in the receiver discarding that entire packet. Third, sending larger packets takes more time and hence increases latency. To overcome the first disadvantage, we can use a time threshold — when this threshold is reached, we send the aggregated packet even if aggregation has not reached its max capacity. The second and third disadvantages cannot be compensated easily, but it is usually reasonable to assume a network to be still accurate and fast when packet sizes are close to but below the MTU.
When we have a target temporal resolution, we can perform some simple math-based data aggregations within the time window:
- Many packets of "Count" data type can be aggregated into one packet with their values summed: "hulu_api_success:10|c" and "hulu_api_success:90|c" can be aggregated to "hulu_api_success:100|c"
- Many packets of "Gauge" data type can be aggregated into one packet with the one latest value:
"hulu_api_queue_size:30|g" and "hulu_api_queue_size:25|g" can be aggregated to "hulu_api_queue_size:25|g"
- Other data types are difficult for math aggregation without re-implementing functionalities of Statsd or Metricsd.
Implementing math-based aggregation is actually not trivial. It requires parsing incoming UDP data, organizing them by metric names and data types, and performing the math operation. The benefit is that it reduces the actual amount of metrics information you send. In the scenario that a high frequency of count or gauge data points are sampled, hundreds of messages can be reduced to one!
To implement above aggregation logic, one could implement it within each sender. Instead of spreading this complexity throughout our applications, we decided instead to build a independent daemon. A standalone daemon has all the advantages of modularization: no matter with what language a service is implemented, and no matter how many different services one machine has, the network admin just needs to bring up one daemon process to aggregate all the packets sent from that host.
We decided to call the project "Bank" because we save penny packets and hope the "withdrawals" to be large. Here are some interesting details of Bank:
- Bank sends an aggregated packet when one of the three criteria is met:
- a configurable max packet length is reached
- a configurable number of packets have been aggregated
- a configurable time interval has elapsed
- Bank can be used as the frontend of both Statsd and Metricsd because it respects both protocols. For example, it supports the Metricsd data type "meter", which allows clients to omit "|m". Also it does not send the Statsd style multi-metric packet, which is not understood by Metricsd.
- Bank respects "delete". A packet with data "hulu_api:delete|h" will cause bank to discard current histogram data of hulu_api and send "hulu_api:delete|h" to downstream.
- Bank can parse packets that are already aggregated. Sometimes certain smart upstreams do certain level of aggregation by themselves, and sometimes math aggregation can be carried out at multiple levels, so it is nice to be able to parse the aggregated packets.
When our Seattle team tried Bank against their Statsd cluster, they decided to add downstream consistent hashing logic to it. To work with a cluster of Statsd machines, the same metric from different frontend should be sent to the same Statsd instance, because that instance will need all the data to do statistics. Consistent hashing is a perfect choice in this scenario. Each Bank uses the TCP health check of Statsd to determine which downstream Statsd's are up, and distribute packets based on metric names.
We implemented the initial version of Bank using Python in a weekend — indeed most of the time was spent on parsing and assembling strings to comply with the Statsd and Metricsd protocols. That version proved to not be as highly performant as we'd wanted it to be, and so we ended up rewriting Bank into C. This C version of Bank only uses standard libraries, so it should be portable across most platforms.
You can find the code at https://github.com/hulu/bank -- please feel free to file issues or suggest patches if you end up using bank.
Feng is a software developer on the Core Services team.