For companies running their own datacenter, setting up internal DNS infrastructure is essential for performance and ease of maintenance. Setting up a single DNS server for occasional requests is pretty straightforward, but scaling and distributing requests across multiple data centers is challenging. In this post, we'll describe the evolution of our DNS infrastructure from a simple setup to a more distributed configuration that is capable of reliably handling a significantly higher request volume.
DNS primer
In order to talk to each other via network computers are assigned ip addresses (http://en.wikipedia.org/wiki/IP_address). IP address can be either IPv4 (e.g. 192.168.1.100) or IPv6 (e.g. fe80::3e97:eff:fe3a:ef7c). Humans are not that good in memorizing IP addresses so the Domain Name System (DNS) was created (http://en.wikipedia.org/wiki/Domain_Name_System). DNS translates human readable domain names to IP address. This allows you to type www.hulu.com in your browser to watch your favorite show instead of typing http://165.254.237.137. DNS can be used to resolve different types of names. The most important examples are:
  • A name corresponding to an IPv4 address
  • AAAA name corresponding to an IPv6 address
  • CNAME alias to another name
  • PTR ip address, corresponding to another name

The DNS response contains a response code. A few important examples are:

  • NOERROR name was resolved ok
  • NXDOMAIN there is no ip address for given name
  • NODATA there is no information for given name and request type, but given name has information for other request types.
A DNS zone is any distinct, contiguous portion of the domain name space in the Domain Name System (DNS) for which administrative responsibility has been delegated to a single manager. The domain name space of the Internet is organized into a hierarchical layout of subdomains below the DNS root domain.
DNS servers will be referred to as nameservers through the rest of this document. Nameservers can be recursive if they forward a request further in case they can't respond by themselves or authoritative if they respond to a given query without additional requests. Recursive DNS servers are called recursors.
For this document, we'll call servers that send DNS requests to nameservers clients. Linux uses a resolv.conf file located at /etc/resolv.conf that contains DNS configuration, including the list of nameservers to use for DNS requests.
Initial configuration
Initially we had a very straightforward DNS infrastructure configuration:
  • 4 nameservers in each datacenter behind a load balancer
  • Each nameserver was running a PowerDNS (https://www.powerdns.com/) authoritative service and recursor service
  • The recursor service was configured to serve internal zones using PowerDNS server on the same host
  • Every client had a resolv.conf configured with the load balancer DNS virtual IP address and 2 ip addresses of backend nameservers. If a timeout occurred for a DNS request sent through the load balanced endpoint, a request would be sent directly to the nameservers.
This worked, but DNS uses UDP which does not guarantee delivery. While the overall DNS infrastructure worked ok, sometime names failed to resolve due to network issues or excessive traffic. As a result of the occasional failures, service owners tended to replace names with ip addresses in their configurations after seeing the occasional failures. This was especially true for services with extremely high request rates.
On the nameservers.

/etc/powerdns/pdns.conf:

default-soa-name=a.prd.hulu.com
launch=gsqlite3
gsqlite3-database=/opt/dns/sqlite_db
local-port=5300
log-failed-updates=yes

/etc/powerdns/recursor.conf:

allow-from=127.0.0.0/8, 192.168.1.0/24, 10.0.0.0/8, 1.2.3.0/24
forward-zones=10.in-addr.arpa=127.0.0.1:5300,prd.hulu.com=127.0.0.1:5300,stg.hulu.com=127.0.0.1:5300,tst.hulu.com=127.0.0.1:5300
# /etc/powerdns/empty_zone is an empty file
auth-zones=hulu.com.prd.hulu.com=/etc/powerdns/empty_zone,hulu.com.stg.hulu.com=/etc/powerdns/empty_zone,hulu.com.tst.hulu.com=/etc/powerdns/empty_zone
local-address=0.0.0.0
log-common-errors=yes
max-negative-ttl=60
threads=2

On the clients

/etc/resolv.conf:

nameserver 10.1.1.53 # dns ip on load balancer
nameserver 10.1.1.120 # nameserver behind load balancer
nameserver 10.1.1.121 # another nameserver behind load balancer
search prd.hulu.com stg.hulu.com tst.hulu.com

Adding local DNS caching
In order to minimize the chance of DNS resolution failure because of network issues, we decided to setup a local DNS cache on each client. We compared several popular DNS caches such as unbound, pdnsd, PowerDNS recursor and nscd. Unbound demonstrated the best performance, excellent stability and ease of configuration, so we gradually rolled it out to all clients.
Initially, resolv.conf was configured to point to the load balanced DNS nameservers. We updated it to list unbound first, followed by the load balanced nameserver in case unbound failed. It is important to note that if you have multiple nameserver entries in resolv.conf by default they would be used in the order they are listed with 2 retries and 5 second timeout (see man resolv.conf for more details). This means, that each failing nameserver in resolv.conf would cause 10 seconds delay in resolving name. However, this is not the case for a local nameserver running on 127.0.0.1. Resolver detects that port udp 53 on localhost is not listening almost instantaneously and tries next nameserver from the list so delay would be negligible. However, it's noteworthy that Python, Ruby, Java and Node.js cache nameservers IPs from resolv.conf. So, services running in those languages can see resolution delays when restarting the local unbound nameserver.
Results were very good, traffic decreased and there were no failures while resolving DNS names.
On the clients.

/etc/resolv.conf:

nameserver 127.0.0.1 # unbound
nameserver 10.1.1.53 # dns ip on load balancer
search prd.hulu.com stg.hulu.com tst.hulu.com

/etc/unbound/unbound.conf:

server:
access-control: 0.0.0.0/0 allow
prefetch: yes
do-ip6: no
local-zone: "10.in-addr.arpa." transparent
remote-control:
control-enable: no
forward-zone:
name: "."
forward-addr: 10.1.1.53

Blocking AAAA queries
We noticed that a significant part of responses were nxdomain for names like host.prd.hulu.com.tst.hulu.com. This was caused by 2 reasons: search list in resolv.conf and AAAA queries from python. We don't use IPv6 in our internal network, so AAAA queries were only adding unnecessary traffic and increasing the chance that a valid request would not be answered due to the fact that DNS uses UDP and UDP is a protocol without delivery guarantee. After a close look at unbound's features, we blocked most of the bad queries by creating zones, which unbound would serve locally.
On the clients.

/etc/unbound/unbound.conf:

server:
access-control: 0.0.0.0/0 allow
prefetch: yes
do-ip6: no
local-zone: "com.prd.hulu.com" static
local-zone: "com.stg.hulu.com" static
local-zone: "com.tst.hulu.com" static
local-zone: "10.in-addr.arpa." transparent
remote-control:
control-enable: no
forward-zone:
name: "."
forward-addr: 10.1.1.53
Making unbound daemons talk directly to nameservers
Everything was working OK until we had a failure in the load balancer we were using for DNS. Services were not able to resolve DNS names and started to fail. Given how important DNS is, we decided that DNS should not be tied to the reliability of a load balancer, so we decided to remove the load balancer from the DNS infrastructure. This was easy to accomplish: instead of pointing to the load balancer VIP, we reconfigured unbound to talk directly to the nameservers.
On the clients.

/etc/unbound/unbound.conf:

server:
access-control: 0.0.0.0/0 allow
prefetch: yes
do-ip6: no
local-zone: "com.prd.hulu.com" static
local-zone: "com.stg.hulu.com" static
local-zone: "com.tst.hulu.com" static
local-zone: "10.in-addr.arpa." transparent
remote-control:
control-enable: no
forward-zone:
name: "."
forward-addr: 10.1.1.120
forward-addr: 10.1.1.121
forward-addr: 10.1.1.122
forward-addr: 10.1.1.123

Making DNS resolution immune to internet connection outages
Some time after that, we had an internet connection issue involving outbound traffic in one of our datacenters. It is important to note, that authoritative DNS for our externally facing DNS names is hosted by our CDN. Resolution of these names in our datacenter also use the CDN. We do this in order to have unified DNS resolution from the internal Hulu network and from the internet (from Hulu's customers perspective). This unified approach allows us to easily reproduce any customer reported issues related to DNS. But this also means that without an internet connection we can't resolve our external DNS names, even when the IPs that they resolve to physically reside in our datacenters. Because names couldn't be resolved many services in this datacenter stopped working even though they didn't have a dependency on the outside internet. We decided to modify DNS so that a failure of the internet connection in a single datacenter wouldn't affect DNS. This was a significant change to the nameservers. What we did:
  • The original PowerDNS recursor service was retired, and we switched to unbound
  • Incoming DNS requests to the nameservers were served by unbound, configured with appropriate zones.
  • Requests for internal zones were forwarded to the PowerDNS authoritative services. Each unbound instance talked to all PowerDNS servers, so failure of one PowerDNS authoritative server would have a small performance penalty
  • External requests were forwarded to another unbound layer. When the internet connection worked, this layer talked to the unbound instances in the local datacenter. If the outbound internet connection fails, the unbound instances from other datacenter are added, so that external names can continue to be resolved. It is important to note that using unbound recursors from other datacenter can result in wrong geo-location information for external names. Instead of getting an address close to datacenter where request originated from, an address close to datacenter with alive internet connection is returned. This can increase round trip times and service latencies.
On the nameservers.

/etc/unbound/unbound.conf:

server:
interface: 0.0.0.0
access-control: 0.0.0.0/0 allow
prefetch: yes
rrset-roundrobin: yes
do-ip6: no
do-not-query-localhost: no
extended-statistics: yes
local-zone: "com.prd.hulu.com" static
local-zone: "com.stg.hulu.com" static
local-zone: "com.tst.hulu.com" static
local-zone: "10.in-addr.arpa." transparent
forward-zone:
name: "10.in-addr.arpa"
forward-addr: 10.1.1.120@5300
forward-addr: 10.1.1.121@5300
forward-addr: 10.1.1.122@5300
forward-addr: 10.1.1.123@5300
forward-zone:
name: "prd.hulu.com"
forward-addr: 10.1.1.120@5300
forward-addr: 10.1.1.121@5300
forward-addr: 10.1.1.122@5300
forward-addr: 10.1.1.123@5300
forward-zone:
name: "stg.hulu.com"
forward-addr: 10.1.1.120@5300
forward-addr: 10.1.1.121@5300
forward-addr: 10.1.1.122@5300
forward-addr: 10.1.1.123@5300
forward-zone:
name: "tst.hulu.com"
forward-addr: 10.1.1.120@5300
forward-addr: 10.1.1.121@5300
forward-addr: 10.1.1.122@5300
forward-addr: 10.1.1.123@5300
forward-zone:
name: "."
forward-addr: 10.1.1.120@5301
forward-addr: 10.1.1.121@5301
forward-addr: 10.1.1.122@5301
forward-addr: 10.1.1.123@5301

/etc/unbound/unbound-5301.conf:

server:
interface: 0.0.0.0
port: 5301
access-control: 0.0.0.0/0 allow
prefetch: yes
do-ip6: no
extended-statistics: yes
pidfile: /var/run/unbound-5301.pid
remote-control:
control-port: 8954

/opt/dns/bin/dns-monitor.sh:

#!/bin/bash

set -u

dc1_ips="10.1.1.120 10.1.1.121 10.1.1.122 10.1.1.123"
dc2_ips="10.2.1.120 10.2.1.121 10.2.1.122 10.2.1.123"

case "$(hostname -s|cut -f1 -d-)" in
dc1) here_ips=$dc1_ips; there_ips=$dc2_ips ;;
dc2) here_ips=$dc2_ips; there_ips=$dc1_ips ;;
* ) exit 1 ;;
esac

PID_FILE=/var/run/$(basename $(readlink -f $0)).pid

all_ips=$(echo $dc1_ips $dc2_ips|sed -e 's/ /\n/g'|sort)

check_upstream() {
for ip; do
[ "$(dig @$ip -p 5301 +tries=1 +time=1 +short \
google-public-dns-a.google.com)" == "8.8.8.8" ] && return 0
done
return 1
}

set_zone() {
forwarders=()
for ip; do
forwarders+=(${ip}@5301)
done
/usr/sbin/unbound-control forward_add . ${forwarders[@]}
}

run_check() {
current_zone=$(/usr/sbin/unbound-control list_forwards|grep '^\. '| \
sed -e 's/.*forward: //' -e 's/ /\n/g'|sort)
here_status=down
there_status=down
cross_dc_status=down
check_upstream $here_ips && here_status=up
[ "$current_zone" == "$all_ips" ] && cross_dc_status=up
[ "${here_status}${cross_dc_status}" == "upup" ] && {
set_zone $here_ips
return
}
check_upstream $there_ips && there_status=up
[ "${here_status}${cross_dc_status}${there_status}" == "downdownup" ] && {
set_zone $all_ips
return
}
}

get_lock() {
touch ${PID_FILE}.test || {
echo "Can't create ${PID_FILE}"
exit 1
}
rm -f ${PID_FILE}.test
while true; do
set -- $(LC_ALL=C ls -il ${PID_FILE} 2>/dev/null)
if [ -z "${1:-}" ] ; then
ln -s $$ $PID_FILE && return 0
else
ps ${12} >/dev/null 2>&1 && return 1
find $PID_FILE -inum ${1} -exec rm -f {} \;
fi
done
}

get_lock || exit 1

exec &>/dev/null

while sleep 5; do
run_check
done

/etc/cron.d/dns-monitor:

SHELL=/bin/bash
* * * * * root /opt/dns/bin/dns-monitor.sh

Serving programmatically generated names
In order to test certain services, we needed a testing host to have a name in the *.hulu.com domain. In case when testing is done using a workstation it is not always possible to use a dedicated name. We decided to use special domain ip.hulu.com. Names in form A.B.C.D.ip.hulu.com are resolved to ip address A.B.C.D. This can be done using unbound python extension. Another useful feature that can be implemented in python extension is datacenter aware names. For example if we have service.dc1.prd.hulu.com in datacenter dc1 and service.dc2.prd.hulu.com in dc2 we can have virtual domain dc.prd.hulu.com so that service.dc.prd.hulu.com would resolve to the proper name local to this datacenter.
On nameserver:

/etc/unbound/unbound.conf:

server:
interface: 0.0.0.0
access-control: 0.0.0.0/0 allow
prefetch: yes
rrset-roundrobin: yes
do-ip6: no
do-not-query-localhost: no
extended-statistics: yes
local-zone: "com.prd.hulu.com" static
local-zone: "com.stg.hulu.com" static
local-zone: "com.tst.hulu.com" static
local-zone: "10.in-addr.arpa." transparent
forward-zone:
name: "10.in-addr.arpa"
forward-addr: 10.1.1.120@5300
forward-addr: 10.1.1.121@5300
forward-addr: 10.1.1.122@5300
forward-addr: 10.1.1.123@5300
forward-zone:
name: "prd.hulu.com"
forward-addr: 10.1.1.120@5300
forward-addr: 10.1.1.121@5300
forward-addr: 10.1.1.122@5300
forward-addr: 10.1.1.123@5300
forward-zone:
name: "stg.hulu.com"
forward-addr: 10.1.1.120@5300
forward-addr: 10.1.1.121@5300
forward-addr: 10.1.1.122@5300
forward-addr: 10.1.1.123@5300
forward-zone:
name: "tst.hulu.com"
forward-addr: 10.1.1.120@5300
forward-addr: 10.1.1.121@5300
forward-addr: 10.1.1.122@5300
forward-addr: 10.1.1.123@5300
forward-zone:
name: "dc.prd.hulu.com"
forward-addr: 10.1.1.120@5301
forward-addr: 10.1.1.121@5301
forward-addr: 10.1.1.122@5301
forward-addr: 10.1.1.123@5301
forward-zone:
name: "dc.stg.hulu.com"
forward-addr: 10.1.1.120@5301
forward-addr: 10.1.1.121@5301
forward-addr: 10.1.1.122@5301
forward-addr: 10.1.1.123@5301
forward-zone:
name: "dc.tst.hulu.com"
forward-addr: 10.1.1.120@5301
forward-addr: 10.1.1.121@5301
forward-addr: 10.1.1.122@5301
forward-addr: 10.1.1.123@5301
forward-zone:
name: "."
forward-addr: 10.1.1.120@5301
forward-addr: 10.1.1.121@5301
forward-addr: 10.1.1.122@5301
forward-addr: 10.1.1.123@5301

/etc/unbound/unbound-5301.conf:

server:
interface: 0.0.0.0
port: 5301
access-control: 0.0.0.0/0 allow
prefetch: yes
do-ip6: no
extended-statistics: yes
pidfile: /var/run/unbound-5301.pid
module-config: "validator python iterator"
python:
python-script: "/etc/unbound/unbound-5301.py"
remote-control:
control-port: 8954

/etc/unbound/unbound-5301.py:

DC = "dc1" # or dc2 depending which datacenter we are in

def init(id, cfg): return True

def deinit(id): return True

def inform_super(id, qstate, superqstate, qdata): return True

def create_response(id, qstate, in_rr_types, out_rr_type, pkt_flags, msg_answer_append):
# create instance of DNS message (packet) with given parameters
msg = DNSMessage(qstate.qinfo.qname_str, out_rr_type, RR_CLASS_IN, pkt_flags)
# append RR
if qstate.qinfo.qtype in in_rr_types:
msg.answer.append(msg_answer_append)
# set qstate.return_msg
if not msg.set_return_msg(qstate):
qstate.ext_state[id] = MODULE_ERROR
return True
# we don't need validation, result is valid
qstate.return_msg.rep.security = 2
qstate.return_rcode = RCODE_NOERROR
qstate.ext_state[id] = MODULE_FINISHED
return True

def operate(id, event, qstate, qdata):
if (event == MODULE_EVENT_NEW) or (event == MODULE_EVENT_PASS):
a = qstate.qinfo.qname_str.split('.')
if len(a) > 5 and a[-1] == '' and a[-2] == 'com' and a[-3] == 'hulu':
if len(a) == 8 and a[-4] == 'ip' \
and 0 <= int(a[-5]) <= 255 \
and 0 <= int(a[-6]) <= 255 \
and 0 <= int(a[-7]) <= 255 \
and 0 <= int(a[-8]) <= 255:
msg_answer_append = "{0} 300 IN A {1}.{2}.{3}.{4}".format(qstate.qinfo.qname_str,
a[-8], a[-7], a[-6], a[-5])
create_response(id, qstate, [RR_TYPE_A, RR_TYPE_ANY], RR_TYPE_A,
PKT_QR | PKT_RA | PKT_AA, msg_answer_append)
return True
if a[-5] == 'dc':
a[-5] = DC
msg_answer_append = "{0} 300 IN CNAME {1}".format(qstate.qinfo.qname_str, '.'.join(a))
create_response(id, qstate, [RR_TYPE_CNAME, RR_TYPE_A, RR_TYPE_ANY],
RR_TYPE_CNAME, PKT_QR | PKT_RA, msg_answer_append)
return True
else:
# pass the query to validator
qstate.ext_state[id] = MODULE_WAIT_MODULE
return True

if event == MODULE_EVENT_MODDONE:
log_info("pythonmod: iterator module done")
qstate.ext_state[id] = MODULE_FINISHED
return True

log_err("pythonmod: bad event")
qstate.ext_state[id] = MODULE_ERROR
return True
Early blocking of unwanted queries
After more research, we figured out that blocking of unwanted queries can be done early with unbound extension. Specifically in our case we wanted to block resolving of all IPv6 names from Hulu domain (since we are not using IPv6 in our internal network). Here is how this can be done:
On the clients.

/etc/unbound/unbound.conf:

server:
server:
access-control: 0.0.0.0/0 allow
prefetch: yes
do-ip6: no
rrset-roundrobin: yes
chroot: ""
local-zone: "com.prd.hulu.com" static
local-zone: "com.stg.hulu.com" static
local-zone: "com.tst.hulu.com" static
local-zone: "10.in-addr.arpa." transparent
module-config: "validator python iterator"
python:
python-script: "/etc/unbound/unbound.py"
remote-control:
control-enable: no
forward-zone:
name: "."
forward-addr: 10.1.1.120
forward-addr: 10.1.1.121
forward-addr: 10.1.1.122
forward-addr: 10.1.1.123

/etc/unbound/unbound.py:

def init(id, cfg): return True

def deinit(id): return True

def inform_super(id, qstate, superqstate, qdata): return True

def operate(id, event, qstate, qdata):
if (event == MODULE_EVENT_NEW) or (event == MODULE_EVENT_PASS):
qtype = qstate.qinfo.qtype
qname_str = qstate.qinfo.qname_str
if (qtype == RR_TYPE_AAAA and qname_str.endswith(".hulu.com.")):
# create instance of DNS message (packet) with given parameters
msg = DNSMessage(qname_str, qtype, RR_CLASS_IN, PKT_QR | PKT_RA | PKT_AA)
# set qstate.return_msg
if not msg.set_return_msg(qstate):
qstate.ext_state[id] = MODULE_ERROR
return True

# we don't need validation, result is valid
qstate.return_msg.rep.security = 2

qstate.return_rcode = RCODE_NXDOMAIN
qstate.ext_state[id] = MODULE_FINISHED
return True
else:
# pass the query to validator
qstate.ext_state[id] = MODULE_WAIT_MODULE
return True

if event == MODULE_EVENT_MODDONE:
# log_info("pythonmod: iterator module done")
qstate.ext_state[id] = MODULE_FINISHED
return True

log_err("pythonmod: bad event")
qstate.ext_state[id] = MODULE_ERROR
return True
Cleanup of search list in resolv.conf
Finally, we removed the search list from /etc/resolv.conf on the clients. Search list is useful when you want to use short names instead of fully qualified domain names (e.g. myservice instead of myservice.prd.hulu.com). But this is definitely a bad practice since if you have myservice.prd.hulu.com and myservice.stg.hulu.com if you use short name you would get the one which is first in the search list. Instead of a search list, we are now using domain resolv.conf directive, which still allows usage of short names, but is limited to a single domain, so there will be no ambiguity in name resolution.
On the clients.

/etc/resolv.conf:

nameserver 127.0.0.1 # unbound
nameserver 10.1.1.53 # dns ip on load balancer
domain prd.hulu.com
DNS naming conventions
We found that once a DNS name starts being used, it is extremely hard to deprecate. Consequently, it's best to think carefully about naming schemas and follow them rigorously. We found that it's generally a good idea to have domains specific to production, testing, etc:
  • prd.company.com
  • tst.company.com
  • dev.company.com
If you have multiple data centers it may be useful to have data center specific domains:
  • dc0.company.com
  • dc1.company.com
Using high level names like git.company.com, help.company.com etc for internal names seems a good idea to have short names and save on typing, but their support may become quite complicated with time. It's often better to use domains specific to production, testing, and dev (e.g. git.prd.company.com etc).
Closing thoughts
In the first portion of our upgrade we added local caches that talked directly to nameservers and proxied traffic between datacenters. This immediately minimized the effect of internal or external network outages on name resolution. Once we had a more stable setup, we increased the performance of the overall system by minimizing a significant amount of irrelevant queries. We're continuing to improve on our DNS infrastructure by adding programmatically generated names to allow services in a datacenter to automatically find their counterpart services and databases. We've been very happy with our new setup and hope that the details we've shared here can prove useful to others looking to scale up their own DNS infrastructure.