Domain Name System (DNS)
- Trace the full resolution walk from root name server to TLD to authoritative server for a domain name.
- Explain how TTL-based caching reduces DNS query load and why negative caching (for non-existent domains) matters.
- Describe how cache poisoning works and what DNSSEC does to prevent it.
- Explain the difference between a recursive resolver and an iterative resolver.
Every time you visit a website or send an email, your computer translates a human-readable name like "www.example.com" into an IP addresses like "192.0.2.1". Domain Name System (DNS) is what makes this possible. DNS is a distributed database, and is one of the internet's most critical infrastructure components. Among other things, it enables service mobility: a website can change servers and IP addresses without users noticing, because the DNS mapping can be updated.
How DNS Works
DNS is a hierarchical, distributed database that handles over a trillion queries per day. Instead of one central server knowing all domain names, which couldn't possibly scale, the work is distributed across millions of servers organized in a tree structure:
- 13 root servers (though there are actually many more physical servers).
- Top-level domain (TLD) servers for
.com,.org,.net, etc. - Authoritative name servers specific to each domain like
example.com. - Recursive resolvers, such as your ISP's DNS server or public DNS like 8.8.8.8.
When you look up www.example.com, your computer asks a recursive resolver, which may:
- ask a root server "where's
.com?" - ask the
.comserver "where'sexample.com?" - ask the
example.comserver "what'swww.example.com?" - return the answer to you
This process is called recursive resolution. It scales because no single server needs to know everything. Instead, each server knows its piece of the hierarchy and where to find the next level.
DNS would be unusably slow if every lookup required multiple round-trips through the hierarchy. The solution is caching: resolvers remember answers for a period referred to as Time to Live (TTL). Popular domains like wikipedia.org are cached at every level so that most queries are answered from cache in milliseconds. This ensures that root servers handle surprisingly few queries despite being at the top of the hierarchy. It is also why changing a DNS record doesn't take effect immediately: old values persist in caches until their TTL expires.
Implementation
DNS stores different types of records. Our implementation will focus on A and CNAME records, which handle most web traffic. As in previous chapters, we start by enumerating the different types of records:
class RecordType(Enum):
"""DNS record types."""
A = "A" # IPv4 address
AAAA = "AAAA" # IPv6 address
CNAME = "CNAME" # Canonical name (alias)
NS = "NS" # Name server
MX = "MX" # Mail exchange
and then define a DNS resource record:
@dataclass
class DNSRecord:
"""A DNS resource record."""
name: str # Domain name
value: str # IP address, domain name, etc.
record_type: RecordType
ttl: int = 3600 # Time to live in seconds
We also need to define the structure of queries and responses:
@dataclass
class DNSQuery:
"""A DNS query message."""
query_id: int
domain: str
record_type: RecordType
@dataclass
class DNSResponse:
"""A DNS response message."""
query_id: int
domain: str
record_type: RecordType
records: list[DNSRecord]
authoritative: bool = False
from_cache: bool = False
The next step is to create the authoritative DNS server.
The constructor registers the zone this server is authoritative for and creates its record store.
add_record lets callers populate the zone before the simulation starts:
class AuthoritativeDNSServer(Process):
"""An authoritative DNS server for a specific zone."""
def init(self, name: str, zone: str, request_queue: Queue):
self.name = name
self.zone = zone # e.g., "example.com"
self.request_queue = request_queue
self.records: dict[tuple[str, RecordType], list[DNSRecord]] = {}
self.queries_served = 0
def add_record(self, record: DNSRecord):
"""Add a DNS record to this server's zone."""
key = (record.name, record.record_type)
if key not in self.records:
self.records[key] = []
self.records[key].append(record)
The run loop receives queries and builds responses.
If the query domain ends in this server's zone,
the server looks up the record directly;
it also follows CNAME chains to resolve aliases to their target A records:
async def run(self):
"""Process DNS queries."""
while True:
# Wait for a query
client_queue, query = await self.request_queue.get()
# Check if this query is for our zone
if not query.domain.endswith(self.zone):
# Not authoritative for this domain
response = DNSResponse(
query_id=query.query_id,
domain=query.domain,
record_type=query.record_type,
records=[],
authoritative=False,
)
else:
# Look up the record
key = (query.domain, query.record_type)
records = self.records.get(key, [])
# Handle CNAME (alias) records
if not records and query.record_type == RecordType.A:
cname_key = (query.domain, RecordType.CNAME)
cname_records = self.records.get(cname_key, [])
if cname_records:
# Follow the CNAME
canonical_name = cname_records[0].value
a_key = (canonical_name, RecordType.A)
records = self.records.get(a_key, [])
records = cname_records + records
response = DNSResponse(
query_id=query.query_id,
domain=query.domain,
record_type=query.record_type,
records=records,
authoritative=True,
)
print(
f"[{self.now:.3f}] {self.name}: Query for {query.domain} "
f"({query.record_type.value}) -> {len(response.records)} record(s)"
)
# Simulate processing delay
await self.timeout(0.01)
# Send response
await client_queue.put(response)
self.queries_served += 1
The recursive resolver is more complex because it must cache and coordinate with authoritative servers.
A CacheEntry wraps a record with its expiry time:
class CacheEntry:
"""A cached DNS record with expiration time."""
def __init__(self, record: DNSRecord, expire_time: float):
self.record = record
self.expire_time = expire_time
def is_expired(self, current_time: float) -> bool:
"""Check if this cache entry has expired."""
return current_time >= self.expire_time
The resolver's constructor sets up the cache and tracks hit and miss statistics:
class RecursiveDNSResolver(Process):
"""A recursive DNS resolver with caching."""
def init(
self,
name: str,
request_queue: Queue,
root_servers: dict[str, Queue],
):
self.name = name
self.request_queue = request_queue
self.root_servers = root_servers # zone -> server queue
self.cache: dict[tuple[str, RecordType], list[CacheEntry]] = {}
self.response_queue = Queue(self._env)
self.queries_received = 0
self.cache_hits = 0
self.cache_misses = 0
The run loop checks the cache before forwarding to an authoritative server.
On a cache miss it calls _resolve_recursive, then stores the results:
async def run(self):
"""Process client DNS queries."""
while True:
# Wait for a query from a client
client_queue, query = await self.request_queue.get()
self.queries_received += 1
print(
f"[{self.now:.3f}] {self.name}: Received query for "
f"{query.domain} ({query.record_type.value})"
)
# Check cache first
cached_records = self._check_cache(query.domain, query.record_type)
if cached_records:
self.cache_hits += 1
response = DNSResponse(
query_id=query.query_id,
domain=query.domain,
record_type=query.record_type,
records=cached_records,
authoritative=False,
from_cache=True,
)
print(f"[{self.now:.3f}] {self.name}: Cache HIT for {query.domain}")
else:
self.cache_misses += 1
print(f"[{self.now:.3f}] {self.name}: Cache MISS for {query.domain}")
# Recursively resolve
response = await self._resolve_recursive(query)
# Cache the results
if response.records:
self._cache_records(response.records)
# Send response to client
await client_queue.put(response)
_check_cache filters out expired entries and returns None on a complete miss.
_cache_records stores records with expiry times computed from each record's TTL:
def _check_cache(
self, domain: str, record_type: RecordType
) -> Optional[list[DNSRecord]]:
"""Check if we have a cached record."""
key = (domain, record_type)
if key not in self.cache:
return None
# Filter out expired entries
valid_entries = [
entry for entry in self.cache[key] if not entry.is_expired(self.now)
]
if not valid_entries:
del self.cache[key]
return None
self.cache[key] = valid_entries
return [entry.record for entry in valid_entries]
def _cache_records(self, records: list[DNSRecord]):
"""Add records to the cache."""
for record in records:
key = (record.name, record.record_type)
if key not in self.cache:
self.cache[key] = []
expire_time = self.now + record.ttl
self.cache[key].append(CacheEntry(record, expire_time))
_resolve_recursive finds the authoritative server
whose zone suffix matches the query domain most specifically,
then forwards the query and returns the response.
Each cached entry has an expiration time based on the record's TTL.
async def _resolve_recursive(self, query: DNSQuery) -> DNSResponse:
"""Recursively resolve a query by querying authoritative servers."""
# Find the appropriate authoritative server
# In real DNS, this involves walking the DNS hierarchy
# For simplicity, we'll match the longest zone suffix
matching_zone = None
for zone in self.root_servers.keys():
if query.domain.endswith(zone):
if matching_zone is None or len(zone) > len(matching_zone):
matching_zone = zone
if matching_zone is None:
# No authoritative server found
return DNSResponse(
query_id=query.query_id,
domain=query.domain,
record_type=query.record_type,
records=[],
authoritative=False,
)
# Query the authoritative server
server_queue = self.root_servers[matching_zone]
await server_queue.put((self.response_queue, query))
# Wait for response
response = await self.response_queue.get()
return response
Our simplified resolver directly contacts authoritative servers. Real DNS resolvers walk the hierarchy starting from root servers, but the principle is the same: cache aggressively, query only when necessary.
Clients
The DNS client wraps a response queue and a query counter:
class DNSClient(Process):
"""A DNS client that performs lookups."""
def init(self, name: str, resolver_queue: Queue):
self.name = name
self.resolver_queue = resolver_queue
self.response_queue = Queue(self._env)
self.query_counter = 0
self.queries_sent = 0
async def run(self):
"""Override in subclass to perform lookups."""
pass
lookup constructs a query, sends it to the resolver, and prints the response:
async def lookup(self, domain: str, record_type: RecordType = RecordType.A):
"""Perform a DNS lookup."""
self.query_counter += 1
query = DNSQuery(
query_id=self.query_counter, domain=domain, record_type=record_type
)
print(
f"[{self.now:.3f}] {self.name}: Looking up {domain} ({record_type.value})"
)
# Send query to resolver
await self.resolver_queue.put((self.response_queue, query))
self.queries_sent += 1
# Wait for response
response = await self.response_queue.get()
# Print results
if response.records:
cache_status = " (cached)" if response.from_cache else ""
print(f"[{self.now:.3f}] {self.name}: Resolved {domain}{cache_status}:")
for record in response.records:
print(f" {record.name} -> {record.value} (TTL: {record.ttl}s)")
else:
print(f"[{self.now:.3f}] {self.name}: No records found for {domain}")
return response
Clients send queries to recursive resolvers and wait for responses. In real systems, clients also cache locally and can query multiple resolvers for redundancy.
Running a Simulation
Let's see DNS resolution in action by building a client:
class TestClient(DNSClient):
"""Test client that performs a series of lookups."""
async def run(self):
"""Perform test lookups."""
# Lookup example.com
await self.lookup("www.example.com")
await self.timeout(1.0)
# Lookup again - should be cached
await self.lookup("www.example.com")
await self.timeout(1.0)
# Lookup different domain
await self.lookup("mail.example.com")
await self.timeout(1.0)
# Lookup from different zone
await self.lookup("www.another.org")
We can now create the overall simulation:
def main():
"""Simulate DNS resolution with caching."""
env = Environment()
# Create authoritative servers for different zones
example_queue = Queue(env)
example_server = AuthoritativeDNSServer(
env, "ns1.example.com", "example.com", example_queue
)
# Add records to example.com zone
example_server.add_record(
DNSRecord("www.example.com", RecordType.A, "192.0.2.1", ttl=300)
)
example_server.add_record(
DNSRecord("mail.example.com", RecordType.A, "192.0.2.2", ttl=300)
)
example_server.add_record(
DNSRecord("ftp.example.com", RecordType.CNAME, "www.example.com", ttl=300)
)
# Create authoritative server for another.org
another_queue = Queue(env)
another_server = AuthoritativeDNSServer(
env, "ns1.another.org", "another.org", another_queue
)
another_server.add_record(
DNSRecord("www.another.org", RecordType.A, "198.51.100.1", ttl=300)
)
# Create recursive resolver
resolver_queue = Queue(env)
resolver = RecursiveDNSResolver(
env,
"resolver.isp.net",
resolver_queue,
root_servers={"example.com": example_queue, "another.org": another_queue},
)
# Create test clients
TestClient(env, "client1.local", resolver_queue)
# Run simulation
env.run(until=10)
# Print statistics
print("\n=== DNS Resolution Statistics ===")
print(f"Example.com server: {example_server.queries_served} queries")
print(f"Another.org server: {another_server.queries_served} queries")
print("\nResolver:")
print(f" Total queries: {resolver.queries_received}")
print(f" Cache hits: {resolver.cache_hits}")
print(f" Cache misses: {resolver.cache_misses}")
if resolver.queries_received > 0:
hit_rate = (resolver.cache_hits / resolver.queries_received) * 100
print(f" Cache hit rate: {hit_rate:.1f}%")
The output shows that the first lookup of www.example.com is a cache miss,
necessitating a query to the authoritative server,
but the second lookup is a cache hit.
Different domains are cache misses until they're cached,
and the cache hit rate improves as the simulation runs.
In other words,
the first user to look up a domain pays the full resolution cost,
but subsequent users (or repeated lookups) get instant responses.
[0.000] client1.local: Looking up www.example.com (A)
[0.000] resolver.isp.net: Received query for www.example.com (A)
[0.000] resolver.isp.net: Cache MISS for www.example.com
[0.000] ns1.example.com: Query for www.example.com (A) -> 0 record(s)
[0.010] client1.local: No records found for www.example.com
[1.010] client1.local: Looking up www.example.com (A)
[1.010] resolver.isp.net: Received query for www.example.com (A)
[1.010] resolver.isp.net: Cache MISS for www.example.com
[1.010] ns1.example.com: Query for www.example.com (A) -> 0 record(s)
[1.020] client1.local: No records found for www.example.com
[2.020] client1.local: Looking up mail.example.com (A)
[2.020] resolver.isp.net: Received query for mail.example.com (A)
[2.020] resolver.isp.net: Cache MISS for mail.example.com
[2.020] ns1.example.com: Query for mail.example.com (A) -> 0 record(s)
[2.030] client1.local: No records found for mail.example.com
[3.030] client1.local: Looking up www.another.org (A)
[3.030] resolver.isp.net: Received query for www.another.org (A)
[3.030] resolver.isp.net: Cache MISS for www.another.org
[3.030] ns1.another.org: Query for www.another.org (A) -> 0 record(s)
[3.040] client1.local: No records found for www.another.org
=== DNS Resolution Statistics ===
Example.com server: 3 queries
Another.org server: 1 queries
Resolver:
Total queries: 4
Cache hits: 0
Cache misses: 4
Cache hit rate: 0.0%
ex_hierarchy.py is a more complex simulation
that demonstrates several key properties of DNS,
including the use of multiple resolvers,
load distribution,
and the effectiveness of multiple caches.
Its output shows how the cache hit rate increases over time as more domains get cached.
Hierarchical Resolution
Our simplified resolver jumps directly to the best-matching authoritative server. Real DNS resolvers cannot do this: they only know about the root servers at startup and must walk the delegation chain to find the authoritative server for a new domain.
HierarchicalResolver implements this:
class HierarchicalResolver(Process):
"""Recursive resolver that walks the hierarchy from root servers.
The resolver starts every query at a root server.
When a server returns an NS referral (the zone is delegated to another
server), the resolver follows the referral. This continues until it
reaches an authoritative server that can answer the query directly.
"""
def init(
self,
name: str,
request_queue: Queue,
root_server_queue: Queue,
):
self.name = name
self.request_queue = request_queue
# The single entry point into the hierarchy.
self.root_server_queue = root_server_queue
self.response_queue: Queue = Queue(self._env)
self.cache: dict[tuple[str, RecordType], list[CacheEntry]] = {}
self.queries_received = 0
self.cache_hits = 0
self.cache_misses = 0
self.referrals_followed = 0
Its _walk_hierarchy method starts at the root server,
follows NS referrals until it reaches an authoritative server, and returns the answer:
async def _walk_hierarchy(self, query: DNSQuery) -> DNSResponse:
"""Walk the DNS delegation hierarchy from root to authoritative server.
The algorithm:
1. Start at the root server.
2. Send the query.
3. If the response is authoritative, return it.
4. If the response is a referral (NS records), follow the first referral
by looking up the referred zone's server and repeating from step 2.
5. If neither, return a not-found response.
In real DNS the resolver must also resolve the IP address of the
referred name server (a glue record lookup) if it is not already known.
We simulate this by giving each server a reference to the next layer.
"""
current_server_queue = self.root_server_queue
labels = query.domain.split(".")
# Walk up to len(labels) levels of delegation.
for depth in range(len(labels) + 1):
await current_server_queue.put((self.response_queue, query))
response = await self.response_queue.get()
if response.authoritative:
# We reached an authoritative server.
return response
# Look for NS referrals in the response.
ns_records = [r for r in response.records if r.record_type == RecordType.NS]
if ns_records:
# The NS record's value is the queue for the delegated zone server.
# In real DNS, NS values are hostnames that must be resolved
# to IP addresses; here we use the hostname as a direct key
# into the resolver's known server map.
referred_zone = ns_records[0].value
next_queue = self._known_servers.get(referred_zone)
if next_queue is None:
print(
f"[{self.now:.3f}] {self.name}: "
f"Unknown referral target {referred_zone}"
)
break
self.referrals_followed += 1
current_server_queue = next_queue
print(
f"[{self.now:.3f}] {self.name}: "
f"Following referral to {referred_zone} (depth {depth + 1})"
)
else:
# No answer, no referral.
break
return DNSResponse(
query_id=query.query_id,
domain=query.domain,
record_type=query.record_type,
records=[],
authoritative=False,
)
def register_server(self, zone: str, server_queue: Queue) -> None:
"""Register the queue for a delegated zone's server."""
if not hasattr(self, "_known_servers"):
self._known_servers: dict[str, Queue] = {}
self._known_servers[zone] = server_queue
The first query for www.example.com requires two round-trips—one to the root, one to .com—
before reaching example.com's authoritative server.
All subsequent queries for anything in example.com can skip this walk because the answers are cached.
Negative Caching
The HierarchicalResolver also caches negative responses (NXDOMAIN):
def _cache_negative(self, domain: str, record_type: RecordType) -> None:
"""Cache a negative response (NXDOMAIN) to reduce redundant queries.
Negative caching prevents the resolver from querying the authoritative
server again for a domain that does not exist. The TTL for negative
responses is typically found in the SOA record of the authoritative
zone; we use a fixed value here.
"""
NEGATIVE_TTL = 300 # 5 minutes, a common default
key = (domain, record_type)
# Store an empty entry list with an expiry. _check_cache returns None
# for an empty list, so we store a sentinel entry instead.
sentinel = DNSRecord(
name=domain,
value="NXDOMAIN",
record_type=record_type,
ttl=NEGATIVE_TTL,
)
entry = CacheEntry(sentinel, self.now + NEGATIVE_TTL)
self.cache[key] = [entry]
print(
f"[{self.now:.3f}] {self.name}: "
f"Cached NXDOMAIN for {domain} (TTL {NEGATIVE_TTL}s)"
)
def _is_negative_entry(self, record: DNSRecord) -> bool:
"""Return True if this is a negative-cache sentinel record."""
return record.value == "NXDOMAIN"
Without negative caching, every lookup of a non-existent domain (a typo, a deleted record) results in a full hierarchy walk—one round-trip to the root, one to the TLD, one to the authoritative server. Caching "this domain does not exist" for the TTL specified in the zone's SOA record means that repeated lookups of the same nonexistent name are answered locally. This is especially important for protecting authoritative servers from DNS amplification attacks, where an attacker repeatedly queries nonexistent subdomains to exhaust server resources.
Real-World Considerations
Our implementation simplifies several DNS complexities:
-
Hierarchy walking: The
HierarchicalResolverabove implements the basic walk. Real resolvers also handle glue records (A records for name servers embedded in referral responses) so they don't need a second hierarchy walk just to resolve the name server's IP address. -
Multiple answers: DNS records can have multiple values (like multiple A records for load balancing). Real clients choose among them.
-
Negative caching: DNS caches "this domain doesn't exist" responses to avoid repeatedly querying for non-existent domains.
-
Anycast: Root servers use anycast (many servers sharing one IP address) to distribute load and improve availability.
-
TTL selection: Low TTLs mean changes propagate quickly but increase query load. High TTLs reduce load but slow down changes.
The most important omission in our implementation, however, is security. DNS was designed without it; DNSSEC adds cryptographic signatures to prevent spoofing and cache poisoning.
Exercises
-
In the basic simulation, look up the same domain three times in a row. On which lookups is the cache hit? Now look up a domain whose TTL is 1 second, wait 2 seconds, and look it up again. What happens? (Hint: check how
CacheEntry.is_expireduses the simulation clock.) -
The simplified resolver matches the longest zone suffix. For example,
api.example.comwould matchexample.combefore.com. Construct a test with zones".","com", and"example.com". Query forapi.example.comand trace which server handles it and why. -
Negative caching stores a sentinel record with value
"NXDOMAIN". Simulate a sequence where a domain genuinely does not exist, then (after the negative TTL expires) a record is added for it and queried again. Does the hierarchical resolver eventually return the correct answer? What would happen if the negative TTL were longer than the positive TTL? -
Real DNS resolvers query multiple root servers for redundancy and choose the fastest. Modify
HierarchicalResolverto accept a list of root server queues and send the query to the first one that responds. (Starter: useasimpy.FirstOfto wait on multiple queues.) -
DNS cache poisoning is an attack where a malicious responder injects a false record into a resolver's cache. In our simulation, what would happen if an attacker intercepted the resolver's query and replied with a forged IP address before the real authoritative server responded? What does DNSSEC do to prevent this, and at what cost?