Domain Name System (DNS)

Trace the full resolution walk from root name server to TLD to authoritative server for a domain name.
Explain how TTL-based caching reduces DNS query load and why negative caching (for non-existent domains) matters.
Describe how cache poisoning works and what DNSSEC does to prevent it.
Explain the difference between a recursive resolver and an iterative resolver.

Every time you visit a website or send an email, your computer translates a human-readable name like "www.example.com" into an IP addresses like "192.0.2.1". Domain Name System (DNS) is what makes this possible. DNS is a distributed database, and is one of the internet's most critical infrastructure components. Among other things, it enables service mobility: a website can change servers and IP addresses without users noticing, because the DNS mapping can be updated.

How DNS Works

DNS is a hierarchical, distributed database that handles over a trillion queries per day. Instead of one central server knowing all domain names, which couldn't possibly scale, the work is distributed across millions of servers organized in a tree structure:

13 root servers (though there are actually many more physical servers).
Top-level domain (TLD) servers for .com, .org, .net, etc.
Authoritative name servers specific to each domain like example.com.
Recursive resolvers, such as your ISP's DNS server or public DNS like 8.8.8.8.

When you look up www.example.com, your computer asks a recursive resolver, which may:

ask a root server "where's .com?"
ask the .com server "where's example.com?"
ask the example.com server "what's www.example.com?"
return the answer to you

This process is called recursive resolution. It scales because no single server needs to know everything. Instead, each server knows its piece of the hierarchy and where to find the next level.

DNS would be unusably slow if every lookup required multiple round-trips through the hierarchy. The solution is caching: resolvers remember answers for a period referred to as Time to Live (TTL). Popular domains like wikipedia.org are cached at every level so that most queries are answered from cache in milliseconds. This ensures that root servers handle surprisingly few queries despite being at the top of the hierarchy. It is also why changing a DNS record doesn't take effect immediately: old values persist in caches until their TTL expires.

Implementation

DNS stores different types of records. Our implementation will focus on A and CNAME records, which handle most web traffic. As in previous chapters, we start by enumerating the different types of records:

class RecordType(Enum):
    """DNS record types."""

    A = "A"  # IPv4 address
    AAAA = "AAAA"  # IPv6 address
    CNAME = "CNAME"  # Canonical name (alias)
    NS = "NS"  # Name server
    MX = "MX"  # Mail exchange

and then define a DNS resource record:

@dataclass
class DNSRecord:
    """A DNS resource record."""

    name: str  # Domain name
    value: str  # IP address, domain name, etc.
    record_type: RecordType
    ttl: int = 3600  # Time to live in seconds

We also need to define the structure of queries and responses:

@dataclass
class DNSQuery:
    """A DNS query message."""

    query_id: int
    domain: str
    record_type: RecordType

@dataclass
class DNSResponse:
    """A DNS response message."""

    query_id: int
    domain: str
    record_type: RecordType
    records: list[DNSRecord]
    authoritative: bool = False
    from_cache: bool = False

The next step is to create the authoritative DNS server. The constructor registers the zone this server is authoritative for and creates its record store. add_record lets callers populate the zone before the simulation starts:

class AuthoritativeDNSServer(Process):
    """An authoritative DNS server for a specific zone."""

    def init(self, name: str, zone: str, request_queue: Queue):
        self.name = name
        self.zone = zone  # e.g., "example.com"
        self.request_queue = request_queue
        self.records: dict[tuple[str, RecordType], list[DNSRecord]] = {}
        self.queries_served = 0

    def add_record(self, record: DNSRecord):
        """Add a DNS record to this server's zone."""
        key = (record.name, record.record_type)
        if key not in self.records:
            self.records[key] = []
        self.records[key].append(record)

The run loop receives queries and builds responses. If the query domain ends in this server's zone, the server looks up the record directly; it also follows CNAME chains to resolve aliases to their target A records:

    async def run(self):
        """Process DNS queries."""
        while True:
            # Wait for a query
            client_queue, query = await self.request_queue.get()

            # Check if this query is for our zone
            if not query.domain.endswith(self.zone):
                # Not authoritative for this domain
                response = DNSResponse(
                    query_id=query.query_id,
                    domain=query.domain,
                    record_type=query.record_type,
                    records=[],
                    authoritative=False,
                )
            else:
                # Look up the record
                key = (query.domain, query.record_type)
                records = self.records.get(key, [])

                # Handle CNAME (alias) records
                if not records and query.record_type == RecordType.A:
                    cname_key = (query.domain, RecordType.CNAME)
                    cname_records = self.records.get(cname_key, [])
                    if cname_records:
                        # Follow the CNAME
                        canonical_name = cname_records[0].value
                        a_key = (canonical_name, RecordType.A)
                        records = self.records.get(a_key, [])
                        records = cname_records + records

                response = DNSResponse(
                    query_id=query.query_id,
                    domain=query.domain,
                    record_type=query.record_type,
                    records=records,
                    authoritative=True,
                )

            print(
                f"[{self.now:.3f}] {self.name}: Query for {query.domain} "
                f"({query.record_type.value}) -> {len(response.records)} record(s)"
            )

            # Simulate processing delay
            await self.timeout(0.01)

            # Send response
            await client_queue.put(response)
            self.queries_served += 1

The recursive resolver is more complex because it must cache and coordinate with authoritative servers. A CacheEntry wraps a record with its expiry time:

class CacheEntry:
    """A cached DNS record with expiration time."""

    def __init__(self, record: DNSRecord, expire_time: float):
        self.record = record
        self.expire_time = expire_time

    def is_expired(self, current_time: float) -> bool:
        """Check if this cache entry has expired."""
        return current_time >= self.expire_time

The resolver's constructor sets up the cache and tracks hit and miss statistics:

class RecursiveDNSResolver(Process):
    """A recursive DNS resolver with caching."""

    def init(
        self,
        name: str,
        request_queue: Queue,
        root_servers: dict[str, Queue],
    ):
        self.name = name
        self.request_queue = request_queue
        self.root_servers = root_servers  # zone -> server queue
        self.cache: dict[tuple[str, RecordType], list[CacheEntry]] = {}
        self.response_queue = Queue(self._env)
        self.queries_received = 0
        self.cache_hits = 0
        self.cache_misses = 0

The run loop checks the cache before forwarding to an authoritative server. On a cache miss it calls _resolve_recursive, then stores the results:

    async def run(self):
        """Process client DNS queries."""
        while True:
            # Wait for a query from a client
            client_queue, query = await self.request_queue.get()
            self.queries_received += 1

            print(
                f"[{self.now:.3f}] {self.name}: Received query for "
                f"{query.domain} ({query.record_type.value})"
            )

            # Check cache first
            cached_records = self._check_cache(query.domain, query.record_type)

            if cached_records:
                self.cache_hits += 1
                response = DNSResponse(
                    query_id=query.query_id,
                    domain=query.domain,
                    record_type=query.record_type,
                    records=cached_records,
                    authoritative=False,
                    from_cache=True,
                )
                print(f"[{self.now:.3f}] {self.name}: Cache HIT for {query.domain}")
            else:
                self.cache_misses += 1
                print(f"[{self.now:.3f}] {self.name}: Cache MISS for {query.domain}")
                # Recursively resolve
                response = await self._resolve_recursive(query)

                # Cache the results
                if response.records:
                    self._cache_records(response.records)

            # Send response to client
            await client_queue.put(response)

_check_cache filters out expired entries and returns None on a complete miss. _cache_records stores records with expiry times computed from each record's TTL:

    def _check_cache(
        self, domain: str, record_type: RecordType
    ) -> Optional[list[DNSRecord]]:
        """Check if we have a cached record."""
        key = (domain, record_type)
        if key not in self.cache:
            return None

        # Filter out expired entries
        valid_entries = [
            entry for entry in self.cache[key] if not entry.is_expired(self.now)
        ]

        if not valid_entries:
            del self.cache[key]
            return None

        self.cache[key] = valid_entries
        return [entry.record for entry in valid_entries]

    def _cache_records(self, records: list[DNSRecord]):
        """Add records to the cache."""
        for record in records:
            key = (record.name, record.record_type)
            if key not in self.cache:
                self.cache[key] = []

            expire_time = self.now + record.ttl
            self.cache[key].append(CacheEntry(record, expire_time))

_resolve_recursive finds the authoritative server whose zone suffix matches the query domain most specifically, then forwards the query and returns the response. Each cached entry has an expiration time based on the record's TTL.

    async def _resolve_recursive(self, query: DNSQuery) -> DNSResponse:
        """Recursively resolve a query by querying authoritative servers."""
        # Find the appropriate authoritative server
        # In real DNS, this involves walking the DNS hierarchy
        # For simplicity, we'll match the longest zone suffix

        matching_zone = None
        for zone in self.root_servers.keys():
            if query.domain.endswith(zone):
                if matching_zone is None or len(zone) > len(matching_zone):
                    matching_zone = zone

        if matching_zone is None:
            # No authoritative server found
            return DNSResponse(
                query_id=query.query_id,
                domain=query.domain,
                record_type=query.record_type,
                records=[],
                authoritative=False,
            )

        # Query the authoritative server
        server_queue = self.root_servers[matching_zone]
        await server_queue.put((self.response_queue, query))

        # Wait for response
        response = await self.response_queue.get()
        return response

Our simplified resolver directly contacts authoritative servers. Real DNS resolvers walk the hierarchy starting from root servers, but the principle is the same: cache aggressively, query only when necessary.

Clients

The DNS client wraps a response queue and a query counter:

class DNSClient(Process):
    """A DNS client that performs lookups."""

    def init(self, name: str, resolver_queue: Queue):
        self.name = name
        self.resolver_queue = resolver_queue
        self.response_queue = Queue(self._env)
        self.query_counter = 0
        self.queries_sent = 0

    async def run(self):
        """Override in subclass to perform lookups."""
        pass

lookup constructs a query, sends it to the resolver, and prints the response:

    async def lookup(self, domain: str, record_type: RecordType = RecordType.A):
        """Perform a DNS lookup."""
        self.query_counter += 1
        query = DNSQuery(
            query_id=self.query_counter, domain=domain, record_type=record_type
        )

        print(
            f"[{self.now:.3f}] {self.name}: Looking up {domain} ({record_type.value})"
        )

        # Send query to resolver
        await self.resolver_queue.put((self.response_queue, query))
        self.queries_sent += 1

        # Wait for response
        response = await self.response_queue.get()

        # Print results
        if response.records:
            cache_status = " (cached)" if response.from_cache else ""
            print(f"[{self.now:.3f}] {self.name}: Resolved {domain}{cache_status}:")
            for record in response.records:
                print(f"  {record.name} -> {record.value} (TTL: {record.ttl}s)")
        else:
            print(f"[{self.now:.3f}] {self.name}: No records found for {domain}")

        return response

Clients send queries to recursive resolvers and wait for responses. In real systems, clients also cache locally and can query multiple resolvers for redundancy.

Running a Simulation

Let's see DNS resolution in action by building a client:

class TestClient(DNSClient):
    """Test client that performs a series of lookups."""

    async def run(self):
        """Perform test lookups."""
        # Lookup example.com
        await self.lookup("www.example.com")
        await self.timeout(1.0)

        # Lookup again - should be cached
        await self.lookup("www.example.com")
        await self.timeout(1.0)

        # Lookup different domain
        await self.lookup("mail.example.com")
        await self.timeout(1.0)

        # Lookup from different zone
        await self.lookup("www.another.org")

We can now create the overall simulation:

def main():
    """Simulate DNS resolution with caching."""
    env = Environment()

    # Create authoritative servers for different zones
    example_queue = Queue(env)
    example_server = AuthoritativeDNSServer(
        env, "ns1.example.com", "example.com", example_queue
    )

    # Add records to example.com zone
    example_server.add_record(
        DNSRecord("www.example.com", RecordType.A, "192.0.2.1", ttl=300)
    )
    example_server.add_record(
        DNSRecord("mail.example.com", RecordType.A, "192.0.2.2", ttl=300)
    )
    example_server.add_record(
        DNSRecord("ftp.example.com", RecordType.CNAME, "www.example.com", ttl=300)
    )

    # Create authoritative server for another.org
    another_queue = Queue(env)
    another_server = AuthoritativeDNSServer(
        env, "ns1.another.org", "another.org", another_queue
    )
    another_server.add_record(
        DNSRecord("www.another.org", RecordType.A, "198.51.100.1", ttl=300)
    )

    # Create recursive resolver
    resolver_queue = Queue(env)
    resolver = RecursiveDNSResolver(
        env,
        "resolver.isp.net",
        resolver_queue,
        root_servers={"example.com": example_queue, "another.org": another_queue},
    )

    # Create test clients
    TestClient(env, "client1.local", resolver_queue)

    # Run simulation
    env.run(until=10)

    # Print statistics
    print("\n=== DNS Resolution Statistics ===")
    print(f"Example.com server: {example_server.queries_served} queries")
    print(f"Another.org server: {another_server.queries_served} queries")
    print("\nResolver:")
    print(f"  Total queries: {resolver.queries_received}")
    print(f"  Cache hits: {resolver.cache_hits}")
    print(f"  Cache misses: {resolver.cache_misses}")
    if resolver.queries_received > 0:
        hit_rate = (resolver.cache_hits / resolver.queries_received) * 100
        print(f"  Cache hit rate: {hit_rate:.1f}%")

The output shows that the first lookup of www.example.com is a cache miss, necessitating a query to the authoritative server, but the second lookup is a cache hit. Different domains are cache misses until they're cached, and the cache hit rate improves as the simulation runs. In other words, the first user to look up a domain pays the full resolution cost, but subsequent users (or repeated lookups) get instant responses.

[0.000] client1.local: Looking up www.example.com (A)
[0.000] resolver.isp.net: Received query for www.example.com (A)
[0.000] resolver.isp.net: Cache MISS for www.example.com
[0.000] ns1.example.com: Query for www.example.com (A) -> 0 record(s)
[0.010] client1.local: No records found for www.example.com
[1.010] client1.local: Looking up www.example.com (A)
[1.010] resolver.isp.net: Received query for www.example.com (A)
[1.010] resolver.isp.net: Cache MISS for www.example.com
[1.010] ns1.example.com: Query for www.example.com (A) -> 0 record(s)
[1.020] client1.local: No records found for www.example.com
[2.020] client1.local: Looking up mail.example.com (A)
[2.020] resolver.isp.net: Received query for mail.example.com (A)
[2.020] resolver.isp.net: Cache MISS for mail.example.com
[2.020] ns1.example.com: Query for mail.example.com (A) -> 0 record(s)
[2.030] client1.local: No records found for mail.example.com
[3.030] client1.local: Looking up www.another.org (A)
[3.030] resolver.isp.net: Received query for www.another.org (A)
[3.030] resolver.isp.net: Cache MISS for www.another.org
[3.030] ns1.another.org: Query for www.another.org (A) -> 0 record(s)
[3.040] client1.local: No records found for www.another.org

=== DNS Resolution Statistics ===
Example.com server: 3 queries
Another.org server: 1 queries

Resolver:
  Total queries: 4
  Cache hits: 0
  Cache misses: 4
  Cache hit rate: 0.0%

ex_hierarchy.py is a more complex simulation that demonstrates several key properties of DNS, including the use of multiple resolvers, load distribution, and the effectiveness of multiple caches. Its output shows how the cache hit rate increases over time as more domains get cached.

Hierarchical Resolution

Our simplified resolver jumps directly to the best-matching authoritative server. Real DNS resolvers cannot do this: they only know about the root servers at startup and must walk the delegation chain to find the authoritative server for a new domain.

HierarchicalResolver implements this:

class HierarchicalResolver(Process):
    """Recursive resolver that walks the hierarchy from root servers.

    The resolver starts every query at a root server.
    When a server returns an NS referral (the zone is delegated to another
    server), the resolver follows the referral.  This continues until it
    reaches an authoritative server that can answer the query directly.
    """

    def init(
        self,
        name: str,
        request_queue: Queue,
        root_server_queue: Queue,
    ):
        self.name = name
        self.request_queue = request_queue
        # The single entry point into the hierarchy.
        self.root_server_queue = root_server_queue
        self.response_queue: Queue = Queue(self._env)

        self.cache: dict[tuple[str, RecordType], list[CacheEntry]] = {}
        self.queries_received = 0
        self.cache_hits = 0
        self.cache_misses = 0
        self.referrals_followed = 0

Its _walk_hierarchy method starts at the root server, follows NS referrals until it reaches an authoritative server, and returns the answer:

    async def _walk_hierarchy(self, query: DNSQuery) -> DNSResponse:
        """Walk the DNS delegation hierarchy from root to authoritative server.

        The algorithm:
        1. Start at the root server.
        2. Send the query.
        3. If the response is authoritative, return it.
        4. If the response is a referral (NS records), follow the first referral
           by looking up the referred zone's server and repeating from step 2.
        5. If neither, return a not-found response.

        In real DNS the resolver must also resolve the IP address of the
        referred name server (a glue record lookup) if it is not already known.
        We simulate this by giving each server a reference to the next layer.
        """
        current_server_queue = self.root_server_queue
        labels = query.domain.split(".")

        # Walk up to len(labels) levels of delegation.
        for depth in range(len(labels) + 1):
            await current_server_queue.put((self.response_queue, query))
            response = await self.response_queue.get()

            if response.authoritative:
                # We reached an authoritative server.
                return response

            # Look for NS referrals in the response.
            ns_records = [r for r in response.records if r.record_type == RecordType.NS]
            if ns_records:
                # The NS record's value is the queue for the delegated zone server.
                # In real DNS, NS values are hostnames that must be resolved
                # to IP addresses; here we use the hostname as a direct key
                # into the resolver's known server map.
                referred_zone = ns_records[0].value
                next_queue = self._known_servers.get(referred_zone)
                if next_queue is None:
                    print(
                        f"[{self.now:.3f}] {self.name}: "
                        f"Unknown referral target {referred_zone}"
                    )
                    break
                self.referrals_followed += 1
                current_server_queue = next_queue
                print(
                    f"[{self.now:.3f}] {self.name}: "
                    f"Following referral to {referred_zone} (depth {depth + 1})"
                )
            else:
                # No answer, no referral.
                break

        return DNSResponse(
            query_id=query.query_id,
            domain=query.domain,
            record_type=query.record_type,
            records=[],
            authoritative=False,
        )

    def register_server(self, zone: str, server_queue: Queue) -> None:
        """Register the queue for a delegated zone's server."""
        if not hasattr(self, "_known_servers"):
            self._known_servers: dict[str, Queue] = {}
        self._known_servers[zone] = server_queue

The first query for www.example.com requires two round-trips—one to the root, one to .com— before reaching example.com's authoritative server. All subsequent queries for anything in example.com can skip this walk because the answers are cached.

Negative Caching

The HierarchicalResolver also caches negative responses (NXDOMAIN):

    def _cache_negative(self, domain: str, record_type: RecordType) -> None:
        """Cache a negative response (NXDOMAIN) to reduce redundant queries.

        Negative caching prevents the resolver from querying the authoritative
        server again for a domain that does not exist.  The TTL for negative
        responses is typically found in the SOA record of the authoritative
        zone; we use a fixed value here.
        """
        NEGATIVE_TTL = 300  # 5 minutes, a common default
        key = (domain, record_type)
        # Store an empty entry list with an expiry.  _check_cache returns None
        # for an empty list, so we store a sentinel entry instead.
        sentinel = DNSRecord(
            name=domain,
            value="NXDOMAIN",
            record_type=record_type,
            ttl=NEGATIVE_TTL,
        )
        entry = CacheEntry(sentinel, self.now + NEGATIVE_TTL)
        self.cache[key] = [entry]
        print(
            f"[{self.now:.3f}] {self.name}: "
            f"Cached NXDOMAIN for {domain} (TTL {NEGATIVE_TTL}s)"
        )

    def _is_negative_entry(self, record: DNSRecord) -> bool:
        """Return True if this is a negative-cache sentinel record."""
        return record.value == "NXDOMAIN"

Without negative caching, every lookup of a non-existent domain (a typo, a deleted record) results in a full hierarchy walk—one round-trip to the root, one to the TLD, one to the authoritative server. Caching "this domain does not exist" for the TTL specified in the zone's SOA record means that repeated lookups of the same nonexistent name are answered locally. This is especially important for protecting authoritative servers from DNS amplification attacks, where an attacker repeatedly queries nonexistent subdomains to exhaust server resources.

Real-World Considerations

Our implementation simplifies several DNS complexities:

Hierarchy walking: The HierarchicalResolver above implements the basic walk. Real resolvers also handle glue records (A records for name servers embedded in referral responses) so they don't need a second hierarchy walk just to resolve the name server's IP address.
Multiple answers: DNS records can have multiple values (like multiple A records for load balancing). Real clients choose among them.
Negative caching: DNS caches "this domain doesn't exist" responses to avoid repeatedly querying for non-existent domains.
Anycast: Root servers use anycast (many servers sharing one IP address) to distribute load and improve availability.
TTL selection: Low TTLs mean changes propagate quickly but increase query load. High TTLs reduce load but slow down changes.

The most important omission in our implementation, however, is security. DNS was designed without it; DNSSEC adds cryptographic signatures to prevent spoofing and cache poisoning.

Exercises

In the basic simulation, look up the same domain three times in a row. On which lookups is the cache hit? Now look up a domain whose TTL is 1 second, wait 2 seconds, and look it up again. What happens? (Hint: check how CacheEntry.is_expired uses the simulation clock.)
The simplified resolver matches the longest zone suffix. For example, api.example.com would match example.com before .com. Construct a test with zones ".", "com", and "example.com". Query for api.example.com and trace which server handles it and why.
Negative caching stores a sentinel record with value "NXDOMAIN". Simulate a sequence where a domain genuinely does not exist, then (after the negative TTL expires) a record is added for it and queried again. Does the hierarchical resolver eventually return the correct answer? What would happen if the negative TTL were longer than the positive TTL?
Real DNS resolvers query multiple root servers for redundancy and choose the fastest. Modify HierarchicalResolver to accept a list of root server queues and send the query to the first one that responds. (Starter: use asimpy.FirstOf to wait on multiple queues.)
DNS cache poisoning is an attack where a malicious responder injects a false record into a resolver's cache. In our simulation, what would happen if an attacker intercepted the resolver's query and replied with a forged IP address before the real authoritative server responded? What does DNSSEC do to prevent this, and at what cost?