DNS resolver internals: what happens between the stub and the recursor

When somebody asks “how does DNS work” the canonical answer is the four-step diagram with the stub, recursor, root, and authoritative servers. That diagram is fine for a first introduction and basically wrong for everything that happens after.

Here’s what actually happens when your laptop resolves example.com.

The stub resolver

The stub on your laptop is not a recursor. It does almost no work. It sends a query — typically over UDP/53 — to whatever resolver is configured in /etc/resolv.conf or via DHCP. It sets the RD bit (recursion desired) and waits.

If the response doesn’t fit in a UDP datagram, the stub retries over TCP/53. This used to be rare. With DNSSEC signatures and large TXT records (SPF, DKIM, verification tokens), it’s not rare anymore. If your firewall blocks DNS over TCP, you have already broken DNS for half the internet.

The recursor

The recursor — your ISP’s, Cloudflare’s 1.1.1.1, your home Pi-hole — is where the actual work happens. It does not just forward the query. It walks the delegation chain.

For an uncached example.com:

Query the root for .com — gets back a.gtld-servers.net and friends
Query a .com server for example.com — gets back the authoritative servers for example.com
Query an authoritative server for example.com A — gets the actual answer

Each step caches based on TTL. The recursor’s job is essentially “minimize the work for next time.”

What the diagram skips

Several things that matter operationally:

Glue records. When .com returns ns1.example.com as the authoritative server for example.com, that’s circular — you can’t resolve ns1.example.com without already having example.com’s nameservers. The TLD server includes a glue record (the IP of ns1.example.com) in the additional section to break the loop.

Out-of-bailiwick glue. If example.com uses ns1.example.org as a nameserver, that’s out-of-bailiwick. The TLD won’t include glue, and the recursor has to go resolve example.org’s nameservers separately. This is slower and more failure-prone.

Negative caching. When you query nonexistent.example.com and the authoritative says NXDOMAIN, the recursor caches the negative answer based on the SOA’s minimum TTL. Bad SOA configuration leads to long-lived negative caches and slow recovery from typos in zone files.

EDNS. The original DNS protocol allowed only 512-byte UDP responses. EDNS0 (RFC 6891) extended this to 4096 bytes by advertising the maximum size in an OPT record. Without EDNS, DNSSEC doesn’t work over UDP. Most middleboxes handle EDNS correctly now. Some don’t.

Practical consequences

If you’re operating DNS, the model that matters is:

TTL is the only mechanism for cache invalidation. There is no purge button. Plan TTL changes hours in advance.
Negative caching is real. Lower the SOA minimum TTL during active development.
Don’t put DNS on the same path as the service it serves. If example.com’s nameserver IS example.com, you have created a hard dependency loop.
TCP fallback works only if your network allows it. Test with dig +tcp.

The diagram is fine. The footnotes are where the bugs live.