Troubleshooting: Domain Controller Is Kind of Up

Symptoms

Out of nowhere, domain members start throwing errors about their trust relationship with the domain not working. You could also be receiving general logon errors by interactive users and service accounts.

The behavior will likely be limited to one AD site, but will occur on seemingly random servers within that site.

Resolution

Check the drives where the NTDS database exists on all domain controllers in that domain. If any have filled up, expand them or clean them off, then reboot the affected servers.

If you cannot log into the servers via console or RDP, try to force a shutdown through the hypervisor or chassis (if applicable). As a last resort, do a manual power-down via button or power cable, then boot up. See the Notes section for more details of this specific incident.

Cause

Normally, when a Windows machine does a domain authentication/authorization check, quite a few things happen in the background. One of those things is finding a working domain controller (DC). During that process, queries are sent to DCs to determine functionality, and the client chooses one that responds (you can control some of those preferences in Active Directory Sites and Services, but that's another discussion). In this situation, the client is fooled into thinking the DC is up and working when it really isn't.

When a DC really does go down (such as a reboot or hard system error), it stops responding to those functionality requests from clients, so they simply use DNS to find a DC that is working and happily go along their way. If you have good infrastructure and your domain is configured well, it happens so quickly that users never notice. However, when a drive fills up, the services that respond to the functionality requests will continue replying, giving the clients the impression the DC is still up, even though it won't actually service any requests because it can't modify the NTDS database because the drive is full.

Notes

  • The behavior will probably only occur in one site because most domains are set up to have members "prefer" local site DCs for requests. If you don't have any custom sites set up in AD or only have a single site, that will not be a useful piece of troubleshooting info.
     
  • The behavior will affect random servers because many will still be using the other DC in the site that is still functioning (you have at least two DCs per site in your production datacenter domains, right?).
     
  • In the incident that caused this post, the server did not allow anyone to log onto the DC when its drive was full (either via console or RDP). When it was forcefully rebooted in the hypervisor, it blue-screened on the first try. On the second attempt, it took 5+ minutes to come back up, but then did so normally. My assumption is the BSOD triggered the cleanup of some system logs/files, giving us about 450MB of free space. We were then able to log in and expand the drive to our standard size, then reboot again to ensure all services came up in a normal state.
     
  • If the appropriate teams aren't already being notified of low disk events on DCs, configure your monitoring solution to send them.

Comments