IssueOuch! In my test lab I had a serious issue, caused by myself. In my test lab I ran 16 VMs, two of them DCs, DC01 and DC02. DC01 is the owner all FSMO roles and is also the Enterprise Root CA.
The disk of DC01 was based on Hyper-V 2008 R2, so it wasn’t a VHDX file. Time to fix that I thought. So I converted the disk to VHDX. But for some reasons I decided to roll it back to VHD (that file was still present). And that’s when the troubles started since the DC looked upon this action as an unsupported roll back, also known as an update sequence number rollback, or USN rollback. And to be frank, I can’t blame the DC, only myself.
But now my whole AD infra was broken. The Netlogon service was paused because the DC itself had added an additional regkey (DSA not writable) in order to prevent replication with DC02.
On top of it, DC01 had also disabled it’s in- and outbound replication, as displayed in the Directory Service event log on DC01:
- Event ID: 1115 > Outbound replication has been disabled by the user.
- Event ID: 1113 > Inbound replication has been disabled by the user.
So I had called upon myself a serious issue, even though it was my own test lab!
Cause
My own STUPID actions!
Case solved!
Since this happened late in the evening I decided to leave it like that and take a new fresh look at it another day and time when I was fresh again. So this evening I finally cracked it by following a reverse way it’s normally done.
- I removed the earlier mentioned regkey so Netlogon service wasn’t paused anymore after a reboot.
- First I tried to use the normal way which is transferring the FSMO roles from the defect DC (DC01) to DC02. But that didn’t work well, even though I succeeded. DC02 was the owner of all FSMO roles BUT since replication was broken, DC01 still thought it was the owner as well. And when I switched off DC01 everything came to a halt, so DC01 was still in charge, even though it was broken.
- So now I had TWO defective DCs! DC01 was totally isolated because of the replication blockage, but DC02 couldn’t function WITHOUT DC01. So I feared the worst by removing DC01 completely from AD, making things even worse.
- But enabling replication on DC01 would make things bad as well since BOTH DCs thought to be the owner of all FSMO roles. On top of it all, DC01 is the enterprise root CA, so breaking that server would wreck my CA as well. Ouch!
- Finally I concluded at least ONE DC had to go, no matter what. And DC02 was ‘only’ a DC and nothing more. So DC02 had to go, no matter what.
- So I ran a forced DC demotion on DC02 which worked great. Afterwards I switched it off and marked it in Hyper-V as a demoted DC.
- Now I had to clean up the meta data, referring to DC02 on DC01. In order to do that I used this article from www.Petri.co.il. Which worked great as well. And I also cleaned up DNS (forward and reverse lookup zones) and cleaned out the Sites.
- So far so good. After a reboot of DC01 far less errors were shown in the Directory Service event log. But still two Events worried me: EventID 1115 and EventID 1113 .
- Soon I learned about a tool, repadmin. However it was an outdated article, referring to Windows Server 2000. After some searching I soon found about an updated version working up to Windows Server 2008 R2. This tool is found in the Windows Server 2003 SP1 Support Tools. I downloaded it, ran the installer and YES the tool was installed as well. Time for the next step.
- After searching on the internet I found this TechNet Library all about the Repadmin commands, also for Windows Server 2012! I know I run Windows Server 2012 R2, but it gave me hope all wasn’t lost. This outdated (based on Windows Server 2003!) article showed me the commands to force replication. Now it was time to put it together.
- So I started an elevated cmd-prompt and run these commands:
- repadmin /showreps in order to see the current status of the replication. This is what I got back:
Default-First-Site-Name\DC01 DC Options: IS_GC DISABLE_INBOUND_REPL DISABLE_OUTBOUND_REPL Site Options: (none) DC object GUID: 541ca80e-cc21-4cd9-98cf-94fd2e0a73c5 DC invocationID: def47bba-aeeb-4ce4-9d0b-1ce8b9c71606 |
- Time to kick some ass! So I ran this command repadmin /syncall
- And now it was time to remove the constraints by running these two command, one after the other: repadmin /options dc01 -DISABLE_OUTBOUND_REPL and repadmin /options dc01 -DISABLE_INBOUND_REPL. Both commands got feedback telling me the constraints were removed!
- Time to check it by running this command again repadmin /showreps. This is what I got:
Default-First-Site-Name\DC01 DC Options: IS_GC Site Options: (none) DC object GUID: 541ca80e-cc21-4cd9-98cf-94fd2e0a73c5 DC invocationID: def47bba-aeeb-4ce4-9d0b-1ce8b9c71606 |
- AWESOME! The restraints are really gone. I emptied the Directory Service event log and rebooted the DC. When it was back again NO MORE ERRORS!
- Soon I rolled out a new server which I promoted to DC (DC03) and all is just fine now.
Recap
As you can see, I didn’t remove the problematic DC but removed the other one instead. And it worked out. Took me some time to figure it out though but I am glad I solved it.
Learned my lessons these days, starting with not to fiddle around with DCs since you can wreck them no matter how rock solid Microsoft has made them. Simply because Microsoft can’t protect your environment against stupid actions like the one I did.
However, I also learned how to troubleshoot deep AD issues as well, so that’s good and now I am happy about everything what happened since I’ve learned many new stuff.
Like an old manager once said to me: ‘I don’t worry when my people make mistakes. I start worrying when they don’t make mistakes anymore because those are the moments they don’t work and more important, learn!’.