Thursday, March 20, 2014

Solving Domain Controller Issues: The Reverse Way To Recover From An USN Rollback

Issue
Ouch! In my test lab I had a serious issue, caused by myself. In my test lab I ran 16 VMs, two of them DCs, DC01 and DC02. DC01 is the owner all FSMO roles and is also the Enterprise Root CA.

The disk of DC01 was based on Hyper-V 2008 R2, so it wasn’t a VHDX file. Time to fix that I thought. So I converted the disk to VHDX. But for some reasons I decided to roll it back to VHD (that file was still present). And that’s when the troubles started since the DC looked upon this action as an unsupported roll back, also known as an update sequence number rollback, or USN rollback. And to be frank, I can’t blame the DC, only myself.

But now my whole AD infra was broken. The Netlogon service was paused because the DC itself had added an additional regkey (DSA not writable) in order to prevent replication with DC02.

On top of it, DC01 had also disabled it’s in- and outbound replication, as displayed in the Directory Service event log on DC01:

  • Event ID: 1115 > Outbound replication has been disabled by the user.
  • Event ID: 1113 > Inbound replication has been disabled by the user.

So I had called upon myself a serious issue, even though it was my own test lab!

Cause
My own STUPID actions!

Case solved!
Since this happened late in the evening I decided to leave it like that and take a new fresh look at it another day and time when I was fresh again. So this evening I finally cracked it by following a reverse way it’s normally done.

  1. I removed the earlier mentioned regkey so Netlogon service wasn’t paused anymore after a reboot.
  2. First I tried to use the normal way which is transferring the FSMO roles from the defect DC (DC01) to DC02. But that didn’t work well, even though I succeeded. DC02 was the owner of all FSMO roles BUT since replication was broken, DC01 still thought it was the owner as well. And when I switched off DC01 everything came to a halt, so DC01 was still in charge, even though it was broken.
  3. So now I had TWO defective DCs! DC01 was totally isolated because of the replication blockage, but DC02 couldn’t function WITHOUT DC01. So I feared the worst by removing DC01 completely from AD, making things even worse.
  4. But enabling replication on DC01 would make things bad as well since BOTH DCs thought to be the owner of all FSMO roles. On top of it all, DC01 is the enterprise root CA, so breaking that server would wreck my CA as well. Ouch!
  5. Finally I concluded at least ONE DC had to go, no matter what. And DC02 was ‘only’ a DC and nothing more. So DC02 had to go, no matter what.
  6. So I ran a forced DC demotion on DC02 which worked great. Afterwards I switched it off and marked it in Hyper-V as a demoted DC.
  7. Now I had to clean up the meta data, referring to DC02 on DC01. In order to do that I used this article from www.Petri.co.il. Which worked great as well. And I also cleaned up DNS (forward and reverse lookup zones) and cleaned out the Sites.
  8. So far so good. After a reboot of DC01 far less errors were shown in the Directory Service event log. But still two Events worried me: EventID 1115 and EventID 1113 Sad smile.
  9. Soon I learned about a tool, repadmin. However it was an outdated article, referring to Windows Server 2000. After some searching I soon found about an updated version working up to Windows Server 2008 R2. This tool is found in the Windows Server 2003 SP1 Support Tools. I downloaded it, ran the installer and YES the tool was installed as well. Time for the next step.
  10. After searching on the internet I found this TechNet Library all about the Repadmin commands, also for Windows Server 2012! I know I run Windows Server 2012 R2, but it gave me hope all wasn’t lost. This outdated (based on Windows Server 2003!) article showed me the commands to force replication. Now it was time to put it together.
  11. So I started an elevated cmd-prompt and run these commands:
    1. repadmin /showreps in order to see the current status of the replication. This is what I got back:

      Default-First-Site-Name\DC01
      DC Options: IS_GC DISABLE_INBOUND_REPL DISABLE_OUTBOUND_REPL
      Site Options: (none)
      DC object GUID: 541ca80e-cc21-4cd9-98cf-94fd2e0a73c5
      DC invocationID: def47bba-aeeb-4ce4-9d0b-1ce8b9c71606

    2. Time to kick some ass! So I ran this command repadmin /syncall
    3. And now it was time to remove the constraints by running these two command, one after the other: repadmin /options dc01 -DISABLE_OUTBOUND_REPL and repadmin /options dc01 -DISABLE_INBOUND_REPL. Both commands got feedback telling me the constraints were removed!
    4. Time to check it by running this command again repadmin /showreps. This is what I got:

      Default-First-Site-Name\DC01
      DC Options: IS_GC
      Site Options: (none)
      DC object GUID: 541ca80e-cc21-4cd9-98cf-94fd2e0a73c5
      DC invocationID: def47bba-aeeb-4ce4-9d0b-1ce8b9c71606

  12. AWESOME! The restraints are really gone. I emptied the Directory Service event log and rebooted the DC. When it was back again NO MORE ERRORS!
  13. Soon I rolled out a new server which I promoted to DC (DC03) and all is just fine now.

Recap
As you can see, I didn’t remove the problematic DC but removed the other one instead. And it worked out. Took me some time to figure it out though but I am glad I solved it.

Learned my lessons these days, starting with not to fiddle around with DCs since you can wreck them no matter how rock solid Microsoft has made them. Simply because Microsoft can’t protect your environment against stupid actions like the one I did.

However, I also learned how to troubleshoot deep AD issues as well, so that’s good and now I am happy about everything what happened since I’ve learned many new stuff.

Like an old manager once said to me: ‘I don’t worry when my people make mistakes. I start worrying when they don’t make mistakes anymore because those are the moments they don’t work and more important, learn!’.

5 comments:

hydeiman said...

You should not have to download repadmin. This tool is embedded in Windows Server 2008 R2 and above

hydeiman said...

You should not have to download repadmin - this tool is embedded in Windows Server 2008 R2 and above

Marnix Wolf said...

Hi Hydeiman.

Thanks for your comment. I have looked for it but couldn't find it. Will take a second look and let you know the outcome.

Cheers,
Marnix

hydeiman said...

repadmin is included in "AD DS Snap-ins and command line tools" feature of Windows Server, and this feature have to automatically installed during setup of AD DS role.

Marnix Wolf said...

Hi Hydeiman

Thanks for your feedback, much appreciated. Tomorrow I will update this posting based on your feedback. Of course. i will mention your name. Power to the community! :)

Cheers,
Marnix