Never to old to learn…
Okay. Some time ago I had a serious issue because of the PS cmdlet Remove-SCOMDisabledClassInstance. Even to such an extend that a TOTAL restore of BOTH OM12x SQL databases was required in order to get things back on track again.
The cause? Me. Sort of.
Before Moment Zero
In a rather sized SCOM 2012x production environment there was an issue with a certain script running on the SCOM 2012x Agents. Based on thorough investigation it turned out it was a Discovery script from the Discovery Discover Agent Relationship Settings object, discovering an additional attribute for the Class Health Service, Managed Through Active Directory (Boolean):
So the instances of the Class Health Service are already Discovered (and managed by SCOM). The earlier mentioned Discovery Discover Agent Relationship Settings object, discovers the additional attribute Managed Through Active Directory (Boolean). Basically a yes or a no whether the SCOM Agent is Active Directory Integrated or not. That’s all.
Making Moment Zero possible…
Because for this customer not a single SCOM Agent is Active Directory Intergrated (ADI), it was decided to DISABLE this particular discovery. Not needed, so why run it while the related script causes issues?
Also because the Class instance Discovery isn’t done by this particular Discovery. Only an ATTRIBUTE to existing Class instances (a yes or no for ADI), is added. No harm in that?!
Also in place – for some time already – is the OpsMgr 2012x Self Maintenance MP, made by Tao Yang. And this MP is fully configured, ALSO the workflow OpsMgr 2012 Self Maintenance Remove Disabled Discovery Objects Rule,which runs the PS cmdlet Remove-SCOMDisabledClassInstance on a scheduled time (at 20:30, every 24 hours).
Please be reminded, WITHOUT this MP Moment Zero was already in place, simply by MANUALLY running the PS cmdlet SCOMDisabledClassInstance. So this issue didn’t happen because of the OpsMgr 2012x Self Maintenance MP.
With this, all was set for Moment Zero to happen…
Moment Zero strikes!
The next day, the SCOM 2012x environment was dead. All SCOM Agents couldn’t communicate anymore with the SCOM 2012x Management Group. Simply because the SCOM Management Servers didn’t recognize ANY of the SCOM Agents as a trusted entity, so communication was cut down immediatly by the very same SCOM MS servers…
Also, in the SCOM Console under Administration > Agent Managed, not a SINGLE SCOM Agent was listed. Totally empty. The SCOM SQL database revealed the same information, so apparently ALL SCOM Agents were removed!
So this explained why all SCOM MS servers refused to communicate with all SCOM Agents. Simply because they weren’t present in the SCOM SQL database anymore. So the WHY question was answered, but the answer to the HOW question eluded me for some minutes…
HOW Moment Zero came to be
So I back traced all my steps taken the days before. When working in different SCOM environments for different customers one quickly learns to log all steps. For this purpose I use OneNote, which is an excellent tool for this purpose.
When going through all the actions the days before I noticed the action in which I disabled the Discovery Discover Agent Relationship Settings object.
Could it be that disabling this particular Discovery (which addes only a boolean attribute (yes/no) to an already discovered Class instance, combined with the scheduled workflow which executes the PS cmdlet Remove-SCOMDisabledClassInstance, REMOVED all Health Service instances?
As the PS cmdlet states it removes CLASS instances from which the related Discovery is DISABLED. And even though I disabled a Discovery for an additional boolean attribute, the PS cmdlet doesn’t work on that granular level. Nor does it differentiate between Discoveries targeted at the same Class!
As a result, the PS cmdlet REMOVED ALL SCOM Health Service instances! As CONFIGURED by me!!!
Time to fix Moment Zero
After a session with Microsoft Customer Support Services, it was decided to restore BOTH SCOM SQL databases. Simply because it was the fastest way to fix this issue.
First both SCOM SQL databases were backed up and then the restores of a previous backup (when all was still okay) were run and successfully executed. Now SCOM ‘recognized’ all SCOM Agents again and resumed communications…
Lessons learned
After this we had a meeting about this issue. We talked about the cause, the fix and what we learned from it. A small recap:
- PS cmdlet Remove-SCOMDisabledClassInstance runs on Class instance level, NOT on attribute level. Meaning, only a Class instance as a whole can be undiscovered, not a particular attribute for a Class instance, even when there is a specific Discovery for it.
- The OpsMgr 2012x Self Maintenance MP ‘saved the day’. Simply because it runs the PS cmdlet Remove-SCOMDisabledClassInstance on an daily basis. When this wasn’t the case and the PS cmdlet had been run manually weeks later, it would have been far more difficult to pinpoint the root cause of this situation.
- Disabling a Discovery isn’t to be taken lightly. It can have huge consequences for your SCOM environment. So check and double check and think it over what it might do when the PS cmdlet Remove-SCOMDisabledClassInstance is executed.
- DOCUMENT all disabled Discoveries and inform the SCOM administrators about it. Keep the document on a central place, like DFS, SharePoint or OneDrive for Business.
- Running the PS cmdlet Remove-SCOMDisabledClassInstance must be done with GREAT care and consideration. Enabling this workflow in the OpsMgr 2012x Self Maintenance MP can be a time saver but must be done with great care and consideration.
It made my feel humble again and I learned a lot from it.
No comments:
Post a Comment