Even though SCOM monitors itself, the Operations Manager event log on the SCOM Management Servers still tells a lot more. So periodically I go through those event logs on the SCOM Management Servers in order to check whether everything is okay.
Hello EventID 33333!
This way I bumped into a SCOM 2012 R2 Management Server logging EventID 33333 way too many times:
I live by the credo: ‘…ONE event isn’t an event.’. or in other words, when a single event happens and the rest is okay, it was just a blurb and nothing more.
But this here tells me a different story, something isn’t going as planned. Time for some investigation.
Loving the event log
Seriously I do! Why? Because the events contain so much information. I have done a lot of troubleshooting and most of the times the Operations Manager event logs were the starting point of my investigations and also the clue to the solutions.
And in this case the event log helped me a lot since 99% of the EventID 33333 logged had the same sources in the description of the event:
As you can see the BaseManagedEntityId and MonitorId are logged, both with their GUIDs. Awesome! And yes, 99% of the events with EventID 33333 had the same GUIDs. So the cause was already pinned down to only ONE source and ONE monitor not functioning well. Awesome!
Sherlock Holmes or PowerShell?
Now it was time for some plain PowerShell commandlets in order to translate the GUIDs to understandable human language.
- In order to get a proper name for the GUID attached to BaseManagedEntityId I ran this PS cmdlet:
Get-SCOMClassInstance -id: 'GUID' | ft DisplayName
(Replace GUID with the GUID for the BaseManagedEntityId shown in the event description.)
This gave me the FQDN of the BaseManagedEntityId. It turned out to be a monitored Windows Server.
- In order to get a proper name for the GUID attached to MonitorID I ran this PS cmdlet:
Get-SCOMMonitor -id: 'GUID' | ft DisplayName
(Replace GUID with the GUID for the MonitorID shown in the event description.)
This gave me the name of the Monitor involved. In this case it was the Monitor System Center Management Health Service Memory Utilization.
Health Explorer
Time to open Health Explorer for that particular Windows server. Since it’s a Monitor targeted against the Agent, I opened the Health Explorer of the Agent of that Windows Server. And this is what I saw:
Yikes! Flip flopping! This Monitor is not doing well on this particular Windows Server. On all other monitored Windows Servers this Monitor runs just fine. I checked about 20 other servers in order to be sure, but on none of those servers this Monitor had issues. And the counter kept on growing….
So the culprit wasn’t the Monitor itself but the Windows Server.
The culprit
Time to start a RDP session with the Windows Server having issues with this Monitor. Also on this server I opened the Operations Manager event log. But all I got was this:
That’s not okay. But it could be the very same reason of flip flopping. Time to run a repair of the SCOM Agent running on this server:
- Go to Programs and Features > right click Microsoft Monitoring Agent > Change;
- Next > select the Repair option > Next;
- Now the Agent will be repaired > Finish.
After this repair job I could open the Operations Manager event log. And besides a few events it was empty and contained no errors.
On the Management Server side of things, the EventID 33333 stopped coming in from the moment the Agent on the Windows Server was repaired!
And in Health Explorer? The counter stopped. No more flip flopping!
Recap
Whenever you see an event (warning/critical) coming back in the Operations Manager event log on the Management Servers, changes are something is not okay.
Use those very same events as a starting point for your investigation and use PowerShell in order to get the understandable names of those GUIDs.
This way you obtain a lot of information within just a few minutes, aiding you in good old trouble shooting without ending up with a goose chase.
No comments:
Post a Comment