Friday, July 12, 2013

The Unstable Resource Pools & Greyed Out Network Devices

Update 07-12-2013
This posting isn’t correct. Please checkout this posting about the REAL issue. Apologies for the misunderstanding here.

An OM12 SP1 Management Group monitoring a couple of hundreds of Windows Servers, almost hundred UX computers and almost 500 network devices became highly unstable.

All the Resource Groups were failing and many network devices entered an unmonitored status. No matter what I did, flushing the cache, restarting the Health Service of all OM12 SP1 Management Servers and so on, the situation remained highly unstable. Also the whole OM12 SP1 Management Group turned grey.

And almost – be ‘default’ many of the network devices turned grey as well WITHOUT generating any Alert. This happened BEFORE the Management Group as a whole died…

And restarting the Health Service and clearing the cache on the OM12 SP1 Management Servers only helped for a few minutes, three at max…

Investigation & cause
Time for a deep dive since this is a bad situation. Since this Management Group contains many customizations like custom made Management Packs it was suspected that a certain MP was causing all these issues.

After going through the change logs (we keep track of what we do) we soon found a potential culprit. A new Monitor was recently added, using VBScript in order to calculate some collected SNMP data to some values which make sense for a human being. At the same time frame the issues started…

When taking a deeper dive into this particular Monitor we found that is was disabled by default and had no override against it. So even though it was targeted against the Class Node, it was inactive. It didn’t have a status what so ever.

However, as it turned out, this Monitor had an error in its VBScript. And the OpsMgr event logs showed this error. So this Monitor – even though it was disabled by default and no overrides were in place to enable it for a particular custom made Class -  wasn’t good at all.

Problem solved
So we decided to remove this faulty Monitor and restarted the Health Services on all OM12 SP1 Management Servers without clearing the cache. Within 5 minutes everything came back to life again. All Resource Pools reported a Healthy status and the greyed out network devices were back to life again. Soon the Management Group as a whole reported a healthy status again and became rock solid.

Lesson learned
Even when a Monitor is disabled by default but isn’t made right, it can wreck havoc in your Management Group. So be careful here. Somehow this particular Monitor slipped through our regular checks resulting in an unstable Management Group.


alex said...

Hi Marnix, can you share xml of this monitor? Strange situation

Marnix Wolf said...

Hi Alex.

The code of this version of this MP is deleted and we only have the previous versions without this Monitor. It's back to the labs now.

What the Monitor was trying to do was pulling two values out of the MIB tree and substract them. When the outcome falls below a certain value an Alert had to be raised.

No rocket science actually. But somehow this Monitor caused a lot of issues.


Marnix Wolf said...

Hi Alex.

As it turns out, something totally different was the REAL cause. Apologies for this posting.

Please check out this new posting about the REAL cause: