Monday, November 4, 2013

HP MP For Blade Systems, Virtual Connect & Linux Systems: Where Are The Alerts?!

First of all, I want to compliment HP for the quality of their MPs. Seriously. The last few years HP has put a lot of effort into the overall quality of their MPs, the requirements and how they operate. And every new iteration showed progress and improvements.

In the last few weeks I have worked with the latest versions of the HP MPs for SCOM and I must really say, it has improved significantly. So that’s an awesome feat since we all know that overall quality of some other MPs delivered by other vendors isn’t that good at all which is a shame.

So this posting isn’t meant in any kind of way to bash HP. Instead I want to point out some challenges with the latest version of their MP targeted at monitoring ESX servers, Linux servers, Blade Systems, Virtual Connect and Agentless servers.

Challenges
The latest version of HP Insight Control 7.1 was in place, installed, imported and properly configured. Also the related Blade Systems and Linux servers were added. And soon enough these devices showed up in SCOM and got a status. Sweet!

So it was time for some tests. The system engineers went to the computer room and took out some hardware from the monitored Blade Systems and Linux Servers. And now something strange happened…

State Changes? Yes. Alerts? NO!
A bit late (?) SCOM started to show the related state changes. The time it took was far too long but nothing alarming. A properly configured override would take care of that issue. But what worried me was that no Alert what so ever showed up. Nothing. Zip. Nada! Time for some investigations.

No Noise please…
And this one really puzzled me. The related Monitor was set to generate an Alert, as this screen dump shows:
image

So why wasn’t the Alert being shown? SCOM itself was in an healthy state and Alerts for other monitored components, covered by other MPs still came in. So the cause was related to HP MP itself.

Time to check the overrides. And this one was a bit surprising. Since it turned out that ALL Monitors in the HP MP are set with an Override NOT to generate an Alert by default:
image

I don’t like noise for sure, but this kind of tuning is a bit too much when you ask me Smile. And no, none of the related guides for this MP tells you anything about this configuration…

Split brain scenario & Enforcing an Override
But this isn’t a nice situation at all since this MP has some configuration issues now which can be addressed but need some serious attention. Why? Well…

  1. The MP contains Monitors which by default generate Alerts;
  2. Out of the box these Monitors contain overrides which suppress this setting (Generate Alert: FALSE). And this Override is boxed in a Sealed MP, so it can’t be removed or edited directly;
  3. So an EXTRA Override is required (Generates Alert: TRUE).

However, with this option as described in Step 3 a new situation is born which is equivalent to the split brain scenario we had back in the days with the old failover clusters. There can only be one owner of the quorum any given time. But during disasters and their recoveries a situation can happen where two or even more nodes they think they’re the quorum owner. And this is even worse for your failover cluster.

With setting two Overrides on the same Parameter (Generates Alert), one time FALSE and the other time TRUE, SCOM doesn’t know what to do so it’s behavior becomes erratic. One time it will generate an Alert and the other time it won’t.

GLADLY, Microsoft had a very bright moment when they engineered SCOM 2007 RTM and from the beginning they added an extra option for setting Overrides: the ENFORCED option. Basically it means that for that particular Override, SCOM has to enforce it, no matter what other overrides for the same Monitor/Rule and Parameter of that very same Rule/Monitor are in place.

So when setting this Override I used the ENFORCED option like this:
image

While I was at it, I also changes the PeriodSeconds Parameter Name from 900 seconds (15 minutes) which is way too long, to 60 seconds, so this Alert would trigger an Alert far sooner. After these modifications the related Monitors looked like this:
image

And now the second test went far better: when the system engineers went out to pull out some disks or other hardware, SCOM showed a State Change within a minute AND the related Alert was also shown!

So for anyone having this MP in place, open the related hardware in the Health Explorer in SCOM and check one by one those Monitors. I’ll bet they have that Override in place, suppressing the Alerts. Now you know how to fix them, and when required also to make sure those very same Monitors run a bit more often…

Recap
Like I said before, HP has done a great job and delivers good MPs now. Still some additional tuning is required though, but when that’s in place, you have a good monitoring solution in place. And to be frank, I rather have MPs like this one (no noise) and the ability to tune them.

None the less, HP could do these two things:

  1. Document these Overrides so their consumers know about it;
  2. Put these Overrides in an additional Unsealed MP, so people can decide whether or not to import it.

And for the rest: RESPECT to HP!

Additional resources
There are some additional resources about this MP, how to import, configure and tune it:

  1. My respected fellow MVP Stanislav Zhelyazkov: https://cloudadministrator.wordpress.com/2012/08/12/configure-hp-bladesystem-management-pack-for-scom/
  2. And my own blog: http://thoughtsonopsmgr.blogspot.nl/2013/06/high-level-overview-hp-blade-monitoring.html 

No comments: