Thoughts on Azure, OMS & SCOM: Intelligent Service Monitoring

Sunday, September 18, 2011

Intelligent Service Monitoring – Part I: The Deal

----------------------------------------------------------------------------------
Postings in the same series:
Part II – By Example
----------------------------------------------------------------------------------

In this series of postings – two parts as I see it now – I will cover in detail how to go about creating Service Monitors in the regular SCOM Console. The first posting will cover the whole theory behind it and the second posting will contain a step-by-step guide how to go about it. Let’s start.

Now hold your breath since I know what you’re about to say: ‘Duh! Creating a Service Monitor is a straight forward process, so why write TWO postings about a topic that simple?’.

True when monitoring one or more Windows Services which don’t have any relationship with each other. But the whole story changes when you want to monitor one or more Windows Services which run on multiple servers (two for instance) in an active/passive configuration. Now it becomes a total different kind of ballgame since you REQUIRE certain intelligence in those Service Monitors. But how to achieve that? Almost sounds like the Gordian Knot.

Situation
Suppose you run two servers. Those servers run the same application, one in passive mode (related Windows Service isn’t running) the other in active mode (related Windows Service is running) . But these servers aren’t HA clusters as Microsoft knows them. So the Cluster MP won’t look at them as cluster nodes. And yet they are for this particular application.

I want to monitor…
Now you want to monitor the Windows Services related to that application running on both servers and have them displayed in a Dashboard. These services are set to run manually and are administered by the application. When one server dies, the other server takes over and the services are started by the application itself.

Challenges
When one wants to monitor Windows Services configured like these, there are some bumps in the road because:

One server out of those two servers doesn’t run the related Windows Service by design so that Monitor will enter a critical state which rolls up to the health of the top level entity of that server. Now the server has a Critical state while all is well for that server.
An Alert is raised. Which is good under other circumstances but not now. Since one Windows Service isn’t supposed to run at all by design, so

Workarounds
Both issues can be dealt with:

The Parent Monitor for that particular Service Monitor is modified from Availability to Entity. Now the Monitor will still have a critical state but the health won’t rollup any more to the top level entity of that server so that server stays green .
The Alert for this Monitor is disabled (either directly by editing the related Monitor or through an override. I myself prefer the latter).

OK, we’re almost there now but still some issues to reckon with
But now another issue arises.

You need to group BOTH monitors targeted against the same Windows Service as well since they relate to each other. One Windows Service is running (green) and the other isn’t (red). By design the health state for that group is calculated by the worst state of any member. Which results in a Critical State. This can be easily adjusted so the health state is calculated by the best state of any member. Now the Monitor has a healthy state and will enter a critical condition when BOTH Windows Services aren’t running. So far so good.
But suppose you use a DA for this. And put those two Windows Services into a single DA Component (which groups those two Windows Services together, thus expressing their relationship). This DA Component is by default targeted against the Availability parent monitor. But the Monitors related to those two Windows Services are moved to Entity (remember?). So health won’t rollup for that Monitor which results in a DA Component in an unmonitored state.

In order to obtain a Health state for that DA component, the Monitor Dependency, for the Monitor targeted against that DA Component has to be changed as well, from Availability to Entity Health. And now the DA Component gets into a monitored state WITH all the required intelligence as well.
Now we also want an Alert when both Windows Services aren’t running. By design a DA Component doesn’t raise an Alert. So the Monitor targeted against the DA component must be changed as well in order to raise an Alert.

Does it work as intended?
Yes, it does! Now everything is in place and SCOM is properly monitoring Windows Services which are configured in passive/active configuration. The services are monitored on a per server basis, one is green the other red but no Alert is raised nor is the Health of that server impacted by it. Also BOTH services are being monitored and alerted upon when BOTH Windows Services don’t run anymore.

Hopefully I haven’t lost you half of the way! In the next posting of this series I will show you how to go about it by using an example. See you all next time.

3 comments:

Jeff Smithling said...: Brilliant. Really.; February 28, 2014 at 9:16 PM
apollo said...: Great stuff, thanks for sharing!; January 22, 2015 at 4:18 PM
JimtheSkinsFan said...: Thanks very much for these postings. My client is implementing SCOM presently. I need to determine what SCOM information to request during the performance tests I design and manage to augment the counters and data collected by HP SiteScope. Having a general understanding about SCOM and SCOM practices from an expert blog like yours helps.; April 10, 2015 at 4:15 PM