Monday, September 26, 2011

Intelligent Service Monitoring – Part II: By Example

Postings in the same series:
Part I – The Deal

In the second and last posting in this series I will demonstrate how to monitor Windows Services configured in active/passive mode as stated in the first posting of this series. When you haven’t read it, please go back and read it since otherwise you won’t fully understand this posting.

The Example
In this example I have taken a Windows Service which is present on two of my sandbox servers: Volume Shadow Copy. On server SV01 this service is set to start automatically and running:

On server SV02 this service is set to start manually and in a stopped state:

In this example these Windows Services are configured in Active/Passive mode: when the Windows Service on server SV01 stops, the underlying application will start the same Windows Service on server SV02.

So these Windows Services relate to each other and must be monitored as such. SCOM mustn’t raise an Alert when ONE Windows Service on both servers isn’t running because that is as it should be. When both Windows Services don’t run however, an Alert must be raised. Also, when the Windows Service isn’t running on one of both servers, the Health mustn’t be critical.

Lets build

  1. Create a Group
    First we need to create a Group containing both Windows Servers. This Group will be used for targeting the Monitors. Since we will put this Group into an unsealed MP and the Monitor as well, we need to put both of them in the same unsealed MP. Why? Unsealed MPs can only reference sealed MPs, not other unsealed MPs.

    Go to Authoring > Authoring > Actions > Create a new Group. Here I have populated the Group explicitly with the Windows Computer objects (in bigger environments its better to populate the Group dynamically):

  2. Run the Windows Service Wizard
    Now it’s time to use the Windows Service wizard, located here: Authoring > Authoring > Management Pack Templates > Windows Service. Right click it and select Add Monitoring Wizard > Windows Service > Next.

    Give the Monitor a proper Name so these Monitors can be differentiated easily from the others.

    Next > click on the radio button (1) for Service Name and select one of the related servers > select the proper Windows Service (2) > OK (3) > back in the screen select the proper Group under Targeted Group by clicking the radio button (4) > back in the screen deselect the option Monitor only automatic service (5) > Next.

    In this example we don’t want to collect any performance data, so leave this screen untouched
    Next > a summary is shown.
    Check it, when all is well > Create.

  3. Let’s change some stuff…
    Now the Monitors are created. Also the related Discoveries and Rules. The latter ones will be disabled since we don’t collect any performance data in this example. But the Monitors are in place and will become functional.

    But the behavior won’t be good for these kind of Windows Services configured in Active/Passive Mode.

    First we don’t want an Alert.
    Like this one. Time to get rid of it…

    Secondly we don’t want the health state affected negatively when only one Windows Service isn’t running, since that is how it should be in a Healthy state:
    Duh! That’s as it should be. So this situation is Healthy. How to make this happen in SCOM R2?

    So let’s start modifying SCOM R2 in order to make it work as we want it. Go to Authoring > Authoring > Management Pack Objects > Monitors and scope the View to Test – Volume Shadow Copy. Now the View will look like this when you expand Parent Monitor Availability:
    Now here is a tricky part: there are TWO service Monitors: both are enabled by default but one is also DISABLED through an override. All done by the Wizard. Rule of thumb in order to select the proper Monitor (the one which isn’t disabled through an override) is looking at the MP: the correct Monitor which requires adjustment must reside in the MP you created yourself. So in this case I select the second Monitor by double clicking on it.

    First we don’t want an Alert any more. Later on we will modify SCOM in such a way that only an Alert will be raised when BOTH Windows Services fail. Disabling an Alert can be done in two ways: by deselecting the option Generates an Alert for this Monitor, found under the tab Alerting. Another approach is to set it through an override. I myself prefer the latter option since it doesn’t change the original configuration in any kind of way, so there is always a way back.

    Save the override > Apply > OK. So now this Monitor won’t raise an Alert any more. Time for step two. Now we don’t want to the Monitor to roll up to the Entity Health State. Stay in the properties screen of this Monitor and go to the tab General.
    Change the Parent Monitor from Availability to Entity Health > Apply > OK.

    As you can see, the Monitor is moved from Parent Monitor Availability to Entity Health:

    Let’s check the Health Explorer for SV02 which was in a Critical condition first:
    As you can see the related Windows Computer is Healthy now since the Windows Service Monitor isn’t shown any more. Don’t let the other Monitor fool you since that’s the one which is disabled through an override by the Service Monitoring Wizard.

    So we’re getting close now: A Windows Service is being Monitored, no Alert is raised when the service doesn’t run AND the Health State isn’t affected as well. Almost feels like creating something and then killing it, doesn’t it?

    But… how to bring in some intelligence? We want to monitor these Windows Services in an active/passive configuration and get an Alert when BOTH Windows Services don’t function anymore. This is where the Distributed Application comes in.

    I have noticed this behavior in SCOM R2 CU#5: When the Parent Monitor is changed, the Windows Service Monitor is set back to check only automatic Windows Services:
    Whoops! We REMOVED that, didn’t we? And now when we remove it again, the Parent Monitor will be changed back to Availability. So don’t change it. Instead, create an override against the same monitor for which we changed the Alerting to none and changed the Parent Monitor. Now create an override for the Parameter Name Alert only if service startup type is automatic by typing (yeah, now typo here…) false in the column Override Value > Apply > OK.
    Now the bug is elevated… 

  4. Let’s add some brains
    Go to Authoring > Authoring > Distributed Applications > Actions > Create a new Distributed Application. Remember to put the DA into the same unsealed MP and select the blank template.

    In DAD, search for both Windows Services > when found select them both > right click > Add to > New Component Group and give this DA Component a good name like Volume Shadow Copy Windows Services in Active-Passive Config:
    OK > now you have this DA component:

    Add other DA components if required. In this example I don’t add additional components. In real life however, sometimes I end up with DAs containing over 30 components. But that’s another story :)

    > Save. The MP is saved now and the related Monitors and Objects in SCOM created. Close DAD.

    Let’s change the default behavior now. This part will add the brains and make the DA work. Because when you open the DA in Diagram View this is what you’ll see, Health isn’t rolling up to the DA:
    But that’s because we changed the Parent Monitor for those Windows Service Monitors from Parent Monitor Availability to Entity Health remember? So the Monitor related to the DA Component Volume Shadow Copy Windows Services in Active-Passive Config has to be changed as well…

    Go to Authoring > Authoring > Management Pack Objects > Monitors and scope the View to Volume Shadow Copy Windows Services in Active-Passive Config. Now the View will look like this when you expand Parent Monitor Availability:

    Double click on Monitor Component Group Health Roll-up for type Test – Volume Shadow Copy. The properties screen for this Monitor will opened now. Go to the tab Monitor Dependency and select the Parent Monitor Entity Health > Apply.

    Let’s check the Diagram View again: Aha! Health is rolling up. Only not good since ONE Windows Service is running and one isn’t. So still we’re lacking the required intelligence:

    So let’s go back to the properties screen of the Monitor related to the DA component (Component Group Health Roll-up for type Test – Volume Shadow Copy) and add some more changes as well. Now we go to the tab Health Rollup Policy and change it from Worst state of any member to Best state of any member > Apply:

    Let’s check the Diagram View again (it might take some minutes, so be patient):
    Tada!!!! So FINALLY some intelligence is coming in! Nice!

    Now we want an Alert to be raised when BOTH Windows Services stop functioning. Normally the Monitor targeted against any DA Component, don’t raise an Alert. So let’s change that as well. Go back to the properties screen of the related Monitor (Component Group Health Roll-up for type Test – Volume Shadow Copy).

    Go to the tab Alerting and select the option Generate alerts for this Monitor. Add a proper Alert Description (you can choose to add some parameters as well) and change the Priority and Severity as required.
    > Apply > OK.

So now all is in place and the intelligence is added. Let’s test it and stop the service on server SV01:

And the Diagram View:

Nice! All is working as intended! Really sweet it is.

SCOM can add intelligence to Service Monitoring, even though it might seem overwhelming. As a matter of a fact, it isn’t since the approach as described in this posting is ALWAYS the same. So familiarize yourself with it and before you know it you create intelligent Monitors like these in the matter of minutes! Happy SCOMming!


Ian C. said...

About your Bug Alert

This problem also happens when you change other properties of the Monitor.

Renaming the service monitor, resets the template to only monitor automatic services.
Changing the monitoring of the startup type, resets the name of the service back to the default 'Windows Service Running'. It also resets the severity of the alert to 'Critical'. It also resets the Alert subject to 'Windows Service running'.

Ian C. said...

Here is the official response from MS about your discovered bug.

From Case #112052801654995

+ This is not a bug
+ This is by design. We do not support making any changes to the properties of the monitors which are automatically created as a result of configuring synthetic transactions

+ The logic is as follows:
When you configure a service synthetic transaction, in the background, a new class is created. This can be verified from discovered inventory view if required. Once the class's objects have been discovered, monitors from the windows service library are applied to it.

This is done via a default template which you can see if you open authoring -> monitors and scope the view to the newly created class. The monitor from the windows service library is the default template and the one saved in your MP is the one which was automatically configured for your synthetic transaction. Every time you make a change to the original service synthetic transaction you created, the default monitor template will be applied to recreate the monitor again, hence removing any customization made. This happens as making changes to the original transaction, essentially recreates the class and hence everything is re-initialized. We only support the customizations provided via overrides that you can set on the monitor.

A possible workaround is to customize the monitor properties and then NEVER changing any properties of the original synthetic transaction. The transaction will go back to monitoring automatic start services only after the customize the monitor properties, but this can be taken care of by setting an override on the monitor. Please remember, this is NOT supported and we do not recommend making any changes to the properties of the monitor created automatically. The properties which can be customized have been exposed via overrides and that remains the only supported method of customization

PS. this behavior is not based on a specific CU

Marnix Wolf said...

Hi lcl or IcI :)

Thanks for sharing this information, really appreciated.

However, sometimes it's needed to think outside the box and to work like that in order to get things going :).