Friday, August 20, 2010

The case of the monitored file cluster and the missing collection counters

This is a nice one. A respected friend of mine from Australia mailed me with a good question. It took me some time to get to the bottom of it and to figure out a good approach.

The Case – The Monitored Cluster
Suppose you run a file server based on a Failover Cluster configuration, consisting out of Cluster Node A and Cluster Node B. Cluster Node B is idle and Cluster Node A is the owner of all resources, among them Disks P1 and P2.

This configuration is being monitored by SCOM (R2 ideally). For this the Server OS MP and the Cluster MP (among others) have been imported and configured. Also is the Proxy on the SCOM R2 Agents running on both Cluster Nodes enabled. So far so good. The Cluster is being monitored and performance collection also runs.

The Case – Disaster Strikes
Cluster Node A runs well from Monday till Wednesday morning but dies on Wednesday afternoon. Cluster Node B kicks in and becomes the new owner of all the resources, among them Disks P1 and P2.

The Case – The Report and the missing data 
After a week someone runs a Report in order to find out more about the % of disk space being used on disks P1 and P2. The Report is targeted at server level. At a first glance the Report seems to be just fine. But wait! From Monday till the beginning of Wednesday data is neatly shown, but after that the graph drops to zero! Huh?

The Question
What? Where is it gone to? The disks are still in place and available. So why does the graph suddenly drop to zero or better, nothingness? Has the Cluster MP turned sour?

The Explanation – Part 1
First of all, the Cluster MP does not collect any performance metrics at all. This is done by the Server OS MP. The Cluster MP covers many health and configuration aspects of the Cluster itself and Alerts when something is not OK.

Time to move on.

The catch here is that Cluster Node B has become the new owner of the disks. So that server will run the collection rules from the moment (*) it became the owner. So when you run a new Report targeted against that server, the graph will start from Wednesday. (* There is a pitfall to reckon with!)

So you end up with two Graphs? One for Cluster Node A and another for Cluster Node B? Yes, you could…

The graph for Cluster Node A displays normal graph from Monday till Wednesday and after that a flat line. Same goes for the Report when targeted against Cluster Node B: a flat line from Monday till Wednesday and a valid graph from early Thursday till Friday.

What about the pitfall?

Good question! As we all know, monitoring and/or performance collection can only start AFTER the discovery has run and ended successfully. The latter is no issue, but the first one is. Why? Well, the discovery of the Logical Disks runs once every 24 hours:
image

So in a ‘worst-case’ scenario you miss out on monitoring and performance collection for a maximum of 24 hours! Of course, an override could be used here, targeted against the Group ‘Cluster Roles’ in order to reduce that time. But use it smart here. Discoveries running too many times can cause other issues…

The Explanation – Part II, the Smart Approach
When you are running two Node Clusters, above mentioned approach should do. But suppose you are running a plus two Node Cluster? So when a failover occurs, there are multiple possible new owners available. So when a Report is to be created, one must know exactly what Cluster Node was the owner of the Resource the Report is about. And not just that, but also when

This is not viable at all. It would take way too much time. So another approach is required.

The idea here is that you do not target the Cluster Node, owning the Resource, but the resource itself instead. When you select the disk instead of the Cluster Node, you will find two or more paths, related to this object. Which is logical when a failover has occurred. When referring to the above mentioned example you could see something like this in the Add Group screen for the Performance Report when adding a new Series to a Chart:

Name Class Path
P1 Windows Server 200x Logical Disk FQDN of Cluster Node A
P1 Windows Server 200x Logical Disk FQDN of Cluster Node B
P2 Windows Server 200x Logical Disk FQDN of Cluster Node A
P2 Windows Server 200x Logical Disk FQDN of Cluster Node B

Add one series per path into the same Graph. This way you will get a graph which shows all the collected performance data, across the different Nodes, without having the need to take a deep dive into what Cluster Node owned what Resource and when

Of course, this graph can have a gap of a maximum of 24 hours…

4 comments:

John Bradshaw said...

This approach works well. thx Marnix
John Bradshaw

Marnix Wolf said...

Hi John.

You're welcome mate!

Cheers,
Marnix

leiram4 said...

Hi,

I have a question regarding overrides on a single drive of a cluster owned logical disk, if you might be able to help as it's in similar line with the reporting issue?

I have a cluster that requires an override for ONE non-system shared drive to be a different threshold to all other non-system drives on the cluster.

To do the override to that single drive, only the single node that owns that drive shows up for G: drive as the path. The passive cluster is not shown, so not sure how this could be possible?

I thought a group might work, but cannot add G: for both nodes as it only shows the drive with the path for the current owner....

Any ideas?

Marnix Wolf said...

Hi leiram4.

My guess here is that when you target the override against the drive directly and select its multiple paths (if any) it should work as intended.

Thing is that with Cluster Resources it is always a bit tricky. Another approach (but less acceptable due to production) is to fail the disk over to the other Node (outside production hours if possible) and target another override against that drive, owned by the new Cluster Node.

Cheers,
Marnix