Wednesday, September 7, 2011

Meet the wacky network module of SCOM…

As many of us know, the network module (MOMNetworkModules.dll) of SCOM isn’t very robust. It doesn’t scale well (actually not at all) and isn’t robust either. Even in SCOM R2 CU#5 environments this module is something to reckon with, otherwise you might end up with strange behavior of your SCOM environment.

A good thing is that Microsoft has totally rewritten this module for OM12. So network monitoring – or monitoring any SNMP enabled device – in OM12 is way much better compared to SCOM nowadays.

Today I bumped into an issue at a customers site (SCOM R2 CU#5 based) which really puzzled me for a while. Soon it turned out the wacky network module was the culprit here so the problem was quickly solved.

The SCOM R2 environment was monitoring some SNMP enabled devices. But when the Monitors were added, others changed and some Rules added in order to get a deeper monitoring of those same SNMP devices, it didn’t seem to land. The Monitors were listed in Health Explorer but they never initiated. Double checked the Monitors, targeting (disabled by default and enabled through an override targeted against a certain group of network devices) and everything else. All was well. But still not a single Monitor started. Ever.

The rules acted in the same strange way. Everything in place and properly configured. But no performance data came in.

So it was time for a deeper dive. Soon it turned out this Management Group is monitoring 300+ network devices by SCOM itself. Many more network devices are monitored as well but not directly by SCOM. Those devices are managed by third party vendor software, before the information is routed to SCOM, which is way more robust compared to the SCOM network module. So no hassles there.

And all these network devices, monitored directly by SCOM itself, were reporting to the same SCOM R2 Management Server. So apparently the network module was causing the strange behavior of the newly created and/or modified Monitors and Rules targeted against the network devices.

Testing 1,2
Time for a test. For the SNMP enabled devices with the strange behaving Monitors and Rules it was decided to move them over to another server (aka Proxy Agent). This was done within 5 minutes. And another 5 minutes later, the Monitors and Rules started to work like they were supposed to.

Conclusion and Cause
So the wacky network module was hampering proper monitoring of those SNMP enabled devices. By moving them over to another Management Server this issue was circumvented, resulting in proper monitoring.

The Day After
Still there is a Management Server which monitors 300+ network devices, directly out of SCOM. This is way TOO much. As a rule of thumb I never ever monitor more than 150 network devices per Management Server. Even such an amount might cause issues. So many of those network devices (up to 75%) will be removed and monitored by the third party software and than piped to SCOM. This will take away the burden on that Management Server.

Whenever you monitor many network devices directly with SCOM (so no third party software is used, like Jalasoft or the OpsLogix Network MP), distribute the load of those network devices evenly among your Management Servers. When those Management Servers are properly dimensioned, 100 to 150 network devices might do. Of course, per environment this will differ. Always monitor your Management Servers. Watch for strange behavior of Monitors/Rules and a Health Service which is consuming too much cpu and RAM. Many times these are the signs of the wacky network module causing issues…

No comments: