Thoughts on Azure, OMS & SCOM: Troubleshooting Flow For Slow SCOM 2012x Consoles

When the SCOM 2012x Consoles (both the UI and web based one) are slow, it’s a challenge to find the cause. And many times it’s not limited to an isolated causes but are many things at play, working in concert for a slow SCOM 2012x Console experience.

This is a bad thing since the end-user won’t use SCOM anymore since ‘it’s slow’ and even worse, ‘unresponsive’. People are starting to think SCOM is a bad product and turn away from it. When this happens it’s quite a challenge to address this where the technical aspect is the easiest one. Instead convincing the end users to start using SCOM again is a whole different story all together.

Therefore I’ve written this posting in order to help you to get SCOM back on track again, and even better to prevent this from happening. There is much to tell so let’s start.

The foundation: Compute, Storage & Networking
Many times (dead) slow SCOM Consoles are the result of a whole long chain of issues. Therefore it’s better to start at the beginning of it all, the three pillars of your data center/cloud based solution: Compute, Networking and Storage.

Compute
Yes, virtualization is everywhere and has become the norm. 99% of the SCOM installations I bump into are virtualized. Which is totally understandable and even the default of operations. Not an issue on itself NOR a cause for slow SCOM Consoles.

And yet, overcommitted hosts , badly configured hosts, or hosts running old and outdated technologies are many times the culprit for bad performance. Sure, SCOM isn’t a production critical system. But that doesn’t mean it should be put on second grade hosts or even worse.

At least you want your monitoring/management solution to be on par with your production environment. How else is it going to be used to measure and manage it? It simply won’t.

So ascertain yourself the SCOM environment is running on top of good virtualization hosts which would be used for the production environment as well. And also make sure these hosts aren’t overcommitted either and using similar configuration settings as the hosts for the production environment. A VM is a VM and whether it’s production or not, the same configuration rules are at play here.

When running physical SCOM servers, make sure those servers are using modern technologies and not technologies which were modern 5 or more years ago. Again, SCOM isn’t production and yet, it need to be taken seriously, thus installed on serious hardware which is current and not on the left overs which would be otherwise ditched or put into digital playgrounds. Hardware like that isn’t meant to run SCOM.

Storage
This is a nice one. Storage configured for maximized capacity measured in TBs IS NOT the type of storage you want to put your monitoring/management solution on. Ever! This simply will bring down the best applications no matter how good you tune them. Simply stay away from it.

Many times I see ISCSCI based SANs being used. And many times these type of SANs perform well. As long there are no short cuts and the disks running the SCOM databases are configured properly. So always ascertain these disks use dedicated LANs for ISCSCI traffic, so they can get all the resources they need. This requires additional configuration on the virtualization side of things, but belief me it’s worth the effort!

Networking
Network connectivity between the SCOM Management Servers and all related SQL servers has to be spot on. So placing one ore more SCOM Management Servers on other LAN sections or even on other connections connected by WAN are a no go area. Of course there are exceptions like remote locations connected by dark fiber, but even in those situations you must be a full 100% sure the latency is really low.

Otherwise the availability of your SCOM 2012x MG functions and Resource Groups will take a serious beating, resulting in an unstable SCOM environment.

Same goes for the related SQL servers. Make sure they’re connected properly together with the SCOM 2012x Management Servers. And when using a separate dedicated SQL Server Reporting Services (SSRS) instance, make sure to apply the same connectivity rules as well. Otherwise SCOM Reporting won’t deliver the expected performance either.

The SQL servers
Paul Keely has written excellent documents all about using SQL for System Center technologies. These documents contain tons of good information, so use it. At least read it and take notice of it. It will help you to design proper SQL servers for your SCOM 2012x environment.

Some good tips and tricks
SQL has the nasty habit to take away ALL available RAM. For SQL server this seems to be okay, but the underlying Server OS might get starved. Which is bad for SQL as well since it runs on top of the OS…

Therefore when provisioning SQL servers, reserve at least 4 GB of RAM for the Server OS itself. This will prevent the Server OS from starving, enabling SQL to run better.

Split the databases and logs! Put the temp DB on a disk of it’s own. The same goes for the log files and for the SCOM databases as well. Even better, use a dedicated SQL server for the OpsMgr database and another for the OpsMgrDW database. Put the system databases on a disk of their own as well.

And for bigger environments even a dedicated SQL server for the SSRS instance being used for SCOM. When using SQL Server standard edition AND using all those SQL servers and instances SOLELY for System Center 2012x technologies, no additional money for SQL server licenses are required Smile .

I won’t pretend I am a SQL DBA. But from my personal experiences I know this approach works well and results in a good performing SCOM 2012x environment. And when issues do arise, because all SQL servers are split, it’s far more easier to pinpoint the issues at hand and remedy them.

The SCOM 2012x MS Servers
Like I stated in a previous posting of mine, nowadays it’s better to roll out an additional SCOM 2012x MS server. This makes your environment more scalable and robust without having to go back to the drawing table when the monitoring requirements are changed.

I have seen this happening many times before and it’s a BIG difference when an additional MS server is already available, compared to provision a new VM, roll out SCOM and configure it. The latter will take way much more time, even when the installation of the SCOM server itself is automated by using PS.

Resource Pools
By default a new installed MG contains out of the box three Resource Pools. Use them wisely and add Resource Pools even more wisely. Don’t add them like you’re adding subscriptions for instance.

Every single Resource Pool requires attention and maintenance performed by the SCOM 2012x MS servers and MG as such. There are many different use cases and scenarios out there, some of them almost demanding a dedicated Resource Pool whereas others might fold in just fine with one of the already present Resource Pools.

The All Management Servers Resource Pool is a special one, requiring additional care. Sometimes there are good reasons to ‘break’ the automatic population rule of the Resource Pool and to remove one or more MS servers from it.

But be very careful here and only do this when you have really solid reasons for it (like isolating a SCOM 2012x MS server for a dedicated role) AND know what reverse side-effects it might have on the overall health and availability of your MG as a whole. When you don’t please don’t touch that Resource Pool.

UNIX\Linux monitoring
SCOM 2012x MS servers participating in the Resource Pool used for monitoring UNIX/Linux systems might require additional RAM and CPU. Simply because the UNIX/Linux SCOM Agents are nothing but web services, fully managed by the SCOM 2012x MS servers.

So additional power for those servers might come in handy. Just monitor them more closely as more UNIX/Linux systems are added to the mix. This will prevent performance issues as well.

Okay. I’ve followed all your advice and yet, the SCOM Consoles are dead slow! Now what?
Time to investigate! Start at the very bottom of things:

Compute
Storage
Networking.

Use the available tools for it and don’t forget about SCOM itself Smile . There are many good reports out there enabling you to get a deeper insight into the performance of your SCOM infrastructure and not only limited to CPU, RAM and disk queue length. But also:

Number of deadlocks on the related SQL DB engines;
RAM consumption of the related SQL DB engines;
Number of transactions per second for the related SQL DBs;
Active connections count for the related SQL DBs.

On top of it all take also a look at what’s hitting the SCOM environment, like too many Alerts, State Changes, Performance Counters, Event collections and so on. The Report Data Volume By Management Pack (found under System Center Core Monitoring Reports) shows you quickly what MP is generating the most volume.

And when clicking on one those values (like the one high lighted in yellow) the other report will be rendered, Report Data Volume By Workflow and Instance you’ll see what Monitors/Rules in particular cause the biggest bulk of total data volume in your SCOM MG:

By ‘simply’ tuning the first three Monitors you’ll address almost 35% of the data volume created by the identified MP in the first Report!

When you run reports like these on a weekly basis for the first few months and try to tune the identified Monitors and Rules, your SCOM environment will take a much smaller performance hit. When more in control, run these Reports on a monthly basis and go from there.

Another two reports which come in handy are both from the SCC Health Check Reports, Alerts - Top 20 Alerts By Alert Count (OM) and Alerts - Top 20 Alerts By Repeat Count (OM).

These Reports will show you quickly which Alerts are triggered most of the times. When solving the causes of those Alerts, and tuning the related Rules/Monitors, your SCOM environment will suffer less performance issues since far less Alerts do come in.

Again, run these Reports on a weekly basis for starters. When more in control, run these reports on a monthly basis.

Recap
When using SCOM 2012x, design and implement it properly. Even so, like any other technology, it requires maintenance and watchful eye on it all, by using the tips and tricks I provided. Soon you’ll see you’re in control of your environment and know how to check it when some performance issues do arise.

Thoughts on Azure, OMS & SCOM

Tuesday, April 15, 2014

Troubleshooting Flow For Slow SCOM 2012x Consoles

1 comment: