Tuesday, April 15, 2014

Troubleshooting Flow For Slow SCOM 2012x Consoles

When the SCOM 2012x Consoles (both the UI and web based one) are slow, it’s a challenge to find the cause. And many times it’s not limited to an isolated causes but are many things at play, working in concert for a slow SCOM 2012x Console experience.

This is a bad thing since the end-user won’t use SCOM anymore since ‘it’s slow’ and even worse, ‘unresponsive’. People are starting to think SCOM is a bad product and turn away from it. When this happens it’s quite a challenge to address this where the technical aspect is the easiest one. Instead convincing the end users to start using SCOM again is a whole different story all together.

Therefore I’ve written this posting in order to help you to get SCOM back on track again, and even better to prevent this from happening. There is much to tell so let’s start.

The foundation: Compute, Storage & Networking
Many times (dead) slow SCOM Consoles are the result of a whole long chain of issues. Therefore it’s better to start at the beginning of it all, the three pillars of your data center/cloud based solution: Compute, Networking and Storage.

Yes, virtualization is everywhere and has become the norm. 99% of the SCOM installations I bump into are virtualized. Which is totally understandable and even the default of operations. Not an issue on itself NOR a cause for slow SCOM Consoles.

And yet, overcommitted hosts , badly configured hosts, or hosts running old and outdated technologies are many times the culprit for bad performance. Sure, SCOM isn’t a production critical system. But that doesn’t mean it should be put on second grade hosts or even worse.

At least you want your monitoring/management solution to be on par with your production environment. How else is it going to be used to measure and manage it? It simply won’t.

So ascertain yourself the SCOM environment is running on top of good virtualization hosts which would be used for the production environment as well. And also make sure these hosts aren’t overcommitted either and using similar configuration settings as the hosts for the production environment. A VM is a VM and whether it’s production or not, the same configuration rules are at play here.

When running physical SCOM servers, make sure those servers are using modern technologies and not technologies which were modern 5 or more years ago. Again, SCOM isn’t production and yet, it need to be taken seriously, thus installed on serious hardware which is current and not on the left overs which would be otherwise ditched or put into digital playgrounds. Hardware like that isn’t meant to run SCOM.

This is a nice one. Storage configured for maximized capacity measured in TBs IS NOT the type of storage you want to put your monitoring/management solution on. Ever! This simply will bring down the best applications no matter how good you tune them. Simply stay away from it.

Many times I see ISCSCI based SANs being used. And many times these type of SANs perform well. As long there are no short cuts and the disks running the SCOM databases are configured properly. So always ascertain these disks use dedicated LANs for ISCSCI traffic, so they can get all the resources they need. This requires additional configuration on the virtualization side of things, but belief me it’s worth the effort!

Network connectivity between the SCOM Management Servers and all related SQL servers has to be spot on. So placing one ore more SCOM Management Servers on other LAN sections or even on other connections connected by WAN are a no go area. Of course there are exceptions like remote locations connected by dark fiber, but even in those situations you must be a full 100% sure the latency is really low.

Otherwise the availability of your SCOM 2012x MG functions and Resource Groups will take a serious beating, resulting in an unstable SCOM environment.

Same goes for the related SQL servers. Make sure they’re connected properly together with the SCOM 2012x Management Servers. And when using a separate dedicated SQL Server Reporting Services (SSRS) instance, make sure to apply the same connectivity rules as well. Otherwise SCOM Reporting won’t deliver the expected performance either.

The SQL servers
Paul Keely has written excellent documents all about using SQL for System Center technologies. These documents contain tons of good information, so use it. At least read it and take notice of it. It will help you to design proper SQL servers for your SCOM 2012x environment.

Some good tips and tricks
SQL has the nasty habit to take away ALL available RAM. For SQL server this seems to be okay, but the underlying Server OS might get starved. Which is bad for SQL as well since it runs on top of the OS…

Therefore when provisioning SQL servers, reserve at least 4 GB of RAM for the Server OS itself. This will prevent the Server OS from starving, enabling SQL to run better.

Split the databases and logs! Put the temp DB on a disk of it’s own. The same goes for the log files and for the SCOM databases as well. Even better, use a dedicated SQL server for the OpsMgr database and another for the OpsMgrDW database. Put the system databases on a disk of their own as well.

And for bigger environments even a dedicated SQL server for the SSRS instance being used for SCOM. When using SQL Server standard edition AND using all those SQL servers and instances SOLELY for System Center 2012x technologies, no additional money for SQL server licenses are required Smile.

I won’t pretend I am a SQL DBA. But from my personal experiences I know this approach works well and results in a good performing SCOM 2012x environment. And when issues do arise, because all SQL servers are split, it’s far more easier to pinpoint the issues at hand and remedy them.

The SCOM 2012x MS Servers
Like I stated in a previous posting of mine, nowadays it’s better to roll out an additional SCOM 2012x MS server. This makes your environment more scalable and robust without having to go back to the drawing table when the monitoring requirements are changed.

I have seen this happening many times before and it’s a BIG difference when an additional MS server is already available, compared to provision a new VM, roll out SCOM and configure it. The latter will take way much more time, even when the installation of the SCOM server itself is automated by using PS.

Resource Pools
By default a new installed MG contains out of the box three Resource Pools. Use them wisely and add Resource Pools even more wisely. Don’t add them like you’re adding subscriptions for instance.

Every single Resource Pool requires attention and maintenance performed by the SCOM 2012x MS servers and MG as such. There are many different use cases and scenarios out there, some of them almost demanding a dedicated Resource Pool whereas others might fold in just fine with one of the already present Resource Pools.

The All Management Servers Resource Pool is a special one, requiring additional care. Sometimes there are good reasons to ‘break’ the automatic population rule of the Resource Pool and to remove one or more MS servers from it.

But be very careful here and only do this when you have really solid reasons for it (like isolating a SCOM 2012x MS server for a dedicated role) AND know what reverse side-effects it might have on the overall health and availability of your MG as a whole. When you don’t please don’t touch that Resource Pool.

UNIX\Linux monitoring
SCOM 2012x MS servers participating in the Resource Pool used for monitoring UNIX/Linux systems might require additional RAM and CPU. Simply because the UNIX/Linux SCOM Agents are nothing but web services, fully managed by the SCOM 2012x MS servers.

So additional power for those servers might come in handy. Just monitor them more closely as more UNIX/Linux systems are added to the mix. This will prevent performance issues as well.

Okay. I’ve followed all your advice and yet, the SCOM Consoles are dead slow! Now what?
Time to investigate! Start at the very bottom of things:

  1. Compute
  2. Storage
  3. Networking.

Use the available tools for it and don’t forget about SCOM itself Smile. There are many good reports out there enabling you to get a deeper insight into the performance of your SCOM infrastructure and not only limited to CPU, RAM and disk queue length. But also:

  1. Number of deadlocks on the related SQL DB engines;
  2. RAM consumption of the related SQL DB engines;
  3. Number of transactions per second for the related SQL DBs;
  4. Active connections count for the related SQL DBs.

On top of it all take also a look at what’s hitting the SCOM environment, like too many Alerts, State Changes, Performance Counters, Event collections and so on. The Report Data Volume By Management Pack (found under System Center Core Monitoring Reports)  shows you quickly what MP is generating the most volume.

And when clicking on one those values (like the one high lighted in yellow) the other report will be rendered, Report Data Volume By Workflow and Instance you’ll see what Monitors/Rules in particular cause the biggest bulk of total data volume in your SCOM MG:

By ‘simply’ tuning the first three Monitors you’ll address almost 35% of the data volume created by the identified MP in the first Report!

When you run reports like these on a weekly basis for the first few months and try to tune the identified Monitors and Rules, your SCOM environment will take a much smaller performance hit. When more in control, run these Reports on a monthly basis and go from there.

Another two reports which come in handy are both from the SCC Health Check Reports, Alerts - Top 20 Alerts By Alert Count (OM) and Alerts - Top 20 Alerts By Repeat Count (OM).

These Reports will show you quickly which Alerts are triggered most of the times. When solving the causes of those Alerts, and tuning the related Rules/Monitors, your SCOM environment will suffer less performance issues since far less Alerts do come in.

Again, run these Reports on a weekly basis for starters. When more in control, run these reports on a monthly basis.

When using SCOM 2012x, design and implement it properly. Even so, like any other technology, it requires maintenance and watchful eye on it all, by using the tips and tricks I provided. Soon you’ll see you’re in control of your environment and know how to check it when some performance issues do arise.

Thursday, April 10, 2014

!!!Don’t Use The Resource Pool Fix Anymore!!!

The past
When the Release Candidate of SCOM 2012 was available there were some issues with it. One of those issues was related to the Resource Pools. They were rather sensitive resulting in an unstable MG. Microsoft got a lot of feedback for it and released a quick fix KB article for it, KB2714482.

In this KB article one was told – when experiencing the Resource Pools issue with the Release Candidate version of SCOM 2012 – how to add a registry key (HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\services\HealthService\Parameters\PoolManager) with certain values, modifying the release request time and resource pool latency.

The present
So far so good. But now the BAD news.

KB2714482 is a long time ago pulled by Microsoft, almost at the same time SCOM 2012 went RTM and became general available. And there is a reason for it. The quick fix isn’t required any more since the SCOM 2012x versions after the Release Candidate contain a fix for it.

So now this quick fix works in a negative manner, resulting in unexpected and unstable behavior of the availability and recovery time of the Resource Pools.

I check on every customers location the SCOM 2012x Management Servers for the presence of this key, make a export – and after having a talk with the customer – I remove the registry key. Also reboot the servers so they start clean and with the normal configuration.

So far I have seen the SCOM 2012x become much more stable and reliable.

My two cents
Check your SCOM 2012x Management Servers for the presence of this key. When found, make an export of it, remove the key and reboot your servers. Yes, the Resource Pools will turn up grey for some time (sometimes up to 20 minutes, depending on the size and monitoring load). However afterwards your SCOM 2012x Management Servers will be okay and the overall availability and stability of your MG will improve.

Wednesday, April 9, 2014

OM12 R2 UNIX/Linux Agent Update Failing? Try Ping & Reverse Lookup

On a location I had to update a whole bunch of SCOM UNIX/Linux Agents to SCOM 2012 R2 UR#1. However, most of them failed with many different error messages. This was a bit strange since these systems were being monitored as intended, so no errors or issues there.

For SCOM it’s crucial being able to resolve the FQDNs of the UNIX/Linux systems. Also the reverse lookup has to be okay. Otherwise the management of these systems will fail.

So per UNIX/Linux system which failed to update to the latest SCOM Agent version I ran first a ping and then used that IP address for a reverse lookup using NSLOOKUP utility. Both commands I ran from the SCOM Management Server being used to update the SCOM Agent on those UNIX/Linux systems.

When everything matched (IP address <> FQDN) I reran the upgrade of the related UNIX/Linux Agent. And guess what? 98% ran just fine, leaving a smaller number of servers not accepting this upgrade.

Whenever upgrading the SCOM 2012x Agent on a UNIX/Linux system fails, first run a ping and a reverse lookup on the SCOM Management Server being used to run those upgrades. Many times it will solve the upgrade issues, making it much easier to single out the real problematic servers.

Monday, April 7, 2014

Free MP Authoring Tool For IT Pros Gets Update

In January 2014 Silect – in a joined effort with Microsoft – released a new MP authoring tool for IT Pros. This new tool also made the Visio MP Authoring tool obsolete and the old SCOM 2007 MP Authoring tool as well.

Based on customer feedback Silect follows up this first release with an updated version, properly titled MP Author SP1.

Customers already registered and using MP Author will be emailed with instructions how to obtain SP1. When you’re an IT Pro and new to MP Authoring, this is the tool to have. Go here.

New MP: Lync Room System MP

A few days ago Microsoft released the LRS SCOM management pack, which allows video conferencing room administrators to monitor all the LRS deployed on campus.

This MP runs on SCOM 2012 SP1 and later versions. MP can be downloaded from here.

Cross Post: Multi-homed Migrations In Operations Manager: Lessons From The Field

Much respected friend and fellow MVP Cameron Fuller has written an excellent posting all about multi-homed migrations.

This posting contains tons of good stuff and refers to other postings as well. For anyone moving to SCOM 2012 R2 from an older version, this is a MUST read.

Want to know more? Go here.

OM12x Version Overview

In the past I already blogged about the different SCOM 2007x versions. Since OM12x is out for some time now Smile I update that old posting for the different OM12x versions.
  • SCOM 2012 RTM: 7.0.8560.0
  • SCOM 2012 SP17.0.9538.0
  • SCOM 2012 R2: 7.1.10226.0

With this query from Kevin’s blog (found under the header Operational DB Version) you can quickly check the version of your SCOM 2012x DB:

select DBVersion from __MOMManagementGroupInfo__