Monday, July 20, 2009

OpsMgr SP1: Is the process HealthService of certain Agents taking away all cpu-cycles?

The core-process of the OpsMgr Agent is the HealthService executable. Sometimes it can become a ‘run-away-train’, consuming as much as cpu-cycles available.

This issue can be hard to trouble-shoot since most of the times there isn’t just one single cause to be found but a combination of multiple causes where every cause on itself wouldn’t be an issue at all.

Here I will describe the way I approached it at a customers site which did the trick. Feel free to comment on it when you know some tricks as well. Keep me sharp!

First of all some history on OpsMgr. When it became RTM there were certain issues which needed attention. SP1 addressed these. And when applying certain hotfixes (check out this blogposting of Kevin Holman on this topic), OpsMgr became way much better.

So whenever there are certain issues with an OpsMgr SP1 environment check it’s patch level and double check it by looking into the file versions on the RMS, Management Servers and Gateways. For Agents this process can be automated. Check this posting of mine how to go about it.

Other items worth looking at are:

  1. How is the server it self performing?
    Stop the HealthService (OpsMgr Health Service) on the server experiencing this issue. Check out how the cpu-load is now. When this is constantly above the 80-90% there are other issues at hand which aren’t OpsMgr related.

  2. Customized MPs
    Have their been modifications of MPs? Or, have their been newly created MPs with custom-made scripts? Especially the latter – when not done properly - can really cause havoc on monitored objects. When newly developed MPs are loaded or some MPs have been modified override these in such a manner that they do not run anymore.

    When the HealthService becomes stable again you know what is wrong.

  3. Old MPs
    In the past there were issues with the DNS and DHCP MPs which caused high cpu-loads on the monitored servers. Check out whether these MPs are up-to-date.

    Be aware though that updating MPs is more than a few simple mouse-clicks: Change Management & RTFM are aspects to reckon with.

  4. Health Service State folder of OpsMgr Agent
    When an OpsMgr Agent is installed, it has it’s own directory (C:\Program Files\System Center Operations Manager 2007). In this folder a subfolder named ‘Health Service State’ is to be found. Here the OpsMgr Agent stores all needed information like what Management Group(s) to connect to, the downloaded MPs and so on.

    Whenever the process HealthService.exe becomes ‘cpu-hungry’ it can help to have this folder recreated by the Agent. How?
    - Stop the Healthservice (OpsMgr Health Service)
    - Rename the folder ‘Health Service State’ to ‘Health Service State_OLD
    - Start the Healthservice

    In the beginning the process HealthService will consume more cpu since the OpsMgr Agent need to download all information. However, this must cause cpu-spikes and not cpu flatliners. Depending on the server hardware and the available resources it can take between 5 to 15 minutes. Afterwards the process should consume not much cpu-cycles.

  5. WMI- I
    On Windows 2003 servers only: WMI can need some patching to make it more stable. But this is only the case when the OpsMgr Console shows multiple Alerts like ‘WMI Probe Failed Execution’. Kevin Holman has blogged about it.

  6. WMI-II
    Recompiling WMI can also do the trick. Check this posting of mine how to go about it (Resetting WMI).

  7. Windows Scripting Host
    On Windows 2003 servers only. WSH versions > 5.7 can cause scripts to run wild. In the earlier mentioned blogposting of Kevin Holman there is also an url to be found for the update of WSH to version 5.7.

A nice View to build in the OpsMgr Console is a Performance View which shows the cpu usage of the process HealthService.exe. Alexandre Verkinderen (OpsMgr MVP) has written a good blogposting about it.

So whenever complaints about slow server performance do get in, with this View it is easy to checkout whether the OpsMgr Agent is the culprit.


Sonia Lopes said...

Great post! We are getting "Description: AgentMinRequiredVersionCheck.vbs : An error occurred while reading the registry" on our SCOM 2012 SP1 RU2 management servers. any idea on causes?

Marnix Wolf said...

Hi Sonia.

Does this happen on the Management Servers or the monitored servers?