Monday, January 3, 2011

How To: Troubleshoot MPs which do not seem to land

Sometimes one bumps into an issue where a particular MP doesn’t seem to land on a monitored server. All other MPs are neatly in place and functional on the same server, except for that particular MP. So now what? How to deal with it?

This posting will show you some tips, tricks and advises how to go about it. Some might seem too obvious, but still they are worth being mentioned. So bear with me.

  1. RTFM (Read The Friendly Manual)
    Yeah, I know. The most obvious one indeed. And yet, an important one. Every MP comes with its related guide. RTFM is key in order to get the most of the MP or even to get it running. Like the SharePoint 2010 MP (love that one), since a file needs to be moderated in order to get the MP running, in conjunction with an account.

  2. Agent Proxy
    Some MPs need the Agent Proxy setting enabled. MPs like that are for example (but not limited to): AD, Exchange and Cluster. Agents installed on servers which are running services like those must have their Agent Proxy set to enabled:
    image
    Again, RTFM is key here since the related MP guide will tell you so when this setting must be set to enabled.

  3. HealthService store is corrupt
    There is an issue with the Jet DB Engine, present on any kind of Windows Server OS. SCOM uses this Jet DB Engine as well (HealthService store (~:\Program Files\System Center Operations Manager 2007\Health Service State\Health Service Store\HealthServiceStore.edb)). In order to solve this a hotfix has been released by Microsoft: http://support.microsoft.com/default.aspx?scid=kb;en-us;981263.

    However, when this is the cause it would not limit itself to a certain MP not landing on a particular server. The same server would experience other issues as well, like being grayed out in the SCOM Console.

  4. Security
    Some MPs need additional permissions in order to function properly, like the SQL MP. One other example, Exchange 2010 installations lock down the servers where it’s installed on, so some additional work is required. Normally SCOM will Alert upon it when the required permissions aren’t sufficient. When the Console stays clean, another way to go about it is to check things out locally on the problematic server:
    - Stop the Agent on the problematic server;
    - Clear the OpsMgr event log on the problematic server;
    - Start the Agent on the problematic server;
    - Check the OpsMgr Event Log for any warning or error;
    - Open the errors/warning (if any) one by one and read them thoroughly. When its a permissions related issue, detailed information will be given;
    - Solve the permissions issues.
    image
    Example of EventID 7026 which tells you the Action account has been validated successfully.

  5. Non forest/domain residing servers
    Any monitored server residing outside the security boundary of SCOM needs certificates in order to communicate with the SCOM Management Group. On top of that, and sometimes people forget this, additional accounts in SCOM are needed. Many times servers like these do not take part of the Forest where the SCOM Management Group resides. So the accounts set in SCOM for the SQL MP (for instance) can not be validated/used on those servers, so additional accounts are needed.
    image 
    Step 4 will also help out here in order to identify what’s happening and why.

  6. Has the MP landed?
    A MP can only do its work when it has landed on the server. So a good thing to know is whether the MP is in place on the server. There are many ways to go about it, but this is the way I prefer the most: - Stop the Agent on the problematic server;
    - Clear the OpsMgr event log on the problematic server;
    - Rename the folder ~:\Program Files\System Center Operations Manager 2007\Health Service State;
    - Start the Agent on the problematic server;
    - Check the OpsMgr Event Log for any Event with ID 1201;
      image
    - Run through every Event with the same ID in order to see whether the problematic MP has been received on the server;
    - When the MP isn’t mentioned in one or more of Events with this ID (1201) check out to see whether the other servers do get this MP AND the SCOM Console whether one or more Alerts are reported about this particular MP.

  7. Hotfixes, WMI and cscript.exe
    Another important things to reckon with are hotfixes, updates and the like. Not only for SCOM (CUs for SCOM R2) but also for the servers being monitored. As we all know the SCOM Agent relies heavily on certain basic Windows components like WMI, cscript.exe and so on. When WMI is not OK, many MPs will not function properly. When an old version of cscript.exe is in place, some or many scripts will not run properly as well.

    W2K08 based servers need some additional attention as well. Check out these postings in order to know more about it: http://thoughtsonopsmgr.blogspot.com/2009/07/opsmgr-and-windows-2008-what-hotfixes.html,
    http://thoughtsonopsmgr.blogspot.com/2009/03/script-or-executable-failed-to-run-part.html,
    http://thoughtsonopsmgr.blogspot.com/2008/12/wmi-and-windows-2003-server.html

  8. MP itself is corrupt
    Haven’t seen this many times, but sometimes a MP might go wrong. With SCOM R2 and the latest CU this is not very likely to happen. So the best way to go about it is to remove the MP, wait an hour or two and import it again.

    Also when the issues is related to a MP gone bad, the problem of a not-landing MP wouldn’t be limited to a single server, unless only one server runs the application/service being covered by that MP.

Hope this walk through helps in troubleshooting MPs which do not seem to land properly on one or more servers.

No comments: