Monday, September 27, 2010

Generic Trouble Shooting Guide for SCOM

Installing SCOM R2 can be a challenge. However, Microsoft has provided many good guides how to go about it and – even more important – what things to reckon with when designing a SCOM R2 environment. When your design is right and the preparations are done properly, the installation should be straight forward without any surprises.

On top of it all, the installation of a SCOM R2 environment happens only once or twice (when you need a test environment as well for instance). After that it is time to start using the SCOM R2 environment which starts with these Steps:

  1. Configuring the Core MP of SCOM R2 (many times people tend to forget that but it is really important so RTFM is the magic word here);
  2. Configuring SCOM R2 (Resolution States, DB Grooming and the lot);
  3. Deploy the SCOM R2 Agents to the servers which need to be monitored;
  4. Import the MPs (RTFM before, during and after!) as needed and start with the Server OS MP;
  5. Tune the MP as required by the business, based on RTFM the related guide of the Server OS MP;
  6. When all is well and the Alerts coming in are relevant (no noise) import the next MP;
  7. Repeat Steps 4 to 6 per MP to be imported.

At a certain point in time your IT organization has a fully operational SCOM R2 environment. All goes well. Tuning and tweaking takes place while using SCOM R2 and the connection to the organization is being tuned as well. The latter is always work in progress because every organization is dynamic so changes are more than likely to occur and SCOM R2 needs to adapt itself as well in order to reflect the current situation.
image

But then something happens and the SCOM R2 environment turns sour. A set of disks might stop functioning, a bad MP is being imported, some one erases the SCOM R2 service-accounts (had this issue once!), the RMS stops running, the related SQL server suffers a hardware failure. Or everything seems to be just fine but ‘only’ the HealthService on the RMS stalls every hour or so…

So now the SCOM R2 environment becomes very silent and instead of being the looking glass for the IT shop on all IT servers and services, SCOM R2 needs some serious attention. But where to start? And what to do and more important what NOT to do?

With this posting I hope to help you with how to troubleshoot a (partially) failed SCOM R2 environment in order to get things working again, or – when you think it is way over your head -  to set out a call to Microsoft CSS and provide them with some good information.

But before I start I want to emphasize on two very important things here:

  • Know what you know, know what you don’t know, and NEVER mix the two
    So whenever you bump into something which you do NOT totally understand, leave it. Do not alter anything without having a full understanding of the consequences. And even when you do, backup the OpsMgr R2 DBs in order to have a way back. And check the validity of those backups. Otherwise you could end up in a situation where SCOM R2 gets in an unsupported state or that Microsoft CSS has to trouble shoot an extra complex issue: the first one which caused an error state in SCOM R2 and your ‘repairs’ afterwards. Microsoft knows much and has a lot of experience but they can’t perform magic…

  • Backup, backup and backup AND VALIDATE
    Be sure to have a valid backup mechanism in place which runs on a regular scheduled basis. Besides that some validation is required as well in order to know for sure that the disks/tapes containing the backup are really valid and do not contain some blob of code without any real value. Check it when all is well and not after SCOM R2 has become (partially) dysfunctional. It will save you and your colleagues a lot of frustration and perhaps even your job…

    Only a backup of the SCOM R2 servers will NOT suffice. Backup the DBs as well (use a ‘connector’ for it) and the Unsealed MPs as well. Also a backup of the EncryptionKey (with a VALID password) is a requirement. This way you have covered the SCOM R2 environment from end-to-end.

Having said that, its time to move on. This is what I do when SCOM R2 is experiencing some issues which need attention:

  1. What is exactly happening and since when?
    Find out what is exactly happening and since when. Try to describe it as briefly as possible and attach a date and time to it. This is not only important when you want to call in Microsoft CSS but also for yourself. This way you do not start a goose chase. Also be aware that there is a uge difference when the problem was detected and when the problem started. Try to get to the bottom of it all.

  2. Can it be reproduced?
    Sometimes network errors occur which can have its impact on SCOM R2. When all is well again, SCOM R2 should be fine as well. So try to see whether you can reproduce the error. If not, it is back to business. When it comes back, it is time to take a deeper dive.

  3. Ask questions
    Did anything took place before the SCOM R2 environment started to fail? Like importing a MP for instance (Many times poorly written MPs can wreck havoc…), AD changes, migrations, failovers or network changes? Did any one perform any task on the SQL server(s) hosting the SCOM R2 DBs and SSRS instance? So inform your self thoroughly and communicate with your colleagues and team members.

  4. Differentiate between main issues and secondary ones
    When a SCOM R2 environment is experiencing issues, many things can happen. Try to differentiate between the main issue(s) and the less important ones. Target your troubleshooting efforts at the main issues. Mostly are the less important ones caused by the main issue(s).

  5. Check out the SCOM R2 services on the RMS
    Are the three SCOM R2 services still operational on the SCOM R2 RMS? Nothing stopped? Nothing stalled?

  6. Check out the SCOM R2 DBs
    Are the DBs still OK? Can you access them from SQL Server Management Studio? Are the DBs still healthy? Can you query these DBs? Can you connect to the SQL server from all SCOM R2 Management Servers? (Telnet is required for it).

  7. Is the SCOM R2 Console still operational?
    When you’re able to access the SCOM R2 Console and navigate through it and the Views are refreshed, you know the SDK service is still running and the OpsMgr DB is still accessible and operational. So by a simple check much is to be found out.

  8. Check out the OpsMgr event logs on the RMS and Management Servers
    These logs are really great and tell you so much. These are the first location to go to in order to get a better understanding what is happening and why. Of course, the SCOM R2 Core MP picks up a lot of these events and raises one or more Alerts in the SCOM R2 Console, but still it is wise to checkout the logs as well since not all Events are covered for by the Core MP.

  9. OpsMgr event log on the RMS
    First look on the RMS since that server is the top level server of the SCOM R2 hierarchy. Stop the HealthService (System Center Management) on the RMS and start it again. This forces the RMS to reprocess its configuration like the SCOM R2 service accounts. When anything is wrong there many errors will be shown in the log file. Keep a watchful eye on and refresh it many times. When something serious is at hand mostly within ten minutes it will be displayed in the event log.

  10. OpsMgr event log on the MS servers
    These servers are used by the monitored servers (aka Managed servers) to report to. The SCOM R2 Management Servers write directly to the SCOM R2 DBs. So when anything goes wrong, these servers should report on it in the OpsMgr event log. Same procedure here as well: Restart the HealthService and check out the logs. Keep a watchful eye on it for the first ten minutes after the HealthService has been restarted. When something goes wrong it should be shown in that timeframe.

  11. Bounce the RMS related SCOM R2 Services
    When nothing strange comes out in Steps 1 to 3 it is time to restart the SCOM R2 services on the RMS (NOT ON THE SCOM R2 MANAGEMENT SERVERS!): restart the Config Service (System Center Management Configuration) first and check out the Ops Mgr event log in order to see what comes out. Do the same for the SDK Service (System Center Data Access). Keep a watchful eye on the OpsMgr event log.

Wow! Stop! I have found some or many serious errors! What’s next? Good question.

Start at the bottom of it
Even when many errors/warnings are shown in the OpsMgr event log, the first one or the first series of three up to to five events are mostly the real cause. The other ones are many times failing workflows BECAUSE some required basic processes are failing. So take a good look at the first errors and warnings.

A good internet connection is important now. Use your favorite search engine and query the internet where you use this format: SCOM <eventid> and the most important piece of the information which is displayed in the general part of the Event, like ‘Failed to store data in the Data Warehouse. Exception 'SqlException': Timeout expired.’ for instance. Leave out the specific details like server names, GUIDs and SCOM R2 Management Group names.

Also details displayed in things like Workflow names can give one a good clue what is causing the issue. So always read the full event and not only the headers.

Some tricks to get things going again:

  • Remove the latest imported MP
    Only when its relevant of course. When on the 1st of October 2010 your SCOM R2 environment starts having issues and the last MP you imported/changed was two months ago changes are that the cause of this issue is to be found some where else.

  • Clear the HealthService State on the (R)MS server experiencing the issues
    On the SCOM R2 (R)MS server which is experiencing the issues, stop the HealthService, rename the folder ‘~:\Program Files\System Center Operations Manager 2007\Health Service State’ to ‘~:\Program Files\System Center Operations Manager 2007\Health Service State_OLD’ and start the HealthService again.

  • Clear the HealthService State on the Agent(s) causing the issues
    When you have pinpointed the issues and suspect one or more SCOM R2 Agents to be the culprit(s), stop the HealthService, rename the folder ‘~:\Program Files\System Center Operations Manager 2007\Health Service State’ to ‘~:\Program Files\System Center Operations Manager 2007\Health Service State_OLD’ and start the HealthService again.

Here is a list of EventIDs which I have seen sometimes and need some attention. Some are very serious and some are easily fixed:

  • EventID 33333
    Data Access Layer rejected retry on SqlError
    . This is a serious one and needs real attention. Sometimes it is an easy one. An Agent has been partially reinstalled but its ID ('BaseManagedEntityId') doesn’t match anymore with the one present in the SCOM R2 DB. By recycling its HealthState all is well again.

  • EventID 33333
    Health service <GUID> should not generate data about this managed object. Easy one. Proxying needs to be enabled on the SCOM R2 Agent generating this event.

  • EventID 10850
    A performance signature couldn't be inserted to the database. A tricky one. Many times it happens when a MP has recently been deleted. More serious is an issue where the OpsMgr DB is running out of space. But when this event also contains this message ‘The INSERT statement conflicted with the FOREIGN KEY constraint’ there is a real challenge to be met. When you are lucky it is happening because of a corrupt Agent. If so, a HealthServiceId is displayed in the same Event. Run this PS script in order to obtain its friendly name (Get-MonitoringObject -id: 'HealthServiceId' | ft DisplayName). Recycle the HealthService State of that Agent and most of the times all is well again.

    Otherwise check this out. If that isn’t the case either contact Microsoft PS.

  • EventID 5300 and or 5304
    On a RMS it means the Health Service is stalled. Serious attention required. Check this out.

Of course there are lot more of EventIDs which need attention. A good approach here is the Excel sheet made by Daniele Muscetta containing all SCOM R2 EventIDs. I hope this posting aided in some targeted trouble shooting.

Of course I know about the tracing tools which are available by default in SCOM R2 (~:\Program Files\System Center Operations Manager 2007\Tools). However, be careful when using them since you really must have a thorough understanding of what you are doing. Taken directly from the file ‘TracingReadMe.txt’ residing in the same folder: ‘…The files in this folder are for use in diagnostic tracing in conjunction with Microsoft Customer Support Services (CSS) only. Do not enable Operations Manager tracing without prior consultation with CSS through a support engagement. Doing so could have an adverse effect on system performance. Operations Manager diagnostic tracing is not customer consumable…’.

When you’re not sure, note down all your findings and contact Microsoft CSS. Do not do anything which you might regret afterwards.

1 comment:

snajgel said...

Great post on troubleshooting!!!