Thursday, March 29, 2012

Erratic behavior of SCOM: EventIDs 20070, 21016 and 20022

Bumped into a very puzzling issue on a customers location. A newly installed SCOM R2 CU#5 environment with a RMS in place, a dedicated SQL Server (2008 R2 SP1 CU#4) and a MS. With the RMS and SQL server all went well. SCOM R2 Reporting was installed without any issues as well. However, when the MS was added to the mix, the troubles started.

The Challenge
Somehow the MS didn’t seem to start. The related Health Service was running all right, but somehow the MS stayed in an unmonitored status in the SCOM R2 Console. So it was time to check the OpsMgr eventlog of the MS server. And these two events repeated themselves many times:

EventID 20070:

EventID 21016:

Of course, events like this may occur when a new MS or Agent is added and hasn’t received its configuration. But soon those events disappear and everything is fine. But these events kept on coming back:

So there was something else wrong.

And it got even wackier: after 15 to 25 minutes the connection with the RMS was made and all seemed to be fine, while NOTHING was changed in SCOM R2. But then after 5 minutes the connection was lost again and the RMS showed only EventID 20022, telling me the health service on the MS wasn’t heartbeating:
And yet, the Health Service on the MS was running all right, residing in the same LAN segment as the RMS. And all the while both servers could connect to each other running the telnet client on port 5723?! Also Ping worked just fine…. Aaaaaaaaaaaaaarrrrrgggghhhh!!!!!

Restarting the Health Service on the MS didn’t change a thing nor recycling the cache on that server nor on the RMS. And when I pushed out a SCOM R2 Agent to the SQL server hosting the SCOM R2 databases, the same erratic behavior was happening. Whether the Agent reported to the RMS or MS.

This told me two things:

  1. Something is NOT OK (duh!);
  2. The RMS isn’t the culprit nor the MS server.

So it was time for a deep dive in SCOM R2 in order to look for possible causes.

The Quest
This was a tough one. However, the SCOM R2 environment was brand new without anything exotic. Nothing special nor fancy about it, just a regular SCOM R2 environment under construction and some erratic behavior. As a test I reinstalled the MS but without any result. Also the erratic behavior of first having no communication between the SCOM servers / SCOM Agent and suddenly everything being fine for some minutes and then starting all over again, without changing ANYTHING at all in SCOM worried me.

Time to run some checks:

  1. SCOM issue? Ran a SP against the SCOM DB and nothing wrong came out.
  2. SCOM issue? RMS/MS not OK? These servers were fine except for the communication issue.
  3. SCOM issue? SCOM service accounts locked out? No, all the accounts were just fine.
  4. SCOM issue? Untrusted servers so certificates are required? No, Kerberos should do fine.
  5. Kerberos Time Skew? Nope. All servers were running at the same time settings and synchronized perfectly.
  6. Kerberos issue? Nope, all the accounts were fine and no Kerberos issues at all.
  7. GPO issue? Nope, just some basic GPOs nothing fancy nor hardening.
  8. Network issue? Hmm, I installed the Telnet client on the RMS, MS and SQL. And I could connect to the servers on TCP port 5723.
  9. Network issue? Tracert showed the first hop was the destination so no routers at all. All SCOM servers reside in the same LAN.
  10. Network issue? A continues Ping ran just fine with response times less then 1 millisecond.
  11. Network issue? NIC removed and reinstalled and reconfigured. Nope. Same issues still occurring.
  12. DNS issue? Nope, NSLOOKUP worked like a charm. Also NETBIOS names were resolved without a glitch.
  13. -

Ouch! So at least SCOM on itself was OK. There was something else causing these issues. The customer was also looking for possible causes and tested many things outside SCOM as well. But also without any result. However after all these tests I knew for sure SCOM itself wasn’t the culprit. But what?

All Systems are a go go!
However, all these SCOM servers are virtualized. On a dedicated host. And as a last resort the customer decided to move the RMS and SQL server to another host in order to make sure the host itself or its virtual switch wasn’t causing the issues.

Guess what? The RMS and SQL were just fine now. They connected right away without any glitch. Bouncing the Health Service didn’t generate issues any more. Time to move the MS to the other host as well. And again, all previous issues vanished!

So somewhere somehow the previous host was causing all this erratic behavior, apparently at the network layer. Phew! Case solved and time to move on!

Whenever you run into similar issues of a SCOM environment showing erratic behavior do not only test SCOM but also look outside SCOM. When virtualization is involved also test that aspect. And when nothing else seems to help, move the VMs to another host with its own virtual switch in order to see the problems are still there or perhaps – as in my case – GONE!

A BIG THANK YOU to Bob Cornelissen. I contacted him through MSN and asked him for some additional advice. Even though we didn’t nail it, it’s good to have such good friends at hand. Thanks Bob!

Tuesday, March 27, 2012

New series of blog postings: Migrating from SCOM R2 to OM12 in ‘real life’

Soon a new series of blog postings will be launched all about migrating from SCOM R2 CU#5 to OM12.

In a test lab of mine I am building a full blown SCOM R2 environment consisting out of a DC, a dedicated SQL server, a RMS, a MS and a Gateway server. Also some Agents will be pushed out to some servers. This environment will be upgraded, step-by-step, to OM12.

In a series of blog postings this upgrade process will be described, along with the potential pitfalls, do’s and don'ts. Also all available documentation about this upgrade process will be discussed. Alongside some useful links to community based postings about the upgrade process will be shared as well.

So stay tuned and see you soon again!

System Center Virtual User Group Meeting #18

Taken directly from the website: ‘…I am pleased to announce registration is ready for the System Center Virtual User Group meeting #18 scheduled for 03/30/2012 from 1 PM to 4 PM Eastern. We have lined up a great mixture System Center 2012 topic as well a demo of the new System Center Management Pack authoring tools!…’

These meetings are really good and packed with solid information. The agenda of this meeting is awesome:
As you can see, it’s a very good agenda.

Since this an online meeting any one from any where can attend this meeting. Go here in order to register.

New MP: Monitoring SQL Azure–CTP

Some weeks ago Microsoft released the CTP version of the MP for monitoring SQL Azure:

MP can be downloaded from here.

New MP: Monitoring SCCM 2012 RC

Some time ago Microsoft released a new MP for monitoring SCCM 2012 RC:

MP can be downloaded from here.

New KB article: How to change the credentials for the SDK/Config account in SCOM/OM12

A few days ago Microsoft released a comprehensive KB article all about changing the credentials for the SDK/Config account in SCOM/OM12.

Want to know more? K936220 tells you all about it.

Thursday, March 22, 2012

The Heinz Story II

About a year ago I wrote a posting about the nworks MP Simulator. An awesome tool in order to demonstrate the real power of the nworks Veeam MP without requiring a real VMware environment, since all monitoring data is simulated. Even the Reports are available and fully operational!

For some days now this tool got an update so it works with the latest version of the nWorks MP (version 5.7, fully compatible and certified for VMware vSphere 5.x) AND with OM12 (RC).

Spoiler Alert
The nworks MP Simulator doesn’t contain ALL the features of the real nWorks Veeam MP therefore it’s NOT a substitute for a real demo, evaluation deployment, or POC. Taken directly from the guide (included with the download of the msi file containing the nWorks MP Simulator): ‘…The key feature of nworksDEMO is that it can show nworks MP functionality when there are no VMware servers to use as a monitoring data source. If a real VMware deployment exists, then a demo of the full nworks product is recommended…’

The file (containing the nworksDEMO with the two MP files and the related guide) can be downloaded from here. One needs to provide some basic information and the download can be started.

For OM12 RC users: take note of the last paragraph on page 6. A certain View requires modification because of a bug present in OM12 RC.

Some screenshots, taken from my OM12 environment:

The nWorks Veeam MP in the Monitoring pane:
As one can see, there is a LOT of information to be found all about the monitored VMware environment.

vCenter Topology Diagram View:
A good Diagram View, showing the VMware topology and its health state.

vCenter Dashboard:
Some (faked!) Alerts are coming in. Nice!

And the Reports are present as well:
I counted 27(!) Reports. For the nWorks Veeam MP version 5.7 many of those Reports are rewritten from the ground up in order to deliver better performance and more relevant information.

Veeam is a company with a rock solid reputation when it comes to delivering high quality software with added value. With this FREE demo they use the same approach and deliver a tool which should be present in every SCOM/OM12 lab environment. So whenever you have such an environment but aren’t running this software go get it, install it and be amazed. Of course, the real stuff is even better :).

Thursday, March 15, 2012

The SCOM Console on steroids

Wow! Got this information from an unknown person who left a comment on my blog. He told me he’s working on speeding up the start of the SCOM Console. In order to achieve that he investigated the underlying .NET Framework since the SCOM Console is a .NET application.

Soon he found some interesting stuff. Translated it to the SCOM Console and BINGO! I tried it in a couple of my SCOM test environments and the SCOM Console starts very fast now! Gone are the long delays between starting the Console and actually seeing it.

Of course, it hasn’t been officially confirmed by Microsoft (yet) so be careful. But until now it hasn’t done any harm to my SCOM test environments.

What to do:
It’s easy and a one step process.

  1. Put this file (Microsoft.MOM.UI.Console.exe.config) into the same folder where the executable of the SCOM Console (Microsoft.MOM.UI.Console.exe) is located. All the file does is disabling the Authenticode check.

There is nothing more to it. Just start the SCOM Console on that computer and be amazed like me Smile.

All credits go to S-E B like this guy calls himself on the OpsMgr TechNet Forums.

Tuesday, March 13, 2012

SCOM R2 Console fails to start with error ‘Application has failed to start because its side-by-side configuration is incorrect’

Bumped into this error at a customers location. The SCOM R2 Console didn’t want to start on a SCOM R2 Management Server anymore. This error was thrown:

Time for some investigation. This is what the Application Log of this server told me: ‘…Activation context generation failed for "D:\Program Files\System Center Operations Manager 2007\Microsoft.MOM.UI.Console.exe".Error in manifest or policy file "~:\Program Files\System Center Operations Manager 2007\Microsoft.MOM.UI.Console.exe.Config" on line 0. Invalid Xml syntax….

Hmm. Strange. Never heard of that file (Microsoft.MOM.UI.Console.exe.Config) before. But the file was really present on that server. Checked other SCOM R2 Management Server and there the file wasn’t present at all.

So I renamed that file to Microsoft.MOM.UI.Console.exe.Config.OLD and tried to start the SCOM R2 Console again. And now the SCOM R2 Console started just fine without any error at all. No one knew how that file came to be. Checked its contents and it referred to Putty?!

Since I was too busy for further investigation I left at that. But still it puzzles me. After all, files don’t get created out of thin air…

Monday, March 12, 2012

Extension MP for the Hyper-V Management Pack version available. For FREE!

On CodePlex, the Open Source Community, there is an extension MP for the Hyper-V MP available.

This MP extends Hyper-V monitoring (which is way too basic by the default MP for Hyper-V) to these items:

  • Live Migration failures monitoring
  • Per Logical Processor Monitoring
  • Per VM Dynamic Memory Monitoring
  • Per VHD Monitoring
  • Per Physical and Logical Disk Monitoring
  • MP can be downloaded from here.

    All credits go to the people who developed and authored this MP. Great work guys!

    Wednesday, March 7, 2012

    Windows Logical Drives Report and some things to reckon with

    Bumped into this strange issue. A customer of mine had imported the Windows Logical Space Report MP. So far so good.

    However, the customer wanted to differentiate between certain sets of Windows Computers to report upon. And now something strange happened. When he created the appropriate Groups containing the relevant Windows Computers, these Groups never showed up in the Report. Of course, when a Group is created it takes a while (sometimes some hours) before they end up in the Data Warehouse as well and can be used in the Reports.

    But here these Groups never became available in the Reports related to the Windows Logical Drives. That puzzled me.

    Gladly this customer has a keen eye and told me that all the Groups listed in the dropdown box for these Reports contained the word ‘Computers’. So that was something to go on. I took a deeper dive and tried to find some baseline in this behavior. First I checked whether the Report was limited to only a certain type of Groups. But that’s not the case. The Reports listed Instance and Computer Groups as well.

    However, the Report also listed some custom made Groups, made for customized Views in the SCOM Console. And after a lot of testing and trial and error we noticed the Group you make has to be of a certain format. Otherwise it won’t end up in these Reports. These are the requirements for this kind of Groups:

    1. The Group Name has to contain the word Computers;
    2. Since we only use dynamically populated Groups the desired Class to filter on has to be Windows Computer
    3. The query you run works best when using DNS Name for Property, Contains as Operator. For the value you enter the required AD name for instance.

    Save this Group, check it whether it gets properly populated. And now – after a while – this Group will become available in your Free Space Report.

    Friday, March 2, 2012

    OM12 APM Testing and Demoing: Latest version of nopCommerce doesn’t work. So now what?

    Some days ago I posted an article all about installing the nopCommerce web shop in order to monitor it with OM12 APM. Tried it some months ago and back then it worked. Like a charm. But now there is a new version out there of nopCommerce which doesn’t work any more in conjunction with OM12 APM, as described here.

    Ouch! Gladly the community came to the rescue and pointed out two alternatives to me:

    • Stefan Stranger
      .NET application: Buggy Bits, to be found here for download and installation instructions.

    • Mats Wigle 
      .NET application: Talking Heads, to be found here for download and installation instructions.

    Thanks guys for sharing. Awesome! Another thanks goes to VIAcode for building AND sharing the .NET application Talking Heads.

    Wrap up
    Whenever you want to test OM12 APM, DON’T install nopCommerce but use Buggy Bits and/or Talking Heads instead.

    Thursday, March 1, 2012

    OM12 APM and nopCommerce website

    Bummer! Seems there is an issue with OM12 APM and the latest version of nopCommerce. Daniele Muscetta pointed this one out to me on the TechNet Forum for APM:

    This explains why an older version of nopCommerce was working perfectly with APM and the latest version isn’t (not even a Discovery). Yes, I can tweak the discovery as Daniele describes, but that’s only one part of the deal. Perhaps I have to check for alternatives of nopCommerce.

    I’ll be back with more information.

    New KB article: Configuration may not update in SCOM (R2)

    Yesterday Microsoft published a new KB article all about the configuration which may not update in SCOM (R2):

    KB2635742 describes the symptoms, cause and resolution.