Thursday, March 29, 2012

Erratic behavior of SCOM: EventIDs 20070, 21016 and 20022

Bumped into a very puzzling issue on a customers location. A newly installed SCOM R2 CU#5 environment with a RMS in place, a dedicated SQL Server (2008 R2 SP1 CU#4) and a MS. With the RMS and SQL server all went well. SCOM R2 Reporting was installed without any issues as well. However, when the MS was added to the mix, the troubles started.

The Challenge
Somehow the MS didn’t seem to start. The related Health Service was running all right, but somehow the MS stayed in an unmonitored status in the SCOM R2 Console. So it was time to check the OpsMgr eventlog of the MS server. And these two events repeated themselves many times:

EventID 20070:
image

EventID 21016:
image

Of course, events like this may occur when a new MS or Agent is added and hasn’t received its configuration. But soon those events disappear and everything is fine. But these events kept on coming back:
image

So there was something else wrong.

And it got even wackier: after 15 to 25 minutes the connection with the RMS was made and all seemed to be fine, while NOTHING was changed in SCOM R2. But then after 5 minutes the connection was lost again and the RMS showed only EventID 20022, telling me the health service on the MS wasn’t heartbeating:
image
And yet, the Health Service on the MS was running all right, residing in the same LAN segment as the RMS. And all the while both servers could connect to each other running the telnet client on port 5723?! Also Ping worked just fine…. Aaaaaaaaaaaaaarrrrrgggghhhh!!!!!

Restarting the Health Service on the MS didn’t change a thing nor recycling the cache on that server nor on the RMS. And when I pushed out a SCOM R2 Agent to the SQL server hosting the SCOM R2 databases, the same erratic behavior was happening. Whether the Agent reported to the RMS or MS.

This told me two things:

  1. Something is NOT OK (duh!);
  2. The RMS isn’t the culprit nor the MS server.

So it was time for a deep dive in SCOM R2 in order to look for possible causes.

The Quest
This was a tough one. However, the SCOM R2 environment was brand new without anything exotic. Nothing special nor fancy about it, just a regular SCOM R2 environment under construction and some erratic behavior. As a test I reinstalled the MS but without any result. Also the erratic behavior of first having no communication between the SCOM servers / SCOM Agent and suddenly everything being fine for some minutes and then starting all over again, without changing ANYTHING at all in SCOM worried me.

Time to run some checks:

  1. SCOM issue? Ran a SP against the SCOM DB and nothing wrong came out.
  2. SCOM issue? RMS/MS not OK? These servers were fine except for the communication issue.
  3. SCOM issue? SCOM service accounts locked out? No, all the accounts were just fine.
  4. SCOM issue? Untrusted servers so certificates are required? No, Kerberos should do fine.
  5. Kerberos Time Skew? Nope. All servers were running at the same time settings and synchronized perfectly.
  6. Kerberos issue? Nope, all the accounts were fine and no Kerberos issues at all.
  7. GPO issue? Nope, just some basic GPOs nothing fancy nor hardening.
  8. Network issue? Hmm, I installed the Telnet client on the RMS, MS and SQL. And I could connect to the servers on TCP port 5723.
  9. Network issue? Tracert showed the first hop was the destination so no routers at all. All SCOM servers reside in the same LAN.
  10. Network issue? A continues Ping ran just fine with response times less then 1 millisecond.
  11. Network issue? NIC removed and reinstalled and reconfigured. Nope. Same issues still occurring.
  12. DNS issue? Nope, NSLOOKUP worked like a charm. Also NETBIOS names were resolved without a glitch.
  13. -

Ouch! So at least SCOM on itself was OK. There was something else causing these issues. The customer was also looking for possible causes and tested many things outside SCOM as well. But also without any result. However after all these tests I knew for sure SCOM itself wasn’t the culprit. But what?

All Systems are a go go!
However, all these SCOM servers are virtualized. On a dedicated host. And as a last resort the customer decided to move the RMS and SQL server to another host in order to make sure the host itself or its virtual switch wasn’t causing the issues.

Guess what? The RMS and SQL were just fine now. They connected right away without any glitch. Bouncing the Health Service didn’t generate issues any more. Time to move the MS to the other host as well. And again, all previous issues vanished!

So somewhere somehow the previous host was causing all this erratic behavior, apparently at the network layer. Phew! Case solved and time to move on!

Advice
Whenever you run into similar issues of a SCOM environment showing erratic behavior do not only test SCOM but also look outside SCOM. When virtualization is involved also test that aspect. And when nothing else seems to help, move the VMs to another host with its own virtual switch in order to see the problems are still there or perhaps – as in my case – GONE!

A BIG THANK YOU to Bob Cornelissen. I contacted him through MSN and asked him for some additional advice. Even though we didn’t nail it, it’s good to have such good friends at hand. Thanks Bob!

1 comment:

Geert Baeten said...

I know what was going on ;)
Check out

https://geertbaeten.wordpress.com/2013/07/08/scom-agent-or-gateway-certificate-issue/

Best regards,
Geert