Monday, September 1, 2014

Xplat Monitoring With Chained Gateway Servers - The NOC Approach

2014-09-01 Update
As it turned out, even though the MS servers aren’t capable of directly contacting the monitored UX servers as described in the NOC approach with chained Gateway Servers, they still require name resolution, and on top of that, reverse name resolution as well.

When running the MP Templates for UX servers, the MS servers (ALL OF THEM, since they’re part of the All Management Servers Resource Pool) must be able to resolve the names of the UX machines involved. So make sure the MS servers can resolve those names, reverse as well.

Otherwise the Tasks related to the UX server specific MP Templates AND the normal Tasks for the UX servers, won’t run at all, or only sometimes when being executed by a MS server which is capable of resolving the names.

The challenge
Even though I’ve helped customers before with monitoring their UX systems with SCOM 2012x, I had a new challenge. In this particular scenario the customer has a NOC in place (Network Operations Center) where only the SCOM 2012x Management Group resides.

All the monitoring happens somewhere else, at many customer locations. Per customer location at least one Gateway Server is in place. Behind that Gateway Server the real monitoring workloads reside. And to make it even more challenging, the Gateway Server(s) residing at the customer locations don’t report directly to the Management Servers residing in the NOC.

Chained Gateway Servers
Instead these customer Gateway Servers report to special NOC Gateway Servers residing in a DMZ. And those Gateway Servers report – finally – to the SCOM 2012x Management Servers. This kind of setup (Gateway Servers communicating to other Gateway Servers and not directly with the Management Servers) is also known as Chained Gateway Servers.

So when monitoring UX based workloads, the chain of communication looks like this: UX based workloads > monitored by the customer Gateway Server(s) > NOC DMZ Gateway Servers > NOC Management Servers.

Additional load AND not for the Management Servers…
And like we all know, SCOM Agents on UX systems aren’t like SCOM Agents on Windows Servers. Where the latter manages itself (decides when to run what scripts and so on, manages it’s own workload and agenda, based on the imported MPs), the SCOM UX Agent is totally managed by the Management- or Gateway Server it reports to.

Also good to know is that not a single UX system would be managed by a Management Server but only by SCOM Gateway Server(s). And when it comes down to load, SCOM Gateway Servers are just like SCOM Agents, nothing like the robust healthservice running on a Management Server Sad smile.

So this means the customer Gateway Server(s) will take an additional hit on their performance. Something to reckon with.

Not clustered but everywhere
The UX systems to be monitored don’t reside at a single location but at many different customer locations. This impacts the setup of the monitoring of the UX systems as well.

Security!!!
Different customers means different security policies. So even though ONE UX account for monitoring would be nice from an administration point of View, it wouldn’t fit the bill at all. So per customer location at least ONE UX account had to be created and distributed to the correct Gateway Servers.

Resource Pools please!
Yes. Monitoring UX based workloads (or SNMP based for that matter) REQUIRE the usage of Resource Pools. And yes, you can put Gateway Servers in Resource Pools, no problem. But in this case multiple Resource Pools are required since the UX systems reside at different locations behind different Gateway Servers.

Certificates?!
Yes. When SCOM installs an UX Agent, it get’s a certificate which is created automatically. But that certificate also needs to be signed. For that the SCOM Management- or Gateway Server creates Root Certificate automatically, used for signing the UX Agent client certificates.

And when the Resource Pool – used for UX monitoring – contains more than one Management- or Gateway Server, those root certificates must be exported and imported on the other Management- or Gateway Server(s) residing in the same Resource Pool. When that doesn’t happen, and another server takes over the monitoring of some or more UX systems, monitoring will come to a grinding halt because of the lacking root certificate…

Network AND DNS!!!
Yes, network connectivity is crucial. Also a fully functional DNS is very important. In the ‘Chained Gateway’ scenario it get’s a bit more complicated when taking a first look at it, but it’s not that hard at all actually. It makes sense Smile:

  • Customer Gateway Server managing/monitoring the UX system > UX system:
    • Port 22 (SSH), only during installation and updating the UX Agent;
    • Port 1270 (WSMAN): All the time
    • Must be able to resolve the FQDN of the UX systems, reverse as well.
  • NOC DMZ Gateway Server > Customer Gateway Server:
    • Port 5723
    • Must be able to resolve the FQDN of the customer Gateway Server
  • SCOM Management Server > NOC DMZ Gateway Server:
    • Port 5723
    • Must be able to resolve the FQDN of the NOC Gateway Server
    • Must be able to resolve the FQDN of the UX systems to be monitored, reverse as well.

Failover
Already configured, before the UX system monitoring question came to be, was failover of the customer Gateway Servers to the NOC DMZ Gateway Servers and – in some situations – the failover of the Agents residing behind the Gateway Servers to another Gateway Server. And yes, the NOC DMZ Gateway Servers are also configured to failover to another SCOM Management Server when their primary goes down.

Even though this doesn’t directly influence the monitoring of UX systems, it’s important to know what the Primary and Secondary are for the Gateway Server managing/monitoring the UX systems. Later more about that Smile with tongue out.

Management Packs
MPs in SCOM are crucial. Even more so for monitoring UX systems. Because the UX MPs also contain the UX Agents! So always make sure you’ve got the latest version of the relevant UX MPs imported. In this case the latest version of the UX MPs are based on UR#3 for SCOM 2012 R2. Even when your SCOM 2012 R2 MG isn’t on UR#3 level, but UR#2 for instance, these UX MPs can be imported.

Credits
Before I continue I want to point out that all this information I’ve described so far is based on input from some people working at Microsoft USA. Thanks to their effort (one person in particular, thanks Steve!) I got it all up and running, along with some deep troubleshooting where finally, the culprit was something very simple…

Overview of what to do
In this section I describe how I went about it and got it running. In the end some troubleshooting was required but my guess it’s just some bad luck what happened here.

  1. Updated the UX MPs and imported those missing. Please read the included MP Guides in order to know what MPs you require. Only import the ones required!!!
  2. Tested network connectivity. Especially the last ‘mile’, from the customer Gateway Server managing/monitoring the UX systems. Checked out DNS and ports 22 and 1270.
  3. Built the required Resource Pools and put only the Gateway Servers in their respective Resource Pools which would finally manage/monitor the UX systems. So I left out the NOC DMZ Gateway Servers and NOC Management Servers!
  4. Created the required UX accounts in SCOM and assigned them to the proper Resource Pools;
  5. Added these Run As Accounts to the three UX Run As Profiles in SCOM;
  6. Ended up by having the UX administrator installing the UX Agent and Certificate on the UX systems using the command line;
  7. The UX systems was successfully Discovered by SCOM and added to the OperationsManager database (Agent installation and certificate signing wasn’t needed anymore, see Step 6).
  8. So far so good. But… the UX system never got to a monitored status. Instead it went from unmonitored to greyed out. Normally this is the normal flow but after some time (minutes) it should get a live status. Not in this case. After hours it was still greyed out. Time for some deep troubleshooting. See next section.

Troubleshooting
Yes. I learned a LOT Smile. Also how to troubleshoot UX systems which don’t want to get a monitored status in SCOM. Nice!

The regular tests

  1. On the customer Gateway Server, managing/monitoring the problematic UX system, I ran telnet and tested whether I could connect to ports 22 (less important, only crucial during installation/updating the UX Agent through the Console) and 1270 (crucial, used by WSMAN). Telnet command: Open <UX system FQN> <port number>.
  2. The UX admin checked whether a firewall was running on the UX system;
  3. On the customer Gateway Server the Windows firewall was checked for the presence of the correct rules allowing traffic on ports 22 and 1270;
  4. FQDN was checked on the customer Gateway Server, including reverse lookup.

These tests turned out to be okay. So no firewall and DNS issues. Check. On to the next series of checks.

The deeper tests
Now it’s time to take a deeper dive into the functionality of SCOM itself. Starting at the customer Gateway Server and working from there.

  1. With two WINRM command you can check whether WINRM on the customer Gateway Server can connect to the SCOM UX Agent on the problematic UX system:

    winrm enumerate http://schemas.microsoft.com/wbem/wscim/1/cim-schema/2/SCX_Agent?__cimnamespace=root/scx -username:<USERNAME> -password:<PASSWORD> -r:https://<FQDN UX SERVER>:1270/wsman -auth:basic -skipCACheck -skipCNCheck -skiprevocationcheck -encoding:utf-8 

    You should get a whole list of information returned. This shows WSMAN is working. Now it’s time for the same command without the –skip parts:

    winrm enumerate http://schemas.microsoft.com/wbem/wscim/1/cim-schema/2/SCX_Agent?__cimnamespace=root/scx -username:<USERNAME> -password:<PASSWORD> -r:https://<FQDN UX SERVER>:1270/wsman -auth:basic -encoding:utf-8 

    When you get the same list back, WSMAN is working and the permissions are okay as well. When issues arise (the first command with the –skip parts worked, but the latter doesn’t) there might be a certificate issue.

    Source: http://social.technet.microsoft.com/Forums/en-US/18047ddf-bcef-4021-a6eb-6cf644e060ad/scom-2012-sp1-not-able-to-discover-linux-workgroup-servers?forum=operationsmanagerunixandlinux

  2. Start logging and debugging, as stated here: http://technet.microsoft.com/en-us/library/hh212862. Especially the section Enable EnableOpsmgrModuleLogging (no this NOT a typo Smile) was new to me. There is also a section about how to enable logging on the UX Agent installed on the UX system. Very helpful and interesting. Verbose logging for SCOM I already knew.
  3. The whole TechNet article all about troubleshooting UX system monitoring: http://technet.microsoft.com/en-us/library/hh212885. Read the sections Certificate Issues, Management Pack Issues and Operating System Issues.

These tests told me the monitoring itself was properly functioning. WSMAN could connect to the UX Agent on the problematic UX system and get information from it. Also without the –skip part WINRM worked like a charm, so no certificate issues either.

The logs on the customer Gateway Servers told me nothing special. So the issue was deeper down the line… Time for some more tests but now more on the SCOM side located at the NOC level (NOC DMZ Management Servers and Management Servers).

And YES! I restarted the SCOM Console way long ago using the /ClearCache switch as well Smile with tongue out

The even deeper tests
First I ran a script made by Kris Bash, also working at Microsoft. This PS script tells you exactly what the status of the monitored UX server in SCOM itself is.

$hostname=”FQDN UX SERVER”

$blDiscovered=$false
$blHealthy=$false

$class=Get-SCOMClass |Where {$_.name -eq "Microsoft.Unix.OperatingSystem"}
$ComputerClass = Get-SCOMClass | Where {$_.Name -eq "Microsoft.Unix.Computer"}
$class |get-scomclassinstance
$instance=get-scomclassInstance -class:$class |where {$_.path -eq "$hostname"}
if ( ($instance -ne $null) -and ($instance.IsManaged -eq $true) )
{
    $blDiscovered=$true
}
else
{
    $blDiscovered=$false
}
$Computer = Get-SCOMClassInstance -Class:$ComputerClass |Where {$_.Name -eq $hostname}
$ComputerHealth = $Computer.HealthState.ToString()

Write-host "Host: $hostname"
Write-Host "OS is discovered: $blDiscovered"
Write-host "Computer health state: $ComputerHealth"

I ran this script first on the Management Server and later on the customer Gateway Server managing/monitoring the problematic UX system. This output told me there was indeed an issue with this particular UX server. Somehow no status was created.

Time to checkout the whole chain of communication, starting at the customer Gateway Server managing/monitoring to problematic UX system, down to the SCOM Management Server and the NOC DMZ Gateway Server in between.

For this I needed to find out what the primaries were for the Gateway Servers involved. See here this posting of Jimmy Harper, section Commands to verify Gateway Server Failover.

When I checked the OpsMgr event log on the NOC DMZ Gateway Server which is the primary for the Gateway Server managing/monitoring the problematic UX system I noticed data for this server was being dropped since the NOC DMZ Gateway Server didn’t think it belonged to the environment.

And yes, a simple PS-cmdlet for restarting the HealthService (Restart-Service HealthService) fixed this issue. Within a few minutes the UX system got a HEALTHY status in SCOM.

Recap
Monitoring UX systems residing on locations behind SCOM Gateway Servers is a valid scenario. Just make sure you’ve all the requirements in place and go for it. And when something is amiss, use this posting to help you out.

No comments: