Monday, September 1, 2014

Xplat Monitoring With Chained Gateway Servers - The NOC Approach

2014-09-01 Update
As it turned out, even though the MS servers aren’t capable of directly contacting the monitored UX servers as described in the NOC approach with chained Gateway Servers, they still require name resolution, and on top of that, reverse name resolution as well.

When running the MP Templates for UX servers, the MS servers (ALL OF THEM, since they’re part of the All Management Servers Resource Pool) must be able to resolve the names of the UX machines involved. So make sure the MS servers can resolve those names, reverse as well.

Otherwise the Tasks related to the UX server specific MP Templates AND the normal Tasks for the UX servers, won’t run at all, or only sometimes when being executed by a MS server which is capable of resolving the names.

The challenge
Even though I’ve helped customers before with monitoring their UX systems with SCOM 2012x, I had a new challenge. In this particular scenario the customer has a NOC in place (Network Operations Center) where only the SCOM 2012x Management Group resides.

All the monitoring happens somewhere else, at many customer locations. Per customer location at least one Gateway Server is in place. Behind that Gateway Server the real monitoring workloads reside. And to make it even more challenging, the Gateway Server(s) residing at the customer locations don’t report directly to the Management Servers residing in the NOC.

Chained Gateway Servers
Instead these customer Gateway Servers report to special NOC Gateway Servers residing in a DMZ. And those Gateway Servers report – finally – to the SCOM 2012x Management Servers. This kind of setup (Gateway Servers communicating to other Gateway Servers and not directly with the Management Servers) is also known as Chained Gateway Servers.

So when monitoring UX based workloads, the chain of communication looks like this: UX based workloads > monitored by the customer Gateway Server(s) > NOC DMZ Gateway Servers > NOC Management Servers.

Additional load AND not for the Management Servers…
And like we all know, SCOM Agents on UX systems aren’t like SCOM Agents on Windows Servers. Where the latter manages itself (decides when to run what scripts and so on, manages it’s own workload and agenda, based on the imported MPs), the SCOM UX Agent is totally managed by the Management- or Gateway Server it reports to.

Also good to know is that not a single UX system would be managed by a Management Server but only by SCOM Gateway Server(s). And when it comes down to load, SCOM Gateway Servers are just like SCOM Agents, nothing like the robust healthservice running on a Management Server Sad smile.

So this means the customer Gateway Server(s) will take an additional hit on their performance. Something to reckon with.

Not clustered but everywhere
The UX systems to be monitored don’t reside at a single location but at many different customer locations. This impacts the setup of the monitoring of the UX systems as well.

Security!!!
Different customers means different security policies. So even though ONE UX account for monitoring would be nice from an administration point of View, it wouldn’t fit the bill at all. So per customer location at least ONE UX account had to be created and distributed to the correct Gateway Servers.

Resource Pools please!
Yes. Monitoring UX based workloads (or SNMP based for that matter) REQUIRE the usage of Resource Pools. And yes, you can put Gateway Servers in Resource Pools, no problem. But in this case multiple Resource Pools are required since the UX systems reside at different locations behind different Gateway Servers.

Certificates?!
Yes. When SCOM installs an UX Agent, it get’s a certificate which is created automatically. But that certificate also needs to be signed. For that the SCOM Management- or Gateway Server creates Root Certificate automatically, used for signing the UX Agent client certificates.

And when the Resource Pool – used for UX monitoring – contains more than one Management- or Gateway Server, those root certificates must be exported and imported on the other Management- or Gateway Server(s) residing in the same Resource Pool. When that doesn’t happen, and another server takes over the monitoring of some or more UX systems, monitoring will come to a grinding halt because of the lacking root certificate…

Network AND DNS!!!
Yes, network connectivity is crucial. Also a fully functional DNS is very important. In the ‘Chained Gateway’ scenario it get’s a bit more complicated when taking a first look at it, but it’s not that hard at all actually. It makes sense Smile:

  • Customer Gateway Server managing/monitoring the UX system > UX system:
    • Port 22 (SSH), only during installation and updating the UX Agent;
    • Port 1270 (WSMAN): All the time
    • Must be able to resolve the FQDN of the UX systems, reverse as well.
  • NOC DMZ Gateway Server > Customer Gateway Server:
    • Port 5723
    • Must be able to resolve the FQDN of the customer Gateway Server
  • SCOM Management Server > NOC DMZ Gateway Server:
    • Port 5723
    • Must be able to resolve the FQDN of the NOC Gateway Server
    • Must be able to resolve the FQDN of the UX systems to be monitored, reverse as well.

Failover
Already configured, before the UX system monitoring question came to be, was failover of the customer Gateway Servers to the NOC DMZ Gateway Servers and – in some situations – the failover of the Agents residing behind the Gateway Servers to another Gateway Server. And yes, the NOC DMZ Gateway Servers are also configured to failover to another SCOM Management Server when their primary goes down.

Even though this doesn’t directly influence the monitoring of UX systems, it’s important to know what the Primary and Secondary are for the Gateway Server managing/monitoring the UX systems. Later more about that Smile with tongue out.

Management Packs
MPs in SCOM are crucial. Even more so for monitoring UX systems. Because the UX MPs also contain the UX Agents! So always make sure you’ve got the latest version of the relevant UX MPs imported. In this case the latest version of the UX MPs are based on UR#3 for SCOM 2012 R2. Even when your SCOM 2012 R2 MG isn’t on UR#3 level, but UR#2 for instance, these UX MPs can be imported.

Credits
Before I continue I want to point out that all this information I’ve described so far is based on input from some people working at Microsoft USA. Thanks to their effort (one person in particular, thanks Steve!) I got it all up and running, along with some deep troubleshooting where finally, the culprit was something very simple…

Overview of what to do
In this section I describe how I went about it and got it running. In the end some troubleshooting was required but my guess it’s just some bad luck what happened here.

  1. Updated the UX MPs and imported those missing. Please read the included MP Guides in order to know what MPs you require. Only import the ones required!!!
  2. Tested network connectivity. Especially the last ‘mile’, from the customer Gateway Server managing/monitoring the UX systems. Checked out DNS and ports 22 and 1270.
  3. Built the required Resource Pools and put only the Gateway Servers in their respective Resource Pools which would finally manage/monitor the UX systems. So I left out the NOC DMZ Gateway Servers and NOC Management Servers!
  4. Created the required UX accounts in SCOM and assigned them to the proper Resource Pools;
  5. Added these Run As Accounts to the three UX Run As Profiles in SCOM;
  6. Ended up by having the UX administrator installing the UX Agent and Certificate on the UX systems using the command line;
  7. The UX systems was successfully Discovered by SCOM and added to the OperationsManager database (Agent installation and certificate signing wasn’t needed anymore, see Step 6).
  8. So far so good. But… the UX system never got to a monitored status. Instead it went from unmonitored to greyed out. Normally this is the normal flow but after some time (minutes) it should get a live status. Not in this case. After hours it was still greyed out. Time for some deep troubleshooting. See next section.

Troubleshooting
Yes. I learned a LOT Smile. Also how to troubleshoot UX systems which don’t want to get a monitored status in SCOM. Nice!

The regular tests

  1. On the customer Gateway Server, managing/monitoring the problematic UX system, I ran telnet and tested whether I could connect to ports 22 (less important, only crucial during installation/updating the UX Agent through the Console) and 1270 (crucial, used by WSMAN). Telnet command: Open <UX system FQN> <port number>.
  2. The UX admin checked whether a firewall was running on the UX system;
  3. On the customer Gateway Server the Windows firewall was checked for the presence of the correct rules allowing traffic on ports 22 and 1270;
  4. FQDN was checked on the customer Gateway Server, including reverse lookup.

These tests turned out to be okay. So no firewall and DNS issues. Check. On to the next series of checks.

The deeper tests
Now it’s time to take a deeper dive into the functionality of SCOM itself. Starting at the customer Gateway Server and working from there.

  1. With two WINRM command you can check whether WINRM on the customer Gateway Server can connect to the SCOM UX Agent on the problematic UX system:

    winrm enumerate http://schemas.microsoft.com/wbem/wscim/1/cim-schema/2/SCX_Agent?__cimnamespace=root/scx -username:<USERNAME> -password:<PASSWORD> -r:https://<FQDN UX SERVER>:1270/wsman -auth:basic -skipCACheck -skipCNCheck -skiprevocationcheck -encoding:utf-8 

    You should get a whole list of information returned. This shows WSMAN is working. Now it’s time for the same command without the –skip parts:

    winrm enumerate http://schemas.microsoft.com/wbem/wscim/1/cim-schema/2/SCX_Agent?__cimnamespace=root/scx -username:<USERNAME> -password:<PASSWORD> -r:https://<FQDN UX SERVER>:1270/wsman -auth:basic -encoding:utf-8 

    When you get the same list back, WSMAN is working and the permissions are okay as well. When issues arise (the first command with the –skip parts worked, but the latter doesn’t) there might be a certificate issue.

    Source: http://social.technet.microsoft.com/Forums/en-US/18047ddf-bcef-4021-a6eb-6cf644e060ad/scom-2012-sp1-not-able-to-discover-linux-workgroup-servers?forum=operationsmanagerunixandlinux

  2. Start logging and debugging, as stated here: http://technet.microsoft.com/en-us/library/hh212862. Especially the section Enable EnableOpsmgrModuleLogging (no this NOT a typo Smile) was new to me. There is also a section about how to enable logging on the UX Agent installed on the UX system. Very helpful and interesting. Verbose logging for SCOM I already knew.
  3. The whole TechNet article all about troubleshooting UX system monitoring: http://technet.microsoft.com/en-us/library/hh212885. Read the sections Certificate Issues, Management Pack Issues and Operating System Issues.

These tests told me the monitoring itself was properly functioning. WSMAN could connect to the UX Agent on the problematic UX system and get information from it. Also without the –skip part WINRM worked like a charm, so no certificate issues either.

The logs on the customer Gateway Servers told me nothing special. So the issue was deeper down the line… Time for some more tests but now more on the SCOM side located at the NOC level (NOC DMZ Management Servers and Management Servers).

And YES! I restarted the SCOM Console way long ago using the /ClearCache switch as well Smile with tongue out

The even deeper tests
First I ran a script made by Kris Bash, also working at Microsoft. This PS script tells you exactly what the status of the monitored UX server in SCOM itself is.

$hostname=”FQDN UX SERVER”

$blDiscovered=$false
$blHealthy=$false

$class=Get-SCOMClass |Where {$_.name -eq "Microsoft.Unix.OperatingSystem"}
$ComputerClass = Get-SCOMClass | Where {$_.Name -eq "Microsoft.Unix.Computer"}
$class |get-scomclassinstance
$instance=get-scomclassInstance -class:$class |where {$_.path -eq "$hostname"}
if ( ($instance -ne $null) -and ($instance.IsManaged -eq $true) )
{
    $blDiscovered=$true
}
else
{
    $blDiscovered=$false
}
$Computer = Get-SCOMClassInstance -Class:$ComputerClass |Where {$_.Name -eq $hostname}
$ComputerHealth = $Computer.HealthState.ToString()

Write-host "Host: $hostname"
Write-Host "OS is discovered: $blDiscovered"
Write-host "Computer health state: $ComputerHealth"

I ran this script first on the Management Server and later on the customer Gateway Server managing/monitoring the problematic UX system. This output told me there was indeed an issue with this particular UX server. Somehow no status was created.

Time to checkout the whole chain of communication, starting at the customer Gateway Server managing/monitoring to problematic UX system, down to the SCOM Management Server and the NOC DMZ Gateway Server in between.

For this I needed to find out what the primaries were for the Gateway Servers involved. See here this posting of Jimmy Harper, section Commands to verify Gateway Server Failover.

When I checked the OpsMgr event log on the NOC DMZ Gateway Server which is the primary for the Gateway Server managing/monitoring the problematic UX system I noticed data for this server was being dropped since the NOC DMZ Gateway Server didn’t think it belonged to the environment.

And yes, a simple PS-cmdlet for restarting the HealthService (Restart-Service HealthService) fixed this issue. Within a few minutes the UX system got a HEALTHY status in SCOM.

Recap
Monitoring UX systems residing on locations behind SCOM Gateway Servers is a valid scenario. Just make sure you’ve all the requirements in place and go for it. And when something is amiss, use this posting to help you out.

Tuesday, August 26, 2014

Largest Collection FREE Microsoft eBooks Ever…

No matter how digital we might have become, reading is still important. Okay, looking at my own situation I own the equivalent of small library just looking at my ebooks. All on my iPad. Awesome!

Microsoft has released many FREE ebooks in the recent years, all about the technologies they’ve launched. And yes, many of those books don’t go the core, but yet even those books help you to get started so you know what to look for when you want to follow up on that free book with a another(e)book covering the topics you deem interesting.

image

Some weeks ago, Tim Bush,  UK Microsoft employee (education marketing manager) wrote a very interesting posting all about the FREE ebooks Microsoft released in the past few years. The same posting contains the most relevant FREE ebooks nowadays available.

Want to know more? Want to learn? Want to read? Go here and be amazed. Yes, it’s the era of information for sure!!!

Monday, August 25, 2014

SCCM 2012x Multicast: Another Bites The Dust With Error 0x80091007

Phew! Had my ‘special’ taste of SCCM last week! All about OSD using SCCM 2012 R2 CU#2 and multicasting.

And yes, the dreaded error 0x80091007 came along way too many times. I searched on the internet but NO MATTER WHAT I DID, TRIED, TESTED, MODIFIED, DELETED, UPDATED and REDISTRIBUTED, NOTHING HELPED!!!

Even worse, the behavior of the OSD became erratic. We had in total 3 OSD’s using multicast. Right from the start 1 out of 3 and the two others failed with the 0x80091007 error. Just in 15 seconds max, when multicasting should have started this error popped up.

Of course, the related log file (smsts.log) was taken from the failed client and checked. And yes, the well known errors popped up, like:

  1. Encountered error transfering file (0x80070003).
  2. Sending status message: SMS_OSDeployment_PackageDownloadMulticastStatusFail;
  3. Hash could not be matched for the downloded content. Original ContentHash = 905E6AEDF8BD29DC83A41552CC248CFEAB9A434E97DE699CE8FF5647371D6367, Downloaded ContentHash =;
  4. DownloadContentAndVerifyHash() failed. 80091007;
  5. Installation of image 2 in package HQ1000F5 failed to complete. The hash value is not correct. (Error: 80091007; Source: Windows);
  6. The user tries to release a source directory C:\_SMSTaskSequence\Packages\HQ1000F5 that is either already released or we have not connected to it;
  7. Failed to run the action: Apply Operating System.
    The hash value is not correct. (Error: 80091007; Source: Windows).

Yes, 0x80091007 error is related to content mismatch errors. But please check out the yellow high lighted piece of error 3. Since NOTHING is downloaded to the client there is NO hash as well. So in this case this 0x80091007 error turned out to be misleading…

Things we tried:

  • Error 80091007 is most of the times related to hash issue. So when updating the related DPs OR even redistributing it, a new hash is created. But this one didn’t work.
  • Antivirus software. We disabled it COMPLETELY on all SCCM systems involved. And then we ran the first step (redistributing the OS images) again. But no luck this time.
  • Permissions? We checked and double checked, but the OS Image which kept on working like clock work, had the exact same permissions as the ones which kept on failing;
  • DP issue? Package not available? Duh! The DP involved neatly logged an error as well in it’s own Windows Event Logs for the Deployment Server. This error contained neatly the URL to the package. And guess what? When I clicked it, the OS Image (WIM file) was neatly downloaded!
  • ESX issue? When the SCCM is a VM running on VMware there might be some issues with the NIC. Sow we removed the old NIC – first from Windows then from VMware, rebooted the server, gave it a new NIC, configured it properly (both in VMware and Windows), rebooted the server. And guess what? NO LUCK! The OS Image which ran without issues from the first moment kept on running and the others kept on FAILING!!!
  • DP issue? So I REMOVED both problematic OS Images. Removed even the Task Sequences but that shouldn’t be necessary since the Task Sequences only contain meta data. But none the less, let’s start CLEAN. I waited a WHOLE DAY so SCCM had ALL the time to sort things out. And after that day, I added the two OS images, set the distribution options to Multi Casting, distributed them to the related DPs and ONLY after that was done, I rebuilt the required deployment Task Sequences. Again SAME ERROR!!!
  • SCCM issue? Checked about 100x logs of SCCM and no where a glitch to be found. SCCM seemed to be in a healthy state. All lights green and all systems a go go! But NO WAY the two problematic OS images would deploy by using multicast.
  • Oh, did I already tell you that UNICAST just run like clockwork? ALL THE TIME? For ALL OS Images, also the two problematic ones when trying to use multicast?
  • I checked out tons of logs, from the BIG names, like Kenneth van Surksum, Peter Daalmans, Henk Hoogendoorn and so on. All those big names bumped into similar issues, but all the things they proposed simply didn’t work. And many times they kept on having issues with multicasting

The BIG surprise AND the erratic behavior…
All of a sudden – just like that – one of the two problematic OS Images started to work with multicast deployment! Just out of the blue! We tried to copy the situation, but no matter what, the last image wouldn’t budge Sad smile. So now we had two OS images running okay using multicast. But only in an isolated test lab.

And the big REAL world – using vLANs (uh oh) was waiting out there…

The verdict
Based on our experiences gained while trying to solve the multicast issues, we came to know multicast used by SCCM 2012 R2 as a mechanism requiring a lot of maintenance combined with black magic and a lot of bad mojo. Not really a toolset to be used in the real world, requiring rock solid and trust worthy end results, to be reproduced over and over…

So in this case multicast is removed and unicast rules.

PS…
Based on the blogs I bumped into it seems like multicast is a pain the well know B#@ side. Just wondering how many people got it up and running AND trust worthy in networks which are segmented using many vLANs.

Tuesday, August 12, 2014

New MP: Office 365

Some weeks ago Microsoft released a MP for monitoring your Office 365 subscription.

The MP can be downloaded from here. Soon I’ll import and tune this MP for a customer of mine. When allowed I’ll blog about my experiences with this new MP.

SCOM 2012 SP1: UR#7 Is Out!

Since a few weeks UR#7 for SCOM 2012 SP1 is out. KB2965420 describes this UR in more detail.

A word of advice:
I know I repeat myself: Be careful with rolling out this UR since it wouldn’t be the first time an UR for SCOM breaks something… So TEST it first before rolling it out in production…

SCOM 2012 R2: UR#3 Is Out!

Since a few weeks UR#3 for SCOM 2012 R2 is out. KB2965445 describes this UR in more detail.

Kevin Holman has written a posting all about installing this update, to be found here.

A word of advice:
I know I repeat myself: Be careful with rolling out this UR since it wouldn’t be the first time an UR for SCOM breaks something… So TEST it first before rolling it out in production…