Friday, May 24, 2013

Getting Rid Of Nagging EventIDs 7000, 7015 & 7021 When Monitoring Untrusted Windows Servers

Some time ago I bumped into a situation where a customer had some untrusted Windows Servers which required monitoring by their OM12 SP1 Management Group. All of these Windows Servers had the proper client/server certificates in place in order to communicate with the OM12 SP1 Management Group. But monitoring didn’t work as expected. EventIDs 7000, 7015 and 7012 were thrown on all of these servers.

It began with EventID 7000: ‘…The Health Service could not log on the RunAs account <DOMAIN>\<SCOM ACTION ACCOUNT> for management group <NAME>.  The error is Logon failure: unknown user name or bad password.(1326L).  This will prevent the health service from monitoring or performing actions using this RunAs account…’

Soon followed by EventID 7015: ‘…The Health Service cannot verify the future validity of the RunAs account <DOMAIN>\<SCOM ACTION ACCOUNT> for management group <NAME>.  The error is Logon failure: unknown user name or bad password.(1326L)…’

But the real killer is EventID 7021 which came last: ‘…The Health Service was unable to validate any user accounts in management group <NAME>…’

And here all monitoring stopped. Simply because the none of the required accounts for the OM12 SP1 Agent couldn’t be validated which broke down monitoring of that related Windows Server as a whole. In SCOM there were two Alerts per server thrown, related to all this:

  1. Run As Account Could Not Log On
    One or more Run As Accounts failed to log on. The account may be disabled or has an expired password.
  2. Unable to Verify Run As Account
    The System Center Management Health Service is unable to verify the Run As account.

Since these Alerts kept coming back and the underlying issues couldn’t be solved by the customer, these Monitors were disabled through an override. But that didn’t solve the issue at all…

So it was time for some good old troubleshooting. And guess what? Got it working in the matter of an hour! And believe me, this isn’t rocket science at all. One needs simply to know some basic stuff and how SCOM operates. And when one does, issues like these won’t happen (any more). There is much to tell so let’s start.

The basics
There a two things to know in order to get this solved properly.

01: Meet the SCOM Agent
Typically a SCOM Agent (the Health Service, aka the System Center Management service) runs under the Local System account to get things done. When required, it can spawn multiple MonitoringHost.exe processes under other credentials, as required.

Taken from this posting from my blog: ‘…The HealthService initiates the process MonitoringHost.exe. The HealthService can spawn multiple MonitoringHost.exe processes… as needed.  Typically – you will see a couple MonitoringHost processes executing under the Default Agent Action Account.  In addition, HealthService will launch MonitoringHost processes under any preconfigured Run-As accounts that are executing workflows on the agents, using those credentials. Thus ‘giving’ the HealthService the credential management capability to support the execution of modules running as different users…’

So this is where the Agent Action account comes in. And – based on best practices – this should be an AD account. But what happens on Windows servers residing outside that very same AD infrastructure, like Windows servers participating in a workgroup for instance?

02: Account distribution
In order to define the proper accounts in OM12 for the corresponding workflows and distribute them to the Windows servers involved, there is an advanced mechanism in place within OM12. It’s based on these pillars:

  1. Run As Account
    Here you can define the proper accounts required to discover/monitor certain classes, workloads. There are many different accounts you can create here:
    image
    In this particular case we need the Action Account.

  2. Run As Profile
    Here the Run As Accounts can be ‘attached’ to a certain Run As Profile in order to discover/monitor certain workflows. Run As Profiles are to be found in many MPs but you can create your own as well when required. The Default Action Account Run As Profile plays an important role in this posting and is present by default in any MG.

  3. Distribution
    With items 1 and 2 we have the mechanism in place to define the required accounts (Run As Accounts) and to logically group them (Run As Profile). So now we need a mechanism to distribute it to the proper managed computers.

    For this distribution there are two types:
    1. Less Secure.
      For lazy admins Smile. Here the Run As Accounts (and their related Run As Profiles) are distributed to ALL managed computers no matter what. Also to computers which aren’t capable to resolve those credentials, like Windows Servers residing in a workgroup for instance. Avoid this distribution type since it introduces many unwanted issues.
    2. More Secure.
      You need to select manually to which managed computers to which the Run As Account will be distributed. Even though this approach costs more time and planning, it works the best since it only brings the Run As Accounts to the servers they’re intended for.

There is much more to tell about this topic. Kevin Holman has written an excellent posting about it, go here.

How the problems were solved
Now with this knowledge we have enough ammo to fight the earlier mentioned problems on the untrusted Windows Servers.

  1. Goodbye EventID 7021
    This one prevented the OM12 SP1 Agent to function at all so it was time to deal with this one first. A new Agent Action account was defined in SCOM, using the credentials of a LOCAL SCOM Action account, freshly created on the untrusted servers. Since it’s a local account this account has to be created on all untrusted Windows servers. Afterwards these accounts have to be created in SCOM as Run As Accounts.

    For more information about the permissions the Action Account requires, go here.

    How the Run As Accounts were created:
    1. Go to Administration > Run As Configuration > Accounts.
    2. In this case per untrusted Windows server a new Action account was created, named Action Account <NAME UNTRUSTED SERVER>
      image
      > Next

      For the username type <SERVERNAME>\<ACTION ACCOUNT NAME> and the password
      image
      > Next

      The More Secure distribution option is automatically selected
      image
      > Create.
    3. Now the Run As Account (Action account) has to be ‘attached’ to the proper managed Windows Server. This is done through the Run As Profile Default Action Account.
    4. Go to Administration > Run As Configuration > Profiles.
    5. Open the Default Action Account profile and go to the third option, Run As Accounts.
    6. Select the untrusted server for which the Action Account in Step 2 is made and modify it accordingly (*).
    7. Save the modifications.
    8. Repeat Steps 2 to 7 for all untrusted Windows servers which require monitoring by OM12.

      ( * : I have seen situations where the related OM12 Agent fell back to the original Agent Action account (Local System). This is easily solved. Start a RDP session to the related server. Open Control Panel > Operations Manager Agent applet > select the Management Group connection involved > Edit > and set the option Agent Action Account to Use the following domain or local account to perform agent actions. Enter the correct credentials. Wait before hitting OK. Adjust the Run As Profile first in the SCOM Console and save it. Now hit the OK button in the Operations Manager Agent applet and you’re just fine.)

      After this EventID 7021 was gone and the SCOM Agent started to function normally. Only the nagging EventIDs 7000 and 7015 remained.
  2. Goodbye EventIDs 7000 and 7015
    Still EventID 7000 and EventID 7015 kept coming back, all about the SCOM Agent Action account not being validated. On itself normal since it’s an AD based account but I didn’t understand where this account came from. Not from the Default Action Account profile, that’s for sure.

    But some of the Run As Accounts was pushing this account to the untrusted Windows servers, that’s for sure! Time to investigate.
    1. Go to Administration > Run As Configuration > Accounts.
    2. Look for any Run As Account which uses the SCOM Action account as well and uses the Less Secure distribution method. Also check for the More Secure distribution option where the untrusted Windows servers are defined. Changes are however, that the Less Distribution option is the culprit here.
    3. Indeed, in this case there was a Run As Account defined which also used the SCOM Action Account and used the Less Secure distribution option.
    4. Change this to More Secure and select only the servers involved.
    5. Now EventID 7000 and EventID 7015 on the untrusted Windows servers are gone as well!

As you can see, monitoring untrusted Windows servers can be done without too much effort. But you need to know how certain things work in OM12 and you’ll be just fine. Troubleshooting consumes way more time…

Issue like these happen when not enough time is taken in order to understand the product itself. And many times the product is blamed while that isn’t the true story at all Smile.

Happy SCOMming!

4 comments:

Junaid Ahmed said...

Thanks brother.
Stay blessed!!

rob said...

Thanks. Really good post on the esoterics of runas. I am now enlightened!

Anders Jensen said...

Hi.
Thanks for this guide. It's helped me get a better understanding on action accounts.
I've been able to create a new account and assign it to my Lync 2013 Edge server (Workgroup computer), and I'm getting The Health Service successfully logged on the RunAs account \ for management group
However, I get this error as well:
The Health Service cannot verify the future validity of the RunAs account \omaa for management group . The error is The user name or password is incorrect.(1326L).

challenge logic said...

Great doc, thank you :)