Thoughts on Azure, OMS & SCOM: One Small Footprint For a Server, One Giant Leap For OMS

Wednesday, September 9, 2015

One Small Footprint For a Server, One Giant Leap For OMS

Welcome to the new world
Microsoft is reinventing itself. It’s in a huge transition from a company previously focused on ‘devices & services’ to an enterprise geared to the ‘mobile-first, cloud-first’ mantra. Even though Microsoft has brought marketing to a whole new level, in this particular case there isn’t much marketing mumbo jumbo, if none at all.

The investments and speed of development in Microsoft’s cloud offering is unprecedented, all across the ‘Azure board’. New features are added on an almost weekly basis to the whole Azure port folio. Some are kept low key (like the Clutter feature in Office 365) where as others do get a bigger exposure.

Fact is that Azure is an ever evolving cloud environment gaining more traction by the day. Microsoft’s whole workforce has shifted their direction and are working in unison for the development of the cloud.

OMS has the same speed of development
OMS makes no difference here. Quite recently Microsoft introduced a new feature in OMS: Near real-time performance data collection. At a first glance it might seem like a minor step, but – after having tested it thoroughly – it’s a giant leap for OMS.

I’ll tell you why.

NRT & supposed impact
The intervals for near real-time (NRT) performance data collection by OMS is set by default to 10 seconds. Which makes sense since the name of the new feature implies ‘near real-time’.

Being someone with a SCOM background it made me wonder about the footprint of it all. How about memory and CPU load?. How about network load? In other words, what kind of footprint does OMS with NRT performance data collection has on any given server?

Time to put it to the test.

The test environment
Any test is just as good as the environment used for it, together with the applied test scenario. So I decided to deploy in my own test lab two brand new VMs, identical to each other. Also I deployed a new OMS Workspace in order to ascertain the test wasn’t ‘contaminated’ with old settings I tested in my other OMS workspaces.

Items:

2 identical Windows 2012 R2 VMs (3 GB RAM, 1 vCPU, 1 logical drive C:\, workgroup member), NRT01 and NRT02;
Both VMs placed on the same Hyper-V host, using the same storage, compute and network resources;
One new OMS workspace, named NRTLab.

Item configuration:

Server NRT01 got the Windows Agent, downloadable from the OMS workspace NRTLab (the Windows Agent is the Microsoft Monitoring Agent (MMA) with OMS Workspace connection capabilities);
The Windows Agent on NRT01 connects ONLY to the NRTLab OMS Workspace;
NRTLab isn’t connected to any SCOM 2012 Management Group nor any Azure Storage Accounts:
NRTLab Solutions configuration: Log Search and System Update Assessment:
NRTLab Logs configuration. Log Name: Operations Manager (Error & Warning):
NRTLab NRT Performance Data Collection settings. OMS default with the default sample interval:
NRTLab is happy and reports a 100% complete configuration:
And yes, NRT01 is connected properly to NRTLab and data is coming in:

Now I’ve got enough resources to run a good test. How about a valid test scenario?

Test scenario
Say what? NRT02 has NO Windows Agent? Yes, that’s correct! This server has only ONE purpose: it’s a reference server!

Now I can see what kind of CPU, RAM and network load this server has compared to NRT01 running the Windows Agent reporting to NRTLab while collecting NRT performance data, OpsMgr event log entries (errors & warnings) & checking whether the server is missing out on any crucial updates (performed by the System Update Assessment Solution).

On both servers I defined a new Data Collector Set in Performance Monitor, in order to collect specific performance data:

NRT01

Logical Disk > Current Disk Queue Length (C:);
Memory > Available MBytes
Network Adapter > Bytes Total/Sec
Network Adapter > Current Bandwidth
Process > % Processor Time (HealthService.exe & MonitoringHost.exe)
Process > IO Data Operations/sec (HealthService.exe & MonitoringHost.exe)
Process > Working Set – Private (HealthService.exe & MonitoringHost.exe)
Processor Information > % Processor Time

NRT02

Logical Disk > Current Disk Queue Length (C:);
Memory > Available MBytes
Network Adapter > Bytes Total/Sec
Network Adapter > Current Bandwidth
Process > % Processor Time (_Total)
Process > IO Data Operations/sec (_Total)
Process > Working Set – Private (_Total)
Processor Information > % Processor Time

I had these Data Collector Sets running for about 24 hours. No programs were opened, all MMC’s were closed (Performance Monitor included!), so these servers were simply running without being used except for their own running processes and services.

I ran these Data Collector Sets multiple times in order to establish a baseline. The results in this posting are based on the last run, from 20:43 9/7/2015 until 21:21 9/8/2015.

The results
And I must say this is the very reason I run the Data Collector Sets multiple times. Simply because the results are very impressive.

Seeing is believing, so let’s take a look at the Report View of the Report of both Data Collector Sets:

NRT01

NRT02

As you can see is the memory footprint of the Windows Agent really small. With the counter Process / Working Set – Private we see the number of bytes in use for both components of the Windows Agent, comprised of HealthService.exe (5.2 MB) and MonitoringHost.exe (11.8 MB).

This means that together (the Windows Agent actually) uses 17 MB of RAM! I don’t know about you, but to me that’s really small.

Looking at the CPU footprint you can see it’s small as well. The Windows Agent consumes about 0.151 % Processor Time (% Processor Time NRT01 – % Processor Time NRT02).

When looking at process level, we see that HealthService.exe consumes 0.014 % Processor Time and MonitoringHost.exe 0.034. Together even less than 0.05 (0.048)!

And the load on the network (Bytes Total/sec) is also very low: 413.469 Bytes Total/sec (0.00039 Megabyte!) for the Windows Agent Bytes Total/sec NRT01 – Bytes Total/sec NRT02).

But how about the network load for NRT Performance data collection only? The OpsMgr Engineering Team states: ‘… for a particular computer, a given counter instance (e.g., Processor(_Total)\% Processor Time) with 10 second sample interval will send ~1MB per day (~1MB/day/counter instance)…’.

I contacted Microsoft about this and they told me this is UNCOMPRESSED data! Since it get’s compressed these values are even lower! And they assured me this is thoroughly tested and triple checked.

Recap
I am amazed! Never ever I expected to see such a SMALL footprint of the OMS Agent (AKA Windows Agent) on any given monitored server.

Since OMS uses a cloud based state of the art back end for data processing it doesn’t have the potential bottle necks we may see with on-prem SCOM installations. So data comes in, is processed very fast and shown in your OMS workspace in the matter of seconds. Now that’s NEAR REAL-TIME!!!

Since the footprint of OMS is so small I see no reason NOT to use OMS on any important server. Connect the Windows Agent with an on-prem SCOM environment and you’ve got the best of both worlds: on-prem SCOM and state of the art (and ever evolving) OMS in the Cloud!

Check it your self
Both Performance Monitor Reports used for this posting can be downloaded from my OneDrive and opened in Performance Monitor, so you can see it for yourself: NRT01 and NRT02.

But even better, start using OMS today and see what it can do for your environment.

2 comments:

Wilson328 said...: What about the amount of traffic that is generated by this near real-time monitoring? I am concerned that in large SCOM environments with thousands of agents sending data to OMS you could conceivably saturate your network pipe to Azure/OMS.; September 10, 2015 at 7:38 PM
Marnix Wolf said...: Hello Wilson328.

Thanks for your comment and good question. Let me do my best to answer it. In large SCOM environments you have basically two scenario's: Either all those systems are concentrated in one huge data center (with multiple high speed WAN/internet connections) or all these systems are dispersed over multiple geographical locations, with PER location multiple high speed WAN/internet connections.

On that account, I don't see any issue since there will be plenty of bandwidth and WAN/internet connections available for the agents to upload the collected NRT performance data.

About the amount of uploaded data when using NRT performance data collection I dare to say it won't be much at all. Yes, there is 1 MB of data (uncompressed!) per day per given NRT performance counter instance.

But... this is UNCOMPRESSED. Since it's data like numbers and some characters, this data is highly compressible. So let's asume there is a compression ratio of 60% (still on the very safe side of things I tell you). So of this 1 MB there remains compressed 0.4 MB or 410 KB.

Now comes another thing in play. This is 410 KB per server per given NRT performance counter PER DAY. But NRT data collection happens once per 10 SECONDS.

Let's break down that 410 KB per day to one increment of 10 seconds. One day (24 hrs) consists out of 86,400 seconds. Since NRT data is collected once per 10 seconds, 86,400/10 = 8,640 times per day NRT data is collected AND uploaded to OMS.

Since we've got 410 KB of compressed data per server per given NRT performance counter PER DAY, we must divide that by the amount of times a NRT performance data collection AND upload takes place: 410/8,640 = 0.04745 KB, or 48.6 BYTES.

So we're talking here about 48.6 BYTES of compressed data per server per given NRT performance counter PER 10 seconds.

And that's an negligible amount of data. Since NRT performance data collection on many servers won't happen at the exact same time frame, I seriously doubt it will have a negative impact on the available bandwidth of the internet/WAN connections.

Cheers,
Marnix; September 13, 2015 at 10:47 AM