Thoughts on Azure, OMS & SCOM: What’s Hammering My SCOM 2012x Database?!

Issue
The health of the operational SCOM database is crucial for a smooth running SCOM 2012x Management Group. For instance, when too much data comes in and this database grows out of control, soon your SCOM 2012x MG will come to a stand still.

So how to recognize situations like these? Of course, you’ll notice a slower performance of the SCOM Console for instance, also a Warning Alert raised by the Monitor Operational Database Space Free (%) when the percentage of free space falls beneath the 40% and a Critical Alert when the percentage of free space falls below the 20%.

But still, when this happens it’s already a bit too late. Therefore it makes sense to run certain reports once per week, just to stay in control of your SCOM 2012x environment. This posting describes the reports which have helped me many times before and are a great help to me.

SCOM 2012x Reports & the Community
Yes, SCOM 2012x delivers out of the box some good reports in order to see what’s happening under the hood of your SCOM MG. None the less, some additional help is welcome, delivered by the Community.

In the days of SCOM 2007x some people build the SCC Health Check Reports MP, containing 25+ Reports delivering a good and deep insights in the state of the nuts and bolts of your SCOM environment. And even though this MP isn’t updated for SCOM 2012x, it runs just fine up to SCOM 2012 R2 UR#4. And delivers tons of good information. So my advice: download this MP, create the additional Data Source (as described in the related MP Guide), import the MP and enjoy the magic Smile .

SCOM 2012x Reports
These are the reports I use in order to gain a good insight about the health state of the SCOM 2012x environment, all found under the reporting node System Center Core Monitoring Reports:

Data Volume by Management Pack
Shows you exactly what MP has the biggest data volume impact on your SCOM environment.

This Report allows you to select different Data Types, like Discovery, Alerts, Performance, Events & State Change. Also you can select one or more MPs. By default ALL MPs are selected. Making a small selection can be tedious since the selection box is way too small to be really functional. You can also choose the aggregation (daily, weekly and so on) and whether you want to see a top 10 or less or more.

Usage: When you’re not investigating anything specific and just running your weekly check OR you think something isn’t okay, simply leave all the selections as they are. As a time period choose at least 7 days so you get a good average slice of it all.

When drilling deeper, like suspecting an issue with performance data, only select Data Type Discovery, make the top 10 bigger like 15 or even 20. This way you can compare the number 1 to 5 more easily with the number 6 to 15/20 in order to have gain a better understanding of the ratio. One ore more single numbers don’t say that much but when compared to a longer list, it’s far more easier to find a ratio.

When run the ‘fun’ isn’t over. The table Counts by Management Pack doesn’t only contain useful information but also LINKS to other reports, allowing you to drill down into specific information. For instance in this screen dump I have highlighted all clickable values in yellow. The value for the Veeam MP has a red circle on it since I want to gain a deeper insight into the performance counter data collected by this MP:

Clicking on that value gives me this detailed information, about the performance data collected by this MP:

When I go back to the first report, using the blue arrow bottom top left of the last report, I can also click on the MP itself, in order to get more detailed information about the whole MP and it’s impact on the SCOM MG:

As you see, this Report delivers tons of good information about the impact of the MPs on your environment. And please, don’t forget to READ the Report Details before running this report since it contains good information about what this report does, how to run it and how to interpret the information in this report:
Data Volume by Workflow and Instance
Even though it sounds the same like the previous report, it drills deeper into the workflows itself. Actually this is the SAME report you’ll see when running the Data Volume by Management Pack report and click on one the numbers in the table cells, like a demonstrated before.

So this report comes in handy when you already suspect something fishy and want to get a deeper understanding of it.

Discovery

Alerts

Performance

Events

State Change

Usage

Counts by Discovered Type, Rule or Monitor

Trend

(arrow points up)

(arrow points down)

(arrow points sideways)

Class Instance

SCC Health Check Reports
With the two previous reports you already gained a deeper insight of your SCOM environment. However, additional information is welcome and now it’s time to run a set of reports from the SCC Health Check Reports MP. With them you have a complete picture of the state of the operational database of your SCOM environment.

Even though these are ‘single-click’ reports (no parameters are required, when double clicked the just run and show you the information), in certain circumstances I want to be a bit more in control, like the time frame selection and so on. In cases like these I run the relevant queries directly against the operational database. In most cases however, these reports deliver all the information I need and are awesome because they’re so easily used.

Misc - Operational and Datawarehouse Usage Report (OM) - (DW)
This report shows you the status of BOTH SCOM databases. On a single page you get to see all the information you need including the top 20 of largest tables of both SCOM databases:
Performance - Performance Inserts Per Day (OM)
This report shows you how much performance data is inserted in to the operational database per day. It enables you to pinpoint potential culprits which hammer down your SCOM environment. Suppose you imported a new MP and from then on SCOM has serious performance issue. Perhaps the new MP collects way too much performance data, causing this issue. This report will help you to find a possible cause.
Performance - Top 20 Performance Insertions By Perf (OM)
This Report shows what Objects and their related counters collects the most performance data and puts it into the operational database. This helps you to pinpoint problematic systems/objects.
Performance - Top Performance Baseline Generating Rules (OM)
Yes, STTs (Self Tuning Thresholds) are still alive and kicking. But I don’t like them at ALL! Simply because they don’t work as intended. The idea was good: STTs would establish a baseline themselves. A lower and upper one. When performance happens between those baselines, all is well. When outside (below or over) the thresholds, an Alert will be raised.

The Exchange Server 2003 MP was full of those STTs in the first version of that MP. The last version of that MP contained far less STTs AND the remaining ones were set to whole different (fixed) values.

So this Report will show the Rules using that STT technology. My advice: when experiencing issues of too much performance data being collected? Kill those STT Rules! For now I suspect only the SQL MP using STT Rules:
- MSSQL 2005: Collect Learning Data for SQL User Connections Monitor;
- MSSQL 2008: Collect Learning Data for SQL User Connections Monitor;
- MSSQL 2012: Collect Learning Data for SQL User Connections Monitor.

As you may have noticed, most of these Reports are aimed at getting an insight about all the performance data being collected. Simply because many times the main reason for your operational database being hammered is that too much performance data is coming in.

When you’ve found the culprit additional tuning is required. Not by simply turning those Rules off, but by selecting other intervals and so on.

No performance data collection issue?
What if performance data collection isn’t an issue? Check out the status of the other SCOM nuts and bolts with these SCC Health Check Reports:

Alerts
Alerts - Top 20 Alerts By Alert Count (OM)
Alerts - Top 20 Alerts By Repeat Count (OM)

Config Churn
Config Churn - Discoveries Last 24 Hours (DW)
Config Churn - Modified Properties Details Last 24 Hours (DW)

Too many events coming in
Events - All Events Count By Last 7 Days (OM)
Events - Most Common Events by Number and Publisher (OM)
Events - Top 20 Computers Generating the Most Events (OM)

State data
State - Noisiest Monitors (OM)
State - Old State Changes Not Groomed (OM)
State - State Changes Per Day (OM)

As you can see, SCOM can be a challenge to master. But these Reports will help you to get on track. And don’t forget the community either Smile

Thoughts on Azure, OMS & SCOM

Monday, November 24, 2014

What’s Hammering My SCOM 2012x Database?!

No comments: