Monday, May 18, 2015

NiCE Free Log File MP & Regex & PowerShell: Enabling SCOM 2 Count LOB Crashes

Issue
Suppose you’ve got a line of business application (LOB) which logs everything into a log file (verbose logging). Among those log entries are also the crashes of the LOB itself.

And now some people in your organization have heard about SCOM and are responsible for the total performance and availability of that LOB. Even though SCOM already monitors that LOB on many levels (Windows Server OS, SQL databases, SQL reporting, IIS related web sites and services and specific services and processes) these same people would LOVE to have a View AND a Report in SCOM all about the total amount of LOB crashes per previous day.

Why? SCOM is sold to them as ‘the single pane of glass’. That’s why! Smile

So now SCOM has to collect that data and plot it into a performance graph in the SCOM Console and to pipe that same data into a Report. On itself nothing strange since SCOM can plot anything as long it’s a performance counter.

The challenges…
Okay. Guess by now you already spotted the first challenge? But there are more…

  1. So here is the first and most obvious challenge: ‘…as long it’s a performance counter…’. But those LOB crashes aren’t performance counters at all. These are written PER app crash as an entry to the log file.
  2. Somehow ALL the app crashes of the previous day for the LOB have to be collected and counted in order to get a total.
  3. Sometimes (not always…) the default log file is broken off, saved to another format and then a new log file is started. And the LOB app crash information of previous day can be found in either one of those log files…
  4. And last but not least, it would be TOTALLY AWESOME when this solution would be portable to other LOB’s as well. So not too many customizations please.

The different components required to address the challenges
There are multiple ways to address these challenges. One would be to author a MP which creates a data source based on that log file, get’s the proper information and put it into a property bag and send it to SCOM which processes it as a performance counter.

However, I am anything but a MP author. So that’s a no go area for me. I know the theory but lack the serious skills to get it working in a proper manner. Therefore I required solutions already available and after some discussions with some peers this is the list of ‘ingredients’ I got in order to address the various challenges:

  1. NiCE Free Log File MP
    With this MP one can map entries in log files to performance counters, used by SCOM. Some good regex is required in order to get the job done.

  2. PowerShell
    With PowerShell one can examine files (log files as well!) and check AND count certain strings, even certain combinations. This collected data can be piped into another file when required.

  3. PowerShell (again, this isn’t a typo)
    When there are multiple files to be checked as described in Item 3, with some good PS scripting this can be solved as well.

  4. Using free available software & solutions
    When using Items 1 to 3 AND document it properly, you’ve got a solution which is portable to other LOB’s as well. Awesome!

Now we’re in business. It’s time to build the solution. But before I start I want to introduce my test environment.

Meet my test environment
For this posting I built myself a new LOB environment, or better a simulation of it. There is no LOB at all in my test environment, only the stuff which is required in order to make this test work.

  • LOB verbose log folder on my SCOM test server: C:\Program Files\Business Critical App\Logs;
  • REGULAR LOB verbose log file: LOB_Log_Verbose.log;
  • Discontinued LOB verbose log file: LOB_Log_Verbose.log.YYYYMMDD (e.g.: LOB_Log_Verbose.log.20150517).

    This is what it looks like:
    image

    Time to start rocking!
    So now we know the challenges and the way how to address them. Now it’s time to dive into the specifics.

    1. Download the FREE NiCE Log File MP and import it into your SCOM environment;

    2. As stated before the default log file (LOB_Log_Verbose.log) can be broken off, and saved into a special format (LOB_Log_Verbose.log.YYYYMMDD). And the required app crashes can be found in one of those files. These issues must be solved with PowerShell.

    3. When using the NiCE Free Log File MP one can also use regex for the name of the log file. However, the log file can become pretty big requiring a lot of time to go through it all. Also so having to use regex to count ALL app crashes of a previous day can be pretty daunting when you’re not that familiar with regex.

      So why not use PowerShell here as well? Meaning PowerShell looks for BOTH log files mentioned before, counts the total app crashes of LOB of the previous day AND pipes that information into a NEW log file with a far more easier format, like this:
      image

      Let’s name this new log file AppCrashesCountPerDay.log and use that log file as a target for the NiCE Log File MP. Now it’s far more easier to manage things on the SCOM side of things since we’ve got an easier log file to monitor which is kept outside the LOB verbose log itself and won’t be renamed nor removed. Based on that we can use absolute names and paths, making it even more easier on the SCOM side of things Smile.

      This PS script will take care of it all. Plan it with Task Scheduler on the LOB server to run once a day. The very same PS script can be downloaded from my OneDrive
      #############################################################
      # Script to count the total amount of LOB crashes of the previous day
      # This number is written to the log file 'AppCrashesCountPerDay.log'
      # Written by Marnix Wolf
      #############################################################

      # Set variables and create file name 'LOB_Log_Verbose.log.YYYY-MM-DD' based on previous day
      $PreviousDay = "{0:yyyy-MM-dd}" -f (get-date).AddDays(-1)
      $LOBLog = "LOB_Log_Verbose.log.$PreviousDay"
      $LOBLogCheck = "C:\Program Files\Business Critical App\Logs\LOB_Log_Verbose.log.$PreviousDay"

      # Tests whether correct LOB_Log_Verbose.log file exists. If so, run a total count of the related LOB crashes that previous day
      If (Test-Path $LOBLogCheck){
      $TotalLOBErrorCount = (Select-String -Path "C:\Program Files\Business Critical App\Logs\$LOBLog -Pattern $PreviousDay" | Select-String -CaseSensitive -Pattern AppCrash).count
      }Else{
      $TotalLOBErrorCount = (Select-String -Path "C:\Program Files\Business Critical App\Logs\LOB_Log_Verbose.log" -Pattern $PreviousDay | Select-String -CaseSensitive -Pattern AppCrash).count
      }

      # Write output of correct log file to AppCrashesCountPerDay.log in specified format
      Out-File -filepath "C:\Program Files\Business Critical App\Logs\AppCrashesCountPerDay.log" -InputObject "$PreviousDay TotalAppCrashes = $TotalLOBErrorCount TimesTotal" -Encoding ASCII -Width 50 -Append

      # End of script

    4. Let’s build the special Rule, based on the NiCE Log File MP: SCOM Console > Authoring > Management Pack Objects > Rules >  right click > context menu > Create a new Rule;


    5. NiCE Log Files > Performance Rule > Advanced > Performance Rule (Advanced)
      image
      > Next


    6. Don’t enable the Rule! We’ll enable it later through an Override targeted against a Group which is explicitly populated. Target the Rule against Windows Server or a less generic Class:
      image
      > Next


    7. Skip the Preprocessing Settings screen > Next
      image
      > Next


    8. As you can see, thanks to PowerShell I can use absolute names and paths. Awesome! Makes it easier to troubleshoot:
      image
      > Next


    9. Save yourself a lot of pain and effort. Just hit the Regex testing tool button Smile. This tool makes life much easier. A BIG thanks to NiCE for this tool.
      image


    10. In the box Logfile Line paste a log file entry the NiCE Log File Rule must look for. In the box Filter Regex Pattern enter the regex required to extract the information. Use the key with > sign for more help to build your regex or go to https://regex101.com/for some online help/testing.
      image
      The Sample Output (Xml) screen is KEY here for your success!!! As you can see there is an entry which starts with <TotalAppCrashes>. The next entry is <Capture> </Capture>. As you can see in the example, the total app crashes of the previous day (2 in total) is captured!

      This tells you the regex is okay and the output is working! Please note the TotalAppCrashes entry since in the next screen it will be used as the performance counter we’re looking for.

      > OK.


    11. Now you’re back in the previous screen. However the regex is there and it’s tested on it’s proper functionality and output.
      image
      > Next


    12. Here you can go nuts and enter any name you like for the first THREE fields. But keep it professional please Smile. The MOST CRUCIAL field is the last one, VALUE. Use this context: $Data/RegexMatch/[confirmed output in Step 10].

      So in this case it becomes: $Data/RegexMatch/TotalAppCrashes$:
      image
      > Create.


    13. Now the Rule is built. Create a Group and explicitly add the server where this Rule must run. Set an Override on the newly created Rule for this new Group and wait some time (a few minutes).


    14. Create a new Performance View in SCOM using this new Rule:
      image


    15. Add MANUALLY a new line in the correct syntax to the log file AppCrashesCountPerDay.log and save the modification. Do this with some time lapses between the new entries. Don’t forget to save the file after every modification! You do this in order to test the correct working of the new Rule you just made, like this:
      image

      When all is okay, this will be shown in the SCOM Console:
      image
      Of course, normally you’ll get ONE data point per day. But this is the test, remember? Smile

      And when things don’t work, please check the OpsMgr event log of the related server. Changes are it will contain events logged by the NiCE Log File MP why certain things don’t work (wrong path, wrong log file name and so on).


    16. When the performance graph works, it’s simple to create a Report: Reporting > Microsoft Generic Report Library > Performance. Use the newly made Rule as a performance Rule, targeted against the correct server. Use this posting for more details about how to make such a report.
      image

      And:
      image
      Please remind that data in the Report comes from the DW which takes a few hours to get aggregated. So you’ll see the data up to two hours ago.

    Recap
    As you can see, the NiCE Free Log File MP rocks! And with some easy PS you can make life even more easier for yourself and SCOM. For me personally the manual provided by NiCE for this MP helped me a lot in order to get things up & running. And please note that also the SCREENSHOTS in that same document can be of a great help.


  • No comments: