Tuesday, March 8, 2011

EventID 31552: Failed to store data in the Data Warehouse. Exception 'SqlException': A network-related or instance-specific error occurred while establishing a connection to SQL Server

Bumped into this issue in a SCOM R2 environment of a customer of mine. As a result the perfmon Reports missed out days of data.

Have had a similar issue before and after having cracked it, the data came back.

Some background information first. This is ‘normal’ behavior because when data is inserted into the Data Warehouse it goes through processes like these:

  1. The data is written to the staging tables;
  2. The data is processed;
  3. The data is inserted into raw tables;
  4. The data is moved to aggregated tables.

Basically, when the Perfmon Reports are not showing all data, most of the time the data is present in the Data Warehouse, but is not processed completely. Somewhere it is stuck. By removing the cause of it, the Reports will functioning again. Of course, it takes some time for the data being processed (depending on the total amount of data in the DW), but finally, the Reports end up just fine again.

Let’s get back to this posting now.

Besides the EventID mentioned in the title of this posting, the RMS showed many of these EventID’s as well:

  • 31557
    • Failed to obtain synchronization process state information from Data Warehouse database,
  • 31561
    • Failed to enumerate (discover) Data Warehouse objects and relationships among them,
  • 31569
    • Report deployment process failed to request management pack list from Data Warehouse,
  • 31552
    • Failed to store data in the Data Warehouse.

Especially the last EventID contained good additional information: ‘Exception 'SqlException': A network-related or instance-specific error occurred while establishing a connection to SQL Server. The server was not found or was not accessible. Verify that the instance name is correct and that SQL Server is configured to allow remote connections’.

This MG was already running a long time without any issues. Only an update of the DNS MP had taken place, nothing else. Since the last EventID pointed the finger to the SQL server it was time to check this server out. This is what the System event log of the SQL server showed:

After having talked with the engineers running these servers it turned out that the VMs weren’t configured correctly. The VM for the SQL server ran an iSCSI initiator. But this is not OK. So they changed it in VMware itself to a raw LUN-mapping and assigned it to the VM hosting the SQL Server instance.

After that most of the errors in the OpsMgr event log were gone. Only EventID 31552 remained, coming from the Exchange 2010 MP. No more errors about the network:

Danielle Grandini, a fellow MVP, already blogged about this issue, to be found here.

However, I am not the type of person to modify SPs by hand. So I skipped this option and used another approach: I removed the culprit MP (the Exchange 2010 MP) and afterwards I deleted the whole Dataset (Microsoft.Exchange.2010.Reports.Dataset.Availability) from the Data Warehouse causing all these issues with a certain SP. Got this one from CSS a while back. I will NOT share it.

Afterwards the event log of the RMS turned back to normal again. No more issues. No warnings, no errors. And believe or not, the Reports came back to live again! So all perf data was to be found again. How nice and neat!

I only imported the Exchange 2010 MP and left out the Report MP, hoping the Reporting MP also contains this Data Set and rule. Will see that within a few days I guess. Can’t wait until a new Exchange 2010 MP ships, WITHOUT this issue…

1 comment:

Unknown said...

Would you know if Scom 2007 runs scripts or jobs during the night ?
We have about 60 event 31551 and 31552 precisely from 00:05 to 07:78 every night. No backup, no antivirus conflict. SQL is 2008 r2 sp1. We had this also with SQL2k5. started around jan 14. we are Scom 2k7 cu7. During daytime all is fine.