Monday, June 4, 2018

"There's nothing wrong with our SQL Servers or network", says the IT department confidently

By Steve Endow

Let me share a story about an all-to-common situation in the life of the Dynamics GP consultant or developer.

You develop an integration, or customization, or implement some software that talks to a SQL Server.  We do this stuff every day.  Normal, routine, common projects.  Just software talkin' to a SQL Server.  It's usually so reliable you don't think twice about it.

Then, after your integration or software has been running just fine for months, without a single issue, you encounter this error on Monday at 1am:

A network-related or instance-specific error occurred while establishing a connection to SQL Server. The server was not found or was not accessible. Verify that the instance name is correct and that SQL Server is configured to allow remote connections. (provider: SQL Network Interfaces, error: 26 - Error Locating Server/Instance Specified)


Hmmm, that's odd, we've never had this issue before--the integration has literally never had an issue connecting to the SQL Server.  We've had two dozen other errors related to bad data or import errors, but never before have we seen an issue connecting to the SQL Server.


"Hey IT manager, we received this SQL Server connection error from our weekly on Monday at 1am. Can you look into what might be causing this?"

IT Manager, 5 minutes later:  "Nothing wrong on our end! No errors, no issues, all of our jobs are running fine. If there was a problem, I would have known about it."


Oooookay then.


One week later.  Monday.  1am.  Same error, second week in a row.  And this is occurring for two different integrations--one pointing to the production GP company, and the other pointing to the Test GP company. 


When both integrations are run manually at 10am, they both run fine.  So this is not a fluke, and it is occurring consistently now.


"Hey IT manager, for the second week in a row, we received the SQL Server connection error at 1am. It's occurring with two different integrations, so there is definitely something causing this. Can you please look into this again?"


IT Manager, 1 minute later:  "I was able to figure out the cause of the problem.  We are shutting down our GP SQL Server every night from 10pm to 5am."


Face.  Palm.


I actually encounter this on a regular basis.  Fortunately in this case, it turned out to be a blatantly obvious cause--the IT manager had just forgotten about the change to the server maintenance schedule.  But once he remembered, we had our explanation.

But usually, the problem is less obvious and much more difficult to track down.

It could be a bad switch or network card that causes intermittent SQL connection errors.  It could be custom code that only fires at certain times, locking or blocking SQL resources, causing seemingly random SQL command timeouts that get blamed on your software.  I've even had a situation where a Veeam backup of a completely different VM caused the host machine to drop network connections for other VMs. (a Veeam bug that I believe has been fixed)

I've seen all of my custom SQL objects literally disappear every week because a super security conscious corporate customer has a routine that deleted any unapproved SQL objects from the database.  And then there's the common case of anti-virus software blocking or deleting an EXE or DLL.

Modern networks are complex, and when you have dozens or hundreds of things going on, it's usually not easy to identify what might be causing an intermittent or infrequent problem.  When hardware and software is normally incredibly reliable, it seems that people are resistant to consider the possibility that something other than your integration or software is causing the problem.

Just because your software happens to be the one that is logging the error, there seems to be a strange bias that has people blame your software and deny that something else is preventing your software from communicating with the SQL Server.

I currently work with hundreds of customers, and as a result, I probably see this issue weekly.  Unfortunately, if the IT department claims that their systems are working perfectly, in many cases there isn't much that I can do except to add additional error handling, logging, and diagnostic information to my log files to present to the IT department repeatedly.  Sometimes it's enough to help a motivated tech do enough research to find the cause.  But many times an intermittent non-critical error just gets ignored.

If you encounter this issue with IT departments, do you have any suggestions?  Do you have a technique for making the IT department curious about researching the problem instead of getting defensive?  If so, I'm all ears.




You can also find him on Twitter, YouTube, and Google+




No comments: