Wednesday, July 14, 2010

SQL Server Backups: A Public Service Announcement

I totally understand if the idea of yet another blog post about Dynamics GP SQL Server backups causes you to roll your eyes and move on to something more exciting, like watching a live video of oil constantly gushing into the ocean.

But let me start with an interesting story, and then share why I'm yet again addressing this very tired subject of SQL Server backups.

A few years ago, I was working with a client located in downtown Los Angeles.  They are a retailer with over $100m in revenues.  After implementing their Dynamics GP system, I noticed that we were running low on disk space for the SQL Server backups, and inquired about getting more space, and making sure that the backups were being copied from the server to a secondary archive location.

I learned that the SQL Server backups on the local server hard drive were the only backups that we had.  The files were not being copied to tape.  Not to SAN.  Not to removable storage.  When I asked about getting the GP backups included in the enterprise backup schedule and off site backups, things got more interesting.  I was told that the company effectively had no backup solution in place.  At all.  I was told something about interviewing new enterprise backup vendors and it was taking a long time and they hadn't yet made a decision and they weren't sure when they might have a solution, etc., etc.

In disbelief (or so I thought), I spoke with a senior executive, who I'll call Ed, who headed up their internal application development.  He told me a great story that was so ironic I had to redefine my definition of disbelief.

Rewind to 1992.  The company was much smaller and back then Ed was apparently the main IT guy at the company, handling everything from hardware, to application development, to backups.  Each night, he would grab the backup tape, stick it in his pocket, and go home. On the historic day of April 29, 1992, Ed was doing some coding and typical work, just like any other day.

But then some coworkers asked him to come look at the television.  The staff huddled around the TV, watching the beginning of the Los Angeles Riots that followed the Rodney King trial.  Wow, that's pretty amazing and scary, they thought, but they eventually went back to their work.  A while later, someone noticed that the TV news was showing pictures of looting and rioting a few blocks away from their offices.  They were able to see live video of buildings that they drove by every day on their way to work.

Seeing how close the activity was to their office, they realized that the situation was pretty serious, and they decided that they should have everybody leave the building and go home for the day.  Ed was planning on going home, but he wanted to finish up a few last things.  As his coworkers nagged him to leave, he grabbed his keys and some papers and started to leave his office.  He paused, and walked back into his office to grab the nightly backup tape like he did every night.  He casually dropped the tape in the pocket of his hawaiian shirt and then drove home.

By the next day, he learned that the company's entire office building had been burned to the ground in the riots.  After getting over the shock and the implications of losing their entire building, he realized that the little tape that he took home that night in his shirt pocket was the only backup of the company's data.  He eventually had to rebuild the company's IT infrastructure and data based on that one backup tape.

Ed then explained that he definitely knew the importance of having backups.  Yet, I explained, 15 years later, his company doesn't have any backups.  He tried to lobby to get some attention on their backup solution, but it wasn't his responsibility.  Several months and many pleas later, still no backups.

So fast forwarding to this year, I worked on a project with a client that had a pretty massive GP company database--over 130GB.  As the database grew, the server started to run out of local disk space and the backups were failing, since a week of backups was starting to consume nearly a terabyte of disk space.  After a few weeks of this, the client finally added disk space and eventually had to move the SQL backups to a new storage device to accommodate the massive backup files. 

All is well, right?  Well, when I recently went to verify something on the server, I happened to check the Event Viewer to look for some errors.  What I found were dozens of critical SQL Server errors that had been occurring for days.  I/O errors.  Operating system write failures.  Some pretty scary looking stuff.

I discussed the errors with the GP partner, and they indicated that the backups were the responsibility of the client's IT team, who was supposed to be monitoring them. 

If you were to look at the drive that stored the backup files, you would see nice neat looking bak files, as if the SQL backups were working properly.  But you also need to check the server logs to see if any problems occurred.  The backup routine may have written 129GB out of 130GB to a backup file and then failed, leaving you with a potentially useless backup.

Yes, it's a bit of a hassle, and yes it's monotonous to monitor something that usually works properly, and yes the chances of needing to restore from backup are usually remote, but would you rather have the excitement of not having a backup when you needed one?  Or would you rather be the person who made sure that you had that backup tape in your shirt pocket just in case?

1 comment:

Steve Chapman said...

Regardless of how often the message is delivered, we routinely see a handful of incidents every year in which SQL Server backups are not performed properly, and the clients have to be told the bad news.

It's so obvious and easy to do, but some just will not do it.