Wednesday, September 20, 2017

The 10th and 11th ways you can lose your SQL data...

By Steve Endow

Brent Ozar has an excellent post where he shares 9 stories about how customers lost some or all of their SQL Server data.

What's great about his stories is that as I read each one, I thought "Yep, I can totally see that happening."  A simple oversight, a small mistake, one person making a change without realizing it affected other systems, or simply forgetting to change back a single setting in SQL Server.  The one about invalid SQL SMTP settings preventing error emails from going out reminded me of my recent Synology drive failures, as I also had invalid SMTP settings and hadn't received the hundreds of error emails telling me I had a problem--so I am certain that is a common symptom.

While stories about hurricanes, floods, tornadoes, or fires may provide great drama for discussion about disaster recovery, I suspect that there are far more disasters that are caused by a few clicks of a mouse, followed by "Ooops." (or "OH MY GOD WHAT HAVE I DONE???")

I have two data loss stories of my own to add to the SQL data loss lore.

Pulling the Wrong Drive

Many years ago, I was a "business systems consultant" for a Big 6 (at the time) consulting firm and somehow ended up helping a customer with their Solomon IV implementation after their sole IT employee quit.  I knew Solomon IV, knew VB, knew SQL, and knew hardware, so I was juggling everything and helping them finish their implementation.

Their Hewlett Packard server that hosted the Solomon IV databases was having some issues with its RAID array.  The server had mirrored drives that hosted the database files, and occasionally that mirror would 'break' for no good reason.  Windows would mark one drive as inactive, and the server would run on one of the drives until we removed the inactivated drive, reinserted it, and repaired the array.  This had happened once or twice before, and I was on site at the customer when it happened again.  I checked Windows, checked the array, confirmed the mirror had broken.  I then pulled the drive, reinserted the drive, and then started the array rebuild.  No problem.

Shortly after that, a user noticed that a transaction they entered that morning was no longer available in Solomon.  Then another user.  Then another.  We eventually discovered that all of the transactions and data that had been entered that day were gone.  What happened?

After pondering for a while, I realized what I had done.  When the RAID mirror broke, Windows would say that one drive had been inactivated, but it wasn't always clear which drive had been inactivated.  You had to poke around to figure out if it was the drive on the left or the drive on the right--I don't remember the process, and it might have even been as high tech as watching to see which blinky light on one of the drives wasn't blinking.

I had either mis-read the drive info or not looked carefully enough, and I had pulled out the wrong drive.  The active drive.  The one that was working and had been saving the transactions and data that day.  After I reinserted the drive, I then chose the 'bad' drive, the one that hadn't been active at all that day, marked it as the primary, and then rebuilt the mirror with the old data from that drive.  Thereby losing the data that had been entered that day.

This was pre-SQL Server, so we didn't have transaction log backups, so even if we had a full back up from the prior evening, it wouldn't have helped, as it was only that day's data that was lost.  Fortunately, I think it was only mid-day, so the users only lost the data from that morning and were able to reconstruct the transactions from paper, email, and memory.

Ever since I made that mistake, I am extremely paranoid about which physical drive is mapped to RAID arrays or Windows drive letters.  If you've built a PC or server in the last several years, you may know that Windows will assign drive letters semi-randomly to SATA drives.  And when I had two bad drives in my Synology, I double and triple checked that the drive numbers provided by the Synology did in fact map to the physical drives in the unit, from left to right.

I'm hoping that I never pull the wrong drive again.

Test vs. Production

In Brent's blog post, he shared a story about someone logging into the wrong server--they thought they had logged into a test environment, but were actually dropping databases in production.

I have a similar story, but it was much more subtle, and fortunately it had a happier ending.

I was testing a Dynamics GP Accounts Payable integration script.  I must have been testing importing AP invoices, and I had a script to delete all AP transactions from the test database and reload sample data.  So I'm running my scripts and doing my integration testing, and a user calls me to tell me that they can't find an AP transaction.  We then start looking, and the user tells me that transactions are disappearing.  What?

As we were talking, all of the posted AP transactions disappeared.  All AP history was gone.

Well, that's weird, I thought.

And then it hit me.  My script.  That deletes AP transactions.  That I ran on the Test database.

But how?

Somehow, I apparently ran that script against the production company database.  I was probably flipping between windows in SQL Management Studio and ended up with the wrong database selected in the UI.  And the customer had so much AP data that it took several minutes to delete it all, as I was talking to the user, and as we watched the data disappear.

You know that gut wrenching feeling of terror when your stomach feels like it's dropped out of your body?  Followed by sweat beading on your brow?  That's pretty much how I felt once I guessed that I had probably accidentally run my Test Delete script on the production database.  Terror.

In a mad scramble that amazes me to this day, I somehow kept my sanity, figured out what happened, and came up with an insane plan to restore the AP data.  Fortunately, the customer had good SQL backups and had SQL transaction logs.  For some reason, I didn't consider a full database restore--I don't recall why--perhaps it was because it would require all users to stop their work and we would have lost some sales data.  So I instead came up with the crazy idea of reading the activity in the SQL log files.  Like I said, insane.

So I found an application called SQL Log Rescue by RedGate Software that allowed me to view the raw activity in SQL Server log files.  I was able to open the latest log file, read all of the activity, see my fateful script that deleted all of the data.  I was also able to view the full data of the records that were deleted and generate SQL scripts that would re-insert the deleted data.  Miraculously, that crazy plan worked, and SQL Log Rescue saved me.  I was able to insert all of the data back into the Accounts Payables tables, and then restart my heart.

Thinking back on it, I suspect that the more proper approach would have been do to a SQL transaction log backup and then perform a proper point in time recovery of the entire database.  Or I could have restored to a new database and then copied the data from the restore into production.  But as Brent's stories also demonstrate, we don't always think clearly when working through a problem.

So when you're planning your backup routines and disaster recovery scenarios, review the stores that Brent shares and see if your backup plans would handle each of them.  And then revisit them again occasionally to make sure the backups are working and you are still able to handle those scenarios.

Steve Endow is a Microsoft MVP in Los Angeles.  He is the owner of Precipio Services, which provides Dynamics GP integrations, customizations, and automation solutions.

You can also find him on Twitter, YouTube, and Google+

No comments: