Thursday, November 3, 2011

A Tale of Two BSODs: Diagnosing Windows Blue Screen of Death

Ever had a Windows machine display the Blue Screen of Death?  Through amazing coincidence, blue screens showed up on both my desktop machine and a client's production Dynamics GP Terminal Server in the same week!  I got fed up with the cryptic errors and finally decided to learn how to diagnose the infamous BSOD.

A few months ago I built a new desktop machine. Although there were a few quirks with 64-bit Windows 7, it seemed to work well.  Until the day when I started to get the dreaded Blue Screen error.

Having dealt with blue screens occasionally over the years, my general interpretation is that once you start getting them on a machine, they don't tend to go away on their own.  Sure enough, my desktop started to blue screen a few times a week.

While at my desk, I saw the blue screen occur and flash on my monitors, but my computer instantly rebooted, preventing me from seeing the message.  By default, Windows 7 and Server 2008 are set to automatically restart when a "System failure" occurs.  This option is set under System Properties -> Startup and Recovery Settings.  The first change I made was to disable the automatic restart option so that I could know when the blue screen occured and see the error messages.

Sure enough, the blue screens showed up again a few days later, but unfortunately, the message displayed wasn't very helpful.

Sometimes you will get lucky and see a specific driver listed, like "ETRON_USB_3", which can tell you immediately that a third party USB 3 driver is causing the problem.

But in my case, since a specific driver wasn't listed on the blue screen, just a cryptic "STOP" error, I didn't have any clues as to a possible cause.  I figured that my only option would be to try and reinstall Windows, which isn't on my favorite-things-to-do list.  So I put it off and just ignored the occasional crash, knowing I would eventually have to deal with it.

Then, the other evening, while connected remotely to a client's server, I was suddenly disconnected.  When I was able to reconnect, I saw a message indicating that the server had experienced a blue screen and had restarted automatically.

Figuring that two systems with blue screens in the same week was too much of a coincidence, I took it as a challenge to learn how to diagnose the cause of the dreaded BSOD.

To my surprise, it turns out that it is shockingly simple to get diagnostic information about the BSOD error--if you know what tools to use and once you know how to use them. 

In the Startup and Recovery options in Windows, there is an option to "Write debugging information".  In the latest versions of windows, the default setting is to write a "Small memory dump", also known as a "minidump". 

When Windows encounters a "system error", it writes certain diagnostic information to this memory dump file explaining the specific area of the operating system that caused the crash and possible causes of the problem.

I naively thought that reading memory dumps were some type of complex process that only Wizards at MS Support could perform, but to my surprise, there are several tools available to make the diagnostic process extremely easy.

I found this blog post by the famous Mark Russinovich, which got me started on the "old school" method of reading the debug files using the Microsoft WinDbg utility.

There are a few challenges with this approach.  First, you have to figure out which version of WinDbg you need for your OS, and Microsoft seems to want to make it as difficult as possible to get just that one tool.  You either need to download the Windows Driver Development Kit (MSDN subscription and login required), or you have to download the Windows SDK just to get one little EXE file.  It's absurd.  You then have to try and figure out the extremely arcane tool, since it is obviously not designed to be a polished consumer-friendly product.

Anyway, I jumped through all of these hoops, installed WinDbg, and read my minidump files.  Immediately, the tool showed me the cause of the blue screens: 

Probably caused by : memory_corruption

Wow.  With a few clicks, I was able to determine the cause of a blue screen!  It seemed like magic.

So I ran MemTest on my workstation, and sure enough, it instantly showed memory errors.  Since I know just enough to be dangerous when it comes to these things, I figured that it was possible that my memory was physically fine, but that there was something else that was causing the issue.

I booted into the BIOS settings and disabled the XMP memory profile, which has the memory operate at a faster speed.  Sure enough, once I saved that setting and ran MemTest again, no errors.  I tried changing various settings with XMP enabled to see if I could get XMP working, but I couldn't get rid of the errors, so for now I'm running sans-XMP, which is fine for the relatively simple tasks that I perform on my desktop.

Feeling very confident after conquering my first BSOD in a matter of minutes, I then decided to diagnose the BSOD on the client's production Dynamics GP Terminal Server.

I launched WinDbg, set the debug symbol path, and loaded the minidump, and well, unfortunately the results weren't quite as simple as my workstation.

Probably caused by : Ntfs.sys ( Ntfs!NtfsDeleteFile+8d3

So what does this mean?  My interpretation is that something about a file delete operation caused the server to crash.  Later on, there is a reference to iexplore.exe, which is Internet Explorer:

PROCESS_NAME:  iexplore.exe
ERROR_CODE: (NTSTATUS) 0xc0000005 - The instruction at 0x%08lx referenced memory at 0x%08lx. The memory could not be %s.

This is where an expert would be required, since this just doesn't seem to be enough information to determine a specific cause.  My only interpretation is that some aspect of Internet Explorer is somehow causing the crash. 

So for now, no magic solutions like what I found with my workstation, but we at least have some information that we can use to monitor the server.

Having gone through the process, here's a summary of what I learned:

1. It seems that WinDbg comes in two flavors, listed on this MSDN Dev Center web page.

The newer version (6.2.8102 8/23/2011) is included with the Windows Developer Preview WDK (MSDN subscriber only), which seems to work fine for Windows 7 minidump files (and perhaps Server 2008 R2).  But when I tried to use this version on the client's Windows Server 2008 (not R2), it spewed a bunch of complaints about being unable to load ntoskrnl.exe. 

So, for Windows 2008 and versions prior to Windows 7, there is a second version ( 2/1/2010), available in the Windows SDK.

When you install the Windows SDK, you only need the Debugging Tools, and can uncheck all of the other options.

2. In order to properly read the dump files, you need to first set the symbols path.  The Russinovich blog post mentions one (a), but I found a different one on a forum thread that seemed to work better for me (b).

a)  asrv*c:\symbols*

b)  SRV*C:\WebSymb*

Version (a) worked on my Windows 7 machine with the newer WinDbg, but I had to use bersion (b) with the older WinDbg.

3. To perform the debugging, launch WinDbg, select File -> Symbol File Path and paste in one of the symbol paths from above.  Then select File -> Open Crash Dump and select your minidump file.

4. Wait a few seconds for WinDbg to analyze the file and display results.  Hopefully you see something like the following, including the helpful "Probably caused by" note:

*  Bugcheck Analysis  *
Use !analyze -v to get detailed debugging information.
BugCheck 24, {1904aa, c941b6a0, c941b39c, 9247f5fc}
Probably caused by : Ntfs.sys ( Ntfs!NtfsDeleteFile+8d3 )

5. You should see a link on the text "!analyze -v". If you click on that link, it will display more information that may help you further diagnose the problem.  It's all pretty cryptic looking, but a technical person or developer should be able to pick out a few clues.

6. There are apparently other tools that are much easier to use than WinDbg, but may not be as comprehensive.  I quickly tried one called BlueScreenView that is amazingly simple and easy to use.  The only downside is that it doesn't appear to offer the "Probably caused by" note provided by WinDbg.  Once you get familiar with the typical errors, you may not need that helpful message, but I still need that pointer, so for now I'll stick with WinDbg.

My experience has been that blue screen errors are pretty rare these days, but obviously they do still occur.  If you are feeling adventurous, hopefully this information helps you navigate the relatively simple process of doing some initial diagnostics on your own before rebuilding a server or paying for a support case.

Steve Endow is a Dynamics GP Certified Trainer and Dynamics GP Certified IT Professional in Los Angeles.  He is also the owner of Precipio Services, which provides Dynamics GP integrations, customizations, and automation solutions.