Enron and the all-night bug

Reading Len's request to discuss your worst bug, I remembered this gem from my days at Enron Europe.

The occasion was the rollout of a new version of my team’s in-house trading application to all of the commodity traders, and the time was 6:00 PM on a Friday evening. The estimate for three of us to complete the rollout was 2–3 hours, although the traders wouldn’t be back until 9:00 AM the following day. We’d already performed a test rollout to a couple of production PCs, and everything had worked perfectly. Full of naive optimism, we set to work.

At about 8:00 PM, we realised that we had a problem. On just three of the ten PCs, our application refused to start properly, producing a strange error that suggested a registry problem. By midnight, after increasingly frustrated efforts to locate the problem, we were reduced to adding copious trace statements and then recompiling the application in order to find the exact line where the error was occurring. This line turned out to be a VB6 Dim statement, a non-executable line that would never normally give an error.

By 3:00 AM, after numerous experiments and diagnostic attempts, we were still baffled. As far as we could tell, the problem was some sort of registry issue that occurred because the Windows user profile was corrupted — re-creating the user profile from scratch cured the problem. Unfortunately, it appeared that running certain other programs would re-corrupt the profile, and our program would then stop working. Why this problem happened on some PCs but not others was also a mystery. It was no conincidence that Enron's installation scripting process was a real witch’s brew.

By 6:00 AM, 12 hours after we had started, we still hadn’t found a satisfactory solution or workaround, and desperation was starting to set in. By 9:00 AM, as the traders came in to work, we decided that they could cope with the seven PCs that were working, and we would fix the other three PCs on the following Monday. Exhausted and frustrated at being beaten, we retired to our respective homes to lick our wounds.

Several months later, we still weren’t able to diagnose the exact problem, even with the help of other teams. In fact, the cause of the bug was never found. It finally disappeared when we moved from Windows NT to Windows 2000 as the base operating system, but this still remains the one of the most exasperating bugs that I’ve ever personally encountered.