How Linux handles hardware problems

How Linux handles hardware problems

By apexwm, 12 August, 2011 19:01

I usually write about various issues in Windows, because there are so many and so frequent. Hardly do I run across major issues with Linux. Until the one I will mention here, which involves a very rare kernel lockup. As we should all know, kernel lockups in Linux are very rare, and in fact I can easily count the number of instances I've seen it happen over 14 years, and keep it under a total count of 10. They usually happen due to hardware problems, when the kernel can no longer run. Recently, I've seen an 10 year old server running Red Hat Linux 7.1, lock up completely. And yes, the OS was installed on the Dell Poweredge 2400 10 years ago, back in 2001, and has been running just fine for many years. Never any file corruption or slowdowns, or other issues like we see with old Windows installations. Recently, the server was shut off abruptly due to an extended power outage. After that, it would run for roughly a week at a time then lock up. The screen at the console was black one time, and another time had a kernel dump screen.

After booting the server back up, I noticed this entry in /var/log/messages:

kernel: Uhhuh. NMI received. Dazed and confused, but trying to continue
kernel: You probably have a hardware problem with your RAM chips

Just some humorous log entries there. But, contains some useful information. It seems that with a lockup like this, the kernel is having some sort of issue with the system memory in the server. Fortunately, this very same problem happened 3 years ago and we knew what the fix is: reseat the power supplies. The first time this happened I ran memory diagnostics and they passed. I ended up finding a forum post that referenced the NMI errors and power issues with the system. Since the issue appeared to be exactly the same, we did not run memory diagnostics this time, and reseated the two hot swap power supplies. This fixed the problem before, and should fix it again this time.

A few points to this post. First, diagnosing problems in Linux is not as hard as it is rumored to be. /var/log/messages is usually where the kernel logs its information. And it logs very thoroughly. The kernel's entries show up as "kernel", just like the example above. And, the logs are in plain text so they can be opened with any program that can read text. Unlike Windows which stores logs in a proprietary format that need Microsoft tools to view.

Second, hardware problems cannot be prevented, and they tend to happen when you least expect. It's not the software's fault when a hardware problem prevents it from running. This tends to be what I mostly see with Linux issues. On the flip side, think about how common Windows blue screens of death (BSOD) are. Countless jokes about it circulate all of the time, and even Linux screensavers contain Windows crash screens (the xscreensaver packages contain these!).

Third, bad RAM is probably the most common cause for a Linux system crash, other than a bad motherboard. The Linux kernel can continue to run as long as it can access the memory. When bad memory is suspect, I always run a copy of the free utility Memtest86, which is an excellent memory tester.

In conclusion, hardware problems are sometimes difficult to diagnose and fix. I believe we happened to get lucky with the example here, but logs should be examined and addressed if there are errors. I've seen a lot of Windows administrators that do not view the error logs, or take any proactive steps upon them. Maybe because they are often difficult to decipher. But, there are services like MOM/SCOM available to make this easier. With Linux, I tend to prefer Logwatch to email the errors. Either way, a well-tuned system will run for many years and should provide reliable service.

Talkback

Linux servers can crash for a variety of reasons - power failures (redundant power supplies and UPS'es help here), faulty RAM (sometimes reseating helps, but better to get it replaced if under warranty), bugs in RAID controller firmware (yes, I'm looking at you Dell) amd disk faults that even RAID doesn't solve (like 2 or 3 disks dying simultaneously - rare, buit it does happen).

What I'm most disappointed about, though, is that /var/log/messages never logs kernel crashes, despite a kernel trace being puked out to the console (where most of it is lost via scrolling). Installing monitoring software (OMSA for Dells [useful for watchdog facilities if nothing else], Nagios etc.) can help, but I still get annoyed that kernel crashes leave us scratching our heads sometimes.

Had a Poweredge recently that just kept kernel crashing - ended up having to patch every single piece of firmware on the box (and many required reboots - luckily it was a clustered Proxmox server, so we just live-migrated away the VMs) and eventually it stayed up for more than a few days. Dell were as perplexed as us - their diagnostic boot CD told them nothing useful other than the firmware was out of date (BIOS, iDRAC and the RAID controller). We suspect it was the RAID controller firmware, but never really worked out whether that was the culprit.
rkl 13 August, 2011 23:02
Report offensive content Reply

@rkl,

I would recomend using either a serial connection directly to the server (linux will send panics to a connected serial port) or following http://www.tocpcs.com/howto-log-a-kernel-panic-it-can-be-done/

Thanks

Brendan
Brendan Edmonds via Facebook 14 August, 2011 03:11
Report offensive content Reply

rkl & Brendan Edmonds :

Thanks for the comments and feedback. That is a great suggestion at trying to troubleshoot kernel panics. You definitely made a good point about the panic error code not being written to disk. That is a tough one however as you also mentioned, in many cases the cause is a hardware problem. In fact, I've only seen a kernel crash due to a software problem, when we had to compile a custom kernel for a DEC Alpha box and we were missing a compilation parameter. Once the correct parameters were specified, the box ran for years without a single issue.
apexwm 24 August, 2011 15:11
Edit Delete Report offensive content Reply