[Cialug] Crashing with errors in mcelog
Daniel A. Ramaley
daniel.ramaley at drake.edu
Wed Mar 4 14:54:37 CST 2009
On 2009-03-03 at 13:55:17, Daniel A. Ramaley wrote:
>On 2009-03-03 at 13:25:56, Aaron Porter wrote:
>>When running memtest on "server grade" hardware it's important to
>>disable ECC in the system BIOS. I've had vendors swear up and down
>>that there are no memory issues as they were testing the error
>>correction and not the memory itself.
>
>Thanks for the hint.
>
>My desktop machine at home is "server grade" hardware, to the limit of
>what budget i was able to rationalize at the time. It does have ECC
>RAM. I'll be sure to disable the ECC when running memtest86.
I'm going to call this thread's issue most likely resolved. Today i
installed the memtest86 Debian package (which then shows up in the grub
menu on boot). I rebooted and went into the BIOS. After disabling ECC,
something interesting happened. The BIOS made a large number of beeps
on boot and printed a message about memory failure. Interesting. I
tried rebooting, same behavior. The memory problems didn't prevent the
machine from booting, however, and it was still able to boot up to a
Linux desktop. But, recalling that some of the mcelog errors mentioned
"DIMM3", i figured i'd remove 2 of the 4 DIMMs (i don't know what "3"
means since i don't know if mcelog counts from 0 or from 1, but since
DIMMs should be installed in pairs in this machine, it doesn't matter
anyway). Upon removing the 2 higher DIMMs, the machine started working
perfectly, albeit with 1/2 the RAM it had before. No BIOS beeps and
complaints about bad memory. It is running now and presumably will not
have more problems.
After checking how cheap that RAM has become (even the large 2 GB DIMMs
that my machine has), i just ordered a couple replacements. Probably
only one of the 2 DIMMs i pulled is bad, but RAM is so cheap that it
isn't worth my time trying to determine which is which.
My previous statement about memtest86 just flat out not working still
stands, however. Both before and after removing the bad RAM, when i
select memtest86 in grub, the screen goes black and after less than a
second reboots. So, memtest86 is functionally similar to the computer's
reset button. I was expecting a memory test, not another way to kick
the machine. I might try booting memtest86 from a CD and see if that
works any better, but since i seem to have discovered the memory
problems without it, i probably won't bother. For now i'm running with
ECC turned off, which i think should cause a crash if there are also
problems with the lower pair of DIMMs.
------------------------------------------------------------------------
Dan Ramaley Dial Center 118, Drake University
Network Programmer/Analyst 2407 Carpenter Ave
+1 515 271-4540 Des Moines IA 50311 USA
More information about the Cialug
mailing list