A few dozen times each day, the Xeon E3-1275 v3 CPU on my SuperMicro X10SLM-F board generates a Machine Check Event (MCE). The Linux kernel logs all MCEs in /var/log/syslog
:
mce: [Hardware Error]: Machine check events logged
mce: [Hardware Error]: Machine check events logged
CMCI storm detected: switching to poll mode
CMCI storm subsided: switching to interrupt mode
mce_notify_irq: 14 callbacks suppressed
mce: [Hardware Error]: Machine check events logged
mce: CPU supports 9 MCE banks
mce: [Hardware Error]: Machine check events logged
After installing mcelog
I was able to pull some more detailed information about the check events:
Hardware event. This is not a software error.
MCE 0
CPU 3 BANK 0
TIME 1415087019 Tue Nov 4 08:43:39 2014
MCG status:
MCi status:
Corrected error
Error enabled
MCA: Internal parity error
STATUS 90000040000f0005 MCGSTATUS 0
MCGCAP c09 APICID 6 SOCKETID 0
CPUID Vendor Intel Family 6 Model 60
The MCEs all look the same (affected is always BANK 0), just the CPU and the APICID may differ. I updated the BIOS, replaced the ECC RAM, replaced the mainboard but the errors kept showing up. Continue reading →