A few dozen times each day, the Xeon E3-1275 v3 CPU on my SuperMicro X10SLM-F board generates a Machine Check Event (MCE). The Linux kernel logs all MCEs in
mce: [Hardware Error]: Machine check events logged mce: [Hardware Error]: Machine check events logged CMCI storm detected: switching to poll mode CMCI storm subsided: switching to interrupt mode mce_notify_irq: 14 callbacks suppressed mce: [Hardware Error]: Machine check events logged mce: CPU supports 9 MCE banks mce: [Hardware Error]: Machine check events logged
mcelog I was able to pull some more detailed information about the check events:
Hardware event. This is not a software error. MCE 0 CPU 3 BANK 0 TIME 1415087019 Tue Nov 4 08:43:39 2014 MCG status: MCi status: Corrected error Error enabled MCA: Internal parity error STATUS 90000040000f0005 MCGSTATUS 0 MCGCAP c09 APICID 6 SOCKETID 0 CPUID Vendor Intel Family 6 Model 60
The MCEs all look the same (affected is always BANK 0), just the CPU and the APICID may differ. I updated the BIOS, replaced the ECC RAM, replaced the mainboard but the errors kept showing up. After some deep digging I discovered that the behaviour is triggered by a bug (which Intel sophisticatedly calls “erratum”) in the CPU code:
HSW131 Spurious Corrected Errors May be Reported Problem: Due this erratum, spurious corrected errors may be logged in the IA32_MC0_STATUS register with the valid field (bit 63) set, the uncorrected error field (bit 61) not set, a Model Specific Error Code (bits [31:16]) of 0x000F, and an MCA Error Code (bits [15:0]) of 0x0005. If CMCI is enabled, these spurious corrected errors also signal interrupts. Implication : When this erratum occurs, software may see corrected errors that are benign. These corrected errors may be safely ignored. None identified.
Using the archived system logs I was able to trace it back when they first appeared: they started showing up after installing libvirt/QEMU and running KVM guests on the server. Fore some reason, the MCE will be triggered at an increased rate when using QEMU. My current solution is to disable check events in the kernel. I turned them off using the
mce=ce_ignore kernel boot option:
I think that letting the MCE generate interrupt storms is more harmful to the server’s stability than turning the check events off (I’m still be able to see them in SuperMicro’s IPMI) in the kernel.
Ideally, the benign check events would be filtered in the kernel. At least in FreeBSD, the problem has already been addressed:
/* * Skip spurious corrected parity errors generated by desktop Haswell * (see HSD131 erratum) unless reporting is enabled. * Note that these errors also have been observed with DO-stepping, * while the revision 014 desktop Haswell specification update only * talks about CO-stepping. */ if (rec->mr_cpu_vendor_id == CPU_VENDOR_INTEL && rec->mr_cpu_id == 0x306c3 && rec->mr_bank == 0 && rec->mr_status == 0x90000040000f0005 && !intel6h_HSD131) return (1); return (0);
If you’ve been affected be the Haswell Check Event Bug, how did you tackle it? What’s your system configuration?