QEMU on Haswell causes spurious MCE events

A few dozen times each day, the Xeon E3-1275 v3 CPU on my SuperMicro X10SLM-F board generates a Machine Check Event (MCE). The Linux kernel logs all MCEs in /var/log/syslog:

mce: [Hardware Error]: Machine check events logged
mce: [Hardware Error]: Machine check events logged
CMCI storm detected: switching to poll mode
CMCI storm subsided: switching to interrupt mode
mce_notify_irq: 14 callbacks suppressed
mce: [Hardware Error]: Machine check events logged
mce: CPU supports 9 MCE banks
mce: [Hardware Error]: Machine check events logged

After installing mcelog I was able to pull some more detailed information about the check events:

Hardware event. This is not a software error.
MCE 0
CPU 3 BANK 0
TIME 1415087019 Tue Nov  4 08:43:39 2014
MCG status:
MCi status:
Corrected error
Error enabled
MCA: Internal parity error
STATUS 90000040000f0005 MCGSTATUS 0
MCGCAP c09 APICID 6 SOCKETID 0
CPUID Vendor Intel Family 6 Model 60

The MCEs all look the same (affected is always BANK 0), just the CPU and the APICID may differ. I updated the BIOS, replaced the ECC RAM, replaced the mainboard but the errors kept showing up. After some deep digging I discovered that the behaviour is triggered by a bug (which Intel sophisticatedly calls “erratum”) in the CPU code:
http://www.intel.com/content/dam/www/public/us/en/documents/specification-updates/xeon-e3-1200v3-spec-update.pdf

HSW131 Spurious Corrected Errors May be Reported

Problem: Due this erratum, spurious corrected errors may be logged in the IA32_MC0_STATUS register with the valid field (bit 63) set, the uncorrected error field (bit 61) not set, a Model Specific Error Code (bits [31:16]) of 0x000F, and an MCA Error Code (bits [15:0]) of 0x0005. If CMCI is enabled, these spurious corrected errors also signal interrupts.

Implication : When this erratum occurs, software may see corrected errors that are benign. These corrected errors may be safely ignored.
None identified.

Using the archived system logs I was able to trace it back when they first appeared: they started showing up after installing libvirt/QEMU and running KVM guests on the server. Fore some reason, the MCE will be triggered at an increased rate when using QEMU. My current solution is to disable check events in the kernel. I turned them off using the mce=ce_ignore kernel boot option:

GRUB_CMDLINE_LINUX_DEFAULT="mce=ignore_ce"

I think that letting the MCE generate¬†interrupt storms is more harmful to the server’s stability than turning the check events off (I’m still be able to see them in SuperMicro’s IPMI) in the kernel.

Ideally, the benign check events would be filtered in the kernel. At least in FreeBSD, the problem has already been addressed:
http://svnweb.freebsd.org/base?view=revision&revision=269052

/*
 * Skip spurious corrected parity errors generated by desktop Haswell
 * (see HSD131 erratum) unless reporting is enabled.
 * Note that these errors also have been observed with DO-stepping,
 * while the revision 014 desktop Haswell specification update only
 * talks about CO-stepping.
 */
 if (rec->mr_cpu_vendor_id == CPU_VENDOR_INTEL &&
   rec->mr_cpu_id == 0x306c3 && rec->mr_bank == 0 &&
   rec->mr_status == 0x90000040000f0005 && !intel6h_HSD131)
     return (1);
   return (0);

If you’ve been affected be the Haswell Check Event Bug, how did you tackle it? What’s your system configuration?

One thought on “QEMU on Haswell causes spurious MCE events

  1. The “mce: [Hardware Error]: Machine check events logged” error went away after shutting down the last remaining 32-bit KVM the other day.

Comments are closed.