From mboxrd@z Thu Jan 1 00:00:00 1970 From: "Egger, Christoph" Subject: Re: [PATCH] x86/MCE: allow overriding the CMCI threshold Date: Mon, 12 Jan 2015 11:18:21 +0100 Message-ID: <54B39F6D.2080805@amazon.de> References: <54B397600200007800053780@mail.emea.novell.com> Mime-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit Return-path: Received: from mail6.bemta4.messagelabs.com ([85.158.143.247]) by lists.xen.org with esmtp (Exim 4.72) (envelope-from ) id 1YAcIe-0004gv-H7 for xen-devel@lists.xenproject.org; Mon, 12 Jan 2015 10:32:44 +0000 In-Reply-To: <54B397600200007800053780@mail.emea.novell.com> List-Unsubscribe: , List-Post: List-Help: List-Subscribe: , Sender: xen-devel-bounces@lists.xen.org Errors-To: xen-devel-bounces@lists.xen.org To: Jan Beulich , xen-devel Cc: Jinsong Liu List-Id: xen-devel@lists.xenproject.org On 2015/01/12 9:44, Jan Beulich wrote: > We've had reports of systems where CMCIs would surface at a relatively > high rate during certain periods of time, without them apparently > causing subsequent more severe problems (see Xeon E7-8800/4800/2800 > specification clarification SC1). Give the admin a knob to lower the > impact on the system logs. > > Signed-off-by: Jan Beulich A small comment at the bottom, besides of that: Acked-by: Christoph Egger > > --- a/docs/misc/xen-command-line.markdown > +++ b/docs/misc/xen-command-line.markdown > @@ -242,6 +242,14 @@ the NMI watchdog is also enabled. > > If set, override Xen's default choice for the platform timer. > > +### cmci-threshold > +> `= ` > + > +> Default: `2` > + > +Specify the event count threshold for raising Corrected Machine Check > +Interrupts. Specifying zero disables CMCI handling. > + > ### cmos-rtc-probe > > `= ` > > --- a/xen/arch/x86/cpu/mcheck/mce_intel.c > +++ b/xen/arch/x86/cpu/mcheck/mce_intel.c > @@ -492,6 +492,9 @@ static int do_cmci_discover(int i) > { > unsigned msr = MSR_IA32_MCx_CTL2(i); > u64 val; > + unsigned int threshold, max_threshold; > + static unsigned int cmci_threshold = 2; > + integer_param("cmci-threshold", cmci_threshold); > > rdmsrl(msr, val); > /* Some other CPU already owns this bank. */ > @@ -500,15 +503,28 @@ static int do_cmci_discover(int i) > goto out; > } > > - val &= ~CMCI_THRESHOLD_MASK; > - wrmsrl(msr, val | CMCI_EN | CMCI_THRESHOLD); > - rdmsrl(msr, val); > + if ( cmci_threshold ) > + { > + wrmsrl(msr, val | CMCI_EN | CMCI_THRESHOLD_MASK); > + rdmsrl(msr, val); > + } > > if (!(val & CMCI_EN)) { > /* This bank does not support CMCI. Polling timer has to handle it. */ > mcabanks_set(i, __get_cpu_var(no_cmci_banks)); > + wrmsrl(msr, val & ~CMCI_THRESHOLD_MASK); > return 0; > } > + max_threshold = MASK_EXTR(val, CMCI_THRESHOLD_MASK); > + threshold = cmci_threshold; > + if ( threshold > max_threshold ) > + { > + mce_printk(MCE_QUIET, > + "CMCI: threshold %#x too large for CPU%u bank %u, using %#x\n", > + threshold, smp_processor_id(), i, max_threshold); > + threshold = max_threshold; > + } > + wrmsrl(msr, (val & ~CMCI_THRESHOLD_MASK) | CMCI_EN | threshold); > mcabanks_set(i, __get_cpu_var(mce_banks_owned)); > out: > mcabanks_clear(i, __get_cpu_var(no_cmci_banks)); > --- a/xen/arch/x86/cpu/mcheck/x86_mca.h > +++ b/xen/arch/x86/cpu/mcheck/x86_mca.h > @@ -86,9 +86,6 @@ > /* Bitfield of MSR_K8_HWCR register */ > #define K8_HWCR_MCi_STATUS_WREN (1ULL << 18) > > -/*Intel Specific bitfield*/ > -#define CMCI_THRESHOLD 0x2 > - > #define MCi_MISC_ADDRMOD_MASK (0x7UL << 6) > #define MCi_MISC_PHYSMOD (0x2UL << 6) I think these two are also Intel specific bitfields. Please leave the comment for those. Christoph