From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1754435AbbFRKZc (ORCPT ); Thu, 18 Jun 2015 06:25:32 -0400 Received: from cantor2.suse.de ([195.135.220.15]:58180 "EHLO mx2.suse.de" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1752998AbbFRKZY (ORCPT ); Thu, 18 Jun 2015 06:25:24 -0400 Date: Thu, 18 Jun 2015 12:25:20 +0200 From: Borislav Petkov To: "Luck, Tony" Cc: "Wang, Rui Y" , "Chen, Gong" , "linux-kernel@vger.kernel.org" Subject: Re: MCE Bug? Message-ID: <20150618102520.GC1670@pd.tnic> References: <20150617094155.GB26661@pd.tnic> <3908561D78D1C84285E8C5FCA982C28F32A9E177@ORSMSX114.amr.corp.intel.com> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Disposition: inline In-Reply-To: <3908561D78D1C84285E8C5FCA982C28F32A9E177@ORSMSX114.amr.corp.intel.com> User-Agent: Mutt/1.5.23 (2014-03-12) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Wed, Jun 17, 2015 at 11:53:53PM +0000, Luck, Tony wrote: > > if you want to give those changes a run, I've uploaded them here: > > > > git://git.kernel.org/pub/scm/linux/kernel/git/ras/ras.git#tip-ras > > Latest experiments show that sometimes checking kventd_up() before calling schedule_work() > helps ... but mostly only when I fake some early logs from low numbered cpus. I added some > traces to the real case of a left-over fatal error and got this splat: Hmm, and calling mce_log from __mcheck_cpu_init_generic() as you suggested yesterday seems to work on this box here: [ 1.588713] smpboot: CPU0: Intel(R) Xeon(R) CPU E5-4620 0 @ 2.20GHz (fam: 06, model: 2d, stepping : 07) [ 1.592727] Performance Events: PEBS fmt1+, 16-deep LBR, SandyBridge events, full-w Broken BIOS d etected, complain to your hardware vendor. [ 1.997344] [Firmware Bug]: the BIOS has corrupted hw-PMU resources (MSR 38d is 330) [ 2.000146] Intel PMU driver. [ 2.001376] ... version: 3 [ 2.002919] ... bit width: 48 [ 2.004626] ... generic registers: 4 [ 2.006137] ... value mask: 0000ffffffffffff [ 2.008064] ... max period: 0000ffffffffffff [ 2.010010] ... fixed-purpose events: 3 [ 2.011528] ... event mask: 000000070000000f [ 2.017257] x86: Booting SMP configuration: [ 2.019232] .... node #0, CPUs: #1 [ 2.033848] microcode: CPU1 microcode updated early to revision 0x710, date = 2013-06-17 [ 2.038730] mce: [Hardware Error]: Machine check events logged [ 2.050735] #2 [ 2.050735] microcode: CPU2 microcode updated early to revision 0x710, date = 2013-06-17 [ 2.056163] mce: [Hardware Error]: Machine check events logged [ 2.068133] #3 [ 2.068140] microcode: CPU3 microcode updated early to revision 0x710, date = 2013-06-17 [ 2.07412.324641] microcode: CPU4 microcode updated early to revision 0x710, date = 2013-06-17 [ 2.479404] #5 Stuff gets logged just fine, no splats later. Hmmm, more staring... -- Regards/Gruss, Boris. ECO tip #101: Trim your mails when you reply. --