From mboxrd@z Thu Jan 1 00:00:00 1970 From: Corey Minyard Subject: Re: [PATCH][RT] x86: Fix an RT MCE crash Date: Tue, 5 Jul 2016 19:59:59 -0500 Message-ID: <577C580F.8010004@acm.org> References: <3908561D78D1C84285E8C5FCA982C28F3A14CDB9@ORSMSX114.amr.corp.intel.com> <57754B71.2000108@acm.org> <20160630170134.GA3932@pd.tnic> <57755449.7070302@acm.org> <20160630172611.GC3932@pd.tnic> <57755CC6.60506@acm.org> <20160630182257.GD3932@pd.tnic> <577576AA.8040004@mvista.com> <20160630203457.GF3932@pd.tnic> <5775A181.2050404@acm.org> <20160701072050.GA4593@pd.tnic> Reply-To: minyard@acm.org Mime-Version: 1.0 Content-Type: text/plain; charset=utf-8; format=flowed Content-Transfer-Encoding: 7bit Cc: Corey Minyard , "Luck, Tony" , Steven Rostedt , "linux-rt-users@vger.kernel.org" To: Borislav Petkov Return-path: Received: from mail-pf0-f172.google.com ([209.85.192.172]:35511 "EHLO mail-pf0-f172.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751076AbcGFBAD (ORCPT ); Tue, 5 Jul 2016 21:00:03 -0400 Received: by mail-pf0-f172.google.com with SMTP id c2so74954031pfa.2 for ; Tue, 05 Jul 2016 18:00:02 -0700 (PDT) In-Reply-To: <20160701072050.GA4593@pd.tnic> Sender: linux-rt-users-owner@vger.kernel.org List-ID: On 07/01/2016 02:20 AM, Borislav Petkov wrote: >> That sounds like a bit much. > Actually, you probably would need only a couple: > > 1. 648ed94038c0 ("x86/mce: Provide a lockless memory pool to save error records") > > 2. 061120aed708 ("x86/mce: Don't use percpu workqueues") > - that one is unrelated but should be nice for RT as it gets rid of percpu > workqueues and I know RT hates them :) > > 3. fd4cf79fcc4b ("x86/mce: Remove the MCE ring for Action Optional errors") > - this one connects the genpool to MCE > > 4. f29a7aff4bd6 ("x86/mce: Avoid potential deadlock due to printk() in MCE context") > - and this is the last one which I meant earlier. > > So that's 4 patches, more or less. > > Now, you're in the perfect position to test those because you *actually* > have a real-life system which generates those errors so it is the > perfect candidate for testing the backports. And you should test them > with the failing DIMM still in place, of course. I'm having our hardware people keep the system as-is until we can track this down. A applied the above four patches and a few more support patches got that were needed, but no love. Exact same issue. Well, almost the same, here's the traceback: [ 0.455575] [] try_to_wake_up+0x34/0x300 [ 0.455590] [] ? __hrtimer_start_range_ns+0x226/0x3a0 [ 0.455593] [] wake_up_process+0x10/0x20 [ 0.455615] [] mce_notify_irq+0x28/0x30 [ 0.455621] [] mce_irq_work_cb+0x9/0x10 [ 0.455646] [] irq_work_run_list+0x3c/0x60 [ 0.455649] [] irq_work_tick_soft+0x27/0x30 [ 0.455673] [] run_timer_softirq+0x24/0x250 [ 0.455681] [] do_current_softirqs+0x1ae/0x250 [ 0.455684] [] run_ksoftirqd+0x2e/0x50 [ 0.455697] [] smpboot_thread_fn+0x206/0x320 [ 0.455700] [] ? lg_global_unlock+0x60/0x60 [ 0.455720] [] kthread+0xad/0xc0 [ 0.455740] [] ? _dbgp_external_startup+0x236/0x392 [ 0.455744] [] ? kthread_create_on_node+0x130/0x130 [ 0.455752] [] ret_from_fork+0x4e/0x80 [ 0.455756] [] ? kthread_create_on_node+0x130/0x130 So it crashed in the kthread instead of the irq, but exactly the same issue, that particular field is not initialized. Not that these aren't patches that look like good ideas. -corey