Re: [PATCH 5/5] mce: recover from "action required" errors reported in data path in usermode

All of lore.kernel.org
 help / color / mirror / Atom feed

From: Borislav Petkov <bp@amd64.org>
To: Chen Gong <gong.chen@linux.intel.com>
Cc: "Luck, Tony" <tony.luck@intel.com>,
	"linux-kernel@vger.kernel.org" <linux-kernel@vger.kernel.org>,
	Ingo Molnar <mingo@elte.hu>, Borislav Petkov <bp@amd64.org>,
	Hidetoshi Seto <seto.hidetoshi@jp.fujitsu.com>
Subject: Re: [PATCH 5/5] mce: recover from "action required" errors reported in data path in usermode
Date: Wed, 7 Sep 2011 15:25:00 +0200	[thread overview]
Message-ID: <20110907132500.GA8928@aftab> (raw)
In-Reply-To: <4E6709B2.7020401@linux.intel.com>

On Wed, Sep 07, 2011 at 02:05:38AM -0400, Chen Gong wrote:

[..]

> > +	/* known AR MCACODs: */
> > +	MCESEV(
> > +		KEEP, "HT thread notices Action required: data load error",
> > +		SER, MASK(MCI_STATUS_OVER|MCI_UC_SAR|MCI_ADDR|MCACOD, MCI_UC_SAR|MCI_ADDR|0x0134),
> > +		MCGMASK(MCG_STATUS_EIPV, 0)
> > +		),
> > +	MCESEV(
> > +		AR, "Action required: data load error",
> > +		SER, MASK(MCI_STATUS_OVER|MCI_UC_SAR|MCI_ADDR|MCACOD, MCI_UC_SAR|MCI_ADDR|0x0134),
> > +		USER
> > +		),
> 
> I don't think *AR* makes sense here because the following codes have a 
> assumption that it means *user space* condition. If so, in the future a 
> new *AR* severity for kernel usage is created, we can't distinguish 
> which one can call "memory_failure" as below. At least, it should have a 
> suffix such as AR_USER/AR_KERN:
> 
> enum severity_level {
>          MCE_NO_SEVERITY,
>          MCE_KEEP_SEVERITY,
>          MCE_SOME_SEVERITY,
>          MCE_AO_SEVERITY,
>          MCE_UC_SEVERITY,
>          MCE_AR_USER_SEVERITY,
> 	MCE_AR_KERN_SEVERITY,
>          MCE_PANIC_SEVERITY,
> };

Are you saying you need action required handling for when the data load
error happens in kernel space? If so, I don't see how you can replay the
data load (assuming this is a data load from DRAM). In that case, we're
fatal and need to panic. If it is a different type of data load coming
from a lower cache level, then we could be able to recover...?

[..]

> > +	if (worst == MCE_AR_SEVERITY) {
> > +		unsigned long pfn = m.addr>>  PAGE_SHIFT;
> > +
> > +		pr_err("Uncorrected hardware memory error in user-access at %llx",
> > +			m.addr);
> 
> print in the MCE handler maybe makes a deadlock ? say, when other CPUs 
> are printing something, suddently they received MCE broadcast from 
> Monarch CPU, when Monarch CPU runs above codes, a deadlock happens ?
> Please fix me if I miss something :-)

sounds like it can happen if the other CPUs have grabbed some console
semaphore/mutex (I don't know what exactly we're using there) and the
monarch tries to grab it.

> > +		if (__memory_failure(pfn, MCE_VECTOR, 0)<  0) {
> > +			pr_err("Memory error not recovered");
> > +			force_sig(SIGBUS, current);
> > +		} else
> > +			pr_err("Memory error recovered");
> > +	}
>
> as you mentioned in the comment, the biggest concern is that when
> __memory_failure runs too long, if another MCE happens at the same
> time, (assuming this MCE is happened on its sibling CPU which has the
> same banks), the 2nd MCE will crash the system. Why not delaying the
> process in a safer context, such as using user_return_notifer ?

The user_return_notifier won't work, as we concluded in the last
discussion round: http://marc.info/?l=linux-kernel&m=130765542330349

AFAIR, we want to have a realtime thread dealing with that recovery
so that we exit #MC context as fast as possible. The code then should
be able to deal with a follow-up #MC. Tony, whatever happened to that
approach?

Thanks.

-- 
Regards/Gruss,
Boris.

Advanced Micro Devices GmbH
Einsteinring 24, 85609 Dornach
GM: Alberto Bozzo
Reg: Dornach, Landkreis Muenchen
HRB Nr. 43632 WEEE Registernr: 129 19551

next prev parent reply	other threads:[~2011-09-07 16:29 UTC|newest]

Thread overview: 25+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2011-08-31 22:21 [PATCH 0/5] Yet another pass at machine check recovery Luck, Tony
2011-08-31 22:25 ` [PATCH 1/5] x86, mce: rework use of TIF_MCE_NOTIFY Luck, Tony
2011-09-07  9:11   ` Borislav Petkov
2011-08-31 22:25 ` Luck, Tony
2011-09-09  2:23   ` huang ying
2011-08-31 22:25 ` [PATCH 2/5] mce: mask out undefined bits from MCi_ADDR Luck, Tony
2011-09-05  9:19   ` Chen Gong
2011-09-06 20:15     ` Luck, Tony
2011-08-31 22:25 ` Luck, Tony
2011-08-31 22:25 ` [PATCH 3/5] HWPOISON: Handle hwpoison in current process Luck, Tony
2011-09-07  5:47   ` Chen Gong
2011-08-31 22:26 ` Luck, Tony
2011-08-31 22:26 ` [PATCH 4/5] mce: remove TIF_MCE_NOTIFY Luck, Tony
2011-09-07  9:23   ` Borislav Petkov
2011-08-31 22:26 ` Luck, Tony
2011-08-31 22:26 ` [PATCH 5/5] mce: recover from "action required" errors reported in data path in usermode Luck, Tony
2011-09-07  6:05   ` Chen Gong
2011-09-07 13:25     ` Borislav Petkov [this message]
2011-09-07 13:50       ` Chen Gong
2011-09-08  3:05     ` Minskey Guo
2011-09-08  5:16       ` Luck, Tony
2011-09-08  9:25         ` Minskey Guo
2011-08-31 22:26 ` Luck, Tony
2011-08-31 22:41 ` [PATCH 0/5] Yet another pass at machine check recovery Valdis.Kletnieks
2011-08-31 22:54   ` Luck, Tony

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20110907132500.GA8928@aftab \
    --to=bp@amd64.org \
    --cc=gong.chen@linux.intel.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=mingo@elte.hu \
    --cc=seto.hidetoshi@jp.fujitsu.com \
    --cc=tony.luck@intel.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.