public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed
From: Hidetoshi Seto <seto.hidetoshi@jp.fujitsu.com>
To: Huang Ying <ying.huang@intel.com>
Cc: Andi Kleen <andi@firstfloor.org>, "hpa@zytor.com" <hpa@zytor.com>,
	"linux-kernel@vger.kernel.org" <linux-kernel@vger.kernel.org>,
	"mingo@elte.hu" <mingo@elte.hu>,
	"tglx@linutronix.de" <tglx@linutronix.de>
Subject: Re: [PATCH] [4/4] x86: MCE: Fix EIPV behaviour with !PCC
Date: Fri, 24 Apr 2009 09:27:38 +0900	[thread overview]
Message-ID: <49F1077A.5030801@jp.fujitsu.com> (raw)
In-Reply-To: <1240479838.6842.555.camel@yhuang-dev.sh.intel.com>

Huang Ying wrote:
> Add some description for the patch, hope that to be more clear.
> 
> Best Regards,
> Huang Ying
> -------------------------------------------------->
> Impact: Spec compliance
> 
> Tolerant level 0 means: always panic on uncorrected errors, that is,
> panic even for recoverable uncorrected errors. This is a useful option
> for someone think panic is the better hardware error containment
> mechanism than trying to recover.
> 
> Current implementation does not comply with the tolerant == 0 spec,
> that is, it tries to recover (by killing related processes) for
> recoverable uncorrected errors (errors triggered in userspace) when
> tolerant == 0. This patch fixes this by going panic for that case.
> 
> Signed-off-by: Huang Ying <ying.huang@intel.com>
> Signed-off-by: Andi Kleen <ak@linux.intel.com>
> 
> ---
>  arch/x86/kernel/cpu/mcheck/mce_64.c |    2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> --- a/arch/x86/kernel/cpu/mcheck/mce_64.c
> +++ b/arch/x86/kernel/cpu/mcheck/mce_64.c
> @@ -400,7 +400,7 @@ void do_machine_check(struct pt_regs * r
>  		 * force_sig() takes an awful lot of locks and has a slight
>  		 * risk of deadlocking.
>  		 */
> -		if (user_space) {
> +		if (user_space && tolerant > 0) {
>  			force_sig(SIGBUS, current);
>  		} else if (panic_on_oops || tolerant < 2) {
>  			mce_panic("Uncorrected machine check",
> 

Wait, I want confirmation.

Given:
 * Tolerant levels:
 *   0: always panic on uncorrected errors, log corrected errors

Let's walk do_machine_check():

    266 void do_machine_check(struct pt_regs * regs, long error_code)
    267 {
	:
    302         for (i = 0; i < banks; i++) {
	:
    311                 rdmsrl(MSR_IA32_MC0_STATUS + i*4, m.status);
    312                 if ((m.status & MCI_STATUS_VAL) == 0)
    313                         continue;
	:
    319                 if ((m.status & MCI_STATUS_UC) == 0)
    320                         continue;
	:
# Now we start checking status with VAL and UC
	:
    329                 if (m.status & MCI_STATUS_EN) {
    330                         /* if PCC was set, there's no way out */
    331                         no_way_out |= !!(m.status & MCI_STATUS_PCC);
    332                         /*
    333                          * If this error was uncorrectable and there was
    334                          * an overflow, we're in trouble.  If no overflow,
    335                          * we might get away with just killing a task.
    336                          */
    337                         if (m.status & MCI_STATUS_UC) {
    338                                 if (tolerant < 1 || m.status & MCI_STATUS_OVER)
    339                                         no_way_out = 1;
    340                                 kill_it = 1;
    341                         }
    342                 } else {
    343                         /*
    344                          * Machine check event was not enabled. Clear, but
    345                          * ignore.
    346                          */
    347                         continue;
    348                 }
	:
# Humm, second UC check should be removed...
# Anyway, in case of tolerant == 0, no_way_out == 1 if the event is enabled.
# And kill_it == 1 unless there are no event enabled.
# Therefore, in case of tolerant == 0, always "no_way_out == kill_it".
	:
    364                 }
    365         }
	:
    376         if (no_way_out && tolerant < 3)
    377                 mce_panic("Machine check", &panicm, mcestart);
	:
# in case of tolerant == 0, we usually hit here.
	:
    385         if (kill_it && tolerant < 3) {
    386                 int user_space = 0;
    387
    388                 /*
    389                  * If the EIPV bit is set, it means the saved IP is the
    390                  * instruction which caused the MCE.
    391                  */
    392                 if (m.mcgstatus & MCG_STATUS_EIPV)
    393                         user_space = panicm.ip && (panicm.cs & 3);
    394
    395                 /*
    396                  * If we know that the error was in user space, send a
    397                  * SIGBUS.  Otherwise, panic if tolerance is low.
    398                  *
    399                  * force_sig() takes an awful lot of locks and has a slight
    400                  * risk of deadlocking.
    401                  */
    402                 if (user_space) {
    403                         force_sig(SIGBUS, current);
    404                 } else if (panic_on_oops || tolerant < 2) {
    405                         mce_panic("Uncorrected machine check",
    406                                 &panicm, mcestart);
    407                 }
    408         }
	:
# Then, when we enter here with tolerant == 0 ?
	:
    421 }

Or, should this patch be applied after committing some of Andi's patches?
It means this patch targets a bug in Andi's patch set and the bug is not
in 2.6.30-rc* yet.


Thanks,
H.Seto


  parent reply	other threads:[~2009-04-24  0:28 UTC|newest]

Thread overview: 35+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2009-04-07 15:06 [PATCH] [0/4] x86: MCE: Machine check bug fix series for 2.6.30 Andi Kleen
2009-04-07 15:06 ` [PATCH] [1/4] x86: MCE: Make polling timer interval per CPU Andi Kleen
2009-04-08  3:43   ` Hidetoshi Seto
2009-04-08 10:43     ` Andi Kleen
2009-04-08 11:30       ` Hidetoshi Seto
2009-04-08 11:40         ` Andi Kleen
2009-04-09 10:28   ` [PATCH] [1/4] x86: MCE: Make polling timer interval per CPU v2 Andi Kleen
2009-04-07 15:06 ` [PATCH] [2/4] x86: MCE: Fix boot logging logic Andi Kleen
2009-04-07 15:06 ` [PATCH] [3/4] x86: MCE: Improve mce_get_rip Andi Kleen
2009-04-08  8:15   ` Hidetoshi Seto
2009-04-08 10:06     ` Andi Kleen
2009-04-09  4:59       ` Hidetoshi Seto
2009-04-09  7:14         ` Andi Kleen
2009-04-09  9:59           ` Hidetoshi Seto
2009-04-09 10:13             ` Andi Kleen
2009-04-10  4:38               ` Hidetoshi Seto
2009-04-10  8:25                 ` Andi Kleen
2009-04-10  9:49                   ` Hidetoshi Seto
2009-04-23  9:43     ` Huang Ying
2009-04-24  6:16       ` Hidetoshi Seto
2009-04-24  6:35         ` Huang Ying
2009-04-24  7:28           ` Hidetoshi Seto
2009-04-24  8:50             ` Andi Kleen
2009-04-24  8:52             ` Huang Ying
2009-04-24 10:11               ` Hidetoshi Seto
2009-04-07 15:06 ` [PATCH] [4/4] x86: MCE: Fix EIPV behaviour with !PCC Andi Kleen
2009-04-23  9:43   ` Huang Ying
2009-04-23 20:49     ` H. Peter Anvin
2009-04-24  8:35       ` Andi Kleen
2009-04-24  0:27     ` Hidetoshi Seto [this message]
2009-04-24  1:11       ` Huang Ying
2009-04-24  5:40         ` H. Peter Anvin
2009-04-24  8:46           ` Andi Kleen
2009-04-24 10:30             ` Hidetoshi Seto
2009-04-24 16:32               ` H. Peter Anvin

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=49F1077A.5030801@jp.fujitsu.com \
    --to=seto.hidetoshi@jp.fujitsu.com \
    --cc=andi@firstfloor.org \
    --cc=hpa@zytor.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=mingo@elte.hu \
    --cc=tglx@linutronix.de \
    --cc=ying.huang@intel.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox