Re: [RFC][PATCH v2] Controlling kexec behaviour when hardware error happened.

All of lore.kernel.org
 help / color / mirror / Atom feed

From: Hidetoshi Seto <seto.hidetoshi@jp.fujitsu.com>
To: Borislav Petkov <bp@alien8.de>,
	Seiji Aguchi <seiji.aguchi@hds.com>,
	"hpa@zytor.com" <hpa@zytor.com>,
	"andi@firstfloor.org" <andi@firstfloor.org>,
	"ebiederm@xmission.com" <ebiederm@xmission.com>,
	"gregkh@suse.de" <gregkh@suse.de>,
	"linux-doc@vger.kernel.org" <linux-doc@vger.kernel.org>,
	"linux-kernel@vger.kernel.org" <linux-kernel@vger.kernel.org>,
	"linux-mm@kvack.org" <linux-mm@kvack.org>,
	"x86@kernel.org" <x86@kernel.org>,
	"dle-develop@lists.sourceforge.net" 
	<dle-develop@lists.sourceforge.net>,
	"amwang@redhat.com" <amwang@redhat.com>,
	Satoru Moriya <satoru.moriya@hds.com>
Subject: Re: [RFC][PATCH v2] Controlling kexec behaviour when hardware error happened.
Date: Mon, 14 Feb 2011 10:20:57 +0900	[thread overview]
Message-ID: <4D588379.4050209@jp.fujitsu.com> (raw)
In-Reply-To: <20110210091408.GA10553@liondog.tnic>

(2011/02/10 18:14), Borislav Petkov wrote:
> On Thu, Feb 10, 2011 at 05:36:58PM +0900, Hidetoshi Seto wrote:
>> (2011/02/10 1:35), Seiji Aguchi wrote:
> 
> [..]
> 
>>> diff --git a/arch/x86/kernel/cpu/mcheck/mce.c b/arch/x86/kernel/cpu/mcheck/mce.c
>>> index d916183..e76b47b 100644
>>> --- a/arch/x86/kernel/cpu/mcheck/mce.c
>>> +++ b/arch/x86/kernel/cpu/mcheck/mce.c
>>> @@ -944,6 +944,8 @@ void do_machine_check(struct pt_regs *regs, long error_code)
>>>  
>>>  	percpu_inc(mce_exception_count);
>>>  
>>> +	hwerr_flag = 1;
>>> +
>>>  	if (notify_die(DIE_NMI, "machine check", regs, error_code,
>>>  			   18, SIGKILL) == NOTIFY_STOP)
>>>  		goto out;
>>
>> Now x86 supports some recoverable machine check, so setting
>> flag here will prevent running kexec on systems that have
>> encountered such recoverable machine check and recovered.
>>
>> I think mce_panic() is proper place to set this flag "hwerr_flag".
> 
> I agree, in that case it is unsafe to run kexec only after the error
> cannot be recovered by software.
> 
> Also, hwerr_flag is really a bad naming choice, how about
> "hwerr_unrecoverable" or "hw_compromised" or "recovery_futile" or
> "hw_incurable" or simply say what happened: "pcc" = processor context
> corrupt (and a reliable restarting might not be possible). This could be
> used by others too, besides kexec.

Or how about something like hwerr_panic() to clear that the panic is
requested due to hardware error.

Anyway, Aguchi-san, please note that we should not turn off kexec before
encountering fatal hardware error and before printing/transmitting
enough hardware error log to out of this system.

> 
> [..]
> 
>>> diff --git a/mm/memory-failure.c b/mm/memory-failure.c index 0207c2f..0178f47 100644
>>> --- a/mm/memory-failure.c
>>> +++ b/mm/memory-failure.c
>>> @@ -994,6 +994,8 @@ int __memory_failure(unsigned long pfn, int trapno, int flags)
>>>  	int res;
>>>  	unsigned int nr_pages;
>>>  
>>> +	hwerr_flag = 1;
>>> +
>>>  	if (!sysctl_memory_failure_recovery)
>>>  		panic("Memory failure from trap %d on page %lx", trapno, pfn);
>>>  
>>
>> For similar reason, setting flag here is not good for
>> systems working after isolating some poisoned memory page.
>>
>> Why not:
>>  if (!sysctl_memory_failure_recovery) {
>>  	hwerr_flag = 1;
>>  	panic("Memory failure from trap %d on page %lx", trapno, pfn);
>>  }
> 
> Why do we need that in memory-failure.c at all? I mean, when we consume
> the UC, we'll end up in mce_panic() anyway.

One possible answer is that memory-failure.c is not x86 specific.


Thanks,
H.Seto

WARNING: multiple messages have this Message-ID (diff)

From: Hidetoshi Seto <seto.hidetoshi@jp.fujitsu.com>
To: Borislav Petkov <bp@alien8.de>,
	Seiji Aguchi <seiji.aguchi@hds.com>,
	"hpa@zytor.com" <hpa@zytor.com>,
	"andi@firstfloor.org" <andi@firstfloor.org>,
	"ebiederm@xmission.com" <ebiederm@xmission.com>,
	"gregkh@suse.de" <gregkh@suse.de>,
	"linux-doc@vger.kernel.org" <linux-doc@vger.kernel.org>,
	"linux-kernel@vger.kernel.org" <linux-kernel@vger.kernel.org>,
	"linux-mm@kvack.org" <linux-mm@kvack.org>,
	"x86@kernel.org" <x86@kernel.org>,
	"dle-develop@lists.sourceforge.net"
	<dle-develop@lists.sourceforge.net>,
	"amwang@redhat.com" <amwang@redhat.com>,
	Satoru Moriya <satoru.moriya@hds.com>
Subject: Re: [RFC][PATCH v2] Controlling kexec behaviour when hardware error happened.
Date: Mon, 14 Feb 2011 10:20:57 +0900	[thread overview]
Message-ID: <4D588379.4050209@jp.fujitsu.com> (raw)
In-Reply-To: <20110210091408.GA10553@liondog.tnic>

(2011/02/10 18:14), Borislav Petkov wrote:
> On Thu, Feb 10, 2011 at 05:36:58PM +0900, Hidetoshi Seto wrote:
>> (2011/02/10 1:35), Seiji Aguchi wrote:
> 
> [..]
> 
>>> diff --git a/arch/x86/kernel/cpu/mcheck/mce.c b/arch/x86/kernel/cpu/mcheck/mce.c
>>> index d916183..e76b47b 100644
>>> --- a/arch/x86/kernel/cpu/mcheck/mce.c
>>> +++ b/arch/x86/kernel/cpu/mcheck/mce.c
>>> @@ -944,6 +944,8 @@ void do_machine_check(struct pt_regs *regs, long error_code)
>>>  
>>>  	percpu_inc(mce_exception_count);
>>>  
>>> +	hwerr_flag = 1;
>>> +
>>>  	if (notify_die(DIE_NMI, "machine check", regs, error_code,
>>>  			   18, SIGKILL) == NOTIFY_STOP)
>>>  		goto out;
>>
>> Now x86 supports some recoverable machine check, so setting
>> flag here will prevent running kexec on systems that have
>> encountered such recoverable machine check and recovered.
>>
>> I think mce_panic() is proper place to set this flag "hwerr_flag".
> 
> I agree, in that case it is unsafe to run kexec only after the error
> cannot be recovered by software.
> 
> Also, hwerr_flag is really a bad naming choice, how about
> "hwerr_unrecoverable" or "hw_compromised" or "recovery_futile" or
> "hw_incurable" or simply say what happened: "pcc" = processor context
> corrupt (and a reliable restarting might not be possible). This could be
> used by others too, besides kexec.

Or how about something like hwerr_panic() to clear that the panic is
requested due to hardware error.

Anyway, Aguchi-san, please note that we should not turn off kexec before
encountering fatal hardware error and before printing/transmitting
enough hardware error log to out of this system.

> 
> [..]
> 
>>> diff --git a/mm/memory-failure.c b/mm/memory-failure.c index 0207c2f..0178f47 100644
>>> --- a/mm/memory-failure.c
>>> +++ b/mm/memory-failure.c
>>> @@ -994,6 +994,8 @@ int __memory_failure(unsigned long pfn, int trapno, int flags)
>>>  	int res;
>>>  	unsigned int nr_pages;
>>>  
>>> +	hwerr_flag = 1;
>>> +
>>>  	if (!sysctl_memory_failure_recovery)
>>>  		panic("Memory failure from trap %d on page %lx", trapno, pfn);
>>>  
>>
>> For similar reason, setting flag here is not good for
>> systems working after isolating some poisoned memory page.
>>
>> Why not:
>>  if (!sysctl_memory_failure_recovery) {
>>  	hwerr_flag = 1;
>>  	panic("Memory failure from trap %d on page %lx", trapno, pfn);
>>  }
> 
> Why do we need that in memory-failure.c at all? I mean, when we consume
> the UC, we'll end up in mce_panic() anyway.

One possible answer is that memory-failure.c is not x86 specific.


Thanks,
H.Seto

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

next prev parent reply	other threads:[~2011-02-14  1:22 UTC|newest]

Thread overview: 16+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2011-02-09 16:35 [RFC][PATCH v2] Controlling kexec behaviour when hardware error happened Seiji Aguchi
2011-02-09 16:35 ` Seiji Aguchi
2011-02-09 16:51 ` Greg KH
2011-02-09 16:51   ` Greg KH
2011-02-09 17:06 ` Eric W. Biederman
2011-02-09 17:06   ` Eric W. Biederman
2011-02-09 17:07 ` Eric W. Biederman
2011-02-09 17:07   ` Eric W. Biederman
2011-02-10  3:04   ` Cong Wang
2011-02-10  3:04     ` Cong Wang
2011-02-10  8:36 ` Hidetoshi Seto
2011-02-10  8:36   ` Hidetoshi Seto
2011-02-10  9:14   ` Borislav Petkov
2011-02-10  9:14     ` Borislav Petkov
2011-02-14  1:20     ` Hidetoshi Seto [this message]
2011-02-14  1:20       ` Hidetoshi Seto

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=4D588379.4050209@jp.fujitsu.com \
    --to=seto.hidetoshi@jp.fujitsu.com \
    --cc=amwang@redhat.com \
    --cc=andi@firstfloor.org \
    --cc=bp@alien8.de \
    --cc=dle-develop@lists.sourceforge.net \
    --cc=ebiederm@xmission.com \
    --cc=gregkh@suse.de \
    --cc=hpa@zytor.com \
    --cc=linux-doc@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=satoru.moriya@hds.com \
    --cc=seiji.aguchi@hds.com \
    --cc=x86@kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.