All of lore.kernel.org
 help / color / mirror / Atom feed
From: Hidetoshi Seto <seto.hidetoshi@jp.fujitsu.com>
To: Tony Luck <tony.luck@intel.com>
Cc: linux-kernel@vger.kernel.org, Ingo Molnar <mingo@elte.hu>,
	Borislav Petkov <bp@amd64.org>,
	Chen Gong <gong.chen@linux.intel.com>,
	"Huang, Ying" <ying.huang@intel.com>
Subject: Re: [PATCH 5/6] x86, mce: handle "action required" errors
Date: Fri, 16 Dec 2011 09:14:53 +0900	[thread overview]
Message-ID: <4EEA8D7D.2030409@jp.fujitsu.com> (raw)
In-Reply-To: <f2d05667872c463ead35141ce0ad97818113e69d.1323979146.git.tony.luck@intel.com>

(2011/12/16 4:02), Tony Luck wrote:
> All non-urgent actions (reporting low severity errors and handling
> "action-optional" errors) are now handled by a work queue. This
> means that TIF_MCE_NOTIFY can be used to block execution for a
> thread experiencing an "action-required" fault until we get all
> cpus out of the machine check handler (and the thread that hit
> the fault into mce_notify_process().
> 
> We use the new mce_{save,find,clear}_info() API to get information
> from do_machine_check() to mce_notify_process(), and then use the
> newly improved memory_failure(..., MF_ACTION_REQUIRED) to handle
> the error (possibly signalling the process).
> 
> Update some comments to make the new code flows clearer.
> 
> Signed-off-by: Tony Luck <tony.luck@intel.com>
> ---
>  arch/x86/kernel/cpu/mcheck/mce.c |   70 ++++++++++++++++++++++++--------------
>  1 files changed, 44 insertions(+), 26 deletions(-)
> 
> diff --git a/arch/x86/kernel/cpu/mcheck/mce.c b/arch/x86/kernel/cpu/mcheck/mce.c
> index 6dfab72..2e10b73 100644

(snip)

... 2 patches in a mail?

> --
> 1.7.3.1
> ---
>  arch/x86/kernel/cpu/mcheck/mce.c |   71 ++++++++++++++++++++++++--------------
>  1 files changed, 45 insertions(+), 26 deletions(-)
> 
> diff --git a/arch/x86/kernel/cpu/mcheck/mce.c b/arch/x86/kernel/cpu/mcheck/mce.c
> index 7d7303a..dfd20a6 100644
> --- a/arch/x86/kernel/cpu/mcheck/mce.c
> +++ b/arch/x86/kernel/cpu/mcheck/mce.c
> @@ -982,7 +982,9 @@ void do_machine_check(struct pt_regs *regs, long error_code)
>  	barrier();
>  
>  	/*
> -	 * When no restart IP must always kill or panic.
> +	 * When no restart IP might need to kill or panic.
> +	 * Assume the worst for now, but if we find the
> +	 * severity is MCE_AR_SEVERITY we have other options.
>  	 */
>  	if (!(m.mcgstatus & MCG_STATUS_RIPV))
>  		kill_it = 1;
> @@ -1036,12 +1038,6 @@ void do_machine_check(struct pt_regs *regs, long error_code)
>  			continue;
>  		}
>  
> -		/*
> -		 * Kill on action required.
> -		 */
> -		if (severity == MCE_AR_SEVERITY)
> -			kill_it = 1;
> -
>  		mce_read_aux(&m, i);
>  
>  		/*
> @@ -1062,6 +1058,8 @@ void do_machine_check(struct pt_regs *regs, long error_code)
>  		}
>  	}
>  
> +	m = *final;
> +
>  	if (!no_way_out)
>  		mce_clear_state(toclear);
>  

Small change, but again, you should describe reason why...

> @@ -1080,7 +1078,7 @@ void do_machine_check(struct pt_regs *regs, long error_code)
>  	 * support MCE broadcasting or it has been disabled.
>  	 */
>  	if (no_way_out && tolerant < 3)
> -		mce_panic("Fatal machine check on current CPU", final, msg);
> +		mce_panic("Fatal machine check on current CPU", &m, msg);
>  
>  	/*
>  	 * If the error seems to be unrecoverable, something should be
> @@ -1089,11 +1087,13 @@ void do_machine_check(struct pt_regs *regs, long error_code)
>  	 * high, don't try to do anything at all.
>  	 */
>  
> -	if (kill_it && tolerant < 3)
> +	if (worst != MCE_AR_SEVERITY && kill_it && tolerant < 3)
>  		force_sig(SIGBUS, current);
>  
> -	/* notify userspace ASAP */
> -	set_thread_flag(TIF_MCE_NOTIFY);
> +	if (worst == MCE_AR_SEVERITY) {
> +		mce_save_info(m.addr);
> +		set_thread_flag(TIF_MCE_NOTIFY);
> +	}

I know tolerant==3 is an insane option, but it is better to care about
it here too (or it would be happy if we can remove tolerant completely).

e.g.
	if (tolerant < 3) {
		if (no_way_out)
			mce_panic(...);
		if (worst == MCE_AR_SEVERITY) {
			/* schedule action before return to userland */
			mce_save_info(m.addr);
			set_thread_flag(TIF_MCE_NOTIFY);
		} else if (kill_it) {
			force_sig(SIGBUS, current);
		}
	}

>  
>  	if (worst > 0)
>  		mce_report_event(regs);
> @@ -1107,6 +1107,8 @@ EXPORT_SYMBOL_GPL(do_machine_check);
>  #ifndef CONFIG_MEMORY_FAILURE
>  int memory_failure(unsigned long pfn, int vector, int flags)
>  {
> +	if (flags & MF_ACTION_REQUIRED)
> +		return -ENXIO; /* panic? */
>  	printk(KERN_ERR "Uncorrected memory error in page 0x%lx ignored\n"
>  		"Rebuild kernel with CONFIG_MEMORY_FAILURE=y for smarter handling\n", pfn);
>  
> @@ -1115,27 +1117,46 @@ int memory_failure(unsigned long pfn, int vector, int flags)
>  #endif
>  
>  /*
> - * Called after mce notification in process context. This code
> - * is allowed to sleep. Call the high level VM handler to process
> - * any corrupted pages.
> - * Assume that the work queue code only calls this one at a time
> - * per CPU.
> - * Note we don't disable preemption, so this code might run on the wrong
> - * CPU. In this case the event is picked up by the scheduled work queue.
> - * This is merely a fast path to expedite processing in some common
> - * cases.
> + * Called in process context that interrupted by MCE and marked with
> + * TIF_MCE_NOTFY, just before returning to errorneous userland.

Spell checker suggests:	                      erroneous

> + * This code is allowed to sleep.
> + * Attempt possible recovery such as calling the high level VM handler to
> + * process any corrupted pages, and kill/signal current process if required.
> + * Action required errors are handled here.
>   */
>  void mce_notify_process(void)
>  {
>  	unsigned long pfn;
> -	mce_notify_irq();
> -	while (mce_ring_get(&pfn))
> -		memory_failure(pfn, MCE_VECTOR, 0);
> +	struct mce_info *mi = mce_find_info();
> +
> +	if (!mi)
> +		mce_panic("Lost address", NULL, NULL);

The message is too short, isn't it?

And if this case is an another version of "Memory error not recovered"
located below then why not force_sig() but mce_panic()?

> +	pfn = mi->paddr >> PAGE_SHIFT;
> +
> +	clear_thread_flag(TIF_MCE_NOTIFY);
> +
> +	pr_err("Uncorrected hardware memory error in user-access at %llx",
> +		 mi->paddr);
> +	if (memory_failure(pfn, MCE_VECTOR, MF_ACTION_REQUIRED) < 0) {
> +		pr_err("Memory error not recovered");
> +		force_sig(SIGBUS, current);
> +	} else {
> +		pr_err("Memory error recovered");
> +	}
> +	mce_clear_info(mi);
>  }
>  
> +/*
> + * Action optional processing happens here (picking up
> + * from the list of faulting pages that do_machine_check()
> + * placed into the "ring").
> + */
>  static void mce_process_work(struct work_struct *dummy)
>  {
> -	mce_notify_process();
> +	unsigned long pfn;
> +
> +	while (mce_ring_get(&pfn))
> +		memory_failure(pfn, MCE_VECTOR, 0);
>  }
>  
>  #ifdef CONFIG_X86_MCE_INTEL
> @@ -1225,8 +1246,6 @@ int mce_notify_irq(void)
>  	/* Not more than two messages every minute */
>  	static DEFINE_RATELIMIT_STATE(ratelimit, 60*HZ, 2);
>  
> -	clear_thread_flag(TIF_MCE_NOTIFY);
> -
>  	if (test_and_clear_bit(0, &mce_need_notify)) {
>  		/* wake processes polling /dev/mcelog */
>  		wake_up_interruptible(&mce_chrdev_wait);

Thanks,
H.Seto


  reply	other threads:[~2011-12-16  0:15 UTC|newest]

Thread overview: 23+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2011-12-15 19:59 [PATCH 0/6] x86, mce: machine check recovery for applications [updated] Tony Luck
2011-12-08 22:49 ` [PATCH 6/6] x86, mce: Recognise machine check bank signature for data path error Tony Luck
2011-12-13 17:27 ` [PATCH 2/6] HWPOISON: Add code to handle "action required" errors Tony Luck
2011-12-13 17:48 ` [PATCH 3/6] x86, mce: create helper function to save addr/misc when needed Tony Luck
2011-12-14 23:55 ` [PATCH 4/6] x86, mce: Add mechanism to safely save information in MCE handler Tony Luck
2011-12-16  0:13   ` Hidetoshi Seto
2011-12-16  8:14   ` Borislav Petkov
2011-12-15 18:48 ` [PATCH 1/6] HWPOISON: clean up memory_failure() vs. __memory_failure() Tony Luck
2011-12-16  8:17   ` Borislav Petkov
2011-12-15 19:02 ` [PATCH 5/6] x86, mce: handle "action required" errors Tony Luck
2011-12-16  0:14   ` Hidetoshi Seto [this message]
2011-12-16  0:29     ` Tony Luck
2011-12-16  0:22       ` [PATCH 5/6] x86, mce: handle "action required" errors (unjumbled version) Tony Luck
2011-12-16 16:35         ` Borislav Petkov
2011-12-17 19:25           ` Tony Luck
2011-12-16  0:51     ` [PATCH 5/6] x86, mce: handle "action required" errors Tony Luck
2011-12-16 23:36       ` [PATCH 5/6] x86, mce: handle "action required" errors (new version) Tony Luck
  -- strict thread matches above, loose matches on Subject: below --
2011-12-13 19:05 [PATCH 0/6] x86, mce: machine check recovery for applications Tony Luck
2011-12-12 21:47 ` [PATCH 5/6] x86, mce: handle "action required" errors Tony Luck
2011-12-14  9:28   ` Chen Gong
2011-12-14 21:30     ` Tony Luck
2011-12-15  2:56       ` Chen Gong
2011-12-14 16:04   ` Borislav Petkov
2011-12-14 19:05     ` Luck, Tony

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=4EEA8D7D.2030409@jp.fujitsu.com \
    --to=seto.hidetoshi@jp.fujitsu.com \
    --cc=bp@amd64.org \
    --cc=gong.chen@linux.intel.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=mingo@elte.hu \
    --cc=tony.luck@intel.com \
    --cc=ying.huang@intel.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.