All of lore.kernel.org
 help / color / mirror / Atom feed
From: Naoya Horiguchi <naoya.horiguchi@linux.dev>
To: Tony Luck <tony.luck@intel.com>
Cc: Andrew Morton <akpm@linux-foundation.org>,
	linux-mm@kvack.org, linux-kernel@vger.kernel.org,
	Youquan Song <youquan.song@intel.com>,
	Naoya Horiguchi <naoya.horiguchi@nec.com>
Subject: Re: [PATCH] mm/hwpoison: Fix error page recovered but reported "not recovered"
Date: Wed, 12 Jan 2022 21:11:45 +0900	[thread overview]
Message-ID: <20220112121145.GA889650@u2004> (raw)
In-Reply-To: <20220107194450.1687264-1-tony.luck@intel.com>

On Fri, Jan 07, 2022 at 11:44:50AM -0800, Tony Luck wrote:
> From: Youquan Song <youquan.song@intel.com>
> 
> When an uncorrected memory error is consumed there is a race between
> the CMCI from the memory controller reporting an uncorrected error
> with a UCNA signature, and the core reporting and SRAR signature
> machine check when the data is about to be consumed.
> 
> If the CMCI wins that race, the page is marked poisoned when
> uc_decode_notifier() calls memory_failure() and the machine
> check processing code finds the page already poisoned. It calls
> kill_accessing_process() to make sure a SIGBUS is sent. But
> returns the wrong error code.
> 
> Console log looks like this:
> 
> [34775.674296] mce: Uncorrected hardware memory error in user-access at 3710b3400
> [34775.675413] Memory failure: 0x3710b3: recovery action for dirty LRU page: Recovered
> [34775.690310] Memory failure: 0x3710b3: already hardware poisoned
> [34775.696247] Memory failure: 0x3710b3: Sending SIGBUS to einj_mem_uc:361438 due to hardware memory corruption
> [34775.706072] mce: Memory error not recovered
> 
> Fix kill_accessing_process() to return -EHWPOISON to avoid the noise
> message "Memory error not recovered" and skip duplicate SIGBUS.
> 
> [Tony: Reworded some parts of commit message]
> 
> Fixes: a3f5d80ea401 ("mm,hwpoison: send SIGBUS with error virutal address")
> Signed-off-by: Youquan Song <youquan.song@intel.com>
> Signed-off-by: Tony Luck <tony.luck@intel.com>
> ---
> 
> This code is very subtle ... the fix makes the "not recovered" message
> go away ... but I'm not more than 75% confident that this is the right
> fix. Please check carefully :-)
> 
>  mm/memory-failure.c | 3 ++-
>  1 file changed, 2 insertions(+), 1 deletion(-)
> 
> diff --git a/mm/memory-failure.c b/mm/memory-failure.c
> index 3a274468f193..a67f558b08ea 100644
> --- a/mm/memory-failure.c
> +++ b/mm/memory-failure.c
> @@ -707,7 +707,8 @@ static int kill_accessing_process(struct task_struct *p, unsigned long pfn,
>  	if (ret == 1 && priv.tk.addr)
>  		kill_proc(&priv.tk, pfn, flags);
>  	mmap_read_unlock(p->mm);
> -	return ret ? -EFAULT : -EHWPOISON;

Thank you for reporting, the original code was wrong and the trinary operation
returns opposite code.  -EHWPOISON here is to notify kill_me_maybe() that it
does not have to send SIGBUS in its own because kill_accessing_process() already
sent SIGBUS with proper virtual address info.

> +
> +	return (ret < 0) ? -EFAULT : -EHWPOISON;

It seems to me that this returns -EHWPOISON whether any hwpoison entry is
found or not, so this fix can cause another issue.
We want to return -EHWPOISON only when kill_accessing_process() sent SIGBUS,
so I think that the following diff should be what we want.
Could you check this fix works for you?

Thanks,
Naoya Horiguchi
---
diff --git a/mm/memory-failure.c b/mm/memory-failure.c
index 14ae5c18e776..4c9bd1d37301 100644
--- a/mm/memory-failure.c
+++ b/mm/memory-failure.c
@@ -707,8 +707,10 @@ static int kill_accessing_process(struct task_struct *p, unsigned long pfn,
 			      (void *)&priv);
 	if (ret == 1 && priv.tk.addr)
 		kill_proc(&priv.tk, pfn, flags);
+	else
+		ret = 0;
 	mmap_read_unlock(p->mm);
-	return ret ? -EFAULT : -EHWPOISON;
+	return ret > 0 ? -EHWPOISON : -EFAULT;
 }
 
 static const char *action_name[] = {


  reply	other threads:[~2022-01-12 12:11 UTC|newest]

Thread overview: 4+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2022-01-07 19:44 [PATCH] mm/hwpoison: Fix error page recovered but reported "not recovered" Tony Luck
2022-01-12 12:11 ` Naoya Horiguchi [this message]
2022-01-13  2:00   ` Luck, Tony
2022-01-13  6:52     ` HORIGUCHI NAOYA(堀口 直也)

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20220112121145.GA889650@u2004 \
    --to=naoya.horiguchi@linux.dev \
    --cc=akpm@linux-foundation.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=naoya.horiguchi@nec.com \
    --cc=tony.luck@intel.com \
    --cc=youquan.song@intel.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.