Re: [PATCH] powerpc/mce: Fix a bug where mce loops on memory UE.

linuxppc-dev.lists.ozlabs.org archive mirror
 help / color / mirror / Atom feed

From: Mahesh Jagannath Salgaonkar <mahesh@linux.vnet.ibm.com>
To: Balbir Singh <bsingharora@gmail.com>
Cc: linuxppc-dev <linuxppc-dev@ozlabs.org>
Subject: Re: [PATCH] powerpc/mce: Fix a bug where mce loops on memory UE.
Date: Mon, 23 Apr 2018 16:03:56 +0530	[thread overview]
Message-ID: <cb831682-8cc6-4cd7-9778-52e5f4ad5af0@linux.vnet.ibm.com> (raw)
In-Reply-To: <CAKTCnznTiS2kQThYrCALGGKb8DfAu-diYzrCbOzwO2vd7Cfw9Q@mail.gmail.com>

On 04/23/2018 12:21 PM, Balbir Singh wrote:
> On Mon, Apr 23, 2018 at 2:59 PM, Mahesh J Salgaonkar
> <mahesh@linux.vnet.ibm.com> wrote:
>> From: Mahesh Salgaonkar <mahesh@linux.vnet.ibm.com>
>>
>> The current code extracts the physical address for UE errors and then
>> hooks it up into memory failure infrastructure. On successful extraction
>> of physical address it wrongly sets "handled = 1" which means this UE error
>> has been recovered. Since MCE handler gets return value as handled = 1, it
>> assumes that error has been recovered and goes back to same NIP. This causes
>> MCE interrupt again and again in a loop leading to hard lockup.
>>
>> Also, initialize phys_addr to ULONG_MAX so that we don't end up queuing
>> undesired page to hwpoison.
>>
>> Without this patch we see:
>> [ 1476.541984] Severe Machine check interrupt [Recovered]
>> [ 1476.541985]   NIP: [000000001002588c] PID: 7109 Comm: find
>> [ 1476.541986]   Initiator: CPU
>> [ 1476.541987]   Error type: UE [Load/Store]
>> [ 1476.541988]     Effective address: 00007fffd2755940
>> [ 1476.541989]     Physical address:  000020181a080000
>> [...]
>> [ 1476.542003] Severe Machine check interrupt [Recovered]
>> [ 1476.542004]   NIP: [000000001002588c] PID: 7109 Comm: find
>> [ 1476.542005]   Initiator: CPU
>> [ 1476.542006]   Error type: UE [Load/Store]
>> [ 1476.542006]     Effective address: 00007fffd2755940
>> [ 1476.542007]     Physical address:  000020181a080000
>> [ 1476.542010] Severe Machine check interrupt [Recovered]
>> [ 1476.542012]   NIP: [000000001002588c] PID: 7109 Comm: find
>> [ 1476.542013]   Initiator: CPU
>> [ 1476.542014]   Error type: UE [Load/Store]
>> [ 1476.542015]     Effective address: 00007fffd2755940
>> [ 1476.542016]     Physical address:  000020181a080000
>> [ 1476.542448] Memory failure: 0x20181a08: recovery action for dirty LRU page: Recovered
>> [ 1476.542452] Memory failure: 0x20181a08: already hardware poisoned
>> [ 1476.542453] Memory failure: 0x20181a08: already hardware poisoned
>> [ 1476.542454] Memory failure: 0x20181a08: already hardware poisoned
>> [ 1476.542455] Memory failure: 0x20181a08: already hardware poisoned
>> [ 1476.542456] Memory failure: 0x20181a08: already hardware poisoned
>> [ 1476.542457] Memory failure: 0x20181a08: already hardware poisoned
>> [...]
>> [ 1490.972174] Watchdog CPU:38 Hard LOCKUP
>>
>> After this patch we see:
>>
>> [  325.384336] Severe Machine check interrupt [Not recovered]
> 
> How did you test for this? 

By injecting cache SUE using L2 FIR register (0x1001080c).

> If the error was recovered, shouldn't the
> process have gotten
> a SIGBUS and we should have prevented further access as a part of the handling
> (memory_failure()). Do we just need a MF_MUST_KILL in the flags?

We hook it up to memory_failure() through a work queue and by the time
work queue kicks in, the application continues to restart and hit same
NIP again and again. Every MCE again hooks the same address to memory
failure work queue and throws multiple recovered MCE messages for same
address. Once the memory_failure() hwpoisons the page, application gets
SIGBUS and then we are fine.

But in case of UE in kernel space, if early machine_check handler
"machine_check_early()" returns as recovered then
machine_check_handle_early() queues up the MCE event and continues from
NIP assuming it is safe causing a MCE loop. So, for UE in kernel we end
up in hard lockup.

> Why shouldn't we treat it as handled if we isolate the page?

Yes we should, but I think not until the page is actually hwpoisioned OR
until we send SIGBUS to process.

> 
> Thanks,
> Balbir Singh.
>

next prev parent reply	other threads:[~2018-04-23 10:35 UTC|newest]

Thread overview: 10+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2018-04-23  4:59 [PATCH] powerpc/mce: Fix a bug where mce loops on memory UE Mahesh J Salgaonkar
2018-04-23  6:51 ` Balbir Singh
2018-04-23  9:23   ` Balbir Singh
2018-04-23 10:33   ` Mahesh Jagannath Salgaonkar [this message]
2018-04-23 11:14     ` Balbir Singh
2018-04-23 13:01       ` Nicholas Piggin
2018-04-23 23:00         ` Balbir Singh
2018-04-23 13:01       ` Mahesh Jagannath Salgaonkar
2018-04-23 23:41 ` Balbir Singh
2018-04-25  2:55 ` Michael Ellerman

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=cb831682-8cc6-4cd7-9778-52e5f4ad5af0@linux.vnet.ibm.com \
    --to=mahesh@linux.vnet.ibm.com \
    --cc=bsingharora@gmail.com \
    --cc=linuxppc-dev@ozlabs.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).