Re: [PATCH v5 1/3] arm64/ras: support sea error recovery

All of lore.kernel.org
 help / color / mirror / Atom feed

From: James Morse <james.morse@arm.com>
To: Xie XiuQi <xiexiuqi@huawei.com>
Cc: catalin.marinas@arm.com, will.deacon@arm.com, mingo@redhat.com,
	mark.rutland@arm.com, ard.biesheuvel@linaro.org,
	Dave.Martin@arm.com, takahiro.akashi@linaro.org,
	tbaicar@codeaurora.org, stephen.boyd@linaro.org, bp@suse.de,
	julien.thierry@arm.com, shiju.jose@huawei.com,
	zjzhang@codeaurora.org, linux-arm-kernel@lists.infradead.org,
	linux-kernel@vger.kernel.org, linux-acpi@vger.kernel.org,
	wangxiongfeng2@huawei.com, zhengqiang10@huawei.com,
	gengdongjiu@huawei.com, huawei.libin@huawei.com,
	wangkefeng.wang@huawei.com, lijinyue@huawei.com,
	guohanjun@huawei.com, hanjun.guo@linaro.org,
	cj.chengjian@huawei.com
Subject: Re: [PATCH v5 1/3] arm64/ras: support sea error recovery
Date: Wed, 07 Feb 2018 19:03:35 +0000	[thread overview]
Message-ID: <5A7B4D87.9020207@arm.com> (raw)
In-Reply-To: <5A70C536.7040208@arm.com>

Hi Xie XiuQi,

On 30/01/18 19:19, James Morse wrote:
> On 26/01/18 12:31, Xie XiuQi wrote:
>> With ARM v8.2 RAS Extension, SEA are usually triggered when memory errors
>> are consumed. According to the existing process, errors occurred in the
>> kernel, leading to direct panic, if it occurred the user-space, we should
>> just kill process.
>>
>> But there is a class of error, in fact, is not necessary to kill
>> process, you can recover and continue to run the process. Such as
>> the instruction data corrupted, where the memory page might be
>> read-only, which is has not been modified, the disk might have the
>> correct data, so you can directly drop the page, ant reload it when
>> necessary.
> 
> With firmware-first support, we do all this...
> 
> 
>> So this patchset is just try to solve such problem: if the error is
>> consumed in user-space and the error occurs on a clean page, you can
>> directly drop the memory page without killing process.
>>
>> If the corrupted page is clean, just dropped it and return to user-space
>> without side effects. And if corrupted page is dirty, memory_failure()
>> will send SIGBUS with code=BUS_MCEERR_AR. While without this patchset,
>> do_sea() will just send SIGBUS, so the process was killed in the same place.
> 
> ... but this happens too. I agree its something we should fix, but I don't think
> this is the best way to do it.
> 
> This series is pulling the memory-failure-queue details back into the arch-code
> to build a second list, that gets processed as extra work when we return to
> user-space.
> 
> 
> The root of the issue is ghes_notify_sea() claims the notification as something
> APEI has dealt with, ... but it hasn't done it yet. The signals will be
> generated by something currently stuck in a queue. (Evidently x86 doesn't handle
> synchronous errors like this using firmware-first).
> 
> I think a smaller fix is to give the queues that may be holding the
> memory_failure() work a kick as part of the code that calls ghes_notify_sea().
> This means that by the time we return to do_sea() ghes_notify_sea()'s claim that
> APEI has dealt with it is true as any generated signals are pending. We can then
> skip the existing SIGBUS generation code.
> 
> 
>> Because memory_failure() may sleep, we can not call it directly in SEA
> 
> (this one is more serious, I've attempted to fix it by moving all NMI-like
> GHES-notifications to use the estatus queue).
> 
> 
>> exception context. So we saved faulting physical address associated with
>> a process in the ghes handler and set __TIF_SEA_NOTIFY. When we return
>> from SEA exception context and get into do_notify_resume() before the
>> process running, we could check it and call memory_failure() to do
>> recovery.
> 
>> It's safe, because we are in process context.
> 
> I think this is the trick. When we take a Synchronous-external-abort out of
> userspace, we're in process context too. We can add helpers to drain the
> memory_failure_queue which can be called when do_sea() when we know we're
> preemptible and interrupts-et-al are unmasked.

Something like... base on [0], in arch/arm64/kernel/acpi.c:
-----------------%<-----------------
int apei_claim_sea(struct pt_regs *regs)
{
        int cpu;
        int err = -ENOENT;
        unsigned long current_flags = arch_local_save_flags();
        unsigned long interrupted_flags = current_flags;

        if (!IS_ENABLED(CONFIG_ACPI_APEI_SEA))
                return err;

        if (regs)
                interrupted_flags = regs->pstate;

        /*
         * APEI expects an NMI-like notification to always be called
         * in NMI context.
         */
        local_daif_restore(DAIF_ERRCTX);
        nmi_enter();
        err = ghes_notify_sea();
        cpu = smp_processor_id();
        nmi_exit();

        /*
         * APEI NMI-like notifications are deferred to irq_work. Unless
         * we interrupted irqs-masked code, we can do that now.
         */
        if (!err) {
                if (!arch_irqs_disabled_flags(interrupted_flags)) {
                        local_daif_restore(DAIF_PROCCTX_NOIRQ);
                        irq_work_run();
                } else {
                        err = -EINPROGRESS;
                }
        }

        local_daif_restore(current_flags);

        if (IS_ENABLED(CONFIG_ACPI_APEI_MEMORY_FAILURE) && !err) {
                /*
                 * Memory failure work is scheduled on the local CPU.
                 * If we interrupted userspace, or are in process context
                 * we can do that now.
                 */
                if ((regs && !user_mode(regs)) || !preemptible())
                        err = -EINPROGRESS;
                else
                        memory_failure_queue_kick(cpu);
        }

        return err;
}
-----------------%<-----------------


and to mm/memory-failure.c:
-----------------%<-----------------
@@ -1355,7 +1355,7 @@ static void memory_failure_work_func(struct work_struct *w
ork)
        unsigned long proc_flags;
        int gotten;

-       mf_cpu = this_cpu_ptr(&memory_failure_cpu);
+       mf_cpu = container_of(work, struct memory_failure_cpu, work);
        for (;;) {
                spin_lock_irqsave(&mf_cpu->lock, proc_flags);
                gotten = kfifo_get(&mf_cpu->fifo, &entry);

@@ -1369,6 +1369,22 @@ static void memory_failure_work_func(struct work_struct *
work)
        }
 }

+/*
+ * Process memory_failure work queued on the specified CPU.
+ * Used to avoid return-to-userspace racing with the memory_failure workqueue.
+ */
+void memory_failure_queue_kick(int cpu)
+{
+       unsigned long flags;
+       struct memory_failure_cpu *mf_cpu;
+
+       might_sleep();
+
+       mf_cpu = &per_cpu(memory_failure_cpu, cpu);
+       cancel_work_sync(&mf_cpu->work);
+       memory_failure_work_func(&mf_cpu->work);
+}
+
 static int __init memory_failure_init(void)
 {
        struct memory_failure_cpu *mf_cpu;
-----------------%<-----------------

I've cooked up some NOTFIY_SEA-ing APEI firmware using kvmtool to test this. I
haven't yet managed to hit irq-masked code with NOTIFY_SEA. I'll try and tidy
this up and post a branch to make it easier to test...

I prefer this as it doesn't duplicate the state then come back on a TIF flag.
I'd like to move the kicking logic into ghes.c, as that is where the queueing
happened, but the 'do-this, restore these flags, do-that' is somewhat tasteless,
and it looks like on arm64 has synchronous nmi-like notifications that must be
handled before returning to user-space...



Thanks,

James

[0] https://www.spinics.net/lists/linux-acpi/msg80149.html

WARNING: multiple messages have this Message-ID (diff)

From: james.morse@arm.com (James Morse)
To: linux-arm-kernel@lists.infradead.org
Subject: [PATCH v5 1/3] arm64/ras: support sea error recovery
Date: Wed, 07 Feb 2018 19:03:35 +0000	[thread overview]
Message-ID: <5A7B4D87.9020207@arm.com> (raw)
In-Reply-To: <5A70C536.7040208@arm.com>

Hi Xie XiuQi,

On 30/01/18 19:19, James Morse wrote:
> On 26/01/18 12:31, Xie XiuQi wrote:
>> With ARM v8.2 RAS Extension, SEA are usually triggered when memory errors
>> are consumed. According to the existing process, errors occurred in the
>> kernel, leading to direct panic, if it occurred the user-space, we should
>> just kill process.
>>
>> But there is a class of error, in fact, is not necessary to kill
>> process, you can recover and continue to run the process. Such as
>> the instruction data corrupted, where the memory page might be
>> read-only, which is has not been modified, the disk might have the
>> correct data, so you can directly drop the page, ant reload it when
>> necessary.
> 
> With firmware-first support, we do all this...
> 
> 
>> So this patchset is just try to solve such problem: if the error is
>> consumed in user-space and the error occurs on a clean page, you can
>> directly drop the memory page without killing process.
>>
>> If the corrupted page is clean, just dropped it and return to user-space
>> without side effects. And if corrupted page is dirty, memory_failure()
>> will send SIGBUS with code=BUS_MCEERR_AR. While without this patchset,
>> do_sea() will just send SIGBUS, so the process was killed in the same place.
> 
> ... but this happens too. I agree its something we should fix, but I don't think
> this is the best way to do it.
> 
> This series is pulling the memory-failure-queue details back into the arch-code
> to build a second list, that gets processed as extra work when we return to
> user-space.
> 
> 
> The root of the issue is ghes_notify_sea() claims the notification as something
> APEI has dealt with, ... but it hasn't done it yet. The signals will be
> generated by something currently stuck in a queue. (Evidently x86 doesn't handle
> synchronous errors like this using firmware-first).
> 
> I think a smaller fix is to give the queues that may be holding the
> memory_failure() work a kick as part of the code that calls ghes_notify_sea().
> This means that by the time we return to do_sea() ghes_notify_sea()'s claim that
> APEI has dealt with it is true as any generated signals are pending. We can then
> skip the existing SIGBUS generation code.
> 
> 
>> Because memory_failure() may sleep, we can not call it directly in SEA
> 
> (this one is more serious, I've attempted to fix it by moving all NMI-like
> GHES-notifications to use the estatus queue).
> 
> 
>> exception context. So we saved faulting physical address associated with
>> a process in the ghes handler and set __TIF_SEA_NOTIFY. When we return
>> from SEA exception context and get into do_notify_resume() before the
>> process running, we could check it and call memory_failure() to do
>> recovery.
> 
>> It's safe, because we are in process context.
> 
> I think this is the trick. When we take a Synchronous-external-abort out of
> userspace, we're in process context too. We can add helpers to drain the
> memory_failure_queue which can be called when do_sea() when we know we're
> preemptible and interrupts-et-al are unmasked.

Something like... base on [0], in arch/arm64/kernel/acpi.c:
-----------------%<-----------------
int apei_claim_sea(struct pt_regs *regs)
{
        int cpu;
        int err = -ENOENT;
        unsigned long current_flags = arch_local_save_flags();
        unsigned long interrupted_flags = current_flags;

        if (!IS_ENABLED(CONFIG_ACPI_APEI_SEA))
                return err;

        if (regs)
                interrupted_flags = regs->pstate;

        /*
         * APEI expects an NMI-like notification to always be called
         * in NMI context.
         */
        local_daif_restore(DAIF_ERRCTX);
        nmi_enter();
        err = ghes_notify_sea();
        cpu = smp_processor_id();
        nmi_exit();

        /*
         * APEI NMI-like notifications are deferred to irq_work. Unless
         * we interrupted irqs-masked code, we can do that now.
         */
        if (!err) {
                if (!arch_irqs_disabled_flags(interrupted_flags)) {
                        local_daif_restore(DAIF_PROCCTX_NOIRQ);
                        irq_work_run();
                } else {
                        err = -EINPROGRESS;
                }
        }

        local_daif_restore(current_flags);

        if (IS_ENABLED(CONFIG_ACPI_APEI_MEMORY_FAILURE) && !err) {
                /*
                 * Memory failure work is scheduled on the local CPU.
                 * If we interrupted userspace, or are in process context
                 * we can do that now.
                 */
                if ((regs && !user_mode(regs)) || !preemptible())
                        err = -EINPROGRESS;
                else
                        memory_failure_queue_kick(cpu);
        }

        return err;
}
-----------------%<-----------------


and to mm/memory-failure.c:
-----------------%<-----------------
@@ -1355,7 +1355,7 @@ static void memory_failure_work_func(struct work_struct *w
ork)
        unsigned long proc_flags;
        int gotten;

-       mf_cpu = this_cpu_ptr(&memory_failure_cpu);
+       mf_cpu = container_of(work, struct memory_failure_cpu, work);
        for (;;) {
                spin_lock_irqsave(&mf_cpu->lock, proc_flags);
                gotten = kfifo_get(&mf_cpu->fifo, &entry);

@@ -1369,6 +1369,22 @@ static void memory_failure_work_func(struct work_struct *
work)
        }
 }

+/*
+ * Process memory_failure work queued on the specified CPU.
+ * Used to avoid return-to-userspace racing with the memory_failure workqueue.
+ */
+void memory_failure_queue_kick(int cpu)
+{
+       unsigned long flags;
+       struct memory_failure_cpu *mf_cpu;
+
+       might_sleep();
+
+       mf_cpu = &per_cpu(memory_failure_cpu, cpu);
+       cancel_work_sync(&mf_cpu->work);
+       memory_failure_work_func(&mf_cpu->work);
+}
+
 static int __init memory_failure_init(void)
 {
        struct memory_failure_cpu *mf_cpu;
-----------------%<-----------------

I've cooked up some NOTFIY_SEA-ing APEI firmware using kvmtool to test this. I
haven't yet managed to hit irq-masked code with NOTIFY_SEA. I'll try and tidy
this up and post a branch to make it easier to test...

I prefer this as it doesn't duplicate the state then come back on a TIF flag.
I'd like to move the kicking logic into ghes.c, as that is where the queueing
happened, but the 'do-this, restore these flags, do-that' is somewhat tasteless,
and it looks like on arm64 has synchronous nmi-like notifications that must be
handled before returning to user-space...



Thanks,

James

[0] https://www.spinics.net/lists/linux-acpi/msg80149.html

next prev parent reply	other threads:[~2018-02-07 19:06 UTC|newest]

Thread overview: 29+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2018-01-26 12:31 [PATCH v5 0/3] arm64/ras: support sea error recovery Xie XiuQi
2018-01-26 12:31 ` Xie XiuQi
2018-01-26 12:31 ` Xie XiuQi
2018-01-26 12:31 ` [PATCH v5 1/3] " Xie XiuQi
2018-01-26 12:31   ` Xie XiuQi
2018-01-26 12:31   ` Xie XiuQi
2018-01-30 19:19   ` James Morse
2018-01-30 19:19     ` James Morse
2018-02-07 19:03     ` James Morse [this message]
2018-02-07 19:03       ` James Morse
2018-02-08  8:35       ` Xie XiuQi
2018-02-08  8:35         ` Xie XiuQi
2018-02-08  8:35         ` Xie XiuQi
2018-02-15 17:56         ` James Morse
2018-02-15 17:56           ` James Morse
2018-02-09  5:04       ` gengdongjiu
2018-02-09  5:04         ` gengdongjiu
2018-02-09  5:04         ` gengdongjiu
2018-01-26 12:31 ` [PATCH v5 2/3] GHES: add a notify chain for process memory section Xie XiuQi
2018-01-26 12:31   ` Xie XiuQi
2018-01-26 12:31   ` Xie XiuQi
2018-02-07 10:31   ` Borislav Petkov
2018-02-07 10:31     ` Borislav Petkov
2018-02-08  8:41     ` Xie XiuQi
2018-02-08  8:41       ` Xie XiuQi
2018-02-08  8:41       ` Xie XiuQi
2018-01-26 12:31 ` [PATCH v5 3/3] arm64/ras: save error address from memory section for recovery Xie XiuQi
2018-01-26 12:31   ` Xie XiuQi
2018-01-26 12:31   ` Xie XiuQi

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=5A7B4D87.9020207@arm.com \
    --to=james.morse@arm.com \
    --cc=Dave.Martin@arm.com \
    --cc=ard.biesheuvel@linaro.org \
    --cc=bp@suse.de \
    --cc=catalin.marinas@arm.com \
    --cc=cj.chengjian@huawei.com \
    --cc=gengdongjiu@huawei.com \
    --cc=guohanjun@huawei.com \
    --cc=hanjun.guo@linaro.org \
    --cc=huawei.libin@huawei.com \
    --cc=julien.thierry@arm.com \
    --cc=lijinyue@huawei.com \
    --cc=linux-acpi@vger.kernel.org \
    --cc=linux-arm-kernel@lists.infradead.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=mark.rutland@arm.com \
    --cc=mingo@redhat.com \
    --cc=shiju.jose@huawei.com \
    --cc=stephen.boyd@linaro.org \
    --cc=takahiro.akashi@linaro.org \
    --cc=tbaicar@codeaurora.org \
    --cc=wangkefeng.wang@huawei.com \
    --cc=wangxiongfeng2@huawei.com \
    --cc=will.deacon@arm.com \
    --cc=xiexiuqi@huawei.com \
    --cc=zhengqiang10@huawei.com \
    --cc=zjzhang@codeaurora.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.