From: James Morse <james.morse@arm.com>
To: Xie XiuQi <xiexiuqi@huawei.com>
Cc: catalin.marinas@arm.com, will.deacon@arm.com, mingo@redhat.com,
mark.rutland@arm.com, ard.biesheuvel@linaro.org,
Dave.Martin@arm.com, takahiro.akashi@linaro.org,
tbaicar@codeaurora.org, stephen.boyd@linaro.org, bp@suse.de,
julien.thierry@arm.com, shiju.jose@huawei.com,
zjzhang@codeaurora.org, linux-arm-kernel@lists.infradead.org,
linux-kernel@vger.kernel.org, linux-acpi@vger.kernel.org,
wangxiongfeng2@huawei.com, zhengqiang10@huawei.com,
gengdongjiu@huawei.com, huawei.libin@huawei.com,
wangkefeng.wang@huawei.com, lijinyue@huawei.com,
guohanjun@huawei.com, hanjun.guo@linaro.org,
cj.chengjian@huawei.com
Subject: Re: [PATCH v5 1/3] arm64/ras: support sea error recovery
Date: Tue, 30 Jan 2018 19:19:18 +0000 [thread overview]
Message-ID: <5A70C536.7040208@arm.com> (raw)
In-Reply-To: <1516969885-150532-2-git-send-email-xiexiuqi@huawei.com>
Hi Xie XiuQi,
On 26/01/18 12:31, Xie XiuQi wrote:
> With ARM v8.2 RAS Extension, SEA are usually triggered when memory errors
> are consumed. According to the existing process, errors occurred in the
> kernel, leading to direct panic, if it occurred the user-space, we should
> just kill process.
>
> But there is a class of error, in fact, is not necessary to kill
> process, you can recover and continue to run the process. Such as
> the instruction data corrupted, where the memory page might be
> read-only, which is has not been modified, the disk might have the
> correct data, so you can directly drop the page, ant reload it when
> necessary.
With firmware-first support, we do all this...
> So this patchset is just try to solve such problem: if the error is
> consumed in user-space and the error occurs on a clean page, you can
> directly drop the memory page without killing process.
>
> If the corrupted page is clean, just dropped it and return to user-space
> without side effects. And if corrupted page is dirty, memory_failure()
> will send SIGBUS with code=BUS_MCEERR_AR. While without this patchset,
> do_sea() will just send SIGBUS, so the process was killed in the same place.
... but this happens too. I agree its something we should fix, but I don't think
this is the best way to do it.
This series is pulling the memory-failure-queue details back into the arch-code
to build a second list, that gets processed as extra work when we return to
user-space.
The root of the issue is ghes_notify_sea() claims the notification as something
APEI has dealt with, ... but it hasn't done it yet. The signals will be
generated by something currently stuck in a queue. (Evidently x86 doesn't handle
synchronous errors like this using firmware-first).
I think a smaller fix is to give the queues that may be holding the
memory_failure() work a kick as part of the code that calls ghes_notify_sea().
This means that by the time we return to do_sea() ghes_notify_sea()'s claim that
APEI has dealt with it is true as any generated signals are pending. We can then
skip the existing SIGBUS generation code.
> Because memory_failure() may sleep, we can not call it directly in SEA
(this one is more serious, I've attempted to fix it by moving all NMI-like
GHES-notifications to use the estatus queue).
> exception context. So we saved faulting physical address associated with
> a process in the ghes handler and set __TIF_SEA_NOTIFY. When we return
> from SEA exception context and get into do_notify_resume() before the
> process running, we could check it and call memory_failure() to do
> recovery.
> It's safe, because we are in process context.
I think this is the trick. When we take a Synchronous-external-abort out of
userspace, we're in process context too. We can add helpers to drain the
memory_failure_queue which can be called when do_sea() when we know we're
preemptible and interrupts-et-al are unmasked.
Thanks,
James
[0] https://www.spinics.net/lists/linux-acpi/msg80149.html
> ---
> arch/arm64/Kconfig | 11 +++
> arch/arm64/include/asm/ras.h | 23 ++++++
> arch/arm64/include/asm/thread_info.h | 4 +-
> arch/arm64/kernel/Makefile | 1 +
> arch/arm64/kernel/ras.c | 142 +++++++++++++++++++++++++++++++++++
> arch/arm64/kernel/signal.c | 7 ++
> arch/arm64/mm/fault.c | 27 +++++--
> drivers/acpi/apei/ghes.c | 8 +-
> include/acpi/ghes.h | 3 +
> 9 files changed, 216 insertions(+), 10 deletions(-)
> create mode 100644 arch/arm64/include/asm/ras.h
> create mode 100644 arch/arm64/kernel/ras.c
next prev parent reply other threads:[~2018-01-30 19:19 UTC|newest]
Thread overview: 11+ messages / expand[flat|nested] mbox.gz Atom feed top
2018-01-26 12:31 [PATCH v5 0/3] arm64/ras: support sea error recovery Xie XiuQi
2018-01-26 12:31 ` [PATCH v5 1/3] " Xie XiuQi
2018-01-30 19:19 ` James Morse [this message]
2018-02-07 19:03 ` James Morse
2018-02-08 8:35 ` Xie XiuQi
2018-02-15 17:56 ` James Morse
2018-02-09 5:04 ` gengdongjiu
2018-01-26 12:31 ` [PATCH v5 2/3] GHES: add a notify chain for process memory section Xie XiuQi
2018-02-07 10:31 ` Borislav Petkov
2018-02-08 8:41 ` Xie XiuQi
2018-01-26 12:31 ` [PATCH v5 3/3] arm64/ras: save error address from memory section for recovery Xie XiuQi
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=5A70C536.7040208@arm.com \
--to=james.morse@arm.com \
--cc=Dave.Martin@arm.com \
--cc=ard.biesheuvel@linaro.org \
--cc=bp@suse.de \
--cc=catalin.marinas@arm.com \
--cc=cj.chengjian@huawei.com \
--cc=gengdongjiu@huawei.com \
--cc=guohanjun@huawei.com \
--cc=hanjun.guo@linaro.org \
--cc=huawei.libin@huawei.com \
--cc=julien.thierry@arm.com \
--cc=lijinyue@huawei.com \
--cc=linux-acpi@vger.kernel.org \
--cc=linux-arm-kernel@lists.infradead.org \
--cc=linux-kernel@vger.kernel.org \
--cc=mark.rutland@arm.com \
--cc=mingo@redhat.com \
--cc=shiju.jose@huawei.com \
--cc=stephen.boyd@linaro.org \
--cc=takahiro.akashi@linaro.org \
--cc=tbaicar@codeaurora.org \
--cc=wangkefeng.wang@huawei.com \
--cc=wangxiongfeng2@huawei.com \
--cc=will.deacon@arm.com \
--cc=xiexiuqi@huawei.com \
--cc=zhengqiang10@huawei.com \
--cc=zjzhang@codeaurora.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).