From: James Morse <james.morse@arm.com>
To: Jonathan Cameron <Jonathan.Cameron@Huawei.com>
Cc: info@metux.net, catalin.marinas@arm.com,
linux-kernel@vger.kernel.org, linuxarm@huawei.com,
allison@lohutok.net, gregkh@linuxfoundation.org,
Tian Tao <tiantao6@hisilicon.com>,
tglx@linutronix.de, Will Deacon <will@kernel.org>,
linux-arm-kernel@lists.infradead.org
Subject: Re: [PATCH] arm32: fix flushcache syscall with device address
Date: Tue, 21 Apr 2020 17:50:22 +0100 [thread overview]
Message-ID: <d0814ab2-03fc-42c1-e447-bfee2df038da@arm.com> (raw)
In-Reply-To: <20200421121651.000009f0@Huawei.com>
Hi guys,
(Subject Nit: arm64, as that is what your patch modifies)
On 21/04/2020 12:16, Jonathan Cameron wrote:
> On Tue, 21 Apr 2020 09:12:39 +0100
> Will Deacon <will@kernel.org> wrote:
>
>> On Tue, Apr 21, 2020 at 04:08:34PM +0800, Tian Tao wrote:
>>> An issue has been observed on our Kungpeng916 systems when using a PCI
>>> express GPU. This occurs when a 32 bit application running on a 64 bit
>>> kernel issues a cache flush operation to a memory address that is in
>>> a PCI BAR of the GPU.The results in an illegal operation and
>>> subsequent crash.
>>
>> A kernel crash? If so, please can you include the log here?
>
> Deploying my finest copy typing from the image Tian Tao sent out
>
> KERNEL: /root/vmlinux-4.19.36-3patch-00228-debuginfo
> DUMPFILE: vmcore [PARTIAL DUMP]
> CPUS: 64
> DATE: Fri Mar 20 06:59:56 2020
> UPTIME: 07:01:01
> LOAD AVERAGE: 33.76, 35.45, 35.79
> TASKS: 59447
> NODENAME: cpus-new-ondemand-0509
> RELEASE: 4.19.36-3patch-0228
> VERSION: #4 SMP Fri Feb 28 15:18:51 UTC 2020
> MACHINE: aarch64 (unknown MHz)
> MEMORY: 255.7 GB
> PANIC: "kernel panic - not syncing: Asynchronous SError Interrupt"
> PID: 175108
> COMMAND: "UnityMain"
> TASK: ffff80a96999dd00 [THREAD_INFO: ffff80a96999dd00]
> CPU: 62
> STATE: TASK_RUNNING (PANIC)
>
> crash> bt
> PID: 175108 TASK: ffff80a96999dd00 CPU: 62 COMMAND: "UnityMain"
> #0 [ffff000194e1b920] machine_kexec at ffff0000080a265c
> #1 [ffff000194e1b980] __crash_kexec at ffff0000081b3ba8
> #2 [ffff000194e1bb10] panic at ffff0000080ecc98
> #3 [ffff000194e1bbf0] nmi_panic at ffff0000080ec7f4
> #4 [ffff000194e1bc10] arm64_serror_panic at fff00000809019c
> #5 [ffff000194e1bc30] do_serror at ffff00000809039c
> #6 [ffff000194e1bd90] el1_error at ffff000008083e50
> #7 [ffff000194e1bda0] __flush_icache_range at ffff0000080a9ec4
> #8 [ffff000194e1be60] el0_svc_common at fff0000080977d8
> #9 [ffff000194e1bea0] el0_svc_compat_handler at ffff0000080979b4
> #10 [ffff000194e1bff0] el0_svc_compat at ffff0000008083874
>
> PC: c90fe7f8 LR: c90ff09c SP: d2afa8e0 PSTATE: 800b0010
> X12: c56e96e4 X11: d2afaa48 X10: d0ff1000 X9: d2afab68
> x8: 000000d6 X7: 000f0002 X6: d3c61840 X5: d3c61001
> X4: d3c03000 X3: 0004d54a x2: 00000000 x1: d3c61040
> X0: d3c61000
>
>
> New advanced test for Mavis Beacon teaches typing.
Thanks for doing that!
> In summary this is all we have to hand...
Jumping back to the patch:
On 21/04/2020 09:08, Tian Tao wrote:> diff --git a/arch/arm64/kernel/sys_compat.c
b/arch/arm64/kernel/sys_compat.c
> index 3c18c24..6c07944 100644
> --- a/arch/arm64/kernel/sys_compat.c
> +++ b/arch/arm64/kernel/sys_compat.c
> @@ -32,6 +94,13 @@ __do_compat_cache_op(unsigned long start, unsigned long end)
> if (fatal_signal_pending(current))
> return 0;
>
> + /* do not flush page table is non-cacheable */
> + if (!__check_pt_cacheable(start)) {
> + cond_resched();
> + start += chunk;
> + continue;
> + }
The Arm-arm expects this to work, so we can't just skip it!
D4.4.8's "Effects of instructions that operate by VA to the PoC" has:
| For Device memory and Normal memory that is Inner Non-cacheable, Outer Non-cacheable,
| these instructions must affect the caches of all PEs in the Outer Shareable shareability
| domain of the PE on which the instruction is operating.
What does aarch64 user-space do in the same situation? Surely that also goes down in
flames too!?
I think the real problem here is you've given this kind of mapping to user-space. If the
device behind the mapping can respond like this, you must trust user-space not to poke it
inappropriately.
If its not got all the properties of memory: please don't pretend its memory.
If its a device, it probably needs to be carefully managed by a dedicated driver.
Do you know where the abort comes from and why? (The interconnect, PCIE-root-complex, or
from the GPU itself?)
Is it the wrong attributes? Too large an eviction from the cache? Wrong alignment for this
BAR? It can't handle cache-maintenance?
Thanks,
James
_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel
WARNING: multiple messages have this Message-ID (diff)
From: James Morse <james.morse@arm.com>
To: Jonathan Cameron <Jonathan.Cameron@Huawei.com>
Cc: Will Deacon <will@kernel.org>, Tian Tao <tiantao6@hisilicon.com>,
catalin.marinas@arm.com, linux-kernel@vger.kernel.org,
linuxarm@huawei.com, linux-arm-kernel@lists.infradead.org,
gregkh@linuxfoundation.org, tglx@linutronix.de, info@metux.net,
allison@lohutok.net
Subject: Re: [PATCH] arm32: fix flushcache syscall with device address
Date: Tue, 21 Apr 2020 17:50:22 +0100 [thread overview]
Message-ID: <d0814ab2-03fc-42c1-e447-bfee2df038da@arm.com> (raw)
In-Reply-To: <20200421121651.000009f0@Huawei.com>
Hi guys,
(Subject Nit: arm64, as that is what your patch modifies)
On 21/04/2020 12:16, Jonathan Cameron wrote:
> On Tue, 21 Apr 2020 09:12:39 +0100
> Will Deacon <will@kernel.org> wrote:
>
>> On Tue, Apr 21, 2020 at 04:08:34PM +0800, Tian Tao wrote:
>>> An issue has been observed on our Kungpeng916 systems when using a PCI
>>> express GPU. This occurs when a 32 bit application running on a 64 bit
>>> kernel issues a cache flush operation to a memory address that is in
>>> a PCI BAR of the GPU.The results in an illegal operation and
>>> subsequent crash.
>>
>> A kernel crash? If so, please can you include the log here?
>
> Deploying my finest copy typing from the image Tian Tao sent out
>
> KERNEL: /root/vmlinux-4.19.36-3patch-00228-debuginfo
> DUMPFILE: vmcore [PARTIAL DUMP]
> CPUS: 64
> DATE: Fri Mar 20 06:59:56 2020
> UPTIME: 07:01:01
> LOAD AVERAGE: 33.76, 35.45, 35.79
> TASKS: 59447
> NODENAME: cpus-new-ondemand-0509
> RELEASE: 4.19.36-3patch-0228
> VERSION: #4 SMP Fri Feb 28 15:18:51 UTC 2020
> MACHINE: aarch64 (unknown MHz)
> MEMORY: 255.7 GB
> PANIC: "kernel panic - not syncing: Asynchronous SError Interrupt"
> PID: 175108
> COMMAND: "UnityMain"
> TASK: ffff80a96999dd00 [THREAD_INFO: ffff80a96999dd00]
> CPU: 62
> STATE: TASK_RUNNING (PANIC)
>
> crash> bt
> PID: 175108 TASK: ffff80a96999dd00 CPU: 62 COMMAND: "UnityMain"
> #0 [ffff000194e1b920] machine_kexec at ffff0000080a265c
> #1 [ffff000194e1b980] __crash_kexec at ffff0000081b3ba8
> #2 [ffff000194e1bb10] panic at ffff0000080ecc98
> #3 [ffff000194e1bbf0] nmi_panic at ffff0000080ec7f4
> #4 [ffff000194e1bc10] arm64_serror_panic at fff00000809019c
> #5 [ffff000194e1bc30] do_serror at ffff00000809039c
> #6 [ffff000194e1bd90] el1_error at ffff000008083e50
> #7 [ffff000194e1bda0] __flush_icache_range at ffff0000080a9ec4
> #8 [ffff000194e1be60] el0_svc_common at fff0000080977d8
> #9 [ffff000194e1bea0] el0_svc_compat_handler at ffff0000080979b4
> #10 [ffff000194e1bff0] el0_svc_compat at ffff0000008083874
>
> PC: c90fe7f8 LR: c90ff09c SP: d2afa8e0 PSTATE: 800b0010
> X12: c56e96e4 X11: d2afaa48 X10: d0ff1000 X9: d2afab68
> x8: 000000d6 X7: 000f0002 X6: d3c61840 X5: d3c61001
> X4: d3c03000 X3: 0004d54a x2: 00000000 x1: d3c61040
> X0: d3c61000
>
>
> New advanced test for Mavis Beacon teaches typing.
Thanks for doing that!
> In summary this is all we have to hand...
Jumping back to the patch:
On 21/04/2020 09:08, Tian Tao wrote:> diff --git a/arch/arm64/kernel/sys_compat.c
b/arch/arm64/kernel/sys_compat.c
> index 3c18c24..6c07944 100644
> --- a/arch/arm64/kernel/sys_compat.c
> +++ b/arch/arm64/kernel/sys_compat.c
> @@ -32,6 +94,13 @@ __do_compat_cache_op(unsigned long start, unsigned long end)
> if (fatal_signal_pending(current))
> return 0;
>
> + /* do not flush page table is non-cacheable */
> + if (!__check_pt_cacheable(start)) {
> + cond_resched();
> + start += chunk;
> + continue;
> + }
The Arm-arm expects this to work, so we can't just skip it!
D4.4.8's "Effects of instructions that operate by VA to the PoC" has:
| For Device memory and Normal memory that is Inner Non-cacheable, Outer Non-cacheable,
| these instructions must affect the caches of all PEs in the Outer Shareable shareability
| domain of the PE on which the instruction is operating.
What does aarch64 user-space do in the same situation? Surely that also goes down in
flames too!?
I think the real problem here is you've given this kind of mapping to user-space. If the
device behind the mapping can respond like this, you must trust user-space not to poke it
inappropriately.
If its not got all the properties of memory: please don't pretend its memory.
If its a device, it probably needs to be carefully managed by a dedicated driver.
Do you know where the abort comes from and why? (The interconnect, PCIE-root-complex, or
from the GPU itself?)
Is it the wrong attributes? Too large an eviction from the cache? Wrong alignment for this
BAR? It can't handle cache-maintenance?
Thanks,
James
next prev parent reply other threads:[~2020-04-21 16:51 UTC|newest]
Thread overview: 14+ messages / expand[flat|nested] mbox.gz Atom feed top
2020-04-21 8:08 [PATCH] arm32: fix flushcache syscall with device address Tian Tao
2020-04-21 8:08 ` Tian Tao
2020-04-21 8:12 ` Will Deacon
2020-04-21 8:12 ` Will Deacon
2020-04-21 11:16 ` Jonathan Cameron
2020-04-21 11:16 ` Jonathan Cameron
2020-04-21 16:50 ` James Morse [this message]
2020-04-21 16:50 ` James Morse
2020-04-21 16:55 ` Russell King - ARM Linux admin
2020-04-21 16:55 ` Russell King - ARM Linux admin
2020-04-22 13:56 ` Catalin Marinas
2020-04-22 13:56 ` Catalin Marinas
2020-04-21 12:22 ` Russell King - ARM Linux admin
2020-04-21 12:22 ` Russell King - ARM Linux admin
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=d0814ab2-03fc-42c1-e447-bfee2df038da@arm.com \
--to=james.morse@arm.com \
--cc=Jonathan.Cameron@Huawei.com \
--cc=allison@lohutok.net \
--cc=catalin.marinas@arm.com \
--cc=gregkh@linuxfoundation.org \
--cc=info@metux.net \
--cc=linux-arm-kernel@lists.infradead.org \
--cc=linux-kernel@vger.kernel.org \
--cc=linuxarm@huawei.com \
--cc=tglx@linutronix.de \
--cc=tiantao6@hisilicon.com \
--cc=will@kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.