* [BUG] ARM64 KVM: Data abort executing post-indexed LDR on MMIO address
@ 2024-10-04 9:47 Ahmad Fatoum
2024-10-04 10:40 ` Peter Maydell
0 siblings, 1 reply; 15+ messages in thread
From: Ahmad Fatoum @ 2024-10-04 9:47 UTC (permalink / raw)
To: qemu-arm, kvmarm
Cc: peter.maydell, kvm, Pengutronix Kernel Team,
linux-arm-kernel@lists.infradead.org, Enrico Joerns
Hi,
I am investigating a data abort affecting the barebox bootloader built for aarch64
that only manifests with qemu-system-aarch64 -enable-kvm.
The issue happens when using the post-indexed form of LDR on a MMIO address:
ldr x0, =0x9000fe0 // MMIO address
ldr w1, [x0], #4 // data abort, but only with -enable-kvm
Here's a full example to reproduce this:
start:
mov w10, 'o'
bl putch
bl amba_device_read_pid
mov w10, 'k'
bl putch
ret
amba_device_read_pid:
ldr x0, =0x9000fe0
ldr w1, [x0], #4 // data abort unless #4 is removed or KVM disabled
ret
putch:
ldr x8, =0x9000018
1:
/* Wait until there is space in the FIFO */
ldr w9, [x8]
tbnz w9, #5, 1b
/* Send the character */
mov x9, #0x9000000
str w10, [x9]
2:
/* Wait to make sure it hits the line, in case we die too soon. */
ldr w9, [x8]
tbnz w9, #5, 2b
ret
It assumes 0x9000000-0x9000fff is a PL011 UART as is the case on the Virt platform.
It will print an 'o', try to access 0x9000fe0, which contains the first byte
of the AMBA Product ID and then print a 'k' using the same PL011.
To build:
aarch64-linux-gnu-as reproducer.S -o reproducer.o
aarch64-linux-gnu-objcopy -O binary reproducer.o reproducer.bin
To run (whether -bios or -kernel doesn't seem to matter):
taskset -a --cpu-list 2-5 qemu-system-aarch64 -bios reproducer.bin -machine virt,secure=off \
-cpu max -m 1024M -nographic -serial mon:stdio -trace file=/dev/null
When run _without_ kvm, this will output:
ok
When run with -enable-kvm, this will trigger a data abort in amba_device_read_pid:
o
The data abort can also be avoided by removing the post index-increment:
-ldr w1, [x0], #4
+ldr w1, [x0]
This doesn't introduce a functional difference, because x0 isn't used again anyway,
but it makes the code work under KVM.
I am using debian-arm64 Bookworm on an Amlogic A311D (4x CA72, 2x CA53) with:
QEMU emulator version 9.1.50 (v9.1.0-704-g423be09ab9)
Linux 6.11-arm64 #1 SMP Debian 6.11-1~exp1 (2024-09-19) aarch64 GNU/Linux
This issue was first noticed by openembedded-core CI while testing this patch series
adding support for testing the barebox and U-Boot bootloaders:
https://lore.kernel.org/all/ee536d88a5b6468b20e37f3daabe4aa63047d1ad.camel@pengutronix.de/
AFAICS, the U-Boot breakage has the same root cause as barebox', except that it's
a str instruction that has the post-increment and the PCI MMIO region is what's
being accessed.
I haven't been successful in getting QEMU/GDB to trap on data aborts, so here's an
excerpt from my original barebox stack trace instead that sent me down the rabbit
hole:
DABT (current EL) exception (ESR 0x96000010) at 0x0000000009030fe0
elr: 000000007fb2221c lr : 000000007fb22250
[...]
Call trace:
[<7fb2221c>] (amba_device_get_pid.constprop.0.isra.0+0x10/0x34) from [<7fb01e3c>] (start_barebox+0x88/0xb4)
[...]
This looks pretty much like a bug to me, but then I would have expected more
software to be affected by this.. Maybe it's memory mapping related?
I only tested with either MMU disabled or MMIO region mapped strongly-ordered.
Please let me know if there's further information I can provide.
Thanks,
Ahmad
--
Pengutronix e.K. | |
Steuerwalder Str. 21 | http://www.pengutronix.de/ |
31137 Hildesheim, Germany | Phone: +49-5121-206917-0 |
Amtsgericht Hildesheim, HRA 2686 | Fax: +49-5121-206917-5555 |
^ permalink raw reply [flat|nested] 15+ messages in thread* Re: [BUG] ARM64 KVM: Data abort executing post-indexed LDR on MMIO address 2024-10-04 9:47 [BUG] ARM64 KVM: Data abort executing post-indexed LDR on MMIO address Ahmad Fatoum @ 2024-10-04 10:40 ` Peter Maydell 2024-10-04 11:51 ` Ahmad Fatoum 0 siblings, 1 reply; 15+ messages in thread From: Peter Maydell @ 2024-10-04 10:40 UTC (permalink / raw) To: Ahmad Fatoum Cc: qemu-arm, kvmarm, kvm, Pengutronix Kernel Team, linux-arm-kernel@lists.infradead.org, Enrico Joerns On Fri, 4 Oct 2024 at 10:47, Ahmad Fatoum <a.fatoum@pengutronix.de> wrote: > I am investigating a data abort affecting the barebox bootloader built for aarch64 > that only manifests with qemu-system-aarch64 -enable-kvm. > > The issue happens when using the post-indexed form of LDR on a MMIO address: > > ldr x0, =0x9000fe0 // MMIO address > ldr w1, [x0], #4 // data abort, but only with -enable-kvm Don't do this -- KVM doesn't support it. For access to MMIO, stick to instructions which will set the ISV bit in ESR_EL1. That is: * AArch64 loads and stores of a single general-purpose register (including the register specified with 0b11111, including those with Acquire/Release semantics, but excluding Load Exclusive or Store Exclusive and excluding those with writeback). * AArch32 instructions where the instruction: - Is an LDR, LDA, LDRT, LDRSH, LDRSHT, LDRH, LDAH, LDRHT, LDRSB, LDRSBT, LDRB, LDAB, LDRBT, STR, STL, STRT, STRH, STLH, STRHT, STRB, STLB, or STRBT instruction. - Is not performing register writeback. - Is not using R15 as a source or destination register. Your instruction is doing writeback. Do the address update as a separate instruction. Strictly speaking this is a missing feature in KVM (in an ideal world it would let you do MMIO with any instruction that you could use on real hardware). In practice it is not a major issue because you don't typically want to do odd things when you're doing MMIO, you just want to load or store a single data item. If you're running into this then your guest software is usually doing something a bit strange. thanks -- PMM ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [BUG] ARM64 KVM: Data abort executing post-indexed LDR on MMIO address 2024-10-04 10:40 ` Peter Maydell @ 2024-10-04 11:51 ` Ahmad Fatoum 2024-10-04 12:10 ` Peter Maydell 0 siblings, 1 reply; 15+ messages in thread From: Ahmad Fatoum @ 2024-10-04 11:51 UTC (permalink / raw) To: Peter Maydell Cc: qemu-arm, kvmarm, kvm, Pengutronix Kernel Team, linux-arm-kernel@lists.infradead.org, Enrico Joerns Hello Peter, Thanks for your quick response. On 04.10.24 12:40, Peter Maydell wrote: > On Fri, 4 Oct 2024 at 10:47, Ahmad Fatoum <a.fatoum@pengutronix.de> wrote: >> I am investigating a data abort affecting the barebox bootloader built for aarch64 >> that only manifests with qemu-system-aarch64 -enable-kvm. >> >> The issue happens when using the post-indexed form of LDR on a MMIO address: >> >> ldr x0, =0x9000fe0 // MMIO address >> ldr w1, [x0], #4 // data abort, but only with -enable-kvm > > Don't do this -- KVM doesn't support it. For access to MMIO, > stick to instructions which will set the ISV bit in ESR_EL1. > > That is: > > * AArch64 loads and stores of a single general-purpose register > (including the register specified with 0b11111, including those > with Acquire/Release semantics, but excluding Load Exclusive > or Store Exclusive and excluding those with writeback). > * AArch32 instructions where the instruction: > - Is an LDR, LDA, LDRT, LDRSH, LDRSHT, LDRH, LDAH, LDRHT, > LDRSB, LDRSBT, LDRB, LDAB, LDRBT, STR, STL, STRT, STRH, > STLH, STRHT, STRB, STLB, or STRBT instruction. > - Is not performing register writeback. > - Is not using R15 as a source or destination register. > > Your instruction is doing writeback. Do the address update > as a separate instruction. This was enlightening, thanks. I will prepare patches to implement readl/writel in barebox in ARM assembly, like Linux does instead of the current fallback to the generic C version that just does volatile accesses. > Strictly speaking this is a missing feature in KVM (in an > ideal world it would let you do MMIO with any instruction > that you could use on real hardware). I assume that's because KVM doesn't want to handle interruptions in the middle of such "composite" instructions? > In practice it is not > a major issue because you don't typically want to do odd > things when you're doing MMIO, you just want to load or > store a single data item. If you're running into this then > your guest software is usually doing something a bit strange. The AMBA Peripheral ID consists of 4 bytes with some padding in-between. The barebox code to read it out looks is this: static inline u32 amba_device_get_pid(void *base, u32 size) { int i; u32 pid; for (pid = 0, i = 0; i < 4; i++) pid |= (readl(base + size - 0x20 + 4 * i) & 255) << (i * 8); return pid; } static inline u32 __raw_readl(const volatile void __iomem *addr) { return *(const volatile u32 __force *)addr; } I wouldn't necessarily characterize this as strange, we just erroneously assumed that with strongly ordered memory for MMIO regions and volatile accesses we had our bases covered and indeed we did until the bases shifted to include hardware-assisted virtualization. :-) Thanks, Ahmad > > thanks > -- PMM > -- Pengutronix e.K. | | Steuerwalder Str. 21 | http://www.pengutronix.de/ | 31137 Hildesheim, Germany | Phone: +49-5121-206917-0 | Amtsgericht Hildesheim, HRA 2686 | Fax: +49-5121-206917-5555 | ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [BUG] ARM64 KVM: Data abort executing post-indexed LDR on MMIO address 2024-10-04 11:51 ` Ahmad Fatoum @ 2024-10-04 12:10 ` Peter Maydell 2024-10-04 15:52 ` Oliver Upton 2024-10-04 19:50 ` Ahmad Fatoum 0 siblings, 2 replies; 15+ messages in thread From: Peter Maydell @ 2024-10-04 12:10 UTC (permalink / raw) To: Ahmad Fatoum Cc: qemu-arm, kvmarm, kvm, Pengutronix Kernel Team, linux-arm-kernel@lists.infradead.org, Enrico Joerns On Fri, 4 Oct 2024 at 12:51, Ahmad Fatoum <a.fatoum@pengutronix.de> wrote: > > Hello Peter, > > Thanks for your quick response. > > On 04.10.24 12:40, Peter Maydell wrote: > > On Fri, 4 Oct 2024 at 10:47, Ahmad Fatoum <a.fatoum@pengutronix.de> wrote: > >> I am investigating a data abort affecting the barebox bootloader built for aarch64 > >> that only manifests with qemu-system-aarch64 -enable-kvm. > >> > >> The issue happens when using the post-indexed form of LDR on a MMIO address: > >> > >> ldr x0, =0x9000fe0 // MMIO address > >> ldr w1, [x0], #4 // data abort, but only with -enable-kvm > > > > Don't do this -- KVM doesn't support it. For access to MMIO, > > stick to instructions which will set the ISV bit in ESR_EL1. > > > > That is: > > > > * AArch64 loads and stores of a single general-purpose register > > (including the register specified with 0b11111, including those > > with Acquire/Release semantics, but excluding Load Exclusive > > or Store Exclusive and excluding those with writeback). > > * AArch32 instructions where the instruction: > > - Is an LDR, LDA, LDRT, LDRSH, LDRSHT, LDRH, LDAH, LDRHT, > > LDRSB, LDRSBT, LDRB, LDAB, LDRBT, STR, STL, STRT, STRH, > > STLH, STRHT, STRB, STLB, or STRBT instruction. > > - Is not performing register writeback. > > - Is not using R15 as a source or destination register. > > > > Your instruction is doing writeback. Do the address update > > as a separate instruction. > > This was enlightening, thanks. I will prepare patches to implement > readl/writel in barebox in ARM assembly, like Linux does > instead of the current fallback to the generic C version > that just does volatile accesses. > > > Strictly speaking this is a missing feature in KVM (in an > > ideal world it would let you do MMIO with any instruction > > that you could use on real hardware). > > I assume that's because KVM doesn't want to handle interruptions > in the middle of such "composite" instructions? It's because with the ISV=1 information in the ESR_EL2, KVM has everything it needs to emulate the load/store: it has the affected register number, the data width, etc. When ISV is 0, simulating the load/store would require KVM to load the actual instruction word, decode it to figure out what kind of load/store it was, and then emulate its behaviour. The instruction decode would be complicated and if done in the kernel would increase the attack surface exposed to the guest. (In practice KVM will these days bounce the ISV=0 failure out to the userspace VMM, but no VMM that I know of implements the decode of load/store instructions in userspace either, so that just changes which part prints the failure message.) > > In practice it is not > > a major issue because you don't typically want to do odd > > things when you're doing MMIO, you just want to load or > > store a single data item. If you're running into this then > > your guest software is usually doing something a bit strange. > > The AMBA Peripheral ID consists of 4 bytes with some padding > in-between. The barebox code to read it out looks is this: > > static inline u32 amba_device_get_pid(void *base, u32 size) > { > int i; > u32 pid; > > for (pid = 0, i = 0; i < 4; i++) > pid |= (readl(base + size - 0x20 + 4 * i) & 255) << > (i * 8); > > return pid; > } > > static inline u32 __raw_readl(const volatile void __iomem *addr) > { > return *(const volatile u32 __force *)addr; > } > > I wouldn't necessarily characterize this as strange, we just erroneously > assumed that with strongly ordered memory for MMIO regions and volatile > accesses we had our bases covered and indeed we did until the bases > shifted to include hardware-assisted virtualization. :-) I'm not a fan of doing MMIO access via 'volatile' in C code, personally -- I think the compiler has a tendency to do more clever recombination than you might actually want, because it doesn't know that the thing you're accessing isn't RAM-like. thanks -- PMM ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [BUG] ARM64 KVM: Data abort executing post-indexed LDR on MMIO address 2024-10-04 12:10 ` Peter Maydell @ 2024-10-04 15:52 ` Oliver Upton 2024-10-04 15:57 ` Peter Maydell 2024-10-04 19:50 ` Ahmad Fatoum 1 sibling, 1 reply; 15+ messages in thread From: Oliver Upton @ 2024-10-04 15:52 UTC (permalink / raw) To: Peter Maydell Cc: Ahmad Fatoum, qemu-arm, kvmarm, kvm, Pengutronix Kernel Team, linux-arm-kernel@lists.infradead.org, Enrico Joerns On Fri, Oct 04, 2024 at 01:10:48PM +0100, Peter Maydell wrote: > On Fri, 4 Oct 2024 at 12:51, Ahmad Fatoum <a.fatoum@pengutronix.de> wrote: > > > Strictly speaking this is a missing feature in KVM (in an > > > ideal world it would let you do MMIO with any instruction > > > that you could use on real hardware). > > > > I assume that's because KVM doesn't want to handle interruptions > > in the middle of such "composite" instructions? > > It's because with the ISV=1 information in the ESR_EL2, > KVM has everything it needs to emulate the load/store: > it has the affected register number, the data width, etc. When > ISV is 0, simulating the load/store would require KVM > to load the actual instruction word, decode it to figure > out what kind of load/store it was, and then emulate > its behaviour. The instruction decode would be complicated > and if done in the kernel would increase the attack surface > exposed to the guest. On top of that, the only way to 'safely' fetch the instruction would be to pause all vCPUs in the VM to prevent the guest from remapping the address space behind either KVM or the VMM's back. > > static inline u32 __raw_readl(const volatile void __iomem *addr) > > { > > return *(const volatile u32 __force *)addr; > > } > > > > I wouldn't necessarily characterize this as strange, we just erroneously > > assumed that with strongly ordered memory for MMIO regions and volatile > > accesses we had our bases covered and indeed we did until the bases > > shifted to include hardware-assisted virtualization. :-) This has nothing to do with ordering and everything to do with the instruction set && what your compiler decided to emit. > I'm not a fan of doing MMIO access via 'volatile' in C code, > personally -- I think the compiler has a tendency to do more > clever recombination than you might actually want, because > it doesn't know that the thing you're accessing isn't RAM-like. +1 -- Thanks, Oliver ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [BUG] ARM64 KVM: Data abort executing post-indexed LDR on MMIO address 2024-10-04 15:52 ` Oliver Upton @ 2024-10-04 15:57 ` Peter Maydell 2024-10-04 16:21 ` Oliver Upton 0 siblings, 1 reply; 15+ messages in thread From: Peter Maydell @ 2024-10-04 15:57 UTC (permalink / raw) To: Oliver Upton Cc: Ahmad Fatoum, qemu-arm, kvmarm, kvm, Pengutronix Kernel Team, linux-arm-kernel@lists.infradead.org, Enrico Joerns On Fri, 4 Oct 2024 at 16:53, Oliver Upton <oliver.upton@linux.dev> wrote: > > On Fri, Oct 04, 2024 at 01:10:48PM +0100, Peter Maydell wrote: > > On Fri, 4 Oct 2024 at 12:51, Ahmad Fatoum <a.fatoum@pengutronix.de> wrote: > > > > Strictly speaking this is a missing feature in KVM (in an > > > > ideal world it would let you do MMIO with any instruction > > > > that you could use on real hardware). > > > > > > I assume that's because KVM doesn't want to handle interruptions > > > in the middle of such "composite" instructions? > > > > It's because with the ISV=1 information in the ESR_EL2, > > KVM has everything it needs to emulate the load/store: > > it has the affected register number, the data width, etc. When > > ISV is 0, simulating the load/store would require KVM > > to load the actual instruction word, decode it to figure > > out what kind of load/store it was, and then emulate > > its behaviour. The instruction decode would be complicated > > and if done in the kernel would increase the attack surface > > exposed to the guest. > > On top of that, the only way to 'safely' fetch the instruction would be > to pause all vCPUs in the VM to prevent the guest from remapping the > address space behind either KVM or the VMM's back. Do we actually care about that, though? If the guest does that isn't it equivalent to a hardware CPU happening to fetch the insn just-after a remapping rather than just-before? If you decode the insn and it's not a store you could just restart the guest... thanks -- PMM ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [BUG] ARM64 KVM: Data abort executing post-indexed LDR on MMIO address 2024-10-04 15:57 ` Peter Maydell @ 2024-10-04 16:21 ` Oliver Upton 0 siblings, 0 replies; 15+ messages in thread From: Oliver Upton @ 2024-10-04 16:21 UTC (permalink / raw) To: Peter Maydell Cc: Ahmad Fatoum, qemu-arm, kvmarm, kvm, Pengutronix Kernel Team, linux-arm-kernel@lists.infradead.org, Enrico Joerns On Fri, Oct 04, 2024 at 04:57:56PM +0100, Peter Maydell wrote: > On Fri, 4 Oct 2024 at 16:53, Oliver Upton <oliver.upton@linux.dev> wrote: > > > > On Fri, Oct 04, 2024 at 01:10:48PM +0100, Peter Maydell wrote: > > > On Fri, 4 Oct 2024 at 12:51, Ahmad Fatoum <a.fatoum@pengutronix.de> wrote: > > > > > Strictly speaking this is a missing feature in KVM (in an > > > > > ideal world it would let you do MMIO with any instruction > > > > > that you could use on real hardware). > > > > > > > > I assume that's because KVM doesn't want to handle interruptions > > > > in the middle of such "composite" instructions? > > > > > > It's because with the ISV=1 information in the ESR_EL2, > > > KVM has everything it needs to emulate the load/store: > > > it has the affected register number, the data width, etc. When > > > ISV is 0, simulating the load/store would require KVM > > > to load the actual instruction word, decode it to figure > > > out what kind of load/store it was, and then emulate > > > its behaviour. The instruction decode would be complicated > > > and if done in the kernel would increase the attack surface > > > exposed to the guest. > > > > On top of that, the only way to 'safely' fetch the instruction would be > > to pause all vCPUs in the VM to prevent the guest from remapping the > > address space behind either KVM or the VMM's back. > > Do we actually care about that, though? Judging from the fact that our existing MMIO flows have a similar "bug", I'd say no. I was just being pedantic about how annoying it'd be to do this faithfully, including the VA -> IPA translation. > If the guest does > that isn't it equivalent to a hardware CPU happening to > fetch the insn just-after a remapping rather than just-before? > If you decode the insn and it's not a store you could just > restart the guest... Definitely, you'd need to restart any time the instruction doesn't line up with the ESR. The pedantic thing I was thinking about was if the instruction bytes remain the same but marked as non-executable: T1 T2 == == readl(addr); < MMIO data abort > insn = fetch(readl); set_nx(readl); tlbi(readl); dsb(ish); emulate(insn); -- Thanks, Oliver ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [BUG] ARM64 KVM: Data abort executing post-indexed LDR on MMIO address 2024-10-04 12:10 ` Peter Maydell 2024-10-04 15:52 ` Oliver Upton @ 2024-10-04 19:50 ` Ahmad Fatoum 2024-10-05 10:31 ` Marc Zyngier 1 sibling, 1 reply; 15+ messages in thread From: Ahmad Fatoum @ 2024-10-04 19:50 UTC (permalink / raw) To: Peter Maydell Cc: qemu-arm, kvmarm, kvm, Pengutronix Kernel Team, linux-arm-kernel@lists.infradead.org, Enrico Joerns Hi, On 04.10.24 14:10, Peter Maydell wrote: > On Fri, 4 Oct 2024 at 12:51, Ahmad Fatoum <a.fatoum@pengutronix.de> wrote: >> On 04.10.24 12:40, Peter Maydell wrote: >>> Don't do this -- KVM doesn't support it. For access to MMIO, >>> stick to instructions which will set the ISV bit in ESR_EL1. >>> >>> That is: >>> >>> * AArch64 loads and stores of a single general-purpose register >>> (including the register specified with 0b11111, including those >>> with Acquire/Release semantics, but excluding Load Exclusive >>> or Store Exclusive and excluding those with writeback). >>> * AArch32 instructions where the instruction: >>> - Is an LDR, LDA, LDRT, LDRSH, LDRSHT, LDRH, LDAH, LDRHT, >>> LDRSB, LDRSBT, LDRB, LDAB, LDRBT, STR, STL, STRT, STRH, >>> STLH, STRHT, STRB, STLB, or STRBT instruction. >>> - Is not performing register writeback. >>> - Is not using R15 as a source or destination register. >>> >>> Your instruction is doing writeback. Do the address update >>> as a separate instruction. With readl/writel implemented in assembly, I get beyond that point, but now I get a data abort running an DC IVAC instruction on address 0x1000, where the cfi-flash is located. This instruction is part of a routine to remap the cfi-flash to start a page later, so the zero page can be mapped faulting. Simple reproducer: start: ldr x0, =0x1000 ldr x1, =0x1040 bl v8_inv_dcache_range mov w10, '!' bl putch ret v8_inv_dcache_range: mrs x3, ctr_el0 lsr x3, x3, #16 and x3, x3, #0xf mov x2, #0x4 lsl x2, x2, x3 sub x3, x2, #0x1 bic x0, x0, x3 1: dc ivac, x0 add x0, x0, x2 cmp x0, x1 b.cc 1b dsb sy ret This prints ! without KVM, but triggers a data abort before that with -enable-kvm: DABT (current EL) exception (ESR 0x96000010) at 0x0000000000001000 elr: 000000007fbe0550 lr : 000000007fbe01ac [snip] Call trace: [<7fbe0550>] (v8_inv_dcache_range+0x1c/0x34) from [<7fbe0218>] (arch_remap_range+0x64/0x70) [<7fbe0218>] (arch_remap_range+0x64/0x70) from [<7fb8795c>] (of_platform_device_create+0x1e8/0x22c) [<7fb8795c>] (of_platform_device_create+0x1e8/0x22c) from [<7fb87a04>] (of_platform_bus_create+0x64/0xbc) [snip] Any idea what this is about? Thanks, Ahmad -- Pengutronix e.K. | | Steuerwalder Str. 21 | http://www.pengutronix.de/ | 31137 Hildesheim, Germany | Phone: +49-5121-206917-0 | Amtsgericht Hildesheim, HRA 2686 | Fax: +49-5121-206917-5555 | ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [BUG] ARM64 KVM: Data abort executing post-indexed LDR on MMIO address 2024-10-04 19:50 ` Ahmad Fatoum @ 2024-10-05 10:31 ` Marc Zyngier 2024-10-05 18:38 ` Ahmad Fatoum 0 siblings, 1 reply; 15+ messages in thread From: Marc Zyngier @ 2024-10-05 10:31 UTC (permalink / raw) To: Ahmad Fatoum Cc: Peter Maydell, qemu-arm, kvmarm, kvm, Pengutronix Kernel Team, linux-arm-kernel@lists.infradead.org, Enrico Joerns On Fri, 04 Oct 2024 20:50:18 +0100, Ahmad Fatoum <a.fatoum@pengutronix.de> wrote: > > Hi, > > On 04.10.24 14:10, Peter Maydell wrote: > > On Fri, 4 Oct 2024 at 12:51, Ahmad Fatoum <a.fatoum@pengutronix.de> wrote: > >> On 04.10.24 12:40, Peter Maydell wrote: > >>> Don't do this -- KVM doesn't support it. For access to MMIO, > >>> stick to instructions which will set the ISV bit in ESR_EL1. > >>> > >>> That is: > >>> > >>> * AArch64 loads and stores of a single general-purpose register > >>> (including the register specified with 0b11111, including those > >>> with Acquire/Release semantics, but excluding Load Exclusive > >>> or Store Exclusive and excluding those with writeback). > >>> * AArch32 instructions where the instruction: > >>> - Is an LDR, LDA, LDRT, LDRSH, LDRSHT, LDRH, LDAH, LDRHT, > >>> LDRSB, LDRSBT, LDRB, LDAB, LDRBT, STR, STL, STRT, STRH, > >>> STLH, STRHT, STRB, STLB, or STRBT instruction. > >>> - Is not performing register writeback. > >>> - Is not using R15 as a source or destination register. > >>> > >>> Your instruction is doing writeback. Do the address update > >>> as a separate instruction. > > With readl/writel implemented in assembly, I get beyond that point, but > now I get a data abort running an DC IVAC instruction on address 0x1000, > where the cfi-flash is located. This instruction is part of a routine > to remap the cfi-flash to start a page later, so the zero page can be > mapped faulting. > > Simple reproducer: > > start: > ldr x0, =0x1000 > ldr x1, =0x1040 > bl v8_inv_dcache_range > > mov w10, '!' > bl putch > > ret > > v8_inv_dcache_range: > mrs x3, ctr_el0 > lsr x3, x3, #16 > and x3, x3, #0xf > mov x2, #0x4 > lsl x2, x2, x3 > sub x3, x2, #0x1 > bic x0, x0, x3 > 1: > dc ivac, x0 > add x0, x0, x2 > cmp x0, x1 > b.cc 1b > dsb sy > ret > > This prints ! without KVM, but triggers a data abort before that with -enable-kvm: > > DABT (current EL) exception (ESR 0x96000010) at 0x0000000000001000 > elr: 000000007fbe0550 lr : 000000007fbe01ac > [snip] > > Call trace: > [<7fbe0550>] (v8_inv_dcache_range+0x1c/0x34) from [<7fbe0218>] (arch_remap_range+0x64/0x70) > [<7fbe0218>] (arch_remap_range+0x64/0x70) from [<7fb8795c>] (of_platform_device_create+0x1e8/0x22c) > [<7fb8795c>] (of_platform_device_create+0x1e8/0x22c) from [<7fb87a04>] (of_platform_bus_create+0x64/0xbc) > [snip] > > Any idea what this is about? IIRC, the QEMU flash is implemented as a read-only memslot. A data cache invalidation is a *write*, as it can be (and is) upgraded to a clean+invalidate by the HW. KVM cannot satisfy the write, for obvious reasons, and tells the guest to bugger off (__gfn_to_pfn_memslot() returns KVM_PFN_ERR_RO_FAULT, which satisfies is_error_noslot_pfn() -- a slight oddity, but hey, why not). In the end, you get an exception. We could relax this by special-casing CMOs to RO memslots, but this doesn't look great. The real question is: what are you trying to achieve with this? M. -- Without deviation from the norm, progress is not possible. ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [BUG] ARM64 KVM: Data abort executing post-indexed LDR on MMIO address 2024-10-05 10:31 ` Marc Zyngier @ 2024-10-05 18:38 ` Ahmad Fatoum 2024-10-05 21:35 ` Marc Zyngier 0 siblings, 1 reply; 15+ messages in thread From: Ahmad Fatoum @ 2024-10-05 18:38 UTC (permalink / raw) To: Marc Zyngier Cc: Peter Maydell, qemu-arm, kvmarm, kvm, Pengutronix Kernel Team, linux-arm-kernel@lists.infradead.org, Enrico Joerns Hello Marc, On 05.10.24 12:31, Marc Zyngier wrote: > On Fri, 04 Oct 2024 20:50:18 +0100, > Ahmad Fatoum <a.fatoum@pengutronix.de> wrote: >> With readl/writel implemented in assembly, I get beyond that point, but >> now I get a data abort running an DC IVAC instruction on address 0x1000, >> where the cfi-flash is located. This instruction is part of a routine >> to remap the cfi-flash to start a page later, so the zero page can be >> mapped faulting. [snip] >> Any idea what this is about? > > IIRC, the QEMU flash is implemented as a read-only memslot. A data > cache invalidation is a *write*, as it can be (and is) upgraded to a > clean+invalidate by the HW. So it's a write, even if there are no dirty cache lines? > KVM cannot satisfy the write, for obvious reasons, and tells the guest > to bugger off (__gfn_to_pfn_memslot() returns KVM_PFN_ERR_RO_FAULT, > which satisfies is_error_noslot_pfn() -- a slight oddity, but hey, why > not). > > In the end, you get an exception. We could relax this by > special-casing CMOs to RO memslots, but this doesn't look great. > > The real question is: what are you trying to achieve with this? barebox sets up the MMU, but tries to keep a 1:1 mapping. On Virt, we want to map the zero page faulting, but still have access to the first block of the cfi-flash. Therefore, barebox will map the cfi-flash one page later (virt 0x1000,0x2000,... -> phys 0x0000,0x1000,...) and so on, so the first page can be mapped faulting. The routine[1] that does this remapping invalidates the virtual address range, because the attributes may change[2]. This invalidate also happens for cfi-flash, but we should never encounter dirty cache lines there as the remap is done before driver probe. Can you advise what should be done differently? [1]: https://elixir.bootlin.com/barebox/v2024.09.0/source/arch/arm/cpu/mmu_64.c#L193 [2]: https://lore.kernel.org/barebox/20230526063354.1145474-4-a.fatoum@pengutronix.de/ Thanks, Ahmad > > M. > -- Pengutronix e.K. | | Steuerwalder Str. 21 | http://www.pengutronix.de/ | 31137 Hildesheim, Germany | Phone: +49-5121-206917-0 | Amtsgericht Hildesheim, HRA 2686 | Fax: +49-5121-206917-5555 | ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [BUG] ARM64 KVM: Data abort executing post-indexed LDR on MMIO address 2024-10-05 18:38 ` Ahmad Fatoum @ 2024-10-05 21:35 ` Marc Zyngier 2024-10-06 7:59 ` Ahmad Fatoum 0 siblings, 1 reply; 15+ messages in thread From: Marc Zyngier @ 2024-10-05 21:35 UTC (permalink / raw) To: Ahmad Fatoum Cc: Peter Maydell, qemu-arm, kvmarm, kvm, Pengutronix Kernel Team, linux-arm-kernel@lists.infradead.org, Enrico Joerns On Sat, 05 Oct 2024 19:38:23 +0100, Ahmad Fatoum <a.fatoum@pengutronix.de> wrote: > > Hello Marc, > > On 05.10.24 12:31, Marc Zyngier wrote: > > On Fri, 04 Oct 2024 20:50:18 +0100, > > Ahmad Fatoum <a.fatoum@pengutronix.de> wrote: > >> With readl/writel implemented in assembly, I get beyond that point, but > >> now I get a data abort running an DC IVAC instruction on address 0x1000, > >> where the cfi-flash is located. This instruction is part of a routine > >> to remap the cfi-flash to start a page later, so the zero page can be > >> mapped faulting. > > [snip] > > >> Any idea what this is about? > > > > IIRC, the QEMU flash is implemented as a read-only memslot. A data > > cache invalidation is a *write*, as it can be (and is) upgraded to a > > clean+invalidate by the HW. > > So it's a write, even if there are no dirty cache lines? Yes. At the point where this is handled, the CPU has no clue about the dirty state of an arbitrary cache line, at an arbitrary cache level. The CPU simply forward the CMO downstream, and the permission check happens way before that. > > KVM cannot satisfy the write, for obvious reasons, and tells the guest > > to bugger off (__gfn_to_pfn_memslot() returns KVM_PFN_ERR_RO_FAULT, > > which satisfies is_error_noslot_pfn() -- a slight oddity, but hey, why > > not). > > > > In the end, you get an exception. We could relax this by > > special-casing CMOs to RO memslots, but this doesn't look great. > > > > The real question is: what are you trying to achieve with this? > > barebox sets up the MMU, but tries to keep a 1:1 mapping. On Virt, we > want to map the zero page faulting, but still have access to the first > block of the cfi-flash. > > Therefore, barebox will map the cfi-flash one page later > (virt 0x1000,0x2000,... -> phys 0x0000,0x1000,...) and so on, so the first > page can be mapped faulting. > > The routine[1] that does this remapping invalidates the virtual address range, > because the attributes may change[2]. This invalidate also happens for cfi-flash, > but we should never encounter dirty cache lines there as the remap is done > before driver probe. > > Can you advise what should be done differently? If you always map the flash as Device memory, there is no need for CMOs. Same thing if you map it as NC. And even if you did map it as Cacheable, it wouldn't matter. KVM already handles coherency when the flash is switching between memory-mapped and programming mode, as the physical address space changes (the flash literally drops from the memory map). In general, issuing CMOs to a device is a bizarre concept, because it is pretty rare that a device can handle a full cache-line as write-back. Devices tend to handle smaller, register-sized accesses, not a full 64-byte eviction. Now, I'm still wondering whether we should simply forward the CMO to userspace as we do for other writes, and let the VMM deal with it. The main issue is that there is no current way to describe this. The alternative would be to silently handle the trap and pretend it never occurred, as we do for other bizarre behaviours. But that'd be something only new kernels would support, and I guess you'd like your guest to work today, not tomorrow. M. -- Without deviation from the norm, progress is not possible. ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [BUG] ARM64 KVM: Data abort executing post-indexed LDR on MMIO address 2024-10-05 21:35 ` Marc Zyngier @ 2024-10-06 7:59 ` Ahmad Fatoum 2024-10-06 10:28 ` Marc Zyngier 0 siblings, 1 reply; 15+ messages in thread From: Ahmad Fatoum @ 2024-10-06 7:59 UTC (permalink / raw) To: Marc Zyngier Cc: Peter Maydell, qemu-arm, kvmarm, kvm, Pengutronix Kernel Team, linux-arm-kernel@lists.infradead.org, Enrico Joerns Hello Marc, On 05.10.24 23:35, Marc Zyngier wrote: > On Sat, 05 Oct 2024 19:38:23 +0100, > Ahmad Fatoum <a.fatoum@pengutronix.de> wrote: >>> IIRC, the QEMU flash is implemented as a read-only memslot. A data >>> cache invalidation is a *write*, as it can be (and is) upgraded to a >>> clean+invalidate by the HW. >> >> So it's a write, even if there are no dirty cache lines? > > Yes. > > At the point where this is handled, the CPU has no clue about the > dirty state of an arbitrary cache line, at an arbitrary cache level. > The CPU simply forward the CMO downstream, and the permission check > happens way before that. I see. When I added that invalidation, it was suggested[1] that instead of invalidation after the remapping, I would clean the range before remapping. Cleaning the zero page triggered a data abort, which I didn't understand as there should be no dirty lines, but now I get it. One more question: This upgrading of DC IVAC to DC CIVAC is because the code is run under virtualization, right? [1]: https://lore.barebox.org/barebox/9809c04c-58c5-6888-2b14-cf17fa7c4b1a@pengutronix.de/ >> The routine[1] that does this remapping invalidates the virtual address range, >> because the attributes may change[2]. This invalidate also happens for cfi-flash, >> but we should never encounter dirty cache lines there as the remap is done >> before driver probe. >> >> Can you advise what should be done differently? > > If you always map the flash as Device memory, there is no need for > CMOs. Same thing if you map it as NC. Everything, except for RAM, is mapped Device-nGnRnE. > And even if you did map it as > Cacheable, it wouldn't matter. KVM already handles coherency when the > flash is switching between memory-mapped and programming mode, as the > physical address space changes (the flash literally drops from the > memory map). > > In general, issuing CMOs to a device is a bizarre concept, because it > is pretty rare that a device can handle a full cache-line as > write-back. Devices tend to handle smaller, register-sized accesses, > not a full 64-byte eviction. The same remap_range function is used to: - remap normal memory: - Mapping memory regions uncached for memory test - Mapping memory buffers coherent or write-combine - remap ROMs: BootROMs at address zero may be hidden behind the NULL page and need to be mapped differently when calling/peeking into it. - remap device memory, e.g. in the case of the cfi-flash here The invalidation is done unconditionally for all of them, although it makes only the sense in the first case. > Now, I'm still wondering whether we should simply forward the CMO to > userspace as we do for other writes, and let the VMM deal with it. The > main issue is that there is no current way to describe this. > > The alternative would be to silently handle the trap and pretend it > never occurred, as we do for other bizarre behaviours. But that'd be > something only new kernels would support, and I guess you'd like your > guest to work today, not tomorrow. I think following fix on the barebox side may work: - Walk all pages about to be remapped - Execute the AT instruction on the page's base address - Only if the page was previously mapped cacheable, clean + invalidate the cache - Remove the current cache invalidation after remap Does that sound sensible? Thanks, Ahmad > > M. > -- Pengutronix e.K. | | Steuerwalder Str. 21 | http://www.pengutronix.de/ | 31137 Hildesheim, Germany | Phone: +49-5121-206917-0 | Amtsgericht Hildesheim, HRA 2686 | Fax: +49-5121-206917-5555 | ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [BUG] ARM64 KVM: Data abort executing post-indexed LDR on MMIO address 2024-10-06 7:59 ` Ahmad Fatoum @ 2024-10-06 10:28 ` Marc Zyngier 2024-10-09 6:11 ` Ahmad Fatoum 0 siblings, 1 reply; 15+ messages in thread From: Marc Zyngier @ 2024-10-06 10:28 UTC (permalink / raw) To: Ahmad Fatoum Cc: Peter Maydell, qemu-arm, kvmarm, kvm, Pengutronix Kernel Team, linux-arm-kernel@lists.infradead.org, Enrico Joerns On Sun, 06 Oct 2024 08:59:56 +0100, Ahmad Fatoum <a.fatoum@pengutronix.de> wrote: > > Hello Marc, > > On 05.10.24 23:35, Marc Zyngier wrote: > > On Sat, 05 Oct 2024 19:38:23 +0100, > > Ahmad Fatoum <a.fatoum@pengutronix.de> wrote: > >>> IIRC, the QEMU flash is implemented as a read-only memslot. A data > >>> cache invalidation is a *write*, as it can be (and is) upgraded to a > >>> clean+invalidate by the HW. > >> > >> So it's a write, even if there are no dirty cache lines? > > > > Yes. > > > > At the point where this is handled, the CPU has no clue about the > > dirty state of an arbitrary cache line, at an arbitrary cache level. > > The CPU simply forward the CMO downstream, and the permission check > > happens way before that. > > I see. When I added that invalidation, it was suggested[1] that instead > of invalidation after the remapping, I would clean the range before > remapping. Cleaning the zero page triggered a data abort, which I didn't > understand as there should be no dirty lines, but now I get it. > > One more question: This upgrading of DC IVAC to DC CIVAC is because > the code is run under virtualization, right? Not necessarily. Virtualisation mandates the upgrade, but CIVAC is also a perfectly valid implementation of both IVAC and CVAC. And it isn't uncommon that CPUs implement everything the same way. This is why, if you dare looking at the definition of ESR_ELx for a data abort, you will notice that ESR_ELx.WnR (which indicates whether the abort was triggered by a read or a write) is always 1 (denoting a write) on a CMO, irrespective of what CMO triggered it. Additionally, FEAT_CMOW requires both read and write permissions to not trigger faults on CMOs, but that's a bit orthogonal to your problem. > > [1]: https://lore.barebox.org/barebox/9809c04c-58c5-6888-2b14-cf17fa7c4b1a@pengutronix.de/ > > >> The routine[1] that does this remapping invalidates the virtual address range, > >> because the attributes may change[2]. This invalidate also happens for cfi-flash, > >> but we should never encounter dirty cache lines there as the remap is done > >> before driver probe. > >> > >> Can you advise what should be done differently? > > > > If you always map the flash as Device memory, there is no need for > > CMOs. Same thing if you map it as NC. > > Everything, except for RAM, is mapped Device-nGnRnE. Right, no problem then. > > > And even if you did map it as > > Cacheable, it wouldn't matter. KVM already handles coherency when the > > flash is switching between memory-mapped and programming mode, as the > > physical address space changes (the flash literally drops from the > > memory map). > > > > In general, issuing CMOs to a device is a bizarre concept, because it > > is pretty rare that a device can handle a full cache-line as > > write-back. Devices tend to handle smaller, register-sized accesses, > > not a full 64-byte eviction. > > The same remap_range function is used to: > > - remap normal memory: > - Mapping memory regions uncached for memory test > - Mapping memory buffers coherent or write-combine > - remap ROMs: BootROMs at address zero may be hidden behind the NULL > page and need to be mapped differently when calling/peeking into it. > - remap device memory, e.g. in the case of the cfi-flash here > > The invalidation is done unconditionally for all of them, although it > makes only the sense in the first case. Indeed. > > > Now, I'm still wondering whether we should simply forward the CMO to > > userspace as we do for other writes, and let the VMM deal with it. The > > main issue is that there is no current way to describe this. > > > > The alternative would be to silently handle the trap and pretend it > > never occurred, as we do for other bizarre behaviours. But that'd be > > something only new kernels would support, and I guess you'd like your > > guest to work today, not tomorrow. > > I think following fix on the barebox side may work: > > - Walk all pages about to be remapped > - Execute the AT instruction on the page's base address Why do you need AT if you are walking the PTs? If you walk, you already have access to the memory attributes. In general, AT can be slower than an actual walk. Or did you actually mean iterating over the VA range? Even in that case, AT can be a bad idea, as you are likely to iterate in page-size increments even if you have a block mapping. Walking the PTs tells you immediately how much a leaf is mapping (assuming you don't have any other tracking). > - Only if the page was previously mapped cacheable, clean + invalidate > the cache > - Remove the current cache invalidation after remap > > Does that sound sensible? This looks reasonable (apart from the AT thingy). Thanks, M. -- Without deviation from the norm, progress is not possible. ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [BUG] ARM64 KVM: Data abort executing post-indexed LDR on MMIO address 2024-10-06 10:28 ` Marc Zyngier @ 2024-10-09 6:11 ` Ahmad Fatoum 2024-10-09 8:05 ` Marc Zyngier 0 siblings, 1 reply; 15+ messages in thread From: Ahmad Fatoum @ 2024-10-09 6:11 UTC (permalink / raw) To: Marc Zyngier Cc: Peter Maydell, qemu-arm, kvmarm, kvm, Pengutronix Kernel Team, linux-arm-kernel@lists.infradead.org, Enrico Joerns Hello Marc, On 06.10.24 12:28, Marc Zyngier wrote: > On Sun, 06 Oct 2024 08:59:56 +0100, > Ahmad Fatoum <a.fatoum@pengutronix.de> wrote: >> On 05.10.24 23:35, Marc Zyngier wrote: >>> On Sat, 05 Oct 2024 19:38:23 +0100, >>> Ahmad Fatoum <a.fatoum@pengutronix.de> wrote: >> One more question: This upgrading of DC IVAC to DC CIVAC is because >> the code is run under virtualization, right? > > Not necessarily. Virtualisation mandates the upgrade, but CIVAC is > also a perfectly valid implementation of both IVAC and CVAC. And it > isn't uncommon that CPUs implement everything the same way. Makes sense. After all, software should expect cache lines to be evicted at any time due to capacity misses anyway. >> I think following fix on the barebox side may work: >> >> - Walk all pages about to be remapped >> - Execute the AT instruction on the page's base address > > Why do you need AT if you are walking the PTs? If you walk, you > already have access to the memory attributes. In general, AT can be > slower than an actual walk. > > Or did you actually mean iterating over the VA range? Even in that > case, AT can be a bad idea, as you are likely to iterate in page-size > increments even if you have a block mapping. Walking the PTs tells you > immediately how much a leaf is mapping (assuming you don't have any > other tracking). There's no other tracking and I hoped that using AT (which is already being used for the mmuinfo shell command) would be easier. I see now that it would be too suboptimal to do it this way and have implemented a revised arch_remap_range[1] for barebox, which I just Cc'd you on. [1]: https://lore.kernel.org/barebox/20241009060511.4121157-5-a.fatoum@pengutronix.de/T/#u >> - Only if the page was previously mapped cacheable, clean + invalidate >> the cache >> - Remove the current cache invalidation after remap >> >> Does that sound sensible? > > This looks reasonable (apart from the AT thingy). I have two (hopefully the last!) questions about remaining differing behavior with KVM and without: 1) Unaligned stack accesses crash in KVM: start: /* This will be mapped at 0x40080000 */ ldr x0, =0x4007fff0 mov sp, x0 stp x0, x1, [sp] // This is ok ldr x0, =0x4007fff8 mov sp, x0 stp x0, x1, [sp] // This crashes I know that the stack should be 16 byte aligned, but why does it crash only under KVM? Context: The barebox Image used for Qemu has a Linux ARM64 "Image" header, so it's loaded at an offset and grows the stack down into this memory region until the FDT's /memory could be decoded and a proper stack is set up. A regression introduced earlier this year, caused the stack to grow down from a non-16-byte address, which is fixed in [2]. [2]: https://lore.kernel.org/barebox/20241009060511.4121157-5-a.fatoum@pengutronix.de/T/#ma381512862d22530382aff1662caadad2c8bc182 2) Using uncached memory for Virt I/O queues with KVM enabled is considerably slower. My guess is that these accesses keep getting trapped, but what I wonder about is the performance discrepancy between the big.LITTLE cores (measurement of barebox copying 1MiB using `time cp -v /dev/virtioblk0 /tmp`): KVM && !CACHED && 1x Cortex-A53: 0.137s KVM && !CACHED && 1x Cortex-A72: 54.030s KVM && CACHED && 1x Cortex-A53: 0.120s KVM && CACHED && 1x Cortex-A72: 0.035s The A53s are CPUs 0-1 and the A72 are 2-5. Any idea why accessing uncached memory from the big core is so much worse? Thank you! Ahmad > > Thanks, > > M. > -- Pengutronix e.K. | | Steuerwalder Str. 21 | http://www.pengutronix.de/ | 31137 Hildesheim, Germany | Phone: +49-5121-206917-0 | Amtsgericht Hildesheim, HRA 2686 | Fax: +49-5121-206917-5555 | ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [BUG] ARM64 KVM: Data abort executing post-indexed LDR on MMIO address 2024-10-09 6:11 ` Ahmad Fatoum @ 2024-10-09 8:05 ` Marc Zyngier 0 siblings, 0 replies; 15+ messages in thread From: Marc Zyngier @ 2024-10-09 8:05 UTC (permalink / raw) To: Ahmad Fatoum Cc: Peter Maydell, qemu-arm, kvmarm, kvm, Pengutronix Kernel Team, linux-arm-kernel@lists.infradead.org, Enrico Joerns On Wed, 09 Oct 2024 07:11:52 +0100, Ahmad Fatoum <a.fatoum@pengutronix.de> wrote: > > Hello Marc, > > On 06.10.24 12:28, Marc Zyngier wrote: > > On Sun, 06 Oct 2024 08:59:56 +0100, > > Ahmad Fatoum <a.fatoum@pengutronix.de> wrote: > >> On 05.10.24 23:35, Marc Zyngier wrote: > >>> On Sat, 05 Oct 2024 19:38:23 +0100, > >>> Ahmad Fatoum <a.fatoum@pengutronix.de> wrote: > >> One more question: This upgrading of DC IVAC to DC CIVAC is because > >> the code is run under virtualization, right? > > > > Not necessarily. Virtualisation mandates the upgrade, but CIVAC is > > also a perfectly valid implementation of both IVAC and CVAC. And it > > isn't uncommon that CPUs implement everything the same way. > > Makes sense. After all, software should expect cache lines to > be evicted at any time due to capacity misses anyway. > > >> I think following fix on the barebox side may work: > >> > >> - Walk all pages about to be remapped > >> - Execute the AT instruction on the page's base address > > > > Why do you need AT if you are walking the PTs? If you walk, you > > already have access to the memory attributes. In general, AT can be > > slower than an actual walk. > > > > Or did you actually mean iterating over the VA range? Even in that > > case, AT can be a bad idea, as you are likely to iterate in page-size > > increments even if you have a block mapping. Walking the PTs tells you > > immediately how much a leaf is mapping (assuming you don't have any > > other tracking). > > There's no other tracking and I hoped that using AT (which is already > being used for the mmuinfo shell command) would be easier. Heavy use of AT is debatable. It requires heavy synchronisation (the ISB between AT and the PAR_EL1 readout), traps in some circumstances, and is not guaranteed to report the exact content of the page tables when it comes to memory attributes (it can instead report what the implementation actually does). In turn, it *may* hit in the TLBs. Or thrash them. But the real use of AT is when you do not have a virtual mapping for your page tables, making them difficult to walk in SW. For example, KVM uses AT to walk the guest's stage-1 PTs on stage-2 fault, because we don't have the guest's memory mapped at EL2 in general. Maybe this doesn't matter in your case, as this doesn't happen often enough to justify a dedicated SW walker. > > I see now that it would be too suboptimal to do it this way and have > implemented a revised arch_remap_range[1] for barebox, which I just > Cc'd you on. I'll try to have a look. > > [1]: https://lore.kernel.org/barebox/20241009060511.4121157-5-a.fatoum@pengutronix.de/T/#u > > >> - Only if the page was previously mapped cacheable, clean + invalidate > >> the cache > >> - Remove the current cache invalidation after remap > >> > >> Does that sound sensible? > > > > This looks reasonable (apart from the AT thingy). > > I have two (hopefully the last!) questions about remaining differing > behavior with KVM and without: > > 1) Unaligned stack accesses crash in KVM: > > start: /* This will be mapped at 0x40080000 */ > ldr x0, =0x4007fff0 > mov sp, x0 > stp x0, x1, [sp] // This is ok > > ldr x0, =0x4007fff8 > mov sp, x0 > stp x0, x1, [sp] // This crashes > > I know that the stack should be 16 byte aligned, but why does it crash > only under KVM? KVM shouldn't be involved in any of this. What is SCTLR_EL1.SA set to? KVM resets it to 1, which is legal ("On a Warm reset, this field resets to an architecturally UNKNOWN value."). You shouldn't rely on it being 0 if you are doing unaligned accesses to the stack pointer. > > Context: The barebox Image used for Qemu has a Linux ARM64 "Image" header, > so it's loaded at an offset and grows the stack down into this memory region > until the FDT's /memory could be decoded and a proper stack is set up. > > A regression introduced earlier this year, caused the stack to grow down > from a non-16-byte address, which is fixed in [2]. > > [2]: https://lore.kernel.org/barebox/20241009060511.4121157-5-a.fatoum@pengutronix.de/T/#ma381512862d22530382aff1662caadad2c8bc182 > > 2) Using uncached memory for Virt I/O queues with KVM enabled is considerably > slower. My guess is that these accesses keep getting trapped, but what I wonder > about is the performance discrepancy between the big.LITTLE cores > (measurement of barebox copying 1MiB using `time cp -v /dev/virtioblk0 /tmp`): > > KVM && !CACHED && 1x Cortex-A53: 0.137s > KVM && !CACHED && 1x Cortex-A72: 54.030s > KVM && CACHED && 1x Cortex-A53: 0.120s > KVM && CACHED && 1x Cortex-A72: 0.035s > > The A53s are CPUs 0-1 and the A72 are 2-5. > > Any idea why accessing uncached memory from the big core is so much worse? KVM shouldn't trap these accesses at all, except for the initial faulting-in of the memory (once per page). But to see such a difference, the problem is likely somewhere else. As you seem to be using QEMU as the VMM, you are likely to fall into the "Mismatched attributes" sink-hole (see B2.16 "Mismatched memory attributes" in the ARM ARM). The gist of the problem is that QEMU is using a cacheable mapping for all of the guest memory, while you are using a NC mapping. Are these two guaranteed to be coherent? Not at all. The abysmal level of performance is likely to be caused by the rate of miss in the cache on the QEMU side. It simply doesn't observe the guest's writes until it misses (for fun, giggles and self-promotion, see [1]). As an experiment, you could run something that actively puts a lot of cache pressure in parallel to your guest. You should see the performance increase significantly. Now, this is very much a case of "don't do that". Virtio is, by definition, a cache coherent device. So either you use it with attributes that maintain coherency (and everything works), or you add cache maintenance to your virtio implementation (good luck getting the ordering right). HTH, M. [1] http://events17.linuxfoundation.org/sites/events/files/slides/slides_10.pdf -- Without deviation from the norm, progress is not possible. ^ permalink raw reply [flat|nested] 15+ messages in thread
end of thread, other threads:[~2024-10-09 8:17 UTC | newest] Thread overview: 15+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2024-10-04 9:47 [BUG] ARM64 KVM: Data abort executing post-indexed LDR on MMIO address Ahmad Fatoum 2024-10-04 10:40 ` Peter Maydell 2024-10-04 11:51 ` Ahmad Fatoum 2024-10-04 12:10 ` Peter Maydell 2024-10-04 15:52 ` Oliver Upton 2024-10-04 15:57 ` Peter Maydell 2024-10-04 16:21 ` Oliver Upton 2024-10-04 19:50 ` Ahmad Fatoum 2024-10-05 10:31 ` Marc Zyngier 2024-10-05 18:38 ` Ahmad Fatoum 2024-10-05 21:35 ` Marc Zyngier 2024-10-06 7:59 ` Ahmad Fatoum 2024-10-06 10:28 ` Marc Zyngier 2024-10-09 6:11 ` Ahmad Fatoum 2024-10-09 8:05 ` Marc Zyngier
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).