* [BUG] ARM64 KVM: Data abort executing post-indexed LDR on MMIO address
@ 2024-10-04 9:47 Ahmad Fatoum
2024-10-04 10:40 ` Peter Maydell
0 siblings, 1 reply; 15+ messages in thread
From: Ahmad Fatoum @ 2024-10-04 9:47 UTC (permalink / raw)
To: qemu-arm, kvmarm
Cc: peter.maydell, kvm, Pengutronix Kernel Team,
linux-arm-kernel@lists.infradead.org, Enrico Joerns
Hi,
I am investigating a data abort affecting the barebox bootloader built for aarch64
that only manifests with qemu-system-aarch64 -enable-kvm.
The issue happens when using the post-indexed form of LDR on a MMIO address:
ldr x0, =0x9000fe0 // MMIO address
ldr w1, [x0], #4 // data abort, but only with -enable-kvm
Here's a full example to reproduce this:
start:
mov w10, 'o'
bl putch
bl amba_device_read_pid
mov w10, 'k'
bl putch
ret
amba_device_read_pid:
ldr x0, =0x9000fe0
ldr w1, [x0], #4 // data abort unless #4 is removed or KVM disabled
ret
putch:
ldr x8, =0x9000018
1:
/* Wait until there is space in the FIFO */
ldr w9, [x8]
tbnz w9, #5, 1b
/* Send the character */
mov x9, #0x9000000
str w10, [x9]
2:
/* Wait to make sure it hits the line, in case we die too soon. */
ldr w9, [x8]
tbnz w9, #5, 2b
ret
It assumes 0x9000000-0x9000fff is a PL011 UART as is the case on the Virt platform.
It will print an 'o', try to access 0x9000fe0, which contains the first byte
of the AMBA Product ID and then print a 'k' using the same PL011.
To build:
aarch64-linux-gnu-as reproducer.S -o reproducer.o
aarch64-linux-gnu-objcopy -O binary reproducer.o reproducer.bin
To run (whether -bios or -kernel doesn't seem to matter):
taskset -a --cpu-list 2-5 qemu-system-aarch64 -bios reproducer.bin -machine virt,secure=off \
-cpu max -m 1024M -nographic -serial mon:stdio -trace file=/dev/null
When run _without_ kvm, this will output:
ok
When run with -enable-kvm, this will trigger a data abort in amba_device_read_pid:
o
The data abort can also be avoided by removing the post index-increment:
-ldr w1, [x0], #4
+ldr w1, [x0]
This doesn't introduce a functional difference, because x0 isn't used again anyway,
but it makes the code work under KVM.
I am using debian-arm64 Bookworm on an Amlogic A311D (4x CA72, 2x CA53) with:
QEMU emulator version 9.1.50 (v9.1.0-704-g423be09ab9)
Linux 6.11-arm64 #1 SMP Debian 6.11-1~exp1 (2024-09-19) aarch64 GNU/Linux
This issue was first noticed by openembedded-core CI while testing this patch series
adding support for testing the barebox and U-Boot bootloaders:
https://lore.kernel.org/all/ee536d88a5b6468b20e37f3daabe4aa63047d1ad.camel@pengutronix.de/
AFAICS, the U-Boot breakage has the same root cause as barebox', except that it's
a str instruction that has the post-increment and the PCI MMIO region is what's
being accessed.
I haven't been successful in getting QEMU/GDB to trap on data aborts, so here's an
excerpt from my original barebox stack trace instead that sent me down the rabbit
hole:
DABT (current EL) exception (ESR 0x96000010) at 0x0000000009030fe0
elr: 000000007fb2221c lr : 000000007fb22250
[...]
Call trace:
[<7fb2221c>] (amba_device_get_pid.constprop.0.isra.0+0x10/0x34) from [<7fb01e3c>] (start_barebox+0x88/0xb4)
[...]
This looks pretty much like a bug to me, but then I would have expected more
software to be affected by this.. Maybe it's memory mapping related?
I only tested with either MMU disabled or MMIO region mapped strongly-ordered.
Please let me know if there's further information I can provide.
Thanks,
Ahmad
--
Pengutronix e.K. | |
Steuerwalder Str. 21 | http://www.pengutronix.de/ |
31137 Hildesheim, Germany | Phone: +49-5121-206917-0 |
Amtsgericht Hildesheim, HRA 2686 | Fax: +49-5121-206917-5555 |
^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [BUG] ARM64 KVM: Data abort executing post-indexed LDR on MMIO address
2024-10-04 9:47 [BUG] ARM64 KVM: Data abort executing post-indexed LDR on MMIO address Ahmad Fatoum
@ 2024-10-04 10:40 ` Peter Maydell
2024-10-04 11:51 ` Ahmad Fatoum
0 siblings, 1 reply; 15+ messages in thread
From: Peter Maydell @ 2024-10-04 10:40 UTC (permalink / raw)
To: Ahmad Fatoum
Cc: qemu-arm, kvmarm, kvm, Pengutronix Kernel Team,
linux-arm-kernel@lists.infradead.org, Enrico Joerns
On Fri, 4 Oct 2024 at 10:47, Ahmad Fatoum <a.fatoum@pengutronix.de> wrote:
> I am investigating a data abort affecting the barebox bootloader built for aarch64
> that only manifests with qemu-system-aarch64 -enable-kvm.
>
> The issue happens when using the post-indexed form of LDR on a MMIO address:
>
> ldr x0, =0x9000fe0 // MMIO address
> ldr w1, [x0], #4 // data abort, but only with -enable-kvm
Don't do this -- KVM doesn't support it. For access to MMIO,
stick to instructions which will set the ISV bit in ESR_EL1.
That is:
* AArch64 loads and stores of a single general-purpose register
(including the register specified with 0b11111, including those
with Acquire/Release semantics, but excluding Load Exclusive
or Store Exclusive and excluding those with writeback).
* AArch32 instructions where the instruction:
- Is an LDR, LDA, LDRT, LDRSH, LDRSHT, LDRH, LDAH, LDRHT,
LDRSB, LDRSBT, LDRB, LDAB, LDRBT, STR, STL, STRT, STRH,
STLH, STRHT, STRB, STLB, or STRBT instruction.
- Is not performing register writeback.
- Is not using R15 as a source or destination register.
Your instruction is doing writeback. Do the address update
as a separate instruction.
Strictly speaking this is a missing feature in KVM (in an
ideal world it would let you do MMIO with any instruction
that you could use on real hardware). In practice it is not
a major issue because you don't typically want to do odd
things when you're doing MMIO, you just want to load or
store a single data item. If you're running into this then
your guest software is usually doing something a bit strange.
thanks
-- PMM
^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [BUG] ARM64 KVM: Data abort executing post-indexed LDR on MMIO address
2024-10-04 10:40 ` Peter Maydell
@ 2024-10-04 11:51 ` Ahmad Fatoum
2024-10-04 12:10 ` Peter Maydell
0 siblings, 1 reply; 15+ messages in thread
From: Ahmad Fatoum @ 2024-10-04 11:51 UTC (permalink / raw)
To: Peter Maydell
Cc: qemu-arm, kvmarm, kvm, Pengutronix Kernel Team,
linux-arm-kernel@lists.infradead.org, Enrico Joerns
Hello Peter,
Thanks for your quick response.
On 04.10.24 12:40, Peter Maydell wrote:
> On Fri, 4 Oct 2024 at 10:47, Ahmad Fatoum <a.fatoum@pengutronix.de> wrote:
>> I am investigating a data abort affecting the barebox bootloader built for aarch64
>> that only manifests with qemu-system-aarch64 -enable-kvm.
>>
>> The issue happens when using the post-indexed form of LDR on a MMIO address:
>>
>> ldr x0, =0x9000fe0 // MMIO address
>> ldr w1, [x0], #4 // data abort, but only with -enable-kvm
>
> Don't do this -- KVM doesn't support it. For access to MMIO,
> stick to instructions which will set the ISV bit in ESR_EL1.
>
> That is:
>
> * AArch64 loads and stores of a single general-purpose register
> (including the register specified with 0b11111, including those
> with Acquire/Release semantics, but excluding Load Exclusive
> or Store Exclusive and excluding those with writeback).
> * AArch32 instructions where the instruction:
> - Is an LDR, LDA, LDRT, LDRSH, LDRSHT, LDRH, LDAH, LDRHT,
> LDRSB, LDRSBT, LDRB, LDAB, LDRBT, STR, STL, STRT, STRH,
> STLH, STRHT, STRB, STLB, or STRBT instruction.
> - Is not performing register writeback.
> - Is not using R15 as a source or destination register.
>
> Your instruction is doing writeback. Do the address update
> as a separate instruction.
This was enlightening, thanks. I will prepare patches to implement
readl/writel in barebox in ARM assembly, like Linux does
instead of the current fallback to the generic C version
that just does volatile accesses.
> Strictly speaking this is a missing feature in KVM (in an
> ideal world it would let you do MMIO with any instruction
> that you could use on real hardware).
I assume that's because KVM doesn't want to handle interruptions
in the middle of such "composite" instructions?
> In practice it is not
> a major issue because you don't typically want to do odd
> things when you're doing MMIO, you just want to load or
> store a single data item. If you're running into this then
> your guest software is usually doing something a bit strange.
The AMBA Peripheral ID consists of 4 bytes with some padding
in-between. The barebox code to read it out looks is this:
static inline u32 amba_device_get_pid(void *base, u32 size)
{
int i;
u32 pid;
for (pid = 0, i = 0; i < 4; i++)
pid |= (readl(base + size - 0x20 + 4 * i) & 255) <<
(i * 8);
return pid;
}
static inline u32 __raw_readl(const volatile void __iomem *addr)
{
return *(const volatile u32 __force *)addr;
}
I wouldn't necessarily characterize this as strange, we just erroneously
assumed that with strongly ordered memory for MMIO regions and volatile
accesses we had our bases covered and indeed we did until the bases
shifted to include hardware-assisted virtualization. :-)
Thanks,
Ahmad
>
> thanks
> -- PMM
>
--
Pengutronix e.K. | |
Steuerwalder Str. 21 | http://www.pengutronix.de/ |
31137 Hildesheim, Germany | Phone: +49-5121-206917-0 |
Amtsgericht Hildesheim, HRA 2686 | Fax: +49-5121-206917-5555 |
^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [BUG] ARM64 KVM: Data abort executing post-indexed LDR on MMIO address
2024-10-04 11:51 ` Ahmad Fatoum
@ 2024-10-04 12:10 ` Peter Maydell
2024-10-04 15:52 ` Oliver Upton
2024-10-04 19:50 ` Ahmad Fatoum
0 siblings, 2 replies; 15+ messages in thread
From: Peter Maydell @ 2024-10-04 12:10 UTC (permalink / raw)
To: Ahmad Fatoum
Cc: qemu-arm, kvmarm, kvm, Pengutronix Kernel Team,
linux-arm-kernel@lists.infradead.org, Enrico Joerns
On Fri, 4 Oct 2024 at 12:51, Ahmad Fatoum <a.fatoum@pengutronix.de> wrote:
>
> Hello Peter,
>
> Thanks for your quick response.
>
> On 04.10.24 12:40, Peter Maydell wrote:
> > On Fri, 4 Oct 2024 at 10:47, Ahmad Fatoum <a.fatoum@pengutronix.de> wrote:
> >> I am investigating a data abort affecting the barebox bootloader built for aarch64
> >> that only manifests with qemu-system-aarch64 -enable-kvm.
> >>
> >> The issue happens when using the post-indexed form of LDR on a MMIO address:
> >>
> >> ldr x0, =0x9000fe0 // MMIO address
> >> ldr w1, [x0], #4 // data abort, but only with -enable-kvm
> >
> > Don't do this -- KVM doesn't support it. For access to MMIO,
> > stick to instructions which will set the ISV bit in ESR_EL1.
> >
> > That is:
> >
> > * AArch64 loads and stores of a single general-purpose register
> > (including the register specified with 0b11111, including those
> > with Acquire/Release semantics, but excluding Load Exclusive
> > or Store Exclusive and excluding those with writeback).
> > * AArch32 instructions where the instruction:
> > - Is an LDR, LDA, LDRT, LDRSH, LDRSHT, LDRH, LDAH, LDRHT,
> > LDRSB, LDRSBT, LDRB, LDAB, LDRBT, STR, STL, STRT, STRH,
> > STLH, STRHT, STRB, STLB, or STRBT instruction.
> > - Is not performing register writeback.
> > - Is not using R15 as a source or destination register.
> >
> > Your instruction is doing writeback. Do the address update
> > as a separate instruction.
>
> This was enlightening, thanks. I will prepare patches to implement
> readl/writel in barebox in ARM assembly, like Linux does
> instead of the current fallback to the generic C version
> that just does volatile accesses.
>
> > Strictly speaking this is a missing feature in KVM (in an
> > ideal world it would let you do MMIO with any instruction
> > that you could use on real hardware).
>
> I assume that's because KVM doesn't want to handle interruptions
> in the middle of such "composite" instructions?
It's because with the ISV=1 information in the ESR_EL2,
KVM has everything it needs to emulate the load/store:
it has the affected register number, the data width, etc. When
ISV is 0, simulating the load/store would require KVM
to load the actual instruction word, decode it to figure
out what kind of load/store it was, and then emulate
its behaviour. The instruction decode would be complicated
and if done in the kernel would increase the attack surface
exposed to the guest. (In practice KVM will these days
bounce the ISV=0 failure out to the userspace VMM, but
no VMM that I know of implements the decode of load/store
instructions in userspace either, so that just changes which
part prints the failure message.)
> > In practice it is not
> > a major issue because you don't typically want to do odd
> > things when you're doing MMIO, you just want to load or
> > store a single data item. If you're running into this then
> > your guest software is usually doing something a bit strange.
>
> The AMBA Peripheral ID consists of 4 bytes with some padding
> in-between. The barebox code to read it out looks is this:
>
> static inline u32 amba_device_get_pid(void *base, u32 size)
> {
> int i;
> u32 pid;
>
> for (pid = 0, i = 0; i < 4; i++)
> pid |= (readl(base + size - 0x20 + 4 * i) & 255) <<
> (i * 8);
>
> return pid;
> }
>
> static inline u32 __raw_readl(const volatile void __iomem *addr)
> {
> return *(const volatile u32 __force *)addr;
> }
>
> I wouldn't necessarily characterize this as strange, we just erroneously
> assumed that with strongly ordered memory for MMIO regions and volatile
> accesses we had our bases covered and indeed we did until the bases
> shifted to include hardware-assisted virtualization. :-)
I'm not a fan of doing MMIO access via 'volatile' in C code,
personally -- I think the compiler has a tendency to do more
clever recombination than you might actually want, because
it doesn't know that the thing you're accessing isn't RAM-like.
thanks
-- PMM
^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [BUG] ARM64 KVM: Data abort executing post-indexed LDR on MMIO address
2024-10-04 12:10 ` Peter Maydell
@ 2024-10-04 15:52 ` Oliver Upton
2024-10-04 15:57 ` Peter Maydell
2024-10-04 19:50 ` Ahmad Fatoum
1 sibling, 1 reply; 15+ messages in thread
From: Oliver Upton @ 2024-10-04 15:52 UTC (permalink / raw)
To: Peter Maydell
Cc: Ahmad Fatoum, qemu-arm, kvmarm, kvm, Pengutronix Kernel Team,
linux-arm-kernel@lists.infradead.org, Enrico Joerns
On Fri, Oct 04, 2024 at 01:10:48PM +0100, Peter Maydell wrote:
> On Fri, 4 Oct 2024 at 12:51, Ahmad Fatoum <a.fatoum@pengutronix.de> wrote:
> > > Strictly speaking this is a missing feature in KVM (in an
> > > ideal world it would let you do MMIO with any instruction
> > > that you could use on real hardware).
> >
> > I assume that's because KVM doesn't want to handle interruptions
> > in the middle of such "composite" instructions?
>
> It's because with the ISV=1 information in the ESR_EL2,
> KVM has everything it needs to emulate the load/store:
> it has the affected register number, the data width, etc. When
> ISV is 0, simulating the load/store would require KVM
> to load the actual instruction word, decode it to figure
> out what kind of load/store it was, and then emulate
> its behaviour. The instruction decode would be complicated
> and if done in the kernel would increase the attack surface
> exposed to the guest.
On top of that, the only way to 'safely' fetch the instruction would be
to pause all vCPUs in the VM to prevent the guest from remapping the
address space behind either KVM or the VMM's back.
> > static inline u32 __raw_readl(const volatile void __iomem *addr)
> > {
> > return *(const volatile u32 __force *)addr;
> > }
> >
> > I wouldn't necessarily characterize this as strange, we just erroneously
> > assumed that with strongly ordered memory for MMIO regions and volatile
> > accesses we had our bases covered and indeed we did until the bases
> > shifted to include hardware-assisted virtualization. :-)
This has nothing to do with ordering and everything to do with the
instruction set && what your compiler decided to emit.
> I'm not a fan of doing MMIO access via 'volatile' in C code,
> personally -- I think the compiler has a tendency to do more
> clever recombination than you might actually want, because
> it doesn't know that the thing you're accessing isn't RAM-like.
+1
--
Thanks,
Oliver
^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [BUG] ARM64 KVM: Data abort executing post-indexed LDR on MMIO address
2024-10-04 15:52 ` Oliver Upton
@ 2024-10-04 15:57 ` Peter Maydell
2024-10-04 16:21 ` Oliver Upton
0 siblings, 1 reply; 15+ messages in thread
From: Peter Maydell @ 2024-10-04 15:57 UTC (permalink / raw)
To: Oliver Upton
Cc: Ahmad Fatoum, qemu-arm, kvmarm, kvm, Pengutronix Kernel Team,
linux-arm-kernel@lists.infradead.org, Enrico Joerns
On Fri, 4 Oct 2024 at 16:53, Oliver Upton <oliver.upton@linux.dev> wrote:
>
> On Fri, Oct 04, 2024 at 01:10:48PM +0100, Peter Maydell wrote:
> > On Fri, 4 Oct 2024 at 12:51, Ahmad Fatoum <a.fatoum@pengutronix.de> wrote:
> > > > Strictly speaking this is a missing feature in KVM (in an
> > > > ideal world it would let you do MMIO with any instruction
> > > > that you could use on real hardware).
> > >
> > > I assume that's because KVM doesn't want to handle interruptions
> > > in the middle of such "composite" instructions?
> >
> > It's because with the ISV=1 information in the ESR_EL2,
> > KVM has everything it needs to emulate the load/store:
> > it has the affected register number, the data width, etc. When
> > ISV is 0, simulating the load/store would require KVM
> > to load the actual instruction word, decode it to figure
> > out what kind of load/store it was, and then emulate
> > its behaviour. The instruction decode would be complicated
> > and if done in the kernel would increase the attack surface
> > exposed to the guest.
>
> On top of that, the only way to 'safely' fetch the instruction would be
> to pause all vCPUs in the VM to prevent the guest from remapping the
> address space behind either KVM or the VMM's back.
Do we actually care about that, though? If the guest does
that isn't it equivalent to a hardware CPU happening to
fetch the insn just-after a remapping rather than just-before?
If you decode the insn and it's not a store you could just
restart the guest...
thanks
-- PMM
^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [BUG] ARM64 KVM: Data abort executing post-indexed LDR on MMIO address
2024-10-04 15:57 ` Peter Maydell
@ 2024-10-04 16:21 ` Oliver Upton
0 siblings, 0 replies; 15+ messages in thread
From: Oliver Upton @ 2024-10-04 16:21 UTC (permalink / raw)
To: Peter Maydell
Cc: Ahmad Fatoum, qemu-arm, kvmarm, kvm, Pengutronix Kernel Team,
linux-arm-kernel@lists.infradead.org, Enrico Joerns
On Fri, Oct 04, 2024 at 04:57:56PM +0100, Peter Maydell wrote:
> On Fri, 4 Oct 2024 at 16:53, Oliver Upton <oliver.upton@linux.dev> wrote:
> >
> > On Fri, Oct 04, 2024 at 01:10:48PM +0100, Peter Maydell wrote:
> > > On Fri, 4 Oct 2024 at 12:51, Ahmad Fatoum <a.fatoum@pengutronix.de> wrote:
> > > > > Strictly speaking this is a missing feature in KVM (in an
> > > > > ideal world it would let you do MMIO with any instruction
> > > > > that you could use on real hardware).
> > > >
> > > > I assume that's because KVM doesn't want to handle interruptions
> > > > in the middle of such "composite" instructions?
> > >
> > > It's because with the ISV=1 information in the ESR_EL2,
> > > KVM has everything it needs to emulate the load/store:
> > > it has the affected register number, the data width, etc. When
> > > ISV is 0, simulating the load/store would require KVM
> > > to load the actual instruction word, decode it to figure
> > > out what kind of load/store it was, and then emulate
> > > its behaviour. The instruction decode would be complicated
> > > and if done in the kernel would increase the attack surface
> > > exposed to the guest.
> >
> > On top of that, the only way to 'safely' fetch the instruction would be
> > to pause all vCPUs in the VM to prevent the guest from remapping the
> > address space behind either KVM or the VMM's back.
>
> Do we actually care about that, though?
Judging from the fact that our existing MMIO flows have a similar "bug",
I'd say no. I was just being pedantic about how annoying it'd be to do
this faithfully, including the VA -> IPA translation.
> If the guest does
> that isn't it equivalent to a hardware CPU happening to
> fetch the insn just-after a remapping rather than just-before?
> If you decode the insn and it's not a store you could just
> restart the guest...
Definitely, you'd need to restart any time the instruction doesn't line
up with the ESR. The pedantic thing I was thinking about was if the
instruction bytes remain the same but marked as non-executable:
T1 T2
== ==
readl(addr);
< MMIO data abort >
insn = fetch(readl);
set_nx(readl);
tlbi(readl);
dsb(ish);
emulate(insn);
--
Thanks,
Oliver
^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [BUG] ARM64 KVM: Data abort executing post-indexed LDR on MMIO address
2024-10-04 12:10 ` Peter Maydell
2024-10-04 15:52 ` Oliver Upton
@ 2024-10-04 19:50 ` Ahmad Fatoum
2024-10-05 10:31 ` Marc Zyngier
1 sibling, 1 reply; 15+ messages in thread
From: Ahmad Fatoum @ 2024-10-04 19:50 UTC (permalink / raw)
To: Peter Maydell
Cc: qemu-arm, kvmarm, kvm, Pengutronix Kernel Team,
linux-arm-kernel@lists.infradead.org, Enrico Joerns
Hi,
On 04.10.24 14:10, Peter Maydell wrote:
> On Fri, 4 Oct 2024 at 12:51, Ahmad Fatoum <a.fatoum@pengutronix.de> wrote:
>> On 04.10.24 12:40, Peter Maydell wrote:
>>> Don't do this -- KVM doesn't support it. For access to MMIO,
>>> stick to instructions which will set the ISV bit in ESR_EL1.
>>>
>>> That is:
>>>
>>> * AArch64 loads and stores of a single general-purpose register
>>> (including the register specified with 0b11111, including those
>>> with Acquire/Release semantics, but excluding Load Exclusive
>>> or Store Exclusive and excluding those with writeback).
>>> * AArch32 instructions where the instruction:
>>> - Is an LDR, LDA, LDRT, LDRSH, LDRSHT, LDRH, LDAH, LDRHT,
>>> LDRSB, LDRSBT, LDRB, LDAB, LDRBT, STR, STL, STRT, STRH,
>>> STLH, STRHT, STRB, STLB, or STRBT instruction.
>>> - Is not performing register writeback.
>>> - Is not using R15 as a source or destination register.
>>>
>>> Your instruction is doing writeback. Do the address update
>>> as a separate instruction.
With readl/writel implemented in assembly, I get beyond that point, but
now I get a data abort running an DC IVAC instruction on address 0x1000,
where the cfi-flash is located. This instruction is part of a routine
to remap the cfi-flash to start a page later, so the zero page can be
mapped faulting.
Simple reproducer:
start:
ldr x0, =0x1000
ldr x1, =0x1040
bl v8_inv_dcache_range
mov w10, '!'
bl putch
ret
v8_inv_dcache_range:
mrs x3, ctr_el0
lsr x3, x3, #16
and x3, x3, #0xf
mov x2, #0x4
lsl x2, x2, x3
sub x3, x2, #0x1
bic x0, x0, x3
1:
dc ivac, x0
add x0, x0, x2
cmp x0, x1
b.cc 1b
dsb sy
ret
This prints ! without KVM, but triggers a data abort before that with -enable-kvm:
DABT (current EL) exception (ESR 0x96000010) at 0x0000000000001000
elr: 000000007fbe0550 lr : 000000007fbe01ac
[snip]
Call trace:
[<7fbe0550>] (v8_inv_dcache_range+0x1c/0x34) from [<7fbe0218>] (arch_remap_range+0x64/0x70)
[<7fbe0218>] (arch_remap_range+0x64/0x70) from [<7fb8795c>] (of_platform_device_create+0x1e8/0x22c)
[<7fb8795c>] (of_platform_device_create+0x1e8/0x22c) from [<7fb87a04>] (of_platform_bus_create+0x64/0xbc)
[snip]
Any idea what this is about?
Thanks,
Ahmad
--
Pengutronix e.K. | |
Steuerwalder Str. 21 | http://www.pengutronix.de/ |
31137 Hildesheim, Germany | Phone: +49-5121-206917-0 |
Amtsgericht Hildesheim, HRA 2686 | Fax: +49-5121-206917-5555 |
^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [BUG] ARM64 KVM: Data abort executing post-indexed LDR on MMIO address
2024-10-04 19:50 ` Ahmad Fatoum
@ 2024-10-05 10:31 ` Marc Zyngier
2024-10-05 18:38 ` Ahmad Fatoum
0 siblings, 1 reply; 15+ messages in thread
From: Marc Zyngier @ 2024-10-05 10:31 UTC (permalink / raw)
To: Ahmad Fatoum
Cc: Peter Maydell, qemu-arm, kvmarm, kvm, Pengutronix Kernel Team,
linux-arm-kernel@lists.infradead.org, Enrico Joerns
On Fri, 04 Oct 2024 20:50:18 +0100,
Ahmad Fatoum <a.fatoum@pengutronix.de> wrote:
>
> Hi,
>
> On 04.10.24 14:10, Peter Maydell wrote:
> > On Fri, 4 Oct 2024 at 12:51, Ahmad Fatoum <a.fatoum@pengutronix.de> wrote:
> >> On 04.10.24 12:40, Peter Maydell wrote:
> >>> Don't do this -- KVM doesn't support it. For access to MMIO,
> >>> stick to instructions which will set the ISV bit in ESR_EL1.
> >>>
> >>> That is:
> >>>
> >>> * AArch64 loads and stores of a single general-purpose register
> >>> (including the register specified with 0b11111, including those
> >>> with Acquire/Release semantics, but excluding Load Exclusive
> >>> or Store Exclusive and excluding those with writeback).
> >>> * AArch32 instructions where the instruction:
> >>> - Is an LDR, LDA, LDRT, LDRSH, LDRSHT, LDRH, LDAH, LDRHT,
> >>> LDRSB, LDRSBT, LDRB, LDAB, LDRBT, STR, STL, STRT, STRH,
> >>> STLH, STRHT, STRB, STLB, or STRBT instruction.
> >>> - Is not performing register writeback.
> >>> - Is not using R15 as a source or destination register.
> >>>
> >>> Your instruction is doing writeback. Do the address update
> >>> as a separate instruction.
>
> With readl/writel implemented in assembly, I get beyond that point, but
> now I get a data abort running an DC IVAC instruction on address 0x1000,
> where the cfi-flash is located. This instruction is part of a routine
> to remap the cfi-flash to start a page later, so the zero page can be
> mapped faulting.
>
> Simple reproducer:
>
> start:
> ldr x0, =0x1000
> ldr x1, =0x1040
> bl v8_inv_dcache_range
>
> mov w10, '!'
> bl putch
>
> ret
>
> v8_inv_dcache_range:
> mrs x3, ctr_el0
> lsr x3, x3, #16
> and x3, x3, #0xf
> mov x2, #0x4
> lsl x2, x2, x3
> sub x3, x2, #0x1
> bic x0, x0, x3
> 1:
> dc ivac, x0
> add x0, x0, x2
> cmp x0, x1
> b.cc 1b
> dsb sy
> ret
>
> This prints ! without KVM, but triggers a data abort before that with -enable-kvm:
>
> DABT (current EL) exception (ESR 0x96000010) at 0x0000000000001000
> elr: 000000007fbe0550 lr : 000000007fbe01ac
> [snip]
>
> Call trace:
> [<7fbe0550>] (v8_inv_dcache_range+0x1c/0x34) from [<7fbe0218>] (arch_remap_range+0x64/0x70)
> [<7fbe0218>] (arch_remap_range+0x64/0x70) from [<7fb8795c>] (of_platform_device_create+0x1e8/0x22c)
> [<7fb8795c>] (of_platform_device_create+0x1e8/0x22c) from [<7fb87a04>] (of_platform_bus_create+0x64/0xbc)
> [snip]
>
> Any idea what this is about?
IIRC, the QEMU flash is implemented as a read-only memslot. A data
cache invalidation is a *write*, as it can be (and is) upgraded to a
clean+invalidate by the HW.
KVM cannot satisfy the write, for obvious reasons, and tells the guest
to bugger off (__gfn_to_pfn_memslot() returns KVM_PFN_ERR_RO_FAULT,
which satisfies is_error_noslot_pfn() -- a slight oddity, but hey, why
not).
In the end, you get an exception. We could relax this by
special-casing CMOs to RO memslots, but this doesn't look great.
The real question is: what are you trying to achieve with this?
M.
--
Without deviation from the norm, progress is not possible.
^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [BUG] ARM64 KVM: Data abort executing post-indexed LDR on MMIO address
2024-10-05 10:31 ` Marc Zyngier
@ 2024-10-05 18:38 ` Ahmad Fatoum
2024-10-05 21:35 ` Marc Zyngier
0 siblings, 1 reply; 15+ messages in thread
From: Ahmad Fatoum @ 2024-10-05 18:38 UTC (permalink / raw)
To: Marc Zyngier
Cc: Peter Maydell, qemu-arm, kvmarm, kvm, Pengutronix Kernel Team,
linux-arm-kernel@lists.infradead.org, Enrico Joerns
Hello Marc,
On 05.10.24 12:31, Marc Zyngier wrote:
> On Fri, 04 Oct 2024 20:50:18 +0100,
> Ahmad Fatoum <a.fatoum@pengutronix.de> wrote:
>> With readl/writel implemented in assembly, I get beyond that point, but
>> now I get a data abort running an DC IVAC instruction on address 0x1000,
>> where the cfi-flash is located. This instruction is part of a routine
>> to remap the cfi-flash to start a page later, so the zero page can be
>> mapped faulting.
[snip]
>> Any idea what this is about?
>
> IIRC, the QEMU flash is implemented as a read-only memslot. A data
> cache invalidation is a *write*, as it can be (and is) upgraded to a
> clean+invalidate by the HW.
So it's a write, even if there are no dirty cache lines?
> KVM cannot satisfy the write, for obvious reasons, and tells the guest
> to bugger off (__gfn_to_pfn_memslot() returns KVM_PFN_ERR_RO_FAULT,
> which satisfies is_error_noslot_pfn() -- a slight oddity, but hey, why
> not).
>
> In the end, you get an exception. We could relax this by
> special-casing CMOs to RO memslots, but this doesn't look great.
>
> The real question is: what are you trying to achieve with this?
barebox sets up the MMU, but tries to keep a 1:1 mapping. On Virt, we
want to map the zero page faulting, but still have access to the first
block of the cfi-flash.
Therefore, barebox will map the cfi-flash one page later
(virt 0x1000,0x2000,... -> phys 0x0000,0x1000,...) and so on, so the first
page can be mapped faulting.
The routine[1] that does this remapping invalidates the virtual address range,
because the attributes may change[2]. This invalidate also happens for cfi-flash,
but we should never encounter dirty cache lines there as the remap is done
before driver probe.
Can you advise what should be done differently?
[1]: https://elixir.bootlin.com/barebox/v2024.09.0/source/arch/arm/cpu/mmu_64.c#L193
[2]: https://lore.kernel.org/barebox/20230526063354.1145474-4-a.fatoum@pengutronix.de/
Thanks,
Ahmad
>
> M.
>
--
Pengutronix e.K. | |
Steuerwalder Str. 21 | http://www.pengutronix.de/ |
31137 Hildesheim, Germany | Phone: +49-5121-206917-0 |
Amtsgericht Hildesheim, HRA 2686 | Fax: +49-5121-206917-5555 |
^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [BUG] ARM64 KVM: Data abort executing post-indexed LDR on MMIO address
2024-10-05 18:38 ` Ahmad Fatoum
@ 2024-10-05 21:35 ` Marc Zyngier
2024-10-06 7:59 ` Ahmad Fatoum
0 siblings, 1 reply; 15+ messages in thread
From: Marc Zyngier @ 2024-10-05 21:35 UTC (permalink / raw)
To: Ahmad Fatoum
Cc: Peter Maydell, qemu-arm, kvmarm, kvm, Pengutronix Kernel Team,
linux-arm-kernel@lists.infradead.org, Enrico Joerns
On Sat, 05 Oct 2024 19:38:23 +0100,
Ahmad Fatoum <a.fatoum@pengutronix.de> wrote:
>
> Hello Marc,
>
> On 05.10.24 12:31, Marc Zyngier wrote:
> > On Fri, 04 Oct 2024 20:50:18 +0100,
> > Ahmad Fatoum <a.fatoum@pengutronix.de> wrote:
> >> With readl/writel implemented in assembly, I get beyond that point, but
> >> now I get a data abort running an DC IVAC instruction on address 0x1000,
> >> where the cfi-flash is located. This instruction is part of a routine
> >> to remap the cfi-flash to start a page later, so the zero page can be
> >> mapped faulting.
>
> [snip]
>
> >> Any idea what this is about?
> >
> > IIRC, the QEMU flash is implemented as a read-only memslot. A data
> > cache invalidation is a *write*, as it can be (and is) upgraded to a
> > clean+invalidate by the HW.
>
> So it's a write, even if there are no dirty cache lines?
Yes.
At the point where this is handled, the CPU has no clue about the
dirty state of an arbitrary cache line, at an arbitrary cache level.
The CPU simply forward the CMO downstream, and the permission check
happens way before that.
> > KVM cannot satisfy the write, for obvious reasons, and tells the guest
> > to bugger off (__gfn_to_pfn_memslot() returns KVM_PFN_ERR_RO_FAULT,
> > which satisfies is_error_noslot_pfn() -- a slight oddity, but hey, why
> > not).
> >
> > In the end, you get an exception. We could relax this by
> > special-casing CMOs to RO memslots, but this doesn't look great.
> >
> > The real question is: what are you trying to achieve with this?
>
> barebox sets up the MMU, but tries to keep a 1:1 mapping. On Virt, we
> want to map the zero page faulting, but still have access to the first
> block of the cfi-flash.
>
> Therefore, barebox will map the cfi-flash one page later
> (virt 0x1000,0x2000,... -> phys 0x0000,0x1000,...) and so on, so the first
> page can be mapped faulting.
>
> The routine[1] that does this remapping invalidates the virtual address range,
> because the attributes may change[2]. This invalidate also happens for cfi-flash,
> but we should never encounter dirty cache lines there as the remap is done
> before driver probe.
>
> Can you advise what should be done differently?
If you always map the flash as Device memory, there is no need for
CMOs. Same thing if you map it as NC. And even if you did map it as
Cacheable, it wouldn't matter. KVM already handles coherency when the
flash is switching between memory-mapped and programming mode, as the
physical address space changes (the flash literally drops from the
memory map).
In general, issuing CMOs to a device is a bizarre concept, because it
is pretty rare that a device can handle a full cache-line as
write-back. Devices tend to handle smaller, register-sized accesses,
not a full 64-byte eviction.
Now, I'm still wondering whether we should simply forward the CMO to
userspace as we do for other writes, and let the VMM deal with it. The
main issue is that there is no current way to describe this.
The alternative would be to silently handle the trap and pretend it
never occurred, as we do for other bizarre behaviours. But that'd be
something only new kernels would support, and I guess you'd like your
guest to work today, not tomorrow.
M.
--
Without deviation from the norm, progress is not possible.
^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [BUG] ARM64 KVM: Data abort executing post-indexed LDR on MMIO address
2024-10-05 21:35 ` Marc Zyngier
@ 2024-10-06 7:59 ` Ahmad Fatoum
2024-10-06 10:28 ` Marc Zyngier
0 siblings, 1 reply; 15+ messages in thread
From: Ahmad Fatoum @ 2024-10-06 7:59 UTC (permalink / raw)
To: Marc Zyngier
Cc: Peter Maydell, qemu-arm, kvmarm, kvm, Pengutronix Kernel Team,
linux-arm-kernel@lists.infradead.org, Enrico Joerns
Hello Marc,
On 05.10.24 23:35, Marc Zyngier wrote:
> On Sat, 05 Oct 2024 19:38:23 +0100,
> Ahmad Fatoum <a.fatoum@pengutronix.de> wrote:
>>> IIRC, the QEMU flash is implemented as a read-only memslot. A data
>>> cache invalidation is a *write*, as it can be (and is) upgraded to a
>>> clean+invalidate by the HW.
>>
>> So it's a write, even if there are no dirty cache lines?
>
> Yes.
>
> At the point where this is handled, the CPU has no clue about the
> dirty state of an arbitrary cache line, at an arbitrary cache level.
> The CPU simply forward the CMO downstream, and the permission check
> happens way before that.
I see. When I added that invalidation, it was suggested[1] that instead
of invalidation after the remapping, I would clean the range before
remapping. Cleaning the zero page triggered a data abort, which I didn't
understand as there should be no dirty lines, but now I get it.
One more question: This upgrading of DC IVAC to DC CIVAC is because
the code is run under virtualization, right?
[1]: https://lore.barebox.org/barebox/9809c04c-58c5-6888-2b14-cf17fa7c4b1a@pengutronix.de/
>> The routine[1] that does this remapping invalidates the virtual address range,
>> because the attributes may change[2]. This invalidate also happens for cfi-flash,
>> but we should never encounter dirty cache lines there as the remap is done
>> before driver probe.
>>
>> Can you advise what should be done differently?
>
> If you always map the flash as Device memory, there is no need for
> CMOs. Same thing if you map it as NC.
Everything, except for RAM, is mapped Device-nGnRnE.
> And even if you did map it as
> Cacheable, it wouldn't matter. KVM already handles coherency when the
> flash is switching between memory-mapped and programming mode, as the
> physical address space changes (the flash literally drops from the
> memory map).
>
> In general, issuing CMOs to a device is a bizarre concept, because it
> is pretty rare that a device can handle a full cache-line as
> write-back. Devices tend to handle smaller, register-sized accesses,
> not a full 64-byte eviction.
The same remap_range function is used to:
- remap normal memory:
- Mapping memory regions uncached for memory test
- Mapping memory buffers coherent or write-combine
- remap ROMs: BootROMs at address zero may be hidden behind the NULL
page and need to be mapped differently when calling/peeking into it.
- remap device memory, e.g. in the case of the cfi-flash here
The invalidation is done unconditionally for all of them, although it
makes only the sense in the first case.
> Now, I'm still wondering whether we should simply forward the CMO to
> userspace as we do for other writes, and let the VMM deal with it. The
> main issue is that there is no current way to describe this.
>
> The alternative would be to silently handle the trap and pretend it
> never occurred, as we do for other bizarre behaviours. But that'd be
> something only new kernels would support, and I guess you'd like your
> guest to work today, not tomorrow.
I think following fix on the barebox side may work:
- Walk all pages about to be remapped
- Execute the AT instruction on the page's base address
- Only if the page was previously mapped cacheable, clean + invalidate
the cache
- Remove the current cache invalidation after remap
Does that sound sensible?
Thanks,
Ahmad
>
> M.
>
--
Pengutronix e.K. | |
Steuerwalder Str. 21 | http://www.pengutronix.de/ |
31137 Hildesheim, Germany | Phone: +49-5121-206917-0 |
Amtsgericht Hildesheim, HRA 2686 | Fax: +49-5121-206917-5555 |
^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [BUG] ARM64 KVM: Data abort executing post-indexed LDR on MMIO address
2024-10-06 7:59 ` Ahmad Fatoum
@ 2024-10-06 10:28 ` Marc Zyngier
2024-10-09 6:11 ` Ahmad Fatoum
0 siblings, 1 reply; 15+ messages in thread
From: Marc Zyngier @ 2024-10-06 10:28 UTC (permalink / raw)
To: Ahmad Fatoum
Cc: Peter Maydell, qemu-arm, kvmarm, kvm, Pengutronix Kernel Team,
linux-arm-kernel@lists.infradead.org, Enrico Joerns
On Sun, 06 Oct 2024 08:59:56 +0100,
Ahmad Fatoum <a.fatoum@pengutronix.de> wrote:
>
> Hello Marc,
>
> On 05.10.24 23:35, Marc Zyngier wrote:
> > On Sat, 05 Oct 2024 19:38:23 +0100,
> > Ahmad Fatoum <a.fatoum@pengutronix.de> wrote:
> >>> IIRC, the QEMU flash is implemented as a read-only memslot. A data
> >>> cache invalidation is a *write*, as it can be (and is) upgraded to a
> >>> clean+invalidate by the HW.
> >>
> >> So it's a write, even if there are no dirty cache lines?
> >
> > Yes.
> >
> > At the point where this is handled, the CPU has no clue about the
> > dirty state of an arbitrary cache line, at an arbitrary cache level.
> > The CPU simply forward the CMO downstream, and the permission check
> > happens way before that.
>
> I see. When I added that invalidation, it was suggested[1] that instead
> of invalidation after the remapping, I would clean the range before
> remapping. Cleaning the zero page triggered a data abort, which I didn't
> understand as there should be no dirty lines, but now I get it.
>
> One more question: This upgrading of DC IVAC to DC CIVAC is because
> the code is run under virtualization, right?
Not necessarily. Virtualisation mandates the upgrade, but CIVAC is
also a perfectly valid implementation of both IVAC and CVAC. And it
isn't uncommon that CPUs implement everything the same way.
This is why, if you dare looking at the definition of ESR_ELx for a
data abort, you will notice that ESR_ELx.WnR (which indicates whether
the abort was triggered by a read or a write) is always 1 (denoting a
write) on a CMO, irrespective of what CMO triggered it.
Additionally, FEAT_CMOW requires both read and write permissions to
not trigger faults on CMOs, but that's a bit orthogonal to your
problem.
>
> [1]: https://lore.barebox.org/barebox/9809c04c-58c5-6888-2b14-cf17fa7c4b1a@pengutronix.de/
>
> >> The routine[1] that does this remapping invalidates the virtual address range,
> >> because the attributes may change[2]. This invalidate also happens for cfi-flash,
> >> but we should never encounter dirty cache lines there as the remap is done
> >> before driver probe.
> >>
> >> Can you advise what should be done differently?
> >
> > If you always map the flash as Device memory, there is no need for
> > CMOs. Same thing if you map it as NC.
>
> Everything, except for RAM, is mapped Device-nGnRnE.
Right, no problem then.
>
> > And even if you did map it as
> > Cacheable, it wouldn't matter. KVM already handles coherency when the
> > flash is switching between memory-mapped and programming mode, as the
> > physical address space changes (the flash literally drops from the
> > memory map).
> >
> > In general, issuing CMOs to a device is a bizarre concept, because it
> > is pretty rare that a device can handle a full cache-line as
> > write-back. Devices tend to handle smaller, register-sized accesses,
> > not a full 64-byte eviction.
>
> The same remap_range function is used to:
>
> - remap normal memory:
> - Mapping memory regions uncached for memory test
> - Mapping memory buffers coherent or write-combine
> - remap ROMs: BootROMs at address zero may be hidden behind the NULL
> page and need to be mapped differently when calling/peeking into it.
> - remap device memory, e.g. in the case of the cfi-flash here
>
> The invalidation is done unconditionally for all of them, although it
> makes only the sense in the first case.
Indeed.
>
> > Now, I'm still wondering whether we should simply forward the CMO to
> > userspace as we do for other writes, and let the VMM deal with it. The
> > main issue is that there is no current way to describe this.
> >
> > The alternative would be to silently handle the trap and pretend it
> > never occurred, as we do for other bizarre behaviours. But that'd be
> > something only new kernels would support, and I guess you'd like your
> > guest to work today, not tomorrow.
>
> I think following fix on the barebox side may work:
>
> - Walk all pages about to be remapped
> - Execute the AT instruction on the page's base address
Why do you need AT if you are walking the PTs? If you walk, you
already have access to the memory attributes. In general, AT can be
slower than an actual walk.
Or did you actually mean iterating over the VA range? Even in that
case, AT can be a bad idea, as you are likely to iterate in page-size
increments even if you have a block mapping. Walking the PTs tells you
immediately how much a leaf is mapping (assuming you don't have any
other tracking).
> - Only if the page was previously mapped cacheable, clean + invalidate
> the cache
> - Remove the current cache invalidation after remap
>
> Does that sound sensible?
This looks reasonable (apart from the AT thingy).
Thanks,
M.
--
Without deviation from the norm, progress is not possible.
^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [BUG] ARM64 KVM: Data abort executing post-indexed LDR on MMIO address
2024-10-06 10:28 ` Marc Zyngier
@ 2024-10-09 6:11 ` Ahmad Fatoum
2024-10-09 8:05 ` Marc Zyngier
0 siblings, 1 reply; 15+ messages in thread
From: Ahmad Fatoum @ 2024-10-09 6:11 UTC (permalink / raw)
To: Marc Zyngier
Cc: Peter Maydell, qemu-arm, kvmarm, kvm, Pengutronix Kernel Team,
linux-arm-kernel@lists.infradead.org, Enrico Joerns
Hello Marc,
On 06.10.24 12:28, Marc Zyngier wrote:
> On Sun, 06 Oct 2024 08:59:56 +0100,
> Ahmad Fatoum <a.fatoum@pengutronix.de> wrote:
>> On 05.10.24 23:35, Marc Zyngier wrote:
>>> On Sat, 05 Oct 2024 19:38:23 +0100,
>>> Ahmad Fatoum <a.fatoum@pengutronix.de> wrote:
>> One more question: This upgrading of DC IVAC to DC CIVAC is because
>> the code is run under virtualization, right?
>
> Not necessarily. Virtualisation mandates the upgrade, but CIVAC is
> also a perfectly valid implementation of both IVAC and CVAC. And it
> isn't uncommon that CPUs implement everything the same way.
Makes sense. After all, software should expect cache lines to
be evicted at any time due to capacity misses anyway.
>> I think following fix on the barebox side may work:
>>
>> - Walk all pages about to be remapped
>> - Execute the AT instruction on the page's base address
>
> Why do you need AT if you are walking the PTs? If you walk, you
> already have access to the memory attributes. In general, AT can be
> slower than an actual walk.
>
> Or did you actually mean iterating over the VA range? Even in that
> case, AT can be a bad idea, as you are likely to iterate in page-size
> increments even if you have a block mapping. Walking the PTs tells you
> immediately how much a leaf is mapping (assuming you don't have any
> other tracking).
There's no other tracking and I hoped that using AT (which is already
being used for the mmuinfo shell command) would be easier.
I see now that it would be too suboptimal to do it this way and have
implemented a revised arch_remap_range[1] for barebox, which I just
Cc'd you on.
[1]: https://lore.kernel.org/barebox/20241009060511.4121157-5-a.fatoum@pengutronix.de/T/#u
>> - Only if the page was previously mapped cacheable, clean + invalidate
>> the cache
>> - Remove the current cache invalidation after remap
>>
>> Does that sound sensible?
>
> This looks reasonable (apart from the AT thingy).
I have two (hopefully the last!) questions about remaining differing
behavior with KVM and without:
1) Unaligned stack accesses crash in KVM:
start: /* This will be mapped at 0x40080000 */
ldr x0, =0x4007fff0
mov sp, x0
stp x0, x1, [sp] // This is ok
ldr x0, =0x4007fff8
mov sp, x0
stp x0, x1, [sp] // This crashes
I know that the stack should be 16 byte aligned, but why does it crash
only under KVM?
Context: The barebox Image used for Qemu has a Linux ARM64 "Image" header,
so it's loaded at an offset and grows the stack down into this memory region
until the FDT's /memory could be decoded and a proper stack is set up.
A regression introduced earlier this year, caused the stack to grow down
from a non-16-byte address, which is fixed in [2].
[2]: https://lore.kernel.org/barebox/20241009060511.4121157-5-a.fatoum@pengutronix.de/T/#ma381512862d22530382aff1662caadad2c8bc182
2) Using uncached memory for Virt I/O queues with KVM enabled is considerably
slower. My guess is that these accesses keep getting trapped, but what I wonder
about is the performance discrepancy between the big.LITTLE cores
(measurement of barebox copying 1MiB using `time cp -v /dev/virtioblk0 /tmp`):
KVM && !CACHED && 1x Cortex-A53: 0.137s
KVM && !CACHED && 1x Cortex-A72: 54.030s
KVM && CACHED && 1x Cortex-A53: 0.120s
KVM && CACHED && 1x Cortex-A72: 0.035s
The A53s are CPUs 0-1 and the A72 are 2-5.
Any idea why accessing uncached memory from the big core is so much worse?
Thank you!
Ahmad
>
> Thanks,
>
> M.
>
--
Pengutronix e.K. | |
Steuerwalder Str. 21 | http://www.pengutronix.de/ |
31137 Hildesheim, Germany | Phone: +49-5121-206917-0 |
Amtsgericht Hildesheim, HRA 2686 | Fax: +49-5121-206917-5555 |
^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [BUG] ARM64 KVM: Data abort executing post-indexed LDR on MMIO address
2024-10-09 6:11 ` Ahmad Fatoum
@ 2024-10-09 8:05 ` Marc Zyngier
0 siblings, 0 replies; 15+ messages in thread
From: Marc Zyngier @ 2024-10-09 8:05 UTC (permalink / raw)
To: Ahmad Fatoum
Cc: Peter Maydell, qemu-arm, kvmarm, kvm, Pengutronix Kernel Team,
linux-arm-kernel@lists.infradead.org, Enrico Joerns
On Wed, 09 Oct 2024 07:11:52 +0100,
Ahmad Fatoum <a.fatoum@pengutronix.de> wrote:
>
> Hello Marc,
>
> On 06.10.24 12:28, Marc Zyngier wrote:
> > On Sun, 06 Oct 2024 08:59:56 +0100,
> > Ahmad Fatoum <a.fatoum@pengutronix.de> wrote:
> >> On 05.10.24 23:35, Marc Zyngier wrote:
> >>> On Sat, 05 Oct 2024 19:38:23 +0100,
> >>> Ahmad Fatoum <a.fatoum@pengutronix.de> wrote:
> >> One more question: This upgrading of DC IVAC to DC CIVAC is because
> >> the code is run under virtualization, right?
> >
> > Not necessarily. Virtualisation mandates the upgrade, but CIVAC is
> > also a perfectly valid implementation of both IVAC and CVAC. And it
> > isn't uncommon that CPUs implement everything the same way.
>
> Makes sense. After all, software should expect cache lines to
> be evicted at any time due to capacity misses anyway.
>
> >> I think following fix on the barebox side may work:
> >>
> >> - Walk all pages about to be remapped
> >> - Execute the AT instruction on the page's base address
> >
> > Why do you need AT if you are walking the PTs? If you walk, you
> > already have access to the memory attributes. In general, AT can be
> > slower than an actual walk.
> >
> > Or did you actually mean iterating over the VA range? Even in that
> > case, AT can be a bad idea, as you are likely to iterate in page-size
> > increments even if you have a block mapping. Walking the PTs tells you
> > immediately how much a leaf is mapping (assuming you don't have any
> > other tracking).
>
> There's no other tracking and I hoped that using AT (which is already
> being used for the mmuinfo shell command) would be easier.
Heavy use of AT is debatable.
It requires heavy synchronisation (the ISB between AT and the PAR_EL1
readout), traps in some circumstances, and is not guaranteed to report
the exact content of the page tables when it comes to memory
attributes (it can instead report what the implementation actually
does). In turn, it *may* hit in the TLBs. Or thrash them.
But the real use of AT is when you do not have a virtual mapping for
your page tables, making them difficult to walk in SW. For example,
KVM uses AT to walk the guest's stage-1 PTs on stage-2 fault, because
we don't have the guest's memory mapped at EL2 in general.
Maybe this doesn't matter in your case, as this doesn't happen often
enough to justify a dedicated SW walker.
>
> I see now that it would be too suboptimal to do it this way and have
> implemented a revised arch_remap_range[1] for barebox, which I just
> Cc'd you on.
I'll try to have a look.
>
> [1]: https://lore.kernel.org/barebox/20241009060511.4121157-5-a.fatoum@pengutronix.de/T/#u
>
> >> - Only if the page was previously mapped cacheable, clean + invalidate
> >> the cache
> >> - Remove the current cache invalidation after remap
> >>
> >> Does that sound sensible?
> >
> > This looks reasonable (apart from the AT thingy).
>
> I have two (hopefully the last!) questions about remaining differing
> behavior with KVM and without:
>
> 1) Unaligned stack accesses crash in KVM:
>
> start: /* This will be mapped at 0x40080000 */
> ldr x0, =0x4007fff0
> mov sp, x0
> stp x0, x1, [sp] // This is ok
>
> ldr x0, =0x4007fff8
> mov sp, x0
> stp x0, x1, [sp] // This crashes
>
> I know that the stack should be 16 byte aligned, but why does it crash
> only under KVM?
KVM shouldn't be involved in any of this. What is SCTLR_EL1.SA set to?
KVM resets it to 1, which is legal ("On a Warm reset, this field
resets to an architecturally UNKNOWN value."). You shouldn't rely on
it being 0 if you are doing unaligned accesses to the stack pointer.
>
> Context: The barebox Image used for Qemu has a Linux ARM64 "Image" header,
> so it's loaded at an offset and grows the stack down into this memory region
> until the FDT's /memory could be decoded and a proper stack is set up.
>
> A regression introduced earlier this year, caused the stack to grow down
> from a non-16-byte address, which is fixed in [2].
>
> [2]: https://lore.kernel.org/barebox/20241009060511.4121157-5-a.fatoum@pengutronix.de/T/#ma381512862d22530382aff1662caadad2c8bc182
>
> 2) Using uncached memory for Virt I/O queues with KVM enabled is considerably
> slower. My guess is that these accesses keep getting trapped, but what I wonder
> about is the performance discrepancy between the big.LITTLE cores
> (measurement of barebox copying 1MiB using `time cp -v /dev/virtioblk0 /tmp`):
>
> KVM && !CACHED && 1x Cortex-A53: 0.137s
> KVM && !CACHED && 1x Cortex-A72: 54.030s
> KVM && CACHED && 1x Cortex-A53: 0.120s
> KVM && CACHED && 1x Cortex-A72: 0.035s
>
> The A53s are CPUs 0-1 and the A72 are 2-5.
>
> Any idea why accessing uncached memory from the big core is so much worse?
KVM shouldn't trap these accesses at all, except for the initial
faulting-in of the memory (once per page).
But to see such a difference, the problem is likely somewhere else. As
you seem to be using QEMU as the VMM, you are likely to fall into the
"Mismatched attributes" sink-hole (see B2.16 "Mismatched memory
attributes" in the ARM ARM).
The gist of the problem is that QEMU is using a cacheable mapping for
all of the guest memory, while you are using a NC mapping. Are these
two guaranteed to be coherent? Not at all. The abysmal level of
performance is likely to be caused by the rate of miss in the cache on
the QEMU side. It simply doesn't observe the guest's writes until it
misses (for fun, giggles and self-promotion, see [1]).
As an experiment, you could run something that actively puts a lot of
cache pressure in parallel to your guest. You should see the
performance increase significantly.
Now, this is very much a case of "don't do that". Virtio is, by
definition, a cache coherent device. So either you use it with
attributes that maintain coherency (and everything works), or you add
cache maintenance to your virtio implementation (good luck getting the
ordering right).
HTH,
M.
[1] http://events17.linuxfoundation.org/sites/events/files/slides/slides_10.pdf
--
Without deviation from the norm, progress is not possible.
^ permalink raw reply [flat|nested] 15+ messages in thread
end of thread, other threads:[~2024-10-09 8:17 UTC | newest]
Thread overview: 15+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-10-04 9:47 [BUG] ARM64 KVM: Data abort executing post-indexed LDR on MMIO address Ahmad Fatoum
2024-10-04 10:40 ` Peter Maydell
2024-10-04 11:51 ` Ahmad Fatoum
2024-10-04 12:10 ` Peter Maydell
2024-10-04 15:52 ` Oliver Upton
2024-10-04 15:57 ` Peter Maydell
2024-10-04 16:21 ` Oliver Upton
2024-10-04 19:50 ` Ahmad Fatoum
2024-10-05 10:31 ` Marc Zyngier
2024-10-05 18:38 ` Ahmad Fatoum
2024-10-05 21:35 ` Marc Zyngier
2024-10-06 7:59 ` Ahmad Fatoum
2024-10-06 10:28 ` Marc Zyngier
2024-10-09 6:11 ` Ahmad Fatoum
2024-10-09 8:05 ` Marc Zyngier
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).