[BUG] ARM64 KVM: Data abort executing post-indexed LDR on MMIO address

linux-arm-kernel.lists.infradead.org archive mirror
 help / color / mirror / Atom feed

* [BUG] ARM64 KVM: Data abort executing post-indexed LDR on MMIO address
@ 2024-10-04  9:47 Ahmad Fatoum
  2024-10-04 10:40 ` Peter Maydell
  0 siblings, 1 reply; 15+ messages in thread
From: Ahmad Fatoum @ 2024-10-04  9:47 UTC (permalink / raw)
  To: qemu-arm, kvmarm
  Cc: peter.maydell, kvm, Pengutronix Kernel Team,
	linux-arm-kernel@lists.infradead.org, Enrico Joerns

Hi,

I am investigating a data abort affecting the barebox bootloader built for aarch64
that only manifests with qemu-system-aarch64 -enable-kvm.

The issue happens when using the post-indexed form of LDR on a MMIO address:

        ldr     x0, =0x9000fe0           // MMIO address
        ldr     w1, [x0], #4             // data abort, but only with -enable-kvm

Here's a full example to reproduce this:

start:
        mov     w10, 'o'
        bl      putch

        bl      amba_device_read_pid

        mov     w10, 'k'
        bl      putch

        ret

amba_device_read_pid:
        ldr     x0, =0x9000fe0
        ldr     w1, [x0], #4             // data abort unless #4 is removed or KVM disabled
        ret

putch:
        ldr     x8, =0x9000018
1:
        /* Wait until there is space in the FIFO */
        ldr     w9, [x8]
        tbnz    w9, #5, 1b

        /* Send the character */
        mov     x9, #0x9000000
        str     w10, [x9]
2:
        /* Wait to make sure it hits the line, in case we die too soon. */
        ldr     w9, [x8]
        tbnz    w9, #5, 2b
        ret

It assumes 0x9000000-0x9000fff is a PL011 UART as is the case on the Virt platform.
It will print an 'o', try to access 0x9000fe0, which contains the first byte
of the AMBA Product ID and then print a 'k' using the same PL011.

To build:

  aarch64-linux-gnu-as reproducer.S -o reproducer.o
  aarch64-linux-gnu-objcopy -O binary reproducer.o reproducer.bin

To run (whether -bios or -kernel doesn't seem to matter):

  taskset -a --cpu-list 2-5 qemu-system-aarch64 -bios reproducer.bin -machine virt,secure=off \
  -cpu max -m 1024M -nographic -serial mon:stdio -trace file=/dev/null

When run _without_ kvm, this will output:

  ok

When run with -enable-kvm, this will trigger a data abort in amba_device_read_pid:

  o

The data abort can also be avoided by removing the post index-increment:

  -ldr     w1, [x0], #4
  +ldr     w1, [x0]

This doesn't introduce a functional difference, because x0 isn't used again anyway,
but it makes the code work under KVM.

I am using debian-arm64 Bookworm on an Amlogic A311D (4x CA72, 2x CA53) with:

  QEMU emulator version 9.1.50 (v9.1.0-704-g423be09ab9)
  Linux 6.11-arm64 #1 SMP Debian 6.11-1~exp1 (2024-09-19) aarch64 GNU/Linux

This issue was first noticed by openembedded-core CI while testing this patch series
adding support for testing the barebox and U-Boot bootloaders:

https://lore.kernel.org/all/ee536d88a5b6468b20e37f3daabe4aa63047d1ad.camel@pengutronix.de/

AFAICS, the U-Boot breakage has the same root cause as barebox', except that it's
a str instruction that has the post-increment and the PCI MMIO region is what's
being accessed.

I haven't been successful in getting QEMU/GDB to trap on data aborts, so here's an
excerpt from my original barebox stack trace instead that sent me down the rabbit
hole:

  DABT (current EL) exception (ESR 0x96000010) at 0x0000000009030fe0
  elr: 000000007fb2221c lr : 000000007fb22250
  [...]

  Call trace:
  [<7fb2221c>] (amba_device_get_pid.constprop.0.isra.0+0x10/0x34) from [<7fb01e3c>] (start_barebox+0x88/0xb4)
  [...]

This looks pretty much like a bug to me, but then I would have expected more
software to be affected by this.. Maybe it's memory mapping related?
I only tested with either MMU disabled or MMIO region mapped strongly-ordered.

Please let me know if there's further information I can provide.

Thanks,
Ahmad

-- 
Pengutronix e.K.                           |                             |
Steuerwalder Str. 21                       | http://www.pengutronix.de/  |
31137 Hildesheim, Germany                  | Phone: +49-5121-206917-0    |
Amtsgericht Hildesheim, HRA 2686           | Fax:   +49-5121-206917-5555 |

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [BUG] ARM64 KVM: Data abort executing post-indexed LDR on MMIO address
  2024-10-04  9:47 [BUG] ARM64 KVM: Data abort executing post-indexed LDR on MMIO address Ahmad Fatoum
@ 2024-10-04 10:40 ` Peter Maydell
  2024-10-04 11:51   ` Ahmad Fatoum
  0 siblings, 1 reply; 15+ messages in thread
From: Peter Maydell @ 2024-10-04 10:40 UTC (permalink / raw)
  To: Ahmad Fatoum
  Cc: qemu-arm, kvmarm, kvm, Pengutronix Kernel Team,
	linux-arm-kernel@lists.infradead.org, Enrico Joerns

On Fri, 4 Oct 2024 at 10:47, Ahmad Fatoum <a.fatoum@pengutronix.de> wrote:
> I am investigating a data abort affecting the barebox bootloader built for aarch64
> that only manifests with qemu-system-aarch64 -enable-kvm.
>
> The issue happens when using the post-indexed form of LDR on a MMIO address:
>
>         ldr     x0, =0x9000fe0           // MMIO address
>         ldr     w1, [x0], #4             // data abort, but only with -enable-kvm

Don't do this -- KVM doesn't support it. For access to MMIO,
stick to instructions which will set the ISV bit in ESR_EL1.
That is:

 * AArch64 loads and stores of a single general-purpose register
   (including the register specified with 0b11111, including those
   with Acquire/Release semantics, but excluding Load Exclusive
   or Store Exclusive and excluding those with writeback).
 * AArch32 instructions where the instruction:
    - Is an LDR, LDA, LDRT, LDRSH, LDRSHT, LDRH, LDAH, LDRHT,
      LDRSB, LDRSBT, LDRB, LDAB, LDRBT, STR, STL, STRT, STRH,
      STLH, STRHT, STRB, STLB, or STRBT instruction.
    - Is not performing register writeback.
    - Is not using R15 as a source or destination register.

Your instruction is doing writeback. Do the address update
as a separate instruction.

Strictly speaking this is a missing feature in KVM (in an
ideal world it would let you do MMIO with any instruction
that you could use on real hardware). In practice it is not
a major issue because you don't typically want to do odd
things when you're doing MMIO, you just want to load or
store a single data item. If you're running into this then
your guest software is usually doing something a bit strange.

thanks
-- PMM


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [BUG] ARM64 KVM: Data abort executing post-indexed LDR on MMIO address
  2024-10-04 10:40 ` Peter Maydell
@ 2024-10-04 11:51   ` Ahmad Fatoum
  2024-10-04 12:10     ` Peter Maydell
  0 siblings, 1 reply; 15+ messages in thread
From: Ahmad Fatoum @ 2024-10-04 11:51 UTC (permalink / raw)
  To: Peter Maydell
  Cc: qemu-arm, kvmarm, kvm, Pengutronix Kernel Team,
	linux-arm-kernel@lists.infradead.org, Enrico Joerns

Hello Peter,

Thanks for your quick response.

On 04.10.24 12:40, Peter Maydell wrote:
> On Fri, 4 Oct 2024 at 10:47, Ahmad Fatoum <a.fatoum@pengutronix.de> wrote:
>> I am investigating a data abort affecting the barebox bootloader built for aarch64
>> that only manifests with qemu-system-aarch64 -enable-kvm.
>>
>> The issue happens when using the post-indexed form of LDR on a MMIO address:
>>
>>         ldr     x0, =0x9000fe0           // MMIO address
>>         ldr     w1, [x0], #4             // data abort, but only with -enable-kvm
> 
> Don't do this -- KVM doesn't support it. For access to MMIO,
> stick to instructions which will set the ISV bit in ESR_EL1.
>
> That is:
> 
>  * AArch64 loads and stores of a single general-purpose register
>    (including the register specified with 0b11111, including those
>    with Acquire/Release semantics, but excluding Load Exclusive
>    or Store Exclusive and excluding those with writeback).
>  * AArch32 instructions where the instruction:
>     - Is an LDR, LDA, LDRT, LDRSH, LDRSHT, LDRH, LDAH, LDRHT,
>       LDRSB, LDRSBT, LDRB, LDAB, LDRBT, STR, STL, STRT, STRH,
>       STLH, STRHT, STRB, STLB, or STRBT instruction.
>     - Is not performing register writeback.
>     - Is not using R15 as a source or destination register.
> 
> Your instruction is doing writeback. Do the address update
> as a separate instruction.

This was enlightening, thanks. I will prepare patches to implement
readl/writel in barebox in ARM assembly, like Linux does
instead of the current fallback to the generic C version
that just does volatile accesses.

> Strictly speaking this is a missing feature in KVM (in an
> ideal world it would let you do MMIO with any instruction
> that you could use on real hardware).

I assume that's because KVM doesn't want to handle interruptions
in the middle of such "composite" instructions?

> In practice it is not
> a major issue because you don't typically want to do odd
> things when you're doing MMIO, you just want to load or
> store a single data item. If you're running into this then
> your guest software is usually doing something a bit strange.

The AMBA Peripheral ID consists of 4 bytes with some padding
in-between. The barebox code to read it out looks is this:

static inline u32 amba_device_get_pid(void *base, u32 size)
{
        int i;
        u32 pid; 

        for (pid = 0, i = 0; i < 4; i++)
                pid |= (readl(base + size - 0x20 + 4 * i) & 255) <<
                        (i * 8);

        return pid;
}

static inline u32 __raw_readl(const volatile void __iomem *addr)
{
        return *(const volatile u32 __force *)addr;
}

I wouldn't necessarily characterize this as strange, we just erroneously
assumed that with strongly ordered memory for MMIO regions and volatile
accesses we had our bases covered and indeed we did until the bases
shifted to include hardware-assisted virtualization. :-)

Thanks,
Ahmad

> 
> thanks
> -- PMM
> 


-- 
Pengutronix e.K.                           |                             |
Steuerwalder Str. 21                       | http://www.pengutronix.de/  |
31137 Hildesheim, Germany                  | Phone: +49-5121-206917-0    |
Amtsgericht Hildesheim, HRA 2686           | Fax:   +49-5121-206917-5555 |


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [BUG] ARM64 KVM: Data abort executing post-indexed LDR on MMIO address
  2024-10-04 11:51   ` Ahmad Fatoum
@ 2024-10-04 12:10     ` Peter Maydell
  2024-10-04 15:52       ` Oliver Upton
  2024-10-04 19:50       ` Ahmad Fatoum
  0 siblings, 2 replies; 15+ messages in thread
From: Peter Maydell @ 2024-10-04 12:10 UTC (permalink / raw)
  To: Ahmad Fatoum
  Cc: qemu-arm, kvmarm, kvm, Pengutronix Kernel Team,
	linux-arm-kernel@lists.infradead.org, Enrico Joerns

On Fri, 4 Oct 2024 at 12:51, Ahmad Fatoum <a.fatoum@pengutronix.de> wrote:
>
> Hello Peter,
>
> Thanks for your quick response.
>
> On 04.10.24 12:40, Peter Maydell wrote:
> > On Fri, 4 Oct 2024 at 10:47, Ahmad Fatoum <a.fatoum@pengutronix.de> wrote:
> >> I am investigating a data abort affecting the barebox bootloader built for aarch64
> >> that only manifests with qemu-system-aarch64 -enable-kvm.
> >>
> >> The issue happens when using the post-indexed form of LDR on a MMIO address:
> >>
> >>         ldr     x0, =0x9000fe0           // MMIO address
> >>         ldr     w1, [x0], #4             // data abort, but only with -enable-kvm
> >
> > Don't do this -- KVM doesn't support it. For access to MMIO,
> > stick to instructions which will set the ISV bit in ESR_EL1.
> >
> > That is:
> >
> >  * AArch64 loads and stores of a single general-purpose register
> >    (including the register specified with 0b11111, including those
> >    with Acquire/Release semantics, but excluding Load Exclusive
> >    or Store Exclusive and excluding those with writeback).
> >  * AArch32 instructions where the instruction:
> >     - Is an LDR, LDA, LDRT, LDRSH, LDRSHT, LDRH, LDAH, LDRHT,
> >       LDRSB, LDRSBT, LDRB, LDAB, LDRBT, STR, STL, STRT, STRH,
> >       STLH, STRHT, STRB, STLB, or STRBT instruction.
> >     - Is not performing register writeback.
> >     - Is not using R15 as a source or destination register.
> >
> > Your instruction is doing writeback. Do the address update
> > as a separate instruction.
>
> This was enlightening, thanks. I will prepare patches to implement
> readl/writel in barebox in ARM assembly, like Linux does
> instead of the current fallback to the generic C version
> that just does volatile accesses.
>
> > Strictly speaking this is a missing feature in KVM (in an
> > ideal world it would let you do MMIO with any instruction
> > that you could use on real hardware).
>
> I assume that's because KVM doesn't want to handle interruptions
> in the middle of such "composite" instructions?

It's because with the ISV=1 information in the ESR_EL2,
KVM has everything it needs to emulate the load/store:
it has the affected register number, the data width, etc. When
ISV is 0, simulating the load/store would require KVM
to load the actual instruction word, decode it to figure
out what kind of load/store it was, and then emulate
its behaviour. The instruction decode would be complicated
and if done in the kernel would increase the attack surface
exposed to the guest. (In practice KVM will these days
bounce the ISV=0 failure out to the userspace VMM, but
no VMM that I know of implements the decode of load/store
instructions in userspace either, so that just changes which
part prints the failure message.)

> > In practice it is not
> > a major issue because you don't typically want to do odd
> > things when you're doing MMIO, you just want to load or
> > store a single data item. If you're running into this then
> > your guest software is usually doing something a bit strange.
>
> The AMBA Peripheral ID consists of 4 bytes with some padding
> in-between. The barebox code to read it out looks is this:
>
> static inline u32 amba_device_get_pid(void *base, u32 size)
> {
>         int i;
>         u32 pid;
>
>         for (pid = 0, i = 0; i < 4; i++)
>                 pid |= (readl(base + size - 0x20 + 4 * i) & 255) <<
>                         (i * 8);
>
>         return pid;
> }
>
> static inline u32 __raw_readl(const volatile void __iomem *addr)
> {
>         return *(const volatile u32 __force *)addr;
> }
>
> I wouldn't necessarily characterize this as strange, we just erroneously
> assumed that with strongly ordered memory for MMIO regions and volatile
> accesses we had our bases covered and indeed we did until the bases
> shifted to include hardware-assisted virtualization. :-)

I'm not a fan of doing MMIO access via 'volatile' in C code,
personally -- I think the compiler has a tendency to do more
clever recombination than you might actually want, because
it doesn't know that the thing you're accessing isn't RAM-like.

thanks
-- PMM


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [BUG] ARM64 KVM: Data abort executing post-indexed LDR on MMIO address
  2024-10-04 12:10     ` Peter Maydell
@ 2024-10-04 15:52       ` Oliver Upton
  2024-10-04 15:57         ` Peter Maydell
  2024-10-04 19:50       ` Ahmad Fatoum
  1 sibling, 1 reply; 15+ messages in thread
From: Oliver Upton @ 2024-10-04 15:52 UTC (permalink / raw)
  To: Peter Maydell
  Cc: Ahmad Fatoum, qemu-arm, kvmarm, kvm, Pengutronix Kernel Team,
	linux-arm-kernel@lists.infradead.org, Enrico Joerns

On Fri, Oct 04, 2024 at 01:10:48PM +0100, Peter Maydell wrote:
> On Fri, 4 Oct 2024 at 12:51, Ahmad Fatoum <a.fatoum@pengutronix.de> wrote:
> > > Strictly speaking this is a missing feature in KVM (in an
> > > ideal world it would let you do MMIO with any instruction
> > > that you could use on real hardware).
> >
> > I assume that's because KVM doesn't want to handle interruptions
> > in the middle of such "composite" instructions?
> 
> It's because with the ISV=1 information in the ESR_EL2,
> KVM has everything it needs to emulate the load/store:
> it has the affected register number, the data width, etc. When
> ISV is 0, simulating the load/store would require KVM
> to load the actual instruction word, decode it to figure
> out what kind of load/store it was, and then emulate
> its behaviour. The instruction decode would be complicated
> and if done in the kernel would increase the attack surface
> exposed to the guest.

On top of that, the only way to 'safely' fetch the instruction would be
to pause all vCPUs in the VM to prevent the guest from remapping the
address space behind either KVM or the VMM's back.

> > static inline u32 __raw_readl(const volatile void __iomem *addr)
> > {
> >         return *(const volatile u32 __force *)addr;
> > }
> >
> > I wouldn't necessarily characterize this as strange, we just erroneously
> > assumed that with strongly ordered memory for MMIO regions and volatile
> > accesses we had our bases covered and indeed we did until the bases
> > shifted to include hardware-assisted virtualization. :-)

This has nothing to do with ordering and everything to do with the
instruction set && what your compiler decided to emit.

> I'm not a fan of doing MMIO access via 'volatile' in C code,
> personally -- I think the compiler has a tendency to do more
> clever recombination than you might actually want, because
> it doesn't know that the thing you're accessing isn't RAM-like.

+1

-- 
Thanks,
Oliver


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [BUG] ARM64 KVM: Data abort executing post-indexed LDR on MMIO address
  2024-10-04 15:52       ` Oliver Upton
@ 2024-10-04 15:57         ` Peter Maydell
  2024-10-04 16:21           ` Oliver Upton
  0 siblings, 1 reply; 15+ messages in thread
From: Peter Maydell @ 2024-10-04 15:57 UTC (permalink / raw)
  To: Oliver Upton
  Cc: Ahmad Fatoum, qemu-arm, kvmarm, kvm, Pengutronix Kernel Team,
	linux-arm-kernel@lists.infradead.org, Enrico Joerns

On Fri, 4 Oct 2024 at 16:53, Oliver Upton <oliver.upton@linux.dev> wrote:
>
> On Fri, Oct 04, 2024 at 01:10:48PM +0100, Peter Maydell wrote:
> > On Fri, 4 Oct 2024 at 12:51, Ahmad Fatoum <a.fatoum@pengutronix.de> wrote:
> > > > Strictly speaking this is a missing feature in KVM (in an
> > > > ideal world it would let you do MMIO with any instruction
> > > > that you could use on real hardware).
> > >
> > > I assume that's because KVM doesn't want to handle interruptions
> > > in the middle of such "composite" instructions?
> >
> > It's because with the ISV=1 information in the ESR_EL2,
> > KVM has everything it needs to emulate the load/store:
> > it has the affected register number, the data width, etc. When
> > ISV is 0, simulating the load/store would require KVM
> > to load the actual instruction word, decode it to figure
> > out what kind of load/store it was, and then emulate
> > its behaviour. The instruction decode would be complicated
> > and if done in the kernel would increase the attack surface
> > exposed to the guest.
>
> On top of that, the only way to 'safely' fetch the instruction would be
> to pause all vCPUs in the VM to prevent the guest from remapping the
> address space behind either KVM or the VMM's back.

Do we actually care about that, though? If the guest does
that isn't it equivalent to a hardware CPU happening to
fetch the insn just-after a remapping rather than just-before?
If you decode the insn and it's not a store you could just
restart the guest...

thanks
-- PMM


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [BUG] ARM64 KVM: Data abort executing post-indexed LDR on MMIO address
  2024-10-04 15:57         ` Peter Maydell
@ 2024-10-04 16:21           ` Oliver Upton
  0 siblings, 0 replies; 15+ messages in thread
From: Oliver Upton @ 2024-10-04 16:21 UTC (permalink / raw)
  To: Peter Maydell
  Cc: Ahmad Fatoum, qemu-arm, kvmarm, kvm, Pengutronix Kernel Team,
	linux-arm-kernel@lists.infradead.org, Enrico Joerns

On Fri, Oct 04, 2024 at 04:57:56PM +0100, Peter Maydell wrote:
> On Fri, 4 Oct 2024 at 16:53, Oliver Upton <oliver.upton@linux.dev> wrote:
> >
> > On Fri, Oct 04, 2024 at 01:10:48PM +0100, Peter Maydell wrote:
> > > On Fri, 4 Oct 2024 at 12:51, Ahmad Fatoum <a.fatoum@pengutronix.de> wrote:
> > > > > Strictly speaking this is a missing feature in KVM (in an
> > > > > ideal world it would let you do MMIO with any instruction
> > > > > that you could use on real hardware).
> > > >
> > > > I assume that's because KVM doesn't want to handle interruptions
> > > > in the middle of such "composite" instructions?
> > >
> > > It's because with the ISV=1 information in the ESR_EL2,
> > > KVM has everything it needs to emulate the load/store:
> > > it has the affected register number, the data width, etc. When
> > > ISV is 0, simulating the load/store would require KVM
> > > to load the actual instruction word, decode it to figure
> > > out what kind of load/store it was, and then emulate
> > > its behaviour. The instruction decode would be complicated
> > > and if done in the kernel would increase the attack surface
> > > exposed to the guest.
> >
> > On top of that, the only way to 'safely' fetch the instruction would be
> > to pause all vCPUs in the VM to prevent the guest from remapping the
> > address space behind either KVM or the VMM's back.
> 
> Do we actually care about that, though?

Judging from the fact that our existing MMIO flows have a similar "bug",
I'd say no. I was just being pedantic about how annoying it'd be to do
this faithfully, including the VA -> IPA translation.

> If the guest does
> that isn't it equivalent to a hardware CPU happening to
> fetch the insn just-after a remapping rather than just-before?
> If you decode the insn and it's not a store you could just
> restart the guest...

Definitely, you'd need to restart any time the instruction doesn't line
up with the ESR. The pedantic thing I was thinking about was if the
instruction bytes remain the same but marked as non-executable:

	T1				T2
	==				==
					readl(addr);
					< MMIO data abort >
					insn = fetch(readl);
	set_nx(readl);
	tlbi(readl);
	dsb(ish);
					emulate(insn);

-- 
Thanks,
Oliver


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [BUG] ARM64 KVM: Data abort executing post-indexed LDR on MMIO address
  2024-10-04 12:10     ` Peter Maydell
  2024-10-04 15:52       ` Oliver Upton
@ 2024-10-04 19:50       ` Ahmad Fatoum
  2024-10-05 10:31         ` Marc Zyngier
  1 sibling, 1 reply; 15+ messages in thread
From: Ahmad Fatoum @ 2024-10-04 19:50 UTC (permalink / raw)
  To: Peter Maydell
  Cc: qemu-arm, kvmarm, kvm, Pengutronix Kernel Team,
	linux-arm-kernel@lists.infradead.org, Enrico Joerns

Hi,

On 04.10.24 14:10, Peter Maydell wrote:
> On Fri, 4 Oct 2024 at 12:51, Ahmad Fatoum <a.fatoum@pengutronix.de> wrote:
>> On 04.10.24 12:40, Peter Maydell wrote:
>>> Don't do this -- KVM doesn't support it. For access to MMIO,
>>> stick to instructions which will set the ISV bit in ESR_EL1.
>>>
>>> That is:
>>>
>>>  * AArch64 loads and stores of a single general-purpose register
>>>    (including the register specified with 0b11111, including those
>>>    with Acquire/Release semantics, but excluding Load Exclusive
>>>    or Store Exclusive and excluding those with writeback).
>>>  * AArch32 instructions where the instruction:
>>>     - Is an LDR, LDA, LDRT, LDRSH, LDRSHT, LDRH, LDAH, LDRHT,
>>>       LDRSB, LDRSBT, LDRB, LDAB, LDRBT, STR, STL, STRT, STRH,
>>>       STLH, STRHT, STRB, STLB, or STRBT instruction.
>>>     - Is not performing register writeback.
>>>     - Is not using R15 as a source or destination register.
>>>
>>> Your instruction is doing writeback. Do the address update
>>> as a separate instruction.

With readl/writel implemented in assembly, I get beyond that point, but
now I get a data abort running an DC IVAC instruction on address 0x1000,
where the cfi-flash is located. This instruction is part of a routine
to remap the cfi-flash to start a page later, so the zero page can be
mapped faulting.

Simple reproducer:

start:
        ldr     x0, =0x1000
        ldr     x1, =0x1040
        bl      v8_inv_dcache_range

        mov     w10, '!'
        bl      putch

        ret

v8_inv_dcache_range:
        mrs     x3, ctr_el0
        lsr     x3, x3, #16
        and     x3, x3, #0xf
        mov     x2, #0x4
        lsl     x2, x2, x3
        sub     x3, x2, #0x1
        bic     x0, x0, x3
1:
        dc      ivac, x0
        add     x0, x0, x2
        cmp     x0, x1
        b.cc    1b
        dsb     sy
        ret

This prints ! without KVM, but triggers a data abort before that with -enable-kvm:

  DABT (current EL) exception (ESR 0x96000010) at 0x0000000000001000
  elr: 000000007fbe0550 lr : 000000007fbe01ac
  [snip]

  Call trace:
  [<7fbe0550>] (v8_inv_dcache_range+0x1c/0x34) from [<7fbe0218>] (arch_remap_range+0x64/0x70)
  [<7fbe0218>] (arch_remap_range+0x64/0x70) from [<7fb8795c>] (of_platform_device_create+0x1e8/0x22c)
  [<7fb8795c>] (of_platform_device_create+0x1e8/0x22c) from [<7fb87a04>] (of_platform_bus_create+0x64/0xbc)
  [snip]

Any idea what this is about?

Thanks,
Ahmad


-- 
Pengutronix e.K.                           |                             |
Steuerwalder Str. 21                       | http://www.pengutronix.de/  |
31137 Hildesheim, Germany                  | Phone: +49-5121-206917-0    |
Amtsgericht Hildesheim, HRA 2686           | Fax:   +49-5121-206917-5555 |


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [BUG] ARM64 KVM: Data abort executing post-indexed LDR on MMIO address
  2024-10-04 19:50       ` Ahmad Fatoum
@ 2024-10-05 10:31         ` Marc Zyngier
  2024-10-05 18:38           ` Ahmad Fatoum
  0 siblings, 1 reply; 15+ messages in thread
From: Marc Zyngier @ 2024-10-05 10:31 UTC (permalink / raw)
  To: Ahmad Fatoum
  Cc: Peter Maydell, qemu-arm, kvmarm, kvm, Pengutronix Kernel Team,
	linux-arm-kernel@lists.infradead.org, Enrico Joerns

On Fri, 04 Oct 2024 20:50:18 +0100,
Ahmad Fatoum <a.fatoum@pengutronix.de> wrote:
> 
> Hi,
> 
> On 04.10.24 14:10, Peter Maydell wrote:
> > On Fri, 4 Oct 2024 at 12:51, Ahmad Fatoum <a.fatoum@pengutronix.de> wrote:
> >> On 04.10.24 12:40, Peter Maydell wrote:
> >>> Don't do this -- KVM doesn't support it. For access to MMIO,
> >>> stick to instructions which will set the ISV bit in ESR_EL1.
> >>>
> >>> That is:
> >>>
> >>>  * AArch64 loads and stores of a single general-purpose register
> >>>    (including the register specified with 0b11111, including those
> >>>    with Acquire/Release semantics, but excluding Load Exclusive
> >>>    or Store Exclusive and excluding those with writeback).
> >>>  * AArch32 instructions where the instruction:
> >>>     - Is an LDR, LDA, LDRT, LDRSH, LDRSHT, LDRH, LDAH, LDRHT,
> >>>       LDRSB, LDRSBT, LDRB, LDAB, LDRBT, STR, STL, STRT, STRH,
> >>>       STLH, STRHT, STRB, STLB, or STRBT instruction.
> >>>     - Is not performing register writeback.
> >>>     - Is not using R15 as a source or destination register.
> >>>
> >>> Your instruction is doing writeback. Do the address update
> >>> as a separate instruction.
> 
> With readl/writel implemented in assembly, I get beyond that point, but
> now I get a data abort running an DC IVAC instruction on address 0x1000,
> where the cfi-flash is located. This instruction is part of a routine
> to remap the cfi-flash to start a page later, so the zero page can be
> mapped faulting.
> 
> Simple reproducer:
> 
> start:
>         ldr     x0, =0x1000
>         ldr     x1, =0x1040
>         bl      v8_inv_dcache_range
> 
>         mov     w10, '!'
>         bl      putch
> 
>         ret
> 
> v8_inv_dcache_range:
>         mrs     x3, ctr_el0
>         lsr     x3, x3, #16
>         and     x3, x3, #0xf
>         mov     x2, #0x4
>         lsl     x2, x2, x3
>         sub     x3, x2, #0x1
>         bic     x0, x0, x3
> 1:
>         dc      ivac, x0
>         add     x0, x0, x2
>         cmp     x0, x1
>         b.cc    1b
>         dsb     sy
>         ret
> 
> This prints ! without KVM, but triggers a data abort before that with -enable-kvm:
> 
>   DABT (current EL) exception (ESR 0x96000010) at 0x0000000000001000
>   elr: 000000007fbe0550 lr : 000000007fbe01ac
>   [snip]
> 
>   Call trace:
>   [<7fbe0550>] (v8_inv_dcache_range+0x1c/0x34) from [<7fbe0218>] (arch_remap_range+0x64/0x70)
>   [<7fbe0218>] (arch_remap_range+0x64/0x70) from [<7fb8795c>] (of_platform_device_create+0x1e8/0x22c)
>   [<7fb8795c>] (of_platform_device_create+0x1e8/0x22c) from [<7fb87a04>] (of_platform_bus_create+0x64/0xbc)
>   [snip]
> 
> Any idea what this is about?

IIRC, the QEMU flash is implemented as a read-only memslot. A data
cache invalidation is a *write*, as it can be (and is) upgraded to a
clean+invalidate by the HW.

KVM cannot satisfy the write, for obvious reasons, and tells the guest
to bugger off (__gfn_to_pfn_memslot() returns KVM_PFN_ERR_RO_FAULT,
which satisfies is_error_noslot_pfn() -- a slight oddity, but hey, why
not).

In the end, you get an exception. We could relax this by
special-casing CMOs to RO memslots, but this doesn't look great.

The real question is: what are you trying to achieve with this?

	M.

-- 
Without deviation from the norm, progress is not possible.


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [BUG] ARM64 KVM: Data abort executing post-indexed LDR on MMIO address
  2024-10-05 10:31         ` Marc Zyngier
@ 2024-10-05 18:38           ` Ahmad Fatoum
  2024-10-05 21:35             ` Marc Zyngier
  0 siblings, 1 reply; 15+ messages in thread
From: Ahmad Fatoum @ 2024-10-05 18:38 UTC (permalink / raw)
  To: Marc Zyngier
  Cc: Peter Maydell, qemu-arm, kvmarm, kvm, Pengutronix Kernel Team,
	linux-arm-kernel@lists.infradead.org, Enrico Joerns

Hello Marc,

On 05.10.24 12:31, Marc Zyngier wrote:
> On Fri, 04 Oct 2024 20:50:18 +0100,
> Ahmad Fatoum <a.fatoum@pengutronix.de> wrote:
>> With readl/writel implemented in assembly, I get beyond that point, but
>> now I get a data abort running an DC IVAC instruction on address 0x1000,
>> where the cfi-flash is located. This instruction is part of a routine
>> to remap the cfi-flash to start a page later, so the zero page can be
>> mapped faulting.

[snip]

>> Any idea what this is about?
> 
> IIRC, the QEMU flash is implemented as a read-only memslot. A data
> cache invalidation is a *write*, as it can be (and is) upgraded to a
> clean+invalidate by the HW.

So it's a write, even if there are no dirty cache lines?

> KVM cannot satisfy the write, for obvious reasons, and tells the guest
> to bugger off (__gfn_to_pfn_memslot() returns KVM_PFN_ERR_RO_FAULT,
> which satisfies is_error_noslot_pfn() -- a slight oddity, but hey, why
> not).
> 
> In the end, you get an exception. We could relax this by
> special-casing CMOs to RO memslots, but this doesn't look great.
> 
> The real question is: what are you trying to achieve with this?

barebox sets up the MMU, but tries to keep a 1:1 mapping. On Virt, we
want to map the zero page faulting, but still have access to the first
block of the cfi-flash.

Therefore, barebox will map the cfi-flash one page later
(virt 0x1000,0x2000,... -> phys 0x0000,0x1000,...) and so on, so the first
page can be mapped faulting.

The routine[1] that does this remapping invalidates the virtual address range,
because the attributes may change[2]. This invalidate also happens for cfi-flash,
but we should never encounter dirty cache lines there as the remap is done
before driver probe.

Can you advise what should be done differently?

[1]: https://elixir.bootlin.com/barebox/v2024.09.0/source/arch/arm/cpu/mmu_64.c#L193
[2]: https://lore.kernel.org/barebox/20230526063354.1145474-4-a.fatoum@pengutronix.de/

Thanks,
Ahmad

> 
> 	M.
> 


-- 
Pengutronix e.K.                           |                             |
Steuerwalder Str. 21                       | http://www.pengutronix.de/  |
31137 Hildesheim, Germany                  | Phone: +49-5121-206917-0    |
Amtsgericht Hildesheim, HRA 2686           | Fax:   +49-5121-206917-5555 |


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [BUG] ARM64 KVM: Data abort executing post-indexed LDR on MMIO address
  2024-10-05 18:38           ` Ahmad Fatoum
@ 2024-10-05 21:35             ` Marc Zyngier
  2024-10-06  7:59               ` Ahmad Fatoum
  0 siblings, 1 reply; 15+ messages in thread
From: Marc Zyngier @ 2024-10-05 21:35 UTC (permalink / raw)
  To: Ahmad Fatoum
  Cc: Peter Maydell, qemu-arm, kvmarm, kvm, Pengutronix Kernel Team,
	linux-arm-kernel@lists.infradead.org, Enrico Joerns

On Sat, 05 Oct 2024 19:38:23 +0100,
Ahmad Fatoum <a.fatoum@pengutronix.de> wrote:
> 
> Hello Marc,
> 
> On 05.10.24 12:31, Marc Zyngier wrote:
> > On Fri, 04 Oct 2024 20:50:18 +0100,
> > Ahmad Fatoum <a.fatoum@pengutronix.de> wrote:
> >> With readl/writel implemented in assembly, I get beyond that point, but
> >> now I get a data abort running an DC IVAC instruction on address 0x1000,
> >> where the cfi-flash is located. This instruction is part of a routine
> >> to remap the cfi-flash to start a page later, so the zero page can be
> >> mapped faulting.
> 
> [snip]
> 
> >> Any idea what this is about?
> > 
> > IIRC, the QEMU flash is implemented as a read-only memslot. A data
> > cache invalidation is a *write*, as it can be (and is) upgraded to a
> > clean+invalidate by the HW.
> 
> So it's a write, even if there are no dirty cache lines?

Yes.

At the point where this is handled, the CPU has no clue about the
dirty state of an arbitrary cache line, at an arbitrary cache level.
The CPU simply forward the CMO downstream, and the permission check
happens way before that.

> > KVM cannot satisfy the write, for obvious reasons, and tells the guest
> > to bugger off (__gfn_to_pfn_memslot() returns KVM_PFN_ERR_RO_FAULT,
> > which satisfies is_error_noslot_pfn() -- a slight oddity, but hey, why
> > not).
> > 
> > In the end, you get an exception. We could relax this by
> > special-casing CMOs to RO memslots, but this doesn't look great.
> > 
> > The real question is: what are you trying to achieve with this?
> 
> barebox sets up the MMU, but tries to keep a 1:1 mapping. On Virt, we
> want to map the zero page faulting, but still have access to the first
> block of the cfi-flash.
> 
> Therefore, barebox will map the cfi-flash one page later
> (virt 0x1000,0x2000,... -> phys 0x0000,0x1000,...) and so on, so the first
> page can be mapped faulting.
> 
> The routine[1] that does this remapping invalidates the virtual address range,
> because the attributes may change[2]. This invalidate also happens for cfi-flash,
> but we should never encounter dirty cache lines there as the remap is done
> before driver probe.
>
> Can you advise what should be done differently?

If you always map the flash as Device memory, there is no need for
CMOs. Same thing if you map it as NC. And even if you did map it as
Cacheable, it wouldn't matter. KVM already handles coherency when the
flash is switching between memory-mapped and programming mode, as the
physical address space changes (the flash literally drops from the
memory map).

In general, issuing CMOs to a device is a bizarre concept, because it
is pretty rare that a device can handle a full cache-line as
write-back. Devices tend to handle smaller, register-sized accesses,
not a full 64-byte eviction.

Now, I'm still wondering whether we should simply forward the CMO to
userspace as we do for other writes, and let the VMM deal with it. The
main issue is that there is no current way to describe this.

The alternative would be to silently handle the trap and pretend it
never occurred, as we do for other bizarre behaviours. But that'd be
something only new kernels would support, and I guess you'd like your
guest to work today, not tomorrow.

	M.

-- 
Without deviation from the norm, progress is not possible.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [BUG] ARM64 KVM: Data abort executing post-indexed LDR on MMIO address
  2024-10-05 21:35             ` Marc Zyngier
@ 2024-10-06  7:59               ` Ahmad Fatoum
  2024-10-06 10:28                 ` Marc Zyngier
  0 siblings, 1 reply; 15+ messages in thread
From: Ahmad Fatoum @ 2024-10-06  7:59 UTC (permalink / raw)
  To: Marc Zyngier
  Cc: Peter Maydell, qemu-arm, kvmarm, kvm, Pengutronix Kernel Team,
	linux-arm-kernel@lists.infradead.org, Enrico Joerns

Hello Marc,

On 05.10.24 23:35, Marc Zyngier wrote:
> On Sat, 05 Oct 2024 19:38:23 +0100,
> Ahmad Fatoum <a.fatoum@pengutronix.de> wrote:
>>> IIRC, the QEMU flash is implemented as a read-only memslot. A data
>>> cache invalidation is a *write*, as it can be (and is) upgraded to a
>>> clean+invalidate by the HW.
>>
>> So it's a write, even if there are no dirty cache lines?
> 
> Yes.
> 
> At the point where this is handled, the CPU has no clue about the
> dirty state of an arbitrary cache line, at an arbitrary cache level.
> The CPU simply forward the CMO downstream, and the permission check
> happens way before that.

I see. When I added that invalidation, it was suggested[1] that instead
of invalidation after the remapping, I would clean the range before
remapping. Cleaning the zero page triggered a data abort, which I didn't
understand as there should be no dirty lines, but now I get it.

One more question: This upgrading of DC IVAC to DC CIVAC is because
the code is run under virtualization, right?

[1]: https://lore.barebox.org/barebox/9809c04c-58c5-6888-2b14-cf17fa7c4b1a@pengutronix.de/

>> The routine[1] that does this remapping invalidates the virtual address range,
>> because the attributes may change[2]. This invalidate also happens for cfi-flash,
>> but we should never encounter dirty cache lines there as the remap is done
>> before driver probe.
>>
>> Can you advise what should be done differently?
> 
> If you always map the flash as Device memory, there is no need for
> CMOs. Same thing if you map it as NC.

Everything, except for RAM, is mapped Device-nGnRnE.

> And even if you did map it as
> Cacheable, it wouldn't matter. KVM already handles coherency when the
> flash is switching between memory-mapped and programming mode, as the
> physical address space changes (the flash literally drops from the
> memory map).
> 
> In general, issuing CMOs to a device is a bizarre concept, because it
> is pretty rare that a device can handle a full cache-line as
> write-back. Devices tend to handle smaller, register-sized accesses,
> not a full 64-byte eviction.

The same remap_range function is used to:

  - remap normal memory:
    - Mapping memory regions uncached for memory test
    - Mapping memory buffers coherent or write-combine
  - remap ROMs: BootROMs at address zero may be hidden behind the NULL
    page and need to be mapped differently when calling/peeking into it.
  - remap device memory, e.g. in the case of the cfi-flash here

The invalidation is done unconditionally for all of them, although it
makes only the sense in the first case.

> Now, I'm still wondering whether we should simply forward the CMO to
> userspace as we do for other writes, and let the VMM deal with it. The
> main issue is that there is no current way to describe this.
> 
> The alternative would be to silently handle the trap and pretend it
> never occurred, as we do for other bizarre behaviours. But that'd be
> something only new kernels would support, and I guess you'd like your
> guest to work today, not tomorrow.

I think following fix on the barebox side may work:

  - Walk all pages about to be remapped
  - Execute the AT instruction on the page's base address
  - Only if the page was previously mapped cacheable, clean + invalidate
    the cache
  - Remove the current cache invalidation after remap

Does that sound sensible?

Thanks,
Ahmad

> 
> 	M.
> 


-- 
Pengutronix e.K.                           |                             |
Steuerwalder Str. 21                       | http://www.pengutronix.de/  |
31137 Hildesheim, Germany                  | Phone: +49-5121-206917-0    |
Amtsgericht Hildesheim, HRA 2686           | Fax:   +49-5121-206917-5555 |


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [BUG] ARM64 KVM: Data abort executing post-indexed LDR on MMIO address
  2024-10-06  7:59               ` Ahmad Fatoum
@ 2024-10-06 10:28                 ` Marc Zyngier
  2024-10-09  6:11                   ` Ahmad Fatoum
  0 siblings, 1 reply; 15+ messages in thread
From: Marc Zyngier @ 2024-10-06 10:28 UTC (permalink / raw)
  To: Ahmad Fatoum
  Cc: Peter Maydell, qemu-arm, kvmarm, kvm, Pengutronix Kernel Team,
	linux-arm-kernel@lists.infradead.org, Enrico Joerns

On Sun, 06 Oct 2024 08:59:56 +0100,
Ahmad Fatoum <a.fatoum@pengutronix.de> wrote:
> 
> Hello Marc,
> 
> On 05.10.24 23:35, Marc Zyngier wrote:
> > On Sat, 05 Oct 2024 19:38:23 +0100,
> > Ahmad Fatoum <a.fatoum@pengutronix.de> wrote:
> >>> IIRC, the QEMU flash is implemented as a read-only memslot. A data
> >>> cache invalidation is a *write*, as it can be (and is) upgraded to a
> >>> clean+invalidate by the HW.
> >>
> >> So it's a write, even if there are no dirty cache lines?
> > 
> > Yes.
> > 
> > At the point where this is handled, the CPU has no clue about the
> > dirty state of an arbitrary cache line, at an arbitrary cache level.
> > The CPU simply forward the CMO downstream, and the permission check
> > happens way before that.
> 
> I see. When I added that invalidation, it was suggested[1] that instead
> of invalidation after the remapping, I would clean the range before
> remapping. Cleaning the zero page triggered a data abort, which I didn't
> understand as there should be no dirty lines, but now I get it.
> 
> One more question: This upgrading of DC IVAC to DC CIVAC is because
> the code is run under virtualization, right?

Not necessarily. Virtualisation mandates the upgrade, but CIVAC is
also a perfectly valid implementation of both IVAC and CVAC.  And it
isn't uncommon that CPUs implement everything the same way.

This is why, if you dare looking at the definition of ESR_ELx for a
data abort, you will notice that ESR_ELx.WnR (which indicates whether
the abort was triggered by a read or a write) is always 1 (denoting a
write) on a CMO, irrespective of what CMO triggered it.

Additionally, FEAT_CMOW requires both read and write permissions to
not trigger faults on CMOs, but that's a bit orthogonal to your
problem.

> 
> [1]: https://lore.barebox.org/barebox/9809c04c-58c5-6888-2b14-cf17fa7c4b1a@pengutronix.de/
> 
> >> The routine[1] that does this remapping invalidates the virtual address range,
> >> because the attributes may change[2]. This invalidate also happens for cfi-flash,
> >> but we should never encounter dirty cache lines there as the remap is done
> >> before driver probe.
> >>
> >> Can you advise what should be done differently?
> > 
> > If you always map the flash as Device memory, there is no need for
> > CMOs. Same thing if you map it as NC.
> 
> Everything, except for RAM, is mapped Device-nGnRnE.

Right, no problem then.

> 
> > And even if you did map it as
> > Cacheable, it wouldn't matter. KVM already handles coherency when the
> > flash is switching between memory-mapped and programming mode, as the
> > physical address space changes (the flash literally drops from the
> > memory map).
> > 
> > In general, issuing CMOs to a device is a bizarre concept, because it
> > is pretty rare that a device can handle a full cache-line as
> > write-back. Devices tend to handle smaller, register-sized accesses,
> > not a full 64-byte eviction.
> 
> The same remap_range function is used to:
> 
>   - remap normal memory:
>     - Mapping memory regions uncached for memory test
>     - Mapping memory buffers coherent or write-combine
>   - remap ROMs: BootROMs at address zero may be hidden behind the NULL
>     page and need to be mapped differently when calling/peeking into it.
>   - remap device memory, e.g. in the case of the cfi-flash here
> 
> The invalidation is done unconditionally for all of them, although it
> makes only the sense in the first case.

Indeed.

> 
> > Now, I'm still wondering whether we should simply forward the CMO to
> > userspace as we do for other writes, and let the VMM deal with it. The
> > main issue is that there is no current way to describe this.
> > 
> > The alternative would be to silently handle the trap and pretend it
> > never occurred, as we do for other bizarre behaviours. But that'd be
> > something only new kernels would support, and I guess you'd like your
> > guest to work today, not tomorrow.
> 
> I think following fix on the barebox side may work:
> 
>   - Walk all pages about to be remapped
>   - Execute the AT instruction on the page's base address

Why do you need AT if you are walking the PTs? If you walk, you
already have access to the memory attributes. In general, AT can be
slower than an actual walk.

Or did you actually mean iterating over the VA range? Even in that
case, AT can be a bad idea, as you are likely to iterate in page-size
increments even if you have a block mapping. Walking the PTs tells you
immediately how much a leaf is mapping (assuming you don't have any
other tracking).

>   - Only if the page was previously mapped cacheable, clean + invalidate
>     the cache
>   - Remove the current cache invalidation after remap
> 
> Does that sound sensible?

This looks reasonable (apart from the AT thingy).

Thanks,

	M.

-- 
Without deviation from the norm, progress is not possible.


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [BUG] ARM64 KVM: Data abort executing post-indexed LDR on MMIO address
  2024-10-06 10:28                 ` Marc Zyngier
@ 2024-10-09  6:11                   ` Ahmad Fatoum
  2024-10-09  8:05                     ` Marc Zyngier
  0 siblings, 1 reply; 15+ messages in thread
From: Ahmad Fatoum @ 2024-10-09  6:11 UTC (permalink / raw)
  To: Marc Zyngier
  Cc: Peter Maydell, qemu-arm, kvmarm, kvm, Pengutronix Kernel Team,
	linux-arm-kernel@lists.infradead.org, Enrico Joerns

Hello Marc,

On 06.10.24 12:28, Marc Zyngier wrote:
> On Sun, 06 Oct 2024 08:59:56 +0100,
> Ahmad Fatoum <a.fatoum@pengutronix.de> wrote:
>> On 05.10.24 23:35, Marc Zyngier wrote:
>>> On Sat, 05 Oct 2024 19:38:23 +0100,
>>> Ahmad Fatoum <a.fatoum@pengutronix.de> wrote:
>> One more question: This upgrading of DC IVAC to DC CIVAC is because
>> the code is run under virtualization, right?
> 
> Not necessarily. Virtualisation mandates the upgrade, but CIVAC is
> also a perfectly valid implementation of both IVAC and CVAC.  And it
> isn't uncommon that CPUs implement everything the same way.

Makes sense. After all, software should expect cache lines to
be evicted at any time due to capacity misses anyway.

>> I think following fix on the barebox side may work:
>>
>>   - Walk all pages about to be remapped
>>   - Execute the AT instruction on the page's base address
> 
> Why do you need AT if you are walking the PTs? If you walk, you
> already have access to the memory attributes. In general, AT can be
> slower than an actual walk.
>
> Or did you actually mean iterating over the VA range? Even in that
> case, AT can be a bad idea, as you are likely to iterate in page-size
> increments even if you have a block mapping. Walking the PTs tells you
> immediately how much a leaf is mapping (assuming you don't have any
> other tracking).

There's no other tracking and I hoped that using AT (which is already
being used for the mmuinfo shell command) would be easier.

I see now that it would be too suboptimal to do it this way and have
implemented a revised arch_remap_range[1] for barebox, which I just
Cc'd you on.

[1]: https://lore.kernel.org/barebox/20241009060511.4121157-5-a.fatoum@pengutronix.de/T/#u

>>   - Only if the page was previously mapped cacheable, clean + invalidate
>>     the cache
>>   - Remove the current cache invalidation after remap
>>
>> Does that sound sensible?
> 
> This looks reasonable (apart from the AT thingy).

I have two (hopefully the last!) questions about remaining differing
behavior with KVM and without:

1) Unaligned stack accesses crash in KVM:

start: /* This will be mapped at 0x40080000 */
        ldr     x0, =0x4007fff0
        mov     sp, x0
        stp     x0, x1, [sp] // This is ok

        ldr     x0, =0x4007fff8
        mov     sp, x0
        stp     x0, x1, [sp] // This crashes

I know that the stack should be 16 byte aligned, but why does it crash
only under KVM?

Context: The barebox Image used for Qemu has a Linux ARM64 "Image" header,
so it's loaded at an offset and grows the stack down into this memory region
until the FDT's /memory could be decoded and a proper stack is set up.

A regression introduced earlier this year, caused the stack to grow down
from a non-16-byte address, which is fixed in [2].

[2]: https://lore.kernel.org/barebox/20241009060511.4121157-5-a.fatoum@pengutronix.de/T/#ma381512862d22530382aff1662caadad2c8bc182

2) Using uncached memory for Virt I/O queues with KVM enabled is considerably
   slower. My guess is that these accesses keep getting trapped, but what I wonder
   about is the performance discrepancy between the big.LITTLE cores
   (measurement of barebox copying 1MiB using `time cp -v /dev/virtioblk0 /tmp`):

    KVM && !CACHED && 1x Cortex-A53:  0.137s
    KVM && !CACHED && 1x Cortex-A72: 54.030s
    KVM &&  CACHED && 1x Cortex-A53:  0.120s
    KVM &&  CACHED && 1x Cortex-A72:  0.035s

The A53s are CPUs 0-1 and the A72 are 2-5.

Any idea why accessing uncached memory from the big core is so much worse?

Thank you!
Ahmad

> 
> Thanks,
> 
> 	M.
> 

-- 
Pengutronix e.K.                           |                             |
Steuerwalder Str. 21                       | http://www.pengutronix.de/  |
31137 Hildesheim, Germany                  | Phone: +49-5121-206917-0    |
Amtsgericht Hildesheim, HRA 2686           | Fax:   +49-5121-206917-5555 |

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [BUG] ARM64 KVM: Data abort executing post-indexed LDR on MMIO address
  2024-10-09  6:11                   ` Ahmad Fatoum
@ 2024-10-09  8:05                     ` Marc Zyngier
  0 siblings, 0 replies; 15+ messages in thread
From: Marc Zyngier @ 2024-10-09  8:05 UTC (permalink / raw)
  To: Ahmad Fatoum
  Cc: Peter Maydell, qemu-arm, kvmarm, kvm, Pengutronix Kernel Team,
	linux-arm-kernel@lists.infradead.org, Enrico Joerns

On Wed, 09 Oct 2024 07:11:52 +0100,
Ahmad Fatoum <a.fatoum@pengutronix.de> wrote:
> 
> Hello Marc,
> 
> On 06.10.24 12:28, Marc Zyngier wrote:
> > On Sun, 06 Oct 2024 08:59:56 +0100,
> > Ahmad Fatoum <a.fatoum@pengutronix.de> wrote:
> >> On 05.10.24 23:35, Marc Zyngier wrote:
> >>> On Sat, 05 Oct 2024 19:38:23 +0100,
> >>> Ahmad Fatoum <a.fatoum@pengutronix.de> wrote:
> >> One more question: This upgrading of DC IVAC to DC CIVAC is because
> >> the code is run under virtualization, right?
> > 
> > Not necessarily. Virtualisation mandates the upgrade, but CIVAC is
> > also a perfectly valid implementation of both IVAC and CVAC.  And it
> > isn't uncommon that CPUs implement everything the same way.
> 
> Makes sense. After all, software should expect cache lines to
> be evicted at any time due to capacity misses anyway.
> 
> >> I think following fix on the barebox side may work:
> >>
> >>   - Walk all pages about to be remapped
> >>   - Execute the AT instruction on the page's base address
> > 
> > Why do you need AT if you are walking the PTs? If you walk, you
> > already have access to the memory attributes. In general, AT can be
> > slower than an actual walk.
> >
> > Or did you actually mean iterating over the VA range? Even in that
> > case, AT can be a bad idea, as you are likely to iterate in page-size
> > increments even if you have a block mapping. Walking the PTs tells you
> > immediately how much a leaf is mapping (assuming you don't have any
> > other tracking).
> 
> There's no other tracking and I hoped that using AT (which is already
> being used for the mmuinfo shell command) would be easier.

Heavy use of AT is debatable.

It requires heavy synchronisation (the ISB between AT and the PAR_EL1
readout), traps in some circumstances, and is not guaranteed to report
the exact content of the page tables when it comes to memory
attributes (it can instead report what the implementation actually
does). In turn, it *may* hit in the TLBs. Or thrash them.

But the real use of AT is when you do not have a virtual mapping for
your page tables, making them difficult to walk in SW. For example,
KVM uses AT to walk the guest's stage-1 PTs on stage-2 fault, because
we don't have the guest's memory mapped at EL2 in general.

Maybe this doesn't matter in your case, as this doesn't happen often
enough to justify a dedicated SW walker.

> 
> I see now that it would be too suboptimal to do it this way and have
> implemented a revised arch_remap_range[1] for barebox, which I just
> Cc'd you on.

I'll try to have a look.

> 
> [1]: https://lore.kernel.org/barebox/20241009060511.4121157-5-a.fatoum@pengutronix.de/T/#u
> 
> >>   - Only if the page was previously mapped cacheable, clean + invalidate
> >>     the cache
> >>   - Remove the current cache invalidation after remap
> >>
> >> Does that sound sensible?
> > 
> > This looks reasonable (apart from the AT thingy).
> 
> I have two (hopefully the last!) questions about remaining differing
> behavior with KVM and without:
> 
> 1) Unaligned stack accesses crash in KVM:
> 
> start: /* This will be mapped at 0x40080000 */
>         ldr     x0, =0x4007fff0
>         mov     sp, x0
>         stp     x0, x1, [sp] // This is ok
> 
>         ldr     x0, =0x4007fff8
>         mov     sp, x0
>         stp     x0, x1, [sp] // This crashes
> 
> I know that the stack should be 16 byte aligned, but why does it crash
> only under KVM?

KVM shouldn't be involved in any of this. What is SCTLR_EL1.SA set to?
KVM resets it to 1, which is legal ("On a Warm reset, this field
resets to an architecturally UNKNOWN value."). You shouldn't rely on
it being 0 if you are doing unaligned accesses to the stack pointer.

> 
> Context: The barebox Image used for Qemu has a Linux ARM64 "Image" header,
> so it's loaded at an offset and grows the stack down into this memory region
> until the FDT's /memory could be decoded and a proper stack is set up.
> 
> A regression introduced earlier this year, caused the stack to grow down
> from a non-16-byte address, which is fixed in [2].
> 
> [2]: https://lore.kernel.org/barebox/20241009060511.4121157-5-a.fatoum@pengutronix.de/T/#ma381512862d22530382aff1662caadad2c8bc182
> 
> 2) Using uncached memory for Virt I/O queues with KVM enabled is considerably
>    slower. My guess is that these accesses keep getting trapped, but what I wonder
>    about is the performance discrepancy between the big.LITTLE cores
>    (measurement of barebox copying 1MiB using `time cp -v /dev/virtioblk0 /tmp`):
> 
>     KVM && !CACHED && 1x Cortex-A53:  0.137s
>     KVM && !CACHED && 1x Cortex-A72: 54.030s
>     KVM &&  CACHED && 1x Cortex-A53:  0.120s
>     KVM &&  CACHED && 1x Cortex-A72:  0.035s
> 
> The A53s are CPUs 0-1 and the A72 are 2-5.
> 
> Any idea why accessing uncached memory from the big core is so much worse?

KVM shouldn't trap these accesses at all, except for the initial
faulting-in of the memory (once per page).

But to see such a difference, the problem is likely somewhere else. As
you seem to be using QEMU as the VMM, you are likely to fall into the
"Mismatched attributes" sink-hole (see B2.16 "Mismatched memory
attributes" in the ARM ARM).

The gist of the problem is that QEMU is using a cacheable mapping for
all of the guest memory, while you are using a NC mapping. Are these
two guaranteed to be coherent? Not at all. The abysmal level of
performance is likely to be caused by the rate of miss in the cache on
the QEMU side. It simply doesn't observe the guest's writes until it
misses (for fun, giggles and self-promotion, see [1]).

As an experiment, you could run something that actively puts a lot of
cache pressure in parallel to your guest. You should see the
performance increase significantly.

Now, this is very much a case of "don't do that". Virtio is, by
definition, a cache coherent device. So either you use it with
attributes that maintain coherency (and everything works), or you add
cache maintenance to your virtio implementation (good luck getting the
ordering right).

HTH,

	M.

[1] http://events17.linuxfoundation.org/sites/events/files/slides/slides_10.pdf

-- 
Without deviation from the norm, progress is not possible.

^ permalink raw reply	[flat|nested] 15+ messages in thread

end of thread, other threads:[~2024-10-09  8:17 UTC | newest]

Thread overview: 15+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-10-04  9:47 [BUG] ARM64 KVM: Data abort executing post-indexed LDR on MMIO address Ahmad Fatoum
2024-10-04 10:40 ` Peter Maydell
2024-10-04 11:51   ` Ahmad Fatoum
2024-10-04 12:10     ` Peter Maydell
2024-10-04 15:52       ` Oliver Upton
2024-10-04 15:57         ` Peter Maydell
2024-10-04 16:21           ` Oliver Upton
2024-10-04 19:50       ` Ahmad Fatoum
2024-10-05 10:31         ` Marc Zyngier
2024-10-05 18:38           ` Ahmad Fatoum
2024-10-05 21:35             ` Marc Zyngier
2024-10-06  7:59               ` Ahmad Fatoum
2024-10-06 10:28                 ` Marc Zyngier
2024-10-09  6:11                   ` Ahmad Fatoum
2024-10-09  8:05                     ` Marc Zyngier

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).