From: Marc Zyngier <maz@kernel.org>
To: Peter Maydell <peter.maydell@linaro.org>
Cc: salil.mehta@opnsrc.net, linux-kernel@vger.kernel.org,
linux-arm-kernel@lists.infradead.org, salil.mehta@huawei.com,
jonathan.cameron@huawei.com, will@kernel.org,
catalin.marinas@arm.com, mark.rutland@arm.com,
james.morse@arm.com, sudeep.holla@arm.com, lpieralisi@kernel.org,
jean-philippe@linaro.org, tglx@linutronix.de,
oliver.upton@linux.dev, richard.henderson@linaro.org,
andrew.jones@linux.dev, mst@redhat.com, david@redhat.com,
philmd@linaro.org, ardb@kernel.org, borntraeger@linux.ibm.com,
alex.bennee@linaro.org, gustavo.romero@linaro.org,
npiggin@gmail.com, linux@armlinux.org.uk,
karl.heubaum@oracle.com, miguel.luis@oracle.com,
darren@os.amperecomputing.com, ilkka@os.amperecomputing.com,
vishnu@os.amperecomputing.com,
gankulkarni@os.amperecomputing.com, wangyanan55@huawei.com,
wangzhou1@hisilicon.com, linuxarm@huawei.com
Subject: Re: [RFC PATCH] KVM: arm64: vgic-v3: Cache ICC_CTLR_EL1 and allow lockless read when ready
Date: Tue, 14 Oct 2025 08:44:42 +0100 [thread overview]
Message-ID: <864is2x6z9.wl-maz@kernel.org> (raw)
In-Reply-To: <CAFEAcA8FhgcaM_OsHKB3+3Z7B_oZJqU4LHX_j9p-ZQrHfWGX7g@mail.gmail.com>
On Mon, 13 Oct 2025 17:48:44 +0100,
Peter Maydell <peter.maydell@linaro.org> wrote:
>
> On Mon, 13 Oct 2025 at 11:55, Marc Zyngier <maz@kernel.org> wrote:
> >
> > On Mon, 13 Oct 2025 09:42:58 +0100,
> > Peter Maydell <peter.maydell@linaro.org> wrote:
> > >
> > > On Thu, 9 Oct 2025 at 14:48, Marc Zyngier <maz@kernel.org> wrote:
> > > >
> > > > On Wed, 08 Oct 2025 21:19:55 +0100,
> > > > salil.mehta@opnsrc.net wrote:
> > > > >
> > > > > From: Salil Mehta <salil.mehta@huawei.com>
> > > > >
> > > > > [A rough illustration of the problem and the probable solution]
> > > > >
> > > > > Userspace reads of ICC_CTLR_EL1 via KVM device attributes currently takes a slow
> > > > > path that may acquire all vCPU locks. Under workloads that exercise userspace
> > > > > PSCI CPU_ON flows or frequent vCPU resets, this can cause vCPU lock contention
> > > > > in KVM and, in the worst cases, -EBUSY returns to userspace.
> > > > >
> > > > > When PSCI CPU_ON and CPU_OFF calls are handled entirely in KVM, these operations
> > > > > are executed under KVM vCPU locks in the host kernel (EL1) and appear atomic to
> > > > > other vCPU threads. In this context, system register accesses are serialized
> > > > > under KVM vCPU locks, ensuring atomicity with respect to other vCPUs. After
> > > > > SMCCC filtering was introduced, PSCI CPU_ON and CPU_OFF calls can now exit to
> > > > > userspace (QEMU). During the handling of PSCI CPU_ON call in userspace, a
> > > > > cpu_reset() is exerted which reads ICC_CTLR_EL1 through KVM device attribute
> > > > > IOCTLs. To avoid transient inconsistency and -EBUSY errors, QEMU is forced to
> > > > > pause all vCPUs before issuing these IOCTLs.
> > > >
> > > > I'm going to repeat in public what I already said in private.
> > > >
> > > > Why does QEMU need to know this? I don't see how this is related to
> > > > PSCI, and outside of save/restore, there is no reason why QEMU should
> > > > poke at this. If QEMU needs fixing, please fix QEMU.
> > >
> > > I don't know the background here, but generally speaking,
> > > when we do a CPU reset that includes writing all the CPU state
> > > of the "this is freshly reset from userspace's point of view" vcpu
> > > back to the kernel. More generally, userspace should be able to
> > > read and write sysregs for a vcpu any time it likes, and not
> > > arbitrarily get back -EBUSY. What does the kernel expect
> > > userspace to do with an errno like that?
> >
> > The main issue here is that GICv3 is modelled as a device, just like
> > GICv2, and that all the sysregs that are relevant to the GIC have the
> > same status as the MMIO registers: they can only be accessed when the
> > vcpus are not running.
> >
> > These sysregs are not visible through the normal ONE_REG API, and
> > therefore not subjected to the "do whatever you want" rule.
>
> Ah, I'd forgotten that. But the cpuif registers are still
> per-cpu, and they do still need to be reset on vcpu reset,
> and that might still happen for a single vcpu when the VM
> as a whole is still running.
>
> That said, QEMU's current code for this could be refactored
> to avoid the reset-time read of ICC_CTLR_EL1 from the kernel.
> We do this so we can set the userspace struct field for this
> register to the right value. But we could ask the kernel for
> that value once on VM startup since it's not going to change mid-run.
The reset value is indeed cast in stone once the GIC has been created.
> That would bring ICC_CTLR_EL1 into line with the other cpuif
> registers, where QEMU assumes it knows what the kernel's
> reset value of them is (mostly "0") and doesn't bother to ask.
> This is different from how we handle ONE_REG sysregs, where
> I'm pretty sure we do ask the kernel the value of all of them
> on a vcpu reset. (And then write the values back again, which
> is a bit silly but nobody's ever said it was a performance
> problem for them :-))
>
> > Should we have done something else when the GICv3 save/restore API was
> > introduced and agreed upon with the QEMU people? Probably. Can we
> > change it now? Probably not. The only thing we could relax is the
> > scope of the lock when accessing a sysreg, so that we only mandate
> > that the targeted vcpu is not running instead of the whole VM.
> >
> > And finally, if you object to this API, why should we do for GICv5,
> > which is so far implemented by following the exact same principles?
>
> I don't object to the API inherently (I don't care whether we
> do these register reads via a dev ioctl or something else,
> from userspace's point of view it's just "do some syscall,
> get a value") -- I'm just objecting to the kernel's
> implementation of it where it might return EBUSY :-)
To me, EBUSY has a clear meaning: you're otherwise using the resource,
and you need to relinquish it first, while EINVAL indicates that the
kernel doesn't understand what you want.
As I said, I'm happy to look at reducing the locking to only the
target vcpu in the case of a sysreg being accessed, but EBUSY will
stay.
>
> (Also, if the kernel had failed EINVAL unconditionally for
> an attempt to do this on a not-stopped VM then we'd probably
> have found this mismatch in understanding about how the
> API should work years ago. "Mostly works but sometimes fails
> EBUSY" is the worst of all worlds.)
>
> I haven't yet got as far as thinking about the KVM interface
> for GICv5 yet...
I guess that for the time being, we'll assume that GICv3 is the
reference.
Thanks,
M.
--
Without deviation from the norm, progress is not possible.
next prev parent reply other threads:[~2025-10-14 7:45 UTC|newest]
Thread overview: 12+ messages / expand[flat|nested] mbox.gz Atom feed top
2025-10-08 20:19 [RFC PATCH] KVM: arm64: vgic-v3: Cache ICC_CTLR_EL1 and allow lockless read when ready salil.mehta
2025-10-09 13:48 ` Marc Zyngier
2025-10-13 8:42 ` Peter Maydell
2025-10-13 10:54 ` Marc Zyngier
2025-10-13 16:48 ` Peter Maydell
2025-10-14 3:02 ` Salil Mehta
2025-10-14 9:31 ` Peter Maydell
2025-10-14 9:50 ` Salil Mehta
2025-10-14 7:44 ` Marc Zyngier [this message]
2025-10-14 9:33 ` Peter Maydell
2025-10-14 10:24 ` Salil Mehta
2025-10-13 15:48 ` Salil Mehta
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=864is2x6z9.wl-maz@kernel.org \
--to=maz@kernel.org \
--cc=alex.bennee@linaro.org \
--cc=andrew.jones@linux.dev \
--cc=ardb@kernel.org \
--cc=borntraeger@linux.ibm.com \
--cc=catalin.marinas@arm.com \
--cc=darren@os.amperecomputing.com \
--cc=david@redhat.com \
--cc=gankulkarni@os.amperecomputing.com \
--cc=gustavo.romero@linaro.org \
--cc=ilkka@os.amperecomputing.com \
--cc=james.morse@arm.com \
--cc=jean-philippe@linaro.org \
--cc=jonathan.cameron@huawei.com \
--cc=karl.heubaum@oracle.com \
--cc=linux-arm-kernel@lists.infradead.org \
--cc=linux-kernel@vger.kernel.org \
--cc=linux@armlinux.org.uk \
--cc=linuxarm@huawei.com \
--cc=lpieralisi@kernel.org \
--cc=mark.rutland@arm.com \
--cc=miguel.luis@oracle.com \
--cc=mst@redhat.com \
--cc=npiggin@gmail.com \
--cc=oliver.upton@linux.dev \
--cc=peter.maydell@linaro.org \
--cc=philmd@linaro.org \
--cc=richard.henderson@linaro.org \
--cc=salil.mehta@huawei.com \
--cc=salil.mehta@opnsrc.net \
--cc=sudeep.holla@arm.com \
--cc=tglx@linutronix.de \
--cc=vishnu@os.amperecomputing.com \
--cc=wangyanan55@huawei.com \
--cc=wangzhou1@hisilicon.com \
--cc=will@kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.