From: Sean Christopherson <seanjc@google.com>
To: Yi Wang <up2wing@gmail.com>
Cc: pbonzini@redhat.com, tglx@linutronix.de, mingo@redhat.com,
bp@alien8.de, dave.hansen@linux.intel.com, x86@kernel.org,
hpa@zytor.com, kvm@vger.kernel.org,
linux-kernel@vger.kernel.org, wanpengli@tencent.com,
Yi Wang <foxywang@tencent.com>,
Oliver Upton <oliver.upton@linux.dev>,
Marc Zyngier <maz@kernel.org>, Anup Patel <anup@brainfault.org>,
Atish Patra <atishp@atishpatra.org>,
Christian Borntraeger <borntraeger@linux.ibm.com>,
Janosch Frank <frankja@linux.ibm.com>,
Claudio Imbrenda <imbrenda@linux.ibm.com>
Subject: Re: [PATCH] KVM: irqchip: synchronize srcu only if needed
Date: Fri, 12 Jan 2024 08:28:31 -0800 [thread overview]
Message-ID: <ZaFor2Lvdm4O2NWa@google.com> (raw)
In-Reply-To: <20240112091128.3868059-1-foxywang@tencent.com>
+other KVM maintainers
On Fri, Jan 12, 2024, Yi Wang wrote:
> From: Yi Wang <foxywang@tencent.com>
>
> We found that it may cost more than 20 milliseconds very accidentally
> to enable cap of KVM_CAP_SPLIT_IRQCHIP on a host which has many vms
> already.
>
> The reason is that when vmm(qemu/CloudHypervisor) invokes
> KVM_CAP_SPLIT_IRQCHIP kvm will call synchronize_srcu_expedited() and
> might_sleep and kworker of srcu may cost some delay during this period.
might_sleep() yielding is not justification for changing KVM. That's more or
less saying "my task got preempted and took longer to run". Well, yeah.
> Since this happens during creating vm, it's no need to synchronize srcu
> now 'cause everything is not ready(vcpu/irqfd) and none uses irq_srcu now.
>
> Signed-off-by: Yi Wang <foxywang@tencent.com>
> ---
> arch/x86/kvm/irq_comm.c | 2 +-
> include/linux/kvm_host.h | 2 ++
> virt/kvm/irqchip.c | 3 ++-
> 3 files changed, 5 insertions(+), 2 deletions(-)
>
> diff --git a/arch/x86/kvm/irq_comm.c b/arch/x86/kvm/irq_comm.c
> index 16d076a1b91a..37c92b7486c7 100644
> --- a/arch/x86/kvm/irq_comm.c
> +++ b/arch/x86/kvm/irq_comm.c
> @@ -394,7 +394,7 @@ static const struct kvm_irq_routing_entry empty_routing[] = {};
>
> int kvm_setup_empty_irq_routing(struct kvm *kvm)
> {
> - return kvm_set_irq_routing(kvm, empty_routing, 0, 0);
> + return kvm_set_irq_routing(kvm, empty_routing, 0, NONEED_SYNC_SRCU);
> }
>
> void kvm_arch_post_irq_routing_update(struct kvm *kvm)
> diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> index 4944136efaa2..a46370cca355 100644
> --- a/include/linux/kvm_host.h
> +++ b/include/linux/kvm_host.h
> @@ -1995,6 +1995,8 @@ static inline int mmu_invalidate_retry_hva(struct kvm *kvm,
>
> #define KVM_MAX_IRQ_ROUTES 4096 /* might need extension/rework in the future */
>
> +#define NONEED_SYNC_SRCU (1U << 0)
> +
> bool kvm_arch_can_set_irq_routing(struct kvm *kvm);
> int kvm_set_irq_routing(struct kvm *kvm,
> const struct kvm_irq_routing_entry *entries,
> diff --git a/virt/kvm/irqchip.c b/virt/kvm/irqchip.c
> index 1e567d1f6d3d..cea5c43c1a49 100644
> --- a/virt/kvm/irqchip.c
> +++ b/virt/kvm/irqchip.c
> @@ -224,7 +224,8 @@ int kvm_set_irq_routing(struct kvm *kvm,
>
> kvm_arch_post_irq_routing_update(kvm);
>
> - synchronize_srcu_expedited(&kvm->irq_srcu);
> + if (!(flags & NONEED_SYNC_SRCU))
> + synchronize_srcu_expedited(&kvm->irq_srcu);
I'm not a fan of x86 passing in a magic flag. It's not immediately clear why
skipping synchronization is safe. Piecing things together, _on x86_, I believe
the answer is that vCPU can't have yet been created, kvm->lock is held, _and_
kvm_arch_irqfd_allowed() will subtly reject attempts to assign irqfds if the local
APIC isn't in-kernel.
But AFAICT, s390's implementation of KVM_CREATE_IRQCHIP, which sets up identical
dummy/empty routing, doesn't provide the same guarantees.
case KVM_CREATE_IRQCHIP: {
struct kvm_irq_routing_entry routing;
r = -EINVAL;
if (kvm->arch.use_irqchip) {
/* Set up dummy routing. */
memset(&routing, 0, sizeof(routing));
r = kvm_set_irq_routing(kvm, &routing, 0, 0);
}
break;
}
It's entirely possible that someday, kvm_setup_empty_irq_routing() is moved to
common KVM and used for s390, at which point we have a mess on our hands because
it's not at all obvious whether or not it's safe for s390 to also skip
synchronization.
Rather than hack in a workaround for x86, I would rather we try and clean up this
mess.
Except for kvm_irq_map_gsi(), it looks like all flows assume irq_routing is
non-NULL. But I'm not remotely confident that that holds true on all architectures,
e.g. the only reason kvm_irq_map_gsi() checks for a NULL irq_routing is because
syzkaller generated a splat (commit c622a3c21ede ("KVM: irqfd: fix NULL pointer
dereference in kvm_irq_map_gsi")).
And on x86, I'm pretty sure as of commit 654f1f13ea56 ("kvm: Check irqchip mode
before assign irqfd"), which added kvm_arch_irqfd_allowed(), it's impossible for
kvm_irq_map_gsi() to encounter a NULL irq_routing _on x86_.
But I strongly suspect other architectures can reach kvm_irq_map_gsi() with a
NULL irq_routing, e.g. RISC-V dynamically configures its interrupt controller,
yet doesn't implement kvm_arch_intc_initialized().
So instead of special casing x86, what if we instead have KVM setup an empty
IRQ routing table during kvm_create_vm(), and then avoid this mess entirely?
That way x86 and s390 no longer need to set empty/dummy routing when creating
an IRQCHIP, and the worst case scenario of userspace misusing an ioctl() is no
longer a NULL pointer deref.
next prev parent reply other threads:[~2024-01-12 16:28 UTC|newest]
Thread overview: 7+ messages / expand[flat|nested] mbox.gz Atom feed top
2024-01-12 9:11 [PATCH] KVM: irqchip: synchronize srcu only if needed Yi Wang
2024-01-12 16:28 ` Sean Christopherson [this message]
2024-01-15 16:01 ` Yi Wang
2024-01-16 16:50 ` Christian Borntraeger
2024-01-16 16:58 ` Sean Christopherson
2024-01-17 0:56 ` Yi Wang
2024-01-17 0:46 ` Yi Wang
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=ZaFor2Lvdm4O2NWa@google.com \
--to=seanjc@google.com \
--cc=anup@brainfault.org \
--cc=atishp@atishpatra.org \
--cc=borntraeger@linux.ibm.com \
--cc=bp@alien8.de \
--cc=dave.hansen@linux.intel.com \
--cc=foxywang@tencent.com \
--cc=frankja@linux.ibm.com \
--cc=hpa@zytor.com \
--cc=imbrenda@linux.ibm.com \
--cc=kvm@vger.kernel.org \
--cc=linux-kernel@vger.kernel.org \
--cc=maz@kernel.org \
--cc=mingo@redhat.com \
--cc=oliver.upton@linux.dev \
--cc=pbonzini@redhat.com \
--cc=tglx@linutronix.de \
--cc=up2wing@gmail.com \
--cc=wanpengli@tencent.com \
--cc=x86@kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.