All of lore.kernel.org
 help / color / mirror / Atom feed
From: Sean Christopherson <seanjc@google.com>
To: Christian Borntraeger <borntraeger@linux.ibm.com>
Cc: Yi Wang <up2wing@gmail.com>,
	pbonzini@redhat.com, tglx@linutronix.de,  mingo@redhat.com,
	bp@alien8.de, dave.hansen@linux.intel.com, x86@kernel.org,
	 hpa@zytor.com, kvm@vger.kernel.org,
	linux-kernel@vger.kernel.org,  wanpengli@tencent.com,
	Yi Wang <foxywang@tencent.com>,
	 Oliver Upton <oliver.upton@linux.dev>,
	Marc Zyngier <maz@kernel.org>,  Anup Patel <anup@brainfault.org>,
	Atish Patra <atishp@atishpatra.org>,
	 Janosch Frank <frankja@linux.ibm.com>,
	Claudio Imbrenda <imbrenda@linux.ibm.com>
Subject: Re: [PATCH] KVM: irqchip: synchronize srcu only if needed
Date: Tue, 16 Jan 2024 08:58:10 -0800	[thread overview]
Message-ID: <Zaa1omCaDQOxxy2j@google.com> (raw)
In-Reply-To: <72edaedc-50d7-415e-9c45-f17ffe0c1c23@linux.ibm.com>

On Tue, Jan 16, 2024, Christian Borntraeger wrote:
> 
> 
> Am 15.01.24 um 17:01 schrieb Yi Wang:
> > Many thanks for your such kind and detailed reply, Sean!
> > 
> > On Sat, Jan 13, 2024 at 12:28 AM Sean Christopherson <seanjc@google.com> wrote:
> > > 
> > > +other KVM maintainers
> > > 
> > > On Fri, Jan 12, 2024, Yi Wang wrote:
> > > > From: Yi Wang <foxywang@tencent.com>
> > > > 
> > > > We found that it may cost more than 20 milliseconds very accidentally
> > > > to enable cap of KVM_CAP_SPLIT_IRQCHIP on a host which has many vms
> > > > already.
> > > > 
> > > > The reason is that when vmm(qemu/CloudHypervisor) invokes
> > > > KVM_CAP_SPLIT_IRQCHIP kvm will call synchronize_srcu_expedited() and
> > > > might_sleep and kworker of srcu may cost some delay during this period.
> > > 
> > > might_sleep() yielding is not justification for changing KVM.  That's more or
> > > less saying "my task got preempted and took longer to run".  Well, yeah.
> > 
> > Agree. But I suppose it may be one of the reasons that makes  time of
> > KVM_CAP_SPLIT_IRQCHIP delayed, of course, the kworker has the biggest
> > suspicion :)
> > 
> > > 
> > > > Since this happens during creating vm, it's no need to synchronize srcu
> > > > now 'cause everything is not ready(vcpu/irqfd) and none uses irq_srcu now.
> > 
> > ....
> > 
> > > And on x86, I'm pretty sure as of commit 654f1f13ea56 ("kvm: Check irqchip mode
> > > before assign irqfd"), which added kvm_arch_irqfd_allowed(), it's impossible for
> > > kvm_irq_map_gsi() to encounter a NULL irq_routing _on x86_.
> > > 
> > > But I strongly suspect other architectures can reach kvm_irq_map_gsi() with a
> > > NULL irq_routing, e.g. RISC-V dynamically configures its interrupt controller,
> > > yet doesn't implement kvm_arch_intc_initialized().
> > > 
> > > So instead of special casing x86, what if we instead have KVM setup an empty
> > > IRQ routing table during kvm_create_vm(), and then avoid this mess entirely?
> > > That way x86 and s390 no longer need to set empty/dummy routing when creating
> > > an IRQCHIP, and the worst case scenario of userspace misusing an ioctl() is no
> > > longer a NULL pointer deref.
> 
> Sounds like a good idea. This should also speedup guest creation on s390 since
> it would avoid one syncronize_irq.
> > 
> > To setup an empty IRQ routing table during kvm_create_vm() sounds a good idea,
> > at this time vCPU have not been created and kvm->lock is held so skipping
> > synchronization is safe here.
> > 
> > However, there is one drawback, if vmm wants to emulate irqchip itself,
> > e.g. qemu with command line '-machine kernel-irqchip=off' may not need
> > irqchip in kernel. How do we handle this issue?
> 
> I would be fine with wasted memory.

+1.  If we really, really want to avoid the negligible memory overhead, we could
pre-configure a static global table and directly use that as the dummy table (and
exempt it from being freed by free_irq_routing_table()).

> The only question is does it have a functional impact or can we simply ignore
> the dummy routing.

Given the lack of sanity checks on kvm->irq_routing, I'm pretty sure the only way
for there to be functional impact is if there's a latent NULL pointer deref hiding
somewhere.

  reply	other threads:[~2024-01-16 16:58 UTC|newest]

Thread overview: 7+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2024-01-12  9:11 [PATCH] KVM: irqchip: synchronize srcu only if needed Yi Wang
2024-01-12 16:28 ` Sean Christopherson
2024-01-15 16:01   ` Yi Wang
2024-01-16 16:50     ` Christian Borntraeger
2024-01-16 16:58       ` Sean Christopherson [this message]
2024-01-17  0:56         ` Yi Wang
2024-01-17  0:46       ` Yi Wang

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=Zaa1omCaDQOxxy2j@google.com \
    --to=seanjc@google.com \
    --cc=anup@brainfault.org \
    --cc=atishp@atishpatra.org \
    --cc=borntraeger@linux.ibm.com \
    --cc=bp@alien8.de \
    --cc=dave.hansen@linux.intel.com \
    --cc=foxywang@tencent.com \
    --cc=frankja@linux.ibm.com \
    --cc=hpa@zytor.com \
    --cc=imbrenda@linux.ibm.com \
    --cc=kvm@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=maz@kernel.org \
    --cc=mingo@redhat.com \
    --cc=oliver.upton@linux.dev \
    --cc=pbonzini@redhat.com \
    --cc=tglx@linutronix.de \
    --cc=up2wing@gmail.com \
    --cc=wanpengli@tencent.com \
    --cc=x86@kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.