From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from bombadil.infradead.org (bombadil.infradead.org [198.137.202.133]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.lore.kernel.org (Postfix) with ESMTPS id B8F62C5320E for ; Thu, 22 Aug 2024 08:27:44 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed; d=lists.infradead.org; s=bombadil.20210309; h=Sender:List-Subscribe:List-Help :List-Post:List-Archive:List-Unsubscribe:List-Id:Content-Transfer-Encoding: Content-Type:MIME-Version:References:In-Reply-To:Subject:Cc:To:From: Message-ID:Date:Reply-To:Content-ID:Content-Description:Resent-Date: Resent-From:Resent-Sender:Resent-To:Resent-Cc:Resent-Message-ID:List-Owner; bh=S1ad9FxSkPLjvbmmb7MEK2LBHnG3d8Bgt2TY9kCjrTU=; b=N/PvCGcXl0/LXj1SWIRyuFsC/D 4pflpwgjoHmmMvb4yvrPPIWuD/UfgXsSjXWN9VemxgRYMU0hy+PbztOy04+AAfRws4szn//Kfvuld H9wg4Qcu+5arMr5DGWTuBC/gvdvUc4ulbsq3/W+k2/gCMU7uQ2NvvOo+d2BTTFhZwcmB0R8jN3w/e 1/hLForuxO6t96w2zieI3cjtwxEmIxWcWnYd96nC8S1Y7cPbdDZhmqqTRwaeDcOacXltPOxhn/Nud xC0iD4ZiHy1e0HDOjK/zdvgmOwmEH5zVCOi8InKYIrNr8eAwn5T0jvH52MXGvEXsPgqQjEp8s0uSz DOEQ48zg==; Received: from localhost ([::1] helo=bombadil.infradead.org) by bombadil.infradead.org with esmtp (Exim 4.97.1 #2 (Red Hat Linux)) id 1sh3AH-0000000C2Xi-1HoY; Thu, 22 Aug 2024 08:27:29 +0000 Received: from dfw.source.kernel.org ([139.178.84.217]) by bombadil.infradead.org with esmtps (Exim 4.97.1 #2 (Red Hat Linux)) id 1sh39V-0000000C2Hf-3IK2 for linux-arm-kernel@lists.infradead.org; Thu, 22 Aug 2024 08:26:43 +0000 Received: from smtp.kernel.org (transwarp.subspace.kernel.org [100.75.92.58]) by dfw.source.kernel.org (Postfix) with ESMTP id 1824C61209; Thu, 22 Aug 2024 08:26:41 +0000 (UTC) Received: by smtp.kernel.org (Postfix) with ESMTPSA id A0828C4AF09; Thu, 22 Aug 2024 08:26:40 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=k20201202; t=1724315200; bh=X0aPNFWBD1VUe6HCE0yap3QSbgor0f8vCoFEYAOSJjg=; h=Date:From:To:Cc:Subject:In-Reply-To:References:From; b=OBjxm3YkJWBffcjm3/rFfx9zn60qtEx8AP7j8MBB3X8i0tz0okL+omjGGeC0NK3SX NHKJDbYxzkSVdZteDIVOltkbklFrOqYTCODZq59ED90Iw9BBWFFgr+Db/2ZIagH8ah GuHtFqs5eg0ynLywfF7lzMXKXnOCCTcLG09meYLO6n2VGloxYZMA2cGRvOlgkPdGPT QNQj4JZmzqIJjpDK61vKKVUi2XBEz9kf3uYm2XselQUcRAemFF5mcubOrC4d85+mTq JYSyglvEc7k0tFO6if35ht9PFKg8Qm9buyIUwV+9YncbVefp1cNB7spbpFwEGfo5L0 1vmfdv7amM8NQ== Received: from sofa.misterjones.org ([185.219.108.64] helo=goblin-girl.misterjones.org) by disco-boy.misterjones.org with esmtpsa (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (Exim 4.95) (envelope-from ) id 1sh39S-005rlM-JP; Thu, 22 Aug 2024 09:26:38 +0100 Date: Thu, 22 Aug 2024 09:26:37 +0100 Message-ID: <867cc9x8si.wl-maz@kernel.org> From: Marc Zyngier To: Kunkun Jiang Cc: Thomas Gleixner , Oliver Upton , James Morse , Suzuki K Poulose , Zenghui Yu , "open list:IRQ\ SUBSYSTEM" , "moderated list:ARM SMMU DRIVERS" , , "wanghaibin.wang@huawei.com" , , "tangnianyao@huawei.com" , Subject: Re: [bug report] GICv4.1: multiple vpus execute vgic_v4_load at the same time will greatly increase the time consumption In-Reply-To: References: <86msl6xhu2.wl-maz@kernel.org> User-Agent: Wanderlust/2.15.9 (Almost Unreal) SEMI-EPG/1.14.7 (Harue) FLIM-LB/1.14.9 (=?UTF-8?B?R29qxY0=?=) APEL-LB/10.8 EasyPG/1.0.0 Emacs/29.4 (aarch64-unknown-linux-gnu) MULE/6.0 (HANACHIRUSATO) MIME-Version: 1.0 (generated by SEMI-EPG 1.14.7 - "Harue") Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: quoted-printable X-SA-Exim-Connect-IP: 185.219.108.64 X-SA-Exim-Rcpt-To: jiangkunkun@huawei.com, tglx@linutronix.de, oliver.upton@linux.dev, james.morse@arm.com, suzuki.poulose@arm.com, yuzenghui@huawei.com, linux-kernel@vger.kernel.org, linux-arm-kernel@lists.infradead.org, kvmarm@lists.linux.dev, wanghaibin.wang@huawei.com, nizhiqiang1@huawei.com, tangnianyao@huawei.com, wangzhou1@hisilicon.com X-SA-Exim-Mail-From: maz@kernel.org X-SA-Exim-Scanned: No (on disco-boy.misterjones.org); SAEximRunCond expanded to false X-CRM114-Version: 20100106-BlameMichelson ( TRE 0.8.0 (BSD) ) MR-646709E3 X-CRM114-CacheID: sfid-20240822_012641_981802_4BD5497E X-CRM114-Status: GOOD ( 54.51 ) X-BeenThere: linux-arm-kernel@lists.infradead.org X-Mailman-Version: 2.1.34 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Sender: "linux-arm-kernel" Errors-To: linux-arm-kernel-bounces+linux-arm-kernel=archiver.kernel.org@lists.infradead.org On Wed, 21 Aug 2024 19:23:30 +0100, Kunkun Jiang wrote: >=20 > Hi Marc, >=20 > On 2024/8/21 18:59, Marc Zyngier wrote: > > On Wed, 21 Aug 2024 10:51:27 +0100, > > Kunkun Jiang wrote: > >>=20 > >> Hi all, > >>=20 > >> Recently I discovered a problem about GICv4.1, the scenario is as foll= ows: > >> 1. Enable GICv4.1 > >> 2. Create multiple VMs.For example, 50 VMs(4U8G) >=20 > s/4U8G/8U16G/, sorry.. >=20 > > I don't know what 4U8G means. On how many physical CPUs are you > > running 50 VMs? Direct injection of interrupts and over-subscription > > are fundamentally incompatible. >=20 > Each VM is configured with 8 vcpus and 16G memory. The number of > physical CPUs is 320. So you spawn 200 vcpus in one go. Fun. >=20 > >=20 > >> 3. The business running in VMs has a frequent mmio access and need to = exit > >> =C2=A0 to qemu for processing. > >> 4. Or modify the kvm code so that wfi must trap to kvm > >> 5. Then the utilization of pcpu where the vcpu is located will be 100%= ,and > >> =C2=A0 basically all in sys. > >=20 > > What did you expect? If you trap all the time, your performance will > > suck. Don't do that. > >=20 > >> 6. This problem does not exist in GICv3. > >=20 > > Because GICv3 doesn't have the same constraints. > >=20 > >>=20 > >> According to analysis, this problem is due to the execution of vgic_v4= _load. > >> vcpu_load or kvm_sched_in > >> =C2=A0=C2=A0=C2=A0 kvm_arch_vcpu_load > >> =C2=A0=C2=A0=C2=A0 ... > >> =C2=A0=C2=A0=C2=A0 =C2=A0=C2=A0=C2=A0 vgic_v4_load > >> =C2=A0=C2=A0=C2=A0 =C2=A0=C2=A0=C2=A0 =C2=A0=C2=A0=C2=A0 irq_set_affi= nity > >> =C2=A0=C2=A0=C2=A0 =C2=A0=C2=A0=C2=A0 =C2=A0=C2=A0=C2=A0 ... > >> =C2=A0=C2=A0=C2=A0 =C2=A0=C2=A0=C2=A0 =C2=A0=C2=A0=C2=A0 =C2=A0=C2=A0= =C2=A0 irq_do_set_affinity > >> =C2=A0=C2=A0=C2=A0 =C2=A0=C2=A0=C2=A0 =C2=A0=C2=A0=C2=A0 =C2=A0=C2=A0= =C2=A0 =C2=A0=C2=A0=C2=A0 raw_spin_lock(&tmp_mask_lock) > >> =C2=A0=C2=A0=C2=A0 =C2=A0=C2=A0=C2=A0 =C2=A0=C2=A0=C2=A0 =C2=A0=C2=A0= =C2=A0 =C2=A0=C2=A0=C2=A0 chip->irq_set_affinity > >> =C2=A0=C2=A0=C2=A0 =C2=A0=C2=A0=C2=A0 =C2=A0=C2=A0=C2=A0 =C2=A0=C2=A0= =C2=A0 =C2=A0=C2=A0=C2=A0 ... > >> =C2=A0=C2=A0=C2=A0 =C2=A0=C2=A0=C2=A0 =C2=A0=C2=A0=C2=A0 =C2=A0=C2=A0= =C2=A0 =C2=A0=C2=A0=C2=A0 =C2=A0 its_vpe_set_affinity > >>=20 > >> The tmp_mask_lock is the key. This is a global lock. I don't quite > >> understand > >> why tmp_mask_lock is needed here. I think there are two possible > >> solutions here: > >> 1. Remove this tmp_mask_lock > >=20 > > Maybe you could have a look at 33de0aa4bae98 (and 11ea68f553e24)? It > > would allow you to understand the nature of the problem. > >=20 > > This can probably be replaced with a per-CPU cpumask, which would > > avoid the locking, but potentially result in a larger memory usage. >=20 > Thanks, I will try it. A simple alternative would be this: diff --git a/kernel/irq/manage.c b/kernel/irq/manage.c index dd53298ef1a5..0d11b74af38c 100644 --- a/kernel/irq/manage.c +++ b/kernel/irq/manage.c @@ -224,15 +224,12 @@ int irq_do_set_affinity(struct irq_data *data, const = struct cpumask *mask, struct irq_desc *desc =3D irq_data_to_desc(data); struct irq_chip *chip =3D irq_data_get_irq_chip(data); const struct cpumask *prog_mask; + struct cpumask tmp_mask =3D {}; int ret; =20 - static DEFINE_RAW_SPINLOCK(tmp_mask_lock); - static struct cpumask tmp_mask; - if (!chip || !chip->irq_set_affinity) return -EINVAL; =20 - raw_spin_lock(&tmp_mask_lock); /* * If this is a managed interrupt and housekeeping is enabled on * it check whether the requested affinity mask intersects with @@ -280,8 +277,6 @@ int irq_do_set_affinity(struct irq_data *data, const st= ruct cpumask *mask, else ret =3D -EINVAL; =20 - raw_spin_unlock(&tmp_mask_lock); - switch (ret) { case IRQ_SET_MASK_OK: case IRQ_SET_MASK_OK_DONE: but that will eat a significant portion of your stack if your kernel is configured for a large number of CPUs. >=20 > >> 2. Modify the gicv4 driver,do not perfrom VMOVP via > >> irq_set_affinity. > >=20 > > Sure. You could also not use KVM at all if don't care about interrupts > > being delivered to your VM. We do not send a VMOVP just for fun. We > > send it because your vcpu has moved to a different CPU, and the ITS > > needs to know about that. >=20 > When a vcpu is moved to a different CPU, of course VMOVP has to be sent. > I mean is it possible to call its_vpe_set_affinity() to send VMOVP by > other means (instead of by calling the irq_set_affinity() API). So we > can bypass this tmp_mask_lock. The whole point of this infrastructure is that the VPE doorbell is the control point for the VPE. If the VPE moves, then the change of affinity *must* be done using irq_set_affinity(). All the locking is constructed around that. Please read the abundant documentation that exists in both the GIC code and KVM describing why this is done like that. >=20 > >=20 > > You seem to be misunderstanding the use case for GICv4: a partitioned > > system, without any over-subscription, no vcpu migration between CPUs. > > If that's not your setup, then GICv4 will always be a net loss > > compared to SW injection with GICv3 (additional HW interaction, > > doorbell interrupts). >=20 > Thanks for the explanation. The key to the problem is not vcpu migration > between CPUs. The key point is that many vcpus execute vgic_v4_load() at > the same time. Even if it is not migrated to another CPU, there may be a > large number of vcpus executing vgic_v4_load() at the same time. For > example, the service running in VMs has a large number of MMIO accesses > and need to return to userspace for emulation. Due to the competition of > tmp_mask_lock, performance will deteriorate. That's only a symptom. And that doesn't affect only pathological VM workloads, but all interrupts being moved around for any reason. >=20 > When the target CPU is the same CPU as the last run, there seems to be > no need to call irq_set_affinity() in this case. I did a test and it was > indeed able to alleviate the problem described above. The premise is that irq_set_affinity() should be cheap when there isn't much to do, and you are papering over the problem. >=20 > I feel it might be better to remove tmp_mask_lock or call > its_vpe_set_affinity() in another way. So I mentioned these two ideas > above. The removal of this global lock is the only option in my opinion. Either the cpumask becomes a stack variable, or it becomes a static per-CPU variable. Both have drawbacks, but they are not a bottleneck anymore. Thanks, M. --=20 Without deviation from the norm, progress is not possible.