From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from mail-yw1-f175.google.com (mail-yw1-f175.google.com [209.85.128.175]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 963543B27EC for ; Mon, 8 Jun 2026 22:28:05 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=209.85.128.175 ARC-Seal:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1780957687; cv=none; b=FkkjPSjNCrN39pTxajwFFhdl+t7UUA8whMMWJFH9ExAaITFfg4TC/WKK1uWVMI9IkR/U4kszgfBOhVO5v2GFRLOnu00JK3xmZOY4WBzwwgwg3KGLqNxIfWyFUUHtUbNfvNK7pT86XCnciMUuMMzwFrHWoKl2ORDBREIymSO+FG8= ARC-Message-Signature:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1780957687; c=relaxed/simple; bh=9SAh3sGkuqDMMnaEBl9TqptHGF2EyXs3AZnH8hyku1E=; h=From:Date:To:Cc:Subject:Message-ID:References:MIME-Version: Content-Type:Content-Disposition:In-Reply-To; b=luSp6NaCigLJG+g92uhEpsFYJhwuW8FGkKbNFGa7uJqDyrtPEaUsyKwsjYH3poqvA5Z8KLf6aPqIv6YYASgbOCFQHs9ij3xTsSb338FqE9JNJ5Kh17lMNdIe65Hip6XDEcxji8sYReJai9ouH0SWtmsEF7s6sA9LUeFQ5KqZ2Ic= ARC-Authentication-Results:i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=gmail.com; spf=pass smtp.mailfrom=gmail.com; dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b=Uj/lzta8; arc=none smtp.client-ip=209.85.128.175 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=gmail.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=gmail.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b="Uj/lzta8" Received: by mail-yw1-f175.google.com with SMTP id 00721157ae682-7e8ec43e5f9so65169357b3.0 for ; Mon, 08 Jun 2026 15:28:05 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20251104; t=1780957684; x=1781562484; darn=vger.kernel.org; h=in-reply-to:content-disposition:mime-version:references:message-id :subject:cc:to:date:from:from:to:cc:subject:date:message-id:reply-to; bh=UZL8W7BE2iY515e29P29VDPBVZJwc1IA4Fj5Frdt1uY=; b=Uj/lzta8mgObQAWImwmCveVMVHfN7te8EQX/0aUhNXRemMXXaIc/zjH7pwrpwbRlaQ grA/N+4YS3P2D6fQ/ITgT76SL8e6gQJrGavqJe7ys9sfMCYeVyog/hbcPncTQY8t47Ip xXeFx2SRgMPJwgza/wDF+Hrx4aeuJ73Cf0uySYpOCBfe7PPvcPmEdpyWsI0OJPEL5jbd Aqgrq/vPXml69t343Gbhnea9uVZdp9bxrdmMOQ3h26Qne7fxcO+EtzrDQsonO4sgbgS2 Yn5P3tgsUdVLNG3rLghdYY5Ofgghwxvv9K01AeTzaAmQ58Sym0Mtv4EPZd9yBesBlDyF xCUA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20251104; t=1780957684; x=1781562484; h=in-reply-to:content-disposition:mime-version:references:message-id :subject:cc:to:date:from:x-gm-gg:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=UZL8W7BE2iY515e29P29VDPBVZJwc1IA4Fj5Frdt1uY=; b=BMU+DaFSWqWyZVffanKJnuX8+UzztV0MFK8H5Y7186u6ADhSs8OJJevHUYw0uO+ATf M/cR6sbNIHXIl/P+pu4f8wC/8k8O60aVJ2ssGnK1SnhH4P5ssouKiVAhF/nrgZMsu9rB 43EQm0Ys25VlKn7RR+Pv6WAbdTTOWFUAkfC0RIh8GlxmaD++xBlY7Efyh1cgO5d28OEg /WcnY+GczJRWlN6pvJXVRWX0rCXij+tj7cfRQFcKzxX5AeBGU798sPYDGxgaN00oXdyI ylwJB3poWRntYOXFBkDt22WJUzcA2fK18T0xvS8QxmvDbClXabpxerkXPvDGsjx+nzWz ojXw== X-Forwarded-Encrypted: i=1; AFNElJ+Tq21S6P7RCEa0BPj1uoO/iZwWPIfkklrkfJ3b1Mtz2437MsnL5S9SHXo2FWk3qq1CunoYEFcuyJlv+qQ=@vger.kernel.org X-Gm-Message-State: AOJu0YymloyK0E3mfpFEvf/wEp5tik7BtTSDFJCPQdvzn0z033fpfP7s 3oXFGo8YB/BEQkx7WtIz/TVe99yjr+4okNPwkhiS+7UByyEKwXYC/zxp X-Gm-Gg: Acq92OEpo+OUtAPFlZdDD8qnVg1/PpFs4Yznx5xCbEqmvIRYffYtesLS8PjMHsadtgz KaONxc4xXHPDWaMJIvoqnmykH7UhYUH/PqaSqSwttzBPO1HrNyyrDc7FVxmo/PxotjHxqEiPkDi zAKREWVbwZeDSDfPIBCG2wqUZIyLpSTBhyTcqMIZGcUgTOKl5gZ4LicXH78P+QqSScwIQregrbD Fn5LMleyKfzf1aTGtiVyibemItmqHbqN3XZ+euNH4AaZ9P4S7DxD0IOAVuzOmY0RsjSpuQ8Cm25 +vXoBoZQzyyX2YDz9wapR184MWMQ6ldkGWfp93LDddbOMlTlljFTIzzupSGLnfXfii/pcDMtKZ0 GnFGZhDE2tR55hG17L+bVsAaPn0/6x3St+3DQi23CA8AaxkK3Qi/r/ZutfTdZdTrvrQ0WQyV/wn Fdbj1+2VRYxkmA8V9p5c2KrPDMCEWw X-Received: by 2002:a05:690c:e1cd:10b0:7ef:d039:a32e with SMTP id 00721157ae682-7efd039acfcmr68170177b3.34.1780957684465; Mon, 08 Jun 2026 15:28:04 -0700 (PDT) Received: from localhost ([50.221.107.122]) by smtp.gmail.com with ESMTPSA id 00721157ae682-7ea23b91689sm88209077b3.40.2026.06.08.15.28.03 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Mon, 08 Jun 2026 15:28:03 -0700 (PDT) From: Yury Norov X-Google-Original-From: Yury Norov Date: Mon, 8 Jun 2026 18:28:03 -0400 To: Shradha Gupta Cc: Jacob Keller , Dexuan Cui , Wei Liu , Haiyang Zhang , "K. Y. Srinivasan" , Andrew Lunn , "David S. Miller" , Eric Dumazet , Jakub Kicinski , Paolo Abeni , Konstantin Taranov , Simon Horman , Erni Sri Satya Vennela , Dipayaan Roy , Shiraz Saleem , Michael Kelley , Long Li , Yury Norov , linux-hyperv@vger.kernel.org, linux-kernel@vger.kernel.org, netdev@vger.kernel.org, Paul Rosswurm , Shradha Gupta , Saurabh Singh Sengar , stable@vger.kernel.org Subject: Re: [PATCH net v3] net: mana: Optimize irq affinity for low vcpu configs Message-ID: References: <20260601102749.1768304-1-shradhagupta@linux.microsoft.com> <10988db9-8a8d-4ad0-917e-317dd4b20253@intel.com> Precedence: bulk X-Mailing-List: linux-hyperv@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: On Wed, Jun 03, 2026 at 09:39:18PM -0700, Shradha Gupta wrote: > On Wed, Jun 03, 2026 at 02:49:24PM -0700, Jacob Keller wrote: > > On 6/1/2026 3:27 AM, Shradha Gupta wrote: > > > In mana driver, the number of IRQs allocated is capped by the > > > min(num_cpu + 1, queue count). In cases, where the IRQ count is greater > > > than the vcpu count, we want to utilize all the vCPUs, irrespective of > > > their NUMA/core bindings. > > > > > > This is important, especially in the envs where number of vCPUs are so > > > few that the softIRQ handling overhead on two IRQs on the same vCPU is > > > much more than their overheads if they were spread across sibling vCPUs. > > > > > > This behaviour is more evident with dynamic IRQ allocation. Since MANA > > > IRQs are assigned at a later stage compared to static allocation, other > > > device IRQs may already be affinitized to the vCPUs. As a result, IRQ > > > weights become imbalanced, causing multiple MANA IRQs to land on the > > > same vCPU, while some vCPUs have none. > > > > > > In such cases when many parallel TCP connections are tested, the > > > throughput drops significantly. > > > > > > Test envs: > > > ======================================================= > > > Case 1: without this patch > > > ======================================================= > > > 4 vcpu(2 cores), 5 MANA IRQs (1 HWC + 4 Queue) > > > > > > TYPE effective vCPU aff > > > ======================================================= > > > IRQ0: HWC 0 > > > IRQ1: mana_q1 0 > > > IRQ2: mana_q2 2 > > > IRQ3: mana_q3 0 > > > IRQ4: mana_q4 3 > > > > > > %soft on each vCPU(mpstat -P ALL 1) on receiver > > > vCPU 0 1 2 3 > > > ======================================================= > > > pass 1: 38.85 0.03 24.89 24.65 > > > pass 2: 39.15 0.03 24.57 25.28 > > > pass 3: 40.36 0.03 23.20 23.17 > > > > > > ======================================================= > > > Case 2: with this patch > > > ======================================================= > > > 4 vcpu(2 cores), 5 MANA IRQs (1 HWC + 4 Queue) > > > > > > TYPE effective vCPU aff > > > ======================================================= > > > IRQ0: HWC 0 > > > IRQ1: mana_q1 0 > > > IRQ2: mana_q2 1 > > > IRQ3: mana_q3 2 > > > IRQ4: mana_q4 3 > > > > > > %soft on each vCPU(mpstat -P ALL 1) on receiver > > > vCPU 0 1 2 3 > > > ======================================================= > > > pass 1: 15.42 15.85 14.99 14.51 > > > pass 2: 15.53 15.94 15.81 15.93 > > > pass 3: 16.41 16.35 16.40 16.36 > > > > > > ======================================================= > > > Throughput Impact(in Gbps, same env) > > > ======================================================= > > > TCP conn with patch w/o patch > > > 20480 15.65 7.73 > > > 10240 15.63 8.93 > > > 8192 15.64 9.69 > > > 6144 15.64 13.16 > > > 4096 15.69 15.75 > > > 2048 15.69 15.83 > > > 1024 15.71 15.28 > > > > > > Fixes: 755391121038 ("net: mana: Allocate MSI-X vectors dynamically") > > > Cc: stable@vger.kernel.org > > > Co-developed-by: Erni Sri Satya Vennela > > > Signed-off-by: Erni Sri Satya Vennela > > > Signed-off-by: Shradha Gupta > > > Reviewed-by: Haiyang Zhang > > > Reviewed-by: Simon Horman > > > --- > > > Changes in v3 > > > * Optimize the comments in mana_gd_setup_dyn_irqs() > > > * add more details in the dev_dbg for extra IRQs > > > --- > > > Changes in v2 > > > * Removed the unused skip_first_cpu variable > > > * fixed exit condition in irq_setup_linear() with len == 0 > > > * changed return type of irq_setup_linear() as it will always be 0 > > > * removed the unnecessary rcu_read_lock() in irq_setup_linear() > > > * added appropriate comments to indicate expected behaviour when > > > IRQs are more than or equal to num_online_cpus() > > > --- > > > .../net/ethernet/microsoft/mana/gdma_main.c | 60 ++++++++++++++++--- > > > 1 file changed, 53 insertions(+), 7 deletions(-) > > > > > > diff --git a/drivers/net/ethernet/microsoft/mana/gdma_main.c b/drivers/net/ethernet/microsoft/mana/gdma_main.c > > > index 712a0881d720..00a28b3ca0a6 100644 > > > --- a/drivers/net/ethernet/microsoft/mana/gdma_main.c > > > +++ b/drivers/net/ethernet/microsoft/mana/gdma_main.c > > > @@ -197,6 +197,8 @@ static int mana_gd_query_max_resources(struct pci_dev *pdev) > > > } else { > > > /* If dynamic allocation is enabled we have already allocated > > > * hwc msi > > > + * Also, we make sure in this case the following is always true > > > + * (num_msix_usable - 1 HWC) <= num_online_cpus() > > > */ > > > gc->num_msix_usable = min(resp.max_msix, num_online_cpus() + 1); > > > } > > > @@ -1717,11 +1719,24 @@ static int irq_setup(unsigned int *irqs, unsigned int len, int node, > > > return 0; > > > } > > > > > > +/* should be called with cpus_read_lock() held */ > > > +static void irq_setup_linear(unsigned int *irqs, unsigned int len) > > > +{ > > > + int cpu; > > > + > > > + for_each_online_cpu(cpu) { > > > + if (len == 0) > > > + break; > > > + > > > + irq_set_affinity_and_hint(*irqs++, cpumask_of(cpu)); > > > + len--; > > > + } > > > +} > > > > I would find all of this a bit easier to follow if irq_setup_linear() > > and irq_setup() had a mana prefix so it was more obvious these are > > specific to the driver. Of course irq_setup is pre-existing, and its not > > my driver so do as you will :) > > > > > + > > > static int mana_gd_setup_dyn_irqs(struct pci_dev *pdev, int nvec) > > > { > > > struct gdma_context *gc = pci_get_drvdata(pdev); > > > struct gdma_irq_context *gic; > > > - bool skip_first_cpu = false; > > > int *irqs, irq, err, i; > > > > > > irqs = kmalloc_objs(int, nvec); > > > @@ -1729,6 +1744,8 @@ static int mana_gd_setup_dyn_irqs(struct pci_dev *pdev, int nvec) > > > return -ENOMEM; > > > > > > /* > > > + * In this function, num_msix_usable = HWC IRQ + Queue IRQ. > > > + * nvec is only Queue IRQ (HWC already setup). > > > * While processing the next pci irq vector, we start with index 1, > > > * as IRQ vector at index 0 is already processed for HWC. > > > * However, the population of irqs array starts with index 0, to be > > > @@ -1767,13 +1784,42 @@ static int mana_gd_setup_dyn_irqs(struct pci_dev *pdev, int nvec) > > > * first CPU sibling group since they are already affinitized to HWC IRQ > > > */ > > > cpus_read_lock(); > > > - if (gc->num_msix_usable <= num_online_cpus()) > > > - skip_first_cpu = true; > > > + if (gc->num_msix_usable <= num_online_cpus()) { > > > + err = irq_setup(irqs, nvec, gc->numa_node, true); > > > + if (err) { > > > + cpus_read_unlock(); > > > + goto free_irq; > > > + } > > > + } else { > > > + /* > > > + * When num_msix_usable are more than num_online_cpus, our > > > + * queue IRQs should be equal to num of online vCPUs. > > > + * We try to make sure queue IRQs spread across all vCPUs. > > > + * In such a case NUMA or CPU core affinity does not matter. > > > + * Note: in this case the total mana IRQ should always be > > > + * num_online_cpus + 1. The first HWC IRQ is already handled > > > + * in HWC setup calls > > > + * However, if CPUs went offline since num_msix_usable was > > > + * computed, queue IRQs will be more than num_online_cpus(). > > > + * In such cases remaining extra IRQs will retain their default > > > + * affinity. > > > + */ > > > + int first_unassigned = num_online_cpus(); > > > + if (nvec > first_unassigned) { > > > + char buf[32]; > > > + > > > + if (first_unassigned == nvec - 1) > > > + snprintf(buf, sizeof(buf), "%d", > > > + first_unassigned); > > > + else > > > + snprintf(buf, sizeof(buf), "%d-%d", > > > + first_unassigned, nvec - 1); > > > + > > > + dev_dbg(&pdev->dev, > > > + "MANA IRQ indices #%s will retain the default CPU affinity\n", buf); > > > + } > > > > > > - err = irq_setup(irqs, nvec, gc->numa_node, skip_first_cpu); > > > - if (err) { > > > - cpus_read_unlock(); > > > - goto free_irq; > > > + irq_setup_linear(irqs, nvec); > > > > irq_setup() doesn't have a driver prefix, but is actually a static > > function in gdma_main.c, so its implementation is specific to this > > driver despite its name. > > > > So if I understand this change correctly, if the number of usable MSI-X > > vectors is smaller than the number of CPUs, you contineu to use the > > current irq_setup logic.. otherwise you switch to the simpler "linear" > > logic. > > > > I guess this means the logic and heuristic used in irq_setup() breaks > > down when the number of vectors is large and number of vCPU is small? > > > > Makes sense. > > > > Hi Jacob, > > Yes, that's the right understanding. > Regarding the function names, let me take that up in a seperate patch to > add prefixes to all such functions. I agree. Now that you've got more than one setup method, short 'irq_setup' looks confusing, if not misleading. You need some name that distinguished numa-based and plain linear method. Thanks, Yury