From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from mail-yw1-f175.google.com (mail-yw1-f175.google.com [209.85.128.175]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 984B32F2910 for ; Fri, 19 Jun 2026 13:55:39 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=209.85.128.175 ARC-Seal:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1781877341; cv=none; b=qGgklcFM7tHErYAUZcG7kbDXcn1GKO5UzOMZ8vBG0KieGoZ5jBjUwDqjvfIApKpYK52M3DLHOEMY1yzRu1BemNJDHk+jms5J5GfevQM49J6O6o3uPmfBXwCuN39CugS2B8KTWW+ie5nQf2NzvTNXrQ1xqmlTg6K42aAKG5BYs+g= ARC-Message-Signature:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1781877341; c=relaxed/simple; bh=o1qyyCUiLk2IZXBlGllDZQs6UDwzB2mUuHtD+qylVnw=; h=From:Date:To:Cc:Subject:Message-ID:References:MIME-Version: Content-Type:Content-Disposition:In-Reply-To; b=rWy5tTqkEUO5sg/bAtlX3QDSCCuxBy2iWpA3lVOYlTox7I8Z5EX7SLfjlh0ab50yeyteWuVvXOg5Z4fISANPu2acBaw+OhlM81+9OfSnPrvtBjjK0yn3dX4Y5+BGFZ6oW7OaXM4n6DHbl08GTlIqjGEv3x71gAKlIWF/DfAk4jQ= ARC-Authentication-Results:i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=gmail.com; spf=pass smtp.mailfrom=gmail.com; dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b=GDDudzH5; arc=none smtp.client-ip=209.85.128.175 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=gmail.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=gmail.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b="GDDudzH5" Received: by mail-yw1-f175.google.com with SMTP id 00721157ae682-7dc67a5e102so21733427b3.1 for ; Fri, 19 Jun 2026 06:55:39 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20251104; t=1781877339; x=1782482139; darn=vger.kernel.org; h=in-reply-to:content-disposition:mime-version:references:message-id :subject:cc:to:date:from:from:to:cc:subject:date:message-id:reply-to; bh=5i1wrFSasHSAGriS0dR4ZgA28mfR2sHR4Unx4u3FPo0=; b=GDDudzH5lym7SIIdRU0Ylqjclpc/ktWBOTR6ZL+HXoe4G4ceGMovTN3wMJqh462liT oV8IEdl2Vc+nQSPNdn5mGZzIQwE1NetKyhBMRpXVX75hFTZ+fYrJUAtnEI2hXCl5VzI1 +FbdhiJuQclVXuBNk6FhLHxsvxdeWxn+jiYqKotRiLtgSGWBUKgNHB8Z1rf7oANJ9qvw OaIt1r+JEE/HGHnmzy5k4nwX2uIbSUDCahk3j5fLFyPu7Z4OvlVcZJ7exx58x7/l9O3O OfNxhgs4ERy04saWQZWr4dAYY6Z2THntMyktQmhwfroAGSnib9zb9CLNwUB4VzgtsmMI GJaA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20251104; t=1781877339; x=1782482139; h=in-reply-to:content-disposition:mime-version:references:message-id :subject:cc:to:date:from:x-gm-gg:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=5i1wrFSasHSAGriS0dR4ZgA28mfR2sHR4Unx4u3FPo0=; b=cpQfd75MtZYD1l9Zi0wrN4GWuaIRVagwq5C63VG+hYEj4AkCRs4XZNB9SVwI9J0k35 zKv3j9ib09LKV1P1rFyNqBf02mUWMA56a5343+22DKpuvzmOXgnNrGCGqQTcXny1ctu+ oNGsMmtHG6aLzH/b4iIcsJ7iliX311z0/fMsPry9xt23UK3+Rp9SW7J8SOGBMfo1/jSO 4unQRuvrilvgGPKCYjWp1ohlVGzafxfqLCZu/4KiSdA2WOpReNsV3znck+JGbrhO2zTp 9Y5aSC6R/YI72O5xLnv1b8wIVkupmWHgeIujaT+mIrSv+37Emq00k74S+WkZCB3lD6ss vsGQ== X-Forwarded-Encrypted: i=1; AFNElJ9KB7wLpX3HTeKa+7d4JzHlu68rcOVDhCzJlBEjHaU+uDwpUAYRcmTC2Nmt2GiQ5sVX5lAPPRRTM3znpwc=@vger.kernel.org X-Gm-Message-State: AOJu0Yy+Owb2WiS3osbguw2MEVdEWKQf3d+Xtodi02dJl2uQeARLmiXy hMjp9YYdqSP6Qd0NiyIG4h7kaua3elCFHAgdBtkbtGLhwnMdiX1K5Agk/Y4XKA== X-Gm-Gg: AfdE7clWrIWLEPFbXE/EEtrH9kn2qxENkshWyv2K6Rt7+l53V7Oj4JbEFsk9XLqCFBS OLoZO6PRKENE4GrOfDVMhvyLDCDguZPUQRPpLM4bgpJkGnw0mOOQ0aM/DRHmHbyrm747FaE2yml a3qGdEyt+nqBS2jsxiRp8U10jl0N+NfWeSxC8yskhPRMJaTiuNRirx4ntg9ZEY9+LtTDxEveaUU n+fQH9IBhjKJNB5qDCZDoA2zEA95kw67ox6AgWiqVAY/b13gFxlfaCzxBAdXdcin76LPDZcct4B pupw4Bph+mHQHmu1f3HkMY6VaVSO+QeoGK5L8I7dv+Rqlyj5/mDL7zewsTFoJaZfS4IIU0jjVdl 0mK6wUUg3n2so677LWaYgf4FtawyT/d7U6fO7BTCUkUaaiwQzV1BLQeu7Gem0AbWTj9OYFDQjQN K3Ng26Pyx7LN/w2oguuquD7pwOhe7RcoTq+wCDUV+A/TYM64kvIMJFNiIh X-Received: by 2002:a05:690c:f15:b0:7db:f8b1:cd7a with SMTP id 00721157ae682-801765be25bmr19662807b3.8.1781877338474; Fri, 19 Jun 2026 06:55:38 -0700 (PDT) Received: from localhost (syn-035-130-123-074.biz.spectrum.com. [35.130.123.74]) by smtp.gmail.com with ESMTPSA id 00721157ae682-80119810889sm11815567b3.1.2026.06.19.06.55.37 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Fri, 19 Jun 2026 06:55:37 -0700 (PDT) From: Yury Norov X-Google-Original-From: Yury Norov Date: Fri, 19 Jun 2026 09:55:37 -0400 To: Shradha Gupta Cc: Dexuan Cui , Wei Liu , Haiyang Zhang , "K. Y. Srinivasan" , Andrew Lunn , "David S. Miller" , Eric Dumazet , Jakub Kicinski , Paolo Abeni , Konstantin Taranov , Simon Horman , Erni Sri Satya Vennela , Dipayaan Roy , Shiraz Saleem , Michael Kelley , Long Li , Yury Norov , linux-hyperv@vger.kernel.org, linux-kernel@vger.kernel.org, netdev@vger.kernel.org, Paul Rosswurm , Shradha Gupta , Saurabh Singh Sengar , stable@vger.kernel.org Subject: Re: [PATCH v4 net] net: mana: Optimize irq affinity for low vcpu configs Message-ID: References: <20260619073338.481035-1-shradhagupta@linux.microsoft.com> Precedence: bulk X-Mailing-List: linux-hyperv@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20260619073338.481035-1-shradhagupta@linux.microsoft.com> On Fri, Jun 19, 2026 at 12:33:35AM -0700, Shradha Gupta wrote: > In mana driver, the number of IRQs allocated is capped by the > min(num_cpu + 1, queue count). In cases, where the IRQ count is greater > than the vcpu count, we want to utilize all the vCPUs, irrespective of > their NUMA/core bindings. > > This is important, especially in the envs where number of vCPUs are so > few that the softIRQ handling overhead on two IRQs on the same vCPU is > much more than their overheads if they were spread across sibling vCPUs. > > This behaviour is more evident with dynamic IRQ allocation. Since MANA > IRQs are assigned at a later stage compared to static allocation, other > device IRQs may already be affinitized to the vCPUs. As a result, IRQ > weights become imbalanced, causing multiple MANA IRQs to land on the > same vCPU, while some vCPUs have none. > > In such cases when many parallel TCP connections are tested, the > throughput drops significantly. > > We also studied the results of setting the affinity and hint to > NULL in these cases, and observed that, with this logic if there are > pre existing IRQs allocated on the VM(apart from MANA), during MANA > IRQs allocation, it leads to clustering of the MANA queue IRQs again. > These results can be seen through case 3 in the following data. > > Test envs: > ======================================================= > Case 1: without this patch > ======================================================= > 4 vcpu(2 cores), 5 MANA IRQs (1 HWC + 4 Queue) > > TYPE effective vCPU aff > ======================================================= > IRQ0: HWC 0 > IRQ1: mana_q1 0 > IRQ2: mana_q2 2 > IRQ3: mana_q3 0 > IRQ4: mana_q4 3 > > %soft on each vCPU(mpstat -P ALL 1) on receiver > vCPU 0 1 2 3 > ======================================================= > pass 1: 38.85 0.03 24.89 24.65 > pass 2: 39.15 0.03 24.57 25.28 > pass 3: 40.36 0.03 23.20 23.17 > > ======================================================= > Case 2: with this patch > ======================================================= > 4 vcpu(2 cores), 5 MANA IRQs (1 HWC + 4 Queue) > > TYPE effective vCPU aff > ======================================================= > IRQ0: HWC 0 > IRQ1: mana_q1 0 > IRQ2: mana_q2 1 > IRQ3: mana_q3 2 > IRQ4: mana_q4 3 > > %soft on each vCPU(mpstat -P ALL 1) on receiver > vCPU 0 1 2 3 > ======================================================= > pass 1: 15.42 15.85 14.99 14.51 > pass 2: 15.53 15.94 15.81 15.93 > pass 3: 16.41 16.35 16.40 16.36 > > ======================================================= > Case 3: with affinity set to NULL > ======================================================= > 4 vCPU(2 cores), 5 MANA IRQs (1 HWC + 4 Queue) > > TYPE effective vCPU aff > ======================================================= > IRQ0: HWC 0 > IRQ1: mana_q1 2 > IRQ2: mana_q2 3 > IRQ3: mana_q3 2 > IRQ4: mana_q4 3 > > ======================================================= > Throughput Impact(in Gbps, same env) > ======================================================= > TCP conn with patch w/o patch aff NULL > 20480 15.65 7.73 5.25 > 10240 15.63 8.93 5.77 > 8192 15.64 9.69 7.16 > 6144 15.64 13.16 9.33 > 4096 15.69 15.75 13.50 > 2048 15.69 15.83 13.61 > 1024 15.71 15.28 13.60 > > Fixes: 755391121038 ("net: mana: Allocate MSI-X vectors dynamically") > Cc: stable@vger.kernel.org > Co-developed-by: Erni Sri Satya Vennela > Signed-off-by: Erni Sri Satya Vennela > Signed-off-by: Shradha Gupta > Reviewed-by: Haiyang Zhang > Reviewed-by: Simon Horman Reviewed-by: Yury Norov > --- > Changes in v4 > * Add mana prefix on irq_affinity_*() in mana driver > * Corrected grammar, comment for mana_irq_setup_linear() > * added new line as per guidelines > * added case 3 in commit message for when affinity is NULL > --- > Changes in v3 > * Optimize the comments in mana_gd_setup_dyn_irqs() > * add more details in the dev_dbg for extra IRQs > --- > Changes in v2 > * Removed the unused skip_first_cpu variable > * fixed exit condition in irq_setup_linear() with len == 0 > * changed return type of irq_setup_linear() as it will always be 0 > * removed the unnecessary rcu_read_lock() in irq_setup_linear() > * added appropriate comments to indicate expected behaviour when > IRQs are more than or equal to num_online_cpus() > --- > .../net/ethernet/microsoft/mana/gdma_main.c | 78 +++++++++++++++---- > 1 file changed, 64 insertions(+), 14 deletions(-) > > diff --git a/drivers/net/ethernet/microsoft/mana/gdma_main.c b/drivers/net/ethernet/microsoft/mana/gdma_main.c > index a0fdd052d7f1..e8b7ffb47eb9 100644 > --- a/drivers/net/ethernet/microsoft/mana/gdma_main.c > +++ b/drivers/net/ethernet/microsoft/mana/gdma_main.c > @@ -210,6 +210,8 @@ static int mana_gd_query_max_resources(struct pci_dev *pdev) > } else { > /* If dynamic allocation is enabled we have already allocated > * hwc msi > + * Also, we make sure in this case the following is always true > + * (num_msix_usable - 1 HWC) <= num_online_cpus() > */ > gc->num_msix_usable = min(resp.max_msix, num_online_cpus() + 1); > } > @@ -1909,8 +1911,8 @@ void mana_gd_free_res_map(struct gdma_resource *r) > * do the same thing. > */ > > -static int irq_setup(unsigned int *irqs, unsigned int len, int node, > - bool skip_first_cpu) > +static int mana_irq_setup_numa_aware(unsigned int *irqs, unsigned int len, > + int node, bool skip_first_cpu) > { > const struct cpumask *next, *prev = cpu_none_mask; > cpumask_var_t cpus __free(free_cpumask_var); > @@ -1946,11 +1948,24 @@ static int irq_setup(unsigned int *irqs, unsigned int len, int node, > return 0; > } > > +/* must be called with cpus_read_lock() held */ > +static void mana_irq_setup_linear(unsigned int *irqs, unsigned int len) > +{ > + int cpu; > + > + for_each_online_cpu(cpu) { > + if (len == 0) > + break; > + > + irq_set_affinity_and_hint(*irqs++, cpumask_of(cpu)); > + len--; > + } > +} > + > static int mana_gd_setup_dyn_irqs(struct pci_dev *pdev, int nvec) > { > struct gdma_context *gc = pci_get_drvdata(pdev); > struct gdma_irq_context *gic; > - bool skip_first_cpu = false; > int *irqs, err, i, msi; > > irqs = kmalloc_objs(int, nvec); > @@ -1958,10 +1973,12 @@ static int mana_gd_setup_dyn_irqs(struct pci_dev *pdev, int nvec) > return -ENOMEM; > > /* > + * In this function, num_msix_usable = HWC IRQ + Queue IRQ. > + * nvec is only Queue IRQ (HWC already setup). > * While processing the next pci irq vector, we start with index 1, > * as IRQ vector at index 0 is already processed for HWC. > * However, the population of irqs array starts with index 0, to be > - * further used in irq_setup() > + * further used in mana_irq_setup_numa_aware() > */ > for (i = 1; i <= nvec; i++) { > msi = i; > @@ -1975,18 +1992,51 @@ static int mana_gd_setup_dyn_irqs(struct pci_dev *pdev, int nvec) > } > > /* > - * When calling irq_setup() for dynamically added IRQs, if number of > - * CPUs is more than or equal to allocated MSI-X, we need to skip the > - * first CPU sibling group since they are already affinitized to HWC IRQ > + * When calling mana_irq_setup_numa_aware() for dynamically added IRQs, > + * if number of CPUs is more than or equal to allocated MSI-X, we need to > + * skip the first CPU sibling group since they are already affinitized to > + * HWC IRQ > */ > cpus_read_lock(); > - if (gc->num_msix_usable <= num_online_cpus()) > - skip_first_cpu = true; > + if (gc->num_msix_usable <= num_online_cpus()) { > + err = mana_irq_setup_numa_aware(irqs, nvec, gc->numa_node, > + true); > + if (err) { > + cpus_read_unlock(); > + goto free_irq; > + } > + } else { > + /* > + * When num_msix_usable are more than num_online_cpus, our > + * queue IRQs should be equal to num of online vCPUs. > + * We try to make sure queue IRQs spread across all vCPUs. > + * In such a case NUMA or CPU core affinity does not matter. > + * Note: in this case the total mana IRQ should always be > + * num_online_cpus + 1. The first HWC IRQ is already handled > + * in HWC setup calls > + * However, if CPUs went offline since num_msix_usable was > + * computed, queue IRQs will be more than num_online_cpus(). > + * In such cases remaining extra IRQs will retain their default > + * affinity. > + */ > + int first_unassigned = num_online_cpus(); > > - err = irq_setup(irqs, nvec, gc->numa_node, skip_first_cpu); > - if (err) { > - cpus_read_unlock(); > - goto free_irq; > + if (nvec > first_unassigned) { > + char buf[32]; > + > + if (first_unassigned == nvec - 1) > + snprintf(buf, sizeof(buf), "%d", > + first_unassigned); > + else > + snprintf(buf, sizeof(buf), "%d-%d", > + first_unassigned, nvec - 1); > + > + dev_dbg(&pdev->dev, > + "MANA IRQ indices #%s will retain the default CPU affinity\n", > + buf); > + } > + > + mana_irq_setup_linear(irqs, nvec); > } > > cpus_read_unlock(); > @@ -2041,7 +2091,7 @@ static int mana_gd_setup_irqs(struct pci_dev *pdev, int nvec) > nvec -= 1; > } > > - err = irq_setup(irqs, nvec, gc->numa_node, false); > + err = mana_irq_setup_numa_aware(irqs, nvec, gc->numa_node, false); > if (err) { > cpus_read_unlock(); > goto free_irq; > > base-commit: 96e7f9122aae0ed000ee321f324b812a447906d9 > -- > 2.34.1