From: Yury Norov <ynorov@nvidia.com>
To: Shradha Gupta <shradhagupta@linux.microsoft.com>
Cc: Dexuan Cui <decui@microsoft.com>, Wei Liu <wei.liu@kernel.org>,
Haiyang Zhang <haiyangz@microsoft.com>,
"K. Y. Srinivasan" <kys@microsoft.com>,
Andrew Lunn <andrew+netdev@lunn.ch>,
"David S. Miller" <davem@davemloft.net>,
Eric Dumazet <edumazet@google.com>,
Jakub Kicinski <kuba@kernel.org>, Paolo Abeni <pabeni@redhat.com>,
Konstantin Taranov <kotaranov@microsoft.com>,
Simon Horman <horms@kernel.org>,
Erni Sri Satya Vennela <ernis@linux.microsoft.com>,
Dipayaan Roy <dipayanroy@linux.microsoft.com>,
Shiraz Saleem <shirazsaleem@microsoft.com>,
Michael Kelley <mhklinux@outlook.com>,
Long Li <longli@microsoft.com>, Yury Norov <yury.norov@gmail.com>,
linux-hyperv@vger.kernel.org, linux-kernel@vger.kernel.org,
netdev@vger.kernel.org, Paul Rosswurm <paulros@microsoft.com>,
Shradha Gupta <shradhagupta@microsoft.com>,
Saurabh Singh Sengar <ssengar@microsoft.com>,
stable@vger.kernel.org
Subject: Re: [PATCH net] net: mana: Optimize irq affinity for low vcpu configs
Date: Fri, 24 Apr 2026 17:25:45 -0400 [thread overview]
Message-ID: <aevf2bPLBiAzX7UC@yury> (raw)
In-Reply-To: <20260424061702.1442618-1-shradhagupta@linux.microsoft.com>
On Thu, Apr 23, 2026 at 11:17:00PM -0700, Shradha Gupta wrote:
> In mana driver, the number of IRQs allocated are capped by the
> min(num_cpu + 1, queue count). In cases, where the IRQ count is greater
> than the vcpu count, we want to utilize all the vcpus, irrespective of
> their NUMA/core bindings.
>
> This is important, especially in the envs where number of vcpus are so
> few that the softIRQ handling overhead on two IRQs on the same vcpu is
> much more than their overheads if they were spread across sibling vcpus
>
> This behaviour is more evident with dynamic IRQ allocation. Since MANA
> IRQs are assigned at a later stage compared to static allocation, other
> device IRQs may already be affinitized to the vCPUs. As a result, IRQ
> weights become imbalanced, causing multiple MANA IRQs to land on the
> same vCPU.
>
> In such cases when many parallel TCP connections are tested, the
> throughput drops significantly
>
> Test envs:
> =======================================================
> Case 1: without this patch
> =======================================================
> 4 vcpu(2 cores), 5 MANA IRQs (1 HWC + 4 Queue)
>
> TYPE effective vCPU aff
> =======================================================
> IRQ0: HWC 0
> IRQ1: mana_q1 0
> IRQ2: mana_q2 2
> IRQ3: mana_q3 0
> IRQ4: mana_q4 3
>
> %soft on each vCPU(mpstat -P ALL 1) on receiver
> vCPU 0 1 2 3
> =======================================================
> pass 1: 38.85 0.03 24.89 24.65
> pass 2: 39.15 0.03 24.57 25.28
> pass 3: 40.36 0.03 23.20 23.17
>
> =======================================================
> Case 2: with this patch
> =======================================================
> 4 vcpu(2 cores), 5 MANA IRQs (1 HWC + 4 Queue)
>
> TYPE effective vCPU aff
> =======================================================
> IRQ0: HWC 0
> IRQ1: mana_q1 0
> IRQ2: mana_q2 1
> IRQ3: mana_q3 2
> IRQ4: mana_q4 3
>
> %soft on each vCPU(mpstat -P ALL 1) on receiver
> vCPU 0 1 2 3
> =======================================================
> pass 1: 15.42 15.85 14.99 14.51
> pass 2: 15.53 15.94 15.81 15.93
> pass 3: 16.41 16.35 16.40 16.36
>
> =======================================================
> Throughput Impact(in Gbps, same env)
> =======================================================
> TCP conn with patch w/o patch
> 20480 15.65 7.73
> 10240 15.63 8.93
> 8192 15.64 9.69
> 6144 15.64 13.16
> 4096 15.69 15.75
> 2048 15.69 15.83
> 1024 15.71 15.28
>
> Fixes: 755391121038 ("net: mana: Allocate MSI-X vectors dynamically")
> Cc: stable@vger.kernel.org
> Signed-off-by: Shradha Gupta <shradhagupta@linux.microsoft.com>
> Signed-off-by: Erni Sri Satya Vennela <ernis@linux.microsoft.com>
> Reviewed-by: Haiyang Zhang <haiyangz@microsoft.com>
> ---
> .../net/ethernet/microsoft/mana/gdma_main.c | 35 +++++++++++++++++--
> 1 file changed, 33 insertions(+), 2 deletions(-)
>
> diff --git a/drivers/net/ethernet/microsoft/mana/gdma_main.c b/drivers/net/ethernet/microsoft/mana/gdma_main.c
> index 098fbda0d128..433c044d53c6 100644
> --- a/drivers/net/ethernet/microsoft/mana/gdma_main.c
> +++ b/drivers/net/ethernet/microsoft/mana/gdma_main.c
> @@ -1672,6 +1672,23 @@ static int irq_setup(unsigned int *irqs, unsigned int len, int node,
> return 0;
> }
>
> +static int irq_setup_linear(unsigned int *irqs, unsigned int len)
> +{
> + int cpu;
> +
> + rcu_read_lock();
> + for_each_online_cpu(cpu) {
> + if (len <= 0)
> + break;
> +
> + irq_set_affinity_and_hint(*irqs++, cpumask_of(cpu));
> + len--;
> + }
> + rcu_read_unlock();
> +
> + return 0;
> +}
> +
> static int mana_gd_setup_dyn_irqs(struct pci_dev *pdev, int nvec)
> {
> struct gdma_context *gc = pci_get_drvdata(pdev);
> @@ -1722,10 +1739,24 @@ static int mana_gd_setup_dyn_irqs(struct pci_dev *pdev, int nvec)
> * first CPU sibling group since they are already affinitized to HWC IRQ
> */
> cpus_read_lock();
> - if (gc->num_msix_usable <= num_online_cpus())
> + if (gc->num_msix_usable <= num_online_cpus()) {
> skip_first_cpu = true;
> + err = irq_setup(irqs, nvec, gc->numa_node, skip_first_cpu);
Then you don't need the 'skip_first_cpu' variable.
> + } else {
> + /*
> + * In case our IRQs are more than num_online_cpus, we try to
> + * make sure we are using all vcpus. In such a case NUMA or
> + * CPU core affinity does not matter.
> + * Note that in this case the total mana IRQ should always be
> + * num_online_cpu + 1. The first HWC IRQ is already handled
> + * in HWC setup calls
> + * So, the nvec value in this path should always be equal to
> + * num_online_cpu
> + */
> + WARN_ON(nvec > num_online_cpus());
That sounds weird. If you don't support IRQs more than CPUs , and want to
warn about it, you'd do that earlier in the function, and align the other
logic accordingly. For example:
if (WARN_ON(nvec > num_online_cpus()))
nvec = num_online_cpus();
irqs = kmalloc_objs(int, nvec);
if (!irqs)
return -ENOMEM;
...
So you'll decrease pressure on allocator.
What would happen with those IRQs beyond num_online_cpus()? Can you explain
it in the comment? I'm not an expert in your driver, but usually if you pass
a vector to function, and the function is able to handle only a part of it,
it returns the number of processed elements.
Thanks,
Yury
> + err = irq_setup_linear(irqs, nvec);
> + }
>
> - err = irq_setup(irqs, nvec, gc->numa_node, skip_first_cpu);
> if (err) {
> cpus_read_unlock();
> goto free_irq;
>
> base-commit: e728258debd553c95d2e70f9cd97c9fde27c7130
> --
> 2.34.1
next prev parent reply other threads:[~2026-04-24 21:25 UTC|newest]
Thread overview: 7+ messages / expand[flat|nested] mbox.gz Atom feed top
2026-04-24 6:17 [PATCH net] net: mana: Optimize irq affinity for low vcpu configs Shradha Gupta
2026-04-24 12:21 ` Dipayaan Roy
2026-04-25 6:15 ` Shradha Gupta
2026-04-25 9:43 ` Dipayaan Roy
2026-04-26 5:07 ` Shradha Gupta
2026-04-24 21:25 ` Yury Norov [this message]
2026-04-25 6:42 ` Shradha Gupta
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=aevf2bPLBiAzX7UC@yury \
--to=ynorov@nvidia.com \
--cc=andrew+netdev@lunn.ch \
--cc=davem@davemloft.net \
--cc=decui@microsoft.com \
--cc=dipayanroy@linux.microsoft.com \
--cc=edumazet@google.com \
--cc=ernis@linux.microsoft.com \
--cc=haiyangz@microsoft.com \
--cc=horms@kernel.org \
--cc=kotaranov@microsoft.com \
--cc=kuba@kernel.org \
--cc=kys@microsoft.com \
--cc=linux-hyperv@vger.kernel.org \
--cc=linux-kernel@vger.kernel.org \
--cc=longli@microsoft.com \
--cc=mhklinux@outlook.com \
--cc=netdev@vger.kernel.org \
--cc=pabeni@redhat.com \
--cc=paulros@microsoft.com \
--cc=shirazsaleem@microsoft.com \
--cc=shradhagupta@linux.microsoft.com \
--cc=shradhagupta@microsoft.com \
--cc=ssengar@microsoft.com \
--cc=stable@vger.kernel.org \
--cc=wei.liu@kernel.org \
--cc=yury.norov@gmail.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.