[PATCH net v2] net: mana: Optimize irq affinity for low vcpu configs

Linux kernel -stable discussions
 help / color / mirror / Atom feed

* [PATCH net v2] net: mana: Optimize irq affinity for low vcpu configs
@ 2026-04-29  9:06 Shradha Gupta
  2026-05-01  9:12 ` Simon Horman
  2026-05-01 16:22 ` Yury Norov
  0 siblings, 2 replies; 11+ messages in thread
From: Shradha Gupta @ 2026-04-29  9:06 UTC (permalink / raw)
  To: Dexuan Cui, Wei Liu, Haiyang Zhang, K. Y. Srinivasan, Andrew Lunn,
	David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	Konstantin Taranov, Simon Horman, Erni Sri Satya Vennela,
	Dipayaan Roy, Shiraz Saleem, Michael Kelley, Long Li, Yury Norov
  Cc: Shradha Gupta, linux-hyperv, linux-kernel, netdev, Paul Rosswurm,
	Shradha Gupta, Saurabh Singh Sengar, stable

In mana driver, the number of IRQs allocated is capped by the
min(num_cpu + 1, queue count). In cases, where the IRQ count is greater
than the vcpu count, we want to utilize all the vCPUs, irrespective of
their NUMA/core bindings.

This is important, especially in the envs where number of vCPUs are so
few that the softIRQ handling overhead on two IRQs on the same vCPU is
much more than their overheads if they were spread across sibling vCPUs.

This behaviour is more evident with dynamic IRQ allocation. Since MANA
IRQs are assigned at a later stage compared to static allocation, other
device IRQs may already be affinitized to the vCPUs. As a result, IRQ
weights become imbalanced, causing multiple MANA IRQs to land on the
same vCPU, while some vCPUs have none.

In such cases when many parallel TCP connections are tested, the
throughput drops significantly.

Test envs:
=======================================================
Case 1: without this patch
=======================================================
4 vcpu(2 cores), 5 MANA IRQs (1 HWC + 4 Queue)

	TYPE		effective vCPU aff
=======================================================
IRQ0:	HWC		0
IRQ1:	mana_q1		0
IRQ2:	mana_q2		2
IRQ3:	mana_q3		0
IRQ4:	mana_q4		3

%soft on each vCPU(mpstat -P ALL 1) on receiver
vCPU		0	1	2	3
=======================================================
pass 1:		38.85	0.03	24.89	24.65
pass 2:		39.15	0.03	24.57	25.28
pass 3:		40.36	0.03	23.20	23.17

=======================================================
Case 2: with this patch
=======================================================
4 vcpu(2 cores), 5 MANA IRQs (1 HWC + 4 Queue)

        TYPE            effective vCPU aff
=======================================================
IRQ0:   HWC             0
IRQ1:   mana_q1         0
IRQ2:   mana_q2         1
IRQ3:   mana_q3         2
IRQ4:   mana_q4         3

%soft on each vCPU(mpstat -P ALL 1) on receiver
vCPU            0       1       2       3
=======================================================
pass 1:         15.42	15.85	14.99	14.51
pass 2:         15.53	15.94	15.81	15.93
pass 3:         16.41	16.35	16.40	16.36

=======================================================
Throughput Impact(in Gbps, same env)
=======================================================
TCP conn	with patch	w/o patch
20480		15.65		7.73
10240		15.63		8.93
8192		15.64		9.69
6144		15.64		13.16
4096		15.69		15.75
2048		15.69		15.83
1024		15.71		15.28

Fixes: 755391121038 ("net: mana: Allocate MSI-X vectors dynamically")
Cc: stable@vger.kernel.org
Co-developed-by: Erni Sri Satya Vennela <ernis@linux.microsoft.com>
Signed-off-by: Erni Sri Satya Vennela <ernis@linux.microsoft.com>
Signed-off-by: Shradha Gupta <shradhagupta@linux.microsoft.com>
Reviewed-by: Haiyang Zhang <haiyangz@microsoft.com>
---
Changes in v2
 * Removed the unused skip_first_cpu variable
 * fixed exit condition in irq_setup_linear() with len == 0
 * changed return type of irq_setup_linear() as it will always be 0
 * removed the unnecessary rcu_read_lock() in irq_setup_linear()
 * added appropriate comments to indicate expected behaviour when
   IRQs are more than or equal to num_online_cpus()
---
 .../net/ethernet/microsoft/mana/gdma_main.c   | 47 ++++++++++++++++---
 1 file changed, 40 insertions(+), 7 deletions(-)

diff --git a/drivers/net/ethernet/microsoft/mana/gdma_main.c b/drivers/net/ethernet/microsoft/mana/gdma_main.c
index 098fbda0d128..d740d1dc43da 100644
--- a/drivers/net/ethernet/microsoft/mana/gdma_main.c
+++ b/drivers/net/ethernet/microsoft/mana/gdma_main.c
@@ -167,6 +167,8 @@ static int mana_gd_query_max_resources(struct pci_dev *pdev)
 	} else {
 		/* If dynamic allocation is enabled we have already allocated
 		 * hwc msi
+		 * Also, we make sure in this case the following is always true
+		 * (num_msix_usable - 1 HWC) <= num_online_cpus()
 		 */
 		gc->num_msix_usable = min(resp.max_msix, num_online_cpus() + 1);
 	}
@@ -1672,11 +1674,24 @@ static int irq_setup(unsigned int *irqs, unsigned int len, int node,
 	return 0;
 }
 
+/* should be called with cpus_read_lock() held */
+static void irq_setup_linear(unsigned int *irqs, unsigned int len)
+{
+	int cpu;
+
+	for_each_online_cpu(cpu) {
+		if (len == 0)
+			break;
+
+		irq_set_affinity_and_hint(*irqs++, cpumask_of(cpu));
+		len--;
+	}
+}
+
 static int mana_gd_setup_dyn_irqs(struct pci_dev *pdev, int nvec)
 {
 	struct gdma_context *gc = pci_get_drvdata(pdev);
 	struct gdma_irq_context *gic;
-	bool skip_first_cpu = false;
 	int *irqs, irq, err, i;
 
 	irqs = kmalloc_objs(int, nvec);
@@ -1722,13 +1737,31 @@ static int mana_gd_setup_dyn_irqs(struct pci_dev *pdev, int nvec)
 	 * first CPU sibling group since they are already affinitized to HWC IRQ
 	 */
 	cpus_read_lock();
-	if (gc->num_msix_usable <= num_online_cpus())
-		skip_first_cpu = true;
+	if (gc->num_msix_usable <= num_online_cpus()) {
+		err = irq_setup(irqs, nvec, gc->numa_node, true);
+		if (err) {
+			cpus_read_unlock();
+			goto free_irq;
+		}
+	} else {
+		/*
+		 * When num_msix_usable are more than num_online_cpus, we try to
+		 * make sure we are using all vcpus. In such a case NUMA or
+		 * CPU core affinity does not matter.
+		 * Note: in this case the total mana IRQ should always be
+		 * num_online_cpus + 1. The first HWC IRQ is already handled
+		 * in HWC setup calls
+		 * However, if CPUs went offline since num_msix_usable was
+		 * computed, nvec count will be more than num_online_cpus().
+		 * In such cases remaining extra IRQs will retain their default
+		 * affinity.
+		 */
+		if (nvec > num_online_cpus())
+			dev_dbg(&pdev->dev,
+				"IRQ count %d exceeds online CPU count %d. Some IRQs will share CPU\n",
+				nvec, num_online_cpus());
 
-	err = irq_setup(irqs, nvec, gc->numa_node, skip_first_cpu);
-	if (err) {
-		cpus_read_unlock();
-		goto free_irq;
+		irq_setup_linear(irqs, nvec);
 	}
 
 	cpus_read_unlock();

base-commit: e728258debd553c95d2e70f9cd97c9fde27c7130
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 11+ messages in thread

* Re: [PATCH net v2] net: mana: Optimize irq affinity for low vcpu configs
  2026-04-29  9:06 [PATCH net v2] net: mana: Optimize irq affinity for low vcpu configs Shradha Gupta
@ 2026-05-01  9:12 ` Simon Horman
  2026-05-01 16:22 ` Yury Norov
  1 sibling, 0 replies; 11+ messages in thread
From: Simon Horman @ 2026-05-01  9:12 UTC (permalink / raw)
  To: Shradha Gupta
  Cc: Dexuan Cui, Wei Liu, Haiyang Zhang, K. Y. Srinivasan, Andrew Lunn,
	David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	Konstantin Taranov, Erni Sri Satya Vennela, Dipayaan Roy,
	Shiraz Saleem, Michael Kelley, Long Li, Yury Norov, linux-hyperv,
	linux-kernel, netdev, Paul Rosswurm, Shradha Gupta,
	Saurabh Singh Sengar, stable

On Wed, Apr 29, 2026 at 02:06:37AM -0700, Shradha Gupta wrote:
> In mana driver, the number of IRQs allocated is capped by the
> min(num_cpu + 1, queue count). In cases, where the IRQ count is greater
> than the vcpu count, we want to utilize all the vCPUs, irrespective of
> their NUMA/core bindings.
> 
> This is important, especially in the envs where number of vCPUs are so
> few that the softIRQ handling overhead on two IRQs on the same vCPU is
> much more than their overheads if they were spread across sibling vCPUs.
> 
> This behaviour is more evident with dynamic IRQ allocation. Since MANA
> IRQs are assigned at a later stage compared to static allocation, other
> device IRQs may already be affinitized to the vCPUs. As a result, IRQ
> weights become imbalanced, causing multiple MANA IRQs to land on the
> same vCPU, while some vCPUs have none.
> 
> In such cases when many parallel TCP connections are tested, the
> throughput drops significantly.

...

> Fixes: 755391121038 ("net: mana: Allocate MSI-X vectors dynamically")
> Cc: stable@vger.kernel.org
> Co-developed-by: Erni Sri Satya Vennela <ernis@linux.microsoft.com>
> Signed-off-by: Erni Sri Satya Vennela <ernis@linux.microsoft.com>
> Signed-off-by: Shradha Gupta <shradhagupta@linux.microsoft.com>
> Reviewed-by: Haiyang Zhang <haiyangz@microsoft.com>
> ---
> Changes in v2
>  * Removed the unused skip_first_cpu variable
>  * fixed exit condition in irq_setup_linear() with len == 0
>  * changed return type of irq_setup_linear() as it will always be 0
>  * removed the unnecessary rcu_read_lock() in irq_setup_linear()
>  * added appropriate comments to indicate expected behaviour when
>    IRQs are more than or equal to num_online_cpus()

Reviewed-by: Simon Horman <horms@kernel.org>

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH net v2] net: mana: Optimize irq affinity for low vcpu configs
  2026-04-29  9:06 [PATCH net v2] net: mana: Optimize irq affinity for low vcpu configs Shradha Gupta
  2026-05-01  9:12 ` Simon Horman
@ 2026-05-01 16:22 ` Yury Norov
  2026-05-02 14:37   ` Shradha Gupta
  1 sibling, 1 reply; 11+ messages in thread
From: Yury Norov @ 2026-05-01 16:22 UTC (permalink / raw)
  To: Shradha Gupta
  Cc: Dexuan Cui, Wei Liu, Haiyang Zhang, K. Y. Srinivasan, Andrew Lunn,
	David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	Konstantin Taranov, Simon Horman, Erni Sri Satya Vennela,
	Dipayaan Roy, Shiraz Saleem, Michael Kelley, Long Li, Yury Norov,
	linux-hyperv, linux-kernel, netdev, Paul Rosswurm, Shradha Gupta,
	Saurabh Singh Sengar, stable

On Wed, Apr 29, 2026 at 02:06:37AM -0700, Shradha Gupta wrote:
> In mana driver, the number of IRQs allocated is capped by the
> min(num_cpu + 1, queue count). In cases, where the IRQ count is greater
> than the vcpu count, we want to utilize all the vCPUs, irrespective of
> their NUMA/core bindings.
> 
> This is important, especially in the envs where number of vCPUs are so
> few that the softIRQ handling overhead on two IRQs on the same vCPU is
> much more than their overheads if they were spread across sibling vCPUs.
> 
> This behaviour is more evident with dynamic IRQ allocation. Since MANA
> IRQs are assigned at a later stage compared to static allocation, other
> device IRQs may already be affinitized to the vCPUs. As a result, IRQ
> weights become imbalanced, causing multiple MANA IRQs to land on the
> same vCPU, while some vCPUs have none.
> 
> In such cases when many parallel TCP connections are tested, the
> throughput drops significantly.
> 
> Test envs:
> =======================================================
> Case 1: without this patch
> =======================================================
> 4 vcpu(2 cores), 5 MANA IRQs (1 HWC + 4 Queue)
> 
> 	TYPE		effective vCPU aff
> =======================================================
> IRQ0:	HWC		0
> IRQ1:	mana_q1		0
> IRQ2:	mana_q2		2
> IRQ3:	mana_q3		0
> IRQ4:	mana_q4		3
> 
> %soft on each vCPU(mpstat -P ALL 1) on receiver
> vCPU		0	1	2	3
> =======================================================
> pass 1:		38.85	0.03	24.89	24.65
> pass 2:		39.15	0.03	24.57	25.28
> pass 3:		40.36	0.03	23.20	23.17
> 
> =======================================================
> Case 2: with this patch
> =======================================================
> 4 vcpu(2 cores), 5 MANA IRQs (1 HWC + 4 Queue)
> 
>         TYPE            effective vCPU aff
> =======================================================
> IRQ0:   HWC             0
> IRQ1:   mana_q1         0
> IRQ2:   mana_q2         1
> IRQ3:   mana_q3         2
> IRQ4:   mana_q4         3
> 
> %soft on each vCPU(mpstat -P ALL 1) on receiver
> vCPU            0       1       2       3
> =======================================================
> pass 1:         15.42	15.85	14.99	14.51
> pass 2:         15.53	15.94	15.81	15.93
> pass 3:         16.41	16.35	16.40	16.36
> 
> =======================================================
> Throughput Impact(in Gbps, same env)
> =======================================================
> TCP conn	with patch	w/o patch
> 20480		15.65		7.73
> 10240		15.63		8.93
> 8192		15.64		9.69
> 6144		15.64		13.16
> 4096		15.69		15.75
> 2048		15.69		15.83
> 1024		15.71		15.28
> 
> Fixes: 755391121038 ("net: mana: Allocate MSI-X vectors dynamically")
> Cc: stable@vger.kernel.org
> Co-developed-by: Erni Sri Satya Vennela <ernis@linux.microsoft.com>
> Signed-off-by: Erni Sri Satya Vennela <ernis@linux.microsoft.com>
> Signed-off-by: Shradha Gupta <shradhagupta@linux.microsoft.com>
> Reviewed-by: Haiyang Zhang <haiyangz@microsoft.com>
> ---
> Changes in v2
>  * Removed the unused skip_first_cpu variable
>  * fixed exit condition in irq_setup_linear() with len == 0
>  * changed return type of irq_setup_linear() as it will always be 0
>  * removed the unnecessary rcu_read_lock() in irq_setup_linear()
>  * added appropriate comments to indicate expected behaviour when
>    IRQs are more than or equal to num_online_cpus()
> ---
>  .../net/ethernet/microsoft/mana/gdma_main.c   | 47 ++++++++++++++++---
>  1 file changed, 40 insertions(+), 7 deletions(-)
> 
> diff --git a/drivers/net/ethernet/microsoft/mana/gdma_main.c b/drivers/net/ethernet/microsoft/mana/gdma_main.c
> index 098fbda0d128..d740d1dc43da 100644
> --- a/drivers/net/ethernet/microsoft/mana/gdma_main.c
> +++ b/drivers/net/ethernet/microsoft/mana/gdma_main.c
> @@ -167,6 +167,8 @@ static int mana_gd_query_max_resources(struct pci_dev *pdev)
>  	} else {
>  		/* If dynamic allocation is enabled we have already allocated
>  		 * hwc msi
> +		 * Also, we make sure in this case the following is always true
> +		 * (num_msix_usable - 1 HWC) <= num_online_cpus()
>  		 */
>  		gc->num_msix_usable = min(resp.max_msix, num_online_cpus() + 1);
>  	}
> @@ -1672,11 +1674,24 @@ static int irq_setup(unsigned int *irqs, unsigned int len, int node,
>  	return 0;
>  }
>  
> +/* should be called with cpus_read_lock() held */
> +static void irq_setup_linear(unsigned int *irqs, unsigned int len)
> +{
> +	int cpu;
> +
> +	for_each_online_cpu(cpu) {
> +		if (len == 0)
> +			break;
> +
> +		irq_set_affinity_and_hint(*irqs++, cpumask_of(cpu));
> +		len--;
> +	}
> +}
> +
>  static int mana_gd_setup_dyn_irqs(struct pci_dev *pdev, int nvec)
>  {
>  	struct gdma_context *gc = pci_get_drvdata(pdev);
>  	struct gdma_irq_context *gic;
> -	bool skip_first_cpu = false;
>  	int *irqs, irq, err, i;
>  
>  	irqs = kmalloc_objs(int, nvec);

So what about WARN_ON() and nvec adjustment before kmalloc?

> @@ -1722,13 +1737,31 @@ static int mana_gd_setup_dyn_irqs(struct pci_dev *pdev, int nvec)
>  	 * first CPU sibling group since they are already affinitized to HWC IRQ
>  	 */
>  	cpus_read_lock();
> -	if (gc->num_msix_usable <= num_online_cpus())
> -		skip_first_cpu = true;
> +	if (gc->num_msix_usable <= num_online_cpus()) {
> +		err = irq_setup(irqs, nvec, gc->numa_node, true);
> +		if (err) {
> +			cpus_read_unlock();
> +			goto free_irq;

One thing puzzles me: if you skip first CPU with this 'true', and the
gc->num_msix_usable == num_online_cpus(), it's one more than you can
distribute. What do I miss?

> +		}
> +	} else {
> +		/*
> +		 * When num_msix_usable are more than num_online_cpus, we try to
> +		 * make sure we are using all vcpus. In such a case NUMA or
> +		 * CPU core affinity does not matter.

If it doesn't matter, why don't you assign each IRQ to all CPUs then?
In theory, the system would have most of flexibility to balance them.

> +		 * Note: in this case the total mana IRQ should always be
> +		 * num_online_cpus + 1. The first HWC IRQ is already handled
> +		 * in HWC setup calls
> +		 * However, if CPUs went offline since num_msix_usable was
> +		 * computed, nvec count will be more than num_online_cpus().
> +		 * In such cases remaining extra IRQs will retain their default
> +		 * affinity.
> +		 */
> +		if (nvec > num_online_cpus())
> +			dev_dbg(&pdev->dev,
> +				"IRQ count %d exceeds online CPU count %d. Some IRQs will share CPU\n",

I'd better say 'some IRQs will share the default CPU', and in the
perfect world, I'd like to see:

        'The IRQs #4-12 will share the default CPU #0'

type of message.

> +				nvec, num_online_cpus());
                
It's not that straightforward as it should be. In one case 

        nvec > num_online_cpus()

is a problem, while in another - not. It looks already suspicious. So
when you throw a warning, you should mention it, I believe.

In the

        gc->num_msix_usable <= num_online_cpus()

case, when nvec is too big, would'n 'some IRQs share some CPU' just
as well? If so, you again should throw a message.

> -	err = irq_setup(irqs, nvec, gc->numa_node, skip_first_cpu);
> -	if (err) {
> -		cpus_read_unlock();
> -		goto free_irq;
> +		irq_setup_linear(irqs, nvec);
>  	}
>  
>  	cpus_read_unlock();
> 
> base-commit: e728258debd553c95d2e70f9cd97c9fde27c7130
> -- 
> 2.34.1

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH net v2] net: mana: Optimize irq affinity for low vcpu configs
  2026-05-01 16:22 ` Yury Norov
@ 2026-05-02 14:37   ` Shradha Gupta
  2026-05-02 17:15     ` Yury Norov
  0 siblings, 1 reply; 11+ messages in thread
From: Shradha Gupta @ 2026-05-02 14:37 UTC (permalink / raw)
  To: Yury Norov
  Cc: Dexuan Cui, Wei Liu, Haiyang Zhang, K. Y. Srinivasan, Andrew Lunn,
	David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	Konstantin Taranov, Simon Horman, Erni Sri Satya Vennela,
	Dipayaan Roy, Shiraz Saleem, Michael Kelley, Long Li, Yury Norov,
	linux-hyperv, linux-kernel, netdev, Paul Rosswurm, Shradha Gupta,
	Saurabh Singh Sengar, stable

On Fri, May 01, 2026 at 12:22:20PM -0400, Yury Norov wrote:
> On Wed, Apr 29, 2026 at 02:06:37AM -0700, Shradha Gupta wrote:
> > In mana driver, the number of IRQs allocated is capped by the
> > min(num_cpu + 1, queue count). In cases, where the IRQ count is greater
> > than the vcpu count, we want to utilize all the vCPUs, irrespective of
> > their NUMA/core bindings.
> > 
> > This is important, especially in the envs where number of vCPUs are so
> > few that the softIRQ handling overhead on two IRQs on the same vCPU is
> > much more than their overheads if they were spread across sibling vCPUs.
> > 
> > This behaviour is more evident with dynamic IRQ allocation. Since MANA
> > IRQs are assigned at a later stage compared to static allocation, other
> > device IRQs may already be affinitized to the vCPUs. As a result, IRQ
> > weights become imbalanced, causing multiple MANA IRQs to land on the
> > same vCPU, while some vCPUs have none.
> > 
> > In such cases when many parallel TCP connections are tested, the
> > throughput drops significantly.
> > 
> > Test envs:
> > =======================================================
> > Case 1: without this patch
> > =======================================================
> > 4 vcpu(2 cores), 5 MANA IRQs (1 HWC + 4 Queue)
> > 
> > 	TYPE		effective vCPU aff
> > =======================================================
> > IRQ0:	HWC		0
> > IRQ1:	mana_q1		0
> > IRQ2:	mana_q2		2
> > IRQ3:	mana_q3		0
> > IRQ4:	mana_q4		3
> > 
> > %soft on each vCPU(mpstat -P ALL 1) on receiver
> > vCPU		0	1	2	3
> > =======================================================
> > pass 1:		38.85	0.03	24.89	24.65
> > pass 2:		39.15	0.03	24.57	25.28
> > pass 3:		40.36	0.03	23.20	23.17
> > 
> > =======================================================
> > Case 2: with this patch
> > =======================================================
> > 4 vcpu(2 cores), 5 MANA IRQs (1 HWC + 4 Queue)
> > 
> >         TYPE            effective vCPU aff
> > =======================================================
> > IRQ0:   HWC             0
> > IRQ1:   mana_q1         0
> > IRQ2:   mana_q2         1
> > IRQ3:   mana_q3         2
> > IRQ4:   mana_q4         3
> > 
> > %soft on each vCPU(mpstat -P ALL 1) on receiver
> > vCPU            0       1       2       3
> > =======================================================
> > pass 1:         15.42	15.85	14.99	14.51
> > pass 2:         15.53	15.94	15.81	15.93
> > pass 3:         16.41	16.35	16.40	16.36
> > 
> > =======================================================
> > Throughput Impact(in Gbps, same env)
> > =======================================================
> > TCP conn	with patch	w/o patch
> > 20480		15.65		7.73
> > 10240		15.63		8.93
> > 8192		15.64		9.69
> > 6144		15.64		13.16
> > 4096		15.69		15.75
> > 2048		15.69		15.83
> > 1024		15.71		15.28
> > 
> > Fixes: 755391121038 ("net: mana: Allocate MSI-X vectors dynamically")
> > Cc: stable@vger.kernel.org
> > Co-developed-by: Erni Sri Satya Vennela <ernis@linux.microsoft.com>
> > Signed-off-by: Erni Sri Satya Vennela <ernis@linux.microsoft.com>
> > Signed-off-by: Shradha Gupta <shradhagupta@linux.microsoft.com>
> > Reviewed-by: Haiyang Zhang <haiyangz@microsoft.com>
> > ---
> > Changes in v2
> >  * Removed the unused skip_first_cpu variable
> >  * fixed exit condition in irq_setup_linear() with len == 0
> >  * changed return type of irq_setup_linear() as it will always be 0
> >  * removed the unnecessary rcu_read_lock() in irq_setup_linear()
> >  * added appropriate comments to indicate expected behaviour when
> >    IRQs are more than or equal to num_online_cpus()
> > ---
> >  .../net/ethernet/microsoft/mana/gdma_main.c   | 47 ++++++++++++++++---
> >  1 file changed, 40 insertions(+), 7 deletions(-)
> > 
> > diff --git a/drivers/net/ethernet/microsoft/mana/gdma_main.c b/drivers/net/ethernet/microsoft/mana/gdma_main.c
> > index 098fbda0d128..d740d1dc43da 100644
> > --- a/drivers/net/ethernet/microsoft/mana/gdma_main.c
> > +++ b/drivers/net/ethernet/microsoft/mana/gdma_main.c
> > @@ -167,6 +167,8 @@ static int mana_gd_query_max_resources(struct pci_dev *pdev)
> >  	} else {
> >  		/* If dynamic allocation is enabled we have already allocated
> >  		 * hwc msi
> > +		 * Also, we make sure in this case the following is always true
> > +		 * (num_msix_usable - 1 HWC) <= num_online_cpus()
> >  		 */
> >  		gc->num_msix_usable = min(resp.max_msix, num_online_cpus() + 1);
> >  	}
> > @@ -1672,11 +1674,24 @@ static int irq_setup(unsigned int *irqs, unsigned int len, int node,
> >  	return 0;
> >  }
> >  
> > +/* should be called with cpus_read_lock() held */
> > +static void irq_setup_linear(unsigned int *irqs, unsigned int len)
> > +{
> > +	int cpu;
> > +
> > +	for_each_online_cpu(cpu) {
> > +		if (len == 0)
> > +			break;
> > +
> > +		irq_set_affinity_and_hint(*irqs++, cpumask_of(cpu));
> > +		len--;
> > +	}
> > +}
> > +
> >  static int mana_gd_setup_dyn_irqs(struct pci_dev *pdev, int nvec)
> >  {
> >  	struct gdma_context *gc = pci_get_drvdata(pdev);
> >  	struct gdma_irq_context *gic;
> > -	bool skip_first_cpu = false;
> >  	int *irqs, irq, err, i;
> >  
> >  	irqs = kmalloc_objs(int, nvec);
> 
> So what about WARN_ON() and nvec adjustment before kmalloc?
Hey Yury,

I am still a bit unsure about the WARN_ON() before kmalloc, as after
that also, in the same function till we take the cpus_read_lock() the
num_online_cpus() can change(or reduce). That's why I introduced the
dev_dbg() to capture hot-remove edge case.

Do you still think it adds more value?
> 
> > @@ -1722,13 +1737,31 @@ static int mana_gd_setup_dyn_irqs(struct pci_dev *pdev, int nvec)
> >  	 * first CPU sibling group since they are already affinitized to HWC IRQ
> >  	 */
> >  	cpus_read_lock();
> > -	if (gc->num_msix_usable <= num_online_cpus())
> > -		skip_first_cpu = true;
> > +	if (gc->num_msix_usable <= num_online_cpus()) {
> > +		err = irq_setup(irqs, nvec, gc->numa_node, true);
> > +		if (err) {
> > +			cpus_read_unlock();
> > +			goto free_irq;
> 
> One thing puzzles me: if you skip first CPU with this 'true', and the
> gc->num_msix_usable == num_online_cpus(), it's one more than you can
> distribute. What do I miss?
> 

Let me explain this case a bit better then,

- num_msix_usable = HWC IRQ + Queue IRQ
- nvec in this functions is only Queue IRQ (HWC already setup)

When num_online_cpus == num_msix_usable:
- nvec = num_online_cpus - 1
- first CPU is already assigned to HWC IRQ, so skip it
- Queue IRQs fit in the remaining CPUs

please let me know if I did not get your question right

> > +		}
> > +	} else {
> > +		/*
> > +		 * When num_msix_usable are more than num_online_cpus, we try to
> > +		 * make sure we are using all vcpus. In such a case NUMA or
> > +		 * CPU core affinity does not matter.
> 
> If it doesn't matter, why don't you assign each IRQ to all CPUs then?
> In theory, the system would have most of flexibility to balance them.
> 

Okay, let me fix the comment and elaborate on this. It doesn't matter
because in such a case we want to anyway exhaust and distribute the
Queue IRQs to all vCPUs.
We don't want to rely on the system's balancer in this case as it could
be skewed by other devices' IRQ weights

> > +		 * Note: in this case the total mana IRQ should always be
> > +		 * num_online_cpus + 1. The first HWC IRQ is already handled
> > +		 * in HWC setup calls
> > +		 * However, if CPUs went offline since num_msix_usable was
> > +		 * computed, nvec count will be more than num_online_cpus().
> > +		 * In such cases remaining extra IRQs will retain their default
> > +		 * affinity.
> > +		 */
> > +		if (nvec > num_online_cpus())
> > +			dev_dbg(&pdev->dev,
> > +				"IRQ count %d exceeds online CPU count %d. Some IRQs will share CPU\n",
> 
> I'd better say 'some IRQs will share the default CPU', and in the
> perfect world, I'd like to see:
> 
>         'The IRQs #4-12 will share the default CPU #0'
> 
> type of message.

Sure, that makes more sense. Will make the change in next version
> 
> > +				nvec, num_online_cpus());
>                 
> It's not that straightforward as it should be. In one case 
> 
>         nvec > num_online_cpus()
> 
> is a problem, while in another - not. It looks already suspicious. So
> when you throw a warning, you should mention it, I believe.
> 
> In the
> 
>         gc->num_msix_usable <= num_online_cpus()
> 
> case, when nvec is too big, would'n 'some IRQs share some CPU' just
> as well? If so, you again should throw a message.

let me try to cleanly address this too, in the next version. May be even
a seperate patch.

Thanks Yury.
> 
> > -	err = irq_setup(irqs, nvec, gc->numa_node, skip_first_cpu);
> > -	if (err) {
> > -		cpus_read_unlock();
> > -		goto free_irq;
> > +		irq_setup_linear(irqs, nvec);
> >  	}
> >  
> >  	cpus_read_unlock();
> > 
> > base-commit: e728258debd553c95d2e70f9cd97c9fde27c7130
> > -- 
> > 2.34.1

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH net v2] net: mana: Optimize irq affinity for low vcpu configs
  2026-05-02 14:37   ` Shradha Gupta
@ 2026-05-02 17:15     ` Yury Norov
  2026-05-05  6:15       ` Shradha Gupta
  0 siblings, 1 reply; 11+ messages in thread
From: Yury Norov @ 2026-05-02 17:15 UTC (permalink / raw)
  To: Shradha Gupta
  Cc: Dexuan Cui, Wei Liu, Haiyang Zhang, K. Y. Srinivasan, Andrew Lunn,
	David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	Konstantin Taranov, Simon Horman, Erni Sri Satya Vennela,
	Dipayaan Roy, Shiraz Saleem, Michael Kelley, Long Li, Yury Norov,
	linux-hyperv, linux-kernel, netdev, Paul Rosswurm, Shradha Gupta,
	Saurabh Singh Sengar, stable

On Sat, May 02, 2026 at 07:37:43AM -0700, Shradha Gupta wrote:
> On Fri, May 01, 2026 at 12:22:20PM -0400, Yury Norov wrote:
> > On Wed, Apr 29, 2026 at 02:06:37AM -0700, Shradha Gupta wrote:
> > > In mana driver, the number of IRQs allocated is capped by the
> > > min(num_cpu + 1, queue count). In cases, where the IRQ count is greater
> > > than the vcpu count, we want to utilize all the vCPUs, irrespective of
> > > their NUMA/core bindings.
> > > 
> > > This is important, especially in the envs where number of vCPUs are so
> > > few that the softIRQ handling overhead on two IRQs on the same vCPU is
> > > much more than their overheads if they were spread across sibling vCPUs.
> > > 
> > > This behaviour is more evident with dynamic IRQ allocation. Since MANA
> > > IRQs are assigned at a later stage compared to static allocation, other
> > > device IRQs may already be affinitized to the vCPUs. As a result, IRQ
> > > weights become imbalanced, causing multiple MANA IRQs to land on the
> > > same vCPU, while some vCPUs have none.
> > > 
> > > In such cases when many parallel TCP connections are tested, the
> > > throughput drops significantly.
> > > 
> > > Test envs:
> > > =======================================================
> > > Case 1: without this patch
> > > =======================================================
> > > 4 vcpu(2 cores), 5 MANA IRQs (1 HWC + 4 Queue)
> > > 
> > > 	TYPE		effective vCPU aff
> > > =======================================================
> > > IRQ0:	HWC		0
> > > IRQ1:	mana_q1		0
> > > IRQ2:	mana_q2		2
> > > IRQ3:	mana_q3		0
> > > IRQ4:	mana_q4		3
> > > 
> > > %soft on each vCPU(mpstat -P ALL 1) on receiver
> > > vCPU		0	1	2	3
> > > =======================================================
> > > pass 1:		38.85	0.03	24.89	24.65
> > > pass 2:		39.15	0.03	24.57	25.28
> > > pass 3:		40.36	0.03	23.20	23.17
> > > 
> > > =======================================================
> > > Case 2: with this patch
> > > =======================================================
> > > 4 vcpu(2 cores), 5 MANA IRQs (1 HWC + 4 Queue)
> > > 
> > >         TYPE            effective vCPU aff
> > > =======================================================
> > > IRQ0:   HWC             0
> > > IRQ1:   mana_q1         0
> > > IRQ2:   mana_q2         1
> > > IRQ3:   mana_q3         2
> > > IRQ4:   mana_q4         3
> > > 
> > > %soft on each vCPU(mpstat -P ALL 1) on receiver
> > > vCPU            0       1       2       3
> > > =======================================================
> > > pass 1:         15.42	15.85	14.99	14.51
> > > pass 2:         15.53	15.94	15.81	15.93
> > > pass 3:         16.41	16.35	16.40	16.36
> > > 
> > > =======================================================
> > > Throughput Impact(in Gbps, same env)
> > > =======================================================
> > > TCP conn	with patch	w/o patch
> > > 20480		15.65		7.73
> > > 10240		15.63		8.93
> > > 8192		15.64		9.69
> > > 6144		15.64		13.16
> > > 4096		15.69		15.75
> > > 2048		15.69		15.83
> > > 1024		15.71		15.28
> > > 
> > > Fixes: 755391121038 ("net: mana: Allocate MSI-X vectors dynamically")
> > > Cc: stable@vger.kernel.org
> > > Co-developed-by: Erni Sri Satya Vennela <ernis@linux.microsoft.com>
> > > Signed-off-by: Erni Sri Satya Vennela <ernis@linux.microsoft.com>
> > > Signed-off-by: Shradha Gupta <shradhagupta@linux.microsoft.com>
> > > Reviewed-by: Haiyang Zhang <haiyangz@microsoft.com>
> > > ---
> > > Changes in v2
> > >  * Removed the unused skip_first_cpu variable
> > >  * fixed exit condition in irq_setup_linear() with len == 0
> > >  * changed return type of irq_setup_linear() as it will always be 0
> > >  * removed the unnecessary rcu_read_lock() in irq_setup_linear()
> > >  * added appropriate comments to indicate expected behaviour when
> > >    IRQs are more than or equal to num_online_cpus()
> > > ---
> > >  .../net/ethernet/microsoft/mana/gdma_main.c   | 47 ++++++++++++++++---
> > >  1 file changed, 40 insertions(+), 7 deletions(-)
> > > 
> > > diff --git a/drivers/net/ethernet/microsoft/mana/gdma_main.c b/drivers/net/ethernet/microsoft/mana/gdma_main.c
> > > index 098fbda0d128..d740d1dc43da 100644
> > > --- a/drivers/net/ethernet/microsoft/mana/gdma_main.c
> > > +++ b/drivers/net/ethernet/microsoft/mana/gdma_main.c
> > > @@ -167,6 +167,8 @@ static int mana_gd_query_max_resources(struct pci_dev *pdev)
> > >  	} else {
> > >  		/* If dynamic allocation is enabled we have already allocated
> > >  		 * hwc msi
> > > +		 * Also, we make sure in this case the following is always true
> > > +		 * (num_msix_usable - 1 HWC) <= num_online_cpus()
> > >  		 */
> > >  		gc->num_msix_usable = min(resp.max_msix, num_online_cpus() + 1);
> > >  	}
> > > @@ -1672,11 +1674,24 @@ static int irq_setup(unsigned int *irqs, unsigned int len, int node,
> > >  	return 0;
> > >  }
> > >  
> > > +/* should be called with cpus_read_lock() held */
> > > +static void irq_setup_linear(unsigned int *irqs, unsigned int len)
> > > +{
> > > +	int cpu;
> > > +
> > > +	for_each_online_cpu(cpu) {
> > > +		if (len == 0)
> > > +			break;
> > > +
> > > +		irq_set_affinity_and_hint(*irqs++, cpumask_of(cpu));
> > > +		len--;
> > > +	}
> > > +}
> > > +
> > >  static int mana_gd_setup_dyn_irqs(struct pci_dev *pdev, int nvec)
> > >  {
> > >  	struct gdma_context *gc = pci_get_drvdata(pdev);
> > >  	struct gdma_irq_context *gic;
> > > -	bool skip_first_cpu = false;
> > >  	int *irqs, irq, err, i;
> > >  
> > >  	irqs = kmalloc_objs(int, nvec);
> > 
> > So what about WARN_ON() and nvec adjustment before kmalloc?
> Hey Yury,
> 
> I am still a bit unsure about the WARN_ON() before kmalloc, as after
> that also, in the same function till we take the cpus_read_lock() the
> num_online_cpus() can change(or reduce). That's why I introduced the
> dev_dbg() to capture hot-remove edge case.

OK.
 
> Do you still think it adds more value?

It's your driver, so you know better. I just wonder because you said
it's good to add WARN_ON(), and then didn't do that.

> > 
> > > @@ -1722,13 +1737,31 @@ static int mana_gd_setup_dyn_irqs(struct pci_dev *pdev, int nvec)
> > >  	 * first CPU sibling group since they are already affinitized to HWC IRQ
> > >  	 */
> > >  	cpus_read_lock();
> > > -	if (gc->num_msix_usable <= num_online_cpus())
> > > -		skip_first_cpu = true;
> > > +	if (gc->num_msix_usable <= num_online_cpus()) {
> > > +		err = irq_setup(irqs, nvec, gc->numa_node, true);
> > > +		if (err) {
> > > +			cpus_read_unlock();
> > > +			goto free_irq;
> > 
> > One thing puzzles me: if you skip first CPU with this 'true', and the
> > gc->num_msix_usable == num_online_cpus(), it's one more than you can
> > distribute. What do I miss?
> > 
> 
> Let me explain this case a bit better then,
> 
> - num_msix_usable = HWC IRQ + Queue IRQ
> - nvec in this functions is only Queue IRQ (HWC already setup)
> 
> When num_online_cpus == num_msix_usable:
> - nvec = num_online_cpus - 1
> - first CPU is already assigned to HWC IRQ, so skip it
> - Queue IRQs fit in the remaining CPUs
> 
> please let me know if I did not get your question right

Can you put that in a comment?

> > > +		}
> > > +	} else {
> > > +		/*
> > > +		 * When num_msix_usable are more than num_online_cpus, we try to
> > > +		 * make sure we are using all vcpus. In such a case NUMA or
> > > +		 * CPU core affinity does not matter.
> > 
> > If it doesn't matter, why don't you assign each IRQ to all CPUs then?
> > In theory, the system would have most of flexibility to balance them.
> > 
> 
> Okay, let me fix the comment and elaborate on this. It doesn't matter
> because in such a case we want to anyway exhaust and distribute the
> Queue IRQs to all vCPUs.
> We don't want to rely on the system's balancer in this case as it could
> be skewed by other devices' IRQ weights

I don't understand this. If I want to reserve some CPUs to solely
handle IRQs from my high-priority hardware, then I configure my system
accordingly. For example, assign all non-networking IRQs on CPU0, and
all networking IRQs to all CPUs.

In your case, you distribute IRQs evenly, which means you've no
preferred CPUs. So, assuming the system is only running your IRQ
driver, it's at max is as good as all-CPU distribution. In case of
heavy loading some particular CPU, your scheme could cause
corresponding IRQs to starve.

I recall, when we was working on irq_setup(), the original idea was to
distribute IRQs one-to-one, but than I suggested the 

        irq_set_affinity_and_hint(*irqs++, topology_sibling_cpumask(cpu));

and after experiments, you agreed on that.

Can you please run your throughput test for my suggested distribution
too? Would be also nice to see how each distribution works when some
CPUs are under stress.

Thanks,
Yury

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH net v2] net: mana: Optimize irq affinity for low vcpu configs
  2026-05-02 17:15     ` Yury Norov
@ 2026-05-05  6:15       ` Shradha Gupta
  2026-05-05 15:43         ` Yury Norov
  0 siblings, 1 reply; 11+ messages in thread
From: Shradha Gupta @ 2026-05-05  6:15 UTC (permalink / raw)
  To: Yury Norov
  Cc: Dexuan Cui, Wei Liu, Haiyang Zhang, K. Y. Srinivasan, Andrew Lunn,
	David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	Konstantin Taranov, Simon Horman, Erni Sri Satya Vennela,
	Dipayaan Roy, Shiraz Saleem, Michael Kelley, Long Li, Yury Norov,
	linux-hyperv, linux-kernel, netdev, Paul Rosswurm, Shradha Gupta,
	Saurabh Singh Sengar, stable

On Sat, May 02, 2026 at 01:15:36PM -0400, Yury Norov wrote:
> On Sat, May 02, 2026 at 07:37:43AM -0700, Shradha Gupta wrote:
> > On Fri, May 01, 2026 at 12:22:20PM -0400, Yury Norov wrote:
> > > On Wed, Apr 29, 2026 at 02:06:37AM -0700, Shradha Gupta wrote:
> > > > In mana driver, the number of IRQs allocated is capped by the
> > > > min(num_cpu + 1, queue count). In cases, where the IRQ count is greater
> > > > than the vcpu count, we want to utilize all the vCPUs, irrespective of
> > > > their NUMA/core bindings.
> > > > 
> > > > This is important, especially in the envs where number of vCPUs are so
> > > > few that the softIRQ handling overhead on two IRQs on the same vCPU is
> > > > much more than their overheads if they were spread across sibling vCPUs.
> > > > 
> > > > This behaviour is more evident with dynamic IRQ allocation. Since MANA
> > > > IRQs are assigned at a later stage compared to static allocation, other
> > > > device IRQs may already be affinitized to the vCPUs. As a result, IRQ
> > > > weights become imbalanced, causing multiple MANA IRQs to land on the
> > > > same vCPU, while some vCPUs have none.
> > > > 
> > > > In such cases when many parallel TCP connections are tested, the
> > > > throughput drops significantly.
> > > > 
> > > > Test envs:
> > > > =======================================================
> > > > Case 1: without this patch
> > > > =======================================================
> > > > 4 vcpu(2 cores), 5 MANA IRQs (1 HWC + 4 Queue)
> > > > 
> > > > 	TYPE		effective vCPU aff
> > > > =======================================================
> > > > IRQ0:	HWC		0
> > > > IRQ1:	mana_q1		0
> > > > IRQ2:	mana_q2		2
> > > > IRQ3:	mana_q3		0
> > > > IRQ4:	mana_q4		3
> > > > 
> > > > %soft on each vCPU(mpstat -P ALL 1) on receiver
> > > > vCPU		0	1	2	3
> > > > =======================================================
> > > > pass 1:		38.85	0.03	24.89	24.65
> > > > pass 2:		39.15	0.03	24.57	25.28
> > > > pass 3:		40.36	0.03	23.20	23.17
> > > > 
> > > > =======================================================
> > > > Case 2: with this patch
> > > > =======================================================
> > > > 4 vcpu(2 cores), 5 MANA IRQs (1 HWC + 4 Queue)
> > > > 
> > > >         TYPE            effective vCPU aff
> > > > =======================================================
> > > > IRQ0:   HWC             0
> > > > IRQ1:   mana_q1         0
> > > > IRQ2:   mana_q2         1
> > > > IRQ3:   mana_q3         2
> > > > IRQ4:   mana_q4         3
> > > > 
> > > > %soft on each vCPU(mpstat -P ALL 1) on receiver
> > > > vCPU            0       1       2       3
> > > > =======================================================
> > > > pass 1:         15.42	15.85	14.99	14.51
> > > > pass 2:         15.53	15.94	15.81	15.93
> > > > pass 3:         16.41	16.35	16.40	16.36
> > > > 
> > > > =======================================================
> > > > Throughput Impact(in Gbps, same env)
> > > > =======================================================
> > > > TCP conn	with patch	w/o patch
> > > > 20480		15.65		7.73
> > > > 10240		15.63		8.93
> > > > 8192		15.64		9.69
> > > > 6144		15.64		13.16
> > > > 4096		15.69		15.75
> > > > 2048		15.69		15.83
> > > > 1024		15.71		15.28
> > > > 
> > > > Fixes: 755391121038 ("net: mana: Allocate MSI-X vectors dynamically")
> > > > Cc: stable@vger.kernel.org
> > > > Co-developed-by: Erni Sri Satya Vennela <ernis@linux.microsoft.com>
> > > > Signed-off-by: Erni Sri Satya Vennela <ernis@linux.microsoft.com>
> > > > Signed-off-by: Shradha Gupta <shradhagupta@linux.microsoft.com>
> > > > Reviewed-by: Haiyang Zhang <haiyangz@microsoft.com>
> > > > ---
> > > > Changes in v2
> > > >  * Removed the unused skip_first_cpu variable
> > > >  * fixed exit condition in irq_setup_linear() with len == 0
> > > >  * changed return type of irq_setup_linear() as it will always be 0
> > > >  * removed the unnecessary rcu_read_lock() in irq_setup_linear()
> > > >  * added appropriate comments to indicate expected behaviour when
> > > >    IRQs are more than or equal to num_online_cpus()
> > > > ---
> > > >  .../net/ethernet/microsoft/mana/gdma_main.c   | 47 ++++++++++++++++---
> > > >  1 file changed, 40 insertions(+), 7 deletions(-)
> > > > 
> > > > diff --git a/drivers/net/ethernet/microsoft/mana/gdma_main.c b/drivers/net/ethernet/microsoft/mana/gdma_main.c
> > > > index 098fbda0d128..d740d1dc43da 100644
> > > > --- a/drivers/net/ethernet/microsoft/mana/gdma_main.c
> > > > +++ b/drivers/net/ethernet/microsoft/mana/gdma_main.c
> > > > @@ -167,6 +167,8 @@ static int mana_gd_query_max_resources(struct pci_dev *pdev)
> > > >  	} else {
> > > >  		/* If dynamic allocation is enabled we have already allocated
> > > >  		 * hwc msi
> > > > +		 * Also, we make sure in this case the following is always true
> > > > +		 * (num_msix_usable - 1 HWC) <= num_online_cpus()
> > > >  		 */
> > > >  		gc->num_msix_usable = min(resp.max_msix, num_online_cpus() + 1);
> > > >  	}
> > > > @@ -1672,11 +1674,24 @@ static int irq_setup(unsigned int *irqs, unsigned int len, int node,
> > > >  	return 0;
> > > >  }
> > > >  
> > > > +/* should be called with cpus_read_lock() held */
> > > > +static void irq_setup_linear(unsigned int *irqs, unsigned int len)
> > > > +{
> > > > +	int cpu;
> > > > +
> > > > +	for_each_online_cpu(cpu) {
> > > > +		if (len == 0)
> > > > +			break;
> > > > +
> > > > +		irq_set_affinity_and_hint(*irqs++, cpumask_of(cpu));
> > > > +		len--;
> > > > +	}
> > > > +}
> > > > +
> > > >  static int mana_gd_setup_dyn_irqs(struct pci_dev *pdev, int nvec)
> > > >  {
> > > >  	struct gdma_context *gc = pci_get_drvdata(pdev);
> > > >  	struct gdma_irq_context *gic;
> > > > -	bool skip_first_cpu = false;
> > > >  	int *irqs, irq, err, i;
> > > >  
> > > >  	irqs = kmalloc_objs(int, nvec);
> > > 
> > > So what about WARN_ON() and nvec adjustment before kmalloc?
> > Hey Yury,
> > 
> > I am still a bit unsure about the WARN_ON() before kmalloc, as after
> > that also, in the same function till we take the cpus_read_lock() the
> > num_online_cpus() can change(or reduce). That's why I introduced the
> > dev_dbg() to capture hot-remove edge case.
> 
> OK.
>  
> > Do you still think it adds more value?
> 
> It's your driver, so you know better. I just wonder because you said
> it's good to add WARN_ON(), and then didn't do that.
> 
> > > 
> > > > @@ -1722,13 +1737,31 @@ static int mana_gd_setup_dyn_irqs(struct pci_dev *pdev, int nvec)
> > > >  	 * first CPU sibling group since they are already affinitized to HWC IRQ
> > > >  	 */
> > > >  	cpus_read_lock();
> > > > -	if (gc->num_msix_usable <= num_online_cpus())
> > > > -		skip_first_cpu = true;
> > > > +	if (gc->num_msix_usable <= num_online_cpus()) {
> > > > +		err = irq_setup(irqs, nvec, gc->numa_node, true);
> > > > +		if (err) {
> > > > +			cpus_read_unlock();
> > > > +			goto free_irq;
> > > 
> > > One thing puzzles me: if you skip first CPU with this 'true', and the
> > > gc->num_msix_usable == num_online_cpus(), it's one more than you can
> > > distribute. What do I miss?
> > > 
> > 
> > Let me explain this case a bit better then,
> > 
> > - num_msix_usable = HWC IRQ + Queue IRQ
> > - nvec in this functions is only Queue IRQ (HWC already setup)
> > 
> > When num_online_cpus == num_msix_usable:
> > - nvec = num_online_cpus - 1
> > - first CPU is already assigned to HWC IRQ, so skip it
> > - Queue IRQs fit in the remaining CPUs
> > 
> > please let me know if I did not get your question right
> 
> Can you put that in a comment?

Sure I will. thanks

> 
> > > > +		}
> > > > +	} else {
> > > > +		/*
> > > > +		 * When num_msix_usable are more than num_online_cpus, we try to
> > > > +		 * make sure we are using all vcpus. In such a case NUMA or
> > > > +		 * CPU core affinity does not matter.
> > > 
> > > If it doesn't matter, why don't you assign each IRQ to all CPUs then?
> > > In theory, the system would have most of flexibility to balance them.
> > > 
> > 
> > Okay, let me fix the comment and elaborate on this. It doesn't matter
> > because in such a case we want to anyway exhaust and distribute the
> > Queue IRQs to all vCPUs.
> > We don't want to rely on the system's balancer in this case as it could
> > be skewed by other devices' IRQ weights
> 
> I don't understand this. If I want to reserve some CPUs to solely
> handle IRQs from my high-priority hardware, then I configure my system
> accordingly. For example, assign all non-networking IRQs on CPU0, and
> all networking IRQs to all CPUs.
> 
> In your case, you distribute IRQs evenly, which means you've no
> preferred CPUs. So, assuming the system is only running your IRQ
> driver, it's at max is as good as all-CPU distribution. In case of
> heavy loading some particular CPU, your scheme could cause
> corresponding IRQs to starve.
> 
> I recall, when we was working on irq_setup(), the original idea was to
> distribute IRQs one-to-one, but than I suggested the 
> 
>         irq_set_affinity_and_hint(*irqs++, topology_sibling_cpumask(cpu));
> 
> and after experiments, you agreed on that.
> 
> Can you please run your throughput test for my suggested distribution
> too? Would be also nice to see how each distribution works when some
> CPUs are under stress.
> 
> Thanks,
> Yury

The design of irq_setup() works exactly how we want it for our IRQs for
almost all of our usecases, so we want to keep that as is. The only
scenarios where this is an issue in terms of significant throughput drop
is when we are working with low vCPU VMs (vCPU <= 4 with high TCP
connection counts) and where there are additional NVMe devices attached
to the VM.

The current patch about utilizing all the vCPUs helps in that case and
doesn't cause any regression for other cases.

This linear path is only taken when num_msix_usable > num_online_cpus(),
which is limited to low-vCPU VMs. Larger VMs continue using irq_setup()
as before.

We can definately get our throughput run results on other suggestions
you have. And about that, I just needed a bit more clarity on what to
test against. Are you suggesting, with irq_setup() intact and in use, we
configure the non-mana IRQs to say CPU0 and capture the numbers?

Thanks,
Shradha.

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH net v2] net: mana: Optimize irq affinity for low vcpu configs
  2026-05-05  6:15       ` Shradha Gupta
@ 2026-05-05 15:43         ` Yury Norov
  2026-05-08  5:51           ` Shradha Gupta
  0 siblings, 1 reply; 11+ messages in thread
From: Yury Norov @ 2026-05-05 15:43 UTC (permalink / raw)
  To: Shradha Gupta
  Cc: Dexuan Cui, Wei Liu, Haiyang Zhang, K. Y. Srinivasan, Andrew Lunn,
	David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	Konstantin Taranov, Simon Horman, Erni Sri Satya Vennela,
	Dipayaan Roy, Shiraz Saleem, Michael Kelley, Long Li, Yury Norov,
	linux-hyperv, linux-kernel, netdev, Paul Rosswurm, Shradha Gupta,
	Saurabh Singh Sengar, stable

On Mon, May 04, 2026 at 11:15:03PM -0700, Shradha Gupta wrote:
> On Sat, May 02, 2026 at 01:15:36PM -0400, Yury Norov wrote:
> > On Sat, May 02, 2026 at 07:37:43AM -0700, Shradha Gupta wrote:
> > > On Fri, May 01, 2026 at 12:22:20PM -0400, Yury Norov wrote:
> > > > On Wed, Apr 29, 2026 at 02:06:37AM -0700, Shradha Gupta wrote:
> > > > > In mana driver, the number of IRQs allocated is capped by the
> > > > > min(num_cpu + 1, queue count). In cases, where the IRQ count is greater
> > > > > than the vcpu count, we want to utilize all the vCPUs, irrespective of
> > > > > their NUMA/core bindings.
> > > > > 
> > > > > This is important, especially in the envs where number of vCPUs are so
> > > > > few that the softIRQ handling overhead on two IRQs on the same vCPU is
> > > > > much more than their overheads if they were spread across sibling vCPUs.
> > > > > 
> > > > > This behaviour is more evident with dynamic IRQ allocation. Since MANA
> > > > > IRQs are assigned at a later stage compared to static allocation, other
> > > > > device IRQs may already be affinitized to the vCPUs. As a result, IRQ
> > > > > weights become imbalanced, causing multiple MANA IRQs to land on the
> > > > > same vCPU, while some vCPUs have none.
> > > > > 
> > > > > In such cases when many parallel TCP connections are tested, the
> > > > > throughput drops significantly.
> > > > > 
> > > > > Test envs:
> > > > > =======================================================
> > > > > Case 1: without this patch
> > > > > =======================================================
> > > > > 4 vcpu(2 cores), 5 MANA IRQs (1 HWC + 4 Queue)
> > > > > 
> > > > > 	TYPE		effective vCPU aff
> > > > > =======================================================
> > > > > IRQ0:	HWC		0
> > > > > IRQ1:	mana_q1		0
> > > > > IRQ2:	mana_q2		2
> > > > > IRQ3:	mana_q3		0
> > > > > IRQ4:	mana_q4		3
> > > > > 
> > > > > %soft on each vCPU(mpstat -P ALL 1) on receiver
> > > > > vCPU		0	1	2	3
> > > > > =======================================================
> > > > > pass 1:		38.85	0.03	24.89	24.65
> > > > > pass 2:		39.15	0.03	24.57	25.28
> > > > > pass 3:		40.36	0.03	23.20	23.17
> > > > > 
> > > > > =======================================================
> > > > > Case 2: with this patch
> > > > > =======================================================
> > > > > 4 vcpu(2 cores), 5 MANA IRQs (1 HWC + 4 Queue)
> > > > > 
> > > > >         TYPE            effective vCPU aff
> > > > > =======================================================
> > > > > IRQ0:   HWC             0
> > > > > IRQ1:   mana_q1         0
> > > > > IRQ2:   mana_q2         1
> > > > > IRQ3:   mana_q3         2
> > > > > IRQ4:   mana_q4         3
> > > > > 
> > > > > %soft on each vCPU(mpstat -P ALL 1) on receiver
> > > > > vCPU            0       1       2       3
> > > > > =======================================================
> > > > > pass 1:         15.42	15.85	14.99	14.51
> > > > > pass 2:         15.53	15.94	15.81	15.93
> > > > > pass 3:         16.41	16.35	16.40	16.36
> > > > > 
> > > > > =======================================================
> > > > > Throughput Impact(in Gbps, same env)
> > > > > =======================================================
> > > > > TCP conn	with patch	w/o patch
> > > > > 20480		15.65		7.73
> > > > > 10240		15.63		8.93
> > > > > 8192		15.64		9.69
> > > > > 6144		15.64		13.16
> > > > > 4096		15.69		15.75
> > > > > 2048		15.69		15.83
> > > > > 1024		15.71		15.28
> > > > > 
> > > > > Fixes: 755391121038 ("net: mana: Allocate MSI-X vectors dynamically")
> > > > > Cc: stable@vger.kernel.org
> > > > > Co-developed-by: Erni Sri Satya Vennela <ernis@linux.microsoft.com>
> > > > > Signed-off-by: Erni Sri Satya Vennela <ernis@linux.microsoft.com>
> > > > > Signed-off-by: Shradha Gupta <shradhagupta@linux.microsoft.com>
> > > > > Reviewed-by: Haiyang Zhang <haiyangz@microsoft.com>
> > > > > ---
> > > > > Changes in v2
> > > > >  * Removed the unused skip_first_cpu variable
> > > > >  * fixed exit condition in irq_setup_linear() with len == 0
> > > > >  * changed return type of irq_setup_linear() as it will always be 0
> > > > >  * removed the unnecessary rcu_read_lock() in irq_setup_linear()
> > > > >  * added appropriate comments to indicate expected behaviour when
> > > > >    IRQs are more than or equal to num_online_cpus()
> > > > > ---
> > > > >  .../net/ethernet/microsoft/mana/gdma_main.c   | 47 ++++++++++++++++---
> > > > >  1 file changed, 40 insertions(+), 7 deletions(-)
> > > > > 
> > > > > diff --git a/drivers/net/ethernet/microsoft/mana/gdma_main.c b/drivers/net/ethernet/microsoft/mana/gdma_main.c
> > > > > index 098fbda0d128..d740d1dc43da 100644
> > > > > --- a/drivers/net/ethernet/microsoft/mana/gdma_main.c
> > > > > +++ b/drivers/net/ethernet/microsoft/mana/gdma_main.c
> > > > > @@ -167,6 +167,8 @@ static int mana_gd_query_max_resources(struct pci_dev *pdev)
> > > > >  	} else {
> > > > >  		/* If dynamic allocation is enabled we have already allocated
> > > > >  		 * hwc msi
> > > > > +		 * Also, we make sure in this case the following is always true
> > > > > +		 * (num_msix_usable - 1 HWC) <= num_online_cpus()
> > > > >  		 */
> > > > >  		gc->num_msix_usable = min(resp.max_msix, num_online_cpus() + 1);
> > > > >  	}
> > > > > @@ -1672,11 +1674,24 @@ static int irq_setup(unsigned int *irqs, unsigned int len, int node,
> > > > >  	return 0;
> > > > >  }
> > > > >  
> > > > > +/* should be called with cpus_read_lock() held */
> > > > > +static void irq_setup_linear(unsigned int *irqs, unsigned int len)
> > > > > +{
> > > > > +	int cpu;
> > > > > +
> > > > > +	for_each_online_cpu(cpu) {
> > > > > +		if (len == 0)
> > > > > +			break;
> > > > > +
> > > > > +		irq_set_affinity_and_hint(*irqs++, cpumask_of(cpu));
> > > > > +		len--;
> > > > > +	}
> > > > > +}
> > > > > +
> > > > >  static int mana_gd_setup_dyn_irqs(struct pci_dev *pdev, int nvec)
> > > > >  {
> > > > >  	struct gdma_context *gc = pci_get_drvdata(pdev);
> > > > >  	struct gdma_irq_context *gic;
> > > > > -	bool skip_first_cpu = false;
> > > > >  	int *irqs, irq, err, i;
> > > > >  
> > > > >  	irqs = kmalloc_objs(int, nvec);
> > > > 
> > > > So what about WARN_ON() and nvec adjustment before kmalloc?
> > > Hey Yury,
> > > 
> > > I am still a bit unsure about the WARN_ON() before kmalloc, as after
> > > that also, in the same function till we take the cpus_read_lock() the
> > > num_online_cpus() can change(or reduce). That's why I introduced the
> > > dev_dbg() to capture hot-remove edge case.
> > 
> > OK.
> >  
> > > Do you still think it adds more value?
> > 
> > It's your driver, so you know better. I just wonder because you said
> > it's good to add WARN_ON(), and then didn't do that.
> > 
> > > > 
> > > > > @@ -1722,13 +1737,31 @@ static int mana_gd_setup_dyn_irqs(struct pci_dev *pdev, int nvec)
> > > > >  	 * first CPU sibling group since they are already affinitized to HWC IRQ
> > > > >  	 */
> > > > >  	cpus_read_lock();
> > > > > -	if (gc->num_msix_usable <= num_online_cpus())
> > > > > -		skip_first_cpu = true;
> > > > > +	if (gc->num_msix_usable <= num_online_cpus()) {
> > > > > +		err = irq_setup(irqs, nvec, gc->numa_node, true);
> > > > > +		if (err) {
> > > > > +			cpus_read_unlock();
> > > > > +			goto free_irq;
> > > > 
> > > > One thing puzzles me: if you skip first CPU with this 'true', and the
> > > > gc->num_msix_usable == num_online_cpus(), it's one more than you can
> > > > distribute. What do I miss?
> > > > 
> > > 
> > > Let me explain this case a bit better then,
> > > 
> > > - num_msix_usable = HWC IRQ + Queue IRQ
> > > - nvec in this functions is only Queue IRQ (HWC already setup)
> > > 
> > > When num_online_cpus == num_msix_usable:
> > > - nvec = num_online_cpus - 1
> > > - first CPU is already assigned to HWC IRQ, so skip it
> > > - Queue IRQs fit in the remaining CPUs
> > > 
> > > please let me know if I did not get your question right
> > 
> > Can you put that in a comment?
> 
> Sure I will. thanks
> 
> > 
> > > > > +		}
> > > > > +	} else {
> > > > > +		/*
> > > > > +		 * When num_msix_usable are more than num_online_cpus, we try to
> > > > > +		 * make sure we are using all vcpus. In such a case NUMA or
> > > > > +		 * CPU core affinity does not matter.
> > > > 
> > > > If it doesn't matter, why don't you assign each IRQ to all CPUs then?
> > > > In theory, the system would have most of flexibility to balance them.
> > > > 
> > > 
> > > Okay, let me fix the comment and elaborate on this. It doesn't matter
> > > because in such a case we want to anyway exhaust and distribute the
> > > Queue IRQs to all vCPUs.
> > > We don't want to rely on the system's balancer in this case as it could
> > > be skewed by other devices' IRQ weights
> > 
> > I don't understand this. If I want to reserve some CPUs to solely
> > handle IRQs from my high-priority hardware, then I configure my system
> > accordingly. For example, assign all non-networking IRQs on CPU0, and
> > all networking IRQs to all CPUs.
> > 
> > In your case, you distribute IRQs evenly, which means you've no
> > preferred CPUs. So, assuming the system is only running your IRQ
> > driver, it's at max is as good as all-CPU distribution. In case of
> > heavy loading some particular CPU, your scheme could cause
> > corresponding IRQs to starve.
> > 
> > I recall, when we was working on irq_setup(), the original idea was to
> > distribute IRQs one-to-one, but than I suggested the 
> > 
> >         irq_set_affinity_and_hint(*irqs++, topology_sibling_cpumask(cpu));
> > 
> > and after experiments, you agreed on that.
> > 
> > Can you please run your throughput test for my suggested distribution
> > too? Would be also nice to see how each distribution works when some
> > CPUs are under stress.
> > 
> > Thanks,
> > Yury
> 
> The design of irq_setup() works exactly how we want it for our IRQs for
> almost all of our usecases, so we want to keep that as is. The only
> scenarios where this is an issue in terms of significant throughput drop
> is when we are working with low vCPU VMs (vCPU <= 4 with high TCP
> connection counts) and where there are additional NVMe devices attached
> to the VM.
> 
> The current patch about utilizing all the vCPUs helps in that case and
> doesn't cause any regression for other cases.
> 
> This linear path is only taken when num_msix_usable > num_online_cpus(),
> which is limited to low-vCPU VMs. Larger VMs continue using irq_setup()
> as before.
> 
> We can definately get our throughput run results on other suggestions
> you have. And about that, I just needed a bit more clarity on what to
> test against. Are you suggesting, with irq_setup() intact and in use, we
> configure the non-mana IRQs to say CPU0 and capture the numbers?

Can you try this:

       while(len--)
               // Or cpu_online_mask or cpu_all_mask?
               irq_set_affinity_and_hint(*irqs++, NULL);

And compare it to the linear version under your vCPU scenario?

Can you run your throughput test alone and on parallel with some
IRQ torture test?

        stress-ng --timer 4 --timeout 60s

And maybe pin the stress test to the default CPU. Assuming it's 0:

        taskset -c 0 stress-ng --timer 4 --timeout 60s

Unless the 'linear' version is significantly faster, I'd stick to the
above.

Thanks,
Yury

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH net v2] net: mana: Optimize irq affinity for low vcpu configs
  2026-05-05 15:43         ` Yury Norov
@ 2026-05-08  5:51           ` Shradha Gupta
  2026-05-10 20:00             ` Yury Norov
  0 siblings, 1 reply; 11+ messages in thread
From: Shradha Gupta @ 2026-05-08  5:51 UTC (permalink / raw)
  To: Yury Norov
  Cc: Dexuan Cui, Wei Liu, Haiyang Zhang, K. Y. Srinivasan, Andrew Lunn,
	David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	Konstantin Taranov, Simon Horman, Erni Sri Satya Vennela,
	Dipayaan Roy, Shiraz Saleem, Michael Kelley, Long Li, Yury Norov,
	linux-hyperv, linux-kernel, netdev, Paul Rosswurm, Shradha Gupta,
	Saurabh Singh Sengar, stable

On Tue, May 05, 2026 at 11:43:26AM -0400, Yury Norov wrote:
> On Mon, May 04, 2026 at 11:15:03PM -0700, Shradha Gupta wrote:
> > On Sat, May 02, 2026 at 01:15:36PM -0400, Yury Norov wrote:
> > > On Sat, May 02, 2026 at 07:37:43AM -0700, Shradha Gupta wrote:
> > > > On Fri, May 01, 2026 at 12:22:20PM -0400, Yury Norov wrote:
> > > > > On Wed, Apr 29, 2026 at 02:06:37AM -0700, Shradha Gupta wrote:
> > > > > > In mana driver, the number of IRQs allocated is capped by the
> > > > > > min(num_cpu + 1, queue count). In cases, where the IRQ count is greater
> > > > > > than the vcpu count, we want to utilize all the vCPUs, irrespective of
> > > > > > their NUMA/core bindings.
> > > > > > 
> > > > > > This is important, especially in the envs where number of vCPUs are so
> > > > > > few that the softIRQ handling overhead on two IRQs on the same vCPU is
> > > > > > much more than their overheads if they were spread across sibling vCPUs.
> > > > > > 
> > > > > > This behaviour is more evident with dynamic IRQ allocation. Since MANA
> > > > > > IRQs are assigned at a later stage compared to static allocation, other
> > > > > > device IRQs may already be affinitized to the vCPUs. As a result, IRQ
> > > > > > weights become imbalanced, causing multiple MANA IRQs to land on the
> > > > > > same vCPU, while some vCPUs have none.
> > > > > > 
> > > > > > In such cases when many parallel TCP connections are tested, the
> > > > > > throughput drops significantly.
> > > > > > 
> > > > > > Test envs:
> > > > > > =======================================================
> > > > > > Case 1: without this patch
> > > > > > =======================================================
> > > > > > 4 vcpu(2 cores), 5 MANA IRQs (1 HWC + 4 Queue)
> > > > > > 
> > > > > > 	TYPE		effective vCPU aff
> > > > > > =======================================================
> > > > > > IRQ0:	HWC		0
> > > > > > IRQ1:	mana_q1		0
> > > > > > IRQ2:	mana_q2		2
> > > > > > IRQ3:	mana_q3		0
> > > > > > IRQ4:	mana_q4		3
> > > > > > 
> > > > > > %soft on each vCPU(mpstat -P ALL 1) on receiver
> > > > > > vCPU		0	1	2	3
> > > > > > =======================================================
> > > > > > pass 1:		38.85	0.03	24.89	24.65
> > > > > > pass 2:		39.15	0.03	24.57	25.28
> > > > > > pass 3:		40.36	0.03	23.20	23.17
> > > > > > 
> > > > > > =======================================================
> > > > > > Case 2: with this patch
> > > > > > =======================================================
> > > > > > 4 vcpu(2 cores), 5 MANA IRQs (1 HWC + 4 Queue)
> > > > > > 
> > > > > >         TYPE            effective vCPU aff
> > > > > > =======================================================
> > > > > > IRQ0:   HWC             0
> > > > > > IRQ1:   mana_q1         0
> > > > > > IRQ2:   mana_q2         1
> > > > > > IRQ3:   mana_q3         2
> > > > > > IRQ4:   mana_q4         3
> > > > > > 
> > > > > > %soft on each vCPU(mpstat -P ALL 1) on receiver
> > > > > > vCPU            0       1       2       3
> > > > > > =======================================================
> > > > > > pass 1:         15.42	15.85	14.99	14.51
> > > > > > pass 2:         15.53	15.94	15.81	15.93
> > > > > > pass 3:         16.41	16.35	16.40	16.36
> > > > > > 
> > > > > > =======================================================
> > > > > > Throughput Impact(in Gbps, same env)
> > > > > > =======================================================
> > > > > > TCP conn	with patch	w/o patch
> > > > > > 20480		15.65		7.73
> > > > > > 10240		15.63		8.93
> > > > > > 8192		15.64		9.69
> > > > > > 6144		15.64		13.16
> > > > > > 4096		15.69		15.75
> > > > > > 2048		15.69		15.83
> > > > > > 1024		15.71		15.28
> > > > > > 
> > > > > > Fixes: 755391121038 ("net: mana: Allocate MSI-X vectors dynamically")
> > > > > > Cc: stable@vger.kernel.org
> > > > > > Co-developed-by: Erni Sri Satya Vennela <ernis@linux.microsoft.com>
> > > > > > Signed-off-by: Erni Sri Satya Vennela <ernis@linux.microsoft.com>
> > > > > > Signed-off-by: Shradha Gupta <shradhagupta@linux.microsoft.com>
> > > > > > Reviewed-by: Haiyang Zhang <haiyangz@microsoft.com>
> > > > > > ---
> > > > > > Changes in v2
> > > > > >  * Removed the unused skip_first_cpu variable
> > > > > >  * fixed exit condition in irq_setup_linear() with len == 0
> > > > > >  * changed return type of irq_setup_linear() as it will always be 0
> > > > > >  * removed the unnecessary rcu_read_lock() in irq_setup_linear()
> > > > > >  * added appropriate comments to indicate expected behaviour when
> > > > > >    IRQs are more than or equal to num_online_cpus()
> > > > > > ---
> > > > > >  .../net/ethernet/microsoft/mana/gdma_main.c   | 47 ++++++++++++++++---
> > > > > >  1 file changed, 40 insertions(+), 7 deletions(-)
> > > > > > 
> > > > > > diff --git a/drivers/net/ethernet/microsoft/mana/gdma_main.c b/drivers/net/ethernet/microsoft/mana/gdma_main.c
> > > > > > index 098fbda0d128..d740d1dc43da 100644
> > > > > > --- a/drivers/net/ethernet/microsoft/mana/gdma_main.c
> > > > > > +++ b/drivers/net/ethernet/microsoft/mana/gdma_main.c
> > > > > > @@ -167,6 +167,8 @@ static int mana_gd_query_max_resources(struct pci_dev *pdev)
> > > > > >  	} else {
> > > > > >  		/* If dynamic allocation is enabled we have already allocated
> > > > > >  		 * hwc msi
> > > > > > +		 * Also, we make sure in this case the following is always true
> > > > > > +		 * (num_msix_usable - 1 HWC) <= num_online_cpus()
> > > > > >  		 */
> > > > > >  		gc->num_msix_usable = min(resp.max_msix, num_online_cpus() + 1);
> > > > > >  	}
> > > > > > @@ -1672,11 +1674,24 @@ static int irq_setup(unsigned int *irqs, unsigned int len, int node,
> > > > > >  	return 0;
> > > > > >  }
> > > > > >  
> > > > > > +/* should be called with cpus_read_lock() held */
> > > > > > +static void irq_setup_linear(unsigned int *irqs, unsigned int len)
> > > > > > +{
> > > > > > +	int cpu;
> > > > > > +
> > > > > > +	for_each_online_cpu(cpu) {
> > > > > > +		if (len == 0)
> > > > > > +			break;
> > > > > > +
> > > > > > +		irq_set_affinity_and_hint(*irqs++, cpumask_of(cpu));
> > > > > > +		len--;
> > > > > > +	}
> > > > > > +}
> > > > > > +
> > > > > >  static int mana_gd_setup_dyn_irqs(struct pci_dev *pdev, int nvec)
> > > > > >  {
> > > > > >  	struct gdma_context *gc = pci_get_drvdata(pdev);
> > > > > >  	struct gdma_irq_context *gic;
> > > > > > -	bool skip_first_cpu = false;
> > > > > >  	int *irqs, irq, err, i;
> > > > > >  
> > > > > >  	irqs = kmalloc_objs(int, nvec);
> > > > > 
> > > > > So what about WARN_ON() and nvec adjustment before kmalloc?
> > > > Hey Yury,
> > > > 
> > > > I am still a bit unsure about the WARN_ON() before kmalloc, as after
> > > > that also, in the same function till we take the cpus_read_lock() the
> > > > num_online_cpus() can change(or reduce). That's why I introduced the
> > > > dev_dbg() to capture hot-remove edge case.
> > > 
> > > OK.
> > >  
> > > > Do you still think it adds more value?
> > > 
> > > It's your driver, so you know better. I just wonder because you said
> > > it's good to add WARN_ON(), and then didn't do that.
> > > 
> > > > > 
> > > > > > @@ -1722,13 +1737,31 @@ static int mana_gd_setup_dyn_irqs(struct pci_dev *pdev, int nvec)
> > > > > >  	 * first CPU sibling group since they are already affinitized to HWC IRQ
> > > > > >  	 */
> > > > > >  	cpus_read_lock();
> > > > > > -	if (gc->num_msix_usable <= num_online_cpus())
> > > > > > -		skip_first_cpu = true;
> > > > > > +	if (gc->num_msix_usable <= num_online_cpus()) {
> > > > > > +		err = irq_setup(irqs, nvec, gc->numa_node, true);
> > > > > > +		if (err) {
> > > > > > +			cpus_read_unlock();
> > > > > > +			goto free_irq;
> > > > > 
> > > > > One thing puzzles me: if you skip first CPU with this 'true', and the
> > > > > gc->num_msix_usable == num_online_cpus(), it's one more than you can
> > > > > distribute. What do I miss?
> > > > > 
> > > > 
> > > > Let me explain this case a bit better then,
> > > > 
> > > > - num_msix_usable = HWC IRQ + Queue IRQ
> > > > - nvec in this functions is only Queue IRQ (HWC already setup)
> > > > 
> > > > When num_online_cpus == num_msix_usable:
> > > > - nvec = num_online_cpus - 1
> > > > - first CPU is already assigned to HWC IRQ, so skip it
> > > > - Queue IRQs fit in the remaining CPUs
> > > > 
> > > > please let me know if I did not get your question right
> > > 
> > > Can you put that in a comment?
> > 
> > Sure I will. thanks
> > 
> > > 
> > > > > > +		}
> > > > > > +	} else {
> > > > > > +		/*
> > > > > > +		 * When num_msix_usable are more than num_online_cpus, we try to
> > > > > > +		 * make sure we are using all vcpus. In such a case NUMA or
> > > > > > +		 * CPU core affinity does not matter.
> > > > > 
> > > > > If it doesn't matter, why don't you assign each IRQ to all CPUs then?
> > > > > In theory, the system would have most of flexibility to balance them.
> > > > > 
> > > > 
> > > > Okay, let me fix the comment and elaborate on this. It doesn't matter
> > > > because in such a case we want to anyway exhaust and distribute the
> > > > Queue IRQs to all vCPUs.
> > > > We don't want to rely on the system's balancer in this case as it could
> > > > be skewed by other devices' IRQ weights
> > > 
> > > I don't understand this. If I want to reserve some CPUs to solely
> > > handle IRQs from my high-priority hardware, then I configure my system
> > > accordingly. For example, assign all non-networking IRQs on CPU0, and
> > > all networking IRQs to all CPUs.
> > > 
> > > In your case, you distribute IRQs evenly, which means you've no
> > > preferred CPUs. So, assuming the system is only running your IRQ
> > > driver, it's at max is as good as all-CPU distribution. In case of
> > > heavy loading some particular CPU, your scheme could cause
> > > corresponding IRQs to starve.
> > > 
> > > I recall, when we was working on irq_setup(), the original idea was to
> > > distribute IRQs one-to-one, but than I suggested the 
> > > 
> > >         irq_set_affinity_and_hint(*irqs++, topology_sibling_cpumask(cpu));
> > > 
> > > and after experiments, you agreed on that.
> > > 
> > > Can you please run your throughput test for my suggested distribution
> > > too? Would be also nice to see how each distribution works when some
> > > CPUs are under stress.
> > > 
> > > Thanks,
> > > Yury
> > 
> > The design of irq_setup() works exactly how we want it for our IRQs for
> > almost all of our usecases, so we want to keep that as is. The only
> > scenarios where this is an issue in terms of significant throughput drop
> > is when we are working with low vCPU VMs (vCPU <= 4 with high TCP
> > connection counts) and where there are additional NVMe devices attached
> > to the VM.
> > 
> > The current patch about utilizing all the vCPUs helps in that case and
> > doesn't cause any regression for other cases.
> > 
> > This linear path is only taken when num_msix_usable > num_online_cpus(),
> > which is limited to low-vCPU VMs. Larger VMs continue using irq_setup()
> > as before.
> > 
> > We can definately get our throughput run results on other suggestions
> > you have. And about that, I just needed a bit more clarity on what to
> > test against. Are you suggesting, with irq_setup() intact and in use, we
> > configure the non-mana IRQs to say CPU0 and capture the numbers?
> 
> Can you try this:
> 
>        while(len--)
>                // Or cpu_online_mask or cpu_all_mask?
>                irq_set_affinity_and_hint(*irqs++, NULL);
> 
> And compare it to the linear version under your vCPU scenario?
> 
> Can you run your throughput test alone and on parallel with some
> IRQ torture test?
> 
>         stress-ng --timer 4 --timeout 60s
> 
> And maybe pin the stress test to the default CPU. Assuming it's 0:
> 
>         taskset -c 0 stress-ng --timer 4 --timeout 60s
> 
> Unless the 'linear' version is significantly faster, I'd stick to the
> above.
> 
> Thanks,
> Yury

Hey Yury,

We tried a few tests with your suggestion, and throughput seems to be
the same compared to the linear distribution approach. We stressed out
CPU0 in both the cases and the results were similar. No IRQ migration
was observed in either case and no throughput drop.
 
But one observation I had was that " irq_set_affinity_and_hint(*irqs++,
NULL);" is essentially a no-op and we end up relying on the initial
placement from pci_alloc_irq_vectors(). Even though in these tests we
were not able to reproduce it, but with this distribution there is a
chance we end up clustering the mana queue IRQs, while other vCPUs are
not running any network load. It's because the placement depends on
system-wide IRQ state at allocation time.
 
The linear approach however gaurantees each queue IRQ lands on a
distinct vCPU regardless of system state. Even after stressing the cpus
using stress-ng, we did not observe any significant throughput drop.


regards,
Shradha.

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH net v2] net: mana: Optimize irq affinity for low vcpu configs
  2026-05-08  5:51           ` Shradha Gupta
@ 2026-05-10 20:00             ` Yury Norov
  2026-05-18  7:04               ` Erni Sri Satya Vennela
  0 siblings, 1 reply; 11+ messages in thread
From: Yury Norov @ 2026-05-10 20:00 UTC (permalink / raw)
  To: Shradha Gupta
  Cc: Dexuan Cui, Wei Liu, Haiyang Zhang, K. Y. Srinivasan, Andrew Lunn,
	David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	Konstantin Taranov, Simon Horman, Erni Sri Satya Vennela,
	Dipayaan Roy, Shiraz Saleem, Michael Kelley, Long Li, Yury Norov,
	linux-hyperv, linux-kernel, netdev, Paul Rosswurm, Shradha Gupta,
	Saurabh Singh Sengar, stable

On Thu, May 07, 2026 at 10:51:05PM -0700, Shradha Gupta wrote:

...

> > > We can definately get our throughput run results on other suggestions
> > > you have. And about that, I just needed a bit more clarity on what to
> > > test against. Are you suggesting, with irq_setup() intact and in use, we
> > > configure the non-mana IRQs to say CPU0 and capture the numbers?
> > 
> > Can you try this:
> > 
> >        while(len--)
> >                // Or cpu_online_mask or cpu_all_mask?
> >                irq_set_affinity_and_hint(*irqs++, NULL);
> > 
> > And compare it to the linear version under your vCPU scenario?
> > 
> > Can you run your throughput test alone and on parallel with some
> > IRQ torture test?
> > 
> >         stress-ng --timer 4 --timeout 60s
> > 
> > And maybe pin the stress test to the default CPU. Assuming it's 0:
> > 
> >         taskset -c 0 stress-ng --timer 4 --timeout 60s
> > 
> > Unless the 'linear' version is significantly faster, I'd stick to the
> > above.
> > 
> > Thanks,
> > Yury
> 
> Hey Yury,
> 
> We tried a few tests with your suggestion, and throughput seems to be
> the same compared to the linear distribution approach. We stressed out
> CPU0 in both the cases and the results were similar. No IRQ migration
> was observed in either case and no throughput drop.
>  
> But one observation I had was that " irq_set_affinity_and_hint(*irqs++,
> NULL);" is essentially a no-op and we end up relying on the initial
> placement from pci_alloc_irq_vectors().

Yes you are, assuming you're not binding them before in your call chain.

> Even though in these tests we
> were not able to reproduce it, but with this distribution there is a
> chance we end up clustering the mana queue IRQs, while other vCPUs are
> not running any network load.

That sounds like an IRQ balancer bug which you're unable to reproduce. 

> It's because the placement depends on
> system-wide IRQ state at allocation time.

I don't understand this point. The 

        irq_set_affinity_and_hint(*irqs++, NULL);

simply means: I trust system IRQ balancer to pick the best CPU for my
IRQ at runtime. It doesn't refer any "IRQ state at allocation time".
  
> The linear approach however gaurantees each queue IRQ lands on a
> distinct vCPU regardless of system state. Even after stressing the cpus
> using stress-ng, we did not observe any significant throughput drop.

If you just do nothing, it would lead to the same numbers, right? What
does that "non-significant throughput drop" mean? It sounds like the
linear approach is slightly worse.

--

So, as you can't demonstrate solid benefit for the 'linear' IRQ placement,
I would just stick to the no-affinity logic.

Thanks,
Yury

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH net v2] net: mana: Optimize irq affinity for low vcpu configs
  2026-05-10 20:00             ` Yury Norov
@ 2026-05-18  7:04               ` Erni Sri Satya Vennela
  2026-05-26 13:05                 ` Shradha Gupta
  0 siblings, 1 reply; 11+ messages in thread
From: Erni Sri Satya Vennela @ 2026-05-18  7:04 UTC (permalink / raw)
  To: Yury Norov
  Cc: Shradha Gupta, Dexuan Cui, Wei Liu, Haiyang Zhang,
	K. Y. Srinivasan, Andrew Lunn, David S. Miller, Eric Dumazet,
	Jakub Kicinski, Paolo Abeni, Konstantin Taranov, Simon Horman,
	Dipayaan Roy, Shiraz Saleem, Michael Kelley, Long Li, Yury Norov,
	linux-hyperv, linux-kernel, netdev, Paul Rosswurm, Shradha Gupta,
	Saurabh Singh Sengar, stable

> > But one observation I had was that " irq_set_affinity_and_hint(*irqs++,
> > NULL);" is essentially a no-op and we end up relying on the initial
> > placement from pci_alloc_irq_vectors().
> 
> Yes you are, assuming you're not binding them before in your call chain.
> 
> > Even though in these tests we
> > were not able to reproduce it, but with this distribution there is a
> > chance we end up clustering the mana queue IRQs, while other vCPUs are
> > not running any network load.
> 
> That sounds like an IRQ balancer bug which you're unable to reproduce. 
> 
> > It's because the placement depends on
> > system-wide IRQ state at allocation time.
> 
> I don't understand this point. The 
> 
>         irq_set_affinity_and_hint(*irqs++, NULL);
> 
> simply means: I trust system IRQ balancer to pick the best CPU for my
> IRQ at runtime. It doesn't refer any "IRQ state at allocation time".
>   
> > The linear approach however gaurantees each queue IRQ lands on a
> > distinct vCPU regardless of system state. Even after stressing the cpus
> > using stress-ng, we did not observe any significant throughput drop.
> 
> If you just do nothing, it would lead to the same numbers, right? What
> does that "non-significant throughput drop" mean? It sounds like the
> linear approach is slightly worse.

The numbers are not worse, they almost same in both the cases.
> 
> --
> 
> So, as you can't demonstrate solid benefit for the 'linear' IRQ placement,
> I would just stick to the no-affinity logic.

Thankyou Yury,
We are investigating on more test scenarios and trying to
capture numbers with both, your proposed change and the one from this
patch. We will keep you updated about the results.


- Vennela

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH net v2] net: mana: Optimize irq affinity for low vcpu configs
  2026-05-18  7:04               ` Erni Sri Satya Vennela
@ 2026-05-26 13:05                 ` Shradha Gupta
  0 siblings, 0 replies; 11+ messages in thread
From: Shradha Gupta @ 2026-05-26 13:05 UTC (permalink / raw)
  To: Erni Sri Satya Vennela
  Cc: Yury Norov, Dexuan Cui, Wei Liu, Haiyang Zhang, K. Y. Srinivasan,
	Andrew Lunn, David S. Miller, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni, Konstantin Taranov, Simon Horman, Dipayaan Roy,
	Shiraz Saleem, Michael Kelley, Long Li, Yury Norov, linux-hyperv,
	linux-kernel, netdev, Paul Rosswurm, Shradha Gupta,
	Saurabh Singh Sengar, stable

On Mon, May 18, 2026 at 12:04:31AM -0700, Erni Sri Satya Vennela wrote:
> > > But one observation I had was that " irq_set_affinity_and_hint(*irqs++,
> > > NULL);" is essentially a no-op and we end up relying on the initial
> > > placement from pci_alloc_irq_vectors().
> > 
> > Yes you are, assuming you're not binding them before in your call chain.
> > 
> > > Even though in these tests we
> > > were not able to reproduce it, but with this distribution there is a
> > > chance we end up clustering the mana queue IRQs, while other vCPUs are
> > > not running any network load.
> > 
> > That sounds like an IRQ balancer bug which you're unable to reproduce. 
> > 
> > > It's because the placement depends on
> > > system-wide IRQ state at allocation time.
> > 
> > I don't understand this point. The 
> > 
> >         irq_set_affinity_and_hint(*irqs++, NULL);
> > 
> > simply means: I trust system IRQ balancer to pick the best CPU for my
> > IRQ at runtime. It doesn't refer any "IRQ state at allocation time".
> >   
> > > The linear approach however gaurantees each queue IRQ lands on a
> > > distinct vCPU regardless of system state. Even after stressing the cpus
> > > using stress-ng, we did not observe any significant throughput drop.
> > 
> > If you just do nothing, it would lead to the same numbers, right? What
> > does that "non-significant throughput drop" mean? It sounds like the
> > linear approach is slightly worse.
> 
> The numbers are not worse, they almost same in both the cases.
> > 
> > --
> > 
> > So, as you can't demonstrate solid benefit for the 'linear' IRQ placement,
> > I would just stick to the no-affinity logic.
> 
> Thankyou Yury,
> We are investigating on more test scenarios and trying to
> capture numbers with both, your proposed change and the one from this
> patch. We will keep you updated about the results.
> 
> 
> - Vennela

Hi Yury,

Vennela and I ran a bunch of more tests and were able to reproduce the
clustering of mana IRQs issue we discussed earlier with the suggested
approach(setting the affinity and hint to NULL).
In these tests there were additional IRQs allocated(apart from MANA),
that disturbed the MANA IRQ distribution

ENV details
azure SKU(Standard_L4als_v5) 4 vcpu(2 cores), 5 MANA IRQs (1 HWC + 4
Queue)

"Affinity set to NULL" approach
========================================
MANA IRQ distribution	vCPU
========================================
IRQ0	HWC		0
IRQ1	mana_q1		2
IRQ3	mana_q2		3
IRQ4	mana_q3		2
IRQ5	mana_q4		3


"Affinity set linearly" approach
========================================
MANA IRQ distribution	vCPU
========================================
IRQ0	HWC		0
IRQ1	mana_q1		1
IRQ3	mana_q2		2
IRQ4	mana_q3		3
IRQ5	mana_q4		0


Throughput(Gbps) with high TCP connection
========================================
connection	affinity NULL	Linear
20480		5.25		13.49
10240		5.77		13.48
8192		7.16		13.48
6144		9.33		13.53
4096		13.50		13.50


Considering these results, we would like to proceed with the linear
approach that was proposed by this patch.


Regards,
Shradha

^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2026-05-26 13:05 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-04-29  9:06 [PATCH net v2] net: mana: Optimize irq affinity for low vcpu configs Shradha Gupta
2026-05-01  9:12 ` Simon Horman
2026-05-01 16:22 ` Yury Norov
2026-05-02 14:37   ` Shradha Gupta
2026-05-02 17:15     ` Yury Norov
2026-05-05  6:15       ` Shradha Gupta
2026-05-05 15:43         ` Yury Norov
2026-05-08  5:51           ` Shradha Gupta
2026-05-10 20:00             ` Yury Norov
2026-05-18  7:04               ` Erni Sri Satya Vennela
2026-05-26 13:05                 ` Shradha Gupta

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox