[PATCH net] net: mana: Optimize irq affinity for low vcpu configs

The Linux Kernel Mailing List
 help / color / mirror / Atom feed

* [PATCH net] net: mana: Optimize irq affinity for low vcpu configs
@ 2026-04-24  6:17 Shradha Gupta
  2026-04-24 12:21 ` Dipayaan Roy
  2026-04-24 21:25 ` Yury Norov
  0 siblings, 2 replies; 7+ messages in thread
From: Shradha Gupta @ 2026-04-24  6:17 UTC (permalink / raw)
  To: Dexuan Cui, Wei Liu, Haiyang Zhang, K. Y. Srinivasan, Andrew Lunn,
	David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	Konstantin Taranov, Simon Horman, Erni Sri Satya Vennela,
	Dipayaan Roy, Shiraz Saleem, Michael Kelley, Long Li, Yury Norov
  Cc: Shradha Gupta, linux-hyperv, linux-kernel, netdev, Paul Rosswurm,
	Shradha Gupta, Saurabh Singh Sengar, stable

In mana driver, the number of IRQs allocated are capped by the
min(num_cpu + 1, queue count). In cases, where the IRQ count is greater
than the vcpu count, we want to utilize all the vcpus, irrespective of
their NUMA/core bindings.

This is important, especially in the envs where number of vcpus are so
few that the softIRQ handling overhead on two IRQs on the same vcpu is
much more than their overheads if they were spread across sibling vcpus

This behaviour is more evident with dynamic IRQ allocation. Since MANA
IRQs are assigned at a later stage compared to static allocation, other
device IRQs may already be affinitized to the vCPUs. As a result, IRQ
weights become imbalanced, causing multiple MANA IRQs to land on the
same vCPU.

In such cases when many parallel TCP connections are tested, the
throughput drops significantly

Test envs:
=======================================================
Case 1: without this patch
=======================================================
4 vcpu(2 cores), 5 MANA IRQs (1 HWC + 4 Queue)

	TYPE		effective vCPU aff
=======================================================
IRQ0:	HWC		0
IRQ1:	mana_q1		0
IRQ2:	mana_q2		2
IRQ3:	mana_q3		0
IRQ4:	mana_q4		3

%soft on each vCPU(mpstat -P ALL 1) on receiver
vCPU		0	1	2	3
=======================================================
pass 1:		38.85	0.03	24.89	24.65
pass 2:		39.15	0.03	24.57	25.28
pass 3:		40.36	0.03	23.20	23.17

=======================================================
Case 2: with this patch
=======================================================
4 vcpu(2 cores), 5 MANA IRQs (1 HWC + 4 Queue)

        TYPE            effective vCPU aff
=======================================================
IRQ0:   HWC             0
IRQ1:   mana_q1         0
IRQ2:   mana_q2         1
IRQ3:   mana_q3         2
IRQ4:   mana_q4         3

%soft on each vCPU(mpstat -P ALL 1) on receiver
vCPU            0       1       2       3
=======================================================
pass 1:         15.42	15.85	14.99	14.51
pass 2:         15.53	15.94	15.81	15.93
pass 3:         16.41	16.35	16.40	16.36

=======================================================
Throughput Impact(in Gbps, same env)
=======================================================
TCP conn	with patch	w/o patch
20480		15.65		7.73
10240		15.63		8.93
8192		15.64		9.69
6144		15.64		13.16
4096		15.69		15.75
2048		15.69		15.83
1024		15.71		15.28

Fixes: 755391121038 ("net: mana: Allocate MSI-X vectors dynamically")
Cc: stable@vger.kernel.org
Signed-off-by: Shradha Gupta <shradhagupta@linux.microsoft.com>
Signed-off-by: Erni Sri Satya Vennela <ernis@linux.microsoft.com>
Reviewed-by: Haiyang Zhang <haiyangz@microsoft.com>
---
 .../net/ethernet/microsoft/mana/gdma_main.c   | 35 +++++++++++++++++--
 1 file changed, 33 insertions(+), 2 deletions(-)

diff --git a/drivers/net/ethernet/microsoft/mana/gdma_main.c b/drivers/net/ethernet/microsoft/mana/gdma_main.c
index 098fbda0d128..433c044d53c6 100644
--- a/drivers/net/ethernet/microsoft/mana/gdma_main.c
+++ b/drivers/net/ethernet/microsoft/mana/gdma_main.c
@@ -1672,6 +1672,23 @@ static int irq_setup(unsigned int *irqs, unsigned int len, int node,
 	return 0;
 }
 
+static int irq_setup_linear(unsigned int *irqs, unsigned int len)
+{
+	int cpu;
+
+	rcu_read_lock();
+	for_each_online_cpu(cpu) {
+		if (len <= 0)
+			break;
+
+		irq_set_affinity_and_hint(*irqs++, cpumask_of(cpu));
+		len--;
+	}
+	rcu_read_unlock();
+
+	return 0;
+}
+
 static int mana_gd_setup_dyn_irqs(struct pci_dev *pdev, int nvec)
 {
 	struct gdma_context *gc = pci_get_drvdata(pdev);
@@ -1722,10 +1739,24 @@ static int mana_gd_setup_dyn_irqs(struct pci_dev *pdev, int nvec)
 	 * first CPU sibling group since they are already affinitized to HWC IRQ
 	 */
 	cpus_read_lock();
-	if (gc->num_msix_usable <= num_online_cpus())
+	if (gc->num_msix_usable <= num_online_cpus()) {
 		skip_first_cpu = true;
+		err = irq_setup(irqs, nvec, gc->numa_node, skip_first_cpu);
+	} else {
+		/*
+		 * In case our IRQs are more than num_online_cpus, we try to
+		 * make sure we are using all vcpus. In such a case NUMA or
+		 * CPU core affinity does not matter.
+		 * Note that in this case the total mana IRQ should always be
+		 * num_online_cpu + 1. The first HWC IRQ is already handled
+		 * in HWC setup calls
+		 * So, the nvec value in this path should always be equal to
+		 * num_online_cpu
+		 */
+		WARN_ON(nvec > num_online_cpus());
+		err = irq_setup_linear(irqs, nvec);
+	}
 
-	err = irq_setup(irqs, nvec, gc->numa_node, skip_first_cpu);
 	if (err) {
 		cpus_read_unlock();
 		goto free_irq;

base-commit: e728258debd553c95d2e70f9cd97c9fde27c7130
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 7+ messages in thread

* Re: [PATCH net] net: mana: Optimize irq affinity for low vcpu configs
  2026-04-24  6:17 [PATCH net] net: mana: Optimize irq affinity for low vcpu configs Shradha Gupta
@ 2026-04-24 12:21 ` Dipayaan Roy
  2026-04-25  6:15   ` Shradha Gupta
  2026-04-24 21:25 ` Yury Norov
  1 sibling, 1 reply; 7+ messages in thread
From: Dipayaan Roy @ 2026-04-24 12:21 UTC (permalink / raw)
  To: Shradha Gupta
  Cc: Dexuan Cui, Wei Liu, Haiyang Zhang, K. Y. Srinivasan, Andrew Lunn,
	David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	Konstantin Taranov, Simon Horman, Erni Sri Satya Vennela,
	Shiraz Saleem, Michael Kelley, Long Li, Yury Norov, linux-hyperv,
	linux-kernel, netdev, Paul Rosswurm, Shradha Gupta,
	Saurabh Singh Sengar, stable

On Thu, Apr 23, 2026 at 11:17:00PM -0700, Shradha Gupta wrote:
> In mana driver, the number of IRQs allocated are capped by the
> min(num_cpu + 1, queue count). In cases, where the IRQ count is greater
> than the vcpu count, we want to utilize all the vcpus, irrespective of
> their NUMA/core bindings.
> 
> This is important, especially in the envs where number of vcpus are so
> few that the softIRQ handling overhead on two IRQs on the same vcpu is
> much more than their overheads if they were spread across sibling vcpus
> 
> This behaviour is more evident with dynamic IRQ allocation. Since MANA
> IRQs are assigned at a later stage compared to static allocation, other
> device IRQs may already be affinitized to the vCPUs. As a result, IRQ
> weights become imbalanced, causing multiple MANA IRQs to land on the
> same vCPU.
> 
> In such cases when many parallel TCP connections are tested, the
> throughput drops significantly
> 
> Test envs:
> =======================================================
> Case 1: without this patch
> =======================================================
> 4 vcpu(2 cores), 5 MANA IRQs (1 HWC + 4 Queue)
> 
> 	TYPE		effective vCPU aff
> =======================================================
> IRQ0:	HWC		0
> IRQ1:	mana_q1		0
> IRQ2:	mana_q2		2
> IRQ3:	mana_q3		0
> IRQ4:	mana_q4		3
> 
> %soft on each vCPU(mpstat -P ALL 1) on receiver
> vCPU		0	1	2	3
> =======================================================
> pass 1:		38.85	0.03	24.89	24.65
> pass 2:		39.15	0.03	24.57	25.28
> pass 3:		40.36	0.03	23.20	23.17
> 
> =======================================================
> Case 2: with this patch
> =======================================================
> 4 vcpu(2 cores), 5 MANA IRQs (1 HWC + 4 Queue)
> 
>         TYPE            effective vCPU aff
> =======================================================
> IRQ0:   HWC             0
> IRQ1:   mana_q1         0
> IRQ2:   mana_q2         1
> IRQ3:   mana_q3         2
> IRQ4:   mana_q4         3
> 
> %soft on each vCPU(mpstat -P ALL 1) on receiver
> vCPU            0       1       2       3
> =======================================================
> pass 1:         15.42	15.85	14.99	14.51
> pass 2:         15.53	15.94	15.81	15.93
> pass 3:         16.41	16.35	16.40	16.36
> 
> =======================================================
> Throughput Impact(in Gbps, same env)
> =======================================================
> TCP conn	with patch	w/o patch
> 20480		15.65		7.73
> 10240		15.63		8.93
> 8192		15.64		9.69
> 6144		15.64		13.16
> 4096		15.69		15.75
> 2048		15.69		15.83
> 1024		15.71		15.28
> 
> Fixes: 755391121038 ("net: mana: Allocate MSI-X vectors dynamically")
> Cc: stable@vger.kernel.org
> Signed-off-by: Shradha Gupta <shradhagupta@linux.microsoft.com>
> Signed-off-by: Erni Sri Satya Vennela <ernis@linux.microsoft.com>
> Reviewed-by: Haiyang Zhang <haiyangz@microsoft.com>
> ---
>  .../net/ethernet/microsoft/mana/gdma_main.c   | 35 +++++++++++++++++--
>  1 file changed, 33 insertions(+), 2 deletions(-)
> 
> diff --git a/drivers/net/ethernet/microsoft/mana/gdma_main.c b/drivers/net/ethernet/microsoft/mana/gdma_main.c
> index 098fbda0d128..433c044d53c6 100644
> --- a/drivers/net/ethernet/microsoft/mana/gdma_main.c
> +++ b/drivers/net/ethernet/microsoft/mana/gdma_main.c
> @@ -1672,6 +1672,23 @@ static int irq_setup(unsigned int *irqs, unsigned int len, int node,
>  	return 0;
>  }
>  
> +static int irq_setup_linear(unsigned int *irqs, unsigned int len)
> +{
> +	int cpu;
> +
> +	rcu_read_lock();
We do not need to call rcu_read_lock here, as the caller of this
function has already acquired cpus_read_lock.
> +	for_each_online_cpu(cpu) {
> +		if (len <= 0)
len is unsigned here so <= doesnot makes sense. PLease change it to int
or better use if(!len)
> +			break;
> +
> +		irq_set_affinity_and_hint(*irqs++, cpumask_of(cpu));
> +		len--;
> +	}
> +	rcu_read_unlock();
> +
> +	return 0;
> +}
> +
>  static int mana_gd_setup_dyn_irqs(struct pci_dev *pdev, int nvec)
>  {
>  	struct gdma_context *gc = pci_get_drvdata(pdev);
> @@ -1722,10 +1739,24 @@ static int mana_gd_setup_dyn_irqs(struct pci_dev *pdev, int nvec)
>  	 * first CPU sibling group since they are already affinitized to HWC IRQ
>  	 */
>  	cpus_read_lock();
> -	if (gc->num_msix_usable <= num_online_cpus())
> +	if (gc->num_msix_usable <= num_online_cpus()) {
>  		skip_first_cpu = true;
> +		err = irq_setup(irqs, nvec, gc->numa_node, skip_first_cpu);
> +	} else {
> +		/*
> +		 * In case our IRQs are more than num_online_cpus, we try to
> +		 * make sure we are using all vcpus. In such a case NUMA or
> +		 * CPU core affinity does not matter.
> +		 * Note that in this case the total mana IRQ should always be
> +		 * num_online_cpu + 1. The first HWC IRQ is already handled
> +		 * in HWC setup calls
> +		 * So, the nvec value in this path should always be equal to
> +		 * num_online_cpu
nit: typo: should be num_online_cpus
> +		 */
> +		WARN_ON(nvec > num_online_cpus());
> +		err = irq_setup_linear(irqs, nvec);
> +	}
>  
> -	err = irq_setup(irqs, nvec, gc->numa_node, skip_first_cpu);
>  	if (err) {
>  		cpus_read_unlock();
>  		goto free_irq;
> 
> base-commit: e728258debd553c95d2e70f9cd97c9fde27c7130
> -- 
> 2.34.1
> 
Regards
Dipayaan Roy

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH net] net: mana: Optimize irq affinity for low vcpu configs
  2026-04-24 12:21 ` Dipayaan Roy
@ 2026-04-25  6:15   ` Shradha Gupta
  2026-04-25  9:43     ` Dipayaan Roy
  0 siblings, 1 reply; 7+ messages in thread
From: Shradha Gupta @ 2026-04-25  6:15 UTC (permalink / raw)
  To: Dipayaan Roy
  Cc: Dexuan Cui, Wei Liu, Haiyang Zhang, K. Y. Srinivasan, Andrew Lunn,
	David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	Konstantin Taranov, Simon Horman, Erni Sri Satya Vennela,
	Shiraz Saleem, Michael Kelley, Long Li, Yury Norov, linux-hyperv,
	linux-kernel, netdev, Paul Rosswurm, Shradha Gupta,
	Saurabh Singh Sengar, stable

On Fri, Apr 24, 2026 at 05:21:23AM -0700, Dipayaan Roy wrote:
> On Thu, Apr 23, 2026 at 11:17:00PM -0700, Shradha Gupta wrote:
> > In mana driver, the number of IRQs allocated are capped by the
> > min(num_cpu + 1, queue count). In cases, where the IRQ count is greater
> > than the vcpu count, we want to utilize all the vcpus, irrespective of
> > their NUMA/core bindings.
> > 
> > This is important, especially in the envs where number of vcpus are so
> > few that the softIRQ handling overhead on two IRQs on the same vcpu is
> > much more than their overheads if they were spread across sibling vcpus
> > 
> > This behaviour is more evident with dynamic IRQ allocation. Since MANA
> > IRQs are assigned at a later stage compared to static allocation, other
> > device IRQs may already be affinitized to the vCPUs. As a result, IRQ
> > weights become imbalanced, causing multiple MANA IRQs to land on the
> > same vCPU.
> > 
> > In such cases when many parallel TCP connections are tested, the
> > throughput drops significantly
> > 
> > Test envs:
> > =======================================================
> > Case 1: without this patch
> > =======================================================
> > 4 vcpu(2 cores), 5 MANA IRQs (1 HWC + 4 Queue)
> > 
> > 	TYPE		effective vCPU aff
> > =======================================================
> > IRQ0:	HWC		0
> > IRQ1:	mana_q1		0
> > IRQ2:	mana_q2		2
> > IRQ3:	mana_q3		0
> > IRQ4:	mana_q4		3
> > 
> > %soft on each vCPU(mpstat -P ALL 1) on receiver
> > vCPU		0	1	2	3
> > =======================================================
> > pass 1:		38.85	0.03	24.89	24.65
> > pass 2:		39.15	0.03	24.57	25.28
> > pass 3:		40.36	0.03	23.20	23.17
> > 
> > =======================================================
> > Case 2: with this patch
> > =======================================================
> > 4 vcpu(2 cores), 5 MANA IRQs (1 HWC + 4 Queue)
> > 
> >         TYPE            effective vCPU aff
> > =======================================================
> > IRQ0:   HWC             0
> > IRQ1:   mana_q1         0
> > IRQ2:   mana_q2         1
> > IRQ3:   mana_q3         2
> > IRQ4:   mana_q4         3
> > 
> > %soft on each vCPU(mpstat -P ALL 1) on receiver
> > vCPU            0       1       2       3
> > =======================================================
> > pass 1:         15.42	15.85	14.99	14.51
> > pass 2:         15.53	15.94	15.81	15.93
> > pass 3:         16.41	16.35	16.40	16.36
> > 
> > =======================================================
> > Throughput Impact(in Gbps, same env)
> > =======================================================
> > TCP conn	with patch	w/o patch
> > 20480		15.65		7.73
> > 10240		15.63		8.93
> > 8192		15.64		9.69
> > 6144		15.64		13.16
> > 4096		15.69		15.75
> > 2048		15.69		15.83
> > 1024		15.71		15.28
> > 
> > Fixes: 755391121038 ("net: mana: Allocate MSI-X vectors dynamically")
> > Cc: stable@vger.kernel.org
> > Signed-off-by: Shradha Gupta <shradhagupta@linux.microsoft.com>
> > Signed-off-by: Erni Sri Satya Vennela <ernis@linux.microsoft.com>
> > Reviewed-by: Haiyang Zhang <haiyangz@microsoft.com>
> > ---
> >  .../net/ethernet/microsoft/mana/gdma_main.c   | 35 +++++++++++++++++--
> >  1 file changed, 33 insertions(+), 2 deletions(-)
> > 
> > diff --git a/drivers/net/ethernet/microsoft/mana/gdma_main.c b/drivers/net/ethernet/microsoft/mana/gdma_main.c
> > index 098fbda0d128..433c044d53c6 100644
> > --- a/drivers/net/ethernet/microsoft/mana/gdma_main.c
> > +++ b/drivers/net/ethernet/microsoft/mana/gdma_main.c
> > @@ -1672,6 +1672,23 @@ static int irq_setup(unsigned int *irqs, unsigned int len, int node,
> >  	return 0;
> >  }
> >  
> > +static int irq_setup_linear(unsigned int *irqs, unsigned int len)
> > +{
> > +	int cpu;
> > +
> > +	rcu_read_lock();
> We do not need to call rcu_read_lock here, as the caller of this
> function has already acquired cpus_read_lock.

Thanks for your comments Dipayaan, I think this is still needed for the
irq_set_affinity_and_hint(), to protect the pointer returned by
irq_to_desc(). You can also see the same in the original function
irq_setup() for the same reason.

> > +	for_each_online_cpu(cpu) {
> > +		if (len <= 0)
> len is unsigned here so <= doesnot makes sense. PLease change it to int
> or better use if(!len)

sure, I think I will change it to explicitly exit when len == 0
Thanks.

> > +			break;
> > +
> > +		irq_set_affinity_and_hint(*irqs++, cpumask_of(cpu));
> > +		len--;
> > +	}
> > +	rcu_read_unlock();
> > +
> > +	return 0;
> > +}
> > +
> >  static int mana_gd_setup_dyn_irqs(struct pci_dev *pdev, int nvec)
> >  {
> >  	struct gdma_context *gc = pci_get_drvdata(pdev);
> > @@ -1722,10 +1739,24 @@ static int mana_gd_setup_dyn_irqs(struct pci_dev *pdev, int nvec)
> >  	 * first CPU sibling group since they are already affinitized to HWC IRQ
> >  	 */
> >  	cpus_read_lock();
> > -	if (gc->num_msix_usable <= num_online_cpus())
> > +	if (gc->num_msix_usable <= num_online_cpus()) {
> >  		skip_first_cpu = true;
> > +		err = irq_setup(irqs, nvec, gc->numa_node, skip_first_cpu);
> > +	} else {
> > +		/*
> > +		 * In case our IRQs are more than num_online_cpus, we try to
> > +		 * make sure we are using all vcpus. In such a case NUMA or
> > +		 * CPU core affinity does not matter.
> > +		 * Note that in this case the total mana IRQ should always be
> > +		 * num_online_cpu + 1. The first HWC IRQ is already handled
> > +		 * in HWC setup calls
> > +		 * So, the nvec value in this path should always be equal to
> > +		 * num_online_cpu
> nit: typo: should be num_online_cpus

noted

> > +		 */
> > +		WARN_ON(nvec > num_online_cpus());
> > +		err = irq_setup_linear(irqs, nvec);
> > +	}
> >  
> > -	err = irq_setup(irqs, nvec, gc->numa_node, skip_first_cpu);
> >  	if (err) {
> >  		cpus_read_unlock();
> >  		goto free_irq;
> > 
> > base-commit: e728258debd553c95d2e70f9cd97c9fde27c7130
> > -- 
> > 2.34.1
> > 
> Regards
> Dipayaan Roy

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH net] net: mana: Optimize irq affinity for low vcpu configs
  2026-04-25  6:15   ` Shradha Gupta
@ 2026-04-25  9:43     ` Dipayaan Roy
  2026-04-26  5:07       ` Shradha Gupta
  0 siblings, 1 reply; 7+ messages in thread
From: Dipayaan Roy @ 2026-04-25  9:43 UTC (permalink / raw)
  To: Shradha Gupta
  Cc: Dexuan Cui, Wei Liu, Haiyang Zhang, K. Y. Srinivasan, Andrew Lunn,
	David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	Konstantin Taranov, Simon Horman, Erni Sri Satya Vennela,
	Shiraz Saleem, Michael Kelley, Long Li, Yury Norov, linux-hyperv,
	linux-kernel, netdev, Paul Rosswurm, Shradha Gupta,
	Saurabh Singh Sengar, stable

On Fri, Apr 24, 2026 at 11:15:42PM -0700, Shradha Gupta wrote:
> On Fri, Apr 24, 2026 at 05:21:23AM -0700, Dipayaan Roy wrote:
> > On Thu, Apr 23, 2026 at 11:17:00PM -0700, Shradha Gupta wrote:
> > > In mana driver, the number of IRQs allocated are capped by the
> > > min(num_cpu + 1, queue count). In cases, where the IRQ count is greater
> > > than the vcpu count, we want to utilize all the vcpus, irrespective of
> > > their NUMA/core bindings.
> > > 
> > > This is important, especially in the envs where number of vcpus are so
> > > few that the softIRQ handling overhead on two IRQs on the same vcpu is
> > > much more than their overheads if they were spread across sibling vcpus
> > > 
> > > This behaviour is more evident with dynamic IRQ allocation. Since MANA
> > > IRQs are assigned at a later stage compared to static allocation, other
> > > device IRQs may already be affinitized to the vCPUs. As a result, IRQ
> > > weights become imbalanced, causing multiple MANA IRQs to land on the
> > > same vCPU.
> > > 
> > > In such cases when many parallel TCP connections are tested, the
> > > throughput drops significantly
> > > 
> > > Test envs:
> > > =======================================================
> > > Case 1: without this patch
> > > =======================================================
> > > 4 vcpu(2 cores), 5 MANA IRQs (1 HWC + 4 Queue)
> > > 
> > > 	TYPE		effective vCPU aff
> > > =======================================================
> > > IRQ0:	HWC		0
> > > IRQ1:	mana_q1		0
> > > IRQ2:	mana_q2		2
> > > IRQ3:	mana_q3		0
> > > IRQ4:	mana_q4		3
> > > 
> > > %soft on each vCPU(mpstat -P ALL 1) on receiver
> > > vCPU		0	1	2	3
> > > =======================================================
> > > pass 1:		38.85	0.03	24.89	24.65
> > > pass 2:		39.15	0.03	24.57	25.28
> > > pass 3:		40.36	0.03	23.20	23.17
> > > 
> > > =======================================================
> > > Case 2: with this patch
> > > =======================================================
> > > 4 vcpu(2 cores), 5 MANA IRQs (1 HWC + 4 Queue)
> > > 
> > >         TYPE            effective vCPU aff
> > > =======================================================
> > > IRQ0:   HWC             0
> > > IRQ1:   mana_q1         0
> > > IRQ2:   mana_q2         1
> > > IRQ3:   mana_q3         2
> > > IRQ4:   mana_q4         3
> > > 
> > > %soft on each vCPU(mpstat -P ALL 1) on receiver
> > > vCPU            0       1       2       3
> > > =======================================================
> > > pass 1:         15.42	15.85	14.99	14.51
> > > pass 2:         15.53	15.94	15.81	15.93
> > > pass 3:         16.41	16.35	16.40	16.36
> > > 
> > > =======================================================
> > > Throughput Impact(in Gbps, same env)
> > > =======================================================
> > > TCP conn	with patch	w/o patch
> > > 20480		15.65		7.73
> > > 10240		15.63		8.93
> > > 8192		15.64		9.69
> > > 6144		15.64		13.16
> > > 4096		15.69		15.75
> > > 2048		15.69		15.83
> > > 1024		15.71		15.28
> > > 
> > > Fixes: 755391121038 ("net: mana: Allocate MSI-X vectors dynamically")
> > > Cc: stable@vger.kernel.org
> > > Signed-off-by: Shradha Gupta <shradhagupta@linux.microsoft.com>
> > > Signed-off-by: Erni Sri Satya Vennela <ernis@linux.microsoft.com>
> > > Reviewed-by: Haiyang Zhang <haiyangz@microsoft.com>
> > > ---
> > >  .../net/ethernet/microsoft/mana/gdma_main.c   | 35 +++++++++++++++++--
> > >  1 file changed, 33 insertions(+), 2 deletions(-)
> > > 
> > > diff --git a/drivers/net/ethernet/microsoft/mana/gdma_main.c b/drivers/net/ethernet/microsoft/mana/gdma_main.c
> > > index 098fbda0d128..433c044d53c6 100644
> > > --- a/drivers/net/ethernet/microsoft/mana/gdma_main.c
> > > +++ b/drivers/net/ethernet/microsoft/mana/gdma_main.c
> > > @@ -1672,6 +1672,23 @@ static int irq_setup(unsigned int *irqs, unsigned int len, int node,
> > >  	return 0;
> > >  }
> > >  
> > > +static int irq_setup_linear(unsigned int *irqs, unsigned int len)
> > > +{
> > > +	int cpu;
> > > +
> > > +	rcu_read_lock();
> > We do not need to call rcu_read_lock here, as the caller of this
> > function has already acquired cpus_read_lock.
> 
> Thanks for your comments Dipayaan, I think this is still needed for the
> irq_set_affinity_and_hint(), to protect the pointer returned by
> irq_to_desc(). You can also see the same in the original function
> irq_setup() for the same reason.
Hi Shradha,

The original irq_setup() function uses rcu_read_lock() because it relies
on for_each_numa_hop_mask(), which explicitly mandates that RCU be held.
You have not used it in irq_setup_linear(), hence the requirement does not apply
here.
https://elixir.bootlin.com/linux/v7.0.1/source/include/linux/topology.h#L314
/**
 * for_each_numa_hop_mask - iterate over cpumasks of increasing NUMA
 * distance
 *                          from a given node.
 * @mask: the iteration variable.
 * @node: the NUMA node to start the search from.
 *
 * Requires rcu_lock to be held.
 *
.....
Regarding irq_set_affinity_and_hint it also garbs rcu locks internally:
irq_set_affinity_and_hint ->__irq_apply_affinity_hint() -> irq_to_desc() -> mtree_load().
Also see how irq_set_affinity_and_hint is called in mana_gd_setup_irqs.
IMO we should drop the nesting when not needed, even though it appears
harmless.

Thanks and Regards
Dipayaan Roy

> 
> > > +	for_each_online_cpu(cpu) {
> > > +		if (len <= 0)
> > len is unsigned here so <= doesnot makes sense. PLease change it to int
> > or better use if(!len)
> 
> sure, I think I will change it to explicitly exit when len == 0
> Thanks.
> 
> > > +			break;
> > > +
> > > +		irq_set_affinity_and_hint(*irqs++, cpumask_of(cpu));
> > > +		len--;
> > > +	}
> > > +	rcu_read_unlock();
> > > +
> > > +	return 0;
> > > +}
> > > +
> > >  static int mana_gd_setup_dyn_irqs(struct pci_dev *pdev, int nvec)
> > >  {
> > >  	struct gdma_context *gc = pci_get_drvdata(pdev);
> > > @@ -1722,10 +1739,24 @@ static int mana_gd_setup_dyn_irqs(struct pci_dev *pdev, int nvec)
> > >  	 * first CPU sibling group since they are already affinitized to HWC IRQ
> > >  	 */
> > >  	cpus_read_lock();
> > > -	if (gc->num_msix_usable <= num_online_cpus())
> > > +	if (gc->num_msix_usable <= num_online_cpus()) {
> > >  		skip_first_cpu = true;
> > > +		err = irq_setup(irqs, nvec, gc->numa_node, skip_first_cpu);
> > > +	} else {
> > > +		/*
> > > +		 * In case our IRQs are more than num_online_cpus, we try to
> > > +		 * make sure we are using all vcpus. In such a case NUMA or
> > > +		 * CPU core affinity does not matter.
> > > +		 * Note that in this case the total mana IRQ should always be
> > > +		 * num_online_cpu + 1. The first HWC IRQ is already handled
> > > +		 * in HWC setup calls
> > > +		 * So, the nvec value in this path should always be equal to
> > > +		 * num_online_cpu
> > nit: typo: should be num_online_cpus
> 
> noted
> 
> > > +		 */
> > > +		WARN_ON(nvec > num_online_cpus());
> > > +		err = irq_setup_linear(irqs, nvec);
> > > +	}
> > >  
> > > -	err = irq_setup(irqs, nvec, gc->numa_node, skip_first_cpu);
> > >  	if (err) {
> > >  		cpus_read_unlock();
> > >  		goto free_irq;
> > > 
> > > base-commit: e728258debd553c95d2e70f9cd97c9fde27c7130
> > > -- 
> > > 2.34.1
> > > 
> > Regards
> > Dipayaan Roy

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH net] net: mana: Optimize irq affinity for low vcpu configs
  2026-04-25  9:43     ` Dipayaan Roy
@ 2026-04-26  5:07       ` Shradha Gupta
  0 siblings, 0 replies; 7+ messages in thread
From: Shradha Gupta @ 2026-04-26  5:07 UTC (permalink / raw)
  To: Dipayaan Roy
  Cc: Dexuan Cui, Wei Liu, Haiyang Zhang, K. Y. Srinivasan, Andrew Lunn,
	David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	Konstantin Taranov, Simon Horman, Erni Sri Satya Vennela,
	Shiraz Saleem, Michael Kelley, Long Li, Yury Norov, linux-hyperv,
	linux-kernel, netdev, Paul Rosswurm, Shradha Gupta,
	Saurabh Singh Sengar, stable

On Sat, Apr 25, 2026 at 02:43:17AM -0700, Dipayaan Roy wrote:
> On Fri, Apr 24, 2026 at 11:15:42PM -0700, Shradha Gupta wrote:
> > On Fri, Apr 24, 2026 at 05:21:23AM -0700, Dipayaan Roy wrote:
> > > On Thu, Apr 23, 2026 at 11:17:00PM -0700, Shradha Gupta wrote:
> > > > In mana driver, the number of IRQs allocated are capped by the
> > > > min(num_cpu + 1, queue count). In cases, where the IRQ count is greater
> > > > than the vcpu count, we want to utilize all the vcpus, irrespective of
> > > > their NUMA/core bindings.
> > > > 
> > > > This is important, especially in the envs where number of vcpus are so
> > > > few that the softIRQ handling overhead on two IRQs on the same vcpu is
> > > > much more than their overheads if they were spread across sibling vcpus
> > > > 
> > > > This behaviour is more evident with dynamic IRQ allocation. Since MANA
> > > > IRQs are assigned at a later stage compared to static allocation, other
> > > > device IRQs may already be affinitized to the vCPUs. As a result, IRQ
> > > > weights become imbalanced, causing multiple MANA IRQs to land on the
> > > > same vCPU.
> > > > 
> > > > In such cases when many parallel TCP connections are tested, the
> > > > throughput drops significantly
> > > > 
> > > > Test envs:
> > > > =======================================================
> > > > Case 1: without this patch
> > > > =======================================================
> > > > 4 vcpu(2 cores), 5 MANA IRQs (1 HWC + 4 Queue)
> > > > 
> > > > 	TYPE		effective vCPU aff
> > > > =======================================================
> > > > IRQ0:	HWC		0
> > > > IRQ1:	mana_q1		0
> > > > IRQ2:	mana_q2		2
> > > > IRQ3:	mana_q3		0
> > > > IRQ4:	mana_q4		3
> > > > 
> > > > %soft on each vCPU(mpstat -P ALL 1) on receiver
> > > > vCPU		0	1	2	3
> > > > =======================================================
> > > > pass 1:		38.85	0.03	24.89	24.65
> > > > pass 2:		39.15	0.03	24.57	25.28
> > > > pass 3:		40.36	0.03	23.20	23.17
> > > > 
> > > > =======================================================
> > > > Case 2: with this patch
> > > > =======================================================
> > > > 4 vcpu(2 cores), 5 MANA IRQs (1 HWC + 4 Queue)
> > > > 
> > > >         TYPE            effective vCPU aff
> > > > =======================================================
> > > > IRQ0:   HWC             0
> > > > IRQ1:   mana_q1         0
> > > > IRQ2:   mana_q2         1
> > > > IRQ3:   mana_q3         2
> > > > IRQ4:   mana_q4         3
> > > > 
> > > > %soft on each vCPU(mpstat -P ALL 1) on receiver
> > > > vCPU            0       1       2       3
> > > > =======================================================
> > > > pass 1:         15.42	15.85	14.99	14.51
> > > > pass 2:         15.53	15.94	15.81	15.93
> > > > pass 3:         16.41	16.35	16.40	16.36
> > > > 
> > > > =======================================================
> > > > Throughput Impact(in Gbps, same env)
> > > > =======================================================
> > > > TCP conn	with patch	w/o patch
> > > > 20480		15.65		7.73
> > > > 10240		15.63		8.93
> > > > 8192		15.64		9.69
> > > > 6144		15.64		13.16
> > > > 4096		15.69		15.75
> > > > 2048		15.69		15.83
> > > > 1024		15.71		15.28
> > > > 
> > > > Fixes: 755391121038 ("net: mana: Allocate MSI-X vectors dynamically")
> > > > Cc: stable@vger.kernel.org
> > > > Signed-off-by: Shradha Gupta <shradhagupta@linux.microsoft.com>
> > > > Signed-off-by: Erni Sri Satya Vennela <ernis@linux.microsoft.com>
> > > > Reviewed-by: Haiyang Zhang <haiyangz@microsoft.com>
> > > > ---
> > > >  .../net/ethernet/microsoft/mana/gdma_main.c   | 35 +++++++++++++++++--
> > > >  1 file changed, 33 insertions(+), 2 deletions(-)
> > > > 
> > > > diff --git a/drivers/net/ethernet/microsoft/mana/gdma_main.c b/drivers/net/ethernet/microsoft/mana/gdma_main.c
> > > > index 098fbda0d128..433c044d53c6 100644
> > > > --- a/drivers/net/ethernet/microsoft/mana/gdma_main.c
> > > > +++ b/drivers/net/ethernet/microsoft/mana/gdma_main.c
> > > > @@ -1672,6 +1672,23 @@ static int irq_setup(unsigned int *irqs, unsigned int len, int node,
> > > >  	return 0;
> > > >  }
> > > >  
> > > > +static int irq_setup_linear(unsigned int *irqs, unsigned int len)
> > > > +{
> > > > +	int cpu;
> > > > +
> > > > +	rcu_read_lock();
> > > We do not need to call rcu_read_lock here, as the caller of this
> > > function has already acquired cpus_read_lock.
> > 
> > Thanks for your comments Dipayaan, I think this is still needed for the
> > irq_set_affinity_and_hint(), to protect the pointer returned by
> > irq_to_desc(). You can also see the same in the original function
> > irq_setup() for the same reason.
> Hi Shradha,
> 
> The original irq_setup() function uses rcu_read_lock() because it relies
> on for_each_numa_hop_mask(), which explicitly mandates that RCU be held.
> You have not used it in irq_setup_linear(), hence the requirement does not apply
> here.
> https://elixir.bootlin.com/linux/v7.0.1/source/include/linux/topology.h#L314
> /**
>  * for_each_numa_hop_mask - iterate over cpumasks of increasing NUMA
>  * distance
>  *                          from a given node.
>  * @mask: the iteration variable.
>  * @node: the NUMA node to start the search from.
>  *
>  * Requires rcu_lock to be held.
>  *
> .....
> Regarding irq_set_affinity_and_hint it also garbs rcu locks internally:
> irq_set_affinity_and_hint ->__irq_apply_affinity_hint() -> irq_to_desc() -> mtree_load().
> Also see how irq_set_affinity_and_hint is called in mana_gd_setup_irqs.
> IMO we should drop the nesting when not needed, even though it appears
> harmless.
> 
> Thanks and Regards
> Dipayaan Roy
>

There was a small window after mtree_load() returns irq_desc* and
raw_spin_lock() being taken, where I thought the rcu_read_lock() was
needed. But on futher investigation, I agree it is not needed there as
well. Let me drop this in the v2. Thanks

> > 
> > > > +	for_each_online_cpu(cpu) {
> > > > +		if (len <= 0)
> > > len is unsigned here so <= doesnot makes sense. PLease change it to int
> > > or better use if(!len)
> > 
> > sure, I think I will change it to explicitly exit when len == 0
> > Thanks.
> > 
> > > > +			break;
> > > > +
> > > > +		irq_set_affinity_and_hint(*irqs++, cpumask_of(cpu));
> > > > +		len--;
> > > > +	}
> > > > +	rcu_read_unlock();
> > > > +
> > > > +	return 0;
> > > > +}
> > > > +
> > > >  static int mana_gd_setup_dyn_irqs(struct pci_dev *pdev, int nvec)
> > > >  {
> > > >  	struct gdma_context *gc = pci_get_drvdata(pdev);
> > > > @@ -1722,10 +1739,24 @@ static int mana_gd_setup_dyn_irqs(struct pci_dev *pdev, int nvec)
> > > >  	 * first CPU sibling group since they are already affinitized to HWC IRQ
> > > >  	 */
> > > >  	cpus_read_lock();
> > > > -	if (gc->num_msix_usable <= num_online_cpus())
> > > > +	if (gc->num_msix_usable <= num_online_cpus()) {
> > > >  		skip_first_cpu = true;
> > > > +		err = irq_setup(irqs, nvec, gc->numa_node, skip_first_cpu);
> > > > +	} else {
> > > > +		/*
> > > > +		 * In case our IRQs are more than num_online_cpus, we try to
> > > > +		 * make sure we are using all vcpus. In such a case NUMA or
> > > > +		 * CPU core affinity does not matter.
> > > > +		 * Note that in this case the total mana IRQ should always be
> > > > +		 * num_online_cpu + 1. The first HWC IRQ is already handled
> > > > +		 * in HWC setup calls
> > > > +		 * So, the nvec value in this path should always be equal to
> > > > +		 * num_online_cpu
> > > nit: typo: should be num_online_cpus
> > 
> > noted
> > 
> > > > +		 */
> > > > +		WARN_ON(nvec > num_online_cpus());
> > > > +		err = irq_setup_linear(irqs, nvec);
> > > > +	}
> > > >  
> > > > -	err = irq_setup(irqs, nvec, gc->numa_node, skip_first_cpu);
> > > >  	if (err) {
> > > >  		cpus_read_unlock();
> > > >  		goto free_irq;
> > > > 
> > > > base-commit: e728258debd553c95d2e70f9cd97c9fde27c7130
> > > > -- 
> > > > 2.34.1
> > > > 
> > > Regards
> > > Dipayaan Roy

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH net] net: mana: Optimize irq affinity for low vcpu configs
  2026-04-24  6:17 [PATCH net] net: mana: Optimize irq affinity for low vcpu configs Shradha Gupta
  2026-04-24 12:21 ` Dipayaan Roy
@ 2026-04-24 21:25 ` Yury Norov
  2026-04-25  6:42   ` Shradha Gupta
  1 sibling, 1 reply; 7+ messages in thread
From: Yury Norov @ 2026-04-24 21:25 UTC (permalink / raw)
  To: Shradha Gupta
  Cc: Dexuan Cui, Wei Liu, Haiyang Zhang, K. Y. Srinivasan, Andrew Lunn,
	David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	Konstantin Taranov, Simon Horman, Erni Sri Satya Vennela,
	Dipayaan Roy, Shiraz Saleem, Michael Kelley, Long Li, Yury Norov,
	linux-hyperv, linux-kernel, netdev, Paul Rosswurm, Shradha Gupta,
	Saurabh Singh Sengar, stable

On Thu, Apr 23, 2026 at 11:17:00PM -0700, Shradha Gupta wrote:
> In mana driver, the number of IRQs allocated are capped by the
> min(num_cpu + 1, queue count). In cases, where the IRQ count is greater
> than the vcpu count, we want to utilize all the vcpus, irrespective of
> their NUMA/core bindings.
> 
> This is important, especially in the envs where number of vcpus are so
> few that the softIRQ handling overhead on two IRQs on the same vcpu is
> much more than their overheads if they were spread across sibling vcpus
> 
> This behaviour is more evident with dynamic IRQ allocation. Since MANA
> IRQs are assigned at a later stage compared to static allocation, other
> device IRQs may already be affinitized to the vCPUs. As a result, IRQ
> weights become imbalanced, causing multiple MANA IRQs to land on the
> same vCPU.
> 
> In such cases when many parallel TCP connections are tested, the
> throughput drops significantly
> 
> Test envs:
> =======================================================
> Case 1: without this patch
> =======================================================
> 4 vcpu(2 cores), 5 MANA IRQs (1 HWC + 4 Queue)
> 
> 	TYPE		effective vCPU aff
> =======================================================
> IRQ0:	HWC		0
> IRQ1:	mana_q1		0
> IRQ2:	mana_q2		2
> IRQ3:	mana_q3		0
> IRQ4:	mana_q4		3
> 
> %soft on each vCPU(mpstat -P ALL 1) on receiver
> vCPU		0	1	2	3
> =======================================================
> pass 1:		38.85	0.03	24.89	24.65
> pass 2:		39.15	0.03	24.57	25.28
> pass 3:		40.36	0.03	23.20	23.17
> 
> =======================================================
> Case 2: with this patch
> =======================================================
> 4 vcpu(2 cores), 5 MANA IRQs (1 HWC + 4 Queue)
> 
>         TYPE            effective vCPU aff
> =======================================================
> IRQ0:   HWC             0
> IRQ1:   mana_q1         0
> IRQ2:   mana_q2         1
> IRQ3:   mana_q3         2
> IRQ4:   mana_q4         3
> 
> %soft on each vCPU(mpstat -P ALL 1) on receiver
> vCPU            0       1       2       3
> =======================================================
> pass 1:         15.42	15.85	14.99	14.51
> pass 2:         15.53	15.94	15.81	15.93
> pass 3:         16.41	16.35	16.40	16.36
> 
> =======================================================
> Throughput Impact(in Gbps, same env)
> =======================================================
> TCP conn	with patch	w/o patch
> 20480		15.65		7.73
> 10240		15.63		8.93
> 8192		15.64		9.69
> 6144		15.64		13.16
> 4096		15.69		15.75
> 2048		15.69		15.83
> 1024		15.71		15.28
> 
> Fixes: 755391121038 ("net: mana: Allocate MSI-X vectors dynamically")
> Cc: stable@vger.kernel.org
> Signed-off-by: Shradha Gupta <shradhagupta@linux.microsoft.com>
> Signed-off-by: Erni Sri Satya Vennela <ernis@linux.microsoft.com>
> Reviewed-by: Haiyang Zhang <haiyangz@microsoft.com>
> ---
>  .../net/ethernet/microsoft/mana/gdma_main.c   | 35 +++++++++++++++++--
>  1 file changed, 33 insertions(+), 2 deletions(-)
> 
> diff --git a/drivers/net/ethernet/microsoft/mana/gdma_main.c b/drivers/net/ethernet/microsoft/mana/gdma_main.c
> index 098fbda0d128..433c044d53c6 100644
> --- a/drivers/net/ethernet/microsoft/mana/gdma_main.c
> +++ b/drivers/net/ethernet/microsoft/mana/gdma_main.c
> @@ -1672,6 +1672,23 @@ static int irq_setup(unsigned int *irqs, unsigned int len, int node,
>  	return 0;
>  }
>  
> +static int irq_setup_linear(unsigned int *irqs, unsigned int len)
> +{
> +	int cpu;
> +
> +	rcu_read_lock();
> +	for_each_online_cpu(cpu) {
> +		if (len <= 0)
> +			break;
> +
> +		irq_set_affinity_and_hint(*irqs++, cpumask_of(cpu));
> +		len--;
> +	}
> +	rcu_read_unlock();
> +
> +	return 0;
> +}
> +
>  static int mana_gd_setup_dyn_irqs(struct pci_dev *pdev, int nvec)
>  {
>  	struct gdma_context *gc = pci_get_drvdata(pdev);
> @@ -1722,10 +1739,24 @@ static int mana_gd_setup_dyn_irqs(struct pci_dev *pdev, int nvec)
>  	 * first CPU sibling group since they are already affinitized to HWC IRQ
>  	 */
>  	cpus_read_lock();
> -	if (gc->num_msix_usable <= num_online_cpus())
> +	if (gc->num_msix_usable <= num_online_cpus()) {
>  		skip_first_cpu = true;
> +		err = irq_setup(irqs, nvec, gc->numa_node, skip_first_cpu);

Then you don't need the 'skip_first_cpu' variable.

> +	} else {
> +		/*
> +		 * In case our IRQs are more than num_online_cpus, we try to
> +		 * make sure we are using all vcpus. In such a case NUMA or
> +		 * CPU core affinity does not matter.
> +		 * Note that in this case the total mana IRQ should always be
> +		 * num_online_cpu + 1. The first HWC IRQ is already handled
> +		 * in HWC setup calls
> +		 * So, the nvec value in this path should always be equal to
> +		 * num_online_cpu
> +		 */
> +		WARN_ON(nvec > num_online_cpus());

That sounds weird. If you don't support IRQs more than CPUs , and want to
warn about it, you'd do that earlier in the function, and align the other
logic accordingly. For example:

        if (WARN_ON(nvec > num_online_cpus()))
                nvec = num_online_cpus();

        irqs = kmalloc_objs(int, nvec);
        if (!irqs)
                return -ENOMEM;

        ...

So you'll decrease pressure on allocator.

What would happen with those IRQs beyond num_online_cpus()? Can you explain
it in the comment? I'm not an expert in your driver, but usually if you pass
a vector to function, and the function is able to handle only a part of it,
it returns the number of processed elements.

Thanks,
Yury

> +		err = irq_setup_linear(irqs, nvec);
> +	}
>  
> -	err = irq_setup(irqs, nvec, gc->numa_node, skip_first_cpu);
>  	if (err) {
>  		cpus_read_unlock();
>  		goto free_irq;
> 
> base-commit: e728258debd553c95d2e70f9cd97c9fde27c7130
> -- 
> 2.34.1

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH net] net: mana: Optimize irq affinity for low vcpu configs
  2026-04-24 21:25 ` Yury Norov
@ 2026-04-25  6:42   ` Shradha Gupta
  0 siblings, 0 replies; 7+ messages in thread
From: Shradha Gupta @ 2026-04-25  6:42 UTC (permalink / raw)
  To: Yury Norov
  Cc: Dexuan Cui, Wei Liu, Haiyang Zhang, K. Y. Srinivasan, Andrew Lunn,
	David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	Konstantin Taranov, Simon Horman, Erni Sri Satya Vennela,
	Dipayaan Roy, Shiraz Saleem, Michael Kelley, Long Li, Yury Norov,
	linux-hyperv, linux-kernel, netdev, Paul Rosswurm, Shradha Gupta,
	Saurabh Singh Sengar, stable

On Fri, Apr 24, 2026 at 05:25:45PM -0400, Yury Norov wrote:
> On Thu, Apr 23, 2026 at 11:17:00PM -0700, Shradha Gupta wrote:
> > In mana driver, the number of IRQs allocated are capped by the
> > min(num_cpu + 1, queue count). In cases, where the IRQ count is greater
> > than the vcpu count, we want to utilize all the vcpus, irrespective of
> > their NUMA/core bindings.
> > 
> > This is important, especially in the envs where number of vcpus are so
> > few that the softIRQ handling overhead on two IRQs on the same vcpu is
> > much more than their overheads if they were spread across sibling vcpus
> > 
> > This behaviour is more evident with dynamic IRQ allocation. Since MANA
> > IRQs are assigned at a later stage compared to static allocation, other
> > device IRQs may already be affinitized to the vCPUs. As a result, IRQ
> > weights become imbalanced, causing multiple MANA IRQs to land on the
> > same vCPU.
> > 
> > In such cases when many parallel TCP connections are tested, the
> > throughput drops significantly
> > 
> > Test envs:
> > =======================================================
> > Case 1: without this patch
> > =======================================================
> > 4 vcpu(2 cores), 5 MANA IRQs (1 HWC + 4 Queue)
> > 
> > 	TYPE		effective vCPU aff
> > =======================================================
> > IRQ0:	HWC		0
> > IRQ1:	mana_q1		0
> > IRQ2:	mana_q2		2
> > IRQ3:	mana_q3		0
> > IRQ4:	mana_q4		3
> > 
> > %soft on each vCPU(mpstat -P ALL 1) on receiver
> > vCPU		0	1	2	3
> > =======================================================
> > pass 1:		38.85	0.03	24.89	24.65
> > pass 2:		39.15	0.03	24.57	25.28
> > pass 3:		40.36	0.03	23.20	23.17
> > 
> > =======================================================
> > Case 2: with this patch
> > =======================================================
> > 4 vcpu(2 cores), 5 MANA IRQs (1 HWC + 4 Queue)
> > 
> >         TYPE            effective vCPU aff
> > =======================================================
> > IRQ0:   HWC             0
> > IRQ1:   mana_q1         0
> > IRQ2:   mana_q2         1
> > IRQ3:   mana_q3         2
> > IRQ4:   mana_q4         3
> > 
> > %soft on each vCPU(mpstat -P ALL 1) on receiver
> > vCPU            0       1       2       3
> > =======================================================
> > pass 1:         15.42	15.85	14.99	14.51
> > pass 2:         15.53	15.94	15.81	15.93
> > pass 3:         16.41	16.35	16.40	16.36
> > 
> > =======================================================
> > Throughput Impact(in Gbps, same env)
> > =======================================================
> > TCP conn	with patch	w/o patch
> > 20480		15.65		7.73
> > 10240		15.63		8.93
> > 8192		15.64		9.69
> > 6144		15.64		13.16
> > 4096		15.69		15.75
> > 2048		15.69		15.83
> > 1024		15.71		15.28
> > 
> > Fixes: 755391121038 ("net: mana: Allocate MSI-X vectors dynamically")
> > Cc: stable@vger.kernel.org
> > Signed-off-by: Shradha Gupta <shradhagupta@linux.microsoft.com>
> > Signed-off-by: Erni Sri Satya Vennela <ernis@linux.microsoft.com>
> > Reviewed-by: Haiyang Zhang <haiyangz@microsoft.com>
> > ---
> >  .../net/ethernet/microsoft/mana/gdma_main.c   | 35 +++++++++++++++++--
> >  1 file changed, 33 insertions(+), 2 deletions(-)
> > 
> > diff --git a/drivers/net/ethernet/microsoft/mana/gdma_main.c b/drivers/net/ethernet/microsoft/mana/gdma_main.c
> > index 098fbda0d128..433c044d53c6 100644
> > --- a/drivers/net/ethernet/microsoft/mana/gdma_main.c
> > +++ b/drivers/net/ethernet/microsoft/mana/gdma_main.c
> > @@ -1672,6 +1672,23 @@ static int irq_setup(unsigned int *irqs, unsigned int len, int node,
> >  	return 0;
> >  }
> >  
> > +static int irq_setup_linear(unsigned int *irqs, unsigned int len)
> > +{
> > +	int cpu;
> > +
> > +	rcu_read_lock();
> > +	for_each_online_cpu(cpu) {
> > +		if (len <= 0)
> > +			break;
> > +
> > +		irq_set_affinity_and_hint(*irqs++, cpumask_of(cpu));
> > +		len--;
> > +	}
> > +	rcu_read_unlock();
> > +
> > +	return 0;
> > +}
> > +
> >  static int mana_gd_setup_dyn_irqs(struct pci_dev *pdev, int nvec)
> >  {
> >  	struct gdma_context *gc = pci_get_drvdata(pdev);
> > @@ -1722,10 +1739,24 @@ static int mana_gd_setup_dyn_irqs(struct pci_dev *pdev, int nvec)
> >  	 * first CPU sibling group since they are already affinitized to HWC IRQ
> >  	 */
> >  	cpus_read_lock();
> > -	if (gc->num_msix_usable <= num_online_cpus())
> > +	if (gc->num_msix_usable <= num_online_cpus()) {
> >  		skip_first_cpu = true;
> > +		err = irq_setup(irqs, nvec, gc->numa_node, skip_first_cpu);
> 
> Then you don't need the 'skip_first_cpu' variable.

That's right, let me change that.

> 
> > +	} else {
> > +		/*
> > +		 * In case our IRQs are more than num_online_cpus, we try to
> > +		 * make sure we are using all vcpus. In such a case NUMA or
> > +		 * CPU core affinity does not matter.
> > +		 * Note that in this case the total mana IRQ should always be
> > +		 * num_online_cpu + 1. The first HWC IRQ is already handled
> > +		 * in HWC setup calls
> > +		 * So, the nvec value in this path should always be equal to
> > +		 * num_online_cpu
> > +		 */
> > +		WARN_ON(nvec > num_online_cpus());
> 
> That sounds weird. If you don't support IRQs more than CPUs , and want to
> warn about it, you'd do that earlier in the function, and align the other
> logic accordingly. For example:
> 
>         if (WARN_ON(nvec > num_online_cpus()))
>                 nvec = num_online_cpus();
> 
>         irqs = kmalloc_objs(int, nvec);
>         if (!irqs)
>                 return -ENOMEM;
> 
>         ...
> 
> So you'll decrease pressure on allocator.
> 
> What would happen with those IRQs beyond num_online_cpus()? Can you explain
> it in the comment? I'm not an expert in your driver, but usually if you pass
> a vector to function, and the function is able to handle only a part of it,
> it returns the number of processed elements.
> 
> Thanks,
> Yury
> 

so, by design the nvec should never exceed num_online_cpus(). I only
added the WARN_ON as a defensive safegaurd. But I agree with your
suggestion to move this earlier before the allocations.

Thanks Yury!

> > +		err = irq_setup_linear(irqs, nvec);
> > +	}
> >  
> > -	err = irq_setup(irqs, nvec, gc->numa_node, skip_first_cpu);
> >  	if (err) {
> >  		cpus_read_unlock();
> >  		goto free_irq;
> > 
> > base-commit: e728258debd553c95d2e70f9cd97c9fde27c7130
> > -- 
> > 2.34.1

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2026-04-26  5:07 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-04-24  6:17 [PATCH net] net: mana: Optimize irq affinity for low vcpu configs Shradha Gupta
2026-04-24 12:21 ` Dipayaan Roy
2026-04-25  6:15   ` Shradha Gupta
2026-04-25  9:43     ` Dipayaan Roy
2026-04-26  5:07       ` Shradha Gupta
2026-04-24 21:25 ` Yury Norov
2026-04-25  6:42   ` Shradha Gupta

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox