Linux-HyperV List
 help / color / mirror / Atom feed
* Re: [PATCH net] net: mana: Optimize irq affinity for low vcpu configs
From: Yury Norov @ 2026-04-24 21:25 UTC (permalink / raw)
  To: Shradha Gupta
  Cc: Dexuan Cui, Wei Liu, Haiyang Zhang, K. Y. Srinivasan, Andrew Lunn,
	David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	Konstantin Taranov, Simon Horman, Erni Sri Satya Vennela,
	Dipayaan Roy, Shiraz Saleem, Michael Kelley, Long Li, Yury Norov,
	linux-hyperv, linux-kernel, netdev, Paul Rosswurm, Shradha Gupta,
	Saurabh Singh Sengar, stable
In-Reply-To: <20260424061702.1442618-1-shradhagupta@linux.microsoft.com>

On Thu, Apr 23, 2026 at 11:17:00PM -0700, Shradha Gupta wrote:
> In mana driver, the number of IRQs allocated are capped by the
> min(num_cpu + 1, queue count). In cases, where the IRQ count is greater
> than the vcpu count, we want to utilize all the vcpus, irrespective of
> their NUMA/core bindings.
> 
> This is important, especially in the envs where number of vcpus are so
> few that the softIRQ handling overhead on two IRQs on the same vcpu is
> much more than their overheads if they were spread across sibling vcpus
> 
> This behaviour is more evident with dynamic IRQ allocation. Since MANA
> IRQs are assigned at a later stage compared to static allocation, other
> device IRQs may already be affinitized to the vCPUs. As a result, IRQ
> weights become imbalanced, causing multiple MANA IRQs to land on the
> same vCPU.
> 
> In such cases when many parallel TCP connections are tested, the
> throughput drops significantly
> 
> Test envs:
> =======================================================
> Case 1: without this patch
> =======================================================
> 4 vcpu(2 cores), 5 MANA IRQs (1 HWC + 4 Queue)
> 
> 	TYPE		effective vCPU aff
> =======================================================
> IRQ0:	HWC		0
> IRQ1:	mana_q1		0
> IRQ2:	mana_q2		2
> IRQ3:	mana_q3		0
> IRQ4:	mana_q4		3
> 
> %soft on each vCPU(mpstat -P ALL 1) on receiver
> vCPU		0	1	2	3
> =======================================================
> pass 1:		38.85	0.03	24.89	24.65
> pass 2:		39.15	0.03	24.57	25.28
> pass 3:		40.36	0.03	23.20	23.17
> 
> =======================================================
> Case 2: with this patch
> =======================================================
> 4 vcpu(2 cores), 5 MANA IRQs (1 HWC + 4 Queue)
> 
>         TYPE            effective vCPU aff
> =======================================================
> IRQ0:   HWC             0
> IRQ1:   mana_q1         0
> IRQ2:   mana_q2         1
> IRQ3:   mana_q3         2
> IRQ4:   mana_q4         3
> 
> %soft on each vCPU(mpstat -P ALL 1) on receiver
> vCPU            0       1       2       3
> =======================================================
> pass 1:         15.42	15.85	14.99	14.51
> pass 2:         15.53	15.94	15.81	15.93
> pass 3:         16.41	16.35	16.40	16.36
> 
> =======================================================
> Throughput Impact(in Gbps, same env)
> =======================================================
> TCP conn	with patch	w/o patch
> 20480		15.65		7.73
> 10240		15.63		8.93
> 8192		15.64		9.69
> 6144		15.64		13.16
> 4096		15.69		15.75
> 2048		15.69		15.83
> 1024		15.71		15.28
> 
> Fixes: 755391121038 ("net: mana: Allocate MSI-X vectors dynamically")
> Cc: stable@vger.kernel.org
> Signed-off-by: Shradha Gupta <shradhagupta@linux.microsoft.com>
> Signed-off-by: Erni Sri Satya Vennela <ernis@linux.microsoft.com>
> Reviewed-by: Haiyang Zhang <haiyangz@microsoft.com>
> ---
>  .../net/ethernet/microsoft/mana/gdma_main.c   | 35 +++++++++++++++++--
>  1 file changed, 33 insertions(+), 2 deletions(-)
> 
> diff --git a/drivers/net/ethernet/microsoft/mana/gdma_main.c b/drivers/net/ethernet/microsoft/mana/gdma_main.c
> index 098fbda0d128..433c044d53c6 100644
> --- a/drivers/net/ethernet/microsoft/mana/gdma_main.c
> +++ b/drivers/net/ethernet/microsoft/mana/gdma_main.c
> @@ -1672,6 +1672,23 @@ static int irq_setup(unsigned int *irqs, unsigned int len, int node,
>  	return 0;
>  }
>  
> +static int irq_setup_linear(unsigned int *irqs, unsigned int len)
> +{
> +	int cpu;
> +
> +	rcu_read_lock();
> +	for_each_online_cpu(cpu) {
> +		if (len <= 0)
> +			break;
> +
> +		irq_set_affinity_and_hint(*irqs++, cpumask_of(cpu));
> +		len--;
> +	}
> +	rcu_read_unlock();
> +
> +	return 0;
> +}
> +
>  static int mana_gd_setup_dyn_irqs(struct pci_dev *pdev, int nvec)
>  {
>  	struct gdma_context *gc = pci_get_drvdata(pdev);
> @@ -1722,10 +1739,24 @@ static int mana_gd_setup_dyn_irqs(struct pci_dev *pdev, int nvec)
>  	 * first CPU sibling group since they are already affinitized to HWC IRQ
>  	 */
>  	cpus_read_lock();
> -	if (gc->num_msix_usable <= num_online_cpus())
> +	if (gc->num_msix_usable <= num_online_cpus()) {
>  		skip_first_cpu = true;
> +		err = irq_setup(irqs, nvec, gc->numa_node, skip_first_cpu);

Then you don't need the 'skip_first_cpu' variable.

> +	} else {
> +		/*
> +		 * In case our IRQs are more than num_online_cpus, we try to
> +		 * make sure we are using all vcpus. In such a case NUMA or
> +		 * CPU core affinity does not matter.
> +		 * Note that in this case the total mana IRQ should always be
> +		 * num_online_cpu + 1. The first HWC IRQ is already handled
> +		 * in HWC setup calls
> +		 * So, the nvec value in this path should always be equal to
> +		 * num_online_cpu
> +		 */
> +		WARN_ON(nvec > num_online_cpus());

That sounds weird. If you don't support IRQs more than CPUs , and want to
warn about it, you'd do that earlier in the function, and align the other
logic accordingly. For example:

        if (WARN_ON(nvec > num_online_cpus()))
                nvec = num_online_cpus();

        irqs = kmalloc_objs(int, nvec);
        if (!irqs)
                return -ENOMEM;

        ...

So you'll decrease pressure on allocator.

What would happen with those IRQs beyond num_online_cpus()? Can you explain
it in the comment? I'm not an expert in your driver, but usually if you pass
a vector to function, and the function is able to handle only a part of it,
it returns the number of processed elements.

Thanks,
Yury

> +		err = irq_setup_linear(irqs, nvec);
> +	}
>  
> -	err = irq_setup(irqs, nvec, gc->numa_node, skip_first_cpu);
>  	if (err) {
>  		cpus_read_unlock();
>  		goto free_irq;
> 
> base-commit: e728258debd553c95d2e70f9cd97c9fde27c7130
> -- 
> 2.34.1

^ permalink raw reply

* Re: [PATCH net] net: mana: Optimize irq affinity for low vcpu configs
From: Shradha Gupta @ 2026-04-25  6:15 UTC (permalink / raw)
  To: Dipayaan Roy
  Cc: Dexuan Cui, Wei Liu, Haiyang Zhang, K. Y. Srinivasan, Andrew Lunn,
	David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	Konstantin Taranov, Simon Horman, Erni Sri Satya Vennela,
	Shiraz Saleem, Michael Kelley, Long Li, Yury Norov, linux-hyperv,
	linux-kernel, netdev, Paul Rosswurm, Shradha Gupta,
	Saurabh Singh Sengar, stable
In-Reply-To: <aetgQ1gCYlGJjiKk@linuxonhyperv3.guj3yctzbm1etfxqx2vob5hsef.xx.internal.cloudapp.net>

On Fri, Apr 24, 2026 at 05:21:23AM -0700, Dipayaan Roy wrote:
> On Thu, Apr 23, 2026 at 11:17:00PM -0700, Shradha Gupta wrote:
> > In mana driver, the number of IRQs allocated are capped by the
> > min(num_cpu + 1, queue count). In cases, where the IRQ count is greater
> > than the vcpu count, we want to utilize all the vcpus, irrespective of
> > their NUMA/core bindings.
> > 
> > This is important, especially in the envs where number of vcpus are so
> > few that the softIRQ handling overhead on two IRQs on the same vcpu is
> > much more than their overheads if they were spread across sibling vcpus
> > 
> > This behaviour is more evident with dynamic IRQ allocation. Since MANA
> > IRQs are assigned at a later stage compared to static allocation, other
> > device IRQs may already be affinitized to the vCPUs. As a result, IRQ
> > weights become imbalanced, causing multiple MANA IRQs to land on the
> > same vCPU.
> > 
> > In such cases when many parallel TCP connections are tested, the
> > throughput drops significantly
> > 
> > Test envs:
> > =======================================================
> > Case 1: without this patch
> > =======================================================
> > 4 vcpu(2 cores), 5 MANA IRQs (1 HWC + 4 Queue)
> > 
> > 	TYPE		effective vCPU aff
> > =======================================================
> > IRQ0:	HWC		0
> > IRQ1:	mana_q1		0
> > IRQ2:	mana_q2		2
> > IRQ3:	mana_q3		0
> > IRQ4:	mana_q4		3
> > 
> > %soft on each vCPU(mpstat -P ALL 1) on receiver
> > vCPU		0	1	2	3
> > =======================================================
> > pass 1:		38.85	0.03	24.89	24.65
> > pass 2:		39.15	0.03	24.57	25.28
> > pass 3:		40.36	0.03	23.20	23.17
> > 
> > =======================================================
> > Case 2: with this patch
> > =======================================================
> > 4 vcpu(2 cores), 5 MANA IRQs (1 HWC + 4 Queue)
> > 
> >         TYPE            effective vCPU aff
> > =======================================================
> > IRQ0:   HWC             0
> > IRQ1:   mana_q1         0
> > IRQ2:   mana_q2         1
> > IRQ3:   mana_q3         2
> > IRQ4:   mana_q4         3
> > 
> > %soft on each vCPU(mpstat -P ALL 1) on receiver
> > vCPU            0       1       2       3
> > =======================================================
> > pass 1:         15.42	15.85	14.99	14.51
> > pass 2:         15.53	15.94	15.81	15.93
> > pass 3:         16.41	16.35	16.40	16.36
> > 
> > =======================================================
> > Throughput Impact(in Gbps, same env)
> > =======================================================
> > TCP conn	with patch	w/o patch
> > 20480		15.65		7.73
> > 10240		15.63		8.93
> > 8192		15.64		9.69
> > 6144		15.64		13.16
> > 4096		15.69		15.75
> > 2048		15.69		15.83
> > 1024		15.71		15.28
> > 
> > Fixes: 755391121038 ("net: mana: Allocate MSI-X vectors dynamically")
> > Cc: stable@vger.kernel.org
> > Signed-off-by: Shradha Gupta <shradhagupta@linux.microsoft.com>
> > Signed-off-by: Erni Sri Satya Vennela <ernis@linux.microsoft.com>
> > Reviewed-by: Haiyang Zhang <haiyangz@microsoft.com>
> > ---
> >  .../net/ethernet/microsoft/mana/gdma_main.c   | 35 +++++++++++++++++--
> >  1 file changed, 33 insertions(+), 2 deletions(-)
> > 
> > diff --git a/drivers/net/ethernet/microsoft/mana/gdma_main.c b/drivers/net/ethernet/microsoft/mana/gdma_main.c
> > index 098fbda0d128..433c044d53c6 100644
> > --- a/drivers/net/ethernet/microsoft/mana/gdma_main.c
> > +++ b/drivers/net/ethernet/microsoft/mana/gdma_main.c
> > @@ -1672,6 +1672,23 @@ static int irq_setup(unsigned int *irqs, unsigned int len, int node,
> >  	return 0;
> >  }
> >  
> > +static int irq_setup_linear(unsigned int *irqs, unsigned int len)
> > +{
> > +	int cpu;
> > +
> > +	rcu_read_lock();
> We do not need to call rcu_read_lock here, as the caller of this
> function has already acquired cpus_read_lock.

Thanks for your comments Dipayaan, I think this is still needed for the
irq_set_affinity_and_hint(), to protect the pointer returned by
irq_to_desc(). You can also see the same in the original function
irq_setup() for the same reason.

> > +	for_each_online_cpu(cpu) {
> > +		if (len <= 0)
> len is unsigned here so <= doesnot makes sense. PLease change it to int
> or better use if(!len)

sure, I think I will change it to explicitly exit when len == 0
Thanks.

> > +			break;
> > +
> > +		irq_set_affinity_and_hint(*irqs++, cpumask_of(cpu));
> > +		len--;
> > +	}
> > +	rcu_read_unlock();
> > +
> > +	return 0;
> > +}
> > +
> >  static int mana_gd_setup_dyn_irqs(struct pci_dev *pdev, int nvec)
> >  {
> >  	struct gdma_context *gc = pci_get_drvdata(pdev);
> > @@ -1722,10 +1739,24 @@ static int mana_gd_setup_dyn_irqs(struct pci_dev *pdev, int nvec)
> >  	 * first CPU sibling group since they are already affinitized to HWC IRQ
> >  	 */
> >  	cpus_read_lock();
> > -	if (gc->num_msix_usable <= num_online_cpus())
> > +	if (gc->num_msix_usable <= num_online_cpus()) {
> >  		skip_first_cpu = true;
> > +		err = irq_setup(irqs, nvec, gc->numa_node, skip_first_cpu);
> > +	} else {
> > +		/*
> > +		 * In case our IRQs are more than num_online_cpus, we try to
> > +		 * make sure we are using all vcpus. In such a case NUMA or
> > +		 * CPU core affinity does not matter.
> > +		 * Note that in this case the total mana IRQ should always be
> > +		 * num_online_cpu + 1. The first HWC IRQ is already handled
> > +		 * in HWC setup calls
> > +		 * So, the nvec value in this path should always be equal to
> > +		 * num_online_cpu
> nit: typo: should be num_online_cpus

noted

> > +		 */
> > +		WARN_ON(nvec > num_online_cpus());
> > +		err = irq_setup_linear(irqs, nvec);
> > +	}
> >  
> > -	err = irq_setup(irqs, nvec, gc->numa_node, skip_first_cpu);
> >  	if (err) {
> >  		cpus_read_unlock();
> >  		goto free_irq;
> > 
> > base-commit: e728258debd553c95d2e70f9cd97c9fde27c7130
> > -- 
> > 2.34.1
> > 
> Regards
> Dipayaan Roy

^ permalink raw reply

* Re: [PATCH net] net: mana: Optimize irq affinity for low vcpu configs
From: Shradha Gupta @ 2026-04-25  6:42 UTC (permalink / raw)
  To: Yury Norov
  Cc: Dexuan Cui, Wei Liu, Haiyang Zhang, K. Y. Srinivasan, Andrew Lunn,
	David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	Konstantin Taranov, Simon Horman, Erni Sri Satya Vennela,
	Dipayaan Roy, Shiraz Saleem, Michael Kelley, Long Li, Yury Norov,
	linux-hyperv, linux-kernel, netdev, Paul Rosswurm, Shradha Gupta,
	Saurabh Singh Sengar, stable
In-Reply-To: <aevf2bPLBiAzX7UC@yury>

On Fri, Apr 24, 2026 at 05:25:45PM -0400, Yury Norov wrote:
> On Thu, Apr 23, 2026 at 11:17:00PM -0700, Shradha Gupta wrote:
> > In mana driver, the number of IRQs allocated are capped by the
> > min(num_cpu + 1, queue count). In cases, where the IRQ count is greater
> > than the vcpu count, we want to utilize all the vcpus, irrespective of
> > their NUMA/core bindings.
> > 
> > This is important, especially in the envs where number of vcpus are so
> > few that the softIRQ handling overhead on two IRQs on the same vcpu is
> > much more than their overheads if they were spread across sibling vcpus
> > 
> > This behaviour is more evident with dynamic IRQ allocation. Since MANA
> > IRQs are assigned at a later stage compared to static allocation, other
> > device IRQs may already be affinitized to the vCPUs. As a result, IRQ
> > weights become imbalanced, causing multiple MANA IRQs to land on the
> > same vCPU.
> > 
> > In such cases when many parallel TCP connections are tested, the
> > throughput drops significantly
> > 
> > Test envs:
> > =======================================================
> > Case 1: without this patch
> > =======================================================
> > 4 vcpu(2 cores), 5 MANA IRQs (1 HWC + 4 Queue)
> > 
> > 	TYPE		effective vCPU aff
> > =======================================================
> > IRQ0:	HWC		0
> > IRQ1:	mana_q1		0
> > IRQ2:	mana_q2		2
> > IRQ3:	mana_q3		0
> > IRQ4:	mana_q4		3
> > 
> > %soft on each vCPU(mpstat -P ALL 1) on receiver
> > vCPU		0	1	2	3
> > =======================================================
> > pass 1:		38.85	0.03	24.89	24.65
> > pass 2:		39.15	0.03	24.57	25.28
> > pass 3:		40.36	0.03	23.20	23.17
> > 
> > =======================================================
> > Case 2: with this patch
> > =======================================================
> > 4 vcpu(2 cores), 5 MANA IRQs (1 HWC + 4 Queue)
> > 
> >         TYPE            effective vCPU aff
> > =======================================================
> > IRQ0:   HWC             0
> > IRQ1:   mana_q1         0
> > IRQ2:   mana_q2         1
> > IRQ3:   mana_q3         2
> > IRQ4:   mana_q4         3
> > 
> > %soft on each vCPU(mpstat -P ALL 1) on receiver
> > vCPU            0       1       2       3
> > =======================================================
> > pass 1:         15.42	15.85	14.99	14.51
> > pass 2:         15.53	15.94	15.81	15.93
> > pass 3:         16.41	16.35	16.40	16.36
> > 
> > =======================================================
> > Throughput Impact(in Gbps, same env)
> > =======================================================
> > TCP conn	with patch	w/o patch
> > 20480		15.65		7.73
> > 10240		15.63		8.93
> > 8192		15.64		9.69
> > 6144		15.64		13.16
> > 4096		15.69		15.75
> > 2048		15.69		15.83
> > 1024		15.71		15.28
> > 
> > Fixes: 755391121038 ("net: mana: Allocate MSI-X vectors dynamically")
> > Cc: stable@vger.kernel.org
> > Signed-off-by: Shradha Gupta <shradhagupta@linux.microsoft.com>
> > Signed-off-by: Erni Sri Satya Vennela <ernis@linux.microsoft.com>
> > Reviewed-by: Haiyang Zhang <haiyangz@microsoft.com>
> > ---
> >  .../net/ethernet/microsoft/mana/gdma_main.c   | 35 +++++++++++++++++--
> >  1 file changed, 33 insertions(+), 2 deletions(-)
> > 
> > diff --git a/drivers/net/ethernet/microsoft/mana/gdma_main.c b/drivers/net/ethernet/microsoft/mana/gdma_main.c
> > index 098fbda0d128..433c044d53c6 100644
> > --- a/drivers/net/ethernet/microsoft/mana/gdma_main.c
> > +++ b/drivers/net/ethernet/microsoft/mana/gdma_main.c
> > @@ -1672,6 +1672,23 @@ static int irq_setup(unsigned int *irqs, unsigned int len, int node,
> >  	return 0;
> >  }
> >  
> > +static int irq_setup_linear(unsigned int *irqs, unsigned int len)
> > +{
> > +	int cpu;
> > +
> > +	rcu_read_lock();
> > +	for_each_online_cpu(cpu) {
> > +		if (len <= 0)
> > +			break;
> > +
> > +		irq_set_affinity_and_hint(*irqs++, cpumask_of(cpu));
> > +		len--;
> > +	}
> > +	rcu_read_unlock();
> > +
> > +	return 0;
> > +}
> > +
> >  static int mana_gd_setup_dyn_irqs(struct pci_dev *pdev, int nvec)
> >  {
> >  	struct gdma_context *gc = pci_get_drvdata(pdev);
> > @@ -1722,10 +1739,24 @@ static int mana_gd_setup_dyn_irqs(struct pci_dev *pdev, int nvec)
> >  	 * first CPU sibling group since they are already affinitized to HWC IRQ
> >  	 */
> >  	cpus_read_lock();
> > -	if (gc->num_msix_usable <= num_online_cpus())
> > +	if (gc->num_msix_usable <= num_online_cpus()) {
> >  		skip_first_cpu = true;
> > +		err = irq_setup(irqs, nvec, gc->numa_node, skip_first_cpu);
> 
> Then you don't need the 'skip_first_cpu' variable.

That's right, let me change that.

> 
> > +	} else {
> > +		/*
> > +		 * In case our IRQs are more than num_online_cpus, we try to
> > +		 * make sure we are using all vcpus. In such a case NUMA or
> > +		 * CPU core affinity does not matter.
> > +		 * Note that in this case the total mana IRQ should always be
> > +		 * num_online_cpu + 1. The first HWC IRQ is already handled
> > +		 * in HWC setup calls
> > +		 * So, the nvec value in this path should always be equal to
> > +		 * num_online_cpu
> > +		 */
> > +		WARN_ON(nvec > num_online_cpus());
> 
> That sounds weird. If you don't support IRQs more than CPUs , and want to
> warn about it, you'd do that earlier in the function, and align the other
> logic accordingly. For example:
> 
>         if (WARN_ON(nvec > num_online_cpus()))
>                 nvec = num_online_cpus();
> 
>         irqs = kmalloc_objs(int, nvec);
>         if (!irqs)
>                 return -ENOMEM;
> 
>         ...
> 
> So you'll decrease pressure on allocator.
> 
> What would happen with those IRQs beyond num_online_cpus()? Can you explain
> it in the comment? I'm not an expert in your driver, but usually if you pass
> a vector to function, and the function is able to handle only a part of it,
> it returns the number of processed elements.
> 
> Thanks,
> Yury
> 

so, by design the nvec should never exceed num_online_cpus(). I only
added the WARN_ON as a defensive safegaurd. But I agree with your
suggestion to move this earlier before the allocations.

Thanks Yury!

> > +		err = irq_setup_linear(irqs, nvec);
> > +	}
> >  
> > -	err = irq_setup(irqs, nvec, gc->numa_node, skip_first_cpu);
> >  	if (err) {
> >  		cpus_read_unlock();
> >  		goto free_irq;
> > 
> > base-commit: e728258debd553c95d2e70f9cd97c9fde27c7130
> > -- 
> > 2.34.1

^ permalink raw reply

* Re: [PATCH net-next v6 0/2] net: mana: add ethtool private flag for full-page RX buffers
From: Dipayaan Roy @ 2026-04-25  8:05 UTC (permalink / raw)
  To: David Wei, kuba
  Cc: Jakub Kicinski, kys, haiyangz, wei.liu, decui, andrew+netdev,
	davem, edumazet, pabeni, leon, longli, kotaranov, horms,
	shradhagupta, ssengar, ernis, shirazsaleem, linux-hyperv, netdev,
	linux-kernel, linux-rdma, stephen, jacob.e.keller, leitao, kees,
	john.fastabend, hawk, bpf, daniel, ast, sdf, dipayanroy
In-Reply-To: <685d7bf9-062d-4bd2-8448-f7714bb05302@davidwei.uk>

On Fri, Apr 24, 2026 at 01:05:24PM -0700, David Wei wrote:
> On 2026-04-23 05:48, Dipayaan Roy wrote:
> > On Thu, Apr 16, 2026 at 08:31:46AM -0700, Jakub Kicinski wrote:
> > > On Tue, 14 Apr 2026 09:00:56 -0700 Dipayaan Roy wrote:
> > > > I still see roughly a 5% overhead from the atomic refcount operation
> > > > itself, but on that platform there is no throughput drop when using
> > > > page fragments versus full-page mode.
> > > 
> > > That seems to contradict your claim that it's a problem with a specific
> > > platform.. Since we're in the merge window I asked David Wei to try to
> > > experiment with disabling page fragmentation on the ARM64 platforms we
> > > have at Meta. If it repros we should use the generic rx-buf-len
> > > ringparam because more NICs may want to implement this strategy.
> > 
> > Hi Jakub,
> > 
> > Thanks. I think I was not precise enough in my previous reply.
> > 
> > What I meant is that the atomic refcount cost itself does not appear to
> > be unique to the affected platform. I see a similar ~5% overhead on
> > another ARM64 platformi (different vendor) as well. However, on that platform
> > there is no throughput delta between fragment mode and full-page mode; both reach
> > line rate.
> > 
> > On the affected platform, fragment mode shows an additional ~15%
> > throughput drop versus full-page mode. So the current data suggests that
> > the atomic overhead is common, but the throughput regression is not
> > explained by that overhead alone and likely depends on an additional
> > platform-specific factor.
> > 
> > Separately, the hardware team collected PCIe traces on the affected
> > platform and reported stalls in the fragment-mode case that are not seen
> > in full-page mode. They are still investigating the root cause, but
> > their current hypothesis is that this is related to that platform’s
> > PCIe/root-port microarchitecture rather than to page_pool refcounting
> > alone.
> > 
> > That said, I agree the right direction depends on whether this
> > reproduces on other ARM64 platforms. If David is able to reproduce the
> > same behavior, then using the generic rx-buf-len ringparam sounds like
> > the better direction.
> > 
> > Please let me know what David finds, and I can rework the patch
> > accordingly.
> 
> I ran a test on Grace, 4 KB pages, 72 cores, 1 NUMA node.
> 
> Broadcom NIC, bnxt driver, 50 Gbps bandwidth. Hacked it up to either
> give me 1 or 2 frags per page. No agg ring, no HDS, no HW GRO.
> 
> Use 1 combined queue only for the server. Affinitized its net rx softirq
> to run on core 4.
> 
> Ran iperf3 server, taskset onto cpu cores 32-47. The iperf3 client is
> running on a host w/ same hw in the same region. Using 32 queues, no
> softirq affinities. The idea is to hammer page->pp_ref_count from
> different cores.
> 
> * 1 frag/page  -> 32.3 Gbps
> * 2 frags/page -> 36.0 Gbps
> 
> Comparing perf, for 2 frags/page the cost of skb_release_data() hitting
> pp_ref_count goes up, as expected. Is this what you see? When you say
> there's a +5% overhead, what function?
> 
> Overall tput is higher with multiple frags. That's to be expected w/
> page pool.

Hi David,

Thanks for running this. Your results are consistent with mine.

I have tested this on 2 ARM64 platforms from different vendors,
running ntttcp and iperf3 using 4k as base page size.
In my observation I see both platforms show a 5% overhead in
napi_pp_put_page (~3.9%) and page_pool_alloc_frag_netmem (~1.9%)
when running in fragment mode, both stalling on the LSE ldaddal
atomic that maintains pp_ref_count.
This seems to be same as your observation as well. However in my
observation one of the platform shows 15% drop in throughput when
in fragment mode vs page mode. The other platform I ran the test on
infact performs slighty better in fragment mode than in full page
mode (simillar observation as yours).

So the atomic refcount overhead appears to be common across ARM64
platforms, but it does not cause a throughput regression.
The throughput regression seems specific to one platform only for which
we want to have the full page work around, also the HW team has
identified PCIe stalls in fragment mode that are absent in full-page mode.
Their investigation points to a suspected microarchitectural
issue in the PCIe root port. IMO, there seems to be no issue with
page_pool itself.

Given that:
 - Grace shows fragments are faster (your data)
 - A second ARM64 platform shows no regression (my data)
 - Only the affected platform shows a throughput drop
 - The HW team suspects this to a platform-specific PCIe issue,
   also form our experiment data the drop in throughput seems to
   be platform specific only.

I believe this remains a platform-specific workaround rather than
a generic issue. Would a private flag still be acceptable for this
case?


> 
> There are some 200 Gbps NICs but they're mlx5 so I'd have to redo the
> driver hack. Are you going to re-implement this change with rx-buf-len
> instead of a private flag? If so, I won't spend more time running this
> test.
> 
I can go either way depending on what Jakub prefers.
Hi Jakub,
with this new data from David, is it convincing enough for a mana driver
specific private flag, which can be set from user space by a udev rule
by detecting the underlying platform? If not then I will send the next
version with the other rxbuflen approach. 
> > 
> > 
> > Regards
> > Dipayaan Roy


Thanks and Regards
Dipayaan Roy

^ permalink raw reply

* Re: [PATCH net] net: mana: Optimize irq affinity for low vcpu configs
From: Dipayaan Roy @ 2026-04-25  9:43 UTC (permalink / raw)
  To: Shradha Gupta
  Cc: Dexuan Cui, Wei Liu, Haiyang Zhang, K. Y. Srinivasan, Andrew Lunn,
	David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	Konstantin Taranov, Simon Horman, Erni Sri Satya Vennela,
	Shiraz Saleem, Michael Kelley, Long Li, Yury Norov, linux-hyperv,
	linux-kernel, netdev, Paul Rosswurm, Shradha Gupta,
	Saurabh Singh Sengar, stable
In-Reply-To: <aexcDgNJw4Nr/uMU@linuxonhyperv3.guj3yctzbm1etfxqx2vob5hsef.xx.internal.cloudapp.net>

On Fri, Apr 24, 2026 at 11:15:42PM -0700, Shradha Gupta wrote:
> On Fri, Apr 24, 2026 at 05:21:23AM -0700, Dipayaan Roy wrote:
> > On Thu, Apr 23, 2026 at 11:17:00PM -0700, Shradha Gupta wrote:
> > > In mana driver, the number of IRQs allocated are capped by the
> > > min(num_cpu + 1, queue count). In cases, where the IRQ count is greater
> > > than the vcpu count, we want to utilize all the vcpus, irrespective of
> > > their NUMA/core bindings.
> > > 
> > > This is important, especially in the envs where number of vcpus are so
> > > few that the softIRQ handling overhead on two IRQs on the same vcpu is
> > > much more than their overheads if they were spread across sibling vcpus
> > > 
> > > This behaviour is more evident with dynamic IRQ allocation. Since MANA
> > > IRQs are assigned at a later stage compared to static allocation, other
> > > device IRQs may already be affinitized to the vCPUs. As a result, IRQ
> > > weights become imbalanced, causing multiple MANA IRQs to land on the
> > > same vCPU.
> > > 
> > > In such cases when many parallel TCP connections are tested, the
> > > throughput drops significantly
> > > 
> > > Test envs:
> > > =======================================================
> > > Case 1: without this patch
> > > =======================================================
> > > 4 vcpu(2 cores), 5 MANA IRQs (1 HWC + 4 Queue)
> > > 
> > > 	TYPE		effective vCPU aff
> > > =======================================================
> > > IRQ0:	HWC		0
> > > IRQ1:	mana_q1		0
> > > IRQ2:	mana_q2		2
> > > IRQ3:	mana_q3		0
> > > IRQ4:	mana_q4		3
> > > 
> > > %soft on each vCPU(mpstat -P ALL 1) on receiver
> > > vCPU		0	1	2	3
> > > =======================================================
> > > pass 1:		38.85	0.03	24.89	24.65
> > > pass 2:		39.15	0.03	24.57	25.28
> > > pass 3:		40.36	0.03	23.20	23.17
> > > 
> > > =======================================================
> > > Case 2: with this patch
> > > =======================================================
> > > 4 vcpu(2 cores), 5 MANA IRQs (1 HWC + 4 Queue)
> > > 
> > >         TYPE            effective vCPU aff
> > > =======================================================
> > > IRQ0:   HWC             0
> > > IRQ1:   mana_q1         0
> > > IRQ2:   mana_q2         1
> > > IRQ3:   mana_q3         2
> > > IRQ4:   mana_q4         3
> > > 
> > > %soft on each vCPU(mpstat -P ALL 1) on receiver
> > > vCPU            0       1       2       3
> > > =======================================================
> > > pass 1:         15.42	15.85	14.99	14.51
> > > pass 2:         15.53	15.94	15.81	15.93
> > > pass 3:         16.41	16.35	16.40	16.36
> > > 
> > > =======================================================
> > > Throughput Impact(in Gbps, same env)
> > > =======================================================
> > > TCP conn	with patch	w/o patch
> > > 20480		15.65		7.73
> > > 10240		15.63		8.93
> > > 8192		15.64		9.69
> > > 6144		15.64		13.16
> > > 4096		15.69		15.75
> > > 2048		15.69		15.83
> > > 1024		15.71		15.28
> > > 
> > > Fixes: 755391121038 ("net: mana: Allocate MSI-X vectors dynamically")
> > > Cc: stable@vger.kernel.org
> > > Signed-off-by: Shradha Gupta <shradhagupta@linux.microsoft.com>
> > > Signed-off-by: Erni Sri Satya Vennela <ernis@linux.microsoft.com>
> > > Reviewed-by: Haiyang Zhang <haiyangz@microsoft.com>
> > > ---
> > >  .../net/ethernet/microsoft/mana/gdma_main.c   | 35 +++++++++++++++++--
> > >  1 file changed, 33 insertions(+), 2 deletions(-)
> > > 
> > > diff --git a/drivers/net/ethernet/microsoft/mana/gdma_main.c b/drivers/net/ethernet/microsoft/mana/gdma_main.c
> > > index 098fbda0d128..433c044d53c6 100644
> > > --- a/drivers/net/ethernet/microsoft/mana/gdma_main.c
> > > +++ b/drivers/net/ethernet/microsoft/mana/gdma_main.c
> > > @@ -1672,6 +1672,23 @@ static int irq_setup(unsigned int *irqs, unsigned int len, int node,
> > >  	return 0;
> > >  }
> > >  
> > > +static int irq_setup_linear(unsigned int *irqs, unsigned int len)
> > > +{
> > > +	int cpu;
> > > +
> > > +	rcu_read_lock();
> > We do not need to call rcu_read_lock here, as the caller of this
> > function has already acquired cpus_read_lock.
> 
> Thanks for your comments Dipayaan, I think this is still needed for the
> irq_set_affinity_and_hint(), to protect the pointer returned by
> irq_to_desc(). You can also see the same in the original function
> irq_setup() for the same reason.
Hi Shradha,

The original irq_setup() function uses rcu_read_lock() because it relies
on for_each_numa_hop_mask(), which explicitly mandates that RCU be held.
You have not used it in irq_setup_linear(), hence the requirement does not apply
here.
https://elixir.bootlin.com/linux/v7.0.1/source/include/linux/topology.h#L314
/**
 * for_each_numa_hop_mask - iterate over cpumasks of increasing NUMA
 * distance
 *                          from a given node.
 * @mask: the iteration variable.
 * @node: the NUMA node to start the search from.
 *
 * Requires rcu_lock to be held.
 *
.....
Regarding irq_set_affinity_and_hint it also garbs rcu locks internally:
irq_set_affinity_and_hint ->__irq_apply_affinity_hint() -> irq_to_desc() -> mtree_load().
Also see how irq_set_affinity_and_hint is called in mana_gd_setup_irqs.
IMO we should drop the nesting when not needed, even though it appears
harmless.

Thanks and Regards
Dipayaan Roy

> 
> > > +	for_each_online_cpu(cpu) {
> > > +		if (len <= 0)
> > len is unsigned here so <= doesnot makes sense. PLease change it to int
> > or better use if(!len)
> 
> sure, I think I will change it to explicitly exit when len == 0
> Thanks.
> 
> > > +			break;
> > > +
> > > +		irq_set_affinity_and_hint(*irqs++, cpumask_of(cpu));
> > > +		len--;
> > > +	}
> > > +	rcu_read_unlock();
> > > +
> > > +	return 0;
> > > +}
> > > +
> > >  static int mana_gd_setup_dyn_irqs(struct pci_dev *pdev, int nvec)
> > >  {
> > >  	struct gdma_context *gc = pci_get_drvdata(pdev);
> > > @@ -1722,10 +1739,24 @@ static int mana_gd_setup_dyn_irqs(struct pci_dev *pdev, int nvec)
> > >  	 * first CPU sibling group since they are already affinitized to HWC IRQ
> > >  	 */
> > >  	cpus_read_lock();
> > > -	if (gc->num_msix_usable <= num_online_cpus())
> > > +	if (gc->num_msix_usable <= num_online_cpus()) {
> > >  		skip_first_cpu = true;
> > > +		err = irq_setup(irqs, nvec, gc->numa_node, skip_first_cpu);
> > > +	} else {
> > > +		/*
> > > +		 * In case our IRQs are more than num_online_cpus, we try to
> > > +		 * make sure we are using all vcpus. In such a case NUMA or
> > > +		 * CPU core affinity does not matter.
> > > +		 * Note that in this case the total mana IRQ should always be
> > > +		 * num_online_cpu + 1. The first HWC IRQ is already handled
> > > +		 * in HWC setup calls
> > > +		 * So, the nvec value in this path should always be equal to
> > > +		 * num_online_cpu
> > nit: typo: should be num_online_cpus
> 
> noted
> 
> > > +		 */
> > > +		WARN_ON(nvec > num_online_cpus());
> > > +		err = irq_setup_linear(irqs, nvec);
> > > +	}
> > >  
> > > -	err = irq_setup(irqs, nvec, gc->numa_node, skip_first_cpu);
> > >  	if (err) {
> > >  		cpus_read_unlock();
> > >  		goto free_irq;
> > > 
> > > base-commit: e728258debd553c95d2e70f9cd97c9fde27c7130
> > > -- 
> > > 2.34.1
> > > 
> > Regards
> > Dipayaan Roy

^ permalink raw reply

* [PATCH v2] tools/hv: fix parse_ip_val_buffer out-of-bounds write
From: Ali Ahmet MEMIS @ 2026-04-25 11:35 UTC (permalink / raw)
  To: kys, haiyangz, wei.liu, decui, longli
  Cc: linux-hyperv, linux-kernel, Ali Ahmet MEMIS


parse_ip_val_buffer() validates the parsed token length against out_len, but several callers passed MAX_IP_ADDR_SIZE * 2 while the destination buffers are much smaller stack arrays (e.g. INET6_ADDRSTRLEN).

This can lead to out-of-bounds writes via strcpy() when a long token is parsed from host-provided IP/subnet strings.

Use size_t for out_len, switch to bounded copy with memcpy() + explicit NUL termination, and pass the actual destination buffer sizes at all call sites.

Signed-off-by: Ali Ahmet MEMIS <dev@unknownbbqr.xyz>
---
 tools/hv/hv_kvp_daemon.c | 22 ++++++++++++----------
 1 file changed, 12 insertions(+), 10 deletions(-)

diff --git a/tools/hv/hv_kvp_daemon.c b/tools/hv/hv_kvp_daemon.c
index 1f64c680b..bb31ba9e9 100644
--- a/tools/hv/hv_kvp_daemon.c
+++ b/tools/hv/hv_kvp_daemon.c
@@ -1162,10 +1162,11 @@ static int is_ipv4(char *addr)
 }
 
 static int parse_ip_val_buffer(char *in_buf, int *offset,
-				char *out_buf, int out_len)
+				char *out_buf, size_t out_len)
 {
 	char *x;
 	char *start;
+	size_t copy_len;
 
 	/*
 	 * in_buf has sequence of characters that are separated by
@@ -1188,8 +1189,10 @@ static int parse_ip_val_buffer(char *in_buf, int *offset,
 		while (start[i] == ' ')
 			i++;
 
-		if ((x - start) <= out_len) {
-			strcpy(out_buf, (start + i));
+		copy_len = x - (start + i);
+		if (copy_len < out_len) {
+			memcpy(out_buf, start + i, copy_len);
+			out_buf[copy_len] = '\0';
 			*offset += (x - start) + 1;
 			return 1;
 		}
@@ -1223,7 +1226,7 @@ static int process_ip_string(FILE *f, char *ip_string, int type)
 	memset(addr, 0, sizeof(addr));
 
 	while (parse_ip_val_buffer(ip_string, &offset, addr,
-					(MAX_IP_ADDR_SIZE * 2))) {
+					sizeof(addr))) {
 
 		sub_str[0] = 0;
 		if (is_ipv4(addr)) {
@@ -1348,7 +1351,7 @@ static int process_dns_gateway_nm(FILE *f, char *ip_string, int type,
 		memset(addr, 0, sizeof(addr));
 
 		if (!parse_ip_val_buffer(ip_string, &ip_offset, addr,
-					 (MAX_IP_ADDR_SIZE * 2)))
+					 sizeof(addr)))
 			break;
 
 		ip_ver = ip_version_check(addr);
@@ -1400,12 +1403,11 @@ static int process_ip_string_nm(FILE *f, char *ip_string, char *subnet,
 	memset(subnet_addr, 0, sizeof(subnet_addr));
 
 	while (parse_ip_val_buffer(ip_string, &ip_offset, addr,
-				   (MAX_IP_ADDR_SIZE * 2)) &&
+				   sizeof(addr)) &&
 				   parse_ip_val_buffer(subnet,
-						       &subnet_offset,
-						       subnet_addr,
-						       (MAX_IP_ADDR_SIZE *
-							2))) {
+					       &subnet_offset,
+					       subnet_addr,
+					       sizeof(subnet_addr))) {
 		ip_ver = ip_version_check(addr);
 		if (ip_ver < 0)
 			continue;

base-commit: 2e68039281932e6dc37718a1ea7cbb8e2cda42e6
-- 
2.53.0




^ permalink raw reply related

* [PATCH 1/2] hv_sock: fix ARM64 support
From: Hamza Mahfooz @ 2026-04-25 18:17 UTC (permalink / raw)
  To: linux-kernel
  Cc: K. Y. Srinivasan, Haiyang Zhang, Wei Liu, Dexuan Cui, Long Li,
	Stefano Garzarella, David S. Miller, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni, Simon Horman, Himadri Pandya, Michael Kelley,
	linux-hyperv, virtualization, netdev, Saurabh Sengar,
	Maarten Lankhorst, Maxime Ripard, Thomas Zimmermann, David Airlie,
	Simona Vetter, Deepak Rawat, dri-devel, Hamza Mahfooz, stable

VMBUS ring buffers must be page aligned. Therefore, the current value of
24K presents a challenge on ARM64 kernels (with 64K pages). So, use
VMBUS_RING_SIZE() to ensure they are always aligned and large enough to
hold all of the relevant data.

Cc: stable@kernel.vger.org
Fixes: 77ffe33363c0 ("hv_sock: use HV_HYP_PAGE_SIZE for Hyper-V communication")
Signed-off-by: Hamza Mahfooz <hamzamahfooz@linux.microsoft.com>
---
 net/vmw_vsock/hyperv_transport.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/net/vmw_vsock/hyperv_transport.c b/net/vmw_vsock/hyperv_transport.c
index 069386a74557..40f09b23efa3 100644
--- a/net/vmw_vsock/hyperv_transport.c
+++ b/net/vmw_vsock/hyperv_transport.c
@@ -375,10 +375,10 @@ static void hvs_open_connection(struct vmbus_channel *chan)
 	} else {
 		sndbuf = max_t(int, sk->sk_sndbuf, RINGBUFFER_HVS_SND_SIZE);
 		sndbuf = min_t(int, sndbuf, RINGBUFFER_HVS_MAX_SIZE);
-		sndbuf = ALIGN(sndbuf, HV_HYP_PAGE_SIZE);
+		sndbuf = VMBUS_RING_SIZE(sndbuf);
 		rcvbuf = max_t(int, sk->sk_rcvbuf, RINGBUFFER_HVS_RCV_SIZE);
 		rcvbuf = min_t(int, rcvbuf, RINGBUFFER_HVS_MAX_SIZE);
-		rcvbuf = ALIGN(rcvbuf, HV_HYP_PAGE_SIZE);
+		rcvbuf = VMBUS_RING_SIZE(rcvbuf);
 	}
 
 	chan->max_pkt_size = HVS_MAX_PKT_SIZE;
-- 
2.54.0


^ permalink raw reply related

* [PATCH 2/2] drm/hyperv: use VMBUS_RING_SIZE()
From: Hamza Mahfooz @ 2026-04-25 18:17 UTC (permalink / raw)
  To: linux-kernel
  Cc: K. Y. Srinivasan, Haiyang Zhang, Wei Liu, Dexuan Cui, Long Li,
	Stefano Garzarella, David S. Miller, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni, Simon Horman, Himadri Pandya, Michael Kelley,
	linux-hyperv, virtualization, netdev, Saurabh Sengar,
	Maarten Lankhorst, Maxime Ripard, Thomas Zimmermann, David Airlie,
	Simona Vetter, Deepak Rawat, dri-devel, Hamza Mahfooz, stable
In-Reply-To: <20260425181719.1538483-1-hamzamahfooz@linux.microsoft.com>

VMBUS ring buffers must be page aligned. So, use VMBUS_RING_SIZE() to
ensure they are always aligned and large enough to hold all of the
relevant data.

Cc: stable@kernel.vger.org
Fixes: 76c56a5affeb ("drm/hyperv: Add DRM driver for hyperv synthetic video device")
Signed-off-by: Hamza Mahfooz <hamzamahfooz@linux.microsoft.com>
---
 drivers/gpu/drm/hyperv/hyperv_drm_proto.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/hyperv/hyperv_drm_proto.c b/drivers/gpu/drm/hyperv/hyperv_drm_proto.c
index 051ecc526832..753d97bff76f 100644
--- a/drivers/gpu/drm/hyperv/hyperv_drm_proto.c
+++ b/drivers/gpu/drm/hyperv/hyperv_drm_proto.c
@@ -10,7 +10,7 @@
 
 #include "hyperv_drm.h"
 
-#define VMBUS_RING_BUFSIZE (256 * 1024)
+#define VMBUS_RING_BUFSIZE VMBUS_RING_SIZE(256 * 1024)
 #define VMBUS_VSP_TIMEOUT (10 * HZ)
 
 #define SYNTHVID_VERSION(major, minor) ((minor) << 16 | (major))
-- 
2.54.0


^ permalink raw reply related

* RE: [PATCH 2/2] drm/hyperv: use VMBUS_RING_SIZE()
From: Saurabh Singh Sengar @ 2026-04-26  5:00 UTC (permalink / raw)
  To: Hamza Mahfooz, linux-kernel@vger.kernel.org
  Cc: KY Srinivasan, Haiyang Zhang, Wei Liu, Dexuan Cui, Long Li,
	Stefano Garzarella, David S. Miller, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni, Simon Horman, Himadri Pandya, Michael Kelley,
	linux-hyperv@vger.kernel.org, virtualization@lists.linux.dev,
	netdev@vger.kernel.org, Saurabh Sengar, Maarten Lankhorst,
	Maxime Ripard, Thomas Zimmermann, David Airlie, Simona Vetter,
	Deepak Rawat, dri-devel@lists.freedesktop.org,
	stable@kernel.vger.org
In-Reply-To: <20260425181719.1538483-2-hamzamahfooz@linux.microsoft.com>

> Subject: [PATCH 2/2] drm/hyperv: use VMBUS_RING_SIZE()
> 
> VMBUS ring buffers must be page aligned. So, use VMBUS_RING_SIZE() to
> ensure they are always aligned and large enough to hold all of the relevant
> data.
> 
> Cc: stable@kernel.vger.org
> Fixes: 76c56a5affeb ("drm/hyperv: Add DRM driver for hyperv synthetic video
> device")
> Signed-off-by: Hamza Mahfooz <hamzamahfooz@linux.microsoft.com>
> ---
>  drivers/gpu/drm/hyperv/hyperv_drm_proto.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/drivers/gpu/drm/hyperv/hyperv_drm_proto.c
> b/drivers/gpu/drm/hyperv/hyperv_drm_proto.c
> index 051ecc526832..753d97bff76f 100644
> --- a/drivers/gpu/drm/hyperv/hyperv_drm_proto.c
> +++ b/drivers/gpu/drm/hyperv/hyperv_drm_proto.c
> @@ -10,7 +10,7 @@
> 
>  #include "hyperv_drm.h"
> 
> -#define VMBUS_RING_BUFSIZE (256 * 1024)
> +#define VMBUS_RING_BUFSIZE VMBUS_RING_SIZE(256 * 1024)
>  #define VMBUS_VSP_TIMEOUT (10 * HZ)
> 
>  #define SYNTHVID_VERSION(major, minor) ((minor) << 16 | (major))
> --
> 2.54.0

Although this lgtm, but this may change the behaviour on ARM64 systems with page size > 4K ?
Have we tested it ?

Reviewed-by: Saurabh Sengar <ssengar@linux.microsoft.com>



^ permalink raw reply

* Re: [PATCH net] net: mana: Optimize irq affinity for low vcpu configs
From: Shradha Gupta @ 2026-04-26  5:07 UTC (permalink / raw)
  To: Dipayaan Roy
  Cc: Dexuan Cui, Wei Liu, Haiyang Zhang, K. Y. Srinivasan, Andrew Lunn,
	David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	Konstantin Taranov, Simon Horman, Erni Sri Satya Vennela,
	Shiraz Saleem, Michael Kelley, Long Li, Yury Norov, linux-hyperv,
	linux-kernel, netdev, Paul Rosswurm, Shradha Gupta,
	Saurabh Singh Sengar, stable
In-Reply-To: <aeyMtcAi6B7DfAx+@linuxonhyperv3.guj3yctzbm1etfxqx2vob5hsef.xx.internal.cloudapp.net>

On Sat, Apr 25, 2026 at 02:43:17AM -0700, Dipayaan Roy wrote:
> On Fri, Apr 24, 2026 at 11:15:42PM -0700, Shradha Gupta wrote:
> > On Fri, Apr 24, 2026 at 05:21:23AM -0700, Dipayaan Roy wrote:
> > > On Thu, Apr 23, 2026 at 11:17:00PM -0700, Shradha Gupta wrote:
> > > > In mana driver, the number of IRQs allocated are capped by the
> > > > min(num_cpu + 1, queue count). In cases, where the IRQ count is greater
> > > > than the vcpu count, we want to utilize all the vcpus, irrespective of
> > > > their NUMA/core bindings.
> > > > 
> > > > This is important, especially in the envs where number of vcpus are so
> > > > few that the softIRQ handling overhead on two IRQs on the same vcpu is
> > > > much more than their overheads if they were spread across sibling vcpus
> > > > 
> > > > This behaviour is more evident with dynamic IRQ allocation. Since MANA
> > > > IRQs are assigned at a later stage compared to static allocation, other
> > > > device IRQs may already be affinitized to the vCPUs. As a result, IRQ
> > > > weights become imbalanced, causing multiple MANA IRQs to land on the
> > > > same vCPU.
> > > > 
> > > > In such cases when many parallel TCP connections are tested, the
> > > > throughput drops significantly
> > > > 
> > > > Test envs:
> > > > =======================================================
> > > > Case 1: without this patch
> > > > =======================================================
> > > > 4 vcpu(2 cores), 5 MANA IRQs (1 HWC + 4 Queue)
> > > > 
> > > > 	TYPE		effective vCPU aff
> > > > =======================================================
> > > > IRQ0:	HWC		0
> > > > IRQ1:	mana_q1		0
> > > > IRQ2:	mana_q2		2
> > > > IRQ3:	mana_q3		0
> > > > IRQ4:	mana_q4		3
> > > > 
> > > > %soft on each vCPU(mpstat -P ALL 1) on receiver
> > > > vCPU		0	1	2	3
> > > > =======================================================
> > > > pass 1:		38.85	0.03	24.89	24.65
> > > > pass 2:		39.15	0.03	24.57	25.28
> > > > pass 3:		40.36	0.03	23.20	23.17
> > > > 
> > > > =======================================================
> > > > Case 2: with this patch
> > > > =======================================================
> > > > 4 vcpu(2 cores), 5 MANA IRQs (1 HWC + 4 Queue)
> > > > 
> > > >         TYPE            effective vCPU aff
> > > > =======================================================
> > > > IRQ0:   HWC             0
> > > > IRQ1:   mana_q1         0
> > > > IRQ2:   mana_q2         1
> > > > IRQ3:   mana_q3         2
> > > > IRQ4:   mana_q4         3
> > > > 
> > > > %soft on each vCPU(mpstat -P ALL 1) on receiver
> > > > vCPU            0       1       2       3
> > > > =======================================================
> > > > pass 1:         15.42	15.85	14.99	14.51
> > > > pass 2:         15.53	15.94	15.81	15.93
> > > > pass 3:         16.41	16.35	16.40	16.36
> > > > 
> > > > =======================================================
> > > > Throughput Impact(in Gbps, same env)
> > > > =======================================================
> > > > TCP conn	with patch	w/o patch
> > > > 20480		15.65		7.73
> > > > 10240		15.63		8.93
> > > > 8192		15.64		9.69
> > > > 6144		15.64		13.16
> > > > 4096		15.69		15.75
> > > > 2048		15.69		15.83
> > > > 1024		15.71		15.28
> > > > 
> > > > Fixes: 755391121038 ("net: mana: Allocate MSI-X vectors dynamically")
> > > > Cc: stable@vger.kernel.org
> > > > Signed-off-by: Shradha Gupta <shradhagupta@linux.microsoft.com>
> > > > Signed-off-by: Erni Sri Satya Vennela <ernis@linux.microsoft.com>
> > > > Reviewed-by: Haiyang Zhang <haiyangz@microsoft.com>
> > > > ---
> > > >  .../net/ethernet/microsoft/mana/gdma_main.c   | 35 +++++++++++++++++--
> > > >  1 file changed, 33 insertions(+), 2 deletions(-)
> > > > 
> > > > diff --git a/drivers/net/ethernet/microsoft/mana/gdma_main.c b/drivers/net/ethernet/microsoft/mana/gdma_main.c
> > > > index 098fbda0d128..433c044d53c6 100644
> > > > --- a/drivers/net/ethernet/microsoft/mana/gdma_main.c
> > > > +++ b/drivers/net/ethernet/microsoft/mana/gdma_main.c
> > > > @@ -1672,6 +1672,23 @@ static int irq_setup(unsigned int *irqs, unsigned int len, int node,
> > > >  	return 0;
> > > >  }
> > > >  
> > > > +static int irq_setup_linear(unsigned int *irqs, unsigned int len)
> > > > +{
> > > > +	int cpu;
> > > > +
> > > > +	rcu_read_lock();
> > > We do not need to call rcu_read_lock here, as the caller of this
> > > function has already acquired cpus_read_lock.
> > 
> > Thanks for your comments Dipayaan, I think this is still needed for the
> > irq_set_affinity_and_hint(), to protect the pointer returned by
> > irq_to_desc(). You can also see the same in the original function
> > irq_setup() for the same reason.
> Hi Shradha,
> 
> The original irq_setup() function uses rcu_read_lock() because it relies
> on for_each_numa_hop_mask(), which explicitly mandates that RCU be held.
> You have not used it in irq_setup_linear(), hence the requirement does not apply
> here.
> https://elixir.bootlin.com/linux/v7.0.1/source/include/linux/topology.h#L314
> /**
>  * for_each_numa_hop_mask - iterate over cpumasks of increasing NUMA
>  * distance
>  *                          from a given node.
>  * @mask: the iteration variable.
>  * @node: the NUMA node to start the search from.
>  *
>  * Requires rcu_lock to be held.
>  *
> .....
> Regarding irq_set_affinity_and_hint it also garbs rcu locks internally:
> irq_set_affinity_and_hint ->__irq_apply_affinity_hint() -> irq_to_desc() -> mtree_load().
> Also see how irq_set_affinity_and_hint is called in mana_gd_setup_irqs.
> IMO we should drop the nesting when not needed, even though it appears
> harmless.
> 
> Thanks and Regards
> Dipayaan Roy
>

There was a small window after mtree_load() returns irq_desc* and
raw_spin_lock() being taken, where I thought the rcu_read_lock() was
needed. But on futher investigation, I agree it is not needed there as
well. Let me drop this in the v2. Thanks

> > 
> > > > +	for_each_online_cpu(cpu) {
> > > > +		if (len <= 0)
> > > len is unsigned here so <= doesnot makes sense. PLease change it to int
> > > or better use if(!len)
> > 
> > sure, I think I will change it to explicitly exit when len == 0
> > Thanks.
> > 
> > > > +			break;
> > > > +
> > > > +		irq_set_affinity_and_hint(*irqs++, cpumask_of(cpu));
> > > > +		len--;
> > > > +	}
> > > > +	rcu_read_unlock();
> > > > +
> > > > +	return 0;
> > > > +}
> > > > +
> > > >  static int mana_gd_setup_dyn_irqs(struct pci_dev *pdev, int nvec)
> > > >  {
> > > >  	struct gdma_context *gc = pci_get_drvdata(pdev);
> > > > @@ -1722,10 +1739,24 @@ static int mana_gd_setup_dyn_irqs(struct pci_dev *pdev, int nvec)
> > > >  	 * first CPU sibling group since they are already affinitized to HWC IRQ
> > > >  	 */
> > > >  	cpus_read_lock();
> > > > -	if (gc->num_msix_usable <= num_online_cpus())
> > > > +	if (gc->num_msix_usable <= num_online_cpus()) {
> > > >  		skip_first_cpu = true;
> > > > +		err = irq_setup(irqs, nvec, gc->numa_node, skip_first_cpu);
> > > > +	} else {
> > > > +		/*
> > > > +		 * In case our IRQs are more than num_online_cpus, we try to
> > > > +		 * make sure we are using all vcpus. In such a case NUMA or
> > > > +		 * CPU core affinity does not matter.
> > > > +		 * Note that in this case the total mana IRQ should always be
> > > > +		 * num_online_cpu + 1. The first HWC IRQ is already handled
> > > > +		 * in HWC setup calls
> > > > +		 * So, the nvec value in this path should always be equal to
> > > > +		 * num_online_cpu
> > > nit: typo: should be num_online_cpus
> > 
> > noted
> > 
> > > > +		 */
> > > > +		WARN_ON(nvec > num_online_cpus());
> > > > +		err = irq_setup_linear(irqs, nvec);
> > > > +	}
> > > >  
> > > > -	err = irq_setup(irqs, nvec, gc->numa_node, skip_first_cpu);
> > > >  	if (err) {
> > > >  		cpus_read_unlock();
> > > >  		goto free_irq;
> > > > 
> > > > base-commit: e728258debd553c95d2e70f9cd97c9fde27c7130
> > > > -- 
> > > > 2.34.1
> > > > 
> > > Regards
> > > Dipayaan Roy

^ permalink raw reply

* Re: [PATCH net] net: mana: hardening: Validate SHM offset from BAR0 register to prevent crash due to alignment fault
From: Jason Gunthorpe @ 2026-04-26 13:15 UTC (permalink / raw)
  To: Dipayaan Roy
  Cc: kys, haiyangz, wei.liu, decui, andrew+netdev, davem, edumazet,
	kuba, pabeni, leon, longli, kotaranov, horms, shradhagupta,
	ssengar, ernis, shirazsaleem, linux-hyperv, netdev, linux-kernel,
	linux-rdma, stephen, jacob.e.keller, dipayanroy, leitao, kees,
	john.fastabend, hawk, bpf, daniel, ast, sdf, yury.norov
In-Reply-To: <aepF3NwyANeklkfD@linuxonhyperv3.guj3yctzbm1etfxqx2vob5hsef.xx.internal.cloudapp.net>

On Thu, Apr 23, 2026 at 09:16:28AM -0700, Dipayaan Roy wrote:
> During Function Level Reset recovery, the MANA driver reads
> hardware BAR0 registers that may temporarily contain garbage values.
> The SHM (Shared Memory) offset read from GDMA_REG_SHM_OFFSET is used
> to compute gc->shm_base, which is later dereferenced via readl() in
> mana_smc_poll_register(). If the hardware returns an unaligned or
> out-of-range value, the driver must not blindly use it, as this would
> propagate the hardware error into a kernel crash.

It is not what we are calling "hardening" if you are hitting actual
crashes in actual real systems.

"hardening" is the driver defending against actively malicious
hardware, operating in ways that will never be seen in real systems,
attempting to compromise the kernel.

Drivers working around real world broken/buggy/malfunctioning HW is
just entirely normal stuff.

> @@ -73,10 +74,25 @@ static int mana_gd_init_pf_regs(struct pci_dev *pdev)
>  	gc->phys_db_page_base = gc->bar0_pa + gc->db_page_off;
>  
>  	sriov_base_off = mana_gd_r64(gc, GDMA_SRIOV_REG_CFG_BASE_OFF);
> +	if (sriov_base_off >= gc->bar0_size ||
> +	    !IS_ALIGNED(sriov_base_off, sizeof(u32))) {
> +		dev_err(gc->dev,
> +			"SRIOV base offset 0x%llx out of range or unaligned (BAR0 size 0x%llx)\n",
> +			sriov_base_off, (u64)gc->bar0_size);
> +		return -EPROTO;
> +	}

.. and if it is entirely normal and something that happens is EPROTO
really the right way to deal with this race, or should the driver be
looping somehow until the device stabilizes??

Jason

^ permalink raw reply

* RE: [PATCH v2 02/15] Drivers: hv: Move hv_vp_assist_page to common files
From: Michael Kelley @ 2026-04-27  5:37 UTC (permalink / raw)
  To: Naman Jain, K . Y . Srinivasan, Haiyang Zhang, Wei Liu,
	Dexuan Cui, Long Li, Catalin Marinas, Will Deacon,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen,
	x86@kernel.org, H . Peter Anvin, Arnd Bergmann, Paul Walmsley,
	Palmer Dabbelt, Albert Ou, Alexandre Ghiti, Michael Kelley
  Cc: Marc Zyngier, Timothy Hayes, Lorenzo Pieralisi, Sascha Bischoff,
	mrigendrachaubey, linux-hyperv@vger.kernel.org,
	linux-arm-kernel@lists.infradead.org,
	linux-kernel@vger.kernel.org, linux-arch@vger.kernel.org,
	linux-riscv@lists.infradead.org, vdso@mailbox.org,
	ssengar@linux.microsoft.com
In-Reply-To: <20260423124206.2410879-3-namjain@linux.microsoft.com>

From: Naman Jain <namjain@linux.microsoft.com> Sent: Thursday, April 23, 2026 5:42 AM
> 
> Move the logic to initialize and export hv_vp_assist_page from x86
> architecture code to Hyper-V common code to allow it to be used for
> upcoming arm64 support in MSHV_VTL driver.
> Note: This change also improves error handling - if VP assist page
> allocation fails, hyperv_init() now returns early instead of
> continuing with partial initialization.
> 
> Signed-off-by: Roman Kisel <romank@linux.microsoft.com>
> Reviewed-by: Roman Kisel <vdso@mailbox.org>
> Signed-off-by: Naman Jain <namjain@linux.microsoft.com>
> ---
>  arch/x86/hyperv/hv_init.c       | 88 +-----------------------------
>  arch/x86/include/asm/mshyperv.h | 14 -----
>  drivers/hv/hv_common.c          | 94 ++++++++++++++++++++++++++++++++-
>  include/asm-generic/mshyperv.h  | 16 ++++++
>  include/hyperv/hvgdk_mini.h     |  6 ++-
>  5 files changed, 115 insertions(+), 103 deletions(-)
> 
> diff --git a/arch/x86/hyperv/hv_init.c b/arch/x86/hyperv/hv_init.c
> index 323adc93f2dc..75a98b5e451b 100644
> --- a/arch/x86/hyperv/hv_init.c
> +++ b/arch/x86/hyperv/hv_init.c
> @@ -81,9 +81,6 @@ union hv_ghcb * __percpu *hv_ghcb_pg;
>  /* Storage to save the hypercall page temporarily for hibernation */
>  static void *hv_hypercall_pg_saved;
> 
> -struct hv_vp_assist_page **hv_vp_assist_page;
> -EXPORT_SYMBOL_GPL(hv_vp_assist_page);
> -
>  static int hyperv_init_ghcb(void)
>  {
>  	u64 ghcb_gpa;
> @@ -117,59 +114,12 @@ static int hyperv_init_ghcb(void)
> 
>  static int hv_cpu_init(unsigned int cpu)
>  {
> -	union hv_vp_assist_msr_contents msr = { 0 };
> -	struct hv_vp_assist_page **hvp;
>  	int ret;
> 
>  	ret = hv_common_cpu_init(cpu);
>  	if (ret)
>  		return ret;
> 
> -	if (!hv_vp_assist_page)
> -		return 0;
> -
> -	hvp = &hv_vp_assist_page[cpu];
> -	if (hv_root_partition()) {
> -		/*
> -		 * For root partition we get the hypervisor provided VP assist
> -		 * page, instead of allocating a new page.
> -		 */
> -		rdmsrq(HV_X64_MSR_VP_ASSIST_PAGE, msr.as_uint64);
> -		*hvp = memremap(msr.pfn << HV_X64_MSR_VP_ASSIST_PAGE_ADDRESS_SHIFT,
> -				PAGE_SIZE, MEMREMAP_WB);
> -	} else {
> -		/*
> -		 * The VP assist page is an "overlay" page (see Hyper-V TLFS's
> -		 * Section 5.2.1 "GPA Overlay Pages"). Here it must be zeroed
> -		 * out to make sure we always write the EOI MSR in
> -		 * hv_apic_eoi_write() *after* the EOI optimization is disabled
> -		 * in hv_cpu_die(), otherwise a CPU may not be stopped in the
> -		 * case of CPU offlining and the VM will hang.
> -		 */
> -		if (!*hvp) {
> -			*hvp = __vmalloc(PAGE_SIZE, GFP_KERNEL | __GFP_ZERO);
> -
> -			/*
> -			 * Hyper-V should never specify a VM that is a Confidential
> -			 * VM and also running in the root partition. Root partition
> -			 * is blocked to run in Confidential VM. So only decrypt assist
> -			 * page in non-root partition here.
> -			 */
> -			if (*hvp && !ms_hyperv.paravisor_present && hv_isolation_type_snp()) {
> -				WARN_ON_ONCE(set_memory_decrypted((unsigned long)(*hvp), 1));
> -				memset(*hvp, 0, PAGE_SIZE);
> -			}
> -		}
> -
> -		if (*hvp)
> -			msr.pfn = vmalloc_to_pfn(*hvp);
> -
> -	}
> -	if (!WARN_ON(!(*hvp))) {
> -		msr.enable = 1;
> -		wrmsrq(HV_X64_MSR_VP_ASSIST_PAGE, msr.as_uint64);
> -	}
> -
>  	/* Allow Hyper-V stimer vector to be injected from Hypervisor. */
>  	if (ms_hyperv.misc_features & HV_STIMER_DIRECT_MODE_AVAILABLE)
>  		apic_update_vector(cpu, HYPERV_STIMER0_VECTOR, true);
> @@ -286,23 +236,6 @@ static int hv_cpu_die(unsigned int cpu)
> 
>  	hv_common_cpu_die(cpu);
> 
> -	if (hv_vp_assist_page && hv_vp_assist_page[cpu]) {
> -		union hv_vp_assist_msr_contents msr = { 0 };
> -		if (hv_root_partition()) {
> -			/*
> -			 * For root partition the VP assist page is mapped to
> -			 * hypervisor provided page, and thus we unmap the
> -			 * page here and nullify it, so that in future we have
> -			 * correct page address mapped in hv_cpu_init.
> -			 */
> -			memunmap(hv_vp_assist_page[cpu]);
> -			hv_vp_assist_page[cpu] = NULL;
> -			rdmsrq(HV_X64_MSR_VP_ASSIST_PAGE, msr.as_uint64);
> -			msr.enable = 0;
> -		}
> -		wrmsrq(HV_X64_MSR_VP_ASSIST_PAGE, msr.as_uint64);
> -	}
> -
>  	if (hv_reenlightenment_cb == NULL)
>  		return 0;
> 
> @@ -460,21 +393,6 @@ void __init hyperv_init(void)
>  	if (hv_common_init())
>  		return;
> 
> -	/*
> -	 * The VP assist page is useless to a TDX guest: the only use we
> -	 * would have for it is lazy EOI, which can not be used with TDX.
> -	 */
> -	if (hv_isolation_type_tdx())
> -		hv_vp_assist_page = NULL;
> -	else
> -		hv_vp_assist_page = kzalloc_objs(*hv_vp_assist_page, nr_cpu_ids);
> -	if (!hv_vp_assist_page) {
> -		ms_hyperv.hints &= ~HV_X64_ENLIGHTENED_VMCS_RECOMMENDED;
> -
> -		if (!hv_isolation_type_tdx())
> -			goto common_free;
> -	}
> -
>  	if (ms_hyperv.paravisor_present && hv_isolation_type_snp()) {
>  		/* Negotiate GHCB Version. */
>  		if (!hv_ghcb_negotiate_protocol())
> @@ -483,7 +401,7 @@ void __init hyperv_init(void)
> 
>  		hv_ghcb_pg = alloc_percpu(union hv_ghcb *);
>  		if (!hv_ghcb_pg)
> -			goto free_vp_assist_page;
> +			goto free_ghcb_page;

Seems like this should be "goto common_free". The allocation of
hv_ghcb_pg has failed, so going to a label where hv_ghcb_pg is
freed seems redundant. It works since free_percpu() checks for
a NULL argument, but it's a bit unexpected since the common_free
label is already there.

>  	}
> 
>  	cpuhp = cpuhp_setup_state(CPUHP_AP_HYPERV_ONLINE, "x86/hyperv_init:online",
> @@ -613,10 +531,6 @@ void __init hyperv_init(void)
>  	cpuhp_remove_state(CPUHP_AP_HYPERV_ONLINE);
>  free_ghcb_page:
>  	free_percpu(hv_ghcb_pg);
> -free_vp_assist_page:
> -	kfree(hv_vp_assist_page);
> -	hv_vp_assist_page = NULL;
> -common_free:
>  	hv_common_free();
>  }
> 
> diff --git a/arch/x86/include/asm/mshyperv.h b/arch/x86/include/asm/mshyperv.h
> index f64393e853ee..95b452387969 100644
> --- a/arch/x86/include/asm/mshyperv.h
> +++ b/arch/x86/include/asm/mshyperv.h
> @@ -155,16 +155,6 @@ static inline u64 hv_do_fast_hypercall16(u16 code, u64 input1, u64 input2)
>  	return _hv_do_fast_hypercall16(control, input1, input2);
>  }
> 
> -extern struct hv_vp_assist_page **hv_vp_assist_page;
> -
> -static inline struct hv_vp_assist_page *hv_get_vp_assist_page(unsigned int cpu)
> -{
> -	if (!hv_vp_assist_page)
> -		return NULL;
> -
> -	return hv_vp_assist_page[cpu];
> -}
> -
>  void __init hyperv_init(void);
>  void hyperv_setup_mmu_ops(void);
>  void set_hv_tscchange_cb(void (*cb)(void));
> @@ -254,10 +244,6 @@ static inline void hyperv_setup_mmu_ops(void) {}
>  static inline void set_hv_tscchange_cb(void (*cb)(void)) {}
>  static inline void clear_hv_tscchange_cb(void) {}
>  static inline void hyperv_stop_tsc_emulation(void) {};
> -static inline struct hv_vp_assist_page *hv_get_vp_assist_page(unsigned int cpu)
> -{
> -	return NULL;
> -}
>  static inline int hyperv_flush_guest_mapping(u64 as) { return -1; }
>  static inline int hyperv_flush_guest_mapping_range(u64 as,
>  		hyperv_fill_flush_list_func fill_func, void *data)
> diff --git a/drivers/hv/hv_common.c b/drivers/hv/hv_common.c
> index 6b67ac616789..e8633bc51d56 100644
> --- a/drivers/hv/hv_common.c
> +++ b/drivers/hv/hv_common.c
> @@ -28,7 +28,11 @@
>  #include <linux/slab.h>
>  #include <linux/dma-map-ops.h>
>  #include <linux/set_memory.h>
> +#include <linux/vmalloc.h>
> +#include <linux/io.h>
> +#include <linux/hyperv.h>
>  #include <hyperv/hvhdk.h>
> +#include <hyperv/hvgdk.h>
>  #include <asm/mshyperv.h>
> 
>  u64 hv_current_partition_id = HV_PARTITION_ID_SELF;
> @@ -78,6 +82,8 @@ static struct ctl_table_header *hv_ctl_table_hdr;
>  u8 * __percpu *hv_synic_eventring_tail;
>  EXPORT_SYMBOL_GPL(hv_synic_eventring_tail);
> 
> +struct hv_vp_assist_page **hv_vp_assist_page;
> +EXPORT_SYMBOL_GPL(hv_vp_assist_page);
>  /*
>   * Hyper-V specific initialization and shutdown code that is
>   * common across all architectures.  Called from architecture
> @@ -92,6 +98,9 @@ void __init hv_common_free(void)
>  	if (ms_hyperv.misc_features & HV_FEATURE_GUEST_CRASH_MSR_AVAILABLE)
>  		hv_kmsg_dump_unregister();
> 
> +	kfree(hv_vp_assist_page);
> +	hv_vp_assist_page = NULL;
> +
>  	kfree(hv_vp_index);
>  	hv_vp_index = NULL;
> 
> @@ -394,6 +403,23 @@ int __init hv_common_init(void)
>  	for (i = 0; i < nr_cpu_ids; i++)
>  		hv_vp_index[i] = VP_INVAL;
> 
> +	/*
> +	 * The VP assist page is useless to a TDX guest: the only use we
> +	 * would have for it is lazy EOI, which can not be used with TDX.
> +	 */
> +	if (hv_isolation_type_tdx()) {
> +		hv_vp_assist_page = NULL;
> +#ifdef CONFIG_X86_64
> +		ms_hyperv.hints &= ~HV_X64_ENLIGHTENED_VMCS_RECOMMENDED;
> +#endif

I realize that this #ifdef went away for the reason I flagged in v1 of
this patch set, but it's back again for a different reason.

Let me suggest another approach. hv_common_init() is called from
both the x86/64 and arm64 hyperv_init() functions. Immediately after
the call to hv_common_init() in the x86/64 hyperv_init(), test
hv_vp_assist_page for NULL and clear
HV_X64_ENLIGHTENED_VMCS_RECOMMENDED if it is. No #ifdef is
needed, and x86/64 specific hackery stays under arch/x86 instead of
being in common code.

> +	} else {
> +		hv_vp_assist_page = kzalloc_objs(*hv_vp_assist_page, nr_cpu_ids);
> +		if (!hv_vp_assist_page) {
> +			hv_common_free();
> +			return -ENOMEM;
> +		}
> +	}
> +
>  	return 0;
>  }
> 
> @@ -471,6 +497,8 @@ void __init ms_hyperv_late_init(void)
> 
>  int hv_common_cpu_init(unsigned int cpu)
>  {
> +	union hv_vp_assist_msr_contents msr = { 0 };
> +	struct hv_vp_assist_page **hvp;
>  	void **inputarg, **outputarg;
>  	u8 **synic_eventring_tail;
>  	u64 msr_vp_index;
> @@ -539,7 +567,53 @@ int hv_common_cpu_init(unsigned int cpu)
>  						sizeof(u8), flags);
>  		/* No need to unwind any of the above on failure here */
>  		if (unlikely(!*synic_eventring_tail))
> -			ret = -ENOMEM;
> +			return -ENOMEM;
> +	}
> +
> +	if (!hv_vp_assist_page)
> +		return ret;
> +
> +	hvp = &hv_vp_assist_page[cpu];
> +	if (hv_root_partition()) {
> +		/*
> +		 * For root partition we get the hypervisor provided VP assist
> +		 * page, instead of allocating a new page.
> +		 */
> +		msr.as_uint64 = hv_get_msr(HV_MSR_VP_ASSIST_PAGE);
> +		*hvp = memremap(msr.pfn << HV_VP_ASSIST_PAGE_ADDRESS_SHIFT,
> +				HV_HYP_PAGE_SIZE, MEMREMAP_WB);
> +	} else {
> +		/*
> +		 * The VP assist page is an "overlay" page (see Hyper-V TLFS's
> +		 * Section 5.2.1 "GPA Overlay Pages"). Here it must be zeroed
> +		 * out to make sure that on x86/x64, we always write the EOI MSR in
> +		 * hv_apic_eoi_write() *after* the EOI optimization is disabled
> +		 * in hv_cpu_die(), otherwise a CPU may not be stopped in the
> +		 * case of CPU offlining and the VM will hang.
> +		 */
> +		if (!*hvp) {
> +			*hvp = __vmalloc(HV_HYP_PAGE_SIZE, flags | __GFP_ZERO);
> +
> +			/*
> +			 * Hyper-V should never specify a VM that is a Confidential
> +			 * VM and also running in the root partition. Root partition
> +			 * is blocked to run in Confidential VM. So only decrypt assist
> +			 * page in non-root partition here.
> +			 */
> +			if (*hvp &&
> +			    !ms_hyperv.paravisor_present &&
> +			    hv_isolation_type_snp()) {
> +				WARN_ON_ONCE(set_memory_decrypted((unsigned long)(*hvp), 1));
> +				memset(*hvp, 0, HV_HYP_PAGE_SIZE);
> +			}
> +		}
> +
> +		if (*hvp)
> +			msr.pfn = page_to_hvpfn(vmalloc_to_page(*hvp));

Your Patch 0 changelog mentions adding a comment about vmalloc_to_pfn(), which
I didn't see anywhere. I'm not sure what that comment would say, so maybe it
became unnecessary.

> +	}
> +	if (!WARN_ON(!(*hvp))) {
> +		msr.enable = 1;
> +		hv_set_msr(HV_MSR_VP_ASSIST_PAGE, msr.as_uint64);
>  	}
> 
>  	return ret;
> @@ -566,6 +640,24 @@ int hv_common_cpu_die(unsigned int cpu)
>  		*synic_eventring_tail = NULL;
>  	}
> 
> +	if (hv_vp_assist_page && hv_vp_assist_page[cpu]) {
> +		union hv_vp_assist_msr_contents msr = { 0 };
> +
> +		if (hv_root_partition()) {
> +			/*
> +			 * For root partition the VP assist page is mapped to
> +			 * hypervisor provided page, and thus we unmap the
> +			 * page here and nullify it, so that in future we have
> +			 * correct page address mapped in hv_cpu_init.
> +			 */
> +			memunmap(hv_vp_assist_page[cpu]);
> +			hv_vp_assist_page[cpu] = NULL;
> +			msr.as_uint64 = hv_get_msr(HV_MSR_VP_ASSIST_PAGE);
> +			msr.enable = 0;
> +		}
> +		hv_set_msr(HV_MSR_VP_ASSIST_PAGE, msr.as_uint64);
> +	}
> +
>  	return 0;
>  }
> 
> diff --git a/include/asm-generic/mshyperv.h b/include/asm-generic/mshyperv.h
> index d37b68238c97..2810aa05dc73 100644
> --- a/include/asm-generic/mshyperv.h
> +++ b/include/asm-generic/mshyperv.h
> @@ -25,6 +25,7 @@
>  #include <linux/nmi.h>
>  #include <asm/ptrace.h>
>  #include <hyperv/hvhdk.h>
> +#include <hyperv/hvgdk.h>
> 
>  #define VTPM_BASE_ADDRESS 0xfed40000
> 
> @@ -299,6 +300,16 @@ do { \
>  #define hv_status_debug(status, fmt, ...) \
>  	hv_status_printk(debug, status, fmt, ##__VA_ARGS__)
> 
> +extern struct hv_vp_assist_page **hv_vp_assist_page;
> +
> +static inline struct hv_vp_assist_page *hv_get_vp_assist_page(unsigned int cpu)
> +{
> +	if (!hv_vp_assist_page)
> +		return NULL;
> +
> +	return hv_vp_assist_page[cpu];
> +}
> +
>  const char *hv_result_to_string(u64 hv_status);
>  int hv_result_to_errno(u64 status);
>  void hyperv_report_panic(struct pt_regs *regs, long err, bool in_die);
> @@ -327,6 +338,11 @@ static inline enum hv_isolation_type hv_get_isolation_type(void)
>  {
>  	return HV_ISOLATION_TYPE_NONE;
>  }
> +
> +static inline struct hv_vp_assist_page *hv_get_vp_assist_page(unsigned int cpu)
> +{
> +	return NULL;
> +}
>  #endif /* CONFIG_HYPERV */
> 
>  #if IS_ENABLED(CONFIG_MSHV_ROOT)
> diff --git a/include/hyperv/hvgdk_mini.h b/include/hyperv/hvgdk_mini.h
> index 056ef7b6b360..c72d04cd5ae4 100644
> --- a/include/hyperv/hvgdk_mini.h
> +++ b/include/hyperv/hvgdk_mini.h
> @@ -149,6 +149,7 @@ struct hv_u128 {
>  #define HV_X64_MSR_VP_ASSIST_PAGE_ADDRESS_SHIFT	12

Can this X64 specific definition of the shift be eliminated entirely,
and a single common definition for x86/64 and arm64 be used?
As I understand it, the MSR layout is the same on both architectures.
The one gotcha is that kvm_hv_set_msr() would need to be updated.

HV_X64_MSR_VP_ASSIST_PAGE_ADDRESS_MASK defined below isn't
used anywhere, so it could go away too.  (The KVM selftest usage has
its own definition.)

I realize these are changes to a source code file that is derived from
Windows, and I'm not sure of the guidelines for such changes. So maybe
these suggestions have to be ignored ....

>  #define HV_X64_MSR_VP_ASSIST_PAGE_ADDRESS_MASK	\
>  		(~((1ull << HV_X64_MSR_VP_ASSIST_PAGE_ADDRESS_SHIFT) - 1))
> +#define HV_MSR_VP_ASSIST_PAGE              (HV_X64_MSR_VP_ASSIST_PAGE)

This is the correct file for this #define, but it should be placed down around
line 1148 or so with the other HV_MSR_* definitions in terms of HV_X64_MSR_*

> 
>  /* Hyper-V Enlightened VMCS version mask in nested features CPUID */
>  #define HV_X64_ENLIGHTENED_VMCS_VERSION		0xff
> @@ -410,6 +411,7 @@ union hv_x64_msr_hypercall_contents {
>  #if defined(CONFIG_ARM64)
>  #define HV_FEATURE_GUEST_CRASH_MSR_AVAILABLE	BIT(8)
>  #define HV_STIMER_DIRECT_MODE_AVAILABLE		BIT(13)
> +#define HV_VP_ASSIST_PAGE_ADDRESS_SHIFT 12
>  #endif /* CONFIG_ARM64 */
> 
>  #if defined(CONFIG_X86)
> @@ -1163,6 +1165,8 @@ enum hv_register_name {
>  #define HV_MSR_STIMER0_CONFIG	(HV_X64_MSR_STIMER0_CONFIG)
>  #define HV_MSR_STIMER0_COUNT	(HV_X64_MSR_STIMER0_COUNT)
> 
> +#define HV_VP_ASSIST_PAGE_ADDRESS_SHIFT HV_X64_MSR_VP_ASSIST_PAGE_ADDRESS_SHIFT
> +
>  #elif defined(CONFIG_ARM64) /* CONFIG_X86 */
> 
>  #define HV_MSR_CRASH_P0		(HV_REGISTER_GUEST_CRASH_P0)
> @@ -1185,7 +1189,7 @@ enum hv_register_name {
> 
>  #define HV_MSR_STIMER0_CONFIG	(HV_REGISTER_STIMER0_CONFIG)
>  #define HV_MSR_STIMER0_COUNT	(HV_REGISTER_STIMER0_COUNT)
> -
> +#define HV_MSR_VP_ASSIST_PAGE    (HV_REGISTER_VP_ASSIST_PAGE)

Nit: This definition is slightly mis-aligned. It has spaces where there
should be a tab to match the similar definitions above it.

>  #endif /* CONFIG_ARM64 */
> 
>  union hv_explicit_suspend_register {
> --
> 2.43.0
> 


^ permalink raw reply

* RE: [PATCH v2 03/15] Drivers: hv: Move vmbus_handler to common code
From: Michael Kelley @ 2026-04-27  5:38 UTC (permalink / raw)
  To: Naman Jain, K . Y . Srinivasan, Haiyang Zhang, Wei Liu,
	Dexuan Cui, Long Li, Catalin Marinas, Will Deacon,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen,
	x86@kernel.org, H . Peter Anvin, Arnd Bergmann, Paul Walmsley,
	Palmer Dabbelt, Albert Ou, Alexandre Ghiti, Michael Kelley
  Cc: Marc Zyngier, Timothy Hayes, Lorenzo Pieralisi, Sascha Bischoff,
	mrigendrachaubey, linux-hyperv@vger.kernel.org,
	linux-arm-kernel@lists.infradead.org,
	linux-kernel@vger.kernel.org, linux-arch@vger.kernel.org,
	linux-riscv@lists.infradead.org, vdso@mailbox.org,
	ssengar@linux.microsoft.com
In-Reply-To: <20260423124206.2410879-4-namjain@linux.microsoft.com>

From: Naman Jain <namjain@linux.microsoft.com> Sent: Thursday, April 23, 2026 5:42 AM
> 
> Move the vmbus_handler global variable and hv_setup_vmbus_handler()/
> hv_remove_vmbus_handler() from arch/x86 to drivers/hv/hv_common.c.
> 
> hv_setup_vmbus_handler() is called unconditionally in vmbus_bus_init()
> and works for both x86 (sysvec handler) and arm64 (vmbus_percpu_isr).
> 
> This eliminates the need for separate percpu vmbus handler setup
> functions and __weak stubs, that are needed for adding ARM64 support
> in MSHV_VTL driver where we need to set a custom per-cpu vmbus handler.
> 
> Signed-off-by: Naman Jain <namjain@linux.microsoft.com>
> ---
>  arch/x86/kernel/cpu/mshyperv.c | 12 ------------
>  drivers/hv/hv_common.c         |  9 +++++++--
>  drivers/hv/vmbus_drv.c         | 17 +++++++++--------
>  include/asm-generic/mshyperv.h |  1 +
>  4 files changed, 17 insertions(+), 22 deletions(-)
> 
> diff --git a/arch/x86/kernel/cpu/mshyperv.c b/arch/x86/kernel/cpu/mshyperv.c
> index 89a2eb8a0722..68706ff5880e 100644
> --- a/arch/x86/kernel/cpu/mshyperv.c
> +++ b/arch/x86/kernel/cpu/mshyperv.c
> @@ -145,7 +145,6 @@ void hv_set_msr(unsigned int reg, u64 value)
>  EXPORT_SYMBOL_GPL(hv_set_msr);
> 
>  static void (*mshv_handler)(void);
> -static void (*vmbus_handler)(void);
>  static void (*hv_stimer0_handler)(void);
>  static void (*hv_kexec_handler)(void);
>  static void (*hv_crash_handler)(struct pt_regs *regs);
> @@ -172,17 +171,6 @@ void hv_setup_mshv_handler(void (*handler)(void))
>  	mshv_handler = handler;
>  }
> 
> -void hv_setup_vmbus_handler(void (*handler)(void))
> -{
> -	vmbus_handler = handler;
> -}
> -
> -void hv_remove_vmbus_handler(void)
> -{
> -	/* We have no way to deallocate the interrupt gate */
> -	vmbus_handler = NULL;
> -}
> -
>  /*
>   * Routines to do per-architecture handling of stimer0
>   * interrupts when in Direct Mode
> diff --git a/drivers/hv/hv_common.c b/drivers/hv/hv_common.c
> index e8633bc51d56..eb7b0028b45d 100644
> --- a/drivers/hv/hv_common.c
> +++ b/drivers/hv/hv_common.c
> @@ -758,13 +758,18 @@ bool __weak hv_isolation_type_tdx(void)
>  }
>  EXPORT_SYMBOL_GPL(hv_isolation_type_tdx);
> 
> -void __weak hv_setup_vmbus_handler(void (*handler)(void))
> +void (*vmbus_handler)(void);
> +EXPORT_SYMBOL_GPL(vmbus_handler);
> +
> +void hv_setup_vmbus_handler(void (*handler)(void))
>  {
> +	vmbus_handler = handler;
>  }
>  EXPORT_SYMBOL_GPL(hv_setup_vmbus_handler);
> 
> -void __weak hv_remove_vmbus_handler(void)
> +void hv_remove_vmbus_handler(void)
>  {
> +	vmbus_handler = NULL;
>  }
>  EXPORT_SYMBOL_GPL(hv_remove_vmbus_handler);

I'd suggest moving hv_setup_vmbus_handler() and
hv_remove_vmbus_handler() above or below the group
of __weak stubs in this source code file. There's a comment
describing the purpose of these __weak functions, and
intermixing these two functions that are no longer __weak
produces something of a jumble.

> 
> diff --git a/drivers/hv/vmbus_drv.c b/drivers/hv/vmbus_drv.c
> index bc4fc1951ae1..052ca8b11cee 100644
> --- a/drivers/hv/vmbus_drv.c
> +++ b/drivers/hv/vmbus_drv.c
> @@ -1415,7 +1415,8 @@ EXPORT_SYMBOL_FOR_MODULES(vmbus_isr, "mshv_vtl");
> 
>  static irqreturn_t vmbus_percpu_isr(int irq, void *dev_id)
>  {
> -	vmbus_isr();
> +	if (vmbus_handler)
> +		vmbus_handler();

Is it necessary to test vmbus_handler first? From what I can
see, it is always set before the per-cpu interrupt is setup.

>  	return IRQ_HANDLED;
>  }
> 
> @@ -1517,8 +1518,10 @@ static int vmbus_bus_init(void)
>  		vmbus_irq_initialized = true;
>  	}
> 
> +	hv_setup_vmbus_handler(vmbus_isr);
> +
>  	if (vmbus_irq == -1) {
> -		hv_setup_vmbus_handler(vmbus_isr);
> +		/* x86: sysvec handler uses vmbus_handler directly */
>  	} else {
>  		ret = request_percpu_irq(vmbus_irq, vmbus_percpu_isr,
>  				"Hyper-V VMbus", &vmbus_evt);
> @@ -1553,9 +1556,8 @@ static int vmbus_bus_init(void)
>  	return 0;
> 
>  err_connect:
> -	if (vmbus_irq == -1)
> -		hv_remove_vmbus_handler();
> -	else
> +	hv_remove_vmbus_handler();
> +	if (vmbus_irq != -1)
>  		free_percpu_irq(vmbus_irq, &vmbus_evt);

These operations should be reordered so they are the inverse
of how they are setup.  I.e., free_percpu_irq() first, then remove
the VMBus handler. That's just good standard practice unless
there's a specific reason to do the cleanup ordering differently. In
fact, hv_remove_vmbus_handler() needs to be moved down
to the err_setup label so it's done if request_percpu_irq()
fails.

>  err_setup:
>  	if (IS_ENABLED(CONFIG_PREEMPT_RT) && vmbus_irq_initialized) {
> @@ -3026,9 +3028,8 @@ static void __exit vmbus_exit(void)
>  	vmbus_connection.conn_state = DISCONNECTED;
>  	hv_stimer_global_cleanup();
>  	vmbus_disconnect();
> -	if (vmbus_irq == -1)
> -		hv_remove_vmbus_handler();
> -	else
> +	hv_remove_vmbus_handler();
> +	if (vmbus_irq != -1)
>  		free_percpu_irq(vmbus_irq, &vmbus_evt);

Ordering should be changed here as well so it is the inverse
of how things are set up.

>  	if (IS_ENABLED(CONFIG_PREEMPT_RT) && vmbus_irq_initialized) {
>  		smpboot_unregister_percpu_thread(&vmbus_irq_threads);
> diff --git a/include/asm-generic/mshyperv.h b/include/asm-generic/mshyperv.h
> index 2810aa05dc73..db183c8cfb95 100644
> --- a/include/asm-generic/mshyperv.h
> +++ b/include/asm-generic/mshyperv.h
> @@ -179,6 +179,7 @@ static inline u64 hv_generate_guest_id(u64 kernel_version)
> 
>  int hv_get_hypervisor_version(union hv_hypervisor_version_info *info);
> 
> +extern void (*vmbus_handler)(void);
>  void hv_setup_vmbus_handler(void (*handler)(void));
>  void hv_remove_vmbus_handler(void);
>  void hv_setup_stimer0_handler(void (*handler)(void));
> --
> 2.43.0
> 


^ permalink raw reply

* RE: [PATCH v2 07/15] arm64: hyperv: Add support for mshv_vtl_return_call
From: Michael Kelley @ 2026-04-27  5:38 UTC (permalink / raw)
  To: Naman Jain, K . Y . Srinivasan, Haiyang Zhang, Wei Liu,
	Dexuan Cui, Long Li, Catalin Marinas, Will Deacon,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen,
	x86@kernel.org, H . Peter Anvin, Arnd Bergmann, Paul Walmsley,
	Palmer Dabbelt, Albert Ou, Alexandre Ghiti, Michael Kelley
  Cc: Marc Zyngier, Timothy Hayes, Lorenzo Pieralisi, Sascha Bischoff,
	mrigendrachaubey, linux-hyperv@vger.kernel.org,
	linux-arm-kernel@lists.infradead.org,
	linux-kernel@vger.kernel.org, linux-arch@vger.kernel.org,
	linux-riscv@lists.infradead.org, vdso@mailbox.org,
	ssengar@linux.microsoft.com
In-Reply-To: <20260423124206.2410879-8-namjain@linux.microsoft.com>

From: Naman Jain <namjain@linux.microsoft.com> Sent: Thursday, April 23, 2026 5:42 AM
> 
> Add the arm64 variant of mshv_vtl_return_call() to support the MSHV_VTL
> driver on arm64. This function enables the transition between Virtual
> Trust Levels (VTLs) in MSHV_VTL when the kernel acts as a paravisor.
> 
> Signed-off-by: Roman Kisel <romank@linux.microsoft.com>
> Reviewed-by: Roman Kisel <vdso@mailbox.org>
> Signed-off-by: Naman Jain <namjain@linux.microsoft.com>
> ---
>  arch/arm64/hyperv/Makefile        |   1 +
>  arch/arm64/hyperv/hv_vtl.c        | 158 ++++++++++++++++++++++++++++++
>  arch/arm64/include/asm/mshyperv.h |  13 +++
>  arch/x86/include/asm/mshyperv.h   |   2 -
>  drivers/hv/mshv_vtl.h             |   3 +
>  include/asm-generic/mshyperv.h    |   2 +
>  6 files changed, 177 insertions(+), 2 deletions(-)
>  create mode 100644 arch/arm64/hyperv/hv_vtl.c
>

[snip]

> diff --git a/arch/arm64/include/asm/mshyperv.h b/arch/arm64/include/asm/mshyperv.h
> index 585b23a26f1b..9eb0e5999f29 100644
> --- a/arch/arm64/include/asm/mshyperv.h
> +++ b/arch/arm64/include/asm/mshyperv.h
> @@ -60,6 +60,18 @@ static inline u64 hv_get_non_nested_msr(unsigned int reg)
>  				ARM_SMCCC_SMC_64,		\
>  				ARM_SMCCC_OWNER_VENDOR_HYP,	\
>  				HV_SMCCC_FUNC_NUMBER)
> +
> +struct mshv_vtl_cpu_context {
> +/*
> + * x18 is managed by the hypervisor. It won't be reloaded from this array.
> + * It is included here for convenience in array indexing.
> + * 'rsvd' field serves as alignment padding so q[] starts at offset 32*8=256.
> + */
> +	__u64 x[31];
> +	__u64 rsvd;
> +	__uint128_t q[32];
> +};
> +
>  #ifdef CONFIG_HYPERV_VTL_MODE
>  /*
>   * Get/Set the register. If the function returns `1`, that must be done via
> @@ -69,6 +81,7 @@ static inline int hv_vtl_get_set_reg(struct hv_register_assoc *regs,
> bool set, b
>  {
>  	return 1;
>  }
> +

This appears to be a spurious blank line being added since there
are no other changes in the vicinity.

>  #endif
> 
>  #include <asm-generic/mshyperv.h>
> diff --git a/arch/x86/include/asm/mshyperv.h b/arch/x86/include/asm/mshyperv.h
> index 08278547b84c..b4d80c9a673a 100644
> --- a/arch/x86/include/asm/mshyperv.h
> +++ b/arch/x86/include/asm/mshyperv.h
> @@ -286,7 +286,6 @@ struct mshv_vtl_cpu_context {
>  #ifdef CONFIG_HYPERV_VTL_MODE
>  void __init hv_vtl_init_platform(void);
>  int __init hv_vtl_early_init(void);
> -void mshv_vtl_return_call(struct mshv_vtl_cpu_context *vtl0);
>  void mshv_vtl_return_call_init(u64 vtl_return_offset);
>  void mshv_vtl_return_hypercall(void);
>  void __mshv_vtl_return_call(struct mshv_vtl_cpu_context *vtl0);
> @@ -294,7 +293,6 @@ int hv_vtl_get_set_reg(struct hv_register_assoc *regs, bool set,
> bool shared);
>  #else
>  static inline void __init hv_vtl_init_platform(void) {}
>  static inline int __init hv_vtl_early_init(void) { return 0; }
> -static inline void mshv_vtl_return_call(struct mshv_vtl_cpu_context *vtl0) {}
>  static inline void mshv_vtl_return_call_init(u64 vtl_return_offset) {}
>  static inline void mshv_vtl_return_hypercall(void) {}
>  static inline void __mshv_vtl_return_call(struct mshv_vtl_cpu_context *vtl0) {}
> diff --git a/drivers/hv/mshv_vtl.h b/drivers/hv/mshv_vtl.h
> index a6eea52f7aa2..103f07371f3f 100644
> --- a/drivers/hv/mshv_vtl.h
> +++ b/drivers/hv/mshv_vtl.h
> @@ -22,4 +22,7 @@ struct mshv_vtl_run {
>  	char vtl_ret_actions[MSHV_MAX_RUN_MSG_SIZE];
>  };
> 
> +static_assert(sizeof(struct mshv_vtl_cpu_context) <= 1024,
> +	      "struct mshv_vtl_cpu_context exceeds reserved space in struct
> mshv_vtl_run");
> +
>  #endif /* _MSHV_VTL_H */
> diff --git a/include/asm-generic/mshyperv.h b/include/asm-generic/mshyperv.h
> index db183c8cfb95..8cdf2a9fbdfb 100644
> --- a/include/asm-generic/mshyperv.h
> +++ b/include/asm-generic/mshyperv.h
> @@ -396,8 +396,10 @@ static inline int hv_deposit_memory(u64 partition_id, u64 status)
> 
>  #if IS_ENABLED(CONFIG_HYPERV_VTL_MODE)
>  u8 __init get_vtl(void);
> +void mshv_vtl_return_call(struct mshv_vtl_cpu_context *vtl0);
>  #else
>  static inline u8 get_vtl(void) { return 0; }
> +static inline void mshv_vtl_return_call(struct mshv_vtl_cpu_context *vtl0) {}

Is this stub needed? Maybe I missed something, but it looks to me like none
of the code that calls this gets built unless CONFIG_HYPERV_VTL_MODE is set.
See further comments about stubs in Patch 8 of this series.

>  #endif
> 
>  #endif
> --
> 2.43.0
> 


^ permalink raw reply

* RE: [PATCH v2 08/15] Drivers: hv: Move hv_call_(get|set)_vp_registers() declarations
From: Michael Kelley @ 2026-04-27  5:39 UTC (permalink / raw)
  To: Naman Jain, K . Y . Srinivasan, Haiyang Zhang, Wei Liu,
	Dexuan Cui, Long Li, Catalin Marinas, Will Deacon,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen,
	x86@kernel.org, H . Peter Anvin, Arnd Bergmann, Paul Walmsley,
	Palmer Dabbelt, Albert Ou, Alexandre Ghiti, Michael Kelley
  Cc: Marc Zyngier, Timothy Hayes, Lorenzo Pieralisi, Sascha Bischoff,
	mrigendrachaubey, linux-hyperv@vger.kernel.org,
	linux-arm-kernel@lists.infradead.org,
	linux-kernel@vger.kernel.org, linux-arch@vger.kernel.org,
	linux-riscv@lists.infradead.org, vdso@mailbox.org,
	ssengar@linux.microsoft.com
In-Reply-To: <20260423124206.2410879-9-namjain@linux.microsoft.com>

From: Naman Jain <namjain@linux.microsoft.com> Sent: Thursday, April 23, 2026 5:42 AM
> 
> Move hv_call_get_vp_registers() and hv_call_set_vp_registers()
> declarations from drivers/hv/mshv.h to include/asm-generic/mshyperv.h.
> 
> These functions are defined in mshv_common.c and are going to be called
> from both drivers/hv/ and arch/x86/hyperv/hv_vtl.c. The latter never
> included mshv.h, relying on implicit declaration visibility. Moving the
> declarations to the arch-generic Hyper-V header makes them properly
> visible to all architecture-specific callers.
> 
> Provide static inline stubs returning -EOPNOTSUPP when neither
> CONFIG_MSHV_ROOT nor CONFIG_MSHV_VTL is enabled.

Looking at the drivers/hv/Kconfig, it's possible to build with
CONFIG_HYPERV_VTL_MODE=y, but not CONFIG_MSHV_VTL. In such a
case, mshv_common.o doesn't get built, which is why the stubs are
needed. Is such a configuration desirable for some scenarios?

I wonder if having CONFIG_HYPERV_VTL_MODE force the building of
mshv_common.o would be a better approach. Then the stubs wouldn't
be needed. The "ifneq" statement in drivers/hv/Makefile could use
CONFIG_HYPERV_VTL_MODE instead of CONFIG_MSHV_VTL, and
everything would be good since CONFIG_MSHV_VTL depends on
CONFIG_HYPERV_VTL_MODE.

> 
> Signed-off-by: Naman Jain <namjain@linux.microsoft.com>
> ---
>  drivers/hv/mshv.h              |  8 --------
>  include/asm-generic/mshyperv.h | 26 ++++++++++++++++++++++++++
>  2 files changed, 26 insertions(+), 8 deletions(-)
> 
> diff --git a/drivers/hv/mshv.h b/drivers/hv/mshv.h
> index d4813df92b9c..0fcb7f9ba6a9 100644
> --- a/drivers/hv/mshv.h
> +++ b/drivers/hv/mshv.h
> @@ -14,14 +14,6 @@
>  	memchr_inv(&((STRUCT).MEMBER), \
>  		   0, sizeof_field(typeof(STRUCT), MEMBER))
> 
> -int hv_call_get_vp_registers(u32 vp_index, u64 partition_id, u16 count,
> -			     union hv_input_vtl input_vtl,
> -			     struct hv_register_assoc *registers);
> -
> -int hv_call_set_vp_registers(u32 vp_index, u64 partition_id, u16 count,
> -			     union hv_input_vtl input_vtl,
> -			     struct hv_register_assoc *registers);
> -
>  int hv_call_get_partition_property(u64 partition_id, u64 property_code,
>  				   u64 *property_value);
> 
> diff --git a/include/asm-generic/mshyperv.h b/include/asm-generic/mshyperv.h
> index 8cdf2a9fbdfb..ef0b9466808c 100644
> --- a/include/asm-generic/mshyperv.h
> +++ b/include/asm-generic/mshyperv.h
> @@ -394,6 +394,32 @@ static inline int hv_deposit_memory(u64 partition_id, u64
> status)
>  	return hv_deposit_memory_node(NUMA_NO_NODE, partition_id, status);
>  }
> 
> +#if IS_ENABLED(CONFIG_MSHV_ROOT) || IS_ENABLED(CONFIG_MSHV_VTL)
> +int hv_call_get_vp_registers(u32 vp_index, u64 partition_id, u16 count,
> +			     union hv_input_vtl input_vtl,
> +			     struct hv_register_assoc *registers);
> +
> +int hv_call_set_vp_registers(u32 vp_index, u64 partition_id, u16 count,
> +			     union hv_input_vtl input_vtl,
> +			     struct hv_register_assoc *registers);
> +#else
> +static inline int hv_call_get_vp_registers(u32 vp_index, u64 partition_id,
> +					   u16 count,
> +					   union hv_input_vtl input_vtl,
> +					   struct hv_register_assoc *registers)
> +{
> +	return -EOPNOTSUPP;
> +}
> +
> +static inline int hv_call_set_vp_registers(u32 vp_index, u64 partition_id,
> +					   u16 count,
> +					   union hv_input_vtl input_vtl,
> +					   struct hv_register_assoc *registers)
> +{
> +	return -EOPNOTSUPP;
> +}
> +#endif /* CONFIG_MSHV_ROOT || CONFIG_MSHV_VTL */
> +
>  #if IS_ENABLED(CONFIG_HYPERV_VTL_MODE)
>  u8 __init get_vtl(void);
>  void mshv_vtl_return_call(struct mshv_vtl_cpu_context *vtl0);
> --
> 2.43.0
> 


^ permalink raw reply

* RE: [PATCH v2 09/15] Drivers: hv: mshv_vtl: Move hv_vtl_configure_reg_page() to x86
From: Michael Kelley @ 2026-04-27  5:40 UTC (permalink / raw)
  To: Naman Jain, K . Y . Srinivasan, Haiyang Zhang, Wei Liu,
	Dexuan Cui, Long Li, Catalin Marinas, Will Deacon,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen,
	x86@kernel.org, H . Peter Anvin, Arnd Bergmann, Paul Walmsley,
	Palmer Dabbelt, Albert Ou, Alexandre Ghiti, Michael Kelley
  Cc: Marc Zyngier, Timothy Hayes, Lorenzo Pieralisi, Sascha Bischoff,
	mrigendrachaubey, linux-hyperv@vger.kernel.org,
	linux-arm-kernel@lists.infradead.org,
	linux-kernel@vger.kernel.org, linux-arch@vger.kernel.org,
	linux-riscv@lists.infradead.org, vdso@mailbox.org,
	ssengar@linux.microsoft.com
In-Reply-To: <20260423124206.2410879-10-namjain@linux.microsoft.com>

From: Naman Jain <namjain@linux.microsoft.com> Sent: Thursday, April 23, 2026 5:42 AM
> 
> Move hv_vtl_configure_reg_page() from drivers/hv/mshv_vtl_main.c to
> arch/x86/hyperv/hv_vtl.c. The register page overlay is an x86-specific
> feature that uses HV_X64_REGISTER_REG_PAGE, so its configuration belongs
> in architecture-specific code.
> 
> Move struct mshv_vtl_per_cpu and union hv_synic_overlay_page_msr to
> include/asm-generic/mshyperv.h so they are visible to both arch and
> driver code.
> 
> Change the return type from void to bool so the caller can determine
> whether the register page was successfully configured and set
> mshv_has_reg_page accordingly.
> 
> Signed-off-by: Naman Jain <namjain@linux.microsoft.com>
> ---
>  arch/x86/hyperv/hv_vtl.c       | 32 ++++++++++++++++++++++
>  drivers/hv/mshv_vtl_main.c     | 49 +++-------------------------------
>  include/asm-generic/mshyperv.h | 17 ++++++++++++
>  3 files changed, 53 insertions(+), 45 deletions(-)
> 
> diff --git a/arch/x86/hyperv/hv_vtl.c b/arch/x86/hyperv/hv_vtl.c
> index 09d81f9b853c..f3ffb6a7cb2d 100644
> --- a/arch/x86/hyperv/hv_vtl.c
> +++ b/arch/x86/hyperv/hv_vtl.c
> @@ -20,6 +20,7 @@
>  #include <uapi/asm/mtrr.h>
>  #include <asm/debugreg.h>
>  #include <linux/export.h>
> +#include <linux/hyperv.h>
>  #include <../kernel/smpboot.h>
>  #include "../../kernel/fpu/legacy.h"
> 
> @@ -259,6 +260,37 @@ int __init hv_vtl_early_init(void)
>  	return 0;
>  }
> 
> +static const union hv_input_vtl input_vtl_zero;
> +
> +bool hv_vtl_configure_reg_page(struct mshv_vtl_per_cpu *per_cpu)
> +{
> +	struct hv_register_assoc reg_assoc = {};
> +	union hv_synic_overlay_page_msr overlay = {};
> +	struct page *reg_page;
> +
> +	reg_page = alloc_page(GFP_KERNEL | __GFP_ZERO | __GFP_RETRY_MAYFAIL);
> +	if (!reg_page) {
> +		WARN(1, "failed to allocate register page\n");
> +		return false;
> +	}
> +
> +	overlay.enabled = 1;
> +	overlay.pfn = page_to_hvpfn(reg_page);
> +	reg_assoc.name = HV_X64_REGISTER_REG_PAGE;
> +	reg_assoc.value.reg64 = overlay.as_uint64;
> +
> +	if (hv_call_set_vp_registers(HV_VP_INDEX_SELF, HV_PARTITION_ID_SELF,
> +				     1, input_vtl_zero, &reg_assoc)) {
> +		WARN(1, "failed to setup register page\n");
> +		__free_page(reg_page);
> +		return false;
> +	}
> +
> +	per_cpu->reg_page = reg_page;
> +	return true;
> +}
> +EXPORT_SYMBOL_GPL(hv_vtl_configure_reg_page);
> +
>  DEFINE_STATIC_CALL_NULL(__mshv_vtl_return_hypercall, void (*)(void));
> 
>  void mshv_vtl_return_call_init(u64 vtl_return_offset)
> diff --git a/drivers/hv/mshv_vtl_main.c b/drivers/hv/mshv_vtl_main.c
> index 91517b45d526..c79d24317b8e 100644
> --- a/drivers/hv/mshv_vtl_main.c
> +++ b/drivers/hv/mshv_vtl_main.c
> @@ -78,21 +78,6 @@ struct mshv_vtl {
>  	u64 id;
>  };
> 
> -struct mshv_vtl_per_cpu {
> -	struct mshv_vtl_run *run;
> -	struct page *reg_page;
> -};
> -
> -/* SYNIC_OVERLAY_PAGE_MSR - internal, identical to hv_synic_simp */
> -union hv_synic_overlay_page_msr {
> -	u64 as_uint64;
> -	struct {
> -		u64 enabled: 1;
> -		u64 reserved: 11;
> -		u64 pfn: 52;
> -	} __packed;
> -};
> -
>  static struct mutex mshv_vtl_poll_file_lock;
>  static union hv_register_vsm_page_offsets mshv_vsm_page_offsets;
>  static union hv_register_vsm_capabilities mshv_vsm_capabilities;
> @@ -201,34 +186,6 @@ static struct page *mshv_vtl_cpu_reg_page(int cpu)
>  	return *per_cpu_ptr(&mshv_vtl_per_cpu.reg_page, cpu);
>  }
> 
> -static void mshv_vtl_configure_reg_page(struct mshv_vtl_per_cpu *per_cpu)
> -{
> -	struct hv_register_assoc reg_assoc = {};
> -	union hv_synic_overlay_page_msr overlay = {};
> -	struct page *reg_page;
> -
> -	reg_page = alloc_page(GFP_KERNEL | __GFP_ZERO | __GFP_RETRY_MAYFAIL);
> -	if (!reg_page) {
> -		WARN(1, "failed to allocate register page\n");
> -		return;
> -	}
> -
> -	overlay.enabled = 1;
> -	overlay.pfn = page_to_hvpfn(reg_page);
> -	reg_assoc.name = HV_X64_REGISTER_REG_PAGE;
> -	reg_assoc.value.reg64 = overlay.as_uint64;
> -
> -	if (hv_call_set_vp_registers(HV_VP_INDEX_SELF, HV_PARTITION_ID_SELF,
> -				     1, input_vtl_zero, &reg_assoc)) {
> -		WARN(1, "failed to setup register page\n");
> -		__free_page(reg_page);
> -		return;
> -	}
> -
> -	per_cpu->reg_page = reg_page;
> -	mshv_has_reg_page = true;
> -}
> -
>  static void mshv_vtl_synic_enable_regs(unsigned int cpu)
>  {
>  	union hv_synic_sint sint;
> @@ -329,8 +286,10 @@ static int mshv_vtl_alloc_context(unsigned int cpu)
>  	if (!per_cpu->run)
>  		return -ENOMEM;
> 
> -	if (mshv_vsm_capabilities.intercept_page_available)
> -		mshv_vtl_configure_reg_page(per_cpu);
> +	if (mshv_vsm_capabilities.intercept_page_available) {
> +		if (hv_vtl_configure_reg_page(per_cpu))
> +			mshv_has_reg_page = true;
> +	}
> 
>  	mshv_vtl_synic_enable_regs(cpu);
> 
> diff --git a/include/asm-generic/mshyperv.h b/include/asm-generic/mshyperv.h
> index ef0b9466808c..9e86178c182e 100644
> --- a/include/asm-generic/mshyperv.h
> +++ b/include/asm-generic/mshyperv.h
> @@ -420,12 +420,29 @@ static inline int hv_call_set_vp_registers(u32 vp_index, u64
> partition_id,
>  }
>  #endif /* CONFIG_MSHV_ROOT || CONFIG_MSHV_VTL */
> 
> +struct mshv_vtl_per_cpu {
> +	struct mshv_vtl_run *run;
> +	struct page *reg_page;
> +};
> +
>  #if IS_ENABLED(CONFIG_HYPERV_VTL_MODE)
> +/* SYNIC_OVERLAY_PAGE_MSR - internal, identical to hv_synic_simp */

This comment pre-dates your patch, but I don't understand the point
it is trying to make. The comment is factually true, but I don't know
why calling that out is relevant. The REG_PAGE MSR seems to be
conceptually separate and distinct from the SIMP MSR, so the fact
that the layouts are the same is just a coincidence. Or is there some
relationship between the two MSRs that I'm not aware of, and the
comment is trying (and failing?) to point out?

> +union hv_synic_overlay_page_msr {
> +	u64 as_uint64;
> +	struct {
> +		u64 enabled: 1;
> +		u64 reserved: 11;
> +		u64 pfn: 52;
> +	} __packed;
> +};
> +
>  u8 __init get_vtl(void);
>  void mshv_vtl_return_call(struct mshv_vtl_cpu_context *vtl0);
> +bool hv_vtl_configure_reg_page(struct mshv_vtl_per_cpu *per_cpu);
>  #else
>  static inline u8 get_vtl(void) { return 0; }
>  static inline void mshv_vtl_return_call(struct mshv_vtl_cpu_context *vtl0) {}
> +static inline bool hv_vtl_configure_reg_page(struct mshv_vtl_per_cpu *per_cpu) { return false; }

As with Patch 8, if CONFIG_HYPERV_VTL_MODE caused mshv_common.o
to be built, this stub wouldn't be needed.

>  #endif
> 
>  #endif
> --
> 2.43.0
> 


^ permalink raw reply

* RE: [PATCH v2 12/15] mshv_vtl: Move VSM code page offset logic to x86 files
From: Michael Kelley @ 2026-04-27  5:40 UTC (permalink / raw)
  To: Naman Jain, K . Y . Srinivasan, Haiyang Zhang, Wei Liu,
	Dexuan Cui, Long Li, Catalin Marinas, Will Deacon,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen,
	x86@kernel.org, H . Peter Anvin, Arnd Bergmann, Paul Walmsley,
	Palmer Dabbelt, Albert Ou, Alexandre Ghiti, Michael Kelley
  Cc: Marc Zyngier, Timothy Hayes, Lorenzo Pieralisi, Sascha Bischoff,
	mrigendrachaubey, linux-hyperv@vger.kernel.org,
	linux-arm-kernel@lists.infradead.org,
	linux-kernel@vger.kernel.org, linux-arch@vger.kernel.org,
	linux-riscv@lists.infradead.org, vdso@mailbox.org,
	ssengar@linux.microsoft.com
In-Reply-To: <20260423124206.2410879-13-namjain@linux.microsoft.com>

From: Naman Jain <namjain@linux.microsoft.com> Sent: Thursday, April 23, 2026 5:42 AM
> 
> The VSM code page offset register (HV_REGISTER_VSM_CODE_PAGE_OFFSETS)
> is x86 specific, its value configures the static call used to return
> to VTL0 via the hypercall page. Move the register read from the common
> mshv_vtl_get_vsm_regs() into the x86 mshv_vtl_return_call_init(),
> which is the sole consumer of the offset.
> 
> Change mshv_vtl_return_call_init() from taking a u64 parameter
> to taking no arguments, and rename mshv_vtl_get_vsm_regs() to
> mshv_vtl_get_vsm_cap_reg() since it now only fetches
> HV_REGISTER_VSM_CAPABILITIES.
> 
> No functional change on x86. This prepares the common driver code for
> ARM64 where VSM code page offsets do not apply.
> 
> Signed-off-by: Naman Jain <namjain@linux.microsoft.com>
> ---
>  arch/x86/hyperv/hv_vtl.c        | 19 +++++++++++++++++--
>  arch/x86/include/asm/mshyperv.h |  4 ++--
>  drivers/hv/mshv_vtl_main.c      | 24 +++++++++++++-----------
>  3 files changed, 32 insertions(+), 15 deletions(-)
> 
> diff --git a/arch/x86/hyperv/hv_vtl.c b/arch/x86/hyperv/hv_vtl.c
> index f3ffb6a7cb2d..7c10b34cf8a4 100644
> --- a/arch/x86/hyperv/hv_vtl.c
> +++ b/arch/x86/hyperv/hv_vtl.c
> @@ -293,10 +293,25 @@ EXPORT_SYMBOL_GPL(hv_vtl_configure_reg_page);
> 
>  DEFINE_STATIC_CALL_NULL(__mshv_vtl_return_hypercall, void (*)(void));
> 
> -void mshv_vtl_return_call_init(u64 vtl_return_offset)
> +int mshv_vtl_return_call_init(void)
>  {
> +	struct hv_register_assoc vsm_pg_offset_reg;
> +	union hv_register_vsm_page_offsets offsets;
> +	int ret;
> +
> +	vsm_pg_offset_reg.name = HV_REGISTER_VSM_CODE_PAGE_OFFSETS;
> +
> +	ret = hv_call_get_vp_registers(HV_VP_INDEX_SELF, HV_PARTITION_ID_SELF,
> +				       1, input_vtl_zero, &vsm_pg_offset_reg);
> +	if (ret)
> +		return ret;
> +
> +	offsets.as_uint64 = vsm_pg_offset_reg.value.reg64;
> +
>  	static_call_update(__mshv_vtl_return_hypercall,
> -			   (void *)((u8 *)hv_hypercall_pg + vtl_return_offset));
> +			   (void *)((u8 *)hv_hypercall_pg + offsets.vtl_return_offset));
> +
> +	return 0;
>  }
>  EXPORT_SYMBOL(mshv_vtl_return_call_init);
> 
> diff --git a/arch/x86/include/asm/mshyperv.h b/arch/x86/include/asm/mshyperv.h
> index b4d80c9a673a..b48f115c1292 100644
> --- a/arch/x86/include/asm/mshyperv.h
> +++ b/arch/x86/include/asm/mshyperv.h
> @@ -286,14 +286,14 @@ struct mshv_vtl_cpu_context {
>  #ifdef CONFIG_HYPERV_VTL_MODE
>  void __init hv_vtl_init_platform(void);
>  int __init hv_vtl_early_init(void);
> -void mshv_vtl_return_call_init(u64 vtl_return_offset);
> +int mshv_vtl_return_call_init(void);
>  void mshv_vtl_return_hypercall(void);
>  void __mshv_vtl_return_call(struct mshv_vtl_cpu_context *vtl0);
>  int hv_vtl_get_set_reg(struct hv_register_assoc *regs, bool set, bool shared);
>  #else
>  static inline void __init hv_vtl_init_platform(void) {}
>  static inline int __init hv_vtl_early_init(void) { return 0; }
> -static inline void mshv_vtl_return_call_init(u64 vtl_return_offset) {}
> +static inline int mshv_vtl_return_call_init(void) { return 0; }
>  static inline void mshv_vtl_return_hypercall(void) {}
>  static inline void __mshv_vtl_return_call(struct mshv_vtl_cpu_context *vtl0) {}
>  #endif
> diff --git a/drivers/hv/mshv_vtl_main.c b/drivers/hv/mshv_vtl_main.c
> index 4c9ae65ad3e8..be498c9234fd 100644
> --- a/drivers/hv/mshv_vtl_main.c
> +++ b/drivers/hv/mshv_vtl_main.c
> @@ -79,7 +79,6 @@ struct mshv_vtl {
>  };
> 
>  static struct mutex mshv_vtl_poll_file_lock;
> -static union hv_register_vsm_page_offsets mshv_vsm_page_offsets;
>  static union hv_register_vsm_capabilities mshv_vsm_capabilities;
> 
>  static DEFINE_PER_CPU(struct mshv_vtl_poll_file, mshv_vtl_poll_file);
> @@ -203,21 +202,19 @@ static void mshv_vtl_synic_enable_regs(unsigned int cpu)
>  	/* VTL2 Host VSP SINT is (un)masked when the user mode requests that */
>  }
> 
> -static int mshv_vtl_get_vsm_regs(void)
> +static int mshv_vtl_get_vsm_cap_reg(void)
>  {
> -	struct hv_register_assoc registers[2];
> -	int ret, count = 2;
> +	struct hv_register_assoc vsm_capability_reg;
> +	int ret;
> 
> -	registers[0].name = HV_REGISTER_VSM_CODE_PAGE_OFFSETS;
> -	registers[1].name = HV_REGISTER_VSM_CAPABILITIES;
> +	vsm_capability_reg.name = HV_REGISTER_VSM_CAPABILITIES;
> 
>  	ret = hv_call_get_vp_registers(HV_VP_INDEX_SELF, HV_PARTITION_ID_SELF,
> -				       count, input_vtl_zero, registers);
> +				       1, input_vtl_zero, &vsm_capability_reg);
>  	if (ret)
>  		return ret;
> 
> -	mshv_vsm_page_offsets.as_uint64 = registers[0].value.reg64;
> -	mshv_vsm_capabilities.as_uint64 = registers[1].value.reg64;
> +	mshv_vsm_capabilities.as_uint64 = vsm_capability_reg.value.reg64;
> 
>  	return ret;

Nit: This could be just "return 0".

>  }
> @@ -1139,13 +1136,18 @@ static int __init mshv_vtl_init(void)
>  	tasklet_init(&msg_dpc, mshv_vtl_sint_on_msg_dpc, 0);
>  	init_waitqueue_head(&fd_wait_queue);
> 
> -	if (mshv_vtl_get_vsm_regs()) {
> +	if (mshv_vtl_get_vsm_cap_reg()) {
>  		dev_emerg(dev, "Unable to get VSM capabilities !!\n");

Why is this failure an emergency message, while the other failures
here in mshv_vtl_init() are just error messages? When there's lack
of consistency, I always wonder if there is a reason ..... :-)

>  		ret = -ENODEV;
>  		goto free_dev;
>  	}
> 
> -	mshv_vtl_return_call_init(mshv_vsm_page_offsets.vtl_return_offset);
> +	ret = mshv_vtl_return_call_init();
> +	if (ret) {
> +		dev_err(dev, "mshv_vtl_return_call_init failed: %d\n", ret);
> +		goto free_dev;
> +	}
> +
>  	ret = hv_vtl_setup_synic();
>  	if (ret)
>  		goto free_dev;
> --
> 2.43.0
> 


^ permalink raw reply

* Re: [PATCH 2/2] drm/hyperv: use VMBUS_RING_SIZE()
From: Hamza Mahfooz @ 2026-04-27 11:51 UTC (permalink / raw)
  To: Saurabh Singh Sengar
  Cc: linux-kernel@vger.kernel.org, KY Srinivasan, Haiyang Zhang,
	Wei Liu, Dexuan Cui, Long Li, Stefano Garzarella, David S. Miller,
	Eric Dumazet, Jakub Kicinski, Paolo Abeni, Simon Horman,
	Himadri Pandya, Michael Kelley, linux-hyperv@vger.kernel.org,
	virtualization@lists.linux.dev, netdev@vger.kernel.org,
	Saurabh Sengar, Maarten Lankhorst, Maxime Ripard,
	Thomas Zimmermann, David Airlie, Simona Vetter, Deepak Rawat,
	dri-devel@lists.freedesktop.org, stable@kernel.vger.org
In-Reply-To: <KUZP153MB14445757C6A5DA5DEDA9A09CBE292@KUZP153MB1444.APCP153.PROD.OUTLOOK.COM>

On Sun, Apr 26, 2026 at 05:00:24AM +0000, Saurabh Singh Sengar wrote:
> > Subject: [PATCH 2/2] drm/hyperv: use VMBUS_RING_SIZE()
> > 
> > VMBUS ring buffers must be page aligned. So, use VMBUS_RING_SIZE() to
> > ensure they are always aligned and large enough to hold all of the relevant
> > data.
> > 
> > Cc: stable@kernel.vger.org
> > Fixes: 76c56a5affeb ("drm/hyperv: Add DRM driver for hyperv synthetic video
> > device")
> > Signed-off-by: Hamza Mahfooz <hamzamahfooz@linux.microsoft.com>
> > ---
> >  drivers/gpu/drm/hyperv/hyperv_drm_proto.c | 2 +-
> >  1 file changed, 1 insertion(+), 1 deletion(-)
> > 
> > diff --git a/drivers/gpu/drm/hyperv/hyperv_drm_proto.c
> > b/drivers/gpu/drm/hyperv/hyperv_drm_proto.c
> > index 051ecc526832..753d97bff76f 100644
> > --- a/drivers/gpu/drm/hyperv/hyperv_drm_proto.c
> > +++ b/drivers/gpu/drm/hyperv/hyperv_drm_proto.c
> > @@ -10,7 +10,7 @@
> > 
> >  #include "hyperv_drm.h"
> > 
> > -#define VMBUS_RING_BUFSIZE (256 * 1024)
> > +#define VMBUS_RING_BUFSIZE VMBUS_RING_SIZE(256 * 1024)
> >  #define VMBUS_VSP_TIMEOUT (10 * HZ)
> > 
> >  #define SYNTHVID_VERSION(major, minor) ((minor) << 16 | (major))
> > --
> > 2.54.0
> 
> Although this lgtm, but this may change the behaviour on ARM64 systems with page size > 4K ?
> Have we tested it ?

Yup, I tested it on an ARM64 windows machine with a 64K page size guest kernel.

> 
> Reviewed-by: Saurabh Sengar <ssengar@linux.microsoft.com>

Pushed to drm-misc.

> 

^ permalink raw reply

* [PATCH net-next v2 0/2] net: mana: Avoid queue struct allocation failure under memory fragmentation
From: Aditya Garg @ 2026-04-27 13:23 UTC (permalink / raw)
  To: kys, haiyangz, wei.liu, decui, longli, andrew+netdev, davem,
	edumazet, kuba, pabeni, kotaranov, horms, ssengar, jacob.e.keller,
	dipayanroy, ernis, shirazsaleem, kees, sbhatta, leitao, netdev,
	linux-hyperv, linux-kernel, linux-rdma, bpf, gargaditya,
	gargaditya

The MANA driver can fail to load on systems with high memory
utilization because several allocations in the queue setup paths
require large physically contiguous blocks via kmalloc. Under memory
fragmentation these high-order allocations may fail, preventing the
driver from creating queues at probe time or when reconfiguring
channels, ring parameters or MTU at runtime.

Allocation sizes that are problematic:

  mana_create_txq -> tx_qp flat array (sizeof(mana_tx_qp) = 35528):
    16 queues (default): 35528 * 16 =  ~555 KB contiguous
    64 queues (max):     35528 * 64 = ~2220 KB contiguous

  mana_create_rxq -> rxq struct with flex array
  (sizeof(mana_rxq) = 35712, rx_oobs=296 per entry):
    depth 1024 (default): 35712 + 296 * 1024 =  ~331 KB per queue
    depth 8192 (max):     35712 + 296 * 8192 = ~2403 KB per queue

  mana_pre_alloc_rxbufs -> rxbufs_pre and das_pre arrays:
    16 queues, depth 1024 (default): 16 * 1024 * 8 =  128 KB each
    64 queues, depth 8192 (max):     64 * 8192 * 8 = 4096 KB each

This series addresses the issue by:
  1. Converting the tx_qp flat array into an array of pointers with
     per-queue kvzalloc (~35 KB each), replacing a single contiguous
     allocation that can reach ~2.2 MB at 64 queues.
  2. Switching rxbufs_pre, das_pre, and rxq allocations to
     kvmalloc/kvzalloc so the allocator can fall back to vmalloc
     when contiguous memory is unavailable.

Throughput testing confirms no regression. Since kvmalloc falls
back to vmalloc under memory fragmentation, all kvmalloc calls
were temporarily replaced with vmalloc to simulate the fallback
path (iperf3, GBits/sec):

                 Physically contiguous         vmalloc region
  Connections      TX          RX              TX          RX
  --------------------------------------------------------------
  1                47.2        46.9            46.8        46.6
  16               181         181             181         181
  32               181         181             181         181
  64               181         181             181         181

---
Changes in v2:
  - Rebased onto v7.1-rc1 (was v7.0-rc7)

Aditya Garg (2):
  net: mana: Use per-queue allocation for tx_qp to reduce allocation
    size
  net: mana: Use kvmalloc for large RX queue and buffer allocations

 .../net/ethernet/microsoft/mana/mana_bpf.c    |  2 +-
 drivers/net/ethernet/microsoft/mana/mana_en.c | 61 +++++++++++--------
 .../ethernet/microsoft/mana/mana_ethtool.c    |  2 +-
 include/net/mana/mana.h                       |  2 +-
 4 files changed, 39 insertions(+), 28 deletions(-)

-- 
2.43.0


^ permalink raw reply

* [PATCH net-next v2 1/2] net: mana: Use per-queue allocation for tx_qp to reduce allocation size
From: Aditya Garg @ 2026-04-27 13:23 UTC (permalink / raw)
  To: kys, haiyangz, wei.liu, decui, longli, andrew+netdev, davem,
	edumazet, kuba, pabeni, kotaranov, horms, ssengar, jacob.e.keller,
	dipayanroy, ernis, shirazsaleem, kees, sbhatta, leitao, netdev,
	linux-hyperv, linux-kernel, linux-rdma, bpf, gargaditya,
	gargaditya
In-Reply-To: <20260427132807.1642290-1-gargaditya@linux.microsoft.com>

Convert tx_qp from a single contiguous array allocation to per-queue
individual allocations. Each mana_tx_qp struct is approximately 35KB.
With many queues (e.g., 32/64), the flat array requires a single
contiguous allocation that can fail under memory fragmentation.

Change mana_tx_qp *tx_qp to mana_tx_qp **tx_qp (array of pointers),
allocating each queue's mana_tx_qp individually via kvzalloc. This
reduces each allocation to ~35KB and provides vmalloc fallback,
avoiding allocation failure due to fragmentation.

Signed-off-by: Aditya Garg <gargaditya@linux.microsoft.com>
Reviewed-by: Haiyang Zhang <haiyangz@microsoft.com>
---
 .../net/ethernet/microsoft/mana/mana_bpf.c    |  2 +-
 drivers/net/ethernet/microsoft/mana/mana_en.c | 49 ++++++++++++-------
 .../ethernet/microsoft/mana/mana_ethtool.c    |  2 +-
 include/net/mana/mana.h                       |  2 +-
 4 files changed, 33 insertions(+), 22 deletions(-)

diff --git a/drivers/net/ethernet/microsoft/mana/mana_bpf.c b/drivers/net/ethernet/microsoft/mana/mana_bpf.c
index 7697c9b52ed3..b5e9bb184a1d 100644
--- a/drivers/net/ethernet/microsoft/mana/mana_bpf.c
+++ b/drivers/net/ethernet/microsoft/mana/mana_bpf.c
@@ -68,7 +68,7 @@ int mana_xdp_xmit(struct net_device *ndev, int n, struct xdp_frame **frames,
 		count++;
 	}
 
-	tx_stats = &apc->tx_qp[q_idx].txq.stats;
+	tx_stats = &apc->tx_qp[q_idx]->txq.stats;
 
 	u64_stats_update_begin(&tx_stats->syncp);
 	tx_stats->xdp_xmit += count;
diff --git a/drivers/net/ethernet/microsoft/mana/mana_en.c b/drivers/net/ethernet/microsoft/mana/mana_en.c
index a654b3699c4c..8adf72b96145 100644
--- a/drivers/net/ethernet/microsoft/mana/mana_en.c
+++ b/drivers/net/ethernet/microsoft/mana/mana_en.c
@@ -355,9 +355,9 @@ netdev_tx_t mana_start_xmit(struct sk_buff *skb, struct net_device *ndev)
 	if (skb_cow_head(skb, MANA_HEADROOM))
 		goto tx_drop_count;
 
-	txq = &apc->tx_qp[txq_idx].txq;
+	txq = &apc->tx_qp[txq_idx]->txq;
 	gdma_sq = txq->gdma_sq;
-	cq = &apc->tx_qp[txq_idx].tx_cq;
+	cq = &apc->tx_qp[txq_idx]->tx_cq;
 	tx_stats = &txq->stats;
 
 	BUILD_BUG_ON(MAX_TX_WQE_SGL_ENTRIES != MANA_MAX_TX_WQE_SGL_ENTRIES);
@@ -614,7 +614,7 @@ static void mana_get_stats64(struct net_device *ndev,
 	}
 
 	for (q = 0; q < num_queues; q++) {
-		tx_stats = &apc->tx_qp[q].txq.stats;
+		tx_stats = &apc->tx_qp[q]->txq.stats;
 
 		do {
 			start = u64_stats_fetch_begin(&tx_stats->syncp);
@@ -2321,21 +2321,26 @@ static void mana_destroy_txq(struct mana_port_context *apc)
 		return;
 
 	for (i = 0; i < apc->num_queues; i++) {
-		debugfs_remove_recursive(apc->tx_qp[i].mana_tx_debugfs);
-		apc->tx_qp[i].mana_tx_debugfs = NULL;
+		if (!apc->tx_qp[i])
+			continue;
+
+		debugfs_remove_recursive(apc->tx_qp[i]->mana_tx_debugfs);
+		apc->tx_qp[i]->mana_tx_debugfs = NULL;
 
-		napi = &apc->tx_qp[i].tx_cq.napi;
-		if (apc->tx_qp[i].txq.napi_initialized) {
+		napi = &apc->tx_qp[i]->tx_cq.napi;
+		if (apc->tx_qp[i]->txq.napi_initialized) {
 			napi_synchronize(napi);
 			napi_disable_locked(napi);
 			netif_napi_del_locked(napi);
-			apc->tx_qp[i].txq.napi_initialized = false;
+			apc->tx_qp[i]->txq.napi_initialized = false;
 		}
-		mana_destroy_wq_obj(apc, GDMA_SQ, apc->tx_qp[i].tx_object);
+		mana_destroy_wq_obj(apc, GDMA_SQ, apc->tx_qp[i]->tx_object);
 
-		mana_deinit_cq(apc, &apc->tx_qp[i].tx_cq);
+		mana_deinit_cq(apc, &apc->tx_qp[i]->tx_cq);
 
-		mana_deinit_txq(apc, &apc->tx_qp[i].txq);
+		mana_deinit_txq(apc, &apc->tx_qp[i]->txq);
+
+		kvfree(apc->tx_qp[i]);
 	}
 
 	kfree(apc->tx_qp);
@@ -2344,7 +2349,7 @@ static void mana_destroy_txq(struct mana_port_context *apc)
 
 static void mana_create_txq_debugfs(struct mana_port_context *apc, int idx)
 {
-	struct mana_tx_qp *tx_qp = &apc->tx_qp[idx];
+	struct mana_tx_qp *tx_qp = apc->tx_qp[idx];
 	char qnum[32];
 
 	sprintf(qnum, "TX-%d", idx);
@@ -2383,7 +2388,7 @@ static int mana_create_txq(struct mana_port_context *apc,
 	int err;
 	int i;
 
-	apc->tx_qp = kzalloc_objs(struct mana_tx_qp, apc->num_queues);
+	apc->tx_qp = kzalloc_objs(struct mana_tx_qp *, apc->num_queues);
 	if (!apc->tx_qp)
 		return -ENOMEM;
 
@@ -2403,10 +2408,16 @@ static int mana_create_txq(struct mana_port_context *apc,
 	gc = gd->gdma_context;
 
 	for (i = 0; i < apc->num_queues; i++) {
-		apc->tx_qp[i].tx_object = INVALID_MANA_HANDLE;
+		apc->tx_qp[i] = kvzalloc_obj(*apc->tx_qp[i]);
+		if (!apc->tx_qp[i]) {
+			err = -ENOMEM;
+			goto out;
+		}
+
+		apc->tx_qp[i]->tx_object = INVALID_MANA_HANDLE;
 
 		/* Create SQ */
-		txq = &apc->tx_qp[i].txq;
+		txq = &apc->tx_qp[i]->txq;
 
 		u64_stats_init(&txq->stats.syncp);
 		txq->ndev = net;
@@ -2424,7 +2435,7 @@ static int mana_create_txq(struct mana_port_context *apc,
 			goto out;
 
 		/* Create SQ's CQ */
-		cq = &apc->tx_qp[i].tx_cq;
+		cq = &apc->tx_qp[i]->tx_cq;
 		cq->type = MANA_CQ_TYPE_TX;
 
 		cq->txq = txq;
@@ -2453,7 +2464,7 @@ static int mana_create_txq(struct mana_port_context *apc,
 
 		err = mana_create_wq_obj(apc, apc->port_handle, GDMA_SQ,
 					 &wq_spec, &cq_spec,
-					 &apc->tx_qp[i].tx_object);
+					 &apc->tx_qp[i]->tx_object);
 
 		if (err)
 			goto out;
@@ -3288,7 +3299,7 @@ static int mana_dealloc_queues(struct net_device *ndev)
 	 */
 
 	for (i = 0; i < apc->num_queues; i++) {
-		txq = &apc->tx_qp[i].txq;
+		txq = &apc->tx_qp[i]->txq;
 		tsleep = 1000;
 		while (atomic_read(&txq->pending_sends) > 0 &&
 		       time_before(jiffies, timeout)) {
@@ -3307,7 +3318,7 @@ static int mana_dealloc_queues(struct net_device *ndev)
 	}
 
 	for (i = 0; i < apc->num_queues; i++) {
-		txq = &apc->tx_qp[i].txq;
+		txq = &apc->tx_qp[i]->txq;
 		while ((skb = skb_dequeue(&txq->pending_skbs))) {
 			mana_unmap_skb(skb, apc);
 			dev_kfree_skb_any(skb);
diff --git a/drivers/net/ethernet/microsoft/mana/mana_ethtool.c b/drivers/net/ethernet/microsoft/mana/mana_ethtool.c
index 6a4b42fe0944..04350973e19e 100644
--- a/drivers/net/ethernet/microsoft/mana/mana_ethtool.c
+++ b/drivers/net/ethernet/microsoft/mana/mana_ethtool.c
@@ -260,7 +260,7 @@ static void mana_get_ethtool_stats(struct net_device *ndev,
 	}
 
 	for (q = 0; q < num_queues; q++) {
-		tx_stats = &apc->tx_qp[q].txq.stats;
+		tx_stats = &apc->tx_qp[q]->txq.stats;
 
 		do {
 			start = u64_stats_fetch_begin(&tx_stats->syncp);
diff --git a/include/net/mana/mana.h b/include/net/mana/mana.h
index 8f721cd4e4a7..aa90a858c8e3 100644
--- a/include/net/mana/mana.h
+++ b/include/net/mana/mana.h
@@ -507,7 +507,7 @@ struct mana_port_context {
 	bool tx_shortform_allowed;
 	u16 tx_vp_offset;
 
-	struct mana_tx_qp *tx_qp;
+	struct mana_tx_qp **tx_qp;
 
 	/* Indirection Table for RX & TX. The values are queue indexes */
 	u32 *indir_table;
-- 
2.43.0


^ permalink raw reply related

* [PATCH net-next v2 2/2] net: mana: Use kvmalloc for large RX queue and buffer allocations
From: Aditya Garg @ 2026-04-27 13:23 UTC (permalink / raw)
  To: kys, haiyangz, wei.liu, decui, longli, andrew+netdev, davem,
	edumazet, kuba, pabeni, kotaranov, horms, ssengar, jacob.e.keller,
	dipayanroy, ernis, shirazsaleem, kees, sbhatta, leitao, netdev,
	linux-hyperv, linux-kernel, linux-rdma, bpf, gargaditya,
	gargaditya
In-Reply-To: <20260427132807.1642290-1-gargaditya@linux.microsoft.com>

The RX path allocations for rxbufs_pre, das_pre, and rxq scale with
queue count and queue depth. With high queue counts and depth, these can
exceed what kmalloc can reliably provide from physically contiguous
memory under fragmentation.

Switch these from kmalloc to kvmalloc variants so the allocator
transparently falls back to vmalloc when contiguous memory is scarce,
and update the corresponding frees to kvfree.

Signed-off-by: Aditya Garg <gargaditya@linux.microsoft.com>
Reviewed-by: Haiyang Zhang <haiyangz@microsoft.com>
---
 drivers/net/ethernet/microsoft/mana/mana_en.c | 12 ++++++------
 1 file changed, 6 insertions(+), 6 deletions(-)

diff --git a/drivers/net/ethernet/microsoft/mana/mana_en.c b/drivers/net/ethernet/microsoft/mana/mana_en.c
index 8adf72b96145..e1d8ac3417e8 100644
--- a/drivers/net/ethernet/microsoft/mana/mana_en.c
+++ b/drivers/net/ethernet/microsoft/mana/mana_en.c
@@ -685,11 +685,11 @@ void mana_pre_dealloc_rxbufs(struct mana_port_context *mpc)
 		put_page(virt_to_head_page(mpc->rxbufs_pre[i]));
 	}
 
-	kfree(mpc->das_pre);
+	kvfree(mpc->das_pre);
 	mpc->das_pre = NULL;
 
 out2:
-	kfree(mpc->rxbufs_pre);
+	kvfree(mpc->rxbufs_pre);
 	mpc->rxbufs_pre = NULL;
 
 out1:
@@ -806,11 +806,11 @@ int mana_pre_alloc_rxbufs(struct mana_port_context *mpc, int new_mtu, int num_qu
 	num_rxb = num_queues * mpc->rx_queue_size;
 
 	WARN(mpc->rxbufs_pre, "mana rxbufs_pre exists\n");
-	mpc->rxbufs_pre = kmalloc_array(num_rxb, sizeof(void *), GFP_KERNEL);
+	mpc->rxbufs_pre = kvmalloc_array(num_rxb, sizeof(void *), GFP_KERNEL);
 	if (!mpc->rxbufs_pre)
 		goto error;
 
-	mpc->das_pre = kmalloc_objs(dma_addr_t, num_rxb);
+	mpc->das_pre = kvmalloc_objs(dma_addr_t, num_rxb);
 	if (!mpc->das_pre)
 		goto error;
 
@@ -2564,7 +2564,7 @@ static void mana_destroy_rxq(struct mana_port_context *apc,
 	if (rxq->gdma_rq)
 		mana_gd_destroy_queue(gc, rxq->gdma_rq);
 
-	kfree(rxq);
+	kvfree(rxq);
 }
 
 static int mana_fill_rx_oob(struct mana_recv_buf_oob *rx_oob, u32 mem_key,
@@ -2704,7 +2704,7 @@ static struct mana_rxq *mana_create_rxq(struct mana_port_context *apc,
 
 	gc = gd->gdma_context;
 
-	rxq = kzalloc_flex(*rxq, rx_oobs, apc->rx_queue_size);
+	rxq = kvzalloc_flex(*rxq, rx_oobs, apc->rx_queue_size);
 	if (!rxq)
 		return NULL;
 
-- 
2.43.0


^ permalink raw reply related

* [PATCH net-next] net: mana: Force single RX buffer per page for CVM/encrypted guest memory
From: Dipayaan Roy @ 2026-04-27 13:51 UTC (permalink / raw)
  To: kys, haiyangz, wei.liu, decui, andrew+netdev, davem, edumazet,
	kuba, pabeni, leon, longli, kotaranov, horms, shradhagupta,
	ssengar, ernis, shirazsaleem, linux-hyperv, netdev, linux-kernel,
	linux-rdma, stephen, jacob.e.keller, dipayanroy, leitao, kees,
	john.fastabend, hawk, bpf, daniel, ast, sdf, yury.norov

On Confidential VMs (CVMs) such as AMD SEV-SNP or Intel TDX, the guest
operating system's memory is encrypted. And current hardwares lacks the
support for TDISP (TEE Device Interface Security Protocol), meaning the
NIC cannot directly access this encrypted VM memory. Consequently, all
DMA operations must utilize SWIOTLB bounce buffers.

In the MANA driver currently, there are two distinct paths for DMA
mapping:

1. Without PP_FLAG_DMA_MAP: The driver manually maps full pages for each
packet. This creates standalone, page-aligned mappings where the offset
is always zero.

2. With PP_FLAG_DMA_MAP: Optimizes memory by using page_pool with
sub-page fragments (e.g., multiple RX buffers sharing a single page).
When PP_FLAG_DMA_MAP is enabled, page_pool maps the entire page once.
Subsequent RX buffer allocations use offsets into this pre-mapped area.

When page_pool allocates sub-page RX buffer fragments, the bounce buffer
granularity may not align with these smaller fragment sizes, leading to
failure in mana driver rx path.

Refactor the RX buffer decision into mana_use_single_rxbuf_per_page().
When cc_platform_has(CC_ATTR_GUEST_MEM_ENCRYPT) is true, the driver is
forced to use a single RX buffer per page. This ensures:
- Each RX buffer is exactly one PAGE_SIZE.
- The DMA offset is always 0.
- SWIOTLB maps full, page-aligned blocks.

Reviewed-by: Haiyang Zhang <haiyangz@microsoft.com>
Signed-off-by: Dipayaan Roy <dipayanroy@linux.microsoft.com>
---
 drivers/net/ethernet/microsoft/mana/mana_en.c | 21 +++++++++++++++++--
 1 file changed, 19 insertions(+), 2 deletions(-)

diff --git a/drivers/net/ethernet/microsoft/mana/mana_en.c b/drivers/net/ethernet/microsoft/mana/mana_en.c
index a654b3699c4c..2d44eaf932a8 100644
--- a/drivers/net/ethernet/microsoft/mana/mana_en.c
+++ b/drivers/net/ethernet/microsoft/mana/mana_en.c
@@ -12,6 +12,7 @@
 #include <linux/pci.h>
 #include <linux/export.h>
 #include <linux/skbuff.h>
+#include <linux/cc_platform.h>
 
 #include <net/checksum.h>
 #include <net/ip6_checksum.h>
@@ -744,6 +745,23 @@ static void *mana_get_rxbuf_pre(struct mana_rxq *rxq, dma_addr_t *da)
 	return va;
 }
 
+static bool
+mana_use_single_rxbuf_per_page(struct mana_port_context *apc, u32 mtu)
+{
+	/* On confidential VMs with guest memory encryption, all DMA goes
+	 * through SWIOTLB bounce buffers. Sub-page RX fragments may not
+	 * be properly bounce-buffered, so use fullpage buffers.
+	 */
+	if (cc_platform_has(CC_ATTR_GUEST_MEM_ENCRYPT))
+		return true;
+
+	/* For xdp and jumbo frames make sure only one packet fits per page. */
+	if (mtu + MANA_RXBUF_PAD > PAGE_SIZE / 2 || mana_xdp_get(apc))
+		return true;
+
+	return false;
+}
+
 /* Get RX buffer's data size, alloc size, XDP headroom based on MTU */
 static void mana_get_rxbuf_cfg(struct mana_port_context *apc,
 			       int mtu, u32 *datasize, u32 *alloc_size,
@@ -754,8 +772,7 @@ static void mana_get_rxbuf_cfg(struct mana_port_context *apc,
 	/* Calculate datasize first (consistent across all cases) */
 	*datasize = mtu + ETH_HLEN;
 
-	/* For xdp and jumbo frames make sure only one packet fits per page */
-	if (mtu + MANA_RXBUF_PAD > PAGE_SIZE / 2 || mana_xdp_get(apc)) {
+	if (mana_use_single_rxbuf_per_page(apc, mtu)) {
 		if (mana_xdp_get(apc)) {
 			*headroom = XDP_PACKET_HEADROOM;
 			*alloc_size = PAGE_SIZE;
-- 
2.43.0


^ permalink raw reply related

* [RFC PATCH net-next] net: mana: Force single RX buffer per page under SWIOTLB bounce modes
From: Dipayaan Roy @ 2026-04-27 14:41 UTC (permalink / raw)
  To: kys, haiyangz, wei.liu, decui, andrew+netdev, davem, edumazet,
	kuba, pabeni, leon, longli, kotaranov, horms, shradhagupta,
	ssengar, ernis, shirazsaleem, linux-hyperv, netdev, linux-kernel,
	linux-rdma, stephen, jacob.e.keller, dipayanroy, leitao, kees,
	john.fastabend, hawk, bpf, daniel, ast, sdf, yury.norov

The MANA driver has two distinct DMA paths for RX buffers:

1. Without PP_FLAG_DMA_MAP: The driver maps full pages manually,
   creating page-aligned mappings where the DMA offset is always zero.

2. With PP_FLAG_DMA_MAP: page_pool uses sub-page fragments, where
   multiple RX buffers share a single page. The pool maps the whole
   page once, and subsequent allocations use offsets into that region.

Path 2 is problematic in two scenarios where DMA must go through
SWIOTLB bounce buffers:

- Confidential VMs (AMD SEV-SNP, Intel TDX): guest memory is encrypted
  and the NIC cannot access it directly due to lack of TDISP support.
  All DMA must use SWIOTLB bounce buffers.

- Force-bounce mode (swiotlb=force): all DMA is routed through bounce
  buffers regardless of whether the system is a CVM.

In both cases, sub-page RX buffer fragments allocated via page_pool may
not be compatible with bounce buffering in this mode, leading to failures
in the RX path.

Add a check using is_swiotlb_force_bounce() in
mana_use_single_rxbuf_per_page() to detect when force-bounce is active
for the device and force single RX buffer per page allocation.

Note: This issue likely affects any NIC driver using page_pool with
sub-page fragment allocation under SWIOTLB. A more general fix at
the page_pool level may be desirable. Seeking feedback on the
preferred approach.

Signed-off-by: Dipayaan Roy <dipayanroy@linux.microsoft.com>
---
 drivers/net/ethernet/microsoft/mana/mana_en.c | 6 ++++++
 1 file changed, 6 insertions(+)

diff --git a/drivers/net/ethernet/microsoft/mana/mana_en.c b/drivers/net/ethernet/microsoft/mana/mana_en.c
index 2d44eaf932a8..841421baf0de 100644
--- a/drivers/net/ethernet/microsoft/mana/mana_en.c
+++ b/drivers/net/ethernet/microsoft/mana/mana_en.c
@@ -12,6 +12,7 @@
 #include <linux/pci.h>
 #include <linux/export.h>
 #include <linux/skbuff.h>
+#include <linux/swiotlb.h>
 #include <linux/cc_platform.h>
 
 #include <net/checksum.h>
@@ -748,10 +749,15 @@ static void *mana_get_rxbuf_pre(struct mana_rxq *rxq, dma_addr_t *da)
 static bool
 mana_use_single_rxbuf_per_page(struct mana_port_context *apc, u32 mtu)
 {
+	struct gdma_context *gc = apc->ac->gdma_dev->gdma_context;
+
 	/* On confidential VMs with guest memory encryption, all DMA goes
 	 * through SWIOTLB bounce buffers. Sub-page RX fragments may not
 	 * be properly bounce-buffered, so use fullpage buffers.
 	 */
+	if (is_swiotlb_force_bounce(gc->dev))
+		return true;
+
 	if (cc_platform_has(CC_ATTR_GUEST_MEM_ENCRYPT))
 		return true;
 
-- 
2.43.0


^ permalink raw reply related

* [PATCH v2] mshv: Fix large page unmap count in error path
From: Stanislav Kinsburskii @ 2026-04-27 14:44 UTC (permalink / raw)
  To: kys, haiyangz, wei.liu, decui, longli; +Cc: linux-hyperv, linux-kernel

When hv_do_map_pfns() fails after partially mapping large pages, the
unmap count passed to hv_call_unmap_pfns() is incorrect. The 'done'
variable tracks the number of large pages mapped, but the unmap
function expects the count in 4KB page units.

This causes incomplete cleanup on error, potentially leaving stale
mappings in the partition. Shift the count by large_shift to convert
from large page count to 4KB page count before calling the unmap
function.

Fixes: 621191d709b14 ("Drivers: hv: Introduce mshv_root module to expose /dev/mshv to VMMs")
Signed-off-by: Stanislav Kinsburskii <skinsburskii@linux.microsoft.com>
---
 drivers/hv/mshv_root_hv_call.c |    4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/drivers/hv/mshv_root_hv_call.c b/drivers/hv/mshv_root_hv_call.c
index 7f91096f95a8e..ab210a7fcb8c3 100644
--- a/drivers/hv/mshv_root_hv_call.c
+++ b/drivers/hv/mshv_root_hv_call.c
@@ -255,8 +255,10 @@ static int hv_do_map_gpa_hcall(u64 partition_id, u64 gfn, u64 page_struct_count,
 	if (ret && done) {
 		u32 unmap_flags = 0;
 
-		if (flags & HV_MAP_GPA_LARGE_PAGE)
+		if (flags & HV_MAP_GPA_LARGE_PAGE) {
 			unmap_flags |= HV_UNMAP_GPA_LARGE_PAGE;
+			done <<= large_shift;
+		}
 		hv_call_unmap_gpa_pages(partition_id, gfn, done, unmap_flags);
 	}
 



^ permalink raw reply related

* [PATCH v2] mshv: Fix interrupt state corruption in hv_do_map_pfns error path
From: Stanislav Kinsburskii @ 2026-04-27 14:44 UTC (permalink / raw)
  To: kys, haiyangz, wei.liu, decui, longli; +Cc: linux-hyperv, linux-kernel

Restore interrupt state before breaking out of the loop on error.

The irq_flags are saved before entering the loop, but the early exit
path on error fails to restore them. This leaves interrupts in an
inconsistent state and can lead to lockdep warnings or other
interrupt-related issues.

Fixes: 621191d709b14 ("Drivers: hv: Introduce mshv_root module to expose /dev/mshv to VMMs")
Signed-off-by: Stanislav Kinsburskii <skinsburskii@linux.microsoft.com>
---
 drivers/hv/mshv_root_hv_call.c |    4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/drivers/hv/mshv_root_hv_call.c b/drivers/hv/mshv_root_hv_call.c
index ab210a7fcb8c3..61291ec6f3468 100644
--- a/drivers/hv/mshv_root_hv_call.c
+++ b/drivers/hv/mshv_root_hv_call.c
@@ -229,8 +229,10 @@ static int hv_do_map_gpa_hcall(u64 partition_id, u64 gfn, u64 page_struct_count,
 			} else {
 				pfnlist[i] = mmio_spa + done + i;
 			}
-		if (ret)
+		if (ret) {
+			local_irq_restore(irq_flags);
 			break;
+		}
 
 		status = hv_do_rep_hypercall(HVCALL_MAP_GPA_PAGES, rep_count, 0,
 					     input_page, NULL);



^ permalink raw reply related


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox