Linux-HyperV List
 help / color / mirror / Atom feed
* [PATCH net v3 4/5] net: mana: Don't overwrite port probe error with add_adev result
From: Erni Sri Satya Vennela @ 2026-04-15  8:09 UTC (permalink / raw)
  To: kys, haiyangz, wei.liu, decui, longli, andrew+netdev, davem,
	edumazet, kuba, pabeni, ernis, ssengar, dipayanroy, gargaditya,
	shirazsaleem, kees, kotaranov, leon, shacharr, stephen,
	linux-hyperv, netdev, linux-kernel
In-Reply-To: <20260415080944.732901-1-ernis@linux.microsoft.com>

In mana_probe(), if mana_probe_port() fails for any port, the error
is stored in 'err' and the loop breaks. However, the subsequent
unconditional 'err = add_adev(gd, "eth")' overwrites this error.
If add_adev() succeeds, mana_probe() returns success despite ports
being left in a partially initialized state (ac->ports[i] == NULL).

Only call add_adev() when there is no prior error, so the probe
correctly fails and triggers mana_remove() cleanup.

Fixes: ced82fce77e9 ("net: mana: Probe rdma device in mana driver")
Signed-off-by: Erni Sri Satya Vennela <ernis@linux.microsoft.com>
---
Changes in v3:
*  Fix inaccurate comments.
Changes in v2:
* Apply the patch in net instead of net-next.
---
 drivers/net/ethernet/microsoft/mana/mana_en.c | 17 ++++++++---------
 1 file changed, 8 insertions(+), 9 deletions(-)

diff --git a/drivers/net/ethernet/microsoft/mana/mana_en.c b/drivers/net/ethernet/microsoft/mana/mana_en.c
index ce1b7ec46a27..39b18577fb51 100644
--- a/drivers/net/ethernet/microsoft/mana/mana_en.c
+++ b/drivers/net/ethernet/microsoft/mana/mana_en.c
@@ -3680,10 +3680,9 @@ int mana_probe(struct gdma_dev *gd, bool resuming)
 	if (!resuming) {
 		for (i = 0; i < ac->num_ports; i++) {
 			err = mana_probe_port(ac, i, &ac->ports[i]);
-			/* we log the port for which the probe failed and stop
-			 * probes for subsequent ports.
-			 * Note that we keep running ports, for which the probes
-			 * were successful, unless add_adev fails too
+			/* Log the port for which the probe failed, stop probing
+			 * subsequent ports, and skip add_adev.
+			 * mana_remove() will clean up already-probed ports.
 			 */
 			if (err) {
 				dev_err(dev, "Probe Failed for port %d\n", i);
@@ -3697,10 +3696,9 @@ int mana_probe(struct gdma_dev *gd, bool resuming)
 			enable_work(&apc->queue_reset_work);
 			err = mana_attach(ac->ports[i]);
 			rtnl_unlock();
-			/* we log the port for which the attach failed and stop
-			 * attach for subsequent ports
-			 * Note that we keep running ports, for which the attach
-			 * were successful, unless add_adev fails too
+			/* Log the port for which the attach failed, stop
+			 * attaching subsequent ports, and skip add_adev.
+			 * mana_remove() will clean up already-attached ports.
 			 */
 			if (err) {
 				dev_err(dev, "Attach Failed for port %d\n", i);
@@ -3709,7 +3707,8 @@ int mana_probe(struct gdma_dev *gd, bool resuming)
 		}
 	}
 
-	err = add_adev(gd, "eth");
+	if (!err)
+		err = add_adev(gd, "eth");
 
 	schedule_delayed_work(&ac->gf_stats_work, MANA_GF_STATS_PERIOD);
 
-- 
2.34.1


^ permalink raw reply related

* [PATCH net v3 3/5] net: mana: Guard mana_remove against double invocation
From: Erni Sri Satya Vennela @ 2026-04-15  8:09 UTC (permalink / raw)
  To: kys, haiyangz, wei.liu, decui, longli, andrew+netdev, davem,
	edumazet, kuba, pabeni, ernis, ssengar, dipayanroy, gargaditya,
	shirazsaleem, kees, kotaranov, leon, shacharr, stephen,
	linux-hyperv, netdev, linux-kernel
In-Reply-To: <20260415080944.732901-1-ernis@linux.microsoft.com>

If PM resume fails (e.g., mana_attach() returns an error), mana_probe()
calls mana_remove(), which tears down the device and sets
gd->gdma_context = NULL and gd->driver_data = NULL.

However, a failed resume callback does not automatically unbind the
driver. When the device is eventually unbound, mana_remove() is invoked
a second time. Without a NULL check, it dereferences gc->dev with
gc == NULL, causing a kernel panic.

Add an early return if gdma_context or driver_data is NULL so the second
invocation is harmless. Move the dev = gc->dev assignment after the
guard so it cannot dereference NULL.

Fixes: ca9c54d2d6a5 ("net: mana: Add a driver for Microsoft Azure Network Adapter (MANA)")
Signed-off-by: Erni Sri Satya Vennela <ernis@linux.microsoft.com>
---
Changes in v3:
* Add this patch to the patchset
---
 drivers/net/ethernet/microsoft/mana/mana_en.c | 7 ++++++-
 1 file changed, 6 insertions(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/microsoft/mana/mana_en.c b/drivers/net/ethernet/microsoft/mana/mana_en.c
index 468ed60a8a00..ce1b7ec46a27 100644
--- a/drivers/net/ethernet/microsoft/mana/mana_en.c
+++ b/drivers/net/ethernet/microsoft/mana/mana_en.c
@@ -3731,11 +3731,16 @@ void mana_remove(struct gdma_dev *gd, bool suspending)
 	struct gdma_context *gc = gd->gdma_context;
 	struct mana_context *ac = gd->driver_data;
 	struct mana_port_context *apc;
-	struct device *dev = gc->dev;
+	struct device *dev;
 	struct net_device *ndev;
 	int err;
 	int i;
 
+	if (!gc || !ac)
+		return;
+
+	dev = gc->dev;
+
 	disable_work_sync(&ac->link_change_work);
 	cancel_delayed_work_sync(&ac->gf_stats_work);
 
-- 
2.34.1


^ permalink raw reply related

* [PATCH net v3 2/5] net: mana: Init gf_stats_work before potential error paths in probe
From: Erni Sri Satya Vennela @ 2026-04-15  8:09 UTC (permalink / raw)
  To: kys, haiyangz, wei.liu, decui, longli, andrew+netdev, davem,
	edumazet, kuba, pabeni, ernis, ssengar, dipayanroy, gargaditya,
	shirazsaleem, kees, kotaranov, leon, shacharr, stephen,
	linux-hyperv, netdev, linux-kernel
In-Reply-To: <20260415080944.732901-1-ernis@linux.microsoft.com>

Move INIT_DELAYED_WORK(gf_stats_work) to before mana_create_eq(),
while keeping schedule_delayed_work() at its original location.

Previously, if any function between mana_create_eq() and the
INIT_DELAYED_WORK call failed, mana_probe() would call mana_remove()
which unconditionally calls cancel_delayed_work_sync(gf_stats_work)
in __flush_work() or debug object warnings with
CONFIG_DEBUG_OBJECTS_WORK enabled.

Fixes: be4f1d67ec56 ("net: mana: Add standard counter rx_missed_errors")
Signed-off-by: Erni Sri Satya Vennela <ernis@linux.microsoft.com>
---
Changes in v3:
* No change
Changes in v2:
* Apply the patch in net instead of net-next.
---
 drivers/net/ethernet/microsoft/mana/mana_en.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/microsoft/mana/mana_en.c b/drivers/net/ethernet/microsoft/mana/mana_en.c
index e3e4b6de6668..468ed60a8a00 100644
--- a/drivers/net/ethernet/microsoft/mana/mana_en.c
+++ b/drivers/net/ethernet/microsoft/mana/mana_en.c
@@ -3635,6 +3635,8 @@ int mana_probe(struct gdma_dev *gd, bool resuming)
 		INIT_WORK(&ac->link_change_work, mana_link_state_handle);
 	}
 
+	INIT_DELAYED_WORK(&ac->gf_stats_work, mana_gf_stats_work_handler);
+
 	err = mana_create_eq(ac);
 	if (err) {
 		dev_err(dev, "Failed to create EQs: %d\n", err);
@@ -3709,7 +3711,6 @@ int mana_probe(struct gdma_dev *gd, bool resuming)
 
 	err = add_adev(gd, "eth");
 
-	INIT_DELAYED_WORK(&ac->gf_stats_work, mana_gf_stats_work_handler);
 	schedule_delayed_work(&ac->gf_stats_work, MANA_GF_STATS_PERIOD);
 
 out:
-- 
2.34.1


^ permalink raw reply related

* [PATCH net v3 1/5] net: mana: Init link_change_work before potential error paths in probe
From: Erni Sri Satya Vennela @ 2026-04-15  8:09 UTC (permalink / raw)
  To: kys, haiyangz, wei.liu, decui, longli, andrew+netdev, davem,
	edumazet, kuba, pabeni, ernis, ssengar, dipayanroy, gargaditya,
	shirazsaleem, kees, kotaranov, leon, shacharr, stephen,
	linux-hyperv, netdev, linux-kernel
In-Reply-To: <20260415080944.732901-1-ernis@linux.microsoft.com>

Move INIT_WORK(link_change_work) to right after the mana_context
allocation, before any error path that could reach mana_remove().

Previously, if mana_create_eq() or mana_query_device_cfg() failed,
mana_probe() would jump to the error path which calls mana_remove().
mana_remove() unconditionally calls disable_work_sync(link_change_work),
but the work struct had not been initialized yet. This can trigger
CONFIG_DEBUG_OBJECTS_WORK enabled.

Fixes: 54133f9b4b53 ("net: mana: Support HW link state events")
Signed-off-by: Erni Sri Satya Vennela <ernis@linux.microsoft.com>
---
Changes in v3:
* No change.
Changes in v2:
* Apply the patch in net instead of net-next.
---
 drivers/net/ethernet/microsoft/mana/mana_en.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/drivers/net/ethernet/microsoft/mana/mana_en.c b/drivers/net/ethernet/microsoft/mana/mana_en.c
index 6302432b9bf6..e3e4b6de6668 100644
--- a/drivers/net/ethernet/microsoft/mana/mana_en.c
+++ b/drivers/net/ethernet/microsoft/mana/mana_en.c
@@ -3631,6 +3631,8 @@ int mana_probe(struct gdma_dev *gd, bool resuming)
 
 		ac->gdma_dev = gd;
 		gd->driver_data = ac;
+
+		INIT_WORK(&ac->link_change_work, mana_link_state_handle);
 	}
 
 	err = mana_create_eq(ac);
@@ -3648,8 +3650,6 @@ int mana_probe(struct gdma_dev *gd, bool resuming)
 
 	if (!resuming) {
 		ac->num_ports = num_ports;
-
-		INIT_WORK(&ac->link_change_work, mana_link_state_handle);
 	} else {
 		if (ac->num_ports != num_ports) {
 			dev_err(dev, "The number of vPorts changed: %d->%d\n",
-- 
2.34.1


^ permalink raw reply related

* [PATCH net v3 0/5] net: mana: Fix probe/remove error path bugs
From: Erni Sri Satya Vennela @ 2026-04-15  8:09 UTC (permalink / raw)
  To: kys, haiyangz, wei.liu, decui, longli, andrew+netdev, davem,
	edumazet, kuba, pabeni, ernis, ssengar, dipayanroy, gargaditya,
	shirazsaleem, kees, kotaranov, leon, shacharr, stephen,
	linux-hyperv, netdev, linux-kernel

Fix five bugs in mana_probe()/mana_remove() error handling that can
cause warnings on uninitialized work structs, NULL pointer dereferences,
masked errors, and resource leaks when early probe steps fail.

Patches 1-2 move work struct initialization (link_change_work and
gf_stats_work) to before any error path that could trigger
mana_remove(), preventing WARN_ON in __flush_work() or debug object
warnings when sync cancellation runs on uninitialized work structs.

Patch 3 guards mana_remove() against double invocation. If PM resume
fails, mana_probe() calls mana_remove() which sets gdma_context and
driver_data to NULL. A failed resume does not unbind the driver, so
when the device is eventually unbound, mana_remove() is called again
and dereferences NULL, causing a kernel panic. An early return on
NULL gdma_context or driver_data makes the second call harmless.

Patch 4 prevents add_adev() from overwriting a port probe error,
which could leave the driver in a broken state with NULL ports while
reporting success.

Patch 5 changes 'goto out' to 'break' in mana_remove()'s port loop
so that mana_destroy_eq() is always reached, preventing EQ leaks when
a NULL port is encountered.
---
Changes in v3:
* Add patch 3: net: mana: Guard mana_remove against double invocation.
* Fix inaccurate comments.
* Correct Fixes tag from ca9c54d2d6a5 to 1e2d0824a9c3.
Changes in v2:
* Apply the patchset in net instead of net-next.
---
Erni Sri Satya Vennela (5):
  net: mana: Init link_change_work before potential error paths in probe
  net: mana: Init gf_stats_work before potential error paths in probe
  net: mana: Guard mana_remove against double invocation
  net: mana: Don't overwrite port probe error with add_adev result
  net: mana: Fix EQ leak in mana_remove on NULL port

 drivers/net/ethernet/microsoft/mana/mana_en.c | 35 +++++++++++--------
 1 file changed, 20 insertions(+), 15 deletions(-)

-- 
2.34.1


^ permalink raw reply

* Re: [PATCH net v2 3/4] net: mana: Don't overwrite port probe error with add_adev result
From: Erni Sri Satya Vennela @ 2026-04-15  7:04 UTC (permalink / raw)
  To: Simon Horman
  Cc: kys, haiyangz, wei.liu, decui, longli, andrew+netdev, davem,
	edumazet, kuba, pabeni, ssengar, dipayanroy, gargaditya,
	shirazsaleem, kees, kotaranov, leon, shacharr, stephen,
	linux-hyperv, netdev, linux-kernel
In-Reply-To: <20260414153503.769476-1-horms@kernel.org>

On Tue, Apr 14, 2026 at 04:35:03PM +0100, Simon Horman wrote:
> From: 'Simon Horman' <horms@kernel.org>
> 
> This is an AI-generated review of your patch. The human sending this
> email has considered the AI review valid, or at least plausible.
> Full review at: https://sashiko.dev
> ---
> net: mana: Don't overwrite port probe error with add_adev result
> 
> This commit ensures that if mana_probe_port() or mana_attach() fails, the
> error is not overwritten by the result of add_adev(). By preserving the
> error, it triggers mana_remove() cleanup instead of leaving ports partially
> initialized.
> 
> > diff --git a/drivers/net/ethernet/microsoft/mana/mana_en.c b/drivers/net/ethernet/microsoft/mana/mana_en.c
> > index 9d095a6fb56c7..4d42cbfb3f15b 100644
> > --- a/drivers/net/ethernet/microsoft/mana/mana_en.c
> > +++ b/drivers/net/ethernet/microsoft/mana/mana_en.c
> > @@ -3634,10 +3634,9 @@ int mana_probe(struct gdma_dev *gd, bool resuming)
> >  	if (!resuming) {
> >  		for (i = 0; i < ac->num_ports; i++) {
> >  			err = mana_probe_port(ac, i, &ac->ports[i]);
> > -			/* we log the port for which the probe failed and stop
> > -			 * probes for subsequent ports.
> > -			 * Note that we keep running ports, for which the probes
> > -			 * were successful, unless add_adev fails too
> > +			/* Log the port for which the probe failed, stop probing
> > +			 * subsequent ports, and skip add_adev.
> > +			 * Already-probed ports remain functional.
> >  			 */
> 
> Are these comments accurate regarding ports remaining functional?
> 
> Since this patch preserves the error code, execution falls through to the end
> of mana_probe() where it checks if err is set and calls mana_remove().
> mana_remove() will iterate over the successfully allocated ports, completely
> detach them, unregister their net_devices, and destroy event queues.
> 
> It appears no ports remain functional because the device is completely torn
> down.
>
I will update the patch with the updated comment message. 
> [ ... ]
> 
> > @@ -3651,10 +3650,9 @@ int mana_probe(struct gdma_dev *gd, bool resuming)
> >  			enable_work(&apc->queue_reset_work);
> >  			err = mana_attach(ac->ports[i]);
> >  			rtnl_unlock();
> > -			/* we log the port for which the attach failed and stop
> > -			 * attach for subsequent ports
> > -			 * Note that we keep running ports, for which the attach
> > -			 * were successful, unless add_adev fails too
> > +			/* Log the port for which the attach failed, stop
> > +			 * attaching subsequent ports, and skip add_adev.
> > +			 * Already-attached ports remain functional.
> >  			 */
> >  			if (err) {
> >  				dev_err(dev, "Attach Failed for port %d\n", i);
> > @@ -3663,7 +3661,8 @@ int mana_probe(struct gdma_dev *gd, bool resuming)
> 
> Does propagating this error during PM resume introduce a NULL pointer
> dereference?
> 
> If PM resume fails here (resuming == true), the preserved error code causes
> mana_probe() to call mana_remove(), which tears down the device and sets
> gd->gdma_context = NULL.
> 
> However, a failed resume callback does not automatically unbind the driver.
> When the device is eventually unbound later, mana_remove() is invoked again:
> 
> drivers/net/ethernet/microsoft/mana/mana_en.c:mana_remove() {
>     struct gdma_context *gc = gd->gdma_context;
>     struct device *dev = gc->dev;
>     ...
> }
> 
> In the second invocation, gd->gdma_context is NULL, meaning gc is NULL,
> which could lead to a kernel panic when dereferencing gc->dev.

Thankyou for pointing it out, Simon.
Since this is a pre-existing bug, I will create a different patch for
this change and make it as part of this patchset.

- Vennela

^ permalink raw reply

* Re: [PATCH net v2 4/4] net: mana: Fix EQ leak in mana_remove on NULL port
From: Erni Sri Satya Vennela @ 2026-04-15  7:01 UTC (permalink / raw)
  To: Simon Horman
  Cc: pabeni, edumazet, ssengar, dipayanroy, gargaditya, decui,
	shacharr, linux-hyperv, kotaranov, kys, shirazsaleem, longli,
	kuba, andrew+netdev, wei.liu, linux-kernel, netdev, haiyangz,
	kees, stephen, leon, davem
In-Reply-To: <20260414154058.771241-1-horms@kernel.org>

On Tue, Apr 14, 2026 at 04:40:58PM +0100, Simon Horman wrote:
> This is an AI-generated review of your patch. The human sending this
> email has considered the AI review valid, or at least plausible.
> 
> For local reproduction steps see https://netdev-ai.bots.linux.dev/ai-local.html
> ---
> net: mana: Fix EQ leak in mana_remove on NULL port
> 
> In mana_remove(), when a NULL port is encountered in the port iteration
> loop, 'goto out' skips the mana_destroy_eq(ac) call, leaking the event
> queues allocated earlier by mana_create_eq(). This can happen when
> mana_probe_port() fails for port 0, leaving ac->ports[0] as NULL. On
> driver unload or error cleanup, mana_remove() hits the NULL entry and
> jumps past mana_destroy_eq(). Change 'goto out' to 'break' so the
> for-loop exits normally and mana_destroy_eq() is always reached.
> 
> > Fixes: ca9c54d2d6a5 ("net: mana: Add a driver for Microsoft Azure Network Adapter (MANA)")
> 
> Does this Fixes tag point to the correct commit? Looking at the commit
> history, the original driver commit ca9c54d2d6a5 did not have
> mana_destroy_eq() in mana_remove() at all, so the 'goto out' pattern could
> not have caused an EQ leak at that point.
> 
> The bug appears to have been introduced by commit 1e2d0824a9c3 ("net: mana:
> Add support for EQ sharing"), which added mana_destroy_eq(ac) to
> mana_remove() and placed it after the 'out:' label, making it skippable by
> the 'goto out' on NULL port.
> 
> Should the Fixes tag be:
>   Fixes: 1e2d0824a9c3 ("net: mana: Add support for EQ sharing")
Thankyou for the correction, Simon. 
I will make this change in the next version of the patchset.

- Vennela

^ permalink raw reply

* Re: [PATCH v2 4/6] mshv: limit SynIC management to MSHV-owned resources
From: Wei Liu @ 2026-04-15  5:54 UTC (permalink / raw)
  To: Anirudh Rayabharam
  Cc: Jork Loeser, linux-hyperv, x86, K . Y . Srinivasan, Haiyang Zhang,
	Wei Liu, Dexuan Cui, Long Li, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, Dave Hansen, H . Peter Anvin, Arnd Bergmann,
	Michael Kelley, linux-kernel, linux-arch
In-Reply-To: <20260414-successful-loyal-magpie-9b7faf@anirudhrb>

On Tue, Apr 14, 2026 at 03:49:53PM +0000, Anirudh Rayabharam wrote:
> On Tue, Apr 07, 2026 at 02:27:52PM -0700, Jork Loeser wrote:
> > On Mon, 6 Apr 2026, Anirudh Rayabharam wrote:
> > 
> > > On Fri, Apr 03, 2026 at 12:06:10PM -0700, Jork Loeser wrote:
> > > > The SynIC is shared between VMBus and MSHV. VMBus owns the message
> > > > page (SIMP), event flags page (SIEFP), global enable (SCONTROL),
> > > > and SINT2. MSHV adds SINT0, SINT5, and the event ring page (SIRBP).
> > > > 
> > > > Currently mshv_synic_init() redundantly enables SIMP, SIEFP, and
> > > 
> > > The redundant enable is probably a no-op from the hypervisor side so it
> > > probably doesn't hurt us. The main problem is with the tear down.
> > 
> > It's an MSR intercept. If we can replace this by an "if()" we shave a few
> > cycles.
> > 
> > > An alternative approach could be: check if SIMP/SIEFP/SCONTROL is
> > > already enabled. If so, don't enable it again. If not enabled, enable it
> > > and keep track of what all stuf we have enabled. Then disable all of
> > > them during cleanup. This approach makes less assumptions about the
> > > behavior of the VMBUS driver and what stuff it does or doesn't use.
> > 
> > It would, yes. Then again, we drag yet more state and make debugging more
> > complicated / less clear to reason what happens dynamically. I had been
> > debating this briefly myself, and ultimately decided against it for that
> > very reason.
> 
> Ultimately, both approaches are fragile in their own ways because the
> contract that "VMBus owns SIMP, SIEFP, SCONTROL, SINT2 and MSHV owns
> SIRBP and SINT0 and SINT5" are not enforced anywhere in code and are
> just assumptions that everyone will play nice. To do that, we'll need to
> refactor the code such that there is a common component that sort of
> facilitates access to SynIC for both VMBus and MSHV.
> 
> I would say that checking the state dynamically and then deciding
> whether or not to enable SIMP/SIEFP/SCONTROL would be less fragile
> because we make lesser assumptions about what VMBus does or doesn't do.
> 

I think it is important to keep the changes as small as possible for
ease of backporting.

> Also, do you know of any cases where the VMBus stuff can get initialized
> after MSHV? Maybe if VMBus is a module (if that is even possible)? That
> would really mess up our logic here.
> 

It is possible to configure Vmbus as a module today. We should think of
a way to resolve what you say. Designing a new component is one way. The
other way is to find a working build configuration and enforce it via
Kconfig.

Wei

> Thanks,
> Anirudh.
> 
> > 
> > Best,
> > Jork

^ permalink raw reply

* [PATCH net] hv_sock: Report EOF instead of -EIO for FIN
From: Dexuan Cui @ 2026-04-14 23:43 UTC (permalink / raw)
  To: kys, haiyangz, wei.liu, decui, longli, sgarzare, davem, edumazet,
	kuba, pabeni, horms, niuxuewei.nxw, linux-hyperv, virtualization,
	netdev, linux-kernel
  Cc: stable, Ben Hillis, Mitchell Levy

Commit f0c5827d07cb unluckily causes a regression for the FIN packet,
and the final read syscall gets an error rather than 0.

Ideally, we would want to fix hvs_channel_readable_payload() so that it
could return 0 in the FIN scenario, but it's not good for the hv_sock
driver to use the VMBus ringbuffer's cached priv_read_index, which is
internal data in the VMBus driver.

Fix the regression in hv_sock by returning 0 rather than -EIO.

Fixes: f0c5827d07cb ("hv_sock: Return the readable bytes in hvs_stream_has_data()")
Cc: stable@vger.kernel.org
Reported-by: Ben Hillis <Ben.Hillis@microsoft.com>
Reported-by: Mitchell Levy <levymitchell0@gmail.com>
Signed-off-by: Dexuan Cui <decui@microsoft.com>
---
 net/vmw_vsock/hyperv_transport.c | 18 ++++++++++++++++--
 1 file changed, 16 insertions(+), 2 deletions(-)

diff --git a/net/vmw_vsock/hyperv_transport.c b/net/vmw_vsock/hyperv_transport.c
index 069386a74557..63d3549125be 100644
--- a/net/vmw_vsock/hyperv_transport.c
+++ b/net/vmw_vsock/hyperv_transport.c
@@ -703,8 +703,22 @@ static s64 hvs_stream_has_data(struct vsock_sock *vsk)
 	switch (hvs_channel_readable_payload(hvs->chan)) {
 	case 1:
 		need_refill = !hvs->recv_desc;
-		if (!need_refill)
-			return -EIO;
+		if (!need_refill) {
+			/* Here hvs->recv_data_len is 0, so hvs->recv_desc must
+			 * be NULL unless it points to the 0-byte-payload FIN
+			 * packet: see hvs_update_recv_data().
+			 *
+			 * Here all the payload has been dequeued, but
+			 * hvs_channel_readable_payload() still returns 1,
+			 * because the VMBus ringbuffer's read_index is not
+			 * updated for the FIN packet: hvs_stream_dequeue() ->
+			 * hv_pkt_iter_next() updates the cached priv_read_index
+			 * but has no opportunity to update the read_index in
+			 * hv_pkt_iter_close() as hvs_stream_has_data() returns
+			 * 0 for the FIN packet, so it won't get dequeued.
+			 */
+			return 0;
+		}
 
 		hvs->recv_desc = hv_pkt_iter_first(hvs->chan);
 		if (!hvs->recv_desc)
-- 
2.49.0


^ permalink raw reply related

* Re: [RFC v1 1/5] PCI: hv: Create and export hv_build_logical_dev_id()
From: Easwar Hariharan @ 2026-04-14 19:43 UTC (permalink / raw)
  To: Michael Kelley
  Cc: easwar.hariharan, Yu Zhang, linux-kernel@vger.kernel.org,
	linux-hyperv@vger.kernel.org, iommu@lists.linux.dev,
	linux-pci@vger.kernel.org, kys@microsoft.com,
	haiyangz@microsoft.com, wei.liu@kernel.org, decui@microsoft.com,
	lpieralisi@kernel.org, kwilczynski@kernel.org, mani@kernel.org,
	robh@kernel.org, bhelgaas@google.com, arnd@arndb.de,
	joro@8bytes.org, will@kernel.org, robin.murphy@arm.com,
	jacob.pan@linux.microsoft.com, nunodasneves@linux.microsoft.com,
	mrathor@linux.microsoft.com, peterz@infradead.org,
	linux-arch@vger.kernel.org
In-Reply-To: <SN6PR02MB415741EA3C27630199BCEAA3D4252@SN6PR02MB4157.namprd02.prod.outlook.com>

On 4/14/2026 11:06 AM, Michael Kelley wrote:
> From: Easwar Hariharan <easwar.hariharan@linux.microsoft.com> Sent: Tuesday, April 14, 2026 10:42 AM
>>
> 
> [snip]
>  
>>>> Thanks for that explanation, that makes sense. I didn't see any serialization
>>>> that would ensure that the VMBus path to communicate the child devices on the bus
>>>> would complete before pci_scan_device() finds and finalizes the pci_dev. I think it's
>>>
>>> FWIW, hv_pci_query_relations() should be ensuring that the communication
>>> has completed before it returns. It does a wait_for_reponse(), which ensures
>>> that the Hyper-V host has sent the PCI_BUS_RELATIONS[2] response. However,
>>> that message spins off work to the hbus->wq workqueue, so
>>> hv_pci_query_relations() has a flush_workqueue() so ensure everything that
>>> was queued has completed.
>>
>> Hm, I read the comment for the flush_workqueue() as addressing the "PCI_BUS_RELATIONS[2]
>> message arrived before we sent the QUERY_BUS_RELATIONS message" race case, not as an
>> "all child devices have definitely been received and processed in response to our
>> QUERY_BUS_RELATIONS message". Also, knowing very little about the VMBus contract, I
>> discounted the 100 ms timeout in wait_for_response() as a serialization guarantee.
> 
> Yeah, that timeout is so that the code can wake up every 100 ms to check
> if the device has been rescinded (i.e., removed). If the device isn't
> rescinded, wait_for_response() waits forever until a response comes in.

I don't know how I missed that. :(

>>
>> Chalk it up to previous experience dealing with hardware that's *supposed* to be
>> spec-compliant and complete initialization within specified timings. :)
>>
>> I see now that the flush is sufficient though.
>>
>>>
>>> Thinking more about the "hv_pcibus_installed" case, if that path is ever
>>> triggered, I don't think anything needs to be done with the logical device ID.
>>> The vPCI device has already been fully initialized on the Linux side, and it's
>>> logical device ID would not change.
>>>
>>> So I think you could construct the full logical device ID once
>>> hv_pci_query_relations() returns to hv_pci_probe().
>>
>> Let me think about this more and decide between the logical ID and full bus GUID
>> options.
>>
>>>
>>>> safest to take the approach to communicate the GUID, and find the function number from
>>>> the pci_dev. This does mean that there will be an essentially identical copy of
>>>> hv_build_logical_dev_id() in the IOMMU code, but a comment can explain that.
>>>
>>> With this alternative approach, is there a need to communicate the full
>>> GUID to the pvIOMMU drvier? Couldn't you just communicate bytes 4 thru
>>> 7, which would be logical device ID minus the function number?
>>
>> Yes, we could just communicate bytes 4 through 7 but the pvIOMMU version of the build logical
>> ID function would diverge from the pci-hyperv version. I figured if we say (in a comment)
>> that this is the same ID as generated in pci-hyperv, it's better for future readers to see it
>> to be clearly identical at first glance.
>>
>> It's also possible to change the pci-hyperv function to only take bytes 4 through 7 instead of the
>> full GUID, but I rather think we don't need that impedance mismatch of bytes 4 through 7 of the
>> GUID becoming bytes 0 through 3 of a u32.
>>
>>>

<snip>

>>
>>>>>>>
>>>>>>> I don't think the pci-hyperv driver even needs to tell the IOMMU driver to
>>>>>>> remove the information if a PCI pass-thru device is unbound or removed, as
>>>>>>> the logical device ID will be the same if the device ever comes back. At worst,
>>>>>>> the IOMMU driver can simply replace an existing logical device ID if a new one
>>>>>>> is provided for the same PCI domain ID.
>>>>>>
>>>>>> As above, replacing a unique GUID when a result is found for a non-unique
>>>>>> key value may be prone to failure if it happens that the device that came "back"
>>>>>> is not in fact the same device (or class of device) that went away and just happens
>>>>>> to, either due to bytes 4 and 5 being identical, or due to collision in the
>>>>>> pci_domain_nr_dynamic_ida, have the same domain number.
>>>>
>>>> Given the vPCI team's statements (above), I think we will need to handle unbind or
>>>> removal and ensure the pvIOMMU drivers data structure is invalidated when either
>>>> happens.
>>>
>>> The generic PCI code should handle detaching from the pvIOMMU. So I'm assuming
>>> your statement is specifically about the mapping from domain ID to logical device ID.
>>
>> Yes, apologies for the vagueness (again).
>>
>>> I still think removing it may be unnecessary since adding a mapping for a new vPCI
>>> device with the same domain ID but different logical device ID could just overwrite
>>> any existing mapping. And leaving a dead mapping in the pvIOMMU data structures
>>> doesn’t actually hurt anything. On the other hand, removing/invalidating it is
>>> certainly more tidy and might prevent some confusion down the road.
>>>
>>
>> Yes, if the data structure maps domain -> logical ID, we can do the overwrite as you say.
>> With my approach of informing the pvIOMMU driver of the entire (bus) GUID, we would want
>> to be careful that we don't assume the 1:1 bus<->device case and overwrite an existing
>> device entry with a new device that's on the same bus.
> 
> Yes, that's a valid point.  I was assuming that the pvIOMMU would use the
> domain ID at the lookup key, since the domain ID is directly available from the
> struct pci_dev that is an input parameter to the IOMMU functions. But in the
> not 1:1 case, that domain ID might refer to a bus with multiple functions. The
> logical device IDs for those devices will be the same except for the low order
> 3 bits that encode with the function number. So maybe the domain ID maps
> to a partial logical device ID, and the pvIOMMU driver must always add in the
> function number so the not 1:1 case works.

Agreed.

> 
> Would the pvIOMMU driver do anything with the full GUID, except extract
> bytes 4 through 7? There's no way I see to use the full GUID as the lookup
> key.
> 

No, the hypercalls only use the logical ID, so the rest of the GUID is unused.

Thanks,
Easwar (he/him)

^ permalink raw reply

* RE: [RFC v1 1/5] PCI: hv: Create and export hv_build_logical_dev_id()
From: Michael Kelley @ 2026-04-14 18:06 UTC (permalink / raw)
  To: Easwar Hariharan
  Cc: Yu Zhang, linux-kernel@vger.kernel.org,
	linux-hyperv@vger.kernel.org, iommu@lists.linux.dev,
	linux-pci@vger.kernel.org, kys@microsoft.com,
	haiyangz@microsoft.com, wei.liu@kernel.org, decui@microsoft.com,
	lpieralisi@kernel.org, kwilczynski@kernel.org, mani@kernel.org,
	robh@kernel.org, bhelgaas@google.com, arnd@arndb.de,
	joro@8bytes.org, will@kernel.org, robin.murphy@arm.com,
	jacob.pan@linux.microsoft.com, nunodasneves@linux.microsoft.com,
	mrathor@linux.microsoft.com, peterz@infradead.org,
	linux-arch@vger.kernel.org
In-Reply-To: <e7b3d040-b504-4665-a3ff-8d20261400ca@linux.microsoft.com>

From: Easwar Hariharan <easwar.hariharan@linux.microsoft.com> Sent: Tuesday, April 14, 2026 10:42 AM
>

[snip]
 
> >> Thanks for that explanation, that makes sense. I didn't see any serialization
> >> that would ensure that the VMBus path to communicate the child devices on the bus
> >> would complete before pci_scan_device() finds and finalizes the pci_dev. I think it's
> >
> > FWIW, hv_pci_query_relations() should be ensuring that the communication
> > has completed before it returns. It does a wait_for_reponse(), which ensures
> > that the Hyper-V host has sent the PCI_BUS_RELATIONS[2] response. However,
> > that message spins off work to the hbus->wq workqueue, so
> > hv_pci_query_relations() has a flush_workqueue() so ensure everything that
> > was queued has completed.
> 
> Hm, I read the comment for the flush_workqueue() as addressing the "PCI_BUS_RELATIONS[2]
> message arrived before we sent the QUERY_BUS_RELATIONS message" race case, not as an
> "all child devices have definitely been received and processed in response to our
> QUERY_BUS_RELATIONS message". Also, knowing very little about the VMBus contract, I
> discounted the 100 ms timeout in wait_for_response() as a serialization guarantee.

Yeah, that timeout is so that the code can wake up every 100 ms to check
if the device has been rescinded (i.e., removed). If the device isn't
rescinded, wait_for_response() waits forever until a response comes in.

> 
> Chalk it up to previous experience dealing with hardware that's *supposed* to be
> spec-compliant and complete initialization within specified timings. :)
> 
> I see now that the flush is sufficient though.
> 
> >
> > Thinking more about the "hv_pcibus_installed" case, if that path is ever
> > triggered, I don't think anything needs to be done with the logical device ID.
> > The vPCI device has already been fully initialized on the Linux side, and it's
> > logical device ID would not change.
> >
> > So I think you could construct the full logical device ID once
> > hv_pci_query_relations() returns to hv_pci_probe().
> 
> Let me think about this more and decide between the logical ID and full bus GUID
> options.
> 
> >
> >> safest to take the approach to communicate the GUID, and find the function number from
> >> the pci_dev. This does mean that there will be an essentially identical copy of
> >> hv_build_logical_dev_id() in the IOMMU code, but a comment can explain that.
> >
> > With this alternative approach, is there a need to communicate the full
> > GUID to the pvIOMMU drvier? Couldn't you just communicate bytes 4 thru
> > 7, which would be logical device ID minus the function number?
> 
> Yes, we could just communicate bytes 4 through 7 but the pvIOMMU version of the build logical
> ID function would diverge from the pci-hyperv version. I figured if we say (in a comment)
> that this is the same ID as generated in pci-hyperv, it's better for future readers to see it
> to be clearly identical at first glance.
> 
> It's also possible to change the pci-hyperv function to only take bytes 4 through 7 instead of the
> full GUID, but I rather think we don't need that impedance mismatch of bytes 4 through 7 of the
> GUID becoming bytes 0 through 3 of a u32.
> 
> >
> >>
> >>>>
> >>>>>
> >>>>> So have the Hyper-V PV IOMMU driver provide an EXPORTed function to accept
> >>>>> a PCI domain ID and the related logical device ID. The PV IOMMU driver is
> >>>>> responsible for storing this data in a form that it can later search. hv_pci_probe()
> >>>>> calls this new function when it instantiates a new PCI pass-thru device. Then when
> >>>>> the IOMMU driver needs to attach a new device, it can get the PCI domain ID
> >>>>> from the struct pci_dev (or struct pci_bus), search for the related logical device
> >>>>> ID in its own data structure, and use it. The pci-hyperv driver has a dependency
> >>>>> on the IOMMU driver, but that's a dependency in the desired direction. The
> >>>>> PCI domain ID and logical device ID are just integers, so no data structures are
> >>>>> shared.
> >>>>
> >>>> In a previous reply on this thread, you raised the uniqueness issue of bytes 4 and 5
> >>>> of the GUID being used to create the domain number. I thought this approach could
> >>>> help with that too, but as I coded it up, I realized that using the domain number
> >>>> (not guaranteed to be unique) to search for the bus instance GUID (guaranteed to be unique)
> >>>> is the wrong way around. It is unfortunately the only available key in the pci_dev
> >>>> handed to the pvIOMMU driver in this approach though...
> >>>>
> >>>> Do you think that's a fatal flaw?
> >>>
> >>> There are two uniqueness problems, which I didn't fully separate conceptually
> >>> until writing this. One problem is constructing a PCI domain ID that Linux can use
> >>> to identify the virtual PCI bus that the Hyper-V PCI driver creates for each vPCI
> >>> device. The Hyper-V virtual PCI driver uses GUID bytes 4 and 5, and recognizes
> >>> that they might not be unique. So there's code in hv_pci_probe() to pick another
> >>> number if there's a duplicate. Hyper-V doesn't really care how Linux picks the
> >>> domain ID for the virtual PCI bus as it's purely a Linux construct.
> >>
> >> This part matters for the IOMMU driver as it is the key we will use to search the data
> >> structure to find the right GUID to construct the logical dev ID that Hyper-V recognizes.
> >
> > Right. But the Hyper-V vPCI driver in Linux ensures that the domain ID is unique
> > in the sense that two active vPCI devices will not have the same domain ID. So
> > the pvIOMMU driver should not encounter any ambiguity when looking up the
> > logical device ID.
> 
> Agreed, that was a fragment of a thought that I neglected to delete before sending.
> Apologies.
> 
> > As you noted below, it's possible that a vPCI device could go
> > away, and another vPCI device could be added that ends up with a domain ID
> > that was previously used. When that added vPCI device is setup by the Hyper-V
> > vPCI driver, it will inform the pvIOMMU driver about the domain ID -> logical
> > device ID mapping, and it might overwrite an existing mapping if the newly
> > added vPCI device ended up with a domain ID that had previously been used.
> > And that's fine.
> 
> Yes.
> 
> >>
> >>>
> >>> The second problem is the logical device ID that Hyper-V interprets to
> >>> identify a vPCI device in hypercalls such a HVCALL_RETARGET_INTERRUPT
> >>> and the new pvIOMMU related hypercalls. This logical device ID uses
> >>> GUID bytes 4 thru 7 (minus 1 bit).  I don’t think Linux uses the
> >>> logical device ID for anything. Since only Hyper-V interprets it, Hyper-V
> >>> must somehow be ensuring uniqueness of bytes 4 thru 7 (minus 1 bit).
> >>> That's something to confirm with the Hyper-V team. If they are just hoping
> >>> for the best, I don't know how Linux can solve the problem.
> >>
> >> I checked with the Hyper-V vPCI team on this aspect and the only guarantee that
> >> they provide is that, at any given time, there will only be 1 device with a given
> >> logical ID attached to a VM.
> >
> > OK, so Hyper-V is guaranteeing the uniqueness of vPCI device GUID bytes 4
> > thru 7 across all vPCI devices that are attached to a VM at a given point in time.
> > That's good!
> 
> Technically, they're guaranteeing only that the *combination* of GUID bytes 4 through 7 AND
> the slot number will be unique across all vPCI devices that are attached to a VM at a given
> point in time. As you say below, while we have in practice not seen multiple devices on a
> vPCI bus, the vPCI team asserts that there is no restriction in the stack on doing so.
> 

Agreed.

> >
> >> Once a device has been removed, everything about it is
> >> forgotten from the Hyper-V stack's perspective, and nothing in the Hyper-V stack would
> >> prevent a scenario where, for example, a data movement accelerator is attached with
> >> logical ID X, then revoked, and let's say a NIC is attached with the same logical ID X.
> >
> > And the "forgetting" behavior is the same in Linux. Once the device is removed,
> > Linux forgets everything about it. If a new vPCI device shows up and happens
> > to have the same GUID as a previous device, that should not cause any problems
> > in Linux.
> >
> >>
> >> Also, FWIW, they also stated that the GUID is not unique and cannot be
> >> guaranteed to be unique because it's the GUID for the bus, not the individual
> >> devices.
> >
> > I'm not sure I understand this statement. Is this referring to the possibility
> > that a vPCI "device" that Hyper-V offers to the guest might have multiple
> > functions?
> 
> Yes, apologies for the vagueness.
> 
> > The vPCI device driver in Linux has code to recognize this case,
> > but I'm not aware of any current cases where it happens. In such a case,
> > Linux should create a single PCI bus abstraction with multiple devices
> > attached to it, with each device being a different function. If Hyper-V
> > did ever offer a multiple-function configuration, there might be some
> > debugging to do in the Hyper-V vPCI driver in Linux!
> >
> > We shortcut the terminology by referring to a vPCI "device", and assuming
> > that devices and busses are 1-to-1. But design allows for multiple devices
> > as different functions on the same bus.
> >
> >>
> 
> <snip>
> 
> >>>>>
> >>>>> I don't think the pci-hyperv driver even needs to tell the IOMMU driver to
> >>>>> remove the information if a PCI pass-thru device is unbound or removed, as
> >>>>> the logical device ID will be the same if the device ever comes back. At worst,
> >>>>> the IOMMU driver can simply replace an existing logical device ID if a new one
> >>>>> is provided for the same PCI domain ID.
> >>>>
> >>>> As above, replacing a unique GUID when a result is found for a non-unique
> >>>> key value may be prone to failure if it happens that the device that came "back"
> >>>> is not in fact the same device (or class of device) that went away and just happens
> >>>> to, either due to bytes 4 and 5 being identical, or due to collision in the
> >>>> pci_domain_nr_dynamic_ida, have the same domain number.
> >>
> >> Given the vPCI team's statements (above), I think we will need to handle unbind or
> >> removal and ensure the pvIOMMU drivers data structure is invalidated when either
> >> happens.
> >
> > The generic PCI code should handle detaching from the pvIOMMU. So I'm assuming
> > your statement is specifically about the mapping from domain ID to logical device ID.
> 
> Yes, apologies for the vagueness (again).
> 
> > I still think removing it may be unnecessary since adding a mapping for a new vPCI
> > device with the same domain ID but different logical device ID could just overwrite
> > any existing mapping. And leaving a dead mapping in the pvIOMMU data structures
> > doesn’t actually hurt anything. On the other hand, removing/invalidating it is
> > certainly more tidy and might prevent some confusion down the road.
> >
> 
> Yes, if the data structure maps domain -> logical ID, we can do the overwrite as you say.
> With my approach of informing the pvIOMMU driver of the entire (bus) GUID, we would want
> to be careful that we don't assume the 1:1 bus<->device case and overwrite an existing
> device entry with a new device that's on the same bus.

Yes, that's a valid point.  I was assuming that the pvIOMMU would use the
domain ID at the lookup key, since the domain ID is directly available from the
struct pci_dev that is an input parameter to the IOMMU functions. But in the
not 1:1 case, that domain ID might refer to a bus with multiple functions. The
logical device IDs for those devices will be the same except for the low order
3 bits that encode with the function number. So maybe the domain ID maps
to a partial logical device ID, and the pvIOMMU driver must always add in the
function number so the not 1:1 case works.

Would the pvIOMMU driver do anything with the full GUID, except extract
bytes 4 through 7? There's no way I see to use the full GUID as the lookup
key.

Michael

^ permalink raw reply

* RE: [EXTERNAL] [PATCH] hv: utils: handle and propagate errors in kvp_register
From: Long Li @ 2026-04-14 17:48 UTC (permalink / raw)
  To: Thorsten Blum, KY Srinivasan, Haiyang Zhang, Wei Liu, Dexuan Cui,
	Greg Kroah-Hartman
  Cc: stable@vger.kernel.org, Ky Srinivasan,
	linux-hyperv@vger.kernel.org, linux-kernel@vger.kernel.org
In-Reply-To: <20260414111008.307220-2-thorsten.blum@linux.dev>

> Make kvp_register() return an error code instead of silently ignoring failures, and
> propagate the error from kvp_handle_handshake() instead of returning success.
> 
> This propagates both kzalloc_obj() and hvutil_transport_send() failures to
> kvp_handle_handshake() and thus to kvp_on_msg().
> 
> Fixes: 245ba56a52a3 ("Staging: hv: Implement key/value pair (KVP)")
> Cc: stable@vger.kernel.org
> Signed-off-by: Thorsten Blum <thorsten.blum@linux.dev>

Reviewed-by:  Long Li <longli@microsoft.com>


> ---
>  drivers/hv/hv_kvp.c | 25 +++++++++++++------------
>  1 file changed, 13 insertions(+), 12 deletions(-)
> 
> diff --git a/drivers/hv/hv_kvp.c b/drivers/hv/hv_kvp.c index
> 0d73daf745a7..6180ebe040ff 100644
> --- a/drivers/hv/hv_kvp.c
> +++ b/drivers/hv/hv_kvp.c
> @@ -93,7 +93,7 @@ static void kvp_send_key(struct work_struct *dummy);
> static void kvp_respond_to_host(struct hv_kvp_msg *msg, int error);  static void
> kvp_timeout_func(struct work_struct *dummy);  static void
> kvp_host_handshake_func(struct work_struct *dummy); -static void
> kvp_register(int);
> +static int kvp_register(int);
> 
>  static DECLARE_DELAYED_WORK(kvp_timeout_work, kvp_timeout_func);  static
> DECLARE_DELAYED_WORK(kvp_host_handshake_work,
> kvp_host_handshake_func); @@ -127,24 +127,26 @@ static void
> kvp_register_done(void)
>  	hv_poll_channel(kvp_transaction.recv_channel, kvp_poll_wrapper);  }
> 
> -static void
> +static int
>  kvp_register(int reg_value)
>  {
> 
>  	struct hv_kvp_msg *kvp_msg;
>  	char *version;
> +	int ret;
> 
>  	kvp_msg = kzalloc_obj(*kvp_msg);
> +	if (!kvp_msg)
> +		return -ENOMEM;
> 
> -	if (kvp_msg) {
> -		version = kvp_msg->body.kvp_register.version;
> -		kvp_msg->kvp_hdr.operation = reg_value;
> -		strcpy(version, HV_DRV_VERSION);
> +	version = kvp_msg->body.kvp_register.version;
> +	kvp_msg->kvp_hdr.operation = reg_value;
> +	strcpy(version, HV_DRV_VERSION);
> 
> -		hvutil_transport_send(hvt, kvp_msg, sizeof(*kvp_msg),
> -				      kvp_register_done);
> -		kfree(kvp_msg);
> -	}
> +	ret = hvutil_transport_send(hvt, kvp_msg, sizeof(*kvp_msg),
> +				    kvp_register_done);
> +	kfree(kvp_msg);
> +	return ret;
>  }
> 
>  static void kvp_timeout_func(struct work_struct *dummy) @@ -186,9 +188,8
> @@ static int kvp_handle_handshake(struct hv_kvp_msg *msg)
>  	 */
>  	pr_debug("KVP: userspace daemon ver. %d connected\n",
>  		 msg->kvp_hdr.operation);
> -	kvp_register(dm_reg_value);
> 
> -	return 0;
> +	return kvp_register(dm_reg_value);
>  }
> 
> 

^ permalink raw reply

* Re: [RFC v1 1/5] PCI: hv: Create and export hv_build_logical_dev_id()
From: Easwar Hariharan @ 2026-04-14 17:42 UTC (permalink / raw)
  To: Michael Kelley
  Cc: easwar.hariharan, Yu Zhang, linux-kernel@vger.kernel.org,
	linux-hyperv@vger.kernel.org, iommu@lists.linux.dev,
	linux-pci@vger.kernel.org, kys@microsoft.com,
	haiyangz@microsoft.com, wei.liu@kernel.org, decui@microsoft.com,
	lpieralisi@kernel.org, kwilczynski@kernel.org, mani@kernel.org,
	robh@kernel.org, bhelgaas@google.com, arnd@arndb.de,
	joro@8bytes.org, will@kernel.org, robin.murphy@arm.com,
	jacob.pan@linux.microsoft.com, nunodasneves@linux.microsoft.com,
	mrathor@linux.microsoft.com, peterz@infradead.org,
	linux-arch@vger.kernel.org
In-Reply-To: <SN6PR02MB4157BA7B07D328969EEBF2ADD4252@SN6PR02MB4157.namprd02.prod.outlook.com>

On 4/14/2026 9:24 AM, Michael Kelley wrote:
> From: Easwar Hariharan <easwar.hariharan@linux.microsoft.com> Sent: Monday, April 13, 2026 2:30 PM
>>
>> On 4/9/2026 12:01 PM, Michael Kelley wrote:
>>> From: Easwar Hariharan <easwar.hariharan@linux.microsoft.com> Sent: Wednesday, April 8, 2026 1:21 PM
>>>>
>>>> On 1/11/2026 9:36 AM, Michael Kelley wrote:
>>>>> From: Easwar Hariharan <easwar.hariharan@linux.microsoft.com> Sent: Friday, January 9, 2026 10:41 AM
>>>>>>
>>>>>> On 1/8/2026 10:46 AM, Michael Kelley wrote:
>>>>>>> From: Yu Zhang <zhangyu1@linux.microsoft.com> Sent: Monday, December 8, 2025 9:11 PM
>>>>>>>>
>>>>>>>> From: Easwar Hariharan <easwar.hariharan@linux.microsoft.com>
>>>>>>>>
>>>>>>>> Hyper-V uses a logical device ID to identify a PCI endpoint device for
>>>>>>>> child partitions. This ID will also be required for future hypercalls
>>>>>>>> used by the Hyper-V IOMMU driver.
>>>>>>>>
>>>>>>>> Refactor the logic for building this logical device ID into a standalone
>>>>>>>> helper function and export the interface for wider use.
>>>>>>>>
>>>>>>>> Signed-off-by: Easwar Hariharan <easwar.hariharan@linux.microsoft.com>
>>>>>>>> Signed-off-by: Yu Zhang <zhangyu1@linux.microsoft.com>
>>>>>>>> ---
>>>>>>>>  drivers/pci/controller/pci-hyperv.c | 28 ++++++++++++++++++++--------
>>>>>>>>  include/asm-generic/mshyperv.h      |  2 ++
>>>>>>>>  2 files changed, 22 insertions(+), 8 deletions(-)
>>>>>>>>
>>>>>>>> diff --git a/drivers/pci/controller/pci-hyperv.c b/drivers/pci/controller/pci-hyperv.c
>>>>>>>> index 146b43981b27..4b82e06b5d93 100644
>>>>>>>> --- a/drivers/pci/controller/pci-hyperv.c
>>>>>>>> +++ b/drivers/pci/controller/pci-hyperv.c
>>>>>>>> @@ -598,15 +598,31 @@ static unsigned int hv_msi_get_int_vector(struct irq_data *data)
>>>>>>>>
>>>>>>>>  #define hv_msi_prepare		pci_msi_prepare
>>>>>>>>
>>>>>>>> +/**
>>>>>>>> + * Build a "Device Logical ID" out of this PCI bus's instance GUID and the
>>>>>>>> + * function number of the device.
>>>>>>>> + */
>>>>>>>> +u64 hv_build_logical_dev_id(struct pci_dev *pdev)
>>>>>>>> +{
>>>>>>>> +	struct pci_bus *pbus = pdev->bus;
>>>>>>>> +	struct hv_pcibus_device *hbus = container_of(pbus->sysdata,
>>>>>>>> +						struct hv_pcibus_device, sysdata);
>>>>>>>> +
>>>>>>>> +	return (u64)((hbus->hdev->dev_instance.b[5] << 24) |
>>>>>>>> +		     (hbus->hdev->dev_instance.b[4] << 16) |
>>>>>>>> +		     (hbus->hdev->dev_instance.b[7] << 8)  |
>>>>>>>> +		     (hbus->hdev->dev_instance.b[6] & 0xf8) |
>>>>>>>> +		     PCI_FUNC(pdev->devfn));
>>>>>>>> +}
>>>>>>>> +EXPORT_SYMBOL_GPL(hv_build_logical_dev_id);
>>>>>>>
>>>>>>> This change is fine for hv_irq_retarget_interrupt(), it doesn't help for the
>>>>>>> new IOMMU driver because pci-hyperv.c can (and often is) built as a module.
>>>>>>> The new Hyper-V IOMMU driver in this patch series is built-in, and so it can't
>>>>>>> use this symbol in that case -- you'll get a link error on vmlinux when building
>>>>>>> the kernel. Requiring pci-hyperv.c to *not* be built as a module would also
>>>>>>> require that the VMBus driver not be built as a module, so I don't think that's
>>>>>>> the right solution.
>>>>>>>
>>>>>>> This is a messy problem. The new IOMMU driver needs to start with a generic
>>>>>>> "struct device" for the PCI device, and somehow find the corresponding VMBus
>>>>>>> PCI pass-thru device from which it can get the VMBus instance ID. I'm thinking
>>>>>>> about ways to do this that don't depend on code and data structures that are
>>>>>>> private to the pci-hyperv.c driver, and will follow-up if I have a good suggestion.
>>>>>>
>>>>>> Thank you, Michael. FWIW, I did try to pull out the device ID components out of
>>>>>> pci-hyperv into include/linux/hyperv.h and/or a new include/linux/pci-hyperv.h
>>>>>> but it was just too messy as you say.
>>>>>
>>>>> Yes, the current approach for getting the device ID wanders through struct
>>>>> hv_pcibus_device (which is private to the pci-hyperv driver), and through
>>>>> struct hv_device (which is a VMBus data structure). That makes the linkage
>>>>> between the PV IOMMU driver and the pci-hyperv and VMBus drivers rather
>>>>> substantial, which is not good.
>>>>
>>>> Hi Michael,
>>>>
>>>> I missed this, or made a mental note to follow up but forgot. Either way, Yu reminded
>>>> me about this email chain and I started looking at it this week.
>>>>
>>>>>
>>>>> But here's an idea for an alternate approach. The PV IOMMU driver doesn't
>>>>> have to generate the logical device ID on-the-fly by going to the dev_instance
>>>>> field of struct hv_device. Instead, the pci-hyperv driver can generate the logical
>>>>> device ID in hv_pci_probe(), and put it somewhere that's easy for the IOMMU
>>>>> driver to access. The logical device ID doesn't change while Linux is running, so
>>>>> stashing another copy somewhere isn't a problem.
>>>>
>>>> In my exploration and consulting with Dexuan, I realized that one of the components of
>>>> the logical device ID, the PCI function number is set only in pci_scan_device(), well into
>>>> pci_scan_root_bus_bridge() that you call out as the point by which the communication
>>>> must have occurred.
>>>>
>>>> But then, Dexuan also pointed me to hv_pci_assign_slots() with its call to wslot_to_devfn() and I'm
>>>> honestly confused how these two interact. With the current approach, it looks like whatever
>>>> devfn pci_scan_device() set is the correct function number to use for the logical device
>>>> ID, in which case, the best I can do with your suggested approach below is to inform the
>>>> pvIOMMU driver of the GUID, rather than the logical device ID itself.
>>>>
>>>> Perhaps with your history, you can clarify the interaction, and/or share your thoughts
>>>> on the above?
>>>
>>> During hv_pci_probe(), hv_pci_query_relations() is called to ask the Hyper-V
>>> host about what PCI devices are present. hv_pci_query_relations() sends a
>>> PCI_QUERY_BUS_RELATIONS message to the host, and the host send back a
>>> PCI_BUS_RELATIONS or PCI_BUS_RELATIONS2 message. The response message
>>> is handled in hv_pci_onchannelcallback(), which calls hv_pci_devices_present()
>>> or hv_pci_devices_present2().  The latter two functions both call
>>> hv_pci_start_relations_work() to add a request to a workqueue that runs
>>> pci_devices_present_work().  Finally, pci_devices_present_work() calls
>>> pc_scan_child_bus(), followed by hv_pci_assign_slots().
>>>
>>> In hv_pci_assign_slots, you can see that the PCI_BUS_RELATIONS[2]
>>> info from the Hyper-V host contains a function number encoded in the
>>> win_slot field. So the Hyper-V host *does* tell the guest the function number.
>>> However, the generic Linux PCI subsystem doesn't use this function number.
>>> It still scans the PCI device, trying successive function numbers to see which
>>> ones work. The scan should find the same function number that the Hyper-V
>>> host originally reported.
>>>
>>> As you noted, there's a sequencing problem in waiting for
>>> pci_scan_single_device() to find the function number. In the hv_pci_probe()
>>> path, after hv_pci_query_relations() runs and before create_root_hv_pci_bus()
>>> is called, it seems feasible to use the function number provided by the
>>> Hyper-V host to construct the logical device ID. That should work. But there's
>>> another path, in that the Hyper-V host can generate a PCI_BUS_RELATIONS[2]
>>> message without a request from Linux when something on the host side changes
>>> the PCI device setup. There's a code path where pci_devices_present_work()
>>> finds the state is "hv_pcibus_installed", and directly calls pci_scan_child_bus().
>>> This path would presumably also need to construct (or re-construct) the
>>> logical device ID using the information from the Hyper-V host before calling
>>> pci_scan_child_bus(). I'm vague on the scenario for this latter case, but the
>>> code is obviously there to handle it.
>>>
>>> The other approach is as you suggest. The Hyper-V PCI driver can tell
>>> the IOMMU driver the almost complete logical device ID, using just the
>>> GUID bits. Then the IOMMU driver can then construct the full logical
>>> device ID by adding the function number from the struct pci_dev. I don't
>>> see a problem with this approach -- other IOMMU drivers are referencing
>>> the struct pci_dev, and pulling out the function number doesn't seem like
>>> a violation of layering.
>>>
>>
>> Thanks for that explanation, that makes sense. I didn't see any serialization
>> that would ensure that the VMBus path to communicate the child devices on the bus
>> would complete before pci_scan_device() finds and finalizes the pci_dev. I think it's
> 
> FWIW, hv_pci_query_relations() should be ensuring that the communication
> has completed before it returns. It does a wait_for_reponse(), which ensures
> that the Hyper-V host has sent the PCI_BUS_RELATIONS[2] response. However,
> that message spins off work to the hbus->wq workqueue, so
> hv_pci_query_relations() has a flush_workqueue() so ensure everything that
> was queued has completed.

Hm, I read the comment for the flush_workqueue() as addressing the "PCI_BUS_RELATIONS[2]
message arrived before we sent the QUERY_BUS_RELATIONS message" race case, not as an
"all child devices have definitely been received and processed in response to our
QUERY_BUS_RELATIONS message". Also, knowing very little about the VMBus contract, I
discounted the 100 ms timeout in wait_for_response() as a serialization guarantee.

Chalk it up to previous experience dealing with hardware that's *supposed* to be
spec-compliant and complete initialization within specified timings. :)

I see now that the flush is sufficient though.

> 
> Thinking more about the "hv_pcibus_installed" case, if that path is ever
> triggered, I don't think anything needs to be done with the logical device ID.
> The vPCI device has already been fully initialized on the Linux side, and it's
> logical device ID would not change.
> 
> So I think you could construct the full logical device ID once
> hv_pci_query_relations() returns to hv_pci_probe().

Let me think about this more and decide between the logical ID and full bus GUID options.

> 
>> safest to take the approach to communicate the GUID, and find the function number from
>> the pci_dev. This does mean that there will be an essentially identical copy of
>> hv_build_logical_dev_id() in the IOMMU code, but a comment can explain that.
> 
> With this alternative approach, is there a need to communicate the full
> GUID to the pvIOMMU drvier? Couldn't you just communicate bytes 4 thru
> 7, which would be logical device ID minus the function number?

Yes, we could just communicate bytes 4 through 7 but the pvIOMMU version of the build logical
ID function would diverge from the pci-hyperv version. I figured if we say (in a comment)
that this is the same ID as generated in pci-hyperv, it's better for future readers to see it
to be clearly identical at first glance.

It's also possible to change the pci-hyperv function to only take bytes 4 through 7 instead of the
full GUID, but I rather think we don't need that impedance mismatch of bytes 4 through 7 of the
GUID becoming bytes 0 through 3 of a u32.

> 
>>
>>>>
>>>>>
>>>>> So have the Hyper-V PV IOMMU driver provide an EXPORTed function to accept
>>>>> a PCI domain ID and the related logical device ID. The PV IOMMU driver is
>>>>> responsible for storing this data in a form that it can later search. hv_pci_probe()
>>>>> calls this new function when it instantiates a new PCI pass-thru device. Then when
>>>>> the IOMMU driver needs to attach a new device, it can get the PCI domain ID
>>>>> from the struct pci_dev (or struct pci_bus), search for the related logical device
>>>>> ID in its own data structure, and use it. The pci-hyperv driver has a dependency
>>>>> on the IOMMU driver, but that's a dependency in the desired direction. The
>>>>> PCI domain ID and logical device ID are just integers, so no data structures are
>>>>> shared.
>>>>
>>>> In a previous reply on this thread, you raised the uniqueness issue of bytes 4 and 5
>>>> of the GUID being used to create the domain number. I thought this approach could
>>>> help with that too, but as I coded it up, I realized that using the domain number
>>>> (not guaranteed to be unique) to search for the bus instance GUID (guaranteed to be unique)
>>>> is the wrong way around. It is unfortunately the only available key in the pci_dev
>>>> handed to the pvIOMMU driver in this approach though...
>>>>
>>>> Do you think that's a fatal flaw?
>>>
>>> There are two uniqueness problems, which I didn't fully separate conceptually
>>> until writing this. One problem is constructing a PCI domain ID that Linux can use
>>> to identify the virtual PCI bus that the Hyper-V PCI driver creates for each vPCI
>>> device. The Hyper-V virtual PCI driver uses GUID bytes 4 and 5, and recognizes
>>> that they might not be unique. So there's code in hv_pci_probe() to pick another
>>> number if there's a duplicate. Hyper-V doesn't really care how Linux picks the
>>> domain ID for the virtual PCI bus as it's purely a Linux construct.
>>
>> This part matters for the IOMMU driver as it is the key we will use to search the data
>> structure to find the right GUID to construct the logical dev ID that Hyper-V recognizes.
> 
> Right. But the Hyper-V vPCI driver in Linux ensures that the domain ID is unique
> in the sense that two active vPCI devices will not have the same domain ID. So
> the pvIOMMU driver should not encounter any ambiguity when looking up the
> logical device ID.

Agreed, that was a fragment of a thought that I neglected to delete before sending. Apologies.

> As you noted below, it's possible that a vPCI device could go
> away, and another vPCI device could be added that ends up with a domain ID
> that was previously used. When that added vPCI device is setup by the Hyper-V
> vPCI driver, it will inform the pvIOMMU driver about the domain ID -> logical
> device ID mapping, and it might overwrite an existing mapping if the newly
> added vPCI device ended up with a domain ID that had previously been used.
> And that's fine.

Yes.

>>
>>>
>>> The second problem is the logical device ID that Hyper-V interprets to
>>> identify a vPCI device in hypercalls such a HVCALL_RETARGET_INTERRUPT
>>> and the new pvIOMMU related hypercalls. This logical device ID uses
>>> GUID bytes 4 thru 7 (minus 1 bit).  I don’t think Linux uses the
>>> logical device ID for anything. Since only Hyper-V interprets it, Hyper-V
>>> must somehow be ensuring uniqueness of bytes 4 thru 7 (minus 1 bit).
>>> That's something to confirm with the Hyper-V team. If they are just hoping
>>> for the best, I don't know how Linux can solve the problem.
>>
>> I checked with the Hyper-V vPCI team on this aspect and the only guarantee that
>> they provide is that, at any given time, there will only be 1 device with a given
>> logical ID attached to a VM.
> 
> OK, so Hyper-V is guaranteeing the uniqueness of vPCI device GUID bytes 4
> thru 7 across all vPCI devices that are attached to a VM at a given point in time.
> That's good!

Technically, they're guaranteeing only that the *combination* of GUID bytes 4 through 7 AND
the slot number will be unique across all vPCI devices that are attached to a VM at a given
point in time. As you say below, while we have in practice not seen multiple devices on a
vPCI bus, the vPCI team asserts that there is no restriction in the stack on doing so.

> 
>> Once a device has been removed, everything about it is
>> forgotten from the Hyper-V stack's perspective, and nothing in the Hyper-V stack would
>> prevent a scenario where, for example, a data movement accelerator is attached with
>> logical ID X, then revoked, and let's say a NIC is attached with the same logical ID X.
> 
> And the "forgetting" behavior is the same in Linux. Once the device is removed,
> Linux forgets everything about it. If a new vPCI device shows up and happens
> to have the same GUID as a previous device, that should not cause any problems
> in Linux.
> 
>>
>> Also, FWIW, they also stated that the GUID is not unique and cannot be
>> guaranteed to be unique because it's the GUID for the bus, not the individual
>> devices.
> 
> I'm not sure I understand this statement. Is this referring to the possibility
> that a vPCI "device" that Hyper-V offers to the guest might have multiple
> functions?

Yes, apologies for the vagueness.

> The vPCI device driver in Linux has code to recognize this case,
> but I'm not aware of any current cases where it happens. In such a case,
> Linux should create a single PCI bus abstraction with multiple devices
> attached to it, with each device being a different function. If Hyper-V
> did ever offer a multiple-function configuration, there might be some
> debugging to do in the Hyper-V vPCI driver in Linux!
> 
> We shortcut the terminology by referring to a vPCI "device", and assuming
> that devices and busses are 1-to-1. But design allows for multiple devices
> as different functions on the same bus.
> 
>>

<snip>

>>>>>
>>>>> I don't think the pci-hyperv driver even needs to tell the IOMMU driver to
>>>>> remove the information if a PCI pass-thru device is unbound or removed, as
>>>>> the logical device ID will be the same if the device ever comes back. At worst,
>>>>> the IOMMU driver can simply replace an existing logical device ID if a new one
>>>>> is provided for the same PCI domain ID.
>>>>
>>>> As above, replacing a unique GUID when a result is found for a non-unique
>>>> key value may be prone to failure if it happens that the device that came "back"
>>>> is not in fact the same device (or class of device) that went away and just happens
>>>> to, either due to bytes 4 and 5 being identical, or due to collision in the
>>>> pci_domain_nr_dynamic_ida, have the same domain number.
>>
>> Given the vPCI team's statements (above), I think we will need to handle unbind or
>> removal and ensure the pvIOMMU drivers data structure is invalidated when either
>> happens.
> 
> The generic PCI code should handle detaching from the pvIOMMU. So I'm assuming
> your statement is specifically about the mapping from domain ID to logical device ID.

Yes, apologies for the vagueness (again).

> I still think removing it may be unnecessary since adding a mapping for a new vPCI
> device with the same domain ID but different logical device ID could just overwrite
> any existing mapping. And leaving a dead mapping in the pvIOMMU data structures
> doesn’t actually hurt anything. On the other hand, removing/invalidating it is
> certainly more tidy and might prevent some confusion down the road.
> 

Yes, if the data structure maps domain -> logical ID, we can do the overwrite as you say.
With my approach of informing the pvIOMMU driver of the entire (bus) GUID, we would want
to be careful that we don't assume the 1:1 bus<->device case and overwrite an existing
device entry with a new device that's on the same bus.

> I'm not the person writing the code, so it's easy for me to make hand-wavy
> statements. ;-)  You've got to actually make it work, so you get to make
> the final decisions.
> 
> Michael

Thank you for the history and engagement as I worked through the options!

- Easwar (he/him)

^ permalink raw reply

* RE: [RFC v1 1/5] PCI: hv: Create and export hv_build_logical_dev_id()
From: Michael Kelley @ 2026-04-14 16:24 UTC (permalink / raw)
  To: Easwar Hariharan
  Cc: Yu Zhang, linux-kernel@vger.kernel.org,
	linux-hyperv@vger.kernel.org, iommu@lists.linux.dev,
	linux-pci@vger.kernel.org, kys@microsoft.com,
	haiyangz@microsoft.com, wei.liu@kernel.org, decui@microsoft.com,
	lpieralisi@kernel.org, kwilczynski@kernel.org, mani@kernel.org,
	robh@kernel.org, bhelgaas@google.com, arnd@arndb.de,
	joro@8bytes.org, will@kernel.org, robin.murphy@arm.com,
	jacob.pan@linux.microsoft.com, nunodasneves@linux.microsoft.com,
	mrathor@linux.microsoft.com, peterz@infradead.org,
	linux-arch@vger.kernel.org
In-Reply-To: <a05fa5b8-3b82-4c5f-8fff-fe10b3f71e87@linux.microsoft.com>

From: Easwar Hariharan <easwar.hariharan@linux.microsoft.com> Sent: Monday, April 13, 2026 2:30 PM
> 
> On 4/9/2026 12:01 PM, Michael Kelley wrote:
> > From: Easwar Hariharan <easwar.hariharan@linux.microsoft.com> Sent: Wednesday, April 8, 2026 1:21 PM
> >>
> >> On 1/11/2026 9:36 AM, Michael Kelley wrote:
> >>> From: Easwar Hariharan <easwar.hariharan@linux.microsoft.com> Sent: Friday, January 9, 2026 10:41 AM
> >>>>
> >>>> On 1/8/2026 10:46 AM, Michael Kelley wrote:
> >>>>> From: Yu Zhang <zhangyu1@linux.microsoft.com> Sent: Monday, December 8, 2025 9:11 PM
> >>>>>>
> >>>>>> From: Easwar Hariharan <easwar.hariharan@linux.microsoft.com>
> >>>>>>
> >>>>>> Hyper-V uses a logical device ID to identify a PCI endpoint device for
> >>>>>> child partitions. This ID will also be required for future hypercalls
> >>>>>> used by the Hyper-V IOMMU driver.
> >>>>>>
> >>>>>> Refactor the logic for building this logical device ID into a standalone
> >>>>>> helper function and export the interface for wider use.
> >>>>>>
> >>>>>> Signed-off-by: Easwar Hariharan <easwar.hariharan@linux.microsoft.com>
> >>>>>> Signed-off-by: Yu Zhang <zhangyu1@linux.microsoft.com>
> >>>>>> ---
> >>>>>>  drivers/pci/controller/pci-hyperv.c | 28 ++++++++++++++++++++--------
> >>>>>>  include/asm-generic/mshyperv.h      |  2 ++
> >>>>>>  2 files changed, 22 insertions(+), 8 deletions(-)
> >>>>>>
> >>>>>> diff --git a/drivers/pci/controller/pci-hyperv.c b/drivers/pci/controller/pci-hyperv.c
> >>>>>> index 146b43981b27..4b82e06b5d93 100644
> >>>>>> --- a/drivers/pci/controller/pci-hyperv.c
> >>>>>> +++ b/drivers/pci/controller/pci-hyperv.c
> >>>>>> @@ -598,15 +598,31 @@ static unsigned int hv_msi_get_int_vector(struct irq_data *data)
> >>>>>>
> >>>>>>  #define hv_msi_prepare		pci_msi_prepare
> >>>>>>
> >>>>>> +/**
> >>>>>> + * Build a "Device Logical ID" out of this PCI bus's instance GUID and the
> >>>>>> + * function number of the device.
> >>>>>> + */
> >>>>>> +u64 hv_build_logical_dev_id(struct pci_dev *pdev)
> >>>>>> +{
> >>>>>> +	struct pci_bus *pbus = pdev->bus;
> >>>>>> +	struct hv_pcibus_device *hbus = container_of(pbus->sysdata,
> >>>>>> +						struct hv_pcibus_device, sysdata);
> >>>>>> +
> >>>>>> +	return (u64)((hbus->hdev->dev_instance.b[5] << 24) |
> >>>>>> +		     (hbus->hdev->dev_instance.b[4] << 16) |
> >>>>>> +		     (hbus->hdev->dev_instance.b[7] << 8)  |
> >>>>>> +		     (hbus->hdev->dev_instance.b[6] & 0xf8) |
> >>>>>> +		     PCI_FUNC(pdev->devfn));
> >>>>>> +}
> >>>>>> +EXPORT_SYMBOL_GPL(hv_build_logical_dev_id);
> >>>>>
> >>>>> This change is fine for hv_irq_retarget_interrupt(), it doesn't help for the
> >>>>> new IOMMU driver because pci-hyperv.c can (and often is) built as a module.
> >>>>> The new Hyper-V IOMMU driver in this patch series is built-in, and so it can't
> >>>>> use this symbol in that case -- you'll get a link error on vmlinux when building
> >>>>> the kernel. Requiring pci-hyperv.c to *not* be built as a module would also
> >>>>> require that the VMBus driver not be built as a module, so I don't think that's
> >>>>> the right solution.
> >>>>>
> >>>>> This is a messy problem. The new IOMMU driver needs to start with a generic
> >>>>> "struct device" for the PCI device, and somehow find the corresponding VMBus
> >>>>> PCI pass-thru device from which it can get the VMBus instance ID. I'm thinking
> >>>>> about ways to do this that don't depend on code and data structures that are
> >>>>> private to the pci-hyperv.c driver, and will follow-up if I have a good suggestion.
> >>>>
> >>>> Thank you, Michael. FWIW, I did try to pull out the device ID components out of
> >>>> pci-hyperv into include/linux/hyperv.h and/or a new include/linux/pci-hyperv.h
> >>>> but it was just too messy as you say.
> >>>
> >>> Yes, the current approach for getting the device ID wanders through struct
> >>> hv_pcibus_device (which is private to the pci-hyperv driver), and through
> >>> struct hv_device (which is a VMBus data structure). That makes the linkage
> >>> between the PV IOMMU driver and the pci-hyperv and VMBus drivers rather
> >>> substantial, which is not good.
> >>
> >> Hi Michael,
> >>
> >> I missed this, or made a mental note to follow up but forgot. Either way, Yu reminded
> >> me about this email chain and I started looking at it this week.
> >>
> >>>
> >>> But here's an idea for an alternate approach. The PV IOMMU driver doesn't
> >>> have to generate the logical device ID on-the-fly by going to the dev_instance
> >>> field of struct hv_device. Instead, the pci-hyperv driver can generate the logical
> >>> device ID in hv_pci_probe(), and put it somewhere that's easy for the IOMMU
> >>> driver to access. The logical device ID doesn't change while Linux is running, so
> >>> stashing another copy somewhere isn't a problem.
> >>
> >> In my exploration and consulting with Dexuan, I realized that one of the components of
> >> the logical device ID, the PCI function number is set only in pci_scan_device(), well into
> >> pci_scan_root_bus_bridge() that you call out as the point by which the communication
> >> must have occurred.
> >>
> >> But then, Dexuan also pointed me to hv_pci_assign_slots() with its call to wslot_to_devfn() and I'm
> >> honestly confused how these two interact. With the current approach, it looks like whatever
> >> devfn pci_scan_device() set is the correct function number to use for the logical device
> >> ID, in which case, the best I can do with your suggested approach below is to inform the
> >> pvIOMMU driver of the GUID, rather than the logical device ID itself.
> >>
> >> Perhaps with your history, you can clarify the interaction, and/or share your thoughts
> >> on the above?
> >
> > During hv_pci_probe(), hv_pci_query_relations() is called to ask the Hyper-V
> > host about what PCI devices are present. hv_pci_query_relations() sends a
> > PCI_QUERY_BUS_RELATIONS message to the host, and the host send back a
> > PCI_BUS_RELATIONS or PCI_BUS_RELATIONS2 message. The response message
> > is handled in hv_pci_onchannelcallback(), which calls hv_pci_devices_present()
> > or hv_pci_devices_present2().  The latter two functions both call
> > hv_pci_start_relations_work() to add a request to a workqueue that runs
> > pci_devices_present_work().  Finally, pci_devices_present_work() calls
> > pc_scan_child_bus(), followed by hv_pci_assign_slots().
> >
> > In hv_pci_assign_slots, you can see that the PCI_BUS_RELATIONS[2]
> > info from the Hyper-V host contains a function number encoded in the
> > win_slot field. So the Hyper-V host *does* tell the guest the function number.
> > However, the generic Linux PCI subsystem doesn't use this function number.
> > It still scans the PCI device, trying successive function numbers to see which
> > ones work. The scan should find the same function number that the Hyper-V
> > host originally reported.
> >
> > As you noted, there's a sequencing problem in waiting for
> > pci_scan_single_device() to find the function number. In the hv_pci_probe()
> > path, after hv_pci_query_relations() runs and before create_root_hv_pci_bus()
> > is called, it seems feasible to use the function number provided by the
> > Hyper-V host to construct the logical device ID. That should work. But there's
> > another path, in that the Hyper-V host can generate a PCI_BUS_RELATIONS[2]
> > message without a request from Linux when something on the host side changes
> > the PCI device setup. There's a code path where pci_devices_present_work()
> > finds the state is "hv_pcibus_installed", and directly calls pci_scan_child_bus().
> > This path would presumably also need to construct (or re-construct) the
> > logical device ID using the information from the Hyper-V host before calling
> > pci_scan_child_bus(). I'm vague on the scenario for this latter case, but the
> > code is obviously there to handle it.
> >
> > The other approach is as you suggest. The Hyper-V PCI driver can tell
> > the IOMMU driver the almost complete logical device ID, using just the
> > GUID bits. Then the IOMMU driver can then construct the full logical
> > device ID by adding the function number from the struct pci_dev. I don't
> > see a problem with this approach -- other IOMMU drivers are referencing
> > the struct pci_dev, and pulling out the function number doesn't seem like
> > a violation of layering.
> >
> 
> Thanks for that explanation, that makes sense. I didn't see any serialization
> that would ensure that the VMBus path to communicate the child devices on the bus
> would complete before pci_scan_device() finds and finalizes the pci_dev. I think it's

FWIW, hv_pci_query_relations() should be ensuring that the communication
has completed before it returns. It does a wait_for_reponse(), which ensures
that the Hyper-V host has sent the PCI_BUS_RELATIONS[2] response. However,
that message spins off work to the hbus->wq workqueue, so
hv_pci_query_relations() has a flush_workqueue() so ensure everything that
was queued has completed.

Thinking more about the "hv_pcibus_installed" case, if that path is ever
triggered, I don't think anything needs to be done with the logical device ID.
The vPCI device has already been fully initialized on the Linux side, and it's
logical device ID would not change.

So I think you could construct the full logical device ID once
hv_pci_query_relations() returns to hv_pci_probe().

> safest to take the approach to communicate the GUID, and find the function number from
> the pci_dev. This does mean that there will be an essentially identical copy of
> hv_build_logical_dev_id() in the IOMMU code, but a comment can explain that.

With this alternative approach, is there a need to communicate the full
GUID to the pvIOMMU drvier? Couldn't you just communicate bytes 4 thru
7, which would be logical device ID minus the function number?

> 
> >>
> >>>
> >>> So have the Hyper-V PV IOMMU driver provide an EXPORTed function to accept
> >>> a PCI domain ID and the related logical device ID. The PV IOMMU driver is
> >>> responsible for storing this data in a form that it can later search. hv_pci_probe()
> >>> calls this new function when it instantiates a new PCI pass-thru device. Then when
> >>> the IOMMU driver needs to attach a new device, it can get the PCI domain ID
> >>> from the struct pci_dev (or struct pci_bus), search for the related logical device
> >>> ID in its own data structure, and use it. The pci-hyperv driver has a dependency
> >>> on the IOMMU driver, but that's a dependency in the desired direction. The
> >>> PCI domain ID and logical device ID are just integers, so no data structures are
> >>> shared.
> >>
> >> In a previous reply on this thread, you raised the uniqueness issue of bytes 4 and 5
> >> of the GUID being used to create the domain number. I thought this approach could
> >> help with that too, but as I coded it up, I realized that using the domain number
> >> (not guaranteed to be unique) to search for the bus instance GUID (guaranteed to be unique)
> >> is the wrong way around. It is unfortunately the only available key in the pci_dev
> >> handed to the pvIOMMU driver in this approach though...
> >>
> >> Do you think that's a fatal flaw?
> >
> > There are two uniqueness problems, which I didn't fully separate conceptually
> > until writing this. One problem is constructing a PCI domain ID that Linux can use
> > to identify the virtual PCI bus that the Hyper-V PCI driver creates for each vPCI
> > device. The Hyper-V virtual PCI driver uses GUID bytes 4 and 5, and recognizes
> > that they might not be unique. So there's code in hv_pci_probe() to pick another
> > number if there's a duplicate. Hyper-V doesn't really care how Linux picks the
> > domain ID for the virtual PCI bus as it's purely a Linux construct.
> 
> This part matters for the IOMMU driver as it is the key we will use to search the data
> structure to find the right GUID to construct the logical dev ID that Hyper-V recognizes.

Right. But the Hyper-V vPCI driver in Linux ensures that the domain ID is unique
in the sense that two active vPCI devices will not have the same domain ID. So
the pvIOMMU driver should not encounter any ambiguity when looking up the
logical device ID. As you noted below, it's possible that a vPCI device could go
away, and another vPCI device could be added that ends up with a domain ID
that was previously used. When that added vPCI device is setup by the Hyper-V
vPCI driver, it will inform the pvIOMMU driver about the domain ID -> logical
device ID mapping, and it might overwrite an existing mapping if the newly
added vPCI device ended up with a domain ID that had previously been used.
And that's fine.

> 
> >
> > The second problem is the logical device ID that Hyper-V interprets to
> > identify a vPCI device in hypercalls such a HVCALL_RETARGET_INTERRUPT
> > and the new pvIOMMU related hypercalls. This logical device ID uses
> > GUID bytes 4 thru 7 (minus 1 bit).  I don’t think Linux uses the
> > logical device ID for anything. Since only Hyper-V interprets it, Hyper-V
> > must somehow be ensuring uniqueness of bytes 4 thru 7 (minus 1 bit).
> > That's something to confirm with the Hyper-V team. If they are just hoping
> > for the best, I don't know how Linux can solve the problem.
> 
> I checked with the Hyper-V vPCI team on this aspect and the only guarantee that
> they provide is that, at any given time, there will only be 1 device with a given
> logical ID attached to a VM.

OK, so Hyper-V is guaranteeing the uniqueness of vPCI device GUID bytes 4
thru 7 across all vPCI devices that are attached to a VM at a given point in time.
That's good!

> Once a device has been removed, everything about it is
> forgotten from the Hyper-V stack's perspective, and nothing in the Hyper-V stack would
> prevent a scenario where, for example, a data movement accelerator is attached with
> logical ID X, then revoked, and let's say a NIC is attached with the same logical ID X.

And the "forgetting" behavior is the same in Linux. Once the device is removed,
Linux forgets everything about it. If a new vPCI device shows up and happens
to have the same GUID as a previous device, that should not cause any problems
in Linux.

> 
> Also, FWIW, they also stated that the GUID is not unique and cannot be
> guaranteed to be unique because it's the GUID for the bus, not the individual
> devices.

I'm not sure I understand this statement. Is this referring to the possibility
that a vPCI "device" that Hyper-V offers to the guest might have multiple
functions? The vPCI device driver in Linux has code to recognize this case,
but I'm not aware of any current cases where it happens. In such a case,
Linux should create a single PCI bus abstraction with multiple devices
attached to it, with each device being a different function. If Hyper-V
did ever offer a multiple-function configuration, there might be some
debugging to do in the Hyper-V vPCI driver in Linux!

We shortcut the terminology by referring to a vPCI "device", and assuming
that devices and busses are 1-to-1. But design allows for multiple devices
as different functions on the same bus.

> 
> > My original comment about uniqueness somewhat conflated the two problems,
> > and that's misleading. The use of the logical device ID has been around for years
> > in hv_irq_retarget_interrupt(). Extending its use to the new pvIOMMU
> > hypercalls doesn't make things any worse. But I'm still curious about
> > what the Hyper-V team says about the uniqueness of bytes 4 thru 7.
> >
> > Michael
> > 
> >>
> >>>
> >>> Note that the pci-hyperv must inform the PV IOMMU driver of the logical
> >>> device ID *before* create_root_hv_pci_bus() calls pci_scan_root_bus_bridge().
> >>> The latter function eventually invokes hv_iommu_attach_dev(), which will
> >>> need the logical device ID. See example stack trace. [1]
> >>>
> >>> I don't think the pci-hyperv driver even needs to tell the IOMMU driver to
> >>> remove the information if a PCI pass-thru device is unbound or removed, as
> >>> the logical device ID will be the same if the device ever comes back. At worst,
> >>> the IOMMU driver can simply replace an existing logical device ID if a new one
> >>> is provided for the same PCI domain ID.
> >>
> >> As above, replacing a unique GUID when a result is found for a non-unique
> >> key value may be prone to failure if it happens that the device that came "back"
> >> is not in fact the same device (or class of device) that went away and just happens
> >> to, either due to bytes 4 and 5 being identical, or due to collision in the
> >> pci_domain_nr_dynamic_ida, have the same domain number.
> 
> Given the vPCI team's statements (above), I think we will need to handle unbind or
> removal and ensure the pvIOMMU drivers data structure is invalidated when either
> happens.

The generic PCI code should handle detaching from the pvIOMMU. So I'm assuming
your statement is specifically about the mapping from domain ID to logical device ID.
I still think removing it may be unnecessary since adding a mapping for a new vPCI
device with the same domain ID but different logical device ID could just overwrite
any existing mapping. And leaving a dead mapping in the pvIOMMU data structures
doesn’t actually hurt anything. On the other hand, removing/invalidating it is
certainly more tidy and might prevent some confusion down the road.

I'm not the person writing the code, so it's easy for me to make hand-wavy
statements. ;-)  You've got to actually make it work, so you get to make
the final decisions.

Michael

^ permalink raw reply

* Re: [PATCH v3 0/7] mshv: Reduce memory consumption for unpinned regions
From: Anirudh Rayabharam @ 2026-04-14 16:11 UTC (permalink / raw)
  To: Stanislav Kinsburskii
  Cc: kys, haiyangz, wei.liu, decui, longli, linux-hyperv, linux-kernel
In-Reply-To: <177574802240.19719.4873018419452139691.stgit@skinsburskii-cloud-desktop.internal.cloudapp.net>

On Thu, Apr 09, 2026 at 03:23:59PM +0000, Stanislav Kinsburskii wrote:
> This series reduces memory consumption for unpinned regions by avoiding
> PFN array allocation. A 1GB unpinned region currently wastes 2MB for an
> unused PFN array that HMM-managed regions don't need.

This series has a dependency on "mshv: Refactor memory region management
and map pages at creation" right?

Anirudh.


^ permalink raw reply

* Re: [PATCH net-next v6 0/2] net: mana: add ethtool private flag for full-page RX buffers
From: Dipayaan Roy @ 2026-04-14 16:00 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: kys, haiyangz, wei.liu, decui, andrew+netdev, davem, edumazet,
	pabeni, leon, longli, kotaranov, horms, shradhagupta, ssengar,
	ernis, shirazsaleem, linux-hyperv, netdev, linux-kernel,
	linux-rdma, stephen, jacob.e.keller, leitao, kees, john.fastabend,
	hawk, bpf, daniel, ast, sdf, dipayanroy
In-Reply-To: <20260412125917.4fa8fc8d@kernel.org>

On Sun, Apr 12, 2026 at 12:59:17PM -0700, Jakub Kicinski wrote:
> On Thu, 9 Apr 2026 18:35:09 -0700 Jakub Kicinski wrote:
> > On Tue,  7 Apr 2026 12:59:17 -0700 Dipayaan Roy wrote:
> > > This behavior is observed on a single platform; other platforms
> > > perform better with page_pool fragments, indicating this is not a
> > > page_pool issue but platform-specific.  
> > 
> > Well, someone has to run some experiments and confirm other ARM
> > platforms are not impacted, with data. I was hoping to do it myself
> > but doesn't look like that will happen in time for the merge window :(
> 
> Please repost with the perf analysis on other commercially available
> ARM platform. Something like:
> 
>   This is a workaround applicable to only some platforms. Modifying
>   driver X to use a similar workaround on [Ampere Max|nVidia
>   Grace|Amazon Graviton 3|..] the performance for split pages is
>   y% higher than when using single pages.
> -- 
> pw-bot: cr

Hi Jakub,

I ran the same experiment on an alternate ARM64 platform from a
different vendor, which I was able to access only recently. I still see
roughly a 5% overhead from the atomic refcount operation itself, but on
that platform there is no throughput drop when using page fragments
versus full-page mode. In both cases, the setup reaches line rate. That
suggests the atomic overhead alone does not explain the throughput loss
on the specific hardware we are discussing.

I also received an update from the hardware team. They collected PCIe
traces and observed stalls on this particular ARM64 prcossor
when running with page fragments, while those stalls are not seen in
full-page mode. The exact root cause is still under investigation, but
their current assessment is that this is likely a microarchitectural
issue in the PCIe root port. Based on that, they are asking for a
software workaround that uses full pages until the issue is fully
understood.

For that reason, I am asking whether this could be accepted as an
ethtool private flag rather than as a generic driver change,
since the problem is still specific to one CPU/platform.
Please let me know whether you think this patch with private flag
would be acceptable here.

Regards
Dipayaan Roy

^ permalink raw reply

* Re: [PATCH v2 4/6] mshv: limit SynIC management to MSHV-owned resources
From: Anirudh Rayabharam @ 2026-04-14 15:49 UTC (permalink / raw)
  To: Jork Loeser
  Cc: linux-hyperv, x86, K . Y . Srinivasan, Haiyang Zhang, Wei Liu,
	Dexuan Cui, Long Li, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, Dave Hansen, H . Peter Anvin, Arnd Bergmann,
	Michael Kelley, linux-kernel, linux-arch
In-Reply-To: <134ce833-544-24eb-883-b190a888b31c@linux.microsoft.com>

On Tue, Apr 07, 2026 at 02:27:52PM -0700, Jork Loeser wrote:
> On Mon, 6 Apr 2026, Anirudh Rayabharam wrote:
> 
> > On Fri, Apr 03, 2026 at 12:06:10PM -0700, Jork Loeser wrote:
> > > The SynIC is shared between VMBus and MSHV. VMBus owns the message
> > > page (SIMP), event flags page (SIEFP), global enable (SCONTROL),
> > > and SINT2. MSHV adds SINT0, SINT5, and the event ring page (SIRBP).
> > > 
> > > Currently mshv_synic_init() redundantly enables SIMP, SIEFP, and
> > 
> > The redundant enable is probably a no-op from the hypervisor side so it
> > probably doesn't hurt us. The main problem is with the tear down.
> 
> It's an MSR intercept. If we can replace this by an "if()" we shave a few
> cycles.
> 
> > An alternative approach could be: check if SIMP/SIEFP/SCONTROL is
> > already enabled. If so, don't enable it again. If not enabled, enable it
> > and keep track of what all stuf we have enabled. Then disable all of
> > them during cleanup. This approach makes less assumptions about the
> > behavior of the VMBUS driver and what stuff it does or doesn't use.
> 
> It would, yes. Then again, we drag yet more state and make debugging more
> complicated / less clear to reason what happens dynamically. I had been
> debating this briefly myself, and ultimately decided against it for that
> very reason.

Ultimately, both approaches are fragile in their own ways because the
contract that "VMBus owns SIMP, SIEFP, SCONTROL, SINT2 and MSHV owns
SIRBP and SINT0 and SINT5" are not enforced anywhere in code and are
just assumptions that everyone will play nice. To do that, we'll need to
refactor the code such that there is a common component that sort of
facilitates access to SynIC for both VMBus and MSHV.

I would say that checking the state dynamically and then deciding
whether or not to enable SIMP/SIEFP/SCONTROL would be less fragile
because we make lesser assumptions about what VMBus does or doesn't do.

Also, do you know of any cases where the VMBus stuff can get initialized
after MSHV? Maybe if VMBus is a module (if that is even possible)? That
would really mess up our logic here.

Thanks,
Anirudh.

> 
> Best,
> Jork

^ permalink raw reply

* Re: [PATCH net v2 2/4] net: mana: Init gf_stats_work before potential error paths in probe
From: Simon Horman @ 2026-04-14 15:41 UTC (permalink / raw)
  To: Erni Sri Satya Vennela
  Cc: kys, haiyangz, wei.liu, decui, longli, andrew+netdev, davem,
	edumazet, kuba, pabeni, ssengar, dipayanroy, gargaditya,
	shirazsaleem, kees, kotaranov, leon, shacharr, stephen,
	linux-hyperv, netdev, linux-kernel
In-Reply-To: <20260413050843.605789-3-ernis@linux.microsoft.com>

On Sun, Apr 12, 2026 at 10:08:38PM -0700, Erni Sri Satya Vennela wrote:
> Move INIT_DELAYED_WORK(gf_stats_work) to before mana_create_eq(),
> while keeping schedule_delayed_work() at its original location.
> 
> Previously, if any function between mana_create_eq() and the
> INIT_DELAYED_WORK call failed, mana_probe() would call mana_remove()
> which unconditionally calls cancel_delayed_work_sync(gf_stats_work)
> in __flush_work() or debug object warnings with
> CONFIG_DEBUG_OBJECTS_WORK enabled.
> 
> Fixes: be4f1d67ec56 ("net: mana: Add standard counter rx_missed_errors")
> Signed-off-by: Erni Sri Satya Vennela <ernis@linux.microsoft.com>

Reviewed-by: Simon Horman <horms@kernel.org>


^ permalink raw reply

* Re: [PATCH net v2 1/4] net: mana: Init link_change_work before potential error paths in probe
From: Simon Horman @ 2026-04-14 15:41 UTC (permalink / raw)
  To: Erni Sri Satya Vennela
  Cc: kys, haiyangz, wei.liu, decui, longli, andrew+netdev, davem,
	edumazet, kuba, pabeni, ssengar, dipayanroy, gargaditya,
	shirazsaleem, kees, kotaranov, leon, shacharr, stephen,
	linux-hyperv, netdev, linux-kernel
In-Reply-To: <20260413050843.605789-2-ernis@linux.microsoft.com>

On Sun, Apr 12, 2026 at 10:08:37PM -0700, Erni Sri Satya Vennela wrote:
> Move INIT_WORK(link_change_work) to right after the mana_context
> allocation, before any error path that could reach mana_remove().
> 
> Previously, if mana_create_eq() or mana_query_device_cfg() failed,
> mana_probe() would jump to the error path which calls mana_remove().
> mana_remove() unconditionally calls disable_work_sync(link_change_work),
> but the work struct had not been initialized yet. This can trigger
> CONFIG_DEBUG_OBJECTS_WORK enabled.
> 
> Fixes: 54133f9b4b53 ("net: mana: Support HW link state events")
> Signed-off-by: Erni Sri Satya Vennela <ernis@linux.microsoft.com>
> ---
> Changes in v2:
> * Apply the patch in net instead of net-next.

Reviewed-by: Simon Horman <horms@kernel.org>


^ permalink raw reply

* Re: [PATCH net v2 4/4] net: mana: Fix EQ leak in mana_remove on NULL port
From: Simon Horman @ 2026-04-14 15:40 UTC (permalink / raw)
  To: ernis
  Cc: Simon Horman, pabeni, edumazet, ssengar, dipayanroy, gargaditya,
	decui, shacharr, linux-hyperv, kotaranov, kys, shirazsaleem,
	longli, kuba, andrew+netdev, wei.liu, linux-kernel, netdev,
	haiyangz, kees, stephen, leon, davem
In-Reply-To: <20260413050843.605789-5-ernis@linux.microsoft.com>

This is an AI-generated review of your patch. The human sending this
email has considered the AI review valid, or at least plausible.

For local reproduction steps see https://netdev-ai.bots.linux.dev/ai-local.html
---
net: mana: Fix EQ leak in mana_remove on NULL port

In mana_remove(), when a NULL port is encountered in the port iteration
loop, 'goto out' skips the mana_destroy_eq(ac) call, leaking the event
queues allocated earlier by mana_create_eq(). This can happen when
mana_probe_port() fails for port 0, leaving ac->ports[0] as NULL. On
driver unload or error cleanup, mana_remove() hits the NULL entry and
jumps past mana_destroy_eq(). Change 'goto out' to 'break' so the
for-loop exits normally and mana_destroy_eq() is always reached.

> Fixes: ca9c54d2d6a5 ("net: mana: Add a driver for Microsoft Azure Network Adapter (MANA)")

Does this Fixes tag point to the correct commit? Looking at the commit
history, the original driver commit ca9c54d2d6a5 did not have
mana_destroy_eq() in mana_remove() at all, so the 'goto out' pattern could
not have caused an EQ leak at that point.

The bug appears to have been introduced by commit 1e2d0824a9c3 ("net: mana:
Add support for EQ sharing"), which added mana_destroy_eq(ac) to
mana_remove() and placed it after the 'out:' label, making it skippable by
the 'goto out' on NULL port.

Should the Fixes tag be:
  Fixes: 1e2d0824a9c3 ("net: mana: Add support for EQ sharing")

^ permalink raw reply

* Re: [PATCH net v2 3/4] net: mana: Don't overwrite port probe error with add_adev result
From: Simon Horman @ 2026-04-14 15:35 UTC (permalink / raw)
  To: ernis
  Cc: 'Simon Horman', kys, haiyangz, wei.liu, decui, longli,
	andrew+netdev, davem, edumazet, kuba, pabeni, ssengar, dipayanroy,
	gargaditya, shirazsaleem, kees, kotaranov, leon, shacharr,
	stephen, linux-hyperv, netdev, linux-kernel
In-Reply-To: <20260413050843.605789-4-ernis@linux.microsoft.com>

From: 'Simon Horman' <horms@kernel.org>

This is an AI-generated review of your patch. The human sending this
email has considered the AI review valid, or at least plausible.
Full review at: https://sashiko.dev
---
net: mana: Don't overwrite port probe error with add_adev result

This commit ensures that if mana_probe_port() or mana_attach() fails, the
error is not overwritten by the result of add_adev(). By preserving the
error, it triggers mana_remove() cleanup instead of leaving ports partially
initialized.

> diff --git a/drivers/net/ethernet/microsoft/mana/mana_en.c b/drivers/net/ethernet/microsoft/mana/mana_en.c
> index 9d095a6fb56c7..4d42cbfb3f15b 100644
> --- a/drivers/net/ethernet/microsoft/mana/mana_en.c
> +++ b/drivers/net/ethernet/microsoft/mana/mana_en.c
> @@ -3634,10 +3634,9 @@ int mana_probe(struct gdma_dev *gd, bool resuming)
>  	if (!resuming) {
>  		for (i = 0; i < ac->num_ports; i++) {
>  			err = mana_probe_port(ac, i, &ac->ports[i]);
> -			/* we log the port for which the probe failed and stop
> -			 * probes for subsequent ports.
> -			 * Note that we keep running ports, for which the probes
> -			 * were successful, unless add_adev fails too
> +			/* Log the port for which the probe failed, stop probing
> +			 * subsequent ports, and skip add_adev.
> +			 * Already-probed ports remain functional.
>  			 */

Are these comments accurate regarding ports remaining functional?

Since this patch preserves the error code, execution falls through to the end
of mana_probe() where it checks if err is set and calls mana_remove().
mana_remove() will iterate over the successfully allocated ports, completely
detach them, unregister their net_devices, and destroy event queues.

It appears no ports remain functional because the device is completely torn
down.

[ ... ]

> @@ -3651,10 +3650,9 @@ int mana_probe(struct gdma_dev *gd, bool resuming)
>  			enable_work(&apc->queue_reset_work);
>  			err = mana_attach(ac->ports[i]);
>  			rtnl_unlock();
> -			/* we log the port for which the attach failed and stop
> -			 * attach for subsequent ports
> -			 * Note that we keep running ports, for which the attach
> -			 * were successful, unless add_adev fails too
> +			/* Log the port for which the attach failed, stop
> +			 * attaching subsequent ports, and skip add_adev.
> +			 * Already-attached ports remain functional.
>  			 */
>  			if (err) {
>  				dev_err(dev, "Attach Failed for port %d\n", i);
> @@ -3663,7 +3661,8 @@ int mana_probe(struct gdma_dev *gd, bool resuming)

Does propagating this error during PM resume introduce a NULL pointer
dereference?

If PM resume fails here (resuming == true), the preserved error code causes
mana_probe() to call mana_remove(), which tears down the device and sets
gd->gdma_context = NULL.

However, a failed resume callback does not automatically unbind the driver.
When the device is eventually unbound later, mana_remove() is invoked again:

drivers/net/ethernet/microsoft/mana/mana_en.c:mana_remove() {
    struct gdma_context *gc = gd->gdma_context;
    struct device *dev = gc->dev;
    ...
}

In the second invocation, gd->gdma_context is NULL, meaning gc is NULL,
which could lead to a kernel panic when dereferencing gc->dev.

^ permalink raw reply

* RE: [PATCH 06/11] Drivers: hv: Make sint vector architecture neutral in MSHV_VTL
From: Michael Kelley @ 2026-04-14 15:34 UTC (permalink / raw)
  To: Naman Jain, K . Y . Srinivasan, Haiyang Zhang, Wei Liu,
	Dexuan Cui, Long Li, Catalin Marinas, Will Deacon,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen,
	x86@kernel.org, H . Peter Anvin, Arnd Bergmann, Paul Walmsley,
	Palmer Dabbelt, Albert Ou, Alexandre Ghiti
  Cc: Marc Zyngier, Timothy Hayes, Lorenzo Pieralisi, mrigendrachaubey,
	ssengar@linux.microsoft.com, linux-hyperv@vger.kernel.org,
	linux-arm-kernel@lists.infradead.org,
	linux-kernel@vger.kernel.org, linux-arch@vger.kernel.org,
	linux-riscv@lists.infradead.org
In-Reply-To: <3264b2e6-f6fb-41b1-97da-22b5249c1839@linux.microsoft.com>

From: Naman Jain <namjain@linux.microsoft.com> Sent: Monday, April 13, 2026 9:52 AM
> >>>
> >>> Sashiko AI raised an interesting question about the startup timing --
> >>> whether the vmbus_platform_driver_probe() is guaranteed to have
> >>> set vmbus_interrupt before the VTL functions below run and use it.
> >>> What causes the mshv_vtl.ko module to be loaded, and hence run
> >>> mshv_vtl_init()?
> >>
> >> There is no race condition here. The init ordering guarantees that
> >> vmbus_interrupt is always set before mshv_vtl_synic_enable_regs()
> >> reads it.
> >>
> >> The call chain for setting vmbus_interrupt:
> >>
> >>     subsys_initcall(hv_acpi_init)                          [level 4]
> >>       -> platform_driver_register(&vmbus_platform_driver) and so on.
> >>
> >>
> >> The call chain for reading vmbus_interrupt:
> >>
> >>     module_init(mshv_vtl_init)                             [level 6]
> >>       -> hv_vtl_setup_synic()
> >>         -> cpuhp_setup_state(CPUHP_AP_ONLINE_DYN, ..., mshv_vtl_alloc_context, ...)
> >>           -> mshv_vtl_alloc_context()
> >>             -> mshv_vtl_synic_enable_regs()
> >>               -> sint.vector = vmbus_interrupt
> >>
> >> do_initcalls() processes sections in order 0 through 7, so
> >> hv_acpi_init() (level 4) is guaranteed to complete before
> >> mshv_vtl_init() (level 6) runs.
> >>
> >
> > I think the situation is more complex than what you describe, depending
> > on whether the VMBus driver and/or MSHV_VTL are built as modules vs.
> > being built-in to the kernel image. In include/linux/module.h, see the
> > comment for module_init() and how subsys_initcall() is mapped
> > to module_init() when built as a module.
> >
> > If both are built-in, then what you describe is correct. But if either or
> > both are modules, then the respective init functions (hv_acpi_init
> > and mshv_vtl_init) get called at the time the module is loaded, and
> > not by do_initcalls(). I think hv_vmbus.ko gets loaded when an attempt
> > is first made to access a disk, but I would need to look more closely to
> > be sure. I don't have any understanding of what causes mshv_vtl.ko
> > to be loaded. And what is the ordering if MSHV_VTL is built-in while
> > VMBus is built as a module, or vice versa?
> >
> > Michael
> >
> 
> Based on this, I still feel that this race is not possible.
> 
> hv_vmbus   mshv_vtl
>     y          y  -> different initcall levels, no issues
>     y          m  -> use without initialization is not possible
>     m          y  -> config dependencies take care of this, and mshv_vtl
> is forced to compile as a module in this case.
>     m          m  -> config and symbol dependencies should take care of
> it. mshv_vtl has symbol and config dependencies on hv_vmbus, and it
> won't allow loading mshv_vtl if hv_vmbus module is not loaded.
> 
> Relevant code here: kernel/module/main.c
> 

Makes sense.  I'm convinced!  :-)

Michael

^ permalink raw reply

* [PATCH net-next 0/2] net: mana: Avoid queue struct allocation failure under memory fragmentation
From: Aditya Garg @ 2026-04-14 15:13 UTC (permalink / raw)
  To: kys, haiyangz, wei.liu, decui, longli, andrew+netdev, davem,
	edumazet, kuba, pabeni, kotaranov, horms, ssengar, jacob.e.keller,
	dipayanroy, ernis, shirazsaleem, kees, sbhatta, leitao, netdev,
	linux-hyperv, linux-kernel, linux-rdma, bpf, gargaditya,
	gargaditya

The MANA driver can fail to load on systems with high memory
utilization because several allocations in the queue setup paths
require large physically contiguous blocks via kmalloc. Under memory
fragmentation these high-order allocations may fail, preventing the
driver from creating queues at probe time or when reconfiguring
channels, ring parameters or MTU at runtime.

Allocation sizes that are problematic:

  mana_create_txq -> tx_qp flat array (sizeof(mana_tx_qp) = 35528):
    16 queues (default): 35528 * 16 =  ~555 KB contiguous
    64 queues (max):     35528 * 64 = ~2220 KB contiguous

  mana_create_rxq -> rxq struct with flex array
  (sizeof(mana_rxq) = 35712, rx_oobs=296 per entry):
    depth 1024 (default): 35712 + 296 * 1024 =  ~331 KB per queue
    depth 8192 (max):     35712 + 296 * 8192 = ~2403 KB per queue

  mana_pre_alloc_rxbufs -> rxbufs_pre and das_pre arrays:
    16 queues, depth 1024 (default): 16 * 1024 * 8 =  128 KB each
    64 queues, depth 8192 (max):     64 * 8192 * 8 = 4096 KB each

This series addresses the issue by:
  1. Converting the tx_qp flat array into an array of pointers with
     per-queue kvzalloc (~35 KB each), replacing a single contiguous
     allocation that can reach ~2.2 MB at 64 queues.
  2. Switching rxbufs_pre, das_pre, and rxq allocations to
     kvmalloc/kvzalloc so the allocator can fall back to vmalloc
     when contiguous memory is unavailable.

Throughput testing confirms no regression. Since kvmalloc falls
back to vmalloc under memory fragmentation, all kvmalloc calls
were temporarily replaced with vmalloc to simulate the fallback
path (iperf3, GBits/sec):

                 Physically contiguous         vmalloc region
  Connections      TX          RX              TX          RX
  --------------------------------------------------------------
  1                47.2        46.9            46.8        46.6
  16               181         181             181         181
  32               181         181             181         181
  64               181         181             181         181

Aditya Garg (2):
  net: mana: Use per-queue allocation for tx_qp to reduce allocation
    size
  net: mana: Use kvmalloc for large RX queue and buffer allocations

 .../net/ethernet/microsoft/mana/mana_bpf.c    |  2 +-
 drivers/net/ethernet/microsoft/mana/mana_en.c | 61 +++++++++++--------
 .../ethernet/microsoft/mana/mana_ethtool.c    |  2 +-
 include/net/mana/mana.h                       |  2 +-
 4 files changed, 39 insertions(+), 28 deletions(-)

-- 
2.43.0


^ permalink raw reply

* [PATCH net-next 2/2] net: mana: Use kvmalloc for large RX queue and buffer allocations
From: Aditya Garg @ 2026-04-14 15:13 UTC (permalink / raw)
  To: kys, haiyangz, wei.liu, decui, longli, andrew+netdev, davem,
	edumazet, kuba, pabeni, kotaranov, horms, ssengar, jacob.e.keller,
	dipayanroy, ernis, shirazsaleem, kees, sbhatta, leitao, netdev,
	linux-hyperv, linux-kernel, linux-rdma, bpf, gargaditya,
	gargaditya
In-Reply-To: <20260414151456.687506-1-gargaditya@linux.microsoft.com>

The RX path allocations for rxbufs_pre, das_pre, and rxq scale with
queue count and queue depth. With high queue counts and depth, these can
exceed what kmalloc can reliably provide from physically contiguous
memory under fragmentation.

Switch these from kmalloc to kvmalloc variants so the allocator
transparently falls back to vmalloc when contiguous memory is scarce,
and update the corresponding frees to kvfree.

Signed-off-by: Aditya Garg <gargaditya@linux.microsoft.com>
Reviewed-by: Haiyang Zhang <haiyangz@microsoft.com>
---
 drivers/net/ethernet/microsoft/mana/mana_en.c | 12 ++++++------
 1 file changed, 6 insertions(+), 6 deletions(-)

diff --git a/drivers/net/ethernet/microsoft/mana/mana_en.c b/drivers/net/ethernet/microsoft/mana/mana_en.c
index 49ee77b0939a..585d891bbbac 100644
--- a/drivers/net/ethernet/microsoft/mana/mana_en.c
+++ b/drivers/net/ethernet/microsoft/mana/mana_en.c
@@ -685,11 +685,11 @@ void mana_pre_dealloc_rxbufs(struct mana_port_context *mpc)
 		put_page(virt_to_head_page(mpc->rxbufs_pre[i]));
 	}
 
-	kfree(mpc->das_pre);
+	kvfree(mpc->das_pre);
 	mpc->das_pre = NULL;
 
 out2:
-	kfree(mpc->rxbufs_pre);
+	kvfree(mpc->rxbufs_pre);
 	mpc->rxbufs_pre = NULL;
 
 out1:
@@ -806,11 +806,11 @@ int mana_pre_alloc_rxbufs(struct mana_port_context *mpc, int new_mtu, int num_qu
 	num_rxb = num_queues * mpc->rx_queue_size;
 
 	WARN(mpc->rxbufs_pre, "mana rxbufs_pre exists\n");
-	mpc->rxbufs_pre = kmalloc_array(num_rxb, sizeof(void *), GFP_KERNEL);
+	mpc->rxbufs_pre = kvmalloc_array(num_rxb, sizeof(void *), GFP_KERNEL);
 	if (!mpc->rxbufs_pre)
 		goto error;
 
-	mpc->das_pre = kmalloc_objs(dma_addr_t, num_rxb);
+	mpc->das_pre = kvmalloc_objs(dma_addr_t, num_rxb);
 	if (!mpc->das_pre)
 		goto error;
 
@@ -2527,7 +2527,7 @@ static void mana_destroy_rxq(struct mana_port_context *apc,
 	if (rxq->gdma_rq)
 		mana_gd_destroy_queue(gc, rxq->gdma_rq);
 
-	kfree(rxq);
+	kvfree(rxq);
 }
 
 static int mana_fill_rx_oob(struct mana_recv_buf_oob *rx_oob, u32 mem_key,
@@ -2667,7 +2667,7 @@ static struct mana_rxq *mana_create_rxq(struct mana_port_context *apc,
 
 	gc = gd->gdma_context;
 
-	rxq = kzalloc_flex(*rxq, rx_oobs, apc->rx_queue_size);
+	rxq = kvzalloc_flex(*rxq, rx_oobs, apc->rx_queue_size);
 	if (!rxq)
 		return NULL;
 
-- 
2.43.0


^ permalink raw reply related

* [PATCH net-next 1/2] net: mana: Use per-queue allocation for tx_qp to reduce allocation size
From: Aditya Garg @ 2026-04-14 15:13 UTC (permalink / raw)
  To: kys, haiyangz, wei.liu, decui, longli, andrew+netdev, davem,
	edumazet, kuba, pabeni, kotaranov, horms, ssengar, jacob.e.keller,
	dipayanroy, ernis, shirazsaleem, kees, sbhatta, leitao, netdev,
	linux-hyperv, linux-kernel, linux-rdma, bpf, gargaditya,
	gargaditya
In-Reply-To: <20260414151456.687506-1-gargaditya@linux.microsoft.com>

Convert tx_qp from a single contiguous array allocation to per-queue
individual allocations. Each mana_tx_qp struct is approximately 35KB.
With many queues (e.g., 32/64), the flat array requires a single
contiguous allocation that can fail under memory fragmentation.

Change mana_tx_qp *tx_qp to mana_tx_qp **tx_qp (array of pointers),
allocating each queue's mana_tx_qp individually via kvzalloc. This
reduces each allocation to ~35KB and provides vmalloc fallback,
avoiding allocation failure due to fragmentation.

Signed-off-by: Aditya Garg <gargaditya@linux.microsoft.com>
Reviewed-by: Haiyang Zhang <haiyangz@microsoft.com>
---
 .../net/ethernet/microsoft/mana/mana_bpf.c    |  2 +-
 drivers/net/ethernet/microsoft/mana/mana_en.c | 49 ++++++++++++-------
 .../ethernet/microsoft/mana/mana_ethtool.c    |  2 +-
 include/net/mana/mana.h                       |  2 +-
 4 files changed, 33 insertions(+), 22 deletions(-)

diff --git a/drivers/net/ethernet/microsoft/mana/mana_bpf.c b/drivers/net/ethernet/microsoft/mana/mana_bpf.c
index 7697c9b52ed3..b5e9bb184a1d 100644
--- a/drivers/net/ethernet/microsoft/mana/mana_bpf.c
+++ b/drivers/net/ethernet/microsoft/mana/mana_bpf.c
@@ -68,7 +68,7 @@ int mana_xdp_xmit(struct net_device *ndev, int n, struct xdp_frame **frames,
 		count++;
 	}
 
-	tx_stats = &apc->tx_qp[q_idx].txq.stats;
+	tx_stats = &apc->tx_qp[q_idx]->txq.stats;
 
 	u64_stats_update_begin(&tx_stats->syncp);
 	tx_stats->xdp_xmit += count;
diff --git a/drivers/net/ethernet/microsoft/mana/mana_en.c b/drivers/net/ethernet/microsoft/mana/mana_en.c
index 09a53c977545..49ee77b0939a 100644
--- a/drivers/net/ethernet/microsoft/mana/mana_en.c
+++ b/drivers/net/ethernet/microsoft/mana/mana_en.c
@@ -355,9 +355,9 @@ netdev_tx_t mana_start_xmit(struct sk_buff *skb, struct net_device *ndev)
 	if (skb_cow_head(skb, MANA_HEADROOM))
 		goto tx_drop_count;
 
-	txq = &apc->tx_qp[txq_idx].txq;
+	txq = &apc->tx_qp[txq_idx]->txq;
 	gdma_sq = txq->gdma_sq;
-	cq = &apc->tx_qp[txq_idx].tx_cq;
+	cq = &apc->tx_qp[txq_idx]->tx_cq;
 	tx_stats = &txq->stats;
 
 	BUILD_BUG_ON(MAX_TX_WQE_SGL_ENTRIES != MANA_MAX_TX_WQE_SGL_ENTRIES);
@@ -614,7 +614,7 @@ static void mana_get_stats64(struct net_device *ndev,
 	}
 
 	for (q = 0; q < num_queues; q++) {
-		tx_stats = &apc->tx_qp[q].txq.stats;
+		tx_stats = &apc->tx_qp[q]->txq.stats;
 
 		do {
 			start = u64_stats_fetch_begin(&tx_stats->syncp);
@@ -2284,21 +2284,26 @@ static void mana_destroy_txq(struct mana_port_context *apc)
 		return;
 
 	for (i = 0; i < apc->num_queues; i++) {
-		debugfs_remove_recursive(apc->tx_qp[i].mana_tx_debugfs);
-		apc->tx_qp[i].mana_tx_debugfs = NULL;
+		if (!apc->tx_qp[i])
+			continue;
+
+		debugfs_remove_recursive(apc->tx_qp[i]->mana_tx_debugfs);
+		apc->tx_qp[i]->mana_tx_debugfs = NULL;
 
-		napi = &apc->tx_qp[i].tx_cq.napi;
-		if (apc->tx_qp[i].txq.napi_initialized) {
+		napi = &apc->tx_qp[i]->tx_cq.napi;
+		if (apc->tx_qp[i]->txq.napi_initialized) {
 			napi_synchronize(napi);
 			napi_disable_locked(napi);
 			netif_napi_del_locked(napi);
-			apc->tx_qp[i].txq.napi_initialized = false;
+			apc->tx_qp[i]->txq.napi_initialized = false;
 		}
-		mana_destroy_wq_obj(apc, GDMA_SQ, apc->tx_qp[i].tx_object);
+		mana_destroy_wq_obj(apc, GDMA_SQ, apc->tx_qp[i]->tx_object);
 
-		mana_deinit_cq(apc, &apc->tx_qp[i].tx_cq);
+		mana_deinit_cq(apc, &apc->tx_qp[i]->tx_cq);
 
-		mana_deinit_txq(apc, &apc->tx_qp[i].txq);
+		mana_deinit_txq(apc, &apc->tx_qp[i]->txq);
+
+		kvfree(apc->tx_qp[i]);
 	}
 
 	kfree(apc->tx_qp);
@@ -2307,7 +2312,7 @@ static void mana_destroy_txq(struct mana_port_context *apc)
 
 static void mana_create_txq_debugfs(struct mana_port_context *apc, int idx)
 {
-	struct mana_tx_qp *tx_qp = &apc->tx_qp[idx];
+	struct mana_tx_qp *tx_qp = apc->tx_qp[idx];
 	char qnum[32];
 
 	sprintf(qnum, "TX-%d", idx);
@@ -2346,7 +2351,7 @@ static int mana_create_txq(struct mana_port_context *apc,
 	int err;
 	int i;
 
-	apc->tx_qp = kzalloc_objs(struct mana_tx_qp, apc->num_queues);
+	apc->tx_qp = kzalloc_objs(struct mana_tx_qp *, apc->num_queues);
 	if (!apc->tx_qp)
 		return -ENOMEM;
 
@@ -2366,10 +2371,16 @@ static int mana_create_txq(struct mana_port_context *apc,
 	gc = gd->gdma_context;
 
 	for (i = 0; i < apc->num_queues; i++) {
-		apc->tx_qp[i].tx_object = INVALID_MANA_HANDLE;
+		apc->tx_qp[i] = kvzalloc_obj(*apc->tx_qp[i]);
+		if (!apc->tx_qp[i]) {
+			err = -ENOMEM;
+			goto out;
+		}
+
+		apc->tx_qp[i]->tx_object = INVALID_MANA_HANDLE;
 
 		/* Create SQ */
-		txq = &apc->tx_qp[i].txq;
+		txq = &apc->tx_qp[i]->txq;
 
 		u64_stats_init(&txq->stats.syncp);
 		txq->ndev = net;
@@ -2387,7 +2398,7 @@ static int mana_create_txq(struct mana_port_context *apc,
 			goto out;
 
 		/* Create SQ's CQ */
-		cq = &apc->tx_qp[i].tx_cq;
+		cq = &apc->tx_qp[i]->tx_cq;
 		cq->type = MANA_CQ_TYPE_TX;
 
 		cq->txq = txq;
@@ -2416,7 +2427,7 @@ static int mana_create_txq(struct mana_port_context *apc,
 
 		err = mana_create_wq_obj(apc, apc->port_handle, GDMA_SQ,
 					 &wq_spec, &cq_spec,
-					 &apc->tx_qp[i].tx_object);
+					 &apc->tx_qp[i]->tx_object);
 
 		if (err)
 			goto out;
@@ -3242,7 +3253,7 @@ static int mana_dealloc_queues(struct net_device *ndev)
 	 */
 
 	for (i = 0; i < apc->num_queues; i++) {
-		txq = &apc->tx_qp[i].txq;
+		txq = &apc->tx_qp[i]->txq;
 		tsleep = 1000;
 		while (atomic_read(&txq->pending_sends) > 0 &&
 		       time_before(jiffies, timeout)) {
@@ -3261,7 +3272,7 @@ static int mana_dealloc_queues(struct net_device *ndev)
 	}
 
 	for (i = 0; i < apc->num_queues; i++) {
-		txq = &apc->tx_qp[i].txq;
+		txq = &apc->tx_qp[i]->txq;
 		while ((skb = skb_dequeue(&txq->pending_skbs))) {
 			mana_unmap_skb(skb, apc);
 			dev_kfree_skb_any(skb);
diff --git a/drivers/net/ethernet/microsoft/mana/mana_ethtool.c b/drivers/net/ethernet/microsoft/mana/mana_ethtool.c
index f2d220b371b5..f5901e4c9816 100644
--- a/drivers/net/ethernet/microsoft/mana/mana_ethtool.c
+++ b/drivers/net/ethernet/microsoft/mana/mana_ethtool.c
@@ -251,7 +251,7 @@ static void mana_get_ethtool_stats(struct net_device *ndev,
 	}
 
 	for (q = 0; q < num_queues; q++) {
-		tx_stats = &apc->tx_qp[q].txq.stats;
+		tx_stats = &apc->tx_qp[q]->txq.stats;
 
 		do {
 			start = u64_stats_fetch_begin(&tx_stats->syncp);
diff --git a/include/net/mana/mana.h b/include/net/mana/mana.h
index a078af283bdd..60b4a4146ea2 100644
--- a/include/net/mana/mana.h
+++ b/include/net/mana/mana.h
@@ -505,7 +505,7 @@ struct mana_port_context {
 	bool tx_shortform_allowed;
 	u16 tx_vp_offset;
 
-	struct mana_tx_qp *tx_qp;
+	struct mana_tx_qp **tx_qp;
 
 	/* Indirection Table for RX & TX. The values are queue indexes */
 	u32 *indir_table;
-- 
2.43.0


^ permalink raw reply related


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox