Linux-HyperV List

Linux-HyperV List
 help / color / mirror / Atom feed

* RE: [BUG] KVM: x86: kvmclock jumps ~253 years on Hyper-V nested virt due to cross-CPU raw TSC inconsistency
From: Michael Kelley @ 2026-04-07 20:40 UTC (permalink / raw)
  To: Thomas Lefebvre
  Cc: Sean Christopherson, Vitaly Kuznetsov, pbonzini@redhat.com,
	kvm@vger.kernel.org, linux-kernel@vger.kernel.org,
	linux-hyperv@vger.kernel.org
In-Reply-To: <CAKdXbaXdSWq-NYk8z6_OtSgb6xsp+GJxrnF2iBMRdk0nfYne=A@mail.gmail.com>

From: Thomas Lefebvre <thomas.lefebvre3@gmail.com> Sent: Tuesday, April 7, 2026 12:13 PM
> 
> Hi everyone, thank you for your attention to this bug report.
> 
> Michael,
> 
> 1. No, lscpu in the L1 guest does not show the flags "tsc_reliable"
> and "constant_tsc".
> $ lscpu | grep tsc_reliable
> $ lscpu | grep constant_tsc
> $ cat /sys/devices/system/clocksource/clocksource0/current_clocksource
> hyperv_clocksource_tsc_page
> 
> 2. Windows 10
> Version 22H2 (OS Build 19045.6466)
> 
> 3. Hyper-V: privilege flags low 0x2e7f, high 0x3b8030, ext 0x2, hints
> 0x24e24, misc 0xbed7b2
> 
> 4. Yes, the laptop hibernates and then resumes.
> When the problem occurred, the laptop had gone through multiple
> hibernate and resume cycles.
> I haven't seen it happen after a full reboot before a hibernate/resume cycle.
> 
> Thomas
> 

How easy is it for you to reproduce the problem? Would it be feasible
to get a definitive answer on whether the problem repros after a
full reboot, but before a hibernate/resume cycle?

There's a known bug Windows 10 Hyper-V where the hardware TSC
scaling gets messed up after a hibernate/resume cycle, causing the TSC
values read in the guest to drift from what the Hyper-V host thinks
the guest's TSC value is. A summary of the problem is here:
https://github.com/microsoft/WSL/issues/6982#issuecomment-2294892954

Of course, this doesn't sound like your symptom. And Hyper-V is not
telling your guest that it supports hardware TSC scaling, because the
HV_ACCESS_TSC_INVARIANT flag is *not* set and the clocksource
is hyperv_clocksource_tsc_page. But my understanding is that the code
changes to fix the Hyper-V problem weren't trivial, and I'm speculating
that maybe you are seeing some other symptom of whatever the
underlying Hyper-V issue was.

Of course, this is just speculation. If the problem can occur before
any hibernate/resume cycles are done, then my speculation is
wrong. But if the problem only happens after a hibernate/resume
cycle, then this known problem, or something related to it, becomes
a pretty good candidate. Unfortunately, I'm pretty sure there's no
fix for Windows 10 Hyper-V. You would need to upgrade to 
Windows 11 22H2 or later.

Michael

^ permalink raw reply

* [PATCH net-next v6 2/2] net: mana: force full-page RX buffers via ethtool private flag
From: Dipayaan Roy @ 2026-04-07 19:59 UTC (permalink / raw)
  To: kys, haiyangz, wei.liu, decui, andrew+netdev, davem, edumazet,
	kuba, pabeni, leon, longli, kotaranov, horms, shradhagupta,
	ssengar, ernis, shirazsaleem, linux-hyperv, netdev, linux-kernel,
	linux-rdma, stephen, jacob.e.keller, leitao, kees, john.fastabend,
	hawk, bpf, daniel, ast, sdf, dipayanroy
In-Reply-To: <20260407200216.272659-1-dipayanroy@linux.microsoft.com>

On some ARM64 platforms with 4K PAGE_SIZE, page_pool fragment
allocation in the RX refill path can cause 15-20% throughput
regression under high connection counts (>16 TCP streams).

Add an ethtool private flag "full-page-rx" that allows the user to
force one RX buffer per page, bypassing the page_pool fragment path.
This restores line-rate (180+ Gbps) performance on affected platforms.

Usage:
  ethtool --set-priv-flags eth0 full-page-rx on

There is no behavioral change by default. The flag must be explicitly
enabled by the user or udev rule.

The existing single-buffer-per-page logic for XDP and jumbo frames is
consolidated into a new helper mana_use_single_rxbuf_per_page() which
is now the single decision point for both the automatic and
user-controlled paths.

Signed-off-by: Dipayaan Roy <dipayanroy@linux.microsoft.com>
---
 drivers/net/ethernet/microsoft/mana/mana_en.c | 22 ++++-
 .../ethernet/microsoft/mana/mana_ethtool.c    | 89 +++++++++++++++++++
 include/net/mana/mana.h                       |  8 ++
 3 files changed, 117 insertions(+), 2 deletions(-)

diff --git a/drivers/net/ethernet/microsoft/mana/mana_en.c b/drivers/net/ethernet/microsoft/mana/mana_en.c
index 49c65cc1697c..59a1626c2be1 100644
--- a/drivers/net/ethernet/microsoft/mana/mana_en.c
+++ b/drivers/net/ethernet/microsoft/mana/mana_en.c
@@ -744,6 +744,25 @@ static void *mana_get_rxbuf_pre(struct mana_rxq *rxq, dma_addr_t *da)
 	return va;
 }
 
+static bool
+mana_use_single_rxbuf_per_page(struct mana_port_context *apc, u32 mtu)
+{
+	/* On some platforms with 4K PAGE_SIZE, page_pool fragment allocation
+	 * in the RX refill path (~2kB buffer) can cause significant throughput
+	 * regression under high connection counts. Allow user to force one RX
+	 * buffer per page via ethtool private flag to bypass the fragment
+	 * path.
+	 */
+	if (apc->priv_flags & BIT(MANA_PRIV_FLAG_USE_FULL_PAGE_RXBUF))
+		return true;
+
+	/* For xdp and jumbo frames make sure only one packet fits per page. */
+	if (mtu + MANA_RXBUF_PAD > PAGE_SIZE / 2 || mana_xdp_get(apc))
+		return true;
+
+	return false;
+}
+
 /* Get RX buffer's data size, alloc size, XDP headroom based on MTU */
 static void mana_get_rxbuf_cfg(struct mana_port_context *apc,
 			       int mtu, u32 *datasize, u32 *alloc_size,
@@ -754,8 +773,7 @@ static void mana_get_rxbuf_cfg(struct mana_port_context *apc,
 	/* Calculate datasize first (consistent across all cases) */
 	*datasize = mtu + ETH_HLEN;
 
-	/* For xdp and jumbo frames make sure only one packet fits per page */
-	if (mtu + MANA_RXBUF_PAD > PAGE_SIZE / 2 || mana_xdp_get(apc)) {
+	if (mana_use_single_rxbuf_per_page(apc, mtu)) {
 		if (mana_xdp_get(apc)) {
 			*headroom = XDP_PACKET_HEADROOM;
 			*alloc_size = PAGE_SIZE;
diff --git a/drivers/net/ethernet/microsoft/mana/mana_ethtool.c b/drivers/net/ethernet/microsoft/mana/mana_ethtool.c
index a28ca461c135..0547c903f613 100644
--- a/drivers/net/ethernet/microsoft/mana/mana_ethtool.c
+++ b/drivers/net/ethernet/microsoft/mana/mana_ethtool.c
@@ -133,6 +133,10 @@ static const struct mana_stats_desc mana_phy_stats[] = {
 	{ "hc_tc7_tx_pause_phy", offsetof(struct mana_ethtool_phy_stats, tx_pause_tc7_phy) },
 };
 
+static const char mana_priv_flags[MANA_PRIV_FLAG_MAX][ETH_GSTRING_LEN] = {
+	[MANA_PRIV_FLAG_USE_FULL_PAGE_RXBUF] = "full-page-rx"
+};
+
 static int mana_get_sset_count(struct net_device *ndev, int stringset)
 {
 	struct mana_port_context *apc = netdev_priv(ndev);
@@ -144,6 +148,10 @@ static int mana_get_sset_count(struct net_device *ndev, int stringset)
 		       ARRAY_SIZE(mana_phy_stats) +
 		       ARRAY_SIZE(mana_hc_stats)  +
 		       num_queues * (MANA_STATS_RX_COUNT + MANA_STATS_TX_COUNT);
+
+	case ETH_SS_PRIV_FLAGS:
+		return MANA_PRIV_FLAG_MAX;
+
 	default:
 		return -EINVAL;
 	}
@@ -192,6 +200,14 @@ static void mana_get_strings_stats(struct mana_port_context *apc, u8 **data)
 	}
 }
 
+static void mana_get_strings_priv_flags(u8 **data)
+{
+	int i;
+
+	for (i = 0; i < MANA_PRIV_FLAG_MAX; i++)
+		ethtool_puts(data, mana_priv_flags[i]);
+}
+
 static void mana_get_strings(struct net_device *ndev, u32 stringset, u8 *data)
 {
 	struct mana_port_context *apc = netdev_priv(ndev);
@@ -200,6 +216,9 @@ static void mana_get_strings(struct net_device *ndev, u32 stringset, u8 *data)
 	case ETH_SS_STATS:
 		mana_get_strings_stats(apc, &data);
 		break;
+	case ETH_SS_PRIV_FLAGS:
+		mana_get_strings_priv_flags(&data);
+		break;
 	default:
 		break;
 	}
@@ -590,6 +609,74 @@ static int mana_get_link_ksettings(struct net_device *ndev,
 	return 0;
 }
 
+static u32 mana_get_priv_flags(struct net_device *ndev)
+{
+	struct mana_port_context *apc = netdev_priv(ndev);
+
+	return apc->priv_flags;
+}
+
+static int mana_set_priv_flags(struct net_device *ndev, u32 priv_flags)
+{
+	struct mana_port_context *apc = netdev_priv(ndev);
+	u32 changed = apc->priv_flags ^ priv_flags;
+	u32 old_priv_flags = apc->priv_flags;
+	bool schedule_port_reset = false;
+	int err = 0;
+
+	if (!changed)
+		return 0;
+
+	/* Reject unknown bits */
+	if (priv_flags & ~GENMASK(MANA_PRIV_FLAG_MAX - 1, 0))
+		return -EINVAL;
+
+	if (changed & BIT(MANA_PRIV_FLAG_USE_FULL_PAGE_RXBUF)) {
+		apc->priv_flags = priv_flags;
+
+		if (!apc->port_is_up) {
+			/* Port is down, flag updated to apply on next up
+			 * so just return.
+			 */
+			return 0;
+		}
+
+		/* Pre-allocate buffers to prevent failure in mana_attach
+		 * later
+		 */
+		err = mana_pre_alloc_rxbufs(apc, ndev->mtu, apc->num_queues);
+		if (err) {
+			netdev_err(ndev,
+				   "Insufficient memory for new allocations\n");
+			apc->priv_flags = old_priv_flags;
+			return err;
+		}
+
+		err = mana_detach(ndev, false);
+		if (err) {
+			netdev_err(ndev, "mana_detach failed: %d\n", err);
+			apc->priv_flags = old_priv_flags;
+			goto out;
+		}
+
+		err = mana_attach(ndev);
+		if (err) {
+			netdev_err(ndev, "mana_attach failed: %d\n", err);
+			apc->priv_flags = old_priv_flags;
+			schedule_port_reset = true;
+		}
+	}
+
+out:
+	mana_pre_dealloc_rxbufs(apc);
+
+	if (err && schedule_port_reset)
+		queue_work(apc->ac->per_port_queue_reset_wq,
+			   &apc->queue_reset_work);
+
+	return err;
+}
+
 const struct ethtool_ops mana_ethtool_ops = {
 	.supported_coalesce_params = ETHTOOL_COALESCE_RX_CQE_FRAMES,
 	.get_ethtool_stats	= mana_get_ethtool_stats,
@@ -608,4 +695,6 @@ const struct ethtool_ops mana_ethtool_ops = {
 	.set_ringparam          = mana_set_ringparam,
 	.get_link_ksettings	= mana_get_link_ksettings,
 	.get_link		= ethtool_op_get_link,
+	.get_priv_flags		= mana_get_priv_flags,
+	.set_priv_flags		= mana_set_priv_flags,
 };
diff --git a/include/net/mana/mana.h b/include/net/mana/mana.h
index 3336688fed5e..fd87e3d6c1f4 100644
--- a/include/net/mana/mana.h
+++ b/include/net/mana/mana.h
@@ -30,6 +30,12 @@ enum TRI_STATE {
 	TRI_STATE_TRUE = 1
 };
 
+/* MANA ethtool private flag bit positions */
+enum mana_priv_flag_bits {
+	MANA_PRIV_FLAG_USE_FULL_PAGE_RXBUF = 0,
+	MANA_PRIV_FLAG_MAX,
+};
+
 /* Number of entries for hardware indirection table must be in power of 2 */
 #define MANA_INDIRECT_TABLE_MAX_SIZE 512
 #define MANA_INDIRECT_TABLE_DEF_SIZE 64
@@ -531,6 +537,8 @@ struct mana_port_context {
 	u32 rxbpre_headroom;
 	u32 rxbpre_frag_count;
 
+	u32 priv_flags;
+
 	struct bpf_prog *bpf_prog;
 
 	/* Create num_queues EQs, SQs, SQ-CQs, RQs and RQ-CQs, respectively. */
-- 
2.43.0


^ permalink raw reply related

* [PATCH net-next v6 1/2] net: mana: refactor mana_get_strings() and mana_get_sset_count() to use switch
From: Dipayaan Roy @ 2026-04-07 19:59 UTC (permalink / raw)
  To: kys, haiyangz, wei.liu, decui, andrew+netdev, davem, edumazet,
	kuba, pabeni, leon, longli, kotaranov, horms, shradhagupta,
	ssengar, ernis, shirazsaleem, linux-hyperv, netdev, linux-kernel,
	linux-rdma, stephen, jacob.e.keller, leitao, kees, john.fastabend,
	hawk, bpf, daniel, ast, sdf, dipayanroy
In-Reply-To: <20260407200216.272659-1-dipayanroy@linux.microsoft.com>

Refactor mana_get_strings() and mana_get_sset_count() from if/else to
switch statements in preparation for adding ethtool private flags
support which requires handling ETH_SS_PRIV_FLAGS.

No functional change.

Signed-off-by: Dipayaan Roy <dipayanroy@linux.microsoft.com>
---
 .../ethernet/microsoft/mana/mana_ethtool.c    | 75 ++++++++++++-------
 1 file changed, 46 insertions(+), 29 deletions(-)

diff --git a/drivers/net/ethernet/microsoft/mana/mana_ethtool.c b/drivers/net/ethernet/microsoft/mana/mana_ethtool.c
index 6a4b42fe0944..a28ca461c135 100644
--- a/drivers/net/ethernet/microsoft/mana/mana_ethtool.c
+++ b/drivers/net/ethernet/microsoft/mana/mana_ethtool.c
@@ -138,53 +138,70 @@ static int mana_get_sset_count(struct net_device *ndev, int stringset)
 	struct mana_port_context *apc = netdev_priv(ndev);
 	unsigned int num_queues = apc->num_queues;
 
-	if (stringset != ETH_SS_STATS)
+	switch (stringset) {
+	case ETH_SS_STATS:
+		return ARRAY_SIZE(mana_eth_stats) +
+		       ARRAY_SIZE(mana_phy_stats) +
+		       ARRAY_SIZE(mana_hc_stats)  +
+		       num_queues * (MANA_STATS_RX_COUNT + MANA_STATS_TX_COUNT);
+	default:
 		return -EINVAL;
-
-	return ARRAY_SIZE(mana_eth_stats) + ARRAY_SIZE(mana_phy_stats) + ARRAY_SIZE(mana_hc_stats) +
-			num_queues * (MANA_STATS_RX_COUNT + MANA_STATS_TX_COUNT);
+	}
 }
 
-static void mana_get_strings(struct net_device *ndev, u32 stringset, u8 *data)
+static void mana_get_strings_stats(struct mana_port_context *apc, u8 **data)
 {
-	struct mana_port_context *apc = netdev_priv(ndev);
 	unsigned int num_queues = apc->num_queues;
 	int i, j;
 
-	if (stringset != ETH_SS_STATS)
-		return;
 	for (i = 0; i < ARRAY_SIZE(mana_eth_stats); i++)
-		ethtool_puts(&data, mana_eth_stats[i].name);
+		ethtool_puts(data, mana_eth_stats[i].name);
 
 	for (i = 0; i < ARRAY_SIZE(mana_hc_stats); i++)
-		ethtool_puts(&data, mana_hc_stats[i].name);
+		ethtool_puts(data, mana_hc_stats[i].name);
 
 	for (i = 0; i < ARRAY_SIZE(mana_phy_stats); i++)
-		ethtool_puts(&data, mana_phy_stats[i].name);
+		ethtool_puts(data, mana_phy_stats[i].name);
 
 	for (i = 0; i < num_queues; i++) {
-		ethtool_sprintf(&data, "rx_%d_packets", i);
-		ethtool_sprintf(&data, "rx_%d_bytes", i);
-		ethtool_sprintf(&data, "rx_%d_xdp_drop", i);
-		ethtool_sprintf(&data, "rx_%d_xdp_tx", i);
-		ethtool_sprintf(&data, "rx_%d_xdp_redirect", i);
-		ethtool_sprintf(&data, "rx_%d_pkt_len0_err", i);
+		ethtool_sprintf(data, "rx_%d_packets", i);
+		ethtool_sprintf(data, "rx_%d_bytes", i);
+		ethtool_sprintf(data, "rx_%d_xdp_drop", i);
+		ethtool_sprintf(data, "rx_%d_xdp_tx", i);
+		ethtool_sprintf(data, "rx_%d_xdp_redirect", i);
+		ethtool_sprintf(data, "rx_%d_pkt_len0_err", i);
 		for (j = 0; j < MANA_RXCOMP_OOB_NUM_PPI - 1; j++)
-			ethtool_sprintf(&data, "rx_%d_coalesced_cqe_%d", i, j + 2);
+			ethtool_sprintf(data,
+					"rx_%d_coalesced_cqe_%d",
+					i,
+					j + 2);
 	}
 
 	for (i = 0; i < num_queues; i++) {
-		ethtool_sprintf(&data, "tx_%d_packets", i);
-		ethtool_sprintf(&data, "tx_%d_bytes", i);
-		ethtool_sprintf(&data, "tx_%d_xdp_xmit", i);
-		ethtool_sprintf(&data, "tx_%d_tso_packets", i);
-		ethtool_sprintf(&data, "tx_%d_tso_bytes", i);
-		ethtool_sprintf(&data, "tx_%d_tso_inner_packets", i);
-		ethtool_sprintf(&data, "tx_%d_tso_inner_bytes", i);
-		ethtool_sprintf(&data, "tx_%d_long_pkt_fmt", i);
-		ethtool_sprintf(&data, "tx_%d_short_pkt_fmt", i);
-		ethtool_sprintf(&data, "tx_%d_csum_partial", i);
-		ethtool_sprintf(&data, "tx_%d_mana_map_err", i);
+		ethtool_sprintf(data, "tx_%d_packets", i);
+		ethtool_sprintf(data, "tx_%d_bytes", i);
+		ethtool_sprintf(data, "tx_%d_xdp_xmit", i);
+		ethtool_sprintf(data, "tx_%d_tso_packets", i);
+		ethtool_sprintf(data, "tx_%d_tso_bytes", i);
+		ethtool_sprintf(data, "tx_%d_tso_inner_packets", i);
+		ethtool_sprintf(data, "tx_%d_tso_inner_bytes", i);
+		ethtool_sprintf(data, "tx_%d_long_pkt_fmt", i);
+		ethtool_sprintf(data, "tx_%d_short_pkt_fmt", i);
+		ethtool_sprintf(data, "tx_%d_csum_partial", i);
+		ethtool_sprintf(data, "tx_%d_mana_map_err", i);
+	}
+}
+
+static void mana_get_strings(struct net_device *ndev, u32 stringset, u8 *data)
+{
+	struct mana_port_context *apc = netdev_priv(ndev);
+
+	switch (stringset) {
+	case ETH_SS_STATS:
+		mana_get_strings_stats(apc, &data);
+		break;
+	default:
+		break;
 	}
 }
 
-- 
2.43.0


^ permalink raw reply related

* [PATCH net-next v6 0/2] net: mana: add ethtool private flag for full-page RX buffers
From: Dipayaan Roy @ 2026-04-07 19:59 UTC (permalink / raw)
  To: kys, haiyangz, wei.liu, decui, andrew+netdev, davem, edumazet,
	kuba, pabeni, leon, longli, kotaranov, horms, shradhagupta,
	ssengar, ernis, shirazsaleem, linux-hyperv, netdev, linux-kernel,
	linux-rdma, stephen, jacob.e.keller, leitao, kees, john.fastabend,
	hawk, bpf, daniel, ast, sdf, dipayanroy

On some ARM64 platforms with 4K PAGE_SIZE, utilizing page_pool 
fragments for allocation in the RX refill path (~2kB buffer per fragment)
causes 15-20% throughput regression under high connection counts
(>16 TCP streams at 180+ Gbps). Using full-page buffers on these
platforms shows no regression and restores line-rate performance.

This behavior is observed on a single platform; other platforms
perform better with page_pool fragments, indicating this is not a
page_pool issue but platform-specific.

This series adds an ethtool private flag "full-page-rx" to let the
user opt in to one RX buffer per page:

  ethtool --set-priv-flags eth0 full-page-rx on

There is no behavioral change by default. The flag can be persisted
via udev rule for affected platforms.

Changes in v6:
  - Added missed maintainers.
Changes in v5:
  - Split prep refactor into separate patch (patch 1/2)
Changes in v4:
  - Dropping the smbios string parsing and add ethtool priv flag
    to reconfigure the queues with full page rx buffers.
Changes in v3:
  - changed u8* to char*
Changes in v2:
  - separate reading string index and the string, remove inline.

Dipayaan Roy (2):
  net: mana: refactor mana_get_strings() and mana_get_sset_count() to
    use switch
  net: mana: force full-page RX buffers via ethtool private flag

 drivers/net/ethernet/microsoft/mana/mana_en.c |  22 ++-
 .../ethernet/microsoft/mana/mana_ethtool.c    | 164 ++++++++++++++----
 include/net/mana/mana.h                       |   8 +
 3 files changed, 163 insertions(+), 31 deletions(-)

-- 
2.43.0

^ permalink raw reply

* Re: [PATCH net-next,v4] net: mana: Force full-page RX buffers via ethtool private flag
From: Dipayaan Roy @ 2026-04-07 19:29 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: kys, haiyangz, wei.liu, decui, andrew+netdev, davem, edumazet,
	pabeni, leon, longli, kotaranov, horms, shradhagupta, ssengar,
	ernis, shirazsaleem, linux-hyperv, netdev, linux-kernel,
	linux-rdma, stephen, jacob.e.keller, leitao, kees, dipayanroy
In-Reply-To: <20260406105136.5e02420e@kernel.org>

On Mon, Apr 06, 2026 at 10:51:36AM -0700, Jakub Kicinski wrote:
> On Sat, 4 Apr 2026 20:14:35 -0700 Dipayaan Roy wrote:
> >   Function                        Fragment   Full-page   Delta
> >   ─----------------------------   ─-------   ---------   -----
> >   napi_pp_put_page                  3.93%      0.85%    +3.08%
> >   page_pool_alloc_frag_netmem       1.93%         —     +1.93%
> >   Total page_pool overhead          5.86%      0.85%    +5.01%
> 
> 
> Thanks for the analysis, and presumably recycling the full page is
> cheaper because page_pool_put_unrefed_netmem() hits the fastpath
> because page_pool_napi_local() returns true?
yes right, thus avoiding atomics.

Regards


^ permalink raw reply

* Re: [GIT PULL] Hyper-V fixes for v7.0-rc8
From: pr-tracker-bot @ 2026-04-07 19:26 UTC (permalink / raw)
  To: Wei Liu
  Cc: Linus Torvalds, Wei Liu, Linux Kernel List, Linux on Hyper-V List,
	kys, haiyangz, decui, longli
In-Reply-To: <20260407042912.GA1012143@liuwe-devbox-debian-v2.local>

The pull request you sent on Tue, 7 Apr 2026 04:29:12 +0000:

> ssh://git@gitolite.kernel.org/pub/scm/linux/kernel/git/hyperv/linux.git tags/hyperv-fixes-signed-20260406

has been merged into torvalds/linux.git:
https://git.kernel.org/torvalds/c/86782c16a81f8232c13c1509fd3295bd97d185b0

Thank you!

-- 
Deet-doot-dot, I am a bot.
https://korg.docs.kernel.org/prtracker.html

^ permalink raw reply

* Re: [BUG] KVM: x86: kvmclock jumps ~253 years on Hyper-V nested virt due to cross-CPU raw TSC inconsistency
From: Thomas Lefebvre @ 2026-04-07 19:13 UTC (permalink / raw)
  To: Michael Kelley
  Cc: Sean Christopherson, Vitaly Kuznetsov, pbonzini@redhat.com,
	kvm@vger.kernel.org, linux-kernel@vger.kernel.org,
	linux-hyperv@vger.kernel.org
In-Reply-To: <SN6PR02MB4157F82E4907C1B3305E86D5D45AA@SN6PR02MB4157.namprd02.prod.outlook.com>

Hi everyone, thank you for your attention to this bug report.

Michael,

1. No, lscpu in the L1 guest does not show the flags "tsc_reliable"
and "constant_tsc".
$ lscpu | grep tsc_reliable
$ lscpu | grep constant_tsc
$ cat /sys/devices/system/clocksource/clocksource0/current_clocksource
hyperv_clocksource_tsc_page

2. Windows 10
Version 22H2 (OS Build 19045.6466)

3. Hyper-V: privilege flags low 0x2e7f, high 0x3b8030, ext 0x2, hints
0x24e24, misc 0xbed7b2

4. Yes, the laptop hibernates and then resumes.
When the problem occurred, the laptop had gone through multiple
hibernate and resume cycles.
I haven't seen it happen after a full reboot before a hibernate/resume cycle.

Thomas

On Tue, Apr 7, 2026 at 11:37 AM Michael Kelley <mhklinux@outlook.com> wrote:
>
> From: Sean Christopherson <seanjc@google.com> Sent: Tuesday, April 7, 2026 9:43 AM
> >
> > +Michael
> >
> > On Tue, Apr 07, 2026, Vitaly Kuznetsov wrote:
> > > Thomas Lefebvre <thomas.lefebvre3@gmail.com> writes:
> > > > Under Hyper-V, raw RDTSC values are not consistent across vCPUs.
> > > > The hypervisor corrects them only through the TSC page scale/offset.
> > > > If pvclock_update_vm_gtod_copy() runs on CPU 0 and __get_kvmclock()
> > > > later runs on CPU 1 where the raw TSC is lower, the unsigned
> > > > subtraction wraps.
> > > >
> > >
> > > According to the TLFS, reference TSC page is partition wide:
> > >
> > > "The hypervisor provides a partition-wide virtual reference TSC page
> > > which is overlaid on the partition’s GPA space. A partition’s reference
> > > time stamp counter page is accessed through the Reference TSC MSR."
> > >
> > > so if as you say RAW rdtsc value is inconsistent across vCPUs, I can
> > > hardly see how we can use this time source at all, even without
> > > KVM. scale/offset are the same for all vCPUs.
> > >
> > > I think the fix here is to avoid setting up Hyper-V TSC page clocksource
> > > in L1. Unfortunately, with unsynchronized TSCs this will leave us the
> > > only choice for a sane clocksource: raw HV_X64_MSR_TIME_REF_COUNT MSR
> > > reads.
> >
> > This feels like either a Hyper-V bug or a Linux-as-a-guest bug.  For "Reference
> > Counter"[1]:
> >
> >   The hypervisor maintains a per-partition reference time counter. It has the
> >   characteristic that successive accesses to it return strictly monotonically
> >   increasing (time) values as seen by any and all virtual processors of a
> >   partition. Furthermore, the reference counter is rate constant and unaffected
> >   by processor or bus speed transitions or deep processor power savings states. A
> >   partition’s reference time counter is initialized to zero when the partition is
> >   created. The reference counter for all partitions count at the same rate, but
> >   at any time, their absolute values will typically differ because partitions
> >   will have different creation times.
> >
> >   The reference counter continues to count up as long as at least one virtual
> >   processor is not explicitly suspended.
> >
> >
> > And then "Partition Reference Time Enlightenment"[2]:
> >
> >   The partition reference time enlightenment presents a reference time source to
> >   a partition which does not require an intercept into the hypervisor. This
> >   enlightenment is available only when the underlying platform provides support
> >   of an invariant processor Time Stamp Counter (TSC), or iTSC. In such platforms,
> >   the processor TSC frequency remains constant irrespective of changes in the
> >   processor’s clock frequency due to the use of power management states such as
> >   ACPI processor performance states, processor idle sleep states (ACPI C-states),
> >   etc.
> >
> >   The partition reference time enlightenment uses a virtual TSC value, an offset
> >   and a multiplier to enable a guest partition to compute the normalized
> >   reference time since partition creation, in 100nS units. The mechanism also
> >   allows a guest partition to atomically compute the reference time when the
> >   guest partition is migrated to a platform with a different TSC rate, and
> >   provides a fallback mechanism to support migration to platforms without the
> >   constant rate TSC feature.
> >
> > My read of "Partition Reference Time Enlightenment" is that it should only be
> > advertised if the TSC is synchronized and constant.  I can't figure out where
> > that feature is actually advertised though, because IIUC it's not the same as
> > HV_ACCESS_TSC_INVARIANT, which says that the virtual TSC is guaranteed to be
> > invariant even across live migration.  And it's not HV_MSR_REFERENCE_TSC_AVAILABLE,
> > because I'm pretty sure that just says HV_MSR_REFERENCE_TSC is available.
> >
> > Michael, help?
> >
> > [1] https://learn.microsoft.com/en-us/virtualization/hyper-v-on-windows/tlfs/timers#reference-counter
> > [2] https://learn.microsoft.com/en-us/virtualization/hyper-v-on-windows/tlfs/timers#partition-reference-time-enlightenment
>
> Yes, TSC page enlightenment is per VM, so it does not compensate for
> discrepancies in raw TSC values across physical CPUs. RDTSC in a
> Hyper-V VM is executed directly by the hardware (i.e., does not trap to
> the hypervisor), so there's no opportunity for the hypervisor to compensate
> for discrepancies. The hypervisor is expected to present a VM with TSCs
> that are already synchronized. I'll need to double-check, but I don't think
> Linux guests on Hyper-V run their own TSC synchronization.
>
> The relevant Hyper-V flags are:
> * HV_MSR_TIME_REF_COUNT_AVAILABLE:  The synthetic MSR for reading
>    the partition reference time is available.
> * HV_MSR_REFERENCE_TSC_AVAILABLE: The partition reference time
>    enlightenment (i.e., "the TSC page") is available as a faster way to read
>    the reference counter.
> * HV_ACCESS_TSC_INVARIANT: As Sean said, this says the hardware and
>    Hyper-V support TSC scaling, so live migration can be done across hosts
>    without the guest seeing a change in TSC frequency.
>
> Yes, this does feel like an issue where Hyper-V is not presenting the guest
> with TSCs that are already synchronized. But I'm not aware of having seen
> such a problem before. I'll try to imagine a scenario where a problem like
> this could happen via some other path.
>
> @Thomas Lefebvre:  Let me double-check a few things via these follow-up
> questions/actions:
>
> 1. You said the clocksource is hyperv_clocksource_tsc_page. Just to
> confirm, that's for the L1 guest, right? Does the output of the "lscpu"
> command in the L1 guest show the flags "tsc_reliable" and "constant_tsc"?
> I'm assume "no", since if these flags were set, the clocksource (i.e.,
> /sys/devices/system/clocksource/clocksource0/current_clocksource)
> should be the standard "tsc". I've got a laptop with a i7-13700H processor,
> and my L1 VMs show "tsc" as the clocksource, but I haven't been running
> KVM with L2 nested VMs.
>
> 2. What is the version of Windows/Hyper-V you are running? Get the
> output of the "winver.exe" command. It should be something like this:
>
> Windows 11 [as the top banner]
> Version 25H2 (OS Build 26200.8037)
>
> 3. In the dmesg output of your L1 VM, find the line like this one and reply
> with what you have:
>
>     Hyper-V: privilege flags low 0xae7f, high 0x3b8030, hints 0x9a4e24, misc 0xe0bed7b2
>
> From there, I can decode the Hyper-V settings and see if anything jumps out
> as anomalous.
>
> 4. Does the laptop where you are seeing this problem ever hibernate and
> then resume? If so, do you recall if the problem occurs after a full reboot but
> before it ever does a hibernate/resume cycle?
>
> Michael

^ permalink raw reply

* RE: [BUG] KVM: x86: kvmclock jumps ~253 years on Hyper-V nested virt due to cross-CPU raw TSC inconsistency
From: Michael Kelley @ 2026-04-07 18:37 UTC (permalink / raw)
  To: Sean Christopherson, Vitaly Kuznetsov, Thomas Lefebvre
  Cc: pbonzini@redhat.com, kvm@vger.kernel.org,
	linux-kernel@vger.kernel.org, linux-hyperv@vger.kernel.org
In-Reply-To: <adU0LAW1h8q9HsGu@google.com>

From: Sean Christopherson <seanjc@google.com> Sent: Tuesday, April 7, 2026 9:43 AM
> 
> +Michael
> 
> On Tue, Apr 07, 2026, Vitaly Kuznetsov wrote:
> > Thomas Lefebvre <thomas.lefebvre3@gmail.com> writes:
> > > Under Hyper-V, raw RDTSC values are not consistent across vCPUs.
> > > The hypervisor corrects them only through the TSC page scale/offset.
> > > If pvclock_update_vm_gtod_copy() runs on CPU 0 and __get_kvmclock()
> > > later runs on CPU 1 where the raw TSC is lower, the unsigned
> > > subtraction wraps.
> > >
> >
> > According to the TLFS, reference TSC page is partition wide:
> >
> > "The hypervisor provides a partition-wide virtual reference TSC page
> > which is overlaid on the partition’s GPA space. A partition’s reference
> > time stamp counter page is accessed through the Reference TSC MSR."
> >
> > so if as you say RAW rdtsc value is inconsistent across vCPUs, I can
> > hardly see how we can use this time source at all, even without
> > KVM. scale/offset are the same for all vCPUs.
> >
> > I think the fix here is to avoid setting up Hyper-V TSC page clocksource
> > in L1. Unfortunately, with unsynchronized TSCs this will leave us the
> > only choice for a sane clocksource: raw HV_X64_MSR_TIME_REF_COUNT MSR
> > reads.
> 
> This feels like either a Hyper-V bug or a Linux-as-a-guest bug.  For "Reference
> Counter"[1]:
> 
>   The hypervisor maintains a per-partition reference time counter. It has the
>   characteristic that successive accesses to it return strictly monotonically
>   increasing (time) values as seen by any and all virtual processors of a
>   partition. Furthermore, the reference counter is rate constant and unaffected
>   by processor or bus speed transitions or deep processor power savings states. A
>   partition’s reference time counter is initialized to zero when the partition is
>   created. The reference counter for all partitions count at the same rate, but
>   at any time, their absolute values will typically differ because partitions
>   will have different creation times.
> 
>   The reference counter continues to count up as long as at least one virtual
>   processor is not explicitly suspended.
> 
> 
> And then "Partition Reference Time Enlightenment"[2]:
> 
>   The partition reference time enlightenment presents a reference time source to
>   a partition which does not require an intercept into the hypervisor. This
>   enlightenment is available only when the underlying platform provides support
>   of an invariant processor Time Stamp Counter (TSC), or iTSC. In such platforms,
>   the processor TSC frequency remains constant irrespective of changes in the
>   processor’s clock frequency due to the use of power management states such as
>   ACPI processor performance states, processor idle sleep states (ACPI C-states),
>   etc.
> 
>   The partition reference time enlightenment uses a virtual TSC value, an offset
>   and a multiplier to enable a guest partition to compute the normalized
>   reference time since partition creation, in 100nS units. The mechanism also
>   allows a guest partition to atomically compute the reference time when the
>   guest partition is migrated to a platform with a different TSC rate, and
>   provides a fallback mechanism to support migration to platforms without the
>   constant rate TSC feature.
> 
> My read of "Partition Reference Time Enlightenment" is that it should only be
> advertised if the TSC is synchronized and constant.  I can't figure out where
> that feature is actually advertised though, because IIUC it's not the same as
> HV_ACCESS_TSC_INVARIANT, which says that the virtual TSC is guaranteed to be
> invariant even across live migration.  And it's not HV_MSR_REFERENCE_TSC_AVAILABLE,
> because I'm pretty sure that just says HV_MSR_REFERENCE_TSC is available.
> 
> Michael, help?
> 
> [1] https://learn.microsoft.com/en-us/virtualization/hyper-v-on-windows/tlfs/timers#reference-counter
> [2] https://learn.microsoft.com/en-us/virtualization/hyper-v-on-windows/tlfs/timers#partition-reference-time-enlightenment

Yes, TSC page enlightenment is per VM, so it does not compensate for
discrepancies in raw TSC values across physical CPUs. RDTSC in a
Hyper-V VM is executed directly by the hardware (i.e., does not trap to
the hypervisor), so there's no opportunity for the hypervisor to compensate
for discrepancies. The hypervisor is expected to present a VM with TSCs
that are already synchronized. I'll need to double-check, but I don't think
Linux guests on Hyper-V run their own TSC synchronization.

The relevant Hyper-V flags are:
* HV_MSR_TIME_REF_COUNT_AVAILABLE:  The synthetic MSR for reading
   the partition reference time is available.
* HV_MSR_REFERENCE_TSC_AVAILABLE: The partition reference time
   enlightenment (i.e., "the TSC page") is available as a faster way to read
   the reference counter.
* HV_ACCESS_TSC_INVARIANT: As Sean said, this says the hardware and
   Hyper-V support TSC scaling, so live migration can be done across hosts
   without the guest seeing a change in TSC frequency.

Yes, this does feel like an issue where Hyper-V is not presenting the guest
with TSCs that are already synchronized. But I'm not aware of having seen
such a problem before. I'll try to imagine a scenario where a problem like
this could happen via some other path.

@Thomas Lefebvre:  Let me double-check a few things via these follow-up
questions/actions:

1. You said the clocksource is hyperv_clocksource_tsc_page. Just to
confirm, that's for the L1 guest, right? Does the output of the "lscpu"
command in the L1 guest show the flags "tsc_reliable" and "constant_tsc"?
I'm assume "no", since if these flags were set, the clocksource (i.e.,
/sys/devices/system/clocksource/clocksource0/current_clocksource)
should be the standard "tsc". I've got a laptop with a i7-13700H processor,
and my L1 VMs show "tsc" as the clocksource, but I haven't been running
KVM with L2 nested VMs.

2. What is the version of Windows/Hyper-V you are running? Get the
output of the "winver.exe" command. It should be something like this:

Windows 11 [as the top banner]
Version 25H2 (OS Build 26200.8037)

3. In the dmesg output of your L1 VM, find the line like this one and reply
with what you have:

    Hyper-V: privilege flags low 0xae7f, high 0x3b8030, hints 0x9a4e24, misc 0xe0bed7b2

From there, I can decode the Hyper-V settings and see if anything jumps out
as anomalous. 

4. Does the laptop where you are seeing this problem ever hibernate and
then resume? If so, do you recall if the problem occurs after a full reboot but
before it ever does a hibernate/resume cycle?

Michael

^ permalink raw reply

* Re: [PATCH v0 06/15] mshv: Implement mshv bridge device for VFIO
From: Mukesh R @ 2026-04-07 17:41 UTC (permalink / raw)
  To: Stanislav Kinsburskii
  Cc: linux-kernel, linux-hyperv, linux-arm-kernel, iommu, linux-pci,
	linux-arch, kys, haiyangz, wei.liu, decui, longli,
	catalin.marinas, will, tglx, mingo, bp, dave.hansen, hpa, joro,
	lpieralisi, kwilczynski, mani, robh, bhelgaas, arnd, nunodasneves,
	mhklinux, romank
In-Reply-To: <aW-oniY3VpagQMPb@skinsburskii.localdomain>

On 1/20/26 08:09, Stanislav Kinsburskii wrote:
> On Mon, Jan 19, 2026 at 10:42:21PM -0800, Mukesh R wrote:
>> From: Mukesh Rathor <mrathor@linux.microsoft.com>
>>
>> Add a new file to implement VFIO-MSHV bridge pseudo device. These
>> functions are called in the VFIO framework, and credits to kvm/vfio.c
>> as this file was adapted from it.
>>
>> Original author: Wei Liu <wei.liu@kernel.org>
>> (Slightly modified from the original version).
>>
> 
> There is a Linux standard for giving credits when code is adapted from.
> This doesn't follow that standard. Please fix.
> 
>> Signed-off-by: Mukesh Rathor <mrathor@linux.microsoft.com>
>> ---
>>   drivers/hv/Makefile    |   3 +-
>>   drivers/hv/mshv_vfio.c | 210 +++++++++++++++++++++++++++++++++++++++++
>>   2 files changed, 212 insertions(+), 1 deletion(-)
>>   create mode 100644 drivers/hv/mshv_vfio.c
>>
>> diff --git a/drivers/hv/Makefile b/drivers/hv/Makefile
>> index a49f93c2d245..eae003c4cb8f 100644
>> --- a/drivers/hv/Makefile
>> +++ b/drivers/hv/Makefile
>> @@ -14,7 +14,8 @@ hv_vmbus-y := vmbus_drv.o \
>>   hv_vmbus-$(CONFIG_HYPERV_TESTING)	+= hv_debugfs.o
>>   hv_utils-y := hv_util.o hv_kvp.o hv_snapshot.o hv_utils_transport.o
>>   mshv_root-y := mshv_root_main.o mshv_synic.o mshv_eventfd.o mshv_irq.o \
>> -	       mshv_root_hv_call.o mshv_portid_table.o mshv_regions.o
>> +	       mshv_root_hv_call.o mshv_portid_table.o mshv_regions.o \
>> +               mshv_vfio.o
>>   mshv_vtl-y := mshv_vtl_main.o
>>   
>>   # Code that must be built-in
>> diff --git a/drivers/hv/mshv_vfio.c b/drivers/hv/mshv_vfio.c
>> new file mode 100644
>> index 000000000000..6ea4d99a3bd2
>> --- /dev/null
>> +++ b/drivers/hv/mshv_vfio.c
>> @@ -0,0 +1,210 @@
>> +// SPDX-License-Identifier: GPL-2.0-only
>> +/*
>> + * VFIO-MSHV bridge pseudo device
>> + *
>> + * Heavily inspired by the VFIO-KVM bridge pseudo device.
>> + */
>> +#include <linux/errno.h>
>> +#include <linux/file.h>
>> +#include <linux/list.h>
>> +#include <linux/module.h>
>> +#include <linux/mutex.h>
>> +#include <linux/slab.h>
>> +#include <linux/vfio.h>
>> +
>> +#include "mshv.h"
>> +#include "mshv_root.h"
>> +
>> +struct mshv_vfio_file {
>> +	struct list_head node;
>> +	struct file *file;	/* list of struct mshv_vfio_file */
>> +};
>> +
>> +struct mshv_vfio {
>> +	struct list_head file_list;
>> +	struct mutex lock;
>> +};
>> +
>> +static bool mshv_vfio_file_is_valid(struct file *file)
>> +{
>> +	bool (*fn)(struct file *file);
>> +	bool ret;
>> +
>> +	fn = symbol_get(vfio_file_is_valid);
>> +	if (!fn)
>> +		return false;
>> +
>> +	ret = fn(file);
>> +
>> +	symbol_put(vfio_file_is_valid);
>> +
>> +	return ret;
>> +}
>> +
>> +static long mshv_vfio_file_add(struct mshv_device *mshvdev, unsigned int fd)
>> +{
>> +	struct mshv_vfio *mshv_vfio = mshvdev->device_private;
>> +	struct mshv_vfio_file *mvf;
>> +	struct file *filp;
>> +	long ret = 0;
>> +
>> +	filp = fget(fd);
>> +	if (!filp)
>> +		return -EBADF;
>> +
>> +	/* Ensure the FD is a vfio FD. */
>> +	if (!mshv_vfio_file_is_valid(filp)) {
>> +		ret = -EINVAL;
>> +		goto out_fput;
>> +	}
>> +
>> +	mutex_lock(&mshv_vfio->lock);
>> +
>> +	list_for_each_entry(mvf, &mshv_vfio->file_list, node) {
>> +		if (mvf->file == filp) {
>> +			ret = -EEXIST;
>> +			goto out_unlock;
>> +		}
>> +	}
>> +
>> +	mvf = kzalloc(sizeof(*mvf), GFP_KERNEL_ACCOUNT);
>> +	if (!mvf) {
>> +		ret = -ENOMEM;
>> +		goto out_unlock;
>> +	}
>> +
>> +	mvf->file = get_file(filp);
>> +	list_add_tail(&mvf->node, &mshv_vfio->file_list);
>> +
>> +out_unlock:
>> +	mutex_unlock(&mshv_vfio->lock);
>> +out_fput:
>> +	fput(filp);
>> +	return ret;
>> +}
>> +
>> +static long mshv_vfio_file_del(struct mshv_device *mshvdev, unsigned int fd)
>> +{
>> +	struct mshv_vfio *mshv_vfio = mshvdev->device_private;
>> +	struct mshv_vfio_file *mvf;
>> +	long ret;
>> +
>> +	CLASS(fd, f)(fd);
>> +
>> +	if (fd_empty(f))
>> +		return -EBADF;
>> +
>> +	ret = -ENOENT;
>> +	mutex_lock(&mshv_vfio->lock);
>> +
>> +	list_for_each_entry(mvf, &mshv_vfio->file_list, node) {
>> +		if (mvf->file != fd_file(f))
>> +			continue;
>> +
>> +		list_del(&mvf->node);
>> +		fput(mvf->file);
>> +		kfree(mvf);
>> +		ret = 0;
>> +		break;
>> +	}
>> +
>> +	mutex_unlock(&mshv_vfio->lock);
>> +	return ret;
>> +}
>> +
>> +static long mshv_vfio_set_file(struct mshv_device *mshvdev, long attr,
>> +			      void __user *arg)
>> +{
>> +	int32_t __user *argp = arg;
>> +	int32_t fd;
>> +
>> +	switch (attr) {
>> +	case MSHV_DEV_VFIO_FILE_ADD:
>> +		if (get_user(fd, argp))
>> +			return -EFAULT;
>> +		return mshv_vfio_file_add(mshvdev, fd);
>> +
>> +	case MSHV_DEV_VFIO_FILE_DEL:
>> +		if (get_user(fd, argp))
>> +			return -EFAULT;
>> +		return mshv_vfio_file_del(mshvdev, fd);
>> +	}
>> +
>> +	return -ENXIO;
>> +}
>> +
>> +static long mshv_vfio_set_attr(struct mshv_device *mshvdev,
>> +			      struct mshv_device_attr *attr)
>> +{
>> +	switch (attr->group) {
>> +	case MSHV_DEV_VFIO_FILE:
>> +		return mshv_vfio_set_file(mshvdev, attr->attr,
>> +					  u64_to_user_ptr(attr->addr));
>> +	}
>> +
>> +	return -ENXIO;
>> +}
>> +
>> +static long mshv_vfio_has_attr(struct mshv_device *mshvdev,
>> +			      struct mshv_device_attr *attr)
>> +{
>> +	switch (attr->group) {
>> +	case MSHV_DEV_VFIO_FILE:
>> +		switch (attr->attr) {
>> +		case MSHV_DEV_VFIO_FILE_ADD:
>> +		case MSHV_DEV_VFIO_FILE_DEL:
>> +			return 0;
>> +		}
>> +
>> +		break;
>> +	}
>> +
>> +	return -ENXIO;
>> +}
>> +
>> +static long mshv_vfio_create_device(struct mshv_device *mshvdev, u32 type)
>> +{
>> +	struct mshv_device *tmp;
>> +	struct mshv_vfio *mshv_vfio;
>> +
>> +	/* Only one VFIO "device" per VM */
>> +	hlist_for_each_entry(tmp, &mshvdev->device_pt->pt_devices,
>> +			     device_ptnode)
>> +		if (tmp->device_ops == &mshv_vfio_device_ops)
>> +			return -EBUSY;
>> +
>> +	mshv_vfio = kzalloc(sizeof(*mshv_vfio), GFP_KERNEL_ACCOUNT);
>> +	if (mshv_vfio == NULL)
>> +		return -ENOMEM;
>> +
>> +	INIT_LIST_HEAD(&mshv_vfio->file_list);
>> +	mutex_init(&mshv_vfio->lock);
>> +
>> +	mshvdev->device_private = mshv_vfio;
>> +
>> +	return 0;
>> +}
>> +
>> +/* This is called from mshv_device_fop_release() */
>> +static void mshv_vfio_release_device(struct mshv_device *mshvdev)
>> +{
>> +	struct mshv_vfio *mv = mshvdev->device_private;
>> +	struct mshv_vfio_file *mvf, *tmp;
>> +
>> +	list_for_each_entry_safe(mvf, tmp, &mv->file_list, node) {
>> +		fput(mvf->file);
> 
> This put must be sync as device must be detached from domain before
> attempting partition destruction.

Like I said in 6.6 PR, this does not attach or detach devices.

> This was explicitly mentioned in the patch originated this code.
> Please fix, add a comment and credits to the commit message.

That was ".detstroy" hook which is gone.

Thanks,
-Mukesh


> Thanks,
> Stanislav
> 
> 
>> +		list_del(&mvf->node);
>> +		kfree(mvf);
>> +	}
>> +
>> +	kfree(mv);
>> +	kfree(mshvdev);
>> +}
>> +
>> +struct mshv_device_ops mshv_vfio_device_ops = {
>> +	.device_name = "mshv-vfio",
>> +	.device_create = mshv_vfio_create_device,
>> +	.device_release = mshv_vfio_release_device,
>> +	.device_set_attr = mshv_vfio_set_attr,
>> +	.device_has_attr = mshv_vfio_has_attr,
>> +};
>> -- 
>> 2.51.2.vfs.0.1
>>


^ permalink raw reply

* Re: [BUG] KVM: x86: kvmclock jumps ~253 years on Hyper-V nested virt due to cross-CPU raw TSC inconsistency
From: Sean Christopherson @ 2026-04-07 16:44 UTC (permalink / raw)
  To: Vitaly Kuznetsov
  Cc: Thomas Lefebvre, pbonzini, kvm, linux-kernel, linux-hyperv,
	Michael Kelley
In-Reply-To: <adU0LAW1h8q9HsGu@google.com>

On Tue, Apr 07, 2026, Sean Christopherson wrote:
> +Michael

Let's try that again.  Email address #1 bounced.

> On Tue, Apr 07, 2026, Vitaly Kuznetsov wrote:
> > Thomas Lefebvre <thomas.lefebvre3@gmail.com> writes:
> > > Under Hyper-V, raw RDTSC values are not consistent across vCPUs.
> > > The hypervisor corrects them only through the TSC page scale/offset.
> > > If pvclock_update_vm_gtod_copy() runs on CPU 0 and __get_kvmclock()
> > > later runs on CPU 1 where the raw TSC is lower, the unsigned
> > > subtraction wraps.
> > >
> > 
> > According to the TLFS, reference TSC page is partition wide:
> > 
> > "The hypervisor provides a partition-wide virtual reference TSC page
> > which is overlaid on the partition’s GPA space. A partition’s reference
> > time stamp counter page is accessed through the Reference TSC MSR."
> > 
> > so if as you say RAW rdtsc value is inconsistent across vCPUs, I can
> > hardly see how we can use this time source at all, even without
> > KVM. scale/offset are the same for all vCPUs.
> > 
> > I think the fix here is to avoid setting up Hyper-V TSC page clocksource
> > in L1. Unfortunately, with unsynchronized TSCs this will leave us the
> > only choice for a sane clocksource: raw HV_X64_MSR_TIME_REF_COUNT MSR
> > reads.
> 
> This feels like either a Hyper-V bug or a Linux-as-a-guest bug.  For "Reference
> Counter"[1]:
> 
>   The hypervisor maintains a per-partition reference time counter. It has the
>   characteristic that successive accesses to it return strictly monotonically
>   increasing (time) values as seen by any and all virtual processors of a
>   partition. Furthermore, the reference counter is rate constant and unaffected
>   by processor or bus speed transitions or deep processor power savings states. A
>   partition’s reference time counter is initialized to zero when the partition is
>   created. The reference counter for all partitions count at the same rate, but
>   at any time, their absolute values will typically differ because partitions
>   will have different creation times.
>   
>   The reference counter continues to count up as long as at least one virtual
>   processor is not explicitly suspended.
> 
> 
> And then "Partition Reference Time Enlightenment"[2]:
> 
>   The partition reference time enlightenment presents a reference time source to
>   a partition which does not require an intercept into the hypervisor. This
>   enlightenment is available only when the underlying platform provides support
>   of an invariant processor Time Stamp Counter (TSC), or iTSC. In such platforms,
>   the processor TSC frequency remains constant irrespective of changes in the
>   processor’s clock frequency due to the use of power management states such as
>   ACPI processor performance states, processor idle sleep states (ACPI C-states),
>   etc.
> 
>   The partition reference time enlightenment uses a virtual TSC value, an offset
>   and a multiplier to enable a guest partition to compute the normalized
>   reference time since partition creation, in 100nS units. The mechanism also
>   allows a guest partition to atomically compute the reference time when the
>   guest partition is migrated to a platform with a different TSC rate, and
>   provides a fallback mechanism to support migration to platforms without the
>   constant rate TSC feature.
> 
> My read of "Partition Reference Time Enlightenment" is that it should only be
> advertised if the TSC is synchronized and constant.  I can't figure out where
> that feature is actually advertised though, because IIUC it's not the same as
> HV_ACCESS_TSC_INVARIANT, which says that the virtual TSC is guaranteed to be
> invariant even across live migration.  And it's not HV_MSR_REFERENCE_TSC_AVAILABLE,
> because I'm pretty sure that just says HV_MSR_REFERENCE_TSC is available.
> 
> Michael, help?
> 
> [1] https://learn.microsoft.com/en-us/virtualization/hyper-v-on-windows/tlfs/timers#reference-counter
> [2] https://learn.microsoft.com/en-us/virtualization/hyper-v-on-windows/tlfs/timers#partition-reference-time-enlightenment

^ permalink raw reply

* Re: [BUG] KVM: x86: kvmclock jumps ~253 years on Hyper-V nested virt due to cross-CPU raw TSC inconsistency
From: Sean Christopherson @ 2026-04-07 16:43 UTC (permalink / raw)
  To: Vitaly Kuznetsov
  Cc: Thomas Lefebvre, pbonzini, kvm, linux-kernel, linux-hyperv,
	Michael Kelley
In-Reply-To: <87v7e3mbgj.fsf@redhat.com>

+Michael

On Tue, Apr 07, 2026, Vitaly Kuznetsov wrote:
> Thomas Lefebvre <thomas.lefebvre3@gmail.com> writes:
> > Under Hyper-V, raw RDTSC values are not consistent across vCPUs.
> > The hypervisor corrects them only through the TSC page scale/offset.
> > If pvclock_update_vm_gtod_copy() runs on CPU 0 and __get_kvmclock()
> > later runs on CPU 1 where the raw TSC is lower, the unsigned
> > subtraction wraps.
> >
> 
> According to the TLFS, reference TSC page is partition wide:
> 
> "The hypervisor provides a partition-wide virtual reference TSC page
> which is overlaid on the partition’s GPA space. A partition’s reference
> time stamp counter page is accessed through the Reference TSC MSR."
> 
> so if as you say RAW rdtsc value is inconsistent across vCPUs, I can
> hardly see how we can use this time source at all, even without
> KVM. scale/offset are the same for all vCPUs.
> 
> I think the fix here is to avoid setting up Hyper-V TSC page clocksource
> in L1. Unfortunately, with unsynchronized TSCs this will leave us the
> only choice for a sane clocksource: raw HV_X64_MSR_TIME_REF_COUNT MSR
> reads.

This feels like either a Hyper-V bug or a Linux-as-a-guest bug.  For "Reference
Counter"[1]:

  The hypervisor maintains a per-partition reference time counter. It has the
  characteristic that successive accesses to it return strictly monotonically
  increasing (time) values as seen by any and all virtual processors of a
  partition. Furthermore, the reference counter is rate constant and unaffected
  by processor or bus speed transitions or deep processor power savings states. A
  partition’s reference time counter is initialized to zero when the partition is
  created. The reference counter for all partitions count at the same rate, but
  at any time, their absolute values will typically differ because partitions
  will have different creation times.

  The reference counter continues to count up as long as at least one virtual
  processor is not explicitly suspended.

And then "Partition Reference Time Enlightenment"[2]:

  The partition reference time enlightenment presents a reference time source to
  a partition which does not require an intercept into the hypervisor. This
  enlightenment is available only when the underlying platform provides support
  of an invariant processor Time Stamp Counter (TSC), or iTSC. In such platforms,
  the processor TSC frequency remains constant irrespective of changes in the
  processor’s clock frequency due to the use of power management states such as
  ACPI processor performance states, processor idle sleep states (ACPI C-states),
  etc.

  The partition reference time enlightenment uses a virtual TSC value, an offset
  and a multiplier to enable a guest partition to compute the normalized
  reference time since partition creation, in 100nS units. The mechanism also
  allows a guest partition to atomically compute the reference time when the
  guest partition is migrated to a platform with a different TSC rate, and
  provides a fallback mechanism to support migration to platforms without the
  constant rate TSC feature.

My read of "Partition Reference Time Enlightenment" is that it should only be
advertised if the TSC is synchronized and constant.  I can't figure out where
that feature is actually advertised though, because IIUC it's not the same as
HV_ACCESS_TSC_INVARIANT, which says that the virtual TSC is guaranteed to be
invariant even across live migration.  And it's not HV_MSR_REFERENCE_TSC_AVAILABLE,
because I'm pretty sure that just says HV_MSR_REFERENCE_TSC is available.

Michael, help?

[1] https://learn.microsoft.com/en-us/virtualization/hyper-v-on-windows/tlfs/timers#reference-counter
[2] https://learn.microsoft.com/en-us/virtualization/hyper-v-on-windows/tlfs/timers#partition-reference-time-enlightenment

^ permalink raw reply

* Re: [PATCH 4/7] mshv: Optimize memory region mapping operations
From: Stanislav Kinsburskii @ 2026-04-07 15:07 UTC (permalink / raw)
  To: Anirudh Rayabharam
  Cc: kys, haiyangz, wei.liu, decui, longli, linux-hyperv, linux-kernel
In-Reply-To: <20260406-expert-pelican-of-weather-75fda2@anirudhrb>

On Mon, Apr 06, 2026 at 04:57:12PM +0000, Anirudh Rayabharam wrote:
> On Wed, Apr 01, 2026 at 10:12:46PM +0000, Stanislav Kinsburskii wrote:
> > Two specific operations don't require PFN iteration: region unmapping
> > and region remapping with no access. For unmapping, all frames in MSHV
> > memory regions are guaranteed to be mapped with page access, so we can
> > unmap them all without checking individual PFNs. For remapping with no
> > access, all frames are already mapped with page access, allowing us to
> > unmap them all in one pass.
> > 
> > Since neither operation needs PFN validation, iterating over PFNs is
> > redundant. Batch operations into large page-aligned chunks followed by
> > remaining pages. This eliminates PFN traversal for these operations,
> > requires no additional hypercalls compared to the PFN-checking approach,
> > and provides the simplest possible sequential execution path.
> > 
> > The optimization utilizes HV_MAP_GPA_LARGE_PAGE and
> > HV_UNMAP_GPA_LARGE_PAGE flags for aligned portions, processing only the
> > remainder with base page granularity. This removes mshv_region_chunk_unmap()
> > and mshv_region_process_range() helper functions, reducing code complexity.
> > 
> > Signed-off-by: Stanislav Kinsburskii <skinsburskii@linux.microsoft.com>
> > ---
> >  drivers/hv/mshv_regions.c |   65 ++++++++++++++++++++++++++++++++-------------
> >  1 file changed, 46 insertions(+), 19 deletions(-)
> > 
> > diff --git a/drivers/hv/mshv_regions.c b/drivers/hv/mshv_regions.c
> > index 2c4215381e0b..a92381219758 100644
> > --- a/drivers/hv/mshv_regions.c
> > +++ b/drivers/hv/mshv_regions.c
> > @@ -449,27 +449,27 @@ static int mshv_region_pin(struct mshv_region *region)
> >  	return ret < 0 ? ret : -ENOMEM;
> >  }
> >  
> > -static int mshv_region_chunk_unmap(struct mshv_region *region,
> > -				   u32 flags,
> > -				   u64 pfn_offset, u64 pfn_count,
> > -				   bool huge_page)
> > +static int mshv_region_unmap(struct mshv_region *region)
> >  {
> > -	if (!pfn_valid(region->mreg_pfns[pfn_offset]))
> > -		return 0;
> > +	u64 aligned_pages, remaining_pages;
> > +	int ret = 0;
> >  
> > -	if (huge_page)
> > -		flags |= HV_UNMAP_GPA_LARGE_PAGE;
> > +	aligned_pages = ALIGN_DOWN(region->nr_pfns, PTRS_PER_PMD);
> 
> Why is it sufficient to check just the number of pages to determine
> whether we need to use the HV_UNMAP_GPA_LARGE_PAGE? Don't we need to
> check the alignment of the start address as well?
> 

Yeah, I think it's not sufficient.
I'll address this gap.

Thanks,
Stanislav

> Thanks,
> Anirudh.

^ permalink raw reply

* Re: [PATCH v2 06/16] RDMA/hns: Fix xarray race in hns_roce_create_srq()
From: Jason Gunthorpe @ 2026-04-07 14:03 UTC (permalink / raw)
  To: Junxian Huang
  Cc: Abhijit Gangurde, Allen Hubbe,
	Broadcom internal kernel review list, Bernard Metzler,
	Potnuri Bharat Teja, Bryan Tan, Cheng Xu, Dennis Dalessandro,
	Gal Pressman, Kai Shen, Kalesh AP, Konstantin Taranov,
	Krzysztof Czurylo, Leon Romanovsky, linux-hyperv, linux-rdma,
	Long Li, Michal Kalderon, Michael Margolin, Nelson Escobar,
	Satish Kharat, Selvin Xavier, Yossi Leybovich, Chengchang Tang,
	Tatyana Nikolova, Vishnu Dasa, Yishai Hadas, Adit Ranadive,
	Aditya Sarwade, Bryan Tan, Dexuan Cui, Doug Ledford, George Zhang,
	Jorgen Hansen, Leon Romanovsky, Parav Pandit, patches,
	Roland Dreier, Roland Dreier, Ajay Sharma, stable
In-Reply-To: <f1fb94fe-c86b-7866-d606-088343a56fab@hisilicon.com>

On Tue, Apr 07, 2026 at 09:39:52PM +0800, Junxian Huang wrote:
> 
> 
> On 2026/4/7 1:40, Jason Gunthorpe wrote:
> > Sashiko points out that once the srq memory is stored into the xarray by
> > alloc_srqc() it can immediately be looked up by:
> > 
> > 	xa_lock(&srq_table->xa);
> > 	srq = xa_load(&srq_table->xa, srqn & (hr_dev->caps.num_srqs - 1));
> > 	if (srq)
> > 		refcount_inc(&srq->refcount);
> > 	xa_unlock(&srq_table->xa);
> > 
> > Which will fail refcount debug because the refcount is 0 and then crash:
> > 
> > 	srq->event(srq, event_type);
> > 
> > Because event is NULL.
> 
> I don't think this will actually happen because HW won't report an SRQ
> event before the SRQ is fully ready and actually used.

Probably, but also maybe there is some crazy race where EQ event can
be generated and the SRQ cycled before it is collected..

There is also a second bug here that Shashiko noticed on this patch
that the order is wrong, the goto unwind in create will call
free_srqc() but it hasn't yet setup the completion. I will fix that in
a v3..

> From the perspective of coding, I'm fine with this change, but since
> there is similar logic for QP event, could you also apply this change
> to QP?

Sure

Jason

^ permalink raw reply

* Re: [PATCH net-next v5 0/2] net: mana: add ethtool private flag for full-page RX buffers
From: Dipayaan Roy @ 2026-04-07 13:54 UTC (permalink / raw)
  To: Alexander Lobakin
  Cc: kys, haiyangz, wei.liu, decui, andrew+netdev, davem, edumazet,
	kuba, pabeni, leon, longli, kotaranov, horms, shradhagupta,
	ssengar, ernis, shirazsaleem, linux-hyperv, netdev, linux-kernel,
	linux-rdma, stephen, jacob.e.keller, dipayanroy, leitao, kees
In-Reply-To: <e80b603d-8be0-4aee-8a31-c9cbb4a8ab00@intel.com>

On Tue, Apr 07, 2026 at 03:10:45PM +0200, Alexander Lobakin wrote:
> From: Dipayaan Roy <dipayanroy@linux.microsoft.com>
> Date: Sat, 4 Apr 2026 20:42:15 -0700
> 
> > On some ARM64 platforms with 4K PAGE_SIZE, utilizing page_pool 
> > fragments for allocation in the RX refill path (~2kB buffer per fragment)
> > causes 15-20% throughput regression under high connection counts
> > (>16 TCP streams at 180+ Gbps). Using full-page buffers on these
> > platforms shows no regression and restores line-rate performance.
> > 
> > This behavior is observed on a single platform; other platforms
> > perform better with page_pool fragments, indicating this is not a
> > page_pool issue but platform-specific.
> > 
> > This series adds an ethtool private flag "full-page-rx" to let the
> > user opt in to one RX buffer per page:
> > 
> >   ethtool --set-priv-flags eth0 full-page-rx on
> 
> Sorry I may've missed the previous threads.
> 
> Has this approach been discussed here? Private flags are generally
> discouraged.
> 
> Alternatively, you can provide Ethtool ops to change the Rx buffer size,
> so that you'd be able to set it to PAGE_SIZE on affected platforms and
> the result would be the same.
>
Hi Alex, 
This was discussed here:
https://lore.kernel.org/all/adHTm2SvjDrezEdv@linuxonhyperv3.guj3yctzbm1etfxqx2vob5hsef.xx.internal.cloudapp.net/ 
> > 
> > There is no behavioral change by default. The flag can be persisted
> > via udev rule for affected platforms.
> > 
> > Changes in v5:
> >   - Split prep refactor into separate patch (patch 1/2)
> > Changes in v4:
> >   - Dropping the smbios string parsing and add ethtool priv flag
> >     to reconfigure the queues with full page rx buffers.
> > Changes in v3:
> >   - changed u8* to char*
> > Changes in v2:
> >   - separate reading string index and the string, remove inline.
> > 
> > Dipayaan Roy (2):
> >   net: mana: refactor mana_get_strings() and mana_get_sset_count() to
> >     use switch
> >   net: mana: force full-page RX buffers via ethtool private flag
> > 
> >  drivers/net/ethernet/microsoft/mana/mana_en.c |  22 ++-
> >  .../ethernet/microsoft/mana/mana_ethtool.c    | 164 ++++++++++++++----
> >  include/net/mana/mana.h                       |   8 +
> >  3 files changed, 163 insertions(+), 31 deletions(-)
> 
> Thanks,
> Olek

^ permalink raw reply

* Re: [PATCH v2 06/16] RDMA/hns: Fix xarray race in hns_roce_create_srq()
From: Junxian Huang @ 2026-04-07 13:39 UTC (permalink / raw)
  To: Jason Gunthorpe, Abhijit Gangurde, Allen Hubbe,
	Broadcom internal kernel review list, Bernard Metzler,
	Potnuri Bharat Teja, Bryan Tan, Cheng Xu, Dennis Dalessandro,
	Gal Pressman, Kai Shen, Kalesh AP, Konstantin Taranov,
	Krzysztof Czurylo, Leon Romanovsky, linux-hyperv, linux-rdma,
	Long Li, Michal Kalderon, Michael Margolin, Nelson Escobar,
	Satish Kharat, Selvin Xavier, Yossi Leybovich, Chengchang Tang,
	Tatyana Nikolova, Vishnu Dasa, Yishai Hadas
  Cc: Adit Ranadive, Aditya Sarwade, Bryan Tan, Dexuan Cui,
	Doug Ledford, George Zhang, Jorgen Hansen, Leon Romanovsky,
	Parav Pandit, patches, Roland Dreier, Roland Dreier, Ajay Sharma,
	stable
In-Reply-To: <6-v2-1c49eeb88c48+91-rdma_udata_rep_jgg@nvidia.com>



On 2026/4/7 1:40, Jason Gunthorpe wrote:
> Sashiko points out that once the srq memory is stored into the xarray by
> alloc_srqc() it can immediately be looked up by:
> 
> 	xa_lock(&srq_table->xa);
> 	srq = xa_load(&srq_table->xa, srqn & (hr_dev->caps.num_srqs - 1));
> 	if (srq)
> 		refcount_inc(&srq->refcount);
> 	xa_unlock(&srq_table->xa);
> 
> Which will fail refcount debug because the refcount is 0 and then crash:
> 
> 	srq->event(srq, event_type);
> 
> Because event is NULL.

I don't think this will actually happen because HW won't report an SRQ
event before the SRQ is fully ready and actually used.

From the perspective of coding, I'm fine with this change, but since
there is similar logic for QP event, could you also apply this change
to QP?

Junxian

> 
> Use refcount_inc_not_zero() instead to ensure a partially prepared srq is
> never retrieved from the event handler and fix the ordering of the
> initialization so refcount becomes 1 only after it is fully ready.
> 
> Link: https://sashiko.dev/#/patchset/0-v1-e911b76a94d1%2B65d95-rdma_udata_rep_jgg%40nvidia.com?part=3
> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
> ---
>  drivers/infiniband/hw/hns/hns_roce_srq.c | 6 +++---
>  1 file changed, 3 insertions(+), 3 deletions(-)
> 
> diff --git a/drivers/infiniband/hw/hns/hns_roce_srq.c b/drivers/infiniband/hw/hns/hns_roce_srq.c
> index cb848e8e6bbd76..d6201ddde0292a 100644
> --- a/drivers/infiniband/hw/hns/hns_roce_srq.c
> +++ b/drivers/infiniband/hw/hns/hns_roce_srq.c
> @@ -16,8 +16,8 @@ void hns_roce_srq_event(struct hns_roce_dev *hr_dev, u32 srqn, int event_type)
>  
>  	xa_lock(&srq_table->xa);
>  	srq = xa_load(&srq_table->xa, srqn & (hr_dev->caps.num_srqs - 1));
> -	if (srq)
> -		refcount_inc(&srq->refcount);
> +	if (srq && !refcount_inc_not_zero(&srq->refcount))
> +		srq = NULL;
>  	xa_unlock(&srq_table->xa);
>  
>  	if (!srq) {
> @@ -481,8 +481,8 @@ int hns_roce_create_srq(struct ib_srq *ib_srq,
>  	}
>  
>  	srq->event = hns_roce_ib_srq_event;
> -	refcount_set(&srq->refcount, 1);
>  	init_completion(&srq->free);
> +	refcount_set(&srq->refcount, 1);
>  
>  	return 0;
>  

^ permalink raw reply

* Re: [PATCH net-next v5 0/2] net: mana: add ethtool private flag for full-page RX buffers
From: Alexander Lobakin @ 2026-04-07 13:10 UTC (permalink / raw)
  To: Dipayaan Roy
  Cc: kys, haiyangz, wei.liu, decui, andrew+netdev, davem, edumazet,
	kuba, pabeni, leon, longli, kotaranov, horms, shradhagupta,
	ssengar, ernis, shirazsaleem, linux-hyperv, netdev, linux-kernel,
	linux-rdma, stephen, jacob.e.keller, dipayanroy, leitao, kees
In-Reply-To: <adHaF6DloRthctRb@linuxonhyperv3.guj3yctzbm1etfxqx2vob5hsef.xx.internal.cloudapp.net>

From: Dipayaan Roy <dipayanroy@linux.microsoft.com>
Date: Sat, 4 Apr 2026 20:42:15 -0700

> On some ARM64 platforms with 4K PAGE_SIZE, utilizing page_pool 
> fragments for allocation in the RX refill path (~2kB buffer per fragment)
> causes 15-20% throughput regression under high connection counts
> (>16 TCP streams at 180+ Gbps). Using full-page buffers on these
> platforms shows no regression and restores line-rate performance.
> 
> This behavior is observed on a single platform; other platforms
> perform better with page_pool fragments, indicating this is not a
> page_pool issue but platform-specific.
> 
> This series adds an ethtool private flag "full-page-rx" to let the
> user opt in to one RX buffer per page:
> 
>   ethtool --set-priv-flags eth0 full-page-rx on

Sorry I may've missed the previous threads.

Has this approach been discussed here? Private flags are generally
discouraged.

Alternatively, you can provide Ethtool ops to change the Rx buffer size,
so that you'd be able to set it to PAGE_SIZE on affected platforms and
the result would be the same.

> 
> There is no behavioral change by default. The flag can be persisted
> via udev rule for affected platforms.
> 
> Changes in v5:
>   - Split prep refactor into separate patch (patch 1/2)
> Changes in v4:
>   - Dropping the smbios string parsing and add ethtool priv flag
>     to reconfigure the queues with full page rx buffers.
> Changes in v3:
>   - changed u8* to char*
> Changes in v2:
>   - separate reading string index and the string, remove inline.
> 
> Dipayaan Roy (2):
>   net: mana: refactor mana_get_strings() and mana_get_sset_count() to
>     use switch
>   net: mana: force full-page RX buffers via ethtool private flag
> 
>  drivers/net/ethernet/microsoft/mana/mana_en.c |  22 ++-
>  .../ethernet/microsoft/mana/mana_ethtool.c    | 164 ++++++++++++++----
>  include/net/mana/mana.h                       |   8 +
>  3 files changed, 163 insertions(+), 31 deletions(-)

Thanks,
Olek

^ permalink raw reply

* [PATCH v2] tools: hv: Fix cross-compilation
From: Aditya Garg @ 2026-04-07 12:20 UTC (permalink / raw)
  To: kys, haiyangz, wei.liu, decui, longli, gregkh, ssengar,
	linux-hyperv, linux-kernel, romank, avladu, vdso, gargaditya,
	gargaditya

Use the native ARCH only in case it is not set, this will allow the
cross-compilation where ARCH is explicitly set.

Additionally, simplify the check for ARCH so that fcopy daemon is built
only for x86_64.

Fixes: 82b0945ce2c2 ("tools: hv: Add new fcopy application based on uio driver")
Reported-by: Adrian Vladu <avladu@cloudbasesolutions.com>
Closes: https://lore.kernel.org/linux-hyperv/PR3PR09MB54119DB2FD76977C62D8DD6AB04D2@PR3PR09MB5411.eurprd09.prod.outlook.com/
Co-developed-by: Saurabh Sengar <ssengar@linux.microsoft.com>
Signed-off-by: Saurabh Sengar <ssengar@linux.microsoft.com>
Signed-off-by: Aditya Garg <gargaditya@linux.microsoft.com>
Reviewed-by: Roman Kisel <romank@linux.microsoft.com>
---
Changes since v1:
    - Dropped the info target printing CC, LD and ARCH

v1: https://lore.kernel.org/all/1733992114-7305-1-git-send-email-ssengar@linux.microsoft.com/
---
 tools/hv/Makefile | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/tools/hv/Makefile b/tools/hv/Makefile
index 34ffcec264ab..e377caf89fb6 100644
--- a/tools/hv/Makefile
+++ b/tools/hv/Makefile
@@ -2,7 +2,7 @@
 # Makefile for Hyper-V tools
 include ../scripts/Makefile.include
 
-ARCH := $(shell uname -m 2>/dev/null)
+ARCH ?= $(shell uname -m 2>/dev/null)
 sbindir ?= /usr/sbin
 libexecdir ?= /usr/libexec
 sharedstatedir ?= /var/lib
@@ -20,7 +20,7 @@ override CFLAGS += -O2 -Wall -g -D_GNU_SOURCE -I$(OUTPUT)include
 override CFLAGS += -Wno-address-of-packed-member
 
 ALL_TARGETS := hv_kvp_daemon hv_vss_daemon
-ifneq ($(ARCH), aarch64)
+ifeq ($(ARCH), x86_64)
 ALL_TARGETS += hv_fcopy_uio_daemon
 endif
 ALL_PROGRAMS := $(patsubst %,$(OUTPUT)%,$(ALL_TARGETS))
-- 
2.43.0


^ permalink raw reply related

* Re: [BUG] KVM: x86: kvmclock jumps ~253 years on Hyper-V nested virt due to cross-CPU raw TSC inconsistency
From: Vitaly Kuznetsov @ 2026-04-07  8:23 UTC (permalink / raw)
  To: Sean Christopherson, Thomas Lefebvre
  Cc: pbonzini, kvm, linux-kernel, linux-hyperv
In-Reply-To: <adO_EYdKtl_TXooI@google.com>

Sean Christopherson <seanjc@google.com> writes:

> On Sun, Apr 05, 2026, Thomas Lefebvre wrote:
>> Hi,
>> 
>> I'm seeing KVM_GET_CLOCK return values ~253 years in the future when
>> running KVM inside a Hyper-V VM (nested virtualization).  I tracked
>> it down to an unsigned wraparound in __get_kvmclock() and have
>> bpftrace data showing the exact failure.
>> 
>> Setup:
>>   - Intel i7-11800H laptop running Windows with Hyper-V
>>   - L1 guest: Ubuntu 24.04, kernel 6.8.0, 4 vCPUs
>>   - Clocksource: hyperv_clocksource_tsc_page (VDSO_CLOCKMODE_HVCLOCK)
>>   - KVM running inside L1, hosting L2 guests
>> 
>> Root cause:
>> 
>> __get_kvmclock() does:
>> 
>>     hv_clock.tsc_timestamp = ka->master_cycle_now;
>>     hv_clock.system_time = ka->master_kernel_ns + ka->kvmclock_offset;
>>     ...
>>     data->clock = __pvclock_read_cycles(&hv_clock, data->host_tsc);
>> 
>> and __pvclock_read_cycles() does:
>> 
>>     delta = tsc - src->tsc_timestamp;    /* unsigned */
>> 
>> master_cycle_now is a raw RDTSC captured by
>> pvclock_update_vm_gtod_copy().  host_tsc is a raw RDTSC read by
>> __get_kvmclock() on the current CPU.  Both go through the vgettsc()
>> HVCLOCK path which calls hv_read_tsc_page_tsc() -- this computes a
>> cross-CPU-consistent reference counter via scale/offset, but stores
>> the *raw* RDTSC in tsc_timestamp as a side effect.
>> 
>> Under Hyper-V, raw RDTSC values are not consistent across vCPUs.
>> The hypervisor corrects them only through the TSC page scale/offset.
>> If pvclock_update_vm_gtod_copy() runs on CPU 0 and __get_kvmclock()
>> later runs on CPU 1 where the raw TSC is lower, the unsigned
>> subtraction wraps.
>> 
>> I wrote a bpftrace tracer (included below) to instrument both
>> functions and captured two corruption events:
>> 
>>   Event 1:
>> 
>>     [GTOD_COPY] pid=2117649 cpu=0->0 use_master=1
>>                 mcn=598992030530137 mkn=259977082393200
>> 
>>     [GET_CLOCK] pid=2117649 entry_cpu=1 exit_cpu=1 use_master=1
>>       clock=8006399342167092479 host_tsc=598991848289183
>>       master_cycle_now=598992030530137
>>       system_time(mkn+off)=5175860260
>>       TSC DEFICIT: 182240954 cycles
>> 
>>     master_cycle_now captured on CPU 0, host_tsc read on CPU 1.
>>     CPU 1's raw RDTSC was 182M cycles lower.
>> 
>>       598991848289183 - 598992030530137 = 18446744073527310662 (u64)
>> 
>>     Returned clock: 8,006,399,342,167,092,479 ns (~253.7 years)
>>     Correct system_time: 5,175,860,260 ns (~5.2 seconds)
>> 
>>   Event 2:
>> 
>>     [GTOD_COPY] pid=2117953 cpu=0->0 use_master=1
>>                 mcn=599040238416510
>> 
>>     [GET_CLOCK] pid=2117953 entry_cpu=3 exit_cpu=3 use_master=1
>>       clock=8006399342464295526 host_tsc=599040211994220
>>       master_cycle_now=599040238416510
>>       TSC DEFICIT: 26422290 cycles
>> 
>>     Same pattern, CPU 0 vs CPU 3, 26M cycle deficit.
>> 
>> kvm_get_wall_clock_epoch() has the same pattern -- fresh host_tsc
>> vs stale master_cycle_now passed to __pvclock_read_cycles().
>> 
>> The simplest fix I can think of is guarding the __pvclock_read_cycles
>> call in __get_kvmclock():
>> 
>>     if (data->host_tsc >= hv_clock.tsc_timestamp)
>>         data->clock = __pvclock_read_cycles(&hv_clock, data->host_tsc);
>>     else
>>         data->clock = hv_clock.system_time;
>
> That might kinda sorta work for one KVM-as-the-host path, but it's not a proper
> fix.  The actual guest-side (L2) reads in __pvclock_clocksource_read() will also
> be broken, because PVCLOCK_TSC_STABLE_BIT will be set.
>
> I don't see how this scenario can possibly work, KVM is effectively mixing two
> time domains.  The stable timestamp from the TSC page is (obviously) *derived*
> from the raw, *unstable* TSC, but they are two distinct domains.
>
> What really confuses me is why we thought this would work for Hyper-V but not for
> kvmclock (i.e. KVM-on-KVM).  Hyper-V's TSC page and kvmclock are the exact same
> concept, but vgettsc() only special cases VDSO_CLOCKMODE_HVCLOCK, not
> VDSO_CLOCKMODE_PVCLOCK.
>
> Shouldn't we just revert b0c39dc68e3b ("x86/kvm: Pass stable clocksource to guests
> when running nested on Hyper-V")?
>
> Vitaly, what am I missing?
>

It's probably me who's missing somethings :-) but my understanding is
that we can't be using TSC page clocksource with unsyncronized TSCs in
L1 at all as TSC page (unlike kvmclock) is always partition-wide and
thus can't lead to a sane result in case raw TSC readings diverge. The
idea of b0c39dc68e3b was that in Hyper-V guests *with stable,
syncronized TSC* we may still be using Hyper-V TSC page clocksource and
thus we can pass it to L2.

-- 
Vitaly


^ permalink raw reply

* Re: [BUG] KVM: x86: kvmclock jumps ~253 years on Hyper-V nested virt due to cross-CPU raw TSC inconsistency
From: Vitaly Kuznetsov @ 2026-04-07  8:17 UTC (permalink / raw)
  To: Thomas Lefebvre, seanjc, pbonzini; +Cc: kvm, linux-kernel, linux-hyperv
In-Reply-To: <CAKdXbaV1PTwetd4zs6+6Rp7h0dwHU1ygMoof5eAcfL6XYZF1xA@mail.gmail.com>

Thomas Lefebvre <thomas.lefebvre3@gmail.com> writes:

...

>
> Under Hyper-V, raw RDTSC values are not consistent across vCPUs.
> The hypervisor corrects them only through the TSC page scale/offset.
> If pvclock_update_vm_gtod_copy() runs on CPU 0 and __get_kvmclock()
> later runs on CPU 1 where the raw TSC is lower, the unsigned
> subtraction wraps.
>

According to the TLFS, reference TSC page is partition wide:

"The hypervisor provides a partition-wide virtual reference TSC page
which is overlaid on the partition’s GPA space. A partition’s reference
time stamp counter page is accessed through the Reference TSC MSR."

so if as you say RAW rdtsc value is inconsistent across vCPUs, I can
hardly see how we can use this time source at all, even without
KVM. scale/offset are the same for all vCPUs.

I think the fix here is to avoid setting up Hyper-V TSC page clocksource
in L1. Unfortunately, with unsynchronized TSCs this will leave us the
only choice for a sane clocksource: raw HV_X64_MSR_TIME_REF_COUNT MSR
reads.

-- 
Vitaly

^ permalink raw reply

* Re: [PATCH 3/8] firmware: sysfb: Make CONFIG_SYSFB a user-selectable option
From: Thomas Zimmermann @ 2026-04-07  7:39 UTC (permalink / raw)
  To: Arnd Bergmann, Javier Martinez Canillas, Ard Biesheuvel,
	Ilias Apalodimas, Huacai Chen, WANG Xuerui, Maarten Lankhorst,
	Maxime Ripard, Dave Airlie, Simona Vetter, K. Y. Srinivasan,
	Haiyang Zhang, Wei Liu, Dexuan Cui, longli, Helge Deller
  Cc: linux-arm-kernel, loongarch, linux-efi, linux-riscv, dri-devel,
	linux-hyperv, linux-fbdev
In-Reply-To: <295a43ce-92fb-435d-a82f-d1cfa8f4f28d@app.fastmail.com>

Hi

Am 02.04.26 um 18:47 schrieb Arnd Bergmann:
> On Thu, Apr 2, 2026, at 17:27, Thomas Zimmermann wrote:
>> Am 02.04.26 um 16:59 schrieb Arnd Bergmann:
>>> On Thu, Apr 2, 2026, at 16:10, Thomas Zimmermann wrote:
>>>> Am 02.04.26 um 15:08 schrieb Arnd Bergmann:
>>>>> On Thu, Apr 2, 2026, at 11:09, Thomas Zimmermann wrote:
>>>>> I don't really like this part of the series and would prefer
>>>>> to keep CONFIG_SYSFB hidden as much as possible as an x86
>>>>> (and EFI) specific implementation detail, with the hope
>>>>> of eventually seperating out the x86 bits from the EFI ones.
>>>> You mean, you want to use the EFI-provided framebuffers without the
>>>> intermediate step of going through sysfb_primary_display?
>>>>
>>>> In that case, CONFIG_SYSFB would become an x86-internal thing, right?
>>> The part that is still needed from sysfb is the arbitration
>>> between DRM_EFI and the PCI device driver for the same hardware,
>>> so I think some part of sysfb is clearly needed, in particular
>>> the sysfb_disable() function that removes the EFI framebuffer
>>> when there is a conflicting simpledrm or hardware specific
>>> driver.
>> We do most of that in the aperture-helper module. (see
>> <linux/aperture.h>). Calling sysfb_disable() from there is a workaround
>> for some corner cases. We can have an EFI-specific function that does
>> the same.
> That sounds good, yes. The same change would need to go into
> of_platform_default_populate_init() then.

The call to sysfb_disable() is a workaround for certain cases that 
aperture helpers don't handle well. In the longer term, I'd want 
aperture helpers to be more clever about aperture ownership. But as an 
intermediate step, adding other _disable() function would be an option. 
But there's no hurry.

>
>> BTW, simpledrm-on-EFI/VESA is considered obsolete and should preferably
>> be removed from that driver. Simpledrm should become a driver for
>> Devicetree nodes of type simple-framebuffer (as it originally has been
>> intended).
> Sure, I was only thinking of the case where there are both
> sysfb (from Arm/riscv UEFI) and simpledrm (from devicetree)
> objects referring to the same one, not the simpledrm
> device created by sysfb_simplefb.
>
>>> The parts that I want to keep out of that is anything
>>> related to the x86 boot protocol, non-EFI framebuffers,
>>> text console, and kexec handoff, which we don't need on
>>> non-x86 UEFI systems.
>>>
>>> I don't mind the idea of having a sysfb_primary_display
>>> in the EFI code if that helps keep EFI sane on x86,
>>> but it would be good to make that local to
>>> drivers/firmware/efi and (eventually) detached from
>>> include/uapi/linux/screen_info.h.
>> Efidrm retrieves the framebuffer settings from the contained struct
>> screen_info. Disconnecting from screen_info would require separate
>> graphics drivers for x86 and non-x86. If we split off EFI from sysfb,
>> we'd likely need a sysfbdrm driver of some sort. Just saying.
> Yes, I saw that as well and don't have an immediate idea for how
> to best do it. I saw that you already abstracted the access to
> the screen_info members in drm_sysfb_screen_info.c, which I think
> is a step in that direction.
>
> I also noticed that efidrm is mostly a subset of vesadrm, so
> in theory they could be merged back into an x86 drm driver
> along with the drm_sysfb_screen_info helpers, and have a non-x86
> driver that constructs a drm_sysfb_device directly from the
> EFI structures.

I would not want to have a unifed driver for all-things-screen_info. The 
code that can easily be shared is already in the sysfb helpers. But I 
don't mind adding a separate driver for EFI's Graphics Output Protocol. 
Then there would be current efidrm for EFI-from-screen_info and 
efigopdrm for EFI via the GOP interfaces. If EFI ever specifies another 
graphics interface, we could add another driver. The maintenance 
overhead is small on the DRM side.

What is the future of x86's boot_param and the related screen_info on 
x86-64?  Is it obsolete?  Will boot loaders run the EFI stub instead?

>
>> I think we'd also have to duplicate the framebuffer-relocation code that
>> currently works on anything using struct screen_info (see patch 5).
> You mean the code from include/linux/screen_info.h? I think
> it would make sense to have an x86 specific version of that
> to operate on the x86 screen_info, and a simpler version
> that just updates the resource for the efirdrm driver, but
> that could also be done one level higher or lower.

Makes me wonder if the relocation code could be integrated into the 
aperture helpers. It would have to track the relocation of all PCI 
graphics devices and probing DRM drivers would query the relocation 
offset for their given framebuffer.

>
>>>>> In general, I am always in favor of properly using Kconfig
>>>>> dependencies over 'select' statements, for the same reasons
>>>>> you describe, but I don't want the the x86 logic for
>>>>> the legacy VESA and VGA console handling to leak into more
>>>>> architectures than necessary.
>>>>>
>>>>> Do you think we could instead move the sysfb_init()
>>>>> function into the same two places that contain the
>>>>> sysfb_primary_display definition (arch/x86/kernel/setup.c,
>>>>> drivers/firmware/efi/efi-init.c) and simplify the efi version
>>>>> to take out the x86 bits? That would reduce the rest
>>>>> of sysfb-primary.c to the logic to unregister the device,
>>>>> and that could then be selected by both x86 and EFI.
>>>> No, I'm more than happy that sysfb finally consolidates all the
>>>> init-framebuffer setup and detection that floated around in the kernel.
>>>> I would not want it to be duplicated again.
>>>>
>>>> For now, we could certainly keep CONFIG_SYSFB hidden and autoselected.
>>>> Although I think this will require soem sort of solution at a later point.
>>> Can you clarify which problem you are trying to solve
>>> with that?
>> One thing is that some users simply what control over their kernel build.
>>
>> I also think that there might be systems that want to use
>> sysfb_primary_display (plus the relocation feature), but not create the
>> framebuffer device. Say for efi-earlycon. It needs user-control over the
>> SYSFB option to do that.
> I'm still not following, sorry. efi-earlycon doesn't require
> CONFIG_SYSFB today, and I don't see why that would need to change,
> or why it couldn't just 'select SYSFB' if it it does change.
>
>> As a side-effect, user-configurable SYSFB gives us a nice place to put
>> SYSFB_SIMPLEFB and FIRMWARE_EDID; two options that currently float
>> around in the config somewhat arbitrarily.
> You said that SYSFB_SIMPLEFB should get phased out in the future,
> right?

Yes. Better sooner than later.

>
> I'm also missing your plan for CONFIG_FIRMWARE_EDID. I only
> see three legacy drivers using the old fb_firmware_edid()
> interface, so I assume this is not what you are interested in.
>
> For the global copy that is filled by x86 and efi, and
> consumed by vesadrm and efidrm, does that even need to
> be a configuration option rather than get always enabled?

There is code in x86's old 16-bit boot/init code that reads the EDID via 
VESA. The help text on CONFIG_FIRMWARE_EDID sounds like it needs to be 
configurable because some systems can't do the VESA calls. Hence, the 
logical step seems to be to make this consistent for all systems by 
adopting the option for all EDID retrieval.

If we can remove CONFIG_FIRMWARE_EDID and make EDID support 
unconditional, I'm all for it.

Best regards
Thomas

>
>         Arnd

-- 
--
Thomas Zimmermann
Graphics Driver Developer
SUSE Software Solutions Germany GmbH
Frankenstr. 146, 90461 Nürnberg, Germany, www.suse.com
GF: Jochen Jaser, Andrew McDonald, Werner Knoblich, (HRB 36809, AG Nürnberg)



^ permalink raw reply

* [GIT PULL] Hyper-V fixes for v7.0-rc8
From: Wei Liu @ 2026-04-07  4:29 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Wei Liu, Linux Kernel List, Linux on Hyper-V List, kys, haiyangz,
	decui, longli


Hi Linus,

The following changes since commit c0e296f257671ba10249630fe58026f29e4804d9:

  mshv: Fix error handling in mshv_region_pin (2026-03-18 16:18:49 +0000)

are available in the Git repository at:

  ssh://git@gitolite.kernel.org/pub/scm/linux/kernel/git/hyperv/linux.git tags/hyperv-fixes-signed-20260406

for you to fetch changes up to 16cbec24897624051b324aa3a85859c38ca65fde:

  mshv: Fix infinite fault loop on permission-denied GPA intercepts (2026-04-04 05:25:53 +0000)

----------------------------------------------------------------
hyperv-fixes for v7.0-rc8
  - Two fixes for Hyper-V PCI driver (Long Li, Sahil Chandna)
  - Fix an infinity loop issue in MSHV driver (Stanislav Kinsburskii)
----------------------------------------------------------------
Long Li (1):
      PCI: hv: Set default NUMA node to 0 for devices without affinity info

Sahil Chandna (1):
      PCI: hv: Fix double ida_free in hv_pci_probe error path

Stanislav Kinsburskii (1):
      mshv: Fix infinite fault loop on permission-denied GPA intercepts

 drivers/hv/mshv_root_main.c         | 15 ++++++++++++---
 drivers/pci/controller/pci-hyperv.c | 12 +++++++++---
 include/hyperv/hvgdk_mini.h         |  6 ++++++
 include/hyperv/hvhdk.h              |  4 ++--
 4 files changed, 29 insertions(+), 8 deletions(-)

^ permalink raw reply

* Re: [PATCH net-next,v4] net: mana: Force full-page RX buffers via ethtool private flag
From: Jakub Kicinski @ 2026-04-06 17:51 UTC (permalink / raw)
  To: Dipayaan Roy
  Cc: kys, haiyangz, wei.liu, decui, andrew+netdev, davem, edumazet,
	pabeni, leon, longli, kotaranov, horms, shradhagupta, ssengar,
	ernis, shirazsaleem, linux-hyperv, netdev, linux-kernel,
	linux-rdma, stephen, jacob.e.keller, leitao, kees, dipayanroy
In-Reply-To: <adHTm2SvjDrezEdv@linuxonhyperv3.guj3yctzbm1etfxqx2vob5hsef.xx.internal.cloudapp.net>

On Sat, 4 Apr 2026 20:14:35 -0700 Dipayaan Roy wrote:
>   Function                        Fragment   Full-page   Delta
>   ─----------------------------   ─-------   ---------   -----
>   napi_pp_put_page                  3.93%      0.85%    +3.08%
>   page_pool_alloc_frag_netmem       1.93%         —     +1.93%
>   Total page_pool overhead          5.86%      0.85%    +5.01%


Thanks for the analysis, and presumably recycling the full page is
cheaper because page_pool_put_unrefed_netmem() hits the fastpath
because page_pool_napi_local() returns true?

^ permalink raw reply

* [PATCH v2 09/16] RDMA: Convert drivers using min to ib_respond_udata()
From: Jason Gunthorpe @ 2026-04-06 17:40 UTC (permalink / raw)
  To: Abhijit Gangurde, Allen Hubbe,
	Broadcom internal kernel review list, Bernard Metzler,
	Potnuri Bharat Teja, Bryan Tan, Cheng Xu, Dennis Dalessandro,
	Gal Pressman, Junxian Huang, Kai Shen, Kalesh AP,
	Konstantin Taranov, Krzysztof Czurylo, Leon Romanovsky,
	linux-hyperv, linux-rdma, Long Li, Michal Kalderon,
	Michael Margolin, Nelson Escobar, Satish Kharat, Selvin Xavier,
	Yossi Leybovich, Chengchang Tang, Tatyana Nikolova, Vishnu Dasa,
	Yishai Hadas
  Cc: Adit Ranadive, Aditya Sarwade, Bryan Tan, Dexuan Cui,
	Doug Ledford, George Zhang, Jorgen Hansen, Leon Romanovsky,
	Parav Pandit, patches, Roland Dreier, Roland Dreier, Ajay Sharma,
	stable
In-Reply-To: <0-v2-1c49eeb88c48+91-rdma_udata_rep_jgg@nvidia.com>

Convert the pattern:

   ib_copy_to_udata(udata, &resp, min(sizeof(resp), udata->outlen));

Using Coccinelle:

@@
identifier resp;
expression udata;
@@

- ib_copy_to_udata(udata, &resp, min(sizeof(resp), udata->outlen))
+ ib_respond_udata(udata, resp)

@@
identifier resp;
expression udata;
@@

- ib_copy_to_udata(udata, &resp, min(udata->outlen, sizeof(resp)))
+ ib_respond_udata(udata, resp)

Run another pass with AI to propagate the return code correctly and
remove redundant prints.

Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
---
 drivers/infiniband/hw/efa/efa_verbs.c        | 44 +++++-------------
 drivers/infiniband/hw/erdma/erdma_verbs.c    |  3 +-
 drivers/infiniband/hw/hns/hns_roce_ah.c      |  4 +-
 drivers/infiniband/hw/hns/hns_roce_cq.c      |  3 +-
 drivers/infiniband/hw/hns/hns_roce_main.c    |  3 +-
 drivers/infiniband/hw/hns/hns_roce_pd.c      |  8 ++--
 drivers/infiniband/hw/hns/hns_roce_qp.c      | 13 ++----
 drivers/infiniband/hw/hns/hns_roce_srq.c     |  6 +--
 drivers/infiniband/hw/irdma/verbs.c          | 48 +++++++-------------
 drivers/infiniband/hw/mana/cq.c              |  6 +--
 drivers/infiniband/hw/mana/qp.c              |  6 +--
 drivers/infiniband/hw/mlx5/srq.c             |  7 +--
 drivers/infiniband/hw/vmw_pvrdma/pvrdma_qp.c |  8 ++--
 13 files changed, 49 insertions(+), 110 deletions(-)

diff --git a/drivers/infiniband/hw/efa/efa_verbs.c b/drivers/infiniband/hw/efa/efa_verbs.c
index 3ad5d6e27b1590..395290ab05847a 100644
--- a/drivers/infiniband/hw/efa/efa_verbs.c
+++ b/drivers/infiniband/hw/efa/efa_verbs.c
@@ -270,13 +270,9 @@ int efa_query_device(struct ib_device *ibdev,
 		if (dev->neqs)
 			resp.device_caps |= EFA_QUERY_DEVICE_CAPS_CQ_NOTIFICATIONS;
 
-		err = ib_copy_to_udata(udata, &resp,
-				       min(sizeof(resp), udata->outlen));
-		if (err) {
-			ibdev_dbg(ibdev,
-				  "Failed to copy udata for query_device\n");
+		err = ib_respond_udata(udata, resp);
+		if (err)
 			return err;
-		}
 	}
 
 	return 0;
@@ -442,13 +438,9 @@ int efa_alloc_pd(struct ib_pd *ibpd, struct ib_udata *udata)
 	resp.pdn = result.pdn;
 
 	if (udata->outlen) {
-		err = ib_copy_to_udata(udata, &resp,
-				       min(sizeof(resp), udata->outlen));
-		if (err) {
-			ibdev_dbg(&dev->ibdev,
-				  "Failed to copy udata for alloc_pd\n");
+		err = ib_respond_udata(udata, resp);
+		if (err)
 			goto err_dealloc_pd;
-		}
 	}
 
 	ibdev_dbg(&dev->ibdev, "Allocated pd[%d]\n", pd->pdn);
@@ -782,14 +774,9 @@ int efa_create_qp(struct ib_qp *ibqp, struct ib_qp_init_attr *init_attr,
 	qp->max_inline_data = init_attr->cap.max_inline_data;
 
 	if (udata->outlen) {
-		err = ib_copy_to_udata(udata, &resp,
-				       min(sizeof(resp), udata->outlen));
-		if (err) {
-			ibdev_dbg(&dev->ibdev,
-				  "Failed to copy udata for qp[%u]\n",
-				  create_qp_resp.qp_num);
+		err = ib_respond_udata(udata, resp);
+		if (err)
 			goto err_remove_mmap_entries;
-		}
 	}
 
 	ibdev_dbg(&dev->ibdev, "Created qp[%d]\n", qp->ibqp.qp_num);
@@ -1226,13 +1213,9 @@ int efa_create_user_cq(struct ib_cq *ibcq, const struct ib_cq_init_attr *attr,
 	}
 
 	if (udata->outlen) {
-		err = ib_copy_to_udata(udata, &resp,
-				       min(sizeof(resp), udata->outlen));
-		if (err) {
-			ibdev_dbg(ibdev,
-				  "Failed to copy udata for create_cq\n");
+		err = ib_respond_udata(udata, resp);
+		if (err)
 			goto err_xa_erase;
-		}
 	}
 
 	ibdev_dbg(ibdev, "Created cq[%d], cq depth[%u]. dma[%pad] virt[0x%p]\n",
@@ -1935,8 +1918,7 @@ int efa_alloc_ucontext(struct ib_ucontext *ibucontext, struct ib_udata *udata)
 	resp.max_tx_batch = dev->dev_attr.max_tx_batch;
 	resp.min_sq_wr = dev->dev_attr.min_sq_depth;
 
-	err = ib_copy_to_udata(udata, &resp,
-			       min(sizeof(resp), udata->outlen));
+	err = ib_respond_udata(udata, resp);
 	if (err)
 		goto err_dealloc_uar;
 
@@ -2087,13 +2069,9 @@ int efa_create_ah(struct ib_ah *ibah,
 	resp.efa_address_handle = result.ah;
 
 	if (udata->outlen) {
-		err = ib_copy_to_udata(udata, &resp,
-				       min(sizeof(resp), udata->outlen));
-		if (err) {
-			ibdev_dbg(&dev->ibdev,
-				  "Failed to copy udata for create_ah response\n");
+		err = ib_respond_udata(udata, resp);
+		if (err)
 			goto err_destroy_ah;
-		}
 	}
 	ibdev_dbg(&dev->ibdev, "Created ah[%d]\n", ah->ah);
 
diff --git a/drivers/infiniband/hw/erdma/erdma_verbs.c b/drivers/infiniband/hw/erdma/erdma_verbs.c
index 5523b4e151e1ff..9bba470c6e3257 100644
--- a/drivers/infiniband/hw/erdma/erdma_verbs.c
+++ b/drivers/infiniband/hw/erdma/erdma_verbs.c
@@ -1990,8 +1990,7 @@ int erdma_create_cq(struct ib_cq *ibcq, const struct ib_cq_init_attr *attr,
 		uresp.cq_id = cq->cqn;
 		uresp.num_cqe = depth;
 
-		ret = ib_copy_to_udata(udata, &uresp,
-				       min(sizeof(uresp), udata->outlen));
+		ret = ib_respond_udata(udata, uresp);
 		if (ret)
 			goto err_free_res;
 	} else {
diff --git a/drivers/infiniband/hw/hns/hns_roce_ah.c b/drivers/infiniband/hw/hns/hns_roce_ah.c
index 8a605da8a93c97..925ddf15b68102 100644
--- a/drivers/infiniband/hw/hns/hns_roce_ah.c
+++ b/drivers/infiniband/hw/hns/hns_roce_ah.c
@@ -32,6 +32,7 @@
 
 #include <rdma/ib_addr.h>
 #include <rdma/ib_cache.h>
+#include <rdma/uverbs_ioctl.h>
 #include "hns_roce_device.h"
 #include "hns_roce_hw_v2.h"
 
@@ -112,8 +113,7 @@ int hns_roce_create_ah(struct ib_ah *ibah, struct rdma_ah_init_attr *init_attr,
 		resp.priority = ah->av.sl;
 		resp.tc_mode = tc_mode;
 		memcpy(resp.dmac, ah_attr->roce.dmac, ETH_ALEN);
-		ret = ib_copy_to_udata(udata, &resp,
-				       min(udata->outlen, sizeof(resp)));
+		ret = ib_respond_udata(udata, resp);
 	}
 
 err_out:
diff --git a/drivers/infiniband/hw/hns/hns_roce_cq.c b/drivers/infiniband/hw/hns/hns_roce_cq.c
index 621568e114054b..24de651f735e03 100644
--- a/drivers/infiniband/hw/hns/hns_roce_cq.c
+++ b/drivers/infiniband/hw/hns/hns_roce_cq.c
@@ -452,8 +452,7 @@ int hns_roce_create_cq(struct ib_cq *ib_cq, const struct ib_cq_init_attr *attr,
 
 	if (udata) {
 		resp.cqn = hr_cq->cqn;
-		ret = ib_copy_to_udata(udata, &resp,
-				       min(udata->outlen, sizeof(resp)));
+		ret = ib_respond_udata(udata, resp);
 		if (ret)
 			goto err_cqc;
 	}
diff --git a/drivers/infiniband/hw/hns/hns_roce_main.c b/drivers/infiniband/hw/hns/hns_roce_main.c
index 0dbe99aab6ad21..c17ff5347a0147 100644
--- a/drivers/infiniband/hw/hns/hns_roce_main.c
+++ b/drivers/infiniband/hw/hns/hns_roce_main.c
@@ -477,8 +477,7 @@ static int hns_roce_alloc_ucontext(struct ib_ucontext *uctx,
 
 	resp.cqe_size = hr_dev->caps.cqe_sz;
 
-	ret = ib_copy_to_udata(udata, &resp,
-			       min(udata->outlen, sizeof(resp)));
+	ret = ib_respond_udata(udata, resp);
 	if (ret)
 		goto error_fail_copy_to_udata;
 
diff --git a/drivers/infiniband/hw/hns/hns_roce_pd.c b/drivers/infiniband/hw/hns/hns_roce_pd.c
index 225c3e328e0e08..73bb000574c50d 100644
--- a/drivers/infiniband/hw/hns/hns_roce_pd.c
+++ b/drivers/infiniband/hw/hns/hns_roce_pd.c
@@ -30,6 +30,7 @@
  * SOFTWARE.
  */
 
+#include <rdma/uverbs_ioctl.h>
 #include "hns_roce_device.h"
 
 void hns_roce_init_pd_table(struct hns_roce_dev *hr_dev)
@@ -61,12 +62,9 @@ int hns_roce_alloc_pd(struct ib_pd *ibpd, struct ib_udata *udata)
 	if (udata) {
 		struct hns_roce_ib_alloc_pd_resp resp = {.pdn = pd->pdn};
 
-		ret = ib_copy_to_udata(udata, &resp,
-				       min(udata->outlen, sizeof(resp)));
-		if (ret) {
+		ret = ib_respond_udata(udata, resp);
+		if (ret)
 			ida_free(&pd_ida->ida, id);
-			ibdev_err(ib_dev, "failed to copy to udata, ret = %d\n", ret);
-		}
 	}
 
 	return ret;
diff --git a/drivers/infiniband/hw/hns/hns_roce_qp.c b/drivers/infiniband/hw/hns/hns_roce_qp.c
index a27ea85bb06323..6d63613dcd5a9a 100644
--- a/drivers/infiniband/hw/hns/hns_roce_qp.c
+++ b/drivers/infiniband/hw/hns/hns_roce_qp.c
@@ -1235,12 +1235,9 @@ static int hns_roce_create_qp_common(struct hns_roce_dev *hr_dev,
 
 	if (udata) {
 		resp.cap_flags = hr_qp->en_flags;
-		ret = ib_copy_to_udata(udata, &resp,
-				       min(udata->outlen, sizeof(resp)));
-		if (ret) {
-			ibdev_err(ibdev, "copy qp resp failed!\n");
+		ret = ib_respond_udata(udata, resp);
+		if (ret)
 			goto err_flow_ctrl;
-		}
 	}
 
 	if (hr_dev->caps.flags & HNS_ROCE_CAP_FLAG_QP_FLOW_CTRL) {
@@ -1487,11 +1484,7 @@ int hns_roce_modify_qp(struct ib_qp *ibqp, struct ib_qp_attr *attr,
 	if (udata && udata->outlen) {
 		resp.tc_mode = hr_qp->tc_mode;
 		resp.priority = hr_qp->sl;
-		ret = ib_copy_to_udata(udata, &resp,
-				       min(udata->outlen, sizeof(resp)));
-		if (ret)
-			ibdev_err_ratelimited(&hr_dev->ib_dev,
-					      "failed to copy modify qp resp.\n");
+		ret = ib_respond_udata(udata, resp);
 	}
 
 out:
diff --git a/drivers/infiniband/hw/hns/hns_roce_srq.c b/drivers/infiniband/hw/hns/hns_roce_srq.c
index d6201ddde0292a..113037c83a0376 100644
--- a/drivers/infiniband/hw/hns/hns_roce_srq.c
+++ b/drivers/infiniband/hw/hns/hns_roce_srq.c
@@ -473,11 +473,9 @@ int hns_roce_create_srq(struct ib_srq *ib_srq,
 	if (udata) {
 		resp.cap_flags = srq->cap_flags;
 		resp.srqn = srq->srqn;
-		if (ib_copy_to_udata(udata, &resp,
-				     min(udata->outlen, sizeof(resp)))) {
-			ret = -EFAULT;
+		ret = ib_respond_udata(udata, resp);
+		if (ret)
 			goto err_srqc;
-		}
 	}
 
 	srq->event = hns_roce_ib_srq_event;
diff --git a/drivers/infiniband/hw/irdma/verbs.c b/drivers/infiniband/hw/irdma/verbs.c
index 17086048d2d7fc..79e72a457e7983 100644
--- a/drivers/infiniband/hw/irdma/verbs.c
+++ b/drivers/infiniband/hw/irdma/verbs.c
@@ -325,9 +325,9 @@ static int irdma_alloc_ucontext(struct ib_ucontext *uctx,
 		uresp.max_pds = iwdev->rf->sc_dev.hw_attrs.max_hw_pds;
 		uresp.wq_size = iwdev->rf->sc_dev.hw_attrs.max_qp_wr * 2;
 		uresp.kernel_ver = req.userspace_ver;
-		if (ib_copy_to_udata(udata, &uresp,
-				     min(sizeof(uresp), udata->outlen)))
-			return -EFAULT;
+		ret = ib_respond_udata(udata, uresp);
+		if (ret)
+			return ret;
 	} else {
 		u64 bar_off = (uintptr_t)iwdev->rf->sc_dev.hw_regs[IRDMA_DB_ADDR_OFFSET];
 
@@ -354,10 +354,10 @@ static int irdma_alloc_ucontext(struct ib_ucontext *uctx,
 		uresp.comp_mask |= IRDMA_ALLOC_UCTX_MIN_HW_WQ_SIZE;
 		uresp.max_hw_srq_quanta = uk_attrs->max_hw_srq_quanta;
 		uresp.comp_mask |= IRDMA_ALLOC_UCTX_MAX_HW_SRQ_QUANTA;
-		if (ib_copy_to_udata(udata, &uresp,
-				     min(sizeof(uresp), udata->outlen))) {
+		ret = ib_respond_udata(udata, uresp);
+		if (ret) {
 			rdma_user_mmap_entry_remove(ucontext->db_mmap_entry);
-			return -EFAULT;
+			return ret;
 		}
 	}
 
@@ -420,11 +420,9 @@ static int irdma_alloc_pd(struct ib_pd *pd, struct ib_udata *udata)
 						  ibucontext);
 		irdma_sc_pd_init(dev, sc_pd, pd_id, ucontext->abi_ver);
 		uresp.pd_id = pd_id;
-		if (ib_copy_to_udata(udata, &uresp,
-				     min(sizeof(uresp), udata->outlen))) {
-			err = -EFAULT;
+		err = ib_respond_udata(udata, uresp);
+		if (err)
 			goto error;
-		}
 	} else {
 		irdma_sc_pd_init(dev, sc_pd, pd_id, IRDMA_ABI_VER);
 	}
@@ -1124,10 +1122,8 @@ static int irdma_create_qp(struct ib_qp *ibqp,
 		uresp.qp_id = qp_num;
 		uresp.qp_caps = qp->qp_uk.qp_caps;
 
-		err_code = ib_copy_to_udata(udata, &uresp,
-					    min(sizeof(uresp), udata->outlen));
+		err_code = ib_respond_udata(udata, uresp);
 		if (err_code) {
-			ibdev_dbg(&iwdev->ibdev, "VERBS: copy_to_udata failed\n");
 			irdma_destroy_qp(&iwqp->ibqp, udata);
 			return err_code;
 		}
@@ -1612,12 +1608,9 @@ int irdma_modify_qp_roce(struct ib_qp *ibqp, struct ib_qp_attr *attr,
 				uresp.push_valid = 1;
 				uresp.push_offset = iwqp->sc_qp.push_offset;
 			}
-			ret = ib_copy_to_udata(udata, &uresp, min(sizeof(uresp),
-					       udata->outlen));
+			ret = ib_respond_udata(udata, uresp);
 			if (ret) {
 				irdma_remove_push_mmap_entries(iwqp);
-				ibdev_dbg(&iwdev->ibdev,
-					  "VERBS: copy_to_udata failed\n");
 				return ret;
 			}
 		}
@@ -1860,12 +1853,9 @@ int irdma_modify_qp(struct ib_qp *ibqp, struct ib_qp_attr *attr, int attr_mask,
 			uresp.push_offset = iwqp->sc_qp.push_offset;
 		}
 
-		err = ib_copy_to_udata(udata, &uresp, min(sizeof(uresp),
-				       udata->outlen));
+		err = ib_respond_udata(udata, uresp);
 		if (err) {
 			irdma_remove_push_mmap_entries(iwqp);
-			ibdev_dbg(&iwdev->ibdev,
-				  "VERBS: copy_to_udata failed\n");
 			return err;
 		}
 	}
@@ -2418,11 +2408,9 @@ static int irdma_create_srq(struct ib_srq *ibsrq,
 
 		resp.srq_id = iwsrq->srq_num;
 		resp.srq_size = ukinfo->srq_size;
-		if (ib_copy_to_udata(udata, &resp,
-				     min(sizeof(resp), udata->outlen))) {
-			err_code = -EPROTO;
+		err_code = ib_respond_udata(udata, resp);
+		if (err_code)
 			goto srq_destroy;
-		}
 	}
 
 	return 0;
@@ -2664,13 +2652,9 @@ static int irdma_create_cq(struct ib_cq *ibcq,
 
 		resp.cq_id = info.cq_uk_init_info.cq_id;
 		resp.cq_size = info.cq_uk_init_info.cq_size;
-		if (ib_copy_to_udata(udata, &resp,
-				     min(sizeof(resp), udata->outlen))) {
-			ibdev_dbg(&iwdev->ibdev,
-				  "VERBS: copy to user data\n");
-			err_code = -EPROTO;
+		err_code = ib_respond_udata(udata, resp);
+		if (err_code)
 			goto cq_destroy;
-		}
 	}
 
 	init_completion(&iwcq->free_cq);
@@ -5330,7 +5314,7 @@ static int irdma_create_user_ah(struct ib_ah *ibah,
 	mutex_unlock(&iwdev->rf->ah_tbl_lock);
 
 	uresp.ah_id = ah->sc_ah.ah_info.ah_idx;
-	err = ib_copy_to_udata(udata, &uresp, min(sizeof(uresp), udata->outlen));
+	err = ib_respond_udata(udata, uresp);
 	if (err)
 		irdma_destroy_ah(ibah, attr->flags);
 
diff --git a/drivers/infiniband/hw/mana/cq.c b/drivers/infiniband/hw/mana/cq.c
index f4cbe21763bf11..43b3ef65d3fc6d 100644
--- a/drivers/infiniband/hw/mana/cq.c
+++ b/drivers/infiniband/hw/mana/cq.c
@@ -79,11 +79,9 @@ int mana_ib_create_cq(struct ib_cq *ibcq, const struct ib_cq_init_attr *attr,
 
 	if (udata) {
 		resp.cqid = cq->queue.id;
-		err = ib_copy_to_udata(udata, &resp, min(sizeof(resp), udata->outlen));
-		if (err) {
-			ibdev_dbg(&mdev->ib_dev, "Failed to copy to udata, %d\n", err);
+		err = ib_respond_udata(udata, resp);
+		if (err)
 			goto err_remove_cq_cb;
-		}
 	}
 
 	spin_lock_init(&cq->cq_lock);
diff --git a/drivers/infiniband/hw/mana/qp.c b/drivers/infiniband/hw/mana/qp.c
index f503445a38f2d8..afc1d0e299aaf4 100644
--- a/drivers/infiniband/hw/mana/qp.c
+++ b/drivers/infiniband/hw/mana/qp.c
@@ -555,11 +555,9 @@ static int mana_ib_create_rc_qp(struct ib_qp *ibqp, struct ib_pd *ibpd,
 			resp.queue_id[j] = qp->rc_qp.queues[i].id;
 			j++;
 		}
-		err = ib_copy_to_udata(udata, &resp, min(sizeof(resp), udata->outlen));
-		if (err) {
-			ibdev_dbg(&mdev->ib_dev, "Failed to copy to udata, %d\n", err);
+		err = ib_respond_udata(udata, resp);
+		if (err)
 			goto destroy_qp;
-		}
 	}
 
 	err = mana_table_store_qp(mdev, qp);
diff --git a/drivers/infiniband/hw/mlx5/srq.c b/drivers/infiniband/hw/mlx5/srq.c
index 852f6f502d14d0..3fb8519a4ce0d7 100644
--- a/drivers/infiniband/hw/mlx5/srq.c
+++ b/drivers/infiniband/hw/mlx5/srq.c
@@ -292,12 +292,9 @@ int mlx5_ib_create_srq(struct ib_srq *ib_srq,
 			.srqn = srq->msrq.srqn,
 		};
 
-		if (ib_copy_to_udata(udata, &resp, min(udata->outlen,
-				     sizeof(resp)))) {
-			mlx5_ib_dbg(dev, "copy to user failed\n");
-			err = -EFAULT;
+		err = ib_respond_udata(udata, resp);
+		if (err)
 			goto err_core;
-		}
 	}
 
 	init_attr->attr.max_wr = srq->msrq.max - 1;
diff --git a/drivers/infiniband/hw/vmw_pvrdma/pvrdma_qp.c b/drivers/infiniband/hw/vmw_pvrdma/pvrdma_qp.c
index 16aab967a20308..cefcb243c3a6f2 100644
--- a/drivers/infiniband/hw/vmw_pvrdma/pvrdma_qp.c
+++ b/drivers/infiniband/hw/vmw_pvrdma/pvrdma_qp.c
@@ -406,12 +406,10 @@ int pvrdma_create_qp(struct ib_qp *ibqp, struct ib_qp_init_attr *init_attr,
 		qp_resp.qpn = qp->ibqp.qp_num;
 		qp_resp.qp_handle = qp->qp_handle;
 
-		if (ib_copy_to_udata(udata, &qp_resp,
-				     min(udata->outlen, sizeof(qp_resp)))) {
-			dev_warn(&dev->pdev->dev,
-				 "failed to copy back udata\n");
+		ret = ib_respond_udata(udata, qp_resp);
+		if (ret) {
 			__pvrdma_destroy_qp(dev, qp);
-			return -EINVAL;
+			return ret;
 		}
 	}
 
-- 
2.43.0


^ permalink raw reply related

* [PATCH v2 16/16] RDMA: Replace memset with = {} pattern for ib_respond_udata()
From: Jason Gunthorpe @ 2026-04-06 17:40 UTC (permalink / raw)
  To: Abhijit Gangurde, Allen Hubbe,
	Broadcom internal kernel review list, Bernard Metzler,
	Potnuri Bharat Teja, Bryan Tan, Cheng Xu, Dennis Dalessandro,
	Gal Pressman, Junxian Huang, Kai Shen, Kalesh AP,
	Konstantin Taranov, Krzysztof Czurylo, Leon Romanovsky,
	linux-hyperv, linux-rdma, Long Li, Michal Kalderon,
	Michael Margolin, Nelson Escobar, Satish Kharat, Selvin Xavier,
	Yossi Leybovich, Chengchang Tang, Tatyana Nikolova, Vishnu Dasa,
	Yishai Hadas
  Cc: Adit Ranadive, Aditya Sarwade, Bryan Tan, Dexuan Cui,
	Doug Ledford, George Zhang, Jorgen Hansen, Leon Romanovsky,
	Parav Pandit, patches, Roland Dreier, Roland Dreier, Ajay Sharma,
	stable
In-Reply-To: <0-v2-1c49eeb88c48+91-rdma_udata_rep_jgg@nvidia.com>

Most drivers do this already, but some open-code a memset. Switch
all instances found. qedr_copy_qp_uresp() is already called with
zeroed memory so that memset is redundant.

Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
---
 drivers/infiniband/hw/cxgb4/cq.c             |  3 +--
 drivers/infiniband/hw/cxgb4/qp.c             |  6 ++----
 drivers/infiniband/hw/erdma/erdma_verbs.c    |  4 +---
 drivers/infiniband/hw/ocrdma/ocrdma_verbs.c  | 12 ++++--------
 drivers/infiniband/hw/qedr/verbs.c           |  6 +-----
 drivers/infiniband/hw/usnic/usnic_ib_verbs.c |  4 +---
 6 files changed, 10 insertions(+), 25 deletions(-)

diff --git a/drivers/infiniband/hw/cxgb4/cq.c b/drivers/infiniband/hw/cxgb4/cq.c
index 47508df4cec023..d1517f2560b981 100644
--- a/drivers/infiniband/hw/cxgb4/cq.c
+++ b/drivers/infiniband/hw/cxgb4/cq.c
@@ -1004,7 +1004,7 @@ int c4iw_create_cq(struct ib_cq *ibcq, const struct ib_cq_init_attr *attr,
 	struct c4iw_dev *rhp = to_c4iw_dev(ibcq->device);
 	struct c4iw_cq *chp = to_c4iw_cq(ibcq);
 	struct c4iw_create_cq ucmd;
-	struct c4iw_create_cq_resp uresp;
+	struct c4iw_create_cq_resp uresp = {};
 	int ret, wr_len;
 	size_t memsize, hwentries;
 	struct c4iw_mm_entry *mm, *mm2;
@@ -1102,7 +1102,6 @@ int c4iw_create_cq(struct ib_cq *ibcq, const struct ib_cq_init_attr *attr,
 		if (!mm2)
 			goto err_free_mm;
 
-		memset(&uresp, 0, sizeof(uresp));
 		uresp.qid_mask = rhp->rdev.cqmask;
 		uresp.cqid = chp->cq.cqid;
 		uresp.size = chp->cq.size;
diff --git a/drivers/infiniband/hw/cxgb4/qp.c b/drivers/infiniband/hw/cxgb4/qp.c
index f9c7030ac6bfd0..e295f79e0cd3e5 100644
--- a/drivers/infiniband/hw/cxgb4/qp.c
+++ b/drivers/infiniband/hw/cxgb4/qp.c
@@ -2120,7 +2120,7 @@ int c4iw_create_qp(struct ib_qp *qp, struct ib_qp_init_attr *attrs,
 	struct c4iw_pd *php;
 	struct c4iw_cq *schp;
 	struct c4iw_cq *rchp;
-	struct c4iw_create_qp_resp uresp;
+	struct c4iw_create_qp_resp uresp = {};
 	unsigned int sqsize, rqsize = 0;
 	struct c4iw_ucontext *ucontext = rdma_udata_to_drv_context(
 		udata, struct c4iw_ucontext, ibucontext);
@@ -2242,7 +2242,6 @@ int c4iw_create_qp(struct ib_qp *qp, struct ib_qp_init_attr *attrs,
 				goto err_free_sq_db_key;
 			}
 		}
-		memset(&uresp, 0, sizeof(uresp));
 		if (t4_sq_onchip(&qhp->wq.sq)) {
 			ma_sync_key_mm = kmalloc_obj(*ma_sync_key_mm);
 			if (!ma_sync_key_mm) {
@@ -2686,7 +2685,7 @@ int c4iw_create_srq(struct ib_srq *ib_srq, struct ib_srq_init_attr *attrs,
 	struct c4iw_dev *rhp;
 	struct c4iw_srq *srq = to_c4iw_srq(ib_srq);
 	struct c4iw_pd *php;
-	struct c4iw_create_srq_resp uresp;
+	struct c4iw_create_srq_resp uresp = {};
 	struct c4iw_ucontext *ucontext;
 	struct c4iw_mm_entry *srq_key_mm, *srq_db_key_mm;
 	int rqsize;
@@ -2764,7 +2763,6 @@ int c4iw_create_srq(struct ib_srq *ib_srq, struct ib_srq_init_attr *attrs,
 			ret = -ENOMEM;
 			goto err_free_srq_key_mm;
 		}
-		memset(&uresp, 0, sizeof(uresp));
 		uresp.flags = srq->flags;
 		uresp.qid_mask = rhp->rdev.qpmask;
 		uresp.srqid = srq->wq.qid;
diff --git a/drivers/infiniband/hw/erdma/erdma_verbs.c b/drivers/infiniband/hw/erdma/erdma_verbs.c
index c8a35337ba51e8..b59c2e3a5306d1 100644
--- a/drivers/infiniband/hw/erdma/erdma_verbs.c
+++ b/drivers/infiniband/hw/erdma/erdma_verbs.c
@@ -996,7 +996,7 @@ int erdma_create_qp(struct ib_qp *ibqp, struct ib_qp_init_attr *attrs,
 	struct erdma_ucontext *uctx = rdma_udata_to_drv_context(
 		udata, struct erdma_ucontext, ibucontext);
 	struct erdma_ureq_create_qp ureq;
-	struct erdma_uresp_create_qp uresp;
+	struct erdma_uresp_create_qp uresp = {};
 	void *old_entry;
 	int ret = 0;
 
@@ -1048,8 +1048,6 @@ int erdma_create_qp(struct ib_qp *ibqp, struct ib_qp_init_attr *attrs,
 		if (ret)
 			goto err_out_xa;
 
-		memset(&uresp, 0, sizeof(uresp));
-
 		uresp.num_sqe = qp->attrs.sq_size;
 		uresp.num_rqe = qp->attrs.rq_size;
 		uresp.qp_id = QP_ID(qp);
diff --git a/drivers/infiniband/hw/ocrdma/ocrdma_verbs.c b/drivers/infiniband/hw/ocrdma/ocrdma_verbs.c
index 2a174d0fe6ca1e..383f1d9c15d151 100644
--- a/drivers/infiniband/hw/ocrdma/ocrdma_verbs.c
+++ b/drivers/infiniband/hw/ocrdma/ocrdma_verbs.c
@@ -586,11 +586,10 @@ static int ocrdma_copy_pd_uresp(struct ocrdma_dev *dev, struct ocrdma_pd *pd,
 	u64 db_page_addr;
 	u64 dpp_page_addr = 0;
 	u32 db_page_size;
-	struct ocrdma_alloc_pd_uresp rsp;
+	struct ocrdma_alloc_pd_uresp rsp = {};
 	struct ocrdma_ucontext *uctx = rdma_udata_to_drv_context(
 		udata, struct ocrdma_ucontext, ibucontext);
 
-	memset(&rsp, 0, sizeof(rsp));
 	rsp.id = pd->id;
 	rsp.dpp_enabled = pd->dpp_enabled;
 	db_page_addr = ocrdma_get_db_addr(dev, pd->id);
@@ -930,13 +929,12 @@ static int ocrdma_copy_cq_uresp(struct ocrdma_dev *dev, struct ocrdma_cq *cq,
 	int status;
 	struct ocrdma_ucontext *uctx = rdma_udata_to_drv_context(
 		udata, struct ocrdma_ucontext, ibucontext);
-	struct ocrdma_create_cq_uresp uresp;
+	struct ocrdma_create_cq_uresp uresp = {};
 
 	/* this must be user flow! */
 	if (!udata)
 		return -EINVAL;
 
-	memset(&uresp, 0, sizeof(uresp));
 	uresp.cq_id = cq->id;
 	uresp.page_size = PAGE_ALIGN(cq->len);
 	uresp.num_pages = 1;
@@ -1173,11 +1171,10 @@ static int ocrdma_copy_qp_uresp(struct ocrdma_qp *qp,
 {
 	int status;
 	u64 usr_db;
-	struct ocrdma_create_qp_uresp uresp;
+	struct ocrdma_create_qp_uresp uresp = {};
 	struct ocrdma_pd *pd = qp->pd;
 	struct ocrdma_dev *dev = get_ocrdma_dev(pd->ibpd.device);
 
-	memset(&uresp, 0, sizeof(uresp));
 	usr_db = dev->nic_info.unmapped_db +
 			(pd->id * dev->nic_info.db_page_size);
 	uresp.qp_id = qp->id;
@@ -1730,9 +1727,8 @@ static int ocrdma_copy_srq_uresp(struct ocrdma_dev *dev, struct ocrdma_srq *srq,
 				struct ib_udata *udata)
 {
 	int status;
-	struct ocrdma_create_srq_uresp uresp;
+	struct ocrdma_create_srq_uresp uresp = {};
 
-	memset(&uresp, 0, sizeof(uresp));
 	uresp.rq_dbid = srq->rq.dbid;
 	uresp.num_rq_pages = 1;
 	uresp.rq_page_addr[0] = virt_to_phys(srq->rq.va);
diff --git a/drivers/infiniband/hw/qedr/verbs.c b/drivers/infiniband/hw/qedr/verbs.c
index 79190c5b8b50b0..1af908275ca729 100644
--- a/drivers/infiniband/hw/qedr/verbs.c
+++ b/drivers/infiniband/hw/qedr/verbs.c
@@ -690,9 +690,7 @@ static void qedr_db_recovery_del(struct qedr_dev *dev,
 static int qedr_copy_cq_uresp(struct qedr_cq *cq, struct ib_udata *udata,
 			      u32 db_offset)
 {
-	struct qedr_create_cq_uresp uresp;
-
-	memset(&uresp, 0, sizeof(uresp));
+	struct qedr_create_cq_uresp uresp = {};
 
 	uresp.db_offset = db_offset;
 	uresp.icid = cq->icid;
@@ -1283,8 +1281,6 @@ static int qedr_copy_qp_uresp(struct qedr_dev *dev,
 			      struct qedr_qp *qp, struct ib_udata *udata,
 			      struct qedr_create_qp_uresp *uresp)
 {
-	memset(uresp, 0, sizeof(*uresp));
-
 	if (qedr_qp_has_sq(qp))
 		qedr_copy_sq_uresp(dev, uresp, qp);
 
diff --git a/drivers/infiniband/hw/usnic/usnic_ib_verbs.c b/drivers/infiniband/hw/usnic/usnic_ib_verbs.c
index e887f03a84d063..261f18a8368543 100644
--- a/drivers/infiniband/hw/usnic/usnic_ib_verbs.c
+++ b/drivers/infiniband/hw/usnic/usnic_ib_verbs.c
@@ -82,15 +82,13 @@ static void usnic_ib_fw_string_to_u64(char *fw_ver_str, u64 *fw_ver)
 static int usnic_ib_fill_create_qp_resp(struct usnic_ib_qp_grp *qp_grp,
 					struct ib_udata *udata)
 {
-	struct usnic_ib_create_qp_resp resp;
+	struct usnic_ib_create_qp_resp resp = {};
 	struct pci_dev *pdev;
 	struct vnic_dev_bar *bar;
 	struct usnic_vnic_res_chunk *chunk;
 	struct usnic_ib_qp_grp_flow *default_flow;
 	int i, err;
 
-	memset(&resp, 0, sizeof(resp));
-
 	pdev = usnic_vnic_get_pdev(qp_grp->vf->vnic);
 	if (!pdev) {
 		usnic_err("Failed to get pdev of qp_grp %d\n",
-- 
2.43.0


^ permalink raw reply related

* [PATCH v2 11/16] RDMA/cxgb4: Convert to ib_respond_udata()
From: Jason Gunthorpe @ 2026-04-06 17:40 UTC (permalink / raw)
  To: Abhijit Gangurde, Allen Hubbe,
	Broadcom internal kernel review list, Bernard Metzler,
	Potnuri Bharat Teja, Bryan Tan, Cheng Xu, Dennis Dalessandro,
	Gal Pressman, Junxian Huang, Kai Shen, Kalesh AP,
	Konstantin Taranov, Krzysztof Czurylo, Leon Romanovsky,
	linux-hyperv, linux-rdma, Long Li, Michal Kalderon,
	Michael Margolin, Nelson Escobar, Satish Kharat, Selvin Xavier,
	Yossi Leybovich, Chengchang Tang, Tatyana Nikolova, Vishnu Dasa,
	Yishai Hadas
  Cc: Adit Ranadive, Aditya Sarwade, Bryan Tan, Dexuan Cui,
	Doug Ledford, George Zhang, Jorgen Hansen, Leon Romanovsky,
	Parav Pandit, patches, Roland Dreier, Roland Dreier, Ajay Sharma,
	stable
In-Reply-To: <0-v2-1c49eeb88c48+91-rdma_udata_rep_jgg@nvidia.com>

These cases carefully work around 32-bit unpadded structures, but
the min integrated into ib_respond_udata() handles this
automatically. Zero-initialize data that would not have been copied.

Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
---
 drivers/infiniband/hw/cxgb4/cq.c       | 8 +++-----
 drivers/infiniband/hw/cxgb4/provider.c | 5 ++---
 2 files changed, 5 insertions(+), 8 deletions(-)

diff --git a/drivers/infiniband/hw/cxgb4/cq.c b/drivers/infiniband/hw/cxgb4/cq.c
index e31fb9134aa818..47508df4cec023 100644
--- a/drivers/infiniband/hw/cxgb4/cq.c
+++ b/drivers/infiniband/hw/cxgb4/cq.c
@@ -1115,13 +1115,11 @@ int c4iw_create_cq(struct ib_cq *ibcq, const struct ib_cq_init_attr *attr,
 		/* communicate to the userspace that
 		 * kernel driver supports 64B CQE
 		 */
-		uresp.flags |= C4IW_64B_CQE;
+		if (!ucontext->is_32b_cqe)
+			uresp.flags |= C4IW_64B_CQE;
 
 		spin_unlock(&ucontext->mmap_lock);
-		ret = ib_copy_to_udata(udata, &uresp,
-				       ucontext->is_32b_cqe ?
-				       sizeof(uresp) - sizeof(uresp.flags) :
-				       sizeof(uresp));
+		ret = ib_respond_udata(udata, uresp);
 		if (ret)
 			goto err_free_mm2;
 
diff --git a/drivers/infiniband/hw/cxgb4/provider.c b/drivers/infiniband/hw/cxgb4/provider.c
index a119e8793aef40..0e3827022c63da 100644
--- a/drivers/infiniband/hw/cxgb4/provider.c
+++ b/drivers/infiniband/hw/cxgb4/provider.c
@@ -80,7 +80,7 @@ static int c4iw_alloc_ucontext(struct ib_ucontext *ucontext,
 	struct ib_device *ibdev = ucontext->device;
 	struct c4iw_ucontext *context = to_c4iw_ucontext(ucontext);
 	struct c4iw_dev *rhp = to_c4iw_dev(ibdev);
-	struct c4iw_alloc_ucontext_resp uresp;
+	struct c4iw_alloc_ucontext_resp uresp = {};
 	int ret = 0;
 	struct c4iw_mm_entry *mm = NULL;
 
@@ -106,8 +106,7 @@ static int c4iw_alloc_ucontext(struct ib_ucontext *ucontext,
 		context->key += PAGE_SIZE;
 		spin_unlock(&context->mmap_lock);
 
-		ret = ib_copy_to_udata(udata, &uresp,
-				       sizeof(uresp) - sizeof(uresp.reserved));
+		ret = ib_respond_udata(udata, uresp);
 		if (ret)
 			goto err_mm;
 
-- 
2.43.0


^ permalink raw reply related

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox