Linux-HyperV List

Linux-HyperV List
 help / color / mirror / Atom feed

* RE: [BUG] KVM: x86: kvmclock jumps ~253 years on Hyper-V nested virt due to cross-CPU raw TSC inconsistency
From: Michael Kelley @ 2026-04-07 18:37 UTC (permalink / raw)
  To: Sean Christopherson, Vitaly Kuznetsov, Thomas Lefebvre
  Cc: pbonzini@redhat.com, kvm@vger.kernel.org,
	linux-kernel@vger.kernel.org, linux-hyperv@vger.kernel.org
In-Reply-To: <adU0LAW1h8q9HsGu@google.com>

From: Sean Christopherson <seanjc@google.com> Sent: Tuesday, April 7, 2026 9:43 AM
> 
> +Michael
> 
> On Tue, Apr 07, 2026, Vitaly Kuznetsov wrote:
> > Thomas Lefebvre <thomas.lefebvre3@gmail.com> writes:
> > > Under Hyper-V, raw RDTSC values are not consistent across vCPUs.
> > > The hypervisor corrects them only through the TSC page scale/offset.
> > > If pvclock_update_vm_gtod_copy() runs on CPU 0 and __get_kvmclock()
> > > later runs on CPU 1 where the raw TSC is lower, the unsigned
> > > subtraction wraps.
> > >
> >
> > According to the TLFS, reference TSC page is partition wide:
> >
> > "The hypervisor provides a partition-wide virtual reference TSC page
> > which is overlaid on the partition’s GPA space. A partition’s reference
> > time stamp counter page is accessed through the Reference TSC MSR."
> >
> > so if as you say RAW rdtsc value is inconsistent across vCPUs, I can
> > hardly see how we can use this time source at all, even without
> > KVM. scale/offset are the same for all vCPUs.
> >
> > I think the fix here is to avoid setting up Hyper-V TSC page clocksource
> > in L1. Unfortunately, with unsynchronized TSCs this will leave us the
> > only choice for a sane clocksource: raw HV_X64_MSR_TIME_REF_COUNT MSR
> > reads.
> 
> This feels like either a Hyper-V bug or a Linux-as-a-guest bug.  For "Reference
> Counter"[1]:
> 
>   The hypervisor maintains a per-partition reference time counter. It has the
>   characteristic that successive accesses to it return strictly monotonically
>   increasing (time) values as seen by any and all virtual processors of a
>   partition. Furthermore, the reference counter is rate constant and unaffected
>   by processor or bus speed transitions or deep processor power savings states. A
>   partition’s reference time counter is initialized to zero when the partition is
>   created. The reference counter for all partitions count at the same rate, but
>   at any time, their absolute values will typically differ because partitions
>   will have different creation times.
> 
>   The reference counter continues to count up as long as at least one virtual
>   processor is not explicitly suspended.
> 
> 
> And then "Partition Reference Time Enlightenment"[2]:
> 
>   The partition reference time enlightenment presents a reference time source to
>   a partition which does not require an intercept into the hypervisor. This
>   enlightenment is available only when the underlying platform provides support
>   of an invariant processor Time Stamp Counter (TSC), or iTSC. In such platforms,
>   the processor TSC frequency remains constant irrespective of changes in the
>   processor’s clock frequency due to the use of power management states such as
>   ACPI processor performance states, processor idle sleep states (ACPI C-states),
>   etc.
> 
>   The partition reference time enlightenment uses a virtual TSC value, an offset
>   and a multiplier to enable a guest partition to compute the normalized
>   reference time since partition creation, in 100nS units. The mechanism also
>   allows a guest partition to atomically compute the reference time when the
>   guest partition is migrated to a platform with a different TSC rate, and
>   provides a fallback mechanism to support migration to platforms without the
>   constant rate TSC feature.
> 
> My read of "Partition Reference Time Enlightenment" is that it should only be
> advertised if the TSC is synchronized and constant.  I can't figure out where
> that feature is actually advertised though, because IIUC it's not the same as
> HV_ACCESS_TSC_INVARIANT, which says that the virtual TSC is guaranteed to be
> invariant even across live migration.  And it's not HV_MSR_REFERENCE_TSC_AVAILABLE,
> because I'm pretty sure that just says HV_MSR_REFERENCE_TSC is available.
> 
> Michael, help?
> 
> [1] https://learn.microsoft.com/en-us/virtualization/hyper-v-on-windows/tlfs/timers#reference-counter
> [2] https://learn.microsoft.com/en-us/virtualization/hyper-v-on-windows/tlfs/timers#partition-reference-time-enlightenment

Yes, TSC page enlightenment is per VM, so it does not compensate for
discrepancies in raw TSC values across physical CPUs. RDTSC in a
Hyper-V VM is executed directly by the hardware (i.e., does not trap to
the hypervisor), so there's no opportunity for the hypervisor to compensate
for discrepancies. The hypervisor is expected to present a VM with TSCs
that are already synchronized. I'll need to double-check, but I don't think
Linux guests on Hyper-V run their own TSC synchronization.

The relevant Hyper-V flags are:
* HV_MSR_TIME_REF_COUNT_AVAILABLE:  The synthetic MSR for reading
   the partition reference time is available.
* HV_MSR_REFERENCE_TSC_AVAILABLE: The partition reference time
   enlightenment (i.e., "the TSC page") is available as a faster way to read
   the reference counter.
* HV_ACCESS_TSC_INVARIANT: As Sean said, this says the hardware and
   Hyper-V support TSC scaling, so live migration can be done across hosts
   without the guest seeing a change in TSC frequency.

Yes, this does feel like an issue where Hyper-V is not presenting the guest
with TSCs that are already synchronized. But I'm not aware of having seen
such a problem before. I'll try to imagine a scenario where a problem like
this could happen via some other path.

@Thomas Lefebvre:  Let me double-check a few things via these follow-up
questions/actions:

1. You said the clocksource is hyperv_clocksource_tsc_page. Just to
confirm, that's for the L1 guest, right? Does the output of the "lscpu"
command in the L1 guest show the flags "tsc_reliable" and "constant_tsc"?
I'm assume "no", since if these flags were set, the clocksource (i.e.,
/sys/devices/system/clocksource/clocksource0/current_clocksource)
should be the standard "tsc". I've got a laptop with a i7-13700H processor,
and my L1 VMs show "tsc" as the clocksource, but I haven't been running
KVM with L2 nested VMs.

2. What is the version of Windows/Hyper-V you are running? Get the
output of the "winver.exe" command. It should be something like this:

Windows 11 [as the top banner]
Version 25H2 (OS Build 26200.8037)

3. In the dmesg output of your L1 VM, find the line like this one and reply
with what you have:

    Hyper-V: privilege flags low 0xae7f, high 0x3b8030, hints 0x9a4e24, misc 0xe0bed7b2

From there, I can decode the Hyper-V settings and see if anything jumps out
as anomalous. 

4. Does the laptop where you are seeing this problem ever hibernate and
then resume? If so, do you recall if the problem occurs after a full reboot but
before it ever does a hibernate/resume cycle?

Michael

^ permalink raw reply

* Re: [BUG] KVM: x86: kvmclock jumps ~253 years on Hyper-V nested virt due to cross-CPU raw TSC inconsistency
From: Thomas Lefebvre @ 2026-04-07 19:13 UTC (permalink / raw)
  To: Michael Kelley
  Cc: Sean Christopherson, Vitaly Kuznetsov, pbonzini@redhat.com,
	kvm@vger.kernel.org, linux-kernel@vger.kernel.org,
	linux-hyperv@vger.kernel.org
In-Reply-To: <SN6PR02MB4157F82E4907C1B3305E86D5D45AA@SN6PR02MB4157.namprd02.prod.outlook.com>

Hi everyone, thank you for your attention to this bug report.

Michael,

1. No, lscpu in the L1 guest does not show the flags "tsc_reliable"
and "constant_tsc".
$ lscpu | grep tsc_reliable
$ lscpu | grep constant_tsc
$ cat /sys/devices/system/clocksource/clocksource0/current_clocksource
hyperv_clocksource_tsc_page

2. Windows 10
Version 22H2 (OS Build 19045.6466)

3. Hyper-V: privilege flags low 0x2e7f, high 0x3b8030, ext 0x2, hints
0x24e24, misc 0xbed7b2

4. Yes, the laptop hibernates and then resumes.
When the problem occurred, the laptop had gone through multiple
hibernate and resume cycles.
I haven't seen it happen after a full reboot before a hibernate/resume cycle.

Thomas

On Tue, Apr 7, 2026 at 11:37 AM Michael Kelley <mhklinux@outlook.com> wrote:
>
> From: Sean Christopherson <seanjc@google.com> Sent: Tuesday, April 7, 2026 9:43 AM
> >
> > +Michael
> >
> > On Tue, Apr 07, 2026, Vitaly Kuznetsov wrote:
> > > Thomas Lefebvre <thomas.lefebvre3@gmail.com> writes:
> > > > Under Hyper-V, raw RDTSC values are not consistent across vCPUs.
> > > > The hypervisor corrects them only through the TSC page scale/offset.
> > > > If pvclock_update_vm_gtod_copy() runs on CPU 0 and __get_kvmclock()
> > > > later runs on CPU 1 where the raw TSC is lower, the unsigned
> > > > subtraction wraps.
> > > >
> > >
> > > According to the TLFS, reference TSC page is partition wide:
> > >
> > > "The hypervisor provides a partition-wide virtual reference TSC page
> > > which is overlaid on the partition’s GPA space. A partition’s reference
> > > time stamp counter page is accessed through the Reference TSC MSR."
> > >
> > > so if as you say RAW rdtsc value is inconsistent across vCPUs, I can
> > > hardly see how we can use this time source at all, even without
> > > KVM. scale/offset are the same for all vCPUs.
> > >
> > > I think the fix here is to avoid setting up Hyper-V TSC page clocksource
> > > in L1. Unfortunately, with unsynchronized TSCs this will leave us the
> > > only choice for a sane clocksource: raw HV_X64_MSR_TIME_REF_COUNT MSR
> > > reads.
> >
> > This feels like either a Hyper-V bug or a Linux-as-a-guest bug.  For "Reference
> > Counter"[1]:
> >
> >   The hypervisor maintains a per-partition reference time counter. It has the
> >   characteristic that successive accesses to it return strictly monotonically
> >   increasing (time) values as seen by any and all virtual processors of a
> >   partition. Furthermore, the reference counter is rate constant and unaffected
> >   by processor or bus speed transitions or deep processor power savings states. A
> >   partition’s reference time counter is initialized to zero when the partition is
> >   created. The reference counter for all partitions count at the same rate, but
> >   at any time, their absolute values will typically differ because partitions
> >   will have different creation times.
> >
> >   The reference counter continues to count up as long as at least one virtual
> >   processor is not explicitly suspended.
> >
> >
> > And then "Partition Reference Time Enlightenment"[2]:
> >
> >   The partition reference time enlightenment presents a reference time source to
> >   a partition which does not require an intercept into the hypervisor. This
> >   enlightenment is available only when the underlying platform provides support
> >   of an invariant processor Time Stamp Counter (TSC), or iTSC. In such platforms,
> >   the processor TSC frequency remains constant irrespective of changes in the
> >   processor’s clock frequency due to the use of power management states such as
> >   ACPI processor performance states, processor idle sleep states (ACPI C-states),
> >   etc.
> >
> >   The partition reference time enlightenment uses a virtual TSC value, an offset
> >   and a multiplier to enable a guest partition to compute the normalized
> >   reference time since partition creation, in 100nS units. The mechanism also
> >   allows a guest partition to atomically compute the reference time when the
> >   guest partition is migrated to a platform with a different TSC rate, and
> >   provides a fallback mechanism to support migration to platforms without the
> >   constant rate TSC feature.
> >
> > My read of "Partition Reference Time Enlightenment" is that it should only be
> > advertised if the TSC is synchronized and constant.  I can't figure out where
> > that feature is actually advertised though, because IIUC it's not the same as
> > HV_ACCESS_TSC_INVARIANT, which says that the virtual TSC is guaranteed to be
> > invariant even across live migration.  And it's not HV_MSR_REFERENCE_TSC_AVAILABLE,
> > because I'm pretty sure that just says HV_MSR_REFERENCE_TSC is available.
> >
> > Michael, help?
> >
> > [1] https://learn.microsoft.com/en-us/virtualization/hyper-v-on-windows/tlfs/timers#reference-counter
> > [2] https://learn.microsoft.com/en-us/virtualization/hyper-v-on-windows/tlfs/timers#partition-reference-time-enlightenment
>
> Yes, TSC page enlightenment is per VM, so it does not compensate for
> discrepancies in raw TSC values across physical CPUs. RDTSC in a
> Hyper-V VM is executed directly by the hardware (i.e., does not trap to
> the hypervisor), so there's no opportunity for the hypervisor to compensate
> for discrepancies. The hypervisor is expected to present a VM with TSCs
> that are already synchronized. I'll need to double-check, but I don't think
> Linux guests on Hyper-V run their own TSC synchronization.
>
> The relevant Hyper-V flags are:
> * HV_MSR_TIME_REF_COUNT_AVAILABLE:  The synthetic MSR for reading
>    the partition reference time is available.
> * HV_MSR_REFERENCE_TSC_AVAILABLE: The partition reference time
>    enlightenment (i.e., "the TSC page") is available as a faster way to read
>    the reference counter.
> * HV_ACCESS_TSC_INVARIANT: As Sean said, this says the hardware and
>    Hyper-V support TSC scaling, so live migration can be done across hosts
>    without the guest seeing a change in TSC frequency.
>
> Yes, this does feel like an issue where Hyper-V is not presenting the guest
> with TSCs that are already synchronized. But I'm not aware of having seen
> such a problem before. I'll try to imagine a scenario where a problem like
> this could happen via some other path.
>
> @Thomas Lefebvre:  Let me double-check a few things via these follow-up
> questions/actions:
>
> 1. You said the clocksource is hyperv_clocksource_tsc_page. Just to
> confirm, that's for the L1 guest, right? Does the output of the "lscpu"
> command in the L1 guest show the flags "tsc_reliable" and "constant_tsc"?
> I'm assume "no", since if these flags were set, the clocksource (i.e.,
> /sys/devices/system/clocksource/clocksource0/current_clocksource)
> should be the standard "tsc". I've got a laptop with a i7-13700H processor,
> and my L1 VMs show "tsc" as the clocksource, but I haven't been running
> KVM with L2 nested VMs.
>
> 2. What is the version of Windows/Hyper-V you are running? Get the
> output of the "winver.exe" command. It should be something like this:
>
> Windows 11 [as the top banner]
> Version 25H2 (OS Build 26200.8037)
>
> 3. In the dmesg output of your L1 VM, find the line like this one and reply
> with what you have:
>
>     Hyper-V: privilege flags low 0xae7f, high 0x3b8030, hints 0x9a4e24, misc 0xe0bed7b2
>
> From there, I can decode the Hyper-V settings and see if anything jumps out
> as anomalous.
>
> 4. Does the laptop where you are seeing this problem ever hibernate and
> then resume? If so, do you recall if the problem occurs after a full reboot but
> before it ever does a hibernate/resume cycle?
>
> Michael

^ permalink raw reply

* Re: [GIT PULL] Hyper-V fixes for v7.0-rc8
From: pr-tracker-bot @ 2026-04-07 19:26 UTC (permalink / raw)
  To: Wei Liu
  Cc: Linus Torvalds, Wei Liu, Linux Kernel List, Linux on Hyper-V List,
	kys, haiyangz, decui, longli
In-Reply-To: <20260407042912.GA1012143@liuwe-devbox-debian-v2.local>

The pull request you sent on Tue, 7 Apr 2026 04:29:12 +0000:

> ssh://git@gitolite.kernel.org/pub/scm/linux/kernel/git/hyperv/linux.git tags/hyperv-fixes-signed-20260406

has been merged into torvalds/linux.git:
https://git.kernel.org/torvalds/c/86782c16a81f8232c13c1509fd3295bd97d185b0

Thank you!

-- 
Deet-doot-dot, I am a bot.
https://korg.docs.kernel.org/prtracker.html

^ permalink raw reply

* Re: [PATCH net-next,v4] net: mana: Force full-page RX buffers via ethtool private flag
From: Dipayaan Roy @ 2026-04-07 19:29 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: kys, haiyangz, wei.liu, decui, andrew+netdev, davem, edumazet,
	pabeni, leon, longli, kotaranov, horms, shradhagupta, ssengar,
	ernis, shirazsaleem, linux-hyperv, netdev, linux-kernel,
	linux-rdma, stephen, jacob.e.keller, leitao, kees, dipayanroy
In-Reply-To: <20260406105136.5e02420e@kernel.org>

On Mon, Apr 06, 2026 at 10:51:36AM -0700, Jakub Kicinski wrote:
> On Sat, 4 Apr 2026 20:14:35 -0700 Dipayaan Roy wrote:
> >   Function                        Fragment   Full-page   Delta
> >   ─----------------------------   ─-------   ---------   -----
> >   napi_pp_put_page                  3.93%      0.85%    +3.08%
> >   page_pool_alloc_frag_netmem       1.93%         —     +1.93%
> >   Total page_pool overhead          5.86%      0.85%    +5.01%
> 
> 
> Thanks for the analysis, and presumably recycling the full page is
> cheaper because page_pool_put_unrefed_netmem() hits the fastpath
> because page_pool_napi_local() returns true?
yes right, thus avoiding atomics.

Regards


^ permalink raw reply

* [PATCH net-next v6 0/2] net: mana: add ethtool private flag for full-page RX buffers
From: Dipayaan Roy @ 2026-04-07 19:59 UTC (permalink / raw)
  To: kys, haiyangz, wei.liu, decui, andrew+netdev, davem, edumazet,
	kuba, pabeni, leon, longli, kotaranov, horms, shradhagupta,
	ssengar, ernis, shirazsaleem, linux-hyperv, netdev, linux-kernel,
	linux-rdma, stephen, jacob.e.keller, leitao, kees, john.fastabend,
	hawk, bpf, daniel, ast, sdf, dipayanroy

On some ARM64 platforms with 4K PAGE_SIZE, utilizing page_pool 
fragments for allocation in the RX refill path (~2kB buffer per fragment)
causes 15-20% throughput regression under high connection counts
(>16 TCP streams at 180+ Gbps). Using full-page buffers on these
platforms shows no regression and restores line-rate performance.

This behavior is observed on a single platform; other platforms
perform better with page_pool fragments, indicating this is not a
page_pool issue but platform-specific.

This series adds an ethtool private flag "full-page-rx" to let the
user opt in to one RX buffer per page:

  ethtool --set-priv-flags eth0 full-page-rx on

There is no behavioral change by default. The flag can be persisted
via udev rule for affected platforms.

Changes in v6:
  - Added missed maintainers.
Changes in v5:
  - Split prep refactor into separate patch (patch 1/2)
Changes in v4:
  - Dropping the smbios string parsing and add ethtool priv flag
    to reconfigure the queues with full page rx buffers.
Changes in v3:
  - changed u8* to char*
Changes in v2:
  - separate reading string index and the string, remove inline.

Dipayaan Roy (2):
  net: mana: refactor mana_get_strings() and mana_get_sset_count() to
    use switch
  net: mana: force full-page RX buffers via ethtool private flag

 drivers/net/ethernet/microsoft/mana/mana_en.c |  22 ++-
 .../ethernet/microsoft/mana/mana_ethtool.c    | 164 ++++++++++++++----
 include/net/mana/mana.h                       |   8 +
 3 files changed, 163 insertions(+), 31 deletions(-)

-- 
2.43.0

^ permalink raw reply

* [PATCH net-next v6 1/2] net: mana: refactor mana_get_strings() and mana_get_sset_count() to use switch
From: Dipayaan Roy @ 2026-04-07 19:59 UTC (permalink / raw)
  To: kys, haiyangz, wei.liu, decui, andrew+netdev, davem, edumazet,
	kuba, pabeni, leon, longli, kotaranov, horms, shradhagupta,
	ssengar, ernis, shirazsaleem, linux-hyperv, netdev, linux-kernel,
	linux-rdma, stephen, jacob.e.keller, leitao, kees, john.fastabend,
	hawk, bpf, daniel, ast, sdf, dipayanroy
In-Reply-To: <20260407200216.272659-1-dipayanroy@linux.microsoft.com>

Refactor mana_get_strings() and mana_get_sset_count() from if/else to
switch statements in preparation for adding ethtool private flags
support which requires handling ETH_SS_PRIV_FLAGS.

No functional change.

Signed-off-by: Dipayaan Roy <dipayanroy@linux.microsoft.com>
---
 .../ethernet/microsoft/mana/mana_ethtool.c    | 75 ++++++++++++-------
 1 file changed, 46 insertions(+), 29 deletions(-)

diff --git a/drivers/net/ethernet/microsoft/mana/mana_ethtool.c b/drivers/net/ethernet/microsoft/mana/mana_ethtool.c
index 6a4b42fe0944..a28ca461c135 100644
--- a/drivers/net/ethernet/microsoft/mana/mana_ethtool.c
+++ b/drivers/net/ethernet/microsoft/mana/mana_ethtool.c
@@ -138,53 +138,70 @@ static int mana_get_sset_count(struct net_device *ndev, int stringset)
 	struct mana_port_context *apc = netdev_priv(ndev);
 	unsigned int num_queues = apc->num_queues;
 
-	if (stringset != ETH_SS_STATS)
+	switch (stringset) {
+	case ETH_SS_STATS:
+		return ARRAY_SIZE(mana_eth_stats) +
+		       ARRAY_SIZE(mana_phy_stats) +
+		       ARRAY_SIZE(mana_hc_stats)  +
+		       num_queues * (MANA_STATS_RX_COUNT + MANA_STATS_TX_COUNT);
+	default:
 		return -EINVAL;
-
-	return ARRAY_SIZE(mana_eth_stats) + ARRAY_SIZE(mana_phy_stats) + ARRAY_SIZE(mana_hc_stats) +
-			num_queues * (MANA_STATS_RX_COUNT + MANA_STATS_TX_COUNT);
+	}
 }
 
-static void mana_get_strings(struct net_device *ndev, u32 stringset, u8 *data)
+static void mana_get_strings_stats(struct mana_port_context *apc, u8 **data)
 {
-	struct mana_port_context *apc = netdev_priv(ndev);
 	unsigned int num_queues = apc->num_queues;
 	int i, j;
 
-	if (stringset != ETH_SS_STATS)
-		return;
 	for (i = 0; i < ARRAY_SIZE(mana_eth_stats); i++)
-		ethtool_puts(&data, mana_eth_stats[i].name);
+		ethtool_puts(data, mana_eth_stats[i].name);
 
 	for (i = 0; i < ARRAY_SIZE(mana_hc_stats); i++)
-		ethtool_puts(&data, mana_hc_stats[i].name);
+		ethtool_puts(data, mana_hc_stats[i].name);
 
 	for (i = 0; i < ARRAY_SIZE(mana_phy_stats); i++)
-		ethtool_puts(&data, mana_phy_stats[i].name);
+		ethtool_puts(data, mana_phy_stats[i].name);
 
 	for (i = 0; i < num_queues; i++) {
-		ethtool_sprintf(&data, "rx_%d_packets", i);
-		ethtool_sprintf(&data, "rx_%d_bytes", i);
-		ethtool_sprintf(&data, "rx_%d_xdp_drop", i);
-		ethtool_sprintf(&data, "rx_%d_xdp_tx", i);
-		ethtool_sprintf(&data, "rx_%d_xdp_redirect", i);
-		ethtool_sprintf(&data, "rx_%d_pkt_len0_err", i);
+		ethtool_sprintf(data, "rx_%d_packets", i);
+		ethtool_sprintf(data, "rx_%d_bytes", i);
+		ethtool_sprintf(data, "rx_%d_xdp_drop", i);
+		ethtool_sprintf(data, "rx_%d_xdp_tx", i);
+		ethtool_sprintf(data, "rx_%d_xdp_redirect", i);
+		ethtool_sprintf(data, "rx_%d_pkt_len0_err", i);
 		for (j = 0; j < MANA_RXCOMP_OOB_NUM_PPI - 1; j++)
-			ethtool_sprintf(&data, "rx_%d_coalesced_cqe_%d", i, j + 2);
+			ethtool_sprintf(data,
+					"rx_%d_coalesced_cqe_%d",
+					i,
+					j + 2);
 	}
 
 	for (i = 0; i < num_queues; i++) {
-		ethtool_sprintf(&data, "tx_%d_packets", i);
-		ethtool_sprintf(&data, "tx_%d_bytes", i);
-		ethtool_sprintf(&data, "tx_%d_xdp_xmit", i);
-		ethtool_sprintf(&data, "tx_%d_tso_packets", i);
-		ethtool_sprintf(&data, "tx_%d_tso_bytes", i);
-		ethtool_sprintf(&data, "tx_%d_tso_inner_packets", i);
-		ethtool_sprintf(&data, "tx_%d_tso_inner_bytes", i);
-		ethtool_sprintf(&data, "tx_%d_long_pkt_fmt", i);
-		ethtool_sprintf(&data, "tx_%d_short_pkt_fmt", i);
-		ethtool_sprintf(&data, "tx_%d_csum_partial", i);
-		ethtool_sprintf(&data, "tx_%d_mana_map_err", i);
+		ethtool_sprintf(data, "tx_%d_packets", i);
+		ethtool_sprintf(data, "tx_%d_bytes", i);
+		ethtool_sprintf(data, "tx_%d_xdp_xmit", i);
+		ethtool_sprintf(data, "tx_%d_tso_packets", i);
+		ethtool_sprintf(data, "tx_%d_tso_bytes", i);
+		ethtool_sprintf(data, "tx_%d_tso_inner_packets", i);
+		ethtool_sprintf(data, "tx_%d_tso_inner_bytes", i);
+		ethtool_sprintf(data, "tx_%d_long_pkt_fmt", i);
+		ethtool_sprintf(data, "tx_%d_short_pkt_fmt", i);
+		ethtool_sprintf(data, "tx_%d_csum_partial", i);
+		ethtool_sprintf(data, "tx_%d_mana_map_err", i);
+	}
+}
+
+static void mana_get_strings(struct net_device *ndev, u32 stringset, u8 *data)
+{
+	struct mana_port_context *apc = netdev_priv(ndev);
+
+	switch (stringset) {
+	case ETH_SS_STATS:
+		mana_get_strings_stats(apc, &data);
+		break;
+	default:
+		break;
 	}
 }
 
-- 
2.43.0


^ permalink raw reply related

* [PATCH net-next v6 2/2] net: mana: force full-page RX buffers via ethtool private flag
From: Dipayaan Roy @ 2026-04-07 19:59 UTC (permalink / raw)
  To: kys, haiyangz, wei.liu, decui, andrew+netdev, davem, edumazet,
	kuba, pabeni, leon, longli, kotaranov, horms, shradhagupta,
	ssengar, ernis, shirazsaleem, linux-hyperv, netdev, linux-kernel,
	linux-rdma, stephen, jacob.e.keller, leitao, kees, john.fastabend,
	hawk, bpf, daniel, ast, sdf, dipayanroy
In-Reply-To: <20260407200216.272659-1-dipayanroy@linux.microsoft.com>

On some ARM64 platforms with 4K PAGE_SIZE, page_pool fragment
allocation in the RX refill path can cause 15-20% throughput
regression under high connection counts (>16 TCP streams).

Add an ethtool private flag "full-page-rx" that allows the user to
force one RX buffer per page, bypassing the page_pool fragment path.
This restores line-rate (180+ Gbps) performance on affected platforms.

Usage:
  ethtool --set-priv-flags eth0 full-page-rx on

There is no behavioral change by default. The flag must be explicitly
enabled by the user or udev rule.

The existing single-buffer-per-page logic for XDP and jumbo frames is
consolidated into a new helper mana_use_single_rxbuf_per_page() which
is now the single decision point for both the automatic and
user-controlled paths.

Signed-off-by: Dipayaan Roy <dipayanroy@linux.microsoft.com>
---
 drivers/net/ethernet/microsoft/mana/mana_en.c | 22 ++++-
 .../ethernet/microsoft/mana/mana_ethtool.c    | 89 +++++++++++++++++++
 include/net/mana/mana.h                       |  8 ++
 3 files changed, 117 insertions(+), 2 deletions(-)

diff --git a/drivers/net/ethernet/microsoft/mana/mana_en.c b/drivers/net/ethernet/microsoft/mana/mana_en.c
index 49c65cc1697c..59a1626c2be1 100644
--- a/drivers/net/ethernet/microsoft/mana/mana_en.c
+++ b/drivers/net/ethernet/microsoft/mana/mana_en.c
@@ -744,6 +744,25 @@ static void *mana_get_rxbuf_pre(struct mana_rxq *rxq, dma_addr_t *da)
 	return va;
 }
 
+static bool
+mana_use_single_rxbuf_per_page(struct mana_port_context *apc, u32 mtu)
+{
+	/* On some platforms with 4K PAGE_SIZE, page_pool fragment allocation
+	 * in the RX refill path (~2kB buffer) can cause significant throughput
+	 * regression under high connection counts. Allow user to force one RX
+	 * buffer per page via ethtool private flag to bypass the fragment
+	 * path.
+	 */
+	if (apc->priv_flags & BIT(MANA_PRIV_FLAG_USE_FULL_PAGE_RXBUF))
+		return true;
+
+	/* For xdp and jumbo frames make sure only one packet fits per page. */
+	if (mtu + MANA_RXBUF_PAD > PAGE_SIZE / 2 || mana_xdp_get(apc))
+		return true;
+
+	return false;
+}
+
 /* Get RX buffer's data size, alloc size, XDP headroom based on MTU */
 static void mana_get_rxbuf_cfg(struct mana_port_context *apc,
 			       int mtu, u32 *datasize, u32 *alloc_size,
@@ -754,8 +773,7 @@ static void mana_get_rxbuf_cfg(struct mana_port_context *apc,
 	/* Calculate datasize first (consistent across all cases) */
 	*datasize = mtu + ETH_HLEN;
 
-	/* For xdp and jumbo frames make sure only one packet fits per page */
-	if (mtu + MANA_RXBUF_PAD > PAGE_SIZE / 2 || mana_xdp_get(apc)) {
+	if (mana_use_single_rxbuf_per_page(apc, mtu)) {
 		if (mana_xdp_get(apc)) {
 			*headroom = XDP_PACKET_HEADROOM;
 			*alloc_size = PAGE_SIZE;
diff --git a/drivers/net/ethernet/microsoft/mana/mana_ethtool.c b/drivers/net/ethernet/microsoft/mana/mana_ethtool.c
index a28ca461c135..0547c903f613 100644
--- a/drivers/net/ethernet/microsoft/mana/mana_ethtool.c
+++ b/drivers/net/ethernet/microsoft/mana/mana_ethtool.c
@@ -133,6 +133,10 @@ static const struct mana_stats_desc mana_phy_stats[] = {
 	{ "hc_tc7_tx_pause_phy", offsetof(struct mana_ethtool_phy_stats, tx_pause_tc7_phy) },
 };
 
+static const char mana_priv_flags[MANA_PRIV_FLAG_MAX][ETH_GSTRING_LEN] = {
+	[MANA_PRIV_FLAG_USE_FULL_PAGE_RXBUF] = "full-page-rx"
+};
+
 static int mana_get_sset_count(struct net_device *ndev, int stringset)
 {
 	struct mana_port_context *apc = netdev_priv(ndev);
@@ -144,6 +148,10 @@ static int mana_get_sset_count(struct net_device *ndev, int stringset)
 		       ARRAY_SIZE(mana_phy_stats) +
 		       ARRAY_SIZE(mana_hc_stats)  +
 		       num_queues * (MANA_STATS_RX_COUNT + MANA_STATS_TX_COUNT);
+
+	case ETH_SS_PRIV_FLAGS:
+		return MANA_PRIV_FLAG_MAX;
+
 	default:
 		return -EINVAL;
 	}
@@ -192,6 +200,14 @@ static void mana_get_strings_stats(struct mana_port_context *apc, u8 **data)
 	}
 }
 
+static void mana_get_strings_priv_flags(u8 **data)
+{
+	int i;
+
+	for (i = 0; i < MANA_PRIV_FLAG_MAX; i++)
+		ethtool_puts(data, mana_priv_flags[i]);
+}
+
 static void mana_get_strings(struct net_device *ndev, u32 stringset, u8 *data)
 {
 	struct mana_port_context *apc = netdev_priv(ndev);
@@ -200,6 +216,9 @@ static void mana_get_strings(struct net_device *ndev, u32 stringset, u8 *data)
 	case ETH_SS_STATS:
 		mana_get_strings_stats(apc, &data);
 		break;
+	case ETH_SS_PRIV_FLAGS:
+		mana_get_strings_priv_flags(&data);
+		break;
 	default:
 		break;
 	}
@@ -590,6 +609,74 @@ static int mana_get_link_ksettings(struct net_device *ndev,
 	return 0;
 }
 
+static u32 mana_get_priv_flags(struct net_device *ndev)
+{
+	struct mana_port_context *apc = netdev_priv(ndev);
+
+	return apc->priv_flags;
+}
+
+static int mana_set_priv_flags(struct net_device *ndev, u32 priv_flags)
+{
+	struct mana_port_context *apc = netdev_priv(ndev);
+	u32 changed = apc->priv_flags ^ priv_flags;
+	u32 old_priv_flags = apc->priv_flags;
+	bool schedule_port_reset = false;
+	int err = 0;
+
+	if (!changed)
+		return 0;
+
+	/* Reject unknown bits */
+	if (priv_flags & ~GENMASK(MANA_PRIV_FLAG_MAX - 1, 0))
+		return -EINVAL;
+
+	if (changed & BIT(MANA_PRIV_FLAG_USE_FULL_PAGE_RXBUF)) {
+		apc->priv_flags = priv_flags;
+
+		if (!apc->port_is_up) {
+			/* Port is down, flag updated to apply on next up
+			 * so just return.
+			 */
+			return 0;
+		}
+
+		/* Pre-allocate buffers to prevent failure in mana_attach
+		 * later
+		 */
+		err = mana_pre_alloc_rxbufs(apc, ndev->mtu, apc->num_queues);
+		if (err) {
+			netdev_err(ndev,
+				   "Insufficient memory for new allocations\n");
+			apc->priv_flags = old_priv_flags;
+			return err;
+		}
+
+		err = mana_detach(ndev, false);
+		if (err) {
+			netdev_err(ndev, "mana_detach failed: %d\n", err);
+			apc->priv_flags = old_priv_flags;
+			goto out;
+		}
+
+		err = mana_attach(ndev);
+		if (err) {
+			netdev_err(ndev, "mana_attach failed: %d\n", err);
+			apc->priv_flags = old_priv_flags;
+			schedule_port_reset = true;
+		}
+	}
+
+out:
+	mana_pre_dealloc_rxbufs(apc);
+
+	if (err && schedule_port_reset)
+		queue_work(apc->ac->per_port_queue_reset_wq,
+			   &apc->queue_reset_work);
+
+	return err;
+}
+
 const struct ethtool_ops mana_ethtool_ops = {
 	.supported_coalesce_params = ETHTOOL_COALESCE_RX_CQE_FRAMES,
 	.get_ethtool_stats	= mana_get_ethtool_stats,
@@ -608,4 +695,6 @@ const struct ethtool_ops mana_ethtool_ops = {
 	.set_ringparam          = mana_set_ringparam,
 	.get_link_ksettings	= mana_get_link_ksettings,
 	.get_link		= ethtool_op_get_link,
+	.get_priv_flags		= mana_get_priv_flags,
+	.set_priv_flags		= mana_set_priv_flags,
 };
diff --git a/include/net/mana/mana.h b/include/net/mana/mana.h
index 3336688fed5e..fd87e3d6c1f4 100644
--- a/include/net/mana/mana.h
+++ b/include/net/mana/mana.h
@@ -30,6 +30,12 @@ enum TRI_STATE {
 	TRI_STATE_TRUE = 1
 };
 
+/* MANA ethtool private flag bit positions */
+enum mana_priv_flag_bits {
+	MANA_PRIV_FLAG_USE_FULL_PAGE_RXBUF = 0,
+	MANA_PRIV_FLAG_MAX,
+};
+
 /* Number of entries for hardware indirection table must be in power of 2 */
 #define MANA_INDIRECT_TABLE_MAX_SIZE 512
 #define MANA_INDIRECT_TABLE_DEF_SIZE 64
@@ -531,6 +537,8 @@ struct mana_port_context {
 	u32 rxbpre_headroom;
 	u32 rxbpre_frag_count;
 
+	u32 priv_flags;
+
 	struct bpf_prog *bpf_prog;
 
 	/* Create num_queues EQs, SQs, SQ-CQs, RQs and RQ-CQs, respectively. */
-- 
2.43.0


^ permalink raw reply related

* RE: [BUG] KVM: x86: kvmclock jumps ~253 years on Hyper-V nested virt due to cross-CPU raw TSC inconsistency
From: Michael Kelley @ 2026-04-07 20:40 UTC (permalink / raw)
  To: Thomas Lefebvre
  Cc: Sean Christopherson, Vitaly Kuznetsov, pbonzini@redhat.com,
	kvm@vger.kernel.org, linux-kernel@vger.kernel.org,
	linux-hyperv@vger.kernel.org
In-Reply-To: <CAKdXbaXdSWq-NYk8z6_OtSgb6xsp+GJxrnF2iBMRdk0nfYne=A@mail.gmail.com>

From: Thomas Lefebvre <thomas.lefebvre3@gmail.com> Sent: Tuesday, April 7, 2026 12:13 PM
> 
> Hi everyone, thank you for your attention to this bug report.
> 
> Michael,
> 
> 1. No, lscpu in the L1 guest does not show the flags "tsc_reliable"
> and "constant_tsc".
> $ lscpu | grep tsc_reliable
> $ lscpu | grep constant_tsc
> $ cat /sys/devices/system/clocksource/clocksource0/current_clocksource
> hyperv_clocksource_tsc_page
> 
> 2. Windows 10
> Version 22H2 (OS Build 19045.6466)
> 
> 3. Hyper-V: privilege flags low 0x2e7f, high 0x3b8030, ext 0x2, hints
> 0x24e24, misc 0xbed7b2
> 
> 4. Yes, the laptop hibernates and then resumes.
> When the problem occurred, the laptop had gone through multiple
> hibernate and resume cycles.
> I haven't seen it happen after a full reboot before a hibernate/resume cycle.
> 
> Thomas
> 

How easy is it for you to reproduce the problem? Would it be feasible
to get a definitive answer on whether the problem repros after a
full reboot, but before a hibernate/resume cycle?

There's a known bug Windows 10 Hyper-V where the hardware TSC
scaling gets messed up after a hibernate/resume cycle, causing the TSC
values read in the guest to drift from what the Hyper-V host thinks
the guest's TSC value is. A summary of the problem is here:
https://github.com/microsoft/WSL/issues/6982#issuecomment-2294892954

Of course, this doesn't sound like your symptom. And Hyper-V is not
telling your guest that it supports hardware TSC scaling, because the
HV_ACCESS_TSC_INVARIANT flag is *not* set and the clocksource
is hyperv_clocksource_tsc_page. But my understanding is that the code
changes to fix the Hyper-V problem weren't trivial, and I'm speculating
that maybe you are seeing some other symptom of whatever the
underlying Hyper-V issue was.

Of course, this is just speculation. If the problem can occur before
any hibernate/resume cycles are done, then my speculation is
wrong. But if the problem only happens after a hibernate/resume
cycle, then this known problem, or something related to it, becomes
a pretty good candidate. Unfortunately, I'm pretty sure there's no
fix for Windows 10 Hyper-V. You would need to upgrade to 
Windows 11 22H2 or later.

Michael

^ permalink raw reply

* Re: [PATCH v2 4/6] mshv: limit SynIC management to MSHV-owned resources
From: Jork Loeser @ 2026-04-07 21:27 UTC (permalink / raw)
  To: Anirudh Rayabharam
  Cc: linux-hyperv, x86, K . Y . Srinivasan, Haiyang Zhang, Wei Liu,
	Dexuan Cui, Long Li, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, Dave Hansen, H . Peter Anvin, Arnd Bergmann,
	Michael Kelley, linux-kernel, linux-arch
In-Reply-To: <20260406-ninja-civet-of-tornado-67ff54@anirudhrb>

On Mon, 6 Apr 2026, Anirudh Rayabharam wrote:

> On Fri, Apr 03, 2026 at 12:06:10PM -0700, Jork Loeser wrote:
>> The SynIC is shared between VMBus and MSHV. VMBus owns the message
>> page (SIMP), event flags page (SIEFP), global enable (SCONTROL),
>> and SINT2. MSHV adds SINT0, SINT5, and the event ring page (SIRBP).
>>
>> Currently mshv_synic_init() redundantly enables SIMP, SIEFP, and
>
> The redundant enable is probably a no-op from the hypervisor side so it
> probably doesn't hurt us. The main problem is with the tear down.

It's an MSR intercept. If we can replace this by an "if()" we shave a few 
cycles.

> An alternative approach could be: check if SIMP/SIEFP/SCONTROL is
> already enabled. If so, don't enable it again. If not enabled, enable it
> and keep track of what all stuf we have enabled. Then disable all of
> them during cleanup. This approach makes less assumptions about the
> behavior of the VMBUS driver and what stuff it does or doesn't use.

It would, yes. Then again, we drag yet more state and make debugging more 
complicated / less clear to reason what happens dynamically. I had been 
debating this briefly myself, and ultimately decided against it for that 
very reason.

Best,
Jork

^ permalink raw reply

* RE: [EXTERNAL] [PATCH] scsi: storvsc: Handle PERSISTENT_RESERVE_IN truncation for Hyper-V vFC
From: Long Li @ 2026-04-07 22:30 UTC (permalink / raw)
  To: Li Tian, linux-scsi@vger.kernel.org
  Cc: KY Srinivasan, Haiyang Zhang, Wei Liu, Dexuan Cui,
	James E.J. Bottomley, Martin K. Petersen,
	linux-hyperv@vger.kernel.org, linux-kernel@vger.kernel.org
In-Reply-To: <20260406015344.12566-1-litian@redhat.com>



> -----Original Message-----
> From: Li Tian <litian@redhat.com>
> Sent: Sunday, April 5, 2026 6:54 PM
> To: linux-scsi@vger.kernel.org
> Cc: Li Tian <litian@redhat.com>; KY Srinivasan <kys@microsoft.com>; Haiyang
> Zhang <haiyangz@microsoft.com>; Wei Liu <wei.liu@kernel.org>; Dexuan Cui
> <DECUI@microsoft.com>; Long Li <longli@microsoft.com>; James E.J. Bottomley
> <James.Bottomley@HansenPartnership.com>; Martin K. Petersen
> <martin.petersen@oracle.com>; linux-hyperv@vger.kernel.org; linux-
> kernel@vger.kernel.org
> Subject: [EXTERNAL] [PATCH] scsi: storvsc: Handle PERSISTENT_RESERVE_IN
> truncation for Hyper-V vFC
> 
> The storvsc driver has become stricter in handling SRB status codes returned by
> the Hyper-V host. When using Virtual Fibre Channel (vFC) passthrough, the host
> may return SRB_STATUS_DATA_OVERRUN for PERSISTENT_RESERVE_IN
> commands if the allocation length in the CDB does not match the host's expected
> response size.
> 
> Currently, this status is treated as a fatal error, propagating
> Host_status=0x07 [DID_ERROR] to the SCSI mid-layer. This causes userspace
> storage utilities (such as sg_persist) to fail with transport errors, even when the
> host has actually returned the requested reservation data in the buffer.
> 
> Refactor the existing command-specific workarounds into a new helper function,
> storvsc_host_mishandles_cmd(), and add PERSISTENT_RESERVE_IN to the list of
> commands where SRB status errors should be suppressed for vFC devices. This
> ensures that the SCSI mid-layer processes the returned data buffer instead of
> terminating the command.
> 
> Signed-off-by: Li Tian <litian@redhat.com>

Reviewed-by: Long Li <longli@microsoft.com>


> ---
>  drivers/scsi/storvsc_drv.c | 32 +++++++++++++++++++++-----------
>  1 file changed, 21 insertions(+), 11 deletions(-)
> 
> diff --git a/drivers/scsi/storvsc_drv.c b/drivers/scsi/storvsc_drv.c index
> ae1abab97835..6977ca8a0658 100644
> --- a/drivers/scsi/storvsc_drv.c
> +++ b/drivers/scsi/storvsc_drv.c
> @@ -1131,6 +1131,26 @@ static void storvsc_command_completion(struct
> storvsc_cmd_request *cmd_request,
>  		kfree(payload);
>  }
> 
> +/*
> + * The current SCSI handling on the host side does not correctly handle:
> + * INQUIRY with page code 0x80, MODE_SENSE / MODE_SENSE_10 with
> cmd[2]
> +== 0x1c,
> + * and (for FC) MAINTENANCE_IN / PERSISTENT_RESERVE_IN passthrough.
> + */
> +static bool storvsc_host_mishandles_cmd(u8 opcode, struct hv_device
> +*device) {
> +	switch (opcode) {
> +	case INQUIRY:
> +	case MODE_SENSE:
> +	case MODE_SENSE_10:
> +		return true;
> +	case MAINTENANCE_IN:
> +	case PERSISTENT_RESERVE_IN:
> +		return hv_dev_is_fc(device);
> +	default:
> +		return false;
> +	}
> +}
> +
>  static void storvsc_on_io_completion(struct storvsc_device *stor_device,
>  				  struct vstor_packet *vstor_packet,
>  				  struct storvsc_cmd_request *request) @@ -
> 1141,22 +1161,12 @@ static void storvsc_on_io_completion(struct
> storvsc_device *stor_device,
>  	stor_pkt = &request->vstor_packet;
> 
>  	/*
> -	 * The current SCSI handling on the host side does
> -	 * not correctly handle:
> -	 * INQUIRY command with page code parameter set to 0x80
> -	 * MODE_SENSE and MODE_SENSE_10 command with cmd[2] == 0x1c
> -	 * MAINTENANCE_IN is not supported by HyperV FC passthrough
> -	 *
>  	 * Setup srb and scsi status so this won't be fatal.
>  	 * We do this so we can distinguish truly fatal failues
>  	 * (srb status == 0x4) and off-line the device in that case.
>  	 */
> 
> -	if ((stor_pkt->vm_srb.cdb[0] == INQUIRY) ||
> -	   (stor_pkt->vm_srb.cdb[0] == MODE_SENSE) ||
> -	   (stor_pkt->vm_srb.cdb[0] == MODE_SENSE_10) ||
> -	   (stor_pkt->vm_srb.cdb[0] == MAINTENANCE_IN &&
> -	   hv_dev_is_fc(device))) {
> +	if (storvsc_host_mishandles_cmd(stor_pkt->vm_srb.cdb[0], device)) {
>  		vstor_packet->vm_srb.scsi_status = 0;
>  		vstor_packet->vm_srb.srb_status = SRB_STATUS_SUCCESS;
>  	}
> --
> 2.53.0


^ permalink raw reply

* Re: [PATCH v2 5/6] mshv: clean up SynIC state on kexec for L1VH
From: Jork Loeser @ 2026-04-08  1:25 UTC (permalink / raw)
  To: Anirudh Rayabharam
  Cc: linux-hyperv, x86, K . Y . Srinivasan, Haiyang Zhang, Wei Liu,
	Dexuan Cui, Long Li, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, Dave Hansen, H . Peter Anvin, Arnd Bergmann,
	Michael Kelley, linux-kernel, linux-arch
In-Reply-To: <20260406-hasty-academic-hippo-a4cfb8@anirudhrb>

On Mon, 6 Apr 2026, Anirudh Rayabharam wrote:

> I believe this code has moved to mshv_synic.c in hyperv-fixes. So, this
> probably won't apply.

Right, this needs a fix.

Best,
Jork

^ permalink raw reply

* [PATCH v3 0/6] Hyper-V: kexec fixes for L1VH (mshv)
From: Jork Loeser @ 2026-04-08  1:36 UTC (permalink / raw)
  To: linux-hyperv
  Cc: x86, K . Y . Srinivasan, Haiyang Zhang, Wei Liu, Dexuan Cui,
	Long Li, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Dave Hansen, H . Peter Anvin, Arnd Bergmann, Michael Kelley,
	linux-kernel, linux-arch, Jork Loeser

This series fixes kexec support when Linux runs as an L1 Virtual Host
(L1VH) under Hyper-V, using the MSHV driver to manage child VMs.

1. A variable shadowing bug in vmbus that hides the cpuhp state used
   for teardown.

2. Move hv_stimer_global_cleanup() from vmbus's hv_kexec_handler() to
   hv_machine_shutdown(). This ensures stimer cleanup happens before
   the vmbus unload.

3. LP/VP re-creation: after kexec, logical processors and virtual
   processors already exist in the hypervisor. Detect this and skip
   re-adding them.

4-5. SynIC cleanup: the MSHV driver manages its own SynIC resources
     separately from vmbus. Add proper teardown of MSHV-owned SINTs,
     SIMP, and SIEFP on kexec, scoped to only the resources MSHV
     owns.

6. Debugfs stats pages: unmap the VP statistics overlay pages before
   kexec to avoid stale mappings in the new kernel.

Changes since v2:
- Rebased onto linux-next/master to adapt to the upstream SynIC
  refactor (commit 5a674ef871fe, "mshv: refactor synic init and
  cleanup").

Changes since v1:
- Patch 4: account for nested root partitions where VMBus is also
  active (not just L1VH); use a vmbus_active local variable; allocate
  SIRBP L1VH allocation path for when the hypervisor doesn't
  pre-provision the page

Jork Loeser (6):
  Drivers: hv: vmbus: fix hyperv_cpuhp_online variable shadowing
  x86/hyperv: move stimer cleanup to hv_machine_shutdown()
  x86/hyperv: Skip LP/VP creation on kexec
  mshv: limit SynIC management to MSHV-owned resources
  mshv: clean up SynIC state on kexec for L1VH
  mshv: unmap debugfs stats pages on kexec

 arch/x86/kernel/cpu/mshyperv.c |  15 +++-
 drivers/hv/hv_proc.c           |  47 +++++++++++
 drivers/hv/mshv_debugfs.c      |   7 +-
 drivers/hv/mshv_synic.c        | 146 +++++++++++++++++++++------------
 drivers/hv/vmbus_drv.c         |   2 -
 include/asm-generic/mshyperv.h |  10 +++
 include/hyperv/hvgdk_mini.h    |   1 +
 include/hyperv/hvhdk_mini.h    |  12 +++
 8 files changed, 184 insertions(+), 56 deletions(-)

--
2.43.0


^ permalink raw reply

* [PATCH v3 1/6] Drivers: hv: vmbus: fix hyperv_cpuhp_online variable shadowing
From: Jork Loeser @ 2026-04-08  1:36 UTC (permalink / raw)
  To: linux-hyperv
  Cc: x86, K . Y . Srinivasan, Haiyang Zhang, Wei Liu, Dexuan Cui,
	Long Li, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Dave Hansen, H . Peter Anvin, Arnd Bergmann, Michael Kelley,
	linux-kernel, linux-arch, Jork Loeser
In-Reply-To: <20260408013645.286723-1-jloeser@linux.microsoft.com>

vmbus_alloc_synic_and_connect() declares a local 'int
hyperv_cpuhp_online' that shadows the file-scope global of the same
name. The cpuhp state returned by cpuhp_setup_state() is stored in
the local, leaving the global at 0 (CPUHP_OFFLINE). When
hv_kexec_handler() or hv_machine_shutdown() later call
cpuhp_remove_state(hyperv_cpuhp_online) they pass 0, which hits the
BUG_ON in __cpuhp_remove_state_cpuslocked().

Remove the local declaration so the cpuhp state is stored in the
file-scope global where hv_kexec_handler() and hv_machine_shutdown()
expect it.

Fixes: 2647c96649ba ("Drivers: hv: Support establishing the confidential VMBus connection")
Signed-off-by: Jork Loeser <jloeser@linux.microsoft.com>
---
 drivers/hv/vmbus_drv.c | 1 -
 1 file changed, 1 deletion(-)

diff --git a/drivers/hv/vmbus_drv.c b/drivers/hv/vmbus_drv.c
index 3faa74e49a6b..5e7a6839c933 100644
--- a/drivers/hv/vmbus_drv.c
+++ b/drivers/hv/vmbus_drv.c
@@ -1422,7 +1422,6 @@ static int vmbus_alloc_synic_and_connect(void)
 {
 	int ret, cpu;
 	struct work_struct __percpu *works;
-	int hyperv_cpuhp_online;

 	ret = hv_synic_alloc();
 	if (ret < 0)
-- 
2.43.0

^ permalink raw reply related

* [PATCH v3 2/6] x86/hyperv: move stimer cleanup to hv_machine_shutdown()
From: Jork Loeser @ 2026-04-08  1:36 UTC (permalink / raw)
  To: linux-hyperv
  Cc: x86, K . Y . Srinivasan, Haiyang Zhang, Wei Liu, Dexuan Cui,
	Long Li, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Dave Hansen, H . Peter Anvin, Arnd Bergmann, Michael Kelley,
	linux-kernel, linux-arch, Jork Loeser, Anirudh Rayabharam
In-Reply-To: <20260408013645.286723-1-jloeser@linux.microsoft.com>

Move hv_stimer_global_cleanup() from vmbus's hv_kexec_handler() to
hv_machine_shutdown() in the platform code. This ensures stimer cleanup
happens before the vmbus unload, which is required for root partition
kexec to work correctly.

Co-developed-by: Anirudh Rayabharam <anrayabh@linux.microsoft.com>
Signed-off-by: Anirudh Rayabharam <anrayabh@linux.microsoft.com>
Signed-off-by: Jork Loeser <jloeser@linux.microsoft.com>
---
 arch/x86/kernel/cpu/mshyperv.c | 8 ++++++--
 drivers/hv/vmbus_drv.c         | 1 -
 2 files changed, 6 insertions(+), 3 deletions(-)

diff --git a/arch/x86/kernel/cpu/mshyperv.c b/arch/x86/kernel/cpu/mshyperv.c
index a7dfc29d3470..e498b6b2ef19 100644
--- a/arch/x86/kernel/cpu/mshyperv.c
+++ b/arch/x86/kernel/cpu/mshyperv.c
@@ -237,8 +237,12 @@ void hv_remove_crash_handler(void)
 #ifdef CONFIG_KEXEC_CORE
 static void hv_machine_shutdown(void)
 {
-	if (kexec_in_progress && hv_kexec_handler)
-		hv_kexec_handler();
+	if (kexec_in_progress) {
+		hv_stimer_global_cleanup();
+
+		if (hv_kexec_handler)
+			hv_kexec_handler();
+	}
 
 	/*
 	 * Call hv_cpu_die() on all the CPUs, otherwise later the hypervisor
diff --git a/drivers/hv/vmbus_drv.c b/drivers/hv/vmbus_drv.c
index 5e7a6839c933..c5dfe9f3b206 100644
--- a/drivers/hv/vmbus_drv.c
+++ b/drivers/hv/vmbus_drv.c
@@ -2891,7 +2891,6 @@ static struct platform_driver vmbus_platform_driver = {
 
 static void hv_kexec_handler(void)
 {
-	hv_stimer_global_cleanup();
 	vmbus_initiate_unload(false);
 	/* Make sure conn_state is set as hv_synic_cleanup checks for it */
 	mb();
-- 
2.43.0


^ permalink raw reply related

* [PATCH v3 3/6] x86/hyperv: Skip LP/VP creation on kexec
From: Jork Loeser @ 2026-04-08  1:36 UTC (permalink / raw)
  To: linux-hyperv
  Cc: x86, K . Y . Srinivasan, Haiyang Zhang, Wei Liu, Dexuan Cui,
	Long Li, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Dave Hansen, H . Peter Anvin, Arnd Bergmann, Michael Kelley,
	linux-kernel, linux-arch, Jork Loeser, Anirudh Rayabharam,
	Stanislav Kinsburskii, Mukesh Rathor
In-Reply-To: <20260408013645.286723-1-jloeser@linux.microsoft.com>

After a kexec the logical processors and virtual processors already
exist in the hypervisor because they were created by the previous
kernel. Attempting to add them again causes either a BUG_ON or
corrupted VP state leading to MCEs in the new kernel.

Add hv_lp_exists() to probe whether an LP is already present by
calling HVCALL_GET_LOGICAL_PROCESSOR_RUN_TIME. When it succeeds the
LP exists and we skip the add-LP and create-VP loops entirely.

Also add hv_call_notify_all_processors_started() which informs the
hypervisor that all processors are online. This is required after
adding LPs (fresh boot) and is a no-op on kexec since we skip that
path.

Co-developed-by: Anirudh Rayabharam <anrayabh@linux.microsoft.com>
Signed-off-by: Anirudh Rayabharam <anrayabh@linux.microsoft.com>
Co-developed-by: Stanislav Kinsburskii <stanislav.kinsburskii@gmail.com>
Signed-off-by: Stanislav Kinsburskii <stanislav.kinsburskii@gmail.com>
Co-developed-by: Mukesh Rathor <mrathor@linux.microsoft.com>
Signed-off-by: Mukesh Rathor <mrathor@linux.microsoft.com>
Signed-off-by: Jork Loeser <jloeser@linux.microsoft.com>
---
 arch/x86/kernel/cpu/mshyperv.c |  7 +++++
 drivers/hv/hv_proc.c           | 47 ++++++++++++++++++++++++++++++++++
 include/asm-generic/mshyperv.h | 10 ++++++++
 include/hyperv/hvgdk_mini.h    |  1 +
 include/hyperv/hvhdk_mini.h    | 12 +++++++++
 5 files changed, 77 insertions(+)

diff --git a/arch/x86/kernel/cpu/mshyperv.c b/arch/x86/kernel/cpu/mshyperv.c
index e498b6b2ef19..b5b6a58b67b0 100644
--- a/arch/x86/kernel/cpu/mshyperv.c
+++ b/arch/x86/kernel/cpu/mshyperv.c
@@ -431,6 +431,10 @@ static void __init hv_smp_prepare_cpus(unsigned int max_cpus)
 	}
 
 #ifdef CONFIG_X86_64
+	/* If AP LPs exist, we are in a kexec'd kernel and VPs already exist */
+	if (num_present_cpus() == 1 || hv_lp_exists(1))
+		return;
+
 	for_each_present_cpu(i) {
 		if (i == 0)
 			continue;
@@ -438,6 +442,9 @@ static void __init hv_smp_prepare_cpus(unsigned int max_cpus)
 		BUG_ON(ret);
 	}
 
+	ret = hv_call_notify_all_processors_started();
+	WARN_ON(ret);
+
 	for_each_present_cpu(i) {
 		if (i == 0)
 			continue;
diff --git a/drivers/hv/hv_proc.c b/drivers/hv/hv_proc.c
index 3cb4b2a3035c..57b2c64197cb 100644
--- a/drivers/hv/hv_proc.c
+++ b/drivers/hv/hv_proc.c
@@ -239,3 +239,50 @@ int hv_call_create_vp(int node, u64 partition_id, u32 vp_index, u32 flags)
 	return ret;
 }
 EXPORT_SYMBOL_GPL(hv_call_create_vp);
+
+int hv_call_notify_all_processors_started(void)
+{
+	struct hv_input_notify_partition_event *input;
+	u64 status;
+	unsigned long irq_flags;
+	int ret = 0;
+
+	local_irq_save(irq_flags);
+	input = *this_cpu_ptr(hyperv_pcpu_input_arg);
+	memset(input, 0, sizeof(*input));
+	input->event = HV_PARTITION_ALL_LOGICAL_PROCESSORS_STARTED;
+	status = hv_do_hypercall(HVCALL_NOTIFY_PARTITION_EVENT,
+				 input, NULL);
+	local_irq_restore(irq_flags);
+
+	if (!hv_result_success(status)) {
+		hv_status_err(status, "\n");
+		ret = hv_result_to_errno(status);
+	}
+	return ret;
+}
+
+bool hv_lp_exists(u32 lp_index)
+{
+	struct hv_input_get_logical_processor_run_time *input;
+	struct hv_output_get_logical_processor_run_time *output;
+	unsigned long flags;
+	u64 status;
+
+	local_irq_save(flags);
+	input = *this_cpu_ptr(hyperv_pcpu_input_arg);
+	output = *this_cpu_ptr(hyperv_pcpu_output_arg);
+
+	input->lp_index = lp_index;
+	status = hv_do_hypercall(HVCALL_GET_LOGICAL_PROCESSOR_RUN_TIME,
+				 input, output);
+	local_irq_restore(flags);
+
+	if (!hv_result_success(status) &&
+	    hv_result(status) != HV_STATUS_INVALID_LP_INDEX) {
+		hv_status_err(status, "\n");
+		BUG();
+	}
+
+	return hv_result_success(status);
+}
diff --git a/include/asm-generic/mshyperv.h b/include/asm-generic/mshyperv.h
index d37b68238c97..bf601d67cecb 100644
--- a/include/asm-generic/mshyperv.h
+++ b/include/asm-generic/mshyperv.h
@@ -347,6 +347,8 @@ bool hv_result_needs_memory(u64 status);
 int hv_deposit_memory_node(int node, u64 partition_id, u64 status);
 int hv_call_deposit_pages(int node, u64 partition_id, u32 num_pages);
 int hv_call_add_logical_proc(int node, u32 lp_index, u32 acpi_id);
+int hv_call_notify_all_processors_started(void);
+bool hv_lp_exists(u32 lp_index);
 int hv_call_create_vp(int node, u64 partition_id, u32 vp_index, u32 flags);
 
 #else /* CONFIG_MSHV_ROOT */
@@ -366,6 +368,14 @@ static inline int hv_call_add_logical_proc(int node, u32 lp_index, u32 acpi_id)
 {
 	return -EOPNOTSUPP;
 }
+static inline int hv_call_notify_all_processors_started(void)
+{
+	return -EOPNOTSUPP;
+}
+static inline bool hv_lp_exists(u32 lp_index)
+{
+	return false;
+}
 static inline int hv_call_create_vp(int node, u64 partition_id, u32 vp_index, u32 flags)
 {
 	return -EOPNOTSUPP;
diff --git a/include/hyperv/hvgdk_mini.h b/include/hyperv/hvgdk_mini.h
index f9600f87186a..6a4e8b9d570f 100644
--- a/include/hyperv/hvgdk_mini.h
+++ b/include/hyperv/hvgdk_mini.h
@@ -435,6 +435,7 @@ union hv_vp_assist_msr_contents {	 /* HV_REGISTER_VP_ASSIST_PAGE */
 /* HV_CALL_CODE */
 #define HVCALL_FLUSH_VIRTUAL_ADDRESS_SPACE		0x0002
 #define HVCALL_FLUSH_VIRTUAL_ADDRESS_LIST		0x0003
+#define HVCALL_GET_LOGICAL_PROCESSOR_RUN_TIME		0x0004
 #define HVCALL_NOTIFY_LONG_SPIN_WAIT			0x0008
 #define HVCALL_SEND_IPI					0x000b
 #define HVCALL_ENABLE_VP_VTL				0x000f
diff --git a/include/hyperv/hvhdk_mini.h b/include/hyperv/hvhdk_mini.h
index 091c03e26046..b4cb2fa26e9b 100644
--- a/include/hyperv/hvhdk_mini.h
+++ b/include/hyperv/hvhdk_mini.h
@@ -362,6 +362,7 @@ union hv_partition_event_input {
 
 enum hv_partition_event {
 	HV_PARTITION_EVENT_ROOT_CRASHDUMP = 2,
+	HV_PARTITION_ALL_LOGICAL_PROCESSORS_STARTED = 4,
 };
 
 struct hv_input_notify_partition_event {
@@ -369,6 +370,17 @@ struct hv_input_notify_partition_event {
 	union hv_partition_event_input input;
 } __packed;
 
+struct hv_input_get_logical_processor_run_time {
+	u32 lp_index;
+} __packed;
+
+struct hv_output_get_logical_processor_run_time {
+	u64 global_time;
+	u64 local_run_time;
+	u64 rsvdz0;
+	u64 hypervisor_time;
+} __packed;
+
 struct hv_lp_startup_status {
 	u64 hv_status;
 	u64 substatus1;
-- 
2.43.0


^ permalink raw reply related

* [PATCH v3 4/6] mshv: limit SynIC management to MSHV-owned resources
From: Jork Loeser @ 2026-04-08  1:36 UTC (permalink / raw)
  To: linux-hyperv
  Cc: x86, K . Y . Srinivasan, Haiyang Zhang, Wei Liu, Dexuan Cui,
	Long Li, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Dave Hansen, H . Peter Anvin, Arnd Bergmann, Michael Kelley,
	linux-kernel, linux-arch, Jork Loeser
In-Reply-To: <20260408013645.286723-1-jloeser@linux.microsoft.com>

The SynIC is shared between VMBus and MSHV. VMBus owns the message
page (SIMP), event flags page (SIEFP), global enable (SCONTROL),
and SINT2. MSHV adds SINT0, SINT5, and the event ring page (SIRBP).

Currently mshv_synic_cpu_init() redundantly enables SIMP, SIEFP, and
SCONTROL that VMBus already configured, and mshv_synic_cpu_exit()
disables all of them. This is wrong because MSHV can be torn down
while VMBus is still active. In particular, a kexec reboot notifier
tears down MSHV first. Disabling SCONTROL, SIMP, and SIEFP out
from under VMBus causes its later cleanup to write SynIC MSRs while
SynIC is disabled, which the hypervisor does not tolerate.

Restrict MSHV to managing only the resources it owns:
- SINT0, SINT5: mask on cleanup, unmask on init
- SIRBP: enable/disable as before
- SIMP, SIEFP, SCONTROL: leave to VMBus when it is active (L1VH
  and nested root partition); on a non-nested root partition VMBus
  doesn't run, so MSHV must enable/disable them

Signed-off-by: Jork Loeser <jloeser@linux.microsoft.com>
---
 drivers/hv/mshv_synic.c | 142 ++++++++++++++++++++++++++--------------
 1 file changed, 94 insertions(+), 48 deletions(-)

diff --git a/drivers/hv/mshv_synic.c b/drivers/hv/mshv_synic.c
index e2288a726fec..f71d5dfce1c1 100644
--- a/drivers/hv/mshv_synic.c
+++ b/drivers/hv/mshv_synic.c
@@ -456,46 +456,72 @@ static int mshv_synic_cpu_init(unsigned int cpu)
 	union hv_synic_siefp siefp;
 	union hv_synic_sirbp sirbp;
 	union hv_synic_sint sint;
-	union hv_synic_scontrol sctrl;
 	struct hv_synic_pages *spages = this_cpu_ptr(synic_pages);
 	struct hv_message_page **msg_page = &spages->hyp_synic_message_page;
 	struct hv_synic_event_flags_page **event_flags_page =
 			&spages->synic_event_flags_page;
 	struct hv_synic_event_ring_page **event_ring_page =
 			&spages->synic_event_ring_page;
+	/* VMBus runs on L1VH and nested root; it owns SIMP/SIEFP/SCONTROL */
+	bool vmbus_active = !hv_root_partition() || hv_nested;
 
-	/* Setup the Synic's message page */
+	/*
+	 * Map the SYNIC message page. When VMBus is not active the
+	 * hypervisor pre-provisions the SIMP GPA but may not set
+	 * simp_enabled — enable it here.
+	 */
 	simp.as_uint64 = hv_get_non_nested_msr(HV_MSR_SIMP);
-	simp.simp_enabled = true;
+	if (!vmbus_active) {
+		simp.simp_enabled = true;
+		hv_set_non_nested_msr(HV_MSR_SIMP, simp.as_uint64);
+	}
 	*msg_page = memremap(simp.base_simp_gpa << HV_HYP_PAGE_SHIFT,
 			     HV_HYP_PAGE_SIZE,
 			     MEMREMAP_WB);
 
 	if (!(*msg_page))
-		return -EFAULT;
+		goto cleanup_simp;
 
-	hv_set_non_nested_msr(HV_MSR_SIMP, simp.as_uint64);
-
-	/* Setup the Synic's event flags page */
+	/*
+	 * Map the event flags page. Same as SIMP: enable when
+	 * VMBus is not active, already enabled by VMBus otherwise.
+	 */
 	siefp.as_uint64 = hv_get_non_nested_msr(HV_MSR_SIEFP);
-	siefp.siefp_enabled = true;
+	if (!vmbus_active) {
+		siefp.siefp_enabled = true;
+		hv_set_non_nested_msr(HV_MSR_SIEFP, siefp.as_uint64);
+	}
 	*event_flags_page = memremap(siefp.base_siefp_gpa << PAGE_SHIFT,
 				     PAGE_SIZE, MEMREMAP_WB);
 
 	if (!(*event_flags_page))
-		goto cleanup;
-
-	hv_set_non_nested_msr(HV_MSR_SIEFP, siefp.as_uint64);
+		goto cleanup_siefp;
 
 	/* Setup the Synic's event ring page */
 	sirbp.as_uint64 = hv_get_non_nested_msr(HV_MSR_SIRBP);
-	sirbp.sirbp_enabled = true;
-	*event_ring_page = memremap(sirbp.base_sirbp_gpa << PAGE_SHIFT,
-				    PAGE_SIZE, MEMREMAP_WB);
 
-	if (!(*event_ring_page))
-		goto cleanup;
+	if (hv_root_partition()) {
+		*event_ring_page = memremap(sirbp.base_sirbp_gpa << PAGE_SHIFT,
+					    PAGE_SIZE, MEMREMAP_WB);
 
+		if (!(*event_ring_page))
+			goto cleanup_siefp;
+	} else {
+		/*
+		 * On L1VH the hypervisor does not provide a SIRBP page.
+		 * Allocate one and program its GPA into the MSR.
+		 */
+		*event_ring_page = (struct hv_synic_event_ring_page *)
+			get_zeroed_page(GFP_KERNEL);
+
+		if (!(*event_ring_page))
+			goto cleanup_siefp;
+
+		sirbp.base_sirbp_gpa = virt_to_phys(*event_ring_page)
+				>> PAGE_SHIFT;
+	}
+
+	sirbp.sirbp_enabled = true;
 	hv_set_non_nested_msr(HV_MSR_SIRBP, sirbp.as_uint64);
 
 	if (mshv_sint_irq != -1)
@@ -518,28 +544,30 @@ static int mshv_synic_cpu_init(unsigned int cpu)
 	hv_set_non_nested_msr(HV_MSR_SINT0 + HV_SYNIC_DOORBELL_SINT_INDEX,
 			      sint.as_uint64);
 
-	/* Enable global synic bit */
-	sctrl.as_uint64 = hv_get_non_nested_msr(HV_MSR_SCONTROL);
-	sctrl.enable = 1;
-	hv_set_non_nested_msr(HV_MSR_SCONTROL, sctrl.as_uint64);
+	/* When VMBus is active it already enabled SCONTROL. */
+	if (!vmbus_active) {
+		union hv_synic_scontrol sctrl;
+
+		sctrl.as_uint64 = hv_get_non_nested_msr(HV_MSR_SCONTROL);
+		sctrl.enable = 1;
+		hv_set_non_nested_msr(HV_MSR_SCONTROL, sctrl.as_uint64);
+	}
 
 	return 0;
 
-cleanup:
-	if (*event_ring_page) {
-		sirbp.sirbp_enabled = false;
-		hv_set_non_nested_msr(HV_MSR_SIRBP, sirbp.as_uint64);
-		memunmap(*event_ring_page);
-	}
-	if (*event_flags_page) {
+cleanup_siefp:
+	if (*event_flags_page)
+		memunmap(*event_flags_page);
+	if (!vmbus_active) {
 		siefp.siefp_enabled = false;
 		hv_set_non_nested_msr(HV_MSR_SIEFP, siefp.as_uint64);
-		memunmap(*event_flags_page);
 	}
-	if (*msg_page) {
+cleanup_simp:
+	if (*msg_page)
+		memunmap(*msg_page);
+	if (!vmbus_active) {
 		simp.simp_enabled = false;
 		hv_set_non_nested_msr(HV_MSR_SIMP, simp.as_uint64);
-		memunmap(*msg_page);
 	}
 
 	return -EFAULT;
@@ -548,16 +576,15 @@ static int mshv_synic_cpu_init(unsigned int cpu)
 static int mshv_synic_cpu_exit(unsigned int cpu)
 {
 	union hv_synic_sint sint;
-	union hv_synic_simp simp;
-	union hv_synic_siefp siefp;
 	union hv_synic_sirbp sirbp;
-	union hv_synic_scontrol sctrl;
 	struct hv_synic_pages *spages = this_cpu_ptr(synic_pages);
 	struct hv_message_page **msg_page = &spages->hyp_synic_message_page;
 	struct hv_synic_event_flags_page **event_flags_page =
 		&spages->synic_event_flags_page;
 	struct hv_synic_event_ring_page **event_ring_page =
 		&spages->synic_event_ring_page;
+	/* VMBus runs on L1VH and nested root; it owns SIMP/SIEFP/SCONTROL */
+	bool vmbus_active = !hv_root_partition() || hv_nested;
 
 	/* Disable the interrupt */
 	sint.as_uint64 = hv_get_non_nested_msr(HV_MSR_SINT0 + HV_SYNIC_INTERCEPTION_SINT_INDEX);
@@ -574,28 +601,47 @@ static int mshv_synic_cpu_exit(unsigned int cpu)
 	if (mshv_sint_irq != -1)
 		disable_percpu_irq(mshv_sint_irq);
 
-	/* Disable Synic's event ring page */
+	/* Disable SYNIC event ring page owned by MSHV */
 	sirbp.as_uint64 = hv_get_non_nested_msr(HV_MSR_SIRBP);
 	sirbp.sirbp_enabled = false;
-	hv_set_non_nested_msr(HV_MSR_SIRBP, sirbp.as_uint64);
-	memunmap(*event_ring_page);
 
-	/* Disable Synic's event flags page */
-	siefp.as_uint64 = hv_get_non_nested_msr(HV_MSR_SIEFP);
-	siefp.siefp_enabled = false;
-	hv_set_non_nested_msr(HV_MSR_SIEFP, siefp.as_uint64);
+	if (hv_root_partition()) {
+		hv_set_non_nested_msr(HV_MSR_SIRBP, sirbp.as_uint64);
+		memunmap(*event_ring_page);
+	} else {
+		sirbp.base_sirbp_gpa = 0;
+		hv_set_non_nested_msr(HV_MSR_SIRBP, sirbp.as_uint64);
+		free_page((unsigned long)*event_ring_page);
+	}
+
+	/*
+	 * Release our mappings of the message and event flags pages.
+	 * When VMBus is not active, we enabled SIMP/SIEFP — disable
+	 * them. Otherwise VMBus owns the MSRs — leave them.
+	 */
 	memunmap(*event_flags_page);
+	if (!vmbus_active) {
+		union hv_synic_simp simp;
+		union hv_synic_siefp siefp;
 
-	/* Disable Synic's message page */
-	simp.as_uint64 = hv_get_non_nested_msr(HV_MSR_SIMP);
-	simp.simp_enabled = false;
-	hv_set_non_nested_msr(HV_MSR_SIMP, simp.as_uint64);
+		siefp.as_uint64 = hv_get_non_nested_msr(HV_MSR_SIEFP);
+		siefp.siefp_enabled = false;
+		hv_set_non_nested_msr(HV_MSR_SIEFP, siefp.as_uint64);
+
+		simp.as_uint64 = hv_get_non_nested_msr(HV_MSR_SIMP);
+		simp.simp_enabled = false;
+		hv_set_non_nested_msr(HV_MSR_SIMP, simp.as_uint64);
+	}
 	memunmap(*msg_page);
 
-	/* Disable global synic bit */
-	sctrl.as_uint64 = hv_get_non_nested_msr(HV_MSR_SCONTROL);
-	sctrl.enable = 0;
-	hv_set_non_nested_msr(HV_MSR_SCONTROL, sctrl.as_uint64);
+	/* When VMBus is active it owns SCONTROL — leave it. */
+	if (!vmbus_active) {
+		union hv_synic_scontrol sctrl;
+
+		sctrl.as_uint64 = hv_get_non_nested_msr(HV_MSR_SCONTROL);
+		sctrl.enable = 0;
+		hv_set_non_nested_msr(HV_MSR_SCONTROL, sctrl.as_uint64);
+	}
 
 	return 0;
 }
-- 
2.43.0


^ permalink raw reply related

* [PATCH v3 5/6] mshv: clean up SynIC state on kexec for L1VH
From: Jork Loeser @ 2026-04-08  1:36 UTC (permalink / raw)
  To: linux-hyperv
  Cc: x86, K . Y . Srinivasan, Haiyang Zhang, Wei Liu, Dexuan Cui,
	Long Li, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Dave Hansen, H . Peter Anvin, Arnd Bergmann, Michael Kelley,
	linux-kernel, linux-arch, Jork Loeser
In-Reply-To: <20260408013645.286723-1-jloeser@linux.microsoft.com>

The reboot notifier that tears down the SynIC cpuhp state guards the
cleanup with hv_root_partition(), so on L1VH (where
hv_root_partition() is false) SINT0, SINT5, and SIRBP are never
cleaned up before kexec. The kexec'd kernel then inherits stale
unmasked SINTs and an enabled SIRBP pointing to freed memory.

Remove the hv_root_partition() guard so the cleanup runs for all
parent partitions.

Signed-off-by: Jork Loeser <jloeser@linux.microsoft.com>
---
 drivers/hv/mshv_synic.c | 3 ---
 1 file changed, 3 deletions(-)

diff --git a/drivers/hv/mshv_synic.c b/drivers/hv/mshv_synic.c
index f71d5dfce1c1..8fe673c876fd 100644
--- a/drivers/hv/mshv_synic.c
+++ b/drivers/hv/mshv_synic.c
@@ -719,9 +719,6 @@ mshv_unregister_doorbell(u64 partition_id, int doorbell_portid)
 static int mshv_synic_reboot_notify(struct notifier_block *nb,
 			      unsigned long code, void *unused)
 {
-	if (!hv_root_partition())
-		return 0;
-
 	cpuhp_remove_state(synic_cpuhp_online);
 	return 0;
 }
-- 
2.43.0


^ permalink raw reply related

* [PATCH v3 6/6] mshv: unmap debugfs stats pages on kexec
From: Jork Loeser @ 2026-04-08  1:36 UTC (permalink / raw)
  To: linux-hyperv
  Cc: x86, K . Y . Srinivasan, Haiyang Zhang, Wei Liu, Dexuan Cui,
	Long Li, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Dave Hansen, H . Peter Anvin, Arnd Bergmann, Michael Kelley,
	linux-kernel, linux-arch, Jork Loeser
In-Reply-To: <20260408013645.286723-1-jloeser@linux.microsoft.com>

On L1VH, debugfs stats pages are overlay pages: the kernel allocates
them and registers the GPAs with the hypervisor via
HVCALL_MAP_STATS_PAGE2. These overlay mappings persist in the
hypervisor across kexec. If the kexec'd kernel reuses those physical
pages, the hypervisor's overlay semantics cause a machine check
exception.

Fix this by calling mshv_debugfs_exit() from the reboot notifier,
which issues HVCALL_UNMAP_STATS_PAGE for each mapped stats page before
kexec. This releases the overlay bindings so the physical pages can be
safely reused. Guard mshv_debugfs_exit() against being called when
init failed.

Signed-off-by: Jork Loeser <jloeser@linux.microsoft.com>
---
 drivers/hv/mshv_debugfs.c | 7 ++++++-
 drivers/hv/mshv_synic.c   | 1 +
 2 files changed, 7 insertions(+), 1 deletion(-)

diff --git a/drivers/hv/mshv_debugfs.c b/drivers/hv/mshv_debugfs.c
index 418b6dc8f3c2..3c3e02237ae9 100644
--- a/drivers/hv/mshv_debugfs.c
+++ b/drivers/hv/mshv_debugfs.c
@@ -674,8 +674,10 @@ int __init mshv_debugfs_init(void)
 
 	mshv_debugfs = debugfs_create_dir("mshv", NULL);
 	if (IS_ERR(mshv_debugfs)) {
+		err = PTR_ERR(mshv_debugfs);
+		mshv_debugfs = NULL;
 		pr_err("%s: failed to create debugfs directory\n", __func__);
-		return PTR_ERR(mshv_debugfs);
+		return err;
 	}
 
 	if (hv_root_partition()) {
@@ -710,6 +712,9 @@ int __init mshv_debugfs_init(void)
 
 void mshv_debugfs_exit(void)
 {
+	if (!mshv_debugfs)
+		return;
+
 	mshv_debugfs_parent_partition_remove();
 
 	if (hv_root_partition()) {
diff --git a/drivers/hv/mshv_synic.c b/drivers/hv/mshv_synic.c
index 8fe673c876fd..ed025f90003f 100644
--- a/drivers/hv/mshv_synic.c
+++ b/drivers/hv/mshv_synic.c
@@ -719,6 +719,7 @@ mshv_unregister_doorbell(u64 partition_id, int doorbell_portid)
 static int mshv_synic_reboot_notify(struct notifier_block *nb,
 			      unsigned long code, void *unused)
 {
+	mshv_debugfs_exit();
 	cpuhp_remove_state(synic_cpuhp_online);
 	return 0;
 }
-- 
2.43.0


^ permalink raw reply related

* Re: [PATCH net-next v5 0/2] net: mana: add ethtool private flag for full-page RX buffers
From: Jakub Kicinski @ 2026-04-08  1:51 UTC (permalink / raw)
  To: Alexander Lobakin
  Cc: Dipayaan Roy, kys, haiyangz, wei.liu, decui, andrew+netdev, davem,
	edumazet, pabeni, leon, longli, kotaranov, horms, shradhagupta,
	ssengar, ernis, shirazsaleem, linux-hyperv, netdev, linux-kernel,
	linux-rdma, stephen, jacob.e.keller, dipayanroy, leitao, kees
In-Reply-To: <e80b603d-8be0-4aee-8a31-c9cbb4a8ab00@intel.com>

On Tue, 7 Apr 2026 15:10:45 +0200 Alexander Lobakin wrote:
> > On some ARM64 platforms with 4K PAGE_SIZE, utilizing page_pool 
> > fragments for allocation in the RX refill path (~2kB buffer per fragment)
> > causes 15-20% throughput regression under high connection counts  
> > (>16 TCP streams at 180+ Gbps). Using full-page buffers on these  
> > platforms shows no regression and restores line-rate performance.
> > 
> > This behavior is observed on a single platform; other platforms
> > perform better with page_pool fragments, indicating this is not a
> > page_pool issue but platform-specific.
> > 
> > This series adds an ethtool private flag "full-page-rx" to let the
> > user opt in to one RX buffer per page:
> > 
> >   ethtool --set-priv-flags eth0 full-page-rx on  
> 
> Sorry I may've missed the previous threads.
> 
> Has this approach been discussed here? Private flags are generally
> discouraged.
> 
> Alternatively, you can provide Ethtool ops to change the Rx buffer size,
> so that you'd be able to set it to PAGE_SIZE on affected platforms and
> the result would be the same.

Actually, hm. Now that you spoke up I wonder how much this is
an inherent ARM problem vs problem in whatever ARM Microsoft's
management empire-built themselves into.

Do you have access to any ARM servers? Google says GCP offers ARM
instances with idpf NICs. So if idpf benefits from the same
"tuning" we should totally push for a proper API not priv flags.

^ permalink raw reply

* Re: [PATCH net-next v5 0/2] net: mana: add ethtool private flag for full-page RX buffers
From: Dipayaan Roy @ 2026-04-08  2:55 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: Alexander Lobakin, kys, haiyangz, wei.liu, decui, andrew+netdev,
	davem, edumazet, pabeni, leon, longli, kotaranov, horms,
	shradhagupta, ssengar, ernis, shirazsaleem, linux-hyperv, netdev,
	linux-kernel, linux-rdma, stephen, jacob.e.keller, dipayanroy,
	leitao, kees
In-Reply-To: <20260407185128.605dcacf@kernel.org>

On Tue, Apr 07, 2026 at 06:51:28PM -0700, Jakub Kicinski wrote:
> On Tue, 7 Apr 2026 15:10:45 +0200 Alexander Lobakin wrote:
> > > On some ARM64 platforms with 4K PAGE_SIZE, utilizing page_pool 
> > > fragments for allocation in the RX refill path (~2kB buffer per fragment)
> > > causes 15-20% throughput regression under high connection counts  
> > > (>16 TCP streams at 180+ Gbps). Using full-page buffers on these  
> > > platforms shows no regression and restores line-rate performance.
> > > 
> > > This behavior is observed on a single platform; other platforms
> > > perform better with page_pool fragments, indicating this is not a
> > > page_pool issue but platform-specific.
> > > 
> > > This series adds an ethtool private flag "full-page-rx" to let the
> > > user opt in to one RX buffer per page:
> > > 
> > >   ethtool --set-priv-flags eth0 full-page-rx on  
> > 
> > Sorry I may've missed the previous threads.
> > 
> > Has this approach been discussed here? Private flags are generally
> > discouraged.
> > 
> > Alternatively, you can provide Ethtool ops to change the Rx buffer size,
> > so that you'd be able to set it to PAGE_SIZE on affected platforms and
> > the result would be the same.
> 
> Actually, hm. Now that you spoke up I wonder how much this is
> an inherent ARM problem vs problem in whatever ARM Microsoft's
> management empire-built themselves into.
> 
> Do you have access to any ARM servers? Google says GCP offers ARM
> instances with idpf NICs. So if idpf benefits from the same
> "tuning" we should totally push for a proper API not priv flags.

Hi,

Sharing an observation from earlier, with a different ARM64 fabric/platfrom
when configured with base size of 4Kb and the smae MANA NIC, did not show
this behaviour. In fact, it showed better performance with page fragments
in single as well as multiple connections. Thats why initial version this
patch we wanted to apply the work around only to this specific chip where
the issue is seen with page fragments.



Regards

^ permalink raw reply

* Re: [BUG] KVM: x86: kvmclock jumps ~253 years on Hyper-V nested virt due to cross-CPU raw TSC inconsistency
From: Thomas Lefebvre @ 2026-04-08  4:13 UTC (permalink / raw)
  To: Michael Kelley
  Cc: Sean Christopherson, Vitaly Kuznetsov, pbonzini@redhat.com,
	kvm@vger.kernel.org, linux-kernel@vger.kernel.org,
	linux-hyperv@vger.kernel.org
In-Reply-To: <SN6PR02MB4157A58DA24233B77829B59FD45AA@SN6PR02MB4157.namprd02.prod.outlook.com>

Thanks, that's very useful information.

Unfortunately, I can't easily reproduce the issue; I can't seem to get
pvclock_update_vm_gtod_copy() and __get_kvmclock() to run on different vCPUs
which was one of the two required conditions that triggered the
unsigned subtraction wraparound
(the second condition being inconsistent values between L1 vCPUs).

I just upgraded to Windows 11 25H2 and my Hyper-V VM config from v9 to v12,
I now see tsc_reliable and constant_tsc in the L1 Linux VM lscpu and
/sys/devices/system/clocksource/clocksource0/current_clocksource is
tsc.
I'll report back if I still encounter the problem when spinning up L2
Linux VMs with KVM.

On Tue, Apr 7, 2026 at 1:40 PM Michael Kelley <mhklinux@outlook.com> wrote:
>
> From: Thomas Lefebvre <thomas.lefebvre3@gmail.com> Sent: Tuesday, April 7, 2026 12:13 PM
> >
> > Hi everyone, thank you for your attention to this bug report.
> >
> > Michael,
> >
> > 1. No, lscpu in the L1 guest does not show the flags "tsc_reliable"
> > and "constant_tsc".
> > $ lscpu | grep tsc_reliable
> > $ lscpu | grep constant_tsc
> > $ cat /sys/devices/system/clocksource/clocksource0/current_clocksource
> > hyperv_clocksource_tsc_page
> >
> > 2. Windows 10
> > Version 22H2 (OS Build 19045.6466)
> >
> > 3. Hyper-V: privilege flags low 0x2e7f, high 0x3b8030, ext 0x2, hints
> > 0x24e24, misc 0xbed7b2
> >
> > 4. Yes, the laptop hibernates and then resumes.
> > When the problem occurred, the laptop had gone through multiple
> > hibernate and resume cycles.
> > I haven't seen it happen after a full reboot before a hibernate/resume cycle.
> >
> > Thomas
> >
>
> How easy is it for you to reproduce the problem? Would it be feasible
> to get a definitive answer on whether the problem repros after a
> full reboot, but before a hibernate/resume cycle?
>
> There's a known bug Windows 10 Hyper-V where the hardware TSC
> scaling gets messed up after a hibernate/resume cycle, causing the TSC
> values read in the guest to drift from what the Hyper-V host thinks
> the guest's TSC value is. A summary of the problem is here:
> https://github.com/microsoft/WSL/issues/6982#issuecomment-2294892954
>
> Of course, this doesn't sound like your symptom. And Hyper-V is not
> telling your guest that it supports hardware TSC scaling, because the
> HV_ACCESS_TSC_INVARIANT flag is *not* set and the clocksource
> is hyperv_clocksource_tsc_page. But my understanding is that the code
> changes to fix the Hyper-V problem weren't trivial, and I'm speculating
> that maybe you are seeing some other symptom of whatever the
> underlying Hyper-V issue was.
>
> Of course, this is just speculation. If the problem can occur before
> any hibernate/resume cycles are done, then my speculation is
> wrong. But if the problem only happens after a hibernate/resume
> cycle, then this known problem, or something related to it, becomes
> a pretty good candidate. Unfortunately, I'm pretty sure there's no
> fix for Windows 10 Hyper-V. You would need to upgrade to
> Windows 11 22H2 or later.
>
> Michael

^ permalink raw reply

* RE: [PATCH] PCI: hv: Allocate MMIO from above 4GB for the config window
From: Dexuan Cui @ 2026-04-08  6:15 UTC (permalink / raw)
  To: Michael Kelley, KY Srinivasan, Haiyang Zhang, wei.liu@kernel.org,
	Long Li, lpieralisi@kernel.org, kwilczynski@kernel.org,
	mani@kernel.org, robh@kernel.org, bhelgaas@google.com,
	Jake Oshins, linux-hyperv@vger.kernel.org,
	linux-pci@vger.kernel.org, linux-kernel@vger.kernel.org
  Cc: stable@vger.kernel.org, Matthew Ruffell, Krister Johansen
In-Reply-To: <SN6PR02MB415726C7758C985528ACB912D45CA@SN6PR02MB4157.namprd02.prod.outlook.com>

> From: Michael Kelley <mhklinux@outlook.com>
> Sent: Sunday, April 5, 2026 4:11 PM
> > ...
> > Unluckily, setup_linux_vesafb() only recognizes the vesafb
> > driver in Linux kernel ("VESA VGA") and the efifb driver ("EFI VGA").
> > It looks like normally arch_options.reuse_video_type is always 0.
> >
> > This means the kdump kernel's screen_info.lfb_base is 0, if
> > hyperv_fb or hyperv_drm loads. In the past,  for a Ubuntu kernel
> > with CONFIG_FB_EFI=y, our workaround is blacklisting
> > hyperv_fb or hyperv_drm, so /dev/fb0 is backed by efifb, and
> > the screen_info.lfb_base is correctly set for kdump.
> 
> Hmmm. This worse than I thought for x86/x64. In fact, it means
> a part of my commit message for 304386373007 is now wrong. I had
> described everything as working when using the kexec_load() system
> call because the FBIOGET_FSCREENINFO ioctl was returning a good
> value for smem_start (at least with the hyperv_fb driver). But as you
> point out further down, newer versions of the kexec user space program
> are ignoring that smem_start value unless the driver is vesafb or efifb.
> 
> Was blacklisting hyperv_fb or hyperv_drm in the kdump kernel
> a workaround we had promulgated in the past? My recollection
> is vague. But no matter.

Blacklisting hyperv_fb or hyperv_drm in the *first* kernel was an
internal workaround, which no longer works since  CONFIG_FB_EFI
is not set in the linux-azure kernels.

Thanks,
Dexuan


^ permalink raw reply

* RE: [PATCH] PCI: hv: Allocate MMIO from above 4GB for the config window
From: Dexuan Cui @ 2026-04-08  6:37 UTC (permalink / raw)
  To: Michael Kelley, Matthew Ruffell
  Cc: bhelgaas@google.com, Haiyang Zhang, Jake Oshins,
	kwilczynski@kernel.org, KY Srinivasan,
	linux-hyperv@vger.kernel.org, linux-kernel@vger.kernel.org,
	linux-pci@vger.kernel.org, Long Li, lpieralisi@kernel.org,
	mani@kernel.org, robh@kernel.org, stable@vger.kernel.org,
	wei.liu@kernel.org
In-Reply-To: <SN6PR02MB4157F48F3B37D74E3A9F1AFBD45CA@SN6PR02MB4157.namprd02.prod.outlook.com>

> From: Michael Kelley <mhklinux@outlook.com>
> Sent: Sunday, April 5, 2026 4:13 PM
> > ...
> > 96959283a58d adds "select SYSFB if !HYPERV_VTL_MODE", but we can
> > still manually unset CONFIG_SYSFB (I happened to do this when debugging
> > the kdump issue), and hv_pci won't work.
> 
> Just curious -- how would you manually unset CONFIG_SYSFB? The kernel
> makefile always resync's .config against the Kconfig rules, which would add
> CONFIG_SYSFB back again. The Kconfig files essentially say that removing
> CONFIG_SYSFB is an invalid configuration.

Sorry, my description above is wrong: on the mainline kernel that has
96959283a58d ("Drivers: hv: Always select CONFIG_SYSFB for Hyper-V guests"),
I'm unable to unset CONFIG_SYSFB.

When I was able to unset CONFIG_SYSFB, I was actually on Ubuntu 22.04
(Ubuntu-azure-6.8-6.8.0-1049.55_22.04.1, released in Feb 2026). I thought the
kernel has 96959283a58d, but actually it doesn't...

> > IMO vmbus_reserve_fb() should unconditionally reserve the frame buffer
> > MMIO range. I'll post a patch like this:
> >
> > --- a/drivers/hv/vmbus_drv.c
> > +++ b/drivers/hv/vmbus_drv.c
> > @@ -2395,10 +2398,8 @@ static void __maybe_unused
> vmbus_reserve_fb(void)
> >
> >         if (efi_enabled(EFI_BOOT)) {
> >                 /* Gen2 VM: get FB base from EFI framebuffer */
> > -               if (IS_ENABLED(CONFIG_SYSFB)) {
> > -                       start = sysfb_primary_display.screen.lfb_base;
> > -                       size = max_t(__u32, sysfb_primary_display.screen.lfb_size,
> 0x800000);
> > -               }
> > +               start = sysfb_primary_display.screen.lfb_base;
> > +               size = max_t(__u32, sysfb_primary_display.screen.lfb_size,
> 0x800000);

Please ignore the change above.

> On arm64 the existence of sysfb_primary_display is conditional on
> several config variables, including CONFIG_SYSFB and CONFIG_EFI_EARLYCON.
> (see drivers/firmware/efi/efi-init.c) If you can take away CONFIG_SYSFB, you
> could also take away CONFIG_EFI_EARLYCON and end up with build error on
> arm64. So I'm not clear how this approach would be more robust against
> invalid .config changes.

Agreed. Then let's keep vmbus_reserve_fb() as is.

Thanks,
Dexuan


^ permalink raw reply

* [PATCH] x86/VMBus: Confidential VMBus for dynamic DMA transfers
From: Tianyu Lan @ 2026-04-08  7:31 UTC (permalink / raw)
  To: kys, haiyangz, wei.liu, decui, longli, James.Bottomley,
	martin.petersen, apais
  Cc: Tianyu Lan, linux-hyperv, linux-kernel, linux-scsi, vdso,
	mhklinux

Hyper-V provides Confidential VMBus to communicate between
device model and device guest driver via encrypted/private
memory in Confidential VM. The device model is in OpenHCL
(https://openvmm.dev/guide/user_guide/openhcl.html) that
plays the paravisor role.

For a VMBus device, there are two communication methods to
talk with Host/Hypervisor. 1) VMBUS Ring buffer 2) Dynamic
DMA transfer.

The Confidential VMBus Ring buffer has been upstreamed by
Roman Kisel(commit 6802d8af47d1).

The dynamic DMA transition of VMBus device normally goes
through DMA core and it uses SWIOTLB as bounce buffer in
a CoCo VM.

The Confidential VMBus device can do DMA directly to
private/encrypted memory. Because the swiotlb is decrypted
memory, the DMA transfer must not be bounced through the
swiotlb, so as to preserve confidentiality. This is different
from the default for Linux CoCo VMs, so not use DMA(SWIOTLB)
API in VMBus driver when confidential dynamic DMA transfers
capability is present.

Signed-off-by: Tianyu Lan <tiala@microsoft.com>
---
 drivers/scsi/storvsc_drv.c | 28 +++++++++++++++++++++-------
 include/linux/hyperv.h     |  1 +
 2 files changed, 22 insertions(+), 7 deletions(-)

diff --git a/drivers/scsi/storvsc_drv.c b/drivers/scsi/storvsc_drv.c
index ae1abab97835..79b7611518b7 100644
--- a/drivers/scsi/storvsc_drv.c
+++ b/drivers/scsi/storvsc_drv.c
@@ -1316,7 +1316,8 @@ static void storvsc_on_channel_callback(void *context)
 					continue;
 				}
 				request = (struct storvsc_cmd_request *)scsi_cmd_priv(scmnd);
-				scsi_dma_unmap(scmnd);
+				if (!device->co_external_memory)
+					scsi_dma_unmap(scmnd);
 			}
 
 			storvsc_on_receive(stor_device, packet, request);
@@ -1339,6 +1340,8 @@ static int storvsc_connect_to_vsp(struct hv_device *device, u32 ring_size,
 
 	device->channel->max_pkt_size = STORVSC_MAX_PKT_SIZE;
 	device->channel->next_request_id_callback = storvsc_next_request_id;
+	if (device->channel->co_external_memory)
+		device->co_external_memory = true;
 
 	ret = vmbus_open(device->channel,
 			 ring_size,
@@ -1805,7 +1808,7 @@ static enum scsi_qc_status storvsc_queuecommand(struct Scsi_Host *host,
 		unsigned long offset_in_hvpg = offset_in_hvpage(sgl->offset);
 		unsigned int hvpg_count = HVPFN_UP(offset_in_hvpg + length);
 		struct scatterlist *sg;
-		unsigned long hvpfn, hvpfns_to_add;
+		unsigned long hvpfn, hvpfns_to_add, hvpgoff;
 		int j, i = 0, sg_count;
 
 		payload_sz = (hvpg_count * sizeof(u64) +
@@ -1821,7 +1824,11 @@ static enum scsi_qc_status storvsc_queuecommand(struct Scsi_Host *host,
 		payload->range.len = length;
 		payload->range.offset = offset_in_hvpg;
 
-		sg_count = scsi_dma_map(scmnd);
+		if (dev->co_external_memory)
+			sg_count = scsi_sg_count(scmnd);
+		else
+			sg_count = scsi_dma_map(scmnd);
+
 		if (sg_count < 0) {
 			ret = SCSI_MLQUEUE_DEVICE_BUSY;
 			goto err_free_payload;
@@ -1836,9 +1843,16 @@ static enum scsi_qc_status storvsc_queuecommand(struct Scsi_Host *host,
 			 * Such offsets are handled even on other than the first
 			 * sgl entry, provided they are a multiple of PAGE_SIZE.
 			 */
-			hvpfn = HVPFN_DOWN(sg_dma_address(sg));
-			hvpfns_to_add = HVPFN_UP(sg_dma_address(sg) +
-						 sg_dma_len(sg)) - hvpfn;
+			if (dev->co_external_memory) {
+				hvpgoff = HVPFN_DOWN(sg->offset);
+				hvpfn = page_to_hvpfn(sg_page(sg)) + hvpgoff;
+				hvpfns_to_add =	HVPFN_UP(sg->offset + sg->length) -
+							hvpgoff;
+			} else {
+				hvpfn = HVPFN_DOWN(sg_dma_address(sg));
+				hvpfns_to_add = HVPFN_UP(sg_dma_address(sg) +
+							 sg_dma_len(sg)) - hvpfn;
+			}
 
 			/*
 			 * Fill the next portion of the PFN array with
@@ -1860,7 +1874,7 @@ static enum scsi_qc_status storvsc_queuecommand(struct Scsi_Host *host,
 	ret = storvsc_do_io(dev, cmd_request, smp_processor_id());
 	migrate_enable();
 
-	if (ret)
+	if (ret && (!dev->co_external_memory))
 		scsi_dma_unmap(scmnd);
 
 	if (ret == -EAGAIN) {
diff --git a/include/linux/hyperv.h b/include/linux/hyperv.h
index dfc516c1c719..bcb143766d6e 100644
--- a/include/linux/hyperv.h
+++ b/include/linux/hyperv.h
@@ -1285,6 +1285,7 @@ struct hv_device {
 
 	/* place holder to keep track of the dir for hv device in debugfs */
 	struct dentry *debug_dir;
+	bool co_external_memory;
 
 };
 
-- 
2.50.1


^ permalink raw reply related

* Re: [PATCH net-next v5 1/3] net: mana: Use pci_name() for debugfs directory naming
From: Erni Sri Satya Vennela @ 2026-04-08  8:12 UTC (permalink / raw)
  To: Simon Horman
  Cc: kys, haiyangz, wei.liu, decui, longli, andrew+netdev, davem,
	edumazet, kuba, pabeni, kotaranov, shradhagupta, shirazsaleem,
	yury.norov, kees, ssengar, dipayanroy, gargaditya, linux-hyperv,
	netdev, linux-kernel, linux-rdma
In-Reply-To: <20260404090514.GS113102@horms.kernel.org>

On Sat, Apr 04, 2026 at 10:05:14AM +0100, Simon Horman wrote:
> On Thu, Apr 02, 2026 at 11:26:55AM -0700, Erni Sri Satya Vennela wrote:
> > Use pci_name(pdev) for the per-device debugfs directory instead of
> > hardcoded "0" for PFs and pci_slot_name(pdev->slot) for VFs. The
> > previous approach had two issues:
> > 
> > 1. pci_slot_name() dereferences pdev->slot, which can be NULL for VFs
> >    in environments like generic VFIO passthrough or nested KVM,
> >    causing a NULL pointer dereference.
> > 
> > 2. Multiple PFs would all use "0", and VFs across different PCI
> >    domains or buses could share the same slot name, leading to
> >    -EEXIST errors from debugfs_create_dir().
> > 
> > pci_name(pdev) returns the unique BDF address, is always valid, and
> > is unique across the system.
> > 
> > Fixes: 6607c17c6c5e ("net: mana: Enable debugfs files for MANA device")
> > Signed-off-by: Erni Sri Satya Vennela <ernis@linux.microsoft.com>
> 
> Hi Erni,
> 
> Possibly the code differs between net and net-next.
> But if this is fixing a bug in code present in net - as per the cited
> commit - then I think it should be a patch that targets net.
> With some strategy for merging that change into net-next
> if conflicts are expected.

Thankyou for the clarity Simon.
I will send a separate patchset for net tree with the fixes.

- Vennela

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox