* RE: [RFC v1 0/5] Hyper-V: Add para-virtualized IOMMU support for Linux guests
From: Michael Kelley @ 2026-01-08 18:45 UTC (permalink / raw)
To: Yu Zhang, linux-kernel@vger.kernel.org,
linux-hyperv@vger.kernel.org, iommu@lists.linux.dev,
linux-pci@vger.kernel.org
Cc: kys@microsoft.com, haiyangz@microsoft.com, wei.liu@kernel.org,
decui@microsoft.com, lpieralisi@kernel.org,
kwilczynski@kernel.org, mani@kernel.org, robh@kernel.org,
bhelgaas@google.com, arnd@arndb.de, joro@8bytes.org,
will@kernel.org, robin.murphy@arm.com,
easwar.hariharan@linux.microsoft.com,
jacob.pan@linux.microsoft.com, nunodasneves@linux.microsoft.com,
mrathor@linux.microsoft.com, peterz@infradead.org,
linux-arch@vger.kernel.org
In-Reply-To: <20251209051128.76913-1-zhangyu1@linux.microsoft.com>
From: Yu Zhang <zhangyu1@linux.microsoft.com> Sent: Monday, December 8, 2025 9:11 PM
>
> This patch series introduces a para-virtualized IOMMU driver for
> Linux guests running on Microsoft Hyper-V. The primary objective
> is to enable hardware-assisted DMA isolation and scalable device
Is there any particular meaning for the qualifier "scalable" vs. just
"device assignment"? I just want to understand what you are getting
at.
> assignment for Hyper-V child partitions, bypassing the performance
> overhead and complexity associated with emulated IOMMU hardware.
>
> The driver implements the following core functionality:
> * Hypercall-based Enumeration
> Unlike traditional ACPI-based discovery (e.g., DMAR/IVRS),
> this driver enumerates the Hyper-V IOMMU capabilities directly
> via hypercalls. This approach allows the guest to discover
> IOMMU presence and features without requiring specific virtual
> firmware extensions or modifications.
>
> * Domain Management
> The driver manages IOMMU domains through a new set of Hyper-V
> hypercall interfaces, handling domain allocation, attachment,
> and detachment for endpoint devices.
>
> * IOTLB Invalidation
> IOTLB invalidation requests are marshaled and issued to the
> hypervisor through the same hypercall mechanism.
>
> * Nested Translation Support
> This implementation leverages guest-managed stage-1 I/O page
> tables nested with host stage-2 translations. It is built
> upon the consolidated IOMMU page table framework designed by
> Jason Gunthorpe [1]. This design eliminates the need for complex
> emulation during map operations and ensures scalability across
> different architectures.
>
> Implementation Notes:
> * Architecture Independence
> While the current implementation only supports x86 platforms (Intel
> VT-d and AMD IOMMU), the driver design aims to be as architecture-
> agnostic as possible. To achieve this, initialization occurs via
> `device_initcall` rather than `x86_init.iommu.iommu_init`, and shutdown
> is handled via `syscore_ops` instead of `x86_platform.iommu_shutdown`.
>
> * MSI Region Handling
> In this RFC, the hardware MSI region is hard-coded to the standard
> x86 interrupt range (0xfee00000 - 0xfeefffff). Future updates may
> allow this configuration to be queried via hypercalls if new hardware
> platforms are to be supported.
>
> * Reserved Regions (RMRR)
> There is currently no requirement to support assigned devices with
> ACPI RMRR limitations. Consequently, this patch series does not specify
> or query reserved memory regions.
>
> Testing:
> This series has been validated using dmatest with Intel DSA devices
> assigned to the child partition. The tests confirmed successful DMA
> transactions under the para-virtualized IOMMU.
>
> Future Work:
> * Page-selective IOTLB Invalidation
> The current implementation relies on full-domain flushes. Support
> for page-selective invalidation is planned for a future series.
>
> * Advanced Features
> Support for vSVA and virtual PRI will be addressed in subsequent
> updates.
>
> * Root Partition Co-existence
> Ensure compatibility with the distinct para-virtualized IOMMU driver
> used by Hyper-V's Linux root partition, in which the DMA remapping
> is not achieved by stage-1 IO page tables and another set of iommu
> ops is provided.
>
> [1] https://github.com/jgunthorpe/linux/tree/iommu_pt_all
>
> Easwar Hariharan (2):
> PCI: hv: Create and export hv_build_logical_dev_id()
> iommu: Move Hyper-V IOMMU driver to its own subdirectory
>
> Wei Liu (1):
> hyperv: Introduce new hypercall interfaces used by Hyper-V guest IOMMU
>
> Yu Zhang (2):
> hyperv: allow hypercall output pages to be allocated for child
> partitions
> iommu/hyperv: Add para-virtualized IOMMU support for Hyper-V guest
>
> drivers/hv/hv_common.c | 21 +-
> drivers/iommu/Kconfig | 10 +-
> drivers/iommu/Makefile | 2 +-
> drivers/iommu/hyperv/Kconfig | 24 +
> drivers/iommu/hyperv/Makefile | 3 +
> drivers/iommu/hyperv/iommu.c | 608 ++++++++++++++++++
> drivers/iommu/hyperv/iommu.h | 53 ++
> .../irq_remapping.c} | 2 +-
> drivers/pci/controller/pci-hyperv.c | 28 +-
> include/asm-generic/mshyperv.h | 2 +
> include/hyperv/hvgdk_mini.h | 8 +
> include/hyperv/hvhdk_mini.h | 123 ++++
> 12 files changed, 850 insertions(+), 34 deletions(-)
> create mode 100644 drivers/iommu/hyperv/Kconfig
> create mode 100644 drivers/iommu/hyperv/Makefile
> create mode 100644 drivers/iommu/hyperv/iommu.c
> create mode 100644 drivers/iommu/hyperv/iommu.h
> rename drivers/iommu/{hyperv-iommu.c => hyperv/irq_remapping.c} (99%)
>
> --
> 2.49.0
^ permalink raw reply
* Re: [PATCH v8 03/10] dt-bindings: reserved-memory: Wakeup Mailbox for Intel processors
From: Ricardo Neri @ 2026-01-08 4:13 UTC (permalink / raw)
To: Rob Herring (Arm)
Cc: linux-acpi, Chris Oo, Dexuan Cui, K. Y. Srinivasan, Ricardo Neri,
Wei Liu, Kirill A. Shutemov, Saurabh Sengar, Conor Dooley,
devicetree, Yunhong Jiang, Michael Kelley, Rafael J. Wysocki,
Rafael J. Wysocki (Intel), Haiyang Zhang, Krzysztof Kozlowski,
linux-hyperv, x86, linux-kernel
In-Reply-To: <176782823140.2300431.3081932954431387872.robh@kernel.org>
On Wed, Jan 07, 2026 at 05:23:51PM -0600, Rob Herring (Arm) wrote:
>
> On Wed, 07 Jan 2026 13:44:39 -0800, Ricardo Neri wrote:
> > Add DeviceTree bindings to enumerate the wakeup mailbox used in platform
> > firmware for Intel processors.
> >
> > x86 platforms commonly boot secondary CPUs using an INIT assert, de-assert
> > followed by Start-Up IPI messages. The wakeup mailbox can be used when this
> > mechanism is unavailable.
> >
> > The wakeup mailbox offers more control to the operating system to boot
> > secondary CPUs than a spin-table. It allows the reuse of the same wakeup
> > vector for all CPUs while maintaining control over which CPUs to boot and
> > when. While it is possible to achieve the same level of control using a
> > spin-table, it would require specifying a separate `cpu-release-addr` for
> > each secondary CPU.
> >
> > The operation and structure of the mailbox are described in the
> > Multiprocessor Wakeup Structure defined in the ACPI specification. Note
> > that this structure does not specify how to publish the mailbox to the
> > operating system (ACPI-based platform firmware uses a separate table). No
> > ACPI table is needed in DeviceTree-based firmware to enumerate the mailbox.
> >
> > Nodes that want to refer to the reserved memory usually define
> > a `memory-region` property. /cpus/cpu* nodes would want to refer to the
> > mailbox, but they do not have such property defined in the DeviceTree
> > specification. Moreover, it would imply that there is a memory region per
> > CPU. Instead, add a `compatible` property that the operating system can use
> > to discover the mailbox.
> >
> > Reviewed-by: Dexuan Cui <decui@microsoft.com>
> > Reviewed-by: Rob Herring (Arm) <robh@kernel.org>
> > Acked-by: Rafael J. Wysocki (Intel) <rafael.j.wysocki@intel.com>
> > Co-developed-by: Yunhong Jiang <yunhong.jiang@linux.intel.com>
> > Signed-off-by: Yunhong Jiang <yunhong.jiang@linux.intel.com>
> > Signed-off-by: Ricardo Neri <ricardo.neri-calderon@linux.intel.com>
> > ---
> > Changes in v8:
> > - None
> >
> > Changes in v7:
> > - Fixed Acked-by tag from Rafael to include the "(Intel)" suffix.
> >
> > Changes in v6:
> > - Reworded the changelog for clarity.
> > - Added Acked-by tag from Rafael. Thanks!
> > - Added Reviewed-by tag from Rob. Thanks!
> > - Added Reviewed-by tag from Dexuan. Thanks!
> >
> > Changes in v5:
> > - Specified the version and section of the ACPI spec in which the
> > wakeup mailbox is defined. (Rafael)
> > - Fixed a warning from yamllint about line lengths of URLs.
> >
> > Changes in v4:
> > - Removed redefinitions of the mailbox and instead referred to ACPI
> > specification as per discussion on LKML.
> > - Clarified that DeviceTree-based firmware do not require the use of
> > ACPI tables to enumerate the mailbox. (Rob)
> > - Described the need of using a `compatible` property.
> > - Dropped the `alignment` property. (Krzysztof, Rafael)
> > - Used a real address for the mailbox node. (Krzysztof)
> >
> > Changes in v3:
> > - Implemented the mailbox as a reserved-memory node. Add to it a
> > `compatible` property. (Krzysztof)
> > - Explained the relationship between the mailbox and the `enable-mehod`
> > property of the CPU nodes.
> > - Expanded the documentation of the binding.
> >
> > Changes in v2:
> > - Added more details to the description of the binding.
> > - Added requirement a new requirement for cpu@N nodes to add an
> > `enable-method`.
> > ---
> > .../reserved-memory/intel,wakeup-mailbox.yaml | 50 ++++++++++++++++++++++
> > 1 file changed, 50 insertions(+)
> >
>
> My bot found errors running 'make dt_binding_check' on your patch:
>
> yamllint warnings/errors:
> ./Documentation/devicetree/bindings/reserved-memory/intel,wakeup-mailbox.yaml:23:1: [warning] too many blank lines (2 > 1) (empty-lines)
This got triggered by an empty line in an patch already reviewed by the DT
binding maintainers. Previous versions did not trigger this warning since
the check to allow at most 1 empty line is rather recent.
Would it be possible for the review proceed? I am happy to post a new version
fixing this after the rest of the patches have been reviewed.
Thanks and BR,
Ricardo
^ permalink raw reply
* Re: [PATCH net-next v12 04/12] vsock: add netns support to virtio transports
From: Bobby Eshleman @ 2026-01-08 0:41 UTC (permalink / raw)
To: Paolo Abeni
Cc: Stefano Garzarella, David S. Miller, Eric Dumazet, Jakub Kicinski,
Simon Horman, Stefan Hajnoczi, Michael S. Tsirkin, Jason Wang,
Eugenio Pérez, Xuan Zhuo, K. Y. Srinivasan, Haiyang Zhang,
Wei Liu, Dexuan Cui, Bryan Tan, Vishnu Dasa,
Broadcom internal kernel review list, Shuah Khan, linux-kernel,
virtualization, netdev, kvm, linux-hyperv, linux-kselftest,
berrange, Sargun Dhillon, Bobby Eshleman
In-Reply-To: <99b6f3f7-4130-436a-bfef-3ef35832e02c@redhat.com>
On Wed, Jan 07, 2026 at 10:47:56AM +0100, Paolo Abeni wrote:
> Hi,
>
> On 12/2/25 11:01 PM, Bobby Eshleman wrote:
> > On Tue, Dec 02, 2025 at 09:47:19PM +0100, Paolo Abeni wrote:
> >> I still have some concern WRT the dynamic mode change after netns
> >> creation. I fear some 'unsolvable' (or very hard to solve) race I can't
> >> see now. A tcp_child_ehash_entries-like model will avoid completely the
> >> issue, but I understand it would be a significant change over the
> >> current status.
> >>
> >> "Luckily" the merge window is on us and we have some time to discuss. Do
> >> you have a specific use-case for the ability to change the netns mode
> >> after creation?
> >>
> >> /P
> >
> > I don't think there is a hard requirement that the mode be change-able
> > after creation. Though I'd love to avoid such a big change... or at
> > least leave unchanged as much of what we've already reviewed as
> > possible.
> >
> > In the scheme of defining the mode at creation and following the
> > tcp_child_ehash_entries-ish model, what I'm imagining is:
> > - /proc/sys/net/vsock/child_ns_mode can be set to "local" or "global"
> > - /proc/sys/net/vsock/child_ns_mode is not immutable, can change any
> > number of times
> >
> > - when a netns is created, the new netns mode is inherited from
> > child_ns_mode, being assigned using something like:
> >
> > net->vsock.ns_mode =
> > get_net_ns_by_pid(current->pid)->child_ns_mode
> >
> > - /proc/sys/net/vsock/ns_mode queries the current mode, returning
> > "local" or "global", returning value of net->vsock.ns_mode
> > - /proc/sys/net/vsock/ns_mode and net->vsock.ns_mode are immutable and
> > reject writes
> >
> > Does that align with what you have in mind?
> Sorry for the latency. This fell of my radar while I still processed PW
> before EoY and afterwards I had some break.
>
> Yes, the above aligns with what I suggested, and I think it should solve
> possible race-related concerns (but I haven't looked at the RFC).
>
> /P
>
>
No worries, understandable! Thanks for the confirmation.
Best,
Bobby
^ permalink raw reply
* Re: [PATCH v8 03/10] dt-bindings: reserved-memory: Wakeup Mailbox for Intel processors
From: Rob Herring (Arm) @ 2026-01-07 23:23 UTC (permalink / raw)
To: Ricardo Neri
Cc: linux-acpi, Chris Oo, Dexuan Cui, K. Y. Srinivasan, Ricardo Neri,
Wei Liu, Kirill A. Shutemov, Saurabh Sengar, Conor Dooley,
devicetree, Yunhong Jiang, Michael Kelley, Rafael J. Wysocki,
Rafael J. Wysocki (Intel), Haiyang Zhang, Krzysztof Kozlowski,
linux-hyperv, x86, linux-kernel
In-Reply-To: <20260107-rneri-wakeup-mailbox-v8-3-2f5b6785f2f5@linux.intel.com>
On Wed, 07 Jan 2026 13:44:39 -0800, Ricardo Neri wrote:
> Add DeviceTree bindings to enumerate the wakeup mailbox used in platform
> firmware for Intel processors.
>
> x86 platforms commonly boot secondary CPUs using an INIT assert, de-assert
> followed by Start-Up IPI messages. The wakeup mailbox can be used when this
> mechanism is unavailable.
>
> The wakeup mailbox offers more control to the operating system to boot
> secondary CPUs than a spin-table. It allows the reuse of the same wakeup
> vector for all CPUs while maintaining control over which CPUs to boot and
> when. While it is possible to achieve the same level of control using a
> spin-table, it would require specifying a separate `cpu-release-addr` for
> each secondary CPU.
>
> The operation and structure of the mailbox are described in the
> Multiprocessor Wakeup Structure defined in the ACPI specification. Note
> that this structure does not specify how to publish the mailbox to the
> operating system (ACPI-based platform firmware uses a separate table). No
> ACPI table is needed in DeviceTree-based firmware to enumerate the mailbox.
>
> Nodes that want to refer to the reserved memory usually define
> a `memory-region` property. /cpus/cpu* nodes would want to refer to the
> mailbox, but they do not have such property defined in the DeviceTree
> specification. Moreover, it would imply that there is a memory region per
> CPU. Instead, add a `compatible` property that the operating system can use
> to discover the mailbox.
>
> Reviewed-by: Dexuan Cui <decui@microsoft.com>
> Reviewed-by: Rob Herring (Arm) <robh@kernel.org>
> Acked-by: Rafael J. Wysocki (Intel) <rafael.j.wysocki@intel.com>
> Co-developed-by: Yunhong Jiang <yunhong.jiang@linux.intel.com>
> Signed-off-by: Yunhong Jiang <yunhong.jiang@linux.intel.com>
> Signed-off-by: Ricardo Neri <ricardo.neri-calderon@linux.intel.com>
> ---
> Changes in v8:
> - None
>
> Changes in v7:
> - Fixed Acked-by tag from Rafael to include the "(Intel)" suffix.
>
> Changes in v6:
> - Reworded the changelog for clarity.
> - Added Acked-by tag from Rafael. Thanks!
> - Added Reviewed-by tag from Rob. Thanks!
> - Added Reviewed-by tag from Dexuan. Thanks!
>
> Changes in v5:
> - Specified the version and section of the ACPI spec in which the
> wakeup mailbox is defined. (Rafael)
> - Fixed a warning from yamllint about line lengths of URLs.
>
> Changes in v4:
> - Removed redefinitions of the mailbox and instead referred to ACPI
> specification as per discussion on LKML.
> - Clarified that DeviceTree-based firmware do not require the use of
> ACPI tables to enumerate the mailbox. (Rob)
> - Described the need of using a `compatible` property.
> - Dropped the `alignment` property. (Krzysztof, Rafael)
> - Used a real address for the mailbox node. (Krzysztof)
>
> Changes in v3:
> - Implemented the mailbox as a reserved-memory node. Add to it a
> `compatible` property. (Krzysztof)
> - Explained the relationship between the mailbox and the `enable-mehod`
> property of the CPU nodes.
> - Expanded the documentation of the binding.
>
> Changes in v2:
> - Added more details to the description of the binding.
> - Added requirement a new requirement for cpu@N nodes to add an
> `enable-method`.
> ---
> .../reserved-memory/intel,wakeup-mailbox.yaml | 50 ++++++++++++++++++++++
> 1 file changed, 50 insertions(+)
>
My bot found errors running 'make dt_binding_check' on your patch:
yamllint warnings/errors:
./Documentation/devicetree/bindings/reserved-memory/intel,wakeup-mailbox.yaml:23:1: [warning] too many blank lines (2 > 1) (empty-lines)
dtschema/dtc warnings/errors:
doc reference errors (make refcheckdocs):
See https://patchwork.kernel.org/project/devicetree/patch/20260107-rneri-wakeup-mailbox-v8-3-2f5b6785f2f5@linux.intel.com
The base for the series is generally the latest rc1. A different dependency
should be noted in *this* patch.
If you already ran 'make dt_binding_check' and didn't see the above
error(s), then make sure 'yamllint' is installed and dt-schema is up to
date:
pip3 install dtschema --upgrade
Please check and re-submit after running the above command yourself. Note
that DT_SCHEMA_FILES can be set to your schema file to speed up checking
your schema. However, it must be unset to test all examples with your schema.
^ permalink raw reply
* [PATCH v8 10/10] x86/hyperv/vtl: Use the wakeup mailbox to boot secondary CPUs
From: Ricardo Neri @ 2026-01-07 21:44 UTC (permalink / raw)
To: x86, Krzysztof Kozlowski, Conor Dooley, Rob Herring,
K. Y. Srinivasan, Haiyang Zhang, Wei Liu, Dexuan Cui,
Michael Kelley, Rafael J. Wysocki
Cc: Saurabh Sengar, Chris Oo, Kirill A. Shutemov, linux-hyperv,
devicetree, linux-acpi, linux-kernel, Ricardo Neri, Ricardo Neri
In-Reply-To: <20260107-rneri-wakeup-mailbox-v8-0-2f5b6785f2f5@linux.intel.com>
The hypervisor is an untrusted entity for TDX guests. It cannot be used
to boot secondary CPUs. The function hv_vtl_wakeup_secondary_cpu() cannot
be used.
Instead, the virtual firmware boots the secondary CPUs and places them in
a state to transfer control to the kernel using the wakeup mailbox. The
firmware enumerates the mailbox via either an ACPI table or a DeviceTree
node.
If the wakeup mailbox is present, the kernel updates the APIC callback
wakeup_secondary_cpu_64() to use it.
Reviewed-by: Dexuan Cui <decui@microsoft.com>
Reviewed-by: Michael Kelley <mhklinux@outlook.com>
Signed-off-by: Ricardo Neri <ricardo.neri-calderon@linux.intel.com>
---
Changes in v8:
- None
Changes in v7:
- None
Changes in v6:
- Added Reviewed-by tag from Dexuan. Thanks!
Changes in v5:
- None
Changes in v4:
- Added Reviewed-by tag from Michael. Thanks!
Changes in v3:
- Unconditionally use the wakeup mailbox in a TDX confidential VM.
(Michael).
- Edited the commit message for clarity.
Changes in v2:
- None
---
arch/x86/hyperv/hv_vtl.c | 10 +++++++++-
1 file changed, 9 insertions(+), 1 deletion(-)
diff --git a/arch/x86/hyperv/hv_vtl.c b/arch/x86/hyperv/hv_vtl.c
index 2af825f7a447..fa4e7fda2868 100644
--- a/arch/x86/hyperv/hv_vtl.c
+++ b/arch/x86/hyperv/hv_vtl.c
@@ -272,7 +272,15 @@ int __init hv_vtl_early_init(void)
panic("XSAVE has to be disabled as it is not supported by this module.\n"
"Please add 'noxsave' to the kernel command line.\n");
- apic_update_callback(wakeup_secondary_cpu_64, hv_vtl_wakeup_secondary_cpu);
+ /*
+ * TDX confidential VMs do not trust the hypervisor and cannot use it to
+ * boot secondary CPUs. Instead, they will be booted using the wakeup
+ * mailbox if detected during boot. See setup_arch().
+ *
+ * There is no paravisor present if we are here.
+ */
+ if (!hv_isolation_type_tdx())
+ apic_update_callback(wakeup_secondary_cpu_64, hv_vtl_wakeup_secondary_cpu);
return 0;
}
--
2.43.0
^ permalink raw reply related
* [PATCH v8 09/10] x86/hyperv/vtl: Mark the wakeup mailbox page as private
From: Ricardo Neri @ 2026-01-07 21:44 UTC (permalink / raw)
To: x86, Krzysztof Kozlowski, Conor Dooley, Rob Herring,
K. Y. Srinivasan, Haiyang Zhang, Wei Liu, Dexuan Cui,
Michael Kelley, Rafael J. Wysocki
Cc: Saurabh Sengar, Chris Oo, Kirill A. Shutemov, linux-hyperv,
devicetree, linux-acpi, linux-kernel, Ricardo Neri, Yunhong Jiang,
Ricardo Neri
In-Reply-To: <20260107-rneri-wakeup-mailbox-v8-0-2f5b6785f2f5@linux.intel.com>
From: Yunhong Jiang <yunhong.jiang@linux.intel.com>
The current code maps MMIO devices as shared (decrypted) by default in a
confidential computing VM.
In a TDX environment, secondary CPUs are booted using the Multiprocessor
Wakeup Structure defined in the ACPI specification. The virtual firmware
and the operating system function in the guest context, without
intervention from the VMM. Map the physical memory of the mailbox as
private. Use the is_private_mmio() callback.
Signed-off-by: Yunhong Jiang <yunhong.jiang@linux.intel.com>
Signed-off-by: Ricardo Neri <ricardo.neri-calderon@linux.intel.com>
---
Changes in v8:
- Included linux/acpi.h to add missing definitions that caused build
breaks (kernel test robot)
Changes in v7:
- Dropped check for !CONFIG_X86_MAILBOX_WAKEUP. The symbol is no longer
valid and now we have a stub for !CONFIG_ACPI.
- Dropped Reviewed-by tags from Dexuan and Michael as this patch
changed.
Changes in v6:
- Fixed a compile error with !CONFIG_X86_MAILBOX_WAKEUP.
- Added Reviewed-by tag from Dexuan. Thanks!
Changes in v5:
- None
Changes in v4:
- Updated to use the renamed function acpi_get_mp_wakeup_mailbox_paddr().
- Added Reviewed-by tag from Michael. Thanks!
Changes in v3:
- Use the new helper function get_mp_wakeup_mailbox_paddr().
- Edited the commit message for clarity.
Changes in v2:
- Added the helper function within_page() to improve readability
- Override the is_private_mmio() callback when detecting a TDX
environment. The address of the mailbox is checked in
hv_is_private_mmio_tdx().
---
arch/x86/hyperv/hv_vtl.c | 17 +++++++++++++++++
1 file changed, 17 insertions(+)
diff --git a/arch/x86/hyperv/hv_vtl.c b/arch/x86/hyperv/hv_vtl.c
index 752101544663..2af825f7a447 100644
--- a/arch/x86/hyperv/hv_vtl.c
+++ b/arch/x86/hyperv/hv_vtl.c
@@ -6,6 +6,9 @@
* Saurabh Sengar <ssengar@microsoft.com>
*/
+#include <linux/acpi.h>
+
+#include <asm/acpi.h>
#include <asm/apic.h>
#include <asm/boot.h>
#include <asm/desc.h>
@@ -59,6 +62,18 @@ static void __noreturn hv_vtl_restart(char __maybe_unused *cmd)
hv_vtl_emergency_restart();
}
+static inline bool within_page(u64 addr, u64 start)
+{
+ return addr >= start && addr < (start + PAGE_SIZE);
+}
+
+static bool hv_vtl_is_private_mmio_tdx(u64 addr)
+{
+ u64 mb_addr = acpi_get_mp_wakeup_mailbox_paddr();
+
+ return mb_addr && within_page(addr, mb_addr);
+}
+
void __init hv_vtl_init_platform(void)
{
/*
@@ -71,6 +86,8 @@ void __init hv_vtl_init_platform(void)
/* There is no paravisor present if we are here. */
if (hv_isolation_type_tdx()) {
x86_init.resources.realmode_limit = SZ_4G;
+ x86_platform.hyper.is_private_mmio = hv_vtl_is_private_mmio_tdx;
+
} else {
x86_platform.realmode_reserve = x86_init_noop;
x86_platform.realmode_init = x86_init_noop;
--
2.43.0
^ permalink raw reply related
* [PATCH v8 08/10] x86/acpi: Add a helper get the address of the wakeup mailbox
From: Ricardo Neri @ 2026-01-07 21:44 UTC (permalink / raw)
To: x86, Krzysztof Kozlowski, Conor Dooley, Rob Herring,
K. Y. Srinivasan, Haiyang Zhang, Wei Liu, Dexuan Cui,
Michael Kelley, Rafael J. Wysocki
Cc: Saurabh Sengar, Chris Oo, Kirill A. Shutemov, linux-hyperv,
devicetree, linux-acpi, linux-kernel, Ricardo Neri, Ricardo Neri
In-Reply-To: <20260107-rneri-wakeup-mailbox-v8-0-2f5b6785f2f5@linux.intel.com>
A Hyper-V VTL level 2 guest in a TDX environment needs to map the physical
page of the ACPI Multiprocessor Wakeup Structure as private (encrypted). It
needs to know the physical address of this structure. Add a helper function
to retrieve the address.
Suggested-by: Michael Kelley <mhklinux@outlook.com>
Acked-by: Rafael J. Wysocki (Intel) <rafael@kernel.org>
Signed-off-by: Ricardo Neri <ricardo.neri-calderon@linux.intel.com>
---
Changes in v8:
- Added Acked-by tag from Rafael. Thanks!
Changes in v7:
- Moved the added function to arch/x86/kernel/acpi/madt_wakeup.c
- Dropped Reviewed-by tags from Dexuan and Michael as this patch
changed.
Changes in v6:
- Added Reviewed-by tag from Dexuan. Thanks!
Changes in v5:
- None
Changes in v4:
- Renamed function to acpi_get_mp_wakeup_mailbox_paddr().
- Added Reviewed-by tag from Michael. Thanks!
Changes in v3:
- Introduced this patch
Changes in v2:
- N/A
---
arch/x86/include/asm/acpi.h | 6 ++++++
arch/x86/kernel/acpi/madt_wakeup.c | 5 +++++
2 files changed, 11 insertions(+)
diff --git a/arch/x86/include/asm/acpi.h b/arch/x86/include/asm/acpi.h
index 820df375df79..c4e6459bd56b 100644
--- a/arch/x86/include/asm/acpi.h
+++ b/arch/x86/include/asm/acpi.h
@@ -184,6 +184,7 @@ void __iomem *x86_acpi_os_ioremap(acpi_physical_address phys, acpi_size size);
void acpi_setup_mp_wakeup_mailbox(u64 addr);
struct acpi_madt_multiproc_wakeup_mailbox *acpi_get_mp_wakeup_mailbox(void);
+u64 acpi_get_mp_wakeup_mailbox_paddr(void);
#else /* !CONFIG_ACPI */
@@ -210,6 +211,11 @@ static inline struct acpi_madt_multiproc_wakeup_mailbox *acpi_get_mp_wakeup_mail
return NULL;
}
+static inline u64 acpi_get_mp_wakeup_mailbox_paddr(void)
+{
+ return 0;
+}
+
#endif /* !CONFIG_ACPI */
#define ARCH_HAS_POWER_INIT 1
diff --git a/arch/x86/kernel/acpi/madt_wakeup.c b/arch/x86/kernel/acpi/madt_wakeup.c
index 82caf44b45e3..48734e4a6e8f 100644
--- a/arch/x86/kernel/acpi/madt_wakeup.c
+++ b/arch/x86/kernel/acpi/madt_wakeup.c
@@ -258,3 +258,8 @@ struct acpi_madt_multiproc_wakeup_mailbox *acpi_get_mp_wakeup_mailbox(void)
{
return acpi_mp_wake_mailbox;
}
+
+u64 acpi_get_mp_wakeup_mailbox_paddr(void)
+{
+ return acpi_mp_wake_mailbox_paddr;
+}
--
2.43.0
^ permalink raw reply related
* [PATCH v8 07/10] x86/hyperv/vtl: Setup the 64-bit trampoline for TDX guests
From: Ricardo Neri @ 2026-01-07 21:44 UTC (permalink / raw)
To: x86, Krzysztof Kozlowski, Conor Dooley, Rob Herring,
K. Y. Srinivasan, Haiyang Zhang, Wei Liu, Dexuan Cui,
Michael Kelley, Rafael J. Wysocki
Cc: Saurabh Sengar, Chris Oo, Kirill A. Shutemov, linux-hyperv,
devicetree, linux-acpi, linux-kernel, Ricardo Neri, Yunhong Jiang,
Ricardo Neri
In-Reply-To: <20260107-rneri-wakeup-mailbox-v8-0-2f5b6785f2f5@linux.intel.com>
From: Yunhong Jiang <yunhong.jiang@linux.intel.com>
The hypervisor is an untrusted entity for TDX guests. It cannot be used
to boot secondary CPUs - neither via hypercalls nor the INIT assert,
de-assert, plus Start-Up IPI messages.
Instead, the platform virtual firmware boots the secondary CPUs and
puts them in a state to transfer control to the kernel. This mechanism uses
the wakeup mailbox described in the Multiprocessor Wakeup Structure of the
ACPI specification. The entry point to the kernel is trampoline_start64.
Allocate and setup the trampoline using the default x86_platform callbacks.
The platform firmware configures the secondary CPUs in long mode. It is no
longer necessary to locate the trampoline under 1MB memory. After handoff
from firmware, the trampoline code switches briefly to 32-bit addressing
mode, which has an addressing limit of 4GB. Set the upper bound of the
trampoline memory accordingly.
Reviewed-by: Dexuan Cui <decui@microsoft.com>
Reviewed-by: Michael Kelley <mhklinux@outlook.com>
Signed-off-by: Yunhong Jiang <yunhong.jiang@linux.intel.com>
Signed-off-by: Ricardo Neri <ricardo.neri-calderon@linux.intel.com>
---
Changes in v8:
- None
Changes in v7:
- None
Changes in v6:
- Added Reviewed-by tag from Dexuan. Thanks!
Changes in v5:
- None
Changes in v4:
- Added Reviewed-by tag from Michael. Thanks!
Changes in v3:
- Added a note regarding there is no need to check for a present
paravisor.
- Edited commit message for clarity.
Changes in v2:
- Dropped the function hv_reserve_real_mode(). Instead, used the new
members realmode_limit and reserve_bios members of x86_init to
set the upper bound of the trampoline memory. (Thomas)
---
arch/x86/hyperv/hv_vtl.c | 11 ++++++++---
1 file changed, 8 insertions(+), 3 deletions(-)
diff --git a/arch/x86/hyperv/hv_vtl.c b/arch/x86/hyperv/hv_vtl.c
index f74199e77133..752101544663 100644
--- a/arch/x86/hyperv/hv_vtl.c
+++ b/arch/x86/hyperv/hv_vtl.c
@@ -68,9 +68,14 @@ void __init hv_vtl_init_platform(void)
*/
pr_info("Linux runs in Hyper-V Virtual Trust Level %d\n", ms_hyperv.vtl);
- x86_platform.realmode_reserve = x86_init_noop;
- x86_platform.realmode_init = x86_init_noop;
- real_mode_header = &hv_vtl_real_mode_header;
+ /* There is no paravisor present if we are here. */
+ if (hv_isolation_type_tdx()) {
+ x86_init.resources.realmode_limit = SZ_4G;
+ } else {
+ x86_platform.realmode_reserve = x86_init_noop;
+ x86_platform.realmode_init = x86_init_noop;
+ real_mode_header = &hv_vtl_real_mode_header;
+ }
x86_init.irqs.pre_vector_init = x86_init_noop;
x86_init.timers.timer_init = x86_init_noop;
x86_init.resources.probe_roms = x86_init_noop;
--
2.43.0
^ permalink raw reply related
* [PATCH v8 06/10] x86/realmode: Make the location of the trampoline configurable
From: Ricardo Neri @ 2026-01-07 21:44 UTC (permalink / raw)
To: x86, Krzysztof Kozlowski, Conor Dooley, Rob Herring,
K. Y. Srinivasan, Haiyang Zhang, Wei Liu, Dexuan Cui,
Michael Kelley, Rafael J. Wysocki
Cc: Saurabh Sengar, Chris Oo, Kirill A. Shutemov, linux-hyperv,
devicetree, linux-acpi, linux-kernel, Ricardo Neri, Yunhong Jiang,
Thomas Gleixner, Ricardo Neri
In-Reply-To: <20260107-rneri-wakeup-mailbox-v8-0-2f5b6785f2f5@linux.intel.com>
From: Yunhong Jiang <yunhong.jiang@linux.intel.com>
x86 CPUs boot in real mode. This mode uses a 1MB address space. The
trampoline must reside below this 1MB memory boundary.
There are platforms in which the firmware boots the secondary CPUs,
switches them to long mode and transfers control to the kernel. An example
of such a mechanism is the ACPI Multiprocessor Wakeup Structure.
In this scenario there is no restriction on locating the trampoline under
1MB memory. Moreover, certain platforms (for example, Hyper-V VTL guests)
may not have memory available for allocation below 1MB.
Add a new member to struct x86_init_resources to specify the upper bound
for the location of the trampoline memory. Preserve the default upper bound
of 1MB to conserve the current behavior.
Reviewed-by: Dexuan Cui <decui@microsoft.com>
Reviewed-by: Michael Kelley <mhklinux@outlook.com>
Originally-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Yunhong Jiang <yunhong.jiang@linux.intel.com>
Signed-off-by: Ricardo Neri <ricardo.neri-calderon@linux.intel.com>
---
Changes in v8:
- None
Changes in v7:
- None
Changes in v6:
- Added Reviewed-by tag from Dexuan. Thanks!
Changes in v5:
- None
Changes in v4:
- Added Reviewed-by tag from Michael. Thanks!
Changes in v3:
- Edited the commit message for clarity.
- Minor tweaks to comments.
- Removed the option to not reserve the first 1MB of memory as it is
not needed.
Changes in v2:
- Added this patch using code that Thomas suggested:
https://lore.kernel.org/lkml/87a5ho2q6x.ffs@tglx/
---
arch/x86/include/asm/x86_init.h | 3 +++
arch/x86/kernel/x86_init.c | 3 +++
arch/x86/realmode/init.c | 7 +++----
3 files changed, 9 insertions(+), 4 deletions(-)
diff --git a/arch/x86/include/asm/x86_init.h b/arch/x86/include/asm/x86_init.h
index 6c8a6ead84f6..953d3199408a 100644
--- a/arch/x86/include/asm/x86_init.h
+++ b/arch/x86/include/asm/x86_init.h
@@ -31,12 +31,15 @@ struct x86_init_mpparse {
* platform
* @memory_setup: platform specific memory setup
* @dmi_setup: platform specific DMI setup
+ * @realmode_limit: platform specific address limit for the real mode trampoline
+ * (default 1M)
*/
struct x86_init_resources {
void (*probe_roms)(void);
void (*reserve_resources)(void);
char *(*memory_setup)(void);
void (*dmi_setup)(void);
+ unsigned long realmode_limit;
};
/**
diff --git a/arch/x86/kernel/x86_init.c b/arch/x86/kernel/x86_init.c
index 0a2bbd674a6d..a25fd7282811 100644
--- a/arch/x86/kernel/x86_init.c
+++ b/arch/x86/kernel/x86_init.c
@@ -9,6 +9,7 @@
#include <linux/export.h>
#include <linux/pci.h>
#include <linux/acpi.h>
+#include <linux/sizes.h>
#include <asm/acpi.h>
#include <asm/bios_ebda.h>
@@ -69,6 +70,8 @@ struct x86_init_ops x86_init __initdata = {
.reserve_resources = reserve_standard_io_resources,
.memory_setup = e820__memory_setup_default,
.dmi_setup = dmi_setup,
+ /* Has to be under 1M so we can execute real-mode AP code. */
+ .realmode_limit = SZ_1M,
},
.mpparse = {
diff --git a/arch/x86/realmode/init.c b/arch/x86/realmode/init.c
index 88be32026768..694d80a5c68e 100644
--- a/arch/x86/realmode/init.c
+++ b/arch/x86/realmode/init.c
@@ -46,7 +46,7 @@ void load_trampoline_pgtable(void)
void __init reserve_real_mode(void)
{
- phys_addr_t mem;
+ phys_addr_t mem, limit = x86_init.resources.realmode_limit;
size_t size = real_mode_size_needed();
if (!size)
@@ -54,10 +54,9 @@ void __init reserve_real_mode(void)
WARN_ON(slab_is_available());
- /* Has to be under 1M so we can execute real-mode AP code. */
- mem = memblock_phys_alloc_range(size, PAGE_SIZE, 0, 1<<20);
+ mem = memblock_phys_alloc_range(size, PAGE_SIZE, 0, limit);
if (!mem)
- pr_info("No sub-1M memory is available for the trampoline\n");
+ pr_info("No memory below %pa for the real-mode trampoline\n", &limit);
else
set_real_mode_mem(mem);
--
2.43.0
^ permalink raw reply related
* [PATCH v8 05/10] x86/hyperv/vtl: Set real_mode_header in hv_vtl_init_platform()
From: Ricardo Neri @ 2026-01-07 21:44 UTC (permalink / raw)
To: x86, Krzysztof Kozlowski, Conor Dooley, Rob Herring,
K. Y. Srinivasan, Haiyang Zhang, Wei Liu, Dexuan Cui,
Michael Kelley, Rafael J. Wysocki
Cc: Saurabh Sengar, Chris Oo, Kirill A. Shutemov, linux-hyperv,
devicetree, linux-acpi, linux-kernel, Ricardo Neri, Yunhong Jiang,
Thomas Gleixner, Ricardo Neri
In-Reply-To: <20260107-rneri-wakeup-mailbox-v8-0-2f5b6785f2f5@linux.intel.com>
From: Yunhong Jiang <yunhong.jiang@linux.intel.com>
Hyper-V VTL clears x86_platform.realmode_{init(), reserve()} in
hv_vtl_init_platform() whereas it sets real_mode_header later in
hv_vtl_early_init(). There is no need to deal with the settings of real
mode memory in two places. Also, both functions are called much earlier
than x86_platform.realmode_init() (via an early_initcall), where the
real_mode_header is needed.
Set real_mode_header in hv_vtl_init_platform() to keep all code dealing
with memory for the real mode trampoline in one place. Besides making the
code more readable, it prepares it for a subsequent changeset in which the
behavior needs to change to support Hyper-V VTL guests in TDX a
environment.
Reviewed-by: Dexuan Cui <decui@microsoft.com>
Reviewed-by: Michael Kelley <mhklinux@outlook.com>
Suggested-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Yunhong Jiang <yunhong.jiang@linux.intel.com>
Signed-off-by: Ricardo Neri <ricardo.neri-calderon@linux.intel.com>
---
Changes in v8:
- None
Changes in v7:
- None
Changes in v6:
- Corrected reference to hv_vtl_init_platform() in the changelog.
(Dexuan)
- Added Reviewed-by tag from Dexuan. Thanks!
Changes in v5:
- None
Changes in v4:
- Added Reviewed-by tag from Michael. Thanks!
Changes in v3:
- Edited the commit message for clarity.
Changes in v2:
- Introduced this patch.
---
arch/x86/hyperv/hv_vtl.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/arch/x86/hyperv/hv_vtl.c b/arch/x86/hyperv/hv_vtl.c
index c0edaed0efb3..f74199e77133 100644
--- a/arch/x86/hyperv/hv_vtl.c
+++ b/arch/x86/hyperv/hv_vtl.c
@@ -70,6 +70,7 @@ void __init hv_vtl_init_platform(void)
x86_platform.realmode_reserve = x86_init_noop;
x86_platform.realmode_init = x86_init_noop;
+ real_mode_header = &hv_vtl_real_mode_header;
x86_init.irqs.pre_vector_init = x86_init_noop;
x86_init.timers.timer_init = x86_init_noop;
x86_init.resources.probe_roms = x86_init_noop;
@@ -249,7 +250,6 @@ int __init hv_vtl_early_init(void)
panic("XSAVE has to be disabled as it is not supported by this module.\n"
"Please add 'noxsave' to the kernel command line.\n");
- real_mode_header = &hv_vtl_real_mode_header;
apic_update_callback(wakeup_secondary_cpu_64, hv_vtl_wakeup_secondary_cpu);
return 0;
--
2.43.0
^ permalink raw reply related
* [PATCH v8 04/10] x86/dt: Parse the Wakeup Mailbox for Intel processors
From: Ricardo Neri @ 2026-01-07 21:44 UTC (permalink / raw)
To: x86, Krzysztof Kozlowski, Conor Dooley, Rob Herring,
K. Y. Srinivasan, Haiyang Zhang, Wei Liu, Dexuan Cui,
Michael Kelley, Rafael J. Wysocki
Cc: Saurabh Sengar, Chris Oo, Kirill A. Shutemov, linux-hyperv,
devicetree, linux-acpi, linux-kernel, Ricardo Neri, Yunhong Jiang,
Ricardo Neri
In-Reply-To: <20260107-rneri-wakeup-mailbox-v8-0-2f5b6785f2f5@linux.intel.com>
The Wakeup Mailbox is a mechanism to boot secondary CPUs on systems that do
not want or cannot use the INIT + StartUp IPI messages.
The platform firmware is expected to implement the mailbox as described in
the Multiprocessor Wakeup Structure of the ACPI specification. It is also
expected to publish the mailbox to the operating system as described in the
corresponding DeviceTree schema that accompanies the documentation of the
Linux kernel.
Reuse the existing functionality to set the memory location of the mailbox
and update the wakeup_secondary_cpu_64() APIC callback. Make this
functionality available to DeviceTree-based systems by making CONFIG_X86_
MAILBOX_WAKEUP depend on either CONFIG_OF or CONFIG_ACPI_MADT_WAKEUP.
do_boot_cpu() uses wakeup_secondary_cpu_64() when set. It will be set if a
wakeup mailbox is enumerated via an ACPI table or a DeviceTree node. For
cases in which this behavior is not desired, this APIC callback can be
updated later during boot using platform-specific hooks.
Reviewed-by: Dexuan Cui <decui@microsoft.com>
Co-developed-by: Yunhong Jiang <yunhong.jiang@linux.intel.com>
Signed-off-by: Yunhong Jiang <yunhong.jiang@linux.intel.com>
Signed-off-by: Ricardo Neri <ricardo.neri-calderon@linux.intel.com>
---
Changes in v8:
- None
Changes in v7:
- #included asm/acpi.h to reflect the updated declaration of the
needed functions.
- (Kept Reviewed-by tag from Dexuan, as this single change is trivial.)
Changes in v6:
- Added Reviewed-by tag from Dexuan. Thanks!
Changes in v5:
- Made CONFIG_X86_MAILBOX_WAKEUP depend on CONFIG_OF or CONFIG_ACPI_
MADT_WAKEUP.
Changes in v4:
- Look for the wakeup mailbox unconditionally, regardless of whether
cpu@N nodes have an `enable-method` property.
- Add a reference to the ACPI specification. (Rafael)
Changes in v3:
- Added extra sanity checks when parsing the mailbox node.
- Probe the mailbox using its `compatible` property
- Setup the Wakeup Mailbox if the `enable-method` is found in the CPU
nodes.
- Cleaned up unneeded ifdeffery.
- Clarified the mechanisms used to override the wakeup_secondary_64()
callback to not use the mailbox when not desired. (Michael)
- Edited the commit message for clarity.
Changes in v2:
- Disabled CPU offlining.
- Modified dtb_parse_mp_wake() to return the address of the mailbox.
---
arch/x86/kernel/devicetree.c | 47 ++++++++++++++++++++++++++++++++++++++++++++
1 file changed, 47 insertions(+)
diff --git a/arch/x86/kernel/devicetree.c b/arch/x86/kernel/devicetree.c
index dd8748c45529..318acaecb5ca 100644
--- a/arch/x86/kernel/devicetree.c
+++ b/arch/x86/kernel/devicetree.c
@@ -18,6 +18,7 @@
#include <linux/of_pci.h>
#include <linux/initrd.h>
+#include <asm/acpi.h>
#include <asm/irqdomain.h>
#include <asm/hpet.h>
#include <asm/apic.h>
@@ -125,6 +126,51 @@ static void __init dtb_setup_hpet(void)
#endif
}
+#if defined(CONFIG_X86_64) && defined(CONFIG_SMP)
+
+#define WAKEUP_MAILBOX_SIZE 0x1000
+#define WAKEUP_MAILBOX_ALIGN 0x1000
+
+/** dtb_wakeup_mailbox_setup() - Parse the wakeup mailbox from the device tree
+ *
+ * Look for the presence of a wakeup mailbox in the DeviceTree. The mailbox is
+ * expected to follow the structure and operation described in the Multiprocessor
+ * Wakeup Structure of the ACPI specification.
+ */
+static void __init dtb_wakeup_mailbox_setup(void)
+{
+ struct device_node *node;
+ struct resource res;
+
+ node = of_find_compatible_node(NULL, NULL, "intel,wakeup-mailbox");
+ if (!node)
+ return;
+
+ if (of_address_to_resource(node, 0, &res))
+ goto done;
+
+ /* The mailbox is a 4KB-aligned region.*/
+ if (res.start & (WAKEUP_MAILBOX_ALIGN - 1))
+ goto done;
+
+ /* The mailbox has a size of 4KB. */
+ if (res.end - res.start + 1 != WAKEUP_MAILBOX_SIZE)
+ goto done;
+
+ /* Not supported when the mailbox is used. */
+ cpu_hotplug_disable_offlining();
+
+ acpi_setup_mp_wakeup_mailbox(res.start);
+done:
+ of_node_put(node);
+}
+#else /* !CONFIG_X86_64 || !CONFIG_SMP */
+static inline int dtb_wakeup_mailbox_setup(void)
+{
+ return -EOPNOTSUPP;
+}
+#endif /* CONFIG_X86_64 && CONFIG_SMP */
+
#ifdef CONFIG_X86_LOCAL_APIC
static void __init dtb_cpu_setup(void)
@@ -287,6 +333,7 @@ static void __init x86_dtb_parse_smp_config(void)
dtb_setup_hpet();
dtb_apic_setup();
+ dtb_wakeup_mailbox_setup();
}
void __init x86_flattree_get_config(void)
--
2.43.0
^ permalink raw reply related
* [PATCH v8 03/10] dt-bindings: reserved-memory: Wakeup Mailbox for Intel processors
From: Ricardo Neri @ 2026-01-07 21:44 UTC (permalink / raw)
To: x86, Krzysztof Kozlowski, Conor Dooley, Rob Herring,
K. Y. Srinivasan, Haiyang Zhang, Wei Liu, Dexuan Cui,
Michael Kelley, Rafael J. Wysocki
Cc: Saurabh Sengar, Chris Oo, Kirill A. Shutemov, linux-hyperv,
devicetree, linux-acpi, linux-kernel, Ricardo Neri,
Rafael J. Wysocki (Intel), Yunhong Jiang, Ricardo Neri
In-Reply-To: <20260107-rneri-wakeup-mailbox-v8-0-2f5b6785f2f5@linux.intel.com>
Add DeviceTree bindings to enumerate the wakeup mailbox used in platform
firmware for Intel processors.
x86 platforms commonly boot secondary CPUs using an INIT assert, de-assert
followed by Start-Up IPI messages. The wakeup mailbox can be used when this
mechanism is unavailable.
The wakeup mailbox offers more control to the operating system to boot
secondary CPUs than a spin-table. It allows the reuse of the same wakeup
vector for all CPUs while maintaining control over which CPUs to boot and
when. While it is possible to achieve the same level of control using a
spin-table, it would require specifying a separate `cpu-release-addr` for
each secondary CPU.
The operation and structure of the mailbox are described in the
Multiprocessor Wakeup Structure defined in the ACPI specification. Note
that this structure does not specify how to publish the mailbox to the
operating system (ACPI-based platform firmware uses a separate table). No
ACPI table is needed in DeviceTree-based firmware to enumerate the mailbox.
Nodes that want to refer to the reserved memory usually define
a `memory-region` property. /cpus/cpu* nodes would want to refer to the
mailbox, but they do not have such property defined in the DeviceTree
specification. Moreover, it would imply that there is a memory region per
CPU. Instead, add a `compatible` property that the operating system can use
to discover the mailbox.
Reviewed-by: Dexuan Cui <decui@microsoft.com>
Reviewed-by: Rob Herring (Arm) <robh@kernel.org>
Acked-by: Rafael J. Wysocki (Intel) <rafael.j.wysocki@intel.com>
Co-developed-by: Yunhong Jiang <yunhong.jiang@linux.intel.com>
Signed-off-by: Yunhong Jiang <yunhong.jiang@linux.intel.com>
Signed-off-by: Ricardo Neri <ricardo.neri-calderon@linux.intel.com>
---
Changes in v8:
- None
Changes in v7:
- Fixed Acked-by tag from Rafael to include the "(Intel)" suffix.
Changes in v6:
- Reworded the changelog for clarity.
- Added Acked-by tag from Rafael. Thanks!
- Added Reviewed-by tag from Rob. Thanks!
- Added Reviewed-by tag from Dexuan. Thanks!
Changes in v5:
- Specified the version and section of the ACPI spec in which the
wakeup mailbox is defined. (Rafael)
- Fixed a warning from yamllint about line lengths of URLs.
Changes in v4:
- Removed redefinitions of the mailbox and instead referred to ACPI
specification as per discussion on LKML.
- Clarified that DeviceTree-based firmware do not require the use of
ACPI tables to enumerate the mailbox. (Rob)
- Described the need of using a `compatible` property.
- Dropped the `alignment` property. (Krzysztof, Rafael)
- Used a real address for the mailbox node. (Krzysztof)
Changes in v3:
- Implemented the mailbox as a reserved-memory node. Add to it a
`compatible` property. (Krzysztof)
- Explained the relationship between the mailbox and the `enable-mehod`
property of the CPU nodes.
- Expanded the documentation of the binding.
Changes in v2:
- Added more details to the description of the binding.
- Added requirement a new requirement for cpu@N nodes to add an
`enable-method`.
---
.../reserved-memory/intel,wakeup-mailbox.yaml | 50 ++++++++++++++++++++++
1 file changed, 50 insertions(+)
diff --git a/Documentation/devicetree/bindings/reserved-memory/intel,wakeup-mailbox.yaml b/Documentation/devicetree/bindings/reserved-memory/intel,wakeup-mailbox.yaml
new file mode 100644
index 000000000000..a80d3bac44c2
--- /dev/null
+++ b/Documentation/devicetree/bindings/reserved-memory/intel,wakeup-mailbox.yaml
@@ -0,0 +1,50 @@
+# SPDX-License-Identifier: GPL-2.0-only OR BSD-2-Clause
+%YAML 1.2
+---
+$id: http://devicetree.org/schemas/reserved-memory/intel,wakeup-mailbox.yaml#
+$schema: http://devicetree.org/meta-schemas/core.yaml#
+
+title: Wakeup Mailbox for Intel processors
+
+description: |
+ The Wakeup Mailbox provides a mechanism for the operating system to wake up
+ secondary CPUs on Intel processors. It is an alternative to the INIT-!INIT-
+ SIPI sequence used on most x86 systems.
+
+ The structure and operation of the mailbox is described in the Multiprocessor
+ Wakeup Structure of the ACPI specification version 6.6 section 5.2.12.19 [1].
+
+ The implementation of the mailbox in platform firmware is described in the
+ Intel TDX Virtual Firmware Design Guide section 4.3.5 [2].
+
+ 1: https://uefi.org/specs/ACPI/6.6/05_ACPI_Software_Programming_Model.html#multiprocessor-wakeup-structure
+ 2: https://www.intel.com/content/www/us/en/content-details/733585/intel-tdx-virtual-firmware-design-guide.html
+
+
+maintainers:
+ - Ricardo Neri <ricardo.neri-calderon@linux.intel.com>
+
+allOf:
+ - $ref: reserved-memory.yaml
+
+properties:
+ compatible:
+ const: intel,wakeup-mailbox
+
+required:
+ - compatible
+ - reg
+
+unevaluatedProperties: false
+
+examples:
+ - |
+ reserved-memory {
+ #address-cells = <2>;
+ #size-cells = <1>;
+
+ wakeup-mailbox@ffff0000 {
+ compatible = "intel,wakeup-mailbox";
+ reg = <0x0 0xffff0000 0x1000>;
+ };
+ };
--
2.43.0
^ permalink raw reply related
* [PATCH v8 02/10] x86/acpi: Add functions to setup and access the wakeup mailbox
From: Ricardo Neri @ 2026-01-07 21:44 UTC (permalink / raw)
To: x86, Krzysztof Kozlowski, Conor Dooley, Rob Herring,
K. Y. Srinivasan, Haiyang Zhang, Wei Liu, Dexuan Cui,
Michael Kelley, Rafael J. Wysocki
Cc: Saurabh Sengar, Chris Oo, Kirill A. Shutemov, linux-hyperv,
devicetree, linux-acpi, linux-kernel, Ricardo Neri, Ricardo Neri
In-Reply-To: <20260107-rneri-wakeup-mailbox-v8-0-2f5b6785f2f5@linux.intel.com>
Systems that describe hardware using DeviceTree graphs may enumerate and
implement the wakeup mailbox as defined in the ACPI specification but do
not otherwise depend on ACPI. Expose functions to setup and access the
location of the wakeup mailbox from outside ACPI code.
The function acpi_setup_mp_wakeup_mailbox() stores the physical address of
the mailbox and updates the wakeup_secondary_cpu_64() APIC callback.
The function acpi_madt_multiproc_wakeup_mailbox() returns a pointer to the
mailbox.
Acked-by: Rafael J. Wysocki (Intel) <rafael@kernel.org>
Signed-off-by: Ricardo Neri <ricardo.neri-calderon@linux.intel.com>
---
Changes in v8:
- Added Acked-by tag from Rafael. Thanks!
Changes in v7:
- Moved function declarations to arch/x86/include/asm/acpi.h
- Added stubs for !CONFIG_ACPI.
- Do not use these new functions in madt_wakeup.c.
- Dropped Acked-by and Reviewed-by tags from Rafael and Dexuan as this
patch changed.
Changes in v6:
- Fixed grammar error in the subject of the patch. (Rafael)
- Added Acked-by tag from Rafael. Thanks!
- Added Reviewed-by tag from Dexuan. Thanks!
Changes in v5:
- None
Changes in v4:
- Squashed the two first patches of the series into one, both introduce
helper functions. (Rafael)
- Renamed setup_mp_wakeup_mailbox() as acpi_setup_mp_wakeup_mailbox().
(Rafael)
- Dropped the function prototype for !CONFIG_X86_64. (Rafael)
Changes in v3:
- Introduced this patch.
Changes in v2:
- N/A
---
arch/x86/include/asm/acpi.h | 10 ++++++++++
arch/x86/kernel/acpi/madt_wakeup.c | 11 +++++++++++
2 files changed, 21 insertions(+)
diff --git a/arch/x86/include/asm/acpi.h b/arch/x86/include/asm/acpi.h
index a03aa6f999d1..820df375df79 100644
--- a/arch/x86/include/asm/acpi.h
+++ b/arch/x86/include/asm/acpi.h
@@ -182,6 +182,9 @@ void __iomem *x86_acpi_os_ioremap(acpi_physical_address phys, acpi_size size);
#define acpi_os_ioremap acpi_os_ioremap
#endif
+void acpi_setup_mp_wakeup_mailbox(u64 addr);
+struct acpi_madt_multiproc_wakeup_mailbox *acpi_get_mp_wakeup_mailbox(void);
+
#else /* !CONFIG_ACPI */
#define acpi_lapic 0
@@ -200,6 +203,13 @@ static inline u64 x86_default_get_root_pointer(void)
return 0;
}
+static inline void acpi_setup_mp_wakeup_mailbox(u64 addr) { }
+
+static inline struct acpi_madt_multiproc_wakeup_mailbox *acpi_get_mp_wakeup_mailbox(void)
+{
+ return NULL;
+}
+
#endif /* !CONFIG_ACPI */
#define ARCH_HAS_POWER_INIT 1
diff --git a/arch/x86/kernel/acpi/madt_wakeup.c b/arch/x86/kernel/acpi/madt_wakeup.c
index 6d7603511f52..82caf44b45e3 100644
--- a/arch/x86/kernel/acpi/madt_wakeup.c
+++ b/arch/x86/kernel/acpi/madt_wakeup.c
@@ -247,3 +247,14 @@ int __init acpi_parse_mp_wake(union acpi_subtable_headers *header,
return 0;
}
+
+void __init acpi_setup_mp_wakeup_mailbox(u64 mailbox_paddr)
+{
+ acpi_mp_wake_mailbox_paddr = mailbox_paddr;
+ apic_update_callback(wakeup_secondary_cpu_64, acpi_wakeup_cpu);
+}
+
+struct acpi_madt_multiproc_wakeup_mailbox *acpi_get_mp_wakeup_mailbox(void)
+{
+ return acpi_mp_wake_mailbox;
+}
--
2.43.0
^ permalink raw reply related
* [PATCH v8 01/10] x86/topology: Add missing struct declaration and attribute dependency
From: Ricardo Neri @ 2026-01-07 21:44 UTC (permalink / raw)
To: x86, Krzysztof Kozlowski, Conor Dooley, Rob Herring,
K. Y. Srinivasan, Haiyang Zhang, Wei Liu, Dexuan Cui,
Michael Kelley, Rafael J. Wysocki
Cc: Saurabh Sengar, Chris Oo, Kirill A. Shutemov, linux-hyperv,
devicetree, linux-acpi, linux-kernel, Ricardo Neri,
kernel test robot, Ricardo Neri
In-Reply-To: <20260107-rneri-wakeup-mailbox-v8-0-2f5b6785f2f5@linux.intel.com>
The prototypes for get_topology_cpu_type_name() and
get_topology_cpu_type() take a pointer to struct cpuinfo_x86, but
asm/topology.h neither includes nor forward-declares the structure.
Including asm/topology.h, directly or indirectly, without including
asm/processor.h triggers a warning:
./arch/x86/include/asm/topology.h:159:47: error: ‘struct cpuinfo_x86’
declared inside parameter list will not be visible outside of this
definition or declaration [-Werror]
159 | const char *get_topology_cpu_type_name(struct cpuinfo_x86 *c);
| ^~~~~~~~~~~
Since only a pointer is needed, add a forward declaration of struct
cpuinfo_x86.
Additionally, sysctl_sched_itmt_enabled is declared in asm/topology.h with
the __read_mostly attribute, but the header does not include linux/cache.h.
This causes a build failure when including asm/topology.h but not linux/
cache.h:
./arch/x86/include/asm/topology.h:264:27: error: expected ‘=’, ‘,’,
‘;’, ‘asm’ or ‘__attribute__’ before ‘sysctl_sched_itmt_enabled’
264 | extern bool __read_mostly sysctl_sched_itmt_enabled;
| ^~~~~~~~~~~~~~~~~~~~~~~~~
Include the required header.
Reported-by: kernel test robot <lkp@intel.com>
Closes: https://lore.kernel.org/oe-kbuild-all/202511181954.UMxCeTV1-lkp@intel.com/
Reported-by: kernel test robot <lkp@intel.com>
Closes: https://lore.kernel.org/oe-kbuild-all/202511190008.AA0NTn3G-lkp@intel.com/
Signed-off-by: Ricardo Neri <ricardo.neri-calderon@linux.intel.com>
---
I independently found this issue when including asm/acpi.h to arch/x86/
hyperv/hv_vtl.c, which implicitly includes asm/topology.h but not asm/
processor.h nor linux/cache.h.
---
Changes in v8:
- Added this patch.
Changes in v7:
- N/A
Changes in v6:
- N/A
Changes in v5:
- N/A
Changes in v4:
- N/A
Changes in v3:
- N/A
Changes in v2:
- N/A
Changes in v3:
- N/A
Changes in v2:
- N/A
---
arch/x86/include/asm/topology.h | 3 +++
1 file changed, 3 insertions(+)
diff --git a/arch/x86/include/asm/topology.h b/arch/x86/include/asm/topology.h
index 1fadf0cf520c..630521a03982 100644
--- a/arch/x86/include/asm/topology.h
+++ b/arch/x86/include/asm/topology.h
@@ -156,6 +156,8 @@ extern unsigned int __max_threads_per_core;
extern unsigned int __num_threads_per_package;
extern unsigned int __num_cores_per_package;
+struct cpuinfo_x86;
+
const char *get_topology_cpu_type_name(struct cpuinfo_x86 *c);
enum x86_topology_cpu_type get_topology_cpu_type(struct cpuinfo_x86 *c);
@@ -259,6 +261,7 @@ extern bool x86_topology_update;
#ifdef CONFIG_SCHED_MC_PRIO
#include <asm/percpu.h>
+#include <linux/cache.h>
DECLARE_PER_CPU_READ_MOSTLY(int, sched_core_priority);
extern bool __read_mostly sysctl_sched_itmt_enabled;
--
2.43.0
^ permalink raw reply related
* [PATCH v8 00/10] x86/hyperv/hv_vtl: Use a wakeup mailbox to boot secondary CPUs
From: Ricardo Neri @ 2026-01-07 21:44 UTC (permalink / raw)
To: x86, Krzysztof Kozlowski, Conor Dooley, Rob Herring,
K. Y. Srinivasan, Haiyang Zhang, Wei Liu, Dexuan Cui,
Michael Kelley, Rafael J. Wysocki
Cc: Saurabh Sengar, Chris Oo, Kirill A. Shutemov, linux-hyperv,
devicetree, linux-acpi, linux-kernel, Ricardo Neri,
kernel test robot, Ricardo Neri, Rafael J. Wysocki (Intel),
Yunhong Jiang, Thomas Gleixner
Hi,
Happy New Year! I have a new version of this patchset to start the year.
I incorporated a patch that I had posted separately to fix a build break
that I found while working on this series [1]. I also added the Acked-by
tags from Rafael. No other changes.
I include the cover letter from the previous version for convenience.
Thanks a lot to all those who have reviewed the series!
...
This patchset adds functionality to use the ACPI wakeup mailbox to boot
secondary CPUs in Hyper-V VTL level 2 TDX guests with DeviceTree-based
virtual firmware. Although this is the target use case, the use of the
mailbox depends solely on it being enumerated in the DeviceTree graph.
On x86 platforms, secondary CPUs are typically booted using INIT assert,
de-assert followed by Start-Up IPI messages. Virtual machines can also use
hypercalls to bring up secondary CPUs to a desired execution state. These
two mechanisms require support from the hypervisor. Confidential computing
VMs in a TDX environment cannot use this mechanism because the hypervisor
is considered an untrusted entity.
Linux already supports the ACPI Multiprocessor Wakeup Structure in which
the guest platform firmware boots the secondary CPUs and transfers control
to the kernel using a mailbox. This mechanism does not need involvement
of the VMM. It can be used in a Hyper-V VTL level 2 TDX guest.
Currently, this mechanism can only be used on x86 platforms with firmware
that supports ACPI. There are platforms that use DeviceTree (e.g., OpenHCL
[2]) instead of ACPI to describe the hardware.
Provided that the wakeup mailbox enumerated in a DeviceTree-based platform
firmware is implemented as described in the ACPI specification, the kernel
can use the existing ACPI code for both DeviceTree and ACPI systems. The
DeviceTree firmware does not need to use any ACPI table to enumerate the
mailbox.
This patchset is structured as follows:
* Add missing dependencies to arch/x86/include/asm/topology.h. (patch 1)
* Expose functions to reuse the code handling the ACPI Multiprocessor
Wakeup Structure outside of ACPI code. (patch 2)
* Define DeviceTree bindings to enumerate a mailbox as described in
the ACPI specification. (patch 3)
* Find and set up the wakeup mailbox if enumerated in the DeviceTree
graph. (patch 4)
* Prepare Hyper-V VTL2 TDX guests to use the Wakeup Mailbox to boot
secondary CPUs when available. (patches 5-10)
I have tested this patchset on a Hyper-V host with VTL2 OpenHCL, QEMU, and
physical hardware.
Changes in v8:
- Fixed a build break. Same patch as [1].
- Added two Acked-by tags from Rafael. Thanks!
- Link to v7: https://lore.kernel.org/r/20251117-rneri-wakeup-mailbox-v7-0-4a8b82ab7c2c@linux.intel.com
Changes in v7:
- Dropped the patch that relocated the ACPI wakeup mailbox to an generic
location. (Boris)
- Instead, added function declarations to use the wakeup mailbox from
outside ACPI code. Also added stubs for !CONFIG_ACPI.
- Link to v6: https://lore.kernel.org/r/20251016-rneri-wakeup-mailbox-v6-0-40435fb9305e@linux.intel.com
Changes in v6:
- Fixed a build error with !CONFIG_X86_MAILBOX_WAKEUP and
CONFIG_HYPER_VTL_MODE.
- Added Acked-by tags from Rafael. Thanks!
- Added Reviewed-by tags from Dexuan and Rob. Thanks!
- Corrected typos and function names in the changelog.
- Link to v5: https://lore.kernel.org/r/20250627-rneri-wakeup-mailbox-v5-0-df547b1d196e@linux.intel.com
Changes in v5:
- Referred in the DeviceTree binding documentation the section and
section of the ACPI specification that defines the wakeup mailbox.
- Moved the dependency on CONFIG_OF to patch 4, where the flattened
DeviceTree is parsed for the mailbox.
- Fixed a warning from yamllint regarding line lengths.
- Link to v4: https://lore.kernel.org/r/20250603-rneri-wakeup-mailbox-v4-0-d533272b7232@linux.intel.com
Changes in v4:
- Added Reviewed-by: tags from Michael Kelley. Thanks!
- Relocated the common wakeup code from acpi/madt_wakeup.c to a new
smpwakeup.c to be used in DeviceTree- and ACPI-based systems.
- Dropped the x86 CPU bindings as they are not a good fit to document
firmware features.
- Dropped the code that parsed and validated of the `enable-method`
property for cpu@N nodes in x86. Instead, unconditionally parse and use
the wakeup mailbox when found.
- Updated the wakeup mailbox schema to avoid redefing the structure and
operation of the mailbox. Instead, refer to the ACPI specification.
Also clarified that the enumeration of the mailbox is done separately.
- Prefixed helper functions of wakeup code with acpi_.
- Link to v3: https://lore.kernel.org/r/20250503191515.24041-1-ricardo.neri-calderon@linux.intel.com
Changes in v3:
- Only move out of the acpi directory acpi_wakeup_cpu() and its
accessory variables. Use helper functions to access the mailbox as
needed. This also fixed the warnings about unused code with CONFIG_
ACPI=n that Michael reported.
- Major rework of the DeviceTree bindings and schema. Now there is a
reserved-memory binding for the mailbox as well as a new x86 CPU
bindings. Both have `compatible` properties.
- Rework of the code parsing the DeviceTree bindings for the mailbox.
Now configuring the mailbox depends solely on its enumeration in the
DeviceTree and not on Hyper-V VTL2 TDX guest.
- Do not make reserving the first 1MB of memory optional. It is not
needed and may introduce bugs.
- Prepare Hyper-V VTL2 guests to unconditionally use the mailbox in TDX
environments. If the mailbox is not available, booting secondary CPUs
will fail gracefully.
- Link to v2: https://lore.kernel.org/r/20240823232327.2408869-1-yunhong.jiang@linux.intel.com
Changes in v2:
- Fix the cover letter's summary phrase.
- Fix the DT binding document to pass validation.
- Change the DT binding document to be ACPI independent.
- Move ACPI-only functions into the #ifdef CONFIG_ACPI.
- Change dtb_parse_mp_wake() to return mailbox physical address.
- Rework the hv_is_private_mmio_tdx().
- Remove unrelated real mode change from the patch that marks mailbox
page private.
- Check hv_isolation_type_tdx() instead of wakeup_mailbox_addr in
hv_vtl_init_platform() because wakeup_mailbox_addr is not parsed yet.
- Add memory range support to reserve_real_mode.
- Remove realmode_reserve callback and use the memory range.
- Move setting the real_mode_header to hv_vtl_init_platform.
- Update comments and commit messages.
- Minor style changes.
- Link to v1: https://lore.kernel.org/r/20240806221237.1634126-1-yunhong.jiang@linux.intel.com
[1]. https://lore.kernel.org/all/20251117-rneri-topology-cpuinfo-bug-v1-1-a905bb5f91e2@linux.intel.com/
[2]. https://openvmm.dev/guide/user_guide/openhcl.html
--
2.43.0
---
Ricardo Neri (6):
x86/topology: Add missing struct declaration and attribute dependency
x86/acpi: Add functions to setup and access the wakeup mailbox
dt-bindings: reserved-memory: Wakeup Mailbox for Intel processors
x86/dt: Parse the Wakeup Mailbox for Intel processors
x86/acpi: Add a helper get the address of the wakeup mailbox
x86/hyperv/vtl: Use the wakeup mailbox to boot secondary CPUs
Yunhong Jiang (4):
x86/hyperv/vtl: Set real_mode_header in hv_vtl_init_platform()
x86/realmode: Make the location of the trampoline configurable
x86/hyperv/vtl: Setup the 64-bit trampoline for TDX guests
x86/hyperv/vtl: Mark the wakeup mailbox page as private
.../reserved-memory/intel,wakeup-mailbox.yaml | 50 ++++++++++++++++++++++
arch/x86/hyperv/hv_vtl.c | 38 ++++++++++++++--
arch/x86/include/asm/acpi.h | 16 +++++++
arch/x86/include/asm/topology.h | 3 ++
arch/x86/include/asm/x86_init.h | 3 ++
arch/x86/kernel/acpi/madt_wakeup.c | 16 +++++++
arch/x86/kernel/devicetree.c | 47 ++++++++++++++++++++
arch/x86/kernel/x86_init.c | 3 ++
arch/x86/realmode/init.c | 7 ++-
9 files changed, 175 insertions(+), 8 deletions(-)
---
base-commit: 39127143100254ceb5f31469ac9a10a2a5b71285
change-id: 20250602-rneri-wakeup-mailbox-328efe72803f
Best regards,
--
Ricardo Neri <ricardo.neri-calderon@linux.intel.com>
^ permalink raw reply
* [PATCH] scsi: storvsc: Process unsupported MODE_SENSE_10
From: longli @ 2026-01-07 19:56 UTC (permalink / raw)
To: K. Y. Srinivasan, Haiyang Zhang, Wei Liu, Dexuan Cui,
James E.J. Bottomley, Martin K. Petersen, James Bottomley,
linux-hyperv, linux-scsi, linux-kernel
Cc: Long Li, stable
From: Long Li <longli@microsoft.com>
The Hyper-V host does not support MODE_SENSE_10 and MODE_SENSE.
The driver handles MODE_SENSE as unsupported command, but not for
MODE_SENSE_10. Add MODE_SENSE_10 to the same handling logic and
return correct code to SCSI layer.
Fixes: 89ae7d709357 ("Staging: hv: storvsc: Move the storage driver out of the staging area")
Cc: stable@kernel.org
Signed-off-by: Long Li <longli@microsoft.com>
---
drivers/scsi/storvsc_drv.c | 1 +
1 file changed, 1 insertion(+)
diff --git a/drivers/scsi/storvsc_drv.c b/drivers/scsi/storvsc_drv.c
index 6e4112143c76..9b15784e2d64 100644
--- a/drivers/scsi/storvsc_drv.c
+++ b/drivers/scsi/storvsc_drv.c
@@ -1154,6 +1154,7 @@ static void storvsc_on_io_completion(struct storvsc_device *stor_device,
if ((stor_pkt->vm_srb.cdb[0] == INQUIRY) ||
(stor_pkt->vm_srb.cdb[0] == MODE_SENSE) ||
+ (stor_pkt->vm_srb.cdb[0] == MODE_SENSE_10) ||
(stor_pkt->vm_srb.cdb[0] == MAINTENANCE_IN &&
hv_dev_is_fc(device))) {
vstor_packet->vm_srb.scsi_status = 0;
--
2.34.1
^ permalink raw reply related
* [PATCH v2] mshv: Align huge page stride with guest mapping
From: Stanislav Kinsburskii @ 2026-01-07 18:45 UTC (permalink / raw)
To: kys, haiyangz, wei.liu, decui, longli; +Cc: linux-hyperv, linux-kernel
Ensure that a stride larger than 1 (huge page) is only used when page
points to a head of a huge page and both the guest frame number (gfn) and
the operation size (page_count) are aligned to the huge page size
(PTRS_PER_PMD). This matches the hypervisor requirement that map/unmap
operations for huge pages must be guest-aligned and cover a full huge page.
Add mshv_chunk_stride() to encapsulate this alignment and page-order
validation, and plumb a huge_page flag into the region chunk handlers.
This prevents issuing large-page map/unmap/share operations that the
hypervisor would reject due to misaligned guest mappings.
Fixes: abceb4297bf8 ("mshv: Fix huge page handling in memory region traversal")
Signed-off-by: Stanislav Kinsburskii <skinsburskii@linux.microsoft.com>
---
drivers/hv/mshv_regions.c | 93 ++++++++++++++++++++++++++++++---------------
1 file changed, 62 insertions(+), 31 deletions(-)
diff --git a/drivers/hv/mshv_regions.c b/drivers/hv/mshv_regions.c
index 30bacba6aec3..adba3564d9f1 100644
--- a/drivers/hv/mshv_regions.c
+++ b/drivers/hv/mshv_regions.c
@@ -19,6 +19,41 @@
#define MSHV_MAP_FAULT_IN_PAGES PTRS_PER_PMD
+/**
+ * mshv_chunk_stride - Compute stride for mapping guest memory
+ * @page : The page to check for huge page backing
+ * @gfn : Guest frame number for the mapping
+ * @page_count: Total number of pages in the mapping
+ *
+ * Determines the appropriate stride (in pages) for mapping guest memory.
+ * Uses huge page stride if the backing page is huge and the guest mapping
+ * is properly aligned; otherwise falls back to single page stride.
+ *
+ * Return: Stride in pages, or -EINVAL if page order is unsupported.
+ */
+static int mshv_chunk_stride(struct page *page,
+ u64 gfn, u64 page_count)
+{
+ unsigned int page_order;
+
+ /*
+ * Use single page stride by default. For huge page stride, the
+ * page must be compound and point to the head of the compound
+ * page, and both gfn and page_count must be huge-page aligned.
+ */
+ if (!PageCompound(page) || !PageHead(page) ||
+ !IS_ALIGNED(gfn, PTRS_PER_PMD) ||
+ !IS_ALIGNED(page_count, PTRS_PER_PMD))
+ return 1;
+
+ page_order = folio_order(page_folio(page));
+ /* The hypervisor only supports 2M huge page */
+ if (page_order != PMD_ORDER)
+ return -EINVAL;
+
+ return 1 << page_order;
+}
+
/**
* mshv_region_process_chunk - Processes a contiguous chunk of memory pages
* in a region.
@@ -45,25 +80,23 @@ static long mshv_region_process_chunk(struct mshv_mem_region *region,
int (*handler)(struct mshv_mem_region *region,
u32 flags,
u64 page_offset,
- u64 page_count))
+ u64 page_count,
+ bool huge_page))
{
- u64 count, stride;
- unsigned int page_order;
+ u64 gfn = region->start_gfn + page_offset;
+ u64 count;
struct page *page;
- int ret;
+ int stride, ret;
page = region->pages[page_offset];
if (!page)
return -EINVAL;
- page_order = folio_order(page_folio(page));
- /* The hypervisor only supports 4K and 2M page sizes */
- if (page_order && page_order != PMD_ORDER)
- return -EINVAL;
+ stride = mshv_chunk_stride(page, gfn, page_count);
+ if (stride < 0)
+ return stride;
- stride = 1 << page_order;
-
- /* Start at stride since the first page is validated */
+ /* Start at stride since the first stride is validated */
for (count = stride; count < page_count; count += stride) {
page = region->pages[page_offset + count];
@@ -71,12 +104,13 @@ static long mshv_region_process_chunk(struct mshv_mem_region *region,
if (!page)
break;
- /* Break if page size changes */
- if (page_order != folio_order(page_folio(page)))
+ /* Break if stride size changes */
+ if (stride != mshv_chunk_stride(page, gfn + count,
+ page_count - count))
break;
}
- ret = handler(region, flags, page_offset, count);
+ ret = handler(region, flags, page_offset, count, stride > 1);
if (ret)
return ret;
@@ -108,7 +142,8 @@ static int mshv_region_process_range(struct mshv_mem_region *region,
int (*handler)(struct mshv_mem_region *region,
u32 flags,
u64 page_offset,
- u64 page_count))
+ u64 page_count,
+ bool huge_page))
{
long ret;
@@ -162,11 +197,10 @@ struct mshv_mem_region *mshv_region_create(u64 guest_pfn, u64 nr_pages,
static int mshv_region_chunk_share(struct mshv_mem_region *region,
u32 flags,
- u64 page_offset, u64 page_count)
+ u64 page_offset, u64 page_count,
+ bool huge_page)
{
- struct page *page = region->pages[page_offset];
-
- if (PageHuge(page) || PageTransCompound(page))
+ if (huge_page)
flags |= HV_MODIFY_SPA_PAGE_HOST_ACCESS_LARGE_PAGE;
return hv_call_modify_spa_host_access(region->partition->pt_id,
@@ -188,11 +222,10 @@ int mshv_region_share(struct mshv_mem_region *region)
static int mshv_region_chunk_unshare(struct mshv_mem_region *region,
u32 flags,
- u64 page_offset, u64 page_count)
+ u64 page_offset, u64 page_count,
+ bool huge_page)
{
- struct page *page = region->pages[page_offset];
-
- if (PageHuge(page) || PageTransCompound(page))
+ if (huge_page)
flags |= HV_MODIFY_SPA_PAGE_HOST_ACCESS_LARGE_PAGE;
return hv_call_modify_spa_host_access(region->partition->pt_id,
@@ -212,11 +245,10 @@ int mshv_region_unshare(struct mshv_mem_region *region)
static int mshv_region_chunk_remap(struct mshv_mem_region *region,
u32 flags,
- u64 page_offset, u64 page_count)
+ u64 page_offset, u64 page_count,
+ bool huge_page)
{
- struct page *page = region->pages[page_offset];
-
- if (PageHuge(page) || PageTransCompound(page))
+ if (huge_page)
flags |= HV_MAP_GPA_LARGE_PAGE;
return hv_call_map_gpa_pages(region->partition->pt_id,
@@ -295,11 +327,10 @@ int mshv_region_pin(struct mshv_mem_region *region)
static int mshv_region_chunk_unmap(struct mshv_mem_region *region,
u32 flags,
- u64 page_offset, u64 page_count)
+ u64 page_offset, u64 page_count,
+ bool huge_page)
{
- struct page *page = region->pages[page_offset];
-
- if (PageHuge(page) || PageTransCompound(page))
+ if (huge_page)
flags |= HV_UNMAP_GPA_LARGE_PAGE;
return hv_call_unmap_gpa_pages(region->partition->pt_id,
^ permalink raw reply related
* Re: [PATCH] mshv: Align huge page stride with guest mapping
From: Stanislav Kinsburskii @ 2026-01-07 18:39 UTC (permalink / raw)
To: Michael Kelley
Cc: kys@microsoft.com, haiyangz@microsoft.com, wei.liu@kernel.org,
decui@microsoft.com, longli@microsoft.com,
linux-hyperv@vger.kernel.org, linux-kernel@vger.kernel.org
In-Reply-To: <aVwVQoTWSNN9Fw3v@skinsburskii.localdomain>
On Mon, Jan 05, 2026 at 11:47:14AM -0800, Stanislav Kinsburskii wrote:
> On Mon, Jan 05, 2026 at 06:07:00PM +0000, Michael Kelley wrote:
> > From: Stanislav Kinsburskii <skinsburskii@linux.microsoft.com> Sent: Monday, January 5, 2026 9:25 AM
> > >
> > > On Sat, Jan 03, 2026 at 01:16:51AM +0000, Michael Kelley wrote:
> > > > From: Stanislav Kinsburskii <skinsburskii@linux.microsoft.com> Sent: Friday, January 2, 2026 3:35 PM
> > > > >
> > > > > On Fri, Jan 02, 2026 at 09:13:31PM +0000, Michael Kelley wrote:
> > > > > > From: Stanislav Kinsburskii <skinsburskii@linux.microsoft.com> Sent: Friday, January 2, 2026 12:03 PM
> > > > > > >
> >
> > [snip]
> >
> > > > > > >
> > > > > > > I think see your point, but I also think this issue doesn't exist,
> > > > > > > because map_chunk_stride() returns huge page stride iff page if:
> > > > > > > 1. the folio order is PMD_ORDER and
> > > > > > > 2. GFN is huge page aligned and
> > > > > > > 3. number of 4K pages is huge pages aligned.
> > > > > > >
> > > > > > > On other words, a host huge page won't be mapped as huge if the page
> > > > > > > can't be mapped as huge in the guest.
> > > > > >
> > > > > > OK, I'm missing how what you say is true. For pinned regions,
> > > > > > the memory is allocated and mapped into the host userspace address
> > > > > > first, as done by mshv_prepare_pinned_region() calling mshv_region_pin(),
> > > > > > which calls pin_user_pages_fast(). This is all done without considering
> > > > > > the GFN or GFN alignment. So one or more 2M pages might be allocated
> > > > > > and mapped in the host before any guest mapping is looked at. Agreed?
> > > > > >
> > > > >
> > > > > Agreed.
> > > > >
> > > > > > Then mshv_prepare_pinned_region() calls mshv_region_map() to do the
> > > > > > guest mapping. This eventually gets down to mshv_chunk_stride(). In
> > > > > > mshv_chunk_stride() when your conditions #2 and #3 are met, the
> > > > > > corresponding struct page argument to mshv_chunk_stride() may be a
> > > > > > 4K page that is in the middle of a 2M page instead of at the beginning
> > > > > > (if the region is mis-aligned). But the key point is that the 4K page in
> > > > > > the middle is part of a folio that will return a folio order of PMD_ORDER.
> > > > > > I.e., a folio order of PMD_ORDER is not sufficient to ensure that the
> > > > > > struct page arg is at the *start* of a 2M-aligned physical memory range
> > > > > > that can be mapped into the guest as a 2M page.
> > > > > >
> > > > >
> > > > > I'm trying to undestand how this can even happen, so please bear with
> > > > > me.
> > > > > In other words (and AFAIU), what you are saying in the following:
> > > > >
> > > > > 1. VMM creates a mapping with a huge page(s) (this implies that virtual
> > > > > address is huge page aligned, size is huge page aligned and physical
> > > > > pages are consequtive).
> > > > > 2. VMM tries to create a region via ioctl, but instead of passing the
> > > > > start of the region, is passes an offset into one of the the region's
> > > > > huge pages, and in the same time with the base GFN and the size huge
> > > > > page aligned (to meet the #2 and #3 conditions).
> > > > > 3. mshv_chunk_stride() sees a folio order of PMD_ORDER, and tries to map
> > > > > the corresponding pages as huge, which will be rejected by the
> > > > > hypervisor.
> > > > >
> > > > > Is this accurate?
> > > >
> > > > Yes, pretty much. In Step 1, the VMM may just allocate some virtual
> > > > address space, and not do anything to populate it with physical pages.
> > > > So populating with any 2M pages may not happen until Step 2 when
> > > > the ioctl calls pin_user_pages_fast(). But *when* the virtual address
> > > > space gets populated with physical pages doesn't really matter. We
> > > > just know that it happens before the ioctl tries to map the memory
> > > > into the guest -- i.e., mshv_prepare_pinned_region() calls
> > > > mshv_region_map().
> > > >
> > > > And yes, the problem is what you call out in Step 2: as input to the
> > > > ioctl, the fields "userspace_addr" and "guest_pfn" in struct
> > > > mshv_user_mem_region could have different alignments modulo 2M
> > > > boundaries. When they are different, that's what I'm calling a "mis-aligned
> > > > region", (referring to a struct mshv_mem_region that is created and
> > > > setup by the ioctl).
> > > >
> > > > > A subseqeunt question: if it is accurate, why the driver needs to
> > > > > support this case? It looks like a VMM bug to me.
> > > >
> > > > I don't know if the driver needs to support this case. That's a question
> > > > for the VMM people to answer. I wouldn't necessarily assume that the
> > > > VMM always allocates virtual address space with exactly the size and
> > > > alignment that matches the regions it creates with the ioctl. The
> > > > kernel ioctl doesn't care how the VMM allocates and manages its
> > > > virtual address space, so the VMM is free to do whatever it wants
> > > > in that regard, as long as it meets the requirements of the ioctl. So
> > > > the requirements of the ioctl in this case are something to be
> > > > negotiated with the VMM.
> > > >
> > > > > Also, how should it support it? By rejecting such requests in the ioctl?
> > > >
> > > > Rejecting requests to create a mis-aligned region is certainly one option
> > > > if the VMM agrees that's OK. The ioctl currently requires only that
> > > > "userspace_addr" and "size" be page aligned, so those requirements
> > > > could be tightened.
> > > >
> > > > The other approach is to fix mshv_chunk_stride() to handle the
> > > > mis-aligned case. Doing so it even easier than I first envisioned.
> > > > I think this works:
> > > >
> > > > @@ -49,7 +49,8 @@ static int mshv_chunk_stride(struct page *page,
> > > > */
> > > > if (page_order &&
> > > > IS_ALIGNED(gfn, PTRS_PER_PMD) &&
> > > > - IS_ALIGNED(page_count, PTRS_PER_PMD))
> > > > + IS_ALIGNED(page_count, PTRS_PER_PMD) &&
> > > > + IS_ALIGNED(page_to_pfn(page), PTRS_PER_PMD))
> > > > return 1 << page_order;
> > > >
> > > > return 1;
> > > >
> > > > But as we discussed earlier, this fix means never getting 2M mappings
> > > > in the guest for a region that is mis-aligned.
> > > >
> > >
> > > Although I understand the logic behind this fix, I’m hesitant to add it
> > > because it looks like a workaround for a VMM bug that could bite back.
> > > The approach you propose will silently map a huge page as a collection
> > > of 4K pages, impacting guest performance (this will be especially
> > > visible for a region containing a single huge page).
> > >
> > > This fix silently allows such behavior instead of reporting it as an
> > > error to user space. It’s worth noting that pinned-region population and
> > > mapping happen upon ioctl invocation, so the VMM will either get an
> > > error from the hypervisor (current behavior) or get a region mapped with
> > > 4K pages (proposed behavior).
> > >
> > > The first case is an explicit error; the second — although it allows
> > > adding a region — will be less performant, significantly increase region
> > > mapping time and thus potentailly guest spin-up (creation) time, and be
> > > less noticeable to customers, especially those who don’t really
> > > understand what’s happening under the hood and simply stumbled upon some
> > > VMM bug.
> > >
> > > What’s your take?
> > >
> >
> > Yes, I agree with everything you say. Silently dropping into a mode where
> > guest performance might be noticeably affected is usually not a good
> > thing. So if the VMM code is OK with the restriction, then I'm fine with
> > adding an explicit alignment check in the ioctl path code to disallow the
> > mis-aligned case.
> >
>
> But the explicit alignment check in the ioctl is already there. The only
> difference is that it's done in the hypervisor and not in the kernel.
>
> > An explicit check is needed because the code "as is" is somewhat flakey
> > as I pointed out earlier. Mis-aligned pinned regions will succeed if the
> > host doesn't allocate any 2M pages, but will fail it is does. And mis-aligned
> > movable regions silently go into the mode of doing all 4K mappings. An
> > explicit check in the ioctl path avoids the flakiness and makes pinned
> > and movable regions have consistent requirements.
> >
> > On the flip side: The ioctl that creates a region is only used by the VMM,
> > not by random end-user provided code like the system call API or general
> > ioctls. As such, I could see the VMM wanting mis-aligned regions to work,
> > with the understanding that there is potential perf impact. The VMM is
> > sophisticated system software, and it may want to take the responsibility
> > for making that tradeoff rather than have the kernel enforce a requirement.
> > There may be cases where it makes sense to create small regions that are
> > mis-aligned. I just don't know what the VMM needs or wants to do in
> > creating regions.
> >
>
> That's a fair point. Let me loop back with the VMM folks and see what
> they think.
>
After discussion, we decided to proceed with the implicit approach.
I'll send an update soon.
Thanks,
Stanislav
> Thanks,
> Stanislav
>
> > So it's hard for me to lean either way. I think the question must go
> > to the VMM folks.
> >
> > Michael
> >
> >
> >
> >
> >
> >
> >
> >
^ permalink raw reply
* Re: [PATCH net-next v12 04/12] vsock: add netns support to virtio transports
From: Paolo Abeni @ 2026-01-07 9:47 UTC (permalink / raw)
To: Bobby Eshleman
Cc: Stefano Garzarella, David S. Miller, Eric Dumazet, Jakub Kicinski,
Simon Horman, Stefan Hajnoczi, Michael S. Tsirkin, Jason Wang,
Eugenio Pérez, Xuan Zhuo, K. Y. Srinivasan, Haiyang Zhang,
Wei Liu, Dexuan Cui, Bryan Tan, Vishnu Dasa,
Broadcom internal kernel review list, Shuah Khan, linux-kernel,
virtualization, netdev, kvm, linux-hyperv, linux-kselftest,
berrange, Sargun Dhillon, Bobby Eshleman
In-Reply-To: <aS9hoOKb7yA5Qgod@devvm11784.nha0.facebook.com>
Hi,
On 12/2/25 11:01 PM, Bobby Eshleman wrote:
> On Tue, Dec 02, 2025 at 09:47:19PM +0100, Paolo Abeni wrote:
>> I still have some concern WRT the dynamic mode change after netns
>> creation. I fear some 'unsolvable' (or very hard to solve) race I can't
>> see now. A tcp_child_ehash_entries-like model will avoid completely the
>> issue, but I understand it would be a significant change over the
>> current status.
>>
>> "Luckily" the merge window is on us and we have some time to discuss. Do
>> you have a specific use-case for the ability to change the netns mode
>> after creation?
>>
>> /P
>
> I don't think there is a hard requirement that the mode be change-able
> after creation. Though I'd love to avoid such a big change... or at
> least leave unchanged as much of what we've already reviewed as
> possible.
>
> In the scheme of defining the mode at creation and following the
> tcp_child_ehash_entries-ish model, what I'm imagining is:
> - /proc/sys/net/vsock/child_ns_mode can be set to "local" or "global"
> - /proc/sys/net/vsock/child_ns_mode is not immutable, can change any
> number of times
>
> - when a netns is created, the new netns mode is inherited from
> child_ns_mode, being assigned using something like:
>
> net->vsock.ns_mode =
> get_net_ns_by_pid(current->pid)->child_ns_mode
>
> - /proc/sys/net/vsock/ns_mode queries the current mode, returning
> "local" or "global", returning value of net->vsock.ns_mode
> - /proc/sys/net/vsock/ns_mode and net->vsock.ns_mode are immutable and
> reject writes
>
> Does that align with what you have in mind?
Sorry for the latency. This fell of my radar while I still processed PW
before EoY and afterwards I had some break.
Yes, the above aligns with what I suggested, and I think it should solve
possible race-related concerns (but I haven't looked at the RFC).
/P
^ permalink raw reply
* Re: [PATCH v2 2/2] mshv: handle gpa intercepts for arm64
From: Anirudh Rayabharam @ 2026-01-07 7:35 UTC (permalink / raw)
To: Stanislav Kinsburskii
Cc: kys, haiyangz, wei.liu, decui, longli, linux-hyperv, linux-kernel
In-Reply-To: <aV05_2Lw6x8Qr_Je@skinsburskii.localdomain>
On Tue, Jan 06, 2026 at 08:36:15AM -0800, Stanislav Kinsburskii wrote:
> On Tue, Jan 06, 2026 at 07:21:41AM +0000, Anirudh Rayabharam wrote:
> > On Mon, Jan 05, 2026 at 09:04:02AM -0800, Stanislav Kinsburskii wrote:
> > > On Mon, Jan 05, 2026 at 12:28:37PM +0000, Anirudh Rayabharam wrote:
> > > > From: Anirudh Rayabharam (Microsoft) <anirudh@anirudhrb.com>
> > > >
> > > > The mshv driver now uses movable pages for guests. For arm64 guests
> > > > to be functional, handle gpa intercepts for arm64 too (the current
> > > > code implements handling only for x86).
> > > >
> > > > Move some arch-agnostic functions out of #ifdefs so that they can be
> > > > re-used.
> > > >
> > > > Fixes: b9a66cd5ccbb ("mshv: Add support for movable memory regions")
> > >
> > > I'm not sure that this patch needs "Fixes" tag as it introduced new
> > > functionality rather than fixing a bug.
> >
> > This does fix a bug. The commit mentioned here regressed arm64 guests because
> > it didn't have GPA intercept handling for arm64.
> >
>
> Were ARM guests functional before this commit? If yes, then I agree that
Yes.
> this patch fixes a bug. If no, then this is just adding new
> functionality.
> I had an impression ARM is not yet supported in MSHV, so please clarify.
No, ARM is very much supported in MSHV. Going forward all new
code/features should be written for arm64 too. Missing arm64
implementation in way that leaves guests broken is a bug.
Thanks,
Anirudh.
>
> Thanks,
> Stanislav
>
> > Thanks,
> > Anirudh.
> >
> > >
> > > Thanks,
> > > Stanislav
> > >
> > > > Signed-off-by: Anirudh Rayabharam (Microsoft) <anirudh@anirudhrb.com>
> > > > ---
> > > > drivers/hv/mshv_root_main.c | 15 ++++++++-------
> > > > 1 file changed, 8 insertions(+), 7 deletions(-)
> > > >
> > > > diff --git a/drivers/hv/mshv_root_main.c b/drivers/hv/mshv_root_main.c
> > > > index 9cf28a3f12fe..f8c4c2ae2cc9 100644
> > > > --- a/drivers/hv/mshv_root_main.c
> > > > +++ b/drivers/hv/mshv_root_main.c
> > > > @@ -608,7 +608,6 @@ mshv_partition_region_by_gfn(struct mshv_partition *partition, u64 gfn)
> > > > return NULL;
> > > > }
> > > >
> > > > -#ifdef CONFIG_X86_64
> > > > static struct mshv_mem_region *
> > > > mshv_partition_region_by_gfn_get(struct mshv_partition *p, u64 gfn)
> > > > {
> > > > @@ -640,12 +639,17 @@ static bool mshv_handle_gpa_intercept(struct mshv_vp *vp)
> > > > {
> > > > struct mshv_partition *p = vp->vp_partition;
> > > > struct mshv_mem_region *region;
> > > > - struct hv_x64_memory_intercept_message *msg;
> > > > bool ret;
> > > > u64 gfn;
> > > > -
> > > > - msg = (struct hv_x64_memory_intercept_message *)
> > > > +#if defined(CONFIG_X86_64)
> > > > + struct hv_x64_memory_intercept_message *msg =
> > > > + (struct hv_x64_memory_intercept_message *)
> > > > + vp->vp_intercept_msg_page->u.payload;
> > > > +#elif defined(CONFIG_ARM64)
> > > > + struct hv_arm64_memory_intercept_message *msg =
> > > > + (struct hv_arm64_memory_intercept_message *)
> > > > vp->vp_intercept_msg_page->u.payload;
> > > > +#endif
> > > >
> > > > gfn = HVPFN_DOWN(msg->guest_physical_address);
> > > >
> > > > @@ -663,9 +667,6 @@ static bool mshv_handle_gpa_intercept(struct mshv_vp *vp)
> > > >
> > > > return ret;
> > > > }
> > > > -#else /* CONFIG_X86_64 */
> > > > -static bool mshv_handle_gpa_intercept(struct mshv_vp *vp) { return false; }
> > > > -#endif /* CONFIG_X86_64 */
> > > >
> > > > static bool mshv_vp_handle_intercept(struct mshv_vp *vp)
> > > > {
> > > > --
> > > > 2.34.1
> > > >
^ permalink raw reply
* [PATCH net-next, v7] net: mana: Implement ndo_tx_timeout and serialize queue resets per port.
From: Dipayaan Roy @ 2026-01-06 23:04 UTC (permalink / raw)
To: kys, haiyangz, wei.liu, decui, andrew+netdev, davem, edumazet,
kuba, pabeni, longli, kotaranov, horms, shradhagupta, ssengar,
ernis, shirazsaleem, linux-hyperv, netdev, linux-kernel,
linux-rdma, dipayanroy
Implement .ndo_tx_timeout for MANA so any stalled TX queue can be detected
and a device-controlled port reset for all queues can be scheduled to a
ordered workqueue. The reset for all queues on stall detection is
recomended by hardware team.
Reviewed-by: Pavan Chebbi <pavan.chebbi@broadcom.com>
Reviewed-by: Haiyang Zhang <haiyangz@microsoft.com>
Signed-off-by: Dipayaan Roy <dipayanroy@linux.microsoft.com>
---
Changes in v7:
- Add enable_work in resume path.
Changes in v6:
- Rebased.
Changes in v5:
-Fixed commit message, used 'create_singlethread_workqueue' and fixed
cleanup part.
Changes in v4:
-Fixed commit message, work initialization before registering netdev,
fixed potential null pointer de-reference bug.
Changes in v3:
-Fixed commit meesage, removed rtnl_trylock and added
disable_work_sync, fixed mana_queue_reset_work, and few
cosmetics.
Changes in v2:
-Fixed cosmetic changes.
---
---
drivers/net/ethernet/microsoft/mana/mana_en.c | 80 ++++++++++++++++++-
include/net/mana/gdma.h | 7 +-
include/net/mana/mana.h | 8 +-
3 files changed, 92 insertions(+), 3 deletions(-)
diff --git a/drivers/net/ethernet/microsoft/mana/mana_en.c b/drivers/net/ethernet/microsoft/mana/mana_en.c
index 1ad154f9db1a..d3e73a0bb442 100644
--- a/drivers/net/ethernet/microsoft/mana/mana_en.c
+++ b/drivers/net/ethernet/microsoft/mana/mana_en.c
@@ -299,6 +299,42 @@ static int mana_get_gso_hs(struct sk_buff *skb)
return gso_hs;
}
+static void mana_per_port_queue_reset_work_handler(struct work_struct *work)
+{
+ struct mana_queue_reset_work *reset_queue_work =
+ container_of(work, struct mana_queue_reset_work, work);
+
+ struct mana_port_context *apc = container_of(reset_queue_work,
+ struct mana_port_context,
+ queue_reset_work);
+ struct net_device *ndev = apc->ndev;
+ int err;
+
+ rtnl_lock();
+
+ /* Pre-allocate buffers to prevent failure in mana_attach later */
+ err = mana_pre_alloc_rxbufs(apc, ndev->mtu, apc->num_queues);
+ if (err) {
+ netdev_err(ndev, "Insufficient memory for reset post tx stall detection\n");
+ goto out;
+ }
+
+ err = mana_detach(ndev, false);
+ if (err) {
+ netdev_err(ndev, "mana_detach failed: %d\n", err);
+ goto dealloc_pre_rxbufs;
+ }
+
+ err = mana_attach(ndev);
+ if (err)
+ netdev_err(ndev, "mana_attach failed: %d\n", err);
+
+dealloc_pre_rxbufs:
+ mana_pre_dealloc_rxbufs(apc);
+out:
+ rtnl_unlock();
+}
+
netdev_tx_t mana_start_xmit(struct sk_buff *skb, struct net_device *ndev)
{
enum mana_tx_pkt_format pkt_fmt = MANA_SHORT_PKT_FMT;
@@ -839,6 +875,23 @@ static int mana_change_mtu(struct net_device *ndev, int new_mtu)
return err;
}
+static void mana_tx_timeout(struct net_device *netdev, unsigned int txqueue)
+{
+ struct mana_port_context *apc = netdev_priv(netdev);
+ struct mana_context *ac = apc->ac;
+ struct gdma_context *gc = ac->gdma_dev->gdma_context;
+
+ /* Already in service, hence tx queue reset is not required.*/
+ if (gc->in_service)
+ return;
+
+ /* Note: If there are pending queue reset work for this port(apc),
+ * subsequent request queued up from here are ignored. This is because
+ * we are using the same work instance per port(apc).
+ */
+ queue_work(ac->per_port_queue_reset_wq, &apc->queue_reset_work.work);
+}
+
static int mana_shaper_set(struct net_shaper_binding *binding,
const struct net_shaper *shaper,
struct netlink_ext_ack *extack)
@@ -924,6 +977,7 @@ static const struct net_device_ops mana_devops = {
.ndo_bpf = mana_bpf,
.ndo_xdp_xmit = mana_xdp_xmit,
.ndo_change_mtu = mana_change_mtu,
+ .ndo_tx_timeout = mana_tx_timeout,
.net_shaper_ops = &mana_shaper_ops,
};
@@ -3287,6 +3341,8 @@ static int mana_probe_port(struct mana_context *ac, int port_idx,
ndev->min_mtu = ETH_MIN_MTU;
ndev->needed_headroom = MANA_HEADROOM;
ndev->dev_port = port_idx;
+ /* Recommended timeout based on HW FPGA re-config scenario. */
+ ndev->watchdog_timeo = 15 * HZ;
SET_NETDEV_DEV(ndev, gc->dev);
netif_set_tso_max_size(ndev, GSO_MAX_SIZE);
@@ -3303,6 +3359,10 @@ static int mana_probe_port(struct mana_context *ac, int port_idx,
if (err)
goto reset_apc;
+ /* Initialize the per port queue reset work.*/
+ INIT_WORK(&apc->queue_reset_work.work,
+ mana_per_port_queue_reset_work_handler);
+
netdev_lockdep_set_classes(ndev);
ndev->hw_features = NETIF_F_SG | NETIF_F_IP_CSUM | NETIF_F_IPV6_CSUM;
@@ -3492,6 +3552,7 @@ int mana_probe(struct gdma_dev *gd, bool resuming)
{
struct gdma_context *gc = gd->gdma_context;
struct mana_context *ac = gd->driver_data;
+ struct mana_port_context *apc = NULL;
struct device *dev = gc->dev;
u8 bm_hostmode = 0;
u16 num_ports = 0;
@@ -3549,6 +3610,14 @@ int mana_probe(struct gdma_dev *gd, bool resuming)
if (ac->num_ports > MAX_PORTS_IN_MANA_DEV)
ac->num_ports = MAX_PORTS_IN_MANA_DEV;
+ ac->per_port_queue_reset_wq =
+ create_singlethread_workqueue("mana_per_port_queue_reset_wq");
+ if (!ac->per_port_queue_reset_wq) {
+ dev_err(dev, "Failed to allocate per port queue reset workqueue\n");
+ err = -ENOMEM;
+ goto out;
+ }
+
if (!resuming) {
for (i = 0; i < ac->num_ports; i++) {
err = mana_probe_port(ac, i, &ac->ports[i]);
@@ -3565,6 +3634,8 @@ int mana_probe(struct gdma_dev *gd, bool resuming)
} else {
for (i = 0; i < ac->num_ports; i++) {
rtnl_lock();
+ apc = netdev_priv(ac->ports[i]);
+ enable_work(&apc->queue_reset_work.work);
err = mana_attach(ac->ports[i]);
rtnl_unlock();
/* we log the port for which the attach failed and stop
@@ -3616,13 +3687,15 @@ void mana_remove(struct gdma_dev *gd, bool suspending)
for (i = 0; i < ac->num_ports; i++) {
ndev = ac->ports[i];
- apc = netdev_priv(ndev);
if (!ndev) {
if (i == 0)
dev_err(dev, "No net device to remove\n");
goto out;
}
+ apc = netdev_priv(ndev);
+ disable_work_sync(&apc->queue_reset_work.work);
+
/* All cleanup actions should stay after rtnl_lock(), otherwise
* other functions may access partially cleaned up data.
*/
@@ -3649,6 +3722,11 @@ void mana_remove(struct gdma_dev *gd, bool suspending)
mana_destroy_eq(ac);
out:
+ if (ac->per_port_queue_reset_wq) {
+ destroy_workqueue(ac->per_port_queue_reset_wq);
+ ac->per_port_queue_reset_wq = NULL;
+ }
+
mana_gd_deregister_device(gd);
if (suspending)
diff --git a/include/net/mana/gdma.h b/include/net/mana/gdma.h
index eaa27483f99b..a59bd4035a99 100644
--- a/include/net/mana/gdma.h
+++ b/include/net/mana/gdma.h
@@ -598,6 +598,10 @@ enum {
/* Driver can self reset on FPGA Reconfig EQE notification */
#define GDMA_DRV_CAP_FLAG_1_HANDLE_RECONFIG_EQE BIT(17)
+
+/* Driver detects stalled send queues and recovers them */
+#define GDMA_DRV_CAP_FLAG_1_HANDLE_STALL_SQ_RECOVERY BIT(18)
+
#define GDMA_DRV_CAP_FLAG_1_HW_VPORT_LINK_AWARE BIT(6)
/* Driver supports linearizing the skb when num_sge exceeds hardware limit */
@@ -621,7 +625,8 @@ enum {
GDMA_DRV_CAP_FLAG_1_HW_VPORT_LINK_AWARE | \
GDMA_DRV_CAP_FLAG_1_PERIODIC_STATS_QUERY | \
GDMA_DRV_CAP_FLAG_1_SKB_LINEARIZE | \
- GDMA_DRV_CAP_FLAG_1_PROBE_RECOVERY)
+ GDMA_DRV_CAP_FLAG_1_PROBE_RECOVERY | \
+ GDMA_DRV_CAP_FLAG_1_HANDLE_STALL_SQ_RECOVERY)
#define GDMA_DRV_CAP_FLAGS2 0
diff --git a/include/net/mana/mana.h b/include/net/mana/mana.h
index d7e089c6b694..cef78a871c7c 100644
--- a/include/net/mana/mana.h
+++ b/include/net/mana/mana.h
@@ -480,7 +480,7 @@ struct mana_context {
struct mana_ethtool_hc_stats hc_stats;
struct mana_eq *eqs;
struct dentry *mana_eqs_debugfs;
-
+ struct workqueue_struct *per_port_queue_reset_wq;
/* Workqueue for querying hardware stats */
struct delayed_work gf_stats_work;
bool hwc_timeout_occurred;
@@ -492,9 +492,15 @@ struct mana_context {
u32 link_event;
};
+struct mana_queue_reset_work {
+ /* Work structure */
+ struct work_struct work;
+};
+
struct mana_port_context {
struct mana_context *ac;
struct net_device *ndev;
+ struct mana_queue_reset_work queue_reset_work;
u8 mac_addr[ETH_ALEN];
--
2.43.0
^ permalink raw reply related
* Re: [PATCH net-next, v6] net: mana: Implement ndo_tx_timeout and serialize queue resets per port.
From: Dipayaan Roy @ 2026-01-06 22:59 UTC (permalink / raw)
To: Jakub Kicinski
Cc: kys, haiyangz, wei.liu, decui, andrew+netdev, davem, edumazet,
pabeni, longli, kotaranov, horms, shradhagupta, ssengar, ernis,
shirazsaleem, linux-hyperv, netdev, linux-kernel, linux-rdma,
dipayanroy
In-Reply-To: <20260105173056.7c2c9d0a@kernel.org>
On Mon, Jan 05, 2026 at 05:30:56PM -0800, Jakub Kicinski wrote:
> On Fri, 2 Jan 2026 20:57:05 -0800 Dipayaan Roy wrote:
> > + apc = netdev_priv(ndev);
> > + disable_work_sync(&apc->queue_reset_work.work);
>
> AI code review points out:
>
> In mana_remove(), disable_work_sync() is called for each port's
> queue_reset_work. However, when resuming=true, mana_probe() creates a new
> workqueue but does not call mana_probe_port() (which contains INIT_WORK),
> and there is no enable_work() call for queue_reset_work in the resume path.
>
> The existing link_change_work handles this correctly: it is disabled in
> mana_remove() and re-enabled with enable_work(&ac->link_change_work) in
> mana_probe() when resuming=true.
>
> Should enable_work(&apc->queue_reset_work.work) be called for each port in
> the resuming path of mana_probe(), similar to how link_change_work is
> handled? Otherwise TX timeout recovery appears to remain disabled after a
> suspend/resume cycle.
> --
> pw-bot: cr
Thanks Jakub for pointing this out. I will send out a new version.
Regards
Dipayaan Roy
^ permalink raw reply
* RE: [PATCH V2,net-next, 2/2] net: mana: Add ethtool counters for RX CQEs in coalesced type
From: Long Li @ 2026-01-06 22:10 UTC (permalink / raw)
To: Haiyang Zhang, linux-hyperv@vger.kernel.org,
netdev@vger.kernel.org, KY Srinivasan, Haiyang Zhang, Wei Liu,
Dexuan Cui, Andrew Lunn, David S. Miller, Eric Dumazet,
Jakub Kicinski, Paolo Abeni, Konstantin Taranov, Simon Horman,
Erni Sri Satya Vennela, Shradha Gupta, Saurabh Sengar,
Aditya Garg, Dipayaan Roy, Shiraz Saleem,
linux-kernel@vger.kernel.org, linux-rdma@vger.kernel.org
Cc: Paul Rosswurm
In-Reply-To: <1767732407-12389-3-git-send-email-haiyangz@linux.microsoft.com>
> Subject: [PATCH V2,net-next, 2/2] net: mana: Add ethtool counters for RX
> CQEs in coalesced type
>
> From: Haiyang Zhang <haiyangz@microsoft.com>
>
> For RX CQEs with type CQE_RX_COALESCED_4, to measure the coalescing
> efficiency, add counters to count how many contains 2, 3, 4 packets
> respectively.
> Also, add a counter for the error case of first packet with length == 0.
>
> Signed-off-by: Haiyang Zhang <haiyangz@microsoft.com>
Reviewed-by: Long Li <longli@microsoft.com>
> ---
> drivers/net/ethernet/microsoft/mana/mana_en.c | 25
> +++++++++++++++++--
> .../ethernet/microsoft/mana/mana_ethtool.c | 17 ++++++++++---
> include/net/mana/mana.h | 10 +++++---
> 3 files changed, 42 insertions(+), 10 deletions(-)
>
> diff --git a/drivers/net/ethernet/microsoft/mana/mana_en.c
> b/drivers/net/ethernet/microsoft/mana/mana_en.c
> index a46a1adf83bc..78824567d80b 100644
> --- a/drivers/net/ethernet/microsoft/mana/mana_en.c
> +++ b/drivers/net/ethernet/microsoft/mana/mana_en.c
> @@ -2083,8 +2083,22 @@ static void mana_process_rx_cqe(struct
> mana_rxq *rxq, struct mana_cq *cq,
>
> nextpkt:
> pktlen = oob->ppi[i].pkt_len;
> - if (pktlen == 0)
> + if (pktlen == 0) {
> + /* Collect coalesced CQE count based on packets processed.
> + * Coalesced CQEs have at least 2 packets, so index is i - 2.
> + */
> + if (i > 1) {
> + u64_stats_update_begin(&rxq->stats.syncp);
> + rxq->stats.coalesced_cqe[i - 2]++;
> + u64_stats_update_end(&rxq->stats.syncp);
> + } else if (i == 0) {
> + /* Error case stat */
> + u64_stats_update_begin(&rxq->stats.syncp);
> + rxq->stats.pkt_len0_err++;
> + u64_stats_update_end(&rxq->stats.syncp);
> + }
> return;
> + }
>
> curr = rxq->buf_index;
> rxbuf_oob = &rxq->rx_oobs[curr];
> @@ -2102,8 +2116,15 @@ static void mana_process_rx_cqe(struct
> mana_rxq *rxq, struct mana_cq *cq,
>
> mana_post_pkt_rxq(rxq);
>
> - if (coalesced && (++i < MANA_RXCOMP_OOB_NUM_PPI))
> + if (!coalesced)
> + return;
> +
> + if (++i < MANA_RXCOMP_OOB_NUM_PPI)
> goto nextpkt;
> +
> + u64_stats_update_begin(&rxq->stats.syncp);
> + rxq->stats.coalesced_cqe[MANA_RXCOMP_OOB_NUM_PPI - 2]++;
> + u64_stats_update_end(&rxq->stats.syncp);
> }
>
> static void mana_poll_rx_cq(struct mana_cq *cq) diff --git
> a/drivers/net/ethernet/microsoft/mana/mana_ethtool.c
> b/drivers/net/ethernet/microsoft/mana/mana_ethtool.c
> index b2b9bfb50396..635796bfdaf1 100644
> --- a/drivers/net/ethernet/microsoft/mana/mana_ethtool.c
> +++ b/drivers/net/ethernet/microsoft/mana/mana_ethtool.c
> @@ -20,8 +20,6 @@ static const struct mana_stats_desc mana_eth_stats[] =
> {
> tx_cqe_unknown_type)},
> {"tx_linear_pkt_cnt", offsetof(struct mana_ethtool_stats,
> tx_linear_pkt_cnt)},
> - {"rx_coalesced_err", offsetof(struct mana_ethtool_stats,
> - rx_coalesced_err)},
> {"rx_cqe_unknown_type", offsetof(struct mana_ethtool_stats,
> rx_cqe_unknown_type)},
> };
> @@ -151,7 +149,7 @@ static void mana_get_strings(struct net_device *ndev,
> u32 stringset, u8 *data) {
> struct mana_port_context *apc = netdev_priv(ndev);
> unsigned int num_queues = apc->num_queues;
> - int i;
> + int i, j;
>
> if (stringset != ETH_SS_STATS)
> return;
> @@ -170,6 +168,9 @@ static void mana_get_strings(struct net_device *ndev,
> u32 stringset, u8 *data)
> ethtool_sprintf(&data, "rx_%d_xdp_drop", i);
> ethtool_sprintf(&data, "rx_%d_xdp_tx", i);
> ethtool_sprintf(&data, "rx_%d_xdp_redirect", i);
> + ethtool_sprintf(&data, "rx_%d_pkt_len0_err", i);
> + for (j = 0; j < MANA_RXCOMP_OOB_NUM_PPI - 1; j++)
> + ethtool_sprintf(&data, "rx_%d_coalesced_cqe_%d", i,
> j + 2);
> }
>
> for (i = 0; i < num_queues; i++) {
> @@ -203,6 +204,8 @@ static void mana_get_ethtool_stats(struct net_device
> *ndev,
> u64 xdp_xmit;
> u64 xdp_drop;
> u64 xdp_tx;
> + u64 pkt_len0_err;
> + u64 coalesced_cqe[MANA_RXCOMP_OOB_NUM_PPI - 1];
> u64 tso_packets;
> u64 tso_bytes;
> u64 tso_inner_packets;
> @@ -211,7 +214,7 @@ static void mana_get_ethtool_stats(struct net_device
> *ndev,
> u64 short_pkt_fmt;
> u64 csum_partial;
> u64 mana_map_err;
> - int q, i = 0;
> + int q, i = 0, j;
>
> if (!apc->port_is_up)
> return;
> @@ -241,6 +244,9 @@ static void mana_get_ethtool_stats(struct net_device
> *ndev,
> xdp_drop = rx_stats->xdp_drop;
> xdp_tx = rx_stats->xdp_tx;
> xdp_redirect = rx_stats->xdp_redirect;
> + pkt_len0_err = rx_stats->pkt_len0_err;
> + for (j = 0; j < MANA_RXCOMP_OOB_NUM_PPI - 1;
> j++)
> + coalesced_cqe[j] = rx_stats->coalesced_cqe[j];
> } while (u64_stats_fetch_retry(&rx_stats->syncp, start));
>
> data[i++] = packets;
> @@ -248,6 +254,9 @@ static void mana_get_ethtool_stats(struct net_device
> *ndev,
> data[i++] = xdp_drop;
> data[i++] = xdp_tx;
> data[i++] = xdp_redirect;
> + data[i++] = pkt_len0_err;
> + for (j = 0; j < MANA_RXCOMP_OOB_NUM_PPI - 1; j++)
> + data[i++] = coalesced_cqe[j];
> }
>
> for (q = 0; q < num_queues; q++) {
> diff --git a/include/net/mana/mana.h b/include/net/mana/mana.h index
> 51d26ebeff6c..f8dd19860103 100644
> --- a/include/net/mana/mana.h
> +++ b/include/net/mana/mana.h
> @@ -61,8 +61,11 @@ enum TRI_STATE {
>
> #define MAX_PORTS_IN_MANA_DEV 256
>
> +/* Maximum number of packets per coalesced CQE */ #define
> +MANA_RXCOMP_OOB_NUM_PPI 4
> +
> /* Update this count whenever the respective structures are changed */ -
> #define MANA_STATS_RX_COUNT 5
> +#define MANA_STATS_RX_COUNT (6 + MANA_RXCOMP_OOB_NUM_PPI - 1)
> #define MANA_STATS_TX_COUNT 11
>
> #define MANA_RX_FRAG_ALIGNMENT 64
> @@ -73,6 +76,8 @@ struct mana_stats_rx {
> u64 xdp_drop;
> u64 xdp_tx;
> u64 xdp_redirect;
> + u64 pkt_len0_err;
> + u64 coalesced_cqe[MANA_RXCOMP_OOB_NUM_PPI - 1];
> struct u64_stats_sync syncp;
> };
>
> @@ -227,8 +232,6 @@ struct mana_rxcomp_perpkt_info {
> u32 pkt_hash;
> }; /* HW DATA */
>
> -#define MANA_RXCOMP_OOB_NUM_PPI 4
> -
> /* Receive completion OOB */
> struct mana_rxcomp_oob {
> struct mana_cqe_header cqe_hdr;
> @@ -378,7 +381,6 @@ struct mana_ethtool_stats {
> u64 tx_cqe_err;
> u64 tx_cqe_unknown_type;
> u64 tx_linear_pkt_cnt;
> - u64 rx_coalesced_err;
> u64 rx_cqe_unknown_type;
> };
>
> --
> 2.34.1
^ permalink raw reply
* RE: [PATCH V2,net-next, 1/2] net: mana: Add support for coalesced RX packets on CQE
From: Long Li @ 2026-01-06 21:50 UTC (permalink / raw)
To: Haiyang Zhang, linux-hyperv@vger.kernel.org,
netdev@vger.kernel.org, KY Srinivasan, Haiyang Zhang, Wei Liu,
Dexuan Cui, Andrew Lunn, David S. Miller, Eric Dumazet,
Jakub Kicinski, Paolo Abeni, Konstantin Taranov, Simon Horman,
Erni Sri Satya Vennela, Shradha Gupta, Saurabh Sengar,
Aditya Garg, Dipayaan Roy, Shiraz Saleem,
linux-kernel@vger.kernel.org, linux-rdma@vger.kernel.org
Cc: Paul Rosswurm
In-Reply-To: <1767732407-12389-2-git-send-email-haiyangz@linux.microsoft.com>
> Subject: [PATCH V2,net-next, 1/2] net: mana: Add support for coalesced RX
> packets on CQE
>
> From: Haiyang Zhang <haiyangz@microsoft.com>
>
> Our NIC can have up to 4 RX packets on 1 CQE. To support this feature, check
> and process the type CQE_RX_COALESCED_4. The default setting is disabled,
> to avoid possible regression on latency.
>
> And add ethtool handler to switch this feature. To turn it on, run:
> ethtool -C <nic> rx-frames 4
> To turn it off:
> ethtool -C <nic> rx-frames 1
>
> Signed-off-by: Haiyang Zhang <haiyangz@microsoft.com>
Reviewed-by: Long Li <longli@microsoft.com>
> ---
> V2:
> Updated extack msg, as recommended by Jakub Kicinski, and Simon Horman.
>
> ---
> drivers/net/ethernet/microsoft/mana/mana_en.c | 32 ++++++-----
> .../ethernet/microsoft/mana/mana_ethtool.c | 55 +++++++++++++++++++
> include/net/mana/mana.h | 2 +
> 3 files changed, 74 insertions(+), 15 deletions(-)
>
> diff --git a/drivers/net/ethernet/microsoft/mana/mana_en.c
> b/drivers/net/ethernet/microsoft/mana/mana_en.c
> index 1ad154f9db1a..a46a1adf83bc 100644
> --- a/drivers/net/ethernet/microsoft/mana/mana_en.c
> +++ b/drivers/net/ethernet/microsoft/mana/mana_en.c
> @@ -1330,7 +1330,7 @@ static int mana_cfg_vport_steering(struct
> mana_port_context *apc,
> req->update_hashkey = update_key;
> req->update_indir_tab = update_tab;
> req->default_rxobj = apc->default_rxobj;
> - req->cqe_coalescing_enable = 0;
> + req->cqe_coalescing_enable = apc->cqe_coalescing_enable;
>
> if (update_key)
> memcpy(&req->hashkey, apc->hashkey,
> MANA_HASH_KEY_SIZE); @@ -1864,11 +1864,12 @@ static struct sk_buff
> *mana_build_skb(struct mana_rxq *rxq, void *buf_va, }
>
> static void mana_rx_skb(void *buf_va, bool from_pool,
> - struct mana_rxcomp_oob *cqe, struct mana_rxq *rxq)
> + struct mana_rxcomp_oob *cqe, struct mana_rxq *rxq,
> + int i)
> {
> struct mana_stats_rx *rx_stats = &rxq->stats;
> struct net_device *ndev = rxq->ndev;
> - uint pkt_len = cqe->ppi[0].pkt_len;
> + uint pkt_len = cqe->ppi[i].pkt_len;
> u16 rxq_idx = rxq->rxq_idx;
> struct napi_struct *napi;
> struct xdp_buff xdp = {};
> @@ -1912,7 +1913,7 @@ static void mana_rx_skb(void *buf_va, bool
> from_pool,
> }
>
> if (cqe->rx_hashtype != 0 && (ndev->features & NETIF_F_RXHASH)) {
> - hash_value = cqe->ppi[0].pkt_hash;
> + hash_value = cqe->ppi[i].pkt_hash;
>
> if (cqe->rx_hashtype & MANA_HASH_L4)
> skb_set_hash(skb, hash_value, PKT_HASH_TYPE_L4);
> @@ -2047,9 +2048,11 @@ static void mana_process_rx_cqe(struct
> mana_rxq *rxq, struct mana_cq *cq,
> struct mana_recv_buf_oob *rxbuf_oob;
> struct mana_port_context *apc;
> struct device *dev = gc->dev;
> + bool coalesced = false;
> void *old_buf = NULL;
> u32 curr, pktlen;
> bool old_fp;
> + int i = 0;
>
> apc = netdev_priv(ndev);
>
> @@ -2064,9 +2067,8 @@ static void mana_process_rx_cqe(struct mana_rxq
> *rxq, struct mana_cq *cq,
> goto drop;
>
> case CQE_RX_COALESCED_4:
> - netdev_err(ndev, "RX coalescing is unsupported\n");
> - apc->eth_stats.rx_coalesced_err++;
> - return;
> + coalesced = true;
> + break;
>
> case CQE_RX_OBJECT_FENCE:
> complete(&rxq->fence_event);
> @@ -2079,14 +2081,10 @@ static void mana_process_rx_cqe(struct
> mana_rxq *rxq, struct mana_cq *cq,
> return;
> }
>
> - pktlen = oob->ppi[0].pkt_len;
> -
> - if (pktlen == 0) {
> - /* data packets should never have packetlength of zero */
> - netdev_err(ndev, "RX pkt len=0, rq=%u, cq=%u,
> rxobj=0x%llx\n",
> - rxq->gdma_id, cq->gdma_id, rxq->rxobj);
> +nextpkt:
> + pktlen = oob->ppi[i].pkt_len;
> + if (pktlen == 0)
> return;
> - }
>
> curr = rxq->buf_index;
> rxbuf_oob = &rxq->rx_oobs[curr];
> @@ -2097,12 +2095,15 @@ static void mana_process_rx_cqe(struct
> mana_rxq *rxq, struct mana_cq *cq,
> /* Unsuccessful refill will have old_buf == NULL.
> * In this case, mana_rx_skb() will drop the packet.
> */
> - mana_rx_skb(old_buf, old_fp, oob, rxq);
> + mana_rx_skb(old_buf, old_fp, oob, rxq, i);
>
> drop:
> mana_move_wq_tail(rxq->gdma_rq, rxbuf_oob-
> >wqe_inf.wqe_size_in_bu);
>
> mana_post_pkt_rxq(rxq);
> +
> + if (coalesced && (++i < MANA_RXCOMP_OOB_NUM_PPI))
> + goto nextpkt;
> }
>
> static void mana_poll_rx_cq(struct mana_cq *cq) @@ -3276,6 +3277,7 @@
> static int mana_probe_port(struct mana_context *ac, int port_idx,
> apc->port_handle = INVALID_MANA_HANDLE;
> apc->pf_filter_handle = INVALID_MANA_HANDLE;
> apc->port_idx = port_idx;
> + apc->cqe_coalescing_enable = 0;
>
> mutex_init(&apc->vport_mutex);
> apc->vport_use_count = 0;
> diff --git a/drivers/net/ethernet/microsoft/mana/mana_ethtool.c
> b/drivers/net/ethernet/microsoft/mana/mana_ethtool.c
> index 0e2f4343ac67..b2b9bfb50396 100644
> --- a/drivers/net/ethernet/microsoft/mana/mana_ethtool.c
> +++ b/drivers/net/ethernet/microsoft/mana/mana_ethtool.c
> @@ -397,6 +397,58 @@ static void mana_get_channels(struct net_device
> *ndev,
> channel->combined_count = apc->num_queues; }
>
> +static int mana_get_coalesce(struct net_device *ndev,
> + struct ethtool_coalesce *ec,
> + struct kernel_ethtool_coalesce *kernel_coal,
> + struct netlink_ext_ack *extack) {
> + struct mana_port_context *apc = netdev_priv(ndev);
> +
> + ec->rx_max_coalesced_frames =
> + apc->cqe_coalescing_enable ?
> MANA_RXCOMP_OOB_NUM_PPI : 1;
> +
> + return 0;
> +}
> +
> +static int mana_set_coalesce(struct net_device *ndev,
> + struct ethtool_coalesce *ec,
> + struct kernel_ethtool_coalesce *kernel_coal,
> + struct netlink_ext_ack *extack) {
> + struct mana_port_context *apc = netdev_priv(ndev);
> + u8 saved_cqe_coalescing_enable;
> + int err;
> +
> + if (ec->rx_max_coalesced_frames != 1 &&
> + ec->rx_max_coalesced_frames != MANA_RXCOMP_OOB_NUM_PPI)
> {
> + NL_SET_ERR_MSG_FMT(extack,
> + "rx-frames must be 1 or %u, got %u",
> + MANA_RXCOMP_OOB_NUM_PPI,
> + ec->rx_max_coalesced_frames);
> + return -EINVAL;
> + }
> +
> + saved_cqe_coalescing_enable = apc->cqe_coalescing_enable;
> + apc->cqe_coalescing_enable =
> + ec->rx_max_coalesced_frames ==
> MANA_RXCOMP_OOB_NUM_PPI;
> +
> + if (!apc->port_is_up)
> + return 0;
> +
> + err = mana_config_rss(apc, TRI_STATE_TRUE, false, false);
> +
> + if (err) {
> + netdev_err(ndev, "Set rx-frames to %u failed:%d\n",
> + ec->rx_max_coalesced_frames, err);
> + NL_SET_ERR_MSG_FMT(extack, "Set rx-frames to %u failed",
> + ec->rx_max_coalesced_frames);
> +
> + apc->cqe_coalescing_enable = saved_cqe_coalescing_enable;
> + }
> +
> + return err;
> +}
> +
> static int mana_set_channels(struct net_device *ndev,
> struct ethtool_channels *channels) { @@ -517,6
> +569,7 @@ static int mana_get_link_ksettings(struct net_device *ndev, }
>
> const struct ethtool_ops mana_ethtool_ops = {
> + .supported_coalesce_params =
> ETHTOOL_COALESCE_RX_MAX_FRAMES,
> .get_ethtool_stats = mana_get_ethtool_stats,
> .get_sset_count = mana_get_sset_count,
> .get_strings = mana_get_strings,
> @@ -527,6 +580,8 @@ const struct ethtool_ops mana_ethtool_ops = {
> .set_rxfh = mana_set_rxfh,
> .get_channels = mana_get_channels,
> .set_channels = mana_set_channels,
> + .get_coalesce = mana_get_coalesce,
> + .set_coalesce = mana_set_coalesce,
> .get_ringparam = mana_get_ringparam,
> .set_ringparam = mana_set_ringparam,
> .get_link_ksettings = mana_get_link_ksettings,
> diff --git a/include/net/mana/mana.h b/include/net/mana/mana.h index
> d7e089c6b694..51d26ebeff6c 100644
> --- a/include/net/mana/mana.h
> +++ b/include/net/mana/mana.h
> @@ -556,6 +556,8 @@ struct mana_port_context {
> bool port_is_up;
> bool port_st_save; /* Saved port state */
>
> + u8 cqe_coalescing_enable;
> +
> struct mana_ethtool_stats eth_stats;
>
> struct mana_ethtool_phy_stats phy_stats;
> --
> 2.34.1
^ permalink raw reply
* [PATCH V2,net-next, 2/2] net: mana: Add ethtool counters for RX CQEs in coalesced type
From: Haiyang Zhang @ 2026-01-06 20:46 UTC (permalink / raw)
To: linux-hyperv, netdev, K. Y. Srinivasan, Haiyang Zhang, Wei Liu,
Dexuan Cui, Long Li, Andrew Lunn, David S. Miller, Eric Dumazet,
Jakub Kicinski, Paolo Abeni, Konstantin Taranov, Simon Horman,
Erni Sri Satya Vennela, Shradha Gupta, Saurabh Sengar,
Aditya Garg, Dipayaan Roy, Shiraz Saleem, linux-kernel,
linux-rdma
Cc: paulros
In-Reply-To: <1767732407-12389-1-git-send-email-haiyangz@linux.microsoft.com>
From: Haiyang Zhang <haiyangz@microsoft.com>
For RX CQEs with type CQE_RX_COALESCED_4, to measure the coalescing
efficiency, add counters to count how many contains 2, 3, 4 packets
respectively.
Also, add a counter for the error case of first packet with length == 0.
Signed-off-by: Haiyang Zhang <haiyangz@microsoft.com>
---
drivers/net/ethernet/microsoft/mana/mana_en.c | 25 +++++++++++++++++--
.../ethernet/microsoft/mana/mana_ethtool.c | 17 ++++++++++---
include/net/mana/mana.h | 10 +++++---
3 files changed, 42 insertions(+), 10 deletions(-)
diff --git a/drivers/net/ethernet/microsoft/mana/mana_en.c b/drivers/net/ethernet/microsoft/mana/mana_en.c
index a46a1adf83bc..78824567d80b 100644
--- a/drivers/net/ethernet/microsoft/mana/mana_en.c
+++ b/drivers/net/ethernet/microsoft/mana/mana_en.c
@@ -2083,8 +2083,22 @@ static void mana_process_rx_cqe(struct mana_rxq *rxq, struct mana_cq *cq,
nextpkt:
pktlen = oob->ppi[i].pkt_len;
- if (pktlen == 0)
+ if (pktlen == 0) {
+ /* Collect coalesced CQE count based on packets processed.
+ * Coalesced CQEs have at least 2 packets, so index is i - 2.
+ */
+ if (i > 1) {
+ u64_stats_update_begin(&rxq->stats.syncp);
+ rxq->stats.coalesced_cqe[i - 2]++;
+ u64_stats_update_end(&rxq->stats.syncp);
+ } else if (i == 0) {
+ /* Error case stat */
+ u64_stats_update_begin(&rxq->stats.syncp);
+ rxq->stats.pkt_len0_err++;
+ u64_stats_update_end(&rxq->stats.syncp);
+ }
return;
+ }
curr = rxq->buf_index;
rxbuf_oob = &rxq->rx_oobs[curr];
@@ -2102,8 +2116,15 @@ static void mana_process_rx_cqe(struct mana_rxq *rxq, struct mana_cq *cq,
mana_post_pkt_rxq(rxq);
- if (coalesced && (++i < MANA_RXCOMP_OOB_NUM_PPI))
+ if (!coalesced)
+ return;
+
+ if (++i < MANA_RXCOMP_OOB_NUM_PPI)
goto nextpkt;
+
+ u64_stats_update_begin(&rxq->stats.syncp);
+ rxq->stats.coalesced_cqe[MANA_RXCOMP_OOB_NUM_PPI - 2]++;
+ u64_stats_update_end(&rxq->stats.syncp);
}
static void mana_poll_rx_cq(struct mana_cq *cq)
diff --git a/drivers/net/ethernet/microsoft/mana/mana_ethtool.c b/drivers/net/ethernet/microsoft/mana/mana_ethtool.c
index b2b9bfb50396..635796bfdaf1 100644
--- a/drivers/net/ethernet/microsoft/mana/mana_ethtool.c
+++ b/drivers/net/ethernet/microsoft/mana/mana_ethtool.c
@@ -20,8 +20,6 @@ static const struct mana_stats_desc mana_eth_stats[] = {
tx_cqe_unknown_type)},
{"tx_linear_pkt_cnt", offsetof(struct mana_ethtool_stats,
tx_linear_pkt_cnt)},
- {"rx_coalesced_err", offsetof(struct mana_ethtool_stats,
- rx_coalesced_err)},
{"rx_cqe_unknown_type", offsetof(struct mana_ethtool_stats,
rx_cqe_unknown_type)},
};
@@ -151,7 +149,7 @@ static void mana_get_strings(struct net_device *ndev, u32 stringset, u8 *data)
{
struct mana_port_context *apc = netdev_priv(ndev);
unsigned int num_queues = apc->num_queues;
- int i;
+ int i, j;
if (stringset != ETH_SS_STATS)
return;
@@ -170,6 +168,9 @@ static void mana_get_strings(struct net_device *ndev, u32 stringset, u8 *data)
ethtool_sprintf(&data, "rx_%d_xdp_drop", i);
ethtool_sprintf(&data, "rx_%d_xdp_tx", i);
ethtool_sprintf(&data, "rx_%d_xdp_redirect", i);
+ ethtool_sprintf(&data, "rx_%d_pkt_len0_err", i);
+ for (j = 0; j < MANA_RXCOMP_OOB_NUM_PPI - 1; j++)
+ ethtool_sprintf(&data, "rx_%d_coalesced_cqe_%d", i, j + 2);
}
for (i = 0; i < num_queues; i++) {
@@ -203,6 +204,8 @@ static void mana_get_ethtool_stats(struct net_device *ndev,
u64 xdp_xmit;
u64 xdp_drop;
u64 xdp_tx;
+ u64 pkt_len0_err;
+ u64 coalesced_cqe[MANA_RXCOMP_OOB_NUM_PPI - 1];
u64 tso_packets;
u64 tso_bytes;
u64 tso_inner_packets;
@@ -211,7 +214,7 @@ static void mana_get_ethtool_stats(struct net_device *ndev,
u64 short_pkt_fmt;
u64 csum_partial;
u64 mana_map_err;
- int q, i = 0;
+ int q, i = 0, j;
if (!apc->port_is_up)
return;
@@ -241,6 +244,9 @@ static void mana_get_ethtool_stats(struct net_device *ndev,
xdp_drop = rx_stats->xdp_drop;
xdp_tx = rx_stats->xdp_tx;
xdp_redirect = rx_stats->xdp_redirect;
+ pkt_len0_err = rx_stats->pkt_len0_err;
+ for (j = 0; j < MANA_RXCOMP_OOB_NUM_PPI - 1; j++)
+ coalesced_cqe[j] = rx_stats->coalesced_cqe[j];
} while (u64_stats_fetch_retry(&rx_stats->syncp, start));
data[i++] = packets;
@@ -248,6 +254,9 @@ static void mana_get_ethtool_stats(struct net_device *ndev,
data[i++] = xdp_drop;
data[i++] = xdp_tx;
data[i++] = xdp_redirect;
+ data[i++] = pkt_len0_err;
+ for (j = 0; j < MANA_RXCOMP_OOB_NUM_PPI - 1; j++)
+ data[i++] = coalesced_cqe[j];
}
for (q = 0; q < num_queues; q++) {
diff --git a/include/net/mana/mana.h b/include/net/mana/mana.h
index 51d26ebeff6c..f8dd19860103 100644
--- a/include/net/mana/mana.h
+++ b/include/net/mana/mana.h
@@ -61,8 +61,11 @@ enum TRI_STATE {
#define MAX_PORTS_IN_MANA_DEV 256
+/* Maximum number of packets per coalesced CQE */
+#define MANA_RXCOMP_OOB_NUM_PPI 4
+
/* Update this count whenever the respective structures are changed */
-#define MANA_STATS_RX_COUNT 5
+#define MANA_STATS_RX_COUNT (6 + MANA_RXCOMP_OOB_NUM_PPI - 1)
#define MANA_STATS_TX_COUNT 11
#define MANA_RX_FRAG_ALIGNMENT 64
@@ -73,6 +76,8 @@ struct mana_stats_rx {
u64 xdp_drop;
u64 xdp_tx;
u64 xdp_redirect;
+ u64 pkt_len0_err;
+ u64 coalesced_cqe[MANA_RXCOMP_OOB_NUM_PPI - 1];
struct u64_stats_sync syncp;
};
@@ -227,8 +232,6 @@ struct mana_rxcomp_perpkt_info {
u32 pkt_hash;
}; /* HW DATA */
-#define MANA_RXCOMP_OOB_NUM_PPI 4
-
/* Receive completion OOB */
struct mana_rxcomp_oob {
struct mana_cqe_header cqe_hdr;
@@ -378,7 +381,6 @@ struct mana_ethtool_stats {
u64 tx_cqe_err;
u64 tx_cqe_unknown_type;
u64 tx_linear_pkt_cnt;
- u64 rx_coalesced_err;
u64 rx_cqe_unknown_type;
};
--
2.34.1
^ permalink raw reply related
page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox