Linux-HyperV List
 help / color / mirror / Atom feed
* Re: [PATCH v2 1/4] hyperv: Introduce new hypercall interfaces used by Hyper-V guest IOMMU
From: sashiko-bot @ 2026-07-02 16:36 UTC (permalink / raw)
  To: Yu Zhang; +Cc: linux-pci, linux-hyperv
In-Reply-To: <20260702160518.311234-2-zhangyu1@linux.microsoft.com>

Thank you for your contribution! Sashiko AI review found 1 potential issue(s) to consider:
- [Medium] `struct hv_input_create_device_domain` is missing a 4-byte padding field, resulting in a non-standard 28-byte size.
--

commit 267ecd2e71d0d7f097fc4d970780fef12801efc5
Author: Wei Liu <wei.liu@kernel.org>

hyperv: Introduce new hypercall interfaces used by Hyper-V guest IOMMU

This commit introduces new hypercall definitions to allow a child partition
to interact with the para-virtualized Hyper-V guest IOMMU. It includes
interfaces for capability detection, device/domain management, and IOTLB
flushing.

[ ... ]

> diff --git a/include/hyperv/hvhdk_mini.h b/include/hyperv/hvhdk_mini.h
> index b4cb2fa26e9bc..493608e791b45 100644
> --- a/include/hyperv/hvhdk_mini.h
> +++ b/include/hyperv/hvhdk_mini.h

[ ... ]

> +union hv_create_device_domain_flags {
> +	u32 as_uint32;
> +	struct {
> +		u32 forward_progress_required: 1;
> +		u32 inherit_owning_vtl: 1;
> +		u32 reserved: 30;
> +	} __packed;
> +};
> +
> +struct hv_input_create_device_domain {
> +	struct hv_input_device_domain device_domain;
> +	union hv_create_device_domain_flags create_device_domain_flags;
> +} __packed;

[Severity: Medium]
Does this structure need an explicit 4-byte padding field at the end to
ensure its size is a multiple of 8 bytes?

Because the struct is packed and lacks tail padding, its size is exactly
28 bytes. These structures are often initialized on a reused per-CPU page
like this:

hv_create_device_domain() {
        ...
        memset(input, 0, sizeof(*input));
        ...
}

If the hypervisor expects an 8-byte aligned 32-byte input for this hypercall,
the memset will only clear the first 28 bytes. This leaves 4 bytes of
uninitialized garbage from previous hypercalls in the tail end of the buffer,
which might cause domain creation to fail if the hypervisor strictly validates
padding fields to be zero.

[ ... ]

-- 
Sashiko AI review · https://sashiko.dev/#/patchset/20260702160518.311234-1-zhangyu1@linux.microsoft.com?part=1

^ permalink raw reply

* RE: [PATCH v1 1/4] drm/hyperv: Unregister pci driver in error path before module unload
From: Michael Kelley @ 2026-07-02 16:37 UTC (permalink / raw)
  To: Uwe Kleine-König (The Capable Hub), Dexuan Cui, Long Li,
	Saurabh Sengar, Maarten Lankhorst, Maxime Ripard,
	Thomas Zimmermann, David Airlie, Simona Vetter, Deepak Rawat
  Cc: linux-hyperv@vger.kernel.org, dri-devel@lists.freedesktop.org,
	linux-kernel@vger.kernel.org
In-Reply-To: <4b7dbf00ce4ff664b7d5dd74b2f39d8d87c1ade9.1782925276.git.u.kleine-koenig@baylibre.com>

From: Uwe Kleine-König (The Capable Hub) <u.kleine-koenig@baylibre.com> Sent: Wednesday, July 1, 2026 10:05 AM
> 
> The pci driver must not kept registered if the module is unloaded after
> vmbus_driver_register() fails. So check the return value of
> vmbus_driver_register() and unregister the pci driver on failure.
> 
> Fixes: 76c56a5affeb ("drm/hyperv: Add DRM driver for hyperv synthetic video device")
> Signed-off-by: Uwe Kleine-König (The Capable Hub) <u.kleine-koenig@baylibre.com>
> ---
>  drivers/gpu/drm/hyperv/hyperv_drm_drv.c | 6 +++++-
>  1 file changed, 5 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/gpu/drm/hyperv/hyperv_drm_drv.c
> b/drivers/gpu/drm/hyperv/hyperv_drm_drv.c
> index 20f35c48c0b8..2e75fb793495 100644
> --- a/drivers/gpu/drm/hyperv/hyperv_drm_drv.c
> +++ b/drivers/gpu/drm/hyperv/hyperv_drm_drv.c
> @@ -249,7 +249,11 @@ static int __init hv_drm_init(void)
>  	if (ret != 0)
>  		return ret;
> 
> -	return vmbus_driver_register(&hv_drm_hv_driver);
> +	ret = vmbus_driver_register(&hv_drm_hv_driver);
> +	if (ret)
> +		pci_unregister_driver(&hv_drm_pci_driver);
> +
> +	return ret;
>  }
> 
>  static void __exit hv_drm_exit(void)
> --
> 2.55.0.11.g153666a7d9bb
> 

Reviewed-by: Michael Kelley <mhklinux@outlook.com>


^ permalink raw reply

* RE: [PATCH v1 2/4] drm/hyperv: Explicitly set subvendor and subdevice for pci match array
From: Michael Kelley @ 2026-07-02 16:37 UTC (permalink / raw)
  To: Thomas Zimmermann, Uwe Kleine-König (The Capable Hub)
  Cc: Dexuan Cui, Long Li, Saurabh Sengar, Maarten Lankhorst,
	Maxime Ripard, David Airlie, Simona Vetter,
	linux-hyperv@vger.kernel.org, dri-devel@lists.freedesktop.org,
	linux-kernel@vger.kernel.org
In-Reply-To: <4e421c5a-1ecc-4155-8262-69c163af9624@suse.de>

From: Thomas Zimmermann <tzimmermann@suse.de> Sent: Thursday, July 2, 2026 2:15 AM

> 
> Hi
> 
> Am 02.07.26 um 10:52 schrieb Uwe Kleine-König (The Capable Hub):
> > Hallo Thomas,
> >
> > On Thu, Jul 02, 2026 at 08:43:32AM +0200, Thomas Zimmermann wrote:
> >> Am 01.07.26 um 19:05 schrieb Uwe Kleine-König (The Capable Hub):
> >>> .subvendor and .subdevice were set to 0 implicitly, so only devices with
> >>> these two values set to 0 in hardware can probe automatically. Make this
> >>> requirement explicit.
> >>>
> >>> While touching this array item, also make use of the pci macro designed
> >>> for that case.
> >>>
> >>> Signed-off-by: Uwe Kleine-König (The Capable Hub) <u.kleine-koenig@baylibre.com>
> >>> ---
> >>>    drivers/gpu/drm/hyperv/hyperv_drm_drv.c | 4 ++--
> >>>    1 file changed, 2 insertions(+), 2 deletions(-)
> >>>
> >>> diff --git a/drivers/gpu/drm/hyperv/hyperv_drm_drv.c b/drivers/gpu/drm/hyperv/hyperv_drm_drv.c
> >>> index 2e75fb793495..e766d87b7a9d 100644
> >>> --- a/drivers/gpu/drm/hyperv/hyperv_drm_drv.c
> >>> +++ b/drivers/gpu/drm/hyperv/hyperv_drm_drv.c
> >>> @@ -51,8 +51,8 @@ static void hv_drm_pci_remove(struct pci_dev *pdev)
> >>>    static const struct pci_device_id hv_drm_pci_tbl[] = {
> >>>    	{
> >>> -		.vendor = PCI_VENDOR_ID_MICROSOFT,
> >>> -		.device = PCI_DEVICE_ID_HYPERV_VIDEO,
> >>> +		PCI_VDEVICE_SUB(MICROSOFT, PCI_DEVICE_ID_HYPERV_VIDEO,
> >>> +				0, 0),
> >> IDK, but it looks like an oversight to me.  Setting the sub-fields to ANY
> >> seems like the better fix.
> > That was my initial reflex, too. However while writing the commit log
> > for that change I noticed that since commit d750785f305e ("Staging: hv:
> > fix hv_utils module to properly autoload") from 2010 (applied to
> > v2.6.35-rc4) the driver never worked for hardware with .subvendor != 0
> > or .subdevice != 0. I cannot believe that something like that is
> > discovered 16 years later by chance during a rework by someone who
> > didn't try to run that hardware. And if I understand correctly, this is
> > emulated hardware and so I guess used quite a lot.
> 
> I wouldn't be surprised. To my knowledge, there's just one
> implementation of this device, which is Windows. If they clear their
> host-side structures to 0 and pass them to the guest, no one would ever
> notice the issue. But let's see what the driver maintainers can comment
> on the issue.
> 

Yes, the device is a synthetic device provided by a host that implements
VMBus. Historically that's Windows Hyper-V, but alternate VMBus
implementations are arising, such as in the paravisor used in some
Azure VMs. That paravisor is also provided by Microsoft, though
conceptually a customer could provide their own paravisor.

Since current code is defaulting to requiring .subvendor and .subdevice
to be zero, I think that requirement should continue even with this
cleanup. That maintains strict consistency with current code. If
there's ever a case where a different implementation of VMBus
needs to identify the synthetic frame buffer device differently, that's
likely because of other changes in how the device presents itself.
Any tweaks to the device id matching can be included then as part
of whatever other changes are needed to the driver.

Just my $.02 ....

As such,

Reviewed-by: Michael Kelley <mhklinux@outlook.com>


^ permalink raw reply

* RE: [PATCH v1 4/4] drm/hyperv: Move MODULE_DEVICE_TABLE to the device_id arrays
From: Michael Kelley @ 2026-07-02 16:38 UTC (permalink / raw)
  To: Uwe Kleine-König (The Capable Hub), Dexuan Cui, Long Li,
	Saurabh Sengar, Maarten Lankhorst, Maxime Ripard,
	Thomas Zimmermann, David Airlie, Simona Vetter
  Cc: linux-hyperv@vger.kernel.org, dri-devel@lists.freedesktop.org,
	linux-kernel@vger.kernel.org
In-Reply-To: <7f9d4a239c76b6bb384048ea5591a21ed87d9b0e.1782925276.git.u.kleine-koenig@baylibre.com>

From: Uwe Kleine-König (The Capable Hub) <u.kleine-koenig@baylibre.com> Sent: Wednesday, July 1, 2026 10:05 AM
> 
> It matches the usual coding style to have the MODULE_DEVICE_TABLE macro
> directly after the respective arrays.
> 
> Signed-off-by: Uwe Kleine-König (The Capable Hub) <u.kleine-koenig@baylibre.com>
> ---
>  drivers/gpu/drm/hyperv/hyperv_drm_drv.c | 4 ++--
>  1 file changed, 2 insertions(+), 2 deletions(-)
> 
> diff --git a/drivers/gpu/drm/hyperv/hyperv_drm_drv.c
> b/drivers/gpu/drm/hyperv/hyperv_drm_drv.c
> index e3f41336a831..6a28048f687b 100644
> --- a/drivers/gpu/drm/hyperv/hyperv_drm_drv.c
> +++ b/drivers/gpu/drm/hyperv/hyperv_drm_drv.c
> @@ -52,6 +52,7 @@ static const struct pci_device_id hv_drm_pci_tbl[] = {
>  	},
>  	{ /* end of list */ }
>  };
> +MODULE_DEVICE_TABLE(pci, hv_drm_pci_tbl);
> 
>  /*
>   * PCI stub to support gen1 VM.
> @@ -219,6 +220,7 @@ static const struct hv_vmbus_device_id hv_drm_vmbus_tbl[] = {
>  	{HV_SYNTHVID_GUID},
>  	{}
>  };
> +MODULE_DEVICE_TABLE(vmbus, hv_drm_vmbus_tbl);
> 
>  static struct hv_driver hv_drm_hv_driver = {
>  	.name = KBUILD_MODNAME,
> @@ -260,8 +262,6 @@ static void __exit hv_drm_exit(void)
>  module_init(hv_drm_init);
>  module_exit(hv_drm_exit);
> 
> -MODULE_DEVICE_TABLE(pci, hv_drm_pci_tbl);
> -MODULE_DEVICE_TABLE(vmbus, hv_drm_vmbus_tbl);
>  MODULE_LICENSE("GPL");
>  MODULE_AUTHOR("Deepak Rawat <drawat.floss@gmail.com>");
>  MODULE_DESCRIPTION("DRM driver for Hyper-V synthetic video device");
> --
> 2.55.0.11.g153666a7d9bb
> 

Reviewed-by: Michael Kelley <mhklinux@outlook.com>


^ permalink raw reply

* Re: [PATCH v2 2/4] Drivers: hv: Add logical device ID registry for vPCI devices
From: sashiko-bot @ 2026-07-02 16:42 UTC (permalink / raw)
  To: Yu Zhang; +Cc: linux-hyperv, linux-pci
In-Reply-To: <20260702160518.311234-3-zhangyu1@linux.microsoft.com>

> From: Easwar Hariharan <easwar.hariharan@linux.microsoft.com>
> 
> Hyper-V identifies each PCI pass-thru device by a logical device ID in
> its hypercall interface. This ID consists of a per-bus prefix, derived
> from the VMBus device instance GUID, combined with the PCI function
> number of the endpoint device.
> 
> Add a small registry in hv_common.c that maps a PCI domain number to its
> logical device ID prefix. The vPCI bus driver (pci-hyperv) registers the
> prefix when a bus is probed and unregisters it when the bus is removed.
> Consumers such as the para-virtualized IOMMU driver look up the prefix
> by PCI domain number and combine it with the function number to form the
> complete logical device ID for hypercalls.
> 
> The prefix construction is shared via hv_build_logical_dev_id_prefix() so
> that pci-hyperv's interrupt retargeting path and the registry use exactly
> the same byte layout. It is derived on demand from the constant hv_device
> instance GUID rather than cached in struct hv_pcibus_device, which is
> private to the pci-hyperv module; this keeps the interface narrow and
> avoids depending on pci-hyperv internals.
> 
> Co-developed-by: Yu Zhang <zhangyu1@linux.microsoft.com>
> Signed-off-by: Yu Zhang <zhangyu1@linux.microsoft.com>
> Signed-off-by: Easwar Hariharan <easwar.hariharan@linux.microsoft.com>

Sashiko has reviewed this patch and found no issues. It looks great!

-- 
Sashiko AI review · https://sashiko.dev/#/patchset/20260702160518.311234-1-zhangyu1@linux.microsoft.com?part=2


^ permalink raw reply

* Re: [PATCH v2 3/4] iommu/hyperv: Add para-virtualized IOMMU support for Hyper-V guest
From: sashiko-bot @ 2026-07-02 17:08 UTC (permalink / raw)
  To: Yu Zhang; +Cc: linux-hyperv, linux-pci
In-Reply-To: <20260702160518.311234-4-zhangyu1@linux.microsoft.com>

Thank you for your contribution! Sashiko AI review found 2 potential issue(s) to consider:
- [High] Undefined behavior when calculating `aperture_end` if the hypervisor returns `max_iova_width >= 64`.
- [High] Unconditional overwrite of `x86_init.iommu.iommu_init` causes hardware IOMMUs (VT-d/AMD) to fail initialization.
--

commit fdcb0264dfd6bd1cb7c3c4676b8a796ad8a85e4a
Author: Yu Zhang <zhangyu1@linux.microsoft.com>

iommu/hyperv: Add para-virtualized IOMMU support for Hyper-V guest

Add a para-virtualized IOMMU driver for Linux guests running on Hyper-V.
This driver implements stage-1 IO translation within the guest OS and
integrates with the Linux IOMMU core using Hyper-V hypercalls.

> diff --git a/arch/x86/hyperv/hv_init.c b/arch/x86/hyperv/hv_init.c
> index 55a8b6de2865a..094f9f7ddb728 100644
> --- a/arch/x86/hyperv/hv_init.c
> +++ b/arch/x86/hyperv/hv_init.c
> @@ -578,6 +578,10 @@ void __init hyperv_init(void)
>  	old_setup_percpu_clockev = x86_init.timers.setup_percpu_clockev;
>  	x86_init.timers.setup_percpu_clockev = hv_stimer_setup_percpu_clockev;
>  
> +#ifdef CONFIG_HYPERV_PVIOMMU
> +	x86_init.iommu.iommu_init = hv_iommu_init;

[Severity: High]
The assignment to x86_init.iommu.iommu_init must be conditionally guarded to
prevent hardware IOMMU initialization failures.

Will this unconditional overwrite break hardware IOMMUs?

Initial boot sequence logic and IOMMU detection parameters are validated.

If pci_iommu_alloc() previously detected a hardware IOMMU (like VT-d/AMD) and
set iommu_detected=1, overwriting the pointer with hv_iommu_init() ensures
failure since hv_iommu_init() immediately aborts when iommu_detected==1.

> +#endif
> +
>  	hv_apic_init();

[ ... ]

> diff --git a/drivers/iommu/hyperv/iommu.c b/drivers/iommu/hyperv/iommu.c
> new file mode 100644
> index 0000000000000..2541369464040
> --- /dev/null
> +++ b/drivers/iommu/hyperv/iommu.c

[ ... ]

> @@ -509,6 +509,16 @@ static void __init hv_init_iommu_device(struct hv_iommu_dev *hv_iommu,
> +	hv_iommu->geometry = (struct iommu_domain_geometry) {
> +		.aperture_start = 0,
> +		.aperture_end = (((u64)1) << hv_iommu->max_iova_width) - 1,

[Severity: High]
The max_iova_width value must be bounds-checked against the maximum 64-bit
integer width before performing the bitwise shift.

Does this code overflow the 64-bit shift limit?

Standard processing applied to the hypercall response.

If an untrusted hypervisor responds to HVCALL_GET_IOMMU_CAPABILITIES with
max_iova_width >= 64, shifting ((u64)1) directly by that amount results in
undefined behavior and unpredictable IOVA aperture sizing.

> +		.force_aperture = true,
> +	};

-- 
Sashiko AI review · https://sashiko.dev/#/patchset/20260702160518.311234-1-zhangyu1@linux.microsoft.com?part=3

^ permalink raw reply

* Re: [PATCH v2 1/2] RDMA/erdma: initialize ret for empty receive WR lists
From: Jason Gunthorpe @ 2026-07-02 17:17 UTC (permalink / raw)
  To: Ruoyu Wang
  Cc: Leon Romanovsky, Cheng Xu, Kai Shen, Long Li, Konstantin Taranov,
	linux-rdma, linux-hyperv, linux-kernel
In-Reply-To: <20260618041752.481193-1-ruoyuw560@gmail.com>

On Thu, Jun 18, 2026 at 12:17:51PM +0800, Ruoyu Wang wrote:
> erdma_post_recv() returns ret after walking the receive work request list.
> If the caller passes an empty list, the loop is skipped and ret is not
> assigned.
> 
> Initialize ret to 0 so an empty receive work request list returns success
> instead of stack data.
> 
> Fixes: 155055771704 ("RDMA/erdma: Add verbs implementation")
> Signed-off-by: Ruoyu Wang <ruoyuw560@gmail.com>
> Reviewed-by: Cheng Xu <chengyou@linux.alibaba.com>
> ---
> v2:
> - Split the erdma and mana_ib changes into separate patches.
> - Add a driver-specific Fixes tag.
> 
>  drivers/infiniband/hw/erdma/erdma_qp.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)

Both patches applied to for-rc

Thanks,
Jason

^ permalink raw reply

* Re: [PATCH v2 4/4] iommu/hyperv: Add page-selective IOTLB flush support
From: sashiko-bot @ 2026-07-02 17:20 UTC (permalink / raw)
  To: Yu Zhang; +Cc: linux-hyperv, linux-pci
In-Reply-To: <20260702160518.311234-5-zhangyu1@linux.microsoft.com>

Thank you for your contribution! Sashiko AI review found 3 potential issue(s) to consider:
- [High] Bitfield assignment on an uninitialized hypercall buffer element causes a read of uninitialized memory, triggering a KMSAN uninit-value warning.
- [High] Missing check for an empty `iotlb_gather` causes spurious flushes of the entire IOTLB domain.
- [Medium] The calculation of `last_pfn` uses an unnecessarily complex formula that theoretically overflows for IOVAs near the end of the 64-bit address space.
--

commit 306d408fd1d04144672964e71dd7149937537189
Author: Yu Zhang <zhangyu1@linux.microsoft.com>

iommu/hyperv: Add page-selective IOTLB flush support

Adds page-selective IOTLB flush using HVCALL_FLUSH_DEVICE_DOMAIN_LIST.
Calculates the minimal power-of-two aligned range for the target IOVA region,
falling back to a full domain flush if the page-selective flush fails.

> diff --git a/drivers/iommu/hyperv/iommu.c b/drivers/iommu/hyperv/iommu.c
> index 2541369464040..e9b104a322fd8 100644
> --- a/drivers/iommu/hyperv/iommu.c
> +++ b/drivers/iommu/hyperv/iommu.c
> @@ -401,10 +402,74 @@ static void hv_iommu_flush_iotlb_all(struct iommu_domain *domain)
>  	hv_flush_device_domain(to_hv_iommu_domain(domain));
>  }
>  
> +/*
> + * Calculate the minimal power-of-two aligned range that covers [start, end]
> + * (end is inclusive). Returns a single (page_number, page_mask_shift)
> + * descriptor that may over-flush when the range is not naturally aligned.
> + */
> +static void hv_iommu_calc_flush_range(unsigned long start, unsigned long end,
> +				       union hv_iommu_flush_va *va)
> +{
> +	unsigned long start_pfn = HVPFN_DOWN(start);
> +	unsigned long last_pfn = HVPFN_UP(end + 1) - 1;

[Severity: Medium]
Can this calculation overflow?

If end is within the last 4095 bytes of the 64-bit address space, end + 1
would wrap around. HVPFN_UP() expands to (((x) + 4095) >> 12), which would
evaluate to 0 in that case.

Subtracting 1 would then result in ULONG_MAX, causing __fls() to compute a
mask_shift of 64 and triggering an undefined left shift by 64 in the later
ALIGN_DOWN macro.

Would it be safer to use HVPFN_DOWN(end) instead, which avoids the overflow
entirely?

> +	unsigned long mask_shift, aligned_pfn;

[ ... ]

> +static void hv_flush_device_domain_list(struct hv_iommu_domain *hv_domain,
> +					struct iommu_iotlb_gather *iotlb_gather)
> +{
> +	u64 status;
> +	unsigned long flags;
> +	struct hv_input_flush_device_domain_list *input;
> +
> +	local_irq_save(flags);
> +
> +	input = *this_cpu_ptr(hyperv_pcpu_input_arg);
> +	memset(input, 0, sizeof(*input));
> +
> +	input->device_domain = hv_domain->device_domain;
> +	input->flags |= HV_FLUSH_DEVICE_DOMAIN_LIST_IOMMU_FORMAT;
> +	hv_iommu_calc_flush_range(iotlb_gather->start, iotlb_gather->end,
> +				  &input->iova_list[0]);

[Severity: High]
Does this code trigger an uninitialized memory read KMSAN warning?

Since iova_list is a flexible array member, sizeof(*input) evaluates to just
the header size (24 bytes). This means memset() clears the header, but leaves
input->iova_list[0] uninitialized.

When hv_iommu_calc_flush_range() later assigns the page_number and
page_mask_shift bitfields of the union, the compiler typically generates a
read-modify-write operation on the underlying 64-bit value. This will read
the uninitialized memory from the reused per-CPU page.

> +
> +	status = hv_do_rep_hypercall(HVCALL_FLUSH_DEVICE_DOMAIN_LIST,

[ ... ]

>  static void hv_iommu_iotlb_sync(struct iommu_domain *domain,
>  				struct iommu_iotlb_gather *iotlb_gather)
>  {
> -	hv_flush_device_domain(to_hv_iommu_domain(domain));
> +	hv_flush_device_domain_list(to_hv_iommu_domain(domain), iotlb_gather);

[Severity: High]
Does this unconditionally flush the entire IOTLB domain if the gather is
empty?

If iommu_unmap() returns without unmapping anything, iotlb_gather retains its
initialized values of start = ULONG_MAX and end = 0.

Because there is no check like if (iotlb_gather->start > iotlb_gather->end)
before issuing the flush, hv_iommu_calc_flush_range() will calculate a
mask_shift of 52. This causes a spurious flush of the full 52-bit address
space instead of returning early.

>  
>  	iommu_put_pages_list(&iotlb_gather->freelist);
>  }

-- 
Sashiko AI review · https://sashiko.dev/#/patchset/20260702160518.311234-1-zhangyu1@linux.microsoft.com?part=4

^ permalink raw reply

* RE: [PATCH v5 01/51] x86/apic: Provide helpers to set local APIC timer period in hz and khz
From: Michael Kelley @ 2026-07-02 17:47 UTC (permalink / raw)
  To: Sean Christopherson, Jonathan Corbet, Paolo Bonzini,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen,
	x86@kernel.org, Kiryl Shutsemau, Rick Edgecombe, K. Y. Srinivasan,
	Haiyang Zhang, Wei Liu, Dexuan Cui, Long Li, Ajay Kaher,
	Alexey Makhalov, Jan Kiszka, Andy Lutomirski, Peter Zijlstra,
	Juergen Gross, Daniel Lezcano, John Stultz
  Cc: Shuah Khan, H. Peter Anvin, Vitaly Kuznetsov,
	Broadcom internal kernel review list, Boris Ostrovsky,
	Stephen Boyd, linux-doc@vger.kernel.org, kvm@vger.kernel.org,
	linux-kernel@vger.kernel.org, linux-coco@lists.linux.dev,
	linux-hyperv@vger.kernel.org, virtualization@lists.linux.dev,
	xen-devel@lists.xenproject.org, Tom Lendacky, Nikunj A Dadhania,
	David Woodhouse, David Woodhouse, Michael Kelley, Thomas Gleixner
In-Reply-To: <20260701193212.749551-2-seanjc@google.com>

From: Sean Christopherson <seanjc@google.com> Sent: Wednesday, July 1, 2026 12:31 PM
> 
> Add and use APIs to set the local APIC timer period instead of open coding
> the subtle HZ math in a all external callers, and make lapic_timer_period

Spurious word "a".

> local to apic.c.  Provide APIs to specify the frequency in both hertz and
> kilohertz so that Hyper-V and VMware code aren't forced to lose precision.
> 
> Opportunistically use mul_u64_u32_div() to harden against the possibility
> that the period in Khz is greater than 4294967, i.e. if the APIC timer runs
> at ~4.29 GHz.  As pointed out by Sashiko, 4294968 * 1000 == 0x1_000002c0,
> and thus a Khz period of 4294968 would silently overflow the 32-bit
> unsigned integer used by most callers.
> 
> Signed-off-by: Sean Christopherson <seanjc@google.com>
> ---
>  arch/x86/include/asm/apic.h    |  3 ++-
>  arch/x86/kernel/apic/apic.c    | 12 +++++++++++-
>  arch/x86/kernel/cpu/mshyperv.c |  5 +----
>  arch/x86/kernel/cpu/vmware.c   |  4 +---
>  arch/x86/kernel/jailhouse.c    |  2 +-
>  arch/x86/kernel/tsc.c          |  2 +-
>  arch/x86/kernel/tsc_msr.c      |  2 +-
>  7 files changed, 18 insertions(+), 12 deletions(-)
> 
> diff --git a/arch/x86/include/asm/apic.h b/arch/x86/include/asm/apic.h
> index 9cd493d467d4..cd84a94688a2 100644
> --- a/arch/x86/include/asm/apic.h
> +++ b/arch/x86/include/asm/apic.h
> @@ -63,7 +63,6 @@ extern int apic_verbosity;
>  extern int local_apic_timer_c2_ok;
> 
>  extern bool apic_is_disabled;
> -extern unsigned int lapic_timer_period;
> 
>  extern enum apic_intr_mode_id apic_intr_mode;
>  enum apic_intr_mode_id {
> @@ -138,6 +137,8 @@ void register_lapic_address(unsigned long address);
>  extern void setup_boot_APIC_clock(void);
>  extern void setup_secondary_APIC_clock(void);
>  extern void lapic_update_tsc_freq(void);
> +extern void apic_set_timer_period_hz(u64 period_hz, const char *source);
> +extern void apic_set_timer_period_khz(u64 period_khz, const char *source);
> 
>  #ifdef CONFIG_X86_64
>  static inline bool apic_force_enable(unsigned long addr)
> diff --git a/arch/x86/kernel/apic/apic.c b/arch/x86/kernel/apic/apic.c
> index aa1e19979aa8..8d3d930576fd 100644
> --- a/arch/x86/kernel/apic/apic.c
> +++ b/arch/x86/kernel/apic/apic.c
> @@ -176,7 +176,7 @@ static struct resource lapic_resource = {
>  };
> 
>  /* Measured in ticks per HZ. */
> -unsigned int lapic_timer_period = 0;
> +static unsigned int lapic_timer_period;
> 
>  static void apic_pm_activate(void);
> 
> @@ -796,6 +796,16 @@ bool __init apic_needs_pit(void)
>  	return lapic_timer_period == 0;
>  }
> 
> +void apic_set_timer_period_khz(u64 period_khz, const char *source)
> +{
> +	lapic_timer_period = mul_u64_u32_div(period_khz, 1000, HZ);
> +}
> +
> +void apic_set_timer_period_hz(u64 period_hz, const char *source)
> +{
> +	lapic_timer_period = div_u64(period_hz, HZ);
> +}

A string "source" argument is passed in, but not used. Is there an
envisioned future use? Also, this function doesn't output a pr_info()
message like the existing Hyper-V and VMware code does. I don't
know that the message is all that useful, though I do remember
one case where I was debugging some clock/timer issue when I
looked at it. 

Michael

> +
>  static int __init calibrate_APIC_clock(void)
>  {
>  	struct clock_event_device *levt = this_cpu_ptr(&lapic_events);
> diff --git a/arch/x86/kernel/cpu/mshyperv.c b/arch/x86/kernel/cpu/mshyperv.c
> index 185d4f677ec0..87beecec76f0 100644
> --- a/arch/x86/kernel/cpu/mshyperv.c
> +++ b/arch/x86/kernel/cpu/mshyperv.c
> @@ -646,10 +646,7 @@ static void __init ms_hyperv_init_platform(void)
>  		u64	hv_lapic_frequency;
> 
>  		rdmsrq(HV_X64_MSR_APIC_FREQUENCY, hv_lapic_frequency);
> -		hv_lapic_frequency = div_u64(hv_lapic_frequency, HZ);
> -		lapic_timer_period = hv_lapic_frequency;
> -		pr_info("Hyper-V: LAPIC Timer Frequency: %#x\n",
> -			lapic_timer_period);
> +		apic_set_timer_period_hz(hv_lapic_frequency, "Hyper-V hypervisor");
>  	}
> 
>  	register_nmi_handler(NMI_UNKNOWN, hv_nmi_unknown, NMI_FLAG_FIRST,
> diff --git a/arch/x86/kernel/cpu/vmware.c b/arch/x86/kernel/cpu/vmware.c
> index 34b73573b108..36f779dd311d 100644
> --- a/arch/x86/kernel/cpu/vmware.c
> +++ b/arch/x86/kernel/cpu/vmware.c
> @@ -424,9 +424,7 @@ static void __init vmware_platform_setup(void)
> 
>  #ifdef CONFIG_X86_LOCAL_APIC
>  		/* Skip lapic calibration since we know the bus frequency. */
> -		lapic_timer_period = ecx / HZ;
> -		pr_info("Host bus clock speed read from hypervisor : %u Hz\n",
> -			ecx);
> +		apic_set_timer_period_hz(ecx, "VMware hypervisor");
>  #endif
>  	} else {
>  		pr_warn("Failed to get TSC freq from the hypervisor\n");
> diff --git a/arch/x86/kernel/jailhouse.c b/arch/x86/kernel/jailhouse.c
> index f58ce9220e0f..f2d4ef89c085 100644
> --- a/arch/x86/kernel/jailhouse.c
> +++ b/arch/x86/kernel/jailhouse.c
> @@ -65,7 +65,7 @@ static void jailhouse_get_wallclock(struct timespec64 *now)
> 
>  static void __init jailhouse_timer_init(void)
>  {
> -	lapic_timer_period = setup_data.v1.apic_khz * (1000 / HZ);
> +	apic_set_timer_period_khz(setup_data.v1.apic_khz, "Jailhouse hypervisor");
>  }
> 
>  static unsigned long jailhouse_get_tsc(void)
> diff --git a/arch/x86/kernel/tsc.c b/arch/x86/kernel/tsc.c
> index ce10ae4b298b..f9ecc9256863 100644
> --- a/arch/x86/kernel/tsc.c
> +++ b/arch/x86/kernel/tsc.c
> @@ -717,7 +717,7 @@ unsigned long native_calibrate_tsc(void)
>  	 * lapic_timer_period here to avoid having to calibrate the APIC
>  	 * timer later.
>  	 */
> -	lapic_timer_period = crystal_khz * 1000 / HZ;
> +	apic_set_timer_period_khz(crystal_khz, "CPUID 0x15/0x16");
>  #endif
> 
>  	return crystal_khz * ebx_numerator / eax_denominator;
> diff --git a/arch/x86/kernel/tsc_msr.c b/arch/x86/kernel/tsc_msr.c
> index 48e6cc1cb017..7e990871e041 100644
> --- a/arch/x86/kernel/tsc_msr.c
> +++ b/arch/x86/kernel/tsc_msr.c
> @@ -211,7 +211,7 @@ unsigned long cpu_khz_from_msr(void)
>  		pr_err("Error MSR_FSB_FREQ index %d is unknown\n", index);
> 
>  #ifdef CONFIG_X86_LOCAL_APIC
> -	lapic_timer_period = (freq * 1000) / HZ;
> +	apic_set_timer_period_khz(freq, "MSR_FSB_FREQ");
>  #endif
> 
>  	/*
> --
> 2.55.0.rc0.799.gd6f94ed593-goog
> 


^ permalink raw reply

* RE: [PATCH v5 11/51] x86/tsc: Add dedicated hypervisor hooks for getting known TSC/CPU frequencies
From: Michael Kelley @ 2026-07-02 17:47 UTC (permalink / raw)
  To: Sean Christopherson, Jonathan Corbet, Paolo Bonzini,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen,
	x86@kernel.org, Kiryl Shutsemau, Rick Edgecombe, K. Y. Srinivasan,
	Haiyang Zhang, Wei Liu, Dexuan Cui, Long Li, Ajay Kaher,
	Alexey Makhalov, Jan Kiszka, Andy Lutomirski, Peter Zijlstra,
	Juergen Gross, Daniel Lezcano, John Stultz
  Cc: Shuah Khan, H. Peter Anvin, Vitaly Kuznetsov,
	Broadcom internal kernel review list, Boris Ostrovsky,
	Stephen Boyd, linux-doc@vger.kernel.org, kvm@vger.kernel.org,
	linux-kernel@vger.kernel.org, linux-coco@lists.linux.dev,
	linux-hyperv@vger.kernel.org, virtualization@lists.linux.dev,
	xen-devel@lists.xenproject.org, Tom Lendacky, Nikunj A Dadhania,
	David Woodhouse, David Woodhouse, Michael Kelley, Thomas Gleixner
In-Reply-To: <20260701193212.749551-12-seanjc@google.com>

From: Sean Christopherson <seanjc@google.com> Sent: Wednesday, July 1, 2026 12:32 PM
> 
> Add dedicated hypervisor hooks for getting known TSC/CPU frequencies
> instead of overriding seemingly generic platform hooks, and explicitly
> priotize hypervisor-provided frequencies over native methods, but do NOT

s/priotize/prioritize/

> clobber the frequency obtained from trusted firmware.  While shuffling the
> hooks around is arguably "six of one, half dozen of the other", scoping
> them to x86_hyper_init makes their purpose more obvious, and allows for
> explicitly defining the priority of sources (as is done here).
> 
> As is already done when trusted firmware provides the TSC frequency, ignore

Word "ignore" is duplicated.

> ignore tsc_early_khz if the exact TSC frequency was obtained from the
> hypervisor, as attempting to refine the TSC frequency when running in a VM
> is all but guaranteed to cause problems sooner or later due to the
> calibration sources being emulated devices in the vast majority of setups.
> 
> Cc: David Woodhouse <dwmw2@infradead.org>
> Signed-off-by: Sean Christopherson <seanjc@google.com>

For the Hyper-V changes,

Reviewed-by: Michael Kelley <mhklinux@outlook.com> 

> ---
>  .../admin-guide/kernel-parameters.txt         |  3 +-
>  arch/x86/include/asm/acrn.h                   |  5 ----
>  arch/x86/include/asm/x86_init.h               |  4 +++
>  arch/x86/kernel/cpu/acrn.c                    | 10 +++++--
>  arch/x86/kernel/cpu/mshyperv.c                |  6 ++--
>  arch/x86/kernel/cpu/vmware.c                  |  8 ++---
>  arch/x86/kernel/jailhouse.c                   |  6 ++--
>  arch/x86/kernel/kvmclock.c                    |  6 ++--
>  arch/x86/kernel/tsc.c                         | 29 ++++++++++++++-----
>  arch/x86/xen/time.c                           |  4 +--
>  10 files changed, 50 insertions(+), 31 deletions(-)
> 
> diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-
> guide/kernel-parameters.txt
> index 490e6aa72fc2..a387bb2c47e2 100644
> --- a/Documentation/admin-guide/kernel-parameters.txt
> +++ b/Documentation/admin-guide/kernel-parameters.txt
> @@ -7948,7 +7948,8 @@ Kernel parameters
> 
>  			Note, tsc_early_khz is ignored if the TSC frequency is
>  			provided by trusted firmware when running as an SNP or
> -			TDX guest.
> +			TDX guest, or when the hypervisor provides the exact
> +			frequency via a paravirtual interface.
> 
>  	tsx=		[X86] Control Transactional Synchronization
>  			Extensions (TSX) feature in Intel processors that
> diff --git a/arch/x86/include/asm/acrn.h b/arch/x86/include/asm/acrn.h
> index db42b477c41d..a892179c61c6 100644
> --- a/arch/x86/include/asm/acrn.h
> +++ b/arch/x86/include/asm/acrn.h
> @@ -32,11 +32,6 @@ static inline u32 acrn_cpuid_base(void)
>  	return 0;
>  }
> 
> -static inline unsigned long acrn_get_tsc_khz(void)
> -{
> -	return cpuid_eax(ACRN_CPUID_TIMING_INFO);
> -}
> -
>  /*
>   * Hypercalls for ACRN
>   *
> diff --git a/arch/x86/include/asm/x86_init.h b/arch/x86/include/asm/x86_init.h
> index 953d3199408a..0c89bf40f507 100644
> --- a/arch/x86/include/asm/x86_init.h
> +++ b/arch/x86/include/asm/x86_init.h
> @@ -123,6 +123,8 @@ struct x86_init_pci {
>   * @msi_ext_dest_id:		MSI supports 15-bit APIC IDs
>   * @init_mem_mapping:		setup early mappings during init_mem_mapping()
>   * @init_after_bootmem:		guest init after boot allocator is finished
> + * @get_tsc_khz:		get the TSC frequency (returns 0 if frequency is unknown)
> + * @get_cpu_khz:		get the CPU frequency (returns 0 if frequency is unknown)
>   */
>  struct x86_hyper_init {
>  	void (*init_platform)(void);
> @@ -131,6 +133,8 @@ struct x86_hyper_init {
>  	bool (*msi_ext_dest_id)(void);
>  	void (*init_mem_mapping)(void);
>  	void (*init_after_bootmem)(void);
> +	unsigned int (*get_tsc_khz)(void);
> +	unsigned int (*get_cpu_khz)(void);
>  };
> 
>  /**
> diff --git a/arch/x86/kernel/cpu/acrn.c b/arch/x86/kernel/cpu/acrn.c
> index dc119af83524..ad8f2da8003b 100644
> --- a/arch/x86/kernel/cpu/acrn.c
> +++ b/arch/x86/kernel/cpu/acrn.c
> @@ -24,13 +24,15 @@ static u32 __init acrn_detect(void)
>  	return acrn_cpuid_base();
>  }
> 
> +static unsigned int __init acrn_get_tsc_khz(void)
> +{
> +	return cpuid_eax(ACRN_CPUID_TIMING_INFO);
> +}
> +
>  static void __init acrn_init_platform(void)
>  {
>  	/* Install system interrupt handler for ACRN hypervisor callback */
>  	sysvec_install(HYPERVISOR_CALLBACK_VECTOR, sysvec_acrn_hv_callback);
> -
> -	x86_platform.calibrate_tsc = acrn_get_tsc_khz;
> -	x86_platform.calibrate_cpu = acrn_get_tsc_khz;
>  }
> 
>  static bool acrn_x2apic_available(void)
> @@ -78,4 +80,6 @@ const __initconst struct hypervisor_x86 x86_hyper_acrn = {
>  	.type			= X86_HYPER_ACRN,
>  	.init.init_platform     = acrn_init_platform,
>  	.init.x2apic_available  = acrn_x2apic_available,
> +	.init.get_tsc_khz	= acrn_get_tsc_khz,
> +	.init.get_cpu_khz	= acrn_get_tsc_khz,
>  };
> diff --git a/arch/x86/kernel/cpu/mshyperv.c b/arch/x86/kernel/cpu/mshyperv.c
> index 87beecec76f0..f9bc1c2d8c93 100644
> --- a/arch/x86/kernel/cpu/mshyperv.c
> +++ b/arch/x86/kernel/cpu/mshyperv.c
> @@ -395,7 +395,7 @@ static int hv_nmi_unknown(unsigned int val, struct pt_regs *regs)
>  }
>  #endif
> 
> -static unsigned long hv_get_tsc_khz(void)
> +static unsigned int __init hv_get_tsc_khz(void)
>  {
>  	unsigned long freq;
> 
> @@ -573,8 +573,8 @@ static void __init ms_hyperv_init_platform(void)
> 
>  	if (ms_hyperv.features & HV_ACCESS_FREQUENCY_MSRS &&
>  	    ms_hyperv.misc_features & HV_FEATURE_FREQUENCY_MSRS_AVAILABLE) {
> -		x86_platform.calibrate_tsc = hv_get_tsc_khz;
> -		x86_platform.calibrate_cpu = hv_get_tsc_khz;
> +		x86_init.hyper.get_tsc_khz = hv_get_tsc_khz;
> +		x86_init.hyper.get_cpu_khz = hv_get_tsc_khz;
>  		setup_force_cpu_cap(X86_FEATURE_TSC_KNOWN_FREQ);
>  	}
> 
> diff --git a/arch/x86/kernel/cpu/vmware.c b/arch/x86/kernel/cpu/vmware.c
> index 13b97265c535..3cb473cae462 100644
> --- a/arch/x86/kernel/cpu/vmware.c
> +++ b/arch/x86/kernel/cpu/vmware.c
> @@ -64,7 +64,7 @@ struct vmware_steal_time {
>  	u64 reserved[7];
>  };
> 
> -static unsigned long vmware_tsc_khz __ro_after_init;
> +static unsigned long vmware_tsc_khz __initdata;
>  static u8 vmware_hypercall_mode     __ro_after_init;
> 
>  unsigned long vmware_hypercall_slow(unsigned long cmd,
> @@ -137,7 +137,7 @@ static inline int __vmware_platform(void)
>  	return eax != UINT_MAX && ebx == VMWARE_HYPERVISOR_MAGIC;
>  }
> 
> -static unsigned long vmware_get_tsc_khz(void)
> +static unsigned int __init vmware_get_tsc_khz(void)
>  {
>  	return vmware_tsc_khz;
>  }
> @@ -419,8 +419,8 @@ static void __init vmware_platform_setup(void)
>  		}
> 
>  		vmware_tsc_khz = tsc_khz;
> -		x86_platform.calibrate_tsc = vmware_get_tsc_khz;
> -		x86_platform.calibrate_cpu = vmware_get_tsc_khz;
> +		x86_init.hyper.get_tsc_khz = vmware_get_tsc_khz;
> +		x86_init.hyper.get_cpu_khz = vmware_get_tsc_khz;
> 
>  		/* Skip lapic calibration since we know the bus frequency. */
>  		apic_set_timer_period_hz(ecx, "VMware hypervisor");
> diff --git a/arch/x86/kernel/jailhouse.c b/arch/x86/kernel/jailhouse.c
> index f2d4ef89c085..e24c05ab4fae 100644
> --- a/arch/x86/kernel/jailhouse.c
> +++ b/arch/x86/kernel/jailhouse.c
> @@ -68,7 +68,7 @@ static void __init jailhouse_timer_init(void)
>  	apic_set_timer_period_khz(setup_data.v1.apic_khz, "Jailhouse hypervisor");
>  }
> 
> -static unsigned long jailhouse_get_tsc(void)
> +static unsigned int __init jailhouse_get_tsc(void)
>  {
>  	return precalibrated_tsc_khz;
>  }
> @@ -210,8 +210,6 @@ static void __init jailhouse_init_platform(void)
>  	x86_init.mpparse.parse_smp_cfg		= jailhouse_parse_smp_config;
>  	x86_init.pci.arch_init			= jailhouse_pci_arch_init;
> 
> -	x86_platform.calibrate_cpu		= jailhouse_get_tsc;
> -	x86_platform.calibrate_tsc		= jailhouse_get_tsc;
>  	x86_platform.get_wallclock		= jailhouse_get_wallclock;
>  	x86_platform.legacy.rtc			= 0;
>  	x86_platform.legacy.warm_reset		= 0;
> @@ -293,5 +291,7 @@ const struct hypervisor_x86 x86_hyper_jailhouse __refconst = {
>  	.detect			= jailhouse_detect,
>  	.init.init_platform	= jailhouse_init_platform,
>  	.init.x2apic_available	= jailhouse_x2apic_available,
> +	.init.get_tsc_khz	= jailhouse_get_tsc,
> +	.init.get_cpu_khz	= jailhouse_get_tsc,
>  	.ignore_nopv		= true,
>  };
> diff --git a/arch/x86/kernel/kvmclock.c b/arch/x86/kernel/kvmclock.c
> index cb3d0ca1fa22..4f8299303a19 100644
> --- a/arch/x86/kernel/kvmclock.c
> +++ b/arch/x86/kernel/kvmclock.c
> @@ -136,7 +136,7 @@ static inline void kvm_sched_clock_init(bool stable)
>   * poll of guests can be running and trouble each other. So we preset
>   * lpj here
>   */
> -static unsigned long kvm_get_tsc_khz(void)
> +static unsigned int __init kvm_get_tsc_khz(void)
>  {
>  	setup_force_cpu_cap(X86_FEATURE_TSC_KNOWN_FREQ);
>  	return pvclock_tsc_khz(this_cpu_pvti());
> @@ -343,8 +343,8 @@ void __init kvmclock_init(void)
>  	flags = pvclock_read_flags(&hv_clock_boot[0].pvti);
>  	kvm_sched_clock_init(flags & PVCLOCK_TSC_STABLE_BIT);
> 
> -	x86_platform.calibrate_tsc = kvm_get_tsc_khz;
> -	x86_platform.calibrate_cpu = kvm_get_tsc_khz;
> +	x86_init.hyper.get_tsc_khz = kvm_get_tsc_khz;
> +	x86_init.hyper.get_cpu_khz = kvm_get_tsc_khz;
>  	x86_platform.get_wallclock = kvm_get_wallclock;
>  	x86_platform.set_wallclock = kvm_set_wallclock;
>  #ifdef CONFIG_X86_LOCAL_APIC
> diff --git a/arch/x86/kernel/tsc.c b/arch/x86/kernel/tsc.c
> index 86384a83a5f6..1dca9464b41c 100644
> --- a/arch/x86/kernel/tsc.c
> +++ b/arch/x86/kernel/tsc.c
> @@ -1451,13 +1451,17 @@ static int __init init_tsc_clocksource(void)
>  device_initcall(init_tsc_clocksource);
> 
>  static bool __init determine_cpu_tsc_frequencies(bool early,
> +						 unsigned int known_cpu_khz,
>  						 unsigned int known_tsc_khz)
>  {
>  	/* Make sure that cpu and tsc are not already calibrated */
>  	WARN_ON(cpu_khz || tsc_khz);
> 
>  	if (early) {
> -		cpu_khz = x86_platform.calibrate_cpu();
> +		if (known_cpu_khz)
> +			cpu_khz = known_cpu_khz;
> +		else
> +			cpu_khz = x86_platform.calibrate_cpu();
>  		if (known_tsc_khz)
>  			tsc_khz = known_tsc_khz;
>  		else
> @@ -1514,7 +1518,7 @@ static void __init tsc_enable_sched_clock(void)
> 
>  void __init tsc_early_init(void)
>  {
> -	unsigned int known_tsc_khz = 0;
> +	unsigned int known_cpu_khz = 0, known_tsc_khz = 0;
> 
>  	if (!boot_cpu_has(X86_FEATURE_TSC))
>  		return;
> @@ -1522,22 +1526,33 @@ void __init tsc_early_init(void)
>  	if (is_early_uv_system())
>  		return;
> 
> +	if (x86_init.hyper.get_cpu_khz)
> +		known_cpu_khz = x86_init.hyper.get_cpu_khz();
> +
>  	if (cc_platform_has(CC_ATTR_GUEST_SNP_SECURE_TSC))
>  		known_tsc_khz = snp_secure_tsc_init();
>  	else if (boot_cpu_has(X86_FEATURE_TDX_GUEST))
>  		known_tsc_khz = tdx_tsc_init();
> 
> +	/*
> +	 * If the TSC frequency wasn't provided by trusted firmware, try to get
> +	 * it from the hypervisor (which is untrusted when running as a CoCo guest).
> +	 */
> +	if (!known_tsc_khz && x86_init.hyper.get_tsc_khz)
> +		known_tsc_khz = x86_init.hyper.get_tsc_khz();
> +
>  	/*
>  	 * Ignore the user-provided TSC frequency if the exact frequency was
> -	 * obtained from trusted firmware, as the user-provided frequency is
> -	 * intended as a "starting point", not a known, guaranteed frequency.
> +	 * obtained from trusted firmware or the hypervisor, as the user-
> +	 * provided frequency is intended as a "starting point", not a known,
> +	 * guaranteed frequency.
>  	 */
>  	if (!known_tsc_khz)
>  		known_tsc_khz = tsc_early_khz;
>  	else if (tsc_early_khz)
> -		pr_err("Ignoring 'tsc_early_khz' in favor of trusted firmware.\n");
> +		pr_err("Ignoring 'tsc_early_khz' in favor of firmware/hypervisor.\n");
> 
> -	if (!determine_cpu_tsc_frequencies(true, known_tsc_khz))
> +	if (!determine_cpu_tsc_frequencies(true, known_cpu_khz, known_tsc_khz))
>  		return;
>  	tsc_enable_sched_clock();
>  }
> @@ -1558,7 +1573,7 @@ void __init tsc_init(void)
> 
>  	if (!tsc_khz) {
>  		/* We failed to determine frequencies earlier, try again */
> -		if (!determine_cpu_tsc_frequencies(false, 0)) {
> +		if (!determine_cpu_tsc_frequencies(false, 0, 0)) {
>  			mark_tsc_unstable("could not calculate TSC khz");
>  			setup_clear_cpu_cap(X86_FEATURE_TSC_DEADLINE_TIMER);
>  			return;
> diff --git a/arch/x86/xen/time.c b/arch/x86/xen/time.c
> index d62c14334b35..1adb44fdddb2 100644
> --- a/arch/x86/xen/time.c
> +++ b/arch/x86/xen/time.c
> @@ -38,7 +38,7 @@
>  static u64 xen_sched_clock_offset __read_mostly;
> 
>  /* Get the TSC speed from Xen */
> -static unsigned long xen_tsc_khz(void)
> +static unsigned int __init xen_tsc_khz(void)
>  {
>  	struct pvclock_vcpu_time_info *info =
>  		&HYPERVISOR_shared_info->vcpu_info[0].time;
> @@ -569,7 +569,7 @@ static void __init xen_init_time_common(void)
>  	static_call_update(pv_steal_clock, xen_steal_clock);
>  	paravirt_set_sched_clock(xen_sched_clock);
> 
> -	x86_platform.calibrate_tsc = xen_tsc_khz;
> +	x86_init.hyper.get_tsc_khz = xen_tsc_khz;
>  	x86_platform.get_wallclock = xen_get_wallclock;
>  }
> 
> --
> 2.55.0.rc0.799.gd6f94ed593-goog
> 


^ permalink raw reply

* RE: [PATCH v5 14/51] x86/tsc: Consolidate forcing of X86_FEATURE_TSC_KNOWN_FREQ for PV code
From: Michael Kelley @ 2026-07-02 17:47 UTC (permalink / raw)
  To: Sean Christopherson, Jonathan Corbet, Paolo Bonzini,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen,
	x86@kernel.org, Kiryl Shutsemau, Rick Edgecombe, K. Y. Srinivasan,
	Haiyang Zhang, Wei Liu, Dexuan Cui, Long Li, Ajay Kaher,
	Alexey Makhalov, Jan Kiszka, Andy Lutomirski, Peter Zijlstra,
	Juergen Gross, Daniel Lezcano, John Stultz
  Cc: Shuah Khan, H. Peter Anvin, Vitaly Kuznetsov,
	Broadcom internal kernel review list, Boris Ostrovsky,
	Stephen Boyd, linux-doc@vger.kernel.org, kvm@vger.kernel.org,
	linux-kernel@vger.kernel.org, linux-coco@lists.linux.dev,
	linux-hyperv@vger.kernel.org, virtualization@lists.linux.dev,
	xen-devel@lists.xenproject.org, Tom Lendacky, Nikunj A Dadhania,
	David Woodhouse, David Woodhouse, Michael Kelley, Thomas Gleixner
In-Reply-To: <20260701193212.749551-15-seanjc@google.com>

From: Sean Christopherson <seanjc@google.com> Sent: Wednesday, July 1, 2026 12:32 PM
> 
> Now that all paravirt code that explicitly specifies the TSC frequency
> also sets X86_FEATURE_TSC_KNOWN_FREQ, replace all of the one-off code
> and simply set X86_FEATURE_TSC_KNOWN_FREQ if the TSC frequency is known.
> 
> Do NOT force set TSC_KNOWN_FREQ if the "known" TSC frequency was provided
> by the user.  Per commit bd35c77e32e4 ("x86/tsc: Add tsc_early_khz command
> line parameter"), one of the goals of the param is to allow the refined
> calibration work "to do meaningful error checking".
> 
> No functional change intended.
> 
> Signed-off-by: Sean Christopherson <seanjc@google.com>

For the Hyper-V changes,

Reviewed-by: Michael Kelley <mhklinux@outlook.com>

> ---
>  arch/x86/coco/sev/core.c       |  1 -
>  arch/x86/coco/tdx/tdx.c        |  1 -
>  arch/x86/kernel/cpu/acrn.c     |  1 -
>  arch/x86/kernel/cpu/mshyperv.c |  1 -
>  arch/x86/kernel/cpu/vmware.c   |  2 --
>  arch/x86/kernel/jailhouse.c    |  1 -
>  arch/x86/kernel/kvmclock.c     |  1 -
>  arch/x86/kernel/tsc.c          | 13 ++++++++++---
>  arch/x86/xen/time.c            |  1 -
>  9 files changed, 10 insertions(+), 12 deletions(-)
> 
> diff --git a/arch/x86/coco/sev/core.c b/arch/x86/coco/sev/core.c
> index bc5ae9ef74da..72313b36b6f5 100644
> --- a/arch/x86/coco/sev/core.c
> +++ b/arch/x86/coco/sev/core.c
> @@ -2027,7 +2027,6 @@ unsigned int __init snp_secure_tsc_init(void)
> 
>  	secrets = (__force struct snp_secrets_page *)mem;
> 
> -	setup_force_cpu_cap(X86_FEATURE_TSC_KNOWN_FREQ);
>  	setup_force_cpu_cap(X86_FEATURE_TSC_RELIABLE);
> 
>  	rdmsrq(MSR_AMD64_GUEST_TSC_FREQ, tsc_freq_mhz);
> diff --git a/arch/x86/coco/tdx/tdx.c b/arch/x86/coco/tdx/tdx.c
> index ae2d35f2ef33..94682aca188b 100644
> --- a/arch/x86/coco/tdx/tdx.c
> +++ b/arch/x86/coco/tdx/tdx.c
> @@ -1205,7 +1205,6 @@ unsigned int __init tdx_tsc_init(void)
> 
>  	/* TSC is the only reliable clock in TDX guest */
>  	setup_force_cpu_cap(X86_FEATURE_TSC_RELIABLE);
> -	setup_force_cpu_cap(X86_FEATURE_TSC_KNOWN_FREQ);
> 
>  	return info.crystal_khz * info.numerator / info.denominator;
>  }
> diff --git a/arch/x86/kernel/cpu/acrn.c b/arch/x86/kernel/cpu/acrn.c
> index 3818f6ae0629..dc71a6fdd461 100644
> --- a/arch/x86/kernel/cpu/acrn.c
> +++ b/arch/x86/kernel/cpu/acrn.c
> @@ -40,7 +40,6 @@ static void __init acrn_init_platform(void)
>  	if (acrn_tsc_khz_cpuid) {
>  		x86_init.hyper.get_tsc_khz = acrn_get_tsc_khz;
>  		x86_init.hyper.get_cpu_khz = acrn_get_tsc_khz;
> -		setup_force_cpu_cap(X86_FEATURE_TSC_KNOWN_FREQ);
>  	}
>  }
> 
> diff --git a/arch/x86/kernel/cpu/mshyperv.c b/arch/x86/kernel/cpu/mshyperv.c
> index f9bc1c2d8c93..e03c69a4db33 100644
> --- a/arch/x86/kernel/cpu/mshyperv.c
> +++ b/arch/x86/kernel/cpu/mshyperv.c
> @@ -575,7 +575,6 @@ static void __init ms_hyperv_init_platform(void)
>  	    ms_hyperv.misc_features & HV_FEATURE_FREQUENCY_MSRS_AVAILABLE) {
>  		x86_init.hyper.get_tsc_khz = hv_get_tsc_khz;
>  		x86_init.hyper.get_cpu_khz = hv_get_tsc_khz;
> -		setup_force_cpu_cap(X86_FEATURE_TSC_KNOWN_FREQ);
>  	}
> 
>  	if (ms_hyperv.priv_high & HV_ISOLATION) {
> diff --git a/arch/x86/kernel/cpu/vmware.c b/arch/x86/kernel/cpu/vmware.c
> index 3cb473cae462..0a3bd90576d4 100644
> --- a/arch/x86/kernel/cpu/vmware.c
> +++ b/arch/x86/kernel/cpu/vmware.c
> @@ -390,8 +390,6 @@ static void __init vmware_set_capabilities(void)
>  {
>  	setup_force_cpu_cap(X86_FEATURE_CONSTANT_TSC);
>  	setup_force_cpu_cap(X86_FEATURE_TSC_RELIABLE);
> -	if (vmware_tsc_khz)
> -		setup_force_cpu_cap(X86_FEATURE_TSC_KNOWN_FREQ);
>  	if (vmware_hypercall_mode == CPUID_VMWARE_FEATURES_ECX_VMCALL)
>  		setup_force_cpu_cap(X86_FEATURE_VMCALL);
>  	else if (vmware_hypercall_mode == CPUID_VMWARE_FEATURES_ECX_VMMCALL)
> diff --git a/arch/x86/kernel/jailhouse.c b/arch/x86/kernel/jailhouse.c
> index e24c05ab4fae..ff173052cdce 100644
> --- a/arch/x86/kernel/jailhouse.c
> +++ b/arch/x86/kernel/jailhouse.c
> @@ -255,7 +255,6 @@ static void __init jailhouse_init_platform(void)
>  	pr_debug("Jailhouse: PM-Timer IO Port: %#x\n", pmtmr_ioport);
> 
>  	precalibrated_tsc_khz = setup_data.v1.tsc_khz;
> -	setup_force_cpu_cap(X86_FEATURE_TSC_KNOWN_FREQ);
> 
>  	pci_probe = 0;
> 
> diff --git a/arch/x86/kernel/kvmclock.c b/arch/x86/kernel/kvmclock.c
> index 4f8299303a19..35a879d33e9e 100644
> --- a/arch/x86/kernel/kvmclock.c
> +++ b/arch/x86/kernel/kvmclock.c
> @@ -138,7 +138,6 @@ static inline void kvm_sched_clock_init(bool stable)
>   */
>  static unsigned int __init kvm_get_tsc_khz(void)
>  {
> -	setup_force_cpu_cap(X86_FEATURE_TSC_KNOWN_FREQ);
>  	return pvclock_tsc_khz(this_cpu_pvti());
>  }
> 
> diff --git a/arch/x86/kernel/tsc.c b/arch/x86/kernel/tsc.c
> index 1dca9464b41c..676910292af7 100644
> --- a/arch/x86/kernel/tsc.c
> +++ b/arch/x86/kernel/tsc.c
> @@ -1541,11 +1541,18 @@ void __init tsc_early_init(void)
>  	if (!known_tsc_khz && x86_init.hyper.get_tsc_khz)
>  		known_tsc_khz = x86_init.hyper.get_tsc_khz();
> 
> +	/*
> +	 * Mark the TSC frequency as known if it was obtained from a hypervisor
> +	 * or trusted firmware.
> +	 */
> +	if (known_tsc_khz)
> +		setup_force_cpu_cap(X86_FEATURE_TSC_KNOWN_FREQ);
> +
>  	/*
>  	 * Ignore the user-provided TSC frequency if the exact frequency was
> -	 * obtained from trusted firmware or the hypervisor, as the user-
> -	 * provided frequency is intended as a "starting point", not a known,
> -	 * guaranteed frequency.
> +	 * obtained from trusted firmware or the hypervisor, and don't mark the
> +	 * frequency as known, as the user-provided frequency is intended as a
> +	 * "starting point", not a known, guaranteed frequency
>  	 */
>  	if (!known_tsc_khz)
>  		known_tsc_khz = tsc_early_khz;
> diff --git a/arch/x86/xen/time.c b/arch/x86/xen/time.c
> index 1adb44fdddb2..487ad838c441 100644
> --- a/arch/x86/xen/time.c
> +++ b/arch/x86/xen/time.c
> @@ -43,7 +43,6 @@ static unsigned int __init xen_tsc_khz(void)
>  	struct pvclock_vcpu_time_info *info =
>  		&HYPERVISOR_shared_info->vcpu_info[0].time;
> 
> -	setup_force_cpu_cap(X86_FEATURE_TSC_KNOWN_FREQ);
>  	return pvclock_tsc_khz(info);
>  }
> 
> --
> 2.55.0.rc0.799.gd6f94ed593-goog
> 


^ permalink raw reply

* RE: [PATCH v5 36/51] x86/paravirt: Pass sched_clock save/restore helpers during registration
From: Michael Kelley @ 2026-07-02 17:48 UTC (permalink / raw)
  To: Sean Christopherson, Jonathan Corbet, Paolo Bonzini,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen,
	x86@kernel.org, Kiryl Shutsemau, Rick Edgecombe, K. Y. Srinivasan,
	Haiyang Zhang, Wei Liu, Dexuan Cui, Long Li, Ajay Kaher,
	Alexey Makhalov, Jan Kiszka, Andy Lutomirski, Peter Zijlstra,
	Juergen Gross, Daniel Lezcano, John Stultz
  Cc: Shuah Khan, H. Peter Anvin, Vitaly Kuznetsov,
	Broadcom internal kernel review list, Boris Ostrovsky,
	Stephen Boyd, linux-doc@vger.kernel.org, kvm@vger.kernel.org,
	linux-kernel@vger.kernel.org, linux-coco@lists.linux.dev,
	linux-hyperv@vger.kernel.org, virtualization@lists.linux.dev,
	xen-devel@lists.xenproject.org, Tom Lendacky, Nikunj A Dadhania,
	David Woodhouse, David Woodhouse, Michael Kelley, Thomas Gleixner
In-Reply-To: <20260701193212.749551-37-seanjc@google.com>

From: Sean Christopherson <seanjc@google.com> Sent: Wednesday, July 1, 2026 12:32 PM
> 
> Pass in a PV clock's save/restore helpers when configuring sched_clock
> instead of relying on each PV clock to manually set the save/restore hooks.
> In addition to bringing sanity to the code, this will allow gracefully
> "rejecting" a PV sched_clock, e.g. when running as a CoCo guest that has
> access to a "secure" TSC.
> 
> No functional change intended.
> 
> Reviewed-by: David Woodhouse <dwmw@amazon.co.uk>
> Signed-off-by: Sean Christopherson <seanjc@google.com>

For the Hyper-V changes,

Reviewed-by: Michael Kelley <mhklinux@outlook.com>

> ---
>  arch/x86/include/asm/timer.h       | 9 ++++++---
>  arch/x86/kernel/cpu/vmware.c       | 8 +++-----
>  arch/x86/kernel/kvmclock.c         | 6 +++---
>  arch/x86/kernel/tsc.c              | 5 ++++-
>  arch/x86/xen/time.c                | 5 ++---
>  drivers/clocksource/hyperv_timer.c | 6 ++----
>  6 files changed, 20 insertions(+), 19 deletions(-)
> 
> diff --git a/arch/x86/include/asm/timer.h b/arch/x86/include/asm/timer.h
> index fe41d40a9ae6..e97cd1ae03d1 100644
> --- a/arch/x86/include/asm/timer.h
> +++ b/arch/x86/include/asm/timer.h
> @@ -14,11 +14,14 @@ extern int no_timer_check;
>  extern bool using_native_sched_clock(void);
> 
>  #ifdef CONFIG_PARAVIRT
> -void __paravirt_set_sched_clock(u64 (*func)(void), bool stable);
> +void __paravirt_set_sched_clock(u64 (*func)(void), bool stable,
> +				void (*save)(void), void (*restore)(void));
> 
> -static inline void paravirt_set_sched_clock(u64 (*func)(void))
> +static inline void paravirt_set_sched_clock(u64 (*func)(void),
> +					    void (*save)(void),
> +					    void (*restore)(void))
>  {
> -	__paravirt_set_sched_clock(func, true);
> +	__paravirt_set_sched_clock(func, true, save, restore);
>  }
>  #endif
> 
> diff --git a/arch/x86/kernel/cpu/vmware.c b/arch/x86/kernel/cpu/vmware.c
> index 5c1ccaf4a25e..232255279a6e 100644
> --- a/arch/x86/kernel/cpu/vmware.c
> +++ b/arch/x86/kernel/cpu/vmware.c
> @@ -347,11 +347,9 @@ static void __init vmware_paravirt_ops_setup(void)
> 
>  	vmware_cyc2ns_setup();
> 
> -	if (vmw_sched_clock) {
> -		paravirt_set_sched_clock(vmware_sched_clock);
> -		x86_platform.save_sched_clock_state = x86_init_noop;
> -		x86_platform.restore_sched_clock_state = x86_init_noop;
> -	}
> +	if (vmw_sched_clock)
> +		paravirt_set_sched_clock(vmware_sched_clock,
> +					 x86_init_noop, x86_init_noop);
> 
>  	if (vmware_is_stealclock_available()) {
>  		has_steal_clock = true;
> diff --git a/arch/x86/kernel/kvmclock.c b/arch/x86/kernel/kvmclock.c
> index 07e875738c39..5b9955343199 100644
> --- a/arch/x86/kernel/kvmclock.c
> +++ b/arch/x86/kernel/kvmclock.c
> @@ -158,7 +158,9 @@ static void kvm_restore_sched_clock_state(void)
>  static inline void kvm_sched_clock_init(bool stable)
>  {
>  	kvm_sched_clock_offset = kvm_clock_read();
> -	__paravirt_set_sched_clock(kvm_sched_clock_read, stable);
> +	__paravirt_set_sched_clock(kvm_sched_clock_read, stable,
> +				   kvm_save_sched_clock_state,
> +				   kvm_restore_sched_clock_state);
> 
>  	pr_info("kvm-clock: using sched offset of %llu cycles",
>  		kvm_sched_clock_offset);
> @@ -367,8 +369,6 @@ void __init kvmclock_init(bool prefer_tsc)
>  #ifdef CONFIG_SMP
>  	x86_cpuinit.early_percpu_clock_init = kvm_setup_secondary_clock;
>  #endif
> -	x86_platform.save_sched_clock_state = kvm_save_sched_clock_state;
> -	x86_platform.restore_sched_clock_state = kvm_restore_sched_clock_state;
>  	kvm_get_preset_lpj();
> 
>  	/*
> diff --git a/arch/x86/kernel/tsc.c b/arch/x86/kernel/tsc.c
> index 7473dcab4775..83353d643150 100644
> --- a/arch/x86/kernel/tsc.c
> +++ b/arch/x86/kernel/tsc.c
> @@ -280,12 +280,15 @@ bool using_native_sched_clock(void)
>  	return static_call_query(pv_sched_clock) == native_sched_clock;
>  }
> 
> -void __paravirt_set_sched_clock(u64 (*func)(void), bool stable)
> +void __paravirt_set_sched_clock(u64 (*func)(void), bool stable,
> +				void (*save)(void), void (*restore)(void))
>  {
>  	if (!stable)
>  		clear_sched_clock_stable();
> 
>  	static_call_update(pv_sched_clock, func);
> +	x86_platform.save_sched_clock_state = save;
> +	x86_platform.restore_sched_clock_state = restore;
>  }
>  #else
>  u64 sched_clock_noinstr(void) __attribute__((alias("native_sched_clock")));
> diff --git a/arch/x86/xen/time.c b/arch/x86/xen/time.c
> index 477441752f40..8cd8bfaf1320 100644
> --- a/arch/x86/xen/time.c
> +++ b/arch/x86/xen/time.c
> @@ -566,13 +566,12 @@ static void __init xen_init_time_common(void)
>  {
>  	xen_sched_clock_offset = xen_clocksource_read();
>  	static_call_update(pv_steal_clock, xen_steal_clock);
> -	paravirt_set_sched_clock(xen_sched_clock);
> +
>  	/*
>  	 * Xen has paravirtualized suspend/resume and so doesn't use the common
>  	 * x86 sched_clock save/restore hooks.
>  	 */
> -	x86_platform.save_sched_clock_state = x86_init_noop;
> -	x86_platform.restore_sched_clock_state = x86_init_noop;
> +	paravirt_set_sched_clock(xen_sched_clock, x86_init_noop, x86_init_noop);
> 
>  	x86_init.hyper.get_tsc_khz = xen_tsc_khz;
>  	x86_platform.get_wallclock = xen_get_wallclock;
> diff --git a/drivers/clocksource/hyperv_timer.c b/drivers/clocksource/hyperv_timer.c
> index 220668207d19..8ee7a9de0f4f 100644
> --- a/drivers/clocksource/hyperv_timer.c
> +++ b/drivers/clocksource/hyperv_timer.c
> @@ -570,10 +570,8 @@ static void hv_restore_sched_clock_state(void)
>  static __always_inline void hv_setup_sched_clock(void *sched_clock)
>  {
>  	/* We're on x86/x64 *and* using PV ops */
> -	paravirt_set_sched_clock(sched_clock);
> -
> -	x86_platform.save_sched_clock_state = hv_save_sched_clock_state;
> -	x86_platform.restore_sched_clock_state = hv_restore_sched_clock_state;
> +	paravirt_set_sched_clock(sched_clock, hv_save_sched_clock_state,
> +				 hv_restore_sched_clock_state);
>  }
>  #else /* !CONFIG_GENERIC_SCHED_CLOCK && !CONFIG_PARAVIRT */
>  static __always_inline void hv_setup_sched_clock(void *sched_clock) {}
> --
> 2.55.0.rc0.799.gd6f94ed593-goog
> 


^ permalink raw reply

* RE: [PATCH v5 47/51] x86/paravirt: Don't use a PV sched_clock in CoCo guests with trusted TSC
From: Michael Kelley @ 2026-07-02 17:48 UTC (permalink / raw)
  To: Sean Christopherson, Jonathan Corbet, Paolo Bonzini,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen,
	x86@kernel.org, Kiryl Shutsemau, Rick Edgecombe, K. Y. Srinivasan,
	Haiyang Zhang, Wei Liu, Dexuan Cui, Long Li, Ajay Kaher,
	Alexey Makhalov, Jan Kiszka, Andy Lutomirski, Peter Zijlstra,
	Juergen Gross, Daniel Lezcano, John Stultz
  Cc: Shuah Khan, H. Peter Anvin, Vitaly Kuznetsov,
	Broadcom internal kernel review list, Boris Ostrovsky,
	Stephen Boyd, linux-doc@vger.kernel.org, kvm@vger.kernel.org,
	linux-kernel@vger.kernel.org, linux-coco@lists.linux.dev,
	linux-hyperv@vger.kernel.org, virtualization@lists.linux.dev,
	xen-devel@lists.xenproject.org, Tom Lendacky, Nikunj A Dadhania,
	David Woodhouse, David Woodhouse, Michael Kelley, Thomas Gleixner
In-Reply-To: <20260701193212.749551-48-seanjc@google.com>

From: Sean Christopherson <seanjc@google.com> Sent: Wednesday, July 1, 2026 12:32 PM
> 
> Silently ignore attempts to switch to a paravirt sched_clock when running
> as a CoCo guest with trusted TSC.  In hand-wavy theory, a misbehaving
> hypervisor could attack the guest by manipulating the PV clock to affect
> guest scheduling in some weird and/or predictable way.  More importantly,
> reading TSC on such platforms is faster than any PV clock, and sched_clock
> is all about speed.
> 
> Reviewed-by: David Woodhouse <dwmw@amazon.co.uk>
> Signed-off-by: Sean Christopherson <seanjc@google.com>
> ---
>  arch/x86/kernel/tsc.c | 9 +++++++++
>  1 file changed, 9 insertions(+)
> 
> diff --git a/arch/x86/kernel/tsc.c b/arch/x86/kernel/tsc.c
> index 012321fed5e5..a146fc7b5e74 100644
> --- a/arch/x86/kernel/tsc.c
> +++ b/arch/x86/kernel/tsc.c
> @@ -283,6 +283,15 @@ bool using_native_sched_clock(void)
>  int __init __paravirt_set_sched_clock(u64 (*func)(void), bool stable,
>  				      void (*save)(void), void (*restore)(void))
>  {
> +	/*
> +	 * Don't replace TSC with a PV clock when running as a CoCo guest and
> +	 * the TSC is secure/trusted; PV clocks are emulated by the hypervisor,
> +	 * which isn't in the guest's TCB.
> +	 */
> +	if (cc_platform_has(CC_ATTR_GUEST_SNP_SECURE_TSC) ||
> +	    boot_cpu_has(X86_FEATURE_TDX_GUEST))
> +		return -EPERM;

Do a pr_warn() in the error case? Your commit message says to
do the ignore silently, but I wonder if that's a good idea. At least
for Hyper-V, the error case shouldn't happen.

Michael

> +
>  	if (!stable)
>  		clear_sched_clock_stable();
> 
> --
> 2.55.0.rc0.799.gd6f94ed593-goog
> 


^ permalink raw reply

* Re: [PATCH net-next v2] net: mana: Add handler for sriov configure
From: sashiko-bot @ 2026-07-02 18:02 UTC (permalink / raw)
  To: Haiyang Zhang; +Cc: linux-hyperv
In-Reply-To: <20260701180116.507690-1-haiyangz@linux.microsoft.com>

Thank you for your contribution! Sashiko AI review found 2 potential issue(s) to consider:
- [High] Disabling SR-IOV without checking if Virtual Functions (VFs) are assigned to a guest VM.
- [High] Missing SR-IOV cleanup during PF driver removal and shutdown.
--

commit 3df823cc57367f69930deb6dc48172ce7d77e785
Author: Haiyang Zhang <haiyangz@microsoft.com>

net: mana: Add handler for sriov configure

Add callback function for the pci_driver / sriov_configure.

It asks the NIC to provide certain number of VFs, or disable
VFs if the request is zero.

> diff --git a/drivers/net/ethernet/microsoft/mana/gdma_main.c b/drivers/net/ethernet/microsoft/mana/gdma_main.c
> index e8b7ffb47eb98..f7a79874fd0eb 100644
> --- a/drivers/net/ethernet/microsoft/mana/gdma_main.c
> +++ b/drivers/net/ethernet/microsoft/mana/gdma_main.c
> @@ -2496,6 +2496,20 @@ static void mana_gd_shutdown(struct pci_dev *pdev)
>  	pci_disable_device(pdev);
>  }

[Severity: High]
With the addition of SR-IOV support in this patch, do mana_gd_shutdown() and
mana_gd_remove() need to call pci_disable_sriov()?

If a user enables VFs via sysfs and then unbinds the driver or shuts down the
system, could this omission introduce a regression where orphaned, dead VF PCI
devices are left in the kernel?

Accessing these orphaned VFs might trigger use-after-free crashes or hardware
faults, as the underlying PF hardware and driver context have been destroyed.

> +static int mana_sriov_configure(struct pci_dev *pdev, int numvfs)
> +{
> +	int err = 0;
> +
> +	dev_info(&pdev->dev, "Requested num VFs: %d\n", numvfs);
> +
> +	if (numvfs > 0)
> +		err = pci_enable_sriov(pdev, numvfs);
> +	else
> +		pci_disable_sriov(pdev);

[Severity: High]
Should this path check if the VFs are currently assigned to a guest VM before
disabling SR-IOV?

If a privileged user writes 0 to sriov_numvfs while VFs are actively assigned
to a VM (e.g., via vfio-pci), could forcefully removing the PCI devices cause
a regression resulting in host IOMMU faults and guest kernel panics?

Would it be better to check pci_vfs_assigned(pdev) here, or use the
pci_sriov_configure_simple() helper to safely enforce this?

> +
> +	return err ? err : numvfs;
> +}

-- 
Sashiko AI review · https://sashiko.dev/#/patchset/20260701180116.507690-1-haiyangz@linux.microsoft.com?part=1

^ permalink raw reply

* RE: [EXTERNAL] Re: [PATCH net-next v4] net: mana: Add Interrupt Moderation support
From: Haiyang Zhang @ 2026-07-02 19:02 UTC (permalink / raw)
  To: Paolo Abeni, Haiyang Zhang, linux-hyperv@vger.kernel.org,
	netdev@vger.kernel.org, KY Srinivasan, Wei Liu, Dexuan Cui,
	Long Li, Andrew Lunn, David S. Miller, Eric Dumazet,
	Jakub Kicinski, Konstantin Taranov, Simon Horman,
	Erni Sri Satya Vennela, Dipayaan Roy, Aditya Garg, Breno Leitao,
	linux-kernel@vger.kernel.org, linux-rdma@vger.kernel.org
  Cc: Paul Rosswurm
In-Reply-To: <8906f758-27fe-4ea8-8558-6d15089372d1@redhat.com>



> -----Original Message-----
> From: Paolo Abeni <pabeni@redhat.com>
> Sent: Thursday, July 2, 2026 4:57 AM
> To: Haiyang Zhang <haiyangz@linux.microsoft.com>; linux-
> hyperv@vger.kernel.org; netdev@vger.kernel.org; KY Srinivasan
> <kys@microsoft.com>; Haiyang Zhang <haiyangz@microsoft.com>; Wei Liu
> <wei.liu@kernel.org>; Dexuan Cui <DECUI@microsoft.com>; Long Li
> <longli@microsoft.com>; Andrew Lunn <andrew+netdev@lunn.ch>; David S.
> Miller <davem@davemloft.net>; Eric Dumazet <edumazet@google.com>; Jakub
> Kicinski <kuba@kernel.org>; Konstantin Taranov <kotaranov@microsoft.com>;
> Simon Horman <horms@kernel.org>; Erni Sri Satya Vennela
> <ernis@linux.microsoft.com>; Dipayaan Roy
> <dipayanroy@linux.microsoft.com>; Aditya Garg
> <gargaditya@linux.microsoft.com>; Breno Leitao <leitao@debian.org>; linux-
> kernel@vger.kernel.org; linux-rdma@vger.kernel.org
> Cc: Paul Rosswurm <paulros@microsoft.com>
> Subject: [EXTERNAL] Re: [PATCH net-next v4] net: mana: Add Interrupt
> Moderation support
> 
> On 6/29/26 11:36 PM, Haiyang Zhang wrote:
> > diff --git a/drivers/net/ethernet/microsoft/mana/mana_en.c
> b/drivers/net/ethernet/microsoft/mana/mana_en.c
> > index 7438ea6b3f26..9391e9564605 100644
> > --- a/drivers/net/ethernet/microsoft/mana/mana_en.c
> > +++ b/drivers/net/ethernet/microsoft/mana/mana_en.c
> > @@ -1591,6 +1591,9 @@ int mana_create_wq_obj(struct mana_port_context
> *apc,
> >
> >  	mana_gd_init_req_hdr(&req.hdr, MANA_CREATE_WQ_OBJ,
> >  			     sizeof(req), sizeof(resp));
> > +
> > +	req.hdr.req.msg_version = GDMA_MESSAGE_V3;
> > +	req.hdr.resp.msg_version = GDMA_MESSAGE_V2;
> 
> Double checking the above is intentional; it feels strange to me that
> request and reply use different versions. Possibly a comment for future
> memory would make sense.

Yes, it's intentional. The request and reply versions can be different.
I will add comments.

> > +
> > +/* The caller must update apc->rx/tx_dim_enabled before disabling and
> > + * after enabling. And synchronize_net() before draining the DIM work,
> > + * so that NAPI cannot observe a stale flag.
> > + */
> > +int mana_dim_change(struct mana_cq *cq, bool enable)
> 
> This always return 0, and the return value is not checked by the
> callers; return type should likelly changed to void
Will update.

Thanks,
- Haiyang


^ permalink raw reply

* [PATCH net-next v5] net: mana: Add Interrupt Moderation support
From: Haiyang Zhang @ 2026-07-02 22:01 UTC (permalink / raw)
  To: linux-hyperv, netdev, K. Y. Srinivasan, Haiyang Zhang, Wei Liu,
	Dexuan Cui, Long Li, Andrew Lunn, David S. Miller, Eric Dumazet,
	Jakub Kicinski, Paolo Abeni, Konstantin Taranov, Simon Horman,
	Erni Sri Satya Vennela, Dipayaan Roy, Shradha Gupta, Aditya Garg,
	Stanislav Fomichev, Breno Leitao, linux-kernel, linux-rdma
  Cc: paulros

From: Haiyang Zhang <haiyangz@microsoft.com>

Add Static and Dynamic Interrupt Moderation (DIM) support for
Rx and Tx.
Update queue creation procedure with new data struct with the related
settings.
Add functions to collect stat for DIM, and workers to update DIM data
and settings.
Update ethtool handler to get/set the moderation settings from a user.
To avoid detach/re-attach ops, ring DIM doorbell to change settings
at run time.
By default, adaptive-rx/tx (DIM) are enabled if supported by HW.

Signed-off-by: Haiyang Zhang <haiyangz@microsoft.com>
Reviewed-by: Simon Horman <horms@kernel.org>
---
v5:
  Updated function return type and comments suggested by Paolo Abeni.

v4:
  Fixed tx stat, concurrency, and mb issues from Simon's review.

v3:
  Updated to avoid detach/re-attach ops as suggested by Paolo.

v2:
  Updated with comments from Jedrzej.

---
 drivers/net/ethernet/microsoft/Kconfig        |   1 +
 .../net/ethernet/microsoft/mana/gdma_main.c   |  29 +++
 drivers/net/ethernet/microsoft/mana/mana_en.c | 175 ++++++++++++++++++
 .../ethernet/microsoft/mana/mana_ethtool.c    | 167 ++++++++++++++++-
 include/net/mana/gdma.h                       |  24 ++-
 include/net/mana/mana.h                       |  54 ++++++
 6 files changed, 441 insertions(+), 9 deletions(-)

diff --git a/drivers/net/ethernet/microsoft/Kconfig b/drivers/net/ethernet/microsoft/Kconfig
index 3f36ee6a8ece..e9be18c92ca5 100644
--- a/drivers/net/ethernet/microsoft/Kconfig
+++ b/drivers/net/ethernet/microsoft/Kconfig
@@ -21,6 +21,7 @@ config MICROSOFT_MANA
 	depends on X86_64 || (ARM64 && !CPU_BIG_ENDIAN)
 	depends on PCI_HYPERV
 	select AUXILIARY_BUS
+	select DIMLIB
 	select PAGE_POOL
 	select NET_SHAPER
 	help
diff --git a/drivers/net/ethernet/microsoft/mana/gdma_main.c b/drivers/net/ethernet/microsoft/mana/gdma_main.c
index e8b7ffb47eb9..aef3b77229c1 100644
--- a/drivers/net/ethernet/microsoft/mana/gdma_main.c
+++ b/drivers/net/ethernet/microsoft/mana/gdma_main.c
@@ -1,6 +1,7 @@
 // SPDX-License-Identifier: GPL-2.0 OR BSD-3-Clause
 /* Copyright (c) 2021, Microsoft Corporation. */
 
+#include <linux/bitfield.h>
 #include <linux/debugfs.h>
 #include <linux/module.h>
 #include <linux/pci.h>
@@ -466,6 +467,7 @@ static int mana_gd_disable_queue(struct gdma_queue *queue)
 #define DOORBELL_OFFSET_RQ	0x400
 #define DOORBELL_OFFSET_CQ	0x800
 #define DOORBELL_OFFSET_EQ	0xFF8
+#define DOORBELL_OFFSET_DIM	0x820
 
 static void mana_gd_ring_doorbell(struct gdma_context *gc, u32 db_index,
 				  enum gdma_queue_type q_type, u32 qid,
@@ -506,6 +508,16 @@ static void mana_gd_ring_doorbell(struct gdma_context *gc, u32 db_index,
 		addr += DOORBELL_OFFSET_SQ;
 		break;
 
+	case GDMA_DIM:
+		e.dim.id = qid;
+		e.dim.mod_usec = FIELD_GET(MANA_INTR_MODR_USEC_MAX, tail_ptr);
+		e.dim.mod_usec_vld = !!(tail_ptr & MANA_INTR_MODR_USEC_VLD);
+		e.dim.mod_comps = FIELD_GET(MANA_INTR_MODR_COMP_MASK, tail_ptr);
+		e.dim.mod_comps_vld = num_req;
+
+		addr += DOORBELL_OFFSET_DIM;
+		break;
+
 	default:
 		WARN_ON(1);
 		return;
@@ -540,6 +552,23 @@ void mana_gd_ring_cq(struct gdma_queue *cq, u8 arm_bit)
 }
 EXPORT_SYMBOL_NS(mana_gd_ring_cq, "NET_MANA");
 
+void mana_gd_ring_dim(struct gdma_queue *cq, u32 mod_usec, bool mod_usec_vld,
+		      u32 mod_comps, bool mod_comps_vld)
+{
+	struct gdma_context *gc = cq->gdma_dev->gdma_context;
+	u32 dim_val;
+
+	/* Convert the DIM values to doorbell parameters */
+	dim_val = FIELD_PREP(MANA_INTR_MODR_USEC_MAX, mod_usec) |
+		  FIELD_PREP(MANA_INTR_MODR_COMP_MASK, mod_comps);
+	if (mod_usec_vld)
+		dim_val |= MANA_INTR_MODR_USEC_VLD;
+
+	mana_gd_ring_doorbell(gc, cq->gdma_dev->doorbell, GDMA_DIM, cq->id,
+			      dim_val, mod_comps_vld);
+}
+EXPORT_SYMBOL_NS(mana_gd_ring_dim, "NET_MANA");
+
 #define MANA_SERVICE_PERIOD 10
 
 static void mana_serv_rescan(struct pci_dev *pdev)
diff --git a/drivers/net/ethernet/microsoft/mana/mana_en.c b/drivers/net/ethernet/microsoft/mana/mana_en.c
index 7438ea6b3f26..5ce0b96c50f6 100644
--- a/drivers/net/ethernet/microsoft/mana/mana_en.c
+++ b/drivers/net/ethernet/microsoft/mana/mana_en.c
@@ -1591,6 +1591,15 @@ int mana_create_wq_obj(struct mana_port_context *apc,
 
 	mana_gd_init_req_hdr(&req.hdr, MANA_CREATE_WQ_OBJ,
 			     sizeof(req), sizeof(resp));
+
+	/* Our driver uses different message versions for request and
+	 * response in this case.
+	 * Our firmware is forward compatible with newer message versions, so
+	 * the old firmware still properly handles this message, just the new
+	 * feature fields are ignored, and queue creation will be successful.
+	 */
+	req.hdr.req.msg_version = GDMA_MESSAGE_V3;
+	req.hdr.resp.msg_version = GDMA_MESSAGE_V2;
 	req.vport = vport;
 	req.wq_type = wq_type;
 	req.wq_gdma_region = wq_spec->gdma_region;
@@ -1599,6 +1608,9 @@ int mana_create_wq_obj(struct mana_port_context *apc,
 	req.cq_size = cq_spec->queue_size;
 	req.cq_moderation_ctx_id = cq_spec->modr_ctx_id;
 	req.cq_parent_qid = cq_spec->attached_eq;
+	req.req_cq_moderation = cq_spec->req_cq_moderation;
+	req.cq_moderation_comp = cq_spec->cq_moderation_comp;
+	req.cq_moderation_usec = cq_spec->cq_moderation_usec;
 
 	err = mana_send_request(apc->ac, &req, sizeof(req), &resp,
 				sizeof(resp));
@@ -1856,6 +1868,7 @@ static void mana_poll_tx_cq(struct mana_cq *cq)
 	struct gdma_posted_wqe_info *wqe_info;
 	unsigned int pkt_transmitted = 0;
 	unsigned int wqe_unit_cnt = 0;
+	unsigned int tx_bytes = 0;
 	struct mana_txq *txq = cq->txq;
 	struct mana_port_context *apc;
 	struct netdev_queue *net_txq;
@@ -1937,6 +1950,8 @@ static void mana_poll_tx_cq(struct mana_cq *cq)
 
 		mana_unmap_skb(skb, apc);
 
+		tx_bytes += skb->len;
+
 		napi_consume_skb(skb, cq->budget);
 
 		pkt_transmitted++;
@@ -1967,6 +1982,10 @@ static void mana_poll_tx_cq(struct mana_cq *cq)
 	if (atomic_sub_return(pkt_transmitted, &txq->pending_sends) < 0)
 		WARN_ON_ONCE(1);
 
+	/* Feed DIM with the completion rate observed here, in NAPI context. */
+	cq->tx_dim_pkts += pkt_transmitted;
+	cq->tx_dim_bytes += tx_bytes;
+
 	cq->work_done = pkt_transmitted;
 }
 
@@ -2318,6 +2337,117 @@ static void mana_poll_rx_cq(struct mana_cq *cq)
 		xdp_do_flush();
 }
 
+static void mana_rx_dim_work(struct work_struct *work)
+{
+	struct dim *dim = container_of(work, struct dim, work);
+	struct dim_cq_moder cur_moder;
+	struct mana_cq *cq;
+
+	cur_moder = net_dim_get_rx_moderation(dim->mode, dim->profile_ix);
+	cq = container_of(dim, struct mana_cq, dim);
+
+	cur_moder.usec = min_t(u16, cur_moder.usec, MANA_INTR_MODR_USEC_MAX);
+	cur_moder.pkts = min_t(u16, cur_moder.pkts, MANA_INTR_MODR_COMP_MAX);
+
+	mana_gd_ring_dim(cq->gdma_cq, cur_moder.usec, true,
+			 cur_moder.pkts, true);
+
+	dim->state = DIM_START_MEASURE;
+}
+
+static void mana_tx_dim_work(struct work_struct *work)
+{
+	struct dim *dim = container_of(work, struct dim, work);
+	struct dim_cq_moder cur_moder;
+	struct mana_cq *cq;
+
+	cur_moder = net_dim_get_tx_moderation(dim->mode, dim->profile_ix);
+	cq = container_of(dim, struct mana_cq, dim);
+
+	cur_moder.usec = min_t(u16, cur_moder.usec, MANA_INTR_MODR_USEC_MAX);
+	cur_moder.pkts = min_t(u16, cur_moder.pkts, MANA_INTR_MODR_COMP_MAX);
+
+	mana_gd_ring_dim(cq->gdma_cq, cur_moder.usec, true,
+			 cur_moder.pkts, true);
+
+	dim->state = DIM_START_MEASURE;
+}
+
+/* The caller must update apc->rx/tx_dim_enabled before disabling and
+ * after enabling. And synchronize_net() before draining the DIM work,
+ * so that NAPI cannot observe a stale flag.
+ */
+void mana_dim_change(struct mana_cq *cq, bool enable)
+{
+	bool is_rx = cq->type == MANA_CQ_TYPE_RX;
+	struct mana_port_context *apc;
+	work_func_t work_func;
+	u32 usec, comp;
+
+	if (is_rx) {
+		apc = netdev_priv(cq->rxq->ndev);
+		usec = apc->intr_modr_rx_usec;
+		comp = apc->intr_modr_rx_comp;
+		work_func = mana_rx_dim_work;
+	} else {
+		apc = netdev_priv(cq->txq->ndev);
+		usec = apc->intr_modr_tx_usec;
+		comp = apc->intr_modr_tx_comp;
+		work_func = mana_tx_dim_work;
+	}
+
+	/* On enable, zero the DIM state so net_dim() starts measuring from
+	 * scratch.
+	 * On disable, drain any pending DIM work and restore the static
+	 * moderation values.
+	 */
+	if (enable) {
+		memset(&cq->dim, 0, sizeof(cq->dim));
+		cq->dim.mode = DIM_CQ_PERIOD_MODE_START_FROM_EQE;
+		INIT_WORK(&cq->dim.work, work_func);
+	} else {
+		cancel_work_sync(&cq->dim.work);
+		mana_gd_ring_dim(cq->gdma_cq, usec, true, comp, true);
+	}
+}
+
+static void mana_update_rx_dim(struct mana_cq *cq)
+{
+	struct mana_port_context *apc = netdev_priv(cq->rxq->ndev);
+	struct dim_sample dim_sample = {};
+	struct mana_rxq *rxq = cq->rxq;
+
+	/* Pairs with smp_store_release() in mana_set_coalesce(): observing the
+	 * enable flag set guarantees the DIM (re)initialization is visible.
+	 */
+	if (!smp_load_acquire(&apc->rx_dim_enabled))
+		return;
+
+	dim_update_sample(READ_ONCE(cq->dim_event_ctr), rxq->stats.packets,
+			  rxq->stats.bytes, &dim_sample);
+	net_dim(&cq->dim, &dim_sample);
+}
+
+static void mana_update_tx_dim(struct mana_cq *cq)
+{
+	struct mana_port_context *apc = netdev_priv(cq->txq->ndev);
+	struct dim_sample dim_sample = {};
+
+	/* Pairs with smp_store_release() in mana_set_coalesce(): observing the
+	 * enable flag set guarantees the DIM (re)initialization is visible.
+	 */
+	if (!smp_load_acquire(&apc->tx_dim_enabled))
+		return;
+
+	/* cq->tx_dim_pkts/bytes are accumulated in mana_poll_tx_cq(), in the
+	 * same NAPI context as this read, so they track the hardware
+	 * completion rate and need no u64_stats_sync protection.
+	 */
+	dim_update_sample(READ_ONCE(cq->dim_event_ctr), cq->tx_dim_pkts,
+			  cq->tx_dim_bytes, &dim_sample);
+	net_dim(&cq->dim, &dim_sample);
+}
+
 static int mana_cq_handler(void *context, struct gdma_queue *gdma_queue)
 {
 	struct mana_cq *cq = context;
@@ -2336,6 +2466,15 @@ static int mana_cq_handler(void *context, struct gdma_queue *gdma_queue)
 	if (w < cq->budget) {
 		mana_gd_ring_cq(gdma_queue, SET_ARM_BIT);
 		cq->work_done_since_doorbell = 0;
+
+		/* Update DIM before napi_complete_done() to prevent running
+		 * net_dim() concurrently.
+		 */
+		if (cq->type == MANA_CQ_TYPE_RX)
+			mana_update_rx_dim(cq);
+		else
+			mana_update_tx_dim(cq);
+
 		napi_complete_done(&cq->napi, w);
 	} else if (cq->work_done_since_doorbell >=
 		   (cq->gdma_cq->queue_size / COMP_ENTRY_SIZE) * 4) {
@@ -2368,6 +2507,7 @@ static void mana_schedule_napi(void *context, struct gdma_queue *gdma_queue)
 {
 	struct mana_cq *cq = context;
 
+	WRITE_ONCE(cq->dim_event_ctr, cq->dim_event_ctr + 1);
 	napi_schedule_irqoff(&cq->napi);
 }
 
@@ -2410,6 +2550,7 @@ static void mana_destroy_txq(struct mana_port_context *apc)
 		if (apc->tx_qp[i]->txq.napi_initialized) {
 			napi_synchronize(napi);
 			napi_disable_locked(napi);
+			cancel_work_sync(&apc->tx_qp[i]->tx_cq.dim.work);
 			netif_napi_del_locked(napi);
 			apc->tx_qp[i]->txq.napi_initialized = false;
 		}
@@ -2543,6 +2684,11 @@ static int mana_create_txq(struct mana_port_context *apc,
 		cq_spec.modr_ctx_id = 0;
 		cq_spec.attached_eq = cq->gdma_cq->cq.parent->id;
 
+		/* DIM setting can be changed at runtime */
+		cq_spec.req_cq_moderation = true;
+		cq_spec.cq_moderation_usec = apc->intr_modr_tx_usec;
+		cq_spec.cq_moderation_comp = apc->intr_modr_tx_comp;
+
 		err = mana_create_wq_obj(apc, apc->port_handle, GDMA_SQ,
 					 &wq_spec, &cq_spec,
 					 &apc->tx_qp[i]->tx_object);
@@ -2573,6 +2719,13 @@ static int mana_create_txq(struct mana_port_context *apc,
 
 		set_bit(NAPI_STATE_NO_BUSY_POLL, &cq->napi.state);
 		netif_napi_add_locked(net, &cq->napi, mana_poll);
+
+		/* Initialize the DIM work before enabling NAPI, so that a poll
+		 * cannot reach net_dim() with an uninitialized cq->dim.work.
+		 */
+		INIT_WORK(&cq->dim.work, mana_tx_dim_work);
+		cq->dim.mode = DIM_CQ_PERIOD_MODE_START_FROM_EQE;
+
 		napi_enable_locked(&cq->napi);
 		txq->napi_initialized = true;
 
@@ -2610,6 +2763,7 @@ static void mana_destroy_rxq(struct mana_port_context *apc,
 		napi_synchronize(napi);
 
 		napi_disable_locked(napi);
+		cancel_work_sync(&rxq->rx_cq.dim.work);
 		netif_napi_del_locked(napi);
 	}
 
@@ -2848,6 +3002,11 @@ static struct mana_rxq *mana_create_rxq(struct mana_port_context *apc,
 	cq_spec.modr_ctx_id = 0;
 	cq_spec.attached_eq = cq->gdma_cq->cq.parent->id;
 
+	/* DIM setting can be changed at runtime */
+	cq_spec.req_cq_moderation = true;
+	cq_spec.cq_moderation_usec = apc->intr_modr_rx_usec;
+	cq_spec.cq_moderation_comp = apc->intr_modr_rx_comp;
+
 	err = mana_create_wq_obj(apc, apc->port_handle, GDMA_RQ,
 				 &wq_spec, &cq_spec, &rxq->rxobj);
 	if (err)
@@ -2880,6 +3039,12 @@ static struct mana_rxq *mana_create_rxq(struct mana_port_context *apc,
 	WARN_ON(xdp_rxq_info_reg_mem_model(&rxq->xdp_rxq, MEM_TYPE_PAGE_POOL,
 					   rxq->page_pool));
 
+	/* Initialize the DIM work before enabling NAPI, so that a poll
+	 * cannot reach net_dim() with an uninitialized cq->dim.work.
+	 */
+	INIT_WORK(&cq->dim.work, mana_rx_dim_work);
+	cq->dim.mode = DIM_CQ_PERIOD_MODE_START_FROM_EQE;
+
 	napi_enable_locked(&cq->napi);
 
 	mana_gd_ring_cq(cq->gdma_cq, SET_ARM_BIT);
@@ -3546,6 +3711,16 @@ static int mana_probe_port(struct mana_context *ac, int port_idx,
 	apc->link_cfg_error = 1;
 	apc->cqe_coalescing_enable = 0;
 
+	/* Initialize interrupt moderation settings if supported by HW */
+	if (gc->pf_cap_flags1 & GDMA_PF_CAP_FLAG_1_DYN_INTERRUPT_MODERATION) {
+		apc->intr_modr_rx_usec = MANA_INTR_MODR_USEC_DEF;
+		apc->intr_modr_rx_comp = MANA_INTR_MODR_COMP_DEF;
+		apc->intr_modr_tx_usec = MANA_INTR_MODR_USEC_DEF;
+		apc->intr_modr_tx_comp = MANA_INTR_MODR_COMP_DEF;
+		apc->rx_dim_enabled = MANA_ADAPTIVE_RX_DEF;
+		apc->tx_dim_enabled = MANA_ADAPTIVE_TX_DEF;
+	}
+
 	mutex_init(&apc->vport_mutex);
 	apc->vport_use_count = 0;
 
diff --git a/drivers/net/ethernet/microsoft/mana/mana_ethtool.c b/drivers/net/ethernet/microsoft/mana/mana_ethtool.c
index 881df597d7f9..9e31e2595ae3 100644
--- a/drivers/net/ethernet/microsoft/mana/mana_ethtool.c
+++ b/drivers/net/ethernet/microsoft/mana/mana_ethtool.c
@@ -419,6 +419,15 @@ static int mana_get_coalesce(struct net_device *ndev,
 	    !kernel_coal->rx_cqe_nsecs)
 		kernel_coal->rx_cqe_nsecs = MANA_RX_CQE_NSEC_DEF;
 
+	ec->rx_coalesce_usecs = apc->intr_modr_rx_usec;
+	ec->rx_max_coalesced_frames = apc->intr_modr_rx_comp;
+
+	ec->tx_coalesce_usecs = apc->intr_modr_tx_usec;
+	ec->tx_max_coalesced_frames = apc->intr_modr_tx_comp;
+
+	ec->use_adaptive_rx_coalesce = apc->rx_dim_enabled;
+	ec->use_adaptive_tx_coalesce = apc->tx_dim_enabled;
+
 	return 0;
 }
 
@@ -428,9 +437,34 @@ static int mana_set_coalesce(struct net_device *ndev,
 			     struct netlink_ext_ack *extack)
 {
 	struct mana_port_context *apc = netdev_priv(ndev);
-	u8 saved_cqe_coalescing_enable;
+	struct {
+		u16 intr_modr_rx_usec;
+		u16 intr_modr_rx_comp;
+		u16 intr_modr_tx_usec;
+		u16 intr_modr_tx_comp;
+		u8 cqe_coalescing_enable;
+		bool rx_dim_enabled;
+		bool tx_dim_enabled;
+	} saved;
+	bool modr_changed = false;
+	bool dim_changed = false;
+	struct gdma_context *gc;
 	int err;
 
+	gc = apc->ac->gdma_dev->gdma_context;
+
+	/* Both static and dynamic interrupt moderation (DIM) rely on the
+	 * same HW capability advertised by the PF.
+	 */
+	if ((ec->use_adaptive_rx_coalesce || ec->use_adaptive_tx_coalesce ||
+	     ec->rx_coalesce_usecs || ec->tx_coalesce_usecs ||
+	     ec->rx_max_coalesced_frames || ec->tx_max_coalesced_frames) &&
+	    !(gc->pf_cap_flags1 & GDMA_PF_CAP_FLAG_1_DYN_INTERRUPT_MODERATION)) {
+		NL_SET_ERR_MSG(extack,
+			       "Interrupt Moderation is not supported by HW");
+		return -EOPNOTSUPP;
+	}
+
 	if (kernel_coal->rx_cqe_frames != 1 &&
 	    kernel_coal->rx_cqe_frames != MANA_RXCOMP_OOB_NUM_PPI) {
 		NL_SET_ERR_MSG_FMT(extack,
@@ -440,18 +474,129 @@ static int mana_set_coalesce(struct net_device *ndev,
 		return -EINVAL;
 	}
 
-	saved_cqe_coalescing_enable = apc->cqe_coalescing_enable;
+	if (ec->rx_coalesce_usecs > MANA_INTR_MODR_USEC_MAX ||
+	    ec->tx_coalesce_usecs > MANA_INTR_MODR_USEC_MAX) {
+		NL_SET_ERR_MSG_FMT(extack,
+				   "coalesce usecs must be <= %lu",
+				   MANA_INTR_MODR_USEC_MAX);
+		return -EINVAL;
+	}
+
+	if (ec->rx_max_coalesced_frames > MANA_INTR_MODR_COMP_MAX ||
+	    ec->tx_max_coalesced_frames > MANA_INTR_MODR_COMP_MAX) {
+		NL_SET_ERR_MSG_FMT(extack,
+				   "coalesce frames must be <= %lu",
+				   MANA_INTR_MODR_COMP_MAX);
+		return -EINVAL;
+	}
+
+	if (ec->rx_coalesce_usecs != apc->intr_modr_rx_usec ||
+	    ec->rx_max_coalesced_frames != apc->intr_modr_rx_comp ||
+	    ec->tx_coalesce_usecs != apc->intr_modr_tx_usec ||
+	    ec->tx_max_coalesced_frames != apc->intr_modr_tx_comp)
+		modr_changed = true;
+
+	saved.intr_modr_rx_usec = apc->intr_modr_rx_usec;
+	saved.intr_modr_rx_comp = apc->intr_modr_rx_comp;
+	saved.intr_modr_tx_usec = apc->intr_modr_tx_usec;
+	saved.intr_modr_tx_comp = apc->intr_modr_tx_comp;
+
+	apc->intr_modr_rx_usec = ec->rx_coalesce_usecs;
+	apc->intr_modr_rx_comp = ec->rx_max_coalesced_frames;
+	apc->intr_modr_tx_usec = ec->tx_coalesce_usecs;
+	apc->intr_modr_tx_comp = ec->tx_max_coalesced_frames;
+
+	if (!!ec->use_adaptive_rx_coalesce != apc->rx_dim_enabled ||
+	    !!ec->use_adaptive_tx_coalesce != apc->tx_dim_enabled)
+		dim_changed = true;
+
+	saved.rx_dim_enabled = apc->rx_dim_enabled;
+	saved.tx_dim_enabled = apc->tx_dim_enabled;
+
+	saved.cqe_coalescing_enable = apc->cqe_coalescing_enable;
 	apc->cqe_coalescing_enable =
 		kernel_coal->rx_cqe_frames == MANA_RXCOMP_OOB_NUM_PPI;
 
-	if (!apc->port_is_up)
+	if (!apc->port_is_up) {
+		WRITE_ONCE(apc->rx_dim_enabled, !!ec->use_adaptive_rx_coalesce);
+		WRITE_ONCE(apc->tx_dim_enabled, !!ec->use_adaptive_tx_coalesce);
 		return 0;
+	}
 
-	err = mana_config_rss(apc, TRI_STATE_TRUE, false, false);
-	if (err)
-		apc->cqe_coalescing_enable = saved_cqe_coalescing_enable;
+	if (apc->cqe_coalescing_enable != saved.cqe_coalescing_enable) {
+		/* CQE coalescing setting is applied via RSS configuration. */
+		err = mana_config_rss(apc, TRI_STATE_TRUE, false, false);
+		if (err) {
+			netdev_err(ndev, "Change CQE coalescing failed: %d\n",
+				   err);
+			apc->cqe_coalescing_enable =
+				saved.cqe_coalescing_enable;
+			apc->intr_modr_rx_usec = saved.intr_modr_rx_usec;
+			apc->intr_modr_rx_comp = saved.intr_modr_rx_comp;
+			apc->intr_modr_tx_usec = saved.intr_modr_tx_usec;
+			apc->intr_modr_tx_comp = saved.intr_modr_tx_comp;
+			return err;
+		}
+	}
 
-	return err;
+	if (modr_changed || dim_changed) {
+		bool new_rx_dim = !!ec->use_adaptive_rx_coalesce;
+		bool new_tx_dim = !!ec->use_adaptive_tx_coalesce;
+		bool disable_rx_dim = saved.rx_dim_enabled && !new_rx_dim;
+		bool disable_tx_dim = saved.tx_dim_enabled && !new_tx_dim;
+		bool enable_rx_dim = !saved.rx_dim_enabled && new_rx_dim;
+		bool enable_tx_dim = !saved.tx_dim_enabled && new_tx_dim;
+		int q;
+
+		/* On disable: clear the per-port flag first and
+		 * synchronize_net() so any in-flight NAPI poll observes
+		 * the new value and will not schedule further DIM work;
+		 * then drain pending work and restore the static
+		 * moderation values.
+		 */
+		if (disable_rx_dim)
+			WRITE_ONCE(apc->rx_dim_enabled, false);
+		if (disable_tx_dim)
+			WRITE_ONCE(apc->tx_dim_enabled, false);
+		if (disable_rx_dim || disable_tx_dim)
+			synchronize_net();
+
+		for (q = 0; q < apc->num_queues; q++) {
+			struct mana_cq *rx_cq = &apc->rxqs[q]->rx_cq;
+			struct mana_cq *tx_cq = &apc->tx_qp[q]->tx_cq;
+
+			if (disable_rx_dim)
+				mana_dim_change(rx_cq, false);
+			else if (enable_rx_dim)
+				mana_dim_change(rx_cq, true);
+			else if (!new_rx_dim && modr_changed)
+				mana_gd_ring_dim(rx_cq->gdma_cq,
+						 apc->intr_modr_rx_usec, true,
+						 apc->intr_modr_rx_comp, true);
+
+			if (disable_tx_dim)
+				mana_dim_change(tx_cq, false);
+			else if (enable_tx_dim)
+				mana_dim_change(tx_cq, true);
+			else if (!new_tx_dim && modr_changed)
+				mana_gd_ring_dim(tx_cq->gdma_cq,
+						 apc->intr_modr_tx_usec, true,
+						 apc->intr_modr_tx_comp, true);
+		}
+
+		/* Publish the enable flag with release semantics so a
+		 * concurrent NAPI poll that observes it set also sees the DIM
+		 * (re)init done by mana_dim_change() above.
+		 */
+		if (enable_rx_dim)
+			/* pairs with smp_load_acquire() in mana_update_rx_dim() */
+			smp_store_release(&apc->rx_dim_enabled, true);
+		if (enable_tx_dim)
+			/* pairs with smp_load_acquire() in mana_update_tx_dim() */
+			smp_store_release(&apc->tx_dim_enabled, true);
+	}
+
+	return 0;
 }
 
 /* mana_set_channels - change the number of queues on a port
@@ -595,7 +740,13 @@ static int mana_get_link_ksettings(struct net_device *ndev,
 }
 
 const struct ethtool_ops mana_ethtool_ops = {
-	.supported_coalesce_params = ETHTOOL_COALESCE_RX_CQE_FRAMES,
+	.supported_coalesce_params = ETHTOOL_COALESCE_RX_CQE_FRAMES |
+				     ETHTOOL_COALESCE_RX_USECS |
+				     ETHTOOL_COALESCE_RX_MAX_FRAMES |
+				     ETHTOOL_COALESCE_TX_USECS |
+				     ETHTOOL_COALESCE_TX_MAX_FRAMES |
+				     ETHTOOL_COALESCE_USE_ADAPTIVE_RX |
+				     ETHTOOL_COALESCE_USE_ADAPTIVE_TX,
 	.op_needs_rtnl		= ETHTOOL_OP_NEEDS_RTNL_SCHANNELS |
 				  ETHTOOL_OP_NEEDS_RTNL_SRINGPARAM |
 				  ETHTOOL_OP_NEEDS_RTNL_GLINK,
diff --git a/include/net/mana/gdma.h b/include/net/mana/gdma.h
index 0c395917b214..8529cef0d7c4 100644
--- a/include/net/mana/gdma.h
+++ b/include/net/mana/gdma.h
@@ -47,6 +47,7 @@ enum gdma_queue_type {
 	GDMA_RQ,
 	GDMA_CQ,
 	GDMA_EQ,
+	GDMA_DIM,
 };
 
 enum gdma_work_request_flags {
@@ -126,6 +127,17 @@ union gdma_doorbell_entry {
 		u64 tail_ptr	: 31;
 		u64 arm		: 1;
 	} eq;
+
+	struct {
+		u64 id           : 24;
+		u64 reserved     : 8;
+		u64 mod_usec     : 10;
+		u64 reserve1     : 5;
+		u64 mod_usec_vld : 1;
+		u64 mod_comps    : 8;
+		u64 reserve2     : 7;
+		u64 mod_comps_vld: 1;
+	} dim;
 }; /* HW DATA */
 
 struct gdma_msg_hdr {
@@ -502,6 +514,9 @@ void mana_gd_ring_cq(struct gdma_queue *cq, u8 arm_bit);
 
 int mana_schedule_serv_work(struct gdma_context *gc, enum gdma_eqe_type type);
 
+void mana_gd_ring_dim(struct gdma_queue *cq, u32 mod_usec, bool mod_usec_vld,
+		      u32 mod_comps, bool mod_comps_vld);
+
 struct gdma_wqe {
 	u32 reserved	:24;
 	u32 last_vbytes	:8;
@@ -650,6 +665,9 @@ enum {
 /* Driver supports self recovery on Hardware Channel timeouts */
 #define GDMA_DRV_CAP_FLAG_1_HWC_TIMEOUT_RECOVERY BIT(25)
 
+/* Driver supports dynamic interrupt moderation - DIM */
+#define GDMA_DRV_CAP_FLAG_1_DYN_INTERRUPT_MODERATION BIT(28)
+
 #define GDMA_DRV_CAP_FLAGS1 \
 	(GDMA_DRV_CAP_FLAG_1_EQ_SHARING_MULTI_VPORT | \
 	 GDMA_DRV_CAP_FLAG_1_NAPI_WKDONE_FIX | \
@@ -665,7 +683,8 @@ enum {
 	 GDMA_DRV_CAP_FLAG_1_PROBE_RECOVERY | \
 	 GDMA_DRV_CAP_FLAG_1_HANDLE_STALL_SQ_RECOVERY | \
 	 GDMA_DRV_CAP_FLAG_1_HWC_TIMEOUT_RECOVERY | \
-	 GDMA_DRV_CAP_FLAG_1_EQ_MSI_UNSHARE_MULTI_VPORT)
+	 GDMA_DRV_CAP_FLAG_1_EQ_MSI_UNSHARE_MULTI_VPORT | \
+	 GDMA_DRV_CAP_FLAG_1_DYN_INTERRUPT_MODERATION)
 
 #define GDMA_DRV_CAP_FLAGS2 0
 
@@ -701,6 +720,9 @@ struct gdma_verify_ver_req {
 	u8 os_ver_str4[128];
 }; /* HW DATA */
 
+/* HW supports dynamic interrupt moderation - DIM */
+#define GDMA_PF_CAP_FLAG_1_DYN_INTERRUPT_MODERATION BIT(15)
+
 struct gdma_verify_ver_resp {
 	struct gdma_resp_hdr hdr;
 	u64 gdma_protocol_ver;
diff --git a/include/net/mana/mana.h b/include/net/mana/mana.h
index 13c87baf018e..48f4445aa87a 100644
--- a/include/net/mana/mana.h
+++ b/include/net/mana/mana.h
@@ -4,6 +4,7 @@
 #ifndef _MANA_H
 #define _MANA_H
 
+#include <linux/dim.h>
 #include <net/xdp.h>
 #include <net/net_shaper.h>
 
@@ -64,6 +65,19 @@ enum TRI_STATE {
 /* Maximum number of packets per coalesced CQE */
 #define MANA_RXCOMP_OOB_NUM_PPI 4
 
+/* Default/max interrupt moderation settings */
+#define MANA_INTR_MODR_USEC_DEF 0
+#define MANA_INTR_MODR_COMP_DEF 0
+
+#define MANA_ADAPTIVE_RX_DEF true
+#define MANA_ADAPTIVE_TX_DEF true
+
+/* DIM doorbell value field layout */
+#define MANA_INTR_MODR_USEC_MAX    GENMASK(9, 0)
+#define MANA_INTR_MODR_USEC_VLD    BIT(15)
+#define MANA_INTR_MODR_COMP_MAX    GENMASK(7, 0)
+#define MANA_INTR_MODR_COMP_MASK   GENMASK(23, 16)
+
 /* Update this count whenever the respective structures are changed */
 #define MANA_STATS_RX_COUNT (6 + MANA_RXCOMP_OOB_NUM_PPI - 1)
 #define MANA_STATS_TX_COUNT 11
@@ -297,6 +311,17 @@ struct mana_cq {
 	int work_done;
 	int work_done_since_doorbell;
 	int budget;
+
+	/* DIM - Dynamic Interrupt Moderation */
+	struct dim dim;
+	u16 dim_event_ctr;
+
+	/* Cumulative TX completions fed to DIM. Updated and read only in
+	 * NAPI context (mana_poll_tx_cq() / mana_update_tx_dim()), so they
+	 * measure the hardware completion rate and need no u64_stats_sync.
+	 */
+	u64 tx_dim_pkts;
+	u64 tx_dim_bytes;
 };
 
 struct mana_recv_buf_oob {
@@ -573,6 +598,15 @@ struct mana_port_context {
 	u8 cqe_coalescing_enable;
 	u32 cqe_coalescing_timeout_ns;
 
+	/* Interrupt moderation settings */
+	u16 intr_modr_rx_usec;
+	u16 intr_modr_rx_comp;
+	u16 intr_modr_tx_usec;
+	u16 intr_modr_tx_comp;
+
+	bool rx_dim_enabled;
+	bool tx_dim_enabled;
+
 	struct mana_ethtool_stats eth_stats;
 
 	struct mana_ethtool_phy_stats phy_stats;
@@ -598,6 +632,8 @@ int mana_alloc_queues(struct net_device *ndev);
 int mana_attach(struct net_device *ndev);
 int mana_detach(struct net_device *ndev, bool from_close);
 
+void mana_dim_change(struct mana_cq *cq, bool enable);
+
 int mana_probe(struct gdma_dev *gd, bool resuming);
 void mana_remove(struct gdma_dev *gd, bool suspending);
 
@@ -633,6 +669,9 @@ struct mana_obj_spec {
 	u32 queue_size;
 	u32 attached_eq;
 	u32 modr_ctx_id;
+	u8 req_cq_moderation;
+	u16 cq_moderation_comp;
+	u16 cq_moderation_usec;
 };
 
 enum mana_command_code {
@@ -764,6 +803,15 @@ struct mana_create_wqobj_req {
 	u32 cq_size;
 	u32 cq_moderation_ctx_id;
 	u32 cq_parent_qid;
+
+	/* V2 */
+	u8 allow_rqwqe_chain;
+
+	/* V3 */
+	u8 req_cq_moderation;
+	u16 cq_moderation_comp;
+	u16 cq_moderation_usec;
+	u8 reserved2[2];
 }; /* HW DATA */
 
 struct mana_create_wqobj_resp {
@@ -771,6 +819,12 @@ struct mana_create_wqobj_resp {
 	u32 wq_id;
 	u32 cq_id;
 	mana_handle_t wq_obj;
+
+	/* V2 */
+	u16 cq_moderation_comp;
+	u16 cq_moderation_usec;
+	u8 cq_moderation_enabled;
+	u8 reserved1[3];
 }; /* HW DATA */
 
 /* Destroy WQ Object */
-- 
2.34.1


^ permalink raw reply related


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox