Linux-HyperV List

Linux-HyperV List
 help / color / mirror / Atom feed

* Re: [PATCH] mshv: add vmbus dependency
From: Jork Loeser @ 2026-05-20 17:15 UTC (permalink / raw)
  To: Arnd Bergmann
  Cc: K. Y. Srinivasan, Haiyang Zhang, Wei Liu, Dexuan Cui, Long Li,
	Anirudh Rayabharam (Microsoft), Stanislav Kinsburskii,
	Arnd Bergmann, linux-hyperv, linux-kernel
In-Reply-To: <20260520074044.923728-1-arnd@kernel.org>

On Wed, 20 May 2026, Arnd Bergmann wrote:

> From: Arnd Bergmann <arnd@arndb.de>
>
> When the vmbus driver is not part of the kernel, the mvhv_root
> driver now fails to link:
>
> ERROR: modpost: "hv_vmbus_exists" [drivers/hv/mshv_root.ko] undefined!
>
> Avoid this by adding an explicit Kconfig dependency. Note that
> stubbing out the hv_vmbus_exists() based on configuration would
> also work for some cases, but not with MSHV_ROOT=y and HYPERV_VMBUS=m.
>
> Fixes: f1a9e67c1138 ("mshv: limit SynIC management to MSHV-owned resources")
> Signed-off-by: Arnd Bergmann <arnd@arndb.de>
> ---
> drivers/hv/Kconfig | 1 +
> 1 file changed, 1 insertion(+)
>
> diff --git a/drivers/hv/Kconfig b/drivers/hv/Kconfig
> index 52af086fdeb2..21193b571a80 100644
> --- a/drivers/hv/Kconfig
> +++ b/drivers/hv/Kconfig
> @@ -75,6 +75,7 @@ config MSHV_ROOT
> 	# e.g. When withdrawing memory, the hypervisor gives back 4k pages in
> 	# no particular order, making it impossible to reassemble larger pages
> 	depends on PAGE_SIZE_4KB
> +	depends on HYPERV_VMBUS
> 	select EVENTFD
> 	select VIRT_XFER_TO_GUEST_WORK
> 	select HMM_MIRROR
> -- 
> 2.39.5
>

Yes, this is the right short-term fix. We will need to solve the root case 
(no VMBUS required) with a separate SYNIC driver abstraction.

Reviewed-by: Jork Loeser <jloeser@linux.microsoft.com>

Thank you!
Jork


^ permalink raw reply

* Re: [PATCH] hyperv: move vmbus_root_device outside of the VMBus driver
From: Jork Loeser @ 2026-05-20 17:28 UTC (permalink / raw)
  To: Hamza Mahfooz
  Cc: linux-hyperv, K. Y. Srinivasan, Haiyang Zhang, Wei Liu,
	Dexuan Cui, Long Li, Stanislav Kinsburskii,
	Anirudh Rayabharam (Microsoft), Michael Kelley, Arnd Bergmann
In-Reply-To: <20260520154943.898386-1-hamzamahfooz@linux.microsoft.com>

On Wed, 20 May 2026, Hamza Mahfooz wrote:

> Conceptually, we shouldn't have to build the VMBus driver to answer the
> question "does VMBus exist?" So, move vmbus_root_device and the
> functions that access it directly to hyperv.h. Also, introduce
> hv_set_vmbus_root_device() to allow vmbus_drv to set the root device.
>
> Reported-by: Arnd Bergmann <arnd@arndb.de>
> Closes: https://lore.kernel.org/r/20260520074044.923728-1-arnd@kernel.org/
> Fixes: f1a9e67c1138 ("mshv: limit SynIC management to MSHV-owned resources")
> Signed-off-by: Hamza Mahfooz <hamzamahfooz@linux.microsoft.com>
> ---
> drivers/hv/vmbus_drv.c | 27 ++++++---------------------
> include/linux/hyperv.h | 18 ++++++++++++++++--
> 2 files changed, 22 insertions(+), 23 deletions(-)
>
> diff --git a/drivers/hv/vmbus_drv.c b/drivers/hv/vmbus_drv.c
> index 71f1b4d52f7f..b26e6b42fa49 100644
> --- a/drivers/hv/vmbus_drv.c
> +++ b/drivers/hv/vmbus_drv.c
> @@ -47,9 +47,6 @@ struct vmbus_dynid {
> 	struct hv_vmbus_device_id id;
> };
>
> -/* VMBus Root Device */
> -static struct device  *vmbus_root_device;
> -
> static int hyperv_cpuhp_online;
>
> static DEFINE_PER_CPU(long, vmbus_evt);
> @@ -95,18 +92,6 @@ static struct resource *fb_mmio;
> static struct resource *hyperv_mmio;
> static DEFINE_MUTEX(hyperv_mmio_lock);
>
> -struct device *hv_get_vmbus_root_device(void)
> -{
> -	return vmbus_root_device;
> -}
> -EXPORT_SYMBOL_GPL(hv_get_vmbus_root_device);
> -
> -bool hv_vmbus_exists(void)
> -{
> -	return vmbus_root_device != NULL;
> -}
> -EXPORT_SYMBOL_GPL(hv_vmbus_exists);
> -
> static u8 channel_monitor_group(const struct vmbus_channel *channel)
> {
> 	return (u8)channel->offermsg.monitorid / 32;
> @@ -903,7 +888,7 @@ static int vmbus_dma_configure(struct device *child_device)
> 	 * On x86/x64 coherence is assumed and these calls have no effect.
> 	 */
> 	hv_setup_dma_ops(child_device,
> -		device_get_dma_attr(vmbus_root_device) == DEV_DMA_COHERENT);
> +		device_get_dma_attr(hv_get_vmbus_root_device()) == DEV_DMA_COHERENT);
> 	return 0;
> }
>
> @@ -2172,7 +2157,7 @@ int vmbus_device_register(struct hv_device *child_device_obj)
> 		     &child_device_obj->channel->offermsg.offer.if_instance);
>
> 	child_device_obj->device.bus = &hv_bus;
> -	child_device_obj->device.parent = vmbus_root_device;
> +	child_device_obj->device.parent = hv_get_vmbus_root_device();
> 	child_device_obj->device.release = vmbus_device_release;
>
> 	child_device_obj->device.dma_parms = &child_device_obj->dma_parms;
> @@ -2561,7 +2546,7 @@ static int vmbus_acpi_add(struct platform_device *pdev)
> 	struct acpi_device *ancestor;
> 	struct acpi_device *device = ACPI_COMPANION(&pdev->dev);
>
> -	vmbus_root_device = &device->dev;
> +	hv_set_vmbus_root_device(&device->dev);
>
> 	/*
> 	 * Older versions of Hyper-V for ARM64 fail to include the _CCA
> @@ -2648,7 +2633,7 @@ static int vmbus_device_add(struct platform_device *pdev)
> 	struct device_node *np = pdev->dev.of_node;
> 	int ret;
>
> -	vmbus_root_device = &pdev->dev;
> +	hv_set_vmbus_root_device(&pdev->dev);
>
> 	ret = of_range_parser_init(&parser, np);
> 	if (ret)
> @@ -2969,7 +2954,7 @@ static int __init hv_acpi_init(void)
> 	if (ret)
> 		return ret;
>
> -	if (!vmbus_root_device) {
> +	if (!hv_vmbus_exists()) {
> 		ret = -ENODEV;
> 		goto cleanup;
> 	}
> @@ -3000,7 +2985,7 @@ static int __init hv_acpi_init(void)
>
> cleanup:
> 	platform_driver_unregister(&vmbus_platform_driver);
> -	vmbus_root_device = NULL;
> +	hv_set_vmbus_root_device(NULL);
> 	return ret;
> }
>
> diff --git a/include/linux/hyperv.h b/include/linux/hyperv.h
> index 5459e776ec17..3e8ea7d9653c 100644
> --- a/include/linux/hyperv.h
> +++ b/include/linux/hyperv.h
> @@ -1302,9 +1302,23 @@ static inline void *hv_get_drvdata(struct hv_device *dev)
> 	return dev_get_drvdata(&dev->device);
> }
>
> -struct device *hv_get_vmbus_root_device(void);
> +/* Do *not* use this outside of hyperv.h */
> +static struct device *__vmbus_root_device __read_mostly;
>
> -bool hv_vmbus_exists(void);
> +static inline void hv_set_vmbus_root_device(struct device *dev)
> +{
> +	__vmbus_root_device = dev;
> +}
> +
> +static inline struct device *hv_get_vmbus_root_device(void)
> +{
> +	return __vmbus_root_device;
> +}
> +
> +static inline bool hv_vmbus_exists(void)
> +{
> +	return hv_get_vmbus_root_device();
> +}
>
> struct hv_ring_buffer_debug_info {
> 	u32 current_interrupt_mask;
> -- 
> 2.54.0
>

Thank you Hamza. Yes, one could do this. It still does not solve the 
problem of loading MSHV-root first, then VMBUS. The entire 
hv_vmbus_exists() test is a workaround that mostly works, but shouldn't be 
there in the first place. What really needs to happen is a SYNIC driver so 
components can load in arbitrary orders if compiled as modules. Which in 
turn is outside of my original goal to fix L1VH's kexec. The SYNIC 
abstraction needs to come no later than the root/nested capability.

Best,
Jork

^ permalink raw reply

* Re: [PATCH v1 4/4] iommu/hyperv: Add page-selective IOTLB flush support
From: Jason Gunthorpe @ 2026-05-20 17:29 UTC (permalink / raw)
  To: Yu Zhang
  Cc: Michael Kelley, linux-kernel, linux-hyperv, iommu, linux-pci,
	linux-arch, wei.liu, kys, haiyangz, decui, longli, joro, will,
	robin.murphy, bhelgaas, kwilczynski, lpieralisi, mani, robh, arnd,
	jacob.pan, tgopinath, easwar.hariharan
In-Reply-To: <lxmfd2ml5dafkxquuf5i45uqgh6wxl44hlqphu77kvximqrnmi@b3htoyjc6d5z>

On Thu, May 21, 2026 at 01:15:24AM +0800, Yu Zhang wrote:

> 2) Follow what Intel/AMD drivers do: find a single minimal
>    2^N-aligned block that covers the entire range, but may
>    over-flush.

If it is plugged straight into a HW invalidation I would do this..

Jason

^ permalink raw reply

* Re: [PATCH v3 02/41] x86/tsc: Add helper to register CPU and TSC freq calibration routines
From: Sean Christopherson @ 2026-05-20 17:56 UTC (permalink / raw)
  To: David Woodhouse
  Cc: tglx@kernel.org, longli@microsoft.com, luto@kernel.org,
	alexey.makhalov@broadcom.com, jstultz@google.com,
	dave.hansen@linux.intel.com, ajay.kaher@broadcom.com,
	jan.kiszka@siemens.com, haiyangz@microsoft.com, kas@kernel.org,
	pbonzini@redhat.com, kys@microsoft.com, decui@microsoft.com,
	daniel.lezcano@kernel.org, wei.liu@kernel.org,
	peterz@infradead.org, jgross@suse.com, boris.ostrovsky@oracle.com,
	linux-coco@lists.linux.dev, kvm@vger.kernel.org,
	mhklinux@outlook.com, thomas.lendacky@amd.com,
	linux-kernel@vger.kernel.org,
	bcm-kernel-feedback-list@broadcom.com, tglx@linutronix.de,
	nikunj@amd.com, xen-devel@lists.xenproject.org,
	linux-hyperv@vger.kernel.org, vkuznets@redhat.com,
	rick.p.edgecombe@intel.com, virtualization@lists.linux.dev,
	sboyd@kernel.org, x86@kernel.org
In-Reply-To: <949e39aec749f019b18fa41c2a42bcc9231288b9.camel@amazon.co.uk>

On Mon, May 18, 2026, David Woodhouse wrote:
> On Fri, 2026-05-15 at 12:19 -0700, Sean Christopherson wrote:
> > 
> > --- a/arch/x86/xen/time.c
> > +++ b/arch/x86/xen/time.c
> > @@ -569,7 +569,7 @@ static void __init xen_init_time_common(void)
> >  	static_call_update(pv_steal_clock, xen_steal_clock);
> >  	paravirt_set_sched_clock(xen_sched_clock);
> >  
> > -	x86_platform.calibrate_tsc = xen_tsc_khz;
> > +	tsc_register_calibration_routines(xen_tsc_khz, NULL);
> >  	x86_platform.get_wallclock = xen_get_wallclock;
> >  }
> >  
> 
> xen_tsc_khz() doesn't use CPUID but really *should*.
> 
> Care to pull in
> https://lore.kernel.org/all/20260509224824.3264567-31-dwmw2@infradead.org/
> to your next round please?
> 
> (Without the misplaced changes in kvm/x86.c that should have been in
> two different prior commits, and are now folded into those correctly in
> my kvmclock5 branch ready for the next posting of that).

Ya, will do.  What's one more patch...

^ permalink raw reply

* Re: [PATCH v3 00/41] x86: Try to wrangle PV clocks vs. TSC
From: Sean Christopherson @ 2026-05-20 17:59 UTC (permalink / raw)
  To: David Woodhouse
  Cc: Kiryl Shutsemau, Paolo Bonzini, K. Y. Srinivasan, Haiyang Zhang,
	Wei Liu, Dexuan Cui, Long Li, Ajay Kaher, Alexey Makhalov,
	Jan Kiszka, Dave Hansen, Andy Lutomirski, Peter Zijlstra,
	Juergen Gross, Daniel Lezcano, Thomas Gleixner, John Stultz,
	Rick Edgecombe, Vitaly Kuznetsov,
	Broadcom internal kernel review list, Boris Ostrovsky,
	Stephen Boyd, x86, linux-coco, kvm, linux-hyperv, virtualization,
	linux-kernel, xen-devel, Michael Kelley, Tom Lendacky,
	Nikunj A Dadhania, Thomas Gleixner
In-Reply-To: <7260682b21c28d1299e58400b9a2f4b8d23bd434.camel@infradead.org>

On Tue, May 19, 2026, David Woodhouse wrote:
> On Fri, 2026-05-15 at 12:19 -0700, Sean Christopherson wrote:
> > Dave/Thomas/Peter/Boris, what's the going rate for bribes to take something
> > like this through the tip tree?
> > 
> > The bulk of the changes are in kvmclock and TSC, but pretty much every
> > hypervisor's guest-side code gets touched at some point.  I am reaonsably
> > confident in the correctness of the KVM changes.  Michael tested Hyper-V in
> > v2, and while there were conflicts when rebasing, they were largely
> > superficial (and I've just jinxed myself).  For all other hypervisors, assume
> > the code is compile-tested only, but those changes are all quite small and
> > straightforward.
> > 
> > The only changes that are questionable/contentious are the last two patches,
> > which have KVM-as-a-guest use CPUID 0x16 to get the CPU frequency, even on
> > AMD (that's the dubious part).  I very deliberately put them last, so that
> > they can be dropped at will (I don't care terribly if those patches land).
> > To merge them, I would want explicit Acks from Paolo and David W.
> > 
> > So, except for the last two patches, to get the stuff I really care about
> > landed, I think/hope it's just the TSC and guest-side CoCo changes that need
> > reviews/acks?
> > 
> > The primary goal of this series is (or at least was, when I started) to
> > fix flaws with SNP and TDX guests where a PV clock provided by the untrusted
> > hypervisor is used instead of the secure/trusted TSC that is controlled by
> > trusted firmware.
> > 
> > The secondary goal is to draft off of the SNP and TDX changes to slightly
> > modernize running under KVM.  Currently, KVM guests will use TSC for
> > clocksource, but not sched_clock.  And they ignore Intel's CPUID-based TSC
> > and CPU frequency enumeration, even when using the TSC instead of kvmclock.
> > And if the host provides the core crystal frequency in CPUID.0x15, then KVM
> > guests can use that for the APIC timer period instead of manually calibrating
> > the frequency.
> > 
> > The tertiary goal is to clean up all of the PV clock code to deduplicate logic
> > across hypervisors, and to hopefully make it all easier to maintain going
> > forward.
> 
> I booted this in qemu with -cpu host,+invtsc,+vmware-cpuid-freq
> 
> I was expecting to see it eschew the kvmclock and use *only* the TSC.
> Is there even any need for 'tsc-early' given that it's *told* the TSC
> frequency in CPUID? Shouldn't it have detected that the TSC is known
> before init_tsc_clocksource() runs?
>
> And then it even spent some time at boot actually using the kvmclock as
> clocksource... when ideally I don't think it would even have *enabled*
> it at all?

Yeah, that's definitely the ideal state.  And I had all the same expectations and
observations as you when digging in and testing this.  But unless this series
makes things worse, I want punt on achieving the ideal state for the moment, as
it's proving to be a big lift just to get to a not-awful state.

> [    0.000000] clocksource: kvm-clock: mask: 0xffffffffffffffff max_cycles: 0x1cd42e4dffb, max_idle_ns: 881590591483 ns
> [    0.000000] tsc: Detected 2400.000 MHz processor
> [    0.008205] TSC deadline timer available
> [    0.008270] clocksource: refined-jiffies: mask: 0xffffffff max_cycles: 0xffffffff, max_idle_ns: 1910969940391419 ns
> [    0.159085] clocksource: hpet: mask: 0xffffffff max_cycles: 0xffffffff, max_idle_ns: 19112604467 ns
> [    0.164074] clocksource: tsc-early: mask: 0xffffffffffffffff max_cycles: 0x22983777dd9, max_idle_ns: 440795300422 ns
> [    0.229087] clocksource: jiffies: mask: 0xffffffff max_cycles: 0xffffffff, max_idle_ns: 1911260446275000 ns
> [    0.337095] clocksource: Switched to clocksource kvm-clock
> [    0.345246] clocksource: acpi_pm: mask: 0xffffff max_cycles: 0xffffff, max_idle_ns: 2085701024 ns
> [    0.356201] clocksource: tsc: mask: 0xffffffffffffffff max_cycles: 0x22983777dd9, max_idle_ns: 440795300422 ns
> [    0.360560] clocksource: Switched to clocksource tsc
> 



^ permalink raw reply

* Re: [PATCH net 1/2] net: mana: Add NULL guards in teardown path to prevent panic on attach failure
From: Dipayaan Roy @ 2026-05-20 18:11 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: kys, haiyangz, wei.liu, decui, andrew+netdev, davem, edumazet,
	pabeni, leon, longli, kotaranov, horms, shradhagupta, ssengar,
	ernis, shirazsaleem, linux-hyperv, netdev, linux-kernel,
	linux-rdma, stephen, jacob.e.keller, dipayanroy, leitao, kees,
	john.fastabend, hawk, bpf, daniel, ast, sdf, yury.norov
In-Reply-To: <20260519225556.3939204-1-kuba@kernel.org>

On Tue, May 19, 2026 at 03:55:56PM -0700, Jakub Kicinski wrote:
> This is an AI-generated review of your patch. The human sending this
> email has considered the AI review valid, or at least plausible.
> ---
> net: mana: Add NULL guards in teardown path to prevent panic on attach failure
>     
> When queue allocation fails partway through, the error cleanup frees
> and NULLs apc->tx_qp and apc->rxqs. The patch adds NULL guards for these
> pointers in teardown paths such as mana_destroy_vport(), mana_fence_rqs(),
> and mana_dealloc_queues() to prevent NULL pointer dereferences.
> 
> > diff --git a/drivers/net/ethernet/microsoft/mana/mana_en.c b/drivers/net/ethernet/microsoft/mana/mana_en.c
> > index 9afc786b297a8..9e7e4bf526bf6 100644
> > --- a/drivers/net/ethernet/microsoft/mana/mana_en.c
> > +++ b/drivers/net/ethernet/microsoft/mana/mana_en.c
> [ ... ]
> > @@ -2858,13 +2861,16 @@ static void mana_destroy_vport(struct mana_port_context *apc)
> >  	struct mana_rxq *rxq;
> >  	u32 rxq_idx;
> >  
> > -	for (rxq_idx = 0; rxq_idx < apc->num_queues; rxq_idx++) {
> > -		rxq = apc->rxqs[rxq_idx];
> > -		if (!rxq)
> > -			continue;
> > +	if (apc->rxqs) {
> >  
> > -		mana_destroy_rxq(apc, rxq, true);
> > -		apc->rxqs[rxq_idx] = NULL;
> > +		for (rxq_idx = 0; rxq_idx < apc->num_queues; rxq_idx++) {
> > +			rxq = apc->rxqs[rxq_idx];
> > +			if (!rxq)
> > +				continue;
> > +
> > +			mana_destroy_rxq(apc, rxq, true);
> > +			apc->rxqs[rxq_idx] = NULL;
> > +		}
> >  	}
> 
> If the teardown path frees apc->rxqs and sets it to NULL on attach failure,
> how does the interface recover if the user subsequently brings it back up?
> 
> When mana_open() is invoked, it directly calls mana_alloc_queues(). Inside
> mana_alloc_queues() -> mana_add_rx_queues(), the code does:
> 
>     apc->rxqs[i] = rxq;
> 
> If apc->rxqs was left as NULL during the teardown, will this cause a NULL
> pointer dereference on the next interface bring-up?
> 
> [ ... ]
Hi Jakub,

The only path that recovers from this state is mana_attach(), which
calls mana_init_port() -> mana_init_port_context() to re-allocate
apc->rxqs before calling mana_alloc_queues().

When mana_open is invoked prior to it the rxqs would be already setup by
mana_probe_port -> mana_init_port -> mana_init_port_context.

> > @@ -3287,33 +3294,38 @@ static int mana_dealloc_queues(struct net_device *ndev)
> >  	 * number of queues.
> >  	 */
> >  
> > -	for (i = 0; i < apc->num_queues; i++) {
> > -		txq = &apc->tx_qp[i].txq;
> > -		tsleep = 1000;
> > -		while (atomic_read(&txq->pending_sends) > 0 &&
> > -		       time_before(jiffies, timeout)) {
> > -			usleep_range(tsleep, tsleep + 1000);
> > -			tsleep <<= 1;
> > -		}
> > -		if (atomic_read(&txq->pending_sends)) {
> > -			err = pcie_flr(to_pci_dev(gd->gdma_context->dev));
> > -			if (err) {
> > -				netdev_err(ndev, "flr failed %d with %d pkts pending in txq %u\n",
> > -					   err, atomic_read(&txq->pending_sends),
> > -					   txq->gdma_txq_id);
> > +	if (apc->tx_qp) {
> 
> In the error path where queue creation fails partway through, will
> mana_destroy_txq() be called?
> 
> If mana_destroy_txq() unconditionally iterates over all apc->num_queues:
> 
>     for (i = 0; i < apc->num_queues; i++) {
>         ...
>         mana_destroy_wq_obj(apc, GDMA_SQ, apc->tx_qp[i].tx_object);
>         ...
> 
> Will it pass 0 or INVALID_MANA_HANDLE to mana_destroy_wq_obj() for queues
> that were never fully initialized?
> 
> Does mana_destroy_wq_obj() send a MANA_DESTROY_WQ_OBJ command to the hardware
> with these uninitialized handles, and could handle 0 corrupt the firmware
> state?
>
This is a known  exisiting issue in the cleaup path for partial tx
setups and was also mentioned in the a recent patch where the rx partial
init clean-up path was fixed. My colleague Aditya is already working on a
patch to fix this and all other issues in the tx cleanup path.
 
> > +		for (i = 0; i < apc->num_queues; i++) {
> > +			txq = &apc->tx_qp[i].txq;
> > +			tsleep = 1000;
> > +			while (atomic_read(&txq->pending_sends) > 0 &&
> > +			       time_before(jiffies, timeout)) {
> > +				usleep_range(tsleep, tsleep + 1000);
> > +				tsleep <<= 1;
> > +			}
> > +			if (atomic_read(&txq->pending_sends)) {
> > +				err =
> > +				    pcie_flr(to_pci_dev(gd->gdma_context->dev));
> 
> If pcie_flr() is triggered here, the PCI Function Level Reset wipes the
> internal GDMA state.
> 
> Further down in this function, mana_disable_vport_rx(), mana_fence_rqs(),
> and mana_destroy_vport() are called. Will these functions attempt to send
> commands to the now-destroyed GDMA admin queue via mana_send_request() and
> wait for completions?
> 
> Could this lead to cascading timeouts since the hardware was just reset and
> cannot process admin commands?
> 
This is pre-existing behavior/issue the pcie_flr() path and the
subsequent calls to mana_disable_vport_rx()/mana_fence_rqs()/mana_destroy_vport()
are unchanged by this patch. My patch only adds a NULL guard for
apc->tx_qp to skip this entire TX drain block when tx_qp is already
freed. But this point seems to be valid issue, I will investigate this
further and I will send a different patch set for this as follow up work.


> > +				if (err) {
> > +					netdev_err(ndev, "flr failed %d with %d pkts pending in txq %u\n",
> > +						   err,
> > +					    atomic_read(&txq->pending_sends),
> > +					    txq->gdma_txq_id);
> > +				}
> > +				break;
> >  			}
> > -			break;
> >  		}
> -- 
> pw-bot: cr

Regards
Dipayaan Roy


^ permalink raw reply

* Re: [PATCH net 2/2] net: mana: Skip redundant detach in queue reset handler if already detached
From: Dipayaan Roy @ 2026-05-20 18:23 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: kys, haiyangz, wei.liu, decui, andrew+netdev, davem, edumazet,
	pabeni, leon, longli, kotaranov, horms, shradhagupta, ssengar,
	ernis, shirazsaleem, linux-hyperv, netdev, linux-kernel,
	linux-rdma, stephen, jacob.e.keller, dipayanroy, leitao, kees,
	john.fastabend, hawk, bpf, daniel, ast, sdf, yury.norov
In-Reply-To: <20260519225559.3939235-1-kuba@kernel.org>

On Tue, May 19, 2026 at 03:55:59PM -0700, Jakub Kicinski wrote:
> This is an AI-generated review of your patch. The human sending this
> email has considered the AI review valid, or at least plausible.
> ---
> net: mana: Skip redundant detach in queue reset handler if already detached
> 
> This commit adds a check for netif_device_present() in the queue reset
> handler. If the device is already detached, it skips the redundant
> mana_detach() call and proceeds directly to mana_attach().
> 
> > diff --git a/drivers/net/ethernet/microsoft/mana/mana_en.c b/drivers/net/ethernet/microsoft/mana/mana_en.c
> > index 9e7e4bf526bf6..740f3c34db451 100644
> > --- a/drivers/net/ethernet/microsoft/mana/mana_en.c
> > +++ b/drivers/net/ethernet/microsoft/mana/mana_en.c
> > @@ -316,12 +316,19 @@ static void mana_per_port_queue_reset_work_handler(struct work_struct *work)
> [ ... ]
> > +	/* If already detached (indicates detach succeeded but attach failed
> > +	 * previously). Now skip mana detach and just retry mana_attach.
> > +	 */
> > +	if (!netif_device_present(ndev))
> > +		goto attach;
> > +
> >  	err = mana_detach(ndev, false);
> 
> The fix here correctly prevents mana_detach() from running on an already
> detached device. However, do identical issues remain in mana_change_mtu(),
> mana_set_channels(), mana_set_ringparam(), and mana_bpf()?
> 
> If mana_attach() previously failed (for example, due to out of memory),
> the port is left with apc->port_is_up = false, but it preserves
> apc->port_st_save = true, and calls netif_device_detach(ndev).
> 
> If an admin tries to recover the port by retrying a configuration command
> like changing the MTU or attaching XDP, these functions appear to
> unconditionally call mana_detach(ndev, false) without the
> !netif_device_present(ndev) check added here.
> 
> Inside mana_detach(), apc->port_st_save = apc->port_is_up executes. Since
> port_is_up is already false, port_st_save is overwritten to false,
> permanently erasing the intent that the port should be up.
> 
> The subsequent call to mana_attach() would then see port_st_save == false,
> skip mana_alloc_queues(), set port_is_up = false, and successfully call
> netif_device_attach(ndev).
> 
> Would this result in the configuration command returning 0 to userspace
> while the port queues were never allocated, leaving the interface silently
> broken until manually toggled down and up?

Hi Jakub,

I agree that mana_change_mtu() and mana_xdp_set() would also need the same
netif_device_present() guard (or alternatively, an early return
when the port is already detached). I'll investigate and address those in a
follow-up patch to keep this fix minimal, as this patch
specifically fixes the reset handler that runs without user intervention,
unlike the ohers cases mentioned above.

Regards
Dipayaan Roy




^ permalink raw reply

* Re: [PATCH v1 3/4] iommu/hyperv: Add para-virtualized IOMMU support for Hyper-V guest
From: Jacob Pan @ 2026-05-20 18:27 UTC (permalink / raw)
  To: Yu Zhang
  Cc: Jason Gunthorpe, linux-kernel, linux-hyperv, iommu, linux-pci,
	linux-arch, wei.liu, kys, haiyangz, decui, longli, joro, will,
	robin.murphy, bhelgaas, kwilczynski, lpieralisi, mani, robh, arnd,
	mhklinux, tgopinath, easwar.hariharan, jacob.pan
In-Reply-To: <yqhb6vxoovscvfafgv6i6zn7uydpfxeff7hqmvbn6z7c2tjqp6@jn2vvtxqgfef>

Hi Yu,

On Wed, 20 May 2026 23:25:43 +0800
Yu Zhang <zhangyu1@linux.microsoft.com> wrote:

> > > +static const struct iommu_domain_ops
> > > hv_iommu_identity_domain_ops = {
> > > +	.attach_dev	= hv_iommu_attach_dev,
> > > +};
> > > +
> > > +static const struct iommu_domain_ops
> > > hv_iommu_blocking_domain_ops = {
> > > +	.attach_dev	= hv_iommu_attach_dev,
> > > +};  
> > 
> > Usually I would expect these to have their own attach
> > functions. blocking in particular must have an attach op that cannot
> > fail. It is used to recover the device back to a known translation
> > in case of cascading other errors.
> >   
> 
> For blocking domain, the hypercall handler of such attach essentially
> disable the translation and IOPF for the device.
I think this should disable all faults, including unrecoverable fault
reporting. right?

^ permalink raw reply

* Re: [PATCH v3 00/41] x86: Try to wrangle PV clocks vs. TSC
From: David Woodhouse @ 2026-05-20 18:30 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Kiryl Shutsemau, Paolo Bonzini, K. Y. Srinivasan, Haiyang Zhang,
	Wei Liu, Dexuan Cui, Long Li, Ajay Kaher, Alexey Makhalov,
	Jan Kiszka, Dave Hansen, Andy Lutomirski, Peter Zijlstra,
	Juergen Gross, Daniel Lezcano, Thomas Gleixner, John Stultz,
	Rick Edgecombe, Vitaly Kuznetsov,
	Broadcom internal kernel review list, Boris Ostrovsky,
	Stephen Boyd, x86, linux-coco, kvm, linux-hyperv, virtualization,
	linux-kernel, xen-devel, Michael Kelley, Tom Lendacky,
	Nikunj A Dadhania, Thomas Gleixner
In-Reply-To: <ag32mwLvHpWn2vCt@google.com>

[-- Attachment #1: Type: text/plain, Size: 648 bytes --]

On Wed, 2026-05-20 at 10:59 -0700, Sean Christopherson wrote:
> 
> > And then it even spent some time at boot actually using the kvmclock as
> > clocksource... when ideally I don't think it would even have *enabled*
> > it at all?
> 
> Yeah, that's definitely the ideal state.  And I had all the same expectations and
> observations as you when digging in and testing this.  But unless this series
> makes things worse, I want punt on achieving the ideal state for the moment, as
> it's proving to be a big lift just to get to a not-awful state.

Ack. Just checking my understanding was correct. Baby steps... 40 patches at a time :)

[-- Attachment #2: smime.p7s --]
[-- Type: application/pkcs7-signature, Size: 5069 bytes --]

^ permalink raw reply

* Re: [PATCH v3 40/41] x86/tsc: Add standalone helper for getting CPU frequency from CPUID
From: David Woodhouse @ 2026-05-20 18:50 UTC (permalink / raw)
  To: Paolo Bonzini, Sean Christopherson, Kiryl Shutsemau,
	K. Y. Srinivasan, Haiyang Zhang, Wei Liu, Dexuan Cui, Long Li,
	Ajay Kaher, Alexey Makhalov, Jan Kiszka, Dave Hansen,
	Andy Lutomirski, Peter Zijlstra, Juergen Gross, Daniel Lezcano,
	Thomas Gleixner, John Stultz
  Cc: Rick Edgecombe, Vitaly Kuznetsov,
	Broadcom internal kernel review list, Boris Ostrovsky,
	Stephen Boyd, x86, linux-coco, kvm, linux-hyperv, virtualization,
	linux-kernel, xen-devel, Michael Kelley, Tom Lendacky,
	Nikunj A Dadhania, Thomas Gleixner
In-Reply-To: <3cb683b4-973a-4b3e-a5d5-a8baa8a70eb0@redhat.com>

[-- Attachment #1: Type: text/plain, Size: 907 bytes --]

On Sat, 2026-05-16 at 09:42 +0200, Paolo Bonzini wrote:
> On 5/15/26 21:19, Sean Christopherson wrote:
> > Extract the guts of cpu_khz_from_cpuid() to a standalone helper that
> > doesn't restrict the usage to Intel CPUs.  This will allow sharing the
> > core logic with kvmclock, as (a) CPUID.0x16 may be enumerated alongside
> > kvmclock, and (b) KVM generally doesn't restrict CPUID based on vendor.
> 
> Even for native there's no real reason to restrict to Intel, I think.
> native_calibrate_tsc() only limits itself because historically (prior to 
> commit 604dc9170f24, "x86/tsc: Use CPUID.0x16 to calculate missing 
> crystal frequency", 2019-05-09) it used a hardcoded table of crystal 
> frequencies.
> 
> Of course paranoia applies, but for virtualization, if the leaf exists 
> there is no reason not to trust it.

Agreed.

Reviewed-by: David Woodhouse <dwmw@amazon.co.uk>


[-- Attachment #2: smime.p7s --]
[-- Type: application/pkcs7-signature, Size: 5069 bytes --]

^ permalink raw reply

* Re: [PATCH v3 41/41] x86/kvmclock: Get CPU base frequency from CPUID when it's available
From: David Woodhouse @ 2026-05-20 18:52 UTC (permalink / raw)
  To: Sean Christopherson, Kiryl Shutsemau, Paolo Bonzini,
	K. Y. Srinivasan, Haiyang Zhang, Wei Liu, Dexuan Cui, Long Li,
	Ajay Kaher, Alexey Makhalov, Jan Kiszka, Dave Hansen,
	Andy Lutomirski, Peter Zijlstra, Juergen Gross, Daniel Lezcano,
	Thomas Gleixner, John Stultz
  Cc: Rick Edgecombe, Vitaly Kuznetsov,
	Broadcom internal kernel review list, Boris Ostrovsky,
	Stephen Boyd, x86, linux-coco, kvm, linux-hyperv, virtualization,
	linux-kernel, xen-devel, Michael Kelley, Tom Lendacky,
	Nikunj A Dadhania, Thomas Gleixner
In-Reply-To: <20260515191942.1892718-42-seanjc@google.com>

[-- Attachment #1: Type: text/plain, Size: 1441 bytes --]

On Fri, 2026-05-15 at 12:19 -0700, Sean Christopherson wrote:
> If CPUID.0x16 is present and valid, use the CPU frequency provided by
> CPUID instead of assuming that the virtual CPU runs at the same
> frequency as TSC and/or kvmclock.  Back before constant TSCs were a
> thing, treating the TSC and CPU frequencies as one and the same was
> somewhat reasonable, but now it's nonsensical, especially if the
> hypervisor explicitly enumerates the CPU frequency.
> 
> Signed-off-by: Sean Christopherson <seanjc@google.com>
> ---
>  arch/x86/kernel/kvmclock.c | 16 +++++++++++++++-
>  1 file changed, 15 insertions(+), 1 deletion(-)
> 
> diff --git a/arch/x86/kernel/kvmclock.c b/arch/x86/kernel/kvmclock.c
> index 62c8ea2e6769..7607920ae386 100644
> --- a/arch/x86/kernel/kvmclock.c
> +++ b/arch/x86/kernel/kvmclock.c
> @@ -190,6 +190,20 @@ void kvmclock_cpu_action(enum kvm_guest_cpu_action action)
>  	}
>  }
>  
> +static unsigned long kvm_get_cpu_khz(void)
> +{
> +	unsigned int cpu_khz;
> +
> +	/*
> +	 * Prefer CPUID over kvmclock when possible, as the base CPU frequency
> +	 * isn't necessarily the same as the kvmlock "TSC" frequency.
> +	 */
> +	if (!cpuid_get_cpu_freq(&cpu_khz))
> +		return cpu_khz;
> +
> +	return pvclock_tsc_khz(this_cpu_pvti());

I'm fine with this in principle but shouldn't the fallback be calling
kvm_get_tsc_khz() instead of directly calling pvclock_tsc_khz()?



[-- Attachment #2: smime.p7s --]
[-- Type: application/pkcs7-signature, Size: 5069 bytes --]

^ permalink raw reply

* Re: [PATCH v3 01/41] x86/tsc: Add a standalone helpers for getting TSC info from CPUID.0x15
From: David Woodhouse @ 2026-05-20 18:59 UTC (permalink / raw)
  To: Sean Christopherson, Kiryl Shutsemau, Paolo Bonzini,
	K. Y. Srinivasan, Haiyang Zhang, Wei Liu, Dexuan Cui, Long Li,
	Ajay Kaher, Alexey Makhalov, Jan Kiszka, Dave Hansen,
	Andy Lutomirski, Peter Zijlstra, Juergen Gross, Daniel Lezcano,
	Thomas Gleixner, John Stultz
  Cc: Rick Edgecombe, Vitaly Kuznetsov,
	Broadcom internal kernel review list, Boris Ostrovsky,
	Stephen Boyd, x86, linux-coco, kvm, linux-hyperv, virtualization,
	linux-kernel, xen-devel, Michael Kelley, Tom Lendacky,
	Nikunj A Dadhania, Thomas Gleixner
In-Reply-To: <20260515191942.1892718-2-seanjc@google.com>

[-- Attachment #1: Type: text/plain, Size: 755 bytes --]

On Fri, 2026-05-15 at 12:19 -0700, Sean Christopherson wrote:
> Extract retrieval of TSC frequency information from CPUID into standalone
> helpers so that TDX guest support can reuse the logic.  Provide a version
> that includes the multiplier math as TDX does NOT want to use
> native_calibrate_tsc()'s fallback logic that derives the TSC frequency
> based on CPUID.0x16, when the core crystal frequency isn't known.
> 
> Opportunsitically drop native_calibrate_tsc()'s "== 0" and "!= 0" checks
> in favor of the kernel's preferred style.

"Opportunistically" ? Now that looks wrong too... 

> No functional change intended.
> 
> Signed-off-by: Sean Christopherson <seanjc@google.com>

Reviewed-by: David Woodhouse <dwmw@amazon.co.uk>

[-- Attachment #2: smime.p7s --]
[-- Type: application/pkcs7-signature, Size: 5069 bytes --]

^ permalink raw reply

* RE: [PATCH v3 31/41] x86/tsc: Pass KNOWN_FREQ and RELIABLE as params to registration
From: Michael Kelley @ 2026-05-20 19:01 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: sashiko-reviews@lists.linux.dev, linux-hyperv@vger.kernel.org
In-Reply-To: <ag3kFYLrypsBnlkY@google.com>

From: Sean Christopherson <seanjc@google.com> Sent: Wednesday, May 20, 2026 9:41 AM
> 
> On Tue, May 19, 2026, Michael Kelley wrote:
> > From: Sean Christopherson <seanjc@google.com> Sent: Monday, May 18, 2026 3:18 PM
> > > > > diff --git a/arch/x86/kernel/cpu/mshyperv.c b/arch/x86/kernel/cpu/mshyperv.c
> > > > > --- a/arch/x86/kernel/cpu/mshyperv.c
> > > > > +++ b/arch/x86/kernel/cpu/mshyperv.c
> > > > > @@ -516,8 +516,13 @@ static void __init ms_hyperv_init_platform(void)
> > > > >
> > > > >  	if (ms_hyperv.features & HV_ACCESS_FREQUENCY_MSRS &&
> > > > >  	    ms_hyperv.misc_features & HV_FEATURE_FREQUENCY_MSRS_AVAILABLE) {
> > > > > -		tsc_register_calibration_routines(hv_get_tsc_khz, hv_get_tsc_khz);
> > > > > -		setup_force_cpu_cap(X86_FEATURE_TSC_KNOWN_FREQ);
> > > > > +		enum tsc_properties tsc_properties = TSC_FREQUENCY_KNOWN;
> > > > > +
> > > > > +		if (ms_hyperv.features & HV_ACCESS_TSC_INVARIANT)
> > > > > +			tsc_properties = TSC_FREQ_KNOWN_AND_RELIABLE;
> > > > > +
> > > > > +		tsc_register_calibration_routines(hv_get_tsc_khz, hv_get_tsc_khz,
> > > > > +						  tsc_properties);
> > > > >  	}
> > > >
> > > > [ ... ]
> > > >
> > > > > @@ -629,7 +634,6 @@ static void __init ms_hyperv_init_platform(void)
> > > > >  		 * is called.
> > > > >  		 */
> > > > >  		wrmsrq(HV_X64_MSR_TSC_INVARIANT_CONTROL, HV_EXPOSE_INVARIANT_TSC);
> > > > > -		setup_force_cpu_cap(X86_FEATURE_TSC_RELIABLE);
> > > > >  	}
> > > >
> > > > If a Hyper-V VM exposes an invariant TSC but lacks the frequency MSRs,
> > > > does it bypass the tsc_register_calibration_routines() block entirely?
> > >
> > > Yes.
> > >
> > > > Without the standalone setup_force_cpu_cap(X86_FEATURE_TSC_RELIABLE) call
> > > > here, it looks like these VMs will lose the reliable flag.
> > > >
> > > > Will this inadvertently enable the TSC watchdog, potentially causing a
> > > > performance regression if the system falsely marks the TSC as unstable due
> > > > to virtualization scheduling delays?
> > >
> > > Hmm, I was going to say that the change was intentional and desriable, but looking
> > > at this yet again, I don't think that's true.  Enabling HV_EXPOSE_INVARIANT_TSC
> > > just means the kernel will (probably) set X86_FEATURE_CONSTANT_TSC and
> > > X86_FEATURE_NONSTOP_TSC during early_init_intel(), AFAICT it doesn't lead to
> > > X86_FEATURE_TSC_RELIABLE being set.  And I think in this case, marking the TSC
> > > as reliable makes sense; even if the kernel doesn't user Hyper-V's calibration
> > > info, the host is still clearly telling the guest that the TSC is reliable.
> >
> > Yes, I agree. But I'm doubtful that such a combination ever occurs in practice.
> > I've never seen an occurrence of a Hyper-V (even really old versions) guest
> > where the frequency MSRs are not available or are not accessible. The
> > Hyper-V spec allows for either condition, so we have the code to test the
> > flags. I've thought of the flags as always set, though I suppose one never
> > knows what the future holds.
> >
> > >
> > > Michael, does keeping the
> > >
> > > 		setup_force_cpu_cap(X86_FEATURE_TSC_RELIABLE);
> > >
> > > but also passing TSC_FREQ_KNOWN_AND_RELIABLE to the calibration routine make
> > > sense?
> >
> > I don't see that it would break anything. But it seems a bit disjointed in
> > that HV_ACCESS_TSC_INVARIANT is tested in two places in
> > ms_hyperv_init_platform(). Does TSC_RELIABLE *need* to be passed to
> > tsc_register_calibration_routines() if later code in ms_hyperv_init_platform()
> > does setup_force_cpu_cap(X86_FEATURE_TSC_RELIABLE)?
> 
> Sort of?  It's not strictly necessary, but passing in TSC_RELIABLE allows
> tsc_register_calibration_routines() to ensure it doesn't clobber a more robust
> calibration routine with a "lesser" routine.  I.e. not passing TSC_RELIABLE in
> this case would trigger a false positive (and break Hyper-V).
> 
> In other words, invoking setup_force_cpu_cap() is a (happy, desirable) side effect,
> not the primary goal.
> 
> > In other words, I'm suggesting let tsc_register_calibration_routines() handle
> > the TSC_FREQ_KNOWN case since that's what the calibration routines are all
> > about. Leave the setting of X86_FEATURE_TSC_RELIABLE to the later code that
> > tests HV_ACCESS_TSC_INVARIANT, instead of duplicating the
> > setup_force_cpu_cap() operation.
> >
> > While combining FREQUENCY_KNOWN and RELIABLE into
> > tsc_register_calibration_routines() is convenient, the two
> > concepts turn out to be independent when looking strictly at
> > the Hyper-V spec and code written to follow that spec.
> > Combining them into the same function ends up being clumsy
> 
> Yeah, it's a bit awkward for Hyper-V, but Hyper-V is definitely the odd one out
> here, in that it has an "out-of-band" feature that marks the TSC as reliable.
> All other PV features that trigger overrides of the calibration routines bundle
> the RELIABLE aspect with the feature itself.

Indeed, Hyper-V is definitely the odd one here. Keeping the
"setup_force_cpu_cap(X86_FEATURE_TSC_RELIABLE)" in the later code
block where HV_ACCESS_TSC_INVARIANT is tested works well enough.
In the real world, the frequency MSRs are available and accessible,
so the additional "setup_force_cpu_cap()" is duplicative, but doesn't
hurt anything.

Michael

^ permalink raw reply

* Re: [PATCH v3 02/41] x86/tsc: Add helper to register CPU and TSC freq calibration routines
From: David Woodhouse @ 2026-05-20 19:04 UTC (permalink / raw)
  To: Sean Christopherson, Kiryl Shutsemau, Paolo Bonzini,
	K. Y. Srinivasan, Haiyang Zhang, Wei Liu, Dexuan Cui, Long Li,
	Ajay Kaher, Alexey Makhalov, Jan Kiszka, Dave Hansen,
	Andy Lutomirski, Peter Zijlstra, Juergen Gross, Daniel Lezcano,
	Thomas Gleixner, John Stultz
  Cc: Rick Edgecombe, Vitaly Kuznetsov,
	Broadcom internal kernel review list, Boris Ostrovsky,
	Stephen Boyd, x86, linux-coco, kvm, linux-hyperv, virtualization,
	linux-kernel, xen-devel, Michael Kelley, Tom Lendacky,
	Nikunj A Dadhania, Thomas Gleixner
In-Reply-To: <20260515191942.1892718-3-seanjc@google.com>

[-- Attachment #1: Type: text/plain, Size: 2333 bytes --]

On Fri, 2026-05-15 at 12:19 -0700, Sean Christopherson wrote:
> Add a helper to register non-native, i.e. PV and CoCo, CPU and TSC
> frequency calibration routines.  This will allow consolidating handling
> of common TSC properties that are forced by hypervisor (PV routines),
> and will also allow adding sanity checks to guard against overriding a
> TSC calibration routine with a routine that is less robust/trusted.
> 
> Make the CPU calibration routine optional, as Xen (very sanely) doesn't
> assume the CPU runs as the same frequency as the TSC.
> 
> Wrap the helper in an #ifdef to document that the kernel overrides
> the native routines when running as a VM, and to guard against unwanted
> usage.  Add a TODO to call out that AMD_MEM_ENCRYPT is a mess and doesn't
> depend on HYPERVISOR_GUEST because it gates both guest and host code.
> 
> No functional change intended.
> 
> Reviewed-by: Michael Kelley <mhklinux@outlook.com>
> Tested-by: Michael Kelley <mhklinux@outlook.com>
> Signed-off-by: Sean Christopherson <seanjc@google.com>

Mildly concerned that we might want to support multiple options — does
it have CPUID 0x15? Does it have 0x40000x10? Does it have a pvclock?
There are various permutations of those which are perhaps best handled
by *trying* each one, in some order, and populating a struct with the
answers?

But on the basis that perfect is the enemy of good,

Reviewed-by: David Woodhouse <dwmw@amazon.co.uk>

> diff --git a/arch/x86/kernel/kvmclock.c b/arch/x86/kernel/kvmclock.c
> index b5991d53fc0e..e9e7394140dd 100644
> --- a/arch/x86/kernel/kvmclock.c
> +++ b/arch/x86/kernel/kvmclock.c
> @@ -321,8 +321,8 @@ void __init kvmclock_init(void)
>  	flags = pvclock_read_flags(&hv_clock_boot[0].pvti);
>  	kvm_sched_clock_init(flags & PVCLOCK_TSC_STABLE_BIT);
>  
> -	x86_platform.calibrate_tsc = kvm_get_tsc_khz;
> -	x86_platform.calibrate_cpu = kvm_get_tsc_khz;
> +	tsc_register_calibration_routines(kvm_get_tsc_khz, kvm_get_tsc_khz);
> +
>  	x86_platform.get_wallclock = kvm_get_wallclock;
>  	x86_platform.set_wallclock = kvm_set_wallclock;
>  #ifdef CONFIG_X86_LOCAL_APIC

Can we move those (and maybe everything in the context there too) up
*before* the check for no-kvmclock at the top of the function? Probably
in a separate patch.

[-- Attachment #2: smime.p7s --]
[-- Type: application/pkcs7-signature, Size: 5069 bytes --]

^ permalink raw reply

* Re: [PATCH v3 41/41] x86/kvmclock: Get CPU base frequency from CPUID when it's available
From: Sean Christopherson @ 2026-05-20 19:06 UTC (permalink / raw)
  To: David Woodhouse
  Cc: Kiryl Shutsemau, Paolo Bonzini, K. Y. Srinivasan, Haiyang Zhang,
	Wei Liu, Dexuan Cui, Long Li, Ajay Kaher, Alexey Makhalov,
	Jan Kiszka, Dave Hansen, Andy Lutomirski, Peter Zijlstra,
	Juergen Gross, Daniel Lezcano, Thomas Gleixner, John Stultz,
	Rick Edgecombe, Vitaly Kuznetsov,
	Broadcom internal kernel review list, Boris Ostrovsky,
	Stephen Boyd, x86, linux-coco, kvm, linux-hyperv, virtualization,
	linux-kernel, xen-devel, Michael Kelley, Tom Lendacky,
	Nikunj A Dadhania, Thomas Gleixner
In-Reply-To: <0a3aa07a1d3c4bec2b89f8026093969155b73caa.camel@infradead.org>

On Wed, May 20, 2026, David Woodhouse wrote:
> On Fri, 2026-05-15 at 12:19 -0700, Sean Christopherson wrote:
> > If CPUID.0x16 is present and valid, use the CPU frequency provided by
> > CPUID instead of assuming that the virtual CPU runs at the same
> > frequency as TSC and/or kvmclock.  Back before constant TSCs were a
> > thing, treating the TSC and CPU frequencies as one and the same was
> > somewhat reasonable, but now it's nonsensical, especially if the
> > hypervisor explicitly enumerates the CPU frequency.
> > 
> > Signed-off-by: Sean Christopherson <seanjc@google.com>
> > ---
> >  arch/x86/kernel/kvmclock.c | 16 +++++++++++++++-
> >  1 file changed, 15 insertions(+), 1 deletion(-)
> > 
> > diff --git a/arch/x86/kernel/kvmclock.c b/arch/x86/kernel/kvmclock.c
> > index 62c8ea2e6769..7607920ae386 100644
> > --- a/arch/x86/kernel/kvmclock.c
> > +++ b/arch/x86/kernel/kvmclock.c
> > @@ -190,6 +190,20 @@ void kvmclock_cpu_action(enum kvm_guest_cpu_action action)
> >  	}
> >  }
> >  
> > +static unsigned long kvm_get_cpu_khz(void)
> > +{
> > +	unsigned int cpu_khz;
> > +
> > +	/*
> > +	 * Prefer CPUID over kvmclock when possible, as the base CPU frequency
> > +	 * isn't necessarily the same as the kvmlock "TSC" frequency.
> > +	 */
> > +	if (!cpuid_get_cpu_freq(&cpu_khz))
> > +		return cpu_khz;
> > +
> > +	return pvclock_tsc_khz(this_cpu_pvti());
> 
> I'm fine with this in principle but shouldn't the fallback be calling
> kvm_get_tsc_khz() instead of directly calling pvclock_tsc_khz()?

Oh, yeah, for this patch, definitely yes, so that there's no side effects.  The
question really should be answered in the context of "x86/kvmclock: Obtain TSC
frequency from CPUID if present", which subtly impacts the CPU frequency, but I
think the answer is "yes" there as well.

^ permalink raw reply

* Re: [PATCH v3 03/41] x86/sev: Mark TSC as reliable when configuring Secure TSC
From: David Woodhouse @ 2026-05-20 19:06 UTC (permalink / raw)
  To: Sean Christopherson, Kiryl Shutsemau, Paolo Bonzini,
	K. Y. Srinivasan, Haiyang Zhang, Wei Liu, Dexuan Cui, Long Li,
	Ajay Kaher, Alexey Makhalov, Jan Kiszka, Dave Hansen,
	Andy Lutomirski, Peter Zijlstra, Juergen Gross, Daniel Lezcano,
	Thomas Gleixner, John Stultz
  Cc: Rick Edgecombe, Vitaly Kuznetsov,
	Broadcom internal kernel review list, Boris Ostrovsky,
	Stephen Boyd, x86, linux-coco, kvm, linux-hyperv, virtualization,
	linux-kernel, xen-devel, Michael Kelley, Tom Lendacky,
	Nikunj A Dadhania, Thomas Gleixner
In-Reply-To: <20260515191942.1892718-4-seanjc@google.com>

[-- Attachment #1: Type: text/plain, Size: 653 bytes --]

On Fri, 2026-05-15 at 12:19 -0700, Sean Christopherson wrote:
> Move the code to mark the TSC as reliable from sme_early_init() to
> snp_secure_tsc_init().  The only reader of TSC_RELIABLE is the aptly
> named check_system_tsc_reliable(), which runs in tsc_init(), i.e.
> after snp_secure_tsc_init().
> 
> This will allow consolidating the handling of TSC_KNOWN_FREQ and
> TSC_RELIABLE when overriding the TSC calibration routine.
> 
> Cc: Tom Lendacky <thomas.lendacky@amd.com>
> Reviewed-by: Nikunj A Dadhania <nikunj@amd.com>
> Signed-off-by: Sean Christopherson <seanjc@google.com>


Reviewed-by: David Woodhouse <dwmw@amazon.co.uk>

[-- Attachment #2: smime.p7s --]
[-- Type: application/pkcs7-signature, Size: 5069 bytes --]

^ permalink raw reply

* Re: [PATCH v3 04/41] x86/sev: Move check for SNP Secure TSC support to tsc_early_init()
From: David Woodhouse @ 2026-05-20 19:16 UTC (permalink / raw)
  To: Sean Christopherson, Kiryl Shutsemau, Paolo Bonzini,
	K. Y. Srinivasan, Haiyang Zhang, Wei Liu, Dexuan Cui, Long Li,
	Ajay Kaher, Alexey Makhalov, Jan Kiszka, Dave Hansen,
	Andy Lutomirski, Peter Zijlstra, Juergen Gross, Daniel Lezcano,
	Thomas Gleixner, John Stultz
  Cc: Rick Edgecombe, Vitaly Kuznetsov,
	Broadcom internal kernel review list, Boris Ostrovsky,
	Stephen Boyd, x86, linux-coco, kvm, linux-hyperv, virtualization,
	linux-kernel, xen-devel, Michael Kelley, Tom Lendacky,
	Nikunj A Dadhania, Thomas Gleixner
In-Reply-To: <20260515191942.1892718-5-seanjc@google.com>

[-- Attachment #1: Type: text/plain, Size: 572 bytes --]

On Fri, 2026-05-15 at 12:19 -0700, Sean Christopherson wrote:
> Move the check on having a Secure TSC to the common tsc_early_init() so
> that it's obvious that having a Secure TSC is conditional, and to prepare
> for adding TDX to the mix (blindly initializing *both* SNP and TDX TSC
> logic looks especially weird).
> 
> No functional change intended.
> 
> Cc: Tom Lendacky <thomas.lendacky@amd.com>
> Reviewed-by: Nikunj A Dadhania <nikunj@amd.com>
> Signed-off-by: Sean Christopherson <seanjc@google.com>

Reviewed-by: David Woodhouse <dwmw@amazon.co.uk>

[-- Attachment #2: smime.p7s --]
[-- Type: application/pkcs7-signature, Size: 5069 bytes --]

^ permalink raw reply

* RE: [PATCH v1 4/4] iommu/hyperv: Add page-selective IOTLB flush support
From: Michael Kelley @ 2026-05-20 19:26 UTC (permalink / raw)
  To: Yu Zhang, Jason Gunthorpe
  Cc: linux-kernel@vger.kernel.org, linux-hyperv@vger.kernel.org,
	iommu@lists.linux.dev, linux-pci@vger.kernel.org,
	linux-arch@vger.kernel.org, wei.liu@kernel.org, kys@microsoft.com,
	haiyangz@microsoft.com, decui@microsoft.com, longli@microsoft.com,
	joro@8bytes.org, will@kernel.org, robin.murphy@arm.com,
	bhelgaas@google.com, kwilczynski@kernel.org,
	lpieralisi@kernel.org, mani@kernel.org, robh@kernel.org,
	arnd@arndb.de, jacob.pan@linux.microsoft.com,
	tgopinath@linux.microsoft.com,
	easwar.hariharan@linux.microsoft.com
In-Reply-To: <lxmfd2ml5dafkxquuf5i45uqgh6wxl44hlqphu77kvximqrnmi@b3htoyjc6d5z>

From: Yu Zhang <zhangyu1@linux.microsoft.com> Sent: Wednesday, May 20, 2026 10:15 AM
> 
> On Fri, May 15, 2026 at 07:35:45PM -0300, Jason Gunthorpe wrote:
> > On Tue, May 12, 2026 at 12:24:08AM +0800, Yu Zhang wrote:
> > > +static inline u16 hv_iommu_fill_iova_list(union hv_iommu_flush_va *iova_list,
> > > +					  unsigned long start,
> > > +					  unsigned long end)
> > > +{
> > > +	unsigned long start_pfn = start >> PAGE_SHIFT;
> > > +	unsigned long end_pfn = PAGE_ALIGN(end) >> PAGE_SHIFT;
> > > +	unsigned long nr_pages = end_pfn - start_pfn;
> > > +	u16 count = 0;
> > > +
> > > +	while (nr_pages > 0) {
> > > +		unsigned long flush_pages;
> > > +		int order;
> > > +		unsigned long pfn_align;
> > > +		unsigned long size_align;
> > > +
> > > +		if (count >= HV_IOMMU_MAX_FLUSH_VA_COUNT) {
> > > +			count = HV_IOMMU_FLUSH_VA_OVERFLOW;
> > > +			break;
> > > +		}
> > > +
> > > +		if (start_pfn)
> > > +			pfn_align = __ffs(start_pfn);
> > > +		else
> > > +			pfn_align = BITS_PER_LONG - 1;
> > > +
> > > +		size_align = __fls(nr_pages);
> > > +		order = min(pfn_align, size_align);
> > > +		iova_list[count].page_mask_shift = order;
> > > +		iova_list[count].page_number = start_pfn;
> > > +
> > > +		flush_pages = 1UL << order;
> > > +		start_pfn += flush_pages;
> > > +		nr_pages -= flush_pages;
> > > +		count++;
> > > +	}
> >
> > This seems like a really silly hypervisor interface. Why doesn't it
> > just accept a normal range? Splitting it into power of two aligned
> > ranges is very inefficient.
> 
> Fair point. I'm not sure how much flexibility we have to change
> this hypercall interface at the moment - it predates the pvIOMMU
> work and may have other consumers beyond Linux guest. On the other
> hand, having the guest specify 2^N-aligned blocks does save the
> hypervisor from having to decompose ranges itself before issuing
> hardware invalidation commands - the guest-provided entries can be
> fed to the HW more or less directly.
> 
> That said, the way I'm currently using this interface may be
> more precise than necessary. Maybe we have 2 options:
> 
> 1) Current approach: decompose the range into multiple exact
>    2^N-aligned blocks with no over-flush, but at the cost of
>    more complex calculations and more entries.
> 
> 2) Follow what Intel/AMD drivers do: find a single minimal
>    2^N-aligned block that covers the entire range, but may
>    over-flush.
> 
> Any preference?
> 
> @Michael, since you've also been reviewing this patch, I'd
> appreciate your thoughts on the above as well. :)
> 

I'm just guessing, but perhaps flushing an aligned power-of-2
range can be processed by the hypervisor at a relatively fixed
cost, regardless of the size. Having the guest do the decomposing
of an arbitrary range allows the hypervisor to make use of the
existing "rep" hypercall mechanism if the hypercall is taking
"too long". The hypervisor can pause its processing, return to
the guest temporarily, and then continue the hypercall. If the
arbitrary range were passed into the hypercall for the hypervisor
to do the decomposing, that pause-and-restart mechanism
wouldn't be available.

Of course, Linux doesn't really take advantage of the pause to
reduce guest interrupt latency because the Hyper-V code in
Linux typically disable interrupts around a hypercall due to the
way the hypercall input page is allocated. But other guest
operating systems might benefit from such a pause. And we could
probably fix the Hyper-V code in Linux to allow interrupts during a
hypercall pause/restart if long-running hypercalls turn out to be
a problem.

Regarding proposal (1) vs. (2), perhaps you could confirm with
the Hyper-V team that flushing an aligned power-of-2 range
has relatively fixed cost, regardless of the size. And what do the
flush requests coming from the generic IOMMU subsystem look
like? Do they match dma_unmap() ranges, which are probably
dominated by relatively small ranges of a few pages at most,
with a few outliers for disk I/O requests of 1 MiB or some such?
If the dominant flush request is for a few pages at most, then
doing (2) seems reasonable.

Michael

^ permalink raw reply

* Re: [PATCH v3 05/41] x86/tdx: Override PV calibration routines with CPUID-based calibration
From: David Woodhouse @ 2026-05-20 19:52 UTC (permalink / raw)
  To: Sean Christopherson, Kiryl Shutsemau, Paolo Bonzini,
	K. Y. Srinivasan, Haiyang Zhang, Wei Liu, Dexuan Cui, Long Li,
	Ajay Kaher, Alexey Makhalov, Jan Kiszka, Dave Hansen,
	Andy Lutomirski, Peter Zijlstra, Juergen Gross, Daniel Lezcano,
	Thomas Gleixner, John Stultz
  Cc: Rick Edgecombe, Vitaly Kuznetsov,
	Broadcom internal kernel review list, Boris Ostrovsky,
	Stephen Boyd, x86, linux-coco, kvm, linux-hyperv, virtualization,
	linux-kernel, xen-devel, Michael Kelley, Tom Lendacky,
	Nikunj A Dadhania, Thomas Gleixner
In-Reply-To: <20260515191942.1892718-6-seanjc@google.com>

[-- Attachment #1: Type: text/plain, Size: 2003 bytes --]

On Fri, 2026-05-15 at 12:19 -0700, Sean Christopherson wrote:
> When running as a TDX guest, explicitly override the TSC frequency
> calibration routine with CPUID-based calibration instead of potentially
> relying on a hypervisor-controlled PV routine.  For TDX guests, CPUID.0x15
> is always emulated by the TDX-Module, i.e. the information from CPUID is
> more trustworthy than the information provided by the hypervisor.
> 
> To maintain backwards compatibility with TDX guest kernels that use native
> calibration, and because it's the least awful option, retain
> native_calibrate_tsc()'s stuffing of the local APIC bus period using the
> core crystal frequency.  While it's entirely possible for the hypervisor
> to emulate the APIC timer at a different frequency than the core crystal
> frequency, the commonly accepted interpretation of Intel's SDM is that APIC
> timer runs at the core crystal frequency when that latter is enumerated via
> CPUID:
> 
>   The APIC timer frequency will be the processor’s bus clock or core
>   crystal clock frequency (when TSC/core crystal clock ratio is enumerated
>   in CPUID leaf 0x15).
> 
> If the hypervisor is malicious and deliberately runs the APIC timer at the
> wrong frequency, nothing would stop the hypervisor from modifying the
> frequency at any time, i.e. attempting to manually calibrate the frequency
> out of paranoia would be futile.
> 
> Deliberately leave the CPU frequency calibration routine as is, since the
> TDX-Module doesn't provide any guarantees with respect to CPUID.0x16.
> 
> Opportunistically add a comment explaining that CoCo TSC initialization
> needs to come after hypervisor specific initialization.
> 
> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
> Signed-off-by: Sean Christopherson <seanjc@google.com>

I don't much like stuffing the lapic_timer_period... but I'll give you
'least awful option'. For now.

Reviewed-by: David Woodhouse <dwmw@amazon.co.uk>


[-- Attachment #2: smime.p7s --]
[-- Type: application/pkcs7-signature, Size: 5069 bytes --]

^ permalink raw reply

* Re: [PATCH v3 06/41] x86/acrn: Mark TSC frequency as known when using ACRN for calibration
From: David Woodhouse @ 2026-05-20 20:01 UTC (permalink / raw)
  To: Sean Christopherson, Kiryl Shutsemau, Paolo Bonzini,
	K. Y. Srinivasan, Haiyang Zhang, Wei Liu, Dexuan Cui, Long Li,
	Ajay Kaher, Alexey Makhalov, Jan Kiszka, Dave Hansen,
	Andy Lutomirski, Peter Zijlstra, Juergen Gross, Daniel Lezcano,
	Thomas Gleixner, John Stultz
  Cc: Rick Edgecombe, Vitaly Kuznetsov,
	Broadcom internal kernel review list, Boris Ostrovsky,
	Stephen Boyd, x86, linux-coco, kvm, linux-hyperv, virtualization,
	linux-kernel, xen-devel, Michael Kelley, Tom Lendacky,
	Nikunj A Dadhania, Thomas Gleixner
In-Reply-To: <20260515191942.1892718-7-seanjc@google.com>

[-- Attachment #1: Type: text/plain, Size: 1288 bytes --]

On Fri, 2026-05-15 at 12:19 -0700, Sean Christopherson wrote:
> Mark the TSC frequency as known when using ACRN's PV CPUID information.
> Per commit 81a71f51b89e ("x86/acrn: Set up timekeeping") and common sense,
> the TSC freq is explicitly provided by the hypervisor.
> 
> Signed-off-by: Sean Christopherson <seanjc@google.com>
> ---
>  arch/x86/kernel/cpu/acrn.c | 1 +
>  1 file changed, 1 insertion(+)
> 
> diff --git a/arch/x86/kernel/cpu/acrn.c b/arch/x86/kernel/cpu/acrn.c
> index c1506cb87d8c..2da3de4d470e 100644
> --- a/arch/x86/kernel/cpu/acrn.c
> +++ b/arch/x86/kernel/cpu/acrn.c
> @@ -29,6 +29,7 @@ static void __init acrn_init_platform(void)
>  	/* Install system interrupt handler for ACRN hypervisor callback */
>  	sysvec_install(HYPERVISOR_CALLBACK_VECTOR, sysvec_acrn_hv_callback);
>  
> +	setup_force_cpu_cap(X86_FEATURE_TSC_KNOWN_FREQ);
>  	tsc_register_calibration_routines(acrn_get_tsc_khz,
>  					  acrn_get_tsc_khz);

I'd feel slightly happier doing that from within acrn_get_tsc_khz()
itself.... which I note is 'static inline'. I'm vaguely surprised that
even builds (although it does). Probably nicer to move it explicitly
out of line in acrn.c though.

Most of that should be a comment on patch  2 of the series, I guess?

[-- Attachment #2: smime.p7s --]
[-- Type: application/pkcs7-signature, Size: 5069 bytes --]

^ permalink raw reply

* Re: [PATCH v1 4/4] iommu/hyperv: Add page-selective IOTLB flush support
From: Jacob Pan @ 2026-05-20 20:40 UTC (permalink / raw)
  To: Michael Kelley
  Cc: Yu Zhang, Jason Gunthorpe, linux-kernel@vger.kernel.org,
	linux-hyperv@vger.kernel.org, iommu@lists.linux.dev,
	linux-pci@vger.kernel.org, linux-arch@vger.kernel.org,
	wei.liu@kernel.org, kys@microsoft.com, haiyangz@microsoft.com,
	decui@microsoft.com, longli@microsoft.com, joro@8bytes.org,
	will@kernel.org, robin.murphy@arm.com, bhelgaas@google.com,
	kwilczynski@kernel.org, lpieralisi@kernel.org, mani@kernel.org,
	robh@kernel.org, arnd@arndb.de, tgopinath@linux.microsoft.com,
	easwar.hariharan@linux.microsoft.com, jacob.pan
In-Reply-To: <SN6PR02MB4157C1EC7F5F69C5ABDA9C7FD4012@SN6PR02MB4157.namprd02.prod.outlook.com>

Hi Michael,

On Wed, 20 May 2026 19:26:24 +0000
Michael Kelley <mhklinux@outlook.com> wrote:

> From: Michael Kelley <mhklinux@outlook.com>
> To: Yu Zhang <zhangyu1@linux.microsoft.com>, Jason Gunthorpe
> <jgg@ziepe.ca> CC: "linux-kernel@vger.kernel.org"
> <linux-kernel@vger.kernel.org>,  "linux-hyperv@vger.kernel.org"
> <linux-hyperv@vger.kernel.org>,  "iommu@lists.linux.dev"
> <iommu@lists.linux.dev>, "linux-pci@vger.kernel.org"
> <linux-pci@vger.kernel.org>, "linux-arch@vger.kernel.org"
> <linux-arch@vger.kernel.org>, "wei.liu@kernel.org"
> <wei.liu@kernel.org>,  "kys@microsoft.com" <kys@microsoft.com>,
> "haiyangz@microsoft.com"  <haiyangz@microsoft.com>,
> "decui@microsoft.com" <decui@microsoft.com>,  "longli@microsoft.com"
> <longli@microsoft.com>, "joro@8bytes.org"  <joro@8bytes.org>,
> "will@kernel.org" <will@kernel.org>,  "robin.murphy@arm.com"
> <robin.murphy@arm.com>, "bhelgaas@google.com"  <bhelgaas@google.com>,
> "kwilczynski@kernel.org" <kwilczynski@kernel.org>,
> "lpieralisi@kernel.org" <lpieralisi@kernel.org>, "mani@kernel.org"
> <mani@kernel.org>, "robh@kernel.org" <robh@kernel.org>,
> "arnd@arndb.de"  <arnd@arndb.de>, "jacob.pan@linux.microsoft.com"
> <jacob.pan@linux.microsoft.com>, "tgopinath@linux.microsoft.com"
> <tgopinath@linux.microsoft.com>,
> "easwar.hariharan@linux.microsoft.com"
> <easwar.hariharan@linux.microsoft.com> Subject: RE: [PATCH v1 4/4]
> iommu/hyperv: Add page-selective IOTLB flush  support Date: Wed, 20
> May 2026 19:26:24 +0000
> 
> From: Yu Zhang <zhangyu1@linux.microsoft.com> Sent: Wednesday, May
> 20, 2026 10:15 AM
> > 
> > On Fri, May 15, 2026 at 07:35:45PM -0300, Jason Gunthorpe wrote:  
> > > On Tue, May 12, 2026 at 12:24:08AM +0800, Yu Zhang wrote:  
> > > > +static inline u16 hv_iommu_fill_iova_list(union
> > > > hv_iommu_flush_va *iova_list,
> > > > +					  unsigned long start,
> > > > +					  unsigned long end)
> > > > +{
> > > > +	unsigned long start_pfn = start >> PAGE_SHIFT;
> > > > +	unsigned long end_pfn = PAGE_ALIGN(end) >> PAGE_SHIFT;
> > > > +	unsigned long nr_pages = end_pfn - start_pfn;
> > > > +	u16 count = 0;
> > > > +
> > > > +	while (nr_pages > 0) {
> > > > +		unsigned long flush_pages;
> > > > +		int order;
> > > > +		unsigned long pfn_align;
> > > > +		unsigned long size_align;
> > > > +
> > > > +		if (count >= HV_IOMMU_MAX_FLUSH_VA_COUNT) {
> > > > +			count = HV_IOMMU_FLUSH_VA_OVERFLOW;
> > > > +			break;
> > > > +		}
> > > > +
> > > > +		if (start_pfn)
> > > > +			pfn_align = __ffs(start_pfn);
> > > > +		else
> > > > +			pfn_align = BITS_PER_LONG - 1;
> > > > +
> > > > +		size_align = __fls(nr_pages);
> > > > +		order = min(pfn_align, size_align);
> > > > +		iova_list[count].page_mask_shift = order;
> > > > +		iova_list[count].page_number = start_pfn;
> > > > +
> > > > +		flush_pages = 1UL << order;
> > > > +		start_pfn += flush_pages;
> > > > +		nr_pages -= flush_pages;
> > > > +		count++;
> > > > +	}  
> > >
> > > This seems like a really silly hypervisor interface. Why doesn't
> > > it just accept a normal range? Splitting it into power of two
> > > aligned ranges is very inefficient.  
> > 
> > Fair point. I'm not sure how much flexibility we have to change
> > this hypercall interface at the moment - it predates the pvIOMMU
> > work and may have other consumers beyond Linux guest. On the other
> > hand, having the guest specify 2^N-aligned blocks does save the
> > hypervisor from having to decompose ranges itself before issuing
> > hardware invalidation commands - the guest-provided entries can be
> > fed to the HW more or less directly.
> > 
> > That said, the way I'm currently using this interface may be
> > more precise than necessary. Maybe we have 2 options:
> > 
> > 1) Current approach: decompose the range into multiple exact
> >    2^N-aligned blocks with no over-flush, but at the cost of
> >    more complex calculations and more entries.
> > 
> > 2) Follow what Intel/AMD drivers do: find a single minimal
> >    2^N-aligned block that covers the entire range, but may
> >    over-flush.
> > 
> > Any preference?
> > 
> > @Michael, since you've also been reviewing this patch, I'd
> > appreciate your thoughts on the above as well. :)
> >   
> 
> I'm just guessing, but perhaps flushing an aligned power-of-2
> range can be processed by the hypervisor at a relatively fixed
> cost, regardless of the size. Having the guest do the decomposing
> of an arbitrary range allows the hypervisor to make use of the
> existing "rep" hypercall mechanism if the hypercall is taking
> "too long". The hypervisor can pause its processing, return to
> the guest temporarily, and then continue the hypercall. If the
> arbitrary range were passed into the hypercall for the hypervisor
> to do the decomposing, that pause-and-restart mechanism
> wouldn't be available.
> 
> Of course, Linux doesn't really take advantage of the pause to
> reduce guest interrupt latency because the Hyper-V code in
> Linux typically disable interrupts around a hypercall due to the
> way the hypercall input page is allocated. But other guest
> operating systems might benefit from such a pause. And we could
> probably fix the Hyper-V code in Linux to allow interrupts during a
> hypercall pause/restart if long-running hypercalls turn out to be
> a problem.
I am not sure if this pause feature is suitable for IOTLB flush at all
since it is inherently synchronous — the caller must block until all
invalidations complete. Pausing mid-flush to return to the guest
doesn't help if the guest can't make forward progress anyway.

^ permalink raw reply

* Re: [PATCH v3 02/41] x86/tsc: Add helper to register CPU and TSC freq calibration routines
From: Sean Christopherson @ 2026-05-20 20:44 UTC (permalink / raw)
  To: David Woodhouse
  Cc: Kiryl Shutsemau, Paolo Bonzini, K. Y. Srinivasan, Haiyang Zhang,
	Wei Liu, Dexuan Cui, Long Li, Ajay Kaher, Alexey Makhalov,
	Jan Kiszka, Dave Hansen, Andy Lutomirski, Peter Zijlstra,
	Juergen Gross, Daniel Lezcano, Thomas Gleixner, John Stultz,
	Rick Edgecombe, Vitaly Kuznetsov,
	Broadcom internal kernel review list, Boris Ostrovsky,
	Stephen Boyd, x86, linux-coco, kvm, linux-hyperv, virtualization,
	linux-kernel, xen-devel, Michael Kelley, Tom Lendacky,
	Nikunj A Dadhania, Thomas Gleixner
In-Reply-To: <44e0d60548d317fd59895f18bd17220dfb2f834b.camel@infradead.org>

On Wed, May 20, 2026, David Woodhouse wrote:
> On Fri, 2026-05-15 at 12:19 -0700, Sean Christopherson wrote:
> > diff --git a/arch/x86/kernel/kvmclock.c b/arch/x86/kernel/kvmclock.c
> > index b5991d53fc0e..e9e7394140dd 100644
> > --- a/arch/x86/kernel/kvmclock.c
> > +++ b/arch/x86/kernel/kvmclock.c
> > @@ -321,8 +321,8 @@ void __init kvmclock_init(void)
> >  	flags = pvclock_read_flags(&hv_clock_boot[0].pvti);
> >  	kvm_sched_clock_init(flags & PVCLOCK_TSC_STABLE_BIT);
> >  
> > -	x86_platform.calibrate_tsc = kvm_get_tsc_khz;
> > -	x86_platform.calibrate_cpu = kvm_get_tsc_khz;
> > +	tsc_register_calibration_routines(kvm_get_tsc_khz, kvm_get_tsc_khz);
> > +
> >  	x86_platform.get_wallclock = kvm_get_wallclock;
> >  	x86_platform.set_wallclock = kvm_set_wallclock;
> >  #ifdef CONFIG_X86_LOCAL_APIC
> 
> Can we move those (and maybe everything in the context there too) up
> *before* the check for no-kvmclock at the top of the function?

Oof, I was going to say "no", but disabling kvmclock is exactly the workaround
I've told people to use to get the kernel to use the TSC instead of kvmclock.

> Probably in a separate patch.

Ya.  I think this?

diff --git a/arch/x86/kernel/kvmclock.c b/arch/x86/kernel/kvmclock.c
index 08ee4bc304c8..92a1ebf31e4d 100644
--- a/arch/x86/kernel/kvmclock.c
+++ b/arch/x86/kernel/kvmclock.c
@@ -27,6 +27,7 @@ static int kvmclock_vsyscall __initdata = 1;
 static int msr_kvm_system_time __ro_after_init;
 static int msr_kvm_wall_clock __ro_after_init;
 static u64 kvm_sched_clock_offset __ro_after_init;
+static unsigned int kvm_tsc_khz_cpuid __ro_after_init;
 
 static int __init parse_no_kvmclock(char *arg)
 {
@@ -207,7 +208,7 @@ static unsigned long kvm_get_tsc_khz(void)
                lapic_timer_period = apic_khz * 1000 / HZ;
 #endif
 
-       return kvm_para_tsc_khz() ? : pvclock_tsc_khz(this_cpu_pvti());
+       return kvm_tsc_khz_cpuid ? : pvclock_tsc_khz(this_cpu_pvti());
 }
 
 static unsigned long kvm_get_cpu_khz(void)
@@ -387,9 +388,39 @@ void __init kvmclock_init(void)
        enum tsc_properties tsc_properties = TSC_FREQUENCY_KNOWN;
        bool stable = false;
 
-       if (!kvm_para_available() || !kvmclock)
+       if (!kvm_para_available())
                return;
 
+       /*
+        * If the TSC counts at a constant frequency across P/T states, counts
+        * in deep C-states, and the TSC hasn't been marked unstable, treat the
+        * TSC reliable, as guaranteed by KVM.  Note, the TSC unstable check
+        * exists purely to honor the TSC being marked unstable via command
+        * line, any runtime detection of an unstable will happen after this.
+        */
+       if (boot_cpu_has(X86_FEATURE_CONSTANT_TSC) &&
+           boot_cpu_has(X86_FEATURE_NONSTOP_TSC) &&
+           !check_tsc_unstable())
+               tsc_properties = TSC_FREQ_KNOWN_AND_RELIABLE;
+
+       kvm_tsc_khz_cpuid = kvm_para_tsc_khz();
+
+       /*
+        * If provided, use the TSC (and APIC bus) frequency provided in KVM's
+        * PV CPUID leaf even if kvmclock itself is disabled via command line.
+        * The PV CPUID information isn't dependent on kvmclock in any way, and
+        * in fact using the precise information is *more* important when the
+        * user has explicitly disabled kvmclock to force the kernel to use the
+        * TSC as its clocksource.
+        */
+       if (!kvmclock) {
+               if (kvm_tsc_khz_cpuid)
+                       tsc_register_calibration_routines(kvm_get_tsc_khz,
+                                                         kvm_get_cpu_khz,
+                                                         tsc_properties);
+               return;
+       }
+
        if (kvm_para_has_feature(KVM_FEATURE_CLOCKSOURCE2)) {
                msr_kvm_system_time = MSR_KVM_SYSTEM_TIME_NEW;
                msr_kvm_wall_clock = MSR_KVM_WALL_CLOCK_NEW;
@@ -424,21 +455,14 @@ void __init kvmclock_init(void)
        }
 
        /*
-        * If the TSC counts at a constant frequency across P/T states, counts
-        * in deep C-states, and the TSC hasn't been marked unstable, prefer
-        * the TSC over kvmclock for sched_clock and drop kvmclock's rating so
-        * that TSC is chosen as the clocksource.  Note, the TSC unstable check
-        * exists purely to honor the TSC being marked unstable via command
-        * line, any runtime detection of an unstable will happen after this.
+        * If the TSC is reliable (see above), prefer the TSC over kvmclock for
+        * sched_clock and drop kvmclock's rating so that TSC is chosen as the
+        * clocksource.
         */
-       if (boot_cpu_has(X86_FEATURE_CONSTANT_TSC) &&
-           boot_cpu_has(X86_FEATURE_NONSTOP_TSC) &&
-           !check_tsc_unstable()) {
+       if (tsc_properties & TSC_RELIABLE)
                kvm_clock.rating = 299;
-               tsc_properties = TSC_FREQ_KNOWN_AND_RELIABLE;
-       } else {
+       else
                kvm_sched_clock_init(stable);
-       }
 
        tsc_register_calibration_routines(kvm_get_tsc_khz, kvm_get_cpu_khz,
                                          tsc_properties);

^ permalink raw reply related

* Re: [PATCH v3 06/41] x86/acrn: Mark TSC frequency as known when using ACRN for calibration
From: Sean Christopherson @ 2026-05-20 21:02 UTC (permalink / raw)
  To: David Woodhouse
  Cc: Kiryl Shutsemau, Paolo Bonzini, K. Y. Srinivasan, Haiyang Zhang,
	Wei Liu, Dexuan Cui, Long Li, Ajay Kaher, Alexey Makhalov,
	Jan Kiszka, Dave Hansen, Andy Lutomirski, Peter Zijlstra,
	Juergen Gross, Daniel Lezcano, Thomas Gleixner, John Stultz,
	Rick Edgecombe, Vitaly Kuznetsov,
	Broadcom internal kernel review list, Boris Ostrovsky,
	Stephen Boyd, x86, linux-coco, kvm, linux-hyperv, virtualization,
	linux-kernel, xen-devel, Michael Kelley, Tom Lendacky,
	Nikunj A Dadhania, Thomas Gleixner
In-Reply-To: <c347d65f555ad1e10a0e87ee57c5879c7046d0e7.camel@infradead.org>

On Wed, May 20, 2026, David Woodhouse wrote:
> On Fri, 2026-05-15 at 12:19 -0700, Sean Christopherson wrote:
> > Mark the TSC frequency as known when using ACRN's PV CPUID information.
> > Per commit 81a71f51b89e ("x86/acrn: Set up timekeeping") and common sense,
> > the TSC freq is explicitly provided by the hypervisor.
> > 
> > Signed-off-by: Sean Christopherson <seanjc@google.com>
> > ---
> >  arch/x86/kernel/cpu/acrn.c | 1 +
> >  1 file changed, 1 insertion(+)
> > 
> > diff --git a/arch/x86/kernel/cpu/acrn.c b/arch/x86/kernel/cpu/acrn.c
> > index c1506cb87d8c..2da3de4d470e 100644
> > --- a/arch/x86/kernel/cpu/acrn.c
> > +++ b/arch/x86/kernel/cpu/acrn.c
> > @@ -29,6 +29,7 @@ static void __init acrn_init_platform(void)
> >  	/* Install system interrupt handler for ACRN hypervisor callback */
> >  	sysvec_install(HYPERVISOR_CALLBACK_VECTOR, sysvec_acrn_hv_callback);
> >  
> > +	setup_force_cpu_cap(X86_FEATURE_TSC_KNOWN_FREQ);
> >  	tsc_register_calibration_routines(acrn_get_tsc_khz,
> >  					  acrn_get_tsc_khz);
> 
> I'd feel slightly happier doing that from within acrn_get_tsc_khz()
> itself....

Ya, though as you note below, this is really a comment on the overall series.

Waiting to set the cap until the calibration routine is actually run does prevent
the case where the something registers a calibration routine, but its routine is
never run.

However, because the cap is sticky, it doesn't handle the scenario where its
routine _does_ run, but the kernel ultimately throws away its calibration in favor
of something else.

Further complicating things is that ~half of the paravirt flows already force set
caps before their routines are invoked:

  snp_secure_tsc_init(), jailhouse_init_platform(), ms_hyperv_init_platform(),
  vmware_set_capabilities()

Rather than trying to convince everyone that waiting is better, despite that
approach still being flawed, I chose to handle this by ensuring that once the TSC
is marked known and/or reliable, the kernel won't replace the calibration routine
with a "lesser" source:

  x86/tsc: Rejects attempts to override TSC calibration with lesser routine

That doesn't completely prevent the kernel from being stupid, but it should prevent
both the case where X's routine is registered but never run, as well as the case
where it's run but ultimately ignored.

> which I note is 'static inline'. I'm vaguely surprised that even builds
> (although it does).

Heh, KVM x86 (in the host) does stupid things like this too.  E.g. kvm_pdptr_read()
is a static inline, but then wired up as a function pointer in three different
places.

> Probably nicer to move it explicitly out of line in acrn.c though.
> 
> Most of that should be a comment on patch  2 of the series, I guess?

^ permalink raw reply

* Re: [PATCH v3 41/41] x86/kvmclock: Get CPU base frequency from CPUID when it's available
From: Sean Christopherson @ 2026-05-20 21:06 UTC (permalink / raw)
  To: sashiko-reviews; +Cc: linux-hyperv
In-Reply-To: <20260515195934.C790BC2BCB0@smtp.kernel.org>

On Fri, May 15, 2026, sashiko-bot@kernel.org wrote:
> > diff --git a/arch/x86/kernel/kvmclock.c b/arch/x86/kernel/kvmclock.c
> > index 62c8ea2e67691..7607920ae386e 100644
> > --- a/arch/x86/kernel/kvmclock.c
> > +++ b/arch/x86/kernel/kvmclock.c
> > @@ -190,6 +190,20 @@ void kvmclock_cpu_action(enum kvm_guest_cpu_action action)
> >  	}
> >  }
> >  
> > +static unsigned long kvm_get_cpu_khz(void)
> > +{
> > +	unsigned int cpu_khz;
> > +
> > +	/*
> > +	 * Prefer CPUID over kvmclock when possible, as the base CPU frequency
> > +	 * isn't necessarily the same as the kvmlock "TSC" frequency.
> > +	 */
> > +	if (!cpuid_get_cpu_freq(&cpu_khz))
> > +		return cpu_khz;
> > +
> > +	return pvclock_tsc_khz(this_cpu_pvti());
> 
> Before this patch, both calibrate_cpu and calibrate_tsc used
> kvm_get_tsc_khz(), which checks kvm_para_tsc_khz() to get the exact TSC
> frequency from KVM's PV CPUID timing leaf, and only falls back to
> pvclock_tsc_khz() if it is unavailable.
> 
> Does the fallback path in kvm_get_cpu_khz() bypass kvm_para_tsc_khz()
> entirely? For VMs that expose KVM's PV CPUID timing leaf but not
> CPUID.0x16, it looks like cpu_khz will now be reverse-calculated from
> pvclock multipliers instead of using the exact hypervisor-provided value.
> 
> Could this introduce a precision regression due to truncation loss, causing
> cpu_khz and tsc_khz to needlessly diverge on the fallback path?

David pointed this out too, I'll fix in the next version.

^ permalink raw reply

* Re: [PATCH v3 02/41] x86/tsc: Add helper to register CPU and TSC freq calibration routines
From: David Woodhouse @ 2026-05-20 21:19 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Kiryl Shutsemau, Paolo Bonzini, K. Y. Srinivasan, Haiyang Zhang,
	Wei Liu, Dexuan Cui, Long Li, Ajay Kaher, Alexey Makhalov,
	Jan Kiszka, Dave Hansen, Andy Lutomirski, Peter Zijlstra,
	Juergen Gross, Daniel Lezcano, Thomas Gleixner, John Stultz,
	Rick Edgecombe, Vitaly Kuznetsov,
	Broadcom internal kernel review list, Boris Ostrovsky,
	Stephen Boyd, x86, linux-coco, kvm, linux-hyperv, virtualization,
	linux-kernel, xen-devel, Michael Kelley, Tom Lendacky,
	Nikunj A Dadhania, Thomas Gleixner
In-Reply-To: <ag4dMc2B3JQi4vxU@google.com>

[-- Attachment #1: Type: text/plain, Size: 2384 bytes --]

On Wed, 2026-05-20 at 13:44 -0700, Sean Christopherson wrote:
> 
> +       /*
> +        * If the TSC counts at a constant frequency across P/T states, counts
> +        * in deep C-states, and the TSC hasn't been marked unstable, treat the
> +        * TSC reliable, as guaranteed by KVM.  Note, the TSC unstable check
> +        * exists purely to honor the TSC being marked unstable via command
> +        * line, any runtime detection of an unstable will happen after this.
> +        */
> +       if (boot_cpu_has(X86_FEATURE_CONSTANT_TSC) &&
> +           boot_cpu_has(X86_FEATURE_NONSTOP_TSC) &&
> +           !check_tsc_unstable())
    { 
> +               tsc_properties = TSC_FREQ_KNOWN_AND_RELIABLE;

    kvmclock = 0; /* Why use it if the TSC works? The kvmclock exists
                     *purely* to work around a TSC which *doesn't*
                     have those properties checked above. */
    }

I was going to say PVCLOCK_TSC_STABLE_BIT, and maybe we should check
that *too* for paranoia? But hopefully the checks you have above are
equivalent?

> +
> +       kvm_tsc_khz_cpuid = kvm_para_tsc_khz();
> +
> +       /*
> +        * If provided, use the TSC (and APIC bus) frequency provided in KVM's
> +        * PV CPUID leaf even if kvmclock itself is disabled via command line.
> +        * The PV CPUID information isn't dependent on kvmclock in any way, and
> +        * in fact using the precise information is *more* important when the
> +        * user has explicitly disabled kvmclock to force the kernel to use the
> +        * TSC as its clocksource.
> +        */
> +       if (!kvmclock) {
> +               if (kvm_tsc_khz_cpuid)
> +                       tsc_register_calibration_routines(kvm_get_tsc_khz,
> +                                                         kvm_get_cpu_khz,
> +                                                         tsc_properties);
> +               return;
> +       }
> +


Regardless of the above, why not just register these here
unconditionally, and remove the later call that does the same?

[-- Attachment #2: smime.p7s --]
[-- Type: application/pkcs7-signature, Size: 5069 bytes --]

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox