* Re: [PATCH net-next v10 0/2] net: mana: add ethtool private flag for full-page RX buffers
From: Dipayaan Roy @ 2026-07-01 14:29 UTC (permalink / raw)
To: Jakub Kicinski
Cc: kys, haiyangz, wei.liu, decui, andrew+netdev, davem, edumazet,
pabeni, leon, longli, kotaranov, horms, shradhagupta, ssengar,
ernis, shirazsaleem, linux-hyperv, netdev, linux-kernel,
linux-rdma, stephen, jacob.e.keller, dipayanroy, leitao, kees,
john.fastabend, hawk, bpf, daniel, ast, sdf, yury.norov,
pavan.chebbi
In-Reply-To: <20260615173314.677c33a8@kernel.org>
On Mon, Jun 15, 2026 at 05:33:14PM -0700, Jakub Kicinski wrote:
> On Mon, 15 Jun 2026 17:21:54 -0700 Dipayaan Roy wrote:
> > On Mon, Jun 15, 2026 at 01:42:47PM -0700, Jakub Kicinski wrote:
> > > On Mon, 15 Jun 2026 12:25:53 -0700 Dipayaan Roy wrote:
> > > > Just a gentle ping on this series. The approach was agreed upon, and it
> > > > has picked up a few Reviewed-by tags as well.
> > > >
> > > > Please let me know if you need anything else from me, or if I should
> > > > resend it to collect the tags.
> > >
> > > Don't recall now what the exact sequence was but pretty sure this
> > > no longer applied after some other mana series was merged.
> >
> > I see, the net-next is closed now, I will rebase and resend this
> > once it opens on June 29th.
>
> Sorry for not flagging this sooner, IDK how it escaped the reply.
> Maybe some mix of Jake's comments plus it not being applicable
> later.
>
> Not to deflect blame but y'all should coordinate better, the "no longer
> applies" situation happens in mana a lot more often than with other
> drivers :(
Hi Jakub,
I have rebased and sent a v11:
https://lore.kernel.org/all/20260701141808.461554-1-dipayanroy@linux.microsoft.com/
Thank you for all the support.
Regards
Dipayaan Roy
^ permalink raw reply
* Re: [PATCH net-next v11 0/2] net: mana: add ethtool private flag for full-page RX buffers
From: Maciej Fijalkowski @ 2026-07-01 14:45 UTC (permalink / raw)
To: Dipayaan Roy
Cc: kys, haiyangz, wei.liu, decui, andrew+netdev, davem, edumazet,
kuba, pabeni, leon, longli, kotaranov, horms, shradhagupta,
ssengar, ernis, shirazsaleem, linux-hyperv, netdev, linux-kernel,
linux-rdma, stephen, jacob.e.keller, dipayanroy, leitao, kees,
john.fastabend, hawk, bpf, daniel, ast, sdf, yury.norov,
pavan.chebbi
In-Reply-To: <20260701141808.461554-1-dipayanroy@linux.microsoft.com>
On Wed, Jul 01, 2026 at 07:15:44AM -0700, Dipayaan Roy wrote:
> On some ARM64 platforms with 4K PAGE_SIZE, utilizing page_pool
> fragments for allocation in the RX refill path (~2kB buffer per fragment)
> causes 15-20% throughput regression under high connection counts
> (>16 TCP streams at 180+ Gbps). Using full-page buffers on these
> platforms shows no regression and restores line-rate performance.
>
> This behavior is observed on a single platform; other platforms
> perform better with page_pool fragments, indicating this is not a
> page_pool issue but platform-specific.
>
> This series adds an ethtool private flag "full-page-rx" to let the
> user opt in to one RX buffer per page:
>
> ethtool --set-priv-flags eth0 full-page-rx on
>
> There is no behavioral change by default. The flag can be persisted
> via udev rule for affected platforms.
Were you able to track down what is the actual bottleneck on the 'broken'
platform? What is the performance of full-page approach on healthy
platforms? On changelog below you mention the frag approach 'outperforms'
the full-page one.
>
> This series depends on the following fixes now merged in net-next:
> commit 17bfe0a8c014 ("net: mana: Add NULL guards in teardown path to prevent panic on attach failure")
> commit 5b05aa36ee24 ("net: mana: Skip redundant detach on already-detached port")
>
> Changes in v11:
> - Rebased on net-next
> Changes in v10:
> - Rebased on net-next which now includes the prerequisite fixes.
> - Recovery logic in mana_set_priv_flags() leverages the idempotent
> mana_detach() from the merged fixes.
> Changes in v9:
> - Added correct tree.
> Changes in v8:
> - Fixed queue_reset_work recovery by restoring port_is_up before
> scheduling reset so the handler can properly re-attach.
> - Simplified "err && schedule_port_reset" to "schedule_port_reset".
> Changes in v7:
> - Rebased onto net-next.
> - Retained private flag approach after David Wei's testing on
> Grace (ARM64) confirmed that fragment mode outperforms
> full-page mode on other platforms, validating this is a
> single-platform workaround rather than a generic issue.
> Changes in v6:
> - Added missed maintainers.
> Changes in v5:
> - Split prep refactor into separate patch (patch 1/2)
> Changes in v4:
> - Dropping the smbios string parsing and add ethtool priv flag
> to reconfigure the queues with full page rx buffers.
> Changes in v3:
> - changed u8* to char*
> Changes in v2:
> - separate reading string index and the string, remove inline.
>
> Dipayaan Roy (2):
> net: mana: refactor mana_get_strings() and mana_get_sset_count() to
> use switch
> net: mana: force full-page RX buffers via ethtool private flag
>
> drivers/net/ethernet/microsoft/mana/mana_en.c | 22 ++-
> .../ethernet/microsoft/mana/mana_ethtool.c | 178 +++++++++++++++---
> include/net/mana/mana.h | 8 +
> 3 files changed, 177 insertions(+), 31 deletions(-)
>
> --
> 2.43.0
>
>
^ permalink raw reply
* Re: [PATCH] mshv: fix hv_input_get_system_property struct
From: Wei Liu @ 2026-07-01 16:11 UTC (permalink / raw)
To: Hamza Mahfooz
Cc: wei.liu, Linux on Hyper-V List, stable, K. Y. Srinivasan,
Haiyang Zhang, Dexuan Cui, Long Li, open list
In-Reply-To: <akRIKl4wTuZVw/t0@linuxonhyperv3.guj3yctzbm1etfxqx2vob5hsef.xx.internal.cloudapp.net>
On Tue, Jun 30, 2026 at 06:50:18PM -0400, Hamza Mahfooz wrote:
> On Tue, Jun 30, 2026 at 02:57:54PM -0700, wei.liu@kernel.org wrote:
> > From: Wei Liu <wei.liu@kernel.org>
> >
> > Keep it in sync with the correct definition.
> >
> > The old code worked by chance.
> >
> > Cc: stable@kernel.org
>
> Any idea how far back this goes? Also, does it require a check on the
> hypervisor version, or was it always wrong?
This should go as far as possible. The upstream version has always been
wrong.
We cannot check versions. Version numbers are not reliable indicators.
Wei
>
> > Signed-off-by: Wei Liu <wei.liu@kernel.org>
> > ---
> > include/hyperv/hvhdk_mini.h | 3 ++-
> > 1 file changed, 2 insertions(+), 1 deletion(-)
> >
> > diff --git a/include/hyperv/hvhdk_mini.h b/include/hyperv/hvhdk_mini.h
> > index b4cb2fa26e9b..035ba20870f7 100644
> > --- a/include/hyperv/hvhdk_mini.h
> > +++ b/include/hyperv/hvhdk_mini.h
> > @@ -184,8 +184,9 @@ enum hv_dynamic_processor_feature_property {
> >
> > struct hv_input_get_system_property {
> > u32 property_id; /* enum hv_system_property */
> > + u32 reserved;
> > union {
> > - u32 as_uint32;
> > + u64 as_uint64;
> > #if IS_ENABLED(CONFIG_X86)
> > /* enum hv_dynamic_processor_feature_property */
> > u32 hv_processor_feature;
> > --
> > 2.53.0
> >
^ permalink raw reply
* [PATCH v1 0/4] drm/hyperv: A fix and a few cleanups
From: Uwe Kleine-König (The Capable Hub) @ 2026-07-01 17:05 UTC (permalink / raw)
To: Dexuan Cui, Long Li, Saurabh Sengar, Maarten Lankhorst,
Maxime Ripard, Thomas Zimmermann, David Airlie, Simona Vetter,
Deepak Rawat
Cc: linux-hyperv, dri-devel, linux-kernel
Hello,
while working on a tree-wide cleanup I found a few issues in the
drm/hyperv driver that are addressed here.
While the first patch is a fix, the issue is so old (from 2021, included
in v5.14-rc1) that applying the series during the next merge window is
probably the right choice.
Best regards
Uwe
Uwe Kleine-König (The Capable Hub) (4):
drm/hyperv: Unregister pci driver in error path before module unload
drm/hyperv: Explicitly set subvendor and subdevice for pci match array
drm/hyperv: Drop useless empty remove callback
drm/hyperv: Move MODULE_DEVICE_TABLE to the device_id arrays
drivers/gpu/drm/hyperv/hyperv_drm_drv.c | 19 +++++++++----------
1 file changed, 9 insertions(+), 10 deletions(-)
base-commit: be5c93fa674f0fc3c8f359c2143abce6bbb422e6
--
2.55.0.11.g153666a7d9bb
^ permalink raw reply
* [PATCH v1 1/4] drm/hyperv: Unregister pci driver in error path before module unload
From: Uwe Kleine-König (The Capable Hub) @ 2026-07-01 17:05 UTC (permalink / raw)
To: Dexuan Cui, Long Li, Saurabh Sengar, Maarten Lankhorst,
Maxime Ripard, Thomas Zimmermann, David Airlie, Simona Vetter,
Deepak Rawat
Cc: linux-hyperv, dri-devel, linux-kernel
In-Reply-To: <cover.1782925276.git.u.kleine-koenig@baylibre.com>
The pci driver must not kept registered if the module is unloaded after
vmbus_driver_register() fails. So check the return value of
vmbus_driver_register() and unregister the pci driver on failure.
Fixes: 76c56a5affeb ("drm/hyperv: Add DRM driver for hyperv synthetic video device")
Signed-off-by: Uwe Kleine-König (The Capable Hub) <u.kleine-koenig@baylibre.com>
---
drivers/gpu/drm/hyperv/hyperv_drm_drv.c | 6 +++++-
1 file changed, 5 insertions(+), 1 deletion(-)
diff --git a/drivers/gpu/drm/hyperv/hyperv_drm_drv.c b/drivers/gpu/drm/hyperv/hyperv_drm_drv.c
index 20f35c48c0b8..2e75fb793495 100644
--- a/drivers/gpu/drm/hyperv/hyperv_drm_drv.c
+++ b/drivers/gpu/drm/hyperv/hyperv_drm_drv.c
@@ -249,7 +249,11 @@ static int __init hv_drm_init(void)
if (ret != 0)
return ret;
- return vmbus_driver_register(&hv_drm_hv_driver);
+ ret = vmbus_driver_register(&hv_drm_hv_driver);
+ if (ret)
+ pci_unregister_driver(&hv_drm_pci_driver);
+
+ return ret;
}
static void __exit hv_drm_exit(void)
--
2.55.0.11.g153666a7d9bb
^ permalink raw reply related
* [PATCH v1 2/4] drm/hyperv: Explicitly set subvendor and subdevice for pci match array
From: Uwe Kleine-König (The Capable Hub) @ 2026-07-01 17:05 UTC (permalink / raw)
To: Dexuan Cui, Long Li, Saurabh Sengar, Maarten Lankhorst,
Maxime Ripard, Thomas Zimmermann, David Airlie, Simona Vetter
Cc: linux-hyperv, dri-devel, linux-kernel
In-Reply-To: <cover.1782925276.git.u.kleine-koenig@baylibre.com>
.subvendor and .subdevice were set to 0 implicitly, so only devices with
these two values set to 0 in hardware can probe automatically. Make this
requirement explicit.
While touching this array item, also make use of the pci macro designed
for that case.
Signed-off-by: Uwe Kleine-König (The Capable Hub) <u.kleine-koenig@baylibre.com>
---
drivers/gpu/drm/hyperv/hyperv_drm_drv.c | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)
diff --git a/drivers/gpu/drm/hyperv/hyperv_drm_drv.c b/drivers/gpu/drm/hyperv/hyperv_drm_drv.c
index 2e75fb793495..e766d87b7a9d 100644
--- a/drivers/gpu/drm/hyperv/hyperv_drm_drv.c
+++ b/drivers/gpu/drm/hyperv/hyperv_drm_drv.c
@@ -51,8 +51,8 @@ static void hv_drm_pci_remove(struct pci_dev *pdev)
static const struct pci_device_id hv_drm_pci_tbl[] = {
{
- .vendor = PCI_VENDOR_ID_MICROSOFT,
- .device = PCI_DEVICE_ID_HYPERV_VIDEO,
+ PCI_VDEVICE_SUB(MICROSOFT, PCI_DEVICE_ID_HYPERV_VIDEO,
+ 0, 0),
},
{ /* end of list */ }
};
--
2.55.0.11.g153666a7d9bb
^ permalink raw reply related
* [PATCH v1 3/4] drm/hyperv: Drop useless empty remove callback
From: Uwe Kleine-König (The Capable Hub) @ 2026-07-01 17:05 UTC (permalink / raw)
To: Dexuan Cui, Long Li, Saurabh Sengar, Maarten Lankhorst,
Maxime Ripard, Thomas Zimmermann, David Airlie, Simona Vetter
Cc: linux-hyperv, dri-devel, linux-kernel
In-Reply-To: <cover.1782925276.git.u.kleine-koenig@baylibre.com>
Having an empty remove callback is equivalent to no remove callback.
(The only minor difference is that with an empty remove callback
pm_runtime_get_sync() and pm_runtime_put_noidle() are called.)
Drop this useless function.
Signed-off-by: Uwe Kleine-König (The Capable Hub) <u.kleine-koenig@baylibre.com>
---
drivers/gpu/drm/hyperv/hyperv_drm_drv.c | 5 -----
1 file changed, 5 deletions(-)
diff --git a/drivers/gpu/drm/hyperv/hyperv_drm_drv.c b/drivers/gpu/drm/hyperv/hyperv_drm_drv.c
index e766d87b7a9d..e3f41336a831 100644
--- a/drivers/gpu/drm/hyperv/hyperv_drm_drv.c
+++ b/drivers/gpu/drm/hyperv/hyperv_drm_drv.c
@@ -45,10 +45,6 @@ static int hv_drm_pci_probe(struct pci_dev *pdev,
return 0;
}
-static void hv_drm_pci_remove(struct pci_dev *pdev)
-{
-}
-
static const struct pci_device_id hv_drm_pci_tbl[] = {
{
PCI_VDEVICE_SUB(MICROSOFT, PCI_DEVICE_ID_HYPERV_VIDEO,
@@ -64,7 +60,6 @@ static struct pci_driver hv_drm_pci_driver = {
.name = KBUILD_MODNAME,
.id_table = hv_drm_pci_tbl,
.probe = hv_drm_pci_probe,
- .remove = hv_drm_pci_remove,
};
static int hv_drm_setup_vram(struct hv_drm_device *hv,
--
2.55.0.11.g153666a7d9bb
^ permalink raw reply related
* [PATCH v1 4/4] drm/hyperv: Move MODULE_DEVICE_TABLE to the device_id arrays
From: Uwe Kleine-König (The Capable Hub) @ 2026-07-01 17:05 UTC (permalink / raw)
To: Dexuan Cui, Long Li, Saurabh Sengar, Maarten Lankhorst,
Maxime Ripard, Thomas Zimmermann, David Airlie, Simona Vetter
Cc: linux-hyperv, dri-devel, linux-kernel
In-Reply-To: <cover.1782925276.git.u.kleine-koenig@baylibre.com>
It matches the usual coding style to have the MODULE_DEVICE_TABLE macro
directly after the respective arrays.
Signed-off-by: Uwe Kleine-König (The Capable Hub) <u.kleine-koenig@baylibre.com>
---
drivers/gpu/drm/hyperv/hyperv_drm_drv.c | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)
diff --git a/drivers/gpu/drm/hyperv/hyperv_drm_drv.c b/drivers/gpu/drm/hyperv/hyperv_drm_drv.c
index e3f41336a831..6a28048f687b 100644
--- a/drivers/gpu/drm/hyperv/hyperv_drm_drv.c
+++ b/drivers/gpu/drm/hyperv/hyperv_drm_drv.c
@@ -52,6 +52,7 @@ static const struct pci_device_id hv_drm_pci_tbl[] = {
},
{ /* end of list */ }
};
+MODULE_DEVICE_TABLE(pci, hv_drm_pci_tbl);
/*
* PCI stub to support gen1 VM.
@@ -219,6 +220,7 @@ static const struct hv_vmbus_device_id hv_drm_vmbus_tbl[] = {
{HV_SYNTHVID_GUID},
{}
};
+MODULE_DEVICE_TABLE(vmbus, hv_drm_vmbus_tbl);
static struct hv_driver hv_drm_hv_driver = {
.name = KBUILD_MODNAME,
@@ -260,8 +262,6 @@ static void __exit hv_drm_exit(void)
module_init(hv_drm_init);
module_exit(hv_drm_exit);
-MODULE_DEVICE_TABLE(pci, hv_drm_pci_tbl);
-MODULE_DEVICE_TABLE(vmbus, hv_drm_vmbus_tbl);
MODULE_LICENSE("GPL");
MODULE_AUTHOR("Deepak Rawat <drawat.floss@gmail.com>");
MODULE_DESCRIPTION("DRM driver for Hyper-V synthetic video device");
--
2.55.0.11.g153666a7d9bb
^ permalink raw reply related
* RE: [EXTERNAL] Re: [PATCH net-next] net: mana: Add handler for sriov configure
From: Haiyang Zhang @ 2026-07-01 17:54 UTC (permalink / raw)
To: Bjorn Helgaas, Leon Romanovsky
Cc: Haiyang Zhang, Paul Rosswurm, linux-hyperv@vger.kernel.org,
netdev@vger.kernel.org, KY Srinivasan, Wei Liu, Dexuan Cui,
Long Li, Andrew Lunn, David S. Miller, Eric Dumazet,
Jakub Kicinski, Paolo Abeni, Bjorn Helgaas, Simon Horman,
Shradha Gupta, Dipayaan Roy, Erni Sri Satya Vennela,
linux-kernel@vger.kernel.org, linux-pci@vger.kernel.org
In-Reply-To: <20260513190509.GA328362@bhelgaas>
> -----Original Message-----
> From: Bjorn Helgaas <helgaas@kernel.org>
> Sent: Wednesday, May 13, 2026 3:05 PM
> To: Leon Romanovsky <leon@kernel.org>
> Cc: Haiyang Zhang <haiyangz@microsoft.com>; Haiyang Zhang
> <haiyangz@linux.microsoft.com>; Paul Rosswurm <paulros@microsoft.com>;
> linux-hyperv@vger.kernel.org; netdev@vger.kernel.org; KY Srinivasan
> <kys@microsoft.com>; Wei Liu <wei.liu@kernel.org>; Dexuan Cui
> <DECUI@microsoft.com>; Long Li <longli@microsoft.com>; Andrew Lunn
> <andrew+netdev@lunn.ch>; David S. Miller <davem@davemloft.net>; Eric
> Dumazet <edumazet@google.com>; Jakub Kicinski <kuba@kernel.org>; Paolo
> Abeni <pabeni@redhat.com>; Bjorn Helgaas <bhelgaas@google.com>; Simon
> Horman <horms@kernel.org>; Shradha Gupta
> <shradhagupta@linux.microsoft.com>; Dipayaan Roy
> <dipayanroy@linux.microsoft.com>; Erni Sri Satya Vennela
> <ernis@linux.microsoft.com>; linux-kernel@vger.kernel.org; linux-
> pci@vger.kernel.org
> Subject: Re: [EXTERNAL] Re: [PATCH net-next] net: mana: Add handler for
> sriov configure
>
> On Wed, May 13, 2026 at 09:47:49PM +0300, Leon Romanovsky wrote:
> > On Fri, May 08, 2026 at 06:10:29PM -0500, Bjorn Helgaas wrote:
> > > On Fri, May 08, 2026 at 10:47:14PM +0000, Haiyang Zhang wrote:
> > > > > On Fri, May 08, 2026 at 03:04:06PM -0700, Haiyang Zhang wrote:
> > > > > > From: Haiyang Zhang <haiyangz@microsoft.com>
> > > > > >
> > > > > > Add callback function for the pci_driver, sriov_configure.
> > > > > >
> > > > > > Also disable VF autoprobe when it runs as PF driver on bare
> metal,
> > > > > > since the hardware side may not have the VF ready immediately.
> > > > > >
> > > > > > Export pci_vf_drivers_autoprobe() so the driver can toggle the
> VF
> > > > > > autoprobe flag.
> > > > >
> > > > > Technically pci_vf_drivers_autoprobe() doesn't *toggle* the
> autoprobe
> > > > > flag. That would mean setting it to the opposite of its current
> > > > > value.
> > > > >
> > > > > Here I would say "so the driver can prevent autoprobing of the
> VFs",
> > > > > which is the intent.
> > > > Thanks, I will change the wording.
> > > >
> > > > >
> > > > > Out of curiosity, how do the VFs eventually get probed? I guess
> > > > > there's some other mechanism that tells you when they're ready,
> and
> > > > > you manually use sysfs 'sriov_drivers_autoprobe' to enable
> probing,
> > > > > then bind drivers to them via sysfs?
> > > > We have a user program talking to the Azure backplane to get that
> information.
> > > > @Paul Rosswurm, do you have more details?
> > > >
> > > >
> > > > > The prevention of autoprobing sounds like a critical part of this
> > > > > change; might be worth saying something in the subject, because
> "add
> > > > > sriov configure" doesn't include much information.
> > > > How about "Add handler for sriov configure with VF autoprobe off"?
> > >
> > > OK by me :)
> >
> > I believe it is the wrong decision to allow toggling a user‑visible knob
> > without the user’s awareness. In this case, they can either disable
> > autoprobe on the PF or rely on EPROBE_DEFER. In all cases, the same
> > functionality can be achieved without changing PCI autoprobe code.
>
> OK, Haiyang, can you drop my ack please? If Leon's solutions don't
> work for you, continue this conversation and we can explore
> alternatives.
Sure, I will submit an updated patch without changing VF autoprobe.
Thanks,
- Haiyang
^ permalink raw reply
* [PATCH net-next v2] net: mana: Add handler for sriov configure
From: Haiyang Zhang @ 2026-07-01 18:01 UTC (permalink / raw)
To: linux-hyperv, netdev, K. Y. Srinivasan, Haiyang Zhang, Wei Liu,
Dexuan Cui, Long Li, Andrew Lunn, David S. Miller, Eric Dumazet,
Jakub Kicinski, Paolo Abeni, Simon Horman, Erni Sri Satya Vennela,
Dipayaan Roy, Aditya Garg, Shradha Gupta, linux-kernel
Cc: paulros
From: Haiyang Zhang <haiyangz@microsoft.com>
Add callback function for the pci_driver / sriov_configure.
It asks the NIC to provide certain number of VFs, or disable
VFs if the request is zero.
Signed-off-by: Haiyang Zhang <haiyangz@microsoft.com>
---
v2:
No longer change VF autoprobe as discussed with Leon Romanovsky and Bjorn Helgaas.
---
drivers/net/ethernet/microsoft/mana/gdma_main.c | 15 +++++++++++++++
1 file changed, 15 insertions(+)
diff --git a/drivers/net/ethernet/microsoft/mana/gdma_main.c b/drivers/net/ethernet/microsoft/mana/gdma_main.c
index a0fdd052d7f1..0b7380fd1da8 100644
--- a/drivers/net/ethernet/microsoft/mana/gdma_main.c
+++ b/drivers/net/ethernet/microsoft/mana/gdma_main.c
@@ -2446,6 +2446,20 @@ static void mana_gd_shutdown(struct pci_dev *pdev)
pci_disable_device(pdev);
}
+static int mana_sriov_configure(struct pci_dev *pdev, int numvfs)
+{
+ int err = 0;
+
+ dev_info(&pdev->dev, "Requested num VFs: %d\n", numvfs);
+
+ if (numvfs > 0)
+ err = pci_enable_sriov(pdev, numvfs);
+ else
+ pci_disable_sriov(pdev);
+
+ return err ? err : numvfs;
+}
+
static const struct pci_device_id mana_id_table[] = {
{ PCI_DEVICE(PCI_VENDOR_ID_MICROSOFT, MANA_PF_DEVICE_ID) },
{ PCI_DEVICE(PCI_VENDOR_ID_MICROSOFT, MANA_PF2_DEVICE_ID) },
@@ -2461,6 +2475,7 @@ static struct pci_driver mana_driver = {
.suspend = mana_gd_suspend,
.resume = mana_gd_resume,
.shutdown = mana_gd_shutdown,
+ .sriov_configure = mana_sriov_configure,
};
static int __init mana_driver_init(void)
--
2.34.1
^ permalink raw reply related
* [PATCH v5 00/51] x86: Try to wrangle PV clocks vs. TSC
From: Sean Christopherson @ 2026-07-01 19:31 UTC (permalink / raw)
To: Jonathan Corbet, Paolo Bonzini, Thomas Gleixner, Ingo Molnar,
Borislav Petkov, Dave Hansen, x86, Kiryl Shutsemau,
Rick Edgecombe, Sean Christopherson, K. Y. Srinivasan,
Haiyang Zhang, Wei Liu, Dexuan Cui, Long Li, Ajay Kaher,
Alexey Makhalov, Jan Kiszka, Andy Lutomirski, Peter Zijlstra,
Juergen Gross, Daniel Lezcano, John Stultz
Cc: Shuah Khan, H. Peter Anvin, Vitaly Kuznetsov,
Broadcom internal kernel review list, Boris Ostrovsky,
Stephen Boyd, linux-doc, kvm, linux-kernel, linux-coco,
linux-hyperv, virtualization, xen-devel, Tom Lendacky,
Nikunj A Dadhania, David Woodhouse, David Woodhouse,
Michael Kelley, Thomas Gleixner
The primary goal of this series to fix flaws with SNP and TDX guests where a
PV clock provided by the untrusted hypervisor is used instead of the secure
TSC that is controlled by trusted firmware.
The secondary goal is modernize running under KVM. Currently, KVM guests will
use TSC for clocksource, but not sched_clock. And Linux-as-a-KVM-guest doesn't
support paravirt enumeration of the TSC/APIC frequencies, even though QEMU
provides that information by default.
The tertiary goal is to clean up the PV clock code to deduplicate logic across
hypervisors, and to hopefully make it all easier to maintain going forward.
The quaternary goal is to clean up the TSC calibration code, which was made
stupidly hard to follow by hypervisor code mixing in with the native
calibration routines, instead of being implemented as a pure alternative.
Note, the VMware and Xen changes still probably should get acks from those
maintainers, as my understanding of what they're trying to do may be flawed.
Lots more background on the SNP/TDX motiviation:
https://lore.kernel.org/all/20250106124633.1418972-13-nikunj@amd.com
As before, I deliberately omitted jailhouse-dev@googlegroups.com from the To/Cc,
as those emails bounced on v1, AFAICT nothing has changed.
v5:
- Use cpu_feature_enabled() instead of boot_cpu_has(). [Boris]
- WARN if recalibrate_cpu_khz() runs on a system with TSC_KNOWN_FREQ. [Thomas]
- Opportunistically drop a line break in native_calibrate_tsc(). [Thomas]
- Rely on callers of cpuid_get_tsc_info() to check the result instead of
unnecessarily zeroing the structure. [Boris]
- Ignore tsc_early_khz if the TSC frequency is provided by trusted firmware
or by the hypervisor. [Thomas, Sashiko]
- Cache CPUID output in acrn_init_platform() to avoid introducing a transient
bug where TSC_KNOWN_FREQ could be set even if the ACRN hypervisor didn't
actually provide the frequency. [Sashiko]
- Drop kvmclock's useless/dead check_tsc_unstable() call (it occurs before the
command line parameter is parsed). [Sashiko]
- Add helpers to set lapic_timer_period, to fix not-so-theoretical overflow
in the various "khz * 1000 / HZ" patterns. [Sashiko]
- Drop the "x86/xen: Obtain TSC frequency from CPUID if present" patch as it
doesn't have any dependencies/conflicts on/with this series, and Sashiko had
concerns about the assumptions it was making. [Sashiko]
- Collect reviews. [David] (Kirill's got dropped because the patch he reviewed
got completely rewritten).
v4:
- Use x86_init_noop() to skip save/restore on VMware and Xen instead of
nullifying x86_platform.{save,restore}_sched_clock_state. [Sashiko]
- Use '0' to indicate "failure" when getting the CPU frequency from CPUID, to
avoid using an out-param and thus make it all but impossible to
unintentionally clobber the global cpu_khz (which v3 did). [Sashiko]
- Rename cpuid_get_cpu_freq() => __cpu_khz_from_cpuid() to capture its
relationship with cpu_khz_from_cpuid().
- Compute lapic_timer_period in units of ticks, not Khz. [Sashiko]
- Kill off x86_platform_ops.calibrate_{cpu,tsc}(), and instead use dedicated
hooks for hypervisor code, and direct calls for TDX and SNP. [David, loosely]
- Drop SNP's secure TSC override of _CPU_ calibration, as there's zero
evidence it's justified or a net positive.
- Collect reviews/acks. [David, Wei]
- Decouple getting TSC/APIC frequencies from KVM PV CPUID from kvmclock. [David]
- Fix an amusing number of Opportunistically misspellings. [David]
- Set kvm_sched_clock_offset _before_ registering kvmclock as sched_clock,
and add a comment to guard against future goofs. [Sashiko]
- Keep "setup_force_cpu_cap(X86_FEATURE_TSC_RELIABLE)" in Hyper-V's handling
of HV_ACCESS_TSC_INVARIANT, as it's technically possible to have a VM
with HV_ACCESS_TSC_INVARIANT but not HV_ACCESS_FREQUENCY_MSRS. Though as
a _very_ nice side effect of using dedicated sequencing for selecting the
TSC frequency source, this would have naturally happened anyways. [Sashiko]
v3:
- https://lore.kernel.org/all/20260515191942.1892718-1-seanjc@google.com
- Collect reviews. [Michael, Thomas]
- Use Hyper-V reference counter / refcounter instead of Hyper-V timer. [Michael]
- Use the paravirt CPUID interface first proposed by VMware for KVM's
"official" mechanism for communicating frequency to KVM-aware guests,
instead of abusing Intel's CPUID leafs. [David]
- Deal with paravirt code being moved into asm/timers.h and
arch/x86/kernel/tsc.c.
v2:
- https://lore.kernel.org/all/Z8YWttWDtvkyCtdJ@google.com
- Add struct to hold the TSC CPUID output. [Boris]
- Don't pointlessly inline the TSC CPUID helpers. [Boris]
- Fix a variable goof in a helper, hopefully for real this time. [Dan]
- Collect reviews. [Nikunj]
- Override the sched_clock save/restore hooks if and only if a PV clock
is successfully registered.
- During resome, restore clocksources before reading persistent time.
- Clean up more warts created by kvmclock.
- Fix more bugs in kvmclock's suspend/resume handling.
- Try to harden kvmclock against future bugs.
v1: https://lore.kernel.org/all/20250201021718.699411-1-seanjc@google.com
David Woodhouse (2):
KVM: x86: Officially define CPUID 0x40000010 as PV Timing Info (TSC
and Bus)
x86/kvm: Obtain TSC frequency from PV CPUID if present
Sean Christopherson (49):
x86/apic: Provide helpers to set local APIC timer period in hz and khz
x86/apic: Add CONFIG_X86_LOCAL_APIC=n stubs for
apic_set_timer_period_{,k}hz()
x86/tsc: Ensure that TSC recalibration doesn't run if TSC frequency is
known
x86/tsc: Restrict recalibrate_cpu_khz() export to p4-clockmod and
powernow-k7
x86/sev: Mark TSC as reliable when configuring Secure TSC
x86/sev: Don't override CPU frequency calibration for SNP's Secure TSC
x86/sev: Move check for SNP Secure TSC support to tsc_early_init()
x86/sev: Shove SNP's secure/trusted TSC frequency directly into
"calibration"
x86/tsc: Add a standalone helper for getting TSC info from CPUID.0x15
x86/tdx: Force TSC frequency with CPUID-based info provided by the
TDX-Module
x86/tsc: Add dedicated hypervisor hooks for getting known TSC/CPU
frequencies
x86/acrn: Register TSC/CPU frequency callbacks iff frequency is
actually in CPUID
x86/acrn: Mark TSC frequency as known when using ACRN for calibration
x86/tsc: Consolidate forcing of X86_FEATURE_TSC_KNOWN_FREQ for PV code
x86/tsc: Kill off x86_platform_ops.calibrate_{cpu,tsc}() hooks
x86/tsc: Rename pit_hpet_ptimer_calibrate_cpu() =>
native_calibrate_cpu_late()
x86/tsc: Fold native_calibrate_cpu() into recalibrate_cpu_khz()
x86/kvmclock: Rename kvm_get_tsc_khz() to kvmclock_get_tsc_khz()
x86/kvmclock: Drop dead check on TSC being unstable during
kvmclock_init()
x86/kvm: Mark TSC as reliable when it's constant and nonstop
x86/tsc: Add standalone helper for getting CPU frequency from CPUID
x86/kvm: Get CPU base frequency from CPUID when it's available
clocksource: hyper-v: Register sched_clock save/restore iff it's
necessary
clocksource: hyper-v: Drop wrappers to sched_clock save/restore
helpers
clocksource: hyper-v: Don't save/restore TSC offset when using HV
sched_clock
x86/kvmclock: Setup kvmclock for secondary CPUs iff CONFIG_SMP=y
x86/kvm: Don't disable kvmclock on BSP in syscore_suspend()
x86/paravirt: Remove unnecessary PARAVIRT=n stub for
paravirt_set_sched_clock()
x86/paravirt: Move handling of unstable PV clocks into
paravirt_set_sched_clock()
x86/kvmclock: Move sched_clock save/restore helpers up in kvmclock.c
x86/xen/time: NOP-ify x86_platform's sched_clock save/restore hooks
x86/vmware: NOP-ify save/restore hooks when using VMware's sched_clock
x86/tsc: WARN if TSC sched_clock save/restore used with PV sched_clock
x86/paravirt: Pass sched_clock save/restore helpers during
registration
x86/kvmclock: Move kvm_sched_clock_init() down in kvmclock.c
x86/xen/time: Mark xen_setup_vsyscall_time_info() as __init
x86/pvclock: Mark setup helpers and related various as
__init/__ro_after_init
x86/pvclock: WARN if pvclock's valid_flags are overwritten
x86/kvmclock: Refactor handling of PVCLOCK_TSC_STABLE_BIT during
kvmclock_init()
timekeeping: Resume clocksources before reading persistent clock
x86/kvmclock: Hook clocksource.suspend/resume when kvmclock isn't
sched_clock
x86/kvmclock: WARN if wall clock is read while kvmclock is suspended
x86/paravirt: Mark __paravirt_set_sched_clock() as __init
x86/paravirt: Plumb a return code into __paravirt_set_sched_clock()
x86/paravirt: Don't use a PV sched_clock in CoCo guests with trusted
TSC
x86/kvmclock: Use TSC for sched_clock if it's constant and non-stop
x86/kvmclock: Plumb in AP-online and BSP-resume to kvmlock, for
documentation
x86/paravirt: Move using_native_sched_clock() stub into timer.h
x86/kvm: Get local APIC bus frequency from PV CPUID Timing Info
.../admin-guide/kernel-parameters.txt | 5 +
Documentation/virt/kvm/x86/cpuid.rst | 12 +
arch/x86/coco/sev/core.c | 21 +-
arch/x86/coco/tdx/tdx.c | 19 +-
arch/x86/include/asm/acrn.h | 5 -
arch/x86/include/asm/apic.h | 5 +-
arch/x86/include/asm/kvm_para.h | 12 +-
arch/x86/include/asm/sev.h | 4 +-
arch/x86/include/asm/tdx.h | 2 +
arch/x86/include/asm/timer.h | 15 +-
arch/x86/include/asm/tsc.h | 10 +-
arch/x86/include/asm/x86_init.h | 8 +-
arch/x86/include/uapi/asm/kvm_para.h | 11 +
arch/x86/kernel/apic/apic.c | 12 +-
arch/x86/kernel/cpu/acrn.c | 14 +-
arch/x86/kernel/cpu/mshyperv.c | 70 +-----
arch/x86/kernel/cpu/vmware.c | 19 +-
arch/x86/kernel/jailhouse.c | 9 +-
arch/x86/kernel/kvm.c | 101 ++++++--
arch/x86/kernel/kvmclock.c | 208 +++++++++++------
arch/x86/kernel/pvclock.c | 9 +-
arch/x86/kernel/tsc.c | 218 +++++++++++-------
arch/x86/kernel/tsc_msr.c | 4 +-
arch/x86/kernel/x86_init.c | 2 -
arch/x86/mm/mem_encrypt_amd.c | 3 -
arch/x86/xen/time.c | 14 +-
drivers/clocksource/hyperv_timer.c | 38 ++-
include/clocksource/hyperv_timer.h | 2 -
kernel/time/timekeeping.c | 9 +-
29 files changed, 540 insertions(+), 321 deletions(-)
base-commit: dc59e4fea9d83f03bad6bddf3fa2e52491777482
--
2.55.0.rc0.799.gd6f94ed593-goog
^ permalink raw reply
* [PATCH v5 01/51] x86/apic: Provide helpers to set local APIC timer period in hz and khz
From: Sean Christopherson @ 2026-07-01 19:31 UTC (permalink / raw)
To: Jonathan Corbet, Paolo Bonzini, Thomas Gleixner, Ingo Molnar,
Borislav Petkov, Dave Hansen, x86, Kiryl Shutsemau,
Rick Edgecombe, Sean Christopherson, K. Y. Srinivasan,
Haiyang Zhang, Wei Liu, Dexuan Cui, Long Li, Ajay Kaher,
Alexey Makhalov, Jan Kiszka, Andy Lutomirski, Peter Zijlstra,
Juergen Gross, Daniel Lezcano, John Stultz
Cc: Shuah Khan, H. Peter Anvin, Vitaly Kuznetsov,
Broadcom internal kernel review list, Boris Ostrovsky,
Stephen Boyd, linux-doc, kvm, linux-kernel, linux-coco,
linux-hyperv, virtualization, xen-devel, Tom Lendacky,
Nikunj A Dadhania, David Woodhouse, David Woodhouse,
Michael Kelley, Thomas Gleixner
In-Reply-To: <20260701193212.749551-1-seanjc@google.com>
Add and use APIs to set the local APIC timer period instead of open coding
the subtle HZ math in a all external callers, and make lapic_timer_period
local to apic.c. Provide APIs to specify the frequency in both hertz and
kilohertz so that Hyper-V and VMware code aren't forced to lose precision.
Opportunistically use mul_u64_u32_div() to harden against the possibility
that the period in Khz is greater than 4294967, i.e. if the APIC timer runs
at ~4.29 GHz. As pointed out by Sashiko, 4294968 * 1000 == 0x1_000002c0,
and thus a Khz period of 4294968 would silently overflow the 32-bit
unsigned integer used by most callers.
Signed-off-by: Sean Christopherson <seanjc@google.com>
---
arch/x86/include/asm/apic.h | 3 ++-
arch/x86/kernel/apic/apic.c | 12 +++++++++++-
arch/x86/kernel/cpu/mshyperv.c | 5 +----
arch/x86/kernel/cpu/vmware.c | 4 +---
arch/x86/kernel/jailhouse.c | 2 +-
arch/x86/kernel/tsc.c | 2 +-
arch/x86/kernel/tsc_msr.c | 2 +-
7 files changed, 18 insertions(+), 12 deletions(-)
diff --git a/arch/x86/include/asm/apic.h b/arch/x86/include/asm/apic.h
index 9cd493d467d4..cd84a94688a2 100644
--- a/arch/x86/include/asm/apic.h
+++ b/arch/x86/include/asm/apic.h
@@ -63,7 +63,6 @@ extern int apic_verbosity;
extern int local_apic_timer_c2_ok;
extern bool apic_is_disabled;
-extern unsigned int lapic_timer_period;
extern enum apic_intr_mode_id apic_intr_mode;
enum apic_intr_mode_id {
@@ -138,6 +137,8 @@ void register_lapic_address(unsigned long address);
extern void setup_boot_APIC_clock(void);
extern void setup_secondary_APIC_clock(void);
extern void lapic_update_tsc_freq(void);
+extern void apic_set_timer_period_hz(u64 period_hz, const char *source);
+extern void apic_set_timer_period_khz(u64 period_khz, const char *source);
#ifdef CONFIG_X86_64
static inline bool apic_force_enable(unsigned long addr)
diff --git a/arch/x86/kernel/apic/apic.c b/arch/x86/kernel/apic/apic.c
index aa1e19979aa8..8d3d930576fd 100644
--- a/arch/x86/kernel/apic/apic.c
+++ b/arch/x86/kernel/apic/apic.c
@@ -176,7 +176,7 @@ static struct resource lapic_resource = {
};
/* Measured in ticks per HZ. */
-unsigned int lapic_timer_period = 0;
+static unsigned int lapic_timer_period;
static void apic_pm_activate(void);
@@ -796,6 +796,16 @@ bool __init apic_needs_pit(void)
return lapic_timer_period == 0;
}
+void apic_set_timer_period_khz(u64 period_khz, const char *source)
+{
+ lapic_timer_period = mul_u64_u32_div(period_khz, 1000, HZ);
+}
+
+void apic_set_timer_period_hz(u64 period_hz, const char *source)
+{
+ lapic_timer_period = div_u64(period_hz, HZ);
+}
+
static int __init calibrate_APIC_clock(void)
{
struct clock_event_device *levt = this_cpu_ptr(&lapic_events);
diff --git a/arch/x86/kernel/cpu/mshyperv.c b/arch/x86/kernel/cpu/mshyperv.c
index 185d4f677ec0..87beecec76f0 100644
--- a/arch/x86/kernel/cpu/mshyperv.c
+++ b/arch/x86/kernel/cpu/mshyperv.c
@@ -646,10 +646,7 @@ static void __init ms_hyperv_init_platform(void)
u64 hv_lapic_frequency;
rdmsrq(HV_X64_MSR_APIC_FREQUENCY, hv_lapic_frequency);
- hv_lapic_frequency = div_u64(hv_lapic_frequency, HZ);
- lapic_timer_period = hv_lapic_frequency;
- pr_info("Hyper-V: LAPIC Timer Frequency: %#x\n",
- lapic_timer_period);
+ apic_set_timer_period_hz(hv_lapic_frequency, "Hyper-V hypervisor");
}
register_nmi_handler(NMI_UNKNOWN, hv_nmi_unknown, NMI_FLAG_FIRST,
diff --git a/arch/x86/kernel/cpu/vmware.c b/arch/x86/kernel/cpu/vmware.c
index 34b73573b108..36f779dd311d 100644
--- a/arch/x86/kernel/cpu/vmware.c
+++ b/arch/x86/kernel/cpu/vmware.c
@@ -424,9 +424,7 @@ static void __init vmware_platform_setup(void)
#ifdef CONFIG_X86_LOCAL_APIC
/* Skip lapic calibration since we know the bus frequency. */
- lapic_timer_period = ecx / HZ;
- pr_info("Host bus clock speed read from hypervisor : %u Hz\n",
- ecx);
+ apic_set_timer_period_hz(ecx, "VMware hypervisor");
#endif
} else {
pr_warn("Failed to get TSC freq from the hypervisor\n");
diff --git a/arch/x86/kernel/jailhouse.c b/arch/x86/kernel/jailhouse.c
index f58ce9220e0f..f2d4ef89c085 100644
--- a/arch/x86/kernel/jailhouse.c
+++ b/arch/x86/kernel/jailhouse.c
@@ -65,7 +65,7 @@ static void jailhouse_get_wallclock(struct timespec64 *now)
static void __init jailhouse_timer_init(void)
{
- lapic_timer_period = setup_data.v1.apic_khz * (1000 / HZ);
+ apic_set_timer_period_khz(setup_data.v1.apic_khz, "Jailhouse hypervisor");
}
static unsigned long jailhouse_get_tsc(void)
diff --git a/arch/x86/kernel/tsc.c b/arch/x86/kernel/tsc.c
index ce10ae4b298b..f9ecc9256863 100644
--- a/arch/x86/kernel/tsc.c
+++ b/arch/x86/kernel/tsc.c
@@ -717,7 +717,7 @@ unsigned long native_calibrate_tsc(void)
* lapic_timer_period here to avoid having to calibrate the APIC
* timer later.
*/
- lapic_timer_period = crystal_khz * 1000 / HZ;
+ apic_set_timer_period_khz(crystal_khz, "CPUID 0x15/0x16");
#endif
return crystal_khz * ebx_numerator / eax_denominator;
diff --git a/arch/x86/kernel/tsc_msr.c b/arch/x86/kernel/tsc_msr.c
index 48e6cc1cb017..7e990871e041 100644
--- a/arch/x86/kernel/tsc_msr.c
+++ b/arch/x86/kernel/tsc_msr.c
@@ -211,7 +211,7 @@ unsigned long cpu_khz_from_msr(void)
pr_err("Error MSR_FSB_FREQ index %d is unknown\n", index);
#ifdef CONFIG_X86_LOCAL_APIC
- lapic_timer_period = (freq * 1000) / HZ;
+ apic_set_timer_period_khz(freq, "MSR_FSB_FREQ");
#endif
/*
--
2.55.0.rc0.799.gd6f94ed593-goog
^ permalink raw reply related
* [PATCH v5 02/51] x86/apic: Add CONFIG_X86_LOCAL_APIC=n stubs for apic_set_timer_period_{,k}hz()
From: Sean Christopherson @ 2026-07-01 19:31 UTC (permalink / raw)
To: Jonathan Corbet, Paolo Bonzini, Thomas Gleixner, Ingo Molnar,
Borislav Petkov, Dave Hansen, x86, Kiryl Shutsemau,
Rick Edgecombe, Sean Christopherson, K. Y. Srinivasan,
Haiyang Zhang, Wei Liu, Dexuan Cui, Long Li, Ajay Kaher,
Alexey Makhalov, Jan Kiszka, Andy Lutomirski, Peter Zijlstra,
Juergen Gross, Daniel Lezcano, John Stultz
Cc: Shuah Khan, H. Peter Anvin, Vitaly Kuznetsov,
Broadcom internal kernel review list, Boris Ostrovsky,
Stephen Boyd, linux-doc, kvm, linux-kernel, linux-coco,
linux-hyperv, virtualization, xen-devel, Tom Lendacky,
Nikunj A Dadhania, David Woodhouse, David Woodhouse,
Michael Kelley, Thomas Gleixner
In-Reply-To: <20260701193212.749551-1-seanjc@google.com>
Add stubs for the apic_set_timer_period_{,k}hz() APIs when the kernel is
built without support for a local APIC, and drop #ifdefs in callers that
don't need to check CONFIG_X86_LOCAL_APIC for other reasons.
No functional change intended.
Signed-off-by: Sean Christopherson <seanjc@google.com>
---
arch/x86/include/asm/apic.h | 2 ++
arch/x86/kernel/cpu/vmware.c | 2 --
arch/x86/kernel/tsc.c | 2 --
arch/x86/kernel/tsc_msr.c | 2 --
4 files changed, 2 insertions(+), 6 deletions(-)
diff --git a/arch/x86/include/asm/apic.h b/arch/x86/include/asm/apic.h
index cd84a94688a2..035998555e99 100644
--- a/arch/x86/include/asm/apic.h
+++ b/arch/x86/include/asm/apic.h
@@ -189,6 +189,8 @@ static inline void disable_local_APIC(void) { }
# define setup_boot_APIC_clock x86_init_noop
# define setup_secondary_APIC_clock x86_init_noop
static inline void lapic_update_tsc_freq(void) { }
+static inline void apic_set_timer_period_hz(u64 period_hz, const char *source) { }
+static inline void apic_set_timer_period_khz(u64 period_khz, const char *source) { }
static inline void init_bsp_APIC(void) { }
static inline void apic_intr_mode_select(void) { }
static inline void apic_intr_mode_init(void) { }
diff --git a/arch/x86/kernel/cpu/vmware.c b/arch/x86/kernel/cpu/vmware.c
index 36f779dd311d..13b97265c535 100644
--- a/arch/x86/kernel/cpu/vmware.c
+++ b/arch/x86/kernel/cpu/vmware.c
@@ -422,10 +422,8 @@ static void __init vmware_platform_setup(void)
x86_platform.calibrate_tsc = vmware_get_tsc_khz;
x86_platform.calibrate_cpu = vmware_get_tsc_khz;
-#ifdef CONFIG_X86_LOCAL_APIC
/* Skip lapic calibration since we know the bus frequency. */
apic_set_timer_period_hz(ecx, "VMware hypervisor");
-#endif
} else {
pr_warn("Failed to get TSC freq from the hypervisor\n");
}
diff --git a/arch/x86/kernel/tsc.c b/arch/x86/kernel/tsc.c
index f9ecc9256863..4d6a446645c0 100644
--- a/arch/x86/kernel/tsc.c
+++ b/arch/x86/kernel/tsc.c
@@ -710,7 +710,6 @@ unsigned long native_calibrate_tsc(void)
if (boot_cpu_data.x86_vfm == INTEL_ATOM_GOLDMONT)
setup_force_cpu_cap(X86_FEATURE_TSC_RELIABLE);
-#ifdef CONFIG_X86_LOCAL_APIC
/*
* The local APIC appears to be fed by the core crystal clock
* (which sounds entirely sensible). We can set the global
@@ -718,7 +717,6 @@ unsigned long native_calibrate_tsc(void)
* timer later.
*/
apic_set_timer_period_khz(crystal_khz, "CPUID 0x15/0x16");
-#endif
return crystal_khz * ebx_numerator / eax_denominator;
}
diff --git a/arch/x86/kernel/tsc_msr.c b/arch/x86/kernel/tsc_msr.c
index 7e990871e041..aece062aee7e 100644
--- a/arch/x86/kernel/tsc_msr.c
+++ b/arch/x86/kernel/tsc_msr.c
@@ -210,9 +210,7 @@ unsigned long cpu_khz_from_msr(void)
if (freq == 0)
pr_err("Error MSR_FSB_FREQ index %d is unknown\n", index);
-#ifdef CONFIG_X86_LOCAL_APIC
apic_set_timer_period_khz(freq, "MSR_FSB_FREQ");
-#endif
/*
* TSC frequency determined by MSR is always considered "known"
--
2.55.0.rc0.799.gd6f94ed593-goog
^ permalink raw reply related
* [PATCH v5 03/51] x86/tsc: Ensure that TSC recalibration doesn't run if TSC frequency is known
From: Sean Christopherson @ 2026-07-01 19:31 UTC (permalink / raw)
To: Jonathan Corbet, Paolo Bonzini, Thomas Gleixner, Ingo Molnar,
Borislav Petkov, Dave Hansen, x86, Kiryl Shutsemau,
Rick Edgecombe, Sean Christopherson, K. Y. Srinivasan,
Haiyang Zhang, Wei Liu, Dexuan Cui, Long Li, Ajay Kaher,
Alexey Makhalov, Jan Kiszka, Andy Lutomirski, Peter Zijlstra,
Juergen Gross, Daniel Lezcano, John Stultz
Cc: Shuah Khan, H. Peter Anvin, Vitaly Kuznetsov,
Broadcom internal kernel review list, Boris Ostrovsky,
Stephen Boyd, linux-doc, kvm, linux-kernel, linux-coco,
linux-hyperv, virtualization, xen-devel, Tom Lendacky,
Nikunj A Dadhania, David Woodhouse, David Woodhouse,
Michael Kelley, Thomas Gleixner
In-Reply-To: <20260701193212.749551-1-seanjc@google.com>
When attempting TSC recalibration post-boot, which is only done for ancient
CPUS (P4 and K7) on SMP=n kernels, assert that the TSC frequency isn't
known (explicitly provided by hardware) by way of MSR or CPUID, and bail if
the impossible happens. In practice, recalibration and TSC_KNOWN_FREQ are
mutually exclusive, as TSC_KNOWN_FREQ will only be set when running on
hardware that was released decades after recalibration was obsoleted, but
but it's hard to see that, especially when looking at just the TSC code.
Note, the WARN can likely be tripped by running in a virtual machine and
concocting an impossible CPU model, e.g. by combining a P4 signature with
CPUID 0x15. This is working as intended, as such a virtual CPU model is
wildly out-of-spec and is not supported.
Signed-off-by: Sean Christopherson <seanjc@google.com>
---
arch/x86/kernel/tsc.c | 3 +++
1 file changed, 3 insertions(+)
diff --git a/arch/x86/kernel/tsc.c b/arch/x86/kernel/tsc.c
index 4d6a446645c0..4393902c0ddd 100644
--- a/arch/x86/kernel/tsc.c
+++ b/arch/x86/kernel/tsc.c
@@ -930,6 +930,9 @@ void recalibrate_cpu_khz(void)
if (!boot_cpu_has(X86_FEATURE_TSC))
return;
+ if (WARN_ON_ONCE(cpu_feature_enabled(X86_FEATURE_TSC_KNOWN_FREQ)))
+ return;
+
cpu_khz = x86_platform.calibrate_cpu();
tsc_khz = x86_platform.calibrate_tsc();
if (tsc_khz == 0)
--
2.55.0.rc0.799.gd6f94ed593-goog
^ permalink raw reply related
* [PATCH v5 04/51] x86/tsc: Restrict recalibrate_cpu_khz() export to p4-clockmod and powernow-k7
From: Sean Christopherson @ 2026-07-01 19:31 UTC (permalink / raw)
To: Jonathan Corbet, Paolo Bonzini, Thomas Gleixner, Ingo Molnar,
Borislav Petkov, Dave Hansen, x86, Kiryl Shutsemau,
Rick Edgecombe, Sean Christopherson, K. Y. Srinivasan,
Haiyang Zhang, Wei Liu, Dexuan Cui, Long Li, Ajay Kaher,
Alexey Makhalov, Jan Kiszka, Andy Lutomirski, Peter Zijlstra,
Juergen Gross, Daniel Lezcano, John Stultz
Cc: Shuah Khan, H. Peter Anvin, Vitaly Kuznetsov,
Broadcom internal kernel review list, Boris Ostrovsky,
Stephen Boyd, linux-doc, kvm, linux-kernel, linux-coco,
linux-hyperv, virtualization, xen-devel, Tom Lendacky,
Nikunj A Dadhania, David Woodhouse, David Woodhouse,
Michael Kelley, Thomas Gleixner
In-Reply-To: <20260701193212.749551-1-seanjc@google.com>
Export recalibrate_cpu_khz() only for its two users, p4-clockmod.ko and
powernow-k7.ko, to help document that recalibration is relevant only to
ancient CPUs.
For all intents and purposes, no functional change intended.
Signed-off-by: Sean Christopherson <seanjc@google.com>
---
arch/x86/kernel/tsc.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/arch/x86/kernel/tsc.c b/arch/x86/kernel/tsc.c
index 4393902c0ddd..482cc3a8999a 100644
--- a/arch/x86/kernel/tsc.c
+++ b/arch/x86/kernel/tsc.c
@@ -943,7 +943,7 @@ void recalibrate_cpu_khz(void)
cpu_khz_old, cpu_khz);
#endif
}
-EXPORT_SYMBOL_GPL(recalibrate_cpu_khz);
+EXPORT_SYMBOL_FOR_MODULES(recalibrate_cpu_khz, "p4-clockmod,powernow-k7");
static unsigned long long cyc2ns_suspend;
--
2.55.0.rc0.799.gd6f94ed593-goog
^ permalink raw reply related
* [PATCH v5 05/51] x86/sev: Mark TSC as reliable when configuring Secure TSC
From: Sean Christopherson @ 2026-07-01 19:31 UTC (permalink / raw)
To: Jonathan Corbet, Paolo Bonzini, Thomas Gleixner, Ingo Molnar,
Borislav Petkov, Dave Hansen, x86, Kiryl Shutsemau,
Rick Edgecombe, Sean Christopherson, K. Y. Srinivasan,
Haiyang Zhang, Wei Liu, Dexuan Cui, Long Li, Ajay Kaher,
Alexey Makhalov, Jan Kiszka, Andy Lutomirski, Peter Zijlstra,
Juergen Gross, Daniel Lezcano, John Stultz
Cc: Shuah Khan, H. Peter Anvin, Vitaly Kuznetsov,
Broadcom internal kernel review list, Boris Ostrovsky,
Stephen Boyd, linux-doc, kvm, linux-kernel, linux-coco,
linux-hyperv, virtualization, xen-devel, Tom Lendacky,
Nikunj A Dadhania, David Woodhouse, David Woodhouse,
Michael Kelley, Thomas Gleixner
In-Reply-To: <20260701193212.749551-1-seanjc@google.com>
Move the code to mark the TSC as reliable from sme_early_init() to
snp_secure_tsc_init(). The only reader of TSC_RELIABLE is the aptly
named check_system_tsc_reliable(), which runs in tsc_init(), i.e.
after snp_secure_tsc_init().
This will allow consolidating the handling of TSC_KNOWN_FREQ and
TSC_RELIABLE when overriding the TSC calibration routine.
Cc: Tom Lendacky <thomas.lendacky@amd.com>
Reviewed-by: Nikunj A Dadhania <nikunj@amd.com>
Reviewed-by: David Woodhouse <dwmw@amazon.co.uk>
Signed-off-by: Sean Christopherson <seanjc@google.com>
---
arch/x86/coco/sev/core.c | 2 ++
arch/x86/mm/mem_encrypt_amd.c | 3 ---
2 files changed, 2 insertions(+), 3 deletions(-)
diff --git a/arch/x86/coco/sev/core.c b/arch/x86/coco/sev/core.c
index ecd77d3217f3..ed0ac52a765e 100644
--- a/arch/x86/coco/sev/core.c
+++ b/arch/x86/coco/sev/core.c
@@ -2037,6 +2037,8 @@ void __init snp_secure_tsc_init(void)
secrets = (__force struct snp_secrets_page *)mem;
setup_force_cpu_cap(X86_FEATURE_TSC_KNOWN_FREQ);
+ setup_force_cpu_cap(X86_FEATURE_TSC_RELIABLE);
+
rdmsrq(MSR_AMD64_GUEST_TSC_FREQ, tsc_freq_mhz);
/* Extract the GUEST TSC MHZ from BIT[17:0], rest is reserved space */
diff --git a/arch/x86/mm/mem_encrypt_amd.c b/arch/x86/mm/mem_encrypt_amd.c
index 2f8c32173972..6c3af974c7c2 100644
--- a/arch/x86/mm/mem_encrypt_amd.c
+++ b/arch/x86/mm/mem_encrypt_amd.c
@@ -535,9 +535,6 @@ void __init sme_early_init(void)
*/
x86_init.resources.dmi_setup = snp_dmi_setup;
}
-
- if (sev_status & MSR_AMD64_SNP_SECURE_TSC)
- setup_force_cpu_cap(X86_FEATURE_TSC_RELIABLE);
}
void __init mem_encrypt_free_decrypted_mem(void)
--
2.55.0.rc0.799.gd6f94ed593-goog
^ permalink raw reply related
* [PATCH v5 06/51] x86/sev: Don't override CPU frequency calibration for SNP's Secure TSC
From: Sean Christopherson @ 2026-07-01 19:31 UTC (permalink / raw)
To: Jonathan Corbet, Paolo Bonzini, Thomas Gleixner, Ingo Molnar,
Borislav Petkov, Dave Hansen, x86, Kiryl Shutsemau,
Rick Edgecombe, Sean Christopherson, K. Y. Srinivasan,
Haiyang Zhang, Wei Liu, Dexuan Cui, Long Li, Ajay Kaher,
Alexey Makhalov, Jan Kiszka, Andy Lutomirski, Peter Zijlstra,
Juergen Gross, Daniel Lezcano, John Stultz
Cc: Shuah Khan, H. Peter Anvin, Vitaly Kuznetsov,
Broadcom internal kernel review list, Boris Ostrovsky,
Stephen Boyd, linux-doc, kvm, linux-kernel, linux-coco,
linux-hyperv, virtualization, xen-devel, Tom Lendacky,
Nikunj A Dadhania, David Woodhouse, David Woodhouse,
Michael Kelley, Thomas Gleixner
In-Reply-To: <20260701193212.749551-1-seanjc@google.com>
Don't override the kernel's CPU frequency calibration routine when
registering SNP's Secure TSC calibration routine. SNP (the architecture)
provides zero guarantees that the CPU runs at the same frequency as the
TSC. The justification for clobbering the CPU routine was:
Since the difference between CPU base and TSC frequency does not apply
in this case, the same callback is being used.
but that's simply not true. E.g. if APERF/MPERF is exposed to the VM, then
the CPU frequency absolutely does matter.
While relying on heuristics and/or the untrusted hypervisor to provide the
CPU frequency isn't ideal, it's at least not outright wrong.
Fixes: 73bbf3b0fbba ("x86/tsc: Init the TSC for Secure TSC guests")
Cc: Nikunj A Dadhania <nikunj@amd.com>
Cc: Tom Lendacky <thomas.lendacky@amd.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
---
arch/x86/coco/sev/core.c | 1 -
1 file changed, 1 deletion(-)
diff --git a/arch/x86/coco/sev/core.c b/arch/x86/coco/sev/core.c
index ed0ac52a765e..665de1aea0ee 100644
--- a/arch/x86/coco/sev/core.c
+++ b/arch/x86/coco/sev/core.c
@@ -2046,7 +2046,6 @@ void __init snp_secure_tsc_init(void)
snp_tsc_freq_khz = SNP_SCALE_TSC_FREQ(tsc_freq_mhz * 1000, secrets->tsc_factor);
- x86_platform.calibrate_cpu = securetsc_get_tsc_khz;
x86_platform.calibrate_tsc = securetsc_get_tsc_khz;
early_memunmap(mem, PAGE_SIZE);
--
2.55.0.rc0.799.gd6f94ed593-goog
^ permalink raw reply related
* [PATCH v5 07/51] x86/sev: Move check for SNP Secure TSC support to tsc_early_init()
From: Sean Christopherson @ 2026-07-01 19:31 UTC (permalink / raw)
To: Jonathan Corbet, Paolo Bonzini, Thomas Gleixner, Ingo Molnar,
Borislav Petkov, Dave Hansen, x86, Kiryl Shutsemau,
Rick Edgecombe, Sean Christopherson, K. Y. Srinivasan,
Haiyang Zhang, Wei Liu, Dexuan Cui, Long Li, Ajay Kaher,
Alexey Makhalov, Jan Kiszka, Andy Lutomirski, Peter Zijlstra,
Juergen Gross, Daniel Lezcano, John Stultz
Cc: Shuah Khan, H. Peter Anvin, Vitaly Kuznetsov,
Broadcom internal kernel review list, Boris Ostrovsky,
Stephen Boyd, linux-doc, kvm, linux-kernel, linux-coco,
linux-hyperv, virtualization, xen-devel, Tom Lendacky,
Nikunj A Dadhania, David Woodhouse, David Woodhouse,
Michael Kelley, Thomas Gleixner
In-Reply-To: <20260701193212.749551-1-seanjc@google.com>
Move the check on having a Secure TSC to the common tsc_early_init() so
that it's obvious that having a Secure TSC is conditional, and to prepare
for adding TDX to the mix (blindly initializing *both* SNP and TDX TSC
logic looks especially weird).
No functional change intended.
Cc: Tom Lendacky <thomas.lendacky@amd.com>
Reviewed-by: Nikunj A Dadhania <nikunj@amd.com>
Reviewed-by: David Woodhouse <dwmw@amazon.co.uk>
Signed-off-by: Sean Christopherson <seanjc@google.com>
---
arch/x86/coco/sev/core.c | 3 ---
arch/x86/kernel/tsc.c | 3 ++-
2 files changed, 2 insertions(+), 4 deletions(-)
diff --git a/arch/x86/coco/sev/core.c b/arch/x86/coco/sev/core.c
index 665de1aea0ee..403dcea86452 100644
--- a/arch/x86/coco/sev/core.c
+++ b/arch/x86/coco/sev/core.c
@@ -2025,9 +2025,6 @@ void __init snp_secure_tsc_init(void)
unsigned long tsc_freq_mhz;
void *mem;
- if (!cc_platform_has(CC_ATTR_GUEST_SNP_SECURE_TSC))
- return;
-
mem = early_memremap_encrypted(sev_secrets_pa, PAGE_SIZE);
if (!mem) {
pr_err("Unable to get TSC_FACTOR: failed to map the SNP secrets page.\n");
diff --git a/arch/x86/kernel/tsc.c b/arch/x86/kernel/tsc.c
index 482cc3a8999a..8f1604ffe986 100644
--- a/arch/x86/kernel/tsc.c
+++ b/arch/x86/kernel/tsc.c
@@ -1509,7 +1509,8 @@ void __init tsc_early_init(void)
if (is_early_uv_system())
return;
- snp_secure_tsc_init();
+ if (cc_platform_has(CC_ATTR_GUEST_SNP_SECURE_TSC))
+ snp_secure_tsc_init();
if (!determine_cpu_tsc_frequencies(true))
return;
--
2.55.0.rc0.799.gd6f94ed593-goog
^ permalink raw reply related
* [PATCH v5 08/51] x86/sev: Shove SNP's secure/trusted TSC frequency directly into "calibration"
From: Sean Christopherson @ 2026-07-01 19:31 UTC (permalink / raw)
To: Jonathan Corbet, Paolo Bonzini, Thomas Gleixner, Ingo Molnar,
Borislav Petkov, Dave Hansen, x86, Kiryl Shutsemau,
Rick Edgecombe, Sean Christopherson, K. Y. Srinivasan,
Haiyang Zhang, Wei Liu, Dexuan Cui, Long Li, Ajay Kaher,
Alexey Makhalov, Jan Kiszka, Andy Lutomirski, Peter Zijlstra,
Juergen Gross, Daniel Lezcano, John Stultz
Cc: Shuah Khan, H. Peter Anvin, Vitaly Kuznetsov,
Broadcom internal kernel review list, Boris Ostrovsky,
Stephen Boyd, linux-doc, kvm, linux-kernel, linux-coco,
linux-hyperv, virtualization, xen-devel, Tom Lendacky,
Nikunj A Dadhania, David Woodhouse, David Woodhouse,
Michael Kelley, Thomas Gleixner
In-Reply-To: <20260701193212.749551-1-seanjc@google.com>
As a first step towards dropping .calibrate_{cpu,tsc}() and explicitly
defining precedence/priority for "calibration" routines, pass the secure
TSC frequency obtained from SNP firmware directly to
determine_cpu_tsc_frequencies() instead of overriding the .calibrate_tsc()
hook.
Unlike the native calibration routines, all of the paravirtual overrides,
including SNP and TDX, are constant in the sense that the frequency
provided by the hypervisor or trusted firmware is fixed, known, and always
available during early boot. More importantly, for CoCo (SNP and TDX) VMs,
it's imperative that the kernel uses the frequency provided by the trusted
firmware, not by the untrusted hypervisor. Enforcing the priority between
sources by carefully ordering seemingly unrelated init calls, so that the
trusted override "wins", is brittle and all but impossible to follow.
Explicitly ignore tsc_early_khz if the exact TSC frequency was obtained
from trusted firmware, as per commit bd35c77e32e4 ("x86/tsc: Add
tsc_early_khz command line parameter"), the goal of the param is to play
nice with setups that provide partial frequency information in CPUID, i.e.
is NOT intended to be a hard override. Neither SNP's secure TSC nor TDX
was supported when commit bd35c77e32e4 landed back in 2020, i.e. lack of
consideration for the interaction was purely due to oversight when SNP and
TDX support came along.
Signed-off-by: Sean Christopherson <seanjc@google.com>
---
.../admin-guide/kernel-parameters.txt | 4 +++
arch/x86/coco/sev/core.c | 14 +++--------
arch/x86/include/asm/sev.h | 4 +--
arch/x86/kernel/tsc.c | 25 ++++++++++++++-----
4 files changed, 29 insertions(+), 18 deletions(-)
diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt
index b5493a7f8f22..181149f633c3 100644
--- a/Documentation/admin-guide/kernel-parameters.txt
+++ b/Documentation/admin-guide/kernel-parameters.txt
@@ -7946,6 +7946,10 @@ Kernel parameters
with CPUID.16h support and partial CPUID.15h support.
Format: <unsigned int>
+ Note, tsc_early_khz is ignored if the TSC frequency is
+ provided by trusted firmware when running as an SNP
+ guest.
+
tsx= [X86] Control Transactional Synchronization
Extensions (TSX) feature in Intel processors that
support TSX control.
diff --git a/arch/x86/coco/sev/core.c b/arch/x86/coco/sev/core.c
index 403dcea86452..bc5ae9ef74da 100644
--- a/arch/x86/coco/sev/core.c
+++ b/arch/x86/coco/sev/core.c
@@ -99,7 +99,6 @@ static const char * const sev_status_feat_names[] = {
*/
static u64 snp_tsc_scale __ro_after_init;
static u64 snp_tsc_offset __ro_after_init;
-static unsigned long snp_tsc_freq_khz __ro_after_init;
DEFINE_PER_CPU(struct sev_es_runtime_data*, runtime_data);
DEFINE_PER_CPU(struct sev_es_save_area *, sev_vmsa);
@@ -2014,15 +2013,10 @@ void __init snp_secure_tsc_prepare(void)
pr_debug("SecureTSC enabled");
}
-static unsigned long securetsc_get_tsc_khz(void)
-{
- return snp_tsc_freq_khz;
-}
-
-void __init snp_secure_tsc_init(void)
+unsigned int __init snp_secure_tsc_init(void)
{
+ unsigned long snp_tsc_freq_khz, tsc_freq_mhz;
struct snp_secrets_page *secrets;
- unsigned long tsc_freq_mhz;
void *mem;
mem = early_memremap_encrypted(sev_secrets_pa, PAGE_SIZE);
@@ -2043,7 +2037,7 @@ void __init snp_secure_tsc_init(void)
snp_tsc_freq_khz = SNP_SCALE_TSC_FREQ(tsc_freq_mhz * 1000, secrets->tsc_factor);
- x86_platform.calibrate_tsc = securetsc_get_tsc_khz;
-
early_memunmap(mem, PAGE_SIZE);
+
+ return snp_tsc_freq_khz;
}
diff --git a/arch/x86/include/asm/sev.h b/arch/x86/include/asm/sev.h
index 594cfa19cbd4..05ebf0b73ef4 100644
--- a/arch/x86/include/asm/sev.h
+++ b/arch/x86/include/asm/sev.h
@@ -530,7 +530,7 @@ int snp_send_guest_request(struct snp_msg_desc *mdesc, struct snp_guest_req *req
int snp_svsm_vtpm_send_command(u8 *buffer);
void __init snp_secure_tsc_prepare(void);
-void __init snp_secure_tsc_init(void);
+unsigned int snp_secure_tsc_init(void);
enum es_result savic_register_gpa(u64 gpa);
enum es_result savic_unregister_gpa(u64 *gpa);
u64 savic_ghcb_msr_read(u32 reg);
@@ -637,7 +637,7 @@ static inline int snp_send_guest_request(struct snp_msg_desc *mdesc,
struct snp_guest_req *req) { return -ENODEV; }
static inline int snp_svsm_vtpm_send_command(u8 *buffer) { return -ENODEV; }
static inline void __init snp_secure_tsc_prepare(void) { }
-static inline void __init snp_secure_tsc_init(void) { }
+static inline unsigned int __init snp_secure_tsc_init(void) { return 0; }
static inline void sev_evict_cache(void *va, int npages) {}
static inline enum es_result savic_register_gpa(u64 gpa) { return ES_UNSUPPORTED; }
static inline enum es_result savic_unregister_gpa(u64 *gpa) { return ES_UNSUPPORTED; }
diff --git a/arch/x86/kernel/tsc.c b/arch/x86/kernel/tsc.c
index 8f1604ffe986..f049c126e47c 100644
--- a/arch/x86/kernel/tsc.c
+++ b/arch/x86/kernel/tsc.c
@@ -1440,15 +1440,16 @@ static int __init init_tsc_clocksource(void)
*/
device_initcall(init_tsc_clocksource);
-static bool __init determine_cpu_tsc_frequencies(bool early)
+static bool __init determine_cpu_tsc_frequencies(bool early,
+ unsigned int known_tsc_khz)
{
/* Make sure that cpu and tsc are not already calibrated */
WARN_ON(cpu_khz || tsc_khz);
if (early) {
cpu_khz = x86_platform.calibrate_cpu();
- if (tsc_early_khz)
- tsc_khz = tsc_early_khz;
+ if (known_tsc_khz)
+ tsc_khz = known_tsc_khz;
else
tsc_khz = x86_platform.calibrate_tsc();
} else {
@@ -1503,6 +1504,8 @@ static void __init tsc_enable_sched_clock(void)
void __init tsc_early_init(void)
{
+ unsigned int known_tsc_khz = 0;
+
if (!boot_cpu_has(X86_FEATURE_TSC))
return;
/* Don't change UV TSC multi-chassis synchronization */
@@ -1510,9 +1513,19 @@ void __init tsc_early_init(void)
return;
if (cc_platform_has(CC_ATTR_GUEST_SNP_SECURE_TSC))
- snp_secure_tsc_init();
+ known_tsc_khz = snp_secure_tsc_init();
- if (!determine_cpu_tsc_frequencies(true))
+ /*
+ * Ignore the user-provided TSC frequency if the exact frequency was
+ * obtained from trusted firmware, as the user-provided frequency is
+ * intended as a "starting point", not a known, guaranteed frequency.
+ */
+ if (!known_tsc_khz)
+ known_tsc_khz = tsc_early_khz;
+ else if (tsc_early_khz)
+ pr_err("Ignoring 'tsc_early_khz' in favor of trusted firmware.\n");
+
+ if (!determine_cpu_tsc_frequencies(true, known_tsc_khz))
return;
tsc_enable_sched_clock();
}
@@ -1533,7 +1546,7 @@ void __init tsc_init(void)
if (!tsc_khz) {
/* We failed to determine frequencies earlier, try again */
- if (!determine_cpu_tsc_frequencies(false)) {
+ if (!determine_cpu_tsc_frequencies(false, 0)) {
mark_tsc_unstable("could not calculate TSC khz");
setup_clear_cpu_cap(X86_FEATURE_TSC_DEADLINE_TIMER);
return;
--
2.55.0.rc0.799.gd6f94ed593-goog
^ permalink raw reply related
* [PATCH v5 09/51] x86/tsc: Add a standalone helper for getting TSC info from CPUID.0x15
From: Sean Christopherson @ 2026-07-01 19:31 UTC (permalink / raw)
To: Jonathan Corbet, Paolo Bonzini, Thomas Gleixner, Ingo Molnar,
Borislav Petkov, Dave Hansen, x86, Kiryl Shutsemau,
Rick Edgecombe, Sean Christopherson, K. Y. Srinivasan,
Haiyang Zhang, Wei Liu, Dexuan Cui, Long Li, Ajay Kaher,
Alexey Makhalov, Jan Kiszka, Andy Lutomirski, Peter Zijlstra,
Juergen Gross, Daniel Lezcano, John Stultz
Cc: Shuah Khan, H. Peter Anvin, Vitaly Kuznetsov,
Broadcom internal kernel review list, Boris Ostrovsky,
Stephen Boyd, linux-doc, kvm, linux-kernel, linux-coco,
linux-hyperv, virtualization, xen-devel, Tom Lendacky,
Nikunj A Dadhania, David Woodhouse, David Woodhouse,
Michael Kelley, Thomas Gleixner
In-Reply-To: <20260701193212.749551-1-seanjc@google.com>
Extract retrieval of TSC frequency information from CPUID into a standalone
helper so that TDX guest support can reuse the logic.
Opportunistically drop native_calibrate_tsc()'s "== 0" and "!= 0" checks
in favor of the kernel's preferred style.
No functional change intended.
Reviewed-by: David Woodhouse <dwmw@amazon.co.uk>
Signed-off-by: Sean Christopherson <seanjc@google.com>
---
arch/x86/kernel/tsc.c | 61 +++++++++++++++++++++++++++----------------
1 file changed, 38 insertions(+), 23 deletions(-)
diff --git a/arch/x86/kernel/tsc.c b/arch/x86/kernel/tsc.c
index f049c126e47c..12043812c8f5 100644
--- a/arch/x86/kernel/tsc.c
+++ b/arch/x86/kernel/tsc.c
@@ -645,46 +645,62 @@ static unsigned long quick_pit_calibrate(void)
return delta;
}
+struct cpuid_tsc_info {
+ unsigned int denominator;
+ unsigned int numerator;
+ unsigned int crystal_khz;
+};
+
+static int cpuid_get_tsc_info(struct cpuid_tsc_info *info)
+{
+ unsigned int ecx_hz, edx;
+
+ if (boot_cpu_data.cpuid_level < CPUID_LEAF_TSC)
+ return -ENOENT;
+
+ /* CPUID 15H TSC/Crystal ratio, plus optionally Crystal Hz */
+ cpuid(CPUID_LEAF_TSC, &info->denominator, &info->numerator, &ecx_hz, &edx);
+
+ if (!info->denominator || !info->numerator)
+ return -ENOENT;
+
+ /*
+ * Note: some CPUs provide the multiplier information, but not the core
+ * crystal frequency. The multiplier information is still useful for
+ * such CPUs, as the crystal frequency can be gleaned from CPUID.0x16.
+ */
+ info->crystal_khz = ecx_hz / 1000;
+ return 0;
+}
+
/**
* native_calibrate_tsc - determine TSC frequency
* Determine TSC frequency via CPUID, else return 0.
*/
unsigned long native_calibrate_tsc(void)
{
- unsigned int eax_denominator, ebx_numerator, ecx_hz, edx;
- unsigned int crystal_khz;
+ struct cpuid_tsc_info info;
if (boot_cpu_data.x86_vendor != X86_VENDOR_INTEL)
return 0;
- if (boot_cpu_data.cpuid_level < CPUID_LEAF_TSC)
+ if (cpuid_get_tsc_info(&info))
return 0;
- eax_denominator = ebx_numerator = ecx_hz = edx = 0;
-
- /* CPUID 15H TSC/Crystal ratio, plus optionally Crystal Hz */
- cpuid(CPUID_LEAF_TSC, &eax_denominator, &ebx_numerator, &ecx_hz, &edx);
-
- if (ebx_numerator == 0 || eax_denominator == 0)
- return 0;
-
- crystal_khz = ecx_hz / 1000;
-
/*
* Denverton SoCs don't report crystal clock, and also don't support
* CPUID_LEAF_FREQ for the calculation below, so hardcode the 25MHz
* crystal clock.
*/
- if (crystal_khz == 0 &&
- boot_cpu_data.x86_vfm == INTEL_ATOM_GOLDMONT_D)
- crystal_khz = 25000;
+ if (!info.crystal_khz && boot_cpu_data.x86_vfm == INTEL_ATOM_GOLDMONT_D)
+ info.crystal_khz = 25000;
/*
* TSC frequency reported directly by CPUID is a "hardware reported"
* frequency and is the most accurate one so far we have. This
* is considered a known frequency.
*/
- if (crystal_khz != 0)
+ if (info.crystal_khz)
setup_force_cpu_cap(X86_FEATURE_TSC_KNOWN_FREQ);
/*
@@ -692,15 +708,14 @@ unsigned long native_calibrate_tsc(void)
* clock, but we can easily calculate it to a high degree of accuracy
* by considering the crystal ratio and the CPU speed.
*/
- if (crystal_khz == 0 && boot_cpu_data.cpuid_level >= CPUID_LEAF_FREQ) {
+ if (!info.crystal_khz && boot_cpu_data.cpuid_level >= CPUID_LEAF_FREQ) {
unsigned int eax_base_mhz, ebx, ecx, edx;
cpuid(CPUID_LEAF_FREQ, &eax_base_mhz, &ebx, &ecx, &edx);
- crystal_khz = eax_base_mhz * 1000 *
- eax_denominator / ebx_numerator;
+ info.crystal_khz = eax_base_mhz * 1000 * info.denominator / info.numerator;
}
- if (crystal_khz == 0)
+ if (!info.crystal_khz)
return 0;
/*
@@ -716,9 +731,9 @@ unsigned long native_calibrate_tsc(void)
* lapic_timer_period here to avoid having to calibrate the APIC
* timer later.
*/
- apic_set_timer_period_khz(crystal_khz, "CPUID 0x15/0x16");
+ apic_set_timer_period_khz(info.crystal_khz, "CPUID 0x15/0x16");
- return crystal_khz * ebx_numerator / eax_denominator;
+ return info.crystal_khz * info.numerator / info.denominator;
}
static unsigned long cpu_khz_from_cpuid(void)
--
2.55.0.rc0.799.gd6f94ed593-goog
^ permalink raw reply related
* [PATCH v5 10/51] x86/tdx: Force TSC frequency with CPUID-based info provided by the TDX-Module
From: Sean Christopherson @ 2026-07-01 19:31 UTC (permalink / raw)
To: Jonathan Corbet, Paolo Bonzini, Thomas Gleixner, Ingo Molnar,
Borislav Petkov, Dave Hansen, x86, Kiryl Shutsemau,
Rick Edgecombe, Sean Christopherson, K. Y. Srinivasan,
Haiyang Zhang, Wei Liu, Dexuan Cui, Long Li, Ajay Kaher,
Alexey Makhalov, Jan Kiszka, Andy Lutomirski, Peter Zijlstra,
Juergen Gross, Daniel Lezcano, John Stultz
Cc: Shuah Khan, H. Peter Anvin, Vitaly Kuznetsov,
Broadcom internal kernel review list, Boris Ostrovsky,
Stephen Boyd, linux-doc, kvm, linux-kernel, linux-coco,
linux-hyperv, virtualization, xen-devel, Tom Lendacky,
Nikunj A Dadhania, David Woodhouse, David Woodhouse,
Michael Kelley, Thomas Gleixner
In-Reply-To: <20260701193212.749551-1-seanjc@google.com>
When running as a TDX guest, explicitly set the TSC frequency to a known
value, using CPUID-based information, instead of potentially relying on a
hypervisor-controlled PV routine. For TDX guests, CPUID.0x15 is always
emulated by the TDX-Module, i.e. the information from CPUID is more
trustworthy than the information provided by the hypervisor.
To maintain backwards compatibility with TDX guest kernels that use native
calibration, and because it's the least awful option, retain
native_calibrate_tsc()'s stuffing of the local APIC bus period using the
core crystal frequency. While it's entirely possible for the hypervisor
to emulate the APIC timer at a different frequency than the core crystal
frequency, the commonly accepted interpretation of Intel's SDM is that APIC
timer runs at the core crystal frequency when that latter is enumerated via
CPUID:
The APIC timer frequency will be the processor’s bus clock or core
crystal clock frequency (when TSC/core crystal clock ratio is enumerated
in CPUID leaf 0x15).
If the hypervisor is malicious and deliberately runs the APIC timer at the
wrong frequency, nothing would stop the hypervisor from modifying the
frequency at any time, i.e. attempting to manually calibrate the frequency
out of paranoia would be futile.
Deliberately leave CPU frequency calibration as is, since the TDX-Module
doesn't provide any guarantees with respect to CPUID.0x16.
Expose and use cpuid_get_tsc_info() instead of providing a wrapper to
get the TSC and core crystal frequency, as TDX is the only anticipated
user outside of the TSC code, i.e. adding a helper to dedup the math won't
actually dedup anything. Having TDX use "struct cpuid_tsc_info" also
avoids the temptation of declaring a local "tsc_khz" variable and thus
unintentionally creating a shadow of the global "tsc_khz".
Cc: Kiryl Shutsemau (Meta) <kas@kernel.org>
Signed-off-by: Sean Christopherson <seanjc@google.com>
---
.../admin-guide/kernel-parameters.txt | 4 ++--
arch/x86/coco/tdx/tdx.c | 20 ++++++++++++++++---
arch/x86/include/asm/tdx.h | 2 ++
arch/x86/include/asm/tsc.h | 7 +++++++
arch/x86/kernel/tsc.c | 11 ++++------
5 files changed, 32 insertions(+), 12 deletions(-)
diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt
index 181149f633c3..490e6aa72fc2 100644
--- a/Documentation/admin-guide/kernel-parameters.txt
+++ b/Documentation/admin-guide/kernel-parameters.txt
@@ -7947,8 +7947,8 @@ Kernel parameters
Format: <unsigned int>
Note, tsc_early_khz is ignored if the TSC frequency is
- provided by trusted firmware when running as an SNP
- guest.
+ provided by trusted firmware when running as an SNP or
+ TDX guest.
tsx= [X86] Control Transactional Synchronization
Extensions (TSX) feature in Intel processors that
diff --git a/arch/x86/coco/tdx/tdx.c b/arch/x86/coco/tdx/tdx.c
index 29b6f1ed59ec..ae2d35f2ef33 100644
--- a/arch/x86/coco/tdx/tdx.c
+++ b/arch/x86/coco/tdx/tdx.c
@@ -8,6 +8,7 @@
#include <linux/export.h>
#include <linux/io.h>
#include <linux/kexec.h>
+#include <asm/apic.h>
#include <asm/coco.h>
#include <asm/tdx.h>
#include <asm/vmx.h>
@@ -1123,9 +1124,6 @@ void __init tdx_early_init(void)
setup_force_cpu_cap(X86_FEATURE_TDX_GUEST);
- /* TSC is the only reliable clock in TDX guest */
- setup_force_cpu_cap(X86_FEATURE_TSC_RELIABLE);
-
cc_vendor = CC_VENDOR_INTEL;
/* Configure the TD */
@@ -1195,3 +1193,19 @@ void __init tdx_early_init(void)
tdx_announce();
}
+
+unsigned int __init tdx_tsc_init(void)
+{
+ struct cpuid_tsc_info info;
+
+ if (WARN_ON_ONCE(cpuid_get_tsc_info(&info) || !info.crystal_khz))
+ return 0;
+
+ apic_set_timer_period_khz(info.crystal_khz, "TDX-Module via CPUID");
+
+ /* TSC is the only reliable clock in TDX guest */
+ setup_force_cpu_cap(X86_FEATURE_TSC_RELIABLE);
+ setup_force_cpu_cap(X86_FEATURE_TSC_KNOWN_FREQ);
+
+ return info.crystal_khz * info.numerator / info.denominator;
+}
diff --git a/arch/x86/include/asm/tdx.h b/arch/x86/include/asm/tdx.h
index 89e97d5761d8..d23ff06db41a 100644
--- a/arch/x86/include/asm/tdx.h
+++ b/arch/x86/include/asm/tdx.h
@@ -68,6 +68,7 @@ struct ve_info {
#ifdef CONFIG_INTEL_TDX_GUEST
void __init tdx_early_init(void);
+unsigned int __init tdx_tsc_init(void);
void tdx_get_ve_info(struct ve_info *ve);
@@ -89,6 +90,7 @@ void __init tdx_dump_td_ctls(u64 td_ctls);
#else
static inline void tdx_early_init(void) { };
+static inline unsigned int tdx_tsc_init(void) { return 0; }
static inline void tdx_halt(void) { };
static inline bool tdx_early_handle_ve(struct pt_regs *regs) { return false; }
diff --git a/arch/x86/include/asm/tsc.h b/arch/x86/include/asm/tsc.h
index 4d2d2f21ff06..b6b86e24e1bf 100644
--- a/arch/x86/include/asm/tsc.h
+++ b/arch/x86/include/asm/tsc.h
@@ -82,6 +82,13 @@ static inline cycles_t get_cycles(void)
}
#define get_cycles get_cycles
+struct cpuid_tsc_info {
+ unsigned int denominator;
+ unsigned int numerator;
+ unsigned int crystal_khz;
+};
+extern int cpuid_get_tsc_info(struct cpuid_tsc_info *info);
+
extern void tsc_early_init(void);
extern void tsc_init(void);
extern void mark_tsc_unstable(char *reason);
diff --git a/arch/x86/kernel/tsc.c b/arch/x86/kernel/tsc.c
index 12043812c8f5..86384a83a5f6 100644
--- a/arch/x86/kernel/tsc.c
+++ b/arch/x86/kernel/tsc.c
@@ -34,6 +34,7 @@
#include <asm/topology.h>
#include <asm/uv/uv.h>
#include <asm/sev.h>
+#include <asm/tdx.h>
unsigned int __read_mostly cpu_khz; /* TSC clocks / usec, not used here */
EXPORT_SYMBOL(cpu_khz);
@@ -645,13 +646,7 @@ static unsigned long quick_pit_calibrate(void)
return delta;
}
-struct cpuid_tsc_info {
- unsigned int denominator;
- unsigned int numerator;
- unsigned int crystal_khz;
-};
-
-static int cpuid_get_tsc_info(struct cpuid_tsc_info *info)
+int cpuid_get_tsc_info(struct cpuid_tsc_info *info)
{
unsigned int ecx_hz, edx;
@@ -1529,6 +1524,8 @@ void __init tsc_early_init(void)
if (cc_platform_has(CC_ATTR_GUEST_SNP_SECURE_TSC))
known_tsc_khz = snp_secure_tsc_init();
+ else if (boot_cpu_has(X86_FEATURE_TDX_GUEST))
+ known_tsc_khz = tdx_tsc_init();
/*
* Ignore the user-provided TSC frequency if the exact frequency was
--
2.55.0.rc0.799.gd6f94ed593-goog
^ permalink raw reply related
* [PATCH v5 11/51] x86/tsc: Add dedicated hypervisor hooks for getting known TSC/CPU frequencies
From: Sean Christopherson @ 2026-07-01 19:31 UTC (permalink / raw)
To: Jonathan Corbet, Paolo Bonzini, Thomas Gleixner, Ingo Molnar,
Borislav Petkov, Dave Hansen, x86, Kiryl Shutsemau,
Rick Edgecombe, Sean Christopherson, K. Y. Srinivasan,
Haiyang Zhang, Wei Liu, Dexuan Cui, Long Li, Ajay Kaher,
Alexey Makhalov, Jan Kiszka, Andy Lutomirski, Peter Zijlstra,
Juergen Gross, Daniel Lezcano, John Stultz
Cc: Shuah Khan, H. Peter Anvin, Vitaly Kuznetsov,
Broadcom internal kernel review list, Boris Ostrovsky,
Stephen Boyd, linux-doc, kvm, linux-kernel, linux-coco,
linux-hyperv, virtualization, xen-devel, Tom Lendacky,
Nikunj A Dadhania, David Woodhouse, David Woodhouse,
Michael Kelley, Thomas Gleixner
In-Reply-To: <20260701193212.749551-1-seanjc@google.com>
Add dedicated hypervisor hooks for getting known TSC/CPU frequencies
instead of overriding seemingly generic platform hooks, and explicitly
priotize hypervisor-provided frequencies over native methods, but do NOT
clobber the frequency obtained from trusted firmware. While shuffling the
hooks around is arguably "six of one, half dozen of the other", scoping
them to x86_hyper_init makes their purpose more obvious, and allows for
explicitly defining the priority of sources (as is done here).
As is already done when trusted firmware provides the TSC frequency, ignore
ignore tsc_early_khz if the exact TSC frequency was obtained from the
hypervisor, as attempting to refine the TSC frequency when running in a VM
is all but guaranteed to cause problems sooner or later due to the
calibration sources being emulated devices in the vast majority of setups.
Cc: David Woodhouse <dwmw2@infradead.org>
Signed-off-by: Sean Christopherson <seanjc@google.com>
---
.../admin-guide/kernel-parameters.txt | 3 +-
arch/x86/include/asm/acrn.h | 5 ----
arch/x86/include/asm/x86_init.h | 4 +++
arch/x86/kernel/cpu/acrn.c | 10 +++++--
arch/x86/kernel/cpu/mshyperv.c | 6 ++--
arch/x86/kernel/cpu/vmware.c | 8 ++---
arch/x86/kernel/jailhouse.c | 6 ++--
arch/x86/kernel/kvmclock.c | 6 ++--
arch/x86/kernel/tsc.c | 29 ++++++++++++++-----
arch/x86/xen/time.c | 4 +--
10 files changed, 50 insertions(+), 31 deletions(-)
diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt
index 490e6aa72fc2..a387bb2c47e2 100644
--- a/Documentation/admin-guide/kernel-parameters.txt
+++ b/Documentation/admin-guide/kernel-parameters.txt
@@ -7948,7 +7948,8 @@ Kernel parameters
Note, tsc_early_khz is ignored if the TSC frequency is
provided by trusted firmware when running as an SNP or
- TDX guest.
+ TDX guest, or when the hypervisor provides the exact
+ frequency via a paravirtual interface.
tsx= [X86] Control Transactional Synchronization
Extensions (TSX) feature in Intel processors that
diff --git a/arch/x86/include/asm/acrn.h b/arch/x86/include/asm/acrn.h
index db42b477c41d..a892179c61c6 100644
--- a/arch/x86/include/asm/acrn.h
+++ b/arch/x86/include/asm/acrn.h
@@ -32,11 +32,6 @@ static inline u32 acrn_cpuid_base(void)
return 0;
}
-static inline unsigned long acrn_get_tsc_khz(void)
-{
- return cpuid_eax(ACRN_CPUID_TIMING_INFO);
-}
-
/*
* Hypercalls for ACRN
*
diff --git a/arch/x86/include/asm/x86_init.h b/arch/x86/include/asm/x86_init.h
index 953d3199408a..0c89bf40f507 100644
--- a/arch/x86/include/asm/x86_init.h
+++ b/arch/x86/include/asm/x86_init.h
@@ -123,6 +123,8 @@ struct x86_init_pci {
* @msi_ext_dest_id: MSI supports 15-bit APIC IDs
* @init_mem_mapping: setup early mappings during init_mem_mapping()
* @init_after_bootmem: guest init after boot allocator is finished
+ * @get_tsc_khz: get the TSC frequency (returns 0 if frequency is unknown)
+ * @get_cpu_khz: get the CPU frequency (returns 0 if frequency is unknown)
*/
struct x86_hyper_init {
void (*init_platform)(void);
@@ -131,6 +133,8 @@ struct x86_hyper_init {
bool (*msi_ext_dest_id)(void);
void (*init_mem_mapping)(void);
void (*init_after_bootmem)(void);
+ unsigned int (*get_tsc_khz)(void);
+ unsigned int (*get_cpu_khz)(void);
};
/**
diff --git a/arch/x86/kernel/cpu/acrn.c b/arch/x86/kernel/cpu/acrn.c
index dc119af83524..ad8f2da8003b 100644
--- a/arch/x86/kernel/cpu/acrn.c
+++ b/arch/x86/kernel/cpu/acrn.c
@@ -24,13 +24,15 @@ static u32 __init acrn_detect(void)
return acrn_cpuid_base();
}
+static unsigned int __init acrn_get_tsc_khz(void)
+{
+ return cpuid_eax(ACRN_CPUID_TIMING_INFO);
+}
+
static void __init acrn_init_platform(void)
{
/* Install system interrupt handler for ACRN hypervisor callback */
sysvec_install(HYPERVISOR_CALLBACK_VECTOR, sysvec_acrn_hv_callback);
-
- x86_platform.calibrate_tsc = acrn_get_tsc_khz;
- x86_platform.calibrate_cpu = acrn_get_tsc_khz;
}
static bool acrn_x2apic_available(void)
@@ -78,4 +80,6 @@ const __initconst struct hypervisor_x86 x86_hyper_acrn = {
.type = X86_HYPER_ACRN,
.init.init_platform = acrn_init_platform,
.init.x2apic_available = acrn_x2apic_available,
+ .init.get_tsc_khz = acrn_get_tsc_khz,
+ .init.get_cpu_khz = acrn_get_tsc_khz,
};
diff --git a/arch/x86/kernel/cpu/mshyperv.c b/arch/x86/kernel/cpu/mshyperv.c
index 87beecec76f0..f9bc1c2d8c93 100644
--- a/arch/x86/kernel/cpu/mshyperv.c
+++ b/arch/x86/kernel/cpu/mshyperv.c
@@ -395,7 +395,7 @@ static int hv_nmi_unknown(unsigned int val, struct pt_regs *regs)
}
#endif
-static unsigned long hv_get_tsc_khz(void)
+static unsigned int __init hv_get_tsc_khz(void)
{
unsigned long freq;
@@ -573,8 +573,8 @@ static void __init ms_hyperv_init_platform(void)
if (ms_hyperv.features & HV_ACCESS_FREQUENCY_MSRS &&
ms_hyperv.misc_features & HV_FEATURE_FREQUENCY_MSRS_AVAILABLE) {
- x86_platform.calibrate_tsc = hv_get_tsc_khz;
- x86_platform.calibrate_cpu = hv_get_tsc_khz;
+ x86_init.hyper.get_tsc_khz = hv_get_tsc_khz;
+ x86_init.hyper.get_cpu_khz = hv_get_tsc_khz;
setup_force_cpu_cap(X86_FEATURE_TSC_KNOWN_FREQ);
}
diff --git a/arch/x86/kernel/cpu/vmware.c b/arch/x86/kernel/cpu/vmware.c
index 13b97265c535..3cb473cae462 100644
--- a/arch/x86/kernel/cpu/vmware.c
+++ b/arch/x86/kernel/cpu/vmware.c
@@ -64,7 +64,7 @@ struct vmware_steal_time {
u64 reserved[7];
};
-static unsigned long vmware_tsc_khz __ro_after_init;
+static unsigned long vmware_tsc_khz __initdata;
static u8 vmware_hypercall_mode __ro_after_init;
unsigned long vmware_hypercall_slow(unsigned long cmd,
@@ -137,7 +137,7 @@ static inline int __vmware_platform(void)
return eax != UINT_MAX && ebx == VMWARE_HYPERVISOR_MAGIC;
}
-static unsigned long vmware_get_tsc_khz(void)
+static unsigned int __init vmware_get_tsc_khz(void)
{
return vmware_tsc_khz;
}
@@ -419,8 +419,8 @@ static void __init vmware_platform_setup(void)
}
vmware_tsc_khz = tsc_khz;
- x86_platform.calibrate_tsc = vmware_get_tsc_khz;
- x86_platform.calibrate_cpu = vmware_get_tsc_khz;
+ x86_init.hyper.get_tsc_khz = vmware_get_tsc_khz;
+ x86_init.hyper.get_cpu_khz = vmware_get_tsc_khz;
/* Skip lapic calibration since we know the bus frequency. */
apic_set_timer_period_hz(ecx, "VMware hypervisor");
diff --git a/arch/x86/kernel/jailhouse.c b/arch/x86/kernel/jailhouse.c
index f2d4ef89c085..e24c05ab4fae 100644
--- a/arch/x86/kernel/jailhouse.c
+++ b/arch/x86/kernel/jailhouse.c
@@ -68,7 +68,7 @@ static void __init jailhouse_timer_init(void)
apic_set_timer_period_khz(setup_data.v1.apic_khz, "Jailhouse hypervisor");
}
-static unsigned long jailhouse_get_tsc(void)
+static unsigned int __init jailhouse_get_tsc(void)
{
return precalibrated_tsc_khz;
}
@@ -210,8 +210,6 @@ static void __init jailhouse_init_platform(void)
x86_init.mpparse.parse_smp_cfg = jailhouse_parse_smp_config;
x86_init.pci.arch_init = jailhouse_pci_arch_init;
- x86_platform.calibrate_cpu = jailhouse_get_tsc;
- x86_platform.calibrate_tsc = jailhouse_get_tsc;
x86_platform.get_wallclock = jailhouse_get_wallclock;
x86_platform.legacy.rtc = 0;
x86_platform.legacy.warm_reset = 0;
@@ -293,5 +291,7 @@ const struct hypervisor_x86 x86_hyper_jailhouse __refconst = {
.detect = jailhouse_detect,
.init.init_platform = jailhouse_init_platform,
.init.x2apic_available = jailhouse_x2apic_available,
+ .init.get_tsc_khz = jailhouse_get_tsc,
+ .init.get_cpu_khz = jailhouse_get_tsc,
.ignore_nopv = true,
};
diff --git a/arch/x86/kernel/kvmclock.c b/arch/x86/kernel/kvmclock.c
index cb3d0ca1fa22..4f8299303a19 100644
--- a/arch/x86/kernel/kvmclock.c
+++ b/arch/x86/kernel/kvmclock.c
@@ -136,7 +136,7 @@ static inline void kvm_sched_clock_init(bool stable)
* poll of guests can be running and trouble each other. So we preset
* lpj here
*/
-static unsigned long kvm_get_tsc_khz(void)
+static unsigned int __init kvm_get_tsc_khz(void)
{
setup_force_cpu_cap(X86_FEATURE_TSC_KNOWN_FREQ);
return pvclock_tsc_khz(this_cpu_pvti());
@@ -343,8 +343,8 @@ void __init kvmclock_init(void)
flags = pvclock_read_flags(&hv_clock_boot[0].pvti);
kvm_sched_clock_init(flags & PVCLOCK_TSC_STABLE_BIT);
- x86_platform.calibrate_tsc = kvm_get_tsc_khz;
- x86_platform.calibrate_cpu = kvm_get_tsc_khz;
+ x86_init.hyper.get_tsc_khz = kvm_get_tsc_khz;
+ x86_init.hyper.get_cpu_khz = kvm_get_tsc_khz;
x86_platform.get_wallclock = kvm_get_wallclock;
x86_platform.set_wallclock = kvm_set_wallclock;
#ifdef CONFIG_X86_LOCAL_APIC
diff --git a/arch/x86/kernel/tsc.c b/arch/x86/kernel/tsc.c
index 86384a83a5f6..1dca9464b41c 100644
--- a/arch/x86/kernel/tsc.c
+++ b/arch/x86/kernel/tsc.c
@@ -1451,13 +1451,17 @@ static int __init init_tsc_clocksource(void)
device_initcall(init_tsc_clocksource);
static bool __init determine_cpu_tsc_frequencies(bool early,
+ unsigned int known_cpu_khz,
unsigned int known_tsc_khz)
{
/* Make sure that cpu and tsc are not already calibrated */
WARN_ON(cpu_khz || tsc_khz);
if (early) {
- cpu_khz = x86_platform.calibrate_cpu();
+ if (known_cpu_khz)
+ cpu_khz = known_cpu_khz;
+ else
+ cpu_khz = x86_platform.calibrate_cpu();
if (known_tsc_khz)
tsc_khz = known_tsc_khz;
else
@@ -1514,7 +1518,7 @@ static void __init tsc_enable_sched_clock(void)
void __init tsc_early_init(void)
{
- unsigned int known_tsc_khz = 0;
+ unsigned int known_cpu_khz = 0, known_tsc_khz = 0;
if (!boot_cpu_has(X86_FEATURE_TSC))
return;
@@ -1522,22 +1526,33 @@ void __init tsc_early_init(void)
if (is_early_uv_system())
return;
+ if (x86_init.hyper.get_cpu_khz)
+ known_cpu_khz = x86_init.hyper.get_cpu_khz();
+
if (cc_platform_has(CC_ATTR_GUEST_SNP_SECURE_TSC))
known_tsc_khz = snp_secure_tsc_init();
else if (boot_cpu_has(X86_FEATURE_TDX_GUEST))
known_tsc_khz = tdx_tsc_init();
+ /*
+ * If the TSC frequency wasn't provided by trusted firmware, try to get
+ * it from the hypervisor (which is untrusted when running as a CoCo guest).
+ */
+ if (!known_tsc_khz && x86_init.hyper.get_tsc_khz)
+ known_tsc_khz = x86_init.hyper.get_tsc_khz();
+
/*
* Ignore the user-provided TSC frequency if the exact frequency was
- * obtained from trusted firmware, as the user-provided frequency is
- * intended as a "starting point", not a known, guaranteed frequency.
+ * obtained from trusted firmware or the hypervisor, as the user-
+ * provided frequency is intended as a "starting point", not a known,
+ * guaranteed frequency.
*/
if (!known_tsc_khz)
known_tsc_khz = tsc_early_khz;
else if (tsc_early_khz)
- pr_err("Ignoring 'tsc_early_khz' in favor of trusted firmware.\n");
+ pr_err("Ignoring 'tsc_early_khz' in favor of firmware/hypervisor.\n");
- if (!determine_cpu_tsc_frequencies(true, known_tsc_khz))
+ if (!determine_cpu_tsc_frequencies(true, known_cpu_khz, known_tsc_khz))
return;
tsc_enable_sched_clock();
}
@@ -1558,7 +1573,7 @@ void __init tsc_init(void)
if (!tsc_khz) {
/* We failed to determine frequencies earlier, try again */
- if (!determine_cpu_tsc_frequencies(false, 0)) {
+ if (!determine_cpu_tsc_frequencies(false, 0, 0)) {
mark_tsc_unstable("could not calculate TSC khz");
setup_clear_cpu_cap(X86_FEATURE_TSC_DEADLINE_TIMER);
return;
diff --git a/arch/x86/xen/time.c b/arch/x86/xen/time.c
index d62c14334b35..1adb44fdddb2 100644
--- a/arch/x86/xen/time.c
+++ b/arch/x86/xen/time.c
@@ -38,7 +38,7 @@
static u64 xen_sched_clock_offset __read_mostly;
/* Get the TSC speed from Xen */
-static unsigned long xen_tsc_khz(void)
+static unsigned int __init xen_tsc_khz(void)
{
struct pvclock_vcpu_time_info *info =
&HYPERVISOR_shared_info->vcpu_info[0].time;
@@ -569,7 +569,7 @@ static void __init xen_init_time_common(void)
static_call_update(pv_steal_clock, xen_steal_clock);
paravirt_set_sched_clock(xen_sched_clock);
- x86_platform.calibrate_tsc = xen_tsc_khz;
+ x86_init.hyper.get_tsc_khz = xen_tsc_khz;
x86_platform.get_wallclock = xen_get_wallclock;
}
--
2.55.0.rc0.799.gd6f94ed593-goog
^ permalink raw reply related
* [PATCH v5 12/51] x86/acrn: Register TSC/CPU frequency callbacks iff frequency is actually in CPUID
From: Sean Christopherson @ 2026-07-01 19:31 UTC (permalink / raw)
To: Jonathan Corbet, Paolo Bonzini, Thomas Gleixner, Ingo Molnar,
Borislav Petkov, Dave Hansen, x86, Kiryl Shutsemau,
Rick Edgecombe, Sean Christopherson, K. Y. Srinivasan,
Haiyang Zhang, Wei Liu, Dexuan Cui, Long Li, Ajay Kaher,
Alexey Makhalov, Jan Kiszka, Andy Lutomirski, Peter Zijlstra,
Juergen Gross, Daniel Lezcano, John Stultz
Cc: Shuah Khan, H. Peter Anvin, Vitaly Kuznetsov,
Broadcom internal kernel review list, Boris Ostrovsky,
Stephen Boyd, linux-doc, kvm, linux-kernel, linux-coco,
linux-hyperv, virtualization, xen-devel, Tom Lendacky,
Nikunj A Dadhania, David Woodhouse, David Woodhouse,
Michael Kelley, Thomas Gleixner
In-Reply-To: <20260701193212.749551-1-seanjc@google.com>
Register ACRN's TSC/CPU frequency overrides if and only if the exact TSC
frequency is actually provided in CPUID. This will allow marking the TSC
as reliable as appropriate, and avoids relying on the caller to handle
"failure".
For all intents and purposes, no functional change intended.
Signed-off-by: Sean Christopherson <seanjc@google.com>
---
arch/x86/kernel/cpu/acrn.c | 12 +++++++++---
1 file changed, 9 insertions(+), 3 deletions(-)
diff --git a/arch/x86/kernel/cpu/acrn.c b/arch/x86/kernel/cpu/acrn.c
index ad8f2da8003b..dc71a6fdd461 100644
--- a/arch/x86/kernel/cpu/acrn.c
+++ b/arch/x86/kernel/cpu/acrn.c
@@ -19,6 +19,8 @@
#include <asm/idtentry.h>
#include <asm/irq_regs.h>
+static unsigned int acrn_tsc_khz_cpuid __initdata;
+
static u32 __init acrn_detect(void)
{
return acrn_cpuid_base();
@@ -26,13 +28,19 @@ static u32 __init acrn_detect(void)
static unsigned int __init acrn_get_tsc_khz(void)
{
- return cpuid_eax(ACRN_CPUID_TIMING_INFO);
+ return acrn_tsc_khz_cpuid;
}
static void __init acrn_init_platform(void)
{
/* Install system interrupt handler for ACRN hypervisor callback */
sysvec_install(HYPERVISOR_CALLBACK_VECTOR, sysvec_acrn_hv_callback);
+
+ acrn_tsc_khz_cpuid = cpuid_eax(ACRN_CPUID_TIMING_INFO);
+ if (acrn_tsc_khz_cpuid) {
+ x86_init.hyper.get_tsc_khz = acrn_get_tsc_khz;
+ x86_init.hyper.get_cpu_khz = acrn_get_tsc_khz;
+ }
}
static bool acrn_x2apic_available(void)
@@ -80,6 +88,4 @@ const __initconst struct hypervisor_x86 x86_hyper_acrn = {
.type = X86_HYPER_ACRN,
.init.init_platform = acrn_init_platform,
.init.x2apic_available = acrn_x2apic_available,
- .init.get_tsc_khz = acrn_get_tsc_khz,
- .init.get_cpu_khz = acrn_get_tsc_khz,
};
--
2.55.0.rc0.799.gd6f94ed593-goog
^ permalink raw reply related
* [PATCH v5 13/51] x86/acrn: Mark TSC frequency as known when using ACRN for calibration
From: Sean Christopherson @ 2026-07-01 19:31 UTC (permalink / raw)
To: Jonathan Corbet, Paolo Bonzini, Thomas Gleixner, Ingo Molnar,
Borislav Petkov, Dave Hansen, x86, Kiryl Shutsemau,
Rick Edgecombe, Sean Christopherson, K. Y. Srinivasan,
Haiyang Zhang, Wei Liu, Dexuan Cui, Long Li, Ajay Kaher,
Alexey Makhalov, Jan Kiszka, Andy Lutomirski, Peter Zijlstra,
Juergen Gross, Daniel Lezcano, John Stultz
Cc: Shuah Khan, H. Peter Anvin, Vitaly Kuznetsov,
Broadcom internal kernel review list, Boris Ostrovsky,
Stephen Boyd, linux-doc, kvm, linux-kernel, linux-coco,
linux-hyperv, virtualization, xen-devel, Tom Lendacky,
Nikunj A Dadhania, David Woodhouse, David Woodhouse,
Michael Kelley, Thomas Gleixner
In-Reply-To: <20260701193212.749551-1-seanjc@google.com>
Mark the TSC frequency as known when using ACRN's PV CPUID information.
Per commit 81a71f51b89e ("x86/acrn: Set up timekeeping") and common sense,
the TSC freq is explicitly provided by the hypervisor.
Signed-off-by: Sean Christopherson <seanjc@google.com>
---
arch/x86/kernel/cpu/acrn.c | 1 +
1 file changed, 1 insertion(+)
diff --git a/arch/x86/kernel/cpu/acrn.c b/arch/x86/kernel/cpu/acrn.c
index dc71a6fdd461..3818f6ae0629 100644
--- a/arch/x86/kernel/cpu/acrn.c
+++ b/arch/x86/kernel/cpu/acrn.c
@@ -40,6 +40,7 @@ static void __init acrn_init_platform(void)
if (acrn_tsc_khz_cpuid) {
x86_init.hyper.get_tsc_khz = acrn_get_tsc_khz;
x86_init.hyper.get_cpu_khz = acrn_get_tsc_khz;
+ setup_force_cpu_cap(X86_FEATURE_TSC_KNOWN_FREQ);
}
}
--
2.55.0.rc0.799.gd6f94ed593-goog
^ permalink raw reply related
* [PATCH v5 14/51] x86/tsc: Consolidate forcing of X86_FEATURE_TSC_KNOWN_FREQ for PV code
From: Sean Christopherson @ 2026-07-01 19:31 UTC (permalink / raw)
To: Jonathan Corbet, Paolo Bonzini, Thomas Gleixner, Ingo Molnar,
Borislav Petkov, Dave Hansen, x86, Kiryl Shutsemau,
Rick Edgecombe, Sean Christopherson, K. Y. Srinivasan,
Haiyang Zhang, Wei Liu, Dexuan Cui, Long Li, Ajay Kaher,
Alexey Makhalov, Jan Kiszka, Andy Lutomirski, Peter Zijlstra,
Juergen Gross, Daniel Lezcano, John Stultz
Cc: Shuah Khan, H. Peter Anvin, Vitaly Kuznetsov,
Broadcom internal kernel review list, Boris Ostrovsky,
Stephen Boyd, linux-doc, kvm, linux-kernel, linux-coco,
linux-hyperv, virtualization, xen-devel, Tom Lendacky,
Nikunj A Dadhania, David Woodhouse, David Woodhouse,
Michael Kelley, Thomas Gleixner
In-Reply-To: <20260701193212.749551-1-seanjc@google.com>
Now that all paravirt code that explicitly specifies the TSC frequency
also sets X86_FEATURE_TSC_KNOWN_FREQ, replace all of the one-off code
and simply set X86_FEATURE_TSC_KNOWN_FREQ if the TSC frequency is known.
Do NOT force set TSC_KNOWN_FREQ if the "known" TSC frequency was provided
by the user. Per commit bd35c77e32e4 ("x86/tsc: Add tsc_early_khz command
line parameter"), one of the goals of the param is to allow the refined
calibration work "to do meaningful error checking".
No functional change intended.
Signed-off-by: Sean Christopherson <seanjc@google.com>
---
arch/x86/coco/sev/core.c | 1 -
arch/x86/coco/tdx/tdx.c | 1 -
arch/x86/kernel/cpu/acrn.c | 1 -
arch/x86/kernel/cpu/mshyperv.c | 1 -
arch/x86/kernel/cpu/vmware.c | 2 --
arch/x86/kernel/jailhouse.c | 1 -
arch/x86/kernel/kvmclock.c | 1 -
arch/x86/kernel/tsc.c | 13 ++++++++++---
arch/x86/xen/time.c | 1 -
9 files changed, 10 insertions(+), 12 deletions(-)
diff --git a/arch/x86/coco/sev/core.c b/arch/x86/coco/sev/core.c
index bc5ae9ef74da..72313b36b6f5 100644
--- a/arch/x86/coco/sev/core.c
+++ b/arch/x86/coco/sev/core.c
@@ -2027,7 +2027,6 @@ unsigned int __init snp_secure_tsc_init(void)
secrets = (__force struct snp_secrets_page *)mem;
- setup_force_cpu_cap(X86_FEATURE_TSC_KNOWN_FREQ);
setup_force_cpu_cap(X86_FEATURE_TSC_RELIABLE);
rdmsrq(MSR_AMD64_GUEST_TSC_FREQ, tsc_freq_mhz);
diff --git a/arch/x86/coco/tdx/tdx.c b/arch/x86/coco/tdx/tdx.c
index ae2d35f2ef33..94682aca188b 100644
--- a/arch/x86/coco/tdx/tdx.c
+++ b/arch/x86/coco/tdx/tdx.c
@@ -1205,7 +1205,6 @@ unsigned int __init tdx_tsc_init(void)
/* TSC is the only reliable clock in TDX guest */
setup_force_cpu_cap(X86_FEATURE_TSC_RELIABLE);
- setup_force_cpu_cap(X86_FEATURE_TSC_KNOWN_FREQ);
return info.crystal_khz * info.numerator / info.denominator;
}
diff --git a/arch/x86/kernel/cpu/acrn.c b/arch/x86/kernel/cpu/acrn.c
index 3818f6ae0629..dc71a6fdd461 100644
--- a/arch/x86/kernel/cpu/acrn.c
+++ b/arch/x86/kernel/cpu/acrn.c
@@ -40,7 +40,6 @@ static void __init acrn_init_platform(void)
if (acrn_tsc_khz_cpuid) {
x86_init.hyper.get_tsc_khz = acrn_get_tsc_khz;
x86_init.hyper.get_cpu_khz = acrn_get_tsc_khz;
- setup_force_cpu_cap(X86_FEATURE_TSC_KNOWN_FREQ);
}
}
diff --git a/arch/x86/kernel/cpu/mshyperv.c b/arch/x86/kernel/cpu/mshyperv.c
index f9bc1c2d8c93..e03c69a4db33 100644
--- a/arch/x86/kernel/cpu/mshyperv.c
+++ b/arch/x86/kernel/cpu/mshyperv.c
@@ -575,7 +575,6 @@ static void __init ms_hyperv_init_platform(void)
ms_hyperv.misc_features & HV_FEATURE_FREQUENCY_MSRS_AVAILABLE) {
x86_init.hyper.get_tsc_khz = hv_get_tsc_khz;
x86_init.hyper.get_cpu_khz = hv_get_tsc_khz;
- setup_force_cpu_cap(X86_FEATURE_TSC_KNOWN_FREQ);
}
if (ms_hyperv.priv_high & HV_ISOLATION) {
diff --git a/arch/x86/kernel/cpu/vmware.c b/arch/x86/kernel/cpu/vmware.c
index 3cb473cae462..0a3bd90576d4 100644
--- a/arch/x86/kernel/cpu/vmware.c
+++ b/arch/x86/kernel/cpu/vmware.c
@@ -390,8 +390,6 @@ static void __init vmware_set_capabilities(void)
{
setup_force_cpu_cap(X86_FEATURE_CONSTANT_TSC);
setup_force_cpu_cap(X86_FEATURE_TSC_RELIABLE);
- if (vmware_tsc_khz)
- setup_force_cpu_cap(X86_FEATURE_TSC_KNOWN_FREQ);
if (vmware_hypercall_mode == CPUID_VMWARE_FEATURES_ECX_VMCALL)
setup_force_cpu_cap(X86_FEATURE_VMCALL);
else if (vmware_hypercall_mode == CPUID_VMWARE_FEATURES_ECX_VMMCALL)
diff --git a/arch/x86/kernel/jailhouse.c b/arch/x86/kernel/jailhouse.c
index e24c05ab4fae..ff173052cdce 100644
--- a/arch/x86/kernel/jailhouse.c
+++ b/arch/x86/kernel/jailhouse.c
@@ -255,7 +255,6 @@ static void __init jailhouse_init_platform(void)
pr_debug("Jailhouse: PM-Timer IO Port: %#x\n", pmtmr_ioport);
precalibrated_tsc_khz = setup_data.v1.tsc_khz;
- setup_force_cpu_cap(X86_FEATURE_TSC_KNOWN_FREQ);
pci_probe = 0;
diff --git a/arch/x86/kernel/kvmclock.c b/arch/x86/kernel/kvmclock.c
index 4f8299303a19..35a879d33e9e 100644
--- a/arch/x86/kernel/kvmclock.c
+++ b/arch/x86/kernel/kvmclock.c
@@ -138,7 +138,6 @@ static inline void kvm_sched_clock_init(bool stable)
*/
static unsigned int __init kvm_get_tsc_khz(void)
{
- setup_force_cpu_cap(X86_FEATURE_TSC_KNOWN_FREQ);
return pvclock_tsc_khz(this_cpu_pvti());
}
diff --git a/arch/x86/kernel/tsc.c b/arch/x86/kernel/tsc.c
index 1dca9464b41c..676910292af7 100644
--- a/arch/x86/kernel/tsc.c
+++ b/arch/x86/kernel/tsc.c
@@ -1541,11 +1541,18 @@ void __init tsc_early_init(void)
if (!known_tsc_khz && x86_init.hyper.get_tsc_khz)
known_tsc_khz = x86_init.hyper.get_tsc_khz();
+ /*
+ * Mark the TSC frequency as known if it was obtained from a hypervisor
+ * or trusted firmware.
+ */
+ if (known_tsc_khz)
+ setup_force_cpu_cap(X86_FEATURE_TSC_KNOWN_FREQ);
+
/*
* Ignore the user-provided TSC frequency if the exact frequency was
- * obtained from trusted firmware or the hypervisor, as the user-
- * provided frequency is intended as a "starting point", not a known,
- * guaranteed frequency.
+ * obtained from trusted firmware or the hypervisor, and don't mark the
+ * frequency as known, as the user-provided frequency is intended as a
+ * "starting point", not a known, guaranteed frequency
*/
if (!known_tsc_khz)
known_tsc_khz = tsc_early_khz;
diff --git a/arch/x86/xen/time.c b/arch/x86/xen/time.c
index 1adb44fdddb2..487ad838c441 100644
--- a/arch/x86/xen/time.c
+++ b/arch/x86/xen/time.c
@@ -43,7 +43,6 @@ static unsigned int __init xen_tsc_khz(void)
struct pvclock_vcpu_time_info *info =
&HYPERVISOR_shared_info->vcpu_info[0].time;
- setup_force_cpu_cap(X86_FEATURE_TSC_KNOWN_FREQ);
return pvclock_tsc_khz(info);
}
--
2.55.0.rc0.799.gd6f94ed593-goog
^ permalink raw reply related
page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox