From: Sean Christopherson <seanjc@google.com>
To: Michael Kelley <mhklinux@outlook.com>
Cc: Thomas Gleixner <tglx@linutronix.de>,
Ingo Molnar <mingo@redhat.com>, Borislav Petkov <bp@alien8.de>,
Dave Hansen <dave.hansen@linux.intel.com>,
"x86@kernel.org" <x86@kernel.org>,
"Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>,
Juergen Gross <jgross@suse.com>,
"K. Y. Srinivasan" <kys@microsoft.com>,
Haiyang Zhang <haiyangz@microsoft.com>,
Wei Liu <wei.liu@kernel.org>, Dexuan Cui <decui@microsoft.com>,
Ajay Kaher <ajay.kaher@broadcom.com>,
Jan Kiszka <jan.kiszka@siemens.com>,
Paolo Bonzini <pbonzini@redhat.com>,
Andy Lutomirski <luto@kernel.org>,
Peter Zijlstra <peterz@infradead.org>,
"linux-kernel@vger.kernel.org" <linux-kernel@vger.kernel.org>,
"linux-coco@lists.linux.dev" <linux-coco@lists.linux.dev>,
"virtualization@lists.linux.dev"
<virtualization@lists.linux.dev>,
"linux-hyperv@vger.kernel.org" <linux-hyperv@vger.kernel.org>,
"kvm@vger.kernel.org" <kvm@vger.kernel.org>,
"xen-devel@lists.xenproject.org"
<xen-devel@lists.xenproject.org>,
Nikunj A Dadhania <nikunj@amd.com>,
Tom Lendacky <thomas.lendacky@amd.com>
Subject: Re: [PATCH 16/16] x86/kvmclock: Use TSC for sched_clock if it's constant and non-stop
Date: Mon, 10 Feb 2025 08:21:49 -0800 [thread overview]
Message-ID: <Z6onnUthSBUVAklf@google.com> (raw)
In-Reply-To: <SN6PR02MB4157A85EC0B1B2D45CB611FAD4F02@SN6PR02MB4157.namprd02.prod.outlook.com>
On Sat, Feb 08, 2025, Michael Kelley wrote:
> From: Sean Christopherson <seanjc@google.com> Sent: Friday, February 7, 2025 9:23 AM
> >
> > Dropping a few people/lists whose emails are bouncing.
> >
> > On Fri, Jan 31, 2025, Sean Christopherson wrote:
> > > @@ -369,6 +369,11 @@ void __init kvmclock_init(void)
> > > #ifdef CONFIG_X86_LOCAL_APIC
> > > x86_cpuinit.early_percpu_clock_init = kvm_setup_secondary_clock;
> > > #endif
> > > + /*
> > > + * Save/restore "sched" clock state even if kvmclock isn't being used
> > > + * for sched_clock, as kvmclock is still used for wallclock and relies
> > > + * on these hooks to re-enable kvmclock after suspend+resume.
> >
> > This is wrong, wallclock is a different MSR entirely.
> >
> > > + */
> > > x86_platform.save_sched_clock_state = kvm_save_sched_clock_state;
> > > x86_platform.restore_sched_clock_state = kvm_restore_sched_clock_state;
> >
> > And usurping sched_clock save/restore is *really* wrong if kvmclock isn't being
> > used as sched_clock, because when TSC is reset on suspend/hiberation, not doing
> > tsc_{save,restore}_sched_clock_state() results in time going haywire.
> >
> > Subtly, that issue goes all the way back to patch "x86/paravirt: Don't use a PV
> > sched_clock in CoCo guests with trusted TSC" because pulling the rug out from
> > under kvmclock leads to the same problem.
> >
> > The whole PV sched_clock scheme is a disaster.
> >
> > Hyper-V overrides the save/restore callbacks, but _also_ runs the old TSC callbacks,
> > because Hyper-V doesn't ensure that it's actually using the Hyper-V clock for
> > sched_clock. And the code is all kinds of funky, because it tries to keep the
> > x86 code isolated from the generic HV clock code, but (a) there's already x86 PV
> > specific code in drivers/clocksource/hyperv_timer.c, and (b) splitting the code
> > means that Hyper-V overides the sched_clock save/restore hooks even when
> > PARAVIRT=n, i.e. when HV clock can't possibly be used as sched_clock.
>
> Regarding (a), the one occurrence of x86 PV-specific code hyperv_timer.c is
> the call to paravirt_set_sched_clock(), and it's under an #ifdef sequence so that
> it's not built if targeting some other architecture. Or do you see something else
> that is x86-specific?
>
> Regarding (b), in drivers/hv/Kconfig, CONFIG_HYPERV always selects PARAVIRT.
> So the #else clause (where PARAVIRT=n) in that #ifdef sequence could arguably
> have a BUILD_BUG() added. If I recall correctly, other Hyper-V stuff breaks if
> PARAVIRT is forced to "n". So I don't think there's a current problem with the
> sched_clock save/restore hooks. i
Oh, there are no build issues, and all of the x86 bits are nicely cordoned off.
My complaint is essentially that they're _too_ isolated; putting the sched_clock
save/restore setup in arch/x86/kernel/cpu/mshyperv.c is well-intentioned, but IMO
it does more harm than good because the split makes it difficult to connect the
dots to hv_setup_sched_clock() in drivers/clocksource/hyperv_timer.c.
> But I would be good with some restructuring so that setting the sched clock
> save/restore hooks is more closely tied to the sched clock choice,
Yeah, this is the intent of my ranting. After the dust settles, the code can
look like this.
---
#ifdef CONFIG_GENERIC_SCHED_CLOCK
static __always_inline void hv_setup_sched_clock(void *sched_clock)
{
/*
* We're on an architecture with generic sched clock (not x86/x64).
* The Hyper-V sched clock read function returns nanoseconds, not
* the normal 100ns units of the Hyper-V synthetic clock.
*/
sched_clock_register(sched_clock, 64, NSEC_PER_SEC);
}
#elif defined CONFIG_PARAVIRT
static u64 hv_ref_counter_at_suspend;
/*
* Hyper-V clock counter resets during hibernation. Save and restore clock
* offset during suspend/resume, while also considering the time passed
* before suspend. This is to make sure that sched_clock using hv tsc page
* based clocksource, proceeds from where it left off during suspend and
* it shows correct time for the timestamps of kernel messages after resume.
*/
static void hv_save_sched_clock_state(void)
{
hv_ref_counter_at_suspend = hv_read_reference_counter();
}
static void hv_restore_sched_clock_state(void)
{
/*
* Adjust the offsets used by hv tsc clocksource to
* account for the time spent before hibernation.
* adjusted value = reference counter (time) at suspend
* - reference counter (time) now.
*/
hv_sched_clock_offset -= (hv_ref_counter_at_suspend - hv_read_reference_counter());
}
static __always_inline void hv_setup_sched_clock(void *sched_clock)
{
/* We're on x86/x64 *and* using PV ops */
paravirt_set_sched_clock(sched_clock, hv_save_sched_clock_state,
hv_restore_sched_clock_state);
}
#else /* !CONFIG_GENERIC_SCHED_CLOCK && !CONFIG_PARAVIRT */
static __always_inline void hv_setup_sched_clock(void *sched_clock) {}
#endif /* CONFIG_GENERIC_SCHED_CLOCK */
---
> as long as the architecture independence of hyperv_timer.c is preserved.
LOL, ah yes, the architecture independence of MSRs and TSC :-D
Teasing aside, the code is firmly x86-only at the moment. It's selectable only
by x86:
config HYPERV_TIMER
def_bool HYPERV && X86
and since at least commit e39acc37db34 ("clocksource: hyper-v: Provide noinstr
sched_clock()") there are references to symbols/functions that are provided only
by x86.
I assume arm64 support is a WIP, but keeping the upstream code arch independent
isn't very realistic if the code can't be at least compile-tested. To help
drive-by contributors like myself, maybe select HYPER_TIMER on arm64 for
COMPILE_TEST=y builds?
config HYPERV_TIMER
def_bool HYPERV && (X86 || (COMPILE_TEST && ARM64))
I have no plans to touch code outside of CONFIG_PARAVIRT, i.e. outside of code
that is explicitly x86-only, but something along those lines would help people
like me understand the goal/intent, and in theory would also help y'all maintain
the code by detecting breakage.
next prev parent reply other threads:[~2025-02-10 16:21 UTC|newest]
Thread overview: 42+ messages / expand[flat|nested] mbox.gz Atom feed top
2025-02-01 2:17 [PATCH 00/16] x86/tsc: Try to wrangle PV clocks vs. TSC Sean Christopherson
2025-02-01 2:17 ` [PATCH 01/16] x86/tsc: Add a standalone helpers for getting TSC info from CPUID.0x15 Sean Christopherson
2025-02-03 5:55 ` Nikunj A Dadhania
2025-02-03 22:03 ` Sean Christopherson
2025-02-05 22:13 ` Sean Christopherson
2025-02-11 15:01 ` Borislav Petkov
2025-02-11 17:25 ` Sean Christopherson
2025-02-11 18:40 ` Borislav Petkov
2025-02-11 19:03 ` Sean Christopherson
2025-02-01 2:17 ` [PATCH 02/16] x86/tsc: Add standalone helper for getting CPU frequency from CPUID Sean Christopherson
2025-02-01 2:17 ` [PATCH 03/16] x86/tsc: Add helper to register CPU and TSC freq calibration routines Sean Christopherson
2025-02-11 17:32 ` Borislav Petkov
2025-02-11 17:43 ` Sean Christopherson
2025-02-11 20:32 ` Borislav Petkov
2025-02-12 16:49 ` Tom Lendacky
2025-02-01 2:17 ` [PATCH 04/16] x86/sev: Mark TSC as reliable when configuring Secure TSC Sean Christopherson
2025-02-04 8:02 ` Nikunj A Dadhania
2025-02-01 2:17 ` [PATCH 05/16] x86/sev: Move check for SNP Secure TSC support to tsc_early_init() Sean Christopherson
2025-02-04 8:27 ` Nikunj A Dadhania
2025-02-01 2:17 ` [PATCH 06/16] x86/tdx: Override PV calibration routines with CPUID-based calibration Sean Christopherson
2025-02-04 10:16 ` Nikunj A Dadhania
2025-02-04 19:29 ` Sean Christopherson
2025-02-05 3:56 ` Nikunj A Dadhania
2025-02-01 2:17 ` [PATCH 07/16] x86/acrn: Mark TSC frequency as known when using ACRN for calibration Sean Christopherson
2025-02-01 2:17 ` [PATCH 08/16] x86/tsc: Pass KNOWN_FREQ and RELIABLE as params to registration Sean Christopherson
2025-02-03 14:48 ` Tom Lendacky
2025-02-03 19:52 ` Sean Christopherson
2025-02-01 2:17 ` [PATCH 09/16] x86/tsc: Rejects attempts to override TSC calibration with lesser routine Sean Christopherson
2025-02-01 2:17 ` [PATCH 10/16] x86/paravirt: Move handling of unstable PV clocks into paravirt_set_sched_clock() Sean Christopherson
2025-02-01 2:17 ` [PATCH 11/16] x86/paravirt: Don't use a PV sched_clock in CoCo guests with trusted TSC Sean Christopherson
2025-02-01 2:17 ` [PATCH 12/16] x86/kvmclock: Mark TSC as reliable when it's constant and nonstop Sean Christopherson
2025-02-01 2:17 ` [PATCH 13/16] x86/kvmclock: Get CPU base frequency from CPUID when it's available Sean Christopherson
2025-02-01 2:17 ` [PATCH 14/16] x86/kvmclock: Get TSC frequency from CPUID when its available Sean Christopherson
2025-02-01 2:17 ` [PATCH 15/16] x86/kvmclock: Stuff local APIC bus period when core crystal freq comes from CPUID Sean Christopherson
2025-02-01 2:17 ` [PATCH 16/16] x86/kvmclock: Use TSC for sched_clock if it's constant and non-stop Sean Christopherson
2025-02-07 17:23 ` Sean Christopherson
2025-02-08 18:03 ` Michael Kelley
2025-02-10 16:21 ` Sean Christopherson [this message]
2025-02-12 16:44 ` Michael Kelley
2025-02-12 22:55 ` Sean Christopherson
2025-02-11 14:39 ` [PATCH 00/16] x86/tsc: Try to wrangle PV clocks vs. TSC Borislav Petkov
2025-02-11 16:28 ` Sean Christopherson
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=Z6onnUthSBUVAklf@google.com \
--to=seanjc@google.com \
--cc=ajay.kaher@broadcom.com \
--cc=bp@alien8.de \
--cc=dave.hansen@linux.intel.com \
--cc=decui@microsoft.com \
--cc=haiyangz@microsoft.com \
--cc=jan.kiszka@siemens.com \
--cc=jgross@suse.com \
--cc=kirill.shutemov@linux.intel.com \
--cc=kvm@vger.kernel.org \
--cc=kys@microsoft.com \
--cc=linux-coco@lists.linux.dev \
--cc=linux-hyperv@vger.kernel.org \
--cc=linux-kernel@vger.kernel.org \
--cc=luto@kernel.org \
--cc=mhklinux@outlook.com \
--cc=mingo@redhat.com \
--cc=nikunj@amd.com \
--cc=pbonzini@redhat.com \
--cc=peterz@infradead.org \
--cc=tglx@linutronix.de \
--cc=thomas.lendacky@amd.com \
--cc=virtualization@lists.linux.dev \
--cc=wei.liu@kernel.org \
--cc=x86@kernel.org \
--cc=xen-devel@lists.xenproject.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).