All of lore.kernel.org
 help / color / mirror / Atom feed
From: Sean Christopherson <seanjc@google.com>
To: Michael Kelley <mhklinux@outlook.com>
Cc: Thomas Gleixner <tglx@linutronix.de>,
	Ingo Molnar <mingo@redhat.com>, Borislav Petkov <bp@alien8.de>,
	 Dave Hansen <dave.hansen@linux.intel.com>,
	"x86@kernel.org" <x86@kernel.org>,
	 "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>,
	Juergen Gross <jgross@suse.com>,
	 "K. Y. Srinivasan" <kys@microsoft.com>,
	Haiyang Zhang <haiyangz@microsoft.com>,
	Wei Liu <wei.liu@kernel.org>,  Dexuan Cui <decui@microsoft.com>,
	Ajay Kaher <ajay.kaher@broadcom.com>,
	 Jan Kiszka <jan.kiszka@siemens.com>,
	Paolo Bonzini <pbonzini@redhat.com>,
	 Andy Lutomirski <luto@kernel.org>,
	Peter Zijlstra <peterz@infradead.org>,
	 "linux-kernel@vger.kernel.org" <linux-kernel@vger.kernel.org>,
	 "linux-coco@lists.linux.dev" <linux-coco@lists.linux.dev>,
	 "virtualization@lists.linux.dev"
	<virtualization@lists.linux.dev>,
	 "linux-hyperv@vger.kernel.org" <linux-hyperv@vger.kernel.org>,
	"kvm@vger.kernel.org" <kvm@vger.kernel.org>,
	 "xen-devel@lists.xenproject.org"
	<xen-devel@lists.xenproject.org>,
	Nikunj A Dadhania <nikunj@amd.com>,
	 Tom Lendacky <thomas.lendacky@amd.com>
Subject: Re: [PATCH 16/16] x86/kvmclock: Use TSC for sched_clock if it's constant and non-stop
Date: Mon, 10 Feb 2025 08:21:49 -0800	[thread overview]
Message-ID: <Z6onnUthSBUVAklf@google.com> (raw)
In-Reply-To: <SN6PR02MB4157A85EC0B1B2D45CB611FAD4F02@SN6PR02MB4157.namprd02.prod.outlook.com>

On Sat, Feb 08, 2025, Michael Kelley wrote:
> From: Sean Christopherson <seanjc@google.com> Sent: Friday, February 7, 2025 9:23 AM
> > 
> > Dropping a few people/lists whose emails are bouncing.
> > 
> > On Fri, Jan 31, 2025, Sean Christopherson wrote:
> > > @@ -369,6 +369,11 @@ void __init kvmclock_init(void)
> > >  #ifdef CONFIG_X86_LOCAL_APIC
> > >  	x86_cpuinit.early_percpu_clock_init = kvm_setup_secondary_clock;
> > >  #endif
> > > +	/*
> > > +	 * Save/restore "sched" clock state even if kvmclock isn't being used
> > > +	 * for sched_clock, as kvmclock is still used for wallclock and relies
> > > +	 * on these hooks to re-enable kvmclock after suspend+resume.
> > 
> > This is wrong, wallclock is a different MSR entirely.
> > 
> > > +	 */
> > >  	x86_platform.save_sched_clock_state = kvm_save_sched_clock_state;
> > >  	x86_platform.restore_sched_clock_state = kvm_restore_sched_clock_state;
> > 
> > And usurping sched_clock save/restore is *really* wrong if kvmclock isn't being
> > used as sched_clock, because when TSC is reset on suspend/hiberation, not doing
> > tsc_{save,restore}_sched_clock_state() results in time going haywire.
> > 
> > Subtly, that issue goes all the way back to patch "x86/paravirt: Don't use a PV
> > sched_clock in CoCo guests with trusted TSC" because pulling the rug out from
> > under kvmclock leads to the same problem.
> > 
> > The whole PV sched_clock scheme is a disaster.
> > 
> > Hyper-V overrides the save/restore callbacks, but _also_ runs the old TSC callbacks,
> > because Hyper-V doesn't ensure that it's actually using the Hyper-V clock for
> > sched_clock.  And the code is all kinds of funky, because it tries to keep the
> > x86 code isolated from the generic HV clock code, but (a) there's already x86 PV
> > specific code in drivers/clocksource/hyperv_timer.c, and (b) splitting the code
> > means that Hyper-V overides the sched_clock save/restore hooks even when
> > PARAVIRT=n, i.e. when HV clock can't possibly be used as sched_clock.
> 
> Regarding (a), the one occurrence of x86 PV-specific code hyperv_timer.c is
> the call to paravirt_set_sched_clock(), and it's under an #ifdef sequence so that
> it's not built if targeting some other architecture. Or do you see something else
> that is x86-specific?
> 
> Regarding (b), in drivers/hv/Kconfig, CONFIG_HYPERV always selects PARAVIRT.
> So the #else clause (where PARAVIRT=n) in that #ifdef sequence could arguably
> have a BUILD_BUG() added. If I recall correctly, other Hyper-V stuff breaks if
> PARAVIRT is forced to "n". So I don't think there's a current problem with the
> sched_clock save/restore hooks. i

Oh, there are no build issues, and all of the x86 bits are nicely cordoned off.
My complaint is essentially that they're _too_ isolated; putting the sched_clock
save/restore setup in arch/x86/kernel/cpu/mshyperv.c is well-intentioned, but IMO
it does more harm than good because the split makes it difficult to connect the
dots to hv_setup_sched_clock() in drivers/clocksource/hyperv_timer.c.

> But I would be good with some restructuring so that setting the sched clock
> save/restore hooks is more closely tied to the sched clock choice,

Yeah, this is the intent of my ranting.  After the dust settles, the code can
look like this.

---
#ifdef CONFIG_GENERIC_SCHED_CLOCK
static __always_inline void hv_setup_sched_clock(void *sched_clock)
{
	/*
	 * We're on an architecture with generic sched clock (not x86/x64).
	 * The Hyper-V sched clock read function returns nanoseconds, not
	 * the normal 100ns units of the Hyper-V synthetic clock.
	 */
	sched_clock_register(sched_clock, 64, NSEC_PER_SEC);
}
#elif defined CONFIG_PARAVIRT
static u64 hv_ref_counter_at_suspend;
/*
 * Hyper-V clock counter resets during hibernation. Save and restore clock
 * offset during suspend/resume, while also considering the time passed
 * before suspend. This is to make sure that sched_clock using hv tsc page
 * based clocksource, proceeds from where it left off during suspend and
 * it shows correct time for the timestamps of kernel messages after resume.
 */
static void hv_save_sched_clock_state(void)
{
	hv_ref_counter_at_suspend = hv_read_reference_counter();
}

static void hv_restore_sched_clock_state(void)
{
	/*
	 * Adjust the offsets used by hv tsc clocksource to
	 * account for the time spent before hibernation.
	 * adjusted value = reference counter (time) at suspend
	 *                - reference counter (time) now.
	 */
	hv_sched_clock_offset -= (hv_ref_counter_at_suspend - hv_read_reference_counter());
}

static __always_inline void hv_setup_sched_clock(void *sched_clock)
{
	/* We're on x86/x64 *and* using PV ops */
	paravirt_set_sched_clock(sched_clock, hv_save_sched_clock_state,
				 hv_restore_sched_clock_state);
}
#else /* !CONFIG_GENERIC_SCHED_CLOCK && !CONFIG_PARAVIRT */
static __always_inline void hv_setup_sched_clock(void *sched_clock) {}
#endif /* CONFIG_GENERIC_SCHED_CLOCK */
---

> as long as the architecture independence of hyperv_timer.c is preserved.

LOL, ah yes, the architecture independence of MSRs and TSC :-D

Teasing aside, the code is firmly x86-only at the moment.  It's selectable only
by x86:

  config HYPERV_TIMER
	def_bool HYPERV && X86
 
and since at least commit e39acc37db34 ("clocksource: hyper-v: Provide noinstr
sched_clock()") there are references to symbols/functions that are provided only
by x86.

I assume arm64 support is a WIP, but keeping the upstream code arch independent
isn't very realistic if the code can't be at least compile-tested.  To help
drive-by contributors like myself, maybe select HYPER_TIMER on arm64 for
COMPILE_TEST=y builds?

  config HYPERV_TIMER
	def_bool HYPERV && (X86 || (COMPILE_TEST && ARM64))

I have no plans to touch code outside of CONFIG_PARAVIRT, i.e. outside of code
that is explicitly x86-only, but something along those lines would help people
like me understand the goal/intent, and in theory would also help y'all maintain
the code by detecting breakage.

  reply	other threads:[~2025-02-10 16:21 UTC|newest]

Thread overview: 44+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-02-01  2:17 [PATCH 00/16] x86/tsc: Try to wrangle PV clocks vs. TSC Sean Christopherson
2025-02-01  2:17 ` [PATCH 01/16] x86/tsc: Add a standalone helpers for getting TSC info from CPUID.0x15 Sean Christopherson
2025-02-03  5:55   ` Nikunj A Dadhania
2025-02-03 22:03     ` Sean Christopherson
2025-02-05 22:13   ` Sean Christopherson
2025-02-11 15:01   ` Borislav Petkov
2025-02-11 17:25     ` Sean Christopherson
2025-02-11 18:40       ` Borislav Petkov
2025-02-11 19:03         ` Sean Christopherson
2025-02-01  2:17 ` [PATCH 02/16] x86/tsc: Add standalone helper for getting CPU frequency from CPUID Sean Christopherson
2025-02-01  2:17 ` [PATCH 03/16] x86/tsc: Add helper to register CPU and TSC freq calibration routines Sean Christopherson
2025-02-11 17:32   ` Borislav Petkov
2025-02-11 17:43     ` Sean Christopherson
2025-02-11 20:32       ` Borislav Petkov
2025-02-12 16:49         ` Tom Lendacky
2025-02-01  2:17 ` [PATCH 04/16] x86/sev: Mark TSC as reliable when configuring Secure TSC Sean Christopherson
2025-02-04  8:02   ` Nikunj A Dadhania
2025-02-01  2:17 ` [PATCH 05/16] x86/sev: Move check for SNP Secure TSC support to tsc_early_init() Sean Christopherson
2025-02-04  8:27   ` Nikunj A Dadhania
2025-02-01  2:17 ` [PATCH 06/16] x86/tdx: Override PV calibration routines with CPUID-based calibration Sean Christopherson
2025-02-04 10:16   ` Nikunj A Dadhania
2025-02-04 19:29     ` Sean Christopherson
2025-02-05  3:56       ` Nikunj A Dadhania
2025-02-01  2:17 ` [PATCH 07/16] x86/acrn: Mark TSC frequency as known when using ACRN for calibration Sean Christopherson
2025-02-01  2:17 ` [PATCH 08/16] x86/tsc: Pass KNOWN_FREQ and RELIABLE as params to registration Sean Christopherson
2025-02-03 14:48   ` Tom Lendacky
2025-02-03 19:52     ` Sean Christopherson
2025-02-01  2:17 ` [PATCH 09/16] x86/tsc: Rejects attempts to override TSC calibration with lesser routine Sean Christopherson
2025-02-01  2:17 ` [PATCH 10/16] x86/paravirt: Move handling of unstable PV clocks into paravirt_set_sched_clock() Sean Christopherson
2025-02-01  2:17 ` [PATCH 11/16] x86/paravirt: Don't use a PV sched_clock in CoCo guests with trusted TSC Sean Christopherson
2025-02-01  2:17 ` [PATCH 12/16] x86/kvmclock: Mark TSC as reliable when it's constant and nonstop Sean Christopherson
2025-02-01  2:17 ` [PATCH 13/16] x86/kvmclock: Get CPU base frequency from CPUID when it's available Sean Christopherson
2025-02-01  2:17 ` [PATCH 14/16] x86/kvmclock: Get TSC frequency from CPUID when its available Sean Christopherson
2025-02-01  2:17 ` [PATCH 15/16] x86/kvmclock: Stuff local APIC bus period when core crystal freq comes from CPUID Sean Christopherson
2025-02-04  5:35   ` Dan Carpenter
2025-02-05 22:11     ` Sean Christopherson
2025-02-01  2:17 ` [PATCH 16/16] x86/kvmclock: Use TSC for sched_clock if it's constant and non-stop Sean Christopherson
2025-02-07 17:23   ` Sean Christopherson
2025-02-08 18:03     ` Michael Kelley
2025-02-10 16:21       ` Sean Christopherson [this message]
2025-02-12 16:44         ` Michael Kelley
2025-02-12 22:55           ` Sean Christopherson
2025-02-11 14:39 ` [PATCH 00/16] x86/tsc: Try to wrangle PV clocks vs. TSC Borislav Petkov
2025-02-11 16:28   ` Sean Christopherson

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=Z6onnUthSBUVAklf@google.com \
    --to=seanjc@google.com \
    --cc=ajay.kaher@broadcom.com \
    --cc=bp@alien8.de \
    --cc=dave.hansen@linux.intel.com \
    --cc=decui@microsoft.com \
    --cc=haiyangz@microsoft.com \
    --cc=jan.kiszka@siemens.com \
    --cc=jgross@suse.com \
    --cc=kirill.shutemov@linux.intel.com \
    --cc=kvm@vger.kernel.org \
    --cc=kys@microsoft.com \
    --cc=linux-coco@lists.linux.dev \
    --cc=linux-hyperv@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=luto@kernel.org \
    --cc=mhklinux@outlook.com \
    --cc=mingo@redhat.com \
    --cc=nikunj@amd.com \
    --cc=pbonzini@redhat.com \
    --cc=peterz@infradead.org \
    --cc=tglx@linutronix.de \
    --cc=thomas.lendacky@amd.com \
    --cc=virtualization@lists.linux.dev \
    --cc=wei.liu@kernel.org \
    --cc=x86@kernel.org \
    --cc=xen-devel@lists.xenproject.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.