[RFC PATCH v2 0/8] timekeeping: Fix draft tracking precision and add feed-forward discipline via vmclock

All of lore.kernel.org
 help / color / mirror / Atom feed

* [RFC PATCH v2 0/8] timekeeping: Fix draft tracking precision and add feed-forward discipline via vmclock
@ 2026-05-17 21:25 David Woodhouse
  2026-05-17 21:25 ` [RFC PATCH v2 1/8] timekeeping: Remove xtime_remainder from ntp_error accumulation David Woodhouse
                   ` (8 more replies)
  0 siblings, 9 replies; 50+ messages in thread
From: David Woodhouse @ 2026-05-17 21:25 UTC (permalink / raw)
  To: Richard Cochran, Wen Gu, David Woodhouse, Andrew Lunn,
	David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	John Stultz, Thomas Gleixner, Stephen Boyd, Anna-Maria Behnsen,
	Frederic Weisbecker, Shuah Khan, Peter Zijlstra,
	Thomas Weißschuh, Arnd Bergmann, Miroslav Lichvar,
	Julien Ridoux, Ryan Luu, linux-kernel

This is v2 of the series to add feed-forward clock discipline, allowing
a guest kernel to lock its system clock directly to a hypervisor-provided
vmclock reference with sub-10ns precision and no drift.

The vmclock device (https://uapi-group.org/specifications/specs/vmclock/)
provides a shared memory page containing a linear time function:
time = base + (counter - counter_value) × period. The guest can read
this at any time to determine the hypervisor's view of the current time,
without a VM exit. Unlike guest-driven NTP, it allows for accurate time
to be preserved across live migration.

The existing ptp_vmclock driver already exposes this as a PTP clock for
userspace consumers (phc2sys, chrony). This series adds kernel-internal
consumption: the tick mechanism can clamp directly to the vmclock
reference, eliminating the need for NTP to discipline the guest clock.

The previous series introduced an external oracle to drive the per-tick 
dithering mechanism towards the reference clock. By fixing all the 
inaccuracies and systematic drift in the kernel's own tracking, we can 
dispense with the external oracle and just configure the timekeeping 
using the existing frequency/tick_length and time_offset/ntp_error 
mechanisms.

Changes since v1 (RFC):
 • Fixed three additional issues in the timekeeping code that were
   discovered during nanosecond-precision testing with the vmclock
   reference:
   - The clawback adjustment in timekeeping_apply_adjustment() moved
     xtime without updating ntp_error (patch 2).
   - The exponential tail of ntp_offset_chunk() asymptotically approached
     zero, preventing convergence to the final nanosecond (patch 3).
   - A divide-by-zero in timekeeping_adjust() when cycle_interval is
     momentarily zero during TSC recalibration on KVM guests (patch 4).
 • Replaced the per-tick absolute reference clamping with a cleaner
   mechanism: the skew from time_offset is now driven by per-tick
   transfer into ntp_error with a matching mult adjustment, rather than
   by inflating tick_length (patch 7). This gives exact per-tick
   accounting of the time_offset drain with no rounding loss.
 • The timekeeping_set_reference() API (patch 5) sets time_offset and
   the frequency, letting the standard skew mechanism handle convergence.

The series:

Patches 1-4: Timekeeping bugfixes (suitable for stable/independent review)
  1. Remove stale xtime_remainder from ntp_error accumulation.
  2. Account for clawback adjustment in ntp_error.
  3. Clamp time_offset delta to prevent infinite exponential tail.
  4. Guard against divide-by-zero during clocksource recalibration.

Patches 5-6: Feed-forward reference clock infrastructure
  5. Add timekeeping_set_reference() API for external clock references.
  6. Wire ptp_vmclock to call timekeeping_set_reference() on probe.

Patch 7: Improved time_offset skew mechanism
  7. Drive time_offset skew via per-tick ntp_error transfer instead of
     tick_length inflation, with mult adjustment for dithering bandwidth.
     (we can't *yet* kill tick_length_base; I have to frown at adjtime()
     some more first).

Patch 8: Host-side vmclock page export (WIP)
  8. Add /dev/vmclock_host miscdev for VMM consumption.

Tested with QEMU passing through a vmclock device to a guest¹. The guest 
clock converges to the reference within seconds and remains within 
single digit nanoseconds indefinitely, with no further external 
correction. Injecting a ±10µs offset via ntp_set_time_offset() converges 
to the target via the same exponential decay as before over about 70 
seconds, and retains the same single-digit nanosecond jitter around 
precisely ±10000ns once converged. Obviously in real usage, the 
reference will be periodically changing too, but the feed-forward setup 
does rely on the kernel being able to converge to, and remain on, the 
precise line it's given.

¹ https://git.infradead.org/?p=users/dwmw2/qemu.git;a=shortlog;h=refs/heads/vmclock-passthrough

David Woodhouse (8):
      timekeeping: Remove xtime_remainder from ntp_error accumulation
      timekeeping: Account for clawback adjustment in ntp_error
      timekeeping: Clamp time_offset delta to prevent infinite tail
      timekeeping: Guard against divide-by-zero in timekeeping_adjust
      timekeeping: Add absolute reference for feed-forward clock discipline
      ptp_vmclock: Feed reference to timekeeping for feed-forward discipline
      timekeeping: Drive time_offset skew via per-tick ntp_error transfer
      WIP: kernel/time: Add /dev/vmclock_host miscdev

 drivers/ptp/ptp_vmclock.c                          |  79 +++++
 include/linux/timekeeper_internal.h                |   3 +-
 include/linux/timekeeping_reference.h              |  19 ++
 include/linux/vmclock_host.h                       |  17 ++
 kernel/time/Kconfig                                |   8 +
 kernel/time/Makefile                               |   1 +
 kernel/time/ntp.c                                  |  72 ++++-
 kernel/time/ntp_internal.h                         |   6 +
 kernel/time/timekeeping.c                          |  83 +++++-
 kernel/time/vmclock_host.c                         | 319 +++++++++++++++++++++
 tools/testing/selftests/timers/vmclock_host_test.c | 171 +++++++++++
 11 files changed, 766 insertions(+), 12 deletions(-)

^ permalink raw reply	[flat|nested] 50+ messages in thread

* [RFC PATCH v2 1/8] timekeeping: Remove xtime_remainder from ntp_error accumulation
  2026-05-17 21:25 [RFC PATCH v2 0/8] timekeeping: Fix draft tracking precision and add feed-forward discipline via vmclock David Woodhouse
@ 2026-05-17 21:25 ` David Woodhouse
  2026-05-17 21:25 ` [RFC PATCH v2 2/8] timekeeping: Account for clawback adjustment in ntp_error David Woodhouse
                   ` (7 subsequent siblings)
  8 siblings, 0 replies; 50+ messages in thread
From: David Woodhouse @ 2026-05-17 21:25 UTC (permalink / raw)
  To: Richard Cochran, Wen Gu, David Woodhouse, Andrew Lunn,
	David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	John Stultz, Thomas Gleixner, Stephen Boyd, Anna-Maria Behnsen,
	Frederic Weisbecker, Shuah Khan, Peter Zijlstra,
	Thomas Weißschuh, Arnd Bergmann, Miroslav Lichvar,
	Julien Ridoux, Ryan Luu, linux-kernel
  Cc: David Woodhouse

From: David Woodhouse <dwmw@amazon.co.uk>

The ntp_error accumulator tracks the difference between intended and
actual clock advance. Each tick it adds ntp_tick (the intended advance)
and subtracts what the clock actually advanced.

The subtraction was (xtime_interval + xtime_remainder), but only
xtime_interval is actually added to xtime_nsec each tick.
xtime_remainder was a boot-time constant representing the rounding error
from converting the tick period to an integer number of counter cycles.
It was never added to xtime_nsec, so subtracting it from ntp_error
created a phantom credit that biased the dithering ratio.

The effect is a systematic drift whose magnitude depends on the value of
xtime_remainder and the NTP frequency correction. NTP masks this by
continuously adjusting the frequency to compensate, but with a fixed
frequency (or an external reference clock like vmclock), the drift is
exposed.

Also remove xtime_remainder from the mult computation in
timekeeping_adjust(), which used it to offset the division for the same
(incorrect) reason.

Fixes: a386b5af8edd ("time: Compensate for rounding on odd-frequency clocksources")
Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
---
 include/linux/timekeeper_internal.h | 2 --
 kernel/time/timekeeping.c           | 7 +++----
 2 files changed, 3 insertions(+), 6 deletions(-)

diff --git a/include/linux/timekeeper_internal.h b/include/linux/timekeeper_internal.h
index e36d11e33e0c..2f4cfcfcaac0 100644
--- a/include/linux/timekeeper_internal.h
+++ b/include/linux/timekeeper_internal.h
@@ -84,7 +84,6 @@ struct tk_read_base {
  * @cycle_interval:		Number of clock cycles in one NTP interval
  * @xtime_interval:		Number of clock shifted nano seconds in one NTP
  *				interval.
- * @xtime_remainder:		Shifted nano seconds left over when rounding
  *				@cycle_interval
  * @raw_interval:		Shifted raw nano seconds accumulated per NTP interval.
  * @next_leap_ktime:		CLOCK_MONOTONIC time value of a pending leap-second
@@ -178,7 +177,6 @@ struct timekeeper {
 
 	u64			cycle_interval;
 	u64			xtime_interval;
-	s64			xtime_remainder;
 	u64			raw_interval;
 
 	ktime_t			next_leap_ktime;
diff --git a/kernel/time/timekeeping.c b/kernel/time/timekeeping.c
index c493a4010305..3da7167ceb0d 100644
--- a/kernel/time/timekeeping.c
+++ b/kernel/time/timekeeping.c
@@ -360,7 +360,6 @@ static void tk_setup_internals(struct timekeeper *tk, struct clocksource *clock)
 
 	/* Go back from cycles -> shifted ns */
 	tk->xtime_interval = interval * clock->mult;
-	tk->xtime_remainder = ntpinterval - tk->xtime_interval;
 	tk->raw_interval = interval * clock->mult;
 
 	 /* if changing clocks, convert xtime_nsec shift units */
@@ -2337,8 +2336,8 @@ static void timekeeping_adjust(struct timekeeper *tk, s64 offset)
 		mult = tk->tkr_mono.mult - tk->ntp_err_mult;
 	} else {
 		tk->ntp_tick = ntp_tl;
-		mult = div64_u64((tk->ntp_tick >> tk->ntp_error_shift) -
-				 tk->xtime_remainder, tk->cycle_interval);
+		mult = div64_u64(tk->ntp_tick >> tk->ntp_error_shift,
+				 tk->cycle_interval);
 	}
 
 	/*
@@ -2463,7 +2462,7 @@ static u64 logarithmic_accumulation(struct timekeeper *tk, u64 offset,
 
 	/* Accumulate error between NTP and clock interval */
 	tk->ntp_error += tk->ntp_tick << shift;
-	tk->ntp_error -= (tk->xtime_interval + tk->xtime_remainder) <<
+	tk->ntp_error -= tk->xtime_interval <<
 						(tk->ntp_error_shift + shift);
 
 	return offset;
-- 
2.51.0


^ permalink raw reply related	[flat|nested] 50+ messages in thread

* [RFC PATCH v2 2/8] timekeeping: Account for clawback adjustment in ntp_error
  2026-05-17 21:25 [RFC PATCH v2 0/8] timekeeping: Fix draft tracking precision and add feed-forward discipline via vmclock David Woodhouse
  2026-05-17 21:25 ` [RFC PATCH v2 1/8] timekeeping: Remove xtime_remainder from ntp_error accumulation David Woodhouse
@ 2026-05-17 21:25 ` David Woodhouse
  2026-05-19  1:59   ` John Stultz
  2026-05-17 21:25 ` [RFC PATCH v2 3/8] timekeeping: Clamp time_offset delta to prevent infinite tail David Woodhouse
                   ` (6 subsequent siblings)
  8 siblings, 1 reply; 50+ messages in thread
From: David Woodhouse @ 2026-05-17 21:25 UTC (permalink / raw)
  To: Richard Cochran, Wen Gu, David Woodhouse, Andrew Lunn,
	David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	John Stultz, Thomas Gleixner, Stephen Boyd, Anna-Maria Behnsen,
	Frederic Weisbecker, Shuah Khan, Peter Zijlstra,
	Thomas Weißschuh, Arnd Bergmann, Miroslav Lichvar,
	Julien Ridoux, Ryan Luu, linux-kernel
  Cc: David Woodhouse

From: David Woodhouse <dwmw@amazon.co.uk>

timekeeping_apply_adjustment() modifies xtime_nsec to maintain vDSO
monotonicity when mult changes:

    xtime_nsec -= offset

This ensures that the time reported to userspace doesn't jump when the
multiplier is adjusted. However, ntp_error — which tracks the difference
between intended and actual clock position — was not updated to reflect
this change.

After a mult change, xtime_nsec has moved but ntp_error still reflects
the old position. For the normal ±1 dithering this is negligible (the
adjustments cancel over time), but for larger mult changes — such as
when an external reference clock sets a new frequency — the one-time
uncompensated offset is significant (~38ns for a 700-count mult change).

Fix by adjusting ntp_error by the same amount:

    ntp_error += offset << ntp_error_shift

This keeps ntp_error consistent with the actual xtime_nsec position
after the clawback.

Fixes: 1b1b3e2a3671 ("timekeeping: Rework frequency adjustments to work better w/ nohz")
Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
---
 kernel/time/timekeeping.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/kernel/time/timekeeping.c b/kernel/time/timekeeping.c
index 3da7167ceb0d..050123fc179b 100644
--- a/kernel/time/timekeeping.c
+++ b/kernel/time/timekeeping.c
@@ -2317,6 +2317,7 @@ static __always_inline void timekeeping_apply_adjustment(struct timekeeper *tk,
 	tk->tkr_mono.mult += mult_adj;
 	tk->xtime_interval += interval;
 	tk->tkr_mono.xtime_nsec -= offset;
+	tk->ntp_error += offset << tk->ntp_error_shift;
 }
 
 /*
-- 
2.51.0


^ permalink raw reply related	[flat|nested] 50+ messages in thread

* [RFC PATCH v2 3/8] timekeeping: Clamp time_offset delta to prevent infinite tail
  2026-05-17 21:25 [RFC PATCH v2 0/8] timekeeping: Fix draft tracking precision and add feed-forward discipline via vmclock David Woodhouse
  2026-05-17 21:25 ` [RFC PATCH v2 1/8] timekeeping: Remove xtime_remainder from ntp_error accumulation David Woodhouse
  2026-05-17 21:25 ` [RFC PATCH v2 2/8] timekeeping: Account for clawback adjustment in ntp_error David Woodhouse
@ 2026-05-17 21:25 ` David Woodhouse
  2026-05-19 13:25   ` Miroslav Lichvar
  2026-05-17 21:25 ` [RFC PATCH v2 4/8] timekeeping: Add absolute reference for feed-forward clock discipline David Woodhouse
                   ` (5 subsequent siblings)
  8 siblings, 1 reply; 50+ messages in thread
From: David Woodhouse @ 2026-05-17 21:25 UTC (permalink / raw)
  To: Richard Cochran, Wen Gu, David Woodhouse, Andrew Lunn,
	David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	John Stultz, Thomas Gleixner, Stephen Boyd, Anna-Maria Behnsen,
	Frederic Weisbecker, Shuah Khan, Peter Zijlstra,
	Thomas Weißschuh, Arnd Bergmann, Miroslav Lichvar,
	Julien Ridoux, Ryan Luu, linux-kernel
  Cc: David Woodhouse

From: David Woodhouse <dwmw@amazon.co.uk>

ntp_offset_chunk() computes delta as time_offset >> (SHIFT_PLL +
time_constant), which exponentially decays toward zero but never
reaches it. This means time_offset asymptotically approaches zero
without ever completing — the clock never fully converges.

Fix by clamping delta:
 - Minimum: 20ns/sec (NTP_OFFSET_DELTA_MIN), ensuring the tail
   converges in finite time
 - Maximum: time_offset itself, preventing overshoot on the final
   second

This preserves the exponential decay behavior for large offsets
while ensuring precise convergence for the tail.

Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
---
 kernel/time/ntp.c | 13 +++++++++++++
 1 file changed, 13 insertions(+)

diff --git a/kernel/time/ntp.c b/kernel/time/ntp.c
index 97fa99b96dd0..2d6d00ae5bf7 100644
--- a/kernel/time/ntp.c
+++ b/kernel/time/ntp.c
@@ -104,6 +104,8 @@ static struct ntp_data tk_ntp_data[TIMEKEEPERS_MAX] = {
 #define MAX_TICKADJ_SCALED \
 	(((MAX_TICKADJ * NSEC_PER_USEC) << NTP_SCALE_SHIFT) / NTP_INTERVAL_FREQ)
 #define MAX_TAI_OFFSET		100000
+/* Minimum skew rate for exponential tail: 20ns/s in tick_length units */
+#define NTP_OFFSET_DELTA_MIN	(((s64)20 << NTP_SCALE_SHIFT) / NTP_INTERVAL_FREQ)
 
 #ifdef CONFIG_NTP_PPS
 
@@ -461,6 +463,17 @@ int second_overflow(unsigned int tkid, time64_t secs)
 	ntpdata->tick_length	 = ntpdata->tick_length_base;
 
 	delta			 = ntp_offset_chunk(ntpdata, ntpdata->time_offset);
+	if (ntpdata->time_offset > 0) {
+		if (delta < NTP_OFFSET_DELTA_MIN)
+			delta = NTP_OFFSET_DELTA_MIN;
+		if (delta > ntpdata->time_offset)
+			delta = ntpdata->time_offset;
+	} else if (ntpdata->time_offset < 0) {
+		if (delta > -NTP_OFFSET_DELTA_MIN)
+			delta = -NTP_OFFSET_DELTA_MIN;
+		if (delta < ntpdata->time_offset)
+			delta = ntpdata->time_offset;
+	}
 	ntpdata->time_offset	-= delta;
 	ntpdata->tick_length	+= delta;
 
-- 
2.51.0


^ permalink raw reply related	[flat|nested] 50+ messages in thread

* [RFC PATCH v2 4/8] timekeeping: Add absolute reference for feed-forward clock discipline
  2026-05-17 21:25 [RFC PATCH v2 0/8] timekeeping: Fix draft tracking precision and add feed-forward discipline via vmclock David Woodhouse
                   ` (2 preceding siblings ...)
  2026-05-17 21:25 ` [RFC PATCH v2 3/8] timekeeping: Clamp time_offset delta to prevent infinite tail David Woodhouse
@ 2026-05-17 21:25 ` David Woodhouse
  2026-05-19  2:09   ` John Stultz
  2026-05-17 21:25 ` [RFC PATCH v2 5/8] ptp_vmclock: Feed reference to timekeeping for feed-forward discipline David Woodhouse
                   ` (4 subsequent siblings)
  8 siblings, 1 reply; 50+ messages in thread
From: David Woodhouse @ 2026-05-17 21:25 UTC (permalink / raw)
  To: Richard Cochran, Wen Gu, David Woodhouse, Andrew Lunn,
	David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	John Stultz, Thomas Gleixner, Stephen Boyd, Anna-Maria Behnsen,
	Frederic Weisbecker, Shuah Khan, Peter Zijlstra,
	Thomas Weißschuh, Arnd Bergmann, Miroslav Lichvar,
	Julien Ridoux, Ryan Luu, linux-kernel
  Cc: David Woodhouse

From: David Woodhouse <dwmw@amazon.co.uk>

Add timekeeping_set_reference() which allows an external clock source
(such as a hypervisor vmclock) to provide an absolute time reference.
The reference defines a linear counter-to-time mapping that the kernel
uses to set both the frequency and phase of the system clock.

When timekeeping_set_reference() is called:
 - tick_length is computed from the reference period and set via
   ntp_set_tick_length(), keeping all NTP state consistent
 - A pending flag is set so that on the next tick (under the
   timekeeping lock), the phase error is set via ntp_set_time_offset()

The existing time_offset slew mechanism then converges the clock to
the reference, with the clawback fix ensuring ntp_error stays accurate
across mult changes.

Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
---
 include/linux/timekeeping_reference.h | 19 ++++++++++++
 kernel/time/ntp.c                     | 27 ++++++++++++++++
 kernel/time/ntp_internal.h            |  3 ++
 kernel/time/timekeeping.c             | 44 +++++++++++++++++++++++++++
 4 files changed, 93 insertions(+)
 create mode 100644 include/linux/timekeeping_reference.h

diff --git a/include/linux/timekeeping_reference.h b/include/linux/timekeeping_reference.h
new file mode 100644
index 000000000000..4c1d8a6c02f1
--- /dev/null
+++ b/include/linux/timekeeping_reference.h
@@ -0,0 +1,19 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef _LINUX_TIMEKEEPING_REFERENCE_H
+#define _LINUX_TIMEKEEPING_REFERENCE_H
+
+#include <linux/clocksource_ids.h>
+#include <linux/types.h>
+
+struct tk_reference {
+	enum clocksource_ids	cs_id;
+	u64			counter_value;
+	u64			time_sec;
+	u64			time_frac_sec;
+	u64			period_frac_sec;
+	u8			period_shift;
+};
+
+int timekeeping_set_reference(const struct tk_reference *ref);
+
+#endif
diff --git a/kernel/time/ntp.c b/kernel/time/ntp.c
index 2d6d00ae5bf7..79e76bb6942b 100644
--- a/kernel/time/ntp.c
+++ b/kernel/time/ntp.c
@@ -366,6 +366,33 @@ u64 ntp_tick_length(unsigned int tkid)
 	return tk_ntp_data[tkid].tick_length;
 }
 
+u64 ntp_tick_length_base(unsigned int tkid)
+{
+	return tk_ntp_data[tkid].tick_length_base;
+}
+
+void ntp_set_time_offset(unsigned int tkid, s64 offset_ns)
+{
+	struct ntp_data *ntpdata = &tk_ntp_data[tkid];
+
+	ntpdata->time_offset = div_s64((s64)offset_ns << NTP_SCALE_SHIFT,
+				       NTP_INTERVAL_FREQ);
+	ntpdata->time_adjust = 0;
+}
+
+void ntp_set_tick_length(unsigned int tkid, u64 tick_length)
+{
+	struct ntp_data *ntpdata = &tk_ntp_data[tkid];
+	u64 base;
+
+	base = (u64)(ntpdata->tick_usec * NSEC_PER_USEC * USER_HZ)
+		<< NTP_SCALE_SHIFT;
+	base += ntpdata->ntp_tick_adj;
+
+	ntpdata->time_freq = (s64)(tick_length * NTP_INTERVAL_FREQ - base);
+	ntp_update_frequency(ntpdata);
+}
+
 /**
  * ntp_get_next_leap - Returns the next leapsecond in CLOCK_REALTIME ktime_t
  * @tkid:	Timekeeper ID
diff --git a/kernel/time/ntp_internal.h b/kernel/time/ntp_internal.h
index 7084d839c207..44306ffe25ff 100644
--- a/kernel/time/ntp_internal.h
+++ b/kernel/time/ntp_internal.h
@@ -6,6 +6,9 @@ extern void ntp_init(void);
 extern void ntp_clear(unsigned int tkid);
 /* Returns how long ticks are at present, in ns / 2^NTP_SCALE_SHIFT. */
 extern u64 ntp_tick_length(unsigned int tkid);
+extern u64 ntp_tick_length_base(unsigned int tkid);
+extern void ntp_set_time_offset(unsigned int tkid, s64 offset_ns);
+extern void ntp_set_tick_length(unsigned int tkid, u64 tick_length);
 extern ktime_t ntp_get_next_leap(unsigned int tkid);
 extern int second_overflow(unsigned int tkid, time64_t secs);
 extern int ntp_adjtimex(unsigned int tkid, struct __kernel_timex *txc, const struct timespec64 *ts,
diff --git a/kernel/time/timekeeping.c b/kernel/time/timekeeping.c
index 050123fc179b..89fed9473c38 100644
--- a/kernel/time/timekeeping.c
+++ b/kernel/time/timekeeping.c
@@ -2324,6 +2324,33 @@ static __always_inline void timekeeping_apply_adjustment(struct timekeeper *tk,
  * Adjust the timekeeper's multiplier to the correct frequency
  * and also to reduce the accumulated error value.
  */
+
+#include <linux/timekeeping_reference.h>
+
+static struct tk_reference tk_ref;
+static bool tk_ref_valid;
+static bool tk_ref_pending;
+
+int timekeeping_set_reference(const struct tk_reference *ref)
+{
+	u64 ci = tk_core.timekeeper.cycle_interval;
+	u64 new_tl;
+
+	tk_ref = *ref;
+
+	new_tl = mul_u64_u64_shr(ref->period_frac_sec,
+			(u64)ci * NSEC_PER_SEC,
+			32 + ref->period_shift);
+	ntp_set_tick_length(TIMEKEEPER_CORE, new_tl);
+
+	/* Ensure tk_ref fields are visible before tk_ref_valid/pending */
+	smp_wmb();
+	tk_ref_valid = true;
+	tk_ref_pending = true;
+	return 0;
+}
+EXPORT_SYMBOL_GPL(timekeeping_set_reference);
+
 static void timekeeping_adjust(struct timekeeper *tk, s64 offset)
 {
 	u64 ntp_tl = ntp_tick_length(tk->id);
@@ -2339,6 +2366,23 @@ static void timekeeping_adjust(struct timekeeper *tk, s64 offset)
 		tk->ntp_tick = ntp_tl;
 		mult = div64_u64(tk->ntp_tick >> tk->ntp_error_shift,
 				 tk->cycle_interval);
+		if (tk_ref_pending && tk->cs_id == tk_ref.cs_id) {
+			u64 d = tk->tkr_mono.cycle_last - tk_ref.counter_value;
+			__uint128_t p = (__uint128_t)d * tk_ref.period_frac_sec;
+			u64 rf;
+			s64 ref_err;
+
+			p >>= tk_ref.period_shift;
+			p += tk_ref.time_frac_sec;
+			rf = (u64)p;
+			ref_err = (s64)mul_u64_u64_shr(rf,
+				(u64)NSEC_PER_SEC << tk->tkr_mono.shift, 64) -
+				(s64)tk->tkr_mono.xtime_nsec;
+			ntp_set_time_offset(tk->id,
+				ref_err >> tk->tkr_mono.shift);
+			tk->ntp_error = 0;
+			tk_ref_pending = false;
+		}
 	}
 
 	/*
-- 
2.51.0


^ permalink raw reply related	[flat|nested] 50+ messages in thread

* [RFC PATCH v2 5/8] ptp_vmclock: Feed reference to timekeeping for feed-forward discipline
  2026-05-17 21:25 [RFC PATCH v2 0/8] timekeeping: Fix draft tracking precision and add feed-forward discipline via vmclock David Woodhouse
                   ` (3 preceding siblings ...)
  2026-05-17 21:25 ` [RFC PATCH v2 4/8] timekeeping: Add absolute reference for feed-forward clock discipline David Woodhouse
@ 2026-05-17 21:25 ` David Woodhouse
  2026-05-17 21:25 ` [RFC PATCH v2 6/8] timekeeping: Guard against divide-by-zero in timekeeping_adjust David Woodhouse
                   ` (3 subsequent siblings)
  8 siblings, 0 replies; 50+ messages in thread
From: David Woodhouse @ 2026-05-17 21:25 UTC (permalink / raw)
  To: Richard Cochran, Wen Gu, David Woodhouse, Andrew Lunn,
	David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	John Stultz, Thomas Gleixner, Stephen Boyd, Anna-Maria Behnsen,
	Frederic Weisbecker, Shuah Khan, Peter Zijlstra,
	Thomas Weißschuh, Arnd Bergmann, Miroslav Lichvar,
	Julien Ridoux, Ryan Luu, linux-kernel
  Cc: David Woodhouse

From: David Woodhouse <dwmw@amazon.co.uk>

When a vmclock device provides valid time, call timekeeping_set_reference()
to enable feed-forward clock discipline. This eliminates drift between the
system clock and the vmclock reference.

The reference is set at probe time (after PTP registration) and updated
on each notification from the hypervisor (ACPI or DT interrupt).

If cycle_interval is not provided (set to 0), timekeeping_set_reference()
fills it from the current timekeeper.

Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
---
 drivers/ptp/ptp_vmclock.c | 79 +++++++++++++++++++++++++++++++++++++++
 1 file changed, 79 insertions(+)

diff --git a/drivers/ptp/ptp_vmclock.c b/drivers/ptp/ptp_vmclock.c
index 8b630eb916b5..3699ee4465ac 100644
--- a/drivers/ptp/ptp_vmclock.c
+++ b/drivers/ptp/ptp_vmclock.c
@@ -27,6 +27,7 @@
 #include <uapi/linux/vmclock-abi.h>
 
 #include <linux/ptp_clock_kernel.h>
+#include <linux/timekeeping_reference.h>
 
 #ifdef CONFIG_X86
 #include <asm/pvclock.h>
@@ -48,6 +49,7 @@ struct vmclock_state {
 	wait_queue_head_t disrupt_wait;
 	struct ptp_clock_info ptp_clock_info;
 	struct ptp_clock *ptp_clock;
+	struct timer_list cmp_timer;
 	enum clocksource_ids cs_id, sys_cs_id;
 	int index;
 	char *name;
@@ -334,6 +336,76 @@ static const struct ptp_clock_info ptp_vmclock_info = {
 	.getcrosststamp = ptp_vmclock_getcrosststamp,
 };
 
+static void vmclock_cmp_timer_fn(struct timer_list *t)
+{
+	struct vmclock_state *st = container_of(t, struct vmclock_state, cmp_timer);
+	volatile struct vmclock_abi *clk = st->clk;
+	struct system_time_snapshot snap;
+	unsigned __int128 product;
+	u64 delta, ref_frac, ref_ns, sys_ns;
+	s64 diff;
+
+	ktime_get_snapshot(&snap);
+	if (snap.cs_id != st->cs_id)
+		goto rearm;
+
+	delta = snap.cycles - le64_to_cpu(clk->counter_value);
+	product = (unsigned __int128)delta * le64_to_cpu(clk->counter_period_frac_sec);
+	product >>= clk->counter_period_shift;
+	product += le64_to_cpu(clk->time_frac_sec);
+	ref_frac = (u64)product;
+	ref_ns = mul_u64_u64_shr(ref_frac, NSEC_PER_SEC, 64);
+	ref_ns += (le64_to_cpu(clk->time_sec) + (u64)(product >> 64)) * NSEC_PER_SEC;
+
+	sys_ns = ktime_to_ns(snap.real) - (s64)(int16_t)le16_to_cpu(clk->tai_offset_sec) * NSEC_PER_SEC;
+	diff = (s64)(ref_ns - sys_ns);
+	pr_info("vmclock_cmp: diff=%lldns tsc=%llx\n", diff, snap.cycles);
+
+rearm:
+	mod_timer(&st->cmp_timer, jiffies + msecs_to_jiffies(500));
+}
+
+static void vmclock_set_tk_reference(struct vmclock_state *st)
+{
+	struct vmclock_abi *clk = st->clk;
+	struct tk_reference ref = {
+		.cs_id = st->cs_id,
+		.counter_value = le64_to_cpu(clk->counter_value),
+		.time_sec = le64_to_cpu(clk->time_sec),
+		.time_frac_sec = le64_to_cpu(clk->time_frac_sec),
+		.period_frac_sec = le64_to_cpu(clk->counter_period_frac_sec),
+		.period_shift = clk->counter_period_shift,
+	};
+
+	/* Convert TAI to UTC for comparison with xtime_sec */
+	if (clk->time_type == VMCLOCK_TIME_TAI &&
+	    (le64_to_cpu(clk->flags) & VMCLOCK_FLAG_TAI_OFFSET_VALID))
+		ref.time_sec += (int16_t)le16_to_cpu(clk->tai_offset_sec);
+
+	if (clk->clock_status != VMCLOCK_STATUS_UNRELIABLE) {
+		/* Step clock if far from reference */
+		struct timespec64 now, vmtime;
+		unsigned __int128 product;
+		u64 cycles = get_cycles();
+		u64 delta_cycles = cycles - ref.counter_value;
+		s64 delta_ns;
+
+		product = (unsigned __int128)delta_cycles * ref.period_frac_sec;
+		product >>= ref.period_shift;
+		product += ref.time_frac_sec;
+		vmtime.tv_sec = ref.time_sec + (u64)(product >> 64);
+		vmtime.tv_nsec = mul_u64_u64_shr((u64)product,
+						  NSEC_PER_SEC, 64);
+
+		ktime_get_real_ts64(&now);
+		delta_ns = timespec64_to_ns(&vmtime) - timespec64_to_ns(&now);
+		if (delta_ns > 100000000 || delta_ns < -100000000)
+			do_settimeofday64(&vmtime);
+
+		timekeeping_set_reference(&ref);
+	}
+}
+
 static struct ptp_clock *vmclock_ptp_register(struct device *dev,
 					      struct vmclock_state *st)
 {
@@ -525,6 +597,7 @@ vmclock_acpi_notification_handler(acpi_handle __always_unused handle,
 	struct device *device = dev;
 	struct vmclock_state *st = device->driver_data;
 
+	vmclock_set_tk_reference(st);
 	wake_up_interruptible(&st->disrupt_wait);
 }
 
@@ -580,6 +653,7 @@ static irqreturn_t vmclock_of_irq_handler(int __always_unused irq, void *_st)
 {
 	struct vmclock_state *st = _st;
 
+	vmclock_set_tk_reference(st);
 	wake_up_interruptible(&st->disrupt_wait);
 	return IRQ_HANDLED;
 }
@@ -751,8 +825,13 @@ static int vmclock_probe(struct platform_device *pdev)
 			st->ptp_clock = NULL;
 			return ret;
 		}
+		if (st->ptp_clock)
+			vmclock_set_tk_reference(st);
 	}
 
+	timer_setup(&st->cmp_timer, vmclock_cmp_timer_fn, 0);
+	mod_timer(&st->cmp_timer, jiffies + msecs_to_jiffies(500));
+
 	if (!st->miscdev.minor && !st->ptp_clock) {
 		/* Neither miscdev nor PTP registered */
 		dev_info(dev, "vmclock: Neither miscdev nor PTP available; not registering\n");
-- 
2.51.0


^ permalink raw reply related	[flat|nested] 50+ messages in thread

* [RFC PATCH v2 6/8] timekeeping: Guard against divide-by-zero in timekeeping_adjust
  2026-05-17 21:25 [RFC PATCH v2 0/8] timekeeping: Fix draft tracking precision and add feed-forward discipline via vmclock David Woodhouse
                   ` (4 preceding siblings ...)
  2026-05-17 21:25 ` [RFC PATCH v2 5/8] ptp_vmclock: Feed reference to timekeeping for feed-forward discipline David Woodhouse
@ 2026-05-17 21:25 ` David Woodhouse
  2026-05-17 21:25 ` [RFC PATCH v2 7/8] timekeeping: Drive time_offset skew via per-tick ntp_error transfer David Woodhouse
                   ` (2 subsequent siblings)
  8 siblings, 0 replies; 50+ messages in thread
From: David Woodhouse @ 2026-05-17 21:25 UTC (permalink / raw)
  To: Richard Cochran, Wen Gu, David Woodhouse, Andrew Lunn,
	David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	John Stultz, Thomas Gleixner, Stephen Boyd, Anna-Maria Behnsen,
	Frederic Weisbecker, Shuah Khan, Peter Zijlstra,
	Thomas Weißschuh, Arnd Bergmann, Miroslav Lichvar,
	Julien Ridoux, Ryan Luu, linux-kernel
  Cc: David Woodhouse

From: David Woodhouse <dwmw@amazon.co.uk>

When the TSC clocksource is recalibrated (e.g. on KVM guests with
clocksource=tsc), cycle_interval can momentarily be zero during the
transition. Guard the div64_u64 in timekeeping_adjust() to prevent a
divide-by-zero oops.

This can be triggered on KVM guests that force clocksource=tsc when
the host TSC frequency doesn't match what KVM initially reports,
causing a recalibration during boot.

Signed-off-by: David Woodhouse (Kiro) <dwmw@amazon.co.uk>
---
 kernel/time/timekeeping.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/kernel/time/timekeeping.c b/kernel/time/timekeeping.c
index 89fed9473c38..1cc98fdda4f8 100644
--- a/kernel/time/timekeeping.c
+++ b/kernel/time/timekeeping.c
@@ -2363,6 +2363,8 @@ static void timekeeping_adjust(struct timekeeper *tk, s64 offset)
 	if (likely(tk->ntp_tick == ntp_tl)) {
 		mult = tk->tkr_mono.mult - tk->ntp_err_mult;
 	} else {
+		if (unlikely(!tk->cycle_interval))
+			return;
 		tk->ntp_tick = ntp_tl;
 		mult = div64_u64(tk->ntp_tick >> tk->ntp_error_shift,
 				 tk->cycle_interval);
-- 
2.51.0


^ permalink raw reply related	[flat|nested] 50+ messages in thread

* [RFC PATCH v2 7/8] timekeeping: Drive time_offset skew via per-tick ntp_error transfer
  2026-05-17 21:25 [RFC PATCH v2 0/8] timekeeping: Fix draft tracking precision and add feed-forward discipline via vmclock David Woodhouse
                   ` (5 preceding siblings ...)
  2026-05-17 21:25 ` [RFC PATCH v2 6/8] timekeeping: Guard against divide-by-zero in timekeeping_adjust David Woodhouse
@ 2026-05-17 21:25 ` David Woodhouse
  2026-05-17 21:25 ` [RFC PATCH v2 8/8] WIP: kernel/time: Add /dev/vmclock_host miscdev David Woodhouse
  2026-05-19 13:16 ` [RFC PATCH v2 0/8] timekeeping: Fix draft tracking precision and add feed-forward discipline via vmclock Miroslav Lichvar
  8 siblings, 0 replies; 50+ messages in thread
From: David Woodhouse @ 2026-05-17 21:25 UTC (permalink / raw)
  To: Richard Cochran, Wen Gu, David Woodhouse, Andrew Lunn,
	David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	John Stultz, Thomas Gleixner, Stephen Boyd, Anna-Maria Behnsen,
	Frederic Weisbecker, Shuah Khan, Peter Zijlstra,
	Thomas Weißschuh, Arnd Bergmann, Miroslav Lichvar,
	Julien Ridoux, Ryan Luu, linux-kernel
  Cc: David Woodhouse

From: David Woodhouse <dwmw@amazon.co.uk>

Instead of inflating tick_length to effect the time_offset slew,
transfer the skew to ntp_error per-tick and drain time_offset at the
equivalent per-tick rate:

 - ntp_error += skew_delta << shift (biases dithering to deliver skew)
 - time_offset -= skew_delta / NTP_INTERVAL_FREQ (per-tick drain)

This simplifies the accounting and allows the skew towards time_offset
to be fairly much cycle accurate.

Compute mult from (ntp_tick + skew_delta) so the dithering has enough
bandwidth to deliver the skew rate by selecting between mult and mult+1.
This applies a skew equivalent to the old tick_length += delta approach
but without modifying tick_length.

To eliminate remainder error in the per-tick division, skew_delta is
rounded to a multiple of NTP_INTERVAL_FREQ in second_overflow().

second_overflow() computes skew_delta (the exponential decay rate)
but no longer drains time_offset or inflates tick_length directly.

Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
---
 include/linux/timekeeper_internal.h |  1 +
 kernel/time/ntp.c                   | 29 +++++++++++++++++++++++++++--
 kernel/time/ntp_internal.h          |  2 ++
 kernel/time/timekeeping.c           | 27 ++++++++++++++++++++++-----
 4 files changed, 52 insertions(+), 7 deletions(-)

diff --git a/include/linux/timekeeper_internal.h b/include/linux/timekeeper_internal.h
index 2f4cfcfcaac0..006437761262 100644
--- a/include/linux/timekeeper_internal.h
+++ b/include/linux/timekeeper_internal.h
@@ -185,6 +185,7 @@ struct timekeeper {
 	u32			ntp_error_shift;
 	u32			ntp_err_mult;
 	u32			skip_second_overflow;
+	s64			skew_delta;
 	s32			tai_offset;
 };
 
diff --git a/kernel/time/ntp.c b/kernel/time/ntp.c
index 79e76bb6942b..f4bf7e78c230 100644
--- a/kernel/time/ntp.c
+++ b/kernel/time/ntp.c
@@ -63,6 +63,7 @@ struct ntp_data {
 	int			time_state;
 	int			time_status;
 	s64			time_offset;
+	s64			skew_delta;
 	long			time_constant;
 	long			time_maxerror;
 	long			time_esterror;
@@ -371,6 +372,25 @@ u64 ntp_tick_length_base(unsigned int tkid)
 	return tk_ntp_data[tkid].tick_length_base;
 }
 
+s64 ntp_get_skew_delta(unsigned int tkid)
+{
+	return tk_ntp_data[tkid].skew_delta;
+}
+s64 ntp_drain_time_offset(unsigned int tkid, s64 amount)
+{
+	struct ntp_data *ntpdata = &tk_ntp_data[tkid];
+	s64 undrained = 0;
+
+	if ((amount > 0 && ntpdata->time_offset < amount) ||
+	    (amount < 0 && ntpdata->time_offset > amount)) {
+		undrained = amount - ntpdata->time_offset;
+		ntpdata->time_offset = 0;
+	} else {
+		ntpdata->time_offset -= amount;
+	}
+	return undrained;
+}
+
 void ntp_set_time_offset(unsigned int tkid, s64 offset_ns)
 {
 	struct ntp_data *ntpdata = &tk_ntp_data[tkid];
@@ -501,8 +521,13 @@ int second_overflow(unsigned int tkid, time64_t secs)
 		if (delta < ntpdata->time_offset)
 			delta = ntpdata->time_offset;
 	}
-	ntpdata->time_offset	-= delta;
-	ntpdata->tick_length	+= delta;
+	/*
+	 * Set the per-tick skew rate for the tick code. This is in the
+	 * same units as tick_length (ns << NTP_SCALE_SHIFT), and is
+	 * rounded to a multiple of NTP_INTERVAL_FREQ so that the per-tick
+	 * division in the tick code is exact.
+	 */
+	ntpdata->skew_delta = delta - delta % NTP_INTERVAL_FREQ;
 
 	/* Check PPS signal */
 	pps_dec_valid(ntpdata);
diff --git a/kernel/time/ntp_internal.h b/kernel/time/ntp_internal.h
index 44306ffe25ff..d0460449eb50 100644
--- a/kernel/time/ntp_internal.h
+++ b/kernel/time/ntp_internal.h
@@ -7,6 +7,8 @@ extern void ntp_clear(unsigned int tkid);
 /* Returns how long ticks are at present, in ns / 2^NTP_SCALE_SHIFT. */
 extern u64 ntp_tick_length(unsigned int tkid);
 extern u64 ntp_tick_length_base(unsigned int tkid);
+extern s64 ntp_get_skew_delta(unsigned int tkid);
+extern s64 ntp_drain_time_offset(unsigned int tkid, s64 amount);
 extern void ntp_set_time_offset(unsigned int tkid, s64 offset_ns);
 extern void ntp_set_tick_length(unsigned int tkid, u64 tick_length);
 extern ktime_t ntp_get_next_leap(unsigned int tkid);
diff --git a/kernel/time/timekeeping.c b/kernel/time/timekeeping.c
index 1cc98fdda4f8..f20bc76f43ca 100644
--- a/kernel/time/timekeeping.c
+++ b/kernel/time/timekeeping.c
@@ -2354,20 +2354,23 @@ EXPORT_SYMBOL_GPL(timekeeping_set_reference);
 static void timekeeping_adjust(struct timekeeper *tk, s64 offset)
 {
 	u64 ntp_tl = ntp_tick_length(tk->id);
+	s64 skew = ntp_get_skew_delta(tk->id);
 	u32 mult;
 
 	/*
-	 * Determine the multiplier from the current NTP tick length.
-	 * Avoid expensive division when the tick length doesn't change.
+	 * Determine the multiplier from the current NTP tick length plus
+	 * skew_delta. The skew biases mult so that ±1 dithering can deliver
+	 * the time_offset slew rate. Recompute when either changes.
 	 */
-	if (likely(tk->ntp_tick == ntp_tl)) {
+	if (likely(tk->ntp_tick == ntp_tl && tk->skew_delta == skew)) {
 		mult = tk->tkr_mono.mult - tk->ntp_err_mult;
 	} else {
 		if (unlikely(!tk->cycle_interval))
 			return;
 		tk->ntp_tick = ntp_tl;
-		mult = div64_u64(tk->ntp_tick >> tk->ntp_error_shift,
-				 tk->cycle_interval);
+		tk->skew_delta = skew;
+		mult = div64_u64((tk->ntp_tick + skew) >> tk->ntp_error_shift,
+				  tk->cycle_interval);
 		if (tk_ref_pending && tk->cs_id == tk_ref.cs_id) {
 			u64 d = tk->tkr_mono.cycle_last - tk_ref.counter_value;
 			__uint128_t p = (__uint128_t)d * tk_ref.period_frac_sec;
@@ -2512,6 +2515,20 @@ static u64 logarithmic_accumulation(struct timekeeper *tk, u64 offset,
 	tk->ntp_error -= tk->xtime_interval <<
 						(tk->ntp_error_shift + shift);
 
+	/*
+	 * During clock skew driven by ntpdata->time_offset, transfer a
+	 * *portion* of the requested total delta into ntp_error from
+	 * time_offset each tick. The second_overflow() function sets
+	 * the rate of skew, and the value of 'mult' has been selected
+	 * in order to allow the dithering to keep ntp_error around zero
+	 * even while this adjustment is being applied.
+	 */
+	if (tk->skew_delta) {
+		s64 drain = div_s64(tk->skew_delta << shift, NTP_INTERVAL_FREQ);
+		tk->ntp_error += tk->skew_delta << shift;
+		ntp_drain_time_offset(tk->id, drain);
+	}
+
 	return offset;
 }
 
-- 
2.51.0


^ permalink raw reply related	[flat|nested] 50+ messages in thread

* [RFC PATCH v2 8/8] WIP: kernel/time: Add /dev/vmclock_host miscdev
  2026-05-17 21:25 [RFC PATCH v2 0/8] timekeeping: Fix draft tracking precision and add feed-forward discipline via vmclock David Woodhouse
                   ` (6 preceding siblings ...)
  2026-05-17 21:25 ` [RFC PATCH v2 7/8] timekeeping: Drive time_offset skew via per-tick ntp_error transfer David Woodhouse
@ 2026-05-17 21:25 ` David Woodhouse
  2026-05-19 13:16 ` [RFC PATCH v2 0/8] timekeeping: Fix draft tracking precision and add feed-forward discipline via vmclock Miroslav Lichvar
  8 siblings, 0 replies; 50+ messages in thread
From: David Woodhouse @ 2026-05-17 21:25 UTC (permalink / raw)
  To: Richard Cochran, Wen Gu, David Woodhouse, Andrew Lunn,
	David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	John Stultz, Thomas Gleixner, Stephen Boyd, Anna-Maria Behnsen,
	Frederic Weisbecker, Shuah Khan, Peter Zijlstra,
	Thomas Weißschuh, Arnd Bergmann, Miroslav Lichvar,
	Julien Ridoux, Ryan Luu, linux-kernel
  Cc: David Woodhouse

From: David Woodhouse <dwmw@amazon.co.uk>

Expose the host's NTP-disciplined clock as a vmclock_abi page via
/dev/vmclock_host. A VMM can mmap or poll() this device to obtain
precision time parameters for relaying to guests.

The page is updated only when ntp_tick changes (i.e., when NTP
actually adjusts the frequency), not on every timekeeping tick.
This avoids the per-tick overhead of the existing pvclock_gtod
notifier while providing the same information.

Fields populated:
- counter_id: X86_TSC
- time_type: TAI
- counter_value: TSC at reference point
- time_sec/time_frac_sec: TAI at reference point
- counter_period_frac_sec: NTP-disciplined TSC period
- tai_offset_sec: current UTC-TAI offset

NOT YET DONE:
- Error bounds (esterror/maxerror)
- Leap second indicator
- Disruption marker (needs clocksource change hook)
- Selftest
---
 include/linux/vmclock_host.h                  |  17 +
 kernel/time/Kconfig                           |   8 +
 kernel/time/Makefile                          |   1 +
 kernel/time/ntp.c                             |   3 +-
 kernel/time/ntp_internal.h                    |   1 +
 kernel/time/timekeeping.c                     |   6 +
 kernel/time/vmclock_host.c                    | 319 ++++++++++++++++++
 .../selftests/timers/vmclock_host_test.c      | 171 ++++++++++
 8 files changed, 525 insertions(+), 1 deletion(-)
 create mode 100644 include/linux/vmclock_host.h
 create mode 100644 kernel/time/vmclock_host.c
 create mode 100644 tools/testing/selftests/timers/vmclock_host_test.c

diff --git a/include/linux/vmclock_host.h b/include/linux/vmclock_host.h
new file mode 100644
index 000000000000..388a5a1b470c
--- /dev/null
+++ b/include/linux/vmclock_host.h
@@ -0,0 +1,17 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef _LINUX_VMCLOCK_HOST_H
+#define _LINUX_VMCLOCK_HOST_H
+
+struct timekeeper;
+
+extern void (*vmclock_host_update_fn)(struct timekeeper *tk);
+
+static inline void vmclock_host_update(struct timekeeper *tk)
+{
+	typeof(vmclock_host_update_fn) fn = READ_ONCE(vmclock_host_update_fn);
+
+	if (fn)
+		fn(tk);
+}
+
+#endif /* _LINUX_VMCLOCK_HOST_H */
diff --git a/kernel/time/Kconfig b/kernel/time/Kconfig
index 02aac7c5aa76..493ffda434a8 100644
--- a/kernel/time/Kconfig
+++ b/kernel/time/Kconfig
@@ -221,4 +221,12 @@ config POSIX_AUX_CLOCKS
 	  and other clock domains, which are not correlated to the TAI/NTP
 	  notion of time.
 
+config VMCLOCK_HOST
+	tristate "VMClock host time provider (/dev/vmclock_host)"
+	depends on X86_TSC || ARM64
+	help
+	  Expose the host NTP-disciplined clock as a vmclock page via
+	  /dev/vmclock_host for VMMs to relay precision time to guests.
+
 endmenu
+
diff --git a/kernel/time/Makefile b/kernel/time/Makefile
index eaf290c972f9..549070254e3a 100644
--- a/kernel/time/Makefile
+++ b/kernel/time/Makefile
@@ -33,3 +33,4 @@ obj-$(CONFIG_TIME_NS)				+= namespace.o
 obj-$(CONFIG_TIME_NS_VDSO)			+= namespace_vdso.o
 obj-$(CONFIG_TEST_CLOCKSOURCE_WATCHDOG)		+= clocksource-wdtest.o
 obj-$(CONFIG_TIME_KUNIT_TEST)			+= time_test.o
+obj-$(CONFIG_VMCLOCK_HOST)	+= vmclock_host.o
diff --git a/kernel/time/ntp.c b/kernel/time/ntp.c
index f4bf7e78c230..e60d9f7da9e3 100644
--- a/kernel/time/ntp.c
+++ b/kernel/time/ntp.c
@@ -676,10 +676,11 @@ static inline int update_rtc(struct timespec64 *to_set, unsigned long *offset_ns
  * ntp_synced - Tells whether the NTP status is not UNSYNC
  * Returns:	true if not UNSYNC, false otherwise
  */
-static inline bool ntp_synced(void)
+bool ntp_synced(void)
 {
 	return !(tk_ntp_data[TIMEKEEPER_CORE].time_status & STA_UNSYNC);
 }
+EXPORT_SYMBOL_GPL(ntp_synced);
 
 /*
  * If we have an externally synchronized Linux clock, then update RTC clock
diff --git a/kernel/time/ntp_internal.h b/kernel/time/ntp_internal.h
index d0460449eb50..0a5d26b22d6a 100644
--- a/kernel/time/ntp_internal.h
+++ b/kernel/time/ntp_internal.h
@@ -3,6 +3,7 @@
 #define _LINUX_NTP_INTERNAL_H
 
 extern void ntp_init(void);
+extern bool ntp_synced(void);
 extern void ntp_clear(unsigned int tkid);
 /* Returns how long ticks are at present, in ns / 2^NTP_SCALE_SHIFT. */
 extern u64 ntp_tick_length(unsigned int tkid);
diff --git a/kernel/time/timekeeping.c b/kernel/time/timekeeping.c
index f20bc76f43ca..37d30283ad60 100644
--- a/kernel/time/timekeeping.c
+++ b/kernel/time/timekeeping.c
@@ -27,6 +27,10 @@
 #include "tick-internal.h"
 #include "timekeeping_internal.h"
 #include "ntp_internal.h"
+#include <linux/vmclock_host.h>
+
+void (*vmclock_host_update_fn)(struct timekeeper *tk);
+EXPORT_SYMBOL_GPL(vmclock_host_update_fn);
 
 #define TK_CLEAR_NTP		(1 << 0)
 #define TK_CLOCK_WAS_SET	(1 << 1)
@@ -2390,6 +2394,8 @@ static void timekeeping_adjust(struct timekeeper *tk, s64 offset)
 		}
 	}
 
+	vmclock_host_update(tk);
+
 	/*
 	 * If the clock is behind the NTP time, increase the multiplier by 1
 	 * to catch up with it. If it's ahead and there was a remainder in the
diff --git a/kernel/time/vmclock_host.c b/kernel/time/vmclock_host.c
new file mode 100644
index 000000000000..f4baf9069e70
--- /dev/null
+++ b/kernel/time/vmclock_host.c
@@ -0,0 +1,319 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * /dev/vmclock_host - Expose host NTP-disciplined time as a vmclock page.
+ *
+ * This provides a vmclock_abi structure populated from the host's
+ * CLOCK_REALTIME (TAI), allowing a VMM to efficiently relay precision
+ * time to guests without per-tick overhead.
+ *
+ * The page is updated only when the NTP frequency (ntp_tick) changes
+ * or the clocksource changes — not on every timekeeping tick.
+ * Userspace can poll() for changes.
+ *
+ * Copyright © 2026 Amazon.com, Inc. or its affiliates.
+ */
+
+#include <linux/clocksource_ids.h>
+#include <linux/miscdevice.h>
+#include <linux/mm.h>
+#include <linux/module.h>
+#include <linux/poll.h>
+#include <linux/slab.h>
+#include <linux/timekeeper_internal.h>
+#include <linux/wait.h>
+
+#include <uapi/linux/vmclock-abi.h>
+
+extern void (*vmclock_host_update_fn)(struct timekeeper *tk);
+extern bool ntp_synced(void);
+
+static struct vmclock_abi *vmclock_page;
+static DECLARE_WAIT_QUEUE_HEAD(vmclock_wait);
+static u64 vmclock_last_ntp_tick = 1; /* Sentinel: force first update */
+static enum clocksource_ids vmclock_last_cs_id;
+
+/*
+ * Compute counter_period_frac_sec from ntp_tick and cycle_interval.
+ *
+ * ntp_tick is ns_per_tick << 32.
+ * cycle_interval is counter cycles per tick.
+ *
+ * vmclock wants: period = frac_sec / 2^(64 + shift) in seconds.
+ *
+ * ns_per_cycle = ntp_tick / cycle_interval (in <<32 fixed point)
+/*
+ * Compute counter_period_frac_sec from ntp_tick and cycle_interval.
+ *
+ * period = ntp_tick / (cycle_interval * 10^9 * 2^32) seconds/cycle
+ * frac_sec = ntp_tick * 2^(32+shift) / (cycle_interval * 10^9)
+ *
+ * Use div64_u64 with maximum pre-shift for precision.
+ * The key: do TWO divisions to get 64 bits of quotient.
+ */
+static void vmclock_compute_period(struct timekeeper *tk,
+				   u64 *period_frac, u8 *period_shift)
+{
+	u64 ntp_tick = tk->ntp_tick;
+	u64 cycle_interval = tk->cycle_interval;
+	u64 divisor = cycle_interval * 1000000000ULL;
+	int headroom = __builtin_clzll(ntp_tick);
+	u64 rem, result;
+	int bits_so_far, need;
+
+	/*
+	 * Compute ntp_tick * 2^(headroom + N) / divisor with 64 bits
+	 * of precision, using iterative 32-bit chunk divisions.
+	 *
+	 * First division: ntp_tick << headroom / divisor
+	 */
+	result = div64_u64_rem(ntp_tick << headroom, divisor, &rem);
+	bits_so_far = 64 - __builtin_clzll(result ?: 1);
+
+	/* Fill remaining bits 32 at a time from the remainder */
+	while (bits_so_far < 64 && rem) {
+		int chunk = min(32, 64 - bits_so_far);
+		int rem_headroom = __builtin_clzll(rem);
+		u64 extra;
+
+		if (rem_headroom < chunk)
+			chunk = rem_headroom;
+
+		extra = div64_u64_rem(rem << chunk, divisor, &rem);
+		result = (result << chunk) | extra;
+		bits_so_far += chunk;
+		headroom += chunk;
+	}
+
+	/* Pad with zeros if we ran out of remainder */
+	if (bits_so_far < 64) {
+		result <<= (64 - bits_so_far);
+		headroom += (64 - bits_so_far);
+	}
+
+	/*
+	 * result = ntp_tick * 2^headroom / divisor
+	 *        = (ntp_tick / (cycle_interval * 10^9)) * 2^headroom
+	 *        = period_seconds * 2^32 * 2^headroom
+	 *        = period_seconds * 2^(32 + headroom)
+	 *
+	 * vmclock: frac_sec / 2^(64 + shift) = period_seconds
+	 * So: shift = 32 + headroom - 64 = headroom - 32
+	 */
+	*period_frac = result;
+	*period_shift = (u8)(headroom - 32);
+}
+
+
+static u8 vmclock_counter_id(struct timekeeper *tk)
+{
+	enum clocksource_ids id = tk->cs_id;
+
+	if (IS_ENABLED(CONFIG_X86) && id == CSID_X86_TSC)
+		return VMCLOCK_COUNTER_X86_TSC;
+	if (IS_ENABLED(CONFIG_ARM64) && id == CSID_ARM_ARCH_COUNTER)
+		return VMCLOCK_COUNTER_ARM_VCNT;
+	return VMCLOCK_COUNTER_INVALID;
+}
+
+/*
+ * Called from timekeeping_adjust() when ntp_tick changes.
+ * Also needs to be called on clocksource change.
+ */
+static void vmclock_host_do_update(struct timekeeper *tk)
+{
+	struct vmclock_abi *clk = vmclock_page;
+	u64 period_frac;
+	u8 period_shift, counter_id;
+
+	if (!clk)
+		return;
+
+	counter_id = vmclock_counter_id(tk);
+
+	/* Only do a full update when something meaningful changes */
+	if (tk->ntp_tick == vmclock_last_ntp_tick &&
+	    tk->cs_id == vmclock_last_cs_id)
+		return;
+
+	vmclock_last_ntp_tick = tk->ntp_tick;
+	vmclock_last_cs_id = tk->cs_id;
+
+	/* Increment seq_count to odd (update in progress) */
+	WRITE_ONCE(clk->seq_count, cpu_to_le32(le32_to_cpu(clk->seq_count) + 1));
+	smp_wmb();
+
+	clk->counter_id = counter_id;
+
+	if (counter_id != VMCLOCK_COUNTER_INVALID) {
+		u64 ns = tk->tkr_mono.xtime_nsec >> tk->tkr_mono.shift;
+		u64 hi, rem;
+
+		/* Adjust for ntp_error: represent where the clock is
+		 * converging TO, not where it is right now. */
+		ns += tk->ntp_error >> (tk->tkr_mono.shift + tk->ntp_error_shift);
+
+		clk->counter_value = cpu_to_le64(tk->tkr_mono.cycle_last);
+		clk->time_sec = cpu_to_le64(tk->xtime_sec + tk->tai_offset);
+
+		hi = div64_u64_rem(ns << 32, 1000000000ULL, &rem);
+		clk->time_frac_sec = cpu_to_le64(
+			(hi << 32) | div64_u64(rem << 32, 1000000000ULL));
+
+		vmclock_compute_period(tk,
+				       &period_frac, &period_shift);
+		clk->counter_period_frac_sec = cpu_to_le64(period_frac);
+		clk->counter_period_shift = period_shift;
+
+		clk->clock_status = ntp_synced() ?
+			VMCLOCK_STATUS_SYNCHRONIZED :
+			VMCLOCK_STATUS_FREERUNNING;
+	} else {
+		clk->clock_status = VMCLOCK_STATUS_UNKNOWN;
+	}
+
+	clk->tai_offset_sec = cpu_to_le16((s16)tk->tai_offset);
+	clk->flags = cpu_to_le64(VMCLOCK_FLAG_TAI_OFFSET_VALID |
+				 VMCLOCK_FLAG_TIME_MONOTONIC |
+				 VMCLOCK_FLAG_NOTIFICATION_PRESENT);
+
+	smp_wmb();
+	WRITE_ONCE(clk->seq_count, cpu_to_le32(le32_to_cpu(clk->seq_count) + 1));
+
+	wake_up_interruptible(&vmclock_wait);
+}
+
+/* File operations */
+
+struct vmclock_host_file {
+	u32 last_seq;
+};
+
+static int vmclock_host_open(struct inode *inode, struct file *fp)
+{
+	struct vmclock_host_file *fst;
+
+	fst = kzalloc(sizeof(*fst), GFP_KERNEL);
+	if (!fst)
+		return -ENOMEM;
+
+	fp->private_data = fst;
+	return 0;
+}
+
+static int vmclock_host_release(struct inode *inode, struct file *fp)
+{
+	kfree(fp->private_data);
+	return 0;
+}
+
+static int vmclock_host_mmap(struct file *fp, struct vm_area_struct *vma)
+{
+	if ((vma->vm_flags & (VM_READ | VM_WRITE)) != VM_READ)
+		return -EROFS;
+
+	if (vma->vm_end - vma->vm_start != PAGE_SIZE || vma->vm_pgoff)
+		return -EINVAL;
+
+	return remap_pfn_range(vma, vma->vm_start,
+			       virt_to_phys(vmclock_page) >> PAGE_SHIFT,
+			       PAGE_SIZE, vma->vm_page_prot);
+}
+
+static ssize_t vmclock_host_read(struct file *fp, char __user *buf,
+				 size_t count, loff_t *ppos)
+{
+	struct vmclock_host_file *fst = fp->private_data;
+	u32 seq;
+
+	if (*ppos >= PAGE_SIZE)
+		return 0;
+	if (count > PAGE_SIZE - *ppos)
+		count = PAGE_SIZE - *ppos;
+
+	do {
+		seq = le32_to_cpu(READ_ONCE(vmclock_page->seq_count));
+		if (seq & 1) {
+			cpu_relax();
+			continue;
+		}
+		smp_rmb();
+		if (copy_to_user(buf, (char *)vmclock_page + *ppos, count))
+			return -EFAULT;
+		smp_rmb();
+	} while (le32_to_cpu(READ_ONCE(vmclock_page->seq_count)) != seq);
+
+	fst->last_seq = seq;
+	*ppos += count;
+	return count;
+}
+
+static __poll_t vmclock_host_poll(struct file *fp, poll_table *wait)
+{
+	struct vmclock_host_file *fst = fp->private_data;
+	u32 seq;
+
+	poll_wait(fp, &vmclock_wait, wait);
+
+	seq = le32_to_cpu(READ_ONCE(vmclock_page->seq_count));
+	if (fst->last_seq != seq)
+		return EPOLLIN | EPOLLRDNORM;
+
+	return 0;
+}
+
+static const struct file_operations vmclock_host_fops = {
+	.owner = THIS_MODULE,
+	.open = vmclock_host_open,
+	.release = vmclock_host_release,
+	.mmap = vmclock_host_mmap,
+	.read = vmclock_host_read,
+	.poll = vmclock_host_poll,
+};
+
+static struct miscdevice vmclock_host_miscdev = {
+	.minor = MISC_DYNAMIC_MINOR,
+	.name = "vmclock_host",
+	.fops = &vmclock_host_fops,
+};
+
+static int __init vmclock_host_init(void)
+{
+	int ret;
+
+	vmclock_page = (struct vmclock_abi *)get_zeroed_page(GFP_KERNEL);
+	if (!vmclock_page)
+		return -ENOMEM;
+
+	/* Set constant fields */
+	vmclock_page->magic = cpu_to_le32(VMCLOCK_MAGIC);
+	vmclock_page->size = cpu_to_le32(PAGE_SIZE);
+	vmclock_page->version = cpu_to_le16(1);
+	vmclock_page->time_type = VMCLOCK_TIME_TAI;
+
+	ret = misc_register(&vmclock_host_miscdev);
+	if (ret) {
+		free_page((unsigned long)vmclock_page);
+		vmclock_page = NULL;
+		return ret;
+	}
+
+	WRITE_ONCE(vmclock_host_update_fn, vmclock_host_do_update);
+	pr_info("vmclock_host: registered /dev/vmclock_host\n");
+	return 0;
+}
+
+static void __exit vmclock_host_exit(void)
+{
+	WRITE_ONCE(vmclock_host_update_fn, NULL);
+	synchronize_rcu();
+	misc_deregister(&vmclock_host_miscdev);
+	free_page((unsigned long)vmclock_page);
+	vmclock_page = NULL;
+}
+
+module_init(vmclock_host_init);
+module_exit(vmclock_host_exit);
+
+MODULE_LICENSE("GPL");
+MODULE_AUTHOR("David Woodhouse <dwmw@amazon.co.uk>");
+MODULE_DESCRIPTION("VMClock host time provider");
diff --git a/tools/testing/selftests/timers/vmclock_host_test.c b/tools/testing/selftests/timers/vmclock_host_test.c
new file mode 100644
index 000000000000..c83cc7e6d404
--- /dev/null
+++ b/tools/testing/selftests/timers/vmclock_host_test.c
@@ -0,0 +1,171 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * Test /dev/vmclock_host by comparing its time against CLOCK_TAI.
+ *
+ * Maps the vmclock page, reads time from it using the ABI formula,
+ * and compares with clock_gettime(CLOCK_TAI) using ABA timestamps
+ * to bound the uncertainty.
+ */
+#include <errno.h>
+#include <fcntl.h>
+#include <inttypes.h>
+#include <stdio.h>
+#include <stdlib.h>
+#include <string.h>
+#include <sys/mman.h>
+#include <time.h>
+#include <unistd.h>
+
+#include <linux/vmclock-abi.h>
+
+#ifdef __x86_64__
+static inline uint64_t read_counter(void)
+{
+	unsigned int lo, hi;
+	asm volatile("rdtsc" : "=a"(lo), "=d"(hi));
+	return ((uint64_t)hi << 32) | lo;
+}
+#elif defined(__aarch64__)
+static inline uint64_t read_counter(void)
+{
+	uint64_t val;
+	asm volatile("mrs %0, cntvct_el0" : "=r"(val));
+	return val;
+}
+#else
+#error "Unsupported architecture"
+#endif
+
+/*
+ * Compute time from vmclock: T = time_sec + time_frac_sec/2^64 +
+ *   (counter_now - counter_value) * counter_period_frac_sec >> (64 + shift)
+ *
+ * Returns nanoseconds since epoch.
+ */
+static int64_t vmclock_read_ns(const volatile struct vmclock_abi *clk,
+			       uint64_t counter_now)
+{
+	uint64_t delta = counter_now - clk->counter_value;
+	uint64_t period = clk->counter_period_frac_sec;
+	uint8_t shift = clk->counter_period_shift;
+	__uint128_t ns128;
+
+	/* delta * period gives seconds in 0.(64+shift) fixed point */
+	ns128 = (__uint128_t)delta * period;
+	ns128 >>= shift;
+	/* Now ns128 is seconds in 0.64 fixed point. Add time_frac_sec */
+	ns128 += clk->time_frac_sec;
+	/* Top 64 bits are whole seconds of fractional part — but we
+	 * need to add time_sec for the full result */
+	uint64_t frac_sec = (uint64_t)(ns128 >> 64);
+	uint64_t sub_sec_ns = (uint64_t)(((ns128 & 0xFFFFFFFFFFFFFFFFULL) *
+					   1000000000ULL) >> 64);
+
+	return (int64_t)(clk->time_sec + frac_sec) * 1000000000LL + sub_sec_ns;
+}
+
+static int64_t clock_tai_ns(void)
+{
+	struct timespec ts;
+	clock_gettime(CLOCK_TAI, &ts);
+	return (int64_t)ts.tv_sec * 1000000000LL + ts.tv_nsec;
+}
+
+int main(void)
+{
+	int fd, ret = 0;
+	volatile struct vmclock_abi *clk;
+	int i, failures = 0;
+
+	fd = open("/dev/vmclock_host", O_RDONLY);
+	if (fd < 0) {
+		if (errno == ENOENT) {
+			printf("SKIP: /dev/vmclock_host not available\n");
+			return 4;
+		}
+		perror("open /dev/vmclock_host");
+		return 1;
+	}
+
+	clk = mmap(NULL, 4096, PROT_READ, MAP_SHARED, fd, 0);
+	if (clk == MAP_FAILED) {
+		perror("mmap");
+		close(fd);
+		return 1;
+	}
+
+	if (clk->magic != VMCLOCK_MAGIC) {
+		fprintf(stderr, "Bad magic: 0x%x\n", clk->magic);
+		ret = 1;
+		goto out;
+	}
+
+	if (clk->counter_id == VMCLOCK_COUNTER_INVALID) {
+		printf("SKIP: counter_id is INVALID (clocksource not TSC?)\n");
+		ret = 4;
+		goto out;
+	}
+
+	printf("vmclock_host: version=%u counter_id=%u time_type=%u status=%u\n",
+	       clk->version, clk->counter_id, clk->time_type, clk->clock_status);
+	printf("  tai_offset=%d\n", (int16_t)clk->tai_offset_sec);
+	printf("  counter_period_frac_sec=0x%" PRIx64 " shift=%u\n",
+	       (uint64_t)clk->counter_period_frac_sec, clk->counter_period_shift);
+
+	/* ABA comparison: read CLOCK_TAI, vmclock, CLOCK_TAI */
+	printf("\nABA comparison (vmclock vs CLOCK_TAI):\n");
+	for (i = 0; i < 10; i++) {
+		uint32_t seq;
+		int64_t tai_before, tai_after, vmclock_ns;
+		int64_t delta, window;
+
+		/* Read with seqcount retry */
+		do {
+			seq = clk->seq_count;
+			if (seq & 1) {
+				__asm__ volatile("pause" ::: "memory");
+				continue;
+			}
+			__asm__ volatile("" ::: "memory");
+
+			tai_before = clock_tai_ns();
+			uint64_t ctr = read_counter();
+			tai_after = clock_tai_ns();
+
+			__asm__ volatile("" ::: "memory");
+			if (clk->seq_count != seq)
+				continue;
+
+			vmclock_ns = vmclock_read_ns(clk, ctr);
+			break;
+		} while (1);
+
+		window = tai_after - tai_before;
+		/* vmclock should be between tai_before and tai_after */
+		delta = vmclock_ns - tai_before;
+
+		printf("  [%d] vmclock-tai_before=%+" PRId64 "ns window=%"
+		       PRId64 "ns", i, delta, window);
+
+		if (delta < -2000 || delta > window + 2000) {
+			printf(" FAIL (out of range)\n");
+			failures++;
+		} else {
+			printf(" OK\n");
+		}
+
+		usleep(100000); /* 100ms between samples */
+	}
+
+	if (failures) {
+		printf("\nFAIL: %d/%d samples out of range\n", failures, 10);
+		ret = 1;
+	} else {
+		printf("\nPASS: all samples within ABA window\n");
+	}
+
+out:
+	munmap((void *)clk, 4096);
+	close(fd);
+	return ret;
+}
-- 
2.51.0


^ permalink raw reply related	[flat|nested] 50+ messages in thread

* Re: [RFC PATCH v2 2/8] timekeeping: Account for clawback adjustment in ntp_error
  2026-05-17 21:25 ` [RFC PATCH v2 2/8] timekeeping: Account for clawback adjustment in ntp_error David Woodhouse
@ 2026-05-19  1:59   ` John Stultz
  2026-05-19 10:04     ` David Woodhouse
  0 siblings, 1 reply; 50+ messages in thread
From: John Stultz @ 2026-05-19  1:59 UTC (permalink / raw)
  To: David Woodhouse
  Cc: Richard Cochran, Wen Gu, Andrew Lunn, David S. Miller,
	Eric Dumazet, Jakub Kicinski, Paolo Abeni, Thomas Gleixner,
	Stephen Boyd, Anna-Maria Behnsen, Frederic Weisbecker, Shuah Khan,
	Peter Zijlstra, Thomas Weißschuh, Arnd Bergmann,
	Miroslav Lichvar, Julien Ridoux, Ryan Luu, linux-kernel,
	David Woodhouse

On Sun, May 17, 2026 at 3:03 PM David Woodhouse <dwmw2@infradead.org> wrote:
>
> From: David Woodhouse <dwmw@amazon.co.uk>
>
> timekeeping_apply_adjustment() modifies xtime_nsec to maintain vDSO
> monotonicity when mult changes:
>
>     xtime_nsec -= offset
>
> This ensures that the time reported to userspace doesn't jump when the
> multiplier is adjusted. However, ntp_error — which tracks the difference
> between intended and actual clock position — was not updated to reflect
> this change.
>
> After a mult change, xtime_nsec has moved but ntp_error still reflects
> the old position. For the normal ±1 dithering this is negligible (the
> adjustments cancel over time), but for larger mult changes — such as
> when an external reference clock sets a new frequency — the one-time
> uncompensated offset is significant (~38ns for a 700-count mult change).
>
> Fix by adjusting ntp_error by the same amount:
>
>     ntp_error += offset << ntp_error_shift
>
> This keeps ntp_error consistent with the actual xtime_nsec position
> after the clawback.
>
> Fixes: 1b1b3e2a3671 ("timekeeping: Rework frequency adjustments to work better w/ nohz")

That doesn't seem to be the right commit. Do you mean dc491596f639 ?

But really, we used to do something like this, but it was removed in
commit c2cda2a5bda9 ("timekeeping/ntp: Don't align NTP frequency
adjustments to ticks").  From the difference in the math it looks like
the previous implementation was maybe adjusting for the next tick
instead of the previous?

Also, since you're re-adding it, could you add a detailed rationale to
the comment in timekeeping_apply_adjustment()? (It had long been on my
todo, but by the time I started adding the commits the details had
faded and I never got the time to re-derive the math.)

Miroslav's review and input here would also be good.

thanks
-john

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [RFC PATCH v2 4/8] timekeeping: Add absolute reference for feed-forward clock discipline
  2026-05-17 21:25 ` [RFC PATCH v2 4/8] timekeeping: Add absolute reference for feed-forward clock discipline David Woodhouse
@ 2026-05-19  2:09   ` John Stultz
  2026-05-19 11:07     ` David Woodhouse
  0 siblings, 1 reply; 50+ messages in thread
From: John Stultz @ 2026-05-19  2:09 UTC (permalink / raw)
  To: David Woodhouse
  Cc: Richard Cochran, Wen Gu, Andrew Lunn, David S. Miller,
	Eric Dumazet, Jakub Kicinski, Paolo Abeni, Thomas Gleixner,
	Stephen Boyd, Anna-Maria Behnsen, Frederic Weisbecker, Shuah Khan,
	Peter Zijlstra, Thomas Weißschuh, Arnd Bergmann,
	Miroslav Lichvar, Julien Ridoux, Ryan Luu, linux-kernel,
	David Woodhouse

On Sun, May 17, 2026 at 3:03 PM David Woodhouse <dwmw2@infradead.org> wrote:
>
> From: David Woodhouse <dwmw@amazon.co.uk>
>
> Add timekeeping_set_reference() which allows an external clock source
> (such as a hypervisor vmclock) to provide an absolute time reference.
> The reference defines a linear counter-to-time mapping that the kernel
> uses to set both the frequency and phase of the system clock.
>
> When timekeeping_set_reference() is called:
>  - tick_length is computed from the reference period and set via
>    ntp_set_tick_length(), keeping all NTP state consistent
>  - A pending flag is set so that on the next tick (under the
>    timekeeping lock), the phase error is set via ntp_set_time_offset()
>
> The existing time_offset slew mechanism then converges the clock to
> the reference, with the clawback fix ensuring ntp_error stays accurate
> across mult changes.
>
> Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
> ---
>  include/linux/timekeeping_reference.h | 19 ++++++++++++
>  kernel/time/ntp.c                     | 27 ++++++++++++++++
>  kernel/time/ntp_internal.h            |  3 ++
>  kernel/time/timekeeping.c             | 44 +++++++++++++++++++++++++++
>  4 files changed, 93 insertions(+)
>  create mode 100644 include/linux/timekeeping_reference.h
>
> diff --git a/include/linux/timekeeping_reference.h b/include/linux/timekeeping_reference.h
> new file mode 100644
> index 000000000000..4c1d8a6c02f1
> --- /dev/null
> +++ b/include/linux/timekeeping_reference.h
> @@ -0,0 +1,19 @@
> +/* SPDX-License-Identifier: GPL-2.0 */
> +#ifndef _LINUX_TIMEKEEPING_REFERENCE_H
> +#define _LINUX_TIMEKEEPING_REFERENCE_H
> +
> +#include <linux/clocksource_ids.h>
> +#include <linux/types.h>
> +
> +struct tk_reference {
> +       enum clocksource_ids    cs_id;
> +       u64                     counter_value;
> +       u64                     time_sec;
> +       u64                     time_frac_sec;
> +       u64                     period_frac_sec;
> +       u8                      period_shift;
> +};

Can you add comments documenting each of these values?


> +
> +int timekeeping_set_reference(const struct tk_reference *ref);
> +
> +#endif
> diff --git a/kernel/time/ntp.c b/kernel/time/ntp.c
> index 2d6d00ae5bf7..79e76bb6942b 100644
> --- a/kernel/time/ntp.c
> +++ b/kernel/time/ntp.c
> @@ -366,6 +366,33 @@ u64 ntp_tick_length(unsigned int tkid)
>         return tk_ntp_data[tkid].tick_length;
>  }
>
> +u64 ntp_tick_length_base(unsigned int tkid)
> +{
> +       return tk_ntp_data[tkid].tick_length_base;
> +}
> +
> +void ntp_set_time_offset(unsigned int tkid, s64 offset_ns)
> +{
> +       struct ntp_data *ntpdata = &tk_ntp_data[tkid];
> +
> +       ntpdata->time_offset = div_s64((s64)offset_ns << NTP_SCALE_SHIFT,
> +                                      NTP_INTERVAL_FREQ);
> +       ntpdata->time_adjust = 0;
> +}
> +
> +void ntp_set_tick_length(unsigned int tkid, u64 tick_length)
> +{
> +       struct ntp_data *ntpdata = &tk_ntp_data[tkid];
> +       u64 base;
> +
> +       base = (u64)(ntpdata->tick_usec * NSEC_PER_USEC * USER_HZ)
> +               << NTP_SCALE_SHIFT;
> +       base += ntpdata->ntp_tick_adj;
> +
> +       ntpdata->time_freq = (s64)(tick_length * NTP_INTERVAL_FREQ - base);
> +       ntp_update_frequency(ntpdata);
> +}


All the math here could use some comments, just to be explicit about
what is intended.

> +
>  /**
>   * ntp_get_next_leap - Returns the next leapsecond in CLOCK_REALTIME ktime_t
>   * @tkid:      Timekeeper ID
> diff --git a/kernel/time/ntp_internal.h b/kernel/time/ntp_internal.h
> index 7084d839c207..44306ffe25ff 100644
> --- a/kernel/time/ntp_internal.h
> +++ b/kernel/time/ntp_internal.h
> @@ -6,6 +6,9 @@ extern void ntp_init(void);
>  extern void ntp_clear(unsigned int tkid);
>  /* Returns how long ticks are at present, in ns / 2^NTP_SCALE_SHIFT. */
>  extern u64 ntp_tick_length(unsigned int tkid);
> +extern u64 ntp_tick_length_base(unsigned int tkid);
> +extern void ntp_set_time_offset(unsigned int tkid, s64 offset_ns);
> +extern void ntp_set_tick_length(unsigned int tkid, u64 tick_length);
>  extern ktime_t ntp_get_next_leap(unsigned int tkid);
>  extern int second_overflow(unsigned int tkid, time64_t secs);
>  extern int ntp_adjtimex(unsigned int tkid, struct __kernel_timex *txc, const struct timespec64 *ts,
> diff --git a/kernel/time/timekeeping.c b/kernel/time/timekeeping.c
> index 050123fc179b..89fed9473c38 100644
> --- a/kernel/time/timekeeping.c
> +++ b/kernel/time/timekeeping.c
> @@ -2324,6 +2324,33 @@ static __always_inline void timekeeping_apply_adjustment(struct timekeeper *tk,
>   * Adjust the timekeeper's multiplier to the correct frequency
>   * and also to reduce the accumulated error value.
>   */
> +
> +#include <linux/timekeeping_reference.h>
> +
> +static struct tk_reference tk_ref;
> +static bool tk_ref_valid;
> +static bool tk_ref_pending;
> +
> +int timekeeping_set_reference(const struct tk_reference *ref)
> +{
> +       u64 ci = tk_core.timekeeper.cycle_interval;
> +       u64 new_tl;
> +
> +       tk_ref = *ref;
> +
> +       new_tl = mul_u64_u64_shr(ref->period_frac_sec,
> +                       (u64)ci * NSEC_PER_SEC,
> +                       32 + ref->period_shift);
> +       ntp_set_tick_length(TIMEKEEPER_CORE, new_tl);
> +
> +       /* Ensure tk_ref fields are visible before tk_ref_valid/pending */
> +       smp_wmb();
> +       tk_ref_valid = true;
> +       tk_ref_pending = true;
> +       return 0;
> +}
> +EXPORT_SYMBOL_GPL(timekeeping_set_reference);
> +
>  static void timekeeping_adjust(struct timekeeper *tk, s64 offset)
>  {
>         u64 ntp_tl = ntp_tick_length(tk->id);
> @@ -2339,6 +2366,23 @@ static void timekeeping_adjust(struct timekeeper *tk, s64 offset)
>                 tk->ntp_tick = ntp_tl;
>                 mult = div64_u64(tk->ntp_tick >> tk->ntp_error_shift,
>                                  tk->cycle_interval);
> +               if (tk_ref_pending && tk->cs_id == tk_ref.cs_id) {
> +                       u64 d = tk->tkr_mono.cycle_last - tk_ref.counter_value;
> +                       __uint128_t p = (__uint128_t)d * tk_ref.period_frac_sec;
> +                       u64 rf;
> +                       s64 ref_err;
> +
> +                       p >>= tk_ref.period_shift;
> +                       p += tk_ref.time_frac_sec;
> +                       rf = (u64)p;
> +                       ref_err = (s64)mul_u64_u64_shr(rf,
> +                               (u64)NSEC_PER_SEC << tk->tkr_mono.shift, 64) -
> +                               (s64)tk->tkr_mono.xtime_nsec;
> +                       ntp_set_time_offset(tk->id,
> +                               ref_err >> tk->tkr_mono.shift);
> +                       tk->ntp_error = 0;
> +                       tk_ref_pending = false;
> +               }

Just a quick skim here, but I don't see anything using tk_ref_valid.
Am I missing it? Or can that value be dropped?

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [RFC PATCH v2 2/8] timekeeping: Account for clawback adjustment in ntp_error
  2026-05-19  1:59   ` John Stultz
@ 2026-05-19 10:04     ` David Woodhouse
  2026-05-19 19:28       ` John Stultz
  0 siblings, 1 reply; 50+ messages in thread
From: David Woodhouse @ 2026-05-19 10:04 UTC (permalink / raw)
  To: John Stultz
  Cc: Richard Cochran, Wen Gu, Andrew Lunn, David S. Miller,
	Eric Dumazet, Jakub Kicinski, Paolo Abeni, Thomas Gleixner,
	Stephen Boyd, Anna-Maria Behnsen, Frederic Weisbecker, Shuah Khan,
	Peter Zijlstra, Thomas Weißschuh, Arnd Bergmann,
	Miroslav Lichvar, Julien Ridoux, Ryan Luu, linux-kernel

[-- Attachment #1: Type: text/plain, Size: 4712 bytes --]

On Mon, 2026-05-18 at 18:59 -0700, John Stultz wrote:
> On Sun, May 17, 2026 at 3:03 PM David Woodhouse <dwmw2@infradead.org> wrote:
> > 
> > From: David Woodhouse <dwmw@amazon.co.uk>
> > 
> > timekeeping_apply_adjustment() modifies xtime_nsec to maintain vDSO
> > monotonicity when mult changes:
> > 
> >     xtime_nsec -= offset
> > 
> > This ensures that the time reported to userspace doesn't jump when the
> > multiplier is adjusted. However, ntp_error — which tracks the difference
> > between intended and actual clock position — was not updated to reflect
> > this change.
> > 
> > After a mult change, xtime_nsec has moved but ntp_error still reflects
> > the old position. For the normal ±1 dithering this is negligible (the
> > adjustments cancel over time), but for larger mult changes — such as
> > when an external reference clock sets a new frequency — the one-time
> > uncompensated offset is significant (~38ns for a 700-count mult change).
> > 
> > Fix by adjusting ntp_error by the same amount:
> > 
> >     ntp_error += offset << ntp_error_shift
> > 
> > This keeps ntp_error consistent with the actual xtime_nsec position
> > after the clawback.
> > 
> > Fixes: 1b1b3e2a3671 ("timekeeping: Rework frequency adjustments to work better w/ nohz")
> 
> That doesn't seem to be the right commit. Do you mean dc491596f639 ?

Er yes, that 1b1b commit doesn't even exist. I've been keeping the AI
on a *very* tight rein as I navigate all this, but that one escaped.

> But really, we used to do something like this, but it was removed in
> commit c2cda2a5bda9 ("timekeeping/ntp: Don't align NTP frequency
> adjustments to ticks").  From the difference in the math it looks like
> the previous implementation was maybe adjusting for the next tick
> instead of the previous?

The original subtraction of (interval - offset) actually goes all the
way back to commit 19923c190e093 in 2006 when the error field was first
introduced. I don't know if it was right then, but it certainly looks
like it was wrong in 2018 when commit c2cda2a5bda9 ripped it out.

What I'm adding back is 'ntp_error += offset'. I don't know why
'interval' was ever involved. As you suggest, I do think it does have
the effect of prematurely accounting for the changed xtime_interval of
the *next* tick... which is going to be correctly accounted when it
happens anyway, so it's a double addition.

I've reworked the commit message for the next round:
https://git.infradead.org/?p=users/dwmw2/linux.git;a=commitdiff;h=a1ea3c1bfd

I find my definitions (A) (B) (C) of the absolute time values
relatively simple to understand. We know *exactly* how much each of
them advances per tick. The *deltas* between them, represented by
ntp_error and time_offset, are somewhat harder to track. Tracking
ntp_error, for example, is always "Add what got added to (B), subtract
what got added to (A)". Hence the +ntp_interval, -xtime_interval we
discussed in the other commit which removed xtime_remainder.

I've added some local debugging which tracks those *absolute* values,
with associated sanity checks on the deltas which should precisely
match the difference between them on each tick. That's why I have a
reasonable amount of confidence that these fixes are correct.

> Also, since you're re-adding it, could you add a detailed rationale to
> the comment in timekeeping_apply_adjustment()? (It had long been on my
> todo, but by the time I started adding the commits the details had
> faded and I never got the time to re-derive the math.)

I was hoping not to have to think about that part. The fact here is
that it *does* apply an offset to 'xtime' and thus of course the delta
from xtime(A) to where we ought to be right now (B) has changed by the
same amount.

Calculating *what* that offset should be, is... above my pay grade :)

And I do think it's mostly working, so is there a particular reason you
want me to take a closer look?

Because the moment I start looking at the comment, I see the part which
says 
	 * So given the same offset value, we need the time to be the same
	 * both before and after the freq adjustment.

... and I come to believe that 'before' the freq adjustment is actually
some point in the *future*; the last counter reading at which a vDSO
*currently* running on another CPU might possibly apply the faster
formula from the previous tick? And then my brain falls out and I have
to sit under the desk rocking back and forth for a while... ?

> Miroslav's review and input here would also be good.

Ack. Thomas had nudged me to add Miroslav to Cc, which
get_maintainers.pl had not.

[-- Attachment #2: smime.p7s --]
[-- Type: application/pkcs7-signature, Size: 5069 bytes --]

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [RFC PATCH v2 4/8] timekeeping: Add absolute reference for feed-forward clock discipline
  2026-05-19  2:09   ` John Stultz
@ 2026-05-19 11:07     ` David Woodhouse
  0 siblings, 0 replies; 50+ messages in thread
From: David Woodhouse @ 2026-05-19 11:07 UTC (permalink / raw)
  To: John Stultz
  Cc: Richard Cochran, Wen Gu, Andrew Lunn, David S. Miller,
	Eric Dumazet, Jakub Kicinski, Paolo Abeni, Thomas Gleixner,
	Stephen Boyd, Anna-Maria Behnsen, Frederic Weisbecker, Shuah Khan,
	Peter Zijlstra, Thomas Weißschuh, Arnd Bergmann,
	Miroslav Lichvar, Julien Ridoux, Ryan Luu, linux-kernel

[-- Attachment #1: Type: text/plain, Size: 5362 bytes --]

On Mon, 2026-05-18 at 19:09 -0700, John Stultz wrote:
> On Sun, May 17, 2026 at 3:03 PM David Woodhouse <dwmw2@infradead.org> wrote:
> > 
> > From: David Woodhouse <dwmw@amazon.co.uk>
> > 
> > Add timekeeping_set_reference() which allows an external clock source
> > (such as a hypervisor vmclock) to provide an absolute time reference.
> > The reference defines a linear counter-to-time mapping that the kernel
> > uses to set both the frequency and phase of the system clock.
> > 
> > When timekeeping_set_reference() is called:
> >  - tick_length is computed from the reference period and set via
> >    ntp_set_tick_length(), keeping all NTP state consistent
> >  - A pending flag is set so that on the next tick (under the
> >    timekeeping lock), the phase error is set via ntp_set_time_offset()
> > 
> > The existing time_offset slew mechanism then converges the clock to
> > the reference, with the clawback fix ensuring ntp_error stays accurate
> > across mult changes.
> > 
> > Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
> > ---
> >  include/linux/timekeeping_reference.h | 19 ++++++++++++
> >  kernel/time/ntp.c                     | 27 ++++++++++++++++
> >  kernel/time/ntp_internal.h            |  3 ++
> >  kernel/time/timekeeping.c             | 44 +++++++++++++++++++++++++++
> >  4 files changed, 93 insertions(+)
> >  create mode 100644 include/linux/timekeeping_reference.h
> > 
> > diff --git a/include/linux/timekeeping_reference.h b/include/linux/timekeeping_reference.h
> > new file mode 100644
> > index 000000000000..4c1d8a6c02f1
> > --- /dev/null
> > +++ b/include/linux/timekeeping_reference.h
> > @@ -0,0 +1,19 @@
> > +/* SPDX-License-Identifier: GPL-2.0 */
> > +#ifndef _LINUX_TIMEKEEPING_REFERENCE_H
> > +#define _LINUX_TIMEKEEPING_REFERENCE_H
> > +
> > +#include <linux/clocksource_ids.h>
> > +#include <linux/types.h>
> > +
> > +struct tk_reference {
> > +       enum clocksource_ids    cs_id;
> > +       u64                     counter_value;
> > +       u64                     time_sec;
> > +       u64                     time_frac_sec;
> > +       u64                     period_frac_sec;
> > +       u8                      period_shift;
> > +};
> 
> Can you add comments documenting each of these values?

Ack.

> All the math here could use some comments, just to be explicit about
> what is intended.

Ack.

> 
> > @@ -2339,6 +2366,23 @@ static void timekeeping_adjust(struct timekeeper *tk, s64 offset)
> >                 tk->ntp_tick = ntp_tl;
> >                 mult = div64_u64(tk->ntp_tick >> tk->ntp_error_shift,
> >                                  tk->cycle_interval);
> > +               if (tk_ref_pending && tk->cs_id == tk_ref.cs_id) {
> > +                       u64 d = tk->tkr_mono.cycle_last - tk_ref.counter_value;
> > +                       __uint128_t p = (__uint128_t)d * tk_ref.period_frac_sec;
> > +                       u64 rf;
> > +                       s64 ref_err;
> > +
> > +                       p >>= tk_ref.period_shift;
> > +                       p += tk_ref.time_frac_sec;
> > +                       rf = (u64)p;
> > +                       ref_err = (s64)mul_u64_u64_shr(rf,
> > +                               (u64)NSEC_PER_SEC << tk->tkr_mono.shift, 64) -
> > +                               (s64)tk->tkr_mono.xtime_nsec;
> > +                       ntp_set_time_offset(tk->id,
> > +                               ref_err >> tk->tkr_mono.shift);
> > +                       tk->ntp_error = 0;
> > +                       tk_ref_pending = false;
> > +               }
> 
> Just a quick skim here, but I don't see anything using tk_ref_valid.
> Am I missing it? Or can that value be dropped?

Yeah, I think it can be dropped; it's a hangover from the time when we
needed to consult the reference at *every* tick to drive the mult±1
dithering.

But now the core timekeeping can actually follow a line for itself,
that external oracle is no longer needed — because time definition (C)
*is* the reference and doesn't keep accumulating errors.

I'm also going to take another look at whether I need this hunk in
timekeeping_adjust() at all, or whether the tracking is now
sufficiently predictable that timekeeper_set_reference() could just set
time_offset directly.

I did it this way because timekeeper_set_reference() runs mid-tick,
while timekeeping_adjust() can adjust time_offset to clamp (C) to the
reference at the start of the new tick.

But if the accounting is all fixed now, there's no reason why
timekeeper_set_reference() couldn't get just apply the appropriate
correction for the moment at the existing cycle_last, effectively
setting time (C) for *that* moment, and then trust that it tracks
correctly from there. I'll play with it... 

[-- Attachment #2: smime.p7s --]
[-- Type: application/pkcs7-signature, Size: 5069 bytes --]

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [RFC PATCH v2 0/8] timekeeping: Fix draft tracking precision and add feed-forward discipline via vmclock
  2026-05-17 21:25 [RFC PATCH v2 0/8] timekeeping: Fix draft tracking precision and add feed-forward discipline via vmclock David Woodhouse
                   ` (7 preceding siblings ...)
  2026-05-17 21:25 ` [RFC PATCH v2 8/8] WIP: kernel/time: Add /dev/vmclock_host miscdev David Woodhouse
@ 2026-05-19 13:16 ` Miroslav Lichvar
  2026-05-19 15:50   ` David Woodhouse
  8 siblings, 1 reply; 50+ messages in thread
From: Miroslav Lichvar @ 2026-05-19 13:16 UTC (permalink / raw)
  To: David Woodhouse
  Cc: Richard Cochran, Wen Gu, Andrew Lunn, David S. Miller,
	Eric Dumazet, Jakub Kicinski, Paolo Abeni, John Stultz,
	Thomas Gleixner, Stephen Boyd, Anna-Maria Behnsen,
	Frederic Weisbecker, Shuah Khan, Peter Zijlstra,
	Thomas Weißschuh, Arnd Bergmann, Julien Ridoux, Ryan Luu,
	linux-kernel

On Sun, May 17, 2026 at 10:25:37PM +0100, David Woodhouse wrote:
> The vmclock device (https://uapi-group.org/specifications/specs/vmclock/)
> provides a shared memory page containing a linear time function:
> time = base + (counter - counter_value) × period. The guest can read
> this at any time to determine the hypervisor's view of the current time,
> without a VM exit.

That sounds nice.

> The existing ptp_vmclock driver already exposes this as a PTP clock for
> userspace consumers (phc2sys, chrony). This series adds kernel-internal
> consumption: the tick mechanism can clamp directly to the vmclock
> reference, eliminating the need for NTP to discipline the guest clock.

I'm not very familiar with the VM timekeeping and other code. If I
understand this idea correctly, by loading the ptp_vmclock module the
guest kernel is giving the host control of its clock. Changes in the
host's REALTIME/MONOTONIC clock frequency are mirrored to the guest's
clock. Differences larger than 100 milliseconds are corrected by step,
whether the guest applications like it or not. Smaller steps and
errors accumulated due to a delay in the frequency update (is there a
limit to this delay?) are corrected by the kernel NTP PLL (with the
default time constant?). When the guest is migrated to a different
host, the frequency offset between the two hosts is injected to the
NTP frequency (assuming REALTIME clocks of the hosts have zero
frequency error at that moment?).

Have you considered a different approach that would address the
problem with frequency step by adjusting the guest's clocksource
frequency to match the original host? That would correct all system
clocks, i.e. not only REALTIME/MONOTONIC, but also MONOTONIC_RAW and
AUX clocks.

The guest would still be in control of its clock and follow its own
preferences to stepping, maximum frequency errors, etc. It could still
compare the stability and accuracy of the host's clock and use it for
synchronization only when it's actually better than other available
time sources (some VPS providers are known to have poorly synchronized
host clocks). An AUX clock could be used to more accurately compare
frequencies of the two hosts, ignoring phase corrections.

There is a work in progress for chrony to support MONOTONIC_RAW as the
main clock. It would be nice if that could be corrected in migrations.
That seems to be a common cause of disruptions of public NTP servers.
Polling for notifications about clock changes caused by migrations and
system suspend+resume would be useful in any case.

-- 
Miroslav Lichvar

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [RFC PATCH v2 3/8] timekeeping: Clamp time_offset delta to prevent infinite tail
  2026-05-17 21:25 ` [RFC PATCH v2 3/8] timekeeping: Clamp time_offset delta to prevent infinite tail David Woodhouse
@ 2026-05-19 13:25   ` Miroslav Lichvar
  2026-05-19 13:31     ` David Woodhouse
  0 siblings, 1 reply; 50+ messages in thread
From: Miroslav Lichvar @ 2026-05-19 13:25 UTC (permalink / raw)
  To: David Woodhouse
  Cc: Richard Cochran, Wen Gu, Andrew Lunn, David S. Miller,
	Eric Dumazet, Jakub Kicinski, Paolo Abeni, John Stultz,
	Thomas Gleixner, Stephen Boyd, Anna-Maria Behnsen,
	Frederic Weisbecker, Shuah Khan, Peter Zijlstra,
	Thomas Weißschuh, Arnd Bergmann, Julien Ridoux, Ryan Luu,
	linux-kernel, David Woodhouse

On Sun, May 17, 2026 at 10:25:40PM +0100, David Woodhouse wrote:
> From: David Woodhouse <dwmw@amazon.co.uk>
> 
> ntp_offset_chunk() computes delta as time_offset >> (SHIFT_PLL +
> time_constant), which exponentially decays toward zero but never
> reaches it. This means time_offset asymptotically approaches zero
> without ever completing — the clock never fully converges.

That's how the NTP PLL was designed to work. It is an infinite impulse
response filter.

> Fix by clamping delta:
>  - Minimum: 20ns/sec (NTP_OFFSET_DELTA_MIN), ensuring the tail
>    converges in finite time

I don't think that is an acceptable change of the filter. The impact
could be measured on a sufficiently stable clock.

To me that looks like using a wrong tool for the job.

-- 
Miroslav Lichvar


^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [RFC PATCH v2 3/8] timekeeping: Clamp time_offset delta to prevent infinite tail
  2026-05-19 13:25   ` Miroslav Lichvar
@ 2026-05-19 13:31     ` David Woodhouse
  2026-05-19 14:17       ` Miroslav Lichvar
  0 siblings, 1 reply; 50+ messages in thread
From: David Woodhouse @ 2026-05-19 13:31 UTC (permalink / raw)
  To: Miroslav Lichvar
  Cc: Richard Cochran, Wen Gu, Andrew Lunn, David S. Miller,
	Eric Dumazet, Jakub Kicinski, Paolo Abeni, John Stultz,
	Thomas Gleixner, Stephen Boyd, Anna-Maria Behnsen,
	Frederic Weisbecker, Shuah Khan, Peter Zijlstra,
	Thomas Weißschuh, Arnd Bergmann, Julien Ridoux, Ryan Luu,
	linux-kernel

[-- Attachment #1: Type: text/plain, Size: 1311 bytes --]

On Tue, 2026-05-19 at 15:25 +0200, Miroslav Lichvar wrote:
> On Sun, May 17, 2026 at 10:25:40PM +0100, David Woodhouse wrote:
> > From: David Woodhouse <dwmw@amazon.co.uk>
> > 
> > ntp_offset_chunk() computes delta as time_offset >> (SHIFT_PLL +
> > time_constant), which exponentially decays toward zero but never
> > reaches it. This means time_offset asymptotically approaches zero
> > without ever completing — the clock never fully converges.
> 
> That's how the NTP PLL was designed to work. It is an infinite impulse
> response filter.
> 
> > Fix by clamping delta:
> >   - Minimum: 20ns/sec (NTP_OFFSET_DELTA_MIN), ensuring the tail
> >     converges in finite time
> 
> I don't think that is an acceptable change of the filter. The impact
> could be measured on a sufficiently stable clock.
> 
> To me that looks like using a wrong tool for the job.

I chose 20ns/s because it's fairly much in the noise of the existing
jitter. The idea here is that there's no change in the initial part of
the exponential delivery of time_offset, but the long asymptotic tail
ends up applying a skew per second which is *far* smaller than the
inter-tick jitter of the output anyway, which seems pointless?

Without it, the output basically *never* converges to the desired line.



[-- Attachment #2: smime.p7s --]
[-- Type: application/pkcs7-signature, Size: 5069 bytes --]

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [RFC PATCH v2 3/8] timekeeping: Clamp time_offset delta to prevent infinite tail
  2026-05-19 13:31     ` David Woodhouse
@ 2026-05-19 14:17       ` Miroslav Lichvar
  2026-05-19 15:06         ` David Woodhouse
  0 siblings, 1 reply; 50+ messages in thread
From: Miroslav Lichvar @ 2026-05-19 14:17 UTC (permalink / raw)
  To: David Woodhouse
  Cc: Richard Cochran, Wen Gu, Andrew Lunn, David S. Miller,
	Eric Dumazet, Jakub Kicinski, Paolo Abeni, John Stultz,
	Thomas Gleixner, Stephen Boyd, Anna-Maria Behnsen,
	Frederic Weisbecker, Shuah Khan, Peter Zijlstra,
	Thomas Weißschuh, Arnd Bergmann, Julien Ridoux, Ryan Luu,
	linux-kernel

On Tue, May 19, 2026 at 02:31:41PM +0100, David Woodhouse wrote:
> On Tue, 2026-05-19 at 15:25 +0200, Miroslav Lichvar wrote:
> > I don't think that is an acceptable change of the filter. The impact
> > could be measured on a sufficiently stable clock.
> > 
> > To me that looks like using a wrong tool for the job.
> 
> I chose 20ns/s because it's fairly much in the noise of the existing
> jitter. The idea here is that there's no change in the initial part of
> the exponential delivery of time_offset, but the long asymptotic tail
> ends up applying a skew per second which is *far* smaller than the
> inter-tick jitter of the output anyway, which seems pointless?

It changes the initial part too. Consider a case where the PLL time
constant is set to 0 and the application is updating the PLL once per
second. ntp_offset_chunk() returns 1/4th of time_offset. If the
NTP/PTP measurements are stable to about 20 nanoseconds, the clock
corrections will be 4 times larger than expected.

By inter-tick jitter you mean the +1/0 multiplier changes? That
can be below 1 nanosecond if the clock is updated frequently enough
and the multiplier is sufficient large.

> Without it, the output basically *never* converges to the desired line.

I think it's not supposed to get to zero. It is expected to be updated
regularly with new measurements.

If the cancellation threshold was based on the time constant and the
time since last update (e.g. 32 seconds for time constant 0), that
would probably make more sense to me.

-- 
Miroslav Lichvar

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [RFC PATCH v2 3/8] timekeeping: Clamp time_offset delta to prevent infinite tail
  2026-05-19 14:17       ` Miroslav Lichvar
@ 2026-05-19 15:06         ` David Woodhouse
  0 siblings, 0 replies; 50+ messages in thread
From: David Woodhouse @ 2026-05-19 15:06 UTC (permalink / raw)
  To: Miroslav Lichvar
  Cc: Richard Cochran, Wen Gu, Andrew Lunn, David S. Miller,
	Eric Dumazet, Jakub Kicinski, Paolo Abeni, John Stultz,
	Thomas Gleixner, Stephen Boyd, Anna-Maria Behnsen,
	Frederic Weisbecker, Shuah Khan, Peter Zijlstra,
	Thomas Weißschuh, Arnd Bergmann, Julien Ridoux, Ryan Luu,
	linux-kernel

[-- Attachment #1: Type: text/plain, Size: 3648 bytes --]

On Tue, 2026-05-19 at 16:17 +0200, Miroslav Lichvar wrote:
> On Tue, May 19, 2026 at 02:31:41PM +0100, David Woodhouse wrote:
> > On Tue, 2026-05-19 at 15:25 +0200, Miroslav Lichvar wrote:
> > > I don't think that is an acceptable change of the filter. The impact
> > > could be measured on a sufficiently stable clock.
> > > 
> > > To me that looks like using a wrong tool for the job.
> > 
> > I chose 20ns/s because it's fairly much in the noise of the existing
> > jitter. The idea here is that there's no change in the initial part of
> > the exponential delivery of time_offset, but the long asymptotic tail
> > ends up applying a skew per second which is *far* smaller than the
> > inter-tick jitter of the output anyway, which seems pointless?
> 
> It changes the initial part too. Consider a case where the PLL time
> constant is set to 0 and the application is updating the PLL once per
> second. ntp_offset_chunk() returns 1/4th of time_offset. If the
> NTP/PTP measurements are stable to about 20 nanoseconds, the clock
> corrections will be 4 times larger than expected.
> 
> By inter-tick jitter you mean the +1/0 multiplier changes? That
> can be below 1 nanosecond if the clock is updated frequently enough
> and the multiplier is sufficient large.

> > Without it, the output basically *never* converges to the desired line.
> 
> I think it's not supposed to get to zero. It is expected to be updated
> regularly with new measurements.

Fair enough. I think I'm happy to drop this. Much of my testing for the
ntp_error and time_offset fixes has been in a completely artificial
environment where I *stop* chrony on the host, advertise a single
(stale) rate through vmclock, and make sure the core timekeeping *can*
converge to that without constantly drifting due to the tracking errors
that I've fixed. The infinite convergence was messing with that, but I
guess it won't matter much in the real world.

My test is calling ktime_get_snapshot() and comparing the resulting
CLOCK_REALTIME with the vmclock time calculated from the *same* TSC
value, and printing that difference every 500ms:

(This is from a test case where I deliberately introduced 2µs offset
after the initial convergence, to test that it injects precisely
2000ns, no more and no less).

[   50.900372] vmclock_cmp: diff=-2003ns tsc=1ca1714991
[   51.404369] vmclock_cmp: diff=-1999ns tsc=1ce98a34e9
[   51.908369] vmclock_cmp: diff=-2001ns tsc=1d31a33821
[   52.412365] vmclock_cmp: diff=-2003ns tsc=1d79bc1c45
[   52.916364] vmclock_cmp: diff=-2005ns tsc=1dc1d5189d
[   53.420360] vmclock_cmp: diff=-2003ns tsc=1e09edfcc9
[   53.924361] vmclock_cmp: diff=-2001ns tsc=1e52070cd1
[   54.428370] vmclock_cmp: diff=-2007ns tsc=1e9a206b9d
[   54.932360] vmclock_cmp: diff=-2002ns tsc=1ee2391235
[   55.436372] vmclock_cmp: diff=-2003ns tsc=1f2a528a9d
[   55.940368] vmclock_cmp: diff=-1999ns tsc=1f726b6e91
[   56.444363] vmclock_cmp: diff=-2001ns tsc=1fba844d09
[   56.948363] vmclock_cmp: diff=-2001ns tsc=20029d5251
[   57.452384] vmclock_cmp: diff=-1997ns tsc=204ab72295
[   57.956363] vmclock_cmp: diff=-2002ns tsc=2092cf5f5d
[   58.460367] vmclock_cmp: diff=-2002ns tsc=20dae89265
[   58.964374] vmclock_cmp: diff=-2001ns tsc=212301d9cd
[   59.468366] vmclock_cmp: diff=-2002ns tsc=216b1a91c1
[   59.972370] vmclock_cmp: diff=-2001ns tsc=21b333c671
[   60.476364] vmclock_cmp: diff=-1998ns tsc=21fb4c9295

So there's still a jitter of single-digit nanoseconds, which is why I
figured a minimum for the *deliberate* skew of 20ns/s was negligibly
into the noise. But I'm happy to drop it.



[-- Attachment #2: smime.p7s --]
[-- Type: application/pkcs7-signature, Size: 5069 bytes --]

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [RFC PATCH v2 0/8] timekeeping: Fix draft tracking precision and add feed-forward discipline via vmclock
  2026-05-19 13:16 ` [RFC PATCH v2 0/8] timekeeping: Fix draft tracking precision and add feed-forward discipline via vmclock Miroslav Lichvar
@ 2026-05-19 15:50   ` David Woodhouse
  2026-05-20 10:39     ` Miroslav Lichvar
  0 siblings, 1 reply; 50+ messages in thread
From: David Woodhouse @ 2026-05-19 15:50 UTC (permalink / raw)
  To: Miroslav Lichvar
  Cc: Richard Cochran, Wen Gu, Andrew Lunn, David S. Miller,
	Eric Dumazet, Jakub Kicinski, Paolo Abeni, John Stultz,
	Thomas Gleixner, Stephen Boyd, Anna-Maria Behnsen,
	Frederic Weisbecker, Shuah Khan, Peter Zijlstra,
	Thomas Weißschuh, Arnd Bergmann, Julien Ridoux, Ryan Luu,
	linux-kernel

[-- Attachment #1: Type: text/plain, Size: 6365 bytes --]

On Tue, 2026-05-19 at 15:16 +0200, Miroslav Lichvar wrote:
> On Sun, May 17, 2026 at 10:25:37PM +0100, David Woodhouse wrote:
> > The vmclock device (https://uapi-group.org/specifications/specs/vmclock/)
> > provides a shared memory page containing a linear time function:
> > time = base + (counter - counter_value) × period. The guest can read
> > this at any time to determine the hypervisor's view of the current time,
> > without a VM exit.
> 
> That sounds nice.

The design has two major purposes:

 • Atomically letting the guest know that live migration has perturbed
   its clock. Without this, some distributed databases which rely on
   precision timestamps on transactions for eventual coherency were
   getting corrupted when guests were live migrated.

 • Avoiding the redundant work of having *hundreds* of guests on the
   same host *all* calibrating the same underlying oscillator, while
   enjoying the added fun of steal time as they're trying to to so.

Right now, the implementations in both QEMU and the EC2 Nitro
Hypervisor only implement part 1, the disruption signal.

I plan for QEMU to use the vmclock_host driver from this series, along
with the QEMU patch I linked, to expose the host's real time clock
guests to follow.

For dedicated hosting environments like EC2, we don't care very much
about the host's timekeeping; that host kernel exists *only* to host
KVM guests. The host userspace can ignore the host's timekeeping
completely and manage the relationship of the counter to real time
directly — and in some cases will have hardware which will latch the
actual CPU's counter at the moment of a 1PPS signal. We'll feed that
counter-to-realtime information *directly* to guests.

(And will probably export timekeeping_set_reference() via a syscall of
some kind so that we *can* set the host's clock from it too, if I can't
find a way to precisely do so through adjtimex.) 

> > The existing ptp_vmclock driver already exposes this as a PTP clock for
> > userspace consumers (phc2sys, chrony). This series adds kernel-internal
> > consumption: the tick mechanism can clamp directly to the vmclock
> > reference, eliminating the need for NTP to discipline the guest clock.
> 
> I'm not very familiar with the VM timekeeping and other code. If I
> understand this idea correctly, by loading the ptp_vmclock module the
> guest kernel is giving the host control of its clock. 

Right *now*, the ptp_vmclock module is only providing a PTP clock for
userspace to discipline the kernel against, as noted above. But yes,
the intent of what I'm doing here is to bypass all that complexity and
manage the explicit counter-to-time relationship *directly* within the
guest kernel.

I did briefly play with simulating 1PPS, and injecting PPS events at
the precise time that a PPS signal *would* have triggered, to the
cycle: 
https://lore.kernel.org/all/87cb97d5a26d0f4909d2ba2545c4b43281109470.camel@infradead.org/

> Changes in the host's REALTIME/MONOTONIC clock frequency are mirrored
> to the guest's clock. 

Strictly, "changes in the realtime clock frequency advertised by the
vmclock device", but basically yes.

> Differences larger than 100 milliseconds are corrected by step,
> whether the guest applications like it or not. Smaller steps and
> errors accumulated due to a delay in the frequency update (is there a
> limit to this delay?) are corrected by the kernel NTP PLL (with the
> default time constant?). 

That behaviour isn't set in stone for vmclock; I'm still only
experimenting with the part where it *can* set the frequency, and an
offset that the kernel will converge to and *stay* on.

Right now it just calls my ntp_set_time_offset() which doesn't step at
all, and always just injects via ->time_offset (the NTP PLL). Much the
same as legacy adjtime() AIUI.

> When the guest is migrated to a different host, the frequency offset
> between the two hosts is injected to the NTP frequency (assuming
> REALTIME clocks of the hosts have zero frequency error at that
> moment?).

When the advertised frequency changes (either due to the ongoing clock
discipline on the host, or because of migration to a new host), the new
frequency is injected directly into tick_length.

> Have you considered a different approach that would address the
> problem with frequency step by adjusting the guest's clocksource
> frequency to match the original host? That would correct all system
> clocks, i.e. not only REALTIME/MONOTONIC, but also MONOTONIC_RAW and
> AUX clocks.

You mean TSC scaling to change the frequency of the actual counter? 

When stepping between non-identical hosts, that might be helpful. But
we still have to deal with the variance of the counter over time even
without migration in the picture.

> The guest would still be in control of its clock and follow its own
> preferences to stepping, maximum frequency errors, etc. It could still
> compare the stability and accuracy of the host's clock and use it for
> synchronization only when it's actually better than other available
> time sources (some VPS providers are known to have poorly synchronized
> host clocks).

I think that mode is already available as a PTP clock, isn't it?

While of course it should be optional for the guest, I'm deliberately
optimising for the case here where the hosting provider *does* get it
right and *can* be trusted.

>  An AUX clock could be used to more accurately compare
> frequencies of the two hosts, ignoring phase corrections.
> 
> There is a work in progress for chrony to support MONOTONIC_RAW as the
> main clock. It would be nice if that could be corrected in migrations.

Not sure I understand this. I thought the whole point of MONOTONIC_RAW
is that it *isn't* skewed by NTP?

> That seems to be a common cause of disruptions of public NTP servers.
> Polling for notifications about clock changes caused by migrations and
> system suspend+resume would be useful in any case.

That much you can do today with /dev/vmclock even when it isn't
exposing the actual time information.

Timekeeping in migration is fairly hosed in KVM. I don't think there
are many implementations that actually set the TSC correctly on the
destination host. But that's a different story...

[-- Attachment #2: smime.p7s --]
[-- Type: application/pkcs7-signature, Size: 5069 bytes --]

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [RFC PATCH v2 2/8] timekeeping: Account for clawback adjustment in ntp_error
  2026-05-19 10:04     ` David Woodhouse
@ 2026-05-19 19:28       ` John Stultz
  2026-05-20 10:47         ` Miroslav Lichvar
  0 siblings, 1 reply; 50+ messages in thread
From: John Stultz @ 2026-05-19 19:28 UTC (permalink / raw)
  To: David Woodhouse
  Cc: Richard Cochran, Wen Gu, Andrew Lunn, David S. Miller,
	Eric Dumazet, Jakub Kicinski, Paolo Abeni, Thomas Gleixner,
	Stephen Boyd, Anna-Maria Behnsen, Frederic Weisbecker, Shuah Khan,
	Peter Zijlstra, Thomas Weißschuh, Arnd Bergmann,
	Miroslav Lichvar, Julien Ridoux, Ryan Luu, linux-kernel

On Tue, May 19, 2026 at 3:04 AM David Woodhouse <dwmw2@infradead.org> wrote:
> On Mon, 2026-05-18 at 18:59 -0700, John Stultz wrote:
> > On Sun, May 17, 2026 at 3:03 PM David Woodhouse <dwmw2@infradead.org> wrote:
> > >
> > > From: David Woodhouse <dwmw@amazon.co.uk>
> > >
> > > timekeeping_apply_adjustment() modifies xtime_nsec to maintain vDSO
> > > monotonicity when mult changes:
> > >
> > >     xtime_nsec -= offset
> > >
> > > This ensures that the time reported to userspace doesn't jump when the
> > > multiplier is adjusted. However, ntp_error — which tracks the difference
> > > between intended and actual clock position — was not updated to reflect
> > > this change.
> > >
> > > After a mult change, xtime_nsec has moved but ntp_error still reflects
> > > the old position. For the normal ±1 dithering this is negligible (the
> > > adjustments cancel over time), but for larger mult changes — such as
> > > when an external reference clock sets a new frequency — the one-time
> > > uncompensated offset is significant (~38ns for a 700-count mult change).
> > >
> > > Fix by adjusting ntp_error by the same amount:
> > >
> > >     ntp_error += offset << ntp_error_shift
> > >
> > > This keeps ntp_error consistent with the actual xtime_nsec position
> > > after the clawback.
> > >
> > > Fixes: 1b1b3e2a3671 ("timekeeping: Rework frequency adjustments to work better w/ nohz")
> >
> > That doesn't seem to be the right commit. Do you mean dc491596f639 ?
>
> Er yes, that 1b1b commit doesn't even exist. I've been keeping the AI
> on a *very* tight rein as I navigate all this, but that one escaped.

You probably should be including the Assisted-by tags then.
https://docs.kernel.org/process/coding-assistants.html

> > Also, since you're re-adding it, could you add a detailed rationale to
> > the comment in timekeeping_apply_adjustment()? (It had long been on my
> > todo, but by the time I started adding the commits the details had
> > faded and I never got the time to re-derive the math.)
>
> I was hoping not to have to think about that part. The fact here is
> that it *does* apply an offset to 'xtime' and thus of course the delta
> from xtime(A) to where we ought to be right now (B) has changed by the
> same amount.
>
> Calculating *what* that offset should be, is... above my pay grade :)
>
> And I do think it's mostly working, so is there a particular reason you
> want me to take a closer look?

Well, mostly just to document the context you have and the justification for
adding the ntp_error adjustment.  I think you've done a reasonable job
articulating it in commit message and discussion here, but having it
in the comment helps others understand how its drerived in the future.
Part of why the ntp_err adjustment was previously dropped is because I
never got around to documenting the math for why it was necessary, so
I didn't have a case to keep it.


> Because the moment I start looking at the comment, I see the part which
> says
>          * So given the same offset value, we need the time to be the same
>          * both before and after the freq adjustment.
>
> ... and I come to believe that 'before' the freq adjustment is actually
> some point in the *future*; the last counter reading at which a vDSO
> *currently* running on another CPU might possibly apply the faster
> formula from the previous tick? And then my brain falls out and I have
> to sit under the desk rocking back and forth for a while... ?

So the timekeeping seq locking should avoid vDSOs from using stale
values across mult updates.
However, the "fast" (a name I hate, really its just lockfree)
ktime_get_real_fast_ns() and friends do not have that protection.

But yea, it does get subtle and complicated, which is all the more
reason to make sure we have things well documented.

> > Miroslav's review and input here would also be good.
>
> Ack. Thomas had nudged me to add Miroslav to Cc, which
> get_maintainers.pl had not.

Yeah. Maybe it would be good to get Miroslav added as a reviewer in
the MAINTAINERS file.

Miroslav: Any objection to that?

thanks
-john

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [RFC PATCH v2 0/8] timekeeping: Fix draft tracking precision and add feed-forward discipline via vmclock
  2026-05-19 15:50   ` David Woodhouse
@ 2026-05-20 10:39     ` Miroslav Lichvar
  2026-05-20 12:21       ` David Woodhouse
  0 siblings, 1 reply; 50+ messages in thread
From: Miroslav Lichvar @ 2026-05-20 10:39 UTC (permalink / raw)
  To: David Woodhouse
  Cc: Richard Cochran, Wen Gu, Andrew Lunn, David S. Miller,
	Eric Dumazet, Jakub Kicinski, Paolo Abeni, John Stultz,
	Thomas Gleixner, Stephen Boyd, Anna-Maria Behnsen,
	Frederic Weisbecker, Shuah Khan, Peter Zijlstra,
	Thomas Weißschuh, Arnd Bergmann, Julien Ridoux, Ryan Luu,
	linux-kernel, Marcelo Tosatti

On Tue, May 19, 2026 at 04:50:41PM +0100, David Woodhouse wrote:
> The design has two major purposes:

>  • Avoiding the redundant work of having *hundreds* of guests on the
>    same host *all* calibrating the same underlying oscillator, while
>    enjoying the added fun of steal time as they're trying to to so.

But isn't that work still duplicated, only moved to the kernel? The
userspace part could be a simple loop waiting for vmclock
notifications and following the changes of the host. The only
difference would be a longer delay, but still insignificant for the
intended purpose, right?

I don't like the idea of adding more clock control loops to the kernel
much. It's a complexity that will likely grow as different
requirements come and the code will be even more difficult to
understand. IMHO the NTP PLL and hard PPS loops shouldn't have been
included in the kernel. The kernel time control API should have been
just setting/stepping the time and changing the
frequency, both possibly at a specified time instead of the time of
the call.

> > Have you considered a different approach that would address the
> > problem with frequency step by adjusting the guest's clocksource
> > frequency to match the original host? That would correct all system
> > clocks, i.e. not only REALTIME/MONOTONIC, but also MONOTONIC_RAW and
> > AUX clocks.
> 
> You mean TSC scaling to change the frequency of the actual counter? 

Yes, in hardware if available, or in software if not. An additional
32-bit multiplier applied like this:

 cycles += (cycles * mult) >> shift

Larger adjustments can be done in the normal multiplier for all clocks.

> When stepping between non-identical hosts, that might be helpful. But
> we still have to deal with the variance of the counter over time even
> without migration in the picture.

Whatever is synchronizing the guest clock to the host (using the PHC
or vmclock page) will take care of that? The point is to avoid
migrations causing a frequency step.

I'm not sure what identical and non-identical hosts mean in this
context, same nominal CPU frequency, or a CPU tied to the same crystal
oscillator?

> > The guest would still be in control of its clock and follow its own
> > preferences to stepping, maximum frequency errors, etc. It could still
> > compare the stability and accuracy of the host's clock and use it for
> > synchronization only when it's actually better than other available
> > time sources (some VPS providers are known to have poorly synchronized
> > host clocks).
> 
> I think that mode is already available as a PTP clock, isn't it?

Yes, but it's slow due to missing frequency transfer, not feed-forward
as you call it. The host's frequency offset could be exposed in the
PHC's timex.

> > There is a work in progress for chrony to support MONOTONIC_RAW as the
> > main clock. It would be nice if that could be corrected in migrations.
> 
> Not sure I understand this. I thought the whole point of MONOTONIC_RAW
> is that it *isn't* skewed by NTP?

It isn't adjusted, but it can be used as a stable reference avoiding
the multiplier-induced jitter, interference from other processes, and
synchronization loops, e.g. when an NTP client is synchronizing to an
NTP server running on the same system (in different containers). 

-- 
Miroslav Lichvar

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [RFC PATCH v2 2/8] timekeeping: Account for clawback adjustment in ntp_error
  2026-05-19 19:28       ` John Stultz
@ 2026-05-20 10:47         ` Miroslav Lichvar
  2026-05-20 12:37           ` David Woodhouse
  0 siblings, 1 reply; 50+ messages in thread
From: Miroslav Lichvar @ 2026-05-20 10:47 UTC (permalink / raw)
  To: John Stultz
  Cc: David Woodhouse, Richard Cochran, Wen Gu, Andrew Lunn,
	David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	Thomas Gleixner, Stephen Boyd, Anna-Maria Behnsen,
	Frederic Weisbecker, Shuah Khan, Peter Zijlstra,
	Thomas Weißschuh, Arnd Bergmann, Julien Ridoux, Ryan Luu,
	linux-kernel

On Tue, May 19, 2026 at 12:28:06PM -0700, John Stultz wrote:
> On Tue, May 19, 2026 at 3:04 AM David Woodhouse <dwmw2@infradead.org> wrote:
> But yea, it does get subtle and complicated, which is all the more
> reason to make sure we have things well documented.
> 
> > > Miroslav's review and input here would also be good.
> >
> > Ack. Thomas had nudged me to add Miroslav to Cc, which
> > get_maintainers.pl had not.
> 
> Yeah. Maybe it would be good to get Miroslav added as a reviewer in
> the MAINTAINERS file.
> 
> Miroslav: Any objection to that?

I'm ok with that, but I can't promise to provide any actual reviews.
Loading all that context needed to understand the code is painful. It
would be so nice if it could be simplified by switching to 64-bit
multiplier.

-- 
Miroslav Lichvar


^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [RFC PATCH v2 0/8] timekeeping: Fix draft tracking precision and add feed-forward discipline via vmclock
  2026-05-20 10:39     ` Miroslav Lichvar
@ 2026-05-20 12:21       ` David Woodhouse
  2026-05-21  6:35         ` Miroslav Lichvar
  2026-05-21 18:30         ` Thomas Gleixner
  0 siblings, 2 replies; 50+ messages in thread
From: David Woodhouse @ 2026-05-20 12:21 UTC (permalink / raw)
  To: Miroslav Lichvar
  Cc: Richard Cochran, Wen Gu, Andrew Lunn, David S. Miller,
	Eric Dumazet, Jakub Kicinski, Paolo Abeni, John Stultz,
	Thomas Gleixner, Stephen Boyd, Anna-Maria Behnsen,
	Frederic Weisbecker, Shuah Khan, Peter Zijlstra,
	Thomas Weißschuh, Arnd Bergmann, Julien Ridoux, Ryan Luu,
	linux-kernel, Marcelo Tosatti

[-- Attachment #1: Type: text/plain, Size: 7007 bytes --]

On Wed, 2026-05-20 at 12:39 +0200, Miroslav Lichvar wrote:
> On Tue, May 19, 2026 at 04:50:41PM +0100, David Woodhouse wrote:
> > The design has two major purposes:
> 
> >  • Avoiding the redundant work of having *hundreds* of guests on the
> >    same host *all* calibrating the same underlying oscillator, while
> >    enjoying the added fun of steal time as they're trying to to so.
> 
> But isn't that work still duplicated, only moved to the kernel? 

Not the actual calibration of the TSC against real time, no. It is the
*host* which gets the 1PPS signal and does all the work of tracking and
smoothing the frequency drift over time. The guest basically gets the
same as a vDSO, *telling* it a relationship from TSC to real time.

Many guests in trustworthy hosting environments will just use that and
want to feed it directly to the guest kernel timekeeping. Others might
want to take a more opinionated stance, as you describe below. Those
probably *would* duplicate some of the effort, in order to form their
opinion.

> The userspace part could be a simple loop waiting for vmclock
> notifications and following the changes of the host. The only
> difference would be a longer delay, but still insignificant for the
> intended purpose, right?
> 
> I don't like the idea of adding more clock control loops to the kernel
> much.

I completely agree. I am absolutely not planning to add any more clock
control to the kernel than we already have. As you say, we probably
have too many already.

>  It's a complexity that will likely grow as different
> requirements come and the code will be even more difficult to
> understand. IMHO the NTP PLL and hard PPS loops shouldn't have been
> included in the kernel. The kernel time control API should have been
> just setting/stepping the time and changing the frequency, both possibly
>  at a specified time instead of the time of the call.

There is merit in that argument.

The kernel already has a separation between the core timekeeping code
in timekeeping.c and the rest of the NTP code in ntp.c which does the
higher level control.

The timekeeping_set_reference() added in my patch *only* uses the
existing basic timekeeping code, taking the vDSO-like information that
I mentioned above, and using it to set the frequency and offset for the
kernel's core timekeeping to follow.

There's a cleaner version in my tree now, because having fixed all the
errors in the core timekeeping which were introducing drift, the
implementation of timekeeping_set_reference() can be a *whole* lot
simpler than it was in my initial proof of concept — it now really can
just set the tick length and time_offset, and let it run:
https://git.infradead.org/?p=users/dwmw2/linux.git;a=commitdiff;h=c62bf50eca

> > > Have you considered a different approach that would address the
> > > problem with frequency step by adjusting the guest's clocksource
> > > frequency to match the original host? That would correct all system
> > > clocks, i.e. not only REALTIME/MONOTONIC, but also MONOTONIC_RAW and
> > > AUX clocks.
> > 
> > You mean TSC scaling to change the frequency of the actual counter?
> 
> Yes, in hardware if available, or in software if not. An additional
> 32-bit multiplier applied like this:
> 
>  cycles += (cycles * mult) >> shift
> 
> Larger adjustments can be done in the normal multiplier for all clocks.
> 
> > When stepping between non-identical hosts, that might be helpful. But
> > we still have to deal with the variance of the counter over time even
> > without migration in the picture.
> 
> Whatever is synchronizing the guest clock to the host (using the PHC
> or vmclock page) will take care of that? The point is to avoid
> migrations causing a frequency step.
> 
> I'm not sure what identical and non-identical hosts mean in this
> context, same nominal CPU frequency, or a CPU tied to the same crystal
> oscillator?

I meant same nominal frequency.

I'm not sure what scaling the guest TSC would buy us. Sure, it would
minimise the frequency step at the moment of migration, but a naïve
guest which isn't using vmclock's disruption signal is screwed on live
migration *anyway*, because there's *also* a step change in the actual
TSC value which is bounded by the real time synchronization of the
source and destination host. 

Anything the guest has done for itself to calibrate the source host's
TSC must be entirely thrown away on migration. The vmclock allows the
destination host to immediately say "here, use this instead".

AFAICT scaling the TSC would just add complexity and wouldn't help
much.

And TSC scaling is pretty much x86-specific; other architectures have a
*defined* counter frequency and don't need to support scaling.

I'm not a fan :)

> > > The guest would still be in control of its clock and follow its own
> > > preferences to stepping, maximum frequency errors, etc. It could still
> > > compare the stability and accuracy of the host's clock and use it for
> > > synchronization only when it's actually better than other available
> > > time sources (some VPS providers are known to have poorly synchronized
> > > host clocks).
> > 
> > I think that mode is already available as a PTP clock, isn't it?
> 
> Yes, but it's slow due to missing frequency transfer, not feed-forward
> as you call it. The host's frequency offset could be exposed in the
> PHC's timex.

Yes, that makes a lot of sense.

You can literally open /dev/vmclock and consume it *however* you like
from userspace. You can even poll() and get woken when there's an
update. I think that would be a great thing for chrony to learn to do
(and that's how you get the disruption signal too).

> > > There is a work in progress for chrony to support MONOTONIC_RAW as the
> > > main clock. It would be nice if that could be corrected in migrations.
> > 
> > Not sure I understand this. I thought the whole point of MONOTONIC_RAW
> > is that it *isn't* skewed by NTP?
> 
> It isn't adjusted, but it can be used as a stable reference avoiding
> the multiplier-induced jitter, interference from other processes, and
> synchronization loops, e.g. when an NTP client is synchronizing to an
> NTP server running on the same system (in different containers). 

We could just use the TSC for this, insted of MONOTONIC_RAW, couldn't
we? Do all our clock discipline of the *TSC* against the external
sources, and then use the same timekeeper_set_reference() to ask the
kernel's core timekeeping to track the TSC-to-realtime relationship
that we desire?

That's exactly what I'm planning to do for a dedicated hosting
environment. I think the patches which allow PTP to return paired
timestamps with reference to TSC instead of CLOCK_MONOTONIC landed in
the net-next tree today?

(for TSC, read 'arch counter, timebase, etc.' — none of this is x86-
specific but 'TSC' is quicker to type...)

[-- Attachment #2: smime.p7s --]
[-- Type: application/pkcs7-signature, Size: 5069 bytes --]

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [RFC PATCH v2 2/8] timekeeping: Account for clawback adjustment in ntp_error
  2026-05-20 10:47         ` Miroslav Lichvar
@ 2026-05-20 12:37           ` David Woodhouse
  0 siblings, 0 replies; 50+ messages in thread
From: David Woodhouse @ 2026-05-20 12:37 UTC (permalink / raw)
  To: Miroslav Lichvar, John Stultz
  Cc: Richard Cochran, Wen Gu, Andrew Lunn, David S. Miller,
	Eric Dumazet, Jakub Kicinski, Paolo Abeni, Thomas Gleixner,
	Stephen Boyd, Anna-Maria Behnsen, Frederic Weisbecker, Shuah Khan,
	Peter Zijlstra, Thomas Weißschuh, Arnd Bergmann,
	Julien Ridoux, Ryan Luu, linux-kernel

[-- Attachment #1: Type: text/plain, Size: 1032 bytes --]

On Wed, 2026-05-20 at 12:47 +0200, Miroslav Lichvar wrote:
> On Tue, May 19, 2026 at 12:28:06PM -0700, John Stultz wrote:
> > On Tue, May 19, 2026 at 3:04 AM David Woodhouse <dwmw2@infradead.org> wrote:
> > But yea, it does get subtle and complicated, which is all the more
> > reason to make sure we have things well documented.
> > 
> > > > Miroslav's review and input here would also be good.
> > > 
> > > Ack. Thomas had nudged me to add Miroslav to Cc, which
> > > get_maintainers.pl had not.
> > 
> > Yeah. Maybe it would be good to get Miroslav added as a reviewer in
> > the MAINTAINERS file.
> > 
> > Miroslav: Any objection to that?
> 
> I'm ok with that, but I can't promise to provide any actual reviews.
> Loading all that context needed to understand the code is painful. It
> would be so nice if it could be simplified by switching to 64-bit
> multiplier.

I'll put that into a separate patch in my series, in with the bug fixes
before it gets to the fun proof-of-concept stuff at the end :)

[-- Attachment #2: smime.p7s --]
[-- Type: application/pkcs7-signature, Size: 5069 bytes --]

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [RFC PATCH v2 0/8] timekeeping: Fix draft tracking precision and add feed-forward discipline via vmclock
  2026-05-20 12:21       ` David Woodhouse
@ 2026-05-21  6:35         ` Miroslav Lichvar
  2026-05-21  9:54           ` David Woodhouse
  2026-05-21 18:30         ` Thomas Gleixner
  1 sibling, 1 reply; 50+ messages in thread
From: Miroslav Lichvar @ 2026-05-21  6:35 UTC (permalink / raw)
  To: David Woodhouse
  Cc: Richard Cochran, Wen Gu, Andrew Lunn, David S. Miller,
	Eric Dumazet, Jakub Kicinski, Paolo Abeni, John Stultz,
	Thomas Gleixner, Stephen Boyd, Anna-Maria Behnsen,
	Frederic Weisbecker, Shuah Khan, Peter Zijlstra,
	Thomas Weißschuh, Arnd Bergmann, Julien Ridoux, Ryan Luu,
	linux-kernel, Marcelo Tosatti

On Wed, May 20, 2026 at 01:21:46PM +0100, David Woodhouse wrote:
> On Wed, 2026-05-20 at 12:39 +0200, Miroslav Lichvar wrote:
> > On Tue, May 19, 2026 at 04:50:41PM +0100, David Woodhouse wrote:
> > > The design has two major purposes:
> > 
> > >  • Avoiding the redundant work of having *hundreds* of guests on the
> > >    same host *all* calibrating the same underlying oscillator, while
> > >    enjoying the added fun of steal time as they're trying to to so.
> > 
> > But isn't that work still duplicated, only moved to the kernel? 
> 
> Not the actual calibration of the TSC against real time, no. It is the
> *host* which gets the 1PPS signal and does all the work of tracking and
> smoothing the frequency drift over time. The guest basically gets the
> same as a vDSO, *telling* it a relationship from TSC to real time.

Ok, but I don't see why the phase corrections of the guest need to be
in the kernel.

> > I don't like the idea of adding more clock control loops to the kernel
> > much.
> 
> I completely agree. I am absolutely not planning to add any more clock
> control to the kernel than we already have. As you say, we probably
> have too many already.

If the vmclock driver is feeding the PLL with the offset between the
host and guest clocks, I think that would count as a loop.

> I'm not sure what scaling the guest TSC would buy us. Sure, it would
> minimise the frequency step at the moment of migration, but a naïve
> guest which isn't using vmclock's disruption signal is screwed on live
> migration *anyway*, because there's *also* a step change in the actual
> TSC value which is bounded by the real time synchronization of the
> source and destination host. 

The TSC offset can be corrected too. I thought that was already
happening.

> AFAICT scaling the TSC would just add complexity and wouldn't help
> much.

I think it's a better place to be solving this kind of problems. It's
compensating for a hardware change. It doesn't need to happen only at
migration. You could adjust the frequency continuously if you really
wanted, kind of like synchronous ethernet is doing for clocks over
network, improving the stability of the physical clock and phase
corrections are done on top of it at a higher level.

> And TSC scaling is pretty much x86-specific; other architectures have a
> *defined* counter frequency and don't need to support scaling.

There can be a software fallback if hardware scaling and/or offset is
not supported.

> > > > There is a work in progress for chrony to support MONOTONIC_RAW as the
> > > > main clock. It would be nice if that could be corrected in migrations.
> > > 
> > > Not sure I understand this. I thought the whole point of MONOTONIC_RAW
> > > is that it *isn't* skewed by NTP?
> > 
> > It isn't adjusted, but it can be used as a stable reference avoiding
> > the multiplier-induced jitter, interference from other processes, and
> > synchronization loops, e.g. when an NTP client is synchronizing to an
> > NTP server running on the same system (in different containers). 
> 
> We could just use the TSC for this, insted of MONOTONIC_RAW, couldn't
> we?

> (for TSC, read 'arch counter, timebase, etc.' — none of this is x86-
> specific but 'TSC' is quicker to type...)

Meaning userspace would have to duplicate the kernel's handling of
the counter (wrapping and scaling) just to avoid a single
multiplication in the vDSO?

-- 
Miroslav Lichvar


^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [RFC PATCH v2 0/8] timekeeping: Fix draft tracking precision and add feed-forward discipline via vmclock
  2026-05-21  6:35         ` Miroslav Lichvar
@ 2026-05-21  9:54           ` David Woodhouse
  2026-05-25  8:08             ` Miroslav Lichvar
  0 siblings, 1 reply; 50+ messages in thread
From: David Woodhouse @ 2026-05-21  9:54 UTC (permalink / raw)
  To: Miroslav Lichvar
  Cc: Richard Cochran, Wen Gu, Andrew Lunn, David S. Miller,
	Eric Dumazet, Jakub Kicinski, Paolo Abeni, John Stultz,
	Thomas Gleixner, Stephen Boyd, Anna-Maria Behnsen,
	Frederic Weisbecker, Shuah Khan, Peter Zijlstra,
	Thomas Weißschuh, Arnd Bergmann, Julien Ridoux, Ryan Luu,
	linux-kernel, Marcelo Tosatti

[-- Attachment #1: Type: text/plain, Size: 10690 bytes --]

On Thu, 2026-05-21 at 08:35 +0200, Miroslav Lichvar wrote:
> On Wed, May 20, 2026 at 01:21:46PM +0100, David Woodhouse wrote:
> > On Wed, 2026-05-20 at 12:39 +0200, Miroslav Lichvar wrote:
> > > On Tue, May 19, 2026 at 04:50:41PM +0100, David Woodhouse wrote:
> > > > The design has two major purposes:
> > > 
> > > >  • Avoiding the redundant work of having *hundreds* of guests on the
> > > >    same host *all* calibrating the same underlying oscillator, while
> > > >    enjoying the added fun of steal time as they're trying to to so.
> > > 
> > > But isn't that work still duplicated, only moved to the kernel? 
> > 
> > Not the actual calibration of the TSC against real time, no. It is the
> > *host* which gets the 1PPS signal and does all the work of tracking and
> > smoothing the frequency drift over time. The guest basically gets the
> > same as a vDSO, *telling* it a relationship from TSC to real time.
> 
> Ok, but I don't see why the phase corrections of the guest need to be
> in the kernel.

I'm not sure I understand. 

There are no 'phase corrections' as such, except of course that the
phase of the guest kernel's clock does get corrected, and naturally
that does have to take effect inside the guest kernel.

I think the key here is that this is not a feedback loop based on
corrections to the existing clock output; this is a feedforward design
as described in https://dl.acm.org/doi/pdf/10.1109/TNET.2011.2158443

It seems that when Julien et al lamented that, "Until now, however,
there has been a serious practical issue inhibiting feed-forward
approaches: a lack of kernel support", the basics were actually there
in the kernel's core timekeeping all along.

We didn't have to *do* anything to the core timekeeping other than fix
a few bugs that the NTP feedback mechanism always masked — who *cares*
if there's a systematic +5PPM drift due to accounting errors, as NTP
can just interpret that as the counter running 5PPM fast and adjust for
it?

Although I don't think the errors are quite that consistent, as they
vary with tick length and even from tick to tick with the mult±1
dithering and interrupt latency — so I wouldn't be surprised if these
fixes made a detectable improvement even in the normal NTP case.

> > > I don't like the idea of adding more clock control loops to the kernel
> > > much.
> > 
> > I completely agree. I am absolutely not planning to add any more clock
> > control to the kernel than we already have. As you say, we probably
> > have too many already.
> 
> If the vmclock driver is feeding the PLL with the offset between the
> host and guest clocks, I think that would count as a loop.

It's not an offset; it's a direct feed-forward "when the TSC is <this>
the time is <this>" relationship, like a vDSO does.

https://uapi-group.org/specifications/specs/vmclock/

The core motivation is for virtual machines (and especially for
consistent time across live migration), but hardware implementations
should be possible using PCIe PTM. I keep meaning to get my hands on a
TimeCard and play, but there are only so many hours in the day...

> > I'm not sure what scaling the guest TSC would buy us. Sure, it would
> > minimise the frequency step at the moment of migration, but a naïve
> > guest which isn't using vmclock's disruption signal is screwed on live
> > migration *anyway*, because there's *also* a step change in the actual
> > TSC value which is bounded by the real time synchronization of the
> > source and destination host. 
> 
> The TSC offset can be corrected too. I thought that was already
> happening.

Yes, it is. The TSC offset (and the guest's KVM clock, which is a whole
different sad story) can be corrected a bit — but the *accuracy* with
which they can be corrected is limited to the accuracy of the source
vs. destination hosts' time synchronization.

If the guest has been using NTP or a PHC to discipline the counter of
the source host that it just came from, carefully tracking not only the
perceived time, but also error bounds in order to ensure coherency of,
say, a distributed database... there is no way that we can migrate it
to a new host and 'fake' the frequency/offset on the new host to
sufficiently match. Database corruption ensues.

The best thing to do is to advertise a disruption signal ("throw away
anything you know about the existing counter"), and provide information
on the new host in that {cycle_count, reference time, counter period,
error bounds} form to allow the guest to return to service as soon as
possible.

Which is precisely what vmclock does.

> > AFAICT scaling the TSC would just add complexity and wouldn't help
> > much.
> 
> I think it's a better place to be solving this kind of problems. It's
> compensating for a hardware change. It doesn't need to happen only at
> migration. You could adjust the frequency continuously if you really
> wanted, kind of like synchronous ethernet is doing for clocks over
> network, improving the stability of the physical clock and phase
> corrections are done on top of it at a higher level.

On the *host* side I might accept a PLL on the actual hardware
oscillator and the 1PPS signal... :)

> > And TSC scaling is pretty much x86-specific; other architectures have a
> > *defined* counter frequency and don't need to support scaling.
> 
> There can be a software fallback if hardware scaling and/or offset is
> not supported.

Right. This *is* the software fallback, because the hardware scaling
and offset aren't sufficient even if we only care about x86 where the
former is supported.

> > > > > There is a work in progress for chrony to support MONOTONIC_RAW as the
> > > > > main clock. It would be nice if that could be corrected in migrations.
> > > > 
> > > > Not sure I understand this. I thought the whole point of MONOTONIC_RAW
> > > > is that it *isn't* skewed by NTP?
> > > 
> > > It isn't adjusted, but it can be used as a stable reference avoiding
> > > the multiplier-induced jitter, interference from other processes, and
> > > synchronization loops, e.g. when an NTP client is synchronizing to an
> > > NTP server running on the same system (in different containers). 
> > 
> > We could just use the TSC for this, insted of MONOTONIC_RAW, couldn't
> > we?
> 
> > (for TSC, read 'arch counter, timebase, etc.' — none of this is x86-
> > specific but 'TSC' is quicker to type...)
> 
> Meaning userspace would have to duplicate the kernel's handling of
> the counter (wrapping and scaling) just to avoid a single
> multiplication in the vDSO?

Hm yeah, I guess that makes sense.

The way I've done it in these proof of concept patches is counter-
based, because the interface between host and guest (and from that
theoretical hardware implementation) *is* necessarily in terms of the
hardware — we get told the relationship of the actual *counter* to
realtime.

But as long as the conversions in both directions are quick and
accurate there's no fundamental reason why it *couldn't* be expressed
in terms of MONOTONIC_RAW as it's being passed around.

In my RFC, it's just a call to timekeeping_set_reference() which uses
the *existing* mechanisms to just set tick_length and time_offset
accordingly. Which naturally takes counter-based units too.

But I certainly don't think that doing so *unconditionally* from the
vmclock driver in my proof of concept is the right thing to do.
Userspace needs to set policy like that.

And I wasn't stunningly happy with timekeeping_set_reference() passing
fractional seconds in the vmclock (seconds<<64) units instead of the
native (nanoseconds<<32) of the timekeeping code.

So maybe timekeeping_set_reference() should take its input in
MONOTONIC_RAW terms, and the raw information from vmclock should be
converted accordingly? I can try that...

On the *host* side, I anticipate two modes of operation.

A dedicated hosting environment only really cares about disciplining
the host kernel's TSC, and absolutely doesn't *care* about the host
kernel's timekeeping. That's just for logs.

For migrating KVM guests as accurately as possible, we set the guest
*TSC* (scaling and) offset based on our understanding of the host TSC
on both source and destination. The KVM APIs for doing this based on
the kernel's own CLOCK_REALTIME are... a source of sadness. There's a
whole 30-patch series in flight to deal with that, which you can look
at if if you like pain, but the tl;dr is that we get the host kernel's
timekeeping out of the picture as *much* as possible and operate in
terms of the TSC. Migrate the guest kernel's TSC as accurately as
possible, and everything *else* in the guest is derived from that.

So in that dedicated environment, userspace will take our hardware
devices which literally latch the *counter* value on a 1PPS signal, or
use NTP if they really have to fall back to that, and discipline the
*counter*, then use that information to both provide the vmclock for
guests, and migrate guests as accurately as possible. All in userspace,
*necessarily* in raw counter terms.

But hey, it's nice for logs to have good timestamps too, so we can feed
it to the kernel's CLOCK_REALTIME as an afterthought. Probably by using
a userspace hook for timekeeping_set_reference(). I haven't yet looked
at whether the existing adjtimex() can be used/abused/extended to allow
for precisely setting tick_length/time_offset like that.

And then there's the 'normal' host side, with a host kernel running
chrony and a few guests in QEMU. Obviously this mode needs to be
properly taken into account as a first class citizen, which is why I've
built the support that's already *in* QEMU (disruption signal only) and
now the vmclock_host and additional QEMU patch to expose that.

Again it needs to be in terms of the guest TSC by the time the VMM
actually puts it in the shared page, but I'm entirely open to input on
how we get it *out* of the kernel's timekeeping. I do tend to have the
opinion that what we should expose to guests is the "intended" clock,
with ntpdata->time_offset built in and *not* including the constant ±1
changes to 'mult' from the dithering, but using the *actual* intended
frequency from tick_length / cycle_interval.

But other than that, I'm prepared to consider the whole of the
vmclock_host export part as a straw man, and entirely happy to
completely reimplement it however you like, if you have strong
opinions. I just needed to get *something* implemented and working, as
a starting point.

[-- Attachment #2: smime.p7s --]
[-- Type: application/pkcs7-signature, Size: 5069 bytes --]

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [RFC PATCH v2 0/8] timekeeping: Fix draft tracking precision and add feed-forward discipline via vmclock
  2026-05-20 12:21       ` David Woodhouse
  2026-05-21  6:35         ` Miroslav Lichvar
@ 2026-05-21 18:30         ` Thomas Gleixner
  2026-05-21 21:06           ` David Woodhouse
  1 sibling, 1 reply; 50+ messages in thread
From: Thomas Gleixner @ 2026-05-21 18:30 UTC (permalink / raw)
  To: David Woodhouse, Miroslav Lichvar
  Cc: Richard Cochran, Wen Gu, Andrew Lunn, David S. Miller,
	Eric Dumazet, Jakub Kicinski, Paolo Abeni, John Stultz,
	Stephen Boyd, Anna-Maria Behnsen, Frederic Weisbecker, Shuah Khan,
	Peter Zijlstra, Thomas Weißschuh, Arnd Bergmann,
	Julien Ridoux, Ryan Luu, linux-kernel, Marcelo Tosatti

On Wed, May 20 2026 at 13:21, David Woodhouse wrote:
> On Wed, 2026-05-20 at 12:39 +0200, Miroslav Lichvar wrote:
>> It isn't adjusted, but it can be used as a stable reference avoiding
>> the multiplier-induced jitter, interference from other processes, and
>> synchronization loops, e.g. when an NTP client is synchronizing to an
>> NTP server running on the same system (in different containers). 
>
> We could just use the TSC for this, insted of MONOTONIC_RAW, couldn't
> we? Do all our clock discipline of the *TSC* against the external
> sources, and then use the same timekeeper_set_reference() to ask the
> kernel's core timekeeping to track the TSC-to-realtime relationship
> that we desire?
>
> That's exactly what I'm planning to do for a dedicated hosting
> environment. I think the patches which allow PTP to return paired
> timestamps with reference to TSC instead of CLOCK_MONOTONIC landed in
> the net-next tree today?

Bah.

> (for TSC, read 'arch counter, timebase, etc.' — none of this is x86-
> specific but 'TSC' is quicker to type...)

As I said in the other thread, that's just creating yet another private
mechanism instead of collecting the counter value together with e.g.
CLOCK_REALTIME or utilizing the PMT correlated one which is available in
get_device_crosstime_stamp().

Can we please stop creating specialized interfaces and instead make them
generic, so they can be used for everything?

Then you can go and extend the posix-timer interface with
clock_set_time_reference() (or whatever name we come up with) and
provide the functionality for all steerable clocks. That'd allow chronyd
to completely ignore the kernel side NTP PLL and do everything in user
space. That obviously needs some thought and input from the chrony
folks, but that's a long term useful solution and not some 'scratch my
itch' side channel.

Thanks,

        tglx



^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [RFC PATCH v2 0/8] timekeeping: Fix draft tracking precision and add feed-forward discipline via vmclock
  2026-05-21 18:30         ` Thomas Gleixner
@ 2026-05-21 21:06           ` David Woodhouse
  2026-05-22  8:02             ` Thomas Gleixner
  0 siblings, 1 reply; 50+ messages in thread
From: David Woodhouse @ 2026-05-21 21:06 UTC (permalink / raw)
  To: Thomas Gleixner, Miroslav Lichvar
  Cc: Richard Cochran, Wen Gu, Andrew Lunn, David S. Miller,
	Eric Dumazet, Jakub Kicinski, Paolo Abeni, John Stultz,
	Stephen Boyd, Anna-Maria Behnsen, Frederic Weisbecker, Shuah Khan,
	Peter Zijlstra, Thomas Weißschuh, Arnd Bergmann,
	Julien Ridoux, Ryan Luu, linux-kernel, Marcelo Tosatti

[-- Attachment #1: Type: text/plain, Size: 2603 bytes --]

On Thu, 2026-05-21 at 20:30 +0200, Thomas Gleixner wrote:
> 
> As I said in the other thread, that's just creating yet another private
> mechanism instead of collecting the counter value together with e.g.
> CLOCK_REALTIME 

On the plus side, at least he wasn't providing a counter value at *all*
for the system timestamps, which is better than using a bogus one :)

Can we have a signed-off-by for your ktime_get_snapshot_id() please?

> or utilizing the PMT correlated one which is available in
> get_device_crosstime_stamp().

AFAICT that was the *only* one he was exposing, wasn't it? The vmclock
driver literally did expose the cycle count used to create the device
timestamp, which is equivalent to PTM and looked correct for that part?

> Can we please stop creating specialized interfaces and instead make them
> generic, so they can be used for everything?

Of course.

> Then you can go and extend the posix-timer interface with
> clock_set_time_reference() (or whatever name we come up with) and
> provide the functionality for all steerable clocks. That'd allow chronyd
> to completely ignore the kernel side NTP PLL and do everything in user
> space. That obviously needs some thought and input from the chrony
> folks, but that's a long term useful solution and not some 'scratch my
> itch' side channel.

Yeah, that's a neat idea. I deliberately hadn't even *proposed* a
userspace API for that at all yet; for the timekeeping part I'm just
working on the basic *concepts* and the accounting fixes that make it
all actually work, with a hack to unconditionally call it directly from
vmclock for now.

In order to solicit exactly that feedback and design a long term
solution that works for everyone, before going too far down any
particular implementation path.

I like clock_set_time_reference(). I'll have a play and see what I can
come up with. It would want to carry error bounds information too.

Having a clock_get_time_reference() would be nice too for QEMU to use,
but that would just be a snapshot and wouldn't get updated when the
clock is adjusted. While the /dev/vmclock_host thing I have in my tree
right now can at least use a gtod notifier and the userspace device is
pollable. And it can export everything we need in one go. More thought
required on that one... but I'm very keen *not* to let that one get
forgotten, because I want this to work optimally for the general case
of QEMU running on a standard general purpose host, not *only* the
dedicated hosting setup where userspace is prepared to do all the work.

[-- Attachment #2: smime.p7s --]
[-- Type: application/pkcs7-signature, Size: 5069 bytes --]

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [RFC PATCH v2 0/8] timekeeping: Fix draft tracking precision and add feed-forward discipline via vmclock
  2026-05-21 21:06           ` David Woodhouse
@ 2026-05-22  8:02             ` Thomas Gleixner
  2026-05-22 10:01               ` David Woodhouse
  0 siblings, 1 reply; 50+ messages in thread
From: Thomas Gleixner @ 2026-05-22  8:02 UTC (permalink / raw)
  To: David Woodhouse, Miroslav Lichvar
  Cc: Richard Cochran, Wen Gu, Andrew Lunn, David S. Miller,
	Eric Dumazet, Jakub Kicinski, Paolo Abeni, John Stultz,
	Stephen Boyd, Anna-Maria Behnsen, Frederic Weisbecker, Shuah Khan,
	Peter Zijlstra, Thomas Weißschuh, Arnd Bergmann,
	Julien Ridoux, Ryan Luu, linux-kernel, Marcelo Tosatti

On Thu, May 21 2026 at 22:06, David Woodhouse wrote:
> On Thu, 2026-05-21 at 20:30 +0200, Thomas Gleixner wrote:
>> 
>> As I said in the other thread, that's just creating yet another private
>> mechanism instead of collecting the counter value together with e.g.
>> CLOCK_REALTIME 
>
> On the plus side, at least he wasn't providing a counter value at *all*
> for the system timestamps, which is better than using a bogus one :)

At least ... for now :)

> Can we have a signed-off-by for your ktime_get_snapshot_id() please?

Are you kidding? That's a PoC to demonstrate how it should be done and
it needs some thought to implement it correctly along with the
get_device_cross_timestamp() one, which is actually not entirely
correct as I noticed a few minutes ago.

>> or utilizing the PMT correlated one which is available in
>> get_device_crosstime_stamp().
>
> AFAICT that was the *only* one he was exposing, wasn't it? The vmclock
> driver literally did expose the cycle count used to create the device
> timestamp, which is equivalent to PTM and looked correct for that
> part?

The vmclock driver lives in it's own made up world, so yes this looks
consistent on the first glance.

Thanks,

        tglx

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [RFC PATCH v2 0/8] timekeeping: Fix draft tracking precision and add feed-forward discipline via vmclock
  2026-05-22  8:02             ` Thomas Gleixner
@ 2026-05-22 10:01               ` David Woodhouse
  2026-05-22 15:28                 ` Thomas Gleixner
  0 siblings, 1 reply; 50+ messages in thread
From: David Woodhouse @ 2026-05-22 10:01 UTC (permalink / raw)
  To: Thomas Gleixner, Miroslav Lichvar
  Cc: Richard Cochran, Wen Gu, Andrew Lunn, David S. Miller,
	Eric Dumazet, Jakub Kicinski, Paolo Abeni, John Stultz,
	Stephen Boyd, Anna-Maria Behnsen, Frederic Weisbecker, Shuah Khan,
	Peter Zijlstra, Thomas Weißschuh, Arnd Bergmann,
	Julien Ridoux, Ryan Luu, linux-kernel, Marcelo Tosatti

[-- Attachment #1: Type: text/plain, Size: 2549 bytes --]

On Fri, 2026-05-22 at 10:02 +0200, Thomas Gleixner wrote:
> On Thu, May 21 2026 at 22:06, David Woodhouse wrote:
> > On Thu, 2026-05-21 at 20:30 +0200, Thomas Gleixner wrote:
> > > 
> > > As I said in the other thread, that's just creating yet another private
> > > mechanism instead of collecting the counter value together with e.g.
> > > CLOCK_REALTIME 
> > 
> > On the plus side, at least he wasn't providing a counter value at *all*
> > for the system timestamps, which is better than using a bogus one :)
> 
> At least ... for now :)
> 
> > Can we have a signed-off-by for your ktime_get_snapshot_id() please?
> 
> Are you kidding? That's a PoC to demonstrate how it should be done and
> it needs some thought to implement it correctly along with the
> get_device_cross_timestamp() one, which is actually not entirely
> correct as I noticed a few minutes ago.

Obviously. But to take a PoC and then do that thought and turn it into
something we can use, it still needs a Co-developed-by: and thus a
Signed-off-by: if you would be so kind.

> > > or utilizing the PMT correlated one which is available in
> > > get_device_crosstime_stamp().
> > 
> > AFAICT that was the *only* one he was exposing, wasn't it? The vmclock
> > driver literally did expose the cycle count used to create the device
> > timestamp, which is equivalent to PTM and looked correct for that
> > part?
> 
> The vmclock driver lives in it's own made up world, so yes this looks
> consistent on the first glance.

Heh, the 'made up world' of which you speak is KVM. The older KVM PTP
drivers get a CSID_X86_TSC or CSID_ARM_ARCH_COUNTER value too.

And they *use* it... and wait, get_device_system_crosststamp() already
*does* require the device to generate a system_counterval_t, so your
nightmare world where driver authors might pull it out of their
posterior *already* exists, doesn't it?

And we have things like stmmac which already populate it using
CSID_X86_ART.

So at least for PTP_SYS_OFFSET_PRECISE, isn't Arthur's patch literally
only exporting the same counter values that the driver *already*
creates? I'm not quite sure why we have all these histrionics about
drivers not being able to create those reliably?

Yes, there's plenty to improve as discussed, and we should probably
have get_device_system_crosststamp() copy the values from the
system_counterval on its local stack into the system_device_crosststamp
rather than asking the driver to pass it back through separate fields
in the attributes.

[-- Attachment #2: smime.p7s --]
[-- Type: application/pkcs7-signature, Size: 5069 bytes --]

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [RFC PATCH v2 0/8] timekeeping: Fix draft tracking precision and add feed-forward discipline via vmclock
  2026-05-22 10:01               ` David Woodhouse
@ 2026-05-22 15:28                 ` Thomas Gleixner
  2026-05-22 16:23                   ` David Woodhouse
  2026-05-22 16:50                   ` David Woodhouse
  0 siblings, 2 replies; 50+ messages in thread
From: Thomas Gleixner @ 2026-05-22 15:28 UTC (permalink / raw)
  To: David Woodhouse, Miroslav Lichvar
  Cc: Richard Cochran, Wen Gu, Andrew Lunn, David S. Miller,
	Eric Dumazet, Jakub Kicinski, Paolo Abeni, John Stultz,
	Stephen Boyd, Anna-Maria Behnsen, Frederic Weisbecker, Shuah Khan,
	Peter Zijlstra, Thomas Weißschuh, Arnd Bergmann,
	Julien Ridoux, Ryan Luu, linux-kernel, Marcelo Tosatti

On Fri, May 22 2026 at 11:01, David Woodhouse wrote:
> On Fri, 2026-05-22 at 10:02 +0200, Thomas Gleixner wrote:
> Obviously. But to take a PoC and then do that thought and turn it into
> something we can use, it still needs a Co-developed-by: and thus a
> Signed-off-by: if you would be so kind.

  git.kernel.org/pub/scm/linux/kernel/git/tglx/devel.git timers/ptp/timekeeping

is the work in progress state as of now. I'm not going to touch it in
the next days and it's still in a rough uncompiled state. It lacks quite
some change logs and the last patch needs to be split up.

I'll go and have a look next week so that I can rethink the approach
with a clear mind.
> Heh, the 'made up world' of which you speak is KVM. The older KVM PTP
> drivers get a CSID_X86_TSC or CSID_ARM_ARCH_COUNTER value too.
>
> And they *use* it... and wait, get_device_system_crosststamp() already
> *does* require the device to generate a system_counterval_t, so your
> nightmare world where driver authors might pull it out of their
> posterior *already* exists, doesn't it?

It exists. Because get_device_system_crosststamp() does _NOT_ propagate
the counter values after converting them to actual TSC values. The half
baked snipped I provided you earlier does exactly that (but wrong). The
version in the git branch should be halfways functional.

> And we have things like stmmac which already populate it using
> CSID_X86_ART.

It does not populate back into PTP land. That's a
get_device_system_crosststamp() internal handshake where the driver
callback provides the PTM time stamp and tells the core which clock
source it is based on. The core converts it to the system clocksource
cycles, e.g. ART to TSC, and then calculates MONO_RAW and REALTIME from
it, optionally with an extra snapshot that allows historical
interpolation for devices where the timestamp retrieval takes ages.

> So at least for PTP_SYS_OFFSET_PRECISE, isn't Arthur's patch literally
> only exporting the same counter values that the driver *already*
> creates? I'm not quite sure why we have all these histrionics about
> drivers not being able to create those reliably?

The driver reads it from the hardware but it does not know how to
convert them back to TSC or anything else. For the driver it's an opaque
piece of data which it read out of a register or got retrieved through a
firmware query.

> Yes, there's plenty to improve as discussed, and we should probably
> have get_device_system_crosststamp() copy the values from the
> system_counterval on its local stack into the system_device_crosststamp
> rather than asking the driver to pass it back through separate fields
> in the attributes.

See the original snippet and the git tree how that is done by extending
the cross time stamp structure and storing all the information there,
which is what PTP hands in:

      system_cross_timestamp sct;

      ptp->info->getcrosstimestamp(..., &sct)
         driver_getcrosstimestamp(...., *sct) {
           get_device_system_crosststamp(callback, context, ..., sct) {
              system_counterval_t scv;
              ktime_t device_time;

              do {
              	...
                callback(&device_time, &scv, context) {
                   read_snapshot(&pch_time, &ptm_time);

                   *device_time = munge(pch_time);
                   scv->cycles = ptm_time;
                   scv->cs_id = ART;
		}
                ....
                cs_cycles = convert_ptm_to_cs(scv.cycles, scv.cs_id);

                real = timekeeping_convert_to_real(cs_cycles);
                raw = timekeeping_convert_to_raw(cs_cycles);

              } while (seq_retry());

              sct->device = device_time;
              sct->real = real;
              sct->raw = raw;
           }

So the new parts are that system_cross_timestamp gains a
system_counter_val and get_device_system_crosststamp() fills that in:

              sct->counter.cycles = cs_cycles;
              sct->counter.cs_id = csid;

On X86 you get the TSC cycles (derived from ART) and CSID_X86_TSC.

That goes all the way back to the PTP layer. Which means magically _all_
existing users of get_device_system_crosststamp() will provide that data
out of the box.

The existing PRECISE usecase will just ignore sct.counter. Your new
stuff can use it and fill in the related attributes in the user space
attr struct.

This raises an interesting question. Must any of the existing PTM using
drivers mplement that new extended getcrosstimestampattr() callback, in
order to expose the cycles/csid in attr or can you fallback to the
existing callback and have the rest of the fields 0?

Same question arises if you change the pre/post timestamp helpers to
utilize ktime_get_snapshot_id(). All existing drivers which use them
will then automatically retrieve cs_cycles/cs_id.

The other change I did to get_device_system_crosststamp() is to let the
PTP core hand in the clock ID, so it can retrieve either REALTIME or AUX
clocks, which enables the whole AUX world to utilize PTM too once the
PTP IOCTL is updated accordingly.

Can you please make the new PTP_SYS_OFFSET_PRECISE_ATTRS and
PTP_SYS_OFFSET_EXTENDED_ATTRS so that user space can convey the CLOCK
ID, like it does today with PTP_SYS_OFFSET_EXTENDED?

Thanks,

        tglx

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [RFC PATCH v2 0/8] timekeeping: Fix draft tracking precision and add feed-forward discipline via vmclock
  2026-05-22 15:28                 ` Thomas Gleixner
@ 2026-05-22 16:23                   ` David Woodhouse
  2026-05-24 12:36                     ` Thomas Gleixner
  2026-05-22 16:50                   ` David Woodhouse
  1 sibling, 1 reply; 50+ messages in thread
From: David Woodhouse @ 2026-05-22 16:23 UTC (permalink / raw)
  To: Thomas Gleixner, Miroslav Lichvar
  Cc: Richard Cochran, Wen Gu, Andrew Lunn, David S. Miller,
	Eric Dumazet, Jakub Kicinski, Paolo Abeni, John Stultz,
	Stephen Boyd, Anna-Maria Behnsen, Frederic Weisbecker, Shuah Khan,
	Peter Zijlstra, Thomas Weißschuh, Arnd Bergmann,
	Julien Ridoux, Ryan Luu, linux-kernel, Marcelo Tosatti

[-- Attachment #1: Type: text/plain, Size: 4780 bytes --]

On Fri, 2026-05-22 at 17:28 +0200, Thomas Gleixner wrote:
> On Fri, May 22 2026 at 11:01, David Woodhouse wrote:
> > On Fri, 2026-05-22 at 10:02 +0200, Thomas Gleixner wrote:
> > Obviously. But to take a PoC and then do that thought and turn it into
> > something we can use, it still needs a Co-developed-by: and thus a
> > Signed-off-by: if you would be so kind.
> 
>   git.kernel.org/pub/scm/linux/kernel/git/tglx/devel.git timers/ptp/timekeeping
> 
> is the work in progress state as of now. I'm not going to touch it in
> the next days and it's still in a rough uncompiled state. It lacks quite
> some change logs and the last patch needs to be split up.
> 
> I'll go and have a look next week so that I can rethink the approach
> with a clear mind.

Thanks. I'll have a play with it. With ptp_read_system_p{re,ost}ts()
also populating pre/post system_counterval_t fields in the struct
ptp_system_timestamp, I can do a bit more cleanup of vmclock than you
have there by using them; I'll work that in.

> > Heh, the 'made up world' of which you speak is KVM. The older KVM PTP
> > drivers get a CSID_X86_TSC or CSID_ARM_ARCH_COUNTER value too.
> > 
> > And they *use* it... and wait, get_device_system_crosststamp() already
> > *does* require the device to generate a system_counterval_t, so your
> > nightmare world where driver authors might pull it out of their
> > posterior *already* exists, doesn't it?
> 
> It exists. Because get_device_system_crosststamp() does _NOT_ propagate
> the counter values after converting them to actual TSC values. The half
> baked snipped I provided you earlier does exactly that (but wrong). The
> version in the git branch should be halfways functional.

Ah, I see it. convert_base_to_cs().

> The existing PRECISE usecase will just ignore sct.counter. Your new
> stuff can use it and fill in the related attributes in the user space
> attr struct.

Perfect.

> This raises an interesting question. Must any of the existing PTM using
> drivers mplement that new extended getcrosstimestampattr() callback, in
> order to expose the cycles/csid in attr or can you fallback to the
> existing callback and have the rest of the fields 0?
>
> Same question arises if you change the pre/post timestamp helpers to
> utilize ktime_get_snapshot_id(). All existing drivers which use them
> will then automatically retrieve cs_cycles/cs_id.

Taking those in reverse order... yes, this means that with a new
variant of PTP_SYS_OFFSET_EXTENDED, userspace can see actual counter
values even for the system parts of those ABA timestamps, even for non-
PTM clocks, and discipline the TSC/archcounter against the external
clock.

Currently I have userspace which literally does rdtsc() either side of
calling the ioctl :)

And PTM devices can be used with PTP_SYS_OFFSET_PRECISE, which goes
through get_device_system_crosststamp() as described, and all just
works? It's just that we now allow userspace to *see* the counter value
that the driver was already generating.

So to your questions: although there's new userspace ioctl support, the
*drivers* don't need any modification for that (as long as they use the
standard prets/postts helpers).

The remaining question is the device timestamp part (the 'B' in the ABA
sandwich) for PTP_SYS_OFFSET_EXTENDED with PTM-capable drivers. Should
that get a counterval?

I don't have a strong opinion. On one hand we'd have to find a way to
convert it from PTM for devices where it actually *is* PTM, and that's
what PTP_SYS_OFFSET_PRECISE is *for*.

But on the other hand, can't the conversion be a whole lot simpler than
get_device_system_crosststamp() because it's not actually dealing with
any timekeepers; it's basically only invoking convert_base_to_cs()?

And the ioctl should support it *all* but just have a clear way of
indicating that any of the optional fields including the attrs are
*not* populated (or use 0/max values maybe?).

So no, I don't think any driver *has* to add any attr support in order
to expose counter values to userspace. The only reason I asked Arthur
to mix those things up was for the *userspace* API, to avoid adding yet
another ioctl over and over again. And now I feel bad for doing so :)

> The other change I did to get_device_system_crosststamp() is to let the
> PTP core hand in the clock ID, so it can retrieve either REALTIME or AUX
> clocks, which enables the whole AUX world to utilize PTM too once the
> PTP IOCTL is updated accordingly.
> 
> Can you please make the new PTP_SYS_OFFSET_PRECISE_ATTRS and
> PTP_SYS_OFFSET_EXTENDED_ATTRS so that user space can convey the CLOCK
> ID, like it does today with PTP_SYS_OFFSET_EXTENDED?

Ack (on Arthur's behalf).

[-- Attachment #2: smime.p7s --]
[-- Type: application/pkcs7-signature, Size: 5069 bytes --]

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [RFC PATCH v2 0/8] timekeeping: Fix draft tracking precision and add feed-forward discipline via vmclock
  2026-05-22 15:28                 ` Thomas Gleixner
  2026-05-22 16:23                   ` David Woodhouse
@ 2026-05-22 16:50                   ` David Woodhouse
  2026-05-24 15:15                     ` Thomas Gleixner
  1 sibling, 1 reply; 50+ messages in thread
From: David Woodhouse @ 2026-05-22 16:50 UTC (permalink / raw)
  To: Thomas Gleixner, Miroslav Lichvar
  Cc: Richard Cochran, Wen Gu, Andrew Lunn, David S. Miller,
	Eric Dumazet, Jakub Kicinski, Paolo Abeni, John Stultz,
	Stephen Boyd, Anna-Maria Behnsen, Frederic Weisbecker, Shuah Khan,
	Peter Zijlstra, Thomas Weißschuh, Arnd Bergmann,
	Julien Ridoux, Ryan Luu, linux-kernel, Marcelo Tosatti

[-- Attachment #1: Type: text/plain, Size: 606 bytes --]

On Fri, 2026-05-22 at 17:28 +0200, Thomas Gleixner wrote:
> 
>   git.kernel.org/pub/scm/linux/kernel/git/tglx/devel.git timers/ptp/timekeeping

In 94dd85a8d0a ("timekeeping: Add system_counterval_t to struct
system_device_crosststamp") my version ditched the system_counterval_t
on the stack and just used the one in xtstamp directly.

The convert_base_to_cs() function probably wants to scv->id=cs->id for
itself anyway; otherwise it's leaving behind an inconsistent
system_counterval_t object which... will lead to exactly the bug my
first version of that had, that I see you avoided :)



[-- Attachment #2: smime.p7s --]
[-- Type: application/pkcs7-signature, Size: 5069 bytes --]

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [RFC PATCH v2 0/8] timekeeping: Fix draft tracking precision and add feed-forward discipline via vmclock
  2026-05-22 16:23                   ` David Woodhouse
@ 2026-05-24 12:36                     ` Thomas Gleixner
  2026-05-24 13:13                       ` David Woodhouse
  2026-05-25  8:06                       ` Arthur Kiyanovski
  0 siblings, 2 replies; 50+ messages in thread
From: Thomas Gleixner @ 2026-05-24 12:36 UTC (permalink / raw)
  To: David Woodhouse, Miroslav Lichvar
  Cc: Richard Cochran, Wen Gu, Andrew Lunn, David S. Miller,
	Eric Dumazet, Jakub Kicinski, Paolo Abeni, John Stultz,
	Stephen Boyd, Anna-Maria Behnsen, Frederic Weisbecker, Shuah Khan,
	Peter Zijlstra, Thomas Weißschuh, Arnd Bergmann,
	Julien Ridoux, Ryan Luu, linux-kernel, Marcelo Tosatti

On Fri, May 22 2026 at 17:23, David Woodhouse wrote:
> On Fri, 2026-05-22 at 17:28 +0200, Thomas Gleixner wrote:
>> This raises an interesting question. Must any of the existing PTM using
>> drivers mplement that new extended getcrosstimestampattr() callback, in
>> order to expose the cycles/csid in attr or can you fallback to the
>> existing callback and have the rest of the fields 0?
>>
>> Same question arises if you change the pre/post timestamp helpers to
>> utilize ktime_get_snapshot_id(). All existing drivers which use them
>> will then automatically retrieve cs_cycles/cs_id.
>
> Taking those in reverse order... yes, this means that with a new
> variant of PTP_SYS_OFFSET_EXTENDED, userspace can see actual counter
> values even for the system parts of those ABA timestamps, even for non-
> PTM clocks, and discipline the TSC/archcounter against the external
> clock.

Correct.

> Currently I have userspace which literally does rdtsc() either side of
> calling the ioctl :)

Why am I not surprised? :)

> And PTM devices can be used with PTP_SYS_OFFSET_PRECISE, which goes
> through get_device_system_crosststamp() as described, and all just
> works? It's just that we now allow userspace to *see* the counter value
> that the driver was already generating.

A new variant of PRECISE

> So to your questions: although there's new userspace ioctl support, the
> *drivers* don't need any modification for that (as long as they use the
> standard prets/postts helpers).

Yes.

> The remaining question is the device timestamp part (the 'B' in the ABA
> sandwich) for PTP_SYS_OFFSET_EXTENDED with PTM-capable drivers. Should
> that get a counterval?

PTM-capable driver support cross timestamps, which will with a new
version of PTP_SYS_OFFSET_PRECISE expose the system counterval. No ABA
for that as it's hardware latched AB.

> I don't have a strong opinion. On one hand we'd have to find a way to
> convert it from PTM for devices where it actually *is* PTM, and that's
> what PTP_SYS_OFFSET_PRECISE is *for*.

Correct.

> But on the other hand, can't the conversion be a whole lot simpler than
> get_device_system_crosststamp() because it's not actually dealing with
> any timekeepers; it's basically only invoking convert_base_to_cs()?

But what for? If you have PTM, use PRECISE. There is _zero_ value of
having pre/post timestamps when the hardware already does the correlated
precise sampling, no?

> And the ioctl should support it *all* but just have a clear way of
> indicating that any of the optional fields including the attrs are
> *not* populated (or use 0/max values maybe?).

Yes.

> So no, I don't think any driver *has* to add any attr support in order
> to expose counter values to userspace. The only reason I asked Arthur
> to mix those things up was for the *userspace* API, to avoid adding yet
> another ioctl over and over again. And now I feel bad for doing so :)

I think you can create _one_ data structure variant, which fits both
EXTENDED_ATTR and PRECISE_ATTR:

struct attrs {
       u32	valid;
       u32	error_bound;
       ....
       u32	reserved[N];
};

@valid tells user space, which of the attributes has been filled in by
the driver. That avoids bounds based validity checks, which are a pain
as you might end up with different bounds for every attribute. Having a
valid flags field avoids that completely.

struct devtime {
        ptp_clock_time	device_time;
        struct attrs	attrs;
};

struct systime {
	u64	sys_systime;
        u64     sys_rawtime;
        u64     sys_counter;
        u32	sys_counter_id;
        u32	reserved;
};

Exposing @sys_counter_id requires to expose CSID_* in the user space ABI
reliably, as otherwise a kernel internal CSID enum change would blow up
the user space guess work. Your ptp_counter_id approach is error prone.

struct timestamp {
	union {
		struct systime		systime;
		struct systime		pre_systime;
	};
	struct devtime			devtime;
	struct systime			post_systime;
};

struct request {
	u32		valid;
	clockid_t	clock_id;
	unsigned int	num_samples;
        u32		reserved[N];
};

I rather have @valid here too. The 'zero the reserved' members approach
is a pain as new kernels have to map 0 to default behavior instead of
being free to make 0 mean what they intend. @valid allows you to use
other sizes than u32 for future fields. All you have to take care of is
to keep the existing fields at the same place as before.

struct ioctl_data {
	struct request		request;
        struct timestamp	timestamps[];
};

So for both PTP_SYS_OFFSET_EXTENDED_ATTRS and
PTP_SYS_OFFSET_PRECISE_ATTRS user space allocates enough space to
accomodate data::request::num_samples.

For PTP_SYS_OFFSET_PRECISE_ATTRS num_samples has to be 1 and
data::timestamps[0].post_systime is zeroed by the kernel because it has
no meaning.

So now in the kernel you do:

ptp_sys_offset_extended_attrs(struct ptp_clock *ptp, void __user *argptr)
{
        struct ioctl_data __user *data = argptr;
        struct request;

        if (copy_from_user(&request, &data->request, sizeof(request)))
        	return -EFAULT;

        if (!extattr_request_valid(request))        	
        	return -EINVAL;

        for (unsigned int i; i < request.num_samples; i++) {
        	struct ptp_system_timestamp sts = { .clock_id = request.clock_id, };
	        struct timestamp uts = { };
                struct timespec64 devts;

        	if (ptp->info->gettimex64_attr)
                	ret = ptp->info->gettimex64_attr(ptp->info, &dev_ts, &sts, &uts.attr);
                else if (ptp->info->gettimex64)
                	ret = ptp->info->gettimex64(ptp->info, &dev_ts, &sts);
                else
                	return -ENOTSUPP;

                if (ret)
                	return ret;

               uts.pre_systime = mangle(sts.pre_systime);
               uts.devtime.device_time = mangle(dev_ts);
               uts.post_systime = mangle(sts.post_systime);
               if (!copy_to_user(&data->timestamps[i], uts, sizeof(uts)))
               		return -EFAULT;
	}
        return 0;
}

ptp_sys_offset_precise_attrs(struct ptp_clock *ptp, void __user *argptr)
{
        struct ioctl_data __user *data = argptr;
        struct request;

        if (copy_from_user(&request, &data->request, sizeof(request)))
        	return -EFAULT;

        if (!preciseattr_request_valid(request))        	
        	return -EINVAL;

	struct system_device_crosststamp xtstamp = { .clock_id = request.clock_id, };
        struct timestamp uts = { };
        
        if (ptp->info->getcrosststamp_attr)
                ret = ptp->info->getcrosststamp_attr(ptp->info, &xtstamp, &uts.attr);
        else if (ptp->info->getcrosststamp)
              	ret = ptp->info->getcrosststamp(ptp->info, &xtstamp);
        else
              	return -ENOTSUPP;

        if (ret)
              	return ret;

        uts.systime = mangle(xtstamp.systime);
        uts.devtime.device_time = mangle(xtstamp.device);
        if (!copy_to_user(&data->timestamps[0], uts, sizeof(uts)))
        	return -EFAULT;
        return 0;
}

Or something like this, which immediately enables the functionality for
all drivers which implement the getcrosststamp() or the gettimex64()
callbacks with a unified user space data structure.

The attributes.valid bits are all zero and and once drivers implement
the _attr callback variants, those attributes supported by the driver
will magically appear with the corresponding valid bits set.

Hmm?

Thanks,

        tglx

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [RFC PATCH v2 0/8] timekeeping: Fix draft tracking precision and add feed-forward discipline via vmclock
  2026-05-24 12:36                     ` Thomas Gleixner
@ 2026-05-24 13:13                       ` David Woodhouse
  2026-05-24 15:05                         ` Thomas Gleixner
  2026-05-25  8:06                       ` Arthur Kiyanovski
  1 sibling, 1 reply; 50+ messages in thread
From: David Woodhouse @ 2026-05-24 13:13 UTC (permalink / raw)
  To: Thomas Gleixner, Miroslav Lichvar
  Cc: Richard Cochran, Wen Gu, Andrew Lunn, David S. Miller,
	Eric Dumazet, Jakub Kicinski, Paolo Abeni, John Stultz,
	Stephen Boyd, Anna-Maria Behnsen, Frederic Weisbecker, Shuah Khan,
	Peter Zijlstra, Thomas Weißschuh, Arnd Bergmann,
	Julien Ridoux, Ryan Luu, linux-kernel, Marcelo Tosatti

[-- Attachment #1: Type: text/plain, Size: 4180 bytes --]

On Sun, 2026-05-24 at 14:36 +0200, Thomas Gleixner wrote:
> On Fri, May 22 2026 at 17:23, David Woodhouse wrote:
> > On Fri, 2026-05-22 at 17:28 +0200, Thomas Gleixner wrote:
> > > This raises an interesting question. Must any of the existing PTM using
> > > drivers mplement that new extended getcrosstimestampattr() callback, in
> > > order to expose the cycles/csid in attr or can you fallback to the
> > > existing callback and have the rest of the fields 0?
> > > 
> > > Same question arises if you change the pre/post timestamp helpers to
> > > utilize ktime_get_snapshot_id(). All existing drivers which use them
> > > will then automatically retrieve cs_cycles/cs_id.
> > 
> > Taking those in reverse order... yes, this means that with a new
> > variant of PTP_SYS_OFFSET_EXTENDED, userspace can see actual counter
> > values even for the system parts of those ABA timestamps, even for non-
> > PTM clocks, and discipline the TSC/archcounter against the external
> > clock.
> 
> Correct.
> 
> > Currently I have userspace which literally does rdtsc() either side of
> > calling the ioctl :)
> 
> Why am I not surprised? :)

To be fair, I *told* them to do it like that in the short term, knowing
it would annoy me enough to chase up the cycles-in-PTP thing.

And hey, it worked :)

> > And PTM devices can be used with PTP_SYS_OFFSET_PRECISE, which goes
> > through get_device_system_crosststamp() as described, and all just
> > works? It's just that we now allow userspace to *see* the counter value
> > that the driver was already generating.
> 
> A new variant of PRECISE

Right.

> > So to your questions: although there's new userspace ioctl support, the
> > *drivers* don't need any modification for that (as long as they use the
> > standard prets/postts helpers).
> 
> Yes.
> 
> > The remaining question is the device timestamp part (the 'B' in the ABA
> > sandwich) for PTP_SYS_OFFSET_EXTENDED with PTM-capable drivers. Should
> > that get a counterval?
> 
> PTM-capable driver support cross timestamps, which will with a new
> version of PTP_SYS_OFFSET_PRECISE expose the system counterval. No ABA
> for that as it's hardware latched AB.
> 
> > I don't have a strong opinion. On one hand we'd have to find a way to
> > convert it from PTM for devices where it actually *is* PTM, and that's
> > what PTP_SYS_OFFSET_PRECISE is *for*.
> 
> Correct.
> 
> > But on the other hand, can't the conversion be a whole lot simpler than
> > get_device_system_crosststamp() because it's not actually dealing with
> > any timekeepers; it's basically only invoking convert_base_to_cs()?
> 
> But what for? If you have PTM, use PRECISE. There is _zero_ value of
> having pre/post timestamps when the hardware already does the correlated
> precise sampling, no?

The PTM mode and support of PRECISE (or the variant) is currently
fairly esoteric: very few devices support it. So I'm not sure we should
expect generic userspace to always even try.

So there may be some merit in having EXTENDED use the precise hardware
paired timestamp. Maybe we don't necessarily care about returning
*cycles* but if we *do* use a PTM-capable device (and I'm including the
virt TSC-based ones here too), then we kind of want the ABA *all* to be
at the same clock cycle. Which is what I've already done for vmclock.

> > And the ioctl should support it *all* but just have a clear way of
> > indicating that any of the optional fields including the attrs are
> > *not* populated (or use 0/max values maybe?).
> 
> Yes.
> 
> > So no, I don't think any driver *has* to add any attr support in order
> > to expose counter values to userspace. The only reason I asked Arthur
> > to mix those things up was for the *userspace* API, to avoid adding yet
> > another ioctl over and over again. And now I feel bad for doing so :)
> 
> I think you can create _one_ data structure variant, which fits both
> EXTENDED_ATTR and PRECISE_ATTR:

 <...>

Yeah, that looks eminently sensible. I've been feeding Arthur
suggestions along those lines but only nudges; you've fleshed it out in
*far* more detail; thanks!


[-- Attachment #2: smime.p7s --]
[-- Type: application/pkcs7-signature, Size: 5069 bytes --]

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [RFC PATCH v2 0/8] timekeeping: Fix draft tracking precision and add feed-forward discipline via vmclock
  2026-05-24 13:13                       ` David Woodhouse
@ 2026-05-24 15:05                         ` Thomas Gleixner
  0 siblings, 0 replies; 50+ messages in thread
From: Thomas Gleixner @ 2026-05-24 15:05 UTC (permalink / raw)
  To: David Woodhouse, Miroslav Lichvar
  Cc: Richard Cochran, Wen Gu, Andrew Lunn, David S. Miller,
	Eric Dumazet, Jakub Kicinski, Paolo Abeni, John Stultz,
	Stephen Boyd, Anna-Maria Behnsen, Frederic Weisbecker, Shuah Khan,
	Peter Zijlstra, Thomas Weißschuh, Arnd Bergmann,
	Julien Ridoux, Ryan Luu, linux-kernel, Marcelo Tosatti

On Sun, May 24 2026 at 14:13, David Woodhouse wrote:
> On Sun, 2026-05-24 at 14:36 +0200, Thomas Gleixner wrote:
>> > But on the other hand, can't the conversion be a whole lot simpler than
>> > get_device_system_crosststamp() because it's not actually dealing with
>> > any timekeepers; it's basically only invoking convert_base_to_cs()?
>> 
>> But what for? If you have PTM, use PRECISE. There is _zero_ value of
>> having pre/post timestamps when the hardware already does the correlated
>> precise sampling, no?
>
> The PTM mode and support of PRECISE (or the variant) is currently
> fairly esoteric: very few devices support it. So I'm not sure we should
> expect generic userspace to always even try.

There are not so many PTM capable devices to begin with. And yes, user
space which cares about time and accuracy _should_ try it.

> So there may be some merit in having EXTENDED use the precise hardware
> paired timestamp. Maybe we don't necessarily care about returning
> *cycles* but if we *do* use a PTM-capable device (and I'm including the
> virt TSC-based ones here too), then we kind of want the ABA *all* to be
> at the same clock cycle. Which is what I've already done for vmclock.

If you can do ABA at the same clock cycle, then just implement the cross
timestamp callback and use that.

For PTM capable devices which lack cross timestamp support in the
driver, adding the magic PTM value field in the ABA timestamp won't make
it magically be filled in. So someone has to touch the driver anyway and
then adding the actual cross time support is not much more effort than
adding support for the new field in the extended callback.

Also user space which wants to use the cycles stuff needs to implement
the new IOCTLs anyway. The cycles won't show up magically in the
existing IOCTLs either. So if you make the data struct identical, then
it's really not rocket science to try precise first and then fallback to
extended.

Actuall with the identical data struct you could make that _ONE_ new
IOCTL and the kernel uses cross time stamps if the device supports it or
extended if not. All it has to do is to report the choice and therefore
the number and nature of the samples back to user space. Not rocket
science either.

But in both variant (separate or unified IOCTL) user space has to handle
the data sets correctly.

No strong opinion on that as I have no clue about user space :)

Thanks,

        tglx

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [RFC PATCH v2 0/8] timekeeping: Fix draft tracking precision and add feed-forward discipline via vmclock
  2026-05-22 16:50                   ` David Woodhouse
@ 2026-05-24 15:15                     ` Thomas Gleixner
  2026-05-24 15:37                       ` Thomas Gleixner
  0 siblings, 1 reply; 50+ messages in thread
From: Thomas Gleixner @ 2026-05-24 15:15 UTC (permalink / raw)
  To: David Woodhouse, Miroslav Lichvar
  Cc: Richard Cochran, Wen Gu, Andrew Lunn, David S. Miller,
	Eric Dumazet, Jakub Kicinski, Paolo Abeni, John Stultz,
	Stephen Boyd, Anna-Maria Behnsen, Frederic Weisbecker, Shuah Khan,
	Peter Zijlstra, Thomas Weißschuh, Arnd Bergmann,
	Julien Ridoux, Ryan Luu, linux-kernel, Marcelo Tosatti

On Fri, May 22 2026 at 17:50, David Woodhouse wrote:
> On Fri, 2026-05-22 at 17:28 +0200, Thomas Gleixner wrote:
>> 
>>   git.kernel.org/pub/scm/linux/kernel/git/tglx/devel.git timers/ptp/timekeeping
>
> In 94dd85a8d0a ("timekeeping: Add system_counterval_t to struct
> system_device_crosststamp") my version ditched the system_counterval_t
> on the stack and just used the one in xtstamp directly.

Which is wrong. I did it the way I did for a very good reason.

> The convert_base_to_cs() function probably wants to scv->id=cs->id for
> itself anyway; otherwise it's leaving behind an inconsistent
> system_counterval_t object which... will lead to exactly the bug my
> first version of that had, that I see you avoided :)

No. It can't because that would corrupt the object for the retry case,
which would then hand back the wrong value.

The object _IS_ consistent because the csid in there is related to the
PTM value and not to the clocksource. The function updates the @cycles
value and leaves everything else untouched. The clock ID for the @cyles
value is guaranteed to be the clock ID of the system clocksource, so
using this is the right thing to do.

Just because it looks tempting or your AI buddy told you so doesn't make
it correct.

Thanks,

        tglx

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [RFC PATCH v2 0/8] timekeeping: Fix draft tracking precision and add feed-forward discipline via vmclock
  2026-05-24 15:15                     ` Thomas Gleixner
@ 2026-05-24 15:37                       ` Thomas Gleixner
  2026-05-24 15:48                         ` Thomas Gleixner
  2026-05-24 16:36                         ` Thomas Gleixner
  0 siblings, 2 replies; 50+ messages in thread
From: Thomas Gleixner @ 2026-05-24 15:37 UTC (permalink / raw)
  To: David Woodhouse, Miroslav Lichvar
  Cc: Richard Cochran, Wen Gu, Andrew Lunn, David S. Miller,
	Eric Dumazet, Jakub Kicinski, Paolo Abeni, John Stultz,
	Stephen Boyd, Anna-Maria Behnsen, Frederic Weisbecker, Shuah Khan,
	Peter Zijlstra, Thomas Weißschuh, Arnd Bergmann,
	Julien Ridoux, Ryan Luu, linux-kernel, Marcelo Tosatti

On Sun, May 24 2026 at 17:15, Thomas Gleixner wrote:
> On Fri, May 22 2026 at 17:50, David Woodhouse wrote:
>> On Fri, 2026-05-22 at 17:28 +0200, Thomas Gleixner wrote:
>>> 
>>>   git.kernel.org/pub/scm/linux/kernel/git/tglx/devel.git timers/ptp/timekeeping
>>
>> In 94dd85a8d0a ("timekeeping: Add system_counterval_t to struct
>> system_device_crosststamp") my version ditched the system_counterval_t
>> on the stack and just used the one in xtstamp directly.
>
> Which is wrong. I did it the way I did for a very good reason.
>
>> The convert_base_to_cs() function probably wants to scv->id=cs->id for
>> itself anyway; otherwise it's leaving behind an inconsistent
>> system_counterval_t object which... will lead to exactly the bug my
>> first version of that had, that I see you avoided :)
>
> No. It can't because that would corrupt the object for the retry case,
> which would then hand back the wrong value.
>
> The object _IS_ consistent because the csid in there is related to the
> PTM value and not to the clocksource. The function updates the @cycles
> value and leaves everything else untouched. The clock ID for the @cyles
> value is guaranteed to be the clock ID of the system clocksource, so
> using this is the right thing to do.
>
> Just because it looks tempting or your AI buddy told you so doesn't make
> it correct.

And it's worse. We both are wrong :)

There is an existing bug in that code for the retry case. Fix below.

Thanks,

        tglx
---
--- a/kernel/time/timekeeping.c
+++ b/kernel/time/timekeeping.c
@@ -1343,12 +1343,14 @@ static bool convert_clock(u64 *val, u32
 	return true;
 }
 
-static bool convert_base_to_cs(struct system_counterval_t *scv)
+static bool convert_base_to_cs(struct system_counterval_t *scv, u64 *cycles)
 {
 	struct clocksource *cs = tk_core.timekeeper.tkr_mono.clock;
 	struct clocksource_base *base;
 	u32 num, den;
 
+	*cycles = scv->cycles;
+
 	/* The timestamp was taken from the time keeper clock source */
 	if (cs->id == scv->cs_id)
 		return true;
@@ -1364,10 +1366,10 @@ static bool convert_base_to_cs(struct sy
 	num = scv->use_nsecs ? cs->freq_khz : base->numerator;
 	den = scv->use_nsecs ? USEC_PER_SEC : base->denominator;
 
-	if (!convert_clock(&scv->cycles, num, den))
+	if (!convert_clock(cycles, num, den))
 		return false;
 
-	scv->cycles += base->offset;
+	*cycles += base->offset;
 	return true;
 }
 
@@ -1479,9 +1481,8 @@ int get_device_system_crosststamp(int (*
 		 * installed timekeeper clocksource
 		 */
 		if (system_counterval.cs_id == CSID_GENERIC ||
-		    !convert_base_to_cs(&system_counterval))
+		    !convert_base_to_cs(&system_counterval, &cycles))
 			return -ENODEV;
-		cycles = system_counterval.cycles;
 
 		/*
 		 * Check whether the system counter value provided by the

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [RFC PATCH v2 0/8] timekeeping: Fix draft tracking precision and add feed-forward discipline via vmclock
  2026-05-24 15:37                       ` Thomas Gleixner
@ 2026-05-24 15:48                         ` Thomas Gleixner
  2026-05-24 16:36                         ` Thomas Gleixner
  1 sibling, 0 replies; 50+ messages in thread
From: Thomas Gleixner @ 2026-05-24 15:48 UTC (permalink / raw)
  To: David Woodhouse, Miroslav Lichvar
  Cc: Richard Cochran, Wen Gu, Andrew Lunn, David S. Miller,
	Eric Dumazet, Jakub Kicinski, Paolo Abeni, John Stultz,
	Stephen Boyd, Anna-Maria Behnsen, Frederic Weisbecker, Shuah Khan,
	Peter Zijlstra, Thomas Weißschuh, Arnd Bergmann,
	Julien Ridoux, Ryan Luu, linux-kernel, Marcelo Tosatti

On Sun, May 24 2026 at 17:37, Thomas Gleixner wrote:
> On Sun, May 24 2026 at 17:15, Thomas Gleixner wrote:
> And it's worse. We both are wrong :)
>
> There is an existing bug in that code for the retry case. Fix below.

I've updated the git branch accordingly.

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [RFC PATCH v2 0/8] timekeeping: Fix draft tracking precision and add feed-forward discipline via vmclock
  2026-05-24 15:37                       ` Thomas Gleixner
  2026-05-24 15:48                         ` Thomas Gleixner
@ 2026-05-24 16:36                         ` Thomas Gleixner
  2026-05-24 16:42                           ` David Woodhouse
  1 sibling, 1 reply; 50+ messages in thread
From: Thomas Gleixner @ 2026-05-24 16:36 UTC (permalink / raw)
  To: David Woodhouse, Miroslav Lichvar
  Cc: Richard Cochran, Wen Gu, Andrew Lunn, David S. Miller,
	Eric Dumazet, Jakub Kicinski, Paolo Abeni, John Stultz,
	Stephen Boyd, Anna-Maria Behnsen, Frederic Weisbecker, Shuah Khan,
	Peter Zijlstra, Thomas Weißschuh, Arnd Bergmann,
	Julien Ridoux, Ryan Luu, linux-kernel, Marcelo Tosatti

On Sun, May 24 2026 at 17:37, Thomas Gleixner wrote:
> On Sun, May 24 2026 at 17:15, Thomas Gleixner wrote:
>
> There is an existing bug in that code for the retry case. Fix below.

There is none. It's just too hot to think straight. The counterval is
updated once per retry ....




^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [RFC PATCH v2 0/8] timekeeping: Fix draft tracking precision and add feed-forward discipline via vmclock
  2026-05-24 16:36                         ` Thomas Gleixner
@ 2026-05-24 16:42                           ` David Woodhouse
  0 siblings, 0 replies; 50+ messages in thread
From: David Woodhouse @ 2026-05-24 16:42 UTC (permalink / raw)
  To: Thomas Gleixner, Miroslav Lichvar
  Cc: Richard Cochran, Wen Gu, Andrew Lunn, David S. Miller,
	Eric Dumazet, Jakub Kicinski, Paolo Abeni, John Stultz,
	Stephen Boyd, Anna-Maria Behnsen, Frederic Weisbecker, Shuah Khan,
	Peter Zijlstra, Thomas Weißschuh, Arnd Bergmann,
	Julien Ridoux, Ryan Luu, linux-kernel, Marcelo Tosatti

On 24 May 2026 17:36:04 BST, Thomas Gleixner <tglx@kernel.org> wrote:
>On Sun, May 24 2026 at 17:37, Thomas Gleixner wrote:
>> On Sun, May 24 2026 at 17:15, Thomas Gleixner wrote:
>>
>> There is an existing bug in that code for the retry case. Fix below.
>
>There is none. It's just too hot to think straight. The counterval is
>updated once per retry ....
>
>
>

Yeah, and setting the csid in it at the same time as changing the actual cycle count seemed to make a lot of sense to me.

I didn't even ask the AI friend about that; it's entirely crap at anything where you have to take the blinkers off. But it *can* type fast and do test iterations, so it has its place as long as you know you can't trust anything it says :)

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [RFC PATCH v2 0/8] timekeeping: Fix draft tracking precision and add feed-forward discipline via vmclock
  2026-05-24 12:36                     ` Thomas Gleixner
  2026-05-24 13:13                       ` David Woodhouse
@ 2026-05-25  8:06                       ` Arthur Kiyanovski
  2026-05-25  8:41                         ` David Woodhouse
  2026-05-26 14:12                         ` Thomas Gleixner
  1 sibling, 2 replies; 50+ messages in thread
From: Arthur Kiyanovski @ 2026-05-25  8:06 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: David Woodhouse, Miroslav Lichvar, Richard Cochran, Wen Gu,
	Andrew Lunn, David S. Miller, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni, John Stultz, Stephen Boyd, Anna-Maria Behnsen,
	Frederic Weisbecker, Shuah Khan, Peter Zijlstra,
	Thomas Weißschuh, Arnd Bergmann, Julien Ridoux, Ryan Luu,
	linux-kernel, Marcelo Tosatti

On 2026-05-24 14:36:35+02:00, Thomas Gleixner wrote:
> On Fri, May 22 2026 at 17:23, David Woodhouse wrote:
> 
> > On Fri, 2026-05-22 at 17:28 +0200, Thomas Gleixner wrote:
> >
> > Taking those in reverse order... yes, this means that with a new
> > variant of PTP_SYS_OFFSET_EXTENDED, userspace can see actual counter
> > values even for the system parts of those ABA timestamps, even for non-
> > PTM clocks, and discipline the TSC/archcounter against the external
> > clock.
> 
> Correct.
> 
> > Currently I have userspace which literally does rdtsc() either side of
> > calling the ioctl :)
> 
> Why am I not surprised? :)
> 
> > And PTM devices can be used with PTP_SYS_OFFSET_PRECISE, which goes
> > through get_device_system_crosststamp() as described, and all just
> > works? It's just that we now allow userspace to *see* the counter value
> > that the driver was already generating.
> 
> A new variant of PRECISE
> 
> > So to your questions: although there's new userspace ioctl support, the
> > *drivers* don't need any modification for that (as long as they use the
> > standard prets/postts helpers).
> 
> Yes.
> 
> > The remaining question is the device timestamp part (the 'B' in the ABA
> > sandwich) for PTP_SYS_OFFSET_EXTENDED with PTM-capable drivers. Should
> > that get a counterval?
> 
> PTM-capable driver support cross timestamps, which will with a new
> version of PTP_SYS_OFFSET_PRECISE expose the system counterval. No ABA
> for that as it's hardware latched AB.
> 
> > I don't have a strong opinion. On one hand we'd have to find a way to
> > convert it from PTM for devices where it actually *is* PTM, and that's
> > what PTP_SYS_OFFSET_PRECISE is *for*.
> 
> Correct.
> 
> > But on the other hand, can't the conversion be a whole lot simpler than
> > get_device_system_crosststamp() because it's not actually dealing with
> > any timekeepers; it's basically only invoking convert_base_to_cs()?
> 
> But what for? If you have PTM, use PRECISE. There is _zero_ value of
> having pre/post timestamps when the hardware already does the correlated
> precise sampling, no?
> 
> > And the ioctl should support it *all* but just have a clear way of
> > indicating that any of the optional fields including the attrs are
> > *not* populated (or use 0/max values maybe?).
> 
> Yes.
> 
> > So no, I don't think any driver *has* to add any attr support in order
> > to expose counter values to userspace. The only reason I asked Arthur
> > to mix those things up was for the *userspace* API, to avoid adding yet
> > another ioctl over and over again. And now I feel bad for doing so :)
> 
> I think you can create _one_ data structure variant, which fits both
> EXTENDED_ATTR and PRECISE_ATTR:
> 
> struct attrs {
>        u32	valid;
>        u32	error_bound;
>        ....
>        u32	reserved[N];
> };
> 
> @valid tells user space, which of the attributes has been filled in by
> the driver. That avoids bounds based validity checks, which are a pain
> as you might end up with different bounds for every attribute. Having a
> valid flags field avoids that completely.
> 
> struct devtime {
>         ptp_clock_time	device_time;
>         struct attrs	attrs;
> };
> 
> struct systime {
> 	u64	sys_systime;
>         u64     sys_rawtime;
>         u64     sys_counter;
>         u32	sys_counter_id;
>         u32	reserved;
> };
> 
> Exposing @sys_counter_id requires to expose CSID_* in the user space ABI
> reliably, as otherwise a kernel internal CSID enum change would blow up
> the user space guess work. Your ptp_counter_id approach is error prone.
> 
> struct timestamp {
> 	union {
> 		struct systime		systime;
> 		struct systime		pre_systime;
> 	};
> 	struct devtime			devtime;
> 	struct systime			post_systime;
> };
> 
> struct request {
> 	u32		valid;
> 	clockid_t	clock_id;
> 	unsigned int	num_samples;
>         u32		reserved[N];
> };
> 
> I rather have @valid here too. The 'zero the reserved' members approach
> is a pain as new kernels have to map 0 to default behavior instead of
> being free to make 0 mean what they intend. @valid allows you to use
> other sizes than u32 for future fields. All you have to take care of is
> to keep the existing fields at the same place as before.
> 
> struct ioctl_data {
> 	struct request		request;
>         struct timestamp	timestamps[];
> };
> 
> So for both PTP_SYS_OFFSET_EXTENDED_ATTRS and
> PTP_SYS_OFFSET_PRECISE_ATTRS user space allocates enough space to
> accomodate data::request::num_samples.
> 
> For PTP_SYS_OFFSET_PRECISE_ATTRS num_samples has to be 1 and
> data::timestamps[0].post_systime is zeroed by the kernel because it has
> no meaning.
> 
> So now in the kernel you do:
> 
> ptp_sys_offset_extended_attrs(struct ptp_clock *ptp, void __user *argptr)
> {
>         struct ioctl_data __user *data = argptr;
>         struct request;
> 
>         if (copy_from_user(&request, &data->request, sizeof(request)))
>         	return -EFAULT;
> 
>         if (!extattr_request_valid(request))        	
>         	return -EINVAL;
> 
>         for (unsigned int i; i < request.num_samples; i++) {
>         	struct ptp_system_timestamp sts = { .clock_id = request.clock_id, };
> 	        struct timestamp uts = { };
>                 struct timespec64 devts;
> 
>         	if (ptp->info->gettimex64_attr)
>                 	ret = ptp->info->gettimex64_attr(ptp->info, &dev_ts, &sts, &uts.attr);
>                 else if (ptp->info->gettimex64)
>                 	ret = ptp->info->gettimex64(ptp->info, &dev_ts, &sts);
>                 else
>                 	return -ENOTSUPP;
> 
>                 if (ret)
>                 	return ret;
> 
>                uts.pre_systime = mangle(sts.pre_systime);
>                uts.devtime.device_time = mangle(dev_ts);
>                uts.post_systime = mangle(sts.post_systime);
>                if (!copy_to_user(&data->timestamps[i], uts, sizeof(uts)))
>                		return -EFAULT;
> 	}
>         return 0;
> }
> 
> ptp_sys_offset_precise_attrs(struct ptp_clock *ptp, void __user *argptr)
> {
>         struct ioctl_data __user *data = argptr;
>         struct request;
> 
>         if (copy_from_user(&request, &data->request, sizeof(request)))
>         	return -EFAULT;
> 
>         if (!preciseattr_request_valid(request))        	
>         	return -EINVAL;
> 
> 	struct system_device_crosststamp xtstamp = { .clock_id = request.clock_id, };
>         struct timestamp uts = { };
>         
>         if (ptp->info->getcrosststamp_attr)
>                 ret = ptp->info->getcrosststamp_attr(ptp->info, &xtstamp, &uts.attr);
>         else if (ptp->info->getcrosststamp)
>               	ret = ptp->info->getcrosststamp(ptp->info, &xtstamp);
>         else
>               	return -ENOTSUPP;
> 
>         if (ret)
>               	return ret;
> 
>         uts.systime = mangle(xtstamp.systime);
>         uts.devtime.device_time = mangle(xtstamp.device);
>         if (!copy_to_user(&data->timestamps[0], uts, sizeof(uts)))
>         	return -EFAULT;
>         return 0;
> }
> 
> Or something like this, which immediately enables the functionality for
> all drivers which implement the getcrosststamp() or the gettimex64()
> callbacks with a unified user space data structure.
> 
> The attributes.valid bits are all zero and and once drivers implement
> the _attr callback variants, those attributes supported by the driver
> will magically appear with the corresponding valid bits set.
> 
> Hmm?
> 
> Thanks,
> 
> tglx

Hi Thomas, David,

Thanks for the layout proposal, Thomas. The unified structure with 
explicit valid flags is a much cleaner approach than bounds-based
validation.

I'm the author of the PHC timestamp attributes series [1] that this
applies to. Before I spin v4 based on this design, I want to confirm
three implementation details:

1. Counter IDs: No stable UAPI clocksource numbering exists today 
(enum clocksource_ids is kernel-internal). I'll define stable constants
in include/uapi/linux/ptp_clock.h (e.g., PTP_CSID_X86_TSC,
PTP_CSID_ARM_ARCH) and map internal IDs in the chardev layer.

2. Array sizing: The timestamps array will be fixed at PTP_MAX_SAMPLES (25)
in the ioctl struct, not a flexible array, to keep
copy_from_user/copy_to_user bounded.

3. Ioctl numbers: Two separate ioctls (PTP_SYS_OFFSET_EXTENDED_ATTRS and
PTP_SYS_OFFSET_PRECISE_ATTRS) sharing the same payload struct, matching
existing PTP convention.

Arthur

[1] https://lore.kernel.org/netdev/20260515164033.6403-1-akiyano@amazon.com/


^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [RFC PATCH v2 0/8] timekeeping: Fix draft tracking precision and add feed-forward discipline via vmclock
  2026-05-21  9:54           ` David Woodhouse
@ 2026-05-25  8:08             ` Miroslav Lichvar
  2026-05-25  9:14               ` David Woodhouse
  0 siblings, 1 reply; 50+ messages in thread
From: Miroslav Lichvar @ 2026-05-25  8:08 UTC (permalink / raw)
  To: David Woodhouse
  Cc: Richard Cochran, Wen Gu, Andrew Lunn, David S. Miller,
	Eric Dumazet, Jakub Kicinski, Paolo Abeni, John Stultz,
	Thomas Gleixner, Stephen Boyd, Anna-Maria Behnsen,
	Frederic Weisbecker, Shuah Khan, Peter Zijlstra,
	Thomas Weißschuh, Arnd Bergmann, Julien Ridoux, Ryan Luu,
	linux-kernel, Marcelo Tosatti

On Thu, May 21, 2026 at 10:54:41AM +0100, David Woodhouse wrote:
> On Thu, 2026-05-21 at 08:35 +0200, Miroslav Lichvar wrote:
> > Ok, but I don't see why the phase corrections of the guest need to be
> > in the kernel.
> 
> I'm not sure I understand. 
> 
> There are no 'phase corrections' as such, except of course that the
> phase of the guest kernel's clock does get corrected, and naturally
> that does have to take effect inside the guest kernel.

I'm referring to these parts of the patches:

	delta_ns = timespec64_to_ns(&vmtime) - timespec64_to_ns(&now);
	if (delta_ns > 100000000 || delta_ns < -100000000)
		do_settimeofday64(&vmtime);

	...

	/* Compute phase offset at cycle_last and set time_offset to slew */
	delta = tk->tkr_mono.cycle_last - ref->counter_value;
	ref_frac = mul_u64_u64_shr(delta, ref->period_frac_sec,
				   ref->period_shift) + ref->time_frac_sec;
	ref_err = (s64)mul_u64_u64_shr(ref_frac,
			(u64)NSEC_PER_SEC << tk->tkr_mono.shift, 64) -
		  (s64)tk->tkr_mono.xtime_nsec;
	ntp_set_time_offset(tk->id, ref_err >> tk->tkr_mono.shift);

> I think the key here is that this is not a feedback loop based on
> corrections to the existing clock output; this is a feedforward design
> as described in https://dl.acm.org/doi/pdf/10.1109/TNET.2011.2158443

There might be a disagreement on terminology. As the guest clock
cannot be updated synchronously with the host, the tracking cannot be
perfect and there has to be some way to correct for the errors due to
the delay. That's what the code shown above seems to be doing. It's a
feedback loop. It doesn't matter if the offset is calculated directly
or measured.

> It seems that when Julien et al lamented that, "Until now, however,
> there has been a serious practical issue inhibiting feed-forward
> approaches: a lack of kernel support", the basics were actually there
> in the kernel's core timekeeping all along.

From my point of view, the only missing piece is software timestamping
of packets using other clocks than CLOCK_REALTIME.

> > > And TSC scaling is pretty much x86-specific; other architectures have a
> > > *defined* counter frequency and don't need to support scaling.
> > 
> > There can be a software fallback if hardware scaling and/or offset is
> > not supported.
> 
> Right. This *is* the software fallback, because the hardware scaling
> and offset aren't sufficient even if we only care about x86 where the
> former is supported.

IMHO it's a solution done at a wrong layer. 

-- 
Miroslav Lichvar


^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [RFC PATCH v2 0/8] timekeeping: Fix draft tracking precision and add feed-forward discipline via vmclock
  2026-05-25  8:06                       ` Arthur Kiyanovski
@ 2026-05-25  8:41                         ` David Woodhouse
  2026-05-26 14:12                         ` Thomas Gleixner
  1 sibling, 0 replies; 50+ messages in thread
From: David Woodhouse @ 2026-05-25  8:41 UTC (permalink / raw)
  To: Arthur Kiyanovski, Thomas Gleixner
  Cc: Miroslav Lichvar, Richard Cochran, Wen Gu, Andrew Lunn,
	David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	John Stultz, Stephen Boyd, Anna-Maria Behnsen,
	Frederic Weisbecker, Shuah Khan, Peter Zijlstra,
	Thomas Weißschuh, Arnd Bergmann, Julien Ridoux, Ryan Luu,
	linux-kernel, Marcelo Tosatti

[-- Attachment #1: Type: text/plain, Size: 2041 bytes --]

On Mon, 2026-05-25 at 08:06 +0000, Arthur Kiyanovski wrote:
> Hi Thomas, David,

Thanks, Arthur.

> Thanks for the layout proposal, Thomas. The unified structure with 
> explicit valid flags is a much cleaner approach than bounds-based
> validation.
> 
> I'm the author of the PHC timestamp attributes series [1] that this
> applies to. Before I spin v4 based on this design, I want to confirm
> three implementation details:
> 
> 1. Counter IDs: No stable UAPI clocksource numbering exists today 
> (enum clocksource_ids is kernel-internal). I'll define stable constants
> in include/uapi/linux/ptp_clock.h (e.g., PTP_CSID_X86_TSC,
> PTP_CSID_ARM_ARCH) and map internal IDs in the chardev layer.

I think you'd already done a mapping like this to PTP_COUNTER_xxx,
hadn't you? Although at the time I thought you'd mapped it to the
VIRTIO_RTC_COUNTER_xxx and VMCLOCK_COUNTER_xxx values and now I see
they don't quite match up.

If I understood Thomas correctly, I think he meant that you should make
the kernel's actual CSID values into uapi (which would involve moving
the clocksource_ids.h file to include/uapi/). 

I think I did have a slight preference for keeping an explicit mapping,
and exposing only those CSIDs that we *intend* to expose to userspace
(maybe not CSID_X86_TSC_EARLY, or we might make that to
PTP_COUNTER_X86_TSC). But only if the number space does actually match
VMCLOCK and VIRTIO_RTC. 

I'm also *entirely* prepared to concede if Thomas really does want to
expose CSID values directly; that isn't a hill to die on.



> 2. Array sizing: The timestamps array will be fixed at PTP_MAX_SAMPLES (25)
> in the ioctl struct, not a flexible array, to keep
> copy_from_user/copy_to_user bounded.
> 
> 3. Ioctl numbers: Two separate ioctls (PTP_SYS_OFFSET_EXTENDED_ATTRS and
> PTP_SYS_OFFSET_PRECISE_ATTRS) sharing the same payload struct, matching
> existing PTP convention.
> 
> Arthur
> 
> [1] https://lore.kernel.org/netdev/20260515164033.6403-1-akiyano@amazon.com/
> 


[-- Attachment #2: smime.p7s --]
[-- Type: application/pkcs7-signature, Size: 5069 bytes --]

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [RFC PATCH v2 0/8] timekeeping: Fix draft tracking precision and add feed-forward discipline via vmclock
  2026-05-25  8:08             ` Miroslav Lichvar
@ 2026-05-25  9:14               ` David Woodhouse
  2026-05-26  7:10                 ` Miroslav Lichvar
  0 siblings, 1 reply; 50+ messages in thread
From: David Woodhouse @ 2026-05-25  9:14 UTC (permalink / raw)
  To: Miroslav Lichvar
  Cc: Richard Cochran, Wen Gu, Andrew Lunn, David S. Miller,
	Eric Dumazet, Jakub Kicinski, Paolo Abeni, John Stultz,
	Thomas Gleixner, Stephen Boyd, Anna-Maria Behnsen,
	Frederic Weisbecker, Shuah Khan, Peter Zijlstra,
	Thomas Weißschuh, Arnd Bergmann, Julien Ridoux, Ryan Luu,
	linux-kernel, Marcelo Tosatti

[-- Attachment #1: Type: text/plain, Size: 4187 bytes --]

On Mon, 2026-05-25 at 10:08 +0200, Miroslav Lichvar wrote:
> On Thu, May 21, 2026 at 10:54:41AM +0100, David Woodhouse wrote:
> > On Thu, 2026-05-21 at 08:35 +0200, Miroslav Lichvar wrote:
> > > Ok, but I don't see why the phase corrections of the guest need to be
> > > in the kernel.
> > 
> > I'm not sure I understand. 
>
>   <..clarification...>
>
> 	/* Compute phase offset at cycle_last and set time_offset to slew */
>        ...
> 	ntp_set_time_offset(tk->id, ref_err >> tk->tkr_mono.shift);
> 

Ah, I see. Thanks.

But that's just using ->time_offset which has *always* been in the
kernel.

It's the same mechanism to apply phase offset that everything else
(adjtime(), adjtimex(ADJ_SETOFFSET)) already uses. 

The only thing that's different here is the calculation I elided
between the comment and ntp_set_time_offset() call shown there, which
is calculating *precisely* the offset to set in order to match the
desired reference.

There's nothing fundamental in the actual *timekeeping* here that
hasn't already been in the guest kernel for decades; I'm just fixing a
few arithmetic errors in the core code, and then *driving* it more
precisely using its existing parameters (tick_length, time_offset).

> There might be a disagreement on terminology.

Those will be entirely my fault.

> > It seems that when Julien et al lamented that, "Until now, however,
> > there has been a serious practical issue inhibiting feed-forward
> > approaches: a lack of kernel support", the basics were actually there
> > in the kernel's core timekeeping all along.
> 
> From my point of view, the only missing piece is software timestamping
> of packets using other clocks than CLOCK_REALTIME.

For literal NTP, you mean? Yes, that makes sense. And having the NIC
timestamp the packets using PTM would be great too.

> > > > And TSC scaling is pretty much x86-specific; other architectures have a
> > > > *defined* counter frequency and don't need to support scaling.
> > > 
> > > There can be a software fallback if hardware scaling and/or offset is
> > > not supported.
> > 
> > Right. This *is* the software fallback, because the hardware scaling
> > and offset aren't sufficient even if we only care about x86 where the
> > former is supported.
> 
> IMHO it's a solution done at a wrong layer. 

Understood. What do you believe is the better solution?

Aside from the case of actually using NTP or a PHC to discipline the
kernel's CLOCK_REALTIME, the use cases I'm trying to enable are:

 • (Micro)VM guest is *given* the TSC→realtime relationship in a virt
   enlightenment, gets an interrupt whenever it changes. Can react to
   that interrupt and steer the kernel's timekeeping as quickly as any
   userspace dæmon could do anything.

 • Dedicated virtual hosting environment needs to discipline the *TSC*
   directly against external references (PHC, 1PPS) in order to provide
   said virt enlightenment directly to guests and allow for accurate
   migration. This environment does not care about the host's actual
   CLOCK_REALTIME; that's basically cosmetic for logging purposes.

 • Multi-purpose environment has a standard ntpd/chrony setup, wants
   QEMU to be able to provide the same virt enlightenment based on
   the kernel's own timekeeping.

Thomas and I seemed to be agreeing on a clock_[sg]et_time_reference()
API which would allow for all of the above, with basically no change to
the kernel's actual timekeeping: again, it's just exposing the
*existing* parameters allowing for more precise control and visibility.

Especially as userspace currently has no way to see what the kernel
*thinks* the time should be at any given moment. It can only see the
actual output of CLOCK_REALTIME, which is sawtoothing *around* the
'intended' value tracked by ntp_error, tick by tick.

I was about to knock up a prototype of that (probably based on ioctls
or read/write on a miscdev for now, just for the proof of concept. All
the boilerplate of actual system call stuff can come later, if we like
it).

If you have a better suggestion, I'm more than happy to entertain it.

[-- Attachment #2: smime.p7s --]
[-- Type: application/pkcs7-signature, Size: 5069 bytes --]

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [RFC PATCH v2 0/8] timekeeping: Fix draft tracking precision and add feed-forward discipline via vmclock
  2026-05-25  9:14               ` David Woodhouse
@ 2026-05-26  7:10                 ` Miroslav Lichvar
  2026-05-26 10:00                   ` David Woodhouse
  0 siblings, 1 reply; 50+ messages in thread
From: Miroslav Lichvar @ 2026-05-26  7:10 UTC (permalink / raw)
  To: David Woodhouse
  Cc: Richard Cochran, Wen Gu, Andrew Lunn, David S. Miller,
	Eric Dumazet, Jakub Kicinski, Paolo Abeni, John Stultz,
	Thomas Gleixner, Stephen Boyd, Anna-Maria Behnsen,
	Frederic Weisbecker, Shuah Khan, Peter Zijlstra,
	Thomas Weißschuh, Arnd Bergmann, Julien Ridoux, Ryan Luu,
	linux-kernel, Marcelo Tosatti

On Mon, May 25, 2026 at 10:14:10AM +0100, David Woodhouse wrote:
> On Mon, 2026-05-25 at 10:08 +0200, Miroslav Lichvar wrote:
> > On Thu, May 21, 2026 at 10:54:41AM +0100, David Woodhouse wrote:
> > > On Thu, 2026-05-21 at 08:35 +0200, Miroslav Lichvar wrote:
> > > > Ok, but I don't see why the phase corrections of the guest need to be
> > > > in the kernel.
> > > 
> > > I'm not sure I understand. 
> >
> >   <..clarification...>
> >
> > 	/* Compute phase offset at cycle_last and set time_offset to slew */
> >        ...
> > 	ntp_set_time_offset(tk->id, ref_err >> tk->tkr_mono.shift);
> > 
> 
> Ah, I see. Thanks.
> 
> But that's just using ->time_offset which has *always* been in the
> kernel.

time_offset is an input of the kernel PLL. My concern is that the PLL
is fed directly by ptp_vmclock, ignoring everything else. There is no
setting of the PLL time constant or the flags, no configuration of the
step threshold, or any other options that a more advanced
implementation might have. To me it feels like a bad shortcut. I think
this part of the loop should be in userspace, properly using the
adjtimex() API. The feed-forward part (copying frequency settings of
the host) is still possible.

> There's nothing fundamental in the actual *timekeeping* here that
> hasn't already been in the guest kernel for decades; I'm just fixing a
> few arithmetic errors in the core code, and then *driving* it more
> precisely using its existing parameters (tick_length, time_offset).

Fixing arithmetic errors is great. The driving part is what I'm
concerned about, like where it is and what it is driving.

> > > Right. This *is* the software fallback, because the hardware scaling
> > > and offset aren't sufficient even if we only care about x86 where the
> > > former is supported.
> > 
> > IMHO it's a solution done at a wrong layer. 
> 
> Understood. What do you believe is the better solution?

I think a better solution is scaling of the clocksource, i.e. a layer
below the realtime clock. An additional multiplier applied in HW or
SW. That would address the problem for all system clocks, not just the
realtime clock. adjtimex() changes are applied on top of that, they
are not in conflict.

> Aside from the case of actually using NTP or a PHC to discipline the
> kernel's CLOCK_REALTIME, the use cases I'm trying to enable are:
> 
>  • (Micro)VM guest is *given* the TSC→realtime relationship in a virt
>    enlightenment, gets an interrupt whenever it changes. Can react to
>    that interrupt and steer the kernel's timekeeping as quickly as any
>    userspace dæmon could do anything.
> 
>  • Dedicated virtual hosting environment needs to discipline the *TSC*
>    directly against external references (PHC, 1PPS) in order to provide
>    said virt enlightenment directly to guests and allow for accurate
>    migration. This environment does not care about the host's actual
>    CLOCK_REALTIME; that's basically cosmetic for logging purposes.
> 
>  • Multi-purpose environment has a standard ntpd/chrony setup, wants
>    QEMU to be able to provide the same virt enlightenment based on
>    the kernel's own timekeeping.

Which of those couldn't be done with the clocksource scaling and/or
adjtimex() if all the necessary information was available to userspace?

-- 
Miroslav Lichvar


^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [RFC PATCH v2 0/8] timekeeping: Fix draft tracking precision and add feed-forward discipline via vmclock
  2026-05-26  7:10                 ` Miroslav Lichvar
@ 2026-05-26 10:00                   ` David Woodhouse
  2026-05-27  7:46                     ` Miroslav Lichvar
  0 siblings, 1 reply; 50+ messages in thread
From: David Woodhouse @ 2026-05-26 10:00 UTC (permalink / raw)
  To: Miroslav Lichvar
  Cc: Richard Cochran, Wen Gu, Andrew Lunn, David S. Miller,
	Eric Dumazet, Jakub Kicinski, Paolo Abeni, John Stultz,
	Thomas Gleixner, Stephen Boyd, Anna-Maria Behnsen,
	Frederic Weisbecker, Shuah Khan, Peter Zijlstra,
	Thomas Weißschuh, Arnd Bergmann, Julien Ridoux, Ryan Luu,
	linux-kernel, Marcelo Tosatti

[-- Attachment #1: Type: text/plain, Size: 9055 bytes --]

On Tue, 2026-05-26 at 09:10 +0200, Miroslav Lichvar wrote:
> On Mon, May 25, 2026 at 10:14:10AM +0100, David Woodhouse wrote:
> > On Mon, 2026-05-25 at 10:08 +0200, Miroslav Lichvar wrote:
> > > On Thu, May 21, 2026 at 10:54:41AM +0100, David Woodhouse wrote:
> > > > On Thu, 2026-05-21 at 08:35 +0200, Miroslav Lichvar wrote:
> > > > > Ok, but I don't see why the phase corrections of the guest need to be
> > > > > in the kernel.

...

> 
> > But that's just using ->time_offset which has *always* been in the
> > kernel.
> 
> time_offset is an input of the kernel PLL. My concern is that the PLL
> is fed directly by ptp_vmclock, ignoring everything else. There is no
> setting of the PLL time constant or the flags, no configuration of the
> step threshold, or any other options that a more advanced
> implementation might have. To me it feels like a bad shortcut.

Oh, undoubtedly yes. The hack I put in vmclock for the RFC is very much
a shortcut to prove the concept and enable discussion of what it
*should* look like.

We absolutely do want userspace to be in control of the policy.

Although I do think we want the kernel to be able to seed its
timekeeping at boot from the vmclock — not just the time, but the
precise tick_length. I haven't looked hard at that part yet; only
observed that with the do_settimeofday64() in the existing hack, the
guest often starts up about 100ns from the reference.

> I think this part of the loop should be in userspace, properly using
> the adjtimex() API. The feed-forward part (copying frequency settings
> of the host) is still possible.

I've no fundamental objection to using adjtimex(); I just couldn't see
how to do so with the required precision otherwise I've have done so.
Although I do quite like Thomas's clock[gs]et_time_reference()
suggestion which allows it to discipline AUX clocks too. 

Let us assume that userspace, either from vmclock or direct discipline
of the arch counter against external sources, has: 
  • Reference time T.
  • Arch counter value at time T.
  • Period of a single arch counter tick.

This translates fairly directly into the kernel's tick_length and
time_offset. But *only* if you know cycle_interval, ntp_error and other
details. Which is why my timekeeping_set_reference() takes the
information in that form, and then translates it within the core
timekeeping.

If you can show me how to do that with adjtimex(), that would be great.

Here's a sample of the output from my test setup. The host is running
chronyd, with the QEMU patch I linked. The guest test is now entirely
in userspace, using PTP to get paired readings of the guest's
CLOCK_REALTIME vs. vmclock for precisely the same counter value.

As chrony introduces a change on the host, QEMU propagates that to the
guest (the vmclock: line is from QEMU), and the guest adjusts
accordingly. And then converges *really* slowly, as even setting the
time constant to 0 gives a half-life for time_offset of about 11
seconds.

EXT[140130] diff=+0ns counter=995f301fc2b1
EXT[140131] diff=+0ns counter=995f77da47f9
EXT[140132] diff=+0ns counter=995fbf92f419
EXT[140133] diff=+0ns counter=9960074d5bfd
EXT[140134] diff=+1ns counter=99604f088779
vmclock: host_cv=0x44e7bb427115f offset=0xfffc4ae4cee46d30 guest_cv=0x9960830b7e8f tsc_khz=2400000
EXT[140135] diff=+1ns counter=996096c408b5
EXT[140136] diff=-9ns counter=9960deab00f5
EXT[140137] diff=-9ns counter=99612660729d
EXT[140138] diff=-9ns counter=99616e184279
EXT[140139] diff=-9ns counter=9961b5d3aca9
EXT[140140] diff=-9ns counter=9961fd8e78dd
EXT[140141] diff=-9ns counter=99624549d909
EXT[140142] diff=-9ns counter=99628d053e61
EXT[140143] diff=-9ns counter=9962d4be6411
EXT[140144] diff=-9ns counter=99631c76dda9
EXT[140145] diff=-8ns counter=99636431ac05
EXT[140146] diff=-9ns counter=9963abed1f91
EXT[140147] diff=-8ns counter=9963f3a82e91
EXT[140148] diff=-8ns counter=99643b639a31
EXT[140149] diff=-8ns counter=9964831d8385
EXT[140150] diff=-8ns counter=9964cad89fe9

Given the simplicity of the 'bad shortcut', and the fact that we do
want the kernel to follow the reference at *boot* time, I do think I'd
like to have a mode for microvms which optionally *allows* the kernel
to continue to track the reference for itself rather than having an
extra userspace tool that literally just polling on /dev/vmclock in
order to feed precisely that same information back into the kernel
directly.

> > There's nothing fundamental in the actual *timekeeping* here that
> > hasn't already been in the guest kernel for decades; I'm just fixing a
> > few arithmetic errors in the core code, and then *driving* it more
> > precisely using its existing parameters (tick_length, time_offset).
> 
> Fixing arithmetic errors is great. The driving part is what I'm
> concerned about, like where it is and what it is driving.
> 
> > > > Right. This *is* the software fallback, because the hardware scaling
> > > > and offset aren't sufficient even if we only care about x86 where the
> > > > former is supported.
> > > 
> > > IMHO it's a solution done at a wrong layer. 
> > 
> > Understood. What do you believe is the better solution?
> 
> I think a better solution is scaling of the clocksource, i.e. a layer
> below the realtime clock. An additional multiplier applied in HW or
> SW. That would address the problem for all system clocks, not just the
> realtime clock. adjtimex() changes are applied on top of that, they
> are not in conflict.

But we literally already have a way to 'scale' the counter in order to
derive CLOCK_MONOTONIC/CLOCK_REALTIME: the kernel's timekeeping code.
Currently driven *only* by NTP/adjtimex().

And we have CLOCK_MONOTONIC_RAW which is explicitly *not* skewed
according to any external idea of time, but tracks raw counter ticks as
if they happen at some nominal frequency — and remains precisely in
sync with what userspace might see by reading the counter directly.

Are you suggesting that the actual clocksource driver in the kernel for
e.g. CSID_ARM_ARCH_COUNTER should *scale* the results it returns,
instead of giving raw counter reads? So we have some NTP-like process
to adjust each clocksource, in *addition* to the core kernel
timekeeping? And then those skewed clocksource values are only
meaningful under a seqlock like the existing kernel timekeeper values
are valid under the tk_data.seq seqlock?

And would we have a separate way to get real value, to use for
CLOCK_MONOTONIC_RAW?

If I'm understanding your proposal correctly, I am... not keen.

> > Aside from the case of actually using NTP or a PHC to discipline the
> > kernel's CLOCK_REALTIME, the use cases I'm trying to enable are:
> > 
> >  • (Micro)VM guest is *given* the TSC→realtime relationship in a virt
> >    enlightenment, gets an interrupt whenever it changes. Can react to
> >    that interrupt and steer the kernel's timekeeping as quickly as any
> >    userspace dæmon could do anything.
> > 
> >  • Dedicated virtual hosting environment needs to discipline the *TSC*
> >    directly against external references (PHC, 1PPS) in order to provide
> >    said virt enlightenment directly to guests and allow for accurate
> >    migration. This environment does not care about the host's actual
> >    CLOCK_REALTIME; that's basically cosmetic for logging purposes.
> > 
> >  • Multi-purpose environment has a standard ntpd/chrony setup, wants
> >    QEMU to be able to provide the same virt enlightenment based on
> >    the kernel's own timekeeping.
> 
> Which of those couldn't be done with the clocksource scaling and/or
> adjtimex() if all the necessary information was available to userspace?

Let us assume that (1) can be done using adjtimex() although as noted
above, I couldn't see how.

(2) is resolved by the patches that Arthur, Thomas and I have worked on
over the last few days to enable PTP to return actual counter values,
and then that 'afterthought' about feeding it into the host kernel is
the same as (1). Although if the counter values themselves end up being
*skewed* then that introduces a whole new set of issues.

(3) would still need the clock_get_time_reference() (which I've hacked
up in my proof of concept as exposing a pollable /dev/vmclock_host
directly from the kernel). And again, if the actual *counter* can't be
trusted any more, that introduces a whole new set of issues with
relating the skewed clocksource cycle count, to what guests actually
*see* and what the kernel reports from its timekeeping. 

I think I like the clock_[gs]et_time_reference() model. I *really* have
to context switch back to other things this week, but at some point in
the near future I'm planning to knock up a proof of concept of that;
probably via read/write or ioctls on a miscdev for now to play with it,
and the whole boilerplate of wiring up system calls can come later,
*if* it passes muster.

[-- Attachment #2: smime.p7s --]
[-- Type: application/pkcs7-signature, Size: 5069 bytes --]

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [RFC PATCH v2 0/8] timekeeping: Fix draft tracking precision and add feed-forward discipline via vmclock
  2026-05-25  8:06                       ` Arthur Kiyanovski
  2026-05-25  8:41                         ` David Woodhouse
@ 2026-05-26 14:12                         ` Thomas Gleixner
  1 sibling, 0 replies; 50+ messages in thread
From: Thomas Gleixner @ 2026-05-26 14:12 UTC (permalink / raw)
  To: Arthur Kiyanovski
  Cc: David Woodhouse, Miroslav Lichvar, Richard Cochran, Wen Gu,
	Andrew Lunn, David S. Miller, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni, John Stultz, Stephen Boyd, Anna-Maria Behnsen,
	Frederic Weisbecker, Shuah Khan, Peter Zijlstra,
	Thomas Weißschuh, Arnd Bergmann, Julien Ridoux, Ryan Luu,
	linux-kernel, Marcelo Tosatti

Arthur!

On Mon, May 25 2026 at 08:06, Arthur Kiyanovski wrote:
> On 2026-05-24 14:36:35+02:00, Thomas Gleixner wrote:
>> On Fri, May 22 2026 at 17:23, David Woodhouse wrote:
> I'm the author of the PHC timestamp attributes series [1] that this
> applies to. Before I spin v4 based on this design, I want to confirm
> three implementation details:
>
> 1. Counter IDs: No stable UAPI clocksource numbering exists today 
> (enum clocksource_ids is kernel-internal). I'll define stable constants
> in include/uapi/linux/ptp_clock.h (e.g., PTP_CSID_X86_TSC,
> PTP_CSID_ARM_ARCH) and map internal IDs in the chardev layer.

Either that or we make the clocksource IDs part of UABI, which avoids
back and forth mapping.

> 2. Array sizing: The timestamps array will be fixed at PTP_MAX_SAMPLES (25)
> in the ioctl struct, not a flexible array, to keep
> copy_from_user/copy_to_user bounded.

Why? If userspace allocates an array size of 10k then the kernel will
still only copy out PTP_MAX_SAMPLES entries.

If it allocates two and asks for 10, that's not a kernel problem when
adjacent data is overwritten. That's not any different from read(2) or
other syscalls which do what they are asked to.

> 3. Ioctl numbers: Two separate ioctls (PTP_SYS_OFFSET_EXTENDED_ATTRS and
> PTP_SYS_OFFSET_PRECISE_ATTRS) sharing the same payload struct, matching
> existing PTP convention.

As I said, I have no strong opinion on that and that's a question to be
answered by user space people. I personally would prefer one just
because I'm lazy :)

Thanks,

        tglx

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [RFC PATCH v2 0/8] timekeeping: Fix draft tracking precision and add feed-forward discipline via vmclock
  2026-05-26 10:00                   ` David Woodhouse
@ 2026-05-27  7:46                     ` Miroslav Lichvar
  2026-05-27 12:28                       ` David Woodhouse
  0 siblings, 1 reply; 50+ messages in thread
From: Miroslav Lichvar @ 2026-05-27  7:46 UTC (permalink / raw)
  To: David Woodhouse
  Cc: Richard Cochran, Wen Gu, Andrew Lunn, David S. Miller,
	Eric Dumazet, Jakub Kicinski, Paolo Abeni, John Stultz,
	Thomas Gleixner, Stephen Boyd, Anna-Maria Behnsen,
	Frederic Weisbecker, Shuah Khan, Peter Zijlstra,
	Thomas Weißschuh, Arnd Bergmann, Julien Ridoux, Ryan Luu,
	linux-kernel, Marcelo Tosatti

On Tue, May 26, 2026 at 11:00:28AM +0100, David Woodhouse wrote:
> Let us assume that userspace, either from vmclock or direct discipline
> of the arch counter against external sources, has: 
>   • Reference time T.
>   • Arch counter value at time T.
>   • Period of a single arch counter tick.
> 
> This translates fairly directly into the kernel's tick_length and
> time_offset. But *only* if you know cycle_interval, ntp_error and other
> details. Which is why my timekeeping_set_reference() takes the
> information in that form, and then translates it within the core
> timekeeping.
> 
> If you can show me how to do that with adjtimex(), that would be great.

tick_length can be set by the adjtimex() modes ADJ_FREQUENCY (in
scaled units of 1/65536 ppm up to 500 ppm) and ADJ_TICK (in
microseconds per 1/USER_HZ tick).

time_offset can be set by the ADJ_OFFSET mode. The PLL needs to be
enabled first by setting the STA_PLL status (ADJ_STATUS mode) and also
the STA_FREQHOLD flag needs to be set to avoid changing the PLL
frequency.

The ntp_error and other details need to be exposed to userspace. Maybe
in the same API that will be used for reporting the time and frequency
offsets between system clocks.

> As chrony introduces a change on the host, QEMU propagates that to the
> guest (the vmclock: line is from QEMU), and the guest adjusts
> accordingly. And then converges *really* slowly, as even setting the
> time constant to 0 gives a half-life for time_offset of about 11
> seconds.

A simple linear slew would be better for this. The offset is accurate,
there is no need for filtering.

> Given the simplicity of the 'bad shortcut', and the fact that we do
> want the kernel to follow the reference at *boot* time, I do think I'd
> like to have a mode for microvms which optionally *allows* the kernel
> to continue to track the reference for itself rather than having an
> extra userspace tool that literally just polling on /dev/vmclock in
> order to feed precisely that same information back into the kernel
> directly.

Setting the values on boot in the kernel makes sense to me. There is
no loop involved. It follows the setting of the system clock from the
RTC.

> > I think a better solution is scaling of the clocksource, i.e. a layer
> > below the realtime clock. An additional multiplier applied in HW or
> > SW. That would address the problem for all system clocks, not just the
> > realtime clock. adjtimex() changes are applied on top of that, they
> > are not in conflict.
> 
> But we literally already have a way to 'scale' the counter in order to
> derive CLOCK_MONOTONIC/CLOCK_REALTIME: the kernel's timekeeping code.
> Currently driven *only* by NTP/adjtimex().

I see that as a different purpose than guest migrations. A migrated
guest should have its clocksource frequency corrected while the clock
is controlled by NTP/PTP. If this mechanism was shared, that would not
be possible.

> Are you suggesting that the actual clocksource driver in the kernel for
> e.g. CSID_ARM_ARCH_COUNTER should *scale* the results it returns,
> instead of giving raw counter reads? So we have some NTP-like process
> to adjust each clocksource, in *addition* to the core kernel
> timekeeping?

Not so much NTP-like. There would be no mult dithering or phase
adjustments, only frequency.

> And then those skewed clocksource values are only
> meaningful under a seqlock like the existing kernel timekeeper values
> are valid under the tk_data.seq seqlock?

I guess you are implying here this SW-fallback scaling would have a
significant impact on the performance. Could it not be applied at the
same time as the normal multiplier in the conversion to nanoseconds?

> And would we have a separate way to get real value, to use for
> CLOCK_MONOTONIC_RAW?

All system clocks should be scaled, that's my point.

-- 
Miroslav Lichvar


^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [RFC PATCH v2 0/8] timekeeping: Fix draft tracking precision and add feed-forward discipline via vmclock
  2026-05-27  7:46                     ` Miroslav Lichvar
@ 2026-05-27 12:28                       ` David Woodhouse
  0 siblings, 0 replies; 50+ messages in thread
From: David Woodhouse @ 2026-05-27 12:28 UTC (permalink / raw)
  To: Miroslav Lichvar
  Cc: Richard Cochran, Wen Gu, Andrew Lunn, David S. Miller,
	Eric Dumazet, Jakub Kicinski, Paolo Abeni, John Stultz,
	Thomas Gleixner, Stephen Boyd, Anna-Maria Behnsen,
	Frederic Weisbecker, Shuah Khan, Peter Zijlstra,
	Thomas Weißschuh, Arnd Bergmann, Julien Ridoux, Ryan Luu,
	linux-kernel, Marcelo Tosatti

[-- Attachment #1: Type: text/plain, Size: 6916 bytes --]

On Wed, 2026-05-27 at 09:46 +0200, Miroslav Lichvar wrote:
> On Tue, May 26, 2026 at 11:00:28AM +0100, David Woodhouse wrote:
> > Let us assume that userspace, either from vmclock or direct discipline
> > of the arch counter against external sources, has: 
> >   • Reference time T.
> >   • Arch counter value at time T.
> >   • Period of a single arch counter tick.
> > 
> > This translates fairly directly into the kernel's tick_length and
> > time_offset. But *only* if you know cycle_interval, ntp_error and other
> > details. Which is why my timekeeping_set_reference() takes the
> > information in that form, and then translates it within the core
> > timekeeping.
> > 
> > If you can show me how to do that with adjtimex(), that would be great.
> 
> tick_length can be set by the adjtimex() modes ADJ_FREQUENCY (in
> scaled units of 1/65536 ppm up to 500 ppm) and ADJ_TICK (in
> microseconds per 1/USER_HZ tick).
> 
> time_offset can be set by the ADJ_OFFSET mode. The PLL needs to be
> enabled first by setting the STA_PLL status (ADJ_STATUS mode) and also
> the STA_FREQHOLD flag needs to be set to avoid changing the PLL
> frequency.
> 
> The ntp_error and other details need to be exposed to userspace. Maybe
> in the same API that will be used for reporting the time and frequency
> offsets between system clocks.

I don't think that's enough. Consider the fact that I've just had to
apply a correction to my existing timekeeping_set_reference() proof of
concept to make it calculate and set time_offset for the moment of the
*next* tick, instead of at the *prior* tick:

--- a/kernel/time/timekeeping.c
+++ b/kernel/time/timekeeping.c
@@ -2443,13 +2443,17 @@ int timekeeping_set_reference(const struct tk_reference *ref)
                        32 + ref->period_shift);
        ntp_set_tick_length(tk->id, new_tl);
 
-       /* Compute phase offset at cycle_last and set time_offset to slew */
-       delta = tk->tkr_mono.cycle_last - ref->counter_value;
+       /*
+        * Compute phase offset at the *next* tick boundary, where the new
+        * tick_length will first take effect. Using cycle_last would leave
+        * a gap where the old mult accumulates additional phase error.
+        */
+       delta = tk->tkr_mono.cycle_last + tk->cycle_interval - ref->counter_value;
        ref_frac = mul_u64_u64_shr(delta, ref->period_frac_sec,
                                   ref->period_shift) + ref->time_frac_sec;
        ref_err = (s64)mul_u64_u64_shr(ref_frac,
                        (u64)NSEC_PER_SEC << tk->tkr_mono.shift, 64) -
-                 (s64)tk->tkr_mono.xtime_nsec;
+                 (s64)(tk->tkr_mono.xtime_nsec + tk->xtime_interval);
        ntp_set_time_offset(tk->id, ref_err >> tk->tkr_mono.shift);
        tk->ntp_error = 0;
 

I just don't think we can do this from userspace, and I don't really
see the *need* to.

It seems cleaner just to have clock_set_time_reference() which matches
what clock_get_time_reference() exports, instead of trying to shoe-horn
it into the adjtimex API and force userspace to jump through hoops to
reverse engineer things and apply racy adjustments.

> > As chrony introduces a change on the host, QEMU propagates that to the
> > guest (the vmclock: line is from QEMU), and the guest adjusts
> > accordingly. And then converges *really* slowly, as even setting the
> > time constant to 0 gives a half-life for time_offset of about 11
> > seconds.
> 
> A simple linear slew would be better for this. The offset is accurate,
> there is no need for filtering.

Perhaps so, although I was trying to avoid making any real changes to
the core timekeeping other than fixing its accounting. In fact if I set
the time constant to zero *and* set STA_NANO, that gives a half-life of
about 2.4 seconds which should be fine.

> > > I think a better solution is scaling of the clocksource, i.e. a layer
> > > below the realtime clock. An additional multiplier applied in HW or
> > > SW. That would address the problem for all system clocks, not just the
> > > realtime clock. adjtimex() changes are applied on top of that, they
> > > are not in conflict.
> > 
> > But we literally already have a way to 'scale' the counter in order to
> > derive CLOCK_MONOTONIC/CLOCK_REALTIME: the kernel's timekeeping code.
> > Currently driven *only* by NTP/adjtimex().
> 
> I see that as a different purpose than guest migrations. A migrated
> guest should have its clocksource frequency corrected while the clock
> is controlled by NTP/PTP. If this mechanism was shared, that would not
> be possible.

If the *host* wants to use hardware frequency scaling to try to mask
the effects of live migration by making the effective frequency of the
TSC on the destination match the effective frequency of the TSC on the
source at the moment of migration, then that's a choice for the host.

I don't think it's likely to happen, as it brings a bunch of complexity
on the host side for relatively little benefit.

I don't think there's *any* chance of Linux ever doing the scaling of
the clocksources on the software side.
 
> > Are you suggesting that the actual clocksource driver in the kernel for
> > e.g. CSID_ARM_ARCH_COUNTER should *scale* the results it returns,
> > instead of giving raw counter reads? So we have some NTP-like process
> > to adjust each clocksource, in *addition* to the core kernel
> > timekeeping?
> 
> Not so much NTP-like. There would be no mult dithering or phase
> adjustments, only frequency.

So clocksources would no longer be monotonic?

> > And then those skewed clocksource values are only
> > meaningful under a seqlock like the existing kernel timekeeper values
> > are valid under the tk_data.seq seqlock?
> 
> I guess you are implying here this SW-fallback scaling would have a
> significant impact on the performance. Could it not be applied at the
> same time as the normal multiplier in the conversion to nanoseconds?
> 
> > And would we have a separate way to get real value, to use for
> > CLOCK_MONOTONIC_RAW?
> 
> All system clocks should be scaled, that's my point.

I'm not sure you'll achieve universal consensus on the concept that
CLOCK_MONOTONIC_RAW should be skewed.

I suspect it's best to ignore the special case of live migration for
the moment. Treat it like any other update from the host which adjusts
the frequency and phase_offset. It's up to the host to make it appear
as if the guest TSC continued to tick at the source frequency while the
guest was in the ether, and provide a vmclock update the moment it
starts on the new host, letting the guest know the new frequency. The
frequency adjustment is applied almost immediately (via an interrupt,
directly *within* the kernel in my proof of concept case), and the
resulting phase delta should be tiny.


[-- Attachment #2: smime.p7s --]
[-- Type: application/pkcs7-signature, Size: 5069 bytes --]

^ permalink raw reply	[flat|nested] 50+ messages in thread

end of thread, other threads:[~2026-05-27 12:29 UTC | newest]

Thread overview: 50+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-05-17 21:25 [RFC PATCH v2 0/8] timekeeping: Fix draft tracking precision and add feed-forward discipline via vmclock David Woodhouse
2026-05-17 21:25 ` [RFC PATCH v2 1/8] timekeeping: Remove xtime_remainder from ntp_error accumulation David Woodhouse
2026-05-17 21:25 ` [RFC PATCH v2 2/8] timekeeping: Account for clawback adjustment in ntp_error David Woodhouse
2026-05-19  1:59   ` John Stultz
2026-05-19 10:04     ` David Woodhouse
2026-05-19 19:28       ` John Stultz
2026-05-20 10:47         ` Miroslav Lichvar
2026-05-20 12:37           ` David Woodhouse
2026-05-17 21:25 ` [RFC PATCH v2 3/8] timekeeping: Clamp time_offset delta to prevent infinite tail David Woodhouse
2026-05-19 13:25   ` Miroslav Lichvar
2026-05-19 13:31     ` David Woodhouse
2026-05-19 14:17       ` Miroslav Lichvar
2026-05-19 15:06         ` David Woodhouse
2026-05-17 21:25 ` [RFC PATCH v2 4/8] timekeeping: Add absolute reference for feed-forward clock discipline David Woodhouse
2026-05-19  2:09   ` John Stultz
2026-05-19 11:07     ` David Woodhouse
2026-05-17 21:25 ` [RFC PATCH v2 5/8] ptp_vmclock: Feed reference to timekeeping for feed-forward discipline David Woodhouse
2026-05-17 21:25 ` [RFC PATCH v2 6/8] timekeeping: Guard against divide-by-zero in timekeeping_adjust David Woodhouse
2026-05-17 21:25 ` [RFC PATCH v2 7/8] timekeeping: Drive time_offset skew via per-tick ntp_error transfer David Woodhouse
2026-05-17 21:25 ` [RFC PATCH v2 8/8] WIP: kernel/time: Add /dev/vmclock_host miscdev David Woodhouse
2026-05-19 13:16 ` [RFC PATCH v2 0/8] timekeeping: Fix draft tracking precision and add feed-forward discipline via vmclock Miroslav Lichvar
2026-05-19 15:50   ` David Woodhouse
2026-05-20 10:39     ` Miroslav Lichvar
2026-05-20 12:21       ` David Woodhouse
2026-05-21  6:35         ` Miroslav Lichvar
2026-05-21  9:54           ` David Woodhouse
2026-05-25  8:08             ` Miroslav Lichvar
2026-05-25  9:14               ` David Woodhouse
2026-05-26  7:10                 ` Miroslav Lichvar
2026-05-26 10:00                   ` David Woodhouse
2026-05-27  7:46                     ` Miroslav Lichvar
2026-05-27 12:28                       ` David Woodhouse
2026-05-21 18:30         ` Thomas Gleixner
2026-05-21 21:06           ` David Woodhouse
2026-05-22  8:02             ` Thomas Gleixner
2026-05-22 10:01               ` David Woodhouse
2026-05-22 15:28                 ` Thomas Gleixner
2026-05-22 16:23                   ` David Woodhouse
2026-05-24 12:36                     ` Thomas Gleixner
2026-05-24 13:13                       ` David Woodhouse
2026-05-24 15:05                         ` Thomas Gleixner
2026-05-25  8:06                       ` Arthur Kiyanovski
2026-05-25  8:41                         ` David Woodhouse
2026-05-26 14:12                         ` Thomas Gleixner
2026-05-22 16:50                   ` David Woodhouse
2026-05-24 15:15                     ` Thomas Gleixner
2026-05-24 15:37                       ` Thomas Gleixner
2026-05-24 15:48                         ` Thomas Gleixner
2026-05-24 16:36                         ` Thomas Gleixner
2026-05-24 16:42                           ` David Woodhouse

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.