[RFC] Enabling CONFIG_NTP_PPS for NOHZ by adding ntp_error to system_time

Netdev List
 help / color / mirror / Atom feed

* [RFC] Enabling CONFIG_NTP_PPS for NOHZ by adding ntp_error to system_time_snapshot
@ 2026-06-19  0:33 David Woodhouse
  2026-06-19 13:34 ` Thomas Gleixner
  0 siblings, 1 reply; 5+ messages in thread
From: David Woodhouse @ 2026-06-19  0:33 UTC (permalink / raw)
  To: John Stultz, Thomas Gleixner, Stephen Boyd, Miroslav Lichvar,
	Richard Cochran, linux-kernel, netdev
  Cc: Rodolfo Giometti, Alexander Gordeev

[-- Attachment #1: Type: text/plain, Size: 10587 bytes --]

As far as I can tell, the only (remaining?) reason that CONFIG_NTP_PPS
doesn't work with NO_HZ_COMMON is because the real time snapshots that
pps_get_ts() uses are not sufficiently accurate, so the phase
correction wouldn't work very well.

The inaccuracy happens because of the way the kernel's timekeeping
sawtooths around the 'ideal' time line, by choosing between adjacent
values of 'mult' and 'mult+1' from one tick to the next. But with a
tickless kernel, of course the correction *doesn't* happen each tick,
and the time reported as CLOCK_REALTIME diverges further from the
correct time.

The thing is... since 
https://lore.kernel.org/all/20260614144032.534706-1-dwmw2@infradead.org/
we know *precisely* how far from the truth our CLOCK_REALTIME value is,
and we can just put that information into the system_time_snapshot for
the caller to use as it sees fit. If the caller doesn't care about
monotonicity, it can just add the known 'error' to the snapshot.systime
value, and have a completely accurate snapshot even under nohz.

If I run my vmclock reference test on a tickless kernel, I see the
kernel's timekeeping vary by ±15ns around the ideal. The correction
below clamps it back to the ±1ns that I see with a periodic tick.

I think that's enough to enable CONFIG_NTP_PPS too, right? I'll have to
revive the hack at
https://lore.kernel.org/all/87cb97d5a26d0f4909d2ba2545c4b43281109470.camel@infradead.org/
to test it...

Am I missing some other reason for the dependency? Aside from the phase
error, it *does* seem to work. The dependency on !NO_HZ goes all the
way back to the original introduction of hardpps support in commit
025b40abe7, which doesn't explain *why* it didn't work on tickless
kernels.

From: David Woodhouse <dwmw@amazon.co.uk>
Date: Fri, 19 Jun 2026 00:00:29 +0100
Subject: [PATCH] timekeeping: Extrapolate ntp_error into snapshots

ktime_get_snapshot_id() is a lockless reader: it interpolates the
clocksource forward from cycle_last at a fixed mult but never runs the
timekeeping accumulation, so tk->ntp_error is only current as of the
last update. Between updates the read accrues the per-cycle deviation
from the NTP-ideal rate; on a NO_HZ kernel that span can be many ticks,
widening the sawtooth between the snapshot's disciplined CLOCK_REALTIME
and the ideal NTP line. This is the obstacle to accurate in-kernel PPS,
which today depends on !NO_HZ_COMMON.

Carry that deviation in the snapshot as a signed nanosecond offset that
a consumer adds directly to ::systime to land on the ideal line. It sums
four terms in ns << NTP_SCALE_SHIFT before converting:

  - tk->ntp_error, the deviation as of the last update;
  - (cycle_delta * ntp_err_frac), the fractional-mult drift accrued
    since then (cycle_delta is at most a tick on a tickful kernel, but
    many ticks' worth under NO_HZ);
  - (cycle_delta * ntp_err_mult), subtracting the applied +1 mult dither
    over the same span;
  - the sub-nanosecond fraction dropped when ::systime was truncated to
    whole ns (low shift bits of the read, exact despite overflow).

Only the mono-based clocks (REALTIME/MONOTONIC/BOOTTIME) carry this; RAW
is undisciplined and AUX has its own discipline. The residual is then a
single clocksource cycle, the same bound as a tickful kernel.

NOT-FOR-UPSTREAM: also includes a temporary ptp_vmclock debug hack that
prints the offset and applies it to the returned timestamp, for
validating the field against the host vmclock reference under QEMU.

Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
Assisted-by: Kiro:claude-opus-4.8
---
 drivers/ptp/ptp_vmclock.c           |  2 ++
 include/linux/timekeeper_internal.h |  6 ++++
 include/linux/timekeeping.h         |  9 +++++
 kernel/time/timekeeping.c           | 56 +++++++++++++++++++++++++++--
 4 files changed, 71 insertions(+), 2 deletions(-)

diff --git a/drivers/ptp/ptp_vmclock.c b/drivers/ptp/ptp_vmclock.c
index c09ae06d7f68..37a9c8390055 100644
--- a/drivers/ptp/ptp_vmclock.c
+++ b/drivers/ptp/ptp_vmclock.c
@@ -140,7 +140,9 @@ static int vmclock_get_crosststamp(struct vmclock_state *st,
 			ptp_read_system_prets(sts);
 			if (sts->pre_sts.cs_id == st->cs_id) {
 				cycle = sts->pre_sts.cycles;
+				sts->pre_sts.systime += sts->pre_sts.ntp_error;
 				sts->post_sts = sts->pre_sts;
+				pr_info("vmclock pre error %lld\n", sts->pre_sts.ntp_error);
 			} else if (sts->pre_sts.hw_csid == st->cs_id &&
 				   sts->pre_sts.hw_cycles) {
 				cycle = sts->pre_sts.hw_cycles;
diff --git a/include/linux/timekeeper_internal.h b/include/linux/timekeeper_internal.h
index 5dc7f8bf2740..b487e7d925fe 100644
--- a/include/linux/timekeeper_internal.h
+++ b/include/linux/timekeeper_internal.h
@@ -97,6 +97,11 @@ struct tk_read_base {
  * @ntp_error_shift:		Shift conversion between clock shifted nano seconds and
  *				ntp shifted nano seconds.
  * @ntp_err_mult:		Multiplication factor for scaled math conversion
+ * @ntp_err_frac:		Fractional part of the per-cycle NTP-ideal mult that the
+ *				integer @mult truncates, as a fraction of 2^32 in
+ *				clock-shifted nanoseconds per cycle. Used to
+ *				extrapolate @ntp_error to an arbitrary cycle count in
+ *				the lockless snapshot readers (ktime_get_snapshot_id).
  * @cs_tick_adj:		Per-second adjustment handed to NTP via ntp_clear()
  *				accounting for the difference between the nominal
  *				NTP interval and the real time taken by the
@@ -187,6 +192,7 @@ struct timekeeper {
 	s64			ntp_error;
 	u32			ntp_error_shift;
 	u32			ntp_err_mult;
+	u64			ntp_err_frac;
 	s64			cs_tick_adj;
 	u32			skip_second_overflow;
 	s64			skew_delta;
diff --git a/include/linux/timekeeping.h b/include/linux/timekeeping.h
index 984a866d293b..e53be1952021 100644
--- a/include/linux/timekeeping.h
+++ b/include/linux/timekeeping.h
@@ -283,6 +283,14 @@ static inline bool ktime_get_aux_ts64(clockid_t id, struct timespec64 *kt) { ret
  *			which @cycles was derived
  * @systime:		The system time of the selected CLOCK ID
  * @monoraw:		Monotonic raw system time
+ * @ntp_error:		Signed nanosecond offset of @systime from the ideal
+ *			NTP-disciplined time at @cycles. Extrapolated to @cycles
+ *			(so it is exact even when many cycles have elapsed since the
+ *			last timekeeping update, e.g. on a NO_HZ kernel) and includes
+ *			the sub-nanosecond fraction dropped when @systime was
+ *			truncated to whole ns. A consumer lands on the ideal line by
+ *			adding @ntp_error directly to @systime. Only meaningful for
+ *			CLOCK_REALTIME/CLOCK_MONOTONIC.
  * @cs_id:		Clocksource ID
  * @hw_csid:		Clocksource ID of the underlying hardware counter for derived
  *			clocksources which implement the read_snapshot() callback.
@@ -295,6 +303,7 @@ struct system_time_snapshot {
 	u64			hw_cycles;
 	ktime_t			systime;
 	ktime_t			monoraw;
+	s64			ntp_error;
 	enum clocksource_ids	cs_id;
 	enum clocksource_ids	hw_csid;
 	unsigned int		clock_was_set_seq;
diff --git a/kernel/time/timekeeping.c b/kernel/time/timekeeping.c
index a67d2f27c73e..e319eca307ee 100644
--- a/kernel/time/timekeeping.c
+++ b/kernel/time/timekeeping.c
@@ -407,6 +407,7 @@ static void tk_setup_internals(struct timekeeper *tk, struct clocksource *clock)
 	tk->tkr_mono.mult = clock->mult;
 	tk->tkr_raw.mult = clock->mult;
 	tk->ntp_err_mult = 0;
+	tk->ntp_err_frac = 0;
 	tk->skip_second_overflow = 0;
 	tk->skew_delta = 0;
 
@@ -1285,6 +1286,45 @@ void ktime_get_snapshot_id(clockid_t clock_id, struct system_time_snapshot *syst
 
 		nsec_sys = timekeeping_cycles_to_ns(&tk->tkr_mono, now);
 		nsec_raw = timekeeping_cycles_to_ns(&tk->tkr_raw, now);
+
+		/*
+		 * For the NTP-disciplined mono-based clocks, report how far
+		 * @systime is from the ideal NTP time at @now, in signed ns,
+		 * so a caller can land on the ideal line by adding it. Four
+		 * terms, summed in ns << NTP_SCALE_SHIFT before converting:
+		 *
+		 *  - tk->ntp_error, the deviation as of the last update;
+		 *  - (cycle_delta * ntp_err_frac), the fractional-mult drift
+		 *    accrued since then (cycle_delta is at most a tick on a
+		 *    tickful kernel, but many ticks' worth under NO_HZ);
+		 *  - (cycle_delta * ntp_err_mult), subtracting the applied +1
+		 *    mult dither over the same span;
+		 *  - the sub-ns fraction @systime dropped when the read was
+		 *    truncated to whole ns (low @shift bits, exact despite the
+		 *    multiply overflowing).
+		 *
+		 * RAW is undisciplined and AUX has its own discipline, so they
+		 * carry no ntp_error.
+		 */
+		if (clock_id == CLOCK_REALTIME || clock_id == CLOCK_MONOTONIC ||
+		    clock_id == CLOCK_BOOTTIME) {
+			u32 nes = tk->ntp_error_shift;
+			u64 cycle_delta = (now - tk->tkr_mono.cycle_last) &
+					  tk->tkr_mono.mask;
+			s64 err = tk->ntp_error +
+				(((s64)mul_u64_u64_shr(cycle_delta,
+						       tk->ntp_err_frac, 32) -
+				  (s64)(cycle_delta * tk->ntp_err_mult)) << nes);
+
+			err += (s64)((cycle_delta * tk->tkr_mono.mult +
+				      tk->tkr_mono.xtime_nsec) &
+				     ((1ULL << tk->tkr_mono.shift) - 1)) << nes;
+			systime_snapshot->ntp_error =
+				(err + (1LL << (NTP_SCALE_SHIFT - 1))) >>
+				NTP_SCALE_SHIFT;
+		} else {
+			systime_snapshot->ntp_error = 0;
+		}
 	} while (read_seqcount_retry(&tkd->seq, seq));
 
 	systime_snapshot->cycles = now;
@@ -2432,6 +2472,7 @@ static void timekeeping_adjust(struct timekeeper *tk, s64 offset)
 {
 	u64 ntp_tl = ntp_tick_length(tk->id);
 	s64 skew = ntp_get_skew_delta(tk->id);
+	u64 dividend;
 	u32 mult;
 
 	/*
@@ -2452,8 +2493,19 @@ static void timekeeping_adjust(struct timekeeper *tk, s64 offset)
 		 * scale it back up to the full per-tick rate for the mult bias.
 		 */
 		skew *= NTP_INTERVAL_FREQ;
-		mult = div64_u64((tk->ntp_tick + skew) >> tk->ntp_error_shift,
-				 tk->cycle_interval);
+		dividend = (tk->ntp_tick + skew) >> tk->ntp_error_shift;
+		mult = div64_u64(dividend, tk->cycle_interval);
+		/*
+		 * Stash the fractional part of the per-cycle ideal mult that
+		 * the integer @mult discards, scaled by 2^32, in clock-shifted
+		 * ns per cycle. The lockless snapshot readers use it to
+		 * extrapolate @ntp_error forward over the cycles accumulated
+		 * since the last tick (which on a NO_HZ kernel may be many
+		 * ticks' worth).
+		 */
+		tk->ntp_err_frac = div64_u64((dividend - (u64)mult *
+					      tk->cycle_interval) << 32,
+					     tk->cycle_interval);
 	}
 
 	/*
-- 
2.43.0


[-- Attachment #2: smime.p7s --]
[-- Type: application/pkcs7-signature, Size: 5069 bytes --]

^ permalink raw reply related	[flat|nested] 5+ messages in thread

* Re: [RFC] Enabling CONFIG_NTP_PPS for NOHZ by adding ntp_error to system_time_snapshot
  2026-06-19  0:33 [RFC] Enabling CONFIG_NTP_PPS for NOHZ by adding ntp_error to system_time_snapshot David Woodhouse
@ 2026-06-19 13:34 ` Thomas Gleixner
  2026-06-19 15:34   ` David Woodhouse
  0 siblings, 1 reply; 5+ messages in thread
From: Thomas Gleixner @ 2026-06-19 13:34 UTC (permalink / raw)
  To: David Woodhouse, John Stultz, Stephen Boyd, Miroslav Lichvar,
	Richard Cochran, linux-kernel, netdev
  Cc: Rodolfo Giometti, Alexander Gordeev

On Fri, Jun 19 2026 at 01:33, David Woodhouse wrote:
> @@ -1285,6 +1286,45 @@ void ktime_get_snapshot_id(clockid_t clock_id, struct system_time_snapshot *syst
>  
>  		nsec_sys = timekeeping_cycles_to_ns(&tk->tkr_mono, now);
>  		nsec_raw = timekeeping_cycles_to_ns(&tk->tkr_raw, now);
> +
> +		/*
> +		 * For the NTP-disciplined mono-based clocks, report how far
> +		 * @systime is from the ideal NTP time at @now, in signed ns,
> +		 * so a caller can land on the ideal line by adding it. Four
> +		 * terms, summed in ns << NTP_SCALE_SHIFT before converting:
> +		 *
> +		 *  - tk->ntp_error, the deviation as of the last update;
> +		 *  - (cycle_delta * ntp_err_frac), the fractional-mult drift
> +		 *    accrued since then (cycle_delta is at most a tick on a
> +		 *    tickful kernel, but many ticks' worth under NO_HZ);
> +		 *  - (cycle_delta * ntp_err_mult), subtracting the applied +1
> +		 *    mult dither over the same span;
> +		 *  - the sub-ns fraction @systime dropped when the read was
> +		 *    truncated to whole ns (low @shift bits, exact despite the
> +		 *    multiply overflowing).
> +		 *
> +		 * RAW is undisciplined and AUX has its own discipline, so they
> +		 * carry no ntp_error.

AUX has ntp_error too. AUX clocks have a per clock NTP instance, which
work exactly like the main timerkeeper's one. Only CLOCK_MONOTONIC_RAW
needs to be excluded.

> +		 */
> +		if (clock_id == CLOCK_REALTIME || clock_id == CLOCK_MONOTONIC ||
> +		    clock_id == CLOCK_BOOTTIME) {
> +			u32 nes = tk->ntp_error_shift;
> +			u64 cycle_delta = (now - tk->tkr_mono.cycle_last) &
> +					  tk->tkr_mono.mask;
> +			s64 err = tk->ntp_error +
> +				(((s64)mul_u64_u64_shr(cycle_delta,
> +						       tk->ntp_err_frac, 32) -
> +				  (s64)(cycle_delta * tk->ntp_err_mult)) << nes);
> +
> +			err += (s64)((cycle_delta * tk->tkr_mono.mult +
> +				      tk->tkr_mono.xtime_nsec) &
> +				     ((1ULL << tk->tkr_mono.shift) - 1)) << nes;
> +			systime_snapshot->ntp_error =
> +				(err + (1LL << (NTP_SCALE_SHIFT - 1))) >>
> +				NTP_SCALE_SHIFT;

This formatting makes my brain hurt. Can you please split that out into
a separate function?


/*
 * Big fat comment....
 */
static void snapshot_ntp_error(clockid_t clock_id, struct system_time_snapshot *snap,
       			       struct timekeeper *tk)
{
	if (clock_id == CLOCK_MONOTONIC_RAW) {
        	snap->ntp_error = 0;
                return;
        }

	u64 cycle_delta = (now - tk->tkr_mono.cycle_last) & tk->tkr_mono.mask;
       	u32 nes = tk->ntp_error_shift;
	s64 tmp, err = tk->ntp_error;

        err += ((s64)mul_u64_u64_shr(cycle_delta, tk->ntp_err_frac, 32) -
               (s64)(cycle_delta * tk->ntp_err_mult)) << nes;

	tmp = (s64)(cycle_delta * tk->tkr_mono.mult + tk->tkr_mono.xtime_nsec);
        tmp &= (1ULL << tk->tkr_mono.shift) - 1;
	err += tmp << nes;
	snap->ntp_error = (err + (1LL << (NTP_SCALE_SHIFT - 1))) >> NTP_SCALE_SHIFT;
}

or something readable like that.
                      

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [RFC] Enabling CONFIG_NTP_PPS for NOHZ by adding ntp_error to system_time_snapshot
  2026-06-19 13:34 ` Thomas Gleixner
@ 2026-06-19 15:34   ` David Woodhouse
  2026-06-19 20:21     ` Thomas Gleixner
  0 siblings, 1 reply; 5+ messages in thread
From: David Woodhouse @ 2026-06-19 15:34 UTC (permalink / raw)
  To: Thomas Gleixner, John Stultz, Stephen Boyd, Miroslav Lichvar,
	Richard Cochran, linux-kernel, netdev
  Cc: Rodolfo Giometti, Alexander Gordeev

[-- Attachment #1: Type: text/plain, Size: 3150 bytes --]

On Fri, 2026-06-19 at 15:34 +0200, Thomas Gleixner wrote:
> On Fri, Jun 19 2026 at 01:33, David Woodhouse wrote:
> > @@ -1285,6 +1286,45 @@ void ktime_get_snapshot_id(clockid_t clock_id, struct system_time_snapshot *syst
> >  
> >  		nsec_sys = timekeeping_cycles_to_ns(&tk->tkr_mono, now);
> >  		nsec_raw = timekeeping_cycles_to_ns(&tk->tkr_raw, now);
> > +
> > +		/*
> > +		 * For the NTP-disciplined mono-based clocks, report how far
> > +		 * @systime is from the ideal NTP time at @now, in signed ns,
> > +		 * so a caller can land on the ideal line by adding it. Four
> > +		 * terms, summed in ns << NTP_SCALE_SHIFT before converting:
> > +		 *
> > +		 *  - tk->ntp_error, the deviation as of the last update;
> > +		 *  - (cycle_delta * ntp_err_frac), the fractional-mult drift
> > +		 *    accrued since then (cycle_delta is at most a tick on a
> > +		 *    tickful kernel, but many ticks' worth under NO_HZ);
> > +		 *  - (cycle_delta * ntp_err_mult), subtracting the applied +1
> > +		 *    mult dither over the same span;
> > +		 *  - the sub-ns fraction @systime dropped when the read was
> > +		 *    truncated to whole ns (low @shift bits, exact despite the
> > +		 *    multiply overflowing).
> > +		 *
> > +		 * RAW is undisciplined and AUX has its own discipline, so they
> > +		 * carry no ntp_error.
> 
> AUX has ntp_error too. AUX clocks have a per clock NTP instance, which
> work exactly like the main timerkeeper's one. Only CLOCK_MONOTONIC_RAW
> needs to be excluded.

Ack.

> > +		 */
> > +		if (clock_id == CLOCK_REALTIME || clock_id == CLOCK_MONOTONIC ||
> > +		    clock_id == CLOCK_BOOTTIME) {
> > +			u32 nes = tk->ntp_error_shift;
> > +			u64 cycle_delta = (now - tk->tkr_mono.cycle_last) &
> > +					  tk->tkr_mono.mask;
> > +			s64 err = tk->ntp_error +
> > +				(((s64)mul_u64_u64_shr(cycle_delta,
> > +						       tk->ntp_err_frac, 32) -
> > +				  (s64)(cycle_delta * tk->ntp_err_mult)) << nes);
> > +
> > +			err += (s64)((cycle_delta * tk->tkr_mono.mult +
> > +				      tk->tkr_mono.xtime_nsec) &
> > +				     ((1ULL << tk->tkr_mono.shift) - 1)) << nes;
> > +			systime_snapshot->ntp_error =
> > +				(err + (1LL << (NTP_SCALE_SHIFT - 1))) >>
> > +				NTP_SCALE_SHIFT;
> 
> This formatting makes my brain hurt. Can you please split that out into
> a separate function?

Yep. There's also a potential error there — an *additional* discrepancy
comes from the enforced monotonicity that timekeeping_cycles_to_ns()
applies (the case where it just returns tkr->xtime_nsec >> tkr_shift).

I couldn't work out if I cared about the clocksource-is-non-monotonic
casse, and even if I did, what I should do about it. 

I also wasn't sure if this should be a new CLOCK_REALTIME_NONMONOTONIC
or something like that, such that e.g. PTP clients could *ask* for it.

It's all very well hard-coding it in pps_get_ts() and unconditionally
changing the behaviour... I *think* we could justify that. But the
example I actually used in the patch was PTP, and that's slightly
harder to justify the behavioural change.

[-- Attachment #2: smime.p7s --]
[-- Type: application/pkcs7-signature, Size: 5069 bytes --]

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [RFC] Enabling CONFIG_NTP_PPS for NOHZ by adding ntp_error to system_time_snapshot
  2026-06-19 15:34   ` David Woodhouse
@ 2026-06-19 20:21     ` Thomas Gleixner
  2026-06-19 20:57       ` David Woodhouse
  0 siblings, 1 reply; 5+ messages in thread
From: Thomas Gleixner @ 2026-06-19 20:21 UTC (permalink / raw)
  To: David Woodhouse, John Stultz, Stephen Boyd, Miroslav Lichvar,
	Richard Cochran, linux-kernel, netdev
  Cc: Rodolfo Giometti, Alexander Gordeev

On Fri, Jun 19 2026 at 16:34, David Woodhouse wrote:
> On Fri, 2026-06-19 at 15:34 +0200, Thomas Gleixner wrote:
>> 
>> This formatting makes my brain hurt. Can you please split that out into
>> a separate function?
>
> Yep. There's also a potential error there — an *additional* discrepancy
> comes from the enforced monotonicity that timekeeping_cycles_to_ns()
> applies (the case where it just returns tkr->xtime_nsec >> tkr_shift).
>
> I couldn't work out if I cared about the clocksource-is-non-monotonic
> casse, and even if I did, what I should do about it.

I think the right thing is just to ignore it.

The problem is very narrow and mostly related to the historically badly
synchronized TSC between sockets. The TSC_ADJUST fixup is obviously
error prone as it adjusts only to the point where the error is not
longer observable. But in the update transition phase it can result in
time going backwards because the readout on the other CPU is slightly
behind tk::tkr_mono::cycles_last. That happens only once in a while and
we talk about a very low single digit number of TSC cycles.

> I also wasn't sure if this should be a new CLOCK_REALTIME_NONMONOTONIC
> or something like that, such that e.g. PTP clients could *ask* for it.

Hell no!

> It's all very well hard-coding it in pps_get_ts() and unconditionally
> changing the behaviour... I *think* we could justify that. But the
> example I actually used in the patch was PTP, and that's slightly
> harder to justify the behavioural change.

Just leave it alone.

If the TSCs between sockets are slightly out of [mostly unobservable]
sync then if you don't hit this corner case at the edge of the update
then you have to live with that discrepancy anyway as you don't know
about it at all. So making a magic extra case for this unlikely event is
overkill. Due to speculation, caches etc. pp the snapshot is anyway in
that low single digit TSC cycles margin of inaccuracy.

Don't try to defeat reality and the underlying physics. Perfect is the
enemy of good.

Thanks,

        tglx

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [RFC] Enabling CONFIG_NTP_PPS for NOHZ by adding ntp_error to system_time_snapshot
  2026-06-19 20:21     ` Thomas Gleixner
@ 2026-06-19 20:57       ` David Woodhouse
  0 siblings, 0 replies; 5+ messages in thread
From: David Woodhouse @ 2026-06-19 20:57 UTC (permalink / raw)
  To: Thomas Gleixner, John Stultz, Stephen Boyd, Miroslav Lichvar,
	Richard Cochran, linux-kernel, netdev
  Cc: Rodolfo Giometti, Alexander Gordeev

[-- Attachment #1: Type: text/plain, Size: 2298 bytes --]

On Fri, 2026-06-19 at 22:21 +0200, Thomas Gleixner wrote:
> On Fri, Jun 19 2026 at 16:34, David Woodhouse wrote:
> > On Fri, 2026-06-19 at 15:34 +0200, Thomas Gleixner wrote:
> > > 
> > > This formatting makes my brain hurt. Can you please split that out into
> > > a separate function?
> > 
> > Yep. There's also a potential error there — an *additional* discrepancy
> > comes from the enforced monotonicity that timekeeping_cycles_to_ns()
> > applies (the case where it just returns tkr->xtime_nsec >> tkr_shift).
> > 
> > I couldn't work out if I cared about the clocksource-is-non-monotonic
> > casse, and even if I did, what I should do about it.
> 
> I think the right thing is just to ignore it.

Yeah, that was basically my conclusion; I had just meant to *mention*
it when posting the RFC.

> The problem is very narrow and mostly related to the historically badly
> synchronized TSC between sockets. The TSC_ADJUST fixup is obviously
> error prone as it adjusts only to the point where the error is not
> longer observable. But in the update transition phase it can result in
> time going backwards because the readout on the other CPU is slightly
> behind tk::tkr_mono::cycles_last. That happens only once in a while and
> we talk about a very low single digit number of TSC cycles.
> 
> > I also wasn't sure if this should be a new CLOCK_REALTIME_NONMONOTONIC
> > or something like that, such that e.g. PTP clients could *ask* for it.
> 
> Hell no!

That was not about the above clocksource nonsense; that was the
question of what the caller (in my example case, the vmclock PTP
snapshot) should *do* with the reported error value.

If I just unconditionally "correct" the CLOCK_REALTIME values then
that's arguably an ABI change. We're silently reporting something
*different* to what we did before.

Maybe that's OK... as I said, in the PPS case we can justify it and
just call it a bug fix?

Or maybe we want a way for callers (not of ktime_get_snapshot_id()
itself, but *their* callers) to *ask* for the "corrected" value
instead. I happened to call that CLOCK_REALTIME_NONMONOTONIC as a straw
man, just because monotonicity is *one* of the reasons why we present
the xtime values that we do, not always the raw "corrected" values.

[-- Attachment #2: smime.p7s --]
[-- Type: application/pkcs7-signature, Size: 5069 bytes --]

^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2026-06-19 20:57 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-06-19  0:33 [RFC] Enabling CONFIG_NTP_PPS for NOHZ by adding ntp_error to system_time_snapshot David Woodhouse
2026-06-19 13:34 ` Thomas Gleixner
2026-06-19 15:34   ` David Woodhouse
2026-06-19 20:21     ` Thomas Gleixner
2026-06-19 20:57       ` David Woodhouse

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox