linux-kselftest.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v2 1/2] time/timekeeping: Fix possible inconsistencies in _COARSE clockids
@ 2025-03-20 20:03 John Stultz
  2025-03-20 20:03 ` [PATCH v2 2/2] selftests/timers: Improve skew_consistency by testing with other clockids John Stultz
  2025-03-25 11:32 ` [PATCH v2 1/2] time/timekeeping: Fix possible inconsistencies in _COARSE clockids Miroslav Lichvar
  0 siblings, 2 replies; 23+ messages in thread
From: John Stultz @ 2025-03-20 20:03 UTC (permalink / raw)
  To: LKML
  Cc: John Stultz, Thomas Gleixner, Stephen Boyd, Anna-Maria Behnsen,
	Frederic Weisbecker, Shuah Khan, Miroslav Lichvar,
	linux-kselftest, kernel-team, Lei Chen

Lei Chen raised an issue with CLOCK_MONOTONIC_COARSE seeing
time inconsistencies.

Lei tracked down that this was being caused by the adjustment
  tk->tkr_mono.xtime_nsec -= offset;

which is made to compensate for the unaccumulated cycles in
offset when the mult value is adjusted forward, so that
the non-_COARSE clockids don't see inconsistencies.

However, the _COARSE clockids don't use the mult*offset value
in their calculations, so this subtraction can cause the
_COARSE clock ids to jump back a bit.

Now, by design, this negative adjustment should be fine, because
the logic run from timekeeping_adjust() is done after we
accumulate approx mult*interval_cycles into xtime_nsec.
The accumulated (mult*interval_cycles) will be larger then the
(mult_adj*offset) value subtracted from xtime_nsec, and both
operations are done together under the tk_core.lock, so the net
change to xtime_nsec should always be positive.

However, do_adjtimex() calls into timekeeping_advance() as well,
since we want to apply the ntp freq adjustment immediately.
In this case, we don't return early when the offset is smaller
then interval_cycles, so we don't end up accumulating any time
into xtime_nsec. But we do go on to call timekeeping_adjust(),
which modifies the mult value, and subtracts from xtime_nsec
to correct for the new mult value.

Here because we did not accumulate anything, we have a window
where the _COARSE clockids that don't utilize the mult*offset
value, can see an inconsistency.

So to fix this, rework the timekeeping_advance() logic a bit
so that when we are called from do_adjtimex(), we call
timekeeping_forward(), to first accumulate the sub-interval
time into xtime_nsec. Then with no unaccumulated cycles in
offset, we can do the mult adjustment without worry of the
subtraction having an impact.

Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Stephen Boyd <sboyd@kernel.org>
Cc: Anna-Maria Behnsen <anna-maria@linutronix.de>
Cc: Frederic Weisbecker <frederic@kernel.org>
Cc: Shuah Khan <shuah@kernel.org>
Cc: Miroslav Lichvar <mlichvar@redhat.com>
Cc: linux-kselftest@vger.kernel.org
Cc: kernel-team@android.com
Cc: Lei Chen <lei.chen@smartx.com>
Fixes: da15cfdae033 ("time: Introduce CLOCK_REALTIME_COARSE")
Reported-by: Lei Chen <lei.chen@smartx.com>
Closes: https://lore.kernel.org/lkml/20250310030004.3705801-1-lei.chen@smartx.com/
Diagnosed-by: Thomas Gleixner <tglx@linutronix.de>
Additional-fixes-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: John Stultz <jstultz@google.com>
---
v2: Include fixes from Thomas, dropping the unnecessary clock_set
    setting, and instead clearing ntp_error, along with some other
    minor tweaks.
---
 kernel/time/timekeeping.c | 94 ++++++++++++++++++++++++++++-----------
 1 file changed, 69 insertions(+), 25 deletions(-)

diff --git a/kernel/time/timekeeping.c b/kernel/time/timekeeping.c
index 1e67d076f1955..929846b8b45ab 100644
--- a/kernel/time/timekeeping.c
+++ b/kernel/time/timekeeping.c
@@ -682,20 +682,19 @@ static void timekeeping_update_from_shadow(struct tk_data *tkd, unsigned int act
 }
 
 /**
- * timekeeping_forward_now - update clock to the current time
+ * timekeeping_forward - update clock to given cycle now value
  * @tk:		Pointer to the timekeeper to update
+ * @cycle_now:  Current clocksource read value
  *
  * Forward the current clock to update its state since the last call to
  * update_wall_time(). This is useful before significant clock changes,
  * as it avoids having to deal with this time offset explicitly.
  */
-static void timekeeping_forward_now(struct timekeeper *tk)
+static void timekeeping_forward(struct timekeeper *tk, u64 cycle_now)
 {
-	u64 cycle_now, delta;
+	u64 delta = clocksource_delta(cycle_now, tk->tkr_mono.cycle_last, tk->tkr_mono.mask,
+				      tk->tkr_mono.clock->max_raw_delta);
 
-	cycle_now = tk_clock_read(&tk->tkr_mono);
-	delta = clocksource_delta(cycle_now, tk->tkr_mono.cycle_last, tk->tkr_mono.mask,
-				  tk->tkr_mono.clock->max_raw_delta);
 	tk->tkr_mono.cycle_last = cycle_now;
 	tk->tkr_raw.cycle_last  = cycle_now;
 
@@ -710,6 +709,21 @@ static void timekeeping_forward_now(struct timekeeper *tk)
 	}
 }
 
+/**
+ * timekeeping_forward_now - update clock to the current time
+ * @tk:		Pointer to the timekeeper to update
+ *
+ * Forward the current clock to update its state since the last call to
+ * update_wall_time(). This is useful before significant clock changes,
+ * as it avoids having to deal with this time offset explicitly.
+ */
+static void timekeeping_forward_now(struct timekeeper *tk)
+{
+	u64 cycle_now = tk_clock_read(&tk->tkr_mono);
+
+	timekeeping_forward(tk, cycle_now);
+}
+
 /**
  * ktime_get_real_ts64 - Returns the time of day in a timespec64.
  * @ts:		pointer to the timespec to be set
@@ -2151,6 +2165,54 @@ static u64 logarithmic_accumulation(struct timekeeper *tk, u64 offset,
 	return offset;
 }
 
+static u64 timekeeping_accumulate(struct timekeeper *tk, u64 offset,
+				  enum timekeeping_adv_mode mode,
+				  unsigned int *clock_set)
+{
+	int shift = 0, maxshift;
+
+	/*
+	 * TK_ADV_FREQ indicates that adjtimex(2) directly set the
+	 * frequency or the tick length.
+	 *
+	 * Accumulate the offset, so that the new multiplier starts from
+	 * now. This is required as otherwise for offsets, which are
+	 * smaller than tk::cycle_interval, timekeeping_adjust() could set
+	 * xtime_nsec backwards, which subsequently causes time going
+	 * backwards in the coarse time getters. But even for the case
+	 * where offset is greater than tk::cycle_interval the periodic
+	 * accumulation does not have much value.
+	 *
+	 * Also reset tk::ntp_error as it does not make sense to keep the
+	 * old accumulated error around in this case.
+	 */
+	if (mode == TK_ADV_FREQ) {
+		timekeeping_forward(tk, tk->tkr_mono.cycle_last + offset);
+		tk->ntp_error = 0;
+		return 0;
+	}
+
+	/*
+	 * With NO_HZ we may have to accumulate many cycle_intervals
+	 * (think "ticks") worth of time at once. To do this efficiently,
+	 * we calculate the largest doubling multiple of cycle_intervals
+	 * that is smaller than the offset.  We then accumulate that
+	 * chunk in one go, and then try to consume the next smaller
+	 * doubled multiple.
+	 */
+	shift = ilog2(offset) - ilog2(tk->cycle_interval);
+	shift = max(0, shift);
+	/* Bound shift to one less than what overflows tick_length */
+	maxshift = (64 - (ilog2(ntp_tick_length()) + 1)) - 1;
+	shift = min(shift, maxshift);
+	while (offset >= tk->cycle_interval) {
+		offset = logarithmic_accumulation(tk, offset, shift, clock_set);
+		if (offset < tk->cycle_interval << shift)
+			shift--;
+	}
+	return offset;
+}
+
 /*
  * timekeeping_advance - Updates the timekeeper to the current time and
  * current NTP tick length
@@ -2160,7 +2222,6 @@ static bool timekeeping_advance(enum timekeeping_adv_mode mode)
 	struct timekeeper *tk = &tk_core.shadow_timekeeper;
 	struct timekeeper *real_tk = &tk_core.timekeeper;
 	unsigned int clock_set = 0;
-	int shift = 0, maxshift;
 	u64 offset;
 
 	guard(raw_spinlock_irqsave)(&tk_core.lock);
@@ -2177,24 +2238,7 @@ static bool timekeeping_advance(enum timekeeping_adv_mode mode)
 	if (offset < real_tk->cycle_interval && mode == TK_ADV_TICK)
 		return false;
 
-	/*
-	 * With NO_HZ we may have to accumulate many cycle_intervals
-	 * (think "ticks") worth of time at once. To do this efficiently,
-	 * we calculate the largest doubling multiple of cycle_intervals
-	 * that is smaller than the offset.  We then accumulate that
-	 * chunk in one go, and then try to consume the next smaller
-	 * doubled multiple.
-	 */
-	shift = ilog2(offset) - ilog2(tk->cycle_interval);
-	shift = max(0, shift);
-	/* Bound shift to one less than what overflows tick_length */
-	maxshift = (64 - (ilog2(ntp_tick_length())+1)) - 1;
-	shift = min(shift, maxshift);
-	while (offset >= tk->cycle_interval) {
-		offset = logarithmic_accumulation(tk, offset, shift, &clock_set);
-		if (offset < tk->cycle_interval<<shift)
-			shift--;
-	}
+	offset = timekeeping_accumulate(tk, offset, mode, &clock_set);
 
 	/* Adjust the multiplier to correct NTP error */
 	timekeeping_adjust(tk, offset);
-- 
2.49.0.395.g12beb8f557-goog


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCH v2 2/2] selftests/timers: Improve skew_consistency by testing with other clockids
  2025-03-20 20:03 [PATCH v2 1/2] time/timekeeping: Fix possible inconsistencies in _COARSE clockids John Stultz
@ 2025-03-20 20:03 ` John Stultz
  2025-03-25 11:32 ` [PATCH v2 1/2] time/timekeeping: Fix possible inconsistencies in _COARSE clockids Miroslav Lichvar
  1 sibling, 0 replies; 23+ messages in thread
From: John Stultz @ 2025-03-20 20:03 UTC (permalink / raw)
  To: LKML
  Cc: John Stultz, Thomas Gleixner, Stephen Boyd, Anna-Maria Behnsen,
	Frederic Weisbecker, Shuah Khan, Miroslav Lichvar,
	linux-kselftest, kernel-team, Lei Chen

Lei Chen reported a bug with CLOCK_MONOTONIC_COARSE having
inconsistencies when NTP is adjusting the clock frequency.

This has gone seemingly undetected for ~15 years, illustrating a
clear gap in our testing.

The skew_consistency test is intended to catch this sort of
problem, but was focused on only evaluating CLOCK_MONOTONIC, and
thus missed the problem on CLOCK_MONOTONIC_COARSE.

So adjust the test to run with all clockids for 60 seconds each
instead of 10 minutes with just CLOCK_MONOTONIC.

Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Stephen Boyd <sboyd@kernel.org>
Cc: Anna-Maria Behnsen <anna-maria@linutronix.de>
Cc: Frederic Weisbecker <frederic@kernel.org>
Cc: Shuah Khan <shuah@kernel.org>
Cc: Miroslav Lichvar <mlichvar@redhat.com>
Cc: linux-kselftest@vger.kernel.org
Cc: kernel-team@android.com
Cc: Lei Chen <lei.chen@smartx.com>
Reported-by: Lei Chen <lei.chen@smartx.com>
Closes: https://lore.kernel.org/lkml/20250310030004.3705801-1-lei.chen@smartx.com/
Signed-off-by: John Stultz <jstultz@google.com>
---
 tools/testing/selftests/timers/skew_consistency.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/tools/testing/selftests/timers/skew_consistency.c b/tools/testing/selftests/timers/skew_consistency.c
index 83450145fe657..46c391d7f45dc 100644
--- a/tools/testing/selftests/timers/skew_consistency.c
+++ b/tools/testing/selftests/timers/skew_consistency.c
@@ -47,7 +47,7 @@ int main(int argc, char **argv)
 
 	pid = fork();
 	if (!pid)
-		return system("./inconsistency-check -c 1 -t 600");
+		return system("./inconsistency-check -t 60");
 
 	ppm = 500;
 	ret = 0;
-- 
2.49.0.395.g12beb8f557-goog


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* Re: [PATCH v2 1/2] time/timekeeping: Fix possible inconsistencies in _COARSE clockids
  2025-03-20 20:03 [PATCH v2 1/2] time/timekeeping: Fix possible inconsistencies in _COARSE clockids John Stultz
  2025-03-20 20:03 ` [PATCH v2 2/2] selftests/timers: Improve skew_consistency by testing with other clockids John Stultz
@ 2025-03-25 11:32 ` Miroslav Lichvar
  2025-03-27  9:22   ` Thomas Gleixner
  1 sibling, 1 reply; 23+ messages in thread
From: Miroslav Lichvar @ 2025-03-25 11:32 UTC (permalink / raw)
  To: John Stultz
  Cc: LKML, Thomas Gleixner, Stephen Boyd, Anna-Maria Behnsen,
	Frederic Weisbecker, Shuah Khan, linux-kselftest, kernel-team,
	Lei Chen

On Thu, Mar 20, 2025 at 01:03:00PM -0700, John Stultz wrote:
> +static u64 timekeeping_accumulate(struct timekeeper *tk, u64 offset,
> +				  enum timekeeping_adv_mode mode,
> +				  unsigned int *clock_set)

> +	 * Also reset tk::ntp_error as it does not make sense to keep the
> +	 * old accumulated error around in this case.
> +	 */

I'm not sure if I still understand the timekeeping code correctly, but
that doesn't seem right to me. At least the comment should explain why
it does not make sense to keep the NTP error.

Resetting the NTP error causes a small time step. An NTP/PTP client
can be setting the frequency very frequently, e.g. up to 128 times per
second and the interval between updates can be random. If the timing
was right, I suspect it could cause a measurable drift. The client
should be able to compensate for it, but why make its job harder by
making the clock less predictable. My expectation for the clock is
that its frequency will not change if the same (or only slightly
different) frequency is set repeatedly by adjtimex().

> +	if (mode == TK_ADV_FREQ) {
> +		timekeeping_forward(tk, tk->tkr_mono.cycle_last + offset);
> +		tk->ntp_error = 0;
> +		return 0;
> +	}

-- 
Miroslav Lichvar


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v2 1/2] time/timekeeping: Fix possible inconsistencies in _COARSE clockids
  2025-03-25 11:32 ` [PATCH v2 1/2] time/timekeeping: Fix possible inconsistencies in _COARSE clockids Miroslav Lichvar
@ 2025-03-27  9:22   ` Thomas Gleixner
  2025-03-27 15:42     ` Miroslav Lichvar
  0 siblings, 1 reply; 23+ messages in thread
From: Thomas Gleixner @ 2025-03-27  9:22 UTC (permalink / raw)
  To: Miroslav Lichvar, John Stultz
  Cc: LKML, Stephen Boyd, Anna-Maria Behnsen, Frederic Weisbecker,
	Shuah Khan, linux-kselftest, kernel-team, Lei Chen

On Tue, Mar 25 2025 at 12:32, Miroslav Lichvar wrote:
> On Thu, Mar 20, 2025 at 01:03:00PM -0700, John Stultz wrote:
>> +static u64 timekeeping_accumulate(struct timekeeper *tk, u64 offset,
>> +				  enum timekeeping_adv_mode mode,
>> +				  unsigned int *clock_set)
>
>> +	 * Also reset tk::ntp_error as it does not make sense to keep the
>> +	 * old accumulated error around in this case.
>> +	 */
>
> I'm not sure if I still understand the timekeeping code correctly, but
> that doesn't seem right to me. At least the comment should explain why
> it does not make sense to keep the NTP error.
>
> Resetting the NTP error causes a small time step. An NTP/PTP client
> can be setting the frequency very frequently, e.g. up to 128 times per
> second and the interval between updates can be random. If the timing

I never observed that behaviour, but I'm not a NTP/PTP wizard/power user.

> was right, I suspect it could cause a measurable drift. The client
> should be able to compensate for it, but why make its job harder by
> making the clock less predictable. My expectation for the clock is
> that its frequency will not change if the same (or only slightly
> different) frequency is set repeatedly by adjtimex().

The point is that timekeeper::ntp_error accumulates the error between
NTP and the clock interval. With John's change to forward the clock in
case of adjtimex() setting the tick length or frequency, the previously
accumulated information is out of sync because the forwarding resets the
period asynchronously.

The fundamental property of the timekeeper adjustment is that it
advances everything in multiples of the clock interval. The clock
interval is the number of hardware clock increments per tick, which has
been determined from the initial multiplier/shift pair of the clock
source at the point where the clock source is installed as the
timekeeper source. It never changes throughout the life time of the
clocksource.

The original implementation respected this base period, but John's
approach of forwarding, which cures the coarse time getter issue,
violates it. As a consequence the previous error accumulation is not
longer based on the base period because the period has been reset to the
random point in time when adjtimex() was invoked, which makes the error
accumulation a random number.

There are two ways to deal with that. Both require to revert this
change completely.

   1) Handle the coarse time getter problem seperately and leave the
      existing adjtimex logic alone. That was my initial suggestion:

      https://lore.kernel.org/all/87cyej5rid.ffs@tglx

   2) Handle adjtimex(ADJ_TICK/ADJ_FREQUENCY) at the next tick boundary
      instead of doing it immediately at the random point in time when
      adjtimex() is invoked.

      That cures the coarse time getter problem as well, but obviously
      delays the multiplier update to the next tick, which means that
      only the last adjtimex(ADJ_TICK/ADJ_FREQUENCY) invocation between
      two ticks becomes effective.

      From a pure mathematical point of view, this is keeping everything
      consistent. A quick test shows that it works. Though again, I'm
      not the NTP wizard here and don't know which dragons are lurking
      in the NTP/PTP clients.

      Patch on top of the revert below. That requires some thought
      vs. NOHZ delaying the next tick, but that's a solvable problem.

Thanks,

        tglx
---
--- a/kernel/time/timekeeping.c
+++ b/kernel/time/timekeeping.c
@@ -34,14 +34,6 @@
 
 #define TK_UPDATE_ALL		(TK_CLEAR_NTP | TK_CLOCK_WAS_SET)
 
-enum timekeeping_adv_mode {
-	/* Update timekeeper when a tick has passed */
-	TK_ADV_TICK,
-
-	/* Update timekeeper on a direct frequency change */
-	TK_ADV_FREQ
-};
-
 /*
  * The most important data for readout fits into a single 64 byte
  * cache line.
@@ -2155,7 +2147,7 @@ static u64 logarithmic_accumulation(stru
  * timekeeping_advance - Updates the timekeeper to the current time and
  * current NTP tick length
  */
-static bool timekeeping_advance(enum timekeeping_adv_mode mode)
+static bool timekeeping_advance(void)
 {
 	struct timekeeper *tk = &tk_core.shadow_timekeeper;
 	struct timekeeper *real_tk = &tk_core.timekeeper;
@@ -2173,8 +2165,8 @@ static bool timekeeping_advance(enum tim
 				   tk->tkr_mono.cycle_last, tk->tkr_mono.mask,
 				   tk->tkr_mono.clock->max_raw_delta);
 
-	/* Check if there's really nothing to do */
-	if (offset < real_tk->cycle_interval && mode == TK_ADV_TICK)
+	/* Check if there's really something to do */
+	if (offset < real_tk->cycle_interval)
 		return false;
 
 	/*
@@ -2216,7 +2208,7 @@ static bool timekeeping_advance(enum tim
  */
 void update_wall_time(void)
 {
-	if (timekeeping_advance(TK_ADV_TICK))
+	if (timekeeping_advance())
 		clock_was_set_delayed();
 }
 
@@ -2548,10 +2540,6 @@ int do_adjtimex(struct __kernel_timex *t
 
 	audit_ntp_log(&ad);
 
-	/* Update the multiplier immediately if frequency was set directly */
-	if (txc->modes & (ADJ_FREQUENCY | ADJ_TICK))
-		clock_set |= timekeeping_advance(TK_ADV_FREQ);
-
 	if (clock_set)
 		clock_was_set(CLOCK_SET_WALL);
 

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v2 1/2] time/timekeeping: Fix possible inconsistencies in _COARSE clockids
  2025-03-27  9:22   ` Thomas Gleixner
@ 2025-03-27 15:42     ` Miroslav Lichvar
  2025-03-27 17:32       ` Thomas Gleixner
  2025-03-31 14:53       ` Miroslav Lichvar
  0 siblings, 2 replies; 23+ messages in thread
From: Miroslav Lichvar @ 2025-03-27 15:42 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: John Stultz, LKML, Stephen Boyd, Anna-Maria Behnsen,
	Frederic Weisbecker, Shuah Khan, linux-kselftest, kernel-team,
	Lei Chen

On Thu, Mar 27, 2025 at 10:22:31AM +0100, Thomas Gleixner wrote:
> On Tue, Mar 25 2025 at 12:32, Miroslav Lichvar wrote:
> > Resetting the NTP error causes a small time step. An NTP/PTP client
> > can be setting the frequency very frequently, e.g. up to 128 times per
> > second and the interval between updates can be random. If the timing
> 
> I never observed that behaviour, but I'm not a NTP/PTP wizard/power user.

On a machine that has a /dev/ptp device available, a simple test to
observe such a high update rate is to run:

1) phc_ctl /dev/ptp0 set
2) phc2sys -m -q -O 0 -s /dev/ptp0 -R 128
or alternatively
2) chronyd -d 'refclock PHC /dev/ptp0 poll -7'

> The original implementation respected this base period, but John's
> approach of forwarding, which cures the coarse time getter issue,
> violates it. As a consequence the previous error accumulation is not
> longer based on the base period because the period has been reset to the
> random point in time when adjtimex() was invoked, which makes the error
> accumulation a random number.

I see, so that value of the NTP error is already wrong at that point
where it's reset to 0.

To clearly see the difference with the new code, I made an attempt
to update the old linux-tktest simulation that was used back when the
multiplier adjustment was reworked, but there are too many missing
things now and I gave up.

Maybe I could simply patch the kernel to force a small clock
multiplier to increase the rate at which the error accumulates.

-- 
Miroslav Lichvar


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v2 1/2] time/timekeeping: Fix possible inconsistencies in _COARSE clockids
  2025-03-27 15:42     ` Miroslav Lichvar
@ 2025-03-27 17:32       ` Thomas Gleixner
  2025-03-31  7:53         ` Miroslav Lichvar
  2025-03-31 14:53       ` Miroslav Lichvar
  1 sibling, 1 reply; 23+ messages in thread
From: Thomas Gleixner @ 2025-03-27 17:32 UTC (permalink / raw)
  To: Miroslav Lichvar
  Cc: John Stultz, LKML, Stephen Boyd, Anna-Maria Behnsen,
	Frederic Weisbecker, Shuah Khan, linux-kselftest, kernel-team,
	Lei Chen

On Thu, Mar 27 2025 at 16:42, Miroslav Lichvar wrote:
> On Thu, Mar 27, 2025 at 10:22:31AM +0100, Thomas Gleixner wrote:
>> The original implementation respected this base period, but John's
>> approach of forwarding, which cures the coarse time getter issue,
>> violates it. As a consequence the previous error accumulation is not
>> longer based on the base period because the period has been reset to the
>> random point in time when adjtimex() was invoked, which makes the error
>> accumulation a random number.
>
> I see, so that value of the NTP error is already wrong at that point
> where it's reset to 0.
>
> To clearly see the difference with the new code, I made an attempt
> to update the old linux-tktest simulation that was used back when the
> multiplier adjustment was reworked, but there are too many missing
> things now and I gave up.

Can you point me to that code?

It would be probably useful to create a test mechanism which allows to
exercise all of this in a simulated way so we actually don't have to
wonder every time we change a bit what the consequences are.

Thanks,

        tglx

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v2 1/2] time/timekeeping: Fix possible inconsistencies in _COARSE clockids
  2025-03-27 17:32       ` Thomas Gleixner
@ 2025-03-31  7:53         ` Miroslav Lichvar
  2025-04-17  2:55           ` John Stultz
  0 siblings, 1 reply; 23+ messages in thread
From: Miroslav Lichvar @ 2025-03-31  7:53 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: John Stultz, LKML, Stephen Boyd, Anna-Maria Behnsen,
	Frederic Weisbecker, Shuah Khan, linux-kselftest, kernel-team,
	Lei Chen

On Thu, Mar 27, 2025 at 06:32:27PM +0100, Thomas Gleixner wrote:
> On Thu, Mar 27 2025 at 16:42, Miroslav Lichvar wrote:
> > On Thu, Mar 27, 2025 at 10:22:31AM +0100, Thomas Gleixner wrote:
> > To clearly see the difference with the new code, I made an attempt
> > to update the old linux-tktest simulation that was used back when the
> > multiplier adjustment was reworked, but there are too many missing
> > things now and I gave up.
> 
> Can you point me to that code?

It's this thing: https://github.com/mlichvar/linux-tktest

> It would be probably useful to create a test mechanism which allows to
> exercise all of this in a simulated way so we actually don't have to
> wonder every time we change a bit what the consequences are.

Yes, that would be very nice if we could run the timekeeping code in a
deterministic simulated environment with a configurable clocksource,
timing of kernel updates, timing and values of injected adjtimex()
calls, etc. The question is how to isolate it.

-- 
Miroslav Lichvar


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v2 1/2] time/timekeeping: Fix possible inconsistencies in _COARSE clockids
  2025-03-27 15:42     ` Miroslav Lichvar
  2025-03-27 17:32       ` Thomas Gleixner
@ 2025-03-31 14:53       ` Miroslav Lichvar
  2025-04-01  6:34         ` Thomas Gleixner
  1 sibling, 1 reply; 23+ messages in thread
From: Miroslav Lichvar @ 2025-03-31 14:53 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: John Stultz, LKML, Stephen Boyd, Anna-Maria Behnsen,
	Frederic Weisbecker, Shuah Khan, linux-kselftest, kernel-team,
	Lei Chen

On Thu, Mar 27, 2025 at 04:42:49PM +0100, Miroslav Lichvar wrote:
> Maybe I could simply patch the kernel to force a small clock
> multiplier to increase the rate at which the error accumulates.

I tried that and it indeed makes the issue clearly visible. The COARSE
fix makes the clock less stable. It's barely visible with the normal
multiplier, at least for the clocksource I tested, but a reduced
multiplier forces a larger NTP error and raises it above the precision
and instability of the system and reference clocks.

The test was done on a machine with a TSC clocksource (3GHz CPU with
disabled frequency scaling - normal multplier is 5592407) and tried a
multiplier reduced by 4, 16, 64 with this COARSE-fixing patch not
applied and applied. Each test ran for 1 minute and produced an
average value of skew - stability of the clock frequency as reported
by chronyd in the tracking log when synchronizing to a free-running
PTP clock at 64, 16, and 4 updates per second. It's in parts per
million (resolution in the chrony log is limited to 0.001 ppm).

Mult reduction	Updates/sec	Skew before	Skew after
1		4		0.000		0.000
1		16		0.001		0.002
1		64		0.002		0.006
4		4		0.001		0.001
4		16		0.003		0.005
4		64		0.005		0.015
16		4		0.004		0.009
16		16		0.011		0.069
16		64		0.020		0.117
64		4		0.013		0.012
64		16		0.030		0.107
64		64		0.058		0.879

-- 
Miroslav Lichvar


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v2 1/2] time/timekeeping: Fix possible inconsistencies in _COARSE clockids
  2025-03-31 14:53       ` Miroslav Lichvar
@ 2025-04-01  6:34         ` Thomas Gleixner
  2025-04-01 11:19           ` Miroslav Lichvar
  0 siblings, 1 reply; 23+ messages in thread
From: Thomas Gleixner @ 2025-04-01  6:34 UTC (permalink / raw)
  To: Miroslav Lichvar
  Cc: John Stultz, LKML, Stephen Boyd, Anna-Maria Behnsen,
	Frederic Weisbecker, Shuah Khan, linux-kselftest, kernel-team,
	Lei Chen

On Mon, Mar 31 2025 at 16:53, Miroslav Lichvar wrote:
> On Thu, Mar 27, 2025 at 04:42:49PM +0100, Miroslav Lichvar wrote:
>> Maybe I could simply patch the kernel to force a small clock
>> multiplier to increase the rate at which the error accumulates.
>
> I tried that and it indeed makes the issue clearly visible. The COARSE
> fix makes the clock less stable. It's barely visible with the normal
> multiplier, at least for the clocksource I tested, but a reduced
> multiplier forces a larger NTP error and raises it above the precision
> and instability of the system and reference clocks.
>
> The test was done on a machine with a TSC clocksource (3GHz CPU with
> disabled frequency scaling - normal multplier is 5592407) and tried a
> multiplier reduced by 4, 16, 64 with this COARSE-fixing patch not
> applied and applied. Each test ran for 1 minute and produced an
> average value of skew - stability of the clock frequency as reported
> by chronyd in the tracking log when synchronizing to a free-running
> PTP clock at 64, 16, and 4 updates per second. It's in parts per
> million (resolution in the chrony log is limited to 0.001 ppm).
>
> Mult reduction	Updates/sec	Skew before	Skew after
> 1		4		0.000		0.000
> 1		16		0.001		0.002
> 1		64		0.002		0.006
> 4		4		0.001		0.001
> 4		16		0.003		0.005
> 4		64		0.005		0.015
> 16		4		0.004		0.009
> 16		16		0.011		0.069
> 16		64		0.020		0.117
> 64		4		0.013		0.012
> 64		16		0.030		0.107
> 64		64		0.058		0.879

Hrm.

Can you try the delta patch below?

Thanks,

        tglx
---
 kernel/time/timekeeping.c |    4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

--- a/kernel/time/timekeeping.c
+++ b/kernel/time/timekeeping.c
@@ -2234,8 +2234,8 @@ static bool timekeeping_advance(enum tim
 				   tk->tkr_mono.cycle_last, tk->tkr_mono.mask,
 				   tk->tkr_mono.clock->max_raw_delta);
 
-	/* Check if there's really nothing to do */
-	if (offset < real_tk->cycle_interval && mode == TK_ADV_TICK)
+	/* Check if there's really something to do */
+	if (offset < real_tk->cycle_interval)
 		return false;
 
 	offset = timekeeping_accumulate(tk, offset, mode, &clock_set);

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v2 1/2] time/timekeeping: Fix possible inconsistencies in _COARSE clockids
  2025-04-01  6:34         ` Thomas Gleixner
@ 2025-04-01 11:19           ` Miroslav Lichvar
  2025-04-01 18:29             ` Thomas Gleixner
  0 siblings, 1 reply; 23+ messages in thread
From: Miroslav Lichvar @ 2025-04-01 11:19 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: John Stultz, LKML, Stephen Boyd, Anna-Maria Behnsen,
	Frederic Weisbecker, Shuah Khan, linux-kselftest, kernel-team,
	Lei Chen

On Tue, Apr 01, 2025 at 08:34:23AM +0200, Thomas Gleixner wrote:
> On Mon, Mar 31 2025 at 16:53, Miroslav Lichvar wrote:
> > Mult reduction	Updates/sec	Skew before	Skew after
> > 16		4		0.004		0.009
> > 16		16		0.011		0.069
> > 16		64		0.020		0.117
> > 64		4		0.013		0.012
> > 64		16		0.030		0.107
> > 64		64		0.058		0.879
> 
> Hrm.
> 
> Can you try the delta patch below?

It seems to improve the worst cases, but overall it's still
a regression.

Mult reduction	Updates/sec	Skew
16		4		0.012
16		16		0.014
16		64	        0.033
64		4		0.012
64		16		0.105
64		64		0.138

Thanks,

-- 
Miroslav Lichvar


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v2 1/2] time/timekeeping: Fix possible inconsistencies in _COARSE clockids
  2025-04-01 11:19           ` Miroslav Lichvar
@ 2025-04-01 18:29             ` Thomas Gleixner
  2025-04-03  8:32               ` Miroslav Lichvar
  0 siblings, 1 reply; 23+ messages in thread
From: Thomas Gleixner @ 2025-04-01 18:29 UTC (permalink / raw)
  To: Miroslav Lichvar
  Cc: John Stultz, LKML, Stephen Boyd, Anna-Maria Behnsen,
	Frederic Weisbecker, Shuah Khan, linux-kselftest, kernel-team,
	Lei Chen

On Tue, Apr 01 2025 at 13:19, Miroslav Lichvar wrote:
> On Tue, Apr 01, 2025 at 08:34:23AM +0200, Thomas Gleixner wrote:
>> On Mon, Mar 31 2025 at 16:53, Miroslav Lichvar wrote:
>> > Mult reduction	Updates/sec	Skew before	Skew after
>> > 16		4		0.004		0.009
>> > 16		16		0.011		0.069
>> > 16		64		0.020		0.117
>> > 64		4		0.013		0.012
>> > 64		16		0.030		0.107
>> > 64		64		0.058		0.879
>> 
>> Hrm.
>> 
>> Can you try the delta patch below?
>
> It seems to improve the worst cases, but overall it's still
> a regression.
>
> Mult reduction	Updates/sec	Skew
> 16		4		0.012
> 16		16		0.014
> 16		64	        0.033
> 64		4		0.012
> 64		16		0.105
> 64		64		0.138

That's weird as it only delays the update to the next tick. My original
approach of maintaining seperate state for the coarse time keeper is
preserving the existing NTP behaviour.

Patch applies after reverting 757b000f7b93 ("timekeeping: Fix possible
inconsistencies in _COARSE clockids").

Thanks,

        tglx
---
--- a/include/linux/timekeeper_internal.h
+++ b/include/linux/timekeeper_internal.h
@@ -51,7 +51,7 @@ struct tk_read_base {
  * @offs_real:			Offset clock monotonic -> clock realtime
  * @offs_boot:			Offset clock monotonic -> clock boottime
  * @offs_tai:			Offset clock monotonic -> clock tai
- * @tai_offset:			The current UTC to TAI offset in seconds
+ * @coarse_nsec:		The nanoseconds part for coarse time getters
  * @tkr_raw:			The readout base structure for CLOCK_MONOTONIC_RAW
  * @raw_sec:			CLOCK_MONOTONIC_RAW  time in seconds
  * @clock_was_set_seq:		The sequence number of clock was set events
@@ -76,6 +76,7 @@ struct tk_read_base {
  *				ntp shifted nano seconds.
  * @ntp_err_mult:		Multiplication factor for scaled math conversion
  * @skip_second_overflow:	Flag used to avoid updating NTP twice with same second
+ * @tai_offset:			The current UTC to TAI offset in seconds
  *
  * Note: For timespec(64) based interfaces wall_to_monotonic is what
  * we need to add to xtime (or xtime corrected for sub jiffy times)
@@ -100,7 +101,7 @@ struct tk_read_base {
  * which results in the following cacheline layout:
  *
  * 0:	seqcount, tkr_mono
- * 1:	xtime_sec ... tai_offset
+ * 1:	xtime_sec ... coarse_nsec
  * 2:	tkr_raw, raw_sec
  * 3,4: Internal variables
  *
@@ -121,7 +122,7 @@ struct timekeeper {
 	ktime_t			offs_real;
 	ktime_t			offs_boot;
 	ktime_t			offs_tai;
-	s32			tai_offset;
+	u32			coarse_nsec;
 
 	/* Cacheline 2: */
 	struct tk_read_base	tkr_raw;
@@ -144,6 +145,7 @@ struct timekeeper {
 	u32			ntp_error_shift;
 	u32			ntp_err_mult;
 	u32			skip_second_overflow;
+	s32			tai_offset;
 };
 
 #ifdef CONFIG_GENERIC_TIME_VSYSCALL
--- a/kernel/time/timekeeping.c
+++ b/kernel/time/timekeeping.c
@@ -31,6 +31,7 @@
 
 #define TK_CLEAR_NTP		(1 << 0)
 #define TK_CLOCK_WAS_SET	(1 << 1)
+#define TK_RETAIN_COARSE	(1 << 2)
 
 #define TK_UPDATE_ALL		(TK_CLEAR_NTP | TK_CLOCK_WAS_SET)
 
@@ -164,6 +165,15 @@ static inline struct timespec64 tk_xtime
 	return ts;
 }
 
+static inline struct timespec64 tk_xtime_coarse(const struct timekeeper *tk)
+{
+	struct timespec64 ts;
+
+	ts.tv_sec = tk->xtime_sec;
+	ts.tv_nsec = tk->coarse_nsec;
+	return ts;
+}
+
 static void tk_set_xtime(struct timekeeper *tk, const struct timespec64 *ts)
 {
 	tk->xtime_sec = ts->tv_sec;
@@ -636,7 +646,34 @@ static void timekeeping_restore_shadow(s
 	memcpy(&tkd->shadow_timekeeper, &tkd->timekeeper, sizeof(tkd->timekeeper));
 }
 
-static void timekeeping_update_from_shadow(struct tk_data *tkd, unsigned int action)
+/*
+ * Update the nanoseconds part for the coarse time keepers. They can't rely
+ * on xtime_nsec because xtime_nsec is adjusted when the multiplication
+ * factor of the clock is adjusted. See timekeeping_apply_adjustment().
+ *
+ * This is required because tk_read::cycle_last must be advanced by
+ * timekeeper::cycle_interval so that the accumulation happens with a
+ * periodic reference.
+ *
+ * But that adjustment of xtime_nsec can make it go backward to compensate
+ * for a larger multiplicator.
+ *
+ * @offset contains the leftover cycles which were not accumulated.
+ * Therefore the nanoseconds portion of the time when the clocksource was
+ * read in timekeeping_advance() is:
+ *
+ *	nsec = (xtime_nsec + offset * mult) >> shift;
+ *
+ * Calculate that value and store it in timekeeper::coarse_nsec, from where
+ * the coarse time getters consume it.
+ */
+static inline void tk_update_coarse_nsecs(struct timekeeper *tk, u64 offset)
+{
+	offset *= tk->tkr_mono.mult;
+	tk->coarse_nsec = (tk->tkr_mono.xtime_nsec + offset) >> tk->tkr_mono.shift;
+}
+
+static void timekeeping_update_from_shadow(struct tk_data *tkd, unsigned int action, u64 offset)
 {
 	struct timekeeper *tk = &tk_core.shadow_timekeeper;
 
@@ -659,6 +696,9 @@ static void timekeeping_update_from_shad
 	tk_update_leap_state(tk);
 	tk_update_ktime_data(tk);
 
+	if (!(action & TK_RETAIN_COARSE))
+		tk_update_coarse_nsecs(tk, offset);
+
 	update_vsyscall(tk);
 	update_pvclock_gtod(tk, action & TK_CLOCK_WAS_SET);
 
@@ -804,8 +844,8 @@ EXPORT_SYMBOL_GPL(ktime_get_with_offset)
 ktime_t ktime_get_coarse_with_offset(enum tk_offsets offs)
 {
 	struct timekeeper *tk = &tk_core.timekeeper;
-	unsigned int seq;
 	ktime_t base, *offset = offsets[offs];
+	unsigned int seq;
 	u64 nsecs;
 
 	WARN_ON(timekeeping_suspended);
@@ -813,7 +853,7 @@ ktime_t ktime_get_coarse_with_offset(enu
 	do {
 		seq = read_seqcount_begin(&tk_core.seq);
 		base = ktime_add(tk->tkr_mono.base, *offset);
-		nsecs = tk->tkr_mono.xtime_nsec >> tk->tkr_mono.shift;
+		nsecs = tk->coarse_nsec;
 
 	} while (read_seqcount_retry(&tk_core.seq, seq));
 
@@ -1374,7 +1414,7 @@ int do_settimeofday64(const struct times
 
 		tk_set_wall_to_mono(tks, timespec64_sub(tks->wall_to_monotonic, ts_delta));
 		tk_set_xtime(tks, ts);
-		timekeeping_update_from_shadow(&tk_core, TK_UPDATE_ALL);
+		timekeeping_update_from_shadow(&tk_core, TK_UPDATE_ALL, 0);
 	}
 
 	/* Signal hrtimers about time change */
@@ -1413,7 +1453,7 @@ static int timekeeping_inject_offset(con
 
 		tk_xtime_add(tks, ts);
 		tk_set_wall_to_mono(tks, timespec64_sub(tks->wall_to_monotonic, *ts));
-		timekeeping_update_from_shadow(&tk_core, TK_UPDATE_ALL);
+		timekeeping_update_from_shadow(&tk_core, TK_UPDATE_ALL, 0);
 	}
 
 	/* Signal hrtimers about time change */
@@ -1493,7 +1533,7 @@ static int change_clocksource(void *data
 		timekeeping_forward_now(tks);
 		old = tks->tkr_mono.clock;
 		tk_setup_internals(tks, new);
-		timekeeping_update_from_shadow(&tk_core, TK_UPDATE_ALL);
+		timekeeping_update_from_shadow(&tk_core, TK_UPDATE_ALL, 0);
 	}
 
 	if (old) {
@@ -1690,7 +1730,7 @@ void __init timekeeping_init(void)
 
 	tk_set_wall_to_mono(tks, wall_to_mono);
 
-	timekeeping_update_from_shadow(&tk_core, TK_CLOCK_WAS_SET);
+	timekeeping_update_from_shadow(&tk_core, TK_CLOCK_WAS_SET, 0);
 }
 
 /* time in seconds when suspend began for persistent clock */
@@ -1774,7 +1814,7 @@ void timekeeping_inject_sleeptime64(cons
 		suspend_timing_needed = false;
 		timekeeping_forward_now(tks);
 		__timekeeping_inject_sleeptime(tks, delta);
-		timekeeping_update_from_shadow(&tk_core, TK_UPDATE_ALL);
+		timekeeping_update_from_shadow(&tk_core, TK_UPDATE_ALL, 0);
 	}
 
 	/* Signal hrtimers about time change */
@@ -1834,7 +1874,7 @@ void timekeeping_resume(void)
 
 	tks->ntp_error = 0;
 	timekeeping_suspended = 0;
-	timekeeping_update_from_shadow(&tk_core, TK_CLOCK_WAS_SET);
+	timekeeping_update_from_shadow(&tk_core, TK_CLOCK_WAS_SET, 0);
 	raw_spin_unlock_irqrestore(&tk_core.lock, flags);
 
 	touch_softlockup_watchdog();
@@ -1901,7 +1941,7 @@ int timekeeping_suspend(void)
 		}
 	}
 
-	timekeeping_update_from_shadow(&tk_core, 0);
+	timekeeping_update_from_shadow(&tk_core, 0, 0);
 	halt_fast_timekeeper(tks);
 	raw_spin_unlock_irqrestore(&tk_core.lock, flags);
 
@@ -2205,7 +2245,7 @@ static bool timekeeping_advance(enum tim
 	 */
 	clock_set |= accumulate_nsecs_to_secs(tk);
 
-	timekeeping_update_from_shadow(&tk_core, clock_set);
+	timekeeping_update_from_shadow(&tk_core, clock_set, offset);
 
 	return !!clock_set;
 }
@@ -2248,7 +2288,7 @@ void ktime_get_coarse_real_ts64(struct t
 	do {
 		seq = read_seqcount_begin(&tk_core.seq);
 
-		*ts = tk_xtime(tk);
+		*ts = tk_xtime_coarse(tk);
 	} while (read_seqcount_retry(&tk_core.seq, seq));
 }
 EXPORT_SYMBOL(ktime_get_coarse_real_ts64);
@@ -2271,7 +2311,7 @@ void ktime_get_coarse_real_ts64_mg(struc
 
 	do {
 		seq = read_seqcount_begin(&tk_core.seq);
-		*ts = tk_xtime(tk);
+		*ts = tk_xtime_coarse(tk);
 		offset = tk_core.timekeeper.offs_real;
 	} while (read_seqcount_retry(&tk_core.seq, seq));
 
@@ -2350,12 +2390,12 @@ void ktime_get_coarse_ts64(struct timesp
 	do {
 		seq = read_seqcount_begin(&tk_core.seq);
 
-		now = tk_xtime(tk);
+		now = tk_xtime_coarse(tk);
 		mono = tk->wall_to_monotonic;
 	} while (read_seqcount_retry(&tk_core.seq, seq));
 
 	set_normalized_timespec64(ts, now.tv_sec + mono.tv_sec,
-				now.tv_nsec + mono.tv_nsec);
+				  now.tv_nsec + mono.tv_nsec);
 }
 EXPORT_SYMBOL(ktime_get_coarse_ts64);
 
@@ -2539,7 +2579,8 @@ int do_adjtimex(struct __kernel_timex *t
 
 		if (tai != orig_tai) {
 			__timekeeping_set_tai_offset(tks, tai);
-			timekeeping_update_from_shadow(&tk_core, TK_CLOCK_WAS_SET);
+			timekeeping_update_from_shadow(&tk_core, TK_CLOCK_WAS_SET |
+						       TK_RETAIN_COARSE, 0);
 			clock_set = true;
 		} else {
 			tk_update_leap_state_all(&tk_core);
--- a/kernel/time/vsyscall.c
+++ b/kernel/time/vsyscall.c
@@ -98,12 +98,12 @@ void update_vsyscall(struct timekeeper *
 	/* CLOCK_REALTIME_COARSE */
 	vdso_ts		= &vc[CS_HRES_COARSE].basetime[CLOCK_REALTIME_COARSE];
 	vdso_ts->sec	= tk->xtime_sec;
-	vdso_ts->nsec	= tk->tkr_mono.xtime_nsec >> tk->tkr_mono.shift;
+	vdso_ts->nsec	= tk->coarse_nsec;
 
 	/* CLOCK_MONOTONIC_COARSE */
 	vdso_ts		= &vc[CS_HRES_COARSE].basetime[CLOCK_MONOTONIC_COARSE];
 	vdso_ts->sec	= tk->xtime_sec + tk->wall_to_monotonic.tv_sec;
-	nsec		= tk->tkr_mono.xtime_nsec >> tk->tkr_mono.shift;
+	nsec		= tk->coarse_nsec;
 	nsec		= nsec + tk->wall_to_monotonic.tv_nsec;
 	vdso_ts->sec	+= __iter_div_u64_rem(nsec, NSEC_PER_SEC, &vdso_ts->nsec);
 




^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v2 1/2] time/timekeeping: Fix possible inconsistencies in _COARSE clockids
  2025-04-01 18:29             ` Thomas Gleixner
@ 2025-04-03  8:32               ` Miroslav Lichvar
  2025-04-03 11:29                 ` Thomas Gleixner
  0 siblings, 1 reply; 23+ messages in thread
From: Miroslav Lichvar @ 2025-04-03  8:32 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: John Stultz, LKML, Stephen Boyd, Anna-Maria Behnsen,
	Frederic Weisbecker, Shuah Khan, linux-kselftest, kernel-team,
	Lei Chen

On Tue, Apr 01, 2025 at 08:29:23PM +0200, Thomas Gleixner wrote:
> On Tue, Apr 01 2025 at 13:19, Miroslav Lichvar wrote:
> > It seems to improve the worst cases, but overall it's still
> > a regression.
> >
> > Mult reduction	Updates/sec	Skew
> > 16		4		0.012
> > 16		16		0.014
> > 16		64	        0.033
> > 64		4		0.012
> > 64		16		0.105
> > 64		64		0.138
> 
> That's weird as it only delays the update to the next tick.

Ok, so it's not an instability of the clock, but rather an instability
of the chronyd synchronization loop, which since kernel 4.19 expects
the frequency to be applied as soon as the adjtimex call is finished.
To confirm that, I tried a different test with chronyd configured to
only monitor the clock without making any adjustments (-x option) and
another process repeatedly setting the same frequency. The one-line
patch does fix that test.

> My original
> approach of maintaining seperate state for the coarse time keeper is
> preserving the existing NTP behaviour.
> 
> Patch applies after reverting 757b000f7b93 ("timekeeping: Fix possible
> inconsistencies in _COARSE clockids").

I ran multiple longer tests to be able to compare the small values in
the noise and yes, from the adjtimex point of view this seems to work
as well as before the first COARSE fix. I didn't run any tests of the
COARSE clock.

Thanks,

-- 
Miroslav Lichvar


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v2 1/2] time/timekeeping: Fix possible inconsistencies in _COARSE clockids
  2025-04-03  8:32               ` Miroslav Lichvar
@ 2025-04-03 11:29                 ` Thomas Gleixner
  2025-04-05 21:40                   ` [PATCH] timekeeping: Prevent coarse clocks going backwards Thomas Gleixner
  0 siblings, 1 reply; 23+ messages in thread
From: Thomas Gleixner @ 2025-04-03 11:29 UTC (permalink / raw)
  To: Miroslav Lichvar
  Cc: John Stultz, LKML, Stephen Boyd, Anna-Maria Behnsen,
	Frederic Weisbecker, Shuah Khan, linux-kselftest, kernel-team,
	Lei Chen

On Thu, Apr 03 2025 at 10:32, Miroslav Lichvar wrote:
> On Tue, Apr 01, 2025 at 08:29:23PM +0200, Thomas Gleixner wrote:
>> > 64		64		0.138
>> 
>> That's weird as it only delays the update to the next tick.
>
> Ok, so it's not an instability of the clock, but rather an instability
> of the chronyd synchronization loop, which since kernel 4.19 expects
> the frequency to be applied as soon as the adjtimex call is finished.

Interesting.

>> Patch applies after reverting 757b000f7b93 ("timekeeping: Fix possible
>> inconsistencies in _COARSE clockids").
>
> I ran multiple longer tests to be able to compare the small values in
> the noise and yes, from the adjtimex point of view this seems to work
> as well as before the first COARSE fix. I didn't run any tests of the
> COARSE clock.

I did run those, but did not run the adjtimex() muck :)

^ permalink raw reply	[flat|nested] 23+ messages in thread

* [PATCH] timekeeping: Prevent coarse clocks going backwards
  2025-04-03 11:29                 ` Thomas Gleixner
@ 2025-04-05 21:40                   ` Thomas Gleixner
  2025-04-17  5:29                     ` John Stultz
  2025-04-18  0:46                     ` John Stultz
  0 siblings, 2 replies; 23+ messages in thread
From: Thomas Gleixner @ 2025-04-05 21:40 UTC (permalink / raw)
  To: Miroslav Lichvar
  Cc: John Stultz, LKML, Stephen Boyd, Anna-Maria Behnsen,
	Frederic Weisbecker, Shuah Khan, linux-kselftest, kernel-team,
	Lei Chen

Lei Chen raised an issue with CLOCK_MONOTONIC_COARSE seeing time
inconsistencies. Lei tracked down that this was being caused by the
adjustment

    tk->tkr_mono.xtime_nsec -= offset;

which is made to compensate for the unaccumulated cycles in offset when the
multiplicator is adjusted forward, so that the non-_COARSE clockids don't
see inconsistencies.

However, the _COARSE clockid getter functions use the adjusted xtime_nsec
value directly and do not compensate the negative offset via the
clocksource delta multiplied with the new multiplicator. In that case the
caller can observe time going backwards in consecutive calls.

By design, this negative adjustment should be fine, because the logic run
from timekeeping_adjust() is done after it accumulated approximately

     multiplicator * interval_cycles

into xtime_nsec.  The accumulated value is always larger then the

     mult_adj * offset

value, which is subtracted from xtime_nsec. Both operations are done
together under the tk_core.lock, so the net change to xtime_nsec is always
always be positive.

However, do_adjtimex() calls into timekeeping_advance() as well, to
apply the NTP frequency adjustment immediately. In this case,
timekeeping_advance() does not return early when the offset is smaller
then interval_cycles. In that case there is no time accumulated into
xtime_nsec. But the subsequent call into timekeeping_adjust(), which
modifies the multiplicator, subtracts from xtime_nsec to correct for the
new multiplicator.

Here because there was no accumulation, xtime_nsec becomes smaller than
before, which opens a window up to the next accumulation, where the
_COARSE clockid getters, which don't compensate for the offset, can
observe the inconsistency.

This has been tried to be fixed by forwarding the timekeeper in the case
that adjtimex() adjusts the multiplier, which resets the offset to zero:

  757b000f7b93 ("timekeeping: Fix possible inconsistencies in _COARSE clockids")

That works correctly, but unfortunately causes a regression on the
adjtimex() side. There are two issues:
    
   1) The forwarding of the base time moves the update out of the original
      period and establishes a new one.
    
   2) The clearing of the accumulated NTP error is changing the behaviour as
      well.
    
Userspace expects that multiplier/frequency updates are in effect, when the
syscall returns, so delaying the update to the next tick is not solving the
problem either.
    
Commit 757b000f7b93 was reverted so that the established expectations of
user space implementations (ntpd, chronyd) are restored, but that obviously
brought the inconsistencies back.

One of the initial approaches to fix this was to establish a seperate
storage for the coarse time getter nanoseconds part by calculating it from
the offset. That was dropped on the floor because not having yet another
state to maintain was simpler. But given the result of the above exercise,
this solution turns out to be the right one. Bring it back in a slightly
modified form.

The coarse time keeper uses xtime_nsec for calculating the nanoseconds part
of the coarse time stamp. After timekeeping_advance() adjusted the
multiplier in timekeeping_adjust(), the current time's nanosecond part is:

  nsec = (xtime_nsec + offset * mult) >> shift;

Introduce timekeeper::coarse_nsec and store that nanoseconds part in it,
switch the time getter functions and the VDSO update to use that value.
coarse_nsec is cleared on all operations which forward or initialize the
timekeeper because those operations do not have a remaining offset.

This leaves the adjtimex() behaviour unmodified and prevents coarse time
from going backwards.

Fixes: da15cfdae033 ("time: Introduce CLOCK_REALTIME_COARSE")
Reported-by: Lei Chen <lei.chen@smartx.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Closes: https://lore.kernel.org/lkml/20250310030004.3705801-1-lei.chen@smartx.com/
---
 include/linux/timekeeper_internal.h |    8 +++-
 kernel/time/timekeeping.c           |   60 ++++++++++++++++++++++++++++++++----
 kernel/time/vsyscall.c              |    4 +-
 3 files changed, 61 insertions(+), 11 deletions(-)

--- a/include/linux/timekeeper_internal.h
+++ b/include/linux/timekeeper_internal.h
@@ -51,7 +51,7 @@ struct tk_read_base {
  * @offs_real:			Offset clock monotonic -> clock realtime
  * @offs_boot:			Offset clock monotonic -> clock boottime
  * @offs_tai:			Offset clock monotonic -> clock tai
- * @tai_offset:			The current UTC to TAI offset in seconds
+ * @coarse_nsec:		The nanoseconds part for coarse time getters
  * @tkr_raw:			The readout base structure for CLOCK_MONOTONIC_RAW
  * @raw_sec:			CLOCK_MONOTONIC_RAW  time in seconds
  * @clock_was_set_seq:		The sequence number of clock was set events
@@ -76,6 +76,7 @@ struct tk_read_base {
  *				ntp shifted nano seconds.
  * @ntp_err_mult:		Multiplication factor for scaled math conversion
  * @skip_second_overflow:	Flag used to avoid updating NTP twice with same second
+ * @tai_offset:			The current UTC to TAI offset in seconds
  *
  * Note: For timespec(64) based interfaces wall_to_monotonic is what
  * we need to add to xtime (or xtime corrected for sub jiffy times)
@@ -100,7 +101,7 @@ struct tk_read_base {
  * which results in the following cacheline layout:
  *
  * 0:	seqcount, tkr_mono
- * 1:	xtime_sec ... tai_offset
+ * 1:	xtime_sec ... coarse_nsec
  * 2:	tkr_raw, raw_sec
  * 3,4: Internal variables
  *
@@ -121,7 +122,7 @@ struct timekeeper {
 	ktime_t			offs_real;
 	ktime_t			offs_boot;
 	ktime_t			offs_tai;
-	s32			tai_offset;
+	u32			coarse_nsec;
 
 	/* Cacheline 2: */
 	struct tk_read_base	tkr_raw;
@@ -144,6 +145,7 @@ struct timekeeper {
 	u32			ntp_error_shift;
 	u32			ntp_err_mult;
 	u32			skip_second_overflow;
+	s32			tai_offset;
 };
 
 #ifdef CONFIG_GENERIC_TIME_VSYSCALL
--- a/kernel/time/timekeeping.c
+++ b/kernel/time/timekeeping.c
@@ -164,6 +164,15 @@ static inline struct timespec64 tk_xtime
 	return ts;
 }
 
+static inline struct timespec64 tk_xtime_coarse(const struct timekeeper *tk)
+{
+	struct timespec64 ts;
+
+	ts.tv_sec = tk->xtime_sec;
+	ts.tv_nsec = tk->coarse_nsec;
+	return ts;
+}
+
 static void tk_set_xtime(struct timekeeper *tk, const struct timespec64 *ts)
 {
 	tk->xtime_sec = ts->tv_sec;
@@ -252,6 +261,7 @@ static void tk_setup_internals(struct ti
 	tk->tkr_raw.clock = clock;
 	tk->tkr_raw.mask = clock->mask;
 	tk->tkr_raw.cycle_last = tk->tkr_mono.cycle_last;
+	tk->coarse_nsec = 0;
 
 	/* Do the ns -> cycle conversion first, using original mult */
 	tmp = NTP_INTERVAL_LENGTH;
@@ -708,6 +718,12 @@ static void timekeeping_forward_now(stru
 		tk_normalize_xtime(tk);
 		delta -= incr;
 	}
+
+	/*
+	 * Clear the offset for the coarse time as the above forward
+	 * brought the offset down to zero.
+	 */
+	tk->coarse_nsec = 0;
 }
 
 /**
@@ -804,8 +820,8 @@ EXPORT_SYMBOL_GPL(ktime_get_with_offset)
 ktime_t ktime_get_coarse_with_offset(enum tk_offsets offs)
 {
 	struct timekeeper *tk = &tk_core.timekeeper;
-	unsigned int seq;
 	ktime_t base, *offset = offsets[offs];
+	unsigned int seq;
 	u64 nsecs;
 
 	WARN_ON(timekeeping_suspended);
@@ -813,7 +829,7 @@ ktime_t ktime_get_coarse_with_offset(enu
 	do {
 		seq = read_seqcount_begin(&tk_core.seq);
 		base = ktime_add(tk->tkr_mono.base, *offset);
-		nsecs = tk->tkr_mono.xtime_nsec >> tk->tkr_mono.shift;
+		nsecs = tk->coarse_nsec;
 
 	} while (read_seqcount_retry(&tk_core.seq, seq));
 
@@ -1831,6 +1847,8 @@ void timekeeping_resume(void)
 	/* Re-base the last cycle value */
 	tks->tkr_mono.cycle_last = cycle_now;
 	tks->tkr_raw.cycle_last  = cycle_now;
+	/* Reset the offset for the coarse time getters */
+	tks->coarse_nsec = 0;
 
 	tks->ntp_error = 0;
 	timekeeping_suspended = 0;
@@ -2152,6 +2170,33 @@ static u64 logarithmic_accumulation(stru
 }
 
 /*
+ * Update the nanoseconds part for the coarse time keepers. They can't rely
+ * on xtime_nsec because xtime_nsec is adjusted when the multiplication
+ * factor of the clock is adjusted. See timekeeping_apply_adjustment().
+ *
+ * This is required because tk_read::cycle_last must be advanced by
+ * timekeeper::cycle_interval so that the accumulation happens with a
+ * periodic reference.
+ *
+ * But that adjustment of xtime_nsec can make it go backward to compensate
+ * for a larger multiplicator.
+ *
+ * timekeeper::offset contains the leftover cycles which were not accumulated.
+ * Therefore the nanoseconds portion of the time when the clocksource was
+ * read in timekeeping_advance() is:
+ *
+ *	nsec = (xtime_nsec + offset * mult) >> shift;
+ *
+ * Calculate that value and store it in timekeeper::coarse_nsec, from where
+ * the coarse time getters consume it.
+ */
+static inline void tk_update_coarse_nsecs(struct timekeeper *tk, u64 offset)
+{
+	offset *= tk->tkr_mono.mult;
+	tk->coarse_nsec = (tk->tkr_mono.xtime_nsec + offset) >> tk->tkr_mono.shift;
+}
+
+/*
  * timekeeping_advance - Updates the timekeeper to the current time and
  * current NTP tick length
  */
@@ -2205,6 +2250,9 @@ static bool timekeeping_advance(enum tim
 	 */
 	clock_set |= accumulate_nsecs_to_secs(tk);
 
+	/* Compensate the coarse time getters xtime_nsec offset */
+	tk_update_coarse_nsecs(tk, offset);
+
 	timekeeping_update_from_shadow(&tk_core, clock_set);
 
 	return !!clock_set;
@@ -2248,7 +2296,7 @@ void ktime_get_coarse_real_ts64(struct t
 	do {
 		seq = read_seqcount_begin(&tk_core.seq);
 
-		*ts = tk_xtime(tk);
+		*ts = tk_xtime_coarse(tk);
 	} while (read_seqcount_retry(&tk_core.seq, seq));
 }
 EXPORT_SYMBOL(ktime_get_coarse_real_ts64);
@@ -2271,7 +2319,7 @@ void ktime_get_coarse_real_ts64_mg(struc
 
 	do {
 		seq = read_seqcount_begin(&tk_core.seq);
-		*ts = tk_xtime(tk);
+		*ts = tk_xtime_coarse(tk);
 		offset = tk_core.timekeeper.offs_real;
 	} while (read_seqcount_retry(&tk_core.seq, seq));
 
@@ -2350,12 +2398,12 @@ void ktime_get_coarse_ts64(struct timesp
 	do {
 		seq = read_seqcount_begin(&tk_core.seq);
 
-		now = tk_xtime(tk);
+		now = tk_xtime_coarse(tk);
 		mono = tk->wall_to_monotonic;
 	} while (read_seqcount_retry(&tk_core.seq, seq));
 
 	set_normalized_timespec64(ts, now.tv_sec + mono.tv_sec,
-				now.tv_nsec + mono.tv_nsec);
+				  now.tv_nsec + mono.tv_nsec);
 }
 EXPORT_SYMBOL(ktime_get_coarse_ts64);
 
--- a/kernel/time/vsyscall.c
+++ b/kernel/time/vsyscall.c
@@ -98,12 +98,12 @@ void update_vsyscall(struct timekeeper *
 	/* CLOCK_REALTIME_COARSE */
 	vdso_ts		= &vc[CS_HRES_COARSE].basetime[CLOCK_REALTIME_COARSE];
 	vdso_ts->sec	= tk->xtime_sec;
-	vdso_ts->nsec	= tk->tkr_mono.xtime_nsec >> tk->tkr_mono.shift;
+	vdso_ts->nsec	= tk->coarse_nsec;
 
 	/* CLOCK_MONOTONIC_COARSE */
 	vdso_ts		= &vc[CS_HRES_COARSE].basetime[CLOCK_MONOTONIC_COARSE];
 	vdso_ts->sec	= tk->xtime_sec + tk->wall_to_monotonic.tv_sec;
-	nsec		= tk->tkr_mono.xtime_nsec >> tk->tkr_mono.shift;
+	nsec		= tk->coarse_nsec;
 	nsec		= nsec + tk->wall_to_monotonic.tv_nsec;
 	vdso_ts->sec	+= __iter_div_u64_rem(nsec, NSEC_PER_SEC, &vdso_ts->nsec);
 

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v2 1/2] time/timekeeping: Fix possible inconsistencies in _COARSE clockids
  2025-03-31  7:53         ` Miroslav Lichvar
@ 2025-04-17  2:55           ` John Stultz
  0 siblings, 0 replies; 23+ messages in thread
From: John Stultz @ 2025-04-17  2:55 UTC (permalink / raw)
  To: Miroslav Lichvar
  Cc: Thomas Gleixner, LKML, Stephen Boyd, Anna-Maria Behnsen,
	Frederic Weisbecker, Shuah Khan, linux-kselftest, kernel-team,
	Lei Chen

On Mon, Mar 31, 2025 at 12:54 AM Miroslav Lichvar <mlichvar@redhat.com> wrote:
> On Thu, Mar 27, 2025 at 06:32:27PM +0100, Thomas Gleixner wrote:
> > On Thu, Mar 27 2025 at 16:42, Miroslav Lichvar wrote:
> > > On Thu, Mar 27, 2025 at 10:22:31AM +0100, Thomas Gleixner wrote:
> > > To clearly see the difference with the new code, I made an attempt
> > > to update the old linux-tktest simulation that was used back when the
> > > multiplier adjustment was reworked, but there are too many missing
> > > things now and I gave up.
> >
> > Can you point me to that code?
>
> It's this thing: https://github.com/mlichvar/linux-tktest
>
> > It would be probably useful to create a test mechanism which allows to
> > exercise all of this in a simulated way so we actually don't have to
> > wonder every time we change a bit what the consequences are.
>
> Yes, that would be very nice if we could run the timekeeping code in a
> deterministic simulated environment with a configurable clocksource,
> timing of kernel updates, timing and values of injected adjtimex()
> calls, etc. The question is how to isolate it.

Miroslav, Have you looked at KUNIT?
https://docs.kernel.org/dev-tools/kunit/index.html

I've not yet done much with it, but it seems like it might be a good
match for moving some of this simulation logic (which has always
impressed me) into the kernel tree.

thanks
-john

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH] timekeeping: Prevent coarse clocks going backwards
  2025-04-05 21:40                   ` [PATCH] timekeeping: Prevent coarse clocks going backwards Thomas Gleixner
@ 2025-04-17  5:29                     ` John Stultz
  2025-04-17 12:36                       ` Thomas Gleixner
  2025-04-18  0:46                     ` John Stultz
  1 sibling, 1 reply; 23+ messages in thread
From: John Stultz @ 2025-04-17  5:29 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Miroslav Lichvar, LKML, Stephen Boyd, Anna-Maria Behnsen,
	Frederic Weisbecker, Shuah Khan, linux-kselftest, kernel-team,
	Lei Chen

On Sat, Apr 5, 2025 at 2:40 PM Thomas Gleixner <tglx@linutronix.de> wrote:
>
> Lei Chen raised an issue with CLOCK_MONOTONIC_COARSE seeing time
> inconsistencies. Lei tracked down that this was being caused by the
> adjustment
>
>     tk->tkr_mono.xtime_nsec -= offset;
>
> which is made to compensate for the unaccumulated cycles in offset when the
> multiplicator is adjusted forward, so that the non-_COARSE clockids don't
> see inconsistencies.
>
> However, the _COARSE clockid getter functions use the adjusted xtime_nsec
> value directly and do not compensate the negative offset via the
> clocksource delta multiplied with the new multiplicator. In that case the
> caller can observe time going backwards in consecutive calls.
>

Hey Thomas!
  Thanks so much for reviving this version of the fix and my apologies
for apparently taking us down the wrong path with the earlier
solution.

Looking over the patch, it seems ok to me, but in a test run with it,
I've seen an error with CLOCK_REALTIME_COARSE during the
clocksource-switch test  (as well as some seemingly unrelated test
errors, which I need to investigate) so I'm looking at the zeroing
done in timekeeping_forward_now and will try to look more into this
tomorrow.

thanks
-john

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH] timekeeping: Prevent coarse clocks going backwards
  2025-04-17  5:29                     ` John Stultz
@ 2025-04-17 12:36                       ` Thomas Gleixner
  0 siblings, 0 replies; 23+ messages in thread
From: Thomas Gleixner @ 2025-04-17 12:36 UTC (permalink / raw)
  To: John Stultz
  Cc: Miroslav Lichvar, LKML, Stephen Boyd, Anna-Maria Behnsen,
	Frederic Weisbecker, Shuah Khan, linux-kselftest, kernel-team,
	Lei Chen

On Wed, Apr 16 2025 at 22:29, John Stultz wrote:
> Looking over the patch, it seems ok to me, but in a test run with it,
> I've seen an error with CLOCK_REALTIME_COARSE during the
> clocksource-switch test  (as well as some seemingly unrelated test
> errors, which I need to investigate) so I'm looking at the zeroing
> done in timekeeping_forward_now and will try to look more into this
> tomorrow.

Odd. I ran all the test nonsense here and did not run into any issues.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH] timekeeping: Prevent coarse clocks going backwards
  2025-04-05 21:40                   ` [PATCH] timekeeping: Prevent coarse clocks going backwards Thomas Gleixner
  2025-04-17  5:29                     ` John Stultz
@ 2025-04-18  0:46                     ` John Stultz
  2025-04-18  6:37                       ` Thomas Gleixner
  1 sibling, 1 reply; 23+ messages in thread
From: John Stultz @ 2025-04-18  0:46 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Miroslav Lichvar, LKML, Stephen Boyd, Anna-Maria Behnsen,
	Frederic Weisbecker, Shuah Khan, linux-kselftest, kernel-team,
	Lei Chen

On Sat, Apr 5, 2025 at 2:40 PM Thomas Gleixner <tglx@linutronix.de> wrote:
>
> Lei Chen raised an issue with CLOCK_MONOTONIC_COARSE seeing time
> inconsistencies. Lei tracked down that this was being caused by the
> adjustment
>
>     tk->tkr_mono.xtime_nsec -= offset;
>
> which is made to compensate for the unaccumulated cycles in offset when the
> multiplicator is adjusted forward, so that the non-_COARSE clockids don't
> see inconsistencies.
>
> However, the _COARSE clockid getter functions use the adjusted xtime_nsec
> value directly and do not compensate the negative offset via the
> clocksource delta multiplied with the new multiplicator. In that case the
> caller can observe time going backwards in consecutive calls.
>
> By design, this negative adjustment should be fine, because the logic run
> from timekeeping_adjust() is done after it accumulated approximately
>
>      multiplicator * interval_cycles
>
> into xtime_nsec.  The accumulated value is always larger then the
>
>      mult_adj * offset
>
> value, which is subtracted from xtime_nsec. Both operations are done
> together under the tk_core.lock, so the net change to xtime_nsec is always
> always be positive.
>
> However, do_adjtimex() calls into timekeeping_advance() as well, to
> apply the NTP frequency adjustment immediately. In this case,
> timekeeping_advance() does not return early when the offset is smaller
> then interval_cycles. In that case there is no time accumulated into
> xtime_nsec. But the subsequent call into timekeeping_adjust(), which
> modifies the multiplicator, subtracts from xtime_nsec to correct for the
> new multiplicator.
>
> Here because there was no accumulation, xtime_nsec becomes smaller than
> before, which opens a window up to the next accumulation, where the
> _COARSE clockid getters, which don't compensate for the offset, can
> observe the inconsistency.
>
> This has been tried to be fixed by forwarding the timekeeper in the case
> that adjtimex() adjusts the multiplier, which resets the offset to zero:
>
>   757b000f7b93 ("timekeeping: Fix possible inconsistencies in _COARSE clockids")
>
> That works correctly, but unfortunately causes a regression on the
> adjtimex() side. There are two issues:
>
>    1) The forwarding of the base time moves the update out of the original
>       period and establishes a new one.
>
>    2) The clearing of the accumulated NTP error is changing the behaviour as
>       well.
>
> Userspace expects that multiplier/frequency updates are in effect, when the
> syscall returns, so delaying the update to the next tick is not solving the
> problem either.
>
> Commit 757b000f7b93 was reverted so that the established expectations of
> user space implementations (ntpd, chronyd) are restored, but that obviously
> brought the inconsistencies back.
>
> One of the initial approaches to fix this was to establish a seperate
> storage for the coarse time getter nanoseconds part by calculating it from
> the offset. That was dropped on the floor because not having yet another
> state to maintain was simpler. But given the result of the above exercise,
> this solution turns out to be the right one. Bring it back in a slightly
> modified form.
>
> The coarse time keeper uses xtime_nsec for calculating the nanoseconds part
> of the coarse time stamp. After timekeeping_advance() adjusted the
> multiplier in timekeeping_adjust(), the current time's nanosecond part is:
>
>   nsec = (xtime_nsec + offset * mult) >> shift;
>
> Introduce timekeeper::coarse_nsec and store that nanoseconds part in it,
> switch the time getter functions and the VDSO update to use that value.
> coarse_nsec is cleared on all operations which forward or initialize the
> timekeeper because those operations do not have a remaining offset.
>
> This leaves the adjtimex() behaviour unmodified and prevents coarse time
> from going backwards.
>
> Fixes: da15cfdae033 ("time: Introduce CLOCK_REALTIME_COARSE")
> Reported-by: Lei Chen <lei.chen@smartx.com>
> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
> Closes: https://lore.kernel.org/lkml/20250310030004.3705801-1-lei.chen@smartx.com/


> @@ -252,6 +261,7 @@ static void tk_setup_internals(struct ti
>         tk->tkr_raw.clock = clock;
>         tk->tkr_raw.mask = clock->mask;
>         tk->tkr_raw.cycle_last = tk->tkr_mono.cycle_last;
> +       tk->coarse_nsec = 0;
>
>         /* Do the ns -> cycle conversion first, using original mult */
>         tmp = NTP_INTERVAL_LENGTH;
> @@ -708,6 +718,12 @@ static void timekeeping_forward_now(stru
>                 tk_normalize_xtime(tk);
>                 delta -= incr;
>         }
> +
> +       /*
> +        * Clear the offset for the coarse time as the above forward
> +        * brought the offset down to zero.
> +        */
> +       tk->coarse_nsec = 0;
>  }
...
> @@ -1831,6 +1847,8 @@ void timekeeping_resume(void)
>         /* Re-base the last cycle value */
>         tks->tkr_mono.cycle_last = cycle_now;
>         tks->tkr_raw.cycle_last  = cycle_now;
> +       /* Reset the offset for the coarse time getters */
> +       tks->coarse_nsec = 0;
>
>         tks->ntp_error = 0;
>         timekeeping_suspended = 0;


So using the clocksource-switch test in kselftest, I can pretty easily
hit inconsistencies with this.

The reason is since we use the coarse_nsec as the nanosecond portion
of the coarse clockids, I don't think we ever want to set it to zero,
as whenever we do so, we lose the previous contents and cause the
coarse time to jump back.

It seems more likely that we'd want to do something similar to
tk_update_coarse_nsecs() filling it in with the shifted down
tk->tkr_mono.xtime_nsec.


> @@ -2152,6 +2170,33 @@ static u64 logarithmic_accumulation(stru
>  }
>
>  /*
> + * Update the nanoseconds part for the coarse time keepers. They can't rely
> + * on xtime_nsec because xtime_nsec is adjusted when the multiplication
> + * factor of the clock is adjusted. See timekeeping_apply_adjustment().
> + *
> + * This is required because tk_read::cycle_last must be advanced by
> + * timekeeper::cycle_interval so that the accumulation happens with a
> + * periodic reference.
> + *
> + * But that adjustment of xtime_nsec can make it go backward to compensate
> + * for a larger multiplicator.
> + *
> + * timekeeper::offset contains the leftover cycles which were not accumulated.
> + * Therefore the nanoseconds portion of the time when the clocksource was
> + * read in timekeeping_advance() is:
> + *
> + *     nsec = (xtime_nsec + offset * mult) >> shift;
> + *
> + * Calculate that value and store it in timekeeper::coarse_nsec, from where
> + * the coarse time getters consume it.
> + */
> +static inline void tk_update_coarse_nsecs(struct timekeeper *tk, u64 offset)
> +{
> +       offset *= tk->tkr_mono.mult;
> +       tk->coarse_nsec = (tk->tkr_mono.xtime_nsec + offset) >> tk->tkr_mono.shift;
> +}

Thinking more on this, I get that you're providing the offset to save
the "at the point" time into the coarse value, but I think this ends
up complicating things.

Instead it seems like we should just do:
  tk->coarse_nsec = tk->tkr_mono.xtime_nsec >> tk->tkr_mono.shift;

However, we would need to skip doing the update if we didn't
accumulate anything. This would perserve the coarse clockid only
updating on tick interval boundaries and avoid the potential negative
correction to the xtime_nsec when we do the frequency adjustment.

I've got a patch in testing that is avoiding the inconsistency so far.
I'll try to send it out tomorrow.

thanks
-john

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH] timekeeping: Prevent coarse clocks going backwards
  2025-04-18  0:46                     ` John Stultz
@ 2025-04-18  6:37                       ` Thomas Gleixner
  2025-04-18  7:00                         ` Thomas Gleixner
  2025-04-18 18:40                         ` John Stultz
  0 siblings, 2 replies; 23+ messages in thread
From: Thomas Gleixner @ 2025-04-18  6:37 UTC (permalink / raw)
  To: John Stultz
  Cc: Miroslav Lichvar, LKML, Stephen Boyd, Anna-Maria Behnsen,
	Frederic Weisbecker, Shuah Khan, linux-kselftest, kernel-team,
	Lei Chen

On Thu, Apr 17 2025 at 17:46, John Stultz wrote:
> On Sat, Apr 5, 2025 at 2:40 PM Thomas Gleixner <tglx@linutronix.de> wrote:
>> @@ -1831,6 +1847,8 @@ void timekeeping_resume(void)
>>         /* Re-base the last cycle value */
>>         tks->tkr_mono.cycle_last = cycle_now;
>>         tks->tkr_raw.cycle_last  = cycle_now;
>> +       /* Reset the offset for the coarse time getters */
>> +       tks->coarse_nsec = 0;
>>
>>         tks->ntp_error = 0;
>>         timekeeping_suspended = 0;
>
>
> So using the clocksource-switch test in kselftest, I can pretty easily
> hit inconsistencies with this.
>
> The reason is since we use the coarse_nsec as the nanosecond portion
> of the coarse clockids, I don't think we ever want to set it to zero,
> as whenever we do so, we lose the previous contents and cause the
> coarse time to jump back.

Bah. Obviously. What was I thinking?

> It seems more likely that we'd want to do something similar to
> tk_update_coarse_nsecs() filling it in with the shifted down
> tk->tkr_mono.xtime_nsec.

Indeed. The earlier approach of handing the offset to
timekeeping_update_from_shadow() was exactly doing that. I dropped that
because of the uglyness vs. the TAI update case in adjtimex().

>> +static inline void tk_update_coarse_nsecs(struct timekeeper *tk, u64 offset)
>> +{
>> +       offset *= tk->tkr_mono.mult;
>> +       tk->coarse_nsec = (tk->tkr_mono.xtime_nsec + offset) >> tk->tkr_mono.shift;
>> +}
>
> Thinking more on this, I get that you're providing the offset to save
> the "at the point" time into the coarse value, but I think this ends
> up complicating things.
>
> Instead it seems like we should just do:
>   tk->coarse_nsec = tk->tkr_mono.xtime_nsec >> tk->tkr_mono.shift;

You end up with the same problem again because xtime_nsec can move
backwards when the multiplier is updated, no?

Thanks,

        tglx

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH] timekeeping: Prevent coarse clocks going backwards
  2025-04-18  6:37                       ` Thomas Gleixner
@ 2025-04-18  7:00                         ` Thomas Gleixner
  2025-04-19  5:55                           ` John Stultz
  2025-04-18 18:40                         ` John Stultz
  1 sibling, 1 reply; 23+ messages in thread
From: Thomas Gleixner @ 2025-04-18  7:00 UTC (permalink / raw)
  To: John Stultz
  Cc: Miroslav Lichvar, LKML, Stephen Boyd, Anna-Maria Behnsen,
	Frederic Weisbecker, Shuah Khan, linux-kselftest, kernel-team,
	Lei Chen

On Fri, Apr 18 2025 at 08:37, Thomas Gleixner wrote:
> On Thu, Apr 17 2025 at 17:46, John Stultz wrote:
>> Instead it seems like we should just do:
>>   tk->coarse_nsec = tk->tkr_mono.xtime_nsec >> tk->tkr_mono.shift;
>
> You end up with the same problem again because xtime_nsec can move
> backwards when the multiplier is updated, no?

Something like the below should work.

Thanks,

        tglx
---
 include/linux/timekeeper_internal.h |   10 ++++-
 kernel/time/timekeeping.c           |   62 ++++++++++++++++++++++++++++++++----
 kernel/time/vsyscall.c              |    4 +-
 3 files changed, 65 insertions(+), 11 deletions(-)

--- a/include/linux/timekeeper_internal.h
+++ b/include/linux/timekeeper_internal.h
@@ -51,7 +51,7 @@ struct tk_read_base {
  * @offs_real:			Offset clock monotonic -> clock realtime
  * @offs_boot:			Offset clock monotonic -> clock boottime
  * @offs_tai:			Offset clock monotonic -> clock tai
- * @tai_offset:			The current UTC to TAI offset in seconds
+ * @coarse_nsec:		The nanoseconds part for coarse time getters
  * @tkr_raw:			The readout base structure for CLOCK_MONOTONIC_RAW
  * @raw_sec:			CLOCK_MONOTONIC_RAW  time in seconds
  * @clock_was_set_seq:		The sequence number of clock was set events
@@ -76,6 +76,8 @@ struct tk_read_base {
  *				ntp shifted nano seconds.
  * @ntp_err_mult:		Multiplication factor for scaled math conversion
  * @skip_second_overflow:	Flag used to avoid updating NTP twice with same second
+ * @tai_offset:			The current UTC to TAI offset in seconds
+ * @coarse_offset:		The offset of the coarse timekeeper in clock cycles
  *
  * Note: For timespec(64) based interfaces wall_to_monotonic is what
  * we need to add to xtime (or xtime corrected for sub jiffy times)
@@ -100,7 +102,7 @@ struct tk_read_base {
  * which results in the following cacheline layout:
  *
  * 0:	seqcount, tkr_mono
- * 1:	xtime_sec ... tai_offset
+ * 1:	xtime_sec ... coarse_nsec
  * 2:	tkr_raw, raw_sec
  * 3,4: Internal variables
  *
@@ -121,7 +123,7 @@ struct timekeeper {
 	ktime_t			offs_real;
 	ktime_t			offs_boot;
 	ktime_t			offs_tai;
-	s32			tai_offset;
+	u32			coarse_nsec;
 
 	/* Cacheline 2: */
 	struct tk_read_base	tkr_raw;
@@ -144,6 +146,8 @@ struct timekeeper {
 	u32			ntp_error_shift;
 	u32			ntp_err_mult;
 	u32			skip_second_overflow;
+	s32			tai_offset;
+	u32			coarse_offset;
 };
 
 #ifdef CONFIG_GENERIC_TIME_VSYSCALL
--- a/kernel/time/timekeeping.c
+++ b/kernel/time/timekeeping.c
@@ -164,6 +164,15 @@ static inline struct timespec64 tk_xtime
 	return ts;
 }
 
+static inline struct timespec64 tk_xtime_coarse(const struct timekeeper *tk)
+{
+	struct timespec64 ts;
+
+	ts.tv_sec = tk->xtime_sec;
+	ts.tv_nsec = tk->coarse_nsec;
+	return ts;
+}
+
 static void tk_set_xtime(struct timekeeper *tk, const struct timespec64 *ts)
 {
 	tk->xtime_sec = ts->tv_sec;
@@ -252,6 +261,7 @@ static void tk_setup_internals(struct ti
 	tk->tkr_raw.clock = clock;
 	tk->tkr_raw.mask = clock->mask;
 	tk->tkr_raw.cycle_last = tk->tkr_mono.cycle_last;
+	tk->coarse_offset = 0;
 
 	/* Do the ns -> cycle conversion first, using original mult */
 	tmp = NTP_INTERVAL_LENGTH;
@@ -636,6 +646,34 @@ static void timekeeping_restore_shadow(s
 	memcpy(&tkd->shadow_timekeeper, &tkd->timekeeper, sizeof(tkd->timekeeper));
 }
 
+/*
+ * Update the nanoseconds part for the coarse time keepers. They can't rely
+ * on xtime_nsec because xtime_nsec is adjusted when the multiplication
+ * factor of the clock is adjusted. See timekeeping_apply_adjustment().
+ *
+ * This is required because tk_read::cycle_last must be advanced by
+ * timekeeper::cycle_interval so that the accumulation happens with a
+ * periodic reference.
+ *
+ * But that adjustment of xtime_nsec can make it go backward to compensate
+ * for a larger multiplicator.
+ *
+ * timekeeper::offset contains the leftover cycles which were not accumulated.
+ * Therefore the nanoseconds portion of the time when the clocksource was
+ * read in timekeeping_advance() is:
+ *
+ *	nsec = (xtime_nsec + offset * mult) >> shift;
+ *
+ * Calculate that value and store it in timekeeper::coarse_nsec, from where
+ * the coarse time getters consume it.
+ */
+static inline void tk_update_coarse_nsecs(struct timekeeper *tk)
+{
+	u64 offset = (u64)tk->coarse_offset * tk->tkr_mono.mult;
+
+	tk->coarse_nsec = (tk->tkr_mono.xtime_nsec + offset) >> tk->tkr_mono.shift;
+}
+
 static void timekeeping_update_from_shadow(struct tk_data *tkd, unsigned int action)
 {
 	struct timekeeper *tk = &tk_core.shadow_timekeeper;
@@ -658,6 +696,7 @@ static void timekeeping_update_from_shad
 
 	tk_update_leap_state(tk);
 	tk_update_ktime_data(tk);
+	tk_update_coarse_nsecs(tk);
 
 	update_vsyscall(tk);
 	update_pvclock_gtod(tk, action & TK_CLOCK_WAS_SET);
@@ -708,6 +747,12 @@ static void timekeeping_forward_now(stru
 		tk_normalize_xtime(tk);
 		delta -= incr;
 	}
+
+	/*
+	 * Clear the offset for the coarse time as the above forward
+	 * brought the offset down to zero.
+	 */
+	tk->coarse_offset = 0;
 }
 
 /**
@@ -804,8 +849,8 @@ EXPORT_SYMBOL_GPL(ktime_get_with_offset)
 ktime_t ktime_get_coarse_with_offset(enum tk_offsets offs)
 {
 	struct timekeeper *tk = &tk_core.timekeeper;
-	unsigned int seq;
 	ktime_t base, *offset = offsets[offs];
+	unsigned int seq;
 	u64 nsecs;
 
 	WARN_ON(timekeeping_suspended);
@@ -813,7 +858,7 @@ ktime_t ktime_get_coarse_with_offset(enu
 	do {
 		seq = read_seqcount_begin(&tk_core.seq);
 		base = ktime_add(tk->tkr_mono.base, *offset);
-		nsecs = tk->tkr_mono.xtime_nsec >> tk->tkr_mono.shift;
+		nsecs = tk->coarse_nsec;
 
 	} while (read_seqcount_retry(&tk_core.seq, seq));
 
@@ -1831,6 +1876,8 @@ void timekeeping_resume(void)
 	/* Re-base the last cycle value */
 	tks->tkr_mono.cycle_last = cycle_now;
 	tks->tkr_raw.cycle_last  = cycle_now;
+	/* Reset the offset for the coarse time getters */
+	tks->coarse_offset = 0;
 
 	tks->ntp_error = 0;
 	timekeeping_suspended = 0;
@@ -2205,6 +2252,9 @@ static bool timekeeping_advance(enum tim
 	 */
 	clock_set |= accumulate_nsecs_to_secs(tk);
 
+	/* Compensate the coarse time getters xtime_nsec offset */
+	tk->coarse_offset = (u32)offset;
+
 	timekeeping_update_from_shadow(&tk_core, clock_set);
 
 	return !!clock_set;
@@ -2248,7 +2298,7 @@ void ktime_get_coarse_real_ts64(struct t
 	do {
 		seq = read_seqcount_begin(&tk_core.seq);
 
-		*ts = tk_xtime(tk);
+		*ts = tk_xtime_coarse(tk);
 	} while (read_seqcount_retry(&tk_core.seq, seq));
 }
 EXPORT_SYMBOL(ktime_get_coarse_real_ts64);
@@ -2271,7 +2321,7 @@ void ktime_get_coarse_real_ts64_mg(struc
 
 	do {
 		seq = read_seqcount_begin(&tk_core.seq);
-		*ts = tk_xtime(tk);
+		*ts = tk_xtime_coarse(tk);
 		offset = tk_core.timekeeper.offs_real;
 	} while (read_seqcount_retry(&tk_core.seq, seq));
 
@@ -2350,12 +2400,12 @@ void ktime_get_coarse_ts64(struct timesp
 	do {
 		seq = read_seqcount_begin(&tk_core.seq);
 
-		now = tk_xtime(tk);
+		now = tk_xtime_coarse(tk);
 		mono = tk->wall_to_monotonic;
 	} while (read_seqcount_retry(&tk_core.seq, seq));
 
 	set_normalized_timespec64(ts, now.tv_sec + mono.tv_sec,
-				now.tv_nsec + mono.tv_nsec);
+				  now.tv_nsec + mono.tv_nsec);
 }
 EXPORT_SYMBOL(ktime_get_coarse_ts64);
 
--- a/kernel/time/vsyscall.c
+++ b/kernel/time/vsyscall.c
@@ -98,12 +98,12 @@ void update_vsyscall(struct timekeeper *
 	/* CLOCK_REALTIME_COARSE */
 	vdso_ts		= &vc[CS_HRES_COARSE].basetime[CLOCK_REALTIME_COARSE];
 	vdso_ts->sec	= tk->xtime_sec;
-	vdso_ts->nsec	= tk->tkr_mono.xtime_nsec >> tk->tkr_mono.shift;
+	vdso_ts->nsec	= tk->coarse_nsec;
 
 	/* CLOCK_MONOTONIC_COARSE */
 	vdso_ts		= &vc[CS_HRES_COARSE].basetime[CLOCK_MONOTONIC_COARSE];
 	vdso_ts->sec	= tk->xtime_sec + tk->wall_to_monotonic.tv_sec;
-	nsec		= tk->tkr_mono.xtime_nsec >> tk->tkr_mono.shift;
+	nsec		= tk->coarse_nsec;
 	nsec		= nsec + tk->wall_to_monotonic.tv_nsec;
 	vdso_ts->sec	+= __iter_div_u64_rem(nsec, NSEC_PER_SEC, &vdso_ts->nsec);
 

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH] timekeeping: Prevent coarse clocks going backwards
  2025-04-18  6:37                       ` Thomas Gleixner
  2025-04-18  7:00                         ` Thomas Gleixner
@ 2025-04-18 18:40                         ` John Stultz
  2025-04-19  5:46                           ` [PATCH v3] " John Stultz
  1 sibling, 1 reply; 23+ messages in thread
From: John Stultz @ 2025-04-18 18:40 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Miroslav Lichvar, LKML, Stephen Boyd, Anna-Maria Behnsen,
	Frederic Weisbecker, Shuah Khan, linux-kselftest, kernel-team,
	Lei Chen

On Thu, Apr 17, 2025 at 11:37 PM Thomas Gleixner <tglx@linutronix.de> wrote:
> On Thu, Apr 17 2025 at 17:46, John Stultz wrote:
> > On Sat, Apr 5, 2025 at 2:40 PM Thomas Gleixner <tglx@linutronix.de> wrote:
> >> +static inline void tk_update_coarse_nsecs(struct timekeeper *tk, u64 offset)
> >> +{
> >> +       offset *= tk->tkr_mono.mult;
> >> +       tk->coarse_nsec = (tk->tkr_mono.xtime_nsec + offset) >> tk->tkr_mono.shift;
> >> +}
> >
> > Thinking more on this, I get that you're providing the offset to save
> > the "at the point" time into the coarse value, but I think this ends
> > up complicating things.
> >
> > Instead it seems like we should just do:
> >   tk->coarse_nsec = tk->tkr_mono.xtime_nsec >> tk->tkr_mono.shift;
>
> You end up with the same problem again because xtime_nsec can move
> backwards when the multiplier is updated, no?

That's assuming you update coarse_nsec on every call to do_adjtimex,
which I don't think is necessary (or wanted - as it would sort of be a
behavior change to the COARSE clockids).

The root issue here has been that there was a mistaken assumption that
the small negative adjustment done to the xtime_nsec base to match the
mult adjustment would only happen after a larger accumulation, but the
timekeeping_advance() call from do_adjtimex() doesn't actually intend
to advance the clock (just change the frequency), so those small
negative adjustments made via do_adjtimex() between accumulations
become visible to the coarse clockids.

Since the coarse clockids are expected to provide ~tick-granular time,
if we are keeping separate state, we should only be
incrementing/setting that state when we accumulate (with each
cycle_interval). We don't need to be making updates to the coarse
clock between ticks on every do_adjtime call. So saving off the
tk->tkr_mono.xtime_nsec >> tk->tkr_mono.shift value after we actually
accumulated something should be fine. Any inter-tick frequency
adjustments to xtime_nsec can be ignored by the coarse clockid state.

I'll test with your updated patch here, as I suspect it will avoid the
problem I'm seeing, but I do think things can be further simplified.

thanks
-john

^ permalink raw reply	[flat|nested] 23+ messages in thread

* [PATCH v3] timekeeping: Prevent coarse clocks going backwards
  2025-04-18 18:40                         ` John Stultz
@ 2025-04-19  5:46                           ` John Stultz
  0 siblings, 0 replies; 23+ messages in thread
From: John Stultz @ 2025-04-19  5:46 UTC (permalink / raw)
  To: LKML
  Cc: Thomas Gleixner, Miroslav Lichvar, Lei Chen, Stephen Boyd,
	Anna-Maria Behnsen, Frederic Weisbecker, Shuah Khan,
	linux-kselftest, kernel-team, John Stultz

From: Thomas Gleixner <tglx@linutronix.de>

Lei Chen raised an issue with CLOCK_MONOTONIC_COARSE seeing time
inconsistencies. Lei tracked down that this was being caused by the
adjustment

    tk->tkr_mono.xtime_nsec -= offset;

which is made to compensate for the unaccumulated cycles in offset when the
multiplicator is adjusted forward, so that the non-_COARSE clockids don't
see inconsistencies.

However, the _COARSE clockid getter functions use the adjusted xtime_nsec
value directly and do not compensate the negative offset via the
clocksource delta multiplied with the new multiplicator. In that case the
caller can observe time going backwards in consecutive calls.

By design, this negative adjustment should be fine, because the logic run
from timekeeping_adjust() is done after it accumulated approximately

     multiplicator * interval_cycles

into xtime_nsec.  The accumulated value is always larger then the

     mult_adj * offset

value, which is subtracted from xtime_nsec. Both operations are done
together under the tk_core.lock, so the net change to xtime_nsec is always
always be positive.

However, do_adjtimex() calls into timekeeping_advance() as well, to
apply the NTP frequency adjustment immediately. In this case,
timekeeping_advance() does not return early when the offset is smaller
then interval_cycles. In that case there is no time accumulated into
xtime_nsec. But the subsequent call into timekeeping_adjust(), which
modifies the multiplicator, subtracts from xtime_nsec to correct for the
new multiplicator.

Here because there was no accumulation, xtime_nsec becomes smaller than
before, which opens a window up to the next accumulation, where the
_COARSE clockid getters, which don't compensate for the offset, can
observe the inconsistency.

This has been tried to be fixed by forwarding the timekeeper in the case
that adjtimex() adjusts the multiplier, which resets the offset to zero:

  757b000f7b93 ("timekeeping: Fix possible inconsistencies in _COARSE clockids")

That works correctly, but unfortunately causes a regression on the
adjtimex() side. There are two issues:

   1) The forwarding of the base time moves the update out of the original
      period and establishes a new one.

   2) The clearing of the accumulated NTP error is changing the behaviour as
      well.

Userspace expects that multiplier/frequency updates are in effect, when the
syscall returns, so delaying the update to the next tick is not solving the
problem either.

Commit 757b000f7b93 was reverted so that the established expectations of
user space implementations (ntpd, chronyd) are restored, but that obviously
brought the inconsistencies back.

One of the initial approaches to fix this was to establish a separate
storage for the coarse time getter nanoseconds part by calculating it from
the offset. That was dropped on the floor because not having yet another
state to maintain was simpler. But given the result of the above exercise,
this solution turns out to be the right one. Bring it back in a slightly
modified form.

Thus introduce timekeeper::coarse_nsec and store that nanoseconds part in
it, switch the time getter functions and the VDSO update to use that value.
coarse_nsec is set on operations which forward or initialize the timekeeper
or after we have accumulated time during a tick

This leaves the adjtimex() behaviour unmodified and prevents coarse time
from going backwards.

Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Miroslav Lichvar <mlichvar@redhat.com>
Cc: Lei Chen <lei.chen@smartx.com>
Cc: Stephen Boyd <sboyd@kernel.org>
Cc: Anna-Maria Behnsen <anna-maria@linutronix.de>
Cc: Frederic Weisbecker <frederic@kernel.org>
Cc: Shuah Khan <shuah@kernel.org>
Cc: linux-kselftest@vger.kernel.org
Cc: kernel-team@android.com
Fixes: da15cfdae033 ("time: Introduce CLOCK_REALTIME_COARSE")
Reported-by: Lei Chen <lei.chen@smartx.com>
Closes: https://lore.kernel.org/lkml/20250310030004.3705801-1-lei.chen@smartx.com/
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
[jstultz: Simplified the coarse_nsec calculation and kept behavior so
coarse clockids aren't adjusted on each inter-tick adjtimex call,
slightly reworking the comments and commit message]
Signed-off-by: John Stultz <jstultz@google.com>
---
v3:
* John's Rework of Thomas' version here:
  - https://lore.kernel.org/lkml/87r01qq7hp.ffs@tglx/
---
 include/linux/timekeeper_internal.h |  8 +++--
 kernel/time/timekeeping.c           | 50 ++++++++++++++++++++++++-----
 kernel/time/vsyscall.c              |  4 +--
 3 files changed, 49 insertions(+), 13 deletions(-)

diff --git a/include/linux/timekeeper_internal.h b/include/linux/timekeeper_internal.h
index e39d4d563b197..785048a3b3e60 100644
--- a/include/linux/timekeeper_internal.h
+++ b/include/linux/timekeeper_internal.h
@@ -51,7 +51,7 @@ struct tk_read_base {
  * @offs_real:			Offset clock monotonic -> clock realtime
  * @offs_boot:			Offset clock monotonic -> clock boottime
  * @offs_tai:			Offset clock monotonic -> clock tai
- * @tai_offset:			The current UTC to TAI offset in seconds
+ * @coarse_nsec:		The nanoseconds part for coarse time getters
  * @tkr_raw:			The readout base structure for CLOCK_MONOTONIC_RAW
  * @raw_sec:			CLOCK_MONOTONIC_RAW  time in seconds
  * @clock_was_set_seq:		The sequence number of clock was set events
@@ -76,6 +76,7 @@ struct tk_read_base {
  *				ntp shifted nano seconds.
  * @ntp_err_mult:		Multiplication factor for scaled math conversion
  * @skip_second_overflow:	Flag used to avoid updating NTP twice with same second
+ * @tai_offset:			The current UTC to TAI offset in seconds
  *
  * Note: For timespec(64) based interfaces wall_to_monotonic is what
  * we need to add to xtime (or xtime corrected for sub jiffy times)
@@ -100,7 +101,7 @@ struct tk_read_base {
  * which results in the following cacheline layout:
  *
  * 0:	seqcount, tkr_mono
- * 1:	xtime_sec ... tai_offset
+ * 1:	xtime_sec ... coarse_nsec
  * 2:	tkr_raw, raw_sec
  * 3,4: Internal variables
  *
@@ -121,7 +122,7 @@ struct timekeeper {
 	ktime_t			offs_real;
 	ktime_t			offs_boot;
 	ktime_t			offs_tai;
-	s32			tai_offset;
+	u32			coarse_nsec;
 
 	/* Cacheline 2: */
 	struct tk_read_base	tkr_raw;
@@ -144,6 +145,7 @@ struct timekeeper {
 	u32			ntp_error_shift;
 	u32			ntp_err_mult;
 	u32			skip_second_overflow;
+	s32			tai_offset;
 };
 
 #ifdef CONFIG_GENERIC_TIME_VSYSCALL
diff --git a/kernel/time/timekeeping.c b/kernel/time/timekeeping.c
index 1e67d076f1955..a009c91f7b05f 100644
--- a/kernel/time/timekeeping.c
+++ b/kernel/time/timekeeping.c
@@ -164,10 +164,34 @@ static inline struct timespec64 tk_xtime(const struct timekeeper *tk)
 	return ts;
 }
 
+static inline struct timespec64 tk_xtime_coarse(const struct timekeeper *tk)
+{
+	struct timespec64 ts;
+
+	ts.tv_sec = tk->xtime_sec;
+	ts.tv_nsec = tk->coarse_nsec;
+	return ts;
+}
+
+/*
+ * Update the nanoseconds part for the coarse time keepers. They can't rely
+ * on xtime_nsec because xtime_nsec could be adjusted by a small negative
+ * amount when the multiplication factor of the clock is adjusted, which
+ * could cause the coarse clocks to go slightly backwards. See
+ * timekeeping_apply_adjustment(). Thus we keep a separate copy for the coarse
+ * clockids which only is updated when the clock has been set or  we have
+ * accumulated time.
+ */
+static inline void tk_update_coarse_nsecs(struct timekeeper *tk)
+{
+	tk->coarse_nsec = tk->tkr_mono.xtime_nsec >> tk->tkr_mono.shift;
+}
+
 static void tk_set_xtime(struct timekeeper *tk, const struct timespec64 *ts)
 {
 	tk->xtime_sec = ts->tv_sec;
 	tk->tkr_mono.xtime_nsec = (u64)ts->tv_nsec << tk->tkr_mono.shift;
+	tk_update_coarse_nsecs(tk);
 }
 
 static void tk_xtime_add(struct timekeeper *tk, const struct timespec64 *ts)
@@ -175,6 +199,7 @@ static void tk_xtime_add(struct timekeeper *tk, const struct timespec64 *ts)
 	tk->xtime_sec += ts->tv_sec;
 	tk->tkr_mono.xtime_nsec += (u64)ts->tv_nsec << tk->tkr_mono.shift;
 	tk_normalize_xtime(tk);
+	tk_update_coarse_nsecs(tk);
 }
 
 static void tk_set_wall_to_mono(struct timekeeper *tk, struct timespec64 wtm)
@@ -708,6 +733,7 @@ static void timekeeping_forward_now(struct timekeeper *tk)
 		tk_normalize_xtime(tk);
 		delta -= incr;
 	}
+	tk_update_coarse_nsecs(tk);
 }
 
 /**
@@ -804,8 +830,8 @@ EXPORT_SYMBOL_GPL(ktime_get_with_offset);
 ktime_t ktime_get_coarse_with_offset(enum tk_offsets offs)
 {
 	struct timekeeper *tk = &tk_core.timekeeper;
-	unsigned int seq;
 	ktime_t base, *offset = offsets[offs];
+	unsigned int seq;
 	u64 nsecs;
 
 	WARN_ON(timekeeping_suspended);
@@ -813,7 +839,7 @@ ktime_t ktime_get_coarse_with_offset(enum tk_offsets offs)
 	do {
 		seq = read_seqcount_begin(&tk_core.seq);
 		base = ktime_add(tk->tkr_mono.base, *offset);
-		nsecs = tk->tkr_mono.xtime_nsec >> tk->tkr_mono.shift;
+		nsecs = tk->coarse_nsec;
 
 	} while (read_seqcount_retry(&tk_core.seq, seq));
 
@@ -2161,7 +2187,7 @@ static bool timekeeping_advance(enum timekeeping_adv_mode mode)
 	struct timekeeper *real_tk = &tk_core.timekeeper;
 	unsigned int clock_set = 0;
 	int shift = 0, maxshift;
-	u64 offset;
+	u64 offset, orig_offset;
 
 	guard(raw_spinlock_irqsave)(&tk_core.lock);
 
@@ -2172,7 +2198,7 @@ static bool timekeeping_advance(enum timekeeping_adv_mode mode)
 	offset = clocksource_delta(tk_clock_read(&tk->tkr_mono),
 				   tk->tkr_mono.cycle_last, tk->tkr_mono.mask,
 				   tk->tkr_mono.clock->max_raw_delta);
-
+	orig_offset = offset;
 	/* Check if there's really nothing to do */
 	if (offset < real_tk->cycle_interval && mode == TK_ADV_TICK)
 		return false;
@@ -2205,6 +2231,14 @@ static bool timekeeping_advance(enum timekeeping_adv_mode mode)
 	 */
 	clock_set |= accumulate_nsecs_to_secs(tk);
 
+	/*
+	 * To avoid inconsistencies caused adjtimex TK_ADV_FREQ calls
+	 * making small negative adjustments to the base xtime_nsec
+	 * value, only update the coarse clocks if we accumulated time
+	 */
+	if (orig_offset != offset)
+		tk_update_coarse_nsecs(tk);
+
 	timekeeping_update_from_shadow(&tk_core, clock_set);
 
 	return !!clock_set;
@@ -2248,7 +2282,7 @@ void ktime_get_coarse_real_ts64(struct timespec64 *ts)
 	do {
 		seq = read_seqcount_begin(&tk_core.seq);
 
-		*ts = tk_xtime(tk);
+		*ts = tk_xtime_coarse(tk);
 	} while (read_seqcount_retry(&tk_core.seq, seq));
 }
 EXPORT_SYMBOL(ktime_get_coarse_real_ts64);
@@ -2271,7 +2305,7 @@ void ktime_get_coarse_real_ts64_mg(struct timespec64 *ts)
 
 	do {
 		seq = read_seqcount_begin(&tk_core.seq);
-		*ts = tk_xtime(tk);
+		*ts = tk_xtime_coarse(tk);
 		offset = tk_core.timekeeper.offs_real;
 	} while (read_seqcount_retry(&tk_core.seq, seq));
 
@@ -2350,12 +2384,12 @@ void ktime_get_coarse_ts64(struct timespec64 *ts)
 	do {
 		seq = read_seqcount_begin(&tk_core.seq);
 
-		now = tk_xtime(tk);
+		now = tk_xtime_coarse(tk);
 		mono = tk->wall_to_monotonic;
 	} while (read_seqcount_retry(&tk_core.seq, seq));
 
 	set_normalized_timespec64(ts, now.tv_sec + mono.tv_sec,
-				now.tv_nsec + mono.tv_nsec);
+				  now.tv_nsec + mono.tv_nsec);
 }
 EXPORT_SYMBOL(ktime_get_coarse_ts64);
 
diff --git a/kernel/time/vsyscall.c b/kernel/time/vsyscall.c
index 01c2ab1e89719..32ef27c71b57a 100644
--- a/kernel/time/vsyscall.c
+++ b/kernel/time/vsyscall.c
@@ -98,12 +98,12 @@ void update_vsyscall(struct timekeeper *tk)
 	/* CLOCK_REALTIME_COARSE */
 	vdso_ts		= &vc[CS_HRES_COARSE].basetime[CLOCK_REALTIME_COARSE];
 	vdso_ts->sec	= tk->xtime_sec;
-	vdso_ts->nsec	= tk->tkr_mono.xtime_nsec >> tk->tkr_mono.shift;
+	vdso_ts->nsec	= tk->coarse_nsec;
 
 	/* CLOCK_MONOTONIC_COARSE */
 	vdso_ts		= &vc[CS_HRES_COARSE].basetime[CLOCK_MONOTONIC_COARSE];
 	vdso_ts->sec	= tk->xtime_sec + tk->wall_to_monotonic.tv_sec;
-	nsec		= tk->tkr_mono.xtime_nsec >> tk->tkr_mono.shift;
+	nsec		= tk->coarse_nsec;
 	nsec		= nsec + tk->wall_to_monotonic.tv_nsec;
 	vdso_ts->sec	+= __iter_div_u64_rem(nsec, NSEC_PER_SEC, &vdso_ts->nsec);
 
-- 
2.49.0.805.g082f7c87e0-goog


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* Re: [PATCH] timekeeping: Prevent coarse clocks going backwards
  2025-04-18  7:00                         ` Thomas Gleixner
@ 2025-04-19  5:55                           ` John Stultz
  0 siblings, 0 replies; 23+ messages in thread
From: John Stultz @ 2025-04-19  5:55 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Miroslav Lichvar, LKML, Stephen Boyd, Anna-Maria Behnsen,
	Frederic Weisbecker, Shuah Khan, linux-kselftest, kernel-team,
	Lei Chen

On Fri, Apr 18, 2025 at 12:00 AM Thomas Gleixner <tglx@linutronix.de> wrote:
> On Fri, Apr 18 2025 at 08:37, Thomas Gleixner wrote:
> > On Thu, Apr 17 2025 at 17:46, John Stultz wrote:
> >> Instead it seems like we should just do:
> >>   tk->coarse_nsec = tk->tkr_mono.xtime_nsec >> tk->tkr_mono.shift;
> >
> > You end up with the same problem again because xtime_nsec can move
> > backwards when the multiplier is updated, no?
>
> Something like the below should work.

Hey Thomas,
  So I did test this version (I'll call it v2) of the patch and it
does avoid the problem I was seeing earlier with your first patch.

Though I also just sent my own slight rework of your patch
(https://lore.kernel.org/lkml/20250419054706.2319105-1-jstultz@google.com/).
The main difference with my version is just the avoidance of mid-tick
updates to the coarse clock by adjtime calls (and the slight benefit
of avoiding the mult in the update path, but this is probably minor).

It's done ok in my testing so far, but obviously the effects of these
changes can be subtle.

I'm ok with either approach, so let me know what you'd prefer.
For your version:
  Acked-by: John Stultz <jstultz@google.com>

thanks
-john

^ permalink raw reply	[flat|nested] 23+ messages in thread

end of thread, other threads:[~2025-04-19  5:55 UTC | newest]

Thread overview: 23+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-03-20 20:03 [PATCH v2 1/2] time/timekeeping: Fix possible inconsistencies in _COARSE clockids John Stultz
2025-03-20 20:03 ` [PATCH v2 2/2] selftests/timers: Improve skew_consistency by testing with other clockids John Stultz
2025-03-25 11:32 ` [PATCH v2 1/2] time/timekeeping: Fix possible inconsistencies in _COARSE clockids Miroslav Lichvar
2025-03-27  9:22   ` Thomas Gleixner
2025-03-27 15:42     ` Miroslav Lichvar
2025-03-27 17:32       ` Thomas Gleixner
2025-03-31  7:53         ` Miroslav Lichvar
2025-04-17  2:55           ` John Stultz
2025-03-31 14:53       ` Miroslav Lichvar
2025-04-01  6:34         ` Thomas Gleixner
2025-04-01 11:19           ` Miroslav Lichvar
2025-04-01 18:29             ` Thomas Gleixner
2025-04-03  8:32               ` Miroslav Lichvar
2025-04-03 11:29                 ` Thomas Gleixner
2025-04-05 21:40                   ` [PATCH] timekeeping: Prevent coarse clocks going backwards Thomas Gleixner
2025-04-17  5:29                     ` John Stultz
2025-04-17 12:36                       ` Thomas Gleixner
2025-04-18  0:46                     ` John Stultz
2025-04-18  6:37                       ` Thomas Gleixner
2025-04-18  7:00                         ` Thomas Gleixner
2025-04-19  5:55                           ` John Stultz
2025-04-18 18:40                         ` John Stultz
2025-04-19  5:46                           ` [PATCH v3] " John Stultz

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).