Re: [PATCH] Fix rolling back of CLOCK_MONOTONIC_COARSE

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

From: Thomas Gleixner <tglx@linutronix.de>
To: John Stultz <jstultz@google.com>, Lei Chen <lei.chen@smartx.com>
Cc: Stephen Boyd <sboyd@kernel.org>, linux-kernel@vger.kernel.org
Subject: Re: [PATCH] Fix rolling back of CLOCK_MONOTONIC_COARSE
Date: Sat, 15 Mar 2025 10:06:28 +0100	[thread overview]
Message-ID: <87a59m64tn.ffs@tglx> (raw)
In-Reply-To: <CANDhNCpSB5HzMHne94rnGEi+=yd1Q2j+pJX8pdX5RbEojwpZcA@mail.gmail.com>

On Fri, Mar 14 2025 at 12:21, John Stultz wrote:
> On Thu, Mar 13, 2025 at 11:32 PM Lei Chen <lei.chen@smartx.com> wrote:
> My suspicion is that if we are coming into timekeeping_advance() more
> frequently then cycle_interval cycles, than its possible we didn't
> actually accumulate anything, but had some left over ntp error that
> triggered a mult adjustment, causing a xtime_nsec to be decremented
> without the normal accumulation before that. Opening up a window for
> the inconsistency.
>
> The "if (offset < real_tk->cycle_interval && mode == TK_ADV_TICK)"
> check there is supposed to catch that, but with
> timekeeping_advance(TK_ADV_FREQ) it looks like during an ntp
> adjustment we can try to do the mult adjustment without accumulation.

Even if the delta is > cycle_interval in the normal TK_ADV_TICK case,
then the accumulation will always have a leftover offset.

After accumulation timekeeping_advance() invokes timekeeping_adjust(offset).

timekeeping_adjust() does multiplier adjustment based on [NTP] tick
length and ntp error. That's completely independent of TV_ADV_FREQ.

So _ANY_ adjustment of the multiplier has to adjust xtime_nsec so that

     T1nsec[OLD] = xtime_sec[OLD] * NSEC_PER_SEC +
		   (xtime_nsec[OLD] + offset * mult[OLD]) >> shift;

     T1nsec[NEW] = xtime_sec[NEW] * NSEC_PER_SEC +
		   (xtime_nsec[NEW] + offset * mult[NEW]) >> shift;

results in:

     T1nsec[OLD] == T1nsec[NEW]

So while Lei's change fulfils that requirement by advancing xtime_nsec
so that this becomes:

     T1nsec[NEW] = xtime_sec[NEW] * NSEC_PER_SEC +
                   xtime_nsec[NEW] >> shift;

it changes a fundamental property of the timekeeping code:

  cycles_last has to be incremented by multiples of cycles_interval to
  ensure that the accumulation, on which the NTP correction relies, has
  a periodic base. That allows to steer the clock in a coordinated way
  against ntp_tick_length and guarantees smooth error accumulation.

With Lei's changes this property is lost because every multiplicator
adjustment moves cycles_last around by the random offset which happened
to be left over in the accumulation. So the period of cycles_last
advancement gets lost and becomes a self modifying random number
generator depening on the size of the reminder (offset).

Now coming back to your suspicion that this is related to TK_ADV_FREQ.
That's one way to trigger it, but even the regular steering algorithm,
which smoothes the multplicator adjustment with

           mult [+-] = 1

exposes this because there is no guarantee that the remaining offset is
small enough to be not visible. All it takes is:

      (offset >> shift) >= 1

which means it's greater than or equal one nanosecond. Assume a clock
frequency of 1GHz. Then

      mult ~= shift

So all it takes is to hit timekeeping_advance() at a point in time where
the reminder is becomes greater than shift, which is pretty much
guaranteed to happen with NOHZ on a fully idle system.

In the 1GHz clock frequency case and HZ=100 and the in use

   shift = 1 << 23 = 8388608

this opens a window of ~1.62 milliseconds, which is easy enough to
hit. It does not even require NOHZ because in the NTP steering case the
period of the tick is not longer the same as the cycle_interval. Which
means that the reminder slightly increases per tick and hits the window
where the reminder becomes large enough roughly once per second.

I've instrumented timekeeping_advance() accordingly and the
trace_printk() nicely fills the trace buffer over time with xtime_nsec
going backwards. Except for the TK_ADV_FREQ case, this is always just
one nanosecond, but it's one nanosecond too much.

As I said in the previous mail, the fundamental flaw of the coarse time
getters is the assumption that xtime_sec + xtime_nsec does not go
backwards (except when the clock was set).

That's not the case and we can't change that without rewriting the
accumulation algorithm. I looked at the math required and it becomes a
total nightmare to compensate for that. I really don't have any
interrest in chasing the resulting NTP regressions for the next year
just for that.

Providing a seperate and accurate coarse_nsec value is easy enough and
does not touch any of the well established and correct accumulation and
steering code at all.

Thanks,

        tglx

next prev parent reply	other threads:[~2025-03-15  9:06 UTC|newest]

Thread overview: 14+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-03-10  3:00 [PATCH] Fix rolling back of CLOCK_MONOTONIC_COARSE Lei Chen
2025-03-13 17:20 ` John Stultz
2025-03-14  6:32   ` Lei Chen
2025-03-14 19:21     ` John Stultz
2025-03-15  9:06       ` Thomas Gleixner [this message]
2025-03-14 19:41 ` Thomas Gleixner
2025-03-14 22:50   ` John Stultz
2025-03-15  0:37     ` [RFC PATCH 1/2] time/timekeeping: Fix possible inconsistencies in _COARSE clockids John Stultz
2025-03-15  0:37       ` [RFC PATCH 2/2] selftests/timers: Improve skew_consistency by testing with other clockids John Stultz
2025-03-15 19:23       ` [RFC PATCH 1/2] time/timekeeping: Fix possible inconsistencies in _COARSE clockids Thomas Gleixner
2025-03-15 23:22         ` John Stultz
2025-03-16 20:56           ` Thomas Gleixner
2025-03-20 18:01             ` John Stultz
2025-03-21 16:17               ` Thomas Gleixner

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=87a59m64tn.ffs@tglx \
    --to=tglx@linutronix.de \
    --cc=jstultz@google.com \
    --cc=lei.chen@smartx.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=sboyd@kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox