Clock control algorithms (Re: [RFC][PATCH 5/7] x86: Use latch data structure for cyc2ns)

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

* Clock control algorithms (Re: [RFC][PATCH 5/7] x86: Use latch data structure for cyc2ns)
@ 2013-11-30 17:29 Andy Lutomirski
  2013-11-30 17:34 ` H. Peter Anvin
  2013-12-02 19:12 ` John Stultz
  0 siblings, 2 replies; 5+ messages in thread
From: Andy Lutomirski @ 2013-11-30 17:29 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Eliezer Tamir, John Stultz, Thomas Gleixner, Steven Rostedt,
	Ingo Molnar, Mathieu Desnoyers, linux-kernel@vger.kernel.org,
	Tony Luck, H. Peter Anvin

[Subject changed because this isn't relevant to the patches in
question any more.]

On Sat, Nov 30, 2013 at 1:18 AM, Peter Zijlstra <peterz@infradead.org> wrote:
> On Fri, Nov 29, 2013 at 03:22:45PM -0800, Andy Lutomirski wrote:
>> On Fri, Nov 29, 2013 at 9:37 AM, Peter Zijlstra <peterz@infradead.org> wrote:
>> > Use the 'latch' data structure for cyc2ns.
>> >
>> > This is a data structure first proposed by me and later named by
>> > Mathieu. If anybody's got a better name; do holler.
>>
>> That structure must exist in the literature, but I have no idea what
>> it's called.  It's a multi-word lock-free atomic (I think -- maybe
>> it's just regular) register.  I even published a considerably fancier
>> version of much the same thing a few years ago.  :)
>
> Yeah, its a fairly straight fwd thing it has to be named someplace ;-)

The trouble is that this data structure (like seqlocks, refcounts, and
lots of real-world synchronization things) fails if the reader falls
asleep for a multiple of 2^32 (or 2^64) writes.  The literature
generally doesn't like such things.  (As a thoroughly irrelevant
aside, TSX, and maybe even LL/SC, can be used to work around that
issue in case anyone cares.)

>
>> I've occasionally wondered whether it would be possible to make a
>> monotonicity-preserving version of this and use it for clock_gettime.
>> One approach: have the writer set the time for the update to be a bit
>> in the future and have the reader compare the current raw time to the
>> cutoff to see which set of frequency/offset to use.  (This requires
>> having some kind of bound on how long it takes to update the data
>> structures.)
>>
>> The advantage: clock_gettime would never block.
>> The disadvantage: complicated, potentially nasty to implement, and it
>> would get complicated if anyone tried to allow multiple updates in
>> rapid succession.
>
> Yes, that way you can chain a number of linear segments in various
> slots, but you're indeed right in that it will limit the update
> frequency. More slots will give you more room, but eventually you're
> limited.
>
> I suppose NTP is the primary updater in that case, does that have a
> limit on the updates? All the other updates we can artificially limit,
> that shouldn't really matter.
>
> But yeah in my case we pretty much assume the TSC is complete crap and a
> little more crap simply doesn't matter.
>
> For the 'stable' tsc on modern machines we never set the frequency and
> it doesn't matter anyway.

I assume that NTP works by filddling with the frequency and offset on
a regular basis to preserve monotonicity while still controlling the
clock.

TBH, I've never understood why the NTP code is so integrated into the
kernel's timing infrastucture or, for that matter, lives in the kernel
at all.  Shouldn't the core interface be something more like "starting
at time t_1, change the frequency to f_1, then at time t_2, change the
frequency to f_2"?  That would give the ability to manage a control
loop in userspace (or some kernel thread) and to reliably slew the
clock by a small, fixed amount.  I suppose this could be elaborated to
allow more than two adjustments to be scheduled in advance, but I
really don't see the need for anything much fancier.

Benefits:
 - Comprehensible without reading the entire NTP spec and all the
various addenda.
 - No need for any timing code at all in the tick handler -- the whole
thing could presumably be done with hrtimers and a bit fancier data
structure that lets clock_gettime figure out when to update*.
 - Things like PTP don't need to pretend to be NTP.

Disadvantages: No clue, since I don't know why NTP works the way it
does right now.

* vclock_gettime on x86_64 already has a branch that depends on the
time.  I think that good implementation could do all of this fancy
stuff with exactly one branch, resulting in the fast path being just
as fast.

--Andy

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: Clock control algorithms (Re: [RFC][PATCH 5/7] x86: Use latch data structure for cyc2ns)
  2013-11-30 17:29 Clock control algorithms (Re: [RFC][PATCH 5/7] x86: Use latch data structure for cyc2ns) Andy Lutomirski
@ 2013-11-30 17:34 ` H. Peter Anvin
  2013-12-12 20:14   ` Andy Lutomirski
  2013-12-02 19:12 ` John Stultz
  1 sibling, 1 reply; 5+ messages in thread
From: H. Peter Anvin @ 2013-11-30 17:34 UTC (permalink / raw)
  To: Andy Lutomirski, Peter Zijlstra
  Cc: Eliezer Tamir, John Stultz, Thomas Gleixner, Steven Rostedt,
	Ingo Molnar, Mathieu Desnoyers, linux-kernel@vger.kernel.org,
	Tony Luck

There is a huge difference between something that breaks after 2^32 and 2^64 events.  Very few computers will ever be able to have 2^64 events of any kind in their lifetime, never mind a single boot.  

Andy Lutomirski <luto@amacapital.net> wrote:
>[Subject changed because this isn't relevant to the patches in
>question any more.]
>
>On Sat, Nov 30, 2013 at 1:18 AM, Peter Zijlstra <peterz@infradead.org>
>wrote:
>> On Fri, Nov 29, 2013 at 03:22:45PM -0800, Andy Lutomirski wrote:
>>> On Fri, Nov 29, 2013 at 9:37 AM, Peter Zijlstra
><peterz@infradead.org> wrote:
>>> > Use the 'latch' data structure for cyc2ns.
>>> >
>>> > This is a data structure first proposed by me and later named by
>>> > Mathieu. If anybody's got a better name; do holler.
>>>
>>> That structure must exist in the literature, but I have no idea what
>>> it's called.  It's a multi-word lock-free atomic (I think -- maybe
>>> it's just regular) register.  I even published a considerably
>fancier
>>> version of much the same thing a few years ago.  :)
>>
>> Yeah, its a fairly straight fwd thing it has to be named someplace
>;-)
>
>The trouble is that this data structure (like seqlocks, refcounts, and
>lots of real-world synchronization things) fails if the reader falls
>asleep for a multiple of 2^32 (or 2^64) writes.  The literature
>generally doesn't like such things.  (As a thoroughly irrelevant
>aside, TSX, and maybe even LL/SC, can be used to work around that
>issue in case anyone cares.)
>
>>
>>> I've occasionally wondered whether it would be possible to make a
>>> monotonicity-preserving version of this and use it for
>clock_gettime.
>>> One approach: have the writer set the time for the update to be a
>bit
>>> in the future and have the reader compare the current raw time to
>the
>>> cutoff to see which set of frequency/offset to use.  (This requires
>>> having some kind of bound on how long it takes to update the data
>>> structures.)
>>>
>>> The advantage: clock_gettime would never block.
>>> The disadvantage: complicated, potentially nasty to implement, and
>it
>>> would get complicated if anyone tried to allow multiple updates in
>>> rapid succession.
>>
>> Yes, that way you can chain a number of linear segments in various
>> slots, but you're indeed right in that it will limit the update
>> frequency. More slots will give you more room, but eventually you're
>> limited.
>>
>> I suppose NTP is the primary updater in that case, does that have a
>> limit on the updates? All the other updates we can artificially
>limit,
>> that shouldn't really matter.
>>
>> But yeah in my case we pretty much assume the TSC is complete crap
>and a
>> little more crap simply doesn't matter.
>>
>> For the 'stable' tsc on modern machines we never set the frequency
>and
>> it doesn't matter anyway.
>
>I assume that NTP works by filddling with the frequency and offset on
>a regular basis to preserve monotonicity while still controlling the
>clock.
>
>TBH, I've never understood why the NTP code is so integrated into the
>kernel's timing infrastucture or, for that matter, lives in the kernel
>at all.  Shouldn't the core interface be something more like "starting
>at time t_1, change the frequency to f_1, then at time t_2, change the
>frequency to f_2"?  That would give the ability to manage a control
>loop in userspace (or some kernel thread) and to reliably slew the
>clock by a small, fixed amount.  I suppose this could be elaborated to
>allow more than two adjustments to be scheduled in advance, but I
>really don't see the need for anything much fancier.
>
>Benefits:
> - Comprehensible without reading the entire NTP spec and all the
>various addenda.
> - No need for any timing code at all in the tick handler -- the whole
>thing could presumably be done with hrtimers and a bit fancier data
>structure that lets clock_gettime figure out when to update*.
> - Things like PTP don't need to pretend to be NTP.
>
>Disadvantages: No clue, since I don't know why NTP works the way it
>does right now.
>
>* vclock_gettime on x86_64 already has a branch that depends on the
>time.  I think that good implementation could do all of this fancy
>stuff with exactly one branch, resulting in the fast path being just
>as fast.
>
>--Andy

-- 
Sent from my mobile phone.  Please pardon brevity and lack of formatting.

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: Clock control algorithms (Re: [RFC][PATCH 5/7] x86: Use latch data structure for cyc2ns)
  2013-11-30 17:34 ` H. Peter Anvin
@ 2013-12-12 20:14   ` Andy Lutomirski
  0 siblings, 0 replies; 5+ messages in thread
From: Andy Lutomirski @ 2013-12-12 20:14 UTC (permalink / raw)
  To: H. Peter Anvin
  Cc: Peter Zijlstra, Eliezer Tamir, John Stultz, Thomas Gleixner,
	Steven Rostedt, Ingo Molnar, Mathieu Desnoyers,
	linux-kernel@vger.kernel.org, Tony Luck

On Sat, Nov 30, 2013 at 9:34 AM, H. Peter Anvin <hpa@zytor.com> wrote:
> There is a huge difference between something that breaks after 2^32 and 2^64 events.  Very few computers will ever be able to have 2^64 events of any kind in their lifetime, never mind a single boot.

Given that struct seqcount contains an "unsigned" counter, the 32-bit
wraparound thing could be a problem in practice.  I hope there aren't
security vulnerabilities in which userspace overflows a kernel
refcount, seqcount, or other similar structure.

--Andy

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: Clock control algorithms (Re: [RFC][PATCH 5/7] x86: Use latch data structure for cyc2ns)
  2013-11-30 17:29 Clock control algorithms (Re: [RFC][PATCH 5/7] x86: Use latch data structure for cyc2ns) Andy Lutomirski
  2013-11-30 17:34 ` H. Peter Anvin
@ 2013-12-02 19:12 ` John Stultz
  2013-12-02 20:28   ` Andy Lutomirski
  1 sibling, 1 reply; 5+ messages in thread
From: John Stultz @ 2013-12-02 19:12 UTC (permalink / raw)
  To: Andy Lutomirski, Peter Zijlstra
  Cc: Eliezer Tamir, Thomas Gleixner, Steven Rostedt, Ingo Molnar,
	Mathieu Desnoyers, linux-kernel@vger.kernel.org, Tony Luck,
	H. Peter Anvin, Mathieu Desnoyers

On 11/30/2013 09:29 AM, Andy Lutomirski wrote:
> [Subject changed because this isn't relevant to the patches in
> question any more.]
>
> On Sat, Nov 30, 2013 at 1:18 AM, Peter Zijlstra <peterz@infradead.org> wrote:
>> On Fri, Nov 29, 2013 at 03:22:45PM -0800, Andy Lutomirski wrote:
>>> I've occasionally wondered whether it would be possible to make a
>>> monotonicity-preserving version of this and use it for clock_gettime.
>>> One approach: have the writer set the time for the update to be a bit
>>> in the future and have the reader compare the current raw time to the
>>> cutoff to see which set of frequency/offset to use.  (This requires
>>> having some kind of bound on how long it takes to update the data
>>> structures.)
>>>
>>> The advantage: clock_gettime would never block.
>>> The disadvantage: complicated, potentially nasty to implement, and it
>>> would get complicated if anyone tried to allow multiple updates in
>>> rapid succession.

So yea, I talked a bit with Mathieu Desnoyers during plumbers about
this, since he was working with Peter and proposing a very similar idea.

Unfortunately, in the timekeeping case, since we are changing the
clock's frequency there's a number of corner cases where if the clock
updating logic is delayed for some reason (long NMI or more
realistically, quirky scheduling on the host of a VM), we could observe
time inconsistencies in the readers (Peter's comment in the patch
mentions this).

If you care for the details, the case in question is when the update
decides to slow the clock frequency down. So we go from (F) originally
set at time T_0 to (F-adj) at a given time T_1.  But should this update
be delayed mid-operation, readers on other cpus will continue to use
frequency F. Then at some time T_2, the update completes, and thus we
have a potential time inconsistency.

T_2(old) = cycles(T_2 - T_0)*F + base_time(T_0)

T_2(new) = cycles(T_2 - T_1)*(F-adj) + base_time(T_1)

Where base_time(T_1) = base_time(T_0) + cycles(T_1-T_0)*(F)

Thus if adj is large enough, and T_2-T_1 is long enough, we would see
time go backwards.

We can bound adj, which makes it so we'd need longer cycle intervals to
observe a ns inconsistency, but with things like VM scheduling, the
delay could be essentially unbounded. I've discussed ideas for what I
call "valid intervals", which I think is similar to what Peter is
calling the "chained linear segments", where each update has a cycle
interval it would be valid for, and readers would have to wait if that
interval has expired, but that basically re-introduces the seqlock style
waiting, and complicates things further, as we don't want to have the
update cpu is delayed beyond the interval, then when the hrtimer fires
and we check the time, have it deadlock waiting for itself to do the
update. :(

>> Yes, that way you can chain a number of linear segments in various
>> slots, but you're indeed right in that it will limit the update
>> frequency. More slots will give you more room, but eventually you're
>> limited.
>>
>> I suppose NTP is the primary updater in that case, does that have a
>> limit on the updates? All the other updates we can artificially limit,
>> that shouldn't really matter.
>>
>> But yeah in my case we pretty much assume the TSC is complete crap and a
>> little more crap simply doesn't matter.
>>
>> For the 'stable' tsc on modern machines we never set the frequency and
>> it doesn't matter anyway.
> I assume that NTP works by filddling with the frequency and offset on
> a regular basis to preserve monotonicity while still controlling the
> clock.
>
> TBH, I've never understood why the NTP code is so integrated into the
> kernel's timing infrastucture or, for that matter, lives in the kernel
> at all.

It is a bit historical, though with the exception of the offset
adjustments, the adjtimex interface isn't particularly ntp exclusive.

>   Shouldn't the core interface be something more like "starting
> at time t_1, change the frequency to f_1, then at time t_2, change the
> frequency to f_2"?  That would give the ability to manage a control
> loop in userspace (or some kernel thread) and to reliably slew the
> clock by a small, fixed amount.  I suppose this could be elaborated to
> allow more than two adjustments to be scheduled in advance, but I
> really don't see the need for anything much fancier.

The difficult part with that is time t_1 and t_2 in your case may not be
aligned with timer interrupts. So we'd have to do reader-side clock
manipulation, which would add overhead, or accept the tick granular
error. Its a very similar problem to the leap-second edge issue, where
we apply leapseconds at the timer tick, instead of the actual
leap-second edge.

> Benefits:
>  - Comprehensible without reading the entire NTP spec and all the
> various addenda.
>  - No need for any timing code at all in the tick handler -- the whole
> thing could presumably be done with hrtimers and a bit fancier data
> structure that lets clock_gettime figure out when to update*.
>  - Things like PTP don't need to pretend to be NTP.
>
> Disadvantages: No clue, since I don't know why NTP works the way it
> does right now.

Well, we'd have to still preserve the existing adjtimex behavior. So we
have to live with that.

But I think something like this could be added on as an extended mode to
adjtimex, and I have heard some folks wish for similar before, so it
might be worth investigating.

> * vclock_gettime on x86_64 already has a branch that depends on the
> time.  I think that good implementation could do all of this fancy
> stuff with exactly one branch, resulting in the fast path being just
> as fast.

Does it? I don't know if I see that.... maybe I'm not looking closely
enough? Again, this would be important if we want to fix the leap-second
edge issue as well.

thanks
-john

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: Clock control algorithms (Re: [RFC][PATCH 5/7] x86: Use latch data structure for cyc2ns)
  2013-12-02 19:12 ` John Stultz
@ 2013-12-02 20:28   ` Andy Lutomirski
  0 siblings, 0 replies; 5+ messages in thread
From: Andy Lutomirski @ 2013-12-02 20:28 UTC (permalink / raw)
  To: John Stultz
  Cc: Peter Zijlstra, Eliezer Tamir, Thomas Gleixner, Steven Rostedt,
	Ingo Molnar, Mathieu Desnoyers, linux-kernel@vger.kernel.org,
	Tony Luck, H. Peter Anvin

On Mon, Dec 2, 2013 at 11:12 AM, John Stultz <john.stultz@linaro.org> wrote:
> On 11/30/2013 09:29 AM, Andy Lutomirski wrote:
>> [Subject changed because this isn't relevant to the patches in
>> question any more.]
>>
>> On Sat, Nov 30, 2013 at 1:18 AM, Peter Zijlstra <peterz@infradead.org> wrote:
>>> On Fri, Nov 29, 2013 at 03:22:45PM -0800, Andy Lutomirski wrote:
>>>> I've occasionally wondered whether it would be possible to make a
>>>> monotonicity-preserving version of this and use it for clock_gettime.
>>>> One approach: have the writer set the time for the update to be a bit
>>>> in the future and have the reader compare the current raw time to the
>>>> cutoff to see which set of frequency/offset to use.  (This requires
>>>> having some kind of bound on how long it takes to update the data
>>>> structures.)
>>>>
>>>> The advantage: clock_gettime would never block.
>>>> The disadvantage: complicated, potentially nasty to implement, and it
>>>> would get complicated if anyone tried to allow multiple updates in
>>>> rapid succession.
>
>
> So yea, I talked a bit with Mathieu Desnoyers during plumbers about
> this, since he was working with Peter and proposing a very similar idea.
>
> Unfortunately, in the timekeeping case, since we are changing the
> clock's frequency there's a number of corner cases where if the clock
> updating logic is delayed for some reason (long NMI or more
> realistically, quirky scheduling on the host of a VM), we could observe
> time inconsistencies in the readers (Peter's comment in the patch
> mentions this).
>
> If you care for the details, the case in question is when the update
> decides to slow the clock frequency down. So we go from (F) originally
> set at time T_0 to (F-adj) at a given time T_1.  But should this update
> be delayed mid-operation, readers on other cpus will continue to use
> frequency F. Then at some time T_2, the update completes, and thus we
> have a potential time inconsistency.
>
> T_2(old) = cycles(T_2 - T_0)*F + base_time(T_0)
>
> T_2(new) = cycles(T_2 - T_1)*(F-adj) + base_time(T_1)
>
> Where base_time(T_1) = base_time(T_0) + cycles(T_1-T_0)*(F)
>
> Thus if adj is large enough, and T_2-T_1 is long enough, we would see
> time go backwards.
>
> We can bound adj, which makes it so we'd need longer cycle intervals to
> observe a ns inconsistency, but with things like VM scheduling, the
> delay could be essentially unbounded. I've discussed ideas for what I
> call "valid intervals", which I think is similar to what Peter is
> calling the "chained linear segments", where each update has a cycle
> interval it would be valid for, and readers would have to wait if that
> interval has expired, but that basically re-introduces the seqlock style
> waiting, and complicates things further, as we don't want to have the
> update cpu is delayed beyond the interval, then when the hrtimer fires
> and we check the time, have it deadlock waiting for itself to do the
> update. :(
>
>

Maybe the vdso variant could fall back to a real syscall if the update
gets delayed, and that syscall could do something intelligent (like
blocking).  This would be more complicated than the current setup, but
I doubt that any new deadlocks could be introduced, since the current
timing code already blocks for an unbounded amount of time if the
updater lags out.

>
>
>>> Yes, that way you can chain a number of linear segments in various
>>> slots, but you're indeed right in that it will limit the update
>>> frequency. More slots will give you more room, but eventually you're
>>> limited.
>>>
>>> I suppose NTP is the primary updater in that case, does that have a
>>> limit on the updates? All the other updates we can artificially limit,
>>> that shouldn't really matter.
>>>
>>> But yeah in my case we pretty much assume the TSC is complete crap and a
>>> little more crap simply doesn't matter.
>>>
>>> For the 'stable' tsc on modern machines we never set the frequency and
>>> it doesn't matter anyway.
>> I assume that NTP works by filddling with the frequency and offset on
>> a regular basis to preserve monotonicity while still controlling the
>> clock.
>>
>> TBH, I've never understood why the NTP code is so integrated into the
>> kernel's timing infrastucture or, for that matter, lives in the kernel
>> at all.
>
> It is a bit historical, though with the exception of the offset
> adjustments, the adjtimex interface isn't particularly ntp exclusive.
>
>
>>   Shouldn't the core interface be something more like "starting
>> at time t_1, change the frequency to f_1, then at time t_2, change the
>> frequency to f_2"?  That would give the ability to manage a control
>> loop in userspace (or some kernel thread) and to reliably slew the
>> clock by a small, fixed amount.  I suppose this could be elaborated to
>> allow more than two adjustments to be scheduled in advance, but I
>> really don't see the need for anything much fancier.
>
> The difficult part with that is time t_1 and t_2 in your case may not be
> aligned with timer interrupts. So we'd have to do reader-side clock
> manipulation, which would add overhead, or accept the tick granular
> error. Its a very similar problem to the leap-second edge issue, where
> we apply leapseconds at the timer tick, instead of the actual
> leap-second edge.

I still think that (the VM / NMI latency issue aside) the reader-side
manipulation would be essentially free, in that it could replace the
current check for negative times (see below).

>
>
>> Benefits:
>>  - Comprehensible without reading the entire NTP spec and all the
>> various addenda.
>>  - No need for any timing code at all in the tick handler -- the whole
>> thing could presumably be done with hrtimers and a bit fancier data
>> structure that lets clock_gettime figure out when to update*.
>>  - Things like PTP don't need to pretend to be NTP.
>>
>> Disadvantages: No clue, since I don't know why NTP works the way it
>> does right now.
>
> Well, we'd have to still preserve the existing adjtimex behavior. So we
> have to live with that.
>
> But I think something like this could be added on as an extended mode to
> adjtimex, and I have heard some folks wish for similar before, so it
> might be worth investigating.

Is it currently guaranteed that CLOCK_REALTIME and CLOCK_MONOTONIC
advance at exactly the same rate except when the time is stepped?  If
so, and if extended adjtimex things are being considered, I think that
a lot of userspace code could benefit from the ability to directly
read the realtime-monotonic offset (especially combined with the
timerfd settimeofday detection stuff).

>
>
>> * vclock_gettime on x86_64 already has a branch that depends on the
>> time.  I think that good implementation could do all of this fancy
>> stuff with exactly one branch, resulting in the fast path being just
>> as fast.
>
> Does it? I don't know if I see that.... maybe I'm not looking closely
> enough? Again, this would be important if we want to fix the leap-second
> edge issue as well.

It's this thing in vread_tsc:

        if (likely(ret >= last))
                return ret;

as long as the frequencies and offsets are chosen such that a little
bit of TSC skew won't overflow, then I think this can go away and get
replaced by a similar branch to do reader-side frequency switching at
predefined times.

Some day I'd like to see the vdso clock reading code be unified with
the in-kernel code.  It does essentially the same thing.

--Andy

^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2013-12-12 20:15 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2013-11-30 17:29 Clock control algorithms (Re: [RFC][PATCH 5/7] x86: Use latch data structure for cyc2ns) Andy Lutomirski
2013-11-30 17:34 ` H. Peter Anvin
2013-12-12 20:14   ` Andy Lutomirski
2013-12-02 19:12 ` John Stultz
2013-12-02 20:28   ` Andy Lutomirski

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox