* Clock control algorithms (Re: [RFC][PATCH 5/7] x86: Use latch data structure for cyc2ns) @ 2013-11-30 17:29 Andy Lutomirski 2013-11-30 17:34 ` H. Peter Anvin 2013-12-02 19:12 ` John Stultz 0 siblings, 2 replies; 5+ messages in thread From: Andy Lutomirski @ 2013-11-30 17:29 UTC (permalink / raw) To: Peter Zijlstra Cc: Eliezer Tamir, John Stultz, Thomas Gleixner, Steven Rostedt, Ingo Molnar, Mathieu Desnoyers, linux-kernel@vger.kernel.org, Tony Luck, H. Peter Anvin [Subject changed because this isn't relevant to the patches in question any more.] On Sat, Nov 30, 2013 at 1:18 AM, Peter Zijlstra <peterz@infradead.org> wrote: > On Fri, Nov 29, 2013 at 03:22:45PM -0800, Andy Lutomirski wrote: >> On Fri, Nov 29, 2013 at 9:37 AM, Peter Zijlstra <peterz@infradead.org> wrote: >> > Use the 'latch' data structure for cyc2ns. >> > >> > This is a data structure first proposed by me and later named by >> > Mathieu. If anybody's got a better name; do holler. >> >> That structure must exist in the literature, but I have no idea what >> it's called. It's a multi-word lock-free atomic (I think -- maybe >> it's just regular) register. I even published a considerably fancier >> version of much the same thing a few years ago. :) > > Yeah, its a fairly straight fwd thing it has to be named someplace ;-) The trouble is that this data structure (like seqlocks, refcounts, and lots of real-world synchronization things) fails if the reader falls asleep for a multiple of 2^32 (or 2^64) writes. The literature generally doesn't like such things. (As a thoroughly irrelevant aside, TSX, and maybe even LL/SC, can be used to work around that issue in case anyone cares.) > >> I've occasionally wondered whether it would be possible to make a >> monotonicity-preserving version of this and use it for clock_gettime. >> One approach: have the writer set the time for the update to be a bit >> in the future and have the reader compare the current raw time to the >> cutoff to see which set of frequency/offset to use. (This requires >> having some kind of bound on how long it takes to update the data >> structures.) >> >> The advantage: clock_gettime would never block. >> The disadvantage: complicated, potentially nasty to implement, and it >> would get complicated if anyone tried to allow multiple updates in >> rapid succession. > > Yes, that way you can chain a number of linear segments in various > slots, but you're indeed right in that it will limit the update > frequency. More slots will give you more room, but eventually you're > limited. > > I suppose NTP is the primary updater in that case, does that have a > limit on the updates? All the other updates we can artificially limit, > that shouldn't really matter. > > But yeah in my case we pretty much assume the TSC is complete crap and a > little more crap simply doesn't matter. > > For the 'stable' tsc on modern machines we never set the frequency and > it doesn't matter anyway. I assume that NTP works by filddling with the frequency and offset on a regular basis to preserve monotonicity while still controlling the clock. TBH, I've never understood why the NTP code is so integrated into the kernel's timing infrastucture or, for that matter, lives in the kernel at all. Shouldn't the core interface be something more like "starting at time t_1, change the frequency to f_1, then at time t_2, change the frequency to f_2"? That would give the ability to manage a control loop in userspace (or some kernel thread) and to reliably slew the clock by a small, fixed amount. I suppose this could be elaborated to allow more than two adjustments to be scheduled in advance, but I really don't see the need for anything much fancier. Benefits: - Comprehensible without reading the entire NTP spec and all the various addenda. - No need for any timing code at all in the tick handler -- the whole thing could presumably be done with hrtimers and a bit fancier data structure that lets clock_gettime figure out when to update*. - Things like PTP don't need to pretend to be NTP. Disadvantages: No clue, since I don't know why NTP works the way it does right now. * vclock_gettime on x86_64 already has a branch that depends on the time. I think that good implementation could do all of this fancy stuff with exactly one branch, resulting in the fast path being just as fast. --Andy ^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: Clock control algorithms (Re: [RFC][PATCH 5/7] x86: Use latch data structure for cyc2ns) 2013-11-30 17:29 Clock control algorithms (Re: [RFC][PATCH 5/7] x86: Use latch data structure for cyc2ns) Andy Lutomirski @ 2013-11-30 17:34 ` H. Peter Anvin 2013-12-12 20:14 ` Andy Lutomirski 2013-12-02 19:12 ` John Stultz 1 sibling, 1 reply; 5+ messages in thread From: H. Peter Anvin @ 2013-11-30 17:34 UTC (permalink / raw) To: Andy Lutomirski, Peter Zijlstra Cc: Eliezer Tamir, John Stultz, Thomas Gleixner, Steven Rostedt, Ingo Molnar, Mathieu Desnoyers, linux-kernel@vger.kernel.org, Tony Luck There is a huge difference between something that breaks after 2^32 and 2^64 events. Very few computers will ever be able to have 2^64 events of any kind in their lifetime, never mind a single boot. Andy Lutomirski <luto@amacapital.net> wrote: >[Subject changed because this isn't relevant to the patches in >question any more.] > >On Sat, Nov 30, 2013 at 1:18 AM, Peter Zijlstra <peterz@infradead.org> >wrote: >> On Fri, Nov 29, 2013 at 03:22:45PM -0800, Andy Lutomirski wrote: >>> On Fri, Nov 29, 2013 at 9:37 AM, Peter Zijlstra ><peterz@infradead.org> wrote: >>> > Use the 'latch' data structure for cyc2ns. >>> > >>> > This is a data structure first proposed by me and later named by >>> > Mathieu. If anybody's got a better name; do holler. >>> >>> That structure must exist in the literature, but I have no idea what >>> it's called. It's a multi-word lock-free atomic (I think -- maybe >>> it's just regular) register. I even published a considerably >fancier >>> version of much the same thing a few years ago. :) >> >> Yeah, its a fairly straight fwd thing it has to be named someplace >;-) > >The trouble is that this data structure (like seqlocks, refcounts, and >lots of real-world synchronization things) fails if the reader falls >asleep for a multiple of 2^32 (or 2^64) writes. The literature >generally doesn't like such things. (As a thoroughly irrelevant >aside, TSX, and maybe even LL/SC, can be used to work around that >issue in case anyone cares.) > >> >>> I've occasionally wondered whether it would be possible to make a >>> monotonicity-preserving version of this and use it for >clock_gettime. >>> One approach: have the writer set the time for the update to be a >bit >>> in the future and have the reader compare the current raw time to >the >>> cutoff to see which set of frequency/offset to use. (This requires >>> having some kind of bound on how long it takes to update the data >>> structures.) >>> >>> The advantage: clock_gettime would never block. >>> The disadvantage: complicated, potentially nasty to implement, and >it >>> would get complicated if anyone tried to allow multiple updates in >>> rapid succession. >> >> Yes, that way you can chain a number of linear segments in various >> slots, but you're indeed right in that it will limit the update >> frequency. More slots will give you more room, but eventually you're >> limited. >> >> I suppose NTP is the primary updater in that case, does that have a >> limit on the updates? All the other updates we can artificially >limit, >> that shouldn't really matter. >> >> But yeah in my case we pretty much assume the TSC is complete crap >and a >> little more crap simply doesn't matter. >> >> For the 'stable' tsc on modern machines we never set the frequency >and >> it doesn't matter anyway. > >I assume that NTP works by filddling with the frequency and offset on >a regular basis to preserve monotonicity while still controlling the >clock. > >TBH, I've never understood why the NTP code is so integrated into the >kernel's timing infrastucture or, for that matter, lives in the kernel >at all. Shouldn't the core interface be something more like "starting >at time t_1, change the frequency to f_1, then at time t_2, change the >frequency to f_2"? That would give the ability to manage a control >loop in userspace (or some kernel thread) and to reliably slew the >clock by a small, fixed amount. I suppose this could be elaborated to >allow more than two adjustments to be scheduled in advance, but I >really don't see the need for anything much fancier. > >Benefits: > - Comprehensible without reading the entire NTP spec and all the >various addenda. > - No need for any timing code at all in the tick handler -- the whole >thing could presumably be done with hrtimers and a bit fancier data >structure that lets clock_gettime figure out when to update*. > - Things like PTP don't need to pretend to be NTP. > >Disadvantages: No clue, since I don't know why NTP works the way it >does right now. > >* vclock_gettime on x86_64 already has a branch that depends on the >time. I think that good implementation could do all of this fancy >stuff with exactly one branch, resulting in the fast path being just >as fast. > >--Andy -- Sent from my mobile phone. Please pardon brevity and lack of formatting. ^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: Clock control algorithms (Re: [RFC][PATCH 5/7] x86: Use latch data structure for cyc2ns) 2013-11-30 17:34 ` H. Peter Anvin @ 2013-12-12 20:14 ` Andy Lutomirski 0 siblings, 0 replies; 5+ messages in thread From: Andy Lutomirski @ 2013-12-12 20:14 UTC (permalink / raw) To: H. Peter Anvin Cc: Peter Zijlstra, Eliezer Tamir, John Stultz, Thomas Gleixner, Steven Rostedt, Ingo Molnar, Mathieu Desnoyers, linux-kernel@vger.kernel.org, Tony Luck On Sat, Nov 30, 2013 at 9:34 AM, H. Peter Anvin <hpa@zytor.com> wrote: > There is a huge difference between something that breaks after 2^32 and 2^64 events. Very few computers will ever be able to have 2^64 events of any kind in their lifetime, never mind a single boot. Given that struct seqcount contains an "unsigned" counter, the 32-bit wraparound thing could be a problem in practice. I hope there aren't security vulnerabilities in which userspace overflows a kernel refcount, seqcount, or other similar structure. --Andy ^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: Clock control algorithms (Re: [RFC][PATCH 5/7] x86: Use latch data structure for cyc2ns) 2013-11-30 17:29 Clock control algorithms (Re: [RFC][PATCH 5/7] x86: Use latch data structure for cyc2ns) Andy Lutomirski 2013-11-30 17:34 ` H. Peter Anvin @ 2013-12-02 19:12 ` John Stultz 2013-12-02 20:28 ` Andy Lutomirski 1 sibling, 1 reply; 5+ messages in thread From: John Stultz @ 2013-12-02 19:12 UTC (permalink / raw) To: Andy Lutomirski, Peter Zijlstra Cc: Eliezer Tamir, Thomas Gleixner, Steven Rostedt, Ingo Molnar, Mathieu Desnoyers, linux-kernel@vger.kernel.org, Tony Luck, H. Peter Anvin, Mathieu Desnoyers On 11/30/2013 09:29 AM, Andy Lutomirski wrote: > [Subject changed because this isn't relevant to the patches in > question any more.] > > On Sat, Nov 30, 2013 at 1:18 AM, Peter Zijlstra <peterz@infradead.org> wrote: >> On Fri, Nov 29, 2013 at 03:22:45PM -0800, Andy Lutomirski wrote: >>> I've occasionally wondered whether it would be possible to make a >>> monotonicity-preserving version of this and use it for clock_gettime. >>> One approach: have the writer set the time for the update to be a bit >>> in the future and have the reader compare the current raw time to the >>> cutoff to see which set of frequency/offset to use. (This requires >>> having some kind of bound on how long it takes to update the data >>> structures.) >>> >>> The advantage: clock_gettime would never block. >>> The disadvantage: complicated, potentially nasty to implement, and it >>> would get complicated if anyone tried to allow multiple updates in >>> rapid succession. So yea, I talked a bit with Mathieu Desnoyers during plumbers about this, since he was working with Peter and proposing a very similar idea. Unfortunately, in the timekeeping case, since we are changing the clock's frequency there's a number of corner cases where if the clock updating logic is delayed for some reason (long NMI or more realistically, quirky scheduling on the host of a VM), we could observe time inconsistencies in the readers (Peter's comment in the patch mentions this). If you care for the details, the case in question is when the update decides to slow the clock frequency down. So we go from (F) originally set at time T_0 to (F-adj) at a given time T_1. But should this update be delayed mid-operation, readers on other cpus will continue to use frequency F. Then at some time T_2, the update completes, and thus we have a potential time inconsistency. T_2(old) = cycles(T_2 - T_0)*F + base_time(T_0) T_2(new) = cycles(T_2 - T_1)*(F-adj) + base_time(T_1) Where base_time(T_1) = base_time(T_0) + cycles(T_1-T_0)*(F) Thus if adj is large enough, and T_2-T_1 is long enough, we would see time go backwards. We can bound adj, which makes it so we'd need longer cycle intervals to observe a ns inconsistency, but with things like VM scheduling, the delay could be essentially unbounded. I've discussed ideas for what I call "valid intervals", which I think is similar to what Peter is calling the "chained linear segments", where each update has a cycle interval it would be valid for, and readers would have to wait if that interval has expired, but that basically re-introduces the seqlock style waiting, and complicates things further, as we don't want to have the update cpu is delayed beyond the interval, then when the hrtimer fires and we check the time, have it deadlock waiting for itself to do the update. :( >> Yes, that way you can chain a number of linear segments in various >> slots, but you're indeed right in that it will limit the update >> frequency. More slots will give you more room, but eventually you're >> limited. >> >> I suppose NTP is the primary updater in that case, does that have a >> limit on the updates? All the other updates we can artificially limit, >> that shouldn't really matter. >> >> But yeah in my case we pretty much assume the TSC is complete crap and a >> little more crap simply doesn't matter. >> >> For the 'stable' tsc on modern machines we never set the frequency and >> it doesn't matter anyway. > I assume that NTP works by filddling with the frequency and offset on > a regular basis to preserve monotonicity while still controlling the > clock. > > TBH, I've never understood why the NTP code is so integrated into the > kernel's timing infrastucture or, for that matter, lives in the kernel > at all. It is a bit historical, though with the exception of the offset adjustments, the adjtimex interface isn't particularly ntp exclusive. > Shouldn't the core interface be something more like "starting > at time t_1, change the frequency to f_1, then at time t_2, change the > frequency to f_2"? That would give the ability to manage a control > loop in userspace (or some kernel thread) and to reliably slew the > clock by a small, fixed amount. I suppose this could be elaborated to > allow more than two adjustments to be scheduled in advance, but I > really don't see the need for anything much fancier. The difficult part with that is time t_1 and t_2 in your case may not be aligned with timer interrupts. So we'd have to do reader-side clock manipulation, which would add overhead, or accept the tick granular error. Its a very similar problem to the leap-second edge issue, where we apply leapseconds at the timer tick, instead of the actual leap-second edge. > Benefits: > - Comprehensible without reading the entire NTP spec and all the > various addenda. > - No need for any timing code at all in the tick handler -- the whole > thing could presumably be done with hrtimers and a bit fancier data > structure that lets clock_gettime figure out when to update*. > - Things like PTP don't need to pretend to be NTP. > > Disadvantages: No clue, since I don't know why NTP works the way it > does right now. Well, we'd have to still preserve the existing adjtimex behavior. So we have to live with that. But I think something like this could be added on as an extended mode to adjtimex, and I have heard some folks wish for similar before, so it might be worth investigating. > * vclock_gettime on x86_64 already has a branch that depends on the > time. I think that good implementation could do all of this fancy > stuff with exactly one branch, resulting in the fast path being just > as fast. Does it? I don't know if I see that.... maybe I'm not looking closely enough? Again, this would be important if we want to fix the leap-second edge issue as well. thanks -john ^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: Clock control algorithms (Re: [RFC][PATCH 5/7] x86: Use latch data structure for cyc2ns) 2013-12-02 19:12 ` John Stultz @ 2013-12-02 20:28 ` Andy Lutomirski 0 siblings, 0 replies; 5+ messages in thread From: Andy Lutomirski @ 2013-12-02 20:28 UTC (permalink / raw) To: John Stultz Cc: Peter Zijlstra, Eliezer Tamir, Thomas Gleixner, Steven Rostedt, Ingo Molnar, Mathieu Desnoyers, linux-kernel@vger.kernel.org, Tony Luck, H. Peter Anvin On Mon, Dec 2, 2013 at 11:12 AM, John Stultz <john.stultz@linaro.org> wrote: > On 11/30/2013 09:29 AM, Andy Lutomirski wrote: >> [Subject changed because this isn't relevant to the patches in >> question any more.] >> >> On Sat, Nov 30, 2013 at 1:18 AM, Peter Zijlstra <peterz@infradead.org> wrote: >>> On Fri, Nov 29, 2013 at 03:22:45PM -0800, Andy Lutomirski wrote: >>>> I've occasionally wondered whether it would be possible to make a >>>> monotonicity-preserving version of this and use it for clock_gettime. >>>> One approach: have the writer set the time for the update to be a bit >>>> in the future and have the reader compare the current raw time to the >>>> cutoff to see which set of frequency/offset to use. (This requires >>>> having some kind of bound on how long it takes to update the data >>>> structures.) >>>> >>>> The advantage: clock_gettime would never block. >>>> The disadvantage: complicated, potentially nasty to implement, and it >>>> would get complicated if anyone tried to allow multiple updates in >>>> rapid succession. > > > So yea, I talked a bit with Mathieu Desnoyers during plumbers about > this, since he was working with Peter and proposing a very similar idea. > > Unfortunately, in the timekeeping case, since we are changing the > clock's frequency there's a number of corner cases where if the clock > updating logic is delayed for some reason (long NMI or more > realistically, quirky scheduling on the host of a VM), we could observe > time inconsistencies in the readers (Peter's comment in the patch > mentions this). > > If you care for the details, the case in question is when the update > decides to slow the clock frequency down. So we go from (F) originally > set at time T_0 to (F-adj) at a given time T_1. But should this update > be delayed mid-operation, readers on other cpus will continue to use > frequency F. Then at some time T_2, the update completes, and thus we > have a potential time inconsistency. > > T_2(old) = cycles(T_2 - T_0)*F + base_time(T_0) > > T_2(new) = cycles(T_2 - T_1)*(F-adj) + base_time(T_1) > > Where base_time(T_1) = base_time(T_0) + cycles(T_1-T_0)*(F) > > Thus if adj is large enough, and T_2-T_1 is long enough, we would see > time go backwards. > > We can bound adj, which makes it so we'd need longer cycle intervals to > observe a ns inconsistency, but with things like VM scheduling, the > delay could be essentially unbounded. I've discussed ideas for what I > call "valid intervals", which I think is similar to what Peter is > calling the "chained linear segments", where each update has a cycle > interval it would be valid for, and readers would have to wait if that > interval has expired, but that basically re-introduces the seqlock style > waiting, and complicates things further, as we don't want to have the > update cpu is delayed beyond the interval, then when the hrtimer fires > and we check the time, have it deadlock waiting for itself to do the > update. :( > > Maybe the vdso variant could fall back to a real syscall if the update gets delayed, and that syscall could do something intelligent (like blocking). This would be more complicated than the current setup, but I doubt that any new deadlocks could be introduced, since the current timing code already blocks for an unbounded amount of time if the updater lags out. > > >>> Yes, that way you can chain a number of linear segments in various >>> slots, but you're indeed right in that it will limit the update >>> frequency. More slots will give you more room, but eventually you're >>> limited. >>> >>> I suppose NTP is the primary updater in that case, does that have a >>> limit on the updates? All the other updates we can artificially limit, >>> that shouldn't really matter. >>> >>> But yeah in my case we pretty much assume the TSC is complete crap and a >>> little more crap simply doesn't matter. >>> >>> For the 'stable' tsc on modern machines we never set the frequency and >>> it doesn't matter anyway. >> I assume that NTP works by filddling with the frequency and offset on >> a regular basis to preserve monotonicity while still controlling the >> clock. >> >> TBH, I've never understood why the NTP code is so integrated into the >> kernel's timing infrastucture or, for that matter, lives in the kernel >> at all. > > It is a bit historical, though with the exception of the offset > adjustments, the adjtimex interface isn't particularly ntp exclusive. > > >> Shouldn't the core interface be something more like "starting >> at time t_1, change the frequency to f_1, then at time t_2, change the >> frequency to f_2"? That would give the ability to manage a control >> loop in userspace (or some kernel thread) and to reliably slew the >> clock by a small, fixed amount. I suppose this could be elaborated to >> allow more than two adjustments to be scheduled in advance, but I >> really don't see the need for anything much fancier. > > The difficult part with that is time t_1 and t_2 in your case may not be > aligned with timer interrupts. So we'd have to do reader-side clock > manipulation, which would add overhead, or accept the tick granular > error. Its a very similar problem to the leap-second edge issue, where > we apply leapseconds at the timer tick, instead of the actual > leap-second edge. I still think that (the VM / NMI latency issue aside) the reader-side manipulation would be essentially free, in that it could replace the current check for negative times (see below). > > >> Benefits: >> - Comprehensible without reading the entire NTP spec and all the >> various addenda. >> - No need for any timing code at all in the tick handler -- the whole >> thing could presumably be done with hrtimers and a bit fancier data >> structure that lets clock_gettime figure out when to update*. >> - Things like PTP don't need to pretend to be NTP. >> >> Disadvantages: No clue, since I don't know why NTP works the way it >> does right now. > > Well, we'd have to still preserve the existing adjtimex behavior. So we > have to live with that. > > But I think something like this could be added on as an extended mode to > adjtimex, and I have heard some folks wish for similar before, so it > might be worth investigating. Is it currently guaranteed that CLOCK_REALTIME and CLOCK_MONOTONIC advance at exactly the same rate except when the time is stepped? If so, and if extended adjtimex things are being considered, I think that a lot of userspace code could benefit from the ability to directly read the realtime-monotonic offset (especially combined with the timerfd settimeofday detection stuff). > > >> * vclock_gettime on x86_64 already has a branch that depends on the >> time. I think that good implementation could do all of this fancy >> stuff with exactly one branch, resulting in the fast path being just >> as fast. > > Does it? I don't know if I see that.... maybe I'm not looking closely > enough? Again, this would be important if we want to fix the leap-second > edge issue as well. It's this thing in vread_tsc: if (likely(ret >= last)) return ret; as long as the frequencies and offsets are chosen such that a little bit of TSC skew won't overflow, then I think this can go away and get replaced by a similar branch to do reader-side frequency switching at predefined times. Some day I'd like to see the vdso clock reading code be unified with the in-kernel code. It does essentially the same thing. --Andy ^ permalink raw reply [flat|nested] 5+ messages in thread
end of thread, other threads:[~2013-12-12 20:15 UTC | newest] Thread overview: 5+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2013-11-30 17:29 Clock control algorithms (Re: [RFC][PATCH 5/7] x86: Use latch data structure for cyc2ns) Andy Lutomirski 2013-11-30 17:34 ` H. Peter Anvin 2013-12-12 20:14 ` Andy Lutomirski 2013-12-02 19:12 ` John Stultz 2013-12-02 20:28 ` Andy Lutomirski
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox