* [BUG, bisect] hrtimer: severe lag after suspend & resume @ 2015-06-04 0:56 Jeremiah Mahler 2015-06-04 11:22 ` Thomas Gleixner 2015-06-04 22:54 ` John Stultz 0 siblings, 2 replies; 13+ messages in thread From: Jeremiah Mahler @ 2015-06-04 0:56 UTC (permalink / raw) To: Thomas Gleixner Cc: Preeti U Murthy, Peter Zijlstra, Viresh Kumar, Marcelo Tosatti, Frederic Weisbecker, John Stultz, linux-kernel all, After a fresh boot, the Chrome web browser behaves normally. Pages load quickly and scroll fast. Even image heavy sites such as images.google.com work fine. However, after a suspend and resume cycle, Chrome becomes very slow. Pages take ten seconds or more to load. The scroll bars and buttons are almost completely unresponsive. Interestingly, I can run Firefox on the same sites and it has no issue whatsoever. I have bisected the kernel and found that the following commit introduced the bug. It is present in the latest linux-next (20150602). From 868a3e915f7f5eba8f8cb4f7da2276760807c51c Mon Sep 17 00:00:00 2001 From: Thomas Gleixner <tglx@linutronix.de> Date: Tue, 14 Apr 2015 21:08:37 +0000 Subject: [PATCH] hrtimer: Make offset update smarter On every tick/hrtimer interrupt we update the offset variables of the clock bases. That's silly because these offsets change very seldom. Add a sequence counter to the time keeping code which keeps track of the offset updates (clock_was_set()). Have a sequence cache in the hrtimer cpu bases to evaluate whether the offsets must be updated or not. This allows us later to avoid pointless cacheline pollution. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Preeti U Murthy <preeti@linux.vnet.ibm.com> Acked-by: Peter Zijlstra <peterz@infradead.org> Cc: Viresh Kumar <viresh.kumar@linaro.org> Cc: Marcelo Tosatti <mtosatti@redhat.com> Cc: Frederic Weisbecker <fweisbec@gmail.com> Cc: John Stultz <john.stultz@linaro.org> Link: http://lkml.kernel.org/r/20150414203501.132820245@linutronix.de Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: John Stultz <john.stultz@linaro.org> --- include/linux/hrtimer.h | 4 ++-- include/linux/timekeeper_internal.h | 2 ++ kernel/time/hrtimer.c | 3 ++- kernel/time/timekeeping.c | 23 ++++++++++++++++------- kernel/time/timekeeping.h | 7 ++++--- 5 files changed, 26 insertions(+), 13 deletions(-) -- - Jeremiah Mahler ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [BUG, bisect] hrtimer: severe lag after suspend & resume 2015-06-04 0:56 [BUG, bisect] hrtimer: severe lag after suspend & resume Jeremiah Mahler @ 2015-06-04 11:22 ` Thomas Gleixner 2015-06-04 20:13 ` Jeremiah Mahler 2015-06-04 22:54 ` John Stultz 1 sibling, 1 reply; 13+ messages in thread From: Thomas Gleixner @ 2015-06-04 11:22 UTC (permalink / raw) To: Jeremiah Mahler Cc: Preeti U Murthy, Peter Zijlstra, Viresh Kumar, Marcelo Tosatti, Frederic Weisbecker, John Stultz, linux-kernel On Wed, 3 Jun 2015, Jeremiah Mahler wrote: > After a fresh boot, the Chrome web browser behaves normally. Pages > load quickly and scroll fast. Even image heavy sites such as > images.google.com work fine. However, after a suspend and resume > cycle, Chrome becomes very slow. Pages take ten seconds or more to > load. The scroll bars and buttons are almost completely > unresponsive. Interestingly, I can run Firefox on the same sites > and it has no issue whatsoever. Weird. > I have bisected the kernel and found that the following commit > introduced the bug. It is present in the latest linux-next (20150602). > > From 868a3e915f7f5eba8f8cb4f7da2276760807c51c Mon Sep 17 00:00:00 2001 > From: Thomas Gleixner <tglx@linutronix.de> > Date: Tue, 14 Apr 2015 21:08:37 +0000 > Subject: [PATCH] hrtimer: Make offset update smarter > > On every tick/hrtimer interrupt we update the offset variables of the > clock bases. That's silly because these offsets change very seldom. > > Add a sequence counter to the time keeping code which keeps track of > the offset updates (clock_was_set()). Have a sequence cache in the > hrtimer cpu bases to evaluate whether the offsets must be updated or > not. This allows us later to avoid pointless cacheline pollution. I had to wrap my head around that for quite a while, but I think I have decoded the issue. Can you please test the patch below whether it solves your problem? Thanks, tglx ------------------------> diff --git a/kernel/time/hrtimer.c b/kernel/time/hrtimer.c index 278d4b36fd94..e9dfcd0b8c41 100644 --- a/kernel/time/hrtimer.c +++ b/kernel/time/hrtimer.c @@ -1492,6 +1492,12 @@ static void init_hrtimers_cpu(int cpu) cpu_base->cpu = cpu; hrtimer_init_hres(cpu_base); + /* + * Force an update by setting the clock was set sequence to an + * odd value. + */ + cpu_base->clock_was_set_seq = 1; + hrtimer_update_base(cpu_base); } #ifdef CONFIG_HOTPLUG_CPU diff --git a/kernel/time/timekeeping.c b/kernel/time/timekeeping.c index 90ed5db67c1d..c97710137a9e 100644 --- a/kernel/time/timekeeping.c +++ b/kernel/time/timekeeping.c @@ -593,7 +593,7 @@ static void timekeeping_update(struct timekeeper *tk, unsigned int action) update_fast_timekeeper(&tk->tkr_raw, &tk_fast_raw); if (action & TK_CLOCK_WAS_SET) - tk->clock_was_set_seq++; + tk->clock_was_set_seq += 2; } /** ^ permalink raw reply related [flat|nested] 13+ messages in thread
* Re: [BUG, bisect] hrtimer: severe lag after suspend & resume 2015-06-04 11:22 ` Thomas Gleixner @ 2015-06-04 20:13 ` Jeremiah Mahler 0 siblings, 0 replies; 13+ messages in thread From: Jeremiah Mahler @ 2015-06-04 20:13 UTC (permalink / raw) To: Thomas Gleixner Cc: Preeti U Murthy, Peter Zijlstra, Viresh Kumar, Marcelo Tosatti, Frederic Weisbecker, John Stultz, linux-kernel Thomas, On Thu, Jun 04, 2015 at 01:22:45PM +0200, Thomas Gleixner wrote: > On Wed, 3 Jun 2015, Jeremiah Mahler wrote: [...] > > I had to wrap my head around that for quite a while, but I think I > have decoded the issue. Can you please test the patch below whether it > solves your problem? > > Thanks, > > tglx > > ------------------------> > > diff --git a/kernel/time/hrtimer.c b/kernel/time/hrtimer.c > index 278d4b36fd94..e9dfcd0b8c41 100644 > --- a/kernel/time/hrtimer.c > +++ b/kernel/time/hrtimer.c > @@ -1492,6 +1492,12 @@ static void init_hrtimers_cpu(int cpu) > > cpu_base->cpu = cpu; > hrtimer_init_hres(cpu_base); > + /* > + * Force an update by setting the clock was set sequence to an > + * odd value. > + */ > + cpu_base->clock_was_set_seq = 1; > + hrtimer_update_base(cpu_base); > } > > #ifdef CONFIG_HOTPLUG_CPU > diff --git a/kernel/time/timekeeping.c b/kernel/time/timekeeping.c > index 90ed5db67c1d..c97710137a9e 100644 > --- a/kernel/time/timekeeping.c > +++ b/kernel/time/timekeeping.c > @@ -593,7 +593,7 @@ static void timekeeping_update(struct timekeeper *tk, unsigned int action) > update_fast_timekeeper(&tk->tkr_raw, &tk_fast_raw); > > if (action & TK_CLOCK_WAS_SET) > - tk->clock_was_set_seq++; > + tk->clock_was_set_seq += 2; > } > > /** With this change it won't boot. It just hangs at "Booting the kernel". -- - Jeremiah Mahler ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [BUG, bisect] hrtimer: severe lag after suspend & resume 2015-06-04 0:56 [BUG, bisect] hrtimer: severe lag after suspend & resume Jeremiah Mahler 2015-06-04 11:22 ` Thomas Gleixner @ 2015-06-04 22:54 ` John Stultz 2015-06-05 0:01 ` Jeremiah Mahler ` (2 more replies) 1 sibling, 3 replies; 13+ messages in thread From: John Stultz @ 2015-06-04 22:54 UTC (permalink / raw) To: Jeremiah Mahler, Thomas Gleixner, Preeti U Murthy, Peter Zijlstra, Viresh Kumar, Marcelo Tosatti, Frederic Weisbecker, John Stultz, lkml, Ingo Molnar [-- Attachment #1: Type: text/plain, Size: 2923 bytes --] On Wed, Jun 3, 2015 at 5:56 PM, Jeremiah Mahler <jmmahler@gmail.com> wrote: > all, > > After a fresh boot, the Chrome web browser behaves normally. Pages > load quickly and scroll fast. Even image heavy sites such as > images.google.com work fine. However, after a suspend and resume > cycle, Chrome becomes very slow. Pages take ten seconds or more to > load. The scroll bars and buttons are almost completely > unresponsive. Interestingly, I can run Firefox on the same sites > and it has no issue whatsoever. > > I have bisected the kernel and found that the following commit > introduced the bug. It is present in the latest linux-next (20150602). > > From 868a3e915f7f5eba8f8cb4f7da2276760807c51c Mon Sep 17 00:00:00 2001 > From: Thomas Gleixner <tglx@linutronix.de> > Date: Tue, 14 Apr 2015 21:08:37 +0000 > Subject: [PATCH] hrtimer: Make offset update smarter > > On every tick/hrtimer interrupt we update the offset variables of the > clock bases. That's silly because these offsets change very seldom. > > Add a sequence counter to the time keeping code which keeps track of > the offset updates (clock_was_set()). Have a sequence cache in the > hrtimer cpu bases to evaluate whether the offsets must be updated or > not. This allows us later to avoid pointless cacheline pollution. > > Signed-off-by: Thomas Gleixner <tglx@linutronix.de> > Reviewed-by: Preeti U Murthy <preeti@linux.vnet.ibm.com> > Acked-by: Peter Zijlstra <peterz@infradead.org> > Cc: Viresh Kumar <viresh.kumar@linaro.org> > Cc: Marcelo Tosatti <mtosatti@redhat.com> > Cc: Frederic Weisbecker <fweisbec@gmail.com> > Cc: John Stultz <john.stultz@linaro.org> > Link: http://lkml.kernel.org/r/20150414203501.132820245@linutronix.de > Signed-off-by: Thomas Gleixner <tglx@linutronix.de> > Cc: John Stultz <john.stultz@linaro.org> > --- > include/linux/hrtimer.h | 4 ++-- > include/linux/timekeeper_internal.h | 2 ++ > kernel/time/hrtimer.c | 3 ++- > kernel/time/timekeeping.c | 23 ++++++++++++++++------- > kernel/time/timekeeping.h | 7 ++++--- > 5 files changed, 26 insertions(+), 13 deletions(-) So I suspect the problem is the change to clock_was_set_seq in timekeeping_update is done prior to mirroring the time state to the shadow-timekeeper. Thus the next time we do update_wall_time() the updated sequence is overwritten by whats in the shadow copy. The attached patch moving the modification up seems to avoid the issue for me. Thomas: Looking at the problematic change, I'm not a big fan of it. Caching timekeeping state here in the hrtimer code has been a source of bugs in the past, and I'm not sure I see how avoiding copying 24bytes is that big of a win. Especially since it adds more state to the timekeeper and hrtimer base that we have to read and mange. Personally I'd prefer a revert to my fix. thanks -john [-- Attachment #2: fixup.patch --] [-- Type: text/x-patch, Size: 675 bytes --] diff --git a/kernel/time/timekeeping.c b/kernel/time/timekeeping.c index 90ed5db..53be796 100644 --- a/kernel/time/timekeeping.c +++ b/kernel/time/timekeeping.c @@ -580,6 +580,9 @@ static void timekeeping_update(struct timekeeper *tk, unsigned int action) ntp_clear(); } + if (action & TK_CLOCK_WAS_SET) + tk->clock_was_set_seq++; + tk_update_ktime_data(tk); update_vsyscall(tk); @@ -591,9 +594,6 @@ static void timekeeping_update(struct timekeeper *tk, unsigned int action) update_fast_timekeeper(&tk->tkr_mono, &tk_fast_mono); update_fast_timekeeper(&tk->tkr_raw, &tk_fast_raw); - - if (action & TK_CLOCK_WAS_SET) - tk->clock_was_set_seq++; } /** ^ permalink raw reply related [flat|nested] 13+ messages in thread
* Re: [BUG, bisect] hrtimer: severe lag after suspend & resume 2015-06-04 22:54 ` John Stultz @ 2015-06-05 0:01 ` Jeremiah Mahler 2015-06-05 7:57 ` Ingo Molnar 2015-06-05 9:14 ` Thomas Gleixner 2 siblings, 0 replies; 13+ messages in thread From: Jeremiah Mahler @ 2015-06-05 0:01 UTC (permalink / raw) To: John Stultz Cc: Thomas Gleixner, Preeti U Murthy, Peter Zijlstra, Viresh Kumar, Marcelo Tosatti, Frederic Weisbecker, lkml, Ingo Molnar John, On Thu, Jun 04, 2015 at 03:54:35PM -0700, John Stultz wrote: > On Wed, Jun 3, 2015 at 5:56 PM, Jeremiah Mahler <jmmahler@gmail.com> wrote: [...] > > > So I suspect the problem is the change to clock_was_set_seq in > timekeeping_update is done prior to mirroring the time state to the > shadow-timekeeper. Thus the next time we do update_wall_time() the > updated sequence is overwritten by whats in the shadow copy. The > attached patch moving the modification up seems to avoid the issue for > me. > > Thomas: Looking at the problematic change, I'm not a big fan of it. > Caching timekeeping state here in the hrtimer code has been a source > of bugs in the past, and I'm not sure I see how avoiding copying > 24bytes is that big of a win. Especially since it adds more state to > the timekeeper and hrtimer base that we have to read and mange. > Personally I'd prefer a revert to my fix. > > thanks > -john > diff --git a/kernel/time/timekeeping.c b/kernel/time/timekeeping.c > index 90ed5db..53be796 100644 > --- a/kernel/time/timekeeping.c > +++ b/kernel/time/timekeeping.c > @@ -580,6 +580,9 @@ static void timekeeping_update(struct timekeeper *tk, unsigned int action) > ntp_clear(); > } > > + if (action & TK_CLOCK_WAS_SET) > + tk->clock_was_set_seq++; > + > tk_update_ktime_data(tk); > > update_vsyscall(tk); > @@ -591,9 +594,6 @@ static void timekeeping_update(struct timekeeper *tk, unsigned int action) > > update_fast_timekeeper(&tk->tkr_mono, &tk_fast_mono); > update_fast_timekeeper(&tk->tkr_raw, &tk_fast_raw); > - > - if (action & TK_CLOCK_WAS_SET) > - tk->clock_was_set_seq++; > } > > /** That patch fixes the problem for me. Thanks John. -- - Jeremiah Mahler ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [BUG, bisect] hrtimer: severe lag after suspend & resume 2015-06-04 22:54 ` John Stultz 2015-06-05 0:01 ` Jeremiah Mahler @ 2015-06-05 7:57 ` Ingo Molnar 2015-06-05 9:14 ` Thomas Gleixner 2 siblings, 0 replies; 13+ messages in thread From: Ingo Molnar @ 2015-06-05 7:57 UTC (permalink / raw) To: John Stultz Cc: Jeremiah Mahler, Thomas Gleixner, Preeti U Murthy, Peter Zijlstra, Viresh Kumar, Marcelo Tosatti, Frederic Weisbecker, lkml * John Stultz <john.stultz@linaro.org> wrote: > So I suspect the problem is the change to clock_was_set_seq in > timekeeping_update is done prior to mirroring the time state to the > shadow-timekeeper. Thus the next time we do update_wall_time() the updated > sequence is overwritten by whats in the shadow copy. The attached patch moving > the modification up seems to avoid the issue for me. > > Thomas: Looking at the problematic change, I'm not a big fan of it. Caching > timekeeping state here in the hrtimer code has been a source of bugs in the > past, and I'm not sure I see how avoiding copying 24bytes is that big of a win. > Especially since it adds more state to the timekeeper and hrtimer base that we > have to read and mange. Personally I'd prefer a revert to my fix. So it's not really the copying of the 24 bytes that is the problem with that pattern. It's the constant dirtying of a cacheline that would otherwise be read-mostly, and then the reading of it from the percpu hrtimer interrupts. That can have negative scalability effects similar to a global lock. ( Now I have not checked whether this cacheline truly becomes read-mostly after the original commit - I suspect Thomas did. ) > + if (action & TK_CLOCK_WAS_SET) > + tk->clock_was_set_seq++; > + > tk_update_ktime_data(tk); So I'd also add a comment that this update should be done before: if (action & TK_MIRROR) memcpy(&shadow_timekeeper, &tk_core.timekeeper, sizeof(tk_core.timekeeper)); Also, there appears to be a layering violation here (unless I mis-read the tk_real/tk logic): this should copy 'tk', not access tk_core directly, correct? Right now this does not matter, because 'tk == &tk_core' is always supposed to be true, or at least it should point to an identical shadow copy, right? But nevertheless tk_core should only ever be directly referenced when 'tk' is initialized from it. Also, there's a few more of these apparent layering violations: git grep 'tk_core' kernel/time/timekeeping.c | grep -v 'struct timekeeper' I realize that we are transitioning from former global variables based timekeeping to a more parametric model, so this isn't a complaint, just an observation. Right now it's a code cleanliness detail, but if/when we introduce clocks that are updated independently from each other then it will also matter functionally. Thanks, Ingo ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [BUG, bisect] hrtimer: severe lag after suspend & resume 2015-06-04 22:54 ` John Stultz 2015-06-05 0:01 ` Jeremiah Mahler 2015-06-05 7:57 ` Ingo Molnar @ 2015-06-05 9:14 ` Thomas Gleixner 2015-06-05 10:07 ` Ingo Molnar 2015-06-05 14:02 ` Thomas Gleixner 2 siblings, 2 replies; 13+ messages in thread From: Thomas Gleixner @ 2015-06-05 9:14 UTC (permalink / raw) To: John Stultz Cc: Jeremiah Mahler, Preeti U Murthy, Peter Zijlstra, Viresh Kumar, Marcelo Tosatti, Frederic Weisbecker, lkml, Ingo Molnar On Thu, 4 Jun 2015, John Stultz wrote: > On Wed, Jun 3, 2015 at 5:56 PM, Jeremiah Mahler <jmmahler@gmail.com> wrote: > So I suspect the problem is the change to clock_was_set_seq in > timekeeping_update is done prior to mirroring the time state to the > shadow-timekeeper. Thus the next time we do update_wall_time() the > updated sequence is overwritten by whats in the shadow copy. The > attached patch moving the modification up seems to avoid the issue for > me. Duh, yes. > Thomas: Looking at the problematic change, I'm not a big fan of it. > Caching timekeeping state here in the hrtimer code has been a source > of bugs in the past, and I'm not sure I see how avoiding copying > 24bytes is that big of a win. Especially since it adds more state to > the timekeeper and hrtimer base that we have to read and mange. It's not about copying 24 bytes. It's about touching 3 cache lines for nothing. In situations where we run high frequency periodic timers on clock monotonic and nothing is going on in the other clock domains, which is a pretty common situation, this is measurable in terms of cache utilization. I went great length to optimize the cache footprint and access patterns and that unconditional update really makes a measurable difference. Thanks, tglx ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [BUG, bisect] hrtimer: severe lag after suspend & resume 2015-06-05 9:14 ` Thomas Gleixner @ 2015-06-05 10:07 ` Ingo Molnar 2015-06-05 18:52 ` John Stultz 2015-06-05 14:02 ` Thomas Gleixner 1 sibling, 1 reply; 13+ messages in thread From: Ingo Molnar @ 2015-06-05 10:07 UTC (permalink / raw) To: Thomas Gleixner Cc: John Stultz, Jeremiah Mahler, Preeti U Murthy, Peter Zijlstra, Viresh Kumar, Marcelo Tosatti, Frederic Weisbecker, lkml * Thomas Gleixner <tglx@linutronix.de> wrote: > On Thu, 4 Jun 2015, John Stultz wrote: > > On Wed, Jun 3, 2015 at 5:56 PM, Jeremiah Mahler <jmmahler@gmail.com> wrote: > > So I suspect the problem is the change to clock_was_set_seq in > > timekeeping_update is done prior to mirroring the time state to the > > shadow-timekeeper. Thus the next time we do update_wall_time() the > > updated sequence is overwritten by whats in the shadow copy. The > > attached patch moving the modification up seems to avoid the issue for > > me. > > Duh, yes. > > > Thomas: Looking at the problematic change, I'm not a big fan of it. Caching > > timekeeping state here in the hrtimer code has been a source of bugs in the > > past, and I'm not sure I see how avoiding copying 24bytes is that big of a > > win. Especially since it adds more state to the timekeeper and hrtimer base > > that we have to read and mange. > > It's not about copying 24 bytes. It's about touching 3 cache lines for nothing. > In situations where we run high frequency periodic timers on clock monotonic and > nothing is going on in the other clock domains, which is a pretty common > situation, this is measurable in terms of cache utilization. [...] It's not just about 'touching': it's about _dirtying_ cachelines from a globally executed function (timekeeping), which is then accessed by per-CPU functionality (hrtimers). That makes it far more expensive, it has similar scalability limiting effects as a global lock - while if we do it smart it can perform as essentially lockless code in most cases. Thanks, Ingo ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [BUG, bisect] hrtimer: severe lag after suspend & resume 2015-06-05 10:07 ` Ingo Molnar @ 2015-06-05 18:52 ` John Stultz 2015-06-08 7:44 ` Thomas Gleixner 0 siblings, 1 reply; 13+ messages in thread From: John Stultz @ 2015-06-05 18:52 UTC (permalink / raw) To: Ingo Molnar Cc: Thomas Gleixner, Jeremiah Mahler, Preeti U Murthy, Peter Zijlstra, Viresh Kumar, Marcelo Tosatti, Frederic Weisbecker, lkml On Fri, Jun 5, 2015 at 3:07 AM, Ingo Molnar <mingo@kernel.org> wrote: > > * Thomas Gleixner <tglx@linutronix.de> wrote: > >> On Thu, 4 Jun 2015, John Stultz wrote: >> > On Wed, Jun 3, 2015 at 5:56 PM, Jeremiah Mahler <jmmahler@gmail.com> wrote: >> > So I suspect the problem is the change to clock_was_set_seq in >> > timekeeping_update is done prior to mirroring the time state to the >> > shadow-timekeeper. Thus the next time we do update_wall_time() the >> > updated sequence is overwritten by whats in the shadow copy. The >> > attached patch moving the modification up seems to avoid the issue for >> > me. >> >> Duh, yes. >> >> > Thomas: Looking at the problematic change, I'm not a big fan of it. Caching >> > timekeeping state here in the hrtimer code has been a source of bugs in the >> > past, and I'm not sure I see how avoiding copying 24bytes is that big of a >> > win. Especially since it adds more state to the timekeeper and hrtimer base >> > that we have to read and mange. >> >> It's not about copying 24 bytes. It's about touching 3 cache lines for nothing. >> In situations where we run high frequency periodic timers on clock monotonic and >> nothing is going on in the other clock domains, which is a pretty common >> situation, this is measurable in terms of cache utilization. [...] > > It's not just about 'touching': it's about _dirtying_ cachelines from a globally > executed function (timekeeping), which is then accessed by per-CPU functionality > (hrtimers). Right, but part of that issue is that we're caching in the hrtimer cpu bases data that *should not be cached*. That was the core issue that caused the 2012 leapsecond issue, and I'd prefer to not reintroduce it. The offset data is only valid for the monotonic time its read for. So dirtying three cache lines really is just due to the fact that the data is stored in the cpu_base structure where I'd argue it doesn't provide real value (other then convenience of indexing it cleanly). Reading the offset data into three values from the stack would be fine too, and (I think) would avoid dirtying much extra (we have to store the now value anyway). BTW: Thomas, what are you using to do measurements here? I hesitate to argue in these sorts of performance discussions, since I really only have a embarrassing theoretical understanding of the issues and suspect myself a bit naive here. Additionally these sorts of constraints aren't always clearly documented, so being able to measure and compare would be helpful to ensure future changes don't impact things here. > That makes it far more expensive, it has similar scalability limiting effects as a > global lock - while if we do it smart it can perform as essentially lockless code > in most cases. Another reason why I don't like this approach of caching the data is that it also prevents fixing the leap-second adjustment to happen on the second edge, because we have to have an async update to the seqcounter in order to refresh the cached real_offset. Or we have to also export more ntp state data so we can duplicate the adjustment to the cached data in the hrtimer code, which is more of the complexity you've objected to. thanks -john ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [BUG, bisect] hrtimer: severe lag after suspend & resume 2015-06-05 18:52 ` John Stultz @ 2015-06-08 7:44 ` Thomas Gleixner 2015-06-08 17:37 ` John Stultz 0 siblings, 1 reply; 13+ messages in thread From: Thomas Gleixner @ 2015-06-08 7:44 UTC (permalink / raw) To: John Stultz Cc: Ingo Molnar, Jeremiah Mahler, Preeti U Murthy, Peter Zijlstra, Viresh Kumar, Marcelo Tosatti, Frederic Weisbecker, lkml On Fri, 5 Jun 2015, John Stultz wrote: > On Fri, Jun 5, 2015 at 3:07 AM, Ingo Molnar <mingo@kernel.org> wrote: > > * Thomas Gleixner <tglx@linutronix.de> wrote: > >> It's not about copying 24 bytes. It's about touching 3 cache lines for nothing. > >> In situations where we run high frequency periodic timers on clock monotonic and > >> nothing is going on in the other clock domains, which is a pretty common > >> situation, this is measurable in terms of cache utilization. [...] > > > > It's not just about 'touching': it's about _dirtying_ cachelines from a globally > > executed function (timekeeping), which is then accessed by per-CPU functionality > > (hrtimers). > > Right, but part of that issue is that we're caching in the hrtimer cpu > bases data that *should not be cached*. That was the core issue that > caused the 2012 leapsecond issue, and I'd prefer to not reintroduce > it. > > The offset data is only valid for the monotonic time its read for. So > dirtying three cache lines really is just due to the fact that the > data is stored in the cpu_base structure where I'd argue it doesn't > provide real value (other then convenience of indexing it cleanly). > > Reading the offset data into three values from the stack would be fine > too, and (I think) would avoid dirtying much extra (we have to store > the now value anyway). Well, the problem is that we need to fetch that data on several occasions: - hrtimer_start (if it is the first expiring timer of a clock) - hrtimer_reprogram (after canceling the first timer) - hrtimer_interrupt So I really prefer to have cached values instead of a function call. And even if we do not cache stuff, there is no guarantee that we wont expire a timer too early: CPU0 CPU1 hrtimer_interrupt() ------------------------------------- leap second edge get_time_and_offsets_uncached() do_leap_second_adjustment() expire_timers() So, if the do_leap_second_adjustment() happens a bit too late, then clock monotonic + offset_realtime will have advanced over the leap second and expire a timer and the sleeper will then observe that it is expired too early because the leap second adjustment finished before it returned to user space. You do not even need two cpus for this. You can observe the same issue on a UP machine. Assume we have two hrtimers programmed to go off at the leap seconds edge: 1) a user space timer 2) the tick/leap second one If the user space timer has been enqueued before the leap one, then it will be expired first and if the timer interrupt got delayed a bit it again will see that its over the programmed time and happily expire to early. So what ever we do vs. the hrtimer offsets, cached or not will not prevent that we expire timers early. > BTW: Thomas, what are you using to do measurements here? I hesitate > to argue in these sorts of performance discussions, since I really > only have a embarrassing theoretical understanding of the issues and > suspect myself a bit naive here. Additionally these sorts of > constraints aren't always clearly documented, so being able to > measure and compare would be helpful to ensure future changes don't > impact things here. performance counters and tests which stress the particular subsystems. > > That makes it far more expensive, it has similar scalability limiting effects as a > > global lock - while if we do it smart it can perform as essentially lockless code > > in most cases. > > Another reason why I don't like this approach of caching the data is > that it also prevents fixing the leap-second adjustment to happen on > the second edge, because we have to have an async update to the > seqcounter in order to refresh the cached real_offset. Or we have to > also export more ntp state data so we can duplicate the adjustment to > the cached data in the hrtimer code, which is more of the complexity > you've objected to. There is no guarantee that it happens at the seconds edge. Timer might be delayed, vcpu scheduled out .... All you will be able to do is to narrow the window, but as I explained above it wont prevent early expiry and it wont prevent VDSO seing the time go over the leap second and then jump back. We just have to accept that timekeeping, time readout and hrtimers have asynchronous behaviour. And there is no way around that unless you want to kill performance completely for the sake of this leap second nonsense. Thanks, tglx ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [BUG, bisect] hrtimer: severe lag after suspend & resume 2015-06-08 7:44 ` Thomas Gleixner @ 2015-06-08 17:37 ` John Stultz 2015-06-08 19:31 ` Thomas Gleixner 0 siblings, 1 reply; 13+ messages in thread From: John Stultz @ 2015-06-08 17:37 UTC (permalink / raw) To: Thomas Gleixner Cc: Ingo Molnar, Jeremiah Mahler, Preeti U Murthy, Peter Zijlstra, Viresh Kumar, Marcelo Tosatti, Frederic Weisbecker, lkml On Mon, Jun 8, 2015 at 12:44 AM, Thomas Gleixner <tglx@linutronix.de> wrote: > On Fri, 5 Jun 2015, John Stultz wrote: >> On Fri, Jun 5, 2015 at 3:07 AM, Ingo Molnar <mingo@kernel.org> wrote: >> > * Thomas Gleixner <tglx@linutronix.de> wrote: >> >> It's not about copying 24 bytes. It's about touching 3 cache lines for nothing. >> >> In situations where we run high frequency periodic timers on clock monotonic and >> >> nothing is going on in the other clock domains, which is a pretty common >> >> situation, this is measurable in terms of cache utilization. [...] >> > >> > It's not just about 'touching': it's about _dirtying_ cachelines from a globally >> > executed function (timekeeping), which is then accessed by per-CPU functionality >> > (hrtimers). >> >> Right, but part of that issue is that we're caching in the hrtimer cpu >> bases data that *should not be cached*. That was the core issue that >> caused the 2012 leapsecond issue, and I'd prefer to not reintroduce >> it. >> >> The offset data is only valid for the monotonic time its read for. So >> dirtying three cache lines really is just due to the fact that the >> data is stored in the cpu_base structure where I'd argue it doesn't >> provide real value (other then convenience of indexing it cleanly). >> >> Reading the offset data into three values from the stack would be fine >> too, and (I think) would avoid dirtying much extra (we have to store >> the now value anyway). > > Well, the problem is that we need to fetch that data on several > occasions: > > - hrtimer_start (if it is the first expiring timer of a clock) > - hrtimer_reprogram (after canceling the first timer) > - hrtimer_interrupt hrtimer_interrupt is really the one I'm most concerned with, because that's what does timer expiration. For that usage, having duplicated time state has been repeatedly problematic. For start and reprogram, I guess I'm not as concerned because bad cache values there can only result in errors in programming the clockevent hardware, which can only cause timers to firing late. > So I really prefer to have cached values instead of a function > call. And even if we do not cache stuff, there is no guarantee that we > wont expire a timer too early: > > CPU0 CPU1 > hrtimer_interrupt() > ------------------------------------- leap second edge > get_time_and_offsets_uncached() > do_leap_second_adjustment() > expire_timers() > > So, if the do_leap_second_adjustment() happens a bit too late, then > clock monotonic + offset_realtime will have advanced over the leap > second and expire a timer and the sleeper will then observe that it is > expired too early because the leap second adjustment finished before > it returned to user space. Well, the patch I proposed (which does leapsecond adjustments in the read paths) avoids this because update_base_and_offsets_now() checks to see if the monotonic base + offset would cross the leapsecond edge, and corrects the offset appropriately. > > You do not even need two cpus for this. You can observe the same issue > on a UP machine. Assume we have two hrtimers programmed to go off at > the leap seconds edge: > > 1) a user space timer > 2) the tick/leap second one > > If the user space timer has been enqueued before the leap one, then it > will be expired first and if the timer interrupt got delayed a bit it > again will see that its over the programmed time and happily expire to > early. > > So what ever we do vs. the hrtimer offsets, cached or not will not > prevent that we expire timers early. Right, so without my proposed patch, this is an issue, but my proposed patch requires that the hrtimer_interrupt not use cached offsets in order to ensure the read-state is adjusted properly for the leapsecond. (Or it requries the hrtimer_interrupt path to also cache the leapsecond state so it can do the same adjustment to the cached data, but this seems terribly duplicative). >> BTW: Thomas, what are you using to do measurements here? I hesitate >> to argue in these sorts of performance discussions, since I really >> only have a embarrassing theoretical understanding of the issues and >> suspect myself a bit naive here. Additionally these sorts of >> constraints aren't always clearly documented, so being able to >> measure and compare would be helpful to ensure future changes don't >> impact things here. > > performance counters and tests which stress the particular subsystems. Sweet. Vague hand-waving results are simple to reproduce. [Gestures, points assertively a few times] Results validated! Science! :) But seriously, I'm earnestly looking for specifics (like which stress tests are you caring about) here, so I can try to also watch that patches I write or take don't undo your performance optimizations, so in the future you'll have to yell at fewer people. >> > That makes it far more expensive, it has similar scalability limiting effects as a >> > global lock - while if we do it smart it can perform as essentially lockless code >> > in most cases. >> >> Another reason why I don't like this approach of caching the data is >> that it also prevents fixing the leap-second adjustment to happen on >> the second edge, because we have to have an async update to the >> seqcounter in order to refresh the cached real_offset. Or we have to >> also export more ntp state data so we can duplicate the adjustment to >> the cached data in the hrtimer code, which is more of the complexity >> you've objected to. > > There is no guarantee that it happens at the seconds edge. Timer might > be delayed, vcpu scheduled out .... Again, I worry you've not looked closely at my patch. The only way to assure the change happens at second edge is to do the adjustment in the read path. > All you will be able to do is to narrow the window, but as I explained > above it wont prevent early expiry and it wont prevent VDSO seing the > time go over the leap second and then jump back. The vdso is fixable as well. I just avoided it because fixing it has less benefit then fixing the timer expiration issue, and expected the performance overhead (which is an extra comparison and subtraction, not huge, but I respect that the vdsos are very optimized) would raise more objections. (If we want to get fancy, there's even the possibility of switching the vdso execution page so we use the fixed adjustment function only near the leap-second and go back to the normal function otherwise - but regardless, if its to be done, have to fix the core first) > We just have to accept that timekeeping, time readout and hrtimers > have asynchronous behaviour. And there is no way around that unless > you want to kill performance completely for the sake of this leap > second nonsense. Even outside of my leapsecond correctness concern, I think caching time state like you're doing here is a maintenance issue. It has bit us a number of times already, and things are complicated enough that even fixing the issues that stem from it are non-trivial. thanks -john ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [BUG, bisect] hrtimer: severe lag after suspend & resume 2015-06-08 17:37 ` John Stultz @ 2015-06-08 19:31 ` Thomas Gleixner 0 siblings, 0 replies; 13+ messages in thread From: Thomas Gleixner @ 2015-06-08 19:31 UTC (permalink / raw) To: John Stultz Cc: Ingo Molnar, Jeremiah Mahler, Preeti U Murthy, Peter Zijlstra, Viresh Kumar, Marcelo Tosatti, Frederic Weisbecker, lkml On Mon, 8 Jun 2015, John Stultz wrote: > On Mon, Jun 8, 2015 at 12:44 AM, Thomas Gleixner <tglx@linutronix.de> wrote: > > Well, the problem is that we need to fetch that data on several > > occasions: > > > > - hrtimer_start (if it is the first expiring timer of a clock) > > - hrtimer_reprogram (after canceling the first timer) > > - hrtimer_interrupt > > hrtimer_interrupt is really the one I'm most concerned with, because > that's what does timer expiration. For that usage, having duplicated > time state has been repeatedly problematic. And hurting performance just to solve problematic issues is definitely not a solution either. > Right, so without my proposed patch, this is an issue, but my proposed > patch requires that the hrtimer_interrupt not use cached offsets in > order to ensure the read-state is adjusted properly for the > leapsecond. (Or it requries the hrtimer_interrupt path to also cache > the leapsecond state so it can do the same adjustment to the cached > data, but this seems terribly duplicative). No, we can be smarter than that, really. > But seriously, I'm earnestly looking for specifics (like which stress > tests are you caring about) here, so I can try to also watch that > patches I write or take don't undo your performance optimizations, so > in the future you'll have to yell at fewer people. I'm often using cyclictest or a variant of it which can arm more timers. I play with various configurations for this, so I can't give a receipe right away. Will try to remember when I play with it next time. > Even outside of my leapsecond correctness concern, I think caching > time state like you're doing here is a maintenance issue. It has bit > us a number of times already, and things are complicated enough that > even fixing the issues that stem from it are non-trivial. The caching is solely a problem for this silly leap second corner case. So sacrifying performance for the general case just to solve this particular issue is really the wrong answer. We know that it is problematic, so we can have tests which verify that it did not break. Sure it requires a bit more brain cycles to get it right, but that's well worth spent cycles. Thanks, tglx ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [BUG, bisect] hrtimer: severe lag after suspend & resume 2015-06-05 9:14 ` Thomas Gleixner 2015-06-05 10:07 ` Ingo Molnar @ 2015-06-05 14:02 ` Thomas Gleixner 1 sibling, 0 replies; 13+ messages in thread From: Thomas Gleixner @ 2015-06-05 14:02 UTC (permalink / raw) To: John Stultz Cc: Jeremiah Mahler, Preeti U Murthy, Peter Zijlstra, Viresh Kumar, Marcelo Tosatti, Frederic Weisbecker, lkml, Ingo Molnar On Fri, 5 Jun 2015, Thomas Gleixner wrote: > On Thu, 4 Jun 2015, John Stultz wrote: > > On Wed, Jun 3, 2015 at 5:56 PM, Jeremiah Mahler <jmmahler@gmail.com> wrote: > > So I suspect the problem is the change to clock_was_set_seq in > > timekeeping_update is done prior to mirroring the time state to the > > shadow-timekeeper. Thus the next time we do update_wall_time() the > > updated sequence is overwritten by whats in the shadow copy. The > > attached patch moving the modification up seems to avoid the issue for > > me. > > Duh, yes. > > > Thomas: Looking at the problematic change, I'm not a big fan of it. > > Caching timekeeping state here in the hrtimer code has been a source > > of bugs in the past, and I'm not sure I see how avoiding copying > > 24bytes is that big of a win. Especially since it adds more state to > > the timekeeper and hrtimer base that we have to read and mange. > > It's not about copying 24 bytes. It's about touching 3 cache lines for > nothing. In situations where we run high frequency periodic timers on > clock monotonic and nothing is going on in the other clock domains, > which is a pretty common situation, this is measurable in terms of > cache utilization. I went great length to optimize the cache footprint > and access patterns and that unconditional update really makes a > measurable difference. That said. Can you please send a patch with a proper changelog? Thanks, tglx ^ permalink raw reply [flat|nested] 13+ messages in thread
end of thread, other threads:[~2015-06-08 19:32 UTC | newest] Thread overview: 13+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2015-06-04 0:56 [BUG, bisect] hrtimer: severe lag after suspend & resume Jeremiah Mahler 2015-06-04 11:22 ` Thomas Gleixner 2015-06-04 20:13 ` Jeremiah Mahler 2015-06-04 22:54 ` John Stultz 2015-06-05 0:01 ` Jeremiah Mahler 2015-06-05 7:57 ` Ingo Molnar 2015-06-05 9:14 ` Thomas Gleixner 2015-06-05 10:07 ` Ingo Molnar 2015-06-05 18:52 ` John Stultz 2015-06-08 7:44 ` Thomas Gleixner 2015-06-08 17:37 ` John Stultz 2015-06-08 19:31 ` Thomas Gleixner 2015-06-05 14:02 ` Thomas Gleixner
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox