* [PATCH] Improve clocksource unstable warning @ 2010-11-10 22:16 Andy Lutomirski 2010-11-10 22:28 ` Thomas Gleixner 2010-11-12 21:31 ` [PATCH] " john stultz 0 siblings, 2 replies; 18+ messages in thread From: Andy Lutomirski @ 2010-11-10 22:16 UTC (permalink / raw) To: Thomas Gleixner; +Cc: linux-kernel, Andy Lutomirski When the system goes out to lunch for a long time, the clocksource watchdog might get false positives. Clarify the warning so that people stop blaming their system freezes on the timing code. This change was Thomas Gleixner's suggestion. Signed-off-by: Andy Lutomirski <luto@mit.edu> --- I've only compile-tested on 2.6.36, but it applies cleanly to Linus' tree and it's rather trivial. kernel/time/clocksource.c | 6 ++++-- 1 files changed, 4 insertions(+), 2 deletions(-) diff --git a/kernel/time/clocksource.c b/kernel/time/clocksource.c index c18d7ef..5b30aa2 100644 --- a/kernel/time/clocksource.c +++ b/kernel/time/clocksource.c @@ -215,8 +215,10 @@ static void __clocksource_unstable(struct clocksource *cs) static void clocksource_unstable(struct clocksource *cs, int64_t delta) { - printk(KERN_WARNING "Clocksource %s unstable (delta = %Ld ns)\n", - cs->name, delta); + printk(KERN_WARNING "Clocksource %s unstable (delta = %Ld ns)%s\n", + cs->name, delta, + delta < -5000000000LL ? + " or your system lagged for other reasons" : ""); __clocksource_unstable(cs); } -- 1.7.3.2 ^ permalink raw reply related [flat|nested] 18+ messages in thread
* Re: [PATCH] Improve clocksource unstable warning 2010-11-10 22:16 [PATCH] Improve clocksource unstable warning Andy Lutomirski @ 2010-11-10 22:28 ` Thomas Gleixner 2010-11-10 22:42 ` [PATCH v2] " Andy Lutomirski 2010-11-12 21:31 ` [PATCH] " john stultz 1 sibling, 1 reply; 18+ messages in thread From: Thomas Gleixner @ 2010-11-10 22:28 UTC (permalink / raw) To: Andy Lutomirski; +Cc: linux-kernel On Wed, 10 Nov 2010, Andy Lutomirski wrote: > When the system goes out to lunch for a long time, the clocksource > watchdog might get false positives. Clarify the warning so that > people stop blaming their system freezes on the timing code. > > This change was Thomas Gleixner's suggestion. > > Signed-off-by: Andy Lutomirski <luto@mit.edu> > --- > I've only compile-tested on 2.6.36, but it applies cleanly to Linus' tree > and it's rather trivial. > > kernel/time/clocksource.c | 6 ++++-- > 1 files changed, 4 insertions(+), 2 deletions(-) > > diff --git a/kernel/time/clocksource.c b/kernel/time/clocksource.c > index c18d7ef..5b30aa2 100644 > --- a/kernel/time/clocksource.c > +++ b/kernel/time/clocksource.c > @@ -215,8 +215,10 @@ static void __clocksource_unstable(struct clocksource *cs) > > static void clocksource_unstable(struct clocksource *cs, int64_t delta) > { > - printk(KERN_WARNING "Clocksource %s unstable (delta = %Ld ns)\n", > - cs->name, delta); > + printk(KERN_WARNING "Clocksource %s unstable (delta = %Ld ns)%s\n", > + cs->name, delta, > + delta < -5000000000LL ? I'm ok with that change, but we should not hard code the delta. Instead we should look at the wrap around time of the clocksource which we use for reference. > + " or your system lagged for other reasons" : ""); > __clocksource_unstable(cs); > } Thanks, tglx ^ permalink raw reply [flat|nested] 18+ messages in thread
* [PATCH v2] Improve clocksource unstable warning 2010-11-10 22:28 ` Thomas Gleixner @ 2010-11-10 22:42 ` Andy Lutomirski 0 siblings, 0 replies; 18+ messages in thread From: Andy Lutomirski @ 2010-11-10 22:42 UTC (permalink / raw) To: Thomas Gleixner; +Cc: linux-kernel, Andy Lutomirski When the system goes out to lunch for a long time, the clocksource watchdog might get false positives. Clarify the warning so that people stop blaming their system freezes on the timing code. This change was Thomas Gleixner's suggestion. Signed-off-by: Andy Lutomirski <luto@mit.edu> --- This version compares directly (and tediously to avoid overflow) with max_idle_ns. kernel/time/clocksource.c | 6 ++++-- 1 files changed, 4 insertions(+), 2 deletions(-) diff --git a/kernel/time/clocksource.c b/kernel/time/clocksource.c index c18d7ef..f8cef8b 100644 --- a/kernel/time/clocksource.c +++ b/kernel/time/clocksource.c @@ -215,8 +215,10 @@ static void __clocksource_unstable(struct clocksource *cs) static void clocksource_unstable(struct clocksource *cs, int64_t delta) { - printk(KERN_WARNING "Clocksource %s unstable (delta = %Ld ns)\n", - cs->name, delta); + printk(KERN_WARNING "Clocksource %s unstable (delta = %Ld ns)%s\n", + cs->name, delta, + (delta < 0 && (u64)-delta > watchdog->max_idle_ns) ? + " or your system lagged for other reasons" : ""); __clocksource_unstable(cs); } -- 1.7.3.2 ^ permalink raw reply related [flat|nested] 18+ messages in thread
* Re: [PATCH] Improve clocksource unstable warning 2010-11-10 22:16 [PATCH] Improve clocksource unstable warning Andy Lutomirski 2010-11-10 22:28 ` Thomas Gleixner @ 2010-11-12 21:31 ` john stultz 2010-11-12 21:51 ` john stultz 2010-11-12 21:52 ` Andrew Lutomirski 1 sibling, 2 replies; 18+ messages in thread From: john stultz @ 2010-11-12 21:31 UTC (permalink / raw) To: Andy Lutomirski; +Cc: Thomas Gleixner, linux-kernel, pc On Wed, Nov 10, 2010 at 2:16 PM, Andy Lutomirski <luto@mit.edu> wrote: > When the system goes out to lunch for a long time, the clocksource > watchdog might get false positives. Clarify the warning so that > people stop blaming their system freezes on the timing code. So this issue has also crept up at times in the -rt kernel, where some high priority rt tasks hogs the cpu long enough for the watchdog clocksource to wrap, and then the TSC seems to have sped up relative to the watchdog when the system returns to a normal state and causes a false positive, so the TSC gets kicked out. So the watchdog heuristic is rough, since it depends on three things: 1) The watchdog clocksource is actually trustworthy 2) The watchdog actually runs near to when it expects to run. 3) When reading the different clocksources, any interruptions when the watchdog runs are small enough to to be handled by the allowed margin of error. And in these cases we're breaking #2 So I had a few ideas to improve the huristic somewhat (and huristic refinement is always ugly). So first of all, the whole reason for the watchdog code is the TSC. Other platforms may want to make use of it, but none do at this moment. The majority of TSC issues are caused when the TSC halts in idle (probably most common these days) or slows down due to cpufreq scaling. When we get these false positives, its because the watchdog check interval is too long and the watchdog hardware has wrapped. This makes it appear that the TSC is running too *fast*. So we could try to be more skeptical of watchdog failures where the TSC appears to have sped up, rather then the more common real issue of it slowing down. Now the actual positive where this could occur is if you were on an old APM based laptop, and booted on battery power and the cpus started at their slow freq. Then you plugged in the AC and the cpus jumped to the faster rate. So while I suspect this is a rare case these days (on ACPI based hardware the kernel has more say in cpufreq transitions, so I believe we are unlikely to calculate tsc_khz while the cpus are in low power mode), we still need to address this. Ideas: 1) Maybe should we check that we get two sequential failures where the cpu seems fast before we throw out the TSC? This will still fall over in some stall cases (ie: a poor rt task hogging the cpu for 10 minutes, pausing for a 10th of a second and then continuing to hog the cpu). 2) We could look at the TSC delta, and if it appears outside the order of 2-10x faster (i don't think any cpus scale up even close to 10x in freq, but please correct me if so), then assume we just have been blocked from running and don't throw out the TSC. 3) Similar to #2 we could look at the max interval that the watchdog clocksource provides, and if the TSC delta is greater then that, avoid throwing things out. This combined with #2 might narrow out the false positives fairly well. Any additional thoughts here? thanks -john ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [PATCH] Improve clocksource unstable warning 2010-11-12 21:31 ` [PATCH] " john stultz @ 2010-11-12 21:51 ` john stultz 2010-11-12 21:52 ` Andrew Lutomirski 1 sibling, 0 replies; 18+ messages in thread From: john stultz @ 2010-11-12 21:51 UTC (permalink / raw) To: Andy Lutomirski; +Cc: Thomas Gleixner, linux-kernel, pc On Fri, 2010-11-12 at 13:31 -0800, john stultz wrote: > On Wed, Nov 10, 2010 at 2:16 PM, Andy Lutomirski <luto@mit.edu> wrote: > > When the system goes out to lunch for a long time, the clocksource > > watchdog might get false positives. Clarify the warning so that > > people stop blaming their system freezes on the timing code. > > So this issue has also crept up at times in the -rt kernel, where some > high priority rt tasks hogs the cpu long enough for the watchdog > clocksource to wrap, and then the TSC seems to have sped up relative > to the watchdog when the system returns to a normal state and causes a > false positive, so the TSC gets kicked out. > > So the watchdog heuristic is rough, since it depends on three things: > 1) The watchdog clocksource is actually trustworthy > 2) The watchdog actually runs near to when it expects to run. > 3) When reading the different clocksources, any interruptions when the > watchdog runs are small enough to to be handled by the allowed margin > of error. > > And in these cases we're breaking #2 > > So I had a few ideas to improve the huristic somewhat (and huristic > refinement is always ugly). So first of all, the whole reason for the > watchdog code is the TSC. Other platforms may want to make use of it, > but none do at this moment. > > The majority of TSC issues are caused when the TSC halts in idle > (probably most common these days) or slows down due to cpufreq > scaling. > > When we get these false positives, its because the watchdog check > interval is too long and the watchdog hardware has wrapped. This makes > it appear that the TSC is running too *fast*. So we could try to be > more skeptical of watchdog failures where the TSC appears to have sped > up, rather then the more common real issue of it slowing down. > > Now the actual positive where this could occur is if you were on an > old APM based laptop, and booted on battery power and the cpus started > at their slow freq. Then you plugged in the AC and the cpus jumped to > the faster rate. So while I suspect this is a rare case these days (on > ACPI based hardware the kernel has more say in cpufreq transitions, so > I believe we are unlikely to calculate tsc_khz while the cpus are in > low power mode), we still need to address this. > > Ideas: > 1) Maybe should we check that we get two sequential failures where the > cpu seems fast before we throw out the TSC? This will still fall over > in some stall cases (ie: a poor rt task hogging the cpu for 10 > minutes, pausing for a 10th of a second and then continuing to hog the > cpu). > > 2) We could look at the TSC delta, and if it appears outside the order > of 2-10x faster (i don't think any cpus scale up even close to 10x in > freq, but please correct me if so), then assume we just have been > blocked from running and don't throw out the TSC. > > 3) Similar to #2 we could look at the max interval that the watchdog > clocksource provides, and if the TSC delta is greater then that, avoid > throwing things out. This combined with #2 might narrow out the false > positives fairly well. > > Any additional thoughts here? Here's a rough and untested attempt at adding just #3. thanks -john Improve watchdog hursitics to avoid false positives caused by the watchdog being delayed. Signed-off-by: John Stultz <johnstul@us.ibm.com> diff --git a/kernel/time/clocksource.c b/kernel/time/clocksource.c index c18d7ef..d63f6f8 100644 --- a/kernel/time/clocksource.c +++ b/kernel/time/clocksource.c @@ -247,6 +247,7 @@ static void clocksource_watchdog(unsigned long data) struct clocksource *cs; cycle_t csnow, wdnow; int64_t wd_nsec, cs_nsec; + int64_t wd_max_nsec; int next_cpu; spin_lock(&watchdog_lock); @@ -257,6 +258,8 @@ static void clocksource_watchdog(unsigned long data) wd_nsec = clocksource_cyc2ns((wdnow - watchdog_last) & watchdog->mask, watchdog->mult, watchdog->shift); watchdog_last = wdnow; + wd_max_nsec = clocksource_cyc2ns(watchdog->mask, watchdog->mult, + watchdog->shift); list_for_each_entry(cs, &watchdog_list, wd_list) { @@ -280,6 +283,14 @@ static void clocksource_watchdog(unsigned long data) cs_nsec = clocksource_cyc2ns((csnow - cs->wd_last) & cs->mask, cs->mult, cs->shift); cs->wd_last = csnow; + + /* Check to make sure the watchdog wasn't delayed */ + if (cs_nsec > wd_max_nsec) { + WARN_ONCE(1, + "clocksource_watchdog: watchdog delayed?\n"); + continue; + } + if (abs(cs_nsec - wd_nsec) > WATCHDOG_THRESHOLD) { clocksource_unstable(cs, cs_nsec - wd_nsec); continue; ^ permalink raw reply related [flat|nested] 18+ messages in thread
* Re: [PATCH] Improve clocksource unstable warning 2010-11-12 21:31 ` [PATCH] " john stultz 2010-11-12 21:51 ` john stultz @ 2010-11-12 21:52 ` Andrew Lutomirski 2010-11-12 23:40 ` john stultz 1 sibling, 1 reply; 18+ messages in thread From: Andrew Lutomirski @ 2010-11-12 21:52 UTC (permalink / raw) To: john stultz; +Cc: Thomas Gleixner, linux-kernel, pc On Fri, Nov 12, 2010 at 4:31 PM, john stultz <johnstul@us.ibm.com> wrote: > On Wed, Nov 10, 2010 at 2:16 PM, Andy Lutomirski <luto@mit.edu> wrote: >> When the system goes out to lunch for a long time, the clocksource >> watchdog might get false positives. Clarify the warning so that >> people stop blaming their system freezes on the timing code. > > So this issue has also crept up at times in the -rt kernel, where some > high priority rt tasks hogs the cpu long enough for the watchdog > clocksource to wrap, and then the TSC seems to have sped up relative > to the watchdog when the system returns to a normal state and causes a > false positive, so the TSC gets kicked out. > > So the watchdog heuristic is rough, since it depends on three things: > 1) The watchdog clocksource is actually trustworthy > 2) The watchdog actually runs near to when it expects to run. > 3) When reading the different clocksources, any interruptions when the > watchdog runs are small enough to to be handled by the allowed margin > of error. > > And in these cases we're breaking #2 > > So I had a few ideas to improve the huristic somewhat (and huristic > refinement is always ugly). So first of all, the whole reason for the > watchdog code is the TSC. Other platforms may want to make use of it, > but none do at this moment. > > The majority of TSC issues are caused when the TSC halts in idle > (probably most common these days) or slows down due to cpufreq > scaling. > > When we get these false positives, its because the watchdog check > interval is too long and the watchdog hardware has wrapped. This makes > it appear that the TSC is running too *fast*. So we could try to be > more skeptical of watchdog failures where the TSC appears to have sped > up, rather then the more common real issue of it slowing down. > > Now the actual positive where this could occur is if you were on an > old APM based laptop, and booted on battery power and the cpus started > at their slow freq. Then you plugged in the AC and the cpus jumped to > the faster rate. So while I suspect this is a rare case these days (on > ACPI based hardware the kernel has more say in cpufreq transitions, so > I believe we are unlikely to calculate tsc_khz while the cpus are in > low power mode), we still need to address this. > > Ideas: > 1) Maybe should we check that we get two sequential failures where the > cpu seems fast before we throw out the TSC? This will still fall over > in some stall cases (ie: a poor rt task hogging the cpu for 10 > minutes, pausing for a 10th of a second and then continuing to hog the > cpu). > > 2) We could look at the TSC delta, and if it appears outside the order > of 2-10x faster (i don't think any cpus scale up even close to 10x in > freq, but please correct me if so), then assume we just have been > blocked from running and don't throw out the TSC. > > 3) Similar to #2 we could look at the max interval that the watchdog > clocksource provides, and if the TSC delta is greater then that, avoid > throwing things out. This combined with #2 might narrow out the false > positives fairly well. > > Any additional thoughts here? Yes. As far as I know, the watchdog doesn't give arbitrary values when it wraps; it just wraps. Here's a possible heuristic, in pseudocode: wd_now_1 = (read watchdog) cs_now = (read clocksource) cs_elapsed = cs_now - cs_last; wd_elapsed = wd_now_1 - wd_last; if ( abs(wd_elapsed - cs_elapsed) < MAX_DELTA) return; // We're OK. wd_now_2 = (read watchdog again) if (abs(wd_now_1 - wd_now_2) > MAX_DELTA / 2) bail; // The clocksource might be unstable, but we either just lagged or the watchdog is unstable, and in either case we don't gain anything by marking the clocksource unstable. if ( wd_elapsed < cs_elapsed and ( (cs_elapsed - wd_elapsed) % wd_wrapping_time ) < (something fairly small) ) bail; // The watchdog most likely wrapped. mark_unstable; The idea is that there are three obvious causes for false positives. 1. Something caused the watchdog itself to lag. In this case, the second watchdog read will differ from the first one by a lot and we'll bail. 2. The TSC wrapped in the time between two watchdog checks. I don't think that this is very likely on x86 CPUs. (Tsn't the TSC always 64 bits? By the time that a 20GHz CPU wraps a 64-bit TSC, you're probably already replaced the computer.) 3. The watchdog wrapped in the time between two watchdog checks. In this case, unless something else went wrong simultaneously, we'll detect it. On the other hand, the chance that the TSC was unstable by almost exactly a multiple of the watchdog wrapping time is pretty unlikely. If we're using a non-TSC clocksource, then you could get false positives if that clocksource wraps, but maybe the watchdog is unnecessary in that case. --Andy ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [PATCH] Improve clocksource unstable warning 2010-11-12 21:52 ` Andrew Lutomirski @ 2010-11-12 23:40 ` john stultz 2010-11-12 23:48 ` Andrew Lutomirski 0 siblings, 1 reply; 18+ messages in thread From: john stultz @ 2010-11-12 23:40 UTC (permalink / raw) To: Andrew Lutomirski; +Cc: Thomas Gleixner, linux-kernel, pc On Fri, 2010-11-12 at 16:52 -0500, Andrew Lutomirski wrote: > On Fri, Nov 12, 2010 at 4:31 PM, john stultz <johnstul@us.ibm.com> wrote: > > Ideas: > > 1) Maybe should we check that we get two sequential failures where the > > cpu seems fast before we throw out the TSC? This will still fall over > > in some stall cases (ie: a poor rt task hogging the cpu for 10 > > minutes, pausing for a 10th of a second and then continuing to hog the > > cpu). > > > > 2) We could look at the TSC delta, and if it appears outside the order > > of 2-10x faster (i don't think any cpus scale up even close to 10x in > > freq, but please correct me if so), then assume we just have been > > blocked from running and don't throw out the TSC. > > > > 3) Similar to #2 we could look at the max interval that the watchdog > > clocksource provides, and if the TSC delta is greater then that, avoid > > throwing things out. This combined with #2 might narrow out the false > > positives fairly well. > > > > Any additional thoughts here? > > Yes. As far as I know, the watchdog doesn't give arbitrary values > when it wraps; it just wraps. Here's a possible heuristic, in > pseudocode: > > wd_now_1 = (read watchdog) > cs_now = (read clocksource) > > cs_elapsed = cs_now - cs_last; > wd_elapsed = wd_now_1 - wd_last; > > if ( abs(wd_elapsed - cs_elapsed) < MAX_DELTA) > return; // We're OK. > > wd_now_2 = (read watchdog again) > if (abs(wd_now_1 - wd_now_2) > MAX_DELTA / 2) > bail; // The clocksource might be unstable, but we either just > lagged or the watchdog is unstable, and in either case we don't gain > anything by marking the clocksource unstable. This is more easily done by just bounding the clocksource read: wd_now_1 = watchdog->read() cs_now = clocksource->read() wd_now_2 = watchdog->read() if (((wd_now_2 - wd_now_1)&watchdog->mask) > SOMETHING_SMALL) bail; // hit an SMI or some sort of long preemption > if ( wd_elapsed < cs_elapsed and ( (cs_elapsed - wd_elapsed) % > wd_wrapping_time ) < (something fairly small) ) > bail; // The watchdog most likely wrapped. Huh. The modulo bit may need tweaking as its not immediately clear its right. Maybe the following is clearer?: if ((cs_elapsed > wd_wrapping_time) && (abs((cs_elapsed % wd_wrapping_time)-wd_elapsed) < MAX_DELTA) // should be ok. > The idea is that there are three obvious causes for false positives. > > 1. Something caused the watchdog itself to lag. In this case, the > second watchdog read will differ from the first one by a lot and we'll > bail. > > 2. The TSC wrapped in the time between two watchdog checks. I don't > think that this is very likely on x86 CPUs. (Tsn't the TSC always 64 > bits? By the time that a 20GHz CPU wraps a 64-bit TSC, you're > probably already replaced the computer.) Yea, current TSCs won't wrap in our lifetime. And even if it ever does get close, it will have to wrap between watchdog checks to be an issue (since twos-compliment math makes the subtraction work even if it does roll over). > 3. The watchdog wrapped in the time between two watchdog checks. In > this case, unless something else went wrong simultaneously, we'll > detect it. > > On the other hand, the chance that the TSC was unstable by almost > exactly a multiple of the watchdog wrapping time is pretty unlikely. > > > If we're using a non-TSC clocksource, then you could get false > positives if that clocksource wraps, but maybe the watchdog is > unnecessary in that case. Yea. Some similar extra modulo conditions could probably handle that too. But again, the only clocksource that needs the watchdog is the TSC. thanks -john ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [PATCH] Improve clocksource unstable warning 2010-11-12 23:40 ` john stultz @ 2010-11-12 23:48 ` Andrew Lutomirski 2010-11-12 23:51 ` Andrew Lutomirski 2010-11-12 23:52 ` john stultz 0 siblings, 2 replies; 18+ messages in thread From: Andrew Lutomirski @ 2010-11-12 23:48 UTC (permalink / raw) To: john stultz; +Cc: Thomas Gleixner, linux-kernel, pc On Fri, Nov 12, 2010 at 6:40 PM, john stultz <johnstul@us.ibm.com> wrote: > On Fri, 2010-11-12 at 16:52 -0500, Andrew Lutomirski wrote: >> On Fri, Nov 12, 2010 at 4:31 PM, john stultz <johnstul@us.ibm.com> wrote: >> > Ideas: >> > 1) Maybe should we check that we get two sequential failures where the >> > cpu seems fast before we throw out the TSC? This will still fall over >> > in some stall cases (ie: a poor rt task hogging the cpu for 10 >> > minutes, pausing for a 10th of a second and then continuing to hog the >> > cpu). >> > >> > 2) We could look at the TSC delta, and if it appears outside the order >> > of 2-10x faster (i don't think any cpus scale up even close to 10x in >> > freq, but please correct me if so), then assume we just have been >> > blocked from running and don't throw out the TSC. >> > >> > 3) Similar to #2 we could look at the max interval that the watchdog >> > clocksource provides, and if the TSC delta is greater then that, avoid >> > throwing things out. This combined with #2 might narrow out the false >> > positives fairly well. >> > >> > Any additional thoughts here? >> >> Yes. As far as I know, the watchdog doesn't give arbitrary values >> when it wraps; it just wraps. Here's a possible heuristic, in >> pseudocode: >> >> wd_now_1 = (read watchdog) >> cs_now = (read clocksource) >> >> cs_elapsed = cs_now - cs_last; >> wd_elapsed = wd_now_1 - wd_last; >> >> if ( abs(wd_elapsed - cs_elapsed) < MAX_DELTA) >> return; // We're OK. >> >> wd_now_2 = (read watchdog again) >> if (abs(wd_now_1 - wd_now_2) > MAX_DELTA / 2) >> bail; // The clocksource might be unstable, but we either just >> lagged or the watchdog is unstable, and in either case we don't gain >> anything by marking the clocksource unstable. > > This is more easily done by just bounding the clocksource read: > wd_now_1 = watchdog->read() > cs_now = clocksource->read() > wd_now_2 = watchdog->read() > > if (((wd_now_2 - wd_now_1)&watchdog->mask) > SOMETHING_SMALL) > bail; // hit an SMI or some sort of long preemption > >> if ( wd_elapsed < cs_elapsed and ( (cs_elapsed - wd_elapsed) % >> wd_wrapping_time ) < (something fairly small) ) >> bail; // The watchdog most likely wrapped. > > Huh. The modulo bit may need tweaking as its not immediately clear its > right. Maybe the following is clearer?: > > if ((cs_elapsed > wd_wrapping_time) > && (abs((cs_elapsed % wd_wrapping_time)-wd_elapsed) < MAX_DELTA) > // should be ok. I think this is wrong if wd_elapsed is large (which could happen if the real wd time is something like (2 * wd_wrapping_time - MAX_DELTA/4)). --Andy ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [PATCH] Improve clocksource unstable warning 2010-11-12 23:48 ` Andrew Lutomirski @ 2010-11-12 23:51 ` Andrew Lutomirski 2010-11-13 0:22 ` john stultz 2010-11-12 23:52 ` john stultz 1 sibling, 1 reply; 18+ messages in thread From: Andrew Lutomirski @ 2010-11-12 23:51 UTC (permalink / raw) To: john stultz; +Cc: Thomas Gleixner, linux-kernel, pc On Fri, Nov 12, 2010 at 6:48 PM, Andrew Lutomirski <luto@mit.edu> wrote: > On Fri, Nov 12, 2010 at 6:40 PM, john stultz <johnstul@us.ibm.com> wrote: >> On Fri, 2010-11-12 at 16:52 -0500, Andrew Lutomirski wrote: >>> On Fri, Nov 12, 2010 at 4:31 PM, john stultz <johnstul@us.ibm.com> wrote: >>> > Ideas: >>> > 1) Maybe should we check that we get two sequential failures where the >>> > cpu seems fast before we throw out the TSC? This will still fall over >>> > in some stall cases (ie: a poor rt task hogging the cpu for 10 >>> > minutes, pausing for a 10th of a second and then continuing to hog the >>> > cpu). >>> > >>> > 2) We could look at the TSC delta, and if it appears outside the order >>> > of 2-10x faster (i don't think any cpus scale up even close to 10x in >>> > freq, but please correct me if so), then assume we just have been >>> > blocked from running and don't throw out the TSC. >>> > >>> > 3) Similar to #2 we could look at the max interval that the watchdog >>> > clocksource provides, and if the TSC delta is greater then that, avoid >>> > throwing things out. This combined with #2 might narrow out the false >>> > positives fairly well. >>> > >>> > Any additional thoughts here? >>> >>> Yes. As far as I know, the watchdog doesn't give arbitrary values >>> when it wraps; it just wraps. Here's a possible heuristic, in >>> pseudocode: >>> >>> wd_now_1 = (read watchdog) >>> cs_now = (read clocksource) >>> >>> cs_elapsed = cs_now - cs_last; >>> wd_elapsed = wd_now_1 - wd_last; >>> >>> if ( abs(wd_elapsed - cs_elapsed) < MAX_DELTA) >>> return; // We're OK. >>> >>> wd_now_2 = (read watchdog again) >>> if (abs(wd_now_1 - wd_now_2) > MAX_DELTA / 2) >>> bail; // The clocksource might be unstable, but we either just >>> lagged or the watchdog is unstable, and in either case we don't gain >>> anything by marking the clocksource unstable. >> >> This is more easily done by just bounding the clocksource read: >> wd_now_1 = watchdog->read() >> cs_now = clocksource->read() >> wd_now_2 = watchdog->read() >> >> if (((wd_now_2 - wd_now_1)&watchdog->mask) > SOMETHING_SMALL) >> bail; // hit an SMI or some sort of long preemption >> >>> if ( wd_elapsed < cs_elapsed and ( (cs_elapsed - wd_elapsed) % >>> wd_wrapping_time ) < (something fairly small) ) >>> bail; // The watchdog most likely wrapped. >> >> Huh. The modulo bit may need tweaking as its not immediately clear its >> right. Maybe the following is clearer?: >> >> if ((cs_elapsed > wd_wrapping_time) >> && (abs((cs_elapsed % wd_wrapping_time)-wd_elapsed) < MAX_DELTA) >> // should be ok. > > I think this is wrong if wd_elapsed is large (which could happen if > the real wd time is something like (2 * wd_wrapping_time - > MAX_DELTA/4)). Also wrong if cs_elapsed is just slightly less than wd_wrapping_time but the wd clocksource runs enough faster that it wrapped. --Andy > > --Andy > ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [PATCH] Improve clocksource unstable warning 2010-11-12 23:51 ` Andrew Lutomirski @ 2010-11-13 0:22 ` john stultz 2010-11-13 0:58 ` john stultz 0 siblings, 1 reply; 18+ messages in thread From: john stultz @ 2010-11-13 0:22 UTC (permalink / raw) To: Andrew Lutomirski; +Cc: Thomas Gleixner, linux-kernel, pc On Fri, 2010-11-12 at 18:51 -0500, Andrew Lutomirski wrote: > On Fri, Nov 12, 2010 at 6:48 PM, Andrew Lutomirski <luto@mit.edu> wrote: > > On Fri, Nov 12, 2010 at 6:40 PM, john stultz <johnstul@us.ibm.com> wrote: > >> On Fri, 2010-11-12 at 16:52 -0500, Andrew Lutomirski wrote: > >>> On Fri, Nov 12, 2010 at 4:31 PM, john stultz <johnstul@us.ibm.com> wrote: > >>> > Ideas: > >>> > 1) Maybe should we check that we get two sequential failures where the > >>> > cpu seems fast before we throw out the TSC? This will still fall over > >>> > in some stall cases (ie: a poor rt task hogging the cpu for 10 > >>> > minutes, pausing for a 10th of a second and then continuing to hog the > >>> > cpu). > >>> > > >>> > 2) We could look at the TSC delta, and if it appears outside the order > >>> > of 2-10x faster (i don't think any cpus scale up even close to 10x in > >>> > freq, but please correct me if so), then assume we just have been > >>> > blocked from running and don't throw out the TSC. > >>> > > >>> > 3) Similar to #2 we could look at the max interval that the watchdog > >>> > clocksource provides, and if the TSC delta is greater then that, avoid > >>> > throwing things out. This combined with #2 might narrow out the false > >>> > positives fairly well. > >>> > > >>> > Any additional thoughts here? > >>> > >>> Yes. As far as I know, the watchdog doesn't give arbitrary values > >>> when it wraps; it just wraps. Here's a possible heuristic, in > >>> pseudocode: > >>> > >>> wd_now_1 = (read watchdog) > >>> cs_now = (read clocksource) > >>> > >>> cs_elapsed = cs_now - cs_last; > >>> wd_elapsed = wd_now_1 - wd_last; > >>> > >>> if ( abs(wd_elapsed - cs_elapsed) < MAX_DELTA) > >>> return; // We're OK. > >>> > >>> wd_now_2 = (read watchdog again) > >>> if (abs(wd_now_1 - wd_now_2) > MAX_DELTA / 2) > >>> bail; // The clocksource might be unstable, but we either just > >>> lagged or the watchdog is unstable, and in either case we don't gain > >>> anything by marking the clocksource unstable. > >> > >> This is more easily done by just bounding the clocksource read: > >> wd_now_1 = watchdog->read() > >> cs_now = clocksource->read() > >> wd_now_2 = watchdog->read() > >> > >> if (((wd_now_2 - wd_now_1)&watchdog->mask) > SOMETHING_SMALL) > >> bail; // hit an SMI or some sort of long preemption > >> > >>> if ( wd_elapsed < cs_elapsed and ( (cs_elapsed - wd_elapsed) % > >>> wd_wrapping_time ) < (something fairly small) ) > >>> bail; // The watchdog most likely wrapped. > >> > >> Huh. The modulo bit may need tweaking as its not immediately clear its > >> right. Maybe the following is clearer?: > >> > >> if ((cs_elapsed > wd_wrapping_time) > >> && (abs((cs_elapsed % wd_wrapping_time)-wd_elapsed) < MAX_DELTA) > >> // should be ok. > > > > I think this is wrong if wd_elapsed is large (which could happen if > > the real wd time is something like (2 * wd_wrapping_time - > > MAX_DELTA/4)). > > Also wrong if cs_elapsed is just slightly less than wd_wrapping_time > but the wd clocksource runs enough faster that it wrapped. Ok. Good point, that's a problem. Hrmmmm. Too much math for Friday. :) And your implementation above hits the same issue.. hrm. I want to say we can get away with it using the same twos-complement masking trick we use for subtracting around an overflow to calculate cycle deltas, but I'm not confident it will work when we've moved from clean bit field masks to nanosecond intervals. I think I'll have to revisit this on Monday. Thanks again for pointing out this think-o. -john ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [PATCH] Improve clocksource unstable warning 2010-11-13 0:22 ` john stultz @ 2010-11-13 0:58 ` john stultz 2010-11-17 0:05 ` Andrew Lutomirski 0 siblings, 1 reply; 18+ messages in thread From: john stultz @ 2010-11-13 0:58 UTC (permalink / raw) To: Andrew Lutomirski; +Cc: Thomas Gleixner, linux-kernel, pc On Sat, 2010-11-13 at 00:22 +0000, john stultz wrote: > On Fri, 2010-11-12 at 18:51 -0500, Andrew Lutomirski wrote: > > Also wrong if cs_elapsed is just slightly less than wd_wrapping_time > > but the wd clocksource runs enough faster that it wrapped. > > Ok. Good point, that's a problem. Hrmmmm. Too much math for Friday. :) I have a hard time leaving things alone. :) So this still has the issue of the u64%u64 won't work on 32bit systems, but I think once I rework the modulo bit the following should be what you were describing. It is ugly, so let me know if you have a cleaner way. Signed-off-by: John Stultz <johnstul@us.ibm.com> diff --git a/kernel/time/clocksource.c b/kernel/time/clocksource.c index c18d7ef..0afbcb5 100644 --- a/kernel/time/clocksource.c +++ b/kernel/time/clocksource.c @@ -247,6 +247,7 @@ static void clocksource_watchdog(unsigned long data) struct clocksource *cs; cycle_t csnow, wdnow; int64_t wd_nsec, cs_nsec; + int64_t wd_max_nsec; int next_cpu; spin_lock(&watchdog_lock); @@ -257,6 +258,8 @@ static void clocksource_watchdog(unsigned long data) wd_nsec = clocksource_cyc2ns((wdnow - watchdog_last) & watchdog->mask, watchdog->mult, watchdog->shift); watchdog_last = wdnow; + wd_max_nsec = clocksource_cyc2ns(watchdog->mask, watchdog->mult, + watchdog->shift); list_for_each_entry(cs, &watchdog_list, wd_list) { @@ -280,6 +283,35 @@ static void clocksource_watchdog(unsigned long data) cs_nsec = clocksource_cyc2ns((csnow - cs->wd_last) & cs->mask, cs->mult, cs->shift); cs->wd_last = csnow; + + /* + * If the cs interval is greater then the wd wrap interval, + * we were somehow delayed. So check that modulo the wrap + * interval, the cs interval is close to the watchdog + * interval value. + */ + if (cs_nsec > wd_max_nsec){ + int64_t cs_modulo = cs_nsec % wd_max_nsec; + + /* first check the easy case */ + if(abs(cs_modulo - wd_nsec) < WATCHDOG_THRESHOLD) { + WARN_ONCE(1, "clocksource_watchdog:" + "watchdog delayed?\n"); + continue; + } + /* could be split along the wd_max_nsec boundary*/ + if (wd_nsec < cs_modulo) + wd_nsec += wd_max_nsec; + else + cs_modulo += wd_max_nsec; + /* check again */ + if(abs(cs_modulo - wd_nsec) < WATCHDOG_THRESHOLD) { + WARN_ONCE(1, "clocksource_watchdog:" + "watchdog delayed?\n"); + continue; + } + } + if (abs(cs_nsec - wd_nsec) > WATCHDOG_THRESHOLD) { clocksource_unstable(cs, cs_nsec - wd_nsec); continue; ^ permalink raw reply related [flat|nested] 18+ messages in thread
* Re: [PATCH] Improve clocksource unstable warning 2010-11-13 0:58 ` john stultz @ 2010-11-17 0:05 ` Andrew Lutomirski 2010-11-17 0:26 ` john stultz 0 siblings, 1 reply; 18+ messages in thread From: Andrew Lutomirski @ 2010-11-17 0:05 UTC (permalink / raw) To: john stultz; +Cc: Thomas Gleixner, linux-kernel, pc On Fri, Nov 12, 2010 at 7:58 PM, john stultz <johnstul@us.ibm.com> wrote: > On Sat, 2010-11-13 at 00:22 +0000, john stultz wrote: >> On Fri, 2010-11-12 at 18:51 -0500, Andrew Lutomirski wrote: >> > Also wrong if cs_elapsed is just slightly less than wd_wrapping_time >> > but the wd clocksource runs enough faster that it wrapped. >> >> Ok. Good point, that's a problem. Hrmmmm. Too much math for Friday. :) > > I have a hard time leaving things alone. :) > > So this still has the issue of the u64%u64 won't work on 32bit systems, > but I think once I rework the modulo bit the following should be what > you were describing. > > It is ugly, so let me know if you have a cleaner way. > I'm playing with this stuff now, and it looks like my (invariant, constant, single-package i7) TSC has a max_idle_ns of just over 3 seconds. I'm confused. --Andy ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [PATCH] Improve clocksource unstable warning 2010-11-17 0:05 ` Andrew Lutomirski @ 2010-11-17 0:26 ` john stultz 2010-11-17 0:54 ` Andrew Lutomirski 0 siblings, 1 reply; 18+ messages in thread From: john stultz @ 2010-11-17 0:26 UTC (permalink / raw) To: Andrew Lutomirski; +Cc: Thomas Gleixner, linux-kernel, pc On Tue, 2010-11-16 at 19:05 -0500, Andrew Lutomirski wrote: > On Fri, Nov 12, 2010 at 7:58 PM, john stultz <johnstul@us.ibm.com> wrote: > > On Sat, 2010-11-13 at 00:22 +0000, john stultz wrote: > >> On Fri, 2010-11-12 at 18:51 -0500, Andrew Lutomirski wrote: > >> > Also wrong if cs_elapsed is just slightly less than wd_wrapping_time > >> > but the wd clocksource runs enough faster that it wrapped. > >> > >> Ok. Good point, that's a problem. Hrmmmm. Too much math for Friday. :) > > > > I have a hard time leaving things alone. :) > > > > So this still has the issue of the u64%u64 won't work on 32bit systems, > > but I think once I rework the modulo bit the following should be what > > you were describing. > > > > It is ugly, so let me know if you have a cleaner way. > > > > I'm playing with this stuff now, and it looks like my (invariant, > constant, single-package i7) TSC has a max_idle_ns of just over 3 > seconds. I'm confused. Yea. I hit this wall the other day as well. So my patch is invalid because its assuming the TSC deltas will be large, but for any unreasonable delay, we'll actually end up with multiply overflows, causing the tsc ns interval to be invalid as well. I'm starting to think we should be pushing the watchdog check into the timekeeping accumulation loop (or have it hang off of the accumulation loop). 1) The clocksource cyc2ns conversion code is built with assumptions linked to how frequently we accumulate time via update_wall_time(). 2) update_wall_time() happens in timer irq context, so we don't have to worry about being delayed. If an irq storm or something does actually cause the timer irq to be delayed, we have bigger issues. The only trouble with this, is that if we actually push the max_idle_ns out to something like 10 seconds on the TSC, we could end up having the watchdog clocksource wrapping while we're in nohz idle. So that could be ugly. Maybe if the current clocksource needs the watchdog observations, we should cap the max_idle_ns to the smaller of the current clocksource and the watchdog clocksource. Oof. Its just ugly. If I can get some time this week I'll try to take a rough swing at refactoring that code. thanks -john ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [PATCH] Improve clocksource unstable warning 2010-11-17 0:26 ` john stultz @ 2010-11-17 0:54 ` Andrew Lutomirski 2010-11-17 1:19 ` john stultz 0 siblings, 1 reply; 18+ messages in thread From: Andrew Lutomirski @ 2010-11-17 0:54 UTC (permalink / raw) To: john stultz; +Cc: Thomas Gleixner, linux-kernel, pc On Tue, Nov 16, 2010 at 7:26 PM, john stultz <johnstul@us.ibm.com> wrote: > On Tue, 2010-11-16 at 19:05 -0500, Andrew Lutomirski wrote: >> On Fri, Nov 12, 2010 at 7:58 PM, john stultz <johnstul@us.ibm.com> wrote: >> > On Sat, 2010-11-13 at 00:22 +0000, john stultz wrote: >> >> On Fri, 2010-11-12 at 18:51 -0500, Andrew Lutomirski wrote: >> >> > Also wrong if cs_elapsed is just slightly less than wd_wrapping_time >> >> > but the wd clocksource runs enough faster that it wrapped. >> >> >> >> Ok. Good point, that's a problem. Hrmmmm. Too much math for Friday. :) >> > >> > I have a hard time leaving things alone. :) >> > >> > So this still has the issue of the u64%u64 won't work on 32bit systems, >> > but I think once I rework the modulo bit the following should be what >> > you were describing. >> > >> > It is ugly, so let me know if you have a cleaner way. >> > >> >> I'm playing with this stuff now, and it looks like my (invariant, >> constant, single-package i7) TSC has a max_idle_ns of just over 3 >> seconds. I'm confused. > > Yea. I hit this wall the other day as well. So my patch is invalid > because its assuming the TSC deltas will be large, but for any > unreasonable delay, we'll actually end up with multiply overflows, > causing the tsc ns interval to be invalid as well. > > I'm starting to think we should be pushing the watchdog check into the > timekeeping accumulation loop (or have it hang off of the accumulation > loop). > > 1) The clocksource cyc2ns conversion code is built with assumptions > linked to how frequently we accumulate time via update_wall_time(). > > 2) update_wall_time() happens in timer irq context, so we don't have to > worry about being delayed. If an irq storm or something does actually > cause the timer irq to be delayed, we have bigger issues. That's why I hit this. It would be nice if we didn't respond to irq storms by calling stop_machine. > > The only trouble with this, is that if we actually push the max_idle_ns > out to something like 10 seconds on the TSC, we could end up having the > watchdog clocksource wrapping while we're in nohz idle. So that could > be ugly. Maybe if the current clocksource needs the watchdog > observations, we should cap the max_idle_ns to the smaller of the > current clocksource and the watchdog clocksource. > What would you think about implementing non-overflowing clocksource_cyc2ns on architectures that can do it efficiently? You'd have to artificially limit the mask to 2^64 / (rate in GHz), rounded down to a power of 2, but that shouldn't be a problem for any sensible clocksource. x86_64 can do it with one multiply, two shifts, an or, and a subtract (to figure out the shifts). It should take just a couple cycles longer than the current code (or maybe the same amount of time, depending on how good the CPU is at running the whole thing in parallel). x86_32 and similar architectures would need two multiplies and one add. Architectures with only 32x32->32 multiply would need three multiplies. (They're already presumably doing two multiplies with the current code, though.) The benefit would be that sensible clocksources (TSC and 64-bit HPET) would essentially never overflow and multicore systems could keep most cores asleep for as long as they liked. (There's yet another approach: keep the current clocksource_cyc2ns, but add an exact version and only use it when waking up from a long sleep.) --Andy ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [PATCH] Improve clocksource unstable warning 2010-11-17 0:54 ` Andrew Lutomirski @ 2010-11-17 1:19 ` john stultz 2010-11-17 1:24 ` Andrew Lutomirski 0 siblings, 1 reply; 18+ messages in thread From: john stultz @ 2010-11-17 1:19 UTC (permalink / raw) To: Andrew Lutomirski; +Cc: Thomas Gleixner, linux-kernel, pc On Tue, 2010-11-16 at 19:54 -0500, Andrew Lutomirski wrote: > On Tue, Nov 16, 2010 at 7:26 PM, john stultz <johnstul@us.ibm.com> wrote: > > I'm starting to think we should be pushing the watchdog check into the > > timekeeping accumulation loop (or have it hang off of the accumulation > > loop). > > > > 1) The clocksource cyc2ns conversion code is built with assumptions > > linked to how frequently we accumulate time via update_wall_time(). > > > > 2) update_wall_time() happens in timer irq context, so we don't have to > > worry about being delayed. If an irq storm or something does actually > > cause the timer irq to be delayed, we have bigger issues. > > That's why I hit this. It would be nice if we didn't respond to irq > storms by calling stop_machine. So even if we don't change clocksources, if you have a long enough interrupt storm that delays the hard timer irq, such that the clocksources wrap (or hit the mult overflow), your system time will be lagging behind anyway. So that would be broken regardless of if the watchdog kicked in or not. I suspect that even with such an irq storm, the timer irq will hopefully be high enough priority to be serviced first, avoiding the accumulation loss. > > The only trouble with this, is that if we actually push the max_idle_ns > > out to something like 10 seconds on the TSC, we could end up having the > > watchdog clocksource wrapping while we're in nohz idle. So that could > > be ugly. Maybe if the current clocksource needs the watchdog > > observations, we should cap the max_idle_ns to the smaller of the > > current clocksource and the watchdog clocksource. > > > > What would you think about implementing non-overflowing > clocksource_cyc2ns on architectures that can do it efficiently? You'd > have to artificially limit the mask to 2^64 / (rate in GHz), rounded > down to a power of 2, but that shouldn't be a problem for any sensible > clocksource. You would run into accuracy issues. The reason why we use large mult/shift pairs for timekeeping is because we need to make very fine grained adjustments to steer the clock (also just the freq accuracy can be poor if you use too low a shift value in the cyc2ns conversions). > The benefit would be that sensible clocksources (TSC and 64-bit HPET) > would essentially never overflow and multicore systems could keep most > cores asleep for as long as they liked. > > (There's yet another approach: keep the current clocksource_cyc2ns, > but add an exact version and only use it when waking up from a long > sleep.) We're actually running into this very issue in the sched_clock threads that have been discussed. Namely that the clocksource/timekeeping code is very intertwined around the assumptions and limits of the freq that update_wall_time is called. I'm currently working to consolidate those assumptions into manageable tunables via clocksource_register_khz/hz patches so we don't end up with odd cases where some idle limit gets pushed out and stuff breaks on half of the clocksources out there, but not the other half. So yea, there are a couple of approaches here: It may just be that we should drop the watchdog infrastructure and either embed that functionality into the TSC code itself (since it is the only user), allowing more flexible and rough conversion factors with other watchdog hardware similar to what you suggest, instead of using the clocksource conversion functions. Or we can embed the watchdog timer into a function of the timekeeping accumulation code, the constraints of of which the clocksource are written to. Neither is really a pretty solution. I suspect second would end up being a little cleaner, and would allow other troublesome clocksources to make use of it in the future. But the first does isolate the issue close to the problematic hardware. thanks -john ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [PATCH] Improve clocksource unstable warning 2010-11-17 1:19 ` john stultz @ 2010-11-17 1:24 ` Andrew Lutomirski 2010-11-17 1:54 ` john stultz 0 siblings, 1 reply; 18+ messages in thread From: Andrew Lutomirski @ 2010-11-17 1:24 UTC (permalink / raw) To: john stultz; +Cc: Thomas Gleixner, linux-kernel, pc On Tue, Nov 16, 2010 at 8:19 PM, john stultz <johnstul@us.ibm.com> wrote: > On Tue, 2010-11-16 at 19:54 -0500, Andrew Lutomirski wrote: >> On Tue, Nov 16, 2010 at 7:26 PM, john stultz <johnstul@us.ibm.com> wrote: >> > I'm starting to think we should be pushing the watchdog check into the >> > timekeeping accumulation loop (or have it hang off of the accumulation >> > loop). >> > >> > 1) The clocksource cyc2ns conversion code is built with assumptions >> > linked to how frequently we accumulate time via update_wall_time(). >> > >> > 2) update_wall_time() happens in timer irq context, so we don't have to >> > worry about being delayed. If an irq storm or something does actually >> > cause the timer irq to be delayed, we have bigger issues. >> >> That's why I hit this. It would be nice if we didn't respond to irq >> storms by calling stop_machine. > > So even if we don't change clocksources, if you have a long enough > interrupt storm that delays the hard timer irq, such that the > clocksources wrap (or hit the mult overflow), your system time will be > lagging behind anyway. So that would be broken regardless of if the > watchdog kicked in or not. > > I suspect that even with such an irq storm, the timer irq will hopefully > be high enough priority to be serviced first, avoiding the accumulation > loss. > > >> > The only trouble with this, is that if we actually push the max_idle_ns >> > out to something like 10 seconds on the TSC, we could end up having the >> > watchdog clocksource wrapping while we're in nohz idle. So that could >> > be ugly. Maybe if the current clocksource needs the watchdog >> > observations, we should cap the max_idle_ns to the smaller of the >> > current clocksource and the watchdog clocksource. >> > >> >> What would you think about implementing non-overflowing >> clocksource_cyc2ns on architectures that can do it efficiently? You'd >> have to artificially limit the mask to 2^64 / (rate in GHz), rounded >> down to a power of 2, but that shouldn't be a problem for any sensible >> clocksource. > > You would run into accuracy issues. The reason why we use large > mult/shift pairs for timekeeping is because we need to make very fine > grained adjustments to steer the clock (also just the freq accuracy can > be poor if you use too low a shift value in the cyc2ns conversions). > Why would it be any worse than right now? We could keep shift as high as 32 (or even higher) and use the exact same logic as we use now. gcc compiles this code: uint64_t mul_64_32_shift(uint64_t a, uint32_t mult, uint32_t shift) { #if __GNUC__ > 4 || (__GNUC__ == 4 && __GNUC_MINOR__ >= 5) if (shift >= 32) __builtin_unreachable(); #endif return (uint64_t)( ((__uint128_t)a * (__uint128_t)mult) >> shift ); } To: 0: 89 f0 mov %esi,%eax 2: 89 d1 mov %edx,%ecx 4: 48 f7 e7 mul %rdi 7: 48 0f ad d0 shrd %cl,%rdx,%rax b: 48 d3 ea shr %cl,%rdx e: f6 c1 40 test $0x40,%cl 11: 48 0f 45 c2 cmovne %rdx,%rax 15: c3 retq And if the compiler were a little smarter, it would generate: mov %esi,%eax mov %edx,%ecx mul %rdi shrd %cl,%rdx,%rax retq So it would be essentially free. --Andy ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [PATCH] Improve clocksource unstable warning 2010-11-17 1:24 ` Andrew Lutomirski @ 2010-11-17 1:54 ` john stultz 0 siblings, 0 replies; 18+ messages in thread From: john stultz @ 2010-11-17 1:54 UTC (permalink / raw) To: Andrew Lutomirski; +Cc: Thomas Gleixner, linux-kernel, pc On Tue, 2010-11-16 at 20:24 -0500, Andrew Lutomirski wrote: > On Tue, Nov 16, 2010 at 8:19 PM, john stultz <johnstul@us.ibm.com> wrote: > > On Tue, 2010-11-16 at 19:54 -0500, Andrew Lutomirski wrote: > >> On Tue, Nov 16, 2010 at 7:26 PM, john stultz <johnstul@us.ibm.com> wrote: > >> > I'm starting to think we should be pushing the watchdog check into the > >> > timekeeping accumulation loop (or have it hang off of the accumulation > >> > loop). > >> > > >> > 1) The clocksource cyc2ns conversion code is built with assumptions > >> > linked to how frequently we accumulate time via update_wall_time(). > >> > > >> > 2) update_wall_time() happens in timer irq context, so we don't have to > >> > worry about being delayed. If an irq storm or something does actually > >> > cause the timer irq to be delayed, we have bigger issues. > >> > >> That's why I hit this. It would be nice if we didn't respond to irq > >> storms by calling stop_machine. > > > > So even if we don't change clocksources, if you have a long enough > > interrupt storm that delays the hard timer irq, such that the > > clocksources wrap (or hit the mult overflow), your system time will be > > lagging behind anyway. So that would be broken regardless of if the > > watchdog kicked in or not. > > > > I suspect that even with such an irq storm, the timer irq will hopefully > > be high enough priority to be serviced first, avoiding the accumulation > > loss. > > > > > >> > The only trouble with this, is that if we actually push the max_idle_ns > >> > out to something like 10 seconds on the TSC, we could end up having the > >> > watchdog clocksource wrapping while we're in nohz idle. So that could > >> > be ugly. Maybe if the current clocksource needs the watchdog > >> > observations, we should cap the max_idle_ns to the smaller of the > >> > current clocksource and the watchdog clocksource. > >> > > >> > >> What would you think about implementing non-overflowing > >> clocksource_cyc2ns on architectures that can do it efficiently? You'd > >> have to artificially limit the mask to 2^64 / (rate in GHz), rounded > >> down to a power of 2, but that shouldn't be a problem for any sensible > >> clocksource. > > > > You would run into accuracy issues. The reason why we use large > > mult/shift pairs for timekeeping is because we need to make very fine > > grained adjustments to steer the clock (also just the freq accuracy can > > be poor if you use too low a shift value in the cyc2ns conversions). > > > > Why would it be any worse than right now? We could keep shift as high > as 32 (or even higher) and use the exact same logic as we use now. Oh. My apologies, I thought you were suggesting to drop shift down, so the 64bit mult doesn't overflow, not using a 128 bit mult to just avoid that issue. > gcc compiles this code: > > uint64_t mul_64_32_shift(uint64_t a, uint32_t mult, uint32_t shift) > { > #if __GNUC__ > 4 || (__GNUC__ == 4 && __GNUC_MINOR__ >= 5) > if (shift >= 32) > __builtin_unreachable(); > #endif > return (uint64_t)( ((__uint128_t)a * (__uint128_t)mult) >> shift ); > } > > To: > > 0: 89 f0 mov %esi,%eax > 2: 89 d1 mov %edx,%ecx > 4: 48 f7 e7 mul %rdi > 7: 48 0f ad d0 shrd %cl,%rdx,%rax > b: 48 d3 ea shr %cl,%rdx > e: f6 c1 40 test $0x40,%cl > 11: 48 0f 45 c2 cmovne %rdx,%rax > 15: c3 retq > > And if the compiler were a little smarter, it would generate: > > mov %esi,%eax > mov %edx,%ecx > mul %rdi > shrd %cl,%rdx,%rax > retq > > So it would be essentially free. So yes, on 64bit systems it won't be so bad, but again, I'm worried a bit about overhead on 32bit systems, as clocksource_cyc2ns is in the gettimeofday hot path for a quite a lot of applications. But it is an interesting thought. And something like the following could avoid the overhead most of the time. if(unlikely(delta > cs->max_mult64_cycles)) return cyc2ns128(delta, cs->mult, cs->shift); return cyc2ns64(delta, cs->mult, cs->shift); Where we optimize mult/shift pair so for the likely max nohz time interval, but allow deeper sleeps without problems. -john ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [PATCH] Improve clocksource unstable warning 2010-11-12 23:48 ` Andrew Lutomirski 2010-11-12 23:51 ` Andrew Lutomirski @ 2010-11-12 23:52 ` john stultz 1 sibling, 0 replies; 18+ messages in thread From: john stultz @ 2010-11-12 23:52 UTC (permalink / raw) To: Andrew Lutomirski; +Cc: Thomas Gleixner, linux-kernel, pc On Fri, 2010-11-12 at 18:48 -0500, Andrew Lutomirski wrote: > On Fri, Nov 12, 2010 at 6:40 PM, john stultz <johnstul@us.ibm.com> wrote: > > On Fri, 2010-11-12 at 16:52 -0500, Andrew Lutomirski wrote: > > > >> if ( wd_elapsed < cs_elapsed and ( (cs_elapsed - wd_elapsed) % > >> wd_wrapping_time ) < (something fairly small) ) > >> bail; // The watchdog most likely wrapped. > > > > Huh. The modulo bit may need tweaking as its not immediately clear its > > right. Maybe the following is clearer?: > > > > if ((cs_elapsed > wd_wrapping_time) > > && (abs((cs_elapsed % wd_wrapping_time)-wd_elapsed) < MAX_DELTA) > > // should be ok. > > I think this is wrong if wd_elapsed is large (which could happen if > the real wd time is something like (2 * wd_wrapping_time - > MAX_DELTA/4)). But wd_elapsed can never be larger then wd_wrapping_time, right? Or am I still missing something? thanks -john ^ permalink raw reply [flat|nested] 18+ messages in thread
end of thread, other threads:[~2010-11-17 1:54 UTC | newest] Thread overview: 18+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2010-11-10 22:16 [PATCH] Improve clocksource unstable warning Andy Lutomirski 2010-11-10 22:28 ` Thomas Gleixner 2010-11-10 22:42 ` [PATCH v2] " Andy Lutomirski 2010-11-12 21:31 ` [PATCH] " john stultz 2010-11-12 21:51 ` john stultz 2010-11-12 21:52 ` Andrew Lutomirski 2010-11-12 23:40 ` john stultz 2010-11-12 23:48 ` Andrew Lutomirski 2010-11-12 23:51 ` Andrew Lutomirski 2010-11-13 0:22 ` john stultz 2010-11-13 0:58 ` john stultz 2010-11-17 0:05 ` Andrew Lutomirski 2010-11-17 0:26 ` john stultz 2010-11-17 0:54 ` Andrew Lutomirski 2010-11-17 1:19 ` john stultz 2010-11-17 1:24 ` Andrew Lutomirski 2010-11-17 1:54 ` john stultz 2010-11-12 23:52 ` john stultz
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox