* [PATCH] clocksource, prevent overflow in clocksource_cyc2ns @ 2012-04-04 15:11 Prarit Bhargava 2012-04-04 18:00 ` John Stultz 0 siblings, 1 reply; 22+ messages in thread From: Prarit Bhargava @ 2012-04-04 15:11 UTC (permalink / raw) To: linux-kernel Cc: Prarit Bhargava, John Stultz, Thomas Gleixner, Salman Qazi, stable The clocksource code has a watchdog which runs once a minute. The clocksource watchdog calculates the number of nanoseconds since the last time the watchdog ran and compares that value to the number of nanoseconds that have passed on another clocksource. If these values do not match (to within .0625s) then the watchdog marks the current clocksource as unstable and switches to another clocksource. This works so long as the delta between calls of the watchdog is small (maximum delta is ~18 seconds with the tsc). After that, the clocksource_cyc2ns() calculation will overflow and the calculated number of ns returned is incorrect. This causes the watchdog to erroneously mark the tsc as unstable and switch to the hpet. A long delay on the system is not usual, however, it can be reproduced simply by doing echo 1 > /proc/sys/kernel/sysrq for i in `seq 10000`; do sleep 1000 & done echo t > /proc/sysrq-trigger on a 32-cpu system (which is not an unusual number of processes for this size of system). This floods the printk buffer and results in a pause of approximately 600 seconds which prevents the clocksource watchdog from running during that time. On the next call, the watchdog erroneously marks the tsc as unstable and switches to the hpet because the checked values for the tsc overflow. Fixing this is simple -- use mult_frac() in clocksource_cyc2ns(). Signed-off-by: Prarit Bhargava <prarit@redhat.com> Cc: John Stultz <johnstul@us.ibm.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Salman Qazi <sqazi@google.com> Cc: stable@kernel.org --- include/linux/clocksource.h | 2 +- 1 files changed, 1 insertions(+), 1 deletions(-) diff --git a/include/linux/clocksource.h b/include/linux/clocksource.h index fbe89e1..1625ddb 100644 --- a/include/linux/clocksource.h +++ b/include/linux/clocksource.h @@ -273,7 +273,7 @@ static inline u32 clocksource_hz2mult(u32 hz, u32 shift_constant) */ static inline s64 clocksource_cyc2ns(cycle_t cycles, u32 mult, u32 shift) { - return ((u64) cycles * mult) >> shift; + return mult_frac(cycles, mult, (1UL << shift)); } -- 1.7.1 ^ permalink raw reply related [flat|nested] 22+ messages in thread
* Re: [PATCH] clocksource, prevent overflow in clocksource_cyc2ns 2012-04-04 15:11 [PATCH] clocksource, prevent overflow in clocksource_cyc2ns Prarit Bhargava @ 2012-04-04 18:00 ` John Stultz 2012-04-04 18:33 ` Prarit Bhargava 0 siblings, 1 reply; 22+ messages in thread From: John Stultz @ 2012-04-04 18:00 UTC (permalink / raw) To: Prarit Bhargava; +Cc: linux-kernel, Thomas Gleixner, Salman Qazi, stable On 04/04/2012 08:11 AM, Prarit Bhargava wrote: > The clocksource code has a watchdog which runs once a minute. The > clocksource watchdog calculates the number of nanoseconds since the last > time the watchdog ran and compares that value to the number of nanoseconds > that have passed on another clocksource. If these values do not match (to > within .0625s) then the watchdog marks the current clocksource as unstable > and switches to another clocksource. > > This works so long as the delta between calls of the watchdog is small > (maximum delta is ~18 seconds with the tsc). After that, the > clocksource_cyc2ns() calculation will overflow and the calculated number > of ns returned is incorrect. > > This causes the watchdog to erroneously mark the tsc as unstable and > switch to the hpet. > > A long delay on the system is not usual, however, it can be reproduced > simply by doing > > echo 1> /proc/sys/kernel/sysrq > for i in `seq 10000`; do sleep 1000& done > echo t> /proc/sysrq-trigger > > on a 32-cpu system (which is not an unusual number of processes for this > size of system). This floods the printk buffer and results in a > pause of approximately 600 seconds which prevents the clocksource watchdog > from running during that time. On the next call, the watchdog erroneously > marks the tsc as unstable and switches to the hpet because the checked > values for the tsc overflow. Hmm. This is a pretty good mix of existing problems. :) We've seen clocksource watchdog false-positives w/ virtualization migrations before, and I think that recently got resolved . Also the huge dumps to printk's which runs with irqs off (usually over slow serial ports) causing lost time also is an outstanding issue. > Fixing this is simple -- use mult_frac() in clocksource_cyc2ns(). > > Signed-off-by: Prarit Bhargava<prarit@redhat.com> > Cc: John Stultz<johnstul@us.ibm.com> > Cc: Thomas Gleixner<tglx@linutronix.de> > Cc: Salman Qazi<sqazi@google.com> > Cc: stable@kernel.org > --- > include/linux/clocksource.h | 2 +- > 1 files changed, 1 insertions(+), 1 deletions(-) > > diff --git a/include/linux/clocksource.h b/include/linux/clocksource.h > index fbe89e1..1625ddb 100644 > --- a/include/linux/clocksource.h > +++ b/include/linux/clocksource.h > @@ -273,7 +273,7 @@ static inline u32 clocksource_hz2mult(u32 hz, u32 shift_constant) > */ > static inline s64 clocksource_cyc2ns(cycle_t cycles, u32 mult, u32 shift) > { > - return ((u64) cycles * mult)>> shift; > + return mult_frac(cycles, mult, (1UL<< shift)); > } > > So I'm not sure this is actually the right fix. The first issue is that the overflow is expected at a specific frequency, and the clocksource code and clockevent code work together to make sure the timekeeping core can accumulate time within that frequency. This is important not just for multiplication overflows, but also because some clocksource hardware wraps after just a few seconds, requiring we accumulate frequently. The rest of the timekeeping code is designed with the requirement that the overflow you're solving with the above doesn't happen. It doesn't mean it can't happen (as you've experience), but that you're running outside the expected bounds of the timekeeping code. So just fixing the multiply in clocksource_cyc2ns doesn't also change the assumptions elsewhere in the timekeeping code. Additionally, gettimeofday/clock_gettime is a hotpath, and this adds extra overhead to the calculation. So that's not great either. So as I said above, I think there's really two issues at play here: 1) If you starve the timekeeping code from being able to periodically accumulate time bad things happen. 2) The clocksource watchdog in particular cannot suffer much delay at all, and falsely triggers on occasion. Item #1 is really an issue of degree. We can try to be more flexible by stretching the expected length of the period out further. But some hardware that wraps quickly just cannot accommodate long intervals. On the TSC we have quite a bit of time till it wraps, but by keeping the periods shorter, we allow for finer grained clock adjustments for NTP. So its somewhat of a balancing act between allowing for really precise timekeeping and robustness in the face of interrupt starvation (and we have only been getting better here, the periodic accumulation was only a tick back in the day, so SMIs or any long period with irqs off would cause lost time). Currently the TSC period should be 10 minutes before we overflow (which aligns to your 600 seconds above). I believe this to be a reasonable length of time where we should expect interrupts to be able to occur. For example, other common clock hardware, like the acpi_pm can really only last ~5 seconds before it wraps. Item #2 is where I think the real fix is needed. The watchdog has been useful in being able to detect flakey TSCs, but because running with a bad TSC can be very problematic for the kernel, its is quick on the trigger to disqualify it. Also, over the years, the expected use cases have grown for systems, and that has run into problems as it hits the bounds of the watchdogs' expectations. In fact, I suspect its not that the TSC wrapping but the HPET or acpi_pm that is being used as the "stable" watchdog clocksource. One likely way to trigger a false positive is if the watchdog is the acpi_pm, but the clocksoruce is the TSC, we can delay/starve interrupts for quite some time, since the TSC won't wrap. But if the watchdog arrives late, and the acpi_pm clock has wrapped, then it assumes the TSC is broken. The problem with these issues is from the kernel's perspective you have to trust one of the two clocks. A similar issue can happen w/ the HPET but not as quickly (since it won't wrap as fast, although maybe you're hitting the multiplication overflow there... honestly your 18 second interval time in the above isn't lining up for me right now, have to think more on that). The hard problem is that the TSC can go wrong in *many* ways. Its a terrible architecture feature, but there's nothing close to it performance wise, so we do what we can to try to make use of it when its safe. Its variation of problems makes it hard to detect false positives when the "trusted" watchdog fails. One idea I've had is to replace the watchdog clocksource with the read_persistent_clock() RTC interface. This would allow us to avoid wrapping issues with watchdog clocksources, however it would also mean we couldn't really check the TSC accuracy for a number of seconds to minutes (due to the RTC granularity being just seconds itself). Or maybe it would be good to still use the watchdog, but double check against the RTC if we get a false positive long after boot? Another idea I had tried to work on a few years ago was to try to catch watchdog overflows was checking if the TSC delta == watchdog delta + some multiple of watchdog intervals, but I didn't manage to get the math right there, and with small enough wrapping intervals and a long enough delay, a % error bound becomes impossible to trigger. One idea might be to replace the cyc2ns w/ mult_frac in only the watchdog code. I need to think on that some more (and maybe have you provide some debug output) to really understand how that's solving the issue for you, but it would be able to be done w/o affecting the other assumptions of the timekeeping core. Thomas: Any other ideas? Looking at the watchdog code, since its not an hrtimer, could it also possibly be delayed past its .0625s threshold even just due to long NOHZ deep idle? I suspect quite a few assumptions have changed since that code was written. :) thanks -john ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [PATCH] clocksource, prevent overflow in clocksource_cyc2ns 2012-04-04 18:00 ` John Stultz @ 2012-04-04 18:33 ` Prarit Bhargava 2012-04-05 1:08 ` John Stultz 0 siblings, 1 reply; 22+ messages in thread From: Prarit Bhargava @ 2012-04-04 18:33 UTC (permalink / raw) To: John Stultz; +Cc: linux-kernel, Thomas Gleixner, Salman Qazi, stable > One idea might be to replace the cyc2ns w/ mult_frac in only the watchdog code. > I need to think on that some more (and maybe have you provide some debug output) > to really understand how that's solving the issue for you, but it would be able > to be done w/o affecting the other assumptions of the timekeeping core. > Hey John, After reading the initial part of your reply I was thinking about calling mult_frac() directly from the watchdog code as well. Here's some debug output I cobbled together to get an idea of how quickly the overflow was happening. [ 5.435323] clocksource_watchdog: {0} cs tsc csfirst 227349443638728 mask 0xFFFFFFFFFFFFFFFF mult 797281036 shift 31 [ 5.444930] clocksource_watchdog: {0} wd hpet wdfirst 78332535 mask 0xFFFFFFFF mult 292935555 shift 22 These, of course, are just the basic data from the clocksources tsc and hpet. This next output shows the overflow of the clocksource_cys2ns() function. The output is clocksource_watchdog: {cpu_number} cs cs->read() clocksource_cyc2ns(csnow - csfirst) clocksource_cyc2ns_calling_mult_frac() and clocksource_watchdog: {cpu_number} wd wd->read() clocksource_cyc2ns(wdnow - wdfirst) [ 19.429080] clocksource_watchdog: {28} cs 37709489529 5410200359 14000134951 [ 19.436964] clocksource_watchdog: {28} wd 200456437 14000133902 [ 19.928803] clocksource_watchdog: {29} cs 39056111109 5910151011 14500085603 [ 19.936678] clocksource_watchdog: {29} wd 207614817 14500084315 [ 20.428600] clocksource_watchdog: {30} cs 40402933983 6410176395 15000110987 [ 20.436477] clocksource_watchdog: {30} wd 214774270 15000109668 [ 20.928376] clocksource_watchdog: {31} cs 41749696692 6910179442 15500114034 [ 20.936260] clocksource_watchdog: {31} wd 221933402 15500112602 [ 21.428091] clocksource_watchdog: {0} cs 43096297329 7410122318 16000056910 [ 21.435878] clocksource_watchdog: {0} wd 229091680 16000055891 [ 21.927855] clocksource_watchdog: {1} cs 44443031499 7910114770 16500049362 [ 21.935642] clocksource_watchdog: {1} wd 236250654 16500047790 [ 22.427640] clocksource_watchdog: {2} cs 45789820092 8410127427 17000062019 [ 22.435426] clocksource_watchdog: {2} wd 243409925 17000060432 [ 22.927479] clocksource_watchdog: {3} cs 47136754215 320259522 17500128706 ^^^^ Right here. The output of clocksource_cyc2ns() overflows, so in theory anything with a delay of ~18 seconds or greater would cause an overflow in the watchdog calculation. [ 22.935168] clocksource_watchdog: {3} wd 250569969 17500127061 [ 23.427221] clocksource_watchdog: {4} cs 48483425124 820228487 18000097671 [ 23.434916] clocksource_watchdog: {4} wd 257728618 18000096262 [ 23.926970] clocksource_watchdog: {5} cs 49830120549 1320206554 18500075738 [ 23.934762] clocksource_watchdog: {5} wd 264887389 18500073983 [ 24.426772] clocksource_watchdog: {6} cs 51176954790 1820236158 19000105342 [ 24.434564] clocksource_watchdog: {6} wd 272046903 19000103596 [ 24.926565] clocksource_watchdog: {7} cs 52523765154 2320256898 19500126082 [ 24.934343] clocksource_watchdog: {7} wd 279206289 19500124270 [Aside: Eventually, the hpet overflows too!] So on to the reproducer ... in which I did echo 1 > /proc/sys/kernel/sysrq for i in `seq 10000`; do sleep 1000 & done echo t > /proc/sysrq-trigger Apr 2 20:05:17 intel-canoepass-02 kernel: [ 104.429864] [<ffffffff814fb7db>] do_nanosleep+0x8b/0xc0 Apr 2 20:05:17 intel-canoepass-02 kernel: [ 104.429866] [<ffffffff81097324>] hrtimer_nanosleep+0xc4/0x180 <snip lots of backtraces for 14000 tasks> [ 621.639952] Clocksource tsc unstable (delta = -4589931578 ns) And then obviously the system switches to hpet. If, however, I switch to using the mult_func() the problem goes away so I'm sure my theory is right. Let me know if you'd like to see any other debug from this. I can certainly dump anything you want. P. ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [PATCH] clocksource, prevent overflow in clocksource_cyc2ns 2012-04-04 18:33 ` Prarit Bhargava @ 2012-04-05 1:08 ` John Stultz 2012-04-05 11:00 ` Prarit Bhargava 2012-04-05 12:27 ` Prarit Bhargava 0 siblings, 2 replies; 22+ messages in thread From: John Stultz @ 2012-04-05 1:08 UTC (permalink / raw) To: Prarit Bhargava; +Cc: linux-kernel, Thomas Gleixner, Salman Qazi, stable On 04/04/2012 11:33 AM, Prarit Bhargava wrote: >> One idea might be to replace the cyc2ns w/ mult_frac in only the watchdog code. >> I need to think on that some more (and maybe have you provide some debug output) >> to really understand how that's solving the issue for you, but it would be able >> to be done w/o affecting the other assumptions of the timekeeping core. >> > Hey John, > > After reading the initial part of your reply I was thinking about calling > mult_frac() directly from the watchdog code as well. > > Here's some debug output I cobbled together to get an idea of how quickly the > overflow was happening. > > [ 5.435323] clocksource_watchdog: {0} cs tsc csfirst 227349443638728 mask > 0xFFFFFFFFFFFFFFFF mult 797281036 shift 31 > [ 5.444930] clocksource_watchdog: {0} wd hpet wdfirst 78332535 mask > 0xFFFFFFFF mult 292935555 shift 22 > > These, of course, are just the basic data from the clocksources tsc and hpet. If I'm doing the math right, these are ~2.7 Ghz cpus? So what kernel version are you using? In trying to reproduce this locally against Linus' HEAD on a much smaller system (single core + HT 1.6Ghz), I got: [ 6.611366] clocksource_watchdog: {0} cs tsc csfirst 36177888648 mask ffffffffffffffff mult 10485747 shift 24 [ 6.611596] clocksource_watchdog: {0} wd hpet wdfirst 169168400 mask ffffffff mult 2684354560 shift 26 Note the smaller shift values. Not too long ago the shift calculation was adjusted to allow for longer periods between interrupts, so I suspect you're on an older kernel. Further, using your debug patch on my system, it was well beyond 10 minutes before the debug overflow occurred. And similarly I couldn't trip the watchdog trigger using sysrq-t (but again, only two threads here, so not nearly as much data to print as you have). Could you verify that the issue you're seeing is still is present w/ current mainline? Please don't take this as me dismissing your problem! As I mentioned earlier there are some known issues w/ the clocksource watchdog code. But I want to narrow down if you're problem is currently present in mainline or only in older kernels, as that will help us find the proper fix. thanks -john ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [PATCH] clocksource, prevent overflow in clocksource_cyc2ns 2012-04-05 1:08 ` John Stultz @ 2012-04-05 11:00 ` Prarit Bhargava 2012-04-05 16:23 ` John Stultz 2012-04-05 12:27 ` Prarit Bhargava 1 sibling, 1 reply; 22+ messages in thread From: Prarit Bhargava @ 2012-04-05 11:00 UTC (permalink / raw) To: John Stultz; +Cc: linux-kernel, Thomas Gleixner, Salman Qazi, stable On 04/04/2012 09:08 PM, John Stultz wrote: > On 04/04/2012 11:33 AM, Prarit Bhargava wrote: >>> One idea might be to replace the cyc2ns w/ mult_frac in only the watchdog code. >>> I need to think on that some more (and maybe have you provide some debug output) >>> to really understand how that's solving the issue for you, but it would be able >>> to be done w/o affecting the other assumptions of the timekeeping core. >>> >> Hey John, >> >> After reading the initial part of your reply I was thinking about calling >> mult_frac() directly from the watchdog code as well. >> >> Here's some debug output I cobbled together to get an idea of how quickly the >> overflow was happening. >> >> [ 5.435323] clocksource_watchdog: {0} cs tsc csfirst 227349443638728 mask >> 0xFFFFFFFFFFFFFFFF mult 797281036 shift 31 >> [ 5.444930] clocksource_watchdog: {0} wd hpet wdfirst 78332535 mask >> 0xFFFFFFFF mult 292935555 shift 22 >> >> These, of course, are just the basic data from the clocksources tsc and hpet. > > If I'm doing the math right, these are ~2.7 Ghz cpus? Yes. > > So what kernel version are you using? I was on an earlier version of Fedora (F16) ... but I'll jump forward and see if I can still hit it. > > In trying to reproduce this locally against Linus' HEAD on a much smaller system > (single core + HT 1.6Ghz), I got: > [ 6.611366] clocksource_watchdog: {0} cs tsc csfirst 36177888648 mask > ffffffffffffffff mult 10485747 shift 24 > [ 6.611596] clocksource_watchdog: {0} wd hpet wdfirst 169168400 mask ffffffff > mult 2684354560 shift 26 > > Note the smaller shift values. Not too long ago the shift calculation was > adjusted to allow for longer periods between interrupts, so I suspect you're on > an older kernel. > > Further, using your debug patch on my system, it was well beyond 10 minutes > before the debug overflow occurred. And similarly I couldn't trip the watchdog > trigger using sysrq-t (but again, only two threads here, so not nearly as much > data to print as you have). I'm going to try this on a 32-cpu system (running the previously mentioned test) with linux.git HEAD. > > Could you verify that the issue you're seeing is still is present w/ current > mainline? Please don't take this as me dismissing your problem! As I mentioned Absolutely :) I didn't take it that way at all. .... when I get in this AM I'll bang out a test and see if I can cause this to happen with sysrq-t. Keep in mind that 10000 threads is the *minimum* I was able to cause this with, which is only ~315 threads/cpu, which isn't a lot :/. At that number of threads the dump takes about 6 mins. Doubling it, IIRC, exceeded 10 mins. > earlier there are some known issues w/ the clocksource watchdog code. But I want > to narrow down if you're problem is currently present in mainline or only in > older kernels, as that will help us find the proper fix. Thanks John, P. > > thanks > -john > ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [PATCH] clocksource, prevent overflow in clocksource_cyc2ns 2012-04-05 11:00 ` Prarit Bhargava @ 2012-04-05 16:23 ` John Stultz 0 siblings, 0 replies; 22+ messages in thread From: John Stultz @ 2012-04-05 16:23 UTC (permalink / raw) To: Prarit Bhargava; +Cc: linux-kernel, Thomas Gleixner, Salman Qazi, stable On 04/05/2012 04:00 AM, Prarit Bhargava wrote: > > On 04/04/2012 09:08 PM, John Stultz wrote: >> So what kernel version are you using? > I was on an earlier version of Fedora (F16) ... but I'll jump forward and see if > I can still hit it. Ok, but if you have the specific kernel version, it would help, since even if the issue is resolved upstream, we can see if a backport to stable is appropriate. > Keep in mind that 10000 threads is the *minimum* I was able to cause > this with, which is only ~315 threads/cpu, which isn't a lot :/. At > that number of threads the dump takes about 6 mins. Doubling it, IIRC, > exceeded 10 mins. At that point, if we're starving the system of interrupts for over 10 minutes, a number of other non-timekeeping issues can crop up. I've seen scsci controllers fall over if their heart-beat checks don't come in, etc. I suspect if printk dumps taking 10 minutes or more is going to be considered "acceptable" behavior, we may need to see about adding some breaks in the printk code, so that the system can take a few necessary irqs. thanks -john ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [PATCH] clocksource, prevent overflow in clocksource_cyc2ns 2012-04-05 1:08 ` John Stultz 2012-04-05 11:00 ` Prarit Bhargava @ 2012-04-05 12:27 ` Prarit Bhargava 2012-04-05 16:45 ` John Stultz ` (2 more replies) 1 sibling, 3 replies; 22+ messages in thread From: Prarit Bhargava @ 2012-04-05 12:27 UTC (permalink / raw) To: John Stultz; +Cc: linux-kernel, Thomas Gleixner, Salman Qazi, stable > > So what kernel version are you using? I retested using top of the linux.git tree, running echo 1 > /proc/sys/kernel/sysrq for i in `seq 10000`; do sleep 1000 & done echo t > /proc/sysrq-trigger and I no longer see a problem. However, if I increase the number of threads to 1000/cpu I get Clocksource %s unstable (delta = -429565427) Clocksource switching to hpet > to narrow down if you're problem is currently present in mainline or only in > older kernels, as that will help us find the proper fix. If I hack in (sorry for the cut-and-paste) diff --git a/kernel/time/clocksource.c b/kernel/time/clocksource.c index c958338..f38b8d0 100644 --- a/kernel/time/clocksource.c +++ b/kernel/time/clocksource.c @@ -279,11 +279,16 @@ static void clocksource_watchdog(unsigned long data) continue; } - wd_nsec = clocksource_cyc2ns((wdnow - cs->wd_last) & watchdog->m - watchdog->mult, watchdog->shift); + /*wd_nsec = clocksource_cyc2ns((wdnow - cs->wd_last) & watchdog- + watchdog->mult, watchdog->shift);*/ + wd_nsec = mult_frac(((wdnow - cs->wd_last), watchdog->mult, + 1UL << watchdog->shift); + + /*cs_nsec = clocksource_cyc2ns((csnow - cs->cs_last) & + cs->mask, cs->mult, cs->shift);*/ + cs_nsec = mult_frac(((csnow - cs->cs_last), cs->mult, + 1UL << cs->shift); - cs_nsec = clocksource_cyc2ns((csnow - cs->cs_last) & - cs->mask, cs->mult, cs->shift); cs->cs_last = csnow; cs->wd_last = wdnow; then I don't see unstable messages. I think the problem is still here but it only happens in extreme cases. P. ^ permalink raw reply related [flat|nested] 22+ messages in thread
* Re: [PATCH] clocksource, prevent overflow in clocksource_cyc2ns 2012-04-05 12:27 ` Prarit Bhargava @ 2012-04-05 16:45 ` John Stultz 2012-04-06 23:29 ` Thomas Gleixner 2012-04-18 23:20 ` John Stultz 2 siblings, 0 replies; 22+ messages in thread From: John Stultz @ 2012-04-05 16:45 UTC (permalink / raw) To: Prarit Bhargava; +Cc: linux-kernel, Thomas Gleixner, Salman Qazi, stable On 04/05/2012 05:27 AM, Prarit Bhargava wrote: >> So what kernel version are you using? > I retested using top of the linux.git tree, running > > echo 1> /proc/sys/kernel/sysrq > for i in `seq 10000`; do sleep 1000& done > echo t> /proc/sysrq-trigger > > and I no longer see a problem. However, if I increase the number of threads to > 1000/cpu I get > > Clocksource %s unstable (delta = -429565427) > Clocksource switching to hpet Yea, so this is with the 12minute + stall starving the system from irqs? Again, its all a matter of degree. > >> to narrow down if you're problem is currently present in mainline or only in >> older kernels, as that will help us find the proper fix. > If I hack in (sorry for the cut-and-paste) > > diff --git a/kernel/time/clocksource.c b/kernel/time/clocksource.c > index c958338..f38b8d0 100644 > --- a/kernel/time/clocksource.c > +++ b/kernel/time/clocksource.c > @@ -279,11 +279,16 @@ static void clocksource_watchdog(unsigned long data) > continue; > } > > - wd_nsec = clocksource_cyc2ns((wdnow - cs->wd_last)& watchdog->m > - watchdog->mult, watchdog->shift); > + /*wd_nsec = clocksource_cyc2ns((wdnow - cs->wd_last)& watchdog- > + watchdog->mult, watchdog->shift);*/ > + wd_nsec = mult_frac(((wdnow - cs->wd_last), watchdog->mult, > + 1UL<< watchdog->shift); > + > + /*cs_nsec = clocksource_cyc2ns((csnow - cs->cs_last)& > + cs->mask, cs->mult, cs->shift);*/ > + cs_nsec = mult_frac(((csnow - cs->cs_last), cs->mult, > + 1UL<< cs->shift); > > - cs_nsec = clocksource_cyc2ns((csnow - cs->cs_last)& > - cs->mask, cs->mult, cs->shift); > cs->cs_last = csnow; > cs->wd_last = wdnow; > > > then I don't see unstable messages. > > I think the problem is still here but it only happens in extreme cases. I suspect if you were closely watching the clock you'd probably see some odd time inconsistencies as well. Its all due to running outside the expected bounds of the system. If you were to use the acpi_pm clocksource and stall the system for 10+ minutes, when you returned the system clock would be 10 minutes behind. I think if 10 minutes printk dumps is going to be expected behavior (which given the size of these machines, isn't unreasonable), we may need to find a way to allow printk to take a break every so often and allow irqs to happen. Disabling irqs for 10+ minutes can cause all sorts of system trouble. I've seen scsi adapters hang if their heart beat code didn't get an irq every 10 minutes or so. I had proposed something like this awhile back, but the issue cropping up there was lost time due to long printks holding off irqs while we're using the jiffies clocks in early boot, so it was at a much smaller scale. Could you give the following a try and let me know if it helps? thanks -john Only flush 1k chunks max from console_unlock to allow IRQ starvation from happening. NOT FOR INCLUSION Signed-off-by: John Stultz <john.stultz@linaro.org> diff --git a/kernel/printk.c b/kernel/printk.c index b663c2c..b699cc7 100644 --- a/kernel/printk.c +++ b/kernel/printk.c @@ -1272,6 +1272,7 @@ void console_unlock(void) unsigned long flags; unsigned _con_start, _log_end; unsigned wake_klogd = 0, retry = 0; + unsigned chunk_size, length; if (console_suspended) { up(&console_sem); @@ -1280,15 +1281,18 @@ void console_unlock(void) console_may_schedule = 0; + chunk_size = min(LOG_BUF_MASK, 1024); /* 1k chunks */ again: for ( ; ; ) { raw_spin_lock_irqsave(&logbuf_lock, flags); wake_klogd |= log_start - log_end; if (con_start == log_end) break; /* Nothing to print */ + length = (log_end - con_start)& LOG_BUF_MASK; + length = min(length , chunk_size); _con_start = con_start; - _log_end = log_end; - con_start = log_end; /* Flush */ + _log_end = (con_start + length)& LOG_BUF_MASK; + con_start = _log_end; /* Flush */ raw_spin_unlock(&logbuf_lock); stop_critical_timings(); /* don't trace print latency */ call_console_drivers(_con_start, _log_end); ^ permalink raw reply related [flat|nested] 22+ messages in thread
* Re: [PATCH] clocksource, prevent overflow in clocksource_cyc2ns 2012-04-05 12:27 ` Prarit Bhargava 2012-04-05 16:45 ` John Stultz @ 2012-04-06 23:29 ` Thomas Gleixner 2012-04-07 13:47 ` Prarit Bhargava 2012-04-18 23:20 ` John Stultz 2 siblings, 1 reply; 22+ messages in thread From: Thomas Gleixner @ 2012-04-06 23:29 UTC (permalink / raw) To: Prarit Bhargava; +Cc: John Stultz, linux-kernel, Salman Qazi, stable On Thu, 5 Apr 2012, Prarit Bhargava wrote: > > > > So what kernel version are you using? > > I retested using top of the linux.git tree, running > > echo 1 > /proc/sys/kernel/sysrq > for i in `seq 10000`; do sleep 1000 & done > echo t > /proc/sysrq-trigger > > and I no longer see a problem. However, if I increase the number of threads to > 1000/cpu I get > > Clocksource %s unstable (delta = -429565427) > Clocksource switching to hpet You are issuing a command which puts the kernel into a state where is dumps data for several seconds with interrupts disabled. And you expect that everything can cope with that? > If I hack in (sorry for the cut-and-paste) > .... > + cs_nsec = mult_frac(((csnow - cs->cs_last), cs->mult, > + 1UL << cs->shift); > > - cs_nsec = clocksource_cyc2ns((csnow - cs->cs_last) & > - cs->mask, cs->mult, cs->shift); > then I don't see unstable messages. That does not make your approach more correct. The HPET wraparound time is ~3 seconds, so you screwed everything already, when your dump lasts longer than that. And there are clocksources which wrap way faster. No, you can't fix that by hacking the timer code. A wraparound CANNOT be fixed by hacks. So instead of fiddling in the victims, please fix the root cause, i.e. that stupid sysrq-t code which should not need to have interrupts disabled just to dump all that state. If that's not possible, send a patch to the sysrq documentation and warn about the consequences. But stay away from code which is correct already. You CANNOT fix a problem which is caused by abnormal system state by hacking the code which is exposing the problem. All you do is making hot pathes more expensive with a very dubious value. The time related calls are hotpath functions and optimized. Aside of that you are breaking all architectures which do not have a native 64/32 instruction. This mult_frac stuff is not going to happen, period. Thanks, tglx ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [PATCH] clocksource, prevent overflow in clocksource_cyc2ns 2012-04-06 23:29 ` Thomas Gleixner @ 2012-04-07 13:47 ` Prarit Bhargava 0 siblings, 0 replies; 22+ messages in thread From: Prarit Bhargava @ 2012-04-07 13:47 UTC (permalink / raw) To: Thomas Gleixner; +Cc: John Stultz, linux-kernel, Salman Qazi, stable On 04/06/2012 07:29 PM, Thomas Gleixner wrote: > On Thu, 5 Apr 2012, Prarit Bhargava wrote: > >>> >>> So what kernel version are you using? >> >> I retested using top of the linux.git tree, running >> >> echo 1 > /proc/sys/kernel/sysrq >> for i in `seq 10000`; do sleep 1000 & done >> echo t > /proc/sysrq-trigger >> >> and I no longer see a problem. However, if I increase the number of threads to >> 1000/cpu I get >> >> Clocksource %s unstable (delta = -429565427) >> Clocksource switching to hpet > > You are issuing a command which puts the kernel into a state where is > dumps data for several seconds with interrupts disabled. And you expect that > everything can cope with that? Yes actually. I do expect that everything "copes" with it. I don't find it unreasonable with system sizes increasing that functionality that has been around for years works. However, I also understand that no one expected or saw this problem -- I'm not blaming anyone or screaming "Hey! This is broken!!!". > >> If I hack in (sorry for the cut-and-paste) >> .... >> + cs_nsec = mult_frac(((csnow - cs->cs_last), cs->mult, >> + 1UL << cs->shift); >> >> - cs_nsec = clocksource_cyc2ns((csnow - cs->cs_last) & >> - cs->mask, cs->mult, cs->shift); >> then I don't see unstable messages. > > That does not make your approach more correct. The HPET wraparound > time is ~3 seconds, so you screwed everything already, when your dump > lasts longer than that. And there are clocksources which wrap way > faster. > > No, you can't fix that by hacking the timer code. A wraparound CANNOT > be fixed by hacks. > > So instead of fiddling in the victims, please fix the root cause, > i.e. that stupid sysrq-t code which should not need to have interrupts > disabled just to dump all that state. If that's not possible, send a > patch to the sysrq documentation and warn about the consequences. > > But stay away from code which is correct already. You CANNOT fix a > problem which is caused by abnormal system state by hacking the code > which is exposing the problem. > > All you do is making hot pathes more expensive with a very dubious > value. The time related calls are hotpath functions and optimized. > > Aside of that you are breaking all architectures which do not have a > native 64/32 instruction. > > This mult_frac stuff is not going to happen, period. Okay -- thanks for the info. Much appreciated. P. ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [PATCH] clocksource, prevent overflow in clocksource_cyc2ns 2012-04-05 12:27 ` Prarit Bhargava 2012-04-05 16:45 ` John Stultz 2012-04-06 23:29 ` Thomas Gleixner @ 2012-04-18 23:20 ` John Stultz 2012-04-18 23:59 ` Prarit Bhargava 2 siblings, 1 reply; 22+ messages in thread From: John Stultz @ 2012-04-18 23:20 UTC (permalink / raw) To: Prarit Bhargava; +Cc: linux-kernel, Thomas Gleixner, Salman Qazi, stable On 04/05/2012 05:27 AM, Prarit Bhargava wrote: >> So what kernel version are you using? > I retested using top of the linux.git tree, running > > echo 1> /proc/sys/kernel/sysrq > for i in `seq 10000`; do sleep 1000& done > echo t> /proc/sysrq-trigger > > and I no longer see a problem. However, if I increase the number of threads to > 1000/cpu I get > > Clocksource %s unstable (delta = -429565427) > Clocksource switching to hpet > >> to narrow down if you're problem is currently present in mainline or only in >> older kernels, as that will help us find the proper fix. > If I hack in (sorry for the cut-and-paste) > > diff --git a/kernel/time/clocksource.c b/kernel/time/clocksource.c > index c958338..f38b8d0 100644 > --- a/kernel/time/clocksource.c > +++ b/kernel/time/clocksource.c > @@ -279,11 +279,16 @@ static void clocksource_watchdog(unsigned long data) > continue; > } > > - wd_nsec = clocksource_cyc2ns((wdnow - cs->wd_last)& watchdog->m > - watchdog->mult, watchdog->shift); > + /*wd_nsec = clocksource_cyc2ns((wdnow - cs->wd_last)& watchdog- > + watchdog->mult, watchdog->shift);*/ > + wd_nsec = mult_frac(((wdnow - cs->wd_last), watchdog->mult, > + 1UL<< watchdog->shift); > + > + /*cs_nsec = clocksource_cyc2ns((csnow - cs->cs_last)& > + cs->mask, cs->mult, cs->shift);*/ > + cs_nsec = mult_frac(((csnow - cs->cs_last), cs->mult, > + 1UL<< cs->shift); > > - cs_nsec = clocksource_cyc2ns((csnow - cs->cs_last)& > - cs->mask, cs->mult, cs->shift); > cs->cs_last = csnow; > cs->wd_last = wdnow; > > > then I don't see unstable messages. > > I think the problem is still here but it only happens in extreme cases. > Hey Prarit, So at tglx's prodding I took a look at the sysrq code, and the problem is the entire sysrq path runs with irqs disabled. As you note,with many cores and many processes, it can take a while to spit all that data out. Instead of the earlier hack I suggested, would you try the following simpler one? I suspect we just need to touch the clocksource watchdog before returning. This should avoid the TSC disqualification you're seeing. On systems using clocksources that wrap, we'll still lose time, since no time accumulation occurred during the long irq off period, but I think that's acceptable given this is not normal operation. Let me know if this helps. thanks -john As irqs may be disabled for quite some time in the sysrq path, touch clocksource watchdog before re-enabling interrupts. Signed-off-by: John Stultz <john.stultz@linaro.org> diff --git a/drivers/tty/sysrq.c b/drivers/tty/sysrq.c index 05728894..28fe2cb 100644 --- a/drivers/tty/sysrq.c +++ b/drivers/tty/sysrq.c @@ -41,6 +41,7 @@ #include<linux/slab.h> #include<linux/input.h> #include<linux/uaccess.h> +#include<linux/clocksource.h> #include<asm/ptrace.h> #include<asm/irq_regs.h> @@ -544,6 +545,7 @@ void __handle_sysrq(int key, bool check_mask) printk("\n"); console_loglevel = orig_log_level; } + clocksource_touch_watchdog(); spin_unlock_irqrestore(&sysrq_key_table_lock, flags); } ^ permalink raw reply related [flat|nested] 22+ messages in thread
* Re: [PATCH] clocksource, prevent overflow in clocksource_cyc2ns 2012-04-18 23:20 ` John Stultz @ 2012-04-18 23:59 ` Prarit Bhargava 2012-04-19 0:18 ` John Stultz 2012-04-19 12:37 ` Thomas Gleixner 0 siblings, 2 replies; 22+ messages in thread From: Prarit Bhargava @ 2012-04-18 23:59 UTC (permalink / raw) To: John Stultz; +Cc: linux-kernel, Thomas Gleixner, Salman Qazi, stable > > Hey Prarit, > So at tglx's prodding I took a look at the sysrq code, and the problem is > the entire sysrq path runs with irqs disabled. As you note,with many cores and > many processes, it can take a while to spit all that data out. > > Instead of the earlier hack I suggested, would you try the following simpler > one? I suspect we just need to touch the clocksource watchdog before returning. > This should avoid the TSC disqualification you're seeing. On systems using > clocksources that wrap, we'll still lose time, since no time accumulation > occurred during the long irq off period, but I think that's acceptable given > this is not normal operation. > > Let me know if this helps. Hey John, Thanks for continuing to work on this. Coincidentally that exact patch was my first attempt at resolving the problem as well. The problem is that even after touching the clocksource watchdog and restoring irqs the printk buffer can take a LONG time to flush -- and that still will cause an overflow comparison. So fixing it with just a touch_clocksource_watchdog() isn't the right thing to do IMO. Maybe a combination of the printk() patch you suggested earlier and the touch_clocksource_watchdog() is the right way to go but I'll leave that up to tglx and yourself to decide on a correct fix. There's also some additional information that I've been gathering on this issue; I have seen *idle* systems switch to the hpet because the clocksource watchdog hits the overflow comparison. As expected it happens much less frequently on newer kernels (linux.git top of tree) than older stable kernels (2.6.32 based) due to the difference in shift values but it is happening in both cases. The odd thing about this behaviour is that I would expect it to occur with the same frequency on small systems as it does on large systems with linux.git as the watchdog fires once/second. AFAICT I do not see this on small systems but see it only on systems with greater than 24 cpus (both Intel and AMD). Using debug code similar to the dump code I previously provided, I can see that every so often these large systems can hit a case where the tsc wraps and the hpet is still monotonically increasing. When the unstable calculation is performed the result is obviously affected by the overflow. Sometimes this comparison overflow happens within 18 minutes, other times it can take hours or days. The other part of this puzzle is that if switch between the tsc and hpet every 10 seconds, and run a gettimeofday() comparison program, the gettimeofday() program will return a backwards time[1] event usually within half-an-hour. [I'm just including this info here to point out that switching between clocksources seems to cause some momentary instability. Before anyone points this out I will say that this not a "real world" bug. I'm trying to find out if anyone actually does switch from the tsc to hpet (and back) on multi-purposed systems. I'm hoping the answer to that is "no" :) ]. P. [1] is not really a backwards time event. The program does a comparison of gettimeofday() values and returns an error if the new value is less than the older value. The gettimeofday() value jumps *forward* slightly for one returned value when the clocksource switches from the tsc to the hpet (I have not see a failure the other way around), and that causes the comparison to fail on the next comparison of gettimeofday(). If I remove the code that returns an error gettimeofday() returns to returning incrementally increasing values. Typical jumps are in the tens of microseconds range. ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [PATCH] clocksource, prevent overflow in clocksource_cyc2ns 2012-04-18 23:59 ` Prarit Bhargava @ 2012-04-19 0:18 ` John Stultz 2012-04-19 11:56 ` Prarit Bhargava 2012-04-19 12:50 ` Thomas Gleixner 2012-04-19 12:37 ` Thomas Gleixner 1 sibling, 2 replies; 22+ messages in thread From: John Stultz @ 2012-04-19 0:18 UTC (permalink / raw) To: Prarit Bhargava; +Cc: linux-kernel, Thomas Gleixner, Salman Qazi, stable On 04/18/2012 04:59 PM, Prarit Bhargava wrote: > > Hey John, > > Thanks for continuing to work on this. Coincidentally that exact patch was my > first attempt at resolving the problem as well. The problem is that even after > touching the clocksource watchdog and restoring irqs the printk buffer can take > a LONG time to flush -- and that still will cause an overflow comparison. So > fixing it with just a touch_clocksource_watchdog() isn't the right thing to do > IMO. Maybe a combination of the printk() patch you suggested earlier and the > touch_clocksource_watchdog() is the right way to go but I'll leave that up to > tglx and yourself to decide on a correct fix. :( That's a bummer. Something similar may be useful on the printk side as well. > There's also some additional information that I've been gathering on this issue; > I have seen *idle* systems switch to the hpet because the clocksource watchdog > hits the overflow comparison. As expected it happens much less frequently on > newer kernels (linux.git top of tree) than older stable kernels (2.6.32 based) > due to the difference in shift values but it is happening in both cases. Some of the recent adjustments for more robust shift calculations may partially be responsible for the improvement. Although I'm not sure why idle systems (that don't halt the TSC in idle) would trip this. Do let me know if you find any particular way of reproducing this. > The odd thing about this behaviour is that I would expect it to occur with the > same frequency on small systems as it does on large systems with linux.git as > the watchdog fires once/second. AFAICT I do not see this on small systems but > see it only on systems with greater than 24 cpus (both Intel and AMD). Hrm. > Using debug code similar to the dump code I previously provided, I can see that > every so often these large systems can hit a case where the tsc wraps and the > hpet is still monotonically increasing. When the unstable calculation is > performed the result is obviously affected by the overflow. Sometimes this > comparison overflow happens within 18 minutes, other times it can take hours or > days. TSC wraps? Are you sure that's what you see? Or do you have that switched? With the HPET wrapping? > The other part of this puzzle is that if switch between the tsc and hpet every > 10 seconds, and run a gettimeofday() comparison program, the gettimeofday() > program will return a backwards time[1] event usually within half-an-hour. [I'm > just including this info here to point out that switching between clocksources > seems to cause some momentary instability. Before anyone points this out I will > say that this not a "real world" bug. I'm trying to find out if anyone actually > does switch from the tsc to hpet (and back) on multi-purposed systems. I'm > hoping the answer to that is "no" :) ]. So, there were some recent fixes for 3.4 to address an issue specifically around inconsistencies at clocksource switch time: http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commitdiff;h=a939e817aa7e199d2fff05a67cb745be32dd5c2d http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commitdiff;h=f695cf94837de53864180400cbac42cfa370426f I definitely want to make sure any sort of inconsistencies like that are resolved. So let me know if you can still trigger anything like that with the latest 3.4 kernel. thanks -john ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [PATCH] clocksource, prevent overflow in clocksource_cyc2ns 2012-04-19 0:18 ` John Stultz @ 2012-04-19 11:56 ` Prarit Bhargava 2012-04-19 12:50 ` Thomas Gleixner 1 sibling, 0 replies; 22+ messages in thread From: Prarit Bhargava @ 2012-04-19 11:56 UTC (permalink / raw) To: John Stultz; +Cc: linux-kernel, Thomas Gleixner, Salman Qazi, stable On 04/18/2012 08:18 PM, John Stultz wrote: > On 04/18/2012 04:59 PM, Prarit Bhargava wrote: >> >> Hey John, >> >> Thanks for continuing to work on this. Coincidentally that exact patch was my >> first attempt at resolving the problem as well. The problem is that even after >> touching the clocksource watchdog and restoring irqs the printk buffer can take >> a LONG time to flush -- and that still will cause an overflow comparison. So >> fixing it with just a touch_clocksource_watchdog() isn't the right thing to do >> IMO. Maybe a combination of the printk() patch you suggested earlier and the >> touch_clocksource_watchdog() is the right way to go but I'll leave that up to >> tglx and yourself to decide on a correct fix. > :( That's a bummer. Something similar may be useful on the printk side as well. Hmm ... I'll give that a shot. > > >> There's also some additional information that I've been gathering on this issue; >> I have seen *idle* systems switch to the hpet because the clocksource watchdog >> hits the overflow comparison. As expected it happens much less frequently on >> newer kernels (linux.git top of tree) than older stable kernels (2.6.32 based) >> due to the difference in shift values but it is happening in both cases. > > Some of the recent adjustments for more robust shift calculations may partially > be responsible for the improvement. Although I'm not sure why idle systems (that > don't halt the TSC in idle) would trip this. Do let me know if you find any > particular way of reproducing this. > >> The odd thing about this behaviour is that I would expect it to occur with the >> same frequency on small systems as it does on large systems with linux.git as >> the watchdog fires once/second. AFAICT I do not see this on small systems but >> see it only on systems with greater than 24 cpus (both Intel and AMD). > Hrm. Yeah, it's odd. I have no idea why more cpus makes any difference :/ > >> Using debug code similar to the dump code I previously provided, I can see that >> every so often these large systems can hit a case where the tsc wraps and the >> hpet is still monotonically increasing. When the unstable calculation is >> performed the result is obviously affected by the overflow. Sometimes this >> comparison overflow happens within 18 minutes, other times it can take hours or >> days. > TSC wraps? Are you sure that's what you see? Or do you have that switched? With > the HPET wrapping? Sorry, you're right -- the HPET wraps. I mistyped that. > > >> The other part of this puzzle is that if switch between the tsc and hpet every >> 10 seconds, and run a gettimeofday() comparison program, the gettimeofday() >> program will return a backwards time[1] event usually within half-an-hour. [I'm >> just including this info here to point out that switching between clocksources >> seems to cause some momentary instability. Before anyone points this out I will >> say that this not a "real world" bug. I'm trying to find out if anyone actually >> does switch from the tsc to hpet (and back) on multi-purposed systems. I'm >> hoping the answer to that is "no" :) ]. > So, there were some recent fixes for 3.4 to address an issue specifically around > inconsistencies at clocksource switch time: > http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commitdiff;h=a939e817aa7e199d2fff05a67cb745be32dd5c2d > > http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commitdiff;h=f695cf94837de53864180400cbac42cfa370426f > AFAICT I have both of these in my tree. It is linux-2.6.git as of 592fe8980688e7cba46897685d014c7fb3018a67. I am doing while (true) do val=`ps aux | egrep $1 | wc -l` if [ $val -ne 2 ]; then exit 1 fi echo "switching to tsc" echo tsc > /sys/devices/system/clocksource/clocksource0/current_clocksource sleep 10 val=`ps aux | egrep $1 | wc -l` if [ $val -ne 2 ]; then exit 1 fi echo "switching to hpet" echo hpet > /sys/devices/system/clocksource/clocksource0/current_clocksource sleep 10 done where $1 is the pid of my gettimeofday() comparison test. As I said, the test exists when a "backwards" time event occurs so the script above also bails. > > I definitely want to make sure any sort of inconsistencies like that are > resolved. So let me know if you can still trigger anything like that with the > latest 3.4 kernel. I'll dig into this a bit more then -- I have a few things I want to investigate. I'll also try the touch_clocksource_watchdog() in the printk() code and get back to in a few days. P. ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [PATCH] clocksource, prevent overflow in clocksource_cyc2ns 2012-04-19 0:18 ` John Stultz 2012-04-19 11:56 ` Prarit Bhargava @ 2012-04-19 12:50 ` Thomas Gleixner 2012-04-19 12:52 ` Thomas Gleixner 1 sibling, 1 reply; 22+ messages in thread From: Thomas Gleixner @ 2012-04-19 12:50 UTC (permalink / raw) To: John Stultz; +Cc: Prarit Bhargava, linux-kernel, Salman Qazi, stable On Wed, 18 Apr 2012, John Stultz wrote: > On 04/18/2012 04:59 PM, Prarit Bhargava wrote: > > > > Hey John, > > > > Thanks for continuing to work on this. Coincidentally that exact patch was > > my > > first attempt at resolving the problem as well. The problem is that even > > after > > touching the clocksource watchdog and restoring irqs the printk buffer can > > take > > a LONG time to flush -- and that still will cause an overflow comparison. > > So > > fixing it with just a touch_clocksource_watchdog() isn't the right thing to > > do > > IMO. Maybe a combination of the printk() patch you suggested earlier and > > the > > touch_clocksource_watchdog() is the right way to go but I'll leave that up > > to > > tglx and yourself to decide on a correct fix. > :( That's a bummer. Something similar may be useful on the printk side as > well. No. The show_state() part prints into the buffer. But it's not guaranteed that the buffer is flushed right away. It could be flushed later as well in a different context. And of course the flush code runs with interrupts disabled and dumping out a gazillion of lines over serial will cause the same hickup. Just planting random touch_watchdog() calls into the code is not the right approach, really. We should think about the reasons why we have interrupts disabled for so much time. Is that really, really necessary ? Thanks, tglx ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [PATCH] clocksource, prevent overflow in clocksource_cyc2ns 2012-04-19 12:50 ` Thomas Gleixner @ 2012-04-19 12:52 ` Thomas Gleixner 2012-04-19 13:06 ` Prarit Bhargava 2012-04-19 18:12 ` John Stultz 0 siblings, 2 replies; 22+ messages in thread From: Thomas Gleixner @ 2012-04-19 12:52 UTC (permalink / raw) To: John Stultz; +Cc: Prarit Bhargava, linux-kernel, Salman Qazi, stable On Thu, 19 Apr 2012, Thomas Gleixner wrote: > On Wed, 18 Apr 2012, John Stultz wrote: > > On 04/18/2012 04:59 PM, Prarit Bhargava wrote: > > > > > > Hey John, > > > > > > Thanks for continuing to work on this. Coincidentally that exact patch was > > > my > > > first attempt at resolving the problem as well. The problem is that even > > > after > > > touching the clocksource watchdog and restoring irqs the printk buffer can > > > take > > > a LONG time to flush -- and that still will cause an overflow comparison. > > > So > > > fixing it with just a touch_clocksource_watchdog() isn't the right thing to > > > do > > > IMO. Maybe a combination of the printk() patch you suggested earlier and > > > the > > > touch_clocksource_watchdog() is the right way to go but I'll leave that up > > > to > > > tglx and yourself to decide on a correct fix. > > :( That's a bummer. Something similar may be useful on the printk side as > > well. > > No. The show_state() part prints into the buffer. But it's not > guaranteed that the buffer is flushed right away. It could be flushed > later as well in a different context. And of course the flush code > runs with interrupts disabled and dumping out a gazillion of lines > over serial will cause the same hickup. Just planting random > touch_watchdog() calls into the code is not the right approach, > really. > > We should think about the reasons why we have interrupts disabled for > so much time. Is that really, really necessary ? I'm not against making the clocksource code more robust, but I don't want to add crap there just to cope with complete madness elsewhere. Thanks, tglx ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [PATCH] clocksource, prevent overflow in clocksource_cyc2ns 2012-04-19 12:52 ` Thomas Gleixner @ 2012-04-19 13:06 ` Prarit Bhargava 2012-04-19 13:18 ` Thomas Gleixner 2012-04-19 18:12 ` John Stultz 1 sibling, 1 reply; 22+ messages in thread From: Prarit Bhargava @ 2012-04-19 13:06 UTC (permalink / raw) To: Thomas Gleixner; +Cc: John Stultz, linux-kernel, Salman Qazi, stable On 04/19/2012 08:52 AM, Thomas Gleixner wrote: > On Thu, 19 Apr 2012, Thomas Gleixner wrote: > >> On Wed, 18 Apr 2012, John Stultz wrote: >>> On 04/18/2012 04:59 PM, Prarit Bhargava wrote: >>>> >> No. The show_state() part prints into the buffer. But it's not >> guaranteed that the buffer is flushed right away. It could be flushed >> later as well in a different context. And of course the flush code >> runs with interrupts disabled and dumping out a gazillion of lines >> over serial will cause the same hickup. Just planting random >> touch_watchdog() calls into the code is not the right approach, >> really. >> >> We should think about the reasons why we have interrupts disabled for >> so much time. Is that really, really necessary ? In the case of the sysrq-t, I would argue that it is. The whole point behind the sysrq-t is that we're capturing the *current* state of the system. Having that output effected by interrupts seems like a bad idea. > > I'm not against making the clocksource code more robust, but I don't > want to add crap there just to cope with complete madness elsewhere. Maybe I came off the wrong way but I completely agree with that sentiment. Like yourself, I'm looking for a correct fix rather than a fast fix. Sorry that I haven't provided any debug info but I'm still in the gathering data stage atm. It was just John's ping that made me "brain dump" the current info I had. P. ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [PATCH] clocksource, prevent overflow in clocksource_cyc2ns 2012-04-19 13:06 ` Prarit Bhargava @ 2012-04-19 13:18 ` Thomas Gleixner 0 siblings, 0 replies; 22+ messages in thread From: Thomas Gleixner @ 2012-04-19 13:18 UTC (permalink / raw) To: Prarit Bhargava; +Cc: John Stultz, linux-kernel, Salman Qazi, stable On Thu, 19 Apr 2012, Prarit Bhargava wrote: > On 04/19/2012 08:52 AM, Thomas Gleixner wrote: > > On Thu, 19 Apr 2012, Thomas Gleixner wrote: > > > >> On Wed, 18 Apr 2012, John Stultz wrote: > >>> On 04/18/2012 04:59 PM, Prarit Bhargava wrote: > >>>> > > >> No. The show_state() part prints into the buffer. But it's not > >> guaranteed that the buffer is flushed right away. It could be flushed > >> later as well in a different context. And of course the flush code > >> runs with interrupts disabled and dumping out a gazillion of lines > >> over serial will cause the same hickup. Just planting random > >> touch_watchdog() calls into the code is not the right approach, > >> really. > >> > >> We should think about the reasons why we have interrupts disabled for > >> so much time. Is that really, really necessary ? > > In the case of the sysrq-t, I would argue that it is. The whole point behind > the sysrq-t is that we're capturing the *current* state of the system. Having > that output effected by interrupts seems like a bad idea. Nonsense. The system state can change, while we are dumping it unless you run on an UP system. irq disable only affects the current CPU nothing else, and nothing can prevent the other cpus to wake up, run, exit, fork ..... Thanks, tglx ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [PATCH] clocksource, prevent overflow in clocksource_cyc2ns 2012-04-19 12:52 ` Thomas Gleixner 2012-04-19 13:06 ` Prarit Bhargava @ 2012-04-19 18:12 ` John Stultz 2012-04-25 12:29 ` Prarit Bhargava 1 sibling, 1 reply; 22+ messages in thread From: John Stultz @ 2012-04-19 18:12 UTC (permalink / raw) To: Thomas Gleixner; +Cc: Prarit Bhargava, linux-kernel, Salman Qazi, stable On 04/19/2012 05:52 AM, Thomas Gleixner wrote: > On Thu, 19 Apr 2012, Thomas Gleixner wrote: > >> We should think about the reasons why we have interrupts disabled for >> so much time. Is that really, really necessary ? > I'm not against making the clocksource code more robust, but I don't > want to add crap there just to cope with complete madness elsewhere. > Very much agreed. -john ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [PATCH] clocksource, prevent overflow in clocksource_cyc2ns 2012-04-19 18:12 ` John Stultz @ 2012-04-25 12:29 ` Prarit Bhargava 0 siblings, 0 replies; 22+ messages in thread From: Prarit Bhargava @ 2012-04-25 12:29 UTC (permalink / raw) To: John Stultz; +Cc: Thomas Gleixner, linux-kernel, Salman Qazi, stable On 04/19/2012 02:12 PM, John Stultz wrote: > On 04/19/2012 05:52 AM, Thomas Gleixner wrote: >> On Thu, 19 Apr 2012, Thomas Gleixner wrote: >> >>> We should think about the reasons why we have interrupts disabled for >>> so much time. Is that really, really necessary ? >> I'm not against making the clocksource code more robust, but I don't >> want to add crap there just to cope with complete madness elsewhere. >> > > Very much agreed. Hi John and Thomas, After much analysis I have good news to report. The good news is that the problem with the random tsc failures was chased down to a script left running in which sysrq-t's were executed over ping packets. This, as I've previously pointed out, can cause the tsc to be erroneously marked unstable. [Aside: I hit myself with a big cluebat when I realized that all the failures were occurring at the same wall-clock time, 3:00AM. That couldn't be a coincidence.] I'm working with lwoodman to figure out a way to get rid of the locking (as suggested by you Thomas) around the sysrq code. P. > -john > ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [PATCH] clocksource, prevent overflow in clocksource_cyc2ns 2012-04-18 23:59 ` Prarit Bhargava 2012-04-19 0:18 ` John Stultz @ 2012-04-19 12:37 ` Thomas Gleixner 2012-04-19 12:51 ` Thomas Gleixner 1 sibling, 1 reply; 22+ messages in thread From: Thomas Gleixner @ 2012-04-19 12:37 UTC (permalink / raw) To: Prarit Bhargava; +Cc: John Stultz, linux-kernel, Salman Qazi, stable On Wed, 18 Apr 2012, Prarit Bhargava wrote: > There's also some additional information that I've been gathering on this issue; > I have seen *idle* systems switch to the hpet because the clocksource watchdog > hits the overflow comparison. As expected it happens much less frequently on > newer kernels (linux.git top of tree) than older stable kernels (2.6.32 based) > due to the difference in shift values but it is happening in both cases. > > The odd thing about this behaviour is that I would expect it to occur with the > same frequency on small systems as it does on large systems with linux.git as > the watchdog fires once/second. AFAICT I do not see this on small systems but > see it only on systems with greater than 24 cpus (both Intel and AMD). > > Using debug code similar to the dump code I previously provided, I can see that > every so often these large systems can hit a case where the tsc wraps and the > hpet is still monotonically increasing. When the unstable calculation is > performed the result is obviously affected by the overflow. Sometimes this > comparison overflow happens within 18 minutes, other times it can take hours or > days. You are describing symptoms, but the root cause is obviously that the watchdog does not get invoked in time. The question is why. Can you please add the patch below and enable scheduler, timer and irq events in the tracer. Tracing will stop once the watchdog triggers. Please provide the traces. We need to understand the root cause of this idle wreckage. Thanks, tglx ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [PATCH] clocksource, prevent overflow in clocksource_cyc2ns 2012-04-19 12:37 ` Thomas Gleixner @ 2012-04-19 12:51 ` Thomas Gleixner 0 siblings, 0 replies; 22+ messages in thread From: Thomas Gleixner @ 2012-04-19 12:51 UTC (permalink / raw) To: Prarit Bhargava; +Cc: John Stultz, linux-kernel, Salman Qazi, stable On Thu, 19 Apr 2012, Thomas Gleixner wrote: > On Wed, 18 Apr 2012, Prarit Bhargava wrote: > > There's also some additional information that I've been gathering on this issue; > > I have seen *idle* systems switch to the hpet because the clocksource watchdog > > hits the overflow comparison. As expected it happens much less frequently on > > newer kernels (linux.git top of tree) than older stable kernels (2.6.32 based) > > due to the difference in shift values but it is happening in both cases. > > > > The odd thing about this behaviour is that I would expect it to occur with the > > same frequency on small systems as it does on large systems with linux.git as > > the watchdog fires once/second. AFAICT I do not see this on small systems but > > see it only on systems with greater than 24 cpus (both Intel and AMD). > > > > Using debug code similar to the dump code I previously provided, I can see that > > every so often these large systems can hit a case where the tsc wraps and the > > hpet is still monotonically increasing. When the unstable calculation is > > performed the result is obviously affected by the overflow. Sometimes this > > comparison overflow happens within 18 minutes, other times it can take hours or > > days. > > You are describing symptoms, but the root cause is obviously that the > watchdog does not get invoked in time. The question is why. > > Can you please add the patch below and enable scheduler, timer and irq > events in the tracer. Tracing will stop once the watchdog triggers. > > Please provide the traces. We need to understand the root cause of > this idle wreckage. > > Thanks, > > tglx -ENOPATCH :) diff --git a/kernel/time/clocksource.c b/kernel/time/clocksource.c index c958338..2214323 100644 --- a/kernel/time/clocksource.c +++ b/kernel/time/clocksource.c @@ -287,11 +287,15 @@ static void clocksource_watchdog(unsigned long data) cs->cs_last = csnow; cs->wd_last = wdnow; + trace_printk("wd %lld %lld cs %lld %lld\n" , wdnow, wd_nsec, + csnow, cs_nsec); + if (atomic_read(&watchdog_reset_pending)) continue; /* Check the deviation from the watchdog clocksource. */ if ((abs(cs_nsec - wd_nsec) > WATCHDOG_THRESHOLD)) { + tracing_off(); clocksource_unstable(cs, cs_nsec - wd_nsec); continue; } ^ permalink raw reply related [flat|nested] 22+ messages in thread
end of thread, other threads:[~2012-04-25 12:30 UTC | newest] Thread overview: 22+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2012-04-04 15:11 [PATCH] clocksource, prevent overflow in clocksource_cyc2ns Prarit Bhargava 2012-04-04 18:00 ` John Stultz 2012-04-04 18:33 ` Prarit Bhargava 2012-04-05 1:08 ` John Stultz 2012-04-05 11:00 ` Prarit Bhargava 2012-04-05 16:23 ` John Stultz 2012-04-05 12:27 ` Prarit Bhargava 2012-04-05 16:45 ` John Stultz 2012-04-06 23:29 ` Thomas Gleixner 2012-04-07 13:47 ` Prarit Bhargava 2012-04-18 23:20 ` John Stultz 2012-04-18 23:59 ` Prarit Bhargava 2012-04-19 0:18 ` John Stultz 2012-04-19 11:56 ` Prarit Bhargava 2012-04-19 12:50 ` Thomas Gleixner 2012-04-19 12:52 ` Thomas Gleixner 2012-04-19 13:06 ` Prarit Bhargava 2012-04-19 13:18 ` Thomas Gleixner 2012-04-19 18:12 ` John Stultz 2012-04-25 12:29 ` Prarit Bhargava 2012-04-19 12:37 ` Thomas Gleixner 2012-04-19 12:51 ` Thomas Gleixner
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox