* [PATCH 1/2] watchdog: Remove touch_all_softlockup_watchdogs
@ 2011-11-24 3:53 Anton Blanchard
2011-11-24 3:54 ` [PATCH 2/2] watchdog: Softlockup has regular windows where it is not armed Anton Blanchard
0 siblings, 1 reply; 5+ messages in thread
From: Anton Blanchard @ 2011-11-24 3:53 UTC (permalink / raw)
To: Don Zickus, Jeremy Fitzhardinge, Thomas Gleixner,
Frederic Weisbecker, Ingo Molnar, Peter Zijlstra
Cc: linux-kernel
commit 04c9167f91e3 (add touch_all_softlockup_watchdogs()) added a
way to touch the watchdog on all cpus because show_state_filter
could hold the tasklist_lock for extended periods of time.
commit 510f5acc4f4f (sched: Don't use tasklist_lock for debug prints)
has since removed the tasklist_lock in show_state_filter so we can
use touch_softlockup_watchdog instead.
Signed-off-by: Anton Blanchard <anton@samba.org>
---
Index: linux-build/include/linux/sched.h
===================================================================
--- linux-build.orig/include/linux/sched.h 2011-11-16 07:57:25.054353865 +1100
+++ linux-build/include/linux/sched.h 2011-11-16 08:04:56.270478443 +1100
@@ -310,7 +310,6 @@ extern void sched_show_task(struct task_
#ifdef CONFIG_LOCKUP_DETECTOR
extern void touch_softlockup_watchdog(void);
extern void touch_softlockup_watchdog_sync(void);
-extern void touch_all_softlockup_watchdogs(void);
extern int proc_dowatchdog_thresh(struct ctl_table *table, int write,
void __user *buffer,
size_t *lenp, loff_t *ppos);
@@ -323,9 +322,6 @@ static inline void touch_softlockup_watc
static inline void touch_softlockup_watchdog_sync(void)
{
}
-static inline void touch_all_softlockup_watchdogs(void)
-{
-}
static inline void lockup_detector_init(void)
{
}
Index: linux-build/kernel/sched.c
===================================================================
--- linux-build.orig/kernel/sched.c 2011-11-16 07:57:25.066354081 +1100
+++ linux-build/kernel/sched.c 2011-11-16 08:04:56.274478516 +1100
@@ -6033,7 +6033,7 @@ void show_state_filter(unsigned long sta
sched_show_task(p);
} while_each_thread(g, p);
- touch_all_softlockup_watchdogs();
+ touch_softlockup_watchdog();
#ifdef CONFIG_SCHED_DEBUG
sysrq_sched_debug_show();
Index: linux-build/kernel/watchdog.c
===================================================================
--- linux-build.orig/kernel/watchdog.c 2011-11-16 07:57:25.626364139 +1100
+++ linux-build/kernel/watchdog.c 2011-11-16 08:04:56.274478516 +1100
@@ -138,19 +138,6 @@ void touch_softlockup_watchdog(void)
}
EXPORT_SYMBOL(touch_softlockup_watchdog);
-void touch_all_softlockup_watchdogs(void)
-{
- int cpu;
-
- /*
- * this is done lockless
- * do we care if a 0 races with a timestamp?
- * all it means is the softlock check starts one cycle later
- */
- for_each_online_cpu(cpu)
- per_cpu(watchdog_touch_ts, cpu) = 0;
-}
-
#ifdef CONFIG_HARDLOCKUP_DETECTOR
void touch_nmi_watchdog(void)
{
^ permalink raw reply [flat|nested] 5+ messages in thread* [PATCH 2/2] watchdog: Softlockup has regular windows where it is not armed 2011-11-24 3:53 [PATCH 1/2] watchdog: Remove touch_all_softlockup_watchdogs Anton Blanchard @ 2011-11-24 3:54 ` Anton Blanchard 2011-11-28 21:47 ` Don Zickus 0 siblings, 1 reply; 5+ messages in thread From: Anton Blanchard @ 2011-11-24 3:54 UTC (permalink / raw) To: Don Zickus, Jeremy Fitzhardinge, Thomas Gleixner, Frederic Weisbecker, Ingo Molnar, Peter Zijlstra Cc: linux-kernel The softlockup watchdog has a two stage sync - touch_softlockup_watchdog simply sets the timestamp to 0 and later on the timer routine notices this and sets the timestamp. The problem is this timer goes off every 4 seconds by default, so each time we call touch_softlockup_watchdog there is a period of up to 4 seconds where the softlockup watchdog is not armed. We call touch_softlockup_watchdog very often in the NO_HZ code and end up hitting this issue every time we go in and out of idle. I wrote a simple test case: http://ozlabs.org/~anton/junkcode/badguy.tar.gz That disables interrupts on selected CPUs for a period of time. Don't run it on a machine you care about. When I disable interrupts for 30 seconds on a previously idle CPU I get no warning: insmod ./badguy.ko timeout=30 cpus=4 However if I keep the CPU busy so we don't switch in and out of NO_HZ mode I get a warning as expected: taskset -c 4 yes > /dev/null & insmod ./badguy.ko timeout=30 cpus=4 With the following patch I get a warning even on a previously idle CPU. Signed-off-by: Anton Blanchard <anton@samba.org> --- There might be a reason for this two stage sync but I haven't been able to find it yet. Perhaps the unsynced versions of cpu_clock() and sched_clock_tick() are not safe to call from all contexts? Index: linux-build/kernel/watchdog.c =================================================================== --- linux-build.orig/kernel/watchdog.c 2011-11-16 08:04:56.274478516 +1100 +++ linux-build/kernel/watchdog.c 2011-11-16 08:04:59.278533261 +1100 @@ -33,7 +33,6 @@ int __read_mostly watchdog_thresh = 10; static DEFINE_PER_CPU(unsigned long, watchdog_touch_ts); static DEFINE_PER_CPU(struct task_struct *, softlockup_watchdog); static DEFINE_PER_CPU(struct hrtimer, watchdog_hrtimer); -static DEFINE_PER_CPU(bool, softlockup_touch_sync); static DEFINE_PER_CPU(bool, soft_watchdog_warn); #ifdef CONFIG_HARDLOCKUP_DETECTOR static DEFINE_PER_CPU(bool, hard_watchdog_warn); @@ -134,7 +133,7 @@ static void __touch_watchdog(void) void touch_softlockup_watchdog(void) { - __this_cpu_write(watchdog_touch_ts, 0); + __touch_watchdog(); } EXPORT_SYMBOL(touch_softlockup_watchdog); @@ -157,8 +156,8 @@ EXPORT_SYMBOL(touch_nmi_watchdog); void touch_softlockup_watchdog_sync(void) { - __raw_get_cpu_var(softlockup_touch_sync) = true; - __raw_get_cpu_var(watchdog_touch_ts) = 0; + sched_clock_tick(); + __touch_watchdog(); } #ifdef CONFIG_HARDLOCKUP_DETECTOR @@ -258,19 +257,6 @@ static enum hrtimer_restart watchdog_tim /* .. and repeat */ hrtimer_forward_now(hrtimer, ns_to_ktime(get_sample_period())); - if (touch_ts == 0) { - if (unlikely(__this_cpu_read(softlockup_touch_sync))) { - /* - * If the time stamp was touched atomically - * make sure the scheduler tick is up to date. - */ - __this_cpu_write(softlockup_touch_sync, false); - sched_clock_tick(); - } - __touch_watchdog(); - return HRTIMER_RESTART; - } - /* check for a softlockup * This is done by making sure a high priority task is * being scheduled. The task touches the watchdog to @@ -438,7 +424,7 @@ static int watchdog_enable(int cpu) goto out; } kthread_bind(p, cpu); - per_cpu(watchdog_touch_ts, cpu) = 0; + __touch_watchdog(); per_cpu(softlockup_watchdog, cpu) = p; wake_up_process(p); } ^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: [PATCH 2/2] watchdog: Softlockup has regular windows where it is not armed 2011-11-24 3:54 ` [PATCH 2/2] watchdog: Softlockup has regular windows where it is not armed Anton Blanchard @ 2011-11-28 21:47 ` Don Zickus 2011-12-05 10:28 ` Anton Blanchard 0 siblings, 1 reply; 5+ messages in thread From: Don Zickus @ 2011-11-28 21:47 UTC (permalink / raw) To: Anton Blanchard Cc: Jeremy Fitzhardinge, Thomas Gleixner, Frederic Weisbecker, Ingo Molnar, Peter Zijlstra, linux-kernel, jason.wessel On Thu, Nov 24, 2011 at 02:54:41PM +1100, Anton Blanchard wrote: > > The softlockup watchdog has a two stage sync - touch_softlockup_watchdog > simply sets the timestamp to 0 and later on the timer routine notices > this and sets the timestamp. > > The problem is this timer goes off every 4 seconds by default, so > each time we call touch_softlockup_watchdog there is a period > of up to 4 seconds where the softlockup watchdog is not armed. > > We call touch_softlockup_watchdog very often in the NO_HZ code and > end up hitting this issue every time we go in and out of idle. > > I wrote a simple test case: > > http://ozlabs.org/~anton/junkcode/badguy.tar.gz > > That disables interrupts on selected CPUs for a period of time. Don't > run it on a machine you care about. When I disable interrupts for 30 > seconds on a previously idle CPU I get no warning: > > insmod ./badguy.ko timeout=30 cpus=4 > > However if I keep the CPU busy so we don't switch in and out of NO_HZ > mode I get a warning as expected: > > taskset -c 4 yes > /dev/null & > insmod ./badguy.ko timeout=30 cpus=4 > > With the following patch I get a warning even on a previously idle > CPU. > > Signed-off-by: Anton Blanchard <anton@samba.org> > --- > > There might be a reason for this two stage sync but I haven't been > able to find it yet. Perhaps the unsynced versions of cpu_clock() and > sched_clock_tick() are not safe to call from all contexts? According to commit 8c2238eaaf0f774ca0f8d9daad7a616429bbb7f1 that was the case, cpu_clock wasn't NMI-safe. Now it is, thanks to Peter. I have a couple of concerns about the patch. I am wondering about the overhead of getting the timestamp more often now as opposed to just setting a boolean for later. It makes sense to stamp it at the time of the call, don't know what the cost is. I am also concern about how this affects suspend/resume and kgdb. I cc'd Jason above for kgdb. I'll have to run some tests locally to see what long periods of delay look like. Oh and virt guests too. You don't have any test results from that setup do you? Cheers, Don > > Index: linux-build/kernel/watchdog.c > =================================================================== > --- linux-build.orig/kernel/watchdog.c 2011-11-16 08:04:56.274478516 +1100 > +++ linux-build/kernel/watchdog.c 2011-11-16 08:04:59.278533261 +1100 > @@ -33,7 +33,6 @@ int __read_mostly watchdog_thresh = 10; > static DEFINE_PER_CPU(unsigned long, watchdog_touch_ts); > static DEFINE_PER_CPU(struct task_struct *, softlockup_watchdog); > static DEFINE_PER_CPU(struct hrtimer, watchdog_hrtimer); > -static DEFINE_PER_CPU(bool, softlockup_touch_sync); > static DEFINE_PER_CPU(bool, soft_watchdog_warn); > #ifdef CONFIG_HARDLOCKUP_DETECTOR > static DEFINE_PER_CPU(bool, hard_watchdog_warn); > @@ -134,7 +133,7 @@ static void __touch_watchdog(void) > > void touch_softlockup_watchdog(void) > { > - __this_cpu_write(watchdog_touch_ts, 0); > + __touch_watchdog(); > } > EXPORT_SYMBOL(touch_softlockup_watchdog); > > @@ -157,8 +156,8 @@ EXPORT_SYMBOL(touch_nmi_watchdog); > > void touch_softlockup_watchdog_sync(void) > { > - __raw_get_cpu_var(softlockup_touch_sync) = true; > - __raw_get_cpu_var(watchdog_touch_ts) = 0; > + sched_clock_tick(); > + __touch_watchdog(); > } > > #ifdef CONFIG_HARDLOCKUP_DETECTOR > @@ -258,19 +257,6 @@ static enum hrtimer_restart watchdog_tim > /* .. and repeat */ > hrtimer_forward_now(hrtimer, ns_to_ktime(get_sample_period())); > > - if (touch_ts == 0) { > - if (unlikely(__this_cpu_read(softlockup_touch_sync))) { > - /* > - * If the time stamp was touched atomically > - * make sure the scheduler tick is up to date. > - */ > - __this_cpu_write(softlockup_touch_sync, false); > - sched_clock_tick(); > - } > - __touch_watchdog(); > - return HRTIMER_RESTART; > - } > - > /* check for a softlockup > * This is done by making sure a high priority task is > * being scheduled. The task touches the watchdog to > @@ -438,7 +424,7 @@ static int watchdog_enable(int cpu) > goto out; > } > kthread_bind(p, cpu); > - per_cpu(watchdog_touch_ts, cpu) = 0; > + __touch_watchdog(); > per_cpu(softlockup_watchdog, cpu) = p; > wake_up_process(p); > } ^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: [PATCH 2/2] watchdog: Softlockup has regular windows where it is not armed 2011-11-28 21:47 ` Don Zickus @ 2011-12-05 10:28 ` Anton Blanchard 2011-12-12 19:53 ` Don Zickus 0 siblings, 1 reply; 5+ messages in thread From: Anton Blanchard @ 2011-12-05 10:28 UTC (permalink / raw) To: Don Zickus Cc: Jeremy Fitzhardinge, Thomas Gleixner, Frederic Weisbecker, Ingo Molnar, Peter Zijlstra, linux-kernel, jason.wessel Hi Don, > > There might be a reason for this two stage sync but I haven't been > > able to find it yet. Perhaps the unsynced versions of cpu_clock() > > and sched_clock_tick() are not safe to call from all contexts? > > According to commit 8c2238eaaf0f774ca0f8d9daad7a616429bbb7f1 that was > the case, cpu_clock wasn't NMI-safe. Now it is, thanks to Peter. Thanks, that makes sense now. > I have a couple of concerns about the patch. I am wondering about the > overhead of getting the timestamp more often now as opposed to just > setting a boolean for later. It makes sense to stamp it at the time > of the call, don't know what the cost is. I had a similar concern since we do execute this quite a lot. The overhead of cpu_clock is quite low on powerpc, but not sure about the other architectures. > I am also concern about how this affects suspend/resume and kgdb. I > cc'd Jason above for kgdb. I'll have to run some tests locally to > see what long periods of delay look like. Oh and virt guests too. > You don't have any test results from that setup do you? I haven't tested suspend resume, kgdb or virtual guests yet. Anton ^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: [PATCH 2/2] watchdog: Softlockup has regular windows where it is not armed 2011-12-05 10:28 ` Anton Blanchard @ 2011-12-12 19:53 ` Don Zickus 0 siblings, 0 replies; 5+ messages in thread From: Don Zickus @ 2011-12-12 19:53 UTC (permalink / raw) To: Anton Blanchard Cc: Jeremy Fitzhardinge, Thomas Gleixner, Frederic Weisbecker, Ingo Molnar, Peter Zijlstra, linux-kernel, jason.wessel On Mon, Dec 05, 2011 at 09:28:22PM +1100, Anton Blanchard wrote: > > Hi Don, > > > > There might be a reason for this two stage sync but I haven't been > > > able to find it yet. Perhaps the unsynced versions of cpu_clock() > > > and sched_clock_tick() are not safe to call from all contexts? > > > > According to commit 8c2238eaaf0f774ca0f8d9daad7a616429bbb7f1 that was > > the case, cpu_clock wasn't NMI-safe. Now it is, thanks to Peter. > > Thanks, that makes sense now. > > > I have a couple of concerns about the patch. I am wondering about the > > overhead of getting the timestamp more often now as opposed to just > > setting a boolean for later. It makes sense to stamp it at the time > > of the call, don't know what the cost is. > > I had a similar concern since we do execute this quite a lot. The > overhead of cpu_clock is quite low on powerpc, but not sure about the > other architectures. It seems like half of the users of touch_softlockup_watchdog is a slow path (ie they are purposely spinning a long time). The cpu_clock overhead for those paths, we probably don't need to care about. The other half seems to deal with long idle/suspend/kgdb paths, which may not be that interesting in their own right, except for the fact they are called all the time for short delays and long delays. :-/ Perhaps I can move the touch_softlockup_watchdog() calls closer to the long path conditionals, minimize the calls a little bit. > > > I am also concern about how this affects suspend/resume and kgdb. I > > cc'd Jason above for kgdb. I'll have to run some tests locally to > > see what long periods of delay look like. Oh and virt guests too. > > You don't have any test results from that setup do you? > > I haven't tested suspend resume, kgdb or virtual guests yet. I'll try to setup a box and play with these paths to see what they look like. Cheers, Don ^ permalink raw reply [flat|nested] 5+ messages in thread
end of thread, other threads:[~2011-12-12 19:54 UTC | newest] Thread overview: 5+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2011-11-24 3:53 [PATCH 1/2] watchdog: Remove touch_all_softlockup_watchdogs Anton Blanchard 2011-11-24 3:54 ` [PATCH 2/2] watchdog: Softlockup has regular windows where it is not armed Anton Blanchard 2011-11-28 21:47 ` Don Zickus 2011-12-05 10:28 ` Anton Blanchard 2011-12-12 19:53 ` Don Zickus
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox