* Re: softlockup: automatically detect hung TASK_UNINTERRUPTIBLE tasks [not found] <200801252259.m0PMxHmD012059@hera.kernel.org> @ 2008-02-06 0:46 ` Andrew Morton 2008-02-06 14:50 ` Peter Zijlstra 0 siblings, 1 reply; 8+ messages in thread From: Andrew Morton @ 2008-02-06 0:46 UTC (permalink / raw) To: Ingo Molnar; +Cc: Linux Kernel Mailing List On Fri, 25 Jan 2008 22:59:17 GMT Linux Kernel Mailing List <linux-kernel@vger.kernel.org> wrote: > Gitweb: http://git.kernel.org/git/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=82a1fcb90287052aabfa235e7ffc693ea003fe69 > Commit: 82a1fcb90287052aabfa235e7ffc693ea003fe69 > Parent: d0d23b5432fe61229dd3641c5e94d4130bc4e61b > Author: Ingo Molnar <mingo@elte.hu> > AuthorDate: Fri Jan 25 21:08:02 2008 +0100 > Committer: Ingo Molnar <mingo@elte.hu> > CommitDate: Fri Jan 25 21:08:02 2008 +0100 > > softlockup: automatically detect hung TASK_UNINTERRUPTIBLE tasks One of my test boxes (an 8-way x86_64 software-development thing from Intel - I'm not sure what's inside it) no longer powers itself off when I run `halt -pfn'. During bisection I found two different problems. Sometimes the machine wouldn't power off at all. Other times it would power off after a pause of around twenty seconds. Bisection indicates that this commit is what caused the 20-second pause. It could be that some later commit caused the infinity-second pause. ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: softlockup: automatically detect hung TASK_UNINTERRUPTIBLE tasks 2008-02-06 0:46 ` softlockup: automatically detect hung TASK_UNINTERRUPTIBLE tasks Andrew Morton @ 2008-02-06 14:50 ` Peter Zijlstra 2008-02-06 18:05 ` Andrew Morton 0 siblings, 1 reply; 8+ messages in thread From: Peter Zijlstra @ 2008-02-06 14:50 UTC (permalink / raw) To: Andrew Morton; +Cc: Ingo Molnar, Linux Kernel Mailing List On Tue, 2008-02-05 at 16:46 -0800, Andrew Morton wrote: > On Fri, 25 Jan 2008 22:59:17 GMT > Linux Kernel Mailing List <linux-kernel@vger.kernel.org> wrote: > > > Gitweb: http://git.kernel.org/git/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=82a1fcb90287052aabfa235e7ffc693ea003fe69 > > Commit: 82a1fcb90287052aabfa235e7ffc693ea003fe69 > > Parent: d0d23b5432fe61229dd3641c5e94d4130bc4e61b > > Author: Ingo Molnar <mingo@elte.hu> > > AuthorDate: Fri Jan 25 21:08:02 2008 +0100 > > Committer: Ingo Molnar <mingo@elte.hu> > > CommitDate: Fri Jan 25 21:08:02 2008 +0100 > > > > softlockup: automatically detect hung TASK_UNINTERRUPTIBLE tasks > > One of my test boxes (an 8-way x86_64 software-development thing from Intel > - I'm not sure what's inside it) no longer powers itself off when I run `halt > -pfn'. > > During bisection I found two different problems. Sometimes the machine > wouldn't power off at all. Other times it would power off after a pause of > around twenty seconds. > > Bisection indicates that this commit is what caused the 20-second pause. > It could be that some later commit caused the infinity-second pause. Does that kernel have: commit ed50d6cbc394cd0966469d3e249353c9dd1d38b9 Author: Peter Zijlstra <a.p.zijlstra@chello.nl> Date: Sat Feb 2 00:23:08 2008 +0100 debug: softlockup looping fix ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: softlockup: automatically detect hung TASK_UNINTERRUPTIBLE tasks 2008-02-06 14:50 ` Peter Zijlstra @ 2008-02-06 18:05 ` Andrew Morton 2008-02-07 0:04 ` Ingo Molnar 0 siblings, 1 reply; 8+ messages in thread From: Andrew Morton @ 2008-02-06 18:05 UTC (permalink / raw) To: Peter Zijlstra; +Cc: Ingo Molnar, Linux Kernel Mailing List On Wed, 06 Feb 2008 15:50:02 +0100 Peter Zijlstra <a.p.zijlstra@chello.nl> wrote: > > On Tue, 2008-02-05 at 16:46 -0800, Andrew Morton wrote: > > On Fri, 25 Jan 2008 22:59:17 GMT > > Linux Kernel Mailing List <linux-kernel@vger.kernel.org> wrote: > > > > > Gitweb: http://git.kernel.org/git/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=82a1fcb90287052aabfa235e7ffc693ea003fe69 > > > Commit: 82a1fcb90287052aabfa235e7ffc693ea003fe69 > > > Parent: d0d23b5432fe61229dd3641c5e94d4130bc4e61b > > > Author: Ingo Molnar <mingo@elte.hu> > > > AuthorDate: Fri Jan 25 21:08:02 2008 +0100 > > > Committer: Ingo Molnar <mingo@elte.hu> > > > CommitDate: Fri Jan 25 21:08:02 2008 +0100 > > > > > > softlockup: automatically detect hung TASK_UNINTERRUPTIBLE tasks > > > > One of my test boxes (an 8-way x86_64 software-development thing from Intel > > - I'm not sure what's inside it) no longer powers itself off when I run `halt > > -pfn'. > > > > During bisection I found two different problems. Sometimes the machine > > wouldn't power off at all. Other times it would power off after a pause of > > around twenty seconds. > > > > Bisection indicates that this commit is what caused the 20-second pause. > > It could be that some later commit caused the infinity-second pause. > > > Does that kernel have: > > commit ed50d6cbc394cd0966469d3e249353c9dd1d38b9 > Author: Peter Zijlstra <a.p.zijlstra@chello.nl> > Date: Sat Feb 2 00:23:08 2008 +0100 > > debug: softlockup looping fix > > yup. It was fetched less than 24 hours ago. ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: softlockup: automatically detect hung TASK_UNINTERRUPTIBLE tasks 2008-02-06 18:05 ` Andrew Morton @ 2008-02-07 0:04 ` Ingo Molnar 2008-02-07 0:31 ` Andrew Morton 0 siblings, 1 reply; 8+ messages in thread From: Ingo Molnar @ 2008-02-07 0:04 UTC (permalink / raw) To: Andrew Morton; +Cc: Peter Zijlstra, Linux Kernel Mailing List * Andrew Morton <akpm@linux-foundation.org> wrote: > > Does that kernel have: > > > > commit ed50d6cbc394cd0966469d3e249353c9dd1d38b9 > > Author: Peter Zijlstra <a.p.zijlstra@chello.nl> > > Date: Sat Feb 2 00:23:08 2008 +0100 > > > > debug: softlockup looping fix > > yup. It was fetched less than 24 hours ago. does the patch below improve the situation? Ingo --- arch/x86/kernel/reboot.c | 16 ++++++++++++++-- 1 file changed, 14 insertions(+), 2 deletions(-) Index: linux-x86.q/arch/x86/kernel/reboot.c =================================================================== --- linux-x86.q.orig/arch/x86/kernel/reboot.c +++ linux-x86.q/arch/x86/kernel/reboot.c @@ -396,8 +396,20 @@ void machine_shutdown(void) if (!cpu_isset(reboot_cpu_id, cpu_online_map)) reboot_cpu_id = smp_processor_id(); - /* Make certain I only run on the appropriate processor */ - set_cpus_allowed(current, cpumask_of_cpu(reboot_cpu_id)); + /* + * Make certain we only run on the appropriate processor, + * and with sufficient priority: + */ + { + struct sched_param schedparm; + schedparm.sched_priority = 99; + int ret; + + ret = sched_setscheduler(current, SCHED_RR, &schedparm); + WARN_ON_ONCE(1); + + set_cpus_allowed(current, cpumask_of_cpu(reboot_cpu_id)); + } /* O.K Now that I'm on the appropriate processor, * stop all of the others. ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: softlockup: automatically detect hung TASK_UNINTERRUPTIBLE tasks 2008-02-07 0:04 ` Ingo Molnar @ 2008-02-07 0:31 ` Andrew Morton 2008-02-07 0:47 ` Andrew Morton 2008-02-07 0:51 ` Ingo Molnar 0 siblings, 2 replies; 8+ messages in thread From: Andrew Morton @ 2008-02-07 0:31 UTC (permalink / raw) To: Ingo Molnar; +Cc: a.p.zijlstra, linux-kernel On Thu, 7 Feb 2008 01:04:25 +0100 Ingo Molnar <mingo@elte.hu> wrote: > > * Andrew Morton <akpm@linux-foundation.org> wrote: > > > > Does that kernel have: > > > > > > commit ed50d6cbc394cd0966469d3e249353c9dd1d38b9 > > > Author: Peter Zijlstra <a.p.zijlstra@chello.nl> > > > Date: Sat Feb 2 00:23:08 2008 +0100 > > > > > > debug: softlockup looping fix > > > > yup. It was fetched less than 24 hours ago. > > does the patch below improve the situation? > Nope. But I tested it on mainline, and mainline exhibits the never-powers-off symptom, whereas ed50d6cbc394cd0966469d3e249353c9dd1d38b9 demonstrates the powers-off-after-20-seconds symptom. So we _may_ be dealing with two bugs here, and your patch might have fixed the first, but that success is obscured by the second. I guess I need to prepare a tree which has ed50d6cbc394cd0966469d3e249353c9dd1d38b9 at its tip. (Wonders how to do that). btw, mainline (plus this patch, not that it changed anything) prints <stopping disk stuff> Disabling non-boot CPUs CPU 1 is now offline and that's it. This machine has eight cpus. Might be a hint? ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: softlockup: automatically detect hung TASK_UNINTERRUPTIBLE tasks 2008-02-07 0:31 ` Andrew Morton @ 2008-02-07 0:47 ` Andrew Morton 2008-02-07 0:51 ` Ingo Molnar 1 sibling, 0 replies; 8+ messages in thread From: Andrew Morton @ 2008-02-07 0:47 UTC (permalink / raw) To: mingo, a.p.zijlstra, linux-kernel On Wed, 6 Feb 2008 16:31:11 -0800 Andrew Morton <akpm@linux-foundation.org> wrote: > On Thu, 7 Feb 2008 01:04:25 +0100 > Ingo Molnar <mingo@elte.hu> wrote: > > > > > * Andrew Morton <akpm@linux-foundation.org> wrote: > > > > > > Does that kernel have: > > > > > > > > commit ed50d6cbc394cd0966469d3e249353c9dd1d38b9 > > > > Author: Peter Zijlstra <a.p.zijlstra@chello.nl> > > > > Date: Sat Feb 2 00:23:08 2008 +0100 > > > > > > > > debug: softlockup looping fix > > > > > > yup. It was fetched less than 24 hours ago. > > > > does the patch below improve the situation? > > > > Nope. > > But I tested it on mainline, and mainline exhibits the never-powers-off > symptom, whereas ed50d6cbc394cd0966469d3e249353c9dd1d38b9 demonstrates the > powers-off-after-20-seconds symptom. > > So we _may_ be dealing with two bugs here, and your patch might have fixed > the first, but that success is obscured by the second. I guess I need to > prepare a tree which has ed50d6cbc394cd0966469d3e249353c9dd1d38b9 at its > tip. (Wonders how to do that). OK, I did this (tested on a ed50d6cbc394cd0966469d3e249353c9dd1d38b9-tipped tree) and again, the patch made no difference: the machine still pauses 20-odd seconds before (correctly) powering off. ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: softlockup: automatically detect hung TASK_UNINTERRUPTIBLE tasks 2008-02-07 0:31 ` Andrew Morton 2008-02-07 0:47 ` Andrew Morton @ 2008-02-07 0:51 ` Ingo Molnar 2008-02-07 1:12 ` Andrew Morton 1 sibling, 1 reply; 8+ messages in thread From: Ingo Molnar @ 2008-02-07 0:51 UTC (permalink / raw) To: Andrew Morton; +Cc: a.p.zijlstra, linux-kernel, Gautham R Shenoy * Andrew Morton <akpm@linux-foundation.org> wrote: > Nope. > > But I tested it on mainline, and mainline exhibits the > never-powers-off symptom, whereas > ed50d6cbc394cd0966469d3e249353c9dd1d38b9 demonstrates the > powers-off-after-20-seconds symptom. > > So we _may_ be dealing with two bugs here, and your patch might have > fixed the first, but that success is obscured by the second. I guess > I need to prepare a tree which has > ed50d6cbc394cd0966469d3e249353c9dd1d38b9 at its tip. (Wonders how to > do that). the way i do it in bisection is to do: mkdir patches git-log -1 -p ed50d6cbc394cd0966469d3 > patches/fix.patch echo fix.patch > patches/series and then before testing a bisection point, i do a 'quilt push'. Before telling git-bisect about the quality of that bisection point (good/bad) i pop it off via 'quilt pop'. this way the 'required fix' can be kept during the bisection, to find the secondary bug. > btw, mainline (plus this patch, not that it changed anything) prints > > <stopping disk stuff> > Disabling non-boot CPUs > CPU 1 is now offline > > and that's it. This machine has eight cpus. Might be a hint? what should be the proper message? my suspects, besides there being something wrong in the hung-tasks code of the softlockup watchdog, would be the cpu-hotplug commits, or some arch/x86 commit. (although we didnt really have anything specifically touching the the reboot path) does a stupid patch like the one below tell you more about what the other CPUs are doing during this hang? [32-bit only patch] Ingo --- arch/i386/kernel/nmi.c | 8 ++++++++ 1 file changed, 8 insertions(+) Index: linux/arch/i386/kernel/nmi.c =================================================================== --- linux.orig/arch/x86/kernel/nmi_64.c +++ linux/arch/x86/kernel/nmi_64.c @@ -331,6 +331,14 @@ __kprobes int nmi_watchdog_tick(struct p int touched = 0; int cpu = smp_processor_id(); int rc=0; + static int count[NR_CPUS]; + + if (!count[cpu]) { + count[cpu] = nmi_hz; + printk("CPU#%d, tick\n", cpu); + show_regs(regs); + } + count[cpu]--; /* check for other users first */ if (notify_die(DIE_NMI, "nmi", regs, reason, 2, SIGINT) ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: softlockup: automatically detect hung TASK_UNINTERRUPTIBLE tasks 2008-02-07 0:51 ` Ingo Molnar @ 2008-02-07 1:12 ` Andrew Morton 0 siblings, 0 replies; 8+ messages in thread From: Andrew Morton @ 2008-02-07 1:12 UTC (permalink / raw) To: Ingo Molnar; +Cc: a.p.zijlstra, linux-kernel, ego On Thu, 7 Feb 2008 01:51:10 +0100 Ingo Molnar <mingo@elte.hu> wrote: > > * Andrew Morton <akpm@linux-foundation.org> wrote: > > > Nope. > > > > But I tested it on mainline, and mainline exhibits the > > never-powers-off symptom, whereas > > ed50d6cbc394cd0966469d3e249353c9dd1d38b9 demonstrates the > > powers-off-after-20-seconds symptom. > > > > So we _may_ be dealing with two bugs here, and your patch might have > > fixed the first, but that success is obscured by the second. I guess > > I need to prepare a tree which has > > ed50d6cbc394cd0966469d3e249353c9dd1d38b9 at its tip. (Wonders how to > > do that). > > the way i do it in bisection is to do: > > mkdir patches > git-log -1 -p ed50d6cbc394cd0966469d3 > patches/fix.patch > echo fix.patch > patches/series > > and then before testing a bisection point, i do a 'quilt push'. Before > telling git-bisect about the quality of that bisection point (good/bad) > i pop it off via 'quilt pop'. > > this way the 'required fix' can be kept during the bisection, to find > the secondary bug. > > > btw, mainline (plus this patch, not that it changed anything) prints > > > > <stopping disk stuff> > > Disabling non-boot CPUs > > CPU 1 is now offline > > > > and that's it. This machine has eight cpus. Might be a hint? > > what should be the proper message? Seems that it should be a stream of eight CPU n is now offline CPU n down > my suspects, besides there being something wrong in the hung-tasks code > of the softlockup watchdog, would be the cpu-hotplug commits, or some > arch/x86 commit. (although we didnt really have anything specifically > touching the the reboot path) > > does a stupid patch like the one below tell you more about what the > other CPUs are doing during this hang? [32-bit only patch] > > Ingo > > --- > arch/i386/kernel/nmi.c | 8 ++++++++ > 1 file changed, 8 insertions(+) > > Index: linux/arch/i386/kernel/nmi.c > =================================================================== > --- linux.orig/arch/x86/kernel/nmi_64.c > +++ linux/arch/x86/kernel/nmi_64.c > @@ -331,6 +331,14 @@ __kprobes int nmi_watchdog_tick(struct p > int touched = 0; > int cpu = smp_processor_id(); > int rc=0; > + static int count[NR_CPUS]; > + > + if (!count[cpu]) { > + count[cpu] = nmi_hz; > + printk("CPU#%d, tick\n", cpu); > + show_regs(regs); > + } > + count[cpu]--; > > /* check for other users first */ > if (notify_die(DIE_NMI, "nmi", regs, reason, 2, SIGINT) I reworked that on top of ed50d6cbc394cd0966469d3e249353c9dd1d38b9: no change. However I watched the vga console this time (nothing is coming over netconsole at this stage) I saw this: CPU 1 is now offline <10 second pause> CPU 1 is down CPU 2 is now offline CPU 2 is down CPU 3 is now offline CPU 3 is down CPU 4 is now offline <10 second pause> followed by a quick spew of the remaining CPUs going down and offline then poweroff. ^ permalink raw reply [flat|nested] 8+ messages in thread
end of thread, other threads:[~2008-02-07 1:13 UTC | newest]
Thread overview: 8+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
[not found] <200801252259.m0PMxHmD012059@hera.kernel.org>
2008-02-06 0:46 ` softlockup: automatically detect hung TASK_UNINTERRUPTIBLE tasks Andrew Morton
2008-02-06 14:50 ` Peter Zijlstra
2008-02-06 18:05 ` Andrew Morton
2008-02-07 0:04 ` Ingo Molnar
2008-02-07 0:31 ` Andrew Morton
2008-02-07 0:47 ` Andrew Morton
2008-02-07 0:51 ` Ingo Molnar
2008-02-07 1:12 ` Andrew Morton
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox