* 2.6.18-rc3-g3b445eea BUG: warning at /usr/src/linux-git/kernel/cpu.c:51/unlock_cpu_hotplug() @ 2006-08-04 19:04 Michal Piotrowski 2006-08-04 19:24 ` Linus Torvalds 0 siblings, 1 reply; 25+ messages in thread From: Michal Piotrowski @ 2006-08-04 19:04 UTC (permalink / raw) To: Linus Torvalds; +Cc: Dave Jones, LKML Hi, This was reported http://www.ussg.iu.edu/hypermail/linux/kernel/0607.3/1867.html http://www.ussg.iu.edu/hypermail/linux/kernel/0608.0/0675.html It's time to use git-bisect. BUG: warning at /usr/src/linux-git/kernel/cpu.c:51/unlock_cpu_hotplug() [<c0132a04>] unlock_cpu_hotplug+0x2c/0x54 [<c013a2ec>] stop_machine_run+0x2e/0x34 [<c0135686>] sys_init_module+0x15a0/0x178a [<c015b7b7>] do_sync_read+0xb6/0xf1 [<c0102d51>] sysenter_past_esp+0x56/0x79 Here is a config file http://www.stardust.webpages.pl/files/2.6-git/18-rc3/config Here is dmesg http://www.stardust.webpages.pl/files/2.6-git/18-rc3/dmesg Regards, Michal -- Michal K. K. Piotrowski LTG - Linux Testers Group (http://www.stardust.webpages.pl/ltg/wiki/) ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: 2.6.18-rc3-g3b445eea BUG: warning at /usr/src/linux-git/kernel/cpu.c:51/unlock_cpu_hotplug() 2006-08-04 19:04 2.6.18-rc3-g3b445eea BUG: warning at /usr/src/linux-git/kernel/cpu.c:51/unlock_cpu_hotplug() Michal Piotrowski @ 2006-08-04 19:24 ` Linus Torvalds 2006-08-04 22:24 ` Dave Jones 0 siblings, 1 reply; 25+ messages in thread From: Linus Torvalds @ 2006-08-04 19:24 UTC (permalink / raw) To: Michal Piotrowski; +Cc: Dave Jones, LKML On Fri, 4 Aug 2006, Michal Piotrowski wrote: > > This was reported > http://www.ussg.iu.edu/hypermail/linux/kernel/0607.3/1867.html > http://www.ussg.iu.edu/hypermail/linux/kernel/0608.0/0675.html > > It's time to use git-bisect. > > BUG: warning at /usr/src/linux-git/kernel/cpu.c:51/unlock_cpu_hotplug() Ok, that really is a pretty scary thing. The warning happens if the lock is released by somebody else than the person who took it. I don't _think_ that's supposed to happen, but with cpu hotplug locking being as "creative" as it is, maybe it's valid. I was planning on shutting the cpu hotplug warnings up for v2.6.18 - at least the "lukewarm IQ" one for _taking_ the lock recursively. But I had planned on leaving the unlock_cpu_hotplug() in place, since up until now it hadn't triggered, and it really smells like a bug if it does. > [<c0132a04>] unlock_cpu_hotplug+0x2c/0x54 > [<c013a2ec>] stop_machine_run+0x2e/0x34 > [<c0135686>] sys_init_module+0x15a0/0x178a > [<c015b7b7>] do_sync_read+0xb6/0xf1 > [<c0102d51>] sysenter_past_esp+0x56/0x79 DaveJ, any ideas? Linus ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: 2.6.18-rc3-g3b445eea BUG: warning at /usr/src/linux-git/kernel/cpu.c:51/unlock_cpu_hotplug() 2006-08-04 19:24 ` Linus Torvalds @ 2006-08-04 22:24 ` Dave Jones 2006-08-05 0:31 ` Dave Jones 0 siblings, 1 reply; 25+ messages in thread From: Dave Jones @ 2006-08-04 22:24 UTC (permalink / raw) To: Linus Torvalds; +Cc: Michal Piotrowski, Dave Jones, LKML On Fri, Aug 04, 2006 at 12:24:57PM -0700, Linus Torvalds wrote: > On Fri, 4 Aug 2006, Michal Piotrowski wrote: > > This was reported > > http://www.ussg.iu.edu/hypermail/linux/kernel/0607.3/1867.html > > http://www.ussg.iu.edu/hypermail/linux/kernel/0608.0/0675.html > > > > It's time to use git-bisect. > > > > BUG: warning at /usr/src/linux-git/kernel/cpu.c:51/unlock_cpu_hotplug() > > Ok, that really is a pretty scary thing. The warning happens if the lock > is released by somebody else than the person who took it. I don't _think_ > that's supposed to happen, but with cpu hotplug locking being as > "creative" as it is, maybe it's valid. > > I was planning on shutting the cpu hotplug warnings up for v2.6.18 - at > least the "lukewarm IQ" one for _taking_ the lock recursively. But I had > planned on leaving the unlock_cpu_hotplug() in place, since up until now > it hadn't triggered, and it really smells like a bug if it does. > > > [<c0132a04>] unlock_cpu_hotplug+0x2c/0x54 > > [<c013a2ec>] stop_machine_run+0x2e/0x34 > > [<c0135686>] sys_init_module+0x15a0/0x178a > > [<c015b7b7>] do_sync_read+0xb6/0xf1 > > [<c0102d51>] sysenter_past_esp+0x56/0x79 > > DaveJ, any ideas? Not really :-/ Whilst it's mentioned that the git-cpufreq.patch that was in -mm made this happen, I really don't see anything in what you merged yesterday that could explain any of this. 99% of the changes were to longhaul, so we can ignore those, leaving just: [CPUFREQ] return error when failing to set minfreq Looks benign. [CPUFREQ] Propagate acpi_processor_preregister_performance return value. Shouldn't affect anything wrt hotplug [CPUFREQ] [1/2] add __find_governor helper and clean up some error handling. [CPUFREQ] [2/2] demand load governor modules. Looks harmless to me, and those new functions don't show up in any of those functions. So I'm at a loss to explain why this made this bug suddenly appear, but I don't think this is anything new. Fixing it however... I'm looking at it, but I don't see anything obvious yet. Dave -- http://www.codemonkey.org.uk ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: 2.6.18-rc3-g3b445eea BUG: warning at /usr/src/linux-git/kernel/cpu.c:51/unlock_cpu_hotplug() 2006-08-04 22:24 ` Dave Jones @ 2006-08-05 0:31 ` Dave Jones 2006-08-05 2:10 ` Dave Jones 0 siblings, 1 reply; 25+ messages in thread From: Dave Jones @ 2006-08-05 0:31 UTC (permalink / raw) To: Linus Torvalds, Michal Piotrowski, Dave Jones, LKML On Fri, Aug 04, 2006 at 06:24:00PM -0400, Dave Jones wrote: > > DaveJ, any ideas? > > So I'm at a loss to explain why this made this bug suddenly appear, but > I don't think this is anything new. Fixing it however... > I'm looking at it, but I don't see anything obvious yet. Ok, debugging this is getting ridiculous. I dug out a core duo, built a kernel and watched it spontaneously reboot before it even got to init. Digging out a serial cable yielded this gem: Dentry cache hash table entries: 131072 (order: 7, 524288 bytes) Inode-cache hash table entries: 65536 (order: 6, 262144 bytes) Memory: 1020104k/1047108k available (2082k kernel code, 26324k reserved, 1107k data, 240k init, 129604k highmem) Checking if this processor honours the WP bit even in supervisor mode... Ok. ============================================= [ INFO: possible recursive locking detected ] --------------------------------------------- swapper/0 is trying to acquire lock: (cpu_bitmask_lock){--..}, at: [<c0304e9c>] mutex_lock+0x21/0x24 but task is already holding lock: (cpu_bitmask_lock){--..}, at: [<c0304e9c>] mutex_lock+0x21/0x24 other info that might help us debug this: 1 lock held by swapper/0: #0: (cpu_bitmask_lock){--..}, at: [<c0304e9c>] mutex_lock+0x21/0x24 stack backtrace: [<c0104edc>] show_trace_log_lvl+0x58/0x152 [<c01054c2>] show_trace+0xd/0x10 [<c01055db>] dump_stack+0x19/0x1b [<c013ae05>] __lock_acquire+0x74f/0x96d [<c013b566>] lock_acquire+0x4b/0x6d [<c0304d35>] __mutex_lock_slowpath+0xb0/0x1f6 [<c0304e9c>] mutex_lock+0x21/0x24 [<c013e8d4>] lock_cpu_hotplug+0x4a/0x4c [<c016bed3>] kmem_cache_create+0x61/0x577 [<c04af5a1>] kmem_cache_init+0x177/0x38c [<c049c5f3>] start_kernel+0x221/0x3b3 [<c0100210>] 0xc0100210 DWARF2 unwinder stuck at 0xc0100210 Leftover inexact backtrace: ======================= ======================= ======================= bad: scheduling from the idle thread! [<c0104edc>] show_trace_log_lvl+0x58/0x152 [<c01054c2>] show_trace+0xd/0x10 [<c01055db>] dump_stack+0x19/0x1b [<c03035a6>] schedule+0xa6/0x9d3 [<c0304d95>] __mutex_lock_slowpath+0x110/0x1f6 [<c0304e9c>] mutex_lock+0x21/0x24 [<c013e8d4>] lock_cpu_hotplug+0x4a/0x4c [<c016bed3>] kmem_cache_create+0x61/0x577 [<c04af5a1>] kmem_cache_init+0x177/0x38c [<c049c5f3>] start_kernel+0x221/0x3b3 [<c0100210>] 0xc0100210 DWARF2 unwinder stuck at 0xc0100210 Leftover inexact backtrace: ======================= ======================= ======================= <Insert reboot here> Waaaay before cpufreq even enters the picture. Sigh. Dave -- http://www.codemonkey.org.uk ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: 2.6.18-rc3-g3b445eea BUG: warning at /usr/src/linux-git/kernel/cpu.c:51/unlock_cpu_hotplug() 2006-08-05 0:31 ` Dave Jones @ 2006-08-05 2:10 ` Dave Jones 2006-08-05 2:23 ` Dave Jones 0 siblings, 1 reply; 25+ messages in thread From: Dave Jones @ 2006-08-05 2:10 UTC (permalink / raw) To: Linus Torvalds, Michal Piotrowski, Dave Jones, LKML On Fri, Aug 04, 2006 at 08:31:42PM -0400, Dave Jones wrote: > On Fri, Aug 04, 2006 at 06:24:00PM -0400, Dave Jones wrote: > > > DaveJ, any ideas? > > > > So I'm at a loss to explain why this made this bug suddenly appear, but > > I don't think this is anything new. Fixing it however... > > I'm looking at it, but I don't see anything obvious yet. > > Ok, debugging this is getting ridiculous. I dug out a core duo, > built a kernel and watched it spontaneously reboot before it even got to init. Hmm, this went away after a ccache flush & a rebuild. Old crud left behind from earlier debugging I guess. So I added this diff.. diff --git a/kernel/cpu.c b/kernel/cpu.c index f230f9a..31210a0 100644 --- a/kernel/cpu.c +++ b/kernel/cpu.c @@ -31,6 +31,14 @@ void lock_cpu_hotplug(void) { struct task_struct *tsk = current; + if (jiffies > HZ * 60) { + printk ("CPU%d called lock_cpu_hotplug() for app %s. recursive_depth=%d\n", + smp_processor_id(), + current->comm, + recursive_depth); + dump_stack(); + } + if (tsk == recursive) { static int warnings = 10; if (warnings) { @@ -42,18 +50,30 @@ void lock_cpu_hotplug(void) return; } mutex_lock(&cpu_bitmask_lock); + if (jiffies > HZ * 60) + printk ("%s acquired cpu_bitmask_lock\n", current->comm); recursive = tsk; } EXPORT_SYMBOL_GPL(lock_cpu_hotplug); void unlock_cpu_hotplug(void) { + if (jiffies > HZ * 120) { + printk ("CPU%d called unlock_cpu_hotplug() for app %s. recursive_depth=%d\n", + smp_processor_id(), + current->comm, + recursive_depth); + dump_stack(); + } + WARN_ON(recursive != current); if (recursive_depth) { recursive_depth--; return; } mutex_unlock(&cpu_bitmask_lock); + if (jiffies > HZ * 120) + printk ("%s released cpu_bitmask_lock\n", current->comm); recursive = NULL; } EXPORT_SYMBOL_GPL(unlock_cpu_hotplug); In the hope to better understand what the hell is going on. Here's what happened when I started the cpuspeed service (That DWARF unwinder stuck thing is really darned annoying btw) CPU1 called lock_cpu_hotplug() for app modprobe. recursive_depth=0 [<c0104edc>] show_trace_log_lvl+0x58/0x152 [<c01054c2>] show_trace+0xd/0x10 [<c01055db>] dump_stack+0x19/0x1b [<c013e8c3>] lock_cpu_hotplug+0x39/0xbf [<c01499c8>] stop_machine_run+0x11/0x38 [<c0141943>] sys_init_module+0x166d/0x1847 [<c0103db9>] sysenter_past_esp+0x56/0x8d DWARF2 unwinder stuck at sysenter_past_esp+0x56/0x8d Leftover inexact backtrace: [<c01054c2>] show_trace+0xd/0x10 [<c01055db>] dump_stack+0x19/0x1b [<c013e8c3>] lock_cpu_hotplug+0x39/0xbf [<c01499c8>] stop_machine_run+0x11/0x38 [<c0141943>] sys_init_module+0x166d/0x1847 [<c0103db9>] sysenter_past_esp+0x56/0x8d modprobe acquired cpu_bitmask_lock CPU1 called lock_cpu_hotplug() for app cpuspeed. recursive_depth=0 [<c0104edc>] show_trace_log_lvl+0x58/0x152 [<c01054c2>] show_trace+0xd/0x10 [<c01055db>] dump_stack+0x19/0x1b [<c013e8c3>] lock_cpu_hotplug+0x39/0xbf [<c029fbae>] store_scaling_governor+0x142/0x1a3 [<c029f1a5>] store+0x37/0x48 [<c01a6561>] sysfs_write_file+0xab/0xd1 [<c016f99f>] vfs_write+0xab/0x157 [<c016ffe4>] sys_write+0x3b/0x60 [<c0103db9>] sysenter_past_esp+0x56/0x8d DWARF2 unwinder stuck at sysenter_past_esp+0x56/0x8d Leftover inexact backtrace: [<c01054c2>] show_trace+0xd/0x10 [<c01055db>] dump_stack+0x19/0x1b [<c013e8c3>] lock_cpu_hotplug+0x39/0xbf [<c029fbae>] store_scaling_governor+0x142/0x1a3 [<c029f1a5>] store+0x37/0x48 [<c01a6561>] sysfs_write_file+0xab/0xd1 [<c016f99f>] vfs_write+0xab/0x157 [<c016ffe4>] sys_write+0x3b/0x60 [<c0103db9>] sysenter_past_esp+0x56/0x8d cpuspeed acquired cpu_bitmask_lock I have questions. 1. How come the printk at the end of unlock_cpu_hotplug() never got hit when modprobe got to the end of stop_machine_run() ? 2. How come another process acquired this mutex when the previous one aparently never released it ? Moving on ... CPU1 called lock_cpu_hotplug() for app cpuspeed. recursive_depth=0 [<c0104edc>] show_trace_log_lvl+0x58/0x152 [<c01054c2>] show_trace+0xd/0x10 [<c01055db>] dump_stack+0x19/0x1b [<c013e8c3>] lock_cpu_hotplug+0x39/0xbf [<c0132f3c>] __create_workqueue+0x52/0x122 [<f8faf34b>] cpufreq_governor_dbs+0x9f/0x2c3 [cpufreq_ondemand] [<c029f7b6>] __cpufreq_governor+0x57/0xd8 [<c029f985>] __cpufreq_set_policy+0x14e/0x1bc [<c029fbc5>] store_scaling_governor+0x159/0x1a3 [<c029f1a5>] store+0x37/0x48 [<c01a6561>] sysfs_write_file+0xab/0xd1 [<c016f99f>] vfs_write+0xab/0x157 [<c016ffe4>] sys_write+0x3b/0x60 [<c0103db9>] sysenter_past_esp+0x56/0x8d DWARF2 unwinder stuck at sysenter_past_esp+0x56/0x8d Leftover inexact backtrace: [<c01054c2>] show_trace+0xd/0x10 [<c01055db>] dump_stack+0x19/0x1b [<c013e8c3>] lock_cpu_hotplug+0x39/0xbf [<c0132f3c>] __create_workqueue+0x52/0x122 [<f8faf34b>] cpufreq_governor_dbs+0x9f/0x2c3 [cpufreq_ondemand] [<c029f7b6>] __cpufreq_governor+0x57/0xd8 [<c029f985>] __cpufreq_set_policy+0x14e/0x1bc [<c029fbc5>] store_scaling_governor+0x159/0x1a3 [<c029f1a5>] store+0x37/0x48 [<c01a6561>] sysfs_write_file+0xab/0xd1 [<c016f99f>] vfs_write+0xab/0x157 [<c016ffe4>] sys_write+0x3b/0x60 [<c0103db9>] sysenter_past_esp+0x56/0x8d Lukewarm IQ detected in hotplug locking BUG: warning at kernel/cpu.c:46/lock_cpu_hotplug() [<c0104edc>] show_trace_log_lvl+0x58/0x152 [<c01054c2>] show_trace+0xd/0x10 [<c01055db>] dump_stack+0x19/0x1b [<c013e8fc>] lock_cpu_hotplug+0x72/0xbf [<c0132f3c>] __create_workqueue+0x52/0x122 [<f8faf34b>] cpufreq_governor_dbs+0x9f/0x2c3 [cpufreq_ondemand] [<c029f7b6>] __cpufreq_governor+0x57/0xd8 [<c029f985>] __cpufreq_set_policy+0x14e/0x1bc [<c029fbc5>] store_scaling_governor+0x159/0x1a3 [<c029f1a5>] store+0x37/0x48 [<c01a6561>] sysfs_write_file+0xab/0xd1 [<c016f99f>] vfs_write+0xab/0x157 [<c016ffe4>] sys_write+0x3b/0x60 [<c0103db9>] sysenter_past_esp+0x56/0x8d DWARF2 unwinder stuck at sysenter_past_esp+0x56/0x8d Leftover inexact backtrace: [<c01054c2>] show_trace+0xd/0x10 [<c01055db>] dump_stack+0x19/0x1b [<c013e8fc>] lock_cpu_hotplug+0x72/0xbf [<c0132f3c>] __create_workqueue+0x52/0x122 [<f8faf34b>] cpufreq_governor_dbs+0x9f/0x2c3 [cpufreq_ondemand] [<c029f7b6>] __cpufreq_governor+0x57/0xd8 [<c029f985>] __cpufreq_set_policy+0x14e/0x1bc [<c029fbc5>] store_scaling_governor+0x159/0x1a3 [<c029f1a5>] store+0x37/0x48 [<c01a6561>] sysfs_write_file+0xab/0xd1 [<c016f99f>] vfs_write+0xab/0x157 [<c016ffe4>] sys_write+0x3b/0x60 [<c0103db9>] sysenter_past_esp+0x56/0x8d Ok, now that it's done insulting my IQ, recursive_depth should be 1. However ... CPU1 called lock_cpu_hotplug() for app cpuspeed. recursive_depth=0 [<c0104edc>] show_trace_log_lvl+0x58/0x152 [<c01054c2>] show_trace+0xd/0x10 [<c01055db>] dump_stack+0x19/0x1b [<c013e8c3>] lock_cpu_hotplug+0x39/0xbf [<c029fbae>] store_scaling_governor+0x142/0x1a3 [<c029f1a5>] store+0x37/0x48 [<c01a6561>] sysfs_write_file+0xab/0xd1 [<c016f99f>] vfs_write+0xab/0x157 [<c016ffe4>] sys_write+0x3b/0x60 [<c0103db9>] sysenter_past_esp+0x56/0x8d WTF is going on here ? In the whole log, recursive_depth is always 0. At this stage, it gets into a death spiral printing out the same crap over and over again. I put the full log at http://people.redhat.com/davej/hotplug-wtf.txt Note how the acquire/releases don't match up at all. Scary. Dave -- http://www.codemonkey.org.uk ^ permalink raw reply related [flat|nested] 25+ messages in thread
* Re: 2.6.18-rc3-g3b445eea BUG: warning at /usr/src/linux-git/kernel/cpu.c:51/unlock_cpu_hotplug() 2006-08-05 2:10 ` Dave Jones @ 2006-08-05 2:23 ` Dave Jones 2006-08-05 2:49 ` Dave Jones 0 siblings, 1 reply; 25+ messages in thread From: Dave Jones @ 2006-08-05 2:23 UTC (permalink / raw) To: Linus Torvalds, Michal Piotrowski, LKML On Fri, Aug 04, 2006 at 10:10:51PM -0400, Dave Jones wrote: > So I added this diff.. > > diff --git a/kernel/cpu.c b/kernel/cpu.c > index f230f9a..31210a0 100644 > --- a/kernel/cpu.c > +++ b/kernel/cpu.c > @@ -31,6 +31,14 @@ void lock_cpu_hotplug(void) > { > struct task_struct *tsk = current; > > + if (jiffies > HZ * 60) { .... > void unlock_cpu_hotplug(void) > { > + if (jiffies > HZ * 120) { ^^^ Duh. Everything becomes clearer the moment you post a diff to lkml. Dave -- http://www.codemonkey.org.uk ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: 2.6.18-rc3-g3b445eea BUG: warning at /usr/src/linux-git/kernel/cpu.c:51/unlock_cpu_hotplug() 2006-08-05 2:23 ` Dave Jones @ 2006-08-05 2:49 ` Dave Jones 2006-08-05 6:47 ` Dave Jones 0 siblings, 1 reply; 25+ messages in thread From: Dave Jones @ 2006-08-05 2:49 UTC (permalink / raw) To: Linus Torvalds, Michal Piotrowski, LKML On Fri, Aug 04, 2006 at 10:23:56PM -0400, Dave Jones wrote: > Duh. Everything becomes clearer the moment you post a diff to lkml. Right, with that silly thinko out of the way, things make _slightly_ more sense, but I'm still puzzled. Here's the trace (with the DWARF noise stripped out). CPU1 called lock_cpu_hotplug() for app cpuspeed. recursive_depth=0 [<c0104edc>] show_trace_log_lvl+0x58/0x152 [<c01054c2>] show_trace+0xd/0x10 [<c01055db>] dump_stack+0x19/0x1b [<c013e8c3>] lock_cpu_hotplug+0x39/0xbf [<c029fbae>] store_scaling_governor+0x142/0x1a3 [<c029f1a5>] store+0x37/0x48 [<c01a6561>] sysfs_write_file+0xab/0xd1 [<c016f99f>] vfs_write+0xab/0x157 [<c016ffe4>] sys_write+0x3b/0x60 [<c0103db9>] sysenter_past_esp+0x56/0x8d cpuspeed acquired cpu_bitmask_lock CPU1 called lock_cpu_hotplug() for app cpuspeed. recursive_depth=0 [<c0104edc>] show_trace_log_lvl+0x58/0x152 [<c01054c2>] show_trace+0xd/0x10 [<c01055db>] dump_stack+0x19/0x1b [<c013e8c3>] lock_cpu_hotplug+0x39/0xbf [<c0132f3c>] __create_workqueue+0x52/0x122 [<f901234b>] cpufreq_governor_dbs+0x9f/0x2c3 [cpufreq_ondemand] [<c029f7b6>] __cpufreq_governor+0x57/0xd8 [<c029f985>] __cpufreq_set_policy+0x14e/0x1bc [<c029fbc5>] store_scaling_governor+0x159/0x1a3 [<c029f1a5>] store+0x37/0x48 [<c01a6561>] sysfs_write_file+0xab/0xd1 [<c016f99f>] vfs_write+0xab/0x157 [<c016ffe4>] sys_write+0x3b/0x60 [<c0103db9>] sysenter_past_esp+0x56/0x8d Lukewarm IQ detected in hotplug locking BUG: warning at kernel/cpu.c:46/lock_cpu_hotplug() CPU1 called unlock_cpu_hotplug() for app cpuspeed. recursive_depth=1 [<c0104edc>] show_trace_log_lvl+0x58/0x152 [<c01054c2>] show_trace+0xd/0x10 [<c01055db>] dump_stack+0x19/0x1b [<c013e980>] unlock_cpu_hotplug+0x37/0xb7 [<c0132fea>] __create_workqueue+0x100/0x122 [<f901234b>] cpufreq_governor_dbs+0x9f/0x2c3 [cpufreq_ondemand] [<c029f7b6>] __cpufreq_governor+0x57/0xd8 [<c029f985>] __cpufreq_set_policy+0x14e/0x1bc [<c029fbc5>] store_scaling_governor+0x159/0x1a3 [<c029f1a5>] store+0x37/0x48 [<c01a6561>] sysfs_write_file+0xab/0xd1 [<c016f99f>] vfs_write+0xab/0x157 [<c016ffe4>] sys_write+0x3b/0x60 [<c0103db9>] sysenter_past_esp+0x56/0x8d CPU1 called unlock_cpu_hotplug() for app cpuspeed. recursive_depth=0 [<c0104edc>] show_trace_log_lvl+0x58/0x152 [<c01054c2>] show_trace+0xd/0x10 [<c01055db>] dump_stack+0x19/0x1b [<c013e980>] unlock_cpu_hotplug+0x37/0xb7 [<c029fbe5>] store_scaling_governor+0x179/0x1a3 [<c029f1a5>] store+0x37/0x48 [<c01a6561>] sysfs_write_file+0xab/0xd1 [<c016f99f>] vfs_write+0xab/0x157 [<c016ffe4>] sys_write+0x3b/0x60 [<c0103db9>] sysenter_past_esp+0x56/0x8d So in these traces we're seeing lock lock unlock unlock But what I really don't understand is the ordering here. Immediately after that first trace we should see an unlock. store_scaling_governor does a lock/unlock pair, with no chance of returning with the hotplug lock still held. In the second trace however, cpuspeed is off doing something completely different. How can this happen? My head hurts. Dave -- http://www.codemonkey.org.uk ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: 2.6.18-rc3-g3b445eea BUG: warning at /usr/src/linux-git/kernel/cpu.c:51/unlock_cpu_hotplug() 2006-08-05 2:49 ` Dave Jones @ 2006-08-05 6:47 ` Dave Jones 2006-08-05 7:46 ` Andrew Morton ` (2 more replies) 0 siblings, 3 replies; 25+ messages in thread From: Dave Jones @ 2006-08-05 6:47 UTC (permalink / raw) To: Linus Torvalds, Michal Piotrowski, LKML On Fri, Aug 04, 2006 at 10:49:47PM -0400, Dave Jones wrote: This trace now makes a lot more sense to me. > CPU1 called lock_cpu_hotplug() for app cpuspeed. recursive_depth=0 > [<c0104edc>] show_trace_log_lvl+0x58/0x152 > [<c01054c2>] show_trace+0xd/0x10 > [<c01055db>] dump_stack+0x19/0x1b > [<c013e8c3>] lock_cpu_hotplug+0x39/0xbf > [<c029fbae>] store_scaling_governor+0x142/0x1a3 > [<c029f1a5>] store+0x37/0x48 > [<c01a6561>] sysfs_write_file+0xab/0xd1 > [<c016f99f>] vfs_write+0xab/0x157 > [<c016ffe4>] sys_write+0x3b/0x60 > [<c0103db9>] sysenter_past_esp+0x56/0x8d > cpuspeed acquired cpu_bitmask_lock > > CPU1 called lock_cpu_hotplug() for app cpuspeed. recursive_depth=0 > [<c0104edc>] show_trace_log_lvl+0x58/0x152 > [<c01054c2>] show_trace+0xd/0x10 > [<c01055db>] dump_stack+0x19/0x1b > [<c013e8c3>] lock_cpu_hotplug+0x39/0xbf > [<c0132f3c>] __create_workqueue+0x52/0x122 > [<f901234b>] cpufreq_governor_dbs+0x9f/0x2c3 [cpufreq_ondemand] > [<c029f7b6>] __cpufreq_governor+0x57/0xd8 > [<c029f985>] __cpufreq_set_policy+0x14e/0x1bc > [<c029fbc5>] store_scaling_governor+0x159/0x1a3 > [<c029f1a5>] store+0x37/0x48 > [<c01a6561>] sysfs_write_file+0xab/0xd1 > [<c016f99f>] vfs_write+0xab/0x157 > [<c016ffe4>] sys_write+0x3b/0x60 > [<c0103db9>] sysenter_past_esp+0x56/0x8d > Lukewarm IQ detected in hotplug locking > BUG: warning at kernel/cpu.c:46/lock_cpu_hotplug() So when we write to sysfs to set the governor, we end up in store_scaling_governor() which takes the hotplug lock, and then calls off into the governor to let it do its thing. Part of ondemand's "thing" is to create a workqueue. unfortunatly, __create_workqueue also takes the hotplug lock. Creating a variant of __create_workqueue that doesn't take the lock seems really nasty. We could remove the locking from store_scaling_governor() and make the governors themselves have to do the locking, but I'm not sure that's entirely safe. We could do something really disgusting like ... unlock_cpu_hotplug() ... create_workqueue() ... lock_cpu_hotplug() in ondemand, which opens up a tiny race window, but as ugly as it is, looks to be the best solution of the bunch right now. Comments? The really sad part is this is completely unrelated to the original bug reported in this thread, which shows just how widespread this braindamage is. Michal's traces really don't really scream anything obvious to me. (Though given it took me 4 hours to decode my own traces above, this is no real sign of how big a problem this might be). Michal, could you apply this diff.. http://lkml.org/lkml/diff/2006/8/4/381/1 (change the '120' to '60' first), and send me the debug spew that you get ? You'll have to wait until a minute of uptime has passed. Oh, and edit include/linux/jiffies.h to change INITIAL_JIFFIES to '0'. Dave -- http://www.codemonkey.org.uk ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: 2.6.18-rc3-g3b445eea BUG: warning at /usr/src/linux-git/kernel/cpu.c:51/unlock_cpu_hotplug() 2006-08-05 6:47 ` Dave Jones @ 2006-08-05 7:46 ` Andrew Morton 2006-08-05 19:47 ` Dave Jones 2006-08-06 1:12 ` Andrew Morton 2006-08-05 10:54 ` Michal Piotrowski 2006-08-06 16:59 ` Michal Piotrowski 2 siblings, 2 replies; 25+ messages in thread From: Andrew Morton @ 2006-08-05 7:46 UTC (permalink / raw) To: Dave Jones; +Cc: torvalds, michal.k.k.piotrowski, linux-kernel On Sat, 5 Aug 2006 02:47:28 -0400 Dave Jones <davej@redhat.com> wrote: > Creating a variant of __create_workqueue that doesn't take the lock > seems really nasty. Honestly, the default fix for lock_cpu_hotplug() problems should be to delete the lock_cpu_hotplug() calls and to implement local locking. Here's a not-runtime-tested workqueue.c example. From: Andrew Morton <akpm@osdl.org> Use a private lock instead. It protects all per-cpu data structures in workqueue.c, including the workqueues list. Fix a bug in schedule_on_each_cpu(): it was forgetting to lock down the per-cpu resources. Unfixed long-standing bug: if someone unplugs the CPU identified by `singlethread_cpu' the kernel will get very sick. Signed-off-by: Andrew Morton <akpm@osdl.org> --- kernel/workqueue.c | 33 +++++++++++++++++++++------------ 1 file changed, 21 insertions(+), 12 deletions(-) diff -puN kernel/workqueue.c~a kernel/workqueue.c --- a/kernel/workqueue.c~a +++ a/kernel/workqueue.c @@ -68,7 +68,7 @@ struct workqueue_struct { /* All the per-cpu workqueues on the system, for hotplug cpu to add/remove threads to each one as cpus come/go. */ -static DEFINE_SPINLOCK(workqueue_lock); +static DEFINE_MUTEX(workqueue_mutex); static LIST_HEAD(workqueues); static int singlethread_cpu; @@ -320,10 +320,10 @@ void fastcall flush_workqueue(struct wor } else { int cpu; - lock_cpu_hotplug(); + mutex_lock(&workqueue_mutex); for_each_online_cpu(cpu) flush_cpu_workqueue(per_cpu_ptr(wq->cpu_wq, cpu)); - unlock_cpu_hotplug(); + mutex_unlock(&workqueue_mutex); } } EXPORT_SYMBOL_GPL(flush_workqueue); @@ -371,8 +371,7 @@ struct workqueue_struct *__create_workqu } wq->name = name; - /* We don't need the distraction of CPUs appearing and vanishing. */ - lock_cpu_hotplug(); + mutex_lock(&workqueue_mutex); if (singlethread) { INIT_LIST_HEAD(&wq->list); p = create_workqueue_thread(wq, singlethread_cpu); @@ -381,9 +380,7 @@ struct workqueue_struct *__create_workqu else wake_up_process(p); } else { - spin_lock(&workqueue_lock); list_add(&wq->list, &workqueues); - spin_unlock(&workqueue_lock); for_each_online_cpu(cpu) { p = create_workqueue_thread(wq, cpu); if (p) { @@ -393,7 +390,7 @@ struct workqueue_struct *__create_workqu destroy = 1; } } - unlock_cpu_hotplug(); + mutex_unlock(&workqueue_mutex); /* * Was there any error during startup? If yes then clean up: @@ -434,17 +431,15 @@ void destroy_workqueue(struct workqueue_ flush_workqueue(wq); /* We don't need the distraction of CPUs appearing and vanishing. */ - lock_cpu_hotplug(); + mutex_lock(&workqueue_mutex); if (is_single_threaded(wq)) cleanup_workqueue_thread(wq, singlethread_cpu); else { for_each_online_cpu(cpu) cleanup_workqueue_thread(wq, cpu); - spin_lock(&workqueue_lock); list_del(&wq->list); - spin_unlock(&workqueue_lock); } - unlock_cpu_hotplug(); + mutex_unlock(&workqueue_mutex); free_percpu(wq->cpu_wq); kfree(wq); } @@ -515,11 +510,13 @@ int schedule_on_each_cpu(void (*func)(vo if (!works) return -ENOMEM; + mutex_lock(&workqueue_mutex); for_each_online_cpu(cpu) { INIT_WORK(per_cpu_ptr(works, cpu), func, info); __queue_work(per_cpu_ptr(keventd_wq->cpu_wq, cpu), per_cpu_ptr(works, cpu)); } + mutex_unlock(&workqueue_mutex); flush_workqueue(keventd_wq); free_percpu(works); return 0; @@ -635,6 +632,7 @@ static int __devinit workqueue_cpu_callb switch (action) { case CPU_UP_PREPARE: + mutex_lock(&workqueue_mutex); /* Create a new workqueue thread for it. */ list_for_each_entry(wq, &workqueues, list) { if (!create_workqueue_thread(wq, hotcpu)) { @@ -653,6 +651,7 @@ static int __devinit workqueue_cpu_callb kthread_bind(cwq->thread, hotcpu); wake_up_process(cwq->thread); } + mutex_unlock(&workqueue_mutex); break; case CPU_UP_CANCELED: @@ -664,6 +663,15 @@ static int __devinit workqueue_cpu_callb any_online_cpu(cpu_online_map)); cleanup_workqueue_thread(wq, hotcpu); } + mutex_unlock(&workqueue_mutex); + break; + + case CPU_DOWN_PREPARE: + mutex_lock(&workqueue_mutex); + break; + + case CPU_DOWN_FAILED: + mutex_unlock(&workqueue_mutex); break; case CPU_DEAD: @@ -671,6 +679,7 @@ static int __devinit workqueue_cpu_callb cleanup_workqueue_thread(wq, hotcpu); list_for_each_entry(wq, &workqueues, list) take_over_work(wq, hotcpu); + mutex_unlock(&workqueue_mutex); break; } _ ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: 2.6.18-rc3-g3b445eea BUG: warning at /usr/src/linux-git/kernel/cpu.c:51/unlock_cpu_hotplug() 2006-08-05 7:46 ` Andrew Morton @ 2006-08-05 19:47 ` Dave Jones 2006-08-05 20:02 ` Andrew Morton 2006-08-06 1:12 ` Andrew Morton 1 sibling, 1 reply; 25+ messages in thread From: Dave Jones @ 2006-08-05 19:47 UTC (permalink / raw) To: Andrew Morton; +Cc: torvalds, michal.k.k.piotrowski, linux-kernel On Sat, Aug 05, 2006 at 12:46:00AM -0700, Andrew Morton wrote: > @@ -320,10 +320,10 @@ void fastcall flush_workqueue(struct wor > } else { > int cpu; > > - lock_cpu_hotplug(); > + mutex_lock(&workqueue_mutex); > for_each_online_cpu(cpu) > flush_cpu_workqueue(per_cpu_ptr(wq->cpu_wq, cpu)); > - unlock_cpu_hotplug(); > + mutex_unlock(&workqueue_mutex); > } > } Is this enough though? I mean, what stops the hotplug cpu code from modifying cpu_online_map underneath us here, making for_each_online_cpu do the wrong thing ? Dave -- http://www.codemonkey.org.uk ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: 2.6.18-rc3-g3b445eea BUG: warning at /usr/src/linux-git/kernel/cpu.c:51/unlock_cpu_hotplug() 2006-08-05 19:47 ` Dave Jones @ 2006-08-05 20:02 ` Andrew Morton 0 siblings, 0 replies; 25+ messages in thread From: Andrew Morton @ 2006-08-05 20:02 UTC (permalink / raw) To: Dave Jones; +Cc: torvalds, michal.k.k.piotrowski, linux-kernel On Sat, 5 Aug 2006 15:47:30 -0400 Dave Jones <davej@redhat.com> wrote: > On Sat, Aug 05, 2006 at 12:46:00AM -0700, Andrew Morton wrote: > > @@ -320,10 +320,10 @@ void fastcall flush_workqueue(struct wor > > } else { > > int cpu; > > > > - lock_cpu_hotplug(); > > + mutex_lock(&workqueue_mutex); > > for_each_online_cpu(cpu) > > flush_cpu_workqueue(per_cpu_ptr(wq->cpu_wq, cpu)); > > - unlock_cpu_hotplug(); > > + mutex_unlock(&workqueue_mutex); > > } > > } > > Is this enough though? I mean, what stops the hotplug cpu code > from modifying cpu_online_map underneath us here, making for_each_online_cpu > do the wrong thing ? > The patch takes that mutex in workqueue_cpu_callback() in such a way as to prevent that. What it _effectively_ does is to change cpu_up() thusly: --- a/kernel/cpu.c~x +++ a/kernel/cpu.c @@ -197,6 +197,7 @@ int __devinit cpu_up(unsigned int cpu) } ret = blocking_notifier_call_chain(&cpu_chain, CPU_UP_PREPARE, hcpu); + mutex_lock(&workqueue_mutex); if (ret == NOTIFY_BAD) { printk("%s: attempt to bring up CPU %u failed\n", __FUNCTION__, cpu); @@ -213,12 +214,15 @@ int __devinit cpu_up(unsigned int cpu) BUG_ON(!cpu_online(cpu)); /* Now call notifier in preparation. */ + mutex_unlock(&workqueue_mutex); blocking_notifier_call_chain(&cpu_chain, CPU_ONLINE, hcpu); out_notify: - if (ret != 0) + if (ret != 0) { blocking_notifier_call_chain(&cpu_chain, CPU_UP_CANCELED, hcpu); + mutex_unlock(&workqueue_mutex); + } out: mutex_unlock(&cpu_add_remove_lock); return ret; _ and similarly in cpu_down(). So the workqueue code itself is protecting its per-cpu data from the hot plug/unplug code across its critical regions. ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: 2.6.18-rc3-g3b445eea BUG: warning at /usr/src/linux-git/kernel/cpu.c:51/unlock_cpu_hotplug() 2006-08-05 7:46 ` Andrew Morton 2006-08-05 19:47 ` Dave Jones @ 2006-08-06 1:12 ` Andrew Morton 1 sibling, 0 replies; 25+ messages in thread From: Andrew Morton @ 2006-08-06 1:12 UTC (permalink / raw) To: davej, torvalds, michal.k.k.piotrowski, linux-kernel On Sat, 5 Aug 2006 00:46:00 -0700 Andrew Morton <akpm@osdl.org> wrote: > On Sat, 5 Aug 2006 02:47:28 -0400 > Dave Jones <davej@redhat.com> wrote: > > > Creating a variant of __create_workqueue that doesn't take the lock > > seems really nasty. > > Honestly, the default fix for lock_cpu_hotplug() problems should be to > delete the lock_cpu_hotplug() calls and to implement local locking. > > Here's a not-runtime-tested workqueue.c example. It's now slightly-runtime-tested. I'll include it in rc3-mm1. ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: 2.6.18-rc3-g3b445eea BUG: warning at /usr/src/linux-git/kernel/cpu.c:51/unlock_cpu_hotplug() 2006-08-05 6:47 ` Dave Jones 2006-08-05 7:46 ` Andrew Morton @ 2006-08-05 10:54 ` Michal Piotrowski 2006-08-05 11:11 ` Michal Piotrowski 2006-08-06 16:59 ` Michal Piotrowski 2 siblings, 1 reply; 25+ messages in thread From: Michal Piotrowski @ 2006-08-05 10:54 UTC (permalink / raw) To: Dave Jones, Linus Torvalds, Michal Piotrowski, LKML On 05/08/06, Dave Jones <davej@redhat.com> wrote: > On Fri, Aug 04, 2006 at 10:49:47PM -0400, Dave Jones wrote: > > This trace now makes a lot more sense to me. > > > CPU1 called lock_cpu_hotplug() for app cpuspeed. recursive_depth=0 > > [<c0104edc>] show_trace_log_lvl+0x58/0x152 > > [<c01054c2>] show_trace+0xd/0x10 > > [<c01055db>] dump_stack+0x19/0x1b > > [<c013e8c3>] lock_cpu_hotplug+0x39/0xbf > > [<c029fbae>] store_scaling_governor+0x142/0x1a3 > > [<c029f1a5>] store+0x37/0x48 > > [<c01a6561>] sysfs_write_file+0xab/0xd1 > > [<c016f99f>] vfs_write+0xab/0x157 > > [<c016ffe4>] sys_write+0x3b/0x60 > > [<c0103db9>] sysenter_past_esp+0x56/0x8d > > cpuspeed acquired cpu_bitmask_lock > > > > CPU1 called lock_cpu_hotplug() for app cpuspeed. recursive_depth=0 > > [<c0104edc>] show_trace_log_lvl+0x58/0x152 > > [<c01054c2>] show_trace+0xd/0x10 > > [<c01055db>] dump_stack+0x19/0x1b > > [<c013e8c3>] lock_cpu_hotplug+0x39/0xbf > > [<c0132f3c>] __create_workqueue+0x52/0x122 > > [<f901234b>] cpufreq_governor_dbs+0x9f/0x2c3 [cpufreq_ondemand] > > [<c029f7b6>] __cpufreq_governor+0x57/0xd8 > > [<c029f985>] __cpufreq_set_policy+0x14e/0x1bc > > [<c029fbc5>] store_scaling_governor+0x159/0x1a3 > > [<c029f1a5>] store+0x37/0x48 > > [<c01a6561>] sysfs_write_file+0xab/0xd1 > > [<c016f99f>] vfs_write+0xab/0x157 > > [<c016ffe4>] sys_write+0x3b/0x60 > > [<c0103db9>] sysenter_past_esp+0x56/0x8d > > Lukewarm IQ detected in hotplug locking > > BUG: warning at kernel/cpu.c:46/lock_cpu_hotplug() > > So when we write to sysfs to set the governor, we end up in store_scaling_governor() > which takes the hotplug lock, and then calls off into the governor to let it > do its thing. Part of ondemand's "thing" is to create a workqueue. > unfortunatly, __create_workqueue also takes the hotplug lock. > > Creating a variant of __create_workqueue that doesn't take the lock > seems really nasty. > > We could remove the locking from store_scaling_governor() and make the governors > themselves have to do the locking, but I'm not sure that's entirely safe. > > We could do something really disgusting like ... > > unlock_cpu_hotplug() > ... > create_workqueue() > ... > lock_cpu_hotplug() > > in ondemand, which opens up a tiny race window, but as ugly as it is, > looks to be the best solution of the bunch right now. > > Comments? > > The really sad part is this is completely unrelated to the original bug reported > in this thread, which shows just how widespread this braindamage is. > Michal's traces really don't really scream anything obvious to me. > (Though given it took me 4 hours to decode my own traces above, this is no > real sign of how big a problem this might be). > > Michal, could you apply this diff.. http://lkml.org/lkml/diff/2006/8/4/381/1 > (change the '120' to '60' first), and send me the debug spew that you get ? > You'll have to wait until a minute of uptime has passed. Oh, and edit > include/linux/jiffies.h to change INITIAL_JIFFIES to '0'. p4-clockmod: P4/Xeon(TM) CPU On-Demand Clock Modulation available ip_tables: (C) 2000-2006 Netfilter Core Team Netfilter messages via NETLINK v0.30. BUG: warning at /usr/src/linux-git/kernel/cpu.c:69/unlock_cpu_hotplug() [<c0132a91>] unlock_cpu_hotplug+0x63/0xb2 [<c013a3a0>] stop_machine_run+0x2e/0x34 [<c013573a>] sys_init_module+0x15a0/0x178a [<c015b86b>] do_sync_read+0xb6/0xf1 [<c0102d51>] sysenter_past_esp+0x56/0x79 ip_conntrack version 2.4 (8192 buckets, 65536 max) - 224 bytes per conntrack NET: Registered protocol family 17 skge eth0: enabling interface skge eth0: Link is up at 100 Mbps, full duplex, flow control tx and rx w83627hf 9191-0290: Reading VID from GPIO5 NET: Registered protocol family 10 IPv6 over IPv4 tunneling driver audit(1154774854.770:5): avc: denied { write } for pid=1360 comm="cpuspeed" name="cpufreq" dev=sysfs ino=4863 scontext=system_u:system_r:cpuspeed_t:s0 tcontext=system_u:object_r:sysfs_t:s0 tclass=dir eth0: no IPv6 routers present CPU0 called lock_cpu_hotplug() for app amarokapp. recursive_depth=0 [<c01329ab>] lock_cpu_hotplug+0x36/0xb9 [<c01182ce>] sched_getaffinity+0xf/0x83 [<c0118361>] sys_sched_getaffinity+0x1f/0x41 [<c0102d51>] sysenter_past_esp+0x56/0x79 amarokapp acquired cpu_bitmask_lock CPU0 called unlock_cpu_hotplug() for app amarokapp. recursive_depth=0 [<c0132a62>] unlock_cpu_hotplug+0x34/0xb2 [<c0118323>] sched_getaffinity+0x64/0x83 [<c0118361>] sys_sched_getaffinity+0x1f/0x41 [<c0102d51>] sysenter_past_esp+0x56/0x79 amarokapp released cpu_bitmask_lock CPU0 called lock_cpu_hotplug() for app amarokapp. recursive_depth=0 [<c01329ab>] lock_cpu_hotplug+0x36/0xb9 [<c01195ca>] sched_setaffinity+0xf/0xd5 [<c01196cb>] sys_sched_setaffinity+0x3b/0x41 [<c0102d51>] sysenter_past_esp+0x56/0x79 amarokapp acquired cpu_bitmask_lock CPU0 called unlock_cpu_hotplug() for app amarokapp. recursive_depth=0 [<c0132a62>] unlock_cpu_hotplug+0x34/0xb2 [<c0119689>] sched_setaffinity+0xce/0xd5 [<c01196cb>] sys_sched_setaffinity+0x3b/0x41 [<c0102d51>] sysenter_past_esp+0x56/0x79 amarokapp released cpu_bitmask_lock > > Dave > > -- > http://www.codemonkey.org.uk > Regards, Michal -- Michal K. K. Piotrowski LTG - Linux Testers Group (http://www.stardust.webpages.pl/ltg/wiki/) ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: 2.6.18-rc3-g3b445eea BUG: warning at /usr/src/linux-git/kernel/cpu.c:51/unlock_cpu_hotplug() 2006-08-05 10:54 ` Michal Piotrowski @ 2006-08-05 11:11 ` Michal Piotrowski 2006-08-05 11:26 ` Michal Piotrowski 0 siblings, 1 reply; 25+ messages in thread From: Michal Piotrowski @ 2006-08-05 11:11 UTC (permalink / raw) To: Dave Jones, Linus Torvalds, Michal Piotrowski, LKML On 05/08/06, Michal Piotrowski <michal.k.k.piotrowski@gmail.com> wrote: > On 05/08/06, Dave Jones <davej@redhat.com> wrote: > > On Fri, Aug 04, 2006 at 10:49:47PM -0400, Dave Jones wrote: > > > > This trace now makes a lot more sense to me. > > > > > CPU1 called lock_cpu_hotplug() for app cpuspeed. recursive_depth=0 > > > [<c0104edc>] show_trace_log_lvl+0x58/0x152 > > > [<c01054c2>] show_trace+0xd/0x10 > > > [<c01055db>] dump_stack+0x19/0x1b > > > [<c013e8c3>] lock_cpu_hotplug+0x39/0xbf > > > [<c029fbae>] store_scaling_governor+0x142/0x1a3 > > > [<c029f1a5>] store+0x37/0x48 > > > [<c01a6561>] sysfs_write_file+0xab/0xd1 > > > [<c016f99f>] vfs_write+0xab/0x157 > > > [<c016ffe4>] sys_write+0x3b/0x60 > > > [<c0103db9>] sysenter_past_esp+0x56/0x8d > > > cpuspeed acquired cpu_bitmask_lock > > > > > > CPU1 called lock_cpu_hotplug() for app cpuspeed. recursive_depth=0 > > > [<c0104edc>] show_trace_log_lvl+0x58/0x152 > > > [<c01054c2>] show_trace+0xd/0x10 > > > [<c01055db>] dump_stack+0x19/0x1b > > > [<c013e8c3>] lock_cpu_hotplug+0x39/0xbf > > > [<c0132f3c>] __create_workqueue+0x52/0x122 > > > [<f901234b>] cpufreq_governor_dbs+0x9f/0x2c3 [cpufreq_ondemand] > > > [<c029f7b6>] __cpufreq_governor+0x57/0xd8 > > > [<c029f985>] __cpufreq_set_policy+0x14e/0x1bc > > > [<c029fbc5>] store_scaling_governor+0x159/0x1a3 > > > [<c029f1a5>] store+0x37/0x48 > > > [<c01a6561>] sysfs_write_file+0xab/0xd1 > > > [<c016f99f>] vfs_write+0xab/0x157 > > > [<c016ffe4>] sys_write+0x3b/0x60 > > > [<c0103db9>] sysenter_past_esp+0x56/0x8d > > > Lukewarm IQ detected in hotplug locking > > > BUG: warning at kernel/cpu.c:46/lock_cpu_hotplug() > > > > So when we write to sysfs to set the governor, we end up in store_scaling_governor() > > which takes the hotplug lock, and then calls off into the governor to let it > > do its thing. Part of ondemand's "thing" is to create a workqueue. > > unfortunatly, __create_workqueue also takes the hotplug lock. > > > > Creating a variant of __create_workqueue that doesn't take the lock > > seems really nasty. > > > > We could remove the locking from store_scaling_governor() and make the governors > > themselves have to do the locking, but I'm not sure that's entirely safe. > > > > We could do something really disgusting like ... > > > > unlock_cpu_hotplug() > > ... > > create_workqueue() > > ... > > lock_cpu_hotplug() > > > > in ondemand, which opens up a tiny race window, but as ugly as it is, > > looks to be the best solution of the bunch right now. > > > > Comments? > > > > The really sad part is this is completely unrelated to the original bug reported > > in this thread, which shows just how widespread this braindamage is. > > Michal's traces really don't really scream anything obvious to me. > > (Though given it took me 4 hours to decode my own traces above, this is no > > real sign of how big a problem this might be). > > > > Michal, could you apply this diff.. http://lkml.org/lkml/diff/2006/8/4/381/1 > > (change the '120' to '60' first), and send me the debug spew that you get ? > > You'll have to wait until a minute of uptime has passed. Oh, and edit > > include/linux/jiffies.h to change INITIAL_JIFFIES to '0'. > > p4-clockmod: P4/Xeon(TM) CPU On-Demand Clock Modulation available > ip_tables: (C) 2000-2006 Netfilter Core Team > Netfilter messages via NETLINK v0.30. > BUG: warning at /usr/src/linux-git/kernel/cpu.c:69/unlock_cpu_hotplug() > [<c0132a91>] unlock_cpu_hotplug+0x63/0xb2 > [<c013a3a0>] stop_machine_run+0x2e/0x34 > [<c013573a>] sys_init_module+0x15a0/0x178a > [<c015b86b>] do_sync_read+0xb6/0xf1 > [<c0102d51>] sysenter_past_esp+0x56/0x79 > ip_conntrack version 2.4 (8192 buckets, 65536 max) - 224 bytes per conntrack > NET: Registered protocol family 17 > skge eth0: enabling interface > skge eth0: Link is up at 100 Mbps, full duplex, flow control tx and rx > w83627hf 9191-0290: Reading VID from GPIO5 > NET: Registered protocol family 10 > IPv6 over IPv4 tunneling driver > audit(1154774854.770:5): avc: denied { write } for pid=1360 > comm="cpuspeed" name="cpufreq" dev=sysfs ino=4863 > scontext=system_u:system_r:cpuspeed_t:s0 > tcontext=system_u:object_r:sysfs_t:s0 tclass=dir > eth0: no IPv6 routers present > CPU0 called lock_cpu_hotplug() for app amarokapp. recursive_depth=0 > [<c01329ab>] lock_cpu_hotplug+0x36/0xb9 > [<c01182ce>] sched_getaffinity+0xf/0x83 > [<c0118361>] sys_sched_getaffinity+0x1f/0x41 > [<c0102d51>] sysenter_past_esp+0x56/0x79 > amarokapp acquired cpu_bitmask_lock > CPU0 called unlock_cpu_hotplug() for app amarokapp. recursive_depth=0 > [<c0132a62>] unlock_cpu_hotplug+0x34/0xb2 > [<c0118323>] sched_getaffinity+0x64/0x83 > [<c0118361>] sys_sched_getaffinity+0x1f/0x41 > [<c0102d51>] sysenter_past_esp+0x56/0x79 > amarokapp released cpu_bitmask_lock > CPU0 called lock_cpu_hotplug() for app amarokapp. recursive_depth=0 > [<c01329ab>] lock_cpu_hotplug+0x36/0xb9 > [<c01195ca>] sched_setaffinity+0xf/0xd5 > [<c01196cb>] sys_sched_setaffinity+0x3b/0x41 > [<c0102d51>] sysenter_past_esp+0x56/0x79 > amarokapp acquired cpu_bitmask_lock > CPU0 called unlock_cpu_hotplug() for app amarokapp. recursive_depth=0 > [<c0132a62>] unlock_cpu_hotplug+0x34/0xb2 > [<c0119689>] sched_setaffinity+0xce/0xd5 > [<c01196cb>] sys_sched_setaffinity+0x3b/0x41 > [<c0102d51>] sysenter_past_esp+0x56/0x79 > amarokapp released cpu_bitmask_lock audit(1154775731.471:6): avc: denied { write } for pid=1361 comm="cpuspeed" name="cpufreq" dev=sysfs ino=4863 scontext=system_u:system_r:cpuspeed_t:s0 tcontext=system_u:object_r:sysfs_t:s0 tclass=dir CPU0 called lock_cpu_hotplug() for app konsole. recursive_depth=0 [<c01329ab>] lock_cpu_hotplug+0x36/0xb9 [<c012aadf>] flush_workqueue+0x25/0x57 [<c020e1b7>] release_dev+0x485/0x5f2 [<c016f7f8>] d_lookup+0x1b/0x3b [<c012d8d6>] remove_wait_queue+0xc/0x34 [<c020e333>] tty_release+0xf/0x18 [<c015c943>] __fput+0x90/0x15e [<c015a361>] filp_close+0x4e/0x54 [<c0102d51>] sysenter_past_esp+0x56/0x79 konsole acquired cpu_bitmask_lock CPU0 called unlock_cpu_hotplug() for app konsole. recursive_depth=0 [<c0132a62>] unlock_cpu_hotplug+0x34/0xb2 [<c020e1b7>] release_dev+0x485/0x5f2 [<c016f7f8>] d_lookup+0x1b/0x3b [<c012d8d6>] remove_wait_queue+0xc/0x34 [<c020e333>] tty_release+0xf/0x18 [<c015c943>] __fput+0x90/0x15e [<c015a361>] filp_close+0x4e/0x54 [<c0102d51>] sysenter_past_esp+0x56/0x79 konsole released cpu_bitmask_lock Regards, Michal -- Michal K. K. Piotrowski LTG - Linux Testers Group (http://www.stardust.webpages.pl/ltg/wiki/) ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: 2.6.18-rc3-g3b445eea BUG: warning at /usr/src/linux-git/kernel/cpu.c:51/unlock_cpu_hotplug() 2006-08-05 11:11 ` Michal Piotrowski @ 2006-08-05 11:26 ` Michal Piotrowski 2006-08-05 18:47 ` Dave Jones 0 siblings, 1 reply; 25+ messages in thread From: Michal Piotrowski @ 2006-08-05 11:26 UTC (permalink / raw) To: Dave Jones, Linus Torvalds, Michal Piotrowski, LKML On 05/08/06, Michal Piotrowski <michal.k.k.piotrowski@gmail.com> wrote: > audit(1154775731.471:6): avc: denied { write } for pid=1361 > comm="cpuspeed" name="cpufreq" dev=sysfs ino=4863 > scontext=system_u:system_r:cpuspeed_t:s0 > tcontext=system_u:object_r:sysfs_t:s0 tclass=dir > CPU0 called lock_cpu_hotplug() for app konsole. recursive_depth=0 > [<c01329ab>] lock_cpu_hotplug+0x36/0xb9 > [<c012aadf>] flush_workqueue+0x25/0x57 > [<c020e1b7>] release_dev+0x485/0x5f2 > [<c016f7f8>] d_lookup+0x1b/0x3b > [<c012d8d6>] remove_wait_queue+0xc/0x34 > [<c020e333>] tty_release+0xf/0x18 > [<c015c943>] __fput+0x90/0x15e > [<c015a361>] filp_close+0x4e/0x54 > [<c0102d51>] sysenter_past_esp+0x56/0x79 > konsole acquired cpu_bitmask_lock > CPU0 called unlock_cpu_hotplug() for app konsole. recursive_depth=0 > [<c0132a62>] unlock_cpu_hotplug+0x34/0xb2 > [<c020e1b7>] release_dev+0x485/0x5f2 > [<c016f7f8>] d_lookup+0x1b/0x3b > [<c012d8d6>] remove_wait_queue+0xc/0x34 > [<c020e333>] tty_release+0xf/0x18 > [<c015c943>] __fput+0x90/0x15e > [<c015a361>] filp_close+0x4e/0x54 > [<c0102d51>] sysenter_past_esp+0x56/0x79 > konsole released cpu_bitmask_lock Aug 5 13:18:00 ltg01-fedora kernel: CPU0 called lock_cpu_hotplug() for app kded. recursive_depth=0 Aug 5 13:18:00 ltg01-fedora kernel: [<c01329ab>] lock_cpu_hotplug+0x36/0xb9 Aug 5 13:18:00 ltg01-fedora kernel: [<c012aadf>] flush_workqueue+0x25/0x57 Aug 5 13:18:00 ltg01-fedora kernel: [<c020e1b7>] release_dev+0x485/0x5f2 Aug 5 13:18:00 ltg01-fedora kernel: [<c014c28f>] do_wp_page+0x318/0x348 Aug 5 13:18:00 ltg01-fedora kernel: [<c01425d4>] find_get_page+0x36/0x3b Aug 5 13:18:00 ltg01-fedora kernel: [<c014d46b>] __handle_mm_fault+0x6bf/0x727 Aug 5 13:18:00 ltg01-fedora kernel: [<c020e333>] tty_release+0xf/0x18 Aug 5 13:18:00 ltg01-fedora kernel: [<c015c943>] __fput+0x90/0x15e Aug 5 13:18:00 ltg01-fedora kernel: [<c015a361>] filp_close+0x4e/0x54 Aug 5 13:18:00 ltg01-fedora kernel: [<c0102d51>] sysenter_past_esp+0x56/0x79 Aug 5 13:18:00 ltg01-fedora kernel: kded acquired cpu_bitmask_lock [..] Aug 5 13:18:03 ltg01-fedora kernel: CPU1 called lock_cpu_hotplug() for app mingetty. recursive_depth=0 Aug 5 13:18:03 ltg01-fedora kernel: [<c01329ab>] lock_cpu_hotplug+0x36/0xb9 Aug 5 13:18:03 ltg01-fedora kernel: [<c012aadf>] flush_workqueue+0x25/0x57 Aug 5 13:18:03 ltg01-fedora kernel: [<c020e1b7>] release_dev+0x485/0x5f2 Aug 5 13:18:03 ltg01-fedora kernel: [<c01463a2>] __pagevec_free+0x18/0x22 Aug 5 13:18:03 ltg01-fedora kernel: [<c0148251>] release_pages+0x13b/0x143 Aug 5 13:18:03 ltg01-fedora kernel: [<c020e333>] tty_release+0xf/0x18 Aug 5 13:18:03 ltg01-fedora kernel: [<c015c943>] __fput+0x90/0x15e Aug 5 13:18:03 ltg01-fedora kernel: [<c015a361>] filp_close+0x4e/0x54 Aug 5 13:18:03 ltg01-fedora kernel: [<c011e477>] put_files_struct+0x65/0xa7 Aug 5 13:18:03 ltg01-fedora kernel: [<c011f44f>] do_exit+0x1dd/0x771 Aug 5 13:18:03 ltg01-fedora kernel: [<c011fa59>] sys_exit_group+0x0/0xd Aug 5 13:18:03 ltg01-fedora kernel: [<c012766a>] get_signal_to_deliver+0x387/0x3cc Aug 5 13:18:03 ltg01-fedora kernel: [<c01024ba>] do_notify_resume+0x71/0x5df Aug 5 13:18:03 ltg01-fedora kernel: [<c012d8d6>] remove_wait_queue+0xc/0x34 Aug 5 13:18:03 ltg01-fedora kernel: [<c0211a89>] read_chan+0x0/0x567 Aug 5 13:18:03 ltg01-fedora kernel: [<c020cc00>] tty_ldisc_deref+0x50/0x5f Aug 5 13:18:03 ltg01-fedora kernel: [<c020ede8>] tty_read+0x6d/0xb4 Aug 5 13:18:03 ltg01-fedora kernel: [<c020ed7b>] tty_read+0x0/0xb4 Aug 5 13:18:03 ltg01-fedora kernel: [<c015c174>] vfs_read+0x9f/0x141 Aug 5 13:18:03 ltg01-fedora kernel: [<c015c5c0>] sys_read+0x3c/0x63 Aug 5 13:18:03 ltg01-fedora kernel: [<c0102e4a>] work_notifysig+0x13/0x19 Aug 5 13:18:03 ltg01-fedora kernel: mingetty acquired cpu_bitmask_lock [..] Aug 5 13:18:03 ltg01-fedora kernel: CPU1 called lock_cpu_hotplug() for app mingetty. recursive_depth=0 Aug 5 13:18:03 ltg01-fedora kernel: [<c01329ab>] lock_cpu_hotplug+0x36/0xb9 Aug 5 13:18:03 ltg01-fedora kernel: [<c012aadf>] flush_workqueue+0x25/0x57 Aug 5 13:18:03 ltg01-fedora kernel: [<c020e1b7>] release_dev+0x485/0x5f2 Aug 5 13:18:03 ltg01-fedora kernel: [<c0115c9e>] __wake_up_common+0x2f/0x53 Aug 5 13:18:03 ltg01-fedora kernel: [<c0115cd3>] __wake_up_locked+0x11/0x15 Aug 5 13:18:03 ltg01-fedora kernel: [<c02b5ef5>] __down+0xa1/0xd7 Aug 5 13:18:03 ltg01-fedora kernel: [<c0117186>] default_wake_function+0x0/0xc Aug 5 13:18:03 ltg01-fedora kernel: [<c020e333>] tty_release+0xf/0x18 Aug 5 13:18:03 ltg01-fedora kernel: [<c015c943>] __fput+0x90/0x15e Aug 5 13:18:03 ltg01-fedora kernel: [<c015a361>] filp_close+0x4e/0x54 Aug 5 13:18:03 ltg01-fedora kernel: [<c011e477>] put_files_struct+0x65/0xa7 Aug 5 13:18:03 ltg01-fedora kernel: [<c011f44f>] do_exit+0x1dd/0x771 Aug 5 13:18:03 ltg01-fedora kernel: [<c011fa59>] sys_exit_group+0x0/0xd Aug 5 13:18:03 ltg01-fedora kernel: [<c012766a>] get_signal_to_deliver+0x387/0x3cc Aug 5 13:18:03 ltg01-fedora kernel: [<c01024ba>] do_notify_resume+0x71/0x5df Aug 5 13:18:03 ltg01-fedora kernel: [<c012d8d6>] remove_wait_queue+0xc/0x34 Aug 5 13:18:03 ltg01-fedora kernel: [<c0211a89>] read_chan+0x0/0x567 Aug 5 13:18:03 ltg01-fedora kernel: [<c020cc00>] tty_ldisc_deref+0x50/0x5f Aug 5 13:18:03 ltg01-fedora kernel: [<c020ede8>] tty_read+0x6d/0xb4 Aug 5 13:18:03 ltg01-fedora kernel: [<c020ed7b>] tty_read+0x0/0xb4 Aug 5 13:18:03 ltg01-fedora kernel: [<c015c174>] vfs_read+0x9f/0x141 Aug 5 13:18:03 ltg01-fedora kernel: [<c015c5c0>] sys_read+0x3c/0x63 Aug 5 13:18:03 ltg01-fedora kernel: [<c0102e4a>] work_notifysig+0x13/0x19 ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: 2.6.18-rc3-g3b445eea BUG: warning at /usr/src/linux-git/kernel/cpu.c:51/unlock_cpu_hotplug() 2006-08-05 11:26 ` Michal Piotrowski @ 2006-08-05 18:47 ` Dave Jones 2006-08-05 21:15 ` Michal Piotrowski 0 siblings, 1 reply; 25+ messages in thread From: Dave Jones @ 2006-08-05 18:47 UTC (permalink / raw) To: Michal Piotrowski; +Cc: Linus Torvalds, LKML On Sat, Aug 05, 2006 at 01:26:49PM +0200, Michal Piotrowski wrote: > Aug 5 13:18:00 ltg01-fedora kernel: CPU0 called lock_cpu_hotplug() > for app kded. recursive_depth=0 > *more snipped traces* The interesting ones will be the ones before & after you hit that BUG: warning at /usr/src/linux-work1/kernel/cpu.c:51/unlock_cpu_hotplug() if you can make that happen again. Dave -- http://www.codemonkey.org.uk ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: 2.6.18-rc3-g3b445eea BUG: warning at /usr/src/linux-git/kernel/cpu.c:51/unlock_cpu_hotplug() 2006-08-05 18:47 ` Dave Jones @ 2006-08-05 21:15 ` Michal Piotrowski 0 siblings, 0 replies; 25+ messages in thread From: Michal Piotrowski @ 2006-08-05 21:15 UTC (permalink / raw) To: Dave Jones, Michal Piotrowski, Linus Torvalds, LKML On 05/08/06, Dave Jones <davej@redhat.com> wrote: > On Sat, Aug 05, 2006 at 01:26:49PM +0200, Michal Piotrowski wrote: > > > Aug 5 13:18:00 ltg01-fedora kernel: CPU0 called lock_cpu_hotplug() > > for app kded. recursive_depth=0 > > *more snipped traces* > > The interesting ones will be the ones before & after you hit that > BUG: warning at /usr/src/linux-work1/kernel/cpu.c:51/unlock_cpu_hotplug() > if you can make that happen again. I don't see nothing interesting before BUG: warning at /usr/src/linux-git/kernel/cpu.c:51/unlock_cpu_hotplug() Only CPU0 called lock_cpu_hotplug() for app amarokapp. recursive_depth=0 [<c01329ab>] lock_cpu_hotplug+0x36/0xb9 [<c01182ce>] sched_getaffinity+0xf/0x83 [<c0118361>] sys_sched_getaffinity+0x1f/0x41 [<c0102d51>] sysenter_past_esp+0x56/0x79 amarokapp acquired cpu_bitmask_lock appears after this warning. dmesg -> http://www.stardust.webpages.pl/files/2.6-git/18-rc3/dmesg2 > > Dave > > -- > http://www.codemonkey.org.uk > Regards, Michal -- Michal K. K. Piotrowski LTG - Linux Testers Group (http://www.stardust.webpages.pl/ltg/wiki/) ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: 2.6.18-rc3-g3b445eea BUG: warning at /usr/src/linux-git/kernel/cpu.c:51/unlock_cpu_hotplug() 2006-08-05 6:47 ` Dave Jones 2006-08-05 7:46 ` Andrew Morton 2006-08-05 10:54 ` Michal Piotrowski @ 2006-08-06 16:59 ` Michal Piotrowski 2006-08-06 18:05 ` Dave Jones 2006-08-06 19:32 ` Andrew Morton 2 siblings, 2 replies; 25+ messages in thread From: Michal Piotrowski @ 2006-08-06 16:59 UTC (permalink / raw) To: Dave Jones, Linus Torvalds, Michal Piotrowski, LKML Hi Dave, On 05/08/06, Dave Jones <davej@redhat.com> wrote: > On Fri, Aug 04, 2006 at 10:49:47PM -0400, Dave Jones wrote: > > This trace now makes a lot more sense to me. > > > CPU1 called lock_cpu_hotplug() for app cpuspeed. recursive_depth=0 > > [<c0104edc>] show_trace_log_lvl+0x58/0x152 > > [<c01054c2>] show_trace+0xd/0x10 > > [<c01055db>] dump_stack+0x19/0x1b > > [<c013e8c3>] lock_cpu_hotplug+0x39/0xbf > > [<c029fbae>] store_scaling_governor+0x142/0x1a3 > > [<c029f1a5>] store+0x37/0x48 > > [<c01a6561>] sysfs_write_file+0xab/0xd1 > > [<c016f99f>] vfs_write+0xab/0x157 > > [<c016ffe4>] sys_write+0x3b/0x60 > > [<c0103db9>] sysenter_past_esp+0x56/0x8d > > cpuspeed acquired cpu_bitmask_lock > > > > CPU1 called lock_cpu_hotplug() for app cpuspeed. recursive_depth=0 > > [<c0104edc>] show_trace_log_lvl+0x58/0x152 > > [<c01054c2>] show_trace+0xd/0x10 > > [<c01055db>] dump_stack+0x19/0x1b > > [<c013e8c3>] lock_cpu_hotplug+0x39/0xbf > > [<c0132f3c>] __create_workqueue+0x52/0x122 > > [<f901234b>] cpufreq_governor_dbs+0x9f/0x2c3 [cpufreq_ondemand] > > [<c029f7b6>] __cpufreq_governor+0x57/0xd8 > > [<c029f985>] __cpufreq_set_policy+0x14e/0x1bc > > [<c029fbc5>] store_scaling_governor+0x159/0x1a3 > > [<c029f1a5>] store+0x37/0x48 > > [<c01a6561>] sysfs_write_file+0xab/0xd1 > > [<c016f99f>] vfs_write+0xab/0x157 > > [<c016ffe4>] sys_write+0x3b/0x60 > > [<c0103db9>] sysenter_past_esp+0x56/0x8d > > Lukewarm IQ detected in hotplug locking > > BUG: warning at kernel/cpu.c:46/lock_cpu_hotplug() > > So when we write to sysfs to set the governor, we end up in store_scaling_governor() > which takes the hotplug lock, and then calls off into the governor to let it > do its thing. Part of ondemand's "thing" is to create a workqueue. > unfortunatly, __create_workqueue also takes the hotplug lock. > > Creating a variant of __create_workqueue that doesn't take the lock > seems really nasty. > > We could remove the locking from store_scaling_governor() and make the governors > themselves have to do the locking, but I'm not sure that's entirely safe. > > We could do something really disgusting like ... > > unlock_cpu_hotplug() > ... > create_workqueue() > ... > lock_cpu_hotplug() > > in ondemand, which opens up a tiny race window, but as ugly as it is, > looks to be the best solution of the bunch right now. > > Comments? > > The really sad part is this is completely unrelated to the original bug reported > in this thread, which shows just how widespread this braindamage is. > Michal's traces really don't really scream anything obvious to me. > (Though given it took me 4 hours to decode my own traces above, this is no > real sign of how big a problem this might be). > > Michal, could you apply this diff.. http://lkml.org/lkml/diff/2006/8/4/381/1 > (change the '120' to '60' first), and send me the debug spew that you get ? > You'll have to wait until a minute of uptime has passed. Oh, and edit > include/linux/jiffies.h to change INITIAL_JIFFIES to '0'. I hope that this one will help BUG: using smp_processor_id() in preemptible [00000001] code: cpuspeed/1433 caller is lock_cpu_hotplug+0x25/0xc5 [<fd95918f>] store_speed+0x36/0x9b [cpufreq_userspace] [<c029fa95>] store+0x37/0x48 [<c0104007>] show_trace_log_lvl+0x58/0x159 [<c0104765>] show_trace+0xd/0x10 [<c010482d>] dump_stack+0x19/0x1b [<c01fc842>] debug_smp_processor_id+0x96/0xac [<c01a8197>] sysfs_write_file+0xa6/0xcc [<c013e0db>] lock_cpu_hotplug+0x25/0xc5 [<fd95918f>] store_speed+0x36/0x9b [cpufreq_userspace] [<c029fa95>] store+0x37/0x48 [<c01a8197>] sysfs_write_file+0xa6/0xcc [<c0171577>] vfs_write+0xcd/0x179 [<c0171c20>] sys_write+0x3b/0x71 [<c010318d>] sysenter_past_esp+0x56/0x8d DWARF2 unwinder stuck at sysenter_past_esp+0x56/0x8d Leftover inexact backtrace: [<c0104765>] show_trace+0xd/0x10 [<c010482d>] dump_stack+0x19/0x1b [<c01fc842>] debug_smp_processor_id+0x96/0xac [<c013e0db>] lock_cpu_hotplug+0x25/0xc5 [<fd95918f>] store_speed+0x36/0x9b [cpufreq_userspace] [<c029fa95>] store+0x37/0x48 [<c01a8197>] [<c0171577>] vfs_write+0xcd/0x179 It's from 2.6.18-rc3-mm2. Config and dmesg log -> http://www.stardust.webpages.pl/files/mm/2.6.18-rc3-mm2/ > > Dave > > -- > http://www.codemonkey.org.uk > Regards, Michal -- Michal K. K. Piotrowski LTG - Linux Testers Group (http://www.stardust.webpages.pl/ltg/wiki/) ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: 2.6.18-rc3-g3b445eea BUG: warning at /usr/src/linux-git/kernel/cpu.c:51/unlock_cpu_hotplug() 2006-08-06 16:59 ` Michal Piotrowski @ 2006-08-06 18:05 ` Dave Jones 2006-08-06 19:32 ` Andrew Morton 1 sibling, 0 replies; 25+ messages in thread From: Dave Jones @ 2006-08-06 18:05 UTC (permalink / raw) To: Michal Piotrowski; +Cc: Linus Torvalds, LKML On Sun, Aug 06, 2006 at 06:59:54PM +0200, Michal Piotrowski wrote: > > Michal, could you apply this diff.. http://lkml.org/lkml/diff/2006/8/4/381/1 > > (change the '120' to '60' first), and send me the debug spew that you get ? > > You'll have to wait until a minute of uptime has passed. Oh, and edit > > include/linux/jiffies.h to change INITIAL_JIFFIES to '0'. > > I hope that this one will help > > BUG: using smp_processor_id() in preemptible [00000001] code: cpuspeed/1433 > caller is lock_cpu_hotplug+0x25/0xc5 Not really, that one seems to be a bug in my debugging patch :-/ Dave -- http://www.codemonkey.org.uk ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: 2.6.18-rc3-g3b445eea BUG: warning at /usr/src/linux-git/kernel/cpu.c:51/unlock_cpu_hotplug() 2006-08-06 16:59 ` Michal Piotrowski 2006-08-06 18:05 ` Dave Jones @ 2006-08-06 19:32 ` Andrew Morton 2006-08-07 1:26 ` Andi Kleen 1 sibling, 1 reply; 25+ messages in thread From: Andrew Morton @ 2006-08-06 19:32 UTC (permalink / raw) To: Michal Piotrowski Cc: davej, torvalds, michal.k.k.piotrowski, linux-kernel, Jan Beulich, Andi Kleen On Sun, 6 Aug 2006 18:59:54 +0200 "Michal Piotrowski" <michal.k.k.piotrowski@gmail.com> wrote: > caller is lock_cpu_hotplug+0x25/0xc5 > [<fd95918f>] store_speed+0x36/0x9b [cpufreq_userspace] > [<c029fa95>] store+0x37/0x48 > [<c0104007>] show_trace_log_lvl+0x58/0x159 > [<c0104765>] show_trace+0xd/0x10 > [<c010482d>] dump_stack+0x19/0x1b > [<c01fc842>] debug_smp_processor_id+0x96/0xac > [<c01a8197>] sysfs_write_file+0xa6/0xcc > [<c013e0db>] lock_cpu_hotplug+0x25/0xc5 > [<fd95918f>] store_speed+0x36/0x9b [cpufreq_userspace] > [<c029fa95>] store+0x37/0x48 > [<c01a8197>] sysfs_write_file+0xa6/0xcc > [<c0171577>] vfs_write+0xcd/0x179 > [<c0171c20>] sys_write+0x3b/0x71 > [<c010318d>] sysenter_past_esp+0x56/0x8d > DWARF2 unwinder stuck at sysenter_past_esp+0x56/0x8d > Leftover inexact backtrace: > [<c0104765>] show_trace+0xd/0x10 > [<c010482d>] dump_stack+0x19/0x1b > [<c01fc842>] debug_smp_processor_id+0x96/0xac > [<c013e0db>] lock_cpu_hotplug+0x25/0xc5 > [<fd95918f>] store_speed+0x36/0x9b [cpufreq_userspace] > [<c029fa95>] store+0x37/0x48 > [<c01a8197>] [<c0171577>] vfs_write+0xcd/0x179 > > It's from 2.6.18-rc3-mm2. This "unwinder stuck" thing seems to be very common. It's a false-positive in this case - the backtrace was complete. It would be good if we could make the did-we-get-stuck detector a bit smarter. Even special-casing "sysenter_past_esp" would stop a lot of this.. ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: 2.6.18-rc3-g3b445eea BUG: warning at /usr/src/linux-git/kernel/cpu.c:51/unlock_cpu_hotplug() 2006-08-06 19:32 ` Andrew Morton @ 2006-08-07 1:26 ` Andi Kleen 2006-08-07 8:18 ` Jan Beulich 2006-08-15 12:23 ` Jan Beulich 0 siblings, 2 replies; 25+ messages in thread From: Andi Kleen @ 2006-08-07 1:26 UTC (permalink / raw) To: Andrew Morton Cc: Michal Piotrowski, davej, torvalds, linux-kernel, Jan Beulich > > [<c0171577>] vfs_write+0xcd/0x179 > > [<c0171c20>] sys_write+0x3b/0x71 > > [<c010318d>] sysenter_past_esp+0x56/0x8d > This "unwinder stuck" thing seems to be very common. Yes, there are still a lot of bugs in the unwind annotation unfortunately. We're also slowly discovering that some things we do cannot even be expressed in CFI, so some code has to change. > > It's a false-positive in this case - the backtrace was complete. It would > be good if we could make the did-we-get-stuck detector a bit smarter. Even > special-casing "sysenter_past_esp" would stop a lot of this.. Actually it's not completely false in this case -- it should have reached user mode and stopped there, but for some reason I didn't and already stopped still in the kernel. Most likely the CFI annotation for that sysenter path is not complete. It's on my todo list to investigate but I still hope Jan does it first ;-) -Andi ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: 2.6.18-rc3-g3b445eea BUG: warning at /usr/src/linux-git/kernel/cpu.c:51/unlock_cpu_hotplug() 2006-08-07 1:26 ` Andi Kleen @ 2006-08-07 8:18 ` Jan Beulich 2006-08-15 12:23 ` Jan Beulich 1 sibling, 0 replies; 25+ messages in thread From: Jan Beulich @ 2006-08-07 8:18 UTC (permalink / raw) To: Andi Kleen, Andrew Morton Cc: Michal Piotrowski, torvalds, davej, linux-kernel >> It's a false-positive in this case - the backtrace was complete. It would >> be good if we could make the did-we-get-stuck detector a bit smarter. Even >> special-casing "sysenter_past_esp" would stop a lot of this.. > >Actually it's not completely false in this case -- it should >have reached user mode and stopped there, but for some reason >I didn't and already stopped still in the kernel. > >Most likely the CFI annotation for that sysenter path is not complete. Correct, the return point of sysexit (SYSENTER_RETURN) is still in kernel space, but its annotations are invisible to the unwinder. We should make the VDSO be treated as user-mode code despite living above PAGE_OFFSET. >It's on my todo list to investigate but I still hope Jan does it first ;-) I'll try to, once I've got through moving the Xen code from 2.6.16 to 2.6.18-rc3 Jan ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: 2.6.18-rc3-g3b445eea BUG: warning at /usr/src/linux-git/kernel/cpu.c:51/unlock_cpu_hotplug() 2006-08-07 1:26 ` Andi Kleen 2006-08-07 8:18 ` Jan Beulich @ 2006-08-15 12:23 ` Jan Beulich 1 sibling, 0 replies; 25+ messages in thread From: Jan Beulich @ 2006-08-15 12:23 UTC (permalink / raw) To: Andi Kleen, Andrew Morton Cc: Michal Piotrowski, torvalds, davej, linux-kernel >> > [<c0171577>] vfs_write+0xcd/0x179 >> > [<c0171c20>] sys_write+0x3b/0x71 >> > [<c010318d>] sysenter_past_esp+0x56/0x8d >> This "unwinder stuck" thing seems to be very common. > >Yes, there are still a lot of bugs in the unwind annotation >unfortunately. > >We're also slowly discovering that some things we do cannot >even be expressed in CFI, so some code has to change. > >> >> It's a false-positive in this case - the backtrace was complete. It would >> be good if we could make the did-we-get-stuck detector a bit smarter. Even >> special-casing "sysenter_past_esp" would stop a lot of this.. > >Actually it's not completely false in this case -- it should >have reached user mode and stopped there, but for some reason >I didn't and already stopped still in the kernel. > >Most likely the CFI annotation for that sysenter path is not complete. They seem to be correct, at least on the default path. I have no problem with it finding and stopping at the first user frame; I'd like to note, however, that this (a) is *with* the patches I sent earlier today (they didn't change anything annotation-wise) and (b) the size reported for sysenter_past_esp is slightly different on my system (above trace says 0x8d, mine is 0x79), while the offsets are 0x56 in both cases. I'd assume the size delta is due to CONFIG_TRACE_IRQFLAGS, which then I can't see how it would make a difference other than in the case where trace_hardirqs_{on,off} themselves would fault (as their call sites are un-annotated), which doesn't appear to be the case here. Again, if "unwinder stuck" messages appear, I'll need a raw stack dump to accompany the stack trace. >It's on my todo list to investigate but I still hope Jan does it first ;-) Jan ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: 2.6.18-rc3-g3b445eea BUG: warning at /usr/src/linux-git/kernel/cpu.c:51/unlock_cpu_hotplug() @ 2006-08-05 22:14 art 2006-08-06 1:23 ` Dave Jones 0 siblings, 1 reply; 25+ messages in thread From: art @ 2006-08-05 22:14 UTC (permalink / raw) To: linux-kernel; +Cc: davej, michal.k.k.piotrowski same for 2.6.18.rc3-git7 on smp amd64 x2 (dualcore) after # echo "ondemand" > /sys/devices/system/cpu/cpu0/cpufreq/scaling_governor dmesg .... SELinux: initialized (dev sdb1, type vfat), uses genfs_contexts Lukewarm IQ detected in hotplug locking BUG: warning at kernel/cpu.c:38/lock_cpu_hotplug() Call Trace: [<ffffffff802a0e72>] lock_cpu_hotplug+0x50/0x73 [<ffffffff80299a36>] __create_workqueue+0x5d/0x146 [<ffffffff803f437d>] cpufreq_governor_dbs+0xa8/0x2f4 [<ffffffff803f22ac>] __cpufreq_governor+0xa4/0x194 [<ffffffff803f2516>] __cpufreq_set_policy+0x17a/0x1f4 [<ffffffff803f2792>] store_scaling_governor+0x17c/0x1d8 [<ffffffff803f35f8>] handle_update+0x0/0x28 [<ffffffff8026650b>] _spin_unlock_irqrestore+0x15/0x30 [<ffffffff803f1a00>] cpufreq_cpu_get+0xb5/0x143 [<ffffffff80272dbf>] store+0x44/0x5b [<ffffffff802f1ddc>] sysfs_write_file+0xc5/0xf4 [<ffffffff80215e1b>] vfs_write+0xce/0x174 [<ffffffff802166a9>] sys_write+0x45/0x6e [<ffffffff8025fa8e>] system_call+0x7e/0x83 ------------------------------------------------------ my old message from 2006/08/03 2.6.18-rc3-git1-smp-amd64 - Lukewarm IQ detected in hotplug what i have in dmesg Linux version 2.6.18-rc3-git1 (xxx@xxx.xxx) (gcc version 4.1.1 20060525)) #1 SMP PREEMPT Tue Aug 1 09:46:44 CDT 2006 ... AMD Athlon(tm) 64 X2 Dual Core Processor 4200+ stepping 01 ... SELinux: initialized (dev binfmt_misc, type binfmt_misc), uses genfs_contexts ip_tables: (C) 2000-2006 Netfilter Core Team BUG: warning at kernel/cpu.c:51/unlock_cpu_hotplug() Call Trace: [<ffffffff802a0e51>] unlock_cpu_hotplug+0x3f/0x6c [<ffffffff803f3937>] store_speed+0xc2/0xd1 [<ffffffff80272d3f>] store+0x44/0x5b [<ffffffff802f1c8a>] sysfs_write_file+0xc5/0xf4 [<ffffffff80215e14>] vfs_write+0xce/0x174 [<ffffffff802166a2>] sys_write+0x45/0x6e [<ffffffff8025fa0e>] system_call+0x7e/0x83 Netfilter messages via NETLINK v0.30. ... ... SELinux: initialized (dev sdb1, type vfat), uses genfs_contexts Lukewarm IQ detected in hotplug locking BUG: warning at kernel/cpu.c:38/lock_cpu_hotplug() Call Trace: [<ffffffff802a0def>] lock_cpu_hotplug+0x50/0x73 [<ffffffff802999ab>] __create_workqueue+0x5d/0x146 [<ffffffff803f4108>] cpufreq_governor_dbs+0xa8/0x2f4 [<ffffffff803f2006>] __cpufreq_governor+0xa4/0x194 [<ffffffff803f225f>] __cpufreq_set_policy+0x169/0x1d9 [<ffffffff803f24a8>] store_scaling_governor+0x153/0x18d [<ffffffff803f32ed>] handle_update+0x0/0x28 [<ffffffff8026648b>] _spin_unlock_irqrestore+0x15/0x30 [<ffffffff803f1800>] cpufreq_cpu_get+0xb4/0x143 [<ffffffff80272d3f>] store+0x44/0x5b [<ffffffff802f1c8a>] sysfs_write_file+0xc5/0xf4 [<ffffffff80215e14>] vfs_write+0xce/0x174 [<ffffffff802166a2>] sys_write+0x45/0x6e [<ffffffff8025fa0e>] system_call+0x7e/0x83 hdc: CHECK for good STATUS ... the same in 2.6.18-rc3-git3 any clue ? ------------------------------------------------- xboom art@usfltd.com ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: 2.6.18-rc3-g3b445eea BUG: warning at /usr/src/linux-git/kernel/cpu.c:51/unlock_cpu_hotplug() 2006-08-05 22:14 art @ 2006-08-06 1:23 ` Dave Jones 0 siblings, 0 replies; 25+ messages in thread From: Dave Jones @ 2006-08-06 1:23 UTC (permalink / raw) To: art; +Cc: linux-kernel, michal.k.k.piotrowski On Sat, Aug 05, 2006 at 05:14:10PM -0500, art wrote: > same for 2.6.18.rc3-git7 on smp amd64 x2 (dualcore) > after > # echo "ondemand" > /sys/devices/system/cpu/cpu0/cpufreq/scaling_governor > > dmesg > > .... > SELinux: initialized (dev sdb1, type vfat), uses genfs_contexts > Lukewarm IQ detected in hotplug locking > BUG: warning at kernel/cpu.c:38/lock_cpu_hotplug() > > Call Trace: > [<ffffffff802a0e72>] lock_cpu_hotplug+0x50/0x73 > [<ffffffff80299a36>] __create_workqueue+0x5d/0x146 > [<ffffffff803f437d>] cpufreq_governor_dbs+0xa8/0x2f4 > [<ffffffff803f22ac>] __cpufreq_governor+0xa4/0x194 > [<ffffffff803f2516>] __cpufreq_set_policy+0x17a/0x1f4 > [<ffffffff803f2792>] store_scaling_governor+0x17c/0x1d8 > [<ffffffff803f35f8>] handle_update+0x0/0x28 > [<ffffffff8026650b>] _spin_unlock_irqrestore+0x15/0x30 > [<ffffffff803f1a00>] cpufreq_cpu_get+0xb5/0x143 > [<ffffffff80272dbf>] store+0x44/0x5b > [<ffffffff802f1ddc>] sysfs_write_file+0xc5/0xf4 > [<ffffffff80215e1b>] vfs_write+0xce/0x174 > [<ffffffff802166a9>] sys_write+0x45/0x6e > [<ffffffff8025fa8e>] system_call+0x7e/0x83 Ok, these should be fixed with Andrews voodoo patches to the workqueue code. Dave -- http://www.codemonkey.org.uk ^ permalink raw reply [flat|nested] 25+ messages in thread
end of thread, other threads:[~2006-08-15 12:23 UTC | newest] Thread overview: 25+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2006-08-04 19:04 2.6.18-rc3-g3b445eea BUG: warning at /usr/src/linux-git/kernel/cpu.c:51/unlock_cpu_hotplug() Michal Piotrowski 2006-08-04 19:24 ` Linus Torvalds 2006-08-04 22:24 ` Dave Jones 2006-08-05 0:31 ` Dave Jones 2006-08-05 2:10 ` Dave Jones 2006-08-05 2:23 ` Dave Jones 2006-08-05 2:49 ` Dave Jones 2006-08-05 6:47 ` Dave Jones 2006-08-05 7:46 ` Andrew Morton 2006-08-05 19:47 ` Dave Jones 2006-08-05 20:02 ` Andrew Morton 2006-08-06 1:12 ` Andrew Morton 2006-08-05 10:54 ` Michal Piotrowski 2006-08-05 11:11 ` Michal Piotrowski 2006-08-05 11:26 ` Michal Piotrowski 2006-08-05 18:47 ` Dave Jones 2006-08-05 21:15 ` Michal Piotrowski 2006-08-06 16:59 ` Michal Piotrowski 2006-08-06 18:05 ` Dave Jones 2006-08-06 19:32 ` Andrew Morton 2006-08-07 1:26 ` Andi Kleen 2006-08-07 8:18 ` Jan Beulich 2006-08-15 12:23 ` Jan Beulich -- strict thread matches above, loose matches on Subject: below -- 2006-08-05 22:14 art 2006-08-06 1:23 ` Dave Jones
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox