public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed
* [BUG] hotplug_cpu vs no_hz
@ 2008-07-03  6:35 Lai Jiangshan
  2008-07-03  9:07 ` Vegard Nossum
  0 siblings, 1 reply; 3+ messages in thread
From: Lai Jiangshan @ 2008-07-03  6:35 UTC (permalink / raw)
  To: tglx, Ingo Molnar, Andrew Morton; +Cc: Linux Kernel Mailing List

config:
CONFIG_DETECT_SOFTLOCKUP=y  # just for call trace
CONFIG_HOTPLUG_CPU=y
CONFIG_ACPI_HOTPLUG_CPU=y
CONFIG_NO_HZ=y		# if =n, this 2 bugs can't occur


this 2 bugs occur on 2kernel_vr * 2platform
platform :	i386 2cpus
		x86_64 2core*2cpus

kernel_vr:	2.6.25
		2.6.26-rc8

test 1:(cpu dies)
offline the other cpus, just left cpu#0 cpu#1, and:

i=0
while ((++i))
do
	echo 0 > /sys/devices/system/cpu/cpu1/online
	sleep 1
	echo 1 > /sys/devices/system/cpu/cpu1/online
	sleep 1
	echo $i
done

after several seconds ~ several hours, "echo 1 > /sys/devices/system/cpu/cpu1/online"
was blocked, cpu#1 can not be used and the output of dmesg:

BUG: soft lockup - CPU#1 stuck for 61s! [events/1:9898]
CPU 1:
Modules linked in:
Pid: 9898, comm: events/1 Not tainted 2.6.26-rc8-official-LAI-00089-ge1441b9 #5
RIP: 0010:[<ffffffff80237612>]  [<ffffffff80237612>] __do_softirq+0x4b/0xc7
RSP: 0018:ffff81006b42ff20  EFLAGS: 00000206
RAX: ffff81006a9b9fd8 RBX: ffff81006b42ff40 RCX: 0000000000000006
RDX: 0000000000000042 RSI: ffffffff8022da16 RDI: ffffffff8022da16
RBP: ffff81006b42fea0 R08: ffff81007f2c9178 R09: ffff81007f2c9140
R10: ffff8100807cc000 R11: 0000000000000000 R12: ffffffff8020be36
R13: ffff81006b42fea0 R14: ffffffff807a5100 R15: 0000000000000042
FS:  0000000000000000(0000) GS:ffff81007fb3ccc0(0000) knlGS:0000000000000000
CS:  0010 DS: 0018 ES: 0018 CR0: 000000008005003b
CR2: 00007f4cc97b6000 CR3: 0000000000201000 CR4: 00000000000006a0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400

Call Trace:
 <IRQ>  [<ffffffff8020c38c>] ? call_softirq+0x1c/0x28
 [<ffffffff8020dad6>] ? do_softirq+0x34/0x72
 [<ffffffff80237586>] ? irq_exit+0x3f/0x80
 [<ffffffff8021b128>] ? smp_apic_timer_interrupt+0x8b/0xa7
 [<ffffffff8020be36>] ? apic_timer_interrupt+0x66/0x70
 <EOI>  [<ffffffff8022da16>] ? finish_task_switch+0x31/0x82
 [<ffffffff80590e7d>] ? thread_return+0x3d/0x9c
 [<ffffffff80241ca1>] ? worker_thread+0xa3/0xe5
 [<ffffffff80244780>] ? autoremove_wake_function+0x0/0x38
 [<ffffffff80241bfe>] ? worker_thread+0x0/0xe5
 [<ffffffff80244645>] ? kthread+0x49/0x78
 [<ffffffff8020c018>] ? child_rip+0xa/0x12
 [<ffffffff802445fc>] ? kthread+0x0/0x78
 [<ffffffff8020c00e>] ? child_rip+0x0/0x12

INFO: task syslogd:3835 blocked for more than 120 seconds.
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
syslogd       D ffffffff805af600     0  3835      1
 ffff81007b80fd28 0000000000000082 0000000000000000 ffff81007dd83ec0
 ffff81007f01a820 ffff81007fbb1560 ffff81007f01ab78 0000000100000001
 00000000ffffffff 0000000000000292 0000000000000000 0000000000000000
Call Trace:
 [<ffffffff8030f58f>] log_wait_commit+0xa4/0xf4
 [<ffffffff80244780>] ? autoremove_wake_function+0x0/0x38
 [<ffffffff8030b1e7>] journal_stop+0x17c/0x1a9
 [<ffffffff8030ba16>] journal_force_commit+0x23/0x25
 [<ffffffff803049d0>] ext3_force_commit+0x26/0x28
 [<ffffffff802fed64>] ext3_write_inode+0x39/0x3f
 [<ffffffff802ad9ad>] __writeback_single_inode+0x180/0x29f
 [<ffffffff802ae353>] sync_inode+0x24/0x31
 [<ffffffff802fb2ff>] ext3_sync_file+0xa3/0xb4
 [<ffffffff802b08f7>] do_fsync+0x54/0xaa
 [<ffffffff802b097b>] __do_fsync+0x2e/0x44
 [<ffffffff802b09ac>] sys_fsync+0xb/0xd
 [<ffffffff8020b1fb>] system_call_after_swapgs+0x7b/0x80



test 2:(time-subsystem was broken)
offline the other cpus, just left cpu0#0 cpu#1, and:

try several times: {
	echo 0 > /sys/devices/system/cpu/cpu1/online
	echo 1 > /sys/devices/system/cpu/cpu1/online
	cat /dev/zero > /dev/null &
	taskset -p 2 $!		# set affinity to cpu#1

	top	# get cpu usage of "cat /dev/zero"
		# if cpu usage=0%, bug of test 1 have occurred
		# stop test

		# if cpu usage>150%, time-subsystem was broken
		# offline/online again, "top" shows huger cpu usage

		# nothing happen, kill "cat /dev/zero" and try again
}


^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: [BUG] hotplug_cpu vs no_hz
  2008-07-03  6:35 [BUG] hotplug_cpu vs no_hz Lai Jiangshan
@ 2008-07-03  9:07 ` Vegard Nossum
  2008-07-05  5:23   ` Lai Jiangshan
  0 siblings, 1 reply; 3+ messages in thread
From: Vegard Nossum @ 2008-07-03  9:07 UTC (permalink / raw)
  To: Lai Jiangshan
  Cc: tglx, Ingo Molnar, Andrew Morton, Linux Kernel Mailing List,
	Peter Zijlstra

Hi!

On Thu, Jul 3, 2008 at 8:35 AM, Lai Jiangshan <laijs@cn.fujitsu.com> wrote:
> after several seconds ~ several hours, "echo 1 > /sys/devices/system/cpu/cpu1/online"
> was blocked, cpu#1 can not be used and the output of dmesg:
>
> BUG: soft lockup - CPU#1 stuck for 61s! [events/1:9898]
> CPU 1:
> Modules linked in:
> Pid: 9898, comm: events/1 Not tainted 2.6.26-rc8-official-LAI-00089-ge1441b9 #5
> RIP: 0010:[<ffffffff80237612>]  [<ffffffff80237612>] __do_softirq+0x4b/0xc7
> RSP: 0018:ffff81006b42ff20  EFLAGS: 00000206
> RAX: ffff81006a9b9fd8 RBX: ffff81006b42ff40 RCX: 0000000000000006
> RDX: 0000000000000042 RSI: ffffffff8022da16 RDI: ffffffff8022da16
> RBP: ffff81006b42fea0 R08: ffff81007f2c9178 R09: ffff81007f2c9140
> R10: ffff8100807cc000 R11: 0000000000000000 R12: ffffffff8020be36
> R13: ffff81006b42fea0 R14: ffffffff807a5100 R15: 0000000000000042
> FS:  0000000000000000(0000) GS:ffff81007fb3ccc0(0000) knlGS:0000000000000000
> CS:  0010 DS: 0018 ES: 0018 CR0: 000000008005003b
> CR2: 00007f4cc97b6000 CR3: 0000000000201000 CR4: 00000000000006a0
> DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
>
> Call Trace:
>  <IRQ>  [<ffffffff8020c38c>] ? call_softirq+0x1c/0x28
>  [<ffffffff8020dad6>] ? do_softirq+0x34/0x72
>  [<ffffffff80237586>] ? irq_exit+0x3f/0x80
>  [<ffffffff8021b128>] ? smp_apic_timer_interrupt+0x8b/0xa7
>  [<ffffffff8020be36>] ? apic_timer_interrupt+0x66/0x70
>  <EOI>  [<ffffffff8022da16>] ? finish_task_switch+0x31/0x82
>  [<ffffffff80590e7d>] ? thread_return+0x3d/0x9c
>  [<ffffffff80241ca1>] ? worker_thread+0xa3/0xe5
>  [<ffffffff80244780>] ? autoremove_wake_function+0x0/0x38
>  [<ffffffff80241bfe>] ? worker_thread+0x0/0xe5
>  [<ffffffff80244645>] ? kthread+0x49/0x78
>  [<ffffffff8020c018>] ? child_rip+0xa/0x12
>  [<ffffffff802445fc>] ? kthread+0x0/0x78
>  [<ffffffff8020c00e>] ? child_rip+0x0/0x12

I believe I have experienced this as well, and I tried to debug it
with Peter Zijlstra (Cc added).

I only managed to get this message once. But I made a patch that could
help us debug this, see

commit 8d5be7f4e8515af461cbc8f07687ccc81507d508
Date:   Wed Jun 25 08:50:10 2008 +0200

    softlockup: show irqtrace

from the core/softlockup branch in the -tip tree. (This commit is also
present in tip/master.)

In order to see the effect of this, you need CONFIG_TRACE_IRQFLAGS=y.
This will tell us when/where irqs were last disabled and will
hopefully give a hint of where the block is really occurring. (Unless
it is already obvious to you or others; it isn't to me :-))

Thanks,


Vegard

-- 
"The animistic metaphor of the bug that maliciously sneaked in while
the programmer was not looking is intellectually dishonest as it
disguises that the error is the programmer's own creation."
	-- E. W. Dijkstra, EWD1036

^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: [BUG] hotplug_cpu vs no_hz
  2008-07-03  9:07 ` Vegard Nossum
@ 2008-07-05  5:23   ` Lai Jiangshan
  0 siblings, 0 replies; 3+ messages in thread
From: Lai Jiangshan @ 2008-07-05  5:23 UTC (permalink / raw)
  To: Vegard Nossum
  Cc: tglx, Ingo Molnar, Andrew Morton, Linux Kernel Mailing List,
	Peter Zijlstra

Vegard Nossum wrote:
> Hi!
> 
> On Thu, Jul 3, 2008 at 8:35 AM, Lai Jiangshan <laijs@cn.fujitsu.com> wrote:
>> after several seconds ~ several hours, "echo 1 > /sys/devices/system/cpu/cpu1/online"
>> was blocked, cpu#1 can not be used and the output of dmesg:
>>
>> BUG: soft lockup - CPU#1 stuck for 61s! [events/1:9898]
>> CPU 1:
>> Modules linked in:
>> Pid: 9898, comm: events/1 Not tainted 2.6.26-rc8-official-LAI-00089-ge1441b9 #5
>> RIP: 0010:[<ffffffff80237612>]  [<ffffffff80237612>] __do_softirq+0x4b/0xc7
>> RSP: 0018:ffff81006b42ff20  EFLAGS: 00000206
>> RAX: ffff81006a9b9fd8 RBX: ffff81006b42ff40 RCX: 0000000000000006
>> RDX: 0000000000000042 RSI: ffffffff8022da16 RDI: ffffffff8022da16
>> RBP: ffff81006b42fea0 R08: ffff81007f2c9178 R09: ffff81007f2c9140
>> R10: ffff8100807cc000 R11: 0000000000000000 R12: ffffffff8020be36
>> R13: ffff81006b42fea0 R14: ffffffff807a5100 R15: 0000000000000042
>> FS:  0000000000000000(0000) GS:ffff81007fb3ccc0(0000) knlGS:0000000000000000
>> CS:  0010 DS: 0018 ES: 0018 CR0: 000000008005003b
>> CR2: 00007f4cc97b6000 CR3: 0000000000201000 CR4: 00000000000006a0
>> DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
>> DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
>>
>> Call Trace:
>>  <IRQ>  [<ffffffff8020c38c>] ? call_softirq+0x1c/0x28
>>  [<ffffffff8020dad6>] ? do_softirq+0x34/0x72
>>  [<ffffffff80237586>] ? irq_exit+0x3f/0x80
>>  [<ffffffff8021b128>] ? smp_apic_timer_interrupt+0x8b/0xa7
>>  [<ffffffff8020be36>] ? apic_timer_interrupt+0x66/0x70
>>  <EOI>  [<ffffffff8022da16>] ? finish_task_switch+0x31/0x82
>>  [<ffffffff80590e7d>] ? thread_return+0x3d/0x9c
>>  [<ffffffff80241ca1>] ? worker_thread+0xa3/0xe5
>>  [<ffffffff80244780>] ? autoremove_wake_function+0x0/0x38
>>  [<ffffffff80241bfe>] ? worker_thread+0x0/0xe5
>>  [<ffffffff80244645>] ? kthread+0x49/0x78
>>  [<ffffffff8020c018>] ? child_rip+0xa/0x12
>>  [<ffffffff802445fc>] ? kthread+0x0/0x78
>>  [<ffffffff8020c00e>] ? child_rip+0x0/0x12
> 
> I believe I have experienced this as well, and I tried to debug it
> with Peter Zijlstra (Cc added).
> 
> I only managed to get this message once. But I made a patch that could
> help us debug this, see
> 
> commit 8d5be7f4e8515af461cbc8f07687ccc81507d508
> Date:   Wed Jun 25 08:50:10 2008 +0200
> 
>     softlockup: show irqtrace
> 
> from the core/softlockup branch in the -tip tree. (This commit is also
> present in tip/master.)

I have applied it in new test.

> 
> In order to see the effect of this, you need CONFIG_TRACE_IRQFLAGS=y.
> This will tell us when/where irqs were last disabled and will
> hopefully give a hint of where the block is really occurring. (Unless
> it is already obvious to you or others; it isn't to me :-))
> 
> Thanks,
> 
> 
> Vegard
> 

I pulled the most recent tree, but messages like
"BUG: soft lockup - CPU#1 stuck for 61s! [events/1:9898]"
can not be triggered.

This is the new dmesg:
(But i hasn't tested that whether these messages occurred when CONFIG_NO_HZ=n)

 =======================
INFO: task bash:18198 blocked for more than 120 seconds.
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
bash          D 4e69057c     0 18198  18191
       f5b4ce00 00000086 f7332248 4e69057c 0000321b f7341d40 f7341fac c20126c0
       00000000 f7341d40 c04586c0 f7341d40 00000000 f5b4ce00 c02f67be f73430c0
       4e690578 7fffffff 7fffffff f5be5ea8 00000002 c02f6b11 f771f390 00000000
Call Trace:
 [<c02f67be>] schedule+0x775/0x7c5
 [<c02f6b11>] schedule_timeout+0x13/0x86
 [<c02f6924>] wait_for_common+0xb2/0xf4
 [<c0119017>] default_wake_function+0x0/0x8
 [<c012f068>] synchronize_rcu+0x2a/0x32
 [<c012f036>] wakeme_after_rcu+0x0/0x8
 [<c011d240>] update_sched_domains+0x29/0x44
 [<c02f976a>] notifier_call_chain+0x2a/0x47
 [<c02e9e2d>] _cpu_down+0x6f/0x205
 [<c0151dff>] __alloc_pages_internal+0xd0/0x365
 [<c02e9fe3>] cpu_down+0x20/0x2c
 [<c02eaef8>] store_online+0x24/0x56
 [<c02eaed4>] store_online+0x0/0x56
 [<c023f48e>] sysdev_store+0x1e/0x22
 [<c01a3546>] sysfs_write_file+0xa4/0xd8
 [<c01a34a2>] sysfs_write_file+0x0/0xd8
 [<c016c514>] vfs_write+0x83/0x120
 [<c016cad4>] sys_write+0x3c/0x63
 [<c0103882>] syscall_call+0x7/0xb
 =======================
INFO: task crond:18687 blocked for more than 120 seconds.
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
crond         D 3ef0cf27     0 18687  14998
       f59d7e00 00000086 00000000 3ef0cf27 00003245 f7831d40 f7831fac c20126c0
       00000000 00000000 00000000 00000000 00000000 00000004 f5b9def6 f6377a84
       f6377a84 7fffffff 7fffffff f5b9df4c 00000002 c02f6b11 00000000 00000000
Call Trace:
 [<c02f6b11>] schedule_timeout+0x13/0x86
 [<c01b0120>] key_alloc+0x1f9/0x2c4
 [<c02f6924>] wait_for_common+0xb2/0xf4
 [<c0119017>] default_wake_function+0x0/0x8
 [<c012f068>] synchronize_rcu+0x2a/0x32
 [<c012f036>] wakeme_after_rcu+0x0/0x8
 [<c01b216d>] install_session_keyring+0xb3/0xc3
 [<c01b2804>] join_session_keyring+0x1a/0xa6
 [<c0103882>] syscall_call+0x7/0xb
 =======================




^ permalink raw reply	[flat|nested] 3+ messages in thread

end of thread, other threads:[~2008-07-05  5:25 UTC | newest]

Thread overview: 3+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2008-07-03  6:35 [BUG] hotplug_cpu vs no_hz Lai Jiangshan
2008-07-03  9:07 ` Vegard Nossum
2008-07-05  5:23   ` Lai Jiangshan

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox