* [Problem] System hang when I run pounder and syscall test on kernel 2.6.18-rc5
@ 2006-09-07 4:35 Shu Qing Yang
2006-09-08 2:14 ` Andrew Morton
0 siblings, 1 reply; 5+ messages in thread
From: Shu Qing Yang @ 2006-09-07 4:35 UTC (permalink / raw)
To: linux-kernel
Problem description:
I run pounder, scsi_debug on a machine. Then start 200 random syscall
test
simultaneously. Tens of minutes later, the system hang.
Hardware Environment
Cpu type :power5+
Software Env:
kernel: 2.6.18-rc5
Base system: opensuse10
Is the system (not just the application) hung?
Yes
Did the system produce an OOPS message on the console?
No.
Is the system sitting in a debugger right now?
Yes, xmon and sysrq are on.
Additional information:
I use 'sysrq + t' then force system into xmon. And get following
message:
<4>Call Trace:.
<4>[C000000046977160] [C000000046977200] 0xc000000046977200 (unreliable).
<4>[C000000046977330] [C00000000000FF24] .__switch_to+0x12c/0x150.
<4>[C0000000469773C0] [C00000000052D94C].schedule+0xa38/0xb84.
<4>[C0000000469774C0] [C000000000179E0C].start_this_handle+0x32c/0x5c4.
<4>[C0000000469775E0] [C00000000017A180] .journal_start+0xdc/0x130.
<4>[C000000046977680] [C000000000170490] .ext3_journal_start_sb+0x58/0x78.
<4>[C000000046977700] [C000000000169D14] .ext3_dirty_inode+0x38/0xec.
<4>[C000000046977790] [C0000000000F7174] .__mark_inode_dirty+0x64/0x1d8.
<4>[C000000046977830] [C0000000000EB674] .touch_atime+0xc8/0xe0.
<4>[C0000000469778C0] [C00000000009B8EC]
.do_generic_mapping_read+0x470/0x4fc.
<4>[C000000046977A10]
[C00000000009C4A4].__generic_file_aio_read+0x184/0x22c.
<4>[C000000046977AE0] [C00000000009C640] .generic_file_aio_read+0x44/0x54.
<4>[C000000046977B70] [C0000000000C874C] .do_sync_read+0xd4/0x130.
<4>[C000000046977CF0] [C0000000000C9410] .vfs_read+0xd0/0x1b4.
<4>[C000000046977D90] [C0000000000C98F0] .sys_read+0x4c/0x8c.
<4>[C000000046977E30] [C00000000000871C] syscall_exit+0x0/0x40.
<3>BUG: soft lockup detected on CPU#0!.
<4>Call Trace:.
<4>[C0000000228B3B90] [C00000000000F7F0]
.show_stack+0x68/0x1b0(unreliable).
<4>[C0000000228B3C30] [C000000000094834] .softlockup_tick+0xec/0x124.
<4>[C0000000228B3CD0] [C0000000000686DC] .run_local_timers+0x1c/0x30.
<4>[C0000000228B3D50] [C000000000021AD4].timer_interrupt+0xa8/0x47c.
<4>[C0000000228B3E30] [C0000000000034EC] decrementer_common+0xec/0x100.
<3>BUG: softlockup detectedon CPU#3!.
<4>Call Trace:.
<4>[C000000033BEE620][C00000000000F7F0] .show_stack+0x68/0x1b0
(unreliable).
<4>[C000000033BEE6C0] [C000000000094834].softlockup_tick+0xec/0x124.
<4>[C000000033BEE760] [C0000000000686DC] .run_local_timers+0x1c/0x30.
<4>[C000000033BEE7E0] [C000000000021AD4] .timer_interrupt+0xa8/0x47c.
<4>[C000000033BEE8C0] [C0000000000034EC] decrementer_common+0xec/0x100.
<4>--- Exception: 901 at .hpte_update+0x158/0x1d0.
<4> LR = .page_referenced_one+0xd8/0x188.
<4>[C000000033BEEBB0][C0000000000B4244] .page_check_address+0xcc/0x16c
(unreliable).
<4>[C000000033BEEC50] [C0000000000B4418] .page_referenced_one+0xd8/0x188.
<4>[C000000033BEED00] [C0000000000B5480] .page_referenced+0x90/0x180.
<4>[C000000033BEEDB0] [C0000000000A6908]
.shrink_inactive_list+0x1d8/0xa0c.
<4>[C000000033BEF020] [C0000000000A7248] .shrink_zone+0x10c/0x168.
<4>[C000000033BEF0C0][C0000000000A7FE8] .try_to_free_pages+0x1c8/0x320.
<4>[C000000033BEF1D0] [C0000000000A1954] .__alloc_pages+0x1ec/0x344.
<4>[C000000033BEF2C0] [C00000000009DE34].find_or_create_page+0x8c/0x10c.
<4>[C000000033BEF370] [C0000000000CBA78] .__getblk+0x130/0x2d0.
<4>[C000000033BEF420] [C0000000001672F0] .ext3_getblk+0xd8/0x2b0.
<4>[C000000033BEF520] [C00000000016CC54] .ext3_find_entry+0x344/0x608.
<4>[C000000033BEF6C0] [C00000000016EB54] .ext3_lookup+0x44/0x178.
<4>[C000000033BEF760] [C0000000000DB620].do_lookup+0xfc/0x22c.
<4>[C000000033BEF820] [C0000000000DDD9C] .__link_path_walk+0xb60/0x121c.
<4>[C000000033BEF8F0] [C0000000000DE4F4] .link_path_walk+0x9c/0x184.
<4>[C000000033BEFA30] [C0000000000DEAB8] .do_path_lookup+0x304/0x398.
<4>[C000000033BEFAE0]
[C0000000000DF728].__path_lookup_intent_open+0x70/0xd0.
<4>[C000000033BEFB90] [C0000000000DF974] .open_namei+0x94/0x820.
<4>[C000000033BEFC60] [C0000000000C6D20] .do_filp_open+0x38/0x70.
<4>[C000000033BEFD80] [C0000000000C6DCC] .do_sys_open+0x74/0x130.
<4>[C000000033BEFE30] [C00000000000871C]syscall_exit+0x0/0x40.
<3>BUG: soft lockup detected on CPU#5!.
<4>Call Trace:.
<4>[C00000005AB73B90] [C00000000000F7F0] .show_stack+0x68/0x1b0
(unreliable).
<4>[C00000005AB73C30] [C000000000094834] .softlockup_tick+0xec/0x124.
<4>[C00000005AB73CD0] [C0000000000686DC] .run_local_timers+0x1c/0x30.
<4>[C00000005AB73D50] [C000000000021AD4] .timer_interrupt+0xa8/0x47c.
<4>[C00000005AB73E30] [C0000000000034EC]decrementer_common+0xec/0x100.
<3>BUG: soft lockup detected on CPU#2!.
<4>Call Trace:.
<4>[C0000000228B7B90] [C00000000000F7F0] .show_stack+0x68/0x1b0
(unreliable).
<4>[C0000000228B7C30] [C000000000094834] .softlockup_tick+0xec/0x124.
<4>[C0000000228B7CD0] [C0000000000686DC].run_local_timers+0x1c/0x30.
<4>[C0000000228B7D50] [C000000000021AD4] .timer_interrupt+0xa8/0x47c.
<4>[C0000000228B7E30] [C0000000000034EC] decrementer_common+0xec/0x100.
<3>BUG:soft lockup detected on CPU#4!.
<4>Call Trace:.
<4>[C0000000531FBB90] [C00000000000F7F0] .show_stack+0x68/0x1b0
(unreliable).
<4>[C0000000531FBC30][C000000000094834] .softlockup_tick+0xec/0x124.
<4>[C0000000531FBCD0] [C0000000000686DC] .run_local_timers+0x1c/0x30.
<4>[C0000000531FBD50] [C000000000021AD4] .timer_interrupt+0xa8/0x47c.
<4>[C0000000531FBE30][C0000000000034EC] decrementer_common+0xec/0x100.
---------------------------
3:mon> t
[c00000000ffe3c30] c0000000002f122c .__handle_sysrq+0xf0/0x1cc
[c00000000ffe3ce0] c0000000002f3658 .hvc_poll+0x198/0x2cc
[c00000000ffe3dc0] c0000000002f37a0 .hvc_handle_interrupt+0x14/0x34
[c00000000ffe3e40] c000000000094c04 .handle_IRQ_event+0x7c/0xf8
[c00000000ffe3ef0] c000000000096b74 .handle_fasteoi_irq+0xe4/0x188
[c00000000ffe3f90] c000000000025130 .call_handle_irq+0x1c/0x2c
[c00000005b47bda0] c00000000000c78c .do_IRQ+0xf4/0x1a4
[c00000005b47be30] c0000000000041ec hardware_interrupt_entry+0xc/0x10
--- Exception: 501 (Hardware Interrupt) at 0000000010002524
SP (ffff99cea70) is in userspace
3:mon> e
cpu 0x3: Vector: 0 at [c00000000ffe3a30]
pc: c00000000004b134: .sysrq_handle_xmon+0x48/0x60
lr: c00000000004b134: .sysrq_handle_xmon+0x48/0x60
sp: c00000000ffe3ba0
msr: 8000000000001032
current = 0xc000000002e235f0
paca = 0xc0000000006b4900
pid = 16858, comm = waitpid13
3:mon> r
R00 = 0000000000000000 R16 = 0000000000000000
R01 = c00000000ffe3ba0 R17 = 0000000000000000
R02 = c000000000909c00 R18 = 0000000000000000
R03 = c00000000ffe3a30 R19 = 0000000000000000
R04 = c0000000009a4eb0 R20 = 0000000000000000
R05 = c0000000009a4ee0 R21 = 0000000000000000
R06 = c0000000008abec0 R22 = 0000000000000000
R07 = c0000000008ac148 R23 = 8000000000001032
R08 = c0000000008ac130 R24 = 8000000000001032
R09 = c0000000008ac178 R25 = c0000000038dee70
R10 = c0000000008ac160 R26 = 0000000000000000
R11 = 0000000000000000 R27 = 0000000000000078
R12 = c0000000009a4eb8 R28 = 0000000000000007
R13 = c0000000006b4900 R29 = 0000000000000000
R14 = 0000000000000000 R30 = c000000000780668
R15 = 0000000000000000 R31 = c00000000ffe3a30
pc = c00000000004b134 .sysrq_handle_xmon+0x48/0x60
lr = c00000000004b134 .sysrq_handle_xmon+0x48/0x60
msr = 8000000000001032 cr = 28000428
ctr = c00000000004f0e4 xer = 000000000000000f trap = 0
3:mon> c
cpus stopped: 0-5
3:mon> c0
0:mon> e
cpu 0x0: Vector: 501 (Hardware Interrupt) at [c0000000228b3ea0]
pc: 0000000010002524
lr: 0000000010002570
sp: ffff99cea70
msr: 800000000000d032
current = 0xc000000002e25770
paca = 0xc0000000006b4300
pid = 16866, comm = waitpid13
0:mon> t
SP (ffff99cea70) is in userspace
0:mon> c1
1:mon> e
cpu 0x1: Vector: 501 (Hardware Interrupt) at [c000000059f89ed0]
pc: c0000000000a4780: .release_pages+0xac/0x260
lr: c0000000000a5138: .__pagevec_release+0x28/0x48
sp: c000000059f8a150
msr: 8000000000009032
current = 0xc00000005f2b66b0
paca = 0xc0000000006b4500
pid = 16704, comm = shmctl01
1:mon> t
[c000000059f8a280] c0000000000a5138 .__pagevec_release+0x28/0x48
[c000000059f8a310] c0000000000a7074 .shrink_inactive_list+0x944/0xa0c
[c000000059f8a580] c0000000000a7248 .shrink_zone+0x10c/0x168
[c000000059f8a620] c0000000000a7fe8 .try_to_free_pages+0x1c8/0x320
[c000000059f8a730] c0000000000a1954 .__alloc_pages+0x1ec/0x344
[c000000059f8a820] c00000000009de34 .find_or_create_page+0x8c/0x10c
[c000000059f8a8d0] c0000000000cba78 .__getblk+0x130/0x2d0
[c000000059f8a980] c0000000000ce1e0 .__bread+0x20/0x124
[c000000059f8aa10] c000000000166280 .ext3_get_branch+0xa4/0x158
[c000000059f8aac0] c000000000166620 .ext3_get_blocks_handle+0xf8/0xcf0
[c000000059f8aca0] c0000000001675cc .ext3_get_block+0x104/0x14c
[c000000059f8ad50] c0000000000cef64 .block_read_full_page+0x12c/0x390
[c000000059f8b220] c0000000000f81bc .do_mpage_readpage+0x5cc/0x63c
[c000000059f8b720] c0000000000f882c .mpage_readpages+0xf0/0x1b4
[c000000059f8b8c0] c000000000166450 .ext3_readpages+0x28/0x40
[c000000059f8b940] c0000000000a3c10 .__do_page_cache_readahead+0x194/0x2f0
[c000000059f8ba90] c00000000009e01c .filemap_nopage+0x168/0x460
[c000000059f8bb60] c0000000000ace18 .__handle_mm_fault+0x544/0xee4
[c000000059f8bc50] c00000000002db24 .do_page_fault+0x408/0x5e8
[c000000059f8be30] c0000000000048e0 .handle_page_fault+0x20/0x54
--- Exception: 301 (Data Access) at 000004000000f4e0
SP (ffffd36e400) is in userspace
1:mon> c2
2:mon> t
SP (ffff99cea70) is in userspace
2:mon> c4
4:mon> t
SP (ffff99cea70) is in userspace
4:mon> e
cpu 0x4: Vector: 501 (Hardware Interrupt) at [c0000000531fbea0]
pc: 0000000010002524
lr: 0000000010002570
sp: ffff99cea70
msr: 800000000000d032
current = 0xc000000022f7b9f0
paca = 0xc0000000006b4b00
pid = 16863, comm = waitpid13
4:mon> c5
5:mon> e
cpu 0x5: Vector: 501 (Hardware Interrupt) at [c00000005ab73ea0]
pc: 0000000010002534
lr: 0000000010002570
sp: ffff99cea70
msr: 800000000000d032
current = 0xc00000001265b390
paca = 0xc0000000006b4d00
pid = 16862, comm = waitpid13
5:mon> t
SP (ffff99cea70) is in userspace
5:mon> c2
2:mon> e
cpu 0x2: Vector: 501 (Hardware Interrupt) at [c0000000228b7ea0]
pc: 0000000010002524
lr: 0000000010002570
sp: ffff99cea70
msr: 800000000000d032
current = 0xc00000001b7daab0
paca = 0xc0000000006b4700
pid = 16867, comm = waitpid13
2:mon> t
SP (ffff99cea70) is in userspace
2:mon>
Best Regards,
Shu Qing Yang
---------------------------
LTC Test, Linux Technology Center, China Systems & Technology Lab
^ permalink raw reply [flat|nested] 5+ messages in thread* Re: [Problem] System hang when I run pounder and syscall test on kernel 2.6.18-rc5 2006-09-07 4:35 [Problem] System hang when I run pounder and syscall test on kernel 2.6.18-rc5 Shu Qing Yang @ 2006-09-08 2:14 ` Andrew Morton 2006-09-08 11:36 ` Shu Qing Yang 0 siblings, 1 reply; 5+ messages in thread From: Andrew Morton @ 2006-09-08 2:14 UTC (permalink / raw) To: Shu Qing Yang; +Cc: linux-kernel On Thu, 7 Sep 2006 12:35:09 +0800 Shu Qing Yang <yangshuq@cn.ibm.com> wrote: > Problem description: > I run pounder, scsi_debug on a machine. Then start 200 random syscall > test > simultaneously. Tens of minutes later, the system hang. What is "pounder" and from where can it be obtained? Running two tests at the same time complicates things. The next step should be to determine whether it is reproducible. If it is, then see if it is reproducible with just one test running (presumably pounder?) It would be helpful to provide sufficient information to give others a chance of reproducing it: amount of memory, method for configuring the scsi-debug "disks", method for invoking pounder, etc. > Hardware Environment > Cpu type :power5+ > Software Env: > kernel: 2.6.18-rc5 > Base system: opensuse10 > > Is the system (not just the application) hung? > Yes > > Did the system produce an OOPS message on the console? > No. > > Is the system sitting in a debugger right now? > Yes, xmon and sysrq are on. > > Additional information: > I use 'sysrq + t' then force system into xmon. And get following > message: Trace is a bit confusing. Ben (who is being shy) thinks it's this: > 0:mon> c1 > 1:mon> e > cpu 0x1: Vector: 501 (Hardware Interrupt) at [c000000059f89ed0] > pc: c0000000000a4780: .release_pages+0xac/0x260 > lr: c0000000000a5138: .__pagevec_release+0x28/0x48 > sp: c000000059f8a150 > msr: 8000000000009032 > current = 0xc00000005f2b66b0 > paca = 0xc0000000006b4500 > pid = 16704, comm = shmctl01 > 1:mon> t > [c000000059f8a280] c0000000000a5138 .__pagevec_release+0x28/0x48 > [c000000059f8a310] c0000000000a7074 .shrink_inactive_list+0x944/0xa0c > [c000000059f8a580] c0000000000a7248 .shrink_zone+0x10c/0x168 > [c000000059f8a620] c0000000000a7fe8 .try_to_free_pages+0x1c8/0x320 > [c000000059f8a730] c0000000000a1954 .__alloc_pages+0x1ec/0x344 > [c000000059f8a820] c00000000009de34 .find_or_create_page+0x8c/0x10c > [c000000059f8a8d0] c0000000000cba78 .__getblk+0x130/0x2d0 > [c000000059f8a980] c0000000000ce1e0 .__bread+0x20/0x124 > [c000000059f8aa10] c000000000166280 .ext3_get_branch+0xa4/0x158 > [c000000059f8aac0] c000000000166620 .ext3_get_blocks_handle+0xf8/0xcf0 > [c000000059f8aca0] c0000000001675cc .ext3_get_block+0x104/0x14c > [c000000059f8ad50] c0000000000cef64 .block_read_full_page+0x12c/0x390 > [c000000059f8b220] c0000000000f81bc .do_mpage_readpage+0x5cc/0x63c > [c000000059f8b720] c0000000000f882c .mpage_readpages+0xf0/0x1b4 > [c000000059f8b8c0] c000000000166450 .ext3_readpages+0x28/0x40 > [c000000059f8b940] c0000000000a3c10 .__do_page_cache_readahead+0x194/0x2f0 > [c000000059f8ba90] c00000000009e01c .filemap_nopage+0x168/0x460 > [c000000059f8bb60] c0000000000ace18 .__handle_mm_fault+0x544/0xee4 > [c000000059f8bc50] c00000000002db24 .do_page_fault+0x408/0x5e8 > [c000000059f8be30] c0000000000048e0 .handle_page_fault+0x20/0x54 Which indicates that a CPU is stuck in page reclaim. As a memory management/VM problem is suspected, a sysrq-M trace would be useful. That'll tell us whether the machine has exhausted physical memory and/or swapspace. ^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: [Problem] System hang when I run pounder and syscall test on kernel 2.6.18-rc5 2006-09-08 2:14 ` Andrew Morton @ 2006-09-08 11:36 ` Shu Qing Yang 2006-09-08 16:52 ` Andrew Morton 0 siblings, 1 reply; 5+ messages in thread From: Shu Qing Yang @ 2006-09-08 11:36 UTC (permalink / raw) To: Andrew Morton; +Cc: linux-kernel Andrew Morton <akpm@osdl.org> wrote on 2006-09-08 10:14:34: > On Thu, 7 Sep 2006 12:35:09 +0800 > Shu Qing Yang <yangshuq@cn.ibm.com> wrote: > > > Problem description: > > I run pounder, scsi_debug on a machine. Then start 200 random syscall > > test > > simultaneously. Tens of minutes later, the system hang. > > What is "pounder" and from where can it be obtained? > Thanks for your reply. Pounder is part of ltp and locate in LTPROOT/testcases/pounder21. It is a suit of test cases including mem_alloc, random_syscall, bonnie++, etc. > Running two tests at the same time complicates things. The next step > should be to determine whether it is reproducible. If it is, then see if > it is reproducible with just one test running (presumably pounder?) > Running multiple cases simultaneously is to stress kernel more. And because of lack of machine resource I have no chance to reproduce it. > It would be helpful to provide sufficient information to give others a > chance of reproducing it: amount of memory, method for configuring the > scsi-debug "disks", method for invoking pounder, etc. > The machine belongs to IBM p-Series with power5+ cpu and 2GB memory. Run LTPROOT/testscript/ltp-scsi_debug.sh and LTPROOT/testscript/pounder21/pounder directly. No extra parameters. The command to load scsi_debug module is: modprobe scsi_debug max_luns=2 num_tgts=2 add_host=2 dev_size_mb=20 > > Hardware Environment > > Cpu type :power5+ > > Software Env: > > kernel: 2.6.18-rc5 > > Base system: opensuse10 > > > > Is the system (not just the application) hung? > > Yes > > > > Did the system produce an OOPS message on the console? > > No. > > > > Is the system sitting in a debugger right now? > > Yes, xmon and sysrq are on. > > > > Additional information: > > I use 'sysrq + t' then force system into xmon. And get following > > message: > > Trace is a bit confusing. Ben (who is being shy) thinks it's this: > > > 0:mon> c1 > > 1:mon> e > > cpu 0x1: Vector: 501 (Hardware Interrupt) at [c000000059f89ed0] > > pc: c0000000000a4780: .release_pages+0xac/0x260 > > lr: c0000000000a5138: .__pagevec_release+0x28/0x48 > > sp: c000000059f8a150 > > msr: 8000000000009032 > > current = 0xc00000005f2b66b0 > > paca = 0xc0000000006b4500 > > pid = 16704, comm = shmctl01 > > 1:mon> t > > [c000000059f8a280] c0000000000a5138 .__pagevec_release+0x28/0x48 > > [c000000059f8a310] c0000000000a7074 .shrink_inactive_list+0x944/0xa0c > > [c000000059f8a580] c0000000000a7248 .shrink_zone+0x10c/0x168 > > [c000000059f8a620] c0000000000a7fe8 .try_to_free_pages+0x1c8/0x320 > > [c000000059f8a730] c0000000000a1954 .__alloc_pages+0x1ec/0x344 > > [c000000059f8a820] c00000000009de34 .find_or_create_page+0x8c/0x10c > > [c000000059f8a8d0] c0000000000cba78 .__getblk+0x130/0x2d0 > > [c000000059f8a980] c0000000000ce1e0 .__bread+0x20/0x124 > > [c000000059f8aa10] c000000000166280 .ext3_get_branch+0xa4/0x158 > > [c000000059f8aac0] c000000000166620 .ext3_get_blocks_handle+0xf8/0xcf0 > > [c000000059f8aca0] c0000000001675cc .ext3_get_block+0x104/0x14c > > [c000000059f8ad50] c0000000000cef64 .block_read_full_page+0x12c/0x390 > > [c000000059f8b220] c0000000000f81bc .do_mpage_readpage+0x5cc/0x63c > > [c000000059f8b720] c0000000000f882c .mpage_readpages+0xf0/0x1b4 > > [c000000059f8b8c0] c000000000166450 .ext3_readpages+0x28/0x40 > > [c000000059f8b940] c0000000000a3c10 .__do_page_cache_readahead+0x194/0x2f0 > > [c000000059f8ba90] c00000000009e01c .filemap_nopage+0x168/0x460 > > [c000000059f8bb60] c0000000000ace18 .__handle_mm_fault+0x544/0xee4 > > [c000000059f8bc50] c00000000002db24 .do_page_fault+0x408/0x5e8 > > [c000000059f8be30] c0000000000048e0 .handle_page_fault+0x20/0x54 > > Which indicates that a CPU is stuck in page reclaim. > > As a memory management/VM problem is suspected, a sysrq-M trace would be > useful. That'll tell us whether the machine has exhausted physical memory > and/or swapspace. > I can not excute sysrq command now. But I can get memory allocation information from xmon, which indicates your guess may be right. 1:mon> mi Mem-info: DMA per-cpu: cpu 0 hot: high 6, batch 1 used:5 cpu 0 cold: high 2, batch 1 used:1 cpu 1 hot: high 6, batch 1 used:5 cpu 1 cold: high 2, batch 1 used:1 cpu 2 hot: high 6, batch 1 used:5 cpu 2 cold: high 2, batch 1 used:1 cpu 3 hot: high 6, batch 1 used:3 cpu 3 cold: high 2, batch 1 used:1 cpu 4 hot: high 6, batch 1 used:5 cpu 4 cold: high 2, batch 1 used:1 cpu 5 hot: high 6, batch 1 used:4 cpu 5 cold: high 2, batch 1 used:0 DMA32 per-cpu: empty Normal per-cpu: empty HighMem per-cpu: empty Free pages: 6976kB (0kB HighMem) Active:6141 inactive:11012 dirty:4742 writeback:0 unstable:0 free:109 slab:11925 mapped:7 pagetables:7061 DMA free:6976kB min:5760kB low:7168kB high:8640kB active:393024kB inactive:704768kB present:2097152kB pages_scanned:5172 all_unreclaimable? no lowmem_reserve[]: 0 0 0 0 DMA32 free:0kB min:0kB low:0kB high:0kB active:0kB inactive:0kB present:0kB pages_scanned:0 all_unreclaimable? no lowmem_reserve[]: 0 0 0 0 Normal free:0kB min:0kB low:0kB high:0kB active:0kB inactive:0kB present:0kB pages_scanned:0 all_unreclaimable? no lowmem_reserve[]: 0 0 0 0 HighMem free:0kB min:2048kB low:2048kB high:2048kB active:0kB inactive:0kB present:0kB pages_scanned:0 all_unreclaimable? no lowmem_reserve[]: 0 0 0 0 DMA: 19*64kB 1*128kB 2*256kB 0*512kB 1*1024kB 0*2048kB 1*4096kB 0*8192kB 0*16384kB = 6976kB DMA32: empty Normal: empty HighMem: empty Swap cache: add 439156, delete 439156, find 50391/101032, race 26+79 Free swap = 0kB Total swap = 855552kB Free swap: 0kB 32768 pages of RAM 408 reserved pages 6834 pages shared 0 pages swap cached ^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: [Problem] System hang when I run pounder and syscall test on kernel 2.6.18-rc5 2006-09-08 11:36 ` Shu Qing Yang @ 2006-09-08 16:52 ` Andrew Morton 2006-09-11 2:09 ` Shu Qing Yang 0 siblings, 1 reply; 5+ messages in thread From: Andrew Morton @ 2006-09-08 16:52 UTC (permalink / raw) To: Shu Qing Yang; +Cc: linux-kernel On Fri, 8 Sep 2006 19:36:40 +0800 Shu Qing Yang <yangshuq@cn.ibm.com> wrote: > Andrew Morton <akpm@osdl.org> wrote on 2006-09-08 10:14:34: > > > On Thu, 7 Sep 2006 12:35:09 +0800 > > Shu Qing Yang <yangshuq@cn.ibm.com> wrote: > > > > > Problem description: > > > I run pounder, scsi_debug on a machine. Then start 200 random > syscall > > > test > > > simultaneously. Tens of minutes later, the system hang. > > > > What is "pounder" and from where can it be obtained? > > > Thanks for your reply. > > Pounder is part of ltp and locate in LTPROOT/testcases/pounder21. > It is a suit of test cases including mem_alloc, random_syscall, bonnie++, > etc. OK, thanks. > > Running two tests at the same time complicates things. The next step > > should be to determine whether it is reproducible. If it is, then see > if > > it is reproducible with just one test running (presumably pounder?) > > > Running multiple cases simultaneously is to stress kernel more. And > because of > lack of machine resource I have no chance to reproduce it. > > > It would be helpful to provide sufficient information to give others a > > chance of reproducing it: amount of memory, method for configuring the > > scsi-debug "disks", method for invoking pounder, etc. > > > The machine belongs to IBM p-Series with power5+ cpu and 2GB memory. > Run LTPROOT/testscript/ltp-scsi_debug.sh and > LTPROOT/testscript/pounder21/pounder directly. > No extra parameters. The command to load scsi_debug module is: > modprobe scsi_debug max_luns=2 num_tgts=2 add_host=2 dev_size_mb=20 > > ... > > I can not excute sysrq command now. But I can get memory allocation > information from xmon, > which indicates your guess may be right. > > 1:mon> mi > Mem-info: > DMA per-cpu: > cpu 0 hot: high 6, batch 1 used:5 > cpu 0 cold: high 2, batch 1 used:1 > cpu 1 hot: high 6, batch 1 used:5 > cpu 1 cold: high 2, batch 1 used:1 > cpu 2 hot: high 6, batch 1 used:5 > cpu 2 cold: high 2, batch 1 used:1 > cpu 3 hot: high 6, batch 1 used:3 > cpu 3 cold: high 2, batch 1 used:1 > cpu 4 hot: high 6, batch 1 used:5 > cpu 4 cold: high 2, batch 1 used:1 > cpu 5 hot: high 6, batch 1 used:4 > cpu 5 cold: high 2, batch 1 used:0 > DMA32 per-cpu: empty > Normal per-cpu: empty > HighMem per-cpu: empty > Free pages: 6976kB (0kB HighMem) > Active:6141 inactive:11012 dirty:4742 writeback:0 unstable:0 free:109 > slab:11925 mapped:7 pagetables:7061 > DMA free:6976kB min:5760kB low:7168kB high:8640kB active:393024kB > inactive:704768kB present:2097152kB pages_scanned:5172 all_unreclaimable? > no > lowmem_reserve[]: 0 0 0 0 > DMA32 free:0kB min:0kB low:0kB high:0kB active:0kB inactive:0kB > present:0kB pages_scanned:0 all_unreclaimable? no > lowmem_reserve[]: 0 0 0 0 > Normal free:0kB min:0kB low:0kB high:0kB active:0kB inactive:0kB > present:0kB pages_scanned:0 all_unreclaimable? no > lowmem_reserve[]: 0 0 0 0 > HighMem free:0kB min:2048kB low:2048kB high:2048kB active:0kB inactive:0kB > present:0kB pages_scanned:0 all_unreclaimable? no > lowmem_reserve[]: 0 0 0 0 > DMA: 19*64kB 1*128kB 2*256kB 0*512kB 1*1024kB 0*2048kB 1*4096kB 0*8192kB > 0*16384kB = 6976kB > DMA32: empty > Normal: empty > HighMem: empty > Swap cache: add 439156, delete 439156, find 50391/101032, race 26+79 > Free swap = 0kB > Total swap = 855552kB > Free swap: 0kB > 32768 pages of RAM > 408 reserved pages > 6834 pages shared > 0 pages swap cached So we ran out of memory and we ran out of swap. Possibly what has happened here is that the machine is doing a huge amount of work scanning pages and pretty soon it will enter the oom-killer to kill some userspace process. But before that happened, the softlockup detector triggered. But the machine _should_ have recovered. If it hung for more than a few seconds then that's bad behaviour. If it hung for more than a few minutes then that should be considered a bug. If it hung for ever then that's definitely a bug. Do you recall approximately how long the machine spent in this state? ^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: [Problem] System hang when I run pounder and syscall test on kernel 2.6.18-rc5 2006-09-08 16:52 ` Andrew Morton @ 2006-09-11 2:09 ` Shu Qing Yang 0 siblings, 0 replies; 5+ messages in thread From: Shu Qing Yang @ 2006-09-11 2:09 UTC (permalink / raw) To: Andrew Morton; +Cc: linux-kernel Andrew Morton <akpm@osdl.org> wrote on 2006-09-09 00:52:41: > On Fri, 8 Sep 2006 19:36:40 +0800 > Shu Qing Yang <yangshuq@cn.ibm.com> wrote: > > > Andrew Morton <akpm@osdl.org> wrote on 2006-09-08 10:14:34: > > > > > On Thu, 7 Sep 2006 12:35:09 +0800 > > > Shu Qing Yang <yangshuq@cn.ibm.com> wrote: > > > > > > > Problem description: > > > > I run pounder, scsi_debug on a machine. Then start 200 random > > syscall > > > > test > > > > simultaneously. Tens of minutes later, the system hang. > > > > > > What is "pounder" and from where can it be obtained? > > > > > Thanks for your reply. > > > > Pounder is part of ltp and locate in LTPROOT/testcases/pounder21. > > It is a suit of test cases including mem_alloc, random_syscall, bonnie++, > > etc. > > OK, thanks. > > > > Running two tests at the same time complicates things. The next step > > > should be to determine whether it is reproducible. If it is, then see > > if > > > it is reproducible with just one test running (presumably pounder?) > > > > > Running multiple cases simultaneously is to stress kernel more. And > > because of > > lack of machine resource I have no chance to reproduce it. > > > > > It would be helpful to provide sufficient information to give others a > > > chance of reproducing it: amount of memory, method for configuring the > > > scsi-debug "disks", method for invoking pounder, etc. > > > > > The machine belongs to IBM p-Series with power5+ cpu and 2GB memory. > > Run LTPROOT/testscript/ltp-scsi_debug.sh and > > LTPROOT/testscript/pounder21/pounder directly. > > No extra parameters. The command to load scsi_debug module is: > > modprobe scsi_debug max_luns=2 num_tgts=2 add_host=2 dev_size_mb=20 > > > > ... > > > > I can not excute sysrq command now. But I can get memory allocation > > information from xmon, > > which indicates your guess may be right. > > > > 1:mon> mi > > Mem-info: > > DMA per-cpu: > > cpu 0 hot: high 6, batch 1 used:5 > > cpu 0 cold: high 2, batch 1 used:1 > > cpu 1 hot: high 6, batch 1 used:5 > > cpu 1 cold: high 2, batch 1 used:1 > > cpu 2 hot: high 6, batch 1 used:5 > > cpu 2 cold: high 2, batch 1 used:1 > > cpu 3 hot: high 6, batch 1 used:3 > > cpu 3 cold: high 2, batch 1 used:1 > > cpu 4 hot: high 6, batch 1 used:5 > > cpu 4 cold: high 2, batch 1 used:1 > > cpu 5 hot: high 6, batch 1 used:4 > > cpu 5 cold: high 2, batch 1 used:0 > > DMA32 per-cpu: empty > > Normal per-cpu: empty > > HighMem per-cpu: empty > > Free pages: 6976kB (0kB HighMem) > > Active:6141 inactive:11012 dirty:4742 writeback:0 unstable:0 free:109 > > slab:11925 mapped:7 pagetables:7061 > > DMA free:6976kB min:5760kB low:7168kB high:8640kB active:393024kB > > inactive:704768kB present:2097152kB pages_scanned:5172 all_unreclaimable? > > no > > lowmem_reserve[]: 0 0 0 0 > > DMA32 free:0kB min:0kB low:0kB high:0kB active:0kB inactive:0kB > > present:0kB pages_scanned:0 all_unreclaimable? no > > lowmem_reserve[]: 0 0 0 0 > > Normal free:0kB min:0kB low:0kB high:0kB active:0kB inactive:0kB > > present:0kB pages_scanned:0 all_unreclaimable? no > > lowmem_reserve[]: 0 0 0 0 > > HighMem free:0kB min:2048kB low:2048kB high:2048kB active:0kB inactive:0kB > > present:0kB pages_scanned:0 all_unreclaimable? no > > lowmem_reserve[]: 0 0 0 0 > > DMA: 19*64kB 1*128kB 2*256kB 0*512kB 1*1024kB 0*2048kB 1*4096kB 0*8192kB > > 0*16384kB = 6976kB > > DMA32: empty > > Normal: empty > > HighMem: empty > > Swap cache: add 439156, delete 439156, find 50391/101032, race 26+79 > > Free swap = 0kB > > Total swap = 855552kB > > Free swap: 0kB > > 32768 pages of RAM > > 408 reserved pages > > 6834 pages shared > > 0 pages swap cached > > So we ran out of memory and we ran out of swap. > > Possibly what has happened here is that the machine is doing a huge amount > of work scanning pages and pretty soon it will enter the oom-killer to kill > some userspace process. But before that happened, the softlockup detector > triggered. > > But the machine _should_ have recovered. If it hung for more than a few > seconds then that's bad behaviour. If it hung for more than a few minutes > then that should be considered a bug. If it hung for ever then that's > definitely a bug. > > Do you recall approximately how long the machine spent in this state? System stayed in hang ten minutes at least. Then I forced it into xmon via sysrq. ^ permalink raw reply [flat|nested] 5+ messages in thread
end of thread, other threads:[~2006-09-11 2:07 UTC | newest] Thread overview: 5+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2006-09-07 4:35 [Problem] System hang when I run pounder and syscall test on kernel 2.6.18-rc5 Shu Qing Yang 2006-09-08 2:14 ` Andrew Morton 2006-09-08 11:36 ` Shu Qing Yang 2006-09-08 16:52 ` Andrew Morton 2006-09-11 2:09 ` Shu Qing Yang
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox