From mboxrd@z Thu Jan 1 00:00:00 1970 From: Hans Kraus Subject: Re: Weird problem: mdadm blocks Date: Sun, 22 Dec 2013 17:11:36 +0100 Message-ID: <52B70F38.5040807@hanswkraus.com> References: <52B6AA66.5050502@hanswkraus.com> <20131222221914.78630829@notabene.brown> Mime-Version: 1.0 Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit Return-path: In-Reply-To: <20131222221914.78630829@notabene.brown> Sender: linux-raid-owner@vger.kernel.org To: NeilBrown Cc: linux-raid@vger.kernel.org List-Id: linux-raid.ids Hi Neil, thanks for your answer. The only process in 'D' state is the "ls -al /stor1" I wrote about. Dmesg has the following complaints: [40800.776543] INFO: task md127_raid5:18837 blocked for more than 120 seconds. [40800.776546] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [40800.776549] md127_raid5 D ffff880077c13780 0 18837 2 0x00000000 [40800.776554] ffff880054d1f1a0 0000000000000046 ffff88003efa3488 ffff88006b6717d0 [40800.776559] 0000000000013780 ffff880043f47fd8 ffff880043f47fd8 ffff880054d1f1a0 [40800.776564] 0000000000000246 ffffffff8134f209 ffff88003734a680 ffff88003734a400 [40800.776569] Call Trace: [40800.776574] [] ? _raw_spin_lock_irqsave+0x9/0x25 [40800.776585] [] ? md_super_wait+0x6a/0x80 [md_mod] [40800.776590] [] ? add_wait_queue+0x3c/0x3c [40800.776600] [] ? md_update_sb+0x382/0x474 [md_mod] [40800.776606] [] ? load_TLS+0x7/0xa [40800.776611] [] ? __switch_to+0x133/0x258 [40800.776621] [] ? md_check_recovery+0x218/0x514 [md_mod] [40800.776629] [] ? raid5d+0x1c/0x483 [raid456] [40800.776634] [] ? schedule_timeout+0x2c/0xdb [40800.776638] [] ? arch_local_irq_save+0x11/0x17 [40800.776642] [] ? arch_local_irq_save+0x11/0x17 [40800.776652] [] ? md_thread+0x114/0x132 [md_mod] [40800.776657] [] ? add_wait_queue+0x3c/0x3c [40800.776666] [] ? md_rdev_init+0xea/0xea [md_mod] [40800.776671] [] ? kthread+0x76/0x7e [40800.776676] [] ? kernel_thread_helper+0x4/0x10 [40800.776681] [] ? kthread_worker_fn+0x139/0x139 [40800.776686] [] ? gs_change+0x13/0x13 [40800.776689] INFO: task md127_resync:18866 blocked for more than 120 seconds. [40800.776692] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [40800.776695] md127_resync D ffff8800727441c0 0 18866 2 0x00000000 [40800.776701] ffff8800727441c0 0000000000000046 0000000000000000 ffff88001b3848b0 [40800.776705] 0000000000013780 ffff8800716d5fd8 ffff8800716d5fd8 ffff8800727441c0 [40800.776710] 0000000000000000 ffffffff81070fc1 0000000000000046 ffff880054ed9570 [40800.776715] Call Trace: [40800.776719] [] ? arch_local_irq_save+0x11/0x17 [40800.776726] [] ? get_active_stripe+0x24c/0x505 [raid456] [40800.776730] [] ? try_to_wake_up+0x197/0x197 [40800.776737] [] ? sync_request+0x26a/0x2de [raid456] [40800.776748] [] ? md_do_sync+0x76b/0xb6f [md_mod] [40800.776754] [] ? add_wait_queue+0x3c/0x3c [40800.776763] [] ? md_thread+0x114/0x132 [md_mod] [40800.776773] [] ? md_rdev_init+0xea/0xea [md_mod] [40800.776778] [] ? kthread+0x76/0x7e [40800.776782] [] ? kernel_thread_helper+0x4/0x10 [40800.776788] [] ? kthread_worker_fn+0x139/0x139 [40800.776792] [] ? gs_change+0x13/0x13 [40800.776797] INFO: task xfsbufd/dm-0:20797 blocked for more than 120 seconds. [40800.776801] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [40800.776804] xfsbufd/dm-0 D ffff880077c13780 0 20797 2 0x00000000 [40800.776809] ffff8800757b3550 0000000000000046 ffff880071fedd40 ffff88006b6717d0 [40800.776814] 0000000000013780 ffff88001b12ffd8 ffff88001b12ffd8 ffff8800757b3550 [40800.776819] 0000000000000246 ffffffff8134f209 ffff88003734a680 ffff88003734a400 [40800.776824] Call Trace: [40800.776828] [] ? _raw_spin_lock_irqsave+0x9/0x25 [40800.776838] [] ? md_write_start+0x133/0x149 [md_mod] [40800.776844] [] ? add_wait_queue+0x3c/0x3c [40800.776850] [] ? make_request+0x36/0x37a [raid456] [40800.776860] [] ? __split_and_process_bio+0x4f4/0x506 [dm_mod] [40800.776866] [] ? add_wait_queue+0x3c/0x3c [40800.776875] [] ? md_make_request+0xee/0x1db [md_mod] [40800.776881] [] ? generic_make_request+0x90/0xcf [40800.776885] [] ? submit_bio+0xd3/0xf1 [40800.776890] [] ? bio_alloc_bioset+0x43/0xb6 [40800.776910] [] ? _xfs_buf_ioapply+0x17a/0x1bb [xfs] [40800.776915] [] ? try_to_wake_up+0x197/0x197 [40800.776932] [] ? xfs_bdstrat_cb+0x4d/0x51 [xfs] [40800.776950] [] ? xfs_buf_iorequest+0x62/0x7b [xfs] [40800.776967] [] ? xfs_bdstrat_cb+0x4d/0x51 [xfs] [40800.776985] [] ? xfsbufd+0xe2/0x114 [xfs] [40800.776989] [] ? __schedule+0x5f9/0x610 [40800.777007] [] ? xfs_bdstrat_cb+0x51/0x51 [xfs] [40800.777012] [] ? kthread+0x76/0x7e [40800.777017] [] ? kernel_thread_helper+0x4/0x10 [40800.777023] [] ? kthread_worker_fn+0x139/0x139 [40800.777027] [] ? gs_change+0x13/0x13 [40800.777030] INFO: task xfsaild/dm-0:20798 blocked for more than 120 seconds. [40800.777034] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [40800.777037] xfsaild/dm-0 D ffff88003754c300 0 20798 2 0x00000000 [40800.777042] ffff88003754c300 0000000000000046 0000000000000000 ffff88005d7c51a0 [40800.777047] 0000000000013780 ffff88001b01ffd8 ffff88001b01ffd8 ffff88003754c300 [40800.777052] ffff880071fede00 ffffffff81070fc1 0000000000000046 ffff88003734a400 [40800.777057] Call Trace: [40800.777061] [] ? arch_local_irq_save+0x11/0x17 [40800.777071] [] ? md_flush_request+0x96/0x111 [md_mod] [40800.777076] [] ? try_to_wake_up+0x197/0x197 [40800.777082] [] ? make_request+0x25/0x37a [raid456] [40800.777091] [] ? __split_and_process_bio+0x4f4/0x506 [dm_mod] [40800.777096] [] ? test_tsk_need_resched+0xa/0x13 [40800.777101] [] ? check_preempt_curr+0x52/0x5f [40800.777106] [] ? ttwu_do_wakeup+0x50/0xc4 [40800.777116] [] ? md_make_request+0xee/0x1db [md_mod] [40800.777121] [] ? generic_make_request+0x90/0xcf [40800.777126] [] ? submit_bio+0xd3/0xf1 [40800.777131] [] ? bio_alloc_bioset+0x69/0xb6 [40800.777149] [] ? _xfs_buf_ioapply+0x17a/0x1bb [xfs] [40800.777154] [] ? try_to_wake_up+0x197/0x197 [40800.777179] [] ? xlog_bdstrat+0x34/0x38 [xfs] [40800.777196] [] ? xfs_buf_iorequest+0x62/0x7b [xfs] [40800.777221] [] ? xlog_bdstrat+0x34/0x38 [xfs] [40800.777245] [] ? xlog_sync+0x1dd/0x2d4 [xfs] [40800.777269] [] ? xfs_ail_min_lsn+0xd/0x2b [xfs] [40800.777294] [] ? xlog_write+0x348/0x545 [xfs] [40800.777316] [] ? kmem_zone_zalloc+0x1b/0x2d [xfs] [40800.777341] [] ? xlog_cil_push+0x1e5/0x2fb [xfs] [40800.777366] [] ? xlog_cil_force_lsn+0x1d/0x86 [xfs] [40800.777391] [] ? _xfs_log_force+0x4e/0x1ae [xfs] [40800.777416] [] ? xfs_log_force+0xb/0x2c [xfs] [40800.777440] [] ? xfsaild+0xf4/0x46b [xfs] [40800.777465] [] ? xfs_trans_ail_cursor_first+0x79/0x79 [xfs] [40800.777470] [] ? kthread+0x76/0x7e [40800.777475] [] ? kernel_thread_helper+0x4/0x10 [40800.777481] [] ? kthread_worker_fn+0x139/0x139 [40800.777485] [] ? gs_change+0x13/0x13 [40800.777489] INFO: task flush-253:0:20958 blocked for more than 120 seconds. [40800.777492] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [40800.777495] flush-253:0 D ffff880077c13780 0 20958 2 0x00000000 [40800.777500] ffff88001b7ecf20 0000000000000046 0000000000011200 ffff8800727441c0 [40800.777505] 0000000000013780 ffff88001b0b3fd8 ffff88001b0b3fd8 ffff88001b7ecf20 [40800.777510] 0000000000000246 ffffffff8134f209 ffff88003734a680 ffff88003734a400 [40800.777515] Call Trace: [40800.777519] [] ? _raw_spin_lock_irqsave+0x9/0x25 [40800.777531] [] ? md_write_start+0x133/0x149 [md_mod] [40800.777536] [] ? add_wait_queue+0x3c/0x3c [40800.777542] [] ? make_request+0x36/0x37a [raid456] [40800.777552] [] ? __split_and_process_bio+0x4f4/0x506 [dm_mod] [40800.777562] [] ? md_make_request+0xee/0x1db [md_mod] [40800.777567] [] ? generic_make_request+0x90/0xcf [40800.777572] [] ? submit_bio+0xd3/0xf1 [40800.777577] [] ? __mark_inode_dirty+0x58/0x17a [40800.777595] [] ? xfs_submit_ioend+0x99/0xd9 [xfs] [40800.777612] [] ? xfs_vm_writepage+0x368/0x3e1 [xfs] [40800.777618] [] ? __writepage+0xa/0x21 [40800.777622] [] ? write_cache_pages+0x1f8/0x2e9 [40800.777628] [] ? set_page_dirty_lock+0x2b/0x2b [40800.777633] [] ? generic_writepages+0x3a/0x52 [40800.777639] [] ? writeback_single_inode+0x11d/0x2cc [40800.777644] [] ? writeback_sb_inodes+0x16b/0x204 [40800.777650] [] ? __writeback_inodes_wb+0x6d/0xab [40800.777655] [] ? wb_writeback+0x128/0x21f [40800.777660] [] ? determine_dirtyable_memory+0x10/0x17 [40800.777665] [] ? wb_do_writeback+0x189/0x1a8 [40800.777671] [] ? bdi_writeback_thread+0x85/0x1e6 [40800.777676] [] ? wb_do_writeback+0x1a8/0x1a8 [40800.777681] [] ? kthread+0x76/0x7e [40800.777686] [] ? kernel_thread_helper+0x4/0x10 [40800.777691] [] ? kthread_worker_fn+0x139/0x139 [40800.777695] [] ? gs_change+0x13/0x13 [40800.777703] INFO: task BackupPC_dump:8965 blocked for more than 120 seconds. [40800.777706] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [40800.777710] BackupPC_dump D ffff88005d7c51a0 0 8965 8846 0x00000000 [40800.777715] ffff88005d7c51a0 0000000000000086 ffff88005e635b40 ffff88006b2a82c0 [40800.777720] 0000000000013780 ffff8800596c9fd8 ffff8800596c9fd8 ffff88005d7c51a0 [40800.777725] 0000000000000000 0000000000000000 ffff880050f9bd40 7fffffffffffffff [40800.777730] Call Trace: [40800.777734] [] ? schedule_timeout+0x2c/0xdb [40800.777739] [] ? _atomic_dec_and_lock+0x1/0x48 [40800.777744] [] ? wait_for_common+0xa0/0x119 [40800.777748] [] ? try_to_wake_up+0x197/0x197 [40800.777766] [] ? xfs_buf_read+0x88/0xbe [xfs] [40800.777791] [] ? xfs_trans_read_buf+0x4a/0x310 [xfs] [40800.777808] [] ? xfs_buf_iowait+0x44/0x81 [xfs] [40800.777826] [] ? xfs_buf_read+0x88/0xbe [xfs] [40800.777850] [] ? xfs_trans_read_buf+0x4a/0x310 [xfs] [40800.777875] [] ? xfs_imap_to_bp+0x40/0x100 [xfs] [40800.777899] [] ? xfs_iread+0x54/0x177 [xfs] [40800.777917] [] ? xfs_inode_alloc+0x73/0xe9 [xfs] [40800.777936] [] ? xfs_iget+0x37c/0x56c [xfs] [40800.777958] [] ? xfs_lookup+0xa4/0xd3 [xfs] [40800.777977] [] ? xfs_vn_lookup+0x3f/0x7e [xfs] [40800.777983] [] ? d_alloc_and_lookup+0x3a/0x60 [40800.777988] [] ? walk_component+0x219/0x406 [40800.777993] [] ? link_path_walk+0x174/0x421 [40800.777998] [] ? path_lookupat+0x53/0x2bd [40800.778002] [] ? should_resched+0x5/0x23 [40800.778006] [] ? should_resched+0x5/0x23 [40800.778010] [] ? _cond_resched+0x7/0x1c [40800.778014] [] ? do_path_lookup+0x1c/0x87 [40800.778019] [] ? user_path_at_empty+0x47/0x7b [40800.778024] [] ? timerqueue_add+0x80/0xa0 [40800.778029] [] ? set_next_entity+0x32/0x55 [40800.778034] [] ? __switch_to+0x1e5/0x258 [40800.778039] [] ? vfs_fstatat+0x32/0x60 [40800.778043] [] ? sys_newstat+0x12/0x2b [40800.778048] [] ? system_call_fastpath+0x16/0x1b [40800.778053] INFO: task kworker/0:0:12935 blocked for more than 120 seconds. [40800.778056] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [40800.778059] kworker/0:0 D ffff880077c13780 0 12935 2 0x00000000 [40800.778064] ffff88001b0c4f60 0000000000000046 ffffffff81788740 ffff88006b7e6340 [40800.778069] 0000000000013780 ffff880000cabfd8 ffff880000cabfd8 ffff88001b0c4f60 [40800.778074] 0000000000000246 ffffffff8134f209 ffff88003734a680 ffff88003734a400 [40800.778079] Call Trace: [40800.778083] [] ? _raw_spin_lock_irqsave+0x9/0x25 [40800.778095] [] ? md_write_start+0x133/0x149 [md_mod] [40800.778100] [] ? add_wait_queue+0x3c/0x3c [40800.778106] [] ? make_request+0x36/0x37a [raid456] [40800.778111] [] ? kmem_cache_alloc+0x86/0xea [40800.778121] [] ? md_make_request+0xee/0x1db [md_mod] [40800.778126] [] ? generic_make_request+0x90/0xcf [40800.778135] [] ? __split_and_process_bio+0x265/0x506 [dm_mod] [40800.778140] [] ? _raw_spin_unlock_irqrestore+0xe/0xf [40800.778144] [] ? try_to_wake_up+0x187/0x197 [40800.778154] [] ? dm_wq_work+0x8c/0xab [dm_mod] [40800.778158] [] ? process_one_work+0x161/0x269 [40800.778163] [] ? worker_thread+0xc2/0x145 [40800.778167] [] ? manage_workers.isra.25+0x15b/0x15b [40800.778172] [] ? kthread+0x76/0x7e [40800.778177] [] ? kernel_thread_helper+0x4/0x10 [40800.778182] [] ? kthread_worker_fn+0x139/0x139 [40800.778186] [] ? gs_change+0x13/0x13 [40800.778190] INFO: task kworker/0:2:19557 blocked for more than 120 seconds. [40800.778193] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [40800.778196] kworker/0:2 D ffff880077c13780 0 19557 2 0x00000000 [40800.778202] ffff88006b7e6340 0000000000000046 0000000000000000 ffff880054d1f1a0 [40800.778206] 0000000000013780 ffff88000005ffd8 ffff88000005ffd8 ffff88006b7e6340 [40800.778211] ffff880071fede00 ffffffff81070fc1 0000000000000046 ffff88003734a400 [40800.778216] Call Trace: [40800.778221] [] ? arch_local_irq_save+0x11/0x17 [40800.778230] [] ? md_flush_request+0x96/0x111 [md_mod] [40800.778235] [] ? try_to_wake_up+0x197/0x197 [40800.778241] [] ? make_request+0x25/0x37a [raid456] [40800.778250] [] ? __split_and_process_bio+0x4f4/0x506 [dm_mod] [40800.778255] [] ? test_tsk_need_resched+0xa/0x13 [40800.778260] [] ? check_preempt_curr+0x52/0x5f [40800.778264] [] ? ttwu_do_wakeup+0x50/0xc4 [40800.778274] [] ? md_make_request+0xee/0x1db [md_mod] [40800.778279] [] ? generic_make_request+0x90/0xcf [40800.778284] [] ? submit_bio+0xd3/0xf1 [40800.778289] [] ? bio_alloc_bioset+0x69/0xb6 [40800.778308] [] ? _xfs_buf_ioapply+0x17a/0x1bb [xfs] [40800.778312] [] ? try_to_wake_up+0x197/0x197 [40800.778337] [] ? xlog_bdstrat+0x34/0x38 [xfs] [40800.778354] [] ? xfs_buf_iorequest+0x62/0x7b [xfs] [40800.778379] [] ? xlog_bdstrat+0x34/0x38 [xfs] [40800.778403] [] ? xlog_sync+0x1dd/0x2d4 [xfs] [40800.778428] [] ? xfs_ail_min_lsn+0xd/0x2b [xfs] [40800.778452] [] ? xlog_write+0x348/0x545 [xfs] [40800.778477] [] ? xlog_cil_push+0x1e5/0x2fb [xfs] [40800.778482] [] ? arch_local_irq_save+0x11/0x17 [40800.778507] [] ? xlog_cil_force_lsn+0x1d/0x86 [xfs] [40800.778531] [] ? _xfs_log_force_lsn+0x63/0x205 [xfs] [40800.778556] [] ? xfs_trans_commit+0x10a/0x205 [xfs] [40800.778577] [] ? xfs_sync_worker+0x3a/0x6a [xfs] [40800.778581] [] ? process_one_work+0x161/0x269 [40800.778586] [] ? worker_thread+0xc2/0x145 [40800.778590] [] ? manage_workers.isra.25+0x15b/0x15b [40800.778595] [] ? kthread+0x76/0x7e [40800.778600] [] ? kernel_thread_helper+0x4/0x10 [40800.778605] [] ? kthread_worker_fn+0x139/0x139 [40800.778609] [] ? gs_change+0x13/0x13 Regards, Hans Am 22.12.2013 12:19, schrieb NeilBrown: > On Sun, 22 Dec 2013 10:01:26 +0100 Hans Kraus wrote: > >> Hi, >> >> my backup system (running backuppc) has developed a weird problem: >> calls from command line "mdadm --detail /dev/mdX" block, for every >> existing raid on the system, and can be only terminated with ^C. This >> is true even for the newest mdadm built from git. >> >> "cat /proc/mdstat" blocks too. All mounted raids are working (at least >> ls is), exept for one, md127 (the storage of backuppc). >> There ls is blocking and is not terminable by ^C. >> >> The raid structure is the following: >> md2, md3, m4 raid1 for swap, /boot, / >> md30 raid0 for short term storage >> md10, md11, md12, md13 raid0, built from 2x 2TB or 1TB + 3TB drives >> md127 raid5 built from md10, md11, md12, md13 >> >> I recently (some 12 hours ago) added md13 again and the system was >> rebuilding from a degraded state. The file system on md127 is xfs. All >> the physical rives are OK, at least according to smartmontools. >> Webmin 1.660 reports: >> CPU load averages 16.96 (1 min) 15.04 (5 mins) 12.67 (15 mins) >> CPU usage 0% user, 1% kernel, 99% IO, 0% idle >> >> Is there any way to diagnose the problem further? I'm reluctant to >> do a reboot. > > Either some process has crashed leaving an 'oops' or 'bug' message in the > kernel logs, or some process is stuck in 'D' state in 'ps'. > > So: > 1/ look through kernel logs since boot (e.g. output of 'dmesg', though that > might not be complete) for anything unusual - there should be a stack > trace. > 2/ if there is a process in 'D' state, find how which and get a stack trace > of it. Possibly by > echo w > /proc/sysrq-trigger > or > cat /proc/$PID/stack > or event > echo t > /proc/sysrq-trigger > (though that might create lots of output that might be hard to capture). > > NeilBrown >