From mboxrd@z Thu Jan 1 00:00:00 1970 From: NeilBrown Subject: Re: Raid5 hang in 3.14.19 Date: Wed, 1 Oct 2014 08:54:09 +1000 Message-ID: <20141001085409.41399b13@notabene.brown> References: <5425E9D6.1050102@sbcglobal.net> <20140929122533.3b91a543@notabene.brown> <5428D863.7090409@sbcglobal.net> <20140929140818.1086972e@notabene.brown> <5428DFE1.9080600@sbcglobal.net> <20140930075950.1d1e3865@notabene.brown> <542B1EDC.7060803@sbcglobal.net> Mime-Version: 1.0 Content-Type: multipart/signed; micalg=pgp-sha1; boundary="Sig_/1Uxc=Wst.=.URlXe4J7pIUP"; protocol="application/pgp-signature" Return-path: In-Reply-To: <542B1EDC.7060803@sbcglobal.net> Sender: linux-raid-owner@vger.kernel.org To: BillStuff Cc: linux-raid List-Id: linux-raid.ids --Sig_/1Uxc=Wst.=.URlXe4J7pIUP Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: quoted-printable On Tue, 30 Sep 2014 16:21:32 -0500 BillStuff wrote: > On 09/29/2014 04:59 PM, NeilBrown wrote: > > On Sun, 28 Sep 2014 23:28:17 -0500 BillStuff > > wrote: > > > >> On 09/28/2014 11:08 PM, NeilBrown wrote: > >>> On Sun, 28 Sep 2014 22:56:19 -0500 BillStuff > >>> wrote: > >>> > >>>> On 09/28/2014 09:25 PM, NeilBrown wrote: > >>>>> On Fri, 26 Sep 2014 17:33:58 -0500 BillStuff > >>>>> wrote: > >>>>> > >>>>>> Hi Neil, > >>>>>> > >>>>>> I found something that looks similar to the problem described in > >>>>>> "Re: seems like a deadlock in workqueue when md do a flush" from S= ept 14th. > >>>>>> > >>>>>> It's on 3.14.19 with 7 recent patches for fixing raid1 recovery ha= ngs. > >>>>>> > >>>>>> on this array: > >>>>>> md3 : active raid5 sdf1[5] sde1[4] sdd1[3] sdc1[2] sdb1[1] sda1[0] > >>>>>> 104171200 blocks level 5, 64k chunk, algorithm 2 [6/6] [= UUUUUU] > >>>>>> bitmap: 1/5 pages [4KB], 2048KB chunk > >>>>>> > >>>>>> I was running a test doing parallel kernel builds, read/write loop= s, and > >>>>>> disk add / remove / check loops, > >>>>>> on both this array and a raid1 array. > >>>>>> > >>>>>> I was trying to stress test your recent raid1 fixes, which went we= ll, > >>>>>> but then after 5 days, > >>>>>> the raid5 array hung up with this in dmesg: > >>>>> I think this is different to the workqueue problem you mentioned, t= hough as I > >>>>> don't know exactly what caused either I cannot be certain. > >>>>> > >>>>> From the data you provided it looks like everything is waiting = on > >>>>> get_active_stripe(), or on a process that is waiting on that. > >>>>> That seems pretty common whenever anything goes wrong in raid5 :-( > >>>>> > >>>>> The md3_raid5 task is listed as blocked, but not stack trace is giv= en. > >>>>> If the machine is still in the state, then > >>>>> > >>>>> cat /proc/1698/stack > >>>>> > >>>>> might be useful. > >>>>> (echo t > /proc/sysrq-trigger is always a good idea) > >>>> Might this help? I believe the array was doing a "check" when things > >>>> hung up. > >>> It looks like it was trying to start doing a 'check'. > >>> The 'resync' thread hadn't been started yet. > >>> What is 'kthreadd' doing? > >>> My guess is that it is in try_to_free_pages() waiting for writeout > >>> for some xfs file page onto the md array ... which won't progress unt= il > >>> the thread gets started. > >>> > >>> That would suggest that we need an async way to start threads... > >>> > >>> Thanks, > >>> NeilBrown > >>> > >> I suspect your guess is correct: > > Thanks for the confirmation. > > > > I'm thinking of something like that. Very basic suggestion suggests it > > instantly crash. > > > > If you were to apply this patch and run your test for a week or two, t= hat > > would increase my confidence (though of course testing doesn't prove the > > absence of bugs....) > > > > Thanks, > > NeilBrown > > >=20 > Neil, >=20 > It locked up already, but in a different way. > In this case it was triggered by a re-add in a fail / remove / add loop. >=20 > It kinda seems like md3_raid5 tried to "collect" (reap) md3_resync=20 > before it got fully started. >=20 >=20 > INFO: task md3_raid5:1698 blocked for more than 120 seconds. > Tainted: P O 3.14.19fe-dirty #3 > "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. > e8371dc4 00000046 c106d92d ea483770 d18a9994 ffd2177f ffffffff e8371dac > c17d6700 c17d6700 e9d3bcc0 c231bcc0 00000420 e8371d98 00007400 00000088 > 00000000 00000000 c33ead90 000003be 00000000 00000005 00000000 0000528b > Call Trace: > [] ? __enqueue_entity+0x6d/0x80 > [] ? enqueue_task_fair+0x2d3/0x660 > [] schedule+0x23/0x60 > [] schedule_timeout+0x145/0x1c0 > [] ? default_wake_function+0x10/0x20 > [] ? update_rq_clock.part.92+0x18/0x50 > [] ? check_preempt_curr+0x65/0x90 > [] ? ttwu_do_wakeup+0x18/0x120 > [] wait_for_common+0x9b/0x110 > [] ? wake_up_process+0x40/0x40 > [] wait_for_completion+0x17/0x20 > [] kthread_stop+0x41/0xb0 > [] md_unregister_thread+0x31/0x40 > [] md_reap_sync_thread+0x19/0x140 > [] md_check_recovery+0xe8/0x480 > [] raid5d+0x20/0x4c0 [raid456] > [] ? try_to_del_timer_sync+0x42/0x60 > [] ? del_timer_sync+0x41/0x50 > [] ? schedule_timeout+0xfd/0x1c0 > [] ? detach_if_pending+0xa0/0xa0 > [] ? finish_wait+0x44/0x60 > [] md_thread+0xe8/0x100 > [] ? __wake_up_sync+0x20/0x20 > [] ? md_start_sync+0xb0/0xb0 > [] kthread+0xa1/0xc0 > [] ret_from_kernel_thread+0x1b/0x28 > [] ? kthread_create_on_node+0x110/0x110 > INFO: task kworker/0:0:12491 blocked for more than 120 seconds. > Tainted: P O 3.14.19fe-dirty #3 > "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. > e6f2ba58 00000046 00000086 00000086 e6f2ba08 c10795ff f395c035 000030ea > c17d6700 c17d6700 d18a9440 e8973cc0 e6f2ba14 c144fa3d 00000000 e6f2ba2c > c2054600 f3dbc81b ea5f4c80 c9f64060 c2054600 00000001 c9f64060 e6f2ba5c > Call Trace: > [] ? __wake_up+0x3f/0x50 > [] ? md_wakeup_thread+0x2d/0x30 > [] ? raid5_unplug+0xbb/0x120 [raid456] > [] ? raid5_unplug+0xbb/0x120 [raid456] > [] schedule+0x23/0x60 > [] schedule_timeout+0x145/0x1c0 > [] ? blk_finish_plug+0x12/0x40 > [] ? _xfs_buf_ioapply+0x287/0x300 [xfs] > [] wait_for_common+0x9b/0x110 > [] ? wake_up_process+0x40/0x40 > [] wait_for_completion+0x17/0x20 > [] xfs_buf_iowait+0x50/0xb0 [xfs] > [] ? _xfs_buf_read+0x37/0x40 [xfs] > [] _xfs_buf_read+0x37/0x40 [xfs] > [] xfs_buf_read_map+0x85/0xe0 [xfs] > [] xfs_trans_read_buf_map+0x1e9/0x3e0 [xfs] > [] xfs_alloc_read_agfl+0x97/0xc0 [xfs] > [] xfs_alloc_fix_freelist+0x193/0x430 [xfs] > [] ? xlog_grant_sub_space.isra.4+0x1e/0x70 [xfs] > [] xfs_free_extent+0x8e/0x100 [xfs] > [] ? kmem_cache_alloc+0xae/0xf0 > [] xfs_bmap_finish+0x13c/0x190 [xfs] > [] xfs_itruncate_extents+0x1b2/0x2b0 [xfs] > [] xfs_free_eofblocks+0x243/0x2e0 [xfs] > [] xfs_inode_free_eofblocks+0x96/0x140 [xfs] > [] ? xfs_inode_clear_eofblocks_tag+0x160/0x160 [xfs] > [] xfs_inode_ag_walk.isra.7+0x1cd/0x2f0 [xfs] > [] ? xfs_inode_clear_eofblocks_tag+0x160/0x160 [xfs] > [] ? radix_tree_gang_lookup_tag+0x71/0xb0 > [] ? xfs_perag_get_tag+0x2b/0xb0 [xfs] > [] ? xfs_inode_ag_iterator_tag+0x3d/0xa0 [xfs] > [] xfs_inode_ag_iterator_tag+0x6a/0xa0 [xfs] > [] ? xfs_inode_clear_eofblocks_tag+0x160/0x160 [xfs] > [] xfs_icache_free_eofblocks+0x2e/0x40 [xfs] > [] xfs_eofblocks_worker+0x18/0x30 [xfs] > [] process_one_work+0x10c/0x340 > [] worker_thread+0x101/0x330 > [] ? manage_workers.isra.27+0x250/0x250 > [] kthread+0xa1/0xc0 > [] ret_from_kernel_thread+0x1b/0x28 > [] ? kthread_create_on_node+0x110/0x110 > INFO: task xfs_fsr:24262 blocked for more than 120 seconds. > Tainted: P O 3.14.19fe-dirty #3 > "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. > e4d17aec 00000082 00000096 00000096 e4d17a9c c10795ff 00000000 00000000 > c17d6700 c17d6700 e9d3b7b0 c70a8000 e4d17aa8 c144fa3d 00000000 e4d17ac0 > c2054600 f3dbc81b ea7ace80 d7b74cc0 c2054600 00000002 d7b74cc0 e4d17af0 > Call Trace: > [] ? __wake_up+0x3f/0x50 > [] ? md_wakeup_thread+0x2d/0x30 > [] ? raid5_unplug+0xbb/0x120 [raid456] > [] ? raid5_unplug+0xbb/0x120 [raid456] > [] schedule+0x23/0x60 > [] schedule_timeout+0x145/0x1c0 > [] ? blk_finish_plug+0x12/0x40 > [] ? _xfs_buf_ioapply+0x287/0x300 [xfs] > [] wait_for_common+0x9b/0x110 > [] ? wake_up_process+0x40/0x40 > [] wait_for_completion+0x17/0x20 > [] xfs_buf_iowait+0x50/0xb0 [xfs] > [] ? _xfs_buf_read+0x37/0x40 [xfs] > [] _xfs_buf_read+0x37/0x40 [xfs] > [] xfs_buf_read_map+0x85/0xe0 [xfs] > [] xfs_trans_read_buf_map+0x179/0x3e0 [xfs] > [] xfs_imap_to_bp+0x67/0xe0 [xfs] > [] xfs_iread+0x7d/0x3e0 [xfs] > [] ? xfs_inode_alloc+0x58/0x1b0 [xfs] > [] xfs_iget+0x192/0x580 [xfs] > [] ? kmem_free+0x19/0x50 [xfs] > [] xfs_bulkstat_one_int+0x86/0x2c0 [xfs] > [] xfs_bulkstat_one+0x34/0x40 [xfs] > [] ? xfs_internal_inum+0xa0/0xa0 [xfs] > [] xfs_bulkstat+0x400/0x870 [xfs] > [] xfs_ioc_bulkstat+0xb6/0x160 [xfs] > [] ? xfs_bulkstat_one_int+0x2c0/0x2c0 [xfs] > [] ? xfs_ioc_swapext+0x160/0x160 [xfs] > [] xfs_file_ioctl+0x783/0xa60 [xfs] > [] ? __update_cpu_load+0xab/0xd0 > [] ? hrtimer_forward+0xa8/0x1b0 > [] ? timerqueue_add+0x50/0xb0 > [] ? ktime_get+0x53/0xe0 > [] ? clockevents_program_event+0x95/0x130 > [] ? xfs_ioc_swapext+0x160/0x160 [xfs] > [] do_vfs_ioctl+0x2e2/0x4c0 > [] ? tick_program_event+0x29/0x30 > [] ? hrtimer_interrupt+0x13c/0x2a0 > [] ? lockref_put_or_lock+0xb/0x30 > [] SyS_ioctl+0x78/0x80 > [] syscall_call+0x7/0x7 > INFO: task md3_resync:24817 blocked for more than 120 seconds. > Tainted: P O 3.14.19fe-dirty #3 > "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. > c2d95e08 00000046 00000092 00000000 00000020 00000092 1befb1af 000030e0 > c17d6700 c17d6700 e9d3c6e0 d18a9950 c2d95ddc c12adbc5 00000000 00000001 > 00000000 00000001 c2d95dec e95f0000 00000001 c2d95e08 00000292 00000292 > Call Trace: > [] ? queue_unplugged+0x45/0x90 > [] ? prepare_to_wait_event+0x6b/0xd0 > [] schedule+0x23/0x60 > [] md_do_sync+0x8dd/0x1010 > [] ? __wake_up_sync+0x20/0x20 > [] ? md_start_sync+0xb0/0xb0 > [] md_thread+0xe8/0x100 > [] ? __wake_up_locked+0x1f/0x30 > [] ? md_start_sync+0xb0/0xb0 > [] kthread+0xa1/0xc0 > [] ret_from_kernel_thread+0x1b/0x28 > [] ? kthread_create_on_node+0x110/0x110 >=20 > Again, the blocked task info is edited for space; hopefully these are=20 > the important ones, > but I've got more if it helps. Thanks for the testing! You have included enough information. I didn't really like that 'sync_starting' variable when I wrote the patch, but it seemed do the right thing. It doesn't. If md_check_recovery() runs again immediately after scheduling the sync thread to run, it will not have set sync_starting but will find ->sync_thre= ad is NULL and so will clear MD_RECOVERY_RUNNING. The next time it runs, that flag is still clear and ->sync_thread is not NULL so it will try to stop the thread, which deadlocks. This patch on top of what you have should fix it... but I might end up redoing the logic a bit. Thanks, NeilBrown diff --git a/drivers/md/md.c b/drivers/md/md.c index 211a615ae415..3df888b80e76 100644 --- a/drivers/md/md.c +++ b/drivers/md/md.c @@ -7850,7 +7850,6 @@ void md_check_recovery(struct mddev *mddev) =20 if (mddev_trylock(mddev)) { int spares =3D 0; - bool sync_starting =3D false; =20 if (mddev->ro) { /* On a read-only array we can: @@ -7914,7 +7913,7 @@ void md_check_recovery(struct mddev *mddev) =20 if (!test_and_clear_bit(MD_RECOVERY_NEEDED, &mddev->recovery) || test_bit(MD_RECOVERY_FROZEN, &mddev->recovery)) - goto unlock; + goto not_running; /* no recovery is running. * remove any failed drives, then * add spares if possible. @@ -7926,7 +7925,7 @@ void md_check_recovery(struct mddev *mddev) if (mddev->pers->check_reshape =3D=3D NULL || mddev->pers->check_reshape(mddev) !=3D 0) /* Cannot proceed */ - goto unlock; + goto not_running; set_bit(MD_RECOVERY_RESHAPE, &mddev->recovery); clear_bit(MD_RECOVERY_RECOVER, &mddev->recovery); } else if ((spares =3D remove_and_add_spares(mddev, NULL))) { @@ -7939,7 +7938,7 @@ void md_check_recovery(struct mddev *mddev) clear_bit(MD_RECOVERY_RECOVER, &mddev->recovery); } else if (!test_bit(MD_RECOVERY_SYNC, &mddev->recovery)) /* nothing to be done ... */ - goto unlock; + goto not_running; =20 if (mddev->pers->sync_request) { if (spares) { @@ -7951,18 +7950,18 @@ void md_check_recovery(struct mddev *mddev) } INIT_WORK(&mddev->del_work, md_start_sync); queue_work(md_misc_wq, &mddev->del_work); - sync_starting =3D true; + goto unlock; } - unlock: - wake_up(&mddev->sb_wait); - - if (!mddev->sync_thread && !sync_starting) { + not_running: + if (!mddev->sync_thread) { clear_bit(MD_RECOVERY_RUNNING, &mddev->recovery); if (test_and_clear_bit(MD_RECOVERY_RECOVER, &mddev->recovery)) if (mddev->sysfs_action) sysfs_notify_dirent_safe(mddev->sysfs_action); } + unlock: + wake_up(&mddev->sb_wait); mddev_unlock(mddev); } } --Sig_/1Uxc=Wst.=.URlXe4J7pIUP Content-Type: application/pgp-signature Content-Description: OpenPGP digital signature -----BEGIN PGP SIGNATURE----- Version: GnuPG v2.0.22 (GNU/Linux) iQIVAwUBVCs0kTnsnt1WYoG5AQIIJA//Rc5aP9IWuRPxXdsTMvgFgm5hSZRJO708 ++QEr6rwA9aSS5dWz3Uy0UhLIGK4ZpmTdOcl+PF8/fPqfMo07/pgrKnP7WlE0HRa lhz8Upb+oEUzG7iQPvj7vHuhosoAhVdrInX7gBwzVhXvNxACU/2KZYBqGxvfIW0o CAlGz4QE4WHwMeu4O6pTE4zKmpvYJq1eawBVWXEeoczkTc5MPmuRN5eGPcFufydz xhGpvZbYW0cMa/S3U03d7mOjPdXEafOF7OknO67wfoklb61nhdaLXPwbsE5Rzo5o rxmGDXX6AYTpg2WSHm7YWb7hcGPAsfaHphdxtlYHTidV8xSRHTMEJwdMbiEumLPk nlt5LfqvSezqH9YeTDPzxUczgEgNOxCho8W13iUHJpg8wA8+4LuaSCa7wX9JOMYL Ca3jEIkMZVNwGOHCA8ioXbC9A0NWDFJEl2+wXUAGxFICQr+QUaYESq7BWYzUUxJy WcnzJ6mZpPjeOFlJM743mR4QkFcafkFqM7/dZ/OWz1lUrLKjSYqrtTy+vPy4eyw4 3anAoIT//d3gcLR0VMpC8BrUg3zDkvyTJomPLM0lfuEEd2+reEGJOmYA8mMDoW7V B4xP6SQLy6BpF9cgohU3YutW/y/ZMqYQ/tlgpY6MCYvZ2GvxR/5MjpA98qE7XDdp I69w3DfP0qw= =035W -----END PGP SIGNATURE----- --Sig_/1Uxc=Wst.=.URlXe4J7pIUP--