From mboxrd@z Thu Jan 1 00:00:00 1970 From: Simon Guinot Subject: [PATCH 2/2] md: fix deadlock while suspending RAID array Date: Tue, 11 Mar 2014 20:12:10 +0100 Message-ID: <1394565130-24233-3-git-send-email-simon.guinot@sequanux.org> References: <1394565130-24233-1-git-send-email-simon.guinot@sequanux.org> Mime-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: QUOTED-PRINTABLE Return-path: In-Reply-To: <1394565130-24233-1-git-send-email-simon.guinot@sequanux.org> Sender: linux-kernel-owner@vger.kernel.org To: Neil Brown Cc: linux-raid@vger.kernel.org, linux-kernel@vger.kernel.org, =?UTF-8?q?R=C3=A9mi=20R=C3=A9rolle?= , Simon Guinot List-Id: linux-raid.ids Sometimes a deadlock happens while migrating a RAID array level, using the mdadm --grow command. In the following example, an ext4 filesystem is installed over a RAID1 array and mdadm is used to transform this array into a RAID5 one. Here are the observed backtraces for the locked tasks: jbd2/dm-0-8 D c0478384 0 9100 2 0x00000000 [] (__schedule+0x154/0x320) from [] (jbd2_journal_c= ommit_transaction+0x1b0/0x132c) [] (jbd2_journal_commit_transaction+0x1b0/0x132c) from [] (kjournald2+0x9c/0x200) [] (kjournald2+0x9c/0x200) from [] (kthread+0xa4/0x= b0) [] (kthread+0xa4/0xb0) from [] (ret_from_fork+0x14/= 0x3c) ext4lazyinit D c0478384 0 9113 2 0x00000000 [] (__schedule+0x154/0x320) from [] (md_write_start= +0xd8/0x194) [] (md_write_start+0xd8/0x194) from [] (make_reques= t+0x3c/0xc5c [raid1]) [] (make_request+0x3c/0xc5c [raid1]) from [] (md_ma= ke_request+0xe4/0x1f8) [] (md_make_request+0xe4/0x1f8) from [] (generic_ma= ke_request+0xa8/0xc8) [] (generic_make_request+0xa8/0xc8) from [] (submit= _bio+0x80/0x12c) [] (submit_bio+0x80/0x12c) from [] (__blkdev_issue_= zeroout+0x134/0x1a0) [] (__blkdev_issue_zeroout+0x134/0x1a0) from [] (bl= kdev_issue_zeroout+0x94/0xa0) [] (blkdev_issue_zeroout+0x94/0xa0) from [] (ext4_i= nit_inode_table+0x178/0x2cc) [] (ext4_init_inode_table+0x178/0x2cc) from [] (ext= 4_lazyinit_thread+0xe8/0x288) [] (ext4_lazyinit_thread+0xe8/0x288) from [] (kthre= ad+0xa4/0xb0) [] (kthread+0xa4/0xb0) from [] (ret_from_fork+0x14/= 0x3c) mdadm D c0478384 0 10163 9465 0x00000000 [] (__schedule+0x154/0x320) from [] (mddev_suspend+= 0x68/0xc0) [] (mddev_suspend+0x68/0xc0) from [] (level_store+0= x14c/0x59c) [] (level_store+0x14c/0x59c) from [] (md_attr_store= +0xac/0xdc) [] (md_attr_store+0xac/0xdc) from [] (sysfs_write_f= ile+0x100/0x168) [] (sysfs_write_file+0x100/0x168) from [] (vfs_writ= e+0xb8/0x184) [] (vfs_write+0xb8/0x184) from [] (SyS_write+0x40/0= x6c) [] (SyS_write+0x40/0x6c) from [] (ret_fast_syscall+= 0x0/0x30) This deadlock can be reproduced on different architecture (ARM and x86) and also with different Linux kernel versions: 3.14-rc and 3.10 stable. The problem comes from the mddev_suspend() function which don't allow mddev->thread to complete the pending I/Os (mddev->active_io) if any: 1. mdadm holds mddev->reconfig_mutex before running mddev_suspend(). If a write I/O is submitted while mdadm holds the mutex and when the RAID array is still not suspended, then mddev->thread is not able to complete the I/O: The superblock can't be updated because mddev->reconfig_mutex is not available. Note that having a write I/O over a "not suspended yet" RAID array is not a marginal scenario: To load a new RAID personality, level_store() calls request_module() which is allowed to schedule. Moreover on a SMP or a preemptible kernel, the odds are probably even greater. 2. In a same way, mddev_suspend() sets the mddev->suspended flag. Again this may prevent mddev->thread to complete some pending I/Os when a superblock update is needed: md_check_recovery, used by the RAID threads, does nothing but exits when the mddev->suspended flag is set. As a consequence the superblock is never updated. This patch solves this issues by ensuring there is no pending active I/Os before suspending effectively a RAID array. Signed-off-by: Simon Guinot Tested-by: R=C3=A9mi R=C3=A9rolle --- drivers/md/md.c | 19 ++++++++++++++++--- 1 file changed, 16 insertions(+), 3 deletions(-) diff --git a/drivers/md/md.c b/drivers/md/md.c index fb4296adae80..ea3e95d1972b 100644 --- a/drivers/md/md.c +++ b/drivers/md/md.c @@ -375,9 +375,22 @@ static void md_make_request(struct request_queue *= q, struct bio *bio) void mddev_suspend(struct mddev *mddev) { BUG_ON(mddev->suspended); - mddev->suspended =3D 1; - synchronize_rcu(); - wait_event(mddev->sb_wait, atomic_read(&mddev->active_io) =3D=3D 0); + + for (;;) { + mddev->suspended =3D 1; + synchronize_rcu(); + if (atomic_read(&mddev->active_io) =3D=3D 0) + break; + mddev->suspended =3D 0; + synchronize_rcu(); + /* + * Note that mddev_unlock is also used to wake up mddev->thread. + * This allows to complete the pending mddev->active_io. + */ + mddev_unlock(mddev); + wait_event(mddev->sb_wait, atomic_read(&mddev->active_io) =3D=3D 0); + mddev_lock_nointr(mddev); + } mddev->pers->quiesce(mddev, 1); =20 del_timer_sync(&mddev->safemode_timer); --=20 1.8.5.3