From mboxrd@z Thu Jan 1 00:00:00 1970 From: NeilBrown Subject: Re: Stuck in md_write_start because MD_SB_CHANGE_PENDING can't be cleared Date: Mon, 11 Sep 2017 13:36:25 +1000 Message-ID: <871snddhza.fsf@notabene.neil.brown.name> References: <546311999.4473128.1504339295016.JavaMail.zimbra@redhat.com> <87y3pvc9ha.fsf@notabene.neil.brown.name> <34fedde7-cef9-34ff-1403-9d097267eb55@redhat.com> <87k21ec4fn.fsf@notabene.neil.brown.name> <3a5955de-e6a1-de83-b00b-1984f7125799@redhat.com> <624049285.8379021.1504748277805.JavaMail.zimbra@redhat.com> <87bmmnax1y.fsf@notabene.neil.brown.name> <310834117.9518865.1505088886429.JavaMail.zimbra@redhat.com> Mime-Version: 1.0 Content-Type: multipart/signed; boundary="=-=-="; micalg=pgp-sha256; protocol="application/pgp-signature" Return-path: In-Reply-To: <310834117.9518865.1505088886429.JavaMail.zimbra@redhat.com> Sender: linux-raid-owner@vger.kernel.org To: Xiao Ni Cc: linux-raid , shli@kernel.org List-Id: linux-raid.ids --=-=-= Content-Type: text/plain Content-Transfer-Encoding: quoted-printable On Sun, Sep 10 2017, Xiao Ni wrote: > ----- Original Message ----- >> From: "NeilBrown" >> To: "Xiao Ni" , "linux-raid" >> Cc: shli@kernel.org >> Sent: Thursday, September 7, 2017 1:37:45 PM >> Subject: Re: Stuck in md_write_start because MD_SB_CHANGE_PENDING can't = be cleared >>=20 >> On Wed, Sep 06 2017, Xiao Ni wrote: >>=20 >> > ----- Original Message ----- >> >> From: "Xiao Ni" >> >> To: "NeilBrown" , "linux-raid" >> >> >> >> Cc: shli@kernel.org >> >> Sent: Tuesday, September 5, 2017 10:15:00 AM >> >> Subject: Re: Stuck in md_write_start because MD_SB_CHANGE_PENDING can= 't be >> >> cleared >> >>=20 >> >>=20 >> >>=20 >> >> On 09/05/2017 09:36 AM, NeilBrown wrote: >> >> > On Mon, Sep 04 2017, Xiao Ni wrote: >> >> > >> >> >> >> >> >> In function handle_stripe: >> >> >> 4697 if (s.handle_bad_blocks || >> >> >> 4698 test_bit(MD_SB_CHANGE_PENDING, >> >> >> &conf->mddev->sb_flags)) { >> >> >> 4699 set_bit(STRIPE_HANDLE, &sh->state); >> >> >> 4700 goto finish; >> >> >> 4701 } >> >> >> >> >> >> Because MD_SB_CHANGE_PENDING is set, so the stripes can't be handl= ed. >> >> >> >> >> > Right, of course. I see what is happening now. >> >> > >> >> > - raid5d cannot complete stripes until the metadata is written >> >> > - the metadata cannot be written until raid5d gets the mddev_lock >> >> > - mddev_lock is held by the write to suspend_hi >> >> > - the write to suspend_hi is waiting for raid5_quiesce >> >> > - raid5_quiesce is waiting for some stripes to complete. >> >> > >> >> > We could declare that ->quiesce(, 1) cannot be called while holding= the >> >> > lock. >> >> > We could possible allow it but only if md_update_sb() is called fir= st, >> >> > though that might still be racy. >> >> > >> >> > ->quiesce(, 1) is currently called from: >> >> > mddev_suspend >> >> > suspend_lo_store >> >> > suspend_hi_store >> >> > __md_stop_writes >> >> > mddev_detach >> >> > set_bitmap_file >> >> > update_array_info (when setting/removing internal bitmap) >> >> > md_do_sync >> >> > >> >> > and most of those are call with the lock held, or take the lock. >> >> > >> >> > Maybe we should *require* that mddev_lock is held when calling >> >> > ->quiesce() and have ->quiesce() do the metadata update. >> >> > >> >> > Something like the following maybe. Can you test it? >> >>=20 >> >> Hi Neil >> >>=20 >> >> Thanks for the analysis. I need to thing for a while :) >> >> I already added the patch and the test is running now. It usually nee= ds >> >> more than 5 >> >> hours to reproduce this problem. I'll let it run more than 24 hours. >> >> I'll update the test >> >> result later. >> > >> > Hi Neil >> > >> > The problem still exists. But it doesn't show calltrace this time. It >> > was stuck yesterday. I didn't notice that because there has no calltra= ce. >> > >> > echo file raid5.c +p > /sys/kernel/debug/dynamic_debug/control >> > >> > It shows that raid5d is still spinning. >>=20 >> Thanks for testing. I've thought some more and I think there is a better >> approach. >> The fact that we need to take the mutex to write the super block has >> caused problems several times before and is a key part of the problem >> now. >> Maybe we should relax that. Obviously we don't want two threads >> updating the metadata at the same time, but it should be safe to >> update it in parallel with other uses of reconfix_mutex. >>=20 >> Holding mddev->lock while copying data from the struct mddev to the >> superblock (which we do) should ensure that the superblock is internally >> consistent, and that should be enough. >>=20 >> So I propose the following patch. It certainly needs review and >> testing, but I think it should make a big improvement. >>=20 >> Thanks, >> NeilBrown >>=20 >>=20 >> diff --git a/drivers/md/md.c b/drivers/md/md.c >> index b01e458d31e9..414a4c1a052d 100644 >> --- a/drivers/md/md.c >> +++ b/drivers/md/md.c >> @@ -2388,6 +2388,15 @@ void md_update_sb(struct mddev *mddev, int >> force_change) >> return; >> } >>=20=20 >> + if (!force_change && !(mddev->sb_flags & ~BIT(MD_SB_UPDATE_ACTIVE))) >> + return; >> + >> + wait_event(mddev->sb_wait, >> + !test_and_set_bit(MD_SB_UPDATE_ACTIVE, &mddev->sb_flags)); >> + >> + if (!force_change && !(mddev->sb_flags & ~BIT(MD_SB_UPDATE_ACTIVE))) >> + goto out; >> + >> repeat: >> if (mddev_is_clustered(mddev)) { >> if (test_and_clear_bit(MD_SB_CHANGE_DEVS, &mddev->sb_flags)) >> @@ -2402,7 +2411,7 @@ void md_update_sb(struct mddev *mddev, int >> force_change) >> bit_clear_unless(&mddev->sb_flags, BIT(MD_SB_CHANGE_PENDING), >> BIT(MD_SB_CHANGE_DEVS) | >> BIT(MD_SB_CHANGE_CLEAN)); >> - return; >> + goto out; >> } >> } >>=20=20 >> @@ -2432,8 +2441,7 @@ void md_update_sb(struct mddev *mddev, int >> force_change) >> wake_up(&rdev->blocked_wait); >> } >> } >> - wake_up(&mddev->sb_wait); >> - return; >> + goto out; >> } >>=20=20 >> spin_lock(&mddev->lock); >> @@ -2544,6 +2552,9 @@ void md_update_sb(struct mddev *mddev, int >> force_change) >> BIT(MD_SB_CHANGE_DEVS) | BIT(MD_SB_CHANGE_CLEAN))) >> /* have to write it out again */ >> goto repeat; >> + >> +out: >> + clear_bit_unlock(MD_SB_UPDATE_ACTIVE, &mddev->sb_flags); >> wake_up(&mddev->sb_wait); >> if (test_bit(MD_RECOVERY_RUNNING, &mddev->recovery)) >> sysfs_notify(&mddev->kobj, NULL, "sync_completed"); >> @@ -5606,8 +5617,7 @@ int md_run(struct mddev *mddev) >> set_bit(MD_RECOVERY_RECOVER, &mddev->recovery); >> set_bit(MD_RECOVERY_NEEDED, &mddev->recovery); >>=20=20 >> - if (mddev->sb_flags) >> - md_update_sb(mddev, 0); >> + md_update_sb(mddev, 0); >>=20=20 >> md_new_event(mddev); >> sysfs_notify_dirent_safe(mddev->sysfs_state); >> @@ -8643,17 +8653,14 @@ void md_check_recovery(struct mddev *mddev) >>=20=20 >> if (mddev->ro && !test_bit(MD_RECOVERY_NEEDED, &mddev->recovery)) >> return; >> - if ( ! ( >> - (mddev->sb_flags & ~ (1<> + if (( >> test_bit(MD_RECOVERY_NEEDED, &mddev->recovery) || >> test_bit(MD_RECOVERY_DONE, &mddev->recovery) || >> (mddev->external =3D=3D 0 && mddev->safemode =3D=3D 1) || >> (mddev->safemode =3D=3D 2 >> && !mddev->in_sync && mddev->recovery_cp =3D=3D MaxSector) >> - )) >> - return; >> - >> - if (mddev_trylock(mddev)) { >> + ) && >> + mddev_trylock(mddev)) { >> int spares =3D 0; >>=20=20 >> if (!mddev->external && mddev->safemode =3D=3D 1) >> @@ -8706,9 +8713,6 @@ void md_check_recovery(struct mddev *mddev) >> spin_unlock(&mddev->lock); >> } >>=20=20 >> - if (mddev->sb_flags) >> - md_update_sb(mddev, 0); >> - >> if (test_bit(MD_RECOVERY_RUNNING, &mddev->recovery) && >> !test_bit(MD_RECOVERY_DONE, &mddev->recovery)) { >> /* resync/recovery still happening */ >> @@ -8786,6 +8790,7 @@ void md_check_recovery(struct mddev *mddev) >> wake_up(&mddev->sb_wait); >> mddev_unlock(mddev); >> } >> + md_update_sb(mddev, 0); >> } >> EXPORT_SYMBOL(md_check_recovery); >>=20=20 >> diff --git a/drivers/md/md.h b/drivers/md/md.h >> index 09db03455801..bc8633cf7c40 100644 >> --- a/drivers/md/md.h >> +++ b/drivers/md/md.h >> @@ -243,6 +243,7 @@ enum mddev_sb_flags { >> MD_SB_CHANGE_CLEAN, /* transition to or from 'clean' */ >> MD_SB_CHANGE_PENDING, /* switch from 'clean' to 'active' in progress */ >> MD_SB_NEED_REWRITE, /* metadata write needs to be repeated */ >> + MD_SB_UPDATE_ACTIVE, /* A thread is running in md_update_sb */ >> }; >>=20=20 >> struct mddev { >> diff --git a/drivers/md/raid5-cache.c b/drivers/md/raid5-cache.c >> index 2dcbafa8e66c..76169dd8ff7c 100644 >> --- a/drivers/md/raid5-cache.c >> +++ b/drivers/md/raid5-cache.c >> @@ -1334,21 +1334,10 @@ static void r5l_write_super_and_discard_space(st= ruct >> r5l_log *log, >> mddev =3D log->rdev->mddev; >> /* >> * Discard could zero data, so before discard we must make sure >> - * superblock is updated to new log tail. Updating superblock (either >> - * directly call md_update_sb() or depend on md thread) must hold >> - * reconfig mutex. On the other hand, raid5_quiesce is called with >> - * reconfig_mutex hold. The first step of raid5_quiesce() is waitting >> - * for all IO finish, hence waitting for reclaim thread, while reclaim >> - * thread is calling this function and waitting for reconfig mutex. So >> - * there is a deadlock. We workaround this issue with a trylock. >> - * FIXME: we could miss discard if we can't take reconfig mutex >> + * superblock is updated to new log tail. >> */ >> - set_mask_bits(&mddev->sb_flags, 0, >> - BIT(MD_SB_CHANGE_DEVS) | BIT(MD_SB_CHANGE_PENDING)); >> - if (!mddev_trylock(mddev)) >> - return; >> + >> md_update_sb(mddev, 1); >> - mddev_unlock(mddev); >>=20=20 >> /* discard IO error really doesn't matter, ignore it */ >> if (log->last_checkpoint < end) { >>=20 > > Hi Neil > > The test have run for three days and the problem is fixed by this patch.= =20 > Thanks for the help. Thanks for testing. I'll look over the patch again and see if there is any chance that the locking change could introduce other problems. > > Could you help to look at https://www.spinics.net/lists/raid/msg58918.htm= l. > The bug which is fixed by your patch was found when I try to reproduce th= at > bug. I did a simply analysis, but I'm not sure whether I'm right or not. It might be the same bug, but if it is there should be other processes stuck in a D wait, one of them holding the reconfig_mutex and waiting for the array to quiesce. Where there any other processes in D wait? NeilBrown --=-=-= Content-Type: application/pgp-signature; name="signature.asc" -----BEGIN PGP SIGNATURE----- iQIzBAEBCAAdFiEEG8Yp69OQ2HB7X0l6Oeye3VZigbkFAlm2BLkACgkQOeye3VZi gbkg2g//TtVy2RdnNarBawfuVmcMvA847lSwAPYxJCP6rZcKu43XUJapZ9tauU8l Ezapcj3N1J7mWPwJsNZlH3pSka9iKmNsPcyiTfwH595q1plE0PhlIUrHkq+UyGYQ 7WOQewI1h2S5H4PLfmiPXHOr8hQhq+QXZuPGjJcQsnf9FqheTvqE+PZZOt6ERde8 y1RmKe4itzDxfP/SKGMtWn4BI07CnTxwbxFGNORlTqxSXrjFIZb++VhpDTVtNufF E9/B4QDvJiM9QdkJgI3Ep/VuJFeWYSq8sZmQleHRI9oTiTqF8vPKrLJDK5tIbjGi c1FwnwSB+QiUxF8/boG4w1Wu29HEykiCwQoA34mpumkajeT1nNOVXYPAK8ex9C4m J5PR6f6Q48ZurrFMcjN/aH0mZh2/c04arGtcr15cs7SeoIBHAC4D2DRN5lXN9NQ6 FoeRRE6GM7x5KG91AashOxSRivsLjSF2qS+SYwwshr68ErO4bJDx7zmXBZqZKQxw XRys7N8sDdqj2LrKcN7jwdTvbvdZqY1KrhzL4OotTd3AkSND9kL42RAElimdKUGr xIv+SP5fPa9YLm3LwFhiHqvF5pvqSqPiNNmpT9HA5+p5ibq83SSjR/Xkxb4A9gNw p0yZ5cd7ZEs4DO2+mLUSHlYSGHf81v4Hd9tQ3gdj8N4bMaxdbLM= =cz0p -----END PGP SIGNATURE----- --=-=-=--