From mboxrd@z Thu Jan 1 00:00:00 1970 From: NeilBrown Subject: Re: RAID1 removing failed disk returns EBUSY Date: Wed, 29 Oct 2014 08:41:13 +1100 Message-ID: <20141029084113.57f6ae6a@notabene.brown> References: <20141027162748.593451be@jlaw-desktop.mno.stratus.com> Mime-Version: 1.0 Content-Type: multipart/signed; micalg=pgp-sha1; boundary="Sig_/.wn7fGgeABPAl+rdUSg/3lC"; protocol="application/pgp-signature" Return-path: In-Reply-To: <20141027162748.593451be@jlaw-desktop.mno.stratus.com> Sender: linux-raid-owner@vger.kernel.org To: Joe Lawrence Cc: linux-raid@vger.kernel.org, Bill Kuzeja List-Id: linux-raid.ids --Sig_/.wn7fGgeABPAl+rdUSg/3lC Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: quoted-printable On Mon, 27 Oct 2014 16:27:48 -0400 Joe Lawrence wrote: > Hi Neil, >=20 > We've encountered changes in MD and mdadm that have broken our automated > disk removal script. In the past, we've been able to run the following > after a RAID1 disk component removal: >=20 > % echo fail > /sys/block/md3/md/dev-sdr5/state > % echo remove > /sys/block/md3/md/dev-sdr5/state >=20 > However, the latest RHEL6.6 code drop has rebased to sufficiently recent > MD kernel and mdadm changes, in which the previous commands occasionally > fail like so: >=20 > * MD array is usually resyncing or checking > * Component disk /dev/sdr removed via HBA sysfs PCI removal > * Following UDEV rule fires: >=20 > SUBSYSTEM=3D=3D"block", ACTION=3D=3D"remove", ENV{ID_PATH}=3D=3D"?*", \ > RUN+=3D"/sbin/mdadm -If $name --path $env{ID_PATH}" >=20 > % mdadm --detail /dev/md3 > /dev/md3: > Version : 1.1 > Creation Time : Tue Oct 14 17:31:59 2014 > Raid Level : raid1 > Array Size : 25149440 (23.98 GiB 25.75 GB) > Used Dev Size : 25149440 (23.98 GiB 25.75 GB) > Raid Devices : 2 > Total Devices : 2 > Persistence : Superblock is persistent >=20 > Intent Bitmap : Internal >=20 > Update Time : Wed Oct 15 14:22:34 2014 > State : active, degraded > Active Devices : 1 > Working Devices : 1 > Failed Devices : 1 > Spare Devices : 0 >=20 > Name : localhost.localdomain:3 > UUID : 40ed68ee:ba41d4cd:28c361ed:be7470b8 > Events : 142 >=20 > Number Major Minor RaidDevice State > 0 65 21 0 faulty > 1 65 5 1 active sync /dev/sdj5 >=20 > All attempts to remove this device fail:=20 >=20 > % echo remove > /sys/block/md3/md/dev-sdr5/state > -bash: echo: write error: Device or resource busy >=20 > This can be traced to state_store(): >=20 > } else if (cmd_match(buf, "remove")) { > if (rdev->raid_disk >=3D 0) > err =3D -EBUSY; >=20 > After much debugging and systemtapping, I think I've figured out that the > sysfs scripting may fail after the following combination of changes: >=20 > mdadm 8af530b07fce "Enhance incremental removal." > kernel 30b8feb730f9 "md/raid5: avoid deadlock when raid5 array has unack > badblocks during md_stop_writes" >=20 > With these two changes: >=20 > 1 - On the user side, mdadm is trying to set the array_state to read-auto > on incremental removal (as invoked by UDEV rule).=20 >=20 > 2 - Kernel side, md_set_readonly() will set the MD_RECOVERY_FROZEN flag, > wake up the mddev->thread and if there is a sync_thread, it will set > MD_RECOVERY_INTR and then wait until the sync_thread is set to NULL. >=20 > When md_check_recovery() gets a chance to run as part of the > raid1d() mddev->thread, it may or may not ever get to > an invocation of remove_and_add_spares(), for there are but *many* > conditional early exits along the way -- for example, if > MD_RECOVERY_FROZEN is set, the following condition will bounce out of > the routine: >=20 > if (!test_and_clear_bit(MD_RECOVERY_NEEDED, &mddev->recov= ery) || > test_bit(MD_RECOVERY_FROZEN, &mddev->recovery)) = =20 > goto unlock; >=20 > the next time around, MD_RECOVERY_NEEDED will have been cleared, so > all future tests will return 0 and the negation will always take the > early exit path. >=20 > Back in md_set_readonly(), it may notice that the MD is still in use, > so it clears the MD_RECOVERY_FROZEN and then returns -EBUSY, without > setting mddev->ro. But the damage has been done as conditions have > been set such that md_check_recovery() will never call > remove_and_add_spares(). >=20 > This would also explain why an "idle" sync_action clears the wedge: it > sets MD_RECOVERY_NEEDED allowing md_check_recovery() to continue executing > to remove_and_add_spares(). >=20 > As far as I can tell, this is what is happening to prevent the "remove" > write to /sys/block/md3/md/dev-sdr5/state from succeeding. There are > certainly a lot of little bit-states between disk removal, UDEV mdadm, and > various MD kernel threads, so apologies if I missed an important > transition. >=20 > Would you consider writing "idle" to the MD array sync_action file as a > safe and reasonable intermediate workaround step for our script? >=20 > And of course, any suggestions to whether this is intended behavior (ie, > the removed component disk is failed, but stuck in the array)? >=20 > This is fairly easy for us to reproduce with multiple MD arrays per disk > (one per partition) and interrupting a raid check on all of them > (especially when they are delayed waiting for the first to finish) by > removing the component disk via sysfs PCI removal. We can provide > additional debug or testing if required. >=20 Hi Joe, thanks for the details analysis!! I think the correct fix would be that MD_RECOVERY_NEEDED should be set after clearing MD_RECOVERY_FROZEN, like the patch below. Can you confirm that it works for you? Writing 'idle' should in general be safe, so that could be used as an inter= im. Thanks, NeilBrown diff --git a/drivers/md/md.c b/drivers/md/md.c index c03d87b6890a..2c73fcb82593 100644 --- a/drivers/md/md.c +++ b/drivers/md/md.c @@ -5261,6 +5261,7 @@ static int md_set_readonly(struct mddev *mddev, struc= t block_device *bdev) printk("md: %s still in use.\n",mdname(mddev)); if (did_freeze) { clear_bit(MD_RECOVERY_FROZEN, &mddev->recovery); + set_bit(MD_RECOVERY_NEEDED, &mddev->recovery); md_wakeup_thread(mddev->thread); } err =3D -EBUSY; @@ -5275,6 +5276,8 @@ static int md_set_readonly(struct mddev *mddev, struc= t block_device *bdev) mddev->ro =3D 1; set_disk_ro(mddev->gendisk, 1); clear_bit(MD_RECOVERY_FROZEN, &mddev->recovery); + set_bit(MD_RECOVERY_NEEDED, &mddev->recovery); + md_wakeup_thread(mddev->thread); sysfs_notify_dirent_safe(mddev->sysfs_state); err =3D 0; } @@ -5318,6 +5321,7 @@ static int do_md_stop(struct mddev *mddev, int mode, mutex_unlock(&mddev->open_mutex); if (did_freeze) { clear_bit(MD_RECOVERY_FROZEN, &mddev->recovery); + set_bit(MD_RECOVERY_NEEDED, &mddev->recovery); md_wakeup_thread(mddev->thread); } return -EBUSY; --Sig_/.wn7fGgeABPAl+rdUSg/3lC Content-Type: application/pgp-signature Content-Description: OpenPGP digital signature -----BEGIN PGP SIGNATURE----- Version: GnuPG v2.0.22 (GNU/Linux) iQIVAwUBVFANeTnsnt1WYoG5AQI3fRAApqUgGTn8qC2LHo7PijJASbUYWxDsJxhT IQNDeaN66hnPhE1Zven/QrZ27Eoyx54mRinto48cU3Gke/LwCWeIKdkRDC17Yhv+ mxBM8beSLKf/VARzkc8ScnovE1NnPdykk4bUHsz21XVRhVFAWyAdOJzzEzu1T/+v 0L4pXdCZ5gW6EhzAR7Da+wDbmz5/fN5Iu8c87zO+/fDTB6U4B8vQPoJWRP5z4TOm 6rswo3GfdSYyRH5oM7XpRAa6Uh1oVnWBuaFQHTzzMCLU8sskPqmtxnD9uSGOu9SV 8MvXrOve/VJhNXZvtx9TyBjCjq7ZrPWWAVh9Jo3V+yLHpJ5PTdR2xNRFYKV0vJMP 07Dwpttoy9m8fp6OCOrdSCToDZA3/T8IXbroKGt027KLoVuePL32NO6YgAgCpMX6 HTtPptjKpzhzGCiytd5sMJmmwx6HBxh698JBZWj/ciTbb6N1X25LPrDb+HtS6UAm wSbHRDPsn5F+g3xvs7vwq64m4yG1SjvEEB6qZmaDrlRTHX9VHZ2hzas5E4nt2xUa YbZJQgYzhdILfyxPBzjcFei+nHvhYIuGf9IP45NREDtKm9mSiepJPuF/UNcgn4Yc uUVY9CHfdb8N9rl6jn3IxqQ4g1HFVucyxEAZZ9za/UL6UYfAmVSulOc911T7htKR UR8H7avV/ik= =H0Q1 -----END PGP SIGNATURE----- --Sig_/.wn7fGgeABPAl+rdUSg/3lC--