From mboxrd@z Thu Jan 1 00:00:00 1970 From: NeilBrown Subject: Re: RAID1 removing failed disk returns EBUSY Date: Mon, 17 Nov 2014 10:03:49 +1100 Message-ID: <20141117100349.1d1ae1fa@notabene.brown> References: <20141027162748.593451be@jlaw-desktop.mno.stratus.com> <20141029084113.57f6ae6a@notabene.brown> <20141029133604.59f9549a@jlaw-desktop.mno.stratus.com> <20141113090549.296a13ee@jlaw-desktop.mno.stratus.com> Mime-Version: 1.0 Content-Type: multipart/signed; micalg=pgp-sha1; boundary="Sig_/byJ4W3Ed6KaAsP6FSdviJQ7"; protocol="application/pgp-signature" Return-path: In-Reply-To: <20141113090549.296a13ee@jlaw-desktop.mno.stratus.com> Sender: linux-raid-owner@vger.kernel.org To: Joe Lawrence Cc: linux-raid@vger.kernel.org, Bill Kuzeja List-Id: linux-raid.ids --Sig_/byJ4W3Ed6KaAsP6FSdviJQ7 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: quoted-printable On Thu, 13 Nov 2014 09:05:49 -0500 Joe Lawrence wrote: > On Wed, 29 Oct 2014 13:36:04 -0400 > Joe Lawrence wrote: >=20 > > On Wed, 29 Oct 2014 08:41:13 +1100 > > NeilBrown wrote: > >=20 > > > On Mon, 27 Oct 2014 16:27:48 -0400 Joe Lawrence > > > wrote: > > >=20 > > > > Hi Neil, > > > >=20 > > > > We've encountered changes in MD and mdadm that have broken our auto= mated > > > > disk removal script. In the past, we've been able to run the follo= wing > > > > after a RAID1 disk component removal: > > > >=20 > > > > % echo fail > /sys/block/md3/md/dev-sdr5/state > > > > % echo remove > /sys/block/md3/md/dev-sdr5/state > > > >=20 > > > > However, the latest RHEL6.6 code drop has rebased to sufficiently r= ecent > > > > MD kernel and mdadm changes, in which the previous commands occasio= nally > > > > fail like so: > > > >=20 > > > > * MD array is usually resyncing or checking > > > > * Component disk /dev/sdr removed via HBA sysfs PCI removal > > > > * Following UDEV rule fires: > > > >=20 > > > > SUBSYSTEM=3D=3D"block", ACTION=3D=3D"remove", ENV{ID_PATH}=3D=3D"?*= ", \ > > > > RUN+=3D"/sbin/mdadm -If $name --path $env{ID_PATH}" > > > >=20 > > > > % mdadm --detail /dev/md3 > > > > /dev/md3: > > > > Version : 1.1 > > > > Creation Time : Tue Oct 14 17:31:59 2014 > > > > Raid Level : raid1 > > > > Array Size : 25149440 (23.98 GiB 25.75 GB) > > > > Used Dev Size : 25149440 (23.98 GiB 25.75 GB) > > > > Raid Devices : 2 > > > > Total Devices : 2 > > > > Persistence : Superblock is persistent > > > >=20 > > > > Intent Bitmap : Internal > > > >=20 > > > > Update Time : Wed Oct 15 14:22:34 2014 > > > > State : active, degraded > > > > Active Devices : 1 > > > > Working Devices : 1 > > > > Failed Devices : 1 > > > > Spare Devices : 0 > > > >=20 > > > > Name : localhost.localdomain:3 > > > > UUID : 40ed68ee:ba41d4cd:28c361ed:be7470b8 > > > > Events : 142 > > > >=20 > > > > Number Major Minor RaidDevice State > > > > 0 65 21 0 faulty > > > > 1 65 5 1 active sync /dev/sdj5 > > > >=20 > > > > All attempts to remove this device fail:=20 > > > >=20 > > > > % echo remove > /sys/block/md3/md/dev-sdr5/state > > > > -bash: echo: write error: Device or resource busy > > > >=20 > > > > This can be traced to state_store(): > > > >=20 > > > > } else if (cmd_match(buf, "remove")) { > > > > if (rdev->raid_disk >=3D 0) > > > > err =3D -EBUSY; > > > >=20 > > > > After much debugging and systemtapping, I think I've figured out th= at the > > > > sysfs scripting may fail after the following combination of changes: > > > >=20 > > > > mdadm 8af530b07fce "Enhance incremental removal." > > > > kernel 30b8feb730f9 "md/raid5: avoid deadlock when raid5 array has = unack > > > > badblocks during md_stop_writes" > > > >=20 > > > > With these two changes: > > > >=20 > > > > 1 - On the user side, mdadm is trying to set the array_state to rea= d-auto > > > > on incremental removal (as invoked by UDEV rule).=20 > > > >=20 > > > > 2 - Kernel side, md_set_readonly() will set the MD_RECOVERY_FROZEN = flag, > > > > wake up the mddev->thread and if there is a sync_thread, it wil= l set > > > > MD_RECOVERY_INTR and then wait until the sync_thread is set to = NULL. > > > >=20 > > > > When md_check_recovery() gets a chance to run as part of the > > > > raid1d() mddev->thread, it may or may not ever get to > > > > an invocation of remove_and_add_spares(), for there are but *ma= ny* > > > > conditional early exits along the way -- for example, if > > > > MD_RECOVERY_FROZEN is set, the following condition will bounce = out of > > > > the routine: > > > >=20 > > > > if (!test_and_clear_bit(MD_RECOVERY_NEEDED, &mddev-= >recovery) || > > > > test_bit(MD_RECOVERY_FROZEN, &mddev->recovery))= =20 > > > > goto unlock; > > > >=20 > > > > the next time around, MD_RECOVERY_NEEDED will have been cleared= , so > > > > all future tests will return 0 and the negation will always tak= e the > > > > early exit path. > > > >=20 > > > > Back in md_set_readonly(), it may notice that the MD is still i= n use, > > > > so it clears the MD_RECOVERY_FROZEN and then returns -EBUSY, wi= thout > > > > setting mddev->ro. But the damage has been done as conditions = have > > > > been set such that md_check_recovery() will never call > > > > remove_and_add_spares(). > > > >=20 > > > > This would also explain why an "idle" sync_action clears the wedge:= it > > > > sets MD_RECOVERY_NEEDED allowing md_check_recovery() to continue ex= ecuting > > > > to remove_and_add_spares(). > > > >=20 > > > > As far as I can tell, this is what is happening to prevent the "rem= ove" > > > > write to /sys/block/md3/md/dev-sdr5/state from succeeding. There a= re > > > > certainly a lot of little bit-states between disk removal, UDEV mda= dm, and > > > > various MD kernel threads, so apologies if I missed an important > > > > transition. > > > >=20 > > > > Would you consider writing "idle" to the MD array sync_action file = as a > > > > safe and reasonable intermediate workaround step for our script? > > > >=20 > > > > And of course, any suggestions to whether this is intended behavior= (ie, > > > > the removed component disk is failed, but stuck in the array)? > > > >=20 > > > > This is fairly easy for us to reproduce with multiple MD arrays per= disk > > > > (one per partition) and interrupting a raid check on all of them > > > > (especially when they are delayed waiting for the first to finish) = by > > > > removing the component disk via sysfs PCI removal. We can provide > > > > additional debug or testing if required. > > > >=20 > > >=20 > > > Hi Joe, > > > thanks for the details analysis!! > > >=20 > > > I think the correct fix would be that MD_RECOVERY_NEEDED should be se= t after > > > clearing MD_RECOVERY_FROZEN, like the patch below. > > > Can you confirm that it works for you? > > >=20 > > > Writing 'idle' should in general be safe, so that could be used as an= interim. > > >=20 > > > Thanks, > > > NeilBrown > > >=20 > > > diff --git a/drivers/md/md.c b/drivers/md/md.c > > > index c03d87b6890a..2c73fcb82593 100644 > > > --- a/drivers/md/md.c > > > +++ b/drivers/md/md.c > > > @@ -5261,6 +5261,7 @@ static int md_set_readonly(struct mddev *mddev,= struct block_device *bdev) > > > printk("md: %s still in use.\n",mdname(mddev)); > > > if (did_freeze) { > > > clear_bit(MD_RECOVERY_FROZEN, &mddev->recovery); > > > + set_bit(MD_RECOVERY_NEEDED, &mddev->recovery); > > > md_wakeup_thread(mddev->thread); > > > } > > > err =3D -EBUSY; > > > @@ -5275,6 +5276,8 @@ static int md_set_readonly(struct mddev *mddev,= struct block_device *bdev) > > > mddev->ro =3D 1; > > > set_disk_ro(mddev->gendisk, 1); > > > clear_bit(MD_RECOVERY_FROZEN, &mddev->recovery); > > > + set_bit(MD_RECOVERY_NEEDED, &mddev->recovery); > > > + md_wakeup_thread(mddev->thread); > > > sysfs_notify_dirent_safe(mddev->sysfs_state); > > > err =3D 0; > > > } > > > @@ -5318,6 +5321,7 @@ static int do_md_stop(struct mddev *mddev, int = mode, > > > mutex_unlock(&mddev->open_mutex); > > > if (did_freeze) { > > > clear_bit(MD_RECOVERY_FROZEN, &mddev->recovery); > > > + set_bit(MD_RECOVERY_NEEDED, &mddev->recovery); > > > md_wakeup_thread(mddev->thread); > > > } > > > return -EBUSY; > >=20 > > Hi Neil, > >=20 > > In my tests, the UDEV "mdadm -If" invocation fails *and* removes the > > pulled disk from the MD array. This is okay for our intentions, but I > > wanted to make sure that it's okay to skip any failed-but-not-removed > > state. > >=20 > > Tested-by: Joe Lawrence > >=20 > > and should this have a > >=20 > > Fixes: 30b8feb730f9 ("md/raid5: avoid deadlock when raid5 array has una= ck badblocks during md_stop_writes") > >=20 > > tag to mark for stable? >=20 >=20 > Hi Neil, >=20 > Would you like me to write up a proper patch, or is this one in the queue? >=20 Several times over the last week I've thought that I should probably push that patch along ... but each time something else seemed more interesting. But it's a new week now. I've just posted a pull request. Thanks for the prompt (and the report and testing of course). NeilBrown --Sig_/byJ4W3Ed6KaAsP6FSdviJQ7 Content-Type: application/pgp-signature Content-Description: OpenPGP digital signature -----BEGIN PGP SIGNATURE----- Version: GnuPG v2 iQIVAwUBVGktVTnsnt1WYoG5AQLnuQ//UTJzflqucgaW3TVq/MCp/FWolBM9LmPt hHxfTIGdr9HApPcxKnZvDuj+5qySMO8A6qdmApR5VyyetV8L9nbwdbEemyiw1ube 7pJwVQHipqHn5nzNVODlX5zViK/2WSp+KN9UUaIz5DMX1RR7xYhxwVZtYrVgDM6r 2yYr7NkOB3t3KL+1G6Utt4hL6zMJx8ypsD2asqxCCPpJivDfItc8Y8sHAtjxl7f1 9t+FLx7QlfaNrmMHtBKLSFAiOI6BApnQT1sJ7kd/l5Wq7yNgavyd/CQJ9Ojq0NA0 sRDIrFo8yIFwbpPWJ3spWeXAuX1/sYOlQ1uZX9v5EYriYDx5uB3NziXMysaUxNFF IOTExLBgtU/859ACBSQ1AL6T9PUxylulPchRJhopO6BHNdeSnoLbHkpfm0QNcqQQ 5CWR75JikhestEtbV5Ct6XpNyWt5l7qhNEpXZIxS4LEQt7shfMLUAjR4VxXzsd+c ETShgZPJ9N4NoJZYUQqiRFkFepUpmc9w7Y217J08h2cV4rmRvAZHB38yG/t5VmgD UgIn2iXd+hioNWGV9lHOyRPn8gJGo7dkJEcDJU2Q7NnXQpkocWuk1qAaQa7/V3/U ERd8BeAYfHfmqZ0NO9mdFih0MbfnQXvyX3TaVJlw7Xtsi7aYT+b8Vn+iB+WsEGYI j1I5tR94oVo= =+7Y2 -----END PGP SIGNATURE----- --Sig_/byJ4W3Ed6KaAsP6FSdviJQ7--