From mboxrd@z Thu Jan 1 00:00:00 1970 From: NeilBrown Subject: Re: Failed, but "md: cannot remove active disk..." Date: Mon, 14 May 2012 21:36:23 +1000 Message-ID: <20120514213623.3bc1bfa5@notabene.brown> References: <1336933308.2831.4.camel@localhost> <20120514202220.5a164eb0@notabene.brown> <1336992780.6722.18.camel@localhost> Mime-Version: 1.0 Content-Type: multipart/signed; micalg=PGP-SHA1; boundary="Sig_/85WfT8UiH1xwptO9uWgYxXl"; protocol="application/pgp-signature" Return-path: In-Reply-To: <1336992780.6722.18.camel@localhost> Sender: linux-raid-owner@vger.kernel.org To: =?UTF-8?Q?Micha=C5=82?= Sawicz Cc: linux-raid List-Id: linux-raid.ids --Sig_/85WfT8UiH1xwptO9uWgYxXl Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: quoted-printable On Mon, 14 May 2012 12:53:00 +0200 Micha=C5=82 Sawicz w= rote: > Dnia 2012-05-14, pon o godzinie 20:22 +1000, NeilBrown pisze: > > On Sun, 13 May 2012 20:21:48 +0200 Micha=C5=82 Sawicz wrote: > >=20 > > > Hey, > > >=20 > > > I've a weird issue with a RAID6 setup, /proc/mdstat says: > > >=20 > > > > md126 : active raid6 sda1[3] sdh1[6] sdg1[0](F) sdf1[5] sdi1[1] sdc= [8] sdb[7] > > > > 9767559680 blocks super 1.2 level 6, 512k chunk, algorithm 2 = [7/6] [_UUUUUU] > > >=20 > > > So sdg1 is (F)ailed, yet `mdadm --remove` yields: > > >=20 > > > > md: cannot remove active disk sdg1 from md126 ... > >=20 > > There is a period of time between when a device fails and when the raid= 456 > > module finally lets go of it so it can be removed. You seem to be in t= his > > period of time. > > Normally it is very short. It needs to wait for any requests that have > > already been sent to the device to complete (probably with failure) and > > very shortly after that it should be released. So this is normally muc= h less > > than one second but could be several seconds is some excessive retry is > > happening. > >=20 > > But I'm guessing you have waited more than a few seconds. >=20 > Yup :) >=20 > > I vaguely recall a bug in the not too distant past whereby RAID456 woul= dn't > > let go of a device quite as soon as it should. Unfortunately I don't > > remember the details. You might be able to trigger it to release the d= rive > > by adding a spare - if you have one - or maybe by just > > echo sync > /sys/block/md126/md/sync_action > > it won't actually do a sync, but it might check things enough to make > > progress. >=20 > # echo sync > /sys/block/md126/md/sync_action > -bash: echo: write error: Device or resource busy Hmmm.... Looks like MD_RECOVERY_NEEDED is already set. But remove_and_add_spares() isn't removing the failed device from the array. I cannot find anything since 2.6.38 that looks like your symptoms. Is the array still functioning? Are there any interesting messages appearing in the kernel logs? What does grep . /sys/block/md126/md/dev*/* show? NeilBrown >=20 > eh? >=20 > > What kernel are you using? >=20 > # uname -a > Linux media 2.6.38-gentoo-r6 #2 SMP Tue Sep 13 19:13:42 CEST 2011 x86_64 > AMD Athlon(tm) 64 X2 Dual Core Processor 4200+ AuthenticAMD GNU/Linux >=20 > Thanks, --Sig_/85WfT8UiH1xwptO9uWgYxXl Content-Type: application/pgp-signature; name=signature.asc Content-Disposition: attachment; filename=signature.asc -----BEGIN PGP SIGNATURE----- Version: GnuPG v2.0.18 (GNU/Linux) iQIVAwUBT7DuNznsnt1WYoG5AQK2xw//bLiIdIGWHE8XQox1lcWva8olIuxjuW6q 4LZhsNjl4VpOuu2rWfjrN9/5flBRgbl2k0VWEHVIx6bVFBV6fdKlUa5TP3KMInGr mYPukVD2zzerYh8bqGOL4N12pPKLgjF5DVSdpptRyrejgaZRQLlSJryJeHsFPsI4 90E17iUgWcJtjv0Zatmsm+iiCjyHO0pkHAKjLxcjEbHCiBfEWS59uocFEQgBexZ7 I9WUY5zvSSPf5N4fi9yeFr+F8qE4eH+UY1/zuX2V7hZAh9i39omVb8UBAgBH5Zr3 ENxpUST5VgOH0S1owOM4gBDByzzqDggjFMsIW/vBKf75EFDYIWzegeGVFTAb6QHH EfpJGzcuvoRrXDlFjH8VJGh8sEvieY+5bERAcxw28jNLPd7RzdP9cxNGvGDHLe52 0mDfbWtXvjC3+qBI+WFEFmXW0XLnJKudJ77N9Iq80SfAfoi7LLAoAF1SQNGtKQXA XMKYXNjqqsOqyFUFY7lIW3H4H37zlJ1UYeamSiuJtQ/aqZhf9Cdiohf895SRP5Bz vnef5YnPK93cTXfbUQysAbaqKMF8uRClUx8h16L4X5BtM9zV+/tDL0NsaSlUk1A+ 2FRD3TgV1e2Xg3OwEo0/Fxr1euRBCGTYH3s/oghzTtMBTsD+q10VFdLTC2P3Zkus lQOfuzLVPk8= =efj6 -----END PGP SIGNATURE----- --Sig_/85WfT8UiH1xwptO9uWgYxXl--