From mboxrd@z Thu Jan 1 00:00:00 1970 From: John Robinson Subject: Re: Help understanding the root cause of a member dropping out of a RAID 1 set. Date: Fri, 14 Aug 2009 18:07:44 +0100 Message-ID: <4A8599E0.5000604@anonymous.org.uk> References: <64960.78.86.108.203.1250180799.squirrel@www.yuiop.co.uk> Mime-Version: 1.0 Content-Type: text/plain; charset=ISO-8859-2; format=flowed Content-Transfer-Encoding: QUOTED-PRINTABLE Return-path: In-Reply-To: Sender: linux-raid-owner@vger.kernel.org To: =?ISO-8859-2?Q?Pawe=B3_Brodacki?= Cc: "linux-raid@vger.kernel.org" List-Id: linux-raid.ids On 14/08/2009 14:09, Pawe=B3 Brodacki wrote: > 2009/8/13 John Robinson : >=20 >> Can or could md be made or configured to try re-adding a device if t= his >> sort of thing happens? After all, a stray cosmic ray or whatever per= haps >> shouldn't make one lose redundancy if the drive's actually OK? >=20 > I think that from the coding point of view md probably could. The mor= e > important thing is if it should. The only hard fact is that there was > an error while accessing the device. md has no way of telling if it > was just a freak accident, or the drive is unreliable from now on. Ah well, perhaps we need to give md a way of knowing the difference=20 between a transient error (that has been recovered from) and a more=20 serious error. > Therefore it does the one safe thing and says "I won't trust you > anymore.". If a human being knows better, the said being is free to > re-add the drive. >=20 > Personally I'd hate having a suspicious drive being auto-added in hop= e > it will rebuild and function properly. I wouldn't want it to be the default behaviour, but I'd like the option= =20 to configure things that way. I'd want the number of auto-re-adds=20 configurable too. > Because such an option could seem tempting but could and would cause > loss of reliability I'd expect bad publicity if it was actually added= =2E But it could cause improvements in reliability too. If the cable on=20 drive A is hit by cosmic rays, the drive is taken out of the array, but= =20 the drive's actually still fine, then drive B fails before the operator= =20 has re-added drive A, the array goes down when it didn't need to. What is the operator's most likely response to seeing the SATA bus=20 reset? She's going to re-add the drive assuming it was a transient=20 error. If we could make this happen automatically, we could close a=20 window when the array's more vulnerable. I wouldn't suggest we do it=20 silently; it gets logged, notified etc. just like the drive being taken= =20 out of the array would be. Cheers, John. -- To unsubscribe from this list: send the line "unsubscribe linux-raid" i= n the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html