From mboxrd@z Thu Jan 1 00:00:00 1970 From: Giovanni Tessore Subject: Re: md road-map: 2011 Date: Thu, 17 Feb 2011 12:45:42 +0100 Message-ID: <4D5D0A66.80608@texsoft.it> References: <20110216212751.51a294aa@notabene.brown> <20110217083531.3090a348@notabene.brown> <20110217100139.7520893d@notabene.brown> <20110217010455.GA16324@www2.open-std.org> <20110217105815.GA24580@www2.open-std.org> Mime-Version: 1.0 Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: QUOTED-PRINTABLE Return-path: In-Reply-To: <20110217105815.GA24580@www2.open-std.org> Sender: linux-raid-owner@vger.kernel.org To: linux-raid@vger.kernel.org List-Id: linux-raid.ids On 02/17/2011 11:58 AM, Keld J=F8rn Simonsen wrote: > On Thu, Feb 17, 2011 at 11:45:35AM +0100, David Brown wrote: >> On 17/02/2011 02:04, Keld J=F8rn Simonsen wrote: >>> On Thu, Feb 17, 2011 at 01:30:49AM +0100, David Brown wrote: >>>> On 17/02/11 00:01, NeilBrown wrote: >>>>> On Wed, 16 Feb 2011 23:34:43 +0100 David Brown >>>>> wrote: >>>>> >>>>>> I thought there was some mechanism for block devices to report b= ad >>>>>> blocks back to the file system, and that file systems tracked ba= d block >>>>>> lists. Modern drives automatically relocate bad blocks (at leas= t, they >>>>>> do if they can), but there was a time when they did not and it w= as up to >>>>>> the file system to track these. Whether that still applies to m= odern >>>>>> file systems, I do not know - they only file system I have studi= ed in >>>>>> low-level detail is FAT16. >>>>> When the block device reports an error the filesystem can certain= ly >>>>> record >>>>> that information in a bad-block list, and possibly does. >>>>> >>>>> However I thought you were suggesting a situation where the block= device >>>>> could succeed with the request, but knew that area of the device = was of >>>>> low >>>>> quality. >>>> I guess that is what I was trying to suggest, though not very clea= rly. >>>> >>>>> e.g. IO to a block on a stripe which had one 'bad block'. The IO= should >>>>> succeed, but the data isn't as safe as elsewhere. It would be ni= ce if we >>>>> could tell the filesystem that fact, and if it could make use of = it. But >>>>> we >>>>> currently cannot. We can say "success" or "failure", but we can= not say >>>>> "success, but you might not be so lucky next time". >>>>> >>>> Do filesystems re-try reads when there is a failure? Could you re= turn >>>> fail on one read, then success on a re-read, which could be interp= reted >>>> as "dying, but not yet dead" by the file system? >>> This should not be a file system feature. The file system is built = upon >>> the raid, and in mirrorred raid types like raid1 and raid10, and al= so >>> other raid types, you cannot be sure which specific drive and secto= r the >>> data was read from - it could be one out of many (typically two) pl= aces. >>> So the bad blocks of a raid is a feature of the raid and its indivi= dual >>> drives, not the file system. If it was a property of the file syste= m, >>> then the fs should be aware of the underlying raid topology, and kn= ow if >>> this was a parity block or data block of raid5 or raid6, or which >>> mirror instance of a raid1/10 type which was involved. >>> >> Thanks for the explanation. >> >> I guess my worry is that if md layer has tracked a bad block on a di= sk, >> then that stripe will be in a degraded mode. It's great that it wil= l >> still work, and it's great that the bad block list means that it is >> /only/ that stripe that is degraded - not the whole raid. > I am proposing that the stripe not be degraded, using a recovery area= for bad > blocks on the disk, that goes together with the metadata area. > >> But I'm hoping there can be some sort of relocation somewhere >> (ultimately it doesn't matter if it is handled by the file system, o= r by >> md for the whole stripe, or by md for just that disk block, or by th= e >> disk itself), so that you can get raid protection again for that str= ipe. > I think we agree in hoping:-) IMHO the point is that this feature (Bad Block Log) is a GREAT feature=20 as it just helps in keeping track of the health status of the underlyin= g=20 disks, and helps A LOT in recovering data from the array when a=20 unrecoverable read error occurs (now the full array goes offline). Then= =20 something must be done proactively to repair the situation, as it means= =20 that a disk of the array has problems and should be replaced. So, first= =20 it's worth to make a backup of the still alive array (getting some read= =20 error when the bad blocks/stripes are encountered [maybe using ddrescue= =20 or similar]), then replace the disk, and reconstruct the array; after=20 that a fsck on the filesystem may repair the situation. You may argue that the unrecoverable read error come from just very few= =20 sector of the disk, and it's not worth to replace it (personally I woul= d=20 replace also on very few ones), as there are still many reserverd=20 sectors for relocation on the disk. Then a simple solution would just b= e=20 to zero-write the bad blocks in the Bad Block Log (the data is gone=20 already): if the write succedes (disk uses reserved sectors for=20 relocation), the blocks are removed from the log (now they are ok); the= n=20 fsck (hopefully) may repair the filesystem. At this point there are no=20 more md read erros, maybe just filesystem errors (the array is clean,=20 the filesystem may be not, but notice that nothing can be done to avoid= =20 filesystem problems, as there has been a data loss; only fsck may help)= =2E Regards --=20 Cordiali saluti. Yours faithfully. Giovanni Tessore -- To unsubscribe from this list: send the line "unsubscribe linux-raid" i= n the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html