From mboxrd@z Thu Jan 1 00:00:00 1970 From: Giovanni Tessore Subject: Re: md road-map: 2011 Date: Fri, 18 Feb 2011 01:13:32 +0100 Message-ID: <4D5DB9AC.10106@texsoft.it> References: <20110216212751.51a294aa@notabene.brown> <20110217083531.3090a348@notabene.brown> <20110217100139.7520893d@notabene.brown> <20110217010455.GA16324@www2.open-std.org> <20110217105815.GA24580@www2.open-std.org> <4D5D0A66.80608@texsoft.it> <20110217154440.GA24982@www2.open-std.org> Mime-Version: 1.0 Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: QUOTED-PRINTABLE Return-path: In-Reply-To: <20110217154440.GA24982@www2.open-std.org> Sender: linux-raid-owner@vger.kernel.org To: =?ISO-8859-1?Q?Keld_J=F8rn_Simonsen?= Cc: linux-raid@vger.kernel.org List-Id: linux-raid.ids On 02/17/2011 04:44 PM, Keld J=F8rn Simonsen wrote: > On Thu, Feb 17, 2011 at 12:45:42PM +0100, Giovanni Tessore wrote: >> On 02/17/2011 11:58 AM, Keld J=F8rn Simonsen wrote: >>> On Thu, Feb 17, 2011 at 11:45:35AM +0100, David Brown wrote: >>>> On 17/02/2011 02:04, Keld J=F8rn Simonsen wrote: >>>>> On Thu, Feb 17, 2011 at 01:30:49AM +0100, David Brown wrote: >>>>>> On 17/02/11 00:01, NeilBrown wrote: >>>>>>> On Wed, 16 Feb 2011 23:34:43 +0100 David >>>>>>> Brown >>>>>>> wrote: >>>>>>> >>>>>>>> I thought there was some mechanism for block devices to report= bad >>>>>>>> blocks back to the file system, and that file systems tracked = bad >>>>>>>> block >>>>>>>> lists. Modern drives automatically relocate bad blocks (at le= ast, >>>>>>>> they >>>>>>>> do if they can), but there was a time when they did not and it= was up >>>>>>>> to >>>>>>>> the file system to track these. Whether that still applies to= modern >>>>>>>> file systems, I do not know - they only file system I have stu= died in >>>>>>>> low-level detail is FAT16. >>>>>>> When the block device reports an error the filesystem can certa= inly >>>>>>> record >>>>>>> that information in a bad-block list, and possibly does. >>>>>>> >>>>>>> However I thought you were suggesting a situation where the blo= ck >>>>>>> device >>>>>>> could succeed with the request, but knew that area of the devic= e was of >>>>>>> low >>>>>>> quality. >>>>>> I guess that is what I was trying to suggest, though not very cl= early. >>>>>> >>>>>>> e.g. IO to a block on a stripe which had one 'bad block'. The = IO >>>>>>> should >>>>>>> succeed, but the data isn't as safe as elsewhere. It would be = nice if >>>>>>> we >>>>>>> could tell the filesystem that fact, and if it could make use o= f it. >>>>>>> But >>>>>>> we >>>>>>> currently cannot. We can say "success" or "failure", but we c= annot >>>>>>> say >>>>>>> "success, but you might not be so lucky next time". >>>>>>> >>>>>> Do filesystems re-try reads when there is a failure? Could you = return >>>>>> fail on one read, then success on a re-read, which could be inte= rpreted >>>>>> as "dying, but not yet dead" by the file system? >>>>> This should not be a file system feature. The file system is buil= t upon >>>>> the raid, and in mirrorred raid types like raid1 and raid10, and = also >>>>> other raid types, you cannot be sure which specific drive and sec= tor the >>>>> data was read from - it could be one out of many (typically two) = places. >>>>> So the bad blocks of a raid is a feature of the raid and its indi= vidual >>>>> drives, not the file system. If it was a property of the file sys= tem, >>>>> then the fs should be aware of the underlying raid topology, and = know if >>>>> this was a parity block or data block of raid5 or raid6, or which >>>>> mirror instance of a raid1/10 type which was involved. >>>>> >>>> Thanks for the explanation. >>>> >>>> I guess my worry is that if md layer has tracked a bad block on a = disk, >>>> then that stripe will be in a degraded mode. It's great that it w= ill >>>> still work, and it's great that the bad block list means that it i= s >>>> /only/ that stripe that is degraded - not the whole raid. >>> I am proposing that the stripe not be degraded, using a recovery ar= ea for >>> bad >>> blocks on the disk, that goes together with the metadata area. >>> >>>> But I'm hoping there can be some sort of relocation somewhere >>>> (ultimately it doesn't matter if it is handled by the file system,= or by >>>> md for the whole stripe, or by md for just that disk block, or by = the >>>> disk itself), so that you can get raid protection again for that s= tripe. >>> I think we agree in hoping:-) >> IMHO the point is that this feature (Bad Block Log) is a GREAT featu= re >> as it just helps in keeping track of the health status of the underl= ying >> disks, and helps A LOT in recovering data from the array when a >> unrecoverable read error occurs (now the full array goes offline). T= hen >> something must be done proactively to repair the situation, as it me= ans >> that a disk of the array has problems and should be replaced. So, fi= rst >> it's worth to make a backup of the still alive array (getting some r= ead >> error when the bad blocks/stripes are encountered [maybe using ddres= cue >> or similar]), then replace the disk, and reconstruct the array; afte= r >> that a fsck on the filesystem may repair the situation. >> >> You may argue that the unrecoverable read error come from just very = few >> sector of the disk, and it's not worth to replace it (personally I w= ould >> replace also on very few ones), as there are still many reserverd >> sectors for relocation on the disk. Then a simple solution would jus= t be >> to zero-write the bad blocks in the Bad Block Log (the data is gone >> already): if the write succedes (disk uses reserved sectors for >> relocation), the blocks are removed from the log (now they are ok); = then >> fsck (hopefully) may repair the filesystem. At this point there are = no >> more md read erros, maybe just filesystem errors (the array is clean= , >> the filesystem may be not, but notice that nothing can be done to av= oid >> filesystem problems, as there has been a data loss; only fsck may he= lp). > another way around, if the badblocks recovery area does not fly with > Neil or other implementors. > > It should be possible to run a periodic check of if any bad sectors h= ave > occurred in an array. Then the half-damaged file should be moved away= from > this area with the bad block by copying it and relinking it, and befo= re > relinking it to the proper place the good block corresponding to the = bad > block should be marked as a corresponding good block on the healthy d= isk > drive, so that it not be allocated again. This action could even be > triggered by the event of the detection of the bad block. This would > probably meean that ther need to be a system call to mark a > corresponding good block. The whole thing should be able to run in > userland and somewhat independent of the file system type, except for > the lookup of the corresponding file fram a damaged block. I don't follow this.. if a file has some damaged blocks, they are gone,= =20 moving it elsewhere does not help. And however, this is a task of the filesystem. md is just a block device (more reliable than a single disk due to some= =20 level of redundancy), and it should be indipendent from the kind of fil= e=20 system on it (as the file system should be indipendent from the kind of= =20 block device it resides on [md, hd, flash, iscsi, ...]). Then what you suggest should be done for every block device that can=20 have bad blocks (that is, every block device). Again, this is a=20 filesystem issue. And of which file system type, as there are many? The Bad Block Log allows md to behave 'like' a read hard disk would do=20 with smart data: - unreadable blocks/stripes are recorded into the log, as unreadable=20 sectors are recorder into smart data - unrecoverable read errors are reported to the caller for both - the device still works if it has unrecoverable read errors for both=20 (now the whole md device fails, this is the problem) - if a block/stripe if rewritten with success the block/stripe is=20 removed from Bad Block Log (and the counter of relocated blocks/stripes= =20 is incremented); as if a sector is rewritten with succes on a disk the=20 sector is removed from list of unreadable sector, and the counter of=20 relocated sector is incremented (smart data) A filesystem on a disk does not know what the firmware of the disk does= =20 about sectors relocation. The same applies for a hardware (not fake) raid controller firmware. The same should apply for md. It is transparent to the filesystem. IMHO a more interesting issue whould be: a write error occurs on a disk= =20 participating to an already degraded array; failing the disk would fail= =20 the whole array. What to do? Put the array into read only mode, still=20 allowing read access to data on it for easy backup? In such situation,=20 what would do a hardware raid controller? Hm, yes.... how do behave hardware raid controllers with uncorrectable=20 read errors? And how they behave with write error on a disk of an already degraded a= rray? I guess md should replicate these behaviours. =2E.. Neil? Regards. --=20 Cordiali saluti. Yours faithfully. Giovanni Tessore -- To unsubscribe from this list: send the line "unsubscribe linux-raid" i= n the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html