From mboxrd@z Thu Jan 1 00:00:00 1970 From: Giovanni Tessore Subject: Re: md road-map: 2011 Date: Fri, 18 Feb 2011 10:47:28 +0100 Message-ID: <4D5E4030.8020805@texsoft.it> References: <20110217083531.3090a348@notabene.brown> <20110217100139.7520893d@notabene.brown> <20110217010455.GA16324@www2.open-std.org> <20110217105815.GA24580@www2.open-std.org> <4D5D0A66.80608@texsoft.it> <20110217154440.GA24982@www2.open-std.org> <4D5DB9AC.10106@texsoft.it> <20110218025623.GA26387@www2.open-std.org> Mime-Version: 1.0 Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: QUOTED-PRINTABLE Return-path: In-Reply-To: <20110218025623.GA26387@www2.open-std.org> Sender: linux-raid-owner@vger.kernel.org To: =?ISO-8859-1?Q?Keld_J=F8rn_Simonsen?= Cc: linux-raid@vger.kernel.org List-Id: linux-raid.ids On 02/18/2011 03:56 AM, Keld J=F8rn Simonsen wrote: > On Fri, Feb 18, 2011 at 01:13:32AM +0100, Giovanni Tessore wrote: >> On 02/17/2011 04:44 PM, Keld J=F8rn Simonsen wrote: >>> It should be possible to run a periodic check of if any bad sectors= have >>> occurred in an array. Then the half-damaged file should be moved aw= ay from >>> this area with the bad block by copying it and relinking it, and be= fore >>> relinking it to the proper place the good block corresponding to th= e bad >>> block should be marked as a corresponding good block on the healthy= disk >>> drive, so that it not be allocated again. This action could even be >>> triggered by the event of the detection of the bad block. This woul= d >>> probably meean that ther need to be a system call to mark a >>> corresponding good block. The whole thing should be able to run in >>> userland and somewhat independent of the file system type, except f= or >>> the lookup of the corresponding file fram a damaged block. >> I don't follow this.. if a file has some damaged blocks, they are go= ne, >> moving it elsewhere does not help. > Remember the file is in a RAID. So you can lose one disk drive and yo= ur > data is still intact. > >> And however, this is a task of the filesystem. > No, it is the task of the raid, as it is the raid that gives the > functionality that you can lose a drive and still have your data inta= ct. > the raid level knows what is lost, and what is still good, and where > this stuff is. > > If we are then operating on the file level, then doing something clev= er could > be a cooperation between the raid leven ald the filesystem level, as > described above. Raid of course has this functionality, but at block level; it's agnosti= c=20 of the filesystem on it (there may be no filesystem at all actually, as= =20 for raid over raid); it does not know the word 'file'. Raid adds SOME level of redundancy, not infinite. If the underlying=20 hardware has damaged sectors over the redundancy level of the raid=20 configuration, data in the stripe is lost; and the hardware probably=20 should be replaced. Unrecoverable read errors FROM MD (those addressed by Bad Block Log=20 feature) only appear when this redudancy level is not enough; for examp= le: - raid 1 in degraded mode with only 1 disk active, read error on the=20 remaning disk - raid 5 in degraded mode, read error on one of the active disks - raid 6 in degraded mode missing 2 disks, read error on one of the=20 active disks - raid 5, read error on the same sector on more than 1 disk - raid 6, read error on the same sector on more than 2 disks - etc ... in this situation nothing can be done neither at md level, nor at=20 filesytem level: data on the block/stripe is lost. Remeber that the Bad Block Log keeps track of the block/stripes who gav= e=20 this unrecoverable read error at md level. It has nothing to do with th= e=20 unreadable sector list of the underlying disks: if raid gets a read=20 error from a disk, it tries to reconstruct data from the other disks,=20 and to rewrite the sector; if it succedes, all is ok for md (it just=20 increments the counter of corrected read errors, which is persistent fo= r=20 superblock > 1.x); otherwise there is a write error, and the disk is=20 marked as failed. > >> md is just a block device (more reliable than a single disk due to s= ome >> level of redundancy), and it should be indipendent from the kind of = file >> system on it (as the file system should be indipendent from the kind= of >> block device it resides on [md, hd, flash, iscsi, ...]). > true > >> Then what you suggest should be done for every block device that can >> have bad blocks (that is, every block device). Again, this is a >> filesystem issue. And of which file system type, as there are many? > yes, it is a cooperation between the file system layer, and the raid > layer, I propose this be done in userland. > >> The Bad Block Log allows md to behave 'like' a read hard disk would = do >> with smart data: >> - unreadable blocks/stripes are recorded into the log, as unreadable >> sectors are recorder into smart data >> - unrecoverable read errors are reported to the caller for both >> - the device still works if it has unrecoverable read errors for bot= h >> (now the whole md device fails, this is the problem) >> - if a block/stripe if rewritten with success the block/stripe is >> removed from Bad Block Log (and the counter of relocated blocks/stri= pes >> is incremented); as if a sector is rewritten with succes on a disk t= he >> sector is removed from list of unreadable sector, and the counter of >> relocated sector is incremented (smart data) > Smart drives also reallocate bad blocks, hiding the errors from the S= W > level. And that is the only natural place where this operation should be done.= =20 Suppose you got a unrecoverable read error from md on a block. It means= =20 that some sector on one (or more) of the underlying disks gave a read=20 error. If you try to rewrite the md block, the sectors are rewritten to= =20 the underlying disk, so either: - all disks write correctly because they could solve the prolem (its a=20 matter of their firmware, maybe relocating the sector on reserved area)= :=20 block relocated, all OK. - some disks give an error on write (no more space for relocatable=20 errors, or other hw problems): then the disk(s) is(are) marked failed,=20 and must be replaced. There is no need for reserved blocks anywhere else than those of the=20 underlying disks. Having reserved relocable blocks at raid level would be usefull to=20 address another situation: uncorrectable errors on write. But this is=20 another story. >> A filesystem on a disk does not know what the firmware of the disk d= oes >> about sectors relocation. >> The same applies for a hardware (not fake) raid controller firmware. >> The same should apply for md. It is transparent to the filesystem. > Yes, normally the raid layer and the fs layer are independent. > > But you can add better recovery with what I suggest. > >> IMHO a more interesting issue whould be: a write error occurs on a d= isk >> participating to an already degraded array; failing the disk would f= ail >> the whole array. What to do? Put the array into read only mode, stil= l >> allowing read access to data on it for easy backup? In such situatio= n, >> what would do a hardware raid controller? >> >> Hm, yes.... how do behave hardware raid controllers with uncorrectab= le >> read errors? >> And how they behave with write error on a disk of an already degrade= d array? >> I guess md should replicate these behaviours. > I think we should be more intelligent than ordinary HW RAID:-) I think it is a good point if the software raid had the same features=20 and reliability of those mission critical hw controllers ;-) Regards --=20 Cordiali saluti. Yours faithfully. Giovanni Tessore -- To unsubscribe from this list: send the line "unsubscribe linux-raid" i= n the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html