linux-raid.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Giovanni Tessore <giotex@texsoft.it>
To: "Keld Jørn Simonsen" <keld@keldix.com>
Cc: linux-raid@vger.kernel.org
Subject: Re: md road-map: 2011
Date: Fri, 18 Feb 2011 10:47:28 +0100	[thread overview]
Message-ID: <4D5E4030.8020805@texsoft.it> (raw)
In-Reply-To: <20110218025623.GA26387@www2.open-std.org>

On 02/18/2011 03:56 AM, Keld Jørn Simonsen wrote:
> On Fri, Feb 18, 2011 at 01:13:32AM +0100, Giovanni Tessore wrote:
>> On 02/17/2011 04:44 PM, Keld Jørn Simonsen wrote:
>>> It should be possible to run a periodic check of if any bad sectors have
>>> occurred in an array. Then the half-damaged file should be moved away from
>>> this area with the bad block by copying it and relinking it, and before
>>> relinking it to the proper place the good block corresponding to the bad
>>> block should be marked as a corresponding good block on the healthy disk
>>> drive, so that it not be allocated again. This action could even be
>>> triggered by the event of the detection of the bad block. This would
>>> probably meean that ther need to be a system call to mark a
>>> corresponding good block. The whole thing should be able to run in
>>> userland and somewhat independent of the file system type, except for
>>> the lookup of the corresponding file fram a damaged block.
>> I don't follow this.. if a file has some damaged blocks, they are gone,
>> moving it elsewhere does not help.
> Remember the file is in a RAID. So you can lose one disk drive and your
> data is still intact.
>
>> And however, this is a task of the filesystem.
> No, it is the task of the raid, as it is the raid that gives the
> functionality that you can lose a drive and still have your data intact.
> the raid level knows what is lost, and  what is still good, and where
> this stuff is.
>
> If we are then operating on the file level, then doing something clever could
> be a cooperation between the raid leven ald the filesystem level, as
> described above.

Raid of course has this functionality, but at block level; it's agnostic 
of the filesystem on it (there may be no filesystem at all actually, as 
for raid over raid); it does not know the word 'file'.

Raid adds SOME level of redundancy, not infinite. If the underlying 
hardware has damaged sectors over the redundancy level of the raid 
configuration, data in the stripe is lost; and the hardware probably 
should be replaced.

Unrecoverable read errors FROM MD (those addressed by Bad Block Log 
feature) only appear when this redudancy level is not enough; for example:
- raid 1 in degraded mode with only 1 disk active, read error on the 
remaning disk
- raid 5 in degraded mode, read error on one of the active disks
- raid 6 in degraded mode missing 2 disks, read error on one of the 
active disks
- raid 5, read error on the same sector on more than 1 disk
- raid 6, read error on the same sector on more than 2 disks
- etc ...

in this situation nothing can be done neither at md level, nor at 
filesytem level: data on the block/stripe is lost.

Remeber that the Bad Block Log keeps track of the block/stripes who gave 
this unrecoverable read error at md level. It has nothing to do with the 
unreadable sector list of the underlying disks: if raid gets a read 
error from a disk, it tries to reconstruct data from the other disks, 
and to rewrite the sector; if it succedes, all is ok for md (it just 
increments the counter of corrected read errors, which is persistent for 
superblock > 1.x); otherwise there is a write error, and the disk is 
marked as failed.


>
>> md is just a block device (more reliable than a single disk due to some
>> level of redundancy), and it should be indipendent from the kind of file
>> system on it (as the file system should be indipendent from the kind of
>> block device it resides on [md, hd, flash, iscsi, ...]).
> true
>
>> Then what you suggest should be done for every block device that can
>> have bad blocks (that is, every block device). Again, this is a
>> filesystem issue. And of which file system type, as there are many?
> yes, it is a cooperation between the file system layer, and the raid
> layer, I propose this be done in userland.
>
>> The Bad Block Log allows md to behave 'like' a read hard disk would do
>> with smart data:
>> - unreadable blocks/stripes are recorded into the log, as unreadable
>> sectors are recorder into smart data
>> - unrecoverable read errors are reported to the caller for both
>> - the device still works if it has unrecoverable read errors for both
>> (now the whole md device fails, this is the problem)
>> - if a block/stripe if rewritten with success  the block/stripe is
>> removed from Bad Block Log (and the counter of relocated blocks/stripes
>> is incremented); as if a sector is rewritten with succes on a disk the
>> sector is removed from list of unreadable sector, and the counter of
>> relocated sector is incremented (smart data)
> Smart drives also reallocate bad blocks, hiding the errors from the SW
> level.

And that is the only natural place where this operation should be done. 
Suppose you got a unrecoverable read error from md on a block. It means 
that some sector on one (or more) of the underlying disks gave a read 
error. If you try to rewrite the md block, the sectors are rewritten to 
the underlying disk, so either:
- all disks write correctly because they could solve the prolem (its a 
matter of their firmware, maybe relocating the sector on reserved area): 
block relocated, all OK.
- some disks give an error on write (no more space for relocatable 
errors, or other hw problems): then the disk(s) is(are) marked failed, 
and must be replaced.
There is no need for reserved blocks anywhere else than those of the 
underlying disks.

Having reserved relocable blocks at raid level would be usefull to 
address another situation: uncorrectable errors on write. But this is 
another story.

>> A filesystem on a disk does not know what the firmware of the disk does
>> about sectors relocation.
>> The same applies for a hardware (not fake) raid controller firmware.
>> The same should apply for md. It is transparent to the filesystem.
> Yes, normally the raid layer and the fs layer are independent.
>
> But you can add better recovery with what I suggest.
>
>> IMHO a more interesting issue whould be: a write error occurs on a disk
>> participating to an already degraded array; failing the disk would fail
>> the whole array. What to do? Put the array into read only mode, still
>> allowing read access to data on it for easy backup? In such situation,
>> what would do a hardware raid controller?
>>
>> Hm, yes.... how do behave hardware raid controllers with uncorrectable
>> read errors?
>> And how they behave with write error on a disk of an already degraded array?
>> I guess md should replicate these behaviours.
> I think we should be more intelligent than ordinary HW RAID:-)

I think it is a good point if the software raid had the same features 
and reliability of those mission critical hw controllers ;-)

Regards

-- 
Cordiali saluti.
Yours faithfully.

Giovanni Tessore


--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

  parent reply	other threads:[~2011-02-18  9:47 UTC|newest]

Thread overview: 52+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2011-02-16 10:27 md road-map: 2011 NeilBrown
2011-02-16 11:28 ` Giovanni Tessore
2011-02-16 13:40   ` Roberto Spadim
2011-02-16 14:00     ` Robin Hill
2011-02-16 14:09       ` Roberto Spadim
2011-02-16 14:21         ` Roberto Spadim
2011-02-16 21:55           ` NeilBrown
2011-02-17  1:30             ` Roberto Spadim
2011-02-16 14:13 ` Joe Landman
2011-02-16 21:24   ` NeilBrown
2011-02-16 21:44     ` Roman Mamedov
2011-02-16 21:59       ` NeilBrown
2011-02-17  0:48         ` Phil Turmel
2011-02-16 22:12       ` Joe Landman
2011-02-16 15:42 ` David Brown
2011-02-16 21:35   ` NeilBrown
2011-02-16 22:34     ` David Brown
2011-02-16 23:01       ` NeilBrown
2011-02-17  0:30         ` David Brown
2011-02-17  0:55           ` NeilBrown
2011-02-17  1:04           ` Keld Jørn Simonsen
2011-02-17 10:45             ` David Brown
2011-02-17 10:58               ` Keld Jørn Simonsen
2011-02-17 11:45                 ` Giovanni Tessore
2011-02-17 15:44                   ` Keld Jørn Simonsen
2011-02-17 16:22                     ` Roberto Spadim
2011-02-18  0:13                     ` Giovanni Tessore
2011-02-18  2:56                       ` Keld Jørn Simonsen
2011-02-18  4:27                         ` Roberto Spadim
2011-02-18  9:47                         ` Giovanni Tessore [this message]
2011-02-18 18:43                           ` Keld Jørn Simonsen
2011-02-18 19:00                             ` Roberto Spadim
2011-02-18 19:18                               ` Keld Jørn Simonsen
2011-02-18 19:22                                 ` Roberto Spadim
2011-02-16 17:20 ` Joe Landman
2011-02-16 21:36   ` NeilBrown
2011-02-16 19:37 ` Phil Turmel
2011-02-16 21:44   ` NeilBrown
2011-02-17  0:11     ` Phil Turmel
2011-02-16 20:29 ` Piergiorgio Sartor
2011-02-16 21:48   ` NeilBrown
2011-02-16 22:53     ` Piergiorgio Sartor
2011-02-17  0:24     ` Phil Turmel
2011-02-17  0:52       ` NeilBrown
2011-02-17  1:14         ` Phil Turmel
2011-02-17  3:10           ` NeilBrown
2011-02-17 18:46             ` Phil Turmel
2011-02-17 21:04             ` Mr. James W. Laferriere
2011-02-18  1:48               ` NeilBrown
2011-02-17 19:56           ` Piergiorgio Sartor
2011-02-16 22:50 ` Keld Jørn Simonsen
2011-02-23  5:06 ` Daniel Reurich

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=4D5E4030.8020805@texsoft.it \
    --to=giotex@texsoft.it \
    --cc=keld@keldix.com \
    --cc=linux-raid@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).