Re: md road-map: 2011 - Giovanni Tessore

linux-raid.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

From: Giovanni Tessore <giotex@texsoft.it>
To: "Keld Jørn Simonsen" <keld@keldix.com>
Cc: linux-raid@vger.kernel.org
Subject: Re: md road-map: 2011
Date: Fri, 18 Feb 2011 01:13:32 +0100	[thread overview]
Message-ID: <4D5DB9AC.10106@texsoft.it> (raw)
In-Reply-To: <20110217154440.GA24982@www2.open-std.org>

On 02/17/2011 04:44 PM, Keld Jørn Simonsen wrote:
> On Thu, Feb 17, 2011 at 12:45:42PM +0100, Giovanni Tessore wrote:
>> On 02/17/2011 11:58 AM, Keld Jørn Simonsen wrote:
>>> On Thu, Feb 17, 2011 at 11:45:35AM +0100, David Brown wrote:
>>>> On 17/02/2011 02:04, Keld Jørn Simonsen wrote:
>>>>> On Thu, Feb 17, 2011 at 01:30:49AM +0100, David Brown wrote:
>>>>>> On 17/02/11 00:01, NeilBrown wrote:
>>>>>>> On Wed, 16 Feb 2011 23:34:43 +0100 David
>>>>>>> Brown<david.brown@hesbynett.no>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> I thought there was some mechanism for block devices to report bad
>>>>>>>> blocks back to the file system, and that file systems tracked bad
>>>>>>>> block
>>>>>>>> lists.  Modern drives automatically relocate bad blocks (at least,
>>>>>>>> they
>>>>>>>> do if they can), but there was a time when they did not and it was up
>>>>>>>> to
>>>>>>>> the file system to track these.  Whether that still applies to modern
>>>>>>>> file systems, I do not know - they only file system I have studied in
>>>>>>>> low-level detail is FAT16.
>>>>>>> When the block device reports an error the filesystem can certainly
>>>>>>> record
>>>>>>> that information in a bad-block list, and possibly does.
>>>>>>>
>>>>>>> However I thought you were suggesting a situation where the block
>>>>>>> device
>>>>>>> could succeed with the request, but knew that area of the device was of
>>>>>>> low
>>>>>>> quality.
>>>>>> I guess that is what I was trying to suggest, though not very clearly.
>>>>>>
>>>>>>> e.g. IO to a block on a stripe which had one 'bad block'.  The IO
>>>>>>> should
>>>>>>> succeed, but the data isn't as safe as elsewhere.  It would be nice if
>>>>>>> we
>>>>>>> could tell the filesystem that fact, and if it could make use of it.
>>>>>>> But
>>>>>>> we
>>>>>>> currently cannot.   We can say "success" or "failure", but we cannot
>>>>>>> say
>>>>>>> "success, but you might not be so lucky next time".
>>>>>>>
>>>>>> Do filesystems re-try reads when there is a failure?  Could you return
>>>>>> fail on one read, then success on a re-read, which could be interpreted
>>>>>> as "dying, but not yet dead" by the file system?
>>>>> This should not be a file system feature. The file system is built upon
>>>>> the raid, and in mirrorred raid types like raid1 and raid10, and also
>>>>> other raid types, you cannot be sure which specific drive and sector the
>>>>> data was read from - it could be one out of many (typically two) places.
>>>>> So the bad blocks of a raid is a feature of the raid and its individual
>>>>> drives, not the file system. If it was a property of the file system,
>>>>> then the fs should be aware of the underlying raid topology, and know if
>>>>> this was a parity block or data block of raid5 or raid6, or which
>>>>> mirror instance of a raid1/10 type which  was involved.
>>>>>
>>>> Thanks for the explanation.
>>>>
>>>> I guess my worry is that if md layer has tracked a bad block on a disk,
>>>> then that stripe will be in a degraded mode.  It's great that it will
>>>> still work, and it's great that the bad block list means that it is
>>>> /only/ that stripe that is degraded - not the whole raid.
>>> I am proposing that the stripe not be degraded, using a recovery area for
>>> bad
>>> blocks on the disk, that goes together with the metadata area.
>>>
>>>> But I'm hoping there can be some sort of relocation somewhere
>>>> (ultimately it doesn't matter if it is handled by the file system, or by
>>>> md for the whole stripe, or by md for just that disk block, or by the
>>>> disk itself), so that you can get raid protection again for that stripe.
>>> I think we agree in hoping:-)
>> IMHO the point is that this feature (Bad Block Log) is a GREAT feature
>> as it just helps in keeping track of the health status of the underlying
>> disks, and helps A LOT in recovering data from the array when a
>> unrecoverable read error occurs (now the full array goes offline). Then
>> something must be done proactively to repair the situation, as it means
>> that a disk of the array has problems and should be replaced. So, first
>> it's worth to make a backup of the still alive array (getting some read
>> error when the bad blocks/stripes are encountered [maybe using ddrescue
>> or similar]), then replace the disk, and reconstruct the array; after
>> that a fsck on the filesystem may repair the situation.
>>
>> You may argue that the unrecoverable read error come from just very few
>> sector of the disk, and it's not worth to replace it (personally I would
>> replace also on very few ones), as there are still many reserverd
>> sectors for relocation on the disk. Then a simple solution would just be
>> to zero-write the bad blocks in the Bad Block Log (the data is gone
>> already): if the write succedes (disk uses reserved sectors for
>> relocation), the blocks are removed from the log (now they are ok); then
>> fsck (hopefully) may repair the filesystem. At this point there are no
>> more md read erros, maybe just filesystem errors (the array is clean,
>> the filesystem may be not, but notice that nothing can be done to avoid
>> filesystem problems, as there has been a data loss; only fsck may help).
> another way around, if the badblocks recovery area does not fly with
> Neil or other implementors.
>
> It should be possible to run a periodic check of if any bad sectors have
> occurred in an array. Then the half-damaged file should be moved away from
> this area with the bad block by copying it and relinking it, and before
> relinking it to the proper place the good block corresponding to the bad
> block should be marked as a corresponding good block on the healthy disk
> drive, so that it not be allocated again. This action could even be
> triggered by the event of the detection of the bad block. This would
> probably meean that ther need to be a system call to mark a
> corresponding good block. The whole thing should be able to run in
> userland and somewhat independent of the file system type, except for
> the lookup of the corresponding file fram a damaged block.

I don't follow this.. if a file has some damaged blocks, they are gone, 
moving it elsewhere does not help.

And however, this is a task of the filesystem.

md is just a block device (more reliable than a single disk due to some 
level of redundancy), and it should be indipendent from the kind of file 
system on it (as the file system should be indipendent from the kind of 
block device it resides on [md, hd, flash, iscsi, ...]).

Then what you suggest should be done for every block device that can 
have bad blocks (that is, every block device). Again, this is a 
filesystem issue. And of which file system type, as there are many?

The Bad Block Log allows md to behave 'like' a read hard disk would do 
with smart data:
- unreadable blocks/stripes are recorded into the log, as unreadable 
sectors are recorder into smart data
- unrecoverable read errors are reported to the caller for both
- the device still works if it has unrecoverable read errors for both 
(now the whole md device fails, this is the problem)
- if a block/stripe if rewritten with success  the block/stripe is 
removed from Bad Block Log (and the counter of relocated blocks/stripes 
is incremented); as if a sector is rewritten with succes on a disk the 
sector is removed from list of unreadable sector, and the counter of 
relocated sector is incremented (smart data)

A filesystem on a disk does not know what the firmware of the disk does 
about sectors relocation.
The same applies for a hardware (not fake) raid controller firmware.
The same should apply for md. It is transparent to the filesystem.

IMHO a more interesting issue whould be: a write error occurs on a disk 
participating to an already degraded array; failing the disk would fail 
the whole array. What to do? Put the array into read only mode, still 
allowing read access to data on it for easy backup? In such situation, 
what would do a hardware raid controller?

Hm, yes.... how do behave hardware raid controllers with uncorrectable 
read errors?
And how they behave with write error on a disk of an already degraded array?
I guess md should replicate these behaviours.

... Neil?

Regards.

-- 
Cordiali saluti.
Yours faithfully.

Giovanni Tessore


--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

next prev parent reply	other threads:[~2011-02-18  0:13 UTC|newest]

Thread overview: 52+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2011-02-16 10:27 md road-map: 2011 NeilBrown
2011-02-16 11:28 ` Giovanni Tessore
2011-02-16 13:40   ` Roberto Spadim
2011-02-16 14:00     ` Robin Hill
2011-02-16 14:09       ` Roberto Spadim
2011-02-16 14:21         ` Roberto Spadim
2011-02-16 21:55           ` NeilBrown
2011-02-17  1:30             ` Roberto Spadim
2011-02-16 14:13 ` Joe Landman
2011-02-16 21:24   ` NeilBrown
2011-02-16 21:44     ` Roman Mamedov
2011-02-16 21:59       ` NeilBrown
2011-02-17  0:48         ` Phil Turmel
2011-02-16 22:12       ` Joe Landman
2011-02-16 15:42 ` David Brown
2011-02-16 21:35   ` NeilBrown
2011-02-16 22:34     ` David Brown
2011-02-16 23:01       ` NeilBrown
2011-02-17  0:30         ` David Brown
2011-02-17  0:55           ` NeilBrown
2011-02-17  1:04           ` Keld Jørn Simonsen
2011-02-17 10:45             ` David Brown
2011-02-17 10:58               ` Keld Jørn Simonsen
2011-02-17 11:45                 ` Giovanni Tessore
2011-02-17 15:44                   ` Keld Jørn Simonsen
2011-02-17 16:22                     ` Roberto Spadim
2011-02-18  0:13                     ` Giovanni Tessore [this message]
2011-02-18  2:56                       ` Keld Jørn Simonsen
2011-02-18  4:27                         ` Roberto Spadim
2011-02-18  9:47                         ` Giovanni Tessore
2011-02-18 18:43                           ` Keld Jørn Simonsen
2011-02-18 19:00                             ` Roberto Spadim
2011-02-18 19:18                               ` Keld Jørn Simonsen
2011-02-18 19:22                                 ` Roberto Spadim
2011-02-16 17:20 ` Joe Landman
2011-02-16 21:36   ` NeilBrown
2011-02-16 19:37 ` Phil Turmel
2011-02-16 21:44   ` NeilBrown
2011-02-17  0:11     ` Phil Turmel
2011-02-16 20:29 ` Piergiorgio Sartor
2011-02-16 21:48   ` NeilBrown
2011-02-16 22:53     ` Piergiorgio Sartor
2011-02-17  0:24     ` Phil Turmel
2011-02-17  0:52       ` NeilBrown
2011-02-17  1:14         ` Phil Turmel
2011-02-17  3:10           ` NeilBrown
2011-02-17 18:46             ` Phil Turmel
2011-02-17 21:04             ` Mr. James W. Laferriere
2011-02-18  1:48               ` NeilBrown
2011-02-17 19:56           ` Piergiorgio Sartor
2011-02-16 22:50 ` Keld Jørn Simonsen
2011-02-23  5:06 ` Daniel Reurich

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=4D5DB9AC.10106@texsoft.it \
    --to=giotex@texsoft.it \
    --cc=keld@keldix.com \
    --cc=linux-raid@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).