Re: limits on raid - David Greaves

linux-raid.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

From: David Greaves <david@dgreaves.com>
To: Neil Brown <neilb@suse.de>
Cc: Wakko Warner <wakko@animx.eu.org>,
	david@lang.hm, linux-kernel@vger.kernel.org,
	linux-raid@vger.kernel.org
Subject: Re: limits on raid
Date: Sat, 16 Jun 2007 14:33:14 +0100	[thread overview]
Message-ID: <4673E69A.4020309@dgreaves.com> (raw)
In-Reply-To: <18035.23867.576212.859440@notabene.brown>

Neil Brown wrote:
> On Friday June 15, wakko@animx.eu.org wrote:
>  
>>                                                   As I understand the way
>> raid works, when you write a block to the array, it will have to read all
>> the other blocks in the stripe and recalculate the parity and write it out.
> 
> Your understanding is incomplete.

Does this help?
[for future reference so you can paste a url and save the typing for code :) ]

http://linux-raid.osdl.org/index.php/Initial_Array_Creation

David

Initial Creation

When mdadm asks the kernel to create a raid array the most noticeable activity 
is what's called the "initial resync".

The kernel takes one (or two for raid6) disks and marks them as 'spare'; it then 
creates the array in degraded mode. It then marks spare disks as 'rebuilding' 
and starts to read from the 'good' disks, calculate the parity and determines 
what should be on any spare disks and then writes it. Once all this is done the 
array is clean and all disks are active.

This can take quite a time and the array is not fully resilient whilst this is 
happening (it is however fully useable).

--assume-clean

Some people have noticed the --assume-clean option in mdadm and speculated that 
this can be used to skip the initial resync. Which it does. But this is a bad 
idea in some cases - and a *very* bad idea in others.

raid5

For raid5 especially it is NOT safe to skip the initial sync. The raid5 
implementation optimises use of the component disks and it is possible for all 
updates to be "read-modify-write" updates which assume the parity is correct. If 
it is wrong, it stays wrong. Then when you lose a drive, the parity blocks are 
wrong so the data you recover using them is wrong. In other words - you will get 
data corruption.

For raid5 on an array with more than 3 drive, if you attempt to write a single 
block, it will:

     * read the current value of the block, and the parity block.
     * "subtract" the old value of the block from the parity, and "add" the new 
value.
     * write out the new data and the new parity.

If the parity was wrong before, it will still be wrong. If you then lose a 
drive, you lose your data.

linear, raid0,1,10

These raid levels do not need an initial sync.

linear and raid0 have no redundancy.

raid1 always writes all data to all disks.

raid10 always writes all data to all relevant disks.

Other raid levels

Probably the most noticeable effect for the other raid levels is that if you 
don't sync first, then every check will find lots of errors. (Of course you 
could 'repair' instead of 'check'. Or do that once. Or something.)

For raid6 it is also safe to not sync first, though with the same caveat. Raid6 
always updates parity by reading all blocks in the stripe that aren't known and 
calculating P and Q. So the first write to a stripe will make P and Q correct 
for that stripe. This is current behaviour. There is no guarantee it will never 
changed (so theoretically one day you may upgrade your kernel and suffer data 
corruption on an old raid6 array).

Summary

In summary, it is safe to use --assume-clean on a raid1 or raid1o, though a 
"repair" is recommended before too long. For other raid levels it is best avoided.

Potential 'Solutions'

There have been 'solutions' suggested including the use of bitmaps to 
efficiently store 'not yet synced' information about the array. It would be 
possible to have a 'this is not initialised' flag on the array, and if that is 
not set, always do a reconstruct-write rather than a read-modify-write. But the 
first time you have an unclean shutdown you are going to resync all the parity 
anyway (unless you have a bitmap....) so you may as well resync at the start. So 
essentially, at the moment, there is no interest in implementing this since the 
added complexity is not justified.

What's the problem anyway?

First of all RAID is all about being safe with your data.

And why is it such a big deal anyway? The initial resync doesn't stop you from 
using the array. If you wanted to put an array into production instantly and 
couldn't afford any slowdown due to resync, then you might want to skip the 
initial resync.... but is that really likely?

So what is --assume-clean for then?

Disaster recovery. If you want to build an array from components that used to be 
in a raid then this stops the kernel from scribbling on them. As the man page says :

"Use this ony if you really know what you are doing."

next prev parent reply	other threads:[~2007-06-16 13:33 UTC|newest]

Thread overview: 69+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2007-06-15  2:58 limits on raid david
2007-06-15  3:05 ` Neil Brown
2007-06-15  3:43   ` david
2007-06-15  3:58     ` Neil Brown
2007-06-15  9:13       ` David Chinner
2007-06-15 22:21         ` Neil Brown
2007-06-15 11:10       ` Avi Kivity
2007-06-15 16:23         ` Jan Engelhardt
2007-06-15 17:20           ` Avi Kivity
2007-06-15 21:59         ` Neil Brown
2007-06-16 17:23           ` Avi Kivity
2007-06-17 13:00           ` Andi Kleen
2007-06-18  4:57           ` David Chinner
2007-06-21  2:56             ` Neil Brown
2007-06-21  6:39               ` David Chinner
2007-06-21  6:45                 ` david
2007-06-21  8:59                   ` David Greaves
2007-06-21 17:00                   ` Mark Lord
2007-06-21 11:00                 ` David Chinner
2007-06-21 12:40               ` Mattias Wadenstein
2007-06-21 14:40                 ` Justin Piszcz
2007-06-21 16:48                 ` david
2007-06-21 18:30                 ` Martin K. Petersen
2007-06-21 20:08               ` Nix
2007-06-16  2:03       ` Wakko Warner
2007-06-16  3:47         ` Neil Brown
2007-06-16  4:40           ` Dan Merillat
2007-06-16  7:48           ` david
2007-06-16 13:38             ` David Greaves
2007-06-16 17:16               ` david
2007-06-17 17:16             ` Bill Davidsen
2007-06-18 17:20             ` Brendan Conoboy
2007-06-18 17:28               ` david
2007-06-18 18:03                 ` Lennart Sorensen
2007-06-18 18:12                   ` david
2007-06-18 18:33                     ` Lennart Sorensen
2007-06-18 18:40                       ` david
2007-06-18 19:11                         ` Brendan Conoboy
2007-06-18 20:52                           ` david
2007-06-18 21:46                             ` Wakko Warner
2007-06-18 21:56                               ` david
2007-06-18 22:00                                 ` Brendan Conoboy
2007-06-19 20:11                                 ` Lennart Sorensen
2007-06-19 20:51                                   ` david
2007-06-19 15:07                             ` Phillip Susi
2007-06-19 19:28                               ` david
2007-06-18 18:07                 ` Brendan Conoboy
2007-06-18 18:16                   ` david
2007-06-16 13:33           ` David Greaves [this message]
2007-06-17  1:44             ` dean gaudet
2007-06-21  3:01             ` Neil Brown
2007-06-21  8:49               ` David Greaves
2007-06-16 14:08           ` Wakko Warner
2007-06-17  1:47             ` dean gaudet
2007-06-17 13:28               ` Wakko Warner
2007-06-17 17:28                 ` dean gaudet
2007-06-17 19:30                   ` Wakko Warner
2007-06-17 19:54                     ` dean gaudet
2007-06-17 20:46                       ` david
2007-06-17 20:44                     ` david
2007-06-17 17:14       ` Bill Davidsen
2007-06-21 23:03         ` Bill Davidsen
2007-06-22  2:24           ` Neil Brown
2007-06-22  8:10             ` David Greaves
2007-06-22  9:51               ` david
2007-06-22 12:39                 ` David Greaves
2007-06-22 16:00                   ` Bill Davidsen
2007-06-22 16:55                     ` David Greaves
2007-06-22 18:41                     ` david

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=4673E69A.4020309@dgreaves.com \
    --to=david@dgreaves.com \
    --cc=david@lang.hm \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-raid@vger.kernel.org \
    --cc=neilb@suse.de \
    --cc=wakko@animx.eu.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).