linux-raid.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* component growing in raid5
@ 2008-03-23  6:59 Nagy Zoltan
  2008-03-23 11:24 ` Peter Grandi
                   ` (2 more replies)
  0 siblings, 3 replies; 13+ messages in thread
From: Nagy Zoltan @ 2008-03-23  6:59 UTC (permalink / raw)
  To: linux-raid

[-- Attachment #1: Type: text/plain, Size: 1236 bytes --]

hi all,

i've set up a two dimensional array:
    * leaf nodes composes raid5 arrays from their disks, and export it 
as a iSCSI target
    * the root node creates a raid5 on top of the exported targets

in this setup i will have to face that an array component can(and would) 
grow, so i
created a test case for this to see what comes out ;)
    * after growing the components mdadm won't recognized them anymore 
as an array member
        (because there are no superblock at the end of the device - last 
64k?)
       i've tried to inform mdadm about the size of the components, but 
it sad no ;)
    * i've added an arbitary superblock copy operation after the 
expansion, to make possible for
   mdadm to recognize and assemble the array - it's working, and passes 
my test.

is there a less 'funky' solution for this ;)
can i run into any trouble when doing this on the real system?

one more thing: when i first assembled the array with 4096KB chunks, 
i've run into the
'2.6.24-rc6 reproducible raid5 hang' bug, but it won't resume after 
changing
'stripe_cache_size' even after i applied the patch manually to (2.6.24.2)
i've upgraded to 2.6.25-rc6 since then it runs smootly.
- thank you all for hunting that bug down.

kirk


[-- Attachment #2: raid-test.bash --]
[-- Type: text/plain, Size: 2334 bytes --]

#!/bin/bash

t=32768	# size in blocks

xall()
{
	for i in `seq 0 7`;do
		$*
	done
}

mkdir -p part data orig

echo "*** $1"
case "$1" in
	test)
			export		FS_DEV=/dev/md0
			$0 create					&&
			$0 createfs				&&
			$0 stop					&&
			$0 expand					&&
			$0 copy-sb				&&
			$0 lo_up					&&
			$0 assemble				&&
			$0 raid-expand				&&
			$0 fs-expand				&&
			$0 fs-check
			echo "---closing---"
			$0 stop
		;;
	test2)
			export		FS_DEV=/dev/mapper/rt
			$0 create					&&
			$0 crypt /dev/md0			&&
			$0 crypt-open				&&
			$0 createfs				&&
			$0 crypt-close				&&
			$0 stop					&&
			$0 expand					&&
			$0 copy-sb				&&
			$0 lo_up					&&
			$0 assemble				&&
			$0 raid-expand				&&
			$0 crypt-open				&&
			$0 fs-expand				&&
			$0 fs-check
			echo "---closing---"
			$0 stop
		;;
	zero)
			q(){	
				rm -f part/$i;	touch part/$i;	}
			xall q
		;;
	expand)
			cp part/{0..7} orig/
			q(){	
				dd if=/dev/zero count=$t >> part/$i;		}
			xall q
		;;
	raid-expand)
			i=0
			s=`stat --printf '%s' part/$i`;
			x=$[ ($s / 1024)-64 ]
			mdadm --grow -z $x /dev/md0
			
		;;
	create)
			$0 zero
			$0 expand
			$0 lo_up
			mdadm -Cv -n8 -l5 /dev/md0 /dev/loop{0..7}
		;;
	crypt)
			dd if=/dev/urandom count=1 of=key
			cryptsetup luksFormat /dev/md0 key				||	exit 1
		;;
	crypt-open)
			cryptsetup luksOpen /dev/md0 rt --key-file key	||	exit1
		;;
	crypt-close)
			cryptsetup luksClose rt						||	exit 1
		;;
	fs-expand)
			[ "$FS_DEV" == "" ] && echo "!!! FS_DEV not set" && exit 1
			mount $FS_DEV data
			xfs_growfs data
			umount data
		;;
	fs-check)
			[ "$FS_DEV" == "" ] && echo "!!! FS_DEV not set" && exit 1
			mount $FS_DEV data
			md5sum -c hash && echo ok || echo error
			umount data
		;;
	createfs)
			[ "$FS_DEV" == "" ] && echo "!!! FS_DEV not set" && exit 1
			mkfs.xfs $FS_DEV
			mount $FS_DEV data
			dd if=/dev/urandom of=data/junk
			md5sum data/junk | tee hash
			umount data
		;;
	lo_up)
			q(){
				losetup /dev/loop$i part/$i;	}
			xall q
		;;
	lo_down)
			q(){
				losetup -d /dev/loop$i;	}
			xall q
		;;
	assemble)
			mdadm -Av /dev/md0 /dev/loop{0..7}
		;;
	stop)
			umount data
			$0 crypt-close
			mdadm -S /dev/md0
			$0 lo_down
		;;
	copy-sb)
			q(){
				s=`stat --printf '%s' part/$i`;
				tail -c 65536 orig/$i | dd bs=1 seek=$[ $s - 65536 ] of=part/$i;	}
			xall q
		;;
	*)	echo "asd"
		;;
esac

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: component growing in raid5
  2008-03-23  6:59 component growing in raid5 Nagy Zoltan
@ 2008-03-23 11:24 ` Peter Grandi
  2008-03-24  7:09 ` Peter Rabbitson
  2008-03-24  7:09 ` Peter Rabbitson
  2 siblings, 0 replies; 13+ messages in thread
From: Peter Grandi @ 2008-03-23 11:24 UTC (permalink / raw)
  To: Linux RAID

[ ... ]

>     * leaf nodes composes raid5 arrays from their disks, and
>       export it as a iSCSI target
>     * the root node creates a raid5 on top of the exported
>       targets

That's "amazing" to say the least. A way to prove that
syntactically valid setups can and do work. I am particularly
"amazed" by the idea of using a whole RAID5 subarray just for
parity on the top level RAID5.

> in this setup i will have to face that an array component
> can(and would) grow, [ ... ]

This idea seems to me beyond "amazing", and I think that even
"stunning" is an understatement.

kirk> one more thing: when i first assembled the array with
kirk> 4096KB chunks,

Indeed, 4MiB chunk sizes are syntactically possible, and it is a
rather "exciting" choice, especially for a RAID55.

[ ... ]
>			cryptsetup luksFormat /dev/md0 key				||	exit 1

Even more "amazing", using 'dm-crypt' over an already "stunning"
setup.

[ ... ]
>			mkfs.xfs $FS_DEV

Entirely consistently, the syntactically valid setup above is
used for a single (presumably very large) filesystem. Very
"courageous".

Surely there must be extraordinarily good reasons for building a
RAID55 with 4MiB chunk (with "amazing" performance for writes,
and "stunning" resilience) and with the expectation that one
will extend it by growing each subarray, triggering a full
two-level reshape every time, and putting 'dm-crypt' and a
single filesystem on top of it all.

It would be interesting to learn those reasons for people like
me whose imagination is limited by stolid pragmatics.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: component growing in raid5
  2008-03-23  6:59 component growing in raid5 Nagy Zoltan
  2008-03-23 11:24 ` Peter Grandi
@ 2008-03-24  7:09 ` Peter Rabbitson
  2008-03-24  7:09 ` Peter Rabbitson
  2 siblings, 0 replies; 13+ messages in thread
From: Peter Rabbitson @ 2008-03-24  7:09 UTC (permalink / raw)
  To: Nagy Zoltan; +Cc: linux-raid

Nagy Zoltan wrote:
> hi all,
> 
> i've set up a two dimensional array:
>    * leaf nodes composes raid5 arrays from their disks, and export it as 
> a iSCSI target
>    * the root node creates a raid5 on top of the exported targets
> 
> in this setup i will have to face that an array component can(and would) 
> grow, so i
> created a test case for this to see what comes out ;)
>    * after growing the components mdadm won't recognized them anymore as 
> an array member
>        (because there are no superblock at the end of the device - last 
> 64k?)
>       i've tried to inform mdadm about the size of the components, but 
> it sad no ;)
>    * i've added an arbitary superblock copy operation after the 
> expansion, to make possible for
>   mdadm to recognize and assemble the array - it's working, and passes 
> my test.
> 
> is there a less 'funky' solution for this ;)
> can i run into any trouble when doing this on the real system?
> 

I would simply use a v1.1 superblock which will be situated at the start of 
the array. Then you will face another problem - once you grow a leaf device, 
mdadm will not see the new size as it will find the superblock at sect 0 and 
will be done there. You will need to issue mdadm -A ... --update devicesize. 
The rest of the operations are identical.

As a side note I am also curious why do you go the raid55 path (I am not very 
impressed however :)

Peter

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: component growing in raid5
  2008-03-23  6:59 component growing in raid5 Nagy Zoltan
  2008-03-23 11:24 ` Peter Grandi
  2008-03-24  7:09 ` Peter Rabbitson
@ 2008-03-24  7:09 ` Peter Rabbitson
  2008-03-24 15:17   ` Nagy Zoltan
  2 siblings, 1 reply; 13+ messages in thread
From: Peter Rabbitson @ 2008-03-24  7:09 UTC (permalink / raw)
  To: Nagy Zoltan; +Cc: linux-raid

Nagy Zoltan wrote:
> hi all,
> 
> i've set up a two dimensional array:
>    * leaf nodes composes raid5 arrays from their disks, and export it as 
> a iSCSI target
>    * the root node creates a raid5 on top of the exported targets
> 
> in this setup i will have to face that an array component can(and would) 
> grow, so i
> created a test case for this to see what comes out ;)
>    * after growing the components mdadm won't recognized them anymore as 
> an array member
>        (because there are no superblock at the end of the device - last 
> 64k?)
>       i've tried to inform mdadm about the size of the components, but 
> it sad no ;)
>    * i've added an arbitary superblock copy operation after the 
> expansion, to make possible for
>   mdadm to recognize and assemble the array - it's working, and passes 
> my test.
> 
> is there a less 'funky' solution for this ;)
> can i run into any trouble when doing this on the real system?
> 

I would simply use a v1.1 superblock which will be situated at the start of
the array. Then you will face another problem - once you grow a leaf device,
mdadm will not see the new size as it will find the superblock at sect 0 and
will be done there. You will need to issue mdadm -A ... --update devicesize.
The rest of the operations are identical.

As a side note I am also curious why do you go the raid55 path (I am not very
impressed however :)

Peter


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: component growing in raid5
  2008-03-24  7:09 ` Peter Rabbitson
@ 2008-03-24 15:17   ` Nagy Zoltan
  2008-03-24 15:42     ` Peter Rabbitson
  2008-03-25 13:06     ` Peter Grandi
  0 siblings, 2 replies; 13+ messages in thread
From: Nagy Zoltan @ 2008-03-24 15:17 UTC (permalink / raw)
  To: Peter Rabbitson, linux-raid

hi

> I would simply use a v1.1 superblock which will be situated at the 
> start of
> the array. Then you will face another problem - once you grow a leaf 
> device,
> mdadm will not see the new size as it will find the superblock at sect 
> 0 and
> will be done there. You will need to issue mdadm -A ... --update 
> devicesize.
> The rest of the operations are identical.
i feeled that there is another solution that i missed  - thank you, next 
time
i will do it this way -- because the system is already up and running, i 
don't wan't
to recreate the array (about the chunksize: i've got back to 64Kb chunks 
because
of that bug - i was happy to see it running ;)
>
> As a side note I am also curious why do you go the raid55 path (I am 
> not very
> impressed however :)
okay - i've run thru the whole scenario a few times - and always come 
get back
to raid55, what would you do in myplace? :)

i choosed this way because:
    * hardware raid controllers are expensive - because of this i prefer 
rather
        having a cluster of machines (average cost per MB shows that 
this is the
        'cheapest' solution)  this solution's impact on avg cost is 
about 20-25%
        compared to a single stand-alone disk - 40-50% if i count only 
usable
       storage
    * as far as i know other raid configurations take a bigger piece 
from the cake
        - raid10, raid01 both halves the usable space, simply creating a
        - raid0 array at the top level could suffer complete destruction 
if a node
            fails (in some rare cases the power-supply can take 
everything along
            with it)
        - raid05 could be reasonable choice providing n*(m-1) space: but 
in case of
            failure a single disk would trigger a full scale rebuild
    * raid55 - considering an array of n*m disks, gives (n-1)*(m-1) 
usable space
        with the ability to detect failing disks and repair them, while 
the cluster
        is still online - i can even grow it without taking it offline! ;)
       and at the leafs the processing power required for the raid is 
already there...
       why not use it? ;)
    * this is because with iscsi i can detach the node, and when i 
reattach the
       node it's size is redetected
    * after replacing a leaf's failing drive, the node itself could 
rebuild it's local
        array, and prevent the triggering of a whole system-scale rebuild
    * an alternate solution could be: drop the top level raid5 away, and 
replace it
       with unionfs - by creating individual filesystems, there is an 
intresting thing
       about raiding filesystems(raif)
    * the leaf nodes are running with network boot, exporting their 
local array
       run thru a dm_crypt on iscsi - this is something i would do 
differently next
       time.. i don't know how much parralelism dm_crypt could achive,
       but doing it on a per device basis - would provide 'enough' 
parralelism for
       the kernel to better utilize processing power
    * the root's role is to manage the filesystem, monitor the leafs - 
and provide
       network boot for them
    * effectively the root node is nothing more than a HBA ;)
    * the construction of the system is not complete - i'm waiting for 
some gbit
       interfaces, after they arrive the root will have 4Gbit link to 
the leafs, and
       by customizing the routing table a bit, it will see only a 
portion of the leaf
       thru each of them - i can possibly trunk the interfaces, but i 
think it's not
       neccessary
    * this cluster could scale up at any time by assimilating new nodes ;)

kirk


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: component growing in raid5
  2008-03-24 15:17   ` Nagy Zoltan
@ 2008-03-24 15:42     ` Peter Rabbitson
  2008-03-24 16:52       ` Nagy Zoltan
  2008-03-25 13:06     ` Peter Grandi
  1 sibling, 1 reply; 13+ messages in thread
From: Peter Rabbitson @ 2008-03-24 15:42 UTC (permalink / raw)
  To: Nagy Zoltan; +Cc: linux-raid

Nagy Zoltan wrote:
> hi
> 
>> I would simply use a v1.1 superblock which will be situated at the 
>> start of
>> the array. Then you will face another problem - once you grow a leaf 
>> device,
>> mdadm will not see the new size as it will find the superblock at sect 
>> 0 and
>> will be done there. You will need to issue mdadm -A ... --update 
>> devicesize.
>> The rest of the operations are identical.
> i feeled that there is another solution that i missed  - thank you, next 
> time
> i will do it this way -- because the system is already up and running, i 
> don't wan't
> to recreate the array (about the chunksize: i've got back to 64Kb chunks 
> because
> of that bug - i was happy to see it running ;)
>>
>> As a side note I am also curious why do you go the raid55 path (I am 
>> not very
>> impressed however :)
> okay - i've run thru the whole scenario a few times - and always come 
> get back
> to raid55, what would you do in myplace? :)

The validity of the snipped arguments depends on how many devices you have at 
every level:

*) how many nodes there are?
*) how many disks per node? do all nodes have an equal amount of disks?

Without additional info I would say this: The problem with using raid5 on the 
top node is that you are stressing your network additionally for every 
r-m-w-cycle. Also rebuild of this array, especially if you add more leaves 
will be more and more resource intensive.

In contrast if the top array is RAID10 with 2 chunk copies, you will sacrifice 
half the space, however your rebuild will utilize only 2 drives (one reader 
one writer).

HTH

Peter

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: component growing in raid5
  2008-03-24 15:42     ` Peter Rabbitson
@ 2008-03-24 16:52       ` Nagy Zoltan
  0 siblings, 0 replies; 13+ messages in thread
From: Nagy Zoltan @ 2008-03-24 16:52 UTC (permalink / raw)
  To: Peter Rabbitson, linux-raid

hi

> The validity of the snipped arguments depends on how many devices you 
> have at every level:
>
> *) how many nodes there are?
8 nodes
> *) how many disks per node? do all nodes have an equal amount of disks?
currently there are 5 disks at every node, and yes: all nodes have equal 
amount,
but the only thing that matters is that the exported size should be the same
>
> Without additional info I would say this: The problem with using raid5 
> on the top node is that you are stressing your network additionally 
> for every r-m-w-cycle. Also rebuild of this array, especially if you 
> add more leaves will be more and more resource intensive.
after the new nics installed, i except that the rebuild would take about 
8 hours to complete
yes, the r-m-w-cycle would be a pain - but i expect to have much more 
reads than writes
this is truly resource intensive - but the bottleneck would be not at 
the network level, i think it will be
at the root node's south-west connection
i must note that this is my first raid setup ;) and i've not faced with 
rmw-cycle problems before,
and because the use conditions doesn't imply continous random writes - 
we will be happy with this,
next week we will try it out in target conditions - and if something is 
not going as expect, i will
reconsider applying raid10 to it

> In contrast if the top array is RAID10 with 2 chunk copies, you will 
> sacrifice half the space, however your rebuild will utilize only 2 
> drives (one reader one writer).
yes, that clearly could reach better performance, with .5 usable space-ratio
i'm working with relatively low-budget
current ratio is: 4*7/(5*8) = .7

kirk

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: component growing in raid5
  2008-03-24 15:17   ` Nagy Zoltan
  2008-03-24 15:42     ` Peter Rabbitson
@ 2008-03-25 13:06     ` Peter Grandi
  2008-03-25 13:38       ` Mattias Wadenstein
  1 sibling, 1 reply; 13+ messages in thread
From: Peter Grandi @ 2008-03-25 13:06 UTC (permalink / raw)
  To: Linux RAID

>>> On Mon, 24 Mar 2008 16:17:29 +0100, Nagy Zoltan
>>> <kirk@bteam.hu> said:

> [ ... ] because the system is already up and running, i
> don't wan't to recreate the array [ ... ]

It looks like that you feel lucky :-).

>> As a side note I am also curious why do you go the raid55 path
>> (I am not very impressed however :)

> okay - i've run thru the whole scenario a few times - and always
> come get back to raid55, what would you do in myplace? :)

Well, it all depends what the array is for. However from some of
the previous messages it looks like it is pretty big, probably
with several dozen drives in it across more than a dozen hosts.

But it is not clear what it is being used for, except that IIRC it
is mostly for reading. Things like access patterns (multithreaded
or single client?) file size profile, availability required, etc.,
matter too.

Anyhow as to very broadly applicable if not optimal guidelines, I
would first apply Peter's (me :->) Law of RAID Level Choice:

  * If you don't know what you are doing, use RAID10.
  * If you know what you are doing, you are already using RAID10
    (except in a very few special cases).

To this I would add some general principles based on calls of
judgement on my part, but that several people seem to judge
differently (good luck to them!):

  * Single volume filesystems larger than 1-2TB require something
    like JFS or XFS (or Reiser4 or 'ext4' for the brave). Larger
    than 5-10TB is not entirely feasible with any filesystem
    currently known (just think 'fsck' times) even if the ZFS
    people glibly say otherwise (no 'fsck' ever!).

  * Single RAID volumes up to say 10-20TB are currently feasible,
    say as 24x(1+1)x1TB (for example with Thumpers). Beyond that
    I would not even try, and even that is a bit crazy. I don't
    think that one should put more than 10-15 drives at most in
    a single RAID volume, even a RAID10 ones.

  * Large storage pools can only be reasonably built by using
    multiple volumes across networks and on top of those some
    network/cluster file system, and it matters a bit whether
    single filesystem image is essential or not.

So my suggestions are:

  * For larger filesystems I would use multiple Thumpers (or
    equivalent) divided in multiple 2TB volumes with a network
    filesystem like OpenAFS for home directories or a parallel
    network filesystem like Lustre for data directories.

  * Multiple 2-4TB RAID10 volumes each with a JFS/XFS filesystem
    exported via NFSv4 might be acceptable if single filesystem
    image semantics are not required.

  * Consider the case for doing RAID10 over the network, by having
    for example two Thumpers with 48 drives and creating 48 RAID1
    pairs across the network using DBRD, and then creating 2-4TB
    RAID0 volumes with a half a dozen of those pairs each.

  * RAID5 (but not RAID6 or other mad arrangements) may be used if
    almost all accesses are reads, the data carries end-to-end
    checksums, and there are backups-of-record for restoring the
    data quickly, and then each array is not larger than say 4+1.
    In other words if RAID5 is used as a mostly RO frontend, for
    example to a large slow tape archive (thanks to R. Petkus for
    persuading me that there is this exception).

A couple of relevant papers for inspiration on best practices by
those that have to deal with this stuff:

  https://indico.desy.de/contributionDisplay.py?contribId=26&sessionId=40&confId=257
  http://indico.fnal.gov/contributionDisplay.py?contribId=43&amp;sessionId=30&amp;confId=805

> i choosed this way because:

>     * hardware raid controllers are expensive - because of
>       this i prefer rather having a cluster of machines
>       (average cost per MB shows that this is the 'cheapest'
>       solution) this solution's impact on avg cost is about
>       20-25% compared to a single stand-alone disk - 40-50% if
>       i count only usable storage

That's strange. Especially as iSCSI host adapters are not exactly
cheaper than SAS/SATA ones.

>     * as far as i know other raid configurations take a bigger
>       piece  from the cake
>         - raid10, raid01 both halves the usable space, simply
>           creating a raid0 array at the top level could suffer
>           complete destruction if a node fails (in some rare
>           cases the power-supply can take everything along with
>           it)

Check out http://WWW.BAARF.com/ for the "but RAID10 is not cost
effective" argument :-).

>         - raid05 could be reasonable choice providing n*(m-1)
>           space: but in case of failure a single disk would
>           trigger a full scale rebuild

Try to imagine what happens when 2 disks fail, either in two
different leaves or in the same leaf. Oops.

>     * raid55 - considering an array of n*m disks, gives
>       (n-1)*(m-1) usable space with the ability to detect
>       failing disks and repair them, while the cluster is still
>       online - i can even grow it without taking it offline! ;)

Assuming that there are no downsides :-) this makes perfect sense.

>       and at the leafs the processing power required for the
>       raid is already there...  why not use it? ;)

Which processing power required? RAID on current CPUs is almost
trivial, and with multilane PCIe and multibank fast DDR2 even
bandwidth is not a big deal.

>     * this is because with iscsi i can detach the node, and when
>       i reattach the node it's size is redetected

Sure, and much good it does you to have nodes of different sizes
in a RAID5 :-). Anyhow SAS/SATA is usually plug and play too.

[ ... ]

>     * an alternate solution could be: drop the top level raid5
>       away, and replace it with unionfs - by creating individual
>       filesystems, there is an intresting thing about raiding
>       filesystems(raif)

This would be a bit better, see above. Though I wonder why one
would need 'unionfs', as one could mount all lower level volumes'
filesystems into subdirectories.  That may not be acceptable, but
'unionfs' would not allow things like cross-filesystem hard links
for example, so not a big deal.

[ ... ]

>     * this cluster could scale up at any time by assimilating
>       new nodes ;)

Assuming that there are now downsides to that, fine ;-).

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: component growing in raid5
  2008-03-25 13:06     ` Peter Grandi
@ 2008-03-25 13:38       ` Mattias Wadenstein
  2008-03-25 20:02         ` Peter Grandi
  0 siblings, 1 reply; 13+ messages in thread
From: Mattias Wadenstein @ 2008-03-25 13:38 UTC (permalink / raw)
  To: Peter Grandi; +Cc: Linux RAID

On Tue, 25 Mar 2008, Peter Grandi wrote:

>  * Single volume filesystems larger than 1-2TB require something
>    like JFS or XFS (or Reiser4 or 'ext4' for the brave). Larger
>    than 5-10TB is not entirely feasible with any filesystem
>    currently known (just think 'fsck' times) even if the ZFS
>    people glibly say otherwise (no 'fsck' ever!).

The ZFS people provide an fsck, it's called "resilver", which checks 
parity and checksums and update accordingly.

>  * Single RAID volumes up to say 10-20TB are currently feasible,
>    say as 24x(1+1)x1TB (for example with Thumpers). Beyond that
>    I would not even try, and even that is a bit crazy. I don't
>    think that one should put more than 10-15 drives at most in
>    a single RAID volume, even a RAID10 ones.

I'd agree, a 12-14 disk raid6 is as high that I'd like to go. This is 
mostly limited by rebuild-times though, you'd preferably stay within a day 
or two of single-parity "risk".

>  * Large storage pools can only be reasonably built by using
>    multiple volumes across networks and on top of those some
>    network/cluster file system, and it matters a bit whether
>    single filesystem image is essential or not.

Or for that matter an application that can handle multiple storage pools, 
many of the software that needs really large-scale storage can itself 
split data store between multiple locations. That way you can have 
resonably small filesystems and stay sane.

>  * RAID5 (but not RAID6 or other mad arrangements) may be used if
>    almost all accesses are reads, the data carries end-to-end
>    checksums, and there are backups-of-record for restoring the
>    data quickly, and then each array is not larger than say 4+1.
>    In other words if RAID5 is used as a mostly RO frontend, for
>    example to a large slow tape archive (thanks to R. Petkus for
>    persuading me that there is this exception).

Funny, my suggestion would definately be raid6 for anything except 
database(-like) load, that is anything that doesn't ends up as lots of 
small updates. My normal usecase is to store large files and having 60% 
more disks really costs alot in both purchase and power for the same 
usable space.

Of course, we'd be more likely to go for a good hardware raid6 controller 
that utilises the extra parity to make a good guess on what data is wrong 
in the case of silent data corruption on a single disk (unlike Linux 
software raid). Unless, of course, you can run ZFS which has proper 
checksumming so you can know which (if any) data is still good.

> A couple of relevant papers for inspiration on best practices by
> those that have to deal with this stuff:
>
>  https://indico.desy.de/contributionDisplay.py?contribId=26&sessionId=40&confId=257
>  http://indico.fnal.gov/contributionDisplay.py?contribId=43&amp;sessionId=30&amp;confId=805

And this is my usecase. It might be quite different from, say, database 
storage or home directories.

/Mattias Wadenstein

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: component growing in raid5
  2008-03-25 13:38       ` Mattias Wadenstein
@ 2008-03-25 20:02         ` Peter Grandi
  2008-03-27 20:44           ` Mattias Wadenstein
  0 siblings, 1 reply; 13+ messages in thread
From: Peter Grandi @ 2008-03-25 20:02 UTC (permalink / raw)
  To: Linux RAID

[ ... ]

>> even if the ZFS people glibly say otherwise (no 'fsck'
>> ever!).

> The ZFS people provide an fsck, it's called "resilver", which
> checks parity and checksums and update accordingly.

That's why I said "glibly": they have been clever enough to call
it with a different name :-).

[ ... ]

> I'd agree, a 12-14 disk raid6 is as high that I'd like to
> go. This is mostly limited by rebuild-times though, you'd
> preferably stay within a day or two of single-parity "risk".

A day or two? That's quite risky. Never mind that you get awful
performance for that day or two and/or a risk of data corruption.
Neil Brown some weeks on this mailing list expressed a very
cautionary thought:

 «It is really best to avoid degraded raid4/5/6 arrays when at all
  possible. NeilBrown»

>> * Large storage pools can only be reasonably built by using
>> multiple volumes across networks and on top of those some
>> network/cluster file system, [ ... ]

> Or for that matter an application that can handle multiple
> storage pools, many of the software that needs really large-scale
> storage can itself split data store between multiple locations. [
> ... ]

I imagine that you are thinking here also of more "systematic"
(library-based) ways of doing it, like SRM/SRB and CASTOR, dCache,
XrootD, and other grid style stuff.

[ ... ]

> Funny, my suggestion would definately be raid6 for anything
> except database(-like) load, that is anything that doesn't ends
> up as lots of small updates.

But RAID6 is "teh Evil"! Consider the arguments in the usual
http://WWW.BAARF.com/ or just what happens when you update one
block in a RAID6 stripe and the RMW and parity recalculation
required.

> My normal usecase is to store large files

If the requirement is to store them as in a mostly-read-only cache,
RAID5 is perfectaly adequate; if it is to store them as in writing
them out, parity RAID is not going to be good unless they are
written as a whole (full stripe writes) or write rates don't matter.

> and having 60% more disks really costs alot in both purchase and
> power for the same usable space.

Well, that's the usual argument... A friend of mine who used to be
a RAID sw developer for a medium vendor calls RAID5 "salesperson's
RAID" because of this argument.

But look at the alternative, say for a 12-14 disk storage array
of say 750GB disks, which are currently best price/capacity (not
necessarily best price/power though), to result in this comparison:

One RAID10 7x(1+1): 5.25TB usable.

  Well, it has 40% less capacity than the others, but one gets
  awesome resilience (especially if one has two arrays of 7 drives
  and the mirror pairs are built across the two) including
  surviving almost all 2-drive losses and most 3-drive losses, very
  good read performance (up to 10-14 drives in parallel with '-p
  f2') and very good write performance (7 drives with '-p n2'), all
  exploitable in parallel, and very fast rebuild times impacting
  only one of the drives, so the others have less chance of failing
  during the rebuild.  Also, there is no real requirement for the
  file system code to carefully split IO into aligned stripes.

One RAID6 12+2: 9.00TB usable.

  Any 3-drive loss is catastrophic, a 2 or 1 drive loss causes a
  massive rebuild involving the whole array with the potential not
  just for terrible performance but extra stress on the other
  drives, and further drive loss. Write performance is going to be
  terrible as for every N-blocks written we have to read N+2 blocks
  and write N+2 blocks, especially bad news if N is small, and we
  can avoid reading only if N happens to be 12 and aligned, but
  read performance (if we don't check parity) is going to be pretty
  good. Rebuilding after a loss is not not just goint to be slow
  and risk further losses, but also carries the risk of corruption.

To me these summaries mean that for rather less than double the
cost in raw storage (storage is cheap, admittedly cooling/power is
less cheap) one gets a much better general purpose storage pool
with RAID10, and one that is almost only suited to large file
read-only caching in the other case.

But wait, one could claim that 12+2 is an unfairly wide RAID6, and
it is a straw man. Then consider these two where the array is split
into multiple RAID volumes, each containing an independent
filesystem:

Two RAID6 4+2: 6TB usable.

  This is a less crazily risky setup than 12+2, but we have lost
  the single volume property, and we have lost quite a bit of
  space, and peak read performance is not going to be awesome
  (4-drive wide per filesystem), but at least there is a better
  chance of putting together an aligned write of just 4 chunks.
  However if the aligned write cannot be materialized, every
  concurrent write to both filesystems will involve reading and
  then writing 2xN+4 blocks. If 3 disks fail, most of the time
  there is no global loss, unless all 3 are in the same half, and
  then only half of the files are lost, and no 2 disk failure
  causes data loss, only large performance loss. We save 2 drives
  too.

Three RAID5 3+1: 6.75TB usable

  This brings down the RAID5 to a more reasonable narrowness (the
  widest I would consider in normal use). Lost single volume
  property, the read speed on each third is not great, but 3 reads
  can process in paralle, 3 disk loss brings at most a third of the
  store, 2 disk loss is only fatal if it happens in the same
  third. Any 1 drive loss causes a performance drop only in one
  filesystem, unaligned writes involve only one parity block, the
  narrow width of the stripe means RMW cycles are going to be less
  frequent. We save 2 drives too.

Now if one wants what parity RAID is good for (mostly read-only
caching of already backed up data), and one has these 3 choices:

  One RAID6 12+2:  14 drives, 9.00TB usable.
  Two RAID6 4+2:   12 drives, 6.00TB usable.
  Three RAID5 3+1: 12 drives, 6.75TB usable.

The three RAID5 one wins for me in terms of simplicity and speed,
unless the single volume is a requirement or there are really very
very few writes other than bulk reloading from backup and the
filesystem is very careful with write alignment etc.

The single volume property only matters if one wants to build a
really large (above 5TB) single physical filesystem, and that's not
something very recommendable, so it does not matter a lot for me.

Overall I still in most cases would prefer 14 drives in one (or
more) RAID10 volumes to 12 drives as 3x(3+1), but the latter
admittedly for mostly read only etc. may well make sense.

> Of course, we'd be more likely to go for a good hardware raid6
> controller that utilises the extra parity to make a good guess on
> what data is wrong in the case of silent data corruption on a
> single disk (unlike Linux software raid).

That's crazy talk. As already argued in another thread, RAID as
normally understood relies totally on errors being notified.

Otherwise one needs to use proper ECC codes, on reading too (and
background scrubbing is not good or cheap enough) and that's a
different type of design.

> Unless, of course, you can run ZFS which has proper checksumming
> so you can know which (if any) data is still good.

ZFS indeed only does checksumming, not full ECC with rebuild.

There are filesystem design that use things like Reed-Solomon codes
for that. IIRC even Microsoft Research has done one for
distributing data across a swarm of desktops.

There is no substitute for end-to-end checksumming (and ECC if
necessary) though... Most people reading list will have
encountered this:

  http://en.Wikipedia.org/wiki/Parchive

and perhaps some will have done a web search bringing up papers
like these for the file system case:

  http://WWW.CS.UTK.edu/~plank/plank/papers/CS-96-332.pdf
  http://WWW.Inf.U-Szeged.HU/~bilickiv/research/LanStore/uos-final.pdf

[ ... ]
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: component growing in raid5
  2008-03-25 20:02         ` Peter Grandi
@ 2008-03-27 20:44           ` Mattias Wadenstein
  2008-03-27 22:09             ` Richard Scobie
  0 siblings, 1 reply; 13+ messages in thread
From: Mattias Wadenstein @ 2008-03-27 20:44 UTC (permalink / raw)
  To: Peter Grandi; +Cc: Linux RAID

On Tue, 25 Mar 2008, Peter Grandi wrote:

> [ ... ]
>
>> I'd agree, a 12-14 disk raid6 is as high that I'd like to
>> go. This is mostly limited by rebuild-times though, you'd
>> preferably stay within a day or two of single-parity "risk".
>
> A day or two? That's quite risky. Never mind that you get awful
> performance for that day or two and/or a risk of data corruption.
> Neil Brown some weeks on this mailing list expressed a very
> cautionary thought:
>
> «It is really best to avoid degraded raid4/5/6 arrays when at all
>  possible. NeilBrown»

Yes, I read that mail. I've been meaning to do some real-world testing of 
restarting degraded/rebuilding raid6es from various vendors, including MD, 
but haven't gotten around to it.

Luckily computer crashes are another order of magnitude rarer than disk 
crashes on storage servers in our experience, so raid6 isn't a net loss 
even assuming worst case if you have checksummed data.

Also, this might be the reason that for instance HP's raid cards require 
battery-backed cache for raid6, so that there won't be partially updated 
stripes without a "better version" in the cache.

>>> * Large storage pools can only be reasonably built by using
>>> multiple volumes across networks and on top of those some
>>> network/cluster file system, [ ... ]
>
>> Or for that matter an application that can handle multiple
>> storage pools, many of the software that needs really large-scale
>> storage can itself split data store between multiple locations. [
>> ... ]
>
> I imagine that you are thinking here also of more "systematic"
> (library-based) ways of doing it, like SRM/SRB and CASTOR, dCache,
> XrootD, and other grid style stuff.

Yes that's what I work with, and similar solutions in other industries (I 
know the TV folks have systems where the storage is just spread out over a 
bunch of servers and filesystems instead of trying to do a cluster 
filesystem, using a database of some kind to keep track of locations).

>> Funny, my suggestion would definately be raid6 for anything
>> except database(-like) load, that is anything that doesn't ends
>> up as lots of small updates.
>
> But RAID6 is "teh Evil"! Consider the arguments in the usual
> http://WWW.BAARF.com/ or just what happens when you update one
> block in a RAID6 stripe and the RMW and parity recalculation
> required.

I've read through most of baarf.com and I have a hard time seeing it 
applying. An update to a block is a rare special case and can very well 
take some time.

>> My normal usecase is to store large files
>
> If the requirement is to store them as in a mostly-read-only cache,
> RAID5 is perfectaly adequate; if it is to store them as in writing
> them out, parity RAID is not going to be good unless they are
> written as a whole (full stripe writes) or write rates don't matter.

Well, yes, full stripe writes is the normal case. Either a program is 
writing a file out or you get a file from the network. It is written to 
disk, and there is almost always sufficient write-back done to make it (at 
least) full stripes (unless you have insanely large stripe size so that a 
few hundred megs won't be enough for a few files).

You'll get a partial update at the file start, unless the filesystem is 
stripe aligned, and one partial update at the end. The vast majority of 
stripes in between (remember, I'm talking about large files, at least a 
couple of hundred megs) will be full stripe writes.

>> and having 60% more disks really costs alot in both purchase and
>> power for the same usable space.
>
> Well, that's the usual argument... A friend of mine who used to be
> a RAID sw developer for a medium vendor calls RAID5 "salesperson's
> RAID" because of this argument.
>
> But look at the alternative, say for a 12-14 disk storage array
> of say 750GB disks, which are currently best price/capacity (not
> necessarily best price/power though), to result in this comparison:

Sounds like a resonable hardware to look at for a building block.

> One RAID10 7x(1+1): 5.25TB usable.
>
>  Well, it has 40% less capacity than the others, but one gets
>  awesome resilience (especially if one has two arrays of 7 drives
>  and the mirror pairs are built across the two) including
>  surviving almost all 2-drive losses and most 3-drive losses, very
>  good read performance (up to 10-14 drives in parallel with '-p
>  f2') and very good write performance (7 drives with '-p n2'), all
>  exploitable in parallel, and very fast rebuild times impacting
>  only one of the drives, so the others have less chance of failing
>  during the rebuild.  Also, there is no real requirement for the
>  file system code to carefully split IO into aligned stripes.
>
> One RAID6 12+2: 9.00TB usable.
>
>  Any 3-drive loss is catastrophic, a 2 or 1 drive loss causes a
>  massive rebuild involving the whole array with the potential not
>  just for terrible performance but extra stress on the other
>  drives, and further drive loss. Write performance is going to be
>  terrible as for every N-blocks written we have to read N+2 blocks
>  and write N+2 blocks, especially bad news if N is small, and we
>  can avoid reading only if N happens to be 12 and aligned, but
>  read performance (if we don't check parity) is going to be pretty
>  good.

Funny, when I do this, the write performance is typically 20-40% slower 
than the 14-disk RAID0 on the same disks. Not quite as terrible as you 
make it out to be.

The performance during rebuilds usually depends on the priority given to 
the rebuild, some do it slow enough that it isn't really affected, but 
then it usually takes much longer.

>  Rebuilding after a loss is not not just goint to be slow
>  and risk further losses, but also carries the risk of corruption.

And that's the really big issue to me.

> To me these summaries mean that for rather less than double the
> cost in raw storage (storage is cheap, admittedly cooling/power is
> less cheap) one gets a much better general purpose storage pool
> with RAID10, and one that is almost only suited to large file
> read-only caching in the other case.

So for a bit less than double the cost, I can get the same capability. 
Perhaps with a little bit of extra performance that I can't make much use 
of, because I only have 1-2Gbit/s network connection to the host anyway.

> The single volume property only matters if one wants to build a
> really large (above 5TB) single physical filesystem, and that's not
> something very recommendable, so it does not matter a lot for me.

I wouldn't count "above 5TB" as "really large", if you are aiming for a 
few PBs of aggregated storage the management overhead of sub-5TB storage 
pools is rather significant. If you substitute that for "over 15TB" I'd 
agree (today, probably not next year :) ).

>> Of course, we'd be more likely to go for a good hardware raid6
>> controller that utilises the extra parity to make a good guess on
>> what data is wrong in the case of silent data corruption on a
>> single disk (unlike Linux software raid).
>
> That's crazy talk. As already argued in another thread, RAID as
> normally understood relies totally on errors being notified.
>
> Otherwise one needs to use proper ECC codes, on reading too (and
> background scrubbing is not good or cheap enough) and that's a
> different type of design.

Oh, but it does work on a background check, on some raid controllers. And 
it does identify a parity mismatch, conclude that one of the disks is 
[likely] wrong and then update data appropriately.

Of course, being storage hardware, you'd be lucky to see any mention of 
this in any logs, when it should be loudly shouted that there were 
mismatches.

But this works, in practice, today.

>> Unless, of course, you can run ZFS which has proper checksumming
>> so you can know which (if any) data is still good.
>
> ZFS indeed only does checksumming, not full ECC with rebuild.

Oh, but a dual-parity raidz2 (~raid6) you have sufficient parity to 
rebuild the correct data assuming not more than 2 disks have failed (as in 
silently returning bad data). You just try to stick the data together in 
all the different ways until you get one with matching checksum. Then you 
can update parity/data as appropriate.

Same goes with n-disk mirrors, you just check until you find at least one 
copy with a matching checksum, then update the rest to this data.

/Mattias Wadenstein
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: component growing in raid5
  2008-03-27 20:44           ` Mattias Wadenstein
@ 2008-03-27 22:09             ` Richard Scobie
  2008-03-28  8:07               ` Mattias Wadenstein
  0 siblings, 1 reply; 13+ messages in thread
From: Richard Scobie @ 2008-03-27 22:09 UTC (permalink / raw)
  To: Linux RAID Mailing List

Mattias Wadenstein wrote:

>> A day or two? That's quite risky. Never mind that you get awful
>> performance for that day or two and/or a risk of data corruption.
>> Neil Brown some weeks on this mailing list expressed a very
>> cautionary thought:
>>
>> «It is really best to avoid degraded raid4/5/6 arrays when at all
>>  possible. NeilBrown»
> 
> 
> Yes, I read that mail. I've been meaning to do some real-world testing 
> of restarting degraded/rebuilding raid6es from various vendors, 
> including MD, but haven't gotten around to it.

You may be interested in these results - throughput results on an 8 SATA 
drive RAID6 showed average write speed went from 348MB/s to 354MB/s and 
read speed 349MB/s to 196MB/s, while rebuilding with 2 failed drives. 
This was with an Areca 1680x RAID controller.

http://www.amug.org/amug-web/html/amug/reviews/articles/areca/1680x/

Regards,

Richard
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: component growing in raid5
  2008-03-27 22:09             ` Richard Scobie
@ 2008-03-28  8:07               ` Mattias Wadenstein
  0 siblings, 0 replies; 13+ messages in thread
From: Mattias Wadenstein @ 2008-03-28  8:07 UTC (permalink / raw)
  To: Richard Scobie; +Cc: Linux RAID Mailing List

On Fri, 28 Mar 2008, Richard Scobie wrote:

> Mattias Wadenstein wrote:
>
>>> A day or two? That's quite risky. Never mind that you get awful
>>> performance for that day or two and/or a risk of data corruption.
>>> Neil Brown some weeks on this mailing list expressed a very
>>> cautionary thought:
>>> 
>>> «It is really best to avoid degraded raid4/5/6 arrays when at all
>>>  possible. NeilBrown»
>> 
>> 
>> Yes, I read that mail. I've been meaning to do some real-world testing of 
>> restarting degraded/rebuilding raid6es from various vendors, including MD, 
>> but haven't gotten around to it.
>
> You may be interested in these results - throughput results on an 8 SATA 
> drive RAID6 showed average write speed went from 348MB/s to 354MB/s and read 
> speed 349MB/s to 196MB/s, while rebuilding with 2 failed drives. This was 
> with an Areca 1680x RAID controller.
>
> http://www.amug.org/amug-web/html/amug/reviews/articles/areca/1680x/

That's only performance though. I was interested in seeing if you could 
provoke actual data corruption by doing unkind resets while doing various 
things like writing, flipping bits/bytes inside files, or some other 
access/update pattern.

That performance is good enough in the "easy" cases I'm well aware of, 
I've done some poking at earlier areca controllers. They had really weird 
performance issues in (for me) way too common corner cases though. I just 
got a couple of newer ones delivered though, so I'll start pushing those 
soon. I'm not sure I'll get the time for trying to provoke data corruption 
in degraded raidsets though.

/Mattias Wadenstein
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 13+ messages in thread

end of thread, other threads:[~2008-03-28  8:07 UTC | newest]

Thread overview: 13+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2008-03-23  6:59 component growing in raid5 Nagy Zoltan
2008-03-23 11:24 ` Peter Grandi
2008-03-24  7:09 ` Peter Rabbitson
2008-03-24  7:09 ` Peter Rabbitson
2008-03-24 15:17   ` Nagy Zoltan
2008-03-24 15:42     ` Peter Rabbitson
2008-03-24 16:52       ` Nagy Zoltan
2008-03-25 13:06     ` Peter Grandi
2008-03-25 13:38       ` Mattias Wadenstein
2008-03-25 20:02         ` Peter Grandi
2008-03-27 20:44           ` Mattias Wadenstein
2008-03-27 22:09             ` Richard Scobie
2008-03-28  8:07               ` Mattias Wadenstein

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).