* component growing in raid5
@ 2008-03-23 6:59 Nagy Zoltan
2008-03-23 11:24 ` Peter Grandi
` (2 more replies)
0 siblings, 3 replies; 13+ messages in thread
From: Nagy Zoltan @ 2008-03-23 6:59 UTC (permalink / raw)
To: linux-raid
[-- Attachment #1: Type: text/plain, Size: 1236 bytes --]
hi all,
i've set up a two dimensional array:
* leaf nodes composes raid5 arrays from their disks, and export it
as a iSCSI target
* the root node creates a raid5 on top of the exported targets
in this setup i will have to face that an array component can(and would)
grow, so i
created a test case for this to see what comes out ;)
* after growing the components mdadm won't recognized them anymore
as an array member
(because there are no superblock at the end of the device - last
64k?)
i've tried to inform mdadm about the size of the components, but
it sad no ;)
* i've added an arbitary superblock copy operation after the
expansion, to make possible for
mdadm to recognize and assemble the array - it's working, and passes
my test.
is there a less 'funky' solution for this ;)
can i run into any trouble when doing this on the real system?
one more thing: when i first assembled the array with 4096KB chunks,
i've run into the
'2.6.24-rc6 reproducible raid5 hang' bug, but it won't resume after
changing
'stripe_cache_size' even after i applied the patch manually to (2.6.24.2)
i've upgraded to 2.6.25-rc6 since then it runs smootly.
- thank you all for hunting that bug down.
kirk
[-- Attachment #2: raid-test.bash --]
[-- Type: text/plain, Size: 2334 bytes --]
#!/bin/bash
t=32768 # size in blocks
xall()
{
for i in `seq 0 7`;do
$*
done
}
mkdir -p part data orig
echo "*** $1"
case "$1" in
test)
export FS_DEV=/dev/md0
$0 create &&
$0 createfs &&
$0 stop &&
$0 expand &&
$0 copy-sb &&
$0 lo_up &&
$0 assemble &&
$0 raid-expand &&
$0 fs-expand &&
$0 fs-check
echo "---closing---"
$0 stop
;;
test2)
export FS_DEV=/dev/mapper/rt
$0 create &&
$0 crypt /dev/md0 &&
$0 crypt-open &&
$0 createfs &&
$0 crypt-close &&
$0 stop &&
$0 expand &&
$0 copy-sb &&
$0 lo_up &&
$0 assemble &&
$0 raid-expand &&
$0 crypt-open &&
$0 fs-expand &&
$0 fs-check
echo "---closing---"
$0 stop
;;
zero)
q(){
rm -f part/$i; touch part/$i; }
xall q
;;
expand)
cp part/{0..7} orig/
q(){
dd if=/dev/zero count=$t >> part/$i; }
xall q
;;
raid-expand)
i=0
s=`stat --printf '%s' part/$i`;
x=$[ ($s / 1024)-64 ]
mdadm --grow -z $x /dev/md0
;;
create)
$0 zero
$0 expand
$0 lo_up
mdadm -Cv -n8 -l5 /dev/md0 /dev/loop{0..7}
;;
crypt)
dd if=/dev/urandom count=1 of=key
cryptsetup luksFormat /dev/md0 key || exit 1
;;
crypt-open)
cryptsetup luksOpen /dev/md0 rt --key-file key || exit1
;;
crypt-close)
cryptsetup luksClose rt || exit 1
;;
fs-expand)
[ "$FS_DEV" == "" ] && echo "!!! FS_DEV not set" && exit 1
mount $FS_DEV data
xfs_growfs data
umount data
;;
fs-check)
[ "$FS_DEV" == "" ] && echo "!!! FS_DEV not set" && exit 1
mount $FS_DEV data
md5sum -c hash && echo ok || echo error
umount data
;;
createfs)
[ "$FS_DEV" == "" ] && echo "!!! FS_DEV not set" && exit 1
mkfs.xfs $FS_DEV
mount $FS_DEV data
dd if=/dev/urandom of=data/junk
md5sum data/junk | tee hash
umount data
;;
lo_up)
q(){
losetup /dev/loop$i part/$i; }
xall q
;;
lo_down)
q(){
losetup -d /dev/loop$i; }
xall q
;;
assemble)
mdadm -Av /dev/md0 /dev/loop{0..7}
;;
stop)
umount data
$0 crypt-close
mdadm -S /dev/md0
$0 lo_down
;;
copy-sb)
q(){
s=`stat --printf '%s' part/$i`;
tail -c 65536 orig/$i | dd bs=1 seek=$[ $s - 65536 ] of=part/$i; }
xall q
;;
*) echo "asd"
;;
esac
^ permalink raw reply [flat|nested] 13+ messages in thread* Re: component growing in raid5 2008-03-23 6:59 component growing in raid5 Nagy Zoltan @ 2008-03-23 11:24 ` Peter Grandi 2008-03-24 7:09 ` Peter Rabbitson 2008-03-24 7:09 ` Peter Rabbitson 2 siblings, 0 replies; 13+ messages in thread From: Peter Grandi @ 2008-03-23 11:24 UTC (permalink / raw) To: Linux RAID [ ... ] > * leaf nodes composes raid5 arrays from their disks, and > export it as a iSCSI target > * the root node creates a raid5 on top of the exported > targets That's "amazing" to say the least. A way to prove that syntactically valid setups can and do work. I am particularly "amazed" by the idea of using a whole RAID5 subarray just for parity on the top level RAID5. > in this setup i will have to face that an array component > can(and would) grow, [ ... ] This idea seems to me beyond "amazing", and I think that even "stunning" is an understatement. kirk> one more thing: when i first assembled the array with kirk> 4096KB chunks, Indeed, 4MiB chunk sizes are syntactically possible, and it is a rather "exciting" choice, especially for a RAID55. [ ... ] > cryptsetup luksFormat /dev/md0 key || exit 1 Even more "amazing", using 'dm-crypt' over an already "stunning" setup. [ ... ] > mkfs.xfs $FS_DEV Entirely consistently, the syntactically valid setup above is used for a single (presumably very large) filesystem. Very "courageous". Surely there must be extraordinarily good reasons for building a RAID55 with 4MiB chunk (with "amazing" performance for writes, and "stunning" resilience) and with the expectation that one will extend it by growing each subarray, triggering a full two-level reshape every time, and putting 'dm-crypt' and a single filesystem on top of it all. It would be interesting to learn those reasons for people like me whose imagination is limited by stolid pragmatics. ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: component growing in raid5 2008-03-23 6:59 component growing in raid5 Nagy Zoltan 2008-03-23 11:24 ` Peter Grandi @ 2008-03-24 7:09 ` Peter Rabbitson 2008-03-24 7:09 ` Peter Rabbitson 2 siblings, 0 replies; 13+ messages in thread From: Peter Rabbitson @ 2008-03-24 7:09 UTC (permalink / raw) To: Nagy Zoltan; +Cc: linux-raid Nagy Zoltan wrote: > hi all, > > i've set up a two dimensional array: > * leaf nodes composes raid5 arrays from their disks, and export it as > a iSCSI target > * the root node creates a raid5 on top of the exported targets > > in this setup i will have to face that an array component can(and would) > grow, so i > created a test case for this to see what comes out ;) > * after growing the components mdadm won't recognized them anymore as > an array member > (because there are no superblock at the end of the device - last > 64k?) > i've tried to inform mdadm about the size of the components, but > it sad no ;) > * i've added an arbitary superblock copy operation after the > expansion, to make possible for > mdadm to recognize and assemble the array - it's working, and passes > my test. > > is there a less 'funky' solution for this ;) > can i run into any trouble when doing this on the real system? > I would simply use a v1.1 superblock which will be situated at the start of the array. Then you will face another problem - once you grow a leaf device, mdadm will not see the new size as it will find the superblock at sect 0 and will be done there. You will need to issue mdadm -A ... --update devicesize. The rest of the operations are identical. As a side note I am also curious why do you go the raid55 path (I am not very impressed however :) Peter ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: component growing in raid5 2008-03-23 6:59 component growing in raid5 Nagy Zoltan 2008-03-23 11:24 ` Peter Grandi 2008-03-24 7:09 ` Peter Rabbitson @ 2008-03-24 7:09 ` Peter Rabbitson 2008-03-24 15:17 ` Nagy Zoltan 2 siblings, 1 reply; 13+ messages in thread From: Peter Rabbitson @ 2008-03-24 7:09 UTC (permalink / raw) To: Nagy Zoltan; +Cc: linux-raid Nagy Zoltan wrote: > hi all, > > i've set up a two dimensional array: > * leaf nodes composes raid5 arrays from their disks, and export it as > a iSCSI target > * the root node creates a raid5 on top of the exported targets > > in this setup i will have to face that an array component can(and would) > grow, so i > created a test case for this to see what comes out ;) > * after growing the components mdadm won't recognized them anymore as > an array member > (because there are no superblock at the end of the device - last > 64k?) > i've tried to inform mdadm about the size of the components, but > it sad no ;) > * i've added an arbitary superblock copy operation after the > expansion, to make possible for > mdadm to recognize and assemble the array - it's working, and passes > my test. > > is there a less 'funky' solution for this ;) > can i run into any trouble when doing this on the real system? > I would simply use a v1.1 superblock which will be situated at the start of the array. Then you will face another problem - once you grow a leaf device, mdadm will not see the new size as it will find the superblock at sect 0 and will be done there. You will need to issue mdadm -A ... --update devicesize. The rest of the operations are identical. As a side note I am also curious why do you go the raid55 path (I am not very impressed however :) Peter ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: component growing in raid5 2008-03-24 7:09 ` Peter Rabbitson @ 2008-03-24 15:17 ` Nagy Zoltan 2008-03-24 15:42 ` Peter Rabbitson 2008-03-25 13:06 ` Peter Grandi 0 siblings, 2 replies; 13+ messages in thread From: Nagy Zoltan @ 2008-03-24 15:17 UTC (permalink / raw) To: Peter Rabbitson, linux-raid hi > I would simply use a v1.1 superblock which will be situated at the > start of > the array. Then you will face another problem - once you grow a leaf > device, > mdadm will not see the new size as it will find the superblock at sect > 0 and > will be done there. You will need to issue mdadm -A ... --update > devicesize. > The rest of the operations are identical. i feeled that there is another solution that i missed - thank you, next time i will do it this way -- because the system is already up and running, i don't wan't to recreate the array (about the chunksize: i've got back to 64Kb chunks because of that bug - i was happy to see it running ;) > > As a side note I am also curious why do you go the raid55 path (I am > not very > impressed however :) okay - i've run thru the whole scenario a few times - and always come get back to raid55, what would you do in myplace? :) i choosed this way because: * hardware raid controllers are expensive - because of this i prefer rather having a cluster of machines (average cost per MB shows that this is the 'cheapest' solution) this solution's impact on avg cost is about 20-25% compared to a single stand-alone disk - 40-50% if i count only usable storage * as far as i know other raid configurations take a bigger piece from the cake - raid10, raid01 both halves the usable space, simply creating a - raid0 array at the top level could suffer complete destruction if a node fails (in some rare cases the power-supply can take everything along with it) - raid05 could be reasonable choice providing n*(m-1) space: but in case of failure a single disk would trigger a full scale rebuild * raid55 - considering an array of n*m disks, gives (n-1)*(m-1) usable space with the ability to detect failing disks and repair them, while the cluster is still online - i can even grow it without taking it offline! ;) and at the leafs the processing power required for the raid is already there... why not use it? ;) * this is because with iscsi i can detach the node, and when i reattach the node it's size is redetected * after replacing a leaf's failing drive, the node itself could rebuild it's local array, and prevent the triggering of a whole system-scale rebuild * an alternate solution could be: drop the top level raid5 away, and replace it with unionfs - by creating individual filesystems, there is an intresting thing about raiding filesystems(raif) * the leaf nodes are running with network boot, exporting their local array run thru a dm_crypt on iscsi - this is something i would do differently next time.. i don't know how much parralelism dm_crypt could achive, but doing it on a per device basis - would provide 'enough' parralelism for the kernel to better utilize processing power * the root's role is to manage the filesystem, monitor the leafs - and provide network boot for them * effectively the root node is nothing more than a HBA ;) * the construction of the system is not complete - i'm waiting for some gbit interfaces, after they arrive the root will have 4Gbit link to the leafs, and by customizing the routing table a bit, it will see only a portion of the leaf thru each of them - i can possibly trunk the interfaces, but i think it's not neccessary * this cluster could scale up at any time by assimilating new nodes ;) kirk ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: component growing in raid5 2008-03-24 15:17 ` Nagy Zoltan @ 2008-03-24 15:42 ` Peter Rabbitson 2008-03-24 16:52 ` Nagy Zoltan 2008-03-25 13:06 ` Peter Grandi 1 sibling, 1 reply; 13+ messages in thread From: Peter Rabbitson @ 2008-03-24 15:42 UTC (permalink / raw) To: Nagy Zoltan; +Cc: linux-raid Nagy Zoltan wrote: > hi > >> I would simply use a v1.1 superblock which will be situated at the >> start of >> the array. Then you will face another problem - once you grow a leaf >> device, >> mdadm will not see the new size as it will find the superblock at sect >> 0 and >> will be done there. You will need to issue mdadm -A ... --update >> devicesize. >> The rest of the operations are identical. > i feeled that there is another solution that i missed - thank you, next > time > i will do it this way -- because the system is already up and running, i > don't wan't > to recreate the array (about the chunksize: i've got back to 64Kb chunks > because > of that bug - i was happy to see it running ;) >> >> As a side note I am also curious why do you go the raid55 path (I am >> not very >> impressed however :) > okay - i've run thru the whole scenario a few times - and always come > get back > to raid55, what would you do in myplace? :) The validity of the snipped arguments depends on how many devices you have at every level: *) how many nodes there are? *) how many disks per node? do all nodes have an equal amount of disks? Without additional info I would say this: The problem with using raid5 on the top node is that you are stressing your network additionally for every r-m-w-cycle. Also rebuild of this array, especially if you add more leaves will be more and more resource intensive. In contrast if the top array is RAID10 with 2 chunk copies, you will sacrifice half the space, however your rebuild will utilize only 2 drives (one reader one writer). HTH Peter ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: component growing in raid5 2008-03-24 15:42 ` Peter Rabbitson @ 2008-03-24 16:52 ` Nagy Zoltan 0 siblings, 0 replies; 13+ messages in thread From: Nagy Zoltan @ 2008-03-24 16:52 UTC (permalink / raw) To: Peter Rabbitson, linux-raid hi > The validity of the snipped arguments depends on how many devices you > have at every level: > > *) how many nodes there are? 8 nodes > *) how many disks per node? do all nodes have an equal amount of disks? currently there are 5 disks at every node, and yes: all nodes have equal amount, but the only thing that matters is that the exported size should be the same > > Without additional info I would say this: The problem with using raid5 > on the top node is that you are stressing your network additionally > for every r-m-w-cycle. Also rebuild of this array, especially if you > add more leaves will be more and more resource intensive. after the new nics installed, i except that the rebuild would take about 8 hours to complete yes, the r-m-w-cycle would be a pain - but i expect to have much more reads than writes this is truly resource intensive - but the bottleneck would be not at the network level, i think it will be at the root node's south-west connection i must note that this is my first raid setup ;) and i've not faced with rmw-cycle problems before, and because the use conditions doesn't imply continous random writes - we will be happy with this, next week we will try it out in target conditions - and if something is not going as expect, i will reconsider applying raid10 to it > In contrast if the top array is RAID10 with 2 chunk copies, you will > sacrifice half the space, however your rebuild will utilize only 2 > drives (one reader one writer). yes, that clearly could reach better performance, with .5 usable space-ratio i'm working with relatively low-budget current ratio is: 4*7/(5*8) = .7 kirk ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: component growing in raid5 2008-03-24 15:17 ` Nagy Zoltan 2008-03-24 15:42 ` Peter Rabbitson @ 2008-03-25 13:06 ` Peter Grandi 2008-03-25 13:38 ` Mattias Wadenstein 1 sibling, 1 reply; 13+ messages in thread From: Peter Grandi @ 2008-03-25 13:06 UTC (permalink / raw) To: Linux RAID >>> On Mon, 24 Mar 2008 16:17:29 +0100, Nagy Zoltan >>> <kirk@bteam.hu> said: > [ ... ] because the system is already up and running, i > don't wan't to recreate the array [ ... ] It looks like that you feel lucky :-). >> As a side note I am also curious why do you go the raid55 path >> (I am not very impressed however :) > okay - i've run thru the whole scenario a few times - and always > come get back to raid55, what would you do in myplace? :) Well, it all depends what the array is for. However from some of the previous messages it looks like it is pretty big, probably with several dozen drives in it across more than a dozen hosts. But it is not clear what it is being used for, except that IIRC it is mostly for reading. Things like access patterns (multithreaded or single client?) file size profile, availability required, etc., matter too. Anyhow as to very broadly applicable if not optimal guidelines, I would first apply Peter's (me :->) Law of RAID Level Choice: * If you don't know what you are doing, use RAID10. * If you know what you are doing, you are already using RAID10 (except in a very few special cases). To this I would add some general principles based on calls of judgement on my part, but that several people seem to judge differently (good luck to them!): * Single volume filesystems larger than 1-2TB require something like JFS or XFS (or Reiser4 or 'ext4' for the brave). Larger than 5-10TB is not entirely feasible with any filesystem currently known (just think 'fsck' times) even if the ZFS people glibly say otherwise (no 'fsck' ever!). * Single RAID volumes up to say 10-20TB are currently feasible, say as 24x(1+1)x1TB (for example with Thumpers). Beyond that I would not even try, and even that is a bit crazy. I don't think that one should put more than 10-15 drives at most in a single RAID volume, even a RAID10 ones. * Large storage pools can only be reasonably built by using multiple volumes across networks and on top of those some network/cluster file system, and it matters a bit whether single filesystem image is essential or not. So my suggestions are: * For larger filesystems I would use multiple Thumpers (or equivalent) divided in multiple 2TB volumes with a network filesystem like OpenAFS for home directories or a parallel network filesystem like Lustre for data directories. * Multiple 2-4TB RAID10 volumes each with a JFS/XFS filesystem exported via NFSv4 might be acceptable if single filesystem image semantics are not required. * Consider the case for doing RAID10 over the network, by having for example two Thumpers with 48 drives and creating 48 RAID1 pairs across the network using DBRD, and then creating 2-4TB RAID0 volumes with a half a dozen of those pairs each. * RAID5 (but not RAID6 or other mad arrangements) may be used if almost all accesses are reads, the data carries end-to-end checksums, and there are backups-of-record for restoring the data quickly, and then each array is not larger than say 4+1. In other words if RAID5 is used as a mostly RO frontend, for example to a large slow tape archive (thanks to R. Petkus for persuading me that there is this exception). A couple of relevant papers for inspiration on best practices by those that have to deal with this stuff: https://indico.desy.de/contributionDisplay.py?contribId=26&sessionId=40&confId=257 http://indico.fnal.gov/contributionDisplay.py?contribId=43&sessionId=30&confId=805 > i choosed this way because: > * hardware raid controllers are expensive - because of > this i prefer rather having a cluster of machines > (average cost per MB shows that this is the 'cheapest' > solution) this solution's impact on avg cost is about > 20-25% compared to a single stand-alone disk - 40-50% if > i count only usable storage That's strange. Especially as iSCSI host adapters are not exactly cheaper than SAS/SATA ones. > * as far as i know other raid configurations take a bigger > piece from the cake > - raid10, raid01 both halves the usable space, simply > creating a raid0 array at the top level could suffer > complete destruction if a node fails (in some rare > cases the power-supply can take everything along with > it) Check out http://WWW.BAARF.com/ for the "but RAID10 is not cost effective" argument :-). > - raid05 could be reasonable choice providing n*(m-1) > space: but in case of failure a single disk would > trigger a full scale rebuild Try to imagine what happens when 2 disks fail, either in two different leaves or in the same leaf. Oops. > * raid55 - considering an array of n*m disks, gives > (n-1)*(m-1) usable space with the ability to detect > failing disks and repair them, while the cluster is still > online - i can even grow it without taking it offline! ;) Assuming that there are no downsides :-) this makes perfect sense. > and at the leafs the processing power required for the > raid is already there... why not use it? ;) Which processing power required? RAID on current CPUs is almost trivial, and with multilane PCIe and multibank fast DDR2 even bandwidth is not a big deal. > * this is because with iscsi i can detach the node, and when > i reattach the node it's size is redetected Sure, and much good it does you to have nodes of different sizes in a RAID5 :-). Anyhow SAS/SATA is usually plug and play too. [ ... ] > * an alternate solution could be: drop the top level raid5 > away, and replace it with unionfs - by creating individual > filesystems, there is an intresting thing about raiding > filesystems(raif) This would be a bit better, see above. Though I wonder why one would need 'unionfs', as one could mount all lower level volumes' filesystems into subdirectories. That may not be acceptable, but 'unionfs' would not allow things like cross-filesystem hard links for example, so not a big deal. [ ... ] > * this cluster could scale up at any time by assimilating > new nodes ;) Assuming that there are now downsides to that, fine ;-). ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: component growing in raid5 2008-03-25 13:06 ` Peter Grandi @ 2008-03-25 13:38 ` Mattias Wadenstein 2008-03-25 20:02 ` Peter Grandi 0 siblings, 1 reply; 13+ messages in thread From: Mattias Wadenstein @ 2008-03-25 13:38 UTC (permalink / raw) To: Peter Grandi; +Cc: Linux RAID On Tue, 25 Mar 2008, Peter Grandi wrote: > * Single volume filesystems larger than 1-2TB require something > like JFS or XFS (or Reiser4 or 'ext4' for the brave). Larger > than 5-10TB is not entirely feasible with any filesystem > currently known (just think 'fsck' times) even if the ZFS > people glibly say otherwise (no 'fsck' ever!). The ZFS people provide an fsck, it's called "resilver", which checks parity and checksums and update accordingly. > * Single RAID volumes up to say 10-20TB are currently feasible, > say as 24x(1+1)x1TB (for example with Thumpers). Beyond that > I would not even try, and even that is a bit crazy. I don't > think that one should put more than 10-15 drives at most in > a single RAID volume, even a RAID10 ones. I'd agree, a 12-14 disk raid6 is as high that I'd like to go. This is mostly limited by rebuild-times though, you'd preferably stay within a day or two of single-parity "risk". > * Large storage pools can only be reasonably built by using > multiple volumes across networks and on top of those some > network/cluster file system, and it matters a bit whether > single filesystem image is essential or not. Or for that matter an application that can handle multiple storage pools, many of the software that needs really large-scale storage can itself split data store between multiple locations. That way you can have resonably small filesystems and stay sane. > * RAID5 (but not RAID6 or other mad arrangements) may be used if > almost all accesses are reads, the data carries end-to-end > checksums, and there are backups-of-record for restoring the > data quickly, and then each array is not larger than say 4+1. > In other words if RAID5 is used as a mostly RO frontend, for > example to a large slow tape archive (thanks to R. Petkus for > persuading me that there is this exception). Funny, my suggestion would definately be raid6 for anything except database(-like) load, that is anything that doesn't ends up as lots of small updates. My normal usecase is to store large files and having 60% more disks really costs alot in both purchase and power for the same usable space. Of course, we'd be more likely to go for a good hardware raid6 controller that utilises the extra parity to make a good guess on what data is wrong in the case of silent data corruption on a single disk (unlike Linux software raid). Unless, of course, you can run ZFS which has proper checksumming so you can know which (if any) data is still good. > A couple of relevant papers for inspiration on best practices by > those that have to deal with this stuff: > > https://indico.desy.de/contributionDisplay.py?contribId=26&sessionId=40&confId=257 > http://indico.fnal.gov/contributionDisplay.py?contribId=43&sessionId=30&confId=805 And this is my usecase. It might be quite different from, say, database storage or home directories. /Mattias Wadenstein ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: component growing in raid5 2008-03-25 13:38 ` Mattias Wadenstein @ 2008-03-25 20:02 ` Peter Grandi 2008-03-27 20:44 ` Mattias Wadenstein 0 siblings, 1 reply; 13+ messages in thread From: Peter Grandi @ 2008-03-25 20:02 UTC (permalink / raw) To: Linux RAID [ ... ] >> even if the ZFS people glibly say otherwise (no 'fsck' >> ever!). > The ZFS people provide an fsck, it's called "resilver", which > checks parity and checksums and update accordingly. That's why I said "glibly": they have been clever enough to call it with a different name :-). [ ... ] > I'd agree, a 12-14 disk raid6 is as high that I'd like to > go. This is mostly limited by rebuild-times though, you'd > preferably stay within a day or two of single-parity "risk". A day or two? That's quite risky. Never mind that you get awful performance for that day or two and/or a risk of data corruption. Neil Brown some weeks on this mailing list expressed a very cautionary thought: «It is really best to avoid degraded raid4/5/6 arrays when at all possible. NeilBrown» >> * Large storage pools can only be reasonably built by using >> multiple volumes across networks and on top of those some >> network/cluster file system, [ ... ] > Or for that matter an application that can handle multiple > storage pools, many of the software that needs really large-scale > storage can itself split data store between multiple locations. [ > ... ] I imagine that you are thinking here also of more "systematic" (library-based) ways of doing it, like SRM/SRB and CASTOR, dCache, XrootD, and other grid style stuff. [ ... ] > Funny, my suggestion would definately be raid6 for anything > except database(-like) load, that is anything that doesn't ends > up as lots of small updates. But RAID6 is "teh Evil"! Consider the arguments in the usual http://WWW.BAARF.com/ or just what happens when you update one block in a RAID6 stripe and the RMW and parity recalculation required. > My normal usecase is to store large files If the requirement is to store them as in a mostly-read-only cache, RAID5 is perfectaly adequate; if it is to store them as in writing them out, parity RAID is not going to be good unless they are written as a whole (full stripe writes) or write rates don't matter. > and having 60% more disks really costs alot in both purchase and > power for the same usable space. Well, that's the usual argument... A friend of mine who used to be a RAID sw developer for a medium vendor calls RAID5 "salesperson's RAID" because of this argument. But look at the alternative, say for a 12-14 disk storage array of say 750GB disks, which are currently best price/capacity (not necessarily best price/power though), to result in this comparison: One RAID10 7x(1+1): 5.25TB usable. Well, it has 40% less capacity than the others, but one gets awesome resilience (especially if one has two arrays of 7 drives and the mirror pairs are built across the two) including surviving almost all 2-drive losses and most 3-drive losses, very good read performance (up to 10-14 drives in parallel with '-p f2') and very good write performance (7 drives with '-p n2'), all exploitable in parallel, and very fast rebuild times impacting only one of the drives, so the others have less chance of failing during the rebuild. Also, there is no real requirement for the file system code to carefully split IO into aligned stripes. One RAID6 12+2: 9.00TB usable. Any 3-drive loss is catastrophic, a 2 or 1 drive loss causes a massive rebuild involving the whole array with the potential not just for terrible performance but extra stress on the other drives, and further drive loss. Write performance is going to be terrible as for every N-blocks written we have to read N+2 blocks and write N+2 blocks, especially bad news if N is small, and we can avoid reading only if N happens to be 12 and aligned, but read performance (if we don't check parity) is going to be pretty good. Rebuilding after a loss is not not just goint to be slow and risk further losses, but also carries the risk of corruption. To me these summaries mean that for rather less than double the cost in raw storage (storage is cheap, admittedly cooling/power is less cheap) one gets a much better general purpose storage pool with RAID10, and one that is almost only suited to large file read-only caching in the other case. But wait, one could claim that 12+2 is an unfairly wide RAID6, and it is a straw man. Then consider these two where the array is split into multiple RAID volumes, each containing an independent filesystem: Two RAID6 4+2: 6TB usable. This is a less crazily risky setup than 12+2, but we have lost the single volume property, and we have lost quite a bit of space, and peak read performance is not going to be awesome (4-drive wide per filesystem), but at least there is a better chance of putting together an aligned write of just 4 chunks. However if the aligned write cannot be materialized, every concurrent write to both filesystems will involve reading and then writing 2xN+4 blocks. If 3 disks fail, most of the time there is no global loss, unless all 3 are in the same half, and then only half of the files are lost, and no 2 disk failure causes data loss, only large performance loss. We save 2 drives too. Three RAID5 3+1: 6.75TB usable This brings down the RAID5 to a more reasonable narrowness (the widest I would consider in normal use). Lost single volume property, the read speed on each third is not great, but 3 reads can process in paralle, 3 disk loss brings at most a third of the store, 2 disk loss is only fatal if it happens in the same third. Any 1 drive loss causes a performance drop only in one filesystem, unaligned writes involve only one parity block, the narrow width of the stripe means RMW cycles are going to be less frequent. We save 2 drives too. Now if one wants what parity RAID is good for (mostly read-only caching of already backed up data), and one has these 3 choices: One RAID6 12+2: 14 drives, 9.00TB usable. Two RAID6 4+2: 12 drives, 6.00TB usable. Three RAID5 3+1: 12 drives, 6.75TB usable. The three RAID5 one wins for me in terms of simplicity and speed, unless the single volume is a requirement or there are really very very few writes other than bulk reloading from backup and the filesystem is very careful with write alignment etc. The single volume property only matters if one wants to build a really large (above 5TB) single physical filesystem, and that's not something very recommendable, so it does not matter a lot for me. Overall I still in most cases would prefer 14 drives in one (or more) RAID10 volumes to 12 drives as 3x(3+1), but the latter admittedly for mostly read only etc. may well make sense. > Of course, we'd be more likely to go for a good hardware raid6 > controller that utilises the extra parity to make a good guess on > what data is wrong in the case of silent data corruption on a > single disk (unlike Linux software raid). That's crazy talk. As already argued in another thread, RAID as normally understood relies totally on errors being notified. Otherwise one needs to use proper ECC codes, on reading too (and background scrubbing is not good or cheap enough) and that's a different type of design. > Unless, of course, you can run ZFS which has proper checksumming > so you can know which (if any) data is still good. ZFS indeed only does checksumming, not full ECC with rebuild. There are filesystem design that use things like Reed-Solomon codes for that. IIRC even Microsoft Research has done one for distributing data across a swarm of desktops. There is no substitute for end-to-end checksumming (and ECC if necessary) though... Most people reading list will have encountered this: http://en.Wikipedia.org/wiki/Parchive and perhaps some will have done a web search bringing up papers like these for the file system case: http://WWW.CS.UTK.edu/~plank/plank/papers/CS-96-332.pdf http://WWW.Inf.U-Szeged.HU/~bilickiv/research/LanStore/uos-final.pdf [ ... ] -- To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: component growing in raid5 2008-03-25 20:02 ` Peter Grandi @ 2008-03-27 20:44 ` Mattias Wadenstein 2008-03-27 22:09 ` Richard Scobie 0 siblings, 1 reply; 13+ messages in thread From: Mattias Wadenstein @ 2008-03-27 20:44 UTC (permalink / raw) To: Peter Grandi; +Cc: Linux RAID On Tue, 25 Mar 2008, Peter Grandi wrote: > [ ... ] > >> I'd agree, a 12-14 disk raid6 is as high that I'd like to >> go. This is mostly limited by rebuild-times though, you'd >> preferably stay within a day or two of single-parity "risk". > > A day or two? That's quite risky. Never mind that you get awful > performance for that day or two and/or a risk of data corruption. > Neil Brown some weeks on this mailing list expressed a very > cautionary thought: > > «It is really best to avoid degraded raid4/5/6 arrays when at all > possible. NeilBrown» Yes, I read that mail. I've been meaning to do some real-world testing of restarting degraded/rebuilding raid6es from various vendors, including MD, but haven't gotten around to it. Luckily computer crashes are another order of magnitude rarer than disk crashes on storage servers in our experience, so raid6 isn't a net loss even assuming worst case if you have checksummed data. Also, this might be the reason that for instance HP's raid cards require battery-backed cache for raid6, so that there won't be partially updated stripes without a "better version" in the cache. >>> * Large storage pools can only be reasonably built by using >>> multiple volumes across networks and on top of those some >>> network/cluster file system, [ ... ] > >> Or for that matter an application that can handle multiple >> storage pools, many of the software that needs really large-scale >> storage can itself split data store between multiple locations. [ >> ... ] > > I imagine that you are thinking here also of more "systematic" > (library-based) ways of doing it, like SRM/SRB and CASTOR, dCache, > XrootD, and other grid style stuff. Yes that's what I work with, and similar solutions in other industries (I know the TV folks have systems where the storage is just spread out over a bunch of servers and filesystems instead of trying to do a cluster filesystem, using a database of some kind to keep track of locations). >> Funny, my suggestion would definately be raid6 for anything >> except database(-like) load, that is anything that doesn't ends >> up as lots of small updates. > > But RAID6 is "teh Evil"! Consider the arguments in the usual > http://WWW.BAARF.com/ or just what happens when you update one > block in a RAID6 stripe and the RMW and parity recalculation > required. I've read through most of baarf.com and I have a hard time seeing it applying. An update to a block is a rare special case and can very well take some time. >> My normal usecase is to store large files > > If the requirement is to store them as in a mostly-read-only cache, > RAID5 is perfectaly adequate; if it is to store them as in writing > them out, parity RAID is not going to be good unless they are > written as a whole (full stripe writes) or write rates don't matter. Well, yes, full stripe writes is the normal case. Either a program is writing a file out or you get a file from the network. It is written to disk, and there is almost always sufficient write-back done to make it (at least) full stripes (unless you have insanely large stripe size so that a few hundred megs won't be enough for a few files). You'll get a partial update at the file start, unless the filesystem is stripe aligned, and one partial update at the end. The vast majority of stripes in between (remember, I'm talking about large files, at least a couple of hundred megs) will be full stripe writes. >> and having 60% more disks really costs alot in both purchase and >> power for the same usable space. > > Well, that's the usual argument... A friend of mine who used to be > a RAID sw developer for a medium vendor calls RAID5 "salesperson's > RAID" because of this argument. > > But look at the alternative, say for a 12-14 disk storage array > of say 750GB disks, which are currently best price/capacity (not > necessarily best price/power though), to result in this comparison: Sounds like a resonable hardware to look at for a building block. > One RAID10 7x(1+1): 5.25TB usable. > > Well, it has 40% less capacity than the others, but one gets > awesome resilience (especially if one has two arrays of 7 drives > and the mirror pairs are built across the two) including > surviving almost all 2-drive losses and most 3-drive losses, very > good read performance (up to 10-14 drives in parallel with '-p > f2') and very good write performance (7 drives with '-p n2'), all > exploitable in parallel, and very fast rebuild times impacting > only one of the drives, so the others have less chance of failing > during the rebuild. Also, there is no real requirement for the > file system code to carefully split IO into aligned stripes. > > One RAID6 12+2: 9.00TB usable. > > Any 3-drive loss is catastrophic, a 2 or 1 drive loss causes a > massive rebuild involving the whole array with the potential not > just for terrible performance but extra stress on the other > drives, and further drive loss. Write performance is going to be > terrible as for every N-blocks written we have to read N+2 blocks > and write N+2 blocks, especially bad news if N is small, and we > can avoid reading only if N happens to be 12 and aligned, but > read performance (if we don't check parity) is going to be pretty > good. Funny, when I do this, the write performance is typically 20-40% slower than the 14-disk RAID0 on the same disks. Not quite as terrible as you make it out to be. The performance during rebuilds usually depends on the priority given to the rebuild, some do it slow enough that it isn't really affected, but then it usually takes much longer. > Rebuilding after a loss is not not just goint to be slow > and risk further losses, but also carries the risk of corruption. And that's the really big issue to me. > To me these summaries mean that for rather less than double the > cost in raw storage (storage is cheap, admittedly cooling/power is > less cheap) one gets a much better general purpose storage pool > with RAID10, and one that is almost only suited to large file > read-only caching in the other case. So for a bit less than double the cost, I can get the same capability. Perhaps with a little bit of extra performance that I can't make much use of, because I only have 1-2Gbit/s network connection to the host anyway. > The single volume property only matters if one wants to build a > really large (above 5TB) single physical filesystem, and that's not > something very recommendable, so it does not matter a lot for me. I wouldn't count "above 5TB" as "really large", if you are aiming for a few PBs of aggregated storage the management overhead of sub-5TB storage pools is rather significant. If you substitute that for "over 15TB" I'd agree (today, probably not next year :) ). >> Of course, we'd be more likely to go for a good hardware raid6 >> controller that utilises the extra parity to make a good guess on >> what data is wrong in the case of silent data corruption on a >> single disk (unlike Linux software raid). > > That's crazy talk. As already argued in another thread, RAID as > normally understood relies totally on errors being notified. > > Otherwise one needs to use proper ECC codes, on reading too (and > background scrubbing is not good or cheap enough) and that's a > different type of design. Oh, but it does work on a background check, on some raid controllers. And it does identify a parity mismatch, conclude that one of the disks is [likely] wrong and then update data appropriately. Of course, being storage hardware, you'd be lucky to see any mention of this in any logs, when it should be loudly shouted that there were mismatches. But this works, in practice, today. >> Unless, of course, you can run ZFS which has proper checksumming >> so you can know which (if any) data is still good. > > ZFS indeed only does checksumming, not full ECC with rebuild. Oh, but a dual-parity raidz2 (~raid6) you have sufficient parity to rebuild the correct data assuming not more than 2 disks have failed (as in silently returning bad data). You just try to stick the data together in all the different ways until you get one with matching checksum. Then you can update parity/data as appropriate. Same goes with n-disk mirrors, you just check until you find at least one copy with a matching checksum, then update the rest to this data. /Mattias Wadenstein -- To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: component growing in raid5 2008-03-27 20:44 ` Mattias Wadenstein @ 2008-03-27 22:09 ` Richard Scobie 2008-03-28 8:07 ` Mattias Wadenstein 0 siblings, 1 reply; 13+ messages in thread From: Richard Scobie @ 2008-03-27 22:09 UTC (permalink / raw) To: Linux RAID Mailing List Mattias Wadenstein wrote: >> A day or two? That's quite risky. Never mind that you get awful >> performance for that day or two and/or a risk of data corruption. >> Neil Brown some weeks on this mailing list expressed a very >> cautionary thought: >> >> «It is really best to avoid degraded raid4/5/6 arrays when at all >> possible. NeilBrown» > > > Yes, I read that mail. I've been meaning to do some real-world testing > of restarting degraded/rebuilding raid6es from various vendors, > including MD, but haven't gotten around to it. You may be interested in these results - throughput results on an 8 SATA drive RAID6 showed average write speed went from 348MB/s to 354MB/s and read speed 349MB/s to 196MB/s, while rebuilding with 2 failed drives. This was with an Areca 1680x RAID controller. http://www.amug.org/amug-web/html/amug/reviews/articles/areca/1680x/ Regards, Richard -- To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: component growing in raid5 2008-03-27 22:09 ` Richard Scobie @ 2008-03-28 8:07 ` Mattias Wadenstein 0 siblings, 0 replies; 13+ messages in thread From: Mattias Wadenstein @ 2008-03-28 8:07 UTC (permalink / raw) To: Richard Scobie; +Cc: Linux RAID Mailing List On Fri, 28 Mar 2008, Richard Scobie wrote: > Mattias Wadenstein wrote: > >>> A day or two? That's quite risky. Never mind that you get awful >>> performance for that day or two and/or a risk of data corruption. >>> Neil Brown some weeks on this mailing list expressed a very >>> cautionary thought: >>> >>> «It is really best to avoid degraded raid4/5/6 arrays when at all >>> possible. NeilBrown» >> >> >> Yes, I read that mail. I've been meaning to do some real-world testing of >> restarting degraded/rebuilding raid6es from various vendors, including MD, >> but haven't gotten around to it. > > You may be interested in these results - throughput results on an 8 SATA > drive RAID6 showed average write speed went from 348MB/s to 354MB/s and read > speed 349MB/s to 196MB/s, while rebuilding with 2 failed drives. This was > with an Areca 1680x RAID controller. > > http://www.amug.org/amug-web/html/amug/reviews/articles/areca/1680x/ That's only performance though. I was interested in seeing if you could provoke actual data corruption by doing unkind resets while doing various things like writing, flipping bits/bytes inside files, or some other access/update pattern. That performance is good enough in the "easy" cases I'm well aware of, I've done some poking at earlier areca controllers. They had really weird performance issues in (for me) way too common corner cases though. I just got a couple of newer ones delivered though, so I'll start pushing those soon. I'm not sure I'll get the time for trying to provoke data corruption in degraded raidsets though. /Mattias Wadenstein -- To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 13+ messages in thread
end of thread, other threads:[~2008-03-28 8:07 UTC | newest] Thread overview: 13+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2008-03-23 6:59 component growing in raid5 Nagy Zoltan 2008-03-23 11:24 ` Peter Grandi 2008-03-24 7:09 ` Peter Rabbitson 2008-03-24 7:09 ` Peter Rabbitson 2008-03-24 15:17 ` Nagy Zoltan 2008-03-24 15:42 ` Peter Rabbitson 2008-03-24 16:52 ` Nagy Zoltan 2008-03-25 13:06 ` Peter Grandi 2008-03-25 13:38 ` Mattias Wadenstein 2008-03-25 20:02 ` Peter Grandi 2008-03-27 20:44 ` Mattias Wadenstein 2008-03-27 22:09 ` Richard Scobie 2008-03-28 8:07 ` Mattias Wadenstein
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).