linux-raid.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* Raid over 48 disks
@ 2007-12-18 17:29 Norman Elton
  2007-12-18 18:27 ` Justin Piszcz
                   ` (4 more replies)
  0 siblings, 5 replies; 28+ messages in thread
From: Norman Elton @ 2007-12-18 17:29 UTC (permalink / raw)
  To: linux-raid

We're investigating the possibility of running Linux (RHEL) on top of  
Sun's X4500 Thumper box:

http://www.sun.com/servers/x64/x4500/

Basically, it's a server with 48 SATA hard drives. No hardware RAID.  
It's designed for Sun's ZFS filesystem.

So... we're curious how Linux will handle such a beast. Has anyone run  
MD software RAID over so many disks? Then piled LVM/ext3 on top of  
that? Any suggestions?

Are we crazy to think this is even possible?

Thanks!

Norman Elton

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Raid over 48 disks
  2007-12-18 17:29 Raid over 48 disks Norman Elton
@ 2007-12-18 18:27 ` Justin Piszcz
  2007-12-18 19:34   ` Thiemo Nagel
  2007-12-18 18:45 ` Robin Hill
                   ` (3 subsequent siblings)
  4 siblings, 1 reply; 28+ messages in thread
From: Justin Piszcz @ 2007-12-18 18:27 UTC (permalink / raw)
  To: Norman Elton; +Cc: linux-raid



On Tue, 18 Dec 2007, Norman Elton wrote:

> We're investigating the possibility of running Linux (RHEL) on top of Sun's 
> X4500 Thumper box:
>
> http://www.sun.com/servers/x64/x4500/
>
> Basically, it's a server with 48 SATA hard drives. No hardware RAID. It's 
> designed for Sun's ZFS filesystem.
>
> So... we're curious how Linux will handle such a beast. Has anyone run MD 
> software RAID over so many disks? Then piled LVM/ext3 on top of that? Any 
> suggestions?
>
> Are we crazy to think this is even possible?
>
> Thanks!
>
> Norman Elton
> -
> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

It sounds VERY fun and exciting if you ask me!  The most disks I've used 
when testing SW RAID was 10 with various raid settings.  With that many 
drives you'd want RAID6 or RAID10 for sure incase more than one failed at 
the same time and definitely XFS/JFS/EXT4(?) as EXT3 is capped to 8TB.

I'd be curious what kind of aggregate bandwidth you can get off of it with 
that many drives.

Justin.

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Raid over 48 disks
  2007-12-18 17:29 Raid over 48 disks Norman Elton
  2007-12-18 18:27 ` Justin Piszcz
@ 2007-12-18 18:45 ` Robin Hill
  2007-12-18 20:28 ` Neil Brown
                   ` (2 subsequent siblings)
  4 siblings, 0 replies; 28+ messages in thread
From: Robin Hill @ 2007-12-18 18:45 UTC (permalink / raw)
  To: linux-raid

[-- Attachment #1: Type: text/plain, Size: 1137 bytes --]

On Tue Dec 18, 2007 at 12:29:27PM -0500, Norman Elton wrote:

> We're investigating the possibility of running Linux (RHEL) on top of Sun's 
> X4500 Thumper box:
>
> http://www.sun.com/servers/x64/x4500/
>
> Basically, it's a server with 48 SATA hard drives. No hardware RAID. It's 
> designed for Sun's ZFS filesystem.
>
> So... we're curious how Linux will handle such a beast. Has anyone run MD 
> software RAID over so many disks? Then piled LVM/ext3 on top of that? Any 
> suggestions?
>
> Are we crazy to think this is even possible?
>
The most I've done is 28 drives in RAID-10 (SCSI drives, with the array
formatted as XFS).  That keeps failing one drive, but I've not had time
to give the drive a full test yet to confirm it's a drive issue.  It's
been running quite happily (under pretty heavy database load) on 27
disks for a couple of months now though.

Cheers,
        Robin
-- 
     ___        
    ( ' }     |       Robin Hill        <robin@robinhill.me.uk> |
   / / )      | Little Jim says ....                            |
  // !!       |      "He fallen in de water !!"                 |

[-- Attachment #2: Type: application/pgp-signature, Size: 189 bytes --]

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Raid over 48 disks
  2007-12-18 18:27 ` Justin Piszcz
@ 2007-12-18 19:34   ` Thiemo Nagel
  2007-12-18 19:52     ` Norman Elton
  2007-12-18 20:25     ` Justin Piszcz
  0 siblings, 2 replies; 28+ messages in thread
From: Thiemo Nagel @ 2007-12-18 19:34 UTC (permalink / raw)
  To: Norman Elton; +Cc: linux-raid

Dear Norman,

>> So... we're curious how Linux will handle such a beast. Has anyone run 
>> MD software RAID over so many disks? Then piled LVM/ext3 on top of 
>> that? Any suggestions?
>>
>> Are we crazy to think this is even possible?

I'm running 22x 500GB disks attached to RocketRaid2340 and NFORCE-MCP55
onboard controllers on an Athlon DC 5000+ with 1GB RAM:

9746150400 blocks super 1.2 level 6, 256k chunk, algorithm 2 [22/22]

Performance of the raw device is fair:
# dd if=/dev/md2 of=/dev/zero bs=128k count=64k
65536+0 records in
65536+0 records out
8589934592 bytes (8.6 GB) copied, 15.6071 seconds, 550 MB/s

Somewhat less through ext3 (created with -E stride=64):
# dd if=largetestfile of=/dev/zero bs=128k count=64k
65536+0 records in
65536+0 records out
8589934592 bytes (8.6 GB) copied, 26.4103 seconds, 325 MB/s

There were no problems up to now.  (mkfs.ext3 wants -F to create a 
filesystem larger than 8TB.  The hard maximum is 16TB, so you will need 
to create partitions, if your drives are larger than 350GB...)

Kind regards,

Thiemo Nagel



^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Raid over 48 disks
  2007-12-18 19:34   ` Thiemo Nagel
@ 2007-12-18 19:52     ` Norman Elton
  2007-12-18 20:19       ` Thiemo Nagel
  2007-12-18 20:25     ` Justin Piszcz
  1 sibling, 1 reply; 28+ messages in thread
From: Norman Elton @ 2007-12-18 19:52 UTC (permalink / raw)
  To: thiemo.nagel; +Cc: linux-raid

Thiemo --

I'm not familiar with RocketRaid. Is it handling the RAID for you, or  
are you using MD?

Thanks, all, for your feedback! I'm still surprised nobody has tried  
this on one of these Sun boxes yet. I've signed up for some demo  
hardware. I'll post what I find.

Norman


On Dec 18, 2007, at 2:34 PM, Thiemo Nagel wrote:

> Dear Norman,
>
>>> So... we're curious how Linux will handle such a beast. Has anyone  
>>> run MD software RAID over so many disks? Then piled LVM/ext3 on  
>>> top of that? Any suggestions?
>>>
>>> Are we crazy to think this is even possible?
>
> I'm running 22x 500GB disks attached to RocketRaid2340 and NFORCE- 
> MCP55
> onboard controllers on an Athlon DC 5000+ with 1GB RAM:
>
> 9746150400 blocks super 1.2 level 6, 256k chunk, algorithm 2 [22/22]
>
> Performance of the raw device is fair:
> # dd if=/dev/md2 of=/dev/zero bs=128k count=64k
> 65536+0 records in
> 65536+0 records out
> 8589934592 bytes (8.6 GB) copied, 15.6071 seconds, 550 MB/s
>
> Somewhat less through ext3 (created with -E stride=64):
> # dd if=largetestfile of=/dev/zero bs=128k count=64k
> 65536+0 records in
> 65536+0 records out
> 8589934592 bytes (8.6 GB) copied, 26.4103 seconds, 325 MB/s
>
> There were no problems up to now.  (mkfs.ext3 wants -F to create a  
> filesystem larger than 8TB.  The hard maximum is 16TB, so you will  
> need to create partitions, if your drives are larger than 350GB...)
>
> Kind regards,
>
> Thiemo Nagel
>
>


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Raid over 48 disks
  2007-12-18 19:52     ` Norman Elton
@ 2007-12-18 20:19       ` Thiemo Nagel
  0 siblings, 0 replies; 28+ messages in thread
From: Thiemo Nagel @ 2007-12-18 20:19 UTC (permalink / raw)
  To: Norman Elton; +Cc: linux-raid

Dear Norman,

> I'm not familiar with RocketRaid. Is it handling the RAID for you, or 
> are you using MD?

I'm using md.  The controller is in a mode that exports all drives 
individually.

Kind regards,

Thiemo

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Raid over 48 disks
  2007-12-18 19:34   ` Thiemo Nagel
  2007-12-18 19:52     ` Norman Elton
@ 2007-12-18 20:25     ` Justin Piszcz
  2007-12-18 21:13       ` Thiemo Nagel
  1 sibling, 1 reply; 28+ messages in thread
From: Justin Piszcz @ 2007-12-18 20:25 UTC (permalink / raw)
  To: Thiemo Nagel; +Cc: Norman Elton, linux-raid



On Tue, 18 Dec 2007, Thiemo Nagel wrote:

> Dear Norman,
>
>>> So... we're curious how Linux will handle such a beast. Has anyone run MD 
>>> software RAID over so many disks? Then piled LVM/ext3 on top of that? Any 
>>> suggestions?
>>> 
>>> Are we crazy to think this is even possible?
>
> I'm running 22x 500GB disks attached to RocketRaid2340 and NFORCE-MCP55
> onboard controllers on an Athlon DC 5000+ with 1GB RAM:
>
> 9746150400 blocks super 1.2 level 6, 256k chunk, algorithm 2 [22/22]
>
> Performance of the raw device is fair:
> # dd if=/dev/md2 of=/dev/zero bs=128k count=64k
> 65536+0 records in
> 65536+0 records out
> 8589934592 bytes (8.6 GB) copied, 15.6071 seconds, 550 MB/s
>
> Somewhat less through ext3 (created with -E stride=64):
> # dd if=largetestfile of=/dev/zero bs=128k count=64k
> 65536+0 records in
> 65536+0 records out
> 8589934592 bytes (8.6 GB) copied, 26.4103 seconds, 325 MB/s
>
> There were no problems up to now.  (mkfs.ext3 wants -F to create a filesystem 
> larger than 8TB.  The hard maximum is 16TB, so you will need to create 
> partitions, if your drives are larger than 350GB...)
>
> Kind regards,
>
> Thiemo Nagel
>
>
> -
> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>

Quite slow?

10 disks (raptors) raid 5 on regular sata controllers:

# dd if=/dev/md3 of=/dev/zero bs=128k count=64k
65536+0 records in
65536+0 records out
8589934592 bytes (8.6 GB) copied, 10.718 seconds, 801 MB/s

# dd if=bigfile of=/dev/zero bs=128k count=64k
27773+1 records in
27773+1 records out
3640379392 bytes (3.6 GB) copied, 6.58454 seconds, 553 MB/s



^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Raid over 48 disks
  2007-12-18 17:29 Raid over 48 disks Norman Elton
  2007-12-18 18:27 ` Justin Piszcz
  2007-12-18 18:45 ` Robin Hill
@ 2007-12-18 20:28 ` Neil Brown
  2007-12-19  8:27   ` Mattias Wadenstein
  2007-12-25 17:31   ` pg_mh, Peter Grandi
  2007-12-18 20:36 ` Brendan Conoboy
  2007-12-21 10:57 ` Leif Nixon
  4 siblings, 2 replies; 28+ messages in thread
From: Neil Brown @ 2007-12-18 20:28 UTC (permalink / raw)
  To: Norman Elton; +Cc: linux-raid

On Tuesday December 18, normelton@gmail.com wrote:
> We're investigating the possibility of running Linux (RHEL) on top of  
> Sun's X4500 Thumper box:
> 
> http://www.sun.com/servers/x64/x4500/
> 
> Basically, it's a server with 48 SATA hard drives. No hardware RAID.  
> It's designed for Sun's ZFS filesystem.
> 
> So... we're curious how Linux will handle such a beast. Has anyone run  
> MD software RAID over so many disks? Then piled LVM/ext3 on top of  
> that? Any suggestions?
> 
> Are we crazy to think this is even possible?

Certainly possible.
The default metadata is limited to 28 devices, but with
    --metadata=1

you can easily use all 48 drives or more in the one array.  I'm not
sure if you would want to though.

If you just wanted an enormous scratch space and were happy to lose
all your data on a drive failure, then you could make a raid0 across
all the drives which should work perfectly and give you lots of
space.  But that probably isn't what you want.

I wouldn't create a raid5 or raid6 on all 48 devices.
RAID5 only survives a single device failure and with that many
devices, the chance of a second failure before you recover becomes
appreciable.

RAID6 would be much more reliable, but probably much slower.  RAID6
always needs to read or write every block in a stripe (i.e. it always
uses reconstruct-write to generate the P and Q blocks,  It never does
a read-modify-write like raid5 does).  This means that every write
touches every device so you have less possibility for parallelism
among your many drives.
It might be instructive to try it out though.

RAID10 would be a good option if you are happy wit 24 drives worth of
space.  I would probably choose a largish chunk size (256K) and use
the 'offset' layout.

Alternately, 8 6drive RAID5s or 6 8raid RAID6s, and use RAID0 to
combine them together.  This would give you adequate reliability and
performance and still a large amount of storage space.

Have fun!!!

NeilBrown


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Raid over 48 disks
  2007-12-18 17:29 Raid over 48 disks Norman Elton
                   ` (2 preceding siblings ...)
  2007-12-18 20:28 ` Neil Brown
@ 2007-12-18 20:36 ` Brendan Conoboy
  2007-12-18 23:50   ` Guy Watkins
  2007-12-21 10:57 ` Leif Nixon
  4 siblings, 1 reply; 28+ messages in thread
From: Brendan Conoboy @ 2007-12-18 20:36 UTC (permalink / raw)
  To: Norman Elton; +Cc: linux-raid

Norman Elton wrote:
> We're investigating the possibility of running Linux (RHEL) on top of 
> Sun's X4500 Thumper box:
> 
> http://www.sun.com/servers/x64/x4500/

Neat- 6 8 port SATA controllers!  It'll be worth checking to be sure 
each controller has equal bandwidth.  If some controllers are on slower 
buses than others you may want to consider that and balance the md 
device layout.

> So... we're curious how Linux will handle such a beast. Has anyone run 
> MD software RAID over so many disks? Then piled LVM/ext3 on top of that? 
> Any suggestions?

There used to be a maximum number of devices allowed in a single md 
device.  Not sure if that is still the case.

With this many drives you would be well advised to make smaller raid 
devices then combine them into a larger md device (or via lvm, etc). 
Consider a write with a 48 device raid5- the system may need to read 
blocks from all those drives before a single write!

If it were my system, all ports were equally well connected, I'd create 
3 16 drive RAID5's with 1 hot spare, then combine them via raid 0 or 
lvm.  That's just my usage scenario, though (modest reliability, 
excellent read speed, modest write speed).

If you put ext3 on time, remember to use the stride option when making 
the filesystem.

> Are we crazy to think this is even possible?

Crazy, possible, and fun!

-- 
Brendan Conoboy / Red Hat, Inc. / blc@redhat.com

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Raid over 48 disks
  2007-12-18 20:25     ` Justin Piszcz
@ 2007-12-18 21:13       ` Thiemo Nagel
  2007-12-18 21:20         ` Jon Nelson
                           ` (2 more replies)
  0 siblings, 3 replies; 28+ messages in thread
From: Thiemo Nagel @ 2007-12-18 21:13 UTC (permalink / raw)
  To: Justin Piszcz; +Cc: Norman Elton, linux-raid

>> Performance of the raw device is fair:
>> # dd if=/dev/md2 of=/dev/zero bs=128k count=64k
>> 8589934592 bytes (8.6 GB) copied, 15.6071 seconds, 550 MB/s
>>
>> Somewhat less through ext3 (created with -E stride=64):
>> # dd if=largetestfile of=/dev/zero bs=128k count=64k
>> 8589934592 bytes (8.6 GB) copied, 26.4103 seconds, 325 MB/s
> 
> Quite slow?
> 
> 10 disks (raptors) raid 5 on regular sata controllers:
> 
> # dd if=/dev/md3 of=/dev/zero bs=128k count=64k
> 8589934592 bytes (8.6 GB) copied, 10.718 seconds, 801 MB/s
> 
> # dd if=bigfile of=/dev/zero bs=128k count=64k
> 3640379392 bytes (3.6 GB) copied, 6.58454 seconds, 553 MB/s

Interesting.  Any ideas what could be the reason?  How much do you get 
from a single drive?  -- The Samsung HD501LJ that I'm using gives 
~84MB/s when reading from the beginning of the disk.

With RAID 5 I'm getting slightly better results (though I really wonder 
why, since naively I would expect identical read performance) but that 
does only account for a small part of the difference:

	16k read	64k write
chunk
size	RAID 5	RAID 6	RAID 5	RAID 6
128k	492	497	268	270
256k	615	530	288	270
512k	625	607	230	174
1024k	650	620	170	75

Kind regards,

Thiemo

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Raid over 48 disks
  2007-12-18 21:13       ` Thiemo Nagel
@ 2007-12-18 21:20         ` Jon Nelson
  2007-12-18 21:40           ` Thiemo Nagel
  2007-12-18 21:43           ` Justin Piszcz
  2007-12-18 21:21         ` Justin Piszcz
  2007-12-19 15:21         ` Bill Davidsen
  2 siblings, 2 replies; 28+ messages in thread
From: Jon Nelson @ 2007-12-18 21:20 UTC (permalink / raw)
  Cc: linux-raid

On 12/18/07, Thiemo Nagel <thiemo.nagel@ph.tum.de> wrote:
> >> Performance of the raw device is fair:
> >> # dd if=/dev/md2 of=/dev/zero bs=128k count=64k
> >> 8589934592 bytes (8.6 GB) copied, 15.6071 seconds, 550 MB/s
> >>
> >> Somewhat less through ext3 (created with -E stride=64):
> >> # dd if=largetestfile of=/dev/zero bs=128k count=64k
> >> 8589934592 bytes (8.6 GB) copied, 26.4103 seconds, 325 MB/s
> >
> > Quite slow?
> >
> > 10 disks (raptors) raid 5 on regular sata controllers:
> >
> > # dd if=/dev/md3 of=/dev/zero bs=128k count=64k
> > 8589934592 bytes (8.6 GB) copied, 10.718 seconds, 801 MB/s
> >
> > # dd if=bigfile of=/dev/zero bs=128k count=64k
> > 3640379392 bytes (3.6 GB) copied, 6.58454 seconds, 553 MB/s
>
> Interesting.  Any ideas what could be the reason?  How much do you get
> from a single drive?  -- The Samsung HD501LJ that I'm using gives
> ~84MB/s when reading from the beginning of the disk.
>
> With RAID 5 I'm getting slightly better results (though I really wonder
> why, since naively I would expect identical read performance) but that
> does only account for a small part of the difference:
>
>         16k read        64k write
> chunk
> size    RAID 5  RAID 6  RAID 5  RAID 6
> 128k    492     497     268     270
> 256k    615     530     288     270
> 512k    625     607     230     174
> 1024k   650     620     170     75

It strikes me that these numbers are meaningless without knowing if
that is actual data-to-disk or data-to-memcache-and-some-to-disk-too.
Later versions of 'dd' offer 'conv=fdatasync' which is really handy
(call fdatasync on the output file, syncing JUST the one file, right
before close). Otherwise, oflags=direct will (try) to bypass the
page/block cache.

I can get really impressive numbers, too (over 200MB/s on a single
disk capable of 70MB/s) when I (mis)use dd without fdatasync, et al.

The variation in reported performance can be really huge without
understanding that you aren't actually testing the DISK I/O but *some*
disk I/O and *some* memory caching.




-- 
Jon

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Raid over 48 disks
  2007-12-18 21:13       ` Thiemo Nagel
  2007-12-18 21:20         ` Jon Nelson
@ 2007-12-18 21:21         ` Justin Piszcz
  2007-12-19 15:21         ` Bill Davidsen
  2 siblings, 0 replies; 28+ messages in thread
From: Justin Piszcz @ 2007-12-18 21:21 UTC (permalink / raw)
  To: Thiemo Nagel; +Cc: Norman Elton, linux-raid



On Tue, 18 Dec 2007, Thiemo Nagel wrote:

>>> Performance of the raw device is fair:
>>> # dd if=/dev/md2 of=/dev/zero bs=128k count=64k
>>> 8589934592 bytes (8.6 GB) copied, 15.6071 seconds, 550 MB/s
>>> 
>>> Somewhat less through ext3 (created with -E stride=64):
>>> # dd if=largetestfile of=/dev/zero bs=128k count=64k
>>> 8589934592 bytes (8.6 GB) copied, 26.4103 seconds, 325 MB/s
>> 
>> Quite slow?
>> 
>> 10 disks (raptors) raid 5 on regular sata controllers:
>> 
>> # dd if=/dev/md3 of=/dev/zero bs=128k count=64k
>> 8589934592 bytes (8.6 GB) copied, 10.718 seconds, 801 MB/s
>> 
>> # dd if=bigfile of=/dev/zero bs=128k count=64k
>> 3640379392 bytes (3.6 GB) copied, 6.58454 seconds, 553 MB/s
>
> Interesting.  Any ideas what could be the reason?  How much do you get from a 
> single drive?  -- The Samsung HD501LJ that I'm using gives ~84MB/s when 
> reading from the beginning of the disk.
>
> With RAID 5 I'm getting slightly better results (though I really wonder why, 
> since naively I would expect identical read performance) but that does only 
> account for a small part of the difference:
>
> 	16k read	64k write
> chunk
> size	RAID 5	RAID 6	RAID 5	RAID 6
> 128k	492	497	268	270
> 256k	615	530	288	270
> 512k	625	607	230	174
> 1024k	650	620	170	75
>
> Kind regards,
>
> Thiemo
>

# dd if=/dev/sdc of=/dev/null bs=1M count=1024
1024+0 records in
1024+0 records out
1073741824 bytes (1.1 GB) copied, 13.8108 seconds, 77.7 MB/s

With more than 2x the drives I'd think you'd have faster speed, perhaps 
the contoller is the problem?

I am using ICH8R (but the raid within linux) and 2 port SATA cards, each 
has their own dedicated bandwidth via PCI-e bus.

I have also tried (on 3ware controllers exporting as JBOD etc, sw RAID5) 
with 10 disks, I saw similar performance with read but not write.

Justin.

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Raid over 48 disks
  2007-12-18 21:20         ` Jon Nelson
@ 2007-12-18 21:40           ` Thiemo Nagel
  2007-12-18 21:43           ` Justin Piszcz
  1 sibling, 0 replies; 28+ messages in thread
From: Thiemo Nagel @ 2007-12-18 21:40 UTC (permalink / raw)
  To: Jon Nelson; +Cc: linux-raid

>>         16k read        64k write
>> chunk
>> size    RAID 5  RAID 6  RAID 5  RAID 6
>> 128k    492     497     268     270
>> 256k    615     530     288     270
>> 512k    625     607     230     174
>> 1024k   650     620     170     75
> 
> It strikes me that these numbers are meaningless without knowing if
> that is actual data-to-disk or data-to-memcache-and-some-to-disk-too.
> Later versions of 'dd' offer 'conv=fdatasync' which is really handy
> (call fdatasync on the output file, syncing JUST the one file, right
> before close). Otherwise, oflags=direct will (try) to bypass the
> page/block cache.
> 
> I can get really impressive numbers, too (over 200MB/s on a single
> disk capable of 70MB/s) when I (mis)use dd without fdatasync, et al.
> 
> The variation in reported performance can be really huge without
> understanding that you aren't actually testing the DISK I/O but *some*
> disk I/O and *some* memory caching.

I did these benchmarks with 32GB of data on a machine with 1GB of RAM, 
therefore the memory cache contribution should be small.

Kind regards,

Thiemo

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Raid over 48 disks
  2007-12-18 21:20         ` Jon Nelson
  2007-12-18 21:40           ` Thiemo Nagel
@ 2007-12-18 21:43           ` Justin Piszcz
  1 sibling, 0 replies; 28+ messages in thread
From: Justin Piszcz @ 2007-12-18 21:43 UTC (permalink / raw)
  To: Jon Nelson; +Cc: linux-raid



On Tue, 18 Dec 2007, Jon Nelson wrote:

> On 12/18/07, Thiemo Nagel <thiemo.nagel@ph.tum.de> wrote:
>>>> Performance of the raw device is fair:
>>>> # dd if=/dev/md2 of=/dev/zero bs=128k count=64k
>>>> 8589934592 bytes (8.6 GB) copied, 15.6071 seconds, 550 MB/s
>>>>
>>>> Somewhat less through ext3 (created with -E stride=64):
>>>> # dd if=largetestfile of=/dev/zero bs=128k count=64k
>>>> 8589934592 bytes (8.6 GB) copied, 26.4103 seconds, 325 MB/s
>>>
>>> Quite slow?
>>>
>>> 10 disks (raptors) raid 5 on regular sata controllers:
>>>
>>> # dd if=/dev/md3 of=/dev/zero bs=128k count=64k
>>> 8589934592 bytes (8.6 GB) copied, 10.718 seconds, 801 MB/s
>>>
>>> # dd if=bigfile of=/dev/zero bs=128k count=64k
>>> 3640379392 bytes (3.6 GB) copied, 6.58454 seconds, 553 MB/s
>>
>> Interesting.  Any ideas what could be the reason?  How much do you get
>> from a single drive?  -- The Samsung HD501LJ that I'm using gives
>> ~84MB/s when reading from the beginning of the disk.
>>
>> With RAID 5 I'm getting slightly better results (though I really wonder
>> why, since naively I would expect identical read performance) but that
>> does only account for a small part of the difference:
>>
>>         16k read        64k write
>> chunk
>> size    RAID 5  RAID 6  RAID 5  RAID 6
>> 128k    492     497     268     270
>> 256k    615     530     288     270
>> 512k    625     607     230     174
>> 1024k   650     620     170     75
>
> It strikes me that these numbers are meaningless without knowing if
> that is actual data-to-disk or data-to-memcache-and-some-to-disk-too.
> Later versions of 'dd' offer 'conv=fdatasync' which is really handy
> (call fdatasync on the output file, syncing JUST the one file, right
> before close). Otherwise, oflags=direct will (try) to bypass the
> page/block cache.
>
> I can get really impressive numbers, too (over 200MB/s on a single
> disk capable of 70MB/s) when I (mis)use dd without fdatasync, et al.
>
> The variation in reported performance can be really huge without
> understanding that you aren't actually testing the DISK I/O but *some*
> disk I/O and *some* memory caching.

Ok-- How's this for caching, a DD over the entire RAID device:

$ /usr/bin/time dd if=/dev/zero of=file bs=1M
dd: writing `file': No space left on device
1070704+0 records in
1070703+0 records out
1122713473024 bytes (1.1 TB) copied, 2565.89 seconds, 438 MB/s


^ permalink raw reply	[flat|nested] 28+ messages in thread

* RE: Raid over 48 disks
  2007-12-18 20:36 ` Brendan Conoboy
@ 2007-12-18 23:50   ` Guy Watkins
  2007-12-18 23:58     ` Justin Piszcz
  2007-12-19 12:08     ` Russell Smith
  0 siblings, 2 replies; 28+ messages in thread
From: Guy Watkins @ 2007-12-18 23:50 UTC (permalink / raw)
  To: 'Brendan Conoboy', 'Norman Elton'; +Cc: linux-raid

} -----Original Message-----
} From: linux-raid-owner@vger.kernel.org [mailto:linux-raid-
} owner@vger.kernel.org] On Behalf Of Brendan Conoboy
} Sent: Tuesday, December 18, 2007 3:36 PM
} To: Norman Elton
} Cc: linux-raid@vger.kernel.org
} Subject: Re: Raid over 48 disks
} 
} Norman Elton wrote:
} > We're investigating the possibility of running Linux (RHEL) on top of
} > Sun's X4500 Thumper box:
} >
} > http://www.sun.com/servers/x64/x4500/
} 
} Neat- 6 8 port SATA controllers!  It'll be worth checking to be sure
} each controller has equal bandwidth.  If some controllers are on slower
} buses than others you may want to consider that and balance the md
} device layout.

Assuming the 6 controllers are equal, I would make 3 16 disk RAID6 arrays
using 2 disks from each controller.  That way any 1 controller can fail and
your system will still be running.  6 disks will be used for redundancy.

Or 6 8 disk RAID6 arrays using 1 disk from each controller).  That way any 2
controllers can fail and your system will still be running.  12 disks will
be used for redundancy.  Might be too excessive!

Combine them into a RAID0 array.

Guy


^ permalink raw reply	[flat|nested] 28+ messages in thread

* RE: Raid over 48 disks
  2007-12-18 23:50   ` Guy Watkins
@ 2007-12-18 23:58     ` Justin Piszcz
  2007-12-18 23:59       ` Justin Piszcz
  2007-12-19 12:08     ` Russell Smith
  1 sibling, 1 reply; 28+ messages in thread
From: Justin Piszcz @ 2007-12-18 23:58 UTC (permalink / raw)
  To: Guy Watkins; +Cc: 'Brendan Conoboy', 'Norman Elton', linux-raid



On Tue, 18 Dec 2007, Guy Watkins wrote:

> } -----Original Message-----
> } From: linux-raid-owner@vger.kernel.org [mailto:linux-raid-
> } owner@vger.kernel.org] On Behalf Of Brendan Conoboy
> } Sent: Tuesday, December 18, 2007 3:36 PM
> } To: Norman Elton
> } Cc: linux-raid@vger.kernel.org
> } Subject: Re: Raid over 48 disks
> }
> } Norman Elton wrote:
> } > We're investigating the possibility of running Linux (RHEL) on top of
> } > Sun's X4500 Thumper box:
> } >
> } > http://www.sun.com/servers/x64/x4500/
> }
> } Neat- 6 8 port SATA controllers!  It'll be worth checking to be sure
> } each controller has equal bandwidth.  If some controllers are on slower
> } buses than others you may want to consider that and balance the md
> } device layout.
>
> Assuming the 6 controllers are equal, I would make 3 16 disk RAID6 arrays
> using 2 disks from each controller.  That way any 1 controller can fail and
> your system will still be running.  6 disks will be used for redundancy.
>
> Or 6 8 disk RAID6 arrays using 1 disk from each controller).  That way any 2
> controllers can fail and your system will still be running.  12 disks will
> be used for redundancy.  Might be too excessive!
>
> Combine them into a RAID0 array.
>
> Guy
>
> -
> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>

I'd be curious what the maximum aggregate bandwidth would be with RAID 0 
of 48 disks on that controller..

^ permalink raw reply	[flat|nested] 28+ messages in thread

* RE: Raid over 48 disks
  2007-12-18 23:58     ` Justin Piszcz
@ 2007-12-18 23:59       ` Justin Piszcz
  0 siblings, 0 replies; 28+ messages in thread
From: Justin Piszcz @ 2007-12-18 23:59 UTC (permalink / raw)
  To: Guy Watkins; +Cc: 'Brendan Conoboy', 'Norman Elton', linux-raid



On Tue, 18 Dec 2007, Justin Piszcz wrote:

>
>
> On Tue, 18 Dec 2007, Guy Watkins wrote:
>
>> } -----Original Message-----
>> } From: linux-raid-owner@vger.kernel.org [mailto:linux-raid-
>> } owner@vger.kernel.org] On Behalf Of Brendan Conoboy
>> } Sent: Tuesday, December 18, 2007 3:36 PM
>> } To: Norman Elton
>> } Cc: linux-raid@vger.kernel.org
>> } Subject: Re: Raid over 48 disks
>> }
>> } Norman Elton wrote:
>> } > We're investigating the possibility of running Linux (RHEL) on top of
>> } > Sun's X4500 Thumper box:
>> } >
>> } > http://www.sun.com/servers/x64/x4500/
>> }
>> } Neat- 6 8 port SATA controllers!  It'll be worth checking to be sure
>> } each controller has equal bandwidth.  If some controllers are on slower
>> } buses than others you may want to consider that and balance the md
>> } device layout.
>> 
>> Assuming the 6 controllers are equal, I would make 3 16 disk RAID6 arrays
>> using 2 disks from each controller.  That way any 1 controller can fail and
>> your system will still be running.  6 disks will be used for redundancy.
>> 
>> Or 6 8 disk RAID6 arrays using 1 disk from each controller).  That way any 
>> 2
>> controllers can fail and your system will still be running.  12 disks will
>> be used for redundancy.  Might be too excessive!
>> 
>> Combine them into a RAID0 array.
>> 
>> Guy
>> 
>> -
>> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>> 
>
> I'd be curious what the maximum aggregate bandwidth would be with RAID 0 of 
> 48 disks on that controller..
>

A RAID 0 over all of the controllers rather, if possible..



^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Raid over 48 disks
  2007-12-18 20:28 ` Neil Brown
@ 2007-12-19  8:27   ` Mattias Wadenstein
  2007-12-19 15:26     ` Bill Davidsen
  2007-12-21 11:03     ` Leif Nixon
  2007-12-25 17:31   ` pg_mh, Peter Grandi
  1 sibling, 2 replies; 28+ messages in thread
From: Mattias Wadenstein @ 2007-12-19  8:27 UTC (permalink / raw)
  To: Neil Brown; +Cc: Norman Elton, linux-raid

On Wed, 19 Dec 2007, Neil Brown wrote:

> On Tuesday December 18, normelton@gmail.com wrote:
>> We're investigating the possibility of running Linux (RHEL) on top of
>> Sun's X4500 Thumper box:
>>
>> http://www.sun.com/servers/x64/x4500/
>>
>> Basically, it's a server with 48 SATA hard drives. No hardware RAID.
>> It's designed for Sun's ZFS filesystem.
>>
>> So... we're curious how Linux will handle such a beast. Has anyone run
>> MD software RAID over so many disks? Then piled LVM/ext3 on top of
>> that? Any suggestions?

There are those that have run Linux MD RAID on thumpers before. I vaguely 
recall some driver issues (unrelated to MD) that made it less suitable 
than solaris, but that might be fixed in recent kernels.

> Alternately, 8 6drive RAID5s or 6 8raid RAID6s, and use RAID0 to
> combine them together.  This would give you adequate reliability and
> performance and still a large amount of storage space.

My personal suggestion would be 5 9-disk raid6s, one raid1 root mirror and 
one hot spare. Then raid0, lvm, or separate filesystem on those 5 raidsets 
for data, depending on your needs.

You get almost as much data space as with the 6 8-disk raid6s, and have a 
separate pair of disks for all the small updates (logging, metadata, etc), 
so this makes alot of sense if most of the data is bulk file access.

/Mattias Wadenstein

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Raid over 48 disks
  2007-12-18 23:50   ` Guy Watkins
  2007-12-18 23:58     ` Justin Piszcz
@ 2007-12-19 12:08     ` Russell Smith
  1 sibling, 0 replies; 28+ messages in thread
From: Russell Smith @ 2007-12-19 12:08 UTC (permalink / raw)
  To: Guy Watkins; +Cc: 'Brendan Conoboy', 'Norman Elton', linux-raid

Guy Watkins wrote:
> } -----Original Message-----
> } From: linux-raid-owner@vger.kernel.org [mailto:linux-raid-
> } owner@vger.kernel.org] On Behalf Of Brendan Conoboy
> } Sent: Tuesday, December 18, 2007 3:36 PM
> } To: Norman Elton
> } Cc: linux-raid@vger.kernel.org
> } Subject: Re: Raid over 48 disks
> } 
> } Norman Elton wrote:
> } > We're investigating the possibility of running Linux (RHEL) on top of
> } > Sun's X4500 Thumper box:
> } >
> } > http://www.sun.com/servers/x64/x4500/
> } 
> } Neat- 6 8 port SATA controllers!  It'll be worth checking to be sure
> } each controller has equal bandwidth.  If some controllers are on slower
> } buses than others you may want to consider that and balance the md
> } device layout.
>
> Assuming the 6 controllers are equal, I would make 3 16 disk RAID6 arrays
> using 2 disks from each controller.  That way any 1 controller can fail and
> your system will still be running.  6 disks will be used for redundancy.
>
> Or 6 8 disk RAID6 arrays using 1 disk from each controller).  That way any 2
> controllers can fail and your system will still be running.  12 disks will
> be used for redundancy.  Might be too excessive!
>
> Combine them into a RAID0 array.
>
> Guy
Sounds interesting!

Just out of interest, whats stopping you from using Solaris?

Though, I'm curious how md will compare to ZFS performance wise. There 
is some interesting configuration info / advice for Solaris here: 
http://www.solarisinternals.com/wiki/index.php/ZFS_Configuration_Guide 
esp for the X4500.


Russell

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Raid over 48 disks
  2007-12-19 15:21         ` Bill Davidsen
@ 2007-12-19 15:02           ` Justin Piszcz
  2007-12-20 16:48           ` Thiemo Nagel
  1 sibling, 0 replies; 28+ messages in thread
From: Justin Piszcz @ 2007-12-19 15:02 UTC (permalink / raw)
  To: Bill Davidsen; +Cc: Thiemo Nagel, Norman Elton, linux-raid



On Wed, 19 Dec 2007, Bill Davidsen wrote:

> Thiemo Nagel wrote:
>>>> Performance of the raw device is fair:
>>>> # dd if=/dev/md2 of=/dev/zero bs=128k count=64k
>>>> 8589934592 bytes (8.6 GB) copied, 15.6071 seconds, 550 MB/s
>>>> 
>>>> Somewhat less through ext3 (created with -E stride=64):
>>>> # dd if=largetestfile of=/dev/zero bs=128k count=64k
>>>> 8589934592 bytes (8.6 GB) copied, 26.4103 seconds, 325 MB/s
>>> 
>>> Quite slow?
>>> 
>>> 10 disks (raptors) raid 5 on regular sata controllers:
>>> 
>>> # dd if=/dev/md3 of=/dev/zero bs=128k count=64k
>>> 8589934592 bytes (8.6 GB) copied, 10.718 seconds, 801 MB/s
>>> 
>>> # dd if=bigfile of=/dev/zero bs=128k count=64k
>>> 3640379392 bytes (3.6 GB) copied, 6.58454 seconds, 553 MB/s
>> 
>> Interesting.  Any ideas what could be the reason?  How much do you get from 
>> a single drive?  -- The Samsung HD501LJ that I'm using gives ~84MB/s when 
>> reading from the beginning of the disk.
>> 
>> With RAID 5 I'm getting slightly better results (though I really wonder 
>> why, since naively I would expect identical read performance) but that does 
>> only account for a small part of the difference:
>>
>>     16k read    64k write
>>   chunk
>>   size    RAID 5    RAID 6    RAID 5    RAID 6
>>   128k    492    497    268    270
>>   256k    615    530    288    270
>>   512k    625    607    230    174
>>   1024k   650    620    170    75
>> 
>
> What is your stripe cache size?

# Set stripe-cache_size for RAID5.
echo "Setting stripe_cache_size to 16 MiB for /dev/md3"
echo 16384 > /sys/block/md3/md/stripe_cache_size

Justin.

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Raid over 48 disks
  2007-12-18 21:13       ` Thiemo Nagel
  2007-12-18 21:20         ` Jon Nelson
  2007-12-18 21:21         ` Justin Piszcz
@ 2007-12-19 15:21         ` Bill Davidsen
  2007-12-19 15:02           ` Justin Piszcz
  2007-12-20 16:48           ` Thiemo Nagel
  2 siblings, 2 replies; 28+ messages in thread
From: Bill Davidsen @ 2007-12-19 15:21 UTC (permalink / raw)
  To: Thiemo Nagel; +Cc: Justin Piszcz, Norman Elton, linux-raid

Thiemo Nagel wrote:
>>> Performance of the raw device is fair:
>>> # dd if=/dev/md2 of=/dev/zero bs=128k count=64k
>>> 8589934592 bytes (8.6 GB) copied, 15.6071 seconds, 550 MB/s
>>>
>>> Somewhat less through ext3 (created with -E stride=64):
>>> # dd if=largetestfile of=/dev/zero bs=128k count=64k
>>> 8589934592 bytes (8.6 GB) copied, 26.4103 seconds, 325 MB/s
>>
>> Quite slow?
>>
>> 10 disks (raptors) raid 5 on regular sata controllers:
>>
>> # dd if=/dev/md3 of=/dev/zero bs=128k count=64k
>> 8589934592 bytes (8.6 GB) copied, 10.718 seconds, 801 MB/s
>>
>> # dd if=bigfile of=/dev/zero bs=128k count=64k
>> 3640379392 bytes (3.6 GB) copied, 6.58454 seconds, 553 MB/s
>
> Interesting.  Any ideas what could be the reason?  How much do you get 
> from a single drive?  -- The Samsung HD501LJ that I'm using gives 
> ~84MB/s when reading from the beginning of the disk.
>
> With RAID 5 I'm getting slightly better results (though I really 
> wonder why, since naively I would expect identical read performance) 
> but that does only account for a small part of the difference:
>
>     16k read    64k write
>   
> chunk
>   
> size    RAID 5    RAID 6    RAID 5    RAID 6
>   
> 128k    492    497    268    270
>   
> 256k    615    530    288    270
>   
> 512k    625    607    230    174
>   
> 1024k   650    620    170    75
>   

What is your stripe cache size?

-- 
Bill Davidsen <davidsen@tmr.com>
  "Woe unto the statesman who makes war without a reason that will still
  be valid when the war is over..." Otto von Bismark 



^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Raid over 48 disks
  2007-12-19  8:27   ` Mattias Wadenstein
@ 2007-12-19 15:26     ` Bill Davidsen
  2007-12-21 11:03     ` Leif Nixon
  1 sibling, 0 replies; 28+ messages in thread
From: Bill Davidsen @ 2007-12-19 15:26 UTC (permalink / raw)
  To: Mattias Wadenstein; +Cc: Neil Brown, Norman Elton, linux-raid

Mattias Wadenstein wrote:
> On Wed, 19 Dec 2007, Neil Brown wrote:
>
>> On Tuesday December 18, normelton@gmail.com wrote:
>>> We're investigating the possibility of running Linux (RHEL) on top of
>>> Sun's X4500 Thumper box:
>>>
>>> http://www.sun.com/servers/x64/x4500/
>>>
>>> Basically, it's a server with 48 SATA hard drives. No hardware RAID.
>>> It's designed for Sun's ZFS filesystem.
>>>
>>> So... we're curious how Linux will handle such a beast. Has anyone run
>>> MD software RAID over so many disks? Then piled LVM/ext3 on top of
>>> that? Any suggestions?
>
> There are those that have run Linux MD RAID on thumpers before. I 
> vaguely recall some driver issues (unrelated to MD) that made it less 
> suitable than solaris, but that might be fixed in recent kernels.
>
>> Alternately, 8 6drive RAID5s or 6 8raid RAID6s, and use RAID0 to
>> combine them together.  This would give you adequate reliability and
>> performance and still a large amount of storage space.
>
> My personal suggestion would be 5 9-disk raid6s, one raid1 root mirror 
> and one hot spare. Then raid0, lvm, or separate filesystem on those 5 
> raidsets for data, depending on your needs.

Other than thinking raid-10 better than  raid-1for performance, I like it.
>
> You get almost as much data space as with the 6 8-disk raid6s, and 
> have a separate pair of disks for all the small updates (logging, 
> metadata, etc), so this makes alot of sense if most of the data is 
> bulk file access.

-- 
Bill Davidsen <davidsen@tmr.com>
  "Woe unto the statesman who makes war without a reason that will still
  be valid when the war is over..." Otto von Bismark 



^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Raid over 48 disks
  2007-12-19 15:21         ` Bill Davidsen
  2007-12-19 15:02           ` Justin Piszcz
@ 2007-12-20 16:48           ` Thiemo Nagel
  2007-12-21  1:53             ` Bill Davidsen
  1 sibling, 1 reply; 28+ messages in thread
From: Thiemo Nagel @ 2007-12-20 16:48 UTC (permalink / raw)
  To: Bill Davidsen; +Cc: Justin Piszcz, Norman Elton, linux-raid

Bill Davidsen wrote:
>>     16k read    64k write
>>   chunk
>>   size    RAID 5    RAID 6    RAID 5    RAID 6
>>   128k    492    497    268    270
>>   256k    615    530    288    270
>>   512k    625    607    230    174
>>   1024k   650    620    170    75
>>   
> 
> What is your stripe cache size?

I didn't fiddle with the default when I did these tests.

Now (with 256k chunk size) I had

# cat stripe_cache_size
256

but increasing that to 1024 didn't show a noticeable improvement for 
reading.  Still around 550MB/s.

Kind regards,

Thiemo

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Raid over 48 disks
  2007-12-20 16:48           ` Thiemo Nagel
@ 2007-12-21  1:53             ` Bill Davidsen
  0 siblings, 0 replies; 28+ messages in thread
From: Bill Davidsen @ 2007-12-21  1:53 UTC (permalink / raw)
  To: Thiemo Nagel; +Cc: Justin Piszcz, Norman Elton, linux-raid

Thiemo Nagel wrote:
> Bill Davidsen wrote:
>>>     16k read    64k write
>>>   chunk
>>>   size    RAID 5    RAID 6    RAID 5    RAID 6
>>>   128k    492    497    268    270
>>>   256k    615    530    288    270
>>>   512k    625    607    230    174
>>>   1024k   650    620    170    75
>>>   
>>
>> What is your stripe cache size?
>
> I didn't fiddle with the default when I did these tests.
>
> Now (with 256k chunk size) I had
>
> # cat stripe_cache_size
> 256
>
> but increasing that to 1024 didn't show a noticeable improvement for 
> reading.  Still around 550MB/s.

You can use blockdev to raise the readahead, either on the drives or the 
array. That may make a difference, I use 4-8MB on the drive, more on the 
array depending on how I use it.
>
> Kind regards,
>
> Thiemo
>


-- 
Bill Davidsen <davidsen@tmr.com>
  "Woe unto the statesman who makes war without a reason that will still
  be valid when the war is over..." Otto von Bismark 



^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Raid over 48 disks
  2007-12-18 17:29 Raid over 48 disks Norman Elton
                   ` (3 preceding siblings ...)
  2007-12-18 20:36 ` Brendan Conoboy
@ 2007-12-21 10:57 ` Leif Nixon
  4 siblings, 0 replies; 28+ messages in thread
From: Leif Nixon @ 2007-12-21 10:57 UTC (permalink / raw)
  To: linux-raid

Norman Elton <normelton@gmail.com> writes:

> We're investigating the possibility of running Linux (RHEL) on top of
> Sun's X4500 Thumper box:
>
> http://www.sun.com/servers/x64/x4500/

I think BNL's evalation of Solaris/ZFS vs. Linux/MD on a thumper
might be of interest:

  http://hepix.caspur.it/storage/hep_pdf/2007/Spring/Petkus_HEPiX_Spring06.storageeval.pdf

-- 
Leif Nixon                       -            Systems expert
------------------------------------------------------------
National Supercomputer Centre    -      Linkoping University
------------------------------------------------------------

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Raid over 48 disks
  2007-12-19  8:27   ` Mattias Wadenstein
  2007-12-19 15:26     ` Bill Davidsen
@ 2007-12-21 11:03     ` Leif Nixon
  1 sibling, 0 replies; 28+ messages in thread
From: Leif Nixon @ 2007-12-21 11:03 UTC (permalink / raw)
  To: linux-raid

Mattias Wadenstein <maswan@acc.umu.se> writes:

> There are those that have run Linux MD RAID on thumpers before. I
> vaguely recall some driver issues (unrelated to MD) that made it less
> suitable than solaris, but that might be fixed in recent kernels.

I think that was mainly an issue for people trying to squeeze
Scientific Linux 3 onto their thumpers.

-- 
Leif Nixon                       -            Systems expert
------------------------------------------------------------
National Supercomputer Centre    -      Linkoping University
------------------------------------------------------------

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Raid over 48 disks
  2007-12-18 20:28 ` Neil Brown
  2007-12-19  8:27   ` Mattias Wadenstein
@ 2007-12-25 17:31   ` pg_mh, Peter Grandi
  2007-12-25 21:08     ` Bill Davidsen
  1 sibling, 1 reply; 28+ messages in thread
From: pg_mh, Peter Grandi @ 2007-12-25 17:31 UTC (permalink / raw)
  To: Linux RAID

>>> On Wed, 19 Dec 2007 07:28:20 +1100, Neil Brown
>>> <neilb@suse.de> said:

[ ... what to do with 48 drive Sun Thumpers ... ]

neilb> I wouldn't create a raid5 or raid6 on all 48 devices.
neilb> RAID5 only survives a single device failure and with that
neilb> many devices, the chance of a second failure before you
neilb> recover becomes appreciable.

That's just one of the many problems, other are:

* If a drive fails, rebuild traffic is going to hit hard, with
  reading in parallel 47 blocks to compute a new 48th.

* With a parity strip length of 48 it will be that much harder
  to avoid read-modify before write, as it will be avoidable
  only for writes of at least 48 blocks aligned on 48 block
  boundaries. And reading 47 blocks to write one is going to be
  quite painful.

[ ... ]

neilb> RAID10 would be a good option if you are happy wit 24
neilb> drives worth of space. [ ... ]

That sounds like the only feasible option (except for the 3
drive case in most cases). Parity RAID does not scale much
beyond 3-4 drives.

neilb> Alternately, 8 6drive RAID5s or 6 8raid RAID6s, and use
neilb> RAID0 to combine them together. This would give you
neilb> adequate reliability and performance and still a large
neilb> amount of storage space.

That sounds optimistic to me: the reason to do a RAID50 of
8x(5+1) can only be to have a single filesystem, else one could
have 8 distinct filesystems each with a subtree of the whole.
With a single filesystem the failure of any one of the 8 RAID5
components of the RAID0 will cause the loss of the whole lot.

So in the 47+1 case a loss of any two drives would lead to
complete loss; in the 8x(5+1) case only a loss of two drives in
the same RAID5 will.

It does not sound like a great improvement to me (especially
considering the thoroughly inane practice of building arrays out
of disks of the same make and model taken out of the same box).

There are also modest improvements in the RMW strip size and in
the cost of a rebuild after a single drive loss. Probably the
reduction in the RMW strip size is the best improvement.

Anyhow, let's assume 0.5TB drives; with a 47+1 we get a single
23.5TB filesystem, and with 8*(5+1) we get a 20TB filesystem.
With current filesystem technology either size is worrying, for
example as to time needed for an 'fsck'.

In practice RAID5 beyond 3-4 drives seems only useful for almost
read-only filesystems where restoring from backups is quick and
easy, never mind the 47+1 case or the 8x(5+1) one, and I think
that giving some credit even to the latter arrangement is not
quite right...

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Raid over 48 disks
  2007-12-25 17:31   ` pg_mh, Peter Grandi
@ 2007-12-25 21:08     ` Bill Davidsen
  0 siblings, 0 replies; 28+ messages in thread
From: Bill Davidsen @ 2007-12-25 21:08 UTC (permalink / raw)
  To: Peter Grandi; +Cc: Linux RAID

Peter Grandi wrote:
>>>> On Wed, 19 Dec 2007 07:28:20 +1100, Neil Brown
>>>> <neilb@suse.de> said:
>>>>         
>
> [ ... what to do with 48 drive Sun Thumpers ... ]
>
> neilb> I wouldn't create a raid5 or raid6 on all 48 devices.
> neilb> RAID5 only survives a single device failure and with that
> neilb> many devices, the chance of a second failure before you
> neilb> recover becomes appreciable.
>
> That's just one of the many problems, other are:
>
> * If a drive fails, rebuild traffic is going to hit hard, with
>   reading in parallel 47 blocks to compute a new 48th.
>
> * With a parity strip length of 48 it will be that much harder
>   to avoid read-modify before write, as it will be avoidable
>   only for writes of at least 48 blocks aligned on 48 block
>   boundaries. And reading 47 blocks to write one is going to be
>   quite painful.
>
> [ ... ]
>
> neilb> RAID10 would be a good option if you are happy wit 24
> neilb> drives worth of space. [ ... ]
>
> That sounds like the only feasible option (except for the 3
> drive case in most cases). Parity RAID does not scale much
> beyond 3-4 drives.
>
> neilb> Alternately, 8 6drive RAID5s or 6 8raid RAID6s, and use
> neilb> RAID0 to combine them together. This would give you
> neilb> adequate reliability and performance and still a large
> neilb> amount of storage space.
>
> That sounds optimistic to me: the reason to do a RAID50 of
> 8x(5+1) can only be to have a single filesystem, else one could
> have 8 distinct filesystems each with a subtree of the whole.
> With a single filesystem the failure of any one of the 8 RAID5
> components of the RAID0 will cause the loss of the whole lot.
>
> So in the 47+1 case a loss of any two drives would lead to
> complete loss; in the 8x(5+1) case only a loss of two drives in
> the same RAID5 will.
>
> It does not sound like a great improvement to me (especially
> considering the thoroughly inane practice of building arrays out
> of disks of the same make and model taken out of the same box).
>   

Quality control just isn't that good that "same box" make a big 
difference, assuming that you have an appropriate number of hot spares 
online. Note that I said "big difference," is there some clustering of 
failures? Some, but damn little. A few years ago I was working with 
multiple 6TB machines and 20+ 1TB machines, all using small, fast, 
drives in RAID5E. I can't remember a case where a drive failed before 
rebuild was complete, and only one or two where there was a failure to 
degraded mode before the hot spare was replaced.

That said, RAID5E typically can rebuild a lot faster than a typical hot 
spare as a unit drive, at least for any given impact on performance. 
This undoubtedly reduce our exposure time.
> There are also modest improvements in the RMW strip size and in
> the cost of a rebuild after a single drive loss. Probably the
> reduction in the RMW strip size is the best improvement.
>
> Anyhow, let's assume 0.5TB drives; with a 47+1 we get a single
> 23.5TB filesystem, and with 8*(5+1) we get a 20TB filesystem.
> With current filesystem technology either size is worrying, for
> example as to time needed for an 'fsck'.
>   

Given that someone is putting a typical filesystem full of small files 
on a big raid, I agree. But fsck with large files is pretty fast on a 
given filesystem (200GB files on a 6TB ext3, for instance), due to the 
small number of inodes in play. While the bitmap resolution is a factor, 
it's pretty linear, fsck with lots of files gets really slow. And let's 
face it, the objective of raid is to avoid doing that fsck in the first 
place ;-)

-- 
Bill Davidsen <davidsen@tmr.com>
  "Woe unto the statesman who makes war without a reason that will still
  be valid when the war is over..." Otto von Bismark 



^ permalink raw reply	[flat|nested] 28+ messages in thread

end of thread, other threads:[~2007-12-25 21:08 UTC | newest]

Thread overview: 28+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2007-12-18 17:29 Raid over 48 disks Norman Elton
2007-12-18 18:27 ` Justin Piszcz
2007-12-18 19:34   ` Thiemo Nagel
2007-12-18 19:52     ` Norman Elton
2007-12-18 20:19       ` Thiemo Nagel
2007-12-18 20:25     ` Justin Piszcz
2007-12-18 21:13       ` Thiemo Nagel
2007-12-18 21:20         ` Jon Nelson
2007-12-18 21:40           ` Thiemo Nagel
2007-12-18 21:43           ` Justin Piszcz
2007-12-18 21:21         ` Justin Piszcz
2007-12-19 15:21         ` Bill Davidsen
2007-12-19 15:02           ` Justin Piszcz
2007-12-20 16:48           ` Thiemo Nagel
2007-12-21  1:53             ` Bill Davidsen
2007-12-18 18:45 ` Robin Hill
2007-12-18 20:28 ` Neil Brown
2007-12-19  8:27   ` Mattias Wadenstein
2007-12-19 15:26     ` Bill Davidsen
2007-12-21 11:03     ` Leif Nixon
2007-12-25 17:31   ` pg_mh, Peter Grandi
2007-12-25 21:08     ` Bill Davidsen
2007-12-18 20:36 ` Brendan Conoboy
2007-12-18 23:50   ` Guy Watkins
2007-12-18 23:58     ` Justin Piszcz
2007-12-18 23:59       ` Justin Piszcz
2007-12-19 12:08     ` Russell Smith
2007-12-21 10:57 ` Leif Nixon

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).