mkfs options for a 16x hw raid5 and xfs (mostly large files)

public inbox for linux-xfs@vger.kernel.org
 help / color / mirror / Atom feed

* mkfs options for a 16x hw raid5 and xfs (mostly large files)
@ 2007-09-23  9:38 Ralf Gross
  2007-09-23 12:56 ` Peter Grandi
  2007-09-24 17:31 ` Ralf Gross
  0 siblings, 2 replies; 48+ messages in thread
From: Ralf Gross @ 2007-09-23  9:38 UTC (permalink / raw)
  To: linux-xfs

Hi,

we have a new large raid array, the shelf has 48 disks, the max.
amount of disks in a single raid 5 set is 16. There will be one global
spare disk, thus we have two raid 5 with 15 data disks and one with 14
data disk.

The data on these raid sets will be video data + some meta data.
Typically each set of data consist of a 2 GB + 500 MB + 100 MB + 20 KB
+2 KB file. There will be some dozen of these sets in a single
directory - but not many hundred or thousend.

Often the data will be transfernd from the windows clients to the
server in some parallel copy jobs at night (eg. 5-10, for each new
data directory). The clients will access the data later (mostly) read
only, the data will not be changed after it was stored on the file
server. Each client then needs a data stream of about 17 MB/s (max. 5
clients are expected to acces the data in parallel). I expect the fs,
each will have a size of 10-11 TB, to be filled > 90%. I know this
is not ideal, but we need every GB we can get.  

I already played with different mkfs.xfs options (sw, su) but didn't
see much of a difference.

The volume sets of the hw raid have the following parameters:

11,xx TB (15 data disks):
Chunk Size : 64 KB
(values of 64/128/256 KB are possible, I'll try 256 KB next week)
Stripe Size : 960 KB (15 x 64 KM)

or 10,xx TB (14 data disks):

Chunk Size :  64 KB
Stripe Size : 896 KB (14 x 64 KB)

The created logical volumes have a block size of 512 bytes (the only
possible value).

Any ideas what options I should use for mkfs.xfs? At the moment I get
about 150 MB/s in seq. writing (tiobench) and 160 MB/s in seq.
reading. This is ok, but I'm curious what I could get with tuned xfs
parameters. The system is running debian etch (amd64) with 16 GB of
RAM. The raid array is connect to the server by fibre channel.

Ralf

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: mkfs options for a 16x hw raid5 and xfs (mostly large files)
  2007-09-23  9:38 mkfs options for a 16x hw raid5 and xfs (mostly large files) Ralf Gross
@ 2007-09-23 12:56 ` Peter Grandi
  2007-09-26 14:54   ` Ralf Gross
  2007-09-24 17:31 ` Ralf Gross
  1 sibling, 1 reply; 48+ messages in thread
From: Peter Grandi @ 2007-09-23 12:56 UTC (permalink / raw)
  To: Linux XFS

>>> On Sun, 23 Sep 2007 11:38:41 +0200, Ralf Gross
>>> <Ralf-Lists@ralfgross.de> said:

Ralf> Hi, we have a new large raid array, the shelf has 48 disks,
Ralf> the max. amount of disks in a single raid 5 set is 16.

Too bad about that petty limitation ;-).

Ralf> There will be one global spare disk, thus we have two raid 5
Ralf> with 15 data disks and one with 14 data disk.

Ahhh a positive-thinking, can-do, brave design ;-).

[ ... ]

Ralf> Often the data will be transfernd from the windows clients
Ralf> to the server in some parallel copy jobs at night (eg. 5-10,
Ralf> for each new data directory). The clients will access the
Ralf> data later (mostly) read only, the data will not be changed
Ralf> after it was stored on the file server.

This is good, and perhaps in some cases one of the few cases in
which even RAID5 naysayers might now object too much.

Ralf> Each client then needs a data stream of about 17 MB/s
Ralf> (max. 5 clients are expected to acces the data in parallel).

Do the requirements include as features some (possibly several)
hours of ''challenging'' read performance if any disk fails or
total loss of data if another disks fails during that time? ;-)

IIRC Google have reported 5% per year disk failure rates across a
very wide mostly uncorrelated population, you have 48 disks,
perhaps 2-3 disks per year will fail. Perhaps more and more often,
because they will likely be all from the same manufacturer, model,
batch and spinning in the same environment.

Ralf> [ ... ] I expect the fs, each will have a size of 10-11 TB,
Ralf> to be filled > 90%. I know this is not ideal, but we need
Ralf> every GB we can get.

That "every GB we can get" is often the key in ''wide RAID5''
stories. Cheap as well as fast and safe, you can have it all with
wide RAID5 setups, so the salesmen would say ;-).

Ralf> [ ... ] Stripe Size : 960 KB (15 x 64 KM)
Ralf> [ ... ] Stripe Size : 896 KB (14 x 64 KB)

Pretty long stripes, I wonder what happens when a whole stripe
cannot be written at once or it can but is not naturally aligned
;-).

Ralf> [ ... ] about 150 MB/s in seq. writing

Surprise surprise ;-).

Ralf> (tiobench) and 160 MB/s in seq.  reading.

This is sort of low. If there something that RAID5 can do sort of
OK is reads (if there are no faults). I'd look at the underlying
storage system and the maximum performance that you can get out of
a single disk.

I have seen a 45-drive 500GB storage subsystem where each drive
can deliver at most 7-10MB/s (even if the same disk standalone in
an ordinary PC can do 60-70MB/s), and the supplier actually claims
so in their published literature (that RAID product is meant to
compete *only* with tape backup subsystems). Your later comment
that "The raid array is connect to the server by fibre channel"
makes me suspect that it may be the same brand.

Ralf> This is ok,

As the total aggregate requirement is 5x17MB/s this is probably
the case [as long as there are no drive failures ;-)].

Ralf> but I'm curious what I could get with tuned xfs parameters.

Looking at the archives of this mailing list the topic ''good mkfs
parameters'' reappears frequently, even if usually for smaller
arrays, as many have yet to discover the benefits of 15-wide RAID5
setups ;-). Threads like these may help:

  http://OSS.SGI.com/archives/xfs/2007-01/msg00079.html
  http://OSS.SGI.com/archives/xfs/2007-05/msg00051.html

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: mkfs options for a 16x hw raid5 and xfs (mostly large files)
  2007-09-23  9:38 mkfs options for a 16x hw raid5 and xfs (mostly large files) Ralf Gross
  2007-09-23 12:56 ` Peter Grandi
@ 2007-09-24 17:31 ` Ralf Gross
  2007-09-24 18:01   ` Justin Piszcz
  1 sibling, 1 reply; 48+ messages in thread
From: Ralf Gross @ 2007-09-24 17:31 UTC (permalink / raw)
  To: linux-xfs

Ralf Gross schrieb:
> 
> we have a new large raid array, the shelf has 48 disks, the max.
> amount of disks in a single raid 5 set is 16. There will be one global
> spare disk, thus we have two raid 5 with 15 data disks and one with 14
> data disk.
> 
> The data on these raid sets will be video data + some meta data.
> Typically each set of data consist of a 2 GB + 500 MB + 100 MB + 20 KB
> +2 KB file. There will be some dozen of these sets in a single
> directory - but not many hundred or thousend.
> ...
> I already played with different mkfs.xfs options (sw, su) but didn't
> see much of a difference.
> 
> The volume sets of the hw raid have the following parameters:
> 
> 11,xx TB (15 data disks):
> Chunk Size : 64 KB
> (values of 64/128/256 KB are possible, I'll try 256 KB next week)
> Stripe Size : 960 KB (15 x 64 KB)
> ...

I did some more benchmarks with the 64KB/256KB chunk size option of
the RAID array and 64K/256K sw option for mkfs.xfs.

4 tests:
two RAID 5 volumes (sdd + sdh, both in the same 48 disk shelf), each
with 15 data disks + 1 parity, 750 GB SATA disks

1. 256KB chunk size (HW RAID, sdd) + su=256K + sw=15
2. 256KB chunk size (HW RAID, sdd) + su=64K + sw=15
3. 64KB chunk size (HW RAID, sdh) + su=256K + sw=15
4. 64KB chunk size (HW RAID, sdh) + su=64K + sw=15

Although the manual of the HW RAID mentions that a 64KB chunk size would be
better with more drives, the result for the 256KB chunk size seems to
me better and more important than the mkfs options. The same manual
states that RAID 5 would be best for databases...

A bit ot: will I waste space on the RAID device with a 256K chunk size
and small files? Or does this only depend on the block size of the fs
(4KB at the moment).

1.)
Chunk Size:  	 256 KB
Stripe Size: 	3840 KB
Array size:	11135 GB
Logical Drive Block Size:   	  512 bytes (only possible value)
mkfs.xfs -d su=256k -d sw=15 /dev/sdd1

/mnt# tiobench --numruns 3 --threads 1 --threads 2  --block 4096 --size 20000

Sequential Reads
File  Blk   Num                   Avg      Maximum      Lat%     Lat%    CPU
Size  Size  Thr   Rate  (CPU%)  Latency    Latency      >2s      >10s    Eff
------ ----- ---  ------ ------ --------- -----------  -------- -------- -----
20000  4096    1  207.80 23.88%     0.055       50.43   0.00000  0.00000   870
20000  4096    2  197.86 44.29%     0.117      373.10   0.00000  0.00000   447

Random Reads
File  Blk   Num                   Avg      Maximum      Lat%     Lat%    CPU
Size  Size  Thr   Rate  (CPU%)  Latency    Latency      >2s      >10s    Eff
------- ---- ---  ------ ------ --------- -----------  -------- -------- -----
20000  4096    1    2.90 0.569%     4.035       42.83   0.00000  0.00000   510
20000  4096    2    4.47 1.679%     5.201       69.75   0.00000  0.00000   266

Sequential Writes
File  Blk   Num                   Avg      Maximum      Lat%     Lat%    CPU
Size  Size  Thr   Rate  (CPU%)  Latency    Latency      >2s      >10s    Eff
------- ---- ---  ------ ------ --------- -----------  -------- -------- -----
20000  4096    1  167.84 36.31%     0.055     9151.42   0.00053  0.00000   462
20000  4096    2  170.77 84.39%     0.099     8471.22   0.00066  0.00000   202

Random Writes
File  Blk   Num                   Avg      Maximum      Lat%     Lat%    CPU
Size  Size  Thr   Rate  (CPU%)  Latency    Latency      >2s      >10s    Eff
------- ---- ---  ------ ------ --------- -----------  -------- -------- -----
20000  4096    1    1.97 0.990%     0.016        0.05   0.00000  0.00000   199
20000  4096    2    1.68 1.739%     0.019        3.04   0.00000  0.00000    97


2.)
Chunk Size:  	 256 KB
Stripe Size: 	3840 KB
Array size:	11135 GB
Logical Drive Block Size:   	  512 bytes (only possible value)
mkfs.xfs -d su=64k -d sw=15 /dev/sdd1

Sequential Reads
File  Blk   Num                   Avg      Maximum      Lat%     Lat%    CPU
Size  Size  Thr   Rate  (CPU%)  Latency    Latency      >2s      >10s    Eff
----- ----- ---  ------ ------ --------- -----------  -------- -------- -----
20000  4096    1  203.15 25.13%     0.056       47.58   0.00000  0.00000   808
20000  4096    2  190.85 44.67%     0.121      370.55   0.00000  0.00000   427

Random Reads
File  Blk   Num                   Avg      Maximum      Lat%     Lat%    CPU
Size  Size  Thr   Rate  (CPU%)  Latency    Latency      >2s      >10s    Eff
----- ----- ---  ------ ------ --------- -----------  -------- -------- -----
20000  4096    1    1.98 0.592%     5.908       41.81   0.00000  0.00000   335
20000  4096    2    3.55 1.665%     6.417       69.23   0.00000  0.00000   213

Sequential Writes
File  Blk   Num                   Avg      Maximum      Lat%     Lat%    CPU
Size  Size  Thr   Rate  (CPU%)  Latency    Latency      >2s      >10s    Eff
----- ----- ---  ------ ------ --------- -----------  -------- -------- -----
20000  4096    1  168.97 35.47%     0.054     8338.06   0.00056  0.00000   476
20000  4096    2  159.21 73.18%     0.109     8133.66   0.00103  0.00000   218

Random Writes
File  Blk   Num                   Avg      Maximum      Lat%     Lat%    CPU
Size  Size  Thr   Rate  (CPU%)  Latency    Latency      >2s      >10s    Eff
----- ----- ---  ------ ------ --------- -----------  -------- -------- -----
20000  4096    1    2.01 1.046%     0.018        2.46   0.00000  0.00000   192
20000  4096    2    1.78 1.668%     0.020        2.98   0.00000  0.00000   107

3.)
Chunk Size:  	 64 KB
Stripe Size: 	960 KB
Array size:     11135 GB
Logical Drive Block Size:   	  512 bytes (only possible value)
mkfs.xfs -d su=256k -d sw=15 /dev/sdh1

Sequential Reads
File  Blk   Num                   Avg      Maximum      Lat%     Lat%    CPU
Size  Size  Thr   Rate  (CPU%)  Latency    Latency      >2s      >10s    Eff
----- ----- ---  ------ ------ --------- -----------  -------- -------- -----
20000  4096    1  189.84 23.00%     0.061       43.77   0.00000  0.00000   825
20000  4096    2  173.20 40.87%     0.134      365.86   0.00000  0.00000   424

Random Reads
File  Blk   Num                   Avg      Maximum      Lat%     Lat%    CPU
Size  Size  Thr   Rate  (CPU%)  Latency    Latency      >2s      >10s    Eff
----- ----- ---  ------ ------ --------- -----------  -------- -------- -----
20000  4096    1    2.16 0.461%     5.415       38.47   0.00000  0.00000   469
20000  4096    2    2.94 1.379%     7.772       69.02   0.00000  0.00000   213

Sequential Writes
File  Blk   Num                   Avg      Maximum      Lat%     Lat%    CPU
Size  Size  Thr   Rate  (CPU%)  Latency    Latency      >2s      >10s    Eff
----- ----- ---  ------ ------ --------- -----------  -------- -------- -----
20000  4096    1  130.48 26.59%     0.076    10970.30   0.00097  0.00000   491
20000  4096    2  124.93 59.08%     0.134    10370.07   0.00173  0.00000   211

Random Writes
File  Blk   Num                   Avg      Maximum      Lat%     Lat%    CPU
Size  Size  Thr   Rate  (CPU%)  Latency    Latency      >2s      >10s    Eff
----- ----- ---  ------ ------ --------- -----------  -------- -------- -----
20000  4096    1    1.73 0.827%     0.018        2.32   0.00000  0.00000   209
20000  4096    2    1.83 1.609%     0.019        2.88   0.00000  0.00000   114


4.)
Chunk Size:  	 64 KB
Stripe Size: 	960 KB
Array size:     11135 GB
Logical Drive Block Size:   	  512 bytes (only possible value)
mkfs.xfs -d su=64k -d sw=15 /dev/sdh1

Sequential Reads
File  Blk   Num                   Avg      Maximum      Lat%     Lat%    CPU
Size  Size  Thr   Rate  (CPU%)  Latency    Latency      >2s      >10s    Eff
----- ----- ---  ------ ------ --------- -----------  -------- -------- -----
20000  4096    1  193.87 21.96%     0.059       59.45   0.00000  0.00000   883
20000  4096    2  185.08 40.73%     0.125      369.16   0.00000  0.00000   454

Random Reads
File  Blk   Num                   Avg      Maximum      Lat%     Lat%    CPU
Size  Size  Thr   Rate  (CPU%)  Latency    Latency      >2s      >10s    Eff
----- ----- ---  ------ ------ --------- -----------  -------- -------- -----
20000  4096    1    2.88 0.565%     4.061       39.23   0.00000  0.00000   510
20000  4096    2    4.37 1.640%     5.199       75.55   0.00000  0.00000   266

Sequential Writes
File  Blk   Num                   Avg      Maximum      Lat%     Lat%    CPU
Size  Size  Thr   Rate  (CPU%)  Latency    Latency      >2s      >10s    Eff
----- ----- ---  ------ ------ --------- -----------  -------- -------- -----
20000  4096    1  143.80 31.12%     0.068    10424.88   0.00072  0.00000   462
20000  4096    2  115.01 53.56%     0.147    11421.10   0.00209  0.00000   215

Random Writes
File  Blk   Num                   Avg      Maximum      Lat%     Lat%    CPU
Size  Size  Thr   Rate  (CPU%)  Latency    Latency      >2s      >10s    Eff
----- ----- ---  ------ ------ --------- -----------  -------- -------- -----
20000  4096    1    2.05 0.753%     0.016        0.09   0.00000  0.00000   273
20000  4096    2    1.86 1.539%     0.018        0.09   0.00000  0.00000   121


Ralf

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: mkfs options for a 16x hw raid5 and xfs (mostly large files)
  2007-09-24 17:31 ` Ralf Gross
@ 2007-09-24 18:01   ` Justin Piszcz
  2007-09-24 20:39     ` Ralf Gross
  0 siblings, 1 reply; 48+ messages in thread
From: Justin Piszcz @ 2007-09-24 18:01 UTC (permalink / raw)
  To: Ralf Gross; +Cc: linux-xfs

A bit ot: will I waste space on the RAID device with a 256K chunk size
and small files? Or does this only depend on the block size of the fs
(4KB at the moment).

That's a good question, I believe its only respective of the filesystem 
size, but will wait for someone to confirm, nice benchmarks!

I use a 1 MiB stripe myself as I found that to give the best performance.

Justin.

On Mon, 24 Sep 2007, Ralf Gross wrote:

> Ralf Gross schrieb:
>>
>> we have a new large raid array, the shelf has 48 disks, the max.
>> amount of disks in a single raid 5 set is 16. There will be one global
>> spare disk, thus we have two raid 5 with 15 data disks and one with 14
>> data disk.
>>
>> The data on these raid sets will be video data + some meta data.
>> Typically each set of data consist of a 2 GB + 500 MB + 100 MB + 20 KB
>> +2 KB file. There will be some dozen of these sets in a single
>> directory - but not many hundred or thousend.
>> ...
>> I already played with different mkfs.xfs options (sw, su) but didn't
>> see much of a difference.
>>
>> The volume sets of the hw raid have the following parameters:
>>
>> 11,xx TB (15 data disks):
>> Chunk Size : 64 KB
>> (values of 64/128/256 KB are possible, I'll try 256 KB next week)
>> Stripe Size : 960 KB (15 x 64 KB)
>> ...
>
> I did some more benchmarks with the 64KB/256KB chunk size option of
> the RAID array and 64K/256K sw option for mkfs.xfs.
>
> 4 tests:
> two RAID 5 volumes (sdd + sdh, both in the same 48 disk shelf), each
> with 15 data disks + 1 parity, 750 GB SATA disks
>
> 1. 256KB chunk size (HW RAID, sdd) + su=256K + sw=15
> 2. 256KB chunk size (HW RAID, sdd) + su=64K + sw=15
> 3. 64KB chunk size (HW RAID, sdh) + su=256K + sw=15
> 4. 64KB chunk size (HW RAID, sdh) + su=64K + sw=15
>
> Although the manual of the HW RAID mentions that a 64KB chunk size would be
> better with more drives, the result for the 256KB chunk size seems to
> me better and more important than the mkfs options. The same manual
> states that RAID 5 would be best for databases...
>
> A bit ot: will I waste space on the RAID device with a 256K chunk size
> and small files? Or does this only depend on the block size of the fs
> (4KB at the moment).
>
> 1.)
> Chunk Size:  	 256 KB
> Stripe Size: 	3840 KB
> Array size:	11135 GB
> Logical Drive Block Size:   	  512 bytes (only possible value)
> mkfs.xfs -d su=256k -d sw=15 /dev/sdd1
>
> /mnt# tiobench --numruns 3 --threads 1 --threads 2  --block 4096 --size 20000
>
> Sequential Reads
> File  Blk   Num                   Avg      Maximum      Lat%     Lat%    CPU
> Size  Size  Thr   Rate  (CPU%)  Latency    Latency      >2s      >10s    Eff
> ------ ----- ---  ------ ------ --------- -----------  -------- -------- -----
> 20000  4096    1  207.80 23.88%     0.055       50.43   0.00000  0.00000   870
> 20000  4096    2  197.86 44.29%     0.117      373.10   0.00000  0.00000   447
>
> Random Reads
> File  Blk   Num                   Avg      Maximum      Lat%     Lat%    CPU
> Size  Size  Thr   Rate  (CPU%)  Latency    Latency      >2s      >10s    Eff
> ------- ---- ---  ------ ------ --------- -----------  -------- -------- -----
> 20000  4096    1    2.90 0.569%     4.035       42.83   0.00000  0.00000   510
> 20000  4096    2    4.47 1.679%     5.201       69.75   0.00000  0.00000   266
>
> Sequential Writes
> File  Blk   Num                   Avg      Maximum      Lat%     Lat%    CPU
> Size  Size  Thr   Rate  (CPU%)  Latency    Latency      >2s      >10s    Eff
> ------- ---- ---  ------ ------ --------- -----------  -------- -------- -----
> 20000  4096    1  167.84 36.31%     0.055     9151.42   0.00053  0.00000   462
> 20000  4096    2  170.77 84.39%     0.099     8471.22   0.00066  0.00000   202
>
> Random Writes
> File  Blk   Num                   Avg      Maximum      Lat%     Lat%    CPU
> Size  Size  Thr   Rate  (CPU%)  Latency    Latency      >2s      >10s    Eff
> ------- ---- ---  ------ ------ --------- -----------  -------- -------- -----
> 20000  4096    1    1.97 0.990%     0.016        0.05   0.00000  0.00000   199
> 20000  4096    2    1.68 1.739%     0.019        3.04   0.00000  0.00000    97
>
>
> 2.)
> Chunk Size:  	 256 KB
> Stripe Size: 	3840 KB
> Array size:	11135 GB
> Logical Drive Block Size:   	  512 bytes (only possible value)
> mkfs.xfs -d su=64k -d sw=15 /dev/sdd1
>
> Sequential Reads
> File  Blk   Num                   Avg      Maximum      Lat%     Lat%    CPU
> Size  Size  Thr   Rate  (CPU%)  Latency    Latency      >2s      >10s    Eff
> ----- ----- ---  ------ ------ --------- -----------  -------- -------- -----
> 20000  4096    1  203.15 25.13%     0.056       47.58   0.00000  0.00000   808
> 20000  4096    2  190.85 44.67%     0.121      370.55   0.00000  0.00000   427
>
> Random Reads
> File  Blk   Num                   Avg      Maximum      Lat%     Lat%    CPU
> Size  Size  Thr   Rate  (CPU%)  Latency    Latency      >2s      >10s    Eff
> ----- ----- ---  ------ ------ --------- -----------  -------- -------- -----
> 20000  4096    1    1.98 0.592%     5.908       41.81   0.00000  0.00000   335
> 20000  4096    2    3.55 1.665%     6.417       69.23   0.00000  0.00000   213
>
> Sequential Writes
> File  Blk   Num                   Avg      Maximum      Lat%     Lat%    CPU
> Size  Size  Thr   Rate  (CPU%)  Latency    Latency      >2s      >10s    Eff
> ----- ----- ---  ------ ------ --------- -----------  -------- -------- -----
> 20000  4096    1  168.97 35.47%     0.054     8338.06   0.00056  0.00000   476
> 20000  4096    2  159.21 73.18%     0.109     8133.66   0.00103  0.00000   218
>
> Random Writes
> File  Blk   Num                   Avg      Maximum      Lat%     Lat%    CPU
> Size  Size  Thr   Rate  (CPU%)  Latency    Latency      >2s      >10s    Eff
> ----- ----- ---  ------ ------ --------- -----------  -------- -------- -----
> 20000  4096    1    2.01 1.046%     0.018        2.46   0.00000  0.00000   192
> 20000  4096    2    1.78 1.668%     0.020        2.98   0.00000  0.00000   107
>
> 3.)
> Chunk Size:  	 64 KB
> Stripe Size: 	960 KB
> Array size:     11135 GB
> Logical Drive Block Size:   	  512 bytes (only possible value)
> mkfs.xfs -d su=256k -d sw=15 /dev/sdh1
>
> Sequential Reads
> File  Blk   Num                   Avg      Maximum      Lat%     Lat%    CPU
> Size  Size  Thr   Rate  (CPU%)  Latency    Latency      >2s      >10s    Eff
> ----- ----- ---  ------ ------ --------- -----------  -------- -------- -----
> 20000  4096    1  189.84 23.00%     0.061       43.77   0.00000  0.00000   825
> 20000  4096    2  173.20 40.87%     0.134      365.86   0.00000  0.00000   424
>
> Random Reads
> File  Blk   Num                   Avg      Maximum      Lat%     Lat%    CPU
> Size  Size  Thr   Rate  (CPU%)  Latency    Latency      >2s      >10s    Eff
> ----- ----- ---  ------ ------ --------- -----------  -------- -------- -----
> 20000  4096    1    2.16 0.461%     5.415       38.47   0.00000  0.00000   469
> 20000  4096    2    2.94 1.379%     7.772       69.02   0.00000  0.00000   213
>
> Sequential Writes
> File  Blk   Num                   Avg      Maximum      Lat%     Lat%    CPU
> Size  Size  Thr   Rate  (CPU%)  Latency    Latency      >2s      >10s    Eff
> ----- ----- ---  ------ ------ --------- -----------  -------- -------- -----
> 20000  4096    1  130.48 26.59%     0.076    10970.30   0.00097  0.00000   491
> 20000  4096    2  124.93 59.08%     0.134    10370.07   0.00173  0.00000   211
>
> Random Writes
> File  Blk   Num                   Avg      Maximum      Lat%     Lat%    CPU
> Size  Size  Thr   Rate  (CPU%)  Latency    Latency      >2s      >10s    Eff
> ----- ----- ---  ------ ------ --------- -----------  -------- -------- -----
> 20000  4096    1    1.73 0.827%     0.018        2.32   0.00000  0.00000   209
> 20000  4096    2    1.83 1.609%     0.019        2.88   0.00000  0.00000   114
>
>
> 4.)
> Chunk Size:  	 64 KB
> Stripe Size: 	960 KB
> Array size:     11135 GB
> Logical Drive Block Size:   	  512 bytes (only possible value)
> mkfs.xfs -d su=64k -d sw=15 /dev/sdh1
>
> Sequential Reads
> File  Blk   Num                   Avg      Maximum      Lat%     Lat%    CPU
> Size  Size  Thr   Rate  (CPU%)  Latency    Latency      >2s      >10s    Eff
> ----- ----- ---  ------ ------ --------- -----------  -------- -------- -----
> 20000  4096    1  193.87 21.96%     0.059       59.45   0.00000  0.00000   883
> 20000  4096    2  185.08 40.73%     0.125      369.16   0.00000  0.00000   454
>
> Random Reads
> File  Blk   Num                   Avg      Maximum      Lat%     Lat%    CPU
> Size  Size  Thr   Rate  (CPU%)  Latency    Latency      >2s      >10s    Eff
> ----- ----- ---  ------ ------ --------- -----------  -------- -------- -----
> 20000  4096    1    2.88 0.565%     4.061       39.23   0.00000  0.00000   510
> 20000  4096    2    4.37 1.640%     5.199       75.55   0.00000  0.00000   266
>
> Sequential Writes
> File  Blk   Num                   Avg      Maximum      Lat%     Lat%    CPU
> Size  Size  Thr   Rate  (CPU%)  Latency    Latency      >2s      >10s    Eff
> ----- ----- ---  ------ ------ --------- -----------  -------- -------- -----
> 20000  4096    1  143.80 31.12%     0.068    10424.88   0.00072  0.00000   462
> 20000  4096    2  115.01 53.56%     0.147    11421.10   0.00209  0.00000   215
>
> Random Writes
> File  Blk   Num                   Avg      Maximum      Lat%     Lat%    CPU
> Size  Size  Thr   Rate  (CPU%)  Latency    Latency      >2s      >10s    Eff
> ----- ----- ---  ------ ------ --------- -----------  -------- -------- -----
> 20000  4096    1    2.05 0.753%     0.016        0.09   0.00000  0.00000   273
> 20000  4096    2    1.86 1.539%     0.018        0.09   0.00000  0.00000   121
>
>
> Ralf
>
>

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: mkfs options for a 16x hw raid5 and xfs (mostly large files)
  2007-09-24 18:01   ` Justin Piszcz
@ 2007-09-24 20:39     ` Ralf Gross
  2007-09-24 20:43       ` Justin Piszcz
  0 siblings, 1 reply; 48+ messages in thread
From: Ralf Gross @ 2007-09-24 20:39 UTC (permalink / raw)
  To: linux-xfs

Justin Piszcz schrieb:
>> A bit ot: will I waste space on the RAID device with a 256K chunk size
>> and small files? Or does this only depend on the block size of the fs
>> (4KB at the moment).
>
> That's a good question, I believe its only respective of the filesystem
> size, but will wait for someone to confirm, nice benchmarks!
>
> I use a 1 MiB stripe myself as I found that to give the best performance.

256KB is the largest chunk size I can choose for a raid set. BTW: the HW-RAID
is an Overland Ultamus 4800.

The funny thing is, that performance (256KB chunks) is even better without
adding any sw/su option to the mkfs command.

mkfs.xfs  /dev/sdd1 -f

Sequential Reads
File  Blk   Num                   Avg      Maximum      Lat%     Lat%    CPU
Size  Size  Thr   Rate  (CPU%)  Latency    Latency      >2s      >10s    Eff
----- ----- ---  ------ ------ --------- -----------  -------- -------- -----
20000  4096    1  208.33 23.81%     0.055       49.55   0.00000  0.00000   875
20000  4096    2  199.48 43.72%     0.116      376.85   0.00000  0.00000   456

Random Reads
File  Blk   Num                   Avg      Maximum      Lat%     Lat%    CPU
Size  Size  Thr   Rate  (CPU%)  Latency    Latency      >2s      >10s    Eff
----- ----- ---  ------ ------ --------- -----------  -------- -------- -----
20000  4096    1    2.83 0.604%     4.131       38.81   0.00000  0.00000   469
20000  4096    2    4.53 1.700%     4.995       67.15   0.00000  0.00000   266

Sequential Writes
File  Blk   Num                   Avg      Maximum      Lat%     Lat%    CPU
Size  Size  Thr   Rate  (CPU%)  Latency    Latency      >2s      >10s    Eff
----- ----- ---  ------ ------ --------- -----------  -------- -------- -----
20000  4096    1  188.15 42.98%     0.047     7547.93   0.00027  0.00000   438
20000  4096    2  167.76 76.89%     0.100     7521.34   0.00078  0.00000   218

Random Writes
File  Blk   Num                   Avg      Maximum      Lat%     Lat%    CPU
Size  Size  Thr   Rate  (CPU%)  Latency    Latency      >2s      >10s    Eff
----- ----- ---  ------ ------ --------- -----------  -------- -------- -----
20000  4096    1    2.08 0.869%     0.016        0.13   0.00000  0.00000   239
20000  4096    2    1.80 1.501%     0.020        6.28   0.00000  0.00000   12


Ralf

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: mkfs options for a 16x hw raid5 and xfs (mostly large files)
  2007-09-24 20:39     ` Ralf Gross
@ 2007-09-24 20:43       ` Justin Piszcz
  2007-09-24 21:33         ` Ralf Gross
  0 siblings, 1 reply; 48+ messages in thread
From: Justin Piszcz @ 2007-09-24 20:43 UTC (permalink / raw)
  To: Ralf Gross; +Cc: linux-xfs



On Mon, 24 Sep 2007, Ralf Gross wrote:

> Justin Piszcz schrieb:
>>> A bit ot: will I waste space on the RAID device with a 256K chunk size
>>> and small files? Or does this only depend on the block size of the fs
>>> (4KB at the moment).
>>
>> That's a good question, I believe its only respective of the filesystem
>> size, but will wait for someone to confirm, nice benchmarks!
>>
>> I use a 1 MiB stripe myself as I found that to give the best performance.
>
> 256KB is the largest chunk size I can choose for a raid set. BTW: the HW-RAID
> is an Overland Ultamus 4800.
>
> The funny thing is, that performance (256KB chunks) is even better without
> adding any sw/su option to the mkfs command.
>
> mkfs.xfs  /dev/sdd1 -f
>
> Sequential Reads
> File  Blk   Num                   Avg      Maximum      Lat%     Lat%    CPU
> Size  Size  Thr   Rate  (CPU%)  Latency    Latency      >2s      >10s    Eff
> ----- ----- ---  ------ ------ --------- -----------  -------- -------- -----
> 20000  4096    1  208.33 23.81%     0.055       49.55   0.00000  0.00000   875
> 20000  4096    2  199.48 43.72%     0.116      376.85   0.00000  0.00000   456
>
> Random Reads
> File  Blk   Num                   Avg      Maximum      Lat%     Lat%    CPU
> Size  Size  Thr   Rate  (CPU%)  Latency    Latency      >2s      >10s    Eff
> ----- ----- ---  ------ ------ --------- -----------  -------- -------- -----
> 20000  4096    1    2.83 0.604%     4.131       38.81   0.00000  0.00000   469
> 20000  4096    2    4.53 1.700%     4.995       67.15   0.00000  0.00000   266
>
> Sequential Writes
> File  Blk   Num                   Avg      Maximum      Lat%     Lat%    CPU
> Size  Size  Thr   Rate  (CPU%)  Latency    Latency      >2s      >10s    Eff
> ----- ----- ---  ------ ------ --------- -----------  -------- -------- -----
> 20000  4096    1  188.15 42.98%     0.047     7547.93   0.00027  0.00000   438
> 20000  4096    2  167.76 76.89%     0.100     7521.34   0.00078  0.00000   218
>
> Random Writes
> File  Blk   Num                   Avg      Maximum      Lat%     Lat%    CPU
> Size  Size  Thr   Rate  (CPU%)  Latency    Latency      >2s      >10s    Eff
> ----- ----- ---  ------ ------ --------- -----------  -------- -------- -----
> 20000  4096    1    2.08 0.869%     0.016        0.13   0.00000  0.00000   239
> 20000  4096    2    1.80 1.501%     0.020        6.28   0.00000  0.00000   12
>
>
> Ralf
>
>

I find that to be the case with SW RAID (defaults are best)

Although with 16 drives(?) that is awfully slow.

I have 6 SATA's I get 160-180 MiB/s raid5 and 250-280 MiB/s raid 0 (sw 
raid).

With 10 raptors I get ~450 MiB/s write and ~550-600 MiB/s read, again 
XFS+SW raid.

Justin.

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: mkfs options for a 16x hw raid5 and xfs (mostly large files)
  2007-09-24 20:43       ` Justin Piszcz
@ 2007-09-24 21:33         ` Ralf Gross
  2007-09-24 21:36           ` Justin Piszcz
  0 siblings, 1 reply; 48+ messages in thread
From: Ralf Gross @ 2007-09-24 21:33 UTC (permalink / raw)
  To: linux-xfs

Justin Piszcz schrieb:
> 
> 
> On Mon, 24 Sep 2007, Ralf Gross wrote:
> 
> >Justin Piszcz schrieb:
> >>>A bit ot: will I waste space on the RAID device with a 256K chunk size
> >>>and small files? Or does this only depend on the block size of the fs
> >>>(4KB at the moment).
> >>
> >>That's a good question, I believe its only respective of the filesystem
> >>size, but will wait for someone to confirm, nice benchmarks!
> >>
> >>I use a 1 MiB stripe myself as I found that to give the best performance.
> >
> >256KB is the largest chunk size I can choose for a raid set. BTW: the 
> >HW-RAID
> >is an Overland Ultamus 4800.
> >
> >The funny thing is, that performance (256KB chunks) is even better without
> >adding any sw/su option to the mkfs command.
> >
> >mkfs.xfs  /dev/sdd1 -f
> >
> >Sequential Reads
> >File  Blk   Num                   Avg      Maximum      Lat%     Lat%    
> >CPU
> >Size  Size  Thr   Rate  (CPU%)  Latency    Latency      >2s      >10s    
> >Eff
> >----- ----- ---  ------ ------ --------- -----------  -------- -------- 
> >-----
> >20000  4096    1  208.33 23.81%     0.055       49.55   0.00000  0.00000   
> >875
> >20000  4096    2  199.48 43.72%     0.116      376.85   0.00000  0.00000   
> >456
> >
> >Random Reads
> >File  Blk   Num                   Avg      Maximum      Lat%     Lat%    
> >CPU
> >Size  Size  Thr   Rate  (CPU%)  Latency    Latency      >2s      >10s    
> >Eff
> >----- ----- ---  ------ ------ --------- -----------  -------- -------- 
> >-----
> >20000  4096    1    2.83 0.604%     4.131       38.81   0.00000  0.00000   
> >469
> >20000  4096    2    4.53 1.700%     4.995       67.15   0.00000  0.00000   
> >266
> >
> >Sequential Writes
> >File  Blk   Num                   Avg      Maximum      Lat%     Lat%    
> >CPU
> >Size  Size  Thr   Rate  (CPU%)  Latency    Latency      >2s      >10s    
> >Eff
> >----- ----- ---  ------ ------ --------- -----------  -------- -------- 
> >-----
> >20000  4096    1  188.15 42.98%     0.047     7547.93   0.00027  0.00000   
> >438
> >20000  4096    2  167.76 76.89%     0.100     7521.34   0.00078  0.00000   
> >218
> >
> >Random Writes
> >File  Blk   Num                   Avg      Maximum      Lat%     Lat%    
> >CPU
> >Size  Size  Thr   Rate  (CPU%)  Latency    Latency      >2s      >10s    
> >Eff
> >----- ----- ---  ------ ------ --------- -----------  -------- -------- 
> >-----
> >20000  4096    1    2.08 0.869%     0.016        0.13   0.00000  0.00000   
> >239
> >20000  4096    2    1.80 1.501%     0.020        6.28   0.00000  0.00000   
> >12
> >
> 
> I find that to be the case with SW RAID (defaults are best)
> 
> Although with 16 drives(?) that is awfully slow.
> 
> I have 6 SATA's I get 160-180 MiB/s raid5 and 250-280 MiB/s raid 0 (sw 
> raid).
> 
> With 10 raptors I get ~450 MiB/s write and ~550-600 MiB/s read, again 
> XFS+SW raid.

Hm, with the different HW-RAIDs I've used so far (easyRAID,
Infortrend, internal Areca controller), I always got 160-200 MiB/s
read/write with 7-15 disks. That's one reason why I asked if there are
some xfs options I could use for better performance. But I guess fs
options won't boost performance that much.

Ralf

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: mkfs options for a 16x hw raid5 and xfs (mostly large files)
  2007-09-24 21:33         ` Ralf Gross
@ 2007-09-24 21:36           ` Justin Piszcz
  2007-09-24 21:52             ` Ralf Gross
  0 siblings, 1 reply; 48+ messages in thread
From: Justin Piszcz @ 2007-09-24 21:36 UTC (permalink / raw)
  To: Ralf Gross; +Cc: linux-xfs



On Mon, 24 Sep 2007, Ralf Gross wrote:

> Justin Piszcz schrieb:
>>
>>
>> On Mon, 24 Sep 2007, Ralf Gross wrote:
>>
>>> Justin Piszcz schrieb:
>>>>> A bit ot: will I waste space on the RAID device with a 256K chunk size
>>>>> and small files? Or does this only depend on the block size of the fs
>>>>> (4KB at the moment).
>>>>
>>>> That's a good question, I believe its only respective of the filesystem
>>>> size, but will wait for someone to confirm, nice benchmarks!
>>>>
>>>> I use a 1 MiB stripe myself as I found that to give the best performance.
>>>
>>> 256KB is the largest chunk size I can choose for a raid set. BTW: the
>>> HW-RAID
>>> is an Overland Ultamus 4800.
>>>
>>> The funny thing is, that performance (256KB chunks) is even better without
>>> adding any sw/su option to the mkfs command.
>>>
>>> mkfs.xfs  /dev/sdd1 -f
>>>
>>> Sequential Reads
>>> File  Blk   Num                   Avg      Maximum      Lat%     Lat%
>>> CPU
>>> Size  Size  Thr   Rate  (CPU%)  Latency    Latency      >2s      >10s
>>> Eff
>>> ----- ----- ---  ------ ------ --------- -----------  -------- --------
>>> -----
>>> 20000  4096    1  208.33 23.81%     0.055       49.55   0.00000  0.00000
>>> 875
>>> 20000  4096    2  199.48 43.72%     0.116      376.85   0.00000  0.00000
>>> 456
>>>
>>> Random Reads
>>> File  Blk   Num                   Avg      Maximum      Lat%     Lat%
>>> CPU
>>> Size  Size  Thr   Rate  (CPU%)  Latency    Latency      >2s      >10s
>>> Eff
>>> ----- ----- ---  ------ ------ --------- -----------  -------- --------
>>> -----
>>> 20000  4096    1    2.83 0.604%     4.131       38.81   0.00000  0.00000
>>> 469
>>> 20000  4096    2    4.53 1.700%     4.995       67.15   0.00000  0.00000
>>> 266
>>>
>>> Sequential Writes
>>> File  Blk   Num                   Avg      Maximum      Lat%     Lat%
>>> CPU
>>> Size  Size  Thr   Rate  (CPU%)  Latency    Latency      >2s      >10s
>>> Eff
>>> ----- ----- ---  ------ ------ --------- -----------  -------- --------
>>> -----
>>> 20000  4096    1  188.15 42.98%     0.047     7547.93   0.00027  0.00000
>>> 438
>>> 20000  4096    2  167.76 76.89%     0.100     7521.34   0.00078  0.00000
>>> 218
>>>
>>> Random Writes
>>> File  Blk   Num                   Avg      Maximum      Lat%     Lat%
>>> CPU
>>> Size  Size  Thr   Rate  (CPU%)  Latency    Latency      >2s      >10s
>>> Eff
>>> ----- ----- ---  ------ ------ --------- -----------  -------- --------
>>> -----
>>> 20000  4096    1    2.08 0.869%     0.016        0.13   0.00000  0.00000
>>> 239
>>> 20000  4096    2    1.80 1.501%     0.020        6.28   0.00000  0.00000
>>> 12
>>>
>>
>> I find that to be the case with SW RAID (defaults are best)
>>
>> Although with 16 drives(?) that is awfully slow.
>>
>> I have 6 SATA's I get 160-180 MiB/s raid5 and 250-280 MiB/s raid 0 (sw
>> raid).
>>
>> With 10 raptors I get ~450 MiB/s write and ~550-600 MiB/s read, again
>> XFS+SW raid.
>
> Hm, with the different HW-RAIDs I've used so far (easyRAID,
> Infortrend, internal Areca controller), I always got 160-200 MiB/s
> read/write with 7-15 disks. That's one reason why I asked if there are
> some xfs options I could use for better performance. But I guess fs
> options won't boost performance that much.
>
> Ralf
>
>

What do you get when (reading) from the raw device?

dd if=/dev/sda bs=1M count=10240

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: mkfs options for a 16x hw raid5 and xfs (mostly large files)
  2007-09-24 21:36           ` Justin Piszcz
@ 2007-09-24 21:52             ` Ralf Gross
  2007-09-25 12:35               ` Ralf Gross
  0 siblings, 1 reply; 48+ messages in thread
From: Ralf Gross @ 2007-09-24 21:52 UTC (permalink / raw)
  To: linux-xfs

Justin Piszcz schrieb:
> >>I find that to be the case with SW RAID (defaults are best)
> >>
> >>Although with 16 drives(?) that is awfully slow.
> >>
> >>I have 6 SATA's I get 160-180 MiB/s raid5 and 250-280 MiB/s raid 0 (sw
> >>raid).
> >>
> >>With 10 raptors I get ~450 MiB/s write and ~550-600 MiB/s read, again
> >>XFS+SW raid.
> >
> >Hm, with the different HW-RAIDs I've used so far (easyRAID,
> >Infortrend, internal Areca controller), I always got 160-200 MiB/s
> >read/write with 7-15 disks. That's one reason why I asked if there are
> >some xfs options I could use for better performance. But I guess fs
> >options won't boost performance that much.
> 
> What do you get when (reading) from the raw device?
> 
> dd if=/dev/sda bs=1M count=10240

The server has 16 GB RAM, so I tried it with 20 GB of data.

dd if=/dev/sdd of=/dev/null bs=1M count=20480
20480+0 Datensätze ein
20480+0 Datensätze aus
21474836480 Bytes (21 GB) kopiert, 95,3738 Sekunden, 225 MB/s

and a second try:

dd if=/dev/sdd of=/dev/null bs=1M count=20480
20480+0 Datensätze ein
20480+0 Datensätze aus
21474836480 Bytes (21 GB) kopiert, 123,78 Sekunden, 173 MB/s

I'm taoo tired to interprete these numbers at the moment, I'll do some
more testing tomorrow.

Good night,
Ralf

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: mkfs options for a 16x hw raid5 and xfs (mostly large files)
  2007-09-24 21:52             ` Ralf Gross
@ 2007-09-25 12:35               ` Ralf Gross
  2007-09-25 12:50                 ` Justin Piszcz
  2007-09-25 12:57                 ` KELEMEN Peter
  0 siblings, 2 replies; 48+ messages in thread
From: Ralf Gross @ 2007-09-25 12:35 UTC (permalink / raw)
  To: linux-xfs

Ralf Gross schrieb:
> > What do you get when (reading) from the raw device?
> > 
> > dd if=/dev/sda bs=1M count=10240
> 
> The server has 16 GB RAM, so I tried it with 20 GB of data.
> 
> dd if=/dev/sdd of=/dev/null bs=1M count=20480
> 20480+0 Datensätze ein
> 20480+0 Datensätze aus
> 21474836480 Bytes (21 GB) kopiert, 95,3738 Sekunden, 225 MB/s
> 
> and a second try:
> 
> dd if=/dev/sdd of=/dev/null bs=1M count=20480
> 20480+0 Datensätze ein
> 20480+0 Datensätze aus
> 21474836480 Bytes (21 GB) kopiert, 123,78 Sekunden, 173 MB/s
> 
> I'm taoo tired to interprete these numbers at the moment, I'll do some
> more testing tomorrow.

There is a second RAID device attached to the server (24x RAID5). The
numbers I get from this device are a bit worse than the 16x RAID 5
numbers (150MB/s read with dd).

I'm really wondering how people can achieve transfer rates of
400MB/s and more. I know that I'm limited by the FC controller, but
I don't even get >200MB/s.

Ralf

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: mkfs options for a 16x hw raid5 and xfs (mostly large files)
  2007-09-25 12:35               ` Ralf Gross
@ 2007-09-25 12:50                 ` Justin Piszcz
  2007-09-25 13:44                   ` Bryan J Smith
  2007-09-25 12:57                 ` KELEMEN Peter
  1 sibling, 1 reply; 48+ messages in thread
From: Justin Piszcz @ 2007-09-25 12:50 UTC (permalink / raw)
  To: Ralf Gross; +Cc: linux-xfs

[-- Attachment #1: Type: TEXT/PLAIN, Size: 1378 bytes --]



On Tue, 25 Sep 2007, Ralf Gross wrote:

> Ralf Gross schrieb:
>>> What do you get when (reading) from the raw device?
>>>
>>> dd if=/dev/sda bs=1M count=10240
>>
>> The server has 16 GB RAM, so I tried it with 20 GB of data.
>>
>> dd if=/dev/sdd of=/dev/null bs=1M count=20480
>> 20480+0 Datensätze ein
>> 20480+0 Datensätze aus
>> 21474836480 Bytes (21 GB) kopiert, 95,3738 Sekunden, 225 MB/s
>>
>> and a second try:
>>
>> dd if=/dev/sdd of=/dev/null bs=1M count=20480
>> 20480+0 Datensätze ein
>> 20480+0 Datensätze aus
>> 21474836480 Bytes (21 GB) kopiert, 123,78 Sekunden, 173 MB/s
>>
>> I'm taoo tired to interprete these numbers at the moment, I'll do some
>> more testing tomorrow.
>
> There is a second RAID device attached to the server (24x RAID5). The
> numbers I get from this device are a bit worse than the 16x RAID 5
> numbers (150MB/s read with dd).
>
> I'm really wondering how people can achieve transfer rates of
> 400MB/s and more. I know that I'm limited by the FC controller, but
> I don't even get >200MB/s.
>
> Ralf
>
>

Perhaps something is wrong with your setup?

Here are my 10 raptors in RAID5 using Software RAID (no hw raid 
controller):

p34:~# dd if=/dev/md3 of=/dev/null bs=1M count=16384
16384+0 records in
16384+0 records out
17179869184 bytes (17 GB) copied, 29.8193 seconds, 576 MB/s
p34:~#

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: mkfs options for a 16x hw raid5 and xfs (mostly large files)
  2007-09-25 12:35               ` Ralf Gross
  2007-09-25 12:50                 ` Justin Piszcz
@ 2007-09-25 12:57                 ` KELEMEN Peter
  2007-09-25 13:49                   ` Ralf Gross
  1 sibling, 1 reply; 48+ messages in thread
From: KELEMEN Peter @ 2007-09-25 12:57 UTC (permalink / raw)
  To: linux-xfs

* Ralf Gross (ralf-lists@ralfgross.de) [20070925 14:35]:

> There is a second RAID device attached to the server (24x
> RAID5). The numbers I get from this device are a bit worse than
> the 16x RAID 5 numbers (150MB/s read with dd).

You are expecting 24 spindles to align up when you have a write
request, which has to be 23*chunksize bytes in order to avoid RMW.
Additionally, your array is so big that you're very likely to hit
another error while rebuilding.  Chop up your monster RAID5 array
into smaller arrays and stripe across them.  Even better, consider
RAID10.

Peter

-- 
    .+'''+.         .+'''+.         .+'''+.         .+'''+.         .+''
 Kelemen Péter     /       \       /       \     Peter.Kelemen@cern.ch
.+'         `+...+'         `+...+'         `+...+'         `+...+'

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: mkfs options for a 16x hw raid5 and xfs (mostly large files)
  2007-09-25 12:50                 ` Justin Piszcz
@ 2007-09-25 13:44                   ` Bryan J Smith
  0 siblings, 0 replies; 48+ messages in thread
From: Bryan J Smith @ 2007-09-25 13:44 UTC (permalink / raw)
  To: Justin Piszcz, xfs-bounce, Ralf Gross; +Cc: linux-xfs

There is not a week that goes by without this on some list.

Benchmarks not under load are useless, and hardware RAID
shows no advantage at all, and can actually be hurt since all
data is committed to the I/O controller synchronously at the driver.

Furthermore, there is a huge difference between software
RAID-5 reads and writes, and read benchmarks are basic RAID-0
(minus one disc) which is always faster with software RAID-0.

Again, testing under actual, production load is how you gage performance.

If your application is CPU bound, like most web servers, then software
RAID-5 is fine because A) little I/O is require, so there is plenty of
systen interconnect throughput available for LOAD-XOR-STOR,
and B) web servers are heavily reads more than writes.

But if your server is a file server, the the amount of inteconnect
required for the LOAD-XOR-STO of software RAID-5 detracts from
that available for the I/O intensive operations of the file service.
You can't measure that at the kernel at all, much less not under load.
Benchmark multiple clients hitting the server to see what they get.

Furthermore, when you're concerned about I/O, you don't stop
at your storage controller, but RX TOE with your HBA GbE NIC(s),
your latency v. throughput of your discs, etc...

--  
Bryan J Smith - mailto:b.j.smith@ieee.org  
http://thebs413.blogspot.com  
Sent via BlackBerry from T-Mobile  

-----Original Message-----
From: Justin Piszcz <jpiszcz@lucidpixels.com>

Date: Tue, 25 Sep 2007 08:50:15 
To:Ralf Gross <Ralf-Lists@ralfgross.de>
Cc:linux-xfs@oss.sgi.com
Subject: Re: mkfs options for a 16x hw raid5 and xfs (mostly large files)

On Tue, 25 Sep 2007, Ralf Gross wrote:

> Ralf Gross schrieb:
>>> What do you get when (reading) from the raw device?
>>>
>>> dd if=/dev/sda bs=1M count=10240
>>
>> The server has 16 GB RAM, so I tried it with 20 GB of data.
>>
>> dd if=/dev/sdd of=/dev/null bs=1M count=20480
>> 20480+0 Datensätze ein
>> 20480+0 Datensätze aus
>> 21474836480 Bytes (21 GB) kopiert, 95,3738 Sekunden, 225 MB/s
>>
>> and a second try:
>>
>> dd if=/dev/sdd of=/dev/null bs=1M count=20480
>> 20480+0 Datensätze ein
>> 20480+0 Datensätze aus
>> 21474836480 Bytes (21 GB) kopiert, 123,78 Sekunden, 173 MB/s
>>
>> I'm taoo tired to interprete these numbers at the moment, I'll do some
>> more testing tomorrow.
>
> There is a second RAID device attached to the server (24x RAID5). The
> numbers I get from this device are a bit worse than the 16x RAID 5
> numbers (150MB/s read with dd).
>
> I'm really wondering how people can achieve transfer rates of
> 400MB/s and more. I know that I'm limited by the FC controller, but
> I don't even get >200MB/s.
>
> Ralf
>
>

Perhaps something is wrong with your setup?

Here are my 10 raptors in RAID5 using Software RAID (no hw raid 
controller):

p34:~# dd if=/dev/md3 of=/dev/null bs=1M count=16384
16384+0 records in
16384+0 records out
17179869184 bytes (17 GB) copied, 29.8193 seconds, 576 MB/s
p34:~#

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: mkfs options for a 16x hw raid5 and xfs (mostly large files)
  2007-09-25 12:57                 ` KELEMEN Peter
@ 2007-09-25 13:49                   ` Ralf Gross
  2007-09-25 14:08                     ` Bryan J Smith
  0 siblings, 1 reply; 48+ messages in thread
From: Ralf Gross @ 2007-09-25 13:49 UTC (permalink / raw)
  To: linux-xfs

KELEMEN Peter schrieb:
> * Ralf Gross (ralf-lists@ralfgross.de) [20070925 14:35]:
> 
> > There is a second RAID device attached to the server (24x
> > RAID5). The numbers I get from this device are a bit worse than
> > the 16x RAID 5 numbers (150MB/s read with dd).
> 
> You are expecting 24 spindles to align up when you have a write
> request, which has to be 23*chunksize bytes in order to avoid RMW.
> Additionally, your array is so big that you're very likely to hit
> another error while rebuilding.  Chop up your monster RAID5 array
> into smaller arrays and stripe across them.  Even better, consider
> RAID10.

RAID10 is no option, we need 60+ TB at the moment, mostly large video
files. Basically the read/write performance we get with the 16x RAID 5
is sufficient for our needs. The 24x RAID 5 is only a test device. The
volumes that will be used in the future are the 16/15x RAIDs (48 disk
shelf with 3 volumes).

I'm just wondering how people get 400+ MB/s with HW-RAID 5.

Ralf

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: mkfs options for a 16x hw raid5 and xfs (mostly large files)
  2007-09-25 13:49                   ` Ralf Gross
@ 2007-09-25 14:08                     ` Bryan J Smith
  2007-09-25 16:07                       ` Ralf Gross
  0 siblings, 1 reply; 48+ messages in thread
From: Bryan J Smith @ 2007-09-25 14:08 UTC (permalink / raw)
  To: Ralf Gross, linux-xfs

Use multiple cards on multiple PCI-X/PCIe channels, each with
their own RAID-5 (or 6) volume, and then stripe (OS LVM) RAID-0
across the volumes.

Depending on your network service and application, you can use
either hardware or software for the RAID-5 (or 6).
If it's heavily read-only servicing, then software RAID works great,
because it's essentially RAID-0 (minus 1 disc).
But always use the OS RAID (e.g., LVM stripe) to stripe RAID-0
across all volumes, assuming there is not an OS volume limit
(of course ;).

Software RAID is extemely fast at XORs, that's not the problem.
The problem is how the data stream through the PC's inefficient
I/O interconnect. PC's have gotten much better, but the load still
detracts from other I/O, that services may contend with.

Software RAID-5 writes are, essentially, "programmed I/O."
Every single commit has to have it's parity blocked programmed
by the CPU, which is difficult to bechmark because the
bottleneck is not the CPU, but the LOAD-XOR-STOR of the interconnect.

An IOP is designed with ASIC peripherals to do that in-line, real-time.
In fact, by the very nature of the IOP driver, the operation is synchronous
to the OS' standpoint, unlike software RAID optimizations by the OS.

--  
Bryan J Smith - mailto:b.j.smith@ieee.org  
http://thebs413.blogspot.com  
Sent via BlackBerry from T-Mobile  

-----Original Message-----
From: Ralf Gross <Ralf-Lists@ralfgross.de>

Date: Tue, 25 Sep 2007 15:49:56 
To:linux-xfs@oss.sgi.com
Subject: Re: mkfs options for a 16x hw raid5 and xfs (mostly large files)

KELEMEN Peter schrieb:
> * Ralf Gross (ralf-lists@ralfgross.de) [20070925 14:35]:
> 
> > There is a second RAID device attached to the server (24x
> > RAID5). The numbers I get from this device are a bit worse than
> > the 16x RAID 5 numbers (150MB/s read with dd).
> 
> You are expecting 24 spindles to align up when you have a write
> request, which has to be 23*chunksize bytes in order to avoid RMW.
> Additionally, your array is so big that you're very likely to hit
> another error while rebuilding.  Chop up your monster RAID5 array
> into smaller arrays and stripe across them.  Even better, consider
> RAID10.

RAID10 is no option, we need 60+ TB at the moment, mostly large video
files. Basically the read/write performance we get with the 16x RAID 5
is sufficient for our needs. The 24x RAID 5 is only a test device. The
volumes that will be used in the future are the 16/15x RAIDs (48 disk
shelf with 3 volumes).

I'm just wondering how people get 400+ MB/s with HW-RAID 5.

Ralf

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: mkfs options for a 16x hw raid5 and xfs (mostly large files)
  2007-09-25 14:08                     ` Bryan J Smith
@ 2007-09-25 16:07                       ` Ralf Gross
  2007-09-25 16:28                         ` Bryan J. Smith
  2007-09-25 16:48                         ` Justin Piszcz
  0 siblings, 2 replies; 48+ messages in thread
From: Ralf Gross @ 2007-09-25 16:07 UTC (permalink / raw)
  To: linux-xfs

Bryan J Smith schrieb:
> Use multiple cards on multiple PCI-X/PCIe channels, each with
> their own RAID-5 (or 6) volume, and then stripe (OS LVM) RAID-0
> across the volumes.

The hardware is fixed to one PCI-X FC HBA (4Gb) and two 48x shelfs.
The performance I get with this setup is ok for us. The data will be
stored in bunches of multiple TB. Only few clients will access the
data, maybe 5-10 clients at the same time.
 
> Depending on your network service and application, you can use
> either hardware or software for the RAID-5 (or 6).
> If it's heavily read-only servicing, then software RAID works great,
> because it's essentially RAID-0 (minus 1 disc).
> But always use the OS RAID (e.g., LVM stripe) to stripe RAID-0
> across all volumes, assuming there is not an OS volume limit
> (of course ;).
> [...]

I always use SW-RAID for RAID0 and RAID1. But for RAID 5/6 I choose
either external arrays or internal controllers (Areca).

Ralf

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: mkfs options for a 16x hw raid5 and xfs (mostly large files)
  2007-09-25 16:07                       ` Ralf Gross
@ 2007-09-25 16:28                         ` Bryan J. Smith
  2007-09-25 17:25                           ` Ralf Gross
  2007-09-25 16:48                         ` Justin Piszcz
  1 sibling, 1 reply; 48+ messages in thread
From: Bryan J. Smith @ 2007-09-25 16:28 UTC (permalink / raw)
  To: Ralf Gross; +Cc: linux-xfs

Ralf Gross <Ralf-Lists@ralfgross.de> wrote:
> The hardware is fixed to one PCI-X FC HBA (4Gb) and two 48x shelfs.
> The performance I get with this setup is ok for us. The data will
> be stored in bunches of multiple TB. Only few clients will access
> the data, maybe 5-10 clients at the same time.

If raw performance is your ultimate goal, the closer you are to the
hardware, and the less overhead in the protocol, the better.

Direct SATA channels (software RAID-10), or taking advantage of the
3Ware ASIC+SRAM (hardware RAID-10) is most ideal.  I've put in a
setup myself that used three (3) 3Ware Escalade 9550SX cards on three
(3) different PCI-X channels, and then striped RAID-0 across all
three (3) volumes (found little difference between using the OS LVM
or the 3Ware manager for the RAID-0 stripe across volumes).

Using a buffered RAID-5 hardware solution is not going to get you the
best latency or direct DTR, if that is what matters.  In most cases,
it does not, depending on your application.

> I always use SW-RAID for RAID0 and RAID1. But for RAID 5/6 I choose
> either external arrays or internal controllers (Areca).

Areca is the Intel IOP + firmware.  Intel's X-Scale storage
processing engines (SPE) seem to best 3Ware's AMCC PowerPC engine. 
The off-load is massive when I/O is an issue.  Unfortunately, I still
find I prefer 3Ware's firmware and software support in Linux over
Areca, and Intel clearly does not have the dedication to addressing
issues that 3Ware does (just like back in the IOP30x/i960 days,
sigh).

To me, support is key.  I've yet to drop a 3Ware volume myself.  The
only people who seem to drop a volume are typically using 3Ware in
JBOD mode, or are "early adopters" of new products.  I don't care if
it's hardware or software, "early adoption" of anything is just not
worth it.  I'd rather have reduced performance for "piece-of-mind." 
3Ware has a solid history on Linux, and my experiences are the
ultimate after 7 years.**

[ **NOTE:  Don't get me started.  The common "proprietary" or
"hardware reliance" argument doesn't hold, because 3Ware's volume
upward compatibility is proven (I've moved volumes of ATA 6000 to
7000 series, SATA 8000 to 9000, etc...), and they have shared the
data organization so you can read them with dmraid as well.  I.e.,
you can always fall back to reading your data off a 3Ware volume with
dmraid these days.  I've also _never_ had an "ATA timeout" issue with
3Ware cards, because 3Ware updates its firmware regularly to "deal"
with troublesome [S]ATA drives.  That has bitten me far too many
times in Linux with direct [S]ATA -- not Linux's fault, just the
fault of hardware [S]ATA PHY chips and their on-drive IDE firmware,
something 3Ware has mitigated for me time and time again. ]

I'm completely biased though, I assemble file and database servers,
not web or other CPU-bound systems.  Turning my system interconnect
(not the CPU, a PC CPU crunches XOR very fast) into a bottlenecked
PIO operation is not ideal for NFS writes or large record SQL commits
in my experience.  Heck, one look at NetApp's volume w/NVRAM and
SPE-accelerated RAID-4 designs will quickly change your opinion as
well (and make you wonder if they aren't worth the cost at times as
well ;).

-- 
Bryan J. Smith   Professional, Technical Annoyance
b.j.smith@ieee.org    http://thebs413.blogspot.com
--------------------------------------------------
     Fission Power:  An Inconvenient Solution

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: mkfs options for a 16x hw raid5 and xfs (mostly large files)
  2007-09-25 16:07                       ` Ralf Gross
  2007-09-25 16:28                         ` Bryan J. Smith
@ 2007-09-25 16:48                         ` Justin Piszcz
  2007-09-25 18:00                           ` Bryan J. Smith
  1 sibling, 1 reply; 48+ messages in thread
From: Justin Piszcz @ 2007-09-25 16:48 UTC (permalink / raw)
  To: Ralf Gross; +Cc: linux-xfs



On Tue, 25 Sep 2007, Ralf Gross wrote:

> Bryan J Smith schrieb:
>> Use multiple cards on multiple PCI-X/PCIe channels, each with
>> their own RAID-5 (or 6) volume, and then stripe (OS LVM) RAID-0
>> across the volumes.
>
> The hardware is fixed to one PCI-X FC HBA (4Gb) and two 48x shelfs.
> The performance I get with this setup is ok for us. The data will be
> stored in bunches of multiple TB. Only few clients will access the
> data, maybe 5-10 clients at the same time.
>
>> Depending on your network service and application, you can use
>> either hardware or software for the RAID-5 (or 6).
>> If it's heavily read-only servicing, then software RAID works great,
>> because it's essentially RAID-0 (minus 1 disc).
>> But always use the OS RAID (e.g., LVM stripe) to stripe RAID-0
>> across all volumes, assuming there is not an OS volume limit
>> (of course ;).
>> [...]
>
> I always use SW-RAID for RAID0 and RAID1. But for RAID 5/6 I choose
> either external arrays or internal controllers (Areca).
>
> Ralf
>
>

Just out of curisosity have you tried SW RAID5 on this array?

Also what do you get if you use RAID0 (hw or sw)?

Justin.

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: mkfs options for a 16x hw raid5 and xfs (mostly large files)
  2007-09-25 16:28                         ` Bryan J. Smith
@ 2007-09-25 17:25                           ` Ralf Gross
  2007-09-25 17:41                             ` Bryan J. Smith
  0 siblings, 1 reply; 48+ messages in thread
From: Ralf Gross @ 2007-09-25 17:25 UTC (permalink / raw)
  To: linux-xfs

Bryan J. Smith schrieb:
> Ralf Gross <Ralf-Lists@ralfgross.de> wrote:
> ... 
> I'm completely biased though, I assemble file and database servers,
> not web or other CPU-bound systems.  Turning my system interconnect
> (not the CPU, a PC CPU crunches XOR very fast) into a bottlenecked
> PIO operation is not ideal for NFS writes or large record SQL commits
> in my experience.  Heck, one look at NetApp's volume w/NVRAM and
> SPE-accelerated RAID-4 designs will quickly change your opinion as
> well (and make you wonder if they aren't worth the cost at times as
> well ;).

Thanks for all the details. Before I leave the office (it's getting
dark here): I think the Overland RAID we have (48x Disk) is from the
same manufacturer (Xyratex) that builds some devices for NetApp.

Our profile is not that performance driven, thus the ~200MB/s
read/write performace is ok. We just need cheap storage ;)

Still I'm wondering how other people saturate a 4 Gb FC controller
with one single RAID 5. At least that's what I've seen in some
benchmarks and here on the list.

If dd doesn't give me more than 200MB/s, the problem could only be the
array, the controller or the FC connection. Given that other setup are
similar and not using different controllers and stripes.

Ralf

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: mkfs options for a 16x hw raid5 and xfs (mostly large files)
  2007-09-25 17:25                           ` Ralf Gross
@ 2007-09-25 17:41                             ` Bryan J. Smith
  2007-09-25 19:13                               ` Ralf Gross
  0 siblings, 1 reply; 48+ messages in thread
From: Bryan J. Smith @ 2007-09-25 17:41 UTC (permalink / raw)
  To: Ralf Gross; +Cc: linux-xfs

Ralf Gross <Ralf-Lists@ralfgross.de> wrote:
> Thanks for all the details. Before I leave the office (it's getting
> dark here): I think the Overland RAID we have (48x Disk) is from
> the same manufacturer (Xyratex) that builds some devices for
NetApp.

There's a lot of cross-fabbing these days.  I was referring more to
NetApp's combined hardware-OS-volume approach, although that was
clearly a poor tangent by myself.

> Our profile is not that performance driven, thus the ~200MB/s
> read/write performace is ok. We just need cheap storage ;)

For what application?  That is the question.  I mean, sustained
software RAID-5 writes can be a PITA.  E.g., the dd example prior
doesn't even do XOR recalculation, it merely copies the existing
parity block with data.  Doing sustained software RAID-5 writes can
easily drop under 50MBps, as the PC interconnect was not designed to
stream data (programmed I/O), only direct it (Direct Memory Access).

> Still I'm wondering how other people saturate a 4 Gb FC controller
> with one single RAID 5. At least that's what I've seen in some
> benchmarks and here on the list.

Depends on the solution, the benchmark, etc...

> If dd doesn't give me more than 200MB/s, the problem could only be
> the array, the controller or the FC connection.

I think you're getting confused.

There are many factors in how dd performs.  Using an OS-managed
volume will result in non-blocking I/O, of which dd will scream. 
Especially when the OS knows it's merely just copying one block to
another, unlike the FC array, and doesn't need to recalculate the
parity block.  I know software RAID proponents like to show those
numbers, but they are beyond removed from "real world," they
literally leverage the fact that parity doesn't need to be
recalculated for the blocks moved.

You need to benchmark from your application -- e.g., clients.  If you
want "raw" disk access benchmarks, then build a software RAID volume
with a massive number of SATA channels using "dumb" SATA ASICs. 
Don't even use an intelligent hardware RAID card in JBOD mode, that
will only slow the DTR.

> Given that other setup are similar and not using different
> controllers and stripes.

Again, benchmark from your application -- e.g., clients.  Everything
else means squat.

I cannot stress this enough.  The only way I can show otherwise, is
with hardware taps (e.g., PCI-X, PCIe).  I literally couldn't explain
"well enough" to one client was only getting 60MBps and seeing only
10% CPU utilization why their software RAID was the bottleneck until
I put in a PCI-X card and showed the amount of traffic on the bus. 
And even that wasn't the system interconnect (although it should be
possible with a HTX card on an AMD solution, although the card would
probably cost 5 figures and have some limits).

-- 
Bryan J. Smith   Professional, Technical Annoyance
b.j.smith@ieee.org    http://thebs413.blogspot.com
--------------------------------------------------
     Fission Power:  An Inconvenient Solution

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: mkfs options for a 16x hw raid5 and xfs (mostly large files)
  2007-09-25 16:48                         ` Justin Piszcz
@ 2007-09-25 18:00                           ` Bryan J. Smith
  2007-09-25 18:33                             ` Ralf Gross
  2007-09-25 23:38                             ` Justin Piszcz
  0 siblings, 2 replies; 48+ messages in thread
From: Bryan J. Smith @ 2007-09-25 18:00 UTC (permalink / raw)
  To: Justin Piszcz, Ralf Gross; +Cc: linux-xfs

Justin Piszcz <jpiszcz@lucidpixels.com> wrote:
> Just out of curisosity have you tried SW RAID5 on this array?
> Also what do you get if you use RAID0 (hw or sw)?

According to him, if I read it correclty, it is an external FC RAID-5
chassis.  I.e., all of the logic is in the chassis.  So your question
is N/A.

Although I'm more than ready to be proven incorrect.

Furthermore, what benchmark do you use?  If dd on the volume itself,
software RAID wins, hands down.  Doesn't matter what size you give
it, it literally copies (and doesn't recalculate) the parity.  It's
the rawest form of non-blocking I/O, and uses virtually no system
interconnect to the CPU (just pushes disk-mem-disk). 

-- 
Bryan J. Smith   Professional, Technical Annoyance
b.j.smith@ieee.org    http://thebs413.blogspot.com
--------------------------------------------------
     Fission Power:  An Inconvenient Solution

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: mkfs options for a 16x hw raid5 and xfs (mostly large files)
  2007-09-25 18:00                           ` Bryan J. Smith
@ 2007-09-25 18:33                             ` Ralf Gross
  2007-09-25 23:38                             ` Justin Piszcz
  1 sibling, 0 replies; 48+ messages in thread
From: Ralf Gross @ 2007-09-25 18:33 UTC (permalink / raw)
  To: linux-xfs

Bryan J. Smith schrieb:
> Justin Piszcz <jpiszcz@lucidpixels.com> wrote:
> > Just out of curisosity have you tried SW RAID5 on this array?
> > Also what do you get if you use RAID0 (hw or sw)?
> 
> According to him, if I read it correclty, it is an external FC RAID-5
> chassis.  I.e., all of the logic is in the chassis.  So your question
> is N/A.
> 
> Although I'm more than ready to be proven incorrect.

No, your're right. It's an external chassis with FC connection to the
server.
 
Ralf

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: mkfs options for a 16x hw raid5 and xfs (mostly large files)
  2007-09-25 17:41                             ` Bryan J. Smith
@ 2007-09-25 19:13                               ` Ralf Gross
  2007-09-25 20:23                                 ` Bryan J. Smith
  0 siblings, 1 reply; 48+ messages in thread
From: Ralf Gross @ 2007-09-25 19:13 UTC (permalink / raw)
  To: linux-xfs

Bryan J. Smith schrieb:
> > Our profile is not that performance driven, thus the ~200MB/s
> > read/write performace is ok. We just need cheap storage ;)
> 
> For what application?  That is the question.  I mean, sustained
> software RAID-5 writes can be a PITA.  E.g., the dd example prior
> doesn't even do XOR recalculation, it merely copies the existing
> parity block with data.  Doing sustained software RAID-5 writes can
> easily drop under 50MBps, as the PC interconnect was not designed to
> stream data (programmed I/O), only direct it (Direct Memory Access).

The server should be able to provide five 17MB/s streams (5 win
clients). Each file is ~2GB large. The clients will access the
data with smb/cifs, I think the main bottleneck will be samba.

There will not be much write access, the data that later will be
streamed to the clients will be transferend to the server from the win
clients first. The files will not be changed afterwards. So there will
be weeks where no data is written, and some days where several TB will be
transfered to the server in 48 hours.

Furthermore, the win clients read the data from external USB/PCIe SATA
drives. Sometimes the clients transfers the data from a external
enclosure with 5 drives (no raid) to the server. The will also be a
limiting factor.
 
> > Still I'm wondering how other people saturate a 4 Gb FC controller
> > with one single RAID 5. At least that's what I've seen in some
> > benchmarks and here on the list.
> 
> Depends on the solution, the benchmark, etc...

I've seen benchmark results from 3ware, areca and other hw raid 5
vendors (bonnie++, tiobench).
 
> > If dd doesn't give me more than 200MB/s, the problem could only be
> > the array, the controller or the FC connection.
> 
> I think you're getting confused.
> 
> There are many factors in how dd performs.  Using an OS-managed
> volume will result in non-blocking I/O, of which dd will scream. 
> Especially when the OS knows it's merely just copying one block to
> another, unlike the FC array, and doesn't need to recalculate the
> parity block.  I know software RAID proponents like to show those
> numbers, but they are beyond removed from "real world," they
> literally leverage the fact that parity doesn't need to be
> recalculated for the blocks moved.
> 
> You need to benchmark from your application -- e.g., clients.  If you
> want "raw" disk access benchmarks, then build a software RAID volume
> with a massive number of SATA channels using "dumb" SATA ASICs. 
> Don't even use an intelligent hardware RAID card in JBOD mode, that
> will only slow the DTR.
> 
> > Given that other setup are similar and not using different
> > controllers and stripes.
> 
> Again, benchmark from your application -- e.g., clients.  Everything
> else means squat.
> 
> I cannot stress this enough.  The only way I can show otherwise, is
> with hardware taps (e.g., PCI-X, PCIe).  I literally couldn't explain
> "well enough" to one client was only getting 60MBps and seeing only
> 10% CPU utilization why their software RAID was the bottleneck until
> I put in a PCI-X card and showed the amount of traffic on the bus. 
> And even that wasn't the system interconnect (although it should be
> possible with a HTX card on an AMD solution, although the card would
> probably cost 5 figures and have some limits).

Maybe I'm just confused by the benchmarks I found in the net and my
200MB/s sql. read/write with tiobench are perfectly ok.

@Justin Piszcz: could you provide some tiobench numbers for you sw
raid 5?

Ralf

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: mkfs options for a 16x hw raid5 and xfs (mostly large files)
  2007-09-25 19:13                               ` Ralf Gross
@ 2007-09-25 20:23                                 ` Bryan J. Smith
  0 siblings, 0 replies; 48+ messages in thread
From: Bryan J. Smith @ 2007-09-25 20:23 UTC (permalink / raw)
  To: Ralf Gross, linux-xfs

Ralf Gross <Ralf-Lists@ralfgross.de> wrote:
> The server should be able to provide five 17MB/s streams (5 win
> clients). Each file is ~2GB large. The clients will access the
> data with smb/cifs, I think the main bottleneck will be samba.

So it's largely read-only SMB access?

So we're talking ...
- Largely read-only disk access
- Largely server TX-only TCP/IP serving

Read-only is cheap, and software RAID-5 is essentially RAID-0 (sans
one disc).  So software RAID-5 is just fine there (assuming there are
no volume addressing limitations).

Server TX is also cheap, most commodity server NICs (i.e., even those
built into mainboards, or typical dual-MAC 96-128KiB SRAM unified
buffer) have a TX TCP Off-load Engine (TOE), some even with Linux
driver support.

You don't need any hardware accelerated RAID or RX TOE (which is far,
far more expensive than TX TOE, largely for receive buffer and
processing).

> Furthermore, the win clients read the data from external USB/PCIe
> SATA drives.

Ouch.  But I won't go there.  ;)

> Sometimes the clients transfers the data from a external
> enclosure with 5 drives (no raid) to the server. The will also be a
> limiting factor.

Ouch.  But I won't go there.  ;)

> I've seen benchmark results from 3ware, areca and other hw raid 5
> vendors (bonnie++, tiobench).

Bonnie++ is really only good for NFS mounts from multiple clients to
a server, and then it will vary.  Aggregate, median, etc... studies
are required.

> Maybe I'm just confused by the benchmarks I found in the
> net and my 200MB/s sql. read/write with tiobench are
> perfectly ok.

I've striped RAID-0 over two, RAID-10 volumes on old 3Ware Escalade
8500-8LP series products over two PCI-X (66MHz) busses and reached
close to 400MBps reads, and over 200MBps writes.  And that was old
ASIC+SRAM (only 4MB) technology in the Escalade 8500 series, not even
native SATA (PATA with SATA PHY).

But I wouldn't get even close to that over the network, especially
not for SMB, unless I used a 4xGbE with a RX TOE and a layer-3
switch.

-- 
Bryan J. Smith   Professional, Technical Annoyance
b.j.smith@ieee.org    http://thebs413.blogspot.com
--------------------------------------------------
     Fission Power:  An Inconvenient Solution

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: mkfs options for a 16x hw raid5 and xfs (mostly large files)
  2007-09-25 18:00                           ` Bryan J. Smith
  2007-09-25 18:33                             ` Ralf Gross
@ 2007-09-25 23:38                             ` Justin Piszcz
  2007-09-26  8:23                               ` Ralf Gross
  1 sibling, 1 reply; 48+ messages in thread
From: Justin Piszcz @ 2007-09-25 23:38 UTC (permalink / raw)
  To: b.j.smith; +Cc: Ralf Gross, linux-xfs



On Tue, 25 Sep 2007, Bryan J. Smith wrote:

> Justin Piszcz <jpiszcz@lucidpixels.com> wrote:
>> Just out of curisosity have you tried SW RAID5 on this array?
>> Also what do you get if you use RAID0 (hw or sw)?
>
> According to him, if I read it correclty, it is an external FC RAID-5
> chassis.  I.e., all of the logic is in the chassis.  So your question
> is N/A.
>
> Although I'm more than ready to be proven incorrect.
>
> Furthermore, what benchmark do you use?  If dd on the volume itself,
> software RAID wins, hands down.  Doesn't matter what size you give
> it, it literally copies (and doesn't recalculate) the parity.  It's
> the rawest form of non-blocking I/O, and uses virtually no system
> interconnect to the CPU (just pushes disk-mem-disk).
>
> -- 
> Bryan J. Smith   Professional, Technical Annoyance
> b.j.smith@ieee.org    http://thebs413.blogspot.com
> --------------------------------------------------
>     Fission Power:  An Inconvenient Solution
>

bonnie++, iozone, etc..

all show ~430-460 MiB/s write and ~550 MiB/s read

Justin.

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: mkfs options for a 16x hw raid5 and xfs (mostly large files)
  2007-09-25 23:38                             ` Justin Piszcz
@ 2007-09-26  8:23                               ` Ralf Gross
  2007-09-26  8:42                                 ` Justin Piszcz
  0 siblings, 1 reply; 48+ messages in thread
From: Ralf Gross @ 2007-09-26  8:23 UTC (permalink / raw)
  To: linux-xfs

Justin Piszcz schrieb:
> 
> 
> On Tue, 25 Sep 2007, Bryan J. Smith wrote:
> 
> >Justin Piszcz <jpiszcz@lucidpixels.com> wrote:
> >>Just out of curisosity have you tried SW RAID5 on this array?
> >>Also what do you get if you use RAID0 (hw or sw)?
> >
> >According to him, if I read it correclty, it is an external FC RAID-5
> >chassis.  I.e., all of the logic is in the chassis.  So your question
> >is N/A.
> >
> >Although I'm more than ready to be proven incorrect.
> >
> >Furthermore, what benchmark do you use?  If dd on the volume itself,
> >software RAID wins, hands down.  Doesn't matter what size you give
> >it, it literally copies (and doesn't recalculate) the parity.  It's
> >the rawest form of non-blocking I/O, and uses virtually no system
> >interconnect to the CPU (just pushes disk-mem-disk).
> 
> bonnie++, iozone, etc..
> 
> all show ~430-460 MiB/s write and ~550 MiB/s read

I'm happy :) I was able to boost the read performance by setting 

blockdev --setra 16384 /dev/sdc

I knew this parameter is neccessary for 3ware controllers, but I
haven't noticed any difference with areca controllers or ext. raids yet.

The write performance may not be ideal, but these read numbers makes much
more sense now because the FC controller is the limiting factor.

Sequential Reads
File  Blk   Num                   Avg      Maximum      Lat%     Lat%    CPU
Size  Size  Thr   Rate  (CPU%)  Latency    Latency      >2s      >10s    Eff
----- ----- ---  ------ ------ --------- -----------  -------- -------- -----
20000  4096    1  391.20 50.24%     0.019       43.01   0.00000  0.00000   779
20000  4096    2  387.79 92.22%     0.040      278.71   0.00000  0.00000   420

Random Reads
File  Blk   Num                   Avg      Maximum      Lat%     Lat%    CPU
Size  Size  Thr   Rate  (CPU%)  Latency    Latency      >2s      >10s    Eff
----- ----- ---  ------ ------ --------- -----------  -------- -------- -----
20000  4096    1    2.87 0.698%     2.720       27.47   0.00000  0.00000   411
20000  4096    2    4.37 2.013%     3.473       47.35   0.00000  0.00000   217

Sequential Writes
File  Blk   Num                   Avg      Maximum      Lat%     Lat%    CPU
Size  Size  Thr   Rate  (CPU%)  Latency    Latency      >2s      >10s    Eff
----- ----- ---  ------ ------ --------- -----------  -------- -------- -----
20000  4096    1  189.23 42.73%     0.029     5670.66   0.00014  0.00000   443
20000  4096    2  173.92 84.93%     0.064     4590.56   0.00029  0.00000   205

Random Writes
File  Blk   Num                   Avg      Maximum      Lat%     Lat%    CPU
Size  Size  Thr   Rate  (CPU%)  Latency    Latency      >2s      >10s    Eff
----- ----- ---  ------ ------ --------- -----------  -------- -------- -----
20000  4096    1    1.85 0.662%     0.011        0.05   0.00000  0.00000   279
20000  4096    2    1.68 0.772%     0.012        0.05   0.00000  0.00000   217




Ralf

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: mkfs options for a 16x hw raid5 and xfs (mostly large files)
  2007-09-26  8:23                               ` Ralf Gross
@ 2007-09-26  8:42                                 ` Justin Piszcz
  2007-09-26  8:49                                   ` Ralf Gross
  0 siblings, 1 reply; 48+ messages in thread
From: Justin Piszcz @ 2007-09-26  8:42 UTC (permalink / raw)
  To: Ralf Gross; +Cc: linux-xfs



On Wed, 26 Sep 2007, Ralf Gross wrote:

> Justin Piszcz schrieb:
>>
>>
>> On Tue, 25 Sep 2007, Bryan J. Smith wrote:
>>
>>> Justin Piszcz <jpiszcz@lucidpixels.com> wrote:
>>>> Just out of curisosity have you tried SW RAID5 on this array?
>>>> Also what do you get if you use RAID0 (hw or sw)?
>>>
>>> According to him, if I read it correclty, it is an external FC RAID-5
>>> chassis.  I.e., all of the logic is in the chassis.  So your question
>>> is N/A.
>>>
>>> Although I'm more than ready to be proven incorrect.
>>>
>>> Furthermore, what benchmark do you use?  If dd on the volume itself,
>>> software RAID wins, hands down.  Doesn't matter what size you give
>>> it, it literally copies (and doesn't recalculate) the parity.  It's
>>> the rawest form of non-blocking I/O, and uses virtually no system
>>> interconnect to the CPU (just pushes disk-mem-disk).
>>
>> bonnie++, iozone, etc..
>>
>> all show ~430-460 MiB/s write and ~550 MiB/s read
>
> I'm happy :) I was able to boost the read performance by setting
>
> blockdev --setra 16384 /dev/sdc
>
> I knew this parameter is neccessary for 3ware controllers, but I
> haven't noticed any difference with areca controllers or ext. raids yet.
>
> The write performance may not be ideal, but these read numbers makes much
> more sense now because the FC controller is the limiting factor.
>
> Sequential Reads
> File  Blk   Num                   Avg      Maximum      Lat%     Lat%    CPU
> Size  Size  Thr   Rate  (CPU%)  Latency    Latency      >2s      >10s    Eff
> ----- ----- ---  ------ ------ --------- -----------  -------- -------- -----
> 20000  4096    1  391.20 50.24%     0.019       43.01   0.00000  0.00000   779
> 20000  4096    2  387.79 92.22%     0.040      278.71   0.00000  0.00000   420
>
> Random Reads
> File  Blk   Num                   Avg      Maximum      Lat%     Lat%    CPU
> Size  Size  Thr   Rate  (CPU%)  Latency    Latency      >2s      >10s    Eff
> ----- ----- ---  ------ ------ --------- -----------  -------- -------- -----
> 20000  4096    1    2.87 0.698%     2.720       27.47   0.00000  0.00000   411
> 20000  4096    2    4.37 2.013%     3.473       47.35   0.00000  0.00000   217
>
> Sequential Writes
> File  Blk   Num                   Avg      Maximum      Lat%     Lat%    CPU
> Size  Size  Thr   Rate  (CPU%)  Latency    Latency      >2s      >10s    Eff
> ----- ----- ---  ------ ------ --------- -----------  -------- -------- -----
> 20000  4096    1  189.23 42.73%     0.029     5670.66   0.00014  0.00000   443
> 20000  4096    2  173.92 84.93%     0.064     4590.56   0.00029  0.00000   205
>
> Random Writes
> File  Blk   Num                   Avg      Maximum      Lat%     Lat%    CPU
> Size  Size  Thr   Rate  (CPU%)  Latency    Latency      >2s      >10s    Eff
> ----- ----- ---  ------ ------ --------- -----------  -------- -------- -----
> 20000  4096    1    1.85 0.662%     0.011        0.05   0.00000  0.00000   279
> 20000  4096    2    1.68 0.772%     0.012        0.05   0.00000  0.00000   217
>
>
>
>
> Ralf
>
>

What was the command line you used for that output?
tiobench.. ?

Justin.

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: mkfs options for a 16x hw raid5 and xfs (mostly large files)
  2007-09-26  8:42                                 ` Justin Piszcz
@ 2007-09-26  8:49                                   ` Ralf Gross
  2007-09-26  9:52                                     ` Justin Piszcz
  0 siblings, 1 reply; 48+ messages in thread
From: Ralf Gross @ 2007-09-26  8:49 UTC (permalink / raw)
  To: linux-xfs

Justin Piszcz schrieb:
> What was the command line you used for that output?
> tiobench.. ?

tiobench --numruns 3 --threads 1 --threads 2 --block 4096 --size 20000

--size 20000 because the server has 16 GB RAM.

Ralf

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: mkfs options for a 16x hw raid5 and xfs (mostly large files)
  2007-09-26  8:49                                   ` Ralf Gross
@ 2007-09-26  9:52                                     ` Justin Piszcz
  2007-09-26 15:03                                       ` Bryan J Smith
  0 siblings, 1 reply; 48+ messages in thread
From: Justin Piszcz @ 2007-09-26  9:52 UTC (permalink / raw)
  To: Ralf Gross; +Cc: linux-xfs, linux-raid



On Wed, 26 Sep 2007, Ralf Gross wrote:

> Justin Piszcz schrieb:
>> What was the command line you used for that output?
>> tiobench.. ?
>
> tiobench --numruns 3 --threads 1 --threads 2 --block 4096 --size 20000
>
> --size 20000 because the server has 16 GB RAM.
>
> Ralf
>
>

Here is my output on my SW RAID5 keep in mind it is currently being used so the numbers are a little slower than they probably should be:

My machine only has 8 GiB of memory but I used the same command you did:

This is with the 2.6.22.6 kernel, the 2.6.23-rcX/final when released is supposed to have the SW RAID5 accelerator code, correct?

Unit information
================
File size = megabytes
Blk Size  = bytes
Rate      = megabytes per second
CPU%      = percentage of CPU used during the test
Latency   = milliseconds
Lat%      = percent of requests that took longer than X seconds
CPU Eff   = Rate divided by CPU% - throughput per cpu load

Sequential Reads
                               File  Blk   Num                   Avg      Maximum      Lat%     Lat%    CPU
Identifier                    Size  Size  Thr   Rate  (CPU%)  Latency    Latency      >2s      >10s    Eff
---------------------------- ------ ----- ---  ------ ------ --------- -----------  -------- -------- -----
2.6.22.6                     20000  4096    1  523.01 45.79%     0.022      510.77   0.00000  0.00000  1142
2.6.22.6                     20000  4096    2  501.29 85.84%     0.046      855.59   0.00000  0.00000   584

Random Reads
                               File  Blk   Num                   Avg      Maximum      Lat%     Lat%    CPU
Identifier                    Size  Size  Thr   Rate  (CPU%)  Latency    Latency      >2s      >10s    Eff
---------------------------- ------ ----- ---  ------ ------ --------- -----------  -------- -------- -----
2.6.22.6                     20000  4096    1    0.90 0.276%    13.003       74.41   0.00000  0.00000   326
2.6.22.6                     20000  4096    2    1.61 1.167%    14.443      126.43   0.00000  0.00000   137

Sequential Writes
                               File  Blk   Num                   Avg      Maximum      Lat%     Lat%    CPU
Identifier                    Size  Size  Thr   Rate  (CPU%)  Latency    Latency      >2s      >10s    Eff
---------------------------- ------ ----- ---  ------ ------ --------- -----------  -------- -------- -----
2.6.22.6                     20000  4096    1  363.46 75.72%     0.030     2757.45   0.00000  0.00000   480
2.6.22.6                     20000  4096    2  394.45 287.9%     0.056     2798.92   0.00000  0.00000   137

Random Writes
                               File  Blk   Num                   Avg      Maximum      Lat%     Lat%    CPU
Identifier                    Size  Size  Thr   Rate  (CPU%)  Latency    Latency      >2s      >10s    Eff
---------------------------- ------ ----- ---  ------ ------ --------- -----------  -------- -------- -----
2.6.22.6                     20000  4096    1    3.16 1.752%     0.011        1.02   0.00000  0.00000   180
2.6.22.6                     20000  4096    2    3.07 3.769%     0.013        0.10   0.00000  0.00000    82

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: mkfs options for a 16x hw raid5 and xfs (mostly large files)
  2007-09-23 12:56 ` Peter Grandi
@ 2007-09-26 14:54   ` Ralf Gross
  2007-09-26 16:27     ` [UNSURE] " Justin Piszcz
  2007-09-27 15:22     ` Ralf Gross
  0 siblings, 2 replies; 48+ messages in thread
From: Ralf Gross @ 2007-09-26 14:54 UTC (permalink / raw)
  To: linux-xfs

Peter Grandi schrieb:
> Ralf> Hi, we have a new large raid array, the shelf has 48 disks,
> Ralf> the max. amount of disks in a single raid 5 set is 16.
> 
> Too bad about that petty limitation ;-).

Yeah, I prefer 24x RAID 5 without spare. Why waste so much space ;)

After talking to the people that own the data and wanted to use as
much as possible space of the device, we'll start with four 12/11
disk RAID 6 volumes (47 disk + 1 spare). That's ~12% less space than
before with five RAID 5 volumes. I think this is a good compromise
between safety and max.  usable disk space.

There's only one point left: will the RAID 6 during a rebuild be able
to deliver 2-3 streams of 17MB/s. Write performance is not the point
then, but the clients will be running simulations for up to 5 days and
need this (more or less) constant data rate. Now that I'm getting
~400 MB/s (which is limited by the FC controller) this should be
possible.

> Ralf> There will be one global spare disk, thus we have two raid 5
> Ralf> with 15 data disks and one with 14 data disk.
> 
> Ahhh a positive-thinking, can-do, brave design ;-).

We have a 60 slot tape lib too (well, we'll have next week...I hope).
I know that raid != backup.

> [ ... ]
> Ralf> Each client then needs a data stream of about 17 MB/s
> Ralf> (max. 5 clients are expected to acces the data in parallel).
> 
> Do the requirements include as features some (possibly several)
> hours of ''challenging'' read performance if any disk fails or
> total loss of data if another disks fails during that time? ;-)

The data then is still on USB disks and on tape. Maybe I'll pull out
a disk of one of the new RAID 6 volumes and see how much the read
performance drops. At the moment only one test bed is active, thus 17
MB/s would be ok. Later with 5 test beds 5 x 17 MB/s are needed (if
they are online at the same time).

> IIRC Google have reported 5% per year disk failure rates across a
> very wide mostly uncorrelated population, you have 48 disks,
> perhaps 2-3 disks per year will fail. Perhaps more and more often,
> because they will likely be all from the same manufacturer, model,
> batch and spinning in the same environment.

Hey, these are ENTERPRISE disks ;) As far as I know, we couldn't even
use other disks than the ones that the manufacturer provides (modified
firmware?).

> Ralf> [ ... ] I expect the fs, each will have a size of 10-11 TB,
> Ralf> to be filled > 90%. I know this is not ideal, but we need
> Ralf> every GB we can get.
> 
> That "every GB we can get" is often the key in ''wide RAID5''
> stories. Cheap as well as fast and safe, you can have it all with
> wide RAID5 setups, so the salesmen would say ;-).

I think we now have found a reasonable solution.

> Ralf> [ ... ] Stripe Size : 960 KB (15 x 64 KM)
> Ralf> [ ... ] Stripe Size : 896 KB (14 x 64 KB)
> 
> Pretty long stripes, I wonder what happens when a whole stripe
> cannot be written at once or it can but is not naturally aligned
> ;-).

I'm still confused bye the chunk/stripe and block size values. The
block size of the HW-RAID is fixed to 512 bytes, I think that a bit
small.

Also, I first thought about wasting disk space with larger
chunk/stripe sizes (HW RAID), but as the OS/FS doesn't necessarily
know about the values, it can't be true - unlike the FS block size
which defines the smallest possible file size.

> Ralf> [ ... ] about 150 MB/s in seq. writing
> 
> Surprise surprise ;-).
> 
> Ralf> (tiobench) and 160 MB/s in seq.  reading.
> 
> This is sort of low. If there something that RAID5 can do sort of
> OK is reads (if there are no faults). I'd look at the underlying
> storage system and the maximum performance that you can get out of
> a single disk.

 /sbin/blockdev --setra 16384  /dev/sdc 

was the key to ~400 MB/s read performance.

> I have seen a 45-drive 500GB storage subsystem where each drive
> can deliver at most 7-10MB/s (even if the same disk standalone in
> an ordinary PC can do 60-70MB/s), and the supplier actually claims
> so in their published literature (that RAID product is meant to
> compete *only* with tape backup subsystems). Your later comment
> that "The raid array is connect to the server by fibre channel"
> makes me suspect that it may be the same brand.
> 
> Ralf> This is ok,
> 
> As the total aggregate requirement is 5x17MB/s this is probably
> the case [as long as there are no drive failures ;-)].
> 
> Ralf> but I'm curious what I could get with tuned xfs parameters.
> 
> Looking at the archives of this mailing list the topic ''good mkfs
> parameters'' reappears frequently, even if usually for smaller
> arrays, as many have yet to discover the benefits of 15-wide RAID5
> setups ;-). Threads like these may help:
> 
>   http://OSS.SGI.com/archives/xfs/2007-01/msg00079.html
>   http://OSS.SGI.com/archives/xfs/2007-05/msg00051.html

I've seen some of JP's postings before. I couldn't get much more
performace with the sw/su options, I got the best results with the
default values. But I haven't tried external logs yet.

Ralf

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: mkfs options for a 16x hw raid5 and xfs (mostly large files)
  2007-09-26  9:52                                     ` Justin Piszcz
@ 2007-09-26 15:03                                       ` Bryan J Smith
  2007-09-26 15:15                                         ` Ralf Gross
  2007-09-26 16:24                                         ` Justin Piszcz
  0 siblings, 2 replies; 48+ messages in thread
From: Bryan J Smith @ 2007-09-26 15:03 UTC (permalink / raw)
  To: Justin Piszcz, xfs-bounce, Ralf Gross; +Cc: linux-xfs, linux-raid

Everyone can play local benchmarking games all they want,
and software RAID will almost always be faster, significantly at times.

What matters is actual, multiple client performance under full load.
Anything less is a completely irrelevant.
--  
Bryan J Smith - mailto:b.j.smith@ieee.org  
http://thebs413.blogspot.com  
Sent via BlackBerry from T-Mobile  
    

-----Original Message-----
From: Justin Piszcz <jpiszcz@lucidpixels.com>

Date: Wed, 26 Sep 2007 05:52:39 
To:Ralf Gross <Ralf-Lists@ralfgross.de>
Cc:linux-xfs@oss.sgi.com, linux-raid@vger.kernel.org
Subject: Re: mkfs options for a 16x hw raid5 and xfs (mostly large files)




On Wed, 26 Sep 2007, Ralf Gross wrote:

> Justin Piszcz schrieb:
>> What was the command line you used for that output?
>> tiobench.. ?
>
> tiobench --numruns 3 --threads 1 --threads 2 --block 4096 --size 20000
>
> --size 20000 because the server has 16 GB RAM.
>
> Ralf
>
>

Here is my output on my SW RAID5 keep in mind it is currently being used so the numbers are a little slower than they probably should be:

My machine only has 8 GiB of memory but I used the same command you did:

This is with the 2.6.22.6 kernel, the 2.6.23-rcX/final when released is supposed to have the SW RAID5 accelerator code, correct?

Unit information
================
File size = megabytes
Blk Size  = bytes
Rate      = megabytes per second
CPU%      = percentage of CPU used during the test
Latency   = milliseconds
Lat%      = percent of requests that took longer than X seconds
CPU Eff   = Rate divided by CPU% - throughput per cpu load

Sequential Reads
                               File  Blk   Num                   Avg      Maximum      Lat%     Lat%    CPU
Identifier                    Size  Size  Thr   Rate  (CPU%)  Latency    Latency      >2s      >10s    Eff
---------------------------- ------ ----- ---  ------ ------ --------- -----------  -------- -------- -----
2.6.22.6                     20000  4096    1  523.01 45.79%     0.022      510.77   0.00000  0.00000  1142
2.6.22.6                     20000  4096    2  501.29 85.84%     0.046      855.59   0.00000  0.00000   584

Random Reads
                               File  Blk   Num                   Avg      Maximum      Lat%     Lat%    CPU
Identifier                    Size  Size  Thr   Rate  (CPU%)  Latency    Latency      >2s      >10s    Eff
---------------------------- ------ ----- ---  ------ ------ --------- -----------  -------- -------- -----
2.6.22.6                     20000  4096    1    0.90 0.276%    13.003       74.41   0.00000  0.00000   326
2.6.22.6                     20000  4096    2    1.61 1.167%    14.443      126.43   0.00000  0.00000   137

Sequential Writes
                               File  Blk   Num                   Avg      Maximum      Lat%     Lat%    CPU
Identifier                    Size  Size  Thr   Rate  (CPU%)  Latency    Latency      >2s      >10s    Eff
---------------------------- ------ ----- ---  ------ ------ --------- -----------  -------- -------- -----
2.6.22.6                     20000  4096    1  363.46 75.72%     0.030     2757.45   0.00000  0.00000   480
2.6.22.6                     20000  4096    2  394.45 287.9%     0.056     2798.92   0.00000  0.00000   137

Random Writes
                               File  Blk   Num                   Avg      Maximum      Lat%     Lat%    CPU
Identifier                    Size  Size  Thr   Rate  (CPU%)  Latency    Latency      >2s      >10s    Eff
---------------------------- ------ ----- ---  ------ ------ --------- -----------  -------- -------- -----
2.6.22.6                     20000  4096    1    3.16 1.752%     0.011        1.02   0.00000  0.00000   180
2.6.22.6                     20000  4096    2    3.07 3.769%     0.013        0.10   0.00000  0.00000    82

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: mkfs options for a 16x hw raid5 and xfs (mostly large files)
  2007-09-26 15:03                                       ` Bryan J Smith
@ 2007-09-26 15:15                                         ` Ralf Gross
  2007-09-26 17:08                                           ` Bryan J. Smith
  2007-09-26 16:24                                         ` Justin Piszcz
  1 sibling, 1 reply; 48+ messages in thread
From: Ralf Gross @ 2007-09-26 15:15 UTC (permalink / raw)
  To: linux-xfs

Bryan J Smith schrieb:
> Everyone can play local benchmarking games all they want,
> and software RAID will almost always be faster, significantly at times.
> 
> What matters is actual, multiple client performance under full load.
> Anything less is a completely irrelevant.

You're right, but these benchmarks help to find simple failures or
misconfigurations at an earlier stage of the process.

Ralf

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: mkfs options for a 16x hw raid5 and xfs (mostly large files)
  2007-09-26 15:03                                       ` Bryan J Smith
  2007-09-26 15:15                                         ` Ralf Gross
@ 2007-09-26 16:24                                         ` Justin Piszcz
  2007-09-26 17:11                                           ` Bryan J. Smith
  1 sibling, 1 reply; 48+ messages in thread
From: Justin Piszcz @ 2007-09-26 16:24 UTC (permalink / raw)
  To: Bryan J Smith; +Cc: xfs-bounce, Ralf Gross, linux-xfs, linux-raid

I have a question, when I use multiple writer threads (2 or 3) I see 
550-600 MiB/s write speed (vmstat) but when using only 1 thread, ~420-430 
MiB/s... Also without tweaking, SW RAID is very slow (180-200 MiB/s) using 
the same disks.

Justin.

On Wed, 26 Sep 2007, Bryan J Smith wrote:

> Everyone can play local benchmarking games all they want,
> and software RAID will almost always be faster, significantly at times.
>
> What matters is actual, multiple client performance under full load.
> Anything less is a completely irrelevant.
> --
> Bryan J Smith - mailto:b.j.smith@ieee.org
> http://thebs413.blogspot.com
> Sent via BlackBerry from T-Mobile
>
>
> -----Original Message-----
> From: Justin Piszcz <jpiszcz@lucidpixels.com>
>
> Date: Wed, 26 Sep 2007 05:52:39
> To:Ralf Gross <Ralf-Lists@ralfgross.de>
> Cc:linux-xfs@oss.sgi.com, linux-raid@vger.kernel.org
> Subject: Re: mkfs options for a 16x hw raid5 and xfs (mostly large files)
>
>
>
>
> On Wed, 26 Sep 2007, Ralf Gross wrote:
>
>> Justin Piszcz schrieb:
>>> What was the command line you used for that output?
>>> tiobench.. ?
>>
>> tiobench --numruns 3 --threads 1 --threads 2 --block 4096 --size 20000
>>
>> --size 20000 because the server has 16 GB RAM.
>>
>> Ralf
>>
>>
>
> Here is my output on my SW RAID5 keep in mind it is currently being used so the numbers are a little slower than they probably should be:
>
> My machine only has 8 GiB of memory but I used the same command you did:
>
> This is with the 2.6.22.6 kernel, the 2.6.23-rcX/final when released is supposed to have the SW RAID5 accelerator code, correct?
>
> Unit information
> ================
> File size = megabytes
> Blk Size  = bytes
> Rate      = megabytes per second
> CPU%      = percentage of CPU used during the test
> Latency   = milliseconds
> Lat%      = percent of requests that took longer than X seconds
> CPU Eff   = Rate divided by CPU% - throughput per cpu load
>
> Sequential Reads
>                               File  Blk   Num                   Avg      Maximum      Lat%     Lat%    CPU
> Identifier                    Size  Size  Thr   Rate  (CPU%)  Latency    Latency      >2s      >10s    Eff
> ---------------------------- ------ ----- ---  ------ ------ --------- -----------  -------- -------- -----
> 2.6.22.6                     20000  4096    1  523.01 45.79%     0.022      510.77   0.00000  0.00000  1142
> 2.6.22.6                     20000  4096    2  501.29 85.84%     0.046      855.59   0.00000  0.00000   584
>
> Random Reads
>                               File  Blk   Num                   Avg      Maximum      Lat%     Lat%    CPU
> Identifier                    Size  Size  Thr   Rate  (CPU%)  Latency    Latency      >2s      >10s    Eff
> ---------------------------- ------ ----- ---  ------ ------ --------- -----------  -------- -------- -----
> 2.6.22.6                     20000  4096    1    0.90 0.276%    13.003       74.41   0.00000  0.00000   326
> 2.6.22.6                     20000  4096    2    1.61 1.167%    14.443      126.43   0.00000  0.00000   137
>
> Sequential Writes
>                               File  Blk   Num                   Avg      Maximum      Lat%     Lat%    CPU
> Identifier                    Size  Size  Thr   Rate  (CPU%)  Latency    Latency      >2s      >10s    Eff
> ---------------------------- ------ ----- ---  ------ ------ --------- -----------  -------- -------- -----
> 2.6.22.6                     20000  4096    1  363.46 75.72%     0.030     2757.45   0.00000  0.00000   480
> 2.6.22.6                     20000  4096    2  394.45 287.9%     0.056     2798.92   0.00000  0.00000   137
>
> Random Writes
>                               File  Blk   Num                   Avg      Maximum      Lat%     Lat%    CPU
> Identifier                    Size  Size  Thr   Rate  (CPU%)  Latency    Latency      >2s      >10s    Eff
> ---------------------------- ------ ----- ---  ------ ------ --------- -----------  -------- -------- -----
> 2.6.22.6                     20000  4096    1    3.16 1.752%     0.011        1.02   0.00000  0.00000   180
> 2.6.22.6                     20000  4096    2    3.07 3.769%     0.013        0.10   0.00000  0.00000    82
>
>
>
>
>

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [UNSURE] Re: mkfs options for a 16x hw raid5 and xfs (mostly large files)
  2007-09-26 14:54   ` Ralf Gross
@ 2007-09-26 16:27     ` Justin Piszcz
  2007-09-26 16:54       ` Ralf Gross
  2007-09-27 15:22     ` Ralf Gross
  1 sibling, 1 reply; 48+ messages in thread
From: Justin Piszcz @ 2007-09-26 16:27 UTC (permalink / raw)
  To: Ralf Gross; +Cc: linux-xfs



On Wed, 26 Sep 2007, Ralf Gross wrote:

> Peter Grandi schrieb:
>> Ralf> Hi, we have a new large raid array, the shelf has 48 disks,
>> Ralf> the max. amount of disks in a single raid 5 set is 16.
>>
>> Too bad about that petty limitation ;-).
>
> Yeah, I prefer 24x RAID 5 without spare. Why waste so much space ;)
>
> After talking to the people that own the data and wanted to use as
> much as possible space of the device, we'll start with four 12/11
> disk RAID 6 volumes (47 disk + 1 spare). That's ~12% less space than
> before with five RAID 5 volumes. I think this is a good compromise
> between safety and max.  usable disk space.
>
> There's only one point left: will the RAID 6 during a rebuild be able
> to deliver 2-3 streams of 17MB/s. Write performance is not the point
> then, but the clients will be running simulations for up to 5 days and
> need this (more or less) constant data rate. Now that I'm getting
> ~400 MB/s (which is limited by the FC controller) this should be
> possible.
>
>> Ralf> There will be one global spare disk, thus we have two raid 5
>> Ralf> with 15 data disks and one with 14 data disk.
>>
>> Ahhh a positive-thinking, can-do, brave design ;-).
>
> We have a 60 slot tape lib too (well, we'll have next week...I hope).
> I know that raid != backup.
>
>> [ ... ]
>> Ralf> Each client then needs a data stream of about 17 MB/s
>> Ralf> (max. 5 clients are expected to acces the data in parallel).
>>
>> Do the requirements include as features some (possibly several)
>> hours of ''challenging'' read performance if any disk fails or
>> total loss of data if another disks fails during that time? ;-)
>
> The data then is still on USB disks and on tape. Maybe I'll pull out
> a disk of one of the new RAID 6 volumes and see how much the read
> performance drops. At the moment only one test bed is active, thus 17
> MB/s would be ok. Later with 5 test beds 5 x 17 MB/s are needed (if
> they are online at the same time).
>
>> IIRC Google have reported 5% per year disk failure rates across a
>> very wide mostly uncorrelated population, you have 48 disks,
>> perhaps 2-3 disks per year will fail. Perhaps more and more often,
>> because they will likely be all from the same manufacturer, model,
>> batch and spinning in the same environment.
>
> Hey, these are ENTERPRISE disks ;) As far as I know, we couldn't even
> use other disks than the ones that the manufacturer provides (modified
> firmware?).
>
>> Ralf> [ ... ] I expect the fs, each will have a size of 10-11 TB,
>> Ralf> to be filled > 90%. I know this is not ideal, but we need
>> Ralf> every GB we can get.
>>
>> That "every GB we can get" is often the key in ''wide RAID5''
>> stories. Cheap as well as fast and safe, you can have it all with
>> wide RAID5 setups, so the salesmen would say ;-).
>
> I think we now have found a reasonable solution.
>
>> Ralf> [ ... ] Stripe Size : 960 KB (15 x 64 KM)
>> Ralf> [ ... ] Stripe Size : 896 KB (14 x 64 KB)
>>
>> Pretty long stripes, I wonder what happens when a whole stripe
>> cannot be written at once or it can but is not naturally aligned
>> ;-).
>
> I'm still confused bye the chunk/stripe and block size values. The
> block size of the HW-RAID is fixed to 512 bytes, I think that a bit
> small.
>
> Also, I first thought about wasting disk space with larger
> chunk/stripe sizes (HW RAID), but as the OS/FS doesn't necessarily
> know about the values, it can't be true - unlike the FS block size
> which defines the smallest possible file size.
>
>> Ralf> [ ... ] about 150 MB/s in seq. writing
>>
>> Surprise surprise ;-).
>>
>> Ralf> (tiobench) and 160 MB/s in seq.  reading.
>>
>> This is sort of low. If there something that RAID5 can do sort of
>> OK is reads (if there are no faults). I'd look at the underlying
>> storage system and the maximum performance that you can get out of
>> a single disk.
>
> /sbin/blockdev --setra 16384  /dev/sdc
>
> was the key to ~400 MB/s read performance.
>
>> I have seen a 45-drive 500GB storage subsystem where each drive
>> can deliver at most 7-10MB/s (even if the same disk standalone in
>> an ordinary PC can do 60-70MB/s), and the supplier actually claims
>> so in their published literature (that RAID product is meant to
>> compete *only* with tape backup subsystems). Your later comment
>> that "The raid array is connect to the server by fibre channel"
>> makes me suspect that it may be the same brand.
>>
>> Ralf> This is ok,
>>
>> As the total aggregate requirement is 5x17MB/s this is probably
>> the case [as long as there are no drive failures ;-)].
>>
>> Ralf> but I'm curious what I could get with tuned xfs parameters.
>>
>> Looking at the archives of this mailing list the topic ''good mkfs
>> parameters'' reappears frequently, even if usually for smaller
>> arrays, as many have yet to discover the benefits of 15-wide RAID5
>> setups ;-). Threads like these may help:
>>
>>   http://OSS.SGI.com/archives/xfs/2007-01/msg00079.html
>>   http://OSS.SGI.com/archives/xfs/2007-05/msg00051.html
>
> I've seen some of JP's postings before. I couldn't get much more
> performace with the sw/su options, I got the best results with the
> default values. But I haven't tried external logs yet.
>
> Ralf
>
>

  /sbin/blockdev --setra 16384  /dev/sdc

was the key to ~400 MB/s read performance.

Nice, what do you get for write speed?

Justin.

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [UNSURE] Re: mkfs options for a 16x hw raid5 and xfs (mostly large files)
  2007-09-26 16:27     ` [UNSURE] " Justin Piszcz
@ 2007-09-26 16:54       ` Ralf Gross
  2007-09-26 16:59         ` Justin Piszcz
  2007-09-26 17:13         ` [UNSURE] " Bryan J. Smith
  0 siblings, 2 replies; 48+ messages in thread
From: Ralf Gross @ 2007-09-26 16:54 UTC (permalink / raw)
  To: linux-xfs

Justin Piszcz schrieb:
> > 
> >  /sbin/blockdev --setra 16384  /dev/sdc
> > 
> > was the key to ~400 MB/s read performance.
> 
> Nice, what do you get for write speed?

Still 170-200 MB/s. The command above just tunes the read ahead value.

Ralf

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: Re: mkfs options for a 16x hw raid5 and xfs (mostly large files)
  2007-09-26 16:54       ` Ralf Gross
@ 2007-09-26 16:59         ` Justin Piszcz
  2007-09-26 17:38           ` Bryan J. Smith
  2007-09-26 17:13         ` [UNSURE] " Bryan J. Smith
  1 sibling, 1 reply; 48+ messages in thread
From: Justin Piszcz @ 2007-09-26 16:59 UTC (permalink / raw)
  To: Ralf Gross; +Cc: linux-xfs

On Wed, 26 Sep 2007, Ralf Gross wrote:

> Justin Piszcz schrieb:
>>>
>>>  /sbin/blockdev --setra 16384  /dev/sdc
>>>
>>> was the key to ~400 MB/s read performance.
>>
>> Nice, what do you get for write speed?
>
> Still 170-200 MB/s. The command above just tunes the read ahead value.
>
> Ralf
>
>

Yes, I understand; what is the equivalent tweak for HW RAID? I have tried 
to tweak some HW RAIDS (3ware 9550SX's) with ~10 drives and one can set 
the read ahead for better reads but writes are still slow, that 3ware 
'tuning' doc always get passed around but it never helps much at least in 
my testing.

I wonder where the bottleneck lies.

Justin.

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: mkfs options for a 16x hw raid5 and xfs (mostly large files)
  2007-09-26 15:15                                         ` Ralf Gross
@ 2007-09-26 17:08                                           ` Bryan J. Smith
  0 siblings, 0 replies; 48+ messages in thread
From: Bryan J. Smith @ 2007-09-26 17:08 UTC (permalink / raw)
  To: Ralf Gross, linux-xfs

Ralf Gross <Ralf-Lists@ralfgross.de> wrote:
> You're right, but these benchmarks help to find simple failures or
> misconfigurations at an earlier stage of the process.

Yes, as long as you are comparing to a benchmark of a known, similar
quantity.

It's not uncommon for Linux's RAID-5 to be 2-3x faster at dd and
single file operations, especially if there are no, actual parity
operations.


-- 
Bryan J. Smith   Professional, Technical Annoyance
b.j.smith@ieee.org    http://thebs413.blogspot.com
--------------------------------------------------
     Fission Power:  An Inconvenient Solution

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: mkfs options for a 16x hw raid5 and xfs (mostly large files)
  2007-09-26 16:24                                         ` Justin Piszcz
@ 2007-09-26 17:11                                           ` Bryan J. Smith
  0 siblings, 0 replies; 48+ messages in thread
From: Bryan J. Smith @ 2007-09-26 17:11 UTC (permalink / raw)
  To: Justin Piszcz, Bryan J Smith
  Cc: xfs-bounce, Ralf Gross, linux-xfs, linux-raid

Justin Piszcz <jpiszcz@lucidpixels.com> wrote:
> I have a question, when I use multiple writer threads (2 or 3) I
> see 550-600 MiB/s write speed (vmstat) but when using only 1
thread,
> ~420-430 MiB/s...

It's called scheduling buffer flushes, as well as the buffering
itself.

> Also without tweaking, SW RAID is very slow (180-200
> MiB/s) using the same disks.

But how much of that tweaking is actually just buffering?
That's a continued theme (and issue).

Unless you can force completely synchronous writes, you honestly
don't know.  Using a larger size than memory is not anywhere near the
same.

Plus it makes software RAID utterly n/a in comparison to hardware
RAID, where the driver is waiting until the commit to actual NVRAM or
disc is complete.

-- 
Bryan J. Smith   Professional, Technical Annoyance
b.j.smith@ieee.org    http://thebs413.blogspot.com
--------------------------------------------------
     Fission Power:  An Inconvenient Solution

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [UNSURE] Re: mkfs options for a 16x hw raid5 and xfs (mostly large files)
  2007-09-26 16:54       ` Ralf Gross
  2007-09-26 16:59         ` Justin Piszcz
@ 2007-09-26 17:13         ` Bryan J. Smith
  2007-09-26 17:27           ` Justin Piszcz
  1 sibling, 1 reply; 48+ messages in thread
From: Bryan J. Smith @ 2007-09-26 17:13 UTC (permalink / raw)
  To: Ralf Gross, linux-xfs

Ralf Gross <Ralf-Lists@ralfgross.de> wrote:
> Still 170-200 MB/s. The command above just tunes the read ahead
> value.

Don't expect your commits to an external subsystem to be anywhere
near as fast as software RAID in simple disk benchmarks.


-- 
Bryan J. Smith   Professional, Technical Annoyance
b.j.smith@ieee.org    http://thebs413.blogspot.com
--------------------------------------------------
     Fission Power:  An Inconvenient Solution

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [UNSURE] Re: mkfs options for a 16x hw raid5 and xfs (mostly large files)
  2007-09-26 17:13         ` [UNSURE] " Bryan J. Smith
@ 2007-09-26 17:27           ` Justin Piszcz
  2007-09-26 17:35             ` Bryan J. Smith
  0 siblings, 1 reply; 48+ messages in thread
From: Justin Piszcz @ 2007-09-26 17:27 UTC (permalink / raw)
  To: b.j.smith; +Cc: Ralf Gross, linux-xfs



On Wed, 26 Sep 2007, Bryan J. Smith wrote:

> Ralf Gross <Ralf-Lists@ralfgross.de> wrote:
>> Still 170-200 MB/s. The command above just tunes the read ahead
>> value.
>
> Don't expect your commits to an external subsystem to be anywhere
> near as fast as software RAID in simple disk benchmarks.
>
>
> -- 
> Bryan J. Smith   Professional, Technical Annoyance
> b.j.smith@ieee.org    http://thebs413.blogspot.com
> --------------------------------------------------
>     Fission Power:  An Inconvenient Solution
>
>

So what tunables do the 9550/9650SE users utilize to achieve > 500 MiB/s 
write on HW RAID5/6?

Justin.

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [UNSURE] Re: mkfs options for a 16x hw raid5 and xfs (mostly large files)
  2007-09-26 17:27           ` Justin Piszcz
@ 2007-09-26 17:35             ` Bryan J. Smith
  2007-09-26 17:37               ` Justin Piszcz
  0 siblings, 1 reply; 48+ messages in thread
From: Bryan J. Smith @ 2007-09-26 17:35 UTC (permalink / raw)
  To: Justin Piszcz, b.j.smith; +Cc: Ralf Gross, linux-xfs

Justin Piszcz <jpiszcz@lucidpixels.com> wrote:
> So what tunables do the 9550/9650SE users utilize to achieve > 500
> MiB/s write on HW RAID5/6?

Don't know.  But I've never claimed it was capable of it either.

At the same time, I've seen software RAID do over 500MBps, only to
drop to under 50MBps aggregate client DTR under load.



-- 
Bryan J. Smith   Professional, Technical Annoyance
b.j.smith@ieee.org    http://thebs413.blogspot.com
--------------------------------------------------
     Fission Power:  An Inconvenient Solution

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [UNSURE] Re: mkfs options for a 16x hw raid5 and xfs (mostly large files)
  2007-09-26 17:35             ` Bryan J. Smith
@ 2007-09-26 17:37               ` Justin Piszcz
  2007-09-26 17:38                 ` Justin Piszcz
  2007-09-26 17:49                 ` Bryan J. Smith
  0 siblings, 2 replies; 48+ messages in thread
From: Justin Piszcz @ 2007-09-26 17:37 UTC (permalink / raw)
  To: Bryan J. Smith; +Cc: Ralf Gross, linux-xfs



On Wed, 26 Sep 2007, Bryan J. Smith wrote:

> Justin Piszcz <jpiszcz@lucidpixels.com> wrote:
>> So what tunables do the 9550/9650SE users utilize to achieve > 500
>> MiB/s write on HW RAID5/6?
>
> Don't know.  But I've never claimed it was capable of it either.
>
> At the same time, I've seen software RAID do over 500MBps, only to
> drop to under 50MBps aggregate client DTR under load.

Do you have any type of benchmarks to similate the load you are 
mentioning?  What did HW RAID drop to when the same test was run with SW 
RAID / 50 MBps under load?  Did it achieve better performance due to an 
on-board / raid-card controller cache, or?

Justin.

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [UNSURE] Re: mkfs options for a 16x hw raid5 and xfs (mostly large files)
  2007-09-26 17:37               ` Justin Piszcz
@ 2007-09-26 17:38                 ` Justin Piszcz
  2007-09-26 17:49                 ` Bryan J. Smith
  1 sibling, 0 replies; 48+ messages in thread
From: Justin Piszcz @ 2007-09-26 17:38 UTC (permalink / raw)
  To: Bryan J. Smith; +Cc: Ralf Gross, linux-xfs



On Wed, 26 Sep 2007, Justin Piszcz wrote:

>
>
> On Wed, 26 Sep 2007, Bryan J. Smith wrote:
>
>> Justin Piszcz <jpiszcz@lucidpixels.com> wrote:
>>> So what tunables do the 9550/9650SE users utilize to achieve > 500
>>> MiB/s write on HW RAID5/6?
>> 
>> Don't know.  But I've never claimed it was capable of it either.
>> 
>> At the same time, I've seen software RAID do over 500MBps, only to
>> drop to under 50MBps aggregate client DTR under load.
>
> Do you have any type of benchmarks to similate the load you are mentioning? 
> What did HW RAID drop to when the same test was run with SW RAID / 50 MBps 
> under load?  Did it achieve better performance due to an on-board / raid-card 
> controller cache, or?
>
> Justin.
>
>

simulate* rather.

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: Re: mkfs options for a 16x hw raid5 and xfs (mostly large files)
  2007-09-26 16:59         ` Justin Piszcz
@ 2007-09-26 17:38           ` Bryan J. Smith
  2007-09-26 17:41             ` Justin Piszcz
  0 siblings, 1 reply; 48+ messages in thread
From: Bryan J. Smith @ 2007-09-26 17:38 UTC (permalink / raw)
  To: Justin Piszcz, Ralf Gross; +Cc: linux-xfs

Justin Piszcz <jpiszcz@lucidpixels.com> wrote:
> I wonder where the bottleneck lies.

The microcontroller.

Listen, for the last time, hardware RAID is _not_ for non-blocking
I/O.  Hardware RAID is for in-line XOR streaming off-load, so it
doesn't tie up a system interconnect (which isn't an ideal use for
it).

A hardware RAID card is when you have other things going on in your
interconnect that you don't want the parity LOAD-XOR-STOR to take
away from what it could be using for the service.

It will _never_ have the "raw performance" of OS optimized software
RAID.  At the same time, OS optimized software RAID's impact on the
system interconnect is one of those "unmeasurable" details _unless_
you actually benchmark your application.

I have repeatedly had issues with elementary UDP/IP NFS performance
when the PIO of software RAID is hogging the system interconnect. 
Same deal for large numbers of large database record commits.

-- 
Bryan J. Smith   Professional, Technical Annoyance
b.j.smith@ieee.org    http://thebs413.blogspot.com
--------------------------------------------------
     Fission Power:  An Inconvenient Solution

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: Re: mkfs options for a 16x hw raid5 and xfs (mostly large files)
  2007-09-26 17:38           ` Bryan J. Smith
@ 2007-09-26 17:41             ` Justin Piszcz
  2007-09-26 17:55               ` Bryan J. Smith
  0 siblings, 1 reply; 48+ messages in thread
From: Justin Piszcz @ 2007-09-26 17:41 UTC (permalink / raw)
  To: b.j.smith; +Cc: Ralf Gross, linux-xfs



On Wed, 26 Sep 2007, Bryan J. Smith wrote:

> Justin Piszcz <jpiszcz@lucidpixels.com> wrote:
>> I wonder where the bottleneck lies.
>
> The microcontroller.
>
> Listen, for the last time, hardware RAID is _not_ for non-blocking
> I/O.  Hardware RAID is for in-line XOR streaming off-load, so it
> doesn't tie up a system interconnect (which isn't an ideal use for
> it).
I agree and this makes sense but in real-world loads it makes me wonder, 
at least with the 2.4 kernel.  I see hosts where total streaming does not 
take place, instead lots of little files are copied on and off a host and 
with the 2.4 kernel (RHEL3) the system 'feels' as if it were buried even 
though the load is not that high ~9-10-15.  Using ext3 on a 9 or 10 disk 
RAID5 with default RAID parameters on a 3ware 9550SX card.

Justin.

>
> A hardware RAID card is when you have other things going on in your
> interconnect that you don't want the parity LOAD-XOR-STOR to take
> away from what it could be using for the service.
>
> It will _never_ have the "raw performance" of OS optimized software
> RAID.  At the same time, OS optimized software RAID's impact on the
> system interconnect is one of those "unmeasurable" details _unless_
> you actually benchmark your application.
>
> I have repeatedly had issues with elementary UDP/IP NFS performance
> when the PIO of software RAID is hogging the system interconnect.
> Same deal for large numbers of large database record commits.
Understood.

>
>
> -- 
> Bryan J. Smith   Professional, Technical Annoyance
> b.j.smith@ieee.org    http://thebs413.blogspot.com
> --------------------------------------------------
>     Fission Power:  An Inconvenient Solution
>

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [UNSURE] Re: mkfs options for a 16x hw raid5 and xfs (mostly large files)
  2007-09-26 17:37               ` Justin Piszcz
  2007-09-26 17:38                 ` Justin Piszcz
@ 2007-09-26 17:49                 ` Bryan J. Smith
  1 sibling, 0 replies; 48+ messages in thread
From: Bryan J. Smith @ 2007-09-26 17:49 UTC (permalink / raw)
  To: Justin Piszcz, Bryan J. Smith; +Cc: Ralf Gross, linux-xfs

Justin Piszcz <jpiszcz@lucidpixels.com> wrote:
> Do you have any type of benchmarks to similate the load you are 
> mentioning?

Yes, write different, non-zero, 100GB data files from 30 NFSv3 sync
clients at the same time.  You can easily script firing that off and
get the number of seconds it takes to commit.

Use NFS with UDP to avoid the overhead of TCP.

> What did HW RAID drop to when the same test was run with SW 
> RAID / 50 MBps under load?

I saw an aggregate commit average of around 150MBps using a pair of
8-channel 3Ware Escalade 9550SX cards (each on their own PCI-X bus),
with a LVM stripe between them.  Understand the test literally took 5
hours to run!

The software RAID-50, two "dumb" SATA 8-channel Marvell cards (each
on their own PCI-X bus), with a LVM stripe between them, was not
completed after 15 hours (overnight).  So I finally terminated it.

Each system had a 4x GbE trunk to a layer-3 switch.  I would have run
the same test with SMB TCP/IP, possibly with a LeWiz 4x GbE RX TOE
HBA, except I honestly didn't have the time to wait on it.

> Did it achieve better performance due to an on-board /
> raid-card controller cache, or?

Has nothing to do with cache.  The OS is far better at scheduling and
buffering in the system RAM, in addition to the fact that it does an
async buffer, whereas many HW RAID drivers are sync to the NVRAM of
the HW RAID card (that's part of the problem with comparisons).

It has to do with the fact in software RAID-5 you are streaming 100%
of the data through the general system interconnect for the
LOAD-XOR-STO operation.  XORs are extrmely fast.  LOAD/STO through a
general purpose CPU is not.

It's the same reason why we don't use general purpose CPUs for
layer-3 switches either, but a "core" CPU with NPE (network processor
engine) ASICs.  Same deal with most HW RAID cards, a "core" CPU with
SPE ASICs -- for off-load from the general CPU system interconnect.

XORs are done "in-line" with the transfer, instead of hogging up the
system interconnect.  It's the direct difference between PIO and DMA.
 An in-line NPE/SPE ASIC basically acts like a DMA transfer,
real-time.  A general purpose CPU and its interconnect cannot do
that, so it has all the issues of PIO.

PIO in a general purpose CPU is to be avoided at all costs when you
have other needs for the system interconnect, like I/O.  If you don't
have much else bothering the I/O, like in a web server or read-only
system (where you're not doing the writes), then it doesn't matter,
and software RAID-5 is great.

-- 
Bryan J. Smith   Professional, Technical Annoyance
b.j.smith@ieee.org    http://thebs413.blogspot.com
--------------------------------------------------
     Fission Power:  An Inconvenient Solution

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: Re: mkfs options for a 16x hw raid5 and xfs (mostly large files)
  2007-09-26 17:41             ` Justin Piszcz
@ 2007-09-26 17:55               ` Bryan J. Smith
  0 siblings, 0 replies; 48+ messages in thread
From: Bryan J. Smith @ 2007-09-26 17:55 UTC (permalink / raw)
  To: Justin Piszcz, b.j.smith; +Cc: Ralf Gross, linux-xfs

Justin Piszcz <jpiszcz@lucidpixels.com> wrote:
> I agree and this makes sense but in real-world loads it makes me
> wonder, at least with the 2.4 kernel.

It took 3Ware (actually AMCC) a good 18 months to get the firmware
and driver tuned well.  I purposely avoided any new 3Ware solution
until a good 12-18 months after release because of such.

The 9550SX series was the first microcontroller approach (PowerPC 400
series) done by 3Ware.  All of their prior designs were an older
64-bit ASIC design with SRAM (and only DRAM slapped on, poorly, in
the 9500S), which only worked well for RAID-0/1/10, not 5.

That was well into the 2.6 era.  I'd say you're well out of date with
what 3Ware, let alone the Intel X-Scale-based Areca, are actually
capable of with RAID-5/6 now.

-- Bryan

P.S.  Both AMD and Intel are currently putting serious R&D into the
first embedded x86 designs with added ASICs for Network, Storage,
etc...  I.e., this is going to be mainstream shortly, as AMD got out
of 29000 long ago, and Intel is putting less and less focus on
IOP33x/34x X-Scale.  NPE, SPE and other units can literally handle
DTRs that are 10x of what a general CPU/interconnect LOAD-op-STOR can
do.

I.e., Don't be surprised when your 2009+ server mainboard ICH is
actually an embedded x86 processor with NPE and SPE units.  That will
finally remove the whole "separate card" in general.

-- 
Bryan J. Smith   Professional, Technical Annoyance
b.j.smith@ieee.org    http://thebs413.blogspot.com
--------------------------------------------------
     Fission Power:  An Inconvenient Solution

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: mkfs options for a 16x hw raid5 and xfs (mostly large files)
  2007-09-26 14:54   ` Ralf Gross
  2007-09-26 16:27     ` [UNSURE] " Justin Piszcz
@ 2007-09-27 15:22     ` Ralf Gross
  1 sibling, 0 replies; 48+ messages in thread
From: Ralf Gross @ 2007-09-27 15:22 UTC (permalink / raw)
  To: linux-xfs

Ralf Gross schrieb:
> Peter Grandi schrieb:
> > Ralf> Hi, we have a new large raid array, the shelf has 48 disks,
> > Ralf> the max. amount of disks in a single raid 5 set is 16.
> > 
> > Too bad about that petty limitation ;-).
> 
> Yeah, I prefer 24x RAID 5 without spare. Why waste so much space ;)
> 
> After talking to the people that own the data and wanted to use as
> much as possible space of the device, we'll start with four 12/11
> disk RAID 6 volumes (47 disk + 1 spare). That's ~12% less space than
> before with five RAID 5 volumes. I think this is a good compromise
> between safety and max.  usable disk space.

Ok, the init of the new 12 disk RAID 6 volume is complete. The
numbers I get now are a bit dissapointing: ~210 MB/s for read and ~110
MB/s for write.

I know that RAID 6 is slower than RAID 5, and that less data disks (10
instead of 15) also slow things down. But 390 MB/s read performance
compared to 220 MB/s is a bit suprising. Particularly because the RAID
5 read performance was limited by the FC (I think). I thought I still
would get 1/3 of the RAID 5 read throughput because of the 5 fewer
disks of the RAID 6. I have to test this again with a larger chunk
size (256k), we'll how much this affects read/write performance.

Ralf

^ permalink raw reply	[flat|nested] 48+ messages in thread

end of thread, other threads:[~2007-09-27 15:23 UTC | newest]

Thread overview: 48+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2007-09-23  9:38 mkfs options for a 16x hw raid5 and xfs (mostly large files) Ralf Gross
2007-09-23 12:56 ` Peter Grandi
2007-09-26 14:54   ` Ralf Gross
2007-09-26 16:27     ` [UNSURE] " Justin Piszcz
2007-09-26 16:54       ` Ralf Gross
2007-09-26 16:59         ` Justin Piszcz
2007-09-26 17:38           ` Bryan J. Smith
2007-09-26 17:41             ` Justin Piszcz
2007-09-26 17:55               ` Bryan J. Smith
2007-09-26 17:13         ` [UNSURE] " Bryan J. Smith
2007-09-26 17:27           ` Justin Piszcz
2007-09-26 17:35             ` Bryan J. Smith
2007-09-26 17:37               ` Justin Piszcz
2007-09-26 17:38                 ` Justin Piszcz
2007-09-26 17:49                 ` Bryan J. Smith
2007-09-27 15:22     ` Ralf Gross
2007-09-24 17:31 ` Ralf Gross
2007-09-24 18:01   ` Justin Piszcz
2007-09-24 20:39     ` Ralf Gross
2007-09-24 20:43       ` Justin Piszcz
2007-09-24 21:33         ` Ralf Gross
2007-09-24 21:36           ` Justin Piszcz
2007-09-24 21:52             ` Ralf Gross
2007-09-25 12:35               ` Ralf Gross
2007-09-25 12:50                 ` Justin Piszcz
2007-09-25 13:44                   ` Bryan J Smith
2007-09-25 12:57                 ` KELEMEN Peter
2007-09-25 13:49                   ` Ralf Gross
2007-09-25 14:08                     ` Bryan J Smith
2007-09-25 16:07                       ` Ralf Gross
2007-09-25 16:28                         ` Bryan J. Smith
2007-09-25 17:25                           ` Ralf Gross
2007-09-25 17:41                             ` Bryan J. Smith
2007-09-25 19:13                               ` Ralf Gross
2007-09-25 20:23                                 ` Bryan J. Smith
2007-09-25 16:48                         ` Justin Piszcz
2007-09-25 18:00                           ` Bryan J. Smith
2007-09-25 18:33                             ` Ralf Gross
2007-09-25 23:38                             ` Justin Piszcz
2007-09-26  8:23                               ` Ralf Gross
2007-09-26  8:42                                 ` Justin Piszcz
2007-09-26  8:49                                   ` Ralf Gross
2007-09-26  9:52                                     ` Justin Piszcz
2007-09-26 15:03                                       ` Bryan J Smith
2007-09-26 15:15                                         ` Ralf Gross
2007-09-26 17:08                                           ` Bryan J. Smith
2007-09-26 16:24                                         ` Justin Piszcz
2007-09-26 17:11                                           ` Bryan J. Smith

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox