linux-raid.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* Linux MD RAID 5 Benchmarks Across (3 to 10) 300 Gigabyte Veliciraptors
@ 2008-06-07 14:22 Justin Piszcz
  2008-06-07 15:54 ` David Lethe
                   ` (3 more replies)
  0 siblings, 4 replies; 14+ messages in thread
From: Justin Piszcz @ 2008-06-07 14:22 UTC (permalink / raw)
  To: linux-kernel, linux-raid, xfs; +Cc: Alan Piszcz

First, the original benchmarks with 6-SATA drives with fixed formatting, using
right justification and the same decimal point precision throughout:
http://home.comcast.net/~jpiszcz/20080607/raid-benchmarks-decimal-fix-and-right-justified/disks.html

Now for for veliciraptors!  Ever wonder what kind of speed is possible with
3 disk, 4,5,6,7,8,9,10-disk RAID5s?  I ran a loop to find out, each run is
executed three times and the average is taken of all three runs per each 
RAID5 disk set.

In short? The 965 no longer does justice with faster drives, a new chipset
and motherboard are needed.  After reading or writing to 4-5 veliciraptors
it saturates the bus/965 chipset.

Here is a picture of the 12 veliciraptors I tested with:
http://home.comcast.net/~jpiszcz/20080607/raid5-benchmarks-3to10-veliciraptors/raptors.jpg

Here are the bonnie++ results:
http://home.comcast.net/~jpiszcz/20080607/raid5-benchmarks-3to10-veliciraptors/veliciraptor-raid.html

For those who want the results in text:
http://home.comcast.net/~jpiszcz/20080607/raid5-benchmarks-3to10-veliciraptors/veliciraptor-raid.txt

System used, same/similar as before:
Motherboard: Intel DG965WH
Memory: 8GiB
Kernel: 2.6.25.4
Distribution: Debian Testing x86_64
Filesystem: XFS with default mkfs.xfs parameters [auto-optimized for SW RAID]
Mount options: defaults,noatime,nodiratime,logbufs=8,logbsize=262144 0 1
Chunk size: 1024KiB
RAID5 Layout: Default (left-symmetric)
Mdadm Superblock used: 0.90

Optimizations used (last one is for the CFQ scheduler), it improves 
performance by a modest 5-10MiB/s:
http://home.comcast.net/~jpiszcz/raid/20080601/raid5.html

# Tell user what's going on.
echo "Optimizing RAID Arrays..."

# Define DISKS.
cd /sys/block
DISKS=$(/bin/ls -1d sd[a-z])

# Set read-ahead.
# > That's actually 65k x 512byte blocks so 32MiB
echo "Setting read-ahead to 32 MiB for /dev/md3"
blockdev --setra 65536 /dev/md3

# Set stripe-cache_size for RAID5.
echo "Setting stripe_cache_size to 16 MiB for /dev/md3"
echo 16384 > /sys/block/md3/md/stripe_cache_size

# Disable NCQ on all disks.
echo "Disabling NCQ on all disks..."
for i in $DISKS
do
   echo "Disabling NCQ on $i"
   echo 1 > /sys/block/"$i"/device/queue_depth
done

# Fix slice_idle.
# See http://www.nextre.it/oracledocs/ioscheduler_03.html
echo "Fixing slice_idle to 0..."
for i in $DISKS
do
   echo "Changing slice_idle to 0 on $i"
   echo 0 > /sys/block/"$i"/queue/iosched/slice_idle
done

----

Order of tests:

1. Create RAID (mdadm)

Example:


   if [ $num_disks -eq 3 ]; then
     mdadm --create /dev/md3 --verbose --level=5 -n $num_disks -c 1024 -e 0.90 \
     /dev/sd[c-e]1 --assume-clean --run
   fi

2. Run optimize script (above)

See above.

3. mkfs.xfs -f /dev/md3

mkfs.xfs auto-optimized for the underlying devices in an mdadm SW RAID.

4. Run bonnie++ as shown below 3 times, averaged:

/usr/bin/time /usr/sbin/bonnie++ -u 1000 -d /x/test -s 16384 -m p34 -n 16:100000:16:64 > $HOME/test"$run"_$num_disks-disks.txt 2>&1


----

A little more info, after 4-5 dd's, I have already maxed out the performance
of what the chipset can offer, see below:

knoppix@Knoppix:~$ vmstat 1
procs -----------memory---------- ---swap-- -----io---- -system-- ----cpu----
  r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa
  1  0      0 2755556   6176 203584    0    0   153     1   25  371  3  1 84 11
  0  0      0 2755556   6176 203588    0    0     0     0   66  257  0  0 100  0
  0  1      0 2605400 152204 203584    0    0     0 146028  257  396  0  5 77 18
  0  1      0 2478176 277520 203604    0    0     0 125316  345  794  1  4 75 20
  1  0      0 2349472 403984 203592    0    0     0 119136  297  256  0  5 75 20
  2  1      0 2117292 631172 203512    0    0     0 232336  498 1019  0  8 66 26
  0  2      0 2014400 731968 203556    0    0     0 241472  542 2078  1 11 63 25
  3  0      0 2013412 733756 203492    0    0     0 302104  672 2760  0 14 59 27
  0  3      0 2013576 735624 203520    0    0     0 362524  808 3356  0 15 56 29
  0  4      0 2039312 736728 174860    0    0   120 425484  956 4899  1 20 52 26
  0  4      0 2050236 738508 163712    0    0     0 482868 1008 5030  1 24 46 29
  5  3      0 2050192 737916 163756    0    0     0 531532 1175 6033  0 26 43 31
  3  4      0 2050220 738028 163744    0    0     0 606560 1312 6664  1 32 38 30
  1  5      0 2049432 739184 163628    0    0     0 592756 1291 7195  1 30 35 34
  8  3      0 2049488 738868 163580    0    0     0 675228 1721 10540 1 38 30 31
Here, ~5 raptor 300s, no more linear improvement after this:
  4  4      0 2050048 737816 163744    0    0     0 677820 1771 10514 1 36 32 31
  6  4      0 2048764 738612 163684    0    0     0 697640 1842 13231 1 40 27 33


^ permalink raw reply	[flat|nested] 14+ messages in thread

* RE: Linux MD RAID 5 Benchmarks Across (3 to 10) 300 Gigabyte Veliciraptors
  2008-06-07 14:22 Linux MD RAID 5 Benchmarks Across (3 to 10) 300 Gigabyte Veliciraptors Justin Piszcz
@ 2008-06-07 15:54 ` David Lethe
  2008-06-08  1:46 ` Dan Williams
                   ` (2 subsequent siblings)
  3 siblings, 0 replies; 14+ messages in thread
From: David Lethe @ 2008-06-07 15:54 UTC (permalink / raw)
  To: Justin Piszcz, linux-kernel, linux-raid, xfs; +Cc: Alan Piszcz

This is all interesting, but this has no relevance to the real world,
where computers run application software.  You have a great foundation
here, but it won't help anybody who is running a database, mail, or
file/backup server because the I/Os are too large, and homogeneous.  You
will get profoundly different sweet spots for RAID configurations once
you model your bench to match something that people actually run.  I am
not criticizing you for this, it is just that now I have a taste for
what you have accomplished, and I want more more more :)

David

 
-----Original Message-----
From: linux-raid-owner@vger.kernel.org
[mailto:linux-raid-owner@vger.kernel.org] On Behalf Of Justin Piszcz
Sent: Saturday, June 07, 2008 9:23 AM
To: linux-kernel@vger.kernel.org; linux-raid@vger.kernel.org;
xfs@oss.sgi.com
Cc: Alan Piszcz
Subject: Linux MD RAID 5 Benchmarks Across (3 to 10) 300 Gigabyte
Veliciraptors

First, the original benchmarks with 6-SATA drives with fixed formatting,
using
right justification and the same decimal point precision throughout:
http://home.comcast.net/~jpiszcz/20080607/raid-benchmarks-decimal-fix-an
d-right-justified/disks.html

Now for for veliciraptors!  Ever wonder what kind of speed is possible
with
3 disk, 4,5,6,7,8,9,10-disk RAID5s?  I ran a loop to find out, each run
is
executed three times and the average is taken of all three runs per each

RAID5 disk set.

In short? The 965 no longer does justice with faster drives, a new
chipset
and motherboard are needed.  After reading or writing to 4-5
veliciraptors
it saturates the bus/965 chipset.

Here is a picture of the 12 veliciraptors I tested with:
http://home.comcast.net/~jpiszcz/20080607/raid5-benchmarks-3to10-velicir
aptors/raptors.jpg

Here are the bonnie++ results:
http://home.comcast.net/~jpiszcz/20080607/raid5-benchmarks-3to10-velicir
aptors/veliciraptor-raid.html

For those who want the results in text:
http://home.comcast.net/~jpiszcz/20080607/raid5-benchmarks-3to10-velicir
aptors/veliciraptor-raid.txt

System used, same/similar as before:
Motherboard: Intel DG965WH
Memory: 8GiB
Kernel: 2.6.25.4
Distribution: Debian Testing x86_64
Filesystem: XFS with default mkfs.xfs parameters [auto-optimized for SW
RAID]
Mount options: defaults,noatime,nodiratime,logbufs=8,logbsize=262144 0 1
Chunk size: 1024KiB
RAID5 Layout: Default (left-symmetric)
Mdadm Superblock used: 0.90

Optimizations used (last one is for the CFQ scheduler), it improves 
performance by a modest 5-10MiB/s:
http://home.comcast.net/~jpiszcz/raid/20080601/raid5.html

# Tell user what's going on.
echo "Optimizing RAID Arrays..."

# Define DISKS.
cd /sys/block
DISKS=$(/bin/ls -1d sd[a-z])

# Set read-ahead.
# > That's actually 65k x 512byte blocks so 32MiB
echo "Setting read-ahead to 32 MiB for /dev/md3"
blockdev --setra 65536 /dev/md3

# Set stripe-cache_size for RAID5.
echo "Setting stripe_cache_size to 16 MiB for /dev/md3"
echo 16384 > /sys/block/md3/md/stripe_cache_size

# Disable NCQ on all disks.
echo "Disabling NCQ on all disks..."
for i in $DISKS
do
   echo "Disabling NCQ on $i"
   echo 1 > /sys/block/"$i"/device/queue_depth
done

# Fix slice_idle.
# See http://www.nextre.it/oracledocs/ioscheduler_03.html
echo "Fixing slice_idle to 0..."
for i in $DISKS
do
   echo "Changing slice_idle to 0 on $i"
   echo 0 > /sys/block/"$i"/queue/iosched/slice_idle
done

----

Order of tests:

1. Create RAID (mdadm)

Example:


   if [ $num_disks -eq 3 ]; then
     mdadm --create /dev/md3 --verbose --level=5 -n $num_disks -c 1024
-e 0.90 \
     /dev/sd[c-e]1 --assume-clean --run
   fi

2. Run optimize script (above)

See above.

3. mkfs.xfs -f /dev/md3

mkfs.xfs auto-optimized for the underlying devices in an mdadm SW RAID.

4. Run bonnie++ as shown below 3 times, averaged:

/usr/bin/time /usr/sbin/bonnie++ -u 1000 -d /x/test -s 16384 -m p34 -n
16:100000:16:64 > $HOME/test"$run"_$num_disks-disks.txt 2>&1


----

A little more info, after 4-5 dd's, I have already maxed out the
performance
of what the chipset can offer, see below:

knoppix@Knoppix:~$ vmstat 1
procs -----------memory---------- ---swap-- -----io---- -system--
----cpu----
  r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy
id wa
  1  0      0 2755556   6176 203584    0    0   153     1   25  371  3
1 84 11
  0  0      0 2755556   6176 203588    0    0     0     0   66  257  0
0 100  0
  0  1      0 2605400 152204 203584    0    0     0 146028  257  396  0
5 77 18
  0  1      0 2478176 277520 203604    0    0     0 125316  345  794  1
4 75 20
  1  0      0 2349472 403984 203592    0    0     0 119136  297  256  0
5 75 20
  2  1      0 2117292 631172 203512    0    0     0 232336  498 1019  0
8 66 26
  0  2      0 2014400 731968 203556    0    0     0 241472  542 2078  1
11 63 25
  3  0      0 2013412 733756 203492    0    0     0 302104  672 2760  0
14 59 27
  0  3      0 2013576 735624 203520    0    0     0 362524  808 3356  0
15 56 29
  0  4      0 2039312 736728 174860    0    0   120 425484  956 4899  1
20 52 26
  0  4      0 2050236 738508 163712    0    0     0 482868 1008 5030  1
24 46 29
  5  3      0 2050192 737916 163756    0    0     0 531532 1175 6033  0
26 43 31
  3  4      0 2050220 738028 163744    0    0     0 606560 1312 6664  1
32 38 30
  1  5      0 2049432 739184 163628    0    0     0 592756 1291 7195  1
30 35 34
  8  3      0 2049488 738868 163580    0    0     0 675228 1721 10540 1
38 30 31
Here, ~5 raptor 300s, no more linear improvement after this:
  4  4      0 2050048 737816 163744    0    0     0 677820 1771 10514 1
36 32 31
  6  4      0 2048764 738612 163684    0    0     0 697640 1842 13231 1
40 27 33

--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html



^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Linux MD RAID 5 Benchmarks Across (3 to 10) 300 Gigabyte Veliciraptors
  2008-06-07 14:22 Linux MD RAID 5 Benchmarks Across (3 to 10) 300 Gigabyte Veliciraptors Justin Piszcz
  2008-06-07 15:54 ` David Lethe
@ 2008-06-08  1:46 ` Dan Williams
  2008-06-09  7:51   ` thomas62186218
  2008-06-11 17:02 ` Nat Makarevitch
  2008-06-11 20:27 ` Bill Davidsen
  3 siblings, 1 reply; 14+ messages in thread
From: Dan Williams @ 2008-06-08  1:46 UTC (permalink / raw)
  To: Justin Piszcz; +Cc: linux-kernel, linux-raid, xfs, Alan Piszcz

On Sat, Jun 7, 2008 at 7:22 AM, Justin Piszcz <jpiszcz@lucidpixels.com> wrote:
> First, the original benchmarks with 6-SATA drives with fixed formatting,
> using
> right justification and the same decimal point precision throughout:
> http://home.comcast.net/~jpiszcz/20080607/raid-benchmarks-decimal-fix-and-right-justified/disks.html
>
> Now for for veliciraptors!  Ever wonder what kind of speed is possible with
> 3 disk, 4,5,6,7,8,9,10-disk RAID5s?  I ran a loop to find out, each run is
> executed three times and the average is taken of all three runs per each
> RAID5 disk set.
>
> In short? The 965 no longer does justice with faster drives, a new chipset
> and motherboard are needed.  After reading or writing to 4-5 veliciraptors
> it saturates the bus/965 chipset.
>
> Here is a picture of the 12 veliciraptors I tested with:
> http://home.comcast.net/~jpiszcz/20080607/raid5-benchmarks-3to10-veliciraptors/raptors.jpg
>
> Here are the bonnie++ results:
> http://home.comcast.net/~jpiszcz/20080607/raid5-benchmarks-3to10-veliciraptors/veliciraptor-raid.html
>
> For those who want the results in text:
> http://home.comcast.net/~jpiszcz/20080607/raid5-benchmarks-3to10-veliciraptors/veliciraptor-raid.txt
>
> System used, same/similar as before:
> Motherboard: Intel DG965WH
> Memory: 8GiB
> Kernel: 2.6.25.4
> Distribution: Debian Testing x86_64
> Filesystem: XFS with default mkfs.xfs parameters [auto-optimized for SW
> RAID]
> Mount options: defaults,noatime,nodiratime,logbufs=8,logbsize=262144 0 1
> Chunk size: 1024KiB
> RAID5 Layout: Default (left-symmetric)
> Mdadm Superblock used: 0.90
>
> Optimizations used (last one is for the CFQ scheduler), it improves
> performance by a modest 5-10MiB/s:
> http://home.comcast.net/~jpiszcz/raid/20080601/raid5.html
>
> # Tell user what's going on.
> echo "Optimizing RAID Arrays..."
>
> # Define DISKS.
> cd /sys/block
> DISKS=$(/bin/ls -1d sd[a-z])
>
> # Set read-ahead.
> # > That's actually 65k x 512byte blocks so 32MiB
> echo "Setting read-ahead to 32 MiB for /dev/md3"
> blockdev --setra 65536 /dev/md3
>
> # Set stripe-cache_size for RAID5.
> echo "Setting stripe_cache_size to 16 MiB for /dev/md3"

Sorry to sound like a broken record,  16MiB is not correct.

size=$((num_disks * 4 * 16384 / 1024))
echo "Setting stripe_cache_size to $size MiB for /dev/md3"

...and commit 8b3e6cdc should improve the performance / stripe_cache_size ratio.

> echo 16384 > /sys/block/md3/md/stripe_cache_size
>
> # Disable NCQ on all disks.
> echo "Disabling NCQ on all disks..."
> for i in $DISKS
> do
>  echo "Disabling NCQ on $i"
>  echo 1 > /sys/block/"$i"/device/queue_depth
> done
>
> # Fix slice_idle.
> # See http://www.nextre.it/oracledocs/ioscheduler_03.html
> echo "Fixing slice_idle to 0..."
> for i in $DISKS
> do
>  echo "Changing slice_idle to 0 on $i"
>  echo 0 > /sys/block/"$i"/queue/iosched/slice_idle
> done
>

Thanks for putting this data together.

Regards,
Dan

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Linux MD RAID 5 Benchmarks Across (3 to 10) 300 Gigabyte Veliciraptors
  2008-06-08  1:46 ` Dan Williams
@ 2008-06-09  7:51   ` thomas62186218
  2008-06-09  8:43     ` Keld Jørn Simonsen
  2008-06-09 13:41     ` David Lethe
  0 siblings, 2 replies; 14+ messages in thread
From: thomas62186218 @ 2008-06-09  7:51 UTC (permalink / raw)
  To: dan.j.williams, jpiszcz; +Cc: linux-kernel, linux-raid, xfs, ap

Thank you for sharing these results. One issue that I consistently see 
with these results is miserable random IO performance. Looking at these 
numbers, even a low-end RAID controller with 128MB of cache will outrun 
md-based RAIDs in random IO benchmarks. In today's world of virtual 
machines, etc, random IO is far more common than sequential IO. What 
can be done with md (or something else) to alleviate this problem?

-Thomas


-----Original Message-----
From: Dan Williams <dan.j.williams@gmail.com>
To: Justin Piszcz <jpiszcz@lucidpixels.com>
Cc: linux-kernel@vger.kernel.org; linux-raid@vger.kernel.org; 
xfs@oss.sgi.com; Alan Piszcz <ap@solarrain.com>
Sent: Sat, 7 Jun 2008 6:46 pm
Subject: Re: Linux MD RAID 5 Benchmarks Across (3 to 10) 300 Gigabyte 
Veliciraptors










On Sat, Jun 7, 2008 at 7:22 AM, Justin Piszcz <jpiszcz@lucidpixels.com> 
wrote:
> First, the original benchmarks with 6-SATA drives with fixed 
formatting,
> using
> right justification and the same decimal point precision throughout:
> 
http://home.comcast.net/~jpiszcz/20080607/raid-benchmarks-decimal-fix-and-right-justified/disks.html
>
> Now for for veliciraptors!  Ever wonder what kind of speed is 
possible with
> 3 disk, 4,5,6,7,8,9,10-disk RAID5s?  I ran a loop to find out, each 
run is
> executed three times and the average is taken of all three runs per 
each
> RAID5 disk set.
>
> In short? The 965 no longer does justice with faster drives, a new 
chipset
> and motherboard are needed.  After reading or writing to 4-5 
veliciraptors
> it saturates the bus/965 chipset.
>
> Here is a picture of the 12 veliciraptors I tested with:
> 
http://home.comcast.net/~jpiszcz/20080607/raid5-benchmarks-3to10-veliciraptors/raptors.jpg
>
> Here are the bonnie++ results:
> 
http://home.comcast.net/~jpiszcz/20080607/raid5-benchmarks-3to10-veliciraptors/veliciraptor-raid.html
>
> For those who want the results in text:
> 
http://home.comcast.net/~jpiszcz/20080607/raid5-benchmarks-3to10-veliciraptors/veliciraptor-raid.txt
>
> System used, same/similar as before:
> Motherboard: Intel DG965WH
> Memory: 8GiB
> Kernel: 2.6.25.4
> Distribution: Debian Testing x86_64
> Filesystem: XFS with default mkfs.xfs parameters [auto-optimized for 
SW
> RAID]
> Mount options: defaults,noatime,nodiratime,logbufs=8,logbsize=262144 
0 1
> Chunk size: 1024KiB
> RAID5 Layout: Default (left-symmetric)
> Mdadm Superblock used: 0.90
>
> Optimizations used (last one is for the CFQ scheduler), it improves
> performance by a modest 5-10MiB/s:
> http://home.comcast.net/~jpiszcz/raid/20080601/raid5.html
>
> # Tell user what's going on.
> echo "Optimizing RAID Arrays..."
>
> # Define DISKS.
> cd /sys/block
> DISKS=$(/bin/ls -1d sd[a-z])
>
> # Set read-ahead.
> # > That's actually 65k x 512byte blocks so 32MiB
> echo "Setting read-ahead to 32 MiB for /dev/md3"
> blockdev --setra 65536 /dev/md3
>
> # Set stripe-cache_size for RAID5.
> echo "Setting stripe_cache_size to 16 MiB for /dev/md3"

Sorry to sound like a broken record,  16MiB is not correct.

size=$((num_disks * 4 * 16384 / 1024))
echo "Setting stripe_cache_size to $size MiB for /dev/md3"

...and commit 8b3e6cdc should improve the performance / 
stripe_cache_size ratio.

> echo 16384 > /sys/block/md3/md/stripe_cache_size
>
> # Disable NCQ on all disks.
> echo "Disabling NCQ on all disks..."
> for i in $DISKS
> do
>  echo "Disabling NCQ on $i"
>  echo 1 > /sys/block/"$i"/device/queue_depth
> done
>
> # Fix slice_idle.
> # See http://www.nextre.it/oracledocs/ioscheduler_03.html
> echo "Fixing slice_idle to 0..."
> for i in $DISKS
> do
>  echo "Changing slice_idle to 0 on $i"
>  echo 0 > /sys/block/"$i"/queue/iosched/slice_idle
> done
>

Thanks for putting this data together.

Regards,
Dan

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Linux MD RAID 5 Benchmarks Across (3 to 10) 300 Gigabyte Veliciraptors
  2008-06-09  7:51   ` thomas62186218
@ 2008-06-09  8:43     ` Keld Jørn Simonsen
  2008-06-09 13:41     ` David Lethe
  1 sibling, 0 replies; 14+ messages in thread
From: Keld Jørn Simonsen @ 2008-06-09  8:43 UTC (permalink / raw)
  To: thomas62186218; +Cc: dan.j.williams, jpiszcz, linux-kernel, linux-raid, xfs, ap

On Mon, Jun 09, 2008 at 03:51:07AM -0400, thomas62186218@aol.com wrote:
> Thank you for sharing these results. One issue that I consistently see 
> with these results is miserable random IO performance. Looking at these 
> numbers, even a low-end RAID controller with 128MB of cache will outrun 
> md-based RAIDs in random IO benchmarks. In today's world of virtual 
> machines, etc, random IO is far more common than sequential IO. What 
> can be done with md (or something else) to alleviate this problem?

Have you got any numbers to back up this?

What benchmark are you using for random IO?

Anyway the numbers that Justin reported was with an outdate motherboard,

My take is that Linux MD raid can outperform most HW RAID by a factor of two
on random IO.

Best regards
keld

> -Thomas
> 
> 
> -----Original Message-----
> From: Dan Williams <dan.j.williams@gmail.com>
> To: Justin Piszcz <jpiszcz@lucidpixels.com>
> Cc: linux-kernel@vger.kernel.org; linux-raid@vger.kernel.org; 
> xfs@oss.sgi.com; Alan Piszcz <ap@solarrain.com>
> Sent: Sat, 7 Jun 2008 6:46 pm
> Subject: Re: Linux MD RAID 5 Benchmarks Across (3 to 10) 300 Gigabyte 
> Veliciraptors
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> On Sat, Jun 7, 2008 at 7:22 AM, Justin Piszcz <jpiszcz@lucidpixels.com> 
> wrote:
> >First, the original benchmarks with 6-SATA drives with fixed 
> formatting,
> >using
> >right justification and the same decimal point precision throughout:
> >
> http://home.comcast.net/~jpiszcz/20080607/raid-benchmarks-decimal-fix-and-right-justified/disks.html
> >
> >Now for for veliciraptors!  Ever wonder what kind of speed is 
> possible with
> >3 disk, 4,5,6,7,8,9,10-disk RAID5s?  I ran a loop to find out, each 
> run is
> >executed three times and the average is taken of all three runs per 
> each
> >RAID5 disk set.
> >
> >In short? The 965 no longer does justice with faster drives, a new 
> chipset
> >and motherboard are needed.  After reading or writing to 4-5 
> veliciraptors
> >it saturates the bus/965 chipset.
> >
> >Here is a picture of the 12 veliciraptors I tested with:
> >
> http://home.comcast.net/~jpiszcz/20080607/raid5-benchmarks-3to10-veliciraptors/raptors.jpg
> >
> >Here are the bonnie++ results:
> >
> http://home.comcast.net/~jpiszcz/20080607/raid5-benchmarks-3to10-veliciraptors/veliciraptor-raid.html
> >
> >For those who want the results in text:
> >
> http://home.comcast.net/~jpiszcz/20080607/raid5-benchmarks-3to10-veliciraptors/veliciraptor-raid.txt
> >
> >System used, same/similar as before:
> >Motherboard: Intel DG965WH
> >Memory: 8GiB
> >Kernel: 2.6.25.4
> >Distribution: Debian Testing x86_64
> >Filesystem: XFS with default mkfs.xfs parameters [auto-optimized for 
> SW
> >RAID]
> >Mount options: defaults,noatime,nodiratime,logbufs=8,logbsize=262144 
> 0 1
> >Chunk size: 1024KiB
> >RAID5 Layout: Default (left-symmetric)
> >Mdadm Superblock used: 0.90
> >
> >Optimizations used (last one is for the CFQ scheduler), it improves
> >performance by a modest 5-10MiB/s:
> >http://home.comcast.net/~jpiszcz/raid/20080601/raid5.html
> >
> ># Tell user what's going on.
> >echo "Optimizing RAID Arrays..."
> >
> ># Define DISKS.
> >cd /sys/block
> >DISKS=$(/bin/ls -1d sd[a-z])
> >
> ># Set read-ahead.
> ># > That's actually 65k x 512byte blocks so 32MiB
> >echo "Setting read-ahead to 32 MiB for /dev/md3"
> >blockdev --setra 65536 /dev/md3
> >
> ># Set stripe-cache_size for RAID5.
> >echo "Setting stripe_cache_size to 16 MiB for /dev/md3"
> 
> Sorry to sound like a broken record,  16MiB is not correct.
> 
> size=$((num_disks * 4 * 16384 / 1024))
> echo "Setting stripe_cache_size to $size MiB for /dev/md3"
> 
> ...and commit 8b3e6cdc should improve the performance / 
> stripe_cache_size ratio.
> 
> >echo 16384 > /sys/block/md3/md/stripe_cache_size
> >
> ># Disable NCQ on all disks.
> >echo "Disabling NCQ on all disks..."
> >for i in $DISKS
> >do
> > echo "Disabling NCQ on $i"
> > echo 1 > /sys/block/"$i"/device/queue_depth
> >done
> >
> ># Fix slice_idle.
> ># See http://www.nextre.it/oracledocs/ioscheduler_03.html
> >echo "Fixing slice_idle to 0..."
> >for i in $DISKS
> >do
> > echo "Changing slice_idle to 0 on $i"
> > echo 0 > /sys/block/"$i"/queue/iosched/slice_idle
> >done
> >
> 
> Thanks for putting this data together.
> 
> Regards,
> Dan
> --
> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
> 
> 
> 
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 14+ messages in thread

* RE: Linux MD RAID 5 Benchmarks Across (3 to 10) 300 Gigabyte Veliciraptors
  2008-06-09  7:51   ` thomas62186218
  2008-06-09  8:43     ` Keld Jørn Simonsen
@ 2008-06-09 13:41     ` David Lethe
  2008-06-09 14:27       ` Keld Jørn Simonsen
  1 sibling, 1 reply; 14+ messages in thread
From: David Lethe @ 2008-06-09 13:41 UTC (permalink / raw)
  To: thomas62186218, dan.j.williams, jpiszcz; +Cc: linux-kernel, linux-raid, xfs, ap

For faster random I/O:
 * Decrease chunk size
 * Migrate files that have higher random I/O to a RAID1 set, using disks
with the lowest access time/latency
 * If possible, use the /dev/shm file system 
 * Determine I/O size of apps that produce most of the random I/O, and
make sure that md+filesystem matches. If most random I/O is 32KB, then
don't waste bandwidth by making md read 256KB at a time, or making it
read 2x16KB I/Os. Also don't build md sets like 4-drive RAID5, (Do a
5-drive RAID5 set), because non-parity data isn't a multiple of 2. A
10-drive RAID5 set with heavy random I/O is also profoundly wrong
because you are just removing the opportunity to have all of those heads
processing random I/O. 
 * If you only have one partition on a md set, then partition it into a
few file systems. This may provide greater opportunity for caching I/Os.
 * Experiment with different file systems, and optimize accordingly.  
 * Turn of journaling, or at least move journals to RAID1 devices.
 * Add RAM and try to increase buffer cache in attempt to improve cache
hit percentage (this works up to a point)
 * Buy a small SSD and migrate files that get pounded with random I/O to
that device. (Make sure you don't get a flash SSD, but a DRAM based SSD
that satisfies random I/O in nanoseconds instead of millisecs). They are
expensive, but the appropriate device.  This is how companies such as
Google & Ebay manage to get things done. 
The biggest thing to remember about random I/O, is that they are
expensive, so just step back and think about ways to minimize the I/O
requests to disk in the first place, and/or to spread the I/O across
multiple raidsets that can work independently to satisfy your load.  All
suggestions above will not work for everybody. You must understand the
nature of the bottleneck. 

David

-----Original Message-----
From: linux-raid-owner@vger.kernel.org
[mailto:linux-raid-owner@vger.kernel.org] On Behalf Of
thomas62186218@aol.com
Sent: Monday, June 09, 2008 2:51 AM
To: dan.j.williams@gmail.com; jpiszcz@lucidpixels.com
Cc: linux-kernel@vger.kernel.org; linux-raid@vger.kernel.org;
xfs@oss.sgi.com; ap@solarrain.com
Subject: Re: Linux MD RAID 5 Benchmarks Across (3 to 10) 300 Gigabyte
Veliciraptors

Thank you for sharing these results. One issue that I consistently see 
with these results is miserable random IO performance. Looking at these 
numbers, even a low-end RAID controller with 128MB of cache will outrun 
md-based RAIDs in random IO benchmarks. In today's world of virtual 
machines, etc, random IO is far more common than sequential IO. What 
can be done with md (or something else) to alleviate this problem?

-Thomas


-----Original Message-----
From: Dan Williams <dan.j.williams@gmail.com>
To: Justin Piszcz <jpiszcz@lucidpixels.com>
Cc: linux-kernel@vger.kernel.org; linux-raid@vger.kernel.org; 
xfs@oss.sgi.com; Alan Piszcz <ap@solarrain.com>
Sent: Sat, 7 Jun 2008 6:46 pm
Subject: Re: Linux MD RAID 5 Benchmarks Across (3 to 10) 300 Gigabyte 
Veliciraptors










On Sat, Jun 7, 2008 at 7:22 AM, Justin Piszcz <jpiszcz@lucidpixels.com> 
wrote:
> First, the original benchmarks with 6-SATA drives with fixed 
formatting,
> using
> right justification and the same decimal point precision throughout:
> 
http://home.comcast.net/~jpiszcz/20080607/raid-benchmarks-decimal-fix-an
d-right-justified/disks.html
>
> Now for for veliciraptors!  Ever wonder what kind of speed is 
possible with
> 3 disk, 4,5,6,7,8,9,10-disk RAID5s?  I ran a loop to find out, each 
run is
> executed three times and the average is taken of all three runs per 
each
> RAID5 disk set.
>
> In short? The 965 no longer does justice with faster drives, a new 
chipset
> and motherboard are needed.  After reading or writing to 4-5 
veliciraptors
> it saturates the bus/965 chipset.
>
> Here is a picture of the 12 veliciraptors I tested with:
> 
http://home.comcast.net/~jpiszcz/20080607/raid5-benchmarks-3to10-velicir
aptors/raptors.jpg
>
> Here are the bonnie++ results:
> 
http://home.comcast.net/~jpiszcz/20080607/raid5-benchmarks-3to10-velicir
aptors/veliciraptor-raid.html
>
> For those who want the results in text:
> 
http://home.comcast.net/~jpiszcz/20080607/raid5-benchmarks-3to10-velicir
aptors/veliciraptor-raid.txt
>
> System used, same/similar as before:
> Motherboard: Intel DG965WH
> Memory: 8GiB
> Kernel: 2.6.25.4
> Distribution: Debian Testing x86_64
> Filesystem: XFS with default mkfs.xfs parameters [auto-optimized for 
SW
> RAID]
> Mount options: defaults,noatime,nodiratime,logbufs=8,logbsize=262144 
0 1
> Chunk size: 1024KiB
> RAID5 Layout: Default (left-symmetric)
> Mdadm Superblock used: 0.90
>
> Optimizations used (last one is for the CFQ scheduler), it improves
> performance by a modest 5-10MiB/s:
> http://home.comcast.net/~jpiszcz/raid/20080601/raid5.html
>
> # Tell user what's going on.
> echo "Optimizing RAID Arrays..."
>
> # Define DISKS.
> cd /sys/block
> DISKS=$(/bin/ls -1d sd[a-z])
>
> # Set read-ahead.
> # > That's actually 65k x 512byte blocks so 32MiB
> echo "Setting read-ahead to 32 MiB for /dev/md3"
> blockdev --setra 65536 /dev/md3
>
> # Set stripe-cache_size for RAID5.
> echo "Setting stripe_cache_size to 16 MiB for /dev/md3"

Sorry to sound like a broken record,  16MiB is not correct.

size=$((num_disks * 4 * 16384 / 1024))
echo "Setting stripe_cache_size to $size MiB for /dev/md3"

...and commit 8b3e6cdc should improve the performance / 
stripe_cache_size ratio.

> echo 16384 > /sys/block/md3/md/stripe_cache_size
>
> # Disable NCQ on all disks.
> echo "Disabling NCQ on all disks..."
> for i in $DISKS
> do
>  echo "Disabling NCQ on $i"
>  echo 1 > /sys/block/"$i"/device/queue_depth
> done
>
> # Fix slice_idle.
> # See http://www.nextre.it/oracledocs/ioscheduler_03.html
> echo "Fixing slice_idle to 0..."
> for i in $DISKS
> do
>  echo "Changing slice_idle to 0 on $i"
>  echo 0 > /sys/block/"$i"/queue/iosched/slice_idle
> done
>

Thanks for putting this data together.

Regards,
Dan
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html





--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html



^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Linux MD RAID 5 Benchmarks Across (3 to 10) 300 Gigabyte Veliciraptors
  2008-06-09 13:41     ` David Lethe
@ 2008-06-09 14:27       ` Keld Jørn Simonsen
  2008-06-09 14:56         ` David Lethe
  0 siblings, 1 reply; 14+ messages in thread
From: Keld Jørn Simonsen @ 2008-06-09 14:27 UTC (permalink / raw)
  To: David Lethe
  Cc: thomas62186218, dan.j.williams, jpiszcz, linux-kernel, linux-raid,
	xfs, ap

On Mon, Jun 09, 2008 at 08:41:18AM -0500, David Lethe wrote:
> For faster random I/O:
>  * Decrease chunk size
>  * Migrate files that have higher random I/O to a RAID1 set, using disks
> with the lowest access time/latency
>  * If possible, use the /dev/shm file system 
>  * Determine I/O size of apps that produce most of the random I/O, and
> make sure that md+filesystem matches. If most random I/O is 32KB, then
> don't waste bandwidth by making md read 256KB at a time, or making it
> read 2x16KB I/Os. Also don't build md sets like 4-drive RAID5, (Do a
> 5-drive RAID5 set), because non-parity data isn't a multiple of 2. A
> 10-drive RAID5 set with heavy random I/O is also profoundly wrong
> because you are just removing the opportunity to have all of those heads
> processing random I/O. 
>  * If you only have one partition on a md set, then partition it into a
> few file systems. This may provide greater opportunity for caching I/Os.
>  * Experiment with different file systems, and optimize accordingly.  
>  * Turn of journaling, or at least move journals to RAID1 devices.
>  * Add RAM and try to increase buffer cache in attempt to improve cache
> hit percentage (this works up to a point)
>  * Buy a small SSD and migrate files that get pounded with random I/O to
> that device. (Make sure you don't get a flash SSD, but a DRAM based SSD
> that satisfies random I/O in nanoseconds instead of millisecs). They are
> expensive, but the appropriate device.  This is how companies such as
> Google & Ebay manage to get things done. 
> The biggest thing to remember about random I/O, is that they are
> expensive, so just step back and think about ways to minimize the I/O
> requests to disk in the first place, and/or to spread the I/O across
> multiple raidsets that can work independently to satisfy your load.  All
> suggestions above will not work for everybody. You must understand the
> nature of the bottleneck. 


For faster random IO I would suggest to use raid10,f2 for the random
reading, it performs like raid0, something like more than double the
speed of a normal single-drive file system. For random writes raid10,f2
performs like most other mirrorred raids, given that data needs to be
written twice.

Try and see if you can gat any HW raids to match that performance.

best regards
keld

> David
> 
> -----Original Message-----
> From: linux-raid-owner@vger.kernel.org
> [mailto:linux-raid-owner@vger.kernel.org] On Behalf Of
> thomas62186218@aol.com
> Sent: Monday, June 09, 2008 2:51 AM
> To: dan.j.williams@gmail.com; jpiszcz@lucidpixels.com
> Cc: linux-kernel@vger.kernel.org; linux-raid@vger.kernel.org;
> xfs@oss.sgi.com; ap@solarrain.com
> Subject: Re: Linux MD RAID 5 Benchmarks Across (3 to 10) 300 Gigabyte
> Veliciraptors
> 
> Thank you for sharing these results. One issue that I consistently see 
> with these results is miserable random IO performance. Looking at these 
> numbers, even a low-end RAID controller with 128MB of cache will outrun 
> md-based RAIDs in random IO benchmarks. In today's world of virtual 
> machines, etc, random IO is far more common than sequential IO. What 
> can be done with md (or something else) to alleviate this problem?
> 
> -Thomas
> 
> 
> -----Original Message-----
> From: Dan Williams <dan.j.williams@gmail.com>
> To: Justin Piszcz <jpiszcz@lucidpixels.com>
> Cc: linux-kernel@vger.kernel.org; linux-raid@vger.kernel.org; 
> xfs@oss.sgi.com; Alan Piszcz <ap@solarrain.com>
> Sent: Sat, 7 Jun 2008 6:46 pm
> Subject: Re: Linux MD RAID 5 Benchmarks Across (3 to 10) 300 Gigabyte 
> Veliciraptors
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> On Sat, Jun 7, 2008 at 7:22 AM, Justin Piszcz <jpiszcz@lucidpixels.com> 
> wrote:
> > First, the original benchmarks with 6-SATA drives with fixed 
> formatting,
> > using
> > right justification and the same decimal point precision throughout:
> > 
> http://home.comcast.net/~jpiszcz/20080607/raid-benchmarks-decimal-fix-an
> d-right-justified/disks.html
> >
> > Now for for veliciraptors!  Ever wonder what kind of speed is 
> possible with
> > 3 disk, 4,5,6,7,8,9,10-disk RAID5s?  I ran a loop to find out, each 
> run is
> > executed three times and the average is taken of all three runs per 
> each
> > RAID5 disk set.
> >
> > In short? The 965 no longer does justice with faster drives, a new 
> chipset
> > and motherboard are needed.  After reading or writing to 4-5 
> veliciraptors
> > it saturates the bus/965 chipset.
> >
> > Here is a picture of the 12 veliciraptors I tested with:
> > 
> http://home.comcast.net/~jpiszcz/20080607/raid5-benchmarks-3to10-velicir
> aptors/raptors.jpg
> >
> > Here are the bonnie++ results:
> > 
> http://home.comcast.net/~jpiszcz/20080607/raid5-benchmarks-3to10-velicir
> aptors/veliciraptor-raid.html
> >
> > For those who want the results in text:
> > 
> http://home.comcast.net/~jpiszcz/20080607/raid5-benchmarks-3to10-velicir
> aptors/veliciraptor-raid.txt
> >
> > System used, same/similar as before:
> > Motherboard: Intel DG965WH
> > Memory: 8GiB
> > Kernel: 2.6.25.4
> > Distribution: Debian Testing x86_64
> > Filesystem: XFS with default mkfs.xfs parameters [auto-optimized for 
> SW
> > RAID]
> > Mount options: defaults,noatime,nodiratime,logbufs=8,logbsize=262144 
> 0 1
> > Chunk size: 1024KiB
> > RAID5 Layout: Default (left-symmetric)
> > Mdadm Superblock used: 0.90
> >
> > Optimizations used (last one is for the CFQ scheduler), it improves
> > performance by a modest 5-10MiB/s:
> > http://home.comcast.net/~jpiszcz/raid/20080601/raid5.html
> >
> > # Tell user what's going on.
> > echo "Optimizing RAID Arrays..."
> >
> > # Define DISKS.
> > cd /sys/block
> > DISKS=$(/bin/ls -1d sd[a-z])
> >
> > # Set read-ahead.
> > # > That's actually 65k x 512byte blocks so 32MiB
> > echo "Setting read-ahead to 32 MiB for /dev/md3"
> > blockdev --setra 65536 /dev/md3
> >
> > # Set stripe-cache_size for RAID5.
> > echo "Setting stripe_cache_size to 16 MiB for /dev/md3"
> 
> Sorry to sound like a broken record,  16MiB is not correct.
> 
> size=$((num_disks * 4 * 16384 / 1024))
> echo "Setting stripe_cache_size to $size MiB for /dev/md3"
> 
> ...and commit 8b3e6cdc should improve the performance / 
> stripe_cache_size ratio.
> 
> > echo 16384 > /sys/block/md3/md/stripe_cache_size
> >
> > # Disable NCQ on all disks.
> > echo "Disabling NCQ on all disks..."
> > for i in $DISKS
> > do
> >  echo "Disabling NCQ on $i"
> >  echo 1 > /sys/block/"$i"/device/queue_depth
> > done
> >
> > # Fix slice_idle.
> > # See http://www.nextre.it/oracledocs/ioscheduler_03.html
> > echo "Fixing slice_idle to 0..."
> > for i in $DISKS
> > do
> >  echo "Changing slice_idle to 0 on $i"
> >  echo 0 > /sys/block/"$i"/queue/iosched/slice_idle
> > done
> >
> 
> Thanks for putting this data together.
> 
> Regards,
> Dan
> --
> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
> 
> 
> 
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 14+ messages in thread

* RE: Linux MD RAID 5 Benchmarks Across (3 to 10) 300 Gigabyte Veliciraptors
  2008-06-09 14:27       ` Keld Jørn Simonsen
@ 2008-06-09 14:56         ` David Lethe
  2008-06-09 23:15           ` Keld Jørn Simonsen
  0 siblings, 1 reply; 14+ messages in thread
From: David Lethe @ 2008-06-09 14:56 UTC (permalink / raw)
  To: Keld Jørn Simonsen
  Cc: thomas62186218, dan.j.williams, jpiszcz, linux-kernel, linux-raid,
	xfs, ap



-----Original Message-----
From: Keld Jørn Simonsen [mailto:keld@dkuug.dk] 
Sent: Monday, June 09, 2008 9:27 AM
To: David Lethe
Cc: thomas62186218@aol.com; dan.j.williams@gmail.com; jpiszcz@lucidpixels.com; linux-kernel@vger.kernel.org; linux-raid@vger.kernel.org; xfs@oss.sgi.com; ap@solarrain.com
Subject: Re: Linux MD RAID 5 Benchmarks Across (3 to 10) 300 Gigabyte Veliciraptors

On Mon, Jun 09, 2008 at 08:41:18AM -0500, David Lethe wrote:
> For faster random I/O:
>  * Decrease chunk size
>  * Migrate files that have higher random I/O to a RAID1 set, using disks
> with the lowest access time/latency
>  * If possible, use the /dev/shm file system 
>  * Determine I/O size of apps that produce most of the random I/O, and
> make sure that md+filesystem matches. If most random I/O is 32KB, then
> don't waste bandwidth by making md read 256KB at a time, or making it
> read 2x16KB I/Os. Also don't build md sets like 4-drive RAID5, (Do a
> 5-drive RAID5 set), because non-parity data isn't a multiple of 2. A
> 10-drive RAID5 set with heavy random I/O is also profoundly wrong
> because you are just removing the opportunity to have all of those heads
> processing random I/O. 
>  * If you only have one partition on a md set, then partition it into a
> few file systems. This may provide greater opportunity for caching I/Os.
>  * Experiment with different file systems, and optimize accordingly.  
>  * Turn of journaling, or at least move journals to RAID1 devices.
>  * Add RAM and try to increase buffer cache in attempt to improve cache
> hit percentage (this works up to a point)
>  * Buy a small SSD and migrate files that get pounded with random I/O to
> that device. (Make sure you don't get a flash SSD, but a DRAM based SSD
> that satisfies random I/O in nanoseconds instead of millisecs). They are
> expensive, but the appropriate device.  This is how companies such as
> Google & Ebay manage to get things done. 
> The biggest thing to remember about random I/O, is that they are
> expensive, so just step back and think about ways to minimize the I/O
> requests to disk in the first place, and/or to spread the I/O across
> multiple raidsets that can work independently to satisfy your load.  All
> suggestions above will not work for everybody. You must understand the
> nature of the bottleneck. 


For faster random IO I would suggest to use raid10,f2 for the random
reading, it performs like raid0, something like more than double the
speed of a normal single-drive file system. For random writes raid10,f2
performs like most other mirrorred raids, given that data needs to be
written twice.

Try and see if you can gat any HW raids to match that performance.

best regards
keld

--------------------------------------------------------------------------------
Keld:
That is counter-intuitive. The issue is random IOPs, not throughput. I do not 
understand how a RAID10 would provide more IOs per sec than RAID1. Or, since
you are using RAID10, then how could RAID10 serve more random I/Os then a pair
of RAID1 filesystems?  RAID0 dictates that each disk will supply half 
of the data you want per application I/O request. At least with RAID1, then each
disk can get all the data you want with a single request, and dual-porting/load balancing 
will allow both disks to work independently of each other on reads so the disk with
the least amount of load at any time can work on the request. That is why RAID1 can be
faster than JBOD.

Granted writes are handled differently, but with any RAID0 implementation you still have to write
Half of the data to each disk requiring 2 I/Os + journaling & housekeeping.


David


--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Linux MD RAID 5 Benchmarks Across (3 to 10) 300 Gigabyte Veliciraptors
  2008-06-09 14:56         ` David Lethe
@ 2008-06-09 23:15           ` Keld Jørn Simonsen
  0 siblings, 0 replies; 14+ messages in thread
From: Keld Jørn Simonsen @ 2008-06-09 23:15 UTC (permalink / raw)
  To: David Lethe
  Cc: thomas62186218, dan.j.williams, jpiszcz, linux-kernel, linux-raid,
	xfs, ap

On Mon, Jun 09, 2008 at 09:56:14AM -0500, David Lethe wrote:
> 
> 
> From: Keld Jørn Simonsen [mailto:keld@dkuug.dk] 
> Sent: Monday, June 09, 2008 9:27 AM
> To: David Lethe
> Cc: thomas62186218@aol.com; dan.j.williams@gmail.com; jpiszcz@lucidpixels.com; linux-kernel@vger.kernel.org; linux-raid@vger.kernel.org; xfs@oss.sgi.com; ap@solarrain.com
> Subject: Re: Linux MD RAID 5 Benchmarks Across (3 to 10) 300 Gigabyte Veliciraptors
> 
> For faster random IO I would suggest to use raid10,f2 for the random
> reading, it performs like raid0, something like more than double the
> speed of a normal single-drive file system. For random writes raid10,f2
> performs like most other mirrorred raids, given that data needs to be
> written twice.
> 
> Try and see if you can gat any HW raids to match that performance.
> 
> best regards
> keld
> 
> --------------------------------------------------------------------------------
> Keld:
> That is counter-intuitive. The issue is random IOPs, not throughput.

That probably depends on your use. I run Linux mirrors, and for that
purpose thruputi of random IO, especially reading, is key.

For data bases it is probably something else, probably IOP. here I also
think that Linux MD raid has good performance. Once again I think my pet
RAID type, raid10,f2 has something to offer, especially with lower
random seek rates, as the track span is shorter, and on the outer,
faster tracks.

And other uses may have other bottlenecks. In general I think that
thruput is an important figure, as it shows how fast a system can
process a given amount of data. Areas where this may count include web servers,
file servers, print servers, ordinary workstations.

I actually think those 2 measures for random IO: IO thruput, and IO transactions per
second, for read and write, are the two most important measures. 

For the IO transacions per second I agree that your suggestions are good
advice.

I would like to have good benchmarking tools for this, and also I would
like figures on how Linux MD compares to different HW RAID.

> I do not 
> understand how a RAID10 would provide more IOs per sec than RAID1. Or, since
> you are using RAID10, then how could RAID10 serve more random I/Os then a pair
> of RAID1 filesystems? 

In theory you are right. The MD implementation of RAID1 does not seem to
handle random seeks so well, AFAIK. Then the seeks are confined with
raid10,f2 to less than half of the disk arm movement, taht does speed
things up a little.

> RAID0 dictates that each disk will supply half 
> of the data you want per application I/O request. At least with RAID1, then each
> disk can get all the data you want with a single request, and dual-porting/load balancing 
> will allow both disks to work independently of each other on reads so the disk with
> the least amount of load at any time can work on the request. That is why RAID1 can be
> faster than JBOD.
> 
> Granted writes are handled differently, but with any RAID0 implementation you still have to write
> Half of the data to each disk requiring 2 I/Os + journaling & housekeeping.

yes, indeed.

best regards
keld
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Linux MD RAID 5 Benchmarks Across (3 to 10) 300 Gigabyte Veliciraptors
  2008-06-07 14:22 Linux MD RAID 5 Benchmarks Across (3 to 10) 300 Gigabyte Veliciraptors Justin Piszcz
  2008-06-07 15:54 ` David Lethe
  2008-06-08  1:46 ` Dan Williams
@ 2008-06-11 17:02 ` Nat Makarevitch
  2008-06-11 20:27 ` Bill Davidsen
  3 siblings, 0 replies; 14+ messages in thread
From: Nat Makarevitch @ 2008-06-11 17:02 UTC (permalink / raw)
  To: linux-raid

Justin Piszcz <jpiszcz <at> lucidpixels.com> writes:

> Ever wonder what kind of speed is possible with 3 disk, 4,5,6,7,8,9,10-disk
RAID5s?

> Here are the bonnie++ results:
>
http://home.comcast.net/~jpiszcz/20080607/raid5-benchmarks-3to10-veliciraptors/veliciraptor-raid.html

Why does the amount of spindles has nearly no effect on the amount of seeks per
second?

3 disks: 713.9 seeks/s  (AFAIK the Raptor works at 10000 rpm, getting 230+
seeks/s is astonishing)

10 disks: 705.5 seeks/s  (same as 3 disks?!)

Did I miss something? Or did you use a very large stripe size (to the point of
forbiding the 16 GB file used to span over all spindles?)? Or is it some glitch
in the RAID code (I don't think so, on a RAID10 with 10 low-end disk I obtained
~1000 IOPS: http://www.makarevitch.org/rant/raid/#3wmd)?


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Linux MD RAID 5 Benchmarks Across (3 to 10) 300 Gigabyte Veliciraptors
  2008-06-07 14:22 Linux MD RAID 5 Benchmarks Across (3 to 10) 300 Gigabyte Veliciraptors Justin Piszcz
                   ` (2 preceding siblings ...)
  2008-06-11 17:02 ` Nat Makarevitch
@ 2008-06-11 20:27 ` Bill Davidsen
  2008-06-11 20:48   ` Justin Piszcz
  3 siblings, 1 reply; 14+ messages in thread
From: Bill Davidsen @ 2008-06-11 20:27 UTC (permalink / raw)
  To: Justin Piszcz; +Cc: linux-kernel, linux-raid, xfs, Alan Piszcz

Justin Piszcz wrote:
> First, the original benchmarks with 6-SATA drives with fixed 
> formatting, using
> right justification and the same decimal point precision throughout:
> http://home.comcast.net/~jpiszcz/20080607/raid-benchmarks-decimal-fix-and-right-justified/disks.html 
>
>
> Now for for veliciraptors! Ever wonder what kind of speed is possible 
> with
> 3 disk, 4,5,6,7,8,9,10-disk RAID5s? I ran a loop to find out, each run is
> executed three times and the average is taken of all three runs per 
> each RAID5 disk set.
>
> In short? The 965 no longer does justice with faster drives, a new 
> chipset
> and motherboard are needed. After reading or writing to 4-5 veliciraptors
> it saturates the bus/965 chipset.

This is very interesting, but a 16GB chunk size bears no relationship to 
anything I would run in the real world, and I suspect most people are in 
the same category.

-- 
Bill Davidsen <davidsen@tmr.com>
  "Woe unto the statesman who makes war without a reason that will still
  be valid when the war is over..." Otto von Bismark 



^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Linux MD RAID 5 Benchmarks Across (3 to 10) 300 Gigabyte Veliciraptors
  2008-06-11 20:27 ` Bill Davidsen
@ 2008-06-11 20:48   ` Justin Piszcz
  2008-06-11 20:53     ` Justin Piszcz
  0 siblings, 1 reply; 14+ messages in thread
From: Justin Piszcz @ 2008-06-11 20:48 UTC (permalink / raw)
  To: Bill Davidsen; +Cc: linux-kernel, linux-raid, xfs, Alan Piszcz



On Wed, 11 Jun 2008, Bill Davidsen wrote:

> Justin Piszcz wrote:
>> First, the original benchmarks with 6-SATA drives with fixed formatting, 
>> using
>> right justification and the same decimal point precision throughout:
>> http://home.comcast.net/~jpiszcz/20080607/raid-benchmarks-decimal-fix-and-right-justified/disks.html 
>> 
>> Now for for veliciraptors! Ever wonder what kind of speed is possible with
>> 3 disk, 4,5,6,7,8,9,10-disk RAID5s? I ran a loop to find out, each run is
>> executed three times and the average is taken of all three runs per each 
>> RAID5 disk set.
>> 
>> In short? The 965 no longer does justice with faster drives, a new chipset
>> and motherboard are needed. After reading or writing to 4-5 veliciraptors
>> it saturates the bus/965 chipset.
>
> This is very interesting, but a 16GB chunk size bears no relationship to 
> anything I would run in the real world, and I suspect most people are in the 
> same category.

I based my bonnie++ test on:
http://everything2.org/?node_id=1479435

So I could compare to his results.

I use a 1024k (1MiB) with 16384 stripe, this offered the best overall 
read/write/rewrite performance AFAIK.

Justin.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Linux MD RAID 5 Benchmarks Across (3 to 10) 300 Gigabyte Veliciraptors
  2008-06-11 20:48   ` Justin Piszcz
@ 2008-06-11 20:53     ` Justin Piszcz
  2008-06-12 19:08       ` Bill Davidsen
  0 siblings, 1 reply; 14+ messages in thread
From: Justin Piszcz @ 2008-06-11 20:53 UTC (permalink / raw)
  To: Bill Davidsen; +Cc: linux-kernel, linux-raid, xfs, Alan Piszcz



On Wed, 11 Jun 2008, Justin Piszcz wrote:

>
>
> On Wed, 11 Jun 2008, Bill Davidsen wrote:
>
>> Justin Piszcz wrote:
>>> First, the original benchmarks with 6-SATA drives with fixed formatting, 
>>> using
>>> right justification and the same decimal point precision throughout:
>>> http://home.comcast.net/~jpiszcz/20080607/raid-benchmarks-decimal-fix-and-right-justified/disks.html 
>>> Now for for veliciraptors! Ever wonder what kind of speed is possible with
>>> 3 disk, 4,5,6,7,8,9,10-disk RAID5s? I ran a loop to find out, each run is
>>> executed three times and the average is taken of all three runs per each 
>>> RAID5 disk set.
>>> 
>>> In short? The 965 no longer does justice with faster drives, a new chipset
>>> and motherboard are needed. After reading or writing to 4-5 veliciraptors
>>> it saturates the bus/965 chipset.
>> 
>> This is very interesting, but a 16GB chunk size bears no relationship to 
>> anything I would run in the real world, and I suspect most people are in 
>> the same category.
>
> I based my bonnie++ test on:
> http://everything2.org/?node_id=1479435
>
> So I could compare to his results.
>
> I use a 1024k (1MiB) with 16384 stripe, this offered the best overall 
> read/write/rewrite performance AFAIK.

1024k chunk size (raid5 chunk size)
echo 16384 > stripe_cache_size


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Linux MD RAID 5 Benchmarks Across (3 to 10) 300 Gigabyte Veliciraptors
  2008-06-11 20:53     ` Justin Piszcz
@ 2008-06-12 19:08       ` Bill Davidsen
  0 siblings, 0 replies; 14+ messages in thread
From: Bill Davidsen @ 2008-06-12 19:08 UTC (permalink / raw)
  To: Justin Piszcz; +Cc: linux-kernel, linux-raid, xfs, Alan Piszcz

Justin Piszcz wrote:
>
>
> On Wed, 11 Jun 2008, Justin Piszcz wrote:
>
>>
>>
>> On Wed, 11 Jun 2008, Bill Davidsen wrote:
>>
>>> Justin Piszcz wrote:
>>>> First, the original benchmarks with 6-SATA drives with fixed 
>>>> formatting, using
>>>> right justification and the same decimal point precision throughout:
>>>> http://home.comcast.net/~jpiszcz/20080607/raid-benchmarks-decimal-fix-and-right-justified/disks.html 
>>>> Now for for veliciraptors! Ever wonder what kind of speed is 
>>>> possible with
>>>> 3 disk, 4,5,6,7,8,9,10-disk RAID5s? I ran a loop to find out, each 
>>>> run is
>>>> executed three times and the average is taken of all three runs per 
>>>> each RAID5 disk set.
>>>>
>>>> In short? The 965 no longer does justice with faster drives, a new 
>>>> chipset
>>>> and motherboard are needed. After reading or writing to 4-5 
>>>> veliciraptors
>>>> it saturates the bus/965 chipset.
>>>
>>> This is very interesting, but a 16GB chunk size bears no 
>>> relationship to anything I would run in the real world, and I 
>>> suspect most people are in the same category.
>>
>> I based my bonnie++ test on:
>> http://everything2.org/?node_id=1479435
>>
>> So I could compare to his results.
>>
>> I use a 1024k (1MiB) with 16384 stripe, this offered the best overall 
>> read/write/rewrite performance AFAIK.
>
> 1024k chunk size (raid5 chunk size)
> echo 16384 > stripe_cache_size

Please don't explain any more, I'm confused enough already. I can't make 
those numbers match 16G no matter how I add them, either the contents of 
the column labeled "size:chunk size" isn't the size of the chunk, or you 
have a multiplier floating around that I don't see.  And you eliminated 
the degraded performance, since your stripe_cache_size is less than 
(raid5 chunk size)*(#disks), I would expect the reads in degraded mode 
to be dog slow because the don't fit in cache, even if 1024k is what I 
call chunk size and certainly not if chunk size is 16G.

-- 
Bill Davidsen <davidsen@tmr.com>
  "Woe unto the statesman who makes war without a reason that will still
  be valid when the war is over..." Otto von Bismark 



^ permalink raw reply	[flat|nested] 14+ messages in thread

end of thread, other threads:[~2008-06-12 19:08 UTC | newest]

Thread overview: 14+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2008-06-07 14:22 Linux MD RAID 5 Benchmarks Across (3 to 10) 300 Gigabyte Veliciraptors Justin Piszcz
2008-06-07 15:54 ` David Lethe
2008-06-08  1:46 ` Dan Williams
2008-06-09  7:51   ` thomas62186218
2008-06-09  8:43     ` Keld Jørn Simonsen
2008-06-09 13:41     ` David Lethe
2008-06-09 14:27       ` Keld Jørn Simonsen
2008-06-09 14:56         ` David Lethe
2008-06-09 23:15           ` Keld Jørn Simonsen
2008-06-11 17:02 ` Nat Makarevitch
2008-06-11 20:27 ` Bill Davidsen
2008-06-11 20:48   ` Justin Piszcz
2008-06-11 20:53     ` Justin Piszcz
2008-06-12 19:08       ` Bill Davidsen

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).