Re: RAID 5 doesn't scale - Benjamin ESTRABAUD

All of lore.kernel.org
 help / color / mirror / Atom feed

From: Benjamin ESTRABAUD <be@mpstor.com>
To: linux-raid@vger.kernel.org
Subject: Re: RAID 5 doesn't scale
Date: Wed, 03 Apr 2013 12:21:35 +0100	[thread overview]
Message-ID: <515C10BF.7060503@mpstor.com> (raw)
In-Reply-To: <loom.20130403T122905-373@post.gmane.org>

On 03/04/13 12:00, Peter Landmann wrote:
> Hi,
Hi,
>
> i wrote it there http://article.gmane.org/gmane.linux.raid/42365 but want to go
> in detail. Maybe there is another problem or
> problem in my thinking.
>
> Environment:
> HW: AMD Phenom II 1055T 2,8 GHz, 8GB ram
>      Intel X25-M G2 Postville 80 GB SATA2 SSD
> SW: kernel 3.4.0 but same performace with 3.8 from git and 3.9 from "next" tree
>      distribution: debian sid
> Raid Settings:
>      for each hdd a 10 GB partition is used, 70 GB spare capacity
>      noop-scheduler
>      raid creation:
>      mdadm --create /dev/md9 --force --raid-devices=4 --chunk=64 --assume-clean -
> -level=5 /dev/sdb1 /dev/sdc1 ..
So here your RAID5 has a chunk size of 64K, and you have 4 drives in a 
RAID 5, so your stripe size is 192KB if I'm correct.
> FIO settings:
> bs=4096
> iodepth=248
> direct=1
> continue_on_error=1
> rw=randwrite
> ioengine=libaio
> norandommap
> refill_buffers
> group_reporting
> [test1]
> numjobs=1
>
It seems that you are running random 4K writes on this array (unless you 
are running the test on the SSD directly here?). If so, you are writing 
lots of 4K sectors on independant 192KB stripes. This means that the 
whole 192KB of stripe needs to be first read, copied to memory, modified 
with the new 4K of data, have its parity calculated and the new stripe 
rewritten to the underlying disks. Add to that that depending on your 
SSD, there might be some read-modify-write cycles happening in the 
background (since you might be running more small random IOs that the 
underlying flash can handle transparently). The performance hit is 
therefore possible.

The guess here is that to maximize performance, you would want to first 
run IOs which minimize the read/modify/write on the RAID itself (so 
writing full 192KB IOs, making sure they are also aligned correctly with 
the underlying RAID), and also maybe tune your RAID chunk size to 
minimize possible RMW cycles on the SSD. However, the SSD aspect is 
unlikely the cause of your performance issue if you get good performance 
writing 4K blocks on the SSD itself.

So it would seem to me that what's killing your performance is the RMW 
on the RAID itself, everytime you want to write 4K a whole stripe has to 
be read, modified in memory, and 192K of data has to be rewritten to the 
array, making it highly inefficient.

A smaller chunk size might help with handling this kind of IOs. The 
thing here is that you have to ask yourself if 4K random writes are 
really what you are going to run, or if this was just for the sake of 
testing?

You could also test read performance (no RMW hit) to see if there is no 
bottleneck there (thus partially confirming the above).

Also, don't take my word for it just yet, maybe wait for confirmation 
from some other people on this ML, the above is what I *think* is 
happening but I could definitely be completely wrong.
> Theoretical performance: in single mode without raid each ssd writes 20k IOPS
> and reads 40k IOPS.
> With Raid 5 and with at least 4 SSDs there are as many write operations as read
> operations. So a single SSD should deliver 13333
> read and write operations per second.
>
> Without Raid (a maximum performance of 140000 random read and 120000 random
> write operations per second is archieved. so hw
> shouldn't be the limiting factor for raid 5.
>
>
> Evaluation: Random write in IOPS
> #SSD experimental    theoretical
> 3  14497.7           24000
> 4  14005             26666
> 5  17172.3           33333
> 6  19779             40000
>
> Following stats and output for  raid 5 with 6 SSDs
>
> fio:
> ssd10gbraid5rw: (g=0): rw=randwrite, bs=4K-4K/4K-4K, ioengine=libaio,
> iodepth=248
> 2.0.8
> Starting 1 process
>
> ssd10gbraid5rw: (groupid=0, jobs=1): err= 0: pid=32400
>    Description  : [SSD 10GB raid5 (mdadm) random write test]
>    write: io=988.0KB, bw=79133KB/s, iops=19783 , runt=5300335msec
>      slat (usec): min=3 , max=282137 , avg= 7.46, stdev=36.26
>      clat (usec): min=250 , max=338796K, avg=12525.28, stdev=136706.65
>       lat (usec): min=259 , max=338796K, avg=12533.00, stdev=136706.66
>      clat percentiles (usec):
>       |  1.00th=[ 1048],  5.00th=[ 2096], 10.00th=[ 2672], 20.00th=[ 3504],
>       | 30.00th=[ 4576], 40.00th=[ 6496], 50.00th=[ 8512], 60.00th=[11456],
>       | 70.00th=[15168], 80.00th=[20352], 90.00th=[28544], 95.00th=[33536],
>       | 99.00th=[39168], 99.50th=[41216], 99.90th=[56064], 99.95th=[292864],
>       | 99.99th=[309248]
>      bw (KB/s)  : min= 6907, max=100088, per=100.00%, avg=79313.22, stdev=8802.19
>      lat (usec) : 500=0.05%, 750=0.27%, 1000=0.52%
>      lat (msec) : 2=3.52%, 4=20.98%, 10=30.25%, 20=23.99%, 50=20.29%
>      lat (msec) : 100=0.03%, 250=0.01%, 500=0.10%, 750=0.01%, 1000=0.01%
>      lat (msec) : 2000=0.01%, >=2000=0.01%
>    cpu          : usr=7.75%, sys=21.55%, ctx=47382311, majf=0, minf=0
>    IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%, >=64=100.0%
>       submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
>       complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.1%
>       issued    : total=r=0/w=0/d=104857847, short=r=0/w=0/d=0
>       errors    : total=0, first_error=0/<(null)>
>
> Run status group 0 (all jobs):
>    WRITE: io=409601MB, aggrb=79132KB/s, minb=79132KB/s, maxb=79132KB/s,
> mint=5300335msec, maxt=5300335msec
>
> Disk stats (read/write):
>      md9: ios=84/104857172, merge=0/0, ticks=0/0, in_queue=0, util=0.00%,
> aggrios=34949993/34951372, aggrmerge=401/512,
> aggrticks=130838494/122401043, aggrin_queue=253198596, aggrutil=96.05%
>    sdb: ios=34950097/34951445, merge=400/511, ticks=130214828/121603063,
> in_queue=251778978, util=95.86%
>    sdc: ios=34952941/34954281, merge=399/516, ticks=130736987/122271756,
> in_queue=252969493, util=95.91%
>    sdd: ios=34943892/34945256, merge=417/527, ticks=131734001/123258071,
> in_queue=254949447, util=95.89%
>    sde: ios=34954980/34956283, merge=367/473, ticks=125822046/117619660,
> in_queue=243399327, util=95.95%
>    sdf: ios=34952583/34954080, merge=415/532, ticks=137200055/128624635,
> in_queue=265784289, util=96.05%
>    sdg: ios=34945469/34946890, merge=408/517, ticks=129323047/121029077,
> in_queue=250310045, util=95.99%
>
> top:
>    PID USER      PR  NI  VIRT  RES  SHR S  %CPU %MEM    TIME+  COMMAND
>   4525 root      20   0     0    0    0 R  39,6  0,0  98:16.78 md9_raid5
> 32400 root      20   0 79716 1824  420 S  30,6  0,1   0:02.77 fio
> 29099 root      20   0     0    0    0 R   7,3  0,0   0:33.90 kworker/u:0
> 31740 root      20   0     0    0    0 S   6,7  0,0   4:59.61 kworker/u:3
> 18488 root      20   0     0    0    0 S   5,7  0,0   2:06.64 kworker/u:1
> 31197 root      20   0     0    0    0 S   4,7  0,0   0:13.77 kworker/u:4
> 23450 root      20   0     0    0    0 S   3,0  0,0   1:34.33 kworker/u:7
> 27068 root      20   0     0    0    0 S   1,7  0,0   0:51.94 kworker/u:2
>
> mpstat:
> CPU    %usr   %nice    %sys %iowait    %irq   %soft  %steal  %guest   %idle
> all    1,17    0,00   12,67   12,71    3,27    3,05    0,00    0,00   67,13
> 0    1,41    0,00    7,88   15,42    0,07    0,15    0,00    0,00   75,07
> 1    0,00    0,00   38,04    3,14   19,20   18,08    0,00    0,00   21,54
> 2    1,50    0,00    7,55   14,78    0,07    0,02    0,00    0,00   76,08
> 3    1,09    0,00    7,31   12,15    0,05    0,02    0,00    0,00   79,38
> 4    1,35    0,00    7,41   12,94    0,07    0,00    0,00    0,00   78,23
> 5    1,65    0,00    7,78   17,84    0,12    0,03    0,00    0,00   72,57
>
> iostat -x 1:
> avg-cpu:  %user   %nice %system %iowait  %steal   %idle
>             0,67    0,00   18,79    3,69    0,00   76,85
>
> Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz
> avgqu-sz   await r_await w_await  svctm  %util
> sdb               0,00     0,00 6952,00 6935,00 27808,00 27740,00     8,00
> 24,97    1,80    2,00    1,59   0,06  77,90
> sda               2,00     0,00 6774,00 6789,00 27104,00 27156,00     8,00
> 21,26    1,57    1,78    1,36   0,06  77,60
> sdd               4,00     4,00 7059,00 7013,00 28252,00 28068,00     8,00
> 136,01    9,66   10,34    8,98   0,07  99,60
> sdc               0,00     0,00 6851,00 6851,00 27404,00 27404,00     8,00
> 22,80    1,66    1,86    1,46   0,06  77,70
> sdf               0,00     0,00 6931,00 6995,00 27724,00 27980,00     8,00
> 41,78    3,03    3,26    2,80   0,06  79,70
> sde               0,00     0,00 6842,00 6837,00 27368,00 27348,00     8,00
> 31,59    2,31    2,53    2,08   0,06  79,60
>
> another snapshot
> avg-cpu:  %user   %nice %system %iowait  %steal   %idle
>             0,84    0,00   22,35    2,18    0,00   74,62
>
> Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz
> avgqu-sz   await r_await w_await  svctm  %util
> sdb               1,00     2,00 8344,00 8400,00 33380,00 33608,00     8,00
> 67,39    4,06    4,30    3,82   0,06  97,80
> sda               1,00     0,00 8305,00 8290,00 33224,00 33160,00     8,00
> 28,74    1,73    1,94    1,52   0,05  88,40
> sdd               5,00     5,00 8393,00 8419,00 33592,00 33696,00     8,00
> 96,74    5,76    6,02    5,49   0,06  98,80
> sdc               0,00     1,00 8199,00 8201,00 32796,00 32808,00     8,00
> 27,64    1,68    1,92    1,45   0,05  87,80
> sdf               1,00     0,00 8332,00 8323,00 33328,00 33292,00     8,00
> 40,95    2,44    2,66    2,23   0,05  89,30
> sde               0,00     0,00 8256,00 8263,00 33024,00 33052,00     8,00
> 28,94    1,75    1,96    1,54   0,05  89,50
>
> mpstat for same test with 3.9 kernel from next-tree
> CPU    %usr   %nice    %sys %iowait    %irq   %soft  %steal  %guest   %idle
> all    0,50    0,00   10,03    1,34    2,01    6,35    0,00    0,00   79,77
> 0    0,00    0,00    0,00    0,00    0,00    0,00    0,00    0,00  100,00
> 1    0,00    0,00   25,00    0,00    5,00   18,00    0,00    0,00   52,00
> 2    0,00    0,00   20,83    0,00    5,21   18,75    0,00    0,00   55,21
> 3    0,00    0,00    0,00    0,00    0,00    0,00    0,00    0,00  100,00
> 4    3,06    0,00   15,31    8,16    0,00    0,00    0,00    0,00   73,47
> 5    0,00    0,00    0,00    0,00    0,00    0,00    0,00    0,00  100,00
>
>
> So you have an idea why the real performance is only 50% of the theoretical
> performance? No cpu core is at its limits.
> As i said in my other post. I would be interested to solve the problem but i
> have problems to identify it.
> Peter Landmann

Regards,

Ben.
>
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>

next prev parent reply	other threads:[~2013-04-03 11:21 UTC|newest]

Thread overview: 17+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2013-04-03 11:00 RAID 5 doesn't scale Peter Landmann
2013-04-03 11:21 ` Benjamin ESTRABAUD [this message]
2013-04-03 18:34   ` Martin Wilck
2013-04-03 20:38     ` Peter Landmann
2013-04-04 13:40       ` Benjamin ESTRABAUD
2013-04-03 13:18 ` Stan Hoeppner
2013-04-03 15:23   ` keld
2013-04-03 15:31   ` Peter Landmann
2013-04-03 18:35     ` Stan Hoeppner
2013-04-03 18:23   ` Martin Wilck
2013-04-03 20:36     ` Peter Landmann
2013-04-03 21:19       ` Peter Landmann
2013-04-03 21:24       ` Stan Hoeppner
2013-04-03 21:29         ` Peter Landmann
2013-04-03 21:15     ` Stan Hoeppner
2013-04-03 19:56   ` Roy Sigurd Karlsbakk
2013-04-03 21:12   ` Peter Landmann

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=515C10BF.7060503@mpstor.com \
    --to=be@mpstor.com \
    --cc=linux-raid@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.