From: Benjamin ESTRABAUD <be@mpstor.com>
To: linux-raid@vger.kernel.org
Subject: Re: RAID 5 doesn't scale
Date: Wed, 03 Apr 2013 12:21:35 +0100 [thread overview]
Message-ID: <515C10BF.7060503@mpstor.com> (raw)
In-Reply-To: <loom.20130403T122905-373@post.gmane.org>
On 03/04/13 12:00, Peter Landmann wrote:
> Hi,
Hi,
>
> i wrote it there http://article.gmane.org/gmane.linux.raid/42365 but want to go
> in detail. Maybe there is another problem or
> problem in my thinking.
>
> Environment:
> HW: AMD Phenom II 1055T 2,8 GHz, 8GB ram
> Intel X25-M G2 Postville 80 GB SATA2 SSD
> SW: kernel 3.4.0 but same performace with 3.8 from git and 3.9 from "next" tree
> distribution: debian sid
> Raid Settings:
> for each hdd a 10 GB partition is used, 70 GB spare capacity
> noop-scheduler
> raid creation:
> mdadm --create /dev/md9 --force --raid-devices=4 --chunk=64 --assume-clean -
> -level=5 /dev/sdb1 /dev/sdc1 ..
So here your RAID5 has a chunk size of 64K, and you have 4 drives in a
RAID 5, so your stripe size is 192KB if I'm correct.
> FIO settings:
> bs=4096
> iodepth=248
> direct=1
> continue_on_error=1
> rw=randwrite
> ioengine=libaio
> norandommap
> refill_buffers
> group_reporting
> [test1]
> numjobs=1
>
It seems that you are running random 4K writes on this array (unless you
are running the test on the SSD directly here?). If so, you are writing
lots of 4K sectors on independant 192KB stripes. This means that the
whole 192KB of stripe needs to be first read, copied to memory, modified
with the new 4K of data, have its parity calculated and the new stripe
rewritten to the underlying disks. Add to that that depending on your
SSD, there might be some read-modify-write cycles happening in the
background (since you might be running more small random IOs that the
underlying flash can handle transparently). The performance hit is
therefore possible.
The guess here is that to maximize performance, you would want to first
run IOs which minimize the read/modify/write on the RAID itself (so
writing full 192KB IOs, making sure they are also aligned correctly with
the underlying RAID), and also maybe tune your RAID chunk size to
minimize possible RMW cycles on the SSD. However, the SSD aspect is
unlikely the cause of your performance issue if you get good performance
writing 4K blocks on the SSD itself.
So it would seem to me that what's killing your performance is the RMW
on the RAID itself, everytime you want to write 4K a whole stripe has to
be read, modified in memory, and 192K of data has to be rewritten to the
array, making it highly inefficient.
A smaller chunk size might help with handling this kind of IOs. The
thing here is that you have to ask yourself if 4K random writes are
really what you are going to run, or if this was just for the sake of
testing?
You could also test read performance (no RMW hit) to see if there is no
bottleneck there (thus partially confirming the above).
Also, don't take my word for it just yet, maybe wait for confirmation
from some other people on this ML, the above is what I *think* is
happening but I could definitely be completely wrong.
> Theoretical performance: in single mode without raid each ssd writes 20k IOPS
> and reads 40k IOPS.
> With Raid 5 and with at least 4 SSDs there are as many write operations as read
> operations. So a single SSD should deliver 13333
> read and write operations per second.
>
> Without Raid (a maximum performance of 140000 random read and 120000 random
> write operations per second is archieved. so hw
> shouldn't be the limiting factor for raid 5.
>
>
> Evaluation: Random write in IOPS
> #SSD experimental theoretical
> 3 14497.7 24000
> 4 14005 26666
> 5 17172.3 33333
> 6 19779 40000
>
> Following stats and output for raid 5 with 6 SSDs
>
> fio:
> ssd10gbraid5rw: (g=0): rw=randwrite, bs=4K-4K/4K-4K, ioengine=libaio,
> iodepth=248
> 2.0.8
> Starting 1 process
>
> ssd10gbraid5rw: (groupid=0, jobs=1): err= 0: pid=32400
> Description : [SSD 10GB raid5 (mdadm) random write test]
> write: io=988.0KB, bw=79133KB/s, iops=19783 , runt=5300335msec
> slat (usec): min=3 , max=282137 , avg= 7.46, stdev=36.26
> clat (usec): min=250 , max=338796K, avg=12525.28, stdev=136706.65
> lat (usec): min=259 , max=338796K, avg=12533.00, stdev=136706.66
> clat percentiles (usec):
> | 1.00th=[ 1048], 5.00th=[ 2096], 10.00th=[ 2672], 20.00th=[ 3504],
> | 30.00th=[ 4576], 40.00th=[ 6496], 50.00th=[ 8512], 60.00th=[11456],
> | 70.00th=[15168], 80.00th=[20352], 90.00th=[28544], 95.00th=[33536],
> | 99.00th=[39168], 99.50th=[41216], 99.90th=[56064], 99.95th=[292864],
> | 99.99th=[309248]
> bw (KB/s) : min= 6907, max=100088, per=100.00%, avg=79313.22, stdev=8802.19
> lat (usec) : 500=0.05%, 750=0.27%, 1000=0.52%
> lat (msec) : 2=3.52%, 4=20.98%, 10=30.25%, 20=23.99%, 50=20.29%
> lat (msec) : 100=0.03%, 250=0.01%, 500=0.10%, 750=0.01%, 1000=0.01%
> lat (msec) : 2000=0.01%, >=2000=0.01%
> cpu : usr=7.75%, sys=21.55%, ctx=47382311, majf=0, minf=0
> IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%, >=64=100.0%
> submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
> complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.1%
> issued : total=r=0/w=0/d=104857847, short=r=0/w=0/d=0
> errors : total=0, first_error=0/<(null)>
>
> Run status group 0 (all jobs):
> WRITE: io=409601MB, aggrb=79132KB/s, minb=79132KB/s, maxb=79132KB/s,
> mint=5300335msec, maxt=5300335msec
>
> Disk stats (read/write):
> md9: ios=84/104857172, merge=0/0, ticks=0/0, in_queue=0, util=0.00%,
> aggrios=34949993/34951372, aggrmerge=401/512,
> aggrticks=130838494/122401043, aggrin_queue=253198596, aggrutil=96.05%
> sdb: ios=34950097/34951445, merge=400/511, ticks=130214828/121603063,
> in_queue=251778978, util=95.86%
> sdc: ios=34952941/34954281, merge=399/516, ticks=130736987/122271756,
> in_queue=252969493, util=95.91%
> sdd: ios=34943892/34945256, merge=417/527, ticks=131734001/123258071,
> in_queue=254949447, util=95.89%
> sde: ios=34954980/34956283, merge=367/473, ticks=125822046/117619660,
> in_queue=243399327, util=95.95%
> sdf: ios=34952583/34954080, merge=415/532, ticks=137200055/128624635,
> in_queue=265784289, util=96.05%
> sdg: ios=34945469/34946890, merge=408/517, ticks=129323047/121029077,
> in_queue=250310045, util=95.99%
>
> top:
> PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
> 4525 root 20 0 0 0 0 R 39,6 0,0 98:16.78 md9_raid5
> 32400 root 20 0 79716 1824 420 S 30,6 0,1 0:02.77 fio
> 29099 root 20 0 0 0 0 R 7,3 0,0 0:33.90 kworker/u:0
> 31740 root 20 0 0 0 0 S 6,7 0,0 4:59.61 kworker/u:3
> 18488 root 20 0 0 0 0 S 5,7 0,0 2:06.64 kworker/u:1
> 31197 root 20 0 0 0 0 S 4,7 0,0 0:13.77 kworker/u:4
> 23450 root 20 0 0 0 0 S 3,0 0,0 1:34.33 kworker/u:7
> 27068 root 20 0 0 0 0 S 1,7 0,0 0:51.94 kworker/u:2
>
> mpstat:
> CPU %usr %nice %sys %iowait %irq %soft %steal %guest %idle
> all 1,17 0,00 12,67 12,71 3,27 3,05 0,00 0,00 67,13
> 0 1,41 0,00 7,88 15,42 0,07 0,15 0,00 0,00 75,07
> 1 0,00 0,00 38,04 3,14 19,20 18,08 0,00 0,00 21,54
> 2 1,50 0,00 7,55 14,78 0,07 0,02 0,00 0,00 76,08
> 3 1,09 0,00 7,31 12,15 0,05 0,02 0,00 0,00 79,38
> 4 1,35 0,00 7,41 12,94 0,07 0,00 0,00 0,00 78,23
> 5 1,65 0,00 7,78 17,84 0,12 0,03 0,00 0,00 72,57
>
> iostat -x 1:
> avg-cpu: %user %nice %system %iowait %steal %idle
> 0,67 0,00 18,79 3,69 0,00 76,85
>
> Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz
> avgqu-sz await r_await w_await svctm %util
> sdb 0,00 0,00 6952,00 6935,00 27808,00 27740,00 8,00
> 24,97 1,80 2,00 1,59 0,06 77,90
> sda 2,00 0,00 6774,00 6789,00 27104,00 27156,00 8,00
> 21,26 1,57 1,78 1,36 0,06 77,60
> sdd 4,00 4,00 7059,00 7013,00 28252,00 28068,00 8,00
> 136,01 9,66 10,34 8,98 0,07 99,60
> sdc 0,00 0,00 6851,00 6851,00 27404,00 27404,00 8,00
> 22,80 1,66 1,86 1,46 0,06 77,70
> sdf 0,00 0,00 6931,00 6995,00 27724,00 27980,00 8,00
> 41,78 3,03 3,26 2,80 0,06 79,70
> sde 0,00 0,00 6842,00 6837,00 27368,00 27348,00 8,00
> 31,59 2,31 2,53 2,08 0,06 79,60
>
> another snapshot
> avg-cpu: %user %nice %system %iowait %steal %idle
> 0,84 0,00 22,35 2,18 0,00 74,62
>
> Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz
> avgqu-sz await r_await w_await svctm %util
> sdb 1,00 2,00 8344,00 8400,00 33380,00 33608,00 8,00
> 67,39 4,06 4,30 3,82 0,06 97,80
> sda 1,00 0,00 8305,00 8290,00 33224,00 33160,00 8,00
> 28,74 1,73 1,94 1,52 0,05 88,40
> sdd 5,00 5,00 8393,00 8419,00 33592,00 33696,00 8,00
> 96,74 5,76 6,02 5,49 0,06 98,80
> sdc 0,00 1,00 8199,00 8201,00 32796,00 32808,00 8,00
> 27,64 1,68 1,92 1,45 0,05 87,80
> sdf 1,00 0,00 8332,00 8323,00 33328,00 33292,00 8,00
> 40,95 2,44 2,66 2,23 0,05 89,30
> sde 0,00 0,00 8256,00 8263,00 33024,00 33052,00 8,00
> 28,94 1,75 1,96 1,54 0,05 89,50
>
> mpstat for same test with 3.9 kernel from next-tree
> CPU %usr %nice %sys %iowait %irq %soft %steal %guest %idle
> all 0,50 0,00 10,03 1,34 2,01 6,35 0,00 0,00 79,77
> 0 0,00 0,00 0,00 0,00 0,00 0,00 0,00 0,00 100,00
> 1 0,00 0,00 25,00 0,00 5,00 18,00 0,00 0,00 52,00
> 2 0,00 0,00 20,83 0,00 5,21 18,75 0,00 0,00 55,21
> 3 0,00 0,00 0,00 0,00 0,00 0,00 0,00 0,00 100,00
> 4 3,06 0,00 15,31 8,16 0,00 0,00 0,00 0,00 73,47
> 5 0,00 0,00 0,00 0,00 0,00 0,00 0,00 0,00 100,00
>
>
> So you have an idea why the real performance is only 50% of the theoretical
> performance? No cpu core is at its limits.
> As i said in my other post. I would be interested to solve the problem but i
> have problems to identify it.
> Peter Landmann
Regards,
Ben.
>
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
>
next prev parent reply other threads:[~2013-04-03 11:21 UTC|newest]
Thread overview: 17+ messages / expand[flat|nested] mbox.gz Atom feed top
2013-04-03 11:00 RAID 5 doesn't scale Peter Landmann
2013-04-03 11:21 ` Benjamin ESTRABAUD [this message]
2013-04-03 18:34 ` Martin Wilck
2013-04-03 20:38 ` Peter Landmann
2013-04-04 13:40 ` Benjamin ESTRABAUD
2013-04-03 13:18 ` Stan Hoeppner
2013-04-03 15:23 ` keld
2013-04-03 15:31 ` Peter Landmann
2013-04-03 18:35 ` Stan Hoeppner
2013-04-03 18:23 ` Martin Wilck
2013-04-03 20:36 ` Peter Landmann
2013-04-03 21:19 ` Peter Landmann
2013-04-03 21:24 ` Stan Hoeppner
2013-04-03 21:29 ` Peter Landmann
2013-04-03 21:15 ` Stan Hoeppner
2013-04-03 19:56 ` Roy Sigurd Karlsbakk
2013-04-03 21:12 ` Peter Landmann
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=515C10BF.7060503@mpstor.com \
--to=be@mpstor.com \
--cc=linux-raid@vger.kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.