* RAID 5 doesn't scale
@ 2013-04-03 11:00 Peter Landmann
2013-04-03 11:21 ` Benjamin ESTRABAUD
2013-04-03 13:18 ` Stan Hoeppner
0 siblings, 2 replies; 17+ messages in thread
From: Peter Landmann @ 2013-04-03 11:00 UTC (permalink / raw)
To: linux-raid
Hi,
i wrote it there http://article.gmane.org/gmane.linux.raid/42365 but want to go
in detail. Maybe there is another problem or
problem in my thinking.
Environment:
HW: AMD Phenom II 1055T 2,8 GHz, 8GB ram
Intel X25-M G2 Postville 80 GB SATA2 SSD
SW: kernel 3.4.0 but same performace with 3.8 from git and 3.9 from "next" tree
distribution: debian sid
Raid Settings:
for each hdd a 10 GB partition is used, 70 GB spare capacity
noop-scheduler
raid creation:
mdadm --create /dev/md9 --force --raid-devices=4 --chunk=64 --assume-clean -
-level=5 /dev/sdb1 /dev/sdc1 ..
FIO settings:
bs=4096
iodepth=248
direct=1
continue_on_error=1
rw=randwrite
ioengine=libaio
norandommap
refill_buffers
group_reporting
[test1]
numjobs=1
Theoretical performance: in single mode without raid each ssd writes 20k IOPS
and reads 40k IOPS.
With Raid 5 and with at least 4 SSDs there are as many write operations as read
operations. So a single SSD should deliver 13333
read and write operations per second.
Without Raid (a maximum performance of 140000 random read and 120000 random
write operations per second is archieved. so hw
shouldn't be the limiting factor for raid 5.
Evaluation: Random write in IOPS
#SSD experimental theoretical
3 14497.7 24000
4 14005 26666
5 17172.3 33333
6 19779 40000
Following stats and output for raid 5 with 6 SSDs
fio:
ssd10gbraid5rw: (g=0): rw=randwrite, bs=4K-4K/4K-4K, ioengine=libaio,
iodepth=248
2.0.8
Starting 1 process
ssd10gbraid5rw: (groupid=0, jobs=1): err= 0: pid=32400
Description : [SSD 10GB raid5 (mdadm) random write test]
write: io=988.0KB, bw=79133KB/s, iops=19783 , runt=5300335msec
slat (usec): min=3 , max=282137 , avg= 7.46, stdev=36.26
clat (usec): min=250 , max=338796K, avg=12525.28, stdev=136706.65
lat (usec): min=259 , max=338796K, avg=12533.00, stdev=136706.66
clat percentiles (usec):
| 1.00th=[ 1048], 5.00th=[ 2096], 10.00th=[ 2672], 20.00th=[ 3504],
| 30.00th=[ 4576], 40.00th=[ 6496], 50.00th=[ 8512], 60.00th=[11456],
| 70.00th=[15168], 80.00th=[20352], 90.00th=[28544], 95.00th=[33536],
| 99.00th=[39168], 99.50th=[41216], 99.90th=[56064], 99.95th=[292864],
| 99.99th=[309248]
bw (KB/s) : min= 6907, max=100088, per=100.00%, avg=79313.22, stdev=8802.19
lat (usec) : 500=0.05%, 750=0.27%, 1000=0.52%
lat (msec) : 2=3.52%, 4=20.98%, 10=30.25%, 20=23.99%, 50=20.29%
lat (msec) : 100=0.03%, 250=0.01%, 500=0.10%, 750=0.01%, 1000=0.01%
lat (msec) : 2000=0.01%, >=2000=0.01%
cpu : usr=7.75%, sys=21.55%, ctx=47382311, majf=0, minf=0
IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%, >=64=100.0%
submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.1%
issued : total=r=0/w=0/d=104857847, short=r=0/w=0/d=0
errors : total=0, first_error=0/<(null)>
Run status group 0 (all jobs):
WRITE: io=409601MB, aggrb=79132KB/s, minb=79132KB/s, maxb=79132KB/s,
mint=5300335msec, maxt=5300335msec
Disk stats (read/write):
md9: ios=84/104857172, merge=0/0, ticks=0/0, in_queue=0, util=0.00%,
aggrios=34949993/34951372, aggrmerge=401/512,
aggrticks=130838494/122401043, aggrin_queue=253198596, aggrutil=96.05%
sdb: ios=34950097/34951445, merge=400/511, ticks=130214828/121603063,
in_queue=251778978, util=95.86%
sdc: ios=34952941/34954281, merge=399/516, ticks=130736987/122271756,
in_queue=252969493, util=95.91%
sdd: ios=34943892/34945256, merge=417/527, ticks=131734001/123258071,
in_queue=254949447, util=95.89%
sde: ios=34954980/34956283, merge=367/473, ticks=125822046/117619660,
in_queue=243399327, util=95.95%
sdf: ios=34952583/34954080, merge=415/532, ticks=137200055/128624635,
in_queue=265784289, util=96.05%
sdg: ios=34945469/34946890, merge=408/517, ticks=129323047/121029077,
in_queue=250310045, util=95.99%
top:
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
4525 root 20 0 0 0 0 R 39,6 0,0 98:16.78 md9_raid5
32400 root 20 0 79716 1824 420 S 30,6 0,1 0:02.77 fio
29099 root 20 0 0 0 0 R 7,3 0,0 0:33.90 kworker/u:0
31740 root 20 0 0 0 0 S 6,7 0,0 4:59.61 kworker/u:3
18488 root 20 0 0 0 0 S 5,7 0,0 2:06.64 kworker/u:1
31197 root 20 0 0 0 0 S 4,7 0,0 0:13.77 kworker/u:4
23450 root 20 0 0 0 0 S 3,0 0,0 1:34.33 kworker/u:7
27068 root 20 0 0 0 0 S 1,7 0,0 0:51.94 kworker/u:2
mpstat:
CPU %usr %nice %sys %iowait %irq %soft %steal %guest %idle
all 1,17 0,00 12,67 12,71 3,27 3,05 0,00 0,00 67,13
0 1,41 0,00 7,88 15,42 0,07 0,15 0,00 0,00 75,07
1 0,00 0,00 38,04 3,14 19,20 18,08 0,00 0,00 21,54
2 1,50 0,00 7,55 14,78 0,07 0,02 0,00 0,00 76,08
3 1,09 0,00 7,31 12,15 0,05 0,02 0,00 0,00 79,38
4 1,35 0,00 7,41 12,94 0,07 0,00 0,00 0,00 78,23
5 1,65 0,00 7,78 17,84 0,12 0,03 0,00 0,00 72,57
iostat -x 1:
avg-cpu: %user %nice %system %iowait %steal %idle
0,67 0,00 18,79 3,69 0,00 76,85
Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz
avgqu-sz await r_await w_await svctm %util
sdb 0,00 0,00 6952,00 6935,00 27808,00 27740,00 8,00
24,97 1,80 2,00 1,59 0,06 77,90
sda 2,00 0,00 6774,00 6789,00 27104,00 27156,00 8,00
21,26 1,57 1,78 1,36 0,06 77,60
sdd 4,00 4,00 7059,00 7013,00 28252,00 28068,00 8,00
136,01 9,66 10,34 8,98 0,07 99,60
sdc 0,00 0,00 6851,00 6851,00 27404,00 27404,00 8,00
22,80 1,66 1,86 1,46 0,06 77,70
sdf 0,00 0,00 6931,00 6995,00 27724,00 27980,00 8,00
41,78 3,03 3,26 2,80 0,06 79,70
sde 0,00 0,00 6842,00 6837,00 27368,00 27348,00 8,00
31,59 2,31 2,53 2,08 0,06 79,60
another snapshot
avg-cpu: %user %nice %system %iowait %steal %idle
0,84 0,00 22,35 2,18 0,00 74,62
Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz
avgqu-sz await r_await w_await svctm %util
sdb 1,00 2,00 8344,00 8400,00 33380,00 33608,00 8,00
67,39 4,06 4,30 3,82 0,06 97,80
sda 1,00 0,00 8305,00 8290,00 33224,00 33160,00 8,00
28,74 1,73 1,94 1,52 0,05 88,40
sdd 5,00 5,00 8393,00 8419,00 33592,00 33696,00 8,00
96,74 5,76 6,02 5,49 0,06 98,80
sdc 0,00 1,00 8199,00 8201,00 32796,00 32808,00 8,00
27,64 1,68 1,92 1,45 0,05 87,80
sdf 1,00 0,00 8332,00 8323,00 33328,00 33292,00 8,00
40,95 2,44 2,66 2,23 0,05 89,30
sde 0,00 0,00 8256,00 8263,00 33024,00 33052,00 8,00
28,94 1,75 1,96 1,54 0,05 89,50
mpstat for same test with 3.9 kernel from next-tree
CPU %usr %nice %sys %iowait %irq %soft %steal %guest %idle
all 0,50 0,00 10,03 1,34 2,01 6,35 0,00 0,00 79,77
0 0,00 0,00 0,00 0,00 0,00 0,00 0,00 0,00 100,00
1 0,00 0,00 25,00 0,00 5,00 18,00 0,00 0,00 52,00
2 0,00 0,00 20,83 0,00 5,21 18,75 0,00 0,00 55,21
3 0,00 0,00 0,00 0,00 0,00 0,00 0,00 0,00 100,00
4 3,06 0,00 15,31 8,16 0,00 0,00 0,00 0,00 73,47
5 0,00 0,00 0,00 0,00 0,00 0,00 0,00 0,00 100,00
So you have an idea why the real performance is only 50% of the theoretical
performance? No cpu core is at its limits.
As i said in my other post. I would be interested to solve the problem but i
have problems to identify it.
Peter Landmann
^ permalink raw reply [flat|nested] 17+ messages in thread* Re: RAID 5 doesn't scale 2013-04-03 11:00 RAID 5 doesn't scale Peter Landmann @ 2013-04-03 11:21 ` Benjamin ESTRABAUD 2013-04-03 18:34 ` Martin Wilck 2013-04-03 13:18 ` Stan Hoeppner 1 sibling, 1 reply; 17+ messages in thread From: Benjamin ESTRABAUD @ 2013-04-03 11:21 UTC (permalink / raw) To: linux-raid On 03/04/13 12:00, Peter Landmann wrote: > Hi, Hi, > > i wrote it there http://article.gmane.org/gmane.linux.raid/42365 but want to go > in detail. Maybe there is another problem or > problem in my thinking. > > Environment: > HW: AMD Phenom II 1055T 2,8 GHz, 8GB ram > Intel X25-M G2 Postville 80 GB SATA2 SSD > SW: kernel 3.4.0 but same performace with 3.8 from git and 3.9 from "next" tree > distribution: debian sid > Raid Settings: > for each hdd a 10 GB partition is used, 70 GB spare capacity > noop-scheduler > raid creation: > mdadm --create /dev/md9 --force --raid-devices=4 --chunk=64 --assume-clean - > -level=5 /dev/sdb1 /dev/sdc1 .. So here your RAID5 has a chunk size of 64K, and you have 4 drives in a RAID 5, so your stripe size is 192KB if I'm correct. > FIO settings: > bs=4096 > iodepth=248 > direct=1 > continue_on_error=1 > rw=randwrite > ioengine=libaio > norandommap > refill_buffers > group_reporting > [test1] > numjobs=1 > It seems that you are running random 4K writes on this array (unless you are running the test on the SSD directly here?). If so, you are writing lots of 4K sectors on independant 192KB stripes. This means that the whole 192KB of stripe needs to be first read, copied to memory, modified with the new 4K of data, have its parity calculated and the new stripe rewritten to the underlying disks. Add to that that depending on your SSD, there might be some read-modify-write cycles happening in the background (since you might be running more small random IOs that the underlying flash can handle transparently). The performance hit is therefore possible. The guess here is that to maximize performance, you would want to first run IOs which minimize the read/modify/write on the RAID itself (so writing full 192KB IOs, making sure they are also aligned correctly with the underlying RAID), and also maybe tune your RAID chunk size to minimize possible RMW cycles on the SSD. However, the SSD aspect is unlikely the cause of your performance issue if you get good performance writing 4K blocks on the SSD itself. So it would seem to me that what's killing your performance is the RMW on the RAID itself, everytime you want to write 4K a whole stripe has to be read, modified in memory, and 192K of data has to be rewritten to the array, making it highly inefficient. A smaller chunk size might help with handling this kind of IOs. The thing here is that you have to ask yourself if 4K random writes are really what you are going to run, or if this was just for the sake of testing? You could also test read performance (no RMW hit) to see if there is no bottleneck there (thus partially confirming the above). Also, don't take my word for it just yet, maybe wait for confirmation from some other people on this ML, the above is what I *think* is happening but I could definitely be completely wrong. > Theoretical performance: in single mode without raid each ssd writes 20k IOPS > and reads 40k IOPS. > With Raid 5 and with at least 4 SSDs there are as many write operations as read > operations. So a single SSD should deliver 13333 > read and write operations per second. > > Without Raid (a maximum performance of 140000 random read and 120000 random > write operations per second is archieved. so hw > shouldn't be the limiting factor for raid 5. > > > Evaluation: Random write in IOPS > #SSD experimental theoretical > 3 14497.7 24000 > 4 14005 26666 > 5 17172.3 33333 > 6 19779 40000 > > Following stats and output for raid 5 with 6 SSDs > > fio: > ssd10gbraid5rw: (g=0): rw=randwrite, bs=4K-4K/4K-4K, ioengine=libaio, > iodepth=248 > 2.0.8 > Starting 1 process > > ssd10gbraid5rw: (groupid=0, jobs=1): err= 0: pid=32400 > Description : [SSD 10GB raid5 (mdadm) random write test] > write: io=988.0KB, bw=79133KB/s, iops=19783 , runt=5300335msec > slat (usec): min=3 , max=282137 , avg= 7.46, stdev=36.26 > clat (usec): min=250 , max=338796K, avg=12525.28, stdev=136706.65 > lat (usec): min=259 , max=338796K, avg=12533.00, stdev=136706.66 > clat percentiles (usec): > | 1.00th=[ 1048], 5.00th=[ 2096], 10.00th=[ 2672], 20.00th=[ 3504], > | 30.00th=[ 4576], 40.00th=[ 6496], 50.00th=[ 8512], 60.00th=[11456], > | 70.00th=[15168], 80.00th=[20352], 90.00th=[28544], 95.00th=[33536], > | 99.00th=[39168], 99.50th=[41216], 99.90th=[56064], 99.95th=[292864], > | 99.99th=[309248] > bw (KB/s) : min= 6907, max=100088, per=100.00%, avg=79313.22, stdev=8802.19 > lat (usec) : 500=0.05%, 750=0.27%, 1000=0.52% > lat (msec) : 2=3.52%, 4=20.98%, 10=30.25%, 20=23.99%, 50=20.29% > lat (msec) : 100=0.03%, 250=0.01%, 500=0.10%, 750=0.01%, 1000=0.01% > lat (msec) : 2000=0.01%, >=2000=0.01% > cpu : usr=7.75%, sys=21.55%, ctx=47382311, majf=0, minf=0 > IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%, >=64=100.0% > submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% > complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.1% > issued : total=r=0/w=0/d=104857847, short=r=0/w=0/d=0 > errors : total=0, first_error=0/<(null)> > > Run status group 0 (all jobs): > WRITE: io=409601MB, aggrb=79132KB/s, minb=79132KB/s, maxb=79132KB/s, > mint=5300335msec, maxt=5300335msec > > Disk stats (read/write): > md9: ios=84/104857172, merge=0/0, ticks=0/0, in_queue=0, util=0.00%, > aggrios=34949993/34951372, aggrmerge=401/512, > aggrticks=130838494/122401043, aggrin_queue=253198596, aggrutil=96.05% > sdb: ios=34950097/34951445, merge=400/511, ticks=130214828/121603063, > in_queue=251778978, util=95.86% > sdc: ios=34952941/34954281, merge=399/516, ticks=130736987/122271756, > in_queue=252969493, util=95.91% > sdd: ios=34943892/34945256, merge=417/527, ticks=131734001/123258071, > in_queue=254949447, util=95.89% > sde: ios=34954980/34956283, merge=367/473, ticks=125822046/117619660, > in_queue=243399327, util=95.95% > sdf: ios=34952583/34954080, merge=415/532, ticks=137200055/128624635, > in_queue=265784289, util=96.05% > sdg: ios=34945469/34946890, merge=408/517, ticks=129323047/121029077, > in_queue=250310045, util=95.99% > > top: > PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND > 4525 root 20 0 0 0 0 R 39,6 0,0 98:16.78 md9_raid5 > 32400 root 20 0 79716 1824 420 S 30,6 0,1 0:02.77 fio > 29099 root 20 0 0 0 0 R 7,3 0,0 0:33.90 kworker/u:0 > 31740 root 20 0 0 0 0 S 6,7 0,0 4:59.61 kworker/u:3 > 18488 root 20 0 0 0 0 S 5,7 0,0 2:06.64 kworker/u:1 > 31197 root 20 0 0 0 0 S 4,7 0,0 0:13.77 kworker/u:4 > 23450 root 20 0 0 0 0 S 3,0 0,0 1:34.33 kworker/u:7 > 27068 root 20 0 0 0 0 S 1,7 0,0 0:51.94 kworker/u:2 > > mpstat: > CPU %usr %nice %sys %iowait %irq %soft %steal %guest %idle > all 1,17 0,00 12,67 12,71 3,27 3,05 0,00 0,00 67,13 > 0 1,41 0,00 7,88 15,42 0,07 0,15 0,00 0,00 75,07 > 1 0,00 0,00 38,04 3,14 19,20 18,08 0,00 0,00 21,54 > 2 1,50 0,00 7,55 14,78 0,07 0,02 0,00 0,00 76,08 > 3 1,09 0,00 7,31 12,15 0,05 0,02 0,00 0,00 79,38 > 4 1,35 0,00 7,41 12,94 0,07 0,00 0,00 0,00 78,23 > 5 1,65 0,00 7,78 17,84 0,12 0,03 0,00 0,00 72,57 > > iostat -x 1: > avg-cpu: %user %nice %system %iowait %steal %idle > 0,67 0,00 18,79 3,69 0,00 76,85 > > Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz > avgqu-sz await r_await w_await svctm %util > sdb 0,00 0,00 6952,00 6935,00 27808,00 27740,00 8,00 > 24,97 1,80 2,00 1,59 0,06 77,90 > sda 2,00 0,00 6774,00 6789,00 27104,00 27156,00 8,00 > 21,26 1,57 1,78 1,36 0,06 77,60 > sdd 4,00 4,00 7059,00 7013,00 28252,00 28068,00 8,00 > 136,01 9,66 10,34 8,98 0,07 99,60 > sdc 0,00 0,00 6851,00 6851,00 27404,00 27404,00 8,00 > 22,80 1,66 1,86 1,46 0,06 77,70 > sdf 0,00 0,00 6931,00 6995,00 27724,00 27980,00 8,00 > 41,78 3,03 3,26 2,80 0,06 79,70 > sde 0,00 0,00 6842,00 6837,00 27368,00 27348,00 8,00 > 31,59 2,31 2,53 2,08 0,06 79,60 > > another snapshot > avg-cpu: %user %nice %system %iowait %steal %idle > 0,84 0,00 22,35 2,18 0,00 74,62 > > Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz > avgqu-sz await r_await w_await svctm %util > sdb 1,00 2,00 8344,00 8400,00 33380,00 33608,00 8,00 > 67,39 4,06 4,30 3,82 0,06 97,80 > sda 1,00 0,00 8305,00 8290,00 33224,00 33160,00 8,00 > 28,74 1,73 1,94 1,52 0,05 88,40 > sdd 5,00 5,00 8393,00 8419,00 33592,00 33696,00 8,00 > 96,74 5,76 6,02 5,49 0,06 98,80 > sdc 0,00 1,00 8199,00 8201,00 32796,00 32808,00 8,00 > 27,64 1,68 1,92 1,45 0,05 87,80 > sdf 1,00 0,00 8332,00 8323,00 33328,00 33292,00 8,00 > 40,95 2,44 2,66 2,23 0,05 89,30 > sde 0,00 0,00 8256,00 8263,00 33024,00 33052,00 8,00 > 28,94 1,75 1,96 1,54 0,05 89,50 > > mpstat for same test with 3.9 kernel from next-tree > CPU %usr %nice %sys %iowait %irq %soft %steal %guest %idle > all 0,50 0,00 10,03 1,34 2,01 6,35 0,00 0,00 79,77 > 0 0,00 0,00 0,00 0,00 0,00 0,00 0,00 0,00 100,00 > 1 0,00 0,00 25,00 0,00 5,00 18,00 0,00 0,00 52,00 > 2 0,00 0,00 20,83 0,00 5,21 18,75 0,00 0,00 55,21 > 3 0,00 0,00 0,00 0,00 0,00 0,00 0,00 0,00 100,00 > 4 3,06 0,00 15,31 8,16 0,00 0,00 0,00 0,00 73,47 > 5 0,00 0,00 0,00 0,00 0,00 0,00 0,00 0,00 100,00 > > > So you have an idea why the real performance is only 50% of the theoretical > performance? No cpu core is at its limits. > As i said in my other post. I would be interested to solve the problem but i > have problems to identify it. > Peter Landmann Regards, Ben. > > > -- > To unsubscribe from this list: send the line "unsubscribe linux-raid" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: RAID 5 doesn't scale 2013-04-03 11:21 ` Benjamin ESTRABAUD @ 2013-04-03 18:34 ` Martin Wilck 2013-04-03 20:38 ` Peter Landmann 0 siblings, 1 reply; 17+ messages in thread From: Martin Wilck @ 2013-04-03 18:34 UTC (permalink / raw) To: Benjamin ESTRABAUD; +Cc: linux-raid On 04/03/2013 01:21 PM, Benjamin ESTRABAUD wrote: > It seems that you are running random 4K writes on this array (unless you > are running the test on the SSD directly here?). If so, you are writing > lots of 4K sectors on independant 192KB stripes. This means that the > whole 192KB of stripe needs to be first read, copied to memory, modified > with the new 4K of data, have its parity calculated and the new stripe > rewritten to the underlying disks. That's not strictly necessary. For each 4k block to be written, it's sufficient to read the data block and the corresponding parity block (2x4k), calculate the changes in the parity block from the difference between the old and new data block, and write both data and parity (2x4k). Thus for every write IOP, 4 RAID IOPS are needed (2x read, 2x write). Doesn't MD do it this way? Martin ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: RAID 5 doesn't scale 2013-04-03 18:34 ` Martin Wilck @ 2013-04-03 20:38 ` Peter Landmann 2013-04-04 13:40 ` Benjamin ESTRABAUD 0 siblings, 1 reply; 17+ messages in thread From: Peter Landmann @ 2013-04-03 20:38 UTC (permalink / raw) To: linux-raid Martin Wilck <mwilck <at> arcor.de> writes: > > Doesn't MD do it this way? > > Martin You are right. You can see it with mpstat. Peter ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: RAID 5 doesn't scale 2013-04-03 20:38 ` Peter Landmann @ 2013-04-04 13:40 ` Benjamin ESTRABAUD 0 siblings, 0 replies; 17+ messages in thread From: Benjamin ESTRABAUD @ 2013-04-04 13:40 UTC (permalink / raw) To: linux-raid On 03/04/13 21:38, Peter Landmann wrote: > Martin Wilck <mwilck <at> arcor.de> writes: > > >> Doesn't MD do it this way? >> >> Martin > You are right. You can see it with mpstat. Thanks both for that, I wasn't aware of this. > Peter > > -- > To unsubscribe from this list: send the line "unsubscribe linux-raid" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: RAID 5 doesn't scale 2013-04-03 11:00 RAID 5 doesn't scale Peter Landmann 2013-04-03 11:21 ` Benjamin ESTRABAUD @ 2013-04-03 13:18 ` Stan Hoeppner 2013-04-03 15:23 ` keld ` (4 more replies) 1 sibling, 5 replies; 17+ messages in thread From: Stan Hoeppner @ 2013-04-03 13:18 UTC (permalink / raw) To: Peter Landmann; +Cc: linux-raid On 4/3/2013 6:00 AM, Peter Landmann wrote: You didn't mention your stripe_cache_size value. It'll make a lot of difference. Make sure it's at least 4096. The default is 256. ~$ /bin/echo 4096 > /sys/block/md[X]/md/stripe_cache_size > FIO settings: > bs=4096 > iodepth=248 > direct=1 > continue_on_error=1 > rw=randwrite > ioengine=libaio > norandommap > refill_buffers > group_reporting > numjobs=1 ^^^^^^^^^^^ Even when using AIO you're still serialized when using a single thread, regardless of queue depth. Thus there is non trivial latency between IO operations. Retest with only these global parameters to get some concurrency. Along with a larger stripe cache your numbers should go up substantially. This test runs 4 threads/core to ensure you saturate md with IO. [global] zero_buffers numjobs=24 thread group_reporting blocksize=4096 ioengine=libaio iodepth=16 direct=1 size=8G > So you have an idea why the real performance is only 50% of the theoretical > performance? Three reasons: IO latency, limited stripe_cache_size, parity RMW > No cpu core is at its limits. Because you're not cycle limited but latency limited. With this FIO test your CPU burn should increase a bit. > As i said in my other post. I would be interested to solve the problem but i > have problems to identify it. Note also that you're doing 4KB random writes against RAID5. This is going to generate substantial RMW cycles. The Intel X25-M G2 is not a speed daemon. Its published max 4KB IOPS throughput is for purely random writes, not the read+write pattern created by parity RMW. So while your random read should get a nice jump with this test, your random write may not improve as much. The limitation here is a function of the SSD controller on the X25-M G2, not md/RAID5. If you test 5 drives in md/RAID0 you'll see a bump in random write IOPS. -- Stan ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: RAID 5 doesn't scale 2013-04-03 13:18 ` Stan Hoeppner @ 2013-04-03 15:23 ` keld 2013-04-03 15:31 ` Peter Landmann ` (3 subsequent siblings) 4 siblings, 0 replies; 17+ messages in thread From: keld @ 2013-04-03 15:23 UTC (permalink / raw) To: Stan Hoeppner; +Cc: Peter Landmann, linux-raid Hi Peter In general Linux RAID 5 scales well, See https://raid.wiki.kernel.org/index.php/Performance Best regards keld On Wed, Apr 03, 2013 at 08:18:52AM -0500, Stan Hoeppner wrote: > On 4/3/2013 6:00 AM, Peter Landmann wrote: > > You didn't mention your stripe_cache_size value. It'll make a lot of > difference. Make sure it's at least 4096. The default is 256. > > ~$ /bin/echo 4096 > /sys/block/md[X]/md/stripe_cache_size > > > FIO settings: > > bs=4096 > > iodepth=248 > > direct=1 > > continue_on_error=1 > > rw=randwrite > > ioengine=libaio > > norandommap > > refill_buffers > > group_reporting > > > numjobs=1 > > ^^^^^^^^^^^ Even when using AIO you're still serialized when using a > single thread, regardless of queue depth. Thus there is non trivial > latency between IO operations. Retest with only these global parameters > to get some concurrency. Along with a larger stripe cache your numbers > should go up substantially. This test runs 4 threads/core to ensure you > saturate md with IO. > > [global] > zero_buffers > numjobs=24 > thread > group_reporting > blocksize=4096 > ioengine=libaio > iodepth=16 > direct=1 > size=8G > > > So you have an idea why the real performance is only 50% of the theoretical > > performance? > > Three reasons: IO latency, limited stripe_cache_size, parity RMW > > > No cpu core is at its limits. > > Because you're not cycle limited but latency limited. With this FIO > test your CPU burn should increase a bit. > > > As i said in my other post. I would be interested to solve the problem but i > > have problems to identify it. > > Note also that you're doing 4KB random writes against RAID5. This is > going to generate substantial RMW cycles. The Intel X25-M G2 is not a > speed daemon. Its published max 4KB IOPS throughput is for purely > random writes, not the read+write pattern created by parity RMW. So > while your random read should get a nice jump with this test, your > random write may not improve as much. The limitation here is a function > of the SSD controller on the X25-M G2, not md/RAID5. If you test 5 > drives in md/RAID0 you'll see a bump in random write IOPS. > > -- > Stan > > -- > To unsubscribe from this list: send the line "unsubscribe linux-raid" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: RAID 5 doesn't scale 2013-04-03 13:18 ` Stan Hoeppner 2013-04-03 15:23 ` keld @ 2013-04-03 15:31 ` Peter Landmann 2013-04-03 18:35 ` Stan Hoeppner 2013-04-03 18:23 ` Martin Wilck ` (2 subsequent siblings) 4 siblings, 1 reply; 17+ messages in thread From: Peter Landmann @ 2013-04-03 15:31 UTC (permalink / raw) To: linux-raid Stan Hoeppner <stan <at> hardwarefreak.com> writes: > > On 4/3/2013 6:00 AM, Peter Landmann wrote: > > You didn't mention your stripe_cache_size value. It'll make a lot of > difference. Make sure it's at least 4096. The default is 256. You are very right. I increased it to 4096 - 32768 and the performance increased much. Also i played a bit with deadline parameters and it helped also to increase performance. With Raid 5 and 6 SSDs i got 33936 IOPS (fio settings as before) which is not far away from theoretical 40000 (i know from former tests that the performance could be increased for some more jobs). For your info: With Raid 6 and 6 SSDs i got 32526 IOPS which is also a very good result. So i conclude that there is no (big) problem with scalability at this hw level, right? > > ^^^^^^^^^^^ Even when using AIO you're still serialized when using a > single thread, regardless of queue depth. Thus there is non trivial > latency between IO operations. Retest with only these global parameters > to get some concurrency. Along with a larger stripe cache your numbers > should go up substantially. This test runs 4 threads/core to ensure you > saturate md with IO. > > [global] > zero_buffers > numjobs=24 > thread > group_reporting > blocksize=4096 > ioengine=libaio > iodepth=16 > direct=1 > size=8G Yeah, that brings me near 40k IOPS (Raid 5, 6 SSDs) > > > So you have an idea why the real performance is only 50% of the theoretical > > performance? > > Three reasons: IO latency, limited stripe_cache_size, parity RMW > > > No cpu core is at its limits. > > Because you're not cycle limited but latency limited. With this FIO > test your CPU burn should increase a bit. > > > As i said in my other post. I would be interested to solve the problem but i > > have problems to identify it. > > Note also that you're doing 4KB random writes against RAID5. This is > going to generate substantial RMW cycles. The Intel X25-M G2 is not a > speed daemon. Its published max 4KB IOPS throughput is for purely > random writes, not the read+write pattern created by parity RMW. So > while your random read should get a nice jump with this test, your > random write may not improve as much. The limitation here is a function > of the SSD controller on the X25-M G2, not md/RAID5. If you test 5 > drives in md/RAID0 you'll see a bump in random write IOPS. FYI: The scheduler makes the difference. If you alternate writes and reades in small steps (R W R R W R W W R ..) then the performce decreases heavily. If you group read and write operations (20xW 20xR 20xW ..)then the performance will be better. Tested it without raid and a patched fio (and noop scheduler). But deadline scheduler can reach the same i learned. Thx for your informations and hints Peter ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: RAID 5 doesn't scale 2013-04-03 15:31 ` Peter Landmann @ 2013-04-03 18:35 ` Stan Hoeppner 0 siblings, 0 replies; 17+ messages in thread From: Stan Hoeppner @ 2013-04-03 18:35 UTC (permalink / raw) To: Peter Landmann; +Cc: linux-raid On 4/3/2013 10:31 AM, Peter Landmann wrote: > Stan Hoeppner <stan <at> hardwarefreak.com> writes: > >> >> On 4/3/2013 6:00 AM, Peter Landmann wrote: >> >> You didn't mention your stripe_cache_size value. It'll make a lot of >> difference. Make sure it's at least 4096. The default is 256. > > You are very right. > I increased it to 4096 - 32768 and the performance increased much. Be careful here. Increasing stripe_cache_size increases memory consumption of md dramatically. Formula: stripe_cache_size * 4096 bytes * drive_count = RAM usage. For a 6 drive array that's stripe_cache_size RAM consumed 4096 96MB 8192 192MB 16384 384MB 32768 768MB Thus you want to select a value that gives you the best combination of performance and lowest memory usage, unless you're not concerned about RAM. > Also i played a bit with deadline parameters and it helped also to increase > performance. ... > With Raid 5 and 6 SSDs i got 33936 IOPS (fio settings as before) which is not > far away from theoretical 40000 (i know from former tests that the performance > could be increased for some more jobs). Always test with parallel threads. If you don't you're not getting a realistic picture of what md/RAID and the hardware are capable of. > For your info: With Raid 6 and 6 SSDs i got 32526 IOPS which is also a very good > result. > > So i conclude that there is no (big) problem with scalability at this hw level, > right? Yes. What this demonstrates is that one Thuban core at 2.8-3.3GHz can apparently execute the md/RAID5/6 write threads faster than these 6 X25-M G2 SSDs can sink the writes. If your CPU was a 1.6GHz Atom and/or these were newer SATAIII Sandforce based SSDs, you'd peak a CPU core long before the SSDs run out of headroom. > FYI: The scheduler makes the difference. If you alternate writes and reades in > small steps (R W R R W R W W R ..) then the performce decreases heavily. If you > group read and write operations (20xW 20xR 20xW ..)then the performance will be > better. Tested it without raid and a patched fio (and noop scheduler). But > deadline scheduler can reach the same i learned. The scheduler can play a difference, but with SSDs noop usually gives the best results. With some SATA/drive controller combos deadline may be better. CFQ is rarely, if ever, good for performance. > Thx for your informations and hints You bet. -- Stan ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: RAID 5 doesn't scale 2013-04-03 13:18 ` Stan Hoeppner 2013-04-03 15:23 ` keld 2013-04-03 15:31 ` Peter Landmann @ 2013-04-03 18:23 ` Martin Wilck 2013-04-03 20:36 ` Peter Landmann 2013-04-03 21:15 ` Stan Hoeppner 2013-04-03 19:56 ` Roy Sigurd Karlsbakk 2013-04-03 21:12 ` Peter Landmann 4 siblings, 2 replies; 17+ messages in thread From: Martin Wilck @ 2013-04-03 18:23 UTC (permalink / raw) To: stan; +Cc: Peter Landmann, linux-raid On 04/03/2013 03:18 PM, Stan Hoeppner wrote: > You didn't mention your stripe_cache_size value. It'll make a lot of > difference. Make sure it's at least 4096. The default is 256. I'm not getting it - why would stripe cache size matter in a random read/write test? If the disks are large enough and the pattern is really random, the cache should hardly ever be hit (s_c_z = 4096 =^ 16MB cache per disk, that's 0.01% of disk size for a 160GB SSD). I read that Peter confirmed the influence of stripe_cache_size, but I'd like to understand why it matters in this case. Martin ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: RAID 5 doesn't scale 2013-04-03 18:23 ` Martin Wilck @ 2013-04-03 20:36 ` Peter Landmann 2013-04-03 21:19 ` Peter Landmann 2013-04-03 21:24 ` Stan Hoeppner 2013-04-03 21:15 ` Stan Hoeppner 1 sibling, 2 replies; 17+ messages in thread From: Peter Landmann @ 2013-04-03 20:36 UTC (permalink / raw) To: linux-raid > > On 04/03/2013 03:18 PM, Stan Hoeppner wrote: > > > You didn't mention your stripe_cache_size value. It'll make a lot of > > difference. Make sure it's at least 4096. The default is 256. > > I'm not getting it - why would stripe cache size matter in a random > read/write test? If the disks are large enough and the pattern is really > random, the cache should hardly ever be hit (s_c_z = 4096 =^ 16MB cache > per disk, that's 0.01% of disk size for a 160GB SSD). > > I read that Peter confirmed the influence of stripe_cache_size, but I'd > like to understand why it matters in this case. > > Martin I'm very sorry but now i can't confirm anymore that stripe_cache_size helps. My test was to short. With every minute the IOPS decrease. So stripe_cache_size does only help for very short tests. I will provide details in another post. Peter ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: RAID 5 doesn't scale 2013-04-03 20:36 ` Peter Landmann @ 2013-04-03 21:19 ` Peter Landmann 2013-04-03 21:24 ` Stan Hoeppner 1 sibling, 0 replies; 17+ messages in thread From: Peter Landmann @ 2013-04-03 21:19 UTC (permalink / raw) To: linux-raid > I'm very sorry but now i can't confirm anymore that stripe_cache_size helps. > > My test was to short. With every minute the IOPS decrease. So > stripe_cache_size does only help for very short tests. > > I will provide details in another post. > Now i wish i could delete the post. As the SSD performance decrease with running time (within minutes) i'm not sure if the better performance with higher stripe_cache_size would be constant over time or is more a effect from a fresh empty cache that ceases with time (in my scenario with writing many small random blocks) Short tests results: Raid 5, 3 SSD stripe_cache_size noop deadline (tuned) 256 18914 18730 16384 18161 19766 Raid 5, 4 SSD stripe_cache_size noop deadline (tuned) 256 11863 13716 16384 13186 14688 Peter ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: RAID 5 doesn't scale 2013-04-03 20:36 ` Peter Landmann 2013-04-03 21:19 ` Peter Landmann @ 2013-04-03 21:24 ` Stan Hoeppner 2013-04-03 21:29 ` Peter Landmann 1 sibling, 1 reply; 17+ messages in thread From: Stan Hoeppner @ 2013-04-03 21:24 UTC (permalink / raw) To: Peter Landmann; +Cc: linux-raid On 4/3/2013 3:36 PM, Peter Landmann wrote: > I'm very sorry but now i can't confirm anymore that stripe_cache_size helps. > > My test was to short. With every minute the IOPS decrease. So > stripe_cache_size does only help for very short tests. If you're running the tests for multiple minutes and many tens of GBs at a time, then this slowdown is due to garbage collection, not stripe cache sizing. You are not performing proper testing methodologies, and you're jumping to conclusions way too quickly, and likely incorrectly. If you're not familiar with SSD garbage collection then you must learn about it. It will affect everything you do with SSDs, especially when doing these kinds of tests where you're writing huge amounts of data to the flash cells. Wear leveling, part of garbage collection, dramatically slows down SSD throughput. And when you're pushing this much data, TRIM won't help. It'll actually slow the SDDs down even more. -- Stan ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: RAID 5 doesn't scale 2013-04-03 21:24 ` Stan Hoeppner @ 2013-04-03 21:29 ` Peter Landmann 0 siblings, 0 replies; 17+ messages in thread From: Peter Landmann @ 2013-04-03 21:29 UTC (permalink / raw) To: linux-raid > If you're running the tests for multiple minutes and many tens of GBs at > a time, then this slowdown is due to garbage collection, not stripe > cache sizing. You are right. I wrote about that in another post. > > You are not performing proper testing methodologies, and you're jumping > to conclusions way too quickly, and likely incorrectly. See above. > > If you're not familiar with SSD garbage collection then you must learn > about it. It will affect everything you do with SSDs, especially when > doing these kinds of tests where you're writing huge amounts of data to > the flash cells. Wear leveling, part of garbage collection, > dramatically slows down SSD throughput. And when you're pushing this > much data, TRIM won't help. It'll actually slow the SDDs down even more. In a shortage of time i was to fast .. I'm sorry for the trouble but at least some people and i could learn something about md :) Peter ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: RAID 5 doesn't scale 2013-04-03 18:23 ` Martin Wilck 2013-04-03 20:36 ` Peter Landmann @ 2013-04-03 21:15 ` Stan Hoeppner 1 sibling, 0 replies; 17+ messages in thread From: Stan Hoeppner @ 2013-04-03 21:15 UTC (permalink / raw) To: Martin Wilck; +Cc: Peter Landmann, linux-raid On 4/3/2013 1:23 PM, Martin Wilck wrote: > On 04/03/2013 03:18 PM, Stan Hoeppner wrote: > >> You didn't mention your stripe_cache_size value. It'll make a lot of >> difference. Make sure it's at least 4096. The default is 256. Actually, the default is 128, not 256, at least with 3.2.6. Not sure about previous/later versions. > I'm not getting it - why would stripe cache size matter in a random > read/write test? It's very similar to the effect of a greater quantity of write back cache on a hardware RAID controller. Which is why it dramatically affects write throughput but not read. I believe the proper way to view this is as a temporary workspace, where md can assemble the stripes to be written out to the block layer, and store chunks which are read in for RMW cycles. As with many things in computing, increasing the size of this working space allows the md driver to work more efficiently. See below for exactly how it works. > If the disks are large enough and the pattern is really > random, the cache should hardly ever be hit (s_c_z = 4096 =^ 16MB cache > per disk, that's 0.01% of disk size for a 160GB SSD). You seem to be assuming the md "stripe cache" functions like some kind of generic dumb filesystem cache. It does not. > I read that Peter confirmed the influence of stripe_cache_size, but I'd > like to understand why it matters in this case. If you think the throughput increase in this thread is impressive, see: http://marc.info/?l=linux-raid&m=136241443706663&w=2 About half way down there is a table showing the effects of stripe_cache_size from 2048 to 32768. Write throughput increased over 600MB/s, from 1018MB/s to 1628MB/s, simply by increasing stripe_cache_size from 2048 to 4096, and decreased as the stripe cache was made larger. Thus every system has a sweet spot. This was with 5 Intel 500GB SSDs w/the SandForce 2281 controller, attached to an LSI 9207-8i. md/RAID5 I'd love to explain exactly how the stripe cache works, but to do that I must first understand it. And I've been unable to find documentation describing the inner workings of the stripe cache. And since I'm neither a C nor kernel programmer, I can't look at the code and understand it, nor then write a document for others. So if you really want that explanation you'll need to start another thread and bribe Neil into explaining it. -- Stan ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: RAID 5 doesn't scale 2013-04-03 13:18 ` Stan Hoeppner ` (2 preceding siblings ...) 2013-04-03 18:23 ` Martin Wilck @ 2013-04-03 19:56 ` Roy Sigurd Karlsbakk 2013-04-03 21:12 ` Peter Landmann 4 siblings, 0 replies; 17+ messages in thread From: Roy Sigurd Karlsbakk @ 2013-04-03 19:56 UTC (permalink / raw) To: stan; +Cc: linux-raid, Peter Landmann ----- Opprinnelig melding ----- > On 4/3/2013 6:00 AM, Peter Landmann wrote: > > You didn't mention your stripe_cache_size value. It'll make a lot of > difference. Make sure it's at least 4096. The default is 256. Looks like Documentation/md.txt (on 3.8.5) says stripe_cache_size, strip_cache_active and preread_bypass_threshold are only available for RAID-5. How can I tune RAID-6 like this? Vennlige hilsener / Best regards roy -- Roy Sigurd Karlsbakk (+47) 98013356 roy@karlsbakk.net http://blogg.karlsbakk.net/ GPG Public key: http://karlsbakk.net/roysigurdkarlsbakk.pubkey.txt -- I all pedagogikk er det essensielt at pensum presenteres intelligibelt. Det er et elementært imperativ for alle pedagoger å unngå eksessiv anvendelse av idiomer med xenotyp etymologi. I de fleste tilfeller eksisterer adekvate og relevante synonymer på norsk. -- To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: RAID 5 doesn't scale 2013-04-03 13:18 ` Stan Hoeppner ` (3 preceding siblings ...) 2013-04-03 19:56 ` Roy Sigurd Karlsbakk @ 2013-04-03 21:12 ` Peter Landmann 4 siblings, 0 replies; 17+ messages in thread From: Peter Landmann @ 2013-04-03 21:12 UTC (permalink / raw) To: linux-raid > Note also that you're doing 4KB random writes against RAID5. This is > going to generate substantial RMW cycles. The Intel X25-M G2 is not a > speed daemon. Its published max 4KB IOPS throughput is for purely > random writes, not the read+write pattern created by parity RMW. So > while your random read should get a nice jump with this test, your > random write may not improve as much. The limitation here is a function > of the SSD controller on the X25-M G2, not md/RAID5. If you test 5 > drives in md/RAID0 you'll see a bump in random write IOPS. > It seems so. I let fio run a bit longer and in each settings the ssd- performance decreased after few minutes. While mpstat still showed ~100% ssd- utilization. ^ permalink raw reply [flat|nested] 17+ messages in thread
end of thread, other threads:[~2013-04-04 13:40 UTC | newest] Thread overview: 17+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2013-04-03 11:00 RAID 5 doesn't scale Peter Landmann 2013-04-03 11:21 ` Benjamin ESTRABAUD 2013-04-03 18:34 ` Martin Wilck 2013-04-03 20:38 ` Peter Landmann 2013-04-04 13:40 ` Benjamin ESTRABAUD 2013-04-03 13:18 ` Stan Hoeppner 2013-04-03 15:23 ` keld 2013-04-03 15:31 ` Peter Landmann 2013-04-03 18:35 ` Stan Hoeppner 2013-04-03 18:23 ` Martin Wilck 2013-04-03 20:36 ` Peter Landmann 2013-04-03 21:19 ` Peter Landmann 2013-04-03 21:24 ` Stan Hoeppner 2013-04-03 21:29 ` Peter Landmann 2013-04-03 21:15 ` Stan Hoeppner 2013-04-03 19:56 ` Roy Sigurd Karlsbakk 2013-04-03 21:12 ` Peter Landmann
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.