* 4x lower IOPS: Linux MD vs indiv. devices - why?
@ 2017-01-23 16:26 Tobias Oberstein
[not found] ` <CANvN+en2ihATNgrbgzwNXAK87wNh+6jXHinmg2-VmHon31AJzA@mail.gmail.com>
2017-01-23 18:18 ` Kudryavtsev, Andrey O
0 siblings, 2 replies; 27+ messages in thread
From: Tobias Oberstein @ 2017-01-23 16:26 UTC (permalink / raw)
To: fio
Hi,
I have a question rgd Linux software RAID (MD) as tested with FIO - so
this is slightly OT, but I am hoping for expert advice or redirection to
a more appropriate place (if this is unwelcome here).
I have a box with this HW:
- 88 cores Xeon E7 (176 HTs) + 3TB RAM
- 8 x Intel P3608 4TB NVMe (which is logicall 16 NVMes)
With random 4kB read load, I am able to max it out at 7 million IOPS -
but only if I run FIO on the _individual_ NVMe devices.
[global]
group_reporting
filename=/dev/nvme0n1:/dev/nvme1n1:/dev/nvme2n1:/dev/nvme3n1:/dev/nvme4n1:/dev/nvme5n1:/dev/nvme6n1:/dev/nvme7n1:/dev/nvme8n1:/dev/nvme9n1:/dev/nvme10n1:/dev/nvme11n1:/dev/nvme12n1:/dev/nvme13n1:/dev/nvme14n1:/dev/nvme15n1
size=30G
ioengine=sync
iodepth=1
thread=1
direct=1
time_based=1
randrepeat=0
norandommap=1
bs=4k
runtime=120
[randread]
stonewall
rw=randread
numjobs=2560
When I create a stripe set over all devices:
sudo mdadm --create /dev/md1 --chunk=8 --level=0 --raid-devices=16 \
/dev/nvme0n1 \
/dev/nvme1n1 \
/dev/nvme2n1 \
/dev/nvme3n1 \
/dev/nvme4n1 \
/dev/nvme5n1 \
/dev/nvme6n1 \
/dev/nvme7n1 \
/dev/nvme8n1 \
/dev/nvme9n1 \
/dev/nvme10n1 \
/dev/nvme11n1 \
/dev/nvme12n1 \
/dev/nvme13n1 \
/dev/nvme14n1 \
/dev/nvme15n1
I only get 1.6 million IOPS. Detail results down below.
Note: the array is created with chunk size 8K because this is for
database workload. Here I tested with 4k block size, but the it's
similar (lower perf on MD) with 8k
Any helps or hints would be greatly appreciated!
Cheers,
/Tobias
7 million IOPS on raw, individual NVMe devices
==============================================
oberstet@svr-psql19:~/scm/parcit/RA/adr/system/docs$ sudo
/opt/fio/bin/fio postgresql_storage_workload.fio
randread: (g=0): rw=randread, bs=4096B-4096B,4096B-4096B,4096B-4096B,
ioengine=sync, iodepth=1
...
fio-2.17-17-g9cf1
Starting 2560 threads
Jobs: 2367 (f=29896):
[_(2),f(3),_(2),f(11),_(2),f(2),_(9),f(1),_(1),f(1),_(3),f(1),_(1),f(1),_(13),f(1),_(8),f(1),_(1),f(4),_(2),f(1),_(1),f(1),_(3),f(2),_(3),f(3),_(8),f(2),_(1),f(3),_(3),f(60),_(1),f(20),_(1),f(33),_(1),f(14),_(1),f(18),_(4),f(6),_(1),f(6),_(1),f(1),_(1),f(1),_(1),f(4),_(1),f(2),_(1),f(11),_(1),f(11),_(4),f(74),_(1),f(8),_(1),f(11),_(1),f(8),_(1),f(61),_(1),f(38),_(1),f(31),_(1),f(5),_(1),f(103),_(1),f(24),E(1),f(27),_(1),f(28),_(1),f(1),_(1),f(134),_(1),f(62),_(1),f(48),_(1),f(27),_(1),f(59),_(1),f(30),_(1),f(14),_(1),f(25),_(1),f(2),_(1),f(25),_(1),f(31),_(1),f(9),_(1),f(7),_(1),f(8),_(1),f(13),_(1),f(28),_(1),f(7),_(1),f(84),_(1),f(42),_(1),f(5),_(1),f(8),_(1),f(20),_(1),f(15),_(1),f(19),_(1),f(3),_(1),f(19),_(1),f(7),_(1),f(17),_(1),f(34),_(1),f(1),_(1),f(4),_(1),f(1),_(1),f(1),_(2),f(3),_(1),f(1),_(1),f(1),_(1),f(8),_(1),f(6),_(1),f(3),_(1),f(3),_(1),f(53),_(1),f(7),_(1),f(19),_(1),f(6),_(1),f(5),_(1),f(22),_(1),f(11),_(1),f(12),_(1),f(3),_(1),f(16),_(1),f(149),_(1),f(20),_(1),f(27),_(1),f(7),_(1),f(29),_(1),f(2),_(1),f(11),_(1),f(46),_(1),f(8),_(2),f(1),_(1),f(1),_(1),f(14),E(1),f(4),_(1),f(22),_(1),f(11),_(1),f(70),_(2),f(11),_(1),f(2),_(1),f(1),_(1),f(1),_(1),f(21),_(1),f(8),_(1),f(4),_(1),f(45),_(2),f(1),_(1),f(18),_(1),f(12),_(1),f(6),_(1),f(5),_(1),f(27),_(1),f(3),_(1),f(3),_(1),f(19),_(1),f(4),_(1),f(25),_(1),f(4),_(1),f(1),_(1),f(2),_(1),f(1),_(1),f(13),_(1),f(18),_(1),f(1),_(1),f(1),_(1),f(29),_(1),f(27)][100.0%][r=21.1GiB/s,w=0KiB/s][r=5751k,w=0
IOPS][eta 00m:00s]
randread: (groupid=0, jobs=2560): err= 0: pid=114435: Mon Jan 23
15:47:17 2017
read: IOPS=6965k, BW=26.6GiB/s (28.6GB/s)(3189GiB/120007msec)
clat (usec): min=38, max=33262, avg=360.11, stdev=465.36
lat (usec): min=38, max=33262, avg=360.20, stdev=465.40
clat percentiles (usec):
| 1.00th=[ 114], 5.00th=[ 135], 10.00th=[ 149], 20.00th=[ 171],
| 30.00th=[ 191], 40.00th=[ 213], 50.00th=[ 239], 60.00th=[ 270],
| 70.00th=[ 314], 80.00th=[ 378], 90.00th=[ 556], 95.00th=[ 980],
| 99.00th=[ 2704], 99.50th=[ 3312], 99.90th=[ 4576], 99.95th=[ 5216],
| 99.99th=[ 8096]
lat (usec) : 50=0.01%, 100=0.11%, 250=53.75%, 500=34.23%, 750=5.23%
lat (usec) : 1000=1.79%
lat (msec) : 2=2.88%, 4=1.81%, 10=0.20%, 20=0.01%, 50=0.01%
cpu : usr=0.63%, sys=4.89%, ctx=837434400, majf=0, minf=2557
IO depths : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%,
>=64=0.0%
submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%,
>=64=0.0%
complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%,
>=64=0.0%
issued rwt: total=835852266,0,0, short=0,0,0, dropped=0,0,0
latency : target=0, window=0, percentile=100.00%, depth=1
Run status group 0 (all jobs):
READ: bw=26.6GiB/s (28.6GB/s), 26.6GiB/s-26.6GiB/s
(28.6GB/s-28.6GB/s), io=3189GiB (3424GB), run=120007-120007msec
Disk stats (read/write):
nvme0n1: ios=52191377/0, merge=0/0, ticks=14400568/0,
in_queue=14802400, util=100.00%
nvme1n1: ios=52241684/0, merge=0/0, ticks=13919744/0,
in_queue=15101276, util=100.00%
nvme2n1: ios=52241537/0, merge=0/0, ticks=11146952/0,
in_queue=12053112, util=100.00%
nvme3n1: ios=52241416/0, merge=0/0, ticks=10806624/0,
in_queue=11135004, util=100.00%
nvme4n1: ios=52241285/0, merge=0/0, ticks=19320448/0,
in_queue=21079576, util=100.00%
nvme5n1: ios=52241142/0, merge=0/0, ticks=18786968/0,
in_queue=19393024, util=100.00%
nvme6n1: ios=52241000/0, merge=0/0, ticks=19610892/0,
in_queue=20140104, util=100.00%
nvme7n1: ios=52240874/0, merge=0/0, ticks=20482920/0,
in_queue=21090048, util=100.00%
nvme8n1: ios=52240731/0, merge=0/0, ticks=14533992/0,
in_queue=14929172, util=100.00%
nvme9n1: ios=52240587/0, merge=0/0, ticks=12854956/0,
in_queue=13919288, util=100.00%
nvme10n1: ios=52240447/0, merge=0/0, ticks=11085508/0,
in_queue=11390392, util=100.00%
nvme11n1: ios=52240301/0, merge=0/0, ticks=18490260/0,
in_queue=20110288, util=100.00%
nvme12n1: ios=52240097/0, merge=0/0, ticks=11377884/0,
in_queue=11683568, util=100.00%
nvme13n1: ios=52239956/0, merge=0/0, ticks=15205304/0,
in_queue=16314628, util=100.00%
nvme14n1: ios=52239766/0, merge=0/0, ticks=27003788/0,
in_queue=27659920, util=100.00%
nvme15n1: ios=52239620/0, merge=0/0, ticks=17352624/0,
in_queue=17910636, util=100.00%
1.6 millions IOPS on Linux MD over 16 NVMe devices
==================================================
oberstet@svr-psql19:~/scm/parcit/RA/adr/system/docs$ sudo
/opt/fio/bin/fio postgresql_storage_workload.fio
randread: (g=0): rw=randread, bs=4096B-4096B,4096B-4096B,4096B-4096B,
ioengine=sync, iodepth=1
...
fio-2.17-17-g9cf1
Starting 2560 threads
Jobs: 2560 (f=2560): [r(2560)][100.0%][r=6212MiB/s,w=0KiB/s][r=1590k,w=0
IOPS][eta 00m:00s]
randread: (groupid=0, jobs=2560): err= 0: pid=146070: Mon Jan 23
17:21:15 2017
read: IOPS=1588k, BW=6204MiB/s (6505MB/s)(728GiB/120098msec)
clat (usec): min=27, max=28498, avg=124.51, stdev=113.10
lat (usec): min=27, max=28498, avg=124.58, stdev=113.10
clat percentiles (usec):
| 1.00th=[ 78], 5.00th=[ 84], 10.00th=[ 86], 20.00th=[ 89],
| 30.00th=[ 95], 40.00th=[ 102], 50.00th=[ 105], 60.00th=[ 108],
| 70.00th=[ 118], 80.00th=[ 133], 90.00th=[ 173], 95.00th=[ 221],
| 99.00th=[ 358], 99.50th=[ 506], 99.90th=[ 2192], 99.95th=[ 2608],
| 99.99th=[ 2960]
lat (usec) : 50=0.06%, 100=35.14%, 250=61.83%, 500=2.46%, 750=0.19%
lat (usec) : 1000=0.07%
lat (msec) : 2=0.13%, 4=0.12%, 10=0.01%, 20=0.01%, 50=0.01%
cpu : usr=0.08%, sys=4.49%, ctx=200431993, majf=0, minf=2557
IO depths : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%,
>=64=0.0%
submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%,
>=64=0.0%
complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%,
>=64=0.0%
issued rwt: total=190730463,0,0, short=0,0,0, dropped=0,0,0
latency : target=0, window=0, percentile=100.00%, depth=1
Run status group 0 (all jobs):
READ: bw=6204MiB/s (6505MB/s), 6204MiB/s-6204MiB/s
(6505MB/s-6505MB/s), io=728GiB (781GB), run=120098-120098msec
Disk stats (read/write):
md1: ios=190632612/0, merge=0/0, ticks=0/0, in_queue=0, util=0.00%,
aggrios=11920653/0, aggrmerge=0/0, aggrticks=1228287/0,
aggrin_queue=1247601, aggrutil=100.00%
nvme15n1: ios=11919850/0, merge=0/0, ticks=1214924/0,
in_queue=1225896, util=100.00%
nvme6n1: ios=11921162/0, merge=0/0, ticks=1182716/0,
in_queue=1191452, util=100.00%
nvme9n1: ios=11916313/0, merge=0/0, ticks=1265060/0,
in_queue=1296728, util=100.00%
nvme11n1: ios=11922174/0, merge=0/0, ticks=1206084/0,
in_queue=1239808, util=100.00%
nvme2n1: ios=11921547/0, merge=0/0, ticks=1238956/0,
in_queue=1272916, util=100.00%
nvme14n1: ios=11923176/0, merge=0/0, ticks=1168688/0,
in_queue=1178360, util=100.00%
nvme5n1: ios=11923142/0, merge=0/0, ticks=1192656/0,
in_queue=1207808, util=100.00%
nvme8n1: ios=11921507/0, merge=0/0, ticks=1250164/0,
in_queue=1258956, util=100.00%
nvme10n1: ios=11919058/0, merge=0/0, ticks=1294028/0,
in_queue=1304536, util=100.00%
nvme1n1: ios=11923129/0, merge=0/0, ticks=1246892/0,
in_queue=1281952, util=100.00%
nvme13n1: ios=11923354/0, merge=0/0, ticks=1241540/0,
in_queue=1271820, util=100.00%
nvme4n1: ios=11926936/0, merge=0/0, ticks=1190384/0,
in_queue=1224192, util=100.00%
nvme7n1: ios=11921139/0, merge=0/0, ticks=1200624/0,
in_queue=1214240, util=100.00%
nvme0n1: ios=11916614/0, merge=0/0, ticks=1230916/0,
in_queue=1242372, util=100.00%
nvme12n1: ios=11916963/0, merge=0/0, ticks=1266840/0,
in_queue=1277600, util=100.00%
nvme3n1: ios=11914399/0, merge=0/0, ticks=1262128/0,
in_queue=1272988, util=100.00%
oberstet@svr-psql19:~/scm/parcit/RA/adr/system/docs$
^ permalink raw reply [flat|nested] 27+ messages in thread[parent not found: <CANvN+en2ihATNgrbgzwNXAK87wNh+6jXHinmg2-VmHon31AJzA@mail.gmail.com>]
* Re: 4x lower IOPS: Linux MD vs indiv. devices - why? [not found] ` <CANvN+en2ihATNgrbgzwNXAK87wNh+6jXHinmg2-VmHon31AJzA@mail.gmail.com> @ 2017-01-23 17:52 ` Tobias Oberstein [not found] ` <CANvN+em0cjWRnQWccdORKFEJk0OSeQOrZq+XE6kzPmqMPB--4g@mail.gmail.com> 0 siblings, 1 reply; 27+ messages in thread From: Tobias Oberstein @ 2017-01-23 17:52 UTC (permalink / raw) To: Andrey Kuzmin, fio Am 23.01.2017 um 18:03 schrieb Andrey Kuzmin: > Why don't you just 'perf' your md run and find out where it spends (an > awful lot if extra) time? Good idea! I ran with threads=1024 (to account for perf overhead). At that concurrency, Linux MD reaches 25% lower IOPS and has higher system load. Please see here: https://github.com/oberstet/scratchbox/tree/master/cruncher/sql19/linux-md-bottleneck With higher concurrency, the discrepancy gets wider up to 7 mio vs 1.6 mio IOPS. I am not a kernel hacker. What is osq_lock? FWIW, this is a NUMA machine with 4 x E7 (88 cores / 176 HT) and 8 x Intel P3608 NVMe. Any hints or anything I should try / measure? Thanks a lot for your tips and assistence! Cheers, /Tobias > > On Jan 23, 2017 19:28, "Tobias Oberstein" <tobias.oberstein@gmail.com> > wrote: > >> Hi, >> >> I have a question rgd Linux software RAID (MD) as tested with FIO - so >> this is slightly OT, but I am hoping for expert advice or redirection to a >> more appropriate place (if this is unwelcome here). >> >> I have a box with this HW: >> >> - 88 cores Xeon E7 (176 HTs) + 3TB RAM >> - 8 x Intel P3608 4TB NVMe (which is logicall 16 NVMes) >> >> With random 4kB read load, I am able to max it out at 7 million IOPS - but >> only if I run FIO on the _individual_ NVMe devices. >> >> [global] >> group_reporting >> filename=/dev/nvme0n1:/dev/nvme1n1:/dev/nvme2n1:/dev/nvme3n1 >> :/dev/nvme4n1:/dev/nvme5n1:/dev/nvme6n1:/dev/nvme7n1:/dev/ >> nvme8n1:/dev/nvme9n1:/dev/nvme10n1:/dev/nvme11n1:/dev/ >> nvme12n1:/dev/nvme13n1:/dev/nvme14n1:/dev/nvme15n1 >> size=30G >> ioengine=sync >> iodepth=1 >> thread=1 >> direct=1 >> time_based=1 >> randrepeat=0 >> norandommap=1 >> bs=4k >> runtime=120 >> >> [randread] >> stonewall >> rw=randread >> numjobs=2560 >> >> When I create a stripe set over all devices: >> >> sudo mdadm --create /dev/md1 --chunk=8 --level=0 --raid-devices=16 \ >> /dev/nvme0n1 \ >> /dev/nvme1n1 \ >> /dev/nvme2n1 \ >> /dev/nvme3n1 \ >> /dev/nvme4n1 \ >> /dev/nvme5n1 \ >> /dev/nvme6n1 \ >> /dev/nvme7n1 \ >> /dev/nvme8n1 \ >> /dev/nvme9n1 \ >> /dev/nvme10n1 \ >> /dev/nvme11n1 \ >> /dev/nvme12n1 \ >> /dev/nvme13n1 \ >> /dev/nvme14n1 \ >> /dev/nvme15n1 >> >> I only get 1.6 million IOPS. Detail results down below. >> >> Note: the array is created with chunk size 8K because this is for database >> workload. Here I tested with 4k block size, but the it's similar (lower >> perf on MD) with 8k >> >> Any helps or hints would be greatly appreciated! >> >> Cheers, >> /Tobias >> >> >> >> 7 million IOPS on raw, individual NVMe devices >> ============================================== >> >> oberstet@svr-psql19:~/scm/parcit/RA/adr/system/docs$ sudo >> /opt/fio/bin/fio postgresql_storage_workload.fio >> randread: (g=0): rw=randread, bs=4096B-4096B,4096B-4096B,4096B-4096B, >> ioengine=sync, iodepth=1 >> ... >> fio-2.17-17-g9cf1 >> Starting 2560 threads >> Jobs: 2367 (f=29896): [_(2),f(3),_(2),f(11),_(2),f(2 >> ),_(9),f(1),_(1),f(1),_(3),f(1),_(1),f(1),_(13),f(1),_(8),f( >> 1),_(1),f(4),_(2),f(1),_(1),f(1),_(3),f(2),_(3),f(3),_(8),f( >> 2),_(1),f(3),_(3),f(60),_(1),f(20),_(1),f(33),_(1),f(14),_( >> 1),f(18),_(4),f(6),_(1),f(6),_(1),f(1),_(1),f(1),_(1),f(4),_ >> (1),f(2),_(1),f(11),_(1),f(11),_(4),f(74),_(1),f(8),_(1),f( >> 11),_(1),f(8),_(1),f(61),_(1),f(38),_(1),f(31),_(1),f(5),_( >> 1),f(103),_(1),f(24),E(1),f(27),_(1),f(28),_(1),f(1),_(1),f( >> 134),_(1),f(62),_(1),f(48),_(1),f(27),_(1),f(59),_(1),f(30) >> ,_(1),f(14),_(1),f(25),_(1),f(2),_(1),f(25),_(1),f(31),_(1), >> f(9),_(1),f(7),_(1),f(8),_(1),f(13),_(1),f(28),_(1),f(7),_( >> 1),f(84),_(1),f(42),_(1),f(5),_(1),f(8),_(1),f(20),_(1),f( >> 15),_(1),f(19),_(1),f(3),_(1),f(19),_(1),f(7),_(1),f(17),_( >> 1),f(34),_(1),f(1),_(1),f(4),_(1),f(1),_(1),f(1),_(2),f(3),_ >> (1),f(1),_(1),f(1),_(1),f(8),_(1),f(6),_(1),f(3),_(1),f(3),_ >> (1),f(53),_(1),f(7),_(1),f(19),_(1),f(6),_(1),f(5),_(1),f( >> 22),_(1),f(11),_(1),f(12),_(1),f(3),_(1),f(16),_(1),f(149),_ >> (1),f(20),_(1),f(27),_(1),f(7),_(1),f(29),_(1),f(2),_(1),f( >> 11),_(1),f(46),_(1),f(8),_(2),f(1),_(1),f(1),_(1),f(14),E(1) >> ,f(4),_(1),f(22),_(1),f(11),_(1),f(70),_(2),f(11),_(1),f(2), >> _(1),f(1),_(1),f(1),_(1),f(21),_(1),f(8),_(1),f(4),_(1),f( >> 45),_(2),f(1),_(1),f(18),_(1),f(12),_(1),f(6),_(1),f(5),_(1) >> ,f(27),_(1),f(3),_(1),f(3),_(1),f(19),_(1),f(4),_(1),f(25), >> _(1),f(4),_(1),f(1),_(1),f(2),_(1),f(1),_(1),f(13),_(1),f( >> 18),_(1),f(1),_(1),f(1),_(1),f(29),_(1),f(27)][100.0%][r= >> 21.1GiB/s,w=0KiB/s][r=5751k,w=0 IOPS][eta 00m:00s] >> randread: (groupid=0, jobs=2560): err= 0: pid=114435: Mon Jan 23 15:47:17 >> 2017 >> read: IOPS=6965k, BW=26.6GiB/s (28.6GB/s)(3189GiB/120007msec) >> clat (usec): min=38, max=33262, avg=360.11, stdev=465.36 >> lat (usec): min=38, max=33262, avg=360.20, stdev=465.40 >> clat percentiles (usec): >> | 1.00th=[ 114], 5.00th=[ 135], 10.00th=[ 149], 20.00th=[ 171], >> | 30.00th=[ 191], 40.00th=[ 213], 50.00th=[ 239], 60.00th=[ 270], >> | 70.00th=[ 314], 80.00th=[ 378], 90.00th=[ 556], 95.00th=[ 980], >> | 99.00th=[ 2704], 99.50th=[ 3312], 99.90th=[ 4576], 99.95th=[ 5216], >> | 99.99th=[ 8096] >> lat (usec) : 50=0.01%, 100=0.11%, 250=53.75%, 500=34.23%, 750=5.23% >> lat (usec) : 1000=1.79% >> lat (msec) : 2=2.88%, 4=1.81%, 10=0.20%, 20=0.01%, 50=0.01% >> cpu : usr=0.63%, sys=4.89%, ctx=837434400, majf=0, minf=2557 >> IO depths : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >>> =64=0.0% >> submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >>> =64=0.0% >> complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >>> =64=0.0% >> issued rwt: total=835852266,0,0, short=0,0,0, dropped=0,0,0 >> latency : target=0, window=0, percentile=100.00%, depth=1 >> >> Run status group 0 (all jobs): >> READ: bw=26.6GiB/s (28.6GB/s), 26.6GiB/s-26.6GiB/s (28.6GB/s-28.6GB/s), >> io=3189GiB (3424GB), run=120007-120007msec >> >> Disk stats (read/write): >> nvme0n1: ios=52191377/0, merge=0/0, ticks=14400568/0, in_queue=14802400, >> util=100.00% >> nvme1n1: ios=52241684/0, merge=0/0, ticks=13919744/0, in_queue=15101276, >> util=100.00% >> nvme2n1: ios=52241537/0, merge=0/0, ticks=11146952/0, in_queue=12053112, >> util=100.00% >> nvme3n1: ios=52241416/0, merge=0/0, ticks=10806624/0, in_queue=11135004, >> util=100.00% >> nvme4n1: ios=52241285/0, merge=0/0, ticks=19320448/0, in_queue=21079576, >> util=100.00% >> nvme5n1: ios=52241142/0, merge=0/0, ticks=18786968/0, in_queue=19393024, >> util=100.00% >> nvme6n1: ios=52241000/0, merge=0/0, ticks=19610892/0, in_queue=20140104, >> util=100.00% >> nvme7n1: ios=52240874/0, merge=0/0, ticks=20482920/0, in_queue=21090048, >> util=100.00% >> nvme8n1: ios=52240731/0, merge=0/0, ticks=14533992/0, in_queue=14929172, >> util=100.00% >> nvme9n1: ios=52240587/0, merge=0/0, ticks=12854956/0, in_queue=13919288, >> util=100.00% >> nvme10n1: ios=52240447/0, merge=0/0, ticks=11085508/0, >> in_queue=11390392, util=100.00% >> nvme11n1: ios=52240301/0, merge=0/0, ticks=18490260/0, >> in_queue=20110288, util=100.00% >> nvme12n1: ios=52240097/0, merge=0/0, ticks=11377884/0, >> in_queue=11683568, util=100.00% >> nvme13n1: ios=52239956/0, merge=0/0, ticks=15205304/0, >> in_queue=16314628, util=100.00% >> nvme14n1: ios=52239766/0, merge=0/0, ticks=27003788/0, >> in_queue=27659920, util=100.00% >> nvme15n1: ios=52239620/0, merge=0/0, ticks=17352624/0, >> in_queue=17910636, util=100.00% >> >> >> 1.6 millions IOPS on Linux MD over 16 NVMe devices >> ================================================== >> >> oberstet@svr-psql19:~/scm/parcit/RA/adr/system/docs$ sudo >> /opt/fio/bin/fio postgresql_storage_workload.fio >> randread: (g=0): rw=randread, bs=4096B-4096B,4096B-4096B,4096B-4096B, >> ioengine=sync, iodepth=1 >> ... >> fio-2.17-17-g9cf1 >> Starting 2560 threads >> Jobs: 2560 (f=2560): [r(2560)][100.0%][r=6212MiB/s,w=0KiB/s][r=1590k,w=0 >> IOPS][eta 00m:00s] >> randread: (groupid=0, jobs=2560): err= 0: pid=146070: Mon Jan 23 17:21:15 >> 2017 >> read: IOPS=1588k, BW=6204MiB/s (6505MB/s)(728GiB/120098msec) >> clat (usec): min=27, max=28498, avg=124.51, stdev=113.10 >> lat (usec): min=27, max=28498, avg=124.58, stdev=113.10 >> clat percentiles (usec): >> | 1.00th=[ 78], 5.00th=[ 84], 10.00th=[ 86], 20.00th=[ 89], >> | 30.00th=[ 95], 40.00th=[ 102], 50.00th=[ 105], 60.00th=[ 108], >> | 70.00th=[ 118], 80.00th=[ 133], 90.00th=[ 173], 95.00th=[ 221], >> | 99.00th=[ 358], 99.50th=[ 506], 99.90th=[ 2192], 99.95th=[ 2608], >> | 99.99th=[ 2960] >> lat (usec) : 50=0.06%, 100=35.14%, 250=61.83%, 500=2.46%, 750=0.19% >> lat (usec) : 1000=0.07% >> lat (msec) : 2=0.13%, 4=0.12%, 10=0.01%, 20=0.01%, 50=0.01% >> cpu : usr=0.08%, sys=4.49%, ctx=200431993, majf=0, minf=2557 >> IO depths : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >>> =64=0.0% >> submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >>> =64=0.0% >> complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >>> =64=0.0% >> issued rwt: total=190730463,0,0, short=0,0,0, dropped=0,0,0 >> latency : target=0, window=0, percentile=100.00%, depth=1 >> >> Run status group 0 (all jobs): >> READ: bw=6204MiB/s (6505MB/s), 6204MiB/s-6204MiB/s (6505MB/s-6505MB/s), >> io=728GiB (781GB), run=120098-120098msec >> >> Disk stats (read/write): >> md1: ios=190632612/0, merge=0/0, ticks=0/0, in_queue=0, util=0.00%, >> aggrios=11920653/0, aggrmerge=0/0, aggrticks=1228287/0, >> aggrin_queue=1247601, aggrutil=100.00% >> nvme15n1: ios=11919850/0, merge=0/0, ticks=1214924/0, in_queue=1225896, >> util=100.00% >> nvme6n1: ios=11921162/0, merge=0/0, ticks=1182716/0, in_queue=1191452, >> util=100.00% >> nvme9n1: ios=11916313/0, merge=0/0, ticks=1265060/0, in_queue=1296728, >> util=100.00% >> nvme11n1: ios=11922174/0, merge=0/0, ticks=1206084/0, in_queue=1239808, >> util=100.00% >> nvme2n1: ios=11921547/0, merge=0/0, ticks=1238956/0, in_queue=1272916, >> util=100.00% >> nvme14n1: ios=11923176/0, merge=0/0, ticks=1168688/0, in_queue=1178360, >> util=100.00% >> nvme5n1: ios=11923142/0, merge=0/0, ticks=1192656/0, in_queue=1207808, >> util=100.00% >> nvme8n1: ios=11921507/0, merge=0/0, ticks=1250164/0, in_queue=1258956, >> util=100.00% >> nvme10n1: ios=11919058/0, merge=0/0, ticks=1294028/0, in_queue=1304536, >> util=100.00% >> nvme1n1: ios=11923129/0, merge=0/0, ticks=1246892/0, in_queue=1281952, >> util=100.00% >> nvme13n1: ios=11923354/0, merge=0/0, ticks=1241540/0, in_queue=1271820, >> util=100.00% >> nvme4n1: ios=11926936/0, merge=0/0, ticks=1190384/0, in_queue=1224192, >> util=100.00% >> nvme7n1: ios=11921139/0, merge=0/0, ticks=1200624/0, in_queue=1214240, >> util=100.00% >> nvme0n1: ios=11916614/0, merge=0/0, ticks=1230916/0, in_queue=1242372, >> util=100.00% >> nvme12n1: ios=11916963/0, merge=0/0, ticks=1266840/0, in_queue=1277600, >> util=100.00% >> nvme3n1: ios=11914399/0, merge=0/0, ticks=1262128/0, in_queue=1272988, >> util=100.00% >> oberstet@svr-psql19:~/scm/parcit/RA/adr/system/docs$ >> > ^ permalink raw reply [flat|nested] 27+ messages in thread
[parent not found: <CANvN+em0cjWRnQWccdORKFEJk0OSeQOrZq+XE6kzPmqMPB--4g@mail.gmail.com>]
* Re: 4x lower IOPS: Linux MD vs indiv. devices - why? [not found] ` <CANvN+em0cjWRnQWccdORKFEJk0OSeQOrZq+XE6kzPmqMPB--4g@mail.gmail.com> @ 2017-01-23 18:33 ` Tobias Oberstein 2017-01-23 19:10 ` Kudryavtsev, Andrey O ` (2 more replies) 0 siblings, 3 replies; 27+ messages in thread From: Tobias Oberstein @ 2017-01-23 18:33 UTC (permalink / raw) To: Andrey Kuzmin; +Cc: fio > You're just running a huge number of threads against the same md device and > bottleneck on some internal lock. If you step back and set up, say, 256 Ah, alright. Shit. > threads with ioengine=libaio, qd=128 (to match the in-flight I/O number), > you'd likely see the locking impact reduced substantially. The problem with using libaio and QD>1 is: that doesn't represent the workload I am optimizing for. The workload is PostgreSQL, and that is doing all it's IO as regular read/writes, and hence the use of ioengine=sync with large thread counts. Note: we have an internal tool that is able to parallelize PostgreSQL via database sessions. -- I tried anyway. Here is what I get with engine=libaio (results down below): A) QD=128 and jobs=8 (same effective IO concurrency as previously = 1024) iops=200184 The IOPS stay constant during the run (120s). B) QD=128 and jobs=16 (effective concurrency = 2048) iops=1068.7K But, but: The IOPS slowly go up to over 5 mio, then collapses to like 20k, and then go up again. Very strange. C) QD=128 and jobs=32 (effective concurrency = 4096) FIO claims: iops=2135.9K Which is still 3.5x lower than what I get with the sync engine and 2800 threads! Plus: that strange behavior over run time .. IOPS go up to 10M: http://picpaste.com/pics/Bildschirmfoto_vom_2017-01-23_19-29-13-ZEyCVcKZ.1485196199.png and the collapse to 0 IOPS http://picpaste.com/pics/Bildschirmfoto_vom_2017-01-23_19-30-20-GEEEQR6f.1485196243.png at which the NVMes don't show any load (I am watching them in another window). === libaio is nowhere near what I get with engine=sync and high job counts. Mmh. Plus the strange behavior. And as said, that doesn't represent my workload anyways. I want to stay away from AIO .. Cheers, /Tobias A) oberstet@svr-psql19:~/scm/parcit/RA/adr/system/docs$ sudo fio postgresql_storage_workload.fio randread: (g=0): rw=randread, bs=4K-4K/4K-4K/4K-4K, ioengine=libaio, iodepth=128 ... fio-2.1.11 Starting 8 threads Jobs: 1 (f=1): [_(2),r(1),_(5)] [38.3% done] [0KB/0KB/0KB /s] [0/0/0 iops] [eta 03m:23s] randread: (groupid=0, jobs=8): err= 0: pid=1994: Mon Jan 23 19:23:23 2017 read : io=93837MB, bw=800739KB/s, iops=200184, runt=120001msec slat (usec): min=0, max=4291, avg=39.28, stdev=76.95 clat (usec): min=2, max=22205, avg=5075.21, stdev=3646.18 lat (usec): min=5, max=22333, avg=5114.55, stdev=3674.10 clat percentiles (usec): | 1.00th=[ 916], 5.00th=[ 1224], 10.00th=[ 1448], 20.00th=[ 1864], | 30.00th=[ 2320], 40.00th=[ 2960], 50.00th=[ 3920], 60.00th=[ 5024], | 70.00th=[ 6368], 80.00th=[ 8384], 90.00th=[10944], 95.00th=[12608], | 99.00th=[14272], 99.50th=[15168], 99.90th=[16768], 99.95th=[17536], | 99.99th=[18816] bw (KB /s): min=33088, max=400688, per=12.35%, avg=98898.47, stdev=76253.23 lat (usec) : 4=0.01%, 10=0.01%, 20=0.01%, 50=0.01%, 100=0.01% lat (usec) : 250=0.01%, 500=0.01%, 750=0.22%, 1000=1.48% lat (msec) : 2=21.67%, 4=27.51%, 10=35.37%, 20=13.74%, 50=0.01% cpu : usr=1.53%, sys=13.53%, ctx=7504182, majf=0, minf=1032 IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%, >=64=100.0% submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.1% issued : total=r=24022368/w=0/d=0, short=r=0/w=0/d=0 latency : target=0, window=0, percentile=100.00%, depth=128 Run status group 0 (all jobs): READ: io=93837MB, aggrb=800738KB/s, minb=800738KB/s, maxb=800738KB/s, mint=120001msec, maxt=120001msec Disk stats (read/write): md1: ios=7485313/0, merge=0/0, ticks=0/0, in_queue=0, util=0.00%, aggrios=468407/0, aggrmerge=0/0, aggrticks=51834/0, aggrin_queue=51770, aggrutil=35.00% nvme15n1: ios=468133/0, merge=0/0, ticks=52628/0, in_queue=52532, util=34.39% nvme6n1: ios=468355/0, merge=0/0, ticks=48944/0, in_queue=48840, util=32.34% nvme9n1: ios=468561/0, merge=0/0, ticks=53924/0, in_queue=53956, util=35.00% nvme11n1: ios=468354/0, merge=0/0, ticks=53424/0, in_queue=53396, util=34.70% nvme2n1: ios=468418/0, merge=0/0, ticks=51536/0, in_queue=51496, util=33.63% nvme14n1: ios=468669/0, merge=0/0, ticks=51696/0, in_queue=51576, util=33.84% nvme5n1: ios=468526/0, merge=0/0, ticks=50004/0, in_queue=49928, util=33.00% nvme8n1: ios=468233/0, merge=0/0, ticks=52232/0, in_queue=52140, util=33.82% nvme10n1: ios=468501/0, merge=0/0, ticks=52532/0, in_queue=52416, util=34.29% nvme1n1: ios=468434/0, merge=0/0, ticks=53492/0, in_queue=53404, util=34.58% nvme13n1: ios=468544/0, merge=0/0, ticks=51876/0, in_queue=51860, util=33.85% nvme4n1: ios=468513/0, merge=0/0, ticks=51172/0, in_queue=51176, util=33.30% nvme7n1: ios=468245/0, merge=0/0, ticks=50564/0, in_queue=50484, util=33.14% nvme0n1: ios=468318/0, merge=0/0, ticks=49812/0, in_queue=49760, util=32.67% nvme12n1: ios=468279/0, merge=0/0, ticks=52416/0, in_queue=52344, util=34.17% nvme3n1: ios=468442/0, merge=0/0, ticks=53092/0, in_queue=53016, util=34.37% oberstet@svr-psql19:~/scm/parcit/RA/adr/system/docs$ B) oberstet@svr-psql19:~/scm/parcit/RA/adr/system/docs$ sudo fio postgresql_storage_workload.fio randread: (g=0): rw=randread, bs=4K-4K/4K-4K/4K-4K, ioengine=libaio, iodepth=128 ... fio-2.1.11 Starting 16 threads Jobs: 1 (f=1): [_(15),r(1)] [100.0% done] [0KB/0KB/0KB /s] [0/0/0 iops] [eta 00m:00s] randread: (groupid=0, jobs=16): err= 0: pid=2141: Mon Jan 23 19:27:38 2017 read : io=500942MB, bw=4174.5MB/s, iops=1068.7K, runt=120001msec slat (usec): min=0, max=3647, avg=11.07, stdev=37.60 clat (usec): min=2, max=19872, avg=1475.65, stdev=2510.83 lat (usec): min=4, max=19964, avg=1486.76, stdev=2530.31 clat percentiles (usec): | 1.00th=[ 334], 5.00th=[ 346], 10.00th=[ 358], 20.00th=[ 362], | 30.00th=[ 370], 40.00th=[ 378], 50.00th=[ 398], 60.00th=[ 494], | 70.00th=[ 780], 80.00th=[ 1480], 90.00th=[ 4256], 95.00th=[ 8032], | 99.00th=[12096], 99.50th=[12736], 99.90th=[14272], 99.95th=[14912], | 99.99th=[16512] bw (KB /s): min= 0, max=1512848, per=8.04%, avg=343481.50, stdev=460791.59 lat (usec) : 4=0.01%, 10=0.01%, 20=0.01%, 50=0.01%, 100=0.01% lat (usec) : 250=0.01%, 500=60.27%, 750=8.95%, 1000=4.94% lat (msec) : 2=9.33%, 4=5.98%, 10=7.89%, 20=2.63% cpu : usr=3.19%, sys=44.95%, ctx=9452424, majf=0, minf=2064 IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%, >=64=100.0% submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.1% issued : total=r=128241193/w=0/d=0, short=r=0/w=0/d=0 latency : target=0, window=0, percentile=100.00%, depth=128 Run status group 0 (all jobs): READ: io=500942MB, aggrb=4174.5MB/s, minb=4174.5MB/s, maxb=4174.5MB/s, mint=120001msec, maxt=120001msec Disk stats (read/write): md1: ios=9392258/0, merge=0/0, ticks=0/0, in_queue=0, util=0.00%, aggrios=588533/0, aggrmerge=0/0, aggrticks=63464/0, aggrin_queue=63476, aggrutil=36.40% nvme15n1: ios=588661/0, merge=0/0, ticks=66932/0, in_queue=66824, util=36.40% nvme6n1: ios=589278/0, merge=0/0, ticks=60768/0, in_queue=60600, util=34.84% nvme9n1: ios=588744/0, merge=0/0, ticks=64344/0, in_queue=64480, util=35.85% nvme11n1: ios=588005/0, merge=0/0, ticks=65636/0, in_queue=65828, util=36.02% nvme2n1: ios=588097/0, merge=0/0, ticks=62296/0, in_queue=62440, util=35.00% nvme14n1: ios=588451/0, merge=0/0, ticks=64480/0, in_queue=64408, util=35.87% nvme5n1: ios=588654/0, merge=0/0, ticks=60736/0, in_queue=60704, util=34.66% nvme8n1: ios=588843/0, merge=0/0, ticks=63980/0, in_queue=63928, util=35.40% nvme10n1: ios=588315/0, merge=0/0, ticks=62436/0, in_queue=62432, util=35.15% nvme1n1: ios=588327/0, merge=0/0, ticks=64432/0, in_queue=64564, util=36.10% nvme13n1: ios=588342/0, merge=0/0, ticks=65856/0, in_queue=65892, util=36.06% nvme4n1: ios=588343/0, merge=0/0, ticks=64528/0, in_queue=64752, util=35.73% nvme7n1: ios=589243/0, merge=0/0, ticks=63740/0, in_queue=63696, util=35.34% nvme0n1: ios=588499/0, merge=0/0, ticks=61308/0, in_queue=61268, util=34.83% nvme12n1: ios=588221/0, merge=0/0, ticks=62076/0, in_queue=61976, util=35.19% nvme3n1: ios=588512/0, merge=0/0, ticks=61880/0, in_queue=61824, util=35.09% oberstet@svr-psql19:~/scm/parcit/RA/adr/system/docs$ C) oberstet@svr-psql19:~/scm/parcit/RA/adr/system/docs$ sudo fio postgresql_storage_workload.fio randread: (g=0): rw=randread, bs=4K-4K/4K-4K/4K-4K, ioengine=libaio, iodepth=128 ... fio-2.1.11 Starting 32 threads Jobs: 1 (f=0): [_(24),r(1),_(7)] [100.0% done] [0KB/0KB/0KB /s] [0/0/0 iops] [eta 00m:00s] randread: (groupid=0, jobs=32): err= 0: pid=2263: Mon Jan 23 19:30:49 2017 read : io=977.76GB, bw=8343.4MB/s, iops=2135.9K, runt=120001msec slat (usec): min=0, max=3372, avg= 7.30, stdev=27.48 clat (usec): min=1, max=21871, avg=997.26, stdev=1995.10 lat (usec): min=4, max=21982, avg=1004.60, stdev=2010.61 clat percentiles (usec): | 1.00th=[ 374], 5.00th=[ 378], 10.00th=[ 378], 20.00th=[ 386], | 30.00th=[ 390], 40.00th=[ 394], 50.00th=[ 394], 60.00th=[ 398], | 70.00th=[ 406], 80.00th=[ 540], 90.00th=[ 1496], 95.00th=[ 5408], | 99.00th=[10944], 99.50th=[12224], 99.90th=[14016], 99.95th=[14784], | 99.99th=[16512] bw (KB /s): min= 0, max=1353208, per=5.91%, avg=505187.96, stdev=549388.79 lat (usec) : 2=0.01%, 4=0.01%, 10=0.01%, 20=0.01%, 50=0.01% lat (usec) : 100=0.01%, 250=0.01%, 500=78.69%, 750=5.80%, 1000=2.94% lat (msec) : 2=3.84%, 4=2.66%, 10=4.52%, 20=1.56%, 50=0.01% cpu : usr=3.09%, sys=68.19%, ctx=10916103, majf=0, minf=4128 IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%, >=64=100.0% submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.1% issued : total=r=256309234/w=0/d=0, short=r=0/w=0/d=0 latency : target=0, window=0, percentile=100.00%, depth=128 Run status group 0 (all jobs): READ: io=977.76GB, aggrb=8343.4MB/s, minb=8343.4MB/s, maxb=8343.4MB/s, mint=120001msec, maxt=120001msec Disk stats (read/write): md1: ios=10762806/0, merge=0/0, ticks=0/0, in_queue=0, util=0.00%, aggrios=675866/0, aggrmerge=0/0, aggrticks=70332/0, aggrin_queue=70505, aggrutil=28.65% nvme15n1: ios=675832/0, merge=0/0, ticks=69604/0, in_queue=69648, util=27.82% nvme6n1: ios=676181/0, merge=0/0, ticks=75584/0, in_queue=75552, util=28.65% nvme9n1: ios=675762/0, merge=0/0, ticks=67916/0, in_queue=68236, util=27.79% nvme11n1: ios=675745/0, merge=0/0, ticks=68296/0, in_queue=68804, util=27.66% nvme2n1: ios=676036/0, merge=0/0, ticks=70904/0, in_queue=71240, util=28.14% nvme14n1: ios=675737/0, merge=0/0, ticks=71560/0, in_queue=71716, util=28.13% nvme5n1: ios=676592/0, merge=0/0, ticks=71832/0, in_queue=71976, util=28.02% nvme8n1: ios=675969/0, merge=0/0, ticks=69152/0, in_queue=69192, util=27.63% nvme10n1: ios=675607/0, merge=0/0, ticks=67600/0, in_queue=67668, util=27.74% nvme1n1: ios=675528/0, merge=0/0, ticks=72856/0, in_queue=73136, util=28.48% nvme13n1: ios=675189/0, merge=0/0, ticks=69736/0, in_queue=70084, util=28.04% nvme4n1: ios=676117/0, merge=0/0, ticks=68120/0, in_queue=68600, util=27.88% nvme7n1: ios=675726/0, merge=0/0, ticks=72004/0, in_queue=71960, util=28.25% nvme0n1: ios=676119/0, merge=0/0, ticks=71228/0, in_queue=71264, util=28.12% nvme12n1: ios=675837/0, merge=0/0, ticks=70320/0, in_queue=70368, util=27.99% nvme3n1: ios=675887/0, merge=0/0, ticks=68600/0, in_queue=68636, util=27.95% oberstet@svr-psql19:~/scm/parcit/RA/adr/system/docs$ ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: 4x lower IOPS: Linux MD vs indiv. devices - why? 2017-01-23 18:33 ` Tobias Oberstein @ 2017-01-23 19:10 ` Kudryavtsev, Andrey O 2017-01-23 19:26 ` Tobias Oberstein 2017-01-23 19:13 ` Sitsofe Wheeler [not found] ` <CANvN+emM2xeKtEgVofOyKri6WBtjqc_o1LMT8Sfawb_RMRXT0g@mail.gmail.com> 2 siblings, 1 reply; 27+ messages in thread From: Kudryavtsev, Andrey O @ 2017-01-23 19:10 UTC (permalink / raw) To: Tobias Oberstein, Andrey Kuzmin; +Cc: fio@vger.kernel.org Tobias, I’d try 128 jobs, QD 32 and disable random map and latency measurements randrepeat=0 norandommap disable_ lat -- Andrey Kudryavtsev, SSD Solution Architect Intel Corp. inet: 83564353 work: +1-916-356-4353 mobile: +1-916-221-2281 On 1/23/17, 10:33 AM, "fio-owner@vger.kernel.org on behalf of Tobias Oberstein" <fio-owner@vger.kernel.org on behalf of tobias.oberstein@gmail.com> wrote: > You're just running a huge number of threads against the same md device and > bottleneck on some internal lock. If you step back and set up, say, 256 Ah, alright. Shit. > threads with ioengine=libaio, qd=128 (to match the in-flight I/O number), > you'd likely see the locking impact reduced substantially. The problem with using libaio and QD>1 is: that doesn't represent the workload I am optimizing for. The workload is PostgreSQL, and that is doing all it's IO as regular read/writes, and hence the use of ioengine=sync with large thread counts. Note: we have an internal tool that is able to parallelize PostgreSQL via database sessions. -- I tried anyway. Here is what I get with engine=libaio (results down below): A) QD=128 and jobs=8 (same effective IO concurrency as previously = 1024) iops=200184 The IOPS stay constant during the run (120s). B) QD=128 and jobs=16 (effective concurrency = 2048) iops=1068.7K But, but: The IOPS slowly go up to over 5 mio, then collapses to like 20k, and then go up again. Very strange. C) QD=128 and jobs=32 (effective concurrency = 4096) FIO claims: iops=2135.9K Which is still 3.5x lower than what I get with the sync engine and 2800 threads! Plus: that strange behavior over run time .. IOPS go up to 10M: http://picpaste.com/pics/Bildschirmfoto_vom_2017-01-23_19-29-13-ZEyCVcKZ.1485196199.png and the collapse to 0 IOPS http://picpaste.com/pics/Bildschirmfoto_vom_2017-01-23_19-30-20-GEEEQR6f.1485196243.png at which the NVMes don't show any load (I am watching them in another window). === libaio is nowhere near what I get with engine=sync and high job counts. Mmh. Plus the strange behavior. And as said, that doesn't represent my workload anyways. I want to stay away from AIO .. Cheers, /Tobias A) oberstet@svr-psql19:~/scm/parcit/RA/adr/system/docs$ sudo fio postgresql_storage_workload.fio randread: (g=0): rw=randread, bs=4K-4K/4K-4K/4K-4K, ioengine=libaio, iodepth=128 ... fio-2.1.11 Starting 8 threads Jobs: 1 (f=1): [_(2),r(1),_(5)] [38.3% done] [0KB/0KB/0KB /s] [0/0/0 iops] [eta 03m:23s] randread: (groupid=0, jobs=8): err= 0: pid=1994: Mon Jan 23 19:23:23 2017 read : io=93837MB, bw=800739KB/s, iops=200184, runt=120001msec slat (usec): min=0, max=4291, avg=39.28, stdev=76.95 clat (usec): min=2, max=22205, avg=5075.21, stdev=3646.18 lat (usec): min=5, max=22333, avg=5114.55, stdev=3674.10 clat percentiles (usec): | 1.00th=[ 916], 5.00th=[ 1224], 10.00th=[ 1448], 20.00th=[ 1864], | 30.00th=[ 2320], 40.00th=[ 2960], 50.00th=[ 3920], 60.00th=[ 5024], | 70.00th=[ 6368], 80.00th=[ 8384], 90.00th=[10944], 95.00th=[12608], | 99.00th=[14272], 99.50th=[15168], 99.90th=[16768], 99.95th=[17536], | 99.99th=[18816] bw (KB /s): min=33088, max=400688, per=12.35%, avg=98898.47, stdev=76253.23 lat (usec) : 4=0.01%, 10=0.01%, 20=0.01%, 50=0.01%, 100=0.01% lat (usec) : 250=0.01%, 500=0.01%, 750=0.22%, 1000=1.48% lat (msec) : 2=21.67%, 4=27.51%, 10=35.37%, 20=13.74%, 50=0.01% cpu : usr=1.53%, sys=13.53%, ctx=7504182, majf=0, minf=1032 IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%, >=64=100.0% submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.1% issued : total=r=24022368/w=0/d=0, short=r=0/w=0/d=0 latency : target=0, window=0, percentile=100.00%, depth=128 Run status group 0 (all jobs): READ: io=93837MB, aggrb=800738KB/s, minb=800738KB/s, maxb=800738KB/s, mint=120001msec, maxt=120001msec Disk stats (read/write): md1: ios=7485313/0, merge=0/0, ticks=0/0, in_queue=0, util=0.00%, aggrios=468407/0, aggrmerge=0/0, aggrticks=51834/0, aggrin_queue=51770, aggrutil=35.00% nvme15n1: ios=468133/0, merge=0/0, ticks=52628/0, in_queue=52532, util=34.39% nvme6n1: ios=468355/0, merge=0/0, ticks=48944/0, in_queue=48840, util=32.34% nvme9n1: ios=468561/0, merge=0/0, ticks=53924/0, in_queue=53956, util=35.00% nvme11n1: ios=468354/0, merge=0/0, ticks=53424/0, in_queue=53396, util=34.70% nvme2n1: ios=468418/0, merge=0/0, ticks=51536/0, in_queue=51496, util=33.63% nvme14n1: ios=468669/0, merge=0/0, ticks=51696/0, in_queue=51576, util=33.84% nvme5n1: ios=468526/0, merge=0/0, ticks=50004/0, in_queue=49928, util=33.00% nvme8n1: ios=468233/0, merge=0/0, ticks=52232/0, in_queue=52140, util=33.82% nvme10n1: ios=468501/0, merge=0/0, ticks=52532/0, in_queue=52416, util=34.29% nvme1n1: ios=468434/0, merge=0/0, ticks=53492/0, in_queue=53404, util=34.58% nvme13n1: ios=468544/0, merge=0/0, ticks=51876/0, in_queue=51860, util=33.85% nvme4n1: ios=468513/0, merge=0/0, ticks=51172/0, in_queue=51176, util=33.30% nvme7n1: ios=468245/0, merge=0/0, ticks=50564/0, in_queue=50484, util=33.14% nvme0n1: ios=468318/0, merge=0/0, ticks=49812/0, in_queue=49760, util=32.67% nvme12n1: ios=468279/0, merge=0/0, ticks=52416/0, in_queue=52344, util=34.17% nvme3n1: ios=468442/0, merge=0/0, ticks=53092/0, in_queue=53016, util=34.37% oberstet@svr-psql19:~/scm/parcit/RA/adr/system/docs$ B) oberstet@svr-psql19:~/scm/parcit/RA/adr/system/docs$ sudo fio postgresql_storage_workload.fio randread: (g=0): rw=randread, bs=4K-4K/4K-4K/4K-4K, ioengine=libaio, iodepth=128 ... fio-2.1.11 Starting 16 threads Jobs: 1 (f=1): [_(15),r(1)] [100.0% done] [0KB/0KB/0KB /s] [0/0/0 iops] [eta 00m:00s] randread: (groupid=0, jobs=16): err= 0: pid=2141: Mon Jan 23 19:27:38 2017 read : io=500942MB, bw=4174.5MB/s, iops=1068.7K, runt=120001msec slat (usec): min=0, max=3647, avg=11.07, stdev=37.60 clat (usec): min=2, max=19872, avg=1475.65, stdev=2510.83 lat (usec): min=4, max=19964, avg=1486.76, stdev=2530.31 clat percentiles (usec): | 1.00th=[ 334], 5.00th=[ 346], 10.00th=[ 358], 20.00th=[ 362], | 30.00th=[ 370], 40.00th=[ 378], 50.00th=[ 398], 60.00th=[ 494], | 70.00th=[ 780], 80.00th=[ 1480], 90.00th=[ 4256], 95.00th=[ 8032], | 99.00th=[12096], 99.50th=[12736], 99.90th=[14272], 99.95th=[14912], | 99.99th=[16512] bw (KB /s): min= 0, max=1512848, per=8.04%, avg=343481.50, stdev=460791.59 lat (usec) : 4=0.01%, 10=0.01%, 20=0.01%, 50=0.01%, 100=0.01% lat (usec) : 250=0.01%, 500=60.27%, 750=8.95%, 1000=4.94% lat (msec) : 2=9.33%, 4=5.98%, 10=7.89%, 20=2.63% cpu : usr=3.19%, sys=44.95%, ctx=9452424, majf=0, minf=2064 IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%, >=64=100.0% submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.1% issued : total=r=128241193/w=0/d=0, short=r=0/w=0/d=0 latency : target=0, window=0, percentile=100.00%, depth=128 Run status group 0 (all jobs): READ: io=500942MB, aggrb=4174.5MB/s, minb=4174.5MB/s, maxb=4174.5MB/s, mint=120001msec, maxt=120001msec Disk stats (read/write): md1: ios=9392258/0, merge=0/0, ticks=0/0, in_queue=0, util=0.00%, aggrios=588533/0, aggrmerge=0/0, aggrticks=63464/0, aggrin_queue=63476, aggrutil=36.40% nvme15n1: ios=588661/0, merge=0/0, ticks=66932/0, in_queue=66824, util=36.40% nvme6n1: ios=589278/0, merge=0/0, ticks=60768/0, in_queue=60600, util=34.84% nvme9n1: ios=588744/0, merge=0/0, ticks=64344/0, in_queue=64480, util=35.85% nvme11n1: ios=588005/0, merge=0/0, ticks=65636/0, in_queue=65828, util=36.02% nvme2n1: ios=588097/0, merge=0/0, ticks=62296/0, in_queue=62440, util=35.00% nvme14n1: ios=588451/0, merge=0/0, ticks=64480/0, in_queue=64408, util=35.87% nvme5n1: ios=588654/0, merge=0/0, ticks=60736/0, in_queue=60704, util=34.66% nvme8n1: ios=588843/0, merge=0/0, ticks=63980/0, in_queue=63928, util=35.40% nvme10n1: ios=588315/0, merge=0/0, ticks=62436/0, in_queue=62432, util=35.15% nvme1n1: ios=588327/0, merge=0/0, ticks=64432/0, in_queue=64564, util=36.10% nvme13n1: ios=588342/0, merge=0/0, ticks=65856/0, in_queue=65892, util=36.06% nvme4n1: ios=588343/0, merge=0/0, ticks=64528/0, in_queue=64752, util=35.73% nvme7n1: ios=589243/0, merge=0/0, ticks=63740/0, in_queue=63696, util=35.34% nvme0n1: ios=588499/0, merge=0/0, ticks=61308/0, in_queue=61268, util=34.83% nvme12n1: ios=588221/0, merge=0/0, ticks=62076/0, in_queue=61976, util=35.19% nvme3n1: ios=588512/0, merge=0/0, ticks=61880/0, in_queue=61824, util=35.09% oberstet@svr-psql19:~/scm/parcit/RA/adr/system/docs$ C) oberstet@svr-psql19:~/scm/parcit/RA/adr/system/docs$ sudo fio postgresql_storage_workload.fio randread: (g=0): rw=randread, bs=4K-4K/4K-4K/4K-4K, ioengine=libaio, iodepth=128 ... fio-2.1.11 Starting 32 threads Jobs: 1 (f=0): [_(24),r(1),_(7)] [100.0% done] [0KB/0KB/0KB /s] [0/0/0 iops] [eta 00m:00s] randread: (groupid=0, jobs=32): err= 0: pid=2263: Mon Jan 23 19:30:49 2017 read : io=977.76GB, bw=8343.4MB/s, iops=2135.9K, runt=120001msec slat (usec): min=0, max=3372, avg= 7.30, stdev=27.48 clat (usec): min=1, max=21871, avg=997.26, stdev=1995.10 lat (usec): min=4, max=21982, avg=1004.60, stdev=2010.61 clat percentiles (usec): | 1.00th=[ 374], 5.00th=[ 378], 10.00th=[ 378], 20.00th=[ 386], | 30.00th=[ 390], 40.00th=[ 394], 50.00th=[ 394], 60.00th=[ 398], | 70.00th=[ 406], 80.00th=[ 540], 90.00th=[ 1496], 95.00th=[ 5408], | 99.00th=[10944], 99.50th=[12224], 99.90th=[14016], 99.95th=[14784], | 99.99th=[16512] bw (KB /s): min= 0, max=1353208, per=5.91%, avg=505187.96, stdev=549388.79 lat (usec) : 2=0.01%, 4=0.01%, 10=0.01%, 20=0.01%, 50=0.01% lat (usec) : 100=0.01%, 250=0.01%, 500=78.69%, 750=5.80%, 1000=2.94% lat (msec) : 2=3.84%, 4=2.66%, 10=4.52%, 20=1.56%, 50=0.01% cpu : usr=3.09%, sys=68.19%, ctx=10916103, majf=0, minf=4128 IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%, >=64=100.0% submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.1% issued : total=r=256309234/w=0/d=0, short=r=0/w=0/d=0 latency : target=0, window=0, percentile=100.00%, depth=128 Run status group 0 (all jobs): READ: io=977.76GB, aggrb=8343.4MB/s, minb=8343.4MB/s, maxb=8343.4MB/s, mint=120001msec, maxt=120001msec Disk stats (read/write): md1: ios=10762806/0, merge=0/0, ticks=0/0, in_queue=0, util=0.00%, aggrios=675866/0, aggrmerge=0/0, aggrticks=70332/0, aggrin_queue=70505, aggrutil=28.65% nvme15n1: ios=675832/0, merge=0/0, ticks=69604/0, in_queue=69648, util=27.82% nvme6n1: ios=676181/0, merge=0/0, ticks=75584/0, in_queue=75552, util=28.65% nvme9n1: ios=675762/0, merge=0/0, ticks=67916/0, in_queue=68236, util=27.79% nvme11n1: ios=675745/0, merge=0/0, ticks=68296/0, in_queue=68804, util=27.66% nvme2n1: ios=676036/0, merge=0/0, ticks=70904/0, in_queue=71240, util=28.14% nvme14n1: ios=675737/0, merge=0/0, ticks=71560/0, in_queue=71716, util=28.13% nvme5n1: ios=676592/0, merge=0/0, ticks=71832/0, in_queue=71976, util=28.02% nvme8n1: ios=675969/0, merge=0/0, ticks=69152/0, in_queue=69192, util=27.63% nvme10n1: ios=675607/0, merge=0/0, ticks=67600/0, in_queue=67668, util=27.74% nvme1n1: ios=675528/0, merge=0/0, ticks=72856/0, in_queue=73136, util=28.48% nvme13n1: ios=675189/0, merge=0/0, ticks=69736/0, in_queue=70084, util=28.04% nvme4n1: ios=676117/0, merge=0/0, ticks=68120/0, in_queue=68600, util=27.88% nvme7n1: ios=675726/0, merge=0/0, ticks=72004/0, in_queue=71960, util=28.25% nvme0n1: ios=676119/0, merge=0/0, ticks=71228/0, in_queue=71264, util=28.12% nvme12n1: ios=675837/0, merge=0/0, ticks=70320/0, in_queue=70368, util=27.99% nvme3n1: ios=675887/0, merge=0/0, ticks=68600/0, in_queue=68636, util=27.95% oberstet@svr-psql19:~/scm/parcit/RA/adr/system/docs$ -- To unsubscribe from this list: send the line "unsubscribe fio" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: 4x lower IOPS: Linux MD vs indiv. devices - why? 2017-01-23 19:10 ` Kudryavtsev, Andrey O @ 2017-01-23 19:26 ` Tobias Oberstein 0 siblings, 0 replies; 27+ messages in thread From: Tobias Oberstein @ 2017-01-23 19:26 UTC (permalink / raw) To: Kudryavtsev, Andrey O, Andrey Kuzmin; +Cc: fio@vger.kernel.org Hi Andrey, Am 23.01.2017 um 20:10 schrieb Kudryavtsev, Andrey O: > Tobias, > I’d try 128 jobs, QD 32 and disable random map and latency measurements > randrepeat=0 > norandommap I had those already set .. > disable_ lat > This I hadn't set. Using the settings you suggest on the MD over 16 NVMes, and after increasing to oberstet@svr-psql19:~/scm/parcit/RA/adr/system/docs$ cat /proc/sys/fs/aio-max-nr 1048576 I get iops=4082.2K, which is much closer to the 7 mio IOPS I get with engine=sync and jobs=2800. Cheers, /Tobias PS: I am still working on your other hints .. so many tips. Thanks guys! oberstet@svr-psql19:~/scm/parcit/RA/adr/system/docs$ sudo fio postgresql_storage_workload.fio randread: (g=0): rw=randread, bs=4K-4K/4K-4K/4K-4K, ioengine=libaio, iodepth=32 ... fio-2.1.11 Starting 128 threads Jobs: 127 (f=0): [r(51),E(1),r(76)] [3.5% done] [15018MB/0KB/0KB /s] [3845K/0/0 iops] [eta 14m:11s] randread: (groupid=0, jobs=128): err= 0: pid=5878: Mon Jan 23 20:25:01 2017 read : io=478427MB, bw=15946MB/s, iops=4082.2K, runt= 30003msec slat (usec): min=1, max=47954, avg=29.39, stdev=34.90 clat (usec): min=37, max=49119, avg=972.35, stdev=673.40 clat percentiles (usec): | 1.00th=[ 338], 5.00th=[ 446], 10.00th=[ 532], 20.00th=[ 660], | 30.00th=[ 756], 40.00th=[ 836], 50.00th=[ 892], 60.00th=[ 956], | 70.00th=[ 1020], 80.00th=[ 1112], 90.00th=[ 1224], 95.00th=[ 1368], | 99.00th=[ 4832], 99.50th=[ 5664], 99.90th=[ 6816], 99.95th=[ 7328], | 99.99th=[ 8896] bw (KB /s): min=14024, max=393664, per=0.78%, avg=127573.83, stdev=51679.15 lat (usec) : 50=0.01%, 100=0.01%, 250=0.07%, 500=8.15%, 750=21.53% lat (usec) : 1000=37.36% lat (msec) : 2=29.83%, 4=1.53%, 10=1.53%, 20=0.01%, 50=0.01% cpu : usr=5.34%, sys=94.48%, ctx=11411, majf=0, minf=4224 IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=100.0%, >=64=0.0% submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.1%, 64=0.0%, >=64=0.0% issued : total=r=122477269/w=0/d=0, short=r=0/w=0/d=0 latency : target=0, window=0, percentile=100.00%, depth=32 Run status group 0 (all jobs): READ: io=478427MB, aggrb=15946MB/s, minb=15946MB/s, maxb=15946MB/s, mint=30003msec, maxt=30003msec Disk stats (read/write): md1: ios=121675684/0, merge=0/0, ticks=0/0, in_queue=0, util=0.00%, aggrios=7654829/0, aggrmerge=0/0, aggrticks=985171/0, aggrin_queue=1037857, aggrutil=100.00% nvme15n1: ios=7650998/0, merge=0/0, ticks=938492/0, in_queue=968336, util=100.00% nvme6n1: ios=7655891/0, merge=0/0, ticks=1044320/0, in_queue=1074048, util=100.00% nvme9n1: ios=7654289/0, merge=0/0, ticks=954912/0, in_queue=1043060, util=100.00% nvme11n1: ios=7656494/0, merge=0/0, ticks=955896/0, in_queue=1050748, util=100.00% nvme2n1: ios=7656190/0, merge=0/0, ticks=998112/0, in_queue=1090236, util=100.00% nvme14n1: ios=7655685/0, merge=0/0, ticks=956648/0, in_queue=982168, util=100.00% nvme5n1: ios=7652531/0, merge=0/0, ticks=1040592/0, in_queue=1068920, util=100.00% nvme8n1: ios=7652934/0, merge=0/0, ticks=969800/0, in_queue=994468, util=100.00% nvme10n1: ios=7655795/0, merge=0/0, ticks=949068/0, in_queue=975252, util=100.00% nvme1n1: ios=7652373/0, merge=0/0, ticks=955772/0, in_queue=1040828, util=100.00% nvme13n1: ios=7654611/0, merge=0/0, ticks=965664/0, in_queue=1053560, util=100.00% nvme4n1: ios=7655941/0, merge=0/0, ticks=1001460/0, in_queue=1113764, util=100.00% nvme7n1: ios=7652420/0, merge=0/0, ticks=991072/0, in_queue=1018248, util=100.00% nvme0n1: ios=7656124/0, merge=0/0, ticks=1051448/0, in_queue=1083992, util=100.00% nvme12n1: ios=7656450/0, merge=0/0, ticks=1031252/0, in_queue=1064052, util=100.00% nvme3n1: ios=7658543/0, merge=0/0, ticks=958228/0, in_queue=984040, util=100.00% oberstet@svr-psql19:~/scm/parcit/RA/adr/system/docs$ oberstet@svr-psql19:~/scm/parcit/RA/adr/system/docs$ cat postgresql_storage_workload.fio [global] group_reporting #filename=/dev/nvme0n1:/dev/nvme1n1:/dev/nvme2n1:/dev/nvme3n1:/dev/nvme4n1:/dev/nvme5n1:/dev/nvme6n1:/dev/nvme7n1:/dev/nvme8n1:/dev/nvme9n1:/dev/nvme10n1:/dev/nvme11n1:/dev/nvme12n1:/dev/nvme13n1:/dev/nvme14n1:/dev/nvme15n1 filename=/dev/md1 #filename=/data/test.dat #filename=/dev/data/data size=30G #ioengine=sync #iodepth=1 ioengine=libaio iodepth=32 thread=1 direct=1 time_based=1 randrepeat=0 norandommap=1 disable_lat=1 #bs=8k bs=4k #ramp_time=0 runtime=30 [randread] stonewall rw=randread numjobs=128 #[randwrite] #stonewall #rw=randwrite #numjobs=32 #[randreadwrite7030] #stonewall #rw=randrw #rwmixread=70 #numjobs=256 oberstet@svr-psql19:~/scm/parcit/RA/adr/system/docs$ ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: 4x lower IOPS: Linux MD vs indiv. devices - why? 2017-01-23 18:33 ` Tobias Oberstein 2017-01-23 19:10 ` Kudryavtsev, Andrey O @ 2017-01-23 19:13 ` Sitsofe Wheeler 2017-01-23 19:40 ` Tobias Oberstein [not found] ` <CANvN+emM2xeKtEgVofOyKri6WBtjqc_o1LMT8Sfawb_RMRXT0g@mail.gmail.com> 2 siblings, 1 reply; 27+ messages in thread From: Sitsofe Wheeler @ 2017-01-23 19:13 UTC (permalink / raw) To: Tobias Oberstein; +Cc: Andrey Kuzmin, fio@vger.kernel.org On 23 January 2017 at 18:33, Tobias Oberstein <tobias.oberstein@gmail.com> wrote: > > libaio is nowhere near what I get with engine=sync and high job counts. Mmh. > Plus the strange behavior. Have you tried batching the IOs and controlling how much are you reaping at any one time? See http://fio.readthedocs.io/en/latest/fio_doc.html#cmdoption-arg-iodepth_batch_submit for some of the options for controlling this... -- Sitsofe | http://sucs.org/~sits/ ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: 4x lower IOPS: Linux MD vs indiv. devices - why? 2017-01-23 19:13 ` Sitsofe Wheeler @ 2017-01-23 19:40 ` Tobias Oberstein 2017-01-23 20:24 ` Sitsofe Wheeler 0 siblings, 1 reply; 27+ messages in thread From: Tobias Oberstein @ 2017-01-23 19:40 UTC (permalink / raw) To: Sitsofe Wheeler; +Cc: Andrey Kuzmin, fio@vger.kernel.org Am 23.01.2017 um 20:13 schrieb Sitsofe Wheeler: > On 23 January 2017 at 18:33, Tobias Oberstein > <tobias.oberstein@gmail.com> wrote: >> >> libaio is nowhere near what I get with engine=sync and high job counts. Mmh. >> Plus the strange behavior. > > Have you tried batching the IOs and controlling how much are you > reaping at any one time? See > http://fio.readthedocs.io/en/latest/fio_doc.html#cmdoption-arg-iodepth_batch_submit > for some of the options for controlling this... > Thanks! Nice. For libaio, and with all the hints applied (no 4k sectors yet), I get (4k randread) Individual NVMes: iops=7350.4K MD (RAID-0) over NVMes: iops=4112.8K The going up and down of IOPS is gone. It's becoming more apparent I'd say, that tthere is a MD bottleneck though. Cheers, /Tobias oberstet@svr-psql19:~/scm/parcit/RA/adr/system/docs$ cat best_libaio.fio # sudo sh -c 'echo "1048576" > /proc/sys/fs/aio-max-nr' [global] group_reporting size=30G ioengine=libaio iodepth=32 iodepth_batch_submit=8 thread=1 direct=1 time_based=1 randrepeat=0 norandommap=1 disable_lat=1 bs=4k runtime=30 [randread-individual-nvmes] stonewall filename=/dev/nvme0n1:/dev/nvme1n1:/dev/nvme2n1:/dev/nvme3n1:/dev/nvme4n1:/dev/nvme5n1:/dev/nvme6n1:/dev/nvme7n1:/dev/nvme8n1:/dev/nvme9n1:/dev/nvme10n1:/dev/nvme11n1:/dev/nvme12n1:/dev/nvme13n1:/dev/nvme14n1:/dev/nvme15n1 rw=randread numjobs=128 [randread-md-over-nvmes] stonewall filename=/dev/md1 rw=randread numjobs=128 oberstet@svr-psql19:~/scm/parcit/RA/adr/system/docs$ sudo fio best_libaio.fio randread-individual-nvmes: (g=0): rw=randread, bs=4K-4K/4K-4K/4K-4K, ioengine=libaio, iodepth=32 ... randread-md-over-nvmes: (g=1): rw=randread, bs=4K-4K/4K-4K/4K-4K, ioengine=libaio, iodepth=32 ... fio-2.1.11 Starting 256 threads Jobs: 128 (f=128): [_(128),r(128)] [7.9% done] [16173MB/0KB/0KB /s] [4140K/0/0 iops] [eta 11m:51s] randread-individual-nvmes: (groupid=0, jobs=128): err= 0: pid=6988: Mon Jan 23 20:37:30 2017 read : io=861513MB, bw=28712MB/s, iops=7350.4K, runt= 30005msec slat (usec): min=1, max=179194, avg= 9.61, stdev=166.67 clat (usec): min=8, max=174722, avg=543.86, stdev=736.75 clat percentiles (usec): | 1.00th=[ 117], 5.00th=[ 139], 10.00th=[ 153], 20.00th=[ 175], | 30.00th=[ 199], 40.00th=[ 223], 50.00th=[ 258], 60.00th=[ 302], | 70.00th=[ 394], 80.00th=[ 636], 90.00th=[ 1480], 95.00th=[ 2192], | 99.00th=[ 3408], 99.50th=[ 3856], 99.90th=[ 4960], 99.95th=[ 5536], | 99.99th=[10048] bw (KB /s): min=14992, max=432176, per=0.78%, avg=229721.98, stdev=44902.57 lat (usec) : 10=0.01%, 50=0.01%, 100=0.10%, 250=48.21%, 500=27.38% lat (usec) : 750=6.48%, 1000=3.18% lat (msec) : 2=8.54%, 4=5.73%, 10=0.38%, 20=0.01%, 50=0.01% lat (msec) : 100=0.01%, 250=0.01% cpu : usr=8.25%, sys=64.76%, ctx=57533651, majf=0, minf=4224 IO depths : 1=0.0%, 2=0.0%, 4=0.0%, 8=0.1%, 16=0.1%, 32=100.0%, >=64=0.0% submit : 0=0.0%, 4=100.0%, 8=0.1%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.1%, 64=0.0%, >=64=0.0% issued : total=r=220547266/w=0/d=0, short=r=0/w=0/d=0 latency : target=0, window=0, percentile=100.00%, depth=32 randread-md-over-nvmes: (groupid=1, jobs=128): err= 0: pid=7138: Mon Jan 23 20:37:30 2017 read : io=482013MB, bw=16065MB/s, iops=4112.8K, runt= 30003msec slat (usec): min=1, max=48048, avg=29.39, stdev=36.10 clat (usec): min=47, max=74459, avg=964.89, stdev=637.97 clat percentiles (usec): | 1.00th=[ 454], 5.00th=[ 540], 10.00th=[ 604], 20.00th=[ 692], | 30.00th=[ 764], 40.00th=[ 828], 50.00th=[ 876], 60.00th=[ 924], | 70.00th=[ 980], 80.00th=[ 1064], 90.00th=[ 1176], 95.00th=[ 1320], | 99.00th=[ 4768], 99.50th=[ 5536], 99.90th=[ 6432], 99.95th=[ 6752], | 99.99th=[ 7968] bw (KB /s): min=14512, max=350248, per=0.78%, avg=128572.72, stdev=42938.35 lat (usec) : 50=0.01%, 100=0.01%, 250=0.03%, 500=2.69%, 750=24.84% lat (usec) : 1000=45.08% lat (msec) : 2=24.43%, 4=1.40%, 10=1.51%, 20=0.01%, 50=0.01% lat (msec) : 100=0.01% cpu : usr=4.98%, sys=94.81%, ctx=12736, majf=0, minf=3328 IO depths : 1=0.0%, 2=0.0%, 4=0.0%, 8=0.1%, 16=0.1%, 32=100.0%, >=64=0.0% submit : 0=0.0%, 4=100.0%, 8=0.1%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.1%, 64=0.0%, >=64=0.0% issued : total=r=123395206/w=0/d=0, short=r=0/w=0/d=0 latency : target=0, window=0, percentile=100.00%, depth=32 Run status group 0 (all jobs): READ: io=861513MB, aggrb=28712MB/s, minb=28712MB/s, maxb=28712MB/s, mint=30005msec, maxt=30005msec Run status group 1 (all jobs): READ: io=482013MB, aggrb=16065MB/s, minb=16065MB/s, maxb=16065MB/s, mint=30003msec, maxt=30003msec Disk stats (read/write): nvme0n1: ios=13713322/0, merge=0/0, ticks=2809744/0, in_queue=2867236, util=98.51% nvme1n1: ios=13713230/0, merge=0/0, ticks=11534416/0, in_queue=12284600, util=99.60% nvme2n1: ios=13713491/0, merge=0/0, ticks=9773908/0, in_queue=10359404, util=99.80% nvme3n1: ios=13713296/0, merge=0/0, ticks=6619552/0, in_queue=6803384, util=99.49% nvme4n1: ios=13713658/0, merge=0/0, ticks=6055532/0, in_queue=6533236, util=100.00% nvme5n1: ios=13713740/0, merge=0/0, ticks=2863528/0, in_queue=2931544, util=99.89% nvme6n1: ios=13713827/0, merge=0/0, ticks=2796528/0, in_queue=2859208, util=99.72% nvme7n1: ios=13713905/0, merge=0/0, ticks=2846160/0, in_queue=2904800, util=99.74% nvme8n1: ios=13713529/0, merge=0/0, ticks=7422588/0, in_queue=7582496, util=100.00% nvme9n1: ios=13713414/0, merge=0/0, ticks=13762972/0, in_queue=14664088, util=100.00% nvme10n1: ios=13714158/0, merge=0/0, ticks=6570356/0, in_queue=6735324, util=100.00% nvme11n1: ios=13714217/0, merge=0/0, ticks=4189764/0, in_queue=4519824, util=100.00% nvme12n1: ios=13714299/0, merge=0/0, ticks=7225476/0, in_queue=7393668, util=100.00% nvme13n1: ios=13714375/0, merge=0/0, ticks=4988804/0, in_queue=5267536, util=100.00% nvme14n1: ios=13714461/0, merge=0/0, ticks=7336928/0, in_queue=7502260, util=100.00% nvme15n1: ios=13713918/0, merge=0/0, ticks=11861500/0, in_queue=12202492, util=100.00% md1: ios=123098498/0, merge=0/0, ticks=0/0, in_queue=0, util=0.00% oberstet@svr-psql19:~/scm/parcit/RA/adr/system/docs$ ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: 4x lower IOPS: Linux MD vs indiv. devices - why? 2017-01-23 19:40 ` Tobias Oberstein @ 2017-01-23 20:24 ` Sitsofe Wheeler 2017-01-23 21:22 ` Tobias Oberstein 0 siblings, 1 reply; 27+ messages in thread From: Sitsofe Wheeler @ 2017-01-23 20:24 UTC (permalink / raw) To: Tobias Oberstein; +Cc: Andrey Kuzmin, fio@vger.kernel.org On 23 January 2017 at 19:40, Tobias Oberstein <tobias.oberstein@gmail.com> wrote: > Am 23.01.2017 um 20:13 schrieb Sitsofe Wheeler: >> >> On 23 January 2017 at 18:33, Tobias Oberstein >> <tobias.oberstein@gmail.com> wrote: >>> >>> libaio is nowhere near what I get with engine=sync and high job counts. >>> Mmh. >>> Plus the strange behavior. >> >> Have you tried batching the IOs and controlling how much are you >> reaping at any one time? See >> >> http://fio.readthedocs.io/en/latest/fio_doc.html#cmdoption-arg-iodepth_batch_submit >> for some of the options for controlling this... > > Thanks! Nice. > > For libaio, and with all the hints applied (no 4k sectors yet), I get (4k > randread) > > Individual NVMes: iops=7350.4K > MD (RAID-0) over NVMes: iops=4112.8K > > The going up and down of IOPS is gone. > > It's becoming more apparent I'd say, that tthere is a MD bottleneck though. If you're "just" trying for higher IOPS you can also try gtod_reduce (see http://fio.readthedocs.io/en/latest/fio_doc.html#cmdoption-arg-gtod_reduce ). This subsumes things like disable_lat but you'll get fewer and less accurate measurement stats back. With libaio userspace reap (http://fio.readthedocs.io/en/latest/fio_doc.html#cmdoption-arg-userspace_reap ) can sometimes nudge numbers up but at the cost of CPU. -- Sitsofe | http://sucs.org/~sits/ ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: 4x lower IOPS: Linux MD vs indiv. devices - why? 2017-01-23 20:24 ` Sitsofe Wheeler @ 2017-01-23 21:22 ` Tobias Oberstein [not found] ` <CANvN+emLjb9idri9r42V3W9ia6v0EDGdJYFfhzq6rAuzGWec8Q@mail.gmail.com> 0 siblings, 1 reply; 27+ messages in thread From: Tobias Oberstein @ 2017-01-23 21:22 UTC (permalink / raw) To: Sitsofe Wheeler; +Cc: Andrey Kuzmin, fio@vger.kernel.org Am 23.01.2017 um 21:24 schrieb Sitsofe Wheeler: > On 23 January 2017 at 19:40, Tobias Oberstein > <tobias.oberstein@gmail.com> wrote: >> Am 23.01.2017 um 20:13 schrieb Sitsofe Wheeler: >>> >>> On 23 January 2017 at 18:33, Tobias Oberstein >>> <tobias.oberstein@gmail.com> wrote: >>>> >>>> libaio is nowhere near what I get with engine=sync and high job counts. >>>> Mmh. >>>> Plus the strange behavior. >>> >>> Have you tried batching the IOs and controlling how much are you >>> reaping at any one time? See >>> >>> http://fio.readthedocs.io/en/latest/fio_doc.html#cmdoption-arg-iodepth_batch_submit >>> for some of the options for controlling this... >> >> Thanks! Nice. >> >> For libaio, and with all the hints applied (no 4k sectors yet), I get (4k >> randread) >> >> Individual NVMes: iops=7350.4K >> MD (RAID-0) over NVMes: iops=4112.8K >> >> The going up and down of IOPS is gone. >> >> It's becoming more apparent I'd say, that tthere is a MD bottleneck though. > > If you're "just" trying for higher IOPS you can also try gtod_reduce > (see http://fio.readthedocs.io/en/latest/fio_doc.html#cmdoption-arg-gtod_reduce > ). This subsumes things like disable_lat but you'll get fewer and less > accurate measurement stats back. With libaio userspace reap > (http://fio.readthedocs.io/en/latest/fio_doc.html#cmdoption-arg-userspace_reap > ) can sometimes nudge numbers up but at the cost of CPU. > Using that option plus bumping to QD=64 and batch submit 16, I get plain NVMes: iops=7415.9K MD over NVMes: iops=4112.4K These are staggering numbers for sure! In fact, the Intel P3608 4TB datasheet says: up to 850k random 4kB Since we have 8 (physical) of these, the real world measurement (7.4 mio) is even above the datasheet (6.8 mio). I'd say: very good job Intel =) The price of course is the CPU load to reach these numbers .. we have the 2nd largest Intel Xeon available Intel(R) Xeon(R) CPU E7-8880 v4 @ 2.20GHz and 4 of these .. and even that isn't enough to saturate these NVMe beasts while still having room to do useful work (PostgreSQL). So we're gonna be CPU bound .. again - this is the 2nd iteration of such a box. The first one has 48 cores E7 v2 and 8 x P3700 2TB. Also CPU bound on PostgreSQL anyway .. with 3TB RAM. Cheers, /Tobias randread-individual-nvmes: (groupid=0, jobs=128): err= 0: pid=37454: Mon Jan 23 22:12:30 2017 read : io=869361MB, bw=28968MB/s, iops=7415.9K, runt= 30011msec cpu : usr=6.14%, sys=64.55%, ctx=59170293, majf=0, minf=8320 randread-md-over-nvmes: (groupid=1, jobs=128): err= 0: pid=37582: Mon Jan 23 22:12:30 2017 read : io=481982MB, bw=16064MB/s, iops=4112.4K, runt= 30004msec cpu : usr=3.88%, sys=95.88%, ctx=14209, majf=0, minf=6784 [global] group_reporting size=30G ioengine=libaio iodepth=64 iodepth_batch_submit=16 thread=1 direct=1 time_based=1 randrepeat=0 norandommap=1 disable_lat=1 gtod_reduce=1 bs=4k runtime=30 [randread-individual-nvmes] stonewall filename=/dev/nvme0n1:/dev/nvme1n1:/dev/nvme2n1:/dev/nvme3n1:/dev/nvme4n1:/dev/nvme5n1:/dev/nvme6n1:/dev/nvme7n1:/dev/nvme8n1:/dev/nvme9n1:/dev/nvme10n1:/dev/nvme11n1:/dev/nvme12n1:/dev/nvme13n1:/dev/nvme14n1:/dev/nvme15n1 rw=randread numjobs=128 [randread-md-over-nvmes] stonewall filename=/dev/md1 rw=randread numjobs=128 ^ permalink raw reply [flat|nested] 27+ messages in thread
[parent not found: <CANvN+emLjb9idri9r42V3W9ia6v0EDGdJYFfhzq6rAuzGWec8Q@mail.gmail.com>]
* Re: 4x lower IOPS: Linux MD vs indiv. devices - why? [not found] ` <CANvN+emLjb9idri9r42V3W9ia6v0EDGdJYFfhzq6rAuzGWec8Q@mail.gmail.com> @ 2017-01-23 21:42 ` Andrey Kuzmin 2017-01-23 23:51 ` Tobias Oberstein 0 siblings, 1 reply; 27+ messages in thread From: Andrey Kuzmin @ 2017-01-23 21:42 UTC (permalink / raw) To: Tobias Oberstein; +Cc: Jens Axboe, fio [-- Attachment #1: Type: text/plain, Size: 4078 bytes --] On Jan 24, 2017 00:22, "Tobias Oberstein" <tobias.oberstein@gmail.com> wrote: Am 23.01.2017 um 21:24 schrieb Sitsofe Wheeler: > On 23 January 2017 at 19:40, Tobias Oberstein > <tobias.oberstein@gmail.com> wrote: > >> Am 23.01.2017 um 20:13 schrieb Sitsofe Wheeler: >> >>> >>> On 23 January 2017 at 18:33, Tobias Oberstein >>> <tobias.oberstein@gmail.com> wrote: >>> >>>> >>>> libaio is nowhere near what I get with engine=sync and high job counts. >>>> Mmh. >>>> Plus the strange behavior. >>>> >>> >>> Have you tried batching the IOs and controlling how much are you >>> reaping at any one time? See >>> >>> http://fio.readthedocs.io/en/latest/fio_doc.html#cmdoption-a >>> rg-iodepth_batch_submit >>> for some of the options for controlling this... >>> >> >> Thanks! Nice. >> >> For libaio, and with all the hints applied (no 4k sectors yet), I get (4k >> randread) >> >> Individual NVMes: iops=7350.4K >> MD (RAID-0) over NVMes: iops=4112.8K >> >> The going up and down of IOPS is gone. >> >> It's becoming more apparent I'd say, that tthere is a MD bottleneck >> though. >> > > If you're "just" trying for higher IOPS you can also try gtod_reduce > (see http://fio.readthedocs.io/en/latest/fio_doc.html#cmdoption-a > rg-gtod_reduce > ). This subsumes things like disable_lat but you'll get fewer and less > accurate measurement stats back. With libaio userspace reap > (http://fio.readthedocs.io/en/latest/fio_doc.html#cmdoption- > arg-userspace_reap > ) can sometimes nudge numbers up but at the cost of CPU. > > Using that option plus bumping to QD=64 and batch submit 16, I get plain NVMes: iops=7415.9K MD over NVMes: iops=4112.4K These are staggering numbers for sure! In fact, the Intel P3608 4TB datasheet says: up to 850k random 4kB Since we have 8 (physical) of these, the real world measurement (7.4 mio) is even above the datasheet (6.8 mio). I'd say: very good job Intel =) The price of course is the CPU load to reach these numbers .. we have the 2nd largest Intel Xeon available Intel(R) Xeon(R) CPU E7-8880 v4 @ 2.20GHz and 4 of these .. and even that isn't enough to saturate these NVMe beasts while still having room to do useful work (PostgreSQL). The root cause behind the high cpu utilization is the IRQ load your eight NVMe drives generate, although context switching your 2048 threads also add a lot. To cope with the unsustainable interrupt rate, you might want to give a shot to the psync engine with RWF_HIPRI option set, which turns on polling mode in the block layer (Jens has been very much behind it, so he's the guy in the know of the details). Polling avoids interrupts at the price of the somewhat inflated latency, but reduces the cpu load noticeably, so it may turn out a good option for your box specifically. Notice you'll need preadv2/pwrirev2 syscall support in your kernel. Regards, Andrey So we're gonna be CPU bound .. again - this is the 2nd iteration of such a box. The first one has 48 cores E7 v2 and 8 x P3700 2TB. Also CPU bound on PostgreSQL anyway .. with 3TB RAM. Cheers, /Tobias randread-individual-nvmes: (groupid=0, jobs=128): err= 0: pid=37454: Mon Jan 23 22:12:30 2017 read : io=869361MB, bw=28968MB/s, iops=7415.9K, runt= 30011msec cpu : usr=6.14%, sys=64.55%, ctx=59170293, majf=0, minf=8320 randread-md-over-nvmes: (groupid=1, jobs=128): err= 0: pid=37582: Mon Jan 23 22:12:30 2017 read : io=481982MB, bw=16064MB/s, iops=4112.4K, runt= 30004msec cpu : usr=3.88%, sys=95.88%, ctx=14209, majf=0, minf=6784 [global] group_reporting size=30G ioengine=libaio iodepth=64 iodepth_batch_submit=16 thread=1 direct=1 time_based=1 randrepeat=0 norandommap=1 disable_lat=1 gtod_reduce=1 bs=4k runtime=30 [randread-individual-nvmes] stonewall filename=/dev/nvme0n1:/dev/nvme1n1:/dev/nvme2n1:/dev/nvme3n1 :/dev/nvme4n1:/dev/nvme5n1:/dev/nvme6n1:/dev/nvme7n1:/dev/nv me8n1:/dev/nvme9n1:/dev/nvme10n1:/dev/nvme11n1:/dev/nvme12n1 :/dev/nvme13n1:/dev/nvme14n1:/dev/nvme15n1 rw=randread numjobs=128 [randread-md-over-nvmes] stonewall filename=/dev/md1 rw=randread numjobs=128 [-- Attachment #2: Type: text/html, Size: 6568 bytes --] ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: 4x lower IOPS: Linux MD vs indiv. devices - why? 2017-01-23 21:42 ` Andrey Kuzmin @ 2017-01-23 23:51 ` Tobias Oberstein 2017-01-24 8:21 ` Andrey Kuzmin 0 siblings, 1 reply; 27+ messages in thread From: Tobias Oberstein @ 2017-01-23 23:51 UTC (permalink / raw) To: Andrey Kuzmin; +Cc: Jens Axboe, fio > The root cause behind the high cpu utilization is the IRQ load your eight > NVMe drives generate, although context switching your 2048 threads also add > a lot. Indeed, the ctx switches and interrupts are in the millions/sec. With engine=sync and numjobs=2048, I have ctx_sw: 8828446 inter: 5780374 It's astonishing that this is even possible. > To cope with the unsustainable interrupt rate, you might want to give a > shot to the psync engine with RWF_HIPRI option set, which turns on polling > mode in the block layer (Jens has been very much behind it, so he's the guy > in the know of the details). > > Polling avoids interrupts at the price of the somewhat inflated latency, > but reduces the cpu load noticeably, so it may turn out a good option for > your box specifically. Notice you'll need preadv2/pwrirev2 syscall support > in your kernel. I have run an exhaustive number of 30 tests using the different engines, including pvsync2 + hipri. Please find everything here https://github.com/oberstet/scratchbox/blob/master/cruncher/sync-engines/README.md and in the containing folder there. Using pvsync2 + hipri indeed changes the picture .. but not to the better =( The machine completely bogs down and the IOPS doesn't get higher. Sidenote: would nice if FIO logged the total CPU and interrupt rates .. Here is a screenshot while running pvsync2+hipri http://picpaste.com/pics/Bildschirmfoto_vom_2017-01-23_23-52-10-55NJYHu2.1485215076.png -- My current preliminary conclusions on this box / workload: - running psync is much better than sync - all engines "above" psync only bring minor perf. gains - Linux MD (pure striping, RAID-0) comes with rougly 45% overhead - saturing the storage subsystem consumes nearly all CPU Cheers, /Tobias PS: I have a small time window left (days) until this box goes into further setup for production (which means, I cannot scratch the storage anymore) - if you have anything you want me to try, let me know. I do my best to get it tested. The hardware is probably not mainstream .. ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: 4x lower IOPS: Linux MD vs indiv. devices - why? 2017-01-23 23:51 ` Tobias Oberstein @ 2017-01-24 8:21 ` Andrey Kuzmin 2017-01-24 9:28 ` Tobias Oberstein 0 siblings, 1 reply; 27+ messages in thread From: Andrey Kuzmin @ 2017-01-24 8:21 UTC (permalink / raw) To: Tobias Oberstein; +Cc: fio, Jens Axboe [-- Attachment #1: Type: text/plain, Size: 2341 bytes --] On Jan 24, 2017 02:51, "Tobias Oberstein" <tobias.oberstein@gmail.com> wrote: The root cause behind the high cpu utilization is the IRQ load your eight > NVMe drives generate, although context switching your 2048 threads also add > a lot. > Indeed, the ctx switches and interrupts are in the millions/sec. With engine=sync and numjobs=2048, I have ctx_sw: 8828446 inter: 5780374 It's astonishing that this is even possible. To cope with the unsustainable interrupt rate, you might want to give a > shot to the psync engine with RWF_HIPRI option set, which turns on polling > mode in the block layer (Jens has been very much behind it, so he's the guy > in the know of the details). > > Polling avoids interrupts at the price of the somewhat inflated latency, > but reduces the cpu load noticeably, so it may turn out a good option for > your box specifically. Notice you'll need preadv2/pwrirev2 syscall support > in your kernel. > I have run an exhaustive number of 30 tests using the different engines, including pvsync2 + hipri. Please find everything here https://github.com/oberstet/scratchbox/blob/master/cruncher/ sync-engines/README.md and in the containing folder there. Using pvsync2 + hipri indeed changes the picture .. but not to the better =( Surprising it didn't work for you since polling is very well suited for your specific scenario. The machine completely bogs down and the IOPS doesn't get higher. Sidenote: would nice if FIO logged the total CPU and interrupt rates .. Here is a screenshot while running pvsync2+hipri http://picpaste.com/pics/Bildschirmfoto_vom_2017-01-23_23- 52-10-55NJYHu2.1485215076.png -- My current preliminary conclusions on this box / workload: - running psync is much better than sync So you likely have a convincing case for Postgres guys to switch over to pread/pwrite. Regards, Andrey - all engines "above" psync only bring minor perf. gains - Linux MD (pure striping, RAID-0) comes with rougly 45% overhead - saturing the storage subsystem consumes nearly all CPU Cheers, /Tobias PS: I have a small time window left (days) until this box goes into further setup for production (which means, I cannot scratch the storage anymore) - if you have anything you want me to try, let me know. I do my best to get it tested. The hardware is probably not mainstream .. [-- Attachment #2: Type: text/html, Size: 4133 bytes --] ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: 4x lower IOPS: Linux MD vs indiv. devices - why? 2017-01-24 8:21 ` Andrey Kuzmin @ 2017-01-24 9:28 ` Tobias Oberstein 2017-01-24 9:40 ` Andrey Kuzmin 0 siblings, 1 reply; 27+ messages in thread From: Tobias Oberstein @ 2017-01-24 9:28 UTC (permalink / raw) To: Andrey Kuzmin; +Cc: fio, Jens Axboe > My current preliminary conclusions on this box / workload: > > - running psync is much better than sync > > So you likely have a convincing case for Postgres guys to switch over to > pread/pwrite. I will approach them, but I want to make sure I did all my homework first. One question that bugs me: the difference in performance between sync and psync engines only surface with MD, _not_ when running over individual devices. --- I ran Linux perf with these results: https://github.com/oberstet/scratchbox/blob/master/cruncher/sync-engines-perf/individual-nvmes-sync.md https://github.com/oberstet/scratchbox/blob/master/cruncher/sync-engines-perf/individual-nvmes-psync.md https://github.com/oberstet/scratchbox/blob/master/cruncher/sync-engines-perf/md-nvmes-sync.md https://github.com/oberstet/scratchbox/blob/master/cruncher/sync-engines-perf/md-nvmes-psync.md --- md-nvmes-sync shows the "issue": Overhead Command Shared Object Symbol 73.48% fio [kernel.kallsyms] [k] osq_lock So while I think it would be good in general if PostgreSQL used pread/pwrite instead of lseek/read/write when available, I am afraid there might be a bottleneck in MD. What do you think? And if so, where should I raise this rgd MD? I have no clue where the hackers of MD hang out .. Cheers, /Tobias ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: 4x lower IOPS: Linux MD vs indiv. devices - why? 2017-01-24 9:28 ` Tobias Oberstein @ 2017-01-24 9:40 ` Andrey Kuzmin 2017-01-24 22:51 ` Tobias Oberstein 0 siblings, 1 reply; 27+ messages in thread From: Andrey Kuzmin @ 2017-01-24 9:40 UTC (permalink / raw) To: Tobias Oberstein; +Cc: fio, Jens Axboe [-- Attachment #1: Type: text/plain, Size: 1703 bytes --] On Jan 24, 2017 12:28, "Tobias Oberstein" <tobias.oberstein@gmail.com> wrote: My current preliminary conclusions on this box / workload: > > - running psync is much better than sync > > So you likely have a convincing case for Postgres guys to switch over to > pread/pwrite. > I will approach them, but I want to make sure I did all my homework first. One question that bugs me: the difference in performance between sync and psync engines only surface with MD, _not_ when running over individual devices. My guess is, with individual devices there's no cpu headroom for press savings to show up. Once MD bottleneck gets in, you're not bound by cpu anymore and the difference between doing a single syscall vs. two shows up. --- I ran Linux perf with these results: https://github.com/oberstet/scratchbox/blob/master/cruncher/ sync-engines-perf/individual-nvmes-sync.md https://github.com/oberstet/scratchbox/blob/master/cruncher/ sync-engines-perf/individual-nvmes-psync.md https://github.com/oberstet/scratchbox/blob/master/cruncher/ sync-engines-perf/md-nvmes-sync.md https://github.com/oberstet/scratchbox/blob/master/cruncher/ sync-engines-perf/md-nvmes-psync.md --- md-nvmes-sync shows the "issue": Overhead Command Shared Object Symbol 73.48% fio [kernel.kallsyms] [k] osq_lock So while I think it would be good in general if PostgreSQL used pread/pwrite instead of lseek/read/write when available, I am afraid there might be a bottleneck in MD. What do you think? And if so, where should I raise this rgd MD? I have no clue where the hackers of MD hang out .. Yup, I believe it makes sense to post to the md mail list. Regards, Andrey Cheers, /Tobias [-- Attachment #2: Type: text/html, Size: 3616 bytes --] ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: 4x lower IOPS: Linux MD vs indiv. devices - why? 2017-01-24 9:40 ` Andrey Kuzmin @ 2017-01-24 22:51 ` Tobias Oberstein 2017-01-25 16:23 ` Elliott, Robert (Persistent Memory) 0 siblings, 1 reply; 27+ messages in thread From: Tobias Oberstein @ 2017-01-24 22:51 UTC (permalink / raw) To: Andrey Kuzmin; +Cc: fio, Jens Axboe > My current preliminary conclusions on this box / workload: >> >> - running psync is much better than sync >> >> So you likely have a convincing case for Postgres guys to switch over to >> pread/pwrite. I did raise it on the PG hackers mailing list, but I couldn't convince them =( Pity, since there even was a patch in the past (the change seems to be easy, but was rejected). They say, I would need to come up with a real world PostgreSQL database workload that shows this effect is above the noise level. And since PostgreSQL is such a CPU hog anyway, and since I don't have time for a full research project, I leave it. --- But, I did more FIO level benchmarking to compare the efficiency of different IO methods: Here are more numbers that quantify the differences of the IO method used. ioengine sync psync vsync pvsync pvsync2 pvsync2+hipri iodepth 1 1 1 1 1 1 numjobs 1024 1024 1024 1024 1024 1024 concurrency 1024 1024 1024 1024 1024 1024 iops (k) 9171 9390 9196 9473 9527 9516 user 7,7 9,3 8,6 9,0 9,3 2,6 system 86,8 77,0 85,8 76,3 77,3 97,4 total 94,5 86,3 94,4 85,3 86,6 100,0 iops/system 105,7 121,9 107,2 124,2 123,2 97,7 As can be seen, the kIOPS normalized to system CPU load (last line) for psync (pread/pwrite) is significantly higher than for sync (lseek/read/write). Now here is AIO: ioengine libaio libaio libaio iodepth 32 32 32 numjobs 128 64 32 concurrency 4096 2048 1024 iops (k) 9485,6 9479,4 8718,1 user 6,7 3,4 2,4 system 59,2 30,0 16,7 total 65,9 33,4 19,1 iops/system 160,2 316,0 522,0 The highest kIOPS/system is reached at a concurrency of 1024. However, during my tests, I get this in kernel log: [459346.155564] NMI watchdog: BUG: soft lockup - CPU#46 stuck for 22s! [swapper/46:0] [461040.530959] NMI watchdog: BUG: soft lockup - CPU#26 stuck for 22s! [swapper/26:0] [461044.279081] NMI watchdog: BUG: soft lockup - CPU#23 stuck for 22s! [swapper/23:0] I wild guess: these lockups are actually deadlocks. AIO seems to be tricky for the kernel too. Cheers, /Tobias ^ permalink raw reply [flat|nested] 27+ messages in thread
* RE: 4x lower IOPS: Linux MD vs indiv. devices - why? 2017-01-24 22:51 ` Tobias Oberstein @ 2017-01-25 16:23 ` Elliott, Robert (Persistent Memory) 2017-01-26 17:52 ` Tobias Oberstein 0 siblings, 1 reply; 27+ messages in thread From: Elliott, Robert (Persistent Memory) @ 2017-01-25 16:23 UTC (permalink / raw) To: Tobias Oberstein, Andrey Kuzmin; +Cc: fio@vger.kernel.org, Jens Axboe > -----Original Message----- > From: fio-owner@vger.kernel.org [mailto:fio-owner@vger.kernel.org] On > Behalf Of Tobias Oberstein > Sent: Tuesday, January 24, 2017 4:52 PM > To: Andrey Kuzmin <andrey.v.kuzmin@gmail.com> > Cc: fio@vger.kernel.org; Jens Axboe <axboe@kernel.dk> > Subject: Re: 4x lower IOPS: Linux MD vs indiv. devices - why? > > However, during my tests, I get this in kernel log: > > [459346.155564] NMI watchdog: BUG: soft lockup - CPU#46 stuck for > 22s! > [swapper/46:0] > [461040.530959] NMI watchdog: BUG: soft lockup - CPU#26 stuck for > 22s! > [swapper/26:0] > [461044.279081] NMI watchdog: BUG: soft lockup - CPU#23 stuck for > 22s! > [swapper/23:0] > > I wild guess: these lockups are actually deadlocks. AIO seems to be > tricky for the kernel too. > Probably not deadlocks. One easy to way trigger those is to submit IOs on one set of CPUs and expect a different set of CPUs to handle the interrupts and completions. The latter CPUs can easily become overwhelmed. The best remedy I've found is to require CPUs to handle their own IOs, which self-throttles them from submitting more IOs than they can handle. The storage device driver needs to set up its hardware interrupts that way. Then, rq_affinity=2 ensures the block layer completions are handled on the submitting CPU. You can add this to the kernel command line (e.g., in /boot/grub/grub.conf) to squelch those checks: nosoftlockup Those prints themselves can induce more soft lockups if you have a live serial port, because printing to the serial port is slow and blocking. ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: 4x lower IOPS: Linux MD vs indiv. devices - why? 2017-01-25 16:23 ` Elliott, Robert (Persistent Memory) @ 2017-01-26 17:52 ` Tobias Oberstein 0 siblings, 0 replies; 27+ messages in thread From: Tobias Oberstein @ 2017-01-26 17:52 UTC (permalink / raw) To: Elliott, Robert (Persistent Memory) Cc: fio@vger.kernel.org, Jens.Wilke@parcIT.de Hi Robert, Am 25.01.2017 um 17:23 schrieb Elliott, Robert (Persistent Memory): > > >> -----Original Message----- >> From: fio-owner@vger.kernel.org [mailto:fio-owner@vger.kernel.org] On >> Behalf Of Tobias Oberstein >> Sent: Tuesday, January 24, 2017 4:52 PM >> To: Andrey Kuzmin <andrey.v.kuzmin@gmail.com> >> Cc: fio@vger.kernel.org; Jens Axboe <axboe@kernel.dk> >> Subject: Re: 4x lower IOPS: Linux MD vs indiv. devices - why? >> >> However, during my tests, I get this in kernel log: >> >> [459346.155564] NMI watchdog: BUG: soft lockup - CPU#46 stuck for >> 22s! >> [swapper/46:0] >> [461040.530959] NMI watchdog: BUG: soft lockup - CPU#26 stuck for >> 22s! >> [swapper/26:0] >> [461044.279081] NMI watchdog: BUG: soft lockup - CPU#23 stuck for >> 22s! >> [swapper/23:0] >> >> I wild guess: these lockups are actually deadlocks. AIO seems to be >> tricky for the kernel too. >> > > Probably not deadlocks. One easy to way trigger those is to submit > IOs on one set of CPUs and expect a different set of CPUs to handle > the interrupts and completions. The latter CPUs can easily become > overwhelmed. The best remedy I've found is to require CPUs to handle > their own IOs, which self-throttles them from submitting more IOs > than they can handle. > > The storage device driver needs to set up its hardware interrupts > that way. Then, rq_affinity=2 ensures the block layer completions > are handled on the submitting CPU. > > You can add this to the kernel command line (e.g., in > /boot/grub/grub.conf) to squelch those checks: > nosoftlockup > > Those prints themselves can induce more soft lockups if you have a > live serial port, because printing to the serial port is slow > and blocking. > Thanks alot for your tips! Indeed, we currently have rq_affinity=1. Are there any risks involved? I mean, this is a complex box .. pls see below. Also: sadly, not each of the NUMA sockets has exactly 2 NVMes (due to mainboard / slot limitations). So wouldn't enforcing IO affinity be a problem with this? Cheers, /Tobias PS: The mainboard is https://www.supermicro.nl/products/motherboard/Xeon/C600/X10QBI.cfm Yeah, I know, no offense - this particular piece isn't HPE;) The current settings / hardware: oberstet@svr-psql19:~$ cat /sys/block/nvme0n1/queue/rq_affinity 1 oberstet@svr-psql19:~$ cat /sys/block/nvme0n1/queue/scheduler none oberstet@svr-psql19:~$ cat /sys/block/nvme0n1/queue/optimal_io_size 0 oberstet@svr-psql19:~$ cat /sys/block/nvme0n1/queue/iostats 1 oberstet@svr-psql19:~$ cat /sys/block/nvme0n1/queue/max_hw_sectors_kb 128 oberstet@svr-psql19:~$ cat /sys/block/nvme0n1/queue/hw_sector_size 4096 oberstet@svr-psql19:~$ cat /sys/block/nvme0n1/queue/physical_block_size 4096 oberstet@svr-psql19:~$ cat /sys/block/nvme0n1/queue/nomerges 0 oberstet@svr-psql19:~$ cat /sys/block/nvme0n1/queue/io_poll 1 oberstet@svr-psql19:~$ cat /sys/block/nvme0n1/queue/minimum_io_size 4096 oberstet@svr-psql19:~$ cat /sys/block/nvme0n1/queue/write_cache write through oberstet@svr-psql19:~$ cat /proc/cpuinfo | grep "Intel(R) Xeon(R) CPU E7-8880 v4 @ 2.20GHz" | wc -l 176 oberstet@svr-psql19:~$ sudo numactl --hardware available: 4 nodes (0-3) node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 node 0 size: 773944 MB node 0 free: 770949 MB node 1 cpus: 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 node 1 size: 774137 MB node 1 free: 762335 MB node 2 cpus: 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 node 2 size: 774126 MB node 2 free: 763220 MB node 3 cpus: 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 node 3 size: 774136 MB node 3 free: 770518 MB node distances: node 0 1 2 3 0: 10 21 21 21 1: 21 10 21 21 2: 21 21 10 21 3: 21 21 21 10 oberstet@svr-psql19:~$ find /sys/devices | egrep 'nvme[0-9][0-9]?$' /sys/devices/pci0000:00/0000:00:03.0/0000:07:00.0/0000:08:02.0/0000:0a:00.0/nvme/nvme3 /sys/devices/pci0000:00/0000:00:03.0/0000:07:00.0/0000:08:01.0/0000:09:00.0/nvme/nvme2 /sys/devices/pci0000:00/0000:00:02.2/0000:03:00.0/0000:04:01.0/0000:05:00.0/nvme/nvme0 /sys/devices/pci0000:00/0000:00:02.2/0000:03:00.0/0000:04:02.0/0000:06:00.0/nvme/nvme1 /sys/devices/pci0000:80/0000:80:03.0/0000:83:00.0/0000:84:02.0/0000:86:00.0/nvme/nvme9 /sys/devices/pci0000:80/0000:80:03.0/0000:83:00.0/0000:84:01.0/0000:85:00.0/nvme/nvme8 /sys/devices/pci0000:40/0000:40:03.2/0000:46:00.0/0000:47:01.0/0000:48:00.0/nvme/nvme6 /sys/devices/pci0000:40/0000:40:03.2/0000:46:00.0/0000:47:02.0/0000:49:00.0/nvme/nvme7 /sys/devices/pci0000:40/0000:40:02.0/0000:41:00.0/0000:42:02.0/0000:44:00.0/nvme/nvme5 /sys/devices/pci0000:40/0000:40:02.0/0000:41:00.0/0000:42:01.0/0000:43:00.0/nvme/nvme4 /sys/devices/pci0000:c0/0000:c0:02.2/0000:c5:00.0/0000:c6:02.0/0000:c8:00.0/nvme/nvme13 /sys/devices/pci0000:c0/0000:c0:02.2/0000:c5:00.0/0000:c6:01.0/0000:c7:00.0/nvme/nvme12 /sys/devices/pci0000:c0/0000:c0:02.0/0000:c1:00.0/0000:c2:01.0/0000:c3:00.0/nvme/nvme10 /sys/devices/pci0000:c0/0000:c0:02.0/0000:c1:00.0/0000:c2:02.0/0000:c4:00.0/nvme/nvme11 /sys/devices/pci0000:c0/0000:c0:03.0/0000:c9:00.0/0000:ca:02.0/0000:cc:00.0/nvme/nvme15 /sys/devices/pci0000:c0/0000:c0:03.0/0000:c9:00.0/0000:ca:01.0/0000:cb:00.0/nvme/nvme14 oberstet@svr-psql19:~$ egrep -H '.*' /sys/bus/pci/slots/*/address /sys/bus/pci/slots/0/address:0000:01:00 /sys/bus/pci/slots/10/address:0000:c5:00 /sys/bus/pci/slots/11/address:0000:c9:00 /sys/bus/pci/slots/1/address:0000:03:00 /sys/bus/pci/slots/2/address:0000:07:00 /sys/bus/pci/slots/3/address:0000:46:00 /sys/bus/pci/slots/4/address:0000:41:00 /sys/bus/pci/slots/5/address:0000:45:00 /sys/bus/pci/slots/6/address:0000:81:00 /sys/bus/pci/slots/7/address:0000:82:00 /sys/bus/pci/slots/8/address:0000:c1:00 /sys/bus/pci/slots/9/address:0000:83:00 ^ permalink raw reply [flat|nested] 27+ messages in thread
[parent not found: <CANvN+emM2xeKtEgVofOyKri6WBtjqc_o1LMT8Sfawb_RMRXT0g@mail.gmail.com>]
* Re: 4x lower IOPS: Linux MD vs indiv. devices - why? [not found] ` <CANvN+emM2xeKtEgVofOyKri6WBtjqc_o1LMT8Sfawb_RMRXT0g@mail.gmail.com> @ 2017-01-23 20:10 ` Tobias Oberstein [not found] ` <CANvN+e=ityWtQj_TJ3yZgTM7mr17VB=3OeyQEEQvdb5tR5AGLA@mail.gmail.com> [not found] ` <CANvN+e=ASW14ShvY6dmVvUDY3PJVWwY9oQSbOT9EiOnQbSZHzA@mail.gmail.com> 0 siblings, 2 replies; 27+ messages in thread From: Tobias Oberstein @ 2017-01-23 20:10 UTC (permalink / raw) To: Andrey Kuzmin; +Cc: fio Hi Andrey, Thanks again for your tips .. the psync thingy in particular. I need to verify if that applies to PostgreSQL, because it brings huge gains compared to sync! Here is the summary of my latest numbers: 1) engine=libaio Individual NVMes: iops=7350.4K usr=8.25%, sys=64.76%, ctx=57533651 MD (RAID-0) over NVMes: iops=4112.8K usr=4.98%, sys=94.81%, ctx=12736 => MD reaches 55% of perf compared to non-MD. 2) engine=sync Individual NVMes: IOPS=6657k usr=0.56%, sys=4.43%, ctx=200588483 MD (RAID-0) over NVMes: IOPS=1467k usr=0.07%, sys=4.13%, ctx=46545978 => MD reaches 22% of perf compared to non-MD. 3) engine=psync Individual NVMes: IOPS=7086k usr=0.60%, sys=4.43%, ctx=214720330 MD (RAID-0) over NVMes: IOPS=4154k usr=0.46%, sys=5.81%, ctx=124737165 => MD reaches 58% of perf compared to non-MD. ================== Are the CPU load numbers reported by FIO reliable? I mean, compare the load between libaio and sync/psync! Cheers, /Tobias oberstet@svr-psql19:~/scm/parcit/RA/adr/system/docs$ cat best_sync_individual_nvmes.fio [global] group_reporting size=30G ioengine=sync iodepth=1 thread=1 direct=1 time_based=1 randrepeat=0 norandommap=1 disable_lat=1 bs=4k runtime=30 [randread-individual-nvmes] stonewall filename=/dev/nvme0n1:/dev/nvme1n1:/dev/nvme2n1:/dev/nvme3n1:/dev/nvme4n1:/dev/nvme5n1:/dev/nvme6n1:/dev/nvme7n1:/dev/nvme8n1:/dev/nvme9n1:/dev/nvme10n1:/dev/nvme11n1:/dev/nvme12n1:/dev/nvme13n1:/dev/nvme14n1:/dev/nvme15n1 rw=randread numjobs=2800 oberstet@svr-psql19:~/scm/parcit/RA/adr/system/docs$ cat best_sync_md_over_nvmes.fio [global] group_reporting size=30G ioengine=sync iodepth=1 thread=1 direct=1 time_based=1 randrepeat=0 norandommap=1 disable_lat=1 bs=4k runtime=30 [randread-md-over-nvmes] stonewall filename=/dev/md1 rw=randread numjobs=2800 oberstet@svr-psql19:~/scm/parcit/RA/adr/system/docs$ sudo /opt/fio/bin/fio best_sync_individual_nvmes.fio randread-individual-nvmes: (g=0): rw=randread, bs=4096B-4096B,4096B-4096B,4096B-4096B, ioengine=sync, iodepth=1 ... fio-2.17-17-g9cf1 Starting 2800 threads Jobs: 2747 (f=28032): [f(9),_(1),f(27),_(3),f(20),_(1),f(2),_(1),f(57),_(1),f(250),_(1),f(108),_(1),f(48),_(1),f(26),_(1),f(14),_(2),f(444),_(1),f(36),_(1),f(193),_(1),f(100),_(1),f(26),_(1),f(40),_(1),f(1),_(1),f(19),_(2),f(36),_(1),f(77),_(1),f(20),_(1),f(37),_(1),f(6),_(1),f(8),_(1),f(45),_(1),f(3),_(1),f(10),_(1),f(38),_(1),f(7),_(1),f(16),_(1),f(10),_(1),f(3),_(1),f(3),_(2),f(11),_(1),f(26),_(1),f(39),_(1),f(5),_(1),f(15),_(1),f(90),_(1),f(80),_(1),f(87),_(1),f(67),_(1),f(91),_(1),f(9),_(1),f(35),E(1),f(166),_(1),f(78),_(1),f(152),_(1),f(57)][100.0%][r=18.7GiB/s,w=0KiB/s][r=4885k,w=0 IOPS][eta 00m:00s] randread-individual-nvmes: (groupid=0, jobs=2800): err= 0: pid=8021: Mon Jan 23 20:51:43 2017 read: IOPS=6657k, BW=25.5GiB/s (27.3GB/s)(762GiB/30012msec) clat (usec): min=31, max=35890, avg=403.07, stdev=587.78 clat percentiles (usec): | 1.00th=[ 112], 5.00th=[ 131], 10.00th=[ 145], 20.00th=[ 167], | 30.00th=[ 187], 40.00th=[ 211], 50.00th=[ 237], 60.00th=[ 270], | 70.00th=[ 318], 80.00th=[ 406], 90.00th=[ 676], 95.00th=[ 1336], | 99.00th=[ 3280], 99.50th=[ 4016], 99.90th=[ 5536], 99.95th=[ 6304], | 99.99th=[ 9536] lat (usec) : 50=0.01%, 100=0.18%, 250=54.00%, 500=31.18%, 750=5.73% lat (usec) : 1000=2.24% lat (msec) : 2=3.63%, 4=2.52%, 10=0.50%, 20=0.01%, 50=0.01% cpu : usr=0.56%, sys=4.43%, ctx=200588483, majf=0, minf=2797 IO depths : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0% submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% issued rwt: total=199803621,0,0, short=0,0,0, dropped=0,0,0 latency : target=0, window=0, percentile=100.00%, depth=1 Run status group 0 (all jobs): READ: bw=25.5GiB/s (27.3GB/s), 25.5GiB/s-25.5GiB/s (27.3GB/s-27.3GB/s), io=762GiB (818GB), run=30012-30012msec Disk stats (read/write): nvme0n1: ios=12474932/0, merge=0/0, ticks=3440096/0, in_queue=3545768, util=97.54% nvme1n1: ios=12488816/0, merge=0/0, ticks=6811092/0, in_queue=7420304, util=97.96% nvme2n1: ios=12488737/0, merge=0/0, ticks=4947416/0, in_queue=5379024, util=97.12% nvme3n1: ios=12488626/0, merge=0/0, ticks=4578888/0, in_queue=4696164, util=96.85% nvme4n1: ios=12488514/0, merge=0/0, ticks=3848360/0, in_queue=4189952, util=97.85% nvme5n1: ios=12488384/0, merge=0/0, ticks=2872728/0, in_queue=2946696, util=96.89% nvme6n1: ios=12488271/0, merge=0/0, ticks=2480536/0, in_queue=2544704, util=96.92% nvme7n1: ios=12488165/0, merge=0/0, ticks=4038500/0, in_queue=4154768, util=96.91% nvme8n1: ios=12488052/0, merge=0/0, ticks=4553428/0, in_queue=4675568, util=97.22% nvme9n1: ios=12487937/0, merge=0/0, ticks=5487888/0, in_queue=5956252, util=97.72% nvme10n1: ios=12486833/0, merge=0/0, ticks=6234216/0, in_queue=6402356, util=97.54% nvme11n1: ios=12486699/0, merge=0/0, ticks=4646856/0, in_queue=5042628, util=97.76% nvme12n1: ios=12486586/0, merge=0/0, ticks=5331000/0, in_queue=5478728, util=97.59% nvme13n1: ios=12486467/0, merge=0/0, ticks=3464404/0, in_queue=3715416, util=98.27% nvme14n1: ios=12486358/0, merge=0/0, ticks=2576312/0, in_queue=2641952, util=97.49% nvme15n1: ios=12486251/0, merge=0/0, ticks=4135908/0, in_queue=4270008, util=97.69% oberstet@svr-psql19:~/scm/parcit/RA/adr/system/docs$ sudo /opt/fio/bin/fio best_sync_md_over_nvmes.fio randread-md-over-nvmes: (g=0): rw=randread, bs=4096B-4096B,4096B-4096B,4096B-4096B, ioengine=sync, iodepth=1 ... fio-2.17-17-g9cf1 Starting 2800 threads Jobs: 2800 (f=2800): [r(2800)][100.0%][r=5764MiB/s,w=0KiB/s][r=1476k,w=0 IOPS][eta 00m:00s] randread-md-over-nvmes: (groupid=0, jobs=2800): err= 0: pid=11137: Mon Jan 23 20:52:30 2017 read: IOPS=1467k, BW=5732MiB/s (6011MB/s)(169GiB/30116msec) clat (usec): min=27, max=33113, avg=124.27, stdev=112.85 clat percentiles (usec): | 1.00th=[ 77], 5.00th=[ 84], 10.00th=[ 86], 20.00th=[ 88], | 30.00th=[ 93], 40.00th=[ 101], 50.00th=[ 104], 60.00th=[ 107], | 70.00th=[ 115], 80.00th=[ 133], 90.00th=[ 177], 95.00th=[ 227], | 99.00th=[ 370], 99.50th=[ 506], 99.90th=[ 2096], 99.95th=[ 2544], | 99.99th=[ 2960] lat (usec) : 50=0.04%, 100=36.72%, 250=60.00%, 500=2.73%, 750=0.22% lat (usec) : 1000=0.07% lat (msec) : 2=0.12%, 4=0.11%, 10=0.01%, 20=0.01%, 50=0.01% cpu : usr=0.07%, sys=4.13%, ctx=46545978, majf=0, minf=2797 IO depths : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0% submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% issued rwt: total=44193488,0,0, short=0,0,0, dropped=0,0,0 latency : target=0, window=0, percentile=100.00%, depth=1 Run status group 0 (all jobs): READ: bw=5732MiB/s (6011MB/s), 5732MiB/s-5732MiB/s (6011MB/s-6011MB/s), io=169GiB (181GB), run=30116-30116msec Disk stats (read/write): md1: ios=44010950/0, merge=0/0, ticks=0/0, in_queue=0, util=0.00%, aggrios=2762093/0, aggrmerge=0/0, aggrticks=280663/0, aggrin_queue=284837, aggrutil=99.12% nvme15n1: ios=2766734/0, merge=0/0, ticks=264808/0, in_queue=267732, util=98.68% nvme6n1: ios=2761142/0, merge=0/0, ticks=288704/0, in_queue=291288, util=98.76% nvme9n1: ios=2759118/0, merge=0/0, ticks=275752/0, in_queue=282288, util=98.95% nvme11n1: ios=2762423/0, merge=0/0, ticks=264996/0, in_queue=271464, util=98.91% nvme2n1: ios=2764361/0, merge=0/0, ticks=281520/0, in_queue=288924, util=99.12% nvme14n1: ios=2760515/0, merge=0/0, ticks=264796/0, in_queue=266752, util=98.61% nvme5n1: ios=2761756/0, merge=0/0, ticks=280020/0, in_queue=282840, util=98.92% nvme8n1: ios=2763138/0, merge=0/0, ticks=279332/0, in_queue=280624, util=98.53% nvme10n1: ios=2764117/0, merge=0/0, ticks=291264/0, in_queue=293444, util=98.67% nvme1n1: ios=2761579/0, merge=0/0, ticks=275872/0, in_queue=282080, util=98.90% nvme13n1: ios=2759948/0, merge=0/0, ticks=280080/0, in_queue=286324, util=99.05% nvme4n1: ios=2763271/0, merge=0/0, ticks=279592/0, in_queue=287944, util=98.96% nvme7n1: ios=2759669/0, merge=0/0, ticks=280708/0, in_queue=284056, util=98.88% nvme0n1: ios=2761263/0, merge=0/0, ticks=296868/0, in_queue=300408, util=98.78% nvme12n1: ios=2763077/0, merge=0/0, ticks=288264/0, in_queue=290264, util=98.71% nvme3n1: ios=2761377/0, merge=0/0, ticks=298040/0, in_queue=300960, util=98.74% oberstet@svr-psql19:~/scm/parcit/RA/adr/system/docs$ ================= Changing engine to psync, leaving everything else unchanged: oberstet@svr-psql19:~/scm/parcit/RA/adr/system/docs$ sudo /opt/fio/bin/fio best_sync_individual_nvmes.fio randread-individual-nvmes: (g=0): rw=randread, bs=4096B-4096B,4096B-4096B,4096B-4096B, ioengine=psync, iodepth=1 ... fio-2.17-17-g9cf1 Starting 2800 threads Jobs: 2771 (f=40464): [f(8),_(1),f(14),_(1),f(30),_(1),f(6),_(1),f(4),_(1),f(7),_(1),f(14),_(1),f(6),_(1),f(62),_(1),f(3),_(1),f(167),_(1),f(309),_(1),f(269),_(1),f(47),_(1),f(206),_(1),f(26),_(1),f(56),_(2),f(4),_(1),f(39),_(1),f(148),_(1),f(148),_(1),f(4),_(1),f(63),_(1),f(27),_(1),f(19),_(1),f(314),_(1),f(189),_(1),f(205),_(1),f(377)][100.0%][r=25.7GiB/s,w=0KiB/s][r=6726k,w=0 IOPS][eta 00m:00s] randread-individual-nvmes: (groupid=0, jobs=2800): err= 0: pid=14753: Mon Jan 23 20:58:45 2017 read: IOPS=7086k, BW=27.4GiB/s (29.3GB/s)(811GiB/30010msec) clat (usec): min=34, max=57916, avg=381.14, stdev=524.36 clat percentiles (usec): | 1.00th=[ 121], 5.00th=[ 145], 10.00th=[ 159], 20.00th=[ 185], | 30.00th=[ 207], 40.00th=[ 229], 50.00th=[ 255], 60.00th=[ 286], | 70.00th=[ 326], 80.00th=[ 394], 90.00th=[ 564], 95.00th=[ 988], | 99.00th=[ 2928], 99.50th=[ 3632], 99.90th=[ 5344], 99.95th=[ 6688], | 99.99th=[11200] lat (usec) : 50=0.01%, 100=0.08%, 250=48.03%, 500=39.59%, 750=5.69% lat (usec) : 1000=1.66% lat (msec) : 2=2.69%, 4=1.91%, 10=0.32%, 20=0.01%, 50=0.01% lat (msec) : 100=0.01% cpu : usr=0.60%, sys=4.43%, ctx=214720330, majf=0, minf=2797 IO depths : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0% submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% issued rwt: total=212658246,0,0, short=0,0,0, dropped=0,0,0 latency : target=0, window=0, percentile=100.00%, depth=1 Run status group 0 (all jobs): READ: bw=27.4GiB/s (29.3GB/s), 27.4GiB/s-27.4GiB/s (29.3GB/s-29.3GB/s), io=811GiB (871GB), run=30010-30010msec Disk stats (read/write): nvme0n1: ios=13204662/0, merge=0/0, ticks=5579056/0, in_queue=5713604, util=97.16% nvme1n1: ios=13292212/0, merge=0/0, ticks=3336164/0, in_queue=3661216, util=97.52% nvme2n1: ios=13292063/0, merge=0/0, ticks=3097888/0, in_queue=3359552, util=97.09% nvme3n1: ios=13291900/0, merge=0/0, ticks=2973176/0, in_queue=3072764, util=96.31% nvme4n1: ios=13291734/0, merge=0/0, ticks=4962684/0, in_queue=5434620, util=97.02% nvme5n1: ios=13291540/0, merge=0/0, ticks=7857284/0, in_queue=8108332, util=96.75% nvme6n1: ios=13291403/0, merge=0/0, ticks=3160292/0, in_queue=3249508, util=96.46% nvme7n1: ios=13291270/0, merge=0/0, ticks=5593256/0, in_queue=5748080, util=96.42% nvme8n1: ios=13291057/0, merge=0/0, ticks=3345216/0, in_queue=3450892, util=96.81% nvme9n1: ios=13290897/0, merge=0/0, ticks=3102344/0, in_queue=3394168, util=97.38% nvme10n1: ios=13290753/0, merge=0/0, ticks=3050116/0, in_queue=3129208, util=96.74% nvme11n1: ios=13290570/0, merge=0/0, ticks=6353996/0, in_queue=6956272, util=97.59% nvme12n1: ios=13290405/0, merge=0/0, ticks=3268144/0, in_queue=3372100, util=97.04% nvme13n1: ios=13290255/0, merge=0/0, ticks=3037220/0, in_queue=3297944, util=97.78% nvme14n1: ios=13290110/0, merge=0/0, ticks=8279264/0, in_queue=8503324, util=97.47% nvme15n1: ios=13289722/0, merge=0/0, ticks=3361284/0, in_queue=3467660, util=97.22% oberstet@svr-psql19:~/scm/parcit/RA/adr/system/docs$ sudo /opt/fio/bin/fio best_sync_md_over_nvmes.fio randread-md-over-nvmes: (g=0): rw=randread, bs=4096B-4096B,4096B-4096B,4096B-4096B, ioengine=psync, iodepth=1 ... fio-2.17-17-g9cf1 Starting 2800 threads Jobs: 2367 (f=2342): [_(1),r(2),_(1),r(38),_(10),r(1),_(1),r(2),_(2),r(2),_(11),r(1),_(1),r(5),_(1),E(1),r(2),f(2),E(1),r(1),f(3),r(19),f(1),r(87),_(1),r(234),_(1),r(13),_(1),r(29),f(1),_(1),r(17),E(1),r(9),E(1),r(9),E(1),r(3),E(1),r(6),_(1),r(16),E(1),r(2),_(1),r(8),E(1),r(30),_(1),r(15),E(1),r(11),f(1),r(27),f(1),r(11),E(1),r(13),_(1),r(27),E(1),r(31),E(1),r(32),E(1),r(6),_(1),r(26),E(1),r(18),E(1),r(5),_(1),E(1),r(16),f(1),r(1),f(1),r(3),f(3),r(3),f(2),r(1),f(3),r(1),f(1),r(1),f(1),r(1),f(4),r(3),f(5),r(1),f(12),E(1),r(2),f(3),r(2),f(1),_(1),f(8),r(1),f(9),r(1),f(1),r(1),f(2),r(1),f(4),r(1),f(7),r(2),f(5),r(1),f(2),r(1),f(2),r(1),f(2),_(1),f(1),r(1),f(2),r(1),f(2),r(1),f(5),r(1),f(1),r(2),f(1),r(4),f(1),r(1),f(5),r(1),f(1),r(2),f(1),r(1),E(1),r(1),f(3),r(2),f(5),r(1),f(1),r(2),f(1),r(1),f(1),r(2),_(1),f(9),E(1),f(3),_(2),f(11),_(1),f(3),_(1),f(4),_(2),f(1),_(1),f(7),_(1),f(3),_(2),f(7),_(1),f(4),_(1),f(4),_(1),f(5),_(1),f(3),_(1),f(12),_(1),f(12),_(1),f(4),_(1),f(2),_(1),f(7),_(1),f(1),_(1),f(15),_(2),f(1),_(1),f(2),_(1),f(10),_(1),f(2),_(1),f(12),_(1),f(10),_(1),f(5),_(1),f(6),_(2),f(6),_(1),f(2),_(1),f(13),_(1),f(6),_(1),f(21),_(1),f(2),_(1),f(1),_(2),f(1),_(1),f(26),_(1),f(1),_(1),f(1),E(1),f(6),_(1),f(3),_(1),f(2),_(1),f(2),_(1),f(3),_(1),f(10),_(1),f(8),_(1),f(11),_(1),f(7),_(1),f(2),_(1),f(4),_(1),f(5),_(1),f(4),_(1),f(8),_(1),f(6),_(1),f(5),_(1),f(9),_(2),f(3),_(1),f(1),_(1),f(13),_(1),f(3),_(1),f(2),_(1),f(1),_(1),f(5),_(1),f(14),_(1),f(4),_(1),f(5),_(1),f(12),_(1),f(1),_(2),f(1),_(1),f(3),_(1),f(2),_(3),f(2),_(1),f(3),_(1),f(5),_(1),f(7),_(3),f(19),_(1),f(4),_(1),f(6),_(1),f(9),_(1),f(9),_(2),f(2),_(2),f(22),_(1),f(69),_(1),f(17),_(1),f(26),_(1),f(1),_(1),f(5),_(1),f(3),_(1),f(9),_(1),f(19),_(1),f(11),_(2),f(7),_(1),f(21),_(1),f(3),_(1),f(6),_(1),f(10),_(1),f(2),_(1),f(26),_(1),f(7),_(1),f(1),_(2),f(2),_(1),f(8),_(1),f(20),_(1),f(15),_(2),f(2),_(1),f(11),_(1),f(8),_(1),f(14),_(1),f(10),_(1),f(6),_(1),f(2),_(1),f(25),_(1),f(2),_(1),f(1),_(1),f(4),_(1),f(42),_(1),f(5),_(2),f(14),_(2),f(2),_(2),f(7),_(1),f(2),_(1),f(2),_(2),f(12),_(1),f(15),_(1),f(2),_(1),f(1),_(1),f(2),_(1),f(4),_(1),f(6),_(1),f(8),_(4),f(2),_(3),f(4),_(1),f(1),_(1),f(1),_(1),f(4),_(1),f(18),_(2),f(1),_(1),f(1),_(2),f(11),_(1),f(20),_(1),f(7),_(1),f(4),_(1),f(6),_(1),f(4),_(1),f(11),_(2),f(3),_(1),f(1),_(1),f(1),_(1),f(8),_(1),f(2),_(1),f(2),_(1),f(4),_(2),f(3),_(1),f(4),_(1),E(1),_(1),f(1),_(1),f(1),_(1),E(1),_(3),f(2),_(5),f(1),_(1),E(1),f(1),_(1),f(2),_(1),f(5),_(2),f(2),_(1),E(1),f(2),_(1),f(3),E(1),f(1),_(2),f(10),_(1),f(1),_(4),f(1),_(1),f(2),_(2),f(3),_(1),f(2),_(3),f(1),_(3),f(1),_(2),f(2),E(1),f(2),_(1),f(1),_(3),f(1),_(1),f(2),E(1),f(9),_(1),f(1),E(1),f(1),_(1),f(1),_(1),f(1),E(1),f(1),E(1),_(1),f(3),E(1),f(1),_(2),f(1),_(1),E(1),f(1),_(2),f(3),_(1),f(1),_(1),f(3),_(1),f(2),_(2),f(2),_(1),f(2),_(3),f(2),_(2),f(8),_(1),f(1),_(2),f(1),_(1),f(3),_(2),f(1),_(1),f(1),_(1),f(1),_(1),f(1),_(1),f(1),_(1),f(3),_(1),f(5),_(2),f(6),_(2),f(1),_(1),f(9),_(1),f(3),_(1),f(7),_(1),f(1),_(2),f(1),_(1),f(2),_(1),f(5),_(2),f(4),_(1),f(1),_(2),f(3),_(3),f(12),_(1),f(2),_(3),f(3),_(1),f(3),_(1),f(1),_(2),f(3),_(1),f(2),_(1),f(3),_(1),f(3),_(2),f(1),_(1),f(2),_(2),f(9),E(1),f(1),E(1),f(5),_(1),E(1),f(7),_(1),f(1),_(1),f(4),_(2),f(2),_(1),f(3),_(3),f(14),_(1),f(10),_(1),f(1),_(1),f(1),_(1),E(1),f(2),E(1),f(1),_(1),f(1),_(3),f(6),_(1),f(4),E(1),f(4),_(4),f(3),_(1),f(1),_(3),f(1),_(1),f(1),E(1),f(2),_(1),f(2),_(1),f(2),_(1),f(1),E(1),_(1),E(1),f(1),_(2),f(1),_(1),f(2),_(1),f(2),_(9),f(1),_(3),f(3),_(1),f(1),_(1),f(3),_(2),f(3),_(2),f(2),_(1),f(2),_(1),f(1),_(2),f(1),_(2),f(2)][0.5%][r=15.2GiB/s,w=0KiB/s][r=3960k,w=0 IOPS][eta 01h:38m:47s] randread-md-over-nvmes: (groupid=0, jobs=2800): err= 0: pid=17756: Mon Jan 23 20:59:22 2017 read: IOPS=4154k, BW=15.9GiB/s (17.2GB/s)(476GiB/30015msec) clat (usec): min=38, max=264790, avg=669.08, stdev=954.35 clat percentiles (usec): | 1.00th=[ 149], 5.00th=[ 207], 10.00th=[ 262], 20.00th=[ 342], | 30.00th=[ 410], 40.00th=[ 470], 50.00th=[ 532], 60.00th=[ 604], | 70.00th=[ 684], 80.00th=[ 788], 90.00th=[ 956], 95.00th=[ 1160], | 99.00th=[ 4512], 99.50th=[ 7392], 99.90th=[12480], 99.95th=[14400], | 99.99th=[19072] lat (usec) : 50=0.01%, 100=0.04%, 250=8.86%, 500=35.57%, 750=32.34% lat (usec) : 1000=14.64% lat (msec) : 2=6.53%, 4=0.91%, 10=0.89%, 20=0.22%, 50=0.01% lat (msec) : 100=0.01%, 250=0.01%, 500=0.01% cpu : usr=0.46%, sys=5.81%, ctx=124737165, majf=0, minf=2797 IO depths : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0% submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% issued rwt: total=124675330,0,0, short=0,0,0, dropped=0,0,0 latency : target=0, window=0, percentile=100.00%, depth=1 Run status group 0 (all jobs): READ: bw=15.9GiB/s (17.2GB/s), 15.9GiB/s-15.9GiB/s (17.2GB/s-17.2GB/s), io=476GiB (511GB), run=30015-30015msec Disk stats (read/write): md1: ios=124675330/0, merge=0/0, ticks=0/0, in_queue=0, util=0.00%, aggrios=7792208/0, aggrmerge=0/0, aggrticks=1051705/0, aggrin_queue=1120720, aggrutil=100.00% nvme15n1: ios=7790429/0, merge=0/0, ticks=1048276/0, in_queue=1090348, util=100.00% nvme6n1: ios=7792474/0, merge=0/0, ticks=999284/0, in_queue=1035092, util=100.00% nvme9n1: ios=7792704/0, merge=0/0, ticks=1033208/0, in_queue=1151824, util=100.00% nvme11n1: ios=7792344/0, merge=0/0, ticks=1103896/0, in_queue=1231748, util=100.00% nvme2n1: ios=7791972/0, merge=0/0, ticks=1001928/0, in_queue=1121472, util=100.00% nvme14n1: ios=7795323/0, merge=0/0, ticks=1154676/0, in_queue=1190940, util=100.00% nvme5n1: ios=7784969/0, merge=0/0, ticks=1048052/0, in_queue=1081964, util=100.00% nvme8n1: ios=7792042/0, merge=0/0, ticks=1080976/0, in_queue=1112776, util=100.00% nvme10n1: ios=7786642/0, merge=0/0, ticks=1018484/0, in_queue=1054712, util=100.00% nvme1n1: ios=7793892/0, merge=0/0, ticks=1072588/0, in_queue=1194612, util=100.00% nvme13n1: ios=7792651/0, merge=0/0, ticks=1040368/0, in_queue=1157356, util=100.00% nvme4n1: ios=7794567/0, merge=0/0, ticks=1065096/0, in_queue=1198308, util=100.00% nvme7n1: ios=7794169/0, merge=0/0, ticks=1061900/0, in_queue=1104168, util=100.00% nvme0n1: ios=7794534/0, merge=0/0, ticks=1039064/0, in_queue=1071864, util=100.00% nvme12n1: ios=7796809/0, merge=0/0, ticks=1044664/0, in_queue=1081852, util=100.00% nvme3n1: ios=7789809/0, merge=0/0, ticks=1014828/0, in_queue=1052484, util=100.00% oberstet@svr-psql19:~/scm/parcit/RA/adr/system/docs$ ^ permalink raw reply [flat|nested] 27+ messages in thread
[parent not found: <CANvN+e=ityWtQj_TJ3yZgTM7mr17VB=3OeyQEEQvdb5tR5AGLA@mail.gmail.com>]
[parent not found: <CANvN+emUGQ=voye=E6g4jFRxbp5eS8cGVJb3vTSn-bD5Db2Ycw@mail.gmail.com>]
* Re: 4x lower IOPS: Linux MD vs indiv. devices - why? [not found] ` <CANvN+emUGQ=voye=E6g4jFRxbp5eS8cGVJb3vTSn-bD5Db2Ycw@mail.gmail.com> @ 2017-01-23 20:20 ` Tobias Oberstein 0 siblings, 0 replies; 27+ messages in thread From: Tobias Oberstein @ 2017-01-23 20:20 UTC (permalink / raw) To: Andrey Kuzmin; +Cc: fio > Are the CPU load numbers reported by FIO reliable? > > > Yes, they're quite solid, just keep in mind that cpu is being reported on a > thread basis. Ahhh =) That explains that http://picpaste.com/pics/Bildschirmfoto_vom_2017-01-23_21-15-59-MEHOP3ZW.1485202585.png Which is engine=psync on MD and http://picpaste.com/pics/Bildschirmfoto_vom_2017-01-23_21-19-56-9ieRvRZy.1485202817.png which is engine=libaio on MD -- Ha. And I thought for a second the machine is now going into "full magic mode" ;) Thanks, Tobias ^ permalink raw reply [flat|nested] 27+ messages in thread
[parent not found: <CANvN+e=ASW14ShvY6dmVvUDY3PJVWwY9oQSbOT9EiOnQbSZHzA@mail.gmail.com>]
[parent not found: <CANvN+ek0DgHF4gFAVep9ygdi=4pi9O9Fp5u3-VOd0iEVCSS0=Q@mail.gmail.com>]
* Re: 4x lower IOPS: Linux MD vs indiv. devices - why? [not found] ` <CANvN+ek0DgHF4gFAVep9ygdi=4pi9O9Fp5u3-VOd0iEVCSS0=Q@mail.gmail.com> @ 2017-01-23 21:49 ` Tobias Oberstein 0 siblings, 0 replies; 27+ messages in thread From: Tobias Oberstein @ 2017-01-23 21:49 UTC (permalink / raw) To: Andrey Kuzmin; +Cc: fio Hi Andrey, > Thanks again for your tips .. the psync thingy in particular. I need to > verify if that applies to PostgreSQL, because it brings huge gains compared > to sync! > > > That's easy to explain, it just does one syscall less per IO. It should > indeed bring home a measurable gain as, with synchronous I/O, I believe > you're cpu-limited. Sadly, it seems PostgreSQL currently does lseek/read/write. (I'll double check tomorrow running perf against an active PostgreSQL instance). There was a patch discussed here using pread/pwrite when avail https://www.postgresql.org/message-id/flat/CABUevEzZ%3DCGdmwSZwW9oNuf4pQZMExk33jcNO7rseqrAgKzj5Q%40mail.gmail.com#CABUevEzZ=CGdmwSZwW9oNuf4pQZMExk33jcNO7rseqrAgKzj5Q@mail.gmail.com which ends with a comment by Tom Lane (PostgreSQL core developer) "Well, my point remains that I see little value in messing with long-established code if you can't demonstrate a benefit that's clearly above the noise level." =( I will post the findings from our discussion here to the PG hackers list. Maybe ... Cheers, /Tobias ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: 4x lower IOPS: Linux MD vs indiv. devices - why? 2017-01-23 16:26 4x lower IOPS: Linux MD vs indiv. devices - why? Tobias Oberstein [not found] ` <CANvN+en2ihATNgrbgzwNXAK87wNh+6jXHinmg2-VmHon31AJzA@mail.gmail.com> @ 2017-01-23 18:18 ` Kudryavtsev, Andrey O 2017-01-23 18:53 ` Tobias Oberstein 1 sibling, 1 reply; 27+ messages in thread From: Kudryavtsev, Andrey O @ 2017-01-23 18:18 UTC (permalink / raw) To: Tobias Oberstein, fio@vger.kernel.org Hi Tobias, MDRAID overhead is always there, but you can play with some tuning knobs. I recommend following: 1. You must use many thread/job with quite high QD configuration. Highest IOPS for Intel P3xxx drives achieved if you saturate them with 128 *4k IO per drive. This can be done in 32 jobs and QD4 or 16J/8QD and so on. With MDRAID on top of that, you should multiply by the number of drives in the array. So, I think currently the problem, that you’re simply not submitting enough IOs. 2. changing a HW SSD sector size to 4k may also help if you’re sure that your workload is always 4k granular 3. and finally using “imsm” MDRAID extensions and latest MDADM build. See some other hints there: http://www.slidesearchengine.com/slide/hands-on-lab-how-to-unleash-your-storage-performance-by-using-nvm-express-based-pci-express-solid-state-drives some config examples for NVMe are here: https://github.com/01org/fiovisualizer/tree/master/Workloads -- Andrey Kudryavtsev, SSD Solution Architect Intel Corp. inet: 83564353 work: +1-916-356-4353 mobile: +1-916-221-2281 On 1/23/17, 8:26 AM, "fio-owner@vger.kernel.org on behalf of Tobias Oberstein" <fio-owner@vger.kernel.org on behalf of tobias.oberstein@gmail.com> wrote: Hi, I have a question rgd Linux software RAID (MD) as tested with FIO - so this is slightly OT, but I am hoping for expert advice or redirection to a more appropriate place (if this is unwelcome here). I have a box with this HW: - 88 cores Xeon E7 (176 HTs) + 3TB RAM - 8 x Intel P3608 4TB NVMe (which is logicall 16 NVMes) With random 4kB read load, I am able to max it out at 7 million IOPS - but only if I run FIO on the _individual_ NVMe devices. [global] group_reporting filename=/dev/nvme0n1:/dev/nvme1n1:/dev/nvme2n1:/dev/nvme3n1:/dev/nvme4n1:/dev/nvme5n1:/dev/nvme6n1:/dev/nvme7n1:/dev/nvme8n1:/dev/nvme9n1:/dev/nvme10n1:/dev/nvme11n1:/dev/nvme12n1:/dev/nvme13n1:/dev/nvme14n1:/dev/nvme15n1 size=30G ioengine=sync iodepth=1 thread=1 direct=1 time_based=1 randrepeat=0 norandommap=1 bs=4k runtime=120 [randread] stonewall rw=randread numjobs=2560 When I create a stripe set over all devices: sudo mdadm --create /dev/md1 --chunk=8 --level=0 --raid-devices=16 \ /dev/nvme0n1 \ /dev/nvme1n1 \ /dev/nvme2n1 \ /dev/nvme3n1 \ /dev/nvme4n1 \ /dev/nvme5n1 \ /dev/nvme6n1 \ /dev/nvme7n1 \ /dev/nvme8n1 \ /dev/nvme9n1 \ /dev/nvme10n1 \ /dev/nvme11n1 \ /dev/nvme12n1 \ /dev/nvme13n1 \ /dev/nvme14n1 \ /dev/nvme15n1 I only get 1.6 million IOPS. Detail results down below. Note: the array is created with chunk size 8K because this is for database workload. Here I tested with 4k block size, but the it's similar (lower perf on MD) with 8k Any helps or hints would be greatly appreciated! Cheers, /Tobias 7 million IOPS on raw, individual NVMe devices ============================================== oberstet@svr-psql19:~/scm/parcit/RA/adr/system/docs$ sudo /opt/fio/bin/fio postgresql_storage_workload.fio randread: (g=0): rw=randread, bs=4096B-4096B,4096B-4096B,4096B-4096B, ioengine=sync, iodepth=1 ... fio-2.17-17-g9cf1 Starting 2560 threads Jobs: 2367 (f=29896): [_(2),f(3),_(2),f(11),_(2),f(2),_(9),f(1),_(1),f(1),_(3),f(1),_(1),f(1),_(13),f(1),_(8),f(1),_(1),f(4),_(2),f(1),_(1),f(1),_(3),f(2),_(3),f(3),_(8),f(2),_(1),f(3),_(3),f(60),_(1),f(20),_(1),f(33),_(1),f(14),_(1),f(18),_(4),f(6),_(1),f(6),_(1),f(1),_(1),f(1),_(1),f(4),_(1),f(2),_(1),f(11),_(1),f(11),_(4),f(74),_(1),f(8),_(1),f(11),_(1),f(8),_(1),f(61),_(1),f(38),_(1),f(31),_(1),f(5),_(1),f(103),_(1),f(24),E(1),f(27),_(1),f(28),_(1),f(1),_(1),f(134),_(1),f(62),_(1),f(48),_(1),f(27),_(1),f(59),_(1),f(30),_(1),f(14),_(1),f(25),_(1),f(2),_(1),f(25),_(1),f(31),_(1),f(9),_(1),f(7),_(1),f(8),_(1),f(13),_(1),f(28),_(1),f(7),_(1),f(84),_(1),f(42),_(1),f(5),_(1),f(8),_(1),f(20),_(1),f(15),_(1),f(19),_(1),f(3),_(1),f(19),_(1),f(7),_(1),f(17),_(1),f(34),_(1),f(1),_(1),f(4),_(1),f(1),_(1),f(1),_(2),f(3),_(1),f(1),_(1),f(1),_(1),f(8),_(1),f(6),_(1),f(3),_(1),f(3),_(1),f(53),_(1),f(7),_(1),f(19),_(1),f(6),_(1),f(5),_(1),f(22),_(1),f(11),_(1),f(12),_(1),f(3),_(1),f(16),_(1),f(149),_(1),f(20),_(1),f(27),_(1),f(7),_(1),f(29),_(1),f(2),_(1),f(11),_(1),f(46),_(1),f(8),_(2),f(1),_(1),f(1),_(1),f(14),E(1),f(4),_(1),f(22),_(1),f(11),_(1),f(70),_(2),f(11),_(1),f(2),_(1),f(1),_(1),f(1),_(1),f(21),_(1),f(8),_(1),f(4),_(1),f(45),_(2),f(1),_(1),f(18),_(1),f(12),_(1),f(6),_(1),f(5),_(1),f(27),_(1),f(3),_(1),f(3),_(1),f(19),_(1),f(4),_(1),f(25),_(1),f(4),_(1),f(1),_(1),f(2),_(1),f(1),_(1),f(13),_(1),f(18),_(1),f(1),_(1),f(1),_(1),f(29),_(1),f(27)][100.0%][r=21.1GiB/s,w=0KiB/s][r=5751k,w=0 IOPS][eta 00m:00s] randread: (groupid=0, jobs=2560): err= 0: pid=114435: Mon Jan 23 15:47:17 2017 read: IOPS=6965k, BW=26.6GiB/s (28.6GB/s)(3189GiB/120007msec) clat (usec): min=38, max=33262, avg=360.11, stdev=465.36 lat (usec): min=38, max=33262, avg=360.20, stdev=465.40 clat percentiles (usec): | 1.00th=[ 114], 5.00th=[ 135], 10.00th=[ 149], 20.00th=[ 171], | 30.00th=[ 191], 40.00th=[ 213], 50.00th=[ 239], 60.00th=[ 270], | 70.00th=[ 314], 80.00th=[ 378], 90.00th=[ 556], 95.00th=[ 980], | 99.00th=[ 2704], 99.50th=[ 3312], 99.90th=[ 4576], 99.95th=[ 5216], | 99.99th=[ 8096] lat (usec) : 50=0.01%, 100=0.11%, 250=53.75%, 500=34.23%, 750=5.23% lat (usec) : 1000=1.79% lat (msec) : 2=2.88%, 4=1.81%, 10=0.20%, 20=0.01%, 50=0.01% cpu : usr=0.63%, sys=4.89%, ctx=837434400, majf=0, minf=2557 IO depths : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0% submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% issued rwt: total=835852266,0,0, short=0,0,0, dropped=0,0,0 latency : target=0, window=0, percentile=100.00%, depth=1 Run status group 0 (all jobs): READ: bw=26.6GiB/s (28.6GB/s), 26.6GiB/s-26.6GiB/s (28.6GB/s-28.6GB/s), io=3189GiB (3424GB), run=120007-120007msec Disk stats (read/write): nvme0n1: ios=52191377/0, merge=0/0, ticks=14400568/0, in_queue=14802400, util=100.00% nvme1n1: ios=52241684/0, merge=0/0, ticks=13919744/0, in_queue=15101276, util=100.00% nvme2n1: ios=52241537/0, merge=0/0, ticks=11146952/0, in_queue=12053112, util=100.00% nvme3n1: ios=52241416/0, merge=0/0, ticks=10806624/0, in_queue=11135004, util=100.00% nvme4n1: ios=52241285/0, merge=0/0, ticks=19320448/0, in_queue=21079576, util=100.00% nvme5n1: ios=52241142/0, merge=0/0, ticks=18786968/0, in_queue=19393024, util=100.00% nvme6n1: ios=52241000/0, merge=0/0, ticks=19610892/0, in_queue=20140104, util=100.00% nvme7n1: ios=52240874/0, merge=0/0, ticks=20482920/0, in_queue=21090048, util=100.00% nvme8n1: ios=52240731/0, merge=0/0, ticks=14533992/0, in_queue=14929172, util=100.00% nvme9n1: ios=52240587/0, merge=0/0, ticks=12854956/0, in_queue=13919288, util=100.00% nvme10n1: ios=52240447/0, merge=0/0, ticks=11085508/0, in_queue=11390392, util=100.00% nvme11n1: ios=52240301/0, merge=0/0, ticks=18490260/0, in_queue=20110288, util=100.00% nvme12n1: ios=52240097/0, merge=0/0, ticks=11377884/0, in_queue=11683568, util=100.00% nvme13n1: ios=52239956/0, merge=0/0, ticks=15205304/0, in_queue=16314628, util=100.00% nvme14n1: ios=52239766/0, merge=0/0, ticks=27003788/0, in_queue=27659920, util=100.00% nvme15n1: ios=52239620/0, merge=0/0, ticks=17352624/0, in_queue=17910636, util=100.00% 1.6 millions IOPS on Linux MD over 16 NVMe devices ================================================== oberstet@svr-psql19:~/scm/parcit/RA/adr/system/docs$ sudo /opt/fio/bin/fio postgresql_storage_workload.fio randread: (g=0): rw=randread, bs=4096B-4096B,4096B-4096B,4096B-4096B, ioengine=sync, iodepth=1 ... fio-2.17-17-g9cf1 Starting 2560 threads Jobs: 2560 (f=2560): [r(2560)][100.0%][r=6212MiB/s,w=0KiB/s][r=1590k,w=0 IOPS][eta 00m:00s] randread: (groupid=0, jobs=2560): err= 0: pid=146070: Mon Jan 23 17:21:15 2017 read: IOPS=1588k, BW=6204MiB/s (6505MB/s)(728GiB/120098msec) clat (usec): min=27, max=28498, avg=124.51, stdev=113.10 lat (usec): min=27, max=28498, avg=124.58, stdev=113.10 clat percentiles (usec): | 1.00th=[ 78], 5.00th=[ 84], 10.00th=[ 86], 20.00th=[ 89], | 30.00th=[ 95], 40.00th=[ 102], 50.00th=[ 105], 60.00th=[ 108], | 70.00th=[ 118], 80.00th=[ 133], 90.00th=[ 173], 95.00th=[ 221], | 99.00th=[ 358], 99.50th=[ 506], 99.90th=[ 2192], 99.95th=[ 2608], | 99.99th=[ 2960] lat (usec) : 50=0.06%, 100=35.14%, 250=61.83%, 500=2.46%, 750=0.19% lat (usec) : 1000=0.07% lat (msec) : 2=0.13%, 4=0.12%, 10=0.01%, 20=0.01%, 50=0.01% cpu : usr=0.08%, sys=4.49%, ctx=200431993, majf=0, minf=2557 IO depths : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0% submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% issued rwt: total=190730463,0,0, short=0,0,0, dropped=0,0,0 latency : target=0, window=0, percentile=100.00%, depth=1 Run status group 0 (all jobs): READ: bw=6204MiB/s (6505MB/s), 6204MiB/s-6204MiB/s (6505MB/s-6505MB/s), io=728GiB (781GB), run=120098-120098msec Disk stats (read/write): md1: ios=190632612/0, merge=0/0, ticks=0/0, in_queue=0, util=0.00%, aggrios=11920653/0, aggrmerge=0/0, aggrticks=1228287/0, aggrin_queue=1247601, aggrutil=100.00% nvme15n1: ios=11919850/0, merge=0/0, ticks=1214924/0, in_queue=1225896, util=100.00% nvme6n1: ios=11921162/0, merge=0/0, ticks=1182716/0, in_queue=1191452, util=100.00% nvme9n1: ios=11916313/0, merge=0/0, ticks=1265060/0, in_queue=1296728, util=100.00% nvme11n1: ios=11922174/0, merge=0/0, ticks=1206084/0, in_queue=1239808, util=100.00% nvme2n1: ios=11921547/0, merge=0/0, ticks=1238956/0, in_queue=1272916, util=100.00% nvme14n1: ios=11923176/0, merge=0/0, ticks=1168688/0, in_queue=1178360, util=100.00% nvme5n1: ios=11923142/0, merge=0/0, ticks=1192656/0, in_queue=1207808, util=100.00% nvme8n1: ios=11921507/0, merge=0/0, ticks=1250164/0, in_queue=1258956, util=100.00% nvme10n1: ios=11919058/0, merge=0/0, ticks=1294028/0, in_queue=1304536, util=100.00% nvme1n1: ios=11923129/0, merge=0/0, ticks=1246892/0, in_queue=1281952, util=100.00% nvme13n1: ios=11923354/0, merge=0/0, ticks=1241540/0, in_queue=1271820, util=100.00% nvme4n1: ios=11926936/0, merge=0/0, ticks=1190384/0, in_queue=1224192, util=100.00% nvme7n1: ios=11921139/0, merge=0/0, ticks=1200624/0, in_queue=1214240, util=100.00% nvme0n1: ios=11916614/0, merge=0/0, ticks=1230916/0, in_queue=1242372, util=100.00% nvme12n1: ios=11916963/0, merge=0/0, ticks=1266840/0, in_queue=1277600, util=100.00% nvme3n1: ios=11914399/0, merge=0/0, ticks=1262128/0, in_queue=1272988, util=100.00% oberstet@svr-psql19:~/scm/parcit/RA/adr/system/docs$ N�����r��y���b�X��ǧv�^�){.n�+�������?��ܨ}���Ơz�&j:+v���?����zZ+��+zf���h���~����i���z�?�w���?����&�)ߢ?f ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: 4x lower IOPS: Linux MD vs indiv. devices - why? 2017-01-23 18:18 ` Kudryavtsev, Andrey O @ 2017-01-23 18:53 ` Tobias Oberstein 2017-01-23 19:06 ` Kudryavtsev, Andrey O 0 siblings, 1 reply; 27+ messages in thread From: Tobias Oberstein @ 2017-01-23 18:53 UTC (permalink / raw) To: Kudryavtsev, Andrey O, fio@vger.kernel.org Hi Andrey, thanks for your tips! Am 23.01.2017 um 19:18 schrieb Kudryavtsev, Andrey O: > Hi Tobias, > MDRAID overhead is always there, but you can play with some tuning knobs. > I recommend following: > 1. You must use many thread/job with quite high QD configuration. Highest IOPS for Intel P3xxx drives achieved if you saturate them with 128 *4k IO per drive. This can be done in 32 jobs and QD4 or 16J/8QD and so on. With MDRAID on top of that, you should multiply by the number of drives in the array. So, I think currently the problem, that you’re simply not submitting enough IOs. I get nearly 7 mio random 4k IOPS with engine=sync and threads=2800 on the 16 logical NVMe block devices (from 8 physical P3608 4TB). The values I get with libaio are much lower (see my other reply). My concrete problem is: I can't get these 7 mio IOPS through MD (striped over all 16 NVMe logical devices) .. MD hits a wall at 1.6 mio Note: I also tried LVM striped volumes. Sluggish perf., much higher system load. > 2. changing a HW SSD sector size to 4k may also help if you’re sure that your workload is always 4k granular Background: my workload is 100% 8kB and current results are here https://github.com/oberstet/scratchbox/raw/master/cruncher/sql19/Performance%20Results%20-%20NVMe%20Scaling%20with%20IO%20Concurrency.pdf The sector size on the NVMes currently is oberstet@svr-psql19:~/scm/parcit/RA/adr/system/docs$ sudo isdct show -a -intelssd 0 | grep SectorSize SectorSize : 512 Do you recommend changing that in my case? > 3. and finally using “imsm” MDRAID extensions and latest MDADM build. What is imsm? Is that "Intel Matrix Storage Array"? Is that fully open-source and in-tree kernel? If not, I won't use it anyway, sorry, company policy. We're running Debian 8 / Kernel 4.8 from backports (and soonish Debian 9). > See some other hints there: > http://www.slidesearchengine.com/slide/hands-on-lab-how-to-unleash-your-storage-performance-by-using-nvm-express-based-pci-express-solid-state-drives > > some config examples for NVMe are here: > https://github.com/01org/fiovisualizer/tree/master/Workloads > > What's your platform? Eg on Windows, async IO is awesome. On *nix .. not. At least in my experience. And then, my target workload (PostgreSQL) isn't doing AIO at all .. Cheers, /Tobias ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: 4x lower IOPS: Linux MD vs indiv. devices - why? 2017-01-23 18:53 ` Tobias Oberstein @ 2017-01-23 19:06 ` Kudryavtsev, Andrey O 2017-01-24 9:46 ` Tobias Oberstein ` (3 more replies) 0 siblings, 4 replies; 27+ messages in thread From: Kudryavtsev, Andrey O @ 2017-01-23 19:06 UTC (permalink / raw) To: Tobias Oberstein, fio@vger.kernel.org Hi Tobias, Yes, “imsm” is in generic release, you don’t need to go to the latest or special build then if you want to stay compliant. It’s mainly a different layout of a raid metadata. Your findings follow my expectations, for QD1 sync engine does good results. Can you try libio with QD4 and 2800/4 jobs? Most of the time I’m running Centos7 either with 3.10 or latest kernel depends of the scope of the testing. Changing sector to 4k is easy, this can really help. see DCT manual, it’s there. This can be relevant for you https://itpeernetwork.intel.com/how-to-configure-oracle-redo-on-the-intel-pcie-ssd-dc-p3700/ -- Andrey Kudryavtsev, SSD Solution Architect Intel Corp. inet: 83564353 work: +1-916-356-4353 mobile: +1-916-221-2281 On 1/23/17, 10:53 AM, "Tobias Oberstein" <tobias.oberstein@gmail.com> wrote: Hi Andrey, thanks for your tips! Am 23.01.2017 um 19:18 schrieb Kudryavtsev, Andrey O: > Hi Tobias, > MDRAID overhead is always there, but you can play with some tuning knobs. > I recommend following: > 1. You must use many thread/job with quite high QD configuration. Highest IOPS for Intel P3xxx drives achieved if you saturate them with 128 *4k IO per drive. This can be done in 32 jobs and QD4 or 16J/8QD and so on. With MDRAID on top of that, you should multiply by the number of drives in the array. So, I think currently the problem, that you’re simply not submitting enough IOs. I get nearly 7 mio random 4k IOPS with engine=sync and threads=2800 on the 16 logical NVMe block devices (from 8 physical P3608 4TB). The values I get with libaio are much lower (see my other reply). My concrete problem is: I can't get these 7 mio IOPS through MD (striped over all 16 NVMe logical devices) .. MD hits a wall at 1.6 mio Note: I also tried LVM striped volumes. Sluggish perf., much higher system load. > 2. changing a HW SSD sector size to 4k may also help if you’re sure that your workload is always 4k granular Background: my workload is 100% 8kB and current results are here https://github.com/oberstet/scratchbox/raw/master/cruncher/sql19/Performance%20Results%20-%20NVMe%20Scaling%20with%20IO%20Concurrency.pdf The sector size on the NVMes currently is oberstet@svr-psql19:~/scm/parcit/RA/adr/system/docs$ sudo isdct show -a -intelssd 0 | grep SectorSize SectorSize : 512 Do you recommend changing that in my case? > 3. and finally using “imsm” MDRAID extensions and latest MDADM build. What is imsm? Is that "Intel Matrix Storage Array"? Is that fully open-source and in-tree kernel? If not, I won't use it anyway, sorry, company policy. We're running Debian 8 / Kernel 4.8 from backports (and soonish Debian 9). > See some other hints there: > http://www.slidesearchengine.com/slide/hands-on-lab-how-to-unleash-your-storage-performance-by-using-nvm-express-based-pci-express-solid-state-drives > > some config examples for NVMe are here: > https://github.com/01org/fiovisualizer/tree/master/Workloads > > What's your platform? Eg on Windows, async IO is awesome. On *nix .. not. At least in my experience. And then, my target workload (PostgreSQL) isn't doing AIO at all .. Cheers, /Tobias ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: 4x lower IOPS: Linux MD vs indiv. devices - why? 2017-01-23 19:06 ` Kudryavtsev, Andrey O @ 2017-01-24 9:46 ` Tobias Oberstein 2017-01-24 9:55 ` Tobias Oberstein ` (2 subsequent siblings) 3 siblings, 0 replies; 27+ messages in thread From: Tobias Oberstein @ 2017-01-24 9:46 UTC (permalink / raw) To: Kudryavtsev, Andrey O, fio@vger.kernel.org Hi Andrey, Am 23.01.2017 um 20:06 schrieb Kudryavtsev, Andrey O: > Hi Tobias, > Yes, “imsm” is in generic release, you don’t need to go to the latest or special build then if you want to stay compliant. It’s mainly a different layout of a raid metadata. > > Your findings follow my expectations, for QD1 sync engine does good results. Can you try libio with QD4 and 2800/4 jobs? > Most of the time I’m running Centos7 either with 3.10 or latest kernel depends of the scope of the testing. > > Changing sector to 4k is easy, this can really help. see DCT manual, it’s there. > This can be relevant for you https://itpeernetwork.intel.com/how-to-configure-oracle-redo-on-the-intel-pcie-ssd-dc-p3700/ > > I have gone through the whole manual, but I cannot find info about the meaning of different LBAFormats. The Oracle article above uses LBAFormat=3 which I presume means 4k secor size. The P3608 seams to support a value up to 6: oberstet@svr-psql19:~/scm/parcit/RA/user/oberstet$ sudo isdct show -all -intelssd 0 | grep LBA LBAFormat : 0 MaximumLBA : 3907029167 NativeMaxLBA : 3907029167 NumLBAFormats : 6 So is this the correct mapping for the value? LBAFormat Sector Size 0 512 1 1024 2 2048 3 4096 4 8192 5 16384 6 32768 In this case, I'd use LBAFormat=4 to get 8k sectors, sine my workload is purely 8k. Cheers, /Tobias ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: 4x lower IOPS: Linux MD vs indiv. devices - why? 2017-01-23 19:06 ` Kudryavtsev, Andrey O 2017-01-24 9:46 ` Tobias Oberstein @ 2017-01-24 9:55 ` Tobias Oberstein 2017-01-24 10:03 ` Tobias Oberstein 2017-01-24 15:19 ` Tobias Oberstein 3 siblings, 0 replies; 27+ messages in thread From: Tobias Oberstein @ 2017-01-24 9:55 UTC (permalink / raw) To: Kudryavtsev, Andrey O, fio@vger.kernel.org Am 23.01.2017 um 20:06 schrieb Kudryavtsev, Andrey O: > Hi Tobias, > Yes, “imsm” is in generic release, you don’t need to go to the latest or special build then if you want to stay compliant. It’s mainly a different layout of a raid metadata. > > Your findings follow my expectations, for QD1 sync engine does good results. Can you try libio with QD4 and 2800/4 jobs? > Most of the time I’m running Centos7 either with 3.10 or latest kernel depends of the scope of the testing. > > Changing sector to 4k is easy, this can really help. see DCT manual, it’s there. > This can be relevant for you https://itpeernetwork.intel.com/how-to-configure-oracle-redo-on-the-intel-pcie-ssd-dc-p3700/ > > It doesn't work =( oberstet@svr-psql19:~/scm/parcit/RA/user/oberstet$ sudo isdct start -nvmeformat -intelssd 0 \ > LBAFormat=4 \ > SecureEraseSetting=0 \ > ProtectionInformation=0 \ > MetaDataSettings=0 WARNING! You have selected to format the drive! Proceed with the format? (Y|N): y Formatting... - Intel SSD DC P3608 Series CVF8551400324P0DGN-1 - Status : NVMe command reported a problem. oberstet@svr-psql19:~/scm/parcit/RA/user/oberstet$ sudo isdct show -all -intelssd 0 - Intel SSD DC P3608 Series CVF8551400324P0DGN-1 - AggregationThreshold : 0 AggregationTime : 0 ArbitrationBurst : 0 Bootloader : 8B1B0133 CoalescingDisable : 1 DevicePath : /dev/nvme0n1 DeviceStatus : Healthy EndToEndDataProtCapabilities : 17 EnduranceAnalyzer : Media Workload Indicators have reset values. Run 60+ minute workload prior to running the endurance analyzer. ErrorString : Firmware : 8DV101F0 FirmwareUpdateAvailable : The selected Intel SSD contains current firmware as of this tool release. HighPriorityWeightArbitration : 0 IOCompletionQueuesRequested : 30 IOSubmissionQueuesRequested : 30 Index : 0 Intel : True IntelGen3SATA : False IntelNVMe : True InterruptVector : 0 LBAFormat : 0 LatencyTrackingEnabled : False LowPriorityWeightArbitration : 0 MaximumLBA : 3907029167 MediumPriorityWeightArbitration : 0 MetadataSetting : 0 ModelNumber : INTEL SSDPECME040T4 NVMeControllerID : 0 NVMeMajorVersion : 1 NVMeMinorVersion : 0 NVMePowerState : 0 NVMeTertiaryVersion : 0 NamespaceId : 1 NativeMaxLBA : 3907029167 NumErrorLogPageEntries : 63 NumLBAFormats : 6 OEM : Generic PCILinkGenSpeed : 3 PCILinkWidth : 4 PowerGovernorMode : 0 40W for 8 Lane Slot power Product : Fultondale X8 ProductFamily : Intel SSD DC P3608 Series ProductProtocol : NVME ProtectionInformation : 0 ProtectionInformationLocation : 0 SMARTEnabled : True SMARTHealthCriticalWarningsConfiguration : 0 SMBusAddress : 106 SectorSize : 512 SerialNumber : CVF8551400324P0DGN-1 TCGSupported : False TempThreshold : 85 TimeLimitedErrorRecovery : 0 TrimSupported : True VolatileWriteCacheEnabled : False WriteAtomicityDisableNormal : 0 oberstet@svr-psql19:~/scm/parcit/RA/user/oberstet$ isdct --version Syntax Error: Invalid command. Error at or around '--version'. oberstet@svr-psql19:~/scm/parcit/RA/user/oberstet$ isdct version - Version Information - Name: Intel(R) Data Center Tool Version: 3.0.2 Description: Interact and configure Intel SSDs. oberstet@svr-psql19:~/scm/parcit/RA/user/oberstet$ ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: 4x lower IOPS: Linux MD vs indiv. devices - why? 2017-01-23 19:06 ` Kudryavtsev, Andrey O 2017-01-24 9:46 ` Tobias Oberstein 2017-01-24 9:55 ` Tobias Oberstein @ 2017-01-24 10:03 ` Tobias Oberstein 2017-01-24 15:19 ` Tobias Oberstein 3 siblings, 0 replies; 27+ messages in thread From: Tobias Oberstein @ 2017-01-24 10:03 UTC (permalink / raw) To: Kudryavtsev, Andrey O, fio@vger.kernel.org Am 23.01.2017 um 20:06 schrieb Kudryavtsev, Andrey O: > Hi Tobias, > Yes, “imsm” is in generic release, you don’t need to go to the latest or special build then if you want to stay compliant. It’s mainly a different layout of a raid metadata. > > Your findings follow my expectations, for QD1 sync engine does good results. Can you try libio with QD4 and 2800/4 jobs? > Most of the time I’m running Centos7 either with 3.10 or latest kernel depends of the scope of the testing. > > Changing sector to 4k is easy, this can really help. see DCT manual, it’s there. > This can be relevant for you https://itpeernetwork.intel.com/how-to-configure-oracle-redo-on-the-intel-pcie-ssd-dc-p3700/ > > It doesn't work with LBAFormat=3 either: oberstet@svr-psql19:~/scm/parcit/RA/user/oberstet$ sudo isdct start -nvmeformat -intelssd 0 \ > LBAFormat=3 \ > SecureEraseSetting=0 \ > ProtectionInformation=0 \ > MetaDataSettings=0 WARNING! You have selected to format the drive! Proceed with the format? (Y|N): y Formatting... - Intel SSD DC P3608 Series CVF8551400324P0DGN-1 - Status : Interrupted system call oberstet@svr-psql19:~/scm/parcit/RA/user/oberstet$ oberstet@svr-psql19:~/scm/parcit/RA/user/oberstet$ sudo isdct show -all -intelssd 0 | grep LBA LBAFormat : 0 MaximumLBA : 3907029167 NativeMaxLBA : 3907029167 NumLBAFormats : 6 ----- And using exactly the same parameters as the article above: oberstet@svr-psql19:~/scm/parcit/RA/user/oberstet$ time sudo isdct start -nvmeformat -intelssd 0 \ > LBAFormat=3 \ > SecureEraseSetting=2 \ > ProtectionInformation=0 \ > MetaDataSettings=0 WARNING! You have selected to format the drive! Proceed with the format? (Y|N): y Formatting... - Intel SSD DC P3608 Series CVF8551400324P0DGN-1 - Status : Interrupted system call real 0m26.901s user 0m0.048s sys 0m0.032s oberstet@svr-psql19:~/scm/parcit/RA/user/oberstet$ ---- I see the following in kernel log: [417528.128501] nvme nvme0: I/O 0 QID 0 timeout, reset controller [417786.440977] nvme nvme0: I/O 0 QID 0 timeout, reset controller What should I do? Thanks alot, /Tobias ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: 4x lower IOPS: Linux MD vs indiv. devices - why? 2017-01-23 19:06 ` Kudryavtsev, Andrey O ` (2 preceding siblings ...) 2017-01-24 10:03 ` Tobias Oberstein @ 2017-01-24 15:19 ` Tobias Oberstein 3 siblings, 0 replies; 27+ messages in thread From: Tobias Oberstein @ 2017-01-24 15:19 UTC (permalink / raw) To: Kudryavtsev, Andrey O, fio@vger.kernel.org Hi Andrey, > Changing sector to 4k is easy, this can really help. see DCT manual, it’s there. > This can be relevant for you https://itpeernetwork.intel.com/how-to-configure-oracle-redo-on-the-intel-pcie-ssd-dc-p3700/ After overcoming my issues with isdct, and reformatting the NVMes to 4k sector size, success! 9.5 mio IOPS =) This is another 34% faster than before. So: thanks a bunch for your tip! Cheers, /Tobias Next steps: - approach MD developers about bottlenecks there - approach PostgreSQL about using pread/pwrite (instead of lseek/read/write) randread-individual-nvmes: (g=0): rw=randread, bs=4K-4K/4K-4K/4K-4K, ioengine=libaio, iodepth=128 ... fio-2.1.11 Starting 128 threads Jobs: 128 (f=2048): [r(128)] [100.0% done] [37244MB/0KB/0KB /s] [9534K/0/0 iops] [eta 00m:00s] randread-individual-nvmes: (groupid=0, jobs=128): err= 0: pid=25406: Tue Jan 24 15:57:19 2017 read : io=1083.9GB, bw=36964MB/s, iops=9462.8K, runt= 30026msec cpu : usr=9.00%, sys=77.01%, ctx=49252920, majf=0, minf=16512 ^ permalink raw reply [flat|nested] 27+ messages in thread
end of thread, other threads:[~2017-01-26 17:52 UTC | newest]
Thread overview: 27+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2017-01-23 16:26 4x lower IOPS: Linux MD vs indiv. devices - why? Tobias Oberstein
[not found] ` <CANvN+en2ihATNgrbgzwNXAK87wNh+6jXHinmg2-VmHon31AJzA@mail.gmail.com>
2017-01-23 17:52 ` Tobias Oberstein
[not found] ` <CANvN+em0cjWRnQWccdORKFEJk0OSeQOrZq+XE6kzPmqMPB--4g@mail.gmail.com>
2017-01-23 18:33 ` Tobias Oberstein
2017-01-23 19:10 ` Kudryavtsev, Andrey O
2017-01-23 19:26 ` Tobias Oberstein
2017-01-23 19:13 ` Sitsofe Wheeler
2017-01-23 19:40 ` Tobias Oberstein
2017-01-23 20:24 ` Sitsofe Wheeler
2017-01-23 21:22 ` Tobias Oberstein
[not found] ` <CANvN+emLjb9idri9r42V3W9ia6v0EDGdJYFfhzq6rAuzGWec8Q@mail.gmail.com>
2017-01-23 21:42 ` Andrey Kuzmin
2017-01-23 23:51 ` Tobias Oberstein
2017-01-24 8:21 ` Andrey Kuzmin
2017-01-24 9:28 ` Tobias Oberstein
2017-01-24 9:40 ` Andrey Kuzmin
2017-01-24 22:51 ` Tobias Oberstein
2017-01-25 16:23 ` Elliott, Robert (Persistent Memory)
2017-01-26 17:52 ` Tobias Oberstein
[not found] ` <CANvN+emM2xeKtEgVofOyKri6WBtjqc_o1LMT8Sfawb_RMRXT0g@mail.gmail.com>
2017-01-23 20:10 ` Tobias Oberstein
[not found] ` <CANvN+e=ityWtQj_TJ3yZgTM7mr17VB=3OeyQEEQvdb5tR5AGLA@mail.gmail.com>
[not found] ` <CANvN+emUGQ=voye=E6g4jFRxbp5eS8cGVJb3vTSn-bD5Db2Ycw@mail.gmail.com>
2017-01-23 20:20 ` Tobias Oberstein
[not found] ` <CANvN+e=ASW14ShvY6dmVvUDY3PJVWwY9oQSbOT9EiOnQbSZHzA@mail.gmail.com>
[not found] ` <CANvN+ek0DgHF4gFAVep9ygdi=4pi9O9Fp5u3-VOd0iEVCSS0=Q@mail.gmail.com>
2017-01-23 21:49 ` Tobias Oberstein
2017-01-23 18:18 ` Kudryavtsev, Andrey O
2017-01-23 18:53 ` Tobias Oberstein
2017-01-23 19:06 ` Kudryavtsev, Andrey O
2017-01-24 9:46 ` Tobias Oberstein
2017-01-24 9:55 ` Tobias Oberstein
2017-01-24 10:03 ` Tobias Oberstein
2017-01-24 15:19 ` Tobias Oberstein
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox