From: Tobias Oberstein <tobias.oberstein@gmail.com>
To: Andrey Kuzmin <andrey.v.kuzmin@gmail.com>, fio@vger.kernel.org
Subject: Re: 4x lower IOPS: Linux MD vs indiv. devices - why?
Date: Mon, 23 Jan 2017 18:52:01 +0100 [thread overview]
Message-ID: <061acabe-dbd0-ae07-cd69-772eccd15b21@gmail.com> (raw)
In-Reply-To: <CANvN+en2ihATNgrbgzwNXAK87wNh+6jXHinmg2-VmHon31AJzA@mail.gmail.com>
Am 23.01.2017 um 18:03 schrieb Andrey Kuzmin:
> Why don't you just 'perf' your md run and find out where it spends (an
> awful lot if extra) time?
Good idea!
I ran with threads=1024 (to account for perf overhead). At that
concurrency, Linux MD reaches 25% lower IOPS and has higher system load.
Please see here:
https://github.com/oberstet/scratchbox/tree/master/cruncher/sql19/linux-md-bottleneck
With higher concurrency, the discrepancy gets wider up to 7 mio vs 1.6
mio IOPS.
I am not a kernel hacker.
What is osq_lock?
FWIW, this is a NUMA machine with 4 x E7 (88 cores / 176 HT) and 8 x
Intel P3608 NVMe.
Any hints or anything I should try / measure?
Thanks a lot for your tips and assistence!
Cheers,
/Tobias
>
> On Jan 23, 2017 19:28, "Tobias Oberstein" <tobias.oberstein@gmail.com>
> wrote:
>
>> Hi,
>>
>> I have a question rgd Linux software RAID (MD) as tested with FIO - so
>> this is slightly OT, but I am hoping for expert advice or redirection to a
>> more appropriate place (if this is unwelcome here).
>>
>> I have a box with this HW:
>>
>> - 88 cores Xeon E7 (176 HTs) + 3TB RAM
>> - 8 x Intel P3608 4TB NVMe (which is logicall 16 NVMes)
>>
>> With random 4kB read load, I am able to max it out at 7 million IOPS - but
>> only if I run FIO on the _individual_ NVMe devices.
>>
>> [global]
>> group_reporting
>> filename=/dev/nvme0n1:/dev/nvme1n1:/dev/nvme2n1:/dev/nvme3n1
>> :/dev/nvme4n1:/dev/nvme5n1:/dev/nvme6n1:/dev/nvme7n1:/dev/
>> nvme8n1:/dev/nvme9n1:/dev/nvme10n1:/dev/nvme11n1:/dev/
>> nvme12n1:/dev/nvme13n1:/dev/nvme14n1:/dev/nvme15n1
>> size=30G
>> ioengine=sync
>> iodepth=1
>> thread=1
>> direct=1
>> time_based=1
>> randrepeat=0
>> norandommap=1
>> bs=4k
>> runtime=120
>>
>> [randread]
>> stonewall
>> rw=randread
>> numjobs=2560
>>
>> When I create a stripe set over all devices:
>>
>> sudo mdadm --create /dev/md1 --chunk=8 --level=0 --raid-devices=16 \
>> /dev/nvme0n1 \
>> /dev/nvme1n1 \
>> /dev/nvme2n1 \
>> /dev/nvme3n1 \
>> /dev/nvme4n1 \
>> /dev/nvme5n1 \
>> /dev/nvme6n1 \
>> /dev/nvme7n1 \
>> /dev/nvme8n1 \
>> /dev/nvme9n1 \
>> /dev/nvme10n1 \
>> /dev/nvme11n1 \
>> /dev/nvme12n1 \
>> /dev/nvme13n1 \
>> /dev/nvme14n1 \
>> /dev/nvme15n1
>>
>> I only get 1.6 million IOPS. Detail results down below.
>>
>> Note: the array is created with chunk size 8K because this is for database
>> workload. Here I tested with 4k block size, but the it's similar (lower
>> perf on MD) with 8k
>>
>> Any helps or hints would be greatly appreciated!
>>
>> Cheers,
>> /Tobias
>>
>>
>>
>> 7 million IOPS on raw, individual NVMe devices
>> ==============================================
>>
>> oberstet@svr-psql19:~/scm/parcit/RA/adr/system/docs$ sudo
>> /opt/fio/bin/fio postgresql_storage_workload.fio
>> randread: (g=0): rw=randread, bs=4096B-4096B,4096B-4096B,4096B-4096B,
>> ioengine=sync, iodepth=1
>> ...
>> fio-2.17-17-g9cf1
>> Starting 2560 threads
>> Jobs: 2367 (f=29896): [_(2),f(3),_(2),f(11),_(2),f(2
>> ),_(9),f(1),_(1),f(1),_(3),f(1),_(1),f(1),_(13),f(1),_(8),f(
>> 1),_(1),f(4),_(2),f(1),_(1),f(1),_(3),f(2),_(3),f(3),_(8),f(
>> 2),_(1),f(3),_(3),f(60),_(1),f(20),_(1),f(33),_(1),f(14),_(
>> 1),f(18),_(4),f(6),_(1),f(6),_(1),f(1),_(1),f(1),_(1),f(4),_
>> (1),f(2),_(1),f(11),_(1),f(11),_(4),f(74),_(1),f(8),_(1),f(
>> 11),_(1),f(8),_(1),f(61),_(1),f(38),_(1),f(31),_(1),f(5),_(
>> 1),f(103),_(1),f(24),E(1),f(27),_(1),f(28),_(1),f(1),_(1),f(
>> 134),_(1),f(62),_(1),f(48),_(1),f(27),_(1),f(59),_(1),f(30)
>> ,_(1),f(14),_(1),f(25),_(1),f(2),_(1),f(25),_(1),f(31),_(1),
>> f(9),_(1),f(7),_(1),f(8),_(1),f(13),_(1),f(28),_(1),f(7),_(
>> 1),f(84),_(1),f(42),_(1),f(5),_(1),f(8),_(1),f(20),_(1),f(
>> 15),_(1),f(19),_(1),f(3),_(1),f(19),_(1),f(7),_(1),f(17),_(
>> 1),f(34),_(1),f(1),_(1),f(4),_(1),f(1),_(1),f(1),_(2),f(3),_
>> (1),f(1),_(1),f(1),_(1),f(8),_(1),f(6),_(1),f(3),_(1),f(3),_
>> (1),f(53),_(1),f(7),_(1),f(19),_(1),f(6),_(1),f(5),_(1),f(
>> 22),_(1),f(11),_(1),f(12),_(1),f(3),_(1),f(16),_(1),f(149),_
>> (1),f(20),_(1),f(27),_(1),f(7),_(1),f(29),_(1),f(2),_(1),f(
>> 11),_(1),f(46),_(1),f(8),_(2),f(1),_(1),f(1),_(1),f(14),E(1)
>> ,f(4),_(1),f(22),_(1),f(11),_(1),f(70),_(2),f(11),_(1),f(2),
>> _(1),f(1),_(1),f(1),_(1),f(21),_(1),f(8),_(1),f(4),_(1),f(
>> 45),_(2),f(1),_(1),f(18),_(1),f(12),_(1),f(6),_(1),f(5),_(1)
>> ,f(27),_(1),f(3),_(1),f(3),_(1),f(19),_(1),f(4),_(1),f(25),
>> _(1),f(4),_(1),f(1),_(1),f(2),_(1),f(1),_(1),f(13),_(1),f(
>> 18),_(1),f(1),_(1),f(1),_(1),f(29),_(1),f(27)][100.0%][r=
>> 21.1GiB/s,w=0KiB/s][r=5751k,w=0 IOPS][eta 00m:00s]
>> randread: (groupid=0, jobs=2560): err= 0: pid=114435: Mon Jan 23 15:47:17
>> 2017
>> read: IOPS=6965k, BW=26.6GiB/s (28.6GB/s)(3189GiB/120007msec)
>> clat (usec): min=38, max=33262, avg=360.11, stdev=465.36
>> lat (usec): min=38, max=33262, avg=360.20, stdev=465.40
>> clat percentiles (usec):
>> | 1.00th=[ 114], 5.00th=[ 135], 10.00th=[ 149], 20.00th=[ 171],
>> | 30.00th=[ 191], 40.00th=[ 213], 50.00th=[ 239], 60.00th=[ 270],
>> | 70.00th=[ 314], 80.00th=[ 378], 90.00th=[ 556], 95.00th=[ 980],
>> | 99.00th=[ 2704], 99.50th=[ 3312], 99.90th=[ 4576], 99.95th=[ 5216],
>> | 99.99th=[ 8096]
>> lat (usec) : 50=0.01%, 100=0.11%, 250=53.75%, 500=34.23%, 750=5.23%
>> lat (usec) : 1000=1.79%
>> lat (msec) : 2=2.88%, 4=1.81%, 10=0.20%, 20=0.01%, 50=0.01%
>> cpu : usr=0.63%, sys=4.89%, ctx=837434400, majf=0, minf=2557
>> IO depths : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%,
>>> =64=0.0%
>> submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%,
>>> =64=0.0%
>> complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%,
>>> =64=0.0%
>> issued rwt: total=835852266,0,0, short=0,0,0, dropped=0,0,0
>> latency : target=0, window=0, percentile=100.00%, depth=1
>>
>> Run status group 0 (all jobs):
>> READ: bw=26.6GiB/s (28.6GB/s), 26.6GiB/s-26.6GiB/s (28.6GB/s-28.6GB/s),
>> io=3189GiB (3424GB), run=120007-120007msec
>>
>> Disk stats (read/write):
>> nvme0n1: ios=52191377/0, merge=0/0, ticks=14400568/0, in_queue=14802400,
>> util=100.00%
>> nvme1n1: ios=52241684/0, merge=0/0, ticks=13919744/0, in_queue=15101276,
>> util=100.00%
>> nvme2n1: ios=52241537/0, merge=0/0, ticks=11146952/0, in_queue=12053112,
>> util=100.00%
>> nvme3n1: ios=52241416/0, merge=0/0, ticks=10806624/0, in_queue=11135004,
>> util=100.00%
>> nvme4n1: ios=52241285/0, merge=0/0, ticks=19320448/0, in_queue=21079576,
>> util=100.00%
>> nvme5n1: ios=52241142/0, merge=0/0, ticks=18786968/0, in_queue=19393024,
>> util=100.00%
>> nvme6n1: ios=52241000/0, merge=0/0, ticks=19610892/0, in_queue=20140104,
>> util=100.00%
>> nvme7n1: ios=52240874/0, merge=0/0, ticks=20482920/0, in_queue=21090048,
>> util=100.00%
>> nvme8n1: ios=52240731/0, merge=0/0, ticks=14533992/0, in_queue=14929172,
>> util=100.00%
>> nvme9n1: ios=52240587/0, merge=0/0, ticks=12854956/0, in_queue=13919288,
>> util=100.00%
>> nvme10n1: ios=52240447/0, merge=0/0, ticks=11085508/0,
>> in_queue=11390392, util=100.00%
>> nvme11n1: ios=52240301/0, merge=0/0, ticks=18490260/0,
>> in_queue=20110288, util=100.00%
>> nvme12n1: ios=52240097/0, merge=0/0, ticks=11377884/0,
>> in_queue=11683568, util=100.00%
>> nvme13n1: ios=52239956/0, merge=0/0, ticks=15205304/0,
>> in_queue=16314628, util=100.00%
>> nvme14n1: ios=52239766/0, merge=0/0, ticks=27003788/0,
>> in_queue=27659920, util=100.00%
>> nvme15n1: ios=52239620/0, merge=0/0, ticks=17352624/0,
>> in_queue=17910636, util=100.00%
>>
>>
>> 1.6 millions IOPS on Linux MD over 16 NVMe devices
>> ==================================================
>>
>> oberstet@svr-psql19:~/scm/parcit/RA/adr/system/docs$ sudo
>> /opt/fio/bin/fio postgresql_storage_workload.fio
>> randread: (g=0): rw=randread, bs=4096B-4096B,4096B-4096B,4096B-4096B,
>> ioengine=sync, iodepth=1
>> ...
>> fio-2.17-17-g9cf1
>> Starting 2560 threads
>> Jobs: 2560 (f=2560): [r(2560)][100.0%][r=6212MiB/s,w=0KiB/s][r=1590k,w=0
>> IOPS][eta 00m:00s]
>> randread: (groupid=0, jobs=2560): err= 0: pid=146070: Mon Jan 23 17:21:15
>> 2017
>> read: IOPS=1588k, BW=6204MiB/s (6505MB/s)(728GiB/120098msec)
>> clat (usec): min=27, max=28498, avg=124.51, stdev=113.10
>> lat (usec): min=27, max=28498, avg=124.58, stdev=113.10
>> clat percentiles (usec):
>> | 1.00th=[ 78], 5.00th=[ 84], 10.00th=[ 86], 20.00th=[ 89],
>> | 30.00th=[ 95], 40.00th=[ 102], 50.00th=[ 105], 60.00th=[ 108],
>> | 70.00th=[ 118], 80.00th=[ 133], 90.00th=[ 173], 95.00th=[ 221],
>> | 99.00th=[ 358], 99.50th=[ 506], 99.90th=[ 2192], 99.95th=[ 2608],
>> | 99.99th=[ 2960]
>> lat (usec) : 50=0.06%, 100=35.14%, 250=61.83%, 500=2.46%, 750=0.19%
>> lat (usec) : 1000=0.07%
>> lat (msec) : 2=0.13%, 4=0.12%, 10=0.01%, 20=0.01%, 50=0.01%
>> cpu : usr=0.08%, sys=4.49%, ctx=200431993, majf=0, minf=2557
>> IO depths : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%,
>>> =64=0.0%
>> submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%,
>>> =64=0.0%
>> complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%,
>>> =64=0.0%
>> issued rwt: total=190730463,0,0, short=0,0,0, dropped=0,0,0
>> latency : target=0, window=0, percentile=100.00%, depth=1
>>
>> Run status group 0 (all jobs):
>> READ: bw=6204MiB/s (6505MB/s), 6204MiB/s-6204MiB/s (6505MB/s-6505MB/s),
>> io=728GiB (781GB), run=120098-120098msec
>>
>> Disk stats (read/write):
>> md1: ios=190632612/0, merge=0/0, ticks=0/0, in_queue=0, util=0.00%,
>> aggrios=11920653/0, aggrmerge=0/0, aggrticks=1228287/0,
>> aggrin_queue=1247601, aggrutil=100.00%
>> nvme15n1: ios=11919850/0, merge=0/0, ticks=1214924/0, in_queue=1225896,
>> util=100.00%
>> nvme6n1: ios=11921162/0, merge=0/0, ticks=1182716/0, in_queue=1191452,
>> util=100.00%
>> nvme9n1: ios=11916313/0, merge=0/0, ticks=1265060/0, in_queue=1296728,
>> util=100.00%
>> nvme11n1: ios=11922174/0, merge=0/0, ticks=1206084/0, in_queue=1239808,
>> util=100.00%
>> nvme2n1: ios=11921547/0, merge=0/0, ticks=1238956/0, in_queue=1272916,
>> util=100.00%
>> nvme14n1: ios=11923176/0, merge=0/0, ticks=1168688/0, in_queue=1178360,
>> util=100.00%
>> nvme5n1: ios=11923142/0, merge=0/0, ticks=1192656/0, in_queue=1207808,
>> util=100.00%
>> nvme8n1: ios=11921507/0, merge=0/0, ticks=1250164/0, in_queue=1258956,
>> util=100.00%
>> nvme10n1: ios=11919058/0, merge=0/0, ticks=1294028/0, in_queue=1304536,
>> util=100.00%
>> nvme1n1: ios=11923129/0, merge=0/0, ticks=1246892/0, in_queue=1281952,
>> util=100.00%
>> nvme13n1: ios=11923354/0, merge=0/0, ticks=1241540/0, in_queue=1271820,
>> util=100.00%
>> nvme4n1: ios=11926936/0, merge=0/0, ticks=1190384/0, in_queue=1224192,
>> util=100.00%
>> nvme7n1: ios=11921139/0, merge=0/0, ticks=1200624/0, in_queue=1214240,
>> util=100.00%
>> nvme0n1: ios=11916614/0, merge=0/0, ticks=1230916/0, in_queue=1242372,
>> util=100.00%
>> nvme12n1: ios=11916963/0, merge=0/0, ticks=1266840/0, in_queue=1277600,
>> util=100.00%
>> nvme3n1: ios=11914399/0, merge=0/0, ticks=1262128/0, in_queue=1272988,
>> util=100.00%
>> oberstet@svr-psql19:~/scm/parcit/RA/adr/system/docs$
>>
>
next prev parent reply other threads:[~2017-01-23 17:52 UTC|newest]
Thread overview: 27+ messages / expand[flat|nested] mbox.gz Atom feed top
2017-01-23 16:26 4x lower IOPS: Linux MD vs indiv. devices - why? Tobias Oberstein
[not found] ` <CANvN+en2ihATNgrbgzwNXAK87wNh+6jXHinmg2-VmHon31AJzA@mail.gmail.com>
2017-01-23 17:52 ` Tobias Oberstein [this message]
[not found] ` <CANvN+em0cjWRnQWccdORKFEJk0OSeQOrZq+XE6kzPmqMPB--4g@mail.gmail.com>
2017-01-23 18:33 ` Tobias Oberstein
2017-01-23 19:10 ` Kudryavtsev, Andrey O
2017-01-23 19:26 ` Tobias Oberstein
2017-01-23 19:13 ` Sitsofe Wheeler
2017-01-23 19:40 ` Tobias Oberstein
2017-01-23 20:24 ` Sitsofe Wheeler
2017-01-23 21:22 ` Tobias Oberstein
[not found] ` <CANvN+emLjb9idri9r42V3W9ia6v0EDGdJYFfhzq6rAuzGWec8Q@mail.gmail.com>
2017-01-23 21:42 ` Andrey Kuzmin
2017-01-23 23:51 ` Tobias Oberstein
2017-01-24 8:21 ` Andrey Kuzmin
2017-01-24 9:28 ` Tobias Oberstein
2017-01-24 9:40 ` Andrey Kuzmin
2017-01-24 22:51 ` Tobias Oberstein
2017-01-25 16:23 ` Elliott, Robert (Persistent Memory)
2017-01-26 17:52 ` Tobias Oberstein
[not found] ` <CANvN+emM2xeKtEgVofOyKri6WBtjqc_o1LMT8Sfawb_RMRXT0g@mail.gmail.com>
2017-01-23 20:10 ` Tobias Oberstein
[not found] ` <CANvN+e=ityWtQj_TJ3yZgTM7mr17VB=3OeyQEEQvdb5tR5AGLA@mail.gmail.com>
[not found] ` <CANvN+emUGQ=voye=E6g4jFRxbp5eS8cGVJb3vTSn-bD5Db2Ycw@mail.gmail.com>
2017-01-23 20:20 ` Tobias Oberstein
[not found] ` <CANvN+e=ASW14ShvY6dmVvUDY3PJVWwY9oQSbOT9EiOnQbSZHzA@mail.gmail.com>
[not found] ` <CANvN+ek0DgHF4gFAVep9ygdi=4pi9O9Fp5u3-VOd0iEVCSS0=Q@mail.gmail.com>
2017-01-23 21:49 ` Tobias Oberstein
2017-01-23 18:18 ` Kudryavtsev, Andrey O
2017-01-23 18:53 ` Tobias Oberstein
2017-01-23 19:06 ` Kudryavtsev, Andrey O
2017-01-24 9:46 ` Tobias Oberstein
2017-01-24 9:55 ` Tobias Oberstein
2017-01-24 10:03 ` Tobias Oberstein
2017-01-24 15:19 ` Tobias Oberstein
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=061acabe-dbd0-ae07-cd69-772eccd15b21@gmail.com \
--to=tobias.oberstein@gmail.com \
--cc=andrey.v.kuzmin@gmail.com \
--cc=fio@vger.kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox