performance collapse: 9 mio IOPS to 1.5 mio with MD RAID0

linux-raid.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* performance collapse: 9 mio IOPS to 1.5 mio with MD RAID0
@ 2017-01-25 11:45 Tobias Oberstein
  2017-01-25 23:01 ` Stan Hoeppner
  0 siblings, 1 reply; 3+ messages in thread
From: Tobias Oberstein @ 2017-01-25 11:45 UTC (permalink / raw)
  To: linux-raid

Hi,

I have a storage consisting of 8 NVMe drives (16 logical drives) that I 
verified (FIO) is able to do >9 million 4kB random read IOPS if I run 
FIO on the set of individual NVMes.

However, when I create a MD (RAID-0) over the 16 NVMes and run the same 
tests, performance collapses:

ioengine=sync, invidual NVMes: IOPS=9191k
ioengine=sync, MD (RAID-0) over NVMes: IOPS=1562k

Using ioengine=psync, the performance collapse isn't as dramatic, but 
still very signifcant:

ioengine=sync, invidual NVMes: IOPS=9395k
ioengine=sync, MD (RAID-0) over NVMes: IOPS=4117k

--

All detail results (including runs under Linux perf) and FIO control 
files are here

https://github.com/oberstet/scratchbox/tree/master/cruncher/sync-engines-perf

--

With sync/MD, top in perf is

   82.77%  fio      [kernel.kallsyms]   [k] osq_lock
    3.12%  fio      [kernel.kallsyms]   [k] nohz_balance_exit_idle
    1.40%  fio      [kernel.kallsyms]   [k] trigger_load_balance
    1.01%  fio      [kernel.kallsyms]   [k] native_queued_spin_lock_slowpath

With psync/MD, top in perf is

   45.56%  fio      [kernel.kallsyms]   [k] md_make_request
    4.33%  fio      [kernel.kallsyms]   [k] osq_lock
    3.40%  fio      [kernel.kallsyms]   [k] native_queued_spin_lock_slowpath
    3.23%  fio      [kernel.kallsyms]   [k] _raw_spin_lock
    2.21%  fio      [kernel.kallsyms]   [k] raid0_make_request

--

Of course there isn't a free lunch, but a performance collapse in this 
order for a RAID-0, that is pure striping, seems excessive.

What's going on?

Cheers,
/Tobias

MD device was created like this:

sudo mdadm --create /dev/md1 \
   --chunk=8 \
   --level=0 \
   --raid-devices=16 \
   /dev/nvme0n1 \
   /dev/nvme1n1 \
   /dev/nvme2n1 \
   /dev/nvme3n1 \
   /dev/nvme4n1 \
   /dev/nvme5n1 \
   /dev/nvme6n1 \
   /dev/nvme7n1 \
   /dev/nvme8n1 \
   /dev/nvme9n1 \
   /dev/nvme10n1 \
   /dev/nvme11n1 \
   /dev/nvme12n1 \
   /dev/nvme13n1 \
   /dev/nvme14n1 \
   /dev/nvme15n1

The NVMes are low-level formatted with 4k sectors. Before, I had 512 
bytes (default), and the perf. collapse was even more dramatic.

The chunk size of 8k is used because this is supposed to carry database 
workloads later.

My target workload is PostgreSQL which is 100% 8k and lseek/read/write 
(not using pread/pwrite or pvread/pvwrite etc).

^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: performance collapse: 9 mio IOPS to 1.5 mio with MD RAID0
  2017-01-25 11:45 performance collapse: 9 mio IOPS to 1.5 mio with MD RAID0 Tobias Oberstein
@ 2017-01-25 23:01 ` Stan Hoeppner
  2017-01-26  8:35   ` Tobias Oberstein
  0 siblings, 1 reply; 3+ messages in thread
From: Stan Hoeppner @ 2017-01-25 23:01 UTC (permalink / raw)
  To: Tobias Oberstein, linux-raid

On 01/25/2017 05:45 AM, Tobias Oberstein wrote:
> Hi,
>
> I have a storage consisting of 8 NVMe drives (16 logical drives) that 
> I verified (FIO) is able to do >9 million 4kB random read IOPS if I 
> run FIO on the set of individual NVMes.
>
> However, when I create a MD (RAID-0) over the 16 NVMes and run the 
> same tests, performance collapses:
>
> ioengine=sync, invidual NVMes: IOPS=9191k
> ioengine=sync, MD (RAID-0) over NVMes: IOPS=1562k
>
> Using ioengine=psync, the performance collapse isn't as dramatic, but 
> still very signifcant:
>
> ioengine=sync, invidual NVMes: IOPS=9395k
> ioengine=sync, MD (RAID-0) over NVMes: IOPS=4117k
>
> -- 
>
> All detail results (including runs under Linux perf) and FIO control 
> files are here
>
> https://github.com/oberstet/scratchbox/tree/master/cruncher/sync-engines-perf 
>
>

You don't need 1024 jobs to fill the request queues.  Just out of 
curiosity, what are the fio results when using fewer jobs and a greater 
queue depth, say one job per core, 88 total, with a queue depth of 32?

osq_lock appears to be a per cpu opportunistic spinlock.  Might be of 
benefit to try even fewer jobs for fewer active cores.

> -- 
>
> With sync/MD, top in perf is
>
>   82.77%  fio      [kernel.kallsyms]   [k] osq_lock
>    3.12%  fio      [kernel.kallsyms]   [k] nohz_balance_exit_idle
>    1.40%  fio      [kernel.kallsyms]   [k] trigger_load_balance
>    1.01%  fio      [kernel.kallsyms]   [k] 
> native_queued_spin_lock_slowpath
>
>
> With psync/MD, top in perf is
>
>   45.56%  fio      [kernel.kallsyms]   [k] md_make_request
>    4.33%  fio      [kernel.kallsyms]   [k] osq_lock
>    3.40%  fio      [kernel.kallsyms]   [k] 
> native_queued_spin_lock_slowpath
>    3.23%  fio      [kernel.kallsyms]   [k] _raw_spin_lock
>    2.21%  fio      [kernel.kallsyms]   [k] raid0_make_request
>
> -- 
>
> Of course there isn't a free lunch, but a performance collapse in this 
> order for a RAID-0, that is pure striping, seems excessive.
>
> What's going on?
>
> Cheers,
> /Tobias
>
>
> MD device was created like this:
>
> sudo mdadm --create /dev/md1 \
>   --chunk=8 \
>   --level=0 \
>   --raid-devices=16 \
>   /dev/nvme0n1 \
>   /dev/nvme1n1 \
>   /dev/nvme2n1 \
>   /dev/nvme3n1 \
>   /dev/nvme4n1 \
>   /dev/nvme5n1 \
>   /dev/nvme6n1 \
>   /dev/nvme7n1 \
>   /dev/nvme8n1 \
>   /dev/nvme9n1 \
>   /dev/nvme10n1 \
>   /dev/nvme11n1 \
>   /dev/nvme12n1 \
>   /dev/nvme13n1 \
>   /dev/nvme14n1 \
>   /dev/nvme15n1
>
> The NVMes are low-level formatted with 4k sectors. Before, I had 512 
> bytes (default), and the perf. collapse was even more dramatic.
>
> The chunk size of 8k is used because this is supposed to carry 
> database workloads later.
>
> My target workload is PostgreSQL which is 100% 8k and lseek/read/write 
> (not using pread/pwrite or pvread/pvwrite etc).
> -- 
> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html


^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: performance collapse: 9 mio IOPS to 1.5 mio with MD RAID0
  2017-01-25 23:01 ` Stan Hoeppner
@ 2017-01-26  8:35   ` Tobias Oberstein
  0 siblings, 0 replies; 3+ messages in thread
From: Tobias Oberstein @ 2017-01-26  8:35 UTC (permalink / raw)
  To: Stan Hoeppner, linux-raid

Am 26.01.2017 um 00:01 schrieb Stan Hoeppner:
> On 01/25/2017 05:45 AM, Tobias Oberstein wrote:
>> Hi,
>>
>> I have a storage consisting of 8 NVMe drives (16 logical drives) that
>> I verified (FIO) is able to do >9 million 4kB random read IOPS if I
>> run FIO on the set of individual NVMes.
>>
>> However, when I create a MD (RAID-0) over the 16 NVMes and run the
>> same tests, performance collapses:
>>
>> ioengine=sync, invidual NVMes: IOPS=9191k
>> ioengine=sync, MD (RAID-0) over NVMes: IOPS=1562k
>>
>> Using ioengine=psync, the performance collapse isn't as dramatic, but
>> still very signifcant:
>>
>> ioengine=sync, invidual NVMes: IOPS=9395k
>> ioengine=sync, MD (RAID-0) over NVMes: IOPS=4117k
>>
>> --
>>
>> All detail results (including runs under Linux perf) and FIO control
>> files are here
>>
>> https://github.com/oberstet/scratchbox/tree/master/cruncher/sync-engines-perf
>>
>>
>
> You don't need 1024 jobs to fill the request queues.  Just out of

Here is scaling of measured IOPS depending on concurrency for sync engines

https://github.com/oberstet/scratchbox/raw/master/cruncher/Performance%20Results%20-%20NVMe%20Scaling%20with%20IO%20Concurrency.pdf

Note: the top numbers are lower than in what I posted above, because 
these measurements were still done with 512 bytes sectors (not with 4k 
as above).

> curiosity, what are the fio results when using fewer jobs and a greater
> queue depth, say one job per core, 88 total, with a queue depth of 32?

iodeapth doesn't apply to synchronous io engines (ioengine=sync/psync/..)

>
> osq_lock appears to be a per cpu opportunistic spinlock.  Might be of
> benefit to try even fewer jobs for fewer active cores.

with reduced concurrency, I am not able to saturate the storage anymore

in essence, 1 Xeon core is needed per 120k IOPS with ioengine=sync even 
on the set of raw NVMe devices (no MD RAID).

for comparison: libaio achieves 600k IOPS / core

SPDK claims (I haven't measured that myself .. Intel claim):
1.8 mio IOPS / core

Cheers,
/Tobias

^ permalink raw reply	[flat|nested] 3+ messages in thread

end of thread, other threads:[~2017-01-26  8:35 UTC | newest]

Thread overview: 3+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2017-01-25 11:45 performance collapse: 9 mio IOPS to 1.5 mio with MD RAID0 Tobias Oberstein
2017-01-25 23:01 ` Stan Hoeppner
2017-01-26  8:35   ` Tobias Oberstein

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).