NVMe over RDMA latency

All of lore.kernel.org
 help / color / mirror / Atom feed

From: mlin@kernel.org (Ming Lin)
Subject: NVMe over RDMA latency
Date: Wed, 13 Jul 2016 11:25:32 -0700	[thread overview]
Message-ID: <1468434332.1869.8.camel@ssi> (raw)
In-Reply-To: <57860EBA.5010103@grimberg.me>

On Wed, 2016-07-13@12:49 +0300, Sagi Grimberg wrote:
> > Hi list,
> 
> Hey Ming,
> 
> > I'm trying to understand the NVMe over RDMA latency.
> >
> > Test hardware:
> > A real NVMe PCI drive on target
> > Host and target back-to-back connected by Mellanox ConnectX-3
> >
> > [global]
> > ioengine=libaio
> > direct=1
> > runtime=10
> > time_based
> > norandommap
> > group_reporting
> >
> > [job1]
> > filename=/dev/nvme0n1
> > rw=randread
> > bs=4k
> >
> >
> > fio latency data on host side(test nvmeof device)
> >      slat (usec): min=2, max=213, avg= 6.34, stdev= 3.47
> >      clat (usec): min=1, max=2470, avg=39.56, stdev=13.04
> >       lat (usec): min=30, max=2476, avg=46.14, stdev=15.50
> >
> > fio latency data on target side(test NVMe pci device locally)
> >      slat (usec): min=1, max=36, avg= 1.92, stdev= 0.42
> >      clat (usec): min=1, max=68, avg=20.35, stdev= 1.11
> >       lat (usec): min=19, max=101, avg=22.35, stdev= 1.21
> >
> > So I picked up this sample from blktrace which seems matches the fio avg latency data.
> >
> > Host(/dev/nvme0n1)
> > 259,0    0       86     0.015768739  3241  Q   R 1272199648 + 8 [fio]
> > 259,0    0       87     0.015769674  3241  G   R 1272199648 + 8 [fio]
> > 259,0    0       88     0.015771628  3241  U   N [fio] 1
> > 259,0    0       89     0.015771901  3241  I  RS 1272199648 + 8 (    2227) [fio]
> > 259,0    0       90     0.015772863  3241  D  RS 1272199648 + 8 (     962) [fio]
> > 259,0    1       85     0.015819257     0  C  RS 1272199648 + 8 (   46394) [0]
> >
> > Target(/dev/nvme0n1)
> > 259,0    0      141     0.015675637  2197  Q   R 1272199648 + 8 [kworker/u17:0]
> > 259,0    0      142     0.015676033  2197  G   R 1272199648 + 8 [kworker/u17:0]
> > 259,0    0      143     0.015676915  2197  D  RS 1272199648 + 8 (15676915) [kworker/u17:0]
> > 259,0    0      144     0.015694992     0  C  RS 1272199648 + 8 (   18077) [0]
> >
> > So host completed IO in about 50usec and target completed IO in about 20usec.
> > Does that mean the 30usec delta comes from RDMA write(host read means target RDMA write)?
> 
> 
> Couple of things that come to mind:
> 
> 0. Are you using iodepth=1 correct?

I didn't set it. It's 1 by default.
Now I set it.

root at host:~# cat t.job 
[global]
ioengine=libaio
direct=1
runtime=20
time_based
norandommap
group_reporting

[job1]
filename=/dev/nvme0n1
rw=randread
bs=4k
iodepth=1
numjobs=1


> 
> 1. I imagine you are not polling in the host but rather interrupt
>     driven correct? thats a latency source.

It's polling.

root at host:~# cat /sys/block/nvme0n1/queue/io_poll 
1

> 
> 2. the target code is polling if the block device supports it. can you
>     confirm that is indeed the case?

Yes.

> 
> 3. mlx4 has a strong fencing policy for memory registration, which we
>     always do. thats a latency source. can you try with
>     register_always=0?

root at host:~# cat /sys/module/nvme_rdma/parameters/register_always 
N


> 
> 4. IRQ affinity assignments. if the sqe is submitted on cpu core X and
>     the completion comes to cpu core Y, we will consume some latency
>     with the context-switch of waiking up fio on cpu core X. Is this
>     a possible case?

Only 1 CPU online on both host and target machine.

root at host:~# lscpu 
Architecture:          x86_64
CPU op-mode(s):        32-bit, 64-bit
Byte Order:            Little Endian
CPU(s):                8
On-line CPU(s) list:   0
Off-line CPU(s) list:  1-7

root at target:~# lscpu 
Architecture:          x86_64
CPU op-mode(s):        32-bit, 64-bit
Byte Order:            Little Endian
CPU(s):                8
On-line CPU(s) list:   0
Off-line CPU(s) list:  1-7


> 
> 5. What happens if you test against a null_blk (which has a latency of
>     < 1us)? back when I ran some tryouts I saw ~10-11us added latency
>     from the fabric under similar conditions.

With null_blk on target, latency about 12us.

root at host:~# fio t.job 
job1: (g=0): rw=randread, bs=4K-4K/4K-4K/4K-4K, ioengine=libaio, iodepth=1
fio-2.9-3-g2078c
Starting 1 process
Jobs: 1 (f=1): [r(1)] [100.0% done] [305.1MB/0KB/0KB /s] [78.4K/0/0 iops] [eta 00m:00s]
job1: (groupid=0, jobs=1): err= 0: pid=3067: Wed Jul 13 11:20:19 2016
  read : io=6096.9MB, bw=312142KB/s, iops=78035, runt= 20001msec
    slat (usec): min=1, max=207, avg= 2.01, stdev= 0.34
    clat (usec): min=0, max=8020, avg= 9.99, stdev= 9.06
     lat (usec): min=10, max=8022, avg=12.10, stdev= 9.07


With real NVMe device on target, host see latency about 33us.

root at host:~# fio t.job 
job1: (g=0): rw=randread, bs=4K-4K/4K-4K/4K-4K, ioengine=libaio, iodepth=1
fio-2.9-3-g2078c
Starting 1 process
Jobs: 1 (f=1): [r(1)] [100.0% done] [113.1MB/0KB/0KB /s] [28.1K/0/0 iops] [eta 00m:00s]
job1: (groupid=0, jobs=1): err= 0: pid=3139: Wed Jul 13 11:22:15 2016
  read : io=2259.5MB, bw=115680KB/s, iops=28920, runt= 20001msec
    slat (usec): min=1, max=195, avg= 2.62, stdev= 1.24
    clat (usec): min=0, max=7962, avg=30.97, stdev=14.50
     lat (usec): min=27, max=7968, avg=33.70, stdev=14.69

And tested NVMe device locally on target, about 23us.
So nvmeof added only about ~10us.

That's nice!

root at target:~# fio t.job 
job1: (g=0): rw=randread, bs=4K-4K/4K-4K/4K-4K, ioengine=libaio, iodepth=1
fio-2.8-26-g603e
Starting 1 process
Jobs: 1 (f=1): [r(1)] [100.0% done] [161.2MB/0KB/0KB /s] [41.3K/0/0 iops] [eta 00m:00s]
job1: (groupid=0, jobs=1): err= 0: pid=2725: Wed Jul 13 11:23:46 2016
  read : io=1605.3MB, bw=164380KB/s, iops=41095, runt= 10000msec
    slat (usec): min=1, max=60, avg= 1.88, stdev= 0.63
    clat (usec): min=1, max=144, avg=21.61, stdev= 8.96
     lat (usec): min=19, max=162, avg=23.59, stdev= 9.00

WARNING: multiple messages have this Message-ID (diff)

From: Ming Lin <mlin-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
To: Sagi Grimberg <sagi-NQWnxTmZq1alnMjI0IkVqw@public.gmane.org>
Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
	linux-nvme-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r@public.gmane.org,
	Christoph Hellwig <hch-jcswGhMUV9g@public.gmane.org>,
	Steve Wise
	<swise-7bPotxP6k4+P2YhJcF5u+vpXobYPEAuW@public.gmane.org>
Subject: Re: NVMe over RDMA latency
Date: Wed, 13 Jul 2016 11:25:32 -0700	[thread overview]
Message-ID: <1468434332.1869.8.camel@ssi> (raw)
In-Reply-To: <57860EBA.5010103-NQWnxTmZq1alnMjI0IkVqw@public.gmane.org>

On Wed, 2016-07-13 at 12:49 +0300, Sagi Grimberg wrote:
> > Hi list,
> 
> Hey Ming,
> 
> > I'm trying to understand the NVMe over RDMA latency.
> >
> > Test hardware:
> > A real NVMe PCI drive on target
> > Host and target back-to-back connected by Mellanox ConnectX-3
> >
> > [global]
> > ioengine=libaio
> > direct=1
> > runtime=10
> > time_based
> > norandommap
> > group_reporting
> >
> > [job1]
> > filename=/dev/nvme0n1
> > rw=randread
> > bs=4k
> >
> >
> > fio latency data on host side(test nvmeof device)
> >      slat (usec): min=2, max=213, avg= 6.34, stdev= 3.47
> >      clat (usec): min=1, max=2470, avg=39.56, stdev=13.04
> >       lat (usec): min=30, max=2476, avg=46.14, stdev=15.50
> >
> > fio latency data on target side(test NVMe pci device locally)
> >      slat (usec): min=1, max=36, avg= 1.92, stdev= 0.42
> >      clat (usec): min=1, max=68, avg=20.35, stdev= 1.11
> >       lat (usec): min=19, max=101, avg=22.35, stdev= 1.21
> >
> > So I picked up this sample from blktrace which seems matches the fio avg latency data.
> >
> > Host(/dev/nvme0n1)
> > 259,0    0       86     0.015768739  3241  Q   R 1272199648 + 8 [fio]
> > 259,0    0       87     0.015769674  3241  G   R 1272199648 + 8 [fio]
> > 259,0    0       88     0.015771628  3241  U   N [fio] 1
> > 259,0    0       89     0.015771901  3241  I  RS 1272199648 + 8 (    2227) [fio]
> > 259,0    0       90     0.015772863  3241  D  RS 1272199648 + 8 (     962) [fio]
> > 259,0    1       85     0.015819257     0  C  RS 1272199648 + 8 (   46394) [0]
> >
> > Target(/dev/nvme0n1)
> > 259,0    0      141     0.015675637  2197  Q   R 1272199648 + 8 [kworker/u17:0]
> > 259,0    0      142     0.015676033  2197  G   R 1272199648 + 8 [kworker/u17:0]
> > 259,0    0      143     0.015676915  2197  D  RS 1272199648 + 8 (15676915) [kworker/u17:0]
> > 259,0    0      144     0.015694992     0  C  RS 1272199648 + 8 (   18077) [0]
> >
> > So host completed IO in about 50usec and target completed IO in about 20usec.
> > Does that mean the 30usec delta comes from RDMA write(host read means target RDMA write)?
> 
> 
> Couple of things that come to mind:
> 
> 0. Are you using iodepth=1 correct?

I didn't set it. It's 1 by default.
Now I set it.

root@host:~# cat t.job 
[global]
ioengine=libaio
direct=1
runtime=20
time_based
norandommap
group_reporting

[job1]
filename=/dev/nvme0n1
rw=randread
bs=4k
iodepth=1
numjobs=1


> 
> 1. I imagine you are not polling in the host but rather interrupt
>     driven correct? thats a latency source.

It's polling.

root@host:~# cat /sys/block/nvme0n1/queue/io_poll 
1

> 
> 2. the target code is polling if the block device supports it. can you
>     confirm that is indeed the case?

Yes.

> 
> 3. mlx4 has a strong fencing policy for memory registration, which we
>     always do. thats a latency source. can you try with
>     register_always=0?

root@host:~# cat /sys/module/nvme_rdma/parameters/register_always 
N


> 
> 4. IRQ affinity assignments. if the sqe is submitted on cpu core X and
>     the completion comes to cpu core Y, we will consume some latency
>     with the context-switch of waiking up fio on cpu core X. Is this
>     a possible case?

Only 1 CPU online on both host and target machine.

root@host:~# lscpu 
Architecture:          x86_64
CPU op-mode(s):        32-bit, 64-bit
Byte Order:            Little Endian
CPU(s):                8
On-line CPU(s) list:   0
Off-line CPU(s) list:  1-7

root@target:~# lscpu 
Architecture:          x86_64
CPU op-mode(s):        32-bit, 64-bit
Byte Order:            Little Endian
CPU(s):                8
On-line CPU(s) list:   0
Off-line CPU(s) list:  1-7


> 
> 5. What happens if you test against a null_blk (which has a latency of
>     < 1us)? back when I ran some tryouts I saw ~10-11us added latency
>     from the fabric under similar conditions.

With null_blk on target, latency about 12us.

root@host:~# fio t.job 
job1: (g=0): rw=randread, bs=4K-4K/4K-4K/4K-4K, ioengine=libaio, iodepth=1
fio-2.9-3-g2078c
Starting 1 process
Jobs: 1 (f=1): [r(1)] [100.0% done] [305.1MB/0KB/0KB /s] [78.4K/0/0 iops] [eta 00m:00s]
job1: (groupid=0, jobs=1): err= 0: pid=3067: Wed Jul 13 11:20:19 2016
  read : io=6096.9MB, bw=312142KB/s, iops=78035, runt= 20001msec
    slat (usec): min=1, max=207, avg= 2.01, stdev= 0.34
    clat (usec): min=0, max=8020, avg= 9.99, stdev= 9.06
     lat (usec): min=10, max=8022, avg=12.10, stdev= 9.07


With real NVMe device on target, host see latency about 33us.

root@host:~# fio t.job 
job1: (g=0): rw=randread, bs=4K-4K/4K-4K/4K-4K, ioengine=libaio, iodepth=1
fio-2.9-3-g2078c
Starting 1 process
Jobs: 1 (f=1): [r(1)] [100.0% done] [113.1MB/0KB/0KB /s] [28.1K/0/0 iops] [eta 00m:00s]
job1: (groupid=0, jobs=1): err= 0: pid=3139: Wed Jul 13 11:22:15 2016
  read : io=2259.5MB, bw=115680KB/s, iops=28920, runt= 20001msec
    slat (usec): min=1, max=195, avg= 2.62, stdev= 1.24
    clat (usec): min=0, max=7962, avg=30.97, stdev=14.50
     lat (usec): min=27, max=7968, avg=33.70, stdev=14.69

And tested NVMe device locally on target, about 23us.
So nvmeof added only about ~10us.

That's nice!

root@target:~# fio t.job 
job1: (g=0): rw=randread, bs=4K-4K/4K-4K/4K-4K, ioengine=libaio, iodepth=1
fio-2.8-26-g603e
Starting 1 process
Jobs: 1 (f=1): [r(1)] [100.0% done] [161.2MB/0KB/0KB /s] [41.3K/0/0 iops] [eta 00m:00s]
job1: (groupid=0, jobs=1): err= 0: pid=2725: Wed Jul 13 11:23:46 2016
  read : io=1605.3MB, bw=164380KB/s, iops=41095, runt= 10000msec
    slat (usec): min=1, max=60, avg= 1.88, stdev= 0.63
    clat (usec): min=1, max=144, avg=21.61, stdev= 8.96
     lat (usec): min=19, max=162, avg=23.59, stdev= 9.00



--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

next prev parent reply	other threads:[~2016-07-13 18:25 UTC|newest]

Thread overview: 14+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2016-07-07 19:55 NVMe over RDMA latency Ming Lin
2016-07-07 19:55 ` Ming Lin
2016-07-13  9:49 ` Sagi Grimberg
2016-07-13  9:49   ` Sagi Grimberg
     [not found]   ` <CABgxfbEa077L6o-AxEqMr1WMuU-gC8_qc4VrrNs9nAkKLrysdw@mail.gmail.com>
2016-07-13 17:25     ` Ming Lin
2016-07-13 17:25       ` Ming Lin
2016-07-13 18:25   ` Ming Lin [this message]
2016-07-13 18:25     ` Ming Lin
2016-07-14  6:52     ` Sagi Grimberg
2016-07-14  6:52       ` Sagi Grimberg
2016-07-14 16:43     ` Wendy Cheng
2016-07-14 16:43       ` Wendy Cheng
2016-07-14 17:45       ` Wendy Cheng
2016-07-14 17:45         ` Wendy Cheng

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=1468434332.1869.8.camel@ssi \
    --to=mlin@kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.