NVMe over RDMA latency

linux-nvme.lists.infradead.org archive mirror
 help / color / mirror / Atom feed

* NVMe over RDMA latency
@ 2016-07-07 19:55 Ming Lin
  2016-07-13  9:49 ` Sagi Grimberg
  0 siblings, 1 reply; 7+ messages in thread
From: Ming Lin @ 2016-07-07 19:55 UTC (permalink / raw)


Hi list,

I'm trying to understand the NVMe over RDMA latency.

Test hardware:
A real NVMe PCI drive on target
Host and target back-to-back connected by Mellanox ConnectX-3

[global]
ioengine=libaio
direct=1
runtime=10
time_based
norandommap
group_reporting

[job1]
filename=/dev/nvme0n1
rw=randread
bs=4k


fio latency data on host side(test nvmeof device)
    slat (usec): min=2, max=213, avg= 6.34, stdev= 3.47
    clat (usec): min=1, max=2470, avg=39.56, stdev=13.04
     lat (usec): min=30, max=2476, avg=46.14, stdev=15.50

fio latency data on target side(test NVMe pci device locally)
    slat (usec): min=1, max=36, avg= 1.92, stdev= 0.42
    clat (usec): min=1, max=68, avg=20.35, stdev= 1.11
     lat (usec): min=19, max=101, avg=22.35, stdev= 1.21

So I picked up this sample from blktrace which seems matches the fio avg latency data.

Host(/dev/nvme0n1)
259,0    0       86     0.015768739  3241  Q   R 1272199648 + 8 [fio]
259,0    0       87     0.015769674  3241  G   R 1272199648 + 8 [fio]
259,0    0       88     0.015771628  3241  U   N [fio] 1
259,0    0       89     0.015771901  3241  I  RS 1272199648 + 8 (    2227) [fio]
259,0    0       90     0.015772863  3241  D  RS 1272199648 + 8 (     962) [fio]
259,0    1       85     0.015819257     0  C  RS 1272199648 + 8 (   46394) [0]

Target(/dev/nvme0n1)
259,0    0      141     0.015675637  2197  Q   R 1272199648 + 8 [kworker/u17:0]
259,0    0      142     0.015676033  2197  G   R 1272199648 + 8 [kworker/u17:0]
259,0    0      143     0.015676915  2197  D  RS 1272199648 + 8 (15676915) [kworker/u17:0]
259,0    0      144     0.015694992     0  C  RS 1272199648 + 8 (   18077) [0]

So host completed IO in about 50usec and target completed IO in about 20usec.
Does that mean the 30usec delta comes from RDMA write(host read means target RDMA write)?

Thanks,
Ming

Below is just for myself to understand what the blktrace flag means
===================================================================
Q - queued:
generic_make_request_checks: trace_block_bio_queue(q, bio)

G - get request:
blk_mq_map_request: trace_block_getrq(q, bio, op)

U:
blk_mq_insert_requests: trace_block_unplug(q, depth, !from_schedule)

I - inserted:
__blk_mq_insert_req_list: trace_block_rq_insert(hctx->queue, rq)

D - issued:
blk_mq_start_request: trace_block_rq_issue(q, rq)

C - complete:
blk_update_request: trace_block_rq_complete(req->q, req, nr_bytes)

^ permalink raw reply	[flat|nested] 7+ messages in thread

* NVMe over RDMA latency
  2016-07-07 19:55 NVMe over RDMA latency Ming Lin
@ 2016-07-13  9:49 ` Sagi Grimberg
       [not found]   ` <CABgxfbEa077L6o-AxEqMr1WMuU-gC8_qc4VrrNs9nAkKLrysdw@mail.gmail.com>
  2016-07-13 18:25   ` Ming Lin
  0 siblings, 2 replies; 7+ messages in thread
From: Sagi Grimberg @ 2016-07-13  9:49 UTC (permalink / raw)


> Hi list,

Hey Ming,

> I'm trying to understand the NVMe over RDMA latency.
>
> Test hardware:
> A real NVMe PCI drive on target
> Host and target back-to-back connected by Mellanox ConnectX-3
>
> [global]
> ioengine=libaio
> direct=1
> runtime=10
> time_based
> norandommap
> group_reporting
>
> [job1]
> filename=/dev/nvme0n1
> rw=randread
> bs=4k
>
>
> fio latency data on host side(test nvmeof device)
>      slat (usec): min=2, max=213, avg= 6.34, stdev= 3.47
>      clat (usec): min=1, max=2470, avg=39.56, stdev=13.04
>       lat (usec): min=30, max=2476, avg=46.14, stdev=15.50
>
> fio latency data on target side(test NVMe pci device locally)
>      slat (usec): min=1, max=36, avg= 1.92, stdev= 0.42
>      clat (usec): min=1, max=68, avg=20.35, stdev= 1.11
>       lat (usec): min=19, max=101, avg=22.35, stdev= 1.21
>
> So I picked up this sample from blktrace which seems matches the fio avg latency data.
>
> Host(/dev/nvme0n1)
> 259,0    0       86     0.015768739  3241  Q   R 1272199648 + 8 [fio]
> 259,0    0       87     0.015769674  3241  G   R 1272199648 + 8 [fio]
> 259,0    0       88     0.015771628  3241  U   N [fio] 1
> 259,0    0       89     0.015771901  3241  I  RS 1272199648 + 8 (    2227) [fio]
> 259,0    0       90     0.015772863  3241  D  RS 1272199648 + 8 (     962) [fio]
> 259,0    1       85     0.015819257     0  C  RS 1272199648 + 8 (   46394) [0]
>
> Target(/dev/nvme0n1)
> 259,0    0      141     0.015675637  2197  Q   R 1272199648 + 8 [kworker/u17:0]
> 259,0    0      142     0.015676033  2197  G   R 1272199648 + 8 [kworker/u17:0]
> 259,0    0      143     0.015676915  2197  D  RS 1272199648 + 8 (15676915) [kworker/u17:0]
> 259,0    0      144     0.015694992     0  C  RS 1272199648 + 8 (   18077) [0]
>
> So host completed IO in about 50usec and target completed IO in about 20usec.
> Does that mean the 30usec delta comes from RDMA write(host read means target RDMA write)?


Couple of things that come to mind:

0. Are you using iodepth=1 correct?

1. I imagine you are not polling in the host but rather interrupt
    driven correct? thats a latency source.

2. the target code is polling if the block device supports it. can you
    confirm that is indeed the case?

3. mlx4 has a strong fencing policy for memory registration, which we
    always do. thats a latency source. can you try with
    register_always=0?

4. IRQ affinity assignments. if the sqe is submitted on cpu core X and
    the completion comes to cpu core Y, we will consume some latency
    with the context-switch of waiking up fio on cpu core X. Is this
    a possible case?

5. What happens if you test against a null_blk (which has a latency of
    < 1us)? back when I ran some tryouts I saw ~10-11us added latency
    from the fabric under similar conditions.

^ permalink raw reply	[flat|nested] 7+ messages in thread

[parent not found: <CABgxfbEa077L6o-AxEqMr1WMuU-gC8_qc4VrrNs9nAkKLrysdw@mail.gmail.com>]

* NVMe over RDMA latency
       [not found]   ` <CABgxfbEa077L6o-AxEqMr1WMuU-gC8_qc4VrrNs9nAkKLrysdw@mail.gmail.com>
@ 2016-07-13 17:25     ` Ming Lin
  0 siblings, 0 replies; 7+ messages in thread
From: Ming Lin @ 2016-07-13 17:25 UTC (permalink / raw)


On Wed, Jul 13, 2016@10:22 AM, Wendy Cheng <s.wendy.cheng@gmail.com> wrote:

> 2) Could anyone pass the pointer on how to check whether a block device
> supports the polling mode (and how to turn it on) ?

Example:

echo 0 > /sys/block/nvme0n1/queue/io_poll
echo 1 > /sys/block/nvme0n1/queue/io_poll

^ permalink raw reply	[flat|nested] 7+ messages in thread

* NVMe over RDMA latency
  2016-07-13  9:49 ` Sagi Grimberg
       [not found]   ` <CABgxfbEa077L6o-AxEqMr1WMuU-gC8_qc4VrrNs9nAkKLrysdw@mail.gmail.com>
@ 2016-07-13 18:25   ` Ming Lin
  2016-07-14  6:52     ` Sagi Grimberg
  2016-07-14 16:43     ` Wendy Cheng
  1 sibling, 2 replies; 7+ messages in thread
From: Ming Lin @ 2016-07-13 18:25 UTC (permalink / raw)


On Wed, 2016-07-13@12:49 +0300, Sagi Grimberg wrote:
> > Hi list,
> 
> Hey Ming,
> 
> > I'm trying to understand the NVMe over RDMA latency.
> >
> > Test hardware:
> > A real NVMe PCI drive on target
> > Host and target back-to-back connected by Mellanox ConnectX-3
> >
> > [global]
> > ioengine=libaio
> > direct=1
> > runtime=10
> > time_based
> > norandommap
> > group_reporting
> >
> > [job1]
> > filename=/dev/nvme0n1
> > rw=randread
> > bs=4k
> >
> >
> > fio latency data on host side(test nvmeof device)
> >      slat (usec): min=2, max=213, avg= 6.34, stdev= 3.47
> >      clat (usec): min=1, max=2470, avg=39.56, stdev=13.04
> >       lat (usec): min=30, max=2476, avg=46.14, stdev=15.50
> >
> > fio latency data on target side(test NVMe pci device locally)
> >      slat (usec): min=1, max=36, avg= 1.92, stdev= 0.42
> >      clat (usec): min=1, max=68, avg=20.35, stdev= 1.11
> >       lat (usec): min=19, max=101, avg=22.35, stdev= 1.21
> >
> > So I picked up this sample from blktrace which seems matches the fio avg latency data.
> >
> > Host(/dev/nvme0n1)
> > 259,0    0       86     0.015768739  3241  Q   R 1272199648 + 8 [fio]
> > 259,0    0       87     0.015769674  3241  G   R 1272199648 + 8 [fio]
> > 259,0    0       88     0.015771628  3241  U   N [fio] 1
> > 259,0    0       89     0.015771901  3241  I  RS 1272199648 + 8 (    2227) [fio]
> > 259,0    0       90     0.015772863  3241  D  RS 1272199648 + 8 (     962) [fio]
> > 259,0    1       85     0.015819257     0  C  RS 1272199648 + 8 (   46394) [0]
> >
> > Target(/dev/nvme0n1)
> > 259,0    0      141     0.015675637  2197  Q   R 1272199648 + 8 [kworker/u17:0]
> > 259,0    0      142     0.015676033  2197  G   R 1272199648 + 8 [kworker/u17:0]
> > 259,0    0      143     0.015676915  2197  D  RS 1272199648 + 8 (15676915) [kworker/u17:0]
> > 259,0    0      144     0.015694992     0  C  RS 1272199648 + 8 (   18077) [0]
> >
> > So host completed IO in about 50usec and target completed IO in about 20usec.
> > Does that mean the 30usec delta comes from RDMA write(host read means target RDMA write)?
> 
> 
> Couple of things that come to mind:
> 
> 0. Are you using iodepth=1 correct?

I didn't set it. It's 1 by default.
Now I set it.

root at host:~# cat t.job 
[global]
ioengine=libaio
direct=1
runtime=20
time_based
norandommap
group_reporting

[job1]
filename=/dev/nvme0n1
rw=randread
bs=4k
iodepth=1
numjobs=1


> 
> 1. I imagine you are not polling in the host but rather interrupt
>     driven correct? thats a latency source.

It's polling.

root at host:~# cat /sys/block/nvme0n1/queue/io_poll 
1

> 
> 2. the target code is polling if the block device supports it. can you
>     confirm that is indeed the case?

Yes.

> 
> 3. mlx4 has a strong fencing policy for memory registration, which we
>     always do. thats a latency source. can you try with
>     register_always=0?

root at host:~# cat /sys/module/nvme_rdma/parameters/register_always 
N


> 
> 4. IRQ affinity assignments. if the sqe is submitted on cpu core X and
>     the completion comes to cpu core Y, we will consume some latency
>     with the context-switch of waiking up fio on cpu core X. Is this
>     a possible case?

Only 1 CPU online on both host and target machine.

root at host:~# lscpu 
Architecture:          x86_64
CPU op-mode(s):        32-bit, 64-bit
Byte Order:            Little Endian
CPU(s):                8
On-line CPU(s) list:   0
Off-line CPU(s) list:  1-7

root at target:~# lscpu 
Architecture:          x86_64
CPU op-mode(s):        32-bit, 64-bit
Byte Order:            Little Endian
CPU(s):                8
On-line CPU(s) list:   0
Off-line CPU(s) list:  1-7


> 
> 5. What happens if you test against a null_blk (which has a latency of
>     < 1us)? back when I ran some tryouts I saw ~10-11us added latency
>     from the fabric under similar conditions.

With null_blk on target, latency about 12us.

root at host:~# fio t.job 
job1: (g=0): rw=randread, bs=4K-4K/4K-4K/4K-4K, ioengine=libaio, iodepth=1
fio-2.9-3-g2078c
Starting 1 process
Jobs: 1 (f=1): [r(1)] [100.0% done] [305.1MB/0KB/0KB /s] [78.4K/0/0 iops] [eta 00m:00s]
job1: (groupid=0, jobs=1): err= 0: pid=3067: Wed Jul 13 11:20:19 2016
  read : io=6096.9MB, bw=312142KB/s, iops=78035, runt= 20001msec
    slat (usec): min=1, max=207, avg= 2.01, stdev= 0.34
    clat (usec): min=0, max=8020, avg= 9.99, stdev= 9.06
     lat (usec): min=10, max=8022, avg=12.10, stdev= 9.07


With real NVMe device on target, host see latency about 33us.

root at host:~# fio t.job 
job1: (g=0): rw=randread, bs=4K-4K/4K-4K/4K-4K, ioengine=libaio, iodepth=1
fio-2.9-3-g2078c
Starting 1 process
Jobs: 1 (f=1): [r(1)] [100.0% done] [113.1MB/0KB/0KB /s] [28.1K/0/0 iops] [eta 00m:00s]
job1: (groupid=0, jobs=1): err= 0: pid=3139: Wed Jul 13 11:22:15 2016
  read : io=2259.5MB, bw=115680KB/s, iops=28920, runt= 20001msec
    slat (usec): min=1, max=195, avg= 2.62, stdev= 1.24
    clat (usec): min=0, max=7962, avg=30.97, stdev=14.50
     lat (usec): min=27, max=7968, avg=33.70, stdev=14.69

And tested NVMe device locally on target, about 23us.
So nvmeof added only about ~10us.

That's nice!

root at target:~# fio t.job 
job1: (g=0): rw=randread, bs=4K-4K/4K-4K/4K-4K, ioengine=libaio, iodepth=1
fio-2.8-26-g603e
Starting 1 process
Jobs: 1 (f=1): [r(1)] [100.0% done] [161.2MB/0KB/0KB /s] [41.3K/0/0 iops] [eta 00m:00s]
job1: (groupid=0, jobs=1): err= 0: pid=2725: Wed Jul 13 11:23:46 2016
  read : io=1605.3MB, bw=164380KB/s, iops=41095, runt= 10000msec
    slat (usec): min=1, max=60, avg= 1.88, stdev= 0.63
    clat (usec): min=1, max=144, avg=21.61, stdev= 8.96
     lat (usec): min=19, max=162, avg=23.59, stdev= 9.00

^ permalink raw reply	[flat|nested] 7+ messages in thread

* NVMe over RDMA latency
  2016-07-13 18:25   ` Ming Lin
@ 2016-07-14  6:52     ` Sagi Grimberg
  2016-07-14 16:43     ` Wendy Cheng
  1 sibling, 0 replies; 7+ messages in thread
From: Sagi Grimberg @ 2016-07-14  6:52 UTC (permalink / raw)



> With real NVMe device on target, host see latency about 33us.
>
> root at host:~# fio t.job
> job1: (g=0): rw=randread, bs=4K-4K/4K-4K/4K-4K, ioengine=libaio, iodepth=1
> fio-2.9-3-g2078c
> Starting 1 process
> Jobs: 1 (f=1): [r(1)] [100.0% done] [113.1MB/0KB/0KB /s] [28.1K/0/0 iops] [eta 00m:00s]
> job1: (groupid=0, jobs=1): err= 0: pid=3139: Wed Jul 13 11:22:15 2016
>    read : io=2259.5MB, bw=115680KB/s, iops=28920, runt= 20001msec
>      slat (usec): min=1, max=195, avg= 2.62, stdev= 1.24
>      clat (usec): min=0, max=7962, avg=30.97, stdev=14.50
>       lat (usec): min=27, max=7968, avg=33.70, stdev=14.69
>
> And tested NVMe device locally on target, about 23us.
> So nvmeof added only about ~10us.
>
> That's nice!

I didn't understand what was changed?

^ permalink raw reply	[flat|nested] 7+ messages in thread

* NVMe over RDMA latency
  2016-07-13 18:25   ` Ming Lin
  2016-07-14  6:52     ` Sagi Grimberg
@ 2016-07-14 16:43     ` Wendy Cheng
  2016-07-14 17:45       ` Wendy Cheng
  1 sibling, 1 reply; 7+ messages in thread
From: Wendy Cheng @ 2016-07-14 16:43 UTC (permalink / raw)


On Wed, Jul 13, 2016@11:25 AM, Ming Lin <mlin@kernel.org> wrote:

>> 1. I imagine you are not polling in the host but rather interrupt
>>     driven correct? thats a latency source.
>
> It's polling.
>
> root at host:~# cat /sys/block/nvme0n1/queue/io_poll
> 1
>
>>
>> 2. the target code is polling if the block device supports it. can you
>>     confirm that is indeed the case?
>
> Yes.
>
>>
>> 3. mlx4 has a strong fencing policy for memory registration, which we
>>     always do. thats a latency source. can you try with
>>     register_always=0?
>
> root at host:~# cat /sys/module/nvme_rdma/parameters/register_always
> N
>
>
>>
>> 4. IRQ affinity assignments. if the sqe is submitted on cpu core X and
>>     the completion comes to cpu core Y, we will consume some latency
>>     with the context-switch of waiking up fio on cpu core X. Is this
>>     a possible case?
>
> Only 1 CPU online on both host and target machine.
>

Since the above tunables can be easily toggled on/off, could you break
down their contributions to the overall latency with each individual
tunable ? e.g. only do io_poll on / off to see how much it improves
the latency.

>From your data, it seems to indicate the local performance on the
target got worse. Is this perception correct ?

Before the tunable: the target avg=22.35 usec
After the tunable: the target avg=23.59 usec

I'm particularly interested in the local target device latency with
io_poll on vs. off. Did you keep your p99.99 latency and p90.00
latency numbers from this experiment that can be share ?

Thanks,
Wendy

^ permalink raw reply	[flat|nested] 7+ messages in thread

* NVMe over RDMA latency
  2016-07-14 16:43     ` Wendy Cheng
@ 2016-07-14 17:45       ` Wendy Cheng
  0 siblings, 0 replies; 7+ messages in thread
From: Wendy Cheng @ 2016-07-14 17:45 UTC (permalink / raw)


On Thu, Jul 14, 2016@9:43 AM, Wendy Cheng <s.wendy.cheng@gmail.com> wrote:
> On Wed, Jul 13, 2016@11:25 AM, Ming Lin <mlin@kernel.org> wrote:
>
>>> 1. I imagine you are not polling in the host but rather interrupt
>>>     driven correct? thats a latency source.
>>
>> It's polling.
>>
>> root at host:~# cat /sys/block/nvme0n1/queue/io_poll
>> 1
>>
>>>
>>> 2. the target code is polling if the block device supports it. can you
>>>     confirm that is indeed the case?
>>
>> Yes.
>>
>>>
>>> 3. mlx4 has a strong fencing policy for memory registration, which we
>>>     always do. thats a latency source. can you try with
>>>     register_always=0?
>>
>> root at host:~# cat /sys/module/nvme_rdma/parameters/register_always
>> N
>>
>>
>>>
>>> 4. IRQ affinity assignments. if the sqe is submitted on cpu core X and
>>>     the completion comes to cpu core Y, we will consume some latency
>>>     with the context-switch of waiking up fio on cpu core X. Is this
>>>     a possible case?
>>
>> Only 1 CPU online on both host and target machine.
>>
>
> Since the above tunables can be easily toggled on/off, could you break
> down their contributions to the overall latency with each individual
> tunable ? e.g. only do io_poll on / off to see how much it improves
> the latency.
>
> From your data, it seems to indicate the local performance on the
> target got worse. Is this perception correct ?
>
> Before the tunable: the target avg=22.35 usec
> After the tunable: the target avg=23.59 usec
>
> I'm particularly interested in the local target device latency with
> io_poll on vs. off. Did you keep your p99.99 latency and p90.00
> latency numbers from this experiment that can be share ?
>

BTW using one single CPU on the target (storage server) does not make
sense. My guess is that the source of slow down on the target was
because of this - as it is particularly relevant on polling mode


Thanks,
Wendy

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2016-07-14 17:45 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2016-07-07 19:55 NVMe over RDMA latency Ming Lin
2016-07-13  9:49 ` Sagi Grimberg
     [not found]   ` <CABgxfbEa077L6o-AxEqMr1WMuU-gC8_qc4VrrNs9nAkKLrysdw@mail.gmail.com>
2016-07-13 17:25     ` Ming Lin
2016-07-13 18:25   ` Ming Lin
2016-07-14  6:52     ` Sagi Grimberg
2016-07-14 16:43     ` Wendy Cheng
2016-07-14 17:45       ` Wendy Cheng

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).