* NVMe over RDMA latency
@ 2016-07-07 19:55 Ming Lin
2016-07-13 9:49 ` Sagi Grimberg
0 siblings, 1 reply; 7+ messages in thread
From: Ming Lin @ 2016-07-07 19:55 UTC (permalink / raw)
Hi list,
I'm trying to understand the NVMe over RDMA latency.
Test hardware:
A real NVMe PCI drive on target
Host and target back-to-back connected by Mellanox ConnectX-3
[global]
ioengine=libaio
direct=1
runtime=10
time_based
norandommap
group_reporting
[job1]
filename=/dev/nvme0n1
rw=randread
bs=4k
fio latency data on host side(test nvmeof device)
slat (usec): min=2, max=213, avg= 6.34, stdev= 3.47
clat (usec): min=1, max=2470, avg=39.56, stdev=13.04
lat (usec): min=30, max=2476, avg=46.14, stdev=15.50
fio latency data on target side(test NVMe pci device locally)
slat (usec): min=1, max=36, avg= 1.92, stdev= 0.42
clat (usec): min=1, max=68, avg=20.35, stdev= 1.11
lat (usec): min=19, max=101, avg=22.35, stdev= 1.21
So I picked up this sample from blktrace which seems matches the fio avg latency data.
Host(/dev/nvme0n1)
259,0 0 86 0.015768739 3241 Q R 1272199648 + 8 [fio]
259,0 0 87 0.015769674 3241 G R 1272199648 + 8 [fio]
259,0 0 88 0.015771628 3241 U N [fio] 1
259,0 0 89 0.015771901 3241 I RS 1272199648 + 8 ( 2227) [fio]
259,0 0 90 0.015772863 3241 D RS 1272199648 + 8 ( 962) [fio]
259,0 1 85 0.015819257 0 C RS 1272199648 + 8 ( 46394) [0]
Target(/dev/nvme0n1)
259,0 0 141 0.015675637 2197 Q R 1272199648 + 8 [kworker/u17:0]
259,0 0 142 0.015676033 2197 G R 1272199648 + 8 [kworker/u17:0]
259,0 0 143 0.015676915 2197 D RS 1272199648 + 8 (15676915) [kworker/u17:0]
259,0 0 144 0.015694992 0 C RS 1272199648 + 8 ( 18077) [0]
So host completed IO in about 50usec and target completed IO in about 20usec.
Does that mean the 30usec delta comes from RDMA write(host read means target RDMA write)?
Thanks,
Ming
Below is just for myself to understand what the blktrace flag means
===================================================================
Q - queued:
generic_make_request_checks: trace_block_bio_queue(q, bio)
G - get request:
blk_mq_map_request: trace_block_getrq(q, bio, op)
U:
blk_mq_insert_requests: trace_block_unplug(q, depth, !from_schedule)
I - inserted:
__blk_mq_insert_req_list: trace_block_rq_insert(hctx->queue, rq)
D - issued:
blk_mq_start_request: trace_block_rq_issue(q, rq)
C - complete:
blk_update_request: trace_block_rq_complete(req->q, req, nr_bytes)
^ permalink raw reply [flat|nested] 7+ messages in thread* NVMe over RDMA latency 2016-07-07 19:55 NVMe over RDMA latency Ming Lin @ 2016-07-13 9:49 ` Sagi Grimberg [not found] ` <CABgxfbEa077L6o-AxEqMr1WMuU-gC8_qc4VrrNs9nAkKLrysdw@mail.gmail.com> 2016-07-13 18:25 ` Ming Lin 0 siblings, 2 replies; 7+ messages in thread From: Sagi Grimberg @ 2016-07-13 9:49 UTC (permalink / raw) > Hi list, Hey Ming, > I'm trying to understand the NVMe over RDMA latency. > > Test hardware: > A real NVMe PCI drive on target > Host and target back-to-back connected by Mellanox ConnectX-3 > > [global] > ioengine=libaio > direct=1 > runtime=10 > time_based > norandommap > group_reporting > > [job1] > filename=/dev/nvme0n1 > rw=randread > bs=4k > > > fio latency data on host side(test nvmeof device) > slat (usec): min=2, max=213, avg= 6.34, stdev= 3.47 > clat (usec): min=1, max=2470, avg=39.56, stdev=13.04 > lat (usec): min=30, max=2476, avg=46.14, stdev=15.50 > > fio latency data on target side(test NVMe pci device locally) > slat (usec): min=1, max=36, avg= 1.92, stdev= 0.42 > clat (usec): min=1, max=68, avg=20.35, stdev= 1.11 > lat (usec): min=19, max=101, avg=22.35, stdev= 1.21 > > So I picked up this sample from blktrace which seems matches the fio avg latency data. > > Host(/dev/nvme0n1) > 259,0 0 86 0.015768739 3241 Q R 1272199648 + 8 [fio] > 259,0 0 87 0.015769674 3241 G R 1272199648 + 8 [fio] > 259,0 0 88 0.015771628 3241 U N [fio] 1 > 259,0 0 89 0.015771901 3241 I RS 1272199648 + 8 ( 2227) [fio] > 259,0 0 90 0.015772863 3241 D RS 1272199648 + 8 ( 962) [fio] > 259,0 1 85 0.015819257 0 C RS 1272199648 + 8 ( 46394) [0] > > Target(/dev/nvme0n1) > 259,0 0 141 0.015675637 2197 Q R 1272199648 + 8 [kworker/u17:0] > 259,0 0 142 0.015676033 2197 G R 1272199648 + 8 [kworker/u17:0] > 259,0 0 143 0.015676915 2197 D RS 1272199648 + 8 (15676915) [kworker/u17:0] > 259,0 0 144 0.015694992 0 C RS 1272199648 + 8 ( 18077) [0] > > So host completed IO in about 50usec and target completed IO in about 20usec. > Does that mean the 30usec delta comes from RDMA write(host read means target RDMA write)? Couple of things that come to mind: 0. Are you using iodepth=1 correct? 1. I imagine you are not polling in the host but rather interrupt driven correct? thats a latency source. 2. the target code is polling if the block device supports it. can you confirm that is indeed the case? 3. mlx4 has a strong fencing policy for memory registration, which we always do. thats a latency source. can you try with register_always=0? 4. IRQ affinity assignments. if the sqe is submitted on cpu core X and the completion comes to cpu core Y, we will consume some latency with the context-switch of waiking up fio on cpu core X. Is this a possible case? 5. What happens if you test against a null_blk (which has a latency of < 1us)? back when I ran some tryouts I saw ~10-11us added latency from the fabric under similar conditions. ^ permalink raw reply [flat|nested] 7+ messages in thread
[parent not found: <CABgxfbEa077L6o-AxEqMr1WMuU-gC8_qc4VrrNs9nAkKLrysdw@mail.gmail.com>]
* NVMe over RDMA latency [not found] ` <CABgxfbEa077L6o-AxEqMr1WMuU-gC8_qc4VrrNs9nAkKLrysdw@mail.gmail.com> @ 2016-07-13 17:25 ` Ming Lin 0 siblings, 0 replies; 7+ messages in thread From: Ming Lin @ 2016-07-13 17:25 UTC (permalink / raw) On Wed, Jul 13, 2016@10:22 AM, Wendy Cheng <s.wendy.cheng@gmail.com> wrote: > 2) Could anyone pass the pointer on how to check whether a block device > supports the polling mode (and how to turn it on) ? Example: echo 0 > /sys/block/nvme0n1/queue/io_poll echo 1 > /sys/block/nvme0n1/queue/io_poll ^ permalink raw reply [flat|nested] 7+ messages in thread
* NVMe over RDMA latency 2016-07-13 9:49 ` Sagi Grimberg [not found] ` <CABgxfbEa077L6o-AxEqMr1WMuU-gC8_qc4VrrNs9nAkKLrysdw@mail.gmail.com> @ 2016-07-13 18:25 ` Ming Lin 2016-07-14 6:52 ` Sagi Grimberg 2016-07-14 16:43 ` Wendy Cheng 1 sibling, 2 replies; 7+ messages in thread From: Ming Lin @ 2016-07-13 18:25 UTC (permalink / raw) On Wed, 2016-07-13@12:49 +0300, Sagi Grimberg wrote: > > Hi list, > > Hey Ming, > > > I'm trying to understand the NVMe over RDMA latency. > > > > Test hardware: > > A real NVMe PCI drive on target > > Host and target back-to-back connected by Mellanox ConnectX-3 > > > > [global] > > ioengine=libaio > > direct=1 > > runtime=10 > > time_based > > norandommap > > group_reporting > > > > [job1] > > filename=/dev/nvme0n1 > > rw=randread > > bs=4k > > > > > > fio latency data on host side(test nvmeof device) > > slat (usec): min=2, max=213, avg= 6.34, stdev= 3.47 > > clat (usec): min=1, max=2470, avg=39.56, stdev=13.04 > > lat (usec): min=30, max=2476, avg=46.14, stdev=15.50 > > > > fio latency data on target side(test NVMe pci device locally) > > slat (usec): min=1, max=36, avg= 1.92, stdev= 0.42 > > clat (usec): min=1, max=68, avg=20.35, stdev= 1.11 > > lat (usec): min=19, max=101, avg=22.35, stdev= 1.21 > > > > So I picked up this sample from blktrace which seems matches the fio avg latency data. > > > > Host(/dev/nvme0n1) > > 259,0 0 86 0.015768739 3241 Q R 1272199648 + 8 [fio] > > 259,0 0 87 0.015769674 3241 G R 1272199648 + 8 [fio] > > 259,0 0 88 0.015771628 3241 U N [fio] 1 > > 259,0 0 89 0.015771901 3241 I RS 1272199648 + 8 ( 2227) [fio] > > 259,0 0 90 0.015772863 3241 D RS 1272199648 + 8 ( 962) [fio] > > 259,0 1 85 0.015819257 0 C RS 1272199648 + 8 ( 46394) [0] > > > > Target(/dev/nvme0n1) > > 259,0 0 141 0.015675637 2197 Q R 1272199648 + 8 [kworker/u17:0] > > 259,0 0 142 0.015676033 2197 G R 1272199648 + 8 [kworker/u17:0] > > 259,0 0 143 0.015676915 2197 D RS 1272199648 + 8 (15676915) [kworker/u17:0] > > 259,0 0 144 0.015694992 0 C RS 1272199648 + 8 ( 18077) [0] > > > > So host completed IO in about 50usec and target completed IO in about 20usec. > > Does that mean the 30usec delta comes from RDMA write(host read means target RDMA write)? > > > Couple of things that come to mind: > > 0. Are you using iodepth=1 correct? I didn't set it. It's 1 by default. Now I set it. root at host:~# cat t.job [global] ioengine=libaio direct=1 runtime=20 time_based norandommap group_reporting [job1] filename=/dev/nvme0n1 rw=randread bs=4k iodepth=1 numjobs=1 > > 1. I imagine you are not polling in the host but rather interrupt > driven correct? thats a latency source. It's polling. root at host:~# cat /sys/block/nvme0n1/queue/io_poll 1 > > 2. the target code is polling if the block device supports it. can you > confirm that is indeed the case? Yes. > > 3. mlx4 has a strong fencing policy for memory registration, which we > always do. thats a latency source. can you try with > register_always=0? root at host:~# cat /sys/module/nvme_rdma/parameters/register_always N > > 4. IRQ affinity assignments. if the sqe is submitted on cpu core X and > the completion comes to cpu core Y, we will consume some latency > with the context-switch of waiking up fio on cpu core X. Is this > a possible case? Only 1 CPU online on both host and target machine. root at host:~# lscpu Architecture: x86_64 CPU op-mode(s): 32-bit, 64-bit Byte Order: Little Endian CPU(s): 8 On-line CPU(s) list: 0 Off-line CPU(s) list: 1-7 root at target:~# lscpu Architecture: x86_64 CPU op-mode(s): 32-bit, 64-bit Byte Order: Little Endian CPU(s): 8 On-line CPU(s) list: 0 Off-line CPU(s) list: 1-7 > > 5. What happens if you test against a null_blk (which has a latency of > < 1us)? back when I ran some tryouts I saw ~10-11us added latency > from the fabric under similar conditions. With null_blk on target, latency about 12us. root at host:~# fio t.job job1: (g=0): rw=randread, bs=4K-4K/4K-4K/4K-4K, ioengine=libaio, iodepth=1 fio-2.9-3-g2078c Starting 1 process Jobs: 1 (f=1): [r(1)] [100.0% done] [305.1MB/0KB/0KB /s] [78.4K/0/0 iops] [eta 00m:00s] job1: (groupid=0, jobs=1): err= 0: pid=3067: Wed Jul 13 11:20:19 2016 read : io=6096.9MB, bw=312142KB/s, iops=78035, runt= 20001msec slat (usec): min=1, max=207, avg= 2.01, stdev= 0.34 clat (usec): min=0, max=8020, avg= 9.99, stdev= 9.06 lat (usec): min=10, max=8022, avg=12.10, stdev= 9.07 With real NVMe device on target, host see latency about 33us. root at host:~# fio t.job job1: (g=0): rw=randread, bs=4K-4K/4K-4K/4K-4K, ioengine=libaio, iodepth=1 fio-2.9-3-g2078c Starting 1 process Jobs: 1 (f=1): [r(1)] [100.0% done] [113.1MB/0KB/0KB /s] [28.1K/0/0 iops] [eta 00m:00s] job1: (groupid=0, jobs=1): err= 0: pid=3139: Wed Jul 13 11:22:15 2016 read : io=2259.5MB, bw=115680KB/s, iops=28920, runt= 20001msec slat (usec): min=1, max=195, avg= 2.62, stdev= 1.24 clat (usec): min=0, max=7962, avg=30.97, stdev=14.50 lat (usec): min=27, max=7968, avg=33.70, stdev=14.69 And tested NVMe device locally on target, about 23us. So nvmeof added only about ~10us. That's nice! root at target:~# fio t.job job1: (g=0): rw=randread, bs=4K-4K/4K-4K/4K-4K, ioengine=libaio, iodepth=1 fio-2.8-26-g603e Starting 1 process Jobs: 1 (f=1): [r(1)] [100.0% done] [161.2MB/0KB/0KB /s] [41.3K/0/0 iops] [eta 00m:00s] job1: (groupid=0, jobs=1): err= 0: pid=2725: Wed Jul 13 11:23:46 2016 read : io=1605.3MB, bw=164380KB/s, iops=41095, runt= 10000msec slat (usec): min=1, max=60, avg= 1.88, stdev= 0.63 clat (usec): min=1, max=144, avg=21.61, stdev= 8.96 lat (usec): min=19, max=162, avg=23.59, stdev= 9.00 ^ permalink raw reply [flat|nested] 7+ messages in thread
* NVMe over RDMA latency 2016-07-13 18:25 ` Ming Lin @ 2016-07-14 6:52 ` Sagi Grimberg 2016-07-14 16:43 ` Wendy Cheng 1 sibling, 0 replies; 7+ messages in thread From: Sagi Grimberg @ 2016-07-14 6:52 UTC (permalink / raw) > With real NVMe device on target, host see latency about 33us. > > root at host:~# fio t.job > job1: (g=0): rw=randread, bs=4K-4K/4K-4K/4K-4K, ioengine=libaio, iodepth=1 > fio-2.9-3-g2078c > Starting 1 process > Jobs: 1 (f=1): [r(1)] [100.0% done] [113.1MB/0KB/0KB /s] [28.1K/0/0 iops] [eta 00m:00s] > job1: (groupid=0, jobs=1): err= 0: pid=3139: Wed Jul 13 11:22:15 2016 > read : io=2259.5MB, bw=115680KB/s, iops=28920, runt= 20001msec > slat (usec): min=1, max=195, avg= 2.62, stdev= 1.24 > clat (usec): min=0, max=7962, avg=30.97, stdev=14.50 > lat (usec): min=27, max=7968, avg=33.70, stdev=14.69 > > And tested NVMe device locally on target, about 23us. > So nvmeof added only about ~10us. > > That's nice! I didn't understand what was changed? ^ permalink raw reply [flat|nested] 7+ messages in thread
* NVMe over RDMA latency 2016-07-13 18:25 ` Ming Lin 2016-07-14 6:52 ` Sagi Grimberg @ 2016-07-14 16:43 ` Wendy Cheng 2016-07-14 17:45 ` Wendy Cheng 1 sibling, 1 reply; 7+ messages in thread From: Wendy Cheng @ 2016-07-14 16:43 UTC (permalink / raw) On Wed, Jul 13, 2016@11:25 AM, Ming Lin <mlin@kernel.org> wrote: >> 1. I imagine you are not polling in the host but rather interrupt >> driven correct? thats a latency source. > > It's polling. > > root at host:~# cat /sys/block/nvme0n1/queue/io_poll > 1 > >> >> 2. the target code is polling if the block device supports it. can you >> confirm that is indeed the case? > > Yes. > >> >> 3. mlx4 has a strong fencing policy for memory registration, which we >> always do. thats a latency source. can you try with >> register_always=0? > > root at host:~# cat /sys/module/nvme_rdma/parameters/register_always > N > > >> >> 4. IRQ affinity assignments. if the sqe is submitted on cpu core X and >> the completion comes to cpu core Y, we will consume some latency >> with the context-switch of waiking up fio on cpu core X. Is this >> a possible case? > > Only 1 CPU online on both host and target machine. > Since the above tunables can be easily toggled on/off, could you break down their contributions to the overall latency with each individual tunable ? e.g. only do io_poll on / off to see how much it improves the latency. >From your data, it seems to indicate the local performance on the target got worse. Is this perception correct ? Before the tunable: the target avg=22.35 usec After the tunable: the target avg=23.59 usec I'm particularly interested in the local target device latency with io_poll on vs. off. Did you keep your p99.99 latency and p90.00 latency numbers from this experiment that can be share ? Thanks, Wendy ^ permalink raw reply [flat|nested] 7+ messages in thread
* NVMe over RDMA latency 2016-07-14 16:43 ` Wendy Cheng @ 2016-07-14 17:45 ` Wendy Cheng 0 siblings, 0 replies; 7+ messages in thread From: Wendy Cheng @ 2016-07-14 17:45 UTC (permalink / raw) On Thu, Jul 14, 2016@9:43 AM, Wendy Cheng <s.wendy.cheng@gmail.com> wrote: > On Wed, Jul 13, 2016@11:25 AM, Ming Lin <mlin@kernel.org> wrote: > >>> 1. I imagine you are not polling in the host but rather interrupt >>> driven correct? thats a latency source. >> >> It's polling. >> >> root at host:~# cat /sys/block/nvme0n1/queue/io_poll >> 1 >> >>> >>> 2. the target code is polling if the block device supports it. can you >>> confirm that is indeed the case? >> >> Yes. >> >>> >>> 3. mlx4 has a strong fencing policy for memory registration, which we >>> always do. thats a latency source. can you try with >>> register_always=0? >> >> root at host:~# cat /sys/module/nvme_rdma/parameters/register_always >> N >> >> >>> >>> 4. IRQ affinity assignments. if the sqe is submitted on cpu core X and >>> the completion comes to cpu core Y, we will consume some latency >>> with the context-switch of waiking up fio on cpu core X. Is this >>> a possible case? >> >> Only 1 CPU online on both host and target machine. >> > > Since the above tunables can be easily toggled on/off, could you break > down their contributions to the overall latency with each individual > tunable ? e.g. only do io_poll on / off to see how much it improves > the latency. > > From your data, it seems to indicate the local performance on the > target got worse. Is this perception correct ? > > Before the tunable: the target avg=22.35 usec > After the tunable: the target avg=23.59 usec > > I'm particularly interested in the local target device latency with > io_poll on vs. off. Did you keep your p99.99 latency and p90.00 > latency numbers from this experiment that can be share ? > BTW using one single CPU on the target (storage server) does not make sense. My guess is that the source of slow down on the target was because of this - as it is particularly relevant on polling mode Thanks, Wendy ^ permalink raw reply [flat|nested] 7+ messages in thread
end of thread, other threads:[~2016-07-14 17:45 UTC | newest]
Thread overview: 7+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2016-07-07 19:55 NVMe over RDMA latency Ming Lin
2016-07-13 9:49 ` Sagi Grimberg
[not found] ` <CABgxfbEa077L6o-AxEqMr1WMuU-gC8_qc4VrrNs9nAkKLrysdw@mail.gmail.com>
2016-07-13 17:25 ` Ming Lin
2016-07-13 18:25 ` Ming Lin
2016-07-14 6:52 ` Sagi Grimberg
2016-07-14 16:43 ` Wendy Cheng
2016-07-14 17:45 ` Wendy Cheng
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).