From: sagi@grimberg.me (Sagi Grimberg)
Subject: NVMe over RDMA latency
Date: Wed, 13 Jul 2016 12:49:46 +0300 [thread overview]
Message-ID: <57860EBA.5010103@grimberg.me> (raw)
In-Reply-To: <1467921342.24395.12.camel@ssi>
> Hi list,
Hey Ming,
> I'm trying to understand the NVMe over RDMA latency.
>
> Test hardware:
> A real NVMe PCI drive on target
> Host and target back-to-back connected by Mellanox ConnectX-3
>
> [global]
> ioengine=libaio
> direct=1
> runtime=10
> time_based
> norandommap
> group_reporting
>
> [job1]
> filename=/dev/nvme0n1
> rw=randread
> bs=4k
>
>
> fio latency data on host side(test nvmeof device)
> slat (usec): min=2, max=213, avg= 6.34, stdev= 3.47
> clat (usec): min=1, max=2470, avg=39.56, stdev=13.04
> lat (usec): min=30, max=2476, avg=46.14, stdev=15.50
>
> fio latency data on target side(test NVMe pci device locally)
> slat (usec): min=1, max=36, avg= 1.92, stdev= 0.42
> clat (usec): min=1, max=68, avg=20.35, stdev= 1.11
> lat (usec): min=19, max=101, avg=22.35, stdev= 1.21
>
> So I picked up this sample from blktrace which seems matches the fio avg latency data.
>
> Host(/dev/nvme0n1)
> 259,0 0 86 0.015768739 3241 Q R 1272199648 + 8 [fio]
> 259,0 0 87 0.015769674 3241 G R 1272199648 + 8 [fio]
> 259,0 0 88 0.015771628 3241 U N [fio] 1
> 259,0 0 89 0.015771901 3241 I RS 1272199648 + 8 ( 2227) [fio]
> 259,0 0 90 0.015772863 3241 D RS 1272199648 + 8 ( 962) [fio]
> 259,0 1 85 0.015819257 0 C RS 1272199648 + 8 ( 46394) [0]
>
> Target(/dev/nvme0n1)
> 259,0 0 141 0.015675637 2197 Q R 1272199648 + 8 [kworker/u17:0]
> 259,0 0 142 0.015676033 2197 G R 1272199648 + 8 [kworker/u17:0]
> 259,0 0 143 0.015676915 2197 D RS 1272199648 + 8 (15676915) [kworker/u17:0]
> 259,0 0 144 0.015694992 0 C RS 1272199648 + 8 ( 18077) [0]
>
> So host completed IO in about 50usec and target completed IO in about 20usec.
> Does that mean the 30usec delta comes from RDMA write(host read means target RDMA write)?
Couple of things that come to mind:
0. Are you using iodepth=1 correct?
1. I imagine you are not polling in the host but rather interrupt
driven correct? thats a latency source.
2. the target code is polling if the block device supports it. can you
confirm that is indeed the case?
3. mlx4 has a strong fencing policy for memory registration, which we
always do. thats a latency source. can you try with
register_always=0?
4. IRQ affinity assignments. if the sqe is submitted on cpu core X and
the completion comes to cpu core Y, we will consume some latency
with the context-switch of waiking up fio on cpu core X. Is this
a possible case?
5. What happens if you test against a null_blk (which has a latency of
< 1us)? back when I ran some tryouts I saw ~10-11us added latency
from the fabric under similar conditions.
WARNING: multiple messages have this Message-ID (diff)
From: Sagi Grimberg <sagi-NQWnxTmZq1alnMjI0IkVqw@public.gmane.org>
To: Ming Lin <mlin-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>,
linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
linux-nvme-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r@public.gmane.org
Cc: Christoph Hellwig <hch-jcswGhMUV9g@public.gmane.org>,
Steve Wise
<swise-7bPotxP6k4+P2YhJcF5u+vpXobYPEAuW@public.gmane.org>
Subject: Re: NVMe over RDMA latency
Date: Wed, 13 Jul 2016 12:49:46 +0300 [thread overview]
Message-ID: <57860EBA.5010103@grimberg.me> (raw)
In-Reply-To: <1467921342.24395.12.camel@ssi>
> Hi list,
Hey Ming,
> I'm trying to understand the NVMe over RDMA latency.
>
> Test hardware:
> A real NVMe PCI drive on target
> Host and target back-to-back connected by Mellanox ConnectX-3
>
> [global]
> ioengine=libaio
> direct=1
> runtime=10
> time_based
> norandommap
> group_reporting
>
> [job1]
> filename=/dev/nvme0n1
> rw=randread
> bs=4k
>
>
> fio latency data on host side(test nvmeof device)
> slat (usec): min=2, max=213, avg= 6.34, stdev= 3.47
> clat (usec): min=1, max=2470, avg=39.56, stdev=13.04
> lat (usec): min=30, max=2476, avg=46.14, stdev=15.50
>
> fio latency data on target side(test NVMe pci device locally)
> slat (usec): min=1, max=36, avg= 1.92, stdev= 0.42
> clat (usec): min=1, max=68, avg=20.35, stdev= 1.11
> lat (usec): min=19, max=101, avg=22.35, stdev= 1.21
>
> So I picked up this sample from blktrace which seems matches the fio avg latency data.
>
> Host(/dev/nvme0n1)
> 259,0 0 86 0.015768739 3241 Q R 1272199648 + 8 [fio]
> 259,0 0 87 0.015769674 3241 G R 1272199648 + 8 [fio]
> 259,0 0 88 0.015771628 3241 U N [fio] 1
> 259,0 0 89 0.015771901 3241 I RS 1272199648 + 8 ( 2227) [fio]
> 259,0 0 90 0.015772863 3241 D RS 1272199648 + 8 ( 962) [fio]
> 259,0 1 85 0.015819257 0 C RS 1272199648 + 8 ( 46394) [0]
>
> Target(/dev/nvme0n1)
> 259,0 0 141 0.015675637 2197 Q R 1272199648 + 8 [kworker/u17:0]
> 259,0 0 142 0.015676033 2197 G R 1272199648 + 8 [kworker/u17:0]
> 259,0 0 143 0.015676915 2197 D RS 1272199648 + 8 (15676915) [kworker/u17:0]
> 259,0 0 144 0.015694992 0 C RS 1272199648 + 8 ( 18077) [0]
>
> So host completed IO in about 50usec and target completed IO in about 20usec.
> Does that mean the 30usec delta comes from RDMA write(host read means target RDMA write)?
Couple of things that come to mind:
0. Are you using iodepth=1 correct?
1. I imagine you are not polling in the host but rather interrupt
driven correct? thats a latency source.
2. the target code is polling if the block device supports it. can you
confirm that is indeed the case?
3. mlx4 has a strong fencing policy for memory registration, which we
always do. thats a latency source. can you try with
register_always=0?
4. IRQ affinity assignments. if the sqe is submitted on cpu core X and
the completion comes to cpu core Y, we will consume some latency
with the context-switch of waiking up fio on cpu core X. Is this
a possible case?
5. What happens if you test against a null_blk (which has a latency of
< 1us)? back when I ran some tryouts I saw ~10-11us added latency
from the fabric under similar conditions.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
next prev parent reply other threads:[~2016-07-13 9:49 UTC|newest]
Thread overview: 14+ messages / expand[flat|nested] mbox.gz Atom feed top
2016-07-07 19:55 NVMe over RDMA latency Ming Lin
2016-07-07 19:55 ` Ming Lin
2016-07-13 9:49 ` Sagi Grimberg [this message]
2016-07-13 9:49 ` Sagi Grimberg
[not found] ` <CABgxfbEa077L6o-AxEqMr1WMuU-gC8_qc4VrrNs9nAkKLrysdw@mail.gmail.com>
2016-07-13 17:25 ` Ming Lin
2016-07-13 17:25 ` Ming Lin
2016-07-13 18:25 ` Ming Lin
2016-07-13 18:25 ` Ming Lin
2016-07-14 6:52 ` Sagi Grimberg
2016-07-14 6:52 ` Sagi Grimberg
2016-07-14 16:43 ` Wendy Cheng
2016-07-14 16:43 ` Wendy Cheng
2016-07-14 17:45 ` Wendy Cheng
2016-07-14 17:45 ` Wendy Cheng
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=57860EBA.5010103@grimberg.me \
--to=sagi@grimberg.me \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.