From mboxrd@z Thu Jan  1 00:00:00 1970
Content-Type: multipart/mixed; boundary="===============6501846281846498764=="
MIME-Version: 1.0
From: Walker, Benjamin <benjamin.walker at intel.com>
Subject: Re: [SPDK] NVMe RDMA SGL Support
Date: Thu, 03 May 2018 19:43:06 +0000
Message-ID: <1525376584.22849.67.camel@intel.com>
In-Reply-To: CA++50Vcqd=efDuKvOhe9oG2jDB6k8f7GSbWsov4WgGG7uRDPjw@mail.gmail.com
List-ID: <spdk@lists.01.org>
To: spdk@lists.01.org

--===============6501846281846498764==
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: quoted-printable

On Thu, 2018-05-03 at 19:11 +0000, Mikhail altman wrote:
> Hello Everyone,
> =

> On SPDK v18.01, we noticed there's a TODO in nvme_rdma_build_sgl_request(=
) in
> nvme_rdma.c.
> =

> Some code for context:
> =

>     /* TODO: for now, we only support a single SGL entry */
>     rc =3D req->payload.u.sgl.next_sge_fn(req->payload.u.sgl.cb_arg, &vir=
t_addr,
> &length);
>     if (rc) {
>             return -1;
>     }
> =

>     if (length < req->payload_size) {
>             SPDK_ERRLOG("multi-element SGL currently not supported for
> RDMA\n");
>             return -1;
>     }
> =

> Is there any ongoing discussion or work to implement support for multiple=
 SGL
> entries? (I looked at the Trello board and GerritHub, but couldn't find
> anything related.) If not, we can look into making a patch for this on our
> end. Any thoughts about what this would entail are welcome!

Hi Mike,

John has been working in this area. It's great to see that he'll have patch=
es to
take a look at shortly. I just wanted to clarify a few things.

This isn't much of a limitation for the use cases we support today. The
initiator buffers can be scattered already, it's just the target memory for=
 a
single I/O that must be described by a single element. Since the RDMA NIC is
pulling the data over the network and placing it into the local target syst=
em's
memory, it is simple enough to have it simultaneously gather it into a sing=
le
contiguous memory region.

That said, I can see at least a few use cases for this. One would be to cha=
nge
the way the memory pool is allocated in the NVMe-oF target. Today, it alloc=
ates
4 full queue depths worth of max I/O size buffers in a shared pool for all
connections to use. If we had full support for scatter gather lists, we cou=
ld
change this pool to contain an equivalent amount of 4k buffers. Then each I=
/O
could pull a list of buffers instead of a single big one and we'd end up wi=
th
better memory utilization. We already have the required scatter-gather-aware
APIs through the rest of the stack to make this happen.

The other use case is one where we switch our model to use memory provided =
by
the backing bdev for the RDMA transfer instead of using a separate dedicated
pool allocated by the NVMe-oF target. That backing bdev may need to provide=
 the
memory as a scatter gather list for various reasons (this is John's use cas=
e).
This is the long term direction for the NVMe-oF target.

In addition to enabling custom bdevs to provide scatter gather lists for
whatever reason, this would also enable things like zero-copy transfers dir=
ectly
to persistent memory or to a local NVMe SSD's controller memory buffer. This
effectively eliminates the single bounce we do from RDMA NIC to host memory=
 to
persistent storage device, and would probably shave an additional ~3
microseconds off of the round trip latency for these cases.

These are all cool projects that are worthy of time and effort. If you all =
are
willing to work in this area, please jump in!

> =

> Thanks in advance,
> Mike

--===============6501846281846498764==--