From mboxrd@z Thu Jan 1 00:00:00 1970 Content-Type: multipart/mixed; boundary="===============6501846281846498764==" MIME-Version: 1.0 From: Walker, Benjamin Subject: Re: [SPDK] NVMe RDMA SGL Support Date: Thu, 03 May 2018 19:43:06 +0000 Message-ID: <1525376584.22849.67.camel@intel.com> In-Reply-To: CA++50Vcqd=efDuKvOhe9oG2jDB6k8f7GSbWsov4WgGG7uRDPjw@mail.gmail.com List-ID: To: spdk@lists.01.org --===============6501846281846498764== Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable On Thu, 2018-05-03 at 19:11 +0000, Mikhail altman wrote: > Hello Everyone, > = > On SPDK v18.01, we noticed there's a TODO in nvme_rdma_build_sgl_request(= ) in > nvme_rdma.c. > = > Some code for context: > = > /* TODO: for now, we only support a single SGL entry */ > rc =3D req->payload.u.sgl.next_sge_fn(req->payload.u.sgl.cb_arg, &vir= t_addr, > &length); > if (rc) { > return -1; > } > = > if (length < req->payload_size) { > SPDK_ERRLOG("multi-element SGL currently not supported for > RDMA\n"); > return -1; > } > = > Is there any ongoing discussion or work to implement support for multiple= SGL > entries? (I looked at the Trello board and GerritHub, but couldn't find > anything related.) If not, we can look into making a patch for this on our > end. Any thoughts about what this would entail are welcome! Hi Mike, John has been working in this area. It's great to see that he'll have patch= es to take a look at shortly. I just wanted to clarify a few things. This isn't much of a limitation for the use cases we support today. The initiator buffers can be scattered already, it's just the target memory for= a single I/O that must be described by a single element. Since the RDMA NIC is pulling the data over the network and placing it into the local target syst= em's memory, it is simple enough to have it simultaneously gather it into a sing= le contiguous memory region. That said, I can see at least a few use cases for this. One would be to cha= nge the way the memory pool is allocated in the NVMe-oF target. Today, it alloc= ates 4 full queue depths worth of max I/O size buffers in a shared pool for all connections to use. If we had full support for scatter gather lists, we cou= ld change this pool to contain an equivalent amount of 4k buffers. Then each I= /O could pull a list of buffers instead of a single big one and we'd end up wi= th better memory utilization. We already have the required scatter-gather-aware APIs through the rest of the stack to make this happen. The other use case is one where we switch our model to use memory provided = by the backing bdev for the RDMA transfer instead of using a separate dedicated pool allocated by the NVMe-oF target. That backing bdev may need to provide= the memory as a scatter gather list for various reasons (this is John's use cas= e). This is the long term direction for the NVMe-oF target. In addition to enabling custom bdevs to provide scatter gather lists for whatever reason, this would also enable things like zero-copy transfers dir= ectly to persistent memory or to a local NVMe SSD's controller memory buffer. This effectively eliminates the single bounce we do from RDMA NIC to host memory= to persistent storage device, and would probably shave an additional ~3 microseconds off of the round trip latency for these cases. These are all cool projects that are worthy of time and effort. If you all = are willing to work in this area, please jump in! > = > Thanks in advance, > Mike --===============6501846281846498764==--