Linux NFS development
 help / color / mirror / Atom feed
From: Chuck Lever <chuck.lever@oracle.com>
To: linux-rdma@vger.kernel.org, linux-nfs@vger.kernel.org
Subject: [PATCH v4 11/16] xprtrdma: Fix XDR tail buffer marshalling
Date: Mon, 03 Aug 2015 13:04:17 -0400	[thread overview]
Message-ID: <20150803170417.9115.23960.stgit@manet.1015granger.net> (raw)
In-Reply-To: <20150803165807.9115.23842.stgit@manet.1015granger.net>

Currently xprtrdma appends an extra chunk element to the RPC/RDMA
read chunk list of each NFSv4 WRITE compound. The extra element
contains the final GETATTR operation in the compound.

The result is an extra RDMA READ operation to transfer a very short
piece of each NFS WRITE compound (typically 16 bytes). This is
inefficient.

It is also incorrect.

The client is sending the trailing GETATTR at the same Position as
the preceding WRITE data payload. Whether or not RFC 5667 allows
the GETATTR to appear in a read chunk, RFC 5666 requires that these
two separate RPC arguments appear at two distinct Positions.

It can also be argued that the GETATTR operation is not bulk data,
and therefore RFC 5667 forbids its appearance in a read chunk at
all.

Although RFC 5667 is not precise about when using a read list with
NFSv4 COMPOUND is allowed, the intent is that only data arguments
not touched by NFS (ie, read and write payloads) are to be sent
using RDMA READ or WRITE.

The NFS client constructs GETATTR arguments itself, and therefore is
required to send the trailing GETATTR operation as additional inline
content, not as a data payload.

NB: This change is not backwards compatible. Some older servers do
not accept inline content following the read list. The Linux NFS
server should handle this content correctly as of commit
a97c331f9aa9 ("svcrdma: Handle additional inline content").

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Tested-by: Devesh Sharma <devesh.sharma@avagotech.com>
---
 net/sunrpc/xprtrdma/rpc_rdma.c |   44 ++++++++++++++++++++++++++++++++++++++--
 1 file changed, 42 insertions(+), 2 deletions(-)

diff --git a/net/sunrpc/xprtrdma/rpc_rdma.c b/net/sunrpc/xprtrdma/rpc_rdma.c
index 62150ae..1dd48f2 100644
--- a/net/sunrpc/xprtrdma/rpc_rdma.c
+++ b/net/sunrpc/xprtrdma/rpc_rdma.c
@@ -96,6 +96,42 @@ static bool rpcrdma_results_inline(struct rpc_rqst *rqst)
 	return repsize <= RPCRDMA_INLINE_READ_THRESHOLD(rqst);
 }
 
+static int
+rpcrdma_tail_pullup(struct xdr_buf *buf)
+{
+	size_t tlen = buf->tail[0].iov_len;
+	size_t skip = tlen & 3;
+
+	/* Do not include the tail if it is only an XDR pad */
+	if (tlen < 4)
+		return 0;
+
+	/* xdr_write_pages() adds a pad at the beginning of the tail
+	 * if the content in "buf->pages" is unaligned. Force the
+	 * tail's actual content to land at the next XDR position
+	 * after the head instead.
+	 */
+	if (skip) {
+		unsigned char *src, *dst;
+		unsigned int count;
+
+		src = buf->tail[0].iov_base;
+		dst = buf->head[0].iov_base;
+		dst += buf->head[0].iov_len;
+
+		src += skip;
+		tlen -= skip;
+
+		dprintk("RPC:       %s: skip=%zu, memmove(%p, %p, %zu)\n",
+			__func__, skip, dst, src, tlen);
+
+		for (count = tlen; count; count--)
+			*dst++ = *src++;
+	}
+
+	return tlen;
+}
+
 /*
  * Chunk assembly from upper layer xdr_buf.
  *
@@ -147,6 +183,10 @@ rpcrdma_convert_iovs(struct xdr_buf *xdrbuf, unsigned int pos,
 	if (len && n == nsegs)
 		return -EIO;
 
+	/* When encoding the read list, the tail is always sent inline */
+	if (type == rpcrdma_readch)
+		return n;
+
 	if (xdrbuf->tail[0].iov_len) {
 		/* the rpcrdma protocol allows us to omit any trailing
 		 * xdr pad bytes, saving the server an RDMA operation. */
@@ -476,8 +516,8 @@ rpcrdma_marshal_req(struct rpc_rqst *rqst)
 		headerp->rm_body.rm_nochunks.rm_empty[2] = xdr_zero;
 		/* new length after pullup */
 		rpclen = rqst->rq_svec[0].iov_len;
-	}
-
+	} else if (rtype == rpcrdma_readch)
+		rpclen += rpcrdma_tail_pullup(&rqst->rq_snd_buf);
 	if (rtype != rpcrdma_noch) {
 		hdrlen = rpcrdma_create_chunks(rqst, &rqst->rq_snd_buf,
 					       headerp, rtype);


  parent reply	other threads:[~2015-08-03 17:04 UTC|newest]

Thread overview: 19+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2015-08-03 17:02 [PATCH v4 00/16] NFS/RDMA client side for Linux 4.3 Chuck Lever
2015-08-03 17:02 ` [PATCH v4 01/16] xprtrdma: Make xprt_setup_rdma() agnostic to family of server address Chuck Lever
2015-08-03 17:02 ` [PATCH v4 02/16] xprtrdma: Raise maximum payload size to one megabyte Chuck Lever
2015-08-03 17:02 ` [PATCH v4 03/16] xprtrdma: Increase default credit limit Chuck Lever
2015-08-03 17:03 ` [PATCH v4 04/16] xprtrdma: Don't fall back to PHYSICAL memory registration Chuck Lever
2015-08-03 17:03 ` [PATCH v4 05/16] xprtrdma: Remove last ib_reg_phys_mr() call site Chuck Lever
2015-08-03 17:03 ` [PATCH v4 06/16] xprtrdma: Clean up rpcrdma_ia_open() Chuck Lever
2015-08-03 17:03 ` [PATCH v4 07/16] xprtrdma: Remove logic that constructs RDMA_MSGP type calls Chuck Lever
2015-08-03 17:03 ` [PATCH v4 08/16] xprtrdma: Account for RPC/RDMA header size when deciding to inline Chuck Lever
2015-08-03 17:03 ` [PATCH v4 09/16] xprtrdma: Always provide a write list when sending NFS READ Chuck Lever
2015-08-03 17:04 ` [PATCH v4 10/16] xprtrdma: Don't provide a reply chunk when expecting a short reply Chuck Lever
2015-08-03 17:04 ` Chuck Lever [this message]
2015-08-03 17:04 ` [PATCH v4 12/16] xprtrdma: Fix large NFS SYMLINK calls Chuck Lever
2015-08-03 17:04 ` [PATCH v4 13/16] xprtrdma: Clean up xprt_rdma_print_stats() Chuck Lever
2015-08-03 17:04 ` [PATCH v4 14/16] xprtrdma: Count RDMA_NOMSG type calls Chuck Lever
2015-08-03 17:04 ` [PATCH v4 15/16] core: Remove the ib_reg_phys_mr() and ib_rereg_phys_mr() verbs Chuck Lever
2015-08-03 17:05 ` [PATCH v4 16/16] xprtrdma: take HCA driver refcount at client Chuck Lever
2015-08-03 17:07   ` Chuck Lever
2015-08-06 14:28 ` [PATCH v4 00/16] NFS/RDMA client side for Linux 4.3 Anna Schumaker

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20150803170417.9115.23960.stgit@manet.1015granger.net \
    --to=chuck.lever@oracle.com \
    --cc=linux-nfs@vger.kernel.org \
    --cc=linux-rdma@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox