[PATCH v3 00/12] NFS/RDMA client-side patches for 4.11

public inbox for linux-rdma@vger.kernel.org
 help / color / mirror / Atom feed

* [PATCH v3 00/12] NFS/RDMA client-side patches for 4.11
@ 2017-02-08 21:59 Chuck Lever
       [not found] ` <20170208214854.7152.83331.stgit-FYjufvaPoItvLzlybtyyYzGyq/o6K9yX@public.gmane.org>
  0 siblings, 1 reply; 21+ messages in thread
From: Chuck Lever @ 2017-02-08 21:59 UTC (permalink / raw)
  To: anna.schumaker-HgOvQuBEEgTQT0dZR+AlfA
  Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	linux-nfs-u79uwXL29TY76Z2rM5mHXA

Hi Anna-

These are bug fixes and add support for RPC-over-RDMA connection
keepalive. The keepalive patches are still waiting for internal
testing resources to confirm they trigger connection loss in the
right circumstances, but my own testing shows they are behaving as
expected and do not introduce instability.

Available in the "nfs-rdma-for-4.11" topic branch of this git repo:

git://git.linux-nfs.org/projects/cel/cel-2.6.git


Or for browsing:

http://git.linux-nfs.org/?p=cel/cel-2.6.git;a=log;h=refs/heads/nfs-rdma-for-4.11


Changes since v2:
- Rebased on v4.10-rc7
- v4.10-rc bugfixes merged into this series
- Minor improvements to patch descriptions
- Field moved in 12/12 now done in the correct patch


Changes since v1:
- Rebased on v4.10-rc6
- Tested-by and additional clean-up in 1/7
- Patch description clarifications
- Renamed some constants and variables

---

Chuck Lever (12):
      xprtrdma: Fix Read chunk padding
      xprtrdma: Per-connection pad optimization
      xprtrdma: Disable pad optimization by default
      xprtrdma: Reduce required number of send SGEs
      xprtrdma: Shrink send SGEs array
      xprtrdma: Properly recover FRWRs with in-flight FASTREG WRs
      xprtrdma: Handle stale connection rejection
      xprtrdma: Refactor management of mw_list field
      sunrpc: Allow xprt->ops->timer method to sleep
      sunrpc: Enable calls to rpc_call_null_helper() from other modules
      xprtrdma: Detect unreachable NFS/RDMA servers more reliably
      sunrpc: Allow keepalive ping on a credit-full transport


 fs/nfs/nfs4proc.c               |    3 -
 fs/nfsd/nfs4callback.c          |    2 -
 include/linux/sunrpc/clnt.h     |    5 ++
 include/linux/sunrpc/sched.h    |    4 +
 net/sunrpc/clnt.c               |   28 +++++-----
 net/sunrpc/xprt.c               |    6 +-
 net/sunrpc/xprtrdma/fmr_ops.c   |    5 --
 net/sunrpc/xprtrdma/frwr_ops.c  |   11 +---
 net/sunrpc/xprtrdma/rpc_rdma.c  |   82 ++++++++++++++++++-----------
 net/sunrpc/xprtrdma/transport.c |   76 +++++++++++++++++++++++++--
 net/sunrpc/xprtrdma/verbs.c     |  109 +++++++++++++++------------------------
 net/sunrpc/xprtrdma/xprt_rdma.h |   37 ++++++++++++-
 net/sunrpc/xprtsock.c           |    2 +
 13 files changed, 234 insertions(+), 136 deletions(-)

--
Chuck Lever
--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 21+ messages in thread

[parent not found: <20170208214854.7152.83331.stgit-FYjufvaPoItvLzlybtyyYzGyq/o6K9yX@public.gmane.org>]

* [PATCH v3 01/12] xprtrdma: Fix Read chunk padding
       [not found] ` <20170208214854.7152.83331.stgit-FYjufvaPoItvLzlybtyyYzGyq/o6K9yX@public.gmane.org>
@ 2017-02-08 21:59   ` Chuck Lever
  2017-02-08 21:59   ` [PATCH v3 02/12] xprtrdma: Per-connection pad optimization Chuck Lever
                     ` (10 subsequent siblings)
  11 siblings, 0 replies; 21+ messages in thread
From: Chuck Lever @ 2017-02-08 21:59 UTC (permalink / raw)
  To: anna.schumaker-HgOvQuBEEgTQT0dZR+AlfA
  Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	linux-nfs-u79uwXL29TY76Z2rM5mHXA

When pad optimization is disabled, rpcrdma_convert_iovs still
does not add explicit XDR round-up padding to a Read chunk.

Commit 677eb17e94ed ("xprtrdma: Fix XDR tail buffer marshalling")
incorrectly short-circuited the test for whether round-up padding
is needed that appears later in rpcrdma_convert_iovs.

However, if this is indeed a regular Read chunk (and not a
Position-Zero Read chunk), the tail iovec _always_ contains the
chunk's padding, and never anything else.

So, it's easy to just skip the tail when padding optimization is
enabled, and add the tail in a subsequent Read chunk segment, if
disabled.

Fixes: 677eb17e94ed ("xprtrdma: Fix XDR tail buffer marshalling")
Signed-off-by: Chuck Lever <chuck.lever-QHcLZuEGTsvQT0dZR+AlfA@public.gmane.org>
---
 net/sunrpc/xprtrdma/rpc_rdma.c |   10 ++++------
 1 file changed, 4 insertions(+), 6 deletions(-)

diff --git a/net/sunrpc/xprtrdma/rpc_rdma.c b/net/sunrpc/xprtrdma/rpc_rdma.c
index c52e0f2..a524d3c 100644
--- a/net/sunrpc/xprtrdma/rpc_rdma.c
+++ b/net/sunrpc/xprtrdma/rpc_rdma.c
@@ -226,8 +226,10 @@ static bool rpcrdma_results_inline(struct rpcrdma_xprt *r_xprt,
 	if (len && n == RPCRDMA_MAX_SEGS)
 		goto out_overflow;
 
-	/* When encoding the read list, the tail is always sent inline */
-	if (type == rpcrdma_readch)
+	/* When encoding a Read chunk, the tail iovec contains an
+	 * XDR pad and may be omitted.
+	 */
+	if (type == rpcrdma_readch && xprt_rdma_pad_optimize)
 		return n;
 
 	/* When encoding the Write list, some servers need to see an extra
@@ -238,10 +240,6 @@ static bool rpcrdma_results_inline(struct rpcrdma_xprt *r_xprt,
 		return n;
 
 	if (xdrbuf->tail[0].iov_len) {
-		/* the rpcrdma protocol allows us to omit any trailing
-		 * xdr pad bytes, saving the server an RDMA operation. */
-		if (xdrbuf->tail[0].iov_len < 4 && xprt_rdma_pad_optimize)
-			return n;
 		n = rpcrdma_convert_kvec(&xdrbuf->tail[0], seg, n);
 		if (n == RPCRDMA_MAX_SEGS)
 			goto out_overflow;

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH v3 02/12] xprtrdma: Per-connection pad optimization
       [not found] ` <20170208214854.7152.83331.stgit-FYjufvaPoItvLzlybtyyYzGyq/o6K9yX@public.gmane.org>
  2017-02-08 21:59   ` [PATCH v3 01/12] xprtrdma: Fix Read chunk padding Chuck Lever
@ 2017-02-08 21:59   ` Chuck Lever
  2017-02-08 22:00   ` [PATCH v3 03/12] xprtrdma: Disable pad optimization by default Chuck Lever
                     ` (9 subsequent siblings)
  11 siblings, 0 replies; 21+ messages in thread
From: Chuck Lever @ 2017-02-08 21:59 UTC (permalink / raw)
  To: anna.schumaker-HgOvQuBEEgTQT0dZR+AlfA
  Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	linux-nfs-u79uwXL29TY76Z2rM5mHXA

Pad optimization is changed by echoing into
/proc/sys/sunrpc/rdma_pad_optimize. This is a global setting,
affecting all RPC-over-RDMA connections to all servers.

The marshaling code picks up that value and uses it for decisions
about how to construct each RPC-over-RDMA frame. Having it change
suddenly in mid-operation can result in unexpected failures. And
some servers a client mounts might need chunk round-up, while
others don't.

So instead, copy the pad_optimize setting into each connection's
rpcrdma_ia when the transport is created, and use the copy, which
can't change during the life of the connection, instead.

This also removes a hack: rpcrdma_convert_iovs was using
the remote-invalidation-expected flag to predict when it could leave
out Write chunk padding. This is because the Linux server handles
implicit XDR padding on Write chunks correctly, and only Linux
servers can set the connection's remote-invalidation-expected flag.

It's more sensible to use the pad optimization setting instead.

Fixes: 677eb17e94ed ("xprtrdma: Fix XDR tail buffer marshalling")
Signed-off-by: Chuck Lever <chuck.lever-QHcLZuEGTsvQT0dZR+AlfA@public.gmane.org>
---
 net/sunrpc/xprtrdma/rpc_rdma.c  |   28 ++++++++++++++--------------
 net/sunrpc/xprtrdma/verbs.c     |    1 +
 net/sunrpc/xprtrdma/xprt_rdma.h |    1 +
 3 files changed, 16 insertions(+), 14 deletions(-)

diff --git a/net/sunrpc/xprtrdma/rpc_rdma.c b/net/sunrpc/xprtrdma/rpc_rdma.c
index a524d3c..c634f0f 100644
--- a/net/sunrpc/xprtrdma/rpc_rdma.c
+++ b/net/sunrpc/xprtrdma/rpc_rdma.c
@@ -186,9 +186,9 @@ static bool rpcrdma_results_inline(struct rpcrdma_xprt *r_xprt,
  */
 
 static int
-rpcrdma_convert_iovs(struct xdr_buf *xdrbuf, unsigned int pos,
-	enum rpcrdma_chunktype type, struct rpcrdma_mr_seg *seg,
-	bool reminv_expected)
+rpcrdma_convert_iovs(struct rpcrdma_xprt *r_xprt, struct xdr_buf *xdrbuf,
+		     unsigned int pos, enum rpcrdma_chunktype type,
+		     struct rpcrdma_mr_seg *seg)
 {
 	int len, n, p, page_base;
 	struct page **ppages;
@@ -229,14 +229,15 @@ static bool rpcrdma_results_inline(struct rpcrdma_xprt *r_xprt,
 	/* When encoding a Read chunk, the tail iovec contains an
 	 * XDR pad and may be omitted.
 	 */
-	if (type == rpcrdma_readch && xprt_rdma_pad_optimize)
+	if (type == rpcrdma_readch && r_xprt->rx_ia.ri_implicit_roundup)
 		return n;
 
-	/* When encoding the Write list, some servers need to see an extra
-	 * segment for odd-length Write chunks. The upper layer provides
-	 * space in the tail iovec for this purpose.
+	/* When encoding a Write chunk, some servers need to see an
+	 * extra segment for non-XDR-aligned Write chunks. The upper
+	 * layer provides space in the tail iovec that may be used
+	 * for this purpose.
 	 */
-	if (type == rpcrdma_writech && reminv_expected)
+	if (type == rpcrdma_writech && r_xprt->rx_ia.ri_implicit_roundup)
 		return n;
 
 	if (xdrbuf->tail[0].iov_len) {
@@ -291,7 +292,8 @@ static bool rpcrdma_results_inline(struct rpcrdma_xprt *r_xprt,
 	if (rtype == rpcrdma_areadch)
 		pos = 0;
 	seg = req->rl_segments;
-	nsegs = rpcrdma_convert_iovs(&rqst->rq_snd_buf, pos, rtype, seg, false);
+	nsegs = rpcrdma_convert_iovs(r_xprt, &rqst->rq_snd_buf, pos,
+				     rtype, seg);
 	if (nsegs < 0)
 		return ERR_PTR(nsegs);
 
@@ -353,10 +355,9 @@ static bool rpcrdma_results_inline(struct rpcrdma_xprt *r_xprt,
 	}
 
 	seg = req->rl_segments;
-	nsegs = rpcrdma_convert_iovs(&rqst->rq_rcv_buf,
+	nsegs = rpcrdma_convert_iovs(r_xprt, &rqst->rq_rcv_buf,
 				     rqst->rq_rcv_buf.head[0].iov_len,
-				     wtype, seg,
-				     r_xprt->rx_ia.ri_reminv_expected);
+				     wtype, seg);
 	if (nsegs < 0)
 		return ERR_PTR(nsegs);
 
@@ -421,8 +422,7 @@ static bool rpcrdma_results_inline(struct rpcrdma_xprt *r_xprt,
 	}
 
 	seg = req->rl_segments;
-	nsegs = rpcrdma_convert_iovs(&rqst->rq_rcv_buf, 0, wtype, seg,
-				     r_xprt->rx_ia.ri_reminv_expected);
+	nsegs = rpcrdma_convert_iovs(r_xprt, &rqst->rq_rcv_buf, 0, wtype, seg);
 	if (nsegs < 0)
 		return ERR_PTR(nsegs);
 
diff --git a/net/sunrpc/xprtrdma/verbs.c b/net/sunrpc/xprtrdma/verbs.c
index 11d0774..2a6a367 100644
--- a/net/sunrpc/xprtrdma/verbs.c
+++ b/net/sunrpc/xprtrdma/verbs.c
@@ -208,6 +208,7 @@
 
 	/* Default settings for RPC-over-RDMA Version One */
 	r_xprt->rx_ia.ri_reminv_expected = false;
+	r_xprt->rx_ia.ri_implicit_roundup = xprt_rdma_pad_optimize;
 	rsize = RPCRDMA_V1_DEF_INLINE_SIZE;
 	wsize = RPCRDMA_V1_DEF_INLINE_SIZE;
 
diff --git a/net/sunrpc/xprtrdma/xprt_rdma.h b/net/sunrpc/xprtrdma/xprt_rdma.h
index e35efd4..c137154 100644
--- a/net/sunrpc/xprtrdma/xprt_rdma.h
+++ b/net/sunrpc/xprtrdma/xprt_rdma.h
@@ -75,6 +75,7 @@ struct rpcrdma_ia {
 	unsigned int		ri_max_inline_write;
 	unsigned int		ri_max_inline_read;
 	bool			ri_reminv_expected;
+	bool			ri_implicit_roundup;
 	enum ib_mr_type		ri_mrtype;
 	struct ib_qp_attr	ri_qp_attr;
 	struct ib_qp_init_attr	ri_qp_init_attr;

--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH v3 03/12] xprtrdma: Disable pad optimization by default
       [not found] ` <20170208214854.7152.83331.stgit-FYjufvaPoItvLzlybtyyYzGyq/o6K9yX@public.gmane.org>
  2017-02-08 21:59   ` [PATCH v3 01/12] xprtrdma: Fix Read chunk padding Chuck Lever
  2017-02-08 21:59   ` [PATCH v3 02/12] xprtrdma: Per-connection pad optimization Chuck Lever
@ 2017-02-08 22:00   ` Chuck Lever
  2017-02-08 22:00   ` [PATCH v3 04/12] xprtrdma: Reduce required number of send SGEs Chuck Lever
                     ` (8 subsequent siblings)
  11 siblings, 0 replies; 21+ messages in thread
From: Chuck Lever @ 2017-02-08 22:00 UTC (permalink / raw)
  To: anna.schumaker-HgOvQuBEEgTQT0dZR+AlfA
  Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	linux-nfs-u79uwXL29TY76Z2rM5mHXA

Commit d5440e27d3e5 ("xprtrdma: Enable pad optimization") made the
Linux client omit XDR round-up padding in normal Read and Write
chunks so that the client doesn't have to register and invalidate
3-byte memory regions that contain no real data.

Unfortunately, my cheery 2014 assessment that this optimization "is
supported now by both Linux and Solaris servers" was premature.
We've found bugs in Solaris in this area since commit d5440e27d3e5
("xprtrdma: Enable pad optimization") was merged (SYMLINK is the
main offender).

So for maximum interoperability, I'm disabling this optimization
again. If a CM private message is exchanged when connecting, the
client recognizes that the server is Linux, and enables the
optimization for that connection.

Until now the Solaris server bugs did not impact common operations,
and were thus largely benign. Soon, less capable devices on Linux
NFS/RDMA clients will make use of Read chunks more often, and these
Solaris bugs will prevent interoperation in more cases.

Fixes: 677eb17e94ed ("xprtrdma: Fix XDR tail buffer marshalling")
Signed-off-by: Chuck Lever <chuck.lever-QHcLZuEGTsvQT0dZR+AlfA@public.gmane.org>
---
 net/sunrpc/xprtrdma/transport.c |    2 +-
 net/sunrpc/xprtrdma/verbs.c     |    1 +
 2 files changed, 2 insertions(+), 1 deletion(-)

diff --git a/net/sunrpc/xprtrdma/transport.c b/net/sunrpc/xprtrdma/transport.c
index 534c178..6990581 100644
--- a/net/sunrpc/xprtrdma/transport.c
+++ b/net/sunrpc/xprtrdma/transport.c
@@ -67,7 +67,7 @@
 static unsigned int xprt_rdma_max_inline_write = RPCRDMA_DEF_INLINE;
 static unsigned int xprt_rdma_inline_write_padding;
 static unsigned int xprt_rdma_memreg_strategy = RPCRDMA_FRMR;
-		int xprt_rdma_pad_optimize = 1;
+		int xprt_rdma_pad_optimize = 0;

 #if IS_ENABLED(CONFIG_SUNRPC_DEBUG)

diff --git a/net/sunrpc/xprtrdma/verbs.c b/net/sunrpc/xprtrdma/verbs.c
index 2a6a367..23f4da4 100644
--- a/net/sunrpc/xprtrdma/verbs.c
+++ b/net/sunrpc/xprtrdma/verbs.c
@@ -216,6 +216,7 @@
 	    pmsg->cp_magic == rpcrdma_cmp_magic &&
 	    pmsg->cp_version == RPCRDMA_CMP_VERSION) {
 		r_xprt->rx_ia.ri_reminv_expected = true;
+		r_xprt->rx_ia.ri_implicit_roundup = true;
 		rsize = rpcrdma_decode_buffer_size(pmsg->cp_send_size);
 		wsize = rpcrdma_decode_buffer_size(pmsg->cp_recv_size);
 	}

--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH v3 04/12] xprtrdma: Reduce required number of send SGEs
       [not found] ` <20170208214854.7152.83331.stgit-FYjufvaPoItvLzlybtyyYzGyq/o6K9yX@public.gmane.org>
                     ` (2 preceding siblings ...)
  2017-02-08 22:00   ` [PATCH v3 03/12] xprtrdma: Disable pad optimization by default Chuck Lever
@ 2017-02-08 22:00   ` Chuck Lever
  2017-02-08 22:00   ` [PATCH v3 05/12] xprtrdma: Shrink send SGEs array Chuck Lever
                     ` (7 subsequent siblings)
  11 siblings, 0 replies; 21+ messages in thread
From: Chuck Lever @ 2017-02-08 22:00 UTC (permalink / raw)
  To: anna.schumaker-HgOvQuBEEgTQT0dZR+AlfA
  Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	linux-nfs-u79uwXL29TY76Z2rM5mHXA

The MAX_SEND_SGES check introduced in commit 655fec6987be
("xprtrdma: Use gathered Send for large inline messages") fails
for devices that have a small max_sge.

Instead of checking for a large fixed maximum number of SGEs,
check for a minimum small number. RPC-over-RDMA will switch to
using a Read chunk if an xdr_buf has more pages than can fit in
the device's max_sge limit. This is considerably better than
failing all together to mount the server.

This fix supports devices that have as few as three send SGEs
available.

Reported-by: Selvin Xavier <selvin.xavier-dY08KVG/lbpWk0Htik3J/w@public.gmane.org>
Reported-by: Devesh Sharma <devesh.sharma-dY08KVG/lbpWk0Htik3J/w@public.gmane.org>
Reported-by: Honggang Li <honli-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
Reported-by: Ram Amrani <Ram.Amrani-YGCgFSpz5w/QT0dZR+AlfA@public.gmane.org>
Fixes: 655fec6987be ("xprtrdma: Use gathered Send for large ...")
Tested-by: Honggang Li <honli-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
Tested-by: Ram Amrani <Ram.Amrani-YGCgFSpz5w/QT0dZR+AlfA@public.gmane.org>
Tested-by: Steve Wise <swise-7bPotxP6k4+P2YhJcF5u+vpXobYPEAuW@public.gmane.org>
Reviewed-by: Parav Pandit <parav-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
Signed-off-by: Chuck Lever <chuck.lever-QHcLZuEGTsvQT0dZR+AlfA@public.gmane.org>
---
 net/sunrpc/xprtrdma/rpc_rdma.c  |   26 +++++++++++++++++++++++---
 net/sunrpc/xprtrdma/verbs.c     |   13 +++++++------
 net/sunrpc/xprtrdma/xprt_rdma.h |    2 ++
 3 files changed, 32 insertions(+), 9 deletions(-)

diff --git a/net/sunrpc/xprtrdma/rpc_rdma.c b/net/sunrpc/xprtrdma/rpc_rdma.c
index c634f0f..d889883 100644
--- a/net/sunrpc/xprtrdma/rpc_rdma.c
+++ b/net/sunrpc/xprtrdma/rpc_rdma.c
@@ -125,14 +125,34 @@ void rpcrdma_set_max_header_sizes(struct rpcrdma_xprt *r_xprt)
 /* The client can send a request inline as long as the RPCRDMA header
  * plus the RPC call fit under the transport's inline limit. If the
  * combined call message size exceeds that limit, the client must use
- * the read chunk list for this operation.
+ * a Read chunk for this operation.
+ *
+ * A Read chunk is also required if sending the RPC call inline would
+ * exceed this device's max_sge limit.
  */
 static bool rpcrdma_args_inline(struct rpcrdma_xprt *r_xprt,
 				struct rpc_rqst *rqst)
 {
-	struct rpcrdma_ia *ia = &r_xprt->rx_ia;
+	struct xdr_buf *xdr = &rqst->rq_snd_buf;
+	unsigned int count, remaining, offset;
+
+	if (xdr->len > r_xprt->rx_ia.ri_max_inline_write)
+		return false;
 
-	return rqst->rq_snd_buf.len <= ia->ri_max_inline_write;
+	if (xdr->page_len) {
+		remaining = xdr->page_len;
+		offset = xdr->page_base & ~PAGE_MASK;
+		count = 0;
+		while (remaining) {
+			remaining -= min_t(unsigned int,
+					   PAGE_SIZE - offset, remaining);
+			offset = 0;
+			if (++count > r_xprt->rx_ia.ri_max_send_sges)
+				return false;
+		}
+	}
+
+	return true;
 }
 
 /* The client can't know how large the actual reply will be. Thus it
diff --git a/net/sunrpc/xprtrdma/verbs.c b/net/sunrpc/xprtrdma/verbs.c
index 23f4da4..61d16c3 100644
--- a/net/sunrpc/xprtrdma/verbs.c
+++ b/net/sunrpc/xprtrdma/verbs.c
@@ -488,18 +488,19 @@ static void rpcrdma_destroy_id(struct rdma_cm_id *id)
  */
 int
 rpcrdma_ep_create(struct rpcrdma_ep *ep, struct rpcrdma_ia *ia,
-				struct rpcrdma_create_data_internal *cdata)
+		  struct rpcrdma_create_data_internal *cdata)
 {
 	struct rpcrdma_connect_private *pmsg = &ep->rep_cm_private;
+	unsigned int max_qp_wr, max_sge;
 	struct ib_cq *sendcq, *recvcq;
-	unsigned int max_qp_wr;
 	int rc;
 
-	if (ia->ri_device->attrs.max_sge < RPCRDMA_MAX_SEND_SGES) {
-		dprintk("RPC:       %s: insufficient sge's available\n",
-			__func__);
+	max_sge = min(ia->ri_device->attrs.max_sge, RPCRDMA_MAX_SEND_SGES);
+	if (max_sge < RPCRDMA_MIN_SEND_SGES) {
+		pr_warn("rpcrdma: HCA provides only %d send SGEs\n", max_sge);
 		return -ENOMEM;
 	}
+	ia->ri_max_send_sges = max_sge - RPCRDMA_MIN_SEND_SGES;
 
 	if (ia->ri_device->attrs.max_qp_wr <= RPCRDMA_BACKWARD_WRS) {
 		dprintk("RPC:       %s: insufficient wqe's available\n",
@@ -524,7 +525,7 @@ static void rpcrdma_destroy_id(struct rdma_cm_id *id)
 	ep->rep_attr.cap.max_recv_wr = cdata->max_requests;
 	ep->rep_attr.cap.max_recv_wr += RPCRDMA_BACKWARD_WRS;
 	ep->rep_attr.cap.max_recv_wr += 1;	/* drain cqe */
-	ep->rep_attr.cap.max_send_sge = RPCRDMA_MAX_SEND_SGES;
+	ep->rep_attr.cap.max_send_sge = max_sge;
 	ep->rep_attr.cap.max_recv_sge = 1;
 	ep->rep_attr.cap.max_inline_data = 0;
 	ep->rep_attr.sq_sig_type = IB_SIGNAL_REQ_WR;
diff --git a/net/sunrpc/xprtrdma/xprt_rdma.h b/net/sunrpc/xprtrdma/xprt_rdma.h
index c137154..3d7e9c9 100644
--- a/net/sunrpc/xprtrdma/xprt_rdma.h
+++ b/net/sunrpc/xprtrdma/xprt_rdma.h
@@ -74,6 +74,7 @@ struct rpcrdma_ia {
 	unsigned int		ri_max_frmr_depth;
 	unsigned int		ri_max_inline_write;
 	unsigned int		ri_max_inline_read;
+	unsigned int		ri_max_send_sges;
 	bool			ri_reminv_expected;
 	bool			ri_implicit_roundup;
 	enum ib_mr_type		ri_mrtype;
@@ -311,6 +312,7 @@ struct rpcrdma_mr_seg {		/* chunk descriptors */
  * - xdr_buf tail iovec
  */
 enum {
+	RPCRDMA_MIN_SEND_SGES = 3,
 	RPCRDMA_MAX_SEND_PAGES = PAGE_SIZE + RPCRDMA_MAX_INLINE - 1,
 	RPCRDMA_MAX_PAGE_SGES = (RPCRDMA_MAX_SEND_PAGES >> PAGE_SHIFT) + 1,
 	RPCRDMA_MAX_SEND_SGES = 1 + 1 + RPCRDMA_MAX_PAGE_SGES + 1,

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH v3 05/12] xprtrdma: Shrink send SGEs array
       [not found] ` <20170208214854.7152.83331.stgit-FYjufvaPoItvLzlybtyyYzGyq/o6K9yX@public.gmane.org>
                     ` (3 preceding siblings ...)
  2017-02-08 22:00   ` [PATCH v3 04/12] xprtrdma: Reduce required number of send SGEs Chuck Lever
@ 2017-02-08 22:00   ` Chuck Lever
  2017-02-08 22:00   ` [PATCH v3 06/12] xprtrdma: Properly recover FRWRs with in-flight FASTREG WRs Chuck Lever
                     ` (6 subsequent siblings)
  11 siblings, 0 replies; 21+ messages in thread
From: Chuck Lever @ 2017-02-08 22:00 UTC (permalink / raw)
  To: anna.schumaker-HgOvQuBEEgTQT0dZR+AlfA
  Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	linux-nfs-u79uwXL29TY76Z2rM5mHXA

We no longer need to accommodate an xdr_buf whose pages start at an
offset and cross extra page boundaries. If there are more partial or
whole pages to send than there are available SGEs, the marshaling
logic is now smart enough to use a Read chunk instead of failing.

Signed-off-by: Chuck Lever <chuck.lever-QHcLZuEGTsvQT0dZR+AlfA@public.gmane.org>
---
 net/sunrpc/xprtrdma/xprt_rdma.h |   11 +++++++----
 1 file changed, 7 insertions(+), 4 deletions(-)

diff --git a/net/sunrpc/xprtrdma/xprt_rdma.h b/net/sunrpc/xprtrdma/xprt_rdma.h
index 3d7e9c9..852dd0a 100644
--- a/net/sunrpc/xprtrdma/xprt_rdma.h
+++ b/net/sunrpc/xprtrdma/xprt_rdma.h
@@ -305,16 +305,19 @@ struct rpcrdma_mr_seg {		/* chunk descriptors */
 	char		*mr_offset;	/* kva if no page, else offset */
 };
 
-/* Reserve enough Send SGEs to send a maximum size inline request:
+/* The Send SGE array is provisioned to send a maximum size
+ * inline request:
  * - RPC-over-RDMA header
  * - xdr_buf head iovec
- * - RPCRDMA_MAX_INLINE bytes, possibly unaligned, in pages
+ * - RPCRDMA_MAX_INLINE bytes, in pages
  * - xdr_buf tail iovec
+ *
+ * The actual number of array elements consumed by each RPC
+ * depends on the device's max_sge limit.
  */
 enum {
 	RPCRDMA_MIN_SEND_SGES = 3,
-	RPCRDMA_MAX_SEND_PAGES = PAGE_SIZE + RPCRDMA_MAX_INLINE - 1,
-	RPCRDMA_MAX_PAGE_SGES = (RPCRDMA_MAX_SEND_PAGES >> PAGE_SHIFT) + 1,
+	RPCRDMA_MAX_PAGE_SGES = RPCRDMA_MAX_INLINE >> PAGE_SHIFT,
 	RPCRDMA_MAX_SEND_SGES = 1 + 1 + RPCRDMA_MAX_PAGE_SGES + 1,
 };
 

--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH v3 06/12] xprtrdma: Properly recover FRWRs with in-flight FASTREG WRs
       [not found] ` <20170208214854.7152.83331.stgit-FYjufvaPoItvLzlybtyyYzGyq/o6K9yX@public.gmane.org>
                     ` (4 preceding siblings ...)
  2017-02-08 22:00   ` [PATCH v3 05/12] xprtrdma: Shrink send SGEs array Chuck Lever
@ 2017-02-08 22:00   ` Chuck Lever
  2017-02-08 22:00   ` [PATCH v3 07/12] xprtrdma: Handle stale connection rejection Chuck Lever
                     ` (5 subsequent siblings)
  11 siblings, 0 replies; 21+ messages in thread
From: Chuck Lever @ 2017-02-08 22:00 UTC (permalink / raw)
  To: anna.schumaker-HgOvQuBEEgTQT0dZR+AlfA
  Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	linux-nfs-u79uwXL29TY76Z2rM5mHXA

Sriharsha (sriharsha.basavapatna-dY08KVG/lbpWk0Htik3J/w@public.gmane.org) reports an occasional
double DMA unmap of an FRWR MR when a connection is lost. I see one
way this can happen.

When a request requires more than one segment or chunk,
rpcrdma_marshal_req loops, invoking ->frwr_op_map for each segment
(MR) in each chunk. Each call posts a FASTREG Work Request to
register one MR.

Now suppose that the transport connection is lost part-way through
marshaling this request. As part of recovering and resetting that
req, rpcrdma_marshal_req invokes ->frwr_op_unmap_safe, which hands
all the req's registered FRWRs to the MR recovery thread.

But note: FRWR registration is asynchronous. So it's possible that
some of these "already registered" FRWRs are fully registered, and
some are still waiting for their FASTREG WR to complete.

When the connection is lost, the "already registered" frmrs are
marked FRMR_IS_VALID, and the "still waiting" WRs flush. Then
frwr_wc_fastreg marks these frmrs FRMR_FLUSHED_FR.

But thanks to ->frwr_op_unmap_safe, the MR recovery thread is doing
an unreg / alloc_mr, a DMA unmap, and marking each of these frwrs
FRMR_IS_INVALID, at the same time frwr_wc_fastreg might be running.

- If the recovery thread runs last, then the frmr is marked
FRMR_IS_INVALID, and life continues.

- If frwr_wc_fastreg runs last, the frmr is marked FRMR_FLUSHED_FR,
but the recovery thread has already DMA unmapped that MR. When
->frwr_op_map later re-uses this frmr, it sees it is not marked
FRMR_IS_INVALID, and tries to recover it before using it, resulting
in a second DMA unmap of the same MR.

The fix is to guarantee in-flight FASTREG WRs have flushed before MR
recovery runs on those FRWRs. Thus we depend on ro_unmap_safe
(called from xprt_rdma_send_request on retransmit, or from
xprt_rdma_free) to clean up old registrations as needed.

Reported-by: Sriharsha Basavapatna <sriharsha.basavapatna-dY08KVG/lbpWk0Htik3J/w@public.gmane.org>
Signed-off-by: Chuck Lever <chuck.lever-QHcLZuEGTsvQT0dZR+AlfA@public.gmane.org>
Tested-by: Sriharsha Basavapatna <sriharsha.basavapatna-dY08KVG/lbpWk0Htik3J/w@public.gmane.org>
---
 net/sunrpc/xprtrdma/rpc_rdma.c  |   14 ++++++++------
 net/sunrpc/xprtrdma/transport.c |    4 ----
 2 files changed, 8 insertions(+), 10 deletions(-)

diff --git a/net/sunrpc/xprtrdma/rpc_rdma.c b/net/sunrpc/xprtrdma/rpc_rdma.c
index d889883..72b3ca0 100644
--- a/net/sunrpc/xprtrdma/rpc_rdma.c
+++ b/net/sunrpc/xprtrdma/rpc_rdma.c
@@ -759,13 +759,13 @@ static bool rpcrdma_results_inline(struct rpcrdma_xprt *r_xprt,
 	iptr = headerp->rm_body.rm_chunks;
 	iptr = rpcrdma_encode_read_list(r_xprt, req, rqst, iptr, rtype);
 	if (IS_ERR(iptr))
-		goto out_unmap;
+		goto out_err;
 	iptr = rpcrdma_encode_write_list(r_xprt, req, rqst, iptr, wtype);
 	if (IS_ERR(iptr))
-		goto out_unmap;
+		goto out_err;
 	iptr = rpcrdma_encode_reply_chunk(r_xprt, req, rqst, iptr, wtype);
 	if (IS_ERR(iptr))
-		goto out_unmap;
+		goto out_err;
 	hdrlen = (unsigned char *)iptr - (unsigned char *)headerp;
 
 	dprintk("RPC: %5u %s: %s/%s: hdrlen %zd rpclen %zd\n",
@@ -776,12 +776,14 @@ static bool rpcrdma_results_inline(struct rpcrdma_xprt *r_xprt,
 	if (!rpcrdma_prepare_send_sges(&r_xprt->rx_ia, req, hdrlen,
 				       &rqst->rq_snd_buf, rtype)) {
 		iptr = ERR_PTR(-EIO);
-		goto out_unmap;
+		goto out_err;
 	}
 	return 0;
 
-out_unmap:
-	r_xprt->rx_ia.ri_ops->ro_unmap_safe(r_xprt, req, false);
+out_err:
+	pr_err("rpcrdma: rpcrdma_marshal_req failed, status %ld\n",
+	       PTR_ERR(iptr));
+	r_xprt->rx_stats.failed_marshal_count++;
 	return PTR_ERR(iptr);
 }
 
diff --git a/net/sunrpc/xprtrdma/transport.c b/net/sunrpc/xprtrdma/transport.c
index 6990581..c717f54 100644
--- a/net/sunrpc/xprtrdma/transport.c
+++ b/net/sunrpc/xprtrdma/transport.c
@@ -709,10 +709,6 @@
 	return 0;
 
 failed_marshal:
-	dprintk("RPC:       %s: rpcrdma_marshal_req failed, status %i\n",
-		__func__, rc);
-	if (rc == -EIO)
-		r_xprt->rx_stats.failed_marshal_count++;
 	if (rc != -ENOTCONN)
 		return rc;
 drop_connection:

--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH v3 07/12] xprtrdma: Handle stale connection rejection
       [not found] ` <20170208214854.7152.83331.stgit-FYjufvaPoItvLzlybtyyYzGyq/o6K9yX@public.gmane.org>
                     ` (5 preceding siblings ...)
  2017-02-08 22:00   ` [PATCH v3 06/12] xprtrdma: Properly recover FRWRs with in-flight FASTREG WRs Chuck Lever
@ 2017-02-08 22:00   ` Chuck Lever
  2017-02-08 22:00   ` [PATCH v3 08/12] xprtrdma: Refactor management of mw_list field Chuck Lever
                     ` (4 subsequent siblings)
  11 siblings, 0 replies; 21+ messages in thread
From: Chuck Lever @ 2017-02-08 22:00 UTC (permalink / raw)
  To: anna.schumaker-HgOvQuBEEgTQT0dZR+AlfA
  Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	linux-nfs-u79uwXL29TY76Z2rM5mHXA

A server rejects a connection attempt with STALE_CONNECTION when a
client attempts to connect to a working remote service, but uses a
QPN and GUID that corresponds to an old connection that was
abandoned. This might occur after a client crashes and restarts.

Fix rpcrdma_conn_upcall() to distinguish between a normal rejection
and rejection of stale connection parameters.

As an additional clean-up, remove the code that retries the
connection attempt with different ORD/IRD values. Code audit of
other ULP initiators shows no similar special case handling of
initiator_depth or responder_resources.

Signed-off-by: Chuck Lever <chuck.lever-QHcLZuEGTsvQT0dZR+AlfA@public.gmane.org>
---
 net/sunrpc/xprtrdma/verbs.c |   66 ++++++++++++++-----------------------------
 1 file changed, 21 insertions(+), 45 deletions(-)

diff --git a/net/sunrpc/xprtrdma/verbs.c b/net/sunrpc/xprtrdma/verbs.c
index 61d16c3..d1ee33f 100644
--- a/net/sunrpc/xprtrdma/verbs.c
+++ b/net/sunrpc/xprtrdma/verbs.c
@@ -54,6 +54,7 @@
 #include <linux/sunrpc/svc_rdma.h>
 #include <asm/bitops.h>
 #include <linux/module.h> /* try_module_get()/module_put() */
+#include <rdma/ib_cm.h>
 
 #include "xprt_rdma.h"
 
@@ -279,7 +280,14 @@
 		connstate = -ENETDOWN;
 		goto connected;
 	case RDMA_CM_EVENT_REJECTED:
+#if IS_ENABLED(CONFIG_SUNRPC_DEBUG)
+		pr_info("rpcrdma: connection to %pIS:%u on %s rejected: %s\n",
+			sap, rpc_get_port(sap), ia->ri_device->name,
+			rdma_reject_msg(id, event->status));
+#endif
 		connstate = -ECONNREFUSED;
+		if (event->status == IB_CM_REJ_STALE_CONN)
+			connstate = -EAGAIN;
 		goto connected;
 	case RDMA_CM_EVENT_DISCONNECTED:
 		connstate = -ECONNABORTED;
@@ -643,20 +651,21 @@ static void rpcrdma_destroy_id(struct rdma_cm_id *id)
 int
 rpcrdma_ep_connect(struct rpcrdma_ep *ep, struct rpcrdma_ia *ia)
 {
+	struct rpcrdma_xprt *r_xprt = container_of(ia, struct rpcrdma_xprt,
+						   rx_ia);
 	struct rdma_cm_id *id, *old;
+	struct sockaddr *sap;
+	unsigned int extras;
 	int rc = 0;
-	int retry_count = 0;
 
 	if (ep->rep_connected != 0) {
-		struct rpcrdma_xprt *xprt;
 retry:
 		dprintk("RPC:       %s: reconnecting...\n", __func__);
 
 		rpcrdma_ep_disconnect(ep, ia);
 
-		xprt = container_of(ia, struct rpcrdma_xprt, rx_ia);
-		id = rpcrdma_create_id(xprt, ia,
-				(struct sockaddr *)&xprt->rx_data.addr);
+		sap = (struct sockaddr *)&r_xprt->rx_data.addr;
+		id = rpcrdma_create_id(r_xprt, ia, sap);
 		if (IS_ERR(id)) {
 			rc = -EHOSTUNREACH;
 			goto out;
@@ -711,51 +720,18 @@ static void rpcrdma_destroy_id(struct rdma_cm_id *id)
 	}
 
 	wait_event_interruptible(ep->rep_connect_wait, ep->rep_connected != 0);
-
-	/*
-	 * Check state. A non-peer reject indicates no listener
-	 * (ECONNREFUSED), which may be a transient state. All
-	 * others indicate a transport condition which has already
-	 * undergone a best-effort.
-	 */
-	if (ep->rep_connected == -ECONNREFUSED &&
-	    ++retry_count <= RDMA_CONNECT_RETRY_MAX) {
-		dprintk("RPC:       %s: non-peer_reject, retry\n", __func__);
-		goto retry;
-	}
 	if (ep->rep_connected <= 0) {
-		/* Sometimes, the only way to reliably connect to remote
-		 * CMs is to use same nonzero values for ORD and IRD. */
-		if (retry_count++ <= RDMA_CONNECT_RETRY_MAX + 1 &&
-		    (ep->rep_remote_cma.responder_resources == 0 ||
-		     ep->rep_remote_cma.initiator_depth !=
-				ep->rep_remote_cma.responder_resources)) {
-			if (ep->rep_remote_cma.responder_resources == 0)
-				ep->rep_remote_cma.responder_resources = 1;
-			ep->rep_remote_cma.initiator_depth =
-				ep->rep_remote_cma.responder_resources;
+		if (ep->rep_connected == -EAGAIN)
 			goto retry;
-		}
 		rc = ep->rep_connected;
-	} else {
-		struct rpcrdma_xprt *r_xprt;
-		unsigned int extras;
-
-		dprintk("RPC:       %s: connected\n", __func__);
-
-		r_xprt = container_of(ia, struct rpcrdma_xprt, rx_ia);
-		extras = r_xprt->rx_buf.rb_bc_srv_max_requests;
-
-		if (extras) {
-			rc = rpcrdma_ep_post_extra_recv(r_xprt, extras);
-			if (rc) {
-				pr_warn("%s: rpcrdma_ep_post_extra_recv: %i\n",
-					__func__, rc);
-				rc = 0;
-			}
-		}
+		goto out;
 	}
 
+	dprintk("RPC:       %s: connected\n", __func__);
+	extras = r_xprt->rx_buf.rb_bc_srv_max_requests;
+	if (extras)
+		rpcrdma_ep_post_extra_recv(r_xprt, extras);
+
 out:
 	if (rc)
 		ep->rep_connected = rc;

--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH v3 08/12] xprtrdma: Refactor management of mw_list field
       [not found] ` <20170208214854.7152.83331.stgit-FYjufvaPoItvLzlybtyyYzGyq/o6K9yX@public.gmane.org>
                     ` (6 preceding siblings ...)
  2017-02-08 22:00   ` [PATCH v3 07/12] xprtrdma: Handle stale connection rejection Chuck Lever
@ 2017-02-08 22:00   ` Chuck Lever
  2017-02-08 22:00   ` [PATCH v3 09/12] sunrpc: Allow xprt->ops->timer method to sleep Chuck Lever
                     ` (3 subsequent siblings)
  11 siblings, 0 replies; 21+ messages in thread
From: Chuck Lever @ 2017-02-08 22:00 UTC (permalink / raw)
  To: anna.schumaker-HgOvQuBEEgTQT0dZR+AlfA
  Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	linux-nfs-u79uwXL29TY76Z2rM5mHXA

Clean up some duplicate code.

Signed-off-by: Chuck Lever <chuck.lever-QHcLZuEGTsvQT0dZR+AlfA@public.gmane.org>
---
 net/sunrpc/xprtrdma/fmr_ops.c   |    5 +----
 net/sunrpc/xprtrdma/frwr_ops.c  |   11 ++++-------
 net/sunrpc/xprtrdma/rpc_rdma.c  |    6 +++---
 net/sunrpc/xprtrdma/verbs.c     |   15 +++++----------
 net/sunrpc/xprtrdma/xprt_rdma.h |   16 ++++++++++++++++
 5 files changed, 29 insertions(+), 24 deletions(-)

diff --git a/net/sunrpc/xprtrdma/fmr_ops.c b/net/sunrpc/xprtrdma/fmr_ops.c
index 1ebb09e..59e6402 100644
--- a/net/sunrpc/xprtrdma/fmr_ops.c
+++ b/net/sunrpc/xprtrdma/fmr_ops.c
@@ -310,10 +310,7 @@ enum {
 	struct rpcrdma_mw *mw;
 
 	while (!list_empty(&req->rl_registered)) {
-		mw = list_first_entry(&req->rl_registered,
-				      struct rpcrdma_mw, mw_list);
-		list_del_init(&mw->mw_list);
-
+		mw = rpcrdma_pop_mw(&req->rl_registered);
 		if (sync)
 			fmr_op_recover_mr(mw);
 		else
diff --git a/net/sunrpc/xprtrdma/frwr_ops.c b/net/sunrpc/xprtrdma/frwr_ops.c
index 47bed53..f81dd93 100644
--- a/net/sunrpc/xprtrdma/frwr_ops.c
+++ b/net/sunrpc/xprtrdma/frwr_ops.c
@@ -466,8 +466,8 @@
 	struct ib_send_wr *first, **prev, *last, *bad_wr;
 	struct rpcrdma_rep *rep = req->rl_reply;
 	struct rpcrdma_ia *ia = &r_xprt->rx_ia;
-	struct rpcrdma_mw *mw, *tmp;
 	struct rpcrdma_frmr *f;
+	struct rpcrdma_mw *mw;
 	int count, rc;
 
 	dprintk("RPC:       %s: req %p\n", __func__, req);
@@ -534,10 +534,10 @@
 	 * them to the free MW list.
 	 */
 unmap:
-	list_for_each_entry_safe(mw, tmp, &req->rl_registered, mw_list) {
+	while (!list_empty(&req->rl_registered)) {
+		mw = rpcrdma_pop_mw(&req->rl_registered);
 		dprintk("RPC:       %s: DMA unmapping frmr %p\n",
 			__func__, &mw->frmr);
-		list_del_init(&mw->mw_list);
 		ib_dma_unmap_sg(ia->ri_device,
 				mw->mw_sg, mw->mw_nents, mw->mw_dir);
 		rpcrdma_put_mw(r_xprt, mw);
@@ -571,10 +571,7 @@
 	struct rpcrdma_mw *mw;
 
 	while (!list_empty(&req->rl_registered)) {
-		mw = list_first_entry(&req->rl_registered,
-				      struct rpcrdma_mw, mw_list);
-		list_del_init(&mw->mw_list);
-
+		mw = rpcrdma_pop_mw(&req->rl_registered);
 		if (sync)
 			frwr_op_recover_mr(mw);
 		else
diff --git a/net/sunrpc/xprtrdma/rpc_rdma.c b/net/sunrpc/xprtrdma/rpc_rdma.c
index 72b3ca0..a044be2 100644
--- a/net/sunrpc/xprtrdma/rpc_rdma.c
+++ b/net/sunrpc/xprtrdma/rpc_rdma.c
@@ -322,7 +322,7 @@ static bool rpcrdma_results_inline(struct rpcrdma_xprt *r_xprt,
 						 false, &mw);
 		if (n < 0)
 			return ERR_PTR(n);
-		list_add(&mw->mw_list, &req->rl_registered);
+		rpcrdma_push_mw(mw, &req->rl_registered);
 
 		*iptr++ = xdr_one;	/* item present */
 
@@ -390,7 +390,7 @@ static bool rpcrdma_results_inline(struct rpcrdma_xprt *r_xprt,
 						 true, &mw);
 		if (n < 0)
 			return ERR_PTR(n);
-		list_add(&mw->mw_list, &req->rl_registered);
+		rpcrdma_push_mw(mw, &req->rl_registered);
 
 		iptr = xdr_encode_rdma_segment(iptr, mw);
 
@@ -455,7 +455,7 @@ static bool rpcrdma_results_inline(struct rpcrdma_xprt *r_xprt,
 						 true, &mw);
 		if (n < 0)
 			return ERR_PTR(n);
-		list_add(&mw->mw_list, &req->rl_registered);
+		rpcrdma_push_mw(mw, &req->rl_registered);
 
 		iptr = xdr_encode_rdma_segment(iptr, mw);
 
diff --git a/net/sunrpc/xprtrdma/verbs.c b/net/sunrpc/xprtrdma/verbs.c
index d1ee33f..81cd31a 100644
--- a/net/sunrpc/xprtrdma/verbs.c
+++ b/net/sunrpc/xprtrdma/verbs.c
@@ -776,9 +776,7 @@ static void rpcrdma_destroy_id(struct rdma_cm_id *id)
 
 	spin_lock(&buf->rb_recovery_lock);
 	while (!list_empty(&buf->rb_stale_mrs)) {
-		mw = list_first_entry(&buf->rb_stale_mrs,
-				      struct rpcrdma_mw, mw_list);
-		list_del_init(&mw->mw_list);
+		mw = rpcrdma_pop_mw(&buf->rb_stale_mrs);
 		spin_unlock(&buf->rb_recovery_lock);
 
 		dprintk("RPC:       %s: recovering MR %p\n", __func__, mw);
@@ -796,7 +794,7 @@ static void rpcrdma_destroy_id(struct rdma_cm_id *id)
 	struct rpcrdma_buffer *buf = &r_xprt->rx_buf;
 
 	spin_lock(&buf->rb_recovery_lock);
-	list_add(&mw->mw_list, &buf->rb_stale_mrs);
+	rpcrdma_push_mw(mw, &buf->rb_stale_mrs);
 	spin_unlock(&buf->rb_recovery_lock);
 
 	schedule_delayed_work(&buf->rb_recovery_worker, 0);
@@ -1072,11 +1070,8 @@ struct rpcrdma_mw *
 	struct rpcrdma_mw *mw = NULL;
 
 	spin_lock(&buf->rb_mwlock);
-	if (!list_empty(&buf->rb_mws)) {
-		mw = list_first_entry(&buf->rb_mws,
-				      struct rpcrdma_mw, mw_list);
-		list_del_init(&mw->mw_list);
-	}
+	if (!list_empty(&buf->rb_mws))
+		mw = rpcrdma_pop_mw(&buf->rb_mws);
 	spin_unlock(&buf->rb_mwlock);
 
 	if (!mw)
@@ -1099,7 +1094,7 @@ struct rpcrdma_mw *
 	struct rpcrdma_buffer *buf = &r_xprt->rx_buf;
 
 	spin_lock(&buf->rb_mwlock);
-	list_add_tail(&mw->mw_list, &buf->rb_mws);
+	rpcrdma_push_mw(mw, &buf->rb_mws);
 	spin_unlock(&buf->rb_mwlock);
 }
 
diff --git a/net/sunrpc/xprtrdma/xprt_rdma.h b/net/sunrpc/xprtrdma/xprt_rdma.h
index 852dd0a..171a351 100644
--- a/net/sunrpc/xprtrdma/xprt_rdma.h
+++ b/net/sunrpc/xprtrdma/xprt_rdma.h
@@ -354,6 +354,22 @@ struct rpcrdma_req {
 	return rqst->rq_xprtdata;
 }
 
+static inline void
+rpcrdma_push_mw(struct rpcrdma_mw *mw, struct list_head *list)
+{
+	list_add_tail(&mw->mw_list, list);
+}
+
+static inline struct rpcrdma_mw *
+rpcrdma_pop_mw(struct list_head *list)
+{
+	struct rpcrdma_mw *mw;
+
+	mw = list_first_entry(list, struct rpcrdma_mw, mw_list);
+	list_del(&mw->mw_list);
+	return mw;
+}
+
 /*
  * struct rpcrdma_buffer -- holds list/queue of pre-registered memory for
  * inline requests/replies, and client/server credits.

--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH v3 09/12] sunrpc: Allow xprt->ops->timer method to sleep
       [not found] ` <20170208214854.7152.83331.stgit-FYjufvaPoItvLzlybtyyYzGyq/o6K9yX@public.gmane.org>
                     ` (7 preceding siblings ...)
  2017-02-08 22:00   ` [PATCH v3 08/12] xprtrdma: Refactor management of mw_list field Chuck Lever
@ 2017-02-08 22:00   ` Chuck Lever
       [not found]     ` <20170208220051.7152.67740.stgit-FYjufvaPoItvLzlybtyyYzGyq/o6K9yX@public.gmane.org>
  2017-02-08 22:00   ` [PATCH v3 10/12] sunrpc: Enable calls to rpc_call_null_helper() from other modules Chuck Lever
                     ` (2 subsequent siblings)
  11 siblings, 1 reply; 21+ messages in thread
From: Chuck Lever @ 2017-02-08 22:00 UTC (permalink / raw)
  To: anna.schumaker-HgOvQuBEEgTQT0dZR+AlfA
  Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	linux-nfs-u79uwXL29TY76Z2rM5mHXA

The transport lock is needed to protect the xprt_adjust_cwnd() call
in xs_udp_timer, but it is not necessary for accessing the
rq_reply_bytes_recvd or tk_status fields. It is correct to sublimate
the lock into UDP's xs_udp_timer method, where it is required.

The ->timer method has to take the transport lock if needed, but it
can now sleep safely, or even call back into the RPC scheduler.

This is more a clean-up than a fix, but the "issue" was introduced
by my transport switch patches back in 2005.

Fixes: 46c0ee8bc4ad ("RPC: separate xprt_timer implementations")
Signed-off-by: Chuck Lever <chuck.lever-QHcLZuEGTsvQT0dZR+AlfA@public.gmane.org>
---
 net/sunrpc/xprt.c     |    2 --
 net/sunrpc/xprtsock.c |    2 ++
 2 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/net/sunrpc/xprt.c b/net/sunrpc/xprt.c
index 9a6be03..b530a28 100644
--- a/net/sunrpc/xprt.c
+++ b/net/sunrpc/xprt.c
@@ -897,13 +897,11 @@ static void xprt_timer(struct rpc_task *task)
 		return;
 	dprintk("RPC: %5u xprt_timer\n", task->tk_pid);
 
-	spin_lock_bh(&xprt->transport_lock);
 	if (!req->rq_reply_bytes_recvd) {
 		if (xprt->ops->timer)
 			xprt->ops->timer(xprt, task);
 	} else
 		task->tk_status = 0;
-	spin_unlock_bh(&xprt->transport_lock);
 }
 
 /**
diff --git a/net/sunrpc/xprtsock.c b/net/sunrpc/xprtsock.c
index af392d9..d9bb644 100644
--- a/net/sunrpc/xprtsock.c
+++ b/net/sunrpc/xprtsock.c
@@ -1734,7 +1734,9 @@ static void xs_udp_set_buffer_size(struct rpc_xprt *xprt, size_t sndsize, size_t
  */
 static void xs_udp_timer(struct rpc_xprt *xprt, struct rpc_task *task)
 {
+	spin_lock_bh(&xprt->transport_lock);
 	xprt_adjust_cwnd(xprt, task, -ETIMEDOUT);
+	spin_unlock_bh(&xprt->transport_lock);
 }
 
 static unsigned short xs_get_random_port(void)

--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply related	[flat|nested] 21+ messages in thread

[parent not found: <20170208220051.7152.67740.stgit-FYjufvaPoItvLzlybtyyYzGyq/o6K9yX@public.gmane.org>]

* Re: [PATCH v3 09/12] sunrpc: Allow xprt->ops->timer method to sleep
       [not found]     ` <20170208220051.7152.67740.stgit-FYjufvaPoItvLzlybtyyYzGyq/o6K9yX@public.gmane.org>
@ 2017-02-08 23:48       ` Trond Myklebust
  0 siblings, 0 replies; 21+ messages in thread
From: Trond Myklebust @ 2017-02-08 23:48 UTC (permalink / raw)
  To: anna.schumaker-HgOvQuBEEgTQT0dZR+AlfA@public.gmane.org,
	chuck.lever-QHcLZuEGTsvQT0dZR+AlfA@public.gmane.org
  Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
	linux-nfs-u79uwXL29TY76Z2rM5mHXA@public.gmane.org

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #1: Type: text/plain; charset=UTF-8, Size: 2335 bytes --]

On Wed, 2017-02-08 at 17:00 -0500, Chuck Lever wrote:
> The transport lock is needed to protect the xprt_adjust_cwnd() call
> in xs_udp_timer, but it is not necessary for accessing the
> rq_reply_bytes_recvd or tk_status fields. It is correct to sublimate
> the lock into UDP's xs_udp_timer method, where it is required.
> 
> The ->timer method has to take the transport lock if needed, but it
> can now sleep safely, or even call back into the RPC scheduler.
> 
> This is more a clean-up than a fix, but the "issue" was introduced
> by my transport switch patches back in 2005.
> 
> Fixes: 46c0ee8bc4ad ("RPC: separate xprt_timer implementations")
> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
> ---
> Â net/sunrpc/xprt.cÂ Â Â Â Â |Â Â Â Â 2 --
> Â net/sunrpc/xprtsock.c |Â Â Â Â 2 ++
> Â 2 files changed, 2 insertions(+), 2 deletions(-)
> 
> diff --git a/net/sunrpc/xprt.c b/net/sunrpc/xprt.c
> index 9a6be03..b530a28 100644
> --- a/net/sunrpc/xprt.c
> +++ b/net/sunrpc/xprt.c
> @@ -897,13 +897,11 @@ static void xprt_timer(struct rpc_task *task)
> Â 		return;
> Â 	dprintk("RPC: %5u xprt_timer\n", task->tk_pid);
> Â 
> -	spin_lock_bh(&xprt->transport_lock);
> Â 	if (!req->rq_reply_bytes_recvd) {
> Â 		if (xprt->ops->timer)
> Â 			xprt->ops->timer(xprt, task);
> Â 	} else
> Â 		task->tk_status = 0;
> -	spin_unlock_bh(&xprt->transport_lock);
> Â }
> Â 
> Â /**
> diff --git a/net/sunrpc/xprtsock.c b/net/sunrpc/xprtsock.c
> index af392d9..d9bb644 100644
> --- a/net/sunrpc/xprtsock.c
> +++ b/net/sunrpc/xprtsock.c
> @@ -1734,7 +1734,9 @@ static void xs_udp_set_buffer_size(struct
> rpc_xprt *xprt, size_t sndsize, size_t
> Â  */
> Â static void xs_udp_timer(struct rpc_xprt *xprt, struct rpc_task
> *task)
> Â {
> +	spin_lock_bh(&xprt->transport_lock);
> Â 	xprt_adjust_cwnd(xprt, task, -ETIMEDOUT);
> +	spin_unlock_bh(&xprt->transport_lock);
> Â }
> Â 
> Â static unsigned short xs_get_random_port(void)
> 

Thanks! Good cleanup...

Trond

-- 



	
	


Trond Myklebust
Principal System Architect
4300 El Camino Real | Suite 100
Los Altos, CAÂ Â 94022
W: 650-422-3800
C: 801-921-4583Â 
www.primarydata.com




N‹§²æìr¸›yúèšØb²X¬¶Ç§vØ^–)Þº{.nÇ+‰·¥Š{±ÙšŠ{ayº\x1dÊ‡Ú™ë,j\a¢f£¢·hš‹»öì\x17/oSc¾™Ú³9˜uÀ¦æå‰È&jw¨®\x03(éšŽŠÝ¢j"ú\x1a¶^[m§ÿïêäz¹Þ–Šàþf£¢·hšˆ§~ˆmš

^ permalink raw reply	[flat|nested] 21+ messages in thread

* [PATCH v3 10/12] sunrpc: Enable calls to rpc_call_null_helper() from other modules
       [not found] ` <20170208214854.7152.83331.stgit-FYjufvaPoItvLzlybtyyYzGyq/o6K9yX@public.gmane.org>
                     ` (8 preceding siblings ...)
  2017-02-08 22:00   ` [PATCH v3 09/12] sunrpc: Allow xprt->ops->timer method to sleep Chuck Lever
@ 2017-02-08 22:00   ` Chuck Lever
  2017-02-08 22:01   ` [PATCH v3 11/12] xprtrdma: Detect unreachable NFS/RDMA servers more reliably Chuck Lever
  2017-02-08 22:01   ` [PATCH v3 12/12] sunrpc: Allow keepalive ping on a credit-full transport Chuck Lever
  11 siblings, 0 replies; 21+ messages in thread
From: Chuck Lever @ 2017-02-08 22:00 UTC (permalink / raw)
  To: anna.schumaker-HgOvQuBEEgTQT0dZR+AlfA
  Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	linux-nfs-u79uwXL29TY76Z2rM5mHXA

To help detect unreachable servers, I'd like to emit an RPC ping
from rpcrdma.ko.

authnull_ops is not visible outside the sunrpc.ko module, so fold
the common case into rpc_call_null_helper, and export it, so that
it can be invoked from other kernel modules.

Signed-off-by: Chuck Lever <chuck.lever-QHcLZuEGTsvQT0dZR+AlfA@public.gmane.org>
---
 fs/nfs/nfs4proc.c            |    3 +--
 fs/nfsd/nfs4callback.c       |    2 +-
 include/linux/sunrpc/clnt.h  |    5 +++++
 include/linux/sunrpc/sched.h |    2 ++
 net/sunrpc/clnt.c            |   28 ++++++++++++++--------------
 5 files changed, 23 insertions(+), 17 deletions(-)

diff --git a/fs/nfs/nfs4proc.c b/fs/nfs/nfs4proc.c
index 0a0eaec..0091e5a 100644
--- a/fs/nfs/nfs4proc.c
+++ b/fs/nfs/nfs4proc.c
@@ -7640,8 +7640,7 @@ static int _nfs4_proc_exchange_id(struct nfs_client *clp, struct rpc_cred *cred,
 	if (xprt) {
 		calldata->xprt = xprt;
 		task_setup_data.rpc_xprt = xprt;
-		task_setup_data.flags =
-				RPC_TASK_SOFT|RPC_TASK_SOFTCONN|RPC_TASK_ASYNC;
+		task_setup_data.flags = RPC_TASK_SOFTPING | RPC_TASK_ASYNC;
 		calldata->args.verifier = &clp->cl_confirm;
 	} else {
 		calldata->args.verifier = &verifier;
diff --git a/fs/nfsd/nfs4callback.c b/fs/nfsd/nfs4callback.c
index eb78109..e1e2224 100644
--- a/fs/nfsd/nfs4callback.c
+++ b/fs/nfsd/nfs4callback.c
@@ -1182,7 +1182,7 @@ static void nfsd4_process_cb_update(struct nfsd4_callback *cb)
 	}
 
 	cb->cb_msg.rpc_cred = clp->cl_cb_cred;
-	rpc_call_async(clnt, &cb->cb_msg, RPC_TASK_SOFT | RPC_TASK_SOFTCONN,
+	rpc_call_async(clnt, &cb->cb_msg, RPC_TASK_SOFTPING,
 			cb->cb_ops ? &nfsd4_cb_ops : &nfsd4_cb_probe_ops, cb);
 }
 
diff --git a/include/linux/sunrpc/clnt.h b/include/linux/sunrpc/clnt.h
index 333ad11..f576127 100644
--- a/include/linux/sunrpc/clnt.h
+++ b/include/linux/sunrpc/clnt.h
@@ -173,6 +173,11 @@ int		rpc_call_async(struct rpc_clnt *clnt,
 			       void *calldata);
 int		rpc_call_sync(struct rpc_clnt *clnt,
 			      const struct rpc_message *msg, int flags);
+struct rpc_task *rpc_call_null_helper(struct rpc_clnt *clnt,
+				      struct rpc_xprt *xprt,
+				      struct rpc_cred *cred, int flags,
+				      const struct rpc_call_ops *ops,
+				      void *data);
 struct rpc_task *rpc_call_null(struct rpc_clnt *clnt, struct rpc_cred *cred,
 			       int flags);
 int		rpc_restart_call_prepare(struct rpc_task *);
diff --git a/include/linux/sunrpc/sched.h b/include/linux/sunrpc/sched.h
index 7ba040c..13822e6 100644
--- a/include/linux/sunrpc/sched.h
+++ b/include/linux/sunrpc/sched.h
@@ -128,6 +128,8 @@ struct rpc_task_setup {
 #define RPC_TASK_NOCONNECT	0x2000		/* return ENOTCONN if not connected */
 #define RPC_TASK_NO_RETRANS_TIMEOUT	0x4000		/* wait forever for a reply */
 
+#define RPC_TASK_SOFTPING	(RPC_TASK_SOFT | RPC_TASK_SOFTCONN)
+
 #define RPC_IS_ASYNC(t)		((t)->tk_flags & RPC_TASK_ASYNC)
 #define RPC_IS_SWAPPER(t)	((t)->tk_flags & RPC_TASK_SWAPPER)
 #define RPC_DO_ROOTOVERRIDE(t)	((t)->tk_flags & RPC_TASK_ROOTCREDS)
diff --git a/net/sunrpc/clnt.c b/net/sunrpc/clnt.c
index 1dc9f3b..642c93d 100644
--- a/net/sunrpc/clnt.c
+++ b/net/sunrpc/clnt.c
@@ -2520,12 +2520,11 @@ static int rpc_ping(struct rpc_clnt *clnt)
 	};
 	int err;
 	msg.rpc_cred = authnull_ops.lookup_cred(NULL, NULL, 0);
-	err = rpc_call_sync(clnt, &msg, RPC_TASK_SOFT | RPC_TASK_SOFTCONN);
+	err = rpc_call_sync(clnt, &msg, RPC_TASK_SOFTPING);
 	put_rpccred(msg.rpc_cred);
 	return err;
 }
 
-static
 struct rpc_task *rpc_call_null_helper(struct rpc_clnt *clnt,
 		struct rpc_xprt *xprt, struct rpc_cred *cred, int flags,
 		const struct rpc_call_ops *ops, void *data)
@@ -2542,9 +2541,17 @@ struct rpc_task *rpc_call_null_helper(struct rpc_clnt *clnt,
 		.callback_data = data,
 		.flags = flags,
 	};
+	struct rpc_task *task;
 
-	return rpc_run_task(&task_setup_data);
+	if (!cred)
+		msg.rpc_cred = authnull_ops.lookup_cred(NULL, NULL, 0);
+	task = rpc_run_task(&task_setup_data);
+	if (!cred)
+		put_rpccred(msg.rpc_cred);
+
+	return task;
 }
+EXPORT_SYMBOL_GPL(rpc_call_null_helper);
 
 struct rpc_task *rpc_call_null(struct rpc_clnt *clnt, struct rpc_cred *cred, int flags)
 {
@@ -2591,7 +2598,6 @@ int rpc_clnt_test_and_add_xprt(struct rpc_clnt *clnt,
 		void *dummy)
 {
 	struct rpc_cb_add_xprt_calldata *data;
-	struct rpc_cred *cred;
 	struct rpc_task *task;
 
 	data = kmalloc(sizeof(*data), GFP_NOFS);
@@ -2600,11 +2606,9 @@ int rpc_clnt_test_and_add_xprt(struct rpc_clnt *clnt,
 	data->xps = xprt_switch_get(xps);
 	data->xprt = xprt_get(xprt);
 
-	cred = authnull_ops.lookup_cred(NULL, NULL, 0);
-	task = rpc_call_null_helper(clnt, xprt, cred,
-			RPC_TASK_SOFT|RPC_TASK_SOFTCONN|RPC_TASK_ASYNC,
-			&rpc_cb_add_xprt_call_ops, data);
-	put_rpccred(cred);
+	task = rpc_call_null_helper(clnt, xprt, NULL,
+				    RPC_TASK_SOFTPING | RPC_TASK_ASYNC,
+				    &rpc_cb_add_xprt_call_ops, data);
 	if (IS_ERR(task))
 		return PTR_ERR(task);
 	rpc_put_task(task);
@@ -2635,7 +2639,6 @@ int rpc_clnt_setup_test_and_add_xprt(struct rpc_clnt *clnt,
 				     struct rpc_xprt *xprt,
 				     void *data)
 {
-	struct rpc_cred *cred;
 	struct rpc_task *task;
 	struct rpc_add_xprt_test *xtest = (struct rpc_add_xprt_test *)data;
 	int status = -EADDRINUSE;
@@ -2647,11 +2650,8 @@ int rpc_clnt_setup_test_and_add_xprt(struct rpc_clnt *clnt,
 		goto out_err;
 
 	/* Test the connection */
-	cred = authnull_ops.lookup_cred(NULL, NULL, 0);
-	task = rpc_call_null_helper(clnt, xprt, cred,
-				    RPC_TASK_SOFT | RPC_TASK_SOFTCONN,
+	task = rpc_call_null_helper(clnt, xprt, NULL, RPC_TASK_SOFTPING,
 				    NULL, NULL);
-	put_rpccred(cred);
 	if (IS_ERR(task)) {
 		status = PTR_ERR(task);
 		goto out_err;

--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH v3 11/12] xprtrdma: Detect unreachable NFS/RDMA servers more reliably
       [not found] ` <20170208214854.7152.83331.stgit-FYjufvaPoItvLzlybtyyYzGyq/o6K9yX@public.gmane.org>
                     ` (9 preceding siblings ...)
  2017-02-08 22:00   ` [PATCH v3 10/12] sunrpc: Enable calls to rpc_call_null_helper() from other modules Chuck Lever
@ 2017-02-08 22:01   ` Chuck Lever
  2017-02-08 22:01   ` [PATCH v3 12/12] sunrpc: Allow keepalive ping on a credit-full transport Chuck Lever
  11 siblings, 0 replies; 21+ messages in thread
From: Chuck Lever @ 2017-02-08 22:01 UTC (permalink / raw)
  To: anna.schumaker-HgOvQuBEEgTQT0dZR+AlfA
  Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	linux-nfs-u79uwXL29TY76Z2rM5mHXA

Current NFS clients rely on connection loss to determine when to
retransmit. In particular, for protocols like NFSv4, clients no
longer rely on RPC timeouts to drive retransmission: NFSv4 servers
are required to terminate a connection when they need a client to
retransmit pending RPCs.

When a server is no longer reachable, either because it has crashed
or because the network path has broken, the server cannot actively
terminate a connection. Thus NFS clients depend on transport-level
keepalive to determine when a connection must be replaced and
pending RPCs retransmitted.

However, RDMA RC connections do not have a native keepalive
mechanism. If an NFS/RDMA server crashes after a client has sent
RPCs successfully (an RC ACK has been received for all OTW RDMA
requests), there is no way for the client to know the connection is
moribund.

In addition, new RDMA requests are subject to the RPC-over-RDMA
credit limit. If the client has consumed all granted credits with
NFS traffic, it is not allowed to send another RDMA request until
the server replies. Thus it has no way to send a true keepalive when
the workload has already consumed all credits with pending RPCs.

To address this, emit an RPC NULL ping when an RPC retransmit
timeout occurs.

The purpose of this ping is to drive traffic on the connection to
force the transport layer to disconnect it if it is no longer
viable. Some RDMA operations are fully offloaded to the HCA, and can
be successful even if the server O/S has crashed. Thus an operation
that requires that the server is responsive is used for the ping.

Signed-off-by: Chuck Lever <chuck.lever-QHcLZuEGTsvQT0dZR+AlfA@public.gmane.org>
---
 net/sunrpc/xprtrdma/transport.c |   69 +++++++++++++++++++++++++++++++++++++++
 net/sunrpc/xprtrdma/xprt_rdma.h |    7 ++++
 2 files changed, 76 insertions(+)

diff --git a/net/sunrpc/xprtrdma/transport.c b/net/sunrpc/xprtrdma/transport.c
index c717f54..3a5a805 100644
--- a/net/sunrpc/xprtrdma/transport.c
+++ b/net/sunrpc/xprtrdma/transport.c
@@ -484,6 +484,74 @@
 	dprintk("RPC:       %s: %u\n", __func__, port);
 }
 
+static void rpcrdma_keepalive_done(struct rpc_task *task, void *calldata)
+{
+	struct rpc_xprt *xprt = (struct rpc_xprt *)calldata;
+	struct rpcrdma_xprt *r_xprt = rpcx_to_rdmax(xprt);
+
+	if (task->tk_status) {
+		struct sockaddr *sap =
+			(struct sockaddr *)&r_xprt->rx_ep.rep_remote_addr;
+
+		pr_err("rpcrdma: keepalive to %pIS:%u failed (%d)\n",
+		       sap, rpc_get_port(sap), task->tk_status);
+		xprt_disconnect_done(xprt);
+	}
+	clear_bit(RPCRDMA_IA_RSVD_CREDIT, &r_xprt->rx_ia.ri_flags);
+}
+
+static void rpcrdma_keepalive_release(void *calldata)
+{
+	struct rpc_xprt *xprt = (struct rpc_xprt *)calldata;
+
+	xprt_put(xprt);
+}
+
+static const struct rpc_call_ops rpcrdma_keepalive_call_ops = {
+	.rpc_call_done	= rpcrdma_keepalive_done,
+	.rpc_release	= rpcrdma_keepalive_release,
+};
+
+/**
+ * xprt_rdma_timer - invoked when an RPC times out
+ * @xprt: controlling RPC transport
+ * @task: RPC task that timed out
+ *
+ * Some RDMA transports do not have any form of connection
+ * keepalive. In some circumstances, unviable connections
+ * can continue to live for a long time.
+ *
+ * Send a NULL RPC to see if the server still responds. On
+ * a moribund connection, this should trigger either an RPC
+ * or transport layer timeout and kill the connection.
+ */
+static void
+xprt_rdma_timer(struct rpc_xprt *xprt, struct rpc_task *task)
+{
+	struct rpcrdma_xprt *r_xprt =
+		container_of(xprt, struct rpcrdma_xprt, rx_xprt);
+#if IS_ENABLED(CONFIG_SUNRPC_DEBUG)
+	struct rpcrdma_ep *ep = &r_xprt->rx_ep;
+	struct sockaddr *sap = (struct sockaddr *)&ep->rep_remote_addr;
+#endif
+	struct rpc_task *null_task;
+	void *data;
+
+	/* Ensure only one is sent at a time */
+	if (!test_and_set_bit(RPCRDMA_IA_RSVD_CREDIT, &r_xprt->rx_ia.ri_flags))
+		return;
+
+	dprintk("RPC:       %s: sending keepalive ping to %pIS:%u\n",
+		__func__, sap, rpc_get_port(sap));
+
+	data = xprt_get(xprt);
+	null_task = rpc_call_null_helper(task->tk_client, xprt, NULL,
+					 RPC_TASK_SOFTPING | RPC_TASK_ASYNC,
+					 &rpcrdma_keepalive_call_ops, data);
+	if (!IS_ERR(null_task))
+		rpc_put_task(null_task);
+}
+
 static void
 xprt_rdma_connect(struct rpc_xprt *xprt, struct rpc_task *task)
 {
@@ -776,6 +844,7 @@ void xprt_rdma_print_stats(struct rpc_xprt *xprt, struct seq_file *seq)
 	.alloc_slot		= xprt_alloc_slot,
 	.release_request	= xprt_release_rqst_cong,       /* ditto */
 	.set_retrans_timeout	= xprt_set_retrans_timeout_def, /* ditto */
+	.timer			= xprt_rdma_timer,
 	.rpcbind		= rpcb_getport_async,	/* sunrpc/rpcb_clnt.c */
 	.set_port		= xprt_rdma_set_port,
 	.connect		= xprt_rdma_connect,
diff --git a/net/sunrpc/xprtrdma/xprt_rdma.h b/net/sunrpc/xprtrdma/xprt_rdma.h
index 171a351..dd1340f 100644
--- a/net/sunrpc/xprtrdma/xprt_rdma.h
+++ b/net/sunrpc/xprtrdma/xprt_rdma.h
@@ -78,10 +78,17 @@ struct rpcrdma_ia {
 	bool			ri_reminv_expected;
 	bool			ri_implicit_roundup;
 	enum ib_mr_type		ri_mrtype;
+	unsigned long		ri_flags;
 	struct ib_qp_attr	ri_qp_attr;
 	struct ib_qp_init_attr	ri_qp_init_attr;
 };
 
+/* ri_flags bits
+ */
+enum {
+	RPCRDMA_IA_RSVD_CREDIT	= 0,
+};
+
 /*
  * RDMA Endpoint -- one per transport instance
  */

--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH v3 12/12] sunrpc: Allow keepalive ping on a credit-full transport
       [not found] ` <20170208214854.7152.83331.stgit-FYjufvaPoItvLzlybtyyYzGyq/o6K9yX@public.gmane.org>
                     ` (10 preceding siblings ...)
  2017-02-08 22:01   ` [PATCH v3 11/12] xprtrdma: Detect unreachable NFS/RDMA servers more reliably Chuck Lever
@ 2017-02-08 22:01   ` Chuck Lever
       [not found]     ` <20170208220116.7152.87626.stgit-FYjufvaPoItvLzlybtyyYzGyq/o6K9yX@public.gmane.org>
  11 siblings, 1 reply; 21+ messages in thread
From: Chuck Lever @ 2017-02-08 22:01 UTC (permalink / raw)
  To: anna.schumaker-HgOvQuBEEgTQT0dZR+AlfA
  Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	linux-nfs-u79uwXL29TY76Z2rM5mHXA

Allow RPC-over-RDMA to send NULL pings even when the transport has
hit its credit limit. One RPC-over-RDMA credit is reserved for
operations like keepalive.

For transports that convey NFSv4, it seems like lease renewal would
also be a candidate for using a priority transport slot. I'd like to
see a mechanism better than RPCRDMA_PRIORITY that can ensure only
one priority operation is in use at a time.

Signed-off-by: Chuck Lever <chuck.lever-QHcLZuEGTsvQT0dZR+AlfA@public.gmane.org>
---
 include/linux/sunrpc/sched.h    |    2 ++
 net/sunrpc/xprt.c               |    4 ++++
 net/sunrpc/xprtrdma/transport.c |    3 ++-
 net/sunrpc/xprtrdma/verbs.c     |   13 ++++++++-----
 4 files changed, 16 insertions(+), 6 deletions(-)

diff --git a/include/linux/sunrpc/sched.h b/include/linux/sunrpc/sched.h
index 13822e6..fcea158 100644
--- a/include/linux/sunrpc/sched.h
+++ b/include/linux/sunrpc/sched.h
@@ -127,6 +127,7 @@ struct rpc_task_setup {
 #define RPC_TASK_TIMEOUT	0x1000		/* fail with ETIMEDOUT on timeout */
 #define RPC_TASK_NOCONNECT	0x2000		/* return ENOTCONN if not connected */
 #define RPC_TASK_NO_RETRANS_TIMEOUT	0x4000		/* wait forever for a reply */
+#define RPC_TASK_NO_CONG	0x8000		/* skip congestion control */
 
 #define RPC_TASK_SOFTPING	(RPC_TASK_SOFT | RPC_TASK_SOFTCONN)
 
@@ -137,6 +138,7 @@ struct rpc_task_setup {
 #define RPC_IS_SOFT(t)		((t)->tk_flags & (RPC_TASK_SOFT|RPC_TASK_TIMEOUT))
 #define RPC_IS_SOFTCONN(t)	((t)->tk_flags & RPC_TASK_SOFTCONN)
 #define RPC_WAS_SENT(t)		((t)->tk_flags & RPC_TASK_SENT)
+#define RPC_SKIP_CONG(t)	((t)->tk_flags & RPC_TASK_NO_CONG)
 
 #define RPC_TASK_RUNNING	0
 #define RPC_TASK_QUEUED		1
diff --git a/net/sunrpc/xprt.c b/net/sunrpc/xprt.c
index b530a28..a477ee6 100644
--- a/net/sunrpc/xprt.c
+++ b/net/sunrpc/xprt.c
@@ -392,6 +392,10 @@ static inline void xprt_release_write(struct rpc_xprt *xprt, struct rpc_task *ta
 {
 	struct rpc_rqst *req = task->tk_rqstp;
 
+	if (RPC_SKIP_CONG(task)) {
+		req->rq_cong = 0;
+		return 1;
+	}
 	if (req->rq_cong)
 		return 1;
 	dprintk("RPC: %5u xprt_cwnd_limited cong = %lu cwnd = %lu\n",
diff --git a/net/sunrpc/xprtrdma/transport.c b/net/sunrpc/xprtrdma/transport.c
index 3a5a805..073fecd 100644
--- a/net/sunrpc/xprtrdma/transport.c
+++ b/net/sunrpc/xprtrdma/transport.c
@@ -546,7 +546,8 @@ static void rpcrdma_keepalive_release(void *calldata)
 
 	data = xprt_get(xprt);
 	null_task = rpc_call_null_helper(task->tk_client, xprt, NULL,
-					 RPC_TASK_SOFTPING | RPC_TASK_ASYNC,
+					 RPC_TASK_SOFTPING | RPC_TASK_ASYNC |
+					 RPC_TASK_NO_CONG,
 					 &rpcrdma_keepalive_call_ops, data);
 	if (!IS_ERR(null_task))
 		rpc_put_task(null_task);
diff --git a/net/sunrpc/xprtrdma/verbs.c b/net/sunrpc/xprtrdma/verbs.c
index 81cd31a..d9b5fa7 100644
--- a/net/sunrpc/xprtrdma/verbs.c
+++ b/net/sunrpc/xprtrdma/verbs.c
@@ -136,19 +136,20 @@
 static void
 rpcrdma_update_granted_credits(struct rpcrdma_rep *rep)
 {
-	struct rpcrdma_msg *rmsgp = rdmab_to_msg(rep->rr_rdmabuf);
 	struct rpcrdma_buffer *buffer = &rep->rr_rxprt->rx_buf;
+	__be32 *p = rep->rr_rdmabuf->rg_base;
 	u32 credits;
 
 	if (rep->rr_len < RPCRDMA_HDRLEN_ERR)
 		return;
 
-	credits = be32_to_cpu(rmsgp->rm_credit);
+	credits = be32_to_cpup(p + 2);
+	if (credits > buffer->rb_max_requests)
+		credits = buffer->rb_max_requests;
+	/* Reserve one credit for keepalive ping */
+	credits--;
 	if (credits == 0)
 		credits = 1;	/* don't deadlock */
-	else if (credits > buffer->rb_max_requests)
-		credits = buffer->rb_max_requests;
-
 	atomic_set(&buffer->rb_credits, credits);
 }
 
@@ -915,6 +916,8 @@ struct rpcrdma_rep *
 	struct rpcrdma_buffer *buf = &r_xprt->rx_buf;
 	int i, rc;
 
+	if (r_xprt->rx_data.max_requests < 2)
+		return -EINVAL;
 	buf->rb_max_requests = r_xprt->rx_data.max_requests;
 	buf->rb_bc_srv_max_requests = 0;
 	atomic_set(&buf->rb_credits, 1);

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply related	[flat|nested] 21+ messages in thread

[parent not found: <20170208220116.7152.87626.stgit-FYjufvaPoItvLzlybtyyYzGyq/o6K9yX@public.gmane.org>]

* Re: [PATCH v3 12/12] sunrpc: Allow keepalive ping on a credit-full transport
       [not found]     ` <20170208220116.7152.87626.stgit-FYjufvaPoItvLzlybtyyYzGyq/o6K9yX@public.gmane.org>
@ 2017-02-09  0:05       ` Trond Myklebust
       [not found]         ` <1486598713.11028.3.camel-7I+n7zu2hftEKMMhf/gKZA@public.gmane.org>
  0 siblings, 1 reply; 21+ messages in thread
From: Trond Myklebust @ 2017-02-09  0:05 UTC (permalink / raw)
  To: anna.schumaker-HgOvQuBEEgTQT0dZR+AlfA@public.gmane.org,
	chuck.lever-QHcLZuEGTsvQT0dZR+AlfA@public.gmane.org
  Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
	linux-nfs-u79uwXL29TY76Z2rM5mHXA@public.gmane.org

On Wed, 2017-02-08 at 17:01 -0500, Chuck Lever wrote:
> Allow RPC-over-RDMA to send NULL pings even when the transport has
> hit its credit limit. One RPC-over-RDMA credit is reserved for
> operations like keepalive.
> 
> For transports that convey NFSv4, it seems like lease renewal would
> also be a candidate for using a priority transport slot. I'd like to
> see a mechanism better than RPCRDMA_PRIORITY that can ensure only
> one priority operation is in use at a time.
> 
> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
> ---
>  include/linux/sunrpc/sched.h    |    2 ++
>  net/sunrpc/xprt.c               |    4 ++++
>  net/sunrpc/xprtrdma/transport.c |    3 ++-
>  net/sunrpc/xprtrdma/verbs.c     |   13 ++++++++-----
>  4 files changed, 16 insertions(+), 6 deletions(-)
> 
> diff --git a/include/linux/sunrpc/sched.h
> b/include/linux/sunrpc/sched.h
> index 13822e6..fcea158 100644
> --- a/include/linux/sunrpc/sched.h
> +++ b/include/linux/sunrpc/sched.h
> @@ -127,6 +127,7 @@ struct rpc_task_setup {
>  #define RPC_TASK_TIMEOUT	0x1000		/* fail with
> ETIMEDOUT on timeout */
>  #define RPC_TASK_NOCONNECT	0x2000		/* return
> ENOTCONN if not connected */
>  #define RPC_TASK_NO_RETRANS_TIMEOUT	0x4000		/*
> wait forever for a reply */
> +#define RPC_TASK_NO_CONG	0x8000		/* skip
> congestion control */
>  
>  #define RPC_TASK_SOFTPING	(RPC_TASK_SOFT | RPC_TASK_SOFTCONN)
>  
> @@ -137,6 +138,7 @@ struct rpc_task_setup {
>  #define RPC_IS_SOFT(t)		((t)->tk_flags &
> (RPC_TASK_SOFT|RPC_TASK_TIMEOUT))
>  #define RPC_IS_SOFTCONN(t)	((t)->tk_flags &
> RPC_TASK_SOFTCONN)
>  #define RPC_WAS_SENT(t)		((t)->tk_flags &
> RPC_TASK_SENT)
> +#define RPC_SKIP_CONG(t)	((t)->tk_flags & RPC_TASK_NO_CONG)
>  
>  #define RPC_TASK_RUNNING	0
>  #define RPC_TASK_QUEUED		1
> diff --git a/net/sunrpc/xprt.c b/net/sunrpc/xprt.c
> index b530a28..a477ee6 100644
> --- a/net/sunrpc/xprt.c
> +++ b/net/sunrpc/xprt.c
> @@ -392,6 +392,10 @@ static inline void xprt_release_write(struct
> rpc_xprt *xprt, struct rpc_task *ta
>  {
>  	struct rpc_rqst *req = task->tk_rqstp;
>  
> +	if (RPC_SKIP_CONG(task)) {
> +		req->rq_cong = 0;
> +		return 1;
> +	}

Why not just have the RDMA layer call xprt_reserve_xprt() (and
xprt_release_xprt()) if this flag is set? It seems to me that you will
need some kind of extra congestion control in the RDMA layer anyway
since you only have one reserved credit for these privileged tasks (or
did I miss where that is being gated?).

>  	if (req->rq_cong)
>  		return 1;
>  	dprintk("RPC: %5u xprt_cwnd_limited cong = %lu cwnd =
> %lu\n",
> diff --git a/net/sunrpc/xprtrdma/transport.c
> b/net/sunrpc/xprtrdma/transport.c
> index 3a5a805..073fecd 100644
> --- a/net/sunrpc/xprtrdma/transport.c
> +++ b/net/sunrpc/xprtrdma/transport.c
> @@ -546,7 +546,8 @@ static void rpcrdma_keepalive_release(void
> *calldata)
>  
>  	data = xprt_get(xprt);
>  	null_task = rpc_call_null_helper(task->tk_client, xprt,
> NULL,
> -					 RPC_TASK_SOFTPING |
> RPC_TASK_ASYNC,
> +					 RPC_TASK_SOFTPING |
> RPC_TASK_ASYNC |
> +					 RPC_TASK_NO_CONG,
>  					 &rpcrdma_keepalive_call_ops
> , data);
>  	if (!IS_ERR(null_task))
>  		rpc_put_task(null_task);
> diff --git a/net/sunrpc/xprtrdma/verbs.c
> b/net/sunrpc/xprtrdma/verbs.c
> index 81cd31a..d9b5fa7 100644
> --- a/net/sunrpc/xprtrdma/verbs.c
> +++ b/net/sunrpc/xprtrdma/verbs.c
> @@ -136,19 +136,20 @@
>  static void
>  rpcrdma_update_granted_credits(struct rpcrdma_rep *rep)
>  {
> -	struct rpcrdma_msg *rmsgp = rdmab_to_msg(rep->rr_rdmabuf);
>  	struct rpcrdma_buffer *buffer = &rep->rr_rxprt->rx_buf;
> +	__be32 *p = rep->rr_rdmabuf->rg_base;
>  	u32 credits;
>  
>  	if (rep->rr_len < RPCRDMA_HDRLEN_ERR)
>  		return;
>  
> -	credits = be32_to_cpu(rmsgp->rm_credit);
> +	credits = be32_to_cpup(p + 2);
> +	if (credits > buffer->rb_max_requests)
> +		credits = buffer->rb_max_requests;
> +	/* Reserve one credit for keepalive ping */
> +	credits--;
>  	if (credits == 0)
>  		credits = 1;	/* don't deadlock */
> -	else if (credits > buffer->rb_max_requests)
> -		credits = buffer->rb_max_requests;
> -
>  	atomic_set(&buffer->rb_credits, credits);
>  }
>  
> @@ -915,6 +916,8 @@ struct rpcrdma_rep *
>  	struct rpcrdma_buffer *buf = &r_xprt->rx_buf;
>  	int i, rc;
>  
> +	if (r_xprt->rx_data.max_requests < 2)
> +		return -EINVAL;
>  	buf->rb_max_requests = r_xprt->rx_data.max_requests;
>  	buf->rb_bc_srv_max_requests = 0;
>  	atomic_set(&buf->rb_credits, 1);
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-nfs"
> in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
-- 


	
	


Trond Myklebust
Principal System Architect
4300 El Camino Real | Suite 100
Los Altos, CA  94022
W: 650-422-3800
C: 801-921-4583 
www.primarydata.com




^ permalink raw reply	[flat|nested] 21+ messages in thread

[parent not found: <1486598713.11028.3.camel-7I+n7zu2hftEKMMhf/gKZA@public.gmane.org>]

* Re: [PATCH v3 12/12] sunrpc: Allow keepalive ping on a credit-full transport
       [not found]         ` <1486598713.11028.3.camel-7I+n7zu2hftEKMMhf/gKZA@public.gmane.org>
@ 2017-02-09  0:19           ` Chuck Lever
       [not found]             ` <9D6B8B44-9C23-427C-9E06-7C92302EB04D-QHcLZuEGTsvQT0dZR+AlfA@public.gmane.org>
  0 siblings, 1 reply; 21+ messages in thread
From: Chuck Lever @ 2017-02-09  0:19 UTC (permalink / raw)
  To: Trond Myklebust
  Cc: Anna Schumaker,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
	Linux NFS Mailing List


> On Feb 8, 2017, at 7:05 PM, Trond Myklebust <trondmy-7I+n7zu2hftEKMMhf/gKZA@public.gmane.org> wrote:
> 
> On Wed, 2017-02-08 at 17:01 -0500, Chuck Lever wrote:
>> Allow RPC-over-RDMA to send NULL pings even when the transport has
>> hit its credit limit. One RPC-over-RDMA credit is reserved for
>> operations like keepalive.
>> 
>> For transports that convey NFSv4, it seems like lease renewal would
>> also be a candidate for using a priority transport slot. I'd like to
>> see a mechanism better than RPCRDMA_PRIORITY that can ensure only
>> one priority operation is in use at a time.
>> 
>> Signed-off-by: Chuck Lever <chuck.lever-QHcLZuEGTsvQT0dZR+AlfA@public.gmane.org>
>> ---
>>  include/linux/sunrpc/sched.h    |    2 ++
>>  net/sunrpc/xprt.c               |    4 ++++
>>  net/sunrpc/xprtrdma/transport.c |    3 ++-
>>  net/sunrpc/xprtrdma/verbs.c     |   13 ++++++++-----
>>  4 files changed, 16 insertions(+), 6 deletions(-)
>> 
>> diff --git a/include/linux/sunrpc/sched.h
>> b/include/linux/sunrpc/sched.h
>> index 13822e6..fcea158 100644
>> --- a/include/linux/sunrpc/sched.h
>> +++ b/include/linux/sunrpc/sched.h
>> @@ -127,6 +127,7 @@ struct rpc_task_setup {
>>  #define RPC_TASK_TIMEOUT	0x1000		/* fail with
>> ETIMEDOUT on timeout */
>>  #define RPC_TASK_NOCONNECT	0x2000		/* return
>> ENOTCONN if not connected */
>>  #define RPC_TASK_NO_RETRANS_TIMEOUT	0x4000		/*
>> wait forever for a reply */
>> +#define RPC_TASK_NO_CONG	0x8000		/* skip
>> congestion control */
>>  
>>  #define RPC_TASK_SOFTPING	(RPC_TASK_SOFT | RPC_TASK_SOFTCONN)
>>  
>> @@ -137,6 +138,7 @@ struct rpc_task_setup {
>>  #define RPC_IS_SOFT(t)		((t)->tk_flags &
>> (RPC_TASK_SOFT|RPC_TASK_TIMEOUT))
>>  #define RPC_IS_SOFTCONN(t)	((t)->tk_flags &
>> RPC_TASK_SOFTCONN)
>>  #define RPC_WAS_SENT(t)		((t)->tk_flags &
>> RPC_TASK_SENT)
>> +#define RPC_SKIP_CONG(t)	((t)->tk_flags & RPC_TASK_NO_CONG)
>>  
>>  #define RPC_TASK_RUNNING	0
>>  #define RPC_TASK_QUEUED		1
>> diff --git a/net/sunrpc/xprt.c b/net/sunrpc/xprt.c
>> index b530a28..a477ee6 100644
>> --- a/net/sunrpc/xprt.c
>> +++ b/net/sunrpc/xprt.c
>> @@ -392,6 +392,10 @@ static inline void xprt_release_write(struct
>> rpc_xprt *xprt, struct rpc_task *ta
>>  {
>>  	struct rpc_rqst *req = task->tk_rqstp;
>>  
>> +	if (RPC_SKIP_CONG(task)) {
>> +		req->rq_cong = 0;
>> +		return 1;
>> +	}
> 
> Why not just have the RDMA layer call xprt_reserve_xprt() (and
> xprt_release_xprt()) if this flag is set? It seems to me that you will
> need some kind of extra congestion control in the RDMA layer anyway
> since you only have one reserved credit for these privileged tasks (or
> did I miss where that is being gated?).

Thanks for the review.

See RPCRDMA_IA_RSVD_CREDIT in 11/12. It's a hack I'm not
terribly happy with.

So, I think you are suggesting replacing xprtrdma's
->reserve_xprt with something like:

int xprt_rdma_reserve_xprt(xprt, task)
{
      if (RPC_SKIP_CONG(task))
           return xprt_reserve_xprt(xprt, task);
      return xprt_reserve_xprt_cong(xprt, task);
}

and likewise for ->release_xprt ?

What I'd really like to do is have the RPC layer
prevent more than one RPC at a time from using the
extra credit, and somehow ensure that those RPCs
are going to be short-lived (SOFT | SOFTCONN,
maybe).


>>  	if (req->rq_cong)
>>  		return 1;
>>  	dprintk("RPC: %5u xprt_cwnd_limited cong = %lu cwnd =
>> %lu\n",
>> diff --git a/net/sunrpc/xprtrdma/transport.c
>> b/net/sunrpc/xprtrdma/transport.c
>> index 3a5a805..073fecd 100644
>> --- a/net/sunrpc/xprtrdma/transport.c
>> +++ b/net/sunrpc/xprtrdma/transport.c
>> @@ -546,7 +546,8 @@ static void rpcrdma_keepalive_release(void
>> *calldata)
>>  
>>  	data = xprt_get(xprt);
>>  	null_task = rpc_call_null_helper(task->tk_client, xprt,
>> NULL,
>> -					 RPC_TASK_SOFTPING |
>> RPC_TASK_ASYNC,
>> +					 RPC_TASK_SOFTPING |
>> RPC_TASK_ASYNC |
>> +					 RPC_TASK_NO_CONG,
>>  					 &rpcrdma_keepalive_call_ops
>> , data);
>>  	if (!IS_ERR(null_task))
>>  		rpc_put_task(null_task);
>> diff --git a/net/sunrpc/xprtrdma/verbs.c
>> b/net/sunrpc/xprtrdma/verbs.c
>> index 81cd31a..d9b5fa7 100644
>> --- a/net/sunrpc/xprtrdma/verbs.c
>> +++ b/net/sunrpc/xprtrdma/verbs.c
>> @@ -136,19 +136,20 @@
>>  static void
>>  rpcrdma_update_granted_credits(struct rpcrdma_rep *rep)
>>  {
>> -	struct rpcrdma_msg *rmsgp = rdmab_to_msg(rep->rr_rdmabuf);
>>  	struct rpcrdma_buffer *buffer = &rep->rr_rxprt->rx_buf;
>> +	__be32 *p = rep->rr_rdmabuf->rg_base;
>>  	u32 credits;
>>  
>>  	if (rep->rr_len < RPCRDMA_HDRLEN_ERR)
>>  		return;
>>  
>> -	credits = be32_to_cpu(rmsgp->rm_credit);
>> +	credits = be32_to_cpup(p + 2);
>> +	if (credits > buffer->rb_max_requests)
>> +		credits = buffer->rb_max_requests;
>> +	/* Reserve one credit for keepalive ping */
>> +	credits--;
>>  	if (credits == 0)
>>  		credits = 1;	/* don't deadlock */
>> -	else if (credits > buffer->rb_max_requests)
>> -		credits = buffer->rb_max_requests;
>> -
>>  	atomic_set(&buffer->rb_credits, credits);
>>  }
>>  
>> @@ -915,6 +916,8 @@ struct rpcrdma_rep *
>>  	struct rpcrdma_buffer *buf = &r_xprt->rx_buf;
>>  	int i, rc;
>>  
>> +	if (r_xprt->rx_data.max_requests < 2)
>> +		return -EINVAL;
>>  	buf->rb_max_requests = r_xprt->rx_data.max_requests;
>>  	buf->rb_bc_srv_max_requests = 0;
>>  	atomic_set(&buf->rb_credits, 1);
>> 
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-nfs"
>> in
>> the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>> 
> -- 
> 
> 
> 	
> 	
> 
> 
> Trond Myklebust
> Principal System Architect
> 4300 El Camino Real | Suite 100
> Los Altos, CA  94022
> W: 650-422-3800
> C: 801-921-4583 
> www.primarydata.com

--
Chuck Lever



--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 21+ messages in thread

[parent not found: <9D6B8B44-9C23-427C-9E06-7C92302EB04D-QHcLZuEGTsvQT0dZR+AlfA@public.gmane.org>]

* Re: [PATCH v3 12/12] sunrpc: Allow keepalive ping on a credit-full transport
       [not found]             ` <9D6B8B44-9C23-427C-9E06-7C92302EB04D-QHcLZuEGTsvQT0dZR+AlfA@public.gmane.org>
@ 2017-02-09  0:48               ` Trond Myklebust
       [not found]                 ` <1486601331.11028.5.camel-7I+n7zu2hftEKMMhf/gKZA@public.gmane.org>
  0 siblings, 1 reply; 21+ messages in thread
From: Trond Myklebust @ 2017-02-09  0:48 UTC (permalink / raw)
  To: chuck.lever-QHcLZuEGTsvQT0dZR+AlfA@public.gmane.org
  Cc: anna.schumaker-HgOvQuBEEgTQT0dZR+AlfA@public.gmane.org,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
	linux-nfs-u79uwXL29TY76Z2rM5mHXA@public.gmane.org

On Wed, 2017-02-08 at 19:19 -0500, Chuck Lever wrote:
> > On Feb 8, 2017, at 7:05 PM, Trond Myklebust <trondmy@primarydata.co
> > m> wrote:
> > 
> > On Wed, 2017-02-08 at 17:01 -0500, Chuck Lever wrote:
> > > Allow RPC-over-RDMA to send NULL pings even when the transport
> > > has
> > > hit its credit limit. One RPC-over-RDMA credit is reserved for
> > > operations like keepalive.
> > > 
> > > For transports that convey NFSv4, it seems like lease renewal
> > > would
> > > also be a candidate for using a priority transport slot. I'd like
> > > to
> > > see a mechanism better than RPCRDMA_PRIORITY that can ensure only
> > > one priority operation is in use at a time.
> > > 
> > > Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
> > > ---
> > >  include/linux/sunrpc/sched.h    |    2 ++
> > >  net/sunrpc/xprt.c               |    4 ++++
> > >  net/sunrpc/xprtrdma/transport.c |    3 ++-
> > >  net/sunrpc/xprtrdma/verbs.c     |   13 ++++++++-----
> > >  4 files changed, 16 insertions(+), 6 deletions(-)
> > > 
> > > diff --git a/include/linux/sunrpc/sched.h
> > > b/include/linux/sunrpc/sched.h
> > > index 13822e6..fcea158 100644
> > > --- a/include/linux/sunrpc/sched.h
> > > +++ b/include/linux/sunrpc/sched.h
> > > @@ -127,6 +127,7 @@ struct rpc_task_setup {
> > >  #define RPC_TASK_TIMEOUT	0x1000		/* fail
> > > with
> > > ETIMEDOUT on timeout */
> > >  #define RPC_TASK_NOCONNECT	0x2000		/*
> > > return
> > > ENOTCONN if not connected */
> > >  #define RPC_TASK_NO_RETRANS_TIMEOUT	0x4000		
> > > /*
> > > wait forever for a reply */
> > > +#define RPC_TASK_NO_CONG	0x8000		/* skip
> > > congestion control */
> > >  
> > >  #define RPC_TASK_SOFTPING	(RPC_TASK_SOFT |
> > > RPC_TASK_SOFTCONN)
> > >  
> > > @@ -137,6 +138,7 @@ struct rpc_task_setup {
> > >  #define RPC_IS_SOFT(t)		((t)->tk_flags &
> > > (RPC_TASK_SOFT|RPC_TASK_TIMEOUT))
> > >  #define RPC_IS_SOFTCONN(t)	((t)->tk_flags &
> > > RPC_TASK_SOFTCONN)
> > >  #define RPC_WAS_SENT(t)		((t)->tk_flags &
> > > RPC_TASK_SENT)
> > > +#define RPC_SKIP_CONG(t)	((t)->tk_flags &
> > > RPC_TASK_NO_CONG)
> > >  
> > >  #define RPC_TASK_RUNNING	0
> > >  #define RPC_TASK_QUEUED		1
> > > diff --git a/net/sunrpc/xprt.c b/net/sunrpc/xprt.c
> > > index b530a28..a477ee6 100644
> > > --- a/net/sunrpc/xprt.c
> > > +++ b/net/sunrpc/xprt.c
> > > @@ -392,6 +392,10 @@ static inline void xprt_release_write(struct
> > > rpc_xprt *xprt, struct rpc_task *ta
> > >  {
> > >  	struct rpc_rqst *req = task->tk_rqstp;
> > >  
> > > +	if (RPC_SKIP_CONG(task)) {
> > > +		req->rq_cong = 0;
> > > +		return 1;
> > > +	}
> > 
> > Why not just have the RDMA layer call xprt_reserve_xprt() (and
> > xprt_release_xprt()) if this flag is set? It seems to me that you
> > will
> > need some kind of extra congestion control in the RDMA layer anyway
> > since you only have one reserved credit for these privileged tasks
> > (or
> > did I miss where that is being gated?).
> 
> Thanks for the review.
> 
> See RPCRDMA_IA_RSVD_CREDIT in 11/12. It's a hack I'm not
> terribly happy with.
> 
> So, I think you are suggesting replacing xprtrdma's
> ->reserve_xprt with something like:
> 
> int xprt_rdma_reserve_xprt(xprt, task)
> {
>       if (RPC_SKIP_CONG(task))
>            return xprt_reserve_xprt(xprt, task);
>       return xprt_reserve_xprt_cong(xprt, task);
> }
> 
> and likewise for ->release_xprt ?

Right.

> What I'd really like to do is have the RPC layer
> prevent more than one RPC at a time from using the
> extra credit, and somehow ensure that those RPCs
> are going to be short-lived (SOFT | SOFTCONN,
> maybe).

Credits are a transport layer thing, though. There is no equivalent in
the non-RDMA world. TCP and UDP should normally both be fine with
transmitting an extra RPC call.

Even timeouts are a transport layer issue; see the patches I put out
this morning in order to reduce the TCP connection timeouts and put
them more in line with the lease period. Something like that makes no
sense in the UDP world (no connections), or even in AF_LOCAL (no
routing), which is why I added the set_connection_timeout() callback.

> 
> > >  	if (req->rq_cong)
> > >  		return 1;
> > >  	dprintk("RPC: %5u xprt_cwnd_limited cong = %lu cwnd =
> > > %lu\n",
> > > diff --git a/net/sunrpc/xprtrdma/transport.c
> > > b/net/sunrpc/xprtrdma/transport.c
> > > index 3a5a805..073fecd 100644
> > > --- a/net/sunrpc/xprtrdma/transport.c
> > > +++ b/net/sunrpc/xprtrdma/transport.c
> > > @@ -546,7 +546,8 @@ static void rpcrdma_keepalive_release(void
> > > *calldata)
> > >  
> > >  	data = xprt_get(xprt);
> > >  	null_task = rpc_call_null_helper(task->tk_client, xprt,
> > > NULL,
> > > -					 RPC_TASK_SOFTPING |
> > > RPC_TASK_ASYNC,
> > > +					 RPC_TASK_SOFTPING |
> > > RPC_TASK_ASYNC |
> > > +					 RPC_TASK_NO_CONG,
> > >  					 &rpcrdma_keepalive_call
> > > _ops
> > > , data);
> > >  	if (!IS_ERR(null_task))
> > >  		rpc_put_task(null_task);
> > > diff --git a/net/sunrpc/xprtrdma/verbs.c
> > > b/net/sunrpc/xprtrdma/verbs.c
> > > index 81cd31a..d9b5fa7 100644
> > > --- a/net/sunrpc/xprtrdma/verbs.c
> > > +++ b/net/sunrpc/xprtrdma/verbs.c
> > > @@ -136,19 +136,20 @@
> > >  static void
> > >  rpcrdma_update_granted_credits(struct rpcrdma_rep *rep)
> > >  {
> > > -	struct rpcrdma_msg *rmsgp = rdmab_to_msg(rep-
> > > >rr_rdmabuf);
> > >  	struct rpcrdma_buffer *buffer = &rep->rr_rxprt->rx_buf;
> > > +	__be32 *p = rep->rr_rdmabuf->rg_base;
> > >  	u32 credits;
> > >  
> > >  	if (rep->rr_len < RPCRDMA_HDRLEN_ERR)
> > >  		return;
> > >  
> > > -	credits = be32_to_cpu(rmsgp->rm_credit);
> > > +	credits = be32_to_cpup(p + 2);
> > > +	if (credits > buffer->rb_max_requests)
> > > +		credits = buffer->rb_max_requests;
> > > +	/* Reserve one credit for keepalive ping */
> > > +	credits--;
> > >  	if (credits == 0)
> > >  		credits = 1;	/* don't deadlock */
> > > -	else if (credits > buffer->rb_max_requests)
> > > -		credits = buffer->rb_max_requests;
> > > -
> > >  	atomic_set(&buffer->rb_credits, credits);
> > >  }
> > >  
> > > @@ -915,6 +916,8 @@ struct rpcrdma_rep *
> > >  	struct rpcrdma_buffer *buf = &r_xprt->rx_buf;
> > >  	int i, rc;
> > >  
> > > +	if (r_xprt->rx_data.max_requests < 2)
> > > +		return -EINVAL;
> > >  	buf->rb_max_requests = r_xprt->rx_data.max_requests;
> > >  	buf->rb_bc_srv_max_requests = 0;
> > >  	atomic_set(&buf->rb_credits, 1);
> > > 
> > > --
> > > To unsubscribe from this list: send the line "unsubscribe linux-
> > > nfs"
> > > in
> > > the body of a message to majordomo@vger.kernel.org
> > > More majordomo info at  http://vger.kernel.org/majordomo-info.htm
> > > l
> > > 
> > 
> > -- 
> > 
> > 
> > 	
> > 	
> > 
> > 
> > Trond Myklebust
> > Principal System Architect
> > 4300 El Camino Real | Suite 100
> > Los Altos, CA  94022
> > W: 650-422-3800
> > C: 801-921-4583 
> > www.primarydata.com
> 
> --
> Chuck Lever
> 
> 
> 
-- 



	
	


Trond Myklebust
Principal System Architect
4300 El Camino Real | Suite 100
Los Altos, CA  94022
W: 650-422-3800
C: 801-921-4583 
www.primarydata.com





^ permalink raw reply	[flat|nested] 21+ messages in thread

[parent not found: <1486601331.11028.5.camel-7I+n7zu2hftEKMMhf/gKZA@public.gmane.org>]

* Re: [PATCH v3 12/12] sunrpc: Allow keepalive ping on a credit-full transport
       [not found]                 ` <1486601331.11028.5.camel-7I+n7zu2hftEKMMhf/gKZA@public.gmane.org>
@ 2017-02-09 15:37                   ` Chuck Lever
       [not found]                     ` <2AFD96A3-8D49-4E2E-B1F1-9F5C46D0C9C8-QHcLZuEGTsvQT0dZR+AlfA@public.gmane.org>
  0 siblings, 1 reply; 21+ messages in thread
From: Chuck Lever @ 2017-02-09 15:37 UTC (permalink / raw)
  To: Trond Myklebust
  Cc: Anna Schumaker,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
	Linux NFS Mailing List


> On Feb 8, 2017, at 7:48 PM, Trond Myklebust <trondmy-7I+n7zu2hftEKMMhf/gKZA@public.gmane.org> wrote:
> 
> On Wed, 2017-02-08 at 19:19 -0500, Chuck Lever wrote:
>>> On Feb 8, 2017, at 7:05 PM, Trond Myklebust <trondmy-7I+n7zu2hfthDmVcRT8U2w@public.gmane.org
>>> m> wrote:
>>> 
>>> On Wed, 2017-02-08 at 17:01 -0500, Chuck Lever wrote:
>>>> Allow RPC-over-RDMA to send NULL pings even when the transport
>>>> has
>>>> hit its credit limit. One RPC-over-RDMA credit is reserved for
>>>> operations like keepalive.
>>>> 
>>>> For transports that convey NFSv4, it seems like lease renewal
>>>> would
>>>> also be a candidate for using a priority transport slot. I'd like
>>>> to
>>>> see a mechanism better than RPCRDMA_PRIORITY that can ensure only
>>>> one priority operation is in use at a time.
>>>> 
>>>> Signed-off-by: Chuck Lever <chuck.lever-QHcLZuEGTsvQT0dZR+AlfA@public.gmane.org>
>>>> ---
>>>>  include/linux/sunrpc/sched.h    |    2 ++
>>>>  net/sunrpc/xprt.c               |    4 ++++
>>>>  net/sunrpc/xprtrdma/transport.c |    3 ++-
>>>>  net/sunrpc/xprtrdma/verbs.c     |   13 ++++++++-----
>>>>  4 files changed, 16 insertions(+), 6 deletions(-)
>>>> 
>>>> diff --git a/include/linux/sunrpc/sched.h
>>>> b/include/linux/sunrpc/sched.h
>>>> index 13822e6..fcea158 100644
>>>> --- a/include/linux/sunrpc/sched.h
>>>> +++ b/include/linux/sunrpc/sched.h
>>>> @@ -127,6 +127,7 @@ struct rpc_task_setup {
>>>>  #define RPC_TASK_TIMEOUT	0x1000		/* fail
>>>> with
>>>> ETIMEDOUT on timeout */
>>>>  #define RPC_TASK_NOCONNECT	0x2000		/*
>>>> return
>>>> ENOTCONN if not connected */
>>>>  #define RPC_TASK_NO_RETRANS_TIMEOUT	0x4000		
>>>> /*
>>>> wait forever for a reply */
>>>> +#define RPC_TASK_NO_CONG	0x8000		/* skip
>>>> congestion control */
>>>>  
>>>>  #define RPC_TASK_SOFTPING	(RPC_TASK_SOFT |
>>>> RPC_TASK_SOFTCONN)
>>>>  
>>>> @@ -137,6 +138,7 @@ struct rpc_task_setup {
>>>>  #define RPC_IS_SOFT(t)		((t)->tk_flags &
>>>> (RPC_TASK_SOFT|RPC_TASK_TIMEOUT))
>>>>  #define RPC_IS_SOFTCONN(t)	((t)->tk_flags &
>>>> RPC_TASK_SOFTCONN)
>>>>  #define RPC_WAS_SENT(t)		((t)->tk_flags &
>>>> RPC_TASK_SENT)
>>>> +#define RPC_SKIP_CONG(t)	((t)->tk_flags &
>>>> RPC_TASK_NO_CONG)
>>>>  
>>>>  #define RPC_TASK_RUNNING	0
>>>>  #define RPC_TASK_QUEUED		1
>>>> diff --git a/net/sunrpc/xprt.c b/net/sunrpc/xprt.c
>>>> index b530a28..a477ee6 100644
>>>> --- a/net/sunrpc/xprt.c
>>>> +++ b/net/sunrpc/xprt.c
>>>> @@ -392,6 +392,10 @@ static inline void xprt_release_write(struct
>>>> rpc_xprt *xprt, struct rpc_task *ta
>>>>  {
>>>>  	struct rpc_rqst *req = task->tk_rqstp;
>>>>  
>>>> +	if (RPC_SKIP_CONG(task)) {
>>>> +		req->rq_cong = 0;
>>>> +		return 1;
>>>> +	}
>>> 
>>> Why not just have the RDMA layer call xprt_reserve_xprt() (and
>>> xprt_release_xprt()) if this flag is set? It seems to me that you
>>> will
>>> need some kind of extra congestion control in the RDMA layer anyway
>>> since you only have one reserved credit for these privileged tasks
>>> (or
>>> did I miss where that is being gated?).
>> 
>> Thanks for the review.
>> 
>> See RPCRDMA_IA_RSVD_CREDIT in 11/12. It's a hack I'm not
>> terribly happy with.
>> 
>> So, I think you are suggesting replacing xprtrdma's
>> ->reserve_xprt with something like:
>> 
>> int xprt_rdma_reserve_xprt(xprt, task)
>> {
>>       if (RPC_SKIP_CONG(task))
>>            return xprt_reserve_xprt(xprt, task);
>>       return xprt_reserve_xprt_cong(xprt, task);
>> }
>> 
>> and likewise for ->release_xprt ?
> 
> Right.
> 
>> What I'd really like to do is have the RPC layer
>> prevent more than one RPC at a time from using the
>> extra credit, and somehow ensure that those RPCs
>> are going to be short-lived (SOFT | SOFTCONN,
>> maybe).
> 
> Credits are a transport layer thing, though. There is no equivalent in
> the non-RDMA world. TCP and UDP should normally both be fine with
> transmitting an extra RPC call.

xprtrdma maps credits to the xprt->cwnd, which UDP also uses.
Agree though, there probably isn't a need for temporarily
superceding the UDP connection window.


> Even timeouts are a transport layer issue; see the patches I put out
> this morning in order to reduce the TCP connection timeouts and put
> them more in line with the lease period. Something like that makes no
> sense in the UDP world (no connections), or even in AF_LOCAL (no
> routing), which is why I added the set_connection_timeout() callback.

I browsed those a couple times, wondering if connection-oriented
RPC-over-RDMA also needs a set_connection_timeout method. Still
studying.


>>>>  	if (req->rq_cong)
>>>>  		return 1;
>>>>  	dprintk("RPC: %5u xprt_cwnd_limited cong = %lu cwnd =
>>>> %lu\n",
>>>> diff --git a/net/sunrpc/xprtrdma/transport.c
>>>> b/net/sunrpc/xprtrdma/transport.c
>>>> index 3a5a805..073fecd 100644
>>>> --- a/net/sunrpc/xprtrdma/transport.c
>>>> +++ b/net/sunrpc/xprtrdma/transport.c
>>>> @@ -546,7 +546,8 @@ static void rpcrdma_keepalive_release(void
>>>> *calldata)
>>>>  
>>>>  	data = xprt_get(xprt);
>>>>  	null_task = rpc_call_null_helper(task->tk_client, xprt,
>>>> NULL,
>>>> -					 RPC_TASK_SOFTPING |
>>>> RPC_TASK_ASYNC,
>>>> +					 RPC_TASK_SOFTPING |
>>>> RPC_TASK_ASYNC |
>>>> +					 RPC_TASK_NO_CONG,
>>>>  					 &rpcrdma_keepalive_call
>>>> _ops
>>>> , data);
>>>>  	if (!IS_ERR(null_task))
>>>>  		rpc_put_task(null_task);
>>>> diff --git a/net/sunrpc/xprtrdma/verbs.c
>>>> b/net/sunrpc/xprtrdma/verbs.c
>>>> index 81cd31a..d9b5fa7 100644
>>>> --- a/net/sunrpc/xprtrdma/verbs.c
>>>> +++ b/net/sunrpc/xprtrdma/verbs.c
>>>> @@ -136,19 +136,20 @@
>>>>  static void
>>>>  rpcrdma_update_granted_credits(struct rpcrdma_rep *rep)
>>>>  {
>>>> -	struct rpcrdma_msg *rmsgp = rdmab_to_msg(rep-
>>>>> rr_rdmabuf);
>>>>  	struct rpcrdma_buffer *buffer = &rep->rr_rxprt->rx_buf;
>>>> +	__be32 *p = rep->rr_rdmabuf->rg_base;
>>>>  	u32 credits;
>>>>  
>>>>  	if (rep->rr_len < RPCRDMA_HDRLEN_ERR)
>>>>  		return;
>>>>  
>>>> -	credits = be32_to_cpu(rmsgp->rm_credit);
>>>> +	credits = be32_to_cpup(p + 2);
>>>> +	if (credits > buffer->rb_max_requests)
>>>> +		credits = buffer->rb_max_requests;
>>>> +	/* Reserve one credit for keepalive ping */
>>>> +	credits--;
>>>>  	if (credits == 0)
>>>>  		credits = 1;	/* don't deadlock */
>>>> -	else if (credits > buffer->rb_max_requests)
>>>> -		credits = buffer->rb_max_requests;
>>>> -
>>>>  	atomic_set(&buffer->rb_credits, credits);
>>>>  }
>>>>  
>>>> @@ -915,6 +916,8 @@ struct rpcrdma_rep *
>>>>  	struct rpcrdma_buffer *buf = &r_xprt->rx_buf;
>>>>  	int i, rc;
>>>>  
>>>> +	if (r_xprt->rx_data.max_requests < 2)
>>>> +		return -EINVAL;
>>>>  	buf->rb_max_requests = r_xprt->rx_data.max_requests;
>>>>  	buf->rb_bc_srv_max_requests = 0;
>>>>  	atomic_set(&buf->rb_credits, 1);
>>>> 
>>>> --
>>>> To unsubscribe from this list: send the line "unsubscribe linux-
>>>> nfs"
>>>> in
>>>> the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
>>>> More majordomo info at  http://vger.kernel.org/majordomo-info.htm
>>>> l
>>>> 
>>> 
>>> -- 
>>> 
>>> 
>>> 	
>>> 	
>>> 
>>> 
>>> Trond Myklebust
>>> Principal System Architect
>>> 4300 El Camino Real | Suite 100
>>> Los Altos, CA  94022
>>> W: 650-422-3800
>>> C: 801-921-4583 
>>> www.primarydata.com
>> 
>> --
>> Chuck Lever
>> 
>> 
>> 
> -- 
> 
> 
> 
> 	
> 	
> 
> 
> Trond Myklebust
> Principal System Architect
> 4300 El Camino Real | Suite 100
> Los Altos, CA  94022
> W: 650-422-3800
> C: 801-921-4583 
> www.primarydata.com
> 
> 
> 
> 
> N���r�y���b�X�ǧv�^�)޺{.n�+�{#<^_NSEDR_^#<ٚ�{ay�\x1d
ʇڙ�,j\a�f�h��z�\x1e
�w�\f
�j:+v�w�j�m�\a���zZ+��ݢj"�!

--
Chuck Lever



--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 21+ messages in thread

[parent not found: <2AFD96A3-8D49-4E2E-B1F1-9F5C46D0C9C8-QHcLZuEGTsvQT0dZR+AlfA@public.gmane.org>]

* Re: [PATCH v3 12/12] sunrpc: Allow keepalive ping on a credit-full transport
       [not found]                     ` <2AFD96A3-8D49-4E2E-B1F1-9F5C46D0C9C8-QHcLZuEGTsvQT0dZR+AlfA@public.gmane.org>
@ 2017-02-09 19:42                       ` Chuck Lever
       [not found]                         ` <4E4245D4-8F9C-4CF3-8B2D-E4528B9E791F-QHcLZuEGTsvQT0dZR+AlfA@public.gmane.org>
  0 siblings, 1 reply; 21+ messages in thread
From: Chuck Lever @ 2017-02-09 19:42 UTC (permalink / raw)
  To: Trond Myklebust
  Cc: Anna Schumaker,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
	Linux NFS Mailing List


> On Feb 9, 2017, at 10:37 AM, Chuck Lever <chuck.lever-QHcLZuEGTsvQT0dZR+AlfA@public.gmane.org> wrote:
> 
>> 
>> On Feb 8, 2017, at 7:48 PM, Trond Myklebust <trondmy-7I+n7zu2hftEKMMhf/gKZA@public.gmane.org> wrote:
>> 
>> On Wed, 2017-02-08 at 19:19 -0500, Chuck Lever wrote:
>>>> On Feb 8, 2017, at 7:05 PM, Trond Myklebust <trondmy-7I+n7zu2hfthDmVcRT8U2w@public.gmane.org
>>>> m> wrote:
>>>> 
>>>> On Wed, 2017-02-08 at 17:01 -0500, Chuck Lever wrote:
>>>>> Allow RPC-over-RDMA to send NULL pings even when the transport
>>>>> has
>>>>> hit its credit limit. One RPC-over-RDMA credit is reserved for
>>>>> operations like keepalive.
>>>>> 
>>>>> For transports that convey NFSv4, it seems like lease renewal
>>>>> would
>>>>> also be a candidate for using a priority transport slot. I'd like
>>>>> to
>>>>> see a mechanism better than RPCRDMA_PRIORITY that can ensure only
>>>>> one priority operation is in use at a time.
>>>>> 
>>>>> Signed-off-by: Chuck Lever <chuck.lever-QHcLZuEGTsvQT0dZR+AlfA@public.gmane.org>
>>>>> ---
>>>>> include/linux/sunrpc/sched.h    |    2 ++
>>>>> net/sunrpc/xprt.c               |    4 ++++
>>>>> net/sunrpc/xprtrdma/transport.c |    3 ++-
>>>>> net/sunrpc/xprtrdma/verbs.c     |   13 ++++++++-----
>>>>> 4 files changed, 16 insertions(+), 6 deletions(-)
>>>>> 
>>>>> diff --git a/include/linux/sunrpc/sched.h
>>>>> b/include/linux/sunrpc/sched.h
>>>>> index 13822e6..fcea158 100644
>>>>> --- a/include/linux/sunrpc/sched.h
>>>>> +++ b/include/linux/sunrpc/sched.h
>>>>> @@ -127,6 +127,7 @@ struct rpc_task_setup {
>>>>> #define RPC_TASK_TIMEOUT	0x1000		/* fail
>>>>> with
>>>>> ETIMEDOUT on timeout */
>>>>> #define RPC_TASK_NOCONNECT	0x2000		/*
>>>>> return
>>>>> ENOTCONN if not connected */
>>>>> #define RPC_TASK_NO_RETRANS_TIMEOUT	0x4000		
>>>>> /*
>>>>> wait forever for a reply */
>>>>> +#define RPC_TASK_NO_CONG	0x8000		/* skip
>>>>> congestion control */
>>>>> 
>>>>> #define RPC_TASK_SOFTPING	(RPC_TASK_SOFT |
>>>>> RPC_TASK_SOFTCONN)
>>>>> 
>>>>> @@ -137,6 +138,7 @@ struct rpc_task_setup {
>>>>> #define RPC_IS_SOFT(t)		((t)->tk_flags &
>>>>> (RPC_TASK_SOFT|RPC_TASK_TIMEOUT))
>>>>> #define RPC_IS_SOFTCONN(t)	((t)->tk_flags &
>>>>> RPC_TASK_SOFTCONN)
>>>>> #define RPC_WAS_SENT(t)		((t)->tk_flags &
>>>>> RPC_TASK_SENT)
>>>>> +#define RPC_SKIP_CONG(t)	((t)->tk_flags &
>>>>> RPC_TASK_NO_CONG)
>>>>> 
>>>>> #define RPC_TASK_RUNNING	0
>>>>> #define RPC_TASK_QUEUED		1
>>>>> diff --git a/net/sunrpc/xprt.c b/net/sunrpc/xprt.c
>>>>> index b530a28..a477ee6 100644
>>>>> --- a/net/sunrpc/xprt.c
>>>>> +++ b/net/sunrpc/xprt.c
>>>>> @@ -392,6 +392,10 @@ static inline void xprt_release_write(struct
>>>>> rpc_xprt *xprt, struct rpc_task *ta
>>>>> {
>>>>> 	struct rpc_rqst *req = task->tk_rqstp;
>>>>> 
>>>>> +	if (RPC_SKIP_CONG(task)) {
>>>>> +		req->rq_cong = 0;
>>>>> +		return 1;
>>>>> +	}
>>>> 
>>>> Why not just have the RDMA layer call xprt_reserve_xprt() (and
>>>> xprt_release_xprt()) if this flag is set? It seems to me that you
>>>> will
>>>> need some kind of extra congestion control in the RDMA layer anyway
>>>> since you only have one reserved credit for these privileged tasks
>>>> (or
>>>> did I miss where that is being gated?).
>>> 
>>> Thanks for the review.
>>> 
>>> See RPCRDMA_IA_RSVD_CREDIT in 11/12. It's a hack I'm not
>>> terribly happy with.
>>> 
>>> So, I think you are suggesting replacing xprtrdma's
>>> ->reserve_xprt with something like:
>>> 
>>> int xprt_rdma_reserve_xprt(xprt, task)
>>> {
>>>      if (RPC_SKIP_CONG(task))
>>>           return xprt_reserve_xprt(xprt, task);
>>>      return xprt_reserve_xprt_cong(xprt, task);
>>> }
>>> 
>>> and likewise for ->release_xprt ?
>> 
>> Right.

This seems to work fine for the normal cases.

I'm confused about how to construct xprt_rdma_release_xprt()
so it never releases a normal RPC task when a SKIP_CONG
task completes and the credit limit is still full.

If it should send a normal task using the reserved credit
and that task hangs too, we're in exactly the position
we wanted to avoid.

My original solution might have had a similar problem,
come to think of it.


>>> What I'd really like to do is have the RPC layer
>>> prevent more than one RPC at a time from using the
>>> extra credit, and somehow ensure that those RPCs
>>> are going to be short-lived (SOFT | SOFTCONN,
>>> maybe).
>> 
>> Credits are a transport layer thing, though. There is no equivalent in
>> the non-RDMA world. TCP and UDP should normally both be fine with
>> transmitting an extra RPC call.
> 
> xprtrdma maps credits to the xprt->cwnd, which UDP also uses.
> Agree though, there probably isn't a need for temporarily
> superceding the UDP connection window.
> 
> 
>> Even timeouts are a transport layer issue; see the patches I put out
>> this morning in order to reduce the TCP connection timeouts and put
>> them more in line with the lease period. Something like that makes no
>> sense in the UDP world (no connections), or even in AF_LOCAL (no
>> routing), which is why I added the set_connection_timeout() callback.
> 
> I browsed those a couple times, wondering if connection-oriented
> RPC-over-RDMA also needs a set_connection_timeout method. Still
> studying.
> 
> 
>>>>> 	if (req->rq_cong)
>>>>> 		return 1;
>>>>> 	dprintk("RPC: %5u xprt_cwnd_limited cong = %lu cwnd =
>>>>> %lu\n",
>>>>> diff --git a/net/sunrpc/xprtrdma/transport.c
>>>>> b/net/sunrpc/xprtrdma/transport.c
>>>>> index 3a5a805..073fecd 100644
>>>>> --- a/net/sunrpc/xprtrdma/transport.c
>>>>> +++ b/net/sunrpc/xprtrdma/transport.c
>>>>> @@ -546,7 +546,8 @@ static void rpcrdma_keepalive_release(void
>>>>> *calldata)
>>>>> 
>>>>> 	data = xprt_get(xprt);
>>>>> 	null_task = rpc_call_null_helper(task->tk_client, xprt,
>>>>> NULL,
>>>>> -					 RPC_TASK_SOFTPING |
>>>>> RPC_TASK_ASYNC,
>>>>> +					 RPC_TASK_SOFTPING |
>>>>> RPC_TASK_ASYNC |
>>>>> +					 RPC_TASK_NO_CONG,
>>>>> 					 &rpcrdma_keepalive_call
>>>>> _ops
>>>>> , data);
>>>>> 	if (!IS_ERR(null_task))
>>>>> 		rpc_put_task(null_task);
>>>>> diff --git a/net/sunrpc/xprtrdma/verbs.c
>>>>> b/net/sunrpc/xprtrdma/verbs.c
>>>>> index 81cd31a..d9b5fa7 100644
>>>>> --- a/net/sunrpc/xprtrdma/verbs.c
>>>>> +++ b/net/sunrpc/xprtrdma/verbs.c
>>>>> @@ -136,19 +136,20 @@
>>>>> static void
>>>>> rpcrdma_update_granted_credits(struct rpcrdma_rep *rep)
>>>>> {
>>>>> -	struct rpcrdma_msg *rmsgp = rdmab_to_msg(rep-
>>>>>> rr_rdmabuf);
>>>>> 	struct rpcrdma_buffer *buffer = &rep->rr_rxprt->rx_buf;
>>>>> +	__be32 *p = rep->rr_rdmabuf->rg_base;
>>>>> 	u32 credits;
>>>>> 
>>>>> 	if (rep->rr_len < RPCRDMA_HDRLEN_ERR)
>>>>> 		return;
>>>>> 
>>>>> -	credits = be32_to_cpu(rmsgp->rm_credit);
>>>>> +	credits = be32_to_cpup(p + 2);
>>>>> +	if (credits > buffer->rb_max_requests)
>>>>> +		credits = buffer->rb_max_requests;
>>>>> +	/* Reserve one credit for keepalive ping */
>>>>> +	credits--;
>>>>> 	if (credits == 0)
>>>>> 		credits = 1;	/* don't deadlock */
>>>>> -	else if (credits > buffer->rb_max_requests)
>>>>> -		credits = buffer->rb_max_requests;
>>>>> -
>>>>> 	atomic_set(&buffer->rb_credits, credits);
>>>>> }
>>>>> 
>>>>> @@ -915,6 +916,8 @@ struct rpcrdma_rep *
>>>>> 	struct rpcrdma_buffer *buf = &r_xprt->rx_buf;
>>>>> 	int i, rc;
>>>>> 
>>>>> +	if (r_xprt->rx_data.max_requests < 2)
>>>>> +		return -EINVAL;
>>>>> 	buf->rb_max_requests = r_xprt->rx_data.max_requests;
>>>>> 	buf->rb_bc_srv_max_requests = 0;
>>>>> 	atomic_set(&buf->rb_credits, 1);
>>>>> 
>>>>> --
>>>>> To unsubscribe from this list: send the line "unsubscribe linux-
>>>>> nfs"
>>>>> in
>>>>> the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
>>>>> More majordomo info at  http://vger.kernel.org/majordomo-info.htm
>>>>> l
>>>>> 
>>>> 
>>>> -- 
>>>> 
>>>> 
>>>> 	
>>>> 	
>>>> 
>>>> 
>>>> Trond Myklebust
>>>> Principal System Architect
>>>> 4300 El Camino Real | Suite 100
>>>> Los Altos, CA  94022
>>>> W: 650-422-3800
>>>> C: 801-921-4583 
>>>> www.primarydata.com
>>> 
>>> --
>>> Chuck Lever
>>> 
>>> 
>>> 
>> -- 
>> 
>> 
>> 
>> 	
>> 	
>> 
>> 
>> Trond Myklebust
>> Principal System Architect
>> 4300 El Camino Real | Suite 100
>> Los Altos, CA  94022
>> W: 650-422-3800
>> C: 801-921-4583 
>> www.primarydata.com
>> 
>> 
>> 
>> 
>> N���r�y���b�X�ǧv�^�)޺{.n�+�{#<^_NSEDR_^#<ٚ�{ay�\x1d
ʇڙ�,j\a�f�h��z�\x1e
�w�\f
�j:+v�w�j�m�\a���zZ+��ݢj"�!
> 
> --
> Chuck Lever
> 
> 
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
> the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

--
Chuck Lever



--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 21+ messages in thread

[parent not found: <4E4245D4-8F9C-4CF3-8B2D-E4528B9E791F-QHcLZuEGTsvQT0dZR+AlfA@public.gmane.org>]

* Re: [PATCH v3 12/12] sunrpc: Allow keepalive ping on a credit-full transport
       [not found]                         ` <4E4245D4-8F9C-4CF3-8B2D-E4528B9E791F-QHcLZuEGTsvQT0dZR+AlfA@public.gmane.org>
@ 2017-02-09 20:13                           ` Trond Myklebust
       [not found]                             ` <1486671236.5570.4.camel-7I+n7zu2hftEKMMhf/gKZA@public.gmane.org>
  0 siblings, 1 reply; 21+ messages in thread
From: Trond Myklebust @ 2017-02-09 20:13 UTC (permalink / raw)
  To: chuck.lever-QHcLZuEGTsvQT0dZR+AlfA@public.gmane.org
  Cc: anna.schumaker-HgOvQuBEEgTQT0dZR+AlfA@public.gmane.org,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
	linux-nfs-u79uwXL29TY76Z2rM5mHXA@public.gmane.org

On Thu, 2017-02-09 at 14:42 -0500, Chuck Lever wrote:
> > On Feb 9, 2017, at 10:37 AM, Chuck Lever <chuck.lever@oracle.com>
> > wrote:
> > 
> > > 
> > > On Feb 8, 2017, at 7:48 PM, Trond Myklebust <trondmy@primarydata.
> > > com> wrote:
> > > 
> > > On Wed, 2017-02-08 at 19:19 -0500, Chuck Lever wrote:
> > > > > On Feb 8, 2017, at 7:05 PM, Trond Myklebust <trondmy@primaryd
> > > > > ata.co
> > > > > m> wrote:
> > > > > 
> > > > > On Wed, 2017-02-08 at 17:01 -0500, Chuck Lever wrote:
> > > > > > Allow RPC-over-RDMA to send NULL pings even when the
> > > > > > transport
> > > > > > has
> > > > > > hit its credit limit. One RPC-over-RDMA credit is reserved
> > > > > > for
> > > > > > operations like keepalive.
> > > > > > 
> > > > > > For transports that convey NFSv4, it seems like lease
> > > > > > renewal
> > > > > > would
> > > > > > also be a candidate for using a priority transport slot.
> > > > > > I'd like
> > > > > > to
> > > > > > see a mechanism better than RPCRDMA_PRIORITY that can
> > > > > > ensure only
> > > > > > one priority operation is in use at a time.
> > > > > > 
> > > > > > Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
> > > > > > ---
> > > > > > include/linux/sunrpc/sched.h    |    2 ++
> > > > > > net/sunrpc/xprt.c               |    4 ++++
> > > > > > net/sunrpc/xprtrdma/transport.c |    3 ++-
> > > > > > net/sunrpc/xprtrdma/verbs.c     |   13 ++++++++-----
> > > > > > 4 files changed, 16 insertions(+), 6 deletions(-)
> > > > > > 
> > > > > > diff --git a/include/linux/sunrpc/sched.h
> > > > > > b/include/linux/sunrpc/sched.h
> > > > > > index 13822e6..fcea158 100644
> > > > > > --- a/include/linux/sunrpc/sched.h
> > > > > > +++ b/include/linux/sunrpc/sched.h
> > > > > > @@ -127,6 +127,7 @@ struct rpc_task_setup {
> > > > > > #define RPC_TASK_TIMEOUT	0x1000		/*
> > > > > > fail
> > > > > > with
> > > > > > ETIMEDOUT on timeout */
> > > > > > #define RPC_TASK_NOCONNECT	0x2000		/*
> > > > > > return
> > > > > > ENOTCONN if not connected */
> > > > > > #define RPC_TASK_NO_RETRANS_TIMEOUT	0x4000		
> > > > > > /*
> > > > > > wait forever for a reply */
> > > > > > +#define RPC_TASK_NO_CONG	0x8000		/*
> > > > > > skip
> > > > > > congestion control */
> > > > > > 
> > > > > > #define RPC_TASK_SOFTPING	(RPC_TASK_SOFT |
> > > > > > RPC_TASK_SOFTCONN)
> > > > > > 
> > > > > > @@ -137,6 +138,7 @@ struct rpc_task_setup {
> > > > > > #define RPC_IS_SOFT(t)		((t)->tk_flags &
> > > > > > (RPC_TASK_SOFT|RPC_TASK_TIMEOUT))
> > > > > > #define RPC_IS_SOFTCONN(t)	((t)->tk_flags &
> > > > > > RPC_TASK_SOFTCONN)
> > > > > > #define RPC_WAS_SENT(t)		((t)->tk_flags &
> > > > > > RPC_TASK_SENT)
> > > > > > +#define RPC_SKIP_CONG(t)	((t)->tk_flags &
> > > > > > RPC_TASK_NO_CONG)
> > > > > > 
> > > > > > #define RPC_TASK_RUNNING	0
> > > > > > #define RPC_TASK_QUEUED		1
> > > > > > diff --git a/net/sunrpc/xprt.c b/net/sunrpc/xprt.c
> > > > > > index b530a28..a477ee6 100644
> > > > > > --- a/net/sunrpc/xprt.c
> > > > > > +++ b/net/sunrpc/xprt.c
> > > > > > @@ -392,6 +392,10 @@ static inline void
> > > > > > xprt_release_write(struct
> > > > > > rpc_xprt *xprt, struct rpc_task *ta
> > > > > > {
> > > > > > 	struct rpc_rqst *req = task->tk_rqstp;
> > > > > > 
> > > > > > +	if (RPC_SKIP_CONG(task)) {
> > > > > > +		req->rq_cong = 0;
> > > > > > +		return 1;
> > > > > > +	}
> > > > > 
> > > > > Why not just have the RDMA layer call xprt_reserve_xprt()
> > > > > (and
> > > > > xprt_release_xprt()) if this flag is set? It seems to me that
> > > > > you
> > > > > will
> > > > > need some kind of extra congestion control in the RDMA layer
> > > > > anyway
> > > > > since you only have one reserved credit for these privileged
> > > > > tasks
> > > > > (or
> > > > > did I miss where that is being gated?).
> > > > 
> > > > Thanks for the review.
> > > > 
> > > > See RPCRDMA_IA_RSVD_CREDIT in 11/12. It's a hack I'm not
> > > > terribly happy with.
> > > > 
> > > > So, I think you are suggesting replacing xprtrdma's
> > > > ->reserve_xprt with something like:
> > > > 
> > > > int xprt_rdma_reserve_xprt(xprt, task)
> > > > {
> > > >      if (RPC_SKIP_CONG(task))
> > > >           return xprt_reserve_xprt(xprt, task);
> > > >      return xprt_reserve_xprt_cong(xprt, task);
> > > > }
> > > > 
> > > > and likewise for ->release_xprt ?
> > > 
> > > Right.
> 
> This seems to work fine for the normal cases.
> 
> I'm confused about how to construct xprt_rdma_release_xprt()
> so it never releases a normal RPC task when a SKIP_CONG
> task completes and the credit limit is still full.
> 
> If it should send a normal task using the reserved credit
> and that task hangs too, we're in exactly the position
> we wanted to avoid.
> 
> My original solution might have had a similar problem,
> come to think of it.
> 
> 

That's true... You may need to set up a separate waitqueue that is
reserved for SKIP_CONG tasks. Again, it makes sense to keep that in the
RDMA code.

-- 


	
	


Trond Myklebust
Principal System Architect
4300 El Camino Real | Suite 100
Los Altos, CA  94022
W: 650-422-3800
C: 801-921-4583 
www.primarydata.com




^ permalink raw reply	[flat|nested] 21+ messages in thread

[parent not found: <1486671236.5570.4.camel-7I+n7zu2hftEKMMhf/gKZA@public.gmane.org>]

* Re: [PATCH v3 12/12] sunrpc: Allow keepalive ping on a credit-full transport
       [not found]                             ` <1486671236.5570.4.camel-7I+n7zu2hftEKMMhf/gKZA@public.gmane.org>
@ 2017-02-09 20:39                               ` Chuck Lever
  0 siblings, 0 replies; 21+ messages in thread
From: Chuck Lever @ 2017-02-09 20:39 UTC (permalink / raw)
  To: Trond Myklebust, Anna Schumaker
  Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
	Linux NFS Mailing List


> On Feb 9, 2017, at 3:13 PM, Trond Myklebust <trondmy-7I+n7zu2hftEKMMhf/gKZA@public.gmane.org> wrote:
> 
> On Thu, 2017-02-09 at 14:42 -0500, Chuck Lever wrote:
>>> On Feb 9, 2017, at 10:37 AM, Chuck Lever <chuck.lever-QHcLZuEGTsvQT0dZR+AlfA@public.gmane.org>
>>> wrote:
>>> 
>>>> 
>>>> On Feb 8, 2017, at 7:48 PM, Trond Myklebust <trondmy@primarydata.
>>>> com> wrote:
>>>> 
>>>> On Wed, 2017-02-08 at 19:19 -0500, Chuck Lever wrote:
>>>>>> On Feb 8, 2017, at 7:05 PM, Trond Myklebust <trondmy@primaryd
>>>>>> ata.co
>>>>>> m> wrote:
>>>>>> 
>>>>>> On Wed, 2017-02-08 at 17:01 -0500, Chuck Lever wrote:
>>>>>>> Allow RPC-over-RDMA to send NULL pings even when the
>>>>>>> transport
>>>>>>> has
>>>>>>> hit its credit limit. One RPC-over-RDMA credit is reserved
>>>>>>> for
>>>>>>> operations like keepalive.
>>>>>>> 
>>>>>>> For transports that convey NFSv4, it seems like lease
>>>>>>> renewal
>>>>>>> would
>>>>>>> also be a candidate for using a priority transport slot.
>>>>>>> I'd like
>>>>>>> to
>>>>>>> see a mechanism better than RPCRDMA_PRIORITY that can
>>>>>>> ensure only
>>>>>>> one priority operation is in use at a time.
>>>>>>> 
>>>>>>> Signed-off-by: Chuck Lever <chuck.lever-QHcLZuEGTsvQT0dZR+AlfA@public.gmane.org>
>>>>>>> ---
>>>>>>> include/linux/sunrpc/sched.h聽聽聽聽|聽聽聽聽2 ++
>>>>>>> net/sunrpc/xprt.c聽聽聽聽聽聽聽聽聽聽聽聽聽聽聽|聽聽聽聽4 ++++
>>>>>>> net/sunrpc/xprtrdma/transport.c |聽聽聽聽3 ++-
>>>>>>> net/sunrpc/xprtrdma/verbs.c聽聽聽聽聽|聽聽聽13 ++++++++-----
>>>>>>> 4 files changed, 16 insertions(+), 6 deletions(-)
>>>>>>> 
>>>>>>> diff --git a/include/linux/sunrpc/sched.h
>>>>>>> b/include/linux/sunrpc/sched.h
>>>>>>> index 13822e6..fcea158 100644
>>>>>>> --- a/include/linux/sunrpc/sched.h
>>>>>>> +++ b/include/linux/sunrpc/sched.h
>>>>>>> @@ -127,6 +127,7 @@ struct rpc_task_setup {
>>>>>>> #define RPC_TASK_TIMEOUT	0x1000		/*
>>>>>>> fail
>>>>>>> with
>>>>>>> ETIMEDOUT on timeout */
>>>>>>> #define RPC_TASK_NOCONNECT	0x2000		/*
>>>>>>> return
>>>>>>> ENOTCONN if not connected */
>>>>>>> #define RPC_TASK_NO_RETRANS_TIMEOUT	0x4000		
>>>>>>> /*
>>>>>>> wait forever for a reply */
>>>>>>> +#define RPC_TASK_NO_CONG	0x8000		/*
>>>>>>> skip
>>>>>>> congestion control */
>>>>>>> 
>>>>>>> #define RPC_TASK_SOFTPING	(RPC_TASK_SOFT |
>>>>>>> RPC_TASK_SOFTCONN)
>>>>>>> 
>>>>>>> @@ -137,6 +138,7 @@ struct rpc_task_setup {
>>>>>>> #define RPC_IS_SOFT(t)		((t)->tk_flags &
>>>>>>> (RPC_TASK_SOFT|RPC_TASK_TIMEOUT))
>>>>>>> #define RPC_IS_SOFTCONN(t)	((t)->tk_flags &
>>>>>>> RPC_TASK_SOFTCONN)
>>>>>>> #define RPC_WAS_SENT(t)		((t)->tk_flags &
>>>>>>> RPC_TASK_SENT)
>>>>>>> +#define RPC_SKIP_CONG(t)	((t)->tk_flags &
>>>>>>> RPC_TASK_NO_CONG)
>>>>>>> 
>>>>>>> #define RPC_TASK_RUNNING	0
>>>>>>> #define RPC_TASK_QUEUED		1
>>>>>>> diff --git a/net/sunrpc/xprt.c b/net/sunrpc/xprt.c
>>>>>>> index b530a28..a477ee6 100644
>>>>>>> --- a/net/sunrpc/xprt.c
>>>>>>> +++ b/net/sunrpc/xprt.c
>>>>>>> @@ -392,6 +392,10 @@ static inline void
>>>>>>> xprt_release_write(struct
>>>>>>> rpc_xprt *xprt, struct rpc_task *ta
>>>>>>> {
>>>>>>> 	struct rpc_rqst *req = task->tk_rqstp;
>>>>>>> 
>>>>>>> +	if (RPC_SKIP_CONG(task)) {
>>>>>>> +		req->rq_cong = 0;
>>>>>>> +		return 1;
>>>>>>> +	}
>>>>>> 
>>>>>> Why not just have the RDMA layer call xprt_reserve_xprt()
>>>>>> (and
>>>>>> xprt_release_xprt()) if this flag is set? It seems to me that
>>>>>> you
>>>>>> will
>>>>>> need some kind of extra congestion control in the RDMA layer
>>>>>> anyway
>>>>>> since you only have one reserved credit for these privileged
>>>>>> tasks
>>>>>> (or
>>>>>> did I miss where that is being gated?).
>>>>> 
>>>>> Thanks for the review.
>>>>> 
>>>>> See RPCRDMA_IA_RSVD_CREDIT in 11/12. It's a hack I'm not
>>>>> terribly happy with.
>>>>> 
>>>>> So, I think you are suggesting replacing xprtrdma's
>>>>> ->reserve_xprt with something like:
>>>>> 
>>>>> int xprt_rdma_reserve_xprt(xprt, task)
>>>>> {
>>>>> 聽聽聽聽聽if (RPC_SKIP_CONG(task))
>>>>> 聽聽聽聽聽聽聽聽聽聽return xprt_reserve_xprt(xprt, task);
>>>>> 聽聽聽聽聽return xprt_reserve_xprt_cong(xprt, task);
>>>>> }
>>>>> 
>>>>> and likewise for ->release_xprt ?
>>>> 
>>>> Right.
>> 
>> This seems to work fine for the normal cases.
>> 
>> I'm confused about how to construct xprt_rdma_release_xprt()
>> so it never releases a normal RPC task when a SKIP_CONG
>> task completes and the credit limit is still full.
>> 
>> If it should send a normal task using the reserved credit
>> and that task hangs too, we're in exactly the position
>> we wanted to avoid.
>> 
>> My original solution might have had a similar problem,
>> come to think of it.
>> 
>> 
> 
> That's true... You may need to set up a separate waitqueue that is
> reserved for SKIP_CONG tasks. Again, it makes sense to keep that in the
> RDMA code.

Understood.

At this late date, probably the best thing to do is drop the
keepalive patches for the v4.11 merge window. That would be
10/12, 11/12, and 12/12.


--
Chuck Lever



--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 21+ messages in thread

end of thread, other threads:[~2017-02-09 20:39 UTC | newest]

Thread overview: 21+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2017-02-08 21:59 [PATCH v3 00/12] NFS/RDMA client-side patches for 4.11 Chuck Lever
     [not found] ` <20170208214854.7152.83331.stgit-FYjufvaPoItvLzlybtyyYzGyq/o6K9yX@public.gmane.org>
2017-02-08 21:59   ` [PATCH v3 01/12] xprtrdma: Fix Read chunk padding Chuck Lever
2017-02-08 21:59   ` [PATCH v3 02/12] xprtrdma: Per-connection pad optimization Chuck Lever
2017-02-08 22:00   ` [PATCH v3 03/12] xprtrdma: Disable pad optimization by default Chuck Lever
2017-02-08 22:00   ` [PATCH v3 04/12] xprtrdma: Reduce required number of send SGEs Chuck Lever
2017-02-08 22:00   ` [PATCH v3 05/12] xprtrdma: Shrink send SGEs array Chuck Lever
2017-02-08 22:00   ` [PATCH v3 06/12] xprtrdma: Properly recover FRWRs with in-flight FASTREG WRs Chuck Lever
2017-02-08 22:00   ` [PATCH v3 07/12] xprtrdma: Handle stale connection rejection Chuck Lever
2017-02-08 22:00   ` [PATCH v3 08/12] xprtrdma: Refactor management of mw_list field Chuck Lever
2017-02-08 22:00   ` [PATCH v3 09/12] sunrpc: Allow xprt->ops->timer method to sleep Chuck Lever
     [not found]     ` <20170208220051.7152.67740.stgit-FYjufvaPoItvLzlybtyyYzGyq/o6K9yX@public.gmane.org>
2017-02-08 23:48       ` Trond Myklebust
2017-02-08 22:00   ` [PATCH v3 10/12] sunrpc: Enable calls to rpc_call_null_helper() from other modules Chuck Lever
2017-02-08 22:01   ` [PATCH v3 11/12] xprtrdma: Detect unreachable NFS/RDMA servers more reliably Chuck Lever
2017-02-08 22:01   ` [PATCH v3 12/12] sunrpc: Allow keepalive ping on a credit-full transport Chuck Lever
     [not found]     ` <20170208220116.7152.87626.stgit-FYjufvaPoItvLzlybtyyYzGyq/o6K9yX@public.gmane.org>
2017-02-09  0:05       ` Trond Myklebust
     [not found]         ` <1486598713.11028.3.camel-7I+n7zu2hftEKMMhf/gKZA@public.gmane.org>
2017-02-09  0:19           ` Chuck Lever
     [not found]             ` <9D6B8B44-9C23-427C-9E06-7C92302EB04D-QHcLZuEGTsvQT0dZR+AlfA@public.gmane.org>
2017-02-09  0:48               ` Trond Myklebust
     [not found]                 ` <1486601331.11028.5.camel-7I+n7zu2hftEKMMhf/gKZA@public.gmane.org>
2017-02-09 15:37                   ` Chuck Lever
     [not found]                     ` <2AFD96A3-8D49-4E2E-B1F1-9F5C46D0C9C8-QHcLZuEGTsvQT0dZR+AlfA@public.gmane.org>
2017-02-09 19:42                       ` Chuck Lever
     [not found]                         ` <4E4245D4-8F9C-4CF3-8B2D-E4528B9E791F-QHcLZuEGTsvQT0dZR+AlfA@public.gmane.org>
2017-02-09 20:13                           ` Trond Myklebust
     [not found]                             ` <1486671236.5570.4.camel-7I+n7zu2hftEKMMhf/gKZA@public.gmane.org>
2017-02-09 20:39                               ` Chuck Lever

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox