linux-nfs.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Chuck Lever <chuck.lever@oracle.com>
To: Trond Myklebust <trondmy@primarydata.com>
Cc: Anna Schumaker <anna.schumaker@netapp.com>,
	"linux-rdma@vger.kernel.org" <linux-rdma@vger.kernel.org>,
	Linux NFS Mailing List <linux-nfs@vger.kernel.org>
Subject: Re: [PATCH v3 12/12] sunrpc: Allow keepalive ping on a credit-full transport
Date: Thu, 9 Feb 2017 14:42:42 -0500	[thread overview]
Message-ID: <4E4245D4-8F9C-4CF3-8B2D-E4528B9E791F@oracle.com> (raw)
In-Reply-To: <2AFD96A3-8D49-4E2E-B1F1-9F5C46D0C9C8@oracle.com>


> On Feb 9, 2017, at 10:37 AM, Chuck Lever <chuck.lever@oracle.com> wrote:
> 
>> 
>> On Feb 8, 2017, at 7:48 PM, Trond Myklebust <trondmy@primarydata.com> wrote:
>> 
>> On Wed, 2017-02-08 at 19:19 -0500, Chuck Lever wrote:
>>>> On Feb 8, 2017, at 7:05 PM, Trond Myklebust <trondmy@primarydata.co
>>>> m> wrote:
>>>> 
>>>> On Wed, 2017-02-08 at 17:01 -0500, Chuck Lever wrote:
>>>>> Allow RPC-over-RDMA to send NULL pings even when the transport
>>>>> has
>>>>> hit its credit limit. One RPC-over-RDMA credit is reserved for
>>>>> operations like keepalive.
>>>>> 
>>>>> For transports that convey NFSv4, it seems like lease renewal
>>>>> would
>>>>> also be a candidate for using a priority transport slot. I'd like
>>>>> to
>>>>> see a mechanism better than RPCRDMA_PRIORITY that can ensure only
>>>>> one priority operation is in use at a time.
>>>>> 
>>>>> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
>>>>> ---
>>>>> include/linux/sunrpc/sched.h    |    2 ++
>>>>> net/sunrpc/xprt.c               |    4 ++++
>>>>> net/sunrpc/xprtrdma/transport.c |    3 ++-
>>>>> net/sunrpc/xprtrdma/verbs.c     |   13 ++++++++-----
>>>>> 4 files changed, 16 insertions(+), 6 deletions(-)
>>>>> 
>>>>> diff --git a/include/linux/sunrpc/sched.h
>>>>> b/include/linux/sunrpc/sched.h
>>>>> index 13822e6..fcea158 100644
>>>>> --- a/include/linux/sunrpc/sched.h
>>>>> +++ b/include/linux/sunrpc/sched.h
>>>>> @@ -127,6 +127,7 @@ struct rpc_task_setup {
>>>>> #define RPC_TASK_TIMEOUT	0x1000		/* fail
>>>>> with
>>>>> ETIMEDOUT on timeout */
>>>>> #define RPC_TASK_NOCONNECT	0x2000		/*
>>>>> return
>>>>> ENOTCONN if not connected */
>>>>> #define RPC_TASK_NO_RETRANS_TIMEOUT	0x4000		
>>>>> /*
>>>>> wait forever for a reply */
>>>>> +#define RPC_TASK_NO_CONG	0x8000		/* skip
>>>>> congestion control */
>>>>> 
>>>>> #define RPC_TASK_SOFTPING	(RPC_TASK_SOFT |
>>>>> RPC_TASK_SOFTCONN)
>>>>> 
>>>>> @@ -137,6 +138,7 @@ struct rpc_task_setup {
>>>>> #define RPC_IS_SOFT(t)		((t)->tk_flags &
>>>>> (RPC_TASK_SOFT|RPC_TASK_TIMEOUT))
>>>>> #define RPC_IS_SOFTCONN(t)	((t)->tk_flags &
>>>>> RPC_TASK_SOFTCONN)
>>>>> #define RPC_WAS_SENT(t)		((t)->tk_flags &
>>>>> RPC_TASK_SENT)
>>>>> +#define RPC_SKIP_CONG(t)	((t)->tk_flags &
>>>>> RPC_TASK_NO_CONG)
>>>>> 
>>>>> #define RPC_TASK_RUNNING	0
>>>>> #define RPC_TASK_QUEUED		1
>>>>> diff --git a/net/sunrpc/xprt.c b/net/sunrpc/xprt.c
>>>>> index b530a28..a477ee6 100644
>>>>> --- a/net/sunrpc/xprt.c
>>>>> +++ b/net/sunrpc/xprt.c
>>>>> @@ -392,6 +392,10 @@ static inline void xprt_release_write(struct
>>>>> rpc_xprt *xprt, struct rpc_task *ta
>>>>> {
>>>>> 	struct rpc_rqst *req = task->tk_rqstp;
>>>>> 
>>>>> +	if (RPC_SKIP_CONG(task)) {
>>>>> +		req->rq_cong = 0;
>>>>> +		return 1;
>>>>> +	}
>>>> 
>>>> Why not just have the RDMA layer call xprt_reserve_xprt() (and
>>>> xprt_release_xprt()) if this flag is set? It seems to me that you
>>>> will
>>>> need some kind of extra congestion control in the RDMA layer anyway
>>>> since you only have one reserved credit for these privileged tasks
>>>> (or
>>>> did I miss where that is being gated?).
>>> 
>>> Thanks for the review.
>>> 
>>> See RPCRDMA_IA_RSVD_CREDIT in 11/12. It's a hack I'm not
>>> terribly happy with.
>>> 
>>> So, I think you are suggesting replacing xprtrdma's
>>> ->reserve_xprt with something like:
>>> 
>>> int xprt_rdma_reserve_xprt(xprt, task)
>>> {
>>>      if (RPC_SKIP_CONG(task))
>>>           return xprt_reserve_xprt(xprt, task);
>>>      return xprt_reserve_xprt_cong(xprt, task);
>>> }
>>> 
>>> and likewise for ->release_xprt ?
>> 
>> Right.

This seems to work fine for the normal cases.

I'm confused about how to construct xprt_rdma_release_xprt()
so it never releases a normal RPC task when a SKIP_CONG
task completes and the credit limit is still full.

If it should send a normal task using the reserved credit
and that task hangs too, we're in exactly the position
we wanted to avoid.

My original solution might have had a similar problem,
come to think of it.


>>> What I'd really like to do is have the RPC layer
>>> prevent more than one RPC at a time from using the
>>> extra credit, and somehow ensure that those RPCs
>>> are going to be short-lived (SOFT | SOFTCONN,
>>> maybe).
>> 
>> Credits are a transport layer thing, though. There is no equivalent in
>> the non-RDMA world. TCP and UDP should normally both be fine with
>> transmitting an extra RPC call.
> 
> xprtrdma maps credits to the xprt->cwnd, which UDP also uses.
> Agree though, there probably isn't a need for temporarily
> superceding the UDP connection window.
> 
> 
>> Even timeouts are a transport layer issue; see the patches I put out
>> this morning in order to reduce the TCP connection timeouts and put
>> them more in line with the lease period. Something like that makes no
>> sense in the UDP world (no connections), or even in AF_LOCAL (no
>> routing), which is why I added the set_connection_timeout() callback.
> 
> I browsed those a couple times, wondering if connection-oriented
> RPC-over-RDMA also needs a set_connection_timeout method. Still
> studying.
> 
> 
>>>>> 	if (req->rq_cong)
>>>>> 		return 1;
>>>>> 	dprintk("RPC: %5u xprt_cwnd_limited cong = %lu cwnd =
>>>>> %lu\n",
>>>>> diff --git a/net/sunrpc/xprtrdma/transport.c
>>>>> b/net/sunrpc/xprtrdma/transport.c
>>>>> index 3a5a805..073fecd 100644
>>>>> --- a/net/sunrpc/xprtrdma/transport.c
>>>>> +++ b/net/sunrpc/xprtrdma/transport.c
>>>>> @@ -546,7 +546,8 @@ static void rpcrdma_keepalive_release(void
>>>>> *calldata)
>>>>> 
>>>>> 	data = xprt_get(xprt);
>>>>> 	null_task = rpc_call_null_helper(task->tk_client, xprt,
>>>>> NULL,
>>>>> -					 RPC_TASK_SOFTPING |
>>>>> RPC_TASK_ASYNC,
>>>>> +					 RPC_TASK_SOFTPING |
>>>>> RPC_TASK_ASYNC |
>>>>> +					 RPC_TASK_NO_CONG,
>>>>> 					 &rpcrdma_keepalive_call
>>>>> _ops
>>>>> , data);
>>>>> 	if (!IS_ERR(null_task))
>>>>> 		rpc_put_task(null_task);
>>>>> diff --git a/net/sunrpc/xprtrdma/verbs.c
>>>>> b/net/sunrpc/xprtrdma/verbs.c
>>>>> index 81cd31a..d9b5fa7 100644
>>>>> --- a/net/sunrpc/xprtrdma/verbs.c
>>>>> +++ b/net/sunrpc/xprtrdma/verbs.c
>>>>> @@ -136,19 +136,20 @@
>>>>> static void
>>>>> rpcrdma_update_granted_credits(struct rpcrdma_rep *rep)
>>>>> {
>>>>> -	struct rpcrdma_msg *rmsgp = rdmab_to_msg(rep-
>>>>>> rr_rdmabuf);
>>>>> 	struct rpcrdma_buffer *buffer = &rep->rr_rxprt->rx_buf;
>>>>> +	__be32 *p = rep->rr_rdmabuf->rg_base;
>>>>> 	u32 credits;
>>>>> 
>>>>> 	if (rep->rr_len < RPCRDMA_HDRLEN_ERR)
>>>>> 		return;
>>>>> 
>>>>> -	credits = be32_to_cpu(rmsgp->rm_credit);
>>>>> +	credits = be32_to_cpup(p + 2);
>>>>> +	if (credits > buffer->rb_max_requests)
>>>>> +		credits = buffer->rb_max_requests;
>>>>> +	/* Reserve one credit for keepalive ping */
>>>>> +	credits--;
>>>>> 	if (credits == 0)
>>>>> 		credits = 1;	/* don't deadlock */
>>>>> -	else if (credits > buffer->rb_max_requests)
>>>>> -		credits = buffer->rb_max_requests;
>>>>> -
>>>>> 	atomic_set(&buffer->rb_credits, credits);
>>>>> }
>>>>> 
>>>>> @@ -915,6 +916,8 @@ struct rpcrdma_rep *
>>>>> 	struct rpcrdma_buffer *buf = &r_xprt->rx_buf;
>>>>> 	int i, rc;
>>>>> 
>>>>> +	if (r_xprt->rx_data.max_requests < 2)
>>>>> +		return -EINVAL;
>>>>> 	buf->rb_max_requests = r_xprt->rx_data.max_requests;
>>>>> 	buf->rb_bc_srv_max_requests = 0;
>>>>> 	atomic_set(&buf->rb_credits, 1);
>>>>> 
>>>>> --
>>>>> To unsubscribe from this list: send the line "unsubscribe linux-
>>>>> nfs"
>>>>> in
>>>>> the body of a message to majordomo@vger.kernel.org
>>>>> More majordomo info at  http://vger.kernel.org/majordomo-info.htm
>>>>> l
>>>>> 
>>>> 
>>>> -- 
>>>> 
>>>> 
>>>> 	
>>>> 	
>>>> 
>>>> 
>>>> Trond Myklebust
>>>> Principal System Architect
>>>> 4300 El Camino Real | Suite 100
>>>> Los Altos, CA  94022
>>>> W: 650-422-3800
>>>> C: 801-921-4583 
>>>> www.primarydata.com
>>> 
>>> --
>>> Chuck Lever
>>> 
>>> 
>>> 
>> -- 
>> 
>> 
>> 
>> 	
>> 	
>> 
>> 
>> Trond Myklebust
>> Principal System Architect
>> 4300 El Camino Real | Suite 100
>> Los Altos, CA  94022
>> W: 650-422-3800
>> C: 801-921-4583 
>> www.primarydata.com
>> 
>> 
>> 
>> 
>> N���r�y���b�X�ǧv�^�)޺{.n�+�{#<^_NSEDR_^#<ٚ�{ay�\x1dʇڙ�,j\a�f�h��z�\x1e�w�\f�j:+v�w�j�m�\a���zZ+��ݢj"�!
> 
> --
> Chuck Lever
> 
> 
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

--
Chuck Lever




  reply	other threads:[~2017-02-09 19:47 UTC|newest]

Thread overview: 21+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2017-02-08 21:59 [PATCH v3 00/12] NFS/RDMA client-side patches for 4.11 Chuck Lever
2017-02-08 21:59 ` [PATCH v3 01/12] xprtrdma: Fix Read chunk padding Chuck Lever
2017-02-08 21:59 ` [PATCH v3 02/12] xprtrdma: Per-connection pad optimization Chuck Lever
2017-02-08 22:00 ` [PATCH v3 03/12] xprtrdma: Disable pad optimization by default Chuck Lever
2017-02-08 22:00 ` [PATCH v3 04/12] xprtrdma: Reduce required number of send SGEs Chuck Lever
2017-02-08 22:00 ` [PATCH v3 05/12] xprtrdma: Shrink send SGEs array Chuck Lever
2017-02-08 22:00 ` [PATCH v3 06/12] xprtrdma: Properly recover FRWRs with in-flight FASTREG WRs Chuck Lever
2017-02-08 22:00 ` [PATCH v3 07/12] xprtrdma: Handle stale connection rejection Chuck Lever
2017-02-08 22:00 ` [PATCH v3 08/12] xprtrdma: Refactor management of mw_list field Chuck Lever
2017-02-08 22:00 ` [PATCH v3 09/12] sunrpc: Allow xprt->ops->timer method to sleep Chuck Lever
2017-02-08 23:48   ` Trond Myklebust
2017-02-08 22:00 ` [PATCH v3 10/12] sunrpc: Enable calls to rpc_call_null_helper() from other modules Chuck Lever
2017-02-08 22:01 ` [PATCH v3 11/12] xprtrdma: Detect unreachable NFS/RDMA servers more reliably Chuck Lever
2017-02-08 22:01 ` [PATCH v3 12/12] sunrpc: Allow keepalive ping on a credit-full transport Chuck Lever
2017-02-09  0:05   ` Trond Myklebust
2017-02-09  0:19     ` Chuck Lever
2017-02-09  0:48       ` Trond Myklebust
2017-02-09 15:37         ` Chuck Lever
2017-02-09 19:42           ` Chuck Lever [this message]
2017-02-09 20:13             ` Trond Myklebust
2017-02-09 20:39               ` Chuck Lever

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=4E4245D4-8F9C-4CF3-8B2D-E4528B9E791F@oracle.com \
    --to=chuck.lever@oracle.com \
    --cc=anna.schumaker@netapp.com \
    --cc=linux-nfs@vger.kernel.org \
    --cc=linux-rdma@vger.kernel.org \
    --cc=trondmy@primarydata.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).