From: dai.ngo@oracle.com
To: Chuck Lever III <chuck.lever@oracle.com>
Cc: Linux NFS Mailing List <linux-nfs@vger.kernel.org>,
Helen Chao <helen.chao@oracle.com>
Subject: Re: [PATCH] SUNRPC: remove the maximum number of retries in call_bind_status
Date: Tue, 14 Mar 2023 09:19:30 -0700 [thread overview]
Message-ID: <ed870a33-0809-3cfd-2d5a-b97409568b97@oracle.com> (raw)
In-Reply-To: <182842b2-3de4-d64b-d729-f4f6c9c576d6@oracle.com>
On 3/8/23 11:03 AM, dai.ngo@oracle.com wrote:
> On 3/8/23 10:50 AM, Chuck Lever III wrote:
>>
>>> On Mar 8, 2023, at 1:45 PM, Dai Ngo <dai.ngo@oracle.com> wrote:
>>>
>>> Currently call_bind_status places a hard limit of 3 to the number of
>>> retries on EACCES error. This limit was done to accommodate the
>>> behavior
>>> of a buggy server that keeps returning garbage when the NLM daemon is
>>> killed on the NFS server. However this change causes problem for other
>>> servers that take a little longer than 9 seconds for the port mapper to
>>> become ready when the NFS server is restarted.
>>>
>>> This patch removes this hard coded limit and let the RPC handles
>>> the retry according to whether the export is soft or hard mounted.
>>>
>>> To avoid the hang with buggy server, the client can use soft mount for
>>> the export.
>>>
>>> Fixes: 0b760113a3a1 ("NLM: Don't hang forever on NLM unlock requests")
>>> Reported-by: Helen Chao <helen.chao@oracle.com>
>>> Tested-by: Helen Chao <helen.chao@oracle.com>
>>> Signed-off-by: Dai Ngo <dai.ngo@oracle.com>
>> Helen is the royal queen of ^C ;-)
>>
>> Did you try ^C on a mount while it waits for a rebind?
>
> She uses a test script that restarts the NFS server while NLM lock test
> is running. The failure is random, sometimes it fails and sometimes it
> passes depending on when the LOCK/UNLOCK requests come in so I think
> it's hard to time it to do the ^C, but I will ask.
We did the test with ^C and here is what we found.
For synchronous RPC task the signal was delivered to the RPC task and
the task exit with -ERESTARTSYS from __rpc_execute as expected.
For asynchronous RPC task the process that invokes the RPC task to send
the request detected the signal in rpc_wait_for_completion_task and exits
with -ERESTARTSYS. However the async RPC was allowed to continue to run
to completion. So if the async RPC task was retrying an operation and
the NFS server was down, it will retry forever if this is a hard mount
or until the NFS server comes back up.
The question for the list is should we propagate the signal to the async
task via rpc_signal_task to stop its execution or just leave it alone as is.
-Dai
>
> Thanks,
> -Dai
>
>>
>>
>>> ---
>>> include/linux/sunrpc/sched.h | 3 +--
>>> net/sunrpc/clnt.c | 3 ---
>>> net/sunrpc/sched.c | 1 -
>>> 3 files changed, 1 insertion(+), 6 deletions(-)
>>>
>>> diff --git a/include/linux/sunrpc/sched.h
>>> b/include/linux/sunrpc/sched.h
>>> index b8ca3ecaf8d7..8ada7dc802d3 100644
>>> --- a/include/linux/sunrpc/sched.h
>>> +++ b/include/linux/sunrpc/sched.h
>>> @@ -90,8 +90,7 @@ struct rpc_task {
>>> #endif
>>> unsigned char tk_priority : 2,/* Task priority */
>>> tk_garb_retry : 2,
>>> - tk_cred_retry : 2,
>>> - tk_rebind_retry : 2;
>>> + tk_cred_retry : 2;
>>> };
>>>
>>> typedef void (*rpc_action)(struct rpc_task *);
>>> diff --git a/net/sunrpc/clnt.c b/net/sunrpc/clnt.c
>>> index 0b0b9f1eed46..63b438d8564b 100644
>>> --- a/net/sunrpc/clnt.c
>>> +++ b/net/sunrpc/clnt.c
>>> @@ -2050,9 +2050,6 @@ call_bind_status(struct rpc_task *task)
>>> status = -EOPNOTSUPP;
>>> break;
>>> }
>>> - if (task->tk_rebind_retry == 0)
>>> - break;
>>> - task->tk_rebind_retry--;
>>> rpc_delay(task, 3*HZ);
>>> goto retry_timeout;
>>> case -ENOBUFS:
>>> diff --git a/net/sunrpc/sched.c b/net/sunrpc/sched.c
>>> index be587a308e05..c8321de341ee 100644
>>> --- a/net/sunrpc/sched.c
>>> +++ b/net/sunrpc/sched.c
>>> @@ -817,7 +817,6 @@ rpc_init_task_statistics(struct rpc_task *task)
>>> /* Initialize retry counters */
>>> task->tk_garb_retry = 2;
>>> task->tk_cred_retry = 2;
>>> - task->tk_rebind_retry = 2;
>>>
>>> /* starting timestamp */
>>> task->tk_start = ktime_get();
>>> --
>>> 2.9.5
>>>
>> --
>> Chuck Lever
>>
>>
next prev parent reply other threads:[~2023-03-14 17:47 UTC|newest]
Thread overview: 16+ messages / expand[flat|nested] mbox.gz Atom feed top
2023-03-08 18:45 [PATCH] SUNRPC: remove the maximum number of retries in call_bind_status Dai Ngo
2023-03-08 18:50 ` Chuck Lever III
2023-03-08 19:03 ` dai.ngo
2023-03-14 16:19 ` dai.ngo [this message]
2023-03-16 13:38 ` Fwd: " Chuck Lever III
2023-03-27 16:05 ` dai.ngo
2023-04-06 17:33 ` Jeff Layton
2023-04-06 18:10 ` Jeff Layton
2023-04-06 19:36 ` dai.ngo
2023-04-06 19:43 ` Chuck Lever III
2023-04-06 19:59 ` Jeff Layton
2023-04-06 20:58 ` dai.ngo
2023-04-06 21:51 ` Jeff Layton
2023-04-06 20:58 ` dai.ngo
-- strict thread matches above, loose matches on Subject: below --
2023-04-18 20:19 Dai Ngo
2023-04-19 10:06 ` Jeff Layton
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=ed870a33-0809-3cfd-2d5a-b97409568b97@oracle.com \
--to=dai.ngo@oracle.com \
--cc=chuck.lever@oracle.com \
--cc=helen.chao@oracle.com \
--cc=linux-nfs@vger.kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox