From: dai.ngo@oracle.com
To: Jeff Layton <jlayton@kernel.org>, chuck.lever@oracle.com
Cc: linux-nfs@vger.kernel.org, linux-nfs@stwm.de
Subject: Re: [PATCH 3/3] NFSD: Fix server reboot hang problem when callback workqueue is stuck
Date: Fri, 15 Dec 2023 12:18:59 -0800 [thread overview]
Message-ID: <eb949f89-6133-4922-9015-0dced4ee5c06@oracle.com> (raw)
In-Reply-To: <c9e8608d0cb2562c645d1f31043fcbc90cf53d52.camel@kernel.org>
On 12/15/23 11:54 AM, Jeff Layton wrote:
> On Fri, 2023-12-15 at 11:15 -0800, Dai Ngo wrote:
>> If the callback workqueue is stuck, nfsd4_deleg_getattr_conflict will
>> also stuck waiting for the callback request to be executed. This causes
>> the client to hang waiting for the reply of the GETATTR and also causes
>> the reboot of the NFS server to hang due to the pending NFS request.
>>
>> Fix by replacing wait_on_bit with wait_on_bit_timeout with 20 seconds
>> time out.
>>
>> Reported-by: Wolfgang Walter <linux-nfs@stwm.de>
>> Fixes: 6c41d9a9bd02 ("NFSD: handle GETATTR conflict with write delegation")
>> Signed-off-by: Dai Ngo <dai.ngo@oracle.com>
>> ---
>> fs/nfsd/nfs4state.c | 6 +++++-
>> fs/nfsd/state.h | 2 ++
>> 2 files changed, 7 insertions(+), 1 deletion(-)
>>
>> diff --git a/fs/nfsd/nfs4state.c b/fs/nfsd/nfs4state.c
>> index 175f3e9f5822..0cc7d4953807 100644
>> --- a/fs/nfsd/nfs4state.c
>> +++ b/fs/nfsd/nfs4state.c
>> @@ -2948,6 +2948,9 @@ void nfs4_cb_getattr(struct nfs4_cb_fattr *ncf)
>> if (test_and_set_bit(CB_GETATTR_BUSY, &ncf->ncf_cb_flags))
>> return;
>>
>> + /* set to proper status when nfsd4_cb_getattr_done runs */
>> + ncf->ncf_cb_status = NFS4ERR_IO;
>> +
>> refcount_inc(&dp->dl_stid.sc_count);
>> if (!nfsd4_run_cb(&ncf->ncf_getattr)) {
>> refcount_dec(&dp->dl_stid.sc_count);
>> @@ -8558,7 +8561,8 @@ nfsd4_deleg_getattr_conflict(struct svc_rqst *rqstp, struct inode *inode,
>> nfs4_cb_getattr(&dp->dl_cb_fattr);
>> spin_unlock(&ctx->flc_lock);
>>
>> - wait_on_bit(&ncf->ncf_cb_flags, CB_GETATTR_BUSY, TASK_INTERRUPTIBLE);
>> + wait_on_bit_timeout(&ncf->ncf_cb_flags, CB_GETATTR_BUSY,
>> + TASK_INTERRUPTIBLE, NFSD_CB_GETATTR_TIMEOUT);
> The RPC won't necessarily have timed out at this point, and it looks
> like ncf_cb_status won't have been set to anything (and is probably
> still 0?).
The timeout was added to handle the case where the callback request
did not get queued to the workqueue; nfsd4_run_cb fails. In this case
RPC is not involved and we don't want to hang here. Note that this patch
sets ncf_cb_status to NFS4ERR_IO before calling nfsd4_run_cb so we can
detect this error condition.
>
> Don't you need to check whether the wait timed out or was successful?
ncf_cb_status is set to tk_status by nfsd4_cb_getattr_done. If the request
was successful then ncf_cb_status is 0.
> What happens now when this times out?
Then we go through the normal logic of nfsd_open_break_lease which will
also get timed out but eventually the lease, delegation state, will be
removed by __break_lease after 45 secs (lease_break_time).
-Dai
>
>
>> if (ncf->ncf_cb_status) {
>> status = nfserrno(nfsd_open_break_lease(inode, NFSD_MAY_READ));
>> if (status != nfserr_jukebox ||
>> diff --git a/fs/nfsd/state.h b/fs/nfsd/state.h
>> index f96eaa8e9413..94563a6813a6 100644
>> --- a/fs/nfsd/state.h
>> +++ b/fs/nfsd/state.h
>> @@ -135,6 +135,8 @@ struct nfs4_cb_fattr {
>> /* bits for ncf_cb_flags */
>> #define CB_GETATTR_BUSY 0
>>
>> +#define NFSD_CB_GETATTR_TIMEOUT msecs_to_jiffies(20000) /* 20 secs */
>> +
> Why 20s?
RPC will time out after 9 secs if it does not receive a callback reply.
This time out value needs to be greater than 9 secs. I just be generous
here, we can reduce it to any value > 9 secs.
-Dai
>
>> /*
>> * Represents a delegation stateid. The nfs4_client holds references to these
>> * and they are put when it is being destroyed or when the delegation is
>
next prev parent reply other threads:[~2023-12-15 20:21 UTC|newest]
Thread overview: 24+ messages / expand[flat|nested] mbox.gz Atom feed top
2023-12-15 19:15 [PATCH 0/3] Bug fixes for NFSD callback Dai Ngo
2023-12-15 19:15 ` [PATCH 1/3] SUNRPC: remove printk when back channel request not found Dai Ngo
2023-12-15 19:37 ` Jeff Layton
2023-12-15 19:15 ` [PATCH 2/3] NFSD: restore delegation's sc_count if nfsd4_run_cb fails Dai Ngo
2023-12-15 19:42 ` Jeff Layton
2023-12-15 20:00 ` dai.ngo
2023-12-15 20:15 ` Jeff Layton
2023-12-15 20:22 ` dai.ngo
2023-12-15 19:15 ` [PATCH 3/3] NFSD: Fix server reboot hang problem when callback workqueue is stuck Dai Ngo
2023-12-15 19:54 ` Chuck Lever
2023-12-15 20:40 ` dai.ngo
2023-12-15 21:41 ` Chuck Lever
2023-12-15 21:55 ` dai.ngo
2023-12-16 1:21 ` Chuck Lever
2023-12-16 3:18 ` dai.ngo
2023-12-16 3:57 ` Chuck Lever
2023-12-16 22:44 ` dai.ngo
2023-12-18 16:02 ` Chuck Lever
2023-12-18 18:17 ` dai.ngo
2023-12-18 19:10 ` Chuck Lever
2023-12-18 20:27 ` dai.ngo
2023-12-15 19:54 ` Jeff Layton
2023-12-15 20:18 ` dai.ngo [this message]
2023-12-15 20:25 ` Jeff Layton
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=eb949f89-6133-4922-9015-0dced4ee5c06@oracle.com \
--to=dai.ngo@oracle.com \
--cc=chuck.lever@oracle.com \
--cc=jlayton@kernel.org \
--cc=linux-nfs@stwm.de \
--cc=linux-nfs@vger.kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox