* [PATCH 0/3] Bug fixes for NFSD callback
@ 2023-12-15 19:15 Dai Ngo
2023-12-15 19:15 ` [PATCH 1/3] SUNRPC: remove printk when back channel request not found Dai Ngo
` (2 more replies)
0 siblings, 3 replies; 24+ messages in thread
From: Dai Ngo @ 2023-12-15 19:15 UTC (permalink / raw)
To: chuck.lever, jlayton; +Cc: linux-nfs, linux-nfs
Here are some fixes for NFSD callback including one that addresses
the problem of server hang on reboot with 6.6.3 reported recently
by Wolfgang Walter <linux-nfs@stwm.de>:
0001: SUNRPC: remove printk when back channel request not found
0002: NFSD: restore delegation's sc_count if nfsd4_run_cb fails
0003: NFSD: Fix server reboot hang problem when callback workqueue
---
fs/nfsd/nfs4state.c | 23 ++++++++++++++++++-----
fs/nfsd/state.h | 2 ++
net/sunrpc/svcsock.c | 8 +-------
3 files changed, 21 insertions(+), 12 deletions(-)
^ permalink raw reply [flat|nested] 24+ messages in thread* [PATCH 1/3] SUNRPC: remove printk when back channel request not found 2023-12-15 19:15 [PATCH 0/3] Bug fixes for NFSD callback Dai Ngo @ 2023-12-15 19:15 ` Dai Ngo 2023-12-15 19:37 ` Jeff Layton 2023-12-15 19:15 ` [PATCH 2/3] NFSD: restore delegation's sc_count if nfsd4_run_cb fails Dai Ngo 2023-12-15 19:15 ` [PATCH 3/3] NFSD: Fix server reboot hang problem when callback workqueue is stuck Dai Ngo 2 siblings, 1 reply; 24+ messages in thread From: Dai Ngo @ 2023-12-15 19:15 UTC (permalink / raw) To: chuck.lever, jlayton; +Cc: linux-nfs, linux-nfs If the client interface is down, or there is a network partition between the client and server, that prevents the callback request to reach the client TCP on the server will keep re-transmitting the callback for about ~9 minutes before giving up and closes the connection. If the connection between the client and the server is re-established before the connection is closed and after the callback timed out (9 secs) then the re-transmitted callback request will arrive at the client. When the server receives the reply of the callback, receive_cb_reply prints the "Got unrecognized reply..." message in the system log since the callback request was already removed from the server xprt's recv_queue. Even though this scenario has no effect on the server operation, a malicious client can take advantage of this behavior and send thousand of callback replies with random XIDs to fill up the server's system log. Signed-off-by: Dai Ngo <dai.ngo@oracle.com> --- net/sunrpc/svcsock.c | 8 +------- 1 file changed, 1 insertion(+), 7 deletions(-) diff --git a/net/sunrpc/svcsock.c b/net/sunrpc/svcsock.c index 998687421fa6..3e89dc0afbef 100644 --- a/net/sunrpc/svcsock.c +++ b/net/sunrpc/svcsock.c @@ -1060,7 +1060,7 @@ static int receive_cb_reply(struct svc_sock *svsk, struct svc_rqst *rqstp) spin_lock(&bc_xprt->queue_lock); req = xprt_lookup_rqst(bc_xprt, xid); if (!req) - goto unlock_notfound; + goto unlock_eagain; memcpy(&req->rq_private_buf, &req->rq_rcv_buf, sizeof(struct xdr_buf)); /* @@ -1077,12 +1077,6 @@ static int receive_cb_reply(struct svc_sock *svsk, struct svc_rqst *rqstp) rqstp->rq_arg.len = 0; spin_unlock(&bc_xprt->queue_lock); return 0; -unlock_notfound: - printk(KERN_NOTICE - "%s: Got unrecognized reply: " - "calldir 0x%x xpt_bc_xprt %p xid %08x\n", - __func__, ntohl(calldir), - bc_xprt, ntohl(xid)); unlock_eagain: spin_unlock(&bc_xprt->queue_lock); return -EAGAIN; -- 2.39.3 ^ permalink raw reply related [flat|nested] 24+ messages in thread
* Re: [PATCH 1/3] SUNRPC: remove printk when back channel request not found 2023-12-15 19:15 ` [PATCH 1/3] SUNRPC: remove printk when back channel request not found Dai Ngo @ 2023-12-15 19:37 ` Jeff Layton 0 siblings, 0 replies; 24+ messages in thread From: Jeff Layton @ 2023-12-15 19:37 UTC (permalink / raw) To: Dai Ngo, chuck.lever; +Cc: linux-nfs, linux-nfs On Fri, 2023-12-15 at 11:15 -0800, Dai Ngo wrote: > If the client interface is down, or there is a network partition between > the client and server, that prevents the callback request to reach the > client TCP on the server will keep re-transmitting the callback for about > ~9 minutes before giving up and closes the connection. > > If the connection between the client and the server is re-established > before the connection is closed and after the callback timed out (9 secs) > then the re-transmitted callback request will arrive at the client. When > the server receives the reply of the callback, receive_cb_reply prints the > "Got unrecognized reply..." message in the system log since the callback > request was already removed from the server xprt's recv_queue. > > Even though this scenario has no effect on the server operation, a > malicious client can take advantage of this behavior and send thousand > of callback replies with random XIDs to fill up the server's system log. > > Signed-off-by: Dai Ngo <dai.ngo@oracle.com> > --- > net/sunrpc/svcsock.c | 8 +------- > 1 file changed, 1 insertion(+), 7 deletions(-) > > diff --git a/net/sunrpc/svcsock.c b/net/sunrpc/svcsock.c > index 998687421fa6..3e89dc0afbef 100644 > --- a/net/sunrpc/svcsock.c > +++ b/net/sunrpc/svcsock.c > @@ -1060,7 +1060,7 @@ static int receive_cb_reply(struct svc_sock *svsk, struct svc_rqst *rqstp) > spin_lock(&bc_xprt->queue_lock); > req = xprt_lookup_rqst(bc_xprt, xid); > if (!req) > - goto unlock_notfound; > + goto unlock_eagain; > > memcpy(&req->rq_private_buf, &req->rq_rcv_buf, sizeof(struct xdr_buf)); > /* > @@ -1077,12 +1077,6 @@ static int receive_cb_reply(struct svc_sock *svsk, struct svc_rqst *rqstp) > rqstp->rq_arg.len = 0; > spin_unlock(&bc_xprt->queue_lock); > return 0; > -unlock_notfound: > - printk(KERN_NOTICE > - "%s: Got unrecognized reply: " > - "calldir 0x%x xpt_bc_xprt %p xid %08x\n", > - __func__, ntohl(calldir), > - bc_xprt, ntohl(xid)); > unlock_eagain: > spin_unlock(&bc_xprt->queue_lock); > return -EAGAIN; Makes sense. It's a cryptic error message for most admins. Reviewed-by: Jeff Layton <jlayton@kernel.org> ^ permalink raw reply [flat|nested] 24+ messages in thread
* [PATCH 2/3] NFSD: restore delegation's sc_count if nfsd4_run_cb fails 2023-12-15 19:15 [PATCH 0/3] Bug fixes for NFSD callback Dai Ngo 2023-12-15 19:15 ` [PATCH 1/3] SUNRPC: remove printk when back channel request not found Dai Ngo @ 2023-12-15 19:15 ` Dai Ngo 2023-12-15 19:42 ` Jeff Layton 2023-12-15 19:15 ` [PATCH 3/3] NFSD: Fix server reboot hang problem when callback workqueue is stuck Dai Ngo 2 siblings, 1 reply; 24+ messages in thread From: Dai Ngo @ 2023-12-15 19:15 UTC (permalink / raw) To: chuck.lever, jlayton; +Cc: linux-nfs, linux-nfs Under some load conditions the callback work request can not be queued and nfsd4_run_cb returns 0 to caller. When this happens, the sc_count of the delegation state was left with an extra reference count preventing the state to be freed later. Signed-off-by: Dai Ngo <dai.ngo@oracle.com> --- fs/nfsd/nfs4state.c | 17 +++++++++++++---- 1 file changed, 13 insertions(+), 4 deletions(-) diff --git a/fs/nfsd/nfs4state.c b/fs/nfsd/nfs4state.c index 40415929e2ae..175f3e9f5822 100644 --- a/fs/nfsd/nfs4state.c +++ b/fs/nfsd/nfs4state.c @@ -2947,8 +2947,14 @@ void nfs4_cb_getattr(struct nfs4_cb_fattr *ncf) if (test_and_set_bit(CB_GETATTR_BUSY, &ncf->ncf_cb_flags)) return; + refcount_inc(&dp->dl_stid.sc_count); - nfsd4_run_cb(&ncf->ncf_getattr); + if (!nfsd4_run_cb(&ncf->ncf_getattr)) { + refcount_dec(&dp->dl_stid.sc_count); + clear_bit(CB_GETATTR_BUSY, &ncf->ncf_cb_flags); + wake_up_bit(&ncf->ncf_cb_flags, CB_GETATTR_BUSY); + WARN_ON_ONCE(1); + } } static struct nfs4_client *create_client(struct xdr_netobj name, @@ -4967,7 +4973,10 @@ static void nfsd_break_one_deleg(struct nfs4_delegation *dp) * we know it's safe to take a reference. */ refcount_inc(&dp->dl_stid.sc_count); - WARN_ON_ONCE(!nfsd4_run_cb(&dp->dl_recall)); + if (!nfsd4_run_cb(&dp->dl_recall)) { + refcount_dec(&dp->dl_stid.sc_count); + WARN_ON_ONCE(1); + } } /* Called from break_lease() with flc_lock held. */ @@ -8543,12 +8552,12 @@ nfsd4_deleg_getattr_conflict(struct svc_rqst *rqstp, struct inode *inode, return 0; } break_lease: - spin_unlock(&ctx->flc_lock); nfsd_stats_wdeleg_getattr_inc(); - dp = fl->fl_owner; ncf = &dp->dl_cb_fattr; nfs4_cb_getattr(&dp->dl_cb_fattr); + spin_unlock(&ctx->flc_lock); + wait_on_bit(&ncf->ncf_cb_flags, CB_GETATTR_BUSY, TASK_INTERRUPTIBLE); if (ncf->ncf_cb_status) { status = nfserrno(nfsd_open_break_lease(inode, NFSD_MAY_READ)); -- 2.39.3 ^ permalink raw reply related [flat|nested] 24+ messages in thread
* Re: [PATCH 2/3] NFSD: restore delegation's sc_count if nfsd4_run_cb fails 2023-12-15 19:15 ` [PATCH 2/3] NFSD: restore delegation's sc_count if nfsd4_run_cb fails Dai Ngo @ 2023-12-15 19:42 ` Jeff Layton 2023-12-15 20:00 ` dai.ngo 0 siblings, 1 reply; 24+ messages in thread From: Jeff Layton @ 2023-12-15 19:42 UTC (permalink / raw) To: Dai Ngo, chuck.lever; +Cc: linux-nfs, linux-nfs On Fri, 2023-12-15 at 11:15 -0800, Dai Ngo wrote: > Under some load conditions the callback work request can not be queued > and nfsd4_run_cb returns 0 to caller. When this happens, the sc_count > of the delegation state was left with an extra reference count preventing > the state to be freed later. > > Signed-off-by: Dai Ngo <dai.ngo@oracle.com> > --- > fs/nfsd/nfs4state.c | 17 +++++++++++++---- > 1 file changed, 13 insertions(+), 4 deletions(-) > > diff --git a/fs/nfsd/nfs4state.c b/fs/nfsd/nfs4state.c > index 40415929e2ae..175f3e9f5822 100644 > --- a/fs/nfsd/nfs4state.c > +++ b/fs/nfsd/nfs4state.c > @@ -2947,8 +2947,14 @@ void nfs4_cb_getattr(struct nfs4_cb_fattr *ncf) > > if (test_and_set_bit(CB_GETATTR_BUSY, &ncf->ncf_cb_flags)) > return; > + > refcount_inc(&dp->dl_stid.sc_count); > - nfsd4_run_cb(&ncf->ncf_getattr); > + if (!nfsd4_run_cb(&ncf->ncf_getattr)) { > + refcount_dec(&dp->dl_stid.sc_count); > + clear_bit(CB_GETATTR_BUSY, &ncf->ncf_cb_flags); > + wake_up_bit(&ncf->ncf_cb_flags, CB_GETATTR_BUSY); > + WARN_ON_ONCE(1); > + } > } > > static struct nfs4_client *create_client(struct xdr_netobj name, > @@ -4967,7 +4973,10 @@ static void nfsd_break_one_deleg(struct nfs4_delegation *dp) > * we know it's safe to take a reference. > */ > refcount_inc(&dp->dl_stid.sc_count); > - WARN_ON_ONCE(!nfsd4_run_cb(&dp->dl_recall)); > + if (!nfsd4_run_cb(&dp->dl_recall)) { > + refcount_dec(&dp->dl_stid.sc_count); > + WARN_ON_ONCE(1); > + } > } > > /* Called from break_lease() with flc_lock held. */ > @@ -8543,12 +8552,12 @@ nfsd4_deleg_getattr_conflict(struct svc_rqst *rqstp, struct inode *inode, > return 0; > } > break_lease: > - spin_unlock(&ctx->flc_lock); > nfsd_stats_wdeleg_getattr_inc(); > - > dp = fl->fl_owner; > ncf = &dp->dl_cb_fattr; > nfs4_cb_getattr(&dp->dl_cb_fattr); > + spin_unlock(&ctx->flc_lock); > + The other hunks in this patch make sense, but what's going on here with moving the lock down? Do we really need to hold the spinlock there? If so, I would have expected to see an explanation in the changelog. > wait_on_bit(&ncf->ncf_cb_flags, CB_GETATTR_BUSY, TASK_INTERRUPTIBLE); > if (ncf->ncf_cb_status) { > status = nfserrno(nfsd_open_break_lease(inode, NFSD_MAY_READ)); -- Jeff Layton <jlayton@kernel.org> ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [PATCH 2/3] NFSD: restore delegation's sc_count if nfsd4_run_cb fails 2023-12-15 19:42 ` Jeff Layton @ 2023-12-15 20:00 ` dai.ngo 2023-12-15 20:15 ` Jeff Layton 0 siblings, 1 reply; 24+ messages in thread From: dai.ngo @ 2023-12-15 20:00 UTC (permalink / raw) To: Jeff Layton, chuck.lever; +Cc: linux-nfs, linux-nfs On 12/15/23 11:42 AM, Jeff Layton wrote: > On Fri, 2023-12-15 at 11:15 -0800, Dai Ngo wrote: >> Under some load conditions the callback work request can not be queued >> and nfsd4_run_cb returns 0 to caller. When this happens, the sc_count >> of the delegation state was left with an extra reference count preventing >> the state to be freed later. >> >> Signed-off-by: Dai Ngo <dai.ngo@oracle.com> >> --- >> fs/nfsd/nfs4state.c | 17 +++++++++++++---- >> 1 file changed, 13 insertions(+), 4 deletions(-) >> >> diff --git a/fs/nfsd/nfs4state.c b/fs/nfsd/nfs4state.c >> index 40415929e2ae..175f3e9f5822 100644 >> --- a/fs/nfsd/nfs4state.c >> +++ b/fs/nfsd/nfs4state.c >> @@ -2947,8 +2947,14 @@ void nfs4_cb_getattr(struct nfs4_cb_fattr *ncf) >> >> if (test_and_set_bit(CB_GETATTR_BUSY, &ncf->ncf_cb_flags)) >> return; >> + >> refcount_inc(&dp->dl_stid.sc_count); >> - nfsd4_run_cb(&ncf->ncf_getattr); >> + if (!nfsd4_run_cb(&ncf->ncf_getattr)) { >> + refcount_dec(&dp->dl_stid.sc_count); >> + clear_bit(CB_GETATTR_BUSY, &ncf->ncf_cb_flags); >> + wake_up_bit(&ncf->ncf_cb_flags, CB_GETATTR_BUSY); >> + WARN_ON_ONCE(1); >> + } >> } >> >> static struct nfs4_client *create_client(struct xdr_netobj name, >> @@ -4967,7 +4973,10 @@ static void nfsd_break_one_deleg(struct nfs4_delegation *dp) >> * we know it's safe to take a reference. >> */ >> refcount_inc(&dp->dl_stid.sc_count); >> - WARN_ON_ONCE(!nfsd4_run_cb(&dp->dl_recall)); >> + if (!nfsd4_run_cb(&dp->dl_recall)) { >> + refcount_dec(&dp->dl_stid.sc_count); >> + WARN_ON_ONCE(1); >> + } >> } >> >> /* Called from break_lease() with flc_lock held. */ >> @@ -8543,12 +8552,12 @@ nfsd4_deleg_getattr_conflict(struct svc_rqst *rqstp, struct inode *inode, >> return 0; >> } >> break_lease: >> - spin_unlock(&ctx->flc_lock); >> nfsd_stats_wdeleg_getattr_inc(); >> - >> dp = fl->fl_owner; >> ncf = &dp->dl_cb_fattr; >> nfs4_cb_getattr(&dp->dl_cb_fattr); >> + spin_unlock(&ctx->flc_lock); >> + > The other hunks in this patch make sense, but what's going on here with > moving the lock down? Do we really need to hold the spinlock there? If > so, I would have expected to see an explanation in the changelog. We need to hold the flc_lock to prevent the lease to be removed which allows the delegation state to be released. We need to do this since we just do the refcount_dec if nfsd4_run_cb fails, instead of doing nfs4_put_stid to free the state if this is the last refcount. This is done to match the logic in nfsd_break_deleg_cb which has an useful comment in nfsd_break_one_deleg. -Dai > >> wait_on_bit(&ncf->ncf_cb_flags, CB_GETATTR_BUSY, TASK_INTERRUPTIBLE); >> if (ncf->ncf_cb_status) { >> status = nfserrno(nfsd_open_break_lease(inode, NFSD_MAY_READ)); ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [PATCH 2/3] NFSD: restore delegation's sc_count if nfsd4_run_cb fails 2023-12-15 20:00 ` dai.ngo @ 2023-12-15 20:15 ` Jeff Layton 2023-12-15 20:22 ` dai.ngo 0 siblings, 1 reply; 24+ messages in thread From: Jeff Layton @ 2023-12-15 20:15 UTC (permalink / raw) To: dai.ngo, chuck.lever; +Cc: linux-nfs, linux-nfs On Fri, 2023-12-15 at 12:00 -0800, dai.ngo@oracle.com wrote: > On 12/15/23 11:42 AM, Jeff Layton wrote: > > On Fri, 2023-12-15 at 11:15 -0800, Dai Ngo wrote: > > > Under some load conditions the callback work request can not be queued > > > and nfsd4_run_cb returns 0 to caller. When this happens, the sc_count > > > of the delegation state was left with an extra reference count preventing > > > the state to be freed later. > > > > > > Signed-off-by: Dai Ngo <dai.ngo@oracle.com> > > > --- > > > fs/nfsd/nfs4state.c | 17 +++++++++++++---- > > > 1 file changed, 13 insertions(+), 4 deletions(-) > > > > > > diff --git a/fs/nfsd/nfs4state.c b/fs/nfsd/nfs4state.c > > > index 40415929e2ae..175f3e9f5822 100644 > > > --- a/fs/nfsd/nfs4state.c > > > +++ b/fs/nfsd/nfs4state.c > > > @@ -2947,8 +2947,14 @@ void nfs4_cb_getattr(struct nfs4_cb_fattr *ncf) > > > > > > if (test_and_set_bit(CB_GETATTR_BUSY, &ncf->ncf_cb_flags)) > > > return; > > > + > > > refcount_inc(&dp->dl_stid.sc_count); > > > - nfsd4_run_cb(&ncf->ncf_getattr); > > > + if (!nfsd4_run_cb(&ncf->ncf_getattr)) { > > > + refcount_dec(&dp->dl_stid.sc_count); > > > + clear_bit(CB_GETATTR_BUSY, &ncf->ncf_cb_flags); > > > + wake_up_bit(&ncf->ncf_cb_flags, CB_GETATTR_BUSY); > > > + WARN_ON_ONCE(1); > > > + } > > > } > > > > > > static struct nfs4_client *create_client(struct xdr_netobj name, > > > @@ -4967,7 +4973,10 @@ static void nfsd_break_one_deleg(struct nfs4_delegation *dp) > > > * we know it's safe to take a reference. > > > */ > > > refcount_inc(&dp->dl_stid.sc_count); > > > - WARN_ON_ONCE(!nfsd4_run_cb(&dp->dl_recall)); > > > + if (!nfsd4_run_cb(&dp->dl_recall)) { > > > + refcount_dec(&dp->dl_stid.sc_count); > > > + WARN_ON_ONCE(1); > > > + } > > > } > > > > > > /* Called from break_lease() with flc_lock held. */ > > > @@ -8543,12 +8552,12 @@ nfsd4_deleg_getattr_conflict(struct svc_rqst *rqstp, struct inode *inode, > > > return 0; > > > } > > > break_lease: > > > - spin_unlock(&ctx->flc_lock); > > > nfsd_stats_wdeleg_getattr_inc(); > > > - > > > dp = fl->fl_owner; > > > ncf = &dp->dl_cb_fattr; > > > nfs4_cb_getattr(&dp->dl_cb_fattr); > > > + spin_unlock(&ctx->flc_lock); > > > + > > The other hunks in this patch make sense, but what's going on here with > > moving the lock down? Do we really need to hold the spinlock there? If > > so, I would have expected to see an explanation in the changelog. > > We need to hold the flc_lock to prevent the lease to be removed which > allows the delegation state to be released. We need to do this since > we just do the refcount_dec if nfsd4_run_cb fails, instead of doing > nfs4_put_stid to free the state if this is the last refcount. > > This is done to match the logic in nfsd_break_deleg_cb which has an useful > comment in nfsd_break_one_deleg. > > -Dai > So is this a race today? I think this deserves a mention in the changelog at least, and maybe a Fixes: tag? > > > > > wait_on_bit(&ncf->ncf_cb_flags, CB_GETATTR_BUSY, TASK_INTERRUPTIBLE); > > > if (ncf->ncf_cb_status) { > > > status = nfserrno(nfsd_open_break_lease(inode, NFSD_MAY_READ)); -- Jeff Layton <jlayton@kernel.org> ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [PATCH 2/3] NFSD: restore delegation's sc_count if nfsd4_run_cb fails 2023-12-15 20:15 ` Jeff Layton @ 2023-12-15 20:22 ` dai.ngo 0 siblings, 0 replies; 24+ messages in thread From: dai.ngo @ 2023-12-15 20:22 UTC (permalink / raw) To: Jeff Layton, chuck.lever; +Cc: linux-nfs, linux-nfs On 12/15/23 12:15 PM, Jeff Layton wrote: > On Fri, 2023-12-15 at 12:00 -0800, dai.ngo@oracle.com wrote: >> On 12/15/23 11:42 AM, Jeff Layton wrote: >>> On Fri, 2023-12-15 at 11:15 -0800, Dai Ngo wrote: >>>> Under some load conditions the callback work request can not be queued >>>> and nfsd4_run_cb returns 0 to caller. When this happens, the sc_count >>>> of the delegation state was left with an extra reference count preventing >>>> the state to be freed later. >>>> >>>> Signed-off-by: Dai Ngo <dai.ngo@oracle.com> >>>> --- >>>> fs/nfsd/nfs4state.c | 17 +++++++++++++---- >>>> 1 file changed, 13 insertions(+), 4 deletions(-) >>>> >>>> diff --git a/fs/nfsd/nfs4state.c b/fs/nfsd/nfs4state.c >>>> index 40415929e2ae..175f3e9f5822 100644 >>>> --- a/fs/nfsd/nfs4state.c >>>> +++ b/fs/nfsd/nfs4state.c >>>> @@ -2947,8 +2947,14 @@ void nfs4_cb_getattr(struct nfs4_cb_fattr *ncf) >>>> >>>> if (test_and_set_bit(CB_GETATTR_BUSY, &ncf->ncf_cb_flags)) >>>> return; >>>> + >>>> refcount_inc(&dp->dl_stid.sc_count); >>>> - nfsd4_run_cb(&ncf->ncf_getattr); >>>> + if (!nfsd4_run_cb(&ncf->ncf_getattr)) { >>>> + refcount_dec(&dp->dl_stid.sc_count); >>>> + clear_bit(CB_GETATTR_BUSY, &ncf->ncf_cb_flags); >>>> + wake_up_bit(&ncf->ncf_cb_flags, CB_GETATTR_BUSY); >>>> + WARN_ON_ONCE(1); >>>> + } >>>> } >>>> >>>> static struct nfs4_client *create_client(struct xdr_netobj name, >>>> @@ -4967,7 +4973,10 @@ static void nfsd_break_one_deleg(struct nfs4_delegation *dp) >>>> * we know it's safe to take a reference. >>>> */ >>>> refcount_inc(&dp->dl_stid.sc_count); >>>> - WARN_ON_ONCE(!nfsd4_run_cb(&dp->dl_recall)); >>>> + if (!nfsd4_run_cb(&dp->dl_recall)) { >>>> + refcount_dec(&dp->dl_stid.sc_count); >>>> + WARN_ON_ONCE(1); >>>> + } >>>> } >>>> >>>> /* Called from break_lease() with flc_lock held. */ >>>> @@ -8543,12 +8552,12 @@ nfsd4_deleg_getattr_conflict(struct svc_rqst *rqstp, struct inode *inode, >>>> return 0; >>>> } >>>> break_lease: >>>> - spin_unlock(&ctx->flc_lock); >>>> nfsd_stats_wdeleg_getattr_inc(); >>>> - >>>> dp = fl->fl_owner; >>>> ncf = &dp->dl_cb_fattr; >>>> nfs4_cb_getattr(&dp->dl_cb_fattr); >>>> + spin_unlock(&ctx->flc_lock); >>>> + >>> The other hunks in this patch make sense, but what's going on here with >>> moving the lock down? Do we really need to hold the spinlock there? If >>> so, I would have expected to see an explanation in the changelog. >> We need to hold the flc_lock to prevent the lease to be removed which >> allows the delegation state to be released. We need to do this since >> we just do the refcount_dec if nfsd4_run_cb fails, instead of doing >> nfs4_put_stid to free the state if this is the last refcount. >> >> This is done to match the logic in nfsd_break_deleg_cb which has an useful >> comment in nfsd_break_one_deleg. >> >> -Dai >> > So is this a race today? I think this deserves a mention in the > changelog at least, and maybe a Fixes: tag? I will add some comments in the changelog and add a Fixes tag in v2. Thanks, -Dai > >>>> wait_on_bit(&ncf->ncf_cb_flags, CB_GETATTR_BUSY, TASK_INTERRUPTIBLE); >>>> if (ncf->ncf_cb_status) { >>>> status = nfserrno(nfsd_open_break_lease(inode, NFSD_MAY_READ)); ^ permalink raw reply [flat|nested] 24+ messages in thread
* [PATCH 3/3] NFSD: Fix server reboot hang problem when callback workqueue is stuck 2023-12-15 19:15 [PATCH 0/3] Bug fixes for NFSD callback Dai Ngo 2023-12-15 19:15 ` [PATCH 1/3] SUNRPC: remove printk when back channel request not found Dai Ngo 2023-12-15 19:15 ` [PATCH 2/3] NFSD: restore delegation's sc_count if nfsd4_run_cb fails Dai Ngo @ 2023-12-15 19:15 ` Dai Ngo 2023-12-15 19:54 ` Chuck Lever 2023-12-15 19:54 ` Jeff Layton 2 siblings, 2 replies; 24+ messages in thread From: Dai Ngo @ 2023-12-15 19:15 UTC (permalink / raw) To: chuck.lever, jlayton; +Cc: linux-nfs, linux-nfs If the callback workqueue is stuck, nfsd4_deleg_getattr_conflict will also stuck waiting for the callback request to be executed. This causes the client to hang waiting for the reply of the GETATTR and also causes the reboot of the NFS server to hang due to the pending NFS request. Fix by replacing wait_on_bit with wait_on_bit_timeout with 20 seconds time out. Reported-by: Wolfgang Walter <linux-nfs@stwm.de> Fixes: 6c41d9a9bd02 ("NFSD: handle GETATTR conflict with write delegation") Signed-off-by: Dai Ngo <dai.ngo@oracle.com> --- fs/nfsd/nfs4state.c | 6 +++++- fs/nfsd/state.h | 2 ++ 2 files changed, 7 insertions(+), 1 deletion(-) diff --git a/fs/nfsd/nfs4state.c b/fs/nfsd/nfs4state.c index 175f3e9f5822..0cc7d4953807 100644 --- a/fs/nfsd/nfs4state.c +++ b/fs/nfsd/nfs4state.c @@ -2948,6 +2948,9 @@ void nfs4_cb_getattr(struct nfs4_cb_fattr *ncf) if (test_and_set_bit(CB_GETATTR_BUSY, &ncf->ncf_cb_flags)) return; + /* set to proper status when nfsd4_cb_getattr_done runs */ + ncf->ncf_cb_status = NFS4ERR_IO; + refcount_inc(&dp->dl_stid.sc_count); if (!nfsd4_run_cb(&ncf->ncf_getattr)) { refcount_dec(&dp->dl_stid.sc_count); @@ -8558,7 +8561,8 @@ nfsd4_deleg_getattr_conflict(struct svc_rqst *rqstp, struct inode *inode, nfs4_cb_getattr(&dp->dl_cb_fattr); spin_unlock(&ctx->flc_lock); - wait_on_bit(&ncf->ncf_cb_flags, CB_GETATTR_BUSY, TASK_INTERRUPTIBLE); + wait_on_bit_timeout(&ncf->ncf_cb_flags, CB_GETATTR_BUSY, + TASK_INTERRUPTIBLE, NFSD_CB_GETATTR_TIMEOUT); if (ncf->ncf_cb_status) { status = nfserrno(nfsd_open_break_lease(inode, NFSD_MAY_READ)); if (status != nfserr_jukebox || diff --git a/fs/nfsd/state.h b/fs/nfsd/state.h index f96eaa8e9413..94563a6813a6 100644 --- a/fs/nfsd/state.h +++ b/fs/nfsd/state.h @@ -135,6 +135,8 @@ struct nfs4_cb_fattr { /* bits for ncf_cb_flags */ #define CB_GETATTR_BUSY 0 +#define NFSD_CB_GETATTR_TIMEOUT msecs_to_jiffies(20000) /* 20 secs */ + /* * Represents a delegation stateid. The nfs4_client holds references to these * and they are put when it is being destroyed or when the delegation is -- 2.39.3 ^ permalink raw reply related [flat|nested] 24+ messages in thread
* Re: [PATCH 3/3] NFSD: Fix server reboot hang problem when callback workqueue is stuck 2023-12-15 19:15 ` [PATCH 3/3] NFSD: Fix server reboot hang problem when callback workqueue is stuck Dai Ngo @ 2023-12-15 19:54 ` Chuck Lever 2023-12-15 20:40 ` dai.ngo 2023-12-15 19:54 ` Jeff Layton 1 sibling, 1 reply; 24+ messages in thread From: Chuck Lever @ 2023-12-15 19:54 UTC (permalink / raw) To: Dai Ngo; +Cc: jlayton, linux-nfs, linux-nfs On Fri, Dec 15, 2023 at 11:15:03AM -0800, Dai Ngo wrote: > If the callback workqueue is stuck, nfsd4_deleg_getattr_conflict will > also stuck waiting for the callback request to be executed. This causes > the client to hang waiting for the reply of the GETATTR and also causes > the reboot of the NFS server to hang due to the pending NFS request. > > Fix by replacing wait_on_bit with wait_on_bit_timeout with 20 seconds > time out. > > Reported-by: Wolfgang Walter <linux-nfs@stwm.de> > Fixes: 6c41d9a9bd02 ("NFSD: handle GETATTR conflict with write delegation") > Signed-off-by: Dai Ngo <dai.ngo@oracle.com> > --- > fs/nfsd/nfs4state.c | 6 +++++- > fs/nfsd/state.h | 2 ++ > 2 files changed, 7 insertions(+), 1 deletion(-) > > diff --git a/fs/nfsd/nfs4state.c b/fs/nfsd/nfs4state.c > index 175f3e9f5822..0cc7d4953807 100644 > --- a/fs/nfsd/nfs4state.c > +++ b/fs/nfsd/nfs4state.c > @@ -2948,6 +2948,9 @@ void nfs4_cb_getattr(struct nfs4_cb_fattr *ncf) > if (test_and_set_bit(CB_GETATTR_BUSY, &ncf->ncf_cb_flags)) > return; > > + /* set to proper status when nfsd4_cb_getattr_done runs */ > + ncf->ncf_cb_status = NFS4ERR_IO; > + > refcount_inc(&dp->dl_stid.sc_count); > if (!nfsd4_run_cb(&ncf->ncf_getattr)) { > refcount_dec(&dp->dl_stid.sc_count); > @@ -8558,7 +8561,8 @@ nfsd4_deleg_getattr_conflict(struct svc_rqst *rqstp, struct inode *inode, > nfs4_cb_getattr(&dp->dl_cb_fattr); > spin_unlock(&ctx->flc_lock); > > - wait_on_bit(&ncf->ncf_cb_flags, CB_GETATTR_BUSY, TASK_INTERRUPTIBLE); > + wait_on_bit_timeout(&ncf->ncf_cb_flags, CB_GETATTR_BUSY, > + TASK_INTERRUPTIBLE, NFSD_CB_GETATTR_TIMEOUT); I'm still thinking the timeout here should be the same (or slightly longer than) the RPC retransmit timeout, rather than adding a new NFSD_CB_GETATTR_TIMEOUT macro. > if (ncf->ncf_cb_status) { > status = nfserrno(nfsd_open_break_lease(inode, NFSD_MAY_READ)); > if (status != nfserr_jukebox || > diff --git a/fs/nfsd/state.h b/fs/nfsd/state.h > index f96eaa8e9413..94563a6813a6 100644 > --- a/fs/nfsd/state.h > +++ b/fs/nfsd/state.h > @@ -135,6 +135,8 @@ struct nfs4_cb_fattr { > /* bits for ncf_cb_flags */ > #define CB_GETATTR_BUSY 0 > > +#define NFSD_CB_GETATTR_TIMEOUT msecs_to_jiffies(20000) /* 20 secs */ > + > /* > * Represents a delegation stateid. The nfs4_client holds references to these > * and they are put when it is being destroyed or when the delegation is > -- > 2.39.3 > -- Chuck Lever ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [PATCH 3/3] NFSD: Fix server reboot hang problem when callback workqueue is stuck 2023-12-15 19:54 ` Chuck Lever @ 2023-12-15 20:40 ` dai.ngo 2023-12-15 21:41 ` Chuck Lever 0 siblings, 1 reply; 24+ messages in thread From: dai.ngo @ 2023-12-15 20:40 UTC (permalink / raw) To: Chuck Lever; +Cc: jlayton, linux-nfs, linux-nfs On 12/15/23 11:54 AM, Chuck Lever wrote: > On Fri, Dec 15, 2023 at 11:15:03AM -0800, Dai Ngo wrote: >> If the callback workqueue is stuck, nfsd4_deleg_getattr_conflict will >> also stuck waiting for the callback request to be executed. This causes >> the client to hang waiting for the reply of the GETATTR and also causes >> the reboot of the NFS server to hang due to the pending NFS request. >> >> Fix by replacing wait_on_bit with wait_on_bit_timeout with 20 seconds >> time out. >> >> Reported-by: Wolfgang Walter <linux-nfs@stwm.de> >> Fixes: 6c41d9a9bd02 ("NFSD: handle GETATTR conflict with write delegation") >> Signed-off-by: Dai Ngo <dai.ngo@oracle.com> >> --- >> fs/nfsd/nfs4state.c | 6 +++++- >> fs/nfsd/state.h | 2 ++ >> 2 files changed, 7 insertions(+), 1 deletion(-) >> >> diff --git a/fs/nfsd/nfs4state.c b/fs/nfsd/nfs4state.c >> index 175f3e9f5822..0cc7d4953807 100644 >> --- a/fs/nfsd/nfs4state.c >> +++ b/fs/nfsd/nfs4state.c >> @@ -2948,6 +2948,9 @@ void nfs4_cb_getattr(struct nfs4_cb_fattr *ncf) >> if (test_and_set_bit(CB_GETATTR_BUSY, &ncf->ncf_cb_flags)) >> return; >> >> + /* set to proper status when nfsd4_cb_getattr_done runs */ >> + ncf->ncf_cb_status = NFS4ERR_IO; >> + >> refcount_inc(&dp->dl_stid.sc_count); >> if (!nfsd4_run_cb(&ncf->ncf_getattr)) { >> refcount_dec(&dp->dl_stid.sc_count); >> @@ -8558,7 +8561,8 @@ nfsd4_deleg_getattr_conflict(struct svc_rqst *rqstp, struct inode *inode, >> nfs4_cb_getattr(&dp->dl_cb_fattr); >> spin_unlock(&ctx->flc_lock); >> >> - wait_on_bit(&ncf->ncf_cb_flags, CB_GETATTR_BUSY, TASK_INTERRUPTIBLE); >> + wait_on_bit_timeout(&ncf->ncf_cb_flags, CB_GETATTR_BUSY, >> + TASK_INTERRUPTIBLE, NFSD_CB_GETATTR_TIMEOUT); > I'm still thinking the timeout here should be the same (or slightly > longer than) the RPC retransmit timeout, rather than adding a new > NFSD_CB_GETATTR_TIMEOUT macro. The NFSD_CB_GETATTR_TIMEOUT is used only when we can not submit a work item to the workqueue so RPC is not involved here. We need to time out here to prevent the client (that causes the conflict) to hang waiting for the reply of the GETATTR and to prevent the server reboot to hang due to a pending NFS request. -Dai > > >> if (ncf->ncf_cb_status) { >> status = nfserrno(nfsd_open_break_lease(inode, NFSD_MAY_READ)); >> if (status != nfserr_jukebox || >> diff --git a/fs/nfsd/state.h b/fs/nfsd/state.h >> index f96eaa8e9413..94563a6813a6 100644 >> --- a/fs/nfsd/state.h >> +++ b/fs/nfsd/state.h >> @@ -135,6 +135,8 @@ struct nfs4_cb_fattr { >> /* bits for ncf_cb_flags */ >> #define CB_GETATTR_BUSY 0 >> >> +#define NFSD_CB_GETATTR_TIMEOUT msecs_to_jiffies(20000) /* 20 secs */ >> + >> /* >> * Represents a delegation stateid. The nfs4_client holds references to these >> * and they are put when it is being destroyed or when the delegation is >> -- >> 2.39.3 >> ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [PATCH 3/3] NFSD: Fix server reboot hang problem when callback workqueue is stuck 2023-12-15 20:40 ` dai.ngo @ 2023-12-15 21:41 ` Chuck Lever 2023-12-15 21:55 ` dai.ngo 0 siblings, 1 reply; 24+ messages in thread From: Chuck Lever @ 2023-12-15 21:41 UTC (permalink / raw) To: dai.ngo; +Cc: jlayton, linux-nfs, linux-nfs On Fri, Dec 15, 2023 at 12:40:07PM -0800, dai.ngo@oracle.com wrote: > > On 12/15/23 11:54 AM, Chuck Lever wrote: > > On Fri, Dec 15, 2023 at 11:15:03AM -0800, Dai Ngo wrote: > > > If the callback workqueue is stuck, nfsd4_deleg_getattr_conflict will > > > also stuck waiting for the callback request to be executed. This causes > > > the client to hang waiting for the reply of the GETATTR and also causes > > > the reboot of the NFS server to hang due to the pending NFS request. > > > > > > Fix by replacing wait_on_bit with wait_on_bit_timeout with 20 seconds > > > time out. > > > > > > Reported-by: Wolfgang Walter <linux-nfs@stwm.de> > > > Fixes: 6c41d9a9bd02 ("NFSD: handle GETATTR conflict with write delegation") > > > Signed-off-by: Dai Ngo <dai.ngo@oracle.com> > > > --- > > > fs/nfsd/nfs4state.c | 6 +++++- > > > fs/nfsd/state.h | 2 ++ > > > 2 files changed, 7 insertions(+), 1 deletion(-) > > > > > > diff --git a/fs/nfsd/nfs4state.c b/fs/nfsd/nfs4state.c > > > index 175f3e9f5822..0cc7d4953807 100644 > > > --- a/fs/nfsd/nfs4state.c > > > +++ b/fs/nfsd/nfs4state.c > > > @@ -2948,6 +2948,9 @@ void nfs4_cb_getattr(struct nfs4_cb_fattr *ncf) > > > if (test_and_set_bit(CB_GETATTR_BUSY, &ncf->ncf_cb_flags)) > > > return; > > > + /* set to proper status when nfsd4_cb_getattr_done runs */ > > > + ncf->ncf_cb_status = NFS4ERR_IO; > > > + > > > refcount_inc(&dp->dl_stid.sc_count); > > > if (!nfsd4_run_cb(&ncf->ncf_getattr)) { > > > refcount_dec(&dp->dl_stid.sc_count); > > > @@ -8558,7 +8561,8 @@ nfsd4_deleg_getattr_conflict(struct svc_rqst *rqstp, struct inode *inode, > > > nfs4_cb_getattr(&dp->dl_cb_fattr); > > > spin_unlock(&ctx->flc_lock); > > > - wait_on_bit(&ncf->ncf_cb_flags, CB_GETATTR_BUSY, TASK_INTERRUPTIBLE); > > > + wait_on_bit_timeout(&ncf->ncf_cb_flags, CB_GETATTR_BUSY, > > > + TASK_INTERRUPTIBLE, NFSD_CB_GETATTR_TIMEOUT); > > I'm still thinking the timeout here should be the same (or slightly > > longer than) the RPC retransmit timeout, rather than adding a new > > NFSD_CB_GETATTR_TIMEOUT macro. > > The NFSD_CB_GETATTR_TIMEOUT is used only when we can not submit a > work item to the workqueue so RPC is not involved here. In the "RPC was sent successfully" case, there is an implicit assumption here that wait_on_bit_timeout() won't time out before the actual RPC CB_GETATTR timeout. You've chosen timeout values that happen to work, but there's nothing in this patch that ties the two timeout values together or in any other way documents this implicit assumption. > We need to > time out here to prevent the client (that causes the conflict) to > hang waiting for the reply of the GETATTR and to prevent the server > reboot to hang due to a pending NFS request. Perhaps a better approach would be to not rely on a timeout, but instead have nfs4_cb_getattr() wake up the bit wait before returning, when it can't queue the work. That way, wait_on_bit() will return immediately in that case. > > > if (ncf->ncf_cb_status) { > > > status = nfserrno(nfsd_open_break_lease(inode, NFSD_MAY_READ)); > > > if (status != nfserr_jukebox || > > > diff --git a/fs/nfsd/state.h b/fs/nfsd/state.h > > > index f96eaa8e9413..94563a6813a6 100644 > > > --- a/fs/nfsd/state.h > > > +++ b/fs/nfsd/state.h > > > @@ -135,6 +135,8 @@ struct nfs4_cb_fattr { > > > /* bits for ncf_cb_flags */ > > > #define CB_GETATTR_BUSY 0 > > > +#define NFSD_CB_GETATTR_TIMEOUT msecs_to_jiffies(20000) /* 20 secs */ > > > + > > > /* > > > * Represents a delegation stateid. The nfs4_client holds references to these > > > * and they are put when it is being destroyed or when the delegation is > > > -- > > > 2.39.3 > > > -- Chuck Lever ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [PATCH 3/3] NFSD: Fix server reboot hang problem when callback workqueue is stuck 2023-12-15 21:41 ` Chuck Lever @ 2023-12-15 21:55 ` dai.ngo 2023-12-16 1:21 ` Chuck Lever 0 siblings, 1 reply; 24+ messages in thread From: dai.ngo @ 2023-12-15 21:55 UTC (permalink / raw) To: Chuck Lever; +Cc: jlayton, linux-nfs, linux-nfs Sorry Chuck, I didn't see this before sending v2. On 12/15/23 1:41 PM, Chuck Lever wrote: > On Fri, Dec 15, 2023 at 12:40:07PM -0800, dai.ngo@oracle.com wrote: >> On 12/15/23 11:54 AM, Chuck Lever wrote: >>> On Fri, Dec 15, 2023 at 11:15:03AM -0800, Dai Ngo wrote: >>>> If the callback workqueue is stuck, nfsd4_deleg_getattr_conflict will >>>> also stuck waiting for the callback request to be executed. This causes >>>> the client to hang waiting for the reply of the GETATTR and also causes >>>> the reboot of the NFS server to hang due to the pending NFS request. >>>> >>>> Fix by replacing wait_on_bit with wait_on_bit_timeout with 20 seconds >>>> time out. >>>> >>>> Reported-by: Wolfgang Walter <linux-nfs@stwm.de> >>>> Fixes: 6c41d9a9bd02 ("NFSD: handle GETATTR conflict with write delegation") >>>> Signed-off-by: Dai Ngo <dai.ngo@oracle.com> >>>> --- >>>> fs/nfsd/nfs4state.c | 6 +++++- >>>> fs/nfsd/state.h | 2 ++ >>>> 2 files changed, 7 insertions(+), 1 deletion(-) >>>> >>>> diff --git a/fs/nfsd/nfs4state.c b/fs/nfsd/nfs4state.c >>>> index 175f3e9f5822..0cc7d4953807 100644 >>>> --- a/fs/nfsd/nfs4state.c >>>> +++ b/fs/nfsd/nfs4state.c >>>> @@ -2948,6 +2948,9 @@ void nfs4_cb_getattr(struct nfs4_cb_fattr *ncf) >>>> if (test_and_set_bit(CB_GETATTR_BUSY, &ncf->ncf_cb_flags)) >>>> return; >>>> + /* set to proper status when nfsd4_cb_getattr_done runs */ >>>> + ncf->ncf_cb_status = NFS4ERR_IO; >>>> + >>>> refcount_inc(&dp->dl_stid.sc_count); >>>> if (!nfsd4_run_cb(&ncf->ncf_getattr)) { >>>> refcount_dec(&dp->dl_stid.sc_count); >>>> @@ -8558,7 +8561,8 @@ nfsd4_deleg_getattr_conflict(struct svc_rqst *rqstp, struct inode *inode, >>>> nfs4_cb_getattr(&dp->dl_cb_fattr); >>>> spin_unlock(&ctx->flc_lock); >>>> - wait_on_bit(&ncf->ncf_cb_flags, CB_GETATTR_BUSY, TASK_INTERRUPTIBLE); >>>> + wait_on_bit_timeout(&ncf->ncf_cb_flags, CB_GETATTR_BUSY, >>>> + TASK_INTERRUPTIBLE, NFSD_CB_GETATTR_TIMEOUT); >>> I'm still thinking the timeout here should be the same (or slightly >>> longer than) the RPC retransmit timeout, rather than adding a new >>> NFSD_CB_GETATTR_TIMEOUT macro. >> The NFSD_CB_GETATTR_TIMEOUT is used only when we can not submit a >> work item to the workqueue so RPC is not involved here. > In the "RPC was sent successfully" case, there is an implicit > assumption here that wait_on_bit_timeout() won't time out before the > actual RPC CB_GETATTR timeout. > > You've chosen timeout values that happen to work, but there's > nothing in this patch that ties the two timeout values together or > in any other way documents this implicit assumption. The timeout value was chosen to be greater then RPC callback receive timeout. I can add this to the commit message. > > >> We need to >> time out here to prevent the client (that causes the conflict) to >> hang waiting for the reply of the GETATTR and to prevent the server >> reboot to hang due to a pending NFS request. > Perhaps a better approach would be to not rely on a timeout, but > instead have nfs4_cb_getattr() wake up the bit wait before > returning, when it can't queue the work. That way, wait_on_bit() > will return immediately in that case. We can detect the condition where the work item can't be queue. But I think we still need to use wait_on_bit_timeout since there is no guarantee that the work will be executed even if it was queued. -Dai > > >>>> if (ncf->ncf_cb_status) { >>>> status = nfserrno(nfsd_open_break_lease(inode, NFSD_MAY_READ)); >>>> if (status != nfserr_jukebox || >>>> diff --git a/fs/nfsd/state.h b/fs/nfsd/state.h >>>> index f96eaa8e9413..94563a6813a6 100644 >>>> --- a/fs/nfsd/state.h >>>> +++ b/fs/nfsd/state.h >>>> @@ -135,6 +135,8 @@ struct nfs4_cb_fattr { >>>> /* bits for ncf_cb_flags */ >>>> #define CB_GETATTR_BUSY 0 >>>> +#define NFSD_CB_GETATTR_TIMEOUT msecs_to_jiffies(20000) /* 20 secs */ >>>> + >>>> /* >>>> * Represents a delegation stateid. The nfs4_client holds references to these >>>> * and they are put when it is being destroyed or when the delegation is >>>> -- >>>> 2.39.3 >>>> ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [PATCH 3/3] NFSD: Fix server reboot hang problem when callback workqueue is stuck 2023-12-15 21:55 ` dai.ngo @ 2023-12-16 1:21 ` Chuck Lever 2023-12-16 3:18 ` dai.ngo 0 siblings, 1 reply; 24+ messages in thread From: Chuck Lever @ 2023-12-16 1:21 UTC (permalink / raw) To: dai.ngo; +Cc: jlayton, linux-nfs, linux-nfs On Fri, Dec 15, 2023 at 01:55:20PM -0800, dai.ngo@oracle.com wrote: > Sorry Chuck, I didn't see this before sending v2. > > On 12/15/23 1:41 PM, Chuck Lever wrote: > > On Fri, Dec 15, 2023 at 12:40:07PM -0800, dai.ngo@oracle.com wrote: > > > On 12/15/23 11:54 AM, Chuck Lever wrote: > > > > On Fri, Dec 15, 2023 at 11:15:03AM -0800, Dai Ngo wrote: > > > > > If the callback workqueue is stuck, nfsd4_deleg_getattr_conflict will > > > > > also stuck waiting for the callback request to be executed. This causes > > > > > the client to hang waiting for the reply of the GETATTR and also causes > > > > > the reboot of the NFS server to hang due to the pending NFS request. > > > > > > > > > > Fix by replacing wait_on_bit with wait_on_bit_timeout with 20 seconds > > > > > time out. > > > > > > > > > > Reported-by: Wolfgang Walter <linux-nfs@stwm.de> > > > > > Fixes: 6c41d9a9bd02 ("NFSD: handle GETATTR conflict with write delegation") > > > > > Signed-off-by: Dai Ngo <dai.ngo@oracle.com> > > > > > --- > > > > > fs/nfsd/nfs4state.c | 6 +++++- > > > > > fs/nfsd/state.h | 2 ++ > > > > > 2 files changed, 7 insertions(+), 1 deletion(-) > > > > > > > > > > diff --git a/fs/nfsd/nfs4state.c b/fs/nfsd/nfs4state.c > > > > > index 175f3e9f5822..0cc7d4953807 100644 > > > > > --- a/fs/nfsd/nfs4state.c > > > > > +++ b/fs/nfsd/nfs4state.c > > > > > @@ -2948,6 +2948,9 @@ void nfs4_cb_getattr(struct nfs4_cb_fattr *ncf) > > > > > if (test_and_set_bit(CB_GETATTR_BUSY, &ncf->ncf_cb_flags)) > > > > > return; > > > > > + /* set to proper status when nfsd4_cb_getattr_done runs */ > > > > > + ncf->ncf_cb_status = NFS4ERR_IO; > > > > > + > > > > > refcount_inc(&dp->dl_stid.sc_count); > > > > > if (!nfsd4_run_cb(&ncf->ncf_getattr)) { > > > > > refcount_dec(&dp->dl_stid.sc_count); > > > > > @@ -8558,7 +8561,8 @@ nfsd4_deleg_getattr_conflict(struct svc_rqst *rqstp, struct inode *inode, > > > > > nfs4_cb_getattr(&dp->dl_cb_fattr); > > > > > spin_unlock(&ctx->flc_lock); > > > > > - wait_on_bit(&ncf->ncf_cb_flags, CB_GETATTR_BUSY, TASK_INTERRUPTIBLE); > > > > > + wait_on_bit_timeout(&ncf->ncf_cb_flags, CB_GETATTR_BUSY, > > > > > + TASK_INTERRUPTIBLE, NFSD_CB_GETATTR_TIMEOUT); > > > > I'm still thinking the timeout here should be the same (or slightly > > > > longer than) the RPC retransmit timeout, rather than adding a new > > > > NFSD_CB_GETATTR_TIMEOUT macro. > > > The NFSD_CB_GETATTR_TIMEOUT is used only when we can not submit a > > > work item to the workqueue so RPC is not involved here. > > In the "RPC was sent successfully" case, there is an implicit > > assumption here that wait_on_bit_timeout() won't time out before the > > actual RPC CB_GETATTR timeout. > > > > You've chosen timeout values that happen to work, but there's > > nothing in this patch that ties the two timeout values together or > > in any other way documents this implicit assumption. > > The timeout value was chosen to be greater then RPC callback receive > timeout. I can add this to the commit message. nfsd needs to handle this case properly. A commit message will not be sufficient. The rpc_timeout setting that is used for callbacks is not always 9 seconds. It is adjusted based on the NFSv4 lease expiry up to a maximum of 360 seconds, if I'm reading the code correctly (see max_cb_time). It would be simple enough for a server admin to set a long lease expiry while individual CB_GETATTRs are still terminating with EIO after just 20 seconds because of this new timeout. Actually, a bit wait in an nfsd thread should be a no-no. Even a wait of tens of milliseconds is bad. Enough nfsd threads go into a wait like this and that's a denial-of-service. That's why NFSv4 callbacks are handled on a work queue and not in an nfsd thread. Is there some way the operation that triggered the CB_GETATTR can be deferred properly and picked up by another nfsd thread when the CB_GETATTR completes? > > > We need to > > > time out here to prevent the client (that causes the conflict) to > > > hang waiting for the reply of the GETATTR and to prevent the server > > > reboot to hang due to a pending NFS request. > > Perhaps a better approach would be to not rely on a timeout, but > > instead have nfs4_cb_getattr() wake up the bit wait before > > returning, when it can't queue the work. That way, wait_on_bit() > > will return immediately in that case. > > We can detect the condition where the work item can't be queue. > But I think we still need to use wait_on_bit_timeout since there > is no guarantee that the work will be executed even if it was > queued. This is a basic guarantee provided by the RPC layer. Can you enumerate what other ways this path will fail without waking the bit wait? Are those issues going to impact other callback operations? > > > > > if (ncf->ncf_cb_status) { > > > > > status = nfserrno(nfsd_open_break_lease(inode, NFSD_MAY_READ)); > > > > > if (status != nfserr_jukebox || > > > > > diff --git a/fs/nfsd/state.h b/fs/nfsd/state.h > > > > > index f96eaa8e9413..94563a6813a6 100644 > > > > > --- a/fs/nfsd/state.h > > > > > +++ b/fs/nfsd/state.h > > > > > @@ -135,6 +135,8 @@ struct nfs4_cb_fattr { > > > > > /* bits for ncf_cb_flags */ > > > > > #define CB_GETATTR_BUSY 0 > > > > > +#define NFSD_CB_GETATTR_TIMEOUT msecs_to_jiffies(20000) /* 20 secs */ > > > > > + > > > > > /* > > > > > * Represents a delegation stateid. The nfs4_client holds references to these > > > > > * and they are put when it is being destroyed or when the delegation is > > > > > -- > > > > > 2.39.3 > > > > > -- Chuck Lever ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [PATCH 3/3] NFSD: Fix server reboot hang problem when callback workqueue is stuck 2023-12-16 1:21 ` Chuck Lever @ 2023-12-16 3:18 ` dai.ngo 2023-12-16 3:57 ` Chuck Lever 0 siblings, 1 reply; 24+ messages in thread From: dai.ngo @ 2023-12-16 3:18 UTC (permalink / raw) To: Chuck Lever; +Cc: jlayton, linux-nfs, linux-nfs On 12/15/23 5:21 PM, Chuck Lever wrote: > On Fri, Dec 15, 2023 at 01:55:20PM -0800, dai.ngo@oracle.com wrote: >> Sorry Chuck, I didn't see this before sending v2. >> >> On 12/15/23 1:41 PM, Chuck Lever wrote: >>> On Fri, Dec 15, 2023 at 12:40:07PM -0800, dai.ngo@oracle.com wrote: >>>> On 12/15/23 11:54 AM, Chuck Lever wrote: >>>>> On Fri, Dec 15, 2023 at 11:15:03AM -0800, Dai Ngo wrote: >>>>>> If the callback workqueue is stuck, nfsd4_deleg_getattr_conflict will >>>>>> also stuck waiting for the callback request to be executed. This causes >>>>>> the client to hang waiting for the reply of the GETATTR and also causes >>>>>> the reboot of the NFS server to hang due to the pending NFS request. >>>>>> >>>>>> Fix by replacing wait_on_bit with wait_on_bit_timeout with 20 seconds >>>>>> time out. >>>>>> >>>>>> Reported-by: Wolfgang Walter <linux-nfs@stwm.de> >>>>>> Fixes: 6c41d9a9bd02 ("NFSD: handle GETATTR conflict with write delegation") >>>>>> Signed-off-by: Dai Ngo <dai.ngo@oracle.com> >>>>>> --- >>>>>> fs/nfsd/nfs4state.c | 6 +++++- >>>>>> fs/nfsd/state.h | 2 ++ >>>>>> 2 files changed, 7 insertions(+), 1 deletion(-) >>>>>> >>>>>> diff --git a/fs/nfsd/nfs4state.c b/fs/nfsd/nfs4state.c >>>>>> index 175f3e9f5822..0cc7d4953807 100644 >>>>>> --- a/fs/nfsd/nfs4state.c >>>>>> +++ b/fs/nfsd/nfs4state.c >>>>>> @@ -2948,6 +2948,9 @@ void nfs4_cb_getattr(struct nfs4_cb_fattr *ncf) >>>>>> if (test_and_set_bit(CB_GETATTR_BUSY, &ncf->ncf_cb_flags)) >>>>>> return; >>>>>> + /* set to proper status when nfsd4_cb_getattr_done runs */ >>>>>> + ncf->ncf_cb_status = NFS4ERR_IO; >>>>>> + >>>>>> refcount_inc(&dp->dl_stid.sc_count); >>>>>> if (!nfsd4_run_cb(&ncf->ncf_getattr)) { >>>>>> refcount_dec(&dp->dl_stid.sc_count); >>>>>> @@ -8558,7 +8561,8 @@ nfsd4_deleg_getattr_conflict(struct svc_rqst *rqstp, struct inode *inode, >>>>>> nfs4_cb_getattr(&dp->dl_cb_fattr); >>>>>> spin_unlock(&ctx->flc_lock); >>>>>> - wait_on_bit(&ncf->ncf_cb_flags, CB_GETATTR_BUSY, TASK_INTERRUPTIBLE); >>>>>> + wait_on_bit_timeout(&ncf->ncf_cb_flags, CB_GETATTR_BUSY, >>>>>> + TASK_INTERRUPTIBLE, NFSD_CB_GETATTR_TIMEOUT); >>>>> I'm still thinking the timeout here should be the same (or slightly >>>>> longer than) the RPC retransmit timeout, rather than adding a new >>>>> NFSD_CB_GETATTR_TIMEOUT macro. >>>> The NFSD_CB_GETATTR_TIMEOUT is used only when we can not submit a >>>> work item to the workqueue so RPC is not involved here. >>> In the "RPC was sent successfully" case, there is an implicit >>> assumption here that wait_on_bit_timeout() won't time out before the >>> actual RPC CB_GETATTR timeout. >>> >>> You've chosen timeout values that happen to work, but there's >>> nothing in this patch that ties the two timeout values together or >>> in any other way documents this implicit assumption. >> The timeout value was chosen to be greater then RPC callback receive >> timeout. I can add this to the commit message. > nfsd needs to handle this case properly. A commit message will not > be sufficient. > > The rpc_timeout setting that is used for callbacks is not always > 9 seconds. It is adjusted based on the NFSv4 lease expiry up to a > maximum of 360 seconds, if I'm reading the code correctly (see > max_cb_time). > > It would be simple enough for a server admin to set a long lease > expiry while individual CB_GETATTRs are still terminating with > EIO after just 20 seconds because of this new timeout. To handle case where server admin sets longer lease, we can set callback timeout to be (nfsd4_lease)/10 + 5) and add a comment in the code to indicate the dependency to max_cb_time. > > Actually, a bit wait in an nfsd thread should be a no-no. Even a > wait of tens of milliseconds is bad. Enough nfsd threads go into a > wait like this and that's a denial-of-service. That's why NFSv4 > callbacks are handled on a work queue and not in an nfsd thread. That sounds reasonable. However I see in the current code there are multiple places the nfsd thread sleeps/waits for certain events: nfsd_file_do_acquire, nfsd41_cb_get_slot, nfsd4_cld_tracking_init, wait_for_concurrent_writes, etc. > > Is there some way the operation that triggered the CB_GETATTR can be > deferred properly and picked up by another nfsd thread when the > CB_GETATTR completes? We can send the CB_GETATTR as an async RPC and return NFS4ERR_DELAY to the conflict client. When the reply of the CB_GETATTR arrives, nfsd4_cb_getattr_done can mark the delegation to indicate the corresponding file's attributes were updated so when the conflict client retries the GETATTR we can return the updated attributes. We still have to have a way to detect that the client never, or take too long, to reply to the CB_GETATTR so that we can break the lease. Also, the intention of the wait_on_bit is to avoid sending the conflict client the NFS4ERR_DELAY if everything works properly which is the normal case. So I think this can be done but it would make the code a bit complicate and we loose the optimization of avoiding the NFS4ERR_DELAY. -Dai > > >>>> We need to >>>> time out here to prevent the client (that causes the conflict) to >>>> hang waiting for the reply of the GETATTR and to prevent the server >>>> reboot to hang due to a pending NFS request. >>> Perhaps a better approach would be to not rely on a timeout, but >>> instead have nfs4_cb_getattr() wake up the bit wait before >>> returning, when it can't queue the work. That way, wait_on_bit() >>> will return immediately in that case. >> We can detect the condition where the work item can't be queue. >> But I think we still need to use wait_on_bit_timeout since there >> is no guarantee that the work will be executed even if it was >> queued. > This is a basic guarantee provided by the RPC layer. Can you > enumerate what other ways this path will fail without waking the bit > wait? Are those issues going to impact other callback operations? > > >>>>>> if (ncf->ncf_cb_status) { >>>>>> status = nfserrno(nfsd_open_break_lease(inode, NFSD_MAY_READ)); >>>>>> if (status != nfserr_jukebox || >>>>>> diff --git a/fs/nfsd/state.h b/fs/nfsd/state.h >>>>>> index f96eaa8e9413..94563a6813a6 100644 >>>>>> --- a/fs/nfsd/state.h >>>>>> +++ b/fs/nfsd/state.h >>>>>> @@ -135,6 +135,8 @@ struct nfs4_cb_fattr { >>>>>> /* bits for ncf_cb_flags */ >>>>>> #define CB_GETATTR_BUSY 0 >>>>>> +#define NFSD_CB_GETATTR_TIMEOUT msecs_to_jiffies(20000) /* 20 secs */ >>>>>> + >>>>>> /* >>>>>> * Represents a delegation stateid. The nfs4_client holds references to these >>>>>> * and they are put when it is being destroyed or when the delegation is >>>>>> -- >>>>>> 2.39.3 >>>>>> ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [PATCH 3/3] NFSD: Fix server reboot hang problem when callback workqueue is stuck 2023-12-16 3:18 ` dai.ngo @ 2023-12-16 3:57 ` Chuck Lever 2023-12-16 22:44 ` dai.ngo 0 siblings, 1 reply; 24+ messages in thread From: Chuck Lever @ 2023-12-16 3:57 UTC (permalink / raw) To: dai.ngo; +Cc: jlayton, linux-nfs, linux-nfs On Fri, Dec 15, 2023 at 07:18:29PM -0800, dai.ngo@oracle.com wrote: > > On 12/15/23 5:21 PM, Chuck Lever wrote: > > On Fri, Dec 15, 2023 at 01:55:20PM -0800, dai.ngo@oracle.com wrote: > > > Sorry Chuck, I didn't see this before sending v2. > > > > > > On 12/15/23 1:41 PM, Chuck Lever wrote: > > > > On Fri, Dec 15, 2023 at 12:40:07PM -0800, dai.ngo@oracle.com wrote: > > > > > On 12/15/23 11:54 AM, Chuck Lever wrote: > > > > > > On Fri, Dec 15, 2023 at 11:15:03AM -0800, Dai Ngo wrote: > > > > > > > @@ -8558,7 +8561,8 @@ nfsd4_deleg_getattr_conflict(struct svc_rqst *rqstp, struct inode *inode, > > > > > > > nfs4_cb_getattr(&dp->dl_cb_fattr); > > > > > > > spin_unlock(&ctx->flc_lock); > > > > > > > - wait_on_bit(&ncf->ncf_cb_flags, CB_GETATTR_BUSY, TASK_INTERRUPTIBLE); > > > > > > > + wait_on_bit_timeout(&ncf->ncf_cb_flags, CB_GETATTR_BUSY, > > > > > > > + TASK_INTERRUPTIBLE, NFSD_CB_GETATTR_TIMEOUT); > > > > > > I'm still thinking the timeout here should be the same (or slightly > > > > > > longer than) the RPC retransmit timeout, rather than adding a new > > > > > > NFSD_CB_GETATTR_TIMEOUT macro. > > > > > The NFSD_CB_GETATTR_TIMEOUT is used only when we can not submit a > > > > > work item to the workqueue so RPC is not involved here. > > > > In the "RPC was sent successfully" case, there is an implicit > > > > assumption here that wait_on_bit_timeout() won't time out before the > > > > actual RPC CB_GETATTR timeout. > > > > > > > > You've chosen timeout values that happen to work, but there's > > > > nothing in this patch that ties the two timeout values together or > > > > in any other way documents this implicit assumption. > > > The timeout value was chosen to be greater then RPC callback receive > > > timeout. I can add this to the commit message. > > nfsd needs to handle this case properly. A commit message will not > > be sufficient. > > > > The rpc_timeout setting that is used for callbacks is not always > > 9 seconds. It is adjusted based on the NFSv4 lease expiry up to a > > maximum of 360 seconds, if I'm reading the code correctly (see > > max_cb_time). > > > > It would be simple enough for a server admin to set a long lease > > expiry while individual CB_GETATTRs are still terminating with > > EIO after just 20 seconds because of this new timeout. > > To handle case where server admin sets longer lease, we can set > callback timeout to be (nfsd4_lease)/10 + 5) and add a comment > in the code to indicate the dependency to max_cb_time. Which was my initial suggestion, but now I think the only proper fix is to replace the wait_on_bit() entirely. > > Actually, a bit wait in an nfsd thread should be a no-no. Even a > > wait of tens of milliseconds is bad. Enough nfsd threads go into a > > wait like this and that's a denial-of-service. That's why NFSv4 > > callbacks are handled on a work queue and not in an nfsd thread. > > That sounds reasonable. However I see in the current code there > are multiple places the nfsd thread sleeps/waits for certain events: > nfsd_file_do_acquire, nfsd41_cb_get_slot, nfsd4_cld_tracking_init, > wait_for_concurrent_writes, etc. Yep, each of those needs to be replaced if there is a danger of the wait becoming indefinite. We found another one recently with the pNFS block layout type waiting for a lease breaker. So an nfsd thread does wait on occasion, but it's almost never the right thing to do. Let's not add another one, especially one that can be manipulated by (bad) client behavior. > > Is there some way the operation that triggered the CB_GETATTR can be > > deferred properly and picked up by another nfsd thread when the > > CB_GETATTR completes? > > We can send the CB_GETATTR as an async RPC and return NFS4ERR_DELAY > to the conflict client. When the reply of the CB_GETATTR arrives, > nfsd4_cb_getattr_done can mark the delegation to indicate the > corresponding file's attributes were updated so when the conflict > client retries the GETATTR we can return the updated attributes. > > We still have to have a way to detect that the client never, or > take too long, to reply to the CB_GETATTR so that we can break > the lease. As long as the RPC is SOFT, the RPC client is supposed to guarantee that the upper layer gets a completion of some kind. If it's HARD then it should fully interruptible by a signal or shutdown. Either way, if NFSD can manage to reliably detect when the CB RPC has not even been scheduled, then that gives us a fully dependable guarantee. > Also, the intention of the wait_on_bit is to avoid sending the > conflict client the NFS4ERR_DELAY if everything works properly > which is the normal case. > > So I think this can be done but it would make the code a bit > complicate and we loose the optimization of avoiding the > NFS4ERR_DELAY. I'm not excited about handling this via DELAY either. There's a good chance there is another way to deal with this sanely. I'm inclined to revert CB_GETATTR support until it is demonstrated to be working reliably. The current mechanism has already shown to be prone to a hard hang that blocks server shutdown, and it's not even in the wild yet. I can add patches to nfsd-fixes to revert CB_GETATTR and let that sit for a few days while we decide how to move forward. -- Chuck Lever ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [PATCH 3/3] NFSD: Fix server reboot hang problem when callback workqueue is stuck 2023-12-16 3:57 ` Chuck Lever @ 2023-12-16 22:44 ` dai.ngo 2023-12-18 16:02 ` Chuck Lever 0 siblings, 1 reply; 24+ messages in thread From: dai.ngo @ 2023-12-16 22:44 UTC (permalink / raw) To: Chuck Lever; +Cc: jlayton, linux-nfs, linux-nfs On 12/15/23 7:57 PM, Chuck Lever wrote: > On Fri, Dec 15, 2023 at 07:18:29PM -0800, dai.ngo@oracle.com wrote: >> On 12/15/23 5:21 PM, Chuck Lever wrote: >>> On Fri, Dec 15, 2023 at 01:55:20PM -0800, dai.ngo@oracle.com wrote: >>>> Sorry Chuck, I didn't see this before sending v2. >>>> >>>> On 12/15/23 1:41 PM, Chuck Lever wrote: >>>>> On Fri, Dec 15, 2023 at 12:40:07PM -0800, dai.ngo@oracle.com wrote: >>>>>> On 12/15/23 11:54 AM, Chuck Lever wrote: >>>>>>> On Fri, Dec 15, 2023 at 11:15:03AM -0800, Dai Ngo wrote: >>>>>>>> @@ -8558,7 +8561,8 @@ nfsd4_deleg_getattr_conflict(struct svc_rqst *rqstp, struct inode *inode, >>>>>>>> nfs4_cb_getattr(&dp->dl_cb_fattr); >>>>>>>> spin_unlock(&ctx->flc_lock); >>>>>>>> - wait_on_bit(&ncf->ncf_cb_flags, CB_GETATTR_BUSY, TASK_INTERRUPTIBLE); >>>>>>>> + wait_on_bit_timeout(&ncf->ncf_cb_flags, CB_GETATTR_BUSY, >>>>>>>> + TASK_INTERRUPTIBLE, NFSD_CB_GETATTR_TIMEOUT); >>>>>>> I'm still thinking the timeout here should be the same (or slightly >>>>>>> longer than) the RPC retransmit timeout, rather than adding a new >>>>>>> NFSD_CB_GETATTR_TIMEOUT macro. >>>>>> The NFSD_CB_GETATTR_TIMEOUT is used only when we can not submit a >>>>>> work item to the workqueue so RPC is not involved here. >>>>> In the "RPC was sent successfully" case, there is an implicit >>>>> assumption here that wait_on_bit_timeout() won't time out before the >>>>> actual RPC CB_GETATTR timeout. >>>>> >>>>> You've chosen timeout values that happen to work, but there's >>>>> nothing in this patch that ties the two timeout values together or >>>>> in any other way documents this implicit assumption. >>>> The timeout value was chosen to be greater then RPC callback receive >>>> timeout. I can add this to the commit message. >>> nfsd needs to handle this case properly. A commit message will not >>> be sufficient. >>> >>> The rpc_timeout setting that is used for callbacks is not always >>> 9 seconds. It is adjusted based on the NFSv4 lease expiry up to a >>> maximum of 360 seconds, if I'm reading the code correctly (see >>> max_cb_time). >>> >>> It would be simple enough for a server admin to set a long lease >>> expiry while individual CB_GETATTRs are still terminating with >>> EIO after just 20 seconds because of this new timeout. With courteous server, what is the benefit of allowing the admin to extend the lease? I think this option should be removed, it's just an administrative overhead and can cause more confusion. >> To handle case where server admin sets longer lease, we can set >> callback timeout to be (nfsd4_lease)/10 + 5) and add a comment >> in the code to indicate the dependency to max_cb_time. > Which was my initial suggestion, but now I think the only proper fix > is to replace the wait_on_bit() entirely. > > >>> Actually, a bit wait in an nfsd thread should be a no-no. Even a >>> wait of tens of milliseconds is bad. Enough nfsd threads go into a >>> wait like this and that's a denial-of-service. That's why NFSv4 >>> callbacks are handled on a work queue and not in an nfsd thread. >> That sounds reasonable. However I see in the current code there >> are multiple places the nfsd thread sleeps/waits for certain events: >> nfsd_file_do_acquire, nfsd41_cb_get_slot, nfsd4_cld_tracking_init, >> wait_for_concurrent_writes, etc. > Yep, each of those needs to be replaced if there is a danger of the > wait becoming indefinite. We found another one recently with the > pNFS block layout type waiting for a lease breaker. So an nfsd > thread does wait on occasion, but it's almost never the right thing > to do. > > Let's not add another one, especially one that can be manipulated by > (bad) client behavior. I'm not clear how the wait for CB_GETATTR can be manipulated by a bad client, can you elaborate? > > >>> Is there some way the operation that triggered the CB_GETATTR can be >>> deferred properly and picked up by another nfsd thread when the >>> CB_GETATTR completes? >> We can send the CB_GETATTR as an async RPC and return NFS4ERR_DELAY >> to the conflict client. When the reply of the CB_GETATTR arrives, >> nfsd4_cb_getattr_done can mark the delegation to indicate the >> corresponding file's attributes were updated so when the conflict >> client retries the GETATTR we can return the updated attributes. >> >> We still have to have a way to detect that the client never, or >> take too long, to reply to the CB_GETATTR so that we can break >> the lease. > As long as the RPC is SOFT, the RPC client is supposed to guarantee > that the upper layer gets a completion of some kind. If it's HARD > then it should fully interruptible by a signal or shutdown. This is only true if the workqueue worker runs and executes the work item successfully. If the work item is stuck at the workqueue then RPC is not involved. NFSD must handle the case where the work item is never executed for any reasons. > > Either way, if NFSD can manage to reliably detect when the CB RPC > has not even been scheduled, then that gives us a fully dependable > guarantee. Once the async CB RPC was passed to the RPC layer via rpc_run_task, I can't see any sure way to know when the RPC task will be picked up and run. Until the RPC task is run, the RPC time out mechanism is not in play. To handle this condition, I think a timeout mechanism is needed at the NFSD layer, any other options? > > >> Also, the intention of the wait_on_bit is to avoid sending the >> conflict client the NFS4ERR_DELAY if everything works properly >> which is the normal case. >> >> So I think this can be done but it would make the code a bit >> complicate and we loose the optimization of avoiding the >> NFS4ERR_DELAY. > I'm not excited about handling this via DELAY either. There's a > good chance there is another way to deal with this sanely. > > I'm inclined to revert CB_GETATTR support until it is demonstrated > to be working reliably. The current mechanism has already shown to > be prone to a hard hang that blocks server shutdown, and it's not > even in the wild yet. If I understand the problem correctly, this hard hang issue is due to the work item being stuck at the workqueue which the current code does not take into account. > > I can add patches to nfsd-fixes to revert CB_GETATTR and let that > sit for a few days while we decide how to move forward. The simplest solution for this particular problem is to use wait with timeout. But if that solution does not meet your expectation then yes reverting the CB_GETATTR, or remove the handling of GETATTR conflict with write delegation, will definitely resolve this problem. -Dai ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [PATCH 3/3] NFSD: Fix server reboot hang problem when callback workqueue is stuck 2023-12-16 22:44 ` dai.ngo @ 2023-12-18 16:02 ` Chuck Lever 2023-12-18 18:17 ` dai.ngo 0 siblings, 1 reply; 24+ messages in thread From: Chuck Lever @ 2023-12-18 16:02 UTC (permalink / raw) To: dai.ngo; +Cc: jlayton, linux-nfs, linux-nfs On Sat, Dec 16, 2023 at 02:44:59PM -0800, dai.ngo@oracle.com wrote: > On 12/15/23 7:57 PM, Chuck Lever wrote: > > On Fri, Dec 15, 2023 at 07:18:29PM -0800, dai.ngo@oracle.com wrote: > > > On 12/15/23 5:21 PM, Chuck Lever wrote: > > > > On Fri, Dec 15, 2023 at 01:55:20PM -0800, dai.ngo@oracle.com wrote: > > > > > Sorry Chuck, I didn't see this before sending v2. > > > > > > > > > > On 12/15/23 1:41 PM, Chuck Lever wrote: > > > > > > On Fri, Dec 15, 2023 at 12:40:07PM -0800, dai.ngo@oracle.com wrote: > > > > > > > On 12/15/23 11:54 AM, Chuck Lever wrote: > > > > > > > > On Fri, Dec 15, 2023 at 11:15:03AM -0800, Dai Ngo wrote: > > > > > > > > > @@ -8558,7 +8561,8 @@ nfsd4_deleg_getattr_conflict(struct svc_rqst *rqstp, struct inode *inode, > > > > > > > > > nfs4_cb_getattr(&dp->dl_cb_fattr); > > > > > > > > > spin_unlock(&ctx->flc_lock); > > > > > > > > > - wait_on_bit(&ncf->ncf_cb_flags, CB_GETATTR_BUSY, TASK_INTERRUPTIBLE); > > > > > > > > > + wait_on_bit_timeout(&ncf->ncf_cb_flags, CB_GETATTR_BUSY, > > > > > > > > > + TASK_INTERRUPTIBLE, NFSD_CB_GETATTR_TIMEOUT); > > > > > > > > I'm still thinking the timeout here should be the same (or slightly > > > > > > > > longer than) the RPC retransmit timeout, rather than adding a new > > > > > > > > NFSD_CB_GETATTR_TIMEOUT macro. > > > > > > > The NFSD_CB_GETATTR_TIMEOUT is used only when we can not submit a > > > > > > > work item to the workqueue so RPC is not involved here. > > > > > > In the "RPC was sent successfully" case, there is an implicit > > > > > > assumption here that wait_on_bit_timeout() won't time out before the > > > > > > actual RPC CB_GETATTR timeout. > > > > > > > > > > > > You've chosen timeout values that happen to work, but there's > > > > > > nothing in this patch that ties the two timeout values together or > > > > > > in any other way documents this implicit assumption. > > > > > The timeout value was chosen to be greater then RPC callback receive > > > > > timeout. I can add this to the commit message. > > > > nfsd needs to handle this case properly. A commit message will not > > > > be sufficient. > > > > > > > > The rpc_timeout setting that is used for callbacks is not always > > > > 9 seconds. It is adjusted based on the NFSv4 lease expiry up to a > > > > maximum of 360 seconds, if I'm reading the code correctly (see > > > > max_cb_time). > > > > > > > > It would be simple enough for a server admin to set a long lease > > > > expiry while individual CB_GETATTRs are still terminating with > > > > EIO after just 20 seconds because of this new timeout. > > With courteous server, what is the benefit of allowing the admin to > extend the lease? The lease time is the time period during which the server /guarantees/ it will preserve the client's locks, even if there are conflicting open or lock requests from other clients. Once the client transitions to courtesy, those locks can be lost due to conflicting open and lock requests. Clients need to know the server's lease expiry so they know how often to renew their lease. That's the only way lease-based locking service can provide a lock guarantee. > I think this option should be removed, it's just > an administrative overhead and can cause more confusion. A server administrator might extend the lock guarantee by several minutes if her network is subject to frequent partitions -- that will result in long workload pauses, but it will /guarantee/ that locks won't be lost during short partitions. The only reason not to extend lease expiry is that it also lengthens the server's grace period. If the server is accessed only by NFSv4.1 clients, they can use RECLAIM_COMPLETE to avoid long waits after a server reboot... but with a mixed client cohort, a shorter grace period is usually preferred. We have the tunable now, so I don't see a lot of value in making an effort to deprecate and remove it until NFSD no longer supports NFSv3 and NFSv4.0. > > > To handle case where server admin sets longer lease, we can set > > > callback timeout to be (nfsd4_lease)/10 + 5) and add a comment > > > in the code to indicate the dependency to max_cb_time. > > Which was my initial suggestion, but now I think the only proper fix > > is to replace the wait_on_bit() entirely. > > > > > > > > Actually, a bit wait in an nfsd thread should be a no-no. Even a > > > > wait of tens of milliseconds is bad. Enough nfsd threads go into a > > > > wait like this and that's a denial-of-service. That's why NFSv4 > > > > callbacks are handled on a work queue and not in an nfsd thread. > > > That sounds reasonable. However I see in the current code there > > > are multiple places the nfsd thread sleeps/waits for certain events: > > > nfsd_file_do_acquire, nfsd41_cb_get_slot, nfsd4_cld_tracking_init, > > > wait_for_concurrent_writes, etc. > > Yep, each of those needs to be replaced if there is a danger of the > > wait becoming indefinite. We found another one recently with the > > pNFS block layout type waiting for a lease breaker. So an nfsd > > thread does wait on occasion, but it's almost never the right thing > > to do. > > > > Let's not add another one, especially one that can be manipulated by > > (bad) client behavior. > > I'm not clear how the wait for CB_GETATTR can be manipulated by a bad > client, can you elaborate? Currently, a callback is handed off to a background worker and the nfsd thread is permitted to look for other work. If instead nfsd threads waited for callbacks, then a client with an unresponsive callback service would pin those nfsd threads for the length of the server's callback timeout. If the client's forechannel workload is actively acquiring delegations and then sending conflicting GETATTRs, it will keep such a server's nfsd threads stuck waiting for CB_GETATTR replies. And since those nfsd threads are shared by all clients, all of the server's clients would see long delays because there would be no available nfsd threads to pick up new work. > > > > Is there some way the operation that triggered the CB_GETATTR can be > > > > deferred properly and picked up by another nfsd thread when the > > > > CB_GETATTR completes? > > > We can send the CB_GETATTR as an async RPC and return NFS4ERR_DELAY > > > to the conflict client. When the reply of the CB_GETATTR arrives, > > > nfsd4_cb_getattr_done can mark the delegation to indicate the > > > corresponding file's attributes were updated so when the conflict > > > client retries the GETATTR we can return the updated attributes. > > > > > > We still have to have a way to detect that the client never, or > > > take too long, to reply to the CB_GETATTR so that we can break > > > the lease. > > As long as the RPC is SOFT, the RPC client is supposed to guarantee > > that the upper layer gets a completion of some kind. If it's HARD > > then it should fully interruptible by a signal or shutdown. > > This is only true if the workqueue worker runs and executes the work > item successfully. If the work item is stuck at the workqueue then RPC > is not involved. NFSD must handle the case where the work item is > never executed for any reasons. The queue_work() API guarantees that the work item will be dispatched. A lot of kernel subsystems would be in a world of pain if that guarantee was broken somehow. So I don't agree that NFSD needs to do anything special here when queue_work() returns true. The only reason I can think of that queue_work() might return false is if NFSD hands it a work item that is already queued. That would be a bug in NFSD. > > Either way, if NFSD can manage to reliably detect when the CB RPC > > has not even been scheduled, then that gives us a fully dependable > > guarantee. > > Once the async CB RPC was passed to the RPC layer via rpc_run_task, > I can't see any sure way to know when the RPC task will be picked up > and run. Until the RPC task is run, the RPC time out mechanism is not > in play. To handle this condition, I think a timeout mechanism is > needed at the NFSD layer, any other options? The only reason you think a timeout is needed is because you want the nfsd thread to wait for the reply. That's just not how NFSD handles NFSv4 callbacks. The current architecture is that NFSD queues the callback and then replies NFS4ERR_DELAY. That nfsd thread is then free to pick up other work, including handling the client's retry. It doesn't matter how long it takes for the callback work item to go from the work queue down to the RPC layer, as long as a subsequent server shutdown does not hang. Either the client will reply to the callback, or the server will see that there was no response and revoke the delegation. (Note that we have nfsd_wait_for_delegreturn() now: it does wait uninterruptibly in an nfsd thread context, but only for 30 milliseconds. The point of this is to give a well-behaved client a chance to respond without returning NFS4ERR_DELAY -- if the client doesn't respond, then it falls back to the architecture outlined above.) > > > Also, the intention of the wait_on_bit is to avoid sending the > > > conflict client the NFS4ERR_DELAY if everything works properly > > > which is the normal case. > > > > > > So I think this can be done but it would make the code a bit > > > complicate and we loose the optimization of avoiding the > > > NFS4ERR_DELAY. > > I'm not excited about handling this via DELAY either. There's a > > good chance there is another way to deal with this sanely. > > > > I'm inclined to revert CB_GETATTR support until it is demonstrated > > to be working reliably. The current mechanism has already shown to > > be prone to a hard hang that blocks server shutdown, and it's not > > even in the wild yet. > > If I understand the problem correctly, this hard hang issue is due to > the work item being stuck at the workqueue which the current code does > not take into account. The hard hang was because of an uninterruptible wait_on_bit(). What we don't know is why the callback was lost. - It could be that queue_work() returned false because of a bug. Note that there is a WARN_ON_ONCE() that fires in this case: if it fired several days before the hang, then we won't see any log messages for more recent misqueued work items. - It could be that nfsd4_run_cb_work() marked the backchannel down but somehow did not wake up any in-flight callback requests. Let's get more details about what's going on. > > I can add patches to nfsd-fixes to revert CB_GETATTR and let that > > sit for a few days while we decide how to move forward. > > The simplest solution for this particular problem is to use wait with > timeout. The hard hang was due to an uninterruptible wait, which has now been reverted. Going forward, if there's no wait, there can be no timeout. The only approach is to handle errors properly when dispatching a callback. -- Chuck Lever ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [PATCH 3/3] NFSD: Fix server reboot hang problem when callback workqueue is stuck 2023-12-18 16:02 ` Chuck Lever @ 2023-12-18 18:17 ` dai.ngo 2023-12-18 19:10 ` Chuck Lever 0 siblings, 1 reply; 24+ messages in thread From: dai.ngo @ 2023-12-18 18:17 UTC (permalink / raw) To: Chuck Lever; +Cc: jlayton, linux-nfs, linux-nfs On 12/18/23 8:02 AM, Chuck Lever wrote: > On Sat, Dec 16, 2023 at 02:44:59PM -0800, dai.ngo@oracle.com wrote: >> On 12/15/23 7:57 PM, Chuck Lever wrote: >>> On Fri, Dec 15, 2023 at 07:18:29PM -0800, dai.ngo@oracle.com wrote: >>>> On 12/15/23 5:21 PM, Chuck Lever wrote: >>>>> On Fri, Dec 15, 2023 at 01:55:20PM -0800, dai.ngo@oracle.com wrote: >>>>>> Sorry Chuck, I didn't see this before sending v2. >>>>>> >>>>>> On 12/15/23 1:41 PM, Chuck Lever wrote: >>>>>>> On Fri, Dec 15, 2023 at 12:40:07PM -0800, dai.ngo@oracle.com wrote: >>>>>>>> On 12/15/23 11:54 AM, Chuck Lever wrote: >>>>>>>>> On Fri, Dec 15, 2023 at 11:15:03AM -0800, Dai Ngo wrote: >>>>>>>>>> @@ -8558,7 +8561,8 @@ nfsd4_deleg_getattr_conflict(struct svc_rqst *rqstp, struct inode *inode, >>>>>>>>>> nfs4_cb_getattr(&dp->dl_cb_fattr); >>>>>>>>>> spin_unlock(&ctx->flc_lock); >>>>>>>>>> - wait_on_bit(&ncf->ncf_cb_flags, CB_GETATTR_BUSY, TASK_INTERRUPTIBLE); >>>>>>>>>> + wait_on_bit_timeout(&ncf->ncf_cb_flags, CB_GETATTR_BUSY, >>>>>>>>>> + TASK_INTERRUPTIBLE, NFSD_CB_GETATTR_TIMEOUT); >>>>>>>>> I'm still thinking the timeout here should be the same (or slightly >>>>>>>>> longer than) the RPC retransmit timeout, rather than adding a new >>>>>>>>> NFSD_CB_GETATTR_TIMEOUT macro. >>>>>>>> The NFSD_CB_GETATTR_TIMEOUT is used only when we can not submit a >>>>>>>> work item to the workqueue so RPC is not involved here. >>>>>>> In the "RPC was sent successfully" case, there is an implicit >>>>>>> assumption here that wait_on_bit_timeout() won't time out before the >>>>>>> actual RPC CB_GETATTR timeout. >>>>>>> >>>>>>> You've chosen timeout values that happen to work, but there's >>>>>>> nothing in this patch that ties the two timeout values together or >>>>>>> in any other way documents this implicit assumption. >>>>>> The timeout value was chosen to be greater then RPC callback receive >>>>>> timeout. I can add this to the commit message. >>>>> nfsd needs to handle this case properly. A commit message will not >>>>> be sufficient. >>>>> >>>>> The rpc_timeout setting that is used for callbacks is not always >>>>> 9 seconds. It is adjusted based on the NFSv4 lease expiry up to a >>>>> maximum of 360 seconds, if I'm reading the code correctly (see >>>>> max_cb_time). >>>>> >>>>> It would be simple enough for a server admin to set a long lease >>>>> expiry while individual CB_GETATTRs are still terminating with >>>>> EIO after just 20 seconds because of this new timeout. >> With courteous server, what is the benefit of allowing the admin to >> extend the lease? > The lease time is the time period during which the server > /guarantees/ it will preserve the client's locks, even if there are > conflicting open or lock requests from other clients. > > Once the client transitions to courtesy, those locks can be lost > due to conflicting open and lock requests. > > Clients need to know the server's lease expiry so they know how > often to renew their lease. That's the only way lease-based locking > service can provide a lock guarantee. > > >> I think this option should be removed, it's just >> an administrative overhead and can cause more confusion. > A server administrator might extend the lock guarantee by several > minutes if her network is subject to frequent partitions -- that > will result in long workload pauses, but it will /guarantee/ that > locks won't be lost during short partitions. > > The only reason not to extend lease expiry is that it also > lengthens the server's grace period. If the server is accessed > only by NFSv4.1 clients, they can use RECLAIM_COMPLETE to avoid > long waits after a server reboot... but with a mixed client > cohort, a shorter grace period is usually preferred. > > We have the tunable now, so I don't see a lot of value in making > an effort to deprecate and remove it until NFSD no longer supports > NFSv3 and NFSv4.0. Thank you for the explanation. > > >>>> To handle case where server admin sets longer lease, we can set >>>> callback timeout to be (nfsd4_lease)/10 + 5) and add a comment >>>> in the code to indicate the dependency to max_cb_time. >>> Which was my initial suggestion, but now I think the only proper fix >>> is to replace the wait_on_bit() entirely. >>> >>> >>>>> Actually, a bit wait in an nfsd thread should be a no-no. Even a >>>>> wait of tens of milliseconds is bad. Enough nfsd threads go into a >>>>> wait like this and that's a denial-of-service. That's why NFSv4 >>>>> callbacks are handled on a work queue and not in an nfsd thread. >>>> That sounds reasonable. However I see in the current code there >>>> are multiple places the nfsd thread sleeps/waits for certain events: >>>> nfsd_file_do_acquire, nfsd41_cb_get_slot, nfsd4_cld_tracking_init, >>>> wait_for_concurrent_writes, etc. >>> Yep, each of those needs to be replaced if there is a danger of the >>> wait becoming indefinite. We found another one recently with the >>> pNFS block layout type waiting for a lease breaker. So an nfsd >>> thread does wait on occasion, but it's almost never the right thing >>> to do. >>> >>> Let's not add another one, especially one that can be manipulated by >>> (bad) client behavior. >> I'm not clear how the wait for CB_GETATTR can be manipulated by a bad >> client, can you elaborate? > Currently, a callback is handed off to a background worker and the > nfsd thread is permitted to look for other work. > > If instead nfsd threads waited for callbacks, then a client with an > unresponsive callback service would pin those nfsd threads for the > length of the server's callback timeout. > > If the client's forechannel workload is actively acquiring > delegations and then sending conflicting GETATTRs, it will keep such > a server's nfsd threads stuck waiting for CB_GETATTR replies. If the GETATTR and the delegation owner come from the same client then we just reply to the GETATTR without sending CB_GETATTR. > > And since those nfsd threads are shared by all clients, all of the > server's clients would see long delays because there would be no > available nfsd threads to pick up new work. > > >>>>> Is there some way the operation that triggered the CB_GETATTR can be >>>>> deferred properly and picked up by another nfsd thread when the >>>>> CB_GETATTR completes? >>>> We can send the CB_GETATTR as an async RPC and return NFS4ERR_DELAY >>>> to the conflict client. When the reply of the CB_GETATTR arrives, >>>> nfsd4_cb_getattr_done can mark the delegation to indicate the >>>> corresponding file's attributes were updated so when the conflict >>>> client retries the GETATTR we can return the updated attributes. >>>> >>>> We still have to have a way to detect that the client never, or >>>> take too long, to reply to the CB_GETATTR so that we can break >>>> the lease. >>> As long as the RPC is SOFT, the RPC client is supposed to guarantee >>> that the upper layer gets a completion of some kind. If it's HARD >>> then it should fully interruptible by a signal or shutdown. >> This is only true if the workqueue worker runs and executes the work >> item successfully. If the work item is stuck at the workqueue then RPC >> is not involved. NFSD must handle the case where the work item is >> never executed for any reasons. > The queue_work() API guarantees that the work item will be > dispatched. A lot of kernel subsystems would be in a world of pain > if that guarantee was broken somehow. > > So I don't agree that NFSD needs to do anything special here when > queue_work() returns true. > > The only reason I can think of that queue_work() might return false > is if NFSD hands it a work item that is already queued. That would > be a bug in NFSD. Jeff, when __break_lease is called with 'want_write' then FL_UNLOCK_PENDING is checked to prevent lm_break being called again. When __break_lease is called with '!want_write' then lease_breaking is called to prevent lm_break being called again. This would work for NFSD if it supports read delegation only. With write delegation, shouldn't __break_lease() always use lease_breaking to prevent __being called again? The scenario is read request conflicts with write delegation then another write request conflicts with the same write delegation. > >>> Either way, if NFSD can manage to reliably detect when the CB RPC >>> has not even been scheduled, then that gives us a fully dependable >>> guarantee. >> Once the async CB RPC was passed to the RPC layer via rpc_run_task, >> I can't see any sure way to know when the RPC task will be picked up >> and run. Until the RPC task is run, the RPC time out mechanism is not >> in play. To handle this condition, I think a timeout mechanism is >> needed at the NFSD layer, any other options? > The only reason you think a timeout is needed is because you want > the nfsd thread to wait for the reply. That's just not how NFSD > handles NFSv4 callbacks. > > The current architecture is that NFSD queues the callback and then > replies NFS4ERR_DELAY. That nfsd thread is then free to pick up > other work, including handling the client's retry. > > It doesn't matter how long it takes for the callback work item to go > from the work queue down to the RPC layer, as long as a subsequent > server shutdown does not hang. Either the client will reply to the > callback, or the server will see that there was no response and > revoke the delegation. > > (Note that we have nfsd_wait_for_delegreturn() now: it does wait > uninterruptibly in an nfsd thread context, but only for 30 > milliseconds. The point of this is to give a well-behaved client > a chance to respond without returning NFS4ERR_DELAY -- if the > client doesn't respond, then it falls back to the architecture > outlined above.) The wait in nfsd4_deleg_getattr_conflict can be modified to do the same as nfsd_wait_for_delegreturn which is wait for only 30ms to work with well-behaved clients. > > >>>> Also, the intention of the wait_on_bit is to avoid sending the >>>> conflict client the NFS4ERR_DELAY if everything works properly >>>> which is the normal case. >>>> >>>> So I think this can be done but it would make the code a bit >>>> complicate and we loose the optimization of avoiding the >>>> NFS4ERR_DELAY. >>> I'm not excited about handling this via DELAY either. There's a >>> good chance there is another way to deal with this sanely. >>> >>> I'm inclined to revert CB_GETATTR support until it is demonstrated >>> to be working reliably. The current mechanism has already shown to >>> be prone to a hard hang that blocks server shutdown, and it's not >>> even in the wild yet. >> If I understand the problem correctly, this hard hang issue is due to >> the work item being stuck at the workqueue which the current code does >> not take into account. > The hard hang was because of an uninterruptible wait_on_bit(). In my debug I simulate the condition where workqueue is stuck when nfs4_cb_getattr is called and this causes the reboot to hang. But I'm not sure this is the cause in the problem that was reported. Note that the WARN_ON_ONCE came from nfsd_break_one_deleg. I will verify that if the work item for nfsd_break_one_deleg is stuck will the server reboot hang. > What > we don't know is why the callback was lost. > > - It could be that queue_work() returned false because of a bug. > Note that there is a WARN_ON_ONCE() that fires in this case: if > it fired several days before the hang, then we won't see any > log messages for more recent misqueued work items. The WARN_ON_ONCE came from nfsd_break_one_deleg which is a delegation recall and not from nfs4_cb_getattr. I suspect this is because of a possible bug in __break_lease as question for Jeff above. > > - It could be that nfsd4_run_cb_work() marked the backchannel down > but somehow did not wake up any in-flight callback requests. > > Let's get more details about what's going on. > > >>> I can add patches to nfsd-fixes to revert CB_GETATTR and let that >>> sit for a few days while we decide how to move forward. >> The simplest solution for this particular problem is to use wait with >> timeout. > The hard hang was due to an uninterruptible wait, which has now been > reverted. > > Going forward, if there's no wait, there can be no timeout. The > only approach is to handle errors properly when dispatching a > callback. not even wait for 30ms for well behave client, same as nfsd_wait_for_delegreturn? -Dai ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [PATCH 3/3] NFSD: Fix server reboot hang problem when callback workqueue is stuck 2023-12-18 18:17 ` dai.ngo @ 2023-12-18 19:10 ` Chuck Lever 2023-12-18 20:27 ` dai.ngo 0 siblings, 1 reply; 24+ messages in thread From: Chuck Lever @ 2023-12-18 19:10 UTC (permalink / raw) To: dai.ngo; +Cc: jlayton, linux-nfs, linux-nfs On Mon, Dec 18, 2023 at 10:17:49AM -0800, dai.ngo@oracle.com wrote: > > On 12/18/23 8:02 AM, Chuck Lever wrote: > > On Sat, Dec 16, 2023 at 02:44:59PM -0800, dai.ngo@oracle.com wrote: > > > On 12/15/23 7:57 PM, Chuck Lever wrote: > > What we don't know is why the callback was lost. > > > > - It could be that queue_work() returned false because of a bug. > > Note that there is a WARN_ON_ONCE() that fires in this case: if > > it fired several days before the hang, then we won't see any > > log messages for more recent misqueued work items. > > The WARN_ON_ONCE came from nfsd_break_one_deleg which is a delegation > recall and not from nfs4_cb_getattr. I suspect this is because of a > possible bug in __break_lease as question for Jeff above. OK, so there's no indication at all if nfsd4_run_cb() fails when NFSD queues CB_GETATTR? No wonder it's a silent failure. > > - It could be that nfsd4_run_cb_work() marked the backchannel down > > but somehow did not wake up any in-flight callback requests. > > > > Let's get more details about what's going on. > > > > > > > > I can add patches to nfsd-fixes to revert CB_GETATTR and let that > > > > sit for a few days while we decide how to move forward. > > > The simplest solution for this particular problem is to use wait with > > > timeout. > > The hard hang was due to an uninterruptible wait, which has now been > > reverted. > > > > Going forward, if there's no wait, there can be no timeout. The > > only approach is to handle errors properly when dispatching a > > callback. > > not even wait for 30ms for well behave client, same as nfsd_wait_for_delegreturn? 30 milliseconds is acceptable. It's very brief and can never result in a shutdown hang. I just don't want a long timeout. -- Chuck Lever ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [PATCH 3/3] NFSD: Fix server reboot hang problem when callback workqueue is stuck 2023-12-18 19:10 ` Chuck Lever @ 2023-12-18 20:27 ` dai.ngo 0 siblings, 0 replies; 24+ messages in thread From: dai.ngo @ 2023-12-18 20:27 UTC (permalink / raw) To: Chuck Lever; +Cc: jlayton, linux-nfs, linux-nfs On 12/18/23 11:10 AM, Chuck Lever wrote: > On Mon, Dec 18, 2023 at 10:17:49AM -0800, dai.ngo@oracle.com wrote: >> On 12/18/23 8:02 AM, Chuck Lever wrote: >>> On Sat, Dec 16, 2023 at 02:44:59PM -0800, dai.ngo@oracle.com wrote: >>>> On 12/15/23 7:57 PM, Chuck Lever wrote: >>> What we don't know is why the callback was lost. >>> >>> - It could be that queue_work() returned false because of a bug. >>> Note that there is a WARN_ON_ONCE() that fires in this case: if >>> it fired several days before the hang, then we won't see any >>> log messages for more recent misqueued work items. >> The WARN_ON_ONCE came from nfsd_break_one_deleg which is a delegation >> recall and not from nfs4_cb_getattr. I suspect this is because of a >> possible bug in __break_lease as question for Jeff above. > OK, so there's no indication at all if nfsd4_run_cb() fails when > NFSD queues CB_GETATTR? No wonder it's a silent failure. This patch adds a WARN_ON_ONCE just in case, but I don't this condition will ever happen since we already had the test_and_set_bit on CB_GETATTR_BUSY bit so the same CB_GETATTR will not be submitted to workqueue more than once. > > >>> - It could be that nfsd4_run_cb_work() marked the backchannel down >>> but somehow did not wake up any in-flight callback requests. >>> >>> Let's get more details about what's going on. >>> >>> >>>>> I can add patches to nfsd-fixes to revert CB_GETATTR and let that >>>>> sit for a few days while we decide how to move forward. >>>> The simplest solution for this particular problem is to use wait with >>>> timeout. >>> The hard hang was due to an uninterruptible wait, which has now been >>> reverted. >>> >>> Going forward, if there's no wait, there can be no timeout. The >>> only approach is to handle errors properly when dispatching a >>> callback. >> not even wait for 30ms for well behave client, same as nfsd_wait_for_delegreturn? > 30 milliseconds is acceptable. It's very brief and can never result > in a shutdown hang. I just don't want a long timeout. Thanks! I will submit v3 patch with timeout of 30 milliseconds. -Dai ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [PATCH 3/3] NFSD: Fix server reboot hang problem when callback workqueue is stuck 2023-12-15 19:15 ` [PATCH 3/3] NFSD: Fix server reboot hang problem when callback workqueue is stuck Dai Ngo 2023-12-15 19:54 ` Chuck Lever @ 2023-12-15 19:54 ` Jeff Layton 2023-12-15 20:18 ` dai.ngo 1 sibling, 1 reply; 24+ messages in thread From: Jeff Layton @ 2023-12-15 19:54 UTC (permalink / raw) To: Dai Ngo, chuck.lever; +Cc: linux-nfs, linux-nfs On Fri, 2023-12-15 at 11:15 -0800, Dai Ngo wrote: > If the callback workqueue is stuck, nfsd4_deleg_getattr_conflict will > also stuck waiting for the callback request to be executed. This causes > the client to hang waiting for the reply of the GETATTR and also causes > the reboot of the NFS server to hang due to the pending NFS request. > > Fix by replacing wait_on_bit with wait_on_bit_timeout with 20 seconds > time out. > > Reported-by: Wolfgang Walter <linux-nfs@stwm.de> > Fixes: 6c41d9a9bd02 ("NFSD: handle GETATTR conflict with write delegation") > Signed-off-by: Dai Ngo <dai.ngo@oracle.com> > --- > fs/nfsd/nfs4state.c | 6 +++++- > fs/nfsd/state.h | 2 ++ > 2 files changed, 7 insertions(+), 1 deletion(-) > > diff --git a/fs/nfsd/nfs4state.c b/fs/nfsd/nfs4state.c > index 175f3e9f5822..0cc7d4953807 100644 > --- a/fs/nfsd/nfs4state.c > +++ b/fs/nfsd/nfs4state.c > @@ -2948,6 +2948,9 @@ void nfs4_cb_getattr(struct nfs4_cb_fattr *ncf) > if (test_and_set_bit(CB_GETATTR_BUSY, &ncf->ncf_cb_flags)) > return; > > + /* set to proper status when nfsd4_cb_getattr_done runs */ > + ncf->ncf_cb_status = NFS4ERR_IO; > + > refcount_inc(&dp->dl_stid.sc_count); > if (!nfsd4_run_cb(&ncf->ncf_getattr)) { > refcount_dec(&dp->dl_stid.sc_count); > @@ -8558,7 +8561,8 @@ nfsd4_deleg_getattr_conflict(struct svc_rqst *rqstp, struct inode *inode, > nfs4_cb_getattr(&dp->dl_cb_fattr); > spin_unlock(&ctx->flc_lock); > > - wait_on_bit(&ncf->ncf_cb_flags, CB_GETATTR_BUSY, TASK_INTERRUPTIBLE); > + wait_on_bit_timeout(&ncf->ncf_cb_flags, CB_GETATTR_BUSY, > + TASK_INTERRUPTIBLE, NFSD_CB_GETATTR_TIMEOUT); The RPC won't necessarily have timed out at this point, and it looks like ncf_cb_status won't have been set to anything (and is probably still 0?). Don't you need to check whether the wait timed out or was successful? What happens now when this times out? > if (ncf->ncf_cb_status) { > status = nfserrno(nfsd_open_break_lease(inode, NFSD_MAY_READ)); > if (status != nfserr_jukebox || > diff --git a/fs/nfsd/state.h b/fs/nfsd/state.h > index f96eaa8e9413..94563a6813a6 100644 > --- a/fs/nfsd/state.h > +++ b/fs/nfsd/state.h > @@ -135,6 +135,8 @@ struct nfs4_cb_fattr { > /* bits for ncf_cb_flags */ > #define CB_GETATTR_BUSY 0 > > +#define NFSD_CB_GETATTR_TIMEOUT msecs_to_jiffies(20000) /* 20 secs */ > + Why 20s? > /* > * Represents a delegation stateid. The nfs4_client holds references to these > * and they are put when it is being destroyed or when the delegation is -- Jeff Layton <jlayton@kernel.org> ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [PATCH 3/3] NFSD: Fix server reboot hang problem when callback workqueue is stuck 2023-12-15 19:54 ` Jeff Layton @ 2023-12-15 20:18 ` dai.ngo 2023-12-15 20:25 ` Jeff Layton 0 siblings, 1 reply; 24+ messages in thread From: dai.ngo @ 2023-12-15 20:18 UTC (permalink / raw) To: Jeff Layton, chuck.lever; +Cc: linux-nfs, linux-nfs On 12/15/23 11:54 AM, Jeff Layton wrote: > On Fri, 2023-12-15 at 11:15 -0800, Dai Ngo wrote: >> If the callback workqueue is stuck, nfsd4_deleg_getattr_conflict will >> also stuck waiting for the callback request to be executed. This causes >> the client to hang waiting for the reply of the GETATTR and also causes >> the reboot of the NFS server to hang due to the pending NFS request. >> >> Fix by replacing wait_on_bit with wait_on_bit_timeout with 20 seconds >> time out. >> >> Reported-by: Wolfgang Walter <linux-nfs@stwm.de> >> Fixes: 6c41d9a9bd02 ("NFSD: handle GETATTR conflict with write delegation") >> Signed-off-by: Dai Ngo <dai.ngo@oracle.com> >> --- >> fs/nfsd/nfs4state.c | 6 +++++- >> fs/nfsd/state.h | 2 ++ >> 2 files changed, 7 insertions(+), 1 deletion(-) >> >> diff --git a/fs/nfsd/nfs4state.c b/fs/nfsd/nfs4state.c >> index 175f3e9f5822..0cc7d4953807 100644 >> --- a/fs/nfsd/nfs4state.c >> +++ b/fs/nfsd/nfs4state.c >> @@ -2948,6 +2948,9 @@ void nfs4_cb_getattr(struct nfs4_cb_fattr *ncf) >> if (test_and_set_bit(CB_GETATTR_BUSY, &ncf->ncf_cb_flags)) >> return; >> >> + /* set to proper status when nfsd4_cb_getattr_done runs */ >> + ncf->ncf_cb_status = NFS4ERR_IO; >> + >> refcount_inc(&dp->dl_stid.sc_count); >> if (!nfsd4_run_cb(&ncf->ncf_getattr)) { >> refcount_dec(&dp->dl_stid.sc_count); >> @@ -8558,7 +8561,8 @@ nfsd4_deleg_getattr_conflict(struct svc_rqst *rqstp, struct inode *inode, >> nfs4_cb_getattr(&dp->dl_cb_fattr); >> spin_unlock(&ctx->flc_lock); >> >> - wait_on_bit(&ncf->ncf_cb_flags, CB_GETATTR_BUSY, TASK_INTERRUPTIBLE); >> + wait_on_bit_timeout(&ncf->ncf_cb_flags, CB_GETATTR_BUSY, >> + TASK_INTERRUPTIBLE, NFSD_CB_GETATTR_TIMEOUT); > The RPC won't necessarily have timed out at this point, and it looks > like ncf_cb_status won't have been set to anything (and is probably > still 0?). The timeout was added to handle the case where the callback request did not get queued to the workqueue; nfsd4_run_cb fails. In this case RPC is not involved and we don't want to hang here. Note that this patch sets ncf_cb_status to NFS4ERR_IO before calling nfsd4_run_cb so we can detect this error condition. > > Don't you need to check whether the wait timed out or was successful? ncf_cb_status is set to tk_status by nfsd4_cb_getattr_done. If the request was successful then ncf_cb_status is 0. > What happens now when this times out? Then we go through the normal logic of nfsd_open_break_lease which will also get timed out but eventually the lease, delegation state, will be removed by __break_lease after 45 secs (lease_break_time). -Dai > > >> if (ncf->ncf_cb_status) { >> status = nfserrno(nfsd_open_break_lease(inode, NFSD_MAY_READ)); >> if (status != nfserr_jukebox || >> diff --git a/fs/nfsd/state.h b/fs/nfsd/state.h >> index f96eaa8e9413..94563a6813a6 100644 >> --- a/fs/nfsd/state.h >> +++ b/fs/nfsd/state.h >> @@ -135,6 +135,8 @@ struct nfs4_cb_fattr { >> /* bits for ncf_cb_flags */ >> #define CB_GETATTR_BUSY 0 >> >> +#define NFSD_CB_GETATTR_TIMEOUT msecs_to_jiffies(20000) /* 20 secs */ >> + > Why 20s? RPC will time out after 9 secs if it does not receive a callback reply. This time out value needs to be greater than 9 secs. I just be generous here, we can reduce it to any value > 9 secs. -Dai > >> /* >> * Represents a delegation stateid. The nfs4_client holds references to these >> * and they are put when it is being destroyed or when the delegation is > ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [PATCH 3/3] NFSD: Fix server reboot hang problem when callback workqueue is stuck 2023-12-15 20:18 ` dai.ngo @ 2023-12-15 20:25 ` Jeff Layton 0 siblings, 0 replies; 24+ messages in thread From: Jeff Layton @ 2023-12-15 20:25 UTC (permalink / raw) To: dai.ngo, chuck.lever; +Cc: linux-nfs, linux-nfs On Fri, 2023-12-15 at 12:18 -0800, dai.ngo@oracle.com wrote: > On 12/15/23 11:54 AM, Jeff Layton wrote: > > On Fri, 2023-12-15 at 11:15 -0800, Dai Ngo wrote: > > > If the callback workqueue is stuck, nfsd4_deleg_getattr_conflict will > > > also stuck waiting for the callback request to be executed. This causes > > > the client to hang waiting for the reply of the GETATTR and also causes > > > the reboot of the NFS server to hang due to the pending NFS request. > > > > > > Fix by replacing wait_on_bit with wait_on_bit_timeout with 20 seconds > > > time out. > > > > > > Reported-by: Wolfgang Walter <linux-nfs@stwm.de> > > > Fixes: 6c41d9a9bd02 ("NFSD: handle GETATTR conflict with write delegation") > > > Signed-off-by: Dai Ngo <dai.ngo@oracle.com> > > > --- > > > fs/nfsd/nfs4state.c | 6 +++++- > > > fs/nfsd/state.h | 2 ++ > > > 2 files changed, 7 insertions(+), 1 deletion(-) > > > > > > diff --git a/fs/nfsd/nfs4state.c b/fs/nfsd/nfs4state.c > > > index 175f3e9f5822..0cc7d4953807 100644 > > > --- a/fs/nfsd/nfs4state.c > > > +++ b/fs/nfsd/nfs4state.c > > > @@ -2948,6 +2948,9 @@ void nfs4_cb_getattr(struct nfs4_cb_fattr *ncf) > > > if (test_and_set_bit(CB_GETATTR_BUSY, &ncf->ncf_cb_flags)) > > > return; > > > > > > + /* set to proper status when nfsd4_cb_getattr_done runs */ > > > + ncf->ncf_cb_status = NFS4ERR_IO; > > > + > > > refcount_inc(&dp->dl_stid.sc_count); > > > if (!nfsd4_run_cb(&ncf->ncf_getattr)) { > > > refcount_dec(&dp->dl_stid.sc_count); > > > @@ -8558,7 +8561,8 @@ nfsd4_deleg_getattr_conflict(struct svc_rqst *rqstp, struct inode *inode, > > > nfs4_cb_getattr(&dp->dl_cb_fattr); > > > spin_unlock(&ctx->flc_lock); > > > > > > - wait_on_bit(&ncf->ncf_cb_flags, CB_GETATTR_BUSY, TASK_INTERRUPTIBLE); > > > + wait_on_bit_timeout(&ncf->ncf_cb_flags, CB_GETATTR_BUSY, > > > + TASK_INTERRUPTIBLE, NFSD_CB_GETATTR_TIMEOUT); > > The RPC won't necessarily have timed out at this point, and it looks > > like ncf_cb_status won't have been set to anything (and is probably > > still 0?). > > The timeout was added to handle the case where the callback request > did not get queued to the workqueue; nfsd4_run_cb fails. In this case > RPC is not involved and we don't want to hang here. Note that this patch > sets ncf_cb_status to NFS4ERR_IO before calling nfsd4_run_cb so we can > detect this error condition. > Ok, I missed that bit, thanks. > > > > Don't you need to check whether the wait timed out or was successful? > > ncf_cb_status is set to tk_status by nfsd4_cb_getattr_done. If the request > was successful then ncf_cb_status is 0. > > > What happens now when this times out? > > Then we go through the normal logic of nfsd_open_break_lease which will > also get timed out but eventually the lease, delegation state, will be > removed by __break_lease after 45 secs (lease_break_time). > > -Dai > > > > > > > > if (ncf->ncf_cb_status) { > > > status = nfserrno(nfsd_open_break_lease(inode, NFSD_MAY_READ)); > > > if (status != nfserr_jukebox || > > > diff --git a/fs/nfsd/state.h b/fs/nfsd/state.h > > > index f96eaa8e9413..94563a6813a6 100644 > > > --- a/fs/nfsd/state.h > > > +++ b/fs/nfsd/state.h > > > @@ -135,6 +135,8 @@ struct nfs4_cb_fattr { > > > /* bits for ncf_cb_flags */ > > > #define CB_GETATTR_BUSY 0 > > > > > > +#define NFSD_CB_GETATTR_TIMEOUT msecs_to_jiffies(20000) /* 20 secs */ > > > + > > Why 20s? > > RPC will time out after 9 secs if it does not receive a callback reply. > This time out value needs to be greater than 9 secs. I just be generous > here, we can reduce it to any value > 9 secs. > > -Dai > > > > > > /* > > > * Represents a delegation stateid. The nfs4_client holds references to these > > > * and they are put when it is being destroyed or when the delegation is > > Given that: Reviewed-by: Jeff Layton <jlayton@kernel.org> ^ permalink raw reply [flat|nested] 24+ messages in thread
end of thread, other threads:[~2023-12-18 20:29 UTC | newest] Thread overview: 24+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2023-12-15 19:15 [PATCH 0/3] Bug fixes for NFSD callback Dai Ngo 2023-12-15 19:15 ` [PATCH 1/3] SUNRPC: remove printk when back channel request not found Dai Ngo 2023-12-15 19:37 ` Jeff Layton 2023-12-15 19:15 ` [PATCH 2/3] NFSD: restore delegation's sc_count if nfsd4_run_cb fails Dai Ngo 2023-12-15 19:42 ` Jeff Layton 2023-12-15 20:00 ` dai.ngo 2023-12-15 20:15 ` Jeff Layton 2023-12-15 20:22 ` dai.ngo 2023-12-15 19:15 ` [PATCH 3/3] NFSD: Fix server reboot hang problem when callback workqueue is stuck Dai Ngo 2023-12-15 19:54 ` Chuck Lever 2023-12-15 20:40 ` dai.ngo 2023-12-15 21:41 ` Chuck Lever 2023-12-15 21:55 ` dai.ngo 2023-12-16 1:21 ` Chuck Lever 2023-12-16 3:18 ` dai.ngo 2023-12-16 3:57 ` Chuck Lever 2023-12-16 22:44 ` dai.ngo 2023-12-18 16:02 ` Chuck Lever 2023-12-18 18:17 ` dai.ngo 2023-12-18 19:10 ` Chuck Lever 2023-12-18 20:27 ` dai.ngo 2023-12-15 19:54 ` Jeff Layton 2023-12-15 20:18 ` dai.ngo 2023-12-15 20:25 ` Jeff Layton
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox