* [PATCH 0/4 v4] nfsd: fix deadlock in move_to_close_lru()
@ 2024-04-08 2:09 NeilBrown
2024-04-08 2:09 ` [PATCH 1/4] nfsd: perform all find_openstateowner_str calls in the one place NeilBrown
` (4 more replies)
0 siblings, 5 replies; 19+ messages in thread
From: NeilBrown @ 2024-04-08 2:09 UTC (permalink / raw)
To: Chuck Lever, Jeff Layton
Cc: linux-nfs, Olga Kornievskaia, Dai Ngo, Tom Talpey
This series replaces
nfsd: drop st_mutex and rp_mutex before calling move_to_close_lru()
which was recently dropped as a problem was found.
The first two patches rearrange code without important functional change.
The last two address the two relaced problems of two different mutexes which are
held while waiting and can each trigger a deadlock.
This is against v6.9-rc2.
Thanks,
NeilBrown
[PATCH 1/4] nfsd: perform all find_openstateowner_str calls in the
[PATCH 2/4] nfsd: move nfsd4_cstate_assign_replay() earlier in open
[PATCH 3/4] nfsd: replace rp_mutex to avoid deadlock in
[PATCH 4/4] nfsd: drop st_mutex before calling move_to_close_lru()
^ permalink raw reply [flat|nested] 19+ messages in thread* [PATCH 1/4] nfsd: perform all find_openstateowner_str calls in the one place. 2024-04-08 2:09 [PATCH 0/4 v4] nfsd: fix deadlock in move_to_close_lru() NeilBrown @ 2024-04-08 2:09 ` NeilBrown 2024-04-08 2:09 ` [PATCH 2/4] nfsd: move nfsd4_cstate_assign_replay() earlier in open handling NeilBrown ` (3 subsequent siblings) 4 siblings, 0 replies; 19+ messages in thread From: NeilBrown @ 2024-04-08 2:09 UTC (permalink / raw) To: Chuck Lever, Jeff Layton Cc: linux-nfs, Olga Kornievskaia, Dai Ngo, Tom Talpey Currently find_openstateowner_str look ups are done both in nfsd4_process_open1() and alloc_init_open_stateowner() - the latter possibly being a surprise based on its name. It would be easier to follow, and more conformant to common patterns, if the lookup was all in the one place. So replace alloc_init_open_stateowner() with find_or_alloc_open_stateowner() and use the latter in nfsd4_process_open1() without any calls to find_openstateowner_str(). This means all finds are find_openstateowner_str_locked() and find_openstateowner_str() is no longer needed. So discard find_openstateowner_str() and rename find_openstateowner_str_locked() to find_openstateowner_str(). Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: NeilBrown <neilb@suse.de> --- fs/nfsd/nfs4state.c | 93 +++++++++++++++++++-------------------------- 1 file changed, 40 insertions(+), 53 deletions(-) diff --git a/fs/nfsd/nfs4state.c b/fs/nfsd/nfs4state.c index 1824a30e7dd4..6ac1dff29d42 100644 --- a/fs/nfsd/nfs4state.c +++ b/fs/nfsd/nfs4state.c @@ -541,7 +541,7 @@ same_owner_str(struct nfs4_stateowner *sop, struct xdr_netobj *owner) } static struct nfs4_openowner * -find_openstateowner_str_locked(unsigned int hashval, struct nfsd4_open *open, +find_openstateowner_str(unsigned int hashval, struct nfsd4_open *open, struct nfs4_client *clp) { struct nfs4_stateowner *so; @@ -558,18 +558,6 @@ find_openstateowner_str_locked(unsigned int hashval, struct nfsd4_open *open, return NULL; } -static struct nfs4_openowner * -find_openstateowner_str(unsigned int hashval, struct nfsd4_open *open, - struct nfs4_client *clp) -{ - struct nfs4_openowner *oo; - - spin_lock(&clp->cl_lock); - oo = find_openstateowner_str_locked(hashval, open, clp); - spin_unlock(&clp->cl_lock); - return oo; -} - static inline u32 opaque_hashval(const void *ptr, int nbytes) { @@ -4880,34 +4868,46 @@ nfsd4_find_and_lock_existing_open(struct nfs4_file *fp, struct nfsd4_open *open) } static struct nfs4_openowner * -alloc_init_open_stateowner(unsigned int strhashval, struct nfsd4_open *open, - struct nfsd4_compound_state *cstate) +find_or_alloc_open_stateowner(unsigned int strhashval, struct nfsd4_open *open, + struct nfsd4_compound_state *cstate) { struct nfs4_client *clp = cstate->clp; - struct nfs4_openowner *oo, *ret; + struct nfs4_openowner *oo, *new = NULL; - oo = alloc_stateowner(openowner_slab, &open->op_owner, clp); - if (!oo) - return NULL; - oo->oo_owner.so_ops = &openowner_ops; - oo->oo_owner.so_is_open_owner = 1; - oo->oo_owner.so_seqid = open->op_seqid; - oo->oo_flags = 0; - if (nfsd4_has_session(cstate)) - oo->oo_flags |= NFS4_OO_CONFIRMED; - oo->oo_time = 0; - oo->oo_last_closed_stid = NULL; - INIT_LIST_HEAD(&oo->oo_close_lru); +retry: spin_lock(&clp->cl_lock); - ret = find_openstateowner_str_locked(strhashval, open, clp); - if (ret == NULL) { - hash_openowner(oo, clp, strhashval); - ret = oo; - } else - nfs4_free_stateowner(&oo->oo_owner); - + oo = find_openstateowner_str(strhashval, open, clp); + if (!oo && new) { + hash_openowner(new, clp, strhashval); + spin_unlock(&clp->cl_lock); + return new; + } spin_unlock(&clp->cl_lock); - return ret; + + if (oo && !(oo->oo_flags & NFS4_OO_CONFIRMED)) { + /* Replace unconfirmed owners without checking for replay. */ + release_openowner(oo); + oo = NULL; + } + if (oo) { + if (new) + nfs4_free_stateowner(&new->oo_owner); + return oo; + } + + new = alloc_stateowner(openowner_slab, &open->op_owner, clp); + if (!new) + return NULL; + new->oo_owner.so_ops = &openowner_ops; + new->oo_owner.so_is_open_owner = 1; + new->oo_owner.so_seqid = open->op_seqid; + new->oo_flags = 0; + if (nfsd4_has_session(cstate)) + new->oo_flags |= NFS4_OO_CONFIRMED; + new->oo_time = 0; + new->oo_last_closed_stid = NULL; + INIT_LIST_HEAD(&new->oo_close_lru); + goto retry; } static struct nfs4_ol_stateid * @@ -5356,27 +5356,14 @@ nfsd4_process_open1(struct nfsd4_compound_state *cstate, clp = cstate->clp; strhashval = ownerstr_hashval(&open->op_owner); - oo = find_openstateowner_str(strhashval, open, clp); + oo = find_or_alloc_open_stateowner(strhashval, open, cstate); open->op_openowner = oo; - if (!oo) { - goto new_owner; - } - if (!(oo->oo_flags & NFS4_OO_CONFIRMED)) { - /* Replace unconfirmed owners without checking for replay. */ - release_openowner(oo); - open->op_openowner = NULL; - goto new_owner; - } + if (!oo) + return nfserr_jukebox; status = nfsd4_check_seqid(cstate, &oo->oo_owner, open->op_seqid); if (status) return status; - goto alloc_stateid; -new_owner: - oo = alloc_init_open_stateowner(strhashval, open, cstate); - if (oo == NULL) - return nfserr_jukebox; - open->op_openowner = oo; -alloc_stateid: + open->op_stp = nfs4_alloc_open_stateid(clp); if (!open->op_stp) return nfserr_jukebox; -- 2.44.0 ^ permalink raw reply related [flat|nested] 19+ messages in thread
* [PATCH 2/4] nfsd: move nfsd4_cstate_assign_replay() earlier in open handling. 2024-04-08 2:09 [PATCH 0/4 v4] nfsd: fix deadlock in move_to_close_lru() NeilBrown 2024-04-08 2:09 ` [PATCH 1/4] nfsd: perform all find_openstateowner_str calls in the one place NeilBrown @ 2024-04-08 2:09 ` NeilBrown 2024-04-08 2:09 ` [PATCH 3/4] nfsd: replace rp_mutex to avoid deadlock in move_to_close_lru() NeilBrown ` (2 subsequent siblings) 4 siblings, 0 replies; 19+ messages in thread From: NeilBrown @ 2024-04-08 2:09 UTC (permalink / raw) To: Chuck Lever, Jeff Layton Cc: linux-nfs, Olga Kornievskaia, Dai Ngo, Tom Talpey Rather than taking the rp_mutex (via nfsd4_cstate_assign_replay) in nfsd4_cleanup_open_state() (which seems counter-intuitive), take it and assign rp_owner as soon as possible - in nfsd4_process_open1(). This will support a future change when nfsd4_cstate_assign_replay() might fail. Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: NeilBrown <neilb@suse.de> --- fs/nfsd/nfs4state.c | 9 +++------ 1 file changed, 3 insertions(+), 6 deletions(-) diff --git a/fs/nfsd/nfs4state.c b/fs/nfsd/nfs4state.c index 6ac1dff29d42..2ff8cb50f2e3 100644 --- a/fs/nfsd/nfs4state.c +++ b/fs/nfsd/nfs4state.c @@ -5360,6 +5360,7 @@ nfsd4_process_open1(struct nfsd4_compound_state *cstate, open->op_openowner = oo; if (!oo) return nfserr_jukebox; + nfsd4_cstate_assign_replay(cstate, &oo->oo_owner); status = nfsd4_check_seqid(cstate, &oo->oo_owner, open->op_seqid); if (status) return status; @@ -6134,12 +6135,8 @@ nfsd4_process_open2(struct svc_rqst *rqstp, struct svc_fh *current_fh, struct nf void nfsd4_cleanup_open_state(struct nfsd4_compound_state *cstate, struct nfsd4_open *open) { - if (open->op_openowner) { - struct nfs4_stateowner *so = &open->op_openowner->oo_owner; - - nfsd4_cstate_assign_replay(cstate, so); - nfs4_put_stateowner(so); - } + if (open->op_openowner) + nfs4_put_stateowner(&open->op_openowner->oo_owner); if (open->op_file) kmem_cache_free(file_slab, open->op_file); if (open->op_stp) -- 2.44.0 ^ permalink raw reply related [flat|nested] 19+ messages in thread
* [PATCH 3/4] nfsd: replace rp_mutex to avoid deadlock in move_to_close_lru() 2024-04-08 2:09 [PATCH 0/4 v4] nfsd: fix deadlock in move_to_close_lru() NeilBrown 2024-04-08 2:09 ` [PATCH 1/4] nfsd: perform all find_openstateowner_str calls in the one place NeilBrown 2024-04-08 2:09 ` [PATCH 2/4] nfsd: move nfsd4_cstate_assign_replay() earlier in open handling NeilBrown @ 2024-04-08 2:09 ` NeilBrown 2024-04-08 2:09 ` [PATCH 4/4] nfsd: drop st_mutex before calling move_to_close_lru() NeilBrown 2024-04-09 2:09 ` [PATCH 0/4 v4] nfsd: fix deadlock in move_to_close_lru() Chuck Lever 4 siblings, 0 replies; 19+ messages in thread From: NeilBrown @ 2024-04-08 2:09 UTC (permalink / raw) To: Chuck Lever, Jeff Layton Cc: linux-nfs, Olga Kornievskaia, Dai Ngo, Tom Talpey move_to_close_lru() waits for sc_count to become zero while holding rp_mutex. This can deadlock if another thread holds a reference and is waiting for rp_mutex. By the time we get to move_to_close_lru() the openowner is unhashed and cannot be found any more. So code waiting for the mutex can safely retry the lookup if move_to_close_lru() has started. So change rp_mutex to an atomic_t with three states: RP_UNLOCK - state is still hashed, not locked for reply RP_LOCKED - state is still hashed, is locked for reply RP_UNHASHED - state is not hashed, no code can get a lock. Use wait_var_event() to wait for either a lock, or for the owner to be unhashed. In the latter case, retry the lookup. Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: NeilBrown <neilb@suse.de> --- fs/nfsd/nfs4state.c | 38 +++++++++++++++++++++++++++++++------- fs/nfsd/state.h | 2 +- 2 files changed, 32 insertions(+), 8 deletions(-) diff --git a/fs/nfsd/nfs4state.c b/fs/nfsd/nfs4state.c index 2ff8cb50f2e3..5dbd1a73bc28 100644 --- a/fs/nfsd/nfs4state.c +++ b/fs/nfsd/nfs4state.c @@ -4664,21 +4664,32 @@ nfsd4_init_leases_net(struct nfsd_net *nn) atomic_set(&nn->nfsd_courtesy_clients, 0); } +enum rp_lock { + RP_UNLOCKED, + RP_LOCKED, + RP_UNHASHED, +}; + static void init_nfs4_replay(struct nfs4_replay *rp) { rp->rp_status = nfserr_serverfault; rp->rp_buflen = 0; rp->rp_buf = rp->rp_ibuf; - mutex_init(&rp->rp_mutex); + atomic_set(&rp->rp_locked, RP_UNLOCKED); } -static void nfsd4_cstate_assign_replay(struct nfsd4_compound_state *cstate, - struct nfs4_stateowner *so) +static int nfsd4_cstate_assign_replay(struct nfsd4_compound_state *cstate, + struct nfs4_stateowner *so) { if (!nfsd4_has_session(cstate)) { - mutex_lock(&so->so_replay.rp_mutex); + wait_var_event(&so->so_replay.rp_locked, + atomic_cmpxchg(&so->so_replay.rp_locked, + RP_UNLOCKED, RP_LOCKED) != RP_LOCKED); + if (atomic_read(&so->so_replay.rp_locked) == RP_UNHASHED) + return -EAGAIN; cstate->replay_owner = nfs4_get_stateowner(so); } + return 0; } void nfsd4_cstate_clear_replay(struct nfsd4_compound_state *cstate) @@ -4687,7 +4698,8 @@ void nfsd4_cstate_clear_replay(struct nfsd4_compound_state *cstate) if (so != NULL) { cstate->replay_owner = NULL; - mutex_unlock(&so->so_replay.rp_mutex); + atomic_set(&so->so_replay.rp_locked, RP_UNLOCKED); + wake_up_var(&so->so_replay.rp_locked); nfs4_put_stateowner(so); } } @@ -4983,7 +4995,11 @@ move_to_close_lru(struct nfs4_ol_stateid *s, struct net *net) * Wait for the refcount to drop to 2. Since it has been unhashed, * there should be no danger of the refcount going back up again at * this point. + * Some threads with a reference might be waiting for rp_locked, + * so tell them to stop waiting. */ + atomic_set(&oo->oo_owner.so_replay.rp_locked, RP_UNHASHED); + wake_up_var(&oo->oo_owner.so_replay.rp_locked); wait_event(close_wq, refcount_read(&s->st_stid.sc_count) == 2); release_all_access(s); @@ -5356,11 +5372,15 @@ nfsd4_process_open1(struct nfsd4_compound_state *cstate, clp = cstate->clp; strhashval = ownerstr_hashval(&open->op_owner); +retry: oo = find_or_alloc_open_stateowner(strhashval, open, cstate); open->op_openowner = oo; if (!oo) return nfserr_jukebox; - nfsd4_cstate_assign_replay(cstate, &oo->oo_owner); + if (nfsd4_cstate_assign_replay(cstate, &oo->oo_owner) == -EAGAIN) { + nfs4_put_stateowner(&oo->oo_owner); + goto retry; + } status = nfsd4_check_seqid(cstate, &oo->oo_owner, open->op_seqid); if (status) return status; @@ -7200,12 +7220,16 @@ nfs4_preprocess_seqid_op(struct nfsd4_compound_state *cstate, u32 seqid, trace_nfsd_preprocess(seqid, stateid); *stpp = NULL; +retry: status = nfsd4_lookup_stateid(cstate, stateid, typemask, statusmask, &s, nn); if (status) return status; stp = openlockstateid(s); - nfsd4_cstate_assign_replay(cstate, stp->st_stateowner); + if (nfsd4_cstate_assign_replay(cstate, stp->st_stateowner) == -EAGAIN) { + nfs4_put_stateowner(stp->st_stateowner); + goto retry; + } status = nfs4_seqid_op_checks(cstate, stateid, seqid, stp); if (!status) diff --git a/fs/nfsd/state.h b/fs/nfsd/state.h index 7f33b89f3b2c..f42d8d782c84 100644 --- a/fs/nfsd/state.h +++ b/fs/nfsd/state.h @@ -488,7 +488,7 @@ struct nfs4_replay { unsigned int rp_buflen; char *rp_buf; struct knfsd_fh rp_openfh; - struct mutex rp_mutex; + atomic_t rp_locked; char rp_ibuf[NFSD4_REPLAY_ISIZE]; }; -- 2.44.0 ^ permalink raw reply related [flat|nested] 19+ messages in thread
* [PATCH 4/4] nfsd: drop st_mutex before calling move_to_close_lru() 2024-04-08 2:09 [PATCH 0/4 v4] nfsd: fix deadlock in move_to_close_lru() NeilBrown ` (2 preceding siblings ...) 2024-04-08 2:09 ` [PATCH 3/4] nfsd: replace rp_mutex to avoid deadlock in move_to_close_lru() NeilBrown @ 2024-04-08 2:09 ` NeilBrown 2024-04-09 2:09 ` [PATCH 0/4 v4] nfsd: fix deadlock in move_to_close_lru() Chuck Lever 4 siblings, 0 replies; 19+ messages in thread From: NeilBrown @ 2024-04-08 2:09 UTC (permalink / raw) To: Chuck Lever, Jeff Layton Cc: linux-nfs, Olga Kornievskaia, Dai Ngo, Tom Talpey move_to_close_lru() is currently called with ->st_mutex held. This can lead to a deadlock as move_to_close_lru() waits for sc_count to drop to 2, and some threads holding a reference might be waiting for the mutex. These references will never be dropped so sc_count will never reach 2. There can be no harm in dropping ->st_mutex before move_to_close_lru() because the only place that takes the mutex is nfsd4_lock_ol_stateid(), and it quickly aborts if sc_type is NFS4_CLOSED_STID, which it will be before move_to_close_lru() is called. See also https://lore.kernel.org/lkml/4dd1fe21e11344e5969bb112e954affb@jd.com/T/ where this problem was raised but not successfully resolved. Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: NeilBrown <neilb@suse.de> --- fs/nfsd/nfs4state.c | 11 +++++++---- 1 file changed, 7 insertions(+), 4 deletions(-) diff --git a/fs/nfsd/nfs4state.c b/fs/nfsd/nfs4state.c index 5dbd1a73bc28..4b7fc0644f89 100644 --- a/fs/nfsd/nfs4state.c +++ b/fs/nfsd/nfs4state.c @@ -7371,7 +7371,7 @@ nfsd4_open_downgrade(struct svc_rqst *rqstp, return status; } -static void nfsd4_close_open_stateid(struct nfs4_ol_stateid *s) +static bool nfsd4_close_open_stateid(struct nfs4_ol_stateid *s) { struct nfs4_client *clp = s->st_stid.sc_client; bool unhashed; @@ -7388,11 +7388,11 @@ static void nfsd4_close_open_stateid(struct nfs4_ol_stateid *s) list_for_each_entry(stp, &reaplist, st_locks) nfs4_free_cpntf_statelist(clp->net, &stp->st_stid); free_ol_stateid_reaplist(&reaplist); + return false; } else { spin_unlock(&clp->cl_lock); free_ol_stateid_reaplist(&reaplist); - if (unhashed) - move_to_close_lru(s, clp->net); + return unhashed; } } @@ -7408,6 +7408,7 @@ nfsd4_close(struct svc_rqst *rqstp, struct nfsd4_compound_state *cstate, struct nfs4_ol_stateid *stp; struct net *net = SVC_NET(rqstp); struct nfsd_net *nn = net_generic(net, nfsd_net_id); + bool need_move_to_close_list; dprintk("NFSD: nfsd4_close on file %pd\n", cstate->current_fh.fh_dentry); @@ -7432,8 +7433,10 @@ nfsd4_close(struct svc_rqst *rqstp, struct nfsd4_compound_state *cstate, */ nfs4_inc_and_copy_stateid(&close->cl_stateid, &stp->st_stid); - nfsd4_close_open_stateid(stp); + need_move_to_close_list = nfsd4_close_open_stateid(stp); mutex_unlock(&stp->st_mutex); + if (need_move_to_close_list) + move_to_close_lru(stp, net); /* v4.1+ suggests that we send a special stateid in here, since the * clients should just ignore this anyway. Since this is not useful -- 2.44.0 ^ permalink raw reply related [flat|nested] 19+ messages in thread
* Re: [PATCH 0/4 v4] nfsd: fix deadlock in move_to_close_lru() 2024-04-08 2:09 [PATCH 0/4 v4] nfsd: fix deadlock in move_to_close_lru() NeilBrown ` (3 preceding siblings ...) 2024-04-08 2:09 ` [PATCH 4/4] nfsd: drop st_mutex before calling move_to_close_lru() NeilBrown @ 2024-04-09 2:09 ` Chuck Lever 4 siblings, 0 replies; 19+ messages in thread From: Chuck Lever @ 2024-04-09 2:09 UTC (permalink / raw) To: NeilBrown Cc: Jeff Layton, linux-nfs@vger.kernel.org, Olga Kornievskaia, Dai Ngo, Tom Talpey On Sun, Apr 07, 2024 at 10:09:14PM -0400, NeilBrown wrote: > This series replaces > nfsd: drop st_mutex and rp_mutex before calling move_to_close_lru() > which was recently dropped as a problem was found. > The first two patches rearrange code without important functional change. > The last two address the two relaced problems of two different mutexes which are > held while waiting and can each trigger a deadlock. > > This is against v6.9-rc2. > > Thanks, > NeilBrown > > [PATCH 1/4] nfsd: perform all find_openstateowner_str calls in the > [PATCH 2/4] nfsd: move nfsd4_cstate_assign_replay() earlier in open > [PATCH 3/4] nfsd: replace rp_mutex to avoid deadlock in > [PATCH 4/4] nfsd: drop st_mutex before calling move_to_close_lru() Applied to nfsd-next. Thank you, Neil! -- Chuck Lever ^ permalink raw reply [flat|nested] 19+ messages in thread
* [PATCH 0/4 v3] nfsd: fix deadlock in move_to_close_lru() @ 2024-03-04 22:45 NeilBrown 2024-03-04 22:45 ` [PATCH 3/4] nfsd: replace rp_mutex to avoid " NeilBrown 0 siblings, 1 reply; 19+ messages in thread From: NeilBrown @ 2024-03-04 22:45 UTC (permalink / raw) To: Chuck Lever, Jeff Layton Cc: linux-nfs, Olga Kornievskaia, Dai Ngo, Tom Talpey This is very similar to v2 with: - first two patches swapped - while(1) loop changed to "goto retry" - add a comment to patch 3 - rebase on nfsd-next - rb from Jeff added. NeilBrown [PATCH 1/4] nfsd: move nfsd4_cstate_assign_replay() earlier in open [PATCH 2/4] nfsd: perform all find_openstateowner_str calls in the [PATCH 3/4] nfsd: replace rp_mutex to avoid deadlock in [PATCH 4/4] nfsd: drop st_mutex_mutex before calling ^ permalink raw reply [flat|nested] 19+ messages in thread
* [PATCH 3/4] nfsd: replace rp_mutex to avoid deadlock in move_to_close_lru() 2024-03-04 22:45 [PATCH 0/4 v3] " NeilBrown @ 2024-03-04 22:45 ` NeilBrown 0 siblings, 0 replies; 19+ messages in thread From: NeilBrown @ 2024-03-04 22:45 UTC (permalink / raw) To: Chuck Lever, Jeff Layton Cc: linux-nfs, Olga Kornievskaia, Dai Ngo, Tom Talpey move_to_close_lru() waits for sc_count to become zero while holding rp_mutex. This can deadlock if another thread holds a reference and is waiting for rp_mutex. By the time we get to move_to_close_lru() the openowner is unhashed and cannot be found any more. So code waiting for the mutex can safely retry the lookup if move_to_close_lru() has started. So change rp_mutex to an atomic_t with three states: RP_UNLOCK - state is still hashed, not locked for reply RP_LOCKED - state is still hashed, is locked for reply RP_UNHASHED - state is not hashed, no code can get a lock. Use wait_var_event() to wait for either a lock, or for the owner to be unhashed. In the latter case, retry the lookup. Signed-off-by: NeilBrown <neilb@suse.de> --- fs/nfsd/nfs4state.c | 43 ++++++++++++++++++++++++++++++++++++------- fs/nfsd/state.h | 2 +- 2 files changed, 37 insertions(+), 8 deletions(-) diff --git a/fs/nfsd/nfs4state.c b/fs/nfsd/nfs4state.c index 0c1058e8cc4b..52838db0c46e 100644 --- a/fs/nfsd/nfs4state.c +++ b/fs/nfsd/nfs4state.c @@ -4639,21 +4639,37 @@ nfsd4_init_leases_net(struct nfsd_net *nn) atomic_set(&nn->nfsd_courtesy_clients, 0); } +enum rp_lock { + RP_UNLOCKED, + RP_LOCKED, + RP_UNHASHED, +}; + static void init_nfs4_replay(struct nfs4_replay *rp) { rp->rp_status = nfserr_serverfault; rp->rp_buflen = 0; rp->rp_buf = rp->rp_ibuf; - mutex_init(&rp->rp_mutex); + atomic_set(&rp->rp_locked, RP_UNLOCKED); } -static void nfsd4_cstate_assign_replay(struct nfsd4_compound_state *cstate, - struct nfs4_stateowner *so) +static int nfsd4_cstate_assign_replay(struct nfsd4_compound_state *cstate, + struct nfs4_stateowner *so) { if (!nfsd4_has_session(cstate)) { - mutex_lock(&so->so_replay.rp_mutex); + /* + * rp_locked is an open-coded mutex which is unlocked in + * nfsd4_cstate_clear_replay() or aborted in + * move_to_close_lru(). + */ + wait_var_event(&so->so_replay.rp_locked, + atomic_cmpxchg(&so->so_replay.rp_locked, + RP_UNLOCKED, RP_LOCKED) != RP_LOCKED); + if (atomic_read(&so->so_replay.rp_locked) == RP_UNHASHED) + return -EAGAIN; cstate->replay_owner = nfs4_get_stateowner(so); } + return 0; } void nfsd4_cstate_clear_replay(struct nfsd4_compound_state *cstate) @@ -4662,7 +4678,8 @@ void nfsd4_cstate_clear_replay(struct nfsd4_compound_state *cstate) if (so != NULL) { cstate->replay_owner = NULL; - mutex_unlock(&so->so_replay.rp_mutex); + atomic_set(&so->so_replay.rp_locked, RP_UNLOCKED); + wake_up_var(&so->so_replay.rp_locked); nfs4_put_stateowner(so); } } @@ -4958,7 +4975,11 @@ move_to_close_lru(struct nfs4_ol_stateid *s, struct net *net) * Wait for the refcount to drop to 2. Since it has been unhashed, * there should be no danger of the refcount going back up again at * this point. + * Some threads with a reference might be waiting for rp_locked, + * so tell them to stop waiting. */ + atomic_set(&oo->oo_owner.so_replay.rp_locked, RP_UNHASHED); + wake_up_var(&oo->oo_owner.so_replay.rp_locked); wait_event(close_wq, refcount_read(&s->st_stid.sc_count) == 2); release_all_access(s); @@ -5331,11 +5352,15 @@ nfsd4_process_open1(struct nfsd4_compound_state *cstate, clp = cstate->clp; strhashval = ownerstr_hashval(&open->op_owner); +retry: oo = find_or_alloc_open_stateowner(strhashval, open, cstate); open->op_openowner = oo; if (!oo) return nfserr_jukebox; - nfsd4_cstate_assign_replay(cstate, &oo->oo_owner); + if (nfsd4_cstate_assign_replay(cstate, &oo->oo_owner) == -EAGAIN) { + nfs4_put_stateowner(&oo->oo_owner); + goto retry; + } status = nfsd4_check_seqid(cstate, &oo->oo_owner, open->op_seqid); if (status) return status; @@ -7175,12 +7200,16 @@ nfs4_preprocess_seqid_op(struct nfsd4_compound_state *cstate, u32 seqid, trace_nfsd_preprocess(seqid, stateid); *stpp = NULL; +retry: status = nfsd4_lookup_stateid(cstate, stateid, typemask, statusmask, &s, nn); if (status) return status; stp = openlockstateid(s); - nfsd4_cstate_assign_replay(cstate, stp->st_stateowner); + if (nfsd4_cstate_assign_replay(cstate, stp->st_stateowner) == -EAGAIN) { + nfs4_put_stateowner(stp->st_stateowner); + goto retry; + } status = nfs4_seqid_op_checks(cstate, stateid, seqid, stp); if (!status) diff --git a/fs/nfsd/state.h b/fs/nfsd/state.h index 01c6f3445646..0400441c87c1 100644 --- a/fs/nfsd/state.h +++ b/fs/nfsd/state.h @@ -486,7 +486,7 @@ struct nfs4_replay { unsigned int rp_buflen; char *rp_buf; struct knfsd_fh rp_openfh; - struct mutex rp_mutex; + atomic_t rp_locked; char rp_ibuf[NFSD4_REPLAY_ISIZE]; }; -- 2.43.0 ^ permalink raw reply related [flat|nested] 19+ messages in thread
* [PATCH 0/4 v2] nfsd: fix dadlock in move_to_close_lru() @ 2024-03-04 4:40 NeilBrown 2024-03-04 4:40 ` [PATCH 3/4] nfsd: replace rp_mutex to avoid deadlock " NeilBrown 0 siblings, 1 reply; 19+ messages in thread From: NeilBrown @ 2024-03-04 4:40 UTC (permalink / raw) To: Chuck Lever, Jeff Layton Cc: linux-nfs, Olga Kornievskaia, Dai Ngo, Tom Talpey This series replaces nfsd: drop st_mutex and rp_mutex before calling move_to_close_lru() which was recently dropped as a problem was found. The first two patches rearrange code without important functional change. The last two address the two relaced problems of two different mutexes which are held while waiting and can each trigger a deadlock. Thanks, NeilBrown [PATCH 1/4] nfsd: move nfsd4_cstate_assign_replay() earlier in open [PATCH 2/4] nfsd: perform all find_openstateowner_str calls in the [PATCH 3/4] nfsd: replace rp_mutex to avoid deadlock in [PATCH 4/4] nfsd: drop st_mutex_mutex before calling ^ permalink raw reply [flat|nested] 19+ messages in thread
* [PATCH 3/4] nfsd: replace rp_mutex to avoid deadlock in move_to_close_lru() 2024-03-04 4:40 [PATCH 0/4 v2] nfsd: fix dadlock " NeilBrown @ 2024-03-04 4:40 ` NeilBrown 2024-03-04 12:50 ` Jeff Layton 2024-03-04 14:09 ` Chuck Lever 0 siblings, 2 replies; 19+ messages in thread From: NeilBrown @ 2024-03-04 4:40 UTC (permalink / raw) To: Chuck Lever, Jeff Layton Cc: linux-nfs, Olga Kornievskaia, Dai Ngo, Tom Talpey move_to_close_lru() waits for sc_count to become zero while holding rp_mutex. This can deadlock if another thread holds a reference and is waiting for rp_mutex. By the time we get to move_to_close_lru() the openowner is unhashed and cannot be found any more. So code waiting for the mutex can safely retry the lookup if move_to_close_lru() has started. So change rp_mutex to an atomic_t with three states: RP_UNLOCK - state is still hashed, not locked for reply RP_LOCKED - state is still hashed, is locked for reply RP_UNHASHED - state is not hashed, no code can get a lock. Use wait_var_event() to wait for either a lock, or for the owner to be unhashed. In the latter case, retry the lookup. Signed-off-by: NeilBrown <neilb@suse.de> --- fs/nfsd/nfs4state.c | 38 +++++++++++++++++++++++++++++++------- fs/nfsd/state.h | 2 +- 2 files changed, 32 insertions(+), 8 deletions(-) diff --git a/fs/nfsd/nfs4state.c b/fs/nfsd/nfs4state.c index 690d0e697320..47e879d5d68a 100644 --- a/fs/nfsd/nfs4state.c +++ b/fs/nfsd/nfs4state.c @@ -4430,21 +4430,32 @@ nfsd4_init_leases_net(struct nfsd_net *nn) atomic_set(&nn->nfsd_courtesy_clients, 0); } +enum rp_lock { + RP_UNLOCKED, + RP_LOCKED, + RP_UNHASHED, +}; + static void init_nfs4_replay(struct nfs4_replay *rp) { rp->rp_status = nfserr_serverfault; rp->rp_buflen = 0; rp->rp_buf = rp->rp_ibuf; - mutex_init(&rp->rp_mutex); + atomic_set(&rp->rp_locked, RP_UNLOCKED); } -static void nfsd4_cstate_assign_replay(struct nfsd4_compound_state *cstate, - struct nfs4_stateowner *so) +static int nfsd4_cstate_assign_replay(struct nfsd4_compound_state *cstate, + struct nfs4_stateowner *so) { if (!nfsd4_has_session(cstate)) { - mutex_lock(&so->so_replay.rp_mutex); + wait_var_event(&so->so_replay.rp_locked, + atomic_cmpxchg(&so->so_replay.rp_locked, + RP_UNLOCKED, RP_LOCKED) != RP_LOCKED); + if (atomic_read(&so->so_replay.rp_locked) == RP_UNHASHED) + return -EAGAIN; cstate->replay_owner = nfs4_get_stateowner(so); } + return 0; } void nfsd4_cstate_clear_replay(struct nfsd4_compound_state *cstate) @@ -4453,7 +4464,8 @@ void nfsd4_cstate_clear_replay(struct nfsd4_compound_state *cstate) if (so != NULL) { cstate->replay_owner = NULL; - mutex_unlock(&so->so_replay.rp_mutex); + atomic_set(&so->so_replay.rp_locked, RP_UNLOCKED); + wake_up_var(&so->so_replay.rp_locked); nfs4_put_stateowner(so); } } @@ -4691,7 +4703,11 @@ move_to_close_lru(struct nfs4_ol_stateid *s, struct net *net) * Wait for the refcount to drop to 2. Since it has been unhashed, * there should be no danger of the refcount going back up again at * this point. + * Some threads with a reference might be waiting for rp_locked, + * so tell them to stop waiting. */ + atomic_set(&oo->oo_owner.so_replay.rp_locked, RP_UNHASHED); + wake_up_var(&oo->oo_owner.so_replay.rp_locked); wait_event(close_wq, refcount_read(&s->st_stid.sc_count) == 2); release_all_access(s); @@ -5064,11 +5080,15 @@ nfsd4_process_open1(struct nfsd4_compound_state *cstate, clp = cstate->clp; strhashval = ownerstr_hashval(&open->op_owner); +retry: oo = find_or_alloc_open_stateowner(strhashval, open, cstate); open->op_openowner = oo; if (!oo) return nfserr_jukebox; - nfsd4_cstate_assign_replay(cstate, &oo->oo_owner); + if (nfsd4_cstate_assign_replay(cstate, &oo->oo_owner) == -EAGAIN) { + nfs4_put_stateowner(&oo->oo_owner); + goto retry; + } status = nfsd4_check_seqid(cstate, &oo->oo_owner, open->op_seqid); if (status) return status; @@ -6828,11 +6848,15 @@ nfs4_preprocess_seqid_op(struct nfsd4_compound_state *cstate, u32 seqid, trace_nfsd_preprocess(seqid, stateid); *stpp = NULL; +retry: status = nfsd4_lookup_stateid(cstate, stateid, typemask, &s, nn); if (status) return status; stp = openlockstateid(s); - nfsd4_cstate_assign_replay(cstate, stp->st_stateowner); + if (nfsd4_cstate_assign_replay(cstate, stp->st_stateowner) == -EAGAIN) { + nfs4_put_stateowner(stp->st_stateowner); + goto retry; + } status = nfs4_seqid_op_checks(cstate, stateid, seqid, stp); if (!status) diff --git a/fs/nfsd/state.h b/fs/nfsd/state.h index 41bdc913fa71..6a3becd86112 100644 --- a/fs/nfsd/state.h +++ b/fs/nfsd/state.h @@ -446,7 +446,7 @@ struct nfs4_replay { unsigned int rp_buflen; char *rp_buf; struct knfsd_fh rp_openfh; - struct mutex rp_mutex; + atomic_t rp_locked; char rp_ibuf[NFSD4_REPLAY_ISIZE]; }; -- 2.43.0 ^ permalink raw reply related [flat|nested] 19+ messages in thread
* Re: [PATCH 3/4] nfsd: replace rp_mutex to avoid deadlock in move_to_close_lru() 2024-03-04 4:40 ` [PATCH 3/4] nfsd: replace rp_mutex to avoid deadlock " NeilBrown @ 2024-03-04 12:50 ` Jeff Layton 2024-03-04 14:09 ` Chuck Lever 1 sibling, 0 replies; 19+ messages in thread From: Jeff Layton @ 2024-03-04 12:50 UTC (permalink / raw) To: NeilBrown, Chuck Lever; +Cc: linux-nfs, Olga Kornievskaia, Dai Ngo, Tom Talpey On Mon, 2024-03-04 at 15:40 +1100, NeilBrown wrote: > move_to_close_lru() waits for sc_count to become zero while holding > rp_mutex. This can deadlock if another thread holds a reference and is > waiting for rp_mutex. > > By the time we get to move_to_close_lru() the openowner is unhashed and > cannot be found any more. So code waiting for the mutex can safely > retry the lookup if move_to_close_lru() has started. > > So change rp_mutex to an atomic_t with three states: > > RP_UNLOCK - state is still hashed, not locked for reply > RP_LOCKED - state is still hashed, is locked for reply > RP_UNHASHED - state is not hashed, no code can get a lock. > > Use wait_var_event() to wait for either a lock, or for the owner to be > unhashed. In the latter case, retry the lookup. > > Signed-off-by: NeilBrown <neilb@suse.de> > --- > fs/nfsd/nfs4state.c | 38 +++++++++++++++++++++++++++++++------- > fs/nfsd/state.h | 2 +- > 2 files changed, 32 insertions(+), 8 deletions(-) > > diff --git a/fs/nfsd/nfs4state.c b/fs/nfsd/nfs4state.c > index 690d0e697320..47e879d5d68a 100644 > --- a/fs/nfsd/nfs4state.c > +++ b/fs/nfsd/nfs4state.c > @@ -4430,21 +4430,32 @@ nfsd4_init_leases_net(struct nfsd_net *nn) > atomic_set(&nn->nfsd_courtesy_clients, 0); > } > > +enum rp_lock { > + RP_UNLOCKED, > + RP_LOCKED, > + RP_UNHASHED, > +}; > + > static void init_nfs4_replay(struct nfs4_replay *rp) > { > rp->rp_status = nfserr_serverfault; > rp->rp_buflen = 0; > rp->rp_buf = rp->rp_ibuf; > - mutex_init(&rp->rp_mutex); > + atomic_set(&rp->rp_locked, RP_UNLOCKED); > } > > -static void nfsd4_cstate_assign_replay(struct nfsd4_compound_state *cstate, > - struct nfs4_stateowner *so) > +static int nfsd4_cstate_assign_replay(struct nfsd4_compound_state *cstate, > + struct nfs4_stateowner *so) > { > if (!nfsd4_has_session(cstate)) { > - mutex_lock(&so->so_replay.rp_mutex); > + wait_var_event(&so->so_replay.rp_locked, > + atomic_cmpxchg(&so->so_replay.rp_locked, > + RP_UNLOCKED, RP_LOCKED) != RP_LOCKED); > + if (atomic_read(&so->so_replay.rp_locked) == RP_UNHASHED) > + return -EAGAIN; > cstate->replay_owner = nfs4_get_stateowner(so); > } > + return 0; > } > > void nfsd4_cstate_clear_replay(struct nfsd4_compound_state *cstate) > @@ -4453,7 +4464,8 @@ void nfsd4_cstate_clear_replay(struct nfsd4_compound_state *cstate) > > if (so != NULL) { > cstate->replay_owner = NULL; > - mutex_unlock(&so->so_replay.rp_mutex); > + atomic_set(&so->so_replay.rp_locked, RP_UNLOCKED); > + wake_up_var(&so->so_replay.rp_locked); > nfs4_put_stateowner(so); > } > } > @@ -4691,7 +4703,11 @@ move_to_close_lru(struct nfs4_ol_stateid *s, struct net *net) > * Wait for the refcount to drop to 2. Since it has been unhashed, > * there should be no danger of the refcount going back up again at > * this point. > + * Some threads with a reference might be waiting for rp_locked, > + * so tell them to stop waiting. > */ > + atomic_set(&oo->oo_owner.so_replay.rp_locked, RP_UNHASHED); > + wake_up_var(&oo->oo_owner.so_replay.rp_locked); > wait_event(close_wq, refcount_read(&s->st_stid.sc_count) == 2); > > release_all_access(s); > @@ -5064,11 +5080,15 @@ nfsd4_process_open1(struct nfsd4_compound_state *cstate, > clp = cstate->clp; > > strhashval = ownerstr_hashval(&open->op_owner); > +retry: > oo = find_or_alloc_open_stateowner(strhashval, open, cstate); > open->op_openowner = oo; > if (!oo) > return nfserr_jukebox; > - nfsd4_cstate_assign_replay(cstate, &oo->oo_owner); > + if (nfsd4_cstate_assign_replay(cstate, &oo->oo_owner) == -EAGAIN) { > + nfs4_put_stateowner(&oo->oo_owner); > + goto retry; > + } > status = nfsd4_check_seqid(cstate, &oo->oo_owner, open->op_seqid); > if (status) > return status; > @@ -6828,11 +6848,15 @@ nfs4_preprocess_seqid_op(struct nfsd4_compound_state *cstate, u32 seqid, > trace_nfsd_preprocess(seqid, stateid); > > *stpp = NULL; > +retry: > status = nfsd4_lookup_stateid(cstate, stateid, typemask, &s, nn); > if (status) > return status; > stp = openlockstateid(s); > - nfsd4_cstate_assign_replay(cstate, stp->st_stateowner); > + if (nfsd4_cstate_assign_replay(cstate, stp->st_stateowner) == -EAGAIN) { > + nfs4_put_stateowner(stp->st_stateowner); > + goto retry; > + } > > status = nfs4_seqid_op_checks(cstate, stateid, seqid, stp); > if (!status) > diff --git a/fs/nfsd/state.h b/fs/nfsd/state.h > index 41bdc913fa71..6a3becd86112 100644 > --- a/fs/nfsd/state.h > +++ b/fs/nfsd/state.h > @@ -446,7 +446,7 @@ struct nfs4_replay { > unsigned int rp_buflen; > char *rp_buf; > struct knfsd_fh rp_openfh; > - struct mutex rp_mutex; > + atomic_t rp_locked; > char rp_ibuf[NFSD4_REPLAY_ISIZE]; > }; > Reviewed-by: Jeff Layton <jlayton@kernel.org> ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [PATCH 3/4] nfsd: replace rp_mutex to avoid deadlock in move_to_close_lru() 2024-03-04 4:40 ` [PATCH 3/4] nfsd: replace rp_mutex to avoid deadlock " NeilBrown 2024-03-04 12:50 ` Jeff Layton @ 2024-03-04 14:09 ` Chuck Lever 2024-03-04 21:45 ` NeilBrown 1 sibling, 1 reply; 19+ messages in thread From: Chuck Lever @ 2024-03-04 14:09 UTC (permalink / raw) To: NeilBrown; +Cc: Jeff Layton, linux-nfs, Olga Kornievskaia, Dai Ngo, Tom Talpey On Mon, Mar 04, 2024 at 03:40:21PM +1100, NeilBrown wrote: > move_to_close_lru() waits for sc_count to become zero while holding > rp_mutex. This can deadlock if another thread holds a reference and is > waiting for rp_mutex. > > By the time we get to move_to_close_lru() the openowner is unhashed and > cannot be found any more. So code waiting for the mutex can safely > retry the lookup if move_to_close_lru() has started. > > So change rp_mutex to an atomic_t with three states: > > RP_UNLOCK - state is still hashed, not locked for reply > RP_LOCKED - state is still hashed, is locked for reply > RP_UNHASHED - state is not hashed, no code can get a lock. > > Use wait_var_event() to wait for either a lock, or for the owner to be > unhashed. In the latter case, retry the lookup. > > Signed-off-by: NeilBrown <neilb@suse.de> > --- > fs/nfsd/nfs4state.c | 38 +++++++++++++++++++++++++++++++------- > fs/nfsd/state.h | 2 +- > 2 files changed, 32 insertions(+), 8 deletions(-) > > diff --git a/fs/nfsd/nfs4state.c b/fs/nfsd/nfs4state.c > index 690d0e697320..47e879d5d68a 100644 > --- a/fs/nfsd/nfs4state.c > +++ b/fs/nfsd/nfs4state.c > @@ -4430,21 +4430,32 @@ nfsd4_init_leases_net(struct nfsd_net *nn) > atomic_set(&nn->nfsd_courtesy_clients, 0); > } > > +enum rp_lock { > + RP_UNLOCKED, > + RP_LOCKED, > + RP_UNHASHED, > +}; > + > static void init_nfs4_replay(struct nfs4_replay *rp) > { > rp->rp_status = nfserr_serverfault; > rp->rp_buflen = 0; > rp->rp_buf = rp->rp_ibuf; > - mutex_init(&rp->rp_mutex); > + atomic_set(&rp->rp_locked, RP_UNLOCKED); > } > > -static void nfsd4_cstate_assign_replay(struct nfsd4_compound_state *cstate, > - struct nfs4_stateowner *so) > +static int nfsd4_cstate_assign_replay(struct nfsd4_compound_state *cstate, > + struct nfs4_stateowner *so) > { > if (!nfsd4_has_session(cstate)) { > - mutex_lock(&so->so_replay.rp_mutex); > + wait_var_event(&so->so_replay.rp_locked, > + atomic_cmpxchg(&so->so_replay.rp_locked, > + RP_UNLOCKED, RP_LOCKED) != RP_LOCKED); What reliably prevents this from being a "wait forever" ? > + if (atomic_read(&so->so_replay.rp_locked) == RP_UNHASHED) > + return -EAGAIN; > cstate->replay_owner = nfs4_get_stateowner(so); > } > + return 0; > } > > void nfsd4_cstate_clear_replay(struct nfsd4_compound_state *cstate) > @@ -4453,7 +4464,8 @@ void nfsd4_cstate_clear_replay(struct nfsd4_compound_state *cstate) > > if (so != NULL) { > cstate->replay_owner = NULL; > - mutex_unlock(&so->so_replay.rp_mutex); > + atomic_set(&so->so_replay.rp_locked, RP_UNLOCKED); > + wake_up_var(&so->so_replay.rp_locked); > nfs4_put_stateowner(so); > } > } > @@ -4691,7 +4703,11 @@ move_to_close_lru(struct nfs4_ol_stateid *s, struct net *net) > * Wait for the refcount to drop to 2. Since it has been unhashed, > * there should be no danger of the refcount going back up again at > * this point. > + * Some threads with a reference might be waiting for rp_locked, > + * so tell them to stop waiting. > */ > + atomic_set(&oo->oo_owner.so_replay.rp_locked, RP_UNHASHED); > + wake_up_var(&oo->oo_owner.so_replay.rp_locked); > wait_event(close_wq, refcount_read(&s->st_stid.sc_count) == 2); > > release_all_access(s); > @@ -5064,11 +5080,15 @@ nfsd4_process_open1(struct nfsd4_compound_state *cstate, > clp = cstate->clp; > > strhashval = ownerstr_hashval(&open->op_owner); > +retry: > oo = find_or_alloc_open_stateowner(strhashval, open, cstate); > open->op_openowner = oo; > if (!oo) > return nfserr_jukebox; > - nfsd4_cstate_assign_replay(cstate, &oo->oo_owner); > + if (nfsd4_cstate_assign_replay(cstate, &oo->oo_owner) == -EAGAIN) { > + nfs4_put_stateowner(&oo->oo_owner); > + goto retry; > + } > status = nfsd4_check_seqid(cstate, &oo->oo_owner, open->op_seqid); > if (status) > return status; > @@ -6828,11 +6848,15 @@ nfs4_preprocess_seqid_op(struct nfsd4_compound_state *cstate, u32 seqid, > trace_nfsd_preprocess(seqid, stateid); > > *stpp = NULL; > +retry: > status = nfsd4_lookup_stateid(cstate, stateid, typemask, &s, nn); > if (status) > return status; > stp = openlockstateid(s); > - nfsd4_cstate_assign_replay(cstate, stp->st_stateowner); > + if (nfsd4_cstate_assign_replay(cstate, stp->st_stateowner) == -EAGAIN) { > + nfs4_put_stateowner(stp->st_stateowner); > + goto retry; > + } > > status = nfs4_seqid_op_checks(cstate, stateid, seqid, stp); > if (!status) My tool chain reports that this hunk won't apply to nfsd-next. In my copy of fs/nfsd/nfs4state.c, nfs4_preprocess_seqid_op() starts at line 7180, so something is whack-a-doodle. > diff --git a/fs/nfsd/state.h b/fs/nfsd/state.h > index 41bdc913fa71..6a3becd86112 100644 > --- a/fs/nfsd/state.h > +++ b/fs/nfsd/state.h > @@ -446,7 +446,7 @@ struct nfs4_replay { > unsigned int rp_buflen; > char *rp_buf; > struct knfsd_fh rp_openfh; > - struct mutex rp_mutex; > + atomic_t rp_locked; > char rp_ibuf[NFSD4_REPLAY_ISIZE]; > }; > > -- > 2.43.0 > -- Chuck Lever ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [PATCH 3/4] nfsd: replace rp_mutex to avoid deadlock in move_to_close_lru() 2024-03-04 14:09 ` Chuck Lever @ 2024-03-04 21:45 ` NeilBrown 2024-03-04 22:03 ` Chuck Lever 0 siblings, 1 reply; 19+ messages in thread From: NeilBrown @ 2024-03-04 21:45 UTC (permalink / raw) To: Chuck Lever Cc: Jeff Layton, linux-nfs, Olga Kornievskaia, Dai Ngo, Tom Talpey On Tue, 05 Mar 2024, Chuck Lever wrote: > On Mon, Mar 04, 2024 at 03:40:21PM +1100, NeilBrown wrote: > > move_to_close_lru() waits for sc_count to become zero while holding > > rp_mutex. This can deadlock if another thread holds a reference and is > > waiting for rp_mutex. > > > > By the time we get to move_to_close_lru() the openowner is unhashed and > > cannot be found any more. So code waiting for the mutex can safely > > retry the lookup if move_to_close_lru() has started. > > > > So change rp_mutex to an atomic_t with three states: > > > > RP_UNLOCK - state is still hashed, not locked for reply > > RP_LOCKED - state is still hashed, is locked for reply > > RP_UNHASHED - state is not hashed, no code can get a lock. > > > > Use wait_var_event() to wait for either a lock, or for the owner to be > > unhashed. In the latter case, retry the lookup. > > > > Signed-off-by: NeilBrown <neilb@suse.de> > > --- > > fs/nfsd/nfs4state.c | 38 +++++++++++++++++++++++++++++++------- > > fs/nfsd/state.h | 2 +- > > 2 files changed, 32 insertions(+), 8 deletions(-) > > > > diff --git a/fs/nfsd/nfs4state.c b/fs/nfsd/nfs4state.c > > index 690d0e697320..47e879d5d68a 100644 > > --- a/fs/nfsd/nfs4state.c > > +++ b/fs/nfsd/nfs4state.c > > @@ -4430,21 +4430,32 @@ nfsd4_init_leases_net(struct nfsd_net *nn) > > atomic_set(&nn->nfsd_courtesy_clients, 0); > > } > > > > +enum rp_lock { > > + RP_UNLOCKED, > > + RP_LOCKED, > > + RP_UNHASHED, > > +}; > > + > > static void init_nfs4_replay(struct nfs4_replay *rp) > > { > > rp->rp_status = nfserr_serverfault; > > rp->rp_buflen = 0; > > rp->rp_buf = rp->rp_ibuf; > > - mutex_init(&rp->rp_mutex); > > + atomic_set(&rp->rp_locked, RP_UNLOCKED); > > } > > > > -static void nfsd4_cstate_assign_replay(struct nfsd4_compound_state *cstate, > > - struct nfs4_stateowner *so) > > +static int nfsd4_cstate_assign_replay(struct nfsd4_compound_state *cstate, > > + struct nfs4_stateowner *so) > > { > > if (!nfsd4_has_session(cstate)) { > > - mutex_lock(&so->so_replay.rp_mutex); > > + wait_var_event(&so->so_replay.rp_locked, > > + atomic_cmpxchg(&so->so_replay.rp_locked, > > + RP_UNLOCKED, RP_LOCKED) != RP_LOCKED); > > What reliably prevents this from being a "wait forever" ? That same thing that reliably prevented the original mutex_lock from waiting forever. It waits until rp_locked is set to RP_UNLOCKED, which is precisely when we previously called mutex_unlock. But it *also* aborts the wait if rp_locked is set to RP_UNHASHED - so it is now more reliable. Does that answer the question? > > > > + if (atomic_read(&so->so_replay.rp_locked) == RP_UNHASHED) > > + return -EAGAIN; > > cstate->replay_owner = nfs4_get_stateowner(so); > > } > > + return 0; > > } > > > > void nfsd4_cstate_clear_replay(struct nfsd4_compound_state *cstate) > > @@ -4453,7 +4464,8 @@ void nfsd4_cstate_clear_replay(struct nfsd4_compound_state *cstate) > > > > if (so != NULL) { > > cstate->replay_owner = NULL; > > - mutex_unlock(&so->so_replay.rp_mutex); > > + atomic_set(&so->so_replay.rp_locked, RP_UNLOCKED); > > + wake_up_var(&so->so_replay.rp_locked); > > nfs4_put_stateowner(so); > > } > > } > > @@ -4691,7 +4703,11 @@ move_to_close_lru(struct nfs4_ol_stateid *s, struct net *net) > > * Wait for the refcount to drop to 2. Since it has been unhashed, > > * there should be no danger of the refcount going back up again at > > * this point. > > + * Some threads with a reference might be waiting for rp_locked, > > + * so tell them to stop waiting. > > */ > > + atomic_set(&oo->oo_owner.so_replay.rp_locked, RP_UNHASHED); > > + wake_up_var(&oo->oo_owner.so_replay.rp_locked); > > wait_event(close_wq, refcount_read(&s->st_stid.sc_count) == 2); > > > > release_all_access(s); > > @@ -5064,11 +5080,15 @@ nfsd4_process_open1(struct nfsd4_compound_state *cstate, > > clp = cstate->clp; > > > > strhashval = ownerstr_hashval(&open->op_owner); > > +retry: > > oo = find_or_alloc_open_stateowner(strhashval, open, cstate); > > open->op_openowner = oo; > > if (!oo) > > return nfserr_jukebox; > > - nfsd4_cstate_assign_replay(cstate, &oo->oo_owner); > > + if (nfsd4_cstate_assign_replay(cstate, &oo->oo_owner) == -EAGAIN) { > > + nfs4_put_stateowner(&oo->oo_owner); > > + goto retry; > > + } > > status = nfsd4_check_seqid(cstate, &oo->oo_owner, open->op_seqid); > > if (status) > > return status; > > @@ -6828,11 +6848,15 @@ nfs4_preprocess_seqid_op(struct nfsd4_compound_state *cstate, u32 seqid, > > trace_nfsd_preprocess(seqid, stateid); > > > > *stpp = NULL; > > +retry: > > status = nfsd4_lookup_stateid(cstate, stateid, typemask, &s, nn); > > if (status) > > return status; > > stp = openlockstateid(s); > > - nfsd4_cstate_assign_replay(cstate, stp->st_stateowner); > > + if (nfsd4_cstate_assign_replay(cstate, stp->st_stateowner) == -EAGAIN) { > > + nfs4_put_stateowner(stp->st_stateowner); > > + goto retry; > > + } > > > > status = nfs4_seqid_op_checks(cstate, stateid, seqid, stp); > > if (!status) > > My tool chain reports that this hunk won't apply to nfsd-next. You need better tools :-) > > In my copy of fs/nfsd/nfs4state.c, nfs4_preprocess_seqid_op() starts > at line 7180, so something is whack-a-doodle. I have been working against master because the old version of the fix was in nfsd-next. I should have rebased when you removed it from nfsd-next. I have now and will repost once you confirm you are comfortable with the answer about waiting above. Would adding a comment there help? Thanks, NeilBrown > > > > diff --git a/fs/nfsd/state.h b/fs/nfsd/state.h > > index 41bdc913fa71..6a3becd86112 100644 > > --- a/fs/nfsd/state.h > > +++ b/fs/nfsd/state.h > > @@ -446,7 +446,7 @@ struct nfs4_replay { > > unsigned int rp_buflen; > > char *rp_buf; > > struct knfsd_fh rp_openfh; > > - struct mutex rp_mutex; > > + atomic_t rp_locked; > > char rp_ibuf[NFSD4_REPLAY_ISIZE]; > > }; > > > > -- > > 2.43.0 > > > > -- > Chuck Lever > ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [PATCH 3/4] nfsd: replace rp_mutex to avoid deadlock in move_to_close_lru() 2024-03-04 21:45 ` NeilBrown @ 2024-03-04 22:03 ` Chuck Lever 2024-03-04 22:36 ` NeilBrown 0 siblings, 1 reply; 19+ messages in thread From: Chuck Lever @ 2024-03-04 22:03 UTC (permalink / raw) To: NeilBrown; +Cc: Jeff Layton, linux-nfs, Olga Kornievskaia, Dai Ngo, Tom Talpey On Tue, Mar 05, 2024 at 08:45:45AM +1100, NeilBrown wrote: > On Tue, 05 Mar 2024, Chuck Lever wrote: > > On Mon, Mar 04, 2024 at 03:40:21PM +1100, NeilBrown wrote: > > > move_to_close_lru() waits for sc_count to become zero while holding > > > rp_mutex. This can deadlock if another thread holds a reference and is > > > waiting for rp_mutex. > > > > > > By the time we get to move_to_close_lru() the openowner is unhashed and > > > cannot be found any more. So code waiting for the mutex can safely > > > retry the lookup if move_to_close_lru() has started. > > > > > > So change rp_mutex to an atomic_t with three states: > > > > > > RP_UNLOCK - state is still hashed, not locked for reply > > > RP_LOCKED - state is still hashed, is locked for reply > > > RP_UNHASHED - state is not hashed, no code can get a lock. > > > > > > Use wait_var_event() to wait for either a lock, or for the owner to be > > > unhashed. In the latter case, retry the lookup. > > > > > > Signed-off-by: NeilBrown <neilb@suse.de> > > > --- > > > fs/nfsd/nfs4state.c | 38 +++++++++++++++++++++++++++++++------- > > > fs/nfsd/state.h | 2 +- > > > 2 files changed, 32 insertions(+), 8 deletions(-) > > > > > > diff --git a/fs/nfsd/nfs4state.c b/fs/nfsd/nfs4state.c > > > index 690d0e697320..47e879d5d68a 100644 > > > --- a/fs/nfsd/nfs4state.c > > > +++ b/fs/nfsd/nfs4state.c > > > @@ -4430,21 +4430,32 @@ nfsd4_init_leases_net(struct nfsd_net *nn) > > > atomic_set(&nn->nfsd_courtesy_clients, 0); > > > } > > > > > > +enum rp_lock { > > > + RP_UNLOCKED, > > > + RP_LOCKED, > > > + RP_UNHASHED, > > > +}; > > > + > > > static void init_nfs4_replay(struct nfs4_replay *rp) > > > { > > > rp->rp_status = nfserr_serverfault; > > > rp->rp_buflen = 0; > > > rp->rp_buf = rp->rp_ibuf; > > > - mutex_init(&rp->rp_mutex); > > > + atomic_set(&rp->rp_locked, RP_UNLOCKED); > > > } > > > > > > -static void nfsd4_cstate_assign_replay(struct nfsd4_compound_state *cstate, > > > - struct nfs4_stateowner *so) > > > +static int nfsd4_cstate_assign_replay(struct nfsd4_compound_state *cstate, > > > + struct nfs4_stateowner *so) > > > { > > > if (!nfsd4_has_session(cstate)) { > > > - mutex_lock(&so->so_replay.rp_mutex); > > > + wait_var_event(&so->so_replay.rp_locked, > > > + atomic_cmpxchg(&so->so_replay.rp_locked, > > > + RP_UNLOCKED, RP_LOCKED) != RP_LOCKED); > > > > What reliably prevents this from being a "wait forever" ? > > That same thing that reliably prevented the original mutex_lock from > waiting forever. > It waits until rp_locked is set to RP_UNLOCKED, which is precisely when > we previously called mutex_unlock. But it *also* aborts the wait if > rp_locked is set to RP_UNHASHED - so it is now more reliable. > > Does that answer the question? Hm. I guess then we are no worse off with wait_var_event(). I'm not as familiar with this logic as perhaps I should be. How long does it take for the wake-up to occur, typically? > > > + if (atomic_read(&so->so_replay.rp_locked) == RP_UNHASHED) > > > + return -EAGAIN; > > > cstate->replay_owner = nfs4_get_stateowner(so); > > > } > > > + return 0; > > > } > > > > > > void nfsd4_cstate_clear_replay(struct nfsd4_compound_state *cstate) > > > @@ -4453,7 +4464,8 @@ void nfsd4_cstate_clear_replay(struct nfsd4_compound_state *cstate) > > > > > > if (so != NULL) { > > > cstate->replay_owner = NULL; > > > - mutex_unlock(&so->so_replay.rp_mutex); > > > + atomic_set(&so->so_replay.rp_locked, RP_UNLOCKED); > > > + wake_up_var(&so->so_replay.rp_locked); > > > nfs4_put_stateowner(so); > > > } > > > } > > > @@ -4691,7 +4703,11 @@ move_to_close_lru(struct nfs4_ol_stateid *s, struct net *net) > > > * Wait for the refcount to drop to 2. Since it has been unhashed, > > > * there should be no danger of the refcount going back up again at > > > * this point. > > > + * Some threads with a reference might be waiting for rp_locked, > > > + * so tell them to stop waiting. > > > */ > > > + atomic_set(&oo->oo_owner.so_replay.rp_locked, RP_UNHASHED); > > > + wake_up_var(&oo->oo_owner.so_replay.rp_locked); > > > wait_event(close_wq, refcount_read(&s->st_stid.sc_count) == 2); > > > > > > release_all_access(s); > > > @@ -5064,11 +5080,15 @@ nfsd4_process_open1(struct nfsd4_compound_state *cstate, > > > clp = cstate->clp; > > > > > > strhashval = ownerstr_hashval(&open->op_owner); > > > +retry: > > > oo = find_or_alloc_open_stateowner(strhashval, open, cstate); > > > open->op_openowner = oo; > > > if (!oo) > > > return nfserr_jukebox; > > > - nfsd4_cstate_assign_replay(cstate, &oo->oo_owner); > > > + if (nfsd4_cstate_assign_replay(cstate, &oo->oo_owner) == -EAGAIN) { > > > + nfs4_put_stateowner(&oo->oo_owner); > > > + goto retry; > > > + } > > > status = nfsd4_check_seqid(cstate, &oo->oo_owner, open->op_seqid); > > > if (status) > > > return status; > > > @@ -6828,11 +6848,15 @@ nfs4_preprocess_seqid_op(struct nfsd4_compound_state *cstate, u32 seqid, > > > trace_nfsd_preprocess(seqid, stateid); > > > > > > *stpp = NULL; > > > +retry: > > > status = nfsd4_lookup_stateid(cstate, stateid, typemask, &s, nn); > > > if (status) > > > return status; > > > stp = openlockstateid(s); > > > - nfsd4_cstate_assign_replay(cstate, stp->st_stateowner); > > > + if (nfsd4_cstate_assign_replay(cstate, stp->st_stateowner) == -EAGAIN) { > > > + nfs4_put_stateowner(stp->st_stateowner); > > > + goto retry; > > > + } > > > > > > status = nfs4_seqid_op_checks(cstate, stateid, seqid, stp); > > > if (!status) > > > > My tool chain reports that this hunk won't apply to nfsd-next. > > You need better tools :-) <shrug> Essentially it is git, importing an mbox. That kind of thing is generally the weakest aspect of all these tools. Do you know of something more robust? > > In my copy of fs/nfsd/nfs4state.c, nfs4_preprocess_seqid_op() starts > > at line 7180, so something is whack-a-doodle. > > I have been working against master because the old version of the fix > was in nfsd-next. I should have rebased when you removed it from > nfsd-next. I have now and will repost once you confirm you are > comfortable with the answer about waiting above. Would adding a comment > there help? The mechanism is clear from the code, so a new comment, if you add one, should spell out what condition(s) mark rp_locked as UNLOCKED. But I might be missing something that should be obvious, in which case no comment is necessary. > Thanks, > NeilBrown > > > > > > > > > diff --git a/fs/nfsd/state.h b/fs/nfsd/state.h > > > index 41bdc913fa71..6a3becd86112 100644 > > > --- a/fs/nfsd/state.h > > > +++ b/fs/nfsd/state.h > > > @@ -446,7 +446,7 @@ struct nfs4_replay { > > > unsigned int rp_buflen; > > > char *rp_buf; > > > struct knfsd_fh rp_openfh; > > > - struct mutex rp_mutex; > > > + atomic_t rp_locked; > > > char rp_ibuf[NFSD4_REPLAY_ISIZE]; > > > }; > > > > > > -- > > > 2.43.0 > > > > > > > -- > > Chuck Lever > > > -- Chuck Lever ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [PATCH 3/4] nfsd: replace rp_mutex to avoid deadlock in move_to_close_lru() 2024-03-04 22:03 ` Chuck Lever @ 2024-03-04 22:36 ` NeilBrown 2024-03-04 22:54 ` Chuck Lever 0 siblings, 1 reply; 19+ messages in thread From: NeilBrown @ 2024-03-04 22:36 UTC (permalink / raw) To: Chuck Lever Cc: Jeff Layton, linux-nfs, Olga Kornievskaia, Dai Ngo, Tom Talpey On Tue, 05 Mar 2024, Chuck Lever wrote: > On Tue, Mar 05, 2024 at 08:45:45AM +1100, NeilBrown wrote: > > On Tue, 05 Mar 2024, Chuck Lever wrote: > > > On Mon, Mar 04, 2024 at 03:40:21PM +1100, NeilBrown wrote: > > > > move_to_close_lru() waits for sc_count to become zero while holding > > > > rp_mutex. This can deadlock if another thread holds a reference and is > > > > waiting for rp_mutex. > > > > > > > > By the time we get to move_to_close_lru() the openowner is unhashed and > > > > cannot be found any more. So code waiting for the mutex can safely > > > > retry the lookup if move_to_close_lru() has started. > > > > > > > > So change rp_mutex to an atomic_t with three states: > > > > > > > > RP_UNLOCK - state is still hashed, not locked for reply > > > > RP_LOCKED - state is still hashed, is locked for reply > > > > RP_UNHASHED - state is not hashed, no code can get a lock. > > > > > > > > Use wait_var_event() to wait for either a lock, or for the owner to be > > > > unhashed. In the latter case, retry the lookup. > > > > > > > > Signed-off-by: NeilBrown <neilb@suse.de> > > > > --- > > > > fs/nfsd/nfs4state.c | 38 +++++++++++++++++++++++++++++++------- > > > > fs/nfsd/state.h | 2 +- > > > > 2 files changed, 32 insertions(+), 8 deletions(-) > > > > > > > > diff --git a/fs/nfsd/nfs4state.c b/fs/nfsd/nfs4state.c > > > > index 690d0e697320..47e879d5d68a 100644 > > > > --- a/fs/nfsd/nfs4state.c > > > > +++ b/fs/nfsd/nfs4state.c > > > > @@ -4430,21 +4430,32 @@ nfsd4_init_leases_net(struct nfsd_net *nn) > > > > atomic_set(&nn->nfsd_courtesy_clients, 0); > > > > } > > > > > > > > +enum rp_lock { > > > > + RP_UNLOCKED, > > > > + RP_LOCKED, > > > > + RP_UNHASHED, > > > > +}; > > > > + > > > > static void init_nfs4_replay(struct nfs4_replay *rp) > > > > { > > > > rp->rp_status = nfserr_serverfault; > > > > rp->rp_buflen = 0; > > > > rp->rp_buf = rp->rp_ibuf; > > > > - mutex_init(&rp->rp_mutex); > > > > + atomic_set(&rp->rp_locked, RP_UNLOCKED); > > > > } > > > > > > > > -static void nfsd4_cstate_assign_replay(struct nfsd4_compound_state *cstate, > > > > - struct nfs4_stateowner *so) > > > > +static int nfsd4_cstate_assign_replay(struct nfsd4_compound_state *cstate, > > > > + struct nfs4_stateowner *so) > > > > { > > > > if (!nfsd4_has_session(cstate)) { > > > > - mutex_lock(&so->so_replay.rp_mutex); > > > > + wait_var_event(&so->so_replay.rp_locked, > > > > + atomic_cmpxchg(&so->so_replay.rp_locked, > > > > + RP_UNLOCKED, RP_LOCKED) != RP_LOCKED); > > > > > > What reliably prevents this from being a "wait forever" ? > > > > That same thing that reliably prevented the original mutex_lock from > > waiting forever. > > It waits until rp_locked is set to RP_UNLOCKED, which is precisely when > > we previously called mutex_unlock. But it *also* aborts the wait if > > rp_locked is set to RP_UNHASHED - so it is now more reliable. > > > > Does that answer the question? > > Hm. I guess then we are no worse off with wait_var_event(). > > I'm not as familiar with this logic as perhaps I should be. How long > does it take for the wake-up to occur, typically? wait_var_event() is paired with wake_up_var(). The wake up happens when wake_up_var() is called, which in this code is always immediately after atomic_set() updates the variable. > > > > > > + if (atomic_read(&so->so_replay.rp_locked) == RP_UNHASHED) > > > > + return -EAGAIN; > > > > cstate->replay_owner = nfs4_get_stateowner(so); > > > > } > > > > + return 0; > > > > } > > > > > > > > void nfsd4_cstate_clear_replay(struct nfsd4_compound_state *cstate) > > > > @@ -4453,7 +4464,8 @@ void nfsd4_cstate_clear_replay(struct nfsd4_compound_state *cstate) > > > > > > > > if (so != NULL) { > > > > cstate->replay_owner = NULL; > > > > - mutex_unlock(&so->so_replay.rp_mutex); > > > > + atomic_set(&so->so_replay.rp_locked, RP_UNLOCKED); > > > > + wake_up_var(&so->so_replay.rp_locked); > > > > nfs4_put_stateowner(so); > > > > } > > > > } > > > > @@ -4691,7 +4703,11 @@ move_to_close_lru(struct nfs4_ol_stateid *s, struct net *net) > > > > * Wait for the refcount to drop to 2. Since it has been unhashed, > > > > * there should be no danger of the refcount going back up again at > > > > * this point. > > > > + * Some threads with a reference might be waiting for rp_locked, > > > > + * so tell them to stop waiting. > > > > */ > > > > + atomic_set(&oo->oo_owner.so_replay.rp_locked, RP_UNHASHED); > > > > + wake_up_var(&oo->oo_owner.so_replay.rp_locked); > > > > wait_event(close_wq, refcount_read(&s->st_stid.sc_count) == 2); > > > > > > > > release_all_access(s); > > > > @@ -5064,11 +5080,15 @@ nfsd4_process_open1(struct nfsd4_compound_state *cstate, > > > > clp = cstate->clp; > > > > > > > > strhashval = ownerstr_hashval(&open->op_owner); > > > > +retry: > > > > oo = find_or_alloc_open_stateowner(strhashval, open, cstate); > > > > open->op_openowner = oo; > > > > if (!oo) > > > > return nfserr_jukebox; > > > > - nfsd4_cstate_assign_replay(cstate, &oo->oo_owner); > > > > + if (nfsd4_cstate_assign_replay(cstate, &oo->oo_owner) == -EAGAIN) { > > > > + nfs4_put_stateowner(&oo->oo_owner); > > > > + goto retry; > > > > + } > > > > status = nfsd4_check_seqid(cstate, &oo->oo_owner, open->op_seqid); > > > > if (status) > > > > return status; > > > > @@ -6828,11 +6848,15 @@ nfs4_preprocess_seqid_op(struct nfsd4_compound_state *cstate, u32 seqid, > > > > trace_nfsd_preprocess(seqid, stateid); > > > > > > > > *stpp = NULL; > > > > +retry: > > > > status = nfsd4_lookup_stateid(cstate, stateid, typemask, &s, nn); > > > > if (status) > > > > return status; > > > > stp = openlockstateid(s); > > > > - nfsd4_cstate_assign_replay(cstate, stp->st_stateowner); > > > > + if (nfsd4_cstate_assign_replay(cstate, stp->st_stateowner) == -EAGAIN) { > > > > + nfs4_put_stateowner(stp->st_stateowner); > > > > + goto retry; > > > > + } > > > > > > > > status = nfs4_seqid_op_checks(cstate, stateid, seqid, stp); > > > > if (!status) > > > > > > My tool chain reports that this hunk won't apply to nfsd-next. > > > > You need better tools :-) > > <shrug> Essentially it is git, importing an mbox. That kind of thing > is generally the weakest aspect of all these tools. Do you know of > something more robust? My tool of choice is "wiggle" - which I wrote. I just put those 4 patches into an mbox and use "git am < mbox" to apply to nfsd-next. It hit a problem at this patch - 3/4. So I ran wiggle -mrp .git/rebase-apply/patch which worked without complaint. (you might want --no-backup too). But you probably want to know that the conflict was that "git am" found and "wiggle" ignored. So: git diff | diff -u -I '^@@' -I '^index' .git/rebase-apply/patch - This shows +retry: - status = nfsd4_lookup_stateid(cstate, stateid, typemask, &s, nn); + status = nfsd4_lookup_stateid(cstate, stateid, + typemask, statusmask, &s, nn); once you have had a bit of practice reading diffs of diffs you can see that the patch inserts "retry:" before a line which has changed between the original and new bases. The difference clearly isn't important to this patch. So "git add -u; git am --continue" will accept the wiggle and continue with the "git am". Patch 4/4 has one conflict: struct nfsd_net *nn = net_generic(net, nfsd_net_id); + bool need_move_to_close_list; - dprintk("NFSD: nfsd4_close on file %pd\n", + dprintk("NFSD: nfsd4_close on file %pd\n", cstate->current_fh.fh_dentry); again - not interesting (white space at end of line changed). But maybe you would rather avoid that and have submitted base their patches properly to start with :-) > > > > > In my copy of fs/nfsd/nfs4state.c, nfs4_preprocess_seqid_op() starts > > > at line 7180, so something is whack-a-doodle. > > > > I have been working against master because the old version of the fix > > was in nfsd-next. I should have rebased when you removed it from > > nfsd-next. I have now and will repost once you confirm you are > > comfortable with the answer about waiting above. Would adding a comment > > there help? > > The mechanism is clear from the code, so a new comment, if you add > one, should spell out what condition(s) mark rp_locked as UNLOCKED. > > But I might be missing something that should be obvious, in which > case no comment is necessary. > I've added a comment emphasising that it is much like a mutex, and pointing to the matching unlocks. Thanks, NeilBrown ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [PATCH 3/4] nfsd: replace rp_mutex to avoid deadlock in move_to_close_lru() 2024-03-04 22:36 ` NeilBrown @ 2024-03-04 22:54 ` Chuck Lever 2024-03-04 23:04 ` NeilBrown 2024-03-04 23:11 ` Jeff Layton 0 siblings, 2 replies; 19+ messages in thread From: Chuck Lever @ 2024-03-04 22:54 UTC (permalink / raw) To: NeilBrown; +Cc: Jeff Layton, linux-nfs, Olga Kornievskaia, Dai Ngo, Tom Talpey On Tue, Mar 05, 2024 at 09:36:29AM +1100, NeilBrown wrote: > On Tue, 05 Mar 2024, Chuck Lever wrote: > > On Tue, Mar 05, 2024 at 08:45:45AM +1100, NeilBrown wrote: > > > On Tue, 05 Mar 2024, Chuck Lever wrote: > > > > On Mon, Mar 04, 2024 at 03:40:21PM +1100, NeilBrown wrote: > > > > > move_to_close_lru() waits for sc_count to become zero while holding > > > > > rp_mutex. This can deadlock if another thread holds a reference and is > > > > > waiting for rp_mutex. > > > > > > > > > > By the time we get to move_to_close_lru() the openowner is unhashed and > > > > > cannot be found any more. So code waiting for the mutex can safely > > > > > retry the lookup if move_to_close_lru() has started. > > > > > > > > > > So change rp_mutex to an atomic_t with three states: > > > > > > > > > > RP_UNLOCK - state is still hashed, not locked for reply > > > > > RP_LOCKED - state is still hashed, is locked for reply > > > > > RP_UNHASHED - state is not hashed, no code can get a lock. > > > > > > > > > > Use wait_var_event() to wait for either a lock, or for the owner to be > > > > > unhashed. In the latter case, retry the lookup. > > > > > > > > > > Signed-off-by: NeilBrown <neilb@suse.de> > > > > > --- > > > > > fs/nfsd/nfs4state.c | 38 +++++++++++++++++++++++++++++++------- > > > > > fs/nfsd/state.h | 2 +- > > > > > 2 files changed, 32 insertions(+), 8 deletions(-) > > > > > > > > > > diff --git a/fs/nfsd/nfs4state.c b/fs/nfsd/nfs4state.c > > > > > index 690d0e697320..47e879d5d68a 100644 > > > > > --- a/fs/nfsd/nfs4state.c > > > > > +++ b/fs/nfsd/nfs4state.c > > > > > @@ -4430,21 +4430,32 @@ nfsd4_init_leases_net(struct nfsd_net *nn) > > > > > atomic_set(&nn->nfsd_courtesy_clients, 0); > > > > > } > > > > > > > > > > +enum rp_lock { > > > > > + RP_UNLOCKED, > > > > > + RP_LOCKED, > > > > > + RP_UNHASHED, > > > > > +}; > > > > > + > > > > > static void init_nfs4_replay(struct nfs4_replay *rp) > > > > > { > > > > > rp->rp_status = nfserr_serverfault; > > > > > rp->rp_buflen = 0; > > > > > rp->rp_buf = rp->rp_ibuf; > > > > > - mutex_init(&rp->rp_mutex); > > > > > + atomic_set(&rp->rp_locked, RP_UNLOCKED); > > > > > } > > > > > > > > > > -static void nfsd4_cstate_assign_replay(struct nfsd4_compound_state *cstate, > > > > > - struct nfs4_stateowner *so) > > > > > +static int nfsd4_cstate_assign_replay(struct nfsd4_compound_state *cstate, > > > > > + struct nfs4_stateowner *so) > > > > > { > > > > > if (!nfsd4_has_session(cstate)) { > > > > > - mutex_lock(&so->so_replay.rp_mutex); > > > > > + wait_var_event(&so->so_replay.rp_locked, > > > > > + atomic_cmpxchg(&so->so_replay.rp_locked, > > > > > + RP_UNLOCKED, RP_LOCKED) != RP_LOCKED); > > > > > > > > What reliably prevents this from being a "wait forever" ? > > > > > > That same thing that reliably prevented the original mutex_lock from > > > waiting forever. > > > It waits until rp_locked is set to RP_UNLOCKED, which is precisely when > > > we previously called mutex_unlock. But it *also* aborts the wait if > > > rp_locked is set to RP_UNHASHED - so it is now more reliable. > > > > > > Does that answer the question? > > > > Hm. I guess then we are no worse off with wait_var_event(). > > > > I'm not as familiar with this logic as perhaps I should be. How long > > does it take for the wake-up to occur, typically? > > wait_var_event() is paired with wake_up_var(). > The wake up happens when wake_up_var() is called, which in this code is > always immediately after atomic_set() updates the variable. I'm trying to ascertain the actual wall-clock time that the nfsd thread is sleeping, at most. Is this going to be a possible DoS vector? Can it impact the ability for the server to shut down without hanging? -- Chuck Lever ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [PATCH 3/4] nfsd: replace rp_mutex to avoid deadlock in move_to_close_lru() 2024-03-04 22:54 ` Chuck Lever @ 2024-03-04 23:04 ` NeilBrown 2024-03-04 23:11 ` Jeff Layton 1 sibling, 0 replies; 19+ messages in thread From: NeilBrown @ 2024-03-04 23:04 UTC (permalink / raw) To: Chuck Lever Cc: Jeff Layton, linux-nfs, Olga Kornievskaia, Dai Ngo, Tom Talpey On Tue, 05 Mar 2024, Chuck Lever wrote: > On Tue, Mar 05, 2024 at 09:36:29AM +1100, NeilBrown wrote: > > On Tue, 05 Mar 2024, Chuck Lever wrote: > > > On Tue, Mar 05, 2024 at 08:45:45AM +1100, NeilBrown wrote: > > > > On Tue, 05 Mar 2024, Chuck Lever wrote: > > > > > On Mon, Mar 04, 2024 at 03:40:21PM +1100, NeilBrown wrote: > > > > > > move_to_close_lru() waits for sc_count to become zero while holding > > > > > > rp_mutex. This can deadlock if another thread holds a reference and is > > > > > > waiting for rp_mutex. > > > > > > > > > > > > By the time we get to move_to_close_lru() the openowner is unhashed and > > > > > > cannot be found any more. So code waiting for the mutex can safely > > > > > > retry the lookup if move_to_close_lru() has started. > > > > > > > > > > > > So change rp_mutex to an atomic_t with three states: > > > > > > > > > > > > RP_UNLOCK - state is still hashed, not locked for reply > > > > > > RP_LOCKED - state is still hashed, is locked for reply > > > > > > RP_UNHASHED - state is not hashed, no code can get a lock. > > > > > > > > > > > > Use wait_var_event() to wait for either a lock, or for the owner to be > > > > > > unhashed. In the latter case, retry the lookup. > > > > > > > > > > > > Signed-off-by: NeilBrown <neilb@suse.de> > > > > > > --- > > > > > > fs/nfsd/nfs4state.c | 38 +++++++++++++++++++++++++++++++------- > > > > > > fs/nfsd/state.h | 2 +- > > > > > > 2 files changed, 32 insertions(+), 8 deletions(-) > > > > > > > > > > > > diff --git a/fs/nfsd/nfs4state.c b/fs/nfsd/nfs4state.c > > > > > > index 690d0e697320..47e879d5d68a 100644 > > > > > > --- a/fs/nfsd/nfs4state.c > > > > > > +++ b/fs/nfsd/nfs4state.c > > > > > > @@ -4430,21 +4430,32 @@ nfsd4_init_leases_net(struct nfsd_net *nn) > > > > > > atomic_set(&nn->nfsd_courtesy_clients, 0); > > > > > > } > > > > > > > > > > > > +enum rp_lock { > > > > > > + RP_UNLOCKED, > > > > > > + RP_LOCKED, > > > > > > + RP_UNHASHED, > > > > > > +}; > > > > > > + > > > > > > static void init_nfs4_replay(struct nfs4_replay *rp) > > > > > > { > > > > > > rp->rp_status = nfserr_serverfault; > > > > > > rp->rp_buflen = 0; > > > > > > rp->rp_buf = rp->rp_ibuf; > > > > > > - mutex_init(&rp->rp_mutex); > > > > > > + atomic_set(&rp->rp_locked, RP_UNLOCKED); > > > > > > } > > > > > > > > > > > > -static void nfsd4_cstate_assign_replay(struct nfsd4_compound_state *cstate, > > > > > > - struct nfs4_stateowner *so) > > > > > > +static int nfsd4_cstate_assign_replay(struct nfsd4_compound_state *cstate, > > > > > > + struct nfs4_stateowner *so) > > > > > > { > > > > > > if (!nfsd4_has_session(cstate)) { > > > > > > - mutex_lock(&so->so_replay.rp_mutex); > > > > > > + wait_var_event(&so->so_replay.rp_locked, > > > > > > + atomic_cmpxchg(&so->so_replay.rp_locked, > > > > > > + RP_UNLOCKED, RP_LOCKED) != RP_LOCKED); > > > > > > > > > > What reliably prevents this from being a "wait forever" ? > > > > > > > > That same thing that reliably prevented the original mutex_lock from > > > > waiting forever. > > > > It waits until rp_locked is set to RP_UNLOCKED, which is precisely when > > > > we previously called mutex_unlock. But it *also* aborts the wait if > > > > rp_locked is set to RP_UNHASHED - so it is now more reliable. > > > > > > > > Does that answer the question? > > > > > > Hm. I guess then we are no worse off with wait_var_event(). > > > > > > I'm not as familiar with this logic as perhaps I should be. How long > > > does it take for the wake-up to occur, typically? > > > > wait_var_event() is paired with wake_up_var(). > > The wake up happens when wake_up_var() is called, which in this code is > > always immediately after atomic_set() updates the variable. > > I'm trying to ascertain the actual wall-clock time that the nfsd thread > is sleeping, at most. Is this going to be a possible DoS vector? Can > it impact the ability for the server to shut down without hanging? Certainly the wall-clock sleep time is no more than it was before this patch. At most it is the time it takes for some other request running in some other nfsd thread to complete. Well.... technically if there were a large number of concurrent requests from a client that all required claiming this lock, one of the threads might be blocked while all the others threads get a turn. There is no fairness guarantee with this style of waiting. So one thread might be blocked indefinitely while other threads take turns taking the lock. It's not really a DoS vector as the client is only really denying service to itself by keeping the server excessively busy. Other clients will still get a reasonable turn. NeilBrown > > > -- > Chuck Lever > ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [PATCH 3/4] nfsd: replace rp_mutex to avoid deadlock in move_to_close_lru() 2024-03-04 22:54 ` Chuck Lever 2024-03-04 23:04 ` NeilBrown @ 2024-03-04 23:11 ` Jeff Layton 2024-03-04 23:30 ` Chuck Lever 1 sibling, 1 reply; 19+ messages in thread From: Jeff Layton @ 2024-03-04 23:11 UTC (permalink / raw) To: Chuck Lever, NeilBrown; +Cc: linux-nfs, Olga Kornievskaia, Dai Ngo, Tom Talpey On Mon, 2024-03-04 at 17:54 -0500, Chuck Lever wrote: > On Tue, Mar 05, 2024 at 09:36:29AM +1100, NeilBrown wrote: > > On Tue, 05 Mar 2024, Chuck Lever wrote: > > > On Tue, Mar 05, 2024 at 08:45:45AM +1100, NeilBrown wrote: > > > > On Tue, 05 Mar 2024, Chuck Lever wrote: > > > > > On Mon, Mar 04, 2024 at 03:40:21PM +1100, NeilBrown wrote: > > > > > > move_to_close_lru() waits for sc_count to become zero while holding > > > > > > rp_mutex. This can deadlock if another thread holds a reference and is > > > > > > waiting for rp_mutex. > > > > > > > > > > > > By the time we get to move_to_close_lru() the openowner is unhashed and > > > > > > cannot be found any more. So code waiting for the mutex can safely > > > > > > retry the lookup if move_to_close_lru() has started. > > > > > > > > > > > > So change rp_mutex to an atomic_t with three states: > > > > > > > > > > > > RP_UNLOCK - state is still hashed, not locked for reply > > > > > > RP_LOCKED - state is still hashed, is locked for reply > > > > > > RP_UNHASHED - state is not hashed, no code can get a lock. > > > > > > > > > > > > Use wait_var_event() to wait for either a lock, or for the owner to be > > > > > > unhashed. In the latter case, retry the lookup. > > > > > > > > > > > > Signed-off-by: NeilBrown <neilb@suse.de> > > > > > > --- > > > > > > fs/nfsd/nfs4state.c | 38 +++++++++++++++++++++++++++++++------- > > > > > > fs/nfsd/state.h | 2 +- > > > > > > 2 files changed, 32 insertions(+), 8 deletions(-) > > > > > > > > > > > > diff --git a/fs/nfsd/nfs4state.c b/fs/nfsd/nfs4state.c > > > > > > index 690d0e697320..47e879d5d68a 100644 > > > > > > --- a/fs/nfsd/nfs4state.c > > > > > > +++ b/fs/nfsd/nfs4state.c > > > > > > @@ -4430,21 +4430,32 @@ nfsd4_init_leases_net(struct nfsd_net *nn) > > > > > > atomic_set(&nn->nfsd_courtesy_clients, 0); > > > > > > } > > > > > > > > > > > > +enum rp_lock { > > > > > > + RP_UNLOCKED, > > > > > > + RP_LOCKED, > > > > > > + RP_UNHASHED, > > > > > > +}; > > > > > > + > > > > > > static void init_nfs4_replay(struct nfs4_replay *rp) > > > > > > { > > > > > > rp->rp_status = nfserr_serverfault; > > > > > > rp->rp_buflen = 0; > > > > > > rp->rp_buf = rp->rp_ibuf; > > > > > > - mutex_init(&rp->rp_mutex); > > > > > > + atomic_set(&rp->rp_locked, RP_UNLOCKED); > > > > > > } > > > > > > > > > > > > -static void nfsd4_cstate_assign_replay(struct nfsd4_compound_state *cstate, > > > > > > - struct nfs4_stateowner *so) > > > > > > +static int nfsd4_cstate_assign_replay(struct nfsd4_compound_state *cstate, > > > > > > + struct nfs4_stateowner *so) > > > > > > { > > > > > > if (!nfsd4_has_session(cstate)) { > > > > > > - mutex_lock(&so->so_replay.rp_mutex); > > > > > > + wait_var_event(&so->so_replay.rp_locked, > > > > > > + atomic_cmpxchg(&so->so_replay.rp_locked, > > > > > > + RP_UNLOCKED, RP_LOCKED) != RP_LOCKED); > > > > > > > > > > What reliably prevents this from being a "wait forever" ? > > > > > > > > That same thing that reliably prevented the original mutex_lock from > > > > waiting forever. > > > > It waits until rp_locked is set to RP_UNLOCKED, which is precisely when > > > > we previously called mutex_unlock. But it *also* aborts the wait if > > > > rp_locked is set to RP_UNHASHED - so it is now more reliable. > > > > > > > > Does that answer the question? > > > > > > Hm. I guess then we are no worse off with wait_var_event(). > > > > > > I'm not as familiar with this logic as perhaps I should be. How long > > > does it take for the wake-up to occur, typically? > > > > wait_var_event() is paired with wake_up_var(). > > The wake up happens when wake_up_var() is called, which in this code is > > always immediately after atomic_set() updates the variable. > > I'm trying to ascertain the actual wall-clock time that the nfsd thread > is sleeping, at most. Is this going to be a possible DoS vector? Can > it impact the ability for the server to shut down without hanging? > > Prior to this patch, there was a mutex in play here and we just released it to wake up the waiters. This is more or less doing the same thing, it just indicates the resulting state better. I doubt this will materially change how long the tasks are waiting. -- Jeff Layton <jlayton@kernel.org> ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [PATCH 3/4] nfsd: replace rp_mutex to avoid deadlock in move_to_close_lru() 2024-03-04 23:11 ` Jeff Layton @ 2024-03-04 23:30 ` Chuck Lever 2024-03-05 0:05 ` NeilBrown 2024-03-05 0:11 ` Jeff Layton 0 siblings, 2 replies; 19+ messages in thread From: Chuck Lever @ 2024-03-04 23:30 UTC (permalink / raw) To: Jeff Layton; +Cc: NeilBrown, linux-nfs, Olga Kornievskaia, Dai Ngo, Tom Talpey On Mon, Mar 04, 2024 at 06:11:24PM -0500, Jeff Layton wrote: > On Mon, 2024-03-04 at 17:54 -0500, Chuck Lever wrote: > > On Tue, Mar 05, 2024 at 09:36:29AM +1100, NeilBrown wrote: > > > On Tue, 05 Mar 2024, Chuck Lever wrote: > > > > On Tue, Mar 05, 2024 at 08:45:45AM +1100, NeilBrown wrote: > > > > > On Tue, 05 Mar 2024, Chuck Lever wrote: > > > > > > On Mon, Mar 04, 2024 at 03:40:21PM +1100, NeilBrown wrote: > > > > > > > move_to_close_lru() waits for sc_count to become zero while holding > > > > > > > rp_mutex. This can deadlock if another thread holds a reference and is > > > > > > > waiting for rp_mutex. > > > > > > > > > > > > > > By the time we get to move_to_close_lru() the openowner is unhashed and > > > > > > > cannot be found any more. So code waiting for the mutex can safely > > > > > > > retry the lookup if move_to_close_lru() has started. > > > > > > > > > > > > > > So change rp_mutex to an atomic_t with three states: > > > > > > > > > > > > > > RP_UNLOCK - state is still hashed, not locked for reply > > > > > > > RP_LOCKED - state is still hashed, is locked for reply > > > > > > > RP_UNHASHED - state is not hashed, no code can get a lock. > > > > > > > > > > > > > > Use wait_var_event() to wait for either a lock, or for the owner to be > > > > > > > unhashed. In the latter case, retry the lookup. > > > > > > > > > > > > > > Signed-off-by: NeilBrown <neilb@suse.de> > > > > > > > --- > > > > > > > fs/nfsd/nfs4state.c | 38 +++++++++++++++++++++++++++++++------- > > > > > > > fs/nfsd/state.h | 2 +- > > > > > > > 2 files changed, 32 insertions(+), 8 deletions(-) > > > > > > > > > > > > > > diff --git a/fs/nfsd/nfs4state.c b/fs/nfsd/nfs4state.c > > > > > > > index 690d0e697320..47e879d5d68a 100644 > > > > > > > --- a/fs/nfsd/nfs4state.c > > > > > > > +++ b/fs/nfsd/nfs4state.c > > > > > > > @@ -4430,21 +4430,32 @@ nfsd4_init_leases_net(struct nfsd_net *nn) > > > > > > > atomic_set(&nn->nfsd_courtesy_clients, 0); > > > > > > > } > > > > > > > > > > > > > > +enum rp_lock { > > > > > > > + RP_UNLOCKED, > > > > > > > + RP_LOCKED, > > > > > > > + RP_UNHASHED, > > > > > > > +}; > > > > > > > + > > > > > > > static void init_nfs4_replay(struct nfs4_replay *rp) > > > > > > > { > > > > > > > rp->rp_status = nfserr_serverfault; > > > > > > > rp->rp_buflen = 0; > > > > > > > rp->rp_buf = rp->rp_ibuf; > > > > > > > - mutex_init(&rp->rp_mutex); > > > > > > > + atomic_set(&rp->rp_locked, RP_UNLOCKED); > > > > > > > } > > > > > > > > > > > > > > -static void nfsd4_cstate_assign_replay(struct nfsd4_compound_state *cstate, > > > > > > > - struct nfs4_stateowner *so) > > > > > > > +static int nfsd4_cstate_assign_replay(struct nfsd4_compound_state *cstate, > > > > > > > + struct nfs4_stateowner *so) > > > > > > > { > > > > > > > if (!nfsd4_has_session(cstate)) { > > > > > > > - mutex_lock(&so->so_replay.rp_mutex); > > > > > > > + wait_var_event(&so->so_replay.rp_locked, > > > > > > > + atomic_cmpxchg(&so->so_replay.rp_locked, > > > > > > > + RP_UNLOCKED, RP_LOCKED) != RP_LOCKED); > > > > > > > > > > > > What reliably prevents this from being a "wait forever" ? > > > > > > > > > > That same thing that reliably prevented the original mutex_lock from > > > > > waiting forever. Note that this patch fixes a deadlock here. So clearly, there /were/ situations where "waiting forever" was possible with the mutex version of this code. > > > > > It waits until rp_locked is set to RP_UNLOCKED, which is precisely when > > > > > we previously called mutex_unlock. But it *also* aborts the wait if > > > > > rp_locked is set to RP_UNHASHED - so it is now more reliable. > > > > > > > > > > Does that answer the question? > > > > > > > > Hm. I guess then we are no worse off with wait_var_event(). > > > > > > > > I'm not as familiar with this logic as perhaps I should be. How long > > > > does it take for the wake-up to occur, typically? > > > > > > wait_var_event() is paired with wake_up_var(). > > > The wake up happens when wake_up_var() is called, which in this code is > > > always immediately after atomic_set() updates the variable. > > > > I'm trying to ascertain the actual wall-clock time that the nfsd thread > > is sleeping, at most. Is this going to be a possible DoS vector? Can > > it impact the ability for the server to shut down without hanging? > > Prior to this patch, there was a mutex in play here and we just released > it to wake up the waiters. This is more or less doing the same thing, it > just indicates the resulting state better. Well, it adds a third state so that a recovery action can be taken on wake-up in some cases. That avoids a deadlock, so this does count as a bug fix. > I doubt this will materially change how long the tasks are waiting. It might not be a longer wait, but it still seems difficult to prove that the wait_var_event() will /always/ be awoken somehow. Applying for now. -- Chuck Lever ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [PATCH 3/4] nfsd: replace rp_mutex to avoid deadlock in move_to_close_lru() 2024-03-04 23:30 ` Chuck Lever @ 2024-03-05 0:05 ` NeilBrown 2024-03-05 0:11 ` Jeff Layton 1 sibling, 0 replies; 19+ messages in thread From: NeilBrown @ 2024-03-05 0:05 UTC (permalink / raw) To: Chuck Lever Cc: Jeff Layton, linux-nfs, Olga Kornievskaia, Dai Ngo, Tom Talpey On Tue, 05 Mar 2024, Chuck Lever wrote: > On Mon, Mar 04, 2024 at 06:11:24PM -0500, Jeff Layton wrote: > > On Mon, 2024-03-04 at 17:54 -0500, Chuck Lever wrote: > > > On Tue, Mar 05, 2024 at 09:36:29AM +1100, NeilBrown wrote: > > > > On Tue, 05 Mar 2024, Chuck Lever wrote: > > > > > On Tue, Mar 05, 2024 at 08:45:45AM +1100, NeilBrown wrote: > > > > > > On Tue, 05 Mar 2024, Chuck Lever wrote: > > > > > > > On Mon, Mar 04, 2024 at 03:40:21PM +1100, NeilBrown wrote: > > > > > > > > move_to_close_lru() waits for sc_count to become zero while holding > > > > > > > > rp_mutex. This can deadlock if another thread holds a reference and is > > > > > > > > waiting for rp_mutex. > > > > > > > > > > > > > > > > By the time we get to move_to_close_lru() the openowner is unhashed and > > > > > > > > cannot be found any more. So code waiting for the mutex can safely > > > > > > > > retry the lookup if move_to_close_lru() has started. > > > > > > > > > > > > > > > > So change rp_mutex to an atomic_t with three states: > > > > > > > > > > > > > > > > RP_UNLOCK - state is still hashed, not locked for reply > > > > > > > > RP_LOCKED - state is still hashed, is locked for reply > > > > > > > > RP_UNHASHED - state is not hashed, no code can get a lock. > > > > > > > > > > > > > > > > Use wait_var_event() to wait for either a lock, or for the owner to be > > > > > > > > unhashed. In the latter case, retry the lookup. > > > > > > > > > > > > > > > > Signed-off-by: NeilBrown <neilb@suse.de> > > > > > > > > --- > > > > > > > > fs/nfsd/nfs4state.c | 38 +++++++++++++++++++++++++++++++------- > > > > > > > > fs/nfsd/state.h | 2 +- > > > > > > > > 2 files changed, 32 insertions(+), 8 deletions(-) > > > > > > > > > > > > > > > > diff --git a/fs/nfsd/nfs4state.c b/fs/nfsd/nfs4state.c > > > > > > > > index 690d0e697320..47e879d5d68a 100644 > > > > > > > > --- a/fs/nfsd/nfs4state.c > > > > > > > > +++ b/fs/nfsd/nfs4state.c > > > > > > > > @@ -4430,21 +4430,32 @@ nfsd4_init_leases_net(struct nfsd_net *nn) > > > > > > > > atomic_set(&nn->nfsd_courtesy_clients, 0); > > > > > > > > } > > > > > > > > > > > > > > > > +enum rp_lock { > > > > > > > > + RP_UNLOCKED, > > > > > > > > + RP_LOCKED, > > > > > > > > + RP_UNHASHED, > > > > > > > > +}; > > > > > > > > + > > > > > > > > static void init_nfs4_replay(struct nfs4_replay *rp) > > > > > > > > { > > > > > > > > rp->rp_status = nfserr_serverfault; > > > > > > > > rp->rp_buflen = 0; > > > > > > > > rp->rp_buf = rp->rp_ibuf; > > > > > > > > - mutex_init(&rp->rp_mutex); > > > > > > > > + atomic_set(&rp->rp_locked, RP_UNLOCKED); > > > > > > > > } > > > > > > > > > > > > > > > > -static void nfsd4_cstate_assign_replay(struct nfsd4_compound_state *cstate, > > > > > > > > - struct nfs4_stateowner *so) > > > > > > > > +static int nfsd4_cstate_assign_replay(struct nfsd4_compound_state *cstate, > > > > > > > > + struct nfs4_stateowner *so) > > > > > > > > { > > > > > > > > if (!nfsd4_has_session(cstate)) { > > > > > > > > - mutex_lock(&so->so_replay.rp_mutex); > > > > > > > > + wait_var_event(&so->so_replay.rp_locked, > > > > > > > > + atomic_cmpxchg(&so->so_replay.rp_locked, > > > > > > > > + RP_UNLOCKED, RP_LOCKED) != RP_LOCKED); > > > > > > > > > > > > > > What reliably prevents this from being a "wait forever" ? > > > > > > > > > > > > That same thing that reliably prevented the original mutex_lock from > > > > > > waiting forever. > > Note that this patch fixes a deadlock here. So clearly, there /were/ > situations where "waiting forever" was possible with the mutex version > of this code. > > > > > > > > It waits until rp_locked is set to RP_UNLOCKED, which is precisely when > > > > > > we previously called mutex_unlock. But it *also* aborts the wait if > > > > > > rp_locked is set to RP_UNHASHED - so it is now more reliable. > > > > > > > > > > > > Does that answer the question? > > > > > > > > > > Hm. I guess then we are no worse off with wait_var_event(). > > > > > > > > > > I'm not as familiar with this logic as perhaps I should be. How long > > > > > does it take for the wake-up to occur, typically? > > > > > > > > wait_var_event() is paired with wake_up_var(). > > > > The wake up happens when wake_up_var() is called, which in this code is > > > > always immediately after atomic_set() updates the variable. > > > > > > I'm trying to ascertain the actual wall-clock time that the nfsd thread > > > is sleeping, at most. Is this going to be a possible DoS vector? Can > > > it impact the ability for the server to shut down without hanging? > > > > Prior to this patch, there was a mutex in play here and we just released > > it to wake up the waiters. This is more or less doing the same thing, it > > just indicates the resulting state better. > > Well, it adds a third state so that a recovery action can be taken > on wake-up in some cases. That avoids a deadlock, so this does count > as a bug fix. > > > > I doubt this will materially change how long the tasks are waiting. > > It might not be a longer wait, but it still seems difficult to prove > that the wait_var_event() will /always/ be awoken somehow. Proving locks are free from possible deadlock is notoriously difficult. That is why we have lockdep. Maybe the first step in this series should have been to add some lockdep annotations to sc_count so that it could report the possible deadlock. That would look like a readlock when sc_count is incremented, readunlock when it is decremented, and a writelock when we wait in move_to_close_lru. Or something like that. If we could demonstrate that easily triggers a lockdep warning, then demonstrate the warning is gone after the patch series, that would be a good thing. I might try that, but not today. > > Applying for now. > Thanks, NeilBrown > > -- > Chuck Lever > ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [PATCH 3/4] nfsd: replace rp_mutex to avoid deadlock in move_to_close_lru() 2024-03-04 23:30 ` Chuck Lever 2024-03-05 0:05 ` NeilBrown @ 2024-03-05 0:11 ` Jeff Layton 1 sibling, 0 replies; 19+ messages in thread From: Jeff Layton @ 2024-03-05 0:11 UTC (permalink / raw) To: Chuck Lever; +Cc: NeilBrown, linux-nfs, Olga Kornievskaia, Dai Ngo, Tom Talpey On Mon, 2024-03-04 at 18:30 -0500, Chuck Lever wrote: > On Mon, Mar 04, 2024 at 06:11:24PM -0500, Jeff Layton wrote: > > On Mon, 2024-03-04 at 17:54 -0500, Chuck Lever wrote: > > > On Tue, Mar 05, 2024 at 09:36:29AM +1100, NeilBrown wrote: > > > > On Tue, 05 Mar 2024, Chuck Lever wrote: > > > > > On Tue, Mar 05, 2024 at 08:45:45AM +1100, NeilBrown wrote: > > > > > > On Tue, 05 Mar 2024, Chuck Lever wrote: > > > > > > > On Mon, Mar 04, 2024 at 03:40:21PM +1100, NeilBrown wrote: > > > > > > > > move_to_close_lru() waits for sc_count to become zero while holding > > > > > > > > rp_mutex. This can deadlock if another thread holds a reference and is > > > > > > > > waiting for rp_mutex. > > > > > > > > > > > > > > > > By the time we get to move_to_close_lru() the openowner is unhashed and > > > > > > > > cannot be found any more. So code waiting for the mutex can safely > > > > > > > > retry the lookup if move_to_close_lru() has started. > > > > > > > > > > > > > > > > So change rp_mutex to an atomic_t with three states: > > > > > > > > > > > > > > > > RP_UNLOCK - state is still hashed, not locked for reply > > > > > > > > RP_LOCKED - state is still hashed, is locked for reply > > > > > > > > RP_UNHASHED - state is not hashed, no code can get a lock. > > > > > > > > > > > > > > > > Use wait_var_event() to wait for either a lock, or for the owner to be > > > > > > > > unhashed. In the latter case, retry the lookup. > > > > > > > > > > > > > > > > Signed-off-by: NeilBrown <neilb@suse.de> > > > > > > > > --- > > > > > > > > fs/nfsd/nfs4state.c | 38 +++++++++++++++++++++++++++++++------- > > > > > > > > fs/nfsd/state.h | 2 +- > > > > > > > > 2 files changed, 32 insertions(+), 8 deletions(-) > > > > > > > > > > > > > > > > diff --git a/fs/nfsd/nfs4state.c b/fs/nfsd/nfs4state.c > > > > > > > > index 690d0e697320..47e879d5d68a 100644 > > > > > > > > --- a/fs/nfsd/nfs4state.c > > > > > > > > +++ b/fs/nfsd/nfs4state.c > > > > > > > > @@ -4430,21 +4430,32 @@ nfsd4_init_leases_net(struct nfsd_net *nn) > > > > > > > > atomic_set(&nn->nfsd_courtesy_clients, 0); > > > > > > > > } > > > > > > > > > > > > > > > > +enum rp_lock { > > > > > > > > + RP_UNLOCKED, > > > > > > > > + RP_LOCKED, > > > > > > > > + RP_UNHASHED, > > > > > > > > +}; > > > > > > > > + > > > > > > > > static void init_nfs4_replay(struct nfs4_replay *rp) > > > > > > > > { > > > > > > > > rp->rp_status = nfserr_serverfault; > > > > > > > > rp->rp_buflen = 0; > > > > > > > > rp->rp_buf = rp->rp_ibuf; > > > > > > > > - mutex_init(&rp->rp_mutex); > > > > > > > > + atomic_set(&rp->rp_locked, RP_UNLOCKED); > > > > > > > > } > > > > > > > > > > > > > > > > -static void nfsd4_cstate_assign_replay(struct nfsd4_compound_state *cstate, > > > > > > > > - struct nfs4_stateowner *so) > > > > > > > > +static int nfsd4_cstate_assign_replay(struct nfsd4_compound_state *cstate, > > > > > > > > + struct nfs4_stateowner *so) > > > > > > > > { > > > > > > > > if (!nfsd4_has_session(cstate)) { > > > > > > > > - mutex_lock(&so->so_replay.rp_mutex); > > > > > > > > + wait_var_event(&so->so_replay.rp_locked, > > > > > > > > + atomic_cmpxchg(&so->so_replay.rp_locked, > > > > > > > > + RP_UNLOCKED, RP_LOCKED) != RP_LOCKED); > > > > > > > > > > > > > > What reliably prevents this from being a "wait forever" ? > > > > > > > > > > > > That same thing that reliably prevented the original mutex_lock from > > > > > > waiting forever. > > Note that this patch fixes a deadlock here. So clearly, there /were/ > situations where "waiting forever" was possible with the mutex version > of this code. > > > > > > > > It waits until rp_locked is set to RP_UNLOCKED, which is precisely when > > > > > > we previously called mutex_unlock. But it *also* aborts the wait if > > > > > > rp_locked is set to RP_UNHASHED - so it is now more reliable. > > > > > > > > > > > > Does that answer the question? > > > > > > > > > > Hm. I guess then we are no worse off with wait_var_event(). > > > > > > > > > > I'm not as familiar with this logic as perhaps I should be. How long > > > > > does it take for the wake-up to occur, typically? > > > > > > > > wait_var_event() is paired with wake_up_var(). > > > > The wake up happens when wake_up_var() is called, which in this code is > > > > always immediately after atomic_set() updates the variable. > > > > > > I'm trying to ascertain the actual wall-clock time that the nfsd thread > > > is sleeping, at most. Is this going to be a possible DoS vector? Can > > > it impact the ability for the server to shut down without hanging? > > > > Prior to this patch, there was a mutex in play here and we just released > > it to wake up the waiters. This is more or less doing the same thing, it > > just indicates the resulting state better. > > Well, it adds a third state so that a recovery action can be taken > on wake-up in some cases. That avoids a deadlock, so this does count > as a bug fix. > > > > I doubt this will materially change how long the tasks are waiting. > > It might not be a longer wait, but it still seems difficult to prove > that the wait_var_event() will /always/ be awoken somehow. > I'm not sure what other guarantees we can reasonably offer. The v4.0 replay handling requires an exclusionary lock of some sort, so you can always get stuck if the wakeup never comes. We typically hold this mutex today over seqid morphing operations in v4.0, so if those get stuck doing something (a vfs_open or something?) you'll get stuck here too. -- Jeff Layton <jlayton@kernel.org> ^ permalink raw reply [flat|nested] 19+ messages in thread
end of thread, other threads:[~2024-04-09 2:09 UTC | newest] Thread overview: 19+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2024-04-08 2:09 [PATCH 0/4 v4] nfsd: fix deadlock in move_to_close_lru() NeilBrown 2024-04-08 2:09 ` [PATCH 1/4] nfsd: perform all find_openstateowner_str calls in the one place NeilBrown 2024-04-08 2:09 ` [PATCH 2/4] nfsd: move nfsd4_cstate_assign_replay() earlier in open handling NeilBrown 2024-04-08 2:09 ` [PATCH 3/4] nfsd: replace rp_mutex to avoid deadlock in move_to_close_lru() NeilBrown 2024-04-08 2:09 ` [PATCH 4/4] nfsd: drop st_mutex before calling move_to_close_lru() NeilBrown 2024-04-09 2:09 ` [PATCH 0/4 v4] nfsd: fix deadlock in move_to_close_lru() Chuck Lever -- strict thread matches above, loose matches on Subject: below -- 2024-03-04 22:45 [PATCH 0/4 v3] " NeilBrown 2024-03-04 22:45 ` [PATCH 3/4] nfsd: replace rp_mutex to avoid " NeilBrown 2024-03-04 4:40 [PATCH 0/4 v2] nfsd: fix dadlock " NeilBrown 2024-03-04 4:40 ` [PATCH 3/4] nfsd: replace rp_mutex to avoid deadlock " NeilBrown 2024-03-04 12:50 ` Jeff Layton 2024-03-04 14:09 ` Chuck Lever 2024-03-04 21:45 ` NeilBrown 2024-03-04 22:03 ` Chuck Lever 2024-03-04 22:36 ` NeilBrown 2024-03-04 22:54 ` Chuck Lever 2024-03-04 23:04 ` NeilBrown 2024-03-04 23:11 ` Jeff Layton 2024-03-04 23:30 ` Chuck Lever 2024-03-05 0:05 ` NeilBrown 2024-03-05 0:11 ` Jeff Layton
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).