[PATCH 0/3] nfsd: fix dadlock in move_to_close

linux-nfs.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH 0/3] nfsd: fix dadlock in move_to_close_lru()
@ 2024-03-01  0:07 NeilBrown
  2024-03-01  0:07 ` [PATCH 1/3] nfsd: move nfsd4_cstate_assign_replay() earlier in open handling NeilBrown
                   ` (3 more replies)
  0 siblings, 4 replies; 9+ messages in thread
From: NeilBrown @ 2024-03-01  0:07 UTC (permalink / raw)
  To: Chuck Lever, Jeff Layton
  Cc: linux-nfs, Olga Kornievskaia, Dai Ngo, Tom Talpey

This series is intended to replace 
 98be4be88369 nfsd: drop st_mutex and rp_mutex before calling move_to_close_lru()
in nfsd-next.
As Jeff Layton oberved recently there is a problem with that patch.
This series takes a different approach to the rp_mutex proble, replacing it with an
atomic_t which has three states.  Details in the patch.

Thnaks,
NeilBrown

 [PATCH 1/3] nfsd: move nfsd4_cstate_assign_replay() earlier in open
 [PATCH 2/3] nfsd: replace rp_mutex to avoid deadlock in
 [PATCH 3/3] nfsd: drop st_mutex_mutex before calling

^ permalink raw reply	[flat|nested] 9+ messages in thread

* [PATCH 1/3] nfsd: move nfsd4_cstate_assign_replay() earlier in open handling.
  2024-03-01  0:07 [PATCH 0/3] nfsd: fix dadlock in move_to_close_lru() NeilBrown
@ 2024-03-01  0:07 ` NeilBrown
  2024-03-01 12:54   ` Jeff Layton
  2024-03-01  0:07 ` [PATCH 2/3] nfsd: replace rp_mutex to avoid deadlock in move_to_close_lru() NeilBrown
                   ` (2 subsequent siblings)
  3 siblings, 1 reply; 9+ messages in thread
From: NeilBrown @ 2024-03-01  0:07 UTC (permalink / raw)
  To: Chuck Lever, Jeff Layton
  Cc: linux-nfs, Olga Kornievskaia, Dai Ngo, Tom Talpey

Rather than taking the rp_mutex in nfsd4_cleanup_open_state() (which
seems counter-intuitive), take it and assign rp_owner as soon as
possible.

This will support a future change when nfsd4_cstate_assign_replay() might
fail.

Signed-off-by: NeilBrown <neilb@suse.de>
---
 fs/nfsd/nfs4state.c | 13 +++++--------
 1 file changed, 5 insertions(+), 8 deletions(-)

diff --git a/fs/nfsd/nfs4state.c b/fs/nfsd/nfs4state.c
index 7d6c657e0409..e625f738f7b0 100644
--- a/fs/nfsd/nfs4state.c
+++ b/fs/nfsd/nfs4state.c
@@ -5066,15 +5066,15 @@ nfsd4_process_open1(struct nfsd4_compound_state *cstate,
 	strhashval = ownerstr_hashval(&open->op_owner);
 	oo = find_openstateowner_str(strhashval, open, clp);
 	open->op_openowner = oo;
-	if (!oo) {
+	if (!oo)
 		goto new_owner;
-	}
 	if (!(oo->oo_flags & NFS4_OO_CONFIRMED)) {
 		/* Replace unconfirmed owners without checking for replay. */
 		release_openowner(oo);
 		open->op_openowner = NULL;
 		goto new_owner;
 	}
+	nfsd4_cstate_assign_replay(cstate, &oo->oo_owner);
 	status = nfsd4_check_seqid(cstate, &oo->oo_owner, open->op_seqid);
 	if (status)
 		return status;
@@ -5084,6 +5084,7 @@ nfsd4_process_open1(struct nfsd4_compound_state *cstate,
 	if (oo == NULL)
 		return nfserr_jukebox;
 	open->op_openowner = oo;
+	nfsd4_cstate_assign_replay(cstate, &oo->oo_owner);
 alloc_stateid:
 	open->op_stp = nfs4_alloc_open_stateid(clp);
 	if (!open->op_stp)
@@ -5835,12 +5836,8 @@ nfsd4_process_open2(struct svc_rqst *rqstp, struct svc_fh *current_fh, struct nf
 void nfsd4_cleanup_open_state(struct nfsd4_compound_state *cstate,
 			      struct nfsd4_open *open)
 {
-	if (open->op_openowner) {
-		struct nfs4_stateowner *so = &open->op_openowner->oo_owner;
-
-		nfsd4_cstate_assign_replay(cstate, so);
-		nfs4_put_stateowner(so);
-	}
+	if (cstate->replay_owner)
+		nfs4_put_stateowner(cstate->replay_owner);
 	if (open->op_file)
 		kmem_cache_free(file_slab, open->op_file);
 	if (open->op_stp)
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 9+ messages in thread

* [PATCH 2/3] nfsd: replace rp_mutex to avoid deadlock in move_to_close_lru()
  2024-03-01  0:07 [PATCH 0/3] nfsd: fix dadlock in move_to_close_lru() NeilBrown
  2024-03-01  0:07 ` [PATCH 1/3] nfsd: move nfsd4_cstate_assign_replay() earlier in open handling NeilBrown
@ 2024-03-01  0:07 ` NeilBrown
  2024-03-01 13:09   ` Jeff Layton
  2024-03-01  0:07 ` [PATCH 3/3] nfsd: drop st_mutex_mutex before calling move_to_close_lru() NeilBrown
  2024-03-01 14:05 ` [PATCH 0/3] nfsd: fix dadlock in move_to_close_lru() Chuck Lever
  3 siblings, 1 reply; 9+ messages in thread
From: NeilBrown @ 2024-03-01  0:07 UTC (permalink / raw)
  To: Chuck Lever, Jeff Layton
  Cc: linux-nfs, Olga Kornievskaia, Dai Ngo, Tom Talpey

move_to_close_lru() waits for sc_count to become zero while holding
rp_mutex.  This can deadlock if another thread holds a reference and is
waiting for rp_mutex.

By the time we get to move_to_close_lru() the openowner is unhashed and
cannot be found any more.  So code waiting for the mutex can safely
retry the lookup if move_to_close_lru() has started.

So change rp_mutex to an atomic_t with three states:

 RP_UNLOCK   - state is still hashed, not locked for reply
 RP_LOCKED   - state is still hashed, is locked for reply
 RP_UNHASHED - state is now hashed, no code can get a lock.

Use wait_ver_event() to wait for either a lock, or for the owner to be
unhashed.  In the latter case, retry the lookup.

Signed-off-by: NeilBrown <neilb@suse.de>
---
 fs/nfsd/nfs4state.c | 46 ++++++++++++++++++++++++++++++++++++---------
 fs/nfsd/state.h     |  2 +-
 2 files changed, 38 insertions(+), 10 deletions(-)

diff --git a/fs/nfsd/nfs4state.c b/fs/nfsd/nfs4state.c
index e625f738f7b0..5dab316932d3 100644
--- a/fs/nfsd/nfs4state.c
+++ b/fs/nfsd/nfs4state.c
@@ -4442,21 +4442,32 @@ nfsd4_init_leases_net(struct nfsd_net *nn)
 	atomic_set(&nn->nfsd_courtesy_clients, 0);
 }
 
+enum rp_lock {
+	RP_UNLOCKED,
+	RP_LOCKED,
+	RP_UNHASHED,
+};
+
 static void init_nfs4_replay(struct nfs4_replay *rp)
 {
 	rp->rp_status = nfserr_serverfault;
 	rp->rp_buflen = 0;
 	rp->rp_buf = rp->rp_ibuf;
-	mutex_init(&rp->rp_mutex);
+	atomic_set(&rp->rp_locked, RP_UNLOCKED);
 }
 
-static void nfsd4_cstate_assign_replay(struct nfsd4_compound_state *cstate,
-		struct nfs4_stateowner *so)
+static int nfsd4_cstate_assign_replay(struct nfsd4_compound_state *cstate,
+				      struct nfs4_stateowner *so)
 {
 	if (!nfsd4_has_session(cstate)) {
-		mutex_lock(&so->so_replay.rp_mutex);
+		wait_var_event(&so->so_replay.rp_locked,
+			       atomic_cmpxchg(&so->so_replay.rp_locked,
+					      RP_UNLOCKED, RP_LOCKED) != RP_LOCKED);
+		if (atomic_read(&so->so_replay.rp_locked) == RP_UNHASHED)
+			return -EAGAIN;
 		cstate->replay_owner = nfs4_get_stateowner(so);
 	}
+	return 0;
 }
 
 void nfsd4_cstate_clear_replay(struct nfsd4_compound_state *cstate)
@@ -4465,7 +4476,8 @@ void nfsd4_cstate_clear_replay(struct nfsd4_compound_state *cstate)
 
 	if (so != NULL) {
 		cstate->replay_owner = NULL;
-		mutex_unlock(&so->so_replay.rp_mutex);
+		atomic_set(&so->so_replay.rp_locked, RP_UNLOCKED);
+		wake_up_var(&so->so_replay.rp_locked);
 		nfs4_put_stateowner(so);
 	}
 }
@@ -4691,7 +4703,11 @@ move_to_close_lru(struct nfs4_ol_stateid *s, struct net *net)
 	 * Wait for the refcount to drop to 2. Since it has been unhashed,
 	 * there should be no danger of the refcount going back up again at
 	 * this point.
+	 * Some threads with a reference might be waiting for rp_locked,
+	 * so tell them to stop waiting.
 	 */
+	atomic_set(&oo->oo_owner.so_replay.rp_locked, RP_UNHASHED);
+	wake_up_var(&oo->oo_owner.so_replay.rp_locked);
 	wait_event(close_wq, refcount_read(&s->st_stid.sc_count) == 2);
 
 	release_all_access(s);
@@ -5064,6 +5080,7 @@ nfsd4_process_open1(struct nfsd4_compound_state *cstate,
 	clp = cstate->clp;
 
 	strhashval = ownerstr_hashval(&open->op_owner);
+retry:
 	oo = find_openstateowner_str(strhashval, open, clp);
 	open->op_openowner = oo;
 	if (!oo)
@@ -5074,17 +5091,24 @@ nfsd4_process_open1(struct nfsd4_compound_state *cstate,
 		open->op_openowner = NULL;
 		goto new_owner;
 	}
-	nfsd4_cstate_assign_replay(cstate, &oo->oo_owner);
+	if (nfsd4_cstate_assign_replay(cstate, &oo->oo_owner) == -EAGAIN) {
+		nfs4_put_stateowner(&oo->oo_owner);
+		goto retry;
+	}
 	status = nfsd4_check_seqid(cstate, &oo->oo_owner, open->op_seqid);
 	if (status)
 		return status;
 	goto alloc_stateid;
 new_owner:
 	oo = alloc_init_open_stateowner(strhashval, open, cstate);
+	open->op_openowner = oo;
 	if (oo == NULL)
 		return nfserr_jukebox;
-	open->op_openowner = oo;
-	nfsd4_cstate_assign_replay(cstate, &oo->oo_owner);
+	if (nfsd4_cstate_assign_replay(cstate, &oo->oo_owner) == -EAGAIN) {
+		WARN_ON(1);
+		nfs4_put_stateowner(&oo->oo_owner);
+		goto new_owner;
+	}
 alloc_stateid:
 	open->op_stp = nfs4_alloc_open_stateid(clp);
 	if (!open->op_stp)
@@ -6841,11 +6865,15 @@ nfs4_preprocess_seqid_op(struct nfsd4_compound_state *cstate, u32 seqid,
 	trace_nfsd_preprocess(seqid, stateid);
 
 	*stpp = NULL;
+retry:
 	status = nfsd4_lookup_stateid(cstate, stateid, typemask, &s, nn);
 	if (status)
 		return status;
 	stp = openlockstateid(s);
-	nfsd4_cstate_assign_replay(cstate, stp->st_stateowner);
+	if (nfsd4_cstate_assign_replay(cstate, stp->st_stateowner) == -EAGAIN) {
+		nfs4_put_stateowner(stp->st_stateowner);
+		goto retry;
+	}
 
 	status = nfs4_seqid_op_checks(cstate, stateid, seqid, stp);
 	if (!status)
diff --git a/fs/nfsd/state.h b/fs/nfsd/state.h
index 41bdc913fa71..6a3becd86112 100644
--- a/fs/nfsd/state.h
+++ b/fs/nfsd/state.h
@@ -446,7 +446,7 @@ struct nfs4_replay {
 	unsigned int		rp_buflen;
 	char			*rp_buf;
 	struct knfsd_fh		rp_openfh;
-	struct mutex		rp_mutex;
+	atomic_t		rp_locked;
 	char			rp_ibuf[NFSD4_REPLAY_ISIZE];
 };
 
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 9+ messages in thread

* [PATCH 3/3] nfsd: drop st_mutex_mutex before calling move_to_close_lru()
  2024-03-01  0:07 [PATCH 0/3] nfsd: fix dadlock in move_to_close_lru() NeilBrown
  2024-03-01  0:07 ` [PATCH 1/3] nfsd: move nfsd4_cstate_assign_replay() earlier in open handling NeilBrown
  2024-03-01  0:07 ` [PATCH 2/3] nfsd: replace rp_mutex to avoid deadlock in move_to_close_lru() NeilBrown
@ 2024-03-01  0:07 ` NeilBrown
  2024-03-01 14:05 ` [PATCH 0/3] nfsd: fix dadlock in move_to_close_lru() Chuck Lever
  3 siblings, 0 replies; 9+ messages in thread
From: NeilBrown @ 2024-03-01  0:07 UTC (permalink / raw)
  To: Chuck Lever, Jeff Layton
  Cc: linux-nfs, Olga Kornievskaia, Dai Ngo, Tom Talpey

move_to_close_lru() is currently called with ->st_mutex held.
This can lead to a deadlock as move_to_close_lru() waits for sc_count to
drop to 2, and some threads holding a reference might be waiting for the
mutex.  These references will never be dropped so sc_count will never
reach 2.

There can be no harm in dropping ->st_mutex before
move_to_close_lru() because the only place that takes the mutex is
nfsd4_lock_ol_stateid(), and it quickly aborts if sc_type is
NFS4_CLOSED_STID, which it will be before move_to_close_lru() is called.

See also
 https://lore.kernel.org/lkml/4dd1fe21e11344e5969bb112e954affb@jd.com/T/
where this problem was raised but not successfully resolved.

Signed-off-by: NeilBrown <neilb@suse.de>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
---
 fs/nfsd/nfs4state.c | 11 +++++++----
 1 file changed, 7 insertions(+), 4 deletions(-)

diff --git a/fs/nfsd/nfs4state.c b/fs/nfsd/nfs4state.c
index 5dab316932d3..bb0f563cb2d2 100644
--- a/fs/nfsd/nfs4state.c
+++ b/fs/nfsd/nfs4state.c
@@ -7015,7 +7015,7 @@ nfsd4_open_downgrade(struct svc_rqst *rqstp,
 	return status;
 }
 
-static void nfsd4_close_open_stateid(struct nfs4_ol_stateid *s)
+static bool nfsd4_close_open_stateid(struct nfs4_ol_stateid *s)
 {
 	struct nfs4_client *clp = s->st_stid.sc_client;
 	bool unhashed;
@@ -7032,11 +7032,11 @@ static void nfsd4_close_open_stateid(struct nfs4_ol_stateid *s)
 		list_for_each_entry(stp, &reaplist, st_locks)
 			nfs4_free_cpntf_statelist(clp->net, &stp->st_stid);
 		free_ol_stateid_reaplist(&reaplist);
+		return false;
 	} else {
 		spin_unlock(&clp->cl_lock);
 		free_ol_stateid_reaplist(&reaplist);
-		if (unhashed)
-			move_to_close_lru(s, clp->net);
+		return unhashed;
 	}
 }
 
@@ -7052,6 +7052,7 @@ nfsd4_close(struct svc_rqst *rqstp, struct nfsd4_compound_state *cstate,
 	struct nfs4_ol_stateid *stp;
 	struct net *net = SVC_NET(rqstp);
 	struct nfsd_net *nn = net_generic(net, nfsd_net_id);
+	bool need_move_to_close_list;
 
 	dprintk("NFSD: nfsd4_close on file %pd\n", 
 			cstate->current_fh.fh_dentry);
@@ -7074,8 +7075,10 @@ nfsd4_close(struct svc_rqst *rqstp, struct nfsd4_compound_state *cstate,
 	 */
 	nfs4_inc_and_copy_stateid(&close->cl_stateid, &stp->st_stid);
 
-	nfsd4_close_open_stateid(stp);
+	need_move_to_close_list = nfsd4_close_open_stateid(stp);
 	mutex_unlock(&stp->st_mutex);
+	if (need_move_to_close_list)
+		move_to_close_lru(stp, net);
 
 	/* v4.1+ suggests that we send a special stateid in here, since the
 	 * clients should just ignore this anyway. Since this is not useful
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 9+ messages in thread

* Re: [PATCH 1/3] nfsd: move nfsd4_cstate_assign_replay() earlier in open handling.
  2024-03-01  0:07 ` [PATCH 1/3] nfsd: move nfsd4_cstate_assign_replay() earlier in open handling NeilBrown
@ 2024-03-01 12:54   ` Jeff Layton
  2024-03-03 23:59     ` NeilBrown
  0 siblings, 1 reply; 9+ messages in thread
From: Jeff Layton @ 2024-03-01 12:54 UTC (permalink / raw)
  To: NeilBrown, Chuck Lever; +Cc: linux-nfs, Olga Kornievskaia, Dai Ngo, Tom Talpey

On Fri, 2024-03-01 at 11:07 +1100, NeilBrown wrote:
> Rather than taking the rp_mutex in nfsd4_cleanup_open_state() (which
> seems counter-intuitive), take it and assign rp_owner as soon as
> possible.
> 
> This will support a future change when nfsd4_cstate_assign_replay() might
> fail.
> 
> Signed-off-by: NeilBrown <neilb@suse.de>
> ---
>  fs/nfsd/nfs4state.c | 13 +++++--------
>  1 file changed, 5 insertions(+), 8 deletions(-)
> 
> diff --git a/fs/nfsd/nfs4state.c b/fs/nfsd/nfs4state.c
> index 7d6c657e0409..e625f738f7b0 100644
> --- a/fs/nfsd/nfs4state.c
> +++ b/fs/nfsd/nfs4state.c
> @@ -5066,15 +5066,15 @@ nfsd4_process_open1(struct nfsd4_compound_state *cstate,
>  	strhashval = ownerstr_hashval(&open->op_owner);
>  	oo = find_openstateowner_str(strhashval, open, clp);
>  	open->op_openowner = oo;
> -	if (!oo) {
> +	if (!oo)
>  		goto new_owner;
> -	}
>  	if (!(oo->oo_flags & NFS4_OO_CONFIRMED)) {
>  		/* Replace unconfirmed owners without checking for replay. */
>  		release_openowner(oo);
>  		open->op_openowner = NULL;
>  		goto new_owner;
>  	}
> +	nfsd4_cstate_assign_replay(cstate, &oo->oo_owner);
>  	status = nfsd4_check_seqid(cstate, &oo->oo_owner, open->op_seqid);
>  	if (status)
>  		return status;
> @@ -5084,6 +5084,7 @@ nfsd4_process_open1(struct nfsd4_compound_state *cstate,
>  	if (oo == NULL)
>  		return nfserr_jukebox;
>  	open->op_openowner = oo;
> +	nfsd4_cstate_assign_replay(cstate, &oo->oo_owner);
>  alloc_stateid:
>  	open->op_stp = nfs4_alloc_open_stateid(clp);
>  	if (!open->op_stp)
> @@ -5835,12 +5836,8 @@ nfsd4_process_open2(struct svc_rqst *rqstp, struct svc_fh *current_fh, struct nf
>  void nfsd4_cleanup_open_state(struct nfsd4_compound_state *cstate,
>  			      struct nfsd4_open *open)
>  {
> -	if (open->op_openowner) {
> -		struct nfs4_stateowner *so = &open->op_openowner->oo_owner;
> -
> -		nfsd4_cstate_assign_replay(cstate, so);
> -		nfs4_put_stateowner(so);
> -	}
> +	if (cstate->replay_owner)
> +		nfs4_put_stateowner(cstate->replay_owner);

The above delta doesn't look right. The replay_owner won't be set on
v4.1+ mounts, but op_openowner will still hold a valid reference that
will now leak.

>  	if (open->op_file)
>  		kmem_cache_free(file_slab, open->op_file);
>  	if (open->op_stp)

-- 
Jeff Layton <jlayton@kernel.org>

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH 2/3] nfsd: replace rp_mutex to avoid deadlock in move_to_close_lru()
  2024-03-01  0:07 ` [PATCH 2/3] nfsd: replace rp_mutex to avoid deadlock in move_to_close_lru() NeilBrown
@ 2024-03-01 13:09   ` Jeff Layton
  2024-03-04  0:09     ` NeilBrown
  0 siblings, 1 reply; 9+ messages in thread
From: Jeff Layton @ 2024-03-01 13:09 UTC (permalink / raw)
  To: NeilBrown, Chuck Lever; +Cc: linux-nfs, Olga Kornievskaia, Dai Ngo, Tom Talpey

On Fri, 2024-03-01 at 11:07 +1100, NeilBrown wrote:
> move_to_close_lru() waits for sc_count to become zero while holding
> rp_mutex.  This can deadlock if another thread holds a reference and is
> waiting for rp_mutex.
> 
> By the time we get to move_to_close_lru() the openowner is unhashed and
> cannot be found any more.  So code waiting for the mutex can safely
> retry the lookup if move_to_close_lru() has started.
> 
> So change rp_mutex to an atomic_t with three states:
> 
>  RP_UNLOCK   - state is still hashed, not locked for reply
>  RP_LOCKED   - state is still hashed, is locked for reply
>  RP_UNHASHED - state is now hashed, no code can get a lock.
> 

"is now unhashed", I think...

> Use wait_ver_event() to wait for either a lock, or for the owner to be
> unhashed.  In the latter case, retry the lookup.
> 
> Signed-off-by: NeilBrown <neilb@suse.de>
> ---
>  fs/nfsd/nfs4state.c | 46 ++++++++++++++++++++++++++++++++++++---------
>  fs/nfsd/state.h     |  2 +-
>  2 files changed, 38 insertions(+), 10 deletions(-)
> 
> diff --git a/fs/nfsd/nfs4state.c b/fs/nfsd/nfs4state.c
> index e625f738f7b0..5dab316932d3 100644
> --- a/fs/nfsd/nfs4state.c
> +++ b/fs/nfsd/nfs4state.c
> @@ -4442,21 +4442,32 @@ nfsd4_init_leases_net(struct nfsd_net *nn)
>  	atomic_set(&nn->nfsd_courtesy_clients, 0);
>  }
>  
> +enum rp_lock {
> +	RP_UNLOCKED,
> +	RP_LOCKED,
> +	RP_UNHASHED,
> +};
> +
>  static void init_nfs4_replay(struct nfs4_replay *rp)
>  {
>  	rp->rp_status = nfserr_serverfault;
>  	rp->rp_buflen = 0;
>  	rp->rp_buf = rp->rp_ibuf;
> -	mutex_init(&rp->rp_mutex);
> +	atomic_set(&rp->rp_locked, RP_UNLOCKED);
>  }
>  
> -static void nfsd4_cstate_assign_replay(struct nfsd4_compound_state *cstate,
> -		struct nfs4_stateowner *so)
> +static int nfsd4_cstate_assign_replay(struct nfsd4_compound_state *cstate,
> +				      struct nfs4_stateowner *so)
>  {
>  	if (!nfsd4_has_session(cstate)) {
> -		mutex_lock(&so->so_replay.rp_mutex);
> +		wait_var_event(&so->so_replay.rp_locked,
> +			       atomic_cmpxchg(&so->so_replay.rp_locked,
> +					      RP_UNLOCKED, RP_LOCKED) != RP_LOCKED);
> +		if (atomic_read(&so->so_replay.rp_locked) == RP_UNHASHED)
> +			return -EAGAIN;
>  		cstate->replay_owner = nfs4_get_stateowner(so);
>  	}
> +	return 0;
>  }
>  
>  void nfsd4_cstate_clear_replay(struct nfsd4_compound_state *cstate)
> @@ -4465,7 +4476,8 @@ void nfsd4_cstate_clear_replay(struct nfsd4_compound_state *cstate)
>  
>  	if (so != NULL) {
>  		cstate->replay_owner = NULL;
> -		mutex_unlock(&so->so_replay.rp_mutex);
> +		atomic_set(&so->so_replay.rp_locked, RP_UNLOCKED);
> +		wake_up_var(&so->so_replay.rp_locked);
>  		nfs4_put_stateowner(so);
>  	}
>  }
> @@ -4691,7 +4703,11 @@ move_to_close_lru(struct nfs4_ol_stateid *s, struct net *net)
>  	 * Wait for the refcount to drop to 2. Since it has been unhashed,
>  	 * there should be no danger of the refcount going back up again at
>  	 * this point.
> +	 * Some threads with a reference might be waiting for rp_locked,
> +	 * so tell them to stop waiting.
>  	 */
> +	atomic_set(&oo->oo_owner.so_replay.rp_locked, RP_UNHASHED);
> +	wake_up_var(&oo->oo_owner.so_replay.rp_locked);
>  	wait_event(close_wq, refcount_read(&s->st_stid.sc_count) == 2);
>  
>  	release_all_access(s);
> @@ -5064,6 +5080,7 @@ nfsd4_process_open1(struct nfsd4_compound_state *cstate,
>  	clp = cstate->clp;
>  
>  	strhashval = ownerstr_hashval(&open->op_owner);
> +retry:
>  	oo = find_openstateowner_str(strhashval, open, clp);
>  	open->op_openowner = oo;
>  	if (!oo)
> @@ -5074,17 +5091,24 @@ nfsd4_process_open1(struct nfsd4_compound_state *cstate,
>  		open->op_openowner = NULL;
>  		goto new_owner;
>  	}
> -	nfsd4_cstate_assign_replay(cstate, &oo->oo_owner);
> +	if (nfsd4_cstate_assign_replay(cstate, &oo->oo_owner) == -EAGAIN) {
> +		nfs4_put_stateowner(&oo->oo_owner);
> +		goto retry;
> +	}
>  	status = nfsd4_check_seqid(cstate, &oo->oo_owner, open->op_seqid);
>  	if (status)
>  		return status;
>  	goto alloc_stateid;
>  new_owner:
>  	oo = alloc_init_open_stateowner(strhashval, open, cstate);
> +	open->op_openowner = oo;
>  	if (oo == NULL)
>  		return nfserr_jukebox;
> -	open->op_openowner = oo;
> -	nfsd4_cstate_assign_replay(cstate, &oo->oo_owner);
> +	if (nfsd4_cstate_assign_replay(cstate, &oo->oo_owner) == -EAGAIN) {
> +		WARN_ON(1);

I don't think you want to WARN here. It seems quite possible for the
client to send simultaneous opens for different files with the same
stateowner.



> +		nfs4_put_stateowner(&oo->oo_owner);
> +		goto new_owner;

Is "goto new_owner" correct here? We likely raced with another RPC that
was using the same owner, so ours probably got inserted and the other
nfsd thread raced in and got the lock before we could. Retrying the
lookup seems more correct in this situation?

That said, it might be best to just call nfsd4_cstate_assign_replay
before hashing the new owner. If you lose the insertion race at that
point, you can just clear it and try to assign the one that won.

> +	}
>  alloc_stateid:
>  	open->op_stp = nfs4_alloc_open_stateid(clp);
>  	if (!open->op_stp)
> @@ -6841,11 +6865,15 @@ nfs4_preprocess_seqid_op(struct nfsd4_compound_state *cstate, u32 seqid,
>  	trace_nfsd_preprocess(seqid, stateid);
>  
>  	*stpp = NULL;
> +retry:
>  	status = nfsd4_lookup_stateid(cstate, stateid, typemask, &s, nn);
>  	if (status)
>  		return status;
>  	stp = openlockstateid(s);
> -	nfsd4_cstate_assign_replay(cstate, stp->st_stateowner);
> +	if (nfsd4_cstate_assign_replay(cstate, stp->st_stateowner) == -EAGAIN) {
> +		nfs4_put_stateowner(stp->st_stateowner);
> +		goto retry;
> +	}
>  
>  	status = nfs4_seqid_op_checks(cstate, stateid, seqid, stp);
>  	if (!status)
> diff --git a/fs/nfsd/state.h b/fs/nfsd/state.h
> index 41bdc913fa71..6a3becd86112 100644
> --- a/fs/nfsd/state.h
> +++ b/fs/nfsd/state.h
> @@ -446,7 +446,7 @@ struct nfs4_replay {
>  	unsigned int		rp_buflen;
>  	char			*rp_buf;
>  	struct knfsd_fh		rp_openfh;
> -	struct mutex		rp_mutex;
> +	atomic_t		rp_locked;
>  	char			rp_ibuf[NFSD4_REPLAY_ISIZE];
>  };
>  

-- 
Jeff Layton <jlayton@kernel.org>

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH 0/3] nfsd: fix dadlock in move_to_close_lru()
  2024-03-01  0:07 [PATCH 0/3] nfsd: fix dadlock in move_to_close_lru() NeilBrown
                   ` (2 preceding siblings ...)
  2024-03-01  0:07 ` [PATCH 3/3] nfsd: drop st_mutex_mutex before calling move_to_close_lru() NeilBrown
@ 2024-03-01 14:05 ` Chuck Lever
  3 siblings, 0 replies; 9+ messages in thread
From: Chuck Lever @ 2024-03-01 14:05 UTC (permalink / raw)
  To: NeilBrown; +Cc: Jeff Layton, linux-nfs, Olga Kornievskaia, Dai Ngo, Tom Talpey

On Fri, Mar 01, 2024 at 11:07:36AM +1100, NeilBrown wrote:
> This series is intended to replace 
>  98be4be88369 nfsd: drop st_mutex and rp_mutex before calling move_to_close_lru()
> in nfsd-next.

Hi Neil, I will drop that patch from nfsd-next now.


> As Jeff Layton oberved recently there is a problem with that patch.
> This series takes a different approach to the rp_mutex proble, replacing it with an
> atomic_t which has three states.  Details in the patch.
> 
> Thnaks,
> NeilBrown
> 
> 
>  [PATCH 1/3] nfsd: move nfsd4_cstate_assign_replay() earlier in open
>  [PATCH 2/3] nfsd: replace rp_mutex to avoid deadlock in
>  [PATCH 3/3] nfsd: drop st_mutex_mutex before calling

-- 
Chuck Lever

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH 1/3] nfsd: move nfsd4_cstate_assign_replay() earlier in open handling.
  2024-03-01 12:54   ` Jeff Layton
@ 2024-03-03 23:59     ` NeilBrown
  0 siblings, 0 replies; 9+ messages in thread
From: NeilBrown @ 2024-03-03 23:59 UTC (permalink / raw)
  To: Jeff Layton
  Cc: Chuck Lever, linux-nfs, Olga Kornievskaia, Dai Ngo, Tom Talpey

On Fri, 01 Mar 2024, Jeff Layton wrote:
> On Fri, 2024-03-01 at 11:07 +1100, NeilBrown wrote:
> > Rather than taking the rp_mutex in nfsd4_cleanup_open_state() (which
> > seems counter-intuitive), take it and assign rp_owner as soon as
> > possible.
> > 
> > This will support a future change when nfsd4_cstate_assign_replay() might
> > fail.
> > 
> > Signed-off-by: NeilBrown <neilb@suse.de>
> > ---
> >  fs/nfsd/nfs4state.c | 13 +++++--------
> >  1 file changed, 5 insertions(+), 8 deletions(-)
> > 
> > diff --git a/fs/nfsd/nfs4state.c b/fs/nfsd/nfs4state.c
> > index 7d6c657e0409..e625f738f7b0 100644
> > --- a/fs/nfsd/nfs4state.c
> > +++ b/fs/nfsd/nfs4state.c
> > @@ -5066,15 +5066,15 @@ nfsd4_process_open1(struct nfsd4_compound_state *cstate,
> >  	strhashval = ownerstr_hashval(&open->op_owner);
> >  	oo = find_openstateowner_str(strhashval, open, clp);
> >  	open->op_openowner = oo;
> > -	if (!oo) {
> > +	if (!oo)
> >  		goto new_owner;
> > -	}
> >  	if (!(oo->oo_flags & NFS4_OO_CONFIRMED)) {
> >  		/* Replace unconfirmed owners without checking for replay. */
> >  		release_openowner(oo);
> >  		open->op_openowner = NULL;
> >  		goto new_owner;
> >  	}
> > +	nfsd4_cstate_assign_replay(cstate, &oo->oo_owner);
> >  	status = nfsd4_check_seqid(cstate, &oo->oo_owner, open->op_seqid);
> >  	if (status)
> >  		return status;
> > @@ -5084,6 +5084,7 @@ nfsd4_process_open1(struct nfsd4_compound_state *cstate,
> >  	if (oo == NULL)
> >  		return nfserr_jukebox;
> >  	open->op_openowner = oo;
> > +	nfsd4_cstate_assign_replay(cstate, &oo->oo_owner);
> >  alloc_stateid:
> >  	open->op_stp = nfs4_alloc_open_stateid(clp);
> >  	if (!open->op_stp)
> > @@ -5835,12 +5836,8 @@ nfsd4_process_open2(struct svc_rqst *rqstp, struct svc_fh *current_fh, struct nf
> >  void nfsd4_cleanup_open_state(struct nfsd4_compound_state *cstate,
> >  			      struct nfsd4_open *open)
> >  {
> > -	if (open->op_openowner) {
> > -		struct nfs4_stateowner *so = &open->op_openowner->oo_owner;
> > -
> > -		nfsd4_cstate_assign_replay(cstate, so);
> > -		nfs4_put_stateowner(so);
> > -	}
> > +	if (cstate->replay_owner)
> > +		nfs4_put_stateowner(cstate->replay_owner);
> 
> The above delta doesn't look right. The replay_owner won't be set on
> v4.1+ mounts, but op_openowner will still hold a valid reference that
> will now leak.

Yes, of course.  I was over-thinking and making a mess of it.

Fixed,
thanks.

NeilBrown

> 
> >  	if (open->op_file)
> >  		kmem_cache_free(file_slab, open->op_file);
> >  	if (open->op_stp)
> 
> -- 
> Jeff Layton <jlayton@kernel.org>
> 
> 


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH 2/3] nfsd: replace rp_mutex to avoid deadlock in move_to_close_lru()
  2024-03-01 13:09   ` Jeff Layton
@ 2024-03-04  0:09     ` NeilBrown
  0 siblings, 0 replies; 9+ messages in thread
From: NeilBrown @ 2024-03-04  0:09 UTC (permalink / raw)
  To: Jeff Layton
  Cc: Chuck Lever, linux-nfs, Olga Kornievskaia, Dai Ngo, Tom Talpey

On Sat, 02 Mar 2024, Jeff Layton wrote:
> On Fri, 2024-03-01 at 11:07 +1100, NeilBrown wrote:
> > move_to_close_lru() waits for sc_count to become zero while holding
> > rp_mutex.  This can deadlock if another thread holds a reference and is
> > waiting for rp_mutex.
> > 
> > By the time we get to move_to_close_lru() the openowner is unhashed and
> > cannot be found any more.  So code waiting for the mutex can safely
> > retry the lookup if move_to_close_lru() has started.
> > 
> > So change rp_mutex to an atomic_t with three states:
> > 
> >  RP_UNLOCK   - state is still hashed, not locked for reply
> >  RP_LOCKED   - state is still hashed, is locked for reply
> >  RP_UNHASHED - state is now hashed, no code can get a lock.
> > 
> 
> "is now unhashed", I think...

 or "state is not hashed".  s/now/not/.

> 
> > Use wait_ver_event() to wait for either a lock, or for the owner to be
> > unhashed.  In the latter case, retry the lookup.
> > 
> > Signed-off-by: NeilBrown <neilb@suse.de>
> > ---
> >  fs/nfsd/nfs4state.c | 46 ++++++++++++++++++++++++++++++++++++---------
> >  fs/nfsd/state.h     |  2 +-
> >  2 files changed, 38 insertions(+), 10 deletions(-)
> > 
> > diff --git a/fs/nfsd/nfs4state.c b/fs/nfsd/nfs4state.c
> > index e625f738f7b0..5dab316932d3 100644
> > --- a/fs/nfsd/nfs4state.c
> > +++ b/fs/nfsd/nfs4state.c
> > @@ -4442,21 +4442,32 @@ nfsd4_init_leases_net(struct nfsd_net *nn)
> >  	atomic_set(&nn->nfsd_courtesy_clients, 0);
> >  }
> >  
> > +enum rp_lock {
> > +	RP_UNLOCKED,
> > +	RP_LOCKED,
> > +	RP_UNHASHED,
> > +};
> > +
> >  static void init_nfs4_replay(struct nfs4_replay *rp)
> >  {
> >  	rp->rp_status = nfserr_serverfault;
> >  	rp->rp_buflen = 0;
> >  	rp->rp_buf = rp->rp_ibuf;
> > -	mutex_init(&rp->rp_mutex);
> > +	atomic_set(&rp->rp_locked, RP_UNLOCKED);
> >  }
> >  
> > -static void nfsd4_cstate_assign_replay(struct nfsd4_compound_state *cstate,
> > -		struct nfs4_stateowner *so)
> > +static int nfsd4_cstate_assign_replay(struct nfsd4_compound_state *cstate,
> > +				      struct nfs4_stateowner *so)
> >  {
> >  	if (!nfsd4_has_session(cstate)) {
> > -		mutex_lock(&so->so_replay.rp_mutex);
> > +		wait_var_event(&so->so_replay.rp_locked,
> > +			       atomic_cmpxchg(&so->so_replay.rp_locked,
> > +					      RP_UNLOCKED, RP_LOCKED) != RP_LOCKED);
> > +		if (atomic_read(&so->so_replay.rp_locked) == RP_UNHASHED)
> > +			return -EAGAIN;
> >  		cstate->replay_owner = nfs4_get_stateowner(so);
> >  	}
> > +	return 0;
> >  }
> >  
> >  void nfsd4_cstate_clear_replay(struct nfsd4_compound_state *cstate)
> > @@ -4465,7 +4476,8 @@ void nfsd4_cstate_clear_replay(struct nfsd4_compound_state *cstate)
> >  
> >  	if (so != NULL) {
> >  		cstate->replay_owner = NULL;
> > -		mutex_unlock(&so->so_replay.rp_mutex);
> > +		atomic_set(&so->so_replay.rp_locked, RP_UNLOCKED);
> > +		wake_up_var(&so->so_replay.rp_locked);
> >  		nfs4_put_stateowner(so);
> >  	}
> >  }
> > @@ -4691,7 +4703,11 @@ move_to_close_lru(struct nfs4_ol_stateid *s, struct net *net)
> >  	 * Wait for the refcount to drop to 2. Since it has been unhashed,
> >  	 * there should be no danger of the refcount going back up again at
> >  	 * this point.
> > +	 * Some threads with a reference might be waiting for rp_locked,
> > +	 * so tell them to stop waiting.
> >  	 */
> > +	atomic_set(&oo->oo_owner.so_replay.rp_locked, RP_UNHASHED);
> > +	wake_up_var(&oo->oo_owner.so_replay.rp_locked);
> >  	wait_event(close_wq, refcount_read(&s->st_stid.sc_count) == 2);
> >  
> >  	release_all_access(s);
> > @@ -5064,6 +5080,7 @@ nfsd4_process_open1(struct nfsd4_compound_state *cstate,
> >  	clp = cstate->clp;
> >  
> >  	strhashval = ownerstr_hashval(&open->op_owner);
> > +retry:
> >  	oo = find_openstateowner_str(strhashval, open, clp);
> >  	open->op_openowner = oo;
> >  	if (!oo)
> > @@ -5074,17 +5091,24 @@ nfsd4_process_open1(struct nfsd4_compound_state *cstate,
> >  		open->op_openowner = NULL;
> >  		goto new_owner;
> >  	}
> > -	nfsd4_cstate_assign_replay(cstate, &oo->oo_owner);
> > +	if (nfsd4_cstate_assign_replay(cstate, &oo->oo_owner) == -EAGAIN) {
> > +		nfs4_put_stateowner(&oo->oo_owner);
> > +		goto retry;
> > +	}
> >  	status = nfsd4_check_seqid(cstate, &oo->oo_owner, open->op_seqid);
> >  	if (status)
> >  		return status;
> >  	goto alloc_stateid;
> >  new_owner:
> >  	oo = alloc_init_open_stateowner(strhashval, open, cstate);
> > +	open->op_openowner = oo;
> >  	if (oo == NULL)
> >  		return nfserr_jukebox;
> > -	open->op_openowner = oo;
> > -	nfsd4_cstate_assign_replay(cstate, &oo->oo_owner);
> > +	if (nfsd4_cstate_assign_replay(cstate, &oo->oo_owner) == -EAGAIN) {
> > +		WARN_ON(1);
> 
> I don't think you want to WARN here. It seems quite possible for the
> client to send simultaneous opens for different files with the same
> stateowner.

Thanks.  alloc_init_open_stateowner() might not return a newly allocated
state owner, as it include the test "does it already exist" and also
adds it to the hash.  Not what I would expect based on the name.  Maybe
I'll clean up the code.

> 
> 
> 
> > +		nfs4_put_stateowner(&oo->oo_owner);
> > +		goto new_owner;
> 
> Is "goto new_owner" correct here? We likely raced with another RPC that
> was using the same owner, so ours probably got inserted and the other
> nfsd thread raced in and got the lock before we could. Retrying the
> lookup seems more correct in this situation?
> 
> That said, it might be best to just call nfsd4_cstate_assign_replay
> before hashing the new owner. If you lose the insertion race at that
> point, you can just clear it and try to assign the one that won.

I did consider that approach but wasn't sure it was worth it.  I'll have
another look.

Thanks for the review.

NeilBrown


> 
> > +	}
> >  alloc_stateid:
> >  	open->op_stp = nfs4_alloc_open_stateid(clp);
> >  	if (!open->op_stp)
> > @@ -6841,11 +6865,15 @@ nfs4_preprocess_seqid_op(struct nfsd4_compound_state *cstate, u32 seqid,
> >  	trace_nfsd_preprocess(seqid, stateid);
> >  
> >  	*stpp = NULL;
> > +retry:
> >  	status = nfsd4_lookup_stateid(cstate, stateid, typemask, &s, nn);
> >  	if (status)
> >  		return status;
> >  	stp = openlockstateid(s);
> > -	nfsd4_cstate_assign_replay(cstate, stp->st_stateowner);
> > +	if (nfsd4_cstate_assign_replay(cstate, stp->st_stateowner) == -EAGAIN) {
> > +		nfs4_put_stateowner(stp->st_stateowner);
> > +		goto retry;
> > +	}
> >  
> >  	status = nfs4_seqid_op_checks(cstate, stateid, seqid, stp);
> >  	if (!status)
> > diff --git a/fs/nfsd/state.h b/fs/nfsd/state.h
> > index 41bdc913fa71..6a3becd86112 100644
> > --- a/fs/nfsd/state.h
> > +++ b/fs/nfsd/state.h
> > @@ -446,7 +446,7 @@ struct nfs4_replay {
> >  	unsigned int		rp_buflen;
> >  	char			*rp_buf;
> >  	struct knfsd_fh		rp_openfh;
> > -	struct mutex		rp_mutex;
> > +	atomic_t		rp_locked;
> >  	char			rp_ibuf[NFSD4_REPLAY_ISIZE];
> >  };
> >  
> 
> -- 
> Jeff Layton <jlayton@kernel.org>
> 
> 


^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2024-03-04  0:09 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-03-01  0:07 [PATCH 0/3] nfsd: fix dadlock in move_to_close_lru() NeilBrown
2024-03-01  0:07 ` [PATCH 1/3] nfsd: move nfsd4_cstate_assign_replay() earlier in open handling NeilBrown
2024-03-01 12:54   ` Jeff Layton
2024-03-03 23:59     ` NeilBrown
2024-03-01  0:07 ` [PATCH 2/3] nfsd: replace rp_mutex to avoid deadlock in move_to_close_lru() NeilBrown
2024-03-01 13:09   ` Jeff Layton
2024-03-04  0:09     ` NeilBrown
2024-03-01  0:07 ` [PATCH 3/3] nfsd: drop st_mutex_mutex before calling move_to_close_lru() NeilBrown
2024-03-01 14:05 ` [PATCH 0/3] nfsd: fix dadlock in move_to_close_lru() Chuck Lever

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).