linux-nfs.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Jeff Layton <jlayton@kernel.org>
To: NeilBrown <neilb@suse.de>, Chuck Lever <chuck.lever@oracle.com>,
	"J. Bruce Fields" <bfields@fieldses.org>
Cc: Olga Kornievskaia <kolga@netapp.com>,
	Dai Ngo <Dai.Ngo@oracle.com>, Tom Talpey <tom@talpey.com>,
	linux-nfs@vger.kernel.org
Subject: Re: [PATCH v2] nfsd: drop st_mutex and rp_mutex before calling move_to_close_lru()
Date: Thu, 29 Feb 2024 09:00:06 -0500	[thread overview]
Message-ID: <5859ef402834c352209b29db73e20e2ab77e4bfc.camel@kernel.org> (raw)
In-Reply-To: <6926f1be34dfb66fc5395a7465c2f3970ac7652a.camel@kernel.org>

On Wed, 2024-02-28 at 12:40 -0500, Jeff Layton wrote:
> On Wed, 2024-01-17 at 14:48 +1100, NeilBrown wrote:
> > move_to_close_lru() is currently called with ->st_mutex and .rp_mutex held.
> > This can lead to a deadlock as move_to_close_lru() waits for sc_count to
> > drop to 2, and some threads holding a reference might be waiting for either
> > mutex.  These references will never be dropped so sc_count will never
> > reach 2.
> > 
> > There can be no harm in dropping ->st_mutex to before
> > move_to_close_lru() because the only place that takes the mutex is
> > nfsd4_lock_ol_stateid(), and it quickly aborts if sc_type is
> > NFS4_CLOSED_STID, which it will be before move_to_close_lru() is called.
> > 
> > Similarly dropping .rp_mutex is safe after the state is closed and so
> > no longer usable.  Another way to look at this is that nothing
> > significant happens between when nfsd4_close() now calls
> > nfsd4_cstate_clear_replay(), and where nfsd4_proc_compound calls
> > nfsd4_cstate_clear_replay() a little later.
> > 
> > See also
> >  https://lore.kernel.org/lkml/4dd1fe21e11344e5969bb112e954affb@jd.com/T/
> > where this problem was raised but not successfully resolved.
> > 
> > Signed-off-by: NeilBrown <neilb@suse.de>
> > ---
> >  fs/nfsd/nfs4state.c | 18 ++++++++++++++----
> >  1 file changed, 14 insertions(+), 4 deletions(-)
> > 
> > diff --git a/fs/nfsd/nfs4state.c b/fs/nfsd/nfs4state.c
> > index 40415929e2ae..0850191f9920 100644
> > --- a/fs/nfsd/nfs4state.c
> > +++ b/fs/nfsd/nfs4state.c
> > @@ -7055,7 +7055,7 @@ nfsd4_open_downgrade(struct svc_rqst *rqstp,
> >  	return status;
> >  }
> >  
> > -static void nfsd4_close_open_stateid(struct nfs4_ol_stateid *s)
> > +static bool nfsd4_close_open_stateid(struct nfs4_ol_stateid *s)
> >  {
> >  	struct nfs4_client *clp = s->st_stid.sc_client;
> >  	bool unhashed;
> > @@ -7072,11 +7072,11 @@ static void nfsd4_close_open_stateid(struct nfs4_ol_stateid *s)
> >  		list_for_each_entry(stp, &reaplist, st_locks)
> >  			nfs4_free_cpntf_statelist(clp->net, &stp->st_stid);
> >  		free_ol_stateid_reaplist(&reaplist);
> > +		return false;
> >  	} else {
> >  		spin_unlock(&clp->cl_lock);
> >  		free_ol_stateid_reaplist(&reaplist);
> > -		if (unhashed)
> > -			move_to_close_lru(s, clp->net);
> > +		return unhashed;
> >  	}
> >  }
> >  
> > @@ -7092,6 +7092,7 @@ nfsd4_close(struct svc_rqst *rqstp, struct nfsd4_compound_state *cstate,
> >  	struct nfs4_ol_stateid *stp;
> >  	struct net *net = SVC_NET(rqstp);
> >  	struct nfsd_net *nn = net_generic(net, nfsd_net_id);
> > +	bool need_move_to_close_list;
> >  
> >  	dprintk("NFSD: nfsd4_close on file %pd\n", 
> >  			cstate->current_fh.fh_dentry);
> > @@ -7114,8 +7115,17 @@ nfsd4_close(struct svc_rqst *rqstp, struct nfsd4_compound_state *cstate,
> >  	 */
> >  	nfs4_inc_and_copy_stateid(&close->cl_stateid, &stp->st_stid);
> >  
> > -	nfsd4_close_open_stateid(stp);
> > +	need_move_to_close_list = nfsd4_close_open_stateid(stp);
> >  	mutex_unlock(&stp->st_mutex);
> > +	if (need_move_to_close_list) {
> > +		/* Drop the replay mutex early as move_to_close_lru()
> > +		 * can wait for other threads which hold that mutex.
> > +		 * This call is idempotent, so that fact that it will
> > +		 * be called twice is harmless.
> > +		 */
> > +		nfsd4_cstate_clear_replay(cstate);

Ok, I think I figured out the regression. The problem is the above line.

That clears cstate->replay_owner, which makes nfsd4_encode_operation not
update the so_replay.rp_buflen, which leaves it set to what it was in
the _prior_ seqid-morphing operation. In this case, that's an OPEN
reply, which was 40 bytes longer than the CLOSE reply.

I'm not sure of the best way to fix this, so it may be best to just
revert this patch for now.

Thinking about it more, the rp_mutex has a rather nasty code smell about
it. Maybe we ought to turn the mutex_lock into a trylock and just return
NFS4ERR_DELAY if you can't get it?

In principle, contention for that lock means that the stateowner is
spraying seqid-morphing operations at us. Returning DELAY would seem
like a reasonable thing to do there if we get confused.

Chuck, Neil, any thoughts?

> > +		move_to_close_lru(stp, net);
> > +	}
> >  
> >  	/* v4.1+ suggests that we send a special stateid in here, since the
> >  	 * clients should just ignore this anyway. Since this is not useful
> 
> There is a recent regression in pynfs test CLOSE12 in Chuck's nfsd-next
> branch. In the attached capture, there is an extra 40 bytes on the end
> of the CLOSE response in frame 112.
> 
> A bisect landed on this patch, though I don't see the cause just yet.
> 
> Thoughts?

-- 
Jeff Layton <jlayton@kernel.org>

  parent reply	other threads:[~2024-02-29 14:00 UTC|newest]

Thread overview: 10+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2024-01-17  3:48 [PATCH v2] nfsd: drop st_mutex and rp_mutex before calling move_to_close_lru() NeilBrown
2024-01-17 18:19 ` Jeff Layton
2024-02-28 17:40 ` Jeff Layton
2024-02-28 19:30   ` Jeff Layton
2024-02-29 14:00   ` Jeff Layton [this message]
  -- strict thread matches above, loose matches on Subject: below --
2023-12-22  1:41 [PATCH] nfsd: drop st_mutex " NeilBrown
2023-12-22  2:12 ` [PATCH v2] nfsd: drop st_mutex and rp_mutex " NeilBrown
2023-12-22 14:39   ` Chuck Lever
2023-12-22 20:15   ` dai.ngo
2023-12-22 23:01     ` NeilBrown
2023-12-23 18:07       ` dai.ngo

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=5859ef402834c352209b29db73e20e2ab77e4bfc.camel@kernel.org \
    --to=jlayton@kernel.org \
    --cc=Dai.Ngo@oracle.com \
    --cc=bfields@fieldses.org \
    --cc=chuck.lever@oracle.com \
    --cc=kolga@netapp.com \
    --cc=linux-nfs@vger.kernel.org \
    --cc=neilb@suse.de \
    --cc=tom@talpey.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).