[PATCH 04/10] nfsd: dedup nfs4_client_to_reclaim inserts

Linux NFS development
 help / color / mirror / Atom feed

From: Jeff Layton <jlayton@kernel.org>
To: Chuck Lever <chuck.lever@oracle.com>, NeilBrown <neil@brown.name>,
	 Olga Kornievskaia <okorniev@redhat.com>,
	Dai Ngo <Dai.Ngo@oracle.com>,  Tom Talpey <tom@talpey.com>,
	"J. Bruce Fields" <bfields@fieldses.org>,
	 Scott Mayhew <smayhew@redhat.com>,
	 Trond Myklebust <Trond.Myklebust@netapp.com>,
	 Andreas Gruenbacher <agruen@suse.de>,
	Mike Snitzer <snitzer@kernel.org>,
	 Rick Macklem <rmacklem@uoguelph.ca>
Cc: Chris Mason <clm@meta.com>,
	linux-nfs@vger.kernel.org,  linux-kernel@vger.kernel.org,
	Jeff Layton <jlayton@kernel.org>
Subject: [PATCH 04/10] nfsd: dedup nfs4_client_to_reclaim inserts
Date: Thu, 28 May 2026 17:55:15 -0400	[thread overview]
Message-ID: <20260528-nfsd-fixes-v1-4-e78708eff77d@kernel.org> (raw)
In-Reply-To: <20260528-nfsd-fixes-v1-0-e78708eff77d@kernel.org>

From: Chris Mason <clm@meta.com>

nfs4_client_to_reclaim() unconditionally allocates a new
nfs4_client_reclaim, prepends it to reclaim_str_hashtbl[], and bumps
reclaim_str_hashtbl_size with no check for an existing entry for the
same client name.  After a reboot with a populated recovery directory
that inflates the counter by one for every client that reclaims:

    boot:    load_recdir()
               nfs4_client_to_reclaim(name)   /* entry #1, size++ */

    grace:   RECLAIM_COMPLETE
               __nfsd4_create_reclaim_record_grace()
                 nfs4_client_to_reclaim(name) /* entry #2, size++ */

inc_reclaim_complete() ends the grace period early only when

    atomic_inc_return(&nn->nr_reclaim_complete) ==
        nn->reclaim_str_hashtbl_size

With reclaim_str_hashtbl_size at 2N and nr_reclaim_complete capped at
N, the equality never holds and the fast end-of-grace path is dead.
The grace period always runs out the full 90-second laundromat timer,
and the shadow entry left in the hash table carries a dangling cr_clp
for any reader that walks it.

Fix nfs4_client_to_reclaim() to compute strhashval first, look the
name up with nfsd4_find_reclaim_client(), and on a hit fold the new
princhash into the existing record (if it lacks one) and return that
record without allocating or touching reclaim_str_hashtbl_size.  On
kmemdup() failure during the fold-in, return NULL so
__cld_pipe_inprogress_downcall() surfaces -EFAULT to nfsdcld, matching
the miss-path contract.

Because the fold-in writes cr_princhash.data and cr_princhash.len on
a record that is already linked into reclaim_str_hashtbl[], pair the
two stores with smp_store_release() on .len after WRITE_ONCE() on
.data, and have nfsd4_cld_check_v2() read .len with smp_load_acquire()
before READ_ONCE() on .data, so a concurrent principal-hash check
cannot observe a torn (data, len) pair.

Fixes: 362063a595be ("nfsd: keep a tally of RECLAIM_COMPLETE operations when using nfsdcld")
Assisted-by: kres:claude-opus-4-7
Signed-off-by: Chris Mason <clm@meta.com>
---
 fs/nfsd/nfs4recover.c | 16 +++++++++++++---
 fs/nfsd/nfs4state.c   | 35 +++++++++++++++++++++++++++++++++++
 2 files changed, 48 insertions(+), 3 deletions(-)

diff --git a/fs/nfsd/nfs4recover.c b/fs/nfsd/nfs4recover.c
index 6ea25a52d2f4..f7905aa9fdce 100644
--- a/fs/nfsd/nfs4recover.c
+++ b/fs/nfsd/nfs4recover.c
@@ -1215,6 +1215,7 @@ nfsd4_cld_check_v2(struct nfs4_client *clp)
 	struct cld_net *cn = nn->cld_net;
 #endif
 	struct nfs4_client_reclaim *crp;
+	unsigned int princhashlen;
 	char *principal = NULL;
 
 	/* did we already find that this client is stable? */
@@ -1249,8 +1250,17 @@ nfsd4_cld_check_v2(struct nfs4_client *clp)
 #endif
 	return -ENOENT;
 found:
-	if (crp->cr_princhash.len) {
+	/*
+	 * nfs4_client_to_reclaim() may fold a princhash into an
+	 * already-listed reclaim record concurrently with this read.
+	 * Pair with the smp_store_release() on cr_princhash.len there:
+	 * if we observe a non-zero len we must also observe the
+	 * matching .data pointer.
+	 */
+	princhashlen = smp_load_acquire(&crp->cr_princhash.len);
+	if (princhashlen) {
 		u8 digest[SHA256_DIGEST_SIZE];
+		u8 *pdata;
 
 		if (clp->cl_cred.cr_raw_principal)
 			principal = clp->cl_cred.cr_raw_principal;
@@ -1259,8 +1269,8 @@ nfsd4_cld_check_v2(struct nfs4_client *clp)
 		if (principal == NULL)
 			return -ENOENT;
 		sha256(principal, strlen(principal), digest);
-		if (memcmp(crp->cr_princhash.data, digest,
-				crp->cr_princhash.len))
+		pdata = READ_ONCE(crp->cr_princhash.data);
+		if (memcmp(pdata, digest, princhashlen))
 			return -ENOENT;
 	}
 	crp->cr_clp = clp;
diff --git a/fs/nfsd/nfs4state.c b/fs/nfsd/nfs4state.c
index dc4ac541436f..3709d0ebcd99 100644
--- a/fs/nfsd/nfs4state.c
+++ b/fs/nfsd/nfs4state.c
@@ -9289,6 +9289,41 @@ nfs4_client_to_reclaim(struct xdr_netobj name, struct xdr_netobj princhash,
 	unsigned int strhashval;
 	struct nfs4_client_reclaim *crp;
 
+	/*
+	 * A reclaim record for this client name may already exist (for
+	 * example, populated at boot from the recovery directory before
+	 * an in-grace RECLAIM_COMPLETE or an nfsdcld downcall delivers
+	 * the same name). Dedup here so reclaim_str_hashtbl_size stays
+	 * equal to the number of distinct client names; inc_reclaim_complete
+	 * relies on that equality to end the grace period via the fast path.
+	 */
+	crp = nfsd4_find_reclaim_client(name, nn);
+	if (crp) {
+		if (princhash.len && crp->cr_princhash.len == 0) {
+			void *pdata = kmemdup(princhash.data, princhash.len,
+					      GFP_KERNEL);
+			if (pdata) {
+				/*
+				 * crp is already linked into reclaim_str_hashtbl[]
+				 * and may be examined concurrently by
+				 * nfsd4_cld_check_v2(). Publish .data before .len
+				 * with release semantics so any reader that
+				 * observes a non-zero len via the paired
+				 * smp_load_acquire() also observes the new
+				 * data pointer.
+				 */
+				WRITE_ONCE(crp->cr_princhash.data, pdata);
+				smp_store_release(&crp->cr_princhash.len,
+						  princhash.len);
+			} else {
+				dprintk("%s: failed to allocate memory for princhash.data!\n",
+					__func__);
+				return NULL;
+			}
+		}
+		return crp;
+	}
+
 	name.data = kmemdup(name.data, name.len, GFP_KERNEL);
 	if (!name.data) {
 		dprintk("%s: failed to allocate memory for name.data!\n",

-- 
2.54.0

next prev parent reply	other threads:[~2026-05-28 21:55 UTC|newest]

Thread overview: 40+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-05-28 21:55 [PATCH 00/10] nfsd: a pile of fixes for random bugs Jeff Layton
2026-05-28 21:55 ` [PATCH 01/10] nfsd: fix BUG_ON in nfsd4_alloc_layout_stateid on racing delegation revoke Jeff Layton
2026-05-28 23:40   ` NeilBrown
2026-05-29 14:44     ` Jeff Layton
2026-05-28 21:55 ` [PATCH 02/10] nfsd: drain callbacks and clear cl_cb_session Jeff Layton
2026-05-29 15:13   ` Chuck Lever
2026-05-29 17:31     ` Jeff Layton
2026-05-28 21:55 ` [PATCH 03/10] nfsd: serialize nfsd4_end_grace() with atomic test-and-set Jeff Layton
2026-05-29 15:38   ` Chuck Lever
2026-05-29 15:57     ` Jeff Layton
2026-05-29 16:05       ` Chuck Lever
2026-05-29 17:02         ` Jeff Layton
2026-05-28 21:55 ` Jeff Layton [this message]
2026-05-29 16:22   ` [PATCH 04/10] nfsd: dedup nfs4_client_to_reclaim inserts Chuck Lever
2026-05-28 21:55 ` [PATCH 05/10] nfsd: gate nfs3 setacl by argp->mask Jeff Layton
2026-05-28 21:55 ` [PATCH 06/10] NFSD: Enable return of an updated stable_how to NFS clients Jeff Layton
2026-05-29 10:56   ` Jeff Layton
2026-05-30  7:58   ` NFSv4.1 COMMIT of all changed areas only on flush? " Cedric Blancher
2026-05-30 10:24     ` Jeff Layton
2026-05-28 21:55 ` [PATCH 07/10] NFSD: check truncate permission under inode lock Jeff Layton
2026-05-28 21:55 ` [PATCH 08/10] nfsd: fix partial-write detection in nfsd_direct_write Jeff Layton
2026-05-29 16:57   ` Chuck Lever
2026-05-29 17:01     ` Jeff Layton
2026-05-29 17:03       ` Chuck Lever
2026-05-29 17:06         ` Jeff Layton
2026-05-29 17:09           ` Chuck Lever
2026-05-28 21:55 ` [PATCH 09/10] nfsd: cap decoded POSIX ACL count to bound sort cost Jeff Layton
2026-05-28 22:11   ` Rick Macklem
2026-05-28 23:11     ` Chuck Lever
2026-05-29  0:07       ` Chuck Lever
2026-05-29 10:48         ` Jeff Layton
2026-05-29 13:20           ` Chuck Lever
2026-05-29  7:34   ` Cedric Blancher
2026-05-29 10:50     ` Jeff Layton
2026-05-29 18:34   ` Chuck Lever
2026-05-29 18:41     ` Jeff Layton
2026-05-29 18:48       ` Chuck Lever
2026-05-29 23:04     ` Rick Macklem
2026-05-28 21:55 ` [PATCH 10/10] nfsd: validate symlink target length in NFSv4 CREATE Jeff Layton
2026-05-29 18:55   ` Chuck Lever

find likely ancestor, descendant, or conflicting patches for this message:
( dfblob:6ea25a52d2f dfblob:f7905aa9fdc dfblob:dc4ac541436
dfblob:3709d0ebcd9 )
 OR (
bs:"[PATCH 04/10] nfsd: dedup nfs4_client_to_reclaim inserts" )
	(help)

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20260528-nfsd-fixes-v1-4-e78708eff77d@kernel.org \
    --to=jlayton@kernel.org \
    --cc=Dai.Ngo@oracle.com \
    --cc=Trond.Myklebust@netapp.com \
    --cc=agruen@suse.de \
    --cc=bfields@fieldses.org \
    --cc=chuck.lever@oracle.com \
    --cc=clm@meta.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-nfs@vger.kernel.org \
    --cc=neil@brown.name \
    --cc=okorniev@redhat.com \
    --cc=rmacklem@uoguelph.ca \
    --cc=smayhew@redhat.com \
    --cc=snitzer@kernel.org \
    --cc=tom@talpey.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox