[PATCH 00/10] nfsd: a pile of fixes for random bugs

Linux NFS development
 help / color / mirror / Atom feed

* [PATCH 00/10] nfsd: a pile of fixes for random bugs
@ 2026-05-28 21:55 Jeff Layton
  2026-05-28 21:55 ` [PATCH 01/10] nfsd: fix BUG_ON in nfsd4_alloc_layout_stateid on racing delegation revoke Jeff Layton
                   ` (9 more replies)
  0 siblings, 10 replies; 40+ messages in thread
From: Jeff Layton @ 2026-05-28 21:55 UTC (permalink / raw)
  To: Chuck Lever, NeilBrown, Olga Kornievskaia, Dai Ngo, Tom Talpey,
	J. Bruce Fields, Scott Mayhew, Trond Myklebust,
	Andreas Gruenbacher, Mike Snitzer, Rick Macklem
  Cc: Chris Mason, linux-nfs, linux-kernel, Jeff Layton

These bugs were categorized as remotely-triggerable panics, UAFs, DoS's,
etc., but they aren't reliable. There are also a few protocol fixes in
here too, etc. It's a grab bag.

A few of the patches were not authored by me. In particular, this patch
was submitted by Chuck a couple of years ago:

    NFSD: Enable return of an updated stable_how to NFS clients

...but Claude believes that this fixes a real bug and isn't optional.

The set passes basic pynfs smoke testing.

Signed-off-by: Jeff Layton <jlayton@kernel.org>
---
Chris Mason (6):
      nfsd: drain callbacks and clear cl_cb_session
      nfsd: serialize nfsd4_end_grace() with atomic test-and-set
      nfsd: dedup nfs4_client_to_reclaim inserts
      nfsd: gate nfs3 setacl by argp->mask
      nfsd: fix partial-write detection in nfsd_direct_write
      nfsd: cap decoded POSIX ACL count to bound sort cost

Chuck Lever (2):
      NFSD: Enable return of an updated stable_how to NFS clients
      NFSD: check truncate permission under inode lock

Jeff Layton (2):
      nfsd: fix BUG_ON in nfsd4_alloc_layout_stateid on racing delegation revoke
      nfsd: validate symlink target length in NFSv4 CREATE

 fs/nfsd/nfs3acl.c      | 17 +++++++++++------
 fs/nfsd/nfs3proc.c     |  2 +-
 fs/nfsd/nfs4callback.c | 21 ++++++++++++++++----
 fs/nfsd/nfs4layouts.c  | 12 +++++++++---
 fs/nfsd/nfs4proc.c     |  2 +-
 fs/nfsd/nfs4recover.c  | 16 +++++++++++++---
 fs/nfsd/nfs4state.c    | 52 +++++++++++++++++++++++++++++++++++++++++++++++---
 fs/nfsd/nfs4xdr.c      |  6 ++++++
 fs/nfsd/nfsproc.c      |  3 ++-
 fs/nfsd/vfs.c          | 46 +++++++++++++++++++++++++++-----------------
 fs/nfsd/vfs.h          |  6 ++++--
 fs/nfsd/xdr3.h         |  2 +-
 12 files changed, 142 insertions(+), 43 deletions(-)
---
base-commit: bbe29ec5b789b9e613170cf0d869260c9128e1e0
change-id: 20260528-nfsd-fixes-89a6e5e20c9d

Best regards,
-- 
Jeff Layton <jlayton@kernel.org>


^ permalink raw reply	[flat|nested] 40+ messages in thread

* [PATCH 01/10] nfsd: fix BUG_ON in nfsd4_alloc_layout_stateid on racing delegation revoke
  2026-05-28 21:55 [PATCH 00/10] nfsd: a pile of fixes for random bugs Jeff Layton
@ 2026-05-28 21:55 ` Jeff Layton
  2026-05-28 23:40   ` NeilBrown
  2026-05-28 21:55 ` [PATCH 02/10] nfsd: drain callbacks and clear cl_cb_session Jeff Layton
                   ` (8 subsequent siblings)
  9 siblings, 1 reply; 40+ messages in thread
From: Jeff Layton @ 2026-05-28 21:55 UTC (permalink / raw)
  To: Chuck Lever, NeilBrown, Olga Kornievskaia, Dai Ngo, Tom Talpey,
	J. Bruce Fields, Scott Mayhew, Trond Myklebust,
	Andreas Gruenbacher, Mike Snitzer, Rick Macklem
  Cc: Chris Mason, linux-nfs, linux-kernel, Jeff Layton

nfsd4_alloc_layout_stateid reads fp->fi_deleg_file without holding
fi_lock when the parent stateid is a delegation. A concurrent delegation
revoke via the laundromat can clear fi_deleg_file under fi_lock, causing
nfsd_file_get() to return NULL and triggering the BUG_ON.

This race is client-reachable: two NFS clients can trigger it by having
one hold a delegation while another opens the same file to force a
recall. When the first client doesn't respond to the recall, the
laundromat revokes it. A concurrent LAYOUTGET from any client using the
delegation stateid hits the race window.

Fix this by taking fi_lock around the fi_deleg_file read in the
SC_TYPE_DELEG path, and replacing the BUG_ON with a graceful error
return that cleans up the partially-initialized layout stateid.

Fixes: c5c707f96fc9 ("nfsd: implement pNFS layout recalls")
Assisted-by: kres:claude-opus-4-7
Signed-off-by: Jeff Layton <jlayton@kernel.org>
---
 fs/nfsd/nfs4layouts.c | 12 +++++++++---
 1 file changed, 9 insertions(+), 3 deletions(-)

diff --git a/fs/nfsd/nfs4layouts.c b/fs/nfsd/nfs4layouts.c
index 9ed2e3d65062..5d48c7673fa1 100644
--- a/fs/nfsd/nfs4layouts.c
+++ b/fs/nfsd/nfs4layouts.c
@@ -247,11 +247,17 @@ nfsd4_alloc_layout_stateid(struct nfsd4_compound_state *cstate,
 	nfsd4_init_cb(&ls->ls_recall, clp, &nfsd4_cb_layout_ops,
 			NFSPROC4_CLNT_CB_LAYOUT);

-	if (parent->sc_type == SC_TYPE_DELEG)
+	if (parent->sc_type == SC_TYPE_DELEG) {
+		spin_lock(&fp->fi_lock);
 		ls->ls_file = nfsd_file_get(rcu_dereference_protected(fp->fi_deleg_file, 1));
-	else
+		spin_unlock(&fp->fi_lock);
+	} else {
 		ls->ls_file = find_any_file(fp);
-	BUG_ON(!ls->ls_file);
+	}
+	if (!ls->ls_file) {
+		nfs4_put_stid(stp);
+		return NULL;
+	}

 	ls->ls_fenced = false;
 	ls->ls_fence_delay = 0;

-- 
2.54.0

^ permalink raw reply related	[flat|nested] 40+ messages in thread

* [PATCH 02/10] nfsd: drain callbacks and clear cl_cb_session
  2026-05-28 21:55 [PATCH 00/10] nfsd: a pile of fixes for random bugs Jeff Layton
  2026-05-28 21:55 ` [PATCH 01/10] nfsd: fix BUG_ON in nfsd4_alloc_layout_stateid on racing delegation revoke Jeff Layton
@ 2026-05-28 21:55 ` Jeff Layton
  2026-05-29 15:13   ` Chuck Lever
  2026-05-28 21:55 ` [PATCH 03/10] nfsd: serialize nfsd4_end_grace() with atomic test-and-set Jeff Layton
                   ` (7 subsequent siblings)
  9 siblings, 1 reply; 40+ messages in thread
From: Jeff Layton @ 2026-05-28 21:55 UTC (permalink / raw)
  To: Chuck Lever, NeilBrown, Olga Kornievskaia, Dai Ngo, Tom Talpey,
	J. Bruce Fields, Scott Mayhew, Trond Myklebust,
	Andreas Gruenbacher, Mike Snitzer, Rick Macklem
  Cc: Chris Mason, linux-nfs, linux-kernel, Jeff Layton

From: Chris Mason <clm@meta.com>

After a DESTROY_SESSION the per-session teardown path can free a
session while rpciod still holds an inflight callback rpc_task that
dereferences clp->cl_cb_session.  nfsd4_probe_callback_sync() flushes
cl_callback_wq, but once nfsd4_run_cb_work() has called
rpc_call_async() the rpc_task lives on rpciod; flushing the workqueue
does not wait for it.  After the flush returns,
nfsd4_destroy_session() proceeds through nfsd4_put_session_locked()
and free_session() kfree()s the slab while rpciod's
nfsd4_cb_sequence_done(), grab_slot(), and nfsd41_cb_release_slot()
are still dereferencing cb->cb_clp->cl_cb_session.

    destroy path                       rpciod
    ------------                       ------
    unhash_session(ses)
    nfsd4_probe_callback_sync(clp)
      flush_workqueue(cl_callback_wq)
      /* returns; rpc_task still live */
    nfsd4_put_session_locked(ses)
    free_session(ses) -> kfree(ses)
                                       nfsd4_cb_sequence_done()
                                         reads cb_clp->cl_cb_session
                                         /* freed slab */

A second window exists in nfsd4_process_cb_update().  When
__nfsd4_find_backchannel() returns NULL because unhash_session() has
already removed the destroyed session from cl_sessions,
setup_callback_client() takes the v4.1 early return

    if (!conn->cb_xprt || !ses)
            return -EINVAL;

so clp->cl_cb_session = ses never fires and the field retains a
pointer to the about-to-be-freed session.  Symmetrically, if a later
probe finds a different session's backchannel conn and that
setup_callback_client() call fails, the error tail must still scrub
any previously published cl_cb_session.

Fix by mirroring the two-stage drain that nfsd4_shutdown_callback()
already performs: call nfsd41_cb_inflight_wait_complete() in
nfsd4_probe_callback_sync() after flush_workqueue() so rpciod-side
nfsd41_cb_inflight_end() decrements are observed before the caller
releases the final session reference.  The two direct callers,
nfsd4_destroy_session() and nfsd4_init_conn() (itself invoked from
nfsd4_create_session() and nfsd4_bind_conn_to_session()), run in
sleepable process context and tolerate the wait_var_event() sleep:

    nfsd4_destroy_session() (fs/nfsd/nfs4state.c):
      unhash_session(ses);
      spin_unlock(&nn->client_lock);   /* spinlock dropped */
      nfsd4_probe_callback_sync(ses->se_client);

    nfsd4_init_conn() (fs/nfsd/nfs4state.c):
      acquires no locks in its body; calls nfsd4_hash_conn(),
      nfsd4_register_conn(), then nfsd4_probe_callback_sync() --
      entirely in sleepable process context.

Also clear clp->cl_cb_session unconditionally on the
nfsd4_process_cb_update() error return so every
setup_callback_client() failure -- whether c is NULL or points at a
different session whose probe failed -- leaves the field NULL rather
than pointing at a session that may subsequently be freed.

Fixes: dcbeaa68dbbd ("nfsd4: allow backchannel recovery")
Assisted-by: kres:claude-opus-4-7
Signed-off-by: Chris Mason <clm@meta.com>
---
 fs/nfsd/nfs4callback.c | 21 +++++++++++++++++----
 1 file changed, 17 insertions(+), 4 deletions(-)

diff --git a/fs/nfsd/nfs4callback.c b/fs/nfsd/nfs4callback.c
index 1964a213f80e..1cf6b6100357 100644
--- a/fs/nfsd/nfs4callback.c
+++ b/fs/nfsd/nfs4callback.c
@@ -1205,9 +1205,8 @@ static int setup_callback_client(struct nfs4_client *clp, struct nfs4_cb_conn *c
 	} else {
 		if (!conn->cb_xprt || !ses)
 			return -EINVAL;
-		clp->cl_cb_session = ses;
 		args.bc_xprt = conn->cb_xprt;
-		args.prognumber = clp->cl_cb_session->se_cb_prog;
+		args.prognumber = ses->se_cb_prog;
 		args.protocol = conn->cb_xprt->xpt_class->xcl_ident |
 				XPRT_TRANSPORT_BC;
 		args.authflavor = ses->se_cb_sec.flavor;
@@ -1225,8 +1224,10 @@ static int setup_callback_client(struct nfs4_client *clp, struct nfs4_cb_conn *c
 		return -ENOMEM;
 	}
 
-	if (clp->cl_minorversion != 0)
+	if (clp->cl_minorversion != 0) {
 		clp->cl_cb_conn.cb_xprt = conn->cb_xprt;
+		clp->cl_cb_session = ses;
+	}
 	clp->cl_cb_client = client;
 	clp->cl_cb_cred = cred;
 	rcu_read_lock();
@@ -1299,6 +1300,7 @@ void nfsd4_probe_callback_sync(struct nfs4_client *clp)
 {
 	nfsd4_probe_callback(clp);
 	flush_workqueue(clp->cl_callback_wq);
+	nfsd41_cb_inflight_wait_complete(clp);
 }
 
 void nfsd4_change_callback(struct nfs4_client *clp, struct nfs4_cb_conn *conn)
@@ -1679,7 +1681,17 @@ static struct nfsd4_conn * __nfsd4_find_backchannel(struct nfs4_client *clp)
  * Note there isn't a lot of locking in this code; instead we depend on
  * the fact that it is run from clp->cl_callback_wq, which won't run two
  * work items at once.  So, for example, clp->cl_callback_wq handles all
- * access of cl_cb_client and all calls to rpc_create or rpc_shutdown_client.
+ * access of cl_cb_client and cl_cb_session, and all calls to rpc_create
+ * or rpc_shutdown_client.
+ *
+ * rpciod-side readers of cl_cb_session (encode_cb_sequence4args(),
+ * nfsd4_cb_sequence_done(), the cb-slot helpers, and the cb_sequence
+ * tracepoints) run outside cl_callback_wq.  The
+ * nfsd41_cb_inflight_wait_complete() drain in nfsd4_probe_callback_sync()
+ * waits until cl_cb_inflight reaches zero before the caller proceeds with
+ * session teardown; any rpc_task that reads cl_cb_session must hold an
+ * inflight pin (via nfsd41_cb_inflight_begin) for this fence to be
+ * effective.
  */
 static void nfsd4_process_cb_update(struct nfsd4_callback *cb)
 {
@@ -1731,6 +1743,7 @@ static void nfsd4_process_cb_update(struct nfsd4_callback *cb)
 		nfsd4_mark_cb_down(clp);
 		if (c)
 			svc_xprt_put(c->cn_xprt);
+		clp->cl_cb_session = NULL;
 		return;
 	}
 }

-- 
2.54.0


^ permalink raw reply related	[flat|nested] 40+ messages in thread

* [PATCH 03/10] nfsd: serialize nfsd4_end_grace() with atomic test-and-set
  2026-05-28 21:55 [PATCH 00/10] nfsd: a pile of fixes for random bugs Jeff Layton
  2026-05-28 21:55 ` [PATCH 01/10] nfsd: fix BUG_ON in nfsd4_alloc_layout_stateid on racing delegation revoke Jeff Layton
  2026-05-28 21:55 ` [PATCH 02/10] nfsd: drain callbacks and clear cl_cb_session Jeff Layton
@ 2026-05-28 21:55 ` Jeff Layton
  2026-05-29 15:38   ` Chuck Lever
  2026-05-28 21:55 ` [PATCH 04/10] nfsd: dedup nfs4_client_to_reclaim inserts Jeff Layton
                   ` (6 subsequent siblings)
  9 siblings, 1 reply; 40+ messages in thread
From: Jeff Layton @ 2026-05-28 21:55 UTC (permalink / raw)
  To: Chuck Lever, NeilBrown, Olga Kornievskaia, Dai Ngo, Tom Talpey,
	J. Bruce Fields, Scott Mayhew, Trond Myklebust,
	Andreas Gruenbacher, Mike Snitzer, Rick Macklem
  Cc: Chris Mason, linux-nfs, linux-kernel, Jeff Layton

From: Chris Mason <clm@meta.com>

nfsd4_end_grace() guards its drain path with a plain bool:

    if (nn->grace_ended)
            return;
    nn->grace_ended = true;

The read and the write are independent, and nothing in struct
nfsd_net serializes them.  At least two contexts can reach this
code with no lock held:

    laundromat path
      laundry_wq kworker
        nfs4_laundromat()
          nfsd4_end_grace()

    RECLAIM_COMPLETE path
      nfsd compound kthread
        nfsd4_reclaim_complete()
          inc_reclaim_complete()
            nfsd4_end_grace()

Both callers can observe grace_ended == false on different CPUs,
both store true, and both proceed into nfsd4_record_grace_done(),
which invokes the active client_tracking_ops->grace_done callback.
For tracking ops that drain reclaim_str_hashtbl (legacy_tracking_ops
via nfsd4_recdir_purge_old, and the cld v1+ ops via
nfsd4_cld_grace_done), grace_done calls nfs4_release_reclaim(),
which walks every bucket of reclaim_str_hashtbl with no lock and
calls nfs4_remove_reclaim_record() (list_del + kfree) on each
entry.  Two concurrent walkers corrupt the list and double-free
every nfs4_client_reclaim.  A concurrent nfsd4_find_reclaim_client()
iterating the same bucket reads through freed memory.

A third call site exists in nfs4_state_start_net() on the
skip_grace startup path, but it runs under nfsd_mutex before any
client has connected and before the laundromat's first delayed
work fires, so it cannot race with the two callers above.

Fix by replacing the read/write pair with try_cmpxchg() so exactly
one caller transitions grace_ended from false to true and proceeds
into the drain; the loser returns immediately.  bool supports
1-byte cmpxchg on all supported architectures, and no lock
ordering changes are needed.

Fixes: 362063a595be ("nfsd: keep a tally of RECLAIM_COMPLETE operations when using nfsdcld")
Assisted-by: kres:claude-opus-4-7
Signed-off-by: Chris Mason <clm@meta.com>
---
 fs/nfsd/nfs4state.c | 17 ++++++++++++++---
 1 file changed, 14 insertions(+), 3 deletions(-)

diff --git a/fs/nfsd/nfs4state.c b/fs/nfsd/nfs4state.c
index f4d12dbcf97b..dc4ac541436f 100644
--- a/fs/nfsd/nfs4state.c
+++ b/fs/nfsd/nfs4state.c
@@ -7022,12 +7022,23 @@ nfsd4_renew(struct svc_rqst *rqstp, struct nfsd4_compound_state *cstate,
 static void
 nfsd4_end_grace(struct nfsd_net *nn)
 {
-	/* do nothing if grace period already ended */
-	if (nn->grace_ended)
+	bool expected = false;
+
+	/*
+	 * nfsd4_end_grace() can be entered concurrently from the
+	 * laundromat workqueue and from an nfsd compound thread
+	 * handling RECLAIM_COMPLETE.  Without serialization, both
+	 * callers can observe grace_ended==false and proceed into
+	 * nfsd4_record_grace_done().  For tracking ops whose
+	 * grace_done drains reclaim_str_hashtbl, that results in
+	 * list corruption and a double free of every
+	 * nfs4_client_reclaim entry.  Use an atomic test-and-set so
+	 * exactly one caller proceeds.
+	 */
+	if (!try_cmpxchg(&nn->grace_ended, &expected, true))
 		return;
 
 	trace_nfsd_grace_complete(nn);
-	nn->grace_ended = true;
 	/*
 	 * If the server goes down again right now, an NFSv4
 	 * client will still be allowed to reclaim after it comes back up,

-- 
2.54.0


^ permalink raw reply related	[flat|nested] 40+ messages in thread

* [PATCH 04/10] nfsd: dedup nfs4_client_to_reclaim inserts
  2026-05-28 21:55 [PATCH 00/10] nfsd: a pile of fixes for random bugs Jeff Layton
                   ` (2 preceding siblings ...)
  2026-05-28 21:55 ` [PATCH 03/10] nfsd: serialize nfsd4_end_grace() with atomic test-and-set Jeff Layton
@ 2026-05-28 21:55 ` Jeff Layton
  2026-05-29 16:22   ` Chuck Lever
  2026-05-28 21:55 ` [PATCH 05/10] nfsd: gate nfs3 setacl by argp->mask Jeff Layton
                   ` (5 subsequent siblings)
  9 siblings, 1 reply; 40+ messages in thread
From: Jeff Layton @ 2026-05-28 21:55 UTC (permalink / raw)
  To: Chuck Lever, NeilBrown, Olga Kornievskaia, Dai Ngo, Tom Talpey,
	J. Bruce Fields, Scott Mayhew, Trond Myklebust,
	Andreas Gruenbacher, Mike Snitzer, Rick Macklem
  Cc: Chris Mason, linux-nfs, linux-kernel, Jeff Layton

From: Chris Mason <clm@meta.com>

nfs4_client_to_reclaim() unconditionally allocates a new
nfs4_client_reclaim, prepends it to reclaim_str_hashtbl[], and bumps
reclaim_str_hashtbl_size with no check for an existing entry for the
same client name.  After a reboot with a populated recovery directory
that inflates the counter by one for every client that reclaims:

    boot:    load_recdir()
               nfs4_client_to_reclaim(name)   /* entry #1, size++ */

    grace:   RECLAIM_COMPLETE
               __nfsd4_create_reclaim_record_grace()
                 nfs4_client_to_reclaim(name) /* entry #2, size++ */

inc_reclaim_complete() ends the grace period early only when

    atomic_inc_return(&nn->nr_reclaim_complete) ==
        nn->reclaim_str_hashtbl_size

With reclaim_str_hashtbl_size at 2N and nr_reclaim_complete capped at
N, the equality never holds and the fast end-of-grace path is dead.
The grace period always runs out the full 90-second laundromat timer,
and the shadow entry left in the hash table carries a dangling cr_clp
for any reader that walks it.

Fix nfs4_client_to_reclaim() to compute strhashval first, look the
name up with nfsd4_find_reclaim_client(), and on a hit fold the new
princhash into the existing record (if it lacks one) and return that
record without allocating or touching reclaim_str_hashtbl_size.  On
kmemdup() failure during the fold-in, return NULL so
__cld_pipe_inprogress_downcall() surfaces -EFAULT to nfsdcld, matching
the miss-path contract.

Because the fold-in writes cr_princhash.data and cr_princhash.len on
a record that is already linked into reclaim_str_hashtbl[], pair the
two stores with smp_store_release() on .len after WRITE_ONCE() on
.data, and have nfsd4_cld_check_v2() read .len with smp_load_acquire()
before READ_ONCE() on .data, so a concurrent principal-hash check
cannot observe a torn (data, len) pair.

Fixes: 362063a595be ("nfsd: keep a tally of RECLAIM_COMPLETE operations when using nfsdcld")
Assisted-by: kres:claude-opus-4-7
Signed-off-by: Chris Mason <clm@meta.com>
---
 fs/nfsd/nfs4recover.c | 16 +++++++++++++---
 fs/nfsd/nfs4state.c   | 35 +++++++++++++++++++++++++++++++++++
 2 files changed, 48 insertions(+), 3 deletions(-)

diff --git a/fs/nfsd/nfs4recover.c b/fs/nfsd/nfs4recover.c
index 6ea25a52d2f4..f7905aa9fdce 100644
--- a/fs/nfsd/nfs4recover.c
+++ b/fs/nfsd/nfs4recover.c
@@ -1215,6 +1215,7 @@ nfsd4_cld_check_v2(struct nfs4_client *clp)
 	struct cld_net *cn = nn->cld_net;
 #endif
 	struct nfs4_client_reclaim *crp;
+	unsigned int princhashlen;
 	char *principal = NULL;
 
 	/* did we already find that this client is stable? */
@@ -1249,8 +1250,17 @@ nfsd4_cld_check_v2(struct nfs4_client *clp)
 #endif
 	return -ENOENT;
 found:
-	if (crp->cr_princhash.len) {
+	/*
+	 * nfs4_client_to_reclaim() may fold a princhash into an
+	 * already-listed reclaim record concurrently with this read.
+	 * Pair with the smp_store_release() on cr_princhash.len there:
+	 * if we observe a non-zero len we must also observe the
+	 * matching .data pointer.
+	 */
+	princhashlen = smp_load_acquire(&crp->cr_princhash.len);
+	if (princhashlen) {
 		u8 digest[SHA256_DIGEST_SIZE];
+		u8 *pdata;
 
 		if (clp->cl_cred.cr_raw_principal)
 			principal = clp->cl_cred.cr_raw_principal;
@@ -1259,8 +1269,8 @@ nfsd4_cld_check_v2(struct nfs4_client *clp)
 		if (principal == NULL)
 			return -ENOENT;
 		sha256(principal, strlen(principal), digest);
-		if (memcmp(crp->cr_princhash.data, digest,
-				crp->cr_princhash.len))
+		pdata = READ_ONCE(crp->cr_princhash.data);
+		if (memcmp(pdata, digest, princhashlen))
 			return -ENOENT;
 	}
 	crp->cr_clp = clp;
diff --git a/fs/nfsd/nfs4state.c b/fs/nfsd/nfs4state.c
index dc4ac541436f..3709d0ebcd99 100644
--- a/fs/nfsd/nfs4state.c
+++ b/fs/nfsd/nfs4state.c
@@ -9289,6 +9289,41 @@ nfs4_client_to_reclaim(struct xdr_netobj name, struct xdr_netobj princhash,
 	unsigned int strhashval;
 	struct nfs4_client_reclaim *crp;
 
+	/*
+	 * A reclaim record for this client name may already exist (for
+	 * example, populated at boot from the recovery directory before
+	 * an in-grace RECLAIM_COMPLETE or an nfsdcld downcall delivers
+	 * the same name). Dedup here so reclaim_str_hashtbl_size stays
+	 * equal to the number of distinct client names; inc_reclaim_complete
+	 * relies on that equality to end the grace period via the fast path.
+	 */
+	crp = nfsd4_find_reclaim_client(name, nn);
+	if (crp) {
+		if (princhash.len && crp->cr_princhash.len == 0) {
+			void *pdata = kmemdup(princhash.data, princhash.len,
+					      GFP_KERNEL);
+			if (pdata) {
+				/*
+				 * crp is already linked into reclaim_str_hashtbl[]
+				 * and may be examined concurrently by
+				 * nfsd4_cld_check_v2(). Publish .data before .len
+				 * with release semantics so any reader that
+				 * observes a non-zero len via the paired
+				 * smp_load_acquire() also observes the new
+				 * data pointer.
+				 */
+				WRITE_ONCE(crp->cr_princhash.data, pdata);
+				smp_store_release(&crp->cr_princhash.len,
+						  princhash.len);
+			} else {
+				dprintk("%s: failed to allocate memory for princhash.data!\n",
+					__func__);
+				return NULL;
+			}
+		}
+		return crp;
+	}
+
 	name.data = kmemdup(name.data, name.len, GFP_KERNEL);
 	if (!name.data) {
 		dprintk("%s: failed to allocate memory for name.data!\n",

-- 
2.54.0


^ permalink raw reply related	[flat|nested] 40+ messages in thread

* [PATCH 05/10] nfsd: gate nfs3 setacl by argp->mask
  2026-05-28 21:55 [PATCH 00/10] nfsd: a pile of fixes for random bugs Jeff Layton
                   ` (3 preceding siblings ...)
  2026-05-28 21:55 ` [PATCH 04/10] nfsd: dedup nfs4_client_to_reclaim inserts Jeff Layton
@ 2026-05-28 21:55 ` Jeff Layton
  2026-05-28 21:55 ` [PATCH 06/10] NFSD: Enable return of an updated stable_how to NFS clients Jeff Layton
                   ` (4 subsequent siblings)
  9 siblings, 0 replies; 40+ messages in thread
From: Jeff Layton @ 2026-05-28 21:55 UTC (permalink / raw)
  To: Chuck Lever, NeilBrown, Olga Kornievskaia, Dai Ngo, Tom Talpey,
	J. Bruce Fields, Scott Mayhew, Trond Myklebust,
	Andreas Gruenbacher, Mike Snitzer, Rick Macklem
  Cc: Chris Mason, linux-nfs, linux-kernel, Jeff Layton

From: Chris Mason <clm@meta.com>

nfsd3_proc_setacl() calls set_posix_acl() unconditionally for both
ACL_TYPE_ACCESS and ACL_TYPE_DEFAULT, passing argp->acl_access and
argp->acl_default verbatim. The NFSv3 ACL decoder only populates
those pointers when the corresponding mask bit is set:

    nfs3svc_decode_setaclargs()
      if (args->mask & NFS_ACL)    decode into acl_access
      if (args->mask & NFS_DFACL)  decode into acl_default
      /* otherwise the pointer stays NULL (pc_argzero) */

    nfsd3_proc_setacl()
      set_posix_acl(.., ACL_TYPE_ACCESS,  argp->acl_access)
      set_posix_acl(.., ACL_TYPE_DEFAULT, argp->acl_default)

set_posix_acl(idmap, dentry, type, NULL) is the VFS "remove this
ACL type" operation. A NULL pointer that means "the client did not
send this arm" is therefore indistinguishable from "the client
asked to remove this ACL". A SETACL with mask=NFS_ACL silently
drops the directory's default ACL; mask=0 drops both.

The sibling nfsd3_proc_getacl() already consults argp->mask before
touching each arm; mirror that in setacl.

Fix by wrapping each set_posix_acl() call in the matching mask bit
check and initializing error to 0 before inode_lock so that a
request with neither bit set leaves the on-disk ACLs untouched and
returns nfs_ok. The out_drop_lock path and the unconditional
posix_acl_release() at out: are preserved; both NULL-tolerate the
skipped arms.

Fixes: a257cdd0e217 ("[PATCH] NFSD: Add server support for NFSv3 ACLs.")
Assisted-by: kres:claude-opus-4-7
Signed-off-by: Chris Mason <clm@meta.com>
---
 fs/nfsd/nfs3acl.c | 17 +++++++++++------
 1 file changed, 11 insertions(+), 6 deletions(-)

diff --git a/fs/nfsd/nfs3acl.c b/fs/nfsd/nfs3acl.c
index e87731380be8..a87f9d7f32be 100644
--- a/fs/nfsd/nfs3acl.c
+++ b/fs/nfsd/nfs3acl.c
@@ -105,12 +105,17 @@ static __be32 nfsd3_proc_setacl(struct svc_rqst *rqstp)
 
 	inode_lock(inode);
 
-	error = set_posix_acl(&nop_mnt_idmap, fh->fh_dentry, ACL_TYPE_ACCESS,
-			      argp->acl_access);
-	if (error)
-		goto out_drop_lock;
-	error = set_posix_acl(&nop_mnt_idmap, fh->fh_dentry, ACL_TYPE_DEFAULT,
-			      argp->acl_default);
+	error = 0;
+	if (argp->mask & NFS_ACL) {
+		error = set_posix_acl(&nop_mnt_idmap, fh->fh_dentry,
+				      ACL_TYPE_ACCESS, argp->acl_access);
+		if (error)
+			goto out_drop_lock;
+	}
+	if (argp->mask & NFS_DFACL) {
+		error = set_posix_acl(&nop_mnt_idmap, fh->fh_dentry,
+				      ACL_TYPE_DEFAULT, argp->acl_default);
+	}
 
 out_drop_lock:
 	inode_unlock(inode);

-- 
2.54.0


^ permalink raw reply related	[flat|nested] 40+ messages in thread

* [PATCH 06/10] NFSD: Enable return of an updated stable_how to NFS clients
  2026-05-28 21:55 [PATCH 00/10] nfsd: a pile of fixes for random bugs Jeff Layton
                   ` (4 preceding siblings ...)
  2026-05-28 21:55 ` [PATCH 05/10] nfsd: gate nfs3 setacl by argp->mask Jeff Layton
@ 2026-05-28 21:55 ` Jeff Layton
  2026-05-29 10:56   ` Jeff Layton
  2026-05-30  7:58   ` NFSv4.1 COMMIT of all changed areas only on flush? " Cedric Blancher
  2026-05-28 21:55 ` [PATCH 07/10] NFSD: check truncate permission under inode lock Jeff Layton
                   ` (3 subsequent siblings)
  9 siblings, 2 replies; 40+ messages in thread
From: Jeff Layton @ 2026-05-28 21:55 UTC (permalink / raw)
  To: Chuck Lever, NeilBrown, Olga Kornievskaia, Dai Ngo, Tom Talpey,
	J. Bruce Fields, Scott Mayhew, Trond Myklebust,
	Andreas Gruenbacher, Mike Snitzer, Rick Macklem
  Cc: Chris Mason, linux-nfs, linux-kernel, Jeff Layton

From: Chuck Lever <chuck.lever@oracle.com>

In a subsequent patch, nfsd_vfs_write() will promote an UNSTABLE
WRITE to be a FILE_SYNC WRITE. This indicates that the client does
not need a subsequent COMMIT operation, saving a round trip and
allowing the client to dispense with cached dirty data as soon as
it receives the server's WRITE response.

This patch refactors nfsd_vfs_write() to return a possibly modified
stable_how value to its callers. No behavior change is expected.

Reviewed-by: NeilBrown <neil@brown.name>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Fixes: 3f3503adb332 ("NFSD: Use vfs_iocb_iter_write()")
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
---
 fs/nfsd/nfs3proc.c |  2 +-
 fs/nfsd/nfs4proc.c |  2 +-
 fs/nfsd/nfsproc.c  |  3 ++-
 fs/nfsd/vfs.c      | 11 ++++++-----
 fs/nfsd/vfs.h      |  6 ++++--
 fs/nfsd/xdr3.h     |  2 +-
 6 files changed, 15 insertions(+), 11 deletions(-)

diff --git a/fs/nfsd/nfs3proc.c b/fs/nfsd/nfs3proc.c
index aeda7a802bdf..dd5ac59e87d6 100644
--- a/fs/nfsd/nfs3proc.c
+++ b/fs/nfsd/nfs3proc.c
@@ -236,7 +236,7 @@ nfsd3_proc_write(struct svc_rqst *rqstp)
 	resp->committed = argp->stable;
 	resp->status = nfsd_write(rqstp, &resp->fh, argp->offset,
 				  &argp->payload, &cnt,
-				  resp->committed, resp->verf);
+				  &resp->committed, resp->verf);
 	resp->count = cnt;
 	resp->status = nfsd3_map_status(resp->status);
 	return rpc_success;
diff --git a/fs/nfsd/nfs4proc.c b/fs/nfsd/nfs4proc.c
index 5f2b9bfc3a84..ac03f9d89288 100644
--- a/fs/nfsd/nfs4proc.c
+++ b/fs/nfsd/nfs4proc.c
@@ -1355,7 +1355,7 @@ nfsd4_write(struct svc_rqst *rqstp, struct nfsd4_compound_state *cstate,
 	write->wr_how_written = write->wr_stable_how;
 	status = nfsd_vfs_write(rqstp, &cstate->current_fh, nf,
 				write->wr_offset, &write->wr_payload,
-				&cnt, write->wr_how_written,
+				&cnt, &write->wr_how_written,
 				(__be32 *)write->wr_verifier.data);
 	nfsd_file_put(nf);
 
diff --git a/fs/nfsd/nfsproc.c b/fs/nfsd/nfsproc.c
index 8873033d1e82..d0a7316f00a5 100644
--- a/fs/nfsd/nfsproc.c
+++ b/fs/nfsd/nfsproc.c
@@ -251,6 +251,7 @@ nfsd_proc_write(struct svc_rqst *rqstp)
 	struct nfsd_writeargs *argp = rqstp->rq_argp;
 	struct nfsd_attrstat *resp = rqstp->rq_resp;
 	unsigned long cnt = argp->len;
+	u32 committed = NFS_DATA_SYNC;
 
 	dprintk("nfsd: WRITE    %s %u bytes at %d\n",
 		SVCFH_fmt(&argp->fh),
@@ -258,7 +259,7 @@ nfsd_proc_write(struct svc_rqst *rqstp)
 
 	fh_copy(&resp->fh, &argp->fh);
 	resp->status = nfsd_write(rqstp, &resp->fh, argp->offset,
-				  &argp->payload, &cnt, NFS_DATA_SYNC, NULL);
+				  &argp->payload, &cnt, &committed, NULL);
 	if (resp->status == nfs_ok)
 		resp->status = fh_getattr(&resp->fh, &resp->stat);
 	else if (resp->status == nfserr_jukebox)
diff --git a/fs/nfsd/vfs.c b/fs/nfsd/vfs.c
index 1e89c7ff9493..7f07292d1569 100644
--- a/fs/nfsd/vfs.c
+++ b/fs/nfsd/vfs.c
@@ -1414,7 +1414,7 @@ nfsd_direct_write(struct svc_rqst *rqstp, struct svc_fh *fhp,
  * @offset: Byte offset of start
  * @payload: xdr_buf containing the write payload
  * @cnt: IN: number of bytes to write, OUT: number of bytes actually written
- * @stable: An NFS stable_how value
+ * @stable_how: IN: Client's requested stable_how, OUT: Actual stable_how
  * @verf: NFS WRITE verifier
  *
  * Upon return, caller must invoke fh_put on @fhp.
@@ -1426,11 +1426,12 @@ __be32
 nfsd_vfs_write(struct svc_rqst *rqstp, struct svc_fh *fhp,
 	       struct nfsd_file *nf, loff_t offset,
 	       const struct xdr_buf *payload, unsigned long *cnt,
-	       int stable, __be32 *verf)
+	       u32 *stable_how, __be32 *verf)
 {
 	struct nfsd_net		*nn = net_generic(SVC_NET(rqstp), nfsd_net_id);
 	struct file		*file = nf->nf_file;
 	struct super_block	*sb = file_inode(file)->i_sb;
+	u32			stable = *stable_how;
 	struct kiocb		kiocb;
 	struct svc_export	*exp;
 	struct iov_iter		iter;
@@ -1609,7 +1610,7 @@ __be32 nfsd_read(struct svc_rqst *rqstp, struct svc_fh *fhp,
  * @offset: Byte offset of start
  * @payload: xdr_buf containing the write payload
  * @cnt: IN: number of bytes to write, OUT: number of bytes actually written
- * @stable: An NFS stable_how value
+ * @stable_how: IN: Client's requested stable_how, OUT: Actual stable_how
  * @verf: NFS WRITE verifier
  *
  * Upon return, caller must invoke fh_put on @fhp.
@@ -1619,7 +1620,7 @@ __be32 nfsd_read(struct svc_rqst *rqstp, struct svc_fh *fhp,
  */
 __be32
 nfsd_write(struct svc_rqst *rqstp, struct svc_fh *fhp, loff_t offset,
-	   const struct xdr_buf *payload, unsigned long *cnt, int stable,
+	   const struct xdr_buf *payload, unsigned long *cnt, u32 *stable_how,
 	   __be32 *verf)
 {
 	struct nfsd_file *nf;
@@ -1632,7 +1633,7 @@ nfsd_write(struct svc_rqst *rqstp, struct svc_fh *fhp, loff_t offset,
 		goto out;
 
 	err = nfsd_vfs_write(rqstp, fhp, nf, offset, payload, cnt,
-			     stable, verf);
+			     stable_how, verf);
 	nfsd_file_put(nf);
 out:
 	trace_nfsd_write_done(rqstp, fhp, offset, *cnt);
diff --git a/fs/nfsd/vfs.h b/fs/nfsd/vfs.h
index e09ea04a51b9..36cd9f6edd8b 100644
--- a/fs/nfsd/vfs.h
+++ b/fs/nfsd/vfs.h
@@ -132,11 +132,13 @@ __be32		nfsd_read(struct svc_rqst *rqstp, struct svc_fh *fhp,
 				u32 *eof);
 __be32		nfsd_write(struct svc_rqst *rqstp, struct svc_fh *fhp,
 				loff_t offset, const struct xdr_buf *payload,
-				unsigned long *cnt, int stable, __be32 *verf);
+				unsigned long *cnt, u32 *stable_how,
+				__be32 *verf);
 __be32		nfsd_vfs_write(struct svc_rqst *rqstp, struct svc_fh *fhp,
 				struct nfsd_file *nf, loff_t offset,
 				const struct xdr_buf *payload,
-				unsigned long *cnt, int stable, __be32 *verf);
+				unsigned long *cnt, u32 *stable_how,
+				__be32 *verf);
 __be32		nfsd_readlink(struct svc_rqst *, struct svc_fh *,
 				char *, int *);
 __be32		nfsd_symlink(struct svc_rqst *, struct svc_fh *,
diff --git a/fs/nfsd/xdr3.h b/fs/nfsd/xdr3.h
index a7c9714b0b0e..1ff7b11c397c 100644
--- a/fs/nfsd/xdr3.h
+++ b/fs/nfsd/xdr3.h
@@ -152,7 +152,7 @@ struct nfsd3_writeres {
 	__be32			status;
 	struct svc_fh		fh;
 	unsigned long		count;
-	int			committed;
+	u32			committed;
 	__be32			verf[2];
 };
 

-- 
2.54.0


^ permalink raw reply related	[flat|nested] 40+ messages in thread

* [PATCH 07/10] NFSD: check truncate permission under inode lock
  2026-05-28 21:55 [PATCH 00/10] nfsd: a pile of fixes for random bugs Jeff Layton
                   ` (5 preceding siblings ...)
  2026-05-28 21:55 ` [PATCH 06/10] NFSD: Enable return of an updated stable_how to NFS clients Jeff Layton
@ 2026-05-28 21:55 ` Jeff Layton
  2026-05-28 21:55 ` [PATCH 08/10] nfsd: fix partial-write detection in nfsd_direct_write Jeff Layton
                   ` (2 subsequent siblings)
  9 siblings, 0 replies; 40+ messages in thread
From: Jeff Layton @ 2026-05-28 21:55 UTC (permalink / raw)
  To: Chuck Lever, NeilBrown, Olga Kornievskaia, Dai Ngo, Tom Talpey,
	J. Bruce Fields, Scott Mayhew, Trond Myklebust,
	Andreas Gruenbacher, Mike Snitzer, Rick Macklem
  Cc: Chris Mason, linux-nfs, linux-kernel, Jeff Layton

From: Chuck Lever <chuck.lever@oracle.com>

nfsd_setattr() checks whether a size update needs NFSD_MAY_TRUNC
before it takes inode_lock(). The comparison uses the file size sampled
by that unlocked read, but the actual ATTR_SIZE update is applied later
under inode_lock() by notify_change().

This leaves a TOCTOU window for append-only files. If a client sends a
SETATTR that does not shrink the file at the time of the unlocked
sample, a concurrent append can extend the file before nfsd_setattr()
takes inode_lock(). notify_change() then applies a real truncation
without the NFSD_MAY_TRUNC check that rejects IS_APPEND(inode). The VFS
truncate syscall paths perform their own append-only checks before
calling notify_change(), so NFSD must make this decision against the
locked size it is about to change.

Split the write-count acquisition from the truncation permission check.
Keep get_write_access() before the locked setattr work, then recheck
whether the requested size is below i_size_read(inode) after inode_lock()
has been acquired and before notify_change(ATTR_SIZE). This also avoids
the plain unlocked inode->i_size load.

Assisted-by: kres:claude-opus-4-7
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Jeff Layton <jlayton@kernel.org>
---
 fs/nfsd/vfs.c | 30 ++++++++++++++++++------------
 1 file changed, 18 insertions(+), 12 deletions(-)

diff --git a/fs/nfsd/vfs.c b/fs/nfsd/vfs.c
index 7f07292d1569..980217f755b7 100644
--- a/fs/nfsd/vfs.c
+++ b/fs/nfsd/vfs.c
@@ -419,21 +419,22 @@ nfsd_sanitize_attrs(struct inode *inode, struct iattr *iap)
 }
 
 static __be32
-nfsd_get_write_access(struct svc_rqst *rqstp, struct svc_fh *fhp,
-		struct iattr *iap)
+nfsd_may_truncate(struct svc_rqst *rqstp, struct svc_fh *fhp,
+		  struct iattr *iap)
 {
 	struct inode *inode = d_inode(fhp->fh_dentry);
 
-	if (iap->ia_size < inode->i_size) {
-		__be32 err;
+	if (iap->ia_size >= i_size_read(inode))
+		return nfs_ok;
 
-		err = nfsd_permission(&rqstp->rq_cred,
-				      fhp->fh_export, fhp->fh_dentry,
-				      NFSD_MAY_TRUNC | NFSD_MAY_OWNER_OVERRIDE);
-		if (err)
-			return err;
-	}
-	return nfserrno(get_write_access(inode));
+	return nfsd_permission(&rqstp->rq_cred, fhp->fh_export, fhp->fh_dentry,
+			       NFSD_MAY_TRUNC | NFSD_MAY_OWNER_OVERRIDE);
+}
+
+static __be32
+nfsd_get_write_access(struct svc_fh *fhp)
+{
+	return nfserrno(get_write_access(d_inode(fhp->fh_dentry)));
 }
 
 static int __nfsd_setattr(struct dentry *dentry, struct iattr *iap)
@@ -560,12 +561,17 @@ nfsd_setattr(struct svc_rqst *rqstp, struct svc_fh *fhp,
 	 * setattr call.
 	 */
 	if (size_change) {
-		err = nfsd_get_write_access(rqstp, fhp, iap);
+		err = nfsd_get_write_access(fhp);
 		if (err)
 			return err;
 	}
 
 	inode_lock(inode);
+	if (size_change) {
+		err = nfsd_may_truncate(rqstp, fhp, iap);
+		if (err)
+			goto out_unlock;
+	}
 	err = fh_fill_pre_attrs(fhp);
 	if (err)
 		goto out_unlock;

-- 
2.54.0


^ permalink raw reply related	[flat|nested] 40+ messages in thread

* [PATCH 08/10] nfsd: fix partial-write detection in nfsd_direct_write
  2026-05-28 21:55 [PATCH 00/10] nfsd: a pile of fixes for random bugs Jeff Layton
                   ` (6 preceding siblings ...)
  2026-05-28 21:55 ` [PATCH 07/10] NFSD: check truncate permission under inode lock Jeff Layton
@ 2026-05-28 21:55 ` Jeff Layton
  2026-05-29 16:57   ` Chuck Lever
  2026-05-28 21:55 ` [PATCH 09/10] nfsd: cap decoded POSIX ACL count to bound sort cost Jeff Layton
  2026-05-28 21:55 ` [PATCH 10/10] nfsd: validate symlink target length in NFSv4 CREATE Jeff Layton
  9 siblings, 1 reply; 40+ messages in thread
From: Jeff Layton @ 2026-05-28 21:55 UTC (permalink / raw)
  To: Chuck Lever, NeilBrown, Olga Kornievskaia, Dai Ngo, Tom Talpey,
	J. Bruce Fields, Scott Mayhew, Trond Myklebust,
	Andreas Gruenbacher, Mike Snitzer, Rick Macklem
  Cc: Chris Mason, linux-nfs, linux-kernel, Jeff Layton

From: Chris Mason <clm@meta.com>

nfsd_direct_write() walks a list of write segments and, after each
vfs_iocb_iter_write(), tries to detect a short write so the loop can
stop before placing the next segment at a wrong file offset:

    host_err = vfs_iocb_iter_write(file, kiocb, &segments[i].iter);
    if (host_err < 0)
            return host_err;
    *cnt += host_err;
    if (host_err < segments[i].iter.count)
            break;	/* partial write */

vfs_iocb_iter_write() runs the iter through ->write_iter(), which
advances the iter by the number of bytes written. By the time the
check runs, segments[i].iter.count is the residual, not the original
request length:

    before write_iter: iter.count == original_len
    after  write_iter: iter.count == original_len - host_err

The condition then reduces to host_err < original_len - host_err, so
the break fires only when less than half of the segment was written.
Any short write completing between 50% and 99% of the segment slips
through; the loop advances to the next segment with kiocb->ki_pos
only bumped by the short amount, writing the next segment's payload
at the wrong offset and over-reporting *cnt to the NFS client.

Snapshot the segment's byte count before the write and compare
host_err against that snapshot so any short write breaks the loop.

Fixes: 06c5c97293e3 ("NFSD: Implement NFSD_IO_DIRECT for NFS WRITE")
Assisted-by: kres:claude-opus-4-7
Signed-off-by: Chris Mason <clm@meta.com>
---
 fs/nfsd/vfs.c | 5 ++++-
 1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/fs/nfsd/vfs.c b/fs/nfsd/vfs.c
index 980217f755b7..619f252af4d1 100644
--- a/fs/nfsd/vfs.c
+++ b/fs/nfsd/vfs.c
@@ -1380,6 +1380,7 @@ nfsd_direct_write(struct svc_rqst *rqstp, struct svc_fh *fhp,
 	struct file *file = nf->nf_file;
 	unsigned int nsegs, i;
 	ssize_t host_err;
+	size_t expected;
 
 	nsegs = nfsd_write_dio_iters_init(nf, rqstp->rq_bvec, nvecs,
 					  kiocb, *cnt, segments);
@@ -1401,11 +1402,13 @@ nfsd_direct_write(struct svc_rqst *rqstp, struct svc_fh *fhp,
 				kiocb->ki_flags |= IOCB_DONTCACHE;
 		}
 
+		expected = iov_iter_count(&segments[i].iter);
+
 		host_err = vfs_iocb_iter_write(file, kiocb, &segments[i].iter);
 		if (host_err < 0)
 			return host_err;
 		*cnt += host_err;
-		if (host_err < segments[i].iter.count)
+		if (host_err < (ssize_t)expected)
 			break;	/* partial write */
 	}
 

-- 
2.54.0


^ permalink raw reply related	[flat|nested] 40+ messages in thread

* [PATCH 09/10] nfsd: cap decoded POSIX ACL count to bound sort cost
  2026-05-28 21:55 [PATCH 00/10] nfsd: a pile of fixes for random bugs Jeff Layton
                   ` (7 preceding siblings ...)
  2026-05-28 21:55 ` [PATCH 08/10] nfsd: fix partial-write detection in nfsd_direct_write Jeff Layton
@ 2026-05-28 21:55 ` Jeff Layton
  2026-05-28 22:11   ` Rick Macklem
                     ` (2 more replies)
  2026-05-28 21:55 ` [PATCH 10/10] nfsd: validate symlink target length in NFSv4 CREATE Jeff Layton
  9 siblings, 3 replies; 40+ messages in thread
From: Jeff Layton @ 2026-05-28 21:55 UTC (permalink / raw)
  To: Chuck Lever, NeilBrown, Olga Kornievskaia, Dai Ngo, Tom Talpey,
	J. Bruce Fields, Scott Mayhew, Trond Myklebust,
	Andreas Gruenbacher, Mike Snitzer, Rick Macklem
  Cc: Chris Mason, linux-nfs, linux-kernel, Jeff Layton

From: Chris Mason <clm@meta.com>

nfsd4_decode_posixacl() reads a u32 entry count off the wire and passes
it straight to posix_acl_alloc() and sort_pacl_range(). The latter is
an O(n^2) bubble sort, so a client-chosen count drives unbounded CPU in
the server's compound processing path.

    nfsd4_decode_posixacl()
      xdr_stream_decode_u32(&count)       /* uncapped u32 */
      posix_acl_alloc(count, GFP_KERNEL)
      sort_pacl_range(*acl, 0, count - 1) /* O(n^2) bubble sort */

The encoder side in the same file already rejects ACLs whose a_count
exceeds NFS_ACL_MAX_ENTRIES, but the decoder introduced in commit
5fc51dfc2eb1 ("NFSD: Add support for XDR decoding POSIX draft ACLs")
omitted the symmetric check.

Fix by rejecting a wire count greater than NFS_ACL_MAX_ENTRIES with
nfserr_resource, before any allocation, so the sort is bounded by
NFS_ACL_MAX_ENTRIES^2 comparisons.

Fixes: 5fc51dfc2eb1 ("NFSD: Add support for XDR decoding POSIX draft ACLs")
Assisted-by: kres:claude-opus-4-7
Signed-off-by: Chris Mason <clm@meta.com>
---
 fs/nfsd/nfs4xdr.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/fs/nfsd/nfs4xdr.c b/fs/nfsd/nfs4xdr.c
index c6c50c376b23..5469c6c207ba 100644
--- a/fs/nfsd/nfs4xdr.c
+++ b/fs/nfsd/nfs4xdr.c
@@ -448,6 +448,8 @@ nfsd4_decode_posixacl(struct nfsd4_compoundargs *argp, struct posix_acl **acl)
 
 	if (xdr_stream_decode_u32(argp->xdr, &count) < 0)
 		return nfserr_bad_xdr;
+	if (count > NFS_ACL_MAX_ENTRIES)
+		return nfserr_resource;
 
 	*acl = posix_acl_alloc(count, GFP_KERNEL);
 	if (*acl == NULL)

-- 
2.54.0


^ permalink raw reply related	[flat|nested] 40+ messages in thread

* [PATCH 10/10] nfsd: validate symlink target length in NFSv4 CREATE
  2026-05-28 21:55 [PATCH 00/10] nfsd: a pile of fixes for random bugs Jeff Layton
                   ` (8 preceding siblings ...)
  2026-05-28 21:55 ` [PATCH 09/10] nfsd: cap decoded POSIX ACL count to bound sort cost Jeff Layton
@ 2026-05-28 21:55 ` Jeff Layton
  2026-05-29 18:55   ` Chuck Lever
  9 siblings, 1 reply; 40+ messages in thread
From: Jeff Layton @ 2026-05-28 21:55 UTC (permalink / raw)
  To: Chuck Lever, NeilBrown, Olga Kornievskaia, Dai Ngo, Tom Talpey,
	J. Bruce Fields, Scott Mayhew, Trond Myklebust,
	Andreas Gruenbacher, Mike Snitzer, Rick Macklem
  Cc: Chris Mason, linux-nfs, linux-kernel, Jeff Layton

nfsd4_decode_create() accepts an unbounded cr_datalen from the wire for
NF4LNK symlink targets, allowing a client to force a kmalloc of up to
the RPC-max size (~1 MiB) per COMPOUND op that persists until compound
teardown.  The VFS rejects oversized targets with ENAMETOOLONG, but the
allocation has already occurred.

Reject cr_datalen == 0 or cr_datalen >= PATH_MAX early with
nfserr_nametoolong to bound the allocation.

Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2")
Assisted-by: kres:claude-opus-4-7
Signed-off-by: Jeff Layton <jlayton@kernel.org>
---
 fs/nfsd/nfs4xdr.c | 4 ++++
 1 file changed, 4 insertions(+)

diff --git a/fs/nfsd/nfs4xdr.c b/fs/nfsd/nfs4xdr.c
index 5469c6c207ba..1f5e49f50f3a 100644
--- a/fs/nfsd/nfs4xdr.c
+++ b/fs/nfsd/nfs4xdr.c
@@ -957,6 +957,10 @@ nfsd4_decode_create(struct nfsd4_compoundargs *argp, union nfsd4_op_u *u)
 	case NF4LNK:
 		if (xdr_stream_decode_u32(argp->xdr, &create->cr_datalen) < 0)
 			return nfserr_bad_xdr;
+		if (create->cr_datalen == 0)
+			return nfserr_inval;
+		if (create->cr_datalen >= PATH_MAX)
+			return nfserr_nametoolong;
 		p = xdr_inline_decode(argp->xdr, create->cr_datalen);
 		if (!p)
 			return nfserr_bad_xdr;

-- 
2.54.0


^ permalink raw reply related	[flat|nested] 40+ messages in thread

* Re: [PATCH 09/10] nfsd: cap decoded POSIX ACL count to bound sort cost
  2026-05-28 21:55 ` [PATCH 09/10] nfsd: cap decoded POSIX ACL count to bound sort cost Jeff Layton
@ 2026-05-28 22:11   ` Rick Macklem
  2026-05-28 23:11     ` Chuck Lever
  2026-05-29  7:34   ` Cedric Blancher
  2026-05-29 18:34   ` Chuck Lever
  2 siblings, 1 reply; 40+ messages in thread
From: Rick Macklem @ 2026-05-28 22:11 UTC (permalink / raw)
  To: Jeff Layton
  Cc: Chuck Lever, NeilBrown, Olga Kornievskaia, Dai Ngo, Tom Talpey,
	J. Bruce Fields, Scott Mayhew, Trond Myklebust,
	Andreas Gruenbacher, Mike Snitzer, Rick Macklem, Chris Mason,
	linux-nfs, linux-kernel

On Thu, May 28, 2026 at 2:56 PM Jeff Layton <jlayton@kernel.org> wrote:
>
> CAUTION: This email originated from outside of the University of Guelph. Do not click links or open attachments unless you recognize the sender and know the content is safe. If you are unsure, forward the message to ITHelp@uoguelph.ca for review.
>
>
> From: Chris Mason <clm@meta.com>
>
> nfsd4_decode_posixacl() reads a u32 entry count off the wire and passes
> it straight to posix_acl_alloc() and sort_pacl_range(). The latter is
> an O(n^2) bubble sort, so a client-chosen count drives unbounded CPU in
> the server's compound processing path.
>
>     nfsd4_decode_posixacl()
>       xdr_stream_decode_u32(&count)       /* uncapped u32 */
>       posix_acl_alloc(count, GFP_KERNEL)
>       sort_pacl_range(*acl, 0, count - 1) /* O(n^2) bubble sort */
>
> The encoder side in the same file already rejects ACLs whose a_count
> exceeds NFS_ACL_MAX_ENTRIES, but the decoder introduced in commit
> 5fc51dfc2eb1 ("NFSD: Add support for XDR decoding POSIX draft ACLs")
> omitted the symmetric check.
My recollection is that Chuck didn't like this limit. He argued that it was
specific to the NFSv3 ACL protocol and that the limit on the size of a NFSv4
RPC message was sufficient.  I, personally, think that 1024 is a reasonable
limit for # of ACEs, but Chuck can jump in here if he doesn't agree.
(Note that, if user/groups are configured as "#s in the string", a lot of ACEs
fits in a 4Mbyte RPC message and the server file system will also set a limit
although, as you note, that happens after the sort.)

The good news w.r.t. the sort is that they are normally already sorted,
so the bubble sort is normally just a single pass, I think?

rick

>
> Fix by rejecting a wire count greater than NFS_ACL_MAX_ENTRIES with
> nfserr_resource, before any allocation, so the sort is bounded by
> NFS_ACL_MAX_ENTRIES^2 comparisons.
>
> Fixes: 5fc51dfc2eb1 ("NFSD: Add support for XDR decoding POSIX draft ACLs")
> Assisted-by: kres:claude-opus-4-7
> Signed-off-by: Chris Mason <clm@meta.com>
> ---
>  fs/nfsd/nfs4xdr.c | 2 ++
>  1 file changed, 2 insertions(+)
>
> diff --git a/fs/nfsd/nfs4xdr.c b/fs/nfsd/nfs4xdr.c
> index c6c50c376b23..5469c6c207ba 100644
> --- a/fs/nfsd/nfs4xdr.c
> +++ b/fs/nfsd/nfs4xdr.c
> @@ -448,6 +448,8 @@ nfsd4_decode_posixacl(struct nfsd4_compoundargs *argp, struct posix_acl **acl)
>
>         if (xdr_stream_decode_u32(argp->xdr, &count) < 0)
>                 return nfserr_bad_xdr;
> +       if (count > NFS_ACL_MAX_ENTRIES)
> +               return nfserr_resource;
>
>         *acl = posix_acl_alloc(count, GFP_KERNEL);
>         if (*acl == NULL)
>
> --
> 2.54.0
>

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 09/10] nfsd: cap decoded POSIX ACL count to bound sort cost
  2026-05-28 22:11   ` Rick Macklem
@ 2026-05-28 23:11     ` Chuck Lever
  2026-05-29  0:07       ` Chuck Lever
  0 siblings, 1 reply; 40+ messages in thread
From: Chuck Lever @ 2026-05-28 23:11 UTC (permalink / raw)
  To: Rick Macklem, Jeff Layton
  Cc: Chuck Lever, NeilBrown, Olga Kornievskaia, Dai Ngo, Tom Talpey,
	J. Bruce Fields, Scott Mayhew, Trond Myklebust,
	Andreas Gruenbacher, Mike Snitzer, Rick Macklem, Chris Mason,
	linux-nfs, linux-kernel

On Thu, May 28, 2026, at 6:11 PM, Rick Macklem wrote:
> On Thu, May 28, 2026 at 2:56 PM Jeff Layton <jlayton@kernel.org> wrote:
>>
>> CAUTION: This email originated from outside of the University of Guelph. Do not click links or open attachments unless you recognize the sender and know the content is safe. If you are unsure, forward the message to ITHelp@uoguelph.ca for review.
>>
>>
>> From: Chris Mason <clm@meta.com>
>>
>> nfsd4_decode_posixacl() reads a u32 entry count off the wire and passes
>> it straight to posix_acl_alloc() and sort_pacl_range(). The latter is
>> an O(n^2) bubble sort, so a client-chosen count drives unbounded CPU in
>> the server's compound processing path.
>>
>>     nfsd4_decode_posixacl()
>>       xdr_stream_decode_u32(&count)       /* uncapped u32 */
>>       posix_acl_alloc(count, GFP_KERNEL)
>>       sort_pacl_range(*acl, 0, count - 1) /* O(n^2) bubble sort */
>>
>> The encoder side in the same file already rejects ACLs whose a_count
>> exceeds NFS_ACL_MAX_ENTRIES, but the decoder introduced in commit
>> 5fc51dfc2eb1 ("NFSD: Add support for XDR decoding POSIX draft ACLs")
>> omitted the symmetric check.
> My recollection is that Chuck didn't like this limit. He argued that it was
> specific to the NFSv3 ACL protocol and that the limit on the size of a NFSv4
> RPC message was sufficient.  I, personally, think that 1024 is a reasonable
> limit for # of ACEs, but Chuck can jump in here if he doesn't agree.

The NFSACL protocol limits the number of ACEs in an ACL to NFS_ACL_MAX_ENTRIES
(1024). It’s a limit that is defined in the protocol itself.

The NFSv4 protocol sets no similar limit. In particular, NFS_ACL_MAX_ENTRIES
is not a constant defined or used by the NFSv4.x family of protocols IIRC.

Using NFS_ACL_MAX_ENTRIES to cap the number of ACEs in NFSv4 ACLs is a
convenience, but it adds technical debt (slight though it may be).

So… We can define an implementation limit for NFSv4 ACL support in NFSD.
But it shouldn’t be called NFS_ACL_MAX_ENTRIES, IMHO. For clarity of
documentation, pick a number (could be 1024) and state in a comment that
it is an NFSD implementation constraint, not a protocol limit. Name the
constant something like NFSD4_MAX_yada to make it clear it is an
implementation limit, not a protocol limit.

(Check if the Linux POSIX ACL implementation already defines and imposes
a similar limit before defining a new one for NFSv4).

Then prepare for an onslaught of complaints that 1024 is just too small :-)

-- 
Chuck Lever

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 01/10] nfsd: fix BUG_ON in nfsd4_alloc_layout_stateid on racing delegation revoke
  2026-05-28 21:55 ` [PATCH 01/10] nfsd: fix BUG_ON in nfsd4_alloc_layout_stateid on racing delegation revoke Jeff Layton
@ 2026-05-28 23:40   ` NeilBrown
  2026-05-29 14:44     ` Jeff Layton
  0 siblings, 1 reply; 40+ messages in thread
From: NeilBrown @ 2026-05-28 23:40 UTC (permalink / raw)
  To: Jeff Layton
  Cc: Chuck Lever, Olga Kornievskaia, Dai Ngo, Tom Talpey,
	J. Bruce Fields, Scott Mayhew, Trond Myklebust,
	Andreas Gruenbacher, Mike Snitzer, Rick Macklem, Chris Mason,
	linux-nfs, linux-kernel, Jeff Layton

On Fri, 29 May 2026, Jeff Layton wrote:
> nfsd4_alloc_layout_stateid reads fp->fi_deleg_file without holding
> fi_lock when the parent stateid is a delegation. A concurrent delegation
> revoke via the laundromat can clear fi_deleg_file under fi_lock, causing
> nfsd_file_get() to return NULL and triggering the BUG_ON.
> 
> This race is client-reachable: two NFS clients can trigger it by having
> one hold a delegation while another opens the same file to force a
> recall. When the first client doesn't respond to the recall, the
> laundromat revokes it. A concurrent LAYOUTGET from any client using the
> delegation stateid hits the race window.
> 
> Fix this by taking fi_lock around the fi_deleg_file read in the
> SC_TYPE_DELEG path, and replacing the BUG_ON with a graceful error
> return that cleans up the partially-initialized layout stateid.

Replacing the BUG_ON with a graceful error is certainly sensible and
probably all that is needed to fix the problem.

I cannot see how the spinlock achieves anything.  If ->fi_deleg_file
could become NULL at this point, it can become NULL just before we take
the spinlock.

We do need to be sure the file (if there is one) doesn't get freed while
nfsd_file_get() is incrementing the refcount, but rcu_read_lock() is the
normal tool for that.
In this case we have

  		ls->ls_file = nfsd_file_get(rcu_dereference_protected(fp->fi_deleg_file, 1));

rcu_dereference_protected(...., 1)
which for me is a warning sign.  What does the '1' mean here?
Presumably something that we cannot easily assert with a C condition, in
which case a comment is called for.

Based on the recent commit which added this I'm guessing 
   fi_delegees > 0 guarantees stability

If that is right, then why do we also need a spinlock to guarantee
stability?

Confused.

NeilBrown

> 
> Fixes: c5c707f96fc9 ("nfsd: implement pNFS layout recalls")
> Assisted-by: kres:claude-opus-4-7
> Signed-off-by: Jeff Layton <jlayton@kernel.org>
> ---
>  fs/nfsd/nfs4layouts.c | 12 +++++++++---
>  1 file changed, 9 insertions(+), 3 deletions(-)
> 
> diff --git a/fs/nfsd/nfs4layouts.c b/fs/nfsd/nfs4layouts.c
> index 9ed2e3d65062..5d48c7673fa1 100644
> --- a/fs/nfsd/nfs4layouts.c
> +++ b/fs/nfsd/nfs4layouts.c
> @@ -247,11 +247,17 @@ nfsd4_alloc_layout_stateid(struct nfsd4_compound_state *cstate,
>  	nfsd4_init_cb(&ls->ls_recall, clp, &nfsd4_cb_layout_ops,
>  			NFSPROC4_CLNT_CB_LAYOUT);
>  
> -	if (parent->sc_type == SC_TYPE_DELEG)
> +	if (parent->sc_type == SC_TYPE_DELEG) {
> +		spin_lock(&fp->fi_lock);
>  		ls->ls_file = nfsd_file_get(rcu_dereference_protected(fp->fi_deleg_file, 1));
> -	else
> +		spin_unlock(&fp->fi_lock);
> +	} else {
>  		ls->ls_file = find_any_file(fp);
> -	BUG_ON(!ls->ls_file);
> +	}
> +	if (!ls->ls_file) {
> +		nfs4_put_stid(stp);
> +		return NULL;
> +	}
>  
>  	ls->ls_fenced = false;
>  	ls->ls_fence_delay = 0;
> 
> -- 
> 2.54.0
> 
> 


^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 09/10] nfsd: cap decoded POSIX ACL count to bound sort cost
  2026-05-28 23:11     ` Chuck Lever
@ 2026-05-29  0:07       ` Chuck Lever
  2026-05-29 10:48         ` Jeff Layton
  0 siblings, 1 reply; 40+ messages in thread
From: Chuck Lever @ 2026-05-29  0:07 UTC (permalink / raw)
  To: Rick Macklem, Jeff Layton
  Cc: Chuck Lever, NeilBrown, Olga Kornievskaia, Dai Ngo, Tom Talpey,
	J. Bruce Fields, Scott Mayhew, Trond Myklebust,
	Andreas Gruenbacher, Mike Snitzer, Rick Macklem, Chris Mason,
	linux-nfs, linux-kernel



On Thu, May 28, 2026, at 7:11 PM, Chuck Lever wrote:
> On Thu, May 28, 2026, at 6:11 PM, Rick Macklem wrote:
>> On Thu, May 28, 2026 at 2:56 PM Jeff Layton <jlayton@kernel.org> wrote:
>>>
>>> CAUTION: This email originated from outside of the University of Guelph. Do not click links or open attachments unless you recognize the sender and know the content is safe. If you are unsure, forward the message to ITHelp@uoguelph.ca for review.
>>>
>>>
>>> From: Chris Mason <clm@meta.com>
>>>
>>> nfsd4_decode_posixacl() reads a u32 entry count off the wire and passes
>>> it straight to posix_acl_alloc() and sort_pacl_range(). The latter is
>>> an O(n^2) bubble sort, so a client-chosen count drives unbounded CPU in
>>> the server's compound processing path.
>>>
>>>     nfsd4_decode_posixacl()
>>>       xdr_stream_decode_u32(&count)       /* uncapped u32 */
>>>       posix_acl_alloc(count, GFP_KERNEL)
>>>       sort_pacl_range(*acl, 0, count - 1) /* O(n^2) bubble sort */
>>>
>>> The encoder side in the same file already rejects ACLs whose a_count
>>> exceeds NFS_ACL_MAX_ENTRIES, but the decoder introduced in commit
>>> 5fc51dfc2eb1 ("NFSD: Add support for XDR decoding POSIX draft ACLs")
>>> omitted the symmetric check.
>> My recollection is that Chuck didn't like this limit. He argued that it was
>> specific to the NFSv3 ACL protocol and that the limit on the size of a NFSv4
>> RPC message was sufficient.  I, personally, think that 1024 is a reasonable
>> limit for # of ACEs, but Chuck can jump in here if he doesn't agree.
>
> The NFSACL protocol limits the number of ACEs in an ACL to NFS_ACL_MAX_ENTRIES
> (1024). It’s a limit that is defined in the protocol itself.
>
> The NFSv4 protocol sets no similar limit. In particular, NFS_ACL_MAX_ENTRIES
> is not a constant defined or used by the NFSv4.x family of protocols IIRC.
>
> Using NFS_ACL_MAX_ENTRIES to cap the number of ACEs in NFSv4 ACLs is a
> convenience, but it adds technical debt (slight though it may be).
>
> So… We can define an implementation limit for NFSv4 ACL support in NFSD.
> But it shouldn’t be called NFS_ACL_MAX_ENTRIES, IMHO. For clarity of
> documentation, pick a number (could be 1024) and state in a comment that
> it is an NFSD implementation constraint, not a protocol limit. Name the
> constant something like NFSD4_MAX_yada to make it clear it is an
> implementation limit, not a protocol limit.

A different take on this might be that we want to ensure that ACLs set
via the NFSv4 POSIX ACL extension can always be retrieved via NFSACL.
In that case, capping the ACE count at the same number makes sense.
As long as the reason for this is clearly mentioned.


-- 
Chuck Lever

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 09/10] nfsd: cap decoded POSIX ACL count to bound sort cost
  2026-05-28 21:55 ` [PATCH 09/10] nfsd: cap decoded POSIX ACL count to bound sort cost Jeff Layton
  2026-05-28 22:11   ` Rick Macklem
@ 2026-05-29  7:34   ` Cedric Blancher
  2026-05-29 10:50     ` Jeff Layton
  2026-05-29 18:34   ` Chuck Lever
  2 siblings, 1 reply; 40+ messages in thread
From: Cedric Blancher @ 2026-05-29  7:34 UTC (permalink / raw)
  To: Jeff Layton
  Cc: Chuck Lever, NeilBrown, Olga Kornievskaia, Dai Ngo, Tom Talpey,
	J. Bruce Fields, Scott Mayhew, Trond Myklebust,
	Andreas Gruenbacher, Mike Snitzer, Rick Macklem, Chris Mason,
	linux-nfs, linux-kernel

What about (finally) getting rid of the bubble sort?

Ced

On Fri, 29 May 2026 at 00:02, Jeff Layton <jlayton@kernel.org> wrote:
>
> From: Chris Mason <clm@meta.com>
>
> nfsd4_decode_posixacl() reads a u32 entry count off the wire and passes
> it straight to posix_acl_alloc() and sort_pacl_range(). The latter is
> an O(n^2) bubble sort, so a client-chosen count drives unbounded CPU in
> the server's compound processing path.
>
>     nfsd4_decode_posixacl()
>       xdr_stream_decode_u32(&count)       /* uncapped u32 */
>       posix_acl_alloc(count, GFP_KERNEL)
>       sort_pacl_range(*acl, 0, count - 1) /* O(n^2) bubble sort */
>
> The encoder side in the same file already rejects ACLs whose a_count
> exceeds NFS_ACL_MAX_ENTRIES, but the decoder introduced in commit
> 5fc51dfc2eb1 ("NFSD: Add support for XDR decoding POSIX draft ACLs")
> omitted the symmetric check.
>
> Fix by rejecting a wire count greater than NFS_ACL_MAX_ENTRIES with
> nfserr_resource, before any allocation, so the sort is bounded by
> NFS_ACL_MAX_ENTRIES^2 comparisons.
>
> Fixes: 5fc51dfc2eb1 ("NFSD: Add support for XDR decoding POSIX draft ACLs")
> Assisted-by: kres:claude-opus-4-7
> Signed-off-by: Chris Mason <clm@meta.com>
> ---
>  fs/nfsd/nfs4xdr.c | 2 ++
>  1 file changed, 2 insertions(+)
>
> diff --git a/fs/nfsd/nfs4xdr.c b/fs/nfsd/nfs4xdr.c
> index c6c50c376b23..5469c6c207ba 100644
> --- a/fs/nfsd/nfs4xdr.c
> +++ b/fs/nfsd/nfs4xdr.c
> @@ -448,6 +448,8 @@ nfsd4_decode_posixacl(struct nfsd4_compoundargs *argp, struct posix_acl **acl)
>
>         if (xdr_stream_decode_u32(argp->xdr, &count) < 0)
>                 return nfserr_bad_xdr;
> +       if (count > NFS_ACL_MAX_ENTRIES)
> +               return nfserr_resource;
>
>         *acl = posix_acl_alloc(count, GFP_KERNEL);
>         if (*acl == NULL)
>
> --
> 2.54.0
>
>


-- 
Cedric Blancher <cedric.blancher@gmail.com>
[https://plus.google.com/u/0/+CedricBlancher/]
Institute Pasteur

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 09/10] nfsd: cap decoded POSIX ACL count to bound sort cost
  2026-05-29  0:07       ` Chuck Lever
@ 2026-05-29 10:48         ` Jeff Layton
  2026-05-29 13:20           ` Chuck Lever
  0 siblings, 1 reply; 40+ messages in thread
From: Jeff Layton @ 2026-05-29 10:48 UTC (permalink / raw)
  To: Chuck Lever, Rick Macklem
  Cc: Chuck Lever, NeilBrown, Olga Kornievskaia, Dai Ngo, Tom Talpey,
	J. Bruce Fields, Scott Mayhew, Trond Myklebust,
	Andreas Gruenbacher, Mike Snitzer, Rick Macklem, Chris Mason,
	linux-nfs, linux-kernel

On Thu, 2026-05-28 at 20:07 -0400, Chuck Lever wrote:
> 
> On Thu, May 28, 2026, at 7:11 PM, Chuck Lever wrote:
> > On Thu, May 28, 2026, at 6:11 PM, Rick Macklem wrote:
> > > On Thu, May 28, 2026 at 2:56 PM Jeff Layton <jlayton@kernel.org> wrote:
> > > > 
> > > > CAUTION: This email originated from outside of the University of Guelph. Do not click links or open attachments unless you recognize the sender and know the content is safe. If you are unsure, forward the message to ITHelp@uoguelph.ca for review.
> > > > 
> > > > 
> > > > From: Chris Mason <clm@meta.com>
> > > > 
> > > > nfsd4_decode_posixacl() reads a u32 entry count off the wire and passes
> > > > it straight to posix_acl_alloc() and sort_pacl_range(). The latter is
> > > > an O(n^2) bubble sort, so a client-chosen count drives unbounded CPU in
> > > > the server's compound processing path.
> > > > 
> > > >     nfsd4_decode_posixacl()
> > > >       xdr_stream_decode_u32(&count)       /* uncapped u32 */
> > > >       posix_acl_alloc(count, GFP_KERNEL)
> > > >       sort_pacl_range(*acl, 0, count - 1) /* O(n^2) bubble sort */
> > > > 
> > > > The encoder side in the same file already rejects ACLs whose a_count
> > > > exceeds NFS_ACL_MAX_ENTRIES, but the decoder introduced in commit
> > > > 5fc51dfc2eb1 ("NFSD: Add support for XDR decoding POSIX draft ACLs")
> > > > omitted the symmetric check.
> > > My recollection is that Chuck didn't like this limit. He argued that it was
> > > specific to the NFSv3 ACL protocol and that the limit on the size of a NFSv4
> > > RPC message was sufficient.  I, personally, think that 1024 is a reasonable
> > > limit for # of ACEs, but Chuck can jump in here if he doesn't agree.
> > 
> > The NFSACL protocol limits the number of ACEs in an ACL to NFS_ACL_MAX_ENTRIES
> > (1024). It’s a limit that is defined in the protocol itself.
> > 
> > The NFSv4 protocol sets no similar limit. In particular, NFS_ACL_MAX_ENTRIES
> > is not a constant defined or used by the NFSv4.x family of protocols IIRC.
> > 
> > Using NFS_ACL_MAX_ENTRIES to cap the number of ACEs in NFSv4 ACLs is a
> > convenience, but it adds technical debt (slight though it may be).
> > 
> > So… We can define an implementation limit for NFSv4 ACL support in NFSD.
> > But it shouldn’t be called NFS_ACL_MAX_ENTRIES, IMHO. For clarity of
> > documentation, pick a number (could be 1024) and state in a comment that
> > it is an NFSD implementation constraint, not a protocol limit. Name the
> > constant something like NFSD4_MAX_yada to make it clear it is an
> > implementation limit, not a protocol limit.
> 
> A different take on this might be that we want to ensure that ACLs set
> via the NFSv4 POSIX ACL extension can always be retrieved via NFSACL.
> In that case, capping the ACE count at the same number makes sense.
> As long as the reason for this is clearly mentioned.
> 

First, I'll note that the encoder already has this limit in place:

static __be32
nfsd4_encode_posixacl(struct xdr_stream *xdr, struct svc_rqst *rqstp,
                      struct posix_acl *acl)
{
[...]
        if (acl->a_count > NFS_ACL_MAX_ENTRIES)
                return nfserr_resource;
[...]
}

...so if you set an ACL that is longer than NFS_ACL_MAX_ENTRIES you
already can't retrieve it with NFSv4. Given that, I went with the above
suggestion, and added a comment to the patch. Seem ok?

diff --git a/fs/nfsd/nfs4xdr.c b/fs/nfsd/nfs4xdr.c
index c6c50c376b23..eaf71c65d665 100644
--- a/fs/nfsd/nfs4xdr.c
+++ b/fs/nfsd/nfs4xdr.c
@@ -448,6 +448,14 @@ nfsd4_decode_posixacl(struct nfsd4_compoundargs *argp, struct posix_acl **acl)
 
        if (xdr_stream_decode_u32(argp->xdr, &count) < 0)
                return nfserr_bad_xdr;
+       /*
+        * RFC8881 doesn't define a max number of ACE's, but the NFSACL spec
+        * does. For NFSv4, cap the number of entries to the v3 limit, as we
+        * want to ensure that ACLs set via NFSv4 POSIX ACL extensions are
+        * retrievable via NFSv3.
+        */
+       if (count > NFS_ACL_MAX_ENTRIES)
+               return nfserr_resource;
 
        *acl = posix_acl_alloc(count, GFP_KERNEL);
        if (*acl == NULL)

^ permalink raw reply related	[flat|nested] 40+ messages in thread

* Re: [PATCH 09/10] nfsd: cap decoded POSIX ACL count to bound sort cost
  2026-05-29  7:34   ` Cedric Blancher
@ 2026-05-29 10:50     ` Jeff Layton
  0 siblings, 0 replies; 40+ messages in thread
From: Jeff Layton @ 2026-05-29 10:50 UTC (permalink / raw)
  To: Cedric Blancher
  Cc: Chuck Lever, NeilBrown, Olga Kornievskaia, Dai Ngo, Tom Talpey,
	J. Bruce Fields, Scott Mayhew, Trond Myklebust,
	Andreas Gruenbacher, Mike Snitzer, Rick Macklem, Chris Mason,
	linux-nfs, linux-kernel

On Fri, 2026-05-29 at 09:34 +0200, Cedric Blancher wrote:
> What about (finally) getting rid of the bubble sort?
> 
> Ced
> 

It's certainly not clear to me that it's necessary (is it?). I'm not
opposed to removing it, but I'd rather not make a behavioral change
like this in the context of this patchset.

Maybe you could propose a patch to remove it?


> On Fri, 29 May 2026 at 00:02, Jeff Layton <jlayton@kernel.org> wrote:
> > 
> > From: Chris Mason <clm@meta.com>
> > 
> > nfsd4_decode_posixacl() reads a u32 entry count off the wire and passes
> > it straight to posix_acl_alloc() and sort_pacl_range(). The latter is
> > an O(n^2) bubble sort, so a client-chosen count drives unbounded CPU in
> > the server's compound processing path.
> > 
> >     nfsd4_decode_posixacl()
> >       xdr_stream_decode_u32(&count)       /* uncapped u32 */
> >       posix_acl_alloc(count, GFP_KERNEL)
> >       sort_pacl_range(*acl, 0, count - 1) /* O(n^2) bubble sort */
> > 
> > The encoder side in the same file already rejects ACLs whose a_count
> > exceeds NFS_ACL_MAX_ENTRIES, but the decoder introduced in commit
> > 5fc51dfc2eb1 ("NFSD: Add support for XDR decoding POSIX draft ACLs")
> > omitted the symmetric check.
> > 
> > Fix by rejecting a wire count greater than NFS_ACL_MAX_ENTRIES with
> > nfserr_resource, before any allocation, so the sort is bounded by
> > NFS_ACL_MAX_ENTRIES^2 comparisons.
> > 
> > Fixes: 5fc51dfc2eb1 ("NFSD: Add support for XDR decoding POSIX draft ACLs")
> > Assisted-by: kres:claude-opus-4-7
> > Signed-off-by: Chris Mason <clm@meta.com>
> > ---
> >  fs/nfsd/nfs4xdr.c | 2 ++
> >  1 file changed, 2 insertions(+)
> > 
> > diff --git a/fs/nfsd/nfs4xdr.c b/fs/nfsd/nfs4xdr.c
> > index c6c50c376b23..5469c6c207ba 100644
> > --- a/fs/nfsd/nfs4xdr.c
> > +++ b/fs/nfsd/nfs4xdr.c
> > @@ -448,6 +448,8 @@ nfsd4_decode_posixacl(struct nfsd4_compoundargs *argp, struct posix_acl **acl)
> > 
> >         if (xdr_stream_decode_u32(argp->xdr, &count) < 0)
> >                 return nfserr_bad_xdr;
> > +       if (count > NFS_ACL_MAX_ENTRIES)
> > +               return nfserr_resource;
> > 
> >         *acl = posix_acl_alloc(count, GFP_KERNEL);
> >         if (*acl == NULL)
> > 
> > --
> > 2.54.0
> > 
> > 
> 

-- 
Jeff Layton <jlayton@kernel.org>

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 06/10] NFSD: Enable return of an updated stable_how to NFS clients
  2026-05-28 21:55 ` [PATCH 06/10] NFSD: Enable return of an updated stable_how to NFS clients Jeff Layton
@ 2026-05-29 10:56   ` Jeff Layton
  2026-05-30  7:58   ` NFSv4.1 COMMIT of all changed areas only on flush? " Cedric Blancher
  1 sibling, 0 replies; 40+ messages in thread
From: Jeff Layton @ 2026-05-29 10:56 UTC (permalink / raw)
  To: Chuck Lever, NeilBrown, Olga Kornievskaia, Dai Ngo, Tom Talpey,
	J. Bruce Fields, Scott Mayhew, Trond Myklebust,
	Andreas Gruenbacher, Mike Snitzer, Rick Macklem
  Cc: Chris Mason, linux-nfs, linux-kernel

On Thu, 2026-05-28 at 17:55 -0400, Jeff Layton wrote:
> From: Chuck Lever <chuck.lever@oracle.com>
> 
> In a subsequent patch, nfsd_vfs_write() will promote an UNSTABLE
> WRITE to be a FILE_SYNC WRITE. This indicates that the client does
> not need a subsequent COMMIT operation, saving a round trip and
> allowing the client to dispense with cached dirty data as soon as
> it receives the server's WRITE response.
> 
> This patch refactors nfsd_vfs_write() to return a possibly modified
> stable_how value to its callers. No behavior change is expected.
> 
> Reviewed-by: NeilBrown <neil@brown.name>
> Reviewed-by: Jeff Layton <jlayton@kernel.org>
> Fixes: 3f3503adb332 ("NFSD: Use vfs_iocb_iter_write()")
> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
> ---
>  fs/nfsd/nfs3proc.c |  2 +-
>  fs/nfsd/nfs4proc.c |  2 +-
>  fs/nfsd/nfsproc.c  |  3 ++-
>  fs/nfsd/vfs.c      | 11 ++++++-----
>  fs/nfsd/vfs.h      |  6 ++++--
>  fs/nfsd/xdr3.h     |  2 +-
>  6 files changed, 15 insertions(+), 11 deletions(-)
> 
> diff --git a/fs/nfsd/nfs3proc.c b/fs/nfsd/nfs3proc.c
> index aeda7a802bdf..dd5ac59e87d6 100644
> --- a/fs/nfsd/nfs3proc.c
> +++ b/fs/nfsd/nfs3proc.c
> @@ -236,7 +236,7 @@ nfsd3_proc_write(struct svc_rqst *rqstp)
>  	resp->committed = argp->stable;
>  	resp->status = nfsd_write(rqstp, &resp->fh, argp->offset,
>  				  &argp->payload, &cnt,
> -				  resp->committed, resp->verf);
> +				  &resp->committed, resp->verf);
>  	resp->count = cnt;
>  	resp->status = nfsd3_map_status(resp->status);
>  	return rpc_success;
> diff --git a/fs/nfsd/nfs4proc.c b/fs/nfsd/nfs4proc.c
> index 5f2b9bfc3a84..ac03f9d89288 100644
> --- a/fs/nfsd/nfs4proc.c
> +++ b/fs/nfsd/nfs4proc.c
> @@ -1355,7 +1355,7 @@ nfsd4_write(struct svc_rqst *rqstp, struct nfsd4_compound_state *cstate,
>  	write->wr_how_written = write->wr_stable_how;
>  	status = nfsd_vfs_write(rqstp, &cstate->current_fh, nf,
>  				write->wr_offset, &write->wr_payload,
> -				&cnt, write->wr_how_written,
> +				&cnt, &write->wr_how_written,
>  				(__be32 *)write->wr_verifier.data);
>  	nfsd_file_put(nf);
>  
> diff --git a/fs/nfsd/nfsproc.c b/fs/nfsd/nfsproc.c
> index 8873033d1e82..d0a7316f00a5 100644
> --- a/fs/nfsd/nfsproc.c
> +++ b/fs/nfsd/nfsproc.c
> @@ -251,6 +251,7 @@ nfsd_proc_write(struct svc_rqst *rqstp)
>  	struct nfsd_writeargs *argp = rqstp->rq_argp;
>  	struct nfsd_attrstat *resp = rqstp->rq_resp;
>  	unsigned long cnt = argp->len;
> +	u32 committed = NFS_DATA_SYNC;
>  
>  	dprintk("nfsd: WRITE    %s %u bytes at %d\n",
>  		SVCFH_fmt(&argp->fh),
> @@ -258,7 +259,7 @@ nfsd_proc_write(struct svc_rqst *rqstp)
>  
>  	fh_copy(&resp->fh, &argp->fh);
>  	resp->status = nfsd_write(rqstp, &resp->fh, argp->offset,
> -				  &argp->payload, &cnt, NFS_DATA_SYNC, NULL);
> +				  &argp->payload, &cnt, &committed, NULL);
>  	if (resp->status == nfs_ok)
>  		resp->status = fh_getattr(&resp->fh, &resp->stat);
>  	else if (resp->status == nfserr_jukebox)
> diff --git a/fs/nfsd/vfs.c b/fs/nfsd/vfs.c
> index 1e89c7ff9493..7f07292d1569 100644
> --- a/fs/nfsd/vfs.c
> +++ b/fs/nfsd/vfs.c
> @@ -1414,7 +1414,7 @@ nfsd_direct_write(struct svc_rqst *rqstp, struct svc_fh *fhp,
>   * @offset: Byte offset of start
>   * @payload: xdr_buf containing the write payload
>   * @cnt: IN: number of bytes to write, OUT: number of bytes actually written
> - * @stable: An NFS stable_how value
> + * @stable_how: IN: Client's requested stable_how, OUT: Actual stable_how
>   * @verf: NFS WRITE verifier
>   *
>   * Upon return, caller must invoke fh_put on @fhp.
> @@ -1426,11 +1426,12 @@ __be32
>  nfsd_vfs_write(struct svc_rqst *rqstp, struct svc_fh *fhp,
>  	       struct nfsd_file *nf, loff_t offset,
>  	       const struct xdr_buf *payload, unsigned long *cnt,
> -	       int stable, __be32 *verf)
> +	       u32 *stable_how, __be32 *verf)
>  {
>  	struct nfsd_net		*nn = net_generic(SVC_NET(rqstp), nfsd_net_id);
>  	struct file		*file = nf->nf_file;
>  	struct super_block	*sb = file_inode(file)->i_sb;
> +	u32			stable = *stable_how;
>  	struct kiocb		kiocb;
>  	struct svc_export	*exp;
>  	struct iov_iter		iter;
> @@ -1609,7 +1610,7 @@ __be32 nfsd_read(struct svc_rqst *rqstp, struct svc_fh *fhp,
>   * @offset: Byte offset of start
>   * @payload: xdr_buf containing the write payload
>   * @cnt: IN: number of bytes to write, OUT: number of bytes actually written
> - * @stable: An NFS stable_how value
> + * @stable_how: IN: Client's requested stable_how, OUT: Actual stable_how
>   * @verf: NFS WRITE verifier
>   *
>   * Upon return, caller must invoke fh_put on @fhp.
> @@ -1619,7 +1620,7 @@ __be32 nfsd_read(struct svc_rqst *rqstp, struct svc_fh *fhp,
>   */
>  __be32
>  nfsd_write(struct svc_rqst *rqstp, struct svc_fh *fhp, loff_t offset,
> -	   const struct xdr_buf *payload, unsigned long *cnt, int stable,
> +	   const struct xdr_buf *payload, unsigned long *cnt, u32 *stable_how,
>  	   __be32 *verf)
>  {
>  	struct nfsd_file *nf;
> @@ -1632,7 +1633,7 @@ nfsd_write(struct svc_rqst *rqstp, struct svc_fh *fhp, loff_t offset,
>  		goto out;
>  
>  	err = nfsd_vfs_write(rqstp, fhp, nf, offset, payload, cnt,
> -			     stable, verf);
> +			     stable_how, verf);
>  	nfsd_file_put(nf);
>  out:
>  	trace_nfsd_write_done(rqstp, fhp, offset, *cnt);
> diff --git a/fs/nfsd/vfs.h b/fs/nfsd/vfs.h
> index e09ea04a51b9..36cd9f6edd8b 100644
> --- a/fs/nfsd/vfs.h
> +++ b/fs/nfsd/vfs.h
> @@ -132,11 +132,13 @@ __be32		nfsd_read(struct svc_rqst *rqstp, struct svc_fh *fhp,
>  				u32 *eof);
>  __be32		nfsd_write(struct svc_rqst *rqstp, struct svc_fh *fhp,
>  				loff_t offset, const struct xdr_buf *payload,
> -				unsigned long *cnt, int stable, __be32 *verf);
> +				unsigned long *cnt, u32 *stable_how,
> +				__be32 *verf);
>  __be32		nfsd_vfs_write(struct svc_rqst *rqstp, struct svc_fh *fhp,
>  				struct nfsd_file *nf, loff_t offset,
>  				const struct xdr_buf *payload,
> -				unsigned long *cnt, int stable, __be32 *verf);
> +				unsigned long *cnt, u32 *stable_how,
> +				__be32 *verf);
>  __be32		nfsd_readlink(struct svc_rqst *, struct svc_fh *,
>  				char *, int *);
>  __be32		nfsd_symlink(struct svc_rqst *, struct svc_fh *,
> diff --git a/fs/nfsd/xdr3.h b/fs/nfsd/xdr3.h
> index a7c9714b0b0e..1ff7b11c397c 100644
> --- a/fs/nfsd/xdr3.h
> +++ b/fs/nfsd/xdr3.h
> @@ -152,7 +152,7 @@ struct nfsd3_writeres {
>  	__be32			status;
>  	struct svc_fh		fh;
>  	unsigned long		count;
> -	int			committed;
> +	u32			committed;
>  	__be32			verf[2];
>  };
>  

After looking at this more closely and talking it over with Chuck,
self-NAK on this patch. It's not clear to me that it's safe to
downgrade a SYNC write to an UNSTABLE one.

FWIW, the spec does seem to require it in this case: RFC 1813 section
3.3.7 says the server MUST set committed to UNSTABLE if it didn't
commit to stable storage, which it clearly didn't do in this case.

The problem is that it's not clear to me that the clients would be
prepared to issue a COMMIT in those cases. Since this only affects
(insane) people using the "async" export option, I think we can safely
just drop this one.
-- 
Jeff Layton <jlayton@kernel.org>

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 09/10] nfsd: cap decoded POSIX ACL count to bound sort cost
  2026-05-29 10:48         ` Jeff Layton
@ 2026-05-29 13:20           ` Chuck Lever
  0 siblings, 0 replies; 40+ messages in thread
From: Chuck Lever @ 2026-05-29 13:20 UTC (permalink / raw)
  To: Jeff Layton, Rick Macklem
  Cc: Chuck Lever, NeilBrown, Olga Kornievskaia, Dai Ngo, Tom Talpey,
	J. Bruce Fields, Scott Mayhew, Trond Myklebust,
	Andreas Gruenbacher, Mike Snitzer, Rick Macklem, Chris Mason,
	linux-nfs, linux-kernel

On 5/29/26 6:48 AM, Jeff Layton wrote:
> On Thu, 2026-05-28 at 20:07 -0400, Chuck Lever wrote:
>>
>> On Thu, May 28, 2026, at 7:11 PM, Chuck Lever wrote:
>>> On Thu, May 28, 2026, at 6:11 PM, Rick Macklem wrote:
>>>> On Thu, May 28, 2026 at 2:56 PM Jeff Layton <jlayton@kernel.org> wrote:
>>>>>
>>>>> CAUTION: This email originated from outside of the University of Guelph. Do not click links or open attachments unless you recognize the sender and know the content is safe. If you are unsure, forward the message to ITHelp@uoguelph.ca for review.
>>>>>
>>>>>
>>>>> From: Chris Mason <clm@meta.com>
>>>>>
>>>>> nfsd4_decode_posixacl() reads a u32 entry count off the wire and passes
>>>>> it straight to posix_acl_alloc() and sort_pacl_range(). The latter is
>>>>> an O(n^2) bubble sort, so a client-chosen count drives unbounded CPU in
>>>>> the server's compound processing path.
>>>>>
>>>>>     nfsd4_decode_posixacl()
>>>>>       xdr_stream_decode_u32(&count)       /* uncapped u32 */
>>>>>       posix_acl_alloc(count, GFP_KERNEL)
>>>>>       sort_pacl_range(*acl, 0, count - 1) /* O(n^2) bubble sort */
>>>>>
>>>>> The encoder side in the same file already rejects ACLs whose a_count
>>>>> exceeds NFS_ACL_MAX_ENTRIES, but the decoder introduced in commit
>>>>> 5fc51dfc2eb1 ("NFSD: Add support for XDR decoding POSIX draft ACLs")
>>>>> omitted the symmetric check.
>>>> My recollection is that Chuck didn't like this limit. He argued that it was
>>>> specific to the NFSv3 ACL protocol and that the limit on the size of a NFSv4
>>>> RPC message was sufficient.  I, personally, think that 1024 is a reasonable
>>>> limit for # of ACEs, but Chuck can jump in here if he doesn't agree.
>>>
>>> The NFSACL protocol limits the number of ACEs in an ACL to NFS_ACL_MAX_ENTRIES
>>> (1024). It’s a limit that is defined in the protocol itself.
>>>
>>> The NFSv4 protocol sets no similar limit. In particular, NFS_ACL_MAX_ENTRIES
>>> is not a constant defined or used by the NFSv4.x family of protocols IIRC.
>>>
>>> Using NFS_ACL_MAX_ENTRIES to cap the number of ACEs in NFSv4 ACLs is a
>>> convenience, but it adds technical debt (slight though it may be).
>>>
>>> So… We can define an implementation limit for NFSv4 ACL support in NFSD.
>>> But it shouldn’t be called NFS_ACL_MAX_ENTRIES, IMHO. For clarity of
>>> documentation, pick a number (could be 1024) and state in a comment that
>>> it is an NFSD implementation constraint, not a protocol limit. Name the
>>> constant something like NFSD4_MAX_yada to make it clear it is an
>>> implementation limit, not a protocol limit.
>>
>> A different take on this might be that we want to ensure that ACLs set
>> via the NFSv4 POSIX ACL extension can always be retrieved via NFSACL.
>> In that case, capping the ACE count at the same number makes sense.
>> As long as the reason for this is clearly mentioned.
>>
> 
> First, I'll note that the encoder already has this limit in place:
> 
> static __be32
> nfsd4_encode_posixacl(struct xdr_stream *xdr, struct svc_rqst *rqstp,
>                       struct posix_acl *acl)
> {
> [...]
>         if (acl->a_count > NFS_ACL_MAX_ENTRIES)
>                 return nfserr_resource;
> [...]
> }
> 
> ...so if you set an ACL that is longer than NFS_ACL_MAX_ENTRIES you
> already can't retrieve it with NFSv4. Given that, I went with the above
> suggestion, and added a comment to the patch. Seem ok?
> 
> diff --git a/fs/nfsd/nfs4xdr.c b/fs/nfsd/nfs4xdr.c
> index c6c50c376b23..eaf71c65d665 100644
> --- a/fs/nfsd/nfs4xdr.c
> +++ b/fs/nfsd/nfs4xdr.c
> @@ -448,6 +448,14 @@ nfsd4_decode_posixacl(struct nfsd4_compoundargs *argp, struct posix_acl **acl)
>  
>         if (xdr_stream_decode_u32(argp->xdr, &count) < 0)
>                 return nfserr_bad_xdr;
> +       /*
> +        * RFC8881 doesn't define a max number of ACE's, but the NFSACL spec
> +        * does. For NFSv4, cap the number of entries to the v3 limit, as we
> +        * want to ensure that ACLs set via NFSv4 POSIX ACL extensions are
> +        * retrievable via NFSv3.

Or, more precisely, "retrievable via NFSACL."

Otherwise, LGTM.


> +        */
> +       if (count > NFS_ACL_MAX_ENTRIES)
> +               return nfserr_resource;
>  
>         *acl = posix_acl_alloc(count, GFP_KERNEL);
>         if (*acl == NULL)


-- 
Chuck Lever

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 01/10] nfsd: fix BUG_ON in nfsd4_alloc_layout_stateid on racing delegation revoke
  2026-05-28 23:40   ` NeilBrown
@ 2026-05-29 14:44     ` Jeff Layton
  0 siblings, 0 replies; 40+ messages in thread
From: Jeff Layton @ 2026-05-29 14:44 UTC (permalink / raw)
  To: NeilBrown
  Cc: Chuck Lever, Olga Kornievskaia, Dai Ngo, Tom Talpey,
	J. Bruce Fields, Scott Mayhew, Trond Myklebust,
	Andreas Gruenbacher, Mike Snitzer, Rick Macklem, Chris Mason,
	linux-nfs, linux-kernel

On Fri, 2026-05-29 at 09:40 +1000, NeilBrown wrote:
> On Fri, 29 May 2026, Jeff Layton wrote:
> > nfsd4_alloc_layout_stateid reads fp->fi_deleg_file without holding
> > fi_lock when the parent stateid is a delegation. A concurrent delegation
> > revoke via the laundromat can clear fi_deleg_file under fi_lock, causing
> > nfsd_file_get() to return NULL and triggering the BUG_ON.
> > 
> > This race is client-reachable: two NFS clients can trigger it by having
> > one hold a delegation while another opens the same file to force a
> > recall. When the first client doesn't respond to the recall, the
> > laundromat revokes it. A concurrent LAYOUTGET from any client using the
> > delegation stateid hits the race window.
> > 
> > Fix this by taking fi_lock around the fi_deleg_file read in the
> > SC_TYPE_DELEG path, and replacing the BUG_ON with a graceful error
> > return that cleans up the partially-initialized layout stateid.
> 
> Replacing the BUG_ON with a graceful error is certainly sensible and
> probably all that is needed to fix the problem.
> 
> I cannot see how the spinlock achieves anything.  If ->fi_deleg_file
> could become NULL at this point, it can become NULL just before we take
> the spinlock.
> 
> We do need to be sure the file (if there is one) doesn't get freed while
> nfsd_file_get() is incrementing the refcount, but rcu_read_lock() is the
> normal tool for that.
> In this case we have
> 
>   		ls->ls_file = nfsd_file_get(rcu_dereference_protected(fp->fi_deleg_file, 1));
> 
> rcu_dereference_protected(...., 1)
> which for me is a warning sign.  What does the '1' mean here?
> Presumably something that we cannot easily assert with a C condition, in
> which case a comment is called for.
> 
> Based on the recent commit which added this I'm guessing 
>    fi_delegees > 0 guarantees stability
> 
> If that is right, then why do we also need a spinlock to guarantee
> stability?
> 
> Confused.
> 
> NeilBrown
> 

Good point. I think the right thing to do here is to use RCU to access
this pointer. Don't think it's protected by anything now. I'll do some
more analysis and put together a v2.

Thanks for the review!
-- 
Jeff Layton <jlayton@kernel.org>

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 02/10] nfsd: drain callbacks and clear cl_cb_session
  2026-05-28 21:55 ` [PATCH 02/10] nfsd: drain callbacks and clear cl_cb_session Jeff Layton
@ 2026-05-29 15:13   ` Chuck Lever
  2026-05-29 17:31     ` Jeff Layton
  0 siblings, 1 reply; 40+ messages in thread
From: Chuck Lever @ 2026-05-29 15:13 UTC (permalink / raw)
  To: Jeff Layton, Chuck Lever, NeilBrown, Olga Kornievskaia, Dai Ngo,
	Tom Talpey, J. Bruce Fields, Scott Mayhew, Trond Myklebust,
	Andreas Gruenbacher, Mike Snitzer, Rick Macklem
  Cc: Chris Mason, linux-nfs, linux-kernel



On Thu, May 28, 2026, at 5:55 PM, Jeff Layton wrote:
> From: Chris Mason <clm@meta.com>
>
> After a DESTROY_SESSION the per-session teardown path can free a
> session while rpciod still holds an inflight callback rpc_task that
> dereferences clp->cl_cb_session.  nfsd4_probe_callback_sync() flushes
> cl_callback_wq, but once nfsd4_run_cb_work() has called
> rpc_call_async() the rpc_task lives on rpciod; flushing the workqueue
> does not wait for it.  After the flush returns,
> nfsd4_destroy_session() proceeds through nfsd4_put_session_locked()
> and free_session() kfree()s the slab while rpciod's
> nfsd4_cb_sequence_done(), grab_slot(), and nfsd41_cb_release_slot()
> are still dereferencing cb->cb_clp->cl_cb_session.
>
>     destroy path                       rpciod
>     ------------                       ------
>     unhash_session(ses)
>     nfsd4_probe_callback_sync(clp)
>       flush_workqueue(cl_callback_wq)
>       /* returns; rpc_task still live */
>     nfsd4_put_session_locked(ses)
>     free_session(ses) -> kfree(ses)
>                                        nfsd4_cb_sequence_done()
>                                          reads cb_clp->cl_cb_session
>                                          /* freed slab */
>
> A second window exists in nfsd4_process_cb_update().  When
> __nfsd4_find_backchannel() returns NULL because unhash_session() has
> already removed the destroyed session from cl_sessions,
> setup_callback_client() takes the v4.1 early return
>
>     if (!conn->cb_xprt || !ses)
>             return -EINVAL;
>
> so clp->cl_cb_session = ses never fires and the field retains a
> pointer to the about-to-be-freed session.  Symmetrically, if a later
> probe finds a different session's backchannel conn and that
> setup_callback_client() call fails, the error tail must still scrub
> any previously published cl_cb_session.
>
> Fix by mirroring the two-stage drain that nfsd4_shutdown_callback()
> already performs: call nfsd41_cb_inflight_wait_complete() in
> nfsd4_probe_callback_sync() after flush_workqueue() so rpciod-side
> nfsd41_cb_inflight_end() decrements are observed before the caller
> releases the final session reference.  The two direct callers,
> nfsd4_destroy_session() and nfsd4_init_conn() (itself invoked from
> nfsd4_create_session() and nfsd4_bind_conn_to_session()), run in
> sleepable process context and tolerate the wait_var_event() sleep:
>
>     nfsd4_destroy_session() (fs/nfsd/nfs4state.c):
>       unhash_session(ses);
>       spin_unlock(&nn->client_lock);   /* spinlock dropped */
>       nfsd4_probe_callback_sync(ses->se_client);
>
>     nfsd4_init_conn() (fs/nfsd/nfs4state.c):
>       acquires no locks in its body; calls nfsd4_hash_conn(),
>       nfsd4_register_conn(), then nfsd4_probe_callback_sync() --
>       entirely in sleepable process context.
>
> Also clear clp->cl_cb_session unconditionally on the
> nfsd4_process_cb_update() error return so every
> setup_callback_client() failure -- whether c is NULL or points at a
> different session whose probe failed -- leaves the field NULL rather
> than pointing at a session that may subsequently be freed.
>
> Fixes: dcbeaa68dbbd ("nfsd4: allow backchannel recovery")
> Assisted-by: kres:claude-opus-4-7
> Signed-off-by: Chris Mason <clm@meta.com>
> ---
>  fs/nfsd/nfs4callback.c | 21 +++++++++++++++++----
>  1 file changed, 17 insertions(+), 4 deletions(-)
>
> diff --git a/fs/nfsd/nfs4callback.c b/fs/nfsd/nfs4callback.c
> index 1964a213f80e..1cf6b6100357 100644
> --- a/fs/nfsd/nfs4callback.c
> +++ b/fs/nfsd/nfs4callback.c
> @@ -1205,9 +1205,8 @@ static int setup_callback_client(struct 
> nfs4_client *clp, struct nfs4_cb_conn *c
>  	} else {
>  		if (!conn->cb_xprt || !ses)
>  			return -EINVAL;
> -		clp->cl_cb_session = ses;
>  		args.bc_xprt = conn->cb_xprt;
> -		args.prognumber = clp->cl_cb_session->se_cb_prog;
> +		args.prognumber = ses->se_cb_prog;
>  		args.protocol = conn->cb_xprt->xpt_class->xcl_ident |
>  				XPRT_TRANSPORT_BC;
>  		args.authflavor = ses->se_cb_sec.flavor;
> @@ -1225,8 +1224,10 @@ static int setup_callback_client(struct 
> nfs4_client *clp, struct nfs4_cb_conn *c
>  		return -ENOMEM;
>  	}
> 
> -	if (clp->cl_minorversion != 0)
> +	if (clp->cl_minorversion != 0) {
>  		clp->cl_cb_conn.cb_xprt = conn->cb_xprt;
> +		clp->cl_cb_session = ses;
> +	}
>  	clp->cl_cb_client = client;
>  	clp->cl_cb_cred = cred;
>  	rcu_read_lock();
> @@ -1299,6 +1300,7 @@ void nfsd4_probe_callback_sync(struct nfs4_client *clp)
>  {
>  	nfsd4_probe_callback(clp);
>  	flush_workqueue(clp->cl_callback_wq);
> +	nfsd41_cb_inflight_wait_complete(clp);
>  }
> 
>  void nfsd4_change_callback(struct nfs4_client *clp, struct 
> nfs4_cb_conn *conn)
> @@ -1679,7 +1681,17 @@ static struct nfsd4_conn * 
> __nfsd4_find_backchannel(struct nfs4_client *clp)
>   * Note there isn't a lot of locking in this code; instead we depend on
>   * the fact that it is run from clp->cl_callback_wq, which won't run 
> two
>   * work items at once.  So, for example, clp->cl_callback_wq handles 
> all
> - * access of cl_cb_client and all calls to rpc_create or 
> rpc_shutdown_client.
> + * access of cl_cb_client and cl_cb_session, and all calls to 
> rpc_create
> + * or rpc_shutdown_client.
> + *
> + * rpciod-side readers of cl_cb_session (encode_cb_sequence4args(),
> + * nfsd4_cb_sequence_done(), the cb-slot helpers, and the cb_sequence
> + * tracepoints) run outside cl_callback_wq.  The
> + * nfsd41_cb_inflight_wait_complete() drain in 
> nfsd4_probe_callback_sync()
> + * waits until cl_cb_inflight reaches zero before the caller proceeds 
> with
> + * session teardown; any rpc_task that reads cl_cb_session must hold an
> + * inflight pin (via nfsd41_cb_inflight_begin) for this fence to be
> + * effective.
>   */
>  static void nfsd4_process_cb_update(struct nfsd4_callback *cb)
>  {
> @@ -1731,6 +1743,7 @@ static void nfsd4_process_cb_update(struct 
> nfsd4_callback *cb)
>  		nfsd4_mark_cb_down(clp);
>  		if (c)
>  			svc_xprt_put(c->cn_xprt);
> +		clp->cl_cb_session = NULL;
>  		return;
>  	}
>  }
>
> -- 
> 2.54.0

Several NFSD callback done handlers retry indefinitely on
NFS4ERR_DELAY via rpc_delay(), so a client that keeps
replying DELAY leaves this per-client counter nonzero and
blocks the foreground CREATE/BIND/DESTROY_SESSION request
even though the callback no longer references the session
being torn down.

Although partly due to the way callbacks are structured
currently, this patch potentially introduces a client-
controlled DoS vector.


-- 
Chuck Lever

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 03/10] nfsd: serialize nfsd4_end_grace() with atomic test-and-set
  2026-05-28 21:55 ` [PATCH 03/10] nfsd: serialize nfsd4_end_grace() with atomic test-and-set Jeff Layton
@ 2026-05-29 15:38   ` Chuck Lever
  2026-05-29 15:57     ` Jeff Layton
  0 siblings, 1 reply; 40+ messages in thread
From: Chuck Lever @ 2026-05-29 15:38 UTC (permalink / raw)
  To: Jeff Layton, Chuck Lever, NeilBrown, Olga Kornievskaia, Dai Ngo,
	Tom Talpey, J. Bruce Fields, Scott Mayhew, Trond Myklebust,
	Andreas Gruenbacher, Mike Snitzer, Rick Macklem
  Cc: Chris Mason, linux-nfs, linux-kernel



On Thu, May 28, 2026, at 5:55 PM, Jeff Layton wrote:
> From: Chris Mason <clm@meta.com>
>
> nfsd4_end_grace() guards its drain path with a plain bool:
>
>     if (nn->grace_ended)
>             return;
>     nn->grace_ended = true;
>
> The read and the write are independent, and nothing in struct
> nfsd_net serializes them.  At least two contexts can reach this
> code with no lock held:
>
>     laundromat path
>       laundry_wq kworker
>         nfs4_laundromat()
>           nfsd4_end_grace()
>
>     RECLAIM_COMPLETE path
>       nfsd compound kthread
>         nfsd4_reclaim_complete()
>           inc_reclaim_complete()
>             nfsd4_end_grace()
>
> Both callers can observe grace_ended == false on different CPUs,
> both store true, and both proceed into nfsd4_record_grace_done(),
> which invokes the active client_tracking_ops->grace_done callback.
> For tracking ops that drain reclaim_str_hashtbl (legacy_tracking_ops
> via nfsd4_recdir_purge_old, and the cld v1+ ops via
> nfsd4_cld_grace_done), grace_done calls nfs4_release_reclaim(),
> which walks every bucket of reclaim_str_hashtbl with no lock and
> calls nfs4_remove_reclaim_record() (list_del + kfree) on each
> entry.  Two concurrent walkers corrupt the list and double-free
> every nfs4_client_reclaim.  A concurrent nfsd4_find_reclaim_client()
> iterating the same bucket reads through freed memory.
>
> A third call site exists in nfs4_state_start_net() on the
> skip_grace startup path, but it runs under nfsd_mutex before any
> client has connected and before the laundromat's first delayed
> work fires, so it cannot race with the two callers above.
>
> Fix by replacing the read/write pair with try_cmpxchg() so exactly
> one caller transitions grace_ended from false to true and proceeds
> into the drain; the loser returns immediately.  bool supports
> 1-byte cmpxchg on all supported architectures, and no lock
> ordering changes are needed.
>
> Fixes: 362063a595be ("nfsd: keep a tally of RECLAIM_COMPLETE operations 
> when using nfsdcld")
> Assisted-by: kres:claude-opus-4-7
> Signed-off-by: Chris Mason <clm@meta.com>
> ---
>  fs/nfsd/nfs4state.c | 17 ++++++++++++++---
>  1 file changed, 14 insertions(+), 3 deletions(-)
>
> diff --git a/fs/nfsd/nfs4state.c b/fs/nfsd/nfs4state.c
> index f4d12dbcf97b..dc4ac541436f 100644
> --- a/fs/nfsd/nfs4state.c
> +++ b/fs/nfsd/nfs4state.c
> @@ -7022,12 +7022,23 @@ nfsd4_renew(struct svc_rqst *rqstp, struct 
> nfsd4_compound_state *cstate,
>  static void
>  nfsd4_end_grace(struct nfsd_net *nn)
>  {
> -	/* do nothing if grace period already ended */
> -	if (nn->grace_ended)
> +	bool expected = false;
> +
> +	/*
> +	 * nfsd4_end_grace() can be entered concurrently from the
> +	 * laundromat workqueue and from an nfsd compound thread
> +	 * handling RECLAIM_COMPLETE.  Without serialization, both
> +	 * callers can observe grace_ended==false and proceed into
> +	 * nfsd4_record_grace_done().  For tracking ops whose
> +	 * grace_done drains reclaim_str_hashtbl, that results in
> +	 * list corruption and a double free of every
> +	 * nfs4_client_reclaim entry.  Use an atomic test-and-set so
> +	 * exactly one caller proceeds.
> +	 */
> +	if (!try_cmpxchg(&nn->grace_ended, &expected, true))
>  		return;
> 
>  	trace_nfsd_grace_complete(nn);
> -	nn->grace_ended = true;
>  	/*
>  	 * If the server goes down again right now, an NFSv4
>  	 * client will still be allowed to reclaim after it comes back up,
>
> -- 
> 2.54.0

Seems like the usual idiom for something like this is an atomic
bit op. Perhaps try_cmpxchg on a boolean variable is not going
to behave as you expect on every hardware platform.

-- 
Chuck Lever

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 03/10] nfsd: serialize nfsd4_end_grace() with atomic test-and-set
  2026-05-29 15:38   ` Chuck Lever
@ 2026-05-29 15:57     ` Jeff Layton
  2026-05-29 16:05       ` Chuck Lever
  0 siblings, 1 reply; 40+ messages in thread
From: Jeff Layton @ 2026-05-29 15:57 UTC (permalink / raw)
  To: Chuck Lever, Chuck Lever, NeilBrown, Olga Kornievskaia, Dai Ngo,
	Tom Talpey, J. Bruce Fields, Scott Mayhew, Trond Myklebust,
	Andreas Gruenbacher, Mike Snitzer, Rick Macklem
  Cc: Chris Mason, linux-nfs, linux-kernel

On Fri, 2026-05-29 at 11:38 -0400, Chuck Lever wrote:
> 
> On Thu, May 28, 2026, at 5:55 PM, Jeff Layton wrote:
> > From: Chris Mason <clm@meta.com>
> > 
> > nfsd4_end_grace() guards its drain path with a plain bool:
> > 
> >     if (nn->grace_ended)
> >             return;
> >     nn->grace_ended = true;
> > 
> > The read and the write are independent, and nothing in struct
> > nfsd_net serializes them.  At least two contexts can reach this
> > code with no lock held:
> > 
> >     laundromat path
> >       laundry_wq kworker
> >         nfs4_laundromat()
> >           nfsd4_end_grace()
> > 
> >     RECLAIM_COMPLETE path
> >       nfsd compound kthread
> >         nfsd4_reclaim_complete()
> >           inc_reclaim_complete()
> >             nfsd4_end_grace()
> > 
> > Both callers can observe grace_ended == false on different CPUs,
> > both store true, and both proceed into nfsd4_record_grace_done(),
> > which invokes the active client_tracking_ops->grace_done callback.
> > For tracking ops that drain reclaim_str_hashtbl (legacy_tracking_ops
> > via nfsd4_recdir_purge_old, and the cld v1+ ops via
> > nfsd4_cld_grace_done), grace_done calls nfs4_release_reclaim(),
> > which walks every bucket of reclaim_str_hashtbl with no lock and
> > calls nfs4_remove_reclaim_record() (list_del + kfree) on each
> > entry.  Two concurrent walkers corrupt the list and double-free
> > every nfs4_client_reclaim.  A concurrent nfsd4_find_reclaim_client()
> > iterating the same bucket reads through freed memory.
> > 
> > A third call site exists in nfs4_state_start_net() on the
> > skip_grace startup path, but it runs under nfsd_mutex before any
> > client has connected and before the laundromat's first delayed
> > work fires, so it cannot race with the two callers above.
> > 
> > Fix by replacing the read/write pair with try_cmpxchg() so exactly
> > one caller transitions grace_ended from false to true and proceeds
> > into the drain; the loser returns immediately.  bool supports
> > 1-byte cmpxchg on all supported architectures, and no lock
> > ordering changes are needed.
> > 
> > Fixes: 362063a595be ("nfsd: keep a tally of RECLAIM_COMPLETE operations 
> > when using nfsdcld")
> > Assisted-by: kres:claude-opus-4-7
> > Signed-off-by: Chris Mason <clm@meta.com>
> > ---
> >  fs/nfsd/nfs4state.c | 17 ++++++++++++++---
> >  1 file changed, 14 insertions(+), 3 deletions(-)
> > 
> > diff --git a/fs/nfsd/nfs4state.c b/fs/nfsd/nfs4state.c
> > index f4d12dbcf97b..dc4ac541436f 100644
> > --- a/fs/nfsd/nfs4state.c
> > +++ b/fs/nfsd/nfs4state.c
> > @@ -7022,12 +7022,23 @@ nfsd4_renew(struct svc_rqst *rqstp, struct 
> > nfsd4_compound_state *cstate,
> >  static void
> >  nfsd4_end_grace(struct nfsd_net *nn)
> >  {
> > -	/* do nothing if grace period already ended */
> > -	if (nn->grace_ended)
> > +	bool expected = false;
> > +
> > +	/*
> > +	 * nfsd4_end_grace() can be entered concurrently from the
> > +	 * laundromat workqueue and from an nfsd compound thread
> > +	 * handling RECLAIM_COMPLETE.  Without serialization, both
> > +	 * callers can observe grace_ended==false and proceed into
> > +	 * nfsd4_record_grace_done().  For tracking ops whose
> > +	 * grace_done drains reclaim_str_hashtbl, that results in
> > +	 * list corruption and a double free of every
> > +	 * nfs4_client_reclaim entry.  Use an atomic test-and-set so
> > +	 * exactly one caller proceeds.
> > +	 */
> > +	if (!try_cmpxchg(&nn->grace_ended, &expected, true))
> >  		return;
> > 
> >  	trace_nfsd_grace_complete(nn);
> > -	nn->grace_ended = true;
> >  	/*
> >  	 * If the server goes down again right now, an NFSv4
> >  	 * client will still be allowed to reclaim after it comes back up,
> > 
> > -- 
> > 2.54.0
> 
> Seems like the usual idiom for something like this is an atomic
> bit op. Perhaps try_cmpxchg on a boolean variable is not going
> to behave as you expect on every hardware platform.

We just need a single flag here though. try_cmpxchg() had better work
the same way on every platform or a lot of stuff is FUBAR. Where
wouldn't it?
-- 
Jeff Layton <jlayton@kernel.org>

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 03/10] nfsd: serialize nfsd4_end_grace() with atomic test-and-set
  2026-05-29 15:57     ` Jeff Layton
@ 2026-05-29 16:05       ` Chuck Lever
  2026-05-29 17:02         ` Jeff Layton
  0 siblings, 1 reply; 40+ messages in thread
From: Chuck Lever @ 2026-05-29 16:05 UTC (permalink / raw)
  To: Jeff Layton, Chuck Lever, NeilBrown, Olga Kornievskaia, Dai Ngo,
	Tom Talpey, J. Bruce Fields, Scott Mayhew, Trond Myklebust,
	Andreas Gruenbacher, Mike Snitzer, Rick Macklem
  Cc: Chris Mason, linux-nfs, linux-kernel

On 5/29/26 11:57 AM, Jeff Layton wrote:
> On Fri, 2026-05-29 at 11:38 -0400, Chuck Lever wrote:
>>
>> On Thu, May 28, 2026, at 5:55 PM, Jeff Layton wrote:
>>> From: Chris Mason <clm@meta.com>
>>>
>>> nfsd4_end_grace() guards its drain path with a plain bool:
>>>
>>>     if (nn->grace_ended)
>>>             return;
>>>     nn->grace_ended = true;
>>>
>>> The read and the write are independent, and nothing in struct
>>> nfsd_net serializes them.  At least two contexts can reach this
>>> code with no lock held:
>>>
>>>     laundromat path
>>>       laundry_wq kworker
>>>         nfs4_laundromat()
>>>           nfsd4_end_grace()
>>>
>>>     RECLAIM_COMPLETE path
>>>       nfsd compound kthread
>>>         nfsd4_reclaim_complete()
>>>           inc_reclaim_complete()
>>>             nfsd4_end_grace()
>>>
>>> Both callers can observe grace_ended == false on different CPUs,
>>> both store true, and both proceed into nfsd4_record_grace_done(),
>>> which invokes the active client_tracking_ops->grace_done callback.
>>> For tracking ops that drain reclaim_str_hashtbl (legacy_tracking_ops
>>> via nfsd4_recdir_purge_old, and the cld v1+ ops via
>>> nfsd4_cld_grace_done), grace_done calls nfs4_release_reclaim(),
>>> which walks every bucket of reclaim_str_hashtbl with no lock and
>>> calls nfs4_remove_reclaim_record() (list_del + kfree) on each
>>> entry.  Two concurrent walkers corrupt the list and double-free
>>> every nfs4_client_reclaim.  A concurrent nfsd4_find_reclaim_client()
>>> iterating the same bucket reads through freed memory.
>>>
>>> A third call site exists in nfs4_state_start_net() on the
>>> skip_grace startup path, but it runs under nfsd_mutex before any
>>> client has connected and before the laundromat's first delayed
>>> work fires, so it cannot race with the two callers above.
>>>
>>> Fix by replacing the read/write pair with try_cmpxchg() so exactly
>>> one caller transitions grace_ended from false to true and proceeds
>>> into the drain; the loser returns immediately.  bool supports
>>> 1-byte cmpxchg on all supported architectures, and no lock
>>> ordering changes are needed.
>>>
>>> Fixes: 362063a595be ("nfsd: keep a tally of RECLAIM_COMPLETE operations 
>>> when using nfsdcld")
>>> Assisted-by: kres:claude-opus-4-7
>>> Signed-off-by: Chris Mason <clm@meta.com>
>>> ---
>>>  fs/nfsd/nfs4state.c | 17 ++++++++++++++---
>>>  1 file changed, 14 insertions(+), 3 deletions(-)
>>>
>>> diff --git a/fs/nfsd/nfs4state.c b/fs/nfsd/nfs4state.c
>>> index f4d12dbcf97b..dc4ac541436f 100644
>>> --- a/fs/nfsd/nfs4state.c
>>> +++ b/fs/nfsd/nfs4state.c
>>> @@ -7022,12 +7022,23 @@ nfsd4_renew(struct svc_rqst *rqstp, struct 
>>> nfsd4_compound_state *cstate,
>>>  static void
>>>  nfsd4_end_grace(struct nfsd_net *nn)
>>>  {
>>> -	/* do nothing if grace period already ended */
>>> -	if (nn->grace_ended)
>>> +	bool expected = false;
>>> +
>>> +	/*
>>> +	 * nfsd4_end_grace() can be entered concurrently from the
>>> +	 * laundromat workqueue and from an nfsd compound thread
>>> +	 * handling RECLAIM_COMPLETE.  Without serialization, both
>>> +	 * callers can observe grace_ended==false and proceed into
>>> +	 * nfsd4_record_grace_done().  For tracking ops whose
>>> +	 * grace_done drains reclaim_str_hashtbl, that results in
>>> +	 * list corruption and a double free of every
>>> +	 * nfs4_client_reclaim entry.  Use an atomic test-and-set so
>>> +	 * exactly one caller proceeds.
>>> +	 */
>>> +	if (!try_cmpxchg(&nn->grace_ended, &expected, true))
>>>  		return;
>>>
>>>  	trace_nfsd_grace_complete(nn);
>>> -	nn->grace_ended = true;
>>>  	/*
>>>  	 * If the server goes down again right now, an NFSv4
>>>  	 * client will still be allowed to reclaim after it comes back up,
>>>
>>> -- 
>>> 2.54.0
>>
>> Seems like the usual idiom for something like this is an atomic
>> bit op. Perhaps try_cmpxchg on a boolean variable is not going
>> to behave as you expect on every hardware platform.
> 
> We just need a single flag here though. try_cmpxchg() had better work
> the same way on every platform or a lot of stuff is FUBAR. Where
> wouldn't it?

Codex suggests on Hexagon, cmpxchg grabs more than just that boolean.


-- 
Chuck Lever

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 04/10] nfsd: dedup nfs4_client_to_reclaim inserts
  2026-05-28 21:55 ` [PATCH 04/10] nfsd: dedup nfs4_client_to_reclaim inserts Jeff Layton
@ 2026-05-29 16:22   ` Chuck Lever
  0 siblings, 0 replies; 40+ messages in thread
From: Chuck Lever @ 2026-05-29 16:22 UTC (permalink / raw)
  To: Jeff Layton, Chuck Lever, NeilBrown, Olga Kornievskaia, Dai Ngo,
	Tom Talpey, J. Bruce Fields, Scott Mayhew, Trond Myklebust,
	Andreas Gruenbacher, Mike Snitzer, Rick Macklem
  Cc: Chris Mason, linux-nfs, linux-kernel



On Thu, May 28, 2026, at 5:55 PM, Jeff Layton wrote:
> From: Chris Mason <clm@meta.com>
>
> nfs4_client_to_reclaim() unconditionally allocates a new
> nfs4_client_reclaim, prepends it to reclaim_str_hashtbl[], and bumps
> reclaim_str_hashtbl_size with no check for an existing entry for the
> same client name.  After a reboot with a populated recovery directory
> that inflates the counter by one for every client that reclaims:
>
>     boot:    load_recdir()
>                nfs4_client_to_reclaim(name)   /* entry #1, size++ */
>
>     grace:   RECLAIM_COMPLETE
>                __nfsd4_create_reclaim_record_grace()
>                  nfs4_client_to_reclaim(name) /* entry #2, size++ */
>
> inc_reclaim_complete() ends the grace period early only when
>
>     atomic_inc_return(&nn->nr_reclaim_complete) ==
>         nn->reclaim_str_hashtbl_size
>
> With reclaim_str_hashtbl_size at 2N and nr_reclaim_complete capped at
> N, the equality never holds and the fast end-of-grace path is dead.
> The grace period always runs out the full 90-second laundromat timer,
> and the shadow entry left in the hash table carries a dangling cr_clp
> for any reader that walks it.
>
> Fix nfs4_client_to_reclaim() to compute strhashval first, look the
> name up with nfsd4_find_reclaim_client(), and on a hit fold the new
> princhash into the existing record (if it lacks one) and return that
> record without allocating or touching reclaim_str_hashtbl_size.  On
> kmemdup() failure during the fold-in, return NULL so
> __cld_pipe_inprogress_downcall() surfaces -EFAULT to nfsdcld, matching
> the miss-path contract.
>
> Because the fold-in writes cr_princhash.data and cr_princhash.len on
> a record that is already linked into reclaim_str_hashtbl[], pair the
> two stores with smp_store_release() on .len after WRITE_ONCE() on
> .data, and have nfsd4_cld_check_v2() read .len with smp_load_acquire()
> before READ_ONCE() on .data, so a concurrent principal-hash check
> cannot observe a torn (data, len) pair.
>
> Fixes: 362063a595be ("nfsd: keep a tally of RECLAIM_COMPLETE operations 
> when using nfsdcld")
> Assisted-by: kres:claude-opus-4-7
> Signed-off-by: Chris Mason <clm@meta.com>

The motivating example in the commit message is built on the legacy
recovery-directory path, but the bug this patch fixes is nfsdcld-only,
which the Fixes: tag already reflects.

Could you recast the example around the path the patch actually
repairs -- a boot-time cld enumeration record followed by an in-grace
RECLAIM_COMPLETE downcall for the same cl_name -- where the records
are keyed consistently by cl_name and the dedup works?  The code
change itself looks correct.


-- 
Chuck Lever

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 08/10] nfsd: fix partial-write detection in nfsd_direct_write
  2026-05-28 21:55 ` [PATCH 08/10] nfsd: fix partial-write detection in nfsd_direct_write Jeff Layton
@ 2026-05-29 16:57   ` Chuck Lever
  2026-05-29 17:01     ` Jeff Layton
  0 siblings, 1 reply; 40+ messages in thread
From: Chuck Lever @ 2026-05-29 16:57 UTC (permalink / raw)
  To: Jeff Layton, Chuck Lever, NeilBrown, Olga Kornievskaia, Dai Ngo,
	Tom Talpey, J. Bruce Fields, Scott Mayhew, Trond Myklebust,
	Andreas Gruenbacher, Mike Snitzer, Rick Macklem
  Cc: Chris Mason, linux-nfs, linux-kernel



On Thu, May 28, 2026, at 5:55 PM, Jeff Layton wrote:
> From: Chris Mason <clm@meta.com>
>
> nfsd_direct_write() walks a list of write segments and, after each
> vfs_iocb_iter_write(), tries to detect a short write so the loop can
> stop before placing the next segment at a wrong file offset:
>
>     host_err = vfs_iocb_iter_write(file, kiocb, &segments[i].iter);
>     if (host_err < 0)
>             return host_err;
>     *cnt += host_err;
>     if (host_err < segments[i].iter.count)
>             break;	/* partial write */
>
> vfs_iocb_iter_write() runs the iter through ->write_iter(), which
> advances the iter by the number of bytes written. By the time the
> check runs, segments[i].iter.count is the residual, not the original
> request length:
>
>     before write_iter: iter.count == original_len
>     after  write_iter: iter.count == original_len - host_err
>
> The condition then reduces to host_err < original_len - host_err, so
> the break fires only when less than half of the segment was written.
> Any short write completing between 50% and 99% of the segment slips
> through; the loop advances to the next segment with kiocb->ki_pos
> only bumped by the short amount, writing the next segment's payload
> at the wrong offset and over-reporting *cnt to the NFS client.
>
> Snapshot the segment's byte count before the write and compare
> host_err against that snapshot so any short write breaks the loop.
>
> Fixes: 06c5c97293e3 ("NFSD: Implement NFSD_IO_DIRECT for NFS WRITE")
> Assisted-by: kres:claude-opus-4-7
> Signed-off-by: Chris Mason <clm@meta.com>
> ---
>  fs/nfsd/vfs.c | 5 ++++-
>  1 file changed, 4 insertions(+), 1 deletion(-)
>
> diff --git a/fs/nfsd/vfs.c b/fs/nfsd/vfs.c
> index 980217f755b7..619f252af4d1 100644
> --- a/fs/nfsd/vfs.c
> +++ b/fs/nfsd/vfs.c
> @@ -1380,6 +1380,7 @@ nfsd_direct_write(struct svc_rqst *rqstp, struct 
> svc_fh *fhp,
>  	struct file *file = nf->nf_file;
>  	unsigned int nsegs, i;
>  	ssize_t host_err;
> +	size_t expected;
> 
>  	nsegs = nfsd_write_dio_iters_init(nf, rqstp->rq_bvec, nvecs,
>  					  kiocb, *cnt, segments);
> @@ -1401,11 +1402,13 @@ nfsd_direct_write(struct svc_rqst *rqstp, 
> struct svc_fh *fhp,
>  				kiocb->ki_flags |= IOCB_DONTCACHE;
>  		}
> 
> +		expected = iov_iter_count(&segments[i].iter);
> +
>  		host_err = vfs_iocb_iter_write(file, kiocb, &segments[i].iter);
>  		if (host_err < 0)
>  			return host_err;
>  		*cnt += host_err;
> -		if (host_err < segments[i].iter.count)
> +		if (host_err < (ssize_t)expected)
>  			break;	/* partial write */
>  	}
> 
>
> -- 
> 2.54.0

How many filesystems can return a short write in this case?
My impression was that only the NFS client can do that.


-- 
Chuck Lever

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 08/10] nfsd: fix partial-write detection in nfsd_direct_write
  2026-05-29 16:57   ` Chuck Lever
@ 2026-05-29 17:01     ` Jeff Layton
  2026-05-29 17:03       ` Chuck Lever
  0 siblings, 1 reply; 40+ messages in thread
From: Jeff Layton @ 2026-05-29 17:01 UTC (permalink / raw)
  To: Chuck Lever, Chuck Lever, NeilBrown, Olga Kornievskaia, Dai Ngo,
	Tom Talpey, J. Bruce Fields, Scott Mayhew, Trond Myklebust,
	Andreas Gruenbacher, Mike Snitzer, Rick Macklem
  Cc: Chris Mason, linux-nfs, linux-kernel

On Fri, 2026-05-29 at 12:57 -0400, Chuck Lever wrote:
> 
> On Thu, May 28, 2026, at 5:55 PM, Jeff Layton wrote:
> > From: Chris Mason <clm@meta.com>
> > 
> > nfsd_direct_write() walks a list of write segments and, after each
> > vfs_iocb_iter_write(), tries to detect a short write so the loop can
> > stop before placing the next segment at a wrong file offset:
> > 
> >     host_err = vfs_iocb_iter_write(file, kiocb, &segments[i].iter);
> >     if (host_err < 0)
> >             return host_err;
> >     *cnt += host_err;
> >     if (host_err < segments[i].iter.count)
> >             break;	/* partial write */
> > 
> > vfs_iocb_iter_write() runs the iter through ->write_iter(), which
> > advances the iter by the number of bytes written. By the time the
> > check runs, segments[i].iter.count is the residual, not the original
> > request length:
> > 
> >     before write_iter: iter.count == original_len
> >     after  write_iter: iter.count == original_len - host_err
> > 
> > The condition then reduces to host_err < original_len - host_err, so
> > the break fires only when less than half of the segment was written.
> > Any short write completing between 50% and 99% of the segment slips
> > through; the loop advances to the next segment with kiocb->ki_pos
> > only bumped by the short amount, writing the next segment's payload
> > at the wrong offset and over-reporting *cnt to the NFS client.
> > 
> > Snapshot the segment's byte count before the write and compare
> > host_err against that snapshot so any short write breaks the loop.
> > 
> > Fixes: 06c5c97293e3 ("NFSD: Implement NFSD_IO_DIRECT for NFS WRITE")
> > Assisted-by: kres:claude-opus-4-7
> > Signed-off-by: Chris Mason <clm@meta.com>
> > ---
> >  fs/nfsd/vfs.c | 5 ++++-
> >  1 file changed, 4 insertions(+), 1 deletion(-)
> > 
> > diff --git a/fs/nfsd/vfs.c b/fs/nfsd/vfs.c
> > index 980217f755b7..619f252af4d1 100644
> > --- a/fs/nfsd/vfs.c
> > +++ b/fs/nfsd/vfs.c
> > @@ -1380,6 +1380,7 @@ nfsd_direct_write(struct svc_rqst *rqstp, struct 
> > svc_fh *fhp,
> >  	struct file *file = nf->nf_file;
> >  	unsigned int nsegs, i;
> >  	ssize_t host_err;
> > +	size_t expected;
> > 
> >  	nsegs = nfsd_write_dio_iters_init(nf, rqstp->rq_bvec, nvecs,
> >  					  kiocb, *cnt, segments);
> > @@ -1401,11 +1402,13 @@ nfsd_direct_write(struct svc_rqst *rqstp, 
> > struct svc_fh *fhp,
> >  				kiocb->ki_flags |= IOCB_DONTCACHE;
> >  		}
> > 
> > +		expected = iov_iter_count(&segments[i].iter);
> > +
> >  		host_err = vfs_iocb_iter_write(file, kiocb, &segments[i].iter);
> >  		if (host_err < 0)
> >  			return host_err;
> >  		*cnt += host_err;
> > -		if (host_err < segments[i].iter.count)
> > +		if (host_err < (ssize_t)expected)
> >  			break;	/* partial write */
> >  	}
> > 
> > 
> > -- 
> > 2.54.0
> 
> How many filesystems can return a short write in this case?
> My impression was that only the NFS client can do that.
> 

No idea right offhand, but NFS is exportable. Since
vfs_iocb_iter_write() is allowed to return a short write, I think we
have to deal with that properly here.

-- 
Jeff Layton <jlayton@kernel.org>

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 03/10] nfsd: serialize nfsd4_end_grace() with atomic test-and-set
  2026-05-29 16:05       ` Chuck Lever
@ 2026-05-29 17:02         ` Jeff Layton
  0 siblings, 0 replies; 40+ messages in thread
From: Jeff Layton @ 2026-05-29 17:02 UTC (permalink / raw)
  To: Chuck Lever, Chuck Lever, NeilBrown, Olga Kornievskaia, Dai Ngo,
	Tom Talpey, J. Bruce Fields, Scott Mayhew, Trond Myklebust,
	Andreas Gruenbacher, Mike Snitzer, Rick Macklem
  Cc: Chris Mason, linux-nfs, linux-kernel

On Fri, 2026-05-29 at 12:05 -0400, Chuck Lever wrote:
> On 5/29/26 11:57 AM, Jeff Layton wrote:
> > On Fri, 2026-05-29 at 11:38 -0400, Chuck Lever wrote:
> > > 
> > > On Thu, May 28, 2026, at 5:55 PM, Jeff Layton wrote:
> > > > From: Chris Mason <clm@meta.com>
> > > > 
> > > > nfsd4_end_grace() guards its drain path with a plain bool:
> > > > 
> > > >     if (nn->grace_ended)
> > > >             return;
> > > >     nn->grace_ended = true;
> > > > 
> > > > The read and the write are independent, and nothing in struct
> > > > nfsd_net serializes them.  At least two contexts can reach this
> > > > code with no lock held:
> > > > 
> > > >     laundromat path
> > > >       laundry_wq kworker
> > > >         nfs4_laundromat()
> > > >           nfsd4_end_grace()
> > > > 
> > > >     RECLAIM_COMPLETE path
> > > >       nfsd compound kthread
> > > >         nfsd4_reclaim_complete()
> > > >           inc_reclaim_complete()
> > > >             nfsd4_end_grace()
> > > > 
> > > > Both callers can observe grace_ended == false on different CPUs,
> > > > both store true, and both proceed into nfsd4_record_grace_done(),
> > > > which invokes the active client_tracking_ops->grace_done callback.
> > > > For tracking ops that drain reclaim_str_hashtbl (legacy_tracking_ops
> > > > via nfsd4_recdir_purge_old, and the cld v1+ ops via
> > > > nfsd4_cld_grace_done), grace_done calls nfs4_release_reclaim(),
> > > > which walks every bucket of reclaim_str_hashtbl with no lock and
> > > > calls nfs4_remove_reclaim_record() (list_del + kfree) on each
> > > > entry.  Two concurrent walkers corrupt the list and double-free
> > > > every nfs4_client_reclaim.  A concurrent nfsd4_find_reclaim_client()
> > > > iterating the same bucket reads through freed memory.
> > > > 
> > > > A third call site exists in nfs4_state_start_net() on the
> > > > skip_grace startup path, but it runs under nfsd_mutex before any
> > > > client has connected and before the laundromat's first delayed
> > > > work fires, so it cannot race with the two callers above.
> > > > 
> > > > Fix by replacing the read/write pair with try_cmpxchg() so exactly
> > > > one caller transitions grace_ended from false to true and proceeds
> > > > into the drain; the loser returns immediately.  bool supports
> > > > 1-byte cmpxchg on all supported architectures, and no lock
> > > > ordering changes are needed.
> > > > 
> > > > Fixes: 362063a595be ("nfsd: keep a tally of RECLAIM_COMPLETE operations 
> > > > when using nfsdcld")
> > > > Assisted-by: kres:claude-opus-4-7
> > > > Signed-off-by: Chris Mason <clm@meta.com>
> > > > ---
> > > >  fs/nfsd/nfs4state.c | 17 ++++++++++++++---
> > > >  1 file changed, 14 insertions(+), 3 deletions(-)
> > > > 
> > > > diff --git a/fs/nfsd/nfs4state.c b/fs/nfsd/nfs4state.c
> > > > index f4d12dbcf97b..dc4ac541436f 100644
> > > > --- a/fs/nfsd/nfs4state.c
> > > > +++ b/fs/nfsd/nfs4state.c
> > > > @@ -7022,12 +7022,23 @@ nfsd4_renew(struct svc_rqst *rqstp, struct 
> > > > nfsd4_compound_state *cstate,
> > > >  static void
> > > >  nfsd4_end_grace(struct nfsd_net *nn)
> > > >  {
> > > > -	/* do nothing if grace period already ended */
> > > > -	if (nn->grace_ended)
> > > > +	bool expected = false;
> > > > +
> > > > +	/*
> > > > +	 * nfsd4_end_grace() can be entered concurrently from the
> > > > +	 * laundromat workqueue and from an nfsd compound thread
> > > > +	 * handling RECLAIM_COMPLETE.  Without serialization, both
> > > > +	 * callers can observe grace_ended==false and proceed into
> > > > +	 * nfsd4_record_grace_done().  For tracking ops whose
> > > > +	 * grace_done drains reclaim_str_hashtbl, that results in
> > > > +	 * list corruption and a double free of every
> > > > +	 * nfs4_client_reclaim entry.  Use an atomic test-and-set so
> > > > +	 * exactly one caller proceeds.
> > > > +	 */
> > > > +	if (!try_cmpxchg(&nn->grace_ended, &expected, true))
> > > >  		return;
> > > > 
> > > >  	trace_nfsd_grace_complete(nn);
> > > > -	nn->grace_ended = true;
> > > >  	/*
> > > >  	 * If the server goes down again right now, an NFSv4
> > > >  	 * client will still be allowed to reclaim after it comes back up,
> > > > 
> > > > -- 
> > > > 2.54.0
> > > 
> > > Seems like the usual idiom for something like this is an atomic
> > > bit op. Perhaps try_cmpxchg on a boolean variable is not going
> > > to behave as you expect on every hardware platform.
> > 
> > We just need a single flag here though. try_cmpxchg() had better work
> > the same way on every platform or a lot of stuff is FUBAR. Where
> > wouldn't it?
> 
> Codex suggests on Hexagon, cmpxchg grabs more than just that boolean.
> 

Ok, I'm convinced. What I think I'll do is add a new flags unsigned
long to struct nfsd_net and we can convert most of the bools in that
struct to use it instead.
-- 
Jeff Layton <jlayton@kernel.org>

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 08/10] nfsd: fix partial-write detection in nfsd_direct_write
  2026-05-29 17:01     ` Jeff Layton
@ 2026-05-29 17:03       ` Chuck Lever
  2026-05-29 17:06         ` Jeff Layton
  0 siblings, 1 reply; 40+ messages in thread
From: Chuck Lever @ 2026-05-29 17:03 UTC (permalink / raw)
  To: Jeff Layton, Chuck Lever, NeilBrown, Olga Kornievskaia, Dai Ngo,
	Tom Talpey, J. Bruce Fields, Scott Mayhew, Trond Myklebust,
	Andreas Gruenbacher, Mike Snitzer, Rick Macklem
  Cc: Chris Mason, linux-nfs, linux-kernel

On 5/29/26 1:01 PM, Jeff Layton wrote:
> On Fri, 2026-05-29 at 12:57 -0400, Chuck Lever wrote:
>>
>> On Thu, May 28, 2026, at 5:55 PM, Jeff Layton wrote:
>>> From: Chris Mason <clm@meta.com>
>>>
>>> nfsd_direct_write() walks a list of write segments and, after each
>>> vfs_iocb_iter_write(), tries to detect a short write so the loop can
>>> stop before placing the next segment at a wrong file offset:
>>>
>>>     host_err = vfs_iocb_iter_write(file, kiocb, &segments[i].iter);
>>>     if (host_err < 0)
>>>             return host_err;
>>>     *cnt += host_err;
>>>     if (host_err < segments[i].iter.count)
>>>             break;	/* partial write */
>>>
>>> vfs_iocb_iter_write() runs the iter through ->write_iter(), which
>>> advances the iter by the number of bytes written. By the time the
>>> check runs, segments[i].iter.count is the residual, not the original
>>> request length:
>>>
>>>     before write_iter: iter.count == original_len
>>>     after  write_iter: iter.count == original_len - host_err
>>>
>>> The condition then reduces to host_err < original_len - host_err, so
>>> the break fires only when less than half of the segment was written.
>>> Any short write completing between 50% and 99% of the segment slips
>>> through; the loop advances to the next segment with kiocb->ki_pos
>>> only bumped by the short amount, writing the next segment's payload
>>> at the wrong offset and over-reporting *cnt to the NFS client.
>>>
>>> Snapshot the segment's byte count before the write and compare
>>> host_err against that snapshot so any short write breaks the loop.
>>>
>>> Fixes: 06c5c97293e3 ("NFSD: Implement NFSD_IO_DIRECT for NFS WRITE")
>>> Assisted-by: kres:claude-opus-4-7
>>> Signed-off-by: Chris Mason <clm@meta.com>
>>> ---
>>>  fs/nfsd/vfs.c | 5 ++++-
>>>  1 file changed, 4 insertions(+), 1 deletion(-)
>>>
>>> diff --git a/fs/nfsd/vfs.c b/fs/nfsd/vfs.c
>>> index 980217f755b7..619f252af4d1 100644
>>> --- a/fs/nfsd/vfs.c
>>> +++ b/fs/nfsd/vfs.c
>>> @@ -1380,6 +1380,7 @@ nfsd_direct_write(struct svc_rqst *rqstp, struct 
>>> svc_fh *fhp,
>>>  	struct file *file = nf->nf_file;
>>>  	unsigned int nsegs, i;
>>>  	ssize_t host_err;
>>> +	size_t expected;
>>>
>>>  	nsegs = nfsd_write_dio_iters_init(nf, rqstp->rq_bvec, nvecs,
>>>  					  kiocb, *cnt, segments);
>>> @@ -1401,11 +1402,13 @@ nfsd_direct_write(struct svc_rqst *rqstp, 
>>> struct svc_fh *fhp,
>>>  				kiocb->ki_flags |= IOCB_DONTCACHE;
>>>  		}
>>>
>>> +		expected = iov_iter_count(&segments[i].iter);
>>> +
>>>  		host_err = vfs_iocb_iter_write(file, kiocb, &segments[i].iter);
>>>  		if (host_err < 0)
>>>  			return host_err;
>>>  		*cnt += host_err;
>>> -		if (host_err < segments[i].iter.count)
>>> +		if (host_err < (ssize_t)expected)
>>>  			break;	/* partial write */
>>>  	}
>>>
>>>
>>> -- 
>>> 2.54.0
>>
>> How many filesystems can return a short write in this case?
>> My impression was that only the NFS client can do that.
>>
> 
> No idea right offhand, but NFS is exportable. Since
> vfs_iocb_iter_write() is allowed to return a short write, I think we
> have to deal with that properly here.

NFSD_IO_DIRECT is experimental, and doesn't make sense (to me)
to use with an NFS re-export.

If we can find another filesystem that might return a short write
with NFSD_IO_DIRECT, I might consider this a higher priority.


-- 
Chuck Lever

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 08/10] nfsd: fix partial-write detection in nfsd_direct_write
  2026-05-29 17:03       ` Chuck Lever
@ 2026-05-29 17:06         ` Jeff Layton
  2026-05-29 17:09           ` Chuck Lever
  0 siblings, 1 reply; 40+ messages in thread
From: Jeff Layton @ 2026-05-29 17:06 UTC (permalink / raw)
  To: Chuck Lever, Chuck Lever, NeilBrown, Olga Kornievskaia, Dai Ngo,
	Tom Talpey, J. Bruce Fields, Scott Mayhew, Trond Myklebust,
	Andreas Gruenbacher, Mike Snitzer, Rick Macklem
  Cc: Chris Mason, linux-nfs, linux-kernel

On Fri, 2026-05-29 at 13:03 -0400, Chuck Lever wrote:
> On 5/29/26 1:01 PM, Jeff Layton wrote:
> > On Fri, 2026-05-29 at 12:57 -0400, Chuck Lever wrote:
> > > 
> > > On Thu, May 28, 2026, at 5:55 PM, Jeff Layton wrote:
> > > > From: Chris Mason <clm@meta.com>
> > > > 
> > > > nfsd_direct_write() walks a list of write segments and, after each
> > > > vfs_iocb_iter_write(), tries to detect a short write so the loop can
> > > > stop before placing the next segment at a wrong file offset:
> > > > 
> > > >     host_err = vfs_iocb_iter_write(file, kiocb, &segments[i].iter);
> > > >     if (host_err < 0)
> > > >             return host_err;
> > > >     *cnt += host_err;
> > > >     if (host_err < segments[i].iter.count)
> > > >             break;	/* partial write */
> > > > 
> > > > vfs_iocb_iter_write() runs the iter through ->write_iter(), which
> > > > advances the iter by the number of bytes written. By the time the
> > > > check runs, segments[i].iter.count is the residual, not the original
> > > > request length:
> > > > 
> > > >     before write_iter: iter.count == original_len
> > > >     after  write_iter: iter.count == original_len - host_err
> > > > 
> > > > The condition then reduces to host_err < original_len - host_err, so
> > > > the break fires only when less than half of the segment was written.
> > > > Any short write completing between 50% and 99% of the segment slips
> > > > through; the loop advances to the next segment with kiocb->ki_pos
> > > > only bumped by the short amount, writing the next segment's payload
> > > > at the wrong offset and over-reporting *cnt to the NFS client.
> > > > 
> > > > Snapshot the segment's byte count before the write and compare
> > > > host_err against that snapshot so any short write breaks the loop.
> > > > 
> > > > Fixes: 06c5c97293e3 ("NFSD: Implement NFSD_IO_DIRECT for NFS WRITE")
> > > > Assisted-by: kres:claude-opus-4-7
> > > > Signed-off-by: Chris Mason <clm@meta.com>
> > > > ---
> > > >  fs/nfsd/vfs.c | 5 ++++-
> > > >  1 file changed, 4 insertions(+), 1 deletion(-)
> > > > 
> > > > diff --git a/fs/nfsd/vfs.c b/fs/nfsd/vfs.c
> > > > index 980217f755b7..619f252af4d1 100644
> > > > --- a/fs/nfsd/vfs.c
> > > > +++ b/fs/nfsd/vfs.c
> > > > @@ -1380,6 +1380,7 @@ nfsd_direct_write(struct svc_rqst *rqstp, struct 
> > > > svc_fh *fhp,
> > > >  	struct file *file = nf->nf_file;
> > > >  	unsigned int nsegs, i;
> > > >  	ssize_t host_err;
> > > > +	size_t expected;
> > > > 
> > > >  	nsegs = nfsd_write_dio_iters_init(nf, rqstp->rq_bvec, nvecs,
> > > >  					  kiocb, *cnt, segments);
> > > > @@ -1401,11 +1402,13 @@ nfsd_direct_write(struct svc_rqst *rqstp, 
> > > > struct svc_fh *fhp,
> > > >  				kiocb->ki_flags |= IOCB_DONTCACHE;
> > > >  		}
> > > > 
> > > > +		expected = iov_iter_count(&segments[i].iter);
> > > > +
> > > >  		host_err = vfs_iocb_iter_write(file, kiocb, &segments[i].iter);
> > > >  		if (host_err < 0)
> > > >  			return host_err;
> > > >  		*cnt += host_err;
> > > > -		if (host_err < segments[i].iter.count)
> > > > +		if (host_err < (ssize_t)expected)
> > > >  			break;	/* partial write */
> > > >  	}
> > > > 
> > > > 
> > > > -- 
> > > > 2.54.0
> > > 
> > > How many filesystems can return a short write in this case?
> > > My impression was that only the NFS client can do that.
> > > 
> > 
> > No idea right offhand, but NFS is exportable. Since
> > vfs_iocb_iter_write() is allowed to return a short write, I think we
> > have to deal with that properly here.
> 
> NFSD_IO_DIRECT is experimental, and doesn't make sense (to me)
> to use with an NFS re-export.
> 
> If we can find another filesystem that might return a short write
> with NFSD_IO_DIRECT, I might consider this a higher priority.
> 

I'm fairly sure Ceph and CIFS can and they're exportable. For local
filesystems, I'm not even sure how to audit that.

Given that the potential effect is data corruption, omitting this patch
based on guesswork about what filesystems are being exported seems
unwise.
-- 
Jeff Layton <jlayton@kernel.org>

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 08/10] nfsd: fix partial-write detection in nfsd_direct_write
  2026-05-29 17:06         ` Jeff Layton
@ 2026-05-29 17:09           ` Chuck Lever
  0 siblings, 0 replies; 40+ messages in thread
From: Chuck Lever @ 2026-05-29 17:09 UTC (permalink / raw)
  To: Jeff Layton, Chuck Lever, NeilBrown, Olga Kornievskaia, Dai Ngo,
	Tom Talpey, J. Bruce Fields, Scott Mayhew, Trond Myklebust,
	Andreas Gruenbacher, Mike Snitzer, Rick Macklem
  Cc: Chris Mason, linux-nfs, linux-kernel

On 5/29/26 1:06 PM, Jeff Layton wrote:
> On Fri, 2026-05-29 at 13:03 -0400, Chuck Lever wrote:
>> On 5/29/26 1:01 PM, Jeff Layton wrote:
>>> On Fri, 2026-05-29 at 12:57 -0400, Chuck Lever wrote:
>>>>
>>>> On Thu, May 28, 2026, at 5:55 PM, Jeff Layton wrote:
>>>>> From: Chris Mason <clm@meta.com>
>>>>>
>>>>> nfsd_direct_write() walks a list of write segments and, after each
>>>>> vfs_iocb_iter_write(), tries to detect a short write so the loop can
>>>>> stop before placing the next segment at a wrong file offset:
>>>>>
>>>>>     host_err = vfs_iocb_iter_write(file, kiocb, &segments[i].iter);
>>>>>     if (host_err < 0)
>>>>>             return host_err;
>>>>>     *cnt += host_err;
>>>>>     if (host_err < segments[i].iter.count)
>>>>>             break;	/* partial write */
>>>>>
>>>>> vfs_iocb_iter_write() runs the iter through ->write_iter(), which
>>>>> advances the iter by the number of bytes written. By the time the
>>>>> check runs, segments[i].iter.count is the residual, not the original
>>>>> request length:
>>>>>
>>>>>     before write_iter: iter.count == original_len
>>>>>     after  write_iter: iter.count == original_len - host_err
>>>>>
>>>>> The condition then reduces to host_err < original_len - host_err, so
>>>>> the break fires only when less than half of the segment was written.
>>>>> Any short write completing between 50% and 99% of the segment slips
>>>>> through; the loop advances to the next segment with kiocb->ki_pos
>>>>> only bumped by the short amount, writing the next segment's payload
>>>>> at the wrong offset and over-reporting *cnt to the NFS client.
>>>>>
>>>>> Snapshot the segment's byte count before the write and compare
>>>>> host_err against that snapshot so any short write breaks the loop.
>>>>>
>>>>> Fixes: 06c5c97293e3 ("NFSD: Implement NFSD_IO_DIRECT for NFS WRITE")
>>>>> Assisted-by: kres:claude-opus-4-7
>>>>> Signed-off-by: Chris Mason <clm@meta.com>
>>>>> ---
>>>>>  fs/nfsd/vfs.c | 5 ++++-
>>>>>  1 file changed, 4 insertions(+), 1 deletion(-)
>>>>>
>>>>> diff --git a/fs/nfsd/vfs.c b/fs/nfsd/vfs.c
>>>>> index 980217f755b7..619f252af4d1 100644
>>>>> --- a/fs/nfsd/vfs.c
>>>>> +++ b/fs/nfsd/vfs.c
>>>>> @@ -1380,6 +1380,7 @@ nfsd_direct_write(struct svc_rqst *rqstp, struct 
>>>>> svc_fh *fhp,
>>>>>  	struct file *file = nf->nf_file;
>>>>>  	unsigned int nsegs, i;
>>>>>  	ssize_t host_err;
>>>>> +	size_t expected;
>>>>>
>>>>>  	nsegs = nfsd_write_dio_iters_init(nf, rqstp->rq_bvec, nvecs,
>>>>>  					  kiocb, *cnt, segments);
>>>>> @@ -1401,11 +1402,13 @@ nfsd_direct_write(struct svc_rqst *rqstp, 
>>>>> struct svc_fh *fhp,
>>>>>  				kiocb->ki_flags |= IOCB_DONTCACHE;
>>>>>  		}
>>>>>
>>>>> +		expected = iov_iter_count(&segments[i].iter);
>>>>> +
>>>>>  		host_err = vfs_iocb_iter_write(file, kiocb, &segments[i].iter);
>>>>>  		if (host_err < 0)
>>>>>  			return host_err;
>>>>>  		*cnt += host_err;
>>>>> -		if (host_err < segments[i].iter.count)
>>>>> +		if (host_err < (ssize_t)expected)
>>>>>  			break;	/* partial write */
>>>>>  	}
>>>>>
>>>>>
>>>>> -- 
>>>>> 2.54.0
>>>>
>>>> How many filesystems can return a short write in this case?
>>>> My impression was that only the NFS client can do that.
>>>>
>>>
>>> No idea right offhand, but NFS is exportable. Since
>>> vfs_iocb_iter_write() is allowed to return a short write, I think we
>>> have to deal with that properly here.
>>
>> NFSD_IO_DIRECT is experimental, and doesn't make sense (to me)
>> to use with an NFS re-export.
>>
>> If we can find another filesystem that might return a short write
>> with NFSD_IO_DIRECT, I might consider this a higher priority.
>>
> 
> I'm fairly sure Ceph and CIFS can and they're exportable. For local
> filesystems, I'm not even sure how to audit that.
> 
> Given that the potential effect is data corruption, omitting this patch
> based on guesswork about what filesystems are being exported seems
> unwise.

The guesswork for me is that it's very unclear how likely a short write
result might be and under what circumstances.

In any event, I think this one needs testing to confirm that the logic
change doesn't introduce a regression in known working use cases.


-- 
Chuck Lever

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 02/10] nfsd: drain callbacks and clear cl_cb_session
  2026-05-29 15:13   ` Chuck Lever
@ 2026-05-29 17:31     ` Jeff Layton
  0 siblings, 0 replies; 40+ messages in thread
From: Jeff Layton @ 2026-05-29 17:31 UTC (permalink / raw)
  To: Chuck Lever, Chuck Lever, NeilBrown, Olga Kornievskaia, Dai Ngo,
	Tom Talpey, J. Bruce Fields, Scott Mayhew, Trond Myklebust,
	Andreas Gruenbacher, Mike Snitzer, Rick Macklem
  Cc: Chris Mason, linux-nfs, linux-kernel

On Fri, 2026-05-29 at 11:13 -0400, Chuck Lever wrote:
> 
> On Thu, May 28, 2026, at 5:55 PM, Jeff Layton wrote:
> > From: Chris Mason <clm@meta.com>
> > 
> > After a DESTROY_SESSION the per-session teardown path can free a
> > session while rpciod still holds an inflight callback rpc_task that
> > dereferences clp->cl_cb_session.  nfsd4_probe_callback_sync() flushes
> > cl_callback_wq, but once nfsd4_run_cb_work() has called
> > rpc_call_async() the rpc_task lives on rpciod; flushing the workqueue
> > does not wait for it.  After the flush returns,
> > nfsd4_destroy_session() proceeds through nfsd4_put_session_locked()
> > and free_session() kfree()s the slab while rpciod's
> > nfsd4_cb_sequence_done(), grab_slot(), and nfsd41_cb_release_slot()
> > are still dereferencing cb->cb_clp->cl_cb_session.
> > 
> >     destroy path                       rpciod
> >     ------------                       ------
> >     unhash_session(ses)
> >     nfsd4_probe_callback_sync(clp)
> >       flush_workqueue(cl_callback_wq)
> >       /* returns; rpc_task still live */
> >     nfsd4_put_session_locked(ses)
> >     free_session(ses) -> kfree(ses)
> >                                        nfsd4_cb_sequence_done()
> >                                          reads cb_clp->cl_cb_session
> >                                          /* freed slab */
> > 
> > A second window exists in nfsd4_process_cb_update().  When
> > __nfsd4_find_backchannel() returns NULL because unhash_session() has
> > already removed the destroyed session from cl_sessions,
> > setup_callback_client() takes the v4.1 early return
> > 
> >     if (!conn->cb_xprt || !ses)
> >             return -EINVAL;
> > 
> > so clp->cl_cb_session = ses never fires and the field retains a
> > pointer to the about-to-be-freed session.  Symmetrically, if a later
> > probe finds a different session's backchannel conn and that
> > setup_callback_client() call fails, the error tail must still scrub
> > any previously published cl_cb_session.
> > 
> > Fix by mirroring the two-stage drain that nfsd4_shutdown_callback()
> > already performs: call nfsd41_cb_inflight_wait_complete() in
> > nfsd4_probe_callback_sync() after flush_workqueue() so rpciod-side
> > nfsd41_cb_inflight_end() decrements are observed before the caller
> > releases the final session reference.  The two direct callers,
> > nfsd4_destroy_session() and nfsd4_init_conn() (itself invoked from
> > nfsd4_create_session() and nfsd4_bind_conn_to_session()), run in
> > sleepable process context and tolerate the wait_var_event() sleep:
> > 
> >     nfsd4_destroy_session() (fs/nfsd/nfs4state.c):
> >       unhash_session(ses);
> >       spin_unlock(&nn->client_lock);   /* spinlock dropped */
> >       nfsd4_probe_callback_sync(ses->se_client);
> > 
> >     nfsd4_init_conn() (fs/nfsd/nfs4state.c):
> >       acquires no locks in its body; calls nfsd4_hash_conn(),
> >       nfsd4_register_conn(), then nfsd4_probe_callback_sync() --
> >       entirely in sleepable process context.
> > 
> > Also clear clp->cl_cb_session unconditionally on the
> > nfsd4_process_cb_update() error return so every
> > setup_callback_client() failure -- whether c is NULL or points at a
> > different session whose probe failed -- leaves the field NULL rather
> > than pointing at a session that may subsequently be freed.
> > 
> > Fixes: dcbeaa68dbbd ("nfsd4: allow backchannel recovery")
> > Assisted-by: kres:claude-opus-4-7
> > Signed-off-by: Chris Mason <clm@meta.com>
> > ---
> >  fs/nfsd/nfs4callback.c | 21 +++++++++++++++++----
> >  1 file changed, 17 insertions(+), 4 deletions(-)
> > 
> > diff --git a/fs/nfsd/nfs4callback.c b/fs/nfsd/nfs4callback.c
> > index 1964a213f80e..1cf6b6100357 100644
> > --- a/fs/nfsd/nfs4callback.c
> > +++ b/fs/nfsd/nfs4callback.c
> > @@ -1205,9 +1205,8 @@ static int setup_callback_client(struct 
> > nfs4_client *clp, struct nfs4_cb_conn *c
> >  	} else {
> >  		if (!conn->cb_xprt || !ses)
> >  			return -EINVAL;
> > -		clp->cl_cb_session = ses;
> >  		args.bc_xprt = conn->cb_xprt;
> > -		args.prognumber = clp->cl_cb_session->se_cb_prog;
> > +		args.prognumber = ses->se_cb_prog;
> >  		args.protocol = conn->cb_xprt->xpt_class->xcl_ident |
> >  				XPRT_TRANSPORT_BC;
> >  		args.authflavor = ses->se_cb_sec.flavor;
> > @@ -1225,8 +1224,10 @@ static int setup_callback_client(struct 
> > nfs4_client *clp, struct nfs4_cb_conn *c
> >  		return -ENOMEM;
> >  	}
> > 
> > -	if (clp->cl_minorversion != 0)
> > +	if (clp->cl_minorversion != 0) {
> >  		clp->cl_cb_conn.cb_xprt = conn->cb_xprt;
> > +		clp->cl_cb_session = ses;
> > +	}
> >  	clp->cl_cb_client = client;
> >  	clp->cl_cb_cred = cred;
> >  	rcu_read_lock();
> > @@ -1299,6 +1300,7 @@ void nfsd4_probe_callback_sync(struct nfs4_client *clp)
> >  {
> >  	nfsd4_probe_callback(clp);
> >  	flush_workqueue(clp->cl_callback_wq);
> > +	nfsd41_cb_inflight_wait_complete(clp);
> >  }
> > 
> >  void nfsd4_change_callback(struct nfs4_client *clp, struct 
> > nfs4_cb_conn *conn)
> > @@ -1679,7 +1681,17 @@ static struct nfsd4_conn * 
> > __nfsd4_find_backchannel(struct nfs4_client *clp)
> >   * Note there isn't a lot of locking in this code; instead we depend on
> >   * the fact that it is run from clp->cl_callback_wq, which won't run 
> > two
> >   * work items at once.  So, for example, clp->cl_callback_wq handles 
> > all
> > - * access of cl_cb_client and all calls to rpc_create or 
> > rpc_shutdown_client.
> > + * access of cl_cb_client and cl_cb_session, and all calls to 
> > rpc_create
> > + * or rpc_shutdown_client.
> > + *
> > + * rpciod-side readers of cl_cb_session (encode_cb_sequence4args(),
> > + * nfsd4_cb_sequence_done(), the cb-slot helpers, and the cb_sequence
> > + * tracepoints) run outside cl_callback_wq.  The
> > + * nfsd41_cb_inflight_wait_complete() drain in 
> > nfsd4_probe_callback_sync()
> > + * waits until cl_cb_inflight reaches zero before the caller proceeds 
> > with
> > + * session teardown; any rpc_task that reads cl_cb_session must hold an
> > + * inflight pin (via nfsd41_cb_inflight_begin) for this fence to be
> > + * effective.
> >   */
> >  static void nfsd4_process_cb_update(struct nfsd4_callback *cb)
> >  {
> > @@ -1731,6 +1743,7 @@ static void nfsd4_process_cb_update(struct 
> > nfsd4_callback *cb)
> >  		nfsd4_mark_cb_down(clp);
> >  		if (c)
> >  			svc_xprt_put(c->cn_xprt);
> > +		clp->cl_cb_session = NULL;
> >  		return;
> >  	}
> >  }
> > 
> > -- 
> > 2.54.0
> 
> Several NFSD callback done handlers retry indefinitely on
> NFS4ERR_DELAY via rpc_delay(), so a client that keeps
> replying DELAY leaves this per-client counter nonzero and
> blocks the foreground CREATE/BIND/DESTROY_SESSION request
> even though the callback no longer references the session
> being torn down.
> 
> Although partly due to the way callbacks are structured
> currently, this patch potentially introduces a client-
> controlled DoS vector.
> 

Good point. I'm currently looking at reworking this so that each stage
of the callback state machine can tolerate a NULL cl_cb_session
pointer. If I can make that work, then we can fix the UAF without
blocking.

I'll definitely be sending a v2 of the series.
-- 
Jeff Layton <jlayton@kernel.org>

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 09/10] nfsd: cap decoded POSIX ACL count to bound sort cost
  2026-05-28 21:55 ` [PATCH 09/10] nfsd: cap decoded POSIX ACL count to bound sort cost Jeff Layton
  2026-05-28 22:11   ` Rick Macklem
  2026-05-29  7:34   ` Cedric Blancher
@ 2026-05-29 18:34   ` Chuck Lever
  2026-05-29 18:41     ` Jeff Layton
  2026-05-29 23:04     ` Rick Macklem
  2 siblings, 2 replies; 40+ messages in thread
From: Chuck Lever @ 2026-05-29 18:34 UTC (permalink / raw)
  To: Jeff Layton, Chuck Lever, NeilBrown, Olga Kornievskaia, Dai Ngo,
	Tom Talpey, J. Bruce Fields, Scott Mayhew, Andreas Gruenbacher,
	Mike Snitzer, Rick Macklem, Trond Myklebust
  Cc: Chris Mason, linux-nfs, linux-kernel

[ replaced broken email address for Trond ]

On Thu, May 28, 2026, at 5:55 PM, Jeff Layton wrote:

> diff --git a/fs/nfsd/nfs4xdr.c b/fs/nfsd/nfs4xdr.c
> index c6c50c376b23..5469c6c207ba 100644
> --- a/fs/nfsd/nfs4xdr.c
> +++ b/fs/nfsd/nfs4xdr.c
> @@ -448,6 +448,8 @@ nfsd4_decode_posixacl(struct nfsd4_compoundargs 
> *argp, struct posix_acl **acl)
> 
>  	if (xdr_stream_decode_u32(argp->xdr, &count) < 0)
>  		return nfserr_bad_xdr;
> +	if (count > NFS_ACL_MAX_ENTRIES)
> +		return nfserr_resource;

nfserr_resource is consistent with other fattr4 decoders, but
does not make sense here, IMO. A better choice is nfserr_inval.

Rick, any opinion?


>  	*acl = posix_acl_alloc(count, GFP_KERNEL);
>  	if (*acl == NULL)
>
> -- 
> 2.54.0

-- 
Chuck Lever

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 09/10] nfsd: cap decoded POSIX ACL count to bound sort cost
  2026-05-29 18:34   ` Chuck Lever
@ 2026-05-29 18:41     ` Jeff Layton
  2026-05-29 18:48       ` Chuck Lever
  2026-05-29 23:04     ` Rick Macklem
  1 sibling, 1 reply; 40+ messages in thread
From: Jeff Layton @ 2026-05-29 18:41 UTC (permalink / raw)
  To: Chuck Lever, Chuck Lever, NeilBrown, Olga Kornievskaia, Dai Ngo,
	Tom Talpey, J. Bruce Fields, Scott Mayhew, Andreas Gruenbacher,
	Mike Snitzer, Rick Macklem, Trond Myklebust
  Cc: Chris Mason, linux-nfs, linux-kernel

On Fri, 2026-05-29 at 14:34 -0400, Chuck Lever wrote:
> [ replaced broken email address for Trond ]
> 
> On Thu, May 28, 2026, at 5:55 PM, Jeff Layton wrote:
> 
> > diff --git a/fs/nfsd/nfs4xdr.c b/fs/nfsd/nfs4xdr.c
> > index c6c50c376b23..5469c6c207ba 100644
> > --- a/fs/nfsd/nfs4xdr.c
> > +++ b/fs/nfsd/nfs4xdr.c
> > @@ -448,6 +448,8 @@ nfsd4_decode_posixacl(struct nfsd4_compoundargs 
> > *argp, struct posix_acl **acl)
> > 
> >  	if (xdr_stream_decode_u32(argp->xdr, &count) < 0)
> >  		return nfserr_bad_xdr;
> > +	if (count > NFS_ACL_MAX_ENTRIES)
> > +		return nfserr_resource;
> 
> nfserr_resource is consistent with other fattr4 decoders, but
> does not make sense here, IMO. A better choice is nfserr_inval.
> 

Why not? An ACL that long doesn't violate the spec (as you pointed
out), the implementation just can't handle it. I do agree that
nfserr_resource is not the ideal error code, but it's the closest error
I can see that says "you hit an internal limitation of the server".

> Rick, any opinion?
> 
> 
> >  	*acl = posix_acl_alloc(count, GFP_KERNEL);
> >  	if (*acl == NULL)
> > 
> > -- 
> > 2.54.0

-- 
Jeff Layton <jlayton@kernel.org>

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 09/10] nfsd: cap decoded POSIX ACL count to bound sort cost
  2026-05-29 18:41     ` Jeff Layton
@ 2026-05-29 18:48       ` Chuck Lever
  0 siblings, 0 replies; 40+ messages in thread
From: Chuck Lever @ 2026-05-29 18:48 UTC (permalink / raw)
  To: Jeff Layton, Chuck Lever, NeilBrown, Olga Kornievskaia, Dai Ngo,
	Tom Talpey, J. Bruce Fields, Scott Mayhew, Andreas Gruenbacher,
	Mike Snitzer, Rick Macklem, Trond Myklebust
  Cc: Chris Mason, linux-nfs, linux-kernel

On 5/29/26 2:41 PM, Jeff Layton wrote:
> On Fri, 2026-05-29 at 14:34 -0400, Chuck Lever wrote:
>> [ replaced broken email address for Trond ]
>>
>> On Thu, May 28, 2026, at 5:55 PM, Jeff Layton wrote:
>>
>>> diff --git a/fs/nfsd/nfs4xdr.c b/fs/nfsd/nfs4xdr.c
>>> index c6c50c376b23..5469c6c207ba 100644
>>> --- a/fs/nfsd/nfs4xdr.c
>>> +++ b/fs/nfsd/nfs4xdr.c
>>> @@ -448,6 +448,8 @@ nfsd4_decode_posixacl(struct nfsd4_compoundargs 
>>> *argp, struct posix_acl **acl)
>>>
>>>  	if (xdr_stream_decode_u32(argp->xdr, &count) < 0)
>>>  		return nfserr_bad_xdr;
>>> +	if (count > NFS_ACL_MAX_ENTRIES)
>>> +		return nfserr_resource;
>>
>> nfserr_resource is consistent with other fattr4 decoders, but
>> does not make sense here, IMO. A better choice is nfserr_inval.
>>
> 
> Why not? An ACL that long doesn't violate the spec (as you pointed
> out), the implementation just can't handle it. I do agree that
> nfserr_resource is not the ideal error code, but it's the closest error
> I can see that says "you hit an internal limitation of the server".

Among other reasons, NFS4ERR_RESOURCE does not exist outside of NFSv4.0.
The server is not out of resources in this case.

For NFSv4.1+ NFSD's compound encoder converts it to either
nfserr_rep_too_big or nfserr_rep_too_big_to_cache, neither of which are
close to appropriate.

Rick's NFSv4 POSIX ACL specification uses NFS4ERR_INVAL for just about
every error condition. The specific meaning of INVAL is "invalid input"
which is exactly what we have here.


-- 
Chuck Lever

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 10/10] nfsd: validate symlink target length in NFSv4 CREATE
  2026-05-28 21:55 ` [PATCH 10/10] nfsd: validate symlink target length in NFSv4 CREATE Jeff Layton
@ 2026-05-29 18:55   ` Chuck Lever
  0 siblings, 0 replies; 40+ messages in thread
From: Chuck Lever @ 2026-05-29 18:55 UTC (permalink / raw)
  To: Jeff Layton, Chuck Lever, NeilBrown, Olga Kornievskaia, Dai Ngo,
	Tom Talpey, J. Bruce Fields, Scott Mayhew, Trond Myklebust,
	Andreas Gruenbacher, Mike Snitzer, Rick Macklem
  Cc: Chris Mason, linux-nfs, linux-kernel



On Thu, May 28, 2026, at 5:55 PM, Jeff Layton wrote:
> nfsd4_decode_create() accepts an unbounded cr_datalen from the wire for
> NF4LNK symlink targets, allowing a client to force a kmalloc of up to
> the RPC-max size (~1 MiB) per COMPOUND op that persists until compound
> teardown.  The VFS rejects oversized targets with ENAMETOOLONG, but the
> allocation has already occurred.
>
> Reject cr_datalen == 0 or cr_datalen >= PATH_MAX early with
> nfserr_nametoolong to bound the allocation.
>
> Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2")
> Assisted-by: kres:claude-opus-4-7
> Signed-off-by: Jeff Layton <jlayton@kernel.org>
> ---
>  fs/nfsd/nfs4xdr.c | 4 ++++
>  1 file changed, 4 insertions(+)
>
> diff --git a/fs/nfsd/nfs4xdr.c b/fs/nfsd/nfs4xdr.c
> index 5469c6c207ba..1f5e49f50f3a 100644
> --- a/fs/nfsd/nfs4xdr.c
> +++ b/fs/nfsd/nfs4xdr.c
> @@ -957,6 +957,10 @@ nfsd4_decode_create(struct nfsd4_compoundargs 
> *argp, union nfsd4_op_u *u)
>  	case NF4LNK:
>  		if (xdr_stream_decode_u32(argp->xdr, &create->cr_datalen) < 0)
>  			return nfserr_bad_xdr;
> +		if (create->cr_datalen == 0)
> +			return nfserr_inval;
> +		if (create->cr_datalen >= PATH_MAX)
> +			return nfserr_nametoolong;

Nit: The protocol already has NFS4_MAXPATHLEN defined to PATH_MAX.
The v3 decoder uses its analog NFS3_MAXPATHLEN. Using
NFS4_MAXPATHLEN here expresses the protocol-layer intent and
matches the cross-version idiom.

The new boundary condition differs from v2/v3. v3 uses
"tlen > NFS3_MAXPATHLEN", accepting a target of exactly PATH_MAX
bytes; this uses ">= PATH_MAX", rejecting it. Switching to 
"> NFS4_MAXPATHLEN" both adopts the named constant and realigns
the maximum accepted length with v2/v3.


>  		p = xdr_inline_decode(argp->xdr, create->cr_datalen);
>  		if (!p)
>  			return nfserr_bad_xdr;
>
> -- 
> 2.54.0

-- 
Chuck Lever

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 09/10] nfsd: cap decoded POSIX ACL count to bound sort cost
  2026-05-29 18:34   ` Chuck Lever
  2026-05-29 18:41     ` Jeff Layton
@ 2026-05-29 23:04     ` Rick Macklem
  1 sibling, 0 replies; 40+ messages in thread
From: Rick Macklem @ 2026-05-29 23:04 UTC (permalink / raw)
  To: Chuck Lever
  Cc: Jeff Layton, Chuck Lever, NeilBrown, Olga Kornievskaia, Dai Ngo,
	Tom Talpey, J. Bruce Fields, Scott Mayhew, Andreas Gruenbacher,
	Mike Snitzer, Rick Macklem, Trond Myklebust, Chris Mason,
	linux-nfs, linux-kernel

On Fri, May 29, 2026 at 11:34 AM Chuck Lever <cel@kernel.org> wrote:
>
> [ replaced broken email address for Trond ]
>
> On Thu, May 28, 2026, at 5:55 PM, Jeff Layton wrote:
>
> > diff --git a/fs/nfsd/nfs4xdr.c b/fs/nfsd/nfs4xdr.c
> > index c6c50c376b23..5469c6c207ba 100644
> > --- a/fs/nfsd/nfs4xdr.c
> > +++ b/fs/nfsd/nfs4xdr.c
> > @@ -448,6 +448,8 @@ nfsd4_decode_posixacl(struct nfsd4_compoundargs
> > *argp, struct posix_acl **acl)
> >
> >       if (xdr_stream_decode_u32(argp->xdr, &count) < 0)
> >               return nfserr_bad_xdr;
> > +     if (count > NFS_ACL_MAX_ENTRIES)
> > +             return nfserr_resource;
>
> nfserr_resource is consistent with other fattr4 decoders, but
> does not make sense here, IMO. A better choice is nfserr_inval.
>
> Rick, any opinion?
My understanding is the NFS4ERR_RESOURCE is a NFSv4.0 only
error.

Looking at Table 12 in RFC8881, NFS4ERR_INVAL seems the
best fit for SETATTR, although I didn't specify that in my draft.
(It's a bit unfortunate
that there is no other error values, since NFS4ERR_INVAL gets
used for everything else, but??)

Maybe I should add that to the draft?

rick


>
>
> >       *acl = posix_acl_alloc(count, GFP_KERNEL);
> >       if (*acl == NULL)
> >
> > --
> > 2.54.0
>
> --
> Chuck Lever
>

^ permalink raw reply	[flat|nested] 40+ messages in thread

* NFSv4.1 COMMIT of all changed areas only on flush? Re: [PATCH 06/10] NFSD: Enable return of an updated stable_how to NFS clients
  2026-05-28 21:55 ` [PATCH 06/10] NFSD: Enable return of an updated stable_how to NFS clients Jeff Layton
  2026-05-29 10:56   ` Jeff Layton
@ 2026-05-30  7:58   ` Cedric Blancher
  2026-05-30 10:24     ` Jeff Layton
  1 sibling, 1 reply; 40+ messages in thread
From: Cedric Blancher @ 2026-05-30  7:58 UTC (permalink / raw)
  To: Jeff Layton, NFSv4
  Cc: Chuck Lever, NeilBrown, Olga Kornievskaia, Dai Ngo, Tom Talpey,
	J. Bruce Fields, Scott Mayhew, Trond Myklebust,
	Andreas Gruenbacher, Mike Snitzer, Rick Macklem, Chris Mason,
	Linux NFS Mailing List, Linux Kernel Mailing List

On Fri, 29 May 2026 at 00:03, Jeff Layton <jlayton@kernel.org> wrote:
>
> From: Chuck Lever <chuck.lever@oracle.com>
>
> In a subsequent patch, nfsd_vfs_write() will promote an UNSTABLE
> WRITE to be a FILE_SYNC WRITE. This indicates that the client does
> not need a subsequent COMMIT operation, saving a round trip and
> allowing the client to dispense with cached dirty data as soon as
> it receives the server's WRITE response.
>
> This patch refactors nfsd_vfs_write() to return a possibly modified
> stable_how value to its callers. No behavior change is expected.

Question: Could the NFS client just record the write position and
length, and only ask for a COMMIT when a flush is explicitly issued?

Ced
-- 
Cedric Blancher <cedric.blancher@gmail.com>
[https://plus.google.com/u/0/+CedricBlancher/]
Institute Pasteur

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: NFSv4.1 COMMIT of all changed areas only on flush? Re: [PATCH 06/10] NFSD: Enable return of an updated stable_how to NFS clients
  2026-05-30  7:58   ` NFSv4.1 COMMIT of all changed areas only on flush? " Cedric Blancher
@ 2026-05-30 10:24     ` Jeff Layton
  0 siblings, 0 replies; 40+ messages in thread
From: Jeff Layton @ 2026-05-30 10:24 UTC (permalink / raw)
  To: Cedric Blancher, NFSv4
  Cc: Chuck Lever, NeilBrown, Olga Kornievskaia, Dai Ngo, Tom Talpey,
	J. Bruce Fields, Scott Mayhew, Trond Myklebust,
	Andreas Gruenbacher, Mike Snitzer, Rick Macklem, Chris Mason,
	Linux NFS Mailing List, Linux Kernel Mailing List

On Sat, 2026-05-30 at 09:58 +0200, Cedric Blancher wrote:
> On Fri, 29 May 2026 at 00:03, Jeff Layton <jlayton@kernel.org> wrote:
> > 
> > From: Chuck Lever <chuck.lever@oracle.com>
> > 
> > In a subsequent patch, nfsd_vfs_write() will promote an UNSTABLE
> > WRITE to be a FILE_SYNC WRITE. This indicates that the client does
> > not need a subsequent COMMIT operation, saving a round trip and
> > allowing the client to dispense with cached dirty data as soon as
> > it receives the server's WRITE response.
> > 
> > This patch refactors nfsd_vfs_write() to return a possibly modified
> > stable_how value to its callers. No behavior change is expected.
> 
> Question: Could the NFS client just record the write position and
> length, and only ask for a COMMIT when a flush is explicitly issued?
> 

I'm planning to drop this patch, fwiw...

If I understand your question though: the client is more or less
expected to keep track of what UNSTABLE writes are still outstanding
and eventually issue one or more COMMIT calls to sync them out to disk.
-- 
Jeff Layton <jlayton@kernel.org>

^ permalink raw reply	[flat|nested] 40+ messages in thread

end of thread, other threads:[~2026-05-30 10:24 UTC | newest]

Thread overview: 40+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-05-28 21:55 [PATCH 00/10] nfsd: a pile of fixes for random bugs Jeff Layton
2026-05-28 21:55 ` [PATCH 01/10] nfsd: fix BUG_ON in nfsd4_alloc_layout_stateid on racing delegation revoke Jeff Layton
2026-05-28 23:40   ` NeilBrown
2026-05-29 14:44     ` Jeff Layton
2026-05-28 21:55 ` [PATCH 02/10] nfsd: drain callbacks and clear cl_cb_session Jeff Layton
2026-05-29 15:13   ` Chuck Lever
2026-05-29 17:31     ` Jeff Layton
2026-05-28 21:55 ` [PATCH 03/10] nfsd: serialize nfsd4_end_grace() with atomic test-and-set Jeff Layton
2026-05-29 15:38   ` Chuck Lever
2026-05-29 15:57     ` Jeff Layton
2026-05-29 16:05       ` Chuck Lever
2026-05-29 17:02         ` Jeff Layton
2026-05-28 21:55 ` [PATCH 04/10] nfsd: dedup nfs4_client_to_reclaim inserts Jeff Layton
2026-05-29 16:22   ` Chuck Lever
2026-05-28 21:55 ` [PATCH 05/10] nfsd: gate nfs3 setacl by argp->mask Jeff Layton
2026-05-28 21:55 ` [PATCH 06/10] NFSD: Enable return of an updated stable_how to NFS clients Jeff Layton
2026-05-29 10:56   ` Jeff Layton
2026-05-30  7:58   ` NFSv4.1 COMMIT of all changed areas only on flush? " Cedric Blancher
2026-05-30 10:24     ` Jeff Layton
2026-05-28 21:55 ` [PATCH 07/10] NFSD: check truncate permission under inode lock Jeff Layton
2026-05-28 21:55 ` [PATCH 08/10] nfsd: fix partial-write detection in nfsd_direct_write Jeff Layton
2026-05-29 16:57   ` Chuck Lever
2026-05-29 17:01     ` Jeff Layton
2026-05-29 17:03       ` Chuck Lever
2026-05-29 17:06         ` Jeff Layton
2026-05-29 17:09           ` Chuck Lever
2026-05-28 21:55 ` [PATCH 09/10] nfsd: cap decoded POSIX ACL count to bound sort cost Jeff Layton
2026-05-28 22:11   ` Rick Macklem
2026-05-28 23:11     ` Chuck Lever
2026-05-29  0:07       ` Chuck Lever
2026-05-29 10:48         ` Jeff Layton
2026-05-29 13:20           ` Chuck Lever
2026-05-29  7:34   ` Cedric Blancher
2026-05-29 10:50     ` Jeff Layton
2026-05-29 18:34   ` Chuck Lever
2026-05-29 18:41     ` Jeff Layton
2026-05-29 18:48       ` Chuck Lever
2026-05-29 23:04     ` Rick Macklem
2026-05-28 21:55 ` [PATCH 10/10] nfsd: validate symlink target length in NFSv4 CREATE Jeff Layton
2026-05-29 18:55   ` Chuck Lever

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox