linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [Patch v4 0/3] NFSD: Fix server hang when there are multiple layout conflicts
@ 2025-11-15 19:16 Dai Ngo
  2025-11-15 19:16 ` [PATCH v4 1/3] locks: Introduce lm_breaker_timedout operation to lease_manager_operations Dai Ngo
                   ` (2 more replies)
  0 siblings, 3 replies; 37+ messages in thread
From: Dai Ngo @ 2025-11-15 19:16 UTC (permalink / raw)
  To: chuck.lever, jlayton, neilb, okorniev, tom, hch, alex.aring, viro,
	brauner, jack
  Cc: linux-fsdevel, linux-nfs

NFSD: Fix server hang when there are multiple layout conflicts

When a layout conflict triggers a call to __break_lease, the function
nfsd4_layout_lm_break clears the fl_break_time timeout before sending
the CB_LAYOUTRECALL. As a result, __break_lease repeatedly restarts
its loop, waiting indefinitely for the conflicting file lease to be
released.

If the number of lease conflicts matches the number of NFSD threads (which
defaults to 8), all available NFSD threads become occupied. onsequently,
there are no threads left to handle incoming requests or callback replies,
leading to a total hang of the NFSD server.

This issue is reliably reproducible by running the Git test suite on a
configuration using the SCSI layout.

This patchset fixes this problem by introducing the new lm_breaker_timedout
operation to lease_manager_operations and enforcing timeout for layout
lease break.

V2:
. replace int with u32 for ls_layout_type in nfsd_layout_breaker_timedout.

. add mechanism to ensure threads wait in __break_lease for layout conflict
  must wait until one of the waiting threads done with the fencing operation
  before these threads can continue.

V3:
. break the patchset into 3 patches:
  (1/3) locks: Introduce lm_breaker_timedout operation to lease_manager_operations.
        No change from V1
  (2/3) locks: Threads with layout conflict must wait until the client was fenced.
        New patch: add synchronization in __break_lease to ensure threads with
        layout conflict must wait until the client was fenced.
  (3/3) NFSD: Fix NFS server hang when there are multiple layout conflicts
        Replace int with u32 for ls_layout_type in nfsd_layout_breaker_timedout
V4:
. (2/3) add wait_queue_head_t in file_lock_context for threads to wait
        until fencing is done.

 Documentation/filesystems/locking.rst |  2 ++
 fs/locks.c                            | 38 +++++++++++++++++++++++++++---
 fs/nfsd/nfs4layouts.c                 | 26 ++++++++++++++++----
 include/linux/filelock.h              |  7 ++++++
 4 files changed, 66 insertions(+), 7 deletions(-)

^ permalink raw reply	[flat|nested] 37+ messages in thread

* [PATCH v4 1/3] locks: Introduce lm_breaker_timedout operation to lease_manager_operations
  2025-11-15 19:16 [Patch v4 0/3] NFSD: Fix server hang when there are multiple layout conflicts Dai Ngo
@ 2025-11-15 19:16 ` Dai Ngo
  2025-11-17 15:39   ` Chuck Lever
                     ` (2 more replies)
  2025-11-15 19:16 ` [PATCH v4 2/3] locks: Threads with layout conflict must wait until client was fenced Dai Ngo
  2025-11-15 19:16 ` [PATCH v4 3/3] FSD: Fix NFS server hang when there are multiple layout conflicts Dai Ngo
  2 siblings, 3 replies; 37+ messages in thread
From: Dai Ngo @ 2025-11-15 19:16 UTC (permalink / raw)
  To: chuck.lever, jlayton, neilb, okorniev, tom, hch, alex.aring, viro,
	brauner, jack
  Cc: linux-fsdevel, linux-nfs

Some consumers of the lease_manager_operations structure need
to perform additional actions when a lease break, triggered by
a conflict, times out.

The NFS server is the first consumer of this operation.

When a pNFS layout conflict occurs and the lease break times
out — resulting in the layout being revoked and its file lease
removed from the flc_lease list — the NFS server must issue a
fence operation. This operation ensures that the client is
prevented from accessing the data server after the layout
revocation.

Fixes: f99d4fbdae67 ("nfsd: add SCSI layout support")
Signed-off-by: Dai Ngo <dai.ngo@oracle.com>
---
 Documentation/filesystems/locking.rst |  2 ++
 fs/locks.c                            | 14 +++++++++++---
 include/linux/filelock.h              |  2 ++
 3 files changed, 15 insertions(+), 3 deletions(-)

diff --git a/Documentation/filesystems/locking.rst b/Documentation/filesystems/locking.rst
index 77704fde9845..cd600db6c4b9 100644
--- a/Documentation/filesystems/locking.rst
+++ b/Documentation/filesystems/locking.rst
@@ -403,6 +403,7 @@ prototypes::
 	bool (*lm_breaker_owns_lease)(struct file_lock *);
         bool (*lm_lock_expirable)(struct file_lock *);
         void (*lm_expire_lock)(void);
+        void (*lm_breaker_timedout)(struct file_lease *);
 
 locking rules:
 
@@ -416,6 +417,7 @@ lm_change		yes		no			no
 lm_breaker_owns_lease:	yes     	no			no
 lm_lock_expirable	yes		no			no
 lm_expire_lock		no		no			yes
+lm_breaker_timedout     no              no                      yes
 ======================	=============	=================	=========
 
 buffer_head
diff --git a/fs/locks.c b/fs/locks.c
index 04a3f0e20724..1f254e0cd398 100644
--- a/fs/locks.c
+++ b/fs/locks.c
@@ -369,9 +369,15 @@ locks_dispose_list(struct list_head *dispose)
 	while (!list_empty(dispose)) {
 		flc = list_first_entry(dispose, struct file_lock_core, flc_list);
 		list_del_init(&flc->flc_list);
-		if (flc->flc_flags & (FL_LEASE|FL_DELEG|FL_LAYOUT))
+		if (flc->flc_flags & (FL_LEASE|FL_DELEG|FL_LAYOUT)) {
+			if (flc->flc_flags & FL_BREAKER_TIMEDOUT) {
+				struct file_lease *fl = file_lease(flc);
+
+				if (fl->fl_lmops->lm_breaker_timedout)
+					fl->fl_lmops->lm_breaker_timedout(fl);
+			}
 			locks_free_lease(file_lease(flc));
-		else
+		} else
 			locks_free_lock(file_lock(flc));
 	}
 }
@@ -1482,8 +1488,10 @@ static void time_out_leases(struct inode *inode, struct list_head *dispose)
 		trace_time_out_leases(inode, fl);
 		if (past_time(fl->fl_downgrade_time))
 			lease_modify(fl, F_RDLCK, dispose);
-		if (past_time(fl->fl_break_time))
+		if (past_time(fl->fl_break_time)) {
 			lease_modify(fl, F_UNLCK, dispose);
+			fl->c.flc_flags |= FL_BREAKER_TIMEDOUT;
+		}
 	}
 }
 
diff --git a/include/linux/filelock.h b/include/linux/filelock.h
index c2ce8ba05d06..06ccd6b66012 100644
--- a/include/linux/filelock.h
+++ b/include/linux/filelock.h
@@ -17,6 +17,7 @@
 #define FL_OFDLCK	1024	/* lock is "owned" by struct file */
 #define FL_LAYOUT	2048	/* outstanding pNFS layout */
 #define FL_RECLAIM	4096	/* reclaiming from a reboot server */
+#define	FL_BREAKER_TIMEDOUT	8192	/* lease breaker timed out */
 
 #define FL_CLOSE_POSIX (FL_POSIX | FL_CLOSE)
 
@@ -49,6 +50,7 @@ struct lease_manager_operations {
 	int (*lm_change)(struct file_lease *, int, struct list_head *);
 	void (*lm_setup)(struct file_lease *, void **);
 	bool (*lm_breaker_owns_lease)(struct file_lease *);
+	void (*lm_breaker_timedout)(struct file_lease *fl);
 };
 
 struct lock_manager {
-- 
2.47.3


^ permalink raw reply related	[flat|nested] 37+ messages in thread

* [PATCH v4 2/3] locks: Threads with layout conflict must wait until client was fenced.
  2025-11-15 19:16 [Patch v4 0/3] NFSD: Fix server hang when there are multiple layout conflicts Dai Ngo
  2025-11-15 19:16 ` [PATCH v4 1/3] locks: Introduce lm_breaker_timedout operation to lease_manager_operations Dai Ngo
@ 2025-11-15 19:16 ` Dai Ngo
  2025-11-17 15:47   ` Chuck Lever
  2025-11-17 18:21   ` Jeff Layton
  2025-11-15 19:16 ` [PATCH v4 3/3] FSD: Fix NFS server hang when there are multiple layout conflicts Dai Ngo
  2 siblings, 2 replies; 37+ messages in thread
From: Dai Ngo @ 2025-11-15 19:16 UTC (permalink / raw)
  To: chuck.lever, jlayton, neilb, okorniev, tom, hch, alex.aring, viro,
	brauner, jack
  Cc: linux-fsdevel, linux-nfs

If multiple threads are waiting for a layout conflict on the same
file in __break_lease, these threads must wait until one of the
waiting threads completes the fencing operation before proceeding.
This ensures that I/O operations from these threads can only occurs
after the client was fenced.

Fixes: f99d4fbdae67 ("nfsd: add SCSI layout support")
Signed-off-by: Dai Ngo <dai.ngo@oracle.com>
---
 fs/locks.c               | 24 ++++++++++++++++++++++++
 include/linux/filelock.h |  5 +++++
 2 files changed, 29 insertions(+)

diff --git a/fs/locks.c b/fs/locks.c
index 1f254e0cd398..b6fd6aa2498c 100644
--- a/fs/locks.c
+++ b/fs/locks.c
@@ -191,6 +191,7 @@ locks_get_lock_context(struct inode *inode, int type)
 	INIT_LIST_HEAD(&ctx->flc_flock);
 	INIT_LIST_HEAD(&ctx->flc_posix);
 	INIT_LIST_HEAD(&ctx->flc_lease);
+	init_waitqueue_head(&ctx->flc_dispose_wait);
 
 	/*
 	 * Assign the pointer if it's not already assigned. If it is, then
@@ -1609,6 +1610,10 @@ int __break_lease(struct inode *inode, unsigned int mode, unsigned int type)
 		error = -EWOULDBLOCK;
 		goto out;
 	}
+	if (type == FL_LAYOUT && !ctx->flc_conflict) {
+		ctx->flc_conflict = true;
+		ctx->flc_wait_for_dispose = false;
+	}
 
 restart:
 	fl = list_first_entry(&ctx->flc_lease, struct file_lease, c.flc_list);
@@ -1640,12 +1645,31 @@ int __break_lease(struct inode *inode, unsigned int mode, unsigned int type)
 			time_out_leases(inode, &dispose);
 		if (any_leases_conflict(inode, new_fl))
 			goto restart;
+		if (type == FL_LAYOUT && ctx->flc_wait_for_dispose) {
+			/*
+			 * wait for flc_wait_for_dispose to ensure
+			 * the offending client has been fenced.
+			 */
+			spin_unlock(&ctx->flc_lock);
+			wait_event_interruptible(ctx->flc_dispose_wait,
+				ctx->flc_wait_for_dispose == false);
+			spin_lock(&ctx->flc_lock);
+		}
 		error = 0;
+		if (type == FL_LAYOUT)
+			ctx->flc_wait_for_dispose = true;
 	}
 out:
 	spin_unlock(&ctx->flc_lock);
 	percpu_up_read(&file_rwsem);
 	locks_dispose_list(&dispose);
+	if (type == FL_LAYOUT) {
+		spin_lock(&ctx->flc_lock);
+		ctx->flc_wait_for_dispose = false;
+		ctx->flc_conflict = false;
+		wake_up(&ctx->flc_dispose_wait);
+		spin_unlock(&ctx->flc_lock);
+	}
 free_lock:
 	locks_free_lease(new_fl);
 	return error;
diff --git a/include/linux/filelock.h b/include/linux/filelock.h
index 06ccd6b66012..5c5353aabbc8 100644
--- a/include/linux/filelock.h
+++ b/include/linux/filelock.h
@@ -146,6 +146,11 @@ struct file_lock_context {
 	struct list_head	flc_flock;
 	struct list_head	flc_posix;
 	struct list_head	flc_lease;
+
+	/* for FL_LAYOUT */
+	bool			flc_conflict;
+	bool			flc_wait_for_dispose;
+	wait_queue_head_t	flc_dispose_wait;
 };
 
 #ifdef CONFIG_FILE_LOCKING
-- 
2.47.3


^ permalink raw reply related	[flat|nested] 37+ messages in thread

* [PATCH v4 3/3] FSD: Fix NFS server hang when there are multiple layout conflicts
  2025-11-15 19:16 [Patch v4 0/3] NFSD: Fix server hang when there are multiple layout conflicts Dai Ngo
  2025-11-15 19:16 ` [PATCH v4 1/3] locks: Introduce lm_breaker_timedout operation to lease_manager_operations Dai Ngo
  2025-11-15 19:16 ` [PATCH v4 2/3] locks: Threads with layout conflict must wait until client was fenced Dai Ngo
@ 2025-11-15 19:16 ` Dai Ngo
  2025-11-15 19:44   ` Chuck Lever
                     ` (2 more replies)
  2 siblings, 3 replies; 37+ messages in thread
From: Dai Ngo @ 2025-11-15 19:16 UTC (permalink / raw)
  To: chuck.lever, jlayton, neilb, okorniev, tom, hch, alex.aring, viro,
	brauner, jack
  Cc: linux-fsdevel, linux-nfs

When a layout conflict triggers a call to __break_lease, the function
nfsd4_layout_lm_break clears the fl_break_time timeout before sending
the CB_LAYOUTRECALL. As a result, __break_lease repeatedly restarts
its loop, waiting indefinitely for the conflicting file lease to be
released.

If the number of lease conflicts matches the number of NFSD threads
(which defaults to 8), all available NFSD threads become occupied.
Consequently, there are no threads left to handle incoming requests
or callback replies, leading to a total hang of the NFSD server.

This issue is reliably reproducible by running the Git test suite
on a configuration using the SCSI layout.

This patch addresses the problem by using the break lease timeout
and ensures that the unresponsive client is fenced, preventing it
from accessing the data server directly.

Fixes: f99d4fbdae67 ("nfsd: add SCSI layout support")
Signed-off-by: Dai Ngo <dai.ngo@oracle.com>
---
 fs/nfsd/nfs4layouts.c | 26 ++++++++++++++++++++++----
 1 file changed, 22 insertions(+), 4 deletions(-)

diff --git a/fs/nfsd/nfs4layouts.c b/fs/nfsd/nfs4layouts.c
index 683bd1130afe..6321fc187825 100644
--- a/fs/nfsd/nfs4layouts.c
+++ b/fs/nfsd/nfs4layouts.c
@@ -747,11 +747,10 @@ static bool
 nfsd4_layout_lm_break(struct file_lease *fl)
 {
 	/*
-	 * We don't want the locks code to timeout the lease for us;
-	 * we'll remove it ourself if a layout isn't returned
-	 * in time:
+	 * Enforce break lease timeout to prevent starvation of
+	 * NFSD threads in __break_lease that causes server to
+	 * hang.
 	 */
-	fl->fl_break_time = 0;
 	nfsd4_recall_file_layout(fl->c.flc_owner);
 	return false;
 }
@@ -764,9 +763,28 @@ nfsd4_layout_lm_change(struct file_lease *onlist, int arg,
 	return lease_modify(onlist, arg, dispose);
 }
 
+static void
+nfsd_layout_breaker_timedout(struct file_lease *fl)
+{
+	struct nfs4_layout_stateid *ls = fl->c.flc_owner;
+	struct nfsd_file *nf;
+
+	rcu_read_lock();
+	nf = nfsd_file_get(ls->ls_file);
+	rcu_read_unlock();
+	if (nf) {
+		u32 type = ls->ls_layout_type;
+
+		if (nfsd4_layout_ops[type]->fence_client)
+			nfsd4_layout_ops[type]->fence_client(ls, nf);
+		nfsd_file_put(nf);
+	}
+}
+
 static const struct lease_manager_operations nfsd4_layouts_lm_ops = {
 	.lm_break	= nfsd4_layout_lm_break,
 	.lm_change	= nfsd4_layout_lm_change,
+	.lm_breaker_timedout	= nfsd_layout_breaker_timedout,
 };
 
 int
-- 
2.47.3


^ permalink raw reply related	[flat|nested] 37+ messages in thread

* Re: [PATCH v4 3/3] FSD: Fix NFS server hang when there are multiple layout conflicts
  2025-11-15 19:16 ` [PATCH v4 3/3] FSD: Fix NFS server hang when there are multiple layout conflicts Dai Ngo
@ 2025-11-15 19:44   ` Chuck Lever
  2025-11-15 20:20     ` Dai Ngo
  2025-11-17 15:55   ` Chuck Lever
  2025-11-19 10:08   ` Christoph Hellwig
  2 siblings, 1 reply; 37+ messages in thread
From: Chuck Lever @ 2025-11-15 19:44 UTC (permalink / raw)
  To: Dai Ngo, jlayton, neilb, okorniev, tom, hch, alex.aring, viro,
	brauner, jack
  Cc: linux-fsdevel, linux-nfs

On 11/15/25 2:16 PM, Dai Ngo wrote:
> When a layout conflict triggers a call to __break_lease, the function
> nfsd4_layout_lm_break clears the fl_break_time timeout before sending
> the CB_LAYOUTRECALL. As a result, __break_lease repeatedly restarts
> its loop, waiting indefinitely for the conflicting file lease to be
> released.
> 
> If the number of lease conflicts matches the number of NFSD threads
> (which defaults to 8),

It's 16 now on newer distributions.


> all available NFSD threads become occupied.
> Consequently, there are no threads left to handle incoming requests
> or callback replies, leading to a total hang of the NFSD server.

This is more of a muse than an actionable review comment, but what if
NFSD recognized that there was already a waiter for the conflicted
layout and, instead of waiting again, returned NFS4ERR_DELAY for
additional waiters?

That doesn't eliminate the deadlock completely, but gives us a little
breathing room, at least.


> This issue is reliably reproducible by running the Git test suite
> on a configuration using the SCSI layout.

The git regression test is a single client test. I have to wonder why
the layout needs to be recalled and why the client is not responsive.
Can you elaborate on what resources are deadlocking?


> This patch addresses the problem by using the break lease timeout
> and ensures that the unresponsive client is fenced, preventing it
> from accessing the data server directly.
> 
> Fixes: f99d4fbdae67 ("nfsd: add SCSI layout support")
> Signed-off-by: Dai Ngo <dai.ngo@oracle.com>
> ---
>  fs/nfsd/nfs4layouts.c | 26 ++++++++++++++++++++++----
>  1 file changed, 22 insertions(+), 4 deletions(-)
> 
> diff --git a/fs/nfsd/nfs4layouts.c b/fs/nfsd/nfs4layouts.c
> index 683bd1130afe..6321fc187825 100644
> --- a/fs/nfsd/nfs4layouts.c
> +++ b/fs/nfsd/nfs4layouts.c
> @@ -747,11 +747,10 @@ static bool
>  nfsd4_layout_lm_break(struct file_lease *fl)
>  {
>  	/*
> -	 * We don't want the locks code to timeout the lease for us;
> -	 * we'll remove it ourself if a layout isn't returned
> -	 * in time:
> +	 * Enforce break lease timeout to prevent starvation of
> +	 * NFSD threads in __break_lease that causes server to
> +	 * hang.
>  	 */
> -	fl->fl_break_time = 0;
>  	nfsd4_recall_file_layout(fl->c.flc_owner);
>  	return false;
>  }
> @@ -764,9 +763,28 @@ nfsd4_layout_lm_change(struct file_lease *onlist, int arg,
>  	return lease_modify(onlist, arg, dispose);
>  }
>  
> +static void
> +nfsd_layout_breaker_timedout(struct file_lease *fl)
> +{
> +	struct nfs4_layout_stateid *ls = fl->c.flc_owner;
> +	struct nfsd_file *nf;
> +
> +	rcu_read_lock();
> +	nf = nfsd_file_get(ls->ls_file);
> +	rcu_read_unlock();
> +	if (nf) {
> +		u32 type = ls->ls_layout_type;
> +
> +		if (nfsd4_layout_ops[type]->fence_client)
> +			nfsd4_layout_ops[type]->fence_client(ls, nf);
> +		nfsd_file_put(nf);
> +	}
> +}
> +
>  static const struct lease_manager_operations nfsd4_layouts_lm_ops = {
>  	.lm_break	= nfsd4_layout_lm_break,
>  	.lm_change	= nfsd4_layout_lm_change,
> +	.lm_breaker_timedout	= nfsd_layout_breaker_timedout,
>  };
>  
>  int


-- 
Chuck Lever

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v4 3/3] FSD: Fix NFS server hang when there are multiple layout conflicts
  2025-11-15 19:44   ` Chuck Lever
@ 2025-11-15 20:20     ` Dai Ngo
  2025-11-19  9:56       ` Christoph Hellwig
  0 siblings, 1 reply; 37+ messages in thread
From: Dai Ngo @ 2025-11-15 20:20 UTC (permalink / raw)
  To: Chuck Lever, jlayton, neilb, okorniev, tom, hch, alex.aring, viro,
	brauner, jack
  Cc: linux-fsdevel, linux-nfs

On 11/15/25 11:44 AM, Chuck Lever wrote:
> On 11/15/25 2:16 PM, Dai Ngo wrote:
>> When a layout conflict triggers a call to __break_lease, the function
>> nfsd4_layout_lm_break clears the fl_break_time timeout before sending
>> the CB_LAYOUTRECALL. As a result, __break_lease repeatedly restarts
>> its loop, waiting indefinitely for the conflicting file lease to be
>> released.
>>
>> If the number of lease conflicts matches the number of NFSD threads
>> (which defaults to 8),
> It's 16 now on newer distributions.

ok, I'll fix the text. Even with 16, it's still a problem when there
is heavier load. Prefer solution here is dynamic NFSD threads, but
even that it is not the absolute solution for this problem.

>
>
>> all available NFSD threads become occupied.
>> Consequently, there are no threads left to handle incoming requests
>> or callback replies, leading to a total hang of the NFSD server.
> This is more of a muse than an actionable review comment, but what if
> NFSD recognized that there was already a waiter for the conflicted
> layout and, instead of waiting again, returned NFS4ERR_DELAY for
> additional waiters?
>
> That doesn't eliminate the deadlock completely, but gives us a little
> breathing room, at least.

break_lease is called from many places - the NFS client and server,
CIFS and VFS. Many of these callers do not handle error returned
from break_lease, some don't even check return value from break_lease.

Until we fix all callers of break_lease to handle error return, which
I think it's much more involved, returning error from break_lease is
not possible.

>
>
>> This issue is reliably reproducible by running the Git test suite
>> on a configuration using the SCSI layout.
> The git regression test is a single client test. I have to wonder why
> the layout needs to be recalled and why the client is not responsive.
> Can you elaborate on what resources are deadlocking?

The conflict occurs from these call stacks:

__break_lease+0x333/0xb90
xfs_break_leased_layouts+0xb6/0x2f0 [xfs]
xfs_break_layouts+0x196/0x200 [xfs]
xfs_vn_setattr+0xe6/0x250 [xfs]
notify_change+0x84e/0xf10
__nfsd_setattr+0xef/0x1d0 [nfsd]
nfsd_setattr+0x504/0x10b0 [nfsd]
nfsd4_setattr+0x660/0xc70 [nfsd]
nfsd4_proc_compound+0xc58/0x2360 [nfsd]
nfsd_dispatch+0x24f/0x6b0 [nfsd]
svc_process_common+0x101c/0x1aa0 [sunrpc]
svc_process+0x54a/0x990 [sunrpc]
svc_handle_xprt+0xd99/0x1430 [sunrpc]
svc_recv+0x1ff/0x4f0 [sunrpc]
nfsd+0x24c/0x370 [nfsd]
kthread+0x391/0x750

__break_lease+0x333/0xb90
xfs_break_leased_layouts+0xb6/0x2f0 [xfs]
xfs_break_layouts+0x196/0x200 [xfs]
xfs_file_write_checks+0x2fe/0x850 [xfs]
xfs_file_buffered_write+0x12f/0x8a0 [xfs]
vfs_iocb_iter_write+0x259/0x6e0
nfsd_vfs_write+0x6c1/0x1780 [nfsd]
nfsd4_write+0x2c1/0x610 [nfsd]
nfsd4_proc_compound+0xc58/0x2360 [nfsd]
nfsd_dispatch+0x24f/0x6b0 [nfsd]
svc_process_common+0x101c/0x1aa0 [sunrpc]
svc_process+0x54a/0x990 [sunrpc]
svc_handle_xprt+0xd99/0x1430 [sunrpc]
svc_recv+0x1ff/0x4f0 [sunrpc]
nfsd+0x24c/0x370 [nfsd]
kthread+0x391/0x750
ret_from_fork+0x25d/0x340
ret_from_fork_asm+0x1a/0x30

I have plan to post a separate patch to check for layout conflict
caused from the same client and skip the recall - same as what's
currently done for delegation conflict.

-Dai


>
>
>> This patch addresses the problem by using the break lease timeout
>> and ensures that the unresponsive client is fenced, preventing it
>> from accessing the data server directly.
>>
>> Fixes: f99d4fbdae67 ("nfsd: add SCSI layout support")
>> Signed-off-by: Dai Ngo <dai.ngo@oracle.com>
>> ---
>>   fs/nfsd/nfs4layouts.c | 26 ++++++++++++++++++++++----
>>   1 file changed, 22 insertions(+), 4 deletions(-)
>>
>> diff --git a/fs/nfsd/nfs4layouts.c b/fs/nfsd/nfs4layouts.c
>> index 683bd1130afe..6321fc187825 100644
>> --- a/fs/nfsd/nfs4layouts.c
>> +++ b/fs/nfsd/nfs4layouts.c
>> @@ -747,11 +747,10 @@ static bool
>>   nfsd4_layout_lm_break(struct file_lease *fl)
>>   {
>>   	/*
>> -	 * We don't want the locks code to timeout the lease for us;
>> -	 * we'll remove it ourself if a layout isn't returned
>> -	 * in time:
>> +	 * Enforce break lease timeout to prevent starvation of
>> +	 * NFSD threads in __break_lease that causes server to
>> +	 * hang.
>>   	 */
>> -	fl->fl_break_time = 0;
>>   	nfsd4_recall_file_layout(fl->c.flc_owner);
>>   	return false;
>>   }
>> @@ -764,9 +763,28 @@ nfsd4_layout_lm_change(struct file_lease *onlist, int arg,
>>   	return lease_modify(onlist, arg, dispose);
>>   }
>>   
>> +static void
>> +nfsd_layout_breaker_timedout(struct file_lease *fl)
>> +{
>> +	struct nfs4_layout_stateid *ls = fl->c.flc_owner;
>> +	struct nfsd_file *nf;
>> +
>> +	rcu_read_lock();
>> +	nf = nfsd_file_get(ls->ls_file);
>> +	rcu_read_unlock();
>> +	if (nf) {
>> +		u32 type = ls->ls_layout_type;
>> +
>> +		if (nfsd4_layout_ops[type]->fence_client)
>> +			nfsd4_layout_ops[type]->fence_client(ls, nf);
>> +		nfsd_file_put(nf);
>> +	}
>> +}
>> +
>>   static const struct lease_manager_operations nfsd4_layouts_lm_ops = {
>>   	.lm_break	= nfsd4_layout_lm_break,
>>   	.lm_change	= nfsd4_layout_lm_change,
>> +	.lm_breaker_timedout	= nfsd_layout_breaker_timedout,
>>   };
>>   
>>   int
>

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v4 1/3] locks: Introduce lm_breaker_timedout operation to lease_manager_operations
  2025-11-15 19:16 ` [PATCH v4 1/3] locks: Introduce lm_breaker_timedout operation to lease_manager_operations Dai Ngo
@ 2025-11-17 15:39   ` Chuck Lever
  2025-11-17 19:17     ` Dai Ngo
  2025-11-17 18:02   ` Jeff Layton
  2025-11-19  9:54   ` Christoph Hellwig
  2 siblings, 1 reply; 37+ messages in thread
From: Chuck Lever @ 2025-11-17 15:39 UTC (permalink / raw)
  To: Dai Ngo, jlayton, neilb, okorniev, tom, hch, alex.aring, viro,
	brauner, jack
  Cc: linux-fsdevel, linux-nfs

On 11/15/25 2:16 PM, Dai Ngo wrote:
> diff --git a/fs/locks.c b/fs/locks.c
> index 04a3f0e20724..1f254e0cd398 100644
> --- a/fs/locks.c
> +++ b/fs/locks.c
> @@ -369,9 +369,15 @@ locks_dispose_list(struct list_head *dispose)
>  	while (!list_empty(dispose)) {
>  		flc = list_first_entry(dispose, struct file_lock_core, flc_list);
>  		list_del_init(&flc->flc_list);
> -		if (flc->flc_flags & (FL_LEASE|FL_DELEG|FL_LAYOUT))
> +		if (flc->flc_flags & (FL_LEASE|FL_DELEG|FL_LAYOUT)) {

Nit: scripts/checkpatch.pl wants to see spaces around the pipe
characters.


> +			if (flc->flc_flags & FL_BREAKER_TIMEDOUT) {
> +				struct file_lease *fl = file_lease(flc);
> +
> +				if (fl->fl_lmops->lm_breaker_timedout)


Let's make this:

			if (fl->fl_lmops && fl->fl_lmops->lm_breaker_timedout)


> +					fl->fl_lmops->lm_breaker_timedout(fl);
> +			}
>  			locks_free_lease(file_lease(flc));
> -		else
> +		} else
>  			locks_free_lock(file_lock(flc));
>  	}
>  }


-- 
Chuck Lever

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v4 2/3] locks: Threads with layout conflict must wait until client was fenced.
  2025-11-15 19:16 ` [PATCH v4 2/3] locks: Threads with layout conflict must wait until client was fenced Dai Ngo
@ 2025-11-17 15:47   ` Chuck Lever
  2025-11-17 19:21     ` Dai Ngo
  2025-11-17 18:21   ` Jeff Layton
  1 sibling, 1 reply; 37+ messages in thread
From: Chuck Lever @ 2025-11-17 15:47 UTC (permalink / raw)
  To: Dai Ngo, jlayton, neilb, okorniev, tom, hch, alex.aring, viro,
	brauner, jack
  Cc: linux-fsdevel, linux-nfs

On 11/15/25 2:16 PM, Dai Ngo wrote:
> If multiple threads are waiting for a layout conflict on the same
> file in __break_lease, these threads must wait until one of the
> waiting threads completes the fencing operation before proceeding.
> This ensures that I/O operations from these threads can only occurs
> after the client was fenced.
> 
> Fixes: f99d4fbdae67 ("nfsd: add SCSI layout support")
> Signed-off-by: Dai Ngo <dai.ngo@oracle.com>
> ---
>  fs/locks.c               | 24 ++++++++++++++++++++++++
>  include/linux/filelock.h |  5 +++++
>  2 files changed, 29 insertions(+)
> 
> diff --git a/fs/locks.c b/fs/locks.c
> index 1f254e0cd398..b6fd6aa2498c 100644
> --- a/fs/locks.c
> +++ b/fs/locks.c
> @@ -191,6 +191,7 @@ locks_get_lock_context(struct inode *inode, int type)
>  	INIT_LIST_HEAD(&ctx->flc_flock);
>  	INIT_LIST_HEAD(&ctx->flc_posix);
>  	INIT_LIST_HEAD(&ctx->flc_lease);
> +	init_waitqueue_head(&ctx->flc_dispose_wait);
>  
>  	/*
>  	 * Assign the pointer if it's not already assigned. If it is, then
> @@ -1609,6 +1610,10 @@ int __break_lease(struct inode *inode, unsigned int mode, unsigned int type)
>  		error = -EWOULDBLOCK;
>  		goto out;
>  	}
> +	if (type == FL_LAYOUT && !ctx->flc_conflict) {

The file_lock_context structure is allocated via kmem_cache_alloc() from
flctx_cache, and it has a NULL constructor. This means newly allocated
structures contain garbage/stale data.

How are you certain that ctx->flc_conflict has been initialized (ie,
does not contain a garbage/unknown value) ?


> +		ctx->flc_conflict = true;
> +		ctx->flc_wait_for_dispose = false;
> +	}
>  
>  restart:
>  	fl = list_first_entry(&ctx->flc_lease, struct file_lease, c.flc_list);
> @@ -1640,12 +1645,31 @@ int __break_lease(struct inode *inode, unsigned int mode, unsigned int type)
>  			time_out_leases(inode, &dispose);
>  		if (any_leases_conflict(inode, new_fl))
>  			goto restart;
> +		if (type == FL_LAYOUT && ctx->flc_wait_for_dispose) {
> +			/*
> +			 * wait for flc_wait_for_dispose to ensure
> +			 * the offending client has been fenced.
> +			 */
> +			spin_unlock(&ctx->flc_lock);
> +			wait_event_interruptible(ctx->flc_dispose_wait,
> +				ctx->flc_wait_for_dispose == false);

Nit: scripts/checkpatch.pl might prefer "!ctx->flc_wait_for_dispose"
instead. Also, it wants the continuation line to align with the open
parenthesis. Ie:

	wait_event_interruptible(ctx->flc_dispose_wait,
				 !ctx->flc_wait_for_dispose);

Notice that if ctx->flc_conflict has a garbage true value above, that
leaves flc_wait_for_dispose unintialized as well, and this wait could
become indefinite.


> +			spin_lock(&ctx->flc_lock);
> +		}
>  		error = 0;
> +		if (type == FL_LAYOUT)
> +			ctx->flc_wait_for_dispose = true;
>  	}
>  out:
>  	spin_unlock(&ctx->flc_lock);
>  	percpu_up_read(&file_rwsem);
>  	locks_dispose_list(&dispose);
> +	if (type == FL_LAYOUT) {
> +		spin_lock(&ctx->flc_lock);
> +		ctx->flc_wait_for_dispose = false;
> +		ctx->flc_conflict = false;
> +		wake_up(&ctx->flc_dispose_wait);
> +		spin_unlock(&ctx->flc_lock);
> +	}
>  free_lock:
>  	locks_free_lease(new_fl);
>  	return error;
> diff --git a/include/linux/filelock.h b/include/linux/filelock.h
> index 06ccd6b66012..5c5353aabbc8 100644
> --- a/include/linux/filelock.h
> +++ b/include/linux/filelock.h
> @@ -146,6 +146,11 @@ struct file_lock_context {
>  	struct list_head	flc_flock;
>  	struct list_head	flc_posix;
>  	struct list_head	flc_lease;
> +
> +	/* for FL_LAYOUT */
> +	bool			flc_conflict;
> +	bool			flc_wait_for_dispose;
> +	wait_queue_head_t	flc_dispose_wait;
>  };
>  
>  #ifdef CONFIG_FILE_LOCKING


-- 
Chuck Lever

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v4 3/3] FSD: Fix NFS server hang when there are multiple layout conflicts
  2025-11-15 19:16 ` [PATCH v4 3/3] FSD: Fix NFS server hang when there are multiple layout conflicts Dai Ngo
  2025-11-15 19:44   ` Chuck Lever
@ 2025-11-17 15:55   ` Chuck Lever
  2025-11-17 19:40     ` Dai Ngo
  2025-11-19  9:57     ` Christoph Hellwig
  2025-11-19 10:08   ` Christoph Hellwig
  2 siblings, 2 replies; 37+ messages in thread
From: Chuck Lever @ 2025-11-17 15:55 UTC (permalink / raw)
  To: Dai Ngo, jlayton, neilb, okorniev, tom, hch, alex.aring, viro,
	brauner, jack
  Cc: linux-fsdevel, linux-nfs

On 11/15/25 2:16 PM, Dai Ngo wrote:
> When a layout conflict triggers a call to __break_lease, the function
> nfsd4_layout_lm_break clears the fl_break_time timeout before sending
> the CB_LAYOUTRECALL. As a result, __break_lease repeatedly restarts
> its loop, waiting indefinitely for the conflicting file lease to be
> released.
> 
> If the number of lease conflicts matches the number of NFSD threads
> (which defaults to 8), all available NFSD threads become occupied.
> Consequently, there are no threads left to handle incoming requests
> or callback replies, leading to a total hang of the NFSD server.
> 
> This issue is reliably reproducible by running the Git test suite
> on a configuration using the SCSI layout.
> 
> This patch addresses the problem by using the break lease timeout
> and ensures that the unresponsive client is fenced, preventing it
> from accessing the data server directly.

This text is a bit misleading. The client is responsive; it's the server
that has no threads to handle the client's LAYOUTRETURN.


> Fixes: f99d4fbdae67 ("nfsd: add SCSI layout support")
> Signed-off-by: Dai Ngo <dai.ngo@oracle.com>
> ---
>  fs/nfsd/nfs4layouts.c | 26 ++++++++++++++++++++++----
>  1 file changed, 22 insertions(+), 4 deletions(-)
> 
> diff --git a/fs/nfsd/nfs4layouts.c b/fs/nfsd/nfs4layouts.c
> index 683bd1130afe..6321fc187825 100644
> --- a/fs/nfsd/nfs4layouts.c
> +++ b/fs/nfsd/nfs4layouts.c
> @@ -747,11 +747,10 @@ static bool
>  nfsd4_layout_lm_break(struct file_lease *fl)
>  {
>  	/*
> -	 * We don't want the locks code to timeout the lease for us;
> -	 * we'll remove it ourself if a layout isn't returned
> -	 * in time:
> +	 * Enforce break lease timeout to prevent starvation of
> +	 * NFSD threads in __break_lease that causes server to
> +	 * hang.
>  	 */
> -	fl->fl_break_time = 0;
>  	nfsd4_recall_file_layout(fl->c.flc_owner);
>  	return false;
>  }
> @@ -764,9 +763,28 @@ nfsd4_layout_lm_change(struct file_lease *onlist, int arg,
>  	return lease_modify(onlist, arg, dispose);
>  }
>  
> +static void
> +nfsd_layout_breaker_timedout(struct file_lease *fl)
> +{
> +	struct nfs4_layout_stateid *ls = fl->c.flc_owner;
> +	struct nfsd_file *nf;
> +
> +	rcu_read_lock();
> +	nf = nfsd_file_get(ls->ls_file);
> +	rcu_read_unlock();
> +	if (nf) {
> +		u32 type = ls->ls_layout_type;
> +
> +		if (nfsd4_layout_ops[type]->fence_client)
> +			nfsd4_layout_ops[type]->fence_client(ls, nf);

If a .fence_client callback is optional for a layout to provide,
timeouts for such layout types won't trigger any fencing action. I'm not
certain yet that's good behavior.


> +		nfsd_file_put(nf);
> +	}
> +}
> +
>  static const struct lease_manager_operations nfsd4_layouts_lm_ops = {
>  	.lm_break	= nfsd4_layout_lm_break,
>  	.lm_change	= nfsd4_layout_lm_change,
> +	.lm_breaker_timedout	= nfsd_layout_breaker_timedout,
>  };
>  
>  int

Since this patch appears to be the first (and only) caller for
.lm_breaker_timedout, consider squashing patches 1/3 and 3/3
together.

I'm still not entirely convinced of this approach. It's subtle, and
therefore will tend to be brittle and hard to maintain. Perhaps it
just needs better internal documentation.


-- 
Chuck Lever

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v4 1/3] locks: Introduce lm_breaker_timedout operation to lease_manager_operations
  2025-11-15 19:16 ` [PATCH v4 1/3] locks: Introduce lm_breaker_timedout operation to lease_manager_operations Dai Ngo
  2025-11-17 15:39   ` Chuck Lever
@ 2025-11-17 18:02   ` Jeff Layton
  2025-11-17 19:41     ` Dai Ngo
  2025-11-19  9:54   ` Christoph Hellwig
  2 siblings, 1 reply; 37+ messages in thread
From: Jeff Layton @ 2025-11-17 18:02 UTC (permalink / raw)
  To: Dai Ngo, chuck.lever, neilb, okorniev, tom, hch, alex.aring, viro,
	brauner, jack
  Cc: linux-fsdevel, linux-nfs

On Sat, 2025-11-15 at 11:16 -0800, Dai Ngo wrote:
> Some consumers of the lease_manager_operations structure need
> to perform additional actions when a lease break, triggered by
> a conflict, times out.
> 
> The NFS server is the first consumer of this operation.
> 
> When a pNFS layout conflict occurs and the lease break times
> out — resulting in the layout being revoked and its file lease
> removed from the flc_lease list — the NFS server must issue a
> fence operation. This operation ensures that the client is
> prevented from accessing the data server after the layout
> revocation.
> 
> Fixes: f99d4fbdae67 ("nfsd: add SCSI layout support")
> Signed-off-by: Dai Ngo <dai.ngo@oracle.com>
> ---
>  Documentation/filesystems/locking.rst |  2 ++
>  fs/locks.c                            | 14 +++++++++++---
>  include/linux/filelock.h              |  2 ++
>  3 files changed, 15 insertions(+), 3 deletions(-)
> 
> diff --git a/Documentation/filesystems/locking.rst b/Documentation/filesystems/locking.rst
> index 77704fde9845..cd600db6c4b9 100644
> --- a/Documentation/filesystems/locking.rst
> +++ b/Documentation/filesystems/locking.rst
> @@ -403,6 +403,7 @@ prototypes::
>  	bool (*lm_breaker_owns_lease)(struct file_lock *);
>          bool (*lm_lock_expirable)(struct file_lock *);
>          void (*lm_expire_lock)(void);
> +        void (*lm_breaker_timedout)(struct file_lease *);
>  
>  locking rules:
>  
> @@ -416,6 +417,7 @@ lm_change		yes		no			no
>  lm_breaker_owns_lease:	yes     	no			no
>  lm_lock_expirable	yes		no			no
>  lm_expire_lock		no		no			yes
> +lm_breaker_timedout     no              no                      yes
>  ======================	=============	=================	=========
>  
>  buffer_head
> diff --git a/fs/locks.c b/fs/locks.c
> index 04a3f0e20724..1f254e0cd398 100644
> --- a/fs/locks.c
> +++ b/fs/locks.c
> @@ -369,9 +369,15 @@ locks_dispose_list(struct list_head *dispose)
>  	while (!list_empty(dispose)) {
>  		flc = list_first_entry(dispose, struct file_lock_core, flc_list);
>  		list_del_init(&flc->flc_list);
> -		if (flc->flc_flags & (FL_LEASE|FL_DELEG|FL_LAYOUT))
> +		if (flc->flc_flags & (FL_LEASE|FL_DELEG|FL_LAYOUT)) {
> +			if (flc->flc_flags & FL_BREAKER_TIMEDOUT) {
> +				struct file_lease *fl = file_lease(flc);
> +
> +				if (fl->fl_lmops->lm_breaker_timedout)
> +					fl->fl_lmops->lm_breaker_timedout(fl);
> +			}

locks_dispose_list() is a common function for locks and leases, and
this is only going to be relevant from __break_lease().

Can you move this handling into a separate function that is called
before the relevant locks_dispose_list() call in __break_lease()?

>  			locks_free_lease(file_lease(flc));
> -		else
> +		} else
>  			locks_free_lock(file_lock(flc));
>  	}
>  }
> @@ -1482,8 +1488,10 @@ static void time_out_leases(struct inode *inode, struct list_head *dispose)
>  		trace_time_out_leases(inode, fl);
>  		if (past_time(fl->fl_downgrade_time))
>  			lease_modify(fl, F_RDLCK, dispose);
> -		if (past_time(fl->fl_break_time))
> +		if (past_time(fl->fl_break_time)) {
>  			lease_modify(fl, F_UNLCK, dispose);
> +			fl->c.flc_flags |= FL_BREAKER_TIMEDOUT;
> +		}
>  	}
>  }
>  
> diff --git a/include/linux/filelock.h b/include/linux/filelock.h
> index c2ce8ba05d06..06ccd6b66012 100644
> --- a/include/linux/filelock.h
> +++ b/include/linux/filelock.h
> @@ -17,6 +17,7 @@
>  #define FL_OFDLCK	1024	/* lock is "owned" by struct file */
>  #define FL_LAYOUT	2048	/* outstanding pNFS layout */
>  #define FL_RECLAIM	4096	/* reclaiming from a reboot server */
> +#define	FL_BREAKER_TIMEDOUT	8192	/* lease breaker timed out */
>  
>  #define FL_CLOSE_POSIX (FL_POSIX | FL_CLOSE)
>  
> @@ -49,6 +50,7 @@ struct lease_manager_operations {
>  	int (*lm_change)(struct file_lease *, int, struct list_head *);
>  	void (*lm_setup)(struct file_lease *, void **);
>  	bool (*lm_breaker_owns_lease)(struct file_lease *);
> +	void (*lm_breaker_timedout)(struct file_lease *fl);
>  };
>  
>  struct lock_manager {

-- 
Jeff Layton <jlayton@kernel.org>

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v4 2/3] locks: Threads with layout conflict must wait until client was fenced.
  2025-11-15 19:16 ` [PATCH v4 2/3] locks: Threads with layout conflict must wait until client was fenced Dai Ngo
  2025-11-17 15:47   ` Chuck Lever
@ 2025-11-17 18:21   ` Jeff Layton
  2025-11-17 19:49     ` Dai Ngo
  2025-11-19  9:53     ` Christoph Hellwig
  1 sibling, 2 replies; 37+ messages in thread
From: Jeff Layton @ 2025-11-17 18:21 UTC (permalink / raw)
  To: Dai Ngo, chuck.lever, neilb, okorniev, tom, hch, alex.aring, viro,
	brauner, jack
  Cc: linux-fsdevel, linux-nfs

On Sat, 2025-11-15 at 11:16 -0800, Dai Ngo wrote:
> If multiple threads are waiting for a layout conflict on the same
> file in __break_lease, these threads must wait until one of the
> waiting threads completes the fencing operation before proceeding.
> This ensures that I/O operations from these threads can only occurs
> after the client was fenced.
> 
> Fixes: f99d4fbdae67 ("nfsd: add SCSI layout support")
> Signed-off-by: Dai Ngo <dai.ngo@oracle.com>
> ---
>  fs/locks.c               | 24 ++++++++++++++++++++++++
>  include/linux/filelock.h |  5 +++++
>  2 files changed, 29 insertions(+)
> 
> diff --git a/fs/locks.c b/fs/locks.c
> index 1f254e0cd398..b6fd6aa2498c 100644
> --- a/fs/locks.c
> +++ b/fs/locks.c
> @@ -191,6 +191,7 @@ locks_get_lock_context(struct inode *inode, int type)
>  	INIT_LIST_HEAD(&ctx->flc_flock);
>  	INIT_LIST_HEAD(&ctx->flc_posix);
>  	INIT_LIST_HEAD(&ctx->flc_lease);
> +	init_waitqueue_head(&ctx->flc_dispose_wait);
>  
>  	/*
>  	 * Assign the pointer if it's not already assigned. If it is, then
> @@ -1609,6 +1610,10 @@ int __break_lease(struct inode *inode, unsigned int mode, unsigned int type)
>  		error = -EWOULDBLOCK;
>  		goto out;
>  	}
> +	if (type == FL_LAYOUT && !ctx->flc_conflict) {
> +		ctx->flc_conflict = true;
> +		ctx->flc_wait_for_dispose = false;
> +	}

I don't like special casing this for FL_LAYOUT leases. It seems like we
ought to be able to set up a lm_breaker_timedout operation on any sort
of lease.

>  
>  restart:
>  	fl = list_first_entry(&ctx->flc_lease, struct file_lease, c.flc_list);
> @@ -1640,12 +1645,31 @@ int __break_lease(struct inode *inode, unsigned int mode, unsigned int type)
>  			time_out_leases(inode, &dispose);
>  		if (any_leases_conflict(inode, new_fl))
>  			goto restart;
> +		if (type == FL_LAYOUT && ctx->flc_wait_for_dispose) {
> +			/*
> +			 * wait for flc_wait_for_dispose to ensure
> +			 * the offending client has been fenced.
> +			 */
> +			spin_unlock(&ctx->flc_lock);
> +			wait_event_interruptible(ctx->flc_dispose_wait,
> +				ctx->flc_wait_for_dispose == false);
> +			spin_lock(&ctx->flc_lock);
> +		}
>  		error = 0;
> +		if (type == FL_LAYOUT)
> +			ctx->flc_wait_for_dispose = true;
>  	}
>  out:
>  	spin_unlock(&ctx->flc_lock);
>  	percpu_up_read(&file_rwsem);
>  	locks_dispose_list(&dispose);
> +	if (type == FL_LAYOUT) {
> +		spin_lock(&ctx->flc_lock);
> +		ctx->flc_wait_for_dispose = false;
> +		ctx->flc_conflict = false;
> +		wake_up(&ctx->flc_dispose_wait);
> +		spin_unlock(&ctx->flc_lock);
> +	}
>  free_lock:
>  	locks_free_lease(new_fl);
>  	return error;
> diff --git a/include/linux/filelock.h b/include/linux/filelock.h
> index 06ccd6b66012..5c5353aabbc8 100644
> --- a/include/linux/filelock.h
> +++ b/include/linux/filelock.h
> @@ -146,6 +146,11 @@ struct file_lock_context {
>  	struct list_head	flc_flock;
>  	struct list_head	flc_posix;
>  	struct list_head	flc_lease;
> +
> +	/* for FL_LAYOUT */
> +	bool			flc_conflict;
> +	bool			flc_wait_for_dispose;

I'm also not a fan of this particular bool. Waiting for any
lm_breaker_timeout operations to complete seems like something we ought
to just always do. In the trivial case where we have no special fencing
to do, that should just return quickly anyway.

> +	wait_queue_head_t	flc_dispose_wait;
>  };
>  
>  #ifdef CONFIG_FILE_LOCKING

-- 
Jeff Layton <jlayton@kernel.org>

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v4 1/3] locks: Introduce lm_breaker_timedout operation to lease_manager_operations
  2025-11-17 15:39   ` Chuck Lever
@ 2025-11-17 19:17     ` Dai Ngo
  0 siblings, 0 replies; 37+ messages in thread
From: Dai Ngo @ 2025-11-17 19:17 UTC (permalink / raw)
  To: Chuck Lever, jlayton, neilb, okorniev, tom, hch, alex.aring, viro,
	brauner, jack
  Cc: linux-fsdevel, linux-nfs


On 11/17/25 7:39 AM, Chuck Lever wrote:
> On 11/15/25 2:16 PM, Dai Ngo wrote:
>> diff --git a/fs/locks.c b/fs/locks.c
>> index 04a3f0e20724..1f254e0cd398 100644
>> --- a/fs/locks.c
>> +++ b/fs/locks.c
>> @@ -369,9 +369,15 @@ locks_dispose_list(struct list_head *dispose)
>>   	while (!list_empty(dispose)) {
>>   		flc = list_first_entry(dispose, struct file_lock_core, flc_list);
>>   		list_del_init(&flc->flc_list);
>> -		if (flc->flc_flags & (FL_LEASE|FL_DELEG|FL_LAYOUT))
>> +		if (flc->flc_flags & (FL_LEASE|FL_DELEG|FL_LAYOUT)) {
> Nit: scripts/checkpatch.pl wants to see spaces around the pipe
> characters.

will fix in v5.

>
>
>> +			if (flc->flc_flags & FL_BREAKER_TIMEDOUT) {
>> +				struct file_lease *fl = file_lease(flc);
>> +
>> +				if (fl->fl_lmops->lm_breaker_timedout)
>
> Let's make this:
>
> 			if (fl->fl_lmops && fl->fl_lmops->lm_breaker_timedout)

will fix in v5.

>
>
>> +					fl->fl_lmops->lm_breaker_timedout(fl);
>> +			}
>>   			locks_free_lease(file_lease(flc));
>> -		else
>> +		} else
>>   			locks_free_lock(file_lock(flc));
>>   	}
>>   }
>

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v4 2/3] locks: Threads with layout conflict must wait until client was fenced.
  2025-11-17 15:47   ` Chuck Lever
@ 2025-11-17 19:21     ` Dai Ngo
  0 siblings, 0 replies; 37+ messages in thread
From: Dai Ngo @ 2025-11-17 19:21 UTC (permalink / raw)
  To: Chuck Lever, jlayton, neilb, okorniev, tom, hch, alex.aring, viro,
	brauner, jack
  Cc: linux-fsdevel, linux-nfs


On 11/17/25 7:47 AM, Chuck Lever wrote:
> On 11/15/25 2:16 PM, Dai Ngo wrote:
>> If multiple threads are waiting for a layout conflict on the same
>> file in __break_lease, these threads must wait until one of the
>> waiting threads completes the fencing operation before proceeding.
>> This ensures that I/O operations from these threads can only occurs
>> after the client was fenced.
>>
>> Fixes: f99d4fbdae67 ("nfsd: add SCSI layout support")
>> Signed-off-by: Dai Ngo <dai.ngo@oracle.com>
>> ---
>>   fs/locks.c               | 24 ++++++++++++++++++++++++
>>   include/linux/filelock.h |  5 +++++
>>   2 files changed, 29 insertions(+)
>>
>> diff --git a/fs/locks.c b/fs/locks.c
>> index 1f254e0cd398..b6fd6aa2498c 100644
>> --- a/fs/locks.c
>> +++ b/fs/locks.c
>> @@ -191,6 +191,7 @@ locks_get_lock_context(struct inode *inode, int type)
>>   	INIT_LIST_HEAD(&ctx->flc_flock);
>>   	INIT_LIST_HEAD(&ctx->flc_posix);
>>   	INIT_LIST_HEAD(&ctx->flc_lease);
>> +	init_waitqueue_head(&ctx->flc_dispose_wait);
>>   
>>   	/*
>>   	 * Assign the pointer if it's not already assigned. If it is, then
>> @@ -1609,6 +1610,10 @@ int __break_lease(struct inode *inode, unsigned int mode, unsigned int type)
>>   		error = -EWOULDBLOCK;
>>   		goto out;
>>   	}
>> +	if (type == FL_LAYOUT && !ctx->flc_conflict) {
> The file_lock_context structure is allocated via kmem_cache_alloc() from
> flctx_cache, and it has a NULL constructor. This means newly allocated
> structures contain garbage/stale data.
>
> How are you certain that ctx->flc_conflict has been initialized (ie,
> does not contain a garbage/unknown value) ?

will fix in v5.

>
>
>> +		ctx->flc_conflict = true;
>> +		ctx->flc_wait_for_dispose = false;
>> +	}
>>   
>>   restart:
>>   	fl = list_first_entry(&ctx->flc_lease, struct file_lease, c.flc_list);
>> @@ -1640,12 +1645,31 @@ int __break_lease(struct inode *inode, unsigned int mode, unsigned int type)
>>   			time_out_leases(inode, &dispose);
>>   		if (any_leases_conflict(inode, new_fl))
>>   			goto restart;
>> +		if (type == FL_LAYOUT && ctx->flc_wait_for_dispose) {
>> +			/*
>> +			 * wait for flc_wait_for_dispose to ensure
>> +			 * the offending client has been fenced.
>> +			 */
>> +			spin_unlock(&ctx->flc_lock);
>> +			wait_event_interruptible(ctx->flc_dispose_wait,
>> +				ctx->flc_wait_for_dispose == false);
> Nit: scripts/checkpatch.pl might prefer "!ctx->flc_wait_for_dispose"
> instead. Also, it wants the continuation line to align with the open
> parenthesis. Ie:
>
> 	wait_event_interruptible(ctx->flc_dispose_wait,
> 				 !ctx->flc_wait_for_dispose);

will fix in v5.

>
> Notice that if ctx->flc_conflict has a garbage true value above, that
> leaves flc_wait_for_dispose unintialized as well, and this wait could
> become indefinite.

will fix in v5.

-Dai

>
>
>> +			spin_lock(&ctx->flc_lock);
>> +		}
>>   		error = 0;
>> +		if (type == FL_LAYOUT)
>> +			ctx->flc_wait_for_dispose = true;
>>   	}
>>   out:
>>   	spin_unlock(&ctx->flc_lock);
>>   	percpu_up_read(&file_rwsem);
>>   	locks_dispose_list(&dispose);
>> +	if (type == FL_LAYOUT) {
>> +		spin_lock(&ctx->flc_lock);
>> +		ctx->flc_wait_for_dispose = false;
>> +		ctx->flc_conflict = false;
>> +		wake_up(&ctx->flc_dispose_wait);
>> +		spin_unlock(&ctx->flc_lock);
>> +	}
>>   free_lock:
>>   	locks_free_lease(new_fl);
>>   	return error;
>> diff --git a/include/linux/filelock.h b/include/linux/filelock.h
>> index 06ccd6b66012..5c5353aabbc8 100644
>> --- a/include/linux/filelock.h
>> +++ b/include/linux/filelock.h
>> @@ -146,6 +146,11 @@ struct file_lock_context {
>>   	struct list_head	flc_flock;
>>   	struct list_head	flc_posix;
>>   	struct list_head	flc_lease;
>> +
>> +	/* for FL_LAYOUT */
>> +	bool			flc_conflict;
>> +	bool			flc_wait_for_dispose;
>> +	wait_queue_head_t	flc_dispose_wait;
>>   };
>>   
>>   #ifdef CONFIG_FILE_LOCKING
>

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v4 3/3] FSD: Fix NFS server hang when there are multiple layout conflicts
  2025-11-17 15:55   ` Chuck Lever
@ 2025-11-17 19:40     ` Dai Ngo
  2025-11-17 21:13       ` Benjamin Coddington
  2025-11-19 10:05       ` Christoph Hellwig
  2025-11-19  9:57     ` Christoph Hellwig
  1 sibling, 2 replies; 37+ messages in thread
From: Dai Ngo @ 2025-11-17 19:40 UTC (permalink / raw)
  To: Chuck Lever, jlayton, neilb, okorniev, tom, hch, alex.aring, viro,
	brauner, jack
  Cc: linux-fsdevel, linux-nfs


On 11/17/25 7:55 AM, Chuck Lever wrote:
> On 11/15/25 2:16 PM, Dai Ngo wrote:
>> When a layout conflict triggers a call to __break_lease, the function
>> nfsd4_layout_lm_break clears the fl_break_time timeout before sending
>> the CB_LAYOUTRECALL. As a result, __break_lease repeatedly restarts
>> its loop, waiting indefinitely for the conflicting file lease to be
>> released.
>>
>> If the number of lease conflicts matches the number of NFSD threads
>> (which defaults to 8), all available NFSD threads become occupied.
>> Consequently, there are no threads left to handle incoming requests
>> or callback replies, leading to a total hang of the NFSD server.
>>
>> This issue is reliably reproducible by running the Git test suite
>> on a configuration using the SCSI layout.
>>
>> This patch addresses the problem by using the break lease timeout
>> and ensures that the unresponsive client is fenced, preventing it
>> from accessing the data server directly.
> This text is a bit misleading. The client is responsive; it's the server
> that has no threads to handle the client's LAYOUTRETURN.

I will reword to point out the problem is there is no server thread
to service the reply from client in v5.

>
>
>> Fixes: f99d4fbdae67 ("nfsd: add SCSI layout support")
>> Signed-off-by: Dai Ngo <dai.ngo@oracle.com>
>> ---
>>   fs/nfsd/nfs4layouts.c | 26 ++++++++++++++++++++++----
>>   1 file changed, 22 insertions(+), 4 deletions(-)
>>
>> diff --git a/fs/nfsd/nfs4layouts.c b/fs/nfsd/nfs4layouts.c
>> index 683bd1130afe..6321fc187825 100644
>> --- a/fs/nfsd/nfs4layouts.c
>> +++ b/fs/nfsd/nfs4layouts.c
>> @@ -747,11 +747,10 @@ static bool
>>   nfsd4_layout_lm_break(struct file_lease *fl)
>>   {
>>   	/*
>> -	 * We don't want the locks code to timeout the lease for us;
>> -	 * we'll remove it ourself if a layout isn't returned
>> -	 * in time:
>> +	 * Enforce break lease timeout to prevent starvation of
>> +	 * NFSD threads in __break_lease that causes server to
>> +	 * hang.
>>   	 */
>> -	fl->fl_break_time = 0;
>>   	nfsd4_recall_file_layout(fl->c.flc_owner);
>>   	return false;
>>   }
>> @@ -764,9 +763,28 @@ nfsd4_layout_lm_change(struct file_lease *onlist, int arg,
>>   	return lease_modify(onlist, arg, dispose);
>>   }
>>   
>> +static void
>> +nfsd_layout_breaker_timedout(struct file_lease *fl)
>> +{
>> +	struct nfs4_layout_stateid *ls = fl->c.flc_owner;
>> +	struct nfsd_file *nf;
>> +
>> +	rcu_read_lock();
>> +	nf = nfsd_file_get(ls->ls_file);
>> +	rcu_read_unlock();
>> +	if (nf) {
>> +		u32 type = ls->ls_layout_type;
>> +
>> +		if (nfsd4_layout_ops[type]->fence_client)
>> +			nfsd4_layout_ops[type]->fence_client(ls, nf);
> If a .fence_client callback is optional for a layout to provide,
> timeouts for such layout types won't trigger any fencing action. I'm not
> certain yet that's good behavior.

Some layout implementation is in experimental state such as block
layout and should not be used in production environment. I don't
know what should we do for that case. Does adding a trace point to
warn the user sufficient?

>
>
>> +		nfsd_file_put(nf);
>> +	}
>> +}
>> +
>>   static const struct lease_manager_operations nfsd4_layouts_lm_ops = {
>>   	.lm_break	= nfsd4_layout_lm_break,
>>   	.lm_change	= nfsd4_layout_lm_change,
>> +	.lm_breaker_timedout	= nfsd_layout_breaker_timedout,
>>   };
>>   
>>   int
> Since this patch appears to be the first (and only) caller for
> .lm_breaker_timedout, consider squashing patches 1/3 and 3/3
> together.

will fix in v5.

>
> I'm still not entirely convinced of this approach. It's subtle, and
> therefore will tend to be brittle and hard to maintain. Perhaps it
> just needs better internal documentation.

I will add comments to explain the changes better. My main goal for these
patches is to plug the hole where a malicious client can cause the server
to hang in a configuration where SCSI layout is used. Over time, perhaps
we can come up with a better solution for this problem.

-Dai

>
>

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v4 1/3] locks: Introduce lm_breaker_timedout operation to lease_manager_operations
  2025-11-17 18:02   ` Jeff Layton
@ 2025-11-17 19:41     ` Dai Ngo
  2025-11-19 13:52       ` Jeff Layton
  0 siblings, 1 reply; 37+ messages in thread
From: Dai Ngo @ 2025-11-17 19:41 UTC (permalink / raw)
  To: Jeff Layton, chuck.lever, neilb, okorniev, tom, hch, alex.aring,
	viro, brauner, jack
  Cc: linux-fsdevel, linux-nfs


On 11/17/25 10:02 AM, Jeff Layton wrote:
> On Sat, 2025-11-15 at 11:16 -0800, Dai Ngo wrote:
>> Some consumers of the lease_manager_operations structure need
>> to perform additional actions when a lease break, triggered by
>> a conflict, times out.
>>
>> The NFS server is the first consumer of this operation.
>>
>> When a pNFS layout conflict occurs and the lease break times
>> out — resulting in the layout being revoked and its file lease
>> removed from the flc_lease list — the NFS server must issue a
>> fence operation. This operation ensures that the client is
>> prevented from accessing the data server after the layout
>> revocation.
>>
>> Fixes: f99d4fbdae67 ("nfsd: add SCSI layout support")
>> Signed-off-by: Dai Ngo <dai.ngo@oracle.com>
>> ---
>>   Documentation/filesystems/locking.rst |  2 ++
>>   fs/locks.c                            | 14 +++++++++++---
>>   include/linux/filelock.h              |  2 ++
>>   3 files changed, 15 insertions(+), 3 deletions(-)
>>
>> diff --git a/Documentation/filesystems/locking.rst b/Documentation/filesystems/locking.rst
>> index 77704fde9845..cd600db6c4b9 100644
>> --- a/Documentation/filesystems/locking.rst
>> +++ b/Documentation/filesystems/locking.rst
>> @@ -403,6 +403,7 @@ prototypes::
>>   	bool (*lm_breaker_owns_lease)(struct file_lock *);
>>           bool (*lm_lock_expirable)(struct file_lock *);
>>           void (*lm_expire_lock)(void);
>> +        void (*lm_breaker_timedout)(struct file_lease *);
>>   
>>   locking rules:
>>   
>> @@ -416,6 +417,7 @@ lm_change		yes		no			no
>>   lm_breaker_owns_lease:	yes     	no			no
>>   lm_lock_expirable	yes		no			no
>>   lm_expire_lock		no		no			yes
>> +lm_breaker_timedout     no              no                      yes
>>   ======================	=============	=================	=========
>>   
>>   buffer_head
>> diff --git a/fs/locks.c b/fs/locks.c
>> index 04a3f0e20724..1f254e0cd398 100644
>> --- a/fs/locks.c
>> +++ b/fs/locks.c
>> @@ -369,9 +369,15 @@ locks_dispose_list(struct list_head *dispose)
>>   	while (!list_empty(dispose)) {
>>   		flc = list_first_entry(dispose, struct file_lock_core, flc_list);
>>   		list_del_init(&flc->flc_list);
>> -		if (flc->flc_flags & (FL_LEASE|FL_DELEG|FL_LAYOUT))
>> +		if (flc->flc_flags & (FL_LEASE|FL_DELEG|FL_LAYOUT)) {
>> +			if (flc->flc_flags & FL_BREAKER_TIMEDOUT) {
>> +				struct file_lease *fl = file_lease(flc);
>> +
>> +				if (fl->fl_lmops->lm_breaker_timedout)
>> +					fl->fl_lmops->lm_breaker_timedout(fl);
>> +			}
> locks_dispose_list() is a common function for locks and leases, and
> this is only going to be relevant from __break_lease().
>
> Can you move this handling into a separate function that is called
> before the relevant locks_dispose_list() call in __break_lease()?

will fix in v5.

-Dai

>
>>   			locks_free_lease(file_lease(flc));
>> -		else
>> +		} else
>>   			locks_free_lock(file_lock(flc));
>>   	}
>>   }
>> @@ -1482,8 +1488,10 @@ static void time_out_leases(struct inode *inode, struct list_head *dispose)
>>   		trace_time_out_leases(inode, fl);
>>   		if (past_time(fl->fl_downgrade_time))
>>   			lease_modify(fl, F_RDLCK, dispose);
>> -		if (past_time(fl->fl_break_time))
>> +		if (past_time(fl->fl_break_time)) {
>>   			lease_modify(fl, F_UNLCK, dispose);
>> +			fl->c.flc_flags |= FL_BREAKER_TIMEDOUT;
>> +		}
>>   	}
>>   }
>>   
>> diff --git a/include/linux/filelock.h b/include/linux/filelock.h
>> index c2ce8ba05d06..06ccd6b66012 100644
>> --- a/include/linux/filelock.h
>> +++ b/include/linux/filelock.h
>> @@ -17,6 +17,7 @@
>>   #define FL_OFDLCK	1024	/* lock is "owned" by struct file */
>>   #define FL_LAYOUT	2048	/* outstanding pNFS layout */
>>   #define FL_RECLAIM	4096	/* reclaiming from a reboot server */
>> +#define	FL_BREAKER_TIMEDOUT	8192	/* lease breaker timed out */
>>   
>>   #define FL_CLOSE_POSIX (FL_POSIX | FL_CLOSE)
>>   
>> @@ -49,6 +50,7 @@ struct lease_manager_operations {
>>   	int (*lm_change)(struct file_lease *, int, struct list_head *);
>>   	void (*lm_setup)(struct file_lease *, void **);
>>   	bool (*lm_breaker_owns_lease)(struct file_lease *);
>> +	void (*lm_breaker_timedout)(struct file_lease *fl);
>>   };
>>   
>>   struct lock_manager {

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v4 2/3] locks: Threads with layout conflict must wait until client was fenced.
  2025-11-17 18:21   ` Jeff Layton
@ 2025-11-17 19:49     ` Dai Ngo
  2025-11-19  9:53     ` Christoph Hellwig
  1 sibling, 0 replies; 37+ messages in thread
From: Dai Ngo @ 2025-11-17 19:49 UTC (permalink / raw)
  To: Jeff Layton, chuck.lever, neilb, okorniev, tom, hch, alex.aring,
	viro, brauner, jack
  Cc: linux-fsdevel, linux-nfs


On 11/17/25 10:21 AM, Jeff Layton wrote:
> On Sat, 2025-11-15 at 11:16 -0800, Dai Ngo wrote:
>> If multiple threads are waiting for a layout conflict on the same
>> file in __break_lease, these threads must wait until one of the
>> waiting threads completes the fencing operation before proceeding.
>> This ensures that I/O operations from these threads can only occurs
>> after the client was fenced.
>>
>> Fixes: f99d4fbdae67 ("nfsd: add SCSI layout support")
>> Signed-off-by: Dai Ngo <dai.ngo@oracle.com>
>> ---
>>   fs/locks.c               | 24 ++++++++++++++++++++++++
>>   include/linux/filelock.h |  5 +++++
>>   2 files changed, 29 insertions(+)
>>
>> diff --git a/fs/locks.c b/fs/locks.c
>> index 1f254e0cd398..b6fd6aa2498c 100644
>> --- a/fs/locks.c
>> +++ b/fs/locks.c
>> @@ -191,6 +191,7 @@ locks_get_lock_context(struct inode *inode, int type)
>>   	INIT_LIST_HEAD(&ctx->flc_flock);
>>   	INIT_LIST_HEAD(&ctx->flc_posix);
>>   	INIT_LIST_HEAD(&ctx->flc_lease);
>> +	init_waitqueue_head(&ctx->flc_dispose_wait);
>>   
>>   	/*
>>   	 * Assign the pointer if it's not already assigned. If it is, then
>> @@ -1609,6 +1610,10 @@ int __break_lease(struct inode *inode, unsigned int mode, unsigned int type)
>>   		error = -EWOULDBLOCK;
>>   		goto out;
>>   	}
>> +	if (type == FL_LAYOUT && !ctx->flc_conflict) {
>> +		ctx->flc_conflict = true;
>> +		ctx->flc_wait_for_dispose = false;
>> +	}
> I don't like special casing this for FL_LAYOUT leases. It seems like we
> ought to be able to set up a lm_breaker_timedout operation on any sort
> of lease.

I just try to minimize the effect of the change to FL_LAYOUT, but if
you think that is not necessary then I will remove the case in v5.

>
>>   
>>   restart:
>>   	fl = list_first_entry(&ctx->flc_lease, struct file_lease, c.flc_list);
>> @@ -1640,12 +1645,31 @@ int __break_lease(struct inode *inode, unsigned int mode, unsigned int type)
>>   			time_out_leases(inode, &dispose);
>>   		if (any_leases_conflict(inode, new_fl))
>>   			goto restart;
>> +		if (type == FL_LAYOUT && ctx->flc_wait_for_dispose) {
>> +			/*
>> +			 * wait for flc_wait_for_dispose to ensure
>> +			 * the offending client has been fenced.
>> +			 */
>> +			spin_unlock(&ctx->flc_lock);
>> +			wait_event_interruptible(ctx->flc_dispose_wait,
>> +				ctx->flc_wait_for_dispose == false);
>> +			spin_lock(&ctx->flc_lock);
>> +		}
>>   		error = 0;
>> +		if (type == FL_LAYOUT)
>> +			ctx->flc_wait_for_dispose = true;
>>   	}
>>   out:
>>   	spin_unlock(&ctx->flc_lock);
>>   	percpu_up_read(&file_rwsem);
>>   	locks_dispose_list(&dispose);
>> +	if (type == FL_LAYOUT) {
>> +		spin_lock(&ctx->flc_lock);
>> +		ctx->flc_wait_for_dispose = false;
>> +		ctx->flc_conflict = false;
>> +		wake_up(&ctx->flc_dispose_wait);
>> +		spin_unlock(&ctx->flc_lock);
>> +	}
>>   free_lock:
>>   	locks_free_lease(new_fl);
>>   	return error;
>> diff --git a/include/linux/filelock.h b/include/linux/filelock.h
>> index 06ccd6b66012..5c5353aabbc8 100644
>> --- a/include/linux/filelock.h
>> +++ b/include/linux/filelock.h
>> @@ -146,6 +146,11 @@ struct file_lock_context {
>>   	struct list_head	flc_flock;
>>   	struct list_head	flc_posix;
>>   	struct list_head	flc_lease;
>> +
>> +	/* for FL_LAYOUT */
>> +	bool			flc_conflict;
>> +	bool			flc_wait_for_dispose;
> I'm also not a fan of this particular bool. Waiting for any
> lm_breaker_timeout operations to complete seems like something we ought
> to just always do. In the trivial case where we have no special fencing
> to do, that should just return quickly anyway.

I have to think more about this. Without the flc_wait_for_dispose flag,
I don't have a way to allow one thread to proceed to do the fencing
while the rest have to wait until the fencing is done. Do you have any
suggestion?

-Dai

>
>> +	wait_queue_head_t	flc_dispose_wait;
>>   };
>>   
>>   #ifdef CONFIG_FILE_LOCKING

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v4 3/3] FSD: Fix NFS server hang when there are multiple layout conflicts
  2025-11-17 19:40     ` Dai Ngo
@ 2025-11-17 21:13       ` Benjamin Coddington
  2025-11-17 22:00         ` Dai Ngo
  2025-11-19 10:05       ` Christoph Hellwig
  1 sibling, 1 reply; 37+ messages in thread
From: Benjamin Coddington @ 2025-11-17 21:13 UTC (permalink / raw)
  To: Dai Ngo
  Cc: Chuck Lever, jlayton, neilb, okorniev, tom, hch, alex.aring, viro,
	brauner, jack, linux-fsdevel, linux-nfs

On 17 Nov 2025, at 14:40, Dai Ngo wrote:
>
> I will add comments to explain the changes better. My main goal for these
> patches is to plug the hole where a malicious client can cause the server
> to hang in a configuration where SCSI layout is used.

This would be a mixed SCSI layout vs non-SCSI layout environment, I assume,
since a malicious client with a SCSI layout could do far worse.

The mixed environment is interesting, but not well-optimized.  I think if a
non-SCSI layout client can break leases, then it would be impossible to it
from just walking the filesystem constantly to open and redirect the IO back
through the MDS -- another DOS risk.

I'd go so far as to say that in a SCSI-layout architecture, we must really
fully trust the clients.

Ben

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v4 3/3] FSD: Fix NFS server hang when there are multiple layout conflicts
  2025-11-17 21:13       ` Benjamin Coddington
@ 2025-11-17 22:00         ` Dai Ngo
  2025-11-19 10:08           ` Christoph Hellwig
  0 siblings, 1 reply; 37+ messages in thread
From: Dai Ngo @ 2025-11-17 22:00 UTC (permalink / raw)
  To: Benjamin Coddington
  Cc: Chuck Lever, jlayton, neilb, okorniev, tom, hch, alex.aring, viro,
	brauner, jack, linux-fsdevel, linux-nfs


On 11/17/25 1:13 PM, Benjamin Coddington wrote:
> On 17 Nov 2025, at 14:40, Dai Ngo wrote:
>> I will add comments to explain the changes better. My main goal for these
>> patches is to plug the hole where a malicious client can cause the server
>> to hang in a configuration where SCSI layout is used.
> This would be a mixed SCSI layout vs non-SCSI layout environment, I assume,
> since a malicious client with a SCSI layout could do far worse.
>
> The mixed environment is interesting, but not well-optimized.  I think if a
> non-SCSI layout client can break leases, then it would be impossible to it
> from just walking the filesystem constantly to open and redirect the IO back
> through the MDS -- another DOS risk.
>
> I'd go so far as to say that in a SCSI-layout architecture, we must really
> fully trust the clients.

Perhaps I overstated the severity of the risk. The real issue is, in the
current state, SCSI layout recall has no timeout and if there are enough
activities on the server that results in lots of layout conflicts then the
server can hang.

I think by requiring iSCSI authentication between the client and the DS, it
helps to establish trust in the client.

-Dai


>
> Ben

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v4 2/3] locks: Threads with layout conflict must wait until client was fenced.
  2025-11-17 18:21   ` Jeff Layton
  2025-11-17 19:49     ` Dai Ngo
@ 2025-11-19  9:53     ` Christoph Hellwig
  1 sibling, 0 replies; 37+ messages in thread
From: Christoph Hellwig @ 2025-11-19  9:53 UTC (permalink / raw)
  To: Jeff Layton
  Cc: Dai Ngo, chuck.lever, neilb, okorniev, tom, hch, alex.aring, viro,
	brauner, jack, linux-fsdevel, linux-nfs

On Mon, Nov 17, 2025 at 01:21:32PM -0500, Jeff Layton wrote:
> > +	if (type == FL_LAYOUT && !ctx->flc_conflict) {
> > +		ctx->flc_conflict = true;
> > +		ctx->flc_wait_for_dispose = false;
> > +	}
> 
> I don't like special casing this for FL_LAYOUT leases. It seems like we
> ought to be able to set up a lm_breaker_timedout operation on any sort
> of lease.

Yes, this should check for a lm_breaker_timedout operation instead.

> > +
> > +	/* for FL_LAYOUT */
> > +	bool			flc_conflict;
> > +	bool			flc_wait_for_dispose;
> 
> I'm also not a fan of this particular bool. Waiting for any
> lm_breaker_timeout operations to complete seems like something we ought
> to just always do.

Yes.

> In the trivial case where we have no special fencing
> to do, that should just return quickly anyway.

Exactly.


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v4 1/3] locks: Introduce lm_breaker_timedout operation to lease_manager_operations
  2025-11-15 19:16 ` [PATCH v4 1/3] locks: Introduce lm_breaker_timedout operation to lease_manager_operations Dai Ngo
  2025-11-17 15:39   ` Chuck Lever
  2025-11-17 18:02   ` Jeff Layton
@ 2025-11-19  9:54   ` Christoph Hellwig
  2025-11-19 14:04     ` Chuck Lever
  2 siblings, 1 reply; 37+ messages in thread
From: Christoph Hellwig @ 2025-11-19  9:54 UTC (permalink / raw)
  To: Dai Ngo
  Cc: chuck.lever, jlayton, neilb, okorniev, tom, hch, alex.aring, viro,
	brauner, jack, linux-fsdevel, linux-nfs

On Sat, Nov 15, 2025 at 11:16:37AM -0800, Dai Ngo wrote:
> Some consumers of the lease_manager_operations structure need
> to perform additional actions when a lease break, triggered by
> a conflict, times out.
> 
> The NFS server is the first consumer of this operation.
> 
> When a pNFS layout conflict occurs and the lease break times
> out — resulting in the layout being revoked and its file lease
> removed from the flc_lease list — the NFS server must issue a
> fence operation. This operation ensures that the client is
> prevented from accessing the data server after the layout
> revocation.
> 
> Fixes: f99d4fbdae67 ("nfsd: add SCSI layout support")

This again does not fix anything.  It is infrastructure for your fix
in patch 3.


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v4 3/3] FSD: Fix NFS server hang when there are multiple layout conflicts
  2025-11-15 20:20     ` Dai Ngo
@ 2025-11-19  9:56       ` Christoph Hellwig
  2025-11-19 16:35         ` Dai Ngo
  0 siblings, 1 reply; 37+ messages in thread
From: Christoph Hellwig @ 2025-11-19  9:56 UTC (permalink / raw)
  To: Dai Ngo
  Cc: Chuck Lever, jlayton, neilb, okorniev, tom, hch, alex.aring, viro,
	brauner, jack, linux-fsdevel, linux-nfs

On Sat, Nov 15, 2025 at 12:20:43PM -0800, Dai Ngo wrote:
> break_lease is called from many places - the NFS client and server,
> CIFS and VFS. Many of these callers do not handle error returned
> from break_lease, some don't even check return value from break_lease.
>
> Until we fix all callers of break_lease to handle error return, which
> I think it's much more involved, returning error from break_lease is
> not possible.

There is about a dozen callers.  Although some might not want to handle
this error, having a flags argument to opt into for the callers that
can and want is entirely reasonable.

> I have plan to post a separate patch to check for layout conflict
> caused from the same client and skip the recall - same as what's
> currently done for delegation conflict.

That would be very useful.


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v4 3/3] FSD: Fix NFS server hang when there are multiple layout conflicts
  2025-11-17 15:55   ` Chuck Lever
  2025-11-17 19:40     ` Dai Ngo
@ 2025-11-19  9:57     ` Christoph Hellwig
  1 sibling, 0 replies; 37+ messages in thread
From: Christoph Hellwig @ 2025-11-19  9:57 UTC (permalink / raw)
  To: Chuck Lever
  Cc: Dai Ngo, jlayton, neilb, okorniev, tom, hch, alex.aring, viro,
	brauner, jack, linux-fsdevel, linux-nfs

On Mon, Nov 17, 2025 at 10:55:27AM -0500, Chuck Lever wrote:
> > +		if (nfsd4_layout_ops[type]->fence_client)
> > +			nfsd4_layout_ops[type]->fence_client(ls, nf);
> 
> If a .fence_client callback is optional for a layout to provide,
> timeouts for such layout types won't trigger any fencing action. I'm not
> certain yet that's good behavior.

It is a for type layouts for which we have explicit least breaks.
Maybe we need to find a way to formalize that?


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v4 3/3] FSD: Fix NFS server hang when there are multiple layout conflicts
  2025-11-17 19:40     ` Dai Ngo
  2025-11-17 21:13       ` Benjamin Coddington
@ 2025-11-19 10:05       ` Christoph Hellwig
  2025-11-19 14:04         ` Benjamin Coddington
  2025-11-19 14:09         ` Chuck Lever
  1 sibling, 2 replies; 37+ messages in thread
From: Christoph Hellwig @ 2025-11-19 10:05 UTC (permalink / raw)
  To: Dai Ngo
  Cc: Chuck Lever, jlayton, neilb, okorniev, tom, hch, alex.aring, viro,
	brauner, jack, linux-fsdevel, linux-nfs

On Mon, Nov 17, 2025 at 11:40:22AM -0800, Dai Ngo wrote:
>> If a .fence_client callback is optional for a layout to provide,
>> timeouts for such layout types won't trigger any fencing action. I'm not
>> certain yet that's good behavior.
>
> Some layout implementation is in experimental state such as block
> layout and should not be used in production environment. I don't
> know what should we do for that case. Does adding a trace point to
> warn the user sufficient?

The block layout isn't really experimental, but really a broken protocol
because there is no way to even fence a client except when there is
a side channel mapping between the client identities for NFS and the
storage protocol.

I'd be all in favour of deprecating the support ASAP and then removing
it aggressively.


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v4 3/3] FSD: Fix NFS server hang when there are multiple layout conflicts
  2025-11-17 22:00         ` Dai Ngo
@ 2025-11-19 10:08           ` Christoph Hellwig
  2025-11-19 16:52             ` Dai Ngo
  0 siblings, 1 reply; 37+ messages in thread
From: Christoph Hellwig @ 2025-11-19 10:08 UTC (permalink / raw)
  To: Dai Ngo
  Cc: Benjamin Coddington, Chuck Lever, jlayton, neilb, okorniev, tom,
	hch, alex.aring, viro, brauner, jack, linux-fsdevel, linux-nfs

On Mon, Nov 17, 2025 at 02:00:07PM -0800, Dai Ngo wrote:
> Perhaps I overstated the severity of the risk. The real issue is, in the
> current state, SCSI layout recall has no timeout and if there are enough
> activities on the server that results in lots of layout conflicts then the
> server can hang.

All this is really caused by the synchronous waiting.  I'm not against
the workaround here, but I think we need to address that.  There's
really no reason to consumer threads for this waiting activity and
we'll need to stop doing it.


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v4 3/3] FSD: Fix NFS server hang when there are multiple layout conflicts
  2025-11-15 19:16 ` [PATCH v4 3/3] FSD: Fix NFS server hang when there are multiple layout conflicts Dai Ngo
  2025-11-15 19:44   ` Chuck Lever
  2025-11-17 15:55   ` Chuck Lever
@ 2025-11-19 10:08   ` Christoph Hellwig
  2 siblings, 0 replies; 37+ messages in thread
From: Christoph Hellwig @ 2025-11-19 10:08 UTC (permalink / raw)
  To: Dai Ngo
  Cc: chuck.lever, jlayton, neilb, okorniev, tom, hch, alex.aring, viro,
	brauner, jack, linux-fsdevel, linux-nfs

s/FSD/NFSD/ in the subject.


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v4 1/3] locks: Introduce lm_breaker_timedout operation to lease_manager_operations
  2025-11-17 19:41     ` Dai Ngo
@ 2025-11-19 13:52       ` Jeff Layton
  2025-11-19 16:32         ` Dai Ngo
  0 siblings, 1 reply; 37+ messages in thread
From: Jeff Layton @ 2025-11-19 13:52 UTC (permalink / raw)
  To: Dai Ngo, chuck.lever, neilb, okorniev, tom, hch, alex.aring, viro,
	brauner, jack
  Cc: linux-fsdevel, linux-nfs

On Mon, 2025-11-17 at 11:41 -0800, Dai Ngo wrote:
> On 11/17/25 10:02 AM, Jeff Layton wrote:
> > On Sat, 2025-11-15 at 11:16 -0800, Dai Ngo wrote:
> > > Some consumers of the lease_manager_operations structure need
> > > to perform additional actions when a lease break, triggered by
> > > a conflict, times out.
> > > 
> > > The NFS server is the first consumer of this operation.
> > > 
> > > When a pNFS layout conflict occurs and the lease break times
> > > out — resulting in the layout being revoked and its file lease
> > > removed from the flc_lease list — the NFS server must issue a
> > > fence operation. This operation ensures that the client is
> > > prevented from accessing the data server after the layout
> > > revocation.
> > > 
> > > Fixes: f99d4fbdae67 ("nfsd: add SCSI layout support")
> > > Signed-off-by: Dai Ngo <dai.ngo@oracle.com>
> > > ---
> > >   Documentation/filesystems/locking.rst |  2 ++
> > >   fs/locks.c                            | 14 +++++++++++---
> > >   include/linux/filelock.h              |  2 ++
> > >   3 files changed, 15 insertions(+), 3 deletions(-)
> > > 
> > > diff --git a/Documentation/filesystems/locking.rst b/Documentation/filesystems/locking.rst
> > > index 77704fde9845..cd600db6c4b9 100644
> > > --- a/Documentation/filesystems/locking.rst
> > > +++ b/Documentation/filesystems/locking.rst
> > > @@ -403,6 +403,7 @@ prototypes::
> > >   	bool (*lm_breaker_owns_lease)(struct file_lock *);
> > >           bool (*lm_lock_expirable)(struct file_lock *);
> > >           void (*lm_expire_lock)(void);
> > > +        void (*lm_breaker_timedout)(struct file_lease *);
> > >   
> > >   locking rules:
> > >   
> > > @@ -416,6 +417,7 @@ lm_change		yes		no			no
> > >   lm_breaker_owns_lease:	yes     	no			no
> > >   lm_lock_expirable	yes		no			no
> > >   lm_expire_lock		no		no			yes
> > > +lm_breaker_timedout     no              no                      yes
> > >   ======================	=============	=================	=========
> > >   
> > >   buffer_head
> > > diff --git a/fs/locks.c b/fs/locks.c
> > > index 04a3f0e20724..1f254e0cd398 100644
> > > --- a/fs/locks.c
> > > +++ b/fs/locks.c
> > > @@ -369,9 +369,15 @@ locks_dispose_list(struct list_head *dispose)
> > >   	while (!list_empty(dispose)) {
> > >   		flc = list_first_entry(dispose, struct file_lock_core, flc_list);
> > >   		list_del_init(&flc->flc_list);
> > > -		if (flc->flc_flags & (FL_LEASE|FL_DELEG|FL_LAYOUT))
> > > +		if (flc->flc_flags & (FL_LEASE|FL_DELEG|FL_LAYOUT)) {
> > > +			if (flc->flc_flags & FL_BREAKER_TIMEDOUT) {
> > > +				struct file_lease *fl = file_lease(flc);
> > > +
> > > +				if (fl->fl_lmops->lm_breaker_timedout)
> > > +					fl->fl_lmops->lm_breaker_timedout(fl);
> > > +			}
> > locks_dispose_list() is a common function for locks and leases, and
> > this is only going to be relevant from __break_lease().
> > 
> > Can you move this handling into a separate function that is called
> > before the relevant locks_dispose_list() call in __break_lease()?
> 
> will fix in v5.
> 
> -Dai
> 

That may not work actually, since we may end up with a timed out lease
on a dispose list in a different codepath.

I just sent this patch:

     https://lore.kernel.org/linux-nfs/20251119-dir-deleg-ro-v8-2-81b6cf5485c6@kernel.org/T/#u

Assuming that goes in, then I think calling ->lm_breaker_timedout()
from lease_dispose_list() should be OK.

> > 
> > >   			locks_free_lease(file_lease(flc));
> > > -		else
> > > +		} else
> > >   			locks_free_lock(file_lock(flc));
> > >   	}
> > >   }
> > > @@ -1482,8 +1488,10 @@ static void time_out_leases(struct inode *inode, struct list_head *dispose)
> > >   		trace_time_out_leases(inode, fl);
> > >   		if (past_time(fl->fl_downgrade_time))
> > >   			lease_modify(fl, F_RDLCK, dispose);
> > > -		if (past_time(fl->fl_break_time))
> > > +		if (past_time(fl->fl_break_time)) {
> > >   			lease_modify(fl, F_UNLCK, dispose);
> > > +			fl->c.flc_flags |= FL_BREAKER_TIMEDOUT;
> > > +		}
> > >   	}
> > >   }
> > >   
> > > diff --git a/include/linux/filelock.h b/include/linux/filelock.h
> > > index c2ce8ba05d06..06ccd6b66012 100644
> > > --- a/include/linux/filelock.h
> > > +++ b/include/linux/filelock.h
> > > @@ -17,6 +17,7 @@
> > >   #define FL_OFDLCK	1024	/* lock is "owned" by struct file */
> > >   #define FL_LAYOUT	2048	/* outstanding pNFS layout */
> > >   #define FL_RECLAIM	4096	/* reclaiming from a reboot server */
> > > +#define	FL_BREAKER_TIMEDOUT	8192	/* lease breaker timed out */
> > >   
> > >   #define FL_CLOSE_POSIX (FL_POSIX | FL_CLOSE)
> > >   
> > > @@ -49,6 +50,7 @@ struct lease_manager_operations {
> > >   	int (*lm_change)(struct file_lease *, int, struct list_head *);
> > >   	void (*lm_setup)(struct file_lease *, void **);
> > >   	bool (*lm_breaker_owns_lease)(struct file_lease *);
> > > +	void (*lm_breaker_timedout)(struct file_lease *fl);
> > >   };
> > >   
> > >   struct lock_manager {

-- 
Jeff Layton <jlayton@kernel.org>

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v4 1/3] locks: Introduce lm_breaker_timedout operation to lease_manager_operations
  2025-11-19  9:54   ` Christoph Hellwig
@ 2025-11-19 14:04     ` Chuck Lever
  0 siblings, 0 replies; 37+ messages in thread
From: Chuck Lever @ 2025-11-19 14:04 UTC (permalink / raw)
  To: Christoph Hellwig, Dai Ngo
  Cc: jlayton, neilb, okorniev, tom, alex.aring, viro, brauner, jack,
	linux-fsdevel, linux-nfs

On 11/19/25 4:54 AM, Christoph Hellwig wrote:
> On Sat, Nov 15, 2025 at 11:16:37AM -0800, Dai Ngo wrote:
>> Some consumers of the lease_manager_operations structure need
>> to perform additional actions when a lease break, triggered by
>> a conflict, times out.
>>
>> The NFS server is the first consumer of this operation.
>>
>> When a pNFS layout conflict occurs and the lease break times
>> out — resulting in the layout being revoked and its file lease
>> removed from the flc_lease list — the NFS server must issue a
>> fence operation. This operation ensures that the client is
>> prevented from accessing the data server after the layout
>> revocation.
>>
>> Fixes: f99d4fbdae67 ("nfsd: add SCSI layout support")
> 
> This again does not fix anything.  It is infrastructure for your fix
> in patch 3.
> 

Dai -

I don't think the Fixes: is necessary. Please just fold this patch
into 3/3.


-- 
Chuck Lever

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v4 3/3] FSD: Fix NFS server hang when there are multiple layout conflicts
  2025-11-19 10:05       ` Christoph Hellwig
@ 2025-11-19 14:04         ` Benjamin Coddington
  2025-11-19 14:09         ` Chuck Lever
  1 sibling, 0 replies; 37+ messages in thread
From: Benjamin Coddington @ 2025-11-19 14:04 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Dai Ngo, Chuck Lever, jlayton, neilb, okorniev, tom, alex.aring,
	viro, brauner, jack, linux-fsdevel, linux-nfs

On 19 Nov 2025, at 5:05, Christoph Hellwig wrote:

> On Mon, Nov 17, 2025 at 11:40:22AM -0800, Dai Ngo wrote:
>>> If a .fence_client callback is optional for a layout to provide,
>>> timeouts for such layout types won't trigger any fencing action. I'm not
>>> certain yet that's good behavior.
>>
>> Some layout implementation is in experimental state such as block
>> layout and should not be used in production environment. I don't
>> know what should we do for that case. Does adding a trace point to
>> warn the user sufficient?
>
> The block layout isn't really experimental, but really a broken protocol
> because there is no way to even fence a client except when there is
> a side channel mapping between the client identities for NFS and the
> storage protocol.
>
> I'd be all in favour of deprecating the support ASAP and then removing
> it aggressively.

I'm in favor as well.  This is something that confuses users when exploring
the space, and I've never seen the setup being used in practice.  I have
seen plenty of attempts to use it before the problems are realized.

Ben

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v4 3/3] FSD: Fix NFS server hang when there are multiple layout conflicts
  2025-11-19 10:05       ` Christoph Hellwig
  2025-11-19 14:04         ` Benjamin Coddington
@ 2025-11-19 14:09         ` Chuck Lever
  2025-11-19 14:12           ` Jeff Layton
                             ` (2 more replies)
  1 sibling, 3 replies; 37+ messages in thread
From: Chuck Lever @ 2025-11-19 14:09 UTC (permalink / raw)
  To: Christoph Hellwig, Dai Ngo
  Cc: jlayton, neilb, okorniev, tom, alex.aring, viro, brauner, jack,
	linux-fsdevel, linux-nfs

On 11/19/25 5:05 AM, Christoph Hellwig wrote:
> On Mon, Nov 17, 2025 at 11:40:22AM -0800, Dai Ngo wrote:
>>> If a .fence_client callback is optional for a layout to provide,
>>> timeouts for such layout types won't trigger any fencing action. I'm not
>>> certain yet that's good behavior.
>>
>> Some layout implementation is in experimental state such as block
>> layout and should not be used in production environment. I don't
>> know what should we do for that case. Does adding a trace point to
>> warn the user sufficient?
> 
> The block layout isn't really experimental, but really a broken protocol
> because there is no way to even fence a client except when there is
> a side channel mapping between the client identities for NFS and the
> storage protocol.

Is the protocol broken, or just incomplete, assuming that other
(unspecified) protocols are necessary to be provided?


> I'd be all in favour of deprecating the support ASAP and then removing
> it aggressively.
If we can say with some certainty that there are no users of the pNFS
block layout type, and there is no way of addressing the fencing issue,
then I'm willing to consider removing it.


-- 
Chuck Lever

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v4 3/3] FSD: Fix NFS server hang when there are multiple layout conflicts
  2025-11-19 14:09         ` Chuck Lever
@ 2025-11-19 14:12           ` Jeff Layton
  2025-11-19 17:06           ` Dai Ngo
  2025-11-20  6:59           ` Christoph Hellwig
  2 siblings, 0 replies; 37+ messages in thread
From: Jeff Layton @ 2025-11-19 14:12 UTC (permalink / raw)
  To: Chuck Lever, Christoph Hellwig, Dai Ngo
  Cc: neilb, okorniev, tom, alex.aring, viro, brauner, jack,
	linux-fsdevel, linux-nfs

On Wed, 2025-11-19 at 09:09 -0500, Chuck Lever wrote:
> On 11/19/25 5:05 AM, Christoph Hellwig wrote:
> > On Mon, Nov 17, 2025 at 11:40:22AM -0800, Dai Ngo wrote:
> > > > If a .fence_client callback is optional for a layout to provide,
> > > > timeouts for such layout types won't trigger any fencing action. I'm not
> > > > certain yet that's good behavior.
> > > 
> > > Some layout implementation is in experimental state such as block
> > > layout and should not be used in production environment. I don't
> > > know what should we do for that case. Does adding a trace point to
> > > warn the user sufficient?
> > 
> > The block layout isn't really experimental, but really a broken protocol
> > because there is no way to even fence a client except when there is
> > a side channel mapping between the client identities for NFS and the
> > storage protocol.
> 
> Is the protocol broken, or just incomplete, assuming that other
> (unspecified) protocols are necessary to be provided?
> 
> 
> > I'd be all in favour of deprecating the support ASAP and then removing
> > it aggressively.
> If we can say with some certainty that there are no users of the pNFS
> block layout type, and there is no way of addressing the fencing issue,
> then I'm willing to consider removing it.
> 

ACK on dropping it from my standpoint. I've also never seen a block
layout setup in the wild.
-- 
Jeff Layton <jlayton@kernel.org>

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v4 1/3] locks: Introduce lm_breaker_timedout operation to lease_manager_operations
  2025-11-19 13:52       ` Jeff Layton
@ 2025-11-19 16:32         ` Dai Ngo
  0 siblings, 0 replies; 37+ messages in thread
From: Dai Ngo @ 2025-11-19 16:32 UTC (permalink / raw)
  To: Jeff Layton, chuck.lever, neilb, okorniev, tom, hch, alex.aring,
	viro, brauner, jack
  Cc: linux-fsdevel, linux-nfs


On 11/19/25 5:52 AM, Jeff Layton wrote:
> On Mon, 2025-11-17 at 11:41 -0800, Dai Ngo wrote:
>> On 11/17/25 10:02 AM, Jeff Layton wrote:
>>> On Sat, 2025-11-15 at 11:16 -0800, Dai Ngo wrote:
>>>> Some consumers of the lease_manager_operations structure need
>>>> to perform additional actions when a lease break, triggered by
>>>> a conflict, times out.
>>>>
>>>> The NFS server is the first consumer of this operation.
>>>>
>>>> When a pNFS layout conflict occurs and the lease break times
>>>> out — resulting in the layout being revoked and its file lease
>>>> removed from the flc_lease list — the NFS server must issue a
>>>> fence operation. This operation ensures that the client is
>>>> prevented from accessing the data server after the layout
>>>> revocation.
>>>>
>>>> Fixes: f99d4fbdae67 ("nfsd: add SCSI layout support")
>>>> Signed-off-by: Dai Ngo <dai.ngo@oracle.com>
>>>> ---
>>>>    Documentation/filesystems/locking.rst |  2 ++
>>>>    fs/locks.c                            | 14 +++++++++++---
>>>>    include/linux/filelock.h              |  2 ++
>>>>    3 files changed, 15 insertions(+), 3 deletions(-)
>>>>
>>>> diff --git a/Documentation/filesystems/locking.rst b/Documentation/filesystems/locking.rst
>>>> index 77704fde9845..cd600db6c4b9 100644
>>>> --- a/Documentation/filesystems/locking.rst
>>>> +++ b/Documentation/filesystems/locking.rst
>>>> @@ -403,6 +403,7 @@ prototypes::
>>>>    	bool (*lm_breaker_owns_lease)(struct file_lock *);
>>>>            bool (*lm_lock_expirable)(struct file_lock *);
>>>>            void (*lm_expire_lock)(void);
>>>> +        void (*lm_breaker_timedout)(struct file_lease *);
>>>>    
>>>>    locking rules:
>>>>    
>>>> @@ -416,6 +417,7 @@ lm_change		yes		no			no
>>>>    lm_breaker_owns_lease:	yes     	no			no
>>>>    lm_lock_expirable	yes		no			no
>>>>    lm_expire_lock		no		no			yes
>>>> +lm_breaker_timedout     no              no                      yes
>>>>    ======================	=============	=================	=========
>>>>    
>>>>    buffer_head
>>>> diff --git a/fs/locks.c b/fs/locks.c
>>>> index 04a3f0e20724..1f254e0cd398 100644
>>>> --- a/fs/locks.c
>>>> +++ b/fs/locks.c
>>>> @@ -369,9 +369,15 @@ locks_dispose_list(struct list_head *dispose)
>>>>    	while (!list_empty(dispose)) {
>>>>    		flc = list_first_entry(dispose, struct file_lock_core, flc_list);
>>>>    		list_del_init(&flc->flc_list);
>>>> -		if (flc->flc_flags & (FL_LEASE|FL_DELEG|FL_LAYOUT))
>>>> +		if (flc->flc_flags & (FL_LEASE|FL_DELEG|FL_LAYOUT)) {
>>>> +			if (flc->flc_flags & FL_BREAKER_TIMEDOUT) {
>>>> +				struct file_lease *fl = file_lease(flc);
>>>> +
>>>> +				if (fl->fl_lmops->lm_breaker_timedout)
>>>> +					fl->fl_lmops->lm_breaker_timedout(fl);
>>>> +			}
>>> locks_dispose_list() is a common function for locks and leases, and
>>> this is only going to be relevant from __break_lease().
>>>
>>> Can you move this handling into a separate function that is called
>>> before the relevant locks_dispose_list() call in __break_lease()?
>> will fix in v5.
>>
>> -Dai
>>
> That may not work actually, since we may end up with a timed out lease
> on a dispose list in a different codepath.
>
> I just sent this patch:
>
>       https://lore.kernel.org/linux-nfs/20251119-dir-deleg-ro-v8-2-81b6cf5485c6@kernel.org/T/#u
>
> Assuming that goes in, then I think calling ->lm_breaker_timedout()
> from lease_dispose_list() should be OK.

Thanks Jeff, I will wait for your patch to go in then rework on this patch.

-Dai

>
>>>>    			locks_free_lease(file_lease(flc));
>>>> -		else
>>>> +		} else
>>>>    			locks_free_lock(file_lock(flc));
>>>>    	}
>>>>    }
>>>> @@ -1482,8 +1488,10 @@ static void time_out_leases(struct inode *inode, struct list_head *dispose)
>>>>    		trace_time_out_leases(inode, fl);
>>>>    		if (past_time(fl->fl_downgrade_time))
>>>>    			lease_modify(fl, F_RDLCK, dispose);
>>>> -		if (past_time(fl->fl_break_time))
>>>> +		if (past_time(fl->fl_break_time)) {
>>>>    			lease_modify(fl, F_UNLCK, dispose);
>>>> +			fl->c.flc_flags |= FL_BREAKER_TIMEDOUT;
>>>> +		}
>>>>    	}
>>>>    }
>>>>    
>>>> diff --git a/include/linux/filelock.h b/include/linux/filelock.h
>>>> index c2ce8ba05d06..06ccd6b66012 100644
>>>> --- a/include/linux/filelock.h
>>>> +++ b/include/linux/filelock.h
>>>> @@ -17,6 +17,7 @@
>>>>    #define FL_OFDLCK	1024	/* lock is "owned" by struct file */
>>>>    #define FL_LAYOUT	2048	/* outstanding pNFS layout */
>>>>    #define FL_RECLAIM	4096	/* reclaiming from a reboot server */
>>>> +#define	FL_BREAKER_TIMEDOUT	8192	/* lease breaker timed out */
>>>>    
>>>>    #define FL_CLOSE_POSIX (FL_POSIX | FL_CLOSE)
>>>>    
>>>> @@ -49,6 +50,7 @@ struct lease_manager_operations {
>>>>    	int (*lm_change)(struct file_lease *, int, struct list_head *);
>>>>    	void (*lm_setup)(struct file_lease *, void **);
>>>>    	bool (*lm_breaker_owns_lease)(struct file_lease *);
>>>> +	void (*lm_breaker_timedout)(struct file_lease *fl);
>>>>    };
>>>>    
>>>>    struct lock_manager {

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v4 3/3] FSD: Fix NFS server hang when there are multiple layout conflicts
  2025-11-19  9:56       ` Christoph Hellwig
@ 2025-11-19 16:35         ` Dai Ngo
  0 siblings, 0 replies; 37+ messages in thread
From: Dai Ngo @ 2025-11-19 16:35 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Chuck Lever, jlayton, neilb, okorniev, tom, alex.aring, viro,
	brauner, jack, linux-fsdevel, linux-nfs


On 11/19/25 1:56 AM, Christoph Hellwig wrote:
> On Sat, Nov 15, 2025 at 12:20:43PM -0800, Dai Ngo wrote:
>> break_lease is called from many places - the NFS client and server,
>> CIFS and VFS. Many of these callers do not handle error returned
>> from break_lease, some don't even check return value from break_lease.
>>
>> Until we fix all callers of break_lease to handle error return, which
>> I think it's much more involved, returning error from break_lease is
>> not possible.
> There is about a dozen callers.  Although some might not want to handle
> this error, having a flags argument to opt into for the callers that
> can and want is entirely reasonable.
>
>> I have plan to post a separate patch to check for layout conflict
>> caused from the same client and skip the recall - same as what's
>> currently done for delegation conflict.
> That would be very useful.

I will work on this first then wait for Jeff's patch to go in then
come back for this server hang problem.

-Dai


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v4 3/3] FSD: Fix NFS server hang when there are multiple layout conflicts
  2025-11-19 10:08           ` Christoph Hellwig
@ 2025-11-19 16:52             ` Dai Ngo
  2025-11-20  6:50               ` Christoph Hellwig
  0 siblings, 1 reply; 37+ messages in thread
From: Dai Ngo @ 2025-11-19 16:52 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Benjamin Coddington, Chuck Lever, jlayton, neilb, okorniev, tom,
	alex.aring, viro, brauner, jack, linux-fsdevel, linux-nfs


On 11/19/25 2:08 AM, Christoph Hellwig wrote:
> On Mon, Nov 17, 2025 at 02:00:07PM -0800, Dai Ngo wrote:
>> Perhaps I overstated the severity of the risk. The real issue is, in the
>> current state, SCSI layout recall has no timeout and if there are enough
>> activities on the server that results in lots of layout conflicts then the
>> server can hang.
> All this is really caused by the synchronous waiting.  I'm not against
> the workaround here, but I think we need to address that.  There's
> really no reason to consumer threads for this waiting activity and
> we'll need to stop doing it.

Yes, I think __break_lease needs to work asynchronously. But I'm concerned
about touching a bunch of other callers. And we need to fix this server
hang asap that was why I posted this patch.

Also, I have questions about how __break_lease currently works such as
the way it picks the break_time and the blocker. It just picks the first
lease off the flc_lease list which could be a FL_FLOCK or FL_DELEG which
has no conflict with the current layout lease. I think it just happens
to work in this case is because of the 1 second restart loop.

-Dai

>
>

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v4 3/3] FSD: Fix NFS server hang when there are multiple layout conflicts
  2025-11-19 14:09         ` Chuck Lever
  2025-11-19 14:12           ` Jeff Layton
@ 2025-11-19 17:06           ` Dai Ngo
  2025-11-20  6:52             ` Christoph Hellwig
  2025-11-20  6:59           ` Christoph Hellwig
  2 siblings, 1 reply; 37+ messages in thread
From: Dai Ngo @ 2025-11-19 17:06 UTC (permalink / raw)
  To: Chuck Lever, Christoph Hellwig
  Cc: jlayton, neilb, okorniev, tom, alex.aring, viro, brauner, jack,
	linux-fsdevel, linux-nfs


On 11/19/25 6:09 AM, Chuck Lever wrote:
> On 11/19/25 5:05 AM, Christoph Hellwig wrote:
>> On Mon, Nov 17, 2025 at 11:40:22AM -0800, Dai Ngo wrote:
>>>> If a .fence_client callback is optional for a layout to provide,
>>>> timeouts for such layout types won't trigger any fencing action. I'm not
>>>> certain yet that's good behavior.
>>> Some layout implementation is in experimental state such as block
>>> layout and should not be used in production environment. I don't
>>> know what should we do for that case. Does adding a trace point to
>>> warn the user sufficient?
>> The block layout isn't really experimental, but really a broken protocol
>> because there is no way to even fence a client except when there is
>> a side channel mapping between the client identities for NFS and the
>> storage protocol.

Up until at least 6.15-rc6, this is in xfs_fs_get_uuid:

xfs_warn_experimental(mp, XFS_EXPERIMENTAL_PNFS);

> Is the protocol broken, or just incomplete, assuming that other
> (unspecified) protocols are necessary to be provided?
>
>
>> I'd be all in favour of deprecating the support ASAP and then removing
>> it aggressively.
> If we can say with some certainty that there are no users of the pNFS
> block layout type, and there is no way of addressing the fencing issue,
> then I'm willing to consider removing it.

I'm all for removing this too. With the latest upstream code, I can't
even configure the systems to use block layout anymore. I think this is
due to a recent change regarding device id but I don't have the exact
detail and the commit hash.

-Dai

>
>

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v4 3/3] FSD: Fix NFS server hang when there are multiple layout conflicts
  2025-11-19 16:52             ` Dai Ngo
@ 2025-11-20  6:50               ` Christoph Hellwig
  0 siblings, 0 replies; 37+ messages in thread
From: Christoph Hellwig @ 2025-11-20  6:50 UTC (permalink / raw)
  To: Dai Ngo
  Cc: Christoph Hellwig, Benjamin Coddington, Chuck Lever, jlayton,
	neilb, okorniev, tom, alex.aring, viro, brauner, jack,
	linux-fsdevel, linux-nfs

On Wed, Nov 19, 2025 at 08:52:25AM -0800, Dai Ngo wrote:
>
> On 11/19/25 2:08 AM, Christoph Hellwig wrote:
>> On Mon, Nov 17, 2025 at 02:00:07PM -0800, Dai Ngo wrote:
>>> Perhaps I overstated the severity of the risk. The real issue is, in the
>>> current state, SCSI layout recall has no timeout and if there are enough
>>> activities on the server that results in lots of layout conflicts then the
>>> server can hang.
>> All this is really caused by the synchronous waiting.  I'm not against
>> the workaround here, but I think we need to address that.  There's
>> really no reason to consumer threads for this waiting activity and
>> we'll need to stop doing it.
>
> Yes, I think __break_lease needs to work asynchronously. But I'm concerned
> about touching a bunch of other callers.

You really don't need to touch other callers, keep the existing semantics
for them.  Just like we don't force all file I/O to be asynchronous,
just because we also offer that.


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v4 3/3] FSD: Fix NFS server hang when there are multiple layout conflicts
  2025-11-19 17:06           ` Dai Ngo
@ 2025-11-20  6:52             ` Christoph Hellwig
  0 siblings, 0 replies; 37+ messages in thread
From: Christoph Hellwig @ 2025-11-20  6:52 UTC (permalink / raw)
  To: Dai Ngo
  Cc: Chuck Lever, Christoph Hellwig, jlayton, neilb, okorniev, tom,
	alex.aring, viro, brauner, jack, linux-fsdevel, linux-nfs

On Wed, Nov 19, 2025 at 09:06:27AM -0800, Dai Ngo wrote:
> Up until at least 6.15-rc6, this is in xfs_fs_get_uuid:
>
> xfs_warn_experimental(mp, XFS_EXPERIMENTAL_PNFS);

That's really just for the XFS layout exporting.  It just happened
to be placed there and thus was more or less accidentally skipped
for the SCSI layout.


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v4 3/3] FSD: Fix NFS server hang when there are multiple layout conflicts
  2025-11-19 14:09         ` Chuck Lever
  2025-11-19 14:12           ` Jeff Layton
  2025-11-19 17:06           ` Dai Ngo
@ 2025-11-20  6:59           ` Christoph Hellwig
  2 siblings, 0 replies; 37+ messages in thread
From: Christoph Hellwig @ 2025-11-20  6:59 UTC (permalink / raw)
  To: Chuck Lever
  Cc: Christoph Hellwig, Dai Ngo, jlayton, neilb, okorniev, tom,
	alex.aring, viro, brauner, jack, linux-fsdevel, linux-nfs

On Wed, Nov 19, 2025 at 09:09:04AM -0500, Chuck Lever wrote:
> On 11/19/25 5:05 AM, Christoph Hellwig wrote:
> > On Mon, Nov 17, 2025 at 11:40:22AM -0800, Dai Ngo wrote:
> >>> If a .fence_client callback is optional for a layout to provide,
> >>> timeouts for such layout types won't trigger any fencing action. I'm not
> >>> certain yet that's good behavior.
> >>
> >> Some layout implementation is in experimental state such as block
> >> layout and should not be used in production environment. I don't
> >> know what should we do for that case. Does adding a trace point to
> >> warn the user sufficient?
> > 
> > The block layout isn't really experimental, but really a broken protocol
> > because there is no way to even fence a client except when there is
> > a side channel mapping between the client identities for NFS and the
> > storage protocol.
> 
> Is the protocol broken, or just incomplete, assuming that other
> (unspecified) protocols are necessary to be provided?

I guess it depends on your definition.  There is no shared client
identify between NFS and the storage protocol.  So you need a side
channel communication protocol such as sneakernet to register the
client identities on the server.  And that even assumes you have
a fencing method at that level, which often might not be the case.


^ permalink raw reply	[flat|nested] 37+ messages in thread

end of thread, other threads:[~2025-11-20  6:59 UTC | newest]

Thread overview: 37+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-11-15 19:16 [Patch v4 0/3] NFSD: Fix server hang when there are multiple layout conflicts Dai Ngo
2025-11-15 19:16 ` [PATCH v4 1/3] locks: Introduce lm_breaker_timedout operation to lease_manager_operations Dai Ngo
2025-11-17 15:39   ` Chuck Lever
2025-11-17 19:17     ` Dai Ngo
2025-11-17 18:02   ` Jeff Layton
2025-11-17 19:41     ` Dai Ngo
2025-11-19 13:52       ` Jeff Layton
2025-11-19 16:32         ` Dai Ngo
2025-11-19  9:54   ` Christoph Hellwig
2025-11-19 14:04     ` Chuck Lever
2025-11-15 19:16 ` [PATCH v4 2/3] locks: Threads with layout conflict must wait until client was fenced Dai Ngo
2025-11-17 15:47   ` Chuck Lever
2025-11-17 19:21     ` Dai Ngo
2025-11-17 18:21   ` Jeff Layton
2025-11-17 19:49     ` Dai Ngo
2025-11-19  9:53     ` Christoph Hellwig
2025-11-15 19:16 ` [PATCH v4 3/3] FSD: Fix NFS server hang when there are multiple layout conflicts Dai Ngo
2025-11-15 19:44   ` Chuck Lever
2025-11-15 20:20     ` Dai Ngo
2025-11-19  9:56       ` Christoph Hellwig
2025-11-19 16:35         ` Dai Ngo
2025-11-17 15:55   ` Chuck Lever
2025-11-17 19:40     ` Dai Ngo
2025-11-17 21:13       ` Benjamin Coddington
2025-11-17 22:00         ` Dai Ngo
2025-11-19 10:08           ` Christoph Hellwig
2025-11-19 16:52             ` Dai Ngo
2025-11-20  6:50               ` Christoph Hellwig
2025-11-19 10:05       ` Christoph Hellwig
2025-11-19 14:04         ` Benjamin Coddington
2025-11-19 14:09         ` Chuck Lever
2025-11-19 14:12           ` Jeff Layton
2025-11-19 17:06           ` Dai Ngo
2025-11-20  6:52             ` Christoph Hellwig
2025-11-20  6:59           ` Christoph Hellwig
2025-11-19  9:57     ` Christoph Hellwig
2025-11-19 10:08   ` Christoph Hellwig

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).