From: Jeff Layton <jlayton@kernel.org>
To: Dai Ngo <dai.ngo@oracle.com>,
chuck.lever@oracle.com, neil@brown.name, okorniev@redhat.com,
tom@talpey.com, hch@lst.de, alex.aring@gmail.com,
viro@zeniv.linux.org.uk, brauner@kernel.org, jack@suse.cz
Cc: linux-fsdevel@vger.kernel.org, linux-nfs@vger.kernel.org
Subject: Re: [PATCH v4 1/1] NFSD: Enforce Timeout on Layout Recall and Integrate Lease Manager Fencing
Date: Wed, 28 Jan 2026 13:01:03 -0500 [thread overview]
Message-ID: <ac6601d20f1f48db9eca7ba204be8030fc16a1d0.camel@kernel.org> (raw)
In-Reply-To: <c566182b-2256-45a0-844e-a7c8614ec895@oracle.com>
On Wed, 2026-01-28 at 09:25 -0800, Dai Ngo wrote:
> On 1/27/26 9:54 AM, Jeff Layton wrote:
> > On Mon, 2026-01-26 at 16:50 -0800, Dai Ngo wrote:
> > > When a layout conflict triggers a recall, enforcing a timeout is
> > > necessary to prevent excessive nfsd threads from being blocked in
> > > __break_lease ensuring the server continues servicing incoming
> > > requests efficiently.
> > >
> > > This patch introduces a new function to lease_manager_operations:
> > >
> > > lm_breaker_timedout: Invoked when a lease recall times out and is
> > > about to be disposed of. This function enables the lease manager
> > > to inform the caller whether the file_lease should remain on the
> > > flc_list or be disposed of.
> > >
> > > For the NFSD lease manager, this function now handles layout recall
> > > timeouts. If the layout type supports fencing and the client has not
> > > been fenced, a fence operation is triggered to prevent the client
> > > from accessing the block device.
> > >
> > > While the fencing operation is in progress, the conflicting file_lease
> > > remains on the flc_list until fencing is complete. This guarantees
> > > that no other clients can access the file, and the client with
> > > exclusive access is properly blocked before disposal.
> > >
> > > Signed-off-by: Dai Ngo <dai.ngo@oracle.com>
> > > ---
> > > Documentation/filesystems/locking.rst | 2 +
> > > fs/locks.c | 10 +++-
> > > fs/nfsd/blocklayout.c | 38 ++++++-------
> > > fs/nfsd/nfs4layouts.c | 79 +++++++++++++++++++++++++--
> > > fs/nfsd/nfs4state.c | 1 +
> > > fs/nfsd/state.h | 6 ++
> > > include/linux/filelock.h | 1 +
> > > 7 files changed, 110 insertions(+), 27 deletions(-)
> > >
> > > v2:
> > > . Update Subject line to include fencing operation.
> > > . Allow conflicting lease to remain on flc_list until fencing
> > > is complete.
> > > . Use system worker to perform fencing operation asynchronously.
> > > . Use nfs4_stid.sc_count to ensure layout stateid remains
> > > valid before starting the fencing operation, nfs4_stid.sc_count
> > > is released after fencing operation is complete.
> > > . Rework nfsd4_scsi_fence_client to:
> > > . wait until fencing to complete before exiting.
> > > . wait until fencing in progress to complete before
> > > checking the NFSD_MDS_PR_FENCED flag.
> > > . Remove lm_need_to_retry from lease_manager_operations.
> > > v3:
> > > . correct locking requirement in locking.rst.
> > > . add max retry count to fencing operation.
> > > . add missing nfs4_put_stid in nfsd4_layout_fence_worker.
> > > . remove special-casing of FL_LAYOUT in lease_modify.
> > > . remove lease_want_dispose.
> > > . move lm_breaker_timedout call to time_out_leases.
> > > v4:
> > > . only increment ls_fence_retry_cnt after successfully
> > > schedule new work in nfsd4_layout_lm_breaker_timedout.
> > >
> > > diff --git a/Documentation/filesystems/locking.rst b/Documentation/filesystems/locking.rst
> > > index 04c7691e50e0..a339491f02e4 100644
> > > --- a/Documentation/filesystems/locking.rst
> > > +++ b/Documentation/filesystems/locking.rst
> > > @@ -403,6 +403,7 @@ prototypes::
> > > bool (*lm_breaker_owns_lease)(struct file_lock *);
> > > bool (*lm_lock_expirable)(struct file_lock *);
> > > void (*lm_expire_lock)(void);
> > > + void (*lm_breaker_timedout)(struct file_lease *);
> > >
> > > locking rules:
> > >
> > > @@ -417,6 +418,7 @@ lm_breaker_owns_lease: yes no no
> > > lm_lock_expirable yes no no
> > > lm_expire_lock no no yes
> > > lm_open_conflict yes no no
> > > +lm_breaker_timedout yes no no
> > > ====================== ============= ================= =========
> > >
> > > buffer_head
> > > diff --git a/fs/locks.c b/fs/locks.c
> > > index 46f229f740c8..1b63aa704598 100644
> > > --- a/fs/locks.c
> > > +++ b/fs/locks.c
> > > @@ -1524,6 +1524,7 @@ static void time_out_leases(struct inode *inode, struct list_head *dispose)
> > > {
> > > struct file_lock_context *ctx = inode->i_flctx;
> > > struct file_lease *fl, *tmp;
> > > + bool remove = true;
> > >
> > > lockdep_assert_held(&ctx->flc_lock);
> > >
> > > @@ -1531,8 +1532,13 @@ static void time_out_leases(struct inode *inode, struct list_head *dispose)
> > > trace_time_out_leases(inode, fl);
> > > if (past_time(fl->fl_downgrade_time))
> > > lease_modify(fl, F_RDLCK, dispose);
> > > - if (past_time(fl->fl_break_time))
> > > - lease_modify(fl, F_UNLCK, dispose);
> > > +
> > > + if (past_time(fl->fl_break_time)) {
> > > + if (fl->fl_lmops && fl->fl_lmops->lm_breaker_timedout)
> > > + remove = fl->fl_lmops->lm_breaker_timedout(fl);
> > > + if (remove)
> > > + lease_modify(fl, F_UNLCK, dispose);
> > I'd not bother with the return code to lm_breaker_timedout.
> >
> > Make it void return and have it call lease_modify itself before
> > returning in the cases where you have it returning true. If the
> > operation isn't defined then just do the lease_modify here like we
> > always have.
>
> The first call to lm_breaker_timedout schedules the fence worker so
> fencing is not done yet. If lm_breaker_timedout return to time_out_leases
> without any return value then lease_modify is called to remove the lease.
>
> Am i missing something?
>
> -Dai
>
if (fl->fl_lmops && fl->fl_lmops->lm_breaker_timedout)
fl->fl_lmops->lm_breaker_timedout(fl);
else
lease_modify(fl, F_UNLCK, dispose);
... the lm_breaker_timedout() function (or the workqueue job it spawns)
is then responsible for calling lease_modify() at the appropriate time.
In most cases you're going to need to queue a workqueue job to fence,
and you don't want to clear out the lease until after it's done.
> >
> > > + }
> > > }
> > > }
> > >
> > > diff --git a/fs/nfsd/blocklayout.c b/fs/nfsd/blocklayout.c
> > > index 7ba9e2dd0875..69d3889df302 100644
> > > --- a/fs/nfsd/blocklayout.c
> > > +++ b/fs/nfsd/blocklayout.c
> > > @@ -443,6 +443,14 @@ nfsd4_scsi_proc_layoutcommit(struct inode *inode, struct svc_rqst *rqstp,
> > > return nfsd4_block_commit_blocks(inode, lcp, iomaps, nr_iomaps);
> > > }
> > >
> > > +/*
> > > + * Perform the fence operation to prevent the client from accessing the
> > > + * block device. If a fence operation is already in progress, wait for
> > > + * it to complete before checking the NFSD_MDS_PR_FENCED flag. Once the
> > > + * operation is complete, check the flag. If NFSD_MDS_PR_FENCED is set,
> > > + * update the layout stateid by setting the ls_fenced flag to indicate
> > > + * that the client has been fenced.
> > > + */
> > > static void
> > > nfsd4_scsi_fence_client(struct nfs4_layout_stateid *ls, struct nfsd_file *file)
> > > {
> > > @@ -450,31 +458,23 @@ nfsd4_scsi_fence_client(struct nfs4_layout_stateid *ls, struct nfsd_file *file)
> > > struct block_device *bdev = file->nf_file->f_path.mnt->mnt_sb->s_bdev;
> > > int status;
> > >
> > > - if (nfsd4_scsi_fence_set(clp, bdev->bd_dev))
> > > + mutex_lock(&clp->cl_fence_mutex);
> > > + if (nfsd4_scsi_fence_set(clp, bdev->bd_dev)) {
> > > + ls->ls_fenced = true;
> > > + mutex_unlock(&clp->cl_fence_mutex);
> > > + nfs4_put_stid(&ls->ls_stid);
> > > return;
> > > + }
> > >
> > I don't understand what this new mutex is protecting, and this all
> > seems overly complex. If feels kind of like you want nfsd to be driving
> > the fencing retries, but I don't think we really do. Here's what I'd
> > do.
> >
> > I'd just make ->fence_client a bool or int return, and have it indicate
> > whether the client was successfully fenced or not. If it was
> > successfully fenced, then have the caller call lease_modify() to remove
> > the lease. If it wasn't successfully fenced, have the caller (the
> > workqueue job) requeue itself if you want to retry. If the caller is
> > ready to give up, then call lease_modify() on it and remove it (and
> > probably throw a pr_warn()).
> >
> > > status = bdev->bd_disk->fops->pr_ops->pr_preempt(bdev, NFSD_MDS_PR_KEY,
> > > nfsd4_scsi_pr_key(clp),
> > > PR_EXCLUSIVE_ACCESS_REG_ONLY, true);
> > > - /*
> > > - * Reset to allow retry only when the command could not have
> > > - * reached the device. Negative status means a local error
> > > - * (e.g., -ENOMEM) prevented the command from being sent.
> > > - * PR_STS_PATH_FAILED, PR_STS_PATH_FAST_FAILED, and
> > > - * PR_STS_RETRY_PATH_FAILURE indicate transport path failures
> > > - * before device delivery.
> > > - *
> > > - * For all other errors, the command may have reached the device
> > > - * and the preempt may have succeeded. Avoid resetting, since
> > > - * retrying a successful preempt returns PR_STS_IOERR or
> > > - * PR_STS_RESERVATION_CONFLICT, which would cause an infinite
> > > - * retry loop.
> > > - */
> > > - if (status < 0 ||
> > > - status == PR_STS_PATH_FAILED ||
> > > - status == PR_STS_PATH_FAST_FAILED ||
> > > - status == PR_STS_RETRY_PATH_FAILURE)
> > > + if (status)
> > > nfsd4_scsi_fence_clear(clp, bdev->bd_dev);
> > > + else
> > > + ls->ls_fenced = true;
> > > + mutex_unlock(&clp->cl_fence_mutex);
> > > + nfs4_put_stid(&ls->ls_stid);
> > >
> > > trace_nfsd_pnfs_fence(clp, bdev->bd_disk->disk_name, status);
> > > }
> > > diff --git a/fs/nfsd/nfs4layouts.c b/fs/nfsd/nfs4layouts.c
> > > index ad7af8cfcf1f..1c498f3cd059 100644
> > > --- a/fs/nfsd/nfs4layouts.c
> > > +++ b/fs/nfsd/nfs4layouts.c
> > > @@ -222,6 +222,29 @@ nfsd4_layout_setlease(struct nfs4_layout_stateid *ls)
> > > return 0;
> > > }
> > >
> > > +static void
> > > +nfsd4_layout_fence_worker(struct work_struct *work)
> > > +{
> > > + struct nfsd_file *nf;
> > > + struct delayed_work *dwork = to_delayed_work(work);
> > > + struct nfs4_layout_stateid *ls = container_of(dwork,
> > > + struct nfs4_layout_stateid, ls_fence_work);
> > > + u32 type;
> > > +
> > > + rcu_read_lock();
> > > + nf = nfsd_file_get(ls->ls_file);
> > > + rcu_read_unlock();
> > > + if (!nf) {
> > > + nfs4_put_stid(&ls->ls_stid);
> > > + return;
> > > + }
> > > +
> > > + type = ls->ls_layout_type;
> > > + if (nfsd4_layout_ops[type]->fence_client)
> > > + nfsd4_layout_ops[type]->fence_client(ls, nf);
> > If you make fence_client an int/bool return, then you could just
> > requeue this job to try it again.
> >
> > > + nfsd_file_put(nf);
> > > +}
> > > +
> > > static struct nfs4_layout_stateid *
> > > nfsd4_alloc_layout_stateid(struct nfsd4_compound_state *cstate,
> > > struct nfs4_stid *parent, u32 layout_type)
> > > @@ -271,6 +294,10 @@ nfsd4_alloc_layout_stateid(struct nfsd4_compound_state *cstate,
> > > list_add(&ls->ls_perfile, &fp->fi_lo_states);
> > > spin_unlock(&fp->fi_lock);
> > >
> > > + INIT_DELAYED_WORK(&ls->ls_fence_work, nfsd4_layout_fence_worker);
> > > + ls->ls_fenced = false;
> > > + ls->ls_fence_retry_cnt = 0;
> > > +
> > > trace_nfsd_layoutstate_alloc(&ls->ls_stid.sc_stateid);
> > > return ls;
> > > }
> > > @@ -708,9 +735,10 @@ nfsd4_cb_layout_done(struct nfsd4_callback *cb, struct rpc_task *task)
> > > rcu_read_unlock();
> > > if (fl) {
> > > ops = nfsd4_layout_ops[ls->ls_layout_type];
> > > - if (ops->fence_client)
> > > + if (ops->fence_client) {
> > > + refcount_inc(&ls->ls_stid.sc_count);
> > > ops->fence_client(ls, fl);
> > > - else
> > > + } else
> > > nfsd4_cb_layout_fail(ls, fl);
> > > nfsd_file_put(fl);
> > > }
> > > @@ -747,11 +775,9 @@ static bool
> > > nfsd4_layout_lm_break(struct file_lease *fl)
> > > {
> > > /*
> > > - * We don't want the locks code to timeout the lease for us;
> > > - * we'll remove it ourself if a layout isn't returned
> > > - * in time:
> > > + * Enforce break lease timeout to prevent NFSD
> > > + * thread from hanging in __break_lease.
> > > */
> > > - fl->fl_break_time = 0;
> > > nfsd4_recall_file_layout(fl->c.flc_owner);
> > > return false;
> > > }
> > > @@ -782,10 +808,51 @@ nfsd4_layout_lm_open_conflict(struct file *filp, int arg)
> > > return 0;
> > > }
> > >
> > > +/**
> > > + * nfsd4_layout_lm_breaker_timedout - The layout recall has timed out.
> > > + * If the layout type supports a fence operation, schedule a worker to
> > > + * fence the client from accessing the block device.
> > > + *
> > > + * @fl: file to check
> > > + *
> > > + * Return true if the file lease should be disposed of by the caller;
> > > + * otherwise, return false.
> > > + */
> > > +static bool
> > > +nfsd4_layout_lm_breaker_timedout(struct file_lease *fl)
> > > +{
> > > + struct nfs4_layout_stateid *ls = fl->c.flc_owner;
> > > + bool ret;
> > > +
> > > + if (!nfsd4_layout_ops[ls->ls_layout_type]->fence_client)
> > > + return true;
> > > + if (ls->ls_fenced || ls->ls_fence_retry_cnt >= LO_MAX_FENCE_RETRY)
> > > + return true;
> > > +
> > > + if (work_busy(&ls->ls_fence_work.work))
> > > + return false;
> > > + /* Schedule work to do the fence operation */
> > > + ret = mod_delayed_work(system_dfl_wq, &ls->ls_fence_work, 0);
> > > + if (!ret) {
> > > + /*
> > > + * If there is no pending work, mod_delayed_work queues
> > > + * new task. While fencing is in progress, a reference
> > > + * count is added to the layout stateid to ensure its
> > > + * validity. This reference count is released once fencing
> > > + * has been completed.
> > > + */
> > > + refcount_inc(&ls->ls_stid.sc_count);
> > > + ++ls->ls_fence_retry_cnt;
> > > + return true;
> > The cases where the fencing didn't work after too many retries, or the
> > job couldn't be queued should probably get a pr_warn or something. The
> > admin needs to know that data corruption is possible and that they
> > might need to nuke the client manually.
> >
> > > + }
> > > + return false;
> > > +}
> > > +
> > > static const struct lease_manager_operations nfsd4_layouts_lm_ops = {
> > > .lm_break = nfsd4_layout_lm_break,
> > > .lm_change = nfsd4_layout_lm_change,
> > > .lm_open_conflict = nfsd4_layout_lm_open_conflict,
> > > + .lm_breaker_timedout = nfsd4_layout_lm_breaker_timedout,
> > > };
> > >
> > > int
> > > diff --git a/fs/nfsd/nfs4state.c b/fs/nfsd/nfs4state.c
> > > index 583c13b5aaf3..a57fa3318362 100644
> > > --- a/fs/nfsd/nfs4state.c
> > > +++ b/fs/nfsd/nfs4state.c
> > > @@ -2385,6 +2385,7 @@ static struct nfs4_client *alloc_client(struct xdr_netobj name,
> > > #endif
> > > #ifdef CONFIG_NFSD_SCSILAYOUT
> > > xa_init(&clp->cl_dev_fences);
> > > + mutex_init(&clp->cl_fence_mutex);
> > > #endif
> > > INIT_LIST_HEAD(&clp->async_copies);
> > > spin_lock_init(&clp->async_lock);
> > > diff --git a/fs/nfsd/state.h b/fs/nfsd/state.h
> > > index 713f55ef6554..57e54dfb406c 100644
> > > --- a/fs/nfsd/state.h
> > > +++ b/fs/nfsd/state.h
> > > @@ -529,6 +529,7 @@ struct nfs4_client {
> > > time64_t cl_ra_time;
> > > #ifdef CONFIG_NFSD_SCSILAYOUT
> > > struct xarray cl_dev_fences;
> > > + struct mutex cl_fence_mutex;
> > > #endif
> > > };
> > >
> > > @@ -738,8 +739,13 @@ struct nfs4_layout_stateid {
> > > stateid_t ls_recall_sid;
> > > bool ls_recalled;
> > > struct mutex ls_mutex;
> > > + struct delayed_work ls_fence_work;
> > > + bool ls_fenced;
> > > + int ls_fence_retry_cnt;
> > > };
> > >
> > > +#define LO_MAX_FENCE_RETRY 5
> > > +
> > > static inline struct nfs4_layout_stateid *layoutstateid(struct nfs4_stid *s)
> > > {
> > > return container_of(s, struct nfs4_layout_stateid, ls_stid);
> > > diff --git a/include/linux/filelock.h b/include/linux/filelock.h
> > > index 2f5e5588ee07..13b9c9f04589 100644
> > > --- a/include/linux/filelock.h
> > > +++ b/include/linux/filelock.h
> > > @@ -50,6 +50,7 @@ struct lease_manager_operations {
> > > void (*lm_setup)(struct file_lease *, void **);
> > > bool (*lm_breaker_owns_lease)(struct file_lease *);
> > > int (*lm_open_conflict)(struct file *, int);
> > > + bool (*lm_breaker_timedout)(struct file_lease *fl);
> > > };
> > >
> > > struct lock_manager {
--
Jeff Layton <jlayton@kernel.org>
next prev parent reply other threads:[~2026-01-28 18:01 UTC|newest]
Thread overview: 12+ messages / expand[flat|nested] mbox.gz Atom feed top
2026-01-27 0:50 [PATCH v4 1/1] NFSD: Enforce Timeout on Layout Recall and Integrate Lease Manager Fencing Dai Ngo
2026-01-27 17:11 ` Chuck Lever
2026-01-27 18:14 ` Jeff Layton
2026-01-27 20:38 ` Dai Ngo
2026-01-27 20:36 ` Dai Ngo
2026-01-27 21:53 ` Chuck Lever
2026-01-28 0:50 ` Dai Ngo
2026-01-27 17:54 ` Jeff Layton
2026-01-27 20:52 ` Dai Ngo
2026-01-28 17:25 ` Dai Ngo
2026-01-28 18:01 ` Jeff Layton [this message]
2026-01-28 18:27 ` Dai Ngo
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=ac6601d20f1f48db9eca7ba204be8030fc16a1d0.camel@kernel.org \
--to=jlayton@kernel.org \
--cc=alex.aring@gmail.com \
--cc=brauner@kernel.org \
--cc=chuck.lever@oracle.com \
--cc=dai.ngo@oracle.com \
--cc=hch@lst.de \
--cc=jack@suse.cz \
--cc=linux-fsdevel@vger.kernel.org \
--cc=linux-nfs@vger.kernel.org \
--cc=neil@brown.name \
--cc=okorniev@redhat.com \
--cc=tom@talpey.com \
--cc=viro@zeniv.linux.org.uk \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox