From: Dai Ngo <dai.ngo@oracle.com>
To: Benjamin Coddington <bcodding@hammerspace.com>
Cc: chuck.lever@oracle.com, jlayton@kernel.org, neilb@ownmail.net,
okorniev@redhat.com, tom@talpey.com, hch@lst.de,
alex.aring@gmail.com, viro@zeniv.linux.org.uk,
brauner@kernel.org, jack@suse.cz, linux-fsdevel@vger.kernel.org,
linux-kernel@vger.kernel.org, linux-nfs@vger.kernel.org
Subject: Re: [Patch 0/2] NFSD: Fix server hang when there are multiple layout conflicts
Date: Tue, 11 Nov 2025 07:24:27 -0800 [thread overview]
Message-ID: <e38104d7-1c2e-4791-aa78-d4458233dcb6@oracle.com> (raw)
In-Reply-To: <ADB5A1E8-F1BF-4B42-BD77-96C57B135305@hammerspace.com>
Hi Ben,
On 11/9/25 10:34 AM, Benjamin Coddington wrote:
> On 6 Nov 2025, at 12:05, Dai Ngo wrote:
>
>> When a layout conflict triggers a call to __break_lease, the function
>> nfsd4_layout_lm_break clears the fl_break_time timeout before sending
>> the CB_LAYOUTRECALL. As a result, __break_lease repeatedly restarts
>> its loop, waiting indefinitely for the conflicting file lease to be
>> released.
>>
>> If the number of lease conflicts matches the number of NFSD threads (which
>> defaults to 8), all available NFSD threads become occupied. Consequently,
>> there are no threads left to handle incoming requests or callback replies,
>> leading to a total hang of the NFS server.
>>
>> This issue is reliably reproducible by running the Git test suite on a
>> configuration using SCSI layout.
>>
>> This patchset fixes this problem by introducing the new lm_breaker_timedout
>> operation to lease_manager_operations and using timeout for layout
>> lease break.
> Hey Dai,
>
> I like your solution here, but I worry it can cause unexpected or
> unnecessary client fencing when the problem is server-side (not enough
> threads). Clients might be dutifully sending LAYOUTRETURN, but the server
> can't service them
I agreed. This is a server problem and we penalize the client. We need
a long term solution for dealing resource shortage (server threads)
problem.
Fortunately, the client can detect reservation conflict errors and appears
to retry the I/O. Also, the client will ask for new layout and in the
process it re-registers its reservation key so I/O will continue.
> - and this change will cause some potentially unexpected
> fencing in environments where things could be fixed (by adding more knfsd
> threads).
> Also, I think we significantly bumped default thread counts
> recently in nfs-utils:
> eb5abb5c60ab (tag: nfs-utils-2-8-2-rc3) nfsd: dump default number of threads to 16
This helps a bit but if there is always a chance that there is a load
that requires more than the number of server threads.
>
> You probably have already seen previous discussions about this:
> https://urldefense.com/v3/__https://lore.kernel.org/linux-nfs/1CC82EC5-6120-4EE4-A7F0-019CF7BC762C@redhat.com/__;!!ACWV5N9M2RV99hQ!Pq4vHQs-qk71XjZ0vOkONTD7nxkuyUUEKTBsJJ0L_OrFWudokphCyc2V0q0_OrNoGD3KnsgoHKp7rb_lDcs$
>
> This also changes the behavior for all layouts, I haven't thought through
> the implications of that - but I wish we could have knob for this behavior,
> or perhaps a knfsd-specific fl_break_time tuneable.
There is already a knob to tune the fl_break_time:
# cat /proc/sys/fs/lease-break-time
but currently lease-break-time is in seconds so the minimum we can set
is 1 which I think is still too long to tight up a server thread.
>
> Last thought (for now): I think Neil has some work for dynamic knfsd thread
> count.. or Jeff? (I am having trouble finding it) Would that work around
> this problem?
This would help, and I prefer this route rather than rework __break_lease
to return EAGAIN/jukebox while the server recalling the layout.
Thank you for your feedback,
-Dai
>
> Regards,
> Ben
>
next prev parent reply other threads:[~2025-11-11 15:45 UTC|newest]
Thread overview: 15+ messages / expand[flat|nested] mbox.gz Atom feed top
2025-11-06 17:05 [Patch 0/2] NFSD: Fix server hang when there are multiple layout conflicts Dai Ngo
2025-11-06 17:05 ` [PATCH 1/2] locks: Introduce lm_breaker_timedout op to lease_manager_operations Dai Ngo
2025-11-07 13:26 ` Christoph Hellwig
2025-11-07 16:58 ` Dai Ngo
2025-11-06 17:05 ` [PATCH 2/2] NFSD: Fix server hang when there are multiple layout conflicts Dai Ngo
2025-11-07 13:29 ` Christoph Hellwig
2025-11-07 17:01 ` Dai Ngo
2025-11-07 13:30 ` [Patch 0/2] " Christoph Hellwig
2025-11-09 18:34 ` Benjamin Coddington
2025-11-11 15:24 ` Dai Ngo [this message]
2025-11-11 15:34 ` Chuck Lever
2025-11-11 15:36 ` Christoph Hellwig
2025-11-11 15:43 ` Dai Ngo
2025-11-11 15:53 ` Chuck Lever
2025-11-11 15:45 ` Jeff Layton
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=e38104d7-1c2e-4791-aa78-d4458233dcb6@oracle.com \
--to=dai.ngo@oracle.com \
--cc=alex.aring@gmail.com \
--cc=bcodding@hammerspace.com \
--cc=brauner@kernel.org \
--cc=chuck.lever@oracle.com \
--cc=hch@lst.de \
--cc=jack@suse.cz \
--cc=jlayton@kernel.org \
--cc=linux-fsdevel@vger.kernel.org \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-nfs@vger.kernel.org \
--cc=neilb@ownmail.net \
--cc=okorniev@redhat.com \
--cc=tom@talpey.com \
--cc=viro@zeniv.linux.org.uk \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).