From: Chuck Lever <chuck.lever@oracle.com>
To: Tom Talpey <tom@talpey.com>,
Jeff Layton via Bugspray Bot <bugbot@kernel.org>,
cel@kernel.org, trondmy@kernel.org, carnil@debian.org,
anna@kernel.org, linux-nfs@vger.kernel.org, herzog@phys.ethz.ch,
harald.dunkel@aixigo.com, jlayton@kernel.org,
baptiste.pellegrin@ac-grenoble.fr,
benoit.gschwind@minesparis.psl.eu
Subject: Re: NFSD threads hang when destroying a session or client ID
Date: Tue, 21 Jan 2025 14:43:02 -0500 [thread overview]
Message-ID: <5243289d-12d3-403b-847d-491d9fe66af4@oracle.com> (raw)
In-Reply-To: <cf8650cb-1d2e-4771-981a-d66d2c455637@talpey.com>
On 1/21/25 2:38 PM, Tom Talpey wrote:
> On 1/21/2025 12:35 PM, Jeff Layton via Bugspray Bot wrote:
>> Jeff Layton writes via Kernel.org Bugzilla:
>>
>> (In reply to Chuck Lever from comment #7)
>>> The trace captures I've reviewed suggest that a callback session is
>>> in use,
>>> so I would say the NFS minor version is 1 or higher. Perhaps it's not
>>> the
>>> RPC_SIGNALLED test above that is the problem, but the one later in
>>> nfsd4_cb_sequence_done().
>>
>>
>> Ok, good. Knowing that it's not v4.0 allows us to rule out some
>> codepaths.
>> There are a couple of other cases where we goto need_restart:
>>
>> The NFS4ERR_BADSESSION case does this, and also if it doesn't get a
>> reply at all (case 1).
>
> Note that one thread in Benoît's recent logs is stuck in
> nfsd4_bind_conn_to_session(), and three threads also in
> nfsd4_destroy_session(), so there is certainly some
> session/connection dance going on. Combining an invalid
> replay cache entry could easily make things worse.
Yes, the client returns RETRY_UNCACHED_REP for some of the backchannel
operations. NFSD never asserts cachethis in CB_SEQUENCE. I'm trying to
understand why NFSD would skip incrementing its slot sequence number.
> There's also one thread in nfsd4_destroy_clientid(), which
> seems important, but odd. And finally, the laundromat is
> running. No shortage of races!
The hangs are all related here: they are waiting for flush_workqueue()
on the callback workqueue. In v6.1, there is only one callback_wq and
it's max_active is 1. If the current work item hangs, then that work
queue stalls.
> Tom.
>
>
>> There is also this that looks a little sketchy:
>>
>> ------------8<-------------------
>> trace_nfsd_cb_free_slot(task, cb);
>> nfsd41_cb_release_slot(cb);
>>
>> if (RPC_SIGNALLED(task))
>> goto need_restart;
>> out:
>> return ret;
>> retry_nowait:
>> if (rpc_restart_call_prepare(task))
>> ret = false;
>> goto out;
>> need_restart:
>> if (!test_bit(NFSD4_CLIENT_CB_KILL, &clp->cl_flags)) {
>> trace_nfsd_cb_restart(clp, cb);
>> task->tk_status = 0;
>> cb->cb_need_restart = true;
>> }
>> return false;
>> ------------8<-------------------
>>
>> Probably now the same bug, but it looks like if RPC_SIGNALLED returns
>> true, then it'll restart the RPC after releasing the slot. It seems
>> like that could break the reply cache handling, as the restarted call
>> could be on a different slot. I'll look at patching that, at least,
>> though I'm not sure it's related to the hang.
>>
>> More notes. The only way RPC_TASK_SIGNALLED gets set is:
>>
>> nfsd4_process_cb_update()
>> rpc_shutdown_client()
>> rpc_killall_tasks()
>>
>> That gets called if:
>>
>> if (clp->cl_flags & NFSD4_CLIENT_CB_FLAG_MASK)
>> nfsd4_process_cb_update(cb);
>>
>> Which means that NFSD4_CLIENT_CB_UPDATE was probably set?
>> NFSD4_CLIENT_CB_KILL seems less likely since that would nerf the
>> cb_need_restart handling.
>>
>> View: https://bugzilla.kernel.org/show_bug.cgi?id=219710#c10
>> You can reply to this message to join the discussion.
>
--
Chuck Lever
next prev parent reply other threads:[~2025-01-21 19:43 UTC|newest]
Thread overview: 35+ messages / expand[flat|nested] mbox.gz Atom feed top
2025-01-20 15:00 NFSD threads hang when destroying a session or client ID Chuck Lever via Bugspray Bot
2025-01-20 15:14 ` Chuck Lever
2025-01-20 15:25 ` Chuck Lever via Bugspray Bot
2025-01-20 15:40 ` Chuck Lever via Bugspray Bot
2025-01-20 19:00 ` Chuck Lever via Bugspray Bot
2025-01-20 20:35 ` Baptiste PELLEGRIN via Bugspray Bot
2025-01-21 14:40 ` Jeff Layton via Bugspray Bot
2025-01-21 16:10 ` Chuck Lever via Bugspray Bot
2025-01-21 17:35 ` Jeff Layton via Bugspray Bot
2025-01-21 19:38 ` Tom Talpey
2025-01-21 19:43 ` Chuck Lever [this message]
2025-01-21 16:25 ` Baptiste PELLEGRIN via Bugspray Bot
2025-01-21 16:35 ` Chuck Lever via Bugspray Bot
2025-01-22 11:40 ` Baptiste PELLEGRIN via Bugspray Bot
2025-01-22 14:19 ` Chuck Lever
2025-01-22 21:25 ` JJ Jordan via Bugspray Bot
2025-01-22 21:25 ` JJ Jordan via Bugspray Bot
2025-01-23 2:10 ` Li Lingfeng via Bugspray Bot
2025-01-23 13:50 ` Jeff Layton via Bugspray Bot
2025-01-23 14:22 ` Chuck Lever
2025-01-23 20:25 ` Baptiste PELLEGRIN via Bugspray Bot
2025-01-23 21:45 ` Chuck Lever via Bugspray Bot
2025-01-26 9:25 ` Baptiste PELLEGRIN via Bugspray Bot
2025-01-26 17:05 ` Chuck Lever via Bugspray Bot
2025-01-29 13:15 ` rik.theys via Bugspray Bot
2025-01-29 19:40 ` Chuck Lever via Bugspray Bot
2025-01-30 14:05 ` rik.theys via Bugspray Bot
2025-01-29 19:50 ` Chuck Lever via Bugspray Bot
2025-02-10 12:05 ` Baptiste PELLEGRIN via Bugspray Bot
2025-02-21 13:42 ` Salvatore Bonaccorso
2025-02-21 13:57 ` Harald Dunkel
2025-02-21 14:31 ` Salvatore Bonaccorso
2025-02-21 14:50 ` Jeff Layton via Bugspray Bot
2025-02-21 16:00 ` Chuck Lever via Bugspray Bot
2025-02-21 14:45 ` Jeff Layton via Bugspray Bot
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=5243289d-12d3-403b-847d-491d9fe66af4@oracle.com \
--to=chuck.lever@oracle.com \
--cc=anna@kernel.org \
--cc=baptiste.pellegrin@ac-grenoble.fr \
--cc=benoit.gschwind@minesparis.psl.eu \
--cc=bugbot@kernel.org \
--cc=carnil@debian.org \
--cc=cel@kernel.org \
--cc=harald.dunkel@aixigo.com \
--cc=herzog@phys.ethz.ch \
--cc=jlayton@kernel.org \
--cc=linux-nfs@vger.kernel.org \
--cc=tom@talpey.com \
--cc=trondmy@kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox