public inbox for linux-nfs@vger.kernel.org
 help / color / mirror / Atom feed
From: Chuck Lever <chuck.lever@oracle.com>
To: Tom Talpey <tom@talpey.com>,
	Jeff Layton via Bugspray Bot <bugbot@kernel.org>,
	cel@kernel.org, trondmy@kernel.org, carnil@debian.org,
	anna@kernel.org, linux-nfs@vger.kernel.org, herzog@phys.ethz.ch,
	harald.dunkel@aixigo.com, jlayton@kernel.org,
	baptiste.pellegrin@ac-grenoble.fr,
	benoit.gschwind@minesparis.psl.eu
Subject: Re: NFSD threads hang when destroying a session or client ID
Date: Tue, 21 Jan 2025 14:43:02 -0500	[thread overview]
Message-ID: <5243289d-12d3-403b-847d-491d9fe66af4@oracle.com> (raw)
In-Reply-To: <cf8650cb-1d2e-4771-981a-d66d2c455637@talpey.com>

On 1/21/25 2:38 PM, Tom Talpey wrote:
> On 1/21/2025 12:35 PM, Jeff Layton via Bugspray Bot wrote:
>> Jeff Layton writes via Kernel.org Bugzilla:
>>
>> (In reply to Chuck Lever from comment #7)
>>> The trace captures I've reviewed suggest that a callback session is 
>>> in use,
>>> so I would say the NFS minor version is 1 or higher. Perhaps it's not 
>>> the
>>> RPC_SIGNALLED test above that is the problem, but the one later in
>>> nfsd4_cb_sequence_done().
>>
>>
>> Ok, good. Knowing that it's not v4.0 allows us to rule out some 
>> codepaths.
>> There are a couple of other cases where we goto need_restart:
>>
>> The NFS4ERR_BADSESSION case does this, and also if it doesn't get a 
>> reply at all (case 1).
> 
> Note that one thread in Benoît's recent logs is stuck in
> nfsd4_bind_conn_to_session(), and three threads also in
> nfsd4_destroy_session(), so there is certainly some
> session/connection dance going on. Combining an invalid
> replay cache entry could easily make things worse.

Yes, the client returns RETRY_UNCACHED_REP for some of the backchannel
operations. NFSD never asserts cachethis in CB_SEQUENCE. I'm trying to
understand why NFSD would skip incrementing its slot sequence number.


> There's also one thread in nfsd4_destroy_clientid(), which
> seems important, but odd. And finally, the laundromat is
> running. No shortage of races!

The hangs are all related here: they are waiting for flush_workqueue()
on the callback workqueue. In v6.1, there is only one callback_wq and
it's max_active is 1. If the current work item hangs, then that work
queue stalls.


> Tom.
> 
> 
>> There is also this that looks a little sketchy:
>>
>> ------------8<-------------------
>>          trace_nfsd_cb_free_slot(task, cb);
>>          nfsd41_cb_release_slot(cb);
>>
>>          if (RPC_SIGNALLED(task))
>>                  goto need_restart;
>> out:
>>          return ret;
>> retry_nowait:
>>          if (rpc_restart_call_prepare(task))
>>                  ret = false;
>>          goto out;
>> need_restart:
>>          if (!test_bit(NFSD4_CLIENT_CB_KILL, &clp->cl_flags)) {
>>                  trace_nfsd_cb_restart(clp, cb);
>>                  task->tk_status = 0;
>>                  cb->cb_need_restart = true;
>>          }
>>          return false;
>> ------------8<-------------------
>>
>> Probably now the same bug, but it looks like if RPC_SIGNALLED returns 
>> true, then it'll restart the RPC after releasing the slot. It seems 
>> like that could break the reply cache handling, as the restarted call 
>> could be on a different slot. I'll look at patching that, at least, 
>> though I'm not sure it's related to the hang.
>>
>> More notes. The only way RPC_TASK_SIGNALLED gets set is:
>>
>>     nfsd4_process_cb_update()
>>        rpc_shutdown_client()
>>            rpc_killall_tasks()
>>
>> That gets called if:
>>
>>          if (clp->cl_flags & NFSD4_CLIENT_CB_FLAG_MASK)
>>                  nfsd4_process_cb_update(cb);
>>
>> Which means that NFSD4_CLIENT_CB_UPDATE was probably set? 
>> NFSD4_CLIENT_CB_KILL seems less likely since that would nerf the 
>> cb_need_restart handling.
>>
>> View: https://bugzilla.kernel.org/show_bug.cgi?id=219710#c10
>> You can reply to this message to join the discussion.
> 


-- 
Chuck Lever

  reply	other threads:[~2025-01-21 19:43 UTC|newest]

Thread overview: 35+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-01-20 15:00 NFSD threads hang when destroying a session or client ID Chuck Lever via Bugspray Bot
2025-01-20 15:14 ` Chuck Lever
2025-01-20 15:25 ` Chuck Lever via Bugspray Bot
2025-01-20 15:40 ` Chuck Lever via Bugspray Bot
2025-01-20 19:00 ` Chuck Lever via Bugspray Bot
2025-01-20 20:35 ` Baptiste PELLEGRIN via Bugspray Bot
2025-01-21 14:40 ` Jeff Layton via Bugspray Bot
2025-01-21 16:10 ` Chuck Lever via Bugspray Bot
2025-01-21 17:35   ` Jeff Layton via Bugspray Bot
2025-01-21 19:38     ` Tom Talpey
2025-01-21 19:43       ` Chuck Lever [this message]
2025-01-21 16:25 ` Baptiste PELLEGRIN via Bugspray Bot
2025-01-21 16:35   ` Chuck Lever via Bugspray Bot
2025-01-22 11:40     ` Baptiste PELLEGRIN via Bugspray Bot
2025-01-22 14:19       ` Chuck Lever
2025-01-22 21:25 ` JJ Jordan via Bugspray Bot
2025-01-22 21:25 ` JJ Jordan via Bugspray Bot
2025-01-23  2:10 ` Li Lingfeng via Bugspray Bot
2025-01-23 13:50 ` Jeff Layton via Bugspray Bot
2025-01-23 14:22   ` Chuck Lever
2025-01-23 20:25 ` Baptiste PELLEGRIN via Bugspray Bot
2025-01-23 21:45 ` Chuck Lever via Bugspray Bot
2025-01-26  9:25 ` Baptiste PELLEGRIN via Bugspray Bot
2025-01-26 17:05   ` Chuck Lever via Bugspray Bot
2025-01-29 13:15 ` rik.theys via Bugspray Bot
2025-01-29 19:40 ` Chuck Lever via Bugspray Bot
2025-01-30 14:05   ` rik.theys via Bugspray Bot
2025-01-29 19:50 ` Chuck Lever via Bugspray Bot
2025-02-10 12:05 ` Baptiste PELLEGRIN via Bugspray Bot
2025-02-21 13:42   ` Salvatore Bonaccorso
2025-02-21 13:57     ` Harald Dunkel
2025-02-21 14:31       ` Salvatore Bonaccorso
2025-02-21 14:50       ` Jeff Layton via Bugspray Bot
2025-02-21 16:00     ` Chuck Lever via Bugspray Bot
2025-02-21 14:45 ` Jeff Layton via Bugspray Bot

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=5243289d-12d3-403b-847d-491d9fe66af4@oracle.com \
    --to=chuck.lever@oracle.com \
    --cc=anna@kernel.org \
    --cc=baptiste.pellegrin@ac-grenoble.fr \
    --cc=benoit.gschwind@minesparis.psl.eu \
    --cc=bugbot@kernel.org \
    --cc=carnil@debian.org \
    --cc=cel@kernel.org \
    --cc=harald.dunkel@aixigo.com \
    --cc=herzog@phys.ethz.ch \
    --cc=jlayton@kernel.org \
    --cc=linux-nfs@vger.kernel.org \
    --cc=tom@talpey.com \
    --cc=trondmy@kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox