NFSD threads hang when destroying a session or client ID

public inbox for linux-nfs@vger.kernel.org
 help / color / mirror / Atom feed

* NFSD threads hang when destroying a session or client ID
@ 2025-01-20 15:00 Chuck Lever via Bugspray Bot
  2025-01-20 15:14 ` Chuck Lever
                   ` (19 more replies)
  0 siblings, 20 replies; 35+ messages in thread
From: Chuck Lever via Bugspray Bot @ 2025-01-20 15:00 UTC (permalink / raw)
  To: trondmy, cel, linux-nfs, jlayton, anna

Chuck Lever writes via Kernel.org Bugzilla:

On recent v6.1.y, intermittently, NFSD threads wait forever for NFSv4 callback to shutdown. The wait is in __flush_workqueue(). A server system reboot is necessary to recover.

On new kernels, similar symptoms but the indefinite wait is in the "destroy client" path, waiting for NFSv4 callback shutdown. The wait is on the wait_var_event() in nfsd41_cb_inflight_wait_complete().

In some cases, clients suspend (inactivity). The server converts them to courteous clients. The NFSv4 callback shutdown workqueue item for that client appears to be stuck waiting in rpc_shutdown_client().

Let's collect data under this bug report.

View: https://bugzilla.kernel.org/show_bug.cgi?id=219710#c0
You can reply to this message to join the discussion.
-- 
Deet-doot-dot, I am a bot.
Kernel.org Bugzilla (bugspray 0.1-dev)

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: NFSD threads hang when destroying a session or client ID
  2025-01-20 15:00 NFSD threads hang when destroying a session or client ID Chuck Lever via Bugspray Bot
@ 2025-01-20 15:14 ` Chuck Lever
  2025-01-20 15:25 ` Chuck Lever via Bugspray Bot
                   ` (18 subsequent siblings)
  19 siblings, 0 replies; 35+ messages in thread
From: Chuck Lever @ 2025-01-20 15:14 UTC (permalink / raw)
  To: Salvatore Bonaccorso, Harald Dunkel, Christian Herzog,
	Pellegrin Baptiste, Benoît Gschwind
  Cc: trondmy, cel, jlayton, anna, linux-nfs

On 1/20/25 10:00 AM, Chuck Lever via Bugspray Bot wrote:
> Chuck Lever writes via Kernel.org Bugzilla:
> 
> On recent v6.1.y, intermittently, NFSD threads wait forever for NFSv4 callback to shutdown. The wait is in __flush_workqueue(). A server system reboot is necessary to recover.
> 
> On new kernels, similar symptoms but the indefinite wait is in the "destroy client" path, waiting for NFSv4 callback shutdown. The wait is on the wait_var_event() in nfsd41_cb_inflight_wait_complete().
> 
> In some cases, clients suspend (inactivity). The server converts them to courteous clients. The NFSv4 callback shutdown workqueue item for that client appears to be stuck waiting in rpc_shutdown_client().
> 
> Let's collect data under this bug report.
> 
> View: https://bugzilla.kernel.org/show_bug.cgi?id=219710#c0
> You can reply to this message to join the discussion.

I've created a bugzilla.kernel.org report (see link above) so we can
collect data in a common place. New comments on that bug are reflected
to linux-nfs@vger.kernel.org.

Instead of sending attachments to me or the list, please attach your
data to this bug.


-- 
Chuck Lever

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: NFSD threads hang when destroying a session or client ID
  2025-01-20 15:00 NFSD threads hang when destroying a session or client ID Chuck Lever via Bugspray Bot
  2025-01-20 15:14 ` Chuck Lever
@ 2025-01-20 15:25 ` Chuck Lever via Bugspray Bot
  2025-01-20 15:40 ` Chuck Lever via Bugspray Bot
                   ` (17 subsequent siblings)
  19 siblings, 0 replies; 35+ messages in thread
From: Chuck Lever via Bugspray Bot @ 2025-01-20 15:25 UTC (permalink / raw)
  To: cel, chuck.lever, harald.dunkel, jlayton, baptiste.pellegrin,
	benoit.gschwind, carnil, linux-nfs, anna, trondmy, herzog

Chuck Lever writes via Kernel.org Bugzilla:

During server operation, you can capture trace events that might lead up to the hang with this command:

# trace-cmd record -e nfsd:nfsd_cb_\* -e sunrpc:svc_xprt_detach -p function -l nfsd4_destroy_session

These events are infrequent and tracing is lightweight, so this capture won't be a burden.

When a server hang is observed, typically, "INFO: task has been blocked for 120 seconds" appears in the server's system journal. Basic collection instructions are:

# rpcdebug -m rpc -c && echo t > /proc/sysrq-trigger

Then stop the trace and package trace.dat and the system journal and attach them to this bug.

View: https://bugzilla.kernel.org/show_bug.cgi?id=219710#c2
You can reply to this message to join the discussion.
-- 
Deet-doot-dot, I am a bot.
Kernel.org Bugzilla (bugspray 0.1-dev)

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: NFSD threads hang when destroying a session or client ID
  2025-01-20 15:00 NFSD threads hang when destroying a session or client ID Chuck Lever via Bugspray Bot
  2025-01-20 15:14 ` Chuck Lever
  2025-01-20 15:25 ` Chuck Lever via Bugspray Bot
@ 2025-01-20 15:40 ` Chuck Lever via Bugspray Bot
  2025-01-20 19:00 ` Chuck Lever via Bugspray Bot
                   ` (16 subsequent siblings)
  19 siblings, 0 replies; 35+ messages in thread
From: Chuck Lever via Bugspray Bot @ 2025-01-20 15:40 UTC (permalink / raw)
  To: chuck.lever, harald.dunkel, anna, carnil, herzog, jlayton,
	linux-nfs, trondmy, baptiste.pellegrin, cel, benoit.gschwind

Chuck Lever writes via Kernel.org Bugzilla:

v6.1 symptomology:

The following stack trace shows that the active callback workqueue item is stuck in rpc_shutdown_client() waiting for RPC operations to complete. Because the callback workqueue is single-threaded (max_active = 1) and this work item never completes, it prevents subsequent flush_workqueue() calls on the callback work queue from completing.

> [7257352.347503] task:kworker/u32:6   state:D stack:0     pid:173983 ppid:2  
>    flags:0x00004000
> [7257352.347511] Workqueue: nfsd4_callbacks nfsd4_run_cb_work [nfsd]
> [7257352.347568] Call Trace:
> [7257352.347571]  <TASK>
> [7257352.347577]  __schedule+0x34d/0x9e0
> [7257352.347589]  schedule+0x5a/0xd0
> [7257352.347597]  schedule_timeout+0x94/0x150
> [7257352.347606]  ? __bpf_trace_tick_stop+0x10/0x10
> [7257352.347616]  rpc_shutdown_client+0xf2/0x150 [sunrpc]
> [7257352.347683]  ? cpuusage_read+0x10/0x10
> [7257352.347694]  nfsd4_process_cb_update+0x4c/0x270 [nfsd]
> [7257352.347763]  nfsd4_run_cb_work+0x9f/0x150 [nfsd]
> [7257352.347812]  process_one_work+0x1c7/0x380
> [7257352.347824]  worker_thread+0x4d/0x380
> [7257352.347835]  ? rescuer_thread+0x3a0/0x3a0
> [7257352.347843]  kthread+0xda/0x100
> [7257352.347849]  ? kthread_complete_and_exit+0x20/0x20
> [7257352.347859]  ret_from_fork+0x22/0x30
> [7257352.347875]  </TASK>

Based on the full stack trace output, there do not appear to be any outstanding RPC operations.

View: https://bugzilla.kernel.org/show_bug.cgi?id=219710#c3
You can reply to this message to join the discussion.
-- 
Deet-doot-dot, I am a bot.
Kernel.org Bugzilla (bugspray 0.1-dev)


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: NFSD threads hang when destroying a session or client ID
  2025-01-20 15:00 NFSD threads hang when destroying a session or client ID Chuck Lever via Bugspray Bot
                   ` (2 preceding siblings ...)
  2025-01-20 15:40 ` Chuck Lever via Bugspray Bot
@ 2025-01-20 19:00 ` Chuck Lever via Bugspray Bot
  2025-01-20 20:35 ` Baptiste PELLEGRIN via Bugspray Bot
                   ` (15 subsequent siblings)
  19 siblings, 0 replies; 35+ messages in thread
From: Chuck Lever via Bugspray Bot @ 2025-01-20 19:00 UTC (permalink / raw)
  To: chuck.lever, carnil, baptiste.pellegrin, anna, jlayton, herzog,
	trondmy, cel, harald.dunkel, linux-nfs, benoit.gschwind

Chuck Lever writes via Kernel.org Bugzilla:

origin/linux-6.1.y is missing this commit:

> 961b4b5e86bf ("NFSD: Reset cb_seq_status after NFS4ERR_DELAY")

which is related to the problem reported in this bug, though it might not be a fix. This branch is based on 6.1.126 and has 961b4b5 applied to it for easy testing.

>
> https://git.kernel.org/pub/scm/linux/kernel/git/cel/linux.git/log/?h=nfsd-6.1.y

View: https://bugzilla.kernel.org/show_bug.cgi?id=219710#c4
You can reply to this message to join the discussion.
-- 
Deet-doot-dot, I am a bot.
Kernel.org Bugzilla (bugspray 0.1-dev)


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: NFSD threads hang when destroying a session or client ID
  2025-01-20 15:00 NFSD threads hang when destroying a session or client ID Chuck Lever via Bugspray Bot
                   ` (3 preceding siblings ...)
  2025-01-20 19:00 ` Chuck Lever via Bugspray Bot
@ 2025-01-20 20:35 ` Baptiste PELLEGRIN via Bugspray Bot
  2025-01-21 14:40 ` Jeff Layton via Bugspray Bot
                   ` (14 subsequent siblings)
  19 siblings, 0 replies; 35+ messages in thread
From: Baptiste PELLEGRIN via Bugspray Bot @ 2025-01-20 20:35 UTC (permalink / raw)
  To: linux-nfs, benoit.gschwind, jlayton, carnil, baptiste.pellegrin,
	herzog, anna, cel, harald.dunkel, trondmy, chuck.lever

Baptiste PELLEGRIN added an attachment on Kernel.org Bugzilla:

Created attachment 307511
Recorded trace 20250120 (syslog + trace-cmd)

Sorry no rpcdebug logs. I will try to record next crash.

File: record_nfsdc_20250120.tar.gz (application/gzip)
Size: 2.93 MiB
Link: https://bugzilla.kernel.org/attachment.cgi?id=307511
---
Recorded trace 20250120 (syslog + trace-cmd)

You can reply to this message to join the discussion.
-- 
Deet-doot-dot, I am a bot.
Kernel.org Bugzilla (bugspray 0.1-dev)


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: NFSD threads hang when destroying a session or client ID
  2025-01-20 15:00 NFSD threads hang when destroying a session or client ID Chuck Lever via Bugspray Bot
                   ` (4 preceding siblings ...)
  2025-01-20 20:35 ` Baptiste PELLEGRIN via Bugspray Bot
@ 2025-01-21 14:40 ` Jeff Layton via Bugspray Bot
  2025-01-21 16:10 ` Chuck Lever via Bugspray Bot
                   ` (13 subsequent siblings)
  19 siblings, 0 replies; 35+ messages in thread
From: Jeff Layton via Bugspray Bot @ 2025-01-21 14:40 UTC (permalink / raw)
  To: baptiste.pellegrin, harald.dunkel, linux-nfs, carnil,
	benoit.gschwind, herzog, anna, chuck.lever, jlayton, trondmy, cel

Jeff Layton writes via Kernel.org Bugzilla:

I looked at this with Chuck the other day. As far as wait_var_event() getting stuck, I think all that would take is for nfsd4_cb_sequence_done() to continually set cb_need_restart on every call. That would cause the callback to not be destroyed and to never call nfsd41_cb_inflight_end().

That happens in the need_restart: label in nfsd4_cb_sequence_done. These two cases goto that:

        if (!clp->cl_minorversion) {
                /*
                 * If the backchannel connection was shut down while this
                 * task was queued, we need to resubmit it after setting up
                 * a new backchannel connection.
                 *
                 * Note that if we lost our callback connection permanently
                 * the submission code will error out, so we don't need to
                 * handle that case here.
                 */
                if (RPC_SIGNALLED(task))
                        goto need_restart;

                return true;
        }

        if (cb->cb_held_slot < 0)
                goto need_restart;

It doesn't seem likely that it somehow lost the slot, so my guess is that the RPC task is continually returning with RPC_SIGNALLED() set.

Question for Baptiste -- what NFS versions are your clients using?

View: https://bugzilla.kernel.org/show_bug.cgi?id=219710#c6
You can reply to this message to join the discussion.
-- 
Deet-doot-dot, I am a bot.
Kernel.org Bugzilla (bugspray 0.1-dev)


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: NFSD threads hang when destroying a session or client ID
  2025-01-20 15:00 NFSD threads hang when destroying a session or client ID Chuck Lever via Bugspray Bot
                   ` (5 preceding siblings ...)
  2025-01-21 14:40 ` Jeff Layton via Bugspray Bot
@ 2025-01-21 16:10 ` Chuck Lever via Bugspray Bot
  2025-01-21 17:35   ` Jeff Layton via Bugspray Bot
  2025-01-21 16:25 ` Baptiste PELLEGRIN via Bugspray Bot
                   ` (12 subsequent siblings)
  19 siblings, 1 reply; 35+ messages in thread
From: Chuck Lever via Bugspray Bot @ 2025-01-21 16:10 UTC (permalink / raw)
  To: harald.dunkel, anna, jlayton, benoit.gschwind, chuck.lever,
	trondmy, cel, herzog, baptiste.pellegrin, carnil, linux-nfs

Chuck Lever writes via Kernel.org Bugzilla:

The trace captures I've reviewed suggest that a callback session is in use, so I would say the NFS minor version is 1 or higher. Perhaps it's not the RPC_SIGNALLED test above that is the problem, but the one later in nfsd4_cb_sequence_done().

View: https://bugzilla.kernel.org/show_bug.cgi?id=219710#c7
You can reply to this message to join the discussion.
-- 
Deet-doot-dot, I am a bot.
Kernel.org Bugzilla (bugspray 0.1-dev)


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: NFSD threads hang when destroying a session or client ID
  2025-01-20 15:00 NFSD threads hang when destroying a session or client ID Chuck Lever via Bugspray Bot
                   ` (6 preceding siblings ...)
  2025-01-21 16:10 ` Chuck Lever via Bugspray Bot
@ 2025-01-21 16:25 ` Baptiste PELLEGRIN via Bugspray Bot
  2025-01-21 16:35   ` Chuck Lever via Bugspray Bot
  2025-01-22 21:25 ` JJ Jordan via Bugspray Bot
                   ` (11 subsequent siblings)
  19 siblings, 1 reply; 35+ messages in thread
From: Baptiste PELLEGRIN via Bugspray Bot @ 2025-01-21 16:25 UTC (permalink / raw)
  To: anna, chuck.lever, carnil, linux-nfs, cel, herzog, trondmy,
	harald.dunkel, jlayton, benoit.gschwind, baptiste.pellegrin

Baptiste PELLEGRIN writes via Kernel.org Bugzilla:

Hello.

My clients are all v4.2 using Debian stable kernel 6.1.0-29-amd64 (= 6.1.123-1).

If you suspect one of my client doing "wrong things" checking my server trace. Give me the IP and I can give you back the client syslog.

In my case, one point seems important. The hang occur almost exactly one time per week. But not always the same day in the week (depending on when I reboot the server). The greater is the server uptime, the greater is the probability to see some "unrecognized reply" messages in syslog.

I always see one or two "unrecognized reply" message around 120 seconds before the hang message.

So it may something that happen on client or server weekly jobs ?
Or maybe some memory leak or cache corruption ?
Or something related to expired Kerberos cache file ?
Or expired NFS session ?
...

It seems also that the number of nfsd_cb_recall_any callback message increase with the server uptime. This seems in favor of the memory leak hypothesis.

Regards,

Baptiste.

View: https://bugzilla.kernel.org/show_bug.cgi?id=219710#c8
You can reply to this message to join the discussion.
-- 
Deet-doot-dot, I am a bot.
Kernel.org Bugzilla (bugspray 0.1-dev)


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: NFSD threads hang when destroying a session or client ID
  2025-01-21 16:25 ` Baptiste PELLEGRIN via Bugspray Bot
@ 2025-01-21 16:35   ` Chuck Lever via Bugspray Bot
  2025-01-22 11:40     ` Baptiste PELLEGRIN via Bugspray Bot
  0 siblings, 1 reply; 35+ messages in thread
From: Chuck Lever via Bugspray Bot @ 2025-01-21 16:35 UTC (permalink / raw)
  To: herzog, linux-nfs, baptiste.pellegrin, anna, benoit.gschwind,
	carnil, chuck.lever, jlayton, trondmy, cel, harald.dunkel

Chuck Lever writes via Kernel.org Bugzilla:

(In reply to Baptiste PELLEGRIN from comment #8)
> I always see one or two "unrecognized reply" message around 120 seconds
> before the hang message.
> 
> So it may something that happen on client or server weekly jobs ?
> Or maybe some memory leak or cache corruption ?
> Or something related to expired Kerberos cache file ?
> Or expired NFS session ?
> ...
> 
> It seems also that the number of nfsd_cb_recall_any callback message
> increase with the server uptime. This seems in favor of the memory leak
> hypothesis.

The server generates a CB_RECALL_ANY message for each active client. If the number of active clients increases from zero at server boot time to a few dozen, that would also explain why you see more of these over time.

If your NFS server does not also have NFS mount points, a few client-side trace points can be enabled to capture more details about NFSv4 callback activity.

"-e sunrpc:xprt_reserve" for example would help us match the XIDs in the callback operations to the messages you see in the server's system journal.

View: https://bugzilla.kernel.org/show_bug.cgi?id=219710#c9
You can reply to this message to join the discussion.
-- 
Deet-doot-dot, I am a bot.
Kernel.org Bugzilla (bugspray 0.1-dev)

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: NFSD threads hang when destroying a session or client ID
  2025-01-21 16:10 ` Chuck Lever via Bugspray Bot
@ 2025-01-21 17:35   ` Jeff Layton via Bugspray Bot
  2025-01-21 19:38     ` Tom Talpey
  0 siblings, 1 reply; 35+ messages in thread
From: Jeff Layton via Bugspray Bot @ 2025-01-21 17:35 UTC (permalink / raw)
  To: cel, trondmy, carnil, anna, linux-nfs, herzog, harald.dunkel,
	chuck.lever, jlayton, baptiste.pellegrin, benoit.gschwind

Jeff Layton writes via Kernel.org Bugzilla:

(In reply to Chuck Lever from comment #7)
> The trace captures I've reviewed suggest that a callback session is in use,
> so I would say the NFS minor version is 1 or higher. Perhaps it's not the
> RPC_SIGNALLED test above that is the problem, but the one later in
> nfsd4_cb_sequence_done().

Ok, good. Knowing that it's not v4.0 allows us to rule out some codepaths.
There are a couple of other cases where we goto need_restart:

The NFS4ERR_BADSESSION case does this, and also if it doesn't get a reply at all (case 1).
There is also this that looks a little sketchy:

------------8<-------------------
        trace_nfsd_cb_free_slot(task, cb);
        nfsd41_cb_release_slot(cb);

        if (RPC_SIGNALLED(task))
                goto need_restart;
out:
        return ret;
retry_nowait:
        if (rpc_restart_call_prepare(task))
                ret = false;
        goto out;
need_restart:
        if (!test_bit(NFSD4_CLIENT_CB_KILL, &clp->cl_flags)) {
                trace_nfsd_cb_restart(clp, cb);
                task->tk_status = 0;
                cb->cb_need_restart = true;
        }
        return false;
------------8<-------------------

Probably now the same bug, but it looks like if RPC_SIGNALLED returns true, then it'll restart the RPC after releasing the slot. It seems like that could break the reply cache handling, as the restarted call could be on a different slot. I'll look at patching that, at least, though I'm not sure it's related to the hang.

More notes. The only way RPC_TASK_SIGNALLED gets set is:

   nfsd4_process_cb_update()
      rpc_shutdown_client()
          rpc_killall_tasks()

That gets called if:

        if (clp->cl_flags & NFSD4_CLIENT_CB_FLAG_MASK)
                nfsd4_process_cb_update(cb);

Which means that NFSD4_CLIENT_CB_UPDATE was probably set? NFSD4_CLIENT_CB_KILL seems less likely since that would nerf the cb_need_restart handling.

View: https://bugzilla.kernel.org/show_bug.cgi?id=219710#c10
You can reply to this message to join the discussion.
-- 
Deet-doot-dot, I am a bot.
Kernel.org Bugzilla (bugspray 0.1-dev)

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: NFSD threads hang when destroying a session or client ID
  2025-01-21 17:35   ` Jeff Layton via Bugspray Bot
@ 2025-01-21 19:38     ` Tom Talpey
  2025-01-21 19:43       ` Chuck Lever
  0 siblings, 1 reply; 35+ messages in thread
From: Tom Talpey @ 2025-01-21 19:38 UTC (permalink / raw)
  To: Jeff Layton via Bugspray Bot, cel, trondmy, carnil, anna,
	linux-nfs, herzog, harald.dunkel, chuck.lever, jlayton,
	baptiste.pellegrin, benoit.gschwind

On 1/21/2025 12:35 PM, Jeff Layton via Bugspray Bot wrote:
> Jeff Layton writes via Kernel.org Bugzilla:
> 
> (In reply to Chuck Lever from comment #7)
>> The trace captures I've reviewed suggest that a callback session is in use,
>> so I would say the NFS minor version is 1 or higher. Perhaps it's not the
>> RPC_SIGNALLED test above that is the problem, but the one later in
>> nfsd4_cb_sequence_done().
> 
> 
> Ok, good. Knowing that it's not v4.0 allows us to rule out some codepaths.
> There are a couple of other cases where we goto need_restart:
> 
> The NFS4ERR_BADSESSION case does this, and also if it doesn't get a reply at all (case 1).

Note that one thread in Benoît's recent logs is stuck in
nfsd4_bind_conn_to_session(), and three threads also in
nfsd4_destroy_session(), so there is certainly some
session/connection dance going on. Combining an invalid
replay cache entry could easily make things worse.

There's also one thread in nfsd4_destroy_clientid(), which
seems important, but odd. And finally, the laundromat is
running. No shortage of races!

Tom.


> There is also this that looks a little sketchy:
> 
> ------------8<-------------------
>          trace_nfsd_cb_free_slot(task, cb);
>          nfsd41_cb_release_slot(cb);
> 
>          if (RPC_SIGNALLED(task))
>                  goto need_restart;
> out:
>          return ret;
> retry_nowait:
>          if (rpc_restart_call_prepare(task))
>                  ret = false;
>          goto out;
> need_restart:
>          if (!test_bit(NFSD4_CLIENT_CB_KILL, &clp->cl_flags)) {
>                  trace_nfsd_cb_restart(clp, cb);
>                  task->tk_status = 0;
>                  cb->cb_need_restart = true;
>          }
>          return false;
> ------------8<-------------------
> 
> Probably now the same bug, but it looks like if RPC_SIGNALLED returns true, then it'll restart the RPC after releasing the slot. It seems like that could break the reply cache handling, as the restarted call could be on a different slot. I'll look at patching that, at least, though I'm not sure it's related to the hang.
> 
> More notes. The only way RPC_TASK_SIGNALLED gets set is:
> 
>     nfsd4_process_cb_update()
>        rpc_shutdown_client()
>            rpc_killall_tasks()
> 
> That gets called if:
> 
>          if (clp->cl_flags & NFSD4_CLIENT_CB_FLAG_MASK)
>                  nfsd4_process_cb_update(cb);
> 
> Which means that NFSD4_CLIENT_CB_UPDATE was probably set? NFSD4_CLIENT_CB_KILL seems less likely since that would nerf the cb_need_restart handling.
> 
> View: https://bugzilla.kernel.org/show_bug.cgi?id=219710#c10
> You can reply to this message to join the discussion.


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: NFSD threads hang when destroying a session or client ID
  2025-01-21 19:38     ` Tom Talpey
@ 2025-01-21 19:43       ` Chuck Lever
  0 siblings, 0 replies; 35+ messages in thread
From: Chuck Lever @ 2025-01-21 19:43 UTC (permalink / raw)
  To: Tom Talpey, Jeff Layton via Bugspray Bot, cel, trondmy, carnil,
	anna, linux-nfs, herzog, harald.dunkel, jlayton,
	baptiste.pellegrin, benoit.gschwind

On 1/21/25 2:38 PM, Tom Talpey wrote:
> On 1/21/2025 12:35 PM, Jeff Layton via Bugspray Bot wrote:
>> Jeff Layton writes via Kernel.org Bugzilla:
>>
>> (In reply to Chuck Lever from comment #7)
>>> The trace captures I've reviewed suggest that a callback session is 
>>> in use,
>>> so I would say the NFS minor version is 1 or higher. Perhaps it's not 
>>> the
>>> RPC_SIGNALLED test above that is the problem, but the one later in
>>> nfsd4_cb_sequence_done().
>>
>>
>> Ok, good. Knowing that it's not v4.0 allows us to rule out some 
>> codepaths.
>> There are a couple of other cases where we goto need_restart:
>>
>> The NFS4ERR_BADSESSION case does this, and also if it doesn't get a 
>> reply at all (case 1).
> 
> Note that one thread in Benoît's recent logs is stuck in
> nfsd4_bind_conn_to_session(), and three threads also in
> nfsd4_destroy_session(), so there is certainly some
> session/connection dance going on. Combining an invalid
> replay cache entry could easily make things worse.

Yes, the client returns RETRY_UNCACHED_REP for some of the backchannel
operations. NFSD never asserts cachethis in CB_SEQUENCE. I'm trying to
understand why NFSD would skip incrementing its slot sequence number.


> There's also one thread in nfsd4_destroy_clientid(), which
> seems important, but odd. And finally, the laundromat is
> running. No shortage of races!

The hangs are all related here: they are waiting for flush_workqueue()
on the callback workqueue. In v6.1, there is only one callback_wq and
it's max_active is 1. If the current work item hangs, then that work
queue stalls.


> Tom.
> 
> 
>> There is also this that looks a little sketchy:
>>
>> ------------8<-------------------
>>          trace_nfsd_cb_free_slot(task, cb);
>>          nfsd41_cb_release_slot(cb);
>>
>>          if (RPC_SIGNALLED(task))
>>                  goto need_restart;
>> out:
>>          return ret;
>> retry_nowait:
>>          if (rpc_restart_call_prepare(task))
>>                  ret = false;
>>          goto out;
>> need_restart:
>>          if (!test_bit(NFSD4_CLIENT_CB_KILL, &clp->cl_flags)) {
>>                  trace_nfsd_cb_restart(clp, cb);
>>                  task->tk_status = 0;
>>                  cb->cb_need_restart = true;
>>          }
>>          return false;
>> ------------8<-------------------
>>
>> Probably now the same bug, but it looks like if RPC_SIGNALLED returns 
>> true, then it'll restart the RPC after releasing the slot. It seems 
>> like that could break the reply cache handling, as the restarted call 
>> could be on a different slot. I'll look at patching that, at least, 
>> though I'm not sure it's related to the hang.
>>
>> More notes. The only way RPC_TASK_SIGNALLED gets set is:
>>
>>     nfsd4_process_cb_update()
>>        rpc_shutdown_client()
>>            rpc_killall_tasks()
>>
>> That gets called if:
>>
>>          if (clp->cl_flags & NFSD4_CLIENT_CB_FLAG_MASK)
>>                  nfsd4_process_cb_update(cb);
>>
>> Which means that NFSD4_CLIENT_CB_UPDATE was probably set? 
>> NFSD4_CLIENT_CB_KILL seems less likely since that would nerf the 
>> cb_need_restart handling.
>>
>> View: https://bugzilla.kernel.org/show_bug.cgi?id=219710#c10
>> You can reply to this message to join the discussion.
> 


-- 
Chuck Lever

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: NFSD threads hang when destroying a session or client ID
  2025-01-21 16:35   ` Chuck Lever via Bugspray Bot
@ 2025-01-22 11:40     ` Baptiste PELLEGRIN via Bugspray Bot
  2025-01-22 14:19       ` Chuck Lever
  0 siblings, 1 reply; 35+ messages in thread
From: Baptiste PELLEGRIN via Bugspray Bot @ 2025-01-22 11:40 UTC (permalink / raw)
  To: benoit.gschwind, linux-nfs, harald.dunkel, trondmy, carnil,
	baptiste.pellegrin, anna, tom, jlayton, cel, chuck.lever, herzog

Baptiste PELLEGRIN writes via Kernel.org Bugzilla:

(In reply to Chuck Lever from comment #9)
> 
> The server generates a CB_RECALL_ANY message for each active client. If the
> number of active clients increases from zero at server boot time to a few
> dozen, that would also explain why you see more of these over time.

This mean that something can stay active overnight ? In my case, no client are running outside opening hour. They are all suspended or power off.

> 
> "-e sunrpc:xprt_reserve" for example would help us match the XIDs in the
> callback operations to the messages you see in the server's system journal.

I will try to help you I much as I can. Is not really a problem for me to run trace-cmd on all clients as they are all managed with Puppet. I will try as soon as possible.

You confirm to me that I need to run "trace-cmd record -e sunrpc:xprt_reserve" on all my clients ? No more flags ?

Did you see that I have added the "-e sunrpc:svc_xprt_detach -p function -l nfsd4_destroy_session" flags in my last recorded trace ? Do I need to add new one for the next crash ?

I will also send to you the dump of current kernel task next time.

Regards,

Baptiste.

View: https://bugzilla.kernel.org/show_bug.cgi?id=219710#c13
You can reply to this message to join the discussion.
-- 
Deet-doot-dot, I am a bot.
Kernel.org Bugzilla (bugspray 0.1-dev)

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: NFSD threads hang when destroying a session or client ID
  2025-01-22 11:40     ` Baptiste PELLEGRIN via Bugspray Bot
@ 2025-01-22 14:19       ` Chuck Lever
  0 siblings, 0 replies; 35+ messages in thread
From: Chuck Lever @ 2025-01-22 14:19 UTC (permalink / raw)
  To: Baptiste PELLEGRIN via Bugspray Bot, baptiste.pellegrin
  Cc: benoit.gschwind, linux-nfs, harald.dunkel, trondmy, carnil, anna,
	tom, herzog, jlayton, cel

On 1/22/25 6:40 AM, Baptiste PELLEGRIN via Bugspray Bot wrote:
> Baptiste PELLEGRIN writes via Kernel.org Bugzilla:
> 
> (In reply to Chuck Lever from comment #9)
>>
>> The server generates a CB_RECALL_ANY message for each active client. If the
>> number of active clients increases from zero at server boot time to a few
>> dozen, that would also explain why you see more of these over time.
> 
> This mean that something can stay active overnight ? In my case, no client are running outside opening hour. They are all suspended or power off.
> 
>>
>> "-e sunrpc:xprt_reserve" for example would help us match the XIDs in the
>> callback operations to the messages you see in the server's system journal.
> 
> 
> I will try to help you I much as I can. Is not really a problem for me to run trace-cmd on all clients as they are all managed with Puppet. I will try as soon as possible.
> 
> You confirm to me that I need to run "trace-cmd record -e sunrpc:xprt_reserve" on all my clients ? No more flags ?

Don't change the clients. The additional command line option above goes
on the existing server trace-cmd, but only if the NFS server system does
not have any NFS mounts of its own. I'm trying to keep the amount of
generated trace data to a minimum.

The point here is to record the XIDs of each of the backchannel
operations. These XIDs will show up in the "unrecognized XID" messages
on your server.


> Did you see that I have added the "-e sunrpc:svc_xprt_detach -p function -l nfsd4_destroy_session" flags in my last recorded trace ? Do I need to add new one for the next crash ?
> 
> I will also send to you the dump of current kernel task next time.


-- 
Chuck Lever

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: NFSD threads hang when destroying a session or client ID
  2025-01-20 15:00 NFSD threads hang when destroying a session or client ID Chuck Lever via Bugspray Bot
                   ` (7 preceding siblings ...)
  2025-01-21 16:25 ` Baptiste PELLEGRIN via Bugspray Bot
@ 2025-01-22 21:25 ` JJ Jordan via Bugspray Bot
  2025-01-22 21:25 ` JJ Jordan via Bugspray Bot
                   ` (10 subsequent siblings)
  19 siblings, 0 replies; 35+ messages in thread
From: JJ Jordan via Bugspray Bot @ 2025-01-22 21:25 UTC (permalink / raw)
  To: trondmy, tom, anna, harald.dunkel, herzog, jlayton, carnil, cel,
	linux-nfs, chuck.lever, benoit.gschwind, baptiste.pellegrin

JJ Jordan added an attachment on Kernel.org Bugzilla:

Created attachment 307527
Logs and traces from Jan-18 pt1

Logs and traces from Jan-18 pt1

Here are the traces from two NFS crashes that occurred this past weekend.
Both occurred in the AM (US time) on Jan 18, a few hours apart from one
another.

I followed the instructions I found on the various threads.
There was no output to `rpcdebug -m rpc -c`, not sure what I did wrong
there. The syslog ought to contain the output of sysrq-trigger, however.

The output from trace-cmd captures several days' worth of logs in either
case, but not from system boot.

The syslogs I have cut from ~one hour before the incident until it finished
shutting down prior to reboot. I have removed the output of other services.

Both are VMs on GCE running the 6.1.119 kernel from Debian bookworm (6.1.0-28)
~60Gi memory, 16 CPUs.

File: nfs-traces-250118-pt1.tar.bz2 (application/octet-stream)
Size: 4.61 MiB
Link: https://bugzilla.kernel.org/attachment.cgi?id=307527
---
Logs and traces from Jan-18 pt1

You can reply to this message to join the discussion.
-- 
Deet-doot-dot, I am a bot.
Kernel.org Bugzilla (bugspray 0.1-dev)

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: NFSD threads hang when destroying a session or client ID
  2025-01-20 15:00 NFSD threads hang when destroying a session or client ID Chuck Lever via Bugspray Bot
                   ` (8 preceding siblings ...)
  2025-01-22 21:25 ` JJ Jordan via Bugspray Bot
@ 2025-01-22 21:25 ` JJ Jordan via Bugspray Bot
  2025-01-23  2:10 ` Li Lingfeng via Bugspray Bot
                   ` (9 subsequent siblings)
  19 siblings, 0 replies; 35+ messages in thread
From: JJ Jordan via Bugspray Bot @ 2025-01-22 21:25 UTC (permalink / raw)
  To: trondmy, tom, anna, harald.dunkel, herzog, jlayton, carnil, cel,
	linux-nfs, chuck.lever, benoit.gschwind, baptiste.pellegrin

JJ Jordan added an attachment on Kernel.org Bugzilla:

Created attachment 307528
Logs and traces from Jan-18 pt2

Part 2 of previous

File: nfs-traces-250118-pt2.tar.bz2 (application/octet-stream)
Size: 601.99 KiB
Link: https://bugzilla.kernel.org/attachment.cgi?id=307528
---
Logs and traces from Jan-18 pt2

You can reply to this message to join the discussion.
-- 
Deet-doot-dot, I am a bot.
Kernel.org Bugzilla (bugspray 0.1-dev)


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: NFSD threads hang when destroying a session or client ID
  2025-01-20 15:00 NFSD threads hang when destroying a session or client ID Chuck Lever via Bugspray Bot
                   ` (9 preceding siblings ...)
  2025-01-22 21:25 ` JJ Jordan via Bugspray Bot
@ 2025-01-23  2:10 ` Li Lingfeng via Bugspray Bot
  2025-01-23 13:50 ` Jeff Layton via Bugspray Bot
                   ` (8 subsequent siblings)
  19 siblings, 0 replies; 35+ messages in thread
From: Li Lingfeng via Bugspray Bot @ 2025-01-23  2:10 UTC (permalink / raw)
  To: benoit.gschwind, harald.dunkel, herzog, tom, chuck.lever, jlayton,
	cel, trondmy, baptiste.pellegrin, carnil, linux-nfs, anna

Li Lingfeng writes via Kernel.org Bugzilla:

(In reply to Chuck Lever from comment #0)
> On recent v6.1.y, intermittently, NFSD threads wait forever for NFSv4
> callback to shutdown. The wait is in __flush_workqueue(). A server system
> reboot is necessary to recover.
> 
> On new kernels, similar symptoms but the indefinite wait is in the "destroy
> client" path, waiting for NFSv4 callback shutdown. The wait is on the
> wait_var_event() in nfsd41_cb_inflight_wait_complete().

Hi, I've had a similar problem recently.
And I've also done some analysis.

[ 6526.031343] INFO: task bash:846259 blocked for more than 606 seconds.
[ 6526.032060]       Not tainted 6.6.0-gfbf24d352c28-dirty #22
[ 6526.032635] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 6526.033404] task:bash            state:D stack:0     pid:846259 ppid:838395 flags:0x0000020d
[ 6526.034226] Call trace:
[ 6526.034527]  __switch_to+0x218/0x3e0
[ 6526.034925]  __schedule+0x734/0x11a8
[ 6526.035323]  schedule+0xa8/0x200
[ 6526.035731]  nfsd4_shutdown_callback+0x24c/0x2f0
[ 6526.036228]  __destroy_client+0x414/0x680
[ 6526.036663]  nfs4_state_destroy_net+0x144/0x448
[ 6526.037152]  nfs4_state_shutdown_net+0x2c8/0x450
[ 6526.037640]  nfsd_shutdown_net+0x100/0x2e0
[ 6526.038078]  nfsd_last_thread+0x190/0x330
[ 6526.038518]  nfsd_svc+0x3cc/0x4a0
[ 6526.038892]  write_threads+0x15c/0x2f0
[ 6526.039301]  nfsctl_transaction_write+0x90/0xd0
[ 6526.039836]  vfs_write+0x110/0x688
[ 6526.040221]  ksys_write+0xd0/0x188
[ 6526.040607]  __arm64_sys_write+0x4c/0x68
[ 6526.041035]  invoke_syscall+0x68/0x198
[ 6526.041455]  el0_svc_common.constprop.0+0x11c/0x150
[ 6526.041967]  do_el0_svc+0x38/0x50
[ 6526.042353]  el0_svc+0x5c/0x240
[ 6526.042723]  el0t_64_sync_handler+0x100/0x130
[ 6526.043186]  el0t_64_sync+0x188/0x190
[ 6526.051007] INFO: task cat:846265 blocked for more than 606 seconds.

1) Check cl_cb_inflight
crash> nfs4_client.cl_cb_inflight ffff000012338f08
  cl_cb_inflight = {
    counter = 1
  },
crash>

2) No work is associated with nfsd
Only two works unrelated to NFSD.
crash> p callback_wq
callback_wq = $1 = (struct workqueue_struct *) 0xffff0000c30a1400
crash>
crash> workqueue_struct.cpu_pwq 0xffff0000c30a1400
  cpu_pwq = 0xccfe9cb5d8d0
crash> kmem -o
PER-CPU OFFSET VALUES:
  CPU 0: ffff2f015341c000
  CPU 1: ffff2f0153442000
  CPU 2: ffff2f0153468000
  CPU 3: ffff2f015348e000
crash>
// ffff2f015341c000 + ccfe9cb5d8d0  = FFFFFBFFEFF798D0
crash> rd FFFFFBFFEFF798D0
fffffbffeff798d0:  ffff0000d3488d00                    ..H.....
crash>
// ffff2f0153442000 + ccfe9cb5d8d0 = FFFFFBFFEFF9F8D0
crash> rd FFFFFBFFEFF9F8D0
fffffbffeff9f8d0:  ffff0000d3488d00                    ..H.....
crash>
// ffff2f0153468000 + ccfe9cb5d8d0 = FFFFFBFFEFFC58D0
crash> rd FFFFFBFFEFFC58D0
fffffbffeffc58d0:  ffff0000d3488d00                    ..H.....
crash>
// ffff2f015348e000 + ccfe9cb5d8d0 = FFFFFBFFEFFEB8D0
crash> rd FFFFFBFFEFFEB8D0
fffffbffeffeb8d0:  ffff0000d3488d00                    ..H.....
crash>
crash> pool_workqueue.pool ffff0000d3488d00
  pool = 0xffff0000c01b6800,
crash>
crash> worker_pool.worklist 0xffff0000c01b6800
  worklist = {
    next = 0xffff0000c906c4a8,
    prev = 0xffffd0ff8944fc68 <stats_flush_dwork+8>
  },
crash>
crash> list 0xffff0000c906c4a8
ffff0000c906c4a8
ffffd0ff8944fc68
ffff0000c01b6860
crash>
crash> work_struct.func ffff0000c906c4a0
  func = 0xffffd0ff84fae128 <wb_update_bandwidth_workfn>,
crash> work_struct.func 0xffffd0ff8944fc60
  func = 0xffffd0ff8510b258 <flush_memcg_stats_dwork>,
crash>

3) No running kworker
I checked vmcore by "foreach bt" and find all kworker are as follows.
PID: 62       TASK: ffff0000c31d0040  CPU: 1    COMMAND: "kworker/R-nfsio"
 #0 [ffff800080c27b80] __switch_to at ffffd0ff866297dc
 #1 [ffff800080c27bd0] __schedule at ffffd0ff8662a180
 #2 [ffff800080c27d00] schedule at ffffd0ff8662ac9c
 #3 [ffff800080c27d40] rescuer_thread at ffffd0ff84b418e4
 #4 [ffff800080c27e60] kthread at ffffd0ff84b52e14
PID: 94       TASK: ffff0000c74ba080  CPU: 0    COMMAND: "kworker/0:1H"
 #0 [ffff800080e07c00] __switch_to at ffffd0ff866297dc
 #1 [ffff800080e07c50] __schedule at ffffd0ff8662a180
 #2 [ffff800080e07d80] schedule at ffffd0ff8662ac9c
 #3 [ffff800080e07dc0] worker_thread at ffffd0ff84b40f94
 #4 [ffff800080e07e60] kthread at ffffd0ff84b52e14

4) Check works releated to nfsd4_run_cb_work
crash> p nfsd4_run_cb_work
nfsd4_run_cb_work = $5 =
 {void (struct work_struct *)} 0xffffd0ff855691e0 <nfsd4_run_cb_work>
crash> search ffffd0ff855691e0
ffff000010474138: ffffd0ff855691e0
ffff0000104750f8: ffffd0ff855691e0
ffff0000104752f0: ffffd0ff855691e0
ffff0000104756e0: ffffd0ff855691e0
ffff000012338388: ffffd0ff855691e0
ffff000012339288: ffffd0ff855691e0
ffff00001233a908: ffffd0ff855691e0
ffff00001233b808: ffffd0ff855691e0
ffff0000c745d038: ffffd0ff855691e0
ffff0000c86499f8: ffffd0ff855691e0
ffff0000c8649b30: ffffd0ff855691e0
ffff0000c9ff8dc8: ffffd0ff855691e0
crash>
ffff000010474138 --> (work) ffff000010474120
ffff0000104750f8 --> (work) ffff0000104750e0
ffff0000104752f0 --> (work) ffff0000104752d8
ffff0000104756e0 --> (work) ffff0000104756c8
ffff000012338388 --> (work) ffff000012338370
ffff000012339288 --> (work) ffff000012339270
ffff00001233a908 --> (work) ffff00001233a8f0
ffff00001233b808 --> (work) ffff00001233b7f0
ffff0000c745d038 --> (work) ffff0000c745d020
ffff0000c86499f8 --> (work) ffff0000c86499e0
ffff0000c8649b30 --> (work) ffff0000c8649b18
ffff0000c9ff8dc8 --> (work) ffff0000c9ff8db0
crash> work_struct.data ffff0000104750e0
  data = {
    counter = 68719476704 // FFFFFFFE0 bit0~5 are 0
  },
crash>
crash> work_struct.data ffff0000c9ff8db0
  data = {
    counter = 256 // 0x100
  },
crash>

I have added some debug information and am trying to reproduce it.
Could you please provide more information you got?
Or any suggestions about this?

Thanks.
> 
> In some cases, clients suspend (inactivity). The server converts them to
> courteous clients. The NFSv4 callback shutdown workqueue item for that
> client appears to be stuck waiting in rpc_shutdown_client().
> 
> Let's collect data under this bug report.

View: https://bugzilla.kernel.org/show_bug.cgi?id=219710#c17
You can reply to this message to join the discussion.
-- 
Deet-doot-dot, I am a bot.
Kernel.org Bugzilla (bugspray 0.1-dev)


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: NFSD threads hang when destroying a session or client ID
  2025-01-20 15:00 NFSD threads hang when destroying a session or client ID Chuck Lever via Bugspray Bot
                   ` (10 preceding siblings ...)
  2025-01-23  2:10 ` Li Lingfeng via Bugspray Bot
@ 2025-01-23 13:50 ` Jeff Layton via Bugspray Bot
  2025-01-23 14:22   ` Chuck Lever
  2025-01-23 20:25 ` Baptiste PELLEGRIN via Bugspray Bot
                   ` (7 subsequent siblings)
  19 siblings, 1 reply; 35+ messages in thread
From: Jeff Layton via Bugspray Bot @ 2025-01-23 13:50 UTC (permalink / raw)
  To: linux-nfs, jlayton, chuck.lever, herzog, tom, carnil, anna,
	benoit.gschwind, trondmy, harald.dunkel, cel, baptiste.pellegrin

Jeff Layton writes via Kernel.org Bugzilla:

There is another scenario that could explain a hang here. From nfsd4_cb_sequence_done():

------------------8<---------------------
        case -NFS4ERR_BADSLOT:
                goto retry_nowait;
        case -NFS4ERR_SEQ_MISORDERED:        
                if (session->se_cb_seq_nr[cb->cb_held_slot] != 1) {
                        session->se_cb_seq_nr[cb->cb_held_slot] = 1;
                        goto retry_nowait;     
                }      
                break;
        default:                          
                nfsd4_mark_cb_fault(cb->cb_clp);
        }                       
        trace_nfsd_cb_free_slot(task, cb);
        nfsd41_cb_release_slot(cb);             
                 
        if (RPC_SIGNALLED(task))
                goto need_restart;
out:                  
        return ret;
retry_nowait:
        if (rpc_restart_call_prepare(task))
                ret = false;                
        goto out;
------------------8<---------------------

Since it doesn't check RPC_SIGNALLED in the v4.1+ case until very late in the function, it's possible to get a BADSLOT or SEQ_MISORDERED error that causes the callback client to immediately resubmit the rpc_task to the RPC engine without resubmitting to the callback workqueue.

I think that we should assume that when RPC_SIGNALLED returns true that the result is suspect, and that we should halt further processing into the CB_SEQUENCE response and restart the callback.

View: https://bugzilla.kernel.org/show_bug.cgi?id=219710#c18
You can reply to this message to join the discussion.
-- 
Deet-doot-dot, I am a bot.
Kernel.org Bugzilla (bugspray 0.1-dev)


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: NFSD threads hang when destroying a session or client ID
  2025-01-23 13:50 ` Jeff Layton via Bugspray Bot
@ 2025-01-23 14:22   ` Chuck Lever
  0 siblings, 0 replies; 35+ messages in thread
From: Chuck Lever @ 2025-01-23 14:22 UTC (permalink / raw)
  To: Jeff Layton via Bugspray Bot, linux-nfs, jlayton, herzog, tom,
	carnil, anna, benoit.gschwind, trondmy, harald.dunkel, cel,
	baptiste.pellegrin

On 1/23/25 8:50 AM, Jeff Layton via Bugspray Bot wrote:
> Jeff Layton writes via Kernel.org Bugzilla:
> 
> There is another scenario that could explain a hang here. From nfsd4_cb_sequence_done():
> 
> ------------------8<---------------------
>          case -NFS4ERR_BADSLOT:
>                  goto retry_nowait;
>          case -NFS4ERR_SEQ_MISORDERED:
>                  if (session->se_cb_seq_nr[cb->cb_held_slot] != 1) {
>                          session->se_cb_seq_nr[cb->cb_held_slot] = 1;
>                          goto retry_nowait;
>                  }
>                  break;
>          default:
>                  nfsd4_mark_cb_fault(cb->cb_clp);
>          }
>          trace_nfsd_cb_free_slot(task, cb);
>          nfsd41_cb_release_slot(cb);
>                   
>          if (RPC_SIGNALLED(task))
>                  goto need_restart;
> out:
>          return ret;
> retry_nowait:
>          if (rpc_restart_call_prepare(task))
>                  ret = false;
>          goto out;
> ------------------8<---------------------
> 
> Since it doesn't check RPC_SIGNALLED in the v4.1+ case until very late in the function, it's possible to get a BADSLOT or SEQ_MISORDERED error that causes the callback client to immediately resubmit the rpc_task to the RPC engine without resubmitting to the callback workqueue.
> 
> I think that we should assume that when RPC_SIGNALLED returns true that the result is suspect, and that we should halt further processing into the CB_SEQUENCE response and restart the callback.

When cb->cb_seq_status is set to any value other than 1, that means the
client replied successfully. RPC_SIGNALLED has nothing to do with
whether a reply is suspect, it means only that the rpc_clnt has been
asked to terminate.

The potential loop you noticed is concerning, but I haven't seen
evidence in the "echo t > sysrq-trigger" output that there is a running
RPC such as you described here.

-- 
Chuck Lever

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: NFSD threads hang when destroying a session or client ID
  2025-01-20 15:00 NFSD threads hang when destroying a session or client ID Chuck Lever via Bugspray Bot
                   ` (11 preceding siblings ...)
  2025-01-23 13:50 ` Jeff Layton via Bugspray Bot
@ 2025-01-23 20:25 ` Baptiste PELLEGRIN via Bugspray Bot
  2025-01-23 21:45 ` Chuck Lever via Bugspray Bot
                   ` (6 subsequent siblings)
  19 siblings, 0 replies; 35+ messages in thread
From: Baptiste PELLEGRIN via Bugspray Bot @ 2025-01-23 20:25 UTC (permalink / raw)
  To: trondmy, baptiste.pellegrin, anna, cel, chuck.lever, herzog,
	harald.dunkel, jlayton, carnil, tom, linux-nfs, benoit.gschwind

Baptiste PELLEGRIN added an attachment on Kernel.org Bugzilla:

Created attachment 307533
Recorded trace 20250123

This time I have got a complete trace.

- trace-cmd with all flags enabled :

trace-cmd record -e nfsd:nfsd_cb_\* -e sunrpc:svc_xprt_detach -e sunrpc:xprt_reserve -p function -l nfsd4_destroy_session

- Syslog with all kernel messages and a task dump just after the crash.

Hope this help.

File: record_nfsdc_20250123.tar.gz (application/gzip)
Size: 2.02 MiB
Link: https://bugzilla.kernel.org/attachment.cgi?id=307533
---
Recorded trace 20250123

You can reply to this message to join the discussion.
-- 
Deet-doot-dot, I am a bot.
Kernel.org Bugzilla (bugspray 0.1-dev)


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: NFSD threads hang when destroying a session or client ID
  2025-01-20 15:00 NFSD threads hang when destroying a session or client ID Chuck Lever via Bugspray Bot
                   ` (12 preceding siblings ...)
  2025-01-23 20:25 ` Baptiste PELLEGRIN via Bugspray Bot
@ 2025-01-23 21:45 ` Chuck Lever via Bugspray Bot
  2025-01-26  9:25 ` Baptiste PELLEGRIN via Bugspray Bot
                   ` (5 subsequent siblings)
  19 siblings, 0 replies; 35+ messages in thread
From: Chuck Lever via Bugspray Bot @ 2025-01-23 21:45 UTC (permalink / raw)
  To: carnil, benoit.gschwind, chuck.lever, jlayton, herzog,
	baptiste.pellegrin, anna, trondmy, harald.dunkel, linux-nfs, cel,
	tom

Chuck Lever writes via Kernel.org Bugzilla:

An RPC task has shown up in the "rpcdebug" output this time:

kernel: [262696.214194] -pid- flgs status -client- --rqstp- -timeout ---ops--
kernel: [262696.214199]   112 2281      0 a6ceb575        0      357 83bde67e nfs4_cbv1 CB_RECALL_ANY a:call_start [sunrpc] q:delayq

Here is a CB_RECALL_ANY operation that is waiting to start. It's in the RPC client's delay_queue, so this looks like it came through the NFS4ERR_DELAY arm of nfsd4_cb_sequence_done(): the RPC was reset to the beginning of the FSM, and rpc_delay was called.

There is a commit missing from origin/linux-6.1.y:

961b4b5e86bf ("NFSD: Reset cb_seq_status after NFS4ERR_DELAY")

Without this commit, the server will continue rescheduling this RPC forever. That could be what is blocking rpc_shutdown_client9).

This commit is applied to the nfsd-6.1.y branch of:

https://git.kernel.org/pub/scm/linux/kernel/git/cel/linux.git

Can anyone on a v6.1-based kernel pull this branch and try it?

View: https://bugzilla.kernel.org/show_bug.cgi?id=219710#c21
You can reply to this message to join the discussion.
-- 
Deet-doot-dot, I am a bot.
Kernel.org Bugzilla (bugspray 0.1-dev)

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: NFSD threads hang when destroying a session or client ID
  2025-01-20 15:00 NFSD threads hang when destroying a session or client ID Chuck Lever via Bugspray Bot
                   ` (13 preceding siblings ...)
  2025-01-23 21:45 ` Chuck Lever via Bugspray Bot
@ 2025-01-26  9:25 ` Baptiste PELLEGRIN via Bugspray Bot
  2025-01-26 17:05   ` Chuck Lever via Bugspray Bot
  2025-01-29 13:15 ` rik.theys via Bugspray Bot
                   ` (4 subsequent siblings)
  19 siblings, 1 reply; 35+ messages in thread
From: Baptiste PELLEGRIN via Bugspray Bot @ 2025-01-26  9:25 UTC (permalink / raw)
  To: chuck.lever, linux-nfs, harald.dunkel, herzog, carnil, tom,
	trondmy, jlayton, cel, anna, baptiste.pellegrin, benoit.gschwind

Baptiste PELLEGRIN writes via Kernel.org Bugzilla:

I patched the current Debian stable kernels :

Debian Linux 6.1.0-29-amd64 (= 6.1.123-1) Based on Linux 6.1.120
Debian Linux 6.1.0-30-amd64 (= 6.1.124-1) Based on Linux 6.1.124

with there two commits :

8626664c87ee NFSD: Replace dprintks in nfsd4_cb_sequence_done()
961b4b5e86bf NFSD: Reset cb_seq_status after NFS4ERR_DELAY

Now my two NFS servers are running these patched kernel. I need to wait one week to see If the crash still occur.

Chuck, does this patch may remove the "unrecognized reply" messages from my syslog ?

I see you have improved the backchannel session logging. Did I need to change my trace-record command ?

View: https://bugzilla.kernel.org/show_bug.cgi?id=219710#c22
You can reply to this message to join the discussion.
-- 
Deet-doot-dot, I am a bot.
Kernel.org Bugzilla (bugspray 0.1-dev)


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: NFSD threads hang when destroying a session or client ID
  2025-01-26  9:25 ` Baptiste PELLEGRIN via Bugspray Bot
@ 2025-01-26 17:05   ` Chuck Lever via Bugspray Bot
  0 siblings, 0 replies; 35+ messages in thread
From: Chuck Lever via Bugspray Bot @ 2025-01-26 17:05 UTC (permalink / raw)
  To: tom, linux-nfs, herzog, carnil, benoit.gschwind,
	baptiste.pellegrin, trondmy, harald.dunkel, cel, jlayton, anna,
	chuck.lever

Chuck Lever writes via Kernel.org Bugzilla:

(In reply to Baptiste PELLEGRIN from comment #22)
> I patched the current Debian stable kernels :
> 
> Debian Linux 6.1.0-29-amd64 (= 6.1.123-1) Based on Linux 6.1.120
> Debian Linux 6.1.0-30-amd64 (= 6.1.124-1) Based on Linux 6.1.124
> 
> with there two commits :
> 
> 8626664c87ee NFSD: Replace dprintks in nfsd4_cb_sequence_done()
> 961b4b5e86bf NFSD: Reset cb_seq_status after NFS4ERR_DELAY
> 
> Now my two NFS servers are running these patched kernel. I need to wait one
> week to see If the crash still occur.

It's probable that a hang might still occur, but the symptoms might be closer to the hangs that happen on v6.12. If so, I count that as step forward: one less bug.


> Chuck, does this patch may remove the "unrecognized reply" messages from my
> syslog ?

It doesn't.


> I see you have improved the backchannel session logging. Did I need to
> change my trace-record command ?

The trace-cmd command line I provided above uses a glob expression that should also enable the trace points added in 8626664c87ee ("NFSD: Replace dprintks in nfsd4_cb_sequence_done()").

View: https://bugzilla.kernel.org/show_bug.cgi?id=219710#c23
You can reply to this message to join the discussion.
-- 
Deet-doot-dot, I am a bot.
Kernel.org Bugzilla (bugspray 0.1-dev)


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: NFSD threads hang when destroying a session or client ID
  2025-01-20 15:00 NFSD threads hang when destroying a session or client ID Chuck Lever via Bugspray Bot
                   ` (14 preceding siblings ...)
  2025-01-26  9:25 ` Baptiste PELLEGRIN via Bugspray Bot
@ 2025-01-29 13:15 ` rik.theys via Bugspray Bot
  2025-01-29 19:40 ` Chuck Lever via Bugspray Bot
                   ` (3 subsequent siblings)
  19 siblings, 0 replies; 35+ messages in thread
From: rik.theys via Bugspray Bot @ 2025-01-29 13:15 UTC (permalink / raw)
  To: tom, trondmy, chuck.lever, cel, anna, baptiste.pellegrin,
	harald.dunkel, linux-nfs, benoit.gschwind, carnil, jlayton,
	herzog

rik.theys writes via Kernel.org Bugzilla:

Hi,

We had similar tracing running as we observed a similar issue on our 6.11 kernel. Recently, we upgraded to 6.12.11 and this has resulted in the warning below.

I'm not 100% sure this is the same issue. If not, let me know and I'll open a different thread/bug.

Unfortunately the attachment is too large to upload as the trace.dat file is ~900MiB.

I've uploaded it here:

https://homes.esat.kuleuven.be/~rtheys/logs-6.12.11.tar.zst

This file contains the warning, trace.dat file and the output of "echo t" > /proc/sysrq-trigger.

The warning that was issued was:

[Wed Jan 29 10:11:17 2025] cb_status=-521 tk_status=-10036
[Wed Jan 29 10:11:17 2025] WARNING: CPU: 16 PID: 1670626 at fs/nfsd/nfs4callback.c:1339 nfsd4_cb_done+0x171/0x180 [nfsd]
[Wed Jan 29 10:11:17 2025] Modules linked in: dm_snapshot(E) nfsv4(E) dns_resolver(E) nfs(E) netfs(E) binfmt_misc(E) rpcsec_gss_krb5(E) rpcrdma(E) rdma_cm(E) iw_cm(E) ib_cm(E) ib_core(E) bonding(E) tls(E) rfkill(E) nft_ct(E) nf_conntrack(E) nf_defrag_ipv6(E) nf_defrag_ipv4(E) nft_reject_inet(E) nf_reject_ipv4(E) nf_reject_ipv6(E) nft_reject(E) nf_tables(E) nfnetlink(E) vfat(E) fat(E) dm_thin_pool(E) dm_persistent_data(E) dm_bio_prison(E) dm_bufio(E) libcrc32c(E) dm_service_time(E) dm_multipath(E) intel_rapl_msr(E) intel_rapl_common(E) intel_uncore_frequency(E) intel_uncore_frequency_common(E) skx_edac(E) skx_edac_common(E) nfit(E) libnvdimm(E) x86_pkg_temp_thermal(E) intel_powerclamp(E) coretemp(E) kvm_intel(E) mgag200(E) kvm(E) i2c_algo_bit(E) dell_pc(E) rapl(E) drm_shmem_helper(E) ipmi_ssif(E) intel_cstate(E) platform_profile(E) dell_smbios(E) dcdbas(E) dell_wmi_descriptor(E) intel_uncore(E) wmi_bmof(E) pcspkr(E) drm_kms_helper(E) i2c_i801(E) mei_me(E) mei(E) intel_pch_thermal(E) i
 2c_mux(E) lpc_ich(E) i2c_smbus(E) joydev(E)
[Wed Jan 29 10:11:17 2025]  acpi_power_meter(E) ipmi_si(E) acpi_ipmi(E) ipmi_devintf(E) ipmi_msghandler(E) nfsd(E) nfs_acl(E) lockd(E) auth_rpcgss(E) grace(E) fuse(E) drm(E) sunrpc(E) ext4(E) mbcache(E) jbd2(E) lpfc(E) sd_mod(E) sg(E) nvmet_fc(E) nvmet(E) nvme_fc(E) ahci(E) crct10dif_pclmul(E) nvme_fabrics(E) crc32_pclmul(E) libahci(E) crc32c_intel(E) polyval_clmulni(E) nvme_core(E) megaraid_sas(E) ixgbe(E) polyval_generic(E) ghash_clmulni_intel(E) mdio(E) libata(E) wdat_wdt(E) scsi_transport_fc(E) dca(E) wmi(E) dm_mirror(E) dm_region_hash(E) dm_log(E) dm_mod(E)
[Wed Jan 29 10:11:17 2025] CPU: 16 UID: 0 PID: 1670626 Comm: kworker/u193:0 Kdump: loaded Tainted: G            E      6.12.11-1.el9.esat.x86_64 #1
[Wed Jan 29 10:11:17 2025] Tainted: [E]=UNSIGNED_MODULE
[Wed Jan 29 10:11:17 2025] Hardware name: Dell Inc. PowerEdge R740/00WGD1, BIOS 2.20.1 09/13/2023
[Wed Jan 29 10:11:17 2025] Workqueue: rpciod rpc_async_schedule [sunrpc]
[Wed Jan 29 10:11:17 2025] RIP: 0010:nfsd4_cb_done+0x171/0x180 [nfsd]
[Wed Jan 29 10:11:17 2025] Code: 0f 1f 44 00 00 e9 1d ff ff ff 80 3d 1c a7 01 00 00 0f 85 d9 fe ff ff 48 c7 c7 e5 b2 06 c1 c6 05 08 a7 01 00 01 e8 1f 4f ef d4 <0f> 0b 8b 75 54 e9 bc fe ff ff 0f 0b 0f 1f 00 90 90 90 90 90 90 90
[Wed Jan 29 10:11:17 2025] RSP: 0018:ffffa469b58c7e08 EFLAGS: 00010282
[Wed Jan 29 10:11:17 2025] RAX: 0000000000000000 RBX: ffff8a8f13ef6400 RCX: 0000000000000000
[Wed Jan 29 10:11:17 2025] RDX: 0000000000000002 RSI: ffffffff97398443 RDI: 00000000ffffffff
[Wed Jan 29 10:11:17 2025] RBP: ffff8a8c574515b8 R08: 0000000000000000 R09: ffffa469b58c7cb0
[Wed Jan 29 10:11:17 2025] R10: ffffa469b58c7ca8 R11: ffffffff97fdf688 R12: ffff8a7548f73f60
[Wed Jan 29 10:11:17 2025] R13: ffff8a8f13ef6400 R14: 0000000004248060 R15: ffffffffc0d66a40
[Wed Jan 29 10:11:17 2025] FS:  0000000000000000(0000) GS:ffff8a8ae0a00000(0000) knlGS:0000000000000000
[Wed Jan 29 10:11:17 2025] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[Wed Jan 29 10:11:17 2025] CR2: 00007f8a14576160 CR3: 000000274fa20004 CR4: 00000000007726f0
[Wed Jan 29 10:11:17 2025] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[Wed Jan 29 10:11:17 2025] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[Wed Jan 29 10:11:17 2025] PKRU: 55555554
[Wed Jan 29 10:11:17 2025] Call Trace:
[Wed Jan 29 10:11:17 2025]  <TASK>
[Wed Jan 29 10:11:17 2025]  ? __warn+0x84/0x130
[Wed Jan 29 10:11:17 2025]  ? nfsd4_cb_done+0x171/0x180 [nfsd]
[Wed Jan 29 10:11:17 2025]  ? report_bug+0x1c3/0x1d0
[Wed Jan 29 10:11:17 2025]  ? handle_bug+0x5b/0xa0
[Wed Jan 29 10:11:17 2025]  ? exc_invalid_op+0x14/0x70
[Wed Jan 29 10:11:17 2025]  ? asm_exc_invalid_op+0x16/0x20
[Wed Jan 29 10:11:17 2025]  ? __pfx_rpc_exit_task+0x10/0x10 [sunrpc]
[Wed Jan 29 10:11:17 2025]  ? nfsd4_cb_done+0x171/0x180 [nfsd]
[Wed Jan 29 10:11:17 2025]  ? nfsd4_cb_done+0x171/0x180 [nfsd]
[Wed Jan 29 10:11:17 2025]  rpc_exit_task+0x5b/0x170 [sunrpc]
[Wed Jan 29 10:11:17 2025]  __rpc_execute+0x9f/0x370 [sunrpc]
[Wed Jan 29 10:11:17 2025]  rpc_async_schedule+0x2b/0x40 [sunrpc]
[Wed Jan 29 10:11:17 2025]  process_one_work+0x179/0x390
[Wed Jan 29 10:11:17 2025]  worker_thread+0x239/0x340
[Wed Jan 29 10:11:17 2025]  ? __pfx_worker_thread+0x10/0x10
[Wed Jan 29 10:11:17 2025]  kthread+0xdb/0x110
[Wed Jan 29 10:11:17 2025]  ? __pfx_kthread+0x10/0x10
[Wed Jan 29 10:11:17 2025]  ret_from_fork+0x2d/0x50
[Wed Jan 29 10:11:17 2025]  ? __pfx_kthread+0x10/0x10
[Wed Jan 29 10:11:17 2025]  ret_from_fork_asm+0x1a/0x30
[Wed Jan 29 10:11:17 2025]  </TASK>
[Wed Jan 29 10:11:17 2025] ---[ end trace 0000000000000000 ]---

It also seems to indicate an issue with nfsd4_cb_done in a workqueue.

The "echo t > /proc/sysrq-trigger" was taken after stopping the trace-cmd.

I hope the trace.dat file contains enough information to help find the root cause. It was started before the first client mounted something from this server.

Regards,
Rik

View: https://bugzilla.kernel.org/show_bug.cgi?id=219710#c24
You can reply to this message to join the discussion.
-- 
Deet-doot-dot, I am a bot.
Kernel.org Bugzilla (bugspray 0.1-dev)


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: NFSD threads hang when destroying a session or client ID
  2025-01-20 15:00 NFSD threads hang when destroying a session or client ID Chuck Lever via Bugspray Bot
                   ` (15 preceding siblings ...)
  2025-01-29 13:15 ` rik.theys via Bugspray Bot
@ 2025-01-29 19:40 ` Chuck Lever via Bugspray Bot
  2025-01-30 14:05   ` rik.theys via Bugspray Bot
  2025-01-29 19:50 ` Chuck Lever via Bugspray Bot
                   ` (2 subsequent siblings)
  19 siblings, 1 reply; 35+ messages in thread
From: Chuck Lever via Bugspray Bot @ 2025-01-29 19:40 UTC (permalink / raw)
  To: herzog, trondmy, linux-nfs, jlayton, tom, baptiste.pellegrin,
	chuck.lever, anna, benoit.gschwind, carnil, harald.dunkel, cel

Chuck Lever writes via Kernel.org Bugzilla:

> [Wed Jan 29 10:11:17 2025] cb_status=-521 tk_status=-10036

-521 = -EBADHANDLE (NFS4ERR_BADHANDLE)

-10036 = -NFS4ERR_BAD_XDR

I see several similar events in the trace.dat report. My guess is the client is generating BADHANDLE (nfs4_callback_getattr) but the server is having some trouble decoding that result.

My first impression is that XDR result decoders are supposed to return -EIO in this case. NFS4ERR_BAD_XDR is supposed to mean the /remote side/ was not able to decode a Call, not the local side couldn't decode the Reply.

RFC 8881 Section 20.1.3 states:

"If the filehandle specified is not one for which the client holds an OPEN_DELEGATE_WRITE delegation, an NFS4ERR_BADHANDLE error is returned."

This appears to be a bug in the new CB_GETATTR implementation. It might or might not cause the callback workqueue to stall, but it should probably be filed as a separate bug.

View: https://bugzilla.kernel.org/show_bug.cgi?id=219710#c25
You can reply to this message to join the discussion.
-- 
Deet-doot-dot, I am a bot.
Kernel.org Bugzilla (bugspray 0.1-dev)


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: NFSD threads hang when destroying a session or client ID
  2025-01-20 15:00 NFSD threads hang when destroying a session or client ID Chuck Lever via Bugspray Bot
                   ` (16 preceding siblings ...)
  2025-01-29 19:40 ` Chuck Lever via Bugspray Bot
@ 2025-01-29 19:50 ` Chuck Lever via Bugspray Bot
  2025-02-10 12:05 ` Baptiste PELLEGRIN via Bugspray Bot
  2025-02-21 14:45 ` Jeff Layton via Bugspray Bot
  19 siblings, 0 replies; 35+ messages in thread
From: Chuck Lever via Bugspray Bot @ 2025-01-29 19:50 UTC (permalink / raw)
  To: benoit.gschwind, cel, herzog, baptiste.pellegrin, jlayton, anna,
	harald.dunkel, tom, trondmy, carnil, chuck.lever, linux-nfs

Chuck Lever writes via Kernel.org Bugzilla:

Looking further back in trace.dat:

  kworker/u194:9-1629754 [033] 309447.521022: nfsd_cb_free_slot:    task:00001fa3@0000009a sessionid=67952a7d:4dd6ab1d:00000fb0:00000000 new slot seqno=8100
  kworker/u193:6-1630755 [004] 309468.053064: nfsd_cb_seq_status:   task:00001fa4@0000009a sessionid=67952a7d:4dd6ab1d:00000fb0:00000000 tk_status=0 seq_status=0
  kworker/u193:6-1630755 [004] 309468.053066: nfsd_cb_free_slot:    task:00001fa4@0000009a sessionid=67952a7d:4dd6ab1d:00000fb0:00000000 new slot seqno=8101
 kworker/u194:13-1634037 [013] 309549.519294: nfsd_cb_seq_status:   task:00001fa5@0000009a sessionid=67952a7d:4dd6ab1d:00000fb0:00000000 tk_status=-5 seq_status=1
 kworker/u194:13-1634037 [037] 309558.734930: nfsd_cb_seq_status:   task:00001fa6@0000009a sessionid=67952a7d:4dd6ab1d:00000fb0:00000000 tk_status=-5 seq_status=1
  kworker/u194:0-1662968 [047] 309567.950612: nfsd_cb_seq_status:   task:00001fa7@0000009a sessionid=67952a7d:4dd6ab1d:00000fb0:00000000 tk_status=-5 seq_status=1
 kworker/u194:13-1634037 [043] 309577.167133: nfsd_cb_seq_status:   task:00001fa8@0000009a sessionid=67952a7d:4dd6ab1d:00000fb0:00000000 tk_status=-5 seq_status=1
 kworker/u193:14-1658519 [008] 309586.381715: nfsd_cb_seq_status:   task:00001fa9@0000009a sessionid=67952a7d:4dd6ab1d:00000fb0:00000000 tk_status=-5 seq_status=1
  kworker/u194:6-1659115 [045] 309588.599171: nfsd_cb_seq_status:   task:00001faa@0000009a sessionid=67952a7d:4dd6ab1d:00000fb0:00000000 tk_status=-512 seq_status=1

(Not shown here) this is a CB_RECALL_ANY operation that was sent several times but timed out. The retry mechanism could have dropped this operation after the first timeout, or it could have noticed that the callback transport was marked FAULT.

This situation appears to resolve itself.

View: https://bugzilla.kernel.org/show_bug.cgi?id=219710#c26
You can reply to this message to join the discussion.
-- 
Deet-doot-dot, I am a bot.
Kernel.org Bugzilla (bugspray 0.1-dev)


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: NFSD threads hang when destroying a session or client ID
  2025-01-29 19:40 ` Chuck Lever via Bugspray Bot
@ 2025-01-30 14:05   ` rik.theys via Bugspray Bot
  0 siblings, 0 replies; 35+ messages in thread
From: rik.theys via Bugspray Bot @ 2025-01-30 14:05 UTC (permalink / raw)
  To: jlayton, anna, cel, chuck.lever, baptiste.pellegrin,
	benoit.gschwind, tom, carnil, trondmy, harald.dunkel, herzog,
	linux-nfs

rik.theys writes via Kernel.org Bugzilla:

(In reply to Chuck Lever from comment #25)
> > [Wed Jan 29 10:11:17 2025] cb_status=-521 tk_status=-10036
> 
> -521 = -EBADHANDLE (NFS4ERR_BADHANDLE)
> 
> -10036 = -NFS4ERR_BAD_XDR
> 
> I see several similar events in the trace.dat report. My guess is the client
> is generating BADHANDLE (nfs4_callback_getattr) but the server is having
> some trouble decoding that result.
> 
> My first impression is that XDR result decoders are supposed to return -EIO
> in this case. NFS4ERR_BAD_XDR is supposed to mean the /remote side/ was not
> able to decode a Call, not the local side couldn't decode the Reply.
> 
> RFC 8881 Section 20.1.3 states:
> 
> "If the filehandle specified is not one for which the client holds an
> OPEN_DELEGATE_WRITE delegation, an NFS4ERR_BADHANDLE error is returned."
> 
> This appears to be a bug in the new CB_GETATTR implementation. It might or
> might not cause the callback workqueue to stall, but it should probably be
> filed as a separate bug.

I've opened Bug #219737 for this.

Regards,
Rik

View: https://bugzilla.kernel.org/show_bug.cgi?id=219710#c27
You can reply to this message to join the discussion.
-- 
Deet-doot-dot, I am a bot.
Kernel.org Bugzilla (bugspray 0.1-dev)


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: NFSD threads hang when destroying a session or client ID
  2025-01-20 15:00 NFSD threads hang when destroying a session or client ID Chuck Lever via Bugspray Bot
                   ` (17 preceding siblings ...)
  2025-01-29 19:50 ` Chuck Lever via Bugspray Bot
@ 2025-02-10 12:05 ` Baptiste PELLEGRIN via Bugspray Bot
  2025-02-21 13:42   ` Salvatore Bonaccorso
  2025-02-21 14:45 ` Jeff Layton via Bugspray Bot
  19 siblings, 1 reply; 35+ messages in thread
From: Baptiste PELLEGRIN via Bugspray Bot @ 2025-02-10 12:05 UTC (permalink / raw)
  To: anna, jlayton, cel, herzog, tom, trondmy, benoit.gschwind, carnil,
	baptiste.pellegrin, linux-nfs, harald.dunkel, chuck.lever

Baptiste PELLEGRIN writes via Kernel.org Bugzilla:

Hello.

Good news for 6.1 kernels.

With these patches applied :

8626664c87ee NFSD: Replace dprintks in nfsd4_cb_sequence_done()
961b4b5e86bf NFSD: Reset cb_seq_status after NFS4ERR_DELAY

No hangs anymore for me since more than two weeks of server uptime. And previously the hangs occurred every weeks.

I just see some suspicious server load maybe caused by the send of RPC_RECALL_ANY to shutdown/suspended/courtesy clients.

I see a lot of work on the list that try to address these problems :

nfsd: eliminate special handling of NFS4ERR_SEQ_MISORDERED
nfsd: handle NFS4ERR_BADSLOT on CB_SEQUENCE better
nfsd: when CB_SEQUENCE gets ESERVERFAULT don't increment seq_nr
nfsd: only check RPC_SIGNALLED() when restarting rpc_task
nfsd: always release slot when requeueing callback
nfsd: lift NFSv4.0 handling out of nfsd4_cb_sequence_done()
nfsd: prepare nfsd4_cb_sequence_done() for error handling rework

NFSD: Skip sending CB_RECALL_ANY when the backchannel isn't up

NFSD: fix hang in nfsd4_shutdown_callback


Thanks a lot !

Baptiste.

View: https://bugzilla.kernel.org/show_bug.cgi?id=219710#c28
You can reply to this message to join the discussion.
-- 
Deet-doot-dot, I am a bot.
Kernel.org Bugzilla (bugspray 0.1-dev)


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: NFSD threads hang when destroying a session or client ID
  2025-02-10 12:05 ` Baptiste PELLEGRIN via Bugspray Bot
@ 2025-02-21 13:42   ` Salvatore Bonaccorso
  2025-02-21 13:57     ` Harald Dunkel
  2025-02-21 16:00     ` Chuck Lever via Bugspray Bot
  0 siblings, 2 replies; 35+ messages in thread
From: Salvatore Bonaccorso @ 2025-02-21 13:42 UTC (permalink / raw)
  To: Baptiste PELLEGRIN via Bugspray Bot
  Cc: anna, jlayton, cel, herzog, tom, trondmy, benoit.gschwind,
	baptiste.pellegrin, linux-nfs, harald.dunkel, chuck.lever

Hi,

On Mon, Feb 10, 2025 at 12:05:32PM +0000, Baptiste PELLEGRIN via Bugspray Bot wrote:
> Baptiste PELLEGRIN writes via Kernel.org Bugzilla:
> 
> Hello.
> 
> Good news for 6.1 kernels.
> 
> With these patches applied :
> 
> 8626664c87ee NFSD: Replace dprintks in nfsd4_cb_sequence_done()
> 961b4b5e86bf NFSD: Reset cb_seq_status after NFS4ERR_DELAY
> 
> No hangs anymore for me since more than two weeks of server uptime. And previously the hangs occurred every weeks.
> 
> I just see some suspicious server load maybe caused by the send of RPC_RECALL_ANY to shutdown/suspended/courtesy clients.
> 
> I see a lot of work on the list that try to address these problems :
> 
> nfsd: eliminate special handling of NFS4ERR_SEQ_MISORDERED
> nfsd: handle NFS4ERR_BADSLOT on CB_SEQUENCE better
> nfsd: when CB_SEQUENCE gets ESERVERFAULT don't increment seq_nr
> nfsd: only check RPC_SIGNALLED() when restarting rpc_task
> nfsd: always release slot when requeueing callback
> nfsd: lift NFSv4.0 handling out of nfsd4_cb_sequence_done()
> nfsd: prepare nfsd4_cb_sequence_done() for error handling rework
> 
> NFSD: Skip sending CB_RECALL_ANY when the backchannel isn't up
> 
> NFSD: fix hang in nfsd4_shutdown_callback

So I see the backport of 961b4b5e86bf NFSD: Reset cb_seq_status after
NFS4ERR_DELAY landed in the just released 6.1.129 stable version.

Do we consider this to be sufficient to have stabilized the situation
about this issue? (I do  realize much other work has dne as well which
partially has flown down to stable series already).

This reply is mainly in focus of https://bugs.debian.org/1071562

Regards,
Salvatore

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: NFSD threads hang when destroying a session or client ID
  2025-02-21 13:42   ` Salvatore Bonaccorso
@ 2025-02-21 13:57     ` Harald Dunkel
  2025-02-21 14:31       ` Salvatore Bonaccorso
  2025-02-21 14:50       ` Jeff Layton via Bugspray Bot
  2025-02-21 16:00     ` Chuck Lever via Bugspray Bot
  1 sibling, 2 replies; 35+ messages in thread
From: Harald Dunkel @ 2025-02-21 13:57 UTC (permalink / raw)
  To: Salvatore Bonaccorso, Baptiste PELLEGRIN via Bugspray Bot
  Cc: anna@kernel.org, jlayton@kernel.org, cel@kernel.org,
	herzog@phys.ethz.ch, tom@talpey.com, trondmy@kernel.org,
	benoit.gschwind@minesparis.psl.eu,
	baptiste.pellegrin@ac-grenoble.fr, linux-nfs@vger.kernel.org,
	chuck.lever@oracle.com

[-- Attachment #1: Type: text/plain, Size: 2636 bytes --]

Hi folks,

I don't like to bring bad news, but yesterday I had a problem with kernel 6.12.9 (twice), see attachment. AFAIK 6.12.9 is supposed to include 961b4b5e86bf56a2e4b567f81682defa5cba957e and 8626664c87eebb21a40d4924b2f244a1f56d8806.

Regards
Harri

________________________________________
From: Salvatore Bonaccorso <salvatore.bonaccorso@gmail.com> on behalf of Salvatore Bonaccorso <carnil@debian.org>
Sent: Friday, February 21, 2025 14:42
To: Baptiste PELLEGRIN via Bugspray Bot
Cc: anna@kernel.org; jlayton@kernel.org; cel@kernel.org; herzog@phys.ethz.ch; tom@talpey.com; trondmy@kernel.org; benoit.gschwind@minesparis.psl.eu; baptiste.pellegrin@ac-grenoble.fr; linux-nfs@vger.kernel.org; Harald Dunkel; chuck.lever@oracle.com
Subject: Re: NFSD threads hang when destroying a session or client ID

Hi,

On Mon, Feb 10, 2025 at 12:05:32PM +0000, Baptiste PELLEGRIN via Bugspray Bot wrote:
> Baptiste PELLEGRIN writes via Kernel.org Bugzilla:
>
> Hello.
>
> Good news for 6.1 kernels.
>
> With these patches applied :
>
> 8626664c87ee NFSD: Replace dprintks in nfsd4_cb_sequence_done()
> 961b4b5e86bf NFSD: Reset cb_seq_status after NFS4ERR_DELAY
>
> No hangs anymore for me since more than two weeks of server uptime. And previously the hangs occurred every weeks.
>
> I just see some suspicious server load maybe caused by the send of RPC_RECALL_ANY to shutdown/suspended/courtesy clients.
>
> I see a lot of work on the list that try to address these problems :
>
> nfsd: eliminate special handling of NFS4ERR_SEQ_MISORDERED
> nfsd: handle NFS4ERR_BADSLOT on CB_SEQUENCE better
> nfsd: when CB_SEQUENCE gets ESERVERFAULT don't increment seq_nr
> nfsd: only check RPC_SIGNALLED() when restarting rpc_task
> nfsd: always release slot when requeueing callback
> nfsd: lift NFSv4.0 handling out of nfsd4_cb_sequence_done()
> nfsd: prepare nfsd4_cb_sequence_done() for error handling rework
>
> NFSD: Skip sending CB_RECALL_ANY when the backchannel isn't up
>
> NFSD: fix hang in nfsd4_shutdown_callback

So I see the backport of 961b4b5e86bf NFSD: Reset cb_seq_status after
NFS4ERR_DELAY landed in the just released 6.1.129 stable version.

Do we consider this to be sufficient to have stabilized the situation
about this issue? (I do  realize much other work has dne as well which
partially has flown down to stable series already).

This reply is mainly in focus of https://bugs.debian.org/1071562

Regards,
Salvatore
District Court Aachen - HRB 8057
Management Board: Arnaud Picut (CEO), Hicham El Bonne (CTO)
Chairman of the Supervisory Board: Benjamin Carl Lucas

[-- Attachment #2: x2 --]
[-- Type: application/octet-stream, Size: 6348 bytes --]

2025-02-20T15:45:09.658208+01:00 srvl056 kernel: [  851.525553] ------------[ cut here ]------------
2025-02-20T15:45:09.658229+01:00 srvl056 kernel: [  851.530874] cb_status=-521 tk_status=-10036
2025-02-20T15:45:09.658232+01:00 srvl056 kernel: [  851.530922] WARNING: CPU: 2 PID: 1364 at fs/nfsd/nfs4callback.c:1339 nfsd4_cb_done+0x4e5/0x550 [nfsd]
2025-02-20T15:45:09.769527+01:00 srvl056 kernel: [  851.546343] Modules linked in: rpcsec_gss_krb5 target_core_user uio target_core_pscsi target_core_file target_core_iblock iscsi_target_mod target_core_mod nvme_fabrics nvme_keyring nvme_core nvme_auth binfmt_misc nls_ascii nls_cp437 vfat fat intel_rapl_msr intel_rapl_common intel_uncore_frequency intel_uncore_frequency_common x86_pkg_temp_thermal intel_powerclamp coretemp ppdev jc42 rapl intel_cstate iTCO_wdt intel_pmc_bxt intel_uncore mei_me pcspkr ee1004 iTCO_vendor_support parport_pc watchdog intel_wmi_thunderbolt intel_pch_thermal mei parport intel_pmc_core ie31200_edac intel_vsec intel_vbtn joydev evdev sg sparse_keymap pmt_telemetry pmt_class ipmi_ssif acpi_power_meter acpi_pad button acpi_ipmi ipmi_si ipmi_devintf ipmi_msghandler loop drm efi_pstore nfsd configfs nfs_acl lockd grace auth_rpcgss sunrpc ip_tables x_tables autofs4 ext4 crc16 mbcache jbd2 btrfs blake2b_generic efivarfs raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor raid6_pq libcrc32c crc32c_generic raid1 raid0 md_mod
2025-02-20T15:45:09.803549+01:00 srvl056 kernel: [  851.546530]  dm_mod hid_generic usbhid hid sd_mod ahci libahci xhci_pci crct10dif_pclmul crc32_pclmul xhci_hcd crc32c_intel libata megaraid_sas ghash_clmulni_intel usbcore scsi_mod igb sha512_ssse3 sha256_ssse3 i2c_i801 i2c_algo_bit i2c_smbus sha1_ssse3 usb_common dca scsi_common fan video battery wmi aesni_intel gf128mul crypto_simd cryptd
2025-02-20T15:45:09.815159+01:00 srvl056 kernel: [  851.681376] CPU: 2 UID: 0 PID: 1364 Comm: kworker/u32:1 Not tainted 6.12.9+bpo-amd64 #1  Debian 6.12.9-1~bpo12+1
2025-02-20T15:45:09.826795+01:00 srvl056 kernel: [  851.692895] Hardware name: Intel Corporation S1200SP/S1200SP, BIOS S1200SP.86B.03.01.0049.060120200516 06/01/2020
2025-02-20T15:45:09.833054+01:00 srvl056 kernel: [  851.704552] Workqueue: rpciod rpc_async_schedule [sunrpc]
2025-02-20T15:45:09.839019+01:00 srvl056 kernel: [  851.710783] RIP: 0010:nfsd4_cb_done+0x4e5/0x550 [nfsd]
2025-02-20T15:45:09.860121+01:00 srvl056 kernel: [  851.716738] Code: 8b 33 45 89 fe e9 d1 fb ff ff 80 3d 77 d7 01 00 00 0f 85 f3 fb ff ff 48 c7 c7 06 6b e0 c0 c6 05 63 d7 01 00 01 e8 9b 01 53 f6 <0f> 0b 8b 73 54 e9 d6 fb ff ff 0f 1f 44 00 00 e9 59 fd ff ff 41 89
2025-02-20T15:45:09.866189+01:00 srvl056 kernel: [  851.737918] RSP: 0018:ffffa50f81287dc8 EFLAGS: 00010282
2025-02-20T15:45:09.874304+01:00 srvl056 kernel: [  851.743915] RAX: 0000000000000000 RBX: ffff8b26707190f0 RCX: 0000000000000027
2025-02-20T15:45:09.882419+01:00 srvl056 kernel: [  851.752031] RDX: ffff8b2da1321788 RSI: 0000000000000001 RDI: ffff8b2da1321780
2025-02-20T15:45:09.890522+01:00 srvl056 kernel: [  851.760132] RBP: ffff8b264af1ba18 R08: 0000000000000000 R09: 0000000000000003
2025-02-20T15:45:09.898624+01:00 srvl056 kernel: [  851.768234] R10: ffffa50f81287c58 R11: ffff8b2da4ffffe8 R12: ffff8b264af1ba18
2025-02-20T15:45:09.906742+01:00 srvl056 kernel: [  851.776355] R13: ffff8b264b5a0d00 R14: 0000000000000001 R15: ffff8b26812f6e00
2025-02-20T15:45:09.915943+01:00 srvl056 kernel: [  851.784473] FS:  0000000000000000(0000) GS:ffff8b2da1300000(0000) knlGS:0000000000000000
2025-02-20T15:45:09.922529+01:00 srvl056 kernel: [  851.793698] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
2025-02-20T15:45:09.930692+01:00 srvl056 kernel: [  851.800294] CR2: 000000c000bed000 CR3: 0000000852622005 CR4: 00000000003706f0
2025-02-20T15:45:09.938810+01:00 srvl056 kernel: [  851.808423] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
2025-02-20T15:45:09.946926+01:00 srvl056 kernel: [  851.816524] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
2025-02-20T15:45:09.949822+01:00 srvl056 kernel: [  851.824655] Call Trace:
2025-02-20T15:45:09.956035+01:00 srvl056 kernel: [  851.827545]  <TASK>
2025-02-20T15:45:09.956040+01:00 srvl056 kernel: [  851.830008]  ? __warn+0x89/0x130
2025-02-20T15:45:09.961265+01:00 srvl056 kernel: [  851.833742]  ? nfsd4_cb_done+0x4e5/0x550 [nfsd]
2025-02-20T15:45:09.965545+01:00 srvl056 kernel: [  851.839009]  ? report_bug+0x164/0x190
2025-02-20T15:45:09.969967+01:00 srvl056 kernel: [  851.843262]  ? prb_read_valid+0x1b/0x30
2025-02-20T15:45:09.973989+01:00 srvl056 kernel: [  851.847674]  ? handle_bug+0x58/0x90
2025-02-20T15:45:09.978406+01:00 srvl056 kernel: [  851.851688]  ? exc_invalid_op+0x17/0x70
2025-02-20T15:45:09.983227+01:00 srvl056 kernel: [  851.856136]  ? asm_exc_invalid_op+0x1a/0x20
2025-02-20T15:45:09.988458+01:00 srvl056 kernel: [  851.860937]  ? nfsd4_cb_done+0x4e5/0x550 [nfsd]
2025-02-20T15:45:09.993695+01:00 srvl056 kernel: [  851.866186]  ? nfsd4_cb_done+0x4e5/0x550 [nfsd]
2025-02-20T15:45:09.999502+01:00 srvl056 kernel: [  851.871405]  ? __pfx_rpc_exit_task+0x10/0x10 [sunrpc]
2025-02-20T15:45:10.004636+01:00 srvl056 kernel: [  851.877231]  rpc_exit_task+0x5f/0x180 [sunrpc]
2025-02-20T15:45:10.009771+01:00 srvl056 kernel: [  851.882381]  __rpc_execute+0xb5/0x490 [sunrpc]
2025-02-20T15:45:10.015330+01:00 srvl056 kernel: [  851.887528]  rpc_async_schedule+0x2f/0x40 [sunrpc]
2025-02-20T15:45:10.019940+01:00 srvl056 kernel: [  851.893041]  process_one_work+0x179/0x390
2025-02-20T15:45:10.024317+01:00 srvl056 kernel: [  851.897737]  worker_thread+0x251/0x360
2025-02-20T15:45:10.029242+01:00 srvl056 kernel: [  851.902054]  ? __pfx_worker_thread+0x10/0x10
2025-02-20T15:45:10.032907+01:00 srvl056 kernel: [  851.907009]  kthread+0xcf/0x100
2025-02-20T15:45:10.037243+01:00 srvl056 kernel: [  851.910643]  ? __pfx_kthread+0x10/0x10
2025-02-20T15:45:10.041362+01:00 srvl056 kernel: [  851.914955]  ret_from_fork+0x31/0x50
2025-02-20T15:45:10.045701+01:00 srvl056 kernel: [  851.919102]  ? __pfx_kthread+0x10/0x10
2025-02-20T15:45:10.050219+01:00 srvl056 kernel: [  851.923449]  ret_from_fork_asm+0x1a/0x30
2025-02-20T15:45:10.052862+01:00 srvl056 kernel: [  851.927992]  </TASK>
2025-02-20T15:45:10.058171+01:00 srvl056 kernel: [  851.930576] ---[ end trace 0000000000000000 ]---

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: NFSD threads hang when destroying a session or client ID
  2025-02-21 13:57     ` Harald Dunkel
@ 2025-02-21 14:31       ` Salvatore Bonaccorso
  2025-02-21 14:50       ` Jeff Layton via Bugspray Bot
  1 sibling, 0 replies; 35+ messages in thread
From: Salvatore Bonaccorso @ 2025-02-21 14:31 UTC (permalink / raw)
  To: Harald Dunkel
  Cc: Baptiste PELLEGRIN via Bugspray Bot, anna@kernel.org,
	jlayton@kernel.org, cel@kernel.org, herzog@phys.ethz.ch,
	tom@talpey.com, trondmy@kernel.org,
	benoit.gschwind@minesparis.psl.eu,
	baptiste.pellegrin@ac-grenoble.fr, linux-nfs@vger.kernel.org,
	chuck.lever@oracle.com

Hi Harry,

On Fri, Feb 21, 2025 at 01:57:47PM +0000, Harald Dunkel wrote:
> Hi folks,
> 
> I don't like to bring bad news, but yesterday I had a problem with
> kernel 6.12.9 (twice), see attachment. AFAIK 6.12.9 is supposed to
> include 961b4b5e86bf56a2e4b567f81682defa5cba957e and
> 8626664c87eebb21a40d4924b2f244a1f56d8806.

At first glance this looks like a different issue, as the backtrace
does ot mention at all nfsd4_destroy_session but I will leave it to
Chuck et all as the experts to correft me if I'm asserting this
wrongly.

Regards,
Salvatore

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: NFSD threads hang when destroying a session or client ID
  2025-01-20 15:00 NFSD threads hang when destroying a session or client ID Chuck Lever via Bugspray Bot
                   ` (18 preceding siblings ...)
  2025-02-10 12:05 ` Baptiste PELLEGRIN via Bugspray Bot
@ 2025-02-21 14:45 ` Jeff Layton via Bugspray Bot
  19 siblings, 0 replies; 35+ messages in thread
From: Jeff Layton via Bugspray Bot @ 2025-02-21 14:45 UTC (permalink / raw)
  To: benoit.gschwind, anna, jlayton, baptiste.pellegrin, chuck.lever,
	linux-nfs, herzog, cel, tom, trondmy, carnil, harald.dunkel

Jeff Layton writes via Kernel.org Bugzilla:

(In reply to Bugspray Bot from comment #31)
> Created attachment 307692 [details]
> x2

That warning is probably the bug that 4990d098433d fixes. That patch just went into -rc3 and hasn't trickled out to stable kernels yet. It is related to callbacks, but is unrelated to this BZ.

View: https://bugzilla.kernel.org/show_bug.cgi?id=219710#c33
You can reply to this message to join the discussion.
-- 
Deet-doot-dot, I am a bot.
Kernel.org Bugzilla (bugspray 0.1-dev)


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: NFSD threads hang when destroying a session or client ID
  2025-02-21 13:57     ` Harald Dunkel
  2025-02-21 14:31       ` Salvatore Bonaccorso
@ 2025-02-21 14:50       ` Jeff Layton via Bugspray Bot
  1 sibling, 0 replies; 35+ messages in thread
From: Jeff Layton via Bugspray Bot @ 2025-02-21 14:50 UTC (permalink / raw)
  To: anna, chuck.lever, linux-nfs, benoit.gschwind, carnil, tom,
	harald.dunkel, cel, baptiste.pellegrin, jlayton, herzog, trondmy

Jeff Layton writes via Kernel.org Bugzilla:

(In reply to Bugspray Bot from comment #30)
> 
> So I see the backport of 961b4b5e86bf NFSD: Reset cb_seq_status after
> NFS4ERR_DELAY landed in the just released 6.1.129 stable version.
> 
> Do we consider this to be sufficient to have stabilized the situation
> about this issue? (I do  realize much other work has dne as well which
> partially has flown down to stable series already).
> 

That seems to have the two patches that I would consider most likely to fix the hangs reported here:

036ac2778f7b28885814c6fbc07e156ad1624d03
961b4b5e86bf56a2e4b567f81682defa5cba957e

View: https://bugzilla.kernel.org/show_bug.cgi?id=219710#c34
You can reply to this message to join the discussion.
-- 
Deet-doot-dot, I am a bot.
Kernel.org Bugzilla (bugspray 0.1-dev)


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: NFSD threads hang when destroying a session or client ID
  2025-02-21 13:42   ` Salvatore Bonaccorso
  2025-02-21 13:57     ` Harald Dunkel
@ 2025-02-21 16:00     ` Chuck Lever via Bugspray Bot
  1 sibling, 0 replies; 35+ messages in thread
From: Chuck Lever via Bugspray Bot @ 2025-02-21 16:00 UTC (permalink / raw)
  To: tom, benoit.gschwind, herzog, trondmy, carnil, jlayton,
	chuck.lever, baptiste.pellegrin, harald.dunkel, cel, anna,
	linux-nfs

Chuck Lever writes via Kernel.org Bugzilla:

(In reply to Bugspray Bot from comment #29)
> So I see the backport of 961b4b5e86bf NFSD: Reset cb_seq_status after
> NFS4ERR_DELAY landed in the just released 6.1.129 stable version.
> 
> Do we consider this to be sufficient to have stabilized the situation
> about this issue? (I do  realize much other work has dne as well which
> partially has flown down to stable series already).
> 
> This reply is mainly in focus of https://bugs.debian.org/1071562

Since we do not have a reproducer, it's difficult to ascertain with certainty that the issue has been addressed. Thus I feel it is up to the various original reporters to decide if they are happy with the solution above.

I'm comfortable leaving this bug open for a while.

View: https://bugzilla.kernel.org/show_bug.cgi?id=219710#c35
You can reply to this message to join the discussion.
-- 
Deet-doot-dot, I am a bot.
Kernel.org Bugzilla (bugspray 0.1-dev)


^ permalink raw reply	[flat|nested] 35+ messages in thread

end of thread, other threads:[~2025-02-21 15:59 UTC | newest]

Thread overview: 35+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-01-20 15:00 NFSD threads hang when destroying a session or client ID Chuck Lever via Bugspray Bot
2025-01-20 15:14 ` Chuck Lever
2025-01-20 15:25 ` Chuck Lever via Bugspray Bot
2025-01-20 15:40 ` Chuck Lever via Bugspray Bot
2025-01-20 19:00 ` Chuck Lever via Bugspray Bot
2025-01-20 20:35 ` Baptiste PELLEGRIN via Bugspray Bot
2025-01-21 14:40 ` Jeff Layton via Bugspray Bot
2025-01-21 16:10 ` Chuck Lever via Bugspray Bot
2025-01-21 17:35   ` Jeff Layton via Bugspray Bot
2025-01-21 19:38     ` Tom Talpey
2025-01-21 19:43       ` Chuck Lever
2025-01-21 16:25 ` Baptiste PELLEGRIN via Bugspray Bot
2025-01-21 16:35   ` Chuck Lever via Bugspray Bot
2025-01-22 11:40     ` Baptiste PELLEGRIN via Bugspray Bot
2025-01-22 14:19       ` Chuck Lever
2025-01-22 21:25 ` JJ Jordan via Bugspray Bot
2025-01-22 21:25 ` JJ Jordan via Bugspray Bot
2025-01-23  2:10 ` Li Lingfeng via Bugspray Bot
2025-01-23 13:50 ` Jeff Layton via Bugspray Bot
2025-01-23 14:22   ` Chuck Lever
2025-01-23 20:25 ` Baptiste PELLEGRIN via Bugspray Bot
2025-01-23 21:45 ` Chuck Lever via Bugspray Bot
2025-01-26  9:25 ` Baptiste PELLEGRIN via Bugspray Bot
2025-01-26 17:05   ` Chuck Lever via Bugspray Bot
2025-01-29 13:15 ` rik.theys via Bugspray Bot
2025-01-29 19:40 ` Chuck Lever via Bugspray Bot
2025-01-30 14:05   ` rik.theys via Bugspray Bot
2025-01-29 19:50 ` Chuck Lever via Bugspray Bot
2025-02-10 12:05 ` Baptiste PELLEGRIN via Bugspray Bot
2025-02-21 13:42   ` Salvatore Bonaccorso
2025-02-21 13:57     ` Harald Dunkel
2025-02-21 14:31       ` Salvatore Bonaccorso
2025-02-21 14:50       ` Jeff Layton via Bugspray Bot
2025-02-21 16:00     ` Chuck Lever via Bugspray Bot
2025-02-21 14:45 ` Jeff Layton via Bugspray Bot

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox