* Re: nfsd stuck in D (disk sleep) state
2024-10-28 12:46 ` Tom Talpey
@ 2024-10-28 13:32 ` Chuck Lever III
2024-10-28 17:01 ` Benoît Gschwind
2024-10-28 17:27 ` Jeff Layton
2 siblings, 0 replies; 12+ messages in thread
From: Chuck Lever III @ 2024-10-28 13:32 UTC (permalink / raw)
To: Tom Talpey; +Cc: Benoît Gschwind, Linux NFS Mailing List
> On Oct 28, 2024, at 8:46 AM, Tom Talpey <tom@talpey.com> wrote:
>
> On 10/28/2024 5:18 AM, Benoît Gschwind wrote:
>> Hello,
>> The issue trigger again, I attached the result of:
>> # dmesg -W | tee dmesg.txt
>> using:
>> # echo t > /proc/sysrq-trigger
>> I have the following PID stuck:
>> 1474 D (disk sleep) 0:54:58.602 [nfsd]
>> 1475 D (disk sleep) 0:54:58.602 [nfsd]
>> 1484 D (disk sleep) 0:54:58.602 [nfsd]
>> 1495 D (disk sleep) 0:54:58.602 [nfsd]
>
> Hmm, 1495 is stuck in nfsd4_create_session
>
> > [427468.304955] task:nfsd state:D stack:0 pid:1495 ppid:2 flags:0x00004000
> > [427468.304962] Call Trace:
> > [427468.304965] <TASK>
> > [427468.304971] __schedule+0x34d/0x9e0
> > [427468.304983] schedule+0x5a/0xd0
> > [427468.304991] schedule_timeout+0x118/0x150
> > [427468.305003] wait_for_completion+0x86/0x160
> > [427468.305015] __flush_workqueue+0x152/0x420
> > [427468.305031] nfsd4_create_session+0x79f/0xba0 [nfsd]
> > [427468.305092] nfsd4_proc_compound+0x34c/0x660 [nfsd]
> > [427468.305147] nfsd_dispatch+0x1a1/0x2b0 [nfsd]
> > [427468.305199] svc_process_common+0x295/0x610 [sunrpc]
> > [427468.305269] ? svc_recv+0x491/0x810 [sunrpc]
> > [427468.305337] ? nfsd_svc+0x370/0x370 [nfsd]
> > [427468.305389] ? nfsd_shutdown_threads+0x90/0x90 [nfsd]
> > [427468.305437] svc_process+0xad/0x100 [sunrpc]
> > [427468.305505] nfsd+0x99/0x140 [nfsd]
> > [427468.305555] kthread+0xda/0x100
> > [427468.305562] ? kthread_complete_and_exit+0x20/0x20
> > [427468.305572] ret_from_fork+0x22/0x30
>
> and the other three are stuck in nfsd4_destroy_session
>
> > [427468.298315] task:nfsd state:D stack:0 pid:1474 ppid:2 flags:0x00004000
> > [427468.298322] Call Trace:
> > [427468.298326] <TASK>
> > [427468.298332] __schedule+0x34d/0x9e0
> > [427468.298343] schedule+0x5a/0xd0
> > [427468.298350] schedule_timeout+0x118/0x150
> > [427468.298362] wait_for_completion+0x86/0x160
> > [427468.298375] __flush_workqueue+0x152/0x420
> > [427468.298392] nfsd4_destroy_session+0x1b6/0x250 [nfsd]
> > [427468.298456] nfsd4_proc_compound+0x34c/0x660 [nfsd]
> > [427468.298515] nfsd_dispatch+0x1a1/0x2b0 [nfsd]
> > [427468.298568] svc_process_common+0x295/0x610 [sunrpc]
> > [427468.298643] ? svc_recv+0x491/0x810 [sunrpc]
> > [427468.298711] ? nfsd_svc+0x370/0x370 [nfsd]
> > [427468.298776] ? nfsd_shutdown_threads+0x90/0x90 [nfsd]
> > [427468.298825] svc_process+0xad/0x100 [sunrpc]
> > [427468.298896] nfsd+0x99/0x140 [nfsd]
> > [427468.298946] kthread+0xda/0x100
> > [427468.298954] ? kthread_complete_and_exit+0x20/0x20
> > [427468.298963] ret_from_fork+0x22/0x30
>
> There aren't a lot of 6.1-era changes in either of these, but there
> are some interesting behavior updates around session create replay
> from early this year. I wonder if the 6.1 server is mishandling an
> nfserr_jukebox situation in nfsd4_session_create.
For reference:
10396f4df8b7 ("nfsd: hold a lighter-weight client reference over CB_RECALL_ANY")
99dc2ef0397d ("NFSD: CREATE_SESSION must never cache NFS4ERR_DELAY replies")
e4469c6cc69b ("NFSD: Fix the NFSv4.1 CREATE_SESSION operation")
A network capture, once these processes are stuck, might
confirm CREATE_SESSION slot replay as a culprit.
> Was the client actually attempting to mount or unmount?
>
> Tom.
>
>> Thank by advance,
>> Best regards
>> Le mercredi 23 octobre 2024 à 19:38 +0000, Chuck Lever III a écrit :
>>>
>>>
>>>> On Oct 23, 2024, at 3:27 PM, Benoît Gschwind
>>>> <benoit.gschwind@minesparis.psl.eu> wrote:
>>>>
>>>> Hello,
>>>>
>>>> I have a nfs server using debian 11 (Linux hostname 6.1.0-25-amd64
>>>> #1
>>>> SMP PREEMPT_DYNAMIC Debian 6.1.106-3 (2024-08-26) x86_64 GNU/Linux)
>>>>
>>>> In some heavy workload some nfsd goes in D state and seems to never
>>>> leave this state. I did a python script to monitor how long a
>>>> process
>>>> stay in particular state and I use it to monitor nfsd state. I get
>>>> the
>>>> following result :
>>>>
>>>> [...]
>>>> 178056 I (idle) 0:25:24.475 [nfsd]
>>>> 178057 I (idle) 0:25:24.475 [nfsd]
>>>> 178058 I (idle) 0:25:24.475 [nfsd]
>>>> 178059 I (idle) 0:25:24.475 [nfsd]
>>>> 178060 I (idle) 0:25:24.475 [nfsd]
>>>> 178061 I (idle) 0:25:24.475 [nfsd]
>>>> 178062 I (idle) 0:24:15.638 [nfsd]
>>>> 178063 I (idle) 0:24:13.488 [nfsd]
>>>> 178064 I (idle) 0:24:13.488 [nfsd]
>>>> 178065 I (idle) 0:00:00.000 [nfsd]
>>>> 178066 I (idle) 0:00:00.000 [nfsd]
>>>> 178067 I (idle) 0:00:00.000 [nfsd]
>>>> 178068 I (idle) 0:00:00.000 [nfsd]
>>>> 178069 S (sleeping) 0:00:02.147 [nfsd]
>>>> 178070 S (sleeping) 0:00:02.147 [nfsd]
>>>> 178071 S (sleeping) 0:00:02.147 [nfsd]
>>>> 178072 S (sleeping) 0:00:02.147 [nfsd]
>>>> 178073 S (sleeping) 0:00:02.147 [nfsd]
>>>> 178074 D (disk sleep) 1:29:25.809 [nfsd]
>>>> 178075 S (sleeping) 0:00:02.147 [nfsd]
>>>> 178076 S (sleeping) 0:00:02.147 [nfsd]
>>>> 178077 S (sleeping) 0:00:02.147 [nfsd]
>>>> 178078 S (sleeping) 0:00:02.147 [nfsd]
>>>> 178079 S (sleeping) 0:00:02.147 [nfsd]
>>>> 178080 D (disk sleep) 1:29:25.809 [nfsd]
>>>> 178081 D (disk sleep) 1:29:25.809 [nfsd]
>>>> 178082 D (disk sleep) 0:28:04.444 [nfsd]
>>>>
>>>> All process not shown are in idle state. Columns are the following:
>>>> PID, state, state name, amoung of time the state did not changed
>>>> and
>>>> the process was not interrupted, and /proc/PID/status Name entry.
>>>>
>>>> As you can read some nfsd process are in disk sleep state since
>>>> more
>>>> than 1 hour, but looking at the disk activity, there is almost no
>>>> I/O.
>>>>
>>>> I tried to restart nfs-server but I get the following error from
>>>> the
>>>> kernel:
>>>>
>>>> oct. 23 11:59:49 hostname kernel: rpc-srv/tcp: nfsd: got error -104
>>>> when sending 20 bytes - shutting down socket
>>>> oct. 23 11:59:49 hostname kernel: rpc-srv/tcp: nfsd: got error -104
>>>> when sending 20 bytes - shutting down socket
>>>> oct. 23 11:59:49 hostname kernel: rpc-srv/tcp: nfsd: got error -104
>>>> when sending 20 bytes - shutting down socket
>>>> oct. 23 11:59:49 hostname kernel: rpc-srv/tcp: nfsd: got error -104
>>>> when sending 20 bytes - shutting down socket
>>>> oct. 23 11:59:49 hostname kernel: rpc-srv/tcp: nfsd: got error -104
>>>> when sending 20 bytes - shutting down socket
>>>> oct. 23 11:59:59 hostname kernel: rpc-srv/tcp: nfsd: got error -104
>>>> when sending 20 bytes - shutting down socket
>>>> oct. 23 11:59:59 hostname kernel: rpc-srv/tcp: nfsd: got error -104
>>>> when sending 20 bytes - shutting down socket
>>>> oct. 23 11:59:59 hostname kernel: rpc-srv/tcp: nfsd: got error -104
>>>> when sending 20 bytes - shutting down socket
>>>> oct. 23 11:59:59 hostname kernel: rpc-srv/tcp: nfsd: got error -104
>>>> when sending 20 bytes - shutting down socket
>>>> oct. 23 11:59:59 hostname kernel: rpc-srv/tcp: nfsd: got error -104
>>>> when sending 20 bytes - shutting down socket
>>>> oct. 23 12:00:09 hostname kernel: rpc-srv/tcp: nfsd: got error -104
>>>> when sending 20 bytes - shutting down socket
>>>> oct. 23 12:00:09 hostname kernel: rpc-srv/tcp: nfsd: got error -104
>>>> when sending 20 bytes - shutting down socket
>>>> oct. 23 12:00:09 hostname kernel: rpc-srv/tcp: nfsd: got error -104
>>>> when sending 20 bytes - shutting down socket
>>>> oct. 23 12:00:10 hostname kernel: rpc-srv/tcp: nfsd: got error -32
>>>> when sending 20 bytes - shutting down socket
>>>> oct. 23 12:00:10 hostname kernel: rpc-srv/tcp: nfsd: got error -32
>>>> when sending 20 bytes - shutting down socket
>>>>
>>>> The only way to recover seems to reboot the kernel. I guess because
>>>> the
>>>> kernel force the reboot after a given timeout.
>>>>
>>>> My setup involve in order :
>>>> - scsi driver
>>>> - mdraid on top of scsi (raid6)
>>>> - btrfs ontop of mdraid
>>>> - nfsd ontop of btrfs
>>>>
>>>>
>>>> The setup is not very fast as expected, but it seems that in some
>>>> situation nfsd never leave the disk sleep state. the exports
>>>> options
>>>> are: gss/krb5i(rw,sync,no_wdelay,no_subtree_check,fsid=XXXXX). The
>>>> situation is not commun but it's always happen at some point. For
>>>> instance in the case I report here, my server booted the 2024-10-01
>>>> and
>>>> was stuck about the 2024-10-23. I did reduced by a large amount the
>>>> frequency of issue by using no_wdelay (I did thought that I did
>>>> solved
>>>> the issue when I started to use this option).
>>>>
>>>> My guess is hadware bug, scsi bug, btrfs bug or nfsd bug ?
>>>>
>>>> Any clue on this topic or any advice is wellcome.
>>>
>>> Generate stack traces for each process on the system
>>> using "sudo echo t > /proc/sysrq-trigger" and then
>>> examine the output in the system journal. Note the
>>> stack contents for the processes that look stuck.
>>>
>>> --
>>> Chuck Lever
>>>
>>>
>
--
Chuck Lever
^ permalink raw reply [flat|nested] 12+ messages in thread* Re: nfsd stuck in D (disk sleep) state
2024-10-28 12:46 ` Tom Talpey
2024-10-28 13:32 ` Chuck Lever III
@ 2024-10-28 17:01 ` Benoît Gschwind
2024-10-28 17:38 ` Tom Talpey
2024-10-28 17:27 ` Jeff Layton
2 siblings, 1 reply; 12+ messages in thread
From: Benoît Gschwind @ 2024-10-28 17:01 UTC (permalink / raw)
To: Tom Talpey, Chuck Lever III; +Cc: Linux NFS Mailing List
Le lundi 28 octobre 2024 à 08:46 -0400, Tom Talpey a écrit :
>
> Was the client actually attempting to mount or unmount?
The servers were running for a while, thus, as far as I know my setup,
there is no ongoing mount or unmount. Moreover none of the clients were
rebooting, starting or shutting down.
>
> Tom.
>
>
--
Benoît Gschwind
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: nfsd stuck in D (disk sleep) state
2024-10-28 17:01 ` Benoît Gschwind
@ 2024-10-28 17:38 ` Tom Talpey
0 siblings, 0 replies; 12+ messages in thread
From: Tom Talpey @ 2024-10-28 17:38 UTC (permalink / raw)
To: Benoît Gschwind, Chuck Lever III; +Cc: Linux NFS Mailing List
On 10/28/2024 1:01 PM, Benoît Gschwind wrote:
> Le lundi 28 octobre 2024 à 08:46 -0400, Tom Talpey a écrit :
>>
>> Was the client actually attempting to mount or unmount?
>
> The servers were running for a while, thus, as far as I know my setup,
> there is no ongoing mount or unmount. Moreover none of the clients were
> rebooting, starting or shutting down.
Interesting. I would therefore suspect the create_session was simply a
reconnect after some kind of transport connection error/loss, and the
destroy_session callers are hung for the same reason. And I agree with
Chuck that a network trace will help identify the patch which is needed
here.
You may have to let the trace run for a minute or two to detect all the
retry traffic between the client and server threads.
Tom.
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: nfsd stuck in D (disk sleep) state
2024-10-28 12:46 ` Tom Talpey
2024-10-28 13:32 ` Chuck Lever III
2024-10-28 17:01 ` Benoît Gschwind
@ 2024-10-28 17:27 ` Jeff Layton
2024-10-29 13:49 ` Jeff Layton
2 siblings, 1 reply; 12+ messages in thread
From: Jeff Layton @ 2024-10-28 17:27 UTC (permalink / raw)
To: Tom Talpey, Benoît Gschwind, Chuck Lever III; +Cc: Linux NFS Mailing List
On Mon, 2024-10-28 at 08:46 -0400, Tom Talpey wrote:
> On 10/28/2024 5:18 AM, Benoît Gschwind wrote:
> > Hello,
> >
> > The issue trigger again, I attached the result of:
> >
> > # dmesg -W | tee dmesg.txt
> >
> > using:
> >
> > # echo t > /proc/sysrq-trigger
> >
> > I have the following PID stuck:
> >
> > 1474 D (disk sleep) 0:54:58.602 [nfsd]
> > 1475 D (disk sleep) 0:54:58.602 [nfsd]
> > 1484 D (disk sleep) 0:54:58.602 [nfsd]
> > 1495 D (disk sleep) 0:54:58.602 [nfsd]
>
> Hmm, 1495 is stuck in nfsd4_create_session
>
> > [427468.304955] task:nfsd state:D stack:0 pid:1495
> ppid:2 flags:0x00004000
> > [427468.304962] Call Trace:
> > [427468.304965] <TASK>
> > [427468.304971] __schedule+0x34d/0x9e0
> > [427468.304983] schedule+0x5a/0xd0
> > [427468.304991] schedule_timeout+0x118/0x150
> > [427468.305003] wait_for_completion+0x86/0x160
> > [427468.305015] __flush_workqueue+0x152/0x420
> > [427468.305031] nfsd4_create_session+0x79f/0xba0 [nfsd]
> > [427468.305092] nfsd4_proc_compound+0x34c/0x660 [nfsd]
> > [427468.305147] nfsd_dispatch+0x1a1/0x2b0 [nfsd]
> > [427468.305199] svc_process_common+0x295/0x610 [sunrpc]
> > [427468.305269] ? svc_recv+0x491/0x810 [sunrpc]
> > [427468.305337] ? nfsd_svc+0x370/0x370 [nfsd]
> > [427468.305389] ? nfsd_shutdown_threads+0x90/0x90 [nfsd]
> > [427468.305437] svc_process+0xad/0x100 [sunrpc]
> > [427468.305505] nfsd+0x99/0x140 [nfsd]
> > [427468.305555] kthread+0xda/0x100
> > [427468.305562] ? kthread_complete_and_exit+0x20/0x20
> > [427468.305572] ret_from_fork+0x22/0x30
>
> and the other three are stuck in nfsd4_destroy_session
>
All 4 processes are stuck waiting on a flush_workqueue call. It's
probably one of these:
flush_workqueue(clp->cl_callback_wq);
So the question here is really -- why are the callback workqueue jobs
stuck? I do see this:
[427468.316839] Workqueue: nfsd4_callbacks nfsd4_run_cb_work [nfsd]
[427468.316899] Call Trace:
[427468.316902] <TASK>
[427468.316908] __schedule+0x34d/0x9e0
[427468.316919] schedule+0x5a/0xd0
[427468.316927] schedule_timeout+0x94/0x150
[427468.316937] ? __bpf_trace_tick_stop+0x10/0x10
[427468.316947] rpc_shutdown_client+0xf2/0x150 [sunrpc]
[427468.317015] ? cpuusage_read+0x10/0x10
[427468.317026] nfsd4_process_cb_update+0x4c/0x270 [nfsd]
[427468.317097] nfsd4_run_cb_work+0x9f/0x150 [nfsd]
[427468.317146] process_one_work+0x1c7/0x380
[427468.317158] worker_thread+0x4d/0x380
[427468.317170] ? rescuer_thread+0x3a0/0x3a0
[427468.317177] kthread+0xda/0x100
[427468.317185] ? kthread_complete_and_exit+0x20/0x20
[427468.317195] ret_from_fork+0x22/0x30
[427468.317213] </TASK>
Maybe the RPC client is having trouble clearing clnt->cl_tasks ?
> > [427468.298315] task:nfsd state:D stack:0 pid:1474
> ppid:2 flags:0x00004000
> > [427468.298322] Call Trace:
> > [427468.298326] <TASK>
> > [427468.298332] __schedule+0x34d/0x9e0
> > [427468.298343] schedule+0x5a/0xd0
> > [427468.298350] schedule_timeout+0x118/0x150
> > [427468.298362] wait_for_completion+0x86/0x160
> > [427468.298375] __flush_workqueue+0x152/0x420
> > [427468.298392] nfsd4_destroy_session+0x1b6/0x250 [nfsd]
> > [427468.298456] nfsd4_proc_compound+0x34c/0x660 [nfsd]
> > [427468.298515] nfsd_dispatch+0x1a1/0x2b0 [nfsd]
> > [427468.298568] svc_process_common+0x295/0x610 [sunrpc]
> > [427468.298643] ? svc_recv+0x491/0x810 [sunrpc]
> > [427468.298711] ? nfsd_svc+0x370/0x370 [nfsd]
> > [427468.298776] ? nfsd_shutdown_threads+0x90/0x90 [nfsd]
> > [427468.298825] svc_process+0xad/0x100 [sunrpc]
> > [427468.298896] nfsd+0x99/0x140 [nfsd]
> > [427468.298946] kthread+0xda/0x100
> > [427468.298954] ? kthread_complete_and_exit+0x20/0x20
> > [427468.298963] ret_from_fork+0x22/0x30
>
> There aren't a lot of 6.1-era changes in either of these, but there
> are some interesting behavior updates around session create replay
> from early this year. I wonder if the 6.1 server is mishandling an
> nfserr_jukebox situation in nfsd4_session_create.
>
> Was the client actually attempting to mount or unmount?
>
> Tom.
>
> >
> > Thank by advance,
> > Best regards
> >
> > Le mercredi 23 octobre 2024 à 19:38 +0000, Chuck Lever III a écrit :
> > >
> > >
> > > > On Oct 23, 2024, at 3:27 PM, Benoît Gschwind
> > > > <benoit.gschwind@minesparis.psl.eu> wrote:
> > > >
> > > > Hello,
> > > >
> > > > I have a nfs server using debian 11 (Linux hostname 6.1.0-25-amd64
> > > > #1
> > > > SMP PREEMPT_DYNAMIC Debian 6.1.106-3 (2024-08-26) x86_64 GNU/Linux)
> > > >
> > > > In some heavy workload some nfsd goes in D state and seems to never
> > > > leave this state. I did a python script to monitor how long a
> > > > process
> > > > stay in particular state and I use it to monitor nfsd state. I get
> > > > the
> > > > following result :
> > > >
> > > > [...]
> > > > 178056 I (idle) 0:25:24.475 [nfsd]
> > > > 178057 I (idle) 0:25:24.475 [nfsd]
> > > > 178058 I (idle) 0:25:24.475 [nfsd]
> > > > 178059 I (idle) 0:25:24.475 [nfsd]
> > > > 178060 I (idle) 0:25:24.475 [nfsd]
> > > > 178061 I (idle) 0:25:24.475 [nfsd]
> > > > 178062 I (idle) 0:24:15.638 [nfsd]
> > > > 178063 I (idle) 0:24:13.488 [nfsd]
> > > > 178064 I (idle) 0:24:13.488 [nfsd]
> > > > 178065 I (idle) 0:00:00.000 [nfsd]
> > > > 178066 I (idle) 0:00:00.000 [nfsd]
> > > > 178067 I (idle) 0:00:00.000 [nfsd]
> > > > 178068 I (idle) 0:00:00.000 [nfsd]
> > > > 178069 S (sleeping) 0:00:02.147 [nfsd]
> > > > 178070 S (sleeping) 0:00:02.147 [nfsd]
> > > > 178071 S (sleeping) 0:00:02.147 [nfsd]
> > > > 178072 S (sleeping) 0:00:02.147 [nfsd]
> > > > 178073 S (sleeping) 0:00:02.147 [nfsd]
> > > > 178074 D (disk sleep) 1:29:25.809 [nfsd]
> > > > 178075 S (sleeping) 0:00:02.147 [nfsd]
> > > > 178076 S (sleeping) 0:00:02.147 [nfsd]
> > > > 178077 S (sleeping) 0:00:02.147 [nfsd]
> > > > 178078 S (sleeping) 0:00:02.147 [nfsd]
> > > > 178079 S (sleeping) 0:00:02.147 [nfsd]
> > > > 178080 D (disk sleep) 1:29:25.809 [nfsd]
> > > > 178081 D (disk sleep) 1:29:25.809 [nfsd]
> > > > 178082 D (disk sleep) 0:28:04.444 [nfsd]
> > > >
> > > > All process not shown are in idle state. Columns are the following:
> > > > PID, state, state name, amoung of time the state did not changed
> > > > and
> > > > the process was not interrupted, and /proc/PID/status Name entry.
> > > >
> > > > As you can read some nfsd process are in disk sleep state since
> > > > more
> > > > than 1 hour, but looking at the disk activity, there is almost no
> > > > I/O.
> > > >
> > > > I tried to restart nfs-server but I get the following error from
> > > > the
> > > > kernel:
> > > >
> > > > oct. 23 11:59:49 hostname kernel: rpc-srv/tcp: nfsd: got error -104
> > > > when sending 20 bytes - shutting down socket
> > > > oct. 23 11:59:49 hostname kernel: rpc-srv/tcp: nfsd: got error -104
> > > > when sending 20 bytes - shutting down socket
> > > > oct. 23 11:59:49 hostname kernel: rpc-srv/tcp: nfsd: got error -104
> > > > when sending 20 bytes - shutting down socket
> > > > oct. 23 11:59:49 hostname kernel: rpc-srv/tcp: nfsd: got error -104
> > > > when sending 20 bytes - shutting down socket
> > > > oct. 23 11:59:49 hostname kernel: rpc-srv/tcp: nfsd: got error -104
> > > > when sending 20 bytes - shutting down socket
> > > > oct. 23 11:59:59 hostname kernel: rpc-srv/tcp: nfsd: got error -104
> > > > when sending 20 bytes - shutting down socket
> > > > oct. 23 11:59:59 hostname kernel: rpc-srv/tcp: nfsd: got error -104
> > > > when sending 20 bytes - shutting down socket
> > > > oct. 23 11:59:59 hostname kernel: rpc-srv/tcp: nfsd: got error -104
> > > > when sending 20 bytes - shutting down socket
> > > > oct. 23 11:59:59 hostname kernel: rpc-srv/tcp: nfsd: got error -104
> > > > when sending 20 bytes - shutting down socket
> > > > oct. 23 11:59:59 hostname kernel: rpc-srv/tcp: nfsd: got error -104
> > > > when sending 20 bytes - shutting down socket
> > > > oct. 23 12:00:09 hostname kernel: rpc-srv/tcp: nfsd: got error -104
> > > > when sending 20 bytes - shutting down socket
> > > > oct. 23 12:00:09 hostname kernel: rpc-srv/tcp: nfsd: got error -104
> > > > when sending 20 bytes - shutting down socket
> > > > oct. 23 12:00:09 hostname kernel: rpc-srv/tcp: nfsd: got error -104
> > > > when sending 20 bytes - shutting down socket
> > > > oct. 23 12:00:10 hostname kernel: rpc-srv/tcp: nfsd: got error -32
> > > > when sending 20 bytes - shutting down socket
> > > > oct. 23 12:00:10 hostname kernel: rpc-srv/tcp: nfsd: got error -32
> > > > when sending 20 bytes - shutting down socket
> > > >
> > > > The only way to recover seems to reboot the kernel. I guess because
> > > > the
> > > > kernel force the reboot after a given timeout.
> > > >
> > > > My setup involve in order :
> > > > - scsi driver
> > > > - mdraid on top of scsi (raid6)
> > > > - btrfs ontop of mdraid
> > > > - nfsd ontop of btrfs
> > > >
> > > >
> > > > The setup is not very fast as expected, but it seems that in some
> > > > situation nfsd never leave the disk sleep state. the exports
> > > > options
> > > > are: gss/krb5i(rw,sync,no_wdelay,no_subtree_check,fsid=XXXXX). The
> > > > situation is not commun but it's always happen at some point. For
> > > > instance in the case I report here, my server booted the 2024-10-01
> > > > and
> > > > was stuck about the 2024-10-23. I did reduced by a large amount the
> > > > frequency of issue by using no_wdelay (I did thought that I did
> > > > solved
> > > > the issue when I started to use this option).
> > > >
> > > > My guess is hadware bug, scsi bug, btrfs bug or nfsd bug ?
> > > >
> > > > Any clue on this topic or any advice is wellcome.
> > >
> > > Generate stack traces for each process on the system
> > > using "sudo echo t > /proc/sysrq-trigger" and then
> > > examine the output in the system journal. Note the
> > > stack contents for the processes that look stuck.
> > >
> > > --
> > > Chuck Lever
> > >
> > >
> >
>
>
--
Jeff Layton <jlayton@kernel.org>
^ permalink raw reply [flat|nested] 12+ messages in thread* Re: nfsd stuck in D (disk sleep) state
2024-10-28 17:27 ` Jeff Layton
@ 2024-10-29 13:49 ` Jeff Layton
2024-10-29 20:04 ` Jeff Layton
0 siblings, 1 reply; 12+ messages in thread
From: Jeff Layton @ 2024-10-29 13:49 UTC (permalink / raw)
To: Tom Talpey, Benoît Gschwind, Chuck Lever III; +Cc: Linux NFS Mailing List
On Mon, 2024-10-28 at 13:27 -0400, Jeff Layton wrote:
> On Mon, 2024-10-28 at 08:46 -0400, Tom Talpey wrote:
> > On 10/28/2024 5:18 AM, Benoît Gschwind wrote:
> > > Hello,
> > >
> > > The issue trigger again, I attached the result of:
> > >
> > > # dmesg -W | tee dmesg.txt
> > >
> > > using:
> > >
> > > # echo t > /proc/sysrq-trigger
> > >
> > > I have the following PID stuck:
> > >
> > > 1474 D (disk sleep) 0:54:58.602 [nfsd]
> > > 1475 D (disk sleep) 0:54:58.602 [nfsd]
> > > 1484 D (disk sleep) 0:54:58.602 [nfsd]
> > > 1495 D (disk sleep) 0:54:58.602 [nfsd]
> >
> > Hmm, 1495 is stuck in nfsd4_create_session
> >
> > > [427468.304955] task:nfsd state:D stack:0 pid:1495
> > ppid:2 flags:0x00004000
> > > [427468.304962] Call Trace:
> > > [427468.304965] <TASK>
> > > [427468.304971] __schedule+0x34d/0x9e0
> > > [427468.304983] schedule+0x5a/0xd0
> > > [427468.304991] schedule_timeout+0x118/0x150
> > > [427468.305003] wait_for_completion+0x86/0x160
> > > [427468.305015] __flush_workqueue+0x152/0x420
> > > [427468.305031] nfsd4_create_session+0x79f/0xba0 [nfsd]
> > > [427468.305092] nfsd4_proc_compound+0x34c/0x660 [nfsd]
> > > [427468.305147] nfsd_dispatch+0x1a1/0x2b0 [nfsd]
> > > [427468.305199] svc_process_common+0x295/0x610 [sunrpc]
> > > [427468.305269] ? svc_recv+0x491/0x810 [sunrpc]
> > > [427468.305337] ? nfsd_svc+0x370/0x370 [nfsd]
> > > [427468.305389] ? nfsd_shutdown_threads+0x90/0x90 [nfsd]
> > > [427468.305437] svc_process+0xad/0x100 [sunrpc]
> > > [427468.305505] nfsd+0x99/0x140 [nfsd]
> > > [427468.305555] kthread+0xda/0x100
> > > [427468.305562] ? kthread_complete_and_exit+0x20/0x20
> > > [427468.305572] ret_from_fork+0x22/0x30
> >
> > and the other three are stuck in nfsd4_destroy_session
> >
>
> All 4 processes are stuck waiting on a flush_workqueue call. It's
> probably one of these:
>
> flush_workqueue(clp->cl_callback_wq);
>
> So the question here is really -- why are the callback workqueue jobs
> stuck? I do see this:
>
> [427468.316839] Workqueue: nfsd4_callbacks nfsd4_run_cb_work [nfsd]
> [427468.316899] Call Trace:
> [427468.316902] <TASK>
> [427468.316908] __schedule+0x34d/0x9e0
> [427468.316919] schedule+0x5a/0xd0
> [427468.316927] schedule_timeout+0x94/0x150
> [427468.316937] ? __bpf_trace_tick_stop+0x10/0x10
> [427468.316947] rpc_shutdown_client+0xf2/0x150 [sunrpc]
> [427468.317015] ? cpuusage_read+0x10/0x10
> [427468.317026] nfsd4_process_cb_update+0x4c/0x270 [nfsd]
> [427468.317097] nfsd4_run_cb_work+0x9f/0x150 [nfsd]
> [427468.317146] process_one_work+0x1c7/0x380
> [427468.317158] worker_thread+0x4d/0x380
> [427468.317170] ? rescuer_thread+0x3a0/0x3a0
> [427468.317177] kthread+0xda/0x100
> [427468.317185] ? kthread_complete_and_exit+0x20/0x20
> [427468.317195] ret_from_fork+0x22/0x30
> [427468.317213] </TASK>
>
> Maybe the RPC client is having trouble clearing clnt->cl_tasks ?
>
Interestingly, I hit this today while testing the patches that widen
the backchannel:
[ 484.550350] INFO: task nfsd:1168 blocked for more than 120 seconds.
[ 484.551746] Not tainted 6.12.0-rc5+ #188
[ 484.552854] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 484.586316] task:nfsd state:D stack:0 pid:1168 tgid:1168 ppid:2 flags:0x00004000
[ 484.588417] Call Trace:
[ 484.589070] <TASK>
[ 484.589624] __schedule+0x4e4/0xb90
[ 484.622306] schedule+0x26/0xf0
[ 484.623162] schedule_timeout+0x121/0x150
[ 484.624344] ? __smp_call_single_queue+0xa7/0x110
[ 484.625375] ? x2apic_send_IPI+0x45/0x50
[ 484.654299] ? ttwu_queue_wakelist+0xbf/0x110
[ 484.655454] __wait_for_common+0x91/0x1c0
[ 484.656372] ? __pfx_schedule_timeout+0x10/0x10
[ 484.657429] __flush_workqueue+0x13a/0x400
[ 484.686304] nfsd4_create_session+0x9cd/0xd90 [nfsd]
[ 484.687562] nfsd4_proc_compound+0x396/0x750 [nfsd]
[ 484.688746] nfsd_dispatch+0xc6/0x210 [nfsd]
[ 484.689835] svc_process_common+0x4e6/0x6d0 [sunrpc]
[ 484.722305] ? __pfx_nfsd_dispatch+0x10/0x10 [nfsd]
[ 484.723532] svc_process+0x12d/0x1c0 [sunrpc]
[ 484.724619] svc_recv+0x7cf/0x980 [sunrpc]
[ 484.725618] ? __pfx_nfsd+0x10/0x10 [nfsd]
[ 484.758308] nfsd+0x9f/0x100 [nfsd]
[ 484.759184] kthread+0xdd/0x110
[ 484.759905] ? __pfx_kthread+0x10/0x10
[ 484.760870] ret_from_fork+0x30/0x50
[ 484.761603] ? __pfx_kthread+0x10/0x10
[ 484.790302] ret_from_fork_asm+0x1a/0x30
[ 484.791151] </TASK>
The host was hung after this, so I can't fully confirm the scenario
below, but I think it's probably correct:
The client might want to create a new session. When that happens, the
create_session call tries to flush_workqueue() on the client-wide
backchannel workqueue.
That workqueue is stuck, as it's waiting for all of the callback RPCs
to die off (as evidenced by the stack from Benoît's info that shows it
hung in rpc_shutdown_client). So the question I guess at this point is
why these RPC tasks aren't dying like they're supposed to when
rpc_killall_tasks() is run.
I'm guessing that the backchannel-widening patches may make this a bit
easier to reproduce? In any case, this problem doesn't seem to be
confined to v6.1.
I'm still looking at this, so stay tuned...
--
Jeff Layton <jlayton@kernel.org>
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: nfsd stuck in D (disk sleep) state
2024-10-29 13:49 ` Jeff Layton
@ 2024-10-29 20:04 ` Jeff Layton
2025-01-20 9:22 ` Benoît Gschwind
0 siblings, 1 reply; 12+ messages in thread
From: Jeff Layton @ 2024-10-29 20:04 UTC (permalink / raw)
To: Tom Talpey, Benoît Gschwind, Chuck Lever III; +Cc: Linux NFS Mailing List
On Tue, 2024-10-29 at 09:49 -0400, Jeff Layton wrote:
> On Mon, 2024-10-28 at 13:27 -0400, Jeff Layton wrote:
> > On Mon, 2024-10-28 at 08:46 -0400, Tom Talpey wrote:
> > > On 10/28/2024 5:18 AM, Benoît Gschwind wrote:
> > > > Hello,
> > > >
> > > > The issue trigger again, I attached the result of:
> > > >
> > > > # dmesg -W | tee dmesg.txt
> > > >
> > > > using:
> > > >
> > > > # echo t > /proc/sysrq-trigger
> > > >
> > > > I have the following PID stuck:
> > > >
> > > > 1474 D (disk sleep) 0:54:58.602 [nfsd]
> > > > 1475 D (disk sleep) 0:54:58.602 [nfsd]
> > > > 1484 D (disk sleep) 0:54:58.602 [nfsd]
> > > > 1495 D (disk sleep) 0:54:58.602 [nfsd]
> > >
> > > Hmm, 1495 is stuck in nfsd4_create_session
> > >
> > > > [427468.304955] task:nfsd state:D stack:0 pid:1495
> > > ppid:2 flags:0x00004000
> > > > [427468.304962] Call Trace:
> > > > [427468.304965] <TASK>
> > > > [427468.304971] __schedule+0x34d/0x9e0
> > > > [427468.304983] schedule+0x5a/0xd0
> > > > [427468.304991] schedule_timeout+0x118/0x150
> > > > [427468.305003] wait_for_completion+0x86/0x160
> > > > [427468.305015] __flush_workqueue+0x152/0x420
> > > > [427468.305031] nfsd4_create_session+0x79f/0xba0 [nfsd]
> > > > [427468.305092] nfsd4_proc_compound+0x34c/0x660 [nfsd]
> > > > [427468.305147] nfsd_dispatch+0x1a1/0x2b0 [nfsd]
> > > > [427468.305199] svc_process_common+0x295/0x610 [sunrpc]
> > > > [427468.305269] ? svc_recv+0x491/0x810 [sunrpc]
> > > > [427468.305337] ? nfsd_svc+0x370/0x370 [nfsd]
> > > > [427468.305389] ? nfsd_shutdown_threads+0x90/0x90 [nfsd]
> > > > [427468.305437] svc_process+0xad/0x100 [sunrpc]
> > > > [427468.305505] nfsd+0x99/0x140 [nfsd]
> > > > [427468.305555] kthread+0xda/0x100
> > > > [427468.305562] ? kthread_complete_and_exit+0x20/0x20
> > > > [427468.305572] ret_from_fork+0x22/0x30
> > >
> > > and the other three are stuck in nfsd4_destroy_session
> > >
> >
> > All 4 processes are stuck waiting on a flush_workqueue call. It's
> > probably one of these:
> >
> > flush_workqueue(clp->cl_callback_wq);
> >
> > So the question here is really -- why are the callback workqueue jobs
> > stuck? I do see this:
> >
> > [427468.316839] Workqueue: nfsd4_callbacks nfsd4_run_cb_work [nfsd]
> > [427468.316899] Call Trace:
> > [427468.316902] <TASK>
> > [427468.316908] __schedule+0x34d/0x9e0
> > [427468.316919] schedule+0x5a/0xd0
> > [427468.316927] schedule_timeout+0x94/0x150
> > [427468.316937] ? __bpf_trace_tick_stop+0x10/0x10
> > [427468.316947] rpc_shutdown_client+0xf2/0x150 [sunrpc]
> > [427468.317015] ? cpuusage_read+0x10/0x10
> > [427468.317026] nfsd4_process_cb_update+0x4c/0x270 [nfsd]
> > [427468.317097] nfsd4_run_cb_work+0x9f/0x150 [nfsd]
> > [427468.317146] process_one_work+0x1c7/0x380
> > [427468.317158] worker_thread+0x4d/0x380
> > [427468.317170] ? rescuer_thread+0x3a0/0x3a0
> > [427468.317177] kthread+0xda/0x100
> > [427468.317185] ? kthread_complete_and_exit+0x20/0x20
> > [427468.317195] ret_from_fork+0x22/0x30
> > [427468.317213] </TASK>
> >
> > Maybe the RPC client is having trouble clearing clnt->cl_tasks ?
> >
>
> Interestingly, I hit this today while testing the patches that widen
> the backchannel:
>
> [ 484.550350] INFO: task nfsd:1168 blocked for more than 120 seconds.
> [ 484.551746] Not tainted 6.12.0-rc5+ #188
> [ 484.552854] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> [ 484.586316] task:nfsd state:D stack:0 pid:1168 tgid:1168 ppid:2 flags:0x00004000
> [ 484.588417] Call Trace:
> [ 484.589070] <TASK>
> [ 484.589624] __schedule+0x4e4/0xb90
> [ 484.622306] schedule+0x26/0xf0
> [ 484.623162] schedule_timeout+0x121/0x150
> [ 484.624344] ? __smp_call_single_queue+0xa7/0x110
> [ 484.625375] ? x2apic_send_IPI+0x45/0x50
> [ 484.654299] ? ttwu_queue_wakelist+0xbf/0x110
> [ 484.655454] __wait_for_common+0x91/0x1c0
> [ 484.656372] ? __pfx_schedule_timeout+0x10/0x10
> [ 484.657429] __flush_workqueue+0x13a/0x400
> [ 484.686304] nfsd4_create_session+0x9cd/0xd90 [nfsd]
> [ 484.687562] nfsd4_proc_compound+0x396/0x750 [nfsd]
> [ 484.688746] nfsd_dispatch+0xc6/0x210 [nfsd]
> [ 484.689835] svc_process_common+0x4e6/0x6d0 [sunrpc]
> [ 484.722305] ? __pfx_nfsd_dispatch+0x10/0x10 [nfsd]
> [ 484.723532] svc_process+0x12d/0x1c0 [sunrpc]
> [ 484.724619] svc_recv+0x7cf/0x980 [sunrpc]
> [ 484.725618] ? __pfx_nfsd+0x10/0x10 [nfsd]
> [ 484.758308] nfsd+0x9f/0x100 [nfsd]
> [ 484.759184] kthread+0xdd/0x110
> [ 484.759905] ? __pfx_kthread+0x10/0x10
> [ 484.760870] ret_from_fork+0x30/0x50
> [ 484.761603] ? __pfx_kthread+0x10/0x10
> [ 484.790302] ret_from_fork_asm+0x1a/0x30
> [ 484.791151] </TASK>
>
> The host was hung after this, so I can't fully confirm the scenario
> below, but I think it's probably correct:
>
> The client might want to create a new session. When that happens, the
> create_session call tries to flush_workqueue() on the client-wide
> backchannel workqueue.
>
> That workqueue is stuck, as it's waiting for all of the callback RPCs
> to die off (as evidenced by the stack from Benoît's info that shows it
> hung in rpc_shutdown_client). So the question I guess at this point is
> why these RPC tasks aren't dying like they're supposed to when
> rpc_killall_tasks() is run.
>
> I'm guessing that the backchannel-widening patches may make this a bit
> easier to reproduce? In any case, this problem doesn't seem to be
> confined to v6.1.
>
> I'm still looking at this, so stay tuned...
The root cause in my case was that I had a bug in the backchannel
widening patch, and that was causing the client to recreate the session
occasionally.
I can't reproduce it now with the v2 version of that patch, but I still
suspect that this is related to problems killing backchannel rpc tasks.
--
Jeff Layton <jlayton@kernel.org>
^ permalink raw reply [flat|nested] 12+ messages in thread