nfsd4 laundromat_main hung tasks

public inbox for linux-nfs@vger.kernel.org
 help / color / mirror / Atom feed

* nfsd4 laundromat_main hung tasks
@ 2025-01-10 19:49 Rik Theys
  2025-01-10 20:30 ` Chuck Lever
  0 siblings, 1 reply; 14+ messages in thread
From: Rik Theys @ 2025-01-10 19:49 UTC (permalink / raw)
  To: linux-nfs

Hi,

Our Rocky 9 NFS server running the upstream 6.11.11 kernel is starting
to log the following hung task messages:

INFO: task kworker/u194:11:1677933 blocked for more than 215285 seconds.
      Tainted: G        W   E      6.11.11-1.el9.esat.x86_64 #1
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
task:kworker/u194:11 state:D stack:0     pid:1677933 tgid:1677933
ppid:2      flags:0x00004000
Workqueue: nfsd4 laundromat_main [nfsd]
Call Trace:
 <TASK>
 __schedule+0x21c/0x5d0
 ? preempt_count_add+0x47/0xa0
 schedule+0x26/0xa0
 nfsd4_shutdown_callback+0xea/0x120 [nfsd]
 ? __pfx_var_wake_function+0x10/0x10
 __destroy_client+0x1f0/0x290 [nfsd]
 nfs4_process_client_reaplist+0xa1/0x110 [nfsd]
 nfs4_laundromat+0x126/0x7a0 [nfsd]
 ? _raw_spin_unlock_irqrestore+0x23/0x40
 laundromat_main+0x16/0x40 [nfsd]
 process_one_work+0x179/0x390
 worker_thread+0x239/0x340
 ? __pfx_worker_thread+0x10/0x10
 kthread+0xdb/0x110
 ? __pfx_kthread+0x10/0x10
 ret_from_fork+0x2d/0x50
 ? __pfx_kthread+0x10/0x10
 ret_from_fork_asm+0x1a/0x30
 </TASK>

If I read this correctly, it seems to be blocked on a callback
operation during shutdown of a client connection?

Is this a known issue that may be fixed in the 6.12.x kernel? Could
the following commit be relevant?

8dd91e8d31febf4d9cca3ae1bb4771d33ae7ee5a    nfsd: fix race between
laundromat and free_stateid

If I increase the hung_task_warnings sysctl and wait a few minutes,
the hung task message appears again, so the issue is still present on
the system. How can I debug which client is causing this issue?

Is there any other information I can provide?

Could this be related to the following thread:
https://lore.kernel.org/linux-nfs/Z2vNQ6HXfG_LqBQc@eldamar.lan/T/#u ?

I don't know if this is relevant but I've noticed that some clients
have multiple entries in the /proc/fs/nfsd/clients directory, so I
assume these clients are not cleaned up correctly?

For example:

clientid: 0x6d077c99675df2b3
address: "10.87.29.32:864"
status: confirmed
seconds from last renew: 0
name: "Linux NFSv4.2 betelgeuse.esat.kuleuven.be"
minor version: 2
Implementation domain: "kernel.org"
Implementation name: "Linux 4.18.0-553.32.1.el8_10.x86_64 #1 SMP Wed
Dec 11 16:33:48 UTC 2024 x86_64"
Implementation time: [0, 0]
callback state: UP
callback address: 10.87.29.32:0
admin-revoked states: 0
***
clientid: 0x6d0596d0675df2b3
address: "10.87.29.32:864"
status: courtesy
seconds from last renew: 2288446
name: "Linux NFSv4.2 betelgeuse.esat.kuleuven.be"
minor version: 2
Implementation domain: "kernel.org"
Implementation name: "Linux 4.18.0-553.32.1.el8_10.x86_64 #1 SMP Wed
Dec 11 16:33:48 UTC 2024 x86_64"
Implementation time: [0, 0]
callback state: UP
callback address: 10.87.29.32:0
admin-revoked states: 0

The first one has status confirmed and the second one "courtesy" with
a high "seconds from last renew". The address and port matches for
both client entries and the callback state is both UP.

For another client, there's a different output:

clientid: 0x6d078a79675df2b3
address: "10.33.130.34:864"
status: unconfirmed
seconds from last renew: 158910
name: "Linux NFSv4.2 bujarski.esat.kuleuven.be"
minor version: 2
Implementation domain: "kernel.org"
Implementation name: "Linux 5.14.0-503.19.1.el9_5.x86_64 #1 SMP
PREEMPT_DYNAMIC Thu Dec 19 12:55:03 UTC 2024 x86_64"
Implementation time: [0, 0]
callback state: UNKNOWN
callback address: (einval)
admin-revoked states: 0
***
clientid: 0x6d078a7a675df2b3
address: "10.33.130.34:864"
status: confirmed
seconds from last renew: 2
name: "Linux NFSv4.2 bujarski.esat.kuleuven.be"
minor version: 2
Implementation domain: "kernel.org"
Implementation name: "Linux 5.14.0-503.19.1.el9_5.x86_64 #1 SMP
PREEMPT_DYNAMIC Thu Dec 19 12:55:03 UTC 2024 x86_64"
Implementation time: [0, 0]
callback state: UP
callback address: 10.33.130.34:0
admin-revoked states: 0


Here the first status is unconfirmed and the callback state is UNKNOWN.

The clients are Rocky 8, Rocky 9 and Fedora 41 clients.

Regards,

Rik

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: nfsd4 laundromat_main hung tasks
  2025-01-10 19:49 nfsd4 laundromat_main hung tasks Rik Theys
@ 2025-01-10 20:30 ` Chuck Lever
       [not found]   ` <CAPwv0J=oKBnCia_mmhm-tYLPqw03jO=LxfUbShSyXFp-mKET5A@mail.gmail.com>
  0 siblings, 1 reply; 14+ messages in thread
From: Chuck Lever @ 2025-01-10 20:30 UTC (permalink / raw)
  To: Rik Theys, linux-nfs

On 1/10/25 2:49 PM, Rik Theys wrote:
> Hi,
> 
> Our Rocky 9 NFS server running the upstream 6.11.11 kernel is starting
> to log the following hung task messages:
> 
> INFO: task kworker/u194:11:1677933 blocked for more than 215285 seconds.
>        Tainted: G        W   E      6.11.11-1.el9.esat.x86_64 #1
> "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> task:kworker/u194:11 state:D stack:0     pid:1677933 tgid:1677933
> ppid:2      flags:0x00004000
> Workqueue: nfsd4 laundromat_main [nfsd]
> Call Trace:
>   <TASK>
>   __schedule+0x21c/0x5d0
>   ? preempt_count_add+0x47/0xa0
>   schedule+0x26/0xa0
>   nfsd4_shutdown_callback+0xea/0x120 [nfsd]
>   ? __pfx_var_wake_function+0x10/0x10
>   __destroy_client+0x1f0/0x290 [nfsd]
>   nfs4_process_client_reaplist+0xa1/0x110 [nfsd]
>   nfs4_laundromat+0x126/0x7a0 [nfsd]
>   ? _raw_spin_unlock_irqrestore+0x23/0x40
>   laundromat_main+0x16/0x40 [nfsd]
>   process_one_work+0x179/0x390
>   worker_thread+0x239/0x340
>   ? __pfx_worker_thread+0x10/0x10
>   kthread+0xdb/0x110
>   ? __pfx_kthread+0x10/0x10
>   ret_from_fork+0x2d/0x50
>   ? __pfx_kthread+0x10/0x10
>   ret_from_fork_asm+0x1a/0x30
>   </TASK>
> 
> If I read this correctly, it seems to be blocked on a callback
> operation during shutdown of a client connection?
> 
> Is this a known issue that may be fixed in the 6.12.x kernel? Could
> the following commit be relevant?

It is a known issue that we're just beginning to work. It's not
addressed in any kernel at the moment.


> 8dd91e8d31febf4d9cca3ae1bb4771d33ae7ee5a    nfsd: fix race between
> laundromat and free_stateid
> 
> If I increase the hung_task_warnings sysctl and wait a few minutes,
> the hung task message appears again, so the issue is still present on
> the system. How can I debug which client is causing this issue?
> 
> Is there any other information I can provide?

Yes. We badly need a simple reproducer for this issue so that we
can test and confirm that it is fixed before requesting that any
fix is merged.

An environment where we can test patches against the upstream
kernel would also be welcome.


> Could this be related to the following thread:
> https://lore.kernel.org/linux-nfs/Z2vNQ6HXfG_LqBQc@eldamar.lan/T/#u ?

Yes.


> I don't know if this is relevant but I've noticed that some clients
> have multiple entries in the /proc/fs/nfsd/clients directory, so I
> assume these clients are not cleaned up correctly?
> 
> For example:
> 
> clientid: 0x6d077c99675df2b3
> address: "10.87.29.32:864"
> status: confirmed
> seconds from last renew: 0
> name: "Linux NFSv4.2 betelgeuse.esat.kuleuven.be"
> minor version: 2
> Implementation domain: "kernel.org"
> Implementation name: "Linux 4.18.0-553.32.1.el8_10.x86_64 #1 SMP Wed
> Dec 11 16:33:48 UTC 2024 x86_64"
> Implementation time: [0, 0]
> callback state: UP
> callback address: 10.87.29.32:0
> admin-revoked states: 0
> ***
> clientid: 0x6d0596d0675df2b3
> address: "10.87.29.32:864"
> status: courtesy
> seconds from last renew: 2288446
> name: "Linux NFSv4.2 betelgeuse.esat.kuleuven.be"
> minor version: 2
> Implementation domain: "kernel.org"
> Implementation name: "Linux 4.18.0-553.32.1.el8_10.x86_64 #1 SMP Wed
> Dec 11 16:33:48 UTC 2024 x86_64"
> Implementation time: [0, 0]
> callback state: UP
> callback address: 10.87.29.32:0
> admin-revoked states: 0
> 
> The first one has status confirmed and the second one "courtesy" with
> a high "seconds from last renew". The address and port matches for
> both client entries and the callback state is both UP.
> 
> For another client, there's a different output:
> 
> clientid: 0x6d078a79675df2b3
> address: "10.33.130.34:864"
> status: unconfirmed
> seconds from last renew: 158910
> name: "Linux NFSv4.2 bujarski.esat.kuleuven.be"
> minor version: 2
> Implementation domain: "kernel.org"
> Implementation name: "Linux 5.14.0-503.19.1.el9_5.x86_64 #1 SMP
> PREEMPT_DYNAMIC Thu Dec 19 12:55:03 UTC 2024 x86_64"
> Implementation time: [0, 0]
> callback state: UNKNOWN
> callback address: (einval)
> admin-revoked states: 0
> ***
> clientid: 0x6d078a7a675df2b3
> address: "10.33.130.34:864"
> status: confirmed
> seconds from last renew: 2
> name: "Linux NFSv4.2 bujarski.esat.kuleuven.be"
> minor version: 2
> Implementation domain: "kernel.org"
> Implementation name: "Linux 5.14.0-503.19.1.el9_5.x86_64 #1 SMP
> PREEMPT_DYNAMIC Thu Dec 19 12:55:03 UTC 2024 x86_64"
> Implementation time: [0, 0]
> callback state: UP
> callback address: 10.33.130.34:0
> admin-revoked states: 0
> 
> 
> Here the first status is unconfirmed and the callback state is UNKNOWN.
> 
> The clients are Rocky 8, Rocky 9 and Fedora 41 clients.
> 
> Regards,
> 
> Rik
> 


-- 
Chuck Lever

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: nfsd4 laundromat_main hung tasks
       [not found]     ` <49654519-9166-4593-ac62-77400cebebb4@oracle.com>
@ 2025-01-12 12:42       ` Rik Theys
  2025-01-12 18:57         ` Chuck Lever
  2025-01-13 22:12         ` Chuck Lever
  0 siblings, 2 replies; 14+ messages in thread
From: Rik Theys @ 2025-01-12 12:42 UTC (permalink / raw)
  To: Chuck Lever, linux-nfs

Hi,

On Fri, Jan 10, 2025 at 11:07 PM Chuck Lever <chuck.lever@oracle.com> wrote:
>
> On 1/10/25 3:51 PM, Rik Theys wrote:
> > Hi,
> >
> > Thanks for your prompt reply.
> >
> > On Fri, Jan 10, 2025 at 9:30 PM Chuck Lever <chuck.lever@oracle.com> wrote:
> >>
> >> On 1/10/25 2:49 PM, Rik Theys wrote:
> >>> Hi,
> >>>
> >>> Our Rocky 9 NFS server running the upstream 6.11.11 kernel is starting
> >>> to log the following hung task messages:
> >>>
> >>> INFO: task kworker/u194:11:1677933 blocked for more than 215285 seconds.
> >>>         Tainted: G        W   E      6.11.11-1.el9.esat.x86_64 #1
> >>> "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> >>> task:kworker/u194:11 state:D stack:0     pid:1677933 tgid:1677933
> >>> ppid:2      flags:0x00004000
> >>> Workqueue: nfsd4 laundromat_main [nfsd]
> >>> Call Trace:
> >>>    <TASK>
> >>>    __schedule+0x21c/0x5d0
> >>>    ? preempt_count_add+0x47/0xa0
> >>>    schedule+0x26/0xa0
> >>>    nfsd4_shutdown_callback+0xea/0x120 [nfsd]
> >>>    ? __pfx_var_wake_function+0x10/0x10
> >>>    __destroy_client+0x1f0/0x290 [nfsd]
> >>>    nfs4_process_client_reaplist+0xa1/0x110 [nfsd]
> >>>    nfs4_laundromat+0x126/0x7a0 [nfsd]
> >>>    ? _raw_spin_unlock_irqrestore+0x23/0x40
> >>>    laundromat_main+0x16/0x40 [nfsd]
> >>>    process_one_work+0x179/0x390
> >>>    worker_thread+0x239/0x340
> >>>    ? __pfx_worker_thread+0x10/0x10
> >>>    kthread+0xdb/0x110
> >>>    ? __pfx_kthread+0x10/0x10
> >>>    ret_from_fork+0x2d/0x50
> >>>    ? __pfx_kthread+0x10/0x10
> >>>    ret_from_fork_asm+0x1a/0x30
> >>>    </TASK>
> >>>
> >>> If I read this correctly, it seems to be blocked on a callback
> >>> operation during shutdown of a client connection?
> >>>
> >>> Is this a known issue that may be fixed in the 6.12.x kernel? Could
> >>> the following commit be relevant?
> >>
> >> It is a known issue that we're just beginning to work. It's not
> >> addressed in any kernel at the moment.
> >>
> >>
> >>> 8dd91e8d31febf4d9cca3ae1bb4771d33ae7ee5a    nfsd: fix race between
> >>> laundromat and free_stateid
> >>>
> >>> If I increase the hung_task_warnings sysctl and wait a few minutes,
> >>> the hung task message appears again, so the issue is still present on
> >>> the system. How can I debug which client is causing this issue?
> >>>
> >>> Is there any other information I can provide?
> >>
> >> Yes. We badly need a simple reproducer for this issue so that we
> >> can test and confirm that it is fixed before requesting that any
> >> fix is merged.
> >
> > Unfortunately, we've been unable to reliably reproduce the issue on
> > our test systems. Sometimes the server works fine for weeks, and
> > sometimes these (or other) issues pop up within hours. Similar to the
> > users from the mentioned thread, we let a few hundred engineers and
> > students loose. Our clients are both EL8/9 based, and also Fedora 41,
> > and they (auto)mount home directories from the NFS server. So clients
> > frequently mount and unmount file systems, students uncleanly shut
> > down systems,...
> >
> > We switched to the mainline kernel in the hope this would include a
> > fix for the issue.
> >
> > Are there any debugging commands we can run once the issue happens
> > that can help to determine the cause of this issue?
>
> Once the issue happens, the precipitating bug has already done its
> damage, so at that point it is too late.
>
> If you can start a trace command on the server before clients mount
> it, try this:
>
>    # trace-cmd record -e nfsd:nfsd_cb_\*
>
> After the issue has occurred, wait a few minutes then ^C this command
> and send me the trace.dat.
>
I can create a systemd unit to start this command when the system
boots and stop it when the issue happens.

What is the expected performance impact of keeping this tracing
running? Is there an easy way to rotate the trace.dat file as I assume
it will grow quite large?

> The current theory is that deleg_reaper() is running concurrently with
> the client's DESTROY_SESSION request, and that leaves callback RPCs
> outstanding when the callback RPC client is destroyed. Session shutdown
> then hangs waiting for a completion that will never fire.

Would it be possible to capture this using a bpftrace script? If so,
which events would be interesting to capture to confirm this theory?

Is there an easy way to forcefully trigger the deleg_reaper to run so
we can try running it in a loop and then reboot/unmount the client in
an attempt to trigger the issue?


>
> If your server runs chronically short on physical memory, that might
> be a trigger.

Before the server was upgraded to EL9, it ran CentOS 7 for 5 years
without any issue and we never experienced any physical memory
shortage.  Why does the system think it's running low on memory and
needs to send the RECALL_ANY callbacks? It never did in the past and
the system seemed to do fine. Is there a way to turn off the
RECALL_ANY callbacks (at runtime)?

>
>
> >> An environment where we can test patches against the upstream
> >> kernel would also be welcome.

Our current plan is to run the 6.12 kernel as this is an LTS kernel.
If there are patches for this kernel version we can try, we may be
able to apply them. But we can't reboot the system every few days as
hundreds of people depend on it. It can also take quite some time to
actually trigger it (or assume it was fixed by a patch).

Regards,
Rik

> >>
> >>
> >>> Could this be related to the following thread:
> >>> https://lore.kernel.org/linux-nfs/Z2vNQ6HXfG_LqBQc@eldamar.lan/T/#u ?
> >>
> >> Yes.
> >>
> >>
> >>> I don't know if this is relevant but I've noticed that some clients
> >>> have multiple entries in the /proc/fs/nfsd/clients directory, so I
> >>> assume these clients are not cleaned up correctly?
> >
> > You don't think this is relevant for this issue? Is this normal?
>
> It might be a bug, but I can't say whether it's related.
>
>
> >>> For example:
> >>>
> >>> clientid: 0x6d077c99675df2b3
> >>> address: "10.87.29.32:864"
> >>> status: confirmed
> >>> seconds from last renew: 0
> >>> name: "Linux NFSv4.2 betelgeuse.esat.kuleuven.be"
> >>> minor version: 2
> >>> Implementation domain: "kernel.org"
> >>> Implementation name: "Linux 4.18.0-553.32.1.el8_10.x86_64 #1 SMP Wed
> >>> Dec 11 16:33:48 UTC 2024 x86_64"
> >>> Implementation time: [0, 0]
> >>> callback state: UP
> >>> callback address: 10.87.29.32:0
> >>> admin-revoked states: 0
> >>> ***
> >>> clientid: 0x6d0596d0675df2b3
> >>> address: "10.87.29.32:864"
> >>> status: courtesy
> >>> seconds from last renew: 2288446
> >>> name: "Linux NFSv4.2 betelgeuse.esat.kuleuven.be"
> >>> minor version: 2
> >>> Implementation domain: "kernel.org"
> >>> Implementation name: "Linux 4.18.0-553.32.1.el8_10.x86_64 #1 SMP Wed
> >>> Dec 11 16:33:48 UTC 2024 x86_64"
> >>> Implementation time: [0, 0]
> >>> callback state: UP
> >>> callback address: 10.87.29.32:0
> >>> admin-revoked states: 0
> >>>
> >>> The first one has status confirmed and the second one "courtesy" with
> >>> a high "seconds from last renew". The address and port matches for
> >>> both client entries and the callback state is both UP.
> >>>
> >>> For another client, there's a different output:
> >>>
> >>> clientid: 0x6d078a79675df2b3
> >>> address: "10.33.130.34:864"
> >>> status: unconfirmed
> >>> seconds from last renew: 158910
> >>> name: "Linux NFSv4.2 bujarski.esat.kuleuven.be"
> >>> minor version: 2
> >>> Implementation domain: "kernel.org"
> >>> Implementation name: "Linux 5.14.0-503.19.1.el9_5.x86_64 #1 SMP
> >>> PREEMPT_DYNAMIC Thu Dec 19 12:55:03 UTC 2024 x86_64"
> >>> Implementation time: [0, 0]
> >>> callback state: UNKNOWN
> >>> callback address: (einval)
> >>> admin-revoked states: 0
> >>> ***
> >>> clientid: 0x6d078a7a675df2b3
> >>> address: "10.33.130.34:864"
> >>> status: confirmed
> >>> seconds from last renew: 2
> >>> name: "Linux NFSv4.2 bujarski.esat.kuleuven.be"
> >>> minor version: 2
> >>> Implementation domain: "kernel.org"
> >>> Implementation name: "Linux 5.14.0-503.19.1.el9_5.x86_64 #1 SMP
> >>> PREEMPT_DYNAMIC Thu Dec 19 12:55:03 UTC 2024 x86_64"
> >>> Implementation time: [0, 0]
> >>> callback state: UP
> >>> callback address: 10.33.130.34:0
> >>> admin-revoked states: 0
> >>>
> >>>
> >>> Here the first status is unconfirmed and the callback state is UNKNOWN.
> >>>
> >>> The clients are Rocky 8, Rocky 9 and Fedora 41 clients.
> >>>
> >>> Regards,
> >>>
> >>> Rik
> >>>
> >>
> >>
> >> --
> >> Chuck Lever
> >
>
>
> --
> Chuck Lever



-- 

Rik

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: nfsd4 laundromat_main hung tasks
  2025-01-12 12:42       ` Rik Theys
@ 2025-01-12 18:57         ` Chuck Lever
  2025-01-13 12:30           ` Rik Theys
  2025-01-13 22:12         ` Chuck Lever
  1 sibling, 1 reply; 14+ messages in thread
From: Chuck Lever @ 2025-01-12 18:57 UTC (permalink / raw)
  To: Rik Theys, linux-nfs

On 1/12/25 7:42 AM, Rik Theys wrote:
> Hi,
> 
> On Fri, Jan 10, 2025 at 11:07 PM Chuck Lever <chuck.lever@oracle.com> wrote:
>>
>> On 1/10/25 3:51 PM, Rik Theys wrote:
>>> Hi,
>>>
>>> Thanks for your prompt reply.
>>>
>>> On Fri, Jan 10, 2025 at 9:30 PM Chuck Lever <chuck.lever@oracle.com> wrote:
>>>>
>>>> On 1/10/25 2:49 PM, Rik Theys wrote:
>>>>> Hi,
>>>>>
>>>>> Our Rocky 9 NFS server running the upstream 6.11.11 kernel is starting
>>>>> to log the following hung task messages:
>>>>>
>>>>> INFO: task kworker/u194:11:1677933 blocked for more than 215285 seconds.
>>>>>          Tainted: G        W   E      6.11.11-1.el9.esat.x86_64 #1
>>>>> "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
>>>>> task:kworker/u194:11 state:D stack:0     pid:1677933 tgid:1677933
>>>>> ppid:2      flags:0x00004000
>>>>> Workqueue: nfsd4 laundromat_main [nfsd]
>>>>> Call Trace:
>>>>>     <TASK>
>>>>>     __schedule+0x21c/0x5d0
>>>>>     ? preempt_count_add+0x47/0xa0
>>>>>     schedule+0x26/0xa0
>>>>>     nfsd4_shutdown_callback+0xea/0x120 [nfsd]
>>>>>     ? __pfx_var_wake_function+0x10/0x10
>>>>>     __destroy_client+0x1f0/0x290 [nfsd]
>>>>>     nfs4_process_client_reaplist+0xa1/0x110 [nfsd]
>>>>>     nfs4_laundromat+0x126/0x7a0 [nfsd]
>>>>>     ? _raw_spin_unlock_irqrestore+0x23/0x40
>>>>>     laundromat_main+0x16/0x40 [nfsd]
>>>>>     process_one_work+0x179/0x390
>>>>>     worker_thread+0x239/0x340
>>>>>     ? __pfx_worker_thread+0x10/0x10
>>>>>     kthread+0xdb/0x110
>>>>>     ? __pfx_kthread+0x10/0x10
>>>>>     ret_from_fork+0x2d/0x50
>>>>>     ? __pfx_kthread+0x10/0x10
>>>>>     ret_from_fork_asm+0x1a/0x30
>>>>>     </TASK>
>>>>>
>>>>> If I read this correctly, it seems to be blocked on a callback
>>>>> operation during shutdown of a client connection?
>>>>>
>>>>> Is this a known issue that may be fixed in the 6.12.x kernel? Could
>>>>> the following commit be relevant?
>>>>
>>>> It is a known issue that we're just beginning to work. It's not
>>>> addressed in any kernel at the moment.
>>>>
>>>>
>>>>> 8dd91e8d31febf4d9cca3ae1bb4771d33ae7ee5a    nfsd: fix race between
>>>>> laundromat and free_stateid
>>>>>
>>>>> If I increase the hung_task_warnings sysctl and wait a few minutes,
>>>>> the hung task message appears again, so the issue is still present on
>>>>> the system. How can I debug which client is causing this issue?
>>>>>
>>>>> Is there any other information I can provide?
>>>>
>>>> Yes. We badly need a simple reproducer for this issue so that we
>>>> can test and confirm that it is fixed before requesting that any
>>>> fix is merged.
>>>
>>> Unfortunately, we've been unable to reliably reproduce the issue on
>>> our test systems. Sometimes the server works fine for weeks, and
>>> sometimes these (or other) issues pop up within hours. Similar to the
>>> users from the mentioned thread, we let a few hundred engineers and
>>> students loose. Our clients are both EL8/9 based, and also Fedora 41,
>>> and they (auto)mount home directories from the NFS server. So clients
>>> frequently mount and unmount file systems, students uncleanly shut
>>> down systems,...
>>>
>>> We switched to the mainline kernel in the hope this would include a
>>> fix for the issue.
>>>
>>> Are there any debugging commands we can run once the issue happens
>>> that can help to determine the cause of this issue?
>>
>> Once the issue happens, the precipitating bug has already done its
>> damage, so at that point it is too late.
>>
>> If you can start a trace command on the server before clients mount
>> it, try this:
>>
>>     # trace-cmd record -e nfsd:nfsd_cb_\*
>>
>> After the issue has occurred, wait a few minutes then ^C this command
>> and send me the trace.dat.
>>
> I can create a systemd unit to start this command when the system
> boots and stop it when the issue happens.

It would help to include "-p function -l nfsd4_destroy_session" as
well on the trace-cmd command line so that DESTROY_SESSION operations
are annotated in the log as well.


> What is the expected performance impact of keeping this tracing
> running? Is there an easy way to rotate the trace.dat file as I assume
> it will grow quite large?

Callback traffic should be light. I don't expect a huge amount of data
will be generated unless the trace runs for weeks without incident, and
therefore I expect any performance impact will be unnoticeable.


>> The current theory is that deleg_reaper() is running concurrently with
>> the client's DESTROY_SESSION request, and that leaves callback RPCs
>> outstanding when the callback RPC client is destroyed. Session shutdown
>> then hangs waiting for a completion that will never fire.
> 
> Would it be possible to capture this using a bpftrace script? If so,
> which events would be interesting to capture to confirm this theory?

You can capture this information however you like. I'm familiar with
trace points, so that's the command I can give you.


> Is there an easy way to forcefully trigger the deleg_reaper to run so
> we can try running it in a loop and then reboot/unmount the client in
> an attempt to trigger the issue?

Reduce the server's lease time. deleg_reaper() is run during every pass
of the server's NFSv4 state laundromat.

It is also run via a shrinker. Forcing memory exhaustion might also
result in more calls to deleg_reaper(), but that can have some less
desirable side effects.


>> If your server runs chronically short on physical memory, that might
>> be a trigger.
> 
> Before the server was upgraded to EL9, it ran CentOS 7 for 5 years
> without any issue and we never experienced any physical memory
> shortage.  Why does the system think it's running low on memory

Different kernels have different memory requirements, different memory
watermarks, and different bugs. Sometimes what barely fits within a
server's RSS and throughput envelope with one kernel does not fit at
all with another kernel.

I'm trying not to make assumptions. Some folks like running NFS servers
with less than 4GB of RAM in virtual environments.

So what I'm asking is how much physical RAM is available on your server,
and do you see other symptoms of memory shortages?


> and
> needs to send the RECALL_ANY callbacks? It never did in the past and
> the system seemed to do fine.

CB_RECALL_ANY is a new feature in recent kernels.


> Is there a way to turn off the
> RECALL_ANY callbacks (at runtime)?

No.


>>>> An environment where we can test patches against the upstream
>>>> kernel would also be welcome.
> 
> Our current plan is to run the 6.12 kernel as this is an LTS kernel.
> If there are patches for this kernel version we can try, we may be
> able to apply them. But we can't reboot the system every few days as
> hundreds of people depend on it. It can also take quite some time to
> actually trigger it (or assume it was fixed by a patch).

Any code change has to go into the upstream kernel first before it is
backported to LTS kernels.

What I'm hearing is that you are not able to provide any testing for
an upstream patch. Fair enough.


> Regards,
> Rik
> 
>>>>
>>>>
>>>>> Could this be related to the following thread:
>>>>> https://lore.kernel.org/linux-nfs/Z2vNQ6HXfG_LqBQc@eldamar.lan/T/#u ?
>>>>
>>>> Yes.
>>>>
>>>>
>>>>> I don't know if this is relevant but I've noticed that some clients
>>>>> have multiple entries in the /proc/fs/nfsd/clients directory, so I
>>>>> assume these clients are not cleaned up correctly?
>>>
>>> You don't think this is relevant for this issue? Is this normal?
>>
>> It might be a bug, but I can't say whether it's related.
>>
>>
>>>>> For example:
>>>>>
>>>>> clientid: 0x6d077c99675df2b3
>>>>> address: "10.87.29.32:864"
>>>>> status: confirmed
>>>>> seconds from last renew: 0
>>>>> name: "Linux NFSv4.2 betelgeuse.esat.kuleuven.be"
>>>>> minor version: 2
>>>>> Implementation domain: "kernel.org"
>>>>> Implementation name: "Linux 4.18.0-553.32.1.el8_10.x86_64 #1 SMP Wed
>>>>> Dec 11 16:33:48 UTC 2024 x86_64"
>>>>> Implementation time: [0, 0]
>>>>> callback state: UP
>>>>> callback address: 10.87.29.32:0
>>>>> admin-revoked states: 0
>>>>> ***
>>>>> clientid: 0x6d0596d0675df2b3
>>>>> address: "10.87.29.32:864"
>>>>> status: courtesy
>>>>> seconds from last renew: 2288446
>>>>> name: "Linux NFSv4.2 betelgeuse.esat.kuleuven.be"
>>>>> minor version: 2
>>>>> Implementation domain: "kernel.org"
>>>>> Implementation name: "Linux 4.18.0-553.32.1.el8_10.x86_64 #1 SMP Wed
>>>>> Dec 11 16:33:48 UTC 2024 x86_64"
>>>>> Implementation time: [0, 0]
>>>>> callback state: UP
>>>>> callback address: 10.87.29.32:0
>>>>> admin-revoked states: 0
>>>>>
>>>>> The first one has status confirmed and the second one "courtesy" with
>>>>> a high "seconds from last renew". The address and port matches for
>>>>> both client entries and the callback state is both UP.
>>>>>
>>>>> For another client, there's a different output:
>>>>>
>>>>> clientid: 0x6d078a79675df2b3
>>>>> address: "10.33.130.34:864"
>>>>> status: unconfirmed
>>>>> seconds from last renew: 158910
>>>>> name: "Linux NFSv4.2 bujarski.esat.kuleuven.be"
>>>>> minor version: 2
>>>>> Implementation domain: "kernel.org"
>>>>> Implementation name: "Linux 5.14.0-503.19.1.el9_5.x86_64 #1 SMP
>>>>> PREEMPT_DYNAMIC Thu Dec 19 12:55:03 UTC 2024 x86_64"
>>>>> Implementation time: [0, 0]
>>>>> callback state: UNKNOWN
>>>>> callback address: (einval)
>>>>> admin-revoked states: 0
>>>>> ***
>>>>> clientid: 0x6d078a7a675df2b3
>>>>> address: "10.33.130.34:864"
>>>>> status: confirmed
>>>>> seconds from last renew: 2
>>>>> name: "Linux NFSv4.2 bujarski.esat.kuleuven.be"
>>>>> minor version: 2
>>>>> Implementation domain: "kernel.org"
>>>>> Implementation name: "Linux 5.14.0-503.19.1.el9_5.x86_64 #1 SMP
>>>>> PREEMPT_DYNAMIC Thu Dec 19 12:55:03 UTC 2024 x86_64"
>>>>> Implementation time: [0, 0]
>>>>> callback state: UP
>>>>> callback address: 10.33.130.34:0
>>>>> admin-revoked states: 0
>>>>>
>>>>>
>>>>> Here the first status is unconfirmed and the callback state is UNKNOWN.
>>>>>
>>>>> The clients are Rocky 8, Rocky 9 and Fedora 41 clients.
>>>>>
>>>>> Regards,
>>>>>
>>>>> Rik
>>>>>
>>>>
>>>>
>>>> --
>>>> Chuck Lever
>>>
>>
>>
>> --
>> Chuck Lever
> 
> 
> 


-- 
Chuck Lever

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: nfsd4 laundromat_main hung tasks
  2025-01-12 18:57         ` Chuck Lever
@ 2025-01-13 12:30           ` Rik Theys
  2025-01-13 13:39             ` Chuck Lever
  0 siblings, 1 reply; 14+ messages in thread
From: Rik Theys @ 2025-01-13 12:30 UTC (permalink / raw)
  To: Chuck Lever; +Cc: linux-nfs

Hi,

On Sun, Jan 12, 2025 at 7:57 PM Chuck Lever <chuck.lever@oracle.com> wrote:
>
> On 1/12/25 7:42 AM, Rik Theys wrote:
> > Hi,
> >
> > On Fri, Jan 10, 2025 at 11:07 PM Chuck Lever <chuck.lever@oracle.com> wrote:
> >>
> >> On 1/10/25 3:51 PM, Rik Theys wrote:
> >>> Hi,
> >>>
> >>> Thanks for your prompt reply.
> >>>
> >>> On Fri, Jan 10, 2025 at 9:30 PM Chuck Lever <chuck.lever@oracle.com> wrote:
> >>>>
> >>>> On 1/10/25 2:49 PM, Rik Theys wrote:
> >>>>> Hi,
> >>>>>
> >>>>> Our Rocky 9 NFS server running the upstream 6.11.11 kernel is starting
> >>>>> to log the following hung task messages:
> >>>>>
> >>>>> INFO: task kworker/u194:11:1677933 blocked for more than 215285 seconds.
> >>>>>          Tainted: G        W   E      6.11.11-1.el9.esat.x86_64 #1
> >>>>> "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> >>>>> task:kworker/u194:11 state:D stack:0     pid:1677933 tgid:1677933
> >>>>> ppid:2      flags:0x00004000
> >>>>> Workqueue: nfsd4 laundromat_main [nfsd]
> >>>>> Call Trace:
> >>>>>     <TASK>
> >>>>>     __schedule+0x21c/0x5d0
> >>>>>     ? preempt_count_add+0x47/0xa0
> >>>>>     schedule+0x26/0xa0
> >>>>>     nfsd4_shutdown_callback+0xea/0x120 [nfsd]
> >>>>>     ? __pfx_var_wake_function+0x10/0x10
> >>>>>     __destroy_client+0x1f0/0x290 [nfsd]
> >>>>>     nfs4_process_client_reaplist+0xa1/0x110 [nfsd]
> >>>>>     nfs4_laundromat+0x126/0x7a0 [nfsd]
> >>>>>     ? _raw_spin_unlock_irqrestore+0x23/0x40
> >>>>>     laundromat_main+0x16/0x40 [nfsd]
> >>>>>     process_one_work+0x179/0x390
> >>>>>     worker_thread+0x239/0x340
> >>>>>     ? __pfx_worker_thread+0x10/0x10
> >>>>>     kthread+0xdb/0x110
> >>>>>     ? __pfx_kthread+0x10/0x10
> >>>>>     ret_from_fork+0x2d/0x50
> >>>>>     ? __pfx_kthread+0x10/0x10
> >>>>>     ret_from_fork_asm+0x1a/0x30
> >>>>>     </TASK>
> >>>>>
> >>>>> If I read this correctly, it seems to be blocked on a callback
> >>>>> operation during shutdown of a client connection?
> >>>>>
> >>>>> Is this a known issue that may be fixed in the 6.12.x kernel? Could
> >>>>> the following commit be relevant?
> >>>>
> >>>> It is a known issue that we're just beginning to work. It's not
> >>>> addressed in any kernel at the moment.
> >>>>
> >>>>
> >>>>> 8dd91e8d31febf4d9cca3ae1bb4771d33ae7ee5a    nfsd: fix race between
> >>>>> laundromat and free_stateid
> >>>>>
> >>>>> If I increase the hung_task_warnings sysctl and wait a few minutes,
> >>>>> the hung task message appears again, so the issue is still present on
> >>>>> the system. How can I debug which client is causing this issue?
> >>>>>
> >>>>> Is there any other information I can provide?
> >>>>
> >>>> Yes. We badly need a simple reproducer for this issue so that we
> >>>> can test and confirm that it is fixed before requesting that any
> >>>> fix is merged.
> >>>
> >>> Unfortunately, we've been unable to reliably reproduce the issue on
> >>> our test systems. Sometimes the server works fine for weeks, and
> >>> sometimes these (or other) issues pop up within hours. Similar to the
> >>> users from the mentioned thread, we let a few hundred engineers and
> >>> students loose. Our clients are both EL8/9 based, and also Fedora 41,
> >>> and they (auto)mount home directories from the NFS server. So clients
> >>> frequently mount and unmount file systems, students uncleanly shut
> >>> down systems,...
> >>>
> >>> We switched to the mainline kernel in the hope this would include a
> >>> fix for the issue.
> >>>
> >>> Are there any debugging commands we can run once the issue happens
> >>> that can help to determine the cause of this issue?
> >>
> >> Once the issue happens, the precipitating bug has already done its
> >> damage, so at that point it is too late.
> >>
> >> If you can start a trace command on the server before clients mount
> >> it, try this:
> >>
> >>     # trace-cmd record -e nfsd:nfsd_cb_\*
> >>
> >> After the issue has occurred, wait a few minutes then ^C this command
> >> and send me the trace.dat.
> >>
> > I can create a systemd unit to start this command when the system
> > boots and stop it when the issue happens.
>
> It would help to include "-p function -l nfsd4_destroy_session" as
> well on the trace-cmd command line so that DESTROY_SESSION operations
> are annotated in the log as well.
>
I've created a systemd unit to run trace-cmd on boot. I've started it
(before rebooting the system) to see how much disk space would be in
use by having it running.
When I stopped it and ran "trace-cmd report", it showed a lot of
[FAILED TO PARSE] lines, such as:

            nfsd-6279  [035] 2560643.942059: nfsd_cb_queue:
[FAILED TO PARSE] cl_boot=1734210227 cl_id=1829245597
cb=0xffff94488e9c8d90 need_restart=0 addr=ARRAY[]
  kworker/u192:4-4169949 [032] 2560643.942079: nfsd_cb_start:
[FAILED TO PARSE] state=0x1 cl_boot=1734210227 cl_id=1829245597
addr=ARRAY[]
  kworker/u192:4-4169949 [032] 2560643.942081: nfsd_cb_bc_update:
[FAILED TO PARSE] cl_boot=1734210227 cl_id=1829245597
cb=0xffff94488e9c8d90 need_restart=0 addr=ARRAY[]
  kworker/u192:4-4169949 [032] 2560643.942082: nfsd_cb_destroy:
[FAILED TO PARSE] cl_boot=1734210227 cl_id=1829245597
cb=0xffff94488e9c8d90 need_restart=0 addr=ARRAY[]
            nfsd-6328  [028] 2560643.942503: nfsd_cb_probe:
[FAILED TO PARSE] state=0x1 cl_boot=1734210227 cl_id=1829245598
addr=ARRAY[02, 00, 00, 00, 0a, 57, 18, a4, 00, 00, 00, 00, 00, 00, 00,
00]
            nfsd-6328  [028] 2560643.942504: nfsd_cb_queue:
[FAILED TO PARSE] cl_boot=1734210227 cl_id=1829245598
cb=0xffff94488e9c8300 need_restart=0 addr=ARRAY[02, 00, 00, 00, 0a,
57, 18, a4, 00, 00, 00, 00, 00, 00, 00, 00]
  kworker/u192:4-4169949 [032] 2560643.942515: nfsd_cb_start:
[FAILED TO PARSE] state=0x1 cl_boot=1734210227 cl_id=1829245598
addr=ARRAY[02, 00, 00, 00, 0a, 57, 18, a4, 00, 00, 00, 00, 00, 00, 00,
00]
  kworker/u192:4-4169949 [032] 2560643.942515: nfsd_cb_bc_update:
[FAILED TO PARSE] cl_boot=1734210227 cl_id=1829245598
cb=0xffff94488e9c8300 need_restart=0 addr=ARRAY[02, 00, 00, 00, 0a,
57, 18, a4, 00, 00, 00, 00, 00, 00, 00, 00]
  kworker/u192:4-4169949 [032] 2560643.942769: nfsd_cb_setup:
[FAILED TO PARSE] cl_boot=1734210227 cl_id=1829245598 authflavor=0x1
addr=ARRAY[02, 00, 00, 00, 0a, 57, 18, a4, 00, 00, 00, 00, 00, 00, 00,
00] netid=tcp
  kworker/u192:4-4169949 [032] 2560643.942770: nfsd_cb_new_state:
[FAILED TO PARSE] state=0x0 cl_boot=1734210227 cl_id=1829245598
addr=ARRAY[02, 00, 00, 00, 0a, 57, 18, a4, 00, 00, 00, 00, 00, 00, 00,
00]
  kworker/u192:4-4169949 [032] 2560643.942770: nfsd_cb_destroy:
[FAILED TO PARSE] cl_boot=1734210227 cl_id=1829245598
cb=0xffff94488e9c8300 need_restart=0 addr=ARRAY[02, 00, 00, 00, 0a,
57, 18, a4, 00, 00, 00, 00, 00, 00, 00, 00]

Is there any additional option I need to specify, or can these items be ignored?

>
> > What is the expected performance impact of keeping this tracing
> > running? Is there an easy way to rotate the trace.dat file as I assume
> > it will grow quite large?
>
> Callback traffic should be light. I don't expect a huge amount of data
> will be generated unless the trace runs for weeks without incident, and
> therefore I expect any performance impact will be unnoticeable.
>
>
> >> The current theory is that deleg_reaper() is running concurrently with
> >> the client's DESTROY_SESSION request, and that leaves callback RPCs
> >> outstanding when the callback RPC client is destroyed. Session shutdown
> >> then hangs waiting for a completion that will never fire.
> >
> > Would it be possible to capture this using a bpftrace script? If so,
> > which events would be interesting to capture to confirm this theory?
>
> You can capture this information however you like. I'm familiar with
> trace points, so that's the command I can give you.
>
>
> > Is there an easy way to forcefully trigger the deleg_reaper to run so
> > we can try running it in a loop and then reboot/unmount the client in
> > an attempt to trigger the issue?
>
> Reduce the server's lease time. deleg_reaper() is run during every pass
> of the server's NFSv4 state laundromat.
>
> It is also run via a shrinker. Forcing memory exhaustion might also
> result in more calls to deleg_reaper(), but that can have some less
> desirable side effects.
>
>
> >> If your server runs chronically short on physical memory, that might
> >> be a trigger.
> >
> > Before the server was upgraded to EL9, it ran CentOS 7 for 5 years
> > without any issue and we never experienced any physical memory
> > shortage.  Why does the system think it's running low on memory
>
> Different kernels have different memory requirements, different memory
> watermarks, and different bugs. Sometimes what barely fits within a
> server's RSS and throughput envelope with one kernel does not fit at
> all with another kernel.
>
> I'm trying not to make assumptions. Some folks like running NFS servers
> with less than 4GB of RAM in virtual environments.
>
> So what I'm asking is how much physical RAM is available on your server,
> and do you see other symptoms of memory shortages?

The system has 192GB ram and runs nfsd and samba. Most memory is used
for cache and buffers. There are no symptoms of memory shortage.


>
>
> > and
> > needs to send the RECALL_ANY callbacks? It never did in the past and
> > the system seemed to do fine.
>
> CB_RECALL_ANY is a new feature in recent kernels.
>
>
> > Is there a way to turn off the
> > RECALL_ANY callbacks (at runtime)?
>
> No.
>
>
> >>>> An environment where we can test patches against the upstream
> >>>> kernel would also be welcome.
> >
> > Our current plan is to run the 6.12 kernel as this is an LTS kernel.
> > If there are patches for this kernel version we can try, we may be
> > able to apply them. But we can't reboot the system every few days as
> > hundreds of people depend on it. It can also take quite some time to
> > actually trigger it (or assume it was fixed by a patch).
>
> Any code change has to go into the upstream kernel first before it is
> backported to LTS kernels.
>
> What I'm hearing is that you are not able to provide any testing for
> an upstream patch. Fair enough.

If there's a patch you can certainly let me know and we can see then
if we can try it on our system.

Regards,
Rik

>
>
> > Regards,
> > Rik
> >
> >>>>
> >>>>
> >>>>> Could this be related to the following thread:
> >>>>> https://lore.kernel.org/linux-nfs/Z2vNQ6HXfG_LqBQc@eldamar.lan/T/#u ?
> >>>>
> >>>> Yes.
> >>>>
> >>>>
> >>>>> I don't know if this is relevant but I've noticed that some clients
> >>>>> have multiple entries in the /proc/fs/nfsd/clients directory, so I
> >>>>> assume these clients are not cleaned up correctly?
> >>>
> >>> You don't think this is relevant for this issue? Is this normal?
> >>
> >> It might be a bug, but I can't say whether it's related.
> >>
> >>
> >>>>> For example:
> >>>>>
> >>>>> clientid: 0x6d077c99675df2b3
> >>>>> address: "10.87.29.32:864"
> >>>>> status: confirmed
> >>>>> seconds from last renew: 0
> >>>>> name: "Linux NFSv4.2 betelgeuse.esat.kuleuven.be"
> >>>>> minor version: 2
> >>>>> Implementation domain: "kernel.org"
> >>>>> Implementation name: "Linux 4.18.0-553.32.1.el8_10.x86_64 #1 SMP Wed
> >>>>> Dec 11 16:33:48 UTC 2024 x86_64"
> >>>>> Implementation time: [0, 0]
> >>>>> callback state: UP
> >>>>> callback address: 10.87.29.32:0
> >>>>> admin-revoked states: 0
> >>>>> ***
> >>>>> clientid: 0x6d0596d0675df2b3
> >>>>> address: "10.87.29.32:864"
> >>>>> status: courtesy
> >>>>> seconds from last renew: 2288446
> >>>>> name: "Linux NFSv4.2 betelgeuse.esat.kuleuven.be"
> >>>>> minor version: 2
> >>>>> Implementation domain: "kernel.org"
> >>>>> Implementation name: "Linux 4.18.0-553.32.1.el8_10.x86_64 #1 SMP Wed
> >>>>> Dec 11 16:33:48 UTC 2024 x86_64"
> >>>>> Implementation time: [0, 0]
> >>>>> callback state: UP
> >>>>> callback address: 10.87.29.32:0
> >>>>> admin-revoked states: 0
> >>>>>
> >>>>> The first one has status confirmed and the second one "courtesy" with
> >>>>> a high "seconds from last renew". The address and port matches for
> >>>>> both client entries and the callback state is both UP.
> >>>>>
> >>>>> For another client, there's a different output:
> >>>>>
> >>>>> clientid: 0x6d078a79675df2b3
> >>>>> address: "10.33.130.34:864"
> >>>>> status: unconfirmed
> >>>>> seconds from last renew: 158910
> >>>>> name: "Linux NFSv4.2 bujarski.esat.kuleuven.be"
> >>>>> minor version: 2
> >>>>> Implementation domain: "kernel.org"
> >>>>> Implementation name: "Linux 5.14.0-503.19.1.el9_5.x86_64 #1 SMP
> >>>>> PREEMPT_DYNAMIC Thu Dec 19 12:55:03 UTC 2024 x86_64"
> >>>>> Implementation time: [0, 0]
> >>>>> callback state: UNKNOWN
> >>>>> callback address: (einval)
> >>>>> admin-revoked states: 0
> >>>>> ***
> >>>>> clientid: 0x6d078a7a675df2b3
> >>>>> address: "10.33.130.34:864"
> >>>>> status: confirmed
> >>>>> seconds from last renew: 2
> >>>>> name: "Linux NFSv4.2 bujarski.esat.kuleuven.be"
> >>>>> minor version: 2
> >>>>> Implementation domain: "kernel.org"
> >>>>> Implementation name: "Linux 5.14.0-503.19.1.el9_5.x86_64 #1 SMP
> >>>>> PREEMPT_DYNAMIC Thu Dec 19 12:55:03 UTC 2024 x86_64"
> >>>>> Implementation time: [0, 0]
> >>>>> callback state: UP
> >>>>> callback address: 10.33.130.34:0
> >>>>> admin-revoked states: 0
> >>>>>
> >>>>>
> >>>>> Here the first status is unconfirmed and the callback state is UNKNOWN.
> >>>>>
> >>>>> The clients are Rocky 8, Rocky 9 and Fedora 41 clients.
> >>>>>
> >>>>> Regards,
> >>>>>
> >>>>> Rik
> >>>>>
> >>>>
> >>>>
> >>>> --
> >>>> Chuck Lever
> >>>
> >>
> >>
> >> --
> >> Chuck Lever
> >
> >
> >
>
>
> --
> Chuck Lever



-- 

Rik

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: nfsd4 laundromat_main hung tasks
  2025-01-13 12:30           ` Rik Theys
@ 2025-01-13 13:39             ` Chuck Lever
  0 siblings, 0 replies; 14+ messages in thread
From: Chuck Lever @ 2025-01-13 13:39 UTC (permalink / raw)
  To: Rik Theys; +Cc: linux-nfs

On 1/13/25 7:30 AM, Rik Theys wrote:
> I've created a systemd unit to run trace-cmd on boot. I've started it
> (before rebooting the system) to see how much disk space would be in
> use by having it running.
> When I stopped it and ran "trace-cmd report", it showed a lot of
> [FAILED TO PARSE] lines, such as:
> 
>              nfsd-6279  [035] 2560643.942059: nfsd_cb_queue:
> [FAILED TO PARSE] cl_boot=1734210227 cl_id=1829245597
> cb=0xffff94488e9c8d90 need_restart=0 addr=ARRAY[]
>    kworker/u192:4-4169949 [032] 2560643.942079: nfsd_cb_start:
> [FAILED TO PARSE] state=0x1 cl_boot=1734210227 cl_id=1829245597
> addr=ARRAY[]
>    kworker/u192:4-4169949 [032] 2560643.942081: nfsd_cb_bc_update:
> [FAILED TO PARSE] cl_boot=1734210227 cl_id=1829245597
> cb=0xffff94488e9c8d90 need_restart=0 addr=ARRAY[]
>    kworker/u192:4-4169949 [032] 2560643.942082: nfsd_cb_destroy:
> [FAILED TO PARSE] cl_boot=1734210227 cl_id=1829245597
> cb=0xffff94488e9c8d90 need_restart=0 addr=ARRAY[]
>              nfsd-6328  [028] 2560643.942503: nfsd_cb_probe:
> [FAILED TO PARSE] state=0x1 cl_boot=1734210227 cl_id=1829245598
> addr=ARRAY[02, 00, 00, 00, 0a, 57, 18, a4, 00, 00, 00, 00, 00, 00, 00,
> 00]
>              nfsd-6328  [028] 2560643.942504: nfsd_cb_queue:
> [FAILED TO PARSE] cl_boot=1734210227 cl_id=1829245598
> cb=0xffff94488e9c8300 need_restart=0 addr=ARRAY[02, 00, 00, 00, 0a,
> 57, 18, a4, 00, 00, 00, 00, 00, 00, 00, 00]
>    kworker/u192:4-4169949 [032] 2560643.942515: nfsd_cb_start:
> [FAILED TO PARSE] state=0x1 cl_boot=1734210227 cl_id=1829245598
> addr=ARRAY[02, 00, 00, 00, 0a, 57, 18, a4, 00, 00, 00, 00, 00, 00, 00,
> 00]
>    kworker/u192:4-4169949 [032] 2560643.942515: nfsd_cb_bc_update:
> [FAILED TO PARSE] cl_boot=1734210227 cl_id=1829245598
> cb=0xffff94488e9c8300 need_restart=0 addr=ARRAY[02, 00, 00, 00, 0a,
> 57, 18, a4, 00, 00, 00, 00, 00, 00, 00, 00]
>    kworker/u192:4-4169949 [032] 2560643.942769: nfsd_cb_setup:
> [FAILED TO PARSE] cl_boot=1734210227 cl_id=1829245598 authflavor=0x1
> addr=ARRAY[02, 00, 00, 00, 0a, 57, 18, a4, 00, 00, 00, 00, 00, 00, 00,
> 00] netid=tcp
>    kworker/u192:4-4169949 [032] 2560643.942770: nfsd_cb_new_state:
> [FAILED TO PARSE] state=0x0 cl_boot=1734210227 cl_id=1829245598
> addr=ARRAY[02, 00, 00, 00, 0a, 57, 18, a4, 00, 00, 00, 00, 00, 00, 00,
> 00]
>    kworker/u192:4-4169949 [032] 2560643.942770: nfsd_cb_destroy:
> [FAILED TO PARSE] cl_boot=1734210227 cl_id=1829245598
> cb=0xffff94488e9c8300 need_restart=0 addr=ARRAY[02, 00, 00, 00, 0a,
> 57, 18, a4, 00, 00, 00, 00, 00, 00, 00, 00]
> 
> Is there any additional option I need to specify, or can these items be ignored?

Nope. This means your distribution's user space tracing libraries
are old. The part of these records that failed to parse are the IP
addresses which aren't important for the current exercise.


-- 
Chuck Lever

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: nfsd4 laundromat_main hung tasks
  2025-01-12 12:42       ` Rik Theys
  2025-01-12 18:57         ` Chuck Lever
@ 2025-01-13 22:12         ` Chuck Lever
  2025-01-14  8:23           ` Rik Theys
  1 sibling, 1 reply; 14+ messages in thread
From: Chuck Lever @ 2025-01-13 22:12 UTC (permalink / raw)
  To: Rik Theys, Christian Herzog, Salvatore Bonaccorso; +Cc: linux-nfs

On 1/12/25 7:42 AM, Rik Theys wrote:
> On Fri, Jan 10, 2025 at 11:07 PM Chuck Lever <chuck.lever@oracle.com> wrote:
>>
>> On 1/10/25 3:51 PM, Rik Theys wrote:
>>> Are there any debugging commands we can run once the issue happens
>>> that can help to determine the cause of this issue?
>>
>> Once the issue happens, the precipitating bug has already done its
>> damage, so at that point it is too late.

I've studied the code and bug reports a bit. I see one intriguing
mention in comment #5:

https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=1071562#5

/proc/130/stack:
[<0>] rpc_shutdown_client+0xf2/0x150 [sunrpc]
[<0>] nfsd4_process_cb_update+0x4c/0x270 [nfsd]
[<0>] nfsd4_run_cb_work+0x9f/0x150 [nfsd]
[<0>] process_one_work+0x1c7/0x380
[<0>] worker_thread+0x4d/0x380
[<0>] kthread+0xda/0x100
[<0>] ret_from_fork+0x22/0x30

This tells me that the active item on the callback_wq is waiting for the
backchannel RPC client to shut down. This is probably the proximal cause
of the callback workqueue stall.

rpc_shutdown_client() is waiting for the client's cl_tasks to become
empty. Typically this is a short wait. But here, there's one or more RPC
requests that are not completing.

Please issue these two commands on your server once it gets into the
hung state:

# rpcdebug -m rpc -c
# echo t > /proc/sysrq-trigger

Then gift-wrap the server's system journal and send it to me. I need to
see only the output from these two commands, so if you want to
anonymize the journal and truncate it to just the day of the failure,
I think that should be fine.

-- 
Chuck Lever

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: nfsd4 laundromat_main hung tasks
  2025-01-13 22:12         ` Chuck Lever
@ 2025-01-14  8:23           ` Rik Theys
  2025-01-14 14:51             ` Chuck Lever
  0 siblings, 1 reply; 14+ messages in thread
From: Rik Theys @ 2025-01-14  8:23 UTC (permalink / raw)
  To: Chuck Lever; +Cc: Christian Herzog, Salvatore Bonaccorso, linux-nfs

[-- Attachment #1: Type: text/plain, Size: 1984 bytes --]

Hi,

On Mon, Jan 13, 2025 at 11:12 PM Chuck Lever <chuck.lever@oracle.com> wrote:
>
> On 1/12/25 7:42 AM, Rik Theys wrote:
> > On Fri, Jan 10, 2025 at 11:07 PM Chuck Lever <chuck.lever@oracle.com> wrote:
> >>
> >> On 1/10/25 3:51 PM, Rik Theys wrote:
> >>> Are there any debugging commands we can run once the issue happens
> >>> that can help to determine the cause of this issue?
> >>
> >> Once the issue happens, the precipitating bug has already done its
> >> damage, so at that point it is too late.
>
> I've studied the code and bug reports a bit. I see one intriguing
> mention in comment #5:
>
> https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=1071562#5
>
> /proc/130/stack:
> [<0>] rpc_shutdown_client+0xf2/0x150 [sunrpc]
> [<0>] nfsd4_process_cb_update+0x4c/0x270 [nfsd]
> [<0>] nfsd4_run_cb_work+0x9f/0x150 [nfsd]
> [<0>] process_one_work+0x1c7/0x380
> [<0>] worker_thread+0x4d/0x380
> [<0>] kthread+0xda/0x100
> [<0>] ret_from_fork+0x22/0x30
>
> This tells me that the active item on the callback_wq is waiting for the
> backchannel RPC client to shut down. This is probably the proximal cause
> of the callback workqueue stall.
>
> rpc_shutdown_client() is waiting for the client's cl_tasks to become
> empty. Typically this is a short wait. But here, there's one or more RPC
> requests that are not completing.
>
> Please issue these two commands on your server once it gets into the
> hung state:
>
> # rpcdebug -m rpc -c
> # echo t > /proc/sysrq-trigger

There were no rpcdebug entries configured, so I don't think the first
command did much.

You can find the output from the second command in attach.

Regards,
Rik

>
> Then gift-wrap the server's system journal and send it to me. I need to
> see only the output from these two commands, so if you want to
> anonymize the journal and truncate it to just the day of the failure,
> I think that should be fine.
>
>
> --
> Chuck Lever



-- 

Rik

[-- Attachment #2: journal.txt.gz --]
[-- Type: application/gzip, Size: 117787 bytes --]

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: nfsd4 laundromat_main hung tasks
  2025-01-14  8:23           ` Rik Theys
@ 2025-01-14 14:51             ` Chuck Lever
  2025-01-14 15:30               ` Rik Theys
  0 siblings, 1 reply; 14+ messages in thread
From: Chuck Lever @ 2025-01-14 14:51 UTC (permalink / raw)
  To: Rik Theys; +Cc: Christian Herzog, Salvatore Bonaccorso, linux-nfs

On 1/14/25 3:23 AM, Rik Theys wrote:
> Hi,
> 
> On Mon, Jan 13, 2025 at 11:12 PM Chuck Lever <chuck.lever@oracle.com> wrote:
>>
>> On 1/12/25 7:42 AM, Rik Theys wrote:
>>> On Fri, Jan 10, 2025 at 11:07 PM Chuck Lever <chuck.lever@oracle.com> wrote:
>>>>
>>>> On 1/10/25 3:51 PM, Rik Theys wrote:
>>>>> Are there any debugging commands we can run once the issue happens
>>>>> that can help to determine the cause of this issue?
>>>>
>>>> Once the issue happens, the precipitating bug has already done its
>>>> damage, so at that point it is too late.
>>
>> I've studied the code and bug reports a bit. I see one intriguing
>> mention in comment #5:
>>
>> https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=1071562#5
>>
>> /proc/130/stack:
>> [<0>] rpc_shutdown_client+0xf2/0x150 [sunrpc]
>> [<0>] nfsd4_process_cb_update+0x4c/0x270 [nfsd]
>> [<0>] nfsd4_run_cb_work+0x9f/0x150 [nfsd]
>> [<0>] process_one_work+0x1c7/0x380
>> [<0>] worker_thread+0x4d/0x380
>> [<0>] kthread+0xda/0x100
>> [<0>] ret_from_fork+0x22/0x30
>>
>> This tells me that the active item on the callback_wq is waiting for the
>> backchannel RPC client to shut down. This is probably the proximal cause
>> of the callback workqueue stall.
>>
>> rpc_shutdown_client() is waiting for the client's cl_tasks to become
>> empty. Typically this is a short wait. But here, there's one or more RPC
>> requests that are not completing.
>>
>> Please issue these two commands on your server once it gets into the
>> hung state:
>>
>> # rpcdebug -m rpc -c
>> # echo t > /proc/sysrq-trigger
> 
> There were no rpcdebug entries configured, so I don't think the first
> command did much.
> 
> You can find the output from the second command in attach.

I don't see any output for "echo t > /proc/sysrq-trigger" here. What I
do see is a large number of OOM-killer notices. So, your server is out
of memory. But I think this is due to a memory leak bug, probably this
one:

    https://bugzilla.kernel.org/show_bug.cgi?id=219535

So that explains the source of the frequent deleg_reaper() calls on your
system; it's the shrinker. (Note these calls are not the actual problem.
The real bug is somewhere in the callback code, but frequent callbacks
are making it easy to hit the callback bug).

Please try again, but wait until you see "INFO: task nfsd:XXXX blocked 
for more than 120 seconds." in the journal before issuing the rpcdebug
and "echo t" commands.


> Regards,
> Rik
> 
>>
>> Then gift-wrap the server's system journal and send it to me. I need to
>> see only the output from these two commands, so if you want to
>> anonymize the journal and truncate it to just the day of the failure,
>> I think that should be fine.
>>
>>
>> --
>> Chuck Lever
> 
> 
> 


-- 
Chuck Lever

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: nfsd4 laundromat_main hung tasks
  2025-01-14 14:51             ` Chuck Lever
@ 2025-01-14 15:30               ` Rik Theys
  2025-01-14 16:10                 ` Chuck Lever
  0 siblings, 1 reply; 14+ messages in thread
From: Rik Theys @ 2025-01-14 15:30 UTC (permalink / raw)
  To: Chuck Lever; +Cc: Christian Herzog, Salvatore Bonaccorso, linux-nfs

Hi,

On Tue, Jan 14, 2025 at 3:51 PM Chuck Lever <chuck.lever@oracle.com> wrote:
>
> On 1/14/25 3:23 AM, Rik Theys wrote:
> > Hi,
> >
> > On Mon, Jan 13, 2025 at 11:12 PM Chuck Lever <chuck.lever@oracle.com> wrote:
> >>
> >> On 1/12/25 7:42 AM, Rik Theys wrote:
> >>> On Fri, Jan 10, 2025 at 11:07 PM Chuck Lever <chuck.lever@oracle.com> wrote:
> >>>>
> >>>> On 1/10/25 3:51 PM, Rik Theys wrote:
> >>>>> Are there any debugging commands we can run once the issue happens
> >>>>> that can help to determine the cause of this issue?
> >>>>
> >>>> Once the issue happens, the precipitating bug has already done its
> >>>> damage, so at that point it is too late.
> >>
> >> I've studied the code and bug reports a bit. I see one intriguing
> >> mention in comment #5:
> >>
> >> https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=1071562#5
> >>
> >> /proc/130/stack:
> >> [<0>] rpc_shutdown_client+0xf2/0x150 [sunrpc]
> >> [<0>] nfsd4_process_cb_update+0x4c/0x270 [nfsd]
> >> [<0>] nfsd4_run_cb_work+0x9f/0x150 [nfsd]
> >> [<0>] process_one_work+0x1c7/0x380
> >> [<0>] worker_thread+0x4d/0x380
> >> [<0>] kthread+0xda/0x100
> >> [<0>] ret_from_fork+0x22/0x30
> >>
> >> This tells me that the active item on the callback_wq is waiting for the
> >> backchannel RPC client to shut down. This is probably the proximal cause
> >> of the callback workqueue stall.
> >>
> >> rpc_shutdown_client() is waiting for the client's cl_tasks to become
> >> empty. Typically this is a short wait. But here, there's one or more RPC
> >> requests that are not completing.
> >>
> >> Please issue these two commands on your server once it gets into the
> >> hung state:
> >>
> >> # rpcdebug -m rpc -c
> >> # echo t > /proc/sysrq-trigger
> >
> > There were no rpcdebug entries configured, so I don't think the first
> > command did much.
> >
> > You can find the output from the second command in attach.
>
> I don't see any output for "echo t > /proc/sysrq-trigger" here. What I
> do see is a large number of OOM-killer notices. So, your server is out
> of memory. But I think this is due to a memory leak bug, probably this
> one:

I'm confused. Where do you see the OOM-killer notices in the log I provided?

The first lines of the file is after increasing the hung_task_warnings
and waiting a few minutes. This triggered the hun task on the nfsd4
laundromat_main workqueue:

Workqueue: nfsd4 laundromat_main [nfsd]
Jan 14 09:06:45 kwak.esat.kuleuven.be kernel: Call Trace:

Then I executed the commands you provided. Are the lines after the

Jan 14 09:07:00 kwak.esat.kuleuven.be kernel: sysrq: Show State

Line not the output you're looking for?

Regards,
Rik

>
>     https://bugzilla.kernel.org/show_bug.cgi?id=219535
>
> So that explains the source of the frequent deleg_reaper() calls on your
> system; it's the shrinker. (Note these calls are not the actual problem.
> The real bug is somewhere in the callback code, but frequent callbacks
> are making it easy to hit the callback bug).
>
> Please try again, but wait until you see "INFO: task nfsd:XXXX blocked
> for more than 120 seconds." in the journal before issuing the rpcdebug
> and "echo t" commands.
>
>
> > Regards,
> > Rik
> >
> >>
> >> Then gift-wrap the server's system journal and send it to me. I need to
> >> see only the output from these two commands, so if you want to
> >> anonymize the journal and truncate it to just the day of the failure,
> >> I think that should be fine.
> >>
> >>
> >> --
> >> Chuck Lever
> >
> >
> >
>
>
> --
> Chuck Lever



-- 

Rik

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: nfsd4 laundromat_main hung tasks
  2025-01-14 15:30               ` Rik Theys
@ 2025-01-14 16:10                 ` Chuck Lever
  2025-01-14 19:02                   ` Chuck Lever
  0 siblings, 1 reply; 14+ messages in thread
From: Chuck Lever @ 2025-01-14 16:10 UTC (permalink / raw)
  To: Rik Theys; +Cc: Christian Herzog, Salvatore Bonaccorso, linux-nfs

On 1/14/25 10:30 AM, Rik Theys wrote:
> Hi,
> 
> On Tue, Jan 14, 2025 at 3:51 PM Chuck Lever <chuck.lever@oracle.com> wrote:
>>
>> On 1/14/25 3:23 AM, Rik Theys wrote:
>>> Hi,
>>>
>>> On Mon, Jan 13, 2025 at 11:12 PM Chuck Lever <chuck.lever@oracle.com> wrote:
>>>>
>>>> On 1/12/25 7:42 AM, Rik Theys wrote:
>>>>> On Fri, Jan 10, 2025 at 11:07 PM Chuck Lever <chuck.lever@oracle.com> wrote:
>>>>>>
>>>>>> On 1/10/25 3:51 PM, Rik Theys wrote:
>>>>>>> Are there any debugging commands we can run once the issue happens
>>>>>>> that can help to determine the cause of this issue?
>>>>>>
>>>>>> Once the issue happens, the precipitating bug has already done its
>>>>>> damage, so at that point it is too late.
>>>>
>>>> I've studied the code and bug reports a bit. I see one intriguing
>>>> mention in comment #5:
>>>>
>>>> https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=1071562#5
>>>>
>>>> /proc/130/stack:
>>>> [<0>] rpc_shutdown_client+0xf2/0x150 [sunrpc]
>>>> [<0>] nfsd4_process_cb_update+0x4c/0x270 [nfsd]
>>>> [<0>] nfsd4_run_cb_work+0x9f/0x150 [nfsd]
>>>> [<0>] process_one_work+0x1c7/0x380
>>>> [<0>] worker_thread+0x4d/0x380
>>>> [<0>] kthread+0xda/0x100
>>>> [<0>] ret_from_fork+0x22/0x30
>>>>
>>>> This tells me that the active item on the callback_wq is waiting for the
>>>> backchannel RPC client to shut down. This is probably the proximal cause
>>>> of the callback workqueue stall.
>>>>
>>>> rpc_shutdown_client() is waiting for the client's cl_tasks to become
>>>> empty. Typically this is a short wait. But here, there's one or more RPC
>>>> requests that are not completing.
>>>>
>>>> Please issue these two commands on your server once it gets into the
>>>> hung state:
>>>>
>>>> # rpcdebug -m rpc -c
>>>> # echo t > /proc/sysrq-trigger
>>>
>>> There were no rpcdebug entries configured, so I don't think the first
>>> command did much.
>>>
>>> You can find the output from the second command in attach.
>>
>> I don't see any output for "echo t > /proc/sysrq-trigger" here. What I
>> do see is a large number of OOM-killer notices. So, your server is out
>> of memory. But I think this is due to a memory leak bug, probably this
>> one:
> 
> I'm confused. Where do you see the OOM-killer notices in the log I provided?

Never mind: my editor opened an old file at the same time. The window
with your dump was on another screen.

Looking at your journal now.


> The first lines of the file is after increasing the hung_task_warnings
> and waiting a few minutes. This triggered the hun task on the nfsd4
> laundromat_main workqueue:
> 
> Workqueue: nfsd4 laundromat_main [nfsd]
> Jan 14 09:06:45 kwak.esat.kuleuven.be kernel: Call Trace:
> 
> Then I executed the commands you provided. Are the lines after the
> 
> Jan 14 09:07:00 kwak.esat.kuleuven.be kernel: sysrq: Show State
> 
> Line not the output you're looking for?
> 
> Regards,
> Rik
> 
>>
>>      https://bugzilla.kernel.org/show_bug.cgi?id=219535
>>
>> So that explains the source of the frequent deleg_reaper() calls on your
>> system; it's the shrinker. (Note these calls are not the actual problem.
>> The real bug is somewhere in the callback code, but frequent callbacks
>> are making it easy to hit the callback bug).
>>
>> Please try again, but wait until you see "INFO: task nfsd:XXXX blocked
>> for more than 120 seconds." in the journal before issuing the rpcdebug
>> and "echo t" commands.
>>
>>
>>> Regards,
>>> Rik
>>>
>>>>
>>>> Then gift-wrap the server's system journal and send it to me. I need to
>>>> see only the output from these two commands, so if you want to
>>>> anonymize the journal and truncate it to just the day of the failure,
>>>> I think that should be fine.
>>>>
>>>>
>>>> --
>>>> Chuck Lever
>>>
>>>
>>>
>>
>>
>> --
>> Chuck Lever
> 
> 
> 


-- 
Chuck Lever

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: nfsd4 laundromat_main hung tasks
  2025-01-14 16:10                 ` Chuck Lever
@ 2025-01-14 19:02                   ` Chuck Lever
  2025-01-16  9:03                     ` Rik Theys
  0 siblings, 1 reply; 14+ messages in thread
From: Chuck Lever @ 2025-01-14 19:02 UTC (permalink / raw)
  To: Rik Theys; +Cc: Christian Herzog, Salvatore Bonaccorso, linux-nfs

On 1/14/25 11:10 AM, Chuck Lever wrote:
> On 1/14/25 10:30 AM, Rik Theys wrote:
>> Hi,
>>
>> On Tue, Jan 14, 2025 at 3:51 PM Chuck Lever <chuck.lever@oracle.com> 
>> wrote:
>>>
>>> On 1/14/25 3:23 AM, Rik Theys wrote:
>>>> Hi,
>>>>
>>>> On Mon, Jan 13, 2025 at 11:12 PM Chuck Lever 
>>>> <chuck.lever@oracle.com> wrote:
>>>>>
>>>>> On 1/12/25 7:42 AM, Rik Theys wrote:
>>>>>> On Fri, Jan 10, 2025 at 11:07 PM Chuck Lever 
>>>>>> <chuck.lever@oracle.com> wrote:
>>>>>>>
>>>>>>> On 1/10/25 3:51 PM, Rik Theys wrote:
>>>>>>>> Are there any debugging commands we can run once the issue happens
>>>>>>>> that can help to determine the cause of this issue?
>>>>>>>
>>>>>>> Once the issue happens, the precipitating bug has already done its
>>>>>>> damage, so at that point it is too late.
>>>>>
>>>>> I've studied the code and bug reports a bit. I see one intriguing
>>>>> mention in comment #5:
>>>>>
>>>>> https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=1071562#5
>>>>>
>>>>> /proc/130/stack:
>>>>> [<0>] rpc_shutdown_client+0xf2/0x150 [sunrpc]
>>>>> [<0>] nfsd4_process_cb_update+0x4c/0x270 [nfsd]
>>>>> [<0>] nfsd4_run_cb_work+0x9f/0x150 [nfsd]
>>>>> [<0>] process_one_work+0x1c7/0x380
>>>>> [<0>] worker_thread+0x4d/0x380
>>>>> [<0>] kthread+0xda/0x100
>>>>> [<0>] ret_from_fork+0x22/0x30
>>>>>
>>>>> This tells me that the active item on the callback_wq is waiting 
>>>>> for the
>>>>> backchannel RPC client to shut down. This is probably the proximal 
>>>>> cause
>>>>> of the callback workqueue stall.
>>>>>
>>>>> rpc_shutdown_client() is waiting for the client's cl_tasks to become
>>>>> empty. Typically this is a short wait. But here, there's one or 
>>>>> more RPC
>>>>> requests that are not completing.
>>>>>
>>>>> Please issue these two commands on your server once it gets into the
>>>>> hung state:
>>>>>
>>>>> # rpcdebug -m rpc -c
>>>>> # echo t > /proc/sysrq-trigger
>>>>
>>>> There were no rpcdebug entries configured, so I don't think the first
>>>> command did much.

The laundromat failure mode is not blocked in rpc_shutdown_client, so
there aren't any outstanding callback RPCs to observe.

The DESTROY_SESSION failure mode is blocking on the flush_workqueue call
in nfsd4_shutdown_callback(), while this failure mode appears to have
passed that call and blocked on the wait for in-flight RPCs to go to
zero (as Jeff analyzed a few days ago).

Next time it happens, please collect the rpcdebug and "echo t" output as
well as the trace-cmd output.


>>>> You can find the output from the second command in attach.
>>>
>>> I don't see any output for "echo t > /proc/sysrq-trigger" here. What I
>>> do see is a large number of OOM-killer notices. So, your server is out
>>> of memory. But I think this is due to a memory leak bug, probably this
>>> one:
>>
>> I'm confused. Where do you see the OOM-killer notices in the log I 
>> provided?
> 
> Never mind: my editor opened an old file at the same time. The window
> with your dump was on another screen.
> 
> Looking at your journal now.
> 
> 
>> The first lines of the file is after increasing the hung_task_warnings
>> and waiting a few minutes. This triggered the hun task on the nfsd4
>> laundromat_main workqueue:
>>
>> Workqueue: nfsd4 laundromat_main [nfsd]
>> Jan 14 09:06:45 kwak.esat.kuleuven.be kernel: Call Trace:
>>
>> Then I executed the commands you provided. Are the lines after the
>>
>> Jan 14 09:07:00 kwak.esat.kuleuven.be kernel: sysrq: Show State
>>
>> Line not the output you're looking for?
>>
>> Regards,
>> Rik
>>
>>>
>>>      https://bugzilla.kernel.org/show_bug.cgi?id=219535
>>>
>>> So that explains the source of the frequent deleg_reaper() calls on your
>>> system; it's the shrinker. (Note these calls are not the actual problem.
>>> The real bug is somewhere in the callback code, but frequent callbacks
>>> are making it easy to hit the callback bug).
>>>
>>> Please try again, but wait until you see "INFO: task nfsd:XXXX blocked
>>> for more than 120 seconds." in the journal before issuing the rpcdebug
>>> and "echo t" commands.
>>>
>>>
>>>> Regards,
>>>> Rik
>>>>
>>>>>
>>>>> Then gift-wrap the server's system journal and send it to me. I 
>>>>> need to
>>>>> see only the output from these two commands, so if you want to
>>>>> anonymize the journal and truncate it to just the day of the failure,
>>>>> I think that should be fine.
>>>>>
>>>>>
>>>>> -- 
>>>>> Chuck Lever
>>>>
>>>>
>>>>
>>>
>>>
>>> -- 
>>> Chuck Lever
>>
>>
>>
> 
> 


-- 
Chuck Lever

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: nfsd4 laundromat_main hung tasks
  2025-01-14 19:02                   ` Chuck Lever
@ 2025-01-16  9:03                     ` Rik Theys
  2025-01-16 14:12                       ` Chuck Lever
  0 siblings, 1 reply; 14+ messages in thread
From: Rik Theys @ 2025-01-16  9:03 UTC (permalink / raw)
  To: Chuck Lever; +Cc: Christian Herzog, Salvatore Bonaccorso, linux-nfs

Hi Chuck,

On Tue, Jan 14, 2025 at 8:02 PM Chuck Lever <chuck.lever@oracle.com> wrote:
>
> On 1/14/25 11:10 AM, Chuck Lever wrote:
> > On 1/14/25 10:30 AM, Rik Theys wrote:
> >> Hi,
> >>
> >> On Tue, Jan 14, 2025 at 3:51 PM Chuck Lever <chuck.lever@oracle.com>
> >> wrote:
> >>>
> >>> On 1/14/25 3:23 AM, Rik Theys wrote:
> >>>> Hi,
> >>>>
> >>>> On Mon, Jan 13, 2025 at 11:12 PM Chuck Lever
> >>>> <chuck.lever@oracle.com> wrote:
> >>>>>
> >>>>> On 1/12/25 7:42 AM, Rik Theys wrote:
> >>>>>> On Fri, Jan 10, 2025 at 11:07 PM Chuck Lever
> >>>>>> <chuck.lever@oracle.com> wrote:
> >>>>>>>
> >>>>>>> On 1/10/25 3:51 PM, Rik Theys wrote:
> >>>>>>>> Are there any debugging commands we can run once the issue happens
> >>>>>>>> that can help to determine the cause of this issue?
> >>>>>>>
> >>>>>>> Once the issue happens, the precipitating bug has already done its
> >>>>>>> damage, so at that point it is too late.
> >>>>>
> >>>>> I've studied the code and bug reports a bit. I see one intriguing
> >>>>> mention in comment #5:
> >>>>>
> >>>>> https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=1071562#5
> >>>>>
> >>>>> /proc/130/stack:
> >>>>> [<0>] rpc_shutdown_client+0xf2/0x150 [sunrpc]
> >>>>> [<0>] nfsd4_process_cb_update+0x4c/0x270 [nfsd]
> >>>>> [<0>] nfsd4_run_cb_work+0x9f/0x150 [nfsd]
> >>>>> [<0>] process_one_work+0x1c7/0x380
> >>>>> [<0>] worker_thread+0x4d/0x380
> >>>>> [<0>] kthread+0xda/0x100
> >>>>> [<0>] ret_from_fork+0x22/0x30
> >>>>>
> >>>>> This tells me that the active item on the callback_wq is waiting
> >>>>> for the
> >>>>> backchannel RPC client to shut down. This is probably the proximal
> >>>>> cause
> >>>>> of the callback workqueue stall.
> >>>>>
> >>>>> rpc_shutdown_client() is waiting for the client's cl_tasks to become
> >>>>> empty. Typically this is a short wait. But here, there's one or
> >>>>> more RPC
> >>>>> requests that are not completing.
> >>>>>
> >>>>> Please issue these two commands on your server once it gets into the
> >>>>> hung state:
> >>>>>
> >>>>> # rpcdebug -m rpc -c
> >>>>> # echo t > /proc/sysrq-trigger
> >>>>
> >>>> There were no rpcdebug entries configured, so I don't think the first
> >>>> command did much.
>
> The laundromat failure mode is not blocked in rpc_shutdown_client, so
> there aren't any outstanding callback RPCs to observe.
>
> The DESTROY_SESSION failure mode is blocking on the flush_workqueue call
> in nfsd4_shutdown_callback(), while this failure mode appears to have
> passed that call and blocked on the wait for in-flight RPCs to go to
> zero (as Jeff analyzed a few days ago).

If I look at the trace, nfs4_laundromat calls
nfs4_process_client_reaplist, which calls __destroy_client at some
point.

When I look at the __destroy_client function in nfs4state.c, I see it
does a spin_lock(&state_lock) and spin_unlock(&state_lock) to perform
certain actions, but it seems the lock is not (again) acquired when
the nfsd4_shutdown_callback() function is called? According to the
comment above the nfsd4_shutdown_callback function in nfs4callback.c,
the function must be called under the state lock? Is it possible that
the function is called without this state lock? Or is the comment no
longer relevant?

Another thing I've noticed (but I'm not sure it's relevant here) is
that there's a client in /proc/nfs/nfsd/clients that has a states file
that crashes nfsdclnts as the field does not have a "superblock"
field:

# cat 8536/{info,states}
clientid: 0x6d0596d0675df2b3
address: "10.87.29.32:864"
status: courtesy
seconds from last renew: 2807740
name: "Linux NFSv4.2 betelgeuse.esat.kuleuven.be"
minor version: 2
Implementation domain: "kernel.org"
Implementation name: "Linux 4.18.0-553.32.1.el8_10.x86_64 #1 SMP Wed
Dec 11 16:33:48 UTC 2024 x86_64"
Implementation time: [0, 0]
callback state: UNKNOWN
callback address: 10.87.29.32:0
admin-revoked states: 0
- 0x00000001b3f25d67d096056d19facf00: { type: deleg, access: w }

This is one of the clients that has multiple entries in the
/proc/fs/nfsd/clients directory, but of all the clients that have
duplicate entries, this is the only one where the "broken" client is
in the "courtesy" state for a long time now. It's also the only
"broken" client that still has an entry in the states file. The others
are all in the "unconfirmed" state and the states file is empty.

Regards,
Rik





>
> Next time it happens, please collect the rpcdebug and "echo t" output as
> well as the trace-cmd output.
>
>
> >>>> You can find the output from the second command in attach.
> >>>
> >>> I don't see any output for "echo t > /proc/sysrq-trigger" here. What I
> >>> do see is a large number of OOM-killer notices. So, your server is out
> >>> of memory. But I think this is due to a memory leak bug, probably this
> >>> one:
> >>
> >> I'm confused. Where do you see the OOM-killer notices in the log I
> >> provided?
> >
> > Never mind: my editor opened an old file at the same time. The window
> > with your dump was on another screen.
> >
> > Looking at your journal now.
> >
> >
> >> The first lines of the file is after increasing the hung_task_warnings
> >> and waiting a few minutes. This triggered the hun task on the nfsd4
> >> laundromat_main workqueue:
> >>
> >> Workqueue: nfsd4 laundromat_main [nfsd]
> >> Jan 14 09:06:45 kwak.esat.kuleuven.be kernel: Call Trace:
> >>
> >> Then I executed the commands you provided. Are the lines after the
> >>
> >> Jan 14 09:07:00 kwak.esat.kuleuven.be kernel: sysrq: Show State
> >>
> >> Line not the output you're looking for?
> >>
> >> Regards,
> >> Rik
> >>
> >>>
> >>>      https://bugzilla.kernel.org/show_bug.cgi?id=219535
> >>>
> >>> So that explains the source of the frequent deleg_reaper() calls on your
> >>> system; it's the shrinker. (Note these calls are not the actual problem.
> >>> The real bug is somewhere in the callback code, but frequent callbacks
> >>> are making it easy to hit the callback bug).
> >>>
> >>> Please try again, but wait until you see "INFO: task nfsd:XXXX blocked
> >>> for more than 120 seconds." in the journal before issuing the rpcdebug
> >>> and "echo t" commands.
> >>>
> >>>
> >>>> Regards,
> >>>> Rik
> >>>>
> >>>>>
> >>>>> Then gift-wrap the server's system journal and send it to me. I
> >>>>> need to
> >>>>> see only the output from these two commands, so if you want to
> >>>>> anonymize the journal and truncate it to just the day of the failure,
> >>>>> I think that should be fine.
> >>>>>
> >>>>>
> >>>>> --
> >>>>> Chuck Lever
> >>>>
> >>>>
> >>>>
> >>>
> >>>
> >>> --
> >>> Chuck Lever
> >>
> >>
> >>
> >
> >
>
>
> --
> Chuck Lever



-- 

Rik

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: nfsd4 laundromat_main hung tasks
  2025-01-16  9:03                     ` Rik Theys
@ 2025-01-16 14:12                       ` Chuck Lever
  0 siblings, 0 replies; 14+ messages in thread
From: Chuck Lever @ 2025-01-16 14:12 UTC (permalink / raw)
  To: Rik Theys; +Cc: Christian Herzog, Salvatore Bonaccorso, linux-nfs

On 1/16/25 4:03 AM, Rik Theys wrote:

>> The laundromat failure mode is not blocked in rpc_shutdown_client, so
>> there aren't any outstanding callback RPCs to observe.
>>
>> The DESTROY_SESSION failure mode is blocking on the flush_workqueue call
>> in nfsd4_shutdown_callback(), while this failure mode appears to have
>> passed that call and blocked on the wait for in-flight RPCs to go to
>> zero (as Jeff analyzed a few days ago).
> 
> If I look at the trace, nfs4_laundromat calls
> nfs4_process_client_reaplist, which calls __destroy_client at some
> point.
> 
> When I look at the __destroy_client function in nfs4state.c, I see it
> does a spin_lock(&state_lock) and spin_unlock(&state_lock) to perform
> certain actions, but it seems the lock is not (again) acquired when
> the nfsd4_shutdown_callback() function is called? According to the
> comment above the nfsd4_shutdown_callback function in nfs4callback.c,
> the function must be called under the state lock? Is it possible that
> the function is called without this state lock? Or is the comment no
> longer relevant?

The comment is stale.

Commit b687f6863eed ("nfsd: remove the client_mutex and the nfs4_lock/
unlock_state wrappers") removed the mutex that used to wrap calls to
this function.


> Another thing I've noticed (but I'm not sure it's relevant here) is
> that there's a client in /proc/nfs/nfsd/clients that has a states file
> that crashes nfsdclnts as the field does not have a "superblock"
> field:
> 
> # cat 8536/{info,states}
> clientid: 0x6d0596d0675df2b3
> address: "10.87.29.32:864"
> status: courtesy
> seconds from last renew: 2807740
> name: "Linux NFSv4.2 betelgeuse.esat.kuleuven.be"
> minor version: 2
> Implementation domain: "kernel.org"
> Implementation name: "Linux 4.18.0-553.32.1.el8_10.x86_64 #1 SMP Wed
> Dec 11 16:33:48 UTC 2024 x86_64"
> Implementation time: [0, 0]
> callback state: UNKNOWN
> callback address: 10.87.29.32:0
> admin-revoked states: 0
> - 0x00000001b3f25d67d096056d19facf00: { type: deleg, access: w }
> 
> This is one of the clients that has multiple entries in the
> /proc/fs/nfsd/clients directory, but of all the clients that have
> duplicate entries, this is the only one where the "broken" client is
> in the "courtesy" state for a long time now. It's also the only
> "broken" client that still has an entry in the states file. The others
> are all in the "unconfirmed" state and the states file is empty.

Likely that client entry is pinned somehow by this bug.

-- 
Chuck Lever

^ permalink raw reply	[flat|nested] 14+ messages in thread

end of thread, other threads:[~2025-01-16 14:13 UTC | newest]

Thread overview: 14+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-01-10 19:49 nfsd4 laundromat_main hung tasks Rik Theys
2025-01-10 20:30 ` Chuck Lever
     [not found]   ` <CAPwv0J=oKBnCia_mmhm-tYLPqw03jO=LxfUbShSyXFp-mKET5A@mail.gmail.com>
     [not found]     ` <49654519-9166-4593-ac62-77400cebebb4@oracle.com>
2025-01-12 12:42       ` Rik Theys
2025-01-12 18:57         ` Chuck Lever
2025-01-13 12:30           ` Rik Theys
2025-01-13 13:39             ` Chuck Lever
2025-01-13 22:12         ` Chuck Lever
2025-01-14  8:23           ` Rik Theys
2025-01-14 14:51             ` Chuck Lever
2025-01-14 15:30               ` Rik Theys
2025-01-14 16:10                 ` Chuck Lever
2025-01-14 19:02                   ` Chuck Lever
2025-01-16  9:03                     ` Rik Theys
2025-01-16 14:12                       ` Chuck Lever

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox