public inbox for linux-nfs@vger.kernel.org
 help / color / mirror / Atom feed
From: Chuck Lever <chuck.lever@oracle.com>
To: Rik Theys <rik.theys@gmail.com>, linux-nfs@vger.kernel.org
Subject: Re: nfsd4 laundromat_main hung tasks
Date: Fri, 10 Jan 2025 15:30:45 -0500	[thread overview]
Message-ID: <d54d71f7-9bdb-49a4-8687-563558eca95e@oracle.com> (raw)
In-Reply-To: <CAPwv0JnSQ=hsmUMy0VY-8k+dANBLNkJdFJ75q9EEE+Hj0XXB8A@mail.gmail.com>

On 1/10/25 2:49 PM, Rik Theys wrote:
> Hi,
> 
> Our Rocky 9 NFS server running the upstream 6.11.11 kernel is starting
> to log the following hung task messages:
> 
> INFO: task kworker/u194:11:1677933 blocked for more than 215285 seconds.
>        Tainted: G        W   E      6.11.11-1.el9.esat.x86_64 #1
> "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> task:kworker/u194:11 state:D stack:0     pid:1677933 tgid:1677933
> ppid:2      flags:0x00004000
> Workqueue: nfsd4 laundromat_main [nfsd]
> Call Trace:
>   <TASK>
>   __schedule+0x21c/0x5d0
>   ? preempt_count_add+0x47/0xa0
>   schedule+0x26/0xa0
>   nfsd4_shutdown_callback+0xea/0x120 [nfsd]
>   ? __pfx_var_wake_function+0x10/0x10
>   __destroy_client+0x1f0/0x290 [nfsd]
>   nfs4_process_client_reaplist+0xa1/0x110 [nfsd]
>   nfs4_laundromat+0x126/0x7a0 [nfsd]
>   ? _raw_spin_unlock_irqrestore+0x23/0x40
>   laundromat_main+0x16/0x40 [nfsd]
>   process_one_work+0x179/0x390
>   worker_thread+0x239/0x340
>   ? __pfx_worker_thread+0x10/0x10
>   kthread+0xdb/0x110
>   ? __pfx_kthread+0x10/0x10
>   ret_from_fork+0x2d/0x50
>   ? __pfx_kthread+0x10/0x10
>   ret_from_fork_asm+0x1a/0x30
>   </TASK>
> 
> If I read this correctly, it seems to be blocked on a callback
> operation during shutdown of a client connection?
> 
> Is this a known issue that may be fixed in the 6.12.x kernel? Could
> the following commit be relevant?

It is a known issue that we're just beginning to work. It's not
addressed in any kernel at the moment.


> 8dd91e8d31febf4d9cca3ae1bb4771d33ae7ee5a    nfsd: fix race between
> laundromat and free_stateid
> 
> If I increase the hung_task_warnings sysctl and wait a few minutes,
> the hung task message appears again, so the issue is still present on
> the system. How can I debug which client is causing this issue?
> 
> Is there any other information I can provide?

Yes. We badly need a simple reproducer for this issue so that we
can test and confirm that it is fixed before requesting that any
fix is merged.

An environment where we can test patches against the upstream
kernel would also be welcome.


> Could this be related to the following thread:
> https://lore.kernel.org/linux-nfs/Z2vNQ6HXfG_LqBQc@eldamar.lan/T/#u ?

Yes.


> I don't know if this is relevant but I've noticed that some clients
> have multiple entries in the /proc/fs/nfsd/clients directory, so I
> assume these clients are not cleaned up correctly?
> 
> For example:
> 
> clientid: 0x6d077c99675df2b3
> address: "10.87.29.32:864"
> status: confirmed
> seconds from last renew: 0
> name: "Linux NFSv4.2 betelgeuse.esat.kuleuven.be"
> minor version: 2
> Implementation domain: "kernel.org"
> Implementation name: "Linux 4.18.0-553.32.1.el8_10.x86_64 #1 SMP Wed
> Dec 11 16:33:48 UTC 2024 x86_64"
> Implementation time: [0, 0]
> callback state: UP
> callback address: 10.87.29.32:0
> admin-revoked states: 0
> ***
> clientid: 0x6d0596d0675df2b3
> address: "10.87.29.32:864"
> status: courtesy
> seconds from last renew: 2288446
> name: "Linux NFSv4.2 betelgeuse.esat.kuleuven.be"
> minor version: 2
> Implementation domain: "kernel.org"
> Implementation name: "Linux 4.18.0-553.32.1.el8_10.x86_64 #1 SMP Wed
> Dec 11 16:33:48 UTC 2024 x86_64"
> Implementation time: [0, 0]
> callback state: UP
> callback address: 10.87.29.32:0
> admin-revoked states: 0
> 
> The first one has status confirmed and the second one "courtesy" with
> a high "seconds from last renew". The address and port matches for
> both client entries and the callback state is both UP.
> 
> For another client, there's a different output:
> 
> clientid: 0x6d078a79675df2b3
> address: "10.33.130.34:864"
> status: unconfirmed
> seconds from last renew: 158910
> name: "Linux NFSv4.2 bujarski.esat.kuleuven.be"
> minor version: 2
> Implementation domain: "kernel.org"
> Implementation name: "Linux 5.14.0-503.19.1.el9_5.x86_64 #1 SMP
> PREEMPT_DYNAMIC Thu Dec 19 12:55:03 UTC 2024 x86_64"
> Implementation time: [0, 0]
> callback state: UNKNOWN
> callback address: (einval)
> admin-revoked states: 0
> ***
> clientid: 0x6d078a7a675df2b3
> address: "10.33.130.34:864"
> status: confirmed
> seconds from last renew: 2
> name: "Linux NFSv4.2 bujarski.esat.kuleuven.be"
> minor version: 2
> Implementation domain: "kernel.org"
> Implementation name: "Linux 5.14.0-503.19.1.el9_5.x86_64 #1 SMP
> PREEMPT_DYNAMIC Thu Dec 19 12:55:03 UTC 2024 x86_64"
> Implementation time: [0, 0]
> callback state: UP
> callback address: 10.33.130.34:0
> admin-revoked states: 0
> 
> 
> Here the first status is unconfirmed and the callback state is UNKNOWN.
> 
> The clients are Rocky 8, Rocky 9 and Fedora 41 clients.
> 
> Regards,
> 
> Rik
> 


-- 
Chuck Lever

  reply	other threads:[~2025-01-10 20:30 UTC|newest]

Thread overview: 14+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-01-10 19:49 nfsd4 laundromat_main hung tasks Rik Theys
2025-01-10 20:30 ` Chuck Lever [this message]
     [not found]   ` <CAPwv0J=oKBnCia_mmhm-tYLPqw03jO=LxfUbShSyXFp-mKET5A@mail.gmail.com>
     [not found]     ` <49654519-9166-4593-ac62-77400cebebb4@oracle.com>
2025-01-12 12:42       ` Rik Theys
2025-01-12 18:57         ` Chuck Lever
2025-01-13 12:30           ` Rik Theys
2025-01-13 13:39             ` Chuck Lever
2025-01-13 22:12         ` Chuck Lever
2025-01-14  8:23           ` Rik Theys
2025-01-14 14:51             ` Chuck Lever
2025-01-14 15:30               ` Rik Theys
2025-01-14 16:10                 ` Chuck Lever
2025-01-14 19:02                   ` Chuck Lever
2025-01-16  9:03                     ` Rik Theys
2025-01-16 14:12                       ` Chuck Lever

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=d54d71f7-9bdb-49a4-8687-563558eca95e@oracle.com \
    --to=chuck.lever@oracle.com \
    --cc=linux-nfs@vger.kernel.org \
    --cc=rik.theys@gmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox