From: Chuck Lever <chuck.lever@oracle.com>
To: Rik Theys <rik.theys@gmail.com>
Cc: Christian Herzog <herzog@phys.ethz.ch>,
Salvatore Bonaccorso <carnil@debian.org>,
linux-nfs@vger.kernel.org
Subject: Re: nfsd4 laundromat_main hung tasks
Date: Tue, 14 Jan 2025 11:10:04 -0500 [thread overview]
Message-ID: <42da212b-071b-4c20-b7da-97ca02413c5a@oracle.com> (raw)
In-Reply-To: <CAPwv0Jk1UaHqNX27AtR+sPrCdVbckpR5ayQ-u+kZ=w3C+sOsgQ@mail.gmail.com>
On 1/14/25 10:30 AM, Rik Theys wrote:
> Hi,
>
> On Tue, Jan 14, 2025 at 3:51 PM Chuck Lever <chuck.lever@oracle.com> wrote:
>>
>> On 1/14/25 3:23 AM, Rik Theys wrote:
>>> Hi,
>>>
>>> On Mon, Jan 13, 2025 at 11:12 PM Chuck Lever <chuck.lever@oracle.com> wrote:
>>>>
>>>> On 1/12/25 7:42 AM, Rik Theys wrote:
>>>>> On Fri, Jan 10, 2025 at 11:07 PM Chuck Lever <chuck.lever@oracle.com> wrote:
>>>>>>
>>>>>> On 1/10/25 3:51 PM, Rik Theys wrote:
>>>>>>> Are there any debugging commands we can run once the issue happens
>>>>>>> that can help to determine the cause of this issue?
>>>>>>
>>>>>> Once the issue happens, the precipitating bug has already done its
>>>>>> damage, so at that point it is too late.
>>>>
>>>> I've studied the code and bug reports a bit. I see one intriguing
>>>> mention in comment #5:
>>>>
>>>> https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=1071562#5
>>>>
>>>> /proc/130/stack:
>>>> [<0>] rpc_shutdown_client+0xf2/0x150 [sunrpc]
>>>> [<0>] nfsd4_process_cb_update+0x4c/0x270 [nfsd]
>>>> [<0>] nfsd4_run_cb_work+0x9f/0x150 [nfsd]
>>>> [<0>] process_one_work+0x1c7/0x380
>>>> [<0>] worker_thread+0x4d/0x380
>>>> [<0>] kthread+0xda/0x100
>>>> [<0>] ret_from_fork+0x22/0x30
>>>>
>>>> This tells me that the active item on the callback_wq is waiting for the
>>>> backchannel RPC client to shut down. This is probably the proximal cause
>>>> of the callback workqueue stall.
>>>>
>>>> rpc_shutdown_client() is waiting for the client's cl_tasks to become
>>>> empty. Typically this is a short wait. But here, there's one or more RPC
>>>> requests that are not completing.
>>>>
>>>> Please issue these two commands on your server once it gets into the
>>>> hung state:
>>>>
>>>> # rpcdebug -m rpc -c
>>>> # echo t > /proc/sysrq-trigger
>>>
>>> There were no rpcdebug entries configured, so I don't think the first
>>> command did much.
>>>
>>> You can find the output from the second command in attach.
>>
>> I don't see any output for "echo t > /proc/sysrq-trigger" here. What I
>> do see is a large number of OOM-killer notices. So, your server is out
>> of memory. But I think this is due to a memory leak bug, probably this
>> one:
>
> I'm confused. Where do you see the OOM-killer notices in the log I provided?
Never mind: my editor opened an old file at the same time. The window
with your dump was on another screen.
Looking at your journal now.
> The first lines of the file is after increasing the hung_task_warnings
> and waiting a few minutes. This triggered the hun task on the nfsd4
> laundromat_main workqueue:
>
> Workqueue: nfsd4 laundromat_main [nfsd]
> Jan 14 09:06:45 kwak.esat.kuleuven.be kernel: Call Trace:
>
> Then I executed the commands you provided. Are the lines after the
>
> Jan 14 09:07:00 kwak.esat.kuleuven.be kernel: sysrq: Show State
>
> Line not the output you're looking for?
>
> Regards,
> Rik
>
>>
>> https://bugzilla.kernel.org/show_bug.cgi?id=219535
>>
>> So that explains the source of the frequent deleg_reaper() calls on your
>> system; it's the shrinker. (Note these calls are not the actual problem.
>> The real bug is somewhere in the callback code, but frequent callbacks
>> are making it easy to hit the callback bug).
>>
>> Please try again, but wait until you see "INFO: task nfsd:XXXX blocked
>> for more than 120 seconds." in the journal before issuing the rpcdebug
>> and "echo t" commands.
>>
>>
>>> Regards,
>>> Rik
>>>
>>>>
>>>> Then gift-wrap the server's system journal and send it to me. I need to
>>>> see only the output from these two commands, so if you want to
>>>> anonymize the journal and truncate it to just the day of the failure,
>>>> I think that should be fine.
>>>>
>>>>
>>>> --
>>>> Chuck Lever
>>>
>>>
>>>
>>
>>
>> --
>> Chuck Lever
>
>
>
--
Chuck Lever
next prev parent reply other threads:[~2025-01-14 16:10 UTC|newest]
Thread overview: 14+ messages / expand[flat|nested] mbox.gz Atom feed top
2025-01-10 19:49 nfsd4 laundromat_main hung tasks Rik Theys
2025-01-10 20:30 ` Chuck Lever
[not found] ` <CAPwv0J=oKBnCia_mmhm-tYLPqw03jO=LxfUbShSyXFp-mKET5A@mail.gmail.com>
[not found] ` <49654519-9166-4593-ac62-77400cebebb4@oracle.com>
2025-01-12 12:42 ` Rik Theys
2025-01-12 18:57 ` Chuck Lever
2025-01-13 12:30 ` Rik Theys
2025-01-13 13:39 ` Chuck Lever
2025-01-13 22:12 ` Chuck Lever
2025-01-14 8:23 ` Rik Theys
2025-01-14 14:51 ` Chuck Lever
2025-01-14 15:30 ` Rik Theys
2025-01-14 16:10 ` Chuck Lever [this message]
2025-01-14 19:02 ` Chuck Lever
2025-01-16 9:03 ` Rik Theys
2025-01-16 14:12 ` Chuck Lever
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=42da212b-071b-4c20-b7da-97ca02413c5a@oracle.com \
--to=chuck.lever@oracle.com \
--cc=carnil@debian.org \
--cc=herzog@phys.ethz.ch \
--cc=linux-nfs@vger.kernel.org \
--cc=rik.theys@gmail.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox