public inbox for linux-nfs@vger.kernel.org
 help / color / mirror / Atom feed
From: Chuck Lever <chuck.lever@oracle.com>
To: Rik Theys <rik.theys@gmail.com>
Cc: Christian Herzog <herzog@phys.ethz.ch>,
	Salvatore Bonaccorso <carnil@debian.org>,
	linux-nfs@vger.kernel.org
Subject: Re: nfsd4 laundromat_main hung tasks
Date: Tue, 14 Jan 2025 11:10:04 -0500	[thread overview]
Message-ID: <42da212b-071b-4c20-b7da-97ca02413c5a@oracle.com> (raw)
In-Reply-To: <CAPwv0Jk1UaHqNX27AtR+sPrCdVbckpR5ayQ-u+kZ=w3C+sOsgQ@mail.gmail.com>

On 1/14/25 10:30 AM, Rik Theys wrote:
> Hi,
> 
> On Tue, Jan 14, 2025 at 3:51 PM Chuck Lever <chuck.lever@oracle.com> wrote:
>>
>> On 1/14/25 3:23 AM, Rik Theys wrote:
>>> Hi,
>>>
>>> On Mon, Jan 13, 2025 at 11:12 PM Chuck Lever <chuck.lever@oracle.com> wrote:
>>>>
>>>> On 1/12/25 7:42 AM, Rik Theys wrote:
>>>>> On Fri, Jan 10, 2025 at 11:07 PM Chuck Lever <chuck.lever@oracle.com> wrote:
>>>>>>
>>>>>> On 1/10/25 3:51 PM, Rik Theys wrote:
>>>>>>> Are there any debugging commands we can run once the issue happens
>>>>>>> that can help to determine the cause of this issue?
>>>>>>
>>>>>> Once the issue happens, the precipitating bug has already done its
>>>>>> damage, so at that point it is too late.
>>>>
>>>> I've studied the code and bug reports a bit. I see one intriguing
>>>> mention in comment #5:
>>>>
>>>> https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=1071562#5
>>>>
>>>> /proc/130/stack:
>>>> [<0>] rpc_shutdown_client+0xf2/0x150 [sunrpc]
>>>> [<0>] nfsd4_process_cb_update+0x4c/0x270 [nfsd]
>>>> [<0>] nfsd4_run_cb_work+0x9f/0x150 [nfsd]
>>>> [<0>] process_one_work+0x1c7/0x380
>>>> [<0>] worker_thread+0x4d/0x380
>>>> [<0>] kthread+0xda/0x100
>>>> [<0>] ret_from_fork+0x22/0x30
>>>>
>>>> This tells me that the active item on the callback_wq is waiting for the
>>>> backchannel RPC client to shut down. This is probably the proximal cause
>>>> of the callback workqueue stall.
>>>>
>>>> rpc_shutdown_client() is waiting for the client's cl_tasks to become
>>>> empty. Typically this is a short wait. But here, there's one or more RPC
>>>> requests that are not completing.
>>>>
>>>> Please issue these two commands on your server once it gets into the
>>>> hung state:
>>>>
>>>> # rpcdebug -m rpc -c
>>>> # echo t > /proc/sysrq-trigger
>>>
>>> There were no rpcdebug entries configured, so I don't think the first
>>> command did much.
>>>
>>> You can find the output from the second command in attach.
>>
>> I don't see any output for "echo t > /proc/sysrq-trigger" here. What I
>> do see is a large number of OOM-killer notices. So, your server is out
>> of memory. But I think this is due to a memory leak bug, probably this
>> one:
> 
> I'm confused. Where do you see the OOM-killer notices in the log I provided?

Never mind: my editor opened an old file at the same time. The window
with your dump was on another screen.

Looking at your journal now.


> The first lines of the file is after increasing the hung_task_warnings
> and waiting a few minutes. This triggered the hun task on the nfsd4
> laundromat_main workqueue:
> 
> Workqueue: nfsd4 laundromat_main [nfsd]
> Jan 14 09:06:45 kwak.esat.kuleuven.be kernel: Call Trace:
> 
> Then I executed the commands you provided. Are the lines after the
> 
> Jan 14 09:07:00 kwak.esat.kuleuven.be kernel: sysrq: Show State
> 
> Line not the output you're looking for?
> 
> Regards,
> Rik
> 
>>
>>      https://bugzilla.kernel.org/show_bug.cgi?id=219535
>>
>> So that explains the source of the frequent deleg_reaper() calls on your
>> system; it's the shrinker. (Note these calls are not the actual problem.
>> The real bug is somewhere in the callback code, but frequent callbacks
>> are making it easy to hit the callback bug).
>>
>> Please try again, but wait until you see "INFO: task nfsd:XXXX blocked
>> for more than 120 seconds." in the journal before issuing the rpcdebug
>> and "echo t" commands.
>>
>>
>>> Regards,
>>> Rik
>>>
>>>>
>>>> Then gift-wrap the server's system journal and send it to me. I need to
>>>> see only the output from these two commands, so if you want to
>>>> anonymize the journal and truncate it to just the day of the failure,
>>>> I think that should be fine.
>>>>
>>>>
>>>> --
>>>> Chuck Lever
>>>
>>>
>>>
>>
>>
>> --
>> Chuck Lever
> 
> 
> 


-- 
Chuck Lever

  reply	other threads:[~2025-01-14 16:10 UTC|newest]

Thread overview: 14+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-01-10 19:49 nfsd4 laundromat_main hung tasks Rik Theys
2025-01-10 20:30 ` Chuck Lever
     [not found]   ` <CAPwv0J=oKBnCia_mmhm-tYLPqw03jO=LxfUbShSyXFp-mKET5A@mail.gmail.com>
     [not found]     ` <49654519-9166-4593-ac62-77400cebebb4@oracle.com>
2025-01-12 12:42       ` Rik Theys
2025-01-12 18:57         ` Chuck Lever
2025-01-13 12:30           ` Rik Theys
2025-01-13 13:39             ` Chuck Lever
2025-01-13 22:12         ` Chuck Lever
2025-01-14  8:23           ` Rik Theys
2025-01-14 14:51             ` Chuck Lever
2025-01-14 15:30               ` Rik Theys
2025-01-14 16:10                 ` Chuck Lever [this message]
2025-01-14 19:02                   ` Chuck Lever
2025-01-16  9:03                     ` Rik Theys
2025-01-16 14:12                       ` Chuck Lever

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=42da212b-071b-4c20-b7da-97ca02413c5a@oracle.com \
    --to=chuck.lever@oracle.com \
    --cc=carnil@debian.org \
    --cc=herzog@phys.ethz.ch \
    --cc=linux-nfs@vger.kernel.org \
    --cc=rik.theys@gmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox