public inbox for linux-nfs@vger.kernel.org
 help / color / mirror / Atom feed
From: dai.ngo@oracle.com
To: Chuck Lever <chuck.lever@oracle.com>
Cc: jlayton@kernel.org, linux-nfs@vger.kernel.org, linux-nfs@stwm.de
Subject: Re: [PATCH 3/3] NFSD: Fix server reboot hang problem when callback workqueue is stuck
Date: Mon, 18 Dec 2023 12:27:30 -0800	[thread overview]
Message-ID: <f68e2a10-4bb0-4b46-966d-d7806a10faea@oracle.com> (raw)
In-Reply-To: <ZYCZQYK9jM2dhiag@tissot.1015granger.net>


On 12/18/23 11:10 AM, Chuck Lever wrote:
> On Mon, Dec 18, 2023 at 10:17:49AM -0800, dai.ngo@oracle.com wrote:
>> On 12/18/23 8:02 AM, Chuck Lever wrote:
>>> On Sat, Dec 16, 2023 at 02:44:59PM -0800, dai.ngo@oracle.com wrote:
>>>> On 12/15/23 7:57 PM, Chuck Lever wrote:
>>> What we don't know is why the callback was lost.
>>>
>>> - It could be that queue_work() returned false because of a bug.
>>>     Note that there is a WARN_ON_ONCE() that fires in this case: if
>>>     it fired several days before the hang, then we won't see any
>>>     log messages for more recent misqueued work items.
>> The WARN_ON_ONCE came from nfsd_break_one_deleg which is a delegation
>> recall and not from nfs4_cb_getattr. I suspect this is because of a
>> possible bug in __break_lease as question for Jeff above.
> OK, so there's no indication at all if nfsd4_run_cb() fails when
> NFSD queues CB_GETATTR? No wonder it's a silent failure.

This patch adds a WARN_ON_ONCE just in case, but I don't this condition
will ever happen since we already had the test_and_set_bit on CB_GETATTR_BUSY
bit so the same CB_GETATTR will not be submitted to workqueue more than
once.

>
>
>>> - It could be that nfsd4_run_cb_work() marked the backchannel down
>>>     but somehow did not wake up any in-flight callback requests.
>>>
>>> Let's get more details about what's going on.
>>>
>>>
>>>>> I can add patches to nfsd-fixes to revert CB_GETATTR and let that
>>>>> sit for a few days while we decide how to move forward.
>>>> The simplest solution for this particular problem is to use wait with
>>>> timeout.
>>> The hard hang was due to an uninterruptible wait, which has now been
>>> reverted.
>>>
>>> Going forward, if there's no wait, there can be no timeout. The
>>> only approach is to handle errors properly when dispatching a
>>> callback.
>> not even wait for 30ms for well behave client, same as nfsd_wait_for_delegreturn?
> 30 milliseconds is acceptable. It's very brief and can never result
> in a shutdown hang. I just don't want a long timeout.

Thanks! I will submit v3 patch with timeout of 30 milliseconds.

-Dai


  reply	other threads:[~2023-12-18 20:29 UTC|newest]

Thread overview: 24+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2023-12-15 19:15 [PATCH 0/3] Bug fixes for NFSD callback Dai Ngo
2023-12-15 19:15 ` [PATCH 1/3] SUNRPC: remove printk when back channel request not found Dai Ngo
2023-12-15 19:37   ` Jeff Layton
2023-12-15 19:15 ` [PATCH 2/3] NFSD: restore delegation's sc_count if nfsd4_run_cb fails Dai Ngo
2023-12-15 19:42   ` Jeff Layton
2023-12-15 20:00     ` dai.ngo
2023-12-15 20:15       ` Jeff Layton
2023-12-15 20:22         ` dai.ngo
2023-12-15 19:15 ` [PATCH 3/3] NFSD: Fix server reboot hang problem when callback workqueue is stuck Dai Ngo
2023-12-15 19:54   ` Chuck Lever
2023-12-15 20:40     ` dai.ngo
2023-12-15 21:41       ` Chuck Lever
2023-12-15 21:55         ` dai.ngo
2023-12-16  1:21           ` Chuck Lever
2023-12-16  3:18             ` dai.ngo
2023-12-16  3:57               ` Chuck Lever
2023-12-16 22:44                 ` dai.ngo
2023-12-18 16:02                   ` Chuck Lever
2023-12-18 18:17                     ` dai.ngo
2023-12-18 19:10                       ` Chuck Lever
2023-12-18 20:27                         ` dai.ngo [this message]
2023-12-15 19:54   ` Jeff Layton
2023-12-15 20:18     ` dai.ngo
2023-12-15 20:25       ` Jeff Layton

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=f68e2a10-4bb0-4b46-966d-d7806a10faea@oracle.com \
    --to=dai.ngo@oracle.com \
    --cc=chuck.lever@oracle.com \
    --cc=jlayton@kernel.org \
    --cc=linux-nfs@stwm.de \
    --cc=linux-nfs@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox