From: dai.ngo@oracle.com
To: Chuck Lever <chuck.lever@oracle.com>
Cc: jlayton@kernel.org, linux-nfs@vger.kernel.org, linux-nfs@stwm.de
Subject: Re: [PATCH 3/3] NFSD: Fix server reboot hang problem when callback workqueue is stuck
Date: Mon, 18 Dec 2023 12:27:30 -0800 [thread overview]
Message-ID: <f68e2a10-4bb0-4b46-966d-d7806a10faea@oracle.com> (raw)
In-Reply-To: <ZYCZQYK9jM2dhiag@tissot.1015granger.net>
On 12/18/23 11:10 AM, Chuck Lever wrote:
> On Mon, Dec 18, 2023 at 10:17:49AM -0800, dai.ngo@oracle.com wrote:
>> On 12/18/23 8:02 AM, Chuck Lever wrote:
>>> On Sat, Dec 16, 2023 at 02:44:59PM -0800, dai.ngo@oracle.com wrote:
>>>> On 12/15/23 7:57 PM, Chuck Lever wrote:
>>> What we don't know is why the callback was lost.
>>>
>>> - It could be that queue_work() returned false because of a bug.
>>> Note that there is a WARN_ON_ONCE() that fires in this case: if
>>> it fired several days before the hang, then we won't see any
>>> log messages for more recent misqueued work items.
>> The WARN_ON_ONCE came from nfsd_break_one_deleg which is a delegation
>> recall and not from nfs4_cb_getattr. I suspect this is because of a
>> possible bug in __break_lease as question for Jeff above.
> OK, so there's no indication at all if nfsd4_run_cb() fails when
> NFSD queues CB_GETATTR? No wonder it's a silent failure.
This patch adds a WARN_ON_ONCE just in case, but I don't this condition
will ever happen since we already had the test_and_set_bit on CB_GETATTR_BUSY
bit so the same CB_GETATTR will not be submitted to workqueue more than
once.
>
>
>>> - It could be that nfsd4_run_cb_work() marked the backchannel down
>>> but somehow did not wake up any in-flight callback requests.
>>>
>>> Let's get more details about what's going on.
>>>
>>>
>>>>> I can add patches to nfsd-fixes to revert CB_GETATTR and let that
>>>>> sit for a few days while we decide how to move forward.
>>>> The simplest solution for this particular problem is to use wait with
>>>> timeout.
>>> The hard hang was due to an uninterruptible wait, which has now been
>>> reverted.
>>>
>>> Going forward, if there's no wait, there can be no timeout. The
>>> only approach is to handle errors properly when dispatching a
>>> callback.
>> not even wait for 30ms for well behave client, same as nfsd_wait_for_delegreturn?
> 30 milliseconds is acceptable. It's very brief and can never result
> in a shutdown hang. I just don't want a long timeout.
Thanks! I will submit v3 patch with timeout of 30 milliseconds.
-Dai
next prev parent reply other threads:[~2023-12-18 20:29 UTC|newest]
Thread overview: 24+ messages / expand[flat|nested] mbox.gz Atom feed top
2023-12-15 19:15 [PATCH 0/3] Bug fixes for NFSD callback Dai Ngo
2023-12-15 19:15 ` [PATCH 1/3] SUNRPC: remove printk when back channel request not found Dai Ngo
2023-12-15 19:37 ` Jeff Layton
2023-12-15 19:15 ` [PATCH 2/3] NFSD: restore delegation's sc_count if nfsd4_run_cb fails Dai Ngo
2023-12-15 19:42 ` Jeff Layton
2023-12-15 20:00 ` dai.ngo
2023-12-15 20:15 ` Jeff Layton
2023-12-15 20:22 ` dai.ngo
2023-12-15 19:15 ` [PATCH 3/3] NFSD: Fix server reboot hang problem when callback workqueue is stuck Dai Ngo
2023-12-15 19:54 ` Chuck Lever
2023-12-15 20:40 ` dai.ngo
2023-12-15 21:41 ` Chuck Lever
2023-12-15 21:55 ` dai.ngo
2023-12-16 1:21 ` Chuck Lever
2023-12-16 3:18 ` dai.ngo
2023-12-16 3:57 ` Chuck Lever
2023-12-16 22:44 ` dai.ngo
2023-12-18 16:02 ` Chuck Lever
2023-12-18 18:17 ` dai.ngo
2023-12-18 19:10 ` Chuck Lever
2023-12-18 20:27 ` dai.ngo [this message]
2023-12-15 19:54 ` Jeff Layton
2023-12-15 20:18 ` dai.ngo
2023-12-15 20:25 ` Jeff Layton
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=f68e2a10-4bb0-4b46-966d-d7806a10faea@oracle.com \
--to=dai.ngo@oracle.com \
--cc=chuck.lever@oracle.com \
--cc=jlayton@kernel.org \
--cc=linux-nfs@stwm.de \
--cc=linux-nfs@vger.kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox