Re: [PATCH 3/3] NFSD: Fix server reboot hang problem when callback workqueue is stuck

public inbox for linux-nfs@vger.kernel.org
 help / color / mirror / Atom feed

From: dai.ngo@oracle.com
To: Chuck Lever <chuck.lever@oracle.com>
Cc: jlayton@kernel.org, linux-nfs@vger.kernel.org, linux-nfs@stwm.de
Subject: Re: [PATCH 3/3] NFSD: Fix server reboot hang problem when callback workqueue is stuck
Date: Mon, 18 Dec 2023 10:17:49 -0800	[thread overview]
Message-ID: <11b384fc-413b-45e8-8592-1f736de6a3de@oracle.com> (raw)
In-Reply-To: <ZYBtEs7JfMCi0TCl@tissot.1015granger.net>


On 12/18/23 8:02 AM, Chuck Lever wrote:
> On Sat, Dec 16, 2023 at 02:44:59PM -0800, dai.ngo@oracle.com wrote:
>> On 12/15/23 7:57 PM, Chuck Lever wrote:
>>> On Fri, Dec 15, 2023 at 07:18:29PM -0800, dai.ngo@oracle.com wrote:
>>>> On 12/15/23 5:21 PM, Chuck Lever wrote:
>>>>> On Fri, Dec 15, 2023 at 01:55:20PM -0800, dai.ngo@oracle.com wrote:
>>>>>> Sorry Chuck, I didn't see this before sending v2.
>>>>>>
>>>>>> On 12/15/23 1:41 PM, Chuck Lever wrote:
>>>>>>> On Fri, Dec 15, 2023 at 12:40:07PM -0800, dai.ngo@oracle.com wrote:
>>>>>>>> On 12/15/23 11:54 AM, Chuck Lever wrote:
>>>>>>>>> On Fri, Dec 15, 2023 at 11:15:03AM -0800, Dai Ngo wrote:
>>>>>>>>>> @@ -8558,7 +8561,8 @@ nfsd4_deleg_getattr_conflict(struct svc_rqst *rqstp, struct inode *inode,
>>>>>>>>>>       			nfs4_cb_getattr(&dp->dl_cb_fattr);
>>>>>>>>>>       			spin_unlock(&ctx->flc_lock);
>>>>>>>>>> -			wait_on_bit(&ncf->ncf_cb_flags, CB_GETATTR_BUSY, TASK_INTERRUPTIBLE);
>>>>>>>>>> +			wait_on_bit_timeout(&ncf->ncf_cb_flags, CB_GETATTR_BUSY,
>>>>>>>>>> +				TASK_INTERRUPTIBLE, NFSD_CB_GETATTR_TIMEOUT);
>>>>>>>>> I'm still thinking the timeout here should be the same (or slightly
>>>>>>>>> longer than) the RPC retransmit timeout, rather than adding a new
>>>>>>>>> NFSD_CB_GETATTR_TIMEOUT macro.
>>>>>>>> The NFSD_CB_GETATTR_TIMEOUT is used only when we can not submit a
>>>>>>>> work item to the workqueue so RPC is not involved here.
>>>>>>> In the "RPC was sent successfully" case, there is an implicit
>>>>>>> assumption here that wait_on_bit_timeout() won't time out before the
>>>>>>> actual RPC CB_GETATTR timeout.
>>>>>>>
>>>>>>> You've chosen timeout values that happen to work, but there's
>>>>>>> nothing in this patch that ties the two timeout values together or
>>>>>>> in any other way documents this implicit assumption.
>>>>>> The timeout value was chosen to be greater then RPC callback receive
>>>>>> timeout. I can add this to the commit message.
>>>>> nfsd needs to handle this case properly. A commit message will not
>>>>> be sufficient.
>>>>>
>>>>> The rpc_timeout setting that is used for callbacks is not always
>>>>> 9 seconds. It is adjusted based on the NFSv4 lease expiry up to a
>>>>> maximum of 360 seconds, if I'm reading the code correctly (see
>>>>> max_cb_time).
>>>>>
>>>>> It would be simple enough for a server admin to set a long lease
>>>>> expiry while individual CB_GETATTRs are still terminating with
>>>>> EIO after just 20 seconds because of this new timeout.
>> With courteous server, what is the benefit of allowing the admin to
>> extend the lease?
> The lease time is the time period during which the server
> /guarantees/ it will preserve the client's locks, even if there are
> conflicting open or lock requests from other clients.
>
> Once the client transitions to courtesy, those locks can be lost
> due to conflicting open and lock requests.
>
> Clients need to know the server's lease expiry so they know how
> often to renew their lease. That's the only way lease-based locking
> service can provide a lock guarantee.
>
>
>> I think this option should be removed, it's just
>> an administrative overhead and can cause more confusion.
> A server administrator might extend the lock guarantee by several
> minutes if her network is subject to frequent partitions -- that
> will result in long workload pauses, but it will /guarantee/ that
> locks won't be lost during short partitions.
>
> The only reason not to extend lease expiry is that it also
> lengthens the server's grace period. If the server is accessed
> only by NFSv4.1 clients, they can use RECLAIM_COMPLETE to avoid
> long waits after a server reboot... but with a mixed client
> cohort, a shorter grace period is usually preferred.
>
> We have the tunable now, so I don't see a lot of value in making
> an effort to deprecate and remove it until NFSD no longer supports
> NFSv3 and NFSv4.0.

Thank you for the explanation.

>
>
>>>> To handle case where server admin sets longer lease, we can set
>>>> callback timeout to be (nfsd4_lease)/10 + 5) and add a comment
>>>> in the code to indicate the dependency to max_cb_time.
>>> Which was my initial suggestion, but now I think the only proper fix
>>> is to replace the wait_on_bit() entirely.
>>>
>>>
>>>>> Actually, a bit wait in an nfsd thread should be a no-no. Even a
>>>>> wait of tens of milliseconds is bad. Enough nfsd threads go into a
>>>>> wait like this and that's a denial-of-service. That's why NFSv4
>>>>> callbacks are handled on a work queue and not in an nfsd thread.
>>>> That sounds reasonable. However I see in the current code there
>>>> are multiple places the nfsd thread sleeps/waits for certain events:
>>>> nfsd_file_do_acquire, nfsd41_cb_get_slot, nfsd4_cld_tracking_init,
>>>> wait_for_concurrent_writes, etc.
>>> Yep, each of those needs to be replaced if there is a danger of the
>>> wait becoming indefinite. We found another one recently with the
>>> pNFS block layout type waiting for a lease breaker. So an nfsd
>>> thread does wait on occasion, but it's almost never the right thing
>>> to do.
>>>
>>> Let's not add another one, especially one that can be manipulated by
>>> (bad) client behavior.
>> I'm not clear how the wait for CB_GETATTR can be manipulated by a bad
>> client, can you elaborate?
> Currently, a callback is handed off to a background worker and the
> nfsd thread is permitted to look for other work.
>
> If instead nfsd threads waited for callbacks, then a client with an
> unresponsive callback service would pin those nfsd threads for the
> length of the server's callback timeout.
>
> If the client's forechannel workload is actively acquiring
> delegations and then sending conflicting GETATTRs, it will keep such
> a server's nfsd threads stuck waiting for CB_GETATTR replies.

If the GETATTR and the delegation owner come from the same client then
we just reply to the GETATTR without sending CB_GETATTR.

>
> And since those nfsd threads are shared by all clients, all of the
> server's clients would see long delays because there would be no
> available nfsd threads to pick up new work.
>
>
>>>>> Is there some way the operation that triggered the CB_GETATTR can be
>>>>> deferred properly and picked up by another nfsd thread when the
>>>>> CB_GETATTR completes?
>>>> We can send the CB_GETATTR as an async RPC and return NFS4ERR_DELAY
>>>> to the conflict client. When the reply of the CB_GETATTR arrives,
>>>> nfsd4_cb_getattr_done can mark the delegation to indicate the
>>>> corresponding file's attributes were updated so when the conflict
>>>> client retries the GETATTR we can return the updated attributes.
>>>>
>>>> We still have to have a way to detect that the client never, or
>>>> take too long, to reply to the CB_GETATTR so that we can break
>>>> the lease.
>>> As long as the RPC is SOFT, the RPC client is supposed to guarantee
>>> that the upper layer gets a completion of some kind. If it's HARD
>>> then it should fully interruptible by a signal or shutdown.
>> This is only true if the workqueue worker runs and executes the work
>> item successfully. If the work item is stuck at the workqueue then RPC
>> is not involved. NFSD must handle the case where the work item is
>> never executed for any reasons.
> The queue_work() API guarantees that the work item will be
> dispatched. A lot of kernel subsystems would be in a world of pain
> if that guarantee was broken somehow.
>
> So I don't agree that NFSD needs to do anything special here when
> queue_work() returns true.
>
> The only reason I can think of that queue_work() might return false
> is if NFSD hands it a work item that is already queued. That would
> be a bug in NFSD.

Jeff, when __break_lease is called with 'want_write' then FL_UNLOCK_PENDING
is checked to prevent lm_break being called again. When __break_lease is
called with '!want_write' then lease_breaking is called to prevent lm_break
being called again. This would work for NFSD if it supports read delegation
only. With write delegation, shouldn't __break_lease() always use
lease_breaking to prevent __being called again? The scenario is read request
conflicts with write delegation then another write request conflicts with
the same write delegation.

>
>>> Either way, if NFSD can manage to reliably detect when the CB RPC
>>> has not even been scheduled, then that gives us a fully dependable
>>> guarantee.
>> Once the async CB RPC was passed to the RPC layer via rpc_run_task,
>> I can't see any sure way to know when the RPC task will be picked up
>> and run. Until the RPC task is run, the RPC time out mechanism is not
>> in play. To handle this condition, I think a timeout mechanism is
>> needed at the NFSD layer, any other options?
> The only reason you think a timeout is needed is because you want
> the nfsd thread to wait for the reply. That's just not how NFSD
> handles NFSv4 callbacks.
>
> The current architecture is that NFSD queues the callback and then
> replies NFS4ERR_DELAY. That nfsd thread is then free to pick up
> other work, including handling the client's retry.
>
> It doesn't matter how long it takes for the callback work item to go
> from the work queue down to the RPC layer, as long as a subsequent
> server shutdown does not hang. Either the client will reply to the
> callback, or the server will see that there was no response and
> revoke the delegation.
>
> (Note that we have nfsd_wait_for_delegreturn() now: it does wait
> uninterruptibly in an nfsd thread context, but only for 30
> milliseconds. The point of this is to give a well-behaved client
> a chance to respond without returning NFS4ERR_DELAY -- if the
> client doesn't respond, then it falls back to the architecture
> outlined above.)

The wait in nfsd4_deleg_getattr_conflict can be modified to do
the same as nfsd_wait_for_delegreturn which is wait for only 30ms
to work with well-behaved clients.

>
>
>>>> Also, the intention of the wait_on_bit is to avoid sending the
>>>> conflict client the NFS4ERR_DELAY if everything works properly
>>>> which is the normal case.
>>>>
>>>> So I think this can be done but it would make the code a bit
>>>> complicate and we loose the optimization of avoiding the
>>>> NFS4ERR_DELAY.
>>> I'm not excited about handling this via DELAY either. There's a
>>> good chance there is another way to deal with this sanely.
>>>
>>> I'm inclined to revert CB_GETATTR support until it is demonstrated
>>> to be working reliably. The current mechanism has already shown to
>>> be prone to a hard hang that blocks server shutdown, and it's not
>>> even in the wild yet.
>> If I understand the problem correctly, this hard hang issue is due to
>> the work item being stuck at the workqueue which the current code does
>> not take into account.
> The hard hang was because of an uninterruptible wait_on_bit().

In my debug I simulate the condition where workqueue is stuck when
nfs4_cb_getattr is called and this causes the reboot to hang. But
I'm not sure this is the cause in the problem that was reported.
Note that the WARN_ON_ONCE came from nfsd_break_one_deleg. I will
verify that if the work item for nfsd_break_one_deleg is stuck will
the server reboot hang.


>   What
> we don't know is why the callback was lost.
>
> - It could be that queue_work() returned false because of a bug.
>    Note that there is a WARN_ON_ONCE() that fires in this case: if
>    it fired several days before the hang, then we won't see any
>    log messages for more recent misqueued work items.

The WARN_ON_ONCE came from nfsd_break_one_deleg which is a delegation
recall and not from nfs4_cb_getattr. I suspect this is because of a
possible bug in __break_lease as question for Jeff above.

>
> - It could be that nfsd4_run_cb_work() marked the backchannel down
>    but somehow did not wake up any in-flight callback requests.
>
> Let's get more details about what's going on.
>
>
>>> I can add patches to nfsd-fixes to revert CB_GETATTR and let that
>>> sit for a few days while we decide how to move forward.
>> The simplest solution for this particular problem is to use wait with
>> timeout.
> The hard hang was due to an uninterruptible wait, which has now been
> reverted.
>
> Going forward, if there's no wait, there can be no timeout. The
> only approach is to handle errors properly when dispatching a
> callback.

not even wait for 30ms for well behave client, same as nfsd_wait_for_delegreturn?

-Dai

next prev parent reply	other threads:[~2023-12-18 18:20 UTC|newest]

Thread overview: 24+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2023-12-15 19:15 [PATCH 0/3] Bug fixes for NFSD callback Dai Ngo
2023-12-15 19:15 ` [PATCH 1/3] SUNRPC: remove printk when back channel request not found Dai Ngo
2023-12-15 19:37   ` Jeff Layton
2023-12-15 19:15 ` [PATCH 2/3] NFSD: restore delegation's sc_count if nfsd4_run_cb fails Dai Ngo
2023-12-15 19:42   ` Jeff Layton
2023-12-15 20:00     ` dai.ngo
2023-12-15 20:15       ` Jeff Layton
2023-12-15 20:22         ` dai.ngo
2023-12-15 19:15 ` [PATCH 3/3] NFSD: Fix server reboot hang problem when callback workqueue is stuck Dai Ngo
2023-12-15 19:54   ` Chuck Lever
2023-12-15 20:40     ` dai.ngo
2023-12-15 21:41       ` Chuck Lever
2023-12-15 21:55         ` dai.ngo
2023-12-16  1:21           ` Chuck Lever
2023-12-16  3:18             ` dai.ngo
2023-12-16  3:57               ` Chuck Lever
2023-12-16 22:44                 ` dai.ngo
2023-12-18 16:02                   ` Chuck Lever
2023-12-18 18:17                     ` dai.ngo [this message]
2023-12-18 19:10                       ` Chuck Lever
2023-12-18 20:27                         ` dai.ngo
2023-12-15 19:54   ` Jeff Layton
2023-12-15 20:18     ` dai.ngo
2023-12-15 20:25       ` Jeff Layton

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=11b384fc-413b-45e8-8592-1f736de6a3de@oracle.com \
    --to=dai.ngo@oracle.com \
    --cc=chuck.lever@oracle.com \
    --cc=jlayton@kernel.org \
    --cc=linux-nfs@stwm.de \
    --cc=linux-nfs@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox