From: Dai Ngo <dai.ngo@oracle.com>
To: Chuck Lever <chuck.lever@oracle.com>
Cc: jlayton@kernel.org, linux-nfs@vger.kernel.org
Subject: Re: [PATCH 1/1] NFSD: cancel CB_RECALL_ANY call when nfs4_client is about to be destroyed
Date: Fri, 29 Mar 2024 10:57:22 -0700 [thread overview]
Message-ID: <97387ec5-bcb4-4c5e-81af-a0038f7fc311@oracle.com> (raw)
In-Reply-To: <ZgbWevtNp8Vust4A@tissot.1015granger.net>
On 3/29/24 7:55 AM, Chuck Lever wrote:
> On Thu, Mar 28, 2024 at 05:31:02PM -0700, Dai Ngo wrote:
>> On 3/28/24 11:14 AM, Dai Ngo wrote:
>>> On 3/28/24 7:08 AM, Chuck Lever wrote:
>>>> On Wed, Mar 27, 2024 at 06:09:28PM -0700, Dai Ngo wrote:
>>>>> On 3/26/24 11:27 AM, Chuck Lever wrote:
>>>>>> On Tue, Mar 26, 2024 at 11:13:29AM -0700, Dai Ngo wrote:
>>>>>>> Currently when a nfs4_client is destroyed we wait for
>>>>>>> the cb_recall_any
>>>>>>> callback to complete before proceed. This adds
>>>>>>> unnecessary delay to the
>>>>>>> __destroy_client call if there is problem communicating
>>>>>>> with the client.
>>>>>> By "unnecessary delay" do you mean only the seven-second RPC
>>>>>> retransmit timeout, or is there something else?
>>>>> when the client network interface is down, the RPC task takes ~9s to
>>>>> send the callback, waits for the reply and gets ETIMEDOUT. This process
>>>>> repeats in a loop with the same RPC task before being stopped by
>>>>> rpc_shutdown_client after client lease expires.
>>>> I'll have to review this code again, but rpc_shutdown_client
>>>> should cause these RPCs to terminate immediately and safely. Can't
>>>> we use that?
>>> rpc_shutdown_client works, it terminated the RPC call to stop the loop.
>>>
>>>>
>>>>> It takes a total of about 1m20s before the CB_RECALL is terminated.
>>>>> For CB_RECALL_ANY and CB_OFFLOAD, this process gets in to a infinite
>>>>> loop since there is no delegation conflict and the client is allowed
>>>>> to stay in courtesy state.
>>>>>
>>>>> The loop happens because in nfsd4_cb_sequence_done if cb_seq_status
>>>>> is 1 (an RPC Reply was never received) it calls nfsd4_mark_cb_fault
>>>>> to set the NFSD4_CB_FAULT bit. It then sets cb_need_restart to true.
>>>>> When nfsd4_cb_release is called, it checks cb_need_restart bit and
>>>>> re-queues the work again.
>>>> Something in the sequence_done path should check if the server is
>>>> tearing down this callback connection. If it doesn't, that is a bug
>>>> IMO.
>> TCP terminated the connection after retrying for 16 minutes and
>> notified the RPC layer which deleted the nfsd4_conn.
> The server should have closed this connection already.
Since there is no delegation conflict, the client remains in courtesy
state.
> Is it stuck
> waiting for the client to respond to a FIN or something?
No, in this case since the client network interface was down server
TCP did not even receive ACK for the transmit so the server TCP
kept retrying.
>
>
>> But when nfsd4_run_cb_work ran again, it got into the infinite
>> loop caused by:
>> /*
>> * XXX: Ideally, we could wait for the client to
>> * reconnect, but I haven't figured out how
>> * to do that yet.
>> */
>> nfsd4_queue_cb_delayed(cb, 25);
>>
>> which was introduced by c1ccfcf1a9bf. Note that I'm using 6.9-rc1.
> The whole paragraph is:
>
> 1503 clnt = clp->cl_cb_client;
> 1504 if (!clnt) {
> 1505 if (test_bit(NFSD4_CLIENT_CB_KILL, &clp->cl_flags))
> 1506 nfsd41_destroy_cb(cb);
> 1507 else {
> 1508 /*
> 1509 * XXX: Ideally, we could wait for the client to
> 1510 * reconnect, but I haven't figured out how
> 1511 * to do that yet.
> 1512 */
> 1513 nfsd4_queue_cb_delayed(cb, 25);
> 1514 }
> 1515 return;
> 1516 }
>
> When there's no rpc_clnt and CB_KILL is set, the callback
> operation should be destroyed immediately. CB_KILL is set by
> nfsd4_shutdown_callback. It's only caller is
> __destroy_client().
>
> Why isn't NFSD4_CLIENT_CB_KILL set?
Since __destroy_client was not called, for the reason mentioned above,
nfsd4_shutdown_callback was not called so NFSD4_CLIENT_CB_KILL was not
set.
Since the nfsd callback_wq was created with alloc_ordered_workqueue,
once this loop happens the nfsd callback_wq is stuck since this workqueue
executes at most one work item at any given time.
This means that if a nfs client is shutdown and the server is in this
state then all other clients are effected; all callbacks to other
clients are stuck in the workqueue. And if a callback for a client is
stuck in the workqueue then that client can not unmount its shares.
This is the symptom that was reported recently on this list. I think
the nfsd callback_wq should be created as a normal workqueue that allows
multiple work items to be executed at the same time so a problem with
one client does not effect other clients.
-Dai
>
>
next prev parent reply other threads:[~2024-03-29 17:57 UTC|newest]
Thread overview: 22+ messages / expand[flat|nested] mbox.gz Atom feed top
2024-03-26 18:13 [PATCH 1/1] NFSD: cancel CB_RECALL_ANY call when nfs4_client is about to be destroyed Dai Ngo
2024-03-26 18:27 ` Chuck Lever
2024-03-28 1:09 ` Dai Ngo
2024-03-28 14:08 ` Chuck Lever
2024-03-28 18:14 ` Dai Ngo
2024-03-29 0:31 ` Dai Ngo
2024-03-29 14:55 ` Chuck Lever
2024-03-29 17:57 ` Dai Ngo [this message]
2024-03-29 23:42 ` Chuck Lever
2024-03-30 17:46 ` Dai Ngo
2024-03-30 18:28 ` Chuck Lever
2024-03-30 23:30 ` Dai Ngo
2024-04-01 12:49 ` Jeff Layton
2024-04-01 13:34 ` Chuck Lever
2024-04-01 16:00 ` Dai Ngo
2024-04-01 16:46 ` Dai Ngo
2024-04-01 17:49 ` Chuck Lever
2024-04-01 19:55 ` Dai Ngo
2024-04-01 20:17 ` Dai Ngo
2024-04-02 13:58 ` Chuck Lever
2024-04-02 14:29 ` Dai Ngo
2024-04-01 16:11 ` Jeff Layton
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=97387ec5-bcb4-4c5e-81af-a0038f7fc311@oracle.com \
--to=dai.ngo@oracle.com \
--cc=chuck.lever@oracle.com \
--cc=jlayton@kernel.org \
--cc=linux-nfs@vger.kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox