From: Dai Ngo <dai.ngo@oracle.com>
To: Chuck Lever <chuck.lever@oracle.com>
Cc: jlayton@kernel.org, linux-nfs@vger.kernel.org
Subject: Re: [PATCH 1/1] NFSD: cancel CB_RECALL_ANY call when nfs4_client is about to be destroyed
Date: Sat, 30 Mar 2024 16:30:05 -0700 [thread overview]
Message-ID: <84d6311e-a0a3-4fc6-a87e-e09857c765b3@oracle.com> (raw)
In-Reply-To: <ZghZzfIi5RkWDh9K@tissot.1015granger.net>
On 3/30/24 11:28 AM, Chuck Lever wrote:
> On Sat, Mar 30, 2024 at 10:46:08AM -0700, Dai Ngo wrote:
>> On 3/29/24 4:42 PM, Chuck Lever wrote:
>>> On Fri, Mar 29, 2024 at 10:57:22AM -0700, Dai Ngo wrote:
>>>> On 3/29/24 7:55 AM, Chuck Lever wrote:
>>>>> On Thu, Mar 28, 2024 at 05:31:02PM -0700, Dai Ngo wrote:
>>>>>> On 3/28/24 11:14 AM, Dai Ngo wrote:
>>>>>>> On 3/28/24 7:08 AM, Chuck Lever wrote:
>>>>>>>> On Wed, Mar 27, 2024 at 06:09:28PM -0700, Dai Ngo wrote:
>>>>>>>>> On 3/26/24 11:27 AM, Chuck Lever wrote:
>>>>>>>>>> On Tue, Mar 26, 2024 at 11:13:29AM -0700, Dai Ngo wrote:
>>>>>>>>>>> Currently when a nfs4_client is destroyed we wait for
>>>>>>>>>>> the cb_recall_any
>>>>>>>>>>> callback to complete before proceed. This adds
>>>>>>>>>>> unnecessary delay to the
>>>>>>>>>>> __destroy_client call if there is problem communicating
>>>>>>>>>>> with the client.
>>>>>>>>>> By "unnecessary delay" do you mean only the seven-second RPC
>>>>>>>>>> retransmit timeout, or is there something else?
>>>>>>>>> when the client network interface is down, the RPC task takes ~9s to
>>>>>>>>> send the callback, waits for the reply and gets ETIMEDOUT. This process
>>>>>>>>> repeats in a loop with the same RPC task before being stopped by
>>>>>>>>> rpc_shutdown_client after client lease expires.
>>>>>>>> I'll have to review this code again, but rpc_shutdown_client
>>>>>>>> should cause these RPCs to terminate immediately and safely. Can't
>>>>>>>> we use that?
>>>>>>> rpc_shutdown_client works, it terminated the RPC call to stop the loop.
>>>>>>>
>>>>>>>>> It takes a total of about 1m20s before the CB_RECALL is terminated.
>>>>>>>>> For CB_RECALL_ANY and CB_OFFLOAD, this process gets in to a infinite
>>>>>>>>> loop since there is no delegation conflict and the client is allowed
>>>>>>>>> to stay in courtesy state.
>>>>>>>>>
>>>>>>>>> The loop happens because in nfsd4_cb_sequence_done if cb_seq_status
>>>>>>>>> is 1 (an RPC Reply was never received) it calls nfsd4_mark_cb_fault
>>>>>>>>> to set the NFSD4_CB_FAULT bit. It then sets cb_need_restart to true.
>>>>>>>>> When nfsd4_cb_release is called, it checks cb_need_restart bit and
>>>>>>>>> re-queues the work again.
>>>>>>>> Something in the sequence_done path should check if the server is
>>>>>>>> tearing down this callback connection. If it doesn't, that is a bug
>>>>>>>> IMO.
>>>>>> TCP terminated the connection after retrying for 16 minutes and
>>>>>> notified the RPC layer which deleted the nfsd4_conn.
>>>>> The server should have closed this connection already.
>>>> Since there is no delegation conflict, the client remains in courtesy
>>>> state.
>>>>
>>>>> Is it stuck
>>>>> waiting for the client to respond to a FIN or something?
>>>> No, in this case since the client network interface was down server
>>>> TCP did not even receive ACK for the transmit so the server TCP
>>>> kept retrying.
>>> It sounds like the server attempts to maintain the client's
>>> transport while the client is in courtesy state?
>> Yes, TCP keeps retryingwhile the client is in courtesy state.
> So the client hasn't sent a lease-renewing operation recently, but
> the connection is still there. It then makes sense for the server to
> forcibly close the connection when a client transitions to the
> courtesy state, since that connection instance is suspect.
yes, this makes sense. This would make the RPC to stop within a
lease period.
I'll submit a patch to kill the back channel connection if there
is RPC pending and the client is about to enter courtesy state.
>
> The server would then recognize immediately that it shouldn't post
> any new backchannel operations to that client until it gets a fresh
> connection attempt from it.
Currently deleg_reaper does not issue CB_RECALL_ANY for courtesy clients.
>
>
>>> I thought the
>>> purpose of courteous server was to more gracefully handle network
>>> partitions, in which case, the transport is not reliable.
>>>
>>> If a courtesy client hasn't re-established a connection with a
>>> backchannel by the time a conflicting open/lock request arrives,
>>> the server has no choice but to expire that client's courtesy
>>> state immediately. Right?
>> Yes, that is the case for CB_RECALL but not for CB_RECALL_ANY.
> CB_RECALL_ANY is done by a shrinker, yes?
CB_RECALL_ANY is done by the memory shrinker or when
num_delegations >= max_delegations.
>
> Instead of attempting to send a CB_RECALL_ANY to a courtesy client
> which is likely to be unresponsive, why not just expire the oldest
> courtesy client the server has? Or... if NFSD /already/ expires the
> oldest courtesy client as a result of memory pressure, then just
> don't ever send CB_RECALL_ANY to courtesy clients.
Currently deleg_reaper does not issue CB_RECALL_ANY for courtesy clients.
>
> As soon as a courtesy client reconnects, it will send a lease-
> renewing operation, transition back to an active state, and at that
> point it re-qualifies for getting CB_RECALL_ANY.
>
>
>>> But maybe that's a side-bar.
>>>
>>>
>>>>>> But when nfsd4_run_cb_work ran again, it got into the infinite
>>>>>> loop caused by:
>>>>>> /*
>>>>>> * XXX: Ideally, we could wait for the client to
>>>>>> * reconnect, but I haven't figured out how
>>>>>> * to do that yet.
>>>>>> */
>>>>>> nfsd4_queue_cb_delayed(cb, 25);
>>>>>>
>>>>>> which was introduced by c1ccfcf1a9bf. Note that I'm using 6.9-rc1.
>>>>> The whole paragraph is:
>>>>>
>>>>> 1503 clnt = clp->cl_cb_client;
>>>>> 1504 if (!clnt) {
>>>>> 1505 if (test_bit(NFSD4_CLIENT_CB_KILL, &clp->cl_flags))
>>>>> 1506 nfsd41_destroy_cb(cb);
>>>>> 1507 else {
>>>>> 1508 /*
>>>>> 1509 * XXX: Ideally, we could wait for the client to
>>>>> 1510 * reconnect, but I haven't figured out how
>>>>> 1511 * to do that yet.
>>>>> 1512 */
>>>>> 1513 nfsd4_queue_cb_delayed(cb, 25);
>>>>> 1514 }
>>>>> 1515 return;
>>>>> 1516 }
>>>>>
>>>>> When there's no rpc_clnt and CB_KILL is set, the callback
>>>>> operation should be destroyed immediately. CB_KILL is set by
>>>>> nfsd4_shutdown_callback. It's only caller is
>>>>> __destroy_client().
>>>>>
>>>>> Why isn't NFSD4_CLIENT_CB_KILL set?
>>>> Since __destroy_client was not called, for the reason mentioned above,
>>>> nfsd4_shutdown_callback was not called so NFSD4_CLIENT_CB_KILL was not
>>>> set.
>>> Well, then I'm missing something. You said, above:
>>>
>>>> Currently when a nfs4_client is destroyed we wait for
>>> And __destroy_client() invokes nfsd4_shutdown_callback(). Can you
>>> explain the usage scenario(s) to me again?
>> __destroy_client is not called in the case of CB_RECALL_ANY since
>> there is no open/lock conflict associated the the client.
>>>> Since the nfsd callback_wq was created with alloc_ordered_workqueue,
>>>> once this loop happens the nfsd callback_wq is stuck since this workqueue
>>>> executes at most one work item at any given time.
>>>>
>>>> This means that if a nfs client is shutdown and the server is in this
>>>> state then all other clients are effected; all callbacks to other
>>>> clients are stuck in the workqueue. And if a callback for a client is
>>>> stuck in the workqueue then that client can not unmount its shares.
>>>>
>>>> This is the symptom that was reported recently on this list. I think
>>>> the nfsd callback_wq should be created as a normal workqueue that allows
>>>> multiple work items to be executed at the same time so a problem with
>>>> one client does not effect other clients.
>>> My impression is that the callback_wq is single-threaded to avoid
>>> locking complexity. I'm not yet convinced it would be safe to simply
>>> change that workqueue to permit multiple concurrent work items.
>> Do you have any specific concern about lock complexity related to
>> callback operation?
> If there needs to be a fix, I'd like something for v6.9-rc, so it
> needs to be a narrow change. Larger infrastructural changes have to
> go via a full merge window.
>
>
>>> It could be straightforward, however, to move the callback_wq into
>>> struct nfs4_client so that each client can have its own workqueue.
>>> Then we can take some time and design something less brittle and
>>> more scalable (and maybe come up with some test infrastructure so
>>> this stuff doesn't break as often).
>> IMHO I don't see why the callback workqueue has to be different
>> from the laundry_wq or nfsd_filecache_wq used by nfsd.
> You mean, you don't see why callback_wq has to be ordered, while
> the others are not so constrained? Quite possibly it does not have
> to be ordered.
Yes, I looked at the all the nfsd4_callback_ops on nfsd and they
seems to take into account of concurrency and use locks appropriately.
For each type of work I don't see why one work has to wait for
the previous work to complete before proceed.
>
> But we might have lost the bit of history that explains why, so
> let's be cautious about making broad changes here until we have a
> good operational understanding of the code and some robust test
> cases to check any changes we make.
Understand, you make the call.
-Dai
>
>
next prev parent reply other threads:[~2024-03-30 23:30 UTC|newest]
Thread overview: 22+ messages / expand[flat|nested] mbox.gz Atom feed top
2024-03-26 18:13 [PATCH 1/1] NFSD: cancel CB_RECALL_ANY call when nfs4_client is about to be destroyed Dai Ngo
2024-03-26 18:27 ` Chuck Lever
2024-03-28 1:09 ` Dai Ngo
2024-03-28 14:08 ` Chuck Lever
2024-03-28 18:14 ` Dai Ngo
2024-03-29 0:31 ` Dai Ngo
2024-03-29 14:55 ` Chuck Lever
2024-03-29 17:57 ` Dai Ngo
2024-03-29 23:42 ` Chuck Lever
2024-03-30 17:46 ` Dai Ngo
2024-03-30 18:28 ` Chuck Lever
2024-03-30 23:30 ` Dai Ngo [this message]
2024-04-01 12:49 ` Jeff Layton
2024-04-01 13:34 ` Chuck Lever
2024-04-01 16:00 ` Dai Ngo
2024-04-01 16:46 ` Dai Ngo
2024-04-01 17:49 ` Chuck Lever
2024-04-01 19:55 ` Dai Ngo
2024-04-01 20:17 ` Dai Ngo
2024-04-02 13:58 ` Chuck Lever
2024-04-02 14:29 ` Dai Ngo
2024-04-01 16:11 ` Jeff Layton
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=84d6311e-a0a3-4fc6-a87e-e09857c765b3@oracle.com \
--to=dai.ngo@oracle.com \
--cc=chuck.lever@oracle.com \
--cc=jlayton@kernel.org \
--cc=linux-nfs@vger.kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox