NFSD callback operations block everything when clients are unresponsive

Linux NFS development
 help / color / mirror / Atom feed

* NFSD callback operations block everything when clients are unresponsive
@ 2024-09-13 20:05 Bugspray Bot
  2024-09-13 20:05 ` Bugspray Bot
                   ` (3 more replies)
  0 siblings, 4 replies; 5+ messages in thread
From: Bugspray Bot @ 2024-09-13 20:05 UTC (permalink / raw)
  To: trondmy, cel, jlayton, linux-nfs, anna

cel writes via Kernel.org Bugzilla:

Several reporters note that after commit c1ccfcf1a9bf ("NFSD: Reschedule CB operations when backchannel rpc_clnt is shut down"), NFSD's callback work queue is blocked when one of the clients is unresponsive.

We know that NFSD's callback_wq is single-threaded (ordered), and that there is only one WQ for all of the NFS server's clients.

What blocks callback operations is the retry loop in nfsd4_run_cb_work(). It was added to ensure that CB_OFFLOAD operations are delivered reliably, but it causes head-of-queue blocking when any NFS client becomes unresponsive when a callback operation is pending.

We've partially addressed this by giving each lease its own callback_wq.

However it's clear that retrying callback operations from within the callback WQ is going to be problematic to some extent. The solution is to hoist the responsibility for retrying higher up into the individual implementations of the callback operations (CB_RECALL, CB_NOTIFY_LOCK, CB_OFFLOAD, and so on), since each of these operations has their own needs in terms of recourse when a callback operation cannot be sent.

View: https://bugzilla.kernel.org/show_bug.cgi?id=218735#c0
You can reply to this message to join the discussion.
-- 
Deet-doot-dot, I am a bot.
Kernel.org Bugzilla (bugspray 0.1-dev)

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: NFSD callback operations block everything when clients are unresponsive
  2024-09-13 20:05 NFSD callback operations block everything when clients are unresponsive Bugspray Bot
@ 2024-09-13 20:05 ` Bugspray Bot
  2024-09-13 20:05 ` Bugspray Bot
                   ` (2 subsequent siblings)
  3 siblings, 0 replies; 5+ messages in thread
From: Bugspray Bot @ 2024-09-13 20:05 UTC (permalink / raw)
  To: trondmy, cel, jlayton, linux-nfs, anna

cel writes via Kernel.org Bugzilla:

A little code audit:

1 fs/nfsd/nfs4layouts.c  739 static const struct nfsd4_callback_ops nfsd4_cb_layout_ops = { 
2 fs/nfsd/nfs4proc.c     1622 static const struct nfsd4_callback_ops nfsd4_cb_offload_ops = { 
6 fs/nfsd/nfs4state.c    399 static const struct nfsd4_callback_ops nfsd4_cb_notify_lock_ops = { 
7 fs/nfsd/nfs4state.c    3079 static const struct nfsd4_callback_ops nfsd4_cb_recall_any_ops = { 
8 fs/nfsd/nfs4state.c    3084 static const struct nfsd4_callback_ops nfsd4_cb_getattr_ops = { 
9 fs/nfsd/nfs4state.c    5182 static const struct nfsd4_callback_ops nfsd4_cb_recall_ops = { 

We have these five callback operations to deal with.

I think the ->release nfsd4_callback_ops method might be used to schedule retry -- it's invoked by nfsd41_destroy_cb(), which should be able to tell whether a reply has been received.

Now I just need to figure out how to keep a record of needing to resend a callback.

View: https://bugzilla.kernel.org/show_bug.cgi?id=218735#c1
You can reply to this message to join the discussion.
-- 
Deet-doot-dot, I am a bot.
Kernel.org Bugzilla (bugspray 0.1-dev)


^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: NFSD callback operations block everything when clients are unresponsive
  2024-09-13 20:05 NFSD callback operations block everything when clients are unresponsive Bugspray Bot
  2024-09-13 20:05 ` Bugspray Bot
@ 2024-09-13 20:05 ` Bugspray Bot
  2024-09-13 20:05 ` Bugspray Bot
  2024-09-13 20:05 ` Bugspray Bot
  3 siblings, 0 replies; 5+ messages in thread
From: Bugspray Bot @ 2024-09-13 20:05 UTC (permalink / raw)
  To: trondmy, cel, jlayton, linux-nfs, anna

jlayton writes via Kernel.org Bugzilla:

The retry loop in this case is requeueing the work using queue_delayed_work. That shouldn't be blocking jobs with shorter delays that are sitting on the same queue. Are you certain that's the case? That sounds like a bug in the workqueue implementation if so.

View: https://bugzilla.kernel.org/show_bug.cgi?id=218735#c2
You can reply to this message to join the discussion.
-- 
Deet-doot-dot, I am a bot.
Kernel.org Bugzilla (bugspray 0.1-dev)


^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: NFSD callback operations block everything when clients are unresponsive
  2024-09-13 20:05 NFSD callback operations block everything when clients are unresponsive Bugspray Bot
  2024-09-13 20:05 ` Bugspray Bot
  2024-09-13 20:05 ` Bugspray Bot
@ 2024-09-13 20:05 ` Bugspray Bot
  2024-09-13 20:05 ` Bugspray Bot
  3 siblings, 0 replies; 5+ messages in thread
From: Bugspray Bot @ 2024-09-13 20:05 UTC (permalink / raw)
  To: trondmy, cel, jlayton, linux-nfs, anna

cel writes via Kernel.org Bugzilla:

Agreed, the queue_delayed_work() isn't working the way I expected, but it may behave differently with an ordered work queue than it does with a bog standard work queue instance.

In any event, some CB operations can be "fire and forget" while others will want some recourse on failure-to-send, and at least CB_OFFLOAD needs to be as reliable as we can make it. Thus having specific retry handlers for each CB operation seems like the best long-term approach.

View: https://bugzilla.kernel.org/show_bug.cgi?id=218735#c3
You can reply to this message to join the discussion.
-- 
Deet-doot-dot, I am a bot.
Kernel.org Bugzilla (bugspray 0.1-dev)

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: NFSD callback operations block everything when clients are unresponsive
  2024-09-13 20:05 NFSD callback operations block everything when clients are unresponsive Bugspray Bot
                   ` (2 preceding siblings ...)
  2024-09-13 20:05 ` Bugspray Bot
@ 2024-09-13 20:05 ` Bugspray Bot
  3 siblings, 0 replies; 5+ messages in thread
From: Bugspray Bot @ 2024-09-13 20:05 UTC (permalink / raw)
  To: trondmy, cel, jlayton, linux-nfs, anna

cel writes via Kernel.org Bugzilla:

Commit c1ccfcf1a9bf ("NFSD: Reschedule CB operations when backchannel rpc_clnt is shut down") was reverted from v6.9-rc to prevent an unresponsive client from backing up callbacks from all clients.

In addition, I'm planning to prototype an implementation of OFFLOAD_STATUS for the Linux NFS client so the COPY operations don't hang if the CB_OFFLOAD gets lost.

View: https://bugzilla.kernel.org/show_bug.cgi?id=218735#c4
You can reply to this message to join the discussion.
-- 
Deet-doot-dot, I am a bot.
Kernel.org Bugzilla (bugspray 0.1-dev)

^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2024-09-13 20:05 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-09-13 20:05 NFSD callback operations block everything when clients are unresponsive Bugspray Bot
2024-09-13 20:05 ` Bugspray Bot
2024-09-13 20:05 ` Bugspray Bot
2024-09-13 20:05 ` Bugspray Bot
2024-09-13 20:05 ` Bugspray Bot

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox