All of lore.kernel.org
 help / color / mirror / Atom feed
From: cel@kernel.org
To: Neil Brown <neilb@suse.de>, Jeff Layton <jlayton@kernel.org>,
	Olga Kornievskaia <okorniev@redhat.com>,
	Dai Ngo <dai.ngo@oracle.com>, Tom Talpey <tom@talpey.com>
Cc: <linux-nfs@vger.kernel.org>, Chuck Lever <chuck.lever@oracle.com>,
	Olga Kornievskaia <kolga@netapp.com>
Subject: [RFC PATCH 8/9] NFS: Use NFSv4.2's OFFLOAD_STATUS operation
Date: Tue,  8 Oct 2024 09:47:27 -0400	[thread overview]
Message-ID: <20241008134719.116825-19-cel@kernel.org> (raw)
In-Reply-To: <20241008134719.116825-11-cel@kernel.org>

From: Chuck Lever <chuck.lever@oracle.com>

We've found that there are cases where a transport disconnection
results in the loss of callback RPCs. NFS servers typically do not
retransmit callback operations after a disconnect.

This can be a problem for the Linux NFS client's current
implementation of asynchronous COPY, which waits indefinitely for a
CB_OFFLOAD callback. If a transport disconnect occurs while an async
COPY is running, there's a good chance the client will never get the
completing CB_OFFLOAD.

Fix this by implementing the OFFLOAD_STATUS operation so that the
Linux NFS client can probe the NFS server if it doesn't see a
CB_OFFLOAD in a reasonable amount of time.

This patch implements a simplistic check. As future work, the client
might also be able to detect whether there is no forward progress on
the request asynchronous COPY operation, and CANCEL it.

Suggested-by: Olga Kornievskaia <kolga@netapp.com>
Link: https://bugzilla.kernel.org/show_bug.cgi?id=218735
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
---
 fs/nfs/nfs42proc.c | 56 ++++++++++++++++++++++++++++++++++++++++------
 1 file changed, 49 insertions(+), 7 deletions(-)

diff --git a/fs/nfs/nfs42proc.c b/fs/nfs/nfs42proc.c
index 175330843558..fc4f64750dc5 100644
--- a/fs/nfs/nfs42proc.c
+++ b/fs/nfs/nfs42proc.c
@@ -175,6 +175,11 @@ int nfs42_proc_deallocate(struct file *filep, loff_t offset, loff_t len)
 	return err;
 }
 
+/* Wait this long before checking progress on a COPY operation */
+enum {
+	NFS42_COPY_TIMEOUT	= 3 * HZ,
+};
+
 static int handle_async_copy(struct nfs42_copy_res *res,
 			     struct nfs_server *dst_server,
 			     struct nfs_server *src_server,
@@ -184,9 +189,10 @@ static int handle_async_copy(struct nfs42_copy_res *res,
 			     bool *restart)
 {
 	struct nfs4_copy_state *copy, *tmp_copy = NULL, *iter;
-	int status = NFS4_OK;
 	struct nfs_open_context *dst_ctx = nfs_file_open_context(dst);
 	struct nfs_open_context *src_ctx = nfs_file_open_context(src);
+	int status = NFS4_OK;
+	u64 copied;
 
 	copy = kzalloc(sizeof(struct nfs4_copy_state), GFP_KERNEL);
 	if (!copy)
@@ -224,7 +230,9 @@ static int handle_async_copy(struct nfs42_copy_res *res,
 		spin_unlock(&src_server->nfs_client->cl_lock);
 	}
 
-	status = wait_for_completion_interruptible(&copy->completion);
+wait:
+	status = wait_for_completion_interruptible_timeout(&copy->completion,
+							   NFS42_COPY_TIMEOUT);
 	spin_lock(&dst_server->nfs_client->cl_lock);
 	list_del_init(&copy->copies);
 	spin_unlock(&dst_server->nfs_client->cl_lock);
@@ -233,15 +241,21 @@ static int handle_async_copy(struct nfs42_copy_res *res,
 		list_del_init(&copy->src_copies);
 		spin_unlock(&src_server->nfs_client->cl_lock);
 	}
-	if (status == -ERESTARTSYS) {
-		goto out_cancel;
-	} else if (copy->flags || copy->error == NFS4ERR_PARTNER_NO_AUTH) {
-		status = -EAGAIN;
-		*restart = true;
+	switch (status) {
+	case 0:
+		goto timeout;
+	case -ERESTARTSYS:
 		goto out_cancel;
+	default:
+		if (copy->flags || copy->error == NFS4ERR_PARTNER_NO_AUTH) {
+			status = -EAGAIN;
+			*restart = true;
+			goto out_cancel;
+		}
 	}
 out:
 	res->write_res.count = copy->count;
+	/* Copy out the updated write verifier provided by CB_OFFLOAD. */
 	memcpy(&res->write_res.verifier, &copy->verf, sizeof(copy->verf));
 	status = -copy->error;
 
@@ -253,6 +267,34 @@ static int handle_async_copy(struct nfs42_copy_res *res,
 	if (!nfs42_files_from_same_server(src, dst))
 		nfs42_do_offload_cancel_async(src, src_stateid);
 	goto out_free;
+timeout:
+	status = nfs42_proc_offload_status(src, &copy->stateid, &copied);
+	switch (status) {
+	case 0:
+	case -EREMOTEIO:
+		/* The server recognized the copy stateid, so it hasn't
+		 * rebooted. Don't overwrite the verifier returned in the
+		 * COPY result. */
+		res->write_res.count = copied;
+		goto out_free;
+	case -EINPROGRESS:
+		goto wait;
+	case -EBADF:
+		/* Server did not recognize the copy stateid. It has
+		 * probably restarted and lost the plot. State recovery
+		 * might redrive the COPY from the beginning, in this
+		 * case? */
+		res->write_res.count = 0;
+		status = -EREMOTEIO;
+		break;
+	case -EOPNOTSUPP:
+		/* RFC 7862 REQUIREs server to support OFFLOAD_STATUS when
+		 * it has signed up for an async COPY, so server is not
+		 * spec-compliant. */
+		res->write_res.count = 0;
+		status = -EREMOTEIO;
+	}
+	goto out;
 }
 
 static int process_copy_commit(struct file *dst, loff_t pos_dst,
-- 
2.46.2


  parent reply	other threads:[~2024-10-08 13:47 UTC|newest]

Thread overview: 19+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2024-10-08 13:47 [RFC PATCH 0/9] async COPY fixes cel
2024-10-08 13:47 ` [RFC PATCH 1/9] NFS: CB_OFFLOAD can return NFS4ERR_DELAY cel
2024-10-08 13:47 ` [RFC PATCH 2/9] NFSD: Free async copy information in nfsd4_cb_offload_release() cel
2024-10-09 15:14   ` Chuck Lever
2024-10-08 13:47 ` [RFC PATCH 3/9] NFSD: Handle an NFS4ERR_DELAY response to CB_OFFLOAD cel
2024-10-08 21:54   ` NeilBrown
2024-10-09 14:10     ` Chuck Lever
2024-10-08 13:47 ` [RFC PATCH 4/9] NFS: Fix typo in OFFLOAD_CANCEL comment cel
2024-10-08 13:47 ` [RFC PATCH 5/9] NFS: Implement NFSv4.2's OFFLOAD_STATUS XDR cel
2024-10-08 22:00   ` NeilBrown
2024-10-08 13:47 ` [RFC PATCH 6/9] NFS: Rename struct nfs4_offloadcancel_data cel
2024-10-08 13:47 ` [RFC PATCH 7/9] NFS: Implement NFSv4.2's OFFLOAD_STATUS operation cel
2024-10-08 22:09   ` NeilBrown
2024-10-09 14:47     ` Chuck Lever
2024-10-08 13:47 ` cel [this message]
2024-10-08 22:10   ` [RFC PATCH 8/9] NFS: Use " NeilBrown
2024-10-09 15:45     ` Chuck Lever
2024-10-08 13:47 ` [RFC PATCH 9/9] NFS: Refactor trace_nfs4_offload_cancel cel
2024-10-08 13:51 ` [RFC PATCH 0/9] async COPY fixes Chuck Lever III

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20241008134719.116825-19-cel@kernel.org \
    --to=cel@kernel.org \
    --cc=chuck.lever@oracle.com \
    --cc=dai.ngo@oracle.com \
    --cc=jlayton@kernel.org \
    --cc=kolga@netapp.com \
    --cc=linux-nfs@vger.kernel.org \
    --cc=neilb@suse.de \
    --cc=okorniev@redhat.com \
    --cc=tom@talpey.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.