public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed
From: Robert Milkowski <rmilkowski@gmail.com>
To: Trond Myklebust <trondmy@kernel.org>
Cc: anna@kernel.org, linux-nfs@vger.kernel.org, linux-kernel@vger.kernel.org
Subject: Re: [PATCH] pnfs/filelayout: handle data server op_status errors
Date: Mon, 23 Mar 2026 13:03:59 +0000	[thread overview]
Message-ID: <48521545-709C-4ECB-A9C2-9C1DA75DA058@gmail.com> (raw)
In-Reply-To: <9799c4ac60c9732941fe4a67493a45eb9b2686ed.camel@kernel.org>

[-- Attachment #1: Type: text/plain, Size: 2809 bytes --]

Hi Trond,

Apologies for the very long delay in responding.

You're right that v1 overstated the problem. filelayout already gets
NFS-level errors in `task->tk_status` when there is no RPC transport
failure. I've respun this as a cleanup of the existing filelayout NFSv4.1
error handling rather than as a parallel op_status-only path.

The new version:
- when `tk_rpc_status == 0`, keeps the existing negative `task->tk_status`
  handling and only uses `op_status` once `task->tk_status` is non-negative;
- keeps transport errors on `task->tk_rpc_status`;
- keeps malformed-reply/decode failures on the existing DS-unavailable path;
- preserves the existing DS connection error behaviour (reset to MDS rather
  than retrying the same DS RPC);
- handles `NFS4ERR_NOMATCHING_LAYOUT` through the same invalidate/reset path.

For evidence, the original investigation had two pieces:
- the blocked task was in the filelayout read path
  (`pnfs_update_layout+0x5e2/0xf50`, `fl_pnfs_update_layout.constprop.0`,
  `filelayout_pg_init_read`), and the client was using
  `nfs_layout_nfsv41_files` (`lsmod | grep -i layout` showed
  `nfs_layout_nfsv41_files`);
- in the packet capture, there are 96 DS GETATTR replies of
  `getattr ERROR: unk 60` between 2025-10-27 11:33:18.333378 and
  2025-10-27 11:34:38.718255, from 21 different DS IPs. In
  `include/linux/nfs4.h`, `NFS4ERR_NOMATCHING_LAYOUT` is 10060, so those
  replies are consistent with the DS returning
  `NFS4ERR_NOMATCHING_LAYOUT`, i.e. that the current layout no longer
  matches.

For example, the capture includes:
- `2025-10-27 11:33:18.333378 ... NFS reply xid 1334105934 reply ok 148 getattr ERROR: unk 60`
- `2025-10-27 11:33:18.333392 ... NFS reply xid 2507704000 reply ok 148 getattr ERROR: unk 60`
- `2025-10-27 11:34:38.718255 ... NFS reply xid 860829781 reply ok 148 getattr ERROR: unk 60`

The original incident notes also recorded the same symptom on multiple jobs
across multiple clients in the same time window, so this did not look like
an isolated single-host event.

That is what led me to respin this along the same NFS-status vs
RPC-status split as 38074de35b01, while keeping filelayout's existing
behaviour for local/decode and DS connection failures.

Testing:
- applies cleanly on current upstream `master`
  (`0e4f8f1a3d081e834be5fd0a62bdb2554fadd307`);
- build-tested there with `x86_64_defconfig`,
  `CONFIG_NFS_FS=m`, `CONFIG_NFS_V4=m`, `CONFIG_PNFS_FILE_LAYOUT=m`,
  followed by `make -j4`;
- operationally, we've been running a local version with the same
  `tk_status`/`tk_rpc_status`/`op_status` split and
  `NFS4ERR_NOMATCHING_LAYOUT` handling on multiple servers for several
  months, and haven't seen the original hang recur.

See the attached respun v2 patch.


[-- Attachment #2: 0001-pnfs-filelayout-handle-data-server-op_status-errors.patch --]
[-- Type: application/octet-stream, Size: 7795 bytes --]

From 1f9849a8291cee861d659eb4ea1b109d696dd3ed Mon Sep 17 00:00:00 2001
From: Robert Milkowski <rmilkowski@gmail.com>
Date: Mon, 23 Mar 2026 11:26:17 +0000
Subject: [PATCH] pnfs/filelayout: handle data server op_status errors

Data servers can return per-operation NFS status in op_status even when
the RPC transport succeeds. filelayout_async_handle_error() currently
mixes NFS-level and transport-level handling through task->tk_status.

Model this on the NFS-status vs RPC-status split in 38074de35b01
("NFSv4/flexfiles: Fix handling of NFS level errors in I/O"), while
keeping filelayout's existing behaviour for local/decode and DS
connection failures. Keep the existing negative task->tk_status handling
for successful transports, use op_status only once task->tk_status is
non-negative, and use tk_rpc_status strictly for transport errors.

Wire read/write/commit done callbacks to pass op_status into
filelayout_async_handle_error() and handle NFS4ERR_NOMATCHING_LAYOUT via
the existing layout invalidate / retry-through-MDS path.

Signed-off-by: Robert Milkowski <rmilkowski@gmail.com>
---
 fs/nfs/filelayout/filelayout.c | 141 ++++++++++++++++++++++-----------
 1 file changed, 96 insertions(+), 45 deletions(-)

diff --git a/fs/nfs/filelayout/filelayout.c b/fs/nfs/filelayout/filelayout.c
index 90a11afa5d05..f941f6fa685e 100644
--- a/fs/nfs/filelayout/filelayout.c
+++ b/fs/nfs/filelayout/filelayout.c
@@ -121,6 +121,7 @@ static void filelayout_reset_read(struct nfs_pgio_header *hdr)
 }
 
 static int filelayout_async_handle_error(struct rpc_task *task,
+					 u32 op_status,
 					 struct nfs4_state *state,
 					 struct nfs_client *clp,
 					 struct pnfs_layout_segment *lseg)
@@ -129,50 +130,94 @@ static int filelayout_async_handle_error(struct rpc_task *task,
 	struct inode *inode = lo->plh_inode;
 	struct nfs4_deviceid_node *devid = FILELAYOUT_DEVID_NODE(lseg);
 	struct nfs4_slot_table *tbl = &clp->cl_session->fc_slot_table;
+	int nfs_err = 0;
+
+	/* Only trust NFS status when the RPC reply was received. */
+	if (task->tk_rpc_status == 0) {
+		if (task->tk_status < 0)
+			nfs_err = task->tk_status;
+		else if (op_status)
+			nfs_err = -(int)op_status;
+	}
 
-	if (task->tk_status >= 0)
-		return 0;
-
-	switch (task->tk_status) {
-	/* DS session errors */
-	case -NFS4ERR_BADSESSION:
-	case -NFS4ERR_BADSLOT:
-	case -NFS4ERR_BAD_HIGH_SLOT:
-	case -NFS4ERR_DEADSESSION:
-	case -NFS4ERR_CONN_NOT_BOUND_TO_SESSION:
-	case -NFS4ERR_SEQ_FALSE_RETRY:
-	case -NFS4ERR_SEQ_MISORDERED:
-		dprintk("%s ERROR %d, Reset session. Exchangeid "
-			"flags 0x%x\n", __func__, task->tk_status,
-			clp->cl_exchange_flags);
-		nfs4_schedule_session_recovery(clp->cl_session, task->tk_status);
+	switch (nfs_err) {
+	case -NFS4ERR_STALE:
+		nfs_err = -ESTALE;
 		break;
-	case -NFS4ERR_DELAY:
-	case -NFS4ERR_GRACE:
-		rpc_delay(task, FILELAYOUT_POLL_RETRY_MAX);
+	case -NFS4ERR_BADHANDLE:
+		nfs_err = -EBADHANDLE;
 		break;
-	case -NFS4ERR_RETRY_UNCACHED_REP:
+	case -NFS4ERR_ISDIR:
+		nfs_err = -EISDIR;
 		break;
-	/* Invalidate Layout errors */
-	case -NFS4ERR_ACCESS:
-	case -NFS4ERR_PNFS_NO_LAYOUT:
-	case -ESTALE:           /* mapped NFS4ERR_STALE */
-	case -EBADHANDLE:       /* mapped NFS4ERR_BADHANDLE */
-	case -EISDIR:           /* mapped NFS4ERR_ISDIR */
-	case -NFS4ERR_FHEXPIRED:
-	case -NFS4ERR_WRONG_TYPE:
-		dprintk("%s Invalid layout error %d\n", __func__,
-			task->tk_status);
-		/*
-		 * Destroy layout so new i/o will get a new layout.
-		 * Layout will not be destroyed until all current lseg
-		 * references are put. Mark layout as invalid to resend failed
-		 * i/o and all i/o waiting on the slot table to the MDS until
-		 * layout is destroyed and a new valid layout is obtained.
-		 */
-		pnfs_destroy_layout(NFS_I(inode));
-		rpc_wake_up(&tbl->slot_tbl_waitq);
-		goto reset;
+	}
+
+	/*
+	 * Preserve the old task->tk_status handling for non-NFS errors after a
+	 * successful transport. This avoids stale or partially decoded op_status
+	 * values overriding a newer local/decode failure.
+	 */
+	if (task->tk_rpc_status == 0 && task->tk_status < 0) {
+		switch (task->tk_status) {
+		case -ECONNREFUSED:
+		case -EHOSTDOWN:
+		case -EHOSTUNREACH:
+		case -ENETUNREACH:
+		case -EIO:
+		case -ETIMEDOUT:
+		case -EPIPE:
+		case -EPROTO:
+		case -ENODEV:
+			dprintk("%s DS reply processing error %d\n", __func__,
+				task->tk_status);
+			goto ds_unavailable;
+		}
+	}
+
+	if (nfs_err < 0) {
+		switch (nfs_err) {
+		/* DS session errors */
+		case -NFS4ERR_BADSESSION:
+		case -NFS4ERR_BADSLOT:
+		case -NFS4ERR_BAD_HIGH_SLOT:
+		case -NFS4ERR_DEADSESSION:
+		case -NFS4ERR_CONN_NOT_BOUND_TO_SESSION:
+		case -NFS4ERR_SEQ_FALSE_RETRY:
+		case -NFS4ERR_SEQ_MISORDERED:
+			dprintk("%s ERROR %d, Reset session. Exchangeid "
+				"flags 0x%x\n", __func__, nfs_err,
+				clp->cl_exchange_flags);
+			nfs4_schedule_session_recovery(clp->cl_session, nfs_err);
+			goto out_retry;
+		case -NFS4ERR_DELAY:
+		case -NFS4ERR_GRACE:
+			rpc_delay(task, FILELAYOUT_POLL_RETRY_MAX);
+			goto out_retry;
+		case -NFS4ERR_RETRY_UNCACHED_REP:
+			goto out_retry;
+		/* Invalidate Layout errors */
+		case -NFS4ERR_ACCESS:
+		case -NFS4ERR_PNFS_NO_LAYOUT:
+		case -ESTALE:           /* mapped NFS4ERR_STALE */
+		case -EBADHANDLE:       /* mapped NFS4ERR_BADHANDLE */
+		case -EISDIR:           /* mapped NFS4ERR_ISDIR */
+		case -NFS4ERR_FHEXPIRED:
+		case -NFS4ERR_WRONG_TYPE:
+		case -NFS4ERR_NOMATCHING_LAYOUT:
+			dprintk("%s Invalid layout error %d\n", __func__,
+				nfs_err);
+			pnfs_destroy_layout(NFS_I(inode));
+			rpc_wake_up(&tbl->slot_tbl_waitq);
+			goto reset;
+		default:
+			goto reset;
+		}
+	}
+
+	if (task->tk_rpc_status == 0)
+		return 0;
+
+	switch (task->tk_rpc_status) {
 	/* RPC connection errors */
 	case -ECONNREFUSED:
 	case -EHOSTDOWN:
@@ -184,7 +229,8 @@ static int filelayout_async_handle_error(struct rpc_task *task,
 	case -EPROTO:
 	case -ENODEV:
 		dprintk("%s DS connection error %d\n", __func__,
-			task->tk_status);
+			task->tk_rpc_status);
+ds_unavailable:
 		nfs4_mark_deviceid_unavailable(devid);
 		pnfs_error_mark_layout_for_return(inode, lseg);
 		pnfs_set_lo_fail(lseg);
@@ -193,9 +239,11 @@ static int filelayout_async_handle_error(struct rpc_task *task,
 	default:
 reset:
 		dprintk("%s Retry through MDS. Error %d\n", __func__,
-			task->tk_status);
+			task->tk_rpc_status ? task->tk_rpc_status : nfs_err);
 		return -NFS4ERR_RESET_TO_MDS;
 	}
+
+out_retry:
 	task->tk_status = 0;
 	return -EAGAIN;
 }
@@ -208,7 +256,8 @@ static int filelayout_read_done_cb(struct rpc_task *task,
 	int err;
 
 	trace_nfs4_pnfs_read(hdr, task->tk_status);
-	err = filelayout_async_handle_error(task, hdr->args.context->state,
+	err = filelayout_async_handle_error(task, hdr->res.op_status,
+					    hdr->args.context->state,
 					    hdr->ds_clp, hdr->lseg);
 
 	switch (err) {
@@ -318,7 +367,8 @@ static int filelayout_write_done_cb(struct rpc_task *task,
 	int err;
 
 	trace_nfs4_pnfs_write(hdr, task->tk_status);
-	err = filelayout_async_handle_error(task, hdr->args.context->state,
+	err = filelayout_async_handle_error(task, hdr->res.op_status,
+					    hdr->args.context->state,
 					    hdr->ds_clp, hdr->lseg);
 
 	switch (err) {
@@ -346,7 +396,8 @@ static int filelayout_commit_done_cb(struct rpc_task *task,
 	int err;
 
 	trace_nfs4_pnfs_commit_ds(data, task->tk_status);
-	err = filelayout_async_handle_error(task, NULL, data->ds_clp,
+	err = filelayout_async_handle_error(task, data->res.op_status,
+					    NULL, data->ds_clp,
 					    data->lseg);
 
 	switch (err) {
-- 
2.47.1


[-- Attachment #3: Type: text/plain, Size: 5300 bytes --]




Best regards,
 Robert Milkowski



> On 9 Dec 2025, at 15:33, Trond Myklebust <trondmy@kernel.org> wrote:
> 
> On Tue, 2025-12-09 at 13:56 +0000, Robert Milkowski wrote:
>> Data servers can return NFS-level status in op_status, but the file
>> layout
>> driver only looked at RPC transport errors. That means session
>> errors,
>> layout-invalidating statuses, and retry hints from the DS can be
>> ignored,
>> leading to missed session recovery, stale layouts, or failed retries.
> 
> The task->tk_status can carry NFS level status if there was no RPC
> error. That's why we distinguish between task->tk_status and task-
>> tk_rpc_status (the latter being guaranteed to only carry RPC errors).
> 
> IOW: is there any evidence of what you call out above?
> 
>> 
>> Pass the DS op_status into the async error handler and handle the
>> same set
>> of NFS status codes as flexfiles (see commit 38074de35b01,
>> "NFSv4/flexfiles: Fix handling of NFS level errors in I/O"). Wire the
>> read/write/commit callbacks to propagate op_status so the file layout
>> driver
>> can invalidate layouts, trigger session recovery, or retry as
>> appropriate.
>> 
>> Signed-off-by: Robert Milkowski <rmilkowski@gmail.com>
>> ---
>>  fs/nfs/filelayout/filelayout.c | 54
>> ++++++++++++++++++++++++++++++++--
>>  1 file changed, 51 insertions(+), 3 deletions(-)
>> 
>> diff --git a/fs/nfs/filelayout/filelayout.c
>> b/fs/nfs/filelayout/filelayout.c
>> index 5c4551117c58..2808baa19f83 100644
>> --- a/fs/nfs/filelayout/filelayout.c
>> +++ b/fs/nfs/filelayout/filelayout.c
>> @@ -121,6 +121,7 @@ static void filelayout_reset_read(struct
>> nfs_pgio_header *hdr)
>>  }
>>  
>>  static int filelayout_async_handle_error(struct rpc_task *task,
>> + u32 op_status,
>>   struct nfs4_state *state,
>>   struct nfs_client *clp,
>>   struct pnfs_layout_segment
>> *lseg)
>> @@ -130,6 +131,48 @@ static int filelayout_async_handle_error(struct
>> rpc_task *task,
>>   struct nfs4_deviceid_node *devid =
>> FILELAYOUT_DEVID_NODE(lseg);
>>   struct nfs4_slot_table *tbl = &clp->cl_session-
>>> fc_slot_table;
>>  
>> + if (op_status) {
>> + switch (op_status) {
>> + case NFS4_OK:
>> + case NFS4ERR_NXIO:
>> + break;
>> + case NFS4ERR_BADSESSION:
>> + case NFS4ERR_BADSLOT:
>> + case NFS4ERR_BAD_HIGH_SLOT:
>> + case NFS4ERR_DEADSESSION:
>> + case NFS4ERR_CONN_NOT_BOUND_TO_SESSION:
>> + case NFS4ERR_SEQ_FALSE_RETRY:
>> + case NFS4ERR_SEQ_MISORDERED:
>> + dprintk("%s op_status %u, Reset session.
>> Exchangeid "
>> + "flags 0x%x\n", __func__, op_status,
>> + clp->cl_exchange_flags);
>> + nfs4_schedule_session_recovery(clp-
>>> cl_session,
>> +       op_status);
>> + goto out_retry;
>> + case NFS4ERR_DELAY:
>> + case NFS4ERR_GRACE:
>> + case NFS4ERR_RETRY_UNCACHED_REP:
>> + rpc_delay(task, FILELAYOUT_POLL_RETRY_MAX);
>> + goto out_retry;
>> + case NFS4ERR_ACCESS:
>> + case NFS4ERR_PNFS_NO_LAYOUT:
>> + case NFS4ERR_STALE:
>> + case NFS4ERR_BADHANDLE:
>> + case NFS4ERR_ISDIR:
>> + case NFS4ERR_FHEXPIRED:
>> + case NFS4ERR_WRONG_TYPE:
>> + case NFS4ERR_NOMATCHING_LAYOUT:
>> + case NFSERR_PERM:
>> + dprintk("%s Invalid layout op_status %u\n",
>> __func__,
>> + op_status);
>> + pnfs_destroy_layout(NFS_I(inode));
>> + rpc_wake_up(&tbl->slot_tbl_waitq);
>> + goto reset;
>> + default:
>> + goto reset;
>> + }
>> + }
>> +
>>   if (task->tk_status >= 0)
>>   return 0;
>>  
>> @@ -196,6 +239,8 @@ static int filelayout_async_handle_error(struct
>> rpc_task *task,
>>   task->tk_status);
>>   return -NFS4ERR_RESET_TO_MDS;
>>   }
>> +
>> +out_retry:
>>   task->tk_status = 0;
>>   return -EAGAIN;
>>  }
>> @@ -208,7 +253,8 @@ static int filelayout_read_done_cb(struct
>> rpc_task *task,
>>   int err;
>>  
>>   trace_nfs4_pnfs_read(hdr, task->tk_status);
>> - err = filelayout_async_handle_error(task, hdr->args.context-
>>> state,
>> + err = filelayout_async_handle_error(task, hdr-
>>> res.op_status,
>> +     hdr->args.context-
>>> state,
>>       hdr->ds_clp, hdr->lseg);
>>  
>>   switch (err) {
>> @@ -318,7 +364,8 @@ static int filelayout_write_done_cb(struct
>> rpc_task *task,
>>   int err;
>>  
>>   trace_nfs4_pnfs_write(hdr, task->tk_status);
>> - err = filelayout_async_handle_error(task, hdr->args.context-
>>> state,
>> + err = filelayout_async_handle_error(task, hdr-
>>> res.op_status,
>> +     hdr->args.context-
>>> state,
>>       hdr->ds_clp, hdr->lseg);
>>  
>>   switch (err) {
>> @@ -346,7 +393,8 @@ static int filelayout_commit_done_cb(struct
>> rpc_task *task,
>>   int err;
>>  
>>   trace_nfs4_pnfs_commit_ds(data, task->tk_status);
>> - err = filelayout_async_handle_error(task, NULL, data-
>>> ds_clp,
>> + err = filelayout_async_handle_error(task, data-
>>> res.op_status,
>> +     NULL, data->ds_clp,
>>       data->lseg);
>>  
>>   switch (err) {
>> 
>> base-commit: cb015814f8b6eebcbb8e46e111d108892c5e6821
> 
> I'd be willing to take something like the above as a cleanup, assuming
> that it replaces the existing handling of NFSv4.1 errors using task-
>> tk_status instead of just adding new cases.
> 
> However I'd want to see evidence that the resulting patch has been
> thoroughly tested before submission, because I currently have no way to
> test it myself.
> 
> -- 
> Trond Myklebust
> Linux NFS client maintainer, Hammerspace
> trondmy@kernel.org, trond.myklebust@hammerspace.com


      reply	other threads:[~2026-03-23 13:04 UTC|newest]

Thread overview: 3+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-12-09 13:56 [PATCH] pnfs/filelayout: handle data server op_status errors Robert Milkowski
2025-12-09 15:33 ` Trond Myklebust
2026-03-23 13:03   ` Robert Milkowski [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=48521545-709C-4ECB-A9C2-9C1DA75DA058@gmail.com \
    --to=rmilkowski@gmail.com \
    --cc=anna@kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-nfs@vger.kernel.org \
    --cc=trondmy@kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox