[PATCH] xfs: shut down zoned file systems on writeback errors

All of lore.kernel.org
 help / color / mirror / Atom feed

* [PATCH] xfs: shut down zoned file systems on writeback errors
@ 2026-06-11  1:53 Yao Sang
  2026-06-11  2:13 ` Darrick J. Wong
  0 siblings, 1 reply; 4+ messages in thread
From: Yao Sang @ 2026-06-11  1:53 UTC (permalink / raw)
  To: linux-xfs; +Cc: cem, Yao Sang

Zoned writeback allocates space from an open zone and advances the
in-memory allocation state before submitting the bio.  The completion
path only records the written blocks and updates the mapping on success.
If the write fails, XFS cannot tell how far the device write pointer
advanced and cannot safely roll the open zone accounting back.

This was observed while investigating xfs/643 and xfs/646 on an external
ZNS realtime device. A writeback error after consuming space from an
open zone left later writers waiting for open-zone or GC progress that
could not happen. xfs/643 exposed this through the GC defragmentation
path, while xfs/646 exposed the same failure mode through the
truncate/EOF-zeroing space wait path.

There is no local recovery path in ioend completion that can restore a
consistent zoned allocation state after the device has rejected the
write. Treat writeback errors for zoned inodes as fatal and force a
file system shutdown from the ioend completion path. The existing
shutdown path wakes zoned allocation waiters and makes future space
waits return -EIO instead of leaving tasks stuck waiting for progress.

Signed-off-by: Yao Sang <sangyao@kylinos.cn>
---
Zoned writeback allocates space from an open zone before submitting the
bio.  If the device later rejects the write, XFS cannot reliably recover
the in-core open-zone allocation state from ioend completion, because it
cannot know whether or how far the device write pointer advanced.

The issue was investigated with xfs/643 and xfs/646 on an external ZNS
realtime device.  Both tests can expose the same failure mode once a
writeback error happens after consuming open-zone space:

- xfs/643 exposes it through the GC defragmentation path.
- xfs/646 exposes it through the truncate/EOF-zeroing space wait path.

Without forcing shutdown, later writers can wait for open-zone or GC
progress that will never arrive. Forcing shutdown wakes the existing
zoned allocation waiters and turns later waits into -EIO.

Tested with:
- xfstests: xfs/642, xfs/643, xfs/644,
  xfs/646, xfs/647 

 fs/xfs/xfs_aops.c | 10 ++++++++++
 1 file changed, 10 insertions(+)

diff --git a/fs/xfs/xfs_aops.c b/fs/xfs/xfs_aops.c
index 1a82cf625a08..4bcb47da5989 100644
--- a/fs/xfs/xfs_aops.c
+++ b/fs/xfs/xfs_aops.c
@@ -139,6 +139,16 @@ xfs_end_ioend_write(
 	 */
 	error = blk_status_to_errno(ioend->io_bio.bi_status);
 	if (unlikely(error)) {
+		/*
+		 * Zoned writes update the in-core open zone accounting before
+		 * I/O submission.  A failed write leaves that state inconsistent,
+		 * so shut down the filesystem instead of letting later writers
+		 * wait forever for open zone space to become available.
+		 */
+		if (is_zoned) {
+			xfs_force_shutdown(mp, SHUTDOWN_META_IO_ERROR);
+			goto done;
+		}
 		if (ioend->io_flags & IOMAP_IOEND_SHARED) {
 			ASSERT(!is_zoned);
 			xfs_reflink_cancel_cow_range(ip, offset, size, true);
-- 
2.25.1

^ permalink raw reply related	[flat|nested] 4+ messages in thread

* Re: [PATCH] xfs: shut down zoned file systems on writeback errors
  2026-06-11  1:53 [PATCH] xfs: shut down zoned file systems on writeback errors Yao Sang
@ 2026-06-11  2:13 ` Darrick J. Wong
  2026-06-11  5:52   ` Christoph Hellwig
  0 siblings, 1 reply; 4+ messages in thread
From: Darrick J. Wong @ 2026-06-11  2:13 UTC (permalink / raw)
  To: Yao Sang; +Cc: linux-xfs, cem, Christoph Hellwig

On Thu, Jun 11, 2026 at 09:53:05AM +0800, Yao Sang wrote:
> Zoned writeback allocates space from an open zone and advances the
> in-memory allocation state before submitting the bio.  The completion
> path only records the written blocks and updates the mapping on success.
> If the write fails, XFS cannot tell how far the device write pointer
> advanced and cannot safely roll the open zone accounting back.
> 
> This was observed while investigating xfs/643 and xfs/646 on an external
> ZNS realtime device. A writeback error after consuming space from an
> open zone left later writers waiting for open-zone or GC progress that
> could not happen. xfs/643 exposed this through the GC defragmentation
> path, while xfs/646 exposed the same failure mode through the
> truncate/EOF-zeroing space wait path.
> 
> There is no local recovery path in ioend completion that can restore a
> consistent zoned allocation state after the device has rejected the
> write. Treat writeback errors for zoned inodes as fatal and force a
> file system shutdown from the ioend completion path. The existing
> shutdown path wakes zoned allocation waiters and makes future space
> waits return -EIO instead of leaving tasks stuck waiting for progress.

File writeback errors taking down the entire filesystem?  That's pretty
drastic. :(

If writes to a zone fail, do subsequent writes to that zone also fail?

Is it possible either to requeue the failed writes to another zone?  Or
at least offline the zone and wake up the writers to convey the EIO?

(hch might have better ideas...)

--D

> Signed-off-by: Yao Sang <sangyao@kylinos.cn>
> ---
> Zoned writeback allocates space from an open zone before submitting the
> bio.  If the device later rejects the write, XFS cannot reliably recover
> the in-core open-zone allocation state from ioend completion, because it
> cannot know whether or how far the device write pointer advanced.
> 
> The issue was investigated with xfs/643 and xfs/646 on an external ZNS
> realtime device.  Both tests can expose the same failure mode once a
> writeback error happens after consuming open-zone space:
> 
> - xfs/643 exposes it through the GC defragmentation path.
> - xfs/646 exposes it through the truncate/EOF-zeroing space wait path.
> 
> Without forcing shutdown, later writers can wait for open-zone or GC
> progress that will never arrive. Forcing shutdown wakes the existing
> zoned allocation waiters and turns later waits into -EIO.
> 
> Tested with:
> - xfstests: xfs/642, xfs/643, xfs/644,
>   xfs/646, xfs/647 
> 
>  fs/xfs/xfs_aops.c | 10 ++++++++++
>  1 file changed, 10 insertions(+)
> 
> diff --git a/fs/xfs/xfs_aops.c b/fs/xfs/xfs_aops.c
> index 1a82cf625a08..4bcb47da5989 100644
> --- a/fs/xfs/xfs_aops.c
> +++ b/fs/xfs/xfs_aops.c
> @@ -139,6 +139,16 @@ xfs_end_ioend_write(
>  	 */
>  	error = blk_status_to_errno(ioend->io_bio.bi_status);
>  	if (unlikely(error)) {
> +		/*
> +		 * Zoned writes update the in-core open zone accounting before
> +		 * I/O submission.  A failed write leaves that state inconsistent,
> +		 * so shut down the filesystem instead of letting later writers
> +		 * wait forever for open zone space to become available.
> +		 */
> +		if (is_zoned) {
> +			xfs_force_shutdown(mp, SHUTDOWN_META_IO_ERROR);
> +			goto done;
> +		}
>  		if (ioend->io_flags & IOMAP_IOEND_SHARED) {
>  			ASSERT(!is_zoned);
>  			xfs_reflink_cancel_cow_range(ip, offset, size, true);
> -- 
> 2.25.1
> 
> 

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: [PATCH] xfs: shut down zoned file systems on writeback errors
  2026-06-11  2:13 ` Darrick J. Wong
@ 2026-06-11  5:52   ` Christoph Hellwig
  2026-06-11 10:37     ` Yao Sang
  0 siblings, 1 reply; 4+ messages in thread
From: Christoph Hellwig @ 2026-06-11  5:52 UTC (permalink / raw)
  To: Darrick J. Wong
  Cc: Yao Sang, linux-xfs, cem, Christoph Hellwig, Damien Le Moal

On Wed, Jun 10, 2026 at 07:13:03PM -0700, Darrick J. Wong wrote:
> > file system shutdown from the ioend completion path. The existing
> > shutdown path wakes zoned allocation waiters and makes future space
> > waits return -EIO instead of leaving tasks stuck waiting for progress.
> 
> File writeback errors taking down the entire filesystem?  That's pretty
> drastic. :(

Right now that is the only sane thing we can do, because..
(we should probably have a different shutdown code for it, including
similar checks in the GC code).

> If writes to a zone fail, do subsequent writes to that zone also fail?

Unless it is a transient retryable error which should not bubble up
to the file system: yes;

> 
> Is it possible either to requeue the failed writes to another zone?  Or
> at least offline the zone and wake up the writers to convey the EIO?

... what would be your model for errors be?  Right now the existing
devices we've deal with will not return errors until they are really
dead, which has been normal for devices for a while.  There can be
transient errors from the device or transport, but the drivers / block
layers are supposed to deal with this.

Yao: can you explain what errors your are seeing?  I.e. full nvme
dmesg output that tells us the status code?  Or are you just doing
error injection for testing.
Either way we should probably document our error handling model and
back it up with tests injecting errors and verifying we conform to
this model.

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: [PATCH] xfs: shut down zoned file systems on writeback errors
  2026-06-11  5:52   ` Christoph Hellwig
@ 2026-06-11 10:37     ` Yao Sang
  0 siblings, 0 replies; 4+ messages in thread
From: Yao Sang @ 2026-06-11 10:37 UTC (permalink / raw)
  To: hch; +Cc: cem, djwong, dlemoal, linux-xfs, sangyao

On Wed, Jun 10, 2026 at 10:52:34PM -0700, Christoph Hellwig wrote:
> Yao: can you explain what errors your are seeing?  I.e. full nvme
> dmesg output that tells us the status code?  Or are you just doing
> error injection for testing.
> Either way we should probably document our error handling model and
> back it up with tests injecting errors and verifying we conform to
> this model.

The original failure was not found by fault injection.

I first hit it while investigating xfs/643 and xfs/646 on an external
ZNS realtime device using an NVMe ZNS multipath namespace.  Tracing
showed regular REQ_OP_WRITE I/O being submitted to sequential zones
through the multipath head.

That turned out to be caused by the NVMe multipath head missing the
block layer per-zone state.  The head had zoned limits stacked from the
path, but the head disk had not gone through blk_revalidate_disk_zones(),
so bdev_zone_is_seq() could treat sequential zones as non-sequential.
The NVMe-side fix is here:

https://lore.kernel.org/all/20260610032846.54044-1-sangyao@kylinos.cn/

With that bug present, the lower NVMe device reported:

nvme0c0n1: I/O Cmd(0x7d) @ LBA 32768, 128 blocks, I/O Error (sct 0x1 / sc 0xb9) DNR

After fixing the NVMe multipath bug, I used EIO injection to reproduce
the same XFS-side failure mode in a controlled way.  So the injection was
used to verify the behavior after a non-retryable lower-layer error had
already reached XFS, not to discover the original problem.

For the XFS side, ioend completion does not seem to have enough
information to recover locally.  The in-core open-zone allocation state
has already been advanced before I/O submission, but after the device
rejects the write, XFS cannot know whether or how far the device write
pointer advanced.  Therefore this patch shuts down the filesystem so the
existing shutdown path wakes zoned allocation waiters and makes later
waits return -EIO.

I agree this should be treated as part of a broader zoned writeback error
handling model, not just as a local ioend fix.  I would like to help work
on that model, including the GC path, whether we need a more specific
shutdown reason, and fault-injection coverage for the lower-layer EIO
case.

Thanks,
Yao

^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2026-06-11 10:37 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-06-11  1:53 [PATCH] xfs: shut down zoned file systems on writeback errors Yao Sang
2026-06-11  2:13 ` Darrick J. Wong
2026-06-11  5:52   ` Christoph Hellwig
2026-06-11 10:37     ` Yao Sang

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.