[QUESTION] xfs, iomap: Handle writeback errors to prevent silent data corruption

linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [QUESTION] xfs, iomap: Handle writeback errors to prevent silent data corruption
@ 2025-05-29  2:50 Yafang Shao
  2025-05-29  4:25 ` Darrick J. Wong
                   ` (2 more replies)
  0 siblings, 3 replies; 36+ messages in thread
From: Yafang Shao @ 2025-05-29  2:50 UTC (permalink / raw)
  To: Christian Brauner, djwong, cem; +Cc: linux-xfs, Linux-Fsdevel

Hello,

Recently, we encountered data loss when using XFS on an HDD with bad
blocks. After investigation, we determined that the issue was related
to writeback errors. The details are as follows:

1. Process-A writes data to a file using buffered I/O and completes
without errors.
2. However, during the writeback of the dirtied pagecache pages, an
I/O error occurs, causing the data to fail to reach the disk.
3. Later, the pagecache pages may be reclaimed due to memory pressure,
since they are already clean pages.
4. When Process-B reads the same file, it retrieves zeroed data from
the bad blocks, as the original data was never successfully written
(IOMAP_UNWRITTEN).

We reviewed the related discussion [0] and confirmed that this is a
known writeback error issue. While using fsync() after buffered
write() could mitigate the problem, this approach is impractical for
our services.

Instead, we propose introducing configurable options to notify users
of writeback errors immediately and prevent further operations on
affected files or disks. Possible solutions include:

- Option A: Immediately shut down the filesystem upon writeback errors.
- Option B: Mark the affected file as inaccessible if a writeback error occurs.

These options could be controlled via mount options or sysfs
configurations. Both solutions would be preferable to silently
returning corrupted data, as they ensure users are aware of disk
issues and can take corrective action.

Any suggestions ?

[0] https://lwn.net/Articles/724307/

-- 
Regards
Yafang

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [QUESTION] xfs, iomap: Handle writeback errors to prevent silent data corruption
  2025-05-29  2:50 [QUESTION] xfs, iomap: Handle writeback errors to prevent silent data corruption Yafang Shao
@ 2025-05-29  4:25 ` Darrick J. Wong
  2025-05-29  5:55   ` Yafang Shao
                     ` (2 more replies)
  2025-05-29  4:36 ` Dave Chinner
  2025-06-02  5:31 ` Christoph Hellwig
  2 siblings, 3 replies; 36+ messages in thread
From: Darrick J. Wong @ 2025-05-29  4:25 UTC (permalink / raw)
  To: Yafang Shao; +Cc: Christian Brauner, cem, linux-xfs, Linux-Fsdevel

On Thu, May 29, 2025 at 10:50:01AM +0800, Yafang Shao wrote:
> Hello,
> 
> Recently, we encountered data loss when using XFS on an HDD with bad
> blocks. After investigation, we determined that the issue was related
> to writeback errors. The details are as follows:
> 
> 1. Process-A writes data to a file using buffered I/O and completes
> without errors.
> 2. However, during the writeback of the dirtied pagecache pages, an
> I/O error occurs, causing the data to fail to reach the disk.
> 3. Later, the pagecache pages may be reclaimed due to memory pressure,
> since they are already clean pages.
> 4. When Process-B reads the same file, it retrieves zeroed data from
> the bad blocks, as the original data was never successfully written
> (IOMAP_UNWRITTEN).
> 
> We reviewed the related discussion [0] and confirmed that this is a
> known writeback error issue. While using fsync() after buffered
> write() could mitigate the problem, this approach is impractical for
> our services.
> 
> Instead, we propose introducing configurable options to notify users
> of writeback errors immediately and prevent further operations on
> affected files or disks. Possible solutions include:
> 
> - Option A: Immediately shut down the filesystem upon writeback errors.
> - Option B: Mark the affected file as inaccessible if a writeback error occurs.
> 
> These options could be controlled via mount options or sysfs
> configurations. Both solutions would be preferable to silently
> returning corrupted data, as they ensure users are aware of disk
> issues and can take corrective action.
> 
> Any suggestions ?

Option C: report all those write errors (direct and buffered) to a
daemon and let it figure out what it wants to do:

https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=health-monitoring_2025-05-21
https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfsprogs-dev.git/log/?h=health-monitoring-rust_2025-05-21

Yes this is a long term option since it involves adding upcalls from the
pagecache/vfs into the filesystem and out through even more XFS code,
which has to go through its usual rigorous reviews.

But if there's interest then I could move up the timeline on submitting
those since I wasn't going to do much with any of that until 2026.

--D

> [0] https://lwn.net/Articles/724307/
> 
> -- 
> Regards
> Yafang

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [QUESTION] xfs, iomap: Handle writeback errors to prevent silent data corruption
  2025-05-29  2:50 [QUESTION] xfs, iomap: Handle writeback errors to prevent silent data corruption Yafang Shao
  2025-05-29  4:25 ` Darrick J. Wong
@ 2025-05-29  4:36 ` Dave Chinner
  2025-05-29  6:04   ` Yafang Shao
  2025-06-02  5:38   ` Christoph Hellwig
  2025-06-02  5:31 ` Christoph Hellwig
  2 siblings, 2 replies; 36+ messages in thread
From: Dave Chinner @ 2025-05-29  4:36 UTC (permalink / raw)
  To: Yafang Shao; +Cc: Christian Brauner, djwong, cem, linux-xfs, Linux-Fsdevel

On Thu, May 29, 2025 at 10:50:01AM +0800, Yafang Shao wrote:
> Hello,
> 
> Recently, we encountered data loss when using XFS on an HDD with bad
> blocks. After investigation, we determined that the issue was related
> to writeback errors. The details are as follows:
> 
> 1. Process-A writes data to a file using buffered I/O and completes
> without errors.
> 2. However, during the writeback of the dirtied pagecache pages, an
> I/O error occurs, causing the data to fail to reach the disk.
> 3. Later, the pagecache pages may be reclaimed due to memory pressure,
> since they are already clean pages.
> 4. When Process-B reads the same file, it retrieves zeroed data from
> the bad blocks, as the original data was never successfully written
> (IOMAP_UNWRITTEN).
> 
> We reviewed the related discussion [0] and confirmed that this is a
> known writeback error issue. While using fsync() after buffered
> write() could mitigate the problem, this approach is impractical for
> our services.

Really, that's terrible application design.  If you aren't checking
that data has been written successfully, then you get to keep all
the broken and/or missing data bits to yourself.

However, with that said, some history.

XFS used to keep pages that had IO errors on writeback dirty so they
would be retried at a later time and couldn't be reclaimed from
memory until they were written. This was historical behaviour from
Irix and designed to handle SAN environments where multipath
fail-over could take several minutes.

In these situations writeback could fail for several attempts before
the storage timed out and came back online. Then the next write
retry would succeed, and everything would be good. Linux never gave
us a specific IO error for this case, so we just had to retry on EIO
and hope that the storage came back eventually.

This is different to traditional Linux writeback behaviour, which is
what is implemented now via iomap. There are good reasons for this
model:

- a filesystem with a dirty page that can't be written and cleaned
  cannot be unmounted.

- having large chunks of memory that cannot be cleaned and
  reclaimed has adverse impact on system performance

- the system can potentially hang if the page cache is dirtied
  beyond write throttling thresholds and then the device is yanked.
  Now none of the dirty memory can be cleaned, and all new writes
  are throttled....

> Instead, we propose introducing configurable options to notify users
> of writeback errors immediately and prevent further operations on
> affected files or disks. Possible solutions include:
> 
> - Option A: Immediately shut down the filesystem upon writeback errors.
> - Option B: Mark the affected file as inaccessible if a writeback error occurs.

Go look at /sys/fs/xfs/<dev>/error/metadata/... and configurable
error handling behaviour implemented through this interface.

Essential, XFS metadata behaves as "retry writes forever and hang on
unmount until write succeeds" by default. i.e. similar to the old
data IO error behaviour. The "hang on unmount" behaviour can be
turned off by /sys/fs/xfs/<dev>/error/fail_at_unmount, and we can
configured different failure handling policies for different types
of IO error. e.g. fail-fast on -ENODEV (e.g. device was unplugged
and is never coming back so shut the filesystem down),
retry-for-while on -ENOSPC (e.g. dm-thinp pool has run out of space,
so give some time for the pool to be expanded before shutting down)
and retry-once on -EIO (to avoid random spurious hardware failures
from shutting down the fs) and everything else uses the configured
default behaviour....

There's also good reason the sysfs error heirarchy is structured the
way it is - it leaves open the option for expanding the error
handling policies to different IO types (i.e. data and metadata). It
even allows different policies for different types of data devices
(e.g. RT vs data device policies).

So, got look at how the error configuration code in XFS is handled,
consider extending that to /sys/fs/xfs/<dev>/error/data/.... to
allow different error handling policies for different types of
data writeback IO errors.

Then you'll need to implement those policies through the XFS and
iomap IO paths...

-Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [QUESTION] xfs, iomap: Handle writeback errors to prevent silent data corruption
  2025-05-29  4:25 ` Darrick J. Wong
@ 2025-05-29  5:55   ` Yafang Shao
  2025-05-30  5:17   ` Christian Brauner
  2025-06-02  5:32   ` Christoph Hellwig
  2 siblings, 0 replies; 36+ messages in thread
From: Yafang Shao @ 2025-05-29  5:55 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: Christian Brauner, cem, linux-xfs, Linux-Fsdevel

On Thu, May 29, 2025 at 12:25 PM Darrick J. Wong <djwong@kernel.org> wrote:
>
> On Thu, May 29, 2025 at 10:50:01AM +0800, Yafang Shao wrote:
> > Hello,
> >
> > Recently, we encountered data loss when using XFS on an HDD with bad
> > blocks. After investigation, we determined that the issue was related
> > to writeback errors. The details are as follows:
> >
> > 1. Process-A writes data to a file using buffered I/O and completes
> > without errors.
> > 2. However, during the writeback of the dirtied pagecache pages, an
> > I/O error occurs, causing the data to fail to reach the disk.
> > 3. Later, the pagecache pages may be reclaimed due to memory pressure,
> > since they are already clean pages.
> > 4. When Process-B reads the same file, it retrieves zeroed data from
> > the bad blocks, as the original data was never successfully written
> > (IOMAP_UNWRITTEN).
> >
> > We reviewed the related discussion [0] and confirmed that this is a
> > known writeback error issue. While using fsync() after buffered
> > write() could mitigate the problem, this approach is impractical for
> > our services.
> >
> > Instead, we propose introducing configurable options to notify users
> > of writeback errors immediately and prevent further operations on
> > affected files or disks. Possible solutions include:
> >
> > - Option A: Immediately shut down the filesystem upon writeback errors.
> > - Option B: Mark the affected file as inaccessible if a writeback error occurs.
> >
> > These options could be controlled via mount options or sysfs
> > configurations. Both solutions would be preferable to silently
> > returning corrupted data, as they ensure users are aware of disk
> > issues and can take corrective action.
> >
> > Any suggestions ?
>
> Option C: report all those write errors (direct and buffered) to a
> daemon and let it figure out what it wants to do:
>
> https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=health-monitoring_2025-05-21
> https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfsprogs-dev.git/log/?h=health-monitoring-rust_2025-05-21
>
> Yes this is a long term option since it involves adding upcalls from the
> pagecache/vfs into the filesystem and out through even more XFS code,
> which has to go through its usual rigorous reviews.
>
> But if there's interest then I could move up the timeline on submitting
> those since I wasn't going to do much with any of that until 2026.

This would be very helpful. While it might take some time, it's better
to address it now than never. Please proceed with this when you have
availability.

-- 
Regards
Yafang

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [QUESTION] xfs, iomap: Handle writeback errors to prevent silent data corruption
  2025-05-29  4:36 ` Dave Chinner
@ 2025-05-29  6:04   ` Yafang Shao
  2025-06-02  5:38   ` Christoph Hellwig
  1 sibling, 0 replies; 36+ messages in thread
From: Yafang Shao @ 2025-05-29  6:04 UTC (permalink / raw)
  To: Dave Chinner; +Cc: Christian Brauner, djwong, cem, linux-xfs, Linux-Fsdevel

On Thu, May 29, 2025 at 12:36 PM Dave Chinner <david@fromorbit.com> wrote:
>
> On Thu, May 29, 2025 at 10:50:01AM +0800, Yafang Shao wrote:
> > Hello,
> >
> > Recently, we encountered data loss when using XFS on an HDD with bad
> > blocks. After investigation, we determined that the issue was related
> > to writeback errors. The details are as follows:
> >
> > 1. Process-A writes data to a file using buffered I/O and completes
> > without errors.
> > 2. However, during the writeback of the dirtied pagecache pages, an
> > I/O error occurs, causing the data to fail to reach the disk.
> > 3. Later, the pagecache pages may be reclaimed due to memory pressure,
> > since they are already clean pages.
> > 4. When Process-B reads the same file, it retrieves zeroed data from
> > the bad blocks, as the original data was never successfully written
> > (IOMAP_UNWRITTEN).
> >
> > We reviewed the related discussion [0] and confirmed that this is a
> > known writeback error issue. While using fsync() after buffered
> > write() could mitigate the problem, this approach is impractical for
> > our services.
>
> Really, that's terrible application design.  If you aren't checking
> that data has been written successfully, then you get to keep all
> the broken and/or missing data bits to yourself.

It’s difficult to justify this.

>
> However, with that said, some history.
>
> XFS used to keep pages that had IO errors on writeback dirty so they
> would be retried at a later time and couldn't be reclaimed from
> memory until they were written. This was historical behaviour from
> Irix and designed to handle SAN environments where multipath
> fail-over could take several minutes.
>
> In these situations writeback could fail for several attempts before
> the storage timed out and came back online. Then the next write
> retry would succeed, and everything would be good. Linux never gave
> us a specific IO error for this case, so we just had to retry on EIO
> and hope that the storage came back eventually.
>
> This is different to traditional Linux writeback behaviour, which is
> what is implemented now via iomap. There are good reasons for this
> model:
>
> - a filesystem with a dirty page that can't be written and cleaned
>   cannot be unmounted.
>
> - having large chunks of memory that cannot be cleaned and
>   reclaimed has adverse impact on system performance
>
> - the system can potentially hang if the page cache is dirtied
>   beyond write throttling thresholds and then the device is yanked.
>   Now none of the dirty memory can be cleaned, and all new writes
>   are throttled....

I previously considered whether we could avoid clearing PG_writeback
for these pages. To handle unwritten pagecache pages more safely, we
could maintain their PG_writeback status and introduce a new
PG_write_error flag. This would explicitly mark pages that failed disk
writes, allowing the reclaim mechanism to skip them and avoid
potential deadlocks.

>
> > Instead, we propose introducing configurable options to notify users
> > of writeback errors immediately and prevent further operations on
> > affected files or disks. Possible solutions include:
> >
> > - Option A: Immediately shut down the filesystem upon writeback errors.
> > - Option B: Mark the affected file as inaccessible if a writeback error occurs.
>
> Go look at /sys/fs/xfs/<dev>/error/metadata/... and configurable
> error handling behaviour implemented through this interface.
>
> Essential, XFS metadata behaves as "retry writes forever and hang on
> unmount until write succeeds" by default. i.e. similar to the old
> data IO error behaviour. The "hang on unmount" behaviour can be
> turned off by /sys/fs/xfs/<dev>/error/fail_at_unmount, and we can
> configured different failure handling policies for different types
> of IO error. e.g. fail-fast on -ENODEV (e.g. device was unplugged
> and is never coming back so shut the filesystem down),
> retry-for-while on -ENOSPC (e.g. dm-thinp pool has run out of space,
> so give some time for the pool to be expanded before shutting down)
> and retry-once on -EIO (to avoid random spurious hardware failures
> from shutting down the fs) and everything else uses the configured
> default behaviour....

Thank you for your clear guidance and detailed explanation.

>
> There's also good reason the sysfs error heirarchy is structured the
> way it is - it leaves open the option for expanding the error
> handling policies to different IO types (i.e. data and metadata). It
> even allows different policies for different types of data devices
> (e.g. RT vs data device policies).
>
> So, got look at how the error configuration code in XFS is handled,
> consider extending that to /sys/fs/xfs/<dev>/error/data/.... to
> allow different error handling policies for different types of
> data writeback IO errors.

That aligns perfectly with our expectations.

>
> Then you'll need to implement those policies through the XFS and
> iomap IO paths...

I will analyze how to implement this effectively.

-- 
Regards
Yafang

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [QUESTION] xfs, iomap: Handle writeback errors to prevent silent data corruption
  2025-05-29  4:25 ` Darrick J. Wong
  2025-05-29  5:55   ` Yafang Shao
@ 2025-05-30  5:17   ` Christian Brauner
  2025-05-30 15:38     ` Darrick J. Wong
  2025-06-02  5:32   ` Christoph Hellwig
  2 siblings, 1 reply; 36+ messages in thread
From: Christian Brauner @ 2025-05-30  5:17 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: Yafang Shao, cem, linux-xfs, Linux-Fsdevel

On Wed, May 28, 2025 at 09:25:50PM -0700, Darrick J. Wong wrote:
> On Thu, May 29, 2025 at 10:50:01AM +0800, Yafang Shao wrote:
> > Hello,
> > 
> > Recently, we encountered data loss when using XFS on an HDD with bad
> > blocks. After investigation, we determined that the issue was related
> > to writeback errors. The details are as follows:
> > 
> > 1. Process-A writes data to a file using buffered I/O and completes
> > without errors.
> > 2. However, during the writeback of the dirtied pagecache pages, an
> > I/O error occurs, causing the data to fail to reach the disk.
> > 3. Later, the pagecache pages may be reclaimed due to memory pressure,
> > since they are already clean pages.
> > 4. When Process-B reads the same file, it retrieves zeroed data from
> > the bad blocks, as the original data was never successfully written
> > (IOMAP_UNWRITTEN).
> > 
> > We reviewed the related discussion [0] and confirmed that this is a
> > known writeback error issue. While using fsync() after buffered
> > write() could mitigate the problem, this approach is impractical for
> > our services.
> > 
> > Instead, we propose introducing configurable options to notify users
> > of writeback errors immediately and prevent further operations on
> > affected files or disks. Possible solutions include:
> > 
> > - Option A: Immediately shut down the filesystem upon writeback errors.
> > - Option B: Mark the affected file as inaccessible if a writeback error occurs.
> > 
> > These options could be controlled via mount options or sysfs
> > configurations. Both solutions would be preferable to silently
> > returning corrupted data, as they ensure users are aware of disk
> > issues and can take corrective action.
> > 
> > Any suggestions ?
> 
> Option C: report all those write errors (direct and buffered) to a
> daemon and let it figure out what it wants to do:
> 
> https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=health-monitoring_2025-05-21
> https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfsprogs-dev.git/log/?h=health-monitoring-rust_2025-05-21
> 
> Yes this is a long term option since it involves adding upcalls from the

I hope you don't mean actual usermodehelper upcalls here because we
should not add any new ones. If you just mean a way to call up from a
lower layer than that's obviously fine.

Fwiw, have you considered building this on top of a fanotify extension
instead of inventing your own mechanism for this?

> pagecache/vfs into the filesystem and out through even more XFS code,
> which has to go through its usual rigorous reviews.
> 
> But if there's interest then I could move up the timeline on submitting
> those since I wasn't going to do much with any of that until 2026.
> 
> --D
> 
> > [0] https://lwn.net/Articles/724307/
> > 
> > -- 
> > Regards
> > Yafang

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [QUESTION] xfs, iomap: Handle writeback errors to prevent silent data corruption
  2025-05-30  5:17   ` Christian Brauner
@ 2025-05-30 15:38     ` Darrick J. Wong
  2025-05-31 23:02       ` Dave Chinner
  0 siblings, 1 reply; 36+ messages in thread
From: Darrick J. Wong @ 2025-05-30 15:38 UTC (permalink / raw)
  To: Christian Brauner; +Cc: Yafang Shao, cem, linux-xfs, Linux-Fsdevel

On Fri, May 30, 2025 at 07:17:00AM +0200, Christian Brauner wrote:
> On Wed, May 28, 2025 at 09:25:50PM -0700, Darrick J. Wong wrote:
> > On Thu, May 29, 2025 at 10:50:01AM +0800, Yafang Shao wrote:
> > > Hello,
> > > 
> > > Recently, we encountered data loss when using XFS on an HDD with bad
> > > blocks. After investigation, we determined that the issue was related
> > > to writeback errors. The details are as follows:
> > > 
> > > 1. Process-A writes data to a file using buffered I/O and completes
> > > without errors.
> > > 2. However, during the writeback of the dirtied pagecache pages, an
> > > I/O error occurs, causing the data to fail to reach the disk.
> > > 3. Later, the pagecache pages may be reclaimed due to memory pressure,
> > > since they are already clean pages.
> > > 4. When Process-B reads the same file, it retrieves zeroed data from
> > > the bad blocks, as the original data was never successfully written
> > > (IOMAP_UNWRITTEN).
> > > 
> > > We reviewed the related discussion [0] and confirmed that this is a
> > > known writeback error issue. While using fsync() after buffered
> > > write() could mitigate the problem, this approach is impractical for
> > > our services.
> > > 
> > > Instead, we propose introducing configurable options to notify users
> > > of writeback errors immediately and prevent further operations on
> > > affected files or disks. Possible solutions include:
> > > 
> > > - Option A: Immediately shut down the filesystem upon writeback errors.
> > > - Option B: Mark the affected file as inaccessible if a writeback error occurs.
> > > 
> > > These options could be controlled via mount options or sysfs
> > > configurations. Both solutions would be preferable to silently
> > > returning corrupted data, as they ensure users are aware of disk
> > > issues and can take corrective action.
> > > 
> > > Any suggestions ?
> > 
> > Option C: report all those write errors (direct and buffered) to a
> > daemon and let it figure out what it wants to do:
> > 
> > https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=health-monitoring_2025-05-21
> > https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfsprogs-dev.git/log/?h=health-monitoring-rust_2025-05-21
> > 
> > Yes this is a long term option since it involves adding upcalls from the
> 
> I hope you don't mean actual usermodehelper upcalls here because we
> should not add any new ones. If you just mean a way to call up from a
> lower layer than that's obviously fine.

Correct.  The VFS upcalls to XFS on some event, then XFS queues the
event data (or drops it) and waits for userspace to read the queued
events.  We're not directly invoking a helper program from deep in the
guts, that's too wild even for me. ;)

> Fwiw, have you considered building this on top of a fanotify extension
> instead of inventing your own mechanism for this?

I have, at various stages of this experiment.

Originally, I was only going to export xfs-specific metadata events
(e.g. this AG's inode btree index is bad) so that the userspace program
(xfs_healer) could initiate a repair against the broken pieces.

At the time I thought it would be fun to experiment with an anonfd file
that emitted jsonp objects so that I could avoid the usual C struct ABI
mess because json is easily parsed into key-value mapping objects in a
lot of languages (that aren't C).  It later turned out that formatting
the json is rather more costly than I thought even with seq_bufs, so I
added an alternate format that emits boring C structures.

Having gone back to C structs, it would be possibly (and possibly quite
nice) to migrate to fanotify so that I don't have to maintain a bunch of
queuing code.  But that can have its own drawbacks, as Ted and I
discovered when we discussed his patches that pushed ext4 error events
through fanotify:

For filesystem metadata events, the fine details of representing that
metadata in a generic interface gets really messy because each
filesystem has a different design.  To initiate a repair you need to
know a lot of specifics: which AG has a bad structure, and what
structure within that AG; or which file and what structure under that
file, etc.  Ted and Jan Kara and I tried to come up with a reasonably
generic format for all that and didn't succeed; the best I could think
of is:

struct fanotify_event_info_fsmeta_error {
	struct fanotify_event_info_header hdr;

	__u32 mask;	/* bitmask of objects */
	__u32 what;	/* union decoder */
	union {
		struct {
			__u32 gno;	/* shard number if applicable */
			__u32 pad0[5];
		};
		struct {
			__u64 ino;	/* affected file */
			__u32 gen;
			__u32 pad1[3];
		};
		struct {
			__u64 diskaddr;	/* device media error */
			__u64 length;
			__u32 device;
			__u32 pad2;
		};
	};

	__u64 pad[2];	/* future expansion */
};

But now we have this gross struct with a union in the ABI, and what
happens when someone wants to add support for a filesystem with even
stranger stuff e.g. btrfs/bcachefs?  We could punt in the generic header
and do this instead:

struct fanotify_event_info_fsmeta_error {
	struct fanotify_event_info_header hdr;

	__u32 fstype;		/* same as statfs::f_type */
	unsigned char data[];	/* good luck with this */
};

and now you just open-cast a pointer to the char array to whatever
fs-specific format you want, but eeeuugh.

The other reason for sticking with an anonfd (so far) is that the kernel
side of xfs_healer is designed to maintain a soft reference to the
xfs_mount object so that the userspace program need not maintain an open
fd on the filesystem, because that prevents unmount.  I aim to find a
means for the magic healer fd to gain the ability to reopen the root
directory of the filesystem so that the sysadmin running mount --move
doesn't break the healer.

I think fanotify fixes the "healer pins the mount" problems but I don't
think there's a way to do the reopening thing.

Getting back to the question that opened this thread -- I think regular
file IO errors can be represented with a sequence of
fanotify_event_metadata -> fanotify_event_info_fid ->
fanotify_event_info_range -> fanotify_event_info_error objects in the
fanotify stream.  This format I think is easily standardized across
filesystems and can be wired up from iomap without a lot of fuss.  But I
don't know how fsnotify event blob chaining works well enough to say for
sure. :/

--D

> > pagecache/vfs into the filesystem and out through even more XFS code,
> > which has to go through its usual rigorous reviews.
> > 
> > But if there's interest then I could move up the timeline on submitting
> > those since I wasn't going to do much with any of that until 2026.
> > 
> > --D
> > 
> > > [0] https://lwn.net/Articles/724307/
> > > 
> > > -- 
> > > Regards
> > > Yafang

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [QUESTION] xfs, iomap: Handle writeback errors to prevent silent data corruption
  2025-05-30 15:38     ` Darrick J. Wong
@ 2025-05-31 23:02       ` Dave Chinner
  2025-06-03  0:03         ` Darrick J. Wong
  0 siblings, 1 reply; 36+ messages in thread
From: Dave Chinner @ 2025-05-31 23:02 UTC (permalink / raw)
  To: Darrick J. Wong
  Cc: Christian Brauner, Yafang Shao, cem, linux-xfs, Linux-Fsdevel

On Fri, May 30, 2025 at 08:38:47AM -0700, Darrick J. Wong wrote:
> On Fri, May 30, 2025 at 07:17:00AM +0200, Christian Brauner wrote:
> > On Wed, May 28, 2025 at 09:25:50PM -0700, Darrick J. Wong wrote:
> > > On Thu, May 29, 2025 at 10:50:01AM +0800, Yafang Shao wrote:
> > > > Hello,
> > > > 
> > > > Recently, we encountered data loss when using XFS on an HDD with bad
> > > > blocks. After investigation, we determined that the issue was related
> > > > to writeback errors. The details are as follows:
> > > > 
> > > > 1. Process-A writes data to a file using buffered I/O and completes
> > > > without errors.
> > > > 2. However, during the writeback of the dirtied pagecache pages, an
> > > > I/O error occurs, causing the data to fail to reach the disk.
> > > > 3. Later, the pagecache pages may be reclaimed due to memory pressure,
> > > > since they are already clean pages.
> > > > 4. When Process-B reads the same file, it retrieves zeroed data from
> > > > the bad blocks, as the original data was never successfully written
> > > > (IOMAP_UNWRITTEN).
> > > > 
> > > > We reviewed the related discussion [0] and confirmed that this is a
> > > > known writeback error issue. While using fsync() after buffered
> > > > write() could mitigate the problem, this approach is impractical for
> > > > our services.
> > > > 
> > > > Instead, we propose introducing configurable options to notify users
> > > > of writeback errors immediately and prevent further operations on
> > > > affected files or disks. Possible solutions include:
> > > > 
> > > > - Option A: Immediately shut down the filesystem upon writeback errors.
> > > > - Option B: Mark the affected file as inaccessible if a writeback error occurs.
> > > > 
> > > > These options could be controlled via mount options or sysfs
> > > > configurations. Both solutions would be preferable to silently
> > > > returning corrupted data, as they ensure users are aware of disk
> > > > issues and can take corrective action.
> > > > 
> > > > Any suggestions ?
> > > 
> > > Option C: report all those write errors (direct and buffered) to a
> > > daemon and let it figure out what it wants to do:
> > > 
> > > https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=health-monitoring_2025-05-21
> > > https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfsprogs-dev.git/log/?h=health-monitoring-rust_2025-05-21
> > > 
> > > Yes this is a long term option since it involves adding upcalls from the
> > 
> > I hope you don't mean actual usermodehelper upcalls here because we
> > should not add any new ones. If you just mean a way to call up from a
> > lower layer than that's obviously fine.
> 
> Correct.  The VFS upcalls to XFS on some event, then XFS queues the
> event data (or drops it) and waits for userspace to read the queued
> events.  We're not directly invoking a helper program from deep in the
> guts, that's too wild even for me. ;)
> 
> > Fwiw, have you considered building this on top of a fanotify extension
> > instead of inventing your own mechanism for this?
> 
> I have, at various stages of this experiment.
> 
> Originally, I was only going to export xfs-specific metadata events
> (e.g. this AG's inode btree index is bad) so that the userspace program
> (xfs_healer) could initiate a repair against the broken pieces.
> 
> At the time I thought it would be fun to experiment with an anonfd file
> that emitted jsonp objects so that I could avoid the usual C struct ABI
> mess because json is easily parsed into key-value mapping objects in a
> lot of languages (that aren't C).  It later turned out that formatting
> the json is rather more costly than I thought even with seq_bufs, so I
> added an alternate format that emits boring C structures.
> 
> Having gone back to C structs, it would be possibly (and possibly quite
> nice) to migrate to fanotify so that I don't have to maintain a bunch of
> queuing code.  But that can have its own drawbacks, as Ted and I
> discovered when we discussed his patches that pushed ext4 error events
> through fanotify:
> 
> For filesystem metadata events, the fine details of representing that
> metadata in a generic interface gets really messy because each
> filesystem has a different design.

Perhaps that is the wrong approach. The event just needs to tell
userspace that there is a metadata error, and the fs specific agent
that receives the event can then pull the failure information from
the filesystem through a fs specific ioctl interface.

i.e. the fanotify event could simply be a unique error, and that
gets passed back into the ioctl to retreive the fs specific details
of the failure. We might not even need fanotify for this - I suspect
that we could use udev events to punch error ID notifications out to
userspace to trigger a fs specific helper to go find out what went
wrong.

Keeping unprocessed failures in an internal fs queue isn't a big
deal; it's not a lot of memory, and it can be discarded on unmount.
At that point we know that userspace did not care about the
failure and is not going to be able to query about the failure in
future, so we can just throw it away.

This also allows filesystems to develop such functionality in
parallel, allowing us to find commonality and potential areas for
abstraction as the functionality is developed, rahter than trying to
come up with some generic interface that needs to support all
possible things we can think of right now....

-Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [QUESTION] xfs, iomap: Handle writeback errors to prevent silent data corruption
  2025-05-29  2:50 [QUESTION] xfs, iomap: Handle writeback errors to prevent silent data corruption Yafang Shao
  2025-05-29  4:25 ` Darrick J. Wong
  2025-05-29  4:36 ` Dave Chinner
@ 2025-06-02  5:31 ` Christoph Hellwig
  2025-06-03  3:03   ` Yafang Shao
  2 siblings, 1 reply; 36+ messages in thread
From: Christoph Hellwig @ 2025-06-02  5:31 UTC (permalink / raw)
  To: Yafang Shao; +Cc: Christian Brauner, djwong, cem, linux-xfs, Linux-Fsdevel

On Thu, May 29, 2025 at 10:50:01AM +0800, Yafang Shao wrote:
> Instead, we propose introducing configurable options to notify users
> of writeback errors immediately and prevent further operations on
> affected files or disks. Possible solutions include:
> 
> - Option A: Immediately shut down the filesystem upon writeback errors.
> - Option B: Mark the affected file as inaccessible if a writeback error occurs.
> 
> These options could be controlled via mount options or sysfs
> configurations. Both solutions would be preferable to silently
> returning corrupted data, as they ensure users are aware of disk
> issues and can take corrective action.

I think option A is the only sane one as there is no way to
actually get this data to disk.  Do you have a use case for option B?


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [QUESTION] xfs, iomap: Handle writeback errors to prevent silent data corruption
  2025-05-29  4:25 ` Darrick J. Wong
  2025-05-29  5:55   ` Yafang Shao
  2025-05-30  5:17   ` Christian Brauner
@ 2025-06-02  5:32   ` Christoph Hellwig
  2025-06-03 14:35     ` Darrick J. Wong
  2 siblings, 1 reply; 36+ messages in thread
From: Christoph Hellwig @ 2025-06-02  5:32 UTC (permalink / raw)
  To: Darrick J. Wong
  Cc: Yafang Shao, Christian Brauner, cem, linux-xfs, Linux-Fsdevel

On Wed, May 28, 2025 at 09:25:50PM -0700, Darrick J. Wong wrote:
> Option C: report all those write errors (direct and buffered) to a
> daemon and let it figure out what it wants to do:

What value does the daemon add to the decision chain?

Some form of out of band error reporting is good and extremely useful,
but having it in the critical error handling path is not.


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [QUESTION] xfs, iomap: Handle writeback errors to prevent silent data corruption
  2025-05-29  4:36 ` Dave Chinner
  2025-05-29  6:04   ` Yafang Shao
@ 2025-06-02  5:38   ` Christoph Hellwig
  2025-06-02 23:19     ` Dave Chinner
  1 sibling, 1 reply; 36+ messages in thread
From: Christoph Hellwig @ 2025-06-02  5:38 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Yafang Shao, Christian Brauner, djwong, cem, linux-xfs,
	Linux-Fsdevel

On Thu, May 29, 2025 at 02:36:30PM +1000, Dave Chinner wrote:
> In these situations writeback could fail for several attempts before
> the storage timed out and came back online. Then the next write
> retry would succeed, and everything would be good. Linux never gave
> us a specific IO error for this case, so we just had to retry on EIO
> and hope that the storage came back eventually.

Linux has had differenciated I/O error codes for quite a while.  But
more importantly dm-multipath doesn't just return errors to the upper
layer during failover, but is instead expected to queue the I/O up
until it either has a working path or an internal timeout passed.

In other words, write errors in Linux are in general expected to be
persistent, modulo explicit failfast requests like REQ_NOWAIT.

Which also leaves me a bit puzzled what the XFS metadata retries are
actually trying to solve, especially without even having a corresponding
data I/O version.

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [QUESTION] xfs, iomap: Handle writeback errors to prevent silent data corruption
  2025-06-02  5:38   ` Christoph Hellwig
@ 2025-06-02 23:19     ` Dave Chinner
  2025-06-03  4:50       ` Christoph Hellwig
  0 siblings, 1 reply; 36+ messages in thread
From: Dave Chinner @ 2025-06-02 23:19 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Yafang Shao, Christian Brauner, djwong, cem, linux-xfs,
	Linux-Fsdevel

On Sun, Jun 01, 2025 at 10:38:07PM -0700, Christoph Hellwig wrote:
> On Thu, May 29, 2025 at 02:36:30PM +1000, Dave Chinner wrote:
> > In these situations writeback could fail for several attempts before
> > the storage timed out and came back online. Then the next write
> > retry would succeed, and everything would be good. Linux never gave
> > us a specific IO error for this case, so we just had to retry on EIO
> > and hope that the storage came back eventually.
> 
> Linux has had differenciated I/O error codes for quite a while.  But
> more importantly dm-multipath doesn't just return errors to the upper
> layer during failover, but is instead expected to queue the I/O up
> until it either has a working path or an internal timeout passed.
> 
> In other words, write errors in Linux are in general expected to be
> persistent, modulo explicit failfast requests like REQ_NOWAIT.

Say what? the blk_errors array defines multiple block layer errors
that are transient in nature - stuff like ENOSPC, ETIMEDOUT, EILSEQ,
ENOLINK, EBUSY - all indicate a transient, retryable error occurred
somewhere in the block/storage layers.

What is permanent about dm-thinp returning ENOSPC to a write
request? Once the pool has been GC'd to free up space or expanded,
the ENOSPC error goes away.

What is permanent about an IO failing with EILSEQ because a t10
checksum failed due to a random bit error detected between the HBA
and the storage device? Retry the IO, and it goes through just fine
without any failures.

These transient error types typically only need a write retry after
some time period to resolve, and that's what XFS does by default.
What makes these sorts of errors persistent in the linux block layer
and hence requiring an immediate filesystem shutdown and complete
denial of service to the storage?

I ask this seriously, because you are effectively saying the linux
storage stack now doesn't behave the same as the model we've been
using for decades. What has changed, and when did it change?

> Which also leaves me a bit puzzled what the XFS metadata retries are
> actually trying to solve, especially without even having a corresponding
> data I/O version.

It's always been for preventing immediate filesystem shutdown when
spurious transient IO errors occur below XFS. Data IO errors don't
cause filesystem shutdowns - errors get propagated to the
application - so there isn't a full system DOS potential for
incorrect classification of data IO errors...

-Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [QUESTION] xfs, iomap: Handle writeback errors to prevent silent data corruption
  2025-05-31 23:02       ` Dave Chinner
@ 2025-06-03  0:03         ` Darrick J. Wong
  2025-06-06 10:43           ` Christian Brauner
  0 siblings, 1 reply; 36+ messages in thread
From: Darrick J. Wong @ 2025-06-03  0:03 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Christian Brauner, Yafang Shao, cem, linux-xfs, Linux-Fsdevel

On Sun, Jun 01, 2025 at 09:02:25AM +1000, Dave Chinner wrote:
> On Fri, May 30, 2025 at 08:38:47AM -0700, Darrick J. Wong wrote:
> > On Fri, May 30, 2025 at 07:17:00AM +0200, Christian Brauner wrote:
> > > On Wed, May 28, 2025 at 09:25:50PM -0700, Darrick J. Wong wrote:
> > > > On Thu, May 29, 2025 at 10:50:01AM +0800, Yafang Shao wrote:
> > > > > Hello,
> > > > > 
> > > > > Recently, we encountered data loss when using XFS on an HDD with bad
> > > > > blocks. After investigation, we determined that the issue was related
> > > > > to writeback errors. The details are as follows:
> > > > > 
> > > > > 1. Process-A writes data to a file using buffered I/O and completes
> > > > > without errors.
> > > > > 2. However, during the writeback of the dirtied pagecache pages, an
> > > > > I/O error occurs, causing the data to fail to reach the disk.
> > > > > 3. Later, the pagecache pages may be reclaimed due to memory pressure,
> > > > > since they are already clean pages.
> > > > > 4. When Process-B reads the same file, it retrieves zeroed data from
> > > > > the bad blocks, as the original data was never successfully written
> > > > > (IOMAP_UNWRITTEN).
> > > > > 
> > > > > We reviewed the related discussion [0] and confirmed that this is a
> > > > > known writeback error issue. While using fsync() after buffered
> > > > > write() could mitigate the problem, this approach is impractical for
> > > > > our services.
> > > > > 
> > > > > Instead, we propose introducing configurable options to notify users
> > > > > of writeback errors immediately and prevent further operations on
> > > > > affected files or disks. Possible solutions include:
> > > > > 
> > > > > - Option A: Immediately shut down the filesystem upon writeback errors.
> > > > > - Option B: Mark the affected file as inaccessible if a writeback error occurs.
> > > > > 
> > > > > These options could be controlled via mount options or sysfs
> > > > > configurations. Both solutions would be preferable to silently
> > > > > returning corrupted data, as they ensure users are aware of disk
> > > > > issues and can take corrective action.
> > > > > 
> > > > > Any suggestions ?
> > > > 
> > > > Option C: report all those write errors (direct and buffered) to a
> > > > daemon and let it figure out what it wants to do:
> > > > 
> > > > https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=health-monitoring_2025-05-21
> > > > https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfsprogs-dev.git/log/?h=health-monitoring-rust_2025-05-21
> > > > 
> > > > Yes this is a long term option since it involves adding upcalls from the
> > > 
> > > I hope you don't mean actual usermodehelper upcalls here because we
> > > should not add any new ones. If you just mean a way to call up from a
> > > lower layer than that's obviously fine.
> > 
> > Correct.  The VFS upcalls to XFS on some event, then XFS queues the
> > event data (or drops it) and waits for userspace to read the queued
> > events.  We're not directly invoking a helper program from deep in the
> > guts, that's too wild even for me. ;)
> > 
> > > Fwiw, have you considered building this on top of a fanotify extension
> > > instead of inventing your own mechanism for this?
> > 
> > I have, at various stages of this experiment.
> > 
> > Originally, I was only going to export xfs-specific metadata events
> > (e.g. this AG's inode btree index is bad) so that the userspace program
> > (xfs_healer) could initiate a repair against the broken pieces.
> > 
> > At the time I thought it would be fun to experiment with an anonfd file
> > that emitted jsonp objects so that I could avoid the usual C struct ABI
> > mess because json is easily parsed into key-value mapping objects in a
> > lot of languages (that aren't C).  It later turned out that formatting
> > the json is rather more costly than I thought even with seq_bufs, so I
> > added an alternate format that emits boring C structures.
> > 
> > Having gone back to C structs, it would be possibly (and possibly quite
> > nice) to migrate to fanotify so that I don't have to maintain a bunch of
> > queuing code.  But that can have its own drawbacks, as Ted and I
> > discovered when we discussed his patches that pushed ext4 error events
> > through fanotify:
> > 
> > For filesystem metadata events, the fine details of representing that
> > metadata in a generic interface gets really messy because each
> > filesystem has a different design.
> 
> Perhaps that is the wrong approach. The event just needs to tell
> userspace that there is a metadata error, and the fs specific agent
> that receives the event can then pull the failure information from
> the filesystem through a fs specific ioctl interface.
> 
> i.e. the fanotify event could simply be a unique error, and that
> gets passed back into the ioctl to retreive the fs specific details
> of the failure. We might not even need fanotify for this - I suspect
> that we could use udev events to punch error ID notifications out to
> userspace to trigger a fs specific helper to go find out what went
> wrong.

I'm not sure if you're addressing me or brauner, but I think it would be
even simpler to retain the current design where events are queued to our
special xfs anonfd and read out by userspace.  Using fanotify as a "door
bell" to go look at another fd is ... basically poll() but far more
complicated than it ought to be.  Pounding udev with events can result
in userspace burning a lot of energy walking the entire rule chain.

> Keeping unprocessed failures in an internal fs queue isn't a big
> deal; it's not a lot of memory, and it can be discarded on unmount.
> At that point we know that userspace did not care about the
> failure and is not going to be able to query about the failure in
> future, so we can just throw it away.
> 
> This also allows filesystems to develop such functionality in
> parallel, allowing us to find commonality and potential areas for
> abstraction as the functionality is developed, rahter than trying to
> come up with some generic interface that needs to support all
> possible things we can think of right now....

Agreed.  I don't think Ted or Jan were enthusiastic about trying to make
a generic fs metadata event descriptor either.

--D

> -Dave.
> -- 
> Dave Chinner
> david@fromorbit.com

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [QUESTION] xfs, iomap: Handle writeback errors to prevent silent data corruption
  2025-06-02  5:31 ` Christoph Hellwig
@ 2025-06-03  3:03   ` Yafang Shao
  2025-06-03  3:13     ` Matthew Wilcox
  0 siblings, 1 reply; 36+ messages in thread
From: Yafang Shao @ 2025-06-03  3:03 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Christian Brauner, djwong, cem, linux-xfs, Linux-Fsdevel

On Mon, Jun 2, 2025 at 1:31 PM Christoph Hellwig <hch@infradead.org> wrote:
>
> On Thu, May 29, 2025 at 10:50:01AM +0800, Yafang Shao wrote:
> > Instead, we propose introducing configurable options to notify users
> > of writeback errors immediately and prevent further operations on
> > affected files or disks. Possible solutions include:
> >
> > - Option A: Immediately shut down the filesystem upon writeback errors.
> > - Option B: Mark the affected file as inaccessible if a writeback error occurs.
> >
> > These options could be controlled via mount options or sysfs
> > configurations. Both solutions would be preferable to silently
> > returning corrupted data, as they ensure users are aware of disk
> > issues and can take corrective action.
>
> I think option A is the only sane one as there is no way to
> actually get this data to disk.  Do you have a use case for option B?

We want to preserve disk functionality despite a few bad sectors. The
option A  fails by declaring the entire disk unusable upon
encountering bad blocks—an overly restrictive policy that wastes
healthy storage capacity.

-- 
Regards
Yafang

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [QUESTION] xfs, iomap: Handle writeback errors to prevent silent data corruption
  2025-06-03  3:03   ` Yafang Shao
@ 2025-06-03  3:13     ` Matthew Wilcox
  2025-06-03  3:21       ` Yafang Shao
  0 siblings, 1 reply; 36+ messages in thread
From: Matthew Wilcox @ 2025-06-03  3:13 UTC (permalink / raw)
  To: Yafang Shao
  Cc: Christoph Hellwig, Christian Brauner, djwong, cem, linux-xfs,
	Linux-Fsdevel

On Tue, Jun 03, 2025 at 11:03:40AM +0800, Yafang Shao wrote:
> We want to preserve disk functionality despite a few bad sectors. The
> option A  fails by declaring the entire disk unusable upon
> encountering bad blocks—an overly restrictive policy that wastes
> healthy storage capacity.

What kind of awful 1980s quality storage are you using that doesn't
remap bad sectors on write?

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [QUESTION] xfs, iomap: Handle writeback errors to prevent silent data corruption
  2025-06-03  3:13     ` Matthew Wilcox
@ 2025-06-03  3:21       ` Yafang Shao
  2025-06-03  3:26         ` Matthew Wilcox
  0 siblings, 1 reply; 36+ messages in thread
From: Yafang Shao @ 2025-06-03  3:21 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Christoph Hellwig, Christian Brauner, djwong, cem, linux-xfs,
	Linux-Fsdevel

On Tue, Jun 3, 2025 at 11:13 AM Matthew Wilcox <willy@infradead.org> wrote:
>
> On Tue, Jun 03, 2025 at 11:03:40AM +0800, Yafang Shao wrote:
> > We want to preserve disk functionality despite a few bad sectors. The
> > option A  fails by declaring the entire disk unusable upon
> > encountering bad blocks—an overly restrictive policy that wastes
> > healthy storage capacity.
>
> What kind of awful 1980s quality storage are you using that doesn't
> remap bad sectors on write?

Could you please explain why a writeback error still occurred if the
bad sector remapping function is working properly?

-- 
Regards
Yafang

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [QUESTION] xfs, iomap: Handle writeback errors to prevent silent data corruption
  2025-06-03  3:21       ` Yafang Shao
@ 2025-06-03  3:26         ` Matthew Wilcox
  2025-06-03  3:50           ` Yafang Shao
  0 siblings, 1 reply; 36+ messages in thread
From: Matthew Wilcox @ 2025-06-03  3:26 UTC (permalink / raw)
  To: Yafang Shao
  Cc: Christoph Hellwig, Christian Brauner, djwong, cem, linux-xfs,
	Linux-Fsdevel

On Tue, Jun 03, 2025 at 11:21:46AM +0800, Yafang Shao wrote:
> On Tue, Jun 3, 2025 at 11:13 AM Matthew Wilcox <willy@infradead.org> wrote:
> >
> > On Tue, Jun 03, 2025 at 11:03:40AM +0800, Yafang Shao wrote:
> > > We want to preserve disk functionality despite a few bad sectors. The
> > > option A  fails by declaring the entire disk unusable upon
> > > encountering bad blocks—an overly restrictive policy that wastes
> > > healthy storage capacity.
> >
> > What kind of awful 1980s quality storage are you using that doesn't
> > remap bad sectors on write?
> 
> Could you please explain why a writeback error still occurred if the
> bad sector remapping function is working properly?

It wouldn't.  Unless you're using something ancient or really really
cheap, getting a writeback error means that the bad block remapping
area is full.  You should be able to use SMART (or similar) to retire
hardware before it gets to that state.


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [QUESTION] xfs, iomap: Handle writeback errors to prevent silent data corruption
  2025-06-03  3:26         ` Matthew Wilcox
@ 2025-06-03  3:50           ` Yafang Shao
  2025-06-03  4:40             ` Christoph Hellwig
  0 siblings, 1 reply; 36+ messages in thread
From: Yafang Shao @ 2025-06-03  3:50 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Christoph Hellwig, Christian Brauner, djwong, cem, linux-xfs,
	Linux-Fsdevel

On Tue, Jun 3, 2025 at 11:26 AM Matthew Wilcox <willy@infradead.org> wrote:
>
> On Tue, Jun 03, 2025 at 11:21:46AM +0800, Yafang Shao wrote:
> > On Tue, Jun 3, 2025 at 11:13 AM Matthew Wilcox <willy@infradead.org> wrote:
> > >
> > > On Tue, Jun 03, 2025 at 11:03:40AM +0800, Yafang Shao wrote:
> > > > We want to preserve disk functionality despite a few bad sectors. The
> > > > option A  fails by declaring the entire disk unusable upon
> > > > encountering bad blocks—an overly restrictive policy that wastes
> > > > healthy storage capacity.
> > >
> > > What kind of awful 1980s quality storage are you using that doesn't
> > > remap bad sectors on write?
> >
> > Could you please explain why a writeback error still occurred if the
> > bad sector remapping function is working properly?
>
> It wouldn't.  Unless you're using something ancient or really really
> cheap,

The drive in question is a Western Digital HGST Ultrastar
HUH721212ALE600 12TB HDD.
The price information is unavailable to me;-)

> getting a writeback error means that the bad block remapping
> area is full.

We have confirmed there are still available remapping sectors, but the
reallocation operation still failed.

> You should be able to use SMART (or similar) to retire
> hardware before it gets to that state.
>

We are always using SMART to do this kind of check.

-- 
Regards
Yafang

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [QUESTION] xfs, iomap: Handle writeback errors to prevent silent data corruption
  2025-06-03  3:50           ` Yafang Shao
@ 2025-06-03  4:40             ` Christoph Hellwig
  2025-06-03  5:17               ` Damien Le Moal
  0 siblings, 1 reply; 36+ messages in thread
From: Christoph Hellwig @ 2025-06-03  4:40 UTC (permalink / raw)
  To: Yafang Shao
  Cc: Matthew Wilcox, Christoph Hellwig, Christian Brauner, djwong, cem,
	linux-xfs, Linux-Fsdevel, Damien Le Moal

On Tue, Jun 03, 2025 at 11:50:58AM +0800, Yafang Shao wrote:
> 
> The drive in question is a Western Digital HGST Ultrastar
> HUH721212ALE600 12TB HDD.
> The price information is unavailable to me;-)

Unless you are doing something funky like setting a crazy CDL policy
it should not randomly fail writes.  Can you post the dmesg including
the sense data that the SCSI code should print in this case?


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [QUESTION] xfs, iomap: Handle writeback errors to prevent silent data corruption
  2025-06-02 23:19     ` Dave Chinner
@ 2025-06-03  4:50       ` Christoph Hellwig
  2025-06-03 22:05         ` Dave Chinner
  0 siblings, 1 reply; 36+ messages in thread
From: Christoph Hellwig @ 2025-06-03  4:50 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Christoph Hellwig, Yafang Shao, Christian Brauner, djwong, cem,
	linux-xfs, Linux-Fsdevel

On Tue, Jun 03, 2025 at 09:19:10AM +1000, Dave Chinner wrote:
> > In other words, write errors in Linux are in general expected to be
> > persistent, modulo explicit failfast requests like REQ_NOWAIT.
> 
> Say what? the blk_errors array defines multiple block layer errors
> that are transient in nature - stuff like ENOSPC, ETIMEDOUT, EILSEQ,
> ENOLINK, EBUSY - all indicate a transient, retryable error occurred
> somewhere in the block/storage layers.

Let's use the block layer codes reported all the way up to the file
systems and their descriptions instead of the errnos they are
mapped to for compatibility.  The above would be in order:

[BLK_STS_NOSPC]         = { -ENOSPC,    "critical space allocation" },
[BLK_STS_TIMEOUT]       = { -ETIMEDOUT, "timeout" },
[BLK_STS_PROTECTION]    = { -EILSEQ,    "protection" },
[BLK_STS_TRANSPORT]     = { -ENOLINK,   "recoverable transport" },
[BLK_STS_DEV_RESOURCE]  = { -EBUSY,     "device resource" },

> What is permanent about dm-thinp returning ENOSPC to a write
> request? Once the pool has been GC'd to free up space or expanded,
> the ENOSPC error goes away.

Everything.  ENOSPC means there is no space.  There might be space in
the non-determinant future, but if the layer just needs to GC it must
not report the error.

u

> What is permanent about an IO failing with EILSEQ because a t10
> checksum failed due to a random bit error detected between the HBA
> and the storage device? Retry the IO, and it goes through just fine
> without any failures.

Normally it means your checksum was wrong.  If you have bit errors
in the cable they will show up again, maybe not on the next I/O
but soon.

> These transient error types typically only need a write retry after
> some time period to resolve, and that's what XFS does by default.
> What makes these sorts of errors persistent in the linux block layer
> and hence requiring an immediate filesystem shutdown and complete
> denial of service to the storage?
> 
> I ask this seriously, because you are effectively saying the linux
> storage stack now doesn't behave the same as the model we've been
> using for decades. What has changed, and when did it change?

Hey, you can retry.  You're unlikely to improve the situation though
but instead just keep deferring the inevitable shutdown.

> > Which also leaves me a bit puzzled what the XFS metadata retries are
> > actually trying to solve, especially without even having a corresponding
> > data I/O version.
> 
> It's always been for preventing immediate filesystem shutdown when
> spurious transient IO errors occur below XFS. Data IO errors don't
> cause filesystem shutdowns - errors get propagated to the
> application - so there isn't a full system DOS potential for
> incorrect classification of data IO errors...

Except as we see in this thread for a fairly common use case (buffered
I/O without fsync) they don't.  And I agree with you that this is not
how you write applications that care about data integrity - but the
entire reset of the system and just about every common utility is
written that way.

And even applications that fsync won't see you fancy error code.  The
only thing stored in the address_space for fsync to catch is EIO and
ENOSPC.

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [QUESTION] xfs, iomap: Handle writeback errors to prevent silent data corruption
  2025-06-03  4:40             ` Christoph Hellwig
@ 2025-06-03  5:17               ` Damien Le Moal
  2025-06-03  5:54                 ` Yafang Shao
  0 siblings, 1 reply; 36+ messages in thread
From: Damien Le Moal @ 2025-06-03  5:17 UTC (permalink / raw)
  To: Christoph Hellwig, Yafang Shao
  Cc: Matthew Wilcox, Christian Brauner, djwong, cem, linux-xfs,
	Linux-Fsdevel, Damien Le Moal

On 2025/06/03 13:40, Christoph Hellwig wrote:
> On Tue, Jun 03, 2025 at 11:50:58AM +0800, Yafang Shao wrote:
>>
>> The drive in question is a Western Digital HGST Ultrastar
>> HUH721212ALE600 12TB HDD.
>> The price information is unavailable to me;-)
> 
> Unless you are doing something funky like setting a crazy CDL policy
> it should not randomly fail writes.  Can you post the dmesg including
> the sense data that the SCSI code should print in this case?

This drive does not support CDL, so it is not that for sure.

Please also describe the drive connection: AHCI SATA port ? SAS HBA ?
Enclosure/SAS expander ?



-- 
Damien Le Moal
Western Digital Research

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [QUESTION] xfs, iomap: Handle writeback errors to prevent silent data corruption
  2025-06-03  5:17               ` Damien Le Moal
@ 2025-06-03  5:54                 ` Yafang Shao
  2025-06-03  6:36                   ` Damien Le Moal
  0 siblings, 1 reply; 36+ messages in thread
From: Yafang Shao @ 2025-06-03  5:54 UTC (permalink / raw)
  To: Damien Le Moal
  Cc: Christoph Hellwig, Matthew Wilcox, Christian Brauner, djwong, cem,
	linux-xfs, Linux-Fsdevel, Damien Le Moal

On Tue, Jun 3, 2025 at 1:17 PM Damien Le Moal <dlemoal@kernel.org> wrote:
>
> On 2025/06/03 13:40, Christoph Hellwig wrote:
> > On Tue, Jun 03, 2025 at 11:50:58AM +0800, Yafang Shao wrote:
> >>
> >> The drive in question is a Western Digital HGST Ultrastar
> >> HUH721212ALE600 12TB HDD.
> >> The price information is unavailable to me;-)
> >
> > Unless you are doing something funky like setting a crazy CDL policy
> > it should not randomly fail writes.  Can you post the dmesg including
> > the sense data that the SCSI code should print in this case?

Below is an error occurred today,

[Tue Jun  3 10:02:44 2025] mpt3sas_cm0: log_info(0x31080000):
originator(PL), code(0x08), sub_code(0x0000)
[Tue Jun  3 10:02:44 2025] mpt3sas_cm0: log_info(0x31080000):
originator(PL), code(0x08), sub_code(0x0000)
[Tue Jun  3 10:02:44 2025] mpt3sas_cm0: log_info(0x31080000):
originator(PL), code(0x08), sub_code(0x0000)
[Tue Jun  3 10:02:44 2025] mpt3sas_cm0: log_info(0x31080000):
originator(PL), code(0x08), sub_code(0x0000)
[Tue Jun  3 10:02:44 2025] mpt3sas_cm0: log_info(0x31080000):
originator(PL), code(0x08), sub_code(0x0000)
[Tue Jun  3 10:02:44 2025] mpt3sas_cm0: log_info(0x31080000):
originator(PL), code(0x08), sub_code(0x0000)
[Tue Jun  3 10:02:44 2025] scsi_io_completion_action: 25 callbacks suppressed
[Tue Jun  3 10:02:44 2025] mpt3sas_cm0: log_info(0x31080000):
originator(PL), code(0x08), sub_code(0x0000)
[Tue Jun  3 10:02:44 2025] mpt3sas_cm0: log_info(0x31080000):
originator(PL), code(0x08), sub_code(0x0000)
[Tue Jun  3 10:02:44 2025] sd 14:0:4:0: [sdd] tag#1669 FAILED Result:
hostbyte=DID_SOFT_ERROR driverbyte=DRIVER_OK cmd_age=2s
[Tue Jun  3 10:02:44 2025] mpt3sas_cm0: log_info(0x31080000):
originator(PL), code(0x08), sub_code(0x0000)
[Tue Jun  3 10:02:44 2025] sd 14:0:4:0: [sdd] tag#1669 CDB: Read(16)
88 00 00 00 00 02 0c dc bc c0 00 00 00 58 00 00
[Tue Jun  3 10:02:44 2025] mpt3sas_cm0: log_info(0x31080000):
originator(PL), code(0x08), sub_code(0x0000)
[Tue Jun  3 10:02:44 2025] mpt3sas_cm0: log_info(0x31080000):
originator(PL), code(0x08), sub_code(0x0000)
[Tue Jun  3 10:02:44 2025] blk_print_req_error: 25 callbacks suppressed
[Tue Jun  3 10:02:44 2025] mpt3sas_cm0: log_info(0x31080000):
originator(PL), code(0x08), sub_code(0x0000)
[Tue Jun  3 10:02:44 2025] I/O error, dev sdd, sector 8805727424 op
0x0:(READ) flags 0x80700 phys_seg 11 prio class 2
[Tue Jun  3 10:02:44 2025] mpt3sas_cm0: log_info(0x31080000):
originator(PL), code(0x08), sub_code(0x0000)
[Tue Jun  3 10:02:44 2025] mpt3sas_cm0: log_info(0x31080000):
originator(PL), code(0x08), sub_code(0x0000)
[Tue Jun  3 10:02:44 2025] mpt3sas_cm0: log_info(0x31080000):
originator(PL), code(0x08), sub_code(0x0000)
[Tue Jun  3 10:02:44 2025] mpt3sas_cm0: log_info(0x31080000):
originator(PL), code(0x08), sub_code(0x0000)
[Tue Jun  3 10:02:44 2025] mpt3sas_cm0: log_info(0x31080000):
originator(PL), code(0x08), sub_code(0x0000)
[Tue Jun  3 10:02:44 2025] mpt3sas_cm0: log_info(0x31080000):
originator(PL), code(0x08), sub_code(0x0000)
[Tue Jun  3 10:02:44 2025] sd 14:0:4:0: [sdd] tag#1693 FAILED Result:
hostbyte=DID_SOFT_ERROR driverbyte=DRIVER_OK cmd_age=2s
[Tue Jun  3 10:02:44 2025] sd 14:0:4:0: [sdd] tag#1709 FAILED Result:
hostbyte=DID_SOFT_ERROR driverbyte=DRIVER_OK cmd_age=2s
[Tue Jun  3 10:02:44 2025] sd 14:0:4:0: [sdd] tag#1693 CDB: Read(16)
88 00 00 00 00 01 02 1e 48 50 00 00 00 08 00 00
[Tue Jun  3 10:02:44 2025] I/O error, dev sdd, sector 4330506320 op
0x0:(READ) flags 0x80700 phys_seg 1 prio class 2
[Tue Jun  3 10:02:44 2025] sd 14:0:4:0: [sdd] tag#1709 CDB: Read(16)
88 00 00 00 00 01 80 01 8c 78 00 00 00 08 00 00
[Tue Jun  3 10:02:44 2025] sd 14:0:4:0: [sdd] tag#1704 FAILED Result:
hostbyte=DID_SOFT_ERROR driverbyte=DRIVER_OK cmd_age=2s
[Tue Jun  3 10:02:44 2025] I/O error, dev sdd, sector 6442552440 op
0x0:(READ) flags 0x81700 phys_seg 1 prio class 2
[Tue Jun  3 10:02:44 2025] sd 14:0:4:0: [sdd] tag#1704 CDB: Read(16)
88 00 00 00 00 04 80 18 43 f8 00 00 00 80 00 00
[Tue Jun  3 10:02:44 2025] I/O error, dev sdd, sector 19328943096 op
0x0:(READ) flags 0x80700 phys_seg 16 prio class 2
[Tue Jun  3 10:02:44 2025] sd 14:0:4:0: [sdd] tag#1705 FAILED Result:
hostbyte=DID_SOFT_ERROR driverbyte=DRIVER_OK cmd_age=2s
[Tue Jun  3 10:02:44 2025] sd 14:0:4:0: [sdd] tag#1705 CDB: Read(16)
88 00 00 00 00 04 80 18 85 c8 00 00 03 80 00 00
[Tue Jun  3 10:02:44 2025] I/O error, dev sdd, sector 19328959944 op
0x0:(READ) flags 0x80700 phys_seg 112 prio class 2
[Tue Jun  3 10:02:44 2025] sd 14:0:4:0: [sdd] tag#1712 FAILED Result:
hostbyte=DID_SOFT_ERROR driverbyte=DRIVER_OK cmd_age=2s
[Tue Jun  3 10:02:44 2025] sd 14:0:4:0: [sdd] tag#1712 CDB: Read(16)
88 00 00 00 00 01 cd 06 86 d8 00 00 03 30 00 00
[Tue Jun  3 10:02:44 2025] I/O error, dev sdd, sector 7734724312 op
0x0:(READ) flags 0x80700 phys_seg 102 prio class 2
[Tue Jun  3 10:02:44 2025] sd 14:0:4:0: [sdd] tag#1720 FAILED Result:
hostbyte=DID_SOFT_ERROR driverbyte=DRIVER_OK cmd_age=2s
[Tue Jun  3 10:02:44 2025] sd 14:0:4:0: [sdd] tag#1720 CDB: Read(16)
88 00 00 00 00 02 49 ed 20 c0 00 00 01 60 00 00
[Tue Jun  3 10:02:44 2025] I/O error, dev sdd, sector 9830211776 op
0x0:(READ) flags 0x80700 phys_seg 44 prio class 2
[Tue Jun  3 10:02:44 2025] sd 14:0:4:0: [sdd] tag#1707 FAILED Result:
hostbyte=DID_OK driverbyte=DRIVER_OK cmd_age=2s
[Tue Jun  3 10:02:44 2025] sd 14:0:4:0: [sdd] tag#1707 Sense Key :
Medium Error [current] [descriptor]
[Tue Jun  3 10:02:44 2025] sd 14:0:4:0: [sdd] tag#1707 Add. Sense:
Unrecovered read error
[Tue Jun  3 10:02:44 2025] sd 14:0:4:0: [sdd] tag#1707 CDB: Read(16)
88 00 00 00 00 05 6b 21 0b e8 00 00 00 08 00 00
[Tue Jun  3 10:02:44 2025] critical medium error, dev sdd, sector
23272164328 op 0x0:(READ) flags 0x0 phys_seg 1 prio class 2
[Tue Jun  3 10:02:44 2025] sd 14:0:4:0: [sdd] tag#1688 FAILED Result:
hostbyte=DID_SOFT_ERROR driverbyte=DRIVER_OK cmd_age=2s
[Tue Jun  3 10:02:44 2025] sd 14:0:4:0: [sdd] tag#1688 CDB: Read(16)
88 00 00 00 00 01 02 0a a6 b8 00 00 00 28 00 00
[Tue Jun  3 10:02:45 2025] I/O error, dev sdd, sector 4329219768 op
0x0:(READ) flags 0x80700 phys_seg 5 prio class 2
[Tue Jun  3 10:02:47 2025] mpt3sas_cm0: log_info(0x31080000):
originator(PL), code(0x08), sub_code(0x0000)
[Tue Jun  3 10:02:47 2025] mpt3sas_cm0: log_info(0x31080000):
originator(PL), code(0x08), sub_code(0x0000)
[Tue Jun  3 10:02:47 2025] mpt3sas_cm0: log_info(0x31080000):
originator(PL), code(0x08), sub_code(0x0000)
[Tue Jun  3 10:02:47 2025] mpt3sas_cm0: log_info(0x31080000):
originator(PL), code(0x08), sub_code(0x0000)
[Tue Jun  3 10:02:47 2025] sd 14:0:4:0: [sdd] tag#1669 FAILED Result:
hostbyte=DID_SOFT_ERROR driverbyte=DRIVER_OK cmd_age=2s
[Tue Jun  3 10:02:47 2025] sd 14:0:4:0: [sdd] tag#1669 CDB: Read(16)
88 00 00 00 00 01 80 01 7b b0 00 00 00 08 00 00
[Tue Jun  3 10:02:47 2025] I/O error, dev sdd, sector 6442548144 op
0x0:(READ) flags 0x80700 phys_seg 1 prio class 2
[Tue Jun  3 10:02:47 2025] mpt3sas_cm0: log_info(0x31080000):
originator(PL), code(0x08), sub_code(0x0000)
[Tue Jun  3 10:02:47 2025] mpt3sas_cm0: log_info(0x31080000):
originator(PL), code(0x08), sub_code(0x0000)
[Tue Jun  3 10:02:47 2025] mpt3sas_cm0: log_info(0x31080000):
originator(PL), code(0x08), sub_code(0x0000)
[Tue Jun  3 10:02:47 2025] mpt3sas_cm0: log_info(0x31080000):
originator(PL), code(0x08), sub_code(0x0000)
[Tue Jun  3 10:02:47 2025] mpt3sas_cm0: log_info(0x31080000):
originator(PL), code(0x08), sub_code(0x0000)
[Tue Jun  3 10:02:47 2025] mpt3sas_cm0: log_info(0x31080000):
originator(PL), code(0x08), sub_code(0x0000)
[Tue Jun  3 10:02:47 2025] mpt3sas_cm0: log_info(0x31080000):
originator(PL), code(0x08), sub_code(0x0000)
[Tue Jun  3 10:02:47 2025] mpt3sas_cm0: log_info(0x31080000):
originator(PL), code(0x08), sub_code(0x0000)
[Tue Jun  3 10:02:47 2025] mpt3sas_cm0: log_info(0x31080000):
originator(PL), code(0x08), sub_code(0x0000)
[Tue Jun  3 10:02:47 2025] mpt3sas_cm0: log_info(0x31080000):
originator(PL), code(0x08), sub_code(0x0000)
[Tue Jun  3 10:02:47 2025] mpt3sas_cm0: log_info(0x31080000):
originator(PL), code(0x08), sub_code(0x0000)
[Tue Jun  3 10:02:47 2025] mpt3sas_cm0: log_info(0x31080000):
originator(PL), code(0x08), sub_code(0x0000)
[Tue Jun  3 10:02:47 2025] mpt3sas_cm0: log_info(0x31080000):
originator(PL), code(0x08), sub_code(0x0000)
[Tue Jun  3 10:02:47 2025] mpt3sas_cm0: log_info(0x31080000):
originator(PL), code(0x08), sub_code(0x0000)
[Tue Jun  3 10:02:47 2025] mpt3sas_cm0: log_info(0x31080000):
originator(PL), code(0x08), sub_code(0x0000)
[Tue Jun  3 10:02:47 2025] mpt3sas_cm0: log_info(0x31080000):
originator(PL), code(0x08), sub_code(0x0000)
[Tue Jun  3 10:02:47 2025] mpt3sas_cm0: log_info(0x31080000):
originator(PL), code(0x08), sub_code(0x0000)
[Tue Jun  3 10:02:47 2025] mpt3sas_cm0: log_info(0x31080000):
originator(PL), code(0x08), sub_code(0x0000)
[Tue Jun  3 10:02:47 2025] mpt3sas_cm0: log_info(0x31080000):
originator(PL), code(0x08), sub_code(0x0000)
[Tue Jun  3 10:02:47 2025] mpt3sas_cm0: log_info(0x31080000):
originator(PL), code(0x08), sub_code(0x0000)
[Tue Jun  3 10:02:47 2025] mpt3sas_cm0: log_info(0x31080000):
originator(PL), code(0x08), sub_code(0x0000)
[Tue Jun  3 10:02:47 2025] mpt3sas_cm0: log_info(0x31080000):
originator(PL), code(0x08), sub_code(0x0000)
[Tue Jun  3 10:02:47 2025] mpt3sas_cm0: log_info(0x31080000):
originator(PL), code(0x08), sub_code(0x0000)
[Tue Jun  3 10:02:47 2025] mpt3sas_cm0: log_info(0x31080000):
originator(PL), code(0x08), sub_code(0x0000)
[Tue Jun  3 10:02:47 2025] mpt3sas_cm0: log_info(0x31080000):
originator(PL), code(0x08), sub_code(0x0000)
[Tue Jun  3 10:02:47 2025] mpt3sas_cm0: log_info(0x31080000):
originator(PL), code(0x08), sub_code(0x0000)
[Tue Jun  3 10:02:47 2025] mpt3sas_cm0: log_info(0x31080000):
originator(PL), code(0x08), sub_code(0x0000)
[Tue Jun  3 10:02:49 2025] sdd: writeback error on inode 10741741427,
offset 54525952, sector 11086521712
[Tue Jun  3 10:03:27 2025] mpt3sas_cm0: log_info(0x31080000):
originator(PL), code(0x08), sub_code(0x0000)
[Tue Jun  3 10:03:27 2025] mpt3sas_cm0: log_info(0x31080000):
originator(PL), code(0x08), sub_code(0x0000)
[Tue Jun  3 10:03:27 2025] mpt3sas_cm0: log_info(0x31080000):
originator(PL), code(0x08), sub_code(0x0000)
[Tue Jun  3 10:03:27 2025] scsi_io_completion_action: 16 callbacks suppressed
[Tue Jun  3 10:03:27 2025] mpt3sas_cm0: log_info(0x31080000):
originator(PL), code(0x08), sub_code(0x0000)
[Tue Jun  3 10:03:27 2025] sd 14:0:4:0: [sdd] tag#1761 FAILED Result:
hostbyte=DID_SOFT_ERROR driverbyte=DRIVER_OK cmd_age=3s
[Tue Jun  3 10:03:27 2025] sd 14:0:4:0: [sdd] tag#1761 CDB: Read(16)
88 00 00 00 00 02 49 ed 1b 80 00 00 00 88 00 00
[Tue Jun  3 10:03:27 2025] blk_print_req_error: 16 callbacks suppressed
[Tue Jun  3 10:03:27 2025] I/O error, dev sdd, sector 9830210432 op
0x0:(READ) flags 0x80700 phys_seg 17 prio class 2
[Tue Jun  3 10:03:27 2025] mpt3sas_cm0: log_info(0x31080000):
originator(PL), code(0x08), sub_code(0x0000)
[Tue Jun  3 10:03:27 2025] mpt3sas_cm0: log_info(0x31080000):
originator(PL), code(0x08), sub_code(0x0000)
[Tue Jun  3 10:03:27 2025] mpt3sas_cm0: log_info(0x31080000):
originator(PL), code(0x08), sub_code(0x0000)
[Tue Jun  3 10:03:27 2025] mpt3sas_cm0: log_info(0x31080000):
originator(PL), code(0x08), sub_code(0x0000)
[Tue Jun  3 10:03:27 2025] mpt3sas_cm0: log_info(0x31080000):
originator(PL), code(0x08), sub_code(0x0000)
[Tue Jun  3 10:03:27 2025] mpt3sas_cm0: log_info(0x31080000):
originator(PL), code(0x08), sub_code(0x0000)
[Tue Jun  3 10:03:27 2025] sd 14:0:4:0: [sdd] tag#1880 FAILED Result:
hostbyte=DID_SOFT_ERROR driverbyte=DRIVER_OK cmd_age=3s
[Tue Jun  3 10:03:27 2025] mpt3sas_cm0: log_info(0x31080000):
originator(PL), code(0x08), sub_code(0x0000)
[Tue Jun  3 10:03:27 2025] sd 14:0:4:0: [sdd] tag#1880 CDB: Read(16)
88 00 00 00 00 05 50 79 b5 58 00 00 04 00 00 00
[Tue Jun  3 10:03:27 2025] I/O error, dev sdd, sector 22824990040 op
0x0:(READ) flags 0x84700 phys_seg 128 prio class 2
[Tue Jun  3 10:03:27 2025] mpt3sas_cm0: log_info(0x31080000):
originator(PL), code(0x08), sub_code(0x0000)
[Tue Jun  3 10:03:27 2025] mpt3sas_cm0: log_info(0x31080000):
originator(PL), code(0x08), sub_code(0x0000)
[Tue Jun  3 10:03:27 2025] mpt3sas_cm0: log_info(0x31080000):
originator(PL), code(0x08), sub_code(0x0000)
[Tue Jun  3 10:03:27 2025] mpt3sas_cm0: log_info(0x31080000):
originator(PL), code(0x08), sub_code(0x0000)
[Tue Jun  3 10:03:27 2025] sd 14:0:4:0: [sdd] tag#1891 FAILED Result:
hostbyte=DID_SOFT_ERROR driverbyte=DRIVER_OK cmd_age=3s
[Tue Jun  3 10:03:27 2025] sd 14:0:4:0: [sdd] tag#1891 CDB: Read(16)
88 00 00 00 00 02 49 ed cb 08 00 00 01 58 00 00
[Tue Jun  3 10:03:27 2025] I/O error, dev sdd, sector 9830255368 op
0x0:(READ) flags 0x80700 phys_seg 43 prio class 2
[Tue Jun  3 10:03:27 2025] mpt3sas_cm0: log_info(0x31080000):
originator(PL), code(0x08), sub_code(0x0000)
[Tue Jun  3 10:03:27 2025] sd 14:0:4:0: [sdd] tag#1894 FAILED Result:
hostbyte=DID_SOFT_ERROR driverbyte=DRIVER_OK cmd_age=3s
[Tue Jun  3 10:03:27 2025] sd 14:0:4:0: [sdd] tag#1894 CDB: Read(16)
88 00 00 00 00 05 6b 21 19 98 00 00 03 f8 00 00
[Tue Jun  3 10:03:27 2025] I/O error, dev sdd, sector 23272167832 op
0x0:(READ) flags 0x80700 phys_seg 127 prio class 2
[Tue Jun  3 10:03:27 2025] sd 14:0:4:0: [sdd] tag#1886 FAILED Result:
hostbyte=DID_SOFT_ERROR driverbyte=DRIVER_OK cmd_age=3s
[Tue Jun  3 10:03:27 2025] sd 14:0:4:0: [sdd] tag#1886 CDB: Read(16)
88 00 00 00 00 02 49 ed 1c 08 00 00 00 d8 00 00
[Tue Jun  3 10:03:27 2025] I/O error, dev sdd, sector 9830210568 op
0x0:(READ) flags 0x80700 phys_seg 27 prio class 2
[Tue Jun  3 10:03:27 2025] sd 14:0:4:0: [sdd] tag#1740 FAILED Result:
hostbyte=DID_SOFT_ERROR driverbyte=DRIVER_OK cmd_age=3s
[Tue Jun  3 10:03:27 2025] sd 14:0:4:0: [sdd] tag#1740 CDB: Read(16)
88 00 00 00 00 03 39 3a 96 90 00 00 04 00 00 00
[Tue Jun  3 10:03:27 2025] I/O error, dev sdd, sector 13845042832 op
0x0:(READ) flags 0x80700 phys_seg 128 prio class 2
[Tue Jun  3 10:03:27 2025] sd 14:0:4:0: [sdd] tag#1741 FAILED Result:
hostbyte=DID_SOFT_ERROR driverbyte=DRIVER_OK cmd_age=3s
[Tue Jun  3 10:03:27 2025] sd 14:0:4:0: [sdd] tag#1741 CDB: Read(16)
88 00 00 00 00 03 39 3a 9a 90 00 00 04 08 00 00
[Tue Jun  3 10:03:27 2025] I/O error, dev sdd, sector 13845043856 op
0x0:(READ) flags 0x84700 phys_seg 128 prio class 2
[Tue Jun  3 10:03:27 2025] sd 14:0:4:0: [sdd] tag#1873 FAILED Result:
hostbyte=DID_SOFT_ERROR driverbyte=DRIVER_OK cmd_age=0s
[Tue Jun  3 10:03:27 2025] sd 14:0:4:0: [sdd] tag#1873 CDB: Read(16)
88 00 00 00 00 03 39 3a 9e 98 00 00 04 00 00 00
[Tue Jun  3 10:03:27 2025] I/O error, dev sdd, sector 13845044888 op
0x0:(READ) flags 0x80700 phys_seg 128 prio class 2
[Tue Jun  3 10:03:27 2025] sd 14:0:4:0: [sdd] tag#1875 FAILED Result:
hostbyte=DID_SOFT_ERROR driverbyte=DRIVER_OK cmd_age=0s
[Tue Jun  3 10:03:27 2025] sd 14:0:4:0: [sdd] tag#1875 CDB: Read(16)
88 00 00 00 00 03 39 3a a2 98 00 00 04 00 00 00
[Tue Jun  3 10:03:27 2025] I/O error, dev sdd, sector 13845045912 op
0x0:(READ) flags 0x84700 phys_seg 128 prio class 2
[Tue Jun  3 10:03:27 2025] sd 14:0:4:0: [sdd] tag#1856 FAILED Result:
hostbyte=DID_SOFT_ERROR driverbyte=DRIVER_OK cmd_age=3s
[Tue Jun  3 10:03:27 2025] sd 14:0:4:0: [sdd] tag#1856 CDB: Read(16)
88 00 00 00 00 03 39 3a 92 88 00 00 04 08 00 00
[Tue Jun  3 10:03:27 2025] I/O error, dev sdd, sector 13845041800 op
0x0:(READ) flags 0x84700 phys_seg 128 prio class 2
[Tue Jun  3 10:03:27 2025] mpt3sas_cm0: log_info(0x31080000):
originator(PL), code(0x08), sub_code(0x0000)
[Tue Jun  3 10:03:27 2025] mpt3sas_cm0: log_info(0x31080000):
originator(PL), code(0x08), sub_code(0x0000)
[Tue Jun  3 10:03:27 2025] mpt3sas_cm0: log_info(0x31080000):
originator(PL), code(0x08), sub_code(0x0000)
[Tue Jun  3 10:03:27 2025] mpt3sas_cm0: log_info(0x31080000):
originator(PL), code(0x08), sub_code(0x0000)
[Tue Jun  3 10:03:27 2025] mpt3sas_cm0: log_info(0x31080000):
originator(PL), code(0x08), sub_code(0x0000)
[Tue Jun  3 10:03:27 2025] mpt3sas_cm0: log_info(0x31080000):
originator(PL), code(0x08), sub_code(0x0000)
[Tue Jun  3 10:03:27 2025] mpt3sas_cm0: log_info(0x31080000):
originator(PL), code(0x08), sub_code(0x0000)
[Tue Jun  3 10:03:27 2025] mpt3sas_cm0: log_info(0x31080000):
originator(PL), code(0x08), sub_code(0x0000)
[Tue Jun  3 10:03:27 2025] mpt3sas_cm0: log_info(0x31080000):
originator(PL), code(0x08), sub_code(0x0000)
[Tue Jun  3 10:03:27 2025] mpt3sas_cm0: log_info(0x31080000):
originator(PL), code(0x08), sub_code(0x0000)
[Tue Jun  3 10:03:27 2025] mpt3sas_cm0: log_info(0x31080000):
originator(PL), code(0x08), sub_code(0x0000)
[Tue Jun  3 10:03:27 2025] mpt3sas_cm0: log_info(0x31080000):
originator(PL), code(0x08), sub_code(0x0000)
[Tue Jun  3 10:03:27 2025] mpt3sas_cm0: log_info(0x31080000):
originator(PL), code(0x08), sub_code(0x0000)
[Tue Jun  3 10:03:27 2025] mpt3sas_cm0: log_info(0x31080000):
originator(PL), code(0x08), sub_code(0x0000)
[Tue Jun  3 10:03:31 2025] mpt3sas_cm0: log_info(0x31080000):
originator(PL), code(0x08), sub_code(0x0000)
[Tue Jun  3 10:03:31 2025] mpt3sas_cm0: log_info(0x31080000):
originator(PL), code(0x08), sub_code(0x0000)
[Tue Jun  3 10:03:31 2025] mpt3sas_cm0: log_info(0x31080000):
originator(PL), code(0x08), sub_code(0x0000)
[Tue Jun  3 10:03:31 2025] mpt3sas_cm0: log_info(0x31080000):
originator(PL), code(0x08), sub_code(0x0000)
[Tue Jun  3 10:03:31 2025] mpt3sas_cm0: log_info(0x31080000):
originator(PL), code(0x08), sub_code(0x0000)
[Tue Jun  3 10:03:31 2025] mpt3sas_cm0: log_info(0x31080000):
originator(PL), code(0x08), sub_code(0x0000)
[Tue Jun  3 10:03:31 2025] mpt3sas_cm0: log_info(0x31080000):
originator(PL), code(0x08), sub_code(0x0000)
[Tue Jun  3 10:03:31 2025] mpt3sas_cm0: log_info(0x31080000):
originator(PL), code(0x08), sub_code(0x0000)
[Tue Jun  3 10:03:31 2025] mpt3sas_cm0: log_info(0x31080000):
originator(PL), code(0x08), sub_code(0x0000)
[Tue Jun  3 10:03:31 2025] mpt3sas_cm0: log_info(0x31080000):
originator(PL), code(0x08), sub_code(0x0000)
[Tue Jun  3 10:03:31 2025] mpt3sas_cm0: log_info(0x31080000):
originator(PL), code(0x08), sub_code(0x0000)
[Tue Jun  3 10:03:31 2025] mpt3sas_cm0: log_info(0x31080000):
originator(PL), code(0x08), sub_code(0x0000)
[Tue Jun  3 10:03:31 2025] mpt3sas_cm0: log_info(0x31080000):
originator(PL), code(0x08), sub_code(0x0000)
[Tue Jun  3 10:03:31 2025] mpt3sas_cm0: log_info(0x31080000):
originator(PL), code(0x08), sub_code(0x0000)
[Tue Jun  3 10:03:31 2025] mpt3sas_cm0: log_info(0x31080000):
originator(PL), code(0x08), sub_code(0x0000)
[Tue Jun  3 10:03:31 2025] mpt3sas_cm0: log_info(0x31080000):
originator(PL), code(0x08), sub_code(0x0000)
[Tue Jun  3 10:03:31 2025] mpt3sas_cm0: log_info(0x31080000):
originator(PL), code(0x08), sub_code(0x0000)
[Tue Jun  3 10:03:31 2025] mpt3sas_cm0: log_info(0x31080000):
originator(PL), code(0x08), sub_code(0x0000)
[Tue Jun  3 10:03:31 2025] mpt3sas_cm0: log_info(0x31080000):
originator(PL), code(0x08), sub_code(0x0000)
[Tue Jun  3 10:03:31 2025] mpt3sas_cm0: log_info(0x31080000):
originator(PL), code(0x08), sub_code(0x0000)
[Tue Jun  3 10:03:31 2025] mpt3sas_cm0: log_info(0x31080000):
originator(PL), code(0x08), sub_code(0x0000)
[Tue Jun  3 10:03:31 2025] mpt3sas_cm0: log_info(0x31080000):
originator(PL), code(0x08), sub_code(0x0000)
[Tue Jun  3 10:03:31 2025] mpt3sas_cm0: log_info(0x31080000):
originator(PL), code(0x08), sub_code(0x0000)
[Tue Jun  3 10:03:31 2025] mpt3sas_cm0: log_info(0x31080000):
originator(PL), code(0x08), sub_code(0x0000)
[Tue Jun  3 10:03:31 2025] mpt3sas_cm0: log_info(0x31080000):
originator(PL), code(0x08), sub_code(0x0000)
[Tue Jun  3 10:03:31 2025] mpt3sas_cm0: log_info(0x31080000):
originator(PL), code(0x08), sub_code(0x0000)
[Tue Jun  3 10:03:31 2025] mpt3sas_cm0: log_info(0x31080000):
originator(PL), code(0x08), sub_code(0x0000)
[Tue Jun  3 10:03:31 2025] mpt3sas_cm0: log_info(0x31080000):
originator(PL), code(0x08), sub_code(0x0000)
[Tue Jun  3 10:03:31 2025] mpt3sas_cm0: log_info(0x31080000):
originator(PL), code(0x08), sub_code(0x0000)
[Tue Jun  3 10:03:31 2025] mpt3sas_cm0: log_info(0x31080000):
originator(PL), code(0x08), sub_code(0x0000)
[Tue Jun  3 10:03:31 2025] mpt3sas_cm0: log_info(0x31080000):
originator(PL), code(0x08), sub_code(0x0000)
[Tue Jun  3 10:03:35 2025] mpt3sas_cm0: log_info(0x31080000):
originator(PL), code(0x08), sub_code(0x0000)
[Tue Jun  3 10:03:35 2025] mpt3sas_cm0: log_info(0x31080000):
originator(PL), code(0x08), sub_code(0x0000)
[Tue Jun  3 10:03:35 2025] mpt3sas_cm0: log_info(0x31080000):
originator(PL), code(0x08), sub_code(0x0000)
[Tue Jun  3 10:03:35 2025] scsi_io_completion_action: 48 callbacks suppressed
[Tue Jun  3 10:03:35 2025] sd 14:0:4:0: [sdd] tag#1773 FAILED Result:
hostbyte=DID_SOFT_ERROR driverbyte=DRIVER_OK cmd_age=2s
[Tue Jun  3 10:03:35 2025] sd 14:0:4:0: [sdd] tag#1773 CDB: Read(16)
88 00 00 00 00 01 b4 78 c3 c8 00 00 02 40 00 00
[Tue Jun  3 10:03:35 2025] blk_print_req_error: 48 callbacks suppressed
[Tue Jun  3 10:03:35 2025] I/O error, dev sdd, sector 7322780616 op
0x0:(READ) flags 0x80700 phys_seg 72 prio class 2
[Tue Jun  3 10:03:35 2025] mpt3sas_cm0: log_info(0x31080000):
originator(PL), code(0x08), sub_code(0x0000)
[Tue Jun  3 10:03:35 2025] sd 14:0:4:0: [sdd] tag#1734 FAILED Result:
hostbyte=DID_SOFT_ERROR driverbyte=DRIVER_OK cmd_age=2s
[Tue Jun  3 10:03:35 2025] mpt3sas_cm0: log_info(0x31080000):
originator(PL), code(0x08), sub_code(0x0000)
[Tue Jun  3 10:03:35 2025] sd 14:0:4:0: [sdd] tag#1734 CDB: Read(16)
88 00 00 00 00 02 49 ee 2f 58 00 00 00 88 00 00
[Tue Jun  3 10:03:35 2025] I/O error, dev sdd, sector 9830281048 op
0x0:(READ) flags 0x80700 phys_seg 17 prio class 2
[Tue Jun  3 10:03:35 2025] sd 14:0:4:0: [sdd] tag#1867 FAILED Result:
hostbyte=DID_SOFT_ERROR driverbyte=DRIVER_OK cmd_age=2s
[Tue Jun  3 10:03:35 2025] sd 14:0:4:0: [sdd] tag#1867 CDB: Read(16)
88 00 00 00 00 02 49 ee 2f e0 00 00 00 d8 00 00
[Tue Jun  3 10:03:35 2025] I/O error, dev sdd, sector 9830281184 op
0x0:(READ) flags 0x80700 phys_seg 15 prio class 2
[Tue Jun  3 10:03:35 2025] mpt3sas_cm0: log_info(0x31080000):
originator(PL), code(0x08), sub_code(0x0000)
[Tue Jun  3 10:03:35 2025] mpt3sas_cm0: log_info(0x31080000):
originator(PL), code(0x08), sub_code(0x0000)
[Tue Jun  3 10:03:35 2025] mpt3sas_cm0: log_info(0x31080000):
originator(PL), code(0x08), sub_code(0x0000)
[Tue Jun  3 10:03:35 2025] sd 14:0:4:0: [sdd] tag#1768 FAILED Result:
hostbyte=DID_SOFT_ERROR driverbyte=DRIVER_OK cmd_age=1s
[Tue Jun  3 10:03:35 2025] sd 14:0:4:0: [sdd] tag#1768 CDB: Read(16)
88 00 00 00 00 02 49 ec a6 a0 00 00 00 88 00 00
[Tue Jun  3 10:03:35 2025] sd 14:0:4:0: [sdd] tag#1769 FAILED Result:
hostbyte=DID_SOFT_ERROR driverbyte=DRIVER_OK cmd_age=1s
[Tue Jun  3 10:03:35 2025] I/O error, dev sdd, sector 9830180512 op
0x0:(READ) flags 0x80700 phys_seg 3 prio class 2
[Tue Jun  3 10:03:35 2025] mpt3sas_cm0: log_info(0x31080000):
originator(PL), code(0x08), sub_code(0x0000)
[Tue Jun  3 10:03:35 2025] sd 14:0:4:0: [sdd] tag#1769 CDB: Read(16)
88 00 00 00 00 02 49 ec a7 28 00 00 00 c0 00 00
[Tue Jun  3 10:03:35 2025] sd 14:0:4:0: [sdd] tag#1934 FAILED Result:
hostbyte=DID_SOFT_ERROR driverbyte=DRIVER_OK cmd_age=0s
[Tue Jun  3 10:03:35 2025] mpt3sas_cm0: log_info(0x31080000):
originator(PL), code(0x08), sub_code(0x0000)
[Tue Jun  3 10:03:35 2025] mpt3sas_cm0: log_info(0x31080000):
originator(PL), code(0x08), sub_code(0x0000)
[Tue Jun  3 10:03:35 2025] I/O error, dev sdd, sector 9830180648 op
0x0:(READ) flags 0x80700 phys_seg 24 prio class 2
[Tue Jun  3 10:03:35 2025] mpt3sas_cm0: log_info(0x31080000):
originator(PL), code(0x08), sub_code(0x0000)
[Tue Jun  3 10:03:35 2025] mpt3sas_cm0: log_info(0x31080000):
originator(PL), code(0x08), sub_code(0x0000)
[Tue Jun  3 10:03:35 2025] sd 14:0:4:0: [sdd] tag#1934 CDB: Read(16)
88 00 00 00 00 00 0a d5 b7 20 00 00 03 f8 00 00
[Tue Jun  3 10:03:35 2025] mpt3sas_cm0: log_info(0x31080000):
originator(PL), code(0x08), sub_code(0x0000)
[Tue Jun  3 10:03:35 2025] mpt3sas_cm0: log_info(0x31080000):
originator(PL), code(0x08), sub_code(0x0000)
[Tue Jun  3 10:03:35 2025] I/O error, dev sdd, sector 181778208 op
0x0:(READ) flags 0x80700 phys_seg 127 prio class 2
[Tue Jun  3 10:03:35 2025] sd 14:0:4:0: [sdd] tag#1894 FAILED Result:
hostbyte=DID_SOFT_ERROR driverbyte=DRIVER_OK cmd_age=0s
[Tue Jun  3 10:03:35 2025] mpt3sas_cm0: log_info(0x31080000):
originator(PL), code(0x08), sub_code(0x0000)
[Tue Jun  3 10:03:35 2025] mpt3sas_cm0: log_info(0x31080000):
originator(PL), code(0x08), sub_code(0x0000)
[Tue Jun  3 10:03:35 2025] mpt3sas_cm0: log_info(0x31080000):
originator(PL), code(0x08), sub_code(0x0000)
[Tue Jun  3 10:03:35 2025] mpt3sas_cm0: log_info(0x31080000):
originator(PL), code(0x08), sub_code(0x0000)
[Tue Jun  3 10:03:35 2025] sd 14:0:4:0: [sdd] tag#1913 FAILED Result:
hostbyte=DID_SOFT_ERROR driverbyte=DRIVER_OK cmd_age=2s
[Tue Jun  3 10:03:35 2025] sd 14:0:4:0: [sdd] tag#1907 FAILED Result:
hostbyte=DID_SOFT_ERROR driverbyte=DRIVER_OK cmd_age=2s
[Tue Jun  3 10:03:35 2025] sd 14:0:4:0: [sdd] tag#1757 FAILED Result:
hostbyte=DID_SOFT_ERROR driverbyte=DRIVER_OK cmd_age=2s
[Tue Jun  3 10:03:35 2025] mpt3sas_cm0: log_info(0x31080000):
originator(PL), code(0x08), sub_code(0x0000)
[Tue Jun  3 10:03:35 2025] sd 14:0:4:0: [sdd] tag#1907 CDB: Read(16)
88 00 00 00 00 02 49 ec 6c 40 00 00 00 f0 00 00
[Tue Jun  3 10:03:35 2025] sd 14:0:4:0: [sdd] tag#1757 CDB: Read(16)
88 00 00 00 00 03 e8 cc 56 a0 00 00 02 d8 00 00
[Tue Jun  3 10:03:35 2025] sd 14:0:4:0: [sdd] tag#1913 CDB: Read(16)
88 00 00 00 00 03 e8 cc 56 38 00 00 00 68 00 00
[Tue Jun  3 10:03:35 2025] I/O error, dev sdd, sector 9830165432 op
0x0:(READ) flags 0x80700 phys_seg 17 prio class 2
[Tue Jun  3 10:03:35 2025] I/O error, dev sdd, sector 21656330648 op
0x0:(READ) flags 0x84700 phys_seg 128 prio class 2
[Tue Jun  3 10:03:35 2025] I/O error, dev sdd, sector 16790607520 op
0x0:(READ) flags 0x80700 phys_seg 91 prio class 2
[Tue Jun  3 10:03:35 2025] I/O error, dev sdd, sector 9830165568 op
0x0:(READ) flags 0x80700 phys_seg 16 prio class 2
[Tue Jun  3 10:03:35 2025] sd 14:0:4:0: [sdd] tag#1894 CDB: Read(16)
88 00 00 00 00 02 49 ec db 18 00 00 00 88 00 00


>
> This drive does not support CDL, so it is not that for sure.
>
> Please also describe the drive connection: AHCI SATA port ? SAS HBA ?
> Enclosure/SAS expander ?

It is SAS HBA.
It is worth noting that this disk has recorded 46560 power-on hours
(approximately 5.3 years) of operational lifetime.

-- 
Regards
Yafang

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [QUESTION] xfs, iomap: Handle writeback errors to prevent silent data corruption
  2025-06-03  5:54                 ` Yafang Shao
@ 2025-06-03  6:36                   ` Damien Le Moal
  2025-06-03 14:41                     ` Christoph Hellwig
  0 siblings, 1 reply; 36+ messages in thread
From: Damien Le Moal @ 2025-06-03  6:36 UTC (permalink / raw)
  To: Yafang Shao
  Cc: Christoph Hellwig, Matthew Wilcox, Christian Brauner, djwong, cem,
	linux-xfs, Linux-Fsdevel, Damien Le Moal

On 6/3/25 2:54 PM, Yafang Shao wrote:
> On Tue, Jun 3, 2025 at 1:17 PM Damien Le Moal <dlemoal@kernel.org> wrote:
>>
>> On 2025/06/03 13:40, Christoph Hellwig wrote:
>>> On Tue, Jun 03, 2025 at 11:50:58AM +0800, Yafang Shao wrote:
>>>>
>>>> The drive in question is a Western Digital HGST Ultrastar
>>>> HUH721212ALE600 12TB HDD.
>>>> The price information is unavailable to me;-)
>>>
>>> Unless you are doing something funky like setting a crazy CDL policy
>>> it should not randomly fail writes.  Can you post the dmesg including
>>> the sense data that the SCSI code should print in this case?
> 
> Below is an error occurred today,
> 
> [Tue Jun  3 10:02:44 2025] mpt3sas_cm0: log_info(0x31080000):
> originator(PL), code(0x08), sub_code(0x0000)

This is PL_LOGINFO_CODE_SATA_NCQ_FAIL_ALL_CMDS_AFTR_ERR, so you got an NCQ
error which blows up the device queue, as usual with SATA.

> [Tue Jun  3 10:02:44 2025] sd 14:0:4:0: [sdd] tag#1669 FAILED Result:
> hostbyte=DID_SOFT_ERROR driverbyte=DRIVER_OK cmd_age=2s

Hmmm... DID_SOFT_ERROR... Normally, this is an immediate retry as this normally
is used to indicate that a command is a collateral abort due to an NCQ error,
and per ATA spec, that command should be retried. However, the *BAD* thing
about Broadcom HBAs using this is that it increments the command retry counter,
so if a command ends up being retried more than 5 times due to other commands
failing, the command runs out of retries and is failed like this. The command
retry counter should *not* be incremented for NCQ collateral aborts. I tried to
fix this, but it is impossible as we actually do not know if this is a
collateral abort or something else. The HBA events used to handle completion do
not allow differentiation. Waiting on Broadcom to do something about this (the
mpi3mr HBA driver has the same nasty issue).

> [Tue Jun  3 10:02:44 2025] sd 14:0:4:0: [sdd] tag#1707 Sense Key :
> Medium Error [current] [descriptor]
> [Tue Jun  3 10:02:44 2025] sd 14:0:4:0: [sdd] tag#1707 Add. Sense:
> Unrecovered read error
> [Tue Jun  3 10:02:44 2025] sd 14:0:4:0: [sdd] tag#1707 CDB: Read(16)
> 88 00 00 00 00 05 6b 21 0b e8 00 00 00 08 00 00
> [Tue Jun  3 10:02:44 2025] critical medium error, dev sdd, sector
> 23272164328 op 0x0:(READ) flags 0x0 phys_seg 1 prio class 2

Here is the culprit causing the collateral aborts. This is a read command
trying to read a dead sector. It fails and causes all the other
"DID_SOFT_ERROR" failures above due to its retries and repeated failures (it
blows up the command queue every time, causing the same commands to be subject
to the collateral abort retries and running out of retries).

> [Tue Jun  3 10:02:49 2025] sdd: writeback error on inode 10741741427,
> offset 54525952, sector 11086521712

So you get also writes being failed, likely due to the same reason (running out
of retries).

> It is SAS HBA.

Yes, a Broadcom HAB. As explained, these have an issue with handling retries of
collateral aborts in the presence of NCQ errors. In your case, the NCQ errors
are due to attempt to read bad sectors, which normally should only result in
EIO errors sent back to the user if the reads are for file data. If the reads
are for FS metadata, the FS likely would go read-only.

But the driver using DID_SOFT_ERROR for retrying commands that where not in
error but aborted due to an NCQ error causes failures because of the invalid
handling of the command retry count. libata does things correctly:

/**
 *      ata_eh_qc_retry - Tell midlayer to retry an ATA command after EH
 *      @qc: Command to retry
 *
 *      Indicate to the mid and upper layers that an ATA command
 *      should be retried.  To be used from EH.
 *
 *      SCSI midlayer limits the number of retries to scmd->allowed.
 *      scmd->allowed is incremented for commands which get retried
 *      due to unrelated failures (qc->err_mask is zero).
 */
void ata_eh_qc_retry(struct ata_queued_cmd *qc)
{
        struct scsi_cmnd *scmd = qc->scsicmd;
        if (!qc->err_mask)
                scmd->allowed++;
        __ata_eh_qc_complete(qc);
}

However, for a SAS connected ATA drive, this is not the code used. It is either
the HBA FW handling the retries (transparently to the host), or the HBA uses
the host to resend commands to the drive (which is what mpt3sas does). We
really need to fix that mess as it causes hard-to-debug IO failures that are
really hard to understand unless you know what to look for.

I have yet to come up with a good solution though.

> It is worth noting that this disk has recorded 46560 power-on hours
> (approximately 5.3 years) of operational lifetime.
> 

-- 
Damien Le Moal
Western Digital Research

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [QUESTION] xfs, iomap: Handle writeback errors to prevent silent data corruption
  2025-06-02  5:32   ` Christoph Hellwig
@ 2025-06-03 14:35     ` Darrick J. Wong
  2025-06-03 14:38       ` Christoph Hellwig
  0 siblings, 1 reply; 36+ messages in thread
From: Darrick J. Wong @ 2025-06-03 14:35 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Yafang Shao, Christian Brauner, cem, linux-xfs, Linux-Fsdevel

On Sun, Jun 01, 2025 at 10:32:33PM -0700, Christoph Hellwig wrote:
> On Wed, May 28, 2025 at 09:25:50PM -0700, Darrick J. Wong wrote:
> > Option C: report all those write errors (direct and buffered) to a
> > daemon and let it figure out what it wants to do:
> 
> What value does the daemon add to the decision chain?

The decision chain itself is unchanged -- the events are added to a
queue (if kmalloc doesn't fail) for later distribution to userspace...

> Some form of out of band error reporting is good and extremely useful,
> but having it in the critical error handling path is not.

...and the error handling path moves on without waiting to see what
happens to the queued events.  Once the daemon picks up the event it
can decide what to do with it, but that's totally asynchronous from the
IO path.

--D

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [QUESTION] xfs, iomap: Handle writeback errors to prevent silent data corruption
  2025-06-03 14:35     ` Darrick J. Wong
@ 2025-06-03 14:38       ` Christoph Hellwig
  0 siblings, 0 replies; 36+ messages in thread
From: Christoph Hellwig @ 2025-06-03 14:38 UTC (permalink / raw)
  To: Darrick J. Wong
  Cc: Christoph Hellwig, Yafang Shao, Christian Brauner, cem, linux-xfs,
	Linux-Fsdevel

On Tue, Jun 03, 2025 at 07:35:23AM -0700, Darrick J. Wong wrote:
> On Sun, Jun 01, 2025 at 10:32:33PM -0700, Christoph Hellwig wrote:
> > On Wed, May 28, 2025 at 09:25:50PM -0700, Darrick J. Wong wrote:
> > > Option C: report all those write errors (direct and buffered) to a
> > > daemon and let it figure out what it wants to do:
> > 
> > What value does the daemon add to the decision chain?
> 
> The decision chain itself is unchanged -- the events are added to a
> queue (if kmalloc doesn't fail) for later distribution to userspace...
> 
> > Some form of out of band error reporting is good and extremely useful,
> > but having it in the critical error handling path is not.
> 
> ...and the error handling path moves on without waiting to see what
> happens to the queued events.  Once the daemon picks up the event it
> can decide what to do with it, but that's totally asynchronous from the
> IO path.

Yes, I'm fully on board with that.  Maybe I just misinterpreted earlier
mails.


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [QUESTION] xfs, iomap: Handle writeback errors to prevent silent data corruption
  2025-06-03  6:36                   ` Damien Le Moal
@ 2025-06-03 14:41                     ` Christoph Hellwig
  2025-06-03 14:57                       ` James Bottomley
  0 siblings, 1 reply; 36+ messages in thread
From: Christoph Hellwig @ 2025-06-03 14:41 UTC (permalink / raw)
  To: Damien Le Moal
  Cc: Yafang Shao, Christoph Hellwig, Matthew Wilcox, Christian Brauner,
	djwong, cem, linux-xfs, Linux-Fsdevel, Damien Le Moal,
	Sathya Prakash, Sreekanth Reddy, Suganath Prabu Subramani,
	Martin K. Petersen, MPT-FusionLinux.pdl, linux-scsi

[taking this private to discuss the mpt drivers]

> Hmmm... DID_SOFT_ERROR... Normally, this is an immediate retry as this normally
> is used to indicate that a command is a collateral abort due to an NCQ error,
> and per ATA spec, that command should be retried. However, the *BAD* thing
> about Broadcom HBAs using this is that it increments the command retry counter,
> so if a command ends up being retried more than 5 times due to other commands
> failing, the command runs out of retries and is failed like this. The command
> retry counter should *not* be incremented for NCQ collateral aborts. I tried to
> fix this, but it is impossible as we actually do not know if this is a
> collateral abort or something else. The HBA events used to handle completion do
> not allow differentiation. Waiting on Broadcom to do something about this (the
> mpi3mr HBA driver has the same nasty issue).

Maybe we should just change the mpt3 sas/mr drivers to use
DID_SOFT_ERROR less?  In fact there's not really a whole lot of
DID_SOFT_ERROR users otherwise, and there's probably better status
codes whatever they are doing can be translated to that do not increment
the retry counter.


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [QUESTION] xfs, iomap: Handle writeback errors to prevent silent data corruption
  2025-06-03 14:41                     ` Christoph Hellwig
@ 2025-06-03 14:57                       ` James Bottomley
  2025-06-04  7:29                         ` Damien Le Moal
  0 siblings, 1 reply; 36+ messages in thread
From: James Bottomley @ 2025-06-03 14:57 UTC (permalink / raw)
  To: Christoph Hellwig, Damien Le Moal
  Cc: Yafang Shao, Matthew Wilcox, Christian Brauner, djwong, cem,
	linux-xfs, Linux-Fsdevel, Damien Le Moal, Sathya Prakash,
	Sreekanth Reddy, Suganath Prabu Subramani, Martin K. Petersen,
	MPT-FusionLinux.pdl, linux-scsi

On Tue, 2025-06-03 at 07:41 -0700, Christoph Hellwig wrote:
> [taking this private to discuss the mpt drivers]
> 
> > Hmmm... DID_SOFT_ERROR... Normally, this is an immediate retry as
> > this normally is used to indicate that a command is a collateral
> > abort due to an NCQ error, and per ATA spec, that command should be
> > retried. However, the *BAD* thing about Broadcom HBAs using this is
> > that it increments the command retry counter, so if a command ends
> > up being retried more than 5 times due to other commands failing,
> > the command runs out of retries and is failed like this. The
> > command retry counter should *not* be incremented for NCQ
> > collateral aborts. I tried to fix this, but it is impossible as we
> > actually do not know if this is a collateral abort or something
> > else. The HBA events used to handle completion do not allow
> > differentiation. Waiting on Broadcom to do something about this
> > (the mpi3mr HBA driver has the same nasty issue).
> 
> Maybe we should just change the mpt3 sas/mr drivers to use
> DID_SOFT_ERROR less?  In fact there's not really a whole lot of
> DID_SOFT_ERROR users otherwise, and there's probably better status
> codes whatever they are doing can be translated to that do not
> increment the retry counter.

The status code that does that (retry without incrementing the counter)
is DID_IMM_RETRY.  The driver has to be a bit careful about using this
because we can get into infinite retry loops.

Regards,

James


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [QUESTION] xfs, iomap: Handle writeback errors to prevent silent data corruption
  2025-06-03  4:50       ` Christoph Hellwig
@ 2025-06-03 22:05         ` Dave Chinner
  2025-06-04  6:33           ` Christoph Hellwig
  0 siblings, 1 reply; 36+ messages in thread
From: Dave Chinner @ 2025-06-03 22:05 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Yafang Shao, Christian Brauner, djwong, cem, linux-xfs,
	Linux-Fsdevel

On Mon, Jun 02, 2025 at 09:50:04PM -0700, Christoph Hellwig wrote:
> On Tue, Jun 03, 2025 at 09:19:10AM +1000, Dave Chinner wrote:
> > > In other words, write errors in Linux are in general expected to be
> > > persistent, modulo explicit failfast requests like REQ_NOWAIT.
> > 
> > Say what? the blk_errors array defines multiple block layer errors
> > that are transient in nature - stuff like ENOSPC, ETIMEDOUT, EILSEQ,
> > ENOLINK, EBUSY - all indicate a transient, retryable error occurred
> > somewhere in the block/storage layers.
> 
> Let's use the block layer codes reported all the way up to the file
> systems and their descriptions instead of the errnos they are
> mapped to for compatibility.  The above would be in order:
> 
> [BLK_STS_NOSPC]         = { -ENOSPC,    "critical space allocation" },
> [BLK_STS_TIMEOUT]       = { -ETIMEDOUT, "timeout" },
> [BLK_STS_PROTECTION]    = { -EILSEQ,    "protection" },
> [BLK_STS_TRANSPORT]     = { -ENOLINK,   "recoverable transport" },
> [BLK_STS_DEV_RESOURCE]  = { -EBUSY,     "device resource" },
> 
> > What is permanent about dm-thinp returning ENOSPC to a write
> > request? Once the pool has been GC'd to free up space or expanded,
> > the ENOSPC error goes away.
> 
> Everything.  ENOSPC means there is no space.  There might be space in
> the non-determinant future, but if the layer just needs to GC it must
> not report the error.

GC of thin pools requires the filesystem to be mounted so fstrim can
be run to tell the thinp device where all the free LBA regions it
can reclaim are located. If we shut down the filesystem instantly
when the pool goes ENOSPC on a metadata write, then *we can't run
fstrim* to free up unused space and hence allow that metadata write
to succeed in the future.

It should be obvious at this point that a filesystem shutdown on an
ENOSPC error from the block device on anything other than journal IO
is exactly the wrong thing to be doing.

> > What is permanent about an IO failing with EILSEQ because a t10
> > checksum failed due to a random bit error detected between the HBA
> > and the storage device? Retry the IO, and it goes through just fine
> > without any failures.
> 
> Normally it means your checksum was wrong.  If you have bit errors
> in the cable they will show up again, maybe not on the next I/O
> but soon.

But it's unlikely to be hit by another cosmic ray anytime soon, and
so bit errors caused by completely random environmental events
should -absolutely- be retried as the subsequent write retry will
succeed.

If there is a dodgy cable causing the problems, the error will
re-occur on random IOs and we'll emit write errors to the log that
monitoring software will pick up. If we are repeatedly isssuing write
errors due to EILSEQ errors, then that's a sign the hardware needs
replacing.

There is no risk to filesystem integrity if write retries
succeed, and that gives the admin time to schedule downtime to
replace the dodgy hardware. That's much better behaviour than
unexpected production system failure in the middle of the night...

It is because we have robust and resilient error handling in the
filesystem that the system is able to operate correctly in these
marginal situations. Operating in marginal conditions or as hardware
is beginning to fail is a necessary to keep production systems
running until corrective action can be taken by the administrators.

> > These transient error types typically only need a write retry after
> > some time period to resolve, and that's what XFS does by default.
> > What makes these sorts of errors persistent in the linux block layer
> > and hence requiring an immediate filesystem shutdown and complete
> > denial of service to the storage?
> > 
> > I ask this seriously, because you are effectively saying the linux
> > storage stack now doesn't behave the same as the model we've been
> > using for decades. What has changed, and when did it change?
> 
> Hey, you can retry.  You're unlikely to improve the situation though
> but instead just keep deferring the inevitable shutdown.

Absolutely. That's the whole point - random failures won't repeat,
and hence when they do occur we avoid a shutdown by retrying them on
failure. This is -exactly- how robust error handling should work.

However, for IO errors that persist or where other IO errors start
to creep in, all the default behaviour is trying to do is hold the
system up in a working state until downtime can be scheduled and the
broken hardware is replaced. If integrity ends up being compromised
by a subsequent IO failure, then we will shut the filesystem down at
that point.

This is about resilience in the face of errors. Not every error is
fatal, nor does every error re-occur. There are classes of errors
known to be transient (ENOSPC), others that are permanent (ENODEV),
and others that we just don't know (EIO). If we value resiliency
and robustness, then the filesystem should be able to withstand
transient and "maybe-transient" IO failures without compromising
integrity.

Failing to recognise that transient and "maybe-transient" errors can
generally be handled cleanly and successfully with future write
retries leads to brittle, fragile systems that fall over at the
first sign of anything going wrong. Filesystems that are targetted
at high value production systems and/or running mission critical
applications needs to have resilient and robust error handling.

> > > Which also leaves me a bit puzzled what the XFS metadata retries are
> > > actually trying to solve, especially without even having a corresponding
> > > data I/O version.
> > 
> > It's always been for preventing immediate filesystem shutdown when
> > spurious transient IO errors occur below XFS. Data IO errors don't
> > cause filesystem shutdowns - errors get propagated to the
> > application - so there isn't a full system DOS potential for
> > incorrect classification of data IO errors...
> 
> Except as we see in this thread for a fairly common use case (buffered
> I/O without fsync) they don't.  And I agree with you that this is not
> how you write applications that care about data integrity - but the
> entire reset of the system and just about every common utility is
> written that way.

Yes, I know that. But there are still valid reasons for retrying
failed async data writeback IO when it triggers a spurious or
retriable IO error....

> And even applications that fsync won't see you fancy error code.  The
> only thing stored in the address_space for fsync to catch is EIO and
> ENOSPC.

The filesystem knows exactly what the IO error reported by the block
layer is before we run folio completions, so we control exactly what
we want to report as IO compeltion status.

Hence the bogosities of error propagation to userspace via the
mapping is completely irrelevant to this discussion/feature because
it would be implemented below the layer that squashes the eventual
IO errno into the address space...

-Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [QUESTION] xfs, iomap: Handle writeback errors to prevent silent data corruption
  2025-06-03 22:05         ` Dave Chinner
@ 2025-06-04  6:33           ` Christoph Hellwig
  2025-06-05  2:18             ` Dave Chinner
  0 siblings, 1 reply; 36+ messages in thread
From: Christoph Hellwig @ 2025-06-04  6:33 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Christoph Hellwig, Yafang Shao, Christian Brauner, djwong, cem,
	linux-xfs, Linux-Fsdevel

On Wed, Jun 04, 2025 at 08:05:03AM +1000, Dave Chinner wrote:
> > 
> > Everything.  ENOSPC means there is no space.  There might be space in
> > the non-determinant future, but if the layer just needs to GC it must
> > not report the error.
> 
> GC of thin pools requires the filesystem to be mounted so fstrim can
> be run to tell the thinp device where all the free LBA regions it
> can reclaim are located. If we shut down the filesystem instantly
> when the pool goes ENOSPC on a metadata write, then *we can't run
> fstrim* to free up unused space and hence allow that metadata write
> to succeed in the future.
> 
> It should be obvious at this point that a filesystem shutdown on an
> ENOSPC error from the block device on anything other than journal IO
> is exactly the wrong thing to be doing.

How high are the chances that you hit exactly the rate metadata
writeback I/O and not journal or data I/O for this odd condition
that requires user interaction?  Where is this weird model where a
storage device returns an out of space error and manual user interaction
using manual and not online trim is going to fix even documented?

> > Normally it means your checksum was wrong.  If you have bit errors
> > in the cable they will show up again, maybe not on the next I/O
> > but soon.
> 
> But it's unlikely to be hit by another cosmic ray anytime soon, and
> so bit errors caused by completely random environmental events
> should -absolutely- be retried as the subsequent write retry will
> succeed.
>
> If there is a dodgy cable causing the problems, the error will
> re-occur on random IOs and we'll emit write errors to the log that
> monitoring software will pick up. If we are repeatedly isssuing write
> errors due to EILSEQ errors, then that's a sign the hardware needs
> replacing.

Umm, all the storage protocols do have pretty good checksums.  A cosmic
ray isn't going to fail them it is something more fundamental like
broken hardware or connections.  In other words you are going to see
this again and again pretty frequently.

> There is no risk to filesystem integrity if write retries
> succeed, and that gives the admin time to schedule downtime to
> replace the dodgy hardware. That's much better behaviour than
> unexpected production system failure in the middle of the night...
> 
> It is because we have robust and resilient error handling in the
> filesystem that the system is able to operate correctly in these
> marginal situations. Operating in marginal conditions or as hardware
> is beginning to fail is a necessary to keep production systems
> running until corrective action can be taken by the administrators.

I'd really like to see a format writeup of your theory of robust error
handling where that robustness is centered around the fairly rare
case of metadata writeback and applications dealing with I/O errors,
while journal write errors and read error lead to shutdown.  Maybe
I'm missing something important, but the theory does not sound valid,
and we don't have any testing framework that actually verifies it.

> Failing to recognise that transient and "maybe-transient" errors can
> generally be handled cleanly and successfully with future write
> retries leads to brittle, fragile systems that fall over at the
> first sign of anything going wrong. Filesystems that are targetted
> at high value production systems and/or running mission critical
> applications needs to have resilient and robust error handling.

What known transient errors do you think XFS (or any other file system)
actually handles properly?  Where is the contract that these errors
actually are transient.

> > And even applications that fsync won't see you fancy error code.  The
> > only thing stored in the address_space for fsync to catch is EIO and
> > ENOSPC.
> 
> The filesystem knows exactly what the IO error reported by the block
> layer is before we run folio completions, so we control exactly what
> we want to report as IO compeltion status.

Sure, you could invent a scheme to propagate the exaxct error.  For
direct I/O we even return the exact error to userspace.  But that
means we actually have a definition of what each error means, and how
it could be handled.  None of that exists right now.  We could do
all this, but that assumes you actually have:

 a) a clear definition of a problem
 b) a good way to fix that problem
 c) good testing infrastructure to actually test it, because without
    that all good intentions will probably cause more problems than
    they solve

> Hence the bogosities of error propagation to userspace via the
> mapping is completely irrelevant to this discussion/feature because
> it would be implemented below the layer that squashes the eventual
> IO errno into the address space...

How would implement and test all this?  And for what use case?


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [QUESTION] xfs, iomap: Handle writeback errors to prevent silent data corruption
  2025-06-03 14:57                       ` James Bottomley
@ 2025-06-04  7:29                         ` Damien Le Moal
  0 siblings, 0 replies; 36+ messages in thread
From: Damien Le Moal @ 2025-06-04  7:29 UTC (permalink / raw)
  To: James Bottomley, Christoph Hellwig
  Cc: Yafang Shao, Matthew Wilcox, Christian Brauner, djwong, cem,
	linux-xfs, Linux-Fsdevel, Damien Le Moal, Sathya Prakash,
	Sreekanth Reddy, Suganath Prabu Subramani, Martin K. Petersen,
	MPT-FusionLinux.pdl, linux-scsi

On 6/3/25 11:57 PM, James Bottomley wrote:
> On Tue, 2025-06-03 at 07:41 -0700, Christoph Hellwig wrote:
>> [taking this private to discuss the mpt drivers]
>>
>>> Hmmm... DID_SOFT_ERROR... Normally, this is an immediate retry as
>>> this normally is used to indicate that a command is a collateral
>>> abort due to an NCQ error, and per ATA spec, that command should be
>>> retried. However, the *BAD* thing about Broadcom HBAs using this is
>>> that it increments the command retry counter, so if a command ends
>>> up being retried more than 5 times due to other commands failing,
>>> the command runs out of retries and is failed like this. The
>>> command retry counter should *not* be incremented for NCQ
>>> collateral aborts. I tried to fix this, but it is impossible as we
>>> actually do not know if this is a collateral abort or something
>>> else. The HBA events used to handle completion do not allow
>>> differentiation. Waiting on Broadcom to do something about this
>>> (the mpi3mr HBA driver has the same nasty issue).
>>
>> Maybe we should just change the mpt3 sas/mr drivers to use
>> DID_SOFT_ERROR less?  In fact there's not really a whole lot of
>> DID_SOFT_ERROR users otherwise, and there's probably better status
>> codes whatever they are doing can be translated to that do not
>> increment the retry counter.
> 
> The status code that does that (retry without incrementing the counter)
> is DID_IMM_RETRY.  The driver has to be a bit careful about using this
> because we can get into infinite retry loops.

James,

Thank you for the information. Will have a try again at changing the driver to
use this.


-- 
Damien Le Moal
Western Digital Research

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [QUESTION] xfs, iomap: Handle writeback errors to prevent silent data corruption
  2025-06-04  6:33           ` Christoph Hellwig
@ 2025-06-05  2:18             ` Dave Chinner
  2025-06-05  4:51               ` Christoph Hellwig
  0 siblings, 1 reply; 36+ messages in thread
From: Dave Chinner @ 2025-06-05  2:18 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Yafang Shao, Christian Brauner, djwong, cem, linux-xfs,
	Linux-Fsdevel

On Tue, Jun 03, 2025 at 11:33:05PM -0700, Christoph Hellwig wrote:
> On Wed, Jun 04, 2025 at 08:05:03AM +1000, Dave Chinner wrote:
> > > 
> > > Everything.  ENOSPC means there is no space.  There might be space in
> > > the non-determinant future, but if the layer just needs to GC it must
> > > not report the error.
> > 
> > GC of thin pools requires the filesystem to be mounted so fstrim can
> > be run to tell the thinp device where all the free LBA regions it
> > can reclaim are located. If we shut down the filesystem instantly
> > when the pool goes ENOSPC on a metadata write, then *we can't run
> > fstrim* to free up unused space and hence allow that metadata write
> > to succeed in the future.
> > 
> > It should be obvious at this point that a filesystem shutdown on an
> > ENOSPC error from the block device on anything other than journal IO
> > is exactly the wrong thing to be doing.
> 
> How high are the chances that you hit exactly the rate metadata
> writeback I/O and not journal or data I/O for this odd condition
> that requires user interaction?

100%.

We'll hit it with both data IO and metadata IO at the same time,
but in the vast majority of cases we won't hit ENOSPC on journal IO.

Why? Because mkfs.xfs zeros the entire log via either
FALLOC_FL_ZERO_RANGE or writing physical zeros. Hence a thin device
always has a fully allocated log before the filesystem is first
mounted and so ENOSPC to journal IO should never happen unless a
device level snapshot is taken.

i.e. the only time the journal is not fully allocated in the block device
is immediately after a block device snapshot is taken. The log needs
to be written entirely once before it is fully allocated again, and
this is the only point in time we will see ENOSPC on a thinp device
for journal IO.

Because the log IO is sequential, and the log is circular, there is
no write or allocation amplification here and once the log has been
written once further writes are simply overwriting allocated LBA
space. Hence after a short period of time of activity after a
snapshot, ENOSPC from journal IO is no longer a possibility. This
case is the exception rather than common behaviour.

Metadata writeback is a different story altogether.

When we allocate and write back metadata for the first time (either
after mkfs, fstrim or a device snapshot) or overwrite existing
metadata after a snapshot, the metadata writeback IO will
always require device side space allocation.

Unlike the neat sequential journal IO, metadata writeback is
effectively random small write IO. This triggers worse case
allocation amplification on thinp devices, as well as worst case
write amplification in the case of COW after a snapshot. metadata
writeback - especially overwrite after snapshot + modification - is
the worst possible write pattern for thinp devices.

It is not unusual to see dm-thin devices with a 64kB block size have
allocation and write amplification factors of 15-16 on 4kB block
size filesystems after a snapshot as every single random metadata
overwrite will now trigger a 64kB COW in the dm-thin device to break
blocks shared between snapshots.

So, yes, metadata writeback is extremely prone to triggering ENOSPC
from thin devices, whilst journal IO almost never triggers it.

> Where is this weird model where a
> storage device returns an out of space error and manual user interaction
> using manual and not online trim is going to fix even documented?

I explicitly said that the filesystem needs to remain online when
the thin pool goes ENOSPC so that fstrim (the online filesystem trim
utility) can be run to inform the thin pool exactly where all the
free LBA address space is so it can efficiently free up pool space.

This is a standard procedure that people automate through things
like udev scripts that capture the dm-thin pool low/no space
events.

You seem to be trying to create a strawman here....

> > > Normally it means your checksum was wrong.  If you have bit errors
> > > in the cable they will show up again, maybe not on the next I/O
> > > but soon.
> > 
> > But it's unlikely to be hit by another cosmic ray anytime soon, and
> > so bit errors caused by completely random environmental events
> > should -absolutely- be retried as the subsequent write retry will
> > succeed.
> >
> > If there is a dodgy cable causing the problems, the error will
> > re-occur on random IOs and we'll emit write errors to the log that
> > monitoring software will pick up. If we are repeatedly isssuing write
> > errors due to EILSEQ errors, then that's a sign the hardware needs
> > replacing.
> 
> Umm, all the storage protocols do have pretty good checksums.

The strength of the checksum is irrelevant. It's what we do when
it detects a bit error that is being discussed.

> A cosmic
> ray isn't going to fail them it is something more fundamental like
> broken hardware or connections. In other words you are going to see
> this again and again pretty frequently.

I've seen plenty of one-off, unexplainable, unreproducable IO
errors because of random bit errors over the past 20+ years.

But what causes them is irrelevant - the fact is that they do occur,
and we cannot know if it transient or persistent from a single IO
context. Hence the only decision that can be made from IO completion
context is "retry or fail this IO". We default to "retry" for
metadata writeback because that automatically handles transient
errors correctly.

IOWs, if it is actually broken hardware, then the fact we may retry
individual failed IOs in a non-critical path is irrelevant. If the
errors persistent and/or are widespread, then we will get an error
in a critical path and shut down at that point. 

This means the architecture is naturally resilient against transient
write errors, regardless of their cause.  We want XFS to resilient;
we do not want it to be brittle or fragile in environments that are
slightly less than perfect, unless that is the way the admin wants
it to behave. We just the admin the option to choose how their
filesystems respond to such errors, but we default to the most
resilient settings for everyone else.

> > There is no risk to filesystem integrity if write retries
> > succeed, and that gives the admin time to schedule downtime to
> > replace the dodgy hardware. That's much better behaviour than
> > unexpected production system failure in the middle of the night...
> > 
> > It is because we have robust and resilient error handling in the
> > filesystem that the system is able to operate correctly in these
> > marginal situations. Operating in marginal conditions or as hardware
> > is beginning to fail is a necessary to keep production systems
> > running until corrective action can be taken by the administrators.
> 
> I'd really like to see a format writeup of your theory of robust error
> handling where that robustness is centered around the fairly rare
> case of metadata writeback and applications dealing with I/O errors,
> while journal write errors and read error lead to shutdown.

.... and there's the strawman argument, and a demand for formal
proofs as the only way to defend against your argument.

> Maybe
> I'm missing something important, but the theory does not sound valid,
> and we don't have any testing framework that actually verifies it.

I think you are being intentionally obtuse, Christoph. I wrote this
for XFS back in *2008*:

https://web.archive.org/web/20140907100223/http://xfs.org/index.php/Reliable_Detection_and_Repair_of_Metadata_Corruption

The "exception handling" section is probably appropriate here,
but whilst the contents are not directly about this particular
discussion, the point is that we've always considered there to be
types of IO errors that are transient in nature. I will quote part
of that section:

"Furthermore, the storage subsystem plays a part in deciding how to
handle errors. The reason is that in many storage configurations I/O
errors can be transient. For example, in a SAN a broken fibre can
cause a failover to a redundant path, however the inflight I/O on
the failed path is usually timed out and an error returned. We don't want
to shut down the filesystem on such an error - we want to wait for
failover to a redundant path and then retry the I/O. If the failover
succeeds, then the I/O will succeed. Hence any robust method of
exception handling needs to consider that I/O exceptions may be
transient. "

The point I am making that is that the entire architecture of the
current V5 on-disk format, the verification architecture and the
scrub/online repair infrastructure was very much based on the
storage device model that *IO errors may be transient*.

> 
> > Failing to recognise that transient and "maybe-transient" errors can
> > generally be handled cleanly and successfully with future write
> > retries leads to brittle, fragile systems that fall over at the
> > first sign of anything going wrong. Filesystems that are targetted
> > at high value production systems and/or running mission critical
> > applications needs to have resilient and robust error handling.
> 
> What known transient errors do you think XFS (or any other file system)
> actually handles properly?  Where is the contract that these errors
> actually are transient.

Nope, I'm not going to play the "I demand that you prove the
behaviour that has existed in XFS for over 30 years is correct",
Christoph.

If you want to change the underlying IO error handling model that
XFS has been based on since it was first designed back in the 1990s,
then it's on you to prove to every filesystem developer that IO
errors reported from the block layer can *never be transient*.

Indeed, please provide us with the "contract" that says block
devices and storage devices are not allowed to expose transient IO
errors to higher layers.

Then you need show that ENOSPC from a dm-thin device is *forever*,
and never goes away, and justify that behaviour as being in the best
interests of users despite the ease of pool expansion to make ENOSPC
go away.....

It is on you to prove that the existing model is wrong and needs
fixing, not for us to prove to you that the existing model is
correct.

> > > And even applications that fsync won't see you fancy error code.  The
> > > only thing stored in the address_space for fsync to catch is EIO and
> > > ENOSPC.
> > 
> > The filesystem knows exactly what the IO error reported by the block
> > layer is before we run folio completions, so we control exactly what
> > we want to report as IO compeltion status.
> 
> Sure, you could invent a scheme to propagate the exaxct error.  For
> direct I/O we even return the exact error to userspace.  But that
> means we actually have a definition of what each error means, and how
> it could be handled.  None of that exists right now.  We could do
> all this, but that assumes you actually have:
> 
>  a) a clear definition of a problem
>  b) a good way to fix that problem
>  c) good testing infrastructure to actually test it, because without
>     that all good intentions will probably cause more problems than
>     they solve
> 
> > Hence the bogosities of error propagation to userspace via the
> > mapping is completely irrelevant to this discussion/feature because
> > it would be implemented below the layer that squashes the eventual
> > IO errno into the address space...
> 
> How would implement and test all this?  And for what use case?

I don't care, it's not my problem to solve, and I don't care if
nothing comes of it.

A fellow developer asked for advice, I simply suggested following an
existing model we already have infrastructure for. Now you are
demanding that I prove the existing decades old model is valid, and
then tell you how to solve the OG's problem and make it all work.

None of this is my problem, regardless of how much you try to make
it so.

Really, though, I don't know why you think that transient errors
don't exist anymore, nor why you are demanding that I prove that
they do when it is abundantly clear that ENOSPC from dm-thin can
definitely be a transient error.

Perhaps you can provide some background on why you are asserting
that there is no such thing as a transient IO error so we can all
start from a common understanding?

-Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [QUESTION] xfs, iomap: Handle writeback errors to prevent silent data corruption
  2025-06-05  2:18             ` Dave Chinner
@ 2025-06-05  4:51               ` Christoph Hellwig
  0 siblings, 0 replies; 36+ messages in thread
From: Christoph Hellwig @ 2025-06-05  4:51 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Christoph Hellwig, Yafang Shao, Christian Brauner, djwong, cem,
	linux-xfs, Linux-Fsdevel

On Thu, Jun 05, 2025 at 12:18:24PM +1000, Dave Chinner wrote:
> > How high are the chances that you hit exactly the rate metadata
> > writeback I/O and not journal or data I/O for this odd condition
> > that requires user interaction?
> 
> 100%.
> 
> We'll hit it with both data IO and metadata IO at the same time,
> but in the vast majority of cases we won't hit ENOSPC on journal IO.
> 
> Why? Because mkfs.xfs zeros the entire log via either
> FALLOC_FL_ZERO_RANGE or writing physical zeros. Hence a thin device
> always has a fully allocated log before the filesystem is first
> mounted and so ENOSPC to journal IO should never happen unless a
> device level snapshot is taken.
> 
> i.e. the only time the journal is not fully allocated in the block device
> is immediately after a block device snapshot is taken. The log needs
> to be written entirely once before it is fully allocated again, and
> this is the only point in time we will see ENOSPC on a thinp device
> for journal IO.

I guess that works for the very specific dm-thin case.  Not for anything
else that does actual out of place writes, though.

> > Where is this weird model where a
> > storage device returns an out of space error and manual user interaction
> > using manual and not online trim is going to fix even documented?
> 
> I explicitly said that the filesystem needs to remain online when
> the thin pool goes ENOSPC so that fstrim (the online filesystem trim
> utility) can be run to inform the thin pool exactly where all the
> free LBA address space is so it can efficiently free up pool space.
> 
> This is a standard procedure that people automate through things
> like udev scripts that capture the dm-thin pool low/no space
> events.
> 
> You seem to be trying to create a strawman here....

I'm not.  But you seem to be very focussed on the undocument and
in general a bit unusual dm-thin semantics.  If that's all you care
about fine, but state that.

> But what causes them is irrelevant - the fact is that they do occur,
> and we cannot know if it transient or persistent from a single IO
> context. Hence the only decision that can be made from IO completion
> context is "retry or fail this IO". We default to "retry" for
> metadata writeback because that automatically handles transient
> errors correctly.
> 
> IOWs, if it is actually broken hardware, then the fact we may retry
> individual failed IOs in a non-critical path is irrelevant. If the
> errors persistent and/or are widespread, then we will get an error
> in a critical path and shut down at that point. 

In general continuing when you have known errors is a bad idea
unless you specifically know retrying makes them better.  When you
are on PI-enabled hardware retrying that PI error (and that's what
we are talking about here) is very unlikely to just make things
better.

> > > It is because we have robust and resilient error handling in the
> > > filesystem that the system is able to operate correctly in these
> > > marginal situations. Operating in marginal conditions or as hardware
> > > is beginning to fail is a necessary to keep production systems
> > > running until corrective action can be taken by the administrators.
> > 
> > I'd really like to see a format writeup of your theory of robust error
> > handling where that robustness is centered around the fairly rare
> > case of metadata writeback and applications dealing with I/O errors,
> > while journal write errors and read error lead to shutdown.
> 
> .... and there's the strawman argument, and a demand for formal
> proofs as the only way to defend against your argument.

No.  You claim that "we have robust and resilient error handling in the
filesystem".  It's pretty clear from the code and the discussion that
we do not.  If you insist that we do I'd rather see a good proof of
that.

> I think you are being intentionally obtuse, Christoph. I wrote this
> for XFS back in *2008*:

Which as you later state yourself is irrelevant to this discussion.

> The point I am making that is that the entire architecture of the
> current V5 on-disk format, the verification architecture and the
> scrub/online repair infrastructure was very much based on the
> storage device model that *IO errors may be transient*.

Except that as we've clearly seen in this thread in practice it
does not.  We have a way to retry the asynchronous metadata writeback,
apparently designed to deal with an undocumented dm-thin use case,
but everything else is handwaiving.

> > What known transient errors do you think XFS (or any other file system)
> > actually handles properly?  Where is the contract that these errors
> > actually are transient.
> 
> Nope, I'm not going to play the "I demand that you prove the
> behaviour that has existed in XFS for over 30 years is correct",
> Christoph.
> 
> If you want to change the underlying IO error handling model that
> XFS has been based on since it was first designed back in the 1990s,
> then it's on you to prove to every filesystem developer that IO
> errors reported from the block layer can *never be transient*.

I'm not changing anything.  I'm just challenging your opinion that
all this has been handled forver.  And it's pretty clear that it
is not.  So I really object to you spreading this untrue claims
without anything top back them up.

Maybe you want to handle transient errors, and that's fine.  But
that aspirational.

> Really, though, I don't know why you think that transient errors
> don't exist anymore, nor why you are demanding that I prove that
> they do when it is abundantly clear that ENOSPC from dm-thin can
> definitely be a transient error.
> 
> Perhaps you can provide some background on why you are asserting
> that there is no such thing as a transient IO error so we can all
> start from a common understanding?

Oh, there absolutely are transient I/O errors.  But in the Linux I/O
stack they are handled in general below the file system.  Look at SCSI
error handling, the NVMe retry mechanisms, or the multipath drivers.  All
of them do handle transient errors in a usually more or less well
understood and well tested fashion.  But except for the retries of
asynchronous metadata buffer writeback in XFS basically nothing in the
commonly used file systems handles transient errors, exactly because that
is not the layering works.  If we want to change that we'd better
understand what the use case for that is and how we properly test it.

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [QUESTION] xfs, iomap: Handle writeback errors to prevent silent data corruption
  2025-06-03  0:03         ` Darrick J. Wong
@ 2025-06-06 10:43           ` Christian Brauner
  2025-06-12  3:43             ` Darrick J. Wong
  0 siblings, 1 reply; 36+ messages in thread
From: Christian Brauner @ 2025-06-06 10:43 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: Dave Chinner, Yafang Shao, cem, linux-xfs, Linux-Fsdevel

On Mon, Jun 02, 2025 at 05:03:27PM -0700, Darrick J. Wong wrote:
> On Sun, Jun 01, 2025 at 09:02:25AM +1000, Dave Chinner wrote:
> > On Fri, May 30, 2025 at 08:38:47AM -0700, Darrick J. Wong wrote:
> > > On Fri, May 30, 2025 at 07:17:00AM +0200, Christian Brauner wrote:
> > > > On Wed, May 28, 2025 at 09:25:50PM -0700, Darrick J. Wong wrote:
> > > > > On Thu, May 29, 2025 at 10:50:01AM +0800, Yafang Shao wrote:
> > > > > > Hello,
> > > > > > 
> > > > > > Recently, we encountered data loss when using XFS on an HDD with bad
> > > > > > blocks. After investigation, we determined that the issue was related
> > > > > > to writeback errors. The details are as follows:
> > > > > > 
> > > > > > 1. Process-A writes data to a file using buffered I/O and completes
> > > > > > without errors.
> > > > > > 2. However, during the writeback of the dirtied pagecache pages, an
> > > > > > I/O error occurs, causing the data to fail to reach the disk.
> > > > > > 3. Later, the pagecache pages may be reclaimed due to memory pressure,
> > > > > > since they are already clean pages.
> > > > > > 4. When Process-B reads the same file, it retrieves zeroed data from
> > > > > > the bad blocks, as the original data was never successfully written
> > > > > > (IOMAP_UNWRITTEN).
> > > > > > 
> > > > > > We reviewed the related discussion [0] and confirmed that this is a
> > > > > > known writeback error issue. While using fsync() after buffered
> > > > > > write() could mitigate the problem, this approach is impractical for
> > > > > > our services.
> > > > > > 
> > > > > > Instead, we propose introducing configurable options to notify users
> > > > > > of writeback errors immediately and prevent further operations on
> > > > > > affected files or disks. Possible solutions include:
> > > > > > 
> > > > > > - Option A: Immediately shut down the filesystem upon writeback errors.
> > > > > > - Option B: Mark the affected file as inaccessible if a writeback error occurs.
> > > > > > 
> > > > > > These options could be controlled via mount options or sysfs
> > > > > > configurations. Both solutions would be preferable to silently
> > > > > > returning corrupted data, as they ensure users are aware of disk
> > > > > > issues and can take corrective action.
> > > > > > 
> > > > > > Any suggestions ?
> > > > > 
> > > > > Option C: report all those write errors (direct and buffered) to a
> > > > > daemon and let it figure out what it wants to do:
> > > > > 
> > > > > https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=health-monitoring_2025-05-21
> > > > > https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfsprogs-dev.git/log/?h=health-monitoring-rust_2025-05-21
> > > > > 
> > > > > Yes this is a long term option since it involves adding upcalls from the
> > > > 
> > > > I hope you don't mean actual usermodehelper upcalls here because we
> > > > should not add any new ones. If you just mean a way to call up from a
> > > > lower layer than that's obviously fine.
> > > 
> > > Correct.  The VFS upcalls to XFS on some event, then XFS queues the
> > > event data (or drops it) and waits for userspace to read the queued
> > > events.  We're not directly invoking a helper program from deep in the
> > > guts, that's too wild even for me. ;)
> > > 
> > > > Fwiw, have you considered building this on top of a fanotify extension
> > > > instead of inventing your own mechanism for this?
> > > 
> > > I have, at various stages of this experiment.
> > > 
> > > Originally, I was only going to export xfs-specific metadata events
> > > (e.g. this AG's inode btree index is bad) so that the userspace program
> > > (xfs_healer) could initiate a repair against the broken pieces.
> > > 
> > > At the time I thought it would be fun to experiment with an anonfd file
> > > that emitted jsonp objects so that I could avoid the usual C struct ABI
> > > mess because json is easily parsed into key-value mapping objects in a
> > > lot of languages (that aren't C).  It later turned out that formatting
> > > the json is rather more costly than I thought even with seq_bufs, so I
> > > added an alternate format that emits boring C structures.
> > > 
> > > Having gone back to C structs, it would be possibly (and possibly quite
> > > nice) to migrate to fanotify so that I don't have to maintain a bunch of
> > > queuing code.  But that can have its own drawbacks, as Ted and I
> > > discovered when we discussed his patches that pushed ext4 error events
> > > through fanotify:
> > > 
> > > For filesystem metadata events, the fine details of representing that
> > > metadata in a generic interface gets really messy because each
> > > filesystem has a different design.
> > 
> > Perhaps that is the wrong approach. The event just needs to tell
> > userspace that there is a metadata error, and the fs specific agent
> > that receives the event can then pull the failure information from
> > the filesystem through a fs specific ioctl interface.
> > 
> > i.e. the fanotify event could simply be a unique error, and that
> > gets passed back into the ioctl to retreive the fs specific details
> > of the failure. We might not even need fanotify for this - I suspect
> > that we could use udev events to punch error ID notifications out to
> > userspace to trigger a fs specific helper to go find out what went
> > wrong.
> 
> I'm not sure if you're addressing me or brauner, but I think it would be
> even simpler to retain the current design where events are queued to our
> special xfs anonfd and read out by userspace.  Using fanotify as a "door
> bell" to go look at another fd is ... basically poll() but far more
> complicated than it ought to be.  Pounding udev with events can result
> in userspace burning a lot of energy walking the entire rule chain.

I don't think we need to rush any of this. My main concern is that if we
come up with something then I want it to be able to be used by other
filesystems as this seems something that is generally very useful. By
using fanotify we implicitly enable this which is why I'm asking.

I don't want the outcome to be that there's a filesystem with a very
elaborate and detailed scheme that cannot be used by another one and
then we end up with slightly different implementations of the same
underlying concept. And so it will be impossible for userspace to
consume correctly even if abstracted in multiple libraries.

I think udev is the wrong medium for this and I'm pretty sure that the
udev maintainers agree with me on this.

I think this specific type of API would really benefit from gathering
feedback from userspace. There's All Systems Go in Berlin in September
and that might not be the worst time to present what you did and give a
little demo. I'm not sure how fond you are of traveling though rn:
https://all-systems-go.io/

> 
> > Keeping unprocessed failures in an internal fs queue isn't a big
> > deal; it's not a lot of memory, and it can be discarded on unmount.
> > At that point we know that userspace did not care about the
> > failure and is not going to be able to query about the failure in
> > future, so we can just throw it away.
> > 
> > This also allows filesystems to develop such functionality in
> > parallel, allowing us to find commonality and potential areas for
> > abstraction as the functionality is developed, rahter than trying to
> > come up with some generic interface that needs to support all
> > possible things we can think of right now....
> 
> Agreed.  I don't think Ted or Jan were enthusiastic about trying to make
> a generic fs metadata event descriptor either.

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [QUESTION] xfs, iomap: Handle writeback errors to prevent silent data corruption
  2025-06-06 10:43           ` Christian Brauner
@ 2025-06-12  3:43             ` Darrick J. Wong
  2025-06-12  6:29               ` Amir Goldstein
  0 siblings, 1 reply; 36+ messages in thread
From: Darrick J. Wong @ 2025-06-12  3:43 UTC (permalink / raw)
  To: Christian Brauner
  Cc: Dave Chinner, Yafang Shao, cem, linux-xfs, Linux-Fsdevel

On Fri, Jun 06, 2025 at 12:43:20PM +0200, Christian Brauner wrote:
> On Mon, Jun 02, 2025 at 05:03:27PM -0700, Darrick J. Wong wrote:
> > On Sun, Jun 01, 2025 at 09:02:25AM +1000, Dave Chinner wrote:
> > > On Fri, May 30, 2025 at 08:38:47AM -0700, Darrick J. Wong wrote:
> > > > On Fri, May 30, 2025 at 07:17:00AM +0200, Christian Brauner wrote:
> > > > > On Wed, May 28, 2025 at 09:25:50PM -0700, Darrick J. Wong wrote:
> > > > > > On Thu, May 29, 2025 at 10:50:01AM +0800, Yafang Shao wrote:
> > > > > > > Hello,
> > > > > > > 
> > > > > > > Recently, we encountered data loss when using XFS on an HDD with bad
> > > > > > > blocks. After investigation, we determined that the issue was related
> > > > > > > to writeback errors. The details are as follows:
> > > > > > > 
> > > > > > > 1. Process-A writes data to a file using buffered I/O and completes
> > > > > > > without errors.
> > > > > > > 2. However, during the writeback of the dirtied pagecache pages, an
> > > > > > > I/O error occurs, causing the data to fail to reach the disk.
> > > > > > > 3. Later, the pagecache pages may be reclaimed due to memory pressure,
> > > > > > > since they are already clean pages.
> > > > > > > 4. When Process-B reads the same file, it retrieves zeroed data from
> > > > > > > the bad blocks, as the original data was never successfully written
> > > > > > > (IOMAP_UNWRITTEN).
> > > > > > > 
> > > > > > > We reviewed the related discussion [0] and confirmed that this is a
> > > > > > > known writeback error issue. While using fsync() after buffered
> > > > > > > write() could mitigate the problem, this approach is impractical for
> > > > > > > our services.
> > > > > > > 
> > > > > > > Instead, we propose introducing configurable options to notify users
> > > > > > > of writeback errors immediately and prevent further operations on
> > > > > > > affected files or disks. Possible solutions include:
> > > > > > > 
> > > > > > > - Option A: Immediately shut down the filesystem upon writeback errors.
> > > > > > > - Option B: Mark the affected file as inaccessible if a writeback error occurs.
> > > > > > > 
> > > > > > > These options could be controlled via mount options or sysfs
> > > > > > > configurations. Both solutions would be preferable to silently
> > > > > > > returning corrupted data, as they ensure users are aware of disk
> > > > > > > issues and can take corrective action.
> > > > > > > 
> > > > > > > Any suggestions ?
> > > > > > 
> > > > > > Option C: report all those write errors (direct and buffered) to a
> > > > > > daemon and let it figure out what it wants to do:
> > > > > > 
> > > > > > https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=health-monitoring_2025-05-21
> > > > > > https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfsprogs-dev.git/log/?h=health-monitoring-rust_2025-05-21
> > > > > > 
> > > > > > Yes this is a long term option since it involves adding upcalls from the
> > > > > 
> > > > > I hope you don't mean actual usermodehelper upcalls here because we
> > > > > should not add any new ones. If you just mean a way to call up from a
> > > > > lower layer than that's obviously fine.
> > > > 
> > > > Correct.  The VFS upcalls to XFS on some event, then XFS queues the
> > > > event data (or drops it) and waits for userspace to read the queued
> > > > events.  We're not directly invoking a helper program from deep in the
> > > > guts, that's too wild even for me. ;)
> > > > 
> > > > > Fwiw, have you considered building this on top of a fanotify extension
> > > > > instead of inventing your own mechanism for this?
> > > > 
> > > > I have, at various stages of this experiment.
> > > > 
> > > > Originally, I was only going to export xfs-specific metadata events
> > > > (e.g. this AG's inode btree index is bad) so that the userspace program
> > > > (xfs_healer) could initiate a repair against the broken pieces.
> > > > 
> > > > At the time I thought it would be fun to experiment with an anonfd file
> > > > that emitted jsonp objects so that I could avoid the usual C struct ABI
> > > > mess because json is easily parsed into key-value mapping objects in a
> > > > lot of languages (that aren't C).  It later turned out that formatting
> > > > the json is rather more costly than I thought even with seq_bufs, so I
> > > > added an alternate format that emits boring C structures.
> > > > 
> > > > Having gone back to C structs, it would be possibly (and possibly quite
> > > > nice) to migrate to fanotify so that I don't have to maintain a bunch of
> > > > queuing code.  But that can have its own drawbacks, as Ted and I
> > > > discovered when we discussed his patches that pushed ext4 error events
> > > > through fanotify:
> > > > 
> > > > For filesystem metadata events, the fine details of representing that
> > > > metadata in a generic interface gets really messy because each
> > > > filesystem has a different design.
> > > 
> > > Perhaps that is the wrong approach. The event just needs to tell
> > > userspace that there is a metadata error, and the fs specific agent
> > > that receives the event can then pull the failure information from
> > > the filesystem through a fs specific ioctl interface.
> > > 
> > > i.e. the fanotify event could simply be a unique error, and that
> > > gets passed back into the ioctl to retreive the fs specific details
> > > of the failure. We might not even need fanotify for this - I suspect
> > > that we could use udev events to punch error ID notifications out to
> > > userspace to trigger a fs specific helper to go find out what went
> > > wrong.
> > 
> > I'm not sure if you're addressing me or brauner, but I think it would be
> > even simpler to retain the current design where events are queued to our
> > special xfs anonfd and read out by userspace.  Using fanotify as a "door
> > bell" to go look at another fd is ... basically poll() but far more
> > complicated than it ought to be.  Pounding udev with events can result
> > in userspace burning a lot of energy walking the entire rule chain.
> 
> I don't think we need to rush any of this. My main concern is that if we
> come up with something then I want it to be able to be used by other
> filesystems as this seems something that is generally very useful. By
> using fanotify we implicitly enable this which is why I'm asking.
> 
> I don't want the outcome to be that there's a filesystem with a very
> elaborate and detailed scheme that cannot be used by another one and
> then we end up with slightly different implementations of the same
> underlying concept. And so it will be impossible for userspace to
> consume correctly even if abstracted in multiple libraries.

Hrm.  I 60% agree and 60% disagree with you. :D

60% disagree: for describing problems with internal filesystem metadata,
I don't think there's a generic way to expose that outside of ugly
stringly-parsing things like json.  Frankly I don't think any fs project
is going to want a piece of that cake.  Maybe we can share the mechanism
for returning fs-specific metadata error information to a daemon, but
the structure of the data is going to be per-filesystem.  And I think
the only clients are going to be written by the same fs folks for
internal purposes like starting online fsck.

60% agree: for telling most programs that "hey, something went wrong
with this file range", I think it's completely appropriate to fling that
out via the existing generic fsnotify mechanisms that ext4 wired up.
I think the same applies to sending a "your fs is broken" event via
fsnotify too, in case regular user programs decide they want to nope
out.  IIRC there's already a generic notification for that too.

Fortunately the vfs hooks I wrote for xfs_healer are general enough that
I don't think it'd be difficult to wire them up to fsnotify.

> I think udev is the wrong medium for this and I'm pretty sure that the
> udev maintainers agree with me on this.
> 
> I think this specific type of API would really benefit from gathering
> feedback from userspace. There's All Systems Go in Berlin in September
> and that might not be the worst time to present what you did and give a
> little demo. I'm not sure how fond you are of traveling though rn:
> https://all-systems-go.io/

I like travelling!  But happily, I'll be travelling for most of
September already.

But yeah, I've wondered if it would be useful to write a generic service
that would hang around on dbus, listen for the fsnotify events, and
broadcast them to clients.  I suspect that sifting through all the
containerization and idmapping stuff so that app A can't hear about
errors in app B's container might be a lot of work though.

--D

> > > Keeping unprocessed failures in an internal fs queue isn't a big
> > > deal; it's not a lot of memory, and it can be discarded on unmount.
> > > At that point we know that userspace did not care about the
> > > failure and is not going to be able to query about the failure in
> > > future, so we can just throw it away.
> > > 
> > > This also allows filesystems to develop such functionality in
> > > parallel, allowing us to find commonality and potential areas for
> > > abstraction as the functionality is developed, rahter than trying to
> > > come up with some generic interface that needs to support all
> > > possible things we can think of right now....
> > 
> > Agreed.  I don't think Ted or Jan were enthusiastic about trying to make
> > a generic fs metadata event descriptor either.

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [QUESTION] xfs, iomap: Handle writeback errors to prevent silent data corruption
  2025-06-12  3:43             ` Darrick J. Wong
@ 2025-06-12  6:29               ` Amir Goldstein
  2025-07-02 18:41                 ` Darrick J. Wong
  0 siblings, 1 reply; 36+ messages in thread
From: Amir Goldstein @ 2025-06-12  6:29 UTC (permalink / raw)
  To: Darrick J. Wong
  Cc: Christian Brauner, Dave Chinner, Yafang Shao, cem, linux-xfs,
	Linux-Fsdevel, Jan Kara

On Thu, Jun 12, 2025 at 5:43 AM Darrick J. Wong <djwong@kernel.org> wrote:
>
> On Fri, Jun 06, 2025 at 12:43:20PM +0200, Christian Brauner wrote:
> > On Mon, Jun 02, 2025 at 05:03:27PM -0700, Darrick J. Wong wrote:
> > > On Sun, Jun 01, 2025 at 09:02:25AM +1000, Dave Chinner wrote:
> > > > On Fri, May 30, 2025 at 08:38:47AM -0700, Darrick J. Wong wrote:
> > > > > On Fri, May 30, 2025 at 07:17:00AM +0200, Christian Brauner wrote:
> > > > > > On Wed, May 28, 2025 at 09:25:50PM -0700, Darrick J. Wong wrote:
> > > > > > > On Thu, May 29, 2025 at 10:50:01AM +0800, Yafang Shao wrote:
> > > > > > > > Hello,
> > > > > > > >
> > > > > > > > Recently, we encountered data loss when using XFS on an HDD with bad
> > > > > > > > blocks. After investigation, we determined that the issue was related
> > > > > > > > to writeback errors. The details are as follows:
> > > > > > > >
> > > > > > > > 1. Process-A writes data to a file using buffered I/O and completes
> > > > > > > > without errors.
> > > > > > > > 2. However, during the writeback of the dirtied pagecache pages, an
> > > > > > > > I/O error occurs, causing the data to fail to reach the disk.
> > > > > > > > 3. Later, the pagecache pages may be reclaimed due to memory pressure,
> > > > > > > > since they are already clean pages.
> > > > > > > > 4. When Process-B reads the same file, it retrieves zeroed data from
> > > > > > > > the bad blocks, as the original data was never successfully written
> > > > > > > > (IOMAP_UNWRITTEN).
> > > > > > > >
> > > > > > > > We reviewed the related discussion [0] and confirmed that this is a
> > > > > > > > known writeback error issue. While using fsync() after buffered
> > > > > > > > write() could mitigate the problem, this approach is impractical for
> > > > > > > > our services.
> > > > > > > >
> > > > > > > > Instead, we propose introducing configurable options to notify users
> > > > > > > > of writeback errors immediately and prevent further operations on
> > > > > > > > affected files or disks. Possible solutions include:
> > > > > > > >
> > > > > > > > - Option A: Immediately shut down the filesystem upon writeback errors.
> > > > > > > > - Option B: Mark the affected file as inaccessible if a writeback error occurs.
> > > > > > > >
> > > > > > > > These options could be controlled via mount options or sysfs
> > > > > > > > configurations. Both solutions would be preferable to silently
> > > > > > > > returning corrupted data, as they ensure users are aware of disk
> > > > > > > > issues and can take corrective action.
> > > > > > > >
> > > > > > > > Any suggestions ?
> > > > > > >
> > > > > > > Option C: report all those write errors (direct and buffered) to a
> > > > > > > daemon and let it figure out what it wants to do:
> > > > > > >
> > > > > > > https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=health-monitoring_2025-05-21
> > > > > > > https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfsprogs-dev.git/log/?h=health-monitoring-rust_2025-05-21
> > > > > > >
> > > > > > > Yes this is a long term option since it involves adding upcalls from the
> > > > > >
> > > > > > I hope you don't mean actual usermodehelper upcalls here because we
> > > > > > should not add any new ones. If you just mean a way to call up from a
> > > > > > lower layer than that's obviously fine.
> > > > >
> > > > > Correct.  The VFS upcalls to XFS on some event, then XFS queues the
> > > > > event data (or drops it) and waits for userspace to read the queued
> > > > > events.  We're not directly invoking a helper program from deep in the
> > > > > guts, that's too wild even for me. ;)
> > > > >
> > > > > > Fwiw, have you considered building this on top of a fanotify extension
> > > > > > instead of inventing your own mechanism for this?
> > > > >
> > > > > I have, at various stages of this experiment.
> > > > >
> > > > > Originally, I was only going to export xfs-specific metadata events
> > > > > (e.g. this AG's inode btree index is bad) so that the userspace program
> > > > > (xfs_healer) could initiate a repair against the broken pieces.
> > > > >
> > > > > At the time I thought it would be fun to experiment with an anonfd file
> > > > > that emitted jsonp objects so that I could avoid the usual C struct ABI
> > > > > mess because json is easily parsed into key-value mapping objects in a
> > > > > lot of languages (that aren't C).  It later turned out that formatting
> > > > > the json is rather more costly than I thought even with seq_bufs, so I
> > > > > added an alternate format that emits boring C structures.
> > > > >
> > > > > Having gone back to C structs, it would be possibly (and possibly quite
> > > > > nice) to migrate to fanotify so that I don't have to maintain a bunch of
> > > > > queuing code.  But that can have its own drawbacks, as Ted and I
> > > > > discovered when we discussed his patches that pushed ext4 error events
> > > > > through fanotify:
> > > > >
> > > > > For filesystem metadata events, the fine details of representing that
> > > > > metadata in a generic interface gets really messy because each
> > > > > filesystem has a different design.
> > > >
> > > > Perhaps that is the wrong approach. The event just needs to tell
> > > > userspace that there is a metadata error, and the fs specific agent
> > > > that receives the event can then pull the failure information from
> > > > the filesystem through a fs specific ioctl interface.
> > > >
> > > > i.e. the fanotify event could simply be a unique error, and that
> > > > gets passed back into the ioctl to retreive the fs specific details
> > > > of the failure. We might not even need fanotify for this - I suspect
> > > > that we could use udev events to punch error ID notifications out to
> > > > userspace to trigger a fs specific helper to go find out what went
> > > > wrong.
> > >
> > > I'm not sure if you're addressing me or brauner, but I think it would be
> > > even simpler to retain the current design where events are queued to our
> > > special xfs anonfd and read out by userspace.  Using fanotify as a "door
> > > bell" to go look at another fd is ... basically poll() but far more
> > > complicated than it ought to be.  Pounding udev with events can result
> > > in userspace burning a lot of energy walking the entire rule chain.
> >
> > I don't think we need to rush any of this. My main concern is that if we
> > come up with something then I want it to be able to be used by other
> > filesystems as this seems something that is generally very useful. By
> > using fanotify we implicitly enable this which is why I'm asking.
> >
> > I don't want the outcome to be that there's a filesystem with a very
> > elaborate and detailed scheme that cannot be used by another one and
> > then we end up with slightly different implementations of the same
> > underlying concept. And so it will be impossible for userspace to
> > consume correctly even if abstracted in multiple libraries.
>
> Hrm.  I 60% agree and 60% disagree with you. :D
>
> 60% disagree: for describing problems with internal filesystem metadata,
> I don't think there's a generic way to expose that outside of ugly
> stringly-parsing things like json.  Frankly I don't think any fs project
> is going to want a piece of that cake.  Maybe we can share the mechanism
> for returning fs-specific metadata error information to a daemon, but
> the structure of the data is going to be per-filesystem.  And I think
> the only clients are going to be written by the same fs folks for
> internal purposes like starting online fsck.
>
> 60% agree: for telling most programs that "hey, something went wrong
> with this file range", I think it's completely appropriate to fling that
> out via the existing generic fsnotify mechanisms that ext4 wired up.
> I think the same applies to sending a "your fs is broken" event via
> fsnotify too, in case regular user programs decide they want to nope
> out.  IIRC there's already a generic notification for that too.
>
> Fortunately the vfs hooks I wrote for xfs_healer are general enough that
> I don't think it'd be difficult to wire them up to fsnotify.
>
> > I think udev is the wrong medium for this and I'm pretty sure that the
> > udev maintainers agree with me on this.
> >
> > I think this specific type of API would really benefit from gathering
> > feedback from userspace. There's All Systems Go in Berlin in September
> > and that might not be the worst time to present what you did and give a
> > little demo. I'm not sure how fond you are of traveling though rn:
> > https://all-systems-go.io/
>
> I like travelling!  But happily, I'll be travelling for most of
> September already.
>
> But yeah, I've wondered if it would be useful to write a generic service
> that would hang around on dbus, listen for the fsnotify events, and
> broadcast them to clients.  I suspect that sifting through all the
> containerization and idmapping stuff so that app A can't hear about
> errors in app B's container might be a lot of work though.
>

FWIW, I would like to endorse the creation of systemd-fsnotifyd
regardless of whether it is being used to report fs errors.

If https://man.archlinux.org/man/core/systemd/systemd-mountfsd.8.en
can mount a filesystem for an unpriv container, then this container
also needs a way to request a watch on this filesystem, to be
notified on either changes, access or errors.

Thanks,
Amir.

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [QUESTION] xfs, iomap: Handle writeback errors to prevent silent data corruption
  2025-06-12  6:29               ` Amir Goldstein
@ 2025-07-02 18:41                 ` Darrick J. Wong
  0 siblings, 0 replies; 36+ messages in thread
From: Darrick J. Wong @ 2025-07-02 18:41 UTC (permalink / raw)
  To: Amir Goldstein
  Cc: Christian Brauner, Dave Chinner, Yafang Shao, cem, linux-xfs,
	Linux-Fsdevel, Jan Kara

On Thu, Jun 12, 2025 at 08:29:28AM +0200, Amir Goldstein wrote:
> On Thu, Jun 12, 2025 at 5:43 AM Darrick J. Wong <djwong@kernel.org> wrote:
> >
> > On Fri, Jun 06, 2025 at 12:43:20PM +0200, Christian Brauner wrote:
> > > On Mon, Jun 02, 2025 at 05:03:27PM -0700, Darrick J. Wong wrote:
> > > > On Sun, Jun 01, 2025 at 09:02:25AM +1000, Dave Chinner wrote:
> > > > > On Fri, May 30, 2025 at 08:38:47AM -0700, Darrick J. Wong wrote:
> > > > > > On Fri, May 30, 2025 at 07:17:00AM +0200, Christian Brauner wrote:
> > > > > > > On Wed, May 28, 2025 at 09:25:50PM -0700, Darrick J. Wong wrote:
> > > > > > > > On Thu, May 29, 2025 at 10:50:01AM +0800, Yafang Shao wrote:
> > > > > > > > > Hello,
> > > > > > > > >
> > > > > > > > > Recently, we encountered data loss when using XFS on an HDD with bad
> > > > > > > > > blocks. After investigation, we determined that the issue was related
> > > > > > > > > to writeback errors. The details are as follows:
> > > > > > > > >
> > > > > > > > > 1. Process-A writes data to a file using buffered I/O and completes
> > > > > > > > > without errors.
> > > > > > > > > 2. However, during the writeback of the dirtied pagecache pages, an
> > > > > > > > > I/O error occurs, causing the data to fail to reach the disk.
> > > > > > > > > 3. Later, the pagecache pages may be reclaimed due to memory pressure,
> > > > > > > > > since they are already clean pages.
> > > > > > > > > 4. When Process-B reads the same file, it retrieves zeroed data from
> > > > > > > > > the bad blocks, as the original data was never successfully written
> > > > > > > > > (IOMAP_UNWRITTEN).
> > > > > > > > >
> > > > > > > > > We reviewed the related discussion [0] and confirmed that this is a
> > > > > > > > > known writeback error issue. While using fsync() after buffered
> > > > > > > > > write() could mitigate the problem, this approach is impractical for
> > > > > > > > > our services.
> > > > > > > > >
> > > > > > > > > Instead, we propose introducing configurable options to notify users
> > > > > > > > > of writeback errors immediately and prevent further operations on
> > > > > > > > > affected files or disks. Possible solutions include:
> > > > > > > > >
> > > > > > > > > - Option A: Immediately shut down the filesystem upon writeback errors.
> > > > > > > > > - Option B: Mark the affected file as inaccessible if a writeback error occurs.
> > > > > > > > >
> > > > > > > > > These options could be controlled via mount options or sysfs
> > > > > > > > > configurations. Both solutions would be preferable to silently
> > > > > > > > > returning corrupted data, as they ensure users are aware of disk
> > > > > > > > > issues and can take corrective action.
> > > > > > > > >
> > > > > > > > > Any suggestions ?
> > > > > > > >
> > > > > > > > Option C: report all those write errors (direct and buffered) to a
> > > > > > > > daemon and let it figure out what it wants to do:
> > > > > > > >
> > > > > > > > https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=health-monitoring_2025-05-21
> > > > > > > > https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfsprogs-dev.git/log/?h=health-monitoring-rust_2025-05-21
> > > > > > > >
> > > > > > > > Yes this is a long term option since it involves adding upcalls from the
> > > > > > >
> > > > > > > I hope you don't mean actual usermodehelper upcalls here because we
> > > > > > > should not add any new ones. If you just mean a way to call up from a
> > > > > > > lower layer than that's obviously fine.
> > > > > >
> > > > > > Correct.  The VFS upcalls to XFS on some event, then XFS queues the
> > > > > > event data (or drops it) and waits for userspace to read the queued
> > > > > > events.  We're not directly invoking a helper program from deep in the
> > > > > > guts, that's too wild even for me. ;)
> > > > > >
> > > > > > > Fwiw, have you considered building this on top of a fanotify extension
> > > > > > > instead of inventing your own mechanism for this?
> > > > > >
> > > > > > I have, at various stages of this experiment.
> > > > > >
> > > > > > Originally, I was only going to export xfs-specific metadata events
> > > > > > (e.g. this AG's inode btree index is bad) so that the userspace program
> > > > > > (xfs_healer) could initiate a repair against the broken pieces.
> > > > > >
> > > > > > At the time I thought it would be fun to experiment with an anonfd file
> > > > > > that emitted jsonp objects so that I could avoid the usual C struct ABI
> > > > > > mess because json is easily parsed into key-value mapping objects in a
> > > > > > lot of languages (that aren't C).  It later turned out that formatting
> > > > > > the json is rather more costly than I thought even with seq_bufs, so I
> > > > > > added an alternate format that emits boring C structures.
> > > > > >
> > > > > > Having gone back to C structs, it would be possibly (and possibly quite
> > > > > > nice) to migrate to fanotify so that I don't have to maintain a bunch of
> > > > > > queuing code.  But that can have its own drawbacks, as Ted and I
> > > > > > discovered when we discussed his patches that pushed ext4 error events
> > > > > > through fanotify:
> > > > > >
> > > > > > For filesystem metadata events, the fine details of representing that
> > > > > > metadata in a generic interface gets really messy because each
> > > > > > filesystem has a different design.
> > > > >
> > > > > Perhaps that is the wrong approach. The event just needs to tell
> > > > > userspace that there is a metadata error, and the fs specific agent
> > > > > that receives the event can then pull the failure information from
> > > > > the filesystem through a fs specific ioctl interface.
> > > > >
> > > > > i.e. the fanotify event could simply be a unique error, and that
> > > > > gets passed back into the ioctl to retreive the fs specific details
> > > > > of the failure. We might not even need fanotify for this - I suspect
> > > > > that we could use udev events to punch error ID notifications out to
> > > > > userspace to trigger a fs specific helper to go find out what went
> > > > > wrong.
> > > >
> > > > I'm not sure if you're addressing me or brauner, but I think it would be
> > > > even simpler to retain the current design where events are queued to our
> > > > special xfs anonfd and read out by userspace.  Using fanotify as a "door
> > > > bell" to go look at another fd is ... basically poll() but far more
> > > > complicated than it ought to be.  Pounding udev with events can result
> > > > in userspace burning a lot of energy walking the entire rule chain.
> > >
> > > I don't think we need to rush any of this. My main concern is that if we
> > > come up with something then I want it to be able to be used by other
> > > filesystems as this seems something that is generally very useful. By
> > > using fanotify we implicitly enable this which is why I'm asking.
> > >
> > > I don't want the outcome to be that there's a filesystem with a very
> > > elaborate and detailed scheme that cannot be used by another one and
> > > then we end up with slightly different implementations of the same
> > > underlying concept. And so it will be impossible for userspace to
> > > consume correctly even if abstracted in multiple libraries.
> >
> > Hrm.  I 60% agree and 60% disagree with you. :D
> >
> > 60% disagree: for describing problems with internal filesystem metadata,
> > I don't think there's a generic way to expose that outside of ugly
> > stringly-parsing things like json.  Frankly I don't think any fs project
> > is going to want a piece of that cake.  Maybe we can share the mechanism
> > for returning fs-specific metadata error information to a daemon, but
> > the structure of the data is going to be per-filesystem.  And I think
> > the only clients are going to be written by the same fs folks for
> > internal purposes like starting online fsck.
> >
> > 60% agree: for telling most programs that "hey, something went wrong
> > with this file range", I think it's completely appropriate to fling that
> > out via the existing generic fsnotify mechanisms that ext4 wired up.
> > I think the same applies to sending a "your fs is broken" event via
> > fsnotify too, in case regular user programs decide they want to nope
> > out.  IIRC there's already a generic notification for that too.
> >
> > Fortunately the vfs hooks I wrote for xfs_healer are general enough that
> > I don't think it'd be difficult to wire them up to fsnotify.
> >
> > > I think udev is the wrong medium for this and I'm pretty sure that the
> > > udev maintainers agree with me on this.
> > >
> > > I think this specific type of API would really benefit from gathering
> > > feedback from userspace. There's All Systems Go in Berlin in September
> > > and that might not be the worst time to present what you did and give a
> > > little demo. I'm not sure how fond you are of traveling though rn:
> > > https://all-systems-go.io/
> >
> > I like travelling!  But happily, I'll be travelling for most of
> > September already.
> >
> > But yeah, I've wondered if it would be useful to write a generic service
> > that would hang around on dbus, listen for the fsnotify events, and
> > broadcast them to clients.  I suspect that sifting through all the
> > containerization and idmapping stuff so that app A can't hear about
> > errors in app B's container might be a lot of work though.
> >
> 
> FWIW, I would like to endorse the creation of systemd-fsnotifyd
> regardless of whether it is being used to report fs errors.
> 
> If https://man.archlinux.org/man/core/systemd/systemd-mountfsd.8.en
> can mount a filesystem for an unpriv container, then this container
> also needs a way to request a watch on this filesystem, to be
> notified on either changes, access or errors.

I don't think it's that hard to write a userspace daemon that can listen
for filesystem errors via fanotify, but a messy part is going to be
adding hooks for all the filesystems that roll their own
pagecache/directio operations (e.g. nearly all of them).

Also, should userspace programs directly hook up to fanotify?  Or should
the system daemon advertise events over dbus/varlink/whatever?  It's
probably better not to have potentially large subscriber lists in the
kernel itself.

--D

> Thanks,
> Amir.
> 

^ permalink raw reply	[flat|nested] 36+ messages in thread

end of thread, other threads:[~2025-07-02 18:41 UTC | newest]

Thread overview: 36+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-05-29  2:50 [QUESTION] xfs, iomap: Handle writeback errors to prevent silent data corruption Yafang Shao
2025-05-29  4:25 ` Darrick J. Wong
2025-05-29  5:55   ` Yafang Shao
2025-05-30  5:17   ` Christian Brauner
2025-05-30 15:38     ` Darrick J. Wong
2025-05-31 23:02       ` Dave Chinner
2025-06-03  0:03         ` Darrick J. Wong
2025-06-06 10:43           ` Christian Brauner
2025-06-12  3:43             ` Darrick J. Wong
2025-06-12  6:29               ` Amir Goldstein
2025-07-02 18:41                 ` Darrick J. Wong
2025-06-02  5:32   ` Christoph Hellwig
2025-06-03 14:35     ` Darrick J. Wong
2025-06-03 14:38       ` Christoph Hellwig
2025-05-29  4:36 ` Dave Chinner
2025-05-29  6:04   ` Yafang Shao
2025-06-02  5:38   ` Christoph Hellwig
2025-06-02 23:19     ` Dave Chinner
2025-06-03  4:50       ` Christoph Hellwig
2025-06-03 22:05         ` Dave Chinner
2025-06-04  6:33           ` Christoph Hellwig
2025-06-05  2:18             ` Dave Chinner
2025-06-05  4:51               ` Christoph Hellwig
2025-06-02  5:31 ` Christoph Hellwig
2025-06-03  3:03   ` Yafang Shao
2025-06-03  3:13     ` Matthew Wilcox
2025-06-03  3:21       ` Yafang Shao
2025-06-03  3:26         ` Matthew Wilcox
2025-06-03  3:50           ` Yafang Shao
2025-06-03  4:40             ` Christoph Hellwig
2025-06-03  5:17               ` Damien Le Moal
2025-06-03  5:54                 ` Yafang Shao
2025-06-03  6:36                   ` Damien Le Moal
2025-06-03 14:41                     ` Christoph Hellwig
2025-06-03 14:57                       ` James Bottomley
2025-06-04  7:29                         ` Damien Le Moal

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).