Re: [QUESTION] xfs, iomap: Handle writeback errors to prevent silent data corruption

public inbox for linux-fsdevel@vger.kernel.org
 help / color / mirror / Atom feed

From: "Darrick J. Wong" <djwong@kernel.org>
To: Dave Chinner <david@fromorbit.com>
Cc: Christian Brauner <brauner@kernel.org>,
	Yafang Shao <laoar.shao@gmail.com>,
	cem@kernel.org, linux-xfs@vger.kernel.org,
	Linux-Fsdevel <linux-fsdevel@vger.kernel.org>
Subject: Re: [QUESTION] xfs, iomap: Handle writeback errors to prevent silent data corruption
Date: Mon, 2 Jun 2025 17:03:27 -0700	[thread overview]
Message-ID: <20250603000327.GM8328@frogsfrogsfrogs> (raw)
In-Reply-To: <aDuKgfi-CCykPuhD@dread.disaster.area>

On Sun, Jun 01, 2025 at 09:02:25AM +1000, Dave Chinner wrote:
> On Fri, May 30, 2025 at 08:38:47AM -0700, Darrick J. Wong wrote:
> > On Fri, May 30, 2025 at 07:17:00AM +0200, Christian Brauner wrote:
> > > On Wed, May 28, 2025 at 09:25:50PM -0700, Darrick J. Wong wrote:
> > > > On Thu, May 29, 2025 at 10:50:01AM +0800, Yafang Shao wrote:
> > > > > Hello,
> > > > > 
> > > > > Recently, we encountered data loss when using XFS on an HDD with bad
> > > > > blocks. After investigation, we determined that the issue was related
> > > > > to writeback errors. The details are as follows:
> > > > > 
> > > > > 1. Process-A writes data to a file using buffered I/O and completes
> > > > > without errors.
> > > > > 2. However, during the writeback of the dirtied pagecache pages, an
> > > > > I/O error occurs, causing the data to fail to reach the disk.
> > > > > 3. Later, the pagecache pages may be reclaimed due to memory pressure,
> > > > > since they are already clean pages.
> > > > > 4. When Process-B reads the same file, it retrieves zeroed data from
> > > > > the bad blocks, as the original data was never successfully written
> > > > > (IOMAP_UNWRITTEN).
> > > > > 
> > > > > We reviewed the related discussion [0] and confirmed that this is a
> > > > > known writeback error issue. While using fsync() after buffered
> > > > > write() could mitigate the problem, this approach is impractical for
> > > > > our services.
> > > > > 
> > > > > Instead, we propose introducing configurable options to notify users
> > > > > of writeback errors immediately and prevent further operations on
> > > > > affected files or disks. Possible solutions include:
> > > > > 
> > > > > - Option A: Immediately shut down the filesystem upon writeback errors.
> > > > > - Option B: Mark the affected file as inaccessible if a writeback error occurs.
> > > > > 
> > > > > These options could be controlled via mount options or sysfs
> > > > > configurations. Both solutions would be preferable to silently
> > > > > returning corrupted data, as they ensure users are aware of disk
> > > > > issues and can take corrective action.
> > > > > 
> > > > > Any suggestions ?
> > > > 
> > > > Option C: report all those write errors (direct and buffered) to a
> > > > daemon and let it figure out what it wants to do:
> > > > 
> > > > https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=health-monitoring_2025-05-21
> > > > https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfsprogs-dev.git/log/?h=health-monitoring-rust_2025-05-21
> > > > 
> > > > Yes this is a long term option since it involves adding upcalls from the
> > > 
> > > I hope you don't mean actual usermodehelper upcalls here because we
> > > should not add any new ones. If you just mean a way to call up from a
> > > lower layer than that's obviously fine.
> > 
> > Correct.  The VFS upcalls to XFS on some event, then XFS queues the
> > event data (or drops it) and waits for userspace to read the queued
> > events.  We're not directly invoking a helper program from deep in the
> > guts, that's too wild even for me. ;)
> > 
> > > Fwiw, have you considered building this on top of a fanotify extension
> > > instead of inventing your own mechanism for this?
> > 
> > I have, at various stages of this experiment.
> > 
> > Originally, I was only going to export xfs-specific metadata events
> > (e.g. this AG's inode btree index is bad) so that the userspace program
> > (xfs_healer) could initiate a repair against the broken pieces.
> > 
> > At the time I thought it would be fun to experiment with an anonfd file
> > that emitted jsonp objects so that I could avoid the usual C struct ABI
> > mess because json is easily parsed into key-value mapping objects in a
> > lot of languages (that aren't C).  It later turned out that formatting
> > the json is rather more costly than I thought even with seq_bufs, so I
> > added an alternate format that emits boring C structures.
> > 
> > Having gone back to C structs, it would be possibly (and possibly quite
> > nice) to migrate to fanotify so that I don't have to maintain a bunch of
> > queuing code.  But that can have its own drawbacks, as Ted and I
> > discovered when we discussed his patches that pushed ext4 error events
> > through fanotify:
> > 
> > For filesystem metadata events, the fine details of representing that
> > metadata in a generic interface gets really messy because each
> > filesystem has a different design.
> 
> Perhaps that is the wrong approach. The event just needs to tell
> userspace that there is a metadata error, and the fs specific agent
> that receives the event can then pull the failure information from
> the filesystem through a fs specific ioctl interface.
> 
> i.e. the fanotify event could simply be a unique error, and that
> gets passed back into the ioctl to retreive the fs specific details
> of the failure. We might not even need fanotify for this - I suspect
> that we could use udev events to punch error ID notifications out to
> userspace to trigger a fs specific helper to go find out what went
> wrong.

I'm not sure if you're addressing me or brauner, but I think it would be
even simpler to retain the current design where events are queued to our
special xfs anonfd and read out by userspace.  Using fanotify as a "door
bell" to go look at another fd is ... basically poll() but far more
complicated than it ought to be.  Pounding udev with events can result
in userspace burning a lot of energy walking the entire rule chain.

> Keeping unprocessed failures in an internal fs queue isn't a big
> deal; it's not a lot of memory, and it can be discarded on unmount.
> At that point we know that userspace did not care about the
> failure and is not going to be able to query about the failure in
> future, so we can just throw it away.
> 
> This also allows filesystems to develop such functionality in
> parallel, allowing us to find commonality and potential areas for
> abstraction as the functionality is developed, rahter than trying to
> come up with some generic interface that needs to support all
> possible things we can think of right now....

Agreed.  I don't think Ted or Jan were enthusiastic about trying to make
a generic fs metadata event descriptor either.

--D

> -Dave.
> -- 
> Dave Chinner
> david@fromorbit.com

next prev parent reply	other threads:[~2025-06-03  0:03 UTC|newest]

Thread overview: 36+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-05-29  2:50 [QUESTION] xfs, iomap: Handle writeback errors to prevent silent data corruption Yafang Shao
2025-05-29  4:25 ` Darrick J. Wong
2025-05-29  5:55   ` Yafang Shao
2025-05-30  5:17   ` Christian Brauner
2025-05-30 15:38     ` Darrick J. Wong
2025-05-31 23:02       ` Dave Chinner
2025-06-03  0:03         ` Darrick J. Wong [this message]
2025-06-06 10:43           ` Christian Brauner
2025-06-12  3:43             ` Darrick J. Wong
2025-06-12  6:29               ` Amir Goldstein
2025-07-02 18:41                 ` Darrick J. Wong
2025-06-02  5:32   ` Christoph Hellwig
2025-06-03 14:35     ` Darrick J. Wong
2025-06-03 14:38       ` Christoph Hellwig
2025-05-29  4:36 ` Dave Chinner
2025-05-29  6:04   ` Yafang Shao
2025-06-02  5:38   ` Christoph Hellwig
2025-06-02 23:19     ` Dave Chinner
2025-06-03  4:50       ` Christoph Hellwig
2025-06-03 22:05         ` Dave Chinner
2025-06-04  6:33           ` Christoph Hellwig
2025-06-05  2:18             ` Dave Chinner
2025-06-05  4:51               ` Christoph Hellwig
2025-06-02  5:31 ` Christoph Hellwig
2025-06-03  3:03   ` Yafang Shao
2025-06-03  3:13     ` Matthew Wilcox
2025-06-03  3:21       ` Yafang Shao
2025-06-03  3:26         ` Matthew Wilcox
2025-06-03  3:50           ` Yafang Shao
2025-06-03  4:40             ` Christoph Hellwig
2025-06-03  5:17               ` Damien Le Moal
2025-06-03  5:54                 ` Yafang Shao
2025-06-03  6:36                   ` Damien Le Moal
2025-06-03 14:41                     ` Christoph Hellwig
2025-06-03 14:57                       ` James Bottomley
2025-06-04  7:29                         ` Damien Le Moal

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20250603000327.GM8328@frogsfrogsfrogs \
    --to=djwong@kernel.org \
    --cc=brauner@kernel.org \
    --cc=cem@kernel.org \
    --cc=david@fromorbit.com \
    --cc=laoar.shao@gmail.com \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=linux-xfs@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox