linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Yafang Shao <laoar.shao@gmail.com>
To: Dave Chinner <david@fromorbit.com>
Cc: Christian Brauner <brauner@kernel.org>,
	djwong@kernel.org, cem@kernel.org,  linux-xfs@vger.kernel.org,
	Linux-Fsdevel <linux-fsdevel@vger.kernel.org>
Subject: Re: [QUESTION] xfs, iomap: Handle writeback errors to prevent silent data corruption
Date: Thu, 29 May 2025 14:04:50 +0800	[thread overview]
Message-ID: <CALOAHbBNMM-pZD+8+7SQ7EyWZCbYSFHpvBzjewDYh_ZWEmz46w@mail.gmail.com> (raw)
In-Reply-To: <aDfkTiTNH1UPKvC7@dread.disaster.area>

On Thu, May 29, 2025 at 12:36 PM Dave Chinner <david@fromorbit.com> wrote:
>
> On Thu, May 29, 2025 at 10:50:01AM +0800, Yafang Shao wrote:
> > Hello,
> >
> > Recently, we encountered data loss when using XFS on an HDD with bad
> > blocks. After investigation, we determined that the issue was related
> > to writeback errors. The details are as follows:
> >
> > 1. Process-A writes data to a file using buffered I/O and completes
> > without errors.
> > 2. However, during the writeback of the dirtied pagecache pages, an
> > I/O error occurs, causing the data to fail to reach the disk.
> > 3. Later, the pagecache pages may be reclaimed due to memory pressure,
> > since they are already clean pages.
> > 4. When Process-B reads the same file, it retrieves zeroed data from
> > the bad blocks, as the original data was never successfully written
> > (IOMAP_UNWRITTEN).
> >
> > We reviewed the related discussion [0] and confirmed that this is a
> > known writeback error issue. While using fsync() after buffered
> > write() could mitigate the problem, this approach is impractical for
> > our services.
>
> Really, that's terrible application design.  If you aren't checking
> that data has been written successfully, then you get to keep all
> the broken and/or missing data bits to yourself.

It’s difficult to justify this.

>
> However, with that said, some history.
>
> XFS used to keep pages that had IO errors on writeback dirty so they
> would be retried at a later time and couldn't be reclaimed from
> memory until they were written. This was historical behaviour from
> Irix and designed to handle SAN environments where multipath
> fail-over could take several minutes.
>
> In these situations writeback could fail for several attempts before
> the storage timed out and came back online. Then the next write
> retry would succeed, and everything would be good. Linux never gave
> us a specific IO error for this case, so we just had to retry on EIO
> and hope that the storage came back eventually.
>
> This is different to traditional Linux writeback behaviour, which is
> what is implemented now via iomap. There are good reasons for this
> model:
>
> - a filesystem with a dirty page that can't be written and cleaned
>   cannot be unmounted.
>
> - having large chunks of memory that cannot be cleaned and
>   reclaimed has adverse impact on system performance
>
> - the system can potentially hang if the page cache is dirtied
>   beyond write throttling thresholds and then the device is yanked.
>   Now none of the dirty memory can be cleaned, and all new writes
>   are throttled....

I previously considered whether we could avoid clearing PG_writeback
for these pages. To handle unwritten pagecache pages more safely, we
could maintain their PG_writeback status and introduce a new
PG_write_error flag. This would explicitly mark pages that failed disk
writes, allowing the reclaim mechanism to skip them and avoid
potential deadlocks.

>
> > Instead, we propose introducing configurable options to notify users
> > of writeback errors immediately and prevent further operations on
> > affected files or disks. Possible solutions include:
> >
> > - Option A: Immediately shut down the filesystem upon writeback errors.
> > - Option B: Mark the affected file as inaccessible if a writeback error occurs.
>
> Go look at /sys/fs/xfs/<dev>/error/metadata/... and configurable
> error handling behaviour implemented through this interface.
>
> Essential, XFS metadata behaves as "retry writes forever and hang on
> unmount until write succeeds" by default. i.e. similar to the old
> data IO error behaviour. The "hang on unmount" behaviour can be
> turned off by /sys/fs/xfs/<dev>/error/fail_at_unmount, and we can
> configured different failure handling policies for different types
> of IO error. e.g. fail-fast on -ENODEV (e.g. device was unplugged
> and is never coming back so shut the filesystem down),
> retry-for-while on -ENOSPC (e.g. dm-thinp pool has run out of space,
> so give some time for the pool to be expanded before shutting down)
> and retry-once on -EIO (to avoid random spurious hardware failures
> from shutting down the fs) and everything else uses the configured
> default behaviour....

Thank you for your clear guidance and detailed explanation.

>
> There's also good reason the sysfs error heirarchy is structured the
> way it is - it leaves open the option for expanding the error
> handling policies to different IO types (i.e. data and metadata). It
> even allows different policies for different types of data devices
> (e.g. RT vs data device policies).
>
> So, got look at how the error configuration code in XFS is handled,
> consider extending that to /sys/fs/xfs/<dev>/error/data/.... to
> allow different error handling policies for different types of
> data writeback IO errors.

That aligns perfectly with our expectations.

>
> Then you'll need to implement those policies through the XFS and
> iomap IO paths...

I will analyze how to implement this effectively.

-- 
Regards
Yafang

  reply	other threads:[~2025-05-29  6:05 UTC|newest]

Thread overview: 36+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-05-29  2:50 [QUESTION] xfs, iomap: Handle writeback errors to prevent silent data corruption Yafang Shao
2025-05-29  4:25 ` Darrick J. Wong
2025-05-29  5:55   ` Yafang Shao
2025-05-30  5:17   ` Christian Brauner
2025-05-30 15:38     ` Darrick J. Wong
2025-05-31 23:02       ` Dave Chinner
2025-06-03  0:03         ` Darrick J. Wong
2025-06-06 10:43           ` Christian Brauner
2025-06-12  3:43             ` Darrick J. Wong
2025-06-12  6:29               ` Amir Goldstein
2025-07-02 18:41                 ` Darrick J. Wong
2025-06-02  5:32   ` Christoph Hellwig
2025-06-03 14:35     ` Darrick J. Wong
2025-06-03 14:38       ` Christoph Hellwig
2025-05-29  4:36 ` Dave Chinner
2025-05-29  6:04   ` Yafang Shao [this message]
2025-06-02  5:38   ` Christoph Hellwig
2025-06-02 23:19     ` Dave Chinner
2025-06-03  4:50       ` Christoph Hellwig
2025-06-03 22:05         ` Dave Chinner
2025-06-04  6:33           ` Christoph Hellwig
2025-06-05  2:18             ` Dave Chinner
2025-06-05  4:51               ` Christoph Hellwig
2025-06-02  5:31 ` Christoph Hellwig
2025-06-03  3:03   ` Yafang Shao
2025-06-03  3:13     ` Matthew Wilcox
2025-06-03  3:21       ` Yafang Shao
2025-06-03  3:26         ` Matthew Wilcox
2025-06-03  3:50           ` Yafang Shao
2025-06-03  4:40             ` Christoph Hellwig
2025-06-03  5:17               ` Damien Le Moal
2025-06-03  5:54                 ` Yafang Shao
2025-06-03  6:36                   ` Damien Le Moal
2025-06-03 14:41                     ` Christoph Hellwig
2025-06-03 14:57                       ` James Bottomley
2025-06-04  7:29                         ` Damien Le Moal

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=CALOAHbBNMM-pZD+8+7SQ7EyWZCbYSFHpvBzjewDYh_ZWEmz46w@mail.gmail.com \
    --to=laoar.shao@gmail.com \
    --cc=brauner@kernel.org \
    --cc=cem@kernel.org \
    --cc=david@fromorbit.com \
    --cc=djwong@kernel.org \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=linux-xfs@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).