From: Christoph Hellwig <hch@infradead.org>
To: Dave Chinner <david@fromorbit.com>
Cc: Christoph Hellwig <hch@infradead.org>,
Yafang Shao <laoar.shao@gmail.com>,
Christian Brauner <brauner@kernel.org>,
djwong@kernel.org, cem@kernel.org, linux-xfs@vger.kernel.org,
Linux-Fsdevel <linux-fsdevel@vger.kernel.org>
Subject: Re: [QUESTION] xfs, iomap: Handle writeback errors to prevent silent data corruption
Date: Tue, 3 Jun 2025 23:33:05 -0700 [thread overview]
Message-ID: <aD_oobAbOs7m8PFN@infradead.org> (raw)
In-Reply-To: <aD9xj8cwfY9ZmQ2B@dread.disaster.area>
On Wed, Jun 04, 2025 at 08:05:03AM +1000, Dave Chinner wrote:
> >
> > Everything. ENOSPC means there is no space. There might be space in
> > the non-determinant future, but if the layer just needs to GC it must
> > not report the error.
>
> GC of thin pools requires the filesystem to be mounted so fstrim can
> be run to tell the thinp device where all the free LBA regions it
> can reclaim are located. If we shut down the filesystem instantly
> when the pool goes ENOSPC on a metadata write, then *we can't run
> fstrim* to free up unused space and hence allow that metadata write
> to succeed in the future.
>
> It should be obvious at this point that a filesystem shutdown on an
> ENOSPC error from the block device on anything other than journal IO
> is exactly the wrong thing to be doing.
How high are the chances that you hit exactly the rate metadata
writeback I/O and not journal or data I/O for this odd condition
that requires user interaction? Where is this weird model where a
storage device returns an out of space error and manual user interaction
using manual and not online trim is going to fix even documented?
> > Normally it means your checksum was wrong. If you have bit errors
> > in the cable they will show up again, maybe not on the next I/O
> > but soon.
>
> But it's unlikely to be hit by another cosmic ray anytime soon, and
> so bit errors caused by completely random environmental events
> should -absolutely- be retried as the subsequent write retry will
> succeed.
>
> If there is a dodgy cable causing the problems, the error will
> re-occur on random IOs and we'll emit write errors to the log that
> monitoring software will pick up. If we are repeatedly isssuing write
> errors due to EILSEQ errors, then that's a sign the hardware needs
> replacing.
Umm, all the storage protocols do have pretty good checksums. A cosmic
ray isn't going to fail them it is something more fundamental like
broken hardware or connections. In other words you are going to see
this again and again pretty frequently.
> There is no risk to filesystem integrity if write retries
> succeed, and that gives the admin time to schedule downtime to
> replace the dodgy hardware. That's much better behaviour than
> unexpected production system failure in the middle of the night...
>
> It is because we have robust and resilient error handling in the
> filesystem that the system is able to operate correctly in these
> marginal situations. Operating in marginal conditions or as hardware
> is beginning to fail is a necessary to keep production systems
> running until corrective action can be taken by the administrators.
I'd really like to see a format writeup of your theory of robust error
handling where that robustness is centered around the fairly rare
case of metadata writeback and applications dealing with I/O errors,
while journal write errors and read error lead to shutdown. Maybe
I'm missing something important, but the theory does not sound valid,
and we don't have any testing framework that actually verifies it.
> Failing to recognise that transient and "maybe-transient" errors can
> generally be handled cleanly and successfully with future write
> retries leads to brittle, fragile systems that fall over at the
> first sign of anything going wrong. Filesystems that are targetted
> at high value production systems and/or running mission critical
> applications needs to have resilient and robust error handling.
What known transient errors do you think XFS (or any other file system)
actually handles properly? Where is the contract that these errors
actually are transient.
> > And even applications that fsync won't see you fancy error code. The
> > only thing stored in the address_space for fsync to catch is EIO and
> > ENOSPC.
>
> The filesystem knows exactly what the IO error reported by the block
> layer is before we run folio completions, so we control exactly what
> we want to report as IO compeltion status.
Sure, you could invent a scheme to propagate the exaxct error. For
direct I/O we even return the exact error to userspace. But that
means we actually have a definition of what each error means, and how
it could be handled. None of that exists right now. We could do
all this, but that assumes you actually have:
a) a clear definition of a problem
b) a good way to fix that problem
c) good testing infrastructure to actually test it, because without
that all good intentions will probably cause more problems than
they solve
> Hence the bogosities of error propagation to userspace via the
> mapping is completely irrelevant to this discussion/feature because
> it would be implemented below the layer that squashes the eventual
> IO errno into the address space...
How would implement and test all this? And for what use case?
next prev parent reply other threads:[~2025-06-04 6:33 UTC|newest]
Thread overview: 36+ messages / expand[flat|nested] mbox.gz Atom feed top
2025-05-29 2:50 [QUESTION] xfs, iomap: Handle writeback errors to prevent silent data corruption Yafang Shao
2025-05-29 4:25 ` Darrick J. Wong
2025-05-29 5:55 ` Yafang Shao
2025-05-30 5:17 ` Christian Brauner
2025-05-30 15:38 ` Darrick J. Wong
2025-05-31 23:02 ` Dave Chinner
2025-06-03 0:03 ` Darrick J. Wong
2025-06-06 10:43 ` Christian Brauner
2025-06-12 3:43 ` Darrick J. Wong
2025-06-12 6:29 ` Amir Goldstein
2025-07-02 18:41 ` Darrick J. Wong
2025-06-02 5:32 ` Christoph Hellwig
2025-06-03 14:35 ` Darrick J. Wong
2025-06-03 14:38 ` Christoph Hellwig
2025-05-29 4:36 ` Dave Chinner
2025-05-29 6:04 ` Yafang Shao
2025-06-02 5:38 ` Christoph Hellwig
2025-06-02 23:19 ` Dave Chinner
2025-06-03 4:50 ` Christoph Hellwig
2025-06-03 22:05 ` Dave Chinner
2025-06-04 6:33 ` Christoph Hellwig [this message]
2025-06-05 2:18 ` Dave Chinner
2025-06-05 4:51 ` Christoph Hellwig
2025-06-02 5:31 ` Christoph Hellwig
2025-06-03 3:03 ` Yafang Shao
2025-06-03 3:13 ` Matthew Wilcox
2025-06-03 3:21 ` Yafang Shao
2025-06-03 3:26 ` Matthew Wilcox
2025-06-03 3:50 ` Yafang Shao
2025-06-03 4:40 ` Christoph Hellwig
2025-06-03 5:17 ` Damien Le Moal
2025-06-03 5:54 ` Yafang Shao
2025-06-03 6:36 ` Damien Le Moal
2025-06-03 14:41 ` Christoph Hellwig
2025-06-03 14:57 ` James Bottomley
2025-06-04 7:29 ` Damien Le Moal
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=aD_oobAbOs7m8PFN@infradead.org \
--to=hch@infradead.org \
--cc=brauner@kernel.org \
--cc=cem@kernel.org \
--cc=david@fromorbit.com \
--cc=djwong@kernel.org \
--cc=laoar.shao@gmail.com \
--cc=linux-fsdevel@vger.kernel.org \
--cc=linux-xfs@vger.kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).