Re: raid5 silent data loss in 6.2 and later, after "7a3150723061 btrfs: raid56: do data csum verification during RMW cycle"

Linux Btrfs filesystem development
 help / color / mirror / Atom feed

From: Lukas Straub <lukasstraub2@web.de>
To: Zygo Blaxell <ce3g8jdj@umail.furryterror.org>
Cc: Qu Wenruo <wqu@suse.com>, Qu Wenruo <quwenruo.btrfs@gmx.com>,
	linux-btrfs@vger.kernel.org
Subject: Re: raid5 silent data loss in 6.2 and later, after "7a3150723061 btrfs: raid56: do data csum verification during RMW cycle"
Date: Mon, 8 Jul 2024 10:26:32 +0200	[thread overview]
Message-ID: <20240708100927.652b2bc7@penguin> (raw)
In-Reply-To: <ZouGYZWkKM_W4hby@hungrycats.org>

[-- Attachment #1: Type: text/plain, Size: 3711 bytes --]

On Mon, 8 Jul 2024 02:25:37 -0400
Zygo Blaxell <ce3g8jdj@umail.furryterror.org> wrote:

> On Sat, Jun 08, 2024 at 12:50:35PM +0930, Qu Wenruo wrote:
> > 在 2024/6/8 11:25, Zygo Blaxell 写道:  
> > > On Sat, Jun 01, 2024 at 05:22:46PM +0930, Qu Wenruo wrote:
> > > After this change, we now end up in an infinite loop:
> > > 
> > > 1.  Allocator picks a stripe with some unrecoverable csum blocks
> > > and some free blocks
> > > 
> > > 2.  Writeback tries to put data in the stripe
> > > 
> > > 3.  rmw_rbio aborts after it can't repair the existing blocks
> > > 
> > > 4.  Writeback deletes the extent, often silently (the application
> > > has to use fsync to detect it)
> > > 
> > > 5.  Go to step 1, where the allocator picks the same blocks again
> > > 
> > > The effect is pretty dramatic--even a single unrecoverable sector in
> > > one stripe will bring an application server to its knees, constantly
> > > discarding an application's data whenever it tries to write.  Once the
> > > allocator reaches the point where the "next" block is in a bad rmw stripe,
> > > it keeps allocating that same block over and over again.  
> > 
> > I'm afraid the error path (no way to inform the caller) is an existing
> > problem. Buffered write can always success (as long as no ENOMEM/ENOSPC
> > etc), but the real writeback is not ensured to success.
> > It doesn't even need RAID56 to trigger.
> > 
> > But "discarding" the dirty pages doesn't sound correct.
> > If a writeback failed, the dirty pages should still stay dirty, not
> > discarded.
> > 
> > It may be a new bug in the error handling path.  
> 
> I found the code that does this.  It's more than 11 years old:
> 
> commit 0bec9ef525e33233d7739b71be83bb78746f6e94
> Author: Josef Bacik <jbacik@fusionio.com>
> Date:   Thu Jan 31 14:58:00 2013 -0500
> 
>     Btrfs: unreserve space if our ordered extent fails to work
> 
>     When a transaction aborts or there's an EIO on an ordered extent or any
>     error really we will not free up the space we reserved for this ordered
>     extent.  This results in warnings from the block group cache cleanup in the
>     case of a transaction abort, or leaking space in the case of EIO on an
>     ordered extent.  Fix this up by free'ing the reserved space if we have an
>     error at all trying to complete an ordered extent.  Thanks,
> 
> [...]

Before this escalates further in IMHO the wrong direction:

I think the current btrfs behavior correct. See also this paper[1] that
examines write failure of buffered io in different filesystems.
Especially Table 2. Ext4 and xfs for example do not discard the page
cache on write failure, but this is worse since now you have a mismatch
of what is in the cache and what is on disk. They do not retry to write
back the page cache.

The source of confusion here is rather that write errors do not happen
in the real world: Disks do not verify if they wrote data correctly and
neither does any layer (raid, etc.) above it.

Thus handling of write failure is completely untested in all
applications (See the paper again) and it seems the problems you see
are due to wrongly handling of write errors. The data loss is not
silent it's just that many applications and scripts do not use fsync()
at all.

I think the proper way of resolving this is for btrfs to retry writing
the extent, but to another (possibly clean) stripe. Or perhaps a fresh
raid5 block group altogether.

I very much approve btrfs' current design of handling (and reporting)
write errors in the most correct way possible.

Regards,
Lukas Straub


[1] https://www.usenix.org/system/files/atc20-rebello.pdf

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

next prev parent reply	other threads:[~2024-07-08  8:26 UTC|newest]

Thread overview: 7+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2024-06-01  3:24 raid5 silent data loss in 6.2 and later, after "7a3150723061 btrfs: raid56: do data csum verification during RMW cycle" Zygo Blaxell
2024-06-01  7:52 ` Qu Wenruo
2024-06-08  1:55   ` Zygo Blaxell
2024-06-08  3:20     ` Qu Wenruo
2024-07-08  6:25       ` Zygo Blaxell
2024-07-08  8:26         ` Lukas Straub [this message]
2024-07-08 20:22           ` Write error handling in btrfs (was: Re: raid5 silent data loss in 6.2 and later, after "7a3150723061 btrfs: raid56: do data csum verification during RMW cycle") Zygo Blaxell

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20240708100927.652b2bc7@penguin \
    --to=lukasstraub2@web.de \
    --cc=ce3g8jdj@umail.furryterror.org \
    --cc=linux-btrfs@vger.kernel.org \
    --cc=quwenruo.btrfs@gmx.com \
    --cc=wqu@suse.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox