On Mon, 8 Jul 2024 02:25:37 -0400
Zygo Blaxell <ce3g8jdj@umail.furryterror.org> wrote:

> On Sat, Jun 08, 2024 at 12:50:35PM +0930, Qu Wenruo wrote:
> > 在 2024/6/8 11:25, Zygo Blaxell 写道:  
> > > On Sat, Jun 01, 2024 at 05:22:46PM +0930, Qu Wenruo wrote:
> > > After this change, we now end up in an infinite loop:
> > > 
> > > 1.  Allocator picks a stripe with some unrecoverable csum blocks
> > > and some free blocks
> > > 
> > > 2.  Writeback tries to put data in the stripe
> > > 
> > > 3.  rmw_rbio aborts after it can't repair the existing blocks
> > > 
> > > 4.  Writeback deletes the extent, often silently (the application
> > > has to use fsync to detect it)
> > > 
> > > 5.  Go to step 1, where the allocator picks the same blocks again
> > > 
> > > The effect is pretty dramatic--even a single unrecoverable sector in
> > > one stripe will bring an application server to its knees, constantly
> > > discarding an application's data whenever it tries to write.  Once the
> > > allocator reaches the point where the "next" block is in a bad rmw stripe,
> > > it keeps allocating that same block over and over again.  
> > 
> > I'm afraid the error path (no way to inform the caller) is an existing
> > problem. Buffered write can always success (as long as no ENOMEM/ENOSPC
> > etc), but the real writeback is not ensured to success.
> > It doesn't even need RAID56 to trigger.
> > 
> > But "discarding" the dirty pages doesn't sound correct.
> > If a writeback failed, the dirty pages should still stay dirty, not
> > discarded.
> > 
> > It may be a new bug in the error handling path.  
> 
> I found the code that does this.  It's more than 11 years old:
> 
> commit 0bec9ef525e33233d7739b71be83bb78746f6e94
> Author: Josef Bacik <jbacik@fusionio.com>
> Date:   Thu Jan 31 14:58:00 2013 -0500
> 
>     Btrfs: unreserve space if our ordered extent fails to work
> 
>     When a transaction aborts or there's an EIO on an ordered extent or any
>     error really we will not free up the space we reserved for this ordered
>     extent.  This results in warnings from the block group cache cleanup in the
>     case of a transaction abort, or leaking space in the case of EIO on an
>     ordered extent.  Fix this up by free'ing the reserved space if we have an
>     error at all trying to complete an ordered extent.  Thanks,
> 
> [...]

Before this escalates further in IMHO the wrong direction:

I think the current btrfs behavior correct. See also this paper[1] that
examines write failure of buffered io in different filesystems.
Especially Table 2. Ext4 and xfs for example do not discard the page
cache on write failure, but this is worse since now you have a mismatch
of what is in the cache and what is on disk. They do not retry to write
back the page cache.

The source of confusion here is rather that write errors do not happen
in the real world: Disks do not verify if they wrote data correctly and
neither does any layer (raid, etc.) above it.

Thus handling of write failure is completely untested in all
applications (See the paper again) and it seems the problems you see
are due to wrongly handling of write errors. The data loss is not
silent it's just that many applications and scripts do not use fsync()
at all.

I think the proper way of resolving this is for btrfs to retry writing
the extent, but to another (possibly clean) stripe. Or perhaps a fresh
raid5 block group altogether.

I very much approve btrfs' current design of handling (and reporting)
write errors in the most correct way possible.

Regards,
Lukas Straub


[1] https://www.usenix.org/system/files/atc20-rebello.pdf