From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from drax.kayaks.hungrycats.org (drax.kayaks.hungrycats.org [174.142.148.226]) by smtp.subspace.kernel.org (Postfix) with ESMTP id 7BE5A1946F for ; Mon, 8 Jul 2024 06:31:51 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=174.142.148.226 ARC-Seal:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1720420314; cv=none; b=vDQNGdkuFeteBv4xpGhB18NaEbtjR2yCHl1nIfyIx7bZkg953dLo37gNzHrr7ElBRGb0pTpNQUAonXOIs/buI76x919pVXFRvAbsBsXqNbXJiLQzN5eIVmCOp7gikE7Fig7t5ksNYctq/uwid3/EOfnuJNLkqIVLBenB5y7MD+o= ARC-Message-Signature:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1720420314; c=relaxed/simple; bh=4dpKF/F4iUSKC+qmxGuR0eOJ7mR60AETIbwAzcx8ss0=; h=Date:From:To:Cc:Subject:Message-ID:References:MIME-Version: Content-Type:Content-Disposition:In-Reply-To; b=P6+XyQQHQkKSRZtIeBnDmsfV3P3RjlAyA+gxxSmKEfjk9zHQvpoJY3RhDVL2jlOFw0Az0QjR9t6jnowHIKILLLQXgJY5bP7mogJ6MCaNceKWl9f6IvqN8C1pHt8yqMI4NJu+Zj7FJcmujZqKPPXcZGtx/GK8dmhDAxl2ytlxjak= ARC-Authentication-Results:i=1; smtp.subspace.kernel.org; dmarc=none (p=none dis=none) header.from=umail.furryterror.org; spf=pass smtp.mailfrom=drax.hungrycats.org; arc=none smtp.client-ip=174.142.148.226 Authentication-Results: smtp.subspace.kernel.org; dmarc=none (p=none dis=none) header.from=umail.furryterror.org Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=drax.hungrycats.org Received: by drax.kayaks.hungrycats.org (Postfix, from userid 1002) id 7FA1AEB05CE; Mon, 8 Jul 2024 02:25:37 -0400 (EDT) Date: Mon, 8 Jul 2024 02:25:37 -0400 From: Zygo Blaxell To: Qu Wenruo Cc: Qu Wenruo , linux-btrfs@vger.kernel.org Subject: Re: raid5 silent data loss in 6.2 and later, after "7a3150723061 btrfs: raid56: do data csum verification during RMW cycle" Message-ID: References: <5a8c1fbf-3065-4cea-9cf9-48e49806707d@suse.com> Precedence: bulk X-Mailing-List: linux-btrfs@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Disposition: inline Content-Transfer-Encoding: 8bit In-Reply-To: <5a8c1fbf-3065-4cea-9cf9-48e49806707d@suse.com> On Sat, Jun 08, 2024 at 12:50:35PM +0930, Qu Wenruo wrote: > 在 2024/6/8 11:25, Zygo Blaxell 写道: > > On Sat, Jun 01, 2024 at 05:22:46PM +0930, Qu Wenruo wrote: > > After this change, we now end up in an infinite loop: > > > > 1. Allocator picks a stripe with some unrecoverable csum blocks > > and some free blocks > > > > 2. Writeback tries to put data in the stripe > > > > 3. rmw_rbio aborts after it can't repair the existing blocks > > > > 4. Writeback deletes the extent, often silently (the application > > has to use fsync to detect it) > > > > 5. Go to step 1, where the allocator picks the same blocks again > > > > The effect is pretty dramatic--even a single unrecoverable sector in > > one stripe will bring an application server to its knees, constantly > > discarding an application's data whenever it tries to write. Once the > > allocator reaches the point where the "next" block is in a bad rmw stripe, > > it keeps allocating that same block over and over again. > > I'm afraid the error path (no way to inform the caller) is an existing > problem. Buffered write can always success (as long as no ENOMEM/ENOSPC > etc), but the real writeback is not ensured to success. > It doesn't even need RAID56 to trigger. > > But "discarding" the dirty pages doesn't sound correct. > If a writeback failed, the dirty pages should still stay dirty, not > discarded. > > It may be a new bug in the error handling path. I found the code that does this. It's more than 11 years old: commit 0bec9ef525e33233d7739b71be83bb78746f6e94 Author: Josef Bacik Date: Thu Jan 31 14:58:00 2013 -0500 Btrfs: unreserve space if our ordered extent fails to work When a transaction aborts or there's an EIO on an ordered extent or any error really we will not free up the space we reserved for this ordered extent. This results in warnings from the block group cache cleanup in the case of a transaction abort, or leaking space in the case of EIO on an ordered extent. Fix this up by free'ing the reserved space if we have an error at all trying to complete an ordered extent. Thanks, > > This case > > is broken in the patch. The data in the unrecoverable blocks is lost > > already, but we can still write the new data in the stripe and update > > the parity to recover the new blocks if we go into degraded mode in > > the future. > > We should only continue if our write range covers the bad sector, but that's > impossible as that breaks DATACOW. > > I can do extra handling, like if the write range covers a bad sector, we do > not try to recover that one. > > But I strongly doubt if that would have any meaning, since that means we're > writing into a sector which already has csum. The point here is to: - keep the P/Q in sync with the a consistent view of data in the stripe, even when some of that data is known to be uncorrectable - not overwrite any block in the stripe unless we are sure its contents are correct or don't matter - always write out the new data and matching P/Q in rmw_rbio i.e. if we fail to read a block, or if its csum fails, and we can't reconstruct the block, we must not write the block out to disk because that will prevent a read retry from succeeding in the future. We must still update P/Q because we modify other blocks in the stripe. The only thing P/Q needs to work is some repeatable data that we can read from the stripe in the future. P/Q doesn't have to be correct wrt the csums, only with the blocks. As long as P/Q matches the other blocks, then any block in the stripe can be reconstructed to its current value. Note we only need to do this for rmw, because we _have_ to finish rmw to avoid data loss from not making a best effort to do the write. Scrub and read can report the problem and move on to the next stripe, because there's no write to be lost in the rbio. > > 6. We can't read enough of the blocks to attempt repair: we can't > > compute parity any more, because we don't know what data is in the > > missing blocks. We can't assume it's some constant value (e.g. zero) > > because then the new parity would be wrong if some but not all of the > > failing drives came back. > > > > For case 1-5, we can ignore the return value of verify_one_sector() > > (or at least not pass it up to the caller of recover_vertical()). > > Nope, that falls back to the old behavior, a corrupted section would only > spread to P/Q. Yeah, I tested this and it's the wrong point in the logic. Another problem with doing it there is that code is shared with scrub, and we don't want scrub to be doing best-effort, we want scrub to be perfect or do nothing. A better place to do it is when recover_blocks() returns. This is specific to the rmw_rbio code, so it doesn't affect other parts of btrfs that can afford to stop when there's an uncorrectable sector. > We can not do the data loss decision for the end users. There is no decision that doesn't have at least potential data loss. RMW is writing data. If our RMW bbio chooses not to write data when it's possible to do so, then we've actively made the decision to lose data. In the current architecture, RMW must make a best effort to correct any correctable errors before modifying the stripe, but if that effort fails, RMW still has to update the stripe anyway, so that one bad stripe doesn't break the entire filesystem. The alternative is to not do RMW at all (or at least remove the dependency on it), which is a much larger problem to solve. There's a few ways to do that: 1. In btrfs_finish_one_ordered, instead of dropping the failed extent, try to allocate it somewhere else (hopefully in a stripe without an uncorrectable sector, or just switch to a full-stripe allocation for that specific case), and restart the write operation. That trades extra disk space for availability. We'd need some way to remember where writes have failed so the allocator isn't constantly trying to put new data on broken stripes. Also it means that we can end up silently dropping data due to ENOSPC instead of EIO (i.e. we try to write to the last available sectors and they're in a broken stripe) (similar to another bug in 6.6, now fixed). 2. Avoid RMW writes in the allocator. This requires some aggressive merging of writes so that we fill up stripes (that would be nice to have anyway as a pure speed optimization for raid5, as the current code will happily write 4 64K files in 4 RMW operations instead of packing them all into one full-stripe write). We'd need bg reclaim to free up space in partially filled stripes (i.e. raid5 would effectively behave like a zoned device). We'd still need RMW for nodatacow files, but nodatasum files can't ever have corrupt sectors so we can always update P/Q when we write a stripe. Prealloc wouldn't work. We might return ENOSPC, but we'll do that at write() time, not in writeback, so userspace will be informed and we won't silently discard any data. 3. Use raid-stripe-tree. AFAICT raid-stripe-tree is basically option 2, but implemented between the extent layer and the device layer so it's a less invasive change wrt the rest of btrfs. raid-stripe-tree doesn't support prealloc or nodatacow either. > Any shorter reproducer? 1. Create a -draid5 -mraid1 filesystem from 5 devices: 2x 750G, 2x 700G. The larger devices will have all the metadata on them, so we can corrupt everything on a small device and not worry about clobbering metadata. 2. Put some data on the filesystem, delete every second file to make a lot of small holes in the raid5 stripes. 3. Corrupt the two smaller devices (the ones without metadata) with your choice of garbage data (leave the first 1M intact for the superblock): yes | dd of=dev-3 iflag=fullblock bs=1024k yes | dd of=dev-4 iflag=fullblock bs=1024k 4. On the filesystem, create a new directory and use it normally. I like 'git clone' for this, because it does a lot of fsyncs and sha1 hash integrity checking, and every kernel developer already has it installed. git clone http://github.com/kdave/btrfs-progs.git git-test The git clone, git gc, commits, etc. should all run without error, and the number of errors in files in the rest of the filesystem should be stable, i.e. writes work for new data, and old data isn't (more) corrupted by the writes. > If I understand you correct, the problem only happens when we have all the > following conditions met: > > - the sector has csum > Or we won't very the content anyway > > - both the original and repaired contents mismatch csum > > This already means there are errors beyond the RAID5 tolerance. Yes, all the problems arise when one stripe goes beyond RAID5 tolerance. It doesn't look like it matters whether it's read or csum failure, but csum failure is much more common due to earlier bugs in btrfs raid5 leaving broken stripes on disk for later kernels to trip over, or a replaced drive combined with an uncorrected write hole error on a second drive. If the errors are within RAID5 tolerance, it all works perfectly: the "best effort to repair" is a completely successful repair, so recover_blocks returns 0, we don't prematurely exit from rmw_rbio, and writes work properly. > So that's something else in v6.2 causing beyond-raid5-tolerance. > And the data drop problem is definitely a big problem but I do not think > it's really RAID56 specific. It's raid56 specific because only raid56 has data block writes which depend on the outcome of data block reads, i.e. a write fails because of something that happens during a read. No other profile has that possibility. If I run the test case with raid1 or single data instead of raid5, there will never be a data loss on a write because of a csum failure, because the write bbios simply never try to read anything. Even raid5 full-stripe writes can't fail that way, because they don't mix read operations into their write bbios. (I'm excluding metadata here--there are plenty of dependencies between data writes and metadata reads, but they aren't very interesting. If metadata is broken then the entire filesystem is broken.) > > On 6.6, all the automation was broken by the regression: when we replaced > > a broken file or added a new one, the new file was almost always silently > > corrupted, > > A quick search through the data writeback path indeed shows we do nothing > extra when hitting a writeback error. That's because it's not in the writeback path--it's in the bio completions for ordered data. This is the sequence of events: 1. btrfs allocator places an extent within an existing raid5 data stripe that contains some unreadable or corrupt data blocks. 2. rmw_rbio is called to perform a rmw operation on the stripe. 3. rmw_read_wait_recover attempts to find all csums in the stripe. 4. rmw_read_wait_recover issues read requests to submit_read_wait_bio_list to fill the stripe in memory. 5. raid_wait_read_end_io sets error bits in the rbio error_bitmap for read failures, and passes successful reads to verify_bio_data_sectors. 6. verify_bio_data_sectors verifies those sectors which have csums (i.e. not the free sectors, nodatasum files, or the P/Q sector(s)). Sectors that have mismatched csums are marked in the rbio error_bitmap. 7. back to rmw_read_wait_recover, which calls recover_sectors. 8. recover_sectors attempts to repair any sectors with a bit set in error_bitmap. If there are too many bad sectors, recover_sectors exits early without doing anything. After reconstructing the bad sector contents from remaining data and P/Q blocks, the csums of the repaired sectors are verified with verify_one_sector(). 9. if recover_sectors is unable to repair all sectors in the stripe, recover_sectors returns an error to rmw_read_wait_recover, which returns the error to rmw_rbio. 10. when rmw_read_wait_recover returns an error to rmw_rbio, rmw_rbio sets bi_status to IOERR on all the bios in the stripe, and returns without performing the parity update and write operations originally requested at step 2. 11. the rbio work function finishes 12. the btrfs_finish_ordered_io work function starts 13. btrfs_finish_one_ordered notices the error in bi_status on the bios 14. btrfs_finish_one_ordered calls btrfs_mark_ordered_extent_error, so syncfs() and fsync() can return EIO. 15. btrfs_finish_one_ordered calls btrfs_drop_extent_map_range to delete the extent from the filesystem. 16. the blocks in the uncorrectable stripe are now free to be allocated again, so return to step 1 and do it all again in the next transaction. Note that this is a loop! So we don't lose one extent of data and then move on--we lose one extent of data _in each transaction_, and then set up the next transaction to fail in the same way. This continues as long as an uncorrectable sector exists in one of the first available free spaces. Also note that after a while, the allocator fills up all the good stripes between the bad stripes, but it can't fill the bad stripes because they keep getting returned to free space at the end of each transaction. If there are a lot of bad stripes on the filesystem, bad stripes move to the the head of the free space list where they accumulate over time. We might end up dropping hundreds or thousands of extents before we finally allocate something in a good stripe. During my original failure event I did notice that these dropped writes seemed to come in "waves", but I couldn't reproduce that effect. Now I know how to trigger it. > I think we should change the behavior of clearing dirty flags, so that the > dirty page flag only got cleared if the writeback is successful. The entire page is deleted, so it stops having any flags. > This change is going to be huge Yeah, all the real solutions are far more development work, and have more ways to fail, than handing everything with a better best-effort in RMW. In the short term, it's more important to fix the RMW regression as soon as possible, so that users with raid5 filesystems can upgrade from 6.1 to later kernels without having worse data loss problems than 6.1 had in the supported cases (drive failures <= tolerance). The most straightforward way to do that is to handle everything we can within rmw_rbio. Once the larger fixes are in place, so that btrfs can handle a dropped data write or so that btrfs doesn't need to do RMW any more, we can then go back to strictly failing RMW on uncorrectable raid56 stripes. It's too early to do that now. Best-effort RMW has low marginal risk of data loss, whereas aborting RMW absolutely guarantees data loss. As you point out, none of the data loss cases arise when the stripe is within correctable tolerances, i.e. you need to start with too many failures so that you have uncorrectable sectors, and we don't claim that btrfs can recover data in that case even without RMW updates. If errors are within tolerance, they're correctable, so rmw_rbio will simply correct them. I propose 3 phases of fix: 1. One-line patch to ignore recover_blocks return value, to get raid56 writes working in the presence of uncorrectable sectors again. 2. Fix up the rbio handling so that we can compute P/Q for RMW stripes even with uncorrectable sectors, so that we don't make a mess on uncorrectable sectors (discussed in more detail below). 3. Fix the upper layers of btrfs so that we can have write bbios fail and not lose data in general. > and I have to check other fses to be extra > safe. No other filesystem tries to do parity RMW writes with data csums, because that's a bad system design choice that they were able to avoid making. Only btrfs has btrfs problems. ;) Other filesystems use strictly immutable extents with parity embedded within them (ZFS), or they use allocators that aggressively coalesce allocations and writes so that all parity raid writes are full-stripe writes with garbage collection (bcachefs). Parity raid block devices don't have csums. They don't have to handle corrupted block because they don't know if any block is corrupted or not. They simply update the P/Q blocks whenever enough devices are readable, and go completely read-only if too many are not readable. > But even with that fixed, the do-not-touch-bad-full-stripe policy is not > doing anything helpful, unless the corrupted sector get full deleted, which > can be very hard detect. > > Maybe you can try a different recovery behavior? > > E.g. always read the file, and the read failed, delete and then copy a new > one from the backup server. > > Then it may solve the problem as the bad sectors would be deleted (along > with its csum) There's some problems with this: - The falures occur when rmw_rbio reads a file that happens to share a raid5 stripe with the extent we are trying to create. The files are not related in any way except for the position of some of their blocks on disk. Finding these files takes time. Scrub takes a long time (weeks) and 'find -type f...' may not find the bad sectors at all, so there's a high risk that a writer will discover an uncorrectable sector before anything else does, and in that case the first signal that we have a problem is silent data loss. - While researching all this, I discovered that in 6.4, 1009254bf22a ("btrfs: scrub: use scrub_stripe to implement RAID56 P/Q scrub") scrub is broken: it aborts on the first stripe with an uncorrectable sector. So users can't discover more than one uncorrectable sector at a time. - Even if we know of a bad file, we can't replace the bad file because it requires us to have reliable writes to the filesystem. All files referencing bad stripes have to be removed before any replacement files can be loaded onto the filesystem. That's a weeks-long interruption in service, and poses some logistical issues (e.g. temporary additional space to store the files we can't write to the original filesystem). - Once the allocator has found an uncorrectable sector, write failures happen continually, at increasing rates. It only stops when _every_ uncorrectable sector is removed from the filesystem. This is not a proportional response to a single bad sector, especially when it's so easy to fix. In kernel 6.1 and earlier, a damaged btrfs raid5 filesystem has no availability for writes because bugs in RMW would corrupt good data if we write to the filesystem. We can write to our filesystems only because we have a full replica of all the data somewhere else, so the btrfs bugs only cause corruption in files we can replace. Ours is a specialist application. 6.1's raid5 isn't usable as a rootfs or unreplicated primary data store. In kernel 6.2 and later, a damaged btrfs raid5 filesystem has no availability for writes because we can't be sure that the filesystem doesn't have any uncorrectable stripes. If at least one uncorrectable stripe exists, we can't expect any data we write to stick. This makes 6.2 and later unusable as a rootfs or unreplicated primary data store, even though the corruption bugs in 6.1 were fixed. This patch makes RMW work again, mostly: diff --git a/fs/btrfs/raid56.c b/fs/btrfs/raid56.c index 39bec672df0c..2964ff380671 100644 --- a/fs/btrfs/raid56.c +++ b/fs/btrfs/raid56.c @@ -2277,7 +2277,13 @@ static int rmw_read_wait_recover(struct btrfs_raid_bio *rbio) * and csum mismatch), just let recover_sectors() to handle them all. */ submit_read_wait_bio_list(rbio, &bio_list); - return recover_sectors(rbio); + recover_sectors(rbio); + /* + * Whether we recover corrupted sectors or not, we must proceed + * with the rmw operation, or we will drop the extent we are + * trying to write. + */ + return 0; } static void raid_wait_write_end_io(struct bio *bio) With that patch, my raid5 test cases work for the first time since I wrote them in 2016: writes are fully available, and there's no new corruption of existing files (any uncorrectable sectors remain uncorrectable and no new ones appear). It's good enough to unblock users who are currently stuck on 6.1 due to the RMW regression. For the "mostly": there are some unresolved issues. With the above patch, the set of uncorrectable sectors is stable (nothing that wasn't corrupted before gets corrupted by RMW), but the _contents_ of the uncorrectable sectors get replaced with garbage, because the code isn't leaving the stripe data in a usable state for P/Q update, and it always writes out the full stripe including known garbage data. P/Q does match the garbage, so the stripe will work in degraded mode for the new data. The rmw code has to do something like this: 1. do the R operation: for unreadable blocks, zero-fill the block 2. for readable blocks, verify csums (no change) 3. reconstruct missing or corrupt blocks in separate memory from the block read from disk (i.e. don't overwrite in place) 4. verify again. If verification fails, keep the original version of the data (or zeros if the block could not be read). 5. keep track of which blocks are bad, i.e. don't clear error_bitmap in rmw_rbio unless the block verifies OK 6. do the M operation: modify the blocks to be written, and compute new P/Q blocks for the stripe. 7. clear the bad bits for the modified blocks and the P/Q blocks. 8. do the W operation: write out only the blocks that are not marked as bad, i.e. the corrected blocks, the modified blocks, and the P/Q blocks. If a block can't be corrected, don't write it out, to give the device another chance to get the data on the next read. 9. if too many writes fail, then rmw_rbio can return an error. (Note: this last line opens up a can of worms for the read error handling cases) For scrub, the problem is that scrub_raid56_parity_stripe exits the entire scrub loop when it sees an unrepaired sector, instead of continuing to the next stripe. Everything else there is OK, it's a one-line fix: diff --git a/fs/btrfs/scrub.c b/fs/btrfs/scrub.c index c7fcb6408eb6..12c3ce0efafa 100644 --- a/fs/btrfs/scrub.c +++ b/fs/btrfs/scrub.c @@ -1972,7 +1972,7 @@ static int scrub_raid56_parity_stripe(struct scrub_ctx *sctx, "unrepaired sectors detected, full stripe %llu data stripe %u errors %*pbl", full_stripe_start, i, stripe->nr_sectors, &error); - ret = -EIO; + ret = 0; goto out; } bitmap_or(&extent_bitmap, &extent_bitmap, I'll post these patches separately. > Thanks, > Qu