From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from drax.kayaks.hungrycats.org (drax.kayaks.hungrycats.org [174.142.148.226]) by smtp.subspace.kernel.org (Postfix) with ESMTP id A64B763C8 for ; Sat, 8 Jun 2024 01:55:51 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=174.142.148.226 ARC-Seal:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1717811754; cv=none; b=jbRyjY+Fzhll42lWKEu3xio3WxvcCwMzsDRa5AWNafdstKbCQMW7vZMlu4ayay2ZaLeGwkMFYBYlrOpHKrPT+KdjcJ3XQEGw60foxPRCgUcJFaDNIe3UTrMvRFX4CLwuIM+C/3syLLfUYHAeU2MxbG9neytRZUy4y66YNyz3tiQ= ARC-Message-Signature:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1717811754; c=relaxed/simple; bh=HN2mNSAl88FK2mJZzNO/2tXM+WWuKJgWxizgUVR7QI8=; h=Date:From:To:Cc:Subject:Message-ID:References:MIME-Version: Content-Type:Content-Disposition:In-Reply-To; b=A948AnQjtpL0UUFGUlhTeCMW0Nc6HriTcUzl0eqmvnttssNd8bhyUutfHZC0AiM26h1U0t3Mbg+6Zj0l5NFHig1sIsAnepvt42U1IB7PA0bViQB/Gg2kb2jJv7HbnXBqRmpqqCOryMInn8PmgdwoLqbsWAeL7GqHXKGKfgTZxiw= ARC-Authentication-Results:i=1; smtp.subspace.kernel.org; dmarc=none (p=none dis=none) header.from=umail.furryterror.org; spf=pass smtp.mailfrom=drax.hungrycats.org; arc=none smtp.client-ip=174.142.148.226 Authentication-Results: smtp.subspace.kernel.org; dmarc=none (p=none dis=none) header.from=umail.furryterror.org Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=drax.hungrycats.org Received: by drax.kayaks.hungrycats.org (Postfix, from userid 1002) id 00071E3D6F2; Fri, 7 Jun 2024 21:55:44 -0400 (EDT) Date: Fri, 7 Jun 2024 21:55:44 -0400 From: Zygo Blaxell To: Qu Wenruo Cc: linux-btrfs@vger.kernel.org Subject: Re: raid5 silent data loss in 6.2 and later, after "7a3150723061 btrfs: raid56: do data csum verification during RMW cycle" Message-ID: References: Precedence: bulk X-Mailing-List: linux-btrfs@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Disposition: inline Content-Transfer-Encoding: 8bit In-Reply-To: On Sat, Jun 01, 2024 at 05:22:46PM +0930, Qu Wenruo wrote: > > > 在 2024/6/1 12:54, Zygo Blaxell 写道: > > There is a new silent data loss bug in kernel 6.2 and later. > > The requirements for the bug are: > > > > 1. 6.2 or later kernel > > 2. raid5 data in the filesystem > > 3. one device severely corrupted > > 4. some free space fragmentation to trigger a lot of rmw cycles > > I'm still not convinced this can be the condition to trigger the bug. > > As RAID56 now does csum verification before RMW, even if some range is > fully corrupted, as long as the recovered data matches csum, it would > use the recovered data instead. > > And if any vertical stripe is not good, the whole RMW cycle would error out. > [...] > > -------- > > > > In the commit, I notice that when reading the rmw stripe, any blocks with > > csum errors are flagged in rbio->error_bitmap, but nothing ever clears > > those error bits once they are set. > > Nope, rmw_rbio() would call bitmap_clear() on the error_bitmap before > doing any RMW. This is true, but the code looks like this: ret = rmw_read_wait_recover(rbio); if (ret < 0) goto out; } [some other stuff] bitmap_clear(rbio->error_bitmap, 0, rbio->nr_sectors); The problem happens before the bitmap_clear, at `goto out`. Actually it happens in rmw_read_wait_recover() (formerly rmw_read_and_wait()), because that function now changes 'ret' if the repair fails: @@ -2136,6 +2215,12 @@ static int rmw_read_and_wait(struct btrfs_raid_bio *rbio) submit_read_bios(rbio, &bio_list); wait_event(rbio->io_wait, atomic_read(&rbio->stripes_pending) == 0); + + /* + * We may or may not have any corrupted sectors (including missing dev + * and csum mismatch), just let recover_sectors() to handle them all. + */ + ret = recover_sectors(rbio); return ret; out: while ((bio = bio_list_pop(&bio_list))) Before this change, if all the blocks in the stripe were readable, we would recompute parity based on the data that was read, and proceed to finish the rbio. As long as the writes were also successful, the new blocks that were written would be updated and recoverable. (The old unrecoverable blocks would still be unrecoverable, but there's no fixing those.) After this change, we now end up in an infinite loop: 1. Allocator picks a stripe with some unrecoverable csum blocks and some free blocks 2. Writeback tries to put data in the stripe 3. rmw_rbio aborts after it can't repair the existing blocks 4. Writeback deletes the extent, often silently (the application has to use fsync to detect it) 5. Go to step 1, where the allocator picks the same blocks again The effect is pretty dramatic--even a single unrecoverable sector in one stripe will bring an application server to its knees, constantly discarding an application's data whenever it tries to write. Once the allocator reaches the point where the "next" block is in a bad rmw stripe, it keeps allocating that same block over and over again. So there are some different cases to handle at this point: 1. We can read all the blocks and they have good csums: proceed to write data and update parity. This is the normal good case. 2. We can read all the blocks, there are csums failures, but they can be repaired: proceed to write data and update parity. This is the case fixed in the patch. 3. We can read all the blocks, there are csums failures, but they can't be repaired: proceed to write data and update parity. This case is broken in the patch. The data in the unrecoverable blocks is lost already, but we can still write the new data in the stripe and update the parity to recover the new blocks if we go into degraded mode in the future. 4. We can read enough of the blocks to attempt repair, and the repair succeeds: proceed to write data and update parity. This is the degraded good case. 5. We can read enough of the blocks to attempt repair, and the repair fails: proceed to write data and update parity. The lost data block is gone, but we can still write the new data blocks and compute a usable parity block. This is another case broken by the patch, the degraded-mode version of case 3. 6. We can't read enough of the blocks to attempt repair: we can't compute parity any more, because we don't know what data is in the missing blocks. We can't assume it's some constant value (e.g. zero) because then the new parity would be wrong if some but not all of the failing drives came back. For case 1-5, we can ignore the return value of verify_one_sector() (or at least not pass it up to the caller of recover_vertical()). Either we restored the sector we didn't, but either outcome doesn't affect whether a new parity block can be computed to enable recovery of the new data blocks and the remaining recoverable data blocks. The other error cases in recover_vertical() are either fatal problems like ENOMEM, or there's too many read failures and we're actually in case 6. For case 6, the options are: A. abort the rbio and drop the write B. force the filesystem read-only I like B for this case, but AIUI the other btrfs raid profiles currently do A if all the data block mirror writes fail. > The same for finish_parity_scrub(), scrub_rbio(). > > Yes, this means we can have the cache rbio with error bitmap, but it > doesn't make any difference, as rmw_rbio() is always the entrance for a > RMW cycle. > > Maybe I can enhance that by clearing the error bitmap after everything > is done, but I prefer to get a proper cause analyse before doing any > random fix. > > [...] > > > > My third experiment breaks the error recovery code, but it does prevent > > the sync failures and missing extent holes, so it shows that the error > > recovery code itself is not what is causing the dropped writes--it's > > the bits left set in error_bitmap after recovery is done. > > Yep, that's expected. > So I'm more interested in a proper (better minimal) reproducer other > than any fix attempt (since there is no patch sent, it already shows the > attempt failed). > > > > > > Test Case > > --------- > > > > My test case uses three loops running in parallel on a 500 GiB test filesystem: > > > > Data Metadata System > > Id Path RAID5 RAID1 RAID1 Unallocated Total Slack > > -- -------- --------- -------- -------- ----------- --------- -------- > > 1 /dev/vdb 71.00GiB 1.00GiB 8.00MiB 647.99GiB 720.00GiB 19.59GiB > > 2 /dev/vdc 71.00GiB 1.00GiB 8.00MiB 647.99GiB 720.00GiB 3.71GiB > > 3 /dev/vdd 71.00GiB 2.00GiB - 647.00GiB 720.00GiB 3.71GiB > > 4 /dev/vde 71.00GiB 2.00GiB - 647.00GiB 720.00GiB 11.00GiB > > 5 /dev/vdf 71.00GiB 2.00GiB - 647.00GiB 720.00GiB 11.00GiB > > -- -------- --------- -------- -------- ----------- --------- -------- > > Total 284.00GiB 4.00GiB 8.00MiB 3.16TiB 3.52TiB 49.02GiB > > Used 262.97GiB 2.61GiB 64.00KiB > > > > The data is a random collection of small files, half of which have been deleted > > to make lots of small free space holes for rmw. > > > > Loop 1 alternates between corrupting device 3 and repairing it with scrub: > > The reproducer is not good enough, in fact it's pretty bad... > > Using anything not normalized is never a good way to reproduce, but I > guess it's already the best scenario you have. > > Can you try to do it with newly created fs instead? Already did (I needed to do that many times to run bisect...). > > while true; do > > # Any big file will do, usually faster than /dev/random > > # Skipping the first 1M leaves the superblock intact > > while cat vmlinux; do :; done | dd of=/dev/vdd bs=1024k seek=1 > > # This should fix all the corruption as long as there are no > > # reads or writes anywhere on the filesystem > > btrfs scrub start -Bd /dev/vdd > > done > > [IMPROVE THE TEST] > If you want to cause interleaved free space, just create a ton of 4K > files, and delete them interleavely. > > And instead of vmlinux or whatever file, you can always go with > randomly/pattern filled file, and saves its md5sum to do verification. > [MY CURRENT GUESS] > My current guess is some race with dd corruption and RMW. > AFAIK the last time I am working on RAID56, I always do a offline > corruption (aka, with fs unmounted) and it always works like a charm. > > So the running corruption may be a point of concern. Generally, we expect the hardware to introduce corruption at any time. Ideally, a btrfs raid[156] should be able to work when one of its devices returns 4K blocks from /dev/random on every read, and if it can't, that's always a bug. > Another thing is, if a full stripe is determined to have unrepairable > data, no RMW can be done on that full stripe forever (unless one > manually fixed the problem). Yeah, that's the serious new problem introduced in this patch for 6.2. 6.1 didn't do that in two cases where 6.2 and later do. > So if by somehow you corrupted the full stripe by just corrupting one > device (maybe some existing csum mismatch etc?), then the full stripe > would never be written back, thus causing the data not to be written back. My reproducer wasn't sufficient to trigger the bug--it would only generate csum errors on one device, but there need to be csum errors on two devices (for raid5) to get into the data-loss loop. Apparently 6.2 still has a bug somewhere that introduces extra errors (it crashes a lot, that might be related), because when I tested on 6.2 I was observing the two-csum-failure case not the intended one-csum-failure case. 6.6 and 6.9 don't have any of those problems--no crashes, no cross-device corruption. I've been running the reproducer on 6.6 and 6.9 for a week now, and there are no csum failures on any device other than the one being intentionally corrupted, and there are no data losses because the code never hits the "goto out" line in rmw_rbio (confirmed with both a kgdb breakpoint on the test kernel and some printk's). So on 6.6 and 6.9 btrfs fixes everything my reproducer throws at it. It was a bad reproducer. Sorry about that! > Finally for the lack of any dmesg, it's indeed a problem, that there is > *NO* error message at all if we failed to recover a full stripe. > Just check recover_sectors() call and its callers. The trouble with dmesg errors is that there are so many of them when there's a failure of a wide RAID5 stripe, and they can end up ratelimited, so admins can't see all of them. But they're also better than nothing for case 6, so I approve adding a new one here. For case 6, it might be reasonable to throw the entire filesystem read-only; otherwise, there's silent data loss, and not having silent data loss is a critical feature of btrfs. It would still be possible to scrub, find the affected files, and remove them, without triggering a rmw cycle. Maybe some future btrfs enhancement could handle it better, but for now a dead stop is better than any amount of silent loss. One possible such enhancement is to keep the blocks from being allocated again, e.g. put them on an in-memory list to break the allocate-write-fail-free-allocate loop above. Having said that, I think the other raid1* profiles actually do drop writes silently when all mirror writes fail for a data block. Maybe we can have a sysfs flag to decide what to do in that case: continue with EIO on fsync as we do now, or go fully RO. > And I believe that may contribute to the confusion, that btrfs consider > the fs is fine, meanwhile it catches tons of error and abort all writes > to that full stripes. > I appreciate the effort you put into this case, but I really hope to get > a more reproducible procedure, or it's really hard to say what is going > wrong. Thanks for the tip on bitmap_clear() - it led me straight to the problem, even before you explained it a few pages further on... > If needed I can craft some debug patches for you to test, but I believe > you won't really want to run testing kernels on your large RAID5 array > anyway. I have plenty of VMs for testing kernel patches if you have some. We run every kernel through a gauntlet of tests on staging VMs. Those tests cases didn't have any coverage for raid5 corruption recovery on read because it doesn't work on any prior kernel version (only recovery on scrub was tested). That gap let 6.6 kernels get through acceptance and deployment, and led to a server getting rebooted from 6.1 to 6.6 in the middle of a pvmove to replace a drive, which combined massive corruption of one disk, 6.1's bug, and 6.2's bug. That didn't end well. Fortunately we keep only redundant data on raid5. To fix the original failing RAID5 array, we reverted to 6.1, which allowed the server to operate more or less normally for a few weeks during recovery. We did have 6.1's cross-device error-propagation bug (ironically the one fixed in this patch), but that bug was far more manageable because we already had automation in place to detect any data that was lost and replace it from another server. Scrub combined with rewriting broken files eventually eliminated all the errors. On 6.6, all the automation was broken by the regression: when we replaced a broken file or added a new one, the new file was almost always silently corrupted, and we had no IO error signal to trigger replacement of the data from another server, so corrupted files became visible to users. git-based tools for deployment were fully unusable on 6.6: any commit would break the local repo, while deleting and recloning the repo simply hit the same broken stripe rmw bug and broke the replacement repo. At the end of recovery we did a full verification of SHA hashes and found only a few errors, all in files that were created or modified while running 6.6. > So a more normalized test would help us both. > > Thanks, > Qu > > > > > Loop 2 runs `sync -f` to detect sync errors and drops caches: > > > > while true; do > > # Sometimes throws EIO > > sync -f /testfs > > sysctl vm.drop_caches=3 > > sleep 9 > > done > > > > Loop 3 does some random git activity on a clone of the 'btrfs-progs' > > repo to detect lost writes at the application level: > > > > while true; do > > cd /testfs/btrfs-progs > > # Sometimes fails complaining about various files being corrupted > > find * -type f -print | unsort -r | while read -r x; do > > date >> "$x" > > git commit -am"Modifying $x" > > done > > git repack -a > > done > > > > The errors occur on the sync -f and various git commands, e.g.: > > > > sync: error syncing '/media/testfs/': Input/output error > > vm.drop_caches = 3 > > > > error: object file .git/objects/39/c876ad9b9af9f5410246d9a3d6bbc331677ee5 is empty > > fatal: loose object 39c876ad9b9af9f5410246d9a3d6bbc331677ee5 (stored in .git/objects/39/c876ad9b9af9f5410246d9a3d6bbc331677ee5) is corrupt > >