* raid5 silent data loss in 6.2 and later, after "7a3150723061 btrfs: raid56: do data csum verification during RMW cycle" @ 2024-06-01 3:24 Zygo Blaxell 2024-06-01 7:52 ` Qu Wenruo 0 siblings, 1 reply; 7+ messages in thread From: Zygo Blaxell @ 2024-06-01 3:24 UTC (permalink / raw) To: linux-btrfs There is a new silent data loss bug in kernel 6.2 and later. The requirements for the bug are: 1. 6.2 or later kernel 2. raid5 data in the filesystem 3. one device severely corrupted 4. some free space fragmentation to trigger a lot of rmw cycles I first discovered this bug after a `pvmove` went badly wrong and wiped out 70% of one data LV of a btrfs raid5 filesystem. I figured out a recipe for reproducing the bug on a test VM (below), and a possible approach for a fix. Bug insertion ------------- The bug is introduced in this commit: 7a3150723061 btrfs: raid56: do data csum verification during RMW cycle This commit fixes a long-standing btrfs raid5 bug which mishandled cases where a raid5 stripe contains corrupted data in some blocks while new data is written to other blocks in the same stripe. This older bug resulted in existing correct data becoming corrupted when new data is written to the filesystem. EIO and csum failures would occur later when reading affected blocks, i.e. the errors were detectable. Bug effect ---------- The change fixes the old bug, but also introduces two new behaviors that are potentially worse: 1. New writes are now silently dropped. After `sync` and `sysctl vm.drop_caches=3`, recently written extents are missing from the files, as if they were never written by the application. Instead, there is a hole in the file, or a truncated file. There is no indication that this has occurred other than the missing data: no dmesg messages, no errors reported to applications at write time. The data can be read back from the filesystem while it is still in page cache, which can delay detection of the failure until after the data is evicted from cache. 2. `fsync` (system call) and `sync -f` (shell command) sometimes return EIO. In this situation, devices are online, metadata is intact, all corruption is located on a single device and is recoverable, so no writes should ever be failing, but they apparently do. This also affects internal btrfs maintenance operations like balance, which aborts with an IO failure when triggering the bug. Reverting the commit 7a3150723061 on 6.2.16 restores the old behavior: no silent data loss on new writes, only the old detectable data corruption on existing data occurs. It seems we can't have nice things. Analysis -------- In the commit, I notice that when reading the rmw stripe, any blocks with csum errors are flagged in rbio->error_bitmap, but nothing ever clears those error bits once they are set. Contrast with the scrub repair code, which clears out the entire rbio->error_bitmap vector after repairing the stripe. The old code didn't set any error_bitmap bits during rmw, and not much else has changed in this commit, which makes the new set_bit calls a good candidate for root cause. I tried some experiments: - clear the error_bits individually after setting uptodate = 1 (still has problems: after verifying the recovered csum, failed reconstruction would leave error bits set, so writes to the stripe that should not fail will fail) - clear the entire error_bits vector after successful repair_sectors (might cause corruption if some reconstruction fails and later reads use data from the stripe without verifying the csum again) - remove the new set_bit() call added in the above commit (breaks repair but fixes EIO and dropped writes) For the first two I didn't have much success. I'm pretty sure I don't know how to calculate the error_bit index correctly, so my first attempt didn't have any effect on the test. I get a lot of UAF crashes if I clear the entire vector in the second test--I suspect it needs some protection from concurrent access that I'm missing. My third experiment breaks the error recovery code, but it does prevent the sync failures and missing extent holes, so it shows that the error recovery code itself is not what is causing the dropped writes--it's the bits left set in error_bitmap after recovery is done. Test Case --------- My test case uses three loops running in parallel on a 500 GiB test filesystem: Data Metadata System Id Path RAID5 RAID1 RAID1 Unallocated Total Slack -- -------- --------- -------- -------- ----------- --------- -------- 1 /dev/vdb 71.00GiB 1.00GiB 8.00MiB 647.99GiB 720.00GiB 19.59GiB 2 /dev/vdc 71.00GiB 1.00GiB 8.00MiB 647.99GiB 720.00GiB 3.71GiB 3 /dev/vdd 71.00GiB 2.00GiB - 647.00GiB 720.00GiB 3.71GiB 4 /dev/vde 71.00GiB 2.00GiB - 647.00GiB 720.00GiB 11.00GiB 5 /dev/vdf 71.00GiB 2.00GiB - 647.00GiB 720.00GiB 11.00GiB -- -------- --------- -------- -------- ----------- --------- -------- Total 284.00GiB 4.00GiB 8.00MiB 3.16TiB 3.52TiB 49.02GiB Used 262.97GiB 2.61GiB 64.00KiB The data is a random collection of small files, half of which have been deleted to make lots of small free space holes for rmw. Loop 1 alternates between corrupting device 3 and repairing it with scrub: while true; do # Any big file will do, usually faster than /dev/random # Skipping the first 1M leaves the superblock intact while cat vmlinux; do :; done | dd of=/dev/vdd bs=1024k seek=1 # This should fix all the corruption as long as there are no # reads or writes anywhere on the filesystem btrfs scrub start -Bd /dev/vdd done Loop 2 runs `sync -f` to detect sync errors and drops caches: while true; do # Sometimes throws EIO sync -f /testfs sysctl vm.drop_caches=3 sleep 9 done Loop 3 does some random git activity on a clone of the 'btrfs-progs' repo to detect lost writes at the application level: while true; do cd /testfs/btrfs-progs # Sometimes fails complaining about various files being corrupted find * -type f -print | unsort -r | while read -r x; do date >> "$x" git commit -am"Modifying $x" done git repack -a done The errors occur on the sync -f and various git commands, e.g.: sync: error syncing '/media/testfs/': Input/output error vm.drop_caches = 3 error: object file .git/objects/39/c876ad9b9af9f5410246d9a3d6bbc331677ee5 is empty fatal: loose object 39c876ad9b9af9f5410246d9a3d6bbc331677ee5 (stored in .git/objects/39/c876ad9b9af9f5410246d9a3d6bbc331677ee5) is corrupt ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: raid5 silent data loss in 6.2 and later, after "7a3150723061 btrfs: raid56: do data csum verification during RMW cycle" 2024-06-01 3:24 raid5 silent data loss in 6.2 and later, after "7a3150723061 btrfs: raid56: do data csum verification during RMW cycle" Zygo Blaxell @ 2024-06-01 7:52 ` Qu Wenruo 2024-06-08 1:55 ` Zygo Blaxell 0 siblings, 1 reply; 7+ messages in thread From: Qu Wenruo @ 2024-06-01 7:52 UTC (permalink / raw) To: Zygo Blaxell, linux-btrfs 在 2024/6/1 12:54, Zygo Blaxell 写道: > There is a new silent data loss bug in kernel 6.2 and later. > The requirements for the bug are: > > 1. 6.2 or later kernel > 2. raid5 data in the filesystem > 3. one device severely corrupted > 4. some free space fragmentation to trigger a lot of rmw cycles I'm still not convinced this can be the condition to trigger the bug. As RAID56 now does csum verification before RMW, even if some range is fully corrupted, as long as the recovered data matches csum, it would use the recovered data instead. And if any vertical stripe is not good, the whole RMW cycle would error out. [...] > -------- > > In the commit, I notice that when reading the rmw stripe, any blocks with > csum errors are flagged in rbio->error_bitmap, but nothing ever clears > those error bits once they are set. Nope, rmw_rbio() would call bitmap_clear() on the error_bitmap before doing any RMW. The same for finish_parity_scrub(), scrub_rbio(). Yes, this means we can have the cache rbio with error bitmap, but it doesn't make any difference, as rmw_rbio() is always the entrance for a RMW cycle. Maybe I can enhance that by clearing the error bitmap after everything is done, but I prefer to get a proper cause analyse before doing any random fix. [...] > > My third experiment breaks the error recovery code, but it does prevent > the sync failures and missing extent holes, so it shows that the error > recovery code itself is not what is causing the dropped writes--it's > the bits left set in error_bitmap after recovery is done. Yep, that's expected. So I'm more interested in a proper (better minimal) reproducer other than any fix attempt (since there is no patch sent, it already shows the attempt failed). > > > Test Case > --------- > > My test case uses three loops running in parallel on a 500 GiB test filesystem: > > Data Metadata System > Id Path RAID5 RAID1 RAID1 Unallocated Total Slack > -- -------- --------- -------- -------- ----------- --------- -------- > 1 /dev/vdb 71.00GiB 1.00GiB 8.00MiB 647.99GiB 720.00GiB 19.59GiB > 2 /dev/vdc 71.00GiB 1.00GiB 8.00MiB 647.99GiB 720.00GiB 3.71GiB > 3 /dev/vdd 71.00GiB 2.00GiB - 647.00GiB 720.00GiB 3.71GiB > 4 /dev/vde 71.00GiB 2.00GiB - 647.00GiB 720.00GiB 11.00GiB > 5 /dev/vdf 71.00GiB 2.00GiB - 647.00GiB 720.00GiB 11.00GiB > -- -------- --------- -------- -------- ----------- --------- -------- > Total 284.00GiB 4.00GiB 8.00MiB 3.16TiB 3.52TiB 49.02GiB > Used 262.97GiB 2.61GiB 64.00KiB > > The data is a random collection of small files, half of which have been deleted > to make lots of small free space holes for rmw. > > Loop 1 alternates between corrupting device 3 and repairing it with scrub: The reproducer is not good enough, in fact it's pretty bad... Using anything not normalized is never a good way to reproduce, but I guess it's already the best scenario you have. Can you try to do it with newly created fs instead? > > while true; do > # Any big file will do, usually faster than /dev/random > # Skipping the first 1M leaves the superblock intact > while cat vmlinux; do :; done | dd of=/dev/vdd bs=1024k seek=1 > # This should fix all the corruption as long as there are no > # reads or writes anywhere on the filesystem > btrfs scrub start -Bd /dev/vdd > done [IMPROVE THE TEST] If you want to cause interleaved free space, just create a ton of 4K files, and delete them interleavely. And instead of vmlinux or whatever file, you can always go with randomly/pattern filled file, and saves its md5sum to do verification. [MY CURRENT GUESS] My current guess is some race with dd corruption and RMW. AFAIK the last time I am working on RAID56, I always do a offline corruption (aka, with fs unmounted) and it always works like a charm. So the running corruption may be a point of concern. Another thing is, if a full stripe is determined to have unrepairable data, no RMW can be done on that full stripe forever (unless one manually fixed the problem). So if by somehow you corrupted the full stripe by just corrupting one device (maybe some existing csum mismatch etc?), then the full stripe would never be written back, thus causing the data not to be written back. Finally for the lack of any dmesg, it's indeed a problem, that there is *NO* error message at all if we failed to recover a full stripe. Just check recover_sectors() call and its callers. And I believe that may contribute to the confusion, that btrfs consider the fs is fine, meanwhile it catches tons of error and abort all writes to that full stripes. I appreciate the effort you put into this case, but I really hope to get a more reproducible procedure, or it's really hard to say what is going wrong. If needed I can craft some debug patches for you to test, but I believe you won't really want to run testing kernels on your large RAID5 array anyway. So a more normalized test would help us both. Thanks, Qu > > Loop 2 runs `sync -f` to detect sync errors and drops caches: > > while true; do > # Sometimes throws EIO > sync -f /testfs > sysctl vm.drop_caches=3 > sleep 9 > done > > Loop 3 does some random git activity on a clone of the 'btrfs-progs' > repo to detect lost writes at the application level: > > while true; do > cd /testfs/btrfs-progs > # Sometimes fails complaining about various files being corrupted > find * -type f -print | unsort -r | while read -r x; do > date >> "$x" > git commit -am"Modifying $x" > done > git repack -a > done > > The errors occur on the sync -f and various git commands, e.g.: > > sync: error syncing '/media/testfs/': Input/output error > vm.drop_caches = 3 > > error: object file .git/objects/39/c876ad9b9af9f5410246d9a3d6bbc331677ee5 is empty > fatal: loose object 39c876ad9b9af9f5410246d9a3d6bbc331677ee5 (stored in .git/objects/39/c876ad9b9af9f5410246d9a3d6bbc331677ee5) is corrupt > ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: raid5 silent data loss in 6.2 and later, after "7a3150723061 btrfs: raid56: do data csum verification during RMW cycle" 2024-06-01 7:52 ` Qu Wenruo @ 2024-06-08 1:55 ` Zygo Blaxell 2024-06-08 3:20 ` Qu Wenruo 0 siblings, 1 reply; 7+ messages in thread From: Zygo Blaxell @ 2024-06-08 1:55 UTC (permalink / raw) To: Qu Wenruo; +Cc: linux-btrfs On Sat, Jun 01, 2024 at 05:22:46PM +0930, Qu Wenruo wrote: > > > 在 2024/6/1 12:54, Zygo Blaxell 写道: > > There is a new silent data loss bug in kernel 6.2 and later. > > The requirements for the bug are: > > > > 1. 6.2 or later kernel > > 2. raid5 data in the filesystem > > 3. one device severely corrupted > > 4. some free space fragmentation to trigger a lot of rmw cycles > > I'm still not convinced this can be the condition to trigger the bug. > > As RAID56 now does csum verification before RMW, even if some range is > fully corrupted, as long as the recovered data matches csum, it would > use the recovered data instead. > > And if any vertical stripe is not good, the whole RMW cycle would error out. > [...] > > -------- > > > > In the commit, I notice that when reading the rmw stripe, any blocks with > > csum errors are flagged in rbio->error_bitmap, but nothing ever clears > > those error bits once they are set. > > Nope, rmw_rbio() would call bitmap_clear() on the error_bitmap before > doing any RMW. This is true, but the code looks like this: ret = rmw_read_wait_recover(rbio); if (ret < 0) goto out; } [some other stuff] bitmap_clear(rbio->error_bitmap, 0, rbio->nr_sectors); The problem happens before the bitmap_clear, at `goto out`. Actually it happens in rmw_read_wait_recover() (formerly rmw_read_and_wait()), because that function now changes 'ret' if the repair fails: @@ -2136,6 +2215,12 @@ static int rmw_read_and_wait(struct btrfs_raid_bio *rbio) submit_read_bios(rbio, &bio_list); wait_event(rbio->io_wait, atomic_read(&rbio->stripes_pending) == 0); + + /* + * We may or may not have any corrupted sectors (including missing dev + * and csum mismatch), just let recover_sectors() to handle them all. + */ + ret = recover_sectors(rbio); return ret; out: while ((bio = bio_list_pop(&bio_list))) Before this change, if all the blocks in the stripe were readable, we would recompute parity based on the data that was read, and proceed to finish the rbio. As long as the writes were also successful, the new blocks that were written would be updated and recoverable. (The old unrecoverable blocks would still be unrecoverable, but there's no fixing those.) After this change, we now end up in an infinite loop: 1. Allocator picks a stripe with some unrecoverable csum blocks and some free blocks 2. Writeback tries to put data in the stripe 3. rmw_rbio aborts after it can't repair the existing blocks 4. Writeback deletes the extent, often silently (the application has to use fsync to detect it) 5. Go to step 1, where the allocator picks the same blocks again The effect is pretty dramatic--even a single unrecoverable sector in one stripe will bring an application server to its knees, constantly discarding an application's data whenever it tries to write. Once the allocator reaches the point where the "next" block is in a bad rmw stripe, it keeps allocating that same block over and over again. So there are some different cases to handle at this point: 1. We can read all the blocks and they have good csums: proceed to write data and update parity. This is the normal good case. 2. We can read all the blocks, there are csums failures, but they can be repaired: proceed to write data and update parity. This is the case fixed in the patch. 3. We can read all the blocks, there are csums failures, but they can't be repaired: proceed to write data and update parity. This case is broken in the patch. The data in the unrecoverable blocks is lost already, but we can still write the new data in the stripe and update the parity to recover the new blocks if we go into degraded mode in the future. 4. We can read enough of the blocks to attempt repair, and the repair succeeds: proceed to write data and update parity. This is the degraded good case. 5. We can read enough of the blocks to attempt repair, and the repair fails: proceed to write data and update parity. The lost data block is gone, but we can still write the new data blocks and compute a usable parity block. This is another case broken by the patch, the degraded-mode version of case 3. 6. We can't read enough of the blocks to attempt repair: we can't compute parity any more, because we don't know what data is in the missing blocks. We can't assume it's some constant value (e.g. zero) because then the new parity would be wrong if some but not all of the failing drives came back. For case 1-5, we can ignore the return value of verify_one_sector() (or at least not pass it up to the caller of recover_vertical()). Either we restored the sector we didn't, but either outcome doesn't affect whether a new parity block can be computed to enable recovery of the new data blocks and the remaining recoverable data blocks. The other error cases in recover_vertical() are either fatal problems like ENOMEM, or there's too many read failures and we're actually in case 6. For case 6, the options are: A. abort the rbio and drop the write B. force the filesystem read-only I like B for this case, but AIUI the other btrfs raid profiles currently do A if all the data block mirror writes fail. > The same for finish_parity_scrub(), scrub_rbio(). > > Yes, this means we can have the cache rbio with error bitmap, but it > doesn't make any difference, as rmw_rbio() is always the entrance for a > RMW cycle. > > Maybe I can enhance that by clearing the error bitmap after everything > is done, but I prefer to get a proper cause analyse before doing any > random fix. > > [...] > > > > My third experiment breaks the error recovery code, but it does prevent > > the sync failures and missing extent holes, so it shows that the error > > recovery code itself is not what is causing the dropped writes--it's > > the bits left set in error_bitmap after recovery is done. > > Yep, that's expected. > So I'm more interested in a proper (better minimal) reproducer other > than any fix attempt (since there is no patch sent, it already shows the > attempt failed). > > > > > > Test Case > > --------- > > > > My test case uses three loops running in parallel on a 500 GiB test filesystem: > > > > Data Metadata System > > Id Path RAID5 RAID1 RAID1 Unallocated Total Slack > > -- -------- --------- -------- -------- ----------- --------- -------- > > 1 /dev/vdb 71.00GiB 1.00GiB 8.00MiB 647.99GiB 720.00GiB 19.59GiB > > 2 /dev/vdc 71.00GiB 1.00GiB 8.00MiB 647.99GiB 720.00GiB 3.71GiB > > 3 /dev/vdd 71.00GiB 2.00GiB - 647.00GiB 720.00GiB 3.71GiB > > 4 /dev/vde 71.00GiB 2.00GiB - 647.00GiB 720.00GiB 11.00GiB > > 5 /dev/vdf 71.00GiB 2.00GiB - 647.00GiB 720.00GiB 11.00GiB > > -- -------- --------- -------- -------- ----------- --------- -------- > > Total 284.00GiB 4.00GiB 8.00MiB 3.16TiB 3.52TiB 49.02GiB > > Used 262.97GiB 2.61GiB 64.00KiB > > > > The data is a random collection of small files, half of which have been deleted > > to make lots of small free space holes for rmw. > > > > Loop 1 alternates between corrupting device 3 and repairing it with scrub: > > The reproducer is not good enough, in fact it's pretty bad... > > Using anything not normalized is never a good way to reproduce, but I > guess it's already the best scenario you have. > > Can you try to do it with newly created fs instead? Already did (I needed to do that many times to run bisect...). > > while true; do > > # Any big file will do, usually faster than /dev/random > > # Skipping the first 1M leaves the superblock intact > > while cat vmlinux; do :; done | dd of=/dev/vdd bs=1024k seek=1 > > # This should fix all the corruption as long as there are no > > # reads or writes anywhere on the filesystem > > btrfs scrub start -Bd /dev/vdd > > done > > [IMPROVE THE TEST] > If you want to cause interleaved free space, just create a ton of 4K > files, and delete them interleavely. > > And instead of vmlinux or whatever file, you can always go with > randomly/pattern filled file, and saves its md5sum to do verification. > [MY CURRENT GUESS] > My current guess is some race with dd corruption and RMW. > AFAIK the last time I am working on RAID56, I always do a offline > corruption (aka, with fs unmounted) and it always works like a charm. > > So the running corruption may be a point of concern. Generally, we expect the hardware to introduce corruption at any time. Ideally, a btrfs raid[156] should be able to work when one of its devices returns 4K blocks from /dev/random on every read, and if it can't, that's always a bug. > Another thing is, if a full stripe is determined to have unrepairable > data, no RMW can be done on that full stripe forever (unless one > manually fixed the problem). Yeah, that's the serious new problem introduced in this patch for 6.2. 6.1 didn't do that in two cases where 6.2 and later do. > So if by somehow you corrupted the full stripe by just corrupting one > device (maybe some existing csum mismatch etc?), then the full stripe > would never be written back, thus causing the data not to be written back. My reproducer wasn't sufficient to trigger the bug--it would only generate csum errors on one device, but there need to be csum errors on two devices (for raid5) to get into the data-loss loop. Apparently 6.2 still has a bug somewhere that introduces extra errors (it crashes a lot, that might be related), because when I tested on 6.2 I was observing the two-csum-failure case not the intended one-csum-failure case. 6.6 and 6.9 don't have any of those problems--no crashes, no cross-device corruption. I've been running the reproducer on 6.6 and 6.9 for a week now, and there are no csum failures on any device other than the one being intentionally corrupted, and there are no data losses because the code never hits the "goto out" line in rmw_rbio (confirmed with both a kgdb breakpoint on the test kernel and some printk's). So on 6.6 and 6.9 btrfs fixes everything my reproducer throws at it. It was a bad reproducer. Sorry about that! > Finally for the lack of any dmesg, it's indeed a problem, that there is > *NO* error message at all if we failed to recover a full stripe. > Just check recover_sectors() call and its callers. The trouble with dmesg errors is that there are so many of them when there's a failure of a wide RAID5 stripe, and they can end up ratelimited, so admins can't see all of them. But they're also better than nothing for case 6, so I approve adding a new one here. For case 6, it might be reasonable to throw the entire filesystem read-only; otherwise, there's silent data loss, and not having silent data loss is a critical feature of btrfs. It would still be possible to scrub, find the affected files, and remove them, without triggering a rmw cycle. Maybe some future btrfs enhancement could handle it better, but for now a dead stop is better than any amount of silent loss. One possible such enhancement is to keep the blocks from being allocated again, e.g. put them on an in-memory list to break the allocate-write-fail-free-allocate loop above. Having said that, I think the other raid1* profiles actually do drop writes silently when all mirror writes fail for a data block. Maybe we can have a sysfs flag to decide what to do in that case: continue with EIO on fsync as we do now, or go fully RO. > And I believe that may contribute to the confusion, that btrfs consider > the fs is fine, meanwhile it catches tons of error and abort all writes > to that full stripes. > I appreciate the effort you put into this case, but I really hope to get > a more reproducible procedure, or it's really hard to say what is going > wrong. Thanks for the tip on bitmap_clear() - it led me straight to the problem, even before you explained it a few pages further on... > If needed I can craft some debug patches for you to test, but I believe > you won't really want to run testing kernels on your large RAID5 array > anyway. I have plenty of VMs for testing kernel patches if you have some. We run every kernel through a gauntlet of tests on staging VMs. Those tests cases didn't have any coverage for raid5 corruption recovery on read because it doesn't work on any prior kernel version (only recovery on scrub was tested). That gap let 6.6 kernels get through acceptance and deployment, and led to a server getting rebooted from 6.1 to 6.6 in the middle of a pvmove to replace a drive, which combined massive corruption of one disk, 6.1's bug, and 6.2's bug. That didn't end well. Fortunately we keep only redundant data on raid5. To fix the original failing RAID5 array, we reverted to 6.1, which allowed the server to operate more or less normally for a few weeks during recovery. We did have 6.1's cross-device error-propagation bug (ironically the one fixed in this patch), but that bug was far more manageable because we already had automation in place to detect any data that was lost and replace it from another server. Scrub combined with rewriting broken files eventually eliminated all the errors. On 6.6, all the automation was broken by the regression: when we replaced a broken file or added a new one, the new file was almost always silently corrupted, and we had no IO error signal to trigger replacement of the data from another server, so corrupted files became visible to users. git-based tools for deployment were fully unusable on 6.6: any commit would break the local repo, while deleting and recloning the repo simply hit the same broken stripe rmw bug and broke the replacement repo. At the end of recovery we did a full verification of SHA hashes and found only a few errors, all in files that were created or modified while running 6.6. > So a more normalized test would help us both. > > Thanks, > Qu > > > > > Loop 2 runs `sync -f` to detect sync errors and drops caches: > > > > while true; do > > # Sometimes throws EIO > > sync -f /testfs > > sysctl vm.drop_caches=3 > > sleep 9 > > done > > > > Loop 3 does some random git activity on a clone of the 'btrfs-progs' > > repo to detect lost writes at the application level: > > > > while true; do > > cd /testfs/btrfs-progs > > # Sometimes fails complaining about various files being corrupted > > find * -type f -print | unsort -r | while read -r x; do > > date >> "$x" > > git commit -am"Modifying $x" > > done > > git repack -a > > done > > > > The errors occur on the sync -f and various git commands, e.g.: > > > > sync: error syncing '/media/testfs/': Input/output error > > vm.drop_caches = 3 > > > > error: object file .git/objects/39/c876ad9b9af9f5410246d9a3d6bbc331677ee5 is empty > > fatal: loose object 39c876ad9b9af9f5410246d9a3d6bbc331677ee5 (stored in .git/objects/39/c876ad9b9af9f5410246d9a3d6bbc331677ee5) is corrupt > > ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: raid5 silent data loss in 6.2 and later, after "7a3150723061 btrfs: raid56: do data csum verification during RMW cycle" 2024-06-08 1:55 ` Zygo Blaxell @ 2024-06-08 3:20 ` Qu Wenruo 2024-07-08 6:25 ` Zygo Blaxell 0 siblings, 1 reply; 7+ messages in thread From: Qu Wenruo @ 2024-06-08 3:20 UTC (permalink / raw) To: Zygo Blaxell, Qu Wenruo; +Cc: linux-btrfs 在 2024/6/8 11:25, Zygo Blaxell 写道: > On Sat, Jun 01, 2024 at 05:22:46PM +0930, Qu Wenruo wrote: [...] >> >> Nope, rmw_rbio() would call bitmap_clear() on the error_bitmap before >> doing any RMW. > > This is true, but the code looks like this: > > ret = rmw_read_wait_recover(rbio); > if (ret < 0) > goto out; > } > > [some other stuff] > > bitmap_clear(rbio->error_bitmap, 0, rbio->nr_sectors); > > The problem happens before the bitmap_clear, at `goto out`. Actually it > happens in rmw_read_wait_recover() (formerly rmw_read_and_wait()), > because that function now changes 'ret' if the repair fails: > > @@ -2136,6 +2215,12 @@ static int rmw_read_and_wait(struct btrfs_raid_bio *rbio) > > submit_read_bios(rbio, &bio_list); > wait_event(rbio->io_wait, atomic_read(&rbio->stripes_pending) == 0); > + > + /* > + * We may or may not have any corrupted sectors (including missing dev > + * and csum mismatch), just let recover_sectors() to handle them all. > + */ > + ret = recover_sectors(rbio); > return ret; > out: > while ((bio = bio_list_pop(&bio_list))) > > Before this change, if all the blocks in the stripe were readable, we > would recompute parity based on the data that was read, and proceed > to finish the rbio. As long as the writes were also successful, > the new blocks that were written would be updated and recoverable. > (The old unrecoverable blocks would still be unrecoverable, but there's > no fixing those.) I still do not believe the old behavior is correct. If the data is corrupted, re-generate the P/Q using the bad one is only killing all the remaining (if any) chance of recovery. > > After this change, we now end up in an infinite loop: > > 1. Allocator picks a stripe with some unrecoverable csum blocks > and some free blocks > > 2. Writeback tries to put data in the stripe > > 3. rmw_rbio aborts after it can't repair the existing blocks > > 4. Writeback deletes the extent, often silently (the application > has to use fsync to detect it) > > 5. Go to step 1, where the allocator picks the same blocks again > > The effect is pretty dramatic--even a single unrecoverable sector in > one stripe will bring an application server to its knees, constantly > discarding an application's data whenever it tries to write. Once the > allocator reaches the point where the "next" block is in a bad rmw stripe, > it keeps allocating that same block over and over again. I'm afraid the error path (no way to inform the caller) is an existing problem. Buffered write can always success (as long as no ENOMEM/ENOSPC etc), but the real writeback is not ensured to success. It doesn't even need RAID56 to trigger. But "discarding" the dirty pages doesn't sound correct. If a writeback failed, the dirty pages should still stay dirty, not discarded. It may be a new bug in the error handling path. > > So there are some different cases to handle at this point: > > 1. We can read all the blocks and they have good csums: proceed > to write data and update parity. This is the normal good case. > > 2. We can read all the blocks, there are csums failures, but they > can be repaired: proceed to write data and update parity. This is > the case fixed in the patch. > > 3. We can read all the blocks, there are csums failures, but they > can't be repaired: proceed to write data and update parity. Nope, we should not continue, it's only "spreading" the corruption. > This case > is broken in the patch. The data in the unrecoverable blocks is lost > already, but we can still write the new data in the stripe and update > the parity to recover the new blocks if we go into degraded mode in > the future. We should only continue if our write range covers the bad sector, but that's impossible as that breaks DATACOW. I can do extra handling, like if the write range covers a bad sector, we do not try to recover that one. But I strongly doubt if that would have any meaning, since that means we're writing into a sector which already has csum. > > 4. We can read enough of the blocks to attempt repair, and the > repair succeeds: proceed to write data and update parity. This is > the degraded good case. Degraded is not treated any different, it's just the same as failing the read. > > 5. We can read enough of the blocks to attempt repair, and the > repair fails: proceed to write data and update parity. The lost > data block is gone, but we can still write the new data blocks and > compute a usable parity block. This is another case broken by the > patch, the degraded-mode version of case 3. The degraded case is only making things more complex. But I get your point. > > 6. We can't read enough of the blocks to attempt repair: we can't > compute parity any more, because we don't know what data is in the > missing blocks. We can't assume it's some constant value (e.g. zero) > because then the new parity would be wrong if some but not all of the > failing drives came back. > > For case 1-5, we can ignore the return value of verify_one_sector() > (or at least not pass it up to the caller of recover_vertical()). Nope, that falls back to the old behavior, a corrupted section would only spread to P/Q. We should not touch such corrupted full stripe. IIRC we already have such test cases in fstests. I believe we should let the users know there is something wrong, telling then which sector is problematic (and which file), hoping they can delete the file or whatever. We can not do the data loss decision for the end users. > Either we restored the sector we didn't, but either outcome doesn't > affect whether a new parity block can be computed to enable recovery of > the new data blocks and the remaining recoverable data blocks. The other > error cases in recover_vertical() are either fatal problems like ENOMEM, > or there's too many read failures and we're actually in case 6. > > For case 6, the options are: > > A. abort the rbio and drop the write > > B. force the filesystem read-only > > I like B for this case, but AIUI the other btrfs raid profiles currently > do A if all the data block mirror writes fail. The current solution for RAID56 is just do not touch it. We do not continue nor ignore. I believe B is the much safer solution, but I'm not sure if other profiles really go A. IIRC for RAID1* we just continue as long as we have at least one good mirror written. [...] >> >> Can you try to do it with newly created fs instead? > > Already did (I needed to do that many times to run bisect...). Any shorter reproducer? If I understand you correct, the problem only happens when we have all the following conditions met: - the sector has csum Or we won't very the content anyway - both the original and repaired contents mismatch csum This already means there are errors beyond the RAID5 tolerance. > >>> while true; do >>> # Any big file will do, usually faster than /dev/random >>> # Skipping the first 1M leaves the superblock intact >>> while cat vmlinux; do :; done | dd of=/dev/vdd bs=1024k seek=1 >>> # This should fix all the corruption as long as there are no >>> # reads or writes anywhere on the filesystem >>> btrfs scrub start -Bd /dev/vdd >>> done >> >> [IMPROVE THE TEST] >> If you want to cause interleaved free space, just create a ton of 4K >> files, and delete them interleavely. >> >> And instead of vmlinux or whatever file, you can always go with >> randomly/pattern filled file, and saves its md5sum to do verification. > >> [MY CURRENT GUESS] >> My current guess is some race with dd corruption and RMW. >> AFAIK the last time I am working on RAID56, I always do a offline >> corruption (aka, with fs unmounted) and it always works like a charm. >> >> So the running corruption may be a point of concern. > > Generally, we expect the hardware to introduce corruption at any time. > Ideally, a btrfs raid[156] should be able to work when one of its devices > returns 4K blocks from /dev/random on every read, and if it can't, > that's always a bug. But so far your description is for two sectors corruption, beyond RAID5's tolerance. > >> Another thing is, if a full stripe is determined to have unrepairable >> data, no RMW can be done on that full stripe forever (unless one >> manually fixed the problem). > > Yeah, that's the serious new problem introduced in this patch for 6.2. > 6.1 didn't do that in two cases where 6.2 and later do. Unfortunately, I do not think this is the root problem. The change is towards the correct direction, from ignoring the problem to at least not making it worse. > >> So if by somehow you corrupted the full stripe by just corrupting one >> device (maybe some existing csum mismatch etc?), then the full stripe >> would never be written back, thus causing the data not to be written back. > > My reproducer wasn't sufficient to trigger the bug--it would only generate > csum errors on one device, but there need to be csum errors on two devices > (for raid5) to get into the data-loss loop. Apparently 6.2 still has > a bug somewhere that introduces extra errors (it crashes a lot, that > might be related), because when I tested on 6.2 I was observing the > two-csum-failure case not the intended one-csum-failure case. > > 6.6 and 6.9 don't have any of those problems--no crashes, no cross-device > corruption. I've been running the reproducer on 6.6 and 6.9 for a week > now, and there are no csum failures on any device other than the one > being intentionally corrupted, and there are no data losses because the > code never hits the "goto out" line in rmw_rbio (confirmed with both > a kgdb breakpoint on the test kernel and some printk's). So on 6.6 > and 6.9 btrfs fixes everything my reproducer throws at it. That's my expectation. Thankfully it finally matches reality. So that's something else in v6.2 causing beyond-raid5-tolerance. And the data drop problem is definitely a big problem but I do not think it's really RAID56 specific. My guess is some later RAID56 fixes are not yet backported to v6.2? One quick possible fix is: 486c737f7fdc ("btrfs: raid56: always verify the P/Q contents for scrub") > > It was a bad reproducer. Sorry about that! > >> Finally for the lack of any dmesg, it's indeed a problem, that there is >> *NO* error message at all if we failed to recover a full stripe. >> Just check recover_sectors() call and its callers. > > The trouble with dmesg errors is that there are so many of them when > there's a failure of a wide RAID5 stripe, and they can end up ratelimited, > so admins can't see all of them. But they're also better than nothing > for case 6, so I approve adding a new one here. > > For case 6, it might be reasonable to throw the entire filesystem > read-only; otherwise, there's silent data loss, and not having silent > data loss is a critical feature of btrfs. It would still be possible > to scrub, find the affected files, and remove them, without triggering a > rmw cycle. Maybe some future btrfs enhancement could handle it better, > but for now a dead stop is better than any amount of silent loss. > > One possible such enhancement is to keep the blocks from being > allocated again, e.g. put them on an in-memory list to break the > allocate-write-fail-free-allocate loop above. The retry for failed full stripe is intentional, for the following two points: - There may be a chance the error may be temporary So the next retry may get a good contents and do the repair correctly - There is no need to cache something we know it's unreliable [...] > > On 6.6, all the automation was broken by the regression: when we replaced > a broken file or added a new one, the new file was almost always silently > corrupted, A quick search through the data writeback path indeed shows we do nothing extra when hitting a writeback error. This means, the dirty pages would be considered uptodate, thus later can be dropped. I think we should change the behavior of clearing dirty flags, so that the dirty page flag only got cleared if the writeback is successful. So I believe this is the root cause. This change is going to be huge, and I have to check other fses to be extra safe. But even with that fixed, the do-not-touch-bad-full-stripe policy is not doing anything helpful, unless the corrupted sector get full deleted, which can be very hard detect. Maybe you can try a different recovery behavior? E.g. always read the file, and the read failed, delete and then copy a new one from the backup server. Then it may solve the problem as the bad sectors would be deleted (along with its csum) Thanks, Qu > and we had no IO error signal to trigger replacement of the > data from another server, so corrupted files became visible to users. > git-based tools for deployment were fully unusable on 6.6: any commit > would break the local repo, while deleting and recloning the repo simply > hit the same broken stripe rmw bug and broke the replacement repo. > > At the end of recovery we did a full verification of SHA hashes and > found only a few errors, all in files that were created or modified > while running 6.6. > >> So a more normalized test would help us both. >> >> Thanks, >> Qu >> >>> >>> Loop 2 runs `sync -f` to detect sync errors and drops caches: >>> >>> while true; do >>> # Sometimes throws EIO >>> sync -f /testfs >>> sysctl vm.drop_caches=3 >>> sleep 9 >>> done >>> >>> Loop 3 does some random git activity on a clone of the 'btrfs-progs' >>> repo to detect lost writes at the application level: >>> >>> while true; do >>> cd /testfs/btrfs-progs >>> # Sometimes fails complaining about various files being corrupted >>> find * -type f -print | unsort -r | while read -r x; do >>> date >> "$x" >>> git commit -am"Modifying $x" >>> done >>> git repack -a >>> done >>> >>> The errors occur on the sync -f and various git commands, e.g.: >>> >>> sync: error syncing '/media/testfs/': Input/output error >>> vm.drop_caches = 3 >>> >>> error: object file .git/objects/39/c876ad9b9af9f5410246d9a3d6bbc331677ee5 is empty >>> fatal: loose object 39c876ad9b9af9f5410246d9a3d6bbc331677ee5 (stored in .git/objects/39/c876ad9b9af9f5410246d9a3d6bbc331677ee5) is corrupt >>> > ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: raid5 silent data loss in 6.2 and later, after "7a3150723061 btrfs: raid56: do data csum verification during RMW cycle" 2024-06-08 3:20 ` Qu Wenruo @ 2024-07-08 6:25 ` Zygo Blaxell 2024-07-08 8:26 ` Lukas Straub 0 siblings, 1 reply; 7+ messages in thread From: Zygo Blaxell @ 2024-07-08 6:25 UTC (permalink / raw) To: Qu Wenruo; +Cc: Qu Wenruo, linux-btrfs On Sat, Jun 08, 2024 at 12:50:35PM +0930, Qu Wenruo wrote: > 在 2024/6/8 11:25, Zygo Blaxell 写道: > > On Sat, Jun 01, 2024 at 05:22:46PM +0930, Qu Wenruo wrote: > > After this change, we now end up in an infinite loop: > > > > 1. Allocator picks a stripe with some unrecoverable csum blocks > > and some free blocks > > > > 2. Writeback tries to put data in the stripe > > > > 3. rmw_rbio aborts after it can't repair the existing blocks > > > > 4. Writeback deletes the extent, often silently (the application > > has to use fsync to detect it) > > > > 5. Go to step 1, where the allocator picks the same blocks again > > > > The effect is pretty dramatic--even a single unrecoverable sector in > > one stripe will bring an application server to its knees, constantly > > discarding an application's data whenever it tries to write. Once the > > allocator reaches the point where the "next" block is in a bad rmw stripe, > > it keeps allocating that same block over and over again. > > I'm afraid the error path (no way to inform the caller) is an existing > problem. Buffered write can always success (as long as no ENOMEM/ENOSPC > etc), but the real writeback is not ensured to success. > It doesn't even need RAID56 to trigger. > > But "discarding" the dirty pages doesn't sound correct. > If a writeback failed, the dirty pages should still stay dirty, not > discarded. > > It may be a new bug in the error handling path. I found the code that does this. It's more than 11 years old: commit 0bec9ef525e33233d7739b71be83bb78746f6e94 Author: Josef Bacik <jbacik@fusionio.com> Date: Thu Jan 31 14:58:00 2013 -0500 Btrfs: unreserve space if our ordered extent fails to work When a transaction aborts or there's an EIO on an ordered extent or any error really we will not free up the space we reserved for this ordered extent. This results in warnings from the block group cache cleanup in the case of a transaction abort, or leaking space in the case of EIO on an ordered extent. Fix this up by free'ing the reserved space if we have an error at all trying to complete an ordered extent. Thanks, > > This case > > is broken in the patch. The data in the unrecoverable blocks is lost > > already, but we can still write the new data in the stripe and update > > the parity to recover the new blocks if we go into degraded mode in > > the future. > > We should only continue if our write range covers the bad sector, but that's > impossible as that breaks DATACOW. > > I can do extra handling, like if the write range covers a bad sector, we do > not try to recover that one. > > But I strongly doubt if that would have any meaning, since that means we're > writing into a sector which already has csum. The point here is to: - keep the P/Q in sync with the a consistent view of data in the stripe, even when some of that data is known to be uncorrectable - not overwrite any block in the stripe unless we are sure its contents are correct or don't matter - always write out the new data and matching P/Q in rmw_rbio i.e. if we fail to read a block, or if its csum fails, and we can't reconstruct the block, we must not write the block out to disk because that will prevent a read retry from succeeding in the future. We must still update P/Q because we modify other blocks in the stripe. The only thing P/Q needs to work is some repeatable data that we can read from the stripe in the future. P/Q doesn't have to be correct wrt the csums, only with the blocks. As long as P/Q matches the other blocks, then any block in the stripe can be reconstructed to its current value. Note we only need to do this for rmw, because we _have_ to finish rmw to avoid data loss from not making a best effort to do the write. Scrub and read can report the problem and move on to the next stripe, because there's no write to be lost in the rbio. > > 6. We can't read enough of the blocks to attempt repair: we can't > > compute parity any more, because we don't know what data is in the > > missing blocks. We can't assume it's some constant value (e.g. zero) > > because then the new parity would be wrong if some but not all of the > > failing drives came back. > > > > For case 1-5, we can ignore the return value of verify_one_sector() > > (or at least not pass it up to the caller of recover_vertical()). > > Nope, that falls back to the old behavior, a corrupted section would only > spread to P/Q. Yeah, I tested this and it's the wrong point in the logic. Another problem with doing it there is that code is shared with scrub, and we don't want scrub to be doing best-effort, we want scrub to be perfect or do nothing. A better place to do it is when recover_blocks() returns. This is specific to the rmw_rbio code, so it doesn't affect other parts of btrfs that can afford to stop when there's an uncorrectable sector. > We can not do the data loss decision for the end users. There is no decision that doesn't have at least potential data loss. RMW is writing data. If our RMW bbio chooses not to write data when it's possible to do so, then we've actively made the decision to lose data. In the current architecture, RMW must make a best effort to correct any correctable errors before modifying the stripe, but if that effort fails, RMW still has to update the stripe anyway, so that one bad stripe doesn't break the entire filesystem. The alternative is to not do RMW at all (or at least remove the dependency on it), which is a much larger problem to solve. There's a few ways to do that: 1. In btrfs_finish_one_ordered, instead of dropping the failed extent, try to allocate it somewhere else (hopefully in a stripe without an uncorrectable sector, or just switch to a full-stripe allocation for that specific case), and restart the write operation. That trades extra disk space for availability. We'd need some way to remember where writes have failed so the allocator isn't constantly trying to put new data on broken stripes. Also it means that we can end up silently dropping data due to ENOSPC instead of EIO (i.e. we try to write to the last available sectors and they're in a broken stripe) (similar to another bug in 6.6, now fixed). 2. Avoid RMW writes in the allocator. This requires some aggressive merging of writes so that we fill up stripes (that would be nice to have anyway as a pure speed optimization for raid5, as the current code will happily write 4 64K files in 4 RMW operations instead of packing them all into one full-stripe write). We'd need bg reclaim to free up space in partially filled stripes (i.e. raid5 would effectively behave like a zoned device). We'd still need RMW for nodatacow files, but nodatasum files can't ever have corrupt sectors so we can always update P/Q when we write a stripe. Prealloc wouldn't work. We might return ENOSPC, but we'll do that at write() time, not in writeback, so userspace will be informed and we won't silently discard any data. 3. Use raid-stripe-tree. AFAICT raid-stripe-tree is basically option 2, but implemented between the extent layer and the device layer so it's a less invasive change wrt the rest of btrfs. raid-stripe-tree doesn't support prealloc or nodatacow either. > Any shorter reproducer? 1. Create a -draid5 -mraid1 filesystem from 5 devices: 2x 750G, 2x 700G. The larger devices will have all the metadata on them, so we can corrupt everything on a small device and not worry about clobbering metadata. 2. Put some data on the filesystem, delete every second file to make a lot of small holes in the raid5 stripes. 3. Corrupt the two smaller devices (the ones without metadata) with your choice of garbage data (leave the first 1M intact for the superblock): yes | dd of=dev-3 iflag=fullblock bs=1024k yes | dd of=dev-4 iflag=fullblock bs=1024k 4. On the filesystem, create a new directory and use it normally. I like 'git clone' for this, because it does a lot of fsyncs and sha1 hash integrity checking, and every kernel developer already has it installed. git clone http://github.com/kdave/btrfs-progs.git git-test The git clone, git gc, commits, etc. should all run without error, and the number of errors in files in the rest of the filesystem should be stable, i.e. writes work for new data, and old data isn't (more) corrupted by the writes. > If I understand you correct, the problem only happens when we have all the > following conditions met: > > - the sector has csum > Or we won't very the content anyway > > - both the original and repaired contents mismatch csum > > This already means there are errors beyond the RAID5 tolerance. Yes, all the problems arise when one stripe goes beyond RAID5 tolerance. It doesn't look like it matters whether it's read or csum failure, but csum failure is much more common due to earlier bugs in btrfs raid5 leaving broken stripes on disk for later kernels to trip over, or a replaced drive combined with an uncorrected write hole error on a second drive. If the errors are within RAID5 tolerance, it all works perfectly: the "best effort to repair" is a completely successful repair, so recover_blocks returns 0, we don't prematurely exit from rmw_rbio, and writes work properly. > So that's something else in v6.2 causing beyond-raid5-tolerance. > And the data drop problem is definitely a big problem but I do not think > it's really RAID56 specific. It's raid56 specific because only raid56 has data block writes which depend on the outcome of data block reads, i.e. a write fails because of something that happens during a read. No other profile has that possibility. If I run the test case with raid1 or single data instead of raid5, there will never be a data loss on a write because of a csum failure, because the write bbios simply never try to read anything. Even raid5 full-stripe writes can't fail that way, because they don't mix read operations into their write bbios. (I'm excluding metadata here--there are plenty of dependencies between data writes and metadata reads, but they aren't very interesting. If metadata is broken then the entire filesystem is broken.) > > On 6.6, all the automation was broken by the regression: when we replaced > > a broken file or added a new one, the new file was almost always silently > > corrupted, > > A quick search through the data writeback path indeed shows we do nothing > extra when hitting a writeback error. That's because it's not in the writeback path--it's in the bio completions for ordered data. This is the sequence of events: 1. btrfs allocator places an extent within an existing raid5 data stripe that contains some unreadable or corrupt data blocks. 2. rmw_rbio is called to perform a rmw operation on the stripe. 3. rmw_read_wait_recover attempts to find all csums in the stripe. 4. rmw_read_wait_recover issues read requests to submit_read_wait_bio_list to fill the stripe in memory. 5. raid_wait_read_end_io sets error bits in the rbio error_bitmap for read failures, and passes successful reads to verify_bio_data_sectors. 6. verify_bio_data_sectors verifies those sectors which have csums (i.e. not the free sectors, nodatasum files, or the P/Q sector(s)). Sectors that have mismatched csums are marked in the rbio error_bitmap. 7. back to rmw_read_wait_recover, which calls recover_sectors. 8. recover_sectors attempts to repair any sectors with a bit set in error_bitmap. If there are too many bad sectors, recover_sectors exits early without doing anything. After reconstructing the bad sector contents from remaining data and P/Q blocks, the csums of the repaired sectors are verified with verify_one_sector(). 9. if recover_sectors is unable to repair all sectors in the stripe, recover_sectors returns an error to rmw_read_wait_recover, which returns the error to rmw_rbio. 10. when rmw_read_wait_recover returns an error to rmw_rbio, rmw_rbio sets bi_status to IOERR on all the bios in the stripe, and returns without performing the parity update and write operations originally requested at step 2. 11. the rbio work function finishes 12. the btrfs_finish_ordered_io work function starts 13. btrfs_finish_one_ordered notices the error in bi_status on the bios 14. btrfs_finish_one_ordered calls btrfs_mark_ordered_extent_error, so syncfs() and fsync() can return EIO. 15. btrfs_finish_one_ordered calls btrfs_drop_extent_map_range to delete the extent from the filesystem. 16. the blocks in the uncorrectable stripe are now free to be allocated again, so return to step 1 and do it all again in the next transaction. Note that this is a loop! So we don't lose one extent of data and then move on--we lose one extent of data _in each transaction_, and then set up the next transaction to fail in the same way. This continues as long as an uncorrectable sector exists in one of the first available free spaces. Also note that after a while, the allocator fills up all the good stripes between the bad stripes, but it can't fill the bad stripes because they keep getting returned to free space at the end of each transaction. If there are a lot of bad stripes on the filesystem, bad stripes move to the the head of the free space list where they accumulate over time. We might end up dropping hundreds or thousands of extents before we finally allocate something in a good stripe. During my original failure event I did notice that these dropped writes seemed to come in "waves", but I couldn't reproduce that effect. Now I know how to trigger it. > I think we should change the behavior of clearing dirty flags, so that the > dirty page flag only got cleared if the writeback is successful. The entire page is deleted, so it stops having any flags. > This change is going to be huge Yeah, all the real solutions are far more development work, and have more ways to fail, than handing everything with a better best-effort in RMW. In the short term, it's more important to fix the RMW regression as soon as possible, so that users with raid5 filesystems can upgrade from 6.1 to later kernels without having worse data loss problems than 6.1 had in the supported cases (drive failures <= tolerance). The most straightforward way to do that is to handle everything we can within rmw_rbio. Once the larger fixes are in place, so that btrfs can handle a dropped data write or so that btrfs doesn't need to do RMW any more, we can then go back to strictly failing RMW on uncorrectable raid56 stripes. It's too early to do that now. Best-effort RMW has low marginal risk of data loss, whereas aborting RMW absolutely guarantees data loss. As you point out, none of the data loss cases arise when the stripe is within correctable tolerances, i.e. you need to start with too many failures so that you have uncorrectable sectors, and we don't claim that btrfs can recover data in that case even without RMW updates. If errors are within tolerance, they're correctable, so rmw_rbio will simply correct them. I propose 3 phases of fix: 1. One-line patch to ignore recover_blocks return value, to get raid56 writes working in the presence of uncorrectable sectors again. 2. Fix up the rbio handling so that we can compute P/Q for RMW stripes even with uncorrectable sectors, so that we don't make a mess on uncorrectable sectors (discussed in more detail below). 3. Fix the upper layers of btrfs so that we can have write bbios fail and not lose data in general. > and I have to check other fses to be extra > safe. No other filesystem tries to do parity RMW writes with data csums, because that's a bad system design choice that they were able to avoid making. Only btrfs has btrfs problems. ;) Other filesystems use strictly immutable extents with parity embedded within them (ZFS), or they use allocators that aggressively coalesce allocations and writes so that all parity raid writes are full-stripe writes with garbage collection (bcachefs). Parity raid block devices don't have csums. They don't have to handle corrupted block because they don't know if any block is corrupted or not. They simply update the P/Q blocks whenever enough devices are readable, and go completely read-only if too many are not readable. > But even with that fixed, the do-not-touch-bad-full-stripe policy is not > doing anything helpful, unless the corrupted sector get full deleted, which > can be very hard detect. > > Maybe you can try a different recovery behavior? > > E.g. always read the file, and the read failed, delete and then copy a new > one from the backup server. > > Then it may solve the problem as the bad sectors would be deleted (along > with its csum) There's some problems with this: - The falures occur when rmw_rbio reads a file that happens to share a raid5 stripe with the extent we are trying to create. The files are not related in any way except for the position of some of their blocks on disk. Finding these files takes time. Scrub takes a long time (weeks) and 'find -type f...' may not find the bad sectors at all, so there's a high risk that a writer will discover an uncorrectable sector before anything else does, and in that case the first signal that we have a problem is silent data loss. - While researching all this, I discovered that in 6.4, 1009254bf22a ("btrfs: scrub: use scrub_stripe to implement RAID56 P/Q scrub") scrub is broken: it aborts on the first stripe with an uncorrectable sector. So users can't discover more than one uncorrectable sector at a time. - Even if we know of a bad file, we can't replace the bad file because it requires us to have reliable writes to the filesystem. All files referencing bad stripes have to be removed before any replacement files can be loaded onto the filesystem. That's a weeks-long interruption in service, and poses some logistical issues (e.g. temporary additional space to store the files we can't write to the original filesystem). - Once the allocator has found an uncorrectable sector, write failures happen continually, at increasing rates. It only stops when _every_ uncorrectable sector is removed from the filesystem. This is not a proportional response to a single bad sector, especially when it's so easy to fix. In kernel 6.1 and earlier, a damaged btrfs raid5 filesystem has no availability for writes because bugs in RMW would corrupt good data if we write to the filesystem. We can write to our filesystems only because we have a full replica of all the data somewhere else, so the btrfs bugs only cause corruption in files we can replace. Ours is a specialist application. 6.1's raid5 isn't usable as a rootfs or unreplicated primary data store. In kernel 6.2 and later, a damaged btrfs raid5 filesystem has no availability for writes because we can't be sure that the filesystem doesn't have any uncorrectable stripes. If at least one uncorrectable stripe exists, we can't expect any data we write to stick. This makes 6.2 and later unusable as a rootfs or unreplicated primary data store, even though the corruption bugs in 6.1 were fixed. This patch makes RMW work again, mostly: diff --git a/fs/btrfs/raid56.c b/fs/btrfs/raid56.c index 39bec672df0c..2964ff380671 100644 --- a/fs/btrfs/raid56.c +++ b/fs/btrfs/raid56.c @@ -2277,7 +2277,13 @@ static int rmw_read_wait_recover(struct btrfs_raid_bio *rbio) * and csum mismatch), just let recover_sectors() to handle them all. */ submit_read_wait_bio_list(rbio, &bio_list); - return recover_sectors(rbio); + recover_sectors(rbio); + /* + * Whether we recover corrupted sectors or not, we must proceed + * with the rmw operation, or we will drop the extent we are + * trying to write. + */ + return 0; } static void raid_wait_write_end_io(struct bio *bio) With that patch, my raid5 test cases work for the first time since I wrote them in 2016: writes are fully available, and there's no new corruption of existing files (any uncorrectable sectors remain uncorrectable and no new ones appear). It's good enough to unblock users who are currently stuck on 6.1 due to the RMW regression. For the "mostly": there are some unresolved issues. With the above patch, the set of uncorrectable sectors is stable (nothing that wasn't corrupted before gets corrupted by RMW), but the _contents_ of the uncorrectable sectors get replaced with garbage, because the code isn't leaving the stripe data in a usable state for P/Q update, and it always writes out the full stripe including known garbage data. P/Q does match the garbage, so the stripe will work in degraded mode for the new data. The rmw code has to do something like this: 1. do the R operation: for unreadable blocks, zero-fill the block 2. for readable blocks, verify csums (no change) 3. reconstruct missing or corrupt blocks in separate memory from the block read from disk (i.e. don't overwrite in place) 4. verify again. If verification fails, keep the original version of the data (or zeros if the block could not be read). 5. keep track of which blocks are bad, i.e. don't clear error_bitmap in rmw_rbio unless the block verifies OK 6. do the M operation: modify the blocks to be written, and compute new P/Q blocks for the stripe. 7. clear the bad bits for the modified blocks and the P/Q blocks. 8. do the W operation: write out only the blocks that are not marked as bad, i.e. the corrected blocks, the modified blocks, and the P/Q blocks. If a block can't be corrected, don't write it out, to give the device another chance to get the data on the next read. 9. if too many writes fail, then rmw_rbio can return an error. (Note: this last line opens up a can of worms for the read error handling cases) For scrub, the problem is that scrub_raid56_parity_stripe exits the entire scrub loop when it sees an unrepaired sector, instead of continuing to the next stripe. Everything else there is OK, it's a one-line fix: diff --git a/fs/btrfs/scrub.c b/fs/btrfs/scrub.c index c7fcb6408eb6..12c3ce0efafa 100644 --- a/fs/btrfs/scrub.c +++ b/fs/btrfs/scrub.c @@ -1972,7 +1972,7 @@ static int scrub_raid56_parity_stripe(struct scrub_ctx *sctx, "unrepaired sectors detected, full stripe %llu data stripe %u errors %*pbl", full_stripe_start, i, stripe->nr_sectors, &error); - ret = -EIO; + ret = 0; goto out; } bitmap_or(&extent_bitmap, &extent_bitmap, I'll post these patches separately. > Thanks, > Qu ^ permalink raw reply related [flat|nested] 7+ messages in thread
* Re: raid5 silent data loss in 6.2 and later, after "7a3150723061 btrfs: raid56: do data csum verification during RMW cycle" 2024-07-08 6:25 ` Zygo Blaxell @ 2024-07-08 8:26 ` Lukas Straub 2024-07-08 20:22 ` Write error handling in btrfs (was: Re: raid5 silent data loss in 6.2 and later, after "7a3150723061 btrfs: raid56: do data csum verification during RMW cycle") Zygo Blaxell 0 siblings, 1 reply; 7+ messages in thread From: Lukas Straub @ 2024-07-08 8:26 UTC (permalink / raw) To: Zygo Blaxell; +Cc: Qu Wenruo, Qu Wenruo, linux-btrfs [-- Attachment #1: Type: text/plain, Size: 3711 bytes --] On Mon, 8 Jul 2024 02:25:37 -0400 Zygo Blaxell <ce3g8jdj@umail.furryterror.org> wrote: > On Sat, Jun 08, 2024 at 12:50:35PM +0930, Qu Wenruo wrote: > > 在 2024/6/8 11:25, Zygo Blaxell 写道: > > > On Sat, Jun 01, 2024 at 05:22:46PM +0930, Qu Wenruo wrote: > > > After this change, we now end up in an infinite loop: > > > > > > 1. Allocator picks a stripe with some unrecoverable csum blocks > > > and some free blocks > > > > > > 2. Writeback tries to put data in the stripe > > > > > > 3. rmw_rbio aborts after it can't repair the existing blocks > > > > > > 4. Writeback deletes the extent, often silently (the application > > > has to use fsync to detect it) > > > > > > 5. Go to step 1, where the allocator picks the same blocks again > > > > > > The effect is pretty dramatic--even a single unrecoverable sector in > > > one stripe will bring an application server to its knees, constantly > > > discarding an application's data whenever it tries to write. Once the > > > allocator reaches the point where the "next" block is in a bad rmw stripe, > > > it keeps allocating that same block over and over again. > > > > I'm afraid the error path (no way to inform the caller) is an existing > > problem. Buffered write can always success (as long as no ENOMEM/ENOSPC > > etc), but the real writeback is not ensured to success. > > It doesn't even need RAID56 to trigger. > > > > But "discarding" the dirty pages doesn't sound correct. > > If a writeback failed, the dirty pages should still stay dirty, not > > discarded. > > > > It may be a new bug in the error handling path. > > I found the code that does this. It's more than 11 years old: > > commit 0bec9ef525e33233d7739b71be83bb78746f6e94 > Author: Josef Bacik <jbacik@fusionio.com> > Date: Thu Jan 31 14:58:00 2013 -0500 > > Btrfs: unreserve space if our ordered extent fails to work > > When a transaction aborts or there's an EIO on an ordered extent or any > error really we will not free up the space we reserved for this ordered > extent. This results in warnings from the block group cache cleanup in the > case of a transaction abort, or leaking space in the case of EIO on an > ordered extent. Fix this up by free'ing the reserved space if we have an > error at all trying to complete an ordered extent. Thanks, > > [...] Before this escalates further in IMHO the wrong direction: I think the current btrfs behavior correct. See also this paper[1] that examines write failure of buffered io in different filesystems. Especially Table 2. Ext4 and xfs for example do not discard the page cache on write failure, but this is worse since now you have a mismatch of what is in the cache and what is on disk. They do not retry to write back the page cache. The source of confusion here is rather that write errors do not happen in the real world: Disks do not verify if they wrote data correctly and neither does any layer (raid, etc.) above it. Thus handling of write failure is completely untested in all applications (See the paper again) and it seems the problems you see are due to wrongly handling of write errors. The data loss is not silent it's just that many applications and scripts do not use fsync() at all. I think the proper way of resolving this is for btrfs to retry writing the extent, but to another (possibly clean) stripe. Or perhaps a fresh raid5 block group altogether. I very much approve btrfs' current design of handling (and reporting) write errors in the most correct way possible. Regards, Lukas Straub [1] https://www.usenix.org/system/files/atc20-rebello.pdf [-- Attachment #2: OpenPGP digital signature --] [-- Type: application/pgp-signature, Size: 833 bytes --] ^ permalink raw reply [flat|nested] 7+ messages in thread
* Write error handling in btrfs (was: Re: raid5 silent data loss in 6.2 and later, after "7a3150723061 btrfs: raid56: do data csum verification during RMW cycle") 2024-07-08 8:26 ` Lukas Straub @ 2024-07-08 20:22 ` Zygo Blaxell 0 siblings, 0 replies; 7+ messages in thread From: Zygo Blaxell @ 2024-07-08 20:22 UTC (permalink / raw) To: Lukas Straub; +Cc: Qu Wenruo, Qu Wenruo, linux-btrfs On Mon, Jul 08, 2024 at 10:26:32AM +0200, Lukas Straub wrote: > > I found the code that does this. It's more than 11 years old: > > > > commit 0bec9ef525e33233d7739b71be83bb78746f6e94 > > Author: Josef Bacik <jbacik@fusionio.com> > > Date: Thu Jan 31 14:58:00 2013 -0500 > > > > Btrfs: unreserve space if our ordered extent fails to work > > > > When a transaction aborts or there's an EIO on an ordered extent or any > > error really we will not free up the space we reserved for this ordered > > extent. This results in warnings from the block group cache cleanup in the > > case of a transaction abort, or leaking space in the case of EIO on an > > ordered extent. Fix this up by free'ing the reserved space if we have an > > error at all trying to complete an ordered extent. Thanks, > > > > [...] > > Before this escalates further in IMHO the wrong direction: Now we're drifting away from raid56-specific issues, and into general btrfs write error handling theory. OK, I'm up for that... > I think the current btrfs behavior correct. See also this paper[1] that > examines write failure of buffered io in different filesystems. > Especially Table 2. Ext4 and xfs for example do not discard the page > cache on write failure, but this is worse since now you have a mismatch > of what is in the cache and what is on disk. They do not retry to write > back the page cache. ext4 and xfs don't retry the write, but they also don't deallocate the block from the filesystem, then try to use the same block immediately in a future allocation. Of the filesystems in the paper, only btrfs does write-failure data loss in a rapidly cycling _loop_. xfs and ext4 (and btrfs on a nodatacow file) leave the failing extent allocated, so future reads return garbage, but future created or extended files pick different, hopefully error-free sectors on the disk. ext4 and xfs thus avoid having a data loss loop, more or less accidentally. We could replicate that to break the loop on btrfs, by introducing an "IOERROR" extent type. This would be like a PREALLOC extent, in the sense that it reserves space on disk without writing data to the space, but it would be always datacow for write, and always return EIO on read instead of zero-fill. Scrub would not attempt reading such an extent, and would report it as a distinct failure type from the existing read/csum/write/etc failures. Balance would make IOERROR extent references point to a zero bytenr when they are relocated (so they become IOERROR holes in files, with no physical extent behind them, and so that balance can dispose of the block group without tripping over an IO error itself). btrfs send would transmit them, btrfs receive would replicate them, btrfs qgroup would count their space if they have a physical extent (i.e. they're not a synthetic extent created by balance or receive). IOERROR extents would persist until they are deleted by a user action, like removing the file, or overwriting the entire extent. An IOERROR extent would be created by flipping a few bits in the extent item when the write failure is detected, then proceeding to write out the extent metadata for the lost extent data (instead of deleting it). This breaks the loop after btrfs writes to a broken sector (or at least pauses the loop until the user deletes the broken file and releases the broken sectors for another allocation). IOERROR doesn't need additional space like retries do, since we're writing out the same number of extents to disk that we already have reserved, so an IO error can't cascade to an ENOSPC error. IOERROR extents break the rapid data loss loop and replicate the better behaviors of ext4 and xfs, without introducing their bad behaviors to btrfs. Users can reliably detect lost data after the fact by reading the files or running scrub, so they don't need to rely on unreliable and invasive methods like fsync() or syncfs() for detection. It's an interesting idea for future enhancement, but it's no solution to the rmw regression. It just breaks the loop and makes write-error-extent-drop detection easier, but the problem for rmw is that the write errors shouldn't happen in the first place. > The source of confusion here is rather that write errors do not happen > in the real world: Disks do not verify if they wrote data correctly and > neither does any layer (raid, etc.) above it. Disks do go offline in the real world, and that presents as write errors. I've had systems running raid1 where 2 disks go offline, but neither disk contains any metadata, so btrfs runs the write-delete-repeat data loss loop for many hours. I didn't care much about that, because raid1 isn't expected to work with 2 disks offline, but I do care about raid5 not working with 0 disks offline. There are some small changes we can make to error handling that might quickly improve things for users. While debugging the RMW issues, I have been using a patch that makes the extent drops visible in dmesg: diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c index 8f38eefc8acd..be3cda596bb8 100644 --- a/fs/btrfs/inode.c +++ b/fs/btrfs/inode.c @@ -3214,6 +3214,22 @@ int btrfs_finish_one_ordered(struct btrfs_ordered_extent *ordered_extent) clear_reserved_extent && !test_bit(BTRFS_ORDERED_NOCOW, &ordered_extent->flags) && !test_bit(BTRFS_ORDERED_PREALLOC, &ordered_extent->flags)) { + /* + * Notify the user that some data has been lost due + * to an error, unless we were already expected to + * remove the data because of truncate. + * + * Put both logical and physical addresses here + * because we're about to remove the mapping that + * would have allowed the user to translate from + * one to the other. + */ + if (ret && !truncated) + btrfs_err_rl(fs_info, +"dropping unwritten extent at root %lld ino %llu offset [%llu,%llu] physical %llu length %llu", + btrfs_root_id(inode->root), btrfs_ino(inode), + unwritten_start, end, ordered_extent->disk_bytenr, + ordered_extent->disk_num_bytes); /* * Discard the range before returning it back to the * free space pool This has been useful when debugging the RMW issue, and valuable on production filesystems too. It also lights up when some of the other recently fixed data loss bugs in btrfs are reproduced during bisection searches. I'll submit that as a patch separately. That new 'if' statement could easily check a mount flag, and throw in a courtesy call to btrfs_abort_transaction or btrfs_panic if set. That would be welcome for users who value data consistency over availability. I've seen several prospective btrfs users go back to ext4/xfs because btrfs doesn't support 'errors=remount-ro' or 'errors=panic' for data writes, and they absolutely prefer to crash instead running with known inconsistent data anywhere in the filesystem. btrfs could also do stats counting for dropped extents somewhere, but it can't use dev stats exactly as the event is happening above the individual device level. We'd need a different stats item for that. > Thus handling of write failure is completely untested in all > applications (See the paper again) and it seems the problems you see > are due to wrongly handling of write errors. Not quite. It's a failure of expectations between two components of btrfs, and fixes can be implemented in either or both components. The new RMW code assumed that the ordered-data error-handling code would recover from RMW returning a synthetic write error and preserve the data, but that assumption is wrong: the ordered-data error-handling code has never done that. The data is lost in kernel 6.2 and later, because both of these subsystems now assume the other will work around the dependent read failures and get the data written somewhere. In kernel 6.1 and earlier, the raid5 component was responsible for working around the read failures during writes, so there was no need for ordered data to handle them. > The data loss is not > silent it's just that many applications and scripts do not use fsync() > at all. fsync()'s return value is a terrible non-solution. Diagnostic coverage is less than 100% and not all of the gaps can be eliminated (e.g. if you're running syncfs() in a continuous loop, you don't get notified if writebacks win races against you). It's invasive, relying on the entire workload to be redesigned so that processes can stick around until writes finish without blocking everything on IO performance (try that with a parallel Makefile tree near you). The engineering hours and runtime cost of application-level synchronous monitoring of every individual write isn't reasonable for an event that is extremely rare for a properly provisioned and monitored redundant RAID system, and generally considered catastrophic if it should occur. I guess I'm agreeing with you there. Many applications have good reasons to never use fsync. > I think the proper way of resolving this is for btrfs to retry writing > the extent, but to another (possibly clean) stripe. Or perhaps a fresh > raid5 block group altogether. That's what I proposed as a long term solution, but I also propose shorter-term fixes to fix the specific rmw regression immediately. > I very much approve btrfs' current design of handling (and reporting) > write errors in the most correct way possible. I think btrfs's current handling of write errors is some distance away from the most correct way possible, or even from the most correct way currently implemented in a mainstream filesystem; however, it's good enough for now, and there are more important issues that need our attention. > Regards, > Lukas Straub > > > [1] https://www.usenix.org/system/files/atc20-rebello.pdf ^ permalink raw reply related [flat|nested] 7+ messages in thread
end of thread, other threads:[~2024-07-08 20:22 UTC | newest] Thread overview: 7+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2024-06-01 3:24 raid5 silent data loss in 6.2 and later, after "7a3150723061 btrfs: raid56: do data csum verification during RMW cycle" Zygo Blaxell 2024-06-01 7:52 ` Qu Wenruo 2024-06-08 1:55 ` Zygo Blaxell 2024-06-08 3:20 ` Qu Wenruo 2024-07-08 6:25 ` Zygo Blaxell 2024-07-08 8:26 ` Lukas Straub 2024-07-08 20:22 ` Write error handling in btrfs (was: Re: raid5 silent data loss in 6.2 and later, after "7a3150723061 btrfs: raid56: do data csum verification during RMW cycle") Zygo Blaxell
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox