* raid5 silent data loss in 6.2 and later, after "7a3150723061 btrfs: raid56: do data csum verification during RMW cycle"
@ 2024-06-01 3:24 Zygo Blaxell
2024-06-01 7:52 ` Qu Wenruo
0 siblings, 1 reply; 7+ messages in thread
From: Zygo Blaxell @ 2024-06-01 3:24 UTC (permalink / raw)
To: linux-btrfs
There is a new silent data loss bug in kernel 6.2 and later.
The requirements for the bug are:
1. 6.2 or later kernel
2. raid5 data in the filesystem
3. one device severely corrupted
4. some free space fragmentation to trigger a lot of rmw cycles
I first discovered this bug after a `pvmove` went badly wrong and wiped
out 70% of one data LV of a btrfs raid5 filesystem. I figured out a
recipe for reproducing the bug on a test VM (below), and a possible
approach for a fix.
Bug insertion
-------------
The bug is introduced in this commit:
7a3150723061 btrfs: raid56: do data csum verification during RMW cycle
This commit fixes a long-standing btrfs raid5 bug which mishandled cases
where a raid5 stripe contains corrupted data in some blocks while new data
is written to other blocks in the same stripe. This older bug resulted
in existing correct data becoming corrupted when new data is written to
the filesystem. EIO and csum failures would occur later when reading
affected blocks, i.e. the errors were detectable.
Bug effect
----------
The change fixes the old bug, but also introduces two new behaviors that
are potentially worse:
1. New writes are now silently dropped. After `sync` and `sysctl
vm.drop_caches=3`, recently written extents are missing from the files,
as if they were never written by the application. Instead, there is a
hole in the file, or a truncated file. There is no indication that this
has occurred other than the missing data: no dmesg messages, no errors
reported to applications at write time. The data can be read back from
the filesystem while it is still in page cache, which can delay detection
of the failure until after the data is evicted from cache.
2. `fsync` (system call) and `sync -f` (shell command) sometimes
return EIO. In this situation, devices are online, metadata is intact,
all corruption is located on a single device and is recoverable, so no
writes should ever be failing, but they apparently do. This also affects
internal btrfs maintenance operations like balance, which aborts with
an IO failure when triggering the bug.
Reverting the commit 7a3150723061 on 6.2.16 restores the old behavior:
no silent data loss on new writes, only the old detectable data corruption
on existing data occurs. It seems we can't have nice things.
Analysis
--------
In the commit, I notice that when reading the rmw stripe, any blocks with
csum errors are flagged in rbio->error_bitmap, but nothing ever clears
those error bits once they are set. Contrast with the scrub repair code,
which clears out the entire rbio->error_bitmap vector after repairing
the stripe. The old code didn't set any error_bitmap bits during rmw,
and not much else has changed in this commit, which makes the new set_bit
calls a good candidate for root cause.
I tried some experiments:
- clear the error_bits individually after setting uptodate = 1
(still has problems: after verifying the recovered csum,
failed reconstruction would leave error bits set, so writes
to the stripe that should not fail will fail)
- clear the entire error_bits vector after successful repair_sectors
(might cause corruption if some reconstruction fails and later
reads use data from the stripe without verifying the csum again)
- remove the new set_bit() call added in the above commit
(breaks repair but fixes EIO and dropped writes)
For the first two I didn't have much success. I'm pretty sure I don't
know how to calculate the error_bit index correctly, so my first attempt
didn't have any effect on the test. I get a lot of UAF crashes if I clear
the entire vector in the second test--I suspect it needs some protection
from concurrent access that I'm missing.
My third experiment breaks the error recovery code, but it does prevent
the sync failures and missing extent holes, so it shows that the error
recovery code itself is not what is causing the dropped writes--it's
the bits left set in error_bitmap after recovery is done.
Test Case
---------
My test case uses three loops running in parallel on a 500 GiB test filesystem:
Data Metadata System
Id Path RAID5 RAID1 RAID1 Unallocated Total Slack
-- -------- --------- -------- -------- ----------- --------- --------
1 /dev/vdb 71.00GiB 1.00GiB 8.00MiB 647.99GiB 720.00GiB 19.59GiB
2 /dev/vdc 71.00GiB 1.00GiB 8.00MiB 647.99GiB 720.00GiB 3.71GiB
3 /dev/vdd 71.00GiB 2.00GiB - 647.00GiB 720.00GiB 3.71GiB
4 /dev/vde 71.00GiB 2.00GiB - 647.00GiB 720.00GiB 11.00GiB
5 /dev/vdf 71.00GiB 2.00GiB - 647.00GiB 720.00GiB 11.00GiB
-- -------- --------- -------- -------- ----------- --------- --------
Total 284.00GiB 4.00GiB 8.00MiB 3.16TiB 3.52TiB 49.02GiB
Used 262.97GiB 2.61GiB 64.00KiB
The data is a random collection of small files, half of which have been deleted
to make lots of small free space holes for rmw.
Loop 1 alternates between corrupting device 3 and repairing it with scrub:
while true; do
# Any big file will do, usually faster than /dev/random
# Skipping the first 1M leaves the superblock intact
while cat vmlinux; do :; done | dd of=/dev/vdd bs=1024k seek=1
# This should fix all the corruption as long as there are no
# reads or writes anywhere on the filesystem
btrfs scrub start -Bd /dev/vdd
done
Loop 2 runs `sync -f` to detect sync errors and drops caches:
while true; do
# Sometimes throws EIO
sync -f /testfs
sysctl vm.drop_caches=3
sleep 9
done
Loop 3 does some random git activity on a clone of the 'btrfs-progs'
repo to detect lost writes at the application level:
while true; do
cd /testfs/btrfs-progs
# Sometimes fails complaining about various files being corrupted
find * -type f -print | unsort -r | while read -r x; do
date >> "$x"
git commit -am"Modifying $x"
done
git repack -a
done
The errors occur on the sync -f and various git commands, e.g.:
sync: error syncing '/media/testfs/': Input/output error
vm.drop_caches = 3
error: object file .git/objects/39/c876ad9b9af9f5410246d9a3d6bbc331677ee5 is empty
fatal: loose object 39c876ad9b9af9f5410246d9a3d6bbc331677ee5 (stored in .git/objects/39/c876ad9b9af9f5410246d9a3d6bbc331677ee5) is corrupt
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: raid5 silent data loss in 6.2 and later, after "7a3150723061 btrfs: raid56: do data csum verification during RMW cycle"
2024-06-01 3:24 raid5 silent data loss in 6.2 and later, after "7a3150723061 btrfs: raid56: do data csum verification during RMW cycle" Zygo Blaxell
@ 2024-06-01 7:52 ` Qu Wenruo
2024-06-08 1:55 ` Zygo Blaxell
0 siblings, 1 reply; 7+ messages in thread
From: Qu Wenruo @ 2024-06-01 7:52 UTC (permalink / raw)
To: Zygo Blaxell, linux-btrfs
在 2024/6/1 12:54, Zygo Blaxell 写道:
> There is a new silent data loss bug in kernel 6.2 and later.
> The requirements for the bug are:
>
> 1. 6.2 or later kernel
> 2. raid5 data in the filesystem
> 3. one device severely corrupted
> 4. some free space fragmentation to trigger a lot of rmw cycles
I'm still not convinced this can be the condition to trigger the bug.
As RAID56 now does csum verification before RMW, even if some range is
fully corrupted, as long as the recovered data matches csum, it would
use the recovered data instead.
And if any vertical stripe is not good, the whole RMW cycle would error out.
[...]
> --------
>
> In the commit, I notice that when reading the rmw stripe, any blocks with
> csum errors are flagged in rbio->error_bitmap, but nothing ever clears
> those error bits once they are set.
Nope, rmw_rbio() would call bitmap_clear() on the error_bitmap before
doing any RMW.
The same for finish_parity_scrub(), scrub_rbio().
Yes, this means we can have the cache rbio with error bitmap, but it
doesn't make any difference, as rmw_rbio() is always the entrance for a
RMW cycle.
Maybe I can enhance that by clearing the error bitmap after everything
is done, but I prefer to get a proper cause analyse before doing any
random fix.
[...]
>
> My third experiment breaks the error recovery code, but it does prevent
> the sync failures and missing extent holes, so it shows that the error
> recovery code itself is not what is causing the dropped writes--it's
> the bits left set in error_bitmap after recovery is done.
Yep, that's expected.
So I'm more interested in a proper (better minimal) reproducer other
than any fix attempt (since there is no patch sent, it already shows the
attempt failed).
>
>
> Test Case
> ---------
>
> My test case uses three loops running in parallel on a 500 GiB test filesystem:
>
> Data Metadata System
> Id Path RAID5 RAID1 RAID1 Unallocated Total Slack
> -- -------- --------- -------- -------- ----------- --------- --------
> 1 /dev/vdb 71.00GiB 1.00GiB 8.00MiB 647.99GiB 720.00GiB 19.59GiB
> 2 /dev/vdc 71.00GiB 1.00GiB 8.00MiB 647.99GiB 720.00GiB 3.71GiB
> 3 /dev/vdd 71.00GiB 2.00GiB - 647.00GiB 720.00GiB 3.71GiB
> 4 /dev/vde 71.00GiB 2.00GiB - 647.00GiB 720.00GiB 11.00GiB
> 5 /dev/vdf 71.00GiB 2.00GiB - 647.00GiB 720.00GiB 11.00GiB
> -- -------- --------- -------- -------- ----------- --------- --------
> Total 284.00GiB 4.00GiB 8.00MiB 3.16TiB 3.52TiB 49.02GiB
> Used 262.97GiB 2.61GiB 64.00KiB
>
> The data is a random collection of small files, half of which have been deleted
> to make lots of small free space holes for rmw.
>
> Loop 1 alternates between corrupting device 3 and repairing it with scrub:
The reproducer is not good enough, in fact it's pretty bad...
Using anything not normalized is never a good way to reproduce, but I
guess it's already the best scenario you have.
Can you try to do it with newly created fs instead?
>
> while true; do
> # Any big file will do, usually faster than /dev/random
> # Skipping the first 1M leaves the superblock intact
> while cat vmlinux; do :; done | dd of=/dev/vdd bs=1024k seek=1
> # This should fix all the corruption as long as there are no
> # reads or writes anywhere on the filesystem
> btrfs scrub start -Bd /dev/vdd
> done
[IMPROVE THE TEST]
If you want to cause interleaved free space, just create a ton of 4K
files, and delete them interleavely.
And instead of vmlinux or whatever file, you can always go with
randomly/pattern filled file, and saves its md5sum to do verification.
[MY CURRENT GUESS]
My current guess is some race with dd corruption and RMW.
AFAIK the last time I am working on RAID56, I always do a offline
corruption (aka, with fs unmounted) and it always works like a charm.
So the running corruption may be a point of concern.
Another thing is, if a full stripe is determined to have unrepairable
data, no RMW can be done on that full stripe forever (unless one
manually fixed the problem).
So if by somehow you corrupted the full stripe by just corrupting one
device (maybe some existing csum mismatch etc?), then the full stripe
would never be written back, thus causing the data not to be written back.
Finally for the lack of any dmesg, it's indeed a problem, that there is
*NO* error message at all if we failed to recover a full stripe.
Just check recover_sectors() call and its callers.
And I believe that may contribute to the confusion, that btrfs consider
the fs is fine, meanwhile it catches tons of error and abort all writes
to that full stripes.
I appreciate the effort you put into this case, but I really hope to get
a more reproducible procedure, or it's really hard to say what is going
wrong.
If needed I can craft some debug patches for you to test, but I believe
you won't really want to run testing kernels on your large RAID5 array
anyway.
So a more normalized test would help us both.
Thanks,
Qu
>
> Loop 2 runs `sync -f` to detect sync errors and drops caches:
>
> while true; do
> # Sometimes throws EIO
> sync -f /testfs
> sysctl vm.drop_caches=3
> sleep 9
> done
>
> Loop 3 does some random git activity on a clone of the 'btrfs-progs'
> repo to detect lost writes at the application level:
>
> while true; do
> cd /testfs/btrfs-progs
> # Sometimes fails complaining about various files being corrupted
> find * -type f -print | unsort -r | while read -r x; do
> date >> "$x"
> git commit -am"Modifying $x"
> done
> git repack -a
> done
>
> The errors occur on the sync -f and various git commands, e.g.:
>
> sync: error syncing '/media/testfs/': Input/output error
> vm.drop_caches = 3
>
> error: object file .git/objects/39/c876ad9b9af9f5410246d9a3d6bbc331677ee5 is empty
> fatal: loose object 39c876ad9b9af9f5410246d9a3d6bbc331677ee5 (stored in .git/objects/39/c876ad9b9af9f5410246d9a3d6bbc331677ee5) is corrupt
>
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: raid5 silent data loss in 6.2 and later, after "7a3150723061 btrfs: raid56: do data csum verification during RMW cycle"
2024-06-01 7:52 ` Qu Wenruo
@ 2024-06-08 1:55 ` Zygo Blaxell
2024-06-08 3:20 ` Qu Wenruo
0 siblings, 1 reply; 7+ messages in thread
From: Zygo Blaxell @ 2024-06-08 1:55 UTC (permalink / raw)
To: Qu Wenruo; +Cc: linux-btrfs
On Sat, Jun 01, 2024 at 05:22:46PM +0930, Qu Wenruo wrote:
>
>
> 在 2024/6/1 12:54, Zygo Blaxell 写道:
> > There is a new silent data loss bug in kernel 6.2 and later.
> > The requirements for the bug are:
> >
> > 1. 6.2 or later kernel
> > 2. raid5 data in the filesystem
> > 3. one device severely corrupted
> > 4. some free space fragmentation to trigger a lot of rmw cycles
>
> I'm still not convinced this can be the condition to trigger the bug.
>
> As RAID56 now does csum verification before RMW, even if some range is
> fully corrupted, as long as the recovered data matches csum, it would
> use the recovered data instead.
>
> And if any vertical stripe is not good, the whole RMW cycle would error out.
> [...]
> > --------
> >
> > In the commit, I notice that when reading the rmw stripe, any blocks with
> > csum errors are flagged in rbio->error_bitmap, but nothing ever clears
> > those error bits once they are set.
>
> Nope, rmw_rbio() would call bitmap_clear() on the error_bitmap before
> doing any RMW.
This is true, but the code looks like this:
ret = rmw_read_wait_recover(rbio);
if (ret < 0)
goto out;
}
[some other stuff]
bitmap_clear(rbio->error_bitmap, 0, rbio->nr_sectors);
The problem happens before the bitmap_clear, at `goto out`. Actually it
happens in rmw_read_wait_recover() (formerly rmw_read_and_wait()),
because that function now changes 'ret' if the repair fails:
@@ -2136,6 +2215,12 @@ static int rmw_read_and_wait(struct btrfs_raid_bio *rbio)
submit_read_bios(rbio, &bio_list);
wait_event(rbio->io_wait, atomic_read(&rbio->stripes_pending) == 0);
+
+ /*
+ * We may or may not have any corrupted sectors (including missing dev
+ * and csum mismatch), just let recover_sectors() to handle them all.
+ */
+ ret = recover_sectors(rbio);
return ret;
out:
while ((bio = bio_list_pop(&bio_list)))
Before this change, if all the blocks in the stripe were readable, we
would recompute parity based on the data that was read, and proceed
to finish the rbio. As long as the writes were also successful,
the new blocks that were written would be updated and recoverable.
(The old unrecoverable blocks would still be unrecoverable, but there's
no fixing those.)
After this change, we now end up in an infinite loop:
1. Allocator picks a stripe with some unrecoverable csum blocks
and some free blocks
2. Writeback tries to put data in the stripe
3. rmw_rbio aborts after it can't repair the existing blocks
4. Writeback deletes the extent, often silently (the application
has to use fsync to detect it)
5. Go to step 1, where the allocator picks the same blocks again
The effect is pretty dramatic--even a single unrecoverable sector in
one stripe will bring an application server to its knees, constantly
discarding an application's data whenever it tries to write. Once the
allocator reaches the point where the "next" block is in a bad rmw stripe,
it keeps allocating that same block over and over again.
So there are some different cases to handle at this point:
1. We can read all the blocks and they have good csums: proceed
to write data and update parity. This is the normal good case.
2. We can read all the blocks, there are csums failures, but they
can be repaired: proceed to write data and update parity. This is
the case fixed in the patch.
3. We can read all the blocks, there are csums failures, but they
can't be repaired: proceed to write data and update parity. This case
is broken in the patch. The data in the unrecoverable blocks is lost
already, but we can still write the new data in the stripe and update
the parity to recover the new blocks if we go into degraded mode in
the future.
4. We can read enough of the blocks to attempt repair, and the
repair succeeds: proceed to write data and update parity. This is
the degraded good case.
5. We can read enough of the blocks to attempt repair, and the
repair fails: proceed to write data and update parity. The lost
data block is gone, but we can still write the new data blocks and
compute a usable parity block. This is another case broken by the
patch, the degraded-mode version of case 3.
6. We can't read enough of the blocks to attempt repair: we can't
compute parity any more, because we don't know what data is in the
missing blocks. We can't assume it's some constant value (e.g. zero)
because then the new parity would be wrong if some but not all of the
failing drives came back.
For case 1-5, we can ignore the return value of verify_one_sector()
(or at least not pass it up to the caller of recover_vertical()).
Either we restored the sector we didn't, but either outcome doesn't
affect whether a new parity block can be computed to enable recovery of
the new data blocks and the remaining recoverable data blocks. The other
error cases in recover_vertical() are either fatal problems like ENOMEM,
or there's too many read failures and we're actually in case 6.
For case 6, the options are:
A. abort the rbio and drop the write
B. force the filesystem read-only
I like B for this case, but AIUI the other btrfs raid profiles currently
do A if all the data block mirror writes fail.
> The same for finish_parity_scrub(), scrub_rbio().
>
> Yes, this means we can have the cache rbio with error bitmap, but it
> doesn't make any difference, as rmw_rbio() is always the entrance for a
> RMW cycle.
>
> Maybe I can enhance that by clearing the error bitmap after everything
> is done, but I prefer to get a proper cause analyse before doing any
> random fix.
>
> [...]
> >
> > My third experiment breaks the error recovery code, but it does prevent
> > the sync failures and missing extent holes, so it shows that the error
> > recovery code itself is not what is causing the dropped writes--it's
> > the bits left set in error_bitmap after recovery is done.
>
> Yep, that's expected.
> So I'm more interested in a proper (better minimal) reproducer other
> than any fix attempt (since there is no patch sent, it already shows the
> attempt failed).
> >
> >
> > Test Case
> > ---------
> >
> > My test case uses three loops running in parallel on a 500 GiB test filesystem:
> >
> > Data Metadata System
> > Id Path RAID5 RAID1 RAID1 Unallocated Total Slack
> > -- -------- --------- -------- -------- ----------- --------- --------
> > 1 /dev/vdb 71.00GiB 1.00GiB 8.00MiB 647.99GiB 720.00GiB 19.59GiB
> > 2 /dev/vdc 71.00GiB 1.00GiB 8.00MiB 647.99GiB 720.00GiB 3.71GiB
> > 3 /dev/vdd 71.00GiB 2.00GiB - 647.00GiB 720.00GiB 3.71GiB
> > 4 /dev/vde 71.00GiB 2.00GiB - 647.00GiB 720.00GiB 11.00GiB
> > 5 /dev/vdf 71.00GiB 2.00GiB - 647.00GiB 720.00GiB 11.00GiB
> > -- -------- --------- -------- -------- ----------- --------- --------
> > Total 284.00GiB 4.00GiB 8.00MiB 3.16TiB 3.52TiB 49.02GiB
> > Used 262.97GiB 2.61GiB 64.00KiB
> >
> > The data is a random collection of small files, half of which have been deleted
> > to make lots of small free space holes for rmw.
> >
> > Loop 1 alternates between corrupting device 3 and repairing it with scrub:
>
> The reproducer is not good enough, in fact it's pretty bad...
>
> Using anything not normalized is never a good way to reproduce, but I
> guess it's already the best scenario you have.
>
> Can you try to do it with newly created fs instead?
Already did (I needed to do that many times to run bisect...).
> > while true; do
> > # Any big file will do, usually faster than /dev/random
> > # Skipping the first 1M leaves the superblock intact
> > while cat vmlinux; do :; done | dd of=/dev/vdd bs=1024k seek=1
> > # This should fix all the corruption as long as there are no
> > # reads or writes anywhere on the filesystem
> > btrfs scrub start -Bd /dev/vdd
> > done
>
> [IMPROVE THE TEST]
> If you want to cause interleaved free space, just create a ton of 4K
> files, and delete them interleavely.
>
> And instead of vmlinux or whatever file, you can always go with
> randomly/pattern filled file, and saves its md5sum to do verification.
> [MY CURRENT GUESS]
> My current guess is some race with dd corruption and RMW.
> AFAIK the last time I am working on RAID56, I always do a offline
> corruption (aka, with fs unmounted) and it always works like a charm.
>
> So the running corruption may be a point of concern.
Generally, we expect the hardware to introduce corruption at any time.
Ideally, a btrfs raid[156] should be able to work when one of its devices
returns 4K blocks from /dev/random on every read, and if it can't,
that's always a bug.
> Another thing is, if a full stripe is determined to have unrepairable
> data, no RMW can be done on that full stripe forever (unless one
> manually fixed the problem).
Yeah, that's the serious new problem introduced in this patch for 6.2.
6.1 didn't do that in two cases where 6.2 and later do.
> So if by somehow you corrupted the full stripe by just corrupting one
> device (maybe some existing csum mismatch etc?), then the full stripe
> would never be written back, thus causing the data not to be written back.
My reproducer wasn't sufficient to trigger the bug--it would only generate
csum errors on one device, but there need to be csum errors on two devices
(for raid5) to get into the data-loss loop. Apparently 6.2 still has
a bug somewhere that introduces extra errors (it crashes a lot, that
might be related), because when I tested on 6.2 I was observing the
two-csum-failure case not the intended one-csum-failure case.
6.6 and 6.9 don't have any of those problems--no crashes, no cross-device
corruption. I've been running the reproducer on 6.6 and 6.9 for a week
now, and there are no csum failures on any device other than the one
being intentionally corrupted, and there are no data losses because the
code never hits the "goto out" line in rmw_rbio (confirmed with both
a kgdb breakpoint on the test kernel and some printk's). So on 6.6
and 6.9 btrfs fixes everything my reproducer throws at it.
It was a bad reproducer. Sorry about that!
> Finally for the lack of any dmesg, it's indeed a problem, that there is
> *NO* error message at all if we failed to recover a full stripe.
> Just check recover_sectors() call and its callers.
The trouble with dmesg errors is that there are so many of them when
there's a failure of a wide RAID5 stripe, and they can end up ratelimited,
so admins can't see all of them. But they're also better than nothing
for case 6, so I approve adding a new one here.
For case 6, it might be reasonable to throw the entire filesystem
read-only; otherwise, there's silent data loss, and not having silent
data loss is a critical feature of btrfs. It would still be possible
to scrub, find the affected files, and remove them, without triggering a
rmw cycle. Maybe some future btrfs enhancement could handle it better,
but for now a dead stop is better than any amount of silent loss.
One possible such enhancement is to keep the blocks from being
allocated again, e.g. put them on an in-memory list to break the
allocate-write-fail-free-allocate loop above.
Having said that, I think the other raid1* profiles actually do drop
writes silently when all mirror writes fail for a data block. Maybe we
can have a sysfs flag to decide what to do in that case: continue with
EIO on fsync as we do now, or go fully RO.
> And I believe that may contribute to the confusion, that btrfs consider
> the fs is fine, meanwhile it catches tons of error and abort all writes
> to that full stripes.
> I appreciate the effort you put into this case, but I really hope to get
> a more reproducible procedure, or it's really hard to say what is going
> wrong.
Thanks for the tip on bitmap_clear() - it led me straight to the problem,
even before you explained it a few pages further on...
> If needed I can craft some debug patches for you to test, but I believe
> you won't really want to run testing kernels on your large RAID5 array
> anyway.
I have plenty of VMs for testing kernel patches if you have some.
We run every kernel through a gauntlet of tests on staging VMs. Those
tests cases didn't have any coverage for raid5 corruption recovery on
read because it doesn't work on any prior kernel version (only recovery
on scrub was tested). That gap let 6.6 kernels get through acceptance
and deployment, and led to a server getting rebooted from 6.1 to 6.6
in the middle of a pvmove to replace a drive, which combined massive
corruption of one disk, 6.1's bug, and 6.2's bug. That didn't end well.
Fortunately we keep only redundant data on raid5.
To fix the original failing RAID5 array, we reverted to 6.1, which
allowed the server to operate more or less normally for a few weeks
during recovery. We did have 6.1's cross-device error-propagation bug
(ironically the one fixed in this patch), but that bug was far more
manageable because we already had automation in place to detect any data
that was lost and replace it from another server. Scrub combined with
rewriting broken files eventually eliminated all the errors.
On 6.6, all the automation was broken by the regression: when we replaced
a broken file or added a new one, the new file was almost always silently
corrupted, and we had no IO error signal to trigger replacement of the
data from another server, so corrupted files became visible to users.
git-based tools for deployment were fully unusable on 6.6: any commit
would break the local repo, while deleting and recloning the repo simply
hit the same broken stripe rmw bug and broke the replacement repo.
At the end of recovery we did a full verification of SHA hashes and
found only a few errors, all in files that were created or modified
while running 6.6.
> So a more normalized test would help us both.
>
> Thanks,
> Qu
>
> >
> > Loop 2 runs `sync -f` to detect sync errors and drops caches:
> >
> > while true; do
> > # Sometimes throws EIO
> > sync -f /testfs
> > sysctl vm.drop_caches=3
> > sleep 9
> > done
> >
> > Loop 3 does some random git activity on a clone of the 'btrfs-progs'
> > repo to detect lost writes at the application level:
> >
> > while true; do
> > cd /testfs/btrfs-progs
> > # Sometimes fails complaining about various files being corrupted
> > find * -type f -print | unsort -r | while read -r x; do
> > date >> "$x"
> > git commit -am"Modifying $x"
> > done
> > git repack -a
> > done
> >
> > The errors occur on the sync -f and various git commands, e.g.:
> >
> > sync: error syncing '/media/testfs/': Input/output error
> > vm.drop_caches = 3
> >
> > error: object file .git/objects/39/c876ad9b9af9f5410246d9a3d6bbc331677ee5 is empty
> > fatal: loose object 39c876ad9b9af9f5410246d9a3d6bbc331677ee5 (stored in .git/objects/39/c876ad9b9af9f5410246d9a3d6bbc331677ee5) is corrupt
> >
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: raid5 silent data loss in 6.2 and later, after "7a3150723061 btrfs: raid56: do data csum verification during RMW cycle"
2024-06-08 1:55 ` Zygo Blaxell
@ 2024-06-08 3:20 ` Qu Wenruo
2024-07-08 6:25 ` Zygo Blaxell
0 siblings, 1 reply; 7+ messages in thread
From: Qu Wenruo @ 2024-06-08 3:20 UTC (permalink / raw)
To: Zygo Blaxell, Qu Wenruo; +Cc: linux-btrfs
在 2024/6/8 11:25, Zygo Blaxell 写道:
> On Sat, Jun 01, 2024 at 05:22:46PM +0930, Qu Wenruo wrote:
[...]
>>
>> Nope, rmw_rbio() would call bitmap_clear() on the error_bitmap before
>> doing any RMW.
>
> This is true, but the code looks like this:
>
> ret = rmw_read_wait_recover(rbio);
> if (ret < 0)
> goto out;
> }
>
> [some other stuff]
>
> bitmap_clear(rbio->error_bitmap, 0, rbio->nr_sectors);
>
> The problem happens before the bitmap_clear, at `goto out`. Actually it
> happens in rmw_read_wait_recover() (formerly rmw_read_and_wait()),
> because that function now changes 'ret' if the repair fails:
>
> @@ -2136,6 +2215,12 @@ static int rmw_read_and_wait(struct btrfs_raid_bio *rbio)
>
> submit_read_bios(rbio, &bio_list);
> wait_event(rbio->io_wait, atomic_read(&rbio->stripes_pending) == 0);
> +
> + /*
> + * We may or may not have any corrupted sectors (including missing dev
> + * and csum mismatch), just let recover_sectors() to handle them all.
> + */
> + ret = recover_sectors(rbio);
> return ret;
> out:
> while ((bio = bio_list_pop(&bio_list)))
>
> Before this change, if all the blocks in the stripe were readable, we
> would recompute parity based on the data that was read, and proceed
> to finish the rbio. As long as the writes were also successful,
> the new blocks that were written would be updated and recoverable.
> (The old unrecoverable blocks would still be unrecoverable, but there's
> no fixing those.)
I still do not believe the old behavior is correct.
If the data is corrupted, re-generate the P/Q using the bad one is only
killing all the remaining (if any) chance of recovery.
>
> After this change, we now end up in an infinite loop:
>
> 1. Allocator picks a stripe with some unrecoverable csum blocks
> and some free blocks
>
> 2. Writeback tries to put data in the stripe
>
> 3. rmw_rbio aborts after it can't repair the existing blocks
>
> 4. Writeback deletes the extent, often silently (the application
> has to use fsync to detect it)
>
> 5. Go to step 1, where the allocator picks the same blocks again
>
> The effect is pretty dramatic--even a single unrecoverable sector in
> one stripe will bring an application server to its knees, constantly
> discarding an application's data whenever it tries to write. Once the
> allocator reaches the point where the "next" block is in a bad rmw stripe,
> it keeps allocating that same block over and over again.
I'm afraid the error path (no way to inform the caller) is an existing
problem. Buffered write can always success (as long as no ENOMEM/ENOSPC
etc), but the real writeback is not ensured to success.
It doesn't even need RAID56 to trigger.
But "discarding" the dirty pages doesn't sound correct.
If a writeback failed, the dirty pages should still stay dirty, not
discarded.
It may be a new bug in the error handling path.
>
> So there are some different cases to handle at this point:
>
> 1. We can read all the blocks and they have good csums: proceed
> to write data and update parity. This is the normal good case.
>
> 2. We can read all the blocks, there are csums failures, but they
> can be repaired: proceed to write data and update parity. This is
> the case fixed in the patch.
>
> 3. We can read all the blocks, there are csums failures, but they
> can't be repaired: proceed to write data and update parity.
Nope, we should not continue, it's only "spreading" the corruption.
> This case
> is broken in the patch. The data in the unrecoverable blocks is lost
> already, but we can still write the new data in the stripe and update
> the parity to recover the new blocks if we go into degraded mode in
> the future.
We should only continue if our write range covers the bad sector, but
that's impossible as that breaks DATACOW.
I can do extra handling, like if the write range covers a bad sector, we
do not try to recover that one.
But I strongly doubt if that would have any meaning, since that means
we're writing into a sector which already has csum.
>
> 4. We can read enough of the blocks to attempt repair, and the
> repair succeeds: proceed to write data and update parity. This is
> the degraded good case.
Degraded is not treated any different, it's just the same as failing the
read.
>
> 5. We can read enough of the blocks to attempt repair, and the
> repair fails: proceed to write data and update parity. The lost
> data block is gone, but we can still write the new data blocks and
> compute a usable parity block. This is another case broken by the
> patch, the degraded-mode version of case 3.
The degraded case is only making things more complex.
But I get your point.
>
> 6. We can't read enough of the blocks to attempt repair: we can't
> compute parity any more, because we don't know what data is in the
> missing blocks. We can't assume it's some constant value (e.g. zero)
> because then the new parity would be wrong if some but not all of the
> failing drives came back.
>
> For case 1-5, we can ignore the return value of verify_one_sector()
> (or at least not pass it up to the caller of recover_vertical()).
Nope, that falls back to the old behavior, a corrupted section would
only spread to P/Q.
We should not touch such corrupted full stripe. IIRC we already have
such test cases in fstests.
I believe we should let the users know there is something wrong, telling
then which sector is problematic (and which file), hoping they can
delete the file or whatever.
We can not do the data loss decision for the end users.
> Either we restored the sector we didn't, but either outcome doesn't
> affect whether a new parity block can be computed to enable recovery of
> the new data blocks and the remaining recoverable data blocks. The other
> error cases in recover_vertical() are either fatal problems like ENOMEM,
> or there's too many read failures and we're actually in case 6.
>
> For case 6, the options are:
>
> A. abort the rbio and drop the write
>
> B. force the filesystem read-only
>
> I like B for this case, but AIUI the other btrfs raid profiles currently
> do A if all the data block mirror writes fail.
The current solution for RAID56 is just do not touch it.
We do not continue nor ignore.
I believe B is the much safer solution, but I'm not sure if other
profiles really go A.
IIRC for RAID1* we just continue as long as we have at least one good
mirror written.
[...]
>>
>> Can you try to do it with newly created fs instead?
>
> Already did (I needed to do that many times to run bisect...).
Any shorter reproducer?
If I understand you correct, the problem only happens when we have all
the following conditions met:
- the sector has csum
Or we won't very the content anyway
- both the original and repaired contents mismatch csum
This already means there are errors beyond the RAID5 tolerance.
>
>>> while true; do
>>> # Any big file will do, usually faster than /dev/random
>>> # Skipping the first 1M leaves the superblock intact
>>> while cat vmlinux; do :; done | dd of=/dev/vdd bs=1024k seek=1
>>> # This should fix all the corruption as long as there are no
>>> # reads or writes anywhere on the filesystem
>>> btrfs scrub start -Bd /dev/vdd
>>> done
>>
>> [IMPROVE THE TEST]
>> If you want to cause interleaved free space, just create a ton of 4K
>> files, and delete them interleavely.
>>
>> And instead of vmlinux or whatever file, you can always go with
>> randomly/pattern filled file, and saves its md5sum to do verification.
>
>> [MY CURRENT GUESS]
>> My current guess is some race with dd corruption and RMW.
>> AFAIK the last time I am working on RAID56, I always do a offline
>> corruption (aka, with fs unmounted) and it always works like a charm.
>>
>> So the running corruption may be a point of concern.
>
> Generally, we expect the hardware to introduce corruption at any time.
> Ideally, a btrfs raid[156] should be able to work when one of its devices
> returns 4K blocks from /dev/random on every read, and if it can't,
> that's always a bug.
But so far your description is for two sectors corruption, beyond
RAID5's tolerance.
>
>> Another thing is, if a full stripe is determined to have unrepairable
>> data, no RMW can be done on that full stripe forever (unless one
>> manually fixed the problem).
>
> Yeah, that's the serious new problem introduced in this patch for 6.2.
> 6.1 didn't do that in two cases where 6.2 and later do.
Unfortunately, I do not think this is the root problem.
The change is towards the correct direction, from ignoring the problem
to at least not making it worse.
>
>> So if by somehow you corrupted the full stripe by just corrupting one
>> device (maybe some existing csum mismatch etc?), then the full stripe
>> would never be written back, thus causing the data not to be written back.
>
> My reproducer wasn't sufficient to trigger the bug--it would only generate
> csum errors on one device, but there need to be csum errors on two devices
> (for raid5) to get into the data-loss loop. Apparently 6.2 still has
> a bug somewhere that introduces extra errors (it crashes a lot, that
> might be related), because when I tested on 6.2 I was observing the
> two-csum-failure case not the intended one-csum-failure case.
>
> 6.6 and 6.9 don't have any of those problems--no crashes, no cross-device
> corruption. I've been running the reproducer on 6.6 and 6.9 for a week
> now, and there are no csum failures on any device other than the one
> being intentionally corrupted, and there are no data losses because the
> code never hits the "goto out" line in rmw_rbio (confirmed with both
> a kgdb breakpoint on the test kernel and some printk's). So on 6.6
> and 6.9 btrfs fixes everything my reproducer throws at it.
That's my expectation. Thankfully it finally matches reality.
So that's something else in v6.2 causing beyond-raid5-tolerance.
And the data drop problem is definitely a big problem but I do not think
it's really RAID56 specific.
My guess is some later RAID56 fixes are not yet backported to v6.2?
One quick possible fix is:
486c737f7fdc ("btrfs: raid56: always verify the P/Q contents for scrub")
>
> It was a bad reproducer. Sorry about that!
>
>> Finally for the lack of any dmesg, it's indeed a problem, that there is
>> *NO* error message at all if we failed to recover a full stripe.
>> Just check recover_sectors() call and its callers.
>
> The trouble with dmesg errors is that there are so many of them when
> there's a failure of a wide RAID5 stripe, and they can end up ratelimited,
> so admins can't see all of them. But they're also better than nothing
> for case 6, so I approve adding a new one here.
>
> For case 6, it might be reasonable to throw the entire filesystem
> read-only; otherwise, there's silent data loss, and not having silent
> data loss is a critical feature of btrfs. It would still be possible
> to scrub, find the affected files, and remove them, without triggering a
> rmw cycle. Maybe some future btrfs enhancement could handle it better,
> but for now a dead stop is better than any amount of silent loss.
>
> One possible such enhancement is to keep the blocks from being
> allocated again, e.g. put them on an in-memory list to break the
> allocate-write-fail-free-allocate loop above.
The retry for failed full stripe is intentional, for the following two
points:
- There may be a chance the error may be temporary
So the next retry may get a good contents and do the repair correctly
- There is no need to cache something we know it's unreliable
[...]
>
> On 6.6, all the automation was broken by the regression: when we replaced
> a broken file or added a new one, the new file was almost always silently
> corrupted,
A quick search through the data writeback path indeed shows we do
nothing extra when hitting a writeback error.
This means, the dirty pages would be considered uptodate, thus later can
be dropped.
I think we should change the behavior of clearing dirty flags, so that
the dirty page flag only got cleared if the writeback is successful.
So I believe this is the root cause.
This change is going to be huge, and I have to check other fses to be
extra safe.
But even with that fixed, the do-not-touch-bad-full-stripe policy is not
doing anything helpful, unless the corrupted sector get full deleted,
which can be very hard detect.
Maybe you can try a different recovery behavior?
E.g. always read the file, and the read failed, delete and then copy a
new one from the backup server.
Then it may solve the problem as the bad sectors would be deleted (along
with its csum)
Thanks,
Qu
> and we had no IO error signal to trigger replacement of the
> data from another server, so corrupted files became visible to users.
> git-based tools for deployment were fully unusable on 6.6: any commit
> would break the local repo, while deleting and recloning the repo simply
> hit the same broken stripe rmw bug and broke the replacement repo.
>
> At the end of recovery we did a full verification of SHA hashes and
> found only a few errors, all in files that were created or modified
> while running 6.6.
>
>> So a more normalized test would help us both.
>>
>> Thanks,
>> Qu
>>
>>>
>>> Loop 2 runs `sync -f` to detect sync errors and drops caches:
>>>
>>> while true; do
>>> # Sometimes throws EIO
>>> sync -f /testfs
>>> sysctl vm.drop_caches=3
>>> sleep 9
>>> done
>>>
>>> Loop 3 does some random git activity on a clone of the 'btrfs-progs'
>>> repo to detect lost writes at the application level:
>>>
>>> while true; do
>>> cd /testfs/btrfs-progs
>>> # Sometimes fails complaining about various files being corrupted
>>> find * -type f -print | unsort -r | while read -r x; do
>>> date >> "$x"
>>> git commit -am"Modifying $x"
>>> done
>>> git repack -a
>>> done
>>>
>>> The errors occur on the sync -f and various git commands, e.g.:
>>>
>>> sync: error syncing '/media/testfs/': Input/output error
>>> vm.drop_caches = 3
>>>
>>> error: object file .git/objects/39/c876ad9b9af9f5410246d9a3d6bbc331677ee5 is empty
>>> fatal: loose object 39c876ad9b9af9f5410246d9a3d6bbc331677ee5 (stored in .git/objects/39/c876ad9b9af9f5410246d9a3d6bbc331677ee5) is corrupt
>>>
>
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: raid5 silent data loss in 6.2 and later, after "7a3150723061 btrfs: raid56: do data csum verification during RMW cycle"
2024-06-08 3:20 ` Qu Wenruo
@ 2024-07-08 6:25 ` Zygo Blaxell
2024-07-08 8:26 ` Lukas Straub
0 siblings, 1 reply; 7+ messages in thread
From: Zygo Blaxell @ 2024-07-08 6:25 UTC (permalink / raw)
To: Qu Wenruo; +Cc: Qu Wenruo, linux-btrfs
On Sat, Jun 08, 2024 at 12:50:35PM +0930, Qu Wenruo wrote:
> 在 2024/6/8 11:25, Zygo Blaxell 写道:
> > On Sat, Jun 01, 2024 at 05:22:46PM +0930, Qu Wenruo wrote:
> > After this change, we now end up in an infinite loop:
> >
> > 1. Allocator picks a stripe with some unrecoverable csum blocks
> > and some free blocks
> >
> > 2. Writeback tries to put data in the stripe
> >
> > 3. rmw_rbio aborts after it can't repair the existing blocks
> >
> > 4. Writeback deletes the extent, often silently (the application
> > has to use fsync to detect it)
> >
> > 5. Go to step 1, where the allocator picks the same blocks again
> >
> > The effect is pretty dramatic--even a single unrecoverable sector in
> > one stripe will bring an application server to its knees, constantly
> > discarding an application's data whenever it tries to write. Once the
> > allocator reaches the point where the "next" block is in a bad rmw stripe,
> > it keeps allocating that same block over and over again.
>
> I'm afraid the error path (no way to inform the caller) is an existing
> problem. Buffered write can always success (as long as no ENOMEM/ENOSPC
> etc), but the real writeback is not ensured to success.
> It doesn't even need RAID56 to trigger.
>
> But "discarding" the dirty pages doesn't sound correct.
> If a writeback failed, the dirty pages should still stay dirty, not
> discarded.
>
> It may be a new bug in the error handling path.
I found the code that does this. It's more than 11 years old:
commit 0bec9ef525e33233d7739b71be83bb78746f6e94
Author: Josef Bacik <jbacik@fusionio.com>
Date: Thu Jan 31 14:58:00 2013 -0500
Btrfs: unreserve space if our ordered extent fails to work
When a transaction aborts or there's an EIO on an ordered extent or any
error really we will not free up the space we reserved for this ordered
extent. This results in warnings from the block group cache cleanup in the
case of a transaction abort, or leaking space in the case of EIO on an
ordered extent. Fix this up by free'ing the reserved space if we have an
error at all trying to complete an ordered extent. Thanks,
> > This case
> > is broken in the patch. The data in the unrecoverable blocks is lost
> > already, but we can still write the new data in the stripe and update
> > the parity to recover the new blocks if we go into degraded mode in
> > the future.
>
> We should only continue if our write range covers the bad sector, but that's
> impossible as that breaks DATACOW.
>
> I can do extra handling, like if the write range covers a bad sector, we do
> not try to recover that one.
>
> But I strongly doubt if that would have any meaning, since that means we're
> writing into a sector which already has csum.
The point here is to:
- keep the P/Q in sync with the a consistent view of data in the
stripe, even when some of that data is known to be uncorrectable
- not overwrite any block in the stripe unless we are sure its
contents are correct or don't matter
- always write out the new data and matching P/Q in rmw_rbio
i.e. if we fail to read a block, or if its csum fails, and we can't
reconstruct the block, we must not write the block out to disk because
that will prevent a read retry from succeeding in the future. We must
still update P/Q because we modify other blocks in the stripe. The only
thing P/Q needs to work is some repeatable data that we can read from
the stripe in the future. P/Q doesn't have to be correct wrt the csums,
only with the blocks. As long as P/Q matches the other blocks, then
any block in the stripe can be reconstructed to its current value.
Note we only need to do this for rmw, because we _have_ to finish
rmw to avoid data loss from not making a best effort to do the write.
Scrub and read can report the problem and move on to the next stripe,
because there's no write to be lost in the rbio.
> > 6. We can't read enough of the blocks to attempt repair: we can't
> > compute parity any more, because we don't know what data is in the
> > missing blocks. We can't assume it's some constant value (e.g. zero)
> > because then the new parity would be wrong if some but not all of the
> > failing drives came back.
> >
> > For case 1-5, we can ignore the return value of verify_one_sector()
> > (or at least not pass it up to the caller of recover_vertical()).
>
> Nope, that falls back to the old behavior, a corrupted section would only
> spread to P/Q.
Yeah, I tested this and it's the wrong point in the logic. Another
problem with doing it there is that code is shared with scrub, and we
don't want scrub to be doing best-effort, we want scrub to be perfect
or do nothing.
A better place to do it is when recover_blocks() returns. This is specific
to the rmw_rbio code, so it doesn't affect other parts of btrfs that
can afford to stop when there's an uncorrectable sector.
> We can not do the data loss decision for the end users.
There is no decision that doesn't have at least potential data loss.
RMW is writing data. If our RMW bbio chooses not to write data when it's
possible to do so, then we've actively made the decision to lose data.
In the current architecture, RMW must make a best effort to correct any
correctable errors before modifying the stripe, but if that effort fails,
RMW still has to update the stripe anyway, so that one bad stripe doesn't
break the entire filesystem.
The alternative is to not do RMW at all (or at least remove the dependency
on it), which is a much larger problem to solve. There's a few ways to
do that:
1. In btrfs_finish_one_ordered, instead of dropping the failed extent,
try to allocate it somewhere else (hopefully in a stripe without an
uncorrectable sector, or just switch to a full-stripe allocation for
that specific case), and restart the write operation. That trades extra
disk space for availability. We'd need some way to remember where writes
have failed so the allocator isn't constantly trying to put new data on
broken stripes. Also it means that we can end up silently dropping data
due to ENOSPC instead of EIO (i.e. we try to write to the last available
sectors and they're in a broken stripe) (similar to another bug in 6.6,
now fixed).
2. Avoid RMW writes in the allocator. This requires some aggressive
merging of writes so that we fill up stripes (that would be nice to have
anyway as a pure speed optimization for raid5, as the current code will
happily write 4 64K files in 4 RMW operations instead of packing them
all into one full-stripe write). We'd need bg reclaim to free up space
in partially filled stripes (i.e. raid5 would effectively behave like a
zoned device). We'd still need RMW for nodatacow files, but nodatasum
files can't ever have corrupt sectors so we can always update P/Q when
we write a stripe. Prealloc wouldn't work. We might return ENOSPC,
but we'll do that at write() time, not in writeback, so userspace will
be informed and we won't silently discard any data.
3. Use raid-stripe-tree. AFAICT raid-stripe-tree is basically option 2,
but implemented between the extent layer and the device layer so it's
a less invasive change wrt the rest of btrfs. raid-stripe-tree doesn't
support prealloc or nodatacow either.
> Any shorter reproducer?
1. Create a -draid5 -mraid1 filesystem from 5 devices: 2x 750G, 2x 700G.
The larger devices will have all the metadata on them, so we can corrupt
everything on a small device and not worry about clobbering metadata.
2. Put some data on the filesystem, delete every second file to make a lot
of small holes in the raid5 stripes.
3. Corrupt the two smaller devices (the ones without metadata) with your
choice of garbage data (leave the first 1M intact for the superblock):
yes | dd of=dev-3 iflag=fullblock bs=1024k
yes | dd of=dev-4 iflag=fullblock bs=1024k
4. On the filesystem, create a new directory and use it normally.
I like 'git clone' for this, because it does a lot of fsyncs and sha1 hash
integrity checking, and every kernel developer already has it installed.
git clone http://github.com/kdave/btrfs-progs.git git-test
The git clone, git gc, commits, etc. should all run without error,
and the number of errors in files in the rest of the filesystem should
be stable, i.e. writes work for new data, and old data isn't (more)
corrupted by the writes.
> If I understand you correct, the problem only happens when we have all the
> following conditions met:
>
> - the sector has csum
> Or we won't very the content anyway
>
> - both the original and repaired contents mismatch csum
>
> This already means there are errors beyond the RAID5 tolerance.
Yes, all the problems arise when one stripe goes beyond RAID5 tolerance.
It doesn't look like it matters whether it's read or csum failure,
but csum failure is much more common due to earlier bugs in btrfs
raid5 leaving broken stripes on disk for later kernels to trip over,
or a replaced drive combined with an uncorrected write hole error on a
second drive.
If the errors are within RAID5 tolerance, it all works perfectly:
the "best effort to repair" is a completely successful repair, so
recover_blocks returns 0, we don't prematurely exit from rmw_rbio,
and writes work properly.
> So that's something else in v6.2 causing beyond-raid5-tolerance.
> And the data drop problem is definitely a big problem but I do not think
> it's really RAID56 specific.
It's raid56 specific because only raid56 has data block writes which
depend on the outcome of data block reads, i.e. a write fails because
of something that happens during a read. No other profile has that
possibility.
If I run the test case with raid1 or single data instead of raid5,
there will never be a data loss on a write because of a csum failure,
because the write bbios simply never try to read anything. Even raid5
full-stripe writes can't fail that way, because they don't mix read
operations into their write bbios.
(I'm excluding metadata here--there are plenty of dependencies between
data writes and metadata reads, but they aren't very interesting.
If metadata is broken then the entire filesystem is broken.)
> > On 6.6, all the automation was broken by the regression: when we replaced
> > a broken file or added a new one, the new file was almost always silently
> > corrupted,
>
> A quick search through the data writeback path indeed shows we do nothing
> extra when hitting a writeback error.
That's because it's not in the writeback path--it's in the bio completions
for ordered data. This is the sequence of events:
1. btrfs allocator places an extent within an existing raid5 data stripe
that contains some unreadable or corrupt data blocks.
2. rmw_rbio is called to perform a rmw operation on the stripe.
3. rmw_read_wait_recover attempts to find all csums in the stripe.
4. rmw_read_wait_recover issues read requests to
submit_read_wait_bio_list to fill the stripe in memory.
5. raid_wait_read_end_io sets error bits in the rbio error_bitmap for
read failures, and passes successful reads to verify_bio_data_sectors.
6. verify_bio_data_sectors verifies those sectors which have csums
(i.e. not the free sectors, nodatasum files, or the P/Q sector(s)).
Sectors that have mismatched csums are marked in the rbio error_bitmap.
7. back to rmw_read_wait_recover, which calls recover_sectors.
8. recover_sectors attempts to repair any sectors with a bit set in
error_bitmap. If there are too many bad sectors, recover_sectors exits
early without doing anything. After reconstructing the bad sector
contents from remaining data and P/Q blocks, the csums of the repaired
sectors are verified with verify_one_sector().
9. if recover_sectors is unable to repair all sectors in the stripe,
recover_sectors returns an error to rmw_read_wait_recover, which
returns the error to rmw_rbio.
10. when rmw_read_wait_recover returns an error to rmw_rbio, rmw_rbio
sets bi_status to IOERR on all the bios in the stripe, and returns
without performing the parity update and write operations originally
requested at step 2.
11. the rbio work function finishes
12. the btrfs_finish_ordered_io work function starts
13. btrfs_finish_one_ordered notices the error in bi_status on the bios
14. btrfs_finish_one_ordered calls btrfs_mark_ordered_extent_error,
so syncfs() and fsync() can return EIO.
15. btrfs_finish_one_ordered calls btrfs_drop_extent_map_range to
delete the extent from the filesystem.
16. the blocks in the uncorrectable stripe are now free to be allocated
again, so return to step 1 and do it all again in the next transaction.
Note that this is a loop! So we don't lose one extent of data and then
move on--we lose one extent of data _in each transaction_, and then set up
the next transaction to fail in the same way. This continues as long as
an uncorrectable sector exists in one of the first available free spaces.
Also note that after a while, the allocator fills up all the good stripes
between the bad stripes, but it can't fill the bad stripes because they
keep getting returned to free space at the end of each transaction.
If there are a lot of bad stripes on the filesystem, bad stripes move
to the the head of the free space list where they accumulate over time.
We might end up dropping hundreds or thousands of extents before we
finally allocate something in a good stripe.
During my original failure event I did notice that these dropped writes
seemed to come in "waves", but I couldn't reproduce that effect. Now I
know how to trigger it.
> I think we should change the behavior of clearing dirty flags, so that the
> dirty page flag only got cleared if the writeback is successful.
The entire page is deleted, so it stops having any flags.
> This change is going to be huge
Yeah, all the real solutions are far more development work, and have more
ways to fail, than handing everything with a better best-effort in RMW.
In the short term, it's more important to fix the RMW regression as soon
as possible, so that users with raid5 filesystems can upgrade from 6.1 to
later kernels without having worse data loss problems than 6.1 had in the
supported cases (drive failures <= tolerance). The most straightforward
way to do that is to handle everything we can within rmw_rbio.
Once the larger fixes are in place, so that btrfs can handle a dropped
data write or so that btrfs doesn't need to do RMW any more, we can then
go back to strictly failing RMW on uncorrectable raid56 stripes. It's
too early to do that now.
Best-effort RMW has low marginal risk of data loss, whereas aborting RMW
absolutely guarantees data loss. As you point out, none of the data loss
cases arise when the stripe is within correctable tolerances, i.e. you
need to start with too many failures so that you have uncorrectable
sectors, and we don't claim that btrfs can recover data in that case even
without RMW updates. If errors are within tolerance, they're correctable,
so rmw_rbio will simply correct them.
I propose 3 phases of fix:
1. One-line patch to ignore recover_blocks return value, to get raid56
writes working in the presence of uncorrectable sectors again.
2. Fix up the rbio handling so that we can compute P/Q for RMW stripes
even with uncorrectable sectors, so that we don't make a mess on
uncorrectable sectors (discussed in more detail below).
3. Fix the upper layers of btrfs so that we can have write bbios fail
and not lose data in general.
> and I have to check other fses to be extra
> safe.
No other filesystem tries to do parity RMW writes with data csums, because
that's a bad system design choice that they were able to avoid making.
Only btrfs has btrfs problems. ;)
Other filesystems use strictly immutable extents with parity embedded
within them (ZFS), or they use allocators that aggressively coalesce
allocations and writes so that all parity raid writes are full-stripe
writes with garbage collection (bcachefs).
Parity raid block devices don't have csums. They don't have to handle
corrupted block because they don't know if any block is corrupted or not.
They simply update the P/Q blocks whenever enough devices are readable,
and go completely read-only if too many are not readable.
> But even with that fixed, the do-not-touch-bad-full-stripe policy is not
> doing anything helpful, unless the corrupted sector get full deleted, which
> can be very hard detect.
>
> Maybe you can try a different recovery behavior?
>
> E.g. always read the file, and the read failed, delete and then copy a new
> one from the backup server.
>
> Then it may solve the problem as the bad sectors would be deleted (along
> with its csum)
There's some problems with this:
- The falures occur when rmw_rbio reads a file that happens to share
a raid5 stripe with the extent we are trying to create. The files are
not related in any way except for the position of some of their blocks
on disk. Finding these files takes time. Scrub takes a long time
(weeks) and 'find -type f...' may not find the bad sectors at all, so
there's a high risk that a writer will discover an uncorrectable sector
before anything else does, and in that case the first signal that we
have a problem is silent data loss.
- While researching all this, I discovered that in 6.4, 1009254bf22a
("btrfs: scrub: use scrub_stripe to implement RAID56 P/Q scrub") scrub is
broken: it aborts on the first stripe with an uncorrectable sector.
So users can't discover more than one uncorrectable sector at a time.
- Even if we know of a bad file, we can't replace the bad file because
it requires us to have reliable writes to the filesystem. All files
referencing bad stripes have to be removed before any replacement files
can be loaded onto the filesystem. That's a weeks-long interruption
in service, and poses some logistical issues (e.g. temporary additional
space to store the files we can't write to the original filesystem).
- Once the allocator has found an uncorrectable sector, write failures
happen continually, at increasing rates. It only stops when _every_
uncorrectable sector is removed from the filesystem. This is not a
proportional response to a single bad sector, especially when it's so
easy to fix.
In kernel 6.1 and earlier, a damaged btrfs raid5 filesystem has no
availability for writes because bugs in RMW would corrupt good data
if we write to the filesystem. We can write to our filesystems only
because we have a full replica of all the data somewhere else, so the
btrfs bugs only cause corruption in files we can replace. Ours is
a specialist application. 6.1's raid5 isn't usable as a rootfs or
unreplicated primary data store.
In kernel 6.2 and later, a damaged btrfs raid5 filesystem has no
availability for writes because we can't be sure that the filesystem
doesn't have any uncorrectable stripes. If at least one uncorrectable
stripe exists, we can't expect any data we write to stick. This
makes 6.2 and later unusable as a rootfs or unreplicated primary data
store, even though the corruption bugs in 6.1 were fixed.
This patch makes RMW work again, mostly:
diff --git a/fs/btrfs/raid56.c b/fs/btrfs/raid56.c
index 39bec672df0c..2964ff380671 100644
--- a/fs/btrfs/raid56.c
+++ b/fs/btrfs/raid56.c
@@ -2277,7 +2277,13 @@ static int rmw_read_wait_recover(struct btrfs_raid_bio *rbio)
* and csum mismatch), just let recover_sectors() to handle them all.
*/
submit_read_wait_bio_list(rbio, &bio_list);
- return recover_sectors(rbio);
+ recover_sectors(rbio);
+ /*
+ * Whether we recover corrupted sectors or not, we must proceed
+ * with the rmw operation, or we will drop the extent we are
+ * trying to write.
+ */
+ return 0;
}
static void raid_wait_write_end_io(struct bio *bio)
With that patch, my raid5 test cases work for the first time since I wrote
them in 2016: writes are fully available, and there's no new corruption
of existing files (any uncorrectable sectors remain uncorrectable and
no new ones appear). It's good enough to unblock users who are currently
stuck on 6.1 due to the RMW regression.
For the "mostly": there are some unresolved issues. With the above
patch, the set of uncorrectable sectors is stable (nothing that wasn't
corrupted before gets corrupted by RMW), but the _contents_ of the
uncorrectable sectors get replaced with garbage, because the code isn't
leaving the stripe data in a usable state for P/Q update, and it always
writes out the full stripe including known garbage data. P/Q does match
the garbage, so the stripe will work in degraded mode for the new data.
The rmw code has to do something like this:
1. do the R operation: for unreadable blocks, zero-fill the block
2. for readable blocks, verify csums (no change)
3. reconstruct missing or corrupt blocks in separate memory from the
block read from disk (i.e. don't overwrite in place)
4. verify again. If verification fails, keep the original version
of the data (or zeros if the block could not be read).
5. keep track of which blocks are bad, i.e. don't clear error_bitmap
in rmw_rbio unless the block verifies OK
6. do the M operation: modify the blocks to be written, and compute
new P/Q blocks for the stripe.
7. clear the bad bits for the modified blocks and the P/Q blocks.
8. do the W operation: write out only the blocks that are not marked
as bad, i.e. the corrected blocks, the modified blocks, and the P/Q
blocks. If a block can't be corrected, don't write it out, to give
the device another chance to get the data on the next read.
9. if too many writes fail, then rmw_rbio can return an error.
(Note: this last line opens up a can of worms for the read error
handling cases)
For scrub, the problem is that scrub_raid56_parity_stripe exits the entire
scrub loop when it sees an unrepaired sector, instead of continuing to
the next stripe. Everything else there is OK, it's a one-line fix:
diff --git a/fs/btrfs/scrub.c b/fs/btrfs/scrub.c
index c7fcb6408eb6..12c3ce0efafa 100644
--- a/fs/btrfs/scrub.c
+++ b/fs/btrfs/scrub.c
@@ -1972,7 +1972,7 @@ static int scrub_raid56_parity_stripe(struct scrub_ctx *sctx,
"unrepaired sectors detected, full stripe %llu data stripe %u errors %*pbl",
full_stripe_start, i, stripe->nr_sectors,
&error);
- ret = -EIO;
+ ret = 0;
goto out;
}
bitmap_or(&extent_bitmap, &extent_bitmap,
I'll post these patches separately.
> Thanks,
> Qu
^ permalink raw reply related [flat|nested] 7+ messages in thread
* Re: raid5 silent data loss in 6.2 and later, after "7a3150723061 btrfs: raid56: do data csum verification during RMW cycle"
2024-07-08 6:25 ` Zygo Blaxell
@ 2024-07-08 8:26 ` Lukas Straub
2024-07-08 20:22 ` Write error handling in btrfs (was: Re: raid5 silent data loss in 6.2 and later, after "7a3150723061 btrfs: raid56: do data csum verification during RMW cycle") Zygo Blaxell
0 siblings, 1 reply; 7+ messages in thread
From: Lukas Straub @ 2024-07-08 8:26 UTC (permalink / raw)
To: Zygo Blaxell; +Cc: Qu Wenruo, Qu Wenruo, linux-btrfs
[-- Attachment #1: Type: text/plain, Size: 3711 bytes --]
On Mon, 8 Jul 2024 02:25:37 -0400
Zygo Blaxell <ce3g8jdj@umail.furryterror.org> wrote:
> On Sat, Jun 08, 2024 at 12:50:35PM +0930, Qu Wenruo wrote:
> > 在 2024/6/8 11:25, Zygo Blaxell 写道:
> > > On Sat, Jun 01, 2024 at 05:22:46PM +0930, Qu Wenruo wrote:
> > > After this change, we now end up in an infinite loop:
> > >
> > > 1. Allocator picks a stripe with some unrecoverable csum blocks
> > > and some free blocks
> > >
> > > 2. Writeback tries to put data in the stripe
> > >
> > > 3. rmw_rbio aborts after it can't repair the existing blocks
> > >
> > > 4. Writeback deletes the extent, often silently (the application
> > > has to use fsync to detect it)
> > >
> > > 5. Go to step 1, where the allocator picks the same blocks again
> > >
> > > The effect is pretty dramatic--even a single unrecoverable sector in
> > > one stripe will bring an application server to its knees, constantly
> > > discarding an application's data whenever it tries to write. Once the
> > > allocator reaches the point where the "next" block is in a bad rmw stripe,
> > > it keeps allocating that same block over and over again.
> >
> > I'm afraid the error path (no way to inform the caller) is an existing
> > problem. Buffered write can always success (as long as no ENOMEM/ENOSPC
> > etc), but the real writeback is not ensured to success.
> > It doesn't even need RAID56 to trigger.
> >
> > But "discarding" the dirty pages doesn't sound correct.
> > If a writeback failed, the dirty pages should still stay dirty, not
> > discarded.
> >
> > It may be a new bug in the error handling path.
>
> I found the code that does this. It's more than 11 years old:
>
> commit 0bec9ef525e33233d7739b71be83bb78746f6e94
> Author: Josef Bacik <jbacik@fusionio.com>
> Date: Thu Jan 31 14:58:00 2013 -0500
>
> Btrfs: unreserve space if our ordered extent fails to work
>
> When a transaction aborts or there's an EIO on an ordered extent or any
> error really we will not free up the space we reserved for this ordered
> extent. This results in warnings from the block group cache cleanup in the
> case of a transaction abort, or leaking space in the case of EIO on an
> ordered extent. Fix this up by free'ing the reserved space if we have an
> error at all trying to complete an ordered extent. Thanks,
>
> [...]
Before this escalates further in IMHO the wrong direction:
I think the current btrfs behavior correct. See also this paper[1] that
examines write failure of buffered io in different filesystems.
Especially Table 2. Ext4 and xfs for example do not discard the page
cache on write failure, but this is worse since now you have a mismatch
of what is in the cache and what is on disk. They do not retry to write
back the page cache.
The source of confusion here is rather that write errors do not happen
in the real world: Disks do not verify if they wrote data correctly and
neither does any layer (raid, etc.) above it.
Thus handling of write failure is completely untested in all
applications (See the paper again) and it seems the problems you see
are due to wrongly handling of write errors. The data loss is not
silent it's just that many applications and scripts do not use fsync()
at all.
I think the proper way of resolving this is for btrfs to retry writing
the extent, but to another (possibly clean) stripe. Or perhaps a fresh
raid5 block group altogether.
I very much approve btrfs' current design of handling (and reporting)
write errors in the most correct way possible.
Regards,
Lukas Straub
[1] https://www.usenix.org/system/files/atc20-rebello.pdf
[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 833 bytes --]
^ permalink raw reply [flat|nested] 7+ messages in thread
* Write error handling in btrfs (was: Re: raid5 silent data loss in 6.2 and later, after "7a3150723061 btrfs: raid56: do data csum verification during RMW cycle")
2024-07-08 8:26 ` Lukas Straub
@ 2024-07-08 20:22 ` Zygo Blaxell
0 siblings, 0 replies; 7+ messages in thread
From: Zygo Blaxell @ 2024-07-08 20:22 UTC (permalink / raw)
To: Lukas Straub; +Cc: Qu Wenruo, Qu Wenruo, linux-btrfs
On Mon, Jul 08, 2024 at 10:26:32AM +0200, Lukas Straub wrote:
> > I found the code that does this. It's more than 11 years old:
> >
> > commit 0bec9ef525e33233d7739b71be83bb78746f6e94
> > Author: Josef Bacik <jbacik@fusionio.com>
> > Date: Thu Jan 31 14:58:00 2013 -0500
> >
> > Btrfs: unreserve space if our ordered extent fails to work
> >
> > When a transaction aborts or there's an EIO on an ordered extent or any
> > error really we will not free up the space we reserved for this ordered
> > extent. This results in warnings from the block group cache cleanup in the
> > case of a transaction abort, or leaking space in the case of EIO on an
> > ordered extent. Fix this up by free'ing the reserved space if we have an
> > error at all trying to complete an ordered extent. Thanks,
> >
> > [...]
>
> Before this escalates further in IMHO the wrong direction:
Now we're drifting away from raid56-specific issues, and into general
btrfs write error handling theory. OK, I'm up for that...
> I think the current btrfs behavior correct. See also this paper[1] that
> examines write failure of buffered io in different filesystems.
> Especially Table 2. Ext4 and xfs for example do not discard the page
> cache on write failure, but this is worse since now you have a mismatch
> of what is in the cache and what is on disk. They do not retry to write
> back the page cache.
ext4 and xfs don't retry the write, but they also don't deallocate the
block from the filesystem, then try to use the same block immediately
in a future allocation. Of the filesystems in the paper, only btrfs
does write-failure data loss in a rapidly cycling _loop_.
xfs and ext4 (and btrfs on a nodatacow file) leave the failing extent
allocated, so future reads return garbage, but future created or extended
files pick different, hopefully error-free sectors on the disk. ext4 and
xfs thus avoid having a data loss loop, more or less accidentally.
We could replicate that to break the loop on btrfs, by introducing an
"IOERROR" extent type. This would be like a PREALLOC extent, in the sense
that it reserves space on disk without writing data to the space, but it
would be always datacow for write, and always return EIO on read instead
of zero-fill. Scrub would not attempt reading such an extent, and would
report it as a distinct failure type from the existing read/csum/write/etc
failures. Balance would make IOERROR extent references point to a
zero bytenr when they are relocated (so they become IOERROR holes in
files, with no physical extent behind them, and so that balance can
dispose of the block group without tripping over an IO error itself).
btrfs send would transmit them, btrfs receive would replicate them,
btrfs qgroup would count their space if they have a physical extent
(i.e. they're not a synthetic extent created by balance or receive).
IOERROR extents would persist until they are deleted by a user action,
like removing the file, or overwriting the entire extent.
An IOERROR extent would be created by flipping a few bits in the extent
item when the write failure is detected, then proceeding to write out
the extent metadata for the lost extent data (instead of deleting it).
This breaks the loop after btrfs writes to a broken sector (or at least
pauses the loop until the user deletes the broken file and releases the
broken sectors for another allocation). IOERROR doesn't need additional
space like retries do, since we're writing out the same number of extents
to disk that we already have reserved, so an IO error can't cascade to
an ENOSPC error.
IOERROR extents break the rapid data loss loop and replicate the better
behaviors of ext4 and xfs, without introducing their bad behaviors to
btrfs. Users can reliably detect lost data after the fact by reading
the files or running scrub, so they don't need to rely on unreliable
and invasive methods like fsync() or syncfs() for detection.
It's an interesting idea for future enhancement, but it's no
solution to the rmw regression. It just breaks the loop and makes
write-error-extent-drop detection easier, but the problem for rmw is
that the write errors shouldn't happen in the first place.
> The source of confusion here is rather that write errors do not happen
> in the real world: Disks do not verify if they wrote data correctly and
> neither does any layer (raid, etc.) above it.
Disks do go offline in the real world, and that presents as write errors.
I've had systems running raid1 where 2 disks go offline, but neither
disk contains any metadata, so btrfs runs the write-delete-repeat data
loss loop for many hours. I didn't care much about that, because raid1
isn't expected to work with 2 disks offline, but I do care about raid5
not working with 0 disks offline.
There are some small changes we can make to error handling that might
quickly improve things for users. While debugging the RMW issues,
I have been using a patch that makes the extent drops visible in dmesg:
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 8f38eefc8acd..be3cda596bb8 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -3214,6 +3214,22 @@ int btrfs_finish_one_ordered(struct btrfs_ordered_extent *ordered_extent)
clear_reserved_extent &&
!test_bit(BTRFS_ORDERED_NOCOW, &ordered_extent->flags) &&
!test_bit(BTRFS_ORDERED_PREALLOC, &ordered_extent->flags)) {
+ /*
+ * Notify the user that some data has been lost due
+ * to an error, unless we were already expected to
+ * remove the data because of truncate.
+ *
+ * Put both logical and physical addresses here
+ * because we're about to remove the mapping that
+ * would have allowed the user to translate from
+ * one to the other.
+ */
+ if (ret && !truncated)
+ btrfs_err_rl(fs_info,
+"dropping unwritten extent at root %lld ino %llu offset [%llu,%llu] physical %llu length %llu",
+ btrfs_root_id(inode->root), btrfs_ino(inode),
+ unwritten_start, end, ordered_extent->disk_bytenr,
+ ordered_extent->disk_num_bytes);
/*
* Discard the range before returning it back to the
* free space pool
This has been useful when debugging the RMW issue, and valuable on
production filesystems too. It also lights up when some of the other
recently fixed data loss bugs in btrfs are reproduced during bisection
searches. I'll submit that as a patch separately.
That new 'if' statement could easily check a mount flag, and throw in
a courtesy call to btrfs_abort_transaction or btrfs_panic if set. That
would be welcome for users who value data consistency over availability.
I've seen several prospective btrfs users go back to ext4/xfs because
btrfs doesn't support 'errors=remount-ro' or 'errors=panic' for data
writes, and they absolutely prefer to crash instead running with known
inconsistent data anywhere in the filesystem.
btrfs could also do stats counting for dropped extents somewhere,
but it can't use dev stats exactly as the event is happening above the
individual device level. We'd need a different stats item for that.
> Thus handling of write failure is completely untested in all
> applications (See the paper again) and it seems the problems you see
> are due to wrongly handling of write errors.
Not quite. It's a failure of expectations between two components of
btrfs, and fixes can be implemented in either or both components.
The new RMW code assumed that the ordered-data error-handling code would
recover from RMW returning a synthetic write error and preserve the data,
but that assumption is wrong: the ordered-data error-handling code has
never done that. The data is lost in kernel 6.2 and later, because both
of these subsystems now assume the other will work around the dependent
read failures and get the data written somewhere. In kernel 6.1 and
earlier, the raid5 component was responsible for working around the
read failures during writes, so there was no need for ordered data to
handle them.
> The data loss is not
> silent it's just that many applications and scripts do not use fsync()
> at all.
fsync()'s return value is a terrible non-solution. Diagnostic coverage
is less than 100% and not all of the gaps can be eliminated (e.g. if
you're running syncfs() in a continuous loop, you don't get notified if
writebacks win races against you). It's invasive, relying on the entire
workload to be redesigned so that processes can stick around until writes
finish without blocking everything on IO performance (try that with a
parallel Makefile tree near you). The engineering hours and runtime cost
of application-level synchronous monitoring of every individual write
isn't reasonable for an event that is extremely rare for a properly
provisioned and monitored redundant RAID system, and generally considered
catastrophic if it should occur.
I guess I'm agreeing with you there. Many applications have good reasons
to never use fsync.
> I think the proper way of resolving this is for btrfs to retry writing
> the extent, but to another (possibly clean) stripe. Or perhaps a fresh
> raid5 block group altogether.
That's what I proposed as a long term solution, but I also propose
shorter-term fixes to fix the specific rmw regression immediately.
> I very much approve btrfs' current design of handling (and reporting)
> write errors in the most correct way possible.
I think btrfs's current handling of write errors is some distance away
from the most correct way possible, or even from the most correct
way currently implemented in a mainstream filesystem; however, it's
good enough for now, and there are more important issues that need
our attention.
> Regards,
> Lukas Straub
>
>
> [1] https://www.usenix.org/system/files/atc20-rebello.pdf
^ permalink raw reply related [flat|nested] 7+ messages in thread
end of thread, other threads:[~2024-07-08 20:22 UTC | newest]
Thread overview: 7+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-06-01 3:24 raid5 silent data loss in 6.2 and later, after "7a3150723061 btrfs: raid56: do data csum verification during RMW cycle" Zygo Blaxell
2024-06-01 7:52 ` Qu Wenruo
2024-06-08 1:55 ` Zygo Blaxell
2024-06-08 3:20 ` Qu Wenruo
2024-07-08 6:25 ` Zygo Blaxell
2024-07-08 8:26 ` Lukas Straub
2024-07-08 20:22 ` Write error handling in btrfs (was: Re: raid5 silent data loss in 6.2 and later, after "7a3150723061 btrfs: raid56: do data csum verification during RMW cycle") Zygo Blaxell
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox