Re: raid5 silent data loss in 6.2 and later, after "7a3150723061 btrfs: raid56: do data csum verification during RMW cycle"

Linux Btrfs filesystem development
 help / color / mirror / Atom feed

From: Qu Wenruo <wqu@suse.com>
To: Zygo Blaxell <ce3g8jdj@umail.furryterror.org>,
	Qu Wenruo <quwenruo.btrfs@gmx.com>
Cc: linux-btrfs@vger.kernel.org
Subject: Re: raid5 silent data loss in 6.2 and later, after "7a3150723061 btrfs: raid56: do data csum verification during RMW cycle"
Date: Sat, 8 Jun 2024 12:50:35 +0930	[thread overview]
Message-ID: <5a8c1fbf-3065-4cea-9cf9-48e49806707d@suse.com> (raw)
In-Reply-To: <ZmO6IPV0aEirG5Vk@hungrycats.org>



在 2024/6/8 11:25, Zygo Blaxell 写道:
> On Sat, Jun 01, 2024 at 05:22:46PM +0930, Qu Wenruo wrote:
[...]
>>
>> Nope, rmw_rbio() would call bitmap_clear() on the error_bitmap before
>> doing any RMW.
> 
> This is true, but the code looks like this:
> 
>                  ret = rmw_read_wait_recover(rbio);
>                  if (ret < 0)
>                          goto out;
>          }
> 
> [some other stuff]
> 
>          bitmap_clear(rbio->error_bitmap, 0, rbio->nr_sectors);
> 
> The problem happens before the bitmap_clear, at `goto out`.  Actually it
> happens in rmw_read_wait_recover() (formerly rmw_read_and_wait()),
> because that function now changes 'ret' if the repair fails:
> 
> @@ -2136,6 +2215,12 @@ static int rmw_read_and_wait(struct btrfs_raid_bio *rbio)
> 
>          submit_read_bios(rbio, &bio_list);
>          wait_event(rbio->io_wait, atomic_read(&rbio->stripes_pending) == 0);
> +
> +       /*
> +        * We may or may not have any corrupted sectors (including missing dev
> +        * and csum mismatch), just let recover_sectors() to handle them all.
> +        */
> +       ret = recover_sectors(rbio);
>          return ret;
>   out:
>          while ((bio = bio_list_pop(&bio_list)))
> 
> Before this change, if all the blocks in the stripe were readable, we
> would recompute parity based on the data that was read, and proceed
> to finish the rbio.  As long as the writes were also successful,
> the new blocks that were written would be updated and recoverable.
> (The old unrecoverable blocks would still be unrecoverable, but there's
> no fixing those.)

I still do not believe the old behavior is correct.

If the data is corrupted, re-generate the P/Q using the bad one is only 
killing all the remaining (if any) chance of recovery.

> 
> After this change, we now end up in an infinite loop:
> 
> 1.  Allocator picks a stripe with some unrecoverable csum blocks
> and some free blocks
> 
> 2.  Writeback tries to put data in the stripe
> 
> 3.  rmw_rbio aborts after it can't repair the existing blocks
> 
> 4.  Writeback deletes the extent, often silently (the application
> has to use fsync to detect it)
> 
> 5.  Go to step 1, where the allocator picks the same blocks again
> 
> The effect is pretty dramatic--even a single unrecoverable sector in
> one stripe will bring an application server to its knees, constantly
> discarding an application's data whenever it tries to write.  Once the
> allocator reaches the point where the "next" block is in a bad rmw stripe,
> it keeps allocating that same block over and over again.

I'm afraid the error path (no way to inform the caller) is an existing 
problem. Buffered write can always success (as long as no ENOMEM/ENOSPC 
etc), but the real writeback is not ensured to success.
It doesn't even need RAID56 to trigger.

But "discarding" the dirty pages doesn't sound correct.
If a writeback failed, the dirty pages should still stay dirty, not 
discarded.

It may be a new bug in the error handling path.

> 
> So there are some different cases to handle at this point:
> 
> 1.  We can read all the blocks and they have good csums:  proceed
> to write data and update parity.  This is the normal good case.
> 
> 2.  We can read all the blocks, there are csums failures, but they
> can be repaired:  proceed to write data and update parity.  This is
> the case fixed in the patch.
> 
> 3.  We can read all the blocks, there are csums failures, but they
> can't be repaired:  proceed to write data and update parity.

Nope, we should not continue, it's only "spreading" the corruption.

>  This case
> is broken in the patch.  The data in the unrecoverable blocks is lost
> already, but we can still write the new data in the stripe and update
> the parity to recover the new blocks if we go into degraded mode in
> the future.

We should only continue if our write range covers the bad sector, but 
that's impossible as that breaks DATACOW.

I can do extra handling, like if the write range covers a bad sector, we 
do not try to recover that one.

But I strongly doubt if that would have any meaning, since that means 
we're writing into a sector which already has csum.

> 
> 4.  We can read enough of the blocks to attempt repair, and the
> repair succeeds:  proceed to write data and update parity.  This is
> the degraded good case.

Degraded is not treated any different, it's just the same as failing the 
read.

> 
> 5.  We can read enough of the blocks to attempt repair, and the
> repair fails:  proceed to write data and update parity.  The lost
> data block is gone, but we can still write the new data blocks and
> compute a usable parity block.  This is another case broken by the
> patch, the degraded-mode version of case 3.

The degraded case is only making things more complex.
But I get your point.

> 
> 6.  We can't read enough of the blocks to attempt repair:  we can't
> compute parity any more, because we don't know what data is in the
> missing blocks.  We can't assume it's some constant value (e.g. zero)
> because then the new parity would be wrong if some but not all of the
> failing drives came back.
> 
> For case 1-5, we can ignore the return value of verify_one_sector()
> (or at least not pass it up to the caller of recover_vertical()).

Nope, that falls back to the old behavior, a corrupted section would 
only spread to P/Q.

We should not touch such corrupted full stripe. IIRC we already have 
such test cases in fstests.

I believe we should let the users know there is something wrong, telling 
then which sector is problematic (and which file), hoping they can 
delete the file or whatever.

We can not do the data loss decision for the end users.

> Either we restored the sector we didn't, but either outcome doesn't
> affect whether a new parity block can be computed to enable recovery of
> the new data blocks and the remaining recoverable data blocks.  The other
> error cases in recover_vertical() are either fatal problems like ENOMEM,
> or there's too many read failures and we're actually in case 6.
> 
> For case 6, the options are:
> 
> A.  abort the rbio and drop the write
> 
> B.  force the filesystem read-only
> 
> I like B for this case, but AIUI the other btrfs raid profiles currently
> do A if all the data block mirror writes fail.

The current solution for RAID56 is just do not touch it.
We do not continue nor ignore.

I believe B is the much safer solution, but I'm not sure if other 
profiles really go A.

IIRC for RAID1* we just continue as long as we have at least one good 
mirror written.

[...]
>>
>> Can you try to do it with newly created fs instead?
> 
> Already did (I needed to do that many times to run bisect...).

Any shorter reproducer?

If I understand you correct, the problem only happens when we have all 
the following conditions met:

- the sector has csum
   Or we won't very the content anyway

- both the original and repaired contents mismatch csum

This already means there are errors beyond the RAID5 tolerance.

> 
>>> 	while true; do
>>> 		# Any big file will do, usually faster than /dev/random
>>> 		# Skipping the first 1M leaves the superblock intact
>>> 		while cat vmlinux; do :; done | dd of=/dev/vdd bs=1024k seek=1
>>> 		# This should fix all the corruption as long as there are no
>>> 		# reads or writes anywhere on the filesystem
>>> 		btrfs scrub start -Bd /dev/vdd
>>> 	done
>>
>> [IMPROVE THE TEST]
>> If you want to cause interleaved free space, just create a ton of 4K
>> files, and delete them interleavely.
>>
>> And instead of vmlinux or whatever file, you can always go with
>> randomly/pattern filled file, and saves its md5sum to do verification.
> 
>> [MY CURRENT GUESS]
>> My current guess is some race with dd corruption and RMW.
>> AFAIK the last time I am working on RAID56, I always do a offline
>> corruption (aka, with fs unmounted) and it always works like a charm.
>>
>> So the running corruption may be a point of concern.
> 
> Generally, we expect the hardware to introduce corruption at any time.
> Ideally, a btrfs raid[156] should be able to work when one of its devices
> returns 4K blocks from /dev/random on every read, and if it can't,
> that's always a bug.

But so far your description is for two sectors corruption, beyond 
RAID5's tolerance.

> 
>> Another thing is, if a full stripe is determined to have unrepairable
>> data, no RMW can be done on that full stripe forever (unless one
>> manually fixed the problem).
> 
> Yeah, that's the serious new problem introduced in this patch for 6.2.
> 6.1 didn't do that in two cases where 6.2 and later do.

Unfortunately, I do not think this is the root problem.
The change is towards the correct direction, from ignoring the problem 
to at least not making it worse.

> 
>> So if by somehow you corrupted the full stripe by just corrupting one
>> device (maybe some existing csum mismatch etc?), then the full stripe
>> would never be written back, thus causing the data not to be written back.
> 
> My reproducer wasn't sufficient to trigger the bug--it would only generate
> csum errors on one device, but there need to be csum errors on two devices
> (for raid5) to get into the data-loss loop.  Apparently 6.2 still has
> a bug somewhere that introduces extra errors (it crashes a lot, that
> might be related), because when I tested on 6.2 I was observing the
> two-csum-failure case not the intended one-csum-failure case.
> 
> 6.6 and 6.9 don't have any of those problems--no crashes, no cross-device
> corruption. I've been running the reproducer on 6.6 and 6.9 for a week
> now, and there are no csum failures on any device other than the one
> being intentionally corrupted, and there are no data losses because the
> code never hits the "goto out" line in rmw_rbio (confirmed with both
> a kgdb breakpoint on the test kernel and some printk's).  So on 6.6
> and 6.9 btrfs fixes everything my reproducer throws at it.

That's my expectation. Thankfully it finally matches reality.

So that's something else in v6.2 causing beyond-raid5-tolerance.
And the data drop problem is definitely a big problem but I do not think 
it's really RAID56 specific.

My guess is some later RAID56 fixes are not yet backported to v6.2?
One quick possible fix is:

486c737f7fdc ("btrfs: raid56: always verify the P/Q contents for scrub")


> 
> It was a bad reproducer.  Sorry about that!
> 
>> Finally for the lack of any dmesg, it's indeed a problem, that there is
>> *NO* error message at all if we failed to recover a full stripe.
>> Just check recover_sectors() call and its callers.
> 
> The trouble with dmesg errors is that there are so many of them when
> there's a failure of a wide RAID5 stripe, and they can end up ratelimited,
> so admins can't see all of them.  But they're also better than nothing
> for case 6, so I approve adding a new one here.
> 
> For case 6, it might be reasonable to throw the entire filesystem
> read-only; otherwise, there's silent data loss, and not having silent
> data loss is a critical feature of btrfs.  It would still be possible
> to scrub, find the affected files, and remove them, without triggering a
> rmw cycle.  Maybe some future btrfs enhancement could handle it better,
> but for now a dead stop is better than any amount of silent loss.
> 
> One possible such enhancement is to keep the blocks from being
> allocated again, e.g. put them on an in-memory list to break the
> allocate-write-fail-free-allocate loop above.

The retry for failed full stripe is intentional, for the following two 
points:

- There may be a chance the error may be temporary
   So the next retry may get a good contents and do the repair correctly

- There is no need to cache something we know it's unreliable

[...]
> 
> On 6.6, all the automation was broken by the regression: when we replaced
> a broken file or added a new one, the new file was almost always silently
> corrupted,

A quick search through the data writeback path indeed shows we do 
nothing extra when hitting a writeback error.

This means, the dirty pages would be considered uptodate, thus later can 
be dropped.
I think we should change the behavior of clearing dirty flags, so that 
the dirty page flag only got cleared if the writeback is successful.
So I believe this is the root cause.

This change is going to be huge, and I have to check other fses to be 
extra safe.

But even with that fixed, the do-not-touch-bad-full-stripe policy is not 
doing anything helpful, unless the corrupted sector get full deleted, 
which can be very hard detect.

Maybe you can try a different recovery behavior?

E.g. always read the file, and the read failed, delete and then copy a 
new one from the backup server.

Then it may solve the problem as the bad sectors would be deleted (along 
with its csum)

Thanks,
Qu
> and we had no IO error signal to trigger replacement of the
> data from another server, so corrupted files became visible to users.
> git-based tools for deployment were fully unusable on 6.6:  any commit
> would break the local repo, while deleting and recloning the repo simply
> hit the same broken stripe rmw bug and broke the replacement repo.
> 
> At the end of recovery we did a full verification of SHA hashes and
> found only a few errors, all in files that were created or modified
> while running 6.6.
> 
>> So a more normalized test would help us both.
>>
>> Thanks,
>> Qu
>>
>>>
>>> Loop 2 runs `sync -f` to detect sync errors and drops caches:
>>>
>>> 	while true; do
>>> 		# Sometimes throws EIO
>>> 		sync -f /testfs
>>> 		sysctl vm.drop_caches=3
>>> 		sleep 9
>>> 	done
>>>
>>> Loop 3 does some random git activity on a clone of the 'btrfs-progs'
>>> repo to detect lost writes at the application level:
>>>
>>> 	while true; do
>>> 		cd /testfs/btrfs-progs
>>> 		# Sometimes fails complaining about various files being corrupted
>>> 		find * -type f -print | unsort -r | while read -r x; do
>>> 			date >> "$x"
>>> 			git commit -am"Modifying $x"
>>> 		done
>>> 		git repack -a
>>> 	done
>>>
>>> The errors occur on the sync -f and various git commands, e.g.:
>>>
>>> 	sync: error syncing '/media/testfs/': Input/output error
>>> 	vm.drop_caches = 3
>>>
>>> 	error: object file .git/objects/39/c876ad9b9af9f5410246d9a3d6bbc331677ee5 is empty
>>> 	fatal: loose object 39c876ad9b9af9f5410246d9a3d6bbc331677ee5 (stored in .git/objects/39/c876ad9b9af9f5410246d9a3d6bbc331677ee5) is corrupt
>>>
>

next prev parent reply	other threads:[~2024-06-08  3:20 UTC|newest]

Thread overview: 7+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2024-06-01  3:24 raid5 silent data loss in 6.2 and later, after "7a3150723061 btrfs: raid56: do data csum verification during RMW cycle" Zygo Blaxell
2024-06-01  7:52 ` Qu Wenruo
2024-06-08  1:55   ` Zygo Blaxell
2024-06-08  3:20     ` Qu Wenruo [this message]
2024-07-08  6:25       ` Zygo Blaxell
2024-07-08  8:26         ` Lukas Straub
2024-07-08 20:22           ` Write error handling in btrfs (was: Re: raid5 silent data loss in 6.2 and later, after "7a3150723061 btrfs: raid56: do data csum verification during RMW cycle") Zygo Blaxell

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=5a8c1fbf-3065-4cea-9cf9-48e49806707d@suse.com \
    --to=wqu@suse.com \
    --cc=ce3g8jdj@umail.furryterror.org \
    --cc=linux-btrfs@vger.kernel.org \
    --cc=quwenruo.btrfs@gmx.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox