Re: raid5 silent data loss in 6.2 and later, after "7a3150723061 btrfs: raid56: do data csum verification during RMW cycle"

Linux Btrfs filesystem development
 help / color / mirror / Atom feed

From: Qu Wenruo <quwenruo.btrfs@gmx.com>
To: Zygo Blaxell <ce3g8jdj@umail.furryterror.org>,
	linux-btrfs@vger.kernel.org
Subject: Re: raid5 silent data loss in 6.2 and later, after "7a3150723061 btrfs: raid56: do data csum verification during RMW cycle"
Date: Sat, 1 Jun 2024 17:22:46 +0930	[thread overview]
Message-ID: <a9d16fd4-2fec-40c7-94cc-c53aa208c9b9@gmx.com> (raw)
In-Reply-To: <ZlqUe+U9hJ87jJiq@hungrycats.org>

在 2024/6/1 12:54, Zygo Blaxell 写道:
> There is a new silent data loss bug in kernel 6.2 and later.
> The requirements for the bug are:
>
> 	1.  6.2 or later kernel
> 	2.  raid5 data in the filesystem
> 	3.  one device severely corrupted
> 	4.  some free space fragmentation to trigger a lot of rmw cycles

I'm still not convinced this can be the condition to trigger the bug.

As RAID56 now does csum verification before RMW, even if some range is
fully corrupted, as long as the recovered data matches csum, it would
use the recovered data instead.

And if any vertical stripe is not good, the whole RMW cycle would error out.

[...]
> --------
>
> In the commit, I notice that when reading the rmw stripe, any blocks with
> csum errors are flagged in rbio->error_bitmap, but nothing ever clears
> those error bits once they are set.

Nope, rmw_rbio() would call bitmap_clear() on the error_bitmap before
doing any RMW.

The same for finish_parity_scrub(), scrub_rbio().

Yes, this means we can have the cache rbio with error bitmap, but it
doesn't make any difference, as rmw_rbio() is always the entrance for a
RMW cycle.

Maybe I can enhance that by clearing the error bitmap after everything
is done, but I prefer to get a proper cause analyse before doing any
random fix.

[...]
>
> My third experiment breaks the error recovery code, but it does prevent
> the sync failures and missing extent holes, so it shows that the error
> recovery code itself is not what is causing the dropped writes--it's
> the bits left set in error_bitmap after recovery is done.

Yep, that's expected.

So I'm more interested in a proper (better minimal) reproducer other
than any fix attempt (since there is no patch sent, it already shows the
attempt failed).

>
>
> Test Case
> ---------
>
> My test case uses three loops running in parallel on a 500 GiB test filesystem:
>
> 		    Data      Metadata System
> 	Id Path     RAID5     RAID1    RAID1    Unallocated Total     Slack
> 	-- -------- --------- -------- -------- ----------- --------- --------
> 	 1 /dev/vdb  71.00GiB  1.00GiB  8.00MiB   647.99GiB 720.00GiB 19.59GiB
> 	 2 /dev/vdc  71.00GiB  1.00GiB  8.00MiB   647.99GiB 720.00GiB  3.71GiB
> 	 3 /dev/vdd  71.00GiB  2.00GiB        -   647.00GiB 720.00GiB  3.71GiB
> 	 4 /dev/vde  71.00GiB  2.00GiB        -   647.00GiB 720.00GiB 11.00GiB
> 	 5 /dev/vdf  71.00GiB  2.00GiB        -   647.00GiB 720.00GiB 11.00GiB
> 	-- -------- --------- -------- -------- ----------- --------- --------
> 	   Total    284.00GiB  4.00GiB  8.00MiB     3.16TiB   3.52TiB 49.02GiB
> 	   Used     262.97GiB  2.61GiB 64.00KiB
>
> The data is a random collection of small files, half of which have been deleted
> to make lots of small free space holes for rmw.
>
> Loop 1 alternates between corrupting device 3 and repairing it with scrub:

The reproducer is not good enough, in fact it's pretty bad...

Using anything not normalized is never a good way to reproduce, but I
guess it's already the best scenario you have.

Can you try to do it with newly created fs instead?
>
> 	while true; do
> 		# Any big file will do, usually faster than /dev/random
> 		# Skipping the first 1M leaves the superblock intact
> 		while cat vmlinux; do :; done | dd of=/dev/vdd bs=1024k seek=1
> 		# This should fix all the corruption as long as there are no
> 		# reads or writes anywhere on the filesystem
> 		btrfs scrub start -Bd /dev/vdd
> 	done

[IMPROVE THE TEST]
If you want to cause interleaved free space, just create a ton of 4K
files, and delete them interleavely.

And instead of vmlinux or whatever file, you can always go with
randomly/pattern filled file, and saves its md5sum to do verification.

[MY CURRENT GUESS]
My current guess is some race with dd corruption and RMW.
AFAIK the last time I am working on RAID56, I always do a offline
corruption (aka, with fs unmounted) and it always works like a charm.

So the running corruption may be a point of concern.

Another thing is, if a full stripe is determined to have unrepairable
data, no RMW can be done on that full stripe forever (unless one
manually fixed the problem).

So if by somehow you corrupted the full stripe by just corrupting one
device (maybe some existing csum mismatch etc?), then the full stripe
would never be written back, thus causing the data not to be written back.

Finally for the lack of any dmesg, it's indeed a problem, that there is
*NO* error message at all if we failed to recover a full stripe.
Just check recover_sectors() call and its callers.

And I believe that may contribute to the confusion, that btrfs consider
the fs is fine, meanwhile it catches tons of error and abort all writes
to that full stripes.

I appreciate the effort you put into this case, but I really hope to get
a more reproducible procedure, or it's really hard to say what is going
wrong.

If needed I can craft some debug patches for you to test, but I believe
you won't really want to run testing kernels on your large RAID5 array
anyway.

So a more normalized test would help us both.

Thanks,
Qu

>
> Loop 2 runs `sync -f` to detect sync errors and drops caches:
>
> 	while true; do
> 		# Sometimes throws EIO
> 		sync -f /testfs
> 		sysctl vm.drop_caches=3
> 		sleep 9
> 	done
>
> Loop 3 does some random git activity on a clone of the 'btrfs-progs'
> repo to detect lost writes at the application level:
>
> 	while true; do
> 		cd /testfs/btrfs-progs
> 		# Sometimes fails complaining about various files being corrupted
> 		find * -type f -print | unsort -r | while read -r x; do
> 			date >> "$x"
> 			git commit -am"Modifying $x"
> 		done
> 		git repack -a
> 	done
>
> The errors occur on the sync -f and various git commands, e.g.:
>
> 	sync: error syncing '/media/testfs/': Input/output error
> 	vm.drop_caches = 3
>
> 	error: object file .git/objects/39/c876ad9b9af9f5410246d9a3d6bbc331677ee5 is empty
> 	fatal: loose object 39c876ad9b9af9f5410246d9a3d6bbc331677ee5 (stored in .git/objects/39/c876ad9b9af9f5410246d9a3d6bbc331677ee5) is corrupt
>

next prev parent reply	other threads:[~2024-06-01  7:52 UTC|newest]

Thread overview: 7+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2024-06-01  3:24 raid5 silent data loss in 6.2 and later, after "7a3150723061 btrfs: raid56: do data csum verification during RMW cycle" Zygo Blaxell
2024-06-01  7:52 ` Qu Wenruo [this message]
2024-06-08  1:55   ` Zygo Blaxell
2024-06-08  3:20     ` Qu Wenruo
2024-07-08  6:25       ` Zygo Blaxell
2024-07-08  8:26         ` Lukas Straub
2024-07-08 20:22           ` Write error handling in btrfs (was: Re: raid5 silent data loss in 6.2 and later, after "7a3150723061 btrfs: raid56: do data csum verification during RMW cycle") Zygo Blaxell

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=a9d16fd4-2fec-40c7-94cc-c53aa208c9b9@gmx.com \
    --to=quwenruo.btrfs@gmx.com \
    --cc=ce3g8jdj@umail.furryterror.org \
    --cc=linux-btrfs@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox