Linux Btrfs filesystem development
 help / color / mirror / Atom feed
From: Zygo Blaxell <ce3g8jdj@umail.furryterror.org>
To: linux-btrfs@vger.kernel.org
Subject: raid5 silent data loss in 6.2 and later, after "7a3150723061 btrfs: raid56: do data csum verification during RMW cycle"
Date: Fri, 31 May 2024 23:24:43 -0400	[thread overview]
Message-ID: <ZlqUe+U9hJ87jJiq@hungrycats.org> (raw)

There is a new silent data loss bug in kernel 6.2 and later.
The requirements for the bug are:

	1.  6.2 or later kernel
	2.  raid5 data in the filesystem
	3.  one device severely corrupted
	4.  some free space fragmentation to trigger a lot of rmw cycles

I first discovered this bug after a `pvmove` went badly wrong and wiped
out 70% of one data LV of a btrfs raid5 filesystem.  I figured out a
recipe for reproducing the bug on a test VM (below), and a possible
approach for a fix.

Bug insertion
-------------

The bug is introduced in this commit:

	7a3150723061 btrfs: raid56: do data csum verification during RMW cycle

This commit fixes a long-standing btrfs raid5 bug which mishandled cases
where a raid5 stripe contains corrupted data in some blocks while new data
is written to other blocks in the same stripe.  This older bug resulted
in existing correct data becoming corrupted when new data is written to
the filesystem.  EIO and csum failures would occur later when reading
affected blocks, i.e. the errors were detectable.

Bug effect
----------

The change fixes the old bug, but also introduces two new behaviors that
are potentially worse:

1.  New writes are now silently dropped.  After `sync` and `sysctl
vm.drop_caches=3`, recently written extents are missing from the files,
as if they were never written by the application.  Instead, there is a
hole in the file, or a truncated file.  There is no indication that this
has occurred other than the missing data:  no dmesg messages, no errors
reported to applications at write time.  The data can be read back from
the filesystem while it is still in page cache, which can delay detection
of the failure until after the data is evicted from cache.

2.  `fsync` (system call) and `sync -f` (shell command) sometimes
return EIO.  In this situation, devices are online, metadata is intact,
all corruption is located on a single device and is recoverable, so no
writes should ever be failing, but they apparently do.  This also affects
internal btrfs maintenance operations like balance, which aborts with
an IO failure when triggering the bug.

Reverting the commit 7a3150723061 on 6.2.16 restores the old behavior:
no silent data loss on new writes, only the old detectable data corruption
on existing data occurs.  It seems we can't have nice things.


Analysis
--------

In the commit, I notice that when reading the rmw stripe, any blocks with
csum errors are flagged in rbio->error_bitmap, but nothing ever clears
those error bits once they are set.  Contrast with the scrub repair code,
which clears out the entire rbio->error_bitmap vector after repairing
the stripe.  The old code didn't set any error_bitmap bits during rmw,
and not much else has changed in this commit, which makes the new set_bit
calls a good candidate for root cause.

I tried some experiments:

	- clear the error_bits individually after setting uptodate = 1
	  (still has problems: after verifying the recovered csum,
	   failed reconstruction would leave error bits set, so writes
	   to the stripe that should not fail will fail)

	- clear the entire error_bits vector after successful repair_sectors
	  (might cause corruption if some reconstruction fails and later
	   reads use data from the stripe without verifying the csum again)

	- remove the new set_bit() call added in the above commit
	  (breaks repair but fixes EIO and dropped writes)

For the first two I didn't have much success.  I'm pretty sure I don't
know how to calculate the error_bit index correctly, so my first attempt
didn't have any effect on the test.  I get a lot of UAF crashes if I clear
the entire vector in the second test--I suspect it needs some protection
from concurrent access that I'm missing.

My third experiment breaks the error recovery code, but it does prevent
the sync failures and missing extent holes, so it shows that the error
recovery code itself is not what is causing the dropped writes--it's
the bits left set in error_bitmap after recovery is done.


Test Case
---------

My test case uses three loops running in parallel on a 500 GiB test filesystem:

		    Data      Metadata System
	Id Path     RAID5     RAID1    RAID1    Unallocated Total     Slack
	-- -------- --------- -------- -------- ----------- --------- --------
	 1 /dev/vdb  71.00GiB  1.00GiB  8.00MiB   647.99GiB 720.00GiB 19.59GiB
	 2 /dev/vdc  71.00GiB  1.00GiB  8.00MiB   647.99GiB 720.00GiB  3.71GiB
	 3 /dev/vdd  71.00GiB  2.00GiB        -   647.00GiB 720.00GiB  3.71GiB
	 4 /dev/vde  71.00GiB  2.00GiB        -   647.00GiB 720.00GiB 11.00GiB
	 5 /dev/vdf  71.00GiB  2.00GiB        -   647.00GiB 720.00GiB 11.00GiB
	-- -------- --------- -------- -------- ----------- --------- --------
	   Total    284.00GiB  4.00GiB  8.00MiB     3.16TiB   3.52TiB 49.02GiB
	   Used     262.97GiB  2.61GiB 64.00KiB

The data is a random collection of small files, half of which have been deleted
to make lots of small free space holes for rmw.

Loop 1 alternates between corrupting device 3 and repairing it with scrub:

	while true; do
		# Any big file will do, usually faster than /dev/random
		# Skipping the first 1M leaves the superblock intact
		while cat vmlinux; do :; done | dd of=/dev/vdd bs=1024k seek=1
		# This should fix all the corruption as long as there are no
		# reads or writes anywhere on the filesystem
		btrfs scrub start -Bd /dev/vdd
	done

Loop 2 runs `sync -f` to detect sync errors and drops caches:

	while true; do
		# Sometimes throws EIO
		sync -f /testfs
		sysctl vm.drop_caches=3
		sleep 9
	done

Loop 3 does some random git activity on a clone of the 'btrfs-progs'
repo to detect lost writes at the application level:

	while true; do
		cd /testfs/btrfs-progs
		# Sometimes fails complaining about various files being corrupted
		find * -type f -print | unsort -r | while read -r x; do
			date >> "$x"
			git commit -am"Modifying $x"
		done
		git repack -a
	done

The errors occur on the sync -f and various git commands, e.g.:

	sync: error syncing '/media/testfs/': Input/output error
	vm.drop_caches = 3

	error: object file .git/objects/39/c876ad9b9af9f5410246d9a3d6bbc331677ee5 is empty
	fatal: loose object 39c876ad9b9af9f5410246d9a3d6bbc331677ee5 (stored in .git/objects/39/c876ad9b9af9f5410246d9a3d6bbc331677ee5) is corrupt

             reply	other threads:[~2024-06-01  3:24 UTC|newest]

Thread overview: 7+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2024-06-01  3:24 Zygo Blaxell [this message]
2024-06-01  7:52 ` raid5 silent data loss in 6.2 and later, after "7a3150723061 btrfs: raid56: do data csum verification during RMW cycle" Qu Wenruo
2024-06-08  1:55   ` Zygo Blaxell
2024-06-08  3:20     ` Qu Wenruo
2024-07-08  6:25       ` Zygo Blaxell
2024-07-08  8:26         ` Lukas Straub
2024-07-08 20:22           ` Write error handling in btrfs (was: Re: raid5 silent data loss in 6.2 and later, after "7a3150723061 btrfs: raid56: do data csum verification during RMW cycle") Zygo Blaxell

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=ZlqUe+U9hJ87jJiq@hungrycats.org \
    --to=ce3g8jdj@umail.furryterror.org \
    --cc=linux-btrfs@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox