All of lore.kernel.org
 help / color / mirror / Atom feed
From: Filipe Manana <fdmanana@kernel.org>
To: Qu Wenruo <wqu@suse.com>
Cc: linux-btrfs@vger.kernel.org
Subject: Re: [PATCH] btrfs: don't trust any cached sector in __raid56_parity_recover()
Date: Wed, 8 Jun 2022 10:47:51 +0100	[thread overview]
Message-ID: <20220608094751.GA3603651@falcondesktop> (raw)
In-Reply-To: <53f7bace2ac75d88ace42dd811d48b7912647301.1654672140.git.wqu@suse.com>

On Wed, Jun 08, 2022 at 03:09:20PM +0800, Qu Wenruo wrote:
> [BUG]
> There is a small workload which is a shorter version extracted from
> btrfs/125:
> 
>   mkfs.btrfs -f -m raid5 -d raid5 -b 1G $dev1 $dev2 $dev3
>   mount $dev1 $mnt
>   xfs_io -f -c "pwrite -S 0xee 0 1M" $mnt/file1
>   sync
>   umount $mnt
>   btrfs dev scan -u $dev3
>   mount -o degraded $dev1 $mnt
>   xfs_io -f -c "pwrite -S 0xff 0 128M" $mnt/file2
>   umount $mnt
>   btrfs dev scan
>   mount $dev1 $mnt
>   btrfs balance start --full-balance $mnt
>   umount $mnt
> 
> In fact, after upstream commit d4e28d9b5f04 ("btrfs: raid56: make
> steal_rbio() subpage compatible") above script will always pass, just
> like btrfs/125.

For me it still fails, very often.
Both before and after that commit.

> 
> But after a bug fix/optimization named "btrfs: update
> stripe_sectors::uptodate in steal_rbio", above test case and btrfs/125
> will always fail (just like the old behavior before upstream d4e28d9b5f04).

And I'm running a branch without that patch, which is just misc-next from
about 2 weeks ago:

https://git.kernel.org/pub/scm/linux/kernel/git/fdmanana/linux.git/log/?h=test_branch

This is still all due to the fundamental flaw we have with partial stripe
writes I pointed out long ago:

https://lore.kernel.org/linux-btrfs/CAL3q7H4oa70DUhOFE7kot62KjxcbvvZKxu62VfLpAcmgsinBFw@mail.gmail.com/

Running it again:

root 10:18:07 /home/fdmanana/git/hub/xfstests (for-next)> ./check btrfs/125
FSTYP         -- btrfs
PLATFORM      -- Linux/x86_64 debian9 5.18.0-btrfs-next-119 #1 SMP PREEMPT_DYNAMIC Sat May 28 20:28:23 WEST 2022
MKFS_OPTIONS  -- /dev/sdb
MOUNT_OPTIONS -- /dev/sdb /home/fdmanana/btrfs-tests/scratch_1

btrfs/125 5s ... - output mismatch (see /home/fdmanana/git/hub/xfstests/results//btrfs/125.out.bad)
    --- tests/btrfs/125.out	2020-06-10 19:29:03.818519162 +0100
    +++ /home/fdmanana/git/hub/xfstests/results//btrfs/125.out.bad	2022-06-08 10:18:13.521948910 +0100
    @@ -3,5 +3,15 @@
     Write data with degraded mount
     
     Mount normal and balance
    +ERROR: error during balancing '/home/fdmanana/btrfs-tests/scratch_1': Input/output error
    +There may be more info in syslog - try dmesg | tail
    +md5sum: /home/fdmanana/btrfs-tests/scratch_1/tf2: Input/output error
     
    ...
    (Run 'diff -u /home/fdmanana/git/hub/xfstests/tests/btrfs/125.out /home/fdmanana/git/hub/xfstests/results//btrfs/125.out.bad'  to see the entire diff)
Ran: btrfs/125
Failures: btrfs/125
Failed 1 of 1 tests

root 10:18:17 /home/fdmanana/git/hub/xfstests (for-next)> dmesg
[777880.530807] run fstests btrfs/125 at 2022-06-08 10:18:09
[777881.341004] BTRFS: device fsid fc182050-867d-42b3-8b70-7a717d5d8c10 devid 1 transid 6 /dev/sdb scanned by mkfs.btrfs (3174370)
[777881.343023] BTRFS: device fsid fc182050-867d-42b3-8b70-7a717d5d8c10 devid 2 transid 6 /dev/sdd scanned by mkfs.btrfs (3174370)
[777881.343156] BTRFS: device fsid fc182050-867d-42b3-8b70-7a717d5d8c10 devid 3 transid 6 /dev/sde scanned by mkfs.btrfs (3174370)
[777881.360352] BTRFS info (device sdb): flagging fs with big metadata feature
[777881.360356] BTRFS info (device sdb): using free space tree
[777881.360357] BTRFS info (device sdb): has skinny extents
[777881.365900] BTRFS info (device sdb): checking UUID tree
[777881.459545] BTRFS: device fsid fc182050-867d-42b3-8b70-7a717d5d8c10 devid 2 transid 8 /dev/sdd scanned by mount (3174418)
[777881.459637] BTRFS: device fsid fc182050-867d-42b3-8b70-7a717d5d8c10 devid 1 transid 8 /dev/sdb scanned by mount (3174418)
[777881.460202] BTRFS info (device sdb): flagging fs with big metadata feature
[777881.460204] BTRFS info (device sdb): allowing degraded mounts
[777881.460206] BTRFS info (device sdb): using free space tree
[777881.460206] BTRFS info (device sdb): has skinny extents
[777881.466293] BTRFS warning (device sdb): devid 3 uuid a9540970-da42-44f8-9e62-30e6fdf013af is missing
[777881.466568] BTRFS warning (device sdb): devid 3 uuid a9540970-da42-44f8-9e62-30e6fdf013af is missing
[777881.923840] BTRFS: device fsid 57e10060-7318-4cd3-8e2d-4c3e481b1dab devid 1 transid 16797 /dev/sda scanned by btrfs (3174443)
[777881.939421] BTRFS info (device sdb): flagging fs with big metadata feature
[777881.939425] BTRFS info (device sdb): using free space tree
[777881.939426] BTRFS info (device sdb): has skinny extents
[777881.959199] BTRFS info (device sdb): balance: start -d -m -s
[777881.959348] BTRFS info (device sdb): relocating block group 754581504 flags data|raid5
[777882.352088] verify_parent_transid: 787 callbacks suppressed
[777882.352092] BTRFS error (device sdb): parent transid verify failed on 38993920 wanted 9 found 5
[777882.352327] BTRFS error (device sdb): parent transid verify failed on 38993920 wanted 9 found 5
[777882.352481] BTRFS error (device sdb): parent transid verify failed on 38993920 wanted 9 found 5
[777882.352692] BTRFS error (device sdb): parent transid verify failed on 38993920 wanted 9 found 5
[777882.352844] BTRFS error (device sdb): parent transid verify failed on 38993920 wanted 9 found 5
[777882.353066] BTRFS error (device sdb): parent transid verify failed on 38993920 wanted 9 found 5
[777882.353241] BTRFS error (device sdb): parent transid verify failed on 38993920 wanted 9 found 5
[777882.353567] BTRFS error (device sdb): parent transid verify failed on 38993920 wanted 9 found 5
[777882.353760] BTRFS error (device sdb): parent transid verify failed on 38993920 wanted 9 found 5
[777882.353982] BTRFS error (device sdb): parent transid verify failed on 38993920 wanted 9 found 5
[777882.455036] BTRFS info (device sdb): balance: ended with status: -5
[777882.456202] BTRFS: error (device sdb: state A) in do_free_extent_accounting:2864: errno=-5 IO failure


> 
> In my case, it fails due to tree block read failure, mostly on bytenr
> 38993920:
> 
>  BTRFS info (device dm-4): relocating block group 217710592 flags data|raid5
>  BTRFS error (device dm-4): parent transid verify failed on 38993920 wanted 9 found 7
>  BTRFS error (device dm-4): parent transid verify failed on 38993920 wanted 9 found 7
>  ...
> 
> [CAUSE]
> With the recently added debug output, we can see all RAID56 operations
> related to full stripe 38928384:
> 
>  23256.118349: raid56_read_partial: full_stripe=38928384 devid=2 type=DATA1 offset=0 opf=0x0 physical=9502720 len=65536
>  23256.118547: raid56_read_partial: full_stripe=38928384 devid=3 type=DATA2 offset=16384 opf=0x0 physical=9519104 len=16384
>  23256.118562: raid56_read_partial: full_stripe=38928384 devid=3 type=DATA2 offset=49152 opf=0x0 physical=9551872 len=16384
>  23256.118704: raid56_write_stripe: full_stripe=38928384 devid=3 type=DATA2 offset=0 opf=0x1 physical=9502720 len=16384
>  23256.118867: raid56_write_stripe: full_stripe=38928384 devid=3 type=DATA2 offset=32768 opf=0x1 physical=9535488 len=16384
>  23256.118887: raid56_write_stripe: full_stripe=38928384 devid=1 type=PQ1 offset=0 opf=0x1 physical=30474240 len=16384
>  23256.118902: raid56_write_stripe: full_stripe=38928384 devid=1 type=PQ1 offset=32768 opf=0x1 physical=30507008 len=16384
>  23256.121894: raid56_write_stripe: full_stripe=38928384 devid=3 type=DATA2 offset=49152 opf=0x1 physical=9551872 len=16384
>  23256.121907: raid56_write_stripe: full_stripe=38928384 devid=1 type=PQ1 offset=49152 opf=0x1 physical=30523392 len=16384
>  23256.272185: raid56_parity_recover: full stripe=38928384 eb=39010304 mirror=2
>  23256.272335: raid56_parity_recover: full stripe=38928384 eb=39010304 mirror=2
>  23256.272446: raid56_parity_recover: full stripe=38928384 eb=39010304 mirror=2
> 
> Before we enter raid56_parity_recover(), we have triggered some metadata
> write for the full stripe 38928384, this leads to us to read all the
> sectors from disk.
> 
> However the test case itself intentionally uses degraded mount to create
> stale metadata.
> Thus we read out the stale data and cached them.
> 
> When we really need to recover certain range, aka in
> raid56_parity_recover(), we will use the cached rbio, along with its
> cached sectors.
> 
> And since those sectors are already cached, we won't even bother to
> re-read them.
> This explains why we have no event raid56_scrub_read_recover()
> triggered.
> 
> Since we use the staled sectors to recover, and obviously this
> will lead to incorrect data recovery.
> 
> In our particular test case, it will always return the same incorrect
> metadata, thus causing the same error message "parent transid verify
> failed on 39010304 wanted 9 found 7" again and again.
> 
> [FIX]
> Commit d4e28d9b5f04 ("btrfs: raid56: make steal_rbio() subpage
> compatible") has a bug that makes RAID56 to skip any cached sector, thus
> it incidentally fixed the failure of btrfs/125.

As mentioned before, I still have the test failing very often even after
that commit, and I don't see how that could ever fix the fundamental
problem raid56 has with partial stripe writes.

Thanks.

> 
> But later patch "btrfs: update stripe_sectors::uptodate in steal_rbio",
> reverted to the old trust-cache-unconditionally behavior, and
> re-introduced the bug.
> 
> In fact, we should still trust the cache for cases where everything is
> fine.
> 
> What we really need is, trust nothing if we're recovery the full stripe.
> 
> So this patch will fix the behavior by not trust any cache in
> __raid56_parity_recover(), to solve the problem while still keep the
> cache useful.
> 
> Now btrfs/125 and above test case can always pass, instead of the old
> random failure behavior.
> 
> Signed-off-by: Qu Wenruo <wqu@suse.com>
> ---
> I'm not sure how to push this patch.
> 
> It's a bug fix for the very old trust-cache-unconditionally bug, but
> since upstream d4e28d9b5f04 incidentally fixed it (by never trusting the
> cache), and later "btrfs: update stripe_sectors::uptodate in steal_rbio"
> is really re-introducing the bad old behavior.
> 
> Thus I guess it may be a good idea to fold this small fix into "btrfs:
> update stripe_sectors::uptodate in steal_rbio" ?
> ---
>  fs/btrfs/raid56.c | 13 ++++++-------
>  1 file changed, 6 insertions(+), 7 deletions(-)
> 
> diff --git a/fs/btrfs/raid56.c b/fs/btrfs/raid56.c
> index c1f61d1708ee..be2f0ea81116 100644
> --- a/fs/btrfs/raid56.c
> +++ b/fs/btrfs/raid56.c
> @@ -2125,9 +2125,12 @@ static int __raid56_parity_recover(struct btrfs_raid_bio *rbio)
>  	atomic_set(&rbio->error, 0);
>  
>  	/*
> -	 * read everything that hasn't failed.  Thanks to the
> -	 * stripe cache, it is possible that some or all of these
> -	 * pages are going to be uptodate.
> +	 * Read everything that hasn't failed. However this time we will
> +	 * not trust any cached sector.
> +	 * As we may read out some stale data but higher layer is not reading
> +	 * that stale part.
> +	 *
> +	 * So here we always re-read everything in recovery path.
>  	 */
>  	for (total_sector_nr = 0; total_sector_nr < rbio->nr_sectors;
>  	     total_sector_nr++) {
> @@ -2142,11 +2145,7 @@ static int __raid56_parity_recover(struct btrfs_raid_bio *rbio)
>  			total_sector_nr += rbio->stripe_nsectors - 1;
>  			continue;
>  		}
> -		/* The rmw code may have already read this page in. */
>  		sector = rbio_stripe_sector(rbio, stripe, sectornr);
> -		if (sector->uptodate)
> -			continue;
> -
>  		ret = rbio_add_io_sector(rbio, &bio_list, sector, stripe,
>  					 sectornr, rbio->stripe_len,
>  					 REQ_OP_READ);
> -- 
> 2.36.1
> 

  reply	other threads:[~2022-06-08 10:05 UTC|newest]

Thread overview: 7+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2022-06-08  7:09 [PATCH] btrfs: don't trust any cached sector in __raid56_parity_recover() Qu Wenruo
2022-06-08  9:47 ` Filipe Manana [this message]
2022-06-08 10:06   ` Qu Wenruo
2022-06-08 13:31     ` Filipe Manana
2022-06-08 21:54       ` Qu Wenruo
2022-06-14 14:00         ` David Sterba
2022-06-15  5:47           ` Qu Wenruo

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20220608094751.GA3603651@falcondesktop \
    --to=fdmanana@kernel.org \
    --cc=linux-btrfs@vger.kernel.org \
    --cc=wqu@suse.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.