Re: RAID5 fails to correct correctable errors, makes them uncorrectable instead (sometimes). With reproducer for kernel 5.3.11, still works on 5.5.1

All of lore.kernel.org
 help / color / mirror / Atom feed

From: Zygo Blaxell <ce3g8jdj@umail.furryterror.org>
To: linux-btrfs@vger.kernel.org
Subject: Re: RAID5 fails to correct correctable errors, makes them uncorrectable instead (sometimes).  With reproducer for kernel 5.3.11, still works on 5.5.1
Date: Thu, 6 Feb 2020 18:40:37 -0500	[thread overview]
Message-ID: <20200206234037.GD13306@hungrycats.org> (raw)
In-Reply-To: <20191119040827.GC22121@hungrycats.org>

[-- Attachment #1: Type: text/plain, Size: 7591 bytes --]

On Mon, Nov 18, 2019 at 11:08:27PM -0500, Zygo Blaxell wrote:
> Sometimes, btrfs raid5 not only fails to recover corrupt data with a
> parity stripe, it also copies bad data over good data.  This propagates
> errors between drives and makes a correctable failure uncorrectable.
> Reproducer script at the end.
> 
> This doesn't happen very often.  The repro script corrupts *every*
> data block on one of the RAID5 drives, and only a handful of blocks
> fail to be corrected--about 16 errors per 3GB of data, but sometimes
> half or double that rate.  It behaves more like a race condition than
> a boundary condition.  It can take a few tries to get a failure with a
> 16GB disk array.  It seems to happen more often on 2-disk raid5 than
> 5-disk raid5, but if you repeat the test enough times even a 5-disk
> raid5 will eventually fail.
> 
> Kernels 4.16..5.3 all seem to behave similarly, so this is not a new bug.
> I haven't tried this reproducer on kernels earlier than 4.16 due to
> other raid5 issues in earlier kernels.

Still reproducible on 5.5.1.  Some more details below:

> [...snip...]
> Reproducer part 1 (runs in a qemu with test disks on /dev/vdb and /dev/vdc):
> 
> 	#!/bin/bash
> 	set -x
> 
> 	# Reset state
> 	umount /try
> 	mkdir -p /try
> 
> 	# Create FS and mount.	Use raid1 metadata so the filesystem
> 	# has a fair chance of survival.
> 	mkfs.btrfs -draid5 -mraid1 -f /dev/vd[bc] || exit 1
> 	btrfs dev scan
> 	mount -onoatime /dev/vdb /try || exit 1
> 
> 	# Must be on btrfs
> 	cd /try || exit 1
> 	btrfs sub list . || exit 1
> 
> 	# Fill disk with files.  Increase seq for more test data
> 	# to increase the chance of finding corruption.
> 	for x in $(seq 0 3); do
> 		sync &
> 		rsync -axHSWI "/usr/." "/try/$(date +%s)" &
> 		sleep 2
> 	done
> 	wait
> 
> 	# Remove half the files.  If you increased seq above, increase the
> 	# '-2' here as well.
> 	find /try/* -maxdepth 0 -type d -print | unsort | head -2 | while read x; do
> 		sync &
> 		rm -fr "$x" &
> 		sleep 2
> 	done
> 	wait
> 
> 	# Fill in some of the holes.  This is to get a good mix of
> 	# partially filled RAID stripes of various sizes.
> 	for x in $(seq 0 1); do
> 		sync &
> 		rsync -axHSWI "/usr/." "/try/$(date +%s)" &
> 		sleep 2
> 	done
> 	wait
> 
> 	# Calculate hash we will use to verify data later
> 	find -type f -exec sha1sum {} + > /tmp/sha1sums.txt
> 
> 	# Make sure it's all on the disk
> 	sync
> 	sysctl vm.drop_caches=3
> 
> 	# See distribution of data across drives
> 	btrfs dev usage /try
> 	btrfs fi usage /try
> 
> 	# Corrupt one byte of each of the first 4G on /dev/vdb,
> 	# so that the crc32c algorithm will always detect the corruption.
> 	# If you need a bigger test disk then increase the '4'.
> 	# Leave the first 16MB of the disk alone so we don't kill the superblock.
> 	perl -MFcntl -e '
> 		for my $x (0..(4 * 1024 * 1024 * 1024 / 4096)) {
> 			my $pos = int(rand(4096)) + 16777216 + ($x * 4096);
> 			sysseek(STDIN, $pos, SEEK_SET) or die "seek: $!";
> 			sysread(STDIN, $dat, 1) or die "read: $!";
> 			sysseek(STDOUT, $pos, SEEK_SET) or die "seek: $!";
> 			syswrite(STDOUT, chr(ord($dat) ^ int(rand(255) + 1)), 1) or die "write: $!";
> 		}
> 	' </dev/vdb >/dev/vdb
> 
> 	# Make sure all that's on disk and our caches are empty
> 	sync
> 	sysctl vm.drop_caches=3

I split the test into two parts: everything up to the above line (let's
call it part 1), and everything below this line (part 2).  Part 1 creates
a RAID5 array with corruption on one disk.  Part 2 tries to read all the
original data and correct the corrupted disk with sha1sum and btrfs scrub.

I saved a copy of the VM's disk images after part 1, so I could repeatedly
reset and run part 2 on identical filesystem images.

I also split up

	btrfs scrub start -rBd /try

and replaced it with

	btrfs scrub start -rBd /dev/vdb
	btrfs scrub start -rBd /dev/vdc

(and same for the non-read-only scrubs) so the scrub runs sequentially
on each disk, instead of running on both in parallel.

Original part 2:

> 	# Before and after dev stat and read-only scrub to see what the damage looks like.
> 	# This will produce some ratelimited kernel output.
> 	btrfs dev stat /try | grep -v ' 0$'
> 	btrfs scrub start -rBd /try
> 	btrfs dev stat /try | grep -v ' 0$'
> 
> 	# Verify all the files are correctly restored transparently by btrfs.
> 	# btrfs repairs correctable blocks as a side-effect.
> 	sha1sum --quiet -c /tmp/sha1sums.txt
> 
> 	# Do a scrub to clean up stray corrupted blocks (including superblocks)
> 	btrfs dev stat /try | grep -v ' 0$'
> 	btrfs scrub start -Bd /try
> 	btrfs dev stat /try | grep -v ' 0$'
> 
> 	# This scrub should be clean, but sometimes is not.
> 	btrfs scrub start -Bd /try
> 	btrfs dev stat /try | grep -v ' 0$'
> 
> 	# Verify that the scrub didn't corrupt anything.
> 	sha1sum --quiet -c /tmp/sha1sums.txt

Multiple runs of part 2 produce different results in scrub:

	result-1581019560.txt:scrub device /dev/vdb (id 1) done
	result-1581019560.txt:Error summary:    super=1 csum=273977
	result-1581019560.txt:scrub device /dev/vdc (id 2) done
	result-1581019560.txt:Error summary:    read=1600744 csum=230813
	result-1581019560.txt:[/dev/vdb].corruption_errs  504791

	result-1581029949.txt:scrub device /dev/vdb (id 1) done
	result-1581029949.txt:Error summary:    super=1 csum=273799
	result-1581029949.txt:scrub device /dev/vdc (id 2) done
	result-1581029949.txt:Error summary:    read=1600744 csum=230813
	result-1581029949.txt:[/dev/vdb].corruption_errs  504613

With scrub on the filesystem it is no better:

	result-1.txt:scrub device /dev/vdb (id 1) done
	result-1.txt:Error summary:    super=1 csum=272757
	result-1.txt:scrub device /dev/vdc (id 2) done
	result-1.txt:Error summary:    read=1600744 csum=230813
	result-1.txt:[/dev/vdb].corruption_errs  503571
	result-2.txt:scrub device /dev/vdb (id 1) done
	result-2.txt:Error summary:    super=1 csum=273430
	result-2.txt:scrub device /dev/vdc (id 2) done
	result-2.txt:Error summary:    read=1600744 csum=230813
	result-2.txt:[/dev/vdb].corruption_errs  504244
	result-3.txt:scrub device /dev/vdb (id 1) done
	result-3.txt:Error summary:    super=1 csum=273456
	result-3.txt:scrub device /dev/vdc (id 2) done
	result-3.txt:Error summary:    read=1600744 csum=230813
	result-3.txt:[/dev/vdb].corruption_errs  504270

The scrub summaries after the sha1sum -c are different too, although in
this case the errors were all corrected (sha1sum -c is clean):

	result-1.txt:scrub device /dev/vdb (id 1) done
	result-1.txt:Error summary:    csum=29911
	result-1.txt:scrub device /dev/vdc (id 2) done
	result-1.txt:Error summary:    csum=11
	result-2.txt:scrub device /dev/vdb (id 1) done
	result-2.txt:Error summary:    csum=29919
	result-2.txt:scrub device /dev/vdc (id 2) done
	result-2.txt:Error summary:    csum=14
	result-3.txt:scrub device /dev/vdb (id 1) done
	result-3.txt:Error summary:    csum=29713
	result-3.txt:scrub device /dev/vdc (id 2) done
	result-3.txt:Error summary:    csum=9

The error counts on /dev/vdb are different after the sha1sum -c,
indicating that file reads are nondeterministically correcting or not
correcting csum errors on btrfs raid5.  This could be due to readahead
or maybe something else.

The error counts on /dev/vdc are interesting, as that drive is not
corrupted, nor does it have read errors, but it is very consistently
reporting read=1600744 csum=230813 in scrub output.

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 195 bytes --]

next prev parent reply	other threads:[~2020-02-06 23:40 UTC|newest]

Thread overview: 5+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2019-11-19  4:08 RAID5 fails to correct correctable errors, makes them uncorrectable instead (sometimes). With reproducer for kernel 5.3.11 Zygo Blaxell
2019-11-19  6:58 ` Qu Wenruo
2019-11-19 23:57   ` Zygo Blaxell
2020-02-06 23:40 ` Zygo Blaxell [this message]
2020-04-02 11:10 ` Filipe Manana

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20200206234037.GD13306@hungrycats.org \
    --to=ce3g8jdj@umail.furryterror.org \
    --cc=linux-btrfs@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.