From: Liu Bo <bo.li.liu@oracle.com>
To: Qu Wenruo <quwenruo@cn.fujitsu.com>
Cc: linux-btrfs@vger.kernel.org
Subject: Re: [PATCH v4 2/5] btrfs: scrub: Fix RAID56 recovery race condition
Date: Thu, 30 Mar 2017 10:05:59 -0700 [thread overview]
Message-ID: <20170330170558.GB8963@lim.localdomain> (raw)
In-Reply-To: <20170330063251.16872-3-quwenruo@cn.fujitsu.com>
On Thu, Mar 30, 2017 at 02:32:48PM +0800, Qu Wenruo wrote:
> When scrubbing a RAID5 which has recoverable data corruption (only one
> data stripe is corrupted), sometimes scrub will report more csum errors
> than expected. Sometimes even unrecoverable error will be reported.
>
> The problem can be easily reproduced by the following steps:
> 1) Create a btrfs with RAID5 data profile with 3 devs
> 2) Mount it with nospace_cache or space_cache=v2
> To avoid extra data space usage.
> 3) Create a 128K file and sync the fs, unmount it
> Now the 128K file lies at the beginning of the data chunk
> 4) Locate the physical bytenr of data chunk on dev3
> Dev3 is the 1st data stripe.
> 5) Corrupt the first 64K of the data chunk stripe on dev3
> 6) Mount the fs and scrub it
>
> The correct csum error number should be 16(assuming using x86_64).
> Larger csum error number can be reported in a 1/3 chance.
> And unrecoverable error can also be reported in a 1/10 chance.
>
> The root cause of the problem is RAID5/6 recover code has race
> condition, due to the fact that full scrub is initiated per device.
>
> While for other mirror based profiles, each mirror is independent with
> each other, so race won't cause any big problem.
>
> For example:
> Corrupted | Correct | Correct |
> | Scrub dev3 (D1) | Scrub dev2 (D2) | Scrub dev1(P) |
> ------------------------------------------------------------------------
> Read out D1 |Read out D2 |Read full stripe |
> Check csum |Check csum |Check parity |
> Csum mismatch |Csum match, continue |Parity mismatch |
> handle_errored_block | |handle_errored_block |
> Read out full stripe | | Read out full stripe|
> D1 csum error(err++) | | D1 csum error(err++)|
> Recover D1 | | Recover D1 |
>
> So D1's csum error is accounted twice, just because
> handle_errored_block() doesn't have enough protect, and race can happen.
>
> On even worse case, for example D1's recovery code is re-writing
> D1/D2/P, and P's recovery code is just reading out full stripe, then we
> can cause unrecoverable error.
>
> This patch will use previously introduced lock_full_stripe() and
> unlock_full_stripe() to protect the whole scrub_handle_errored_block()
> function for RAID56 recovery.
> So no extra csum error nor unrecoverable error.
>
> Reported-by: Goffredo Baroncelli <kreijack@libero.it>
> Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
> ---
> fs/btrfs/scrub.c | 23 +++++++++++++++++++++++
> 1 file changed, 23 insertions(+)
>
> diff --git a/fs/btrfs/scrub.c b/fs/btrfs/scrub.c
> index 5fc99a92b4ff..4bbefc96485d 100644
> --- a/fs/btrfs/scrub.c
> +++ b/fs/btrfs/scrub.c
> @@ -1109,6 +1109,7 @@ static int scrub_handle_errored_block(struct scrub_block *sblock_to_check)
> int mirror_index;
> int page_num;
> int success;
> + bool full_stripe_locked;
> static DEFINE_RATELIMIT_STATE(_rs, DEFAULT_RATELIMIT_INTERVAL,
> DEFAULT_RATELIMIT_BURST);
>
> @@ -1134,6 +1135,25 @@ static int scrub_handle_errored_block(struct scrub_block *sblock_to_check)
> have_csum = sblock_to_check->pagev[0]->have_csum;
> dev = sblock_to_check->pagev[0]->dev;
>
> + /*
> + * For RAID5/6 race can happen for different dev scrub thread.
> + * For data corruption, Parity and Data thread will both try
> + * to recovery the data.
> + * Race can lead to double added csum error, or even unrecoverable
> + * error.
> + */
> + ret = lock_full_stripe(fs_info, logical, &full_stripe_locked);
> + if (ret < 0) {
> + spin_lock(&sctx->stat_lock);
> + /* Either malloc failure or bg_cache not found */
> + if (ret == -ENOMEM)
> + sctx->stat.malloc_errors++;
> + else
> + sctx->stat.uncorrectable_errors++;
Other places in scrub_handle_errored_block() also set read_errors and
uncorrectable_errors, why the above is an exception?
I'm fine with putting a bool into lock_full_stripe(), but it's easier
to tell whether locking is successful by setting the flag after
locking.
Thanks,
-liubo
> + spin_unlock(&sctx->stat_lock);
> + return ret;
> + }
> +
> if (sctx->is_dev_replace && !is_metadata && !have_csum) {
> sblocks_for_recheck = NULL;
> goto nodatasum_case;
> @@ -1468,6 +1488,9 @@ static int scrub_handle_errored_block(struct scrub_block *sblock_to_check)
> kfree(sblocks_for_recheck);
> }
>
> + ret = unlock_full_stripe(fs_info, logical, full_stripe_locked);
> + if (ret < 0)
> + return ret;
> return 0;
> }
>
> --
> 2.12.1
>
>
>
next prev parent reply other threads:[~2017-03-30 17:06 UTC|newest]
Thread overview: 13+ messages / expand[flat|nested] mbox.gz Atom feed top
2017-03-30 6:32 [PATCH v4 0/5] raid56: scrub related fixes Qu Wenruo
2017-03-30 6:32 ` [PATCH v4 1/5] btrfs: scrub: Introduce full stripe lock for RAID56 Qu Wenruo
2017-03-30 16:49 ` Liu Bo
2017-03-31 1:29 ` Qu Wenruo
2017-03-31 17:34 ` Liu Bo
2017-04-03 0:48 ` Qu Wenruo
2017-03-30 6:32 ` [PATCH v4 2/5] btrfs: scrub: Fix RAID56 recovery race condition Qu Wenruo
2017-03-30 17:05 ` Liu Bo [this message]
2017-03-31 0:25 ` Qu Wenruo
2017-03-31 1:40 ` Qu Wenruo
2017-03-30 6:32 ` [PATCH v4 3/5] btrfs: scrub: Don't append on-disk pages for raid56 scrub Qu Wenruo
2017-03-30 6:32 ` [PATCH v4 4/5] btrfs: Wait flighting bio before freeing target device for raid56 Qu Wenruo
2017-03-30 6:32 ` [PATCH v4 5/5] btrfs: Prevent scrub recheck from racing with dev replace Qu Wenruo
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20170330170558.GB8963@lim.localdomain \
--to=bo.li.liu@oracle.com \
--cc=linux-btrfs@vger.kernel.org \
--cc=quwenruo@cn.fujitsu.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).