From: Qu WenRuo <wqu@suse.com>
To: "fdmanana@gmail.com" <fdmanana@gmail.com>
Cc: linux-btrfs <linux-btrfs@vger.kernel.org>,
Filipe Manana <FdManana@suse.com>
Subject: Re: [PATCH] btrfs: scrub: Require mandatory block group RO for dev-replace
Date: Thu, 23 Jan 2020 12:28:54 +0000 [thread overview]
Message-ID: <9e202c2d-b715-ef02-4895-0529ad127449@suse.com> (raw)
In-Reply-To: <CAL3q7H4ed9PtALC_xjPeaiKDDhAN1oNzgM0yd=buF_C5r+x7wA@mail.gmail.com>
On 2020/1/23 下午8:06, Filipe Manana wrote:
> On Thu, Jan 23, 2020 at 7:38 AM Qu Wenruo <wqu@suse.com> wrote:
>>
>> [BUG]
>> For dev-replace test cases with fsstress, like btrfs/06[45] btrfs/071,
>> looped runs can lead to random failure, where scrub finds csum error.
>>
>> The possibility is not high, around 1/20 to 1/100, but it's causing data
>> corruption.
>>
>> The bug is observable after commit b12de52896c0 ("btrfs: scrub: Don't
>> check free space before marking a block group RO")
>>
>> [CAUSE]
>> Dev-replace has two source of writes:
>> - Write duplication
>> All writes to source device will also be duplicated to target device.
>>
>> Content: Latest data/meta
>
> I find the term "latest" a bit confusing, perhaps "not yet persisted
> data and metadata" is more clear.
>
>>
>> - Scrub copy
>> Dev-replace reused scrub code to iterate through existing extents, and
>> copy the verified data to target device.
>>
>> Content: Data/meta in commit root
>
> And so here "previously persisted data and metadata".
>
>>
>> The difference in contents makes the following race possible:
>> Regular Writer | Dev-replace
>> -----------------------------------------------------------------
>> ^ |
>> | Preallocate one data extent |
>> | at bytenr X, len 1M |
>> v |
>> ^ Commit transaction |
>> | Now extent [X, X+1M) is in |
>> v commit root |
>> ================== Dev replace starts =========================
>> | ^
>> | | Scrub extent [X, X+1M)
>> | | Read [X, X+1M)
>> | | (The content are mostly garbage
>> | | since it's preallocated)
>> ^ | v
>> | Write back happens for |
>> | extent [X, X+512K) |
>> | New data writes to both |
>> | source and target dev. |
>> v |
>> | ^
>> | | Scrub writes back extent [X, X+1M)
>> | | to target device.
>> | | This will over write the new data in
>> | | [X, X+512K)
>> | v
>>
>> This race can only happen for nocow writes. Thus metadata and data cow
>> writes are safe, as COW will never overwrite extents of previous trans
>> (in commit root).
>>
>> This behavior can be confirmed by disabling all fallocate related calls
>> in fsstress (*), then all related tests can pass a 2000 run loop.
>>
>> *: FSSTRESS_AVOID="-f fallocate=0 -f allocsp=0 -f zero=0 -f insert=0 \
>> -f collapse=0 -f punch=0 -f resvsp=0"
>> I didn't expect resvsp ioctl will fallback to fallocate in VFS...
>>
>> [FIX]
>> Make dev-replace to require mandatory block group RO, and wait for current
>> nocow writes before calling scrub_chunk().
>>
>> This patch will mostly revert commit 76a8efa171bf ("btrfs: Continue replace
>> when set_block_ro failed") for dev-replace path.
>>
>> ENOSPC for dev-replace is still much better than data corruption.
>
> Technically if we flag the block group RO without being able to
> persist that due to ENOSPC, it's still ok.
Right, I will change the words, since it only slightly increase the
possibility of ENOSPC, not ensured to cause ENOSPC and abort replace.
> We just want to prevent nocow writes racing with scrub copying
> procedure. But that's something for some other time, and this is fine
> to me.
>
>>
>> Reported-by: Filipe Manana <fdmanana@suse.com>
>> Fixes: 76a8efa171bf ("btrfs: Continue replace when set_block_ro failed")
>> Fixes: b12de52896c0 ("btrfs: scrub: Don't check free space before marking a block group RO")
>> Signed-off-by: Qu Wenruo <wqu@suse.com>
>> ---
>> Changelog:
>> RFC->v1:
>> - Remove the RFC tag
>> Since the cause is pinned and verified, no need for RFC.
>>
>> - Only wait for nocow writes for dev-replace
>> CoW writes are safe as they will never overwrite extents in commit
>> root.
>>
[...]
>> + /*
>> + * Now the target block is marked RO, wait for nocow writes to
>> + * finish before dev-replace.
>> + * COW is fine, as COW never overwrites extents in commit tree.
>> + */
>> + if (sctx->is_dev_replace)
>> + btrfs_wait_nocow_writers(cache);
>
> So this only waits for nocow ordered extents to be created - it
> doesn't wait for them to complete.
> After that you still need to call:
>
> btrfs_wait_ordered_roots(fs_info, U64_MAX, cache->start, cache->length);
Forgot that, although 1000 runs doesn't expose any problem you are
completely right.
I'll update the patch to address all mentioned comments.
Thanks,
Qu
>
> Other than that, looks good to me.
>
> Thanks.
>
>> +
>> + scrub_pause_off(fs_info);
>> down_write(&dev_replace->rwsem);
>> dev_replace->cursor_right = found_key.offset + length;
>> dev_replace->cursor_left = found_key.offset;
>> --
>> 2.25.0
>>
>
>
next prev parent reply other threads:[~2020-01-23 12:31 UTC|newest]
Thread overview: 17+ messages / expand[flat|nested] mbox.gz Atom feed top
2020-01-23 7:37 [PATCH] btrfs: scrub: Require mandatory block group RO for dev-replace Qu Wenruo
2020-01-23 12:06 ` Filipe Manana
2020-01-23 12:28 ` Qu WenRuo [this message]
2020-01-23 13:39 ` Qu Wenruo
2020-01-23 13:49 ` Filipe Manana
2020-01-23 13:57 ` Qu Wenruo
2020-01-23 16:40 ` David Sterba
2020-01-23 23:58 ` Qu Wenruo
-- strict thread matches above, loose matches on Subject: below --
2020-01-23 23:58 Qu Wenruo
2020-01-24 9:24 ` Filipe Manana
2020-01-24 13:40 ` David Sterba
2020-01-24 14:44 ` David Sterba
2020-01-24 16:28 ` Filipe Manana
2020-01-25 0:36 ` Qu Wenruo
2020-01-25 0:48 ` Qu Wenruo
2020-01-25 11:35 ` David Sterba
2020-01-25 12:09 ` Qu WenRuo
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=9e202c2d-b715-ef02-4895-0529ad127449@suse.com \
--to=wqu@suse.com \
--cc=FdManana@suse.com \
--cc=fdmanana@gmail.com \
--cc=linux-btrfs@vger.kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox