From: Qu Wenruo <quwenruo.btrfs@gmx.com>
To: Lukas Straub <lukasstraub2@web.de>
Cc: Martin Raiber <martin@urbackup.org>,
Paul Jones <paul@pauljones.id.au>,
Wang Yugui <wangyugui@e16-tech.com>,
"linux-btrfs@vger.kernel.org" <linux-btrfs@vger.kernel.org>
Subject: Re: [PATCH DRAFT] btrfs: RAID56J journal on-disk format draft
Date: Mon, 6 Jun 2022 19:21:42 +0800 [thread overview]
Message-ID: <c4c298bf-ca54-1915-c22f-a6d87fc5a78f@gmx.com> (raw)
In-Reply-To: <252577ba-1659-62f8-fc44-fea506eb97b7@gmx.com>
On 2022/6/6 16:16, Qu Wenruo wrote:
>
>
[...]
>>>
>>> Hello Qu,
>>>
>>> If you don't care about the write-hole, you can also use a dirty bitmap
>>> like mdraid 5/6 does. There, one bit in the bitmap represents for
>>> example one gigabyte of the disk that _may_ be dirty, and the bit is
>>> left
>>> dirty for a while and doesn't need to be set for each write. Or you
>>> could do a per-block-group dirty bit.
>>
>> That would be a pretty good way for auto scrub after dirty close.
>>
>> Currently we have quite some different ideas, but some are pretty
>> similar but at different side of a spectrum:
>>
>> Easier to implement .. Harder to implement
>> |<- More on mount time scrub .. More on journal ->|
>> | | | \- Full journal
>> | | \--- Per bg dirty bitmap
>> | \----------- Per bg dirty flag
>> \--------------------------------------------------- Per sb dirty flag
>
> In fact, recently I'm checking the MD code (including their MD-raid5).
>
> It turns out they have write-intent bitmap, which is almost the per-bg
> dirty bitmap in above spectrum.
>
> In fact, since btrfs has all the CoW and checksum for metadata (and part
> of its data), btrfs scrub can do a much better job than MD to resilver
> the range.
>
> Furthermore, we have a pretty good reserved space (1M), and has a pretty
> reasonable stripe length (1GiB).
> This means, we only need 32KiB for the bitmap for each RAID56 stripe,
> much smaller than the 1MiB we reserved.
>
> I think this can be a pretty reasonable middle ground, faster than full
> journal, while the amount to scrub should be reasonable enough to be
> done at mount time.
Furthermore, this even allows us to go something like bitmap tree, for
such write-intent bitmap.
And as long as the user is not using RAID56 for metadata (maybe even
it's OK to use RAID56 for metadata), it should be pretty safe against
most write-hole (for metadata and CoW data only though, nocow data is
still affected).
Thus I believe this can be a valid path to explore, and even have a
higher priority than full journal.
Thanks,
Qu
>
> Thanks,
> Qu
>>
>> In fact, the dirty bitmap is just a simplified version of journal (only
>> record the metadata, without data).
>> Unlike dm/dm-raid56, with btrfs scrub, we should be able to fully
>> recover the data without problem.
>>
>> Even with per-bg dirty bitmap, we still need some extra location to
>> record the bitmap. Thus it needs a on-disk format change anyway.
>>
>> Currently only sb dirty flag may be backward compatible.
>>
>> And whether we should wait for the scrub to finish before allowing use
>> to do anything into the fs is also another concern.
>>
>> Even using bitmap, we may have several GiB data needs to be scrubbed.
>> If we wait for the scrub to finish, it's the best and safest way, but
>> users won't be happy at all.
>>
>> If we go scrub resume way, it's faster but still leaves a large window
>> to allow write-hole to reduce our tolerance.
>>
>> Thanks,
>> Qu
>>>
>>> And while you're at it, add the same mechanism to all the other raid
>>> and dup modes to fix the inconsistency of NOCOW files after a crash.
>>>
>>> Regards,
>>> Lukas Straub
>>>
>>>> Thanks,
>>>> Qu
>>>>
>>>>>>
>>>>>> Thanks,
>>>>>> Qu
>>>>>>
>>>>>>>
>>>>>>> Paul.
>>>>>
>>>>>
>>>
>>>
>>>
next prev parent reply other threads:[~2022-06-06 11:22 UTC|newest]
Thread overview: 37+ messages / expand[flat|nested] mbox.gz Atom feed top
2022-05-24 6:13 [PATCH DRAFT] btrfs: RAID56J journal on-disk format draft Qu Wenruo
2022-05-24 11:08 ` kernel test robot
2022-05-24 12:19 ` kernel test robot
2022-05-24 17:02 ` David Sterba
2022-05-24 22:31 ` Qu Wenruo
2022-05-25 9:00 ` Christoph Hellwig
2022-05-25 9:13 ` Qu Wenruo
2022-05-25 9:26 ` Christoph Hellwig
2022-05-25 9:35 ` Qu Wenruo
2022-05-26 9:06 ` waxhead
2022-05-26 9:26 ` Qu Wenruo
2022-05-26 15:30 ` Goffredo Baroncelli
2022-05-26 16:10 ` David Sterba
2022-06-01 2:06 ` Wang Yugui
2022-06-01 2:13 ` Qu Wenruo
2022-06-01 2:25 ` Wang Yugui
2022-06-01 2:55 ` Qu Wenruo
2022-06-01 9:07 ` Wang Yugui
2022-06-01 9:27 ` Qu Wenruo
2022-06-01 9:56 ` Paul Jones
2022-06-01 10:12 ` Qu Wenruo
2022-06-01 18:49 ` Martin Raiber
2022-06-01 21:37 ` Qu Wenruo
2022-06-03 9:32 ` Lukas Straub
2022-06-03 9:59 ` Qu Wenruo
2022-06-06 8:16 ` Qu Wenruo
2022-06-06 11:21 ` Qu Wenruo [this message]
2022-06-06 18:10 ` Goffredo Baroncelli
2022-06-07 1:27 ` Qu Wenruo
2022-06-07 17:36 ` Goffredo Baroncelli
2022-06-07 22:14 ` Qu Wenruo
2022-06-08 17:26 ` Goffredo Baroncelli
2022-06-13 2:27 ` Qu Wenruo
2022-06-08 15:17 ` Lukas Straub
2022-06-08 17:32 ` Goffredo Baroncelli
2022-06-01 12:21 ` Qu Wenruo
2022-06-01 14:55 ` Robert Krig
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=c4c298bf-ca54-1915-c22f-a6d87fc5a78f@gmx.com \
--to=quwenruo.btrfs@gmx.com \
--cc=linux-btrfs@vger.kernel.org \
--cc=lukasstraub2@web.de \
--cc=martin@urbackup.org \
--cc=paul@pauljones.id.au \
--cc=wangyugui@e16-tech.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox