Re: [PATCH DRAFT] btrfs: RAID56J journal on-disk format draft

Linux Btrfs filesystem development
 help / color / mirror / Atom feed

From: Qu Wenruo <quwenruo.btrfs@gmx.com>
To: kreijack@inwind.it, Lukas Straub <lukasstraub2@web.de>
Cc: Martin Raiber <martin@urbackup.org>,
	Paul Jones <paul@pauljones.id.au>,
	Wang Yugui <wangyugui@e16-tech.com>,
	"linux-btrfs@vger.kernel.org" <linux-btrfs@vger.kernel.org>
Subject: Re: [PATCH DRAFT] btrfs: RAID56J journal on-disk format draft
Date: Mon, 13 Jun 2022 10:27:17 +0800	[thread overview]
Message-ID: <f8d90bb5-f538-e492-34d7-9e006b2e2e60@gmx.com> (raw)
In-Reply-To: <d7de634b-d1a2-7d2d-1b4f-91400b19e9a7@inwind.it>



On 2022/6/9 01:26, Goffredo Baroncelli wrote:
> On 08/06/2022 00.14, Qu Wenruo wrote:
>>
>>
>> On 2022/6/8 01:36, Goffredo Baroncelli wrote:
>>> On 07/06/2022 03.27, Qu Wenruo wrote:
>>>>
>>>>
>>>> On 2022/6/7 02:10, Goffredo Baroncelli wrote:
>>> [...]
>>>
>>>>>
>>>>> But with a battery backup (i.e. no power failure), the likelihood
>>>>> of b)
>>>>> became
>>>>> negligible.
>>>>>
>>>>> This to say that a write intent bitmap will provide an huge
>>>>> improvement of the resilience of a btrfs raid5, and in turn raid6.
>>>>>
>>>>> My only suggestions, is to find a way to store the bitmap intent not
>>>>> in the
>>>>> raid5/6 block group, but in a separate block group, with the
>>>>> appropriate
>>>>> level
>>>>> of redundancy.
>>>>
>>>> That's why I want to reject RAID56 as metadata, and just store the
>>>> write-intent tree into the metadata, like what we did for fsync (log
>>>> tree).
>>>>
>>>
>>> My suggestion was not to use the btrfs metadata to store the
>>> "write-intent", but
>>> to track the space used by the write-intent storage area with a bg. Then
>>> the
>>> write intent can be handled not with a btrfs btree, but (e.g.) simply
>>> writing a bitmap of the used blocks, or the pairs [starts, length]....
>>
>> That solution requires a lot of extra change to chunk allocation, and
>> out-of-btree tracking.
>>
>> Furthermore, btrfs Btree itself has CoW to defend against the power loss.
>
> [...]
>>
>> But such write intent bitmap must survive powerloss by itself.
>>
>
> What is the reason that the write intent "must survive" ?

Easy, to let us know whether a range needs to be scrubbed, to close the
write hole.

Consider the following case of RAID56, a partial write:

Disk1:		|WWWWW|     |
Disk2:		|     |WWWWW|
Parity:		|WWWWW|WWWWW|

If we write something into the write-intent bitmap, and we do the real
write, but a powerloss happened.

Only the untouched data in Disk1 and Disk2 are safe.
The remaining can be updated or not.

At next mount, we should scrub the this full stripe to close the write hole.

If by somehow, for example the metadata is on another disk (not involved
in the RAID5 array), and that disk is lost.
We are unable to know which full stripe we need to scrub, thus the write
hole is still there.

> My understanding is that if the write intent is not fully wrote,
> also no data is wrote. And to check if a write intent
> is fully wrote, it is enough to check against a checksums.
>
> I imagine (but maybe I am wrong) that the sequence should be:
>
> 1) write the intent (w/ checksum)
> 2) sync
> 3) write the raid5/6 data
> 4) sync
> 5) invalid the intent
> 6) sync
>
> a) If the powerloss happens before 3, we don't *need* to scrub anything.
> But if the checksum matches we will.
> But hopefully it doesn't harm (only a delay at mount time after a poweroff)
>
>
> b) If the powerloss happens after 2 (and until 6), we should scrub all
> the potential
> impacted blocks disk. But the consistency of the write intent is guarantee
> by "2) sync" (and this doesn't depend by the fact that the intent is
> stored in a btree or in another kind of storage)
>
>
>> And in fact, that bitmap is not small as you think.
>>
>> In fact, for users who need write-intent tree/bitmap, we're talking
>> about at least TiB level usage.
>
> The worst case is that the data to write is big as the memory. Usually
> sizeof(memory) << sizeof(disk), so I think an intent map "extent based"
> would be more efficient than a bit map.

Nope, write intent bitmap/tree doesn't work like that.

In fact for mdraid, the bitmap normally has only 16KiB in size.

It has proper reclaim mechanism, like flushing a disk to reclaim all
those bitmaps.

Thus in fact I don't really think we'd need to contain the whole bitmap,
sorry for the confusion.



And furthermore, some previous idea, especially using btrfs btree for
write-intent bitmap is not feasible.

The problem here is about the lifespan.
Previously I think write-intent tree can go the same lifespan as log tree.

But it's not true, the problem is, data writeback can happen crossing
transaction boundary.

This means the purposed write-intent tree need to survive multiple
transactions, and is different from log tree, thus it needs metadata
extent item.

The requirement for metadata extent item can get pretty messy.

Thus I'd say, the write-intent bitmap would really go the mdraid way,
preserve a small space (1MiB~2MiB range for each device), then get
updated at the same timing/behavior just like md-bitmap.c.

Thanks,
Qu
>
> Supposing to have 16gb server, and to track the blocks using an extent
> based (64bit pointer + 16bit length in 4k unit). The worst case is that
> all the
> pages are not contiguous, so we need 16GB/4k*10 = 41MB to track all the
> pages.
>
> The point is that for each 16GB of data, we need to write further 41MB,
> which is a negligible quantity, 0.2%; and this factor is constant.
> And in case of a powerloss, you have to scrub only the
> changed blocks (plus the intrinsic amplification factor of raid5/6).
>
>
> On the other side, supposing to have a 16TB disks set, and to track the
> blocks using a bitmap, where each bit represent 1MB.
> To track all the disks we need 2MB. However:
> 1) if you write 4k, you still need to write 21MB
> 2) the worst case is that you need to update 16GB of 4k pages, where each
> pages is 1MB far from the nearest. This means that you need to scrub
> 16GB/4k*1MB = 4TB of disks (plus the intrinsic amplification factor
> of raid5/6).
>
> If we reduce the unit of the page, we reduce the amplification factor
> for the scrub, but we increase the size of the bitmap.
> For example if each bit tracks a 4k page, we have a bitmap of
> 4GB for a 16TB filesystem. And
> 1.bis) if you write 4k, you will still need to write 4GB of intent.
> 2.bis) on the other side in case of powerloss, you have to scrub only the
> impacted disk pages (plus the intrinsic amplification factor
> of raid5/6).
>
>>
>> 4TiB used space needs already 128MiB if we really go straight bitmap for
>> them.
>> Embedding them all in a per-device basis is completely possible, but
>> when implementing it, it's much complex.
>>
>> 128MiB is not that large, so in theory we're fine to keep an in-memory
>> bitmap.
>> But what would happen if we go 32TiB? Then 1GiB in-memory bitmap is
>> needed, which is not really acceptable anymore.
>>
>> When we start to choose what part is really needed in the large bitmap
>> pool, then Btree starts to make sense. We can store a super large bitmap
>> using bitmap and extent based entries pretty easily, just like free
>> space cache tree.
>>
>>>
>>> Moreover, the handling of raid5/6 is a layer below the btree.
>>
>> While CSUM is also a layer below, but we still put it into CSUM tree.
>>
>> The handling of write-intent bitmap/tree is indeed a layer lower.
>> But traditional DM lacks the awareness of the upper layer fs, thus has a
>> lot of problems like unable to detect bit rot in RAID1 for example.
>>
>> Yes, we care about layer separation, but more in a code level.
>> For functionality, layer separation is not that a big deal already.
>>
>>> I think that
>>> updating the write-intent btree would be a performance bottleneck. I am
>>> quite sure
>>> that the write intent likely requires less than one metadata page (16K
>>> today);
>>> however to store this page you need to update the metadata page
>>> tracking...
>>
>> We already have the existing log tree code doing similar (but still
>> quite different purpose) things, and it's used to speed up fsync.
>>
>> Furthermore, DM layer bitmap is not a straight bitmap of all sectors
>> either, and for performance it's almost negligible for sequential RW.
>>
>> I don't think Btree handling would be a performance bottleneck, as
>> NODATACOW for data doesn't improve much performance other than the
>> implied NODATASUM.
>>
>>>
>>>>>
>>>>> This for two main reasons:
>>>>> 1) in future BTRFS may get the ability of allocating this block group
>>>>> in a
>>>>> dedicate disks set. I see two main cases:
>>>>> a) in case of raid6, we can store the intent bitmap (or the journal)
>>>>> in a
>>>>> raid1C3 BG allocated in the faster disks. The cons is that each block
>>>>> has to be
>>>>> written 3x2 times. But if you have an hybrid disks set (some ssd and
>>>>> some hdd,
>>>>> you got a noticeable gain of performance)
>>>>
>>>> In fact, for 4 disk usage, RAID10 has good enough chance to tolerate 2
>>>> missing disks.
>>>>
>>>> In fact, the chance to tolerate two missing devices for 4 disks RAID10
>>>> is:
>>>>
>>>> 4 / 6 = 66.7%
>>>>
>>>> 4 is the total valid combinations, no order involved, including:
>>>> (1, 3), (1, 4), (2, 3) (2, 4).
>>>> (Or 4C2 - 2)
>>>>
>>>> 6 is the 4C2.
>>>>
>>>> So really no need to go RAID1C3 unless you're really want to ensured 2
>>>> disks tolerance.
>>>
>>> I don't get the point: I started talking about raid6. The raid6 is two
>>> failures proof (you need three failure to see the problem... in theory).
>>>
>>> If P is the probability of a disk failure (with P << 1), the
>>> likelihood of
>>> a RAID6 failure is O(P^3). The same is RAID1C3.
>>>
>>> Instead RAID10 failure likelihood is only a bit lesser than two disk
>>> failure:
>>> RAID10 (4 disks) failure is O(0.66 * P^2) ~ O(P^2).
>>>
>>> Because P is << 1 then  P^3 << 0.66 * P^2.
>>
>> My point here is, although RAID10 is not ensured to lose 2 disks, just
>> losing two disks still have a high enough chance to survive.
>>
>> While RAID10 only have two copies of data, instead of 3 from RAID1C3,
>> such cost saving can be attractive for a lot of users though.
>>
>> Thanks,
>> Qu
>>
>>>>
>>>>> b) another option is to spread the intent bitmap (or the journal) in
>>>>> *all* disks,
>>>>> where each disks contains only the the related data (if we update only
>>>>> disk #1
>>>>> and disk #2, we have to update only the intent bitmap (or the
>>>>> journal) in
>>>>> disk #1 and  disk #2)
>>>>
>>>> That's my initial per-device reservation method.
>>>>
>>>> But for write-intent tree, I tend to not go that way, but with a
>>>> RO-compatible flag instead, as it's much simpler and more back
>>>> compatible.
>>>>
>>>> Thanks,
>>>> Qu
>>>>>
>>>>>
>>>>> 2) having a dedicate bg for the intent bitmap (or the journal), has
>>>>> another big
>>>>> advantage: you don't need to change the meaning of the raid5/6 bg.
>>>>> This
>>>>> means
>>>>> that an older kernel can read/write a raid5/6 filesystem: it
>>>>> sufficient
>>>>> to ignore
>>>>> the intent bitmap (or the journal)
>>>>>
>>>>>
>>>>>
>>>>>>
>>>>>> Furthermore, this even allows us to go something like bitmap tree,
>>>>>> for
>>>>>> such write-intent bitmap.
>>>>>> And as long as the user is not using RAID56 for metadata (maybe even
>>>>>> it's OK to use RAID56 for metadata), it should be pretty safe against
>>>>>> most write-hole (for metadata and CoW data only though, nocow data is
>>>>>> still affected).
>>>>>>
>>>>>> Thus I believe this can be a valid path to explore, and even have a
>>>>>> higher priority than full journal.
>>>>>>
>>>>>> Thanks,
>>>>>> Qu
>>>>>>
>>>>>
>>>>>
>>>>>
>>>
>>>
>
>

next prev parent reply	other threads:[~2022-06-13  2:27 UTC|newest]

Thread overview: 37+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2022-05-24  6:13 [PATCH DRAFT] btrfs: RAID56J journal on-disk format draft Qu Wenruo
2022-05-24 11:08 ` kernel test robot
2022-05-24 12:19 ` kernel test robot
2022-05-24 17:02 ` David Sterba
2022-05-24 22:31   ` Qu Wenruo
2022-05-25  9:00   ` Christoph Hellwig
2022-05-25  9:13     ` Qu Wenruo
2022-05-25  9:26       ` Christoph Hellwig
2022-05-25  9:35         ` Qu Wenruo
2022-05-26  9:06           ` waxhead
2022-05-26  9:26             ` Qu Wenruo
2022-05-26 15:30               ` Goffredo Baroncelli
2022-05-26 16:10                 ` David Sterba
2022-06-01  2:06 ` Wang Yugui
2022-06-01  2:13   ` Qu Wenruo
2022-06-01  2:25     ` Wang Yugui
2022-06-01  2:55       ` Qu Wenruo
2022-06-01  9:07         ` Wang Yugui
2022-06-01  9:27           ` Qu Wenruo
2022-06-01  9:56             ` Paul Jones
2022-06-01 10:12               ` Qu Wenruo
2022-06-01 18:49                 ` Martin Raiber
2022-06-01 21:37                   ` Qu Wenruo
2022-06-03  9:32                     ` Lukas Straub
2022-06-03  9:59                       ` Qu Wenruo
2022-06-06  8:16                         ` Qu Wenruo
2022-06-06 11:21                           ` Qu Wenruo
2022-06-06 18:10                             ` Goffredo Baroncelli
2022-06-07  1:27                               ` Qu Wenruo
2022-06-07 17:36                                 ` Goffredo Baroncelli
2022-06-07 22:14                                   ` Qu Wenruo
2022-06-08 17:26                                     ` Goffredo Baroncelli
2022-06-13  2:27                                       ` Qu Wenruo [this message]
2022-06-08 15:17                         ` Lukas Straub
2022-06-08 17:32                           ` Goffredo Baroncelli
2022-06-01 12:21               ` Qu Wenruo
2022-06-01 14:55                 ` Robert Krig

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=f8d90bb5-f538-e492-34d7-9e006b2e2e60@gmx.com \
    --to=quwenruo.btrfs@gmx.com \
    --cc=kreijack@inwind.it \
    --cc=linux-btrfs@vger.kernel.org \
    --cc=lukasstraub2@web.de \
    --cc=martin@urbackup.org \
    --cc=paul@pauljones.id.au \
    --cc=wangyugui@e16-tech.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox