Re: [PATCH DRAFT] btrfs: RAID56J journal on-disk format draft

Linux Btrfs filesystem development
 help / color / mirror / Atom feed

From: Goffredo Baroncelli <kreijack@inwind.it>
To: Qu Wenruo <quwenruo.btrfs@gmx.com>, Lukas Straub <lukasstraub2@web.de>
Cc: Martin Raiber <martin@urbackup.org>,
	Paul Jones <paul@pauljones.id.au>,
	Wang Yugui <wangyugui@e16-tech.com>,
	"linux-btrfs@vger.kernel.org" <linux-btrfs@vger.kernel.org>
Subject: Re: [PATCH DRAFT] btrfs: RAID56J journal on-disk format draft
Date: Tue, 7 Jun 2022 19:36:30 +0200	[thread overview]
Message-ID: <f5bf7ecb-8cb1-4da1-6052-a2968d4dc6b1@inwind.it> (raw)
In-Reply-To: <2575376b-fbd9-8406-3684-7fbc3899ddf3@gmx.com>

On 07/06/2022 03.27, Qu Wenruo wrote:
> 
> 
> On 2022/6/7 02:10, Goffredo Baroncelli wrote:
[...]

>>
>> But with a battery backup (i.e. no power failure), the likelihood of b)
>> became
>> negligible.
>>
>> This to say that a write intent bitmap will provide an huge
>> improvement of the resilience of a btrfs raid5, and in turn raid6.
>>
>> My only suggestions, is to find a way to store the bitmap intent not in the
>> raid5/6 block group, but in a separate block group, with the appropriate
>> level
>> of redundancy.
> 
> That's why I want to reject RAID56 as metadata, and just store the
> write-intent tree into the metadata, like what we did for fsync (log tree).
> 

My suggestion was not to use the btrfs metadata to store the "write-intent", but
to track the space used by the write-intent storage area with a bg. Then the
write intent can be handled not with a btrfs btree, but (e.g.) simply
writing a bitmap of the used blocks, or the pairs [starts, length]....

I really like the idea to store the write intent in a btree. I find it very
elegant. However I don't think that it is convenient.

The write intent disk format is not performance related, you don't need to seek
inside it; and it is small: you need to read it (entirerly) only in case of power
failure, and in any case the biggest cost is to scrub the last updated blocks. So
it is not needed a btree.

Moreover, the handling of raid5/6 is a layer below the btree. I think that
updating the write-intent btree would be a performance bottleneck. I am quite sure
that the write intent likely requires less than one metadata page (16K today);
however to store this page you need to update the metadata page tracking...

>>
>> This for two main reasons:
>> 1) in future BTRFS may get the ability of allocating this block group in a
>> dedicate disks set. I see two main cases:
>> a) in case of raid6, we can store the intent bitmap (or the journal) in a
>> raid1C3 BG allocated in the faster disks. The cons is that each block
>> has to be
>> written 3x2 times. But if you have an hybrid disks set (some ssd and
>> some hdd,
>> you got a noticeable gain of performance)
> 
> In fact, for 4 disk usage, RAID10 has good enough chance to tolerate 2
> missing disks.
> 
> In fact, the chance to tolerate two missing devices for 4 disks RAID10 is:
> 
> 4 / 6 = 66.7%
> 
> 4 is the total valid combinations, no order involved, including:
> (1, 3), (1, 4), (2, 3) (2, 4).
> (Or 4C2 - 2)
> 
> 6 is the 4C2.
> 
> So really no need to go RAID1C3 unless you're really want to ensured 2
> disks tolerance.

I don't get the point: I started talking about raid6. The raid6 is two
failures proof (you need three failure to see the problem... in theory).

If P is the probability of a disk failure (with P << 1), the likelihood of
a RAID6 failure is O(P^3). The same is RAID1C3.

Instead RAID10 failure likelihood is only a bit lesser than two disk failure:
RAID10 (4 disks) failure is O(0.66 * P^2) ~ O(P^2).

Because P is << 1 then  P^3 << 0.66 * P^2.
> 
>> b) another option is to spread the intent bitmap (or the journal) in
>> *all* disks,
>> where each disks contains only the the related data (if we update only
>> disk #1
>> and disk #2, we have to update only the intent bitmap (or the journal) in
>> disk #1 and  disk #2)
> 
> That's my initial per-device reservation method.
> 
> But for write-intent tree, I tend to not go that way, but with a
> RO-compatible flag instead, as it's much simpler and more back compatible.
> 
> Thanks,
> Qu
>>
>>
>> 2) having a dedicate bg for the intent bitmap (or the journal), has
>> another big
>> advantage: you don't need to change the meaning of the raid5/6 bg. This
>> means
>> that an older kernel can read/write a raid5/6 filesystem: it sufficient
>> to ignore
>> the intent bitmap (or the journal)
>>
>>
>>
>>>
>>> Furthermore, this even allows us to go something like bitmap tree, for
>>> such write-intent bitmap.
>>> And as long as the user is not using RAID56 for metadata (maybe even
>>> it's OK to use RAID56 for metadata), it should be pretty safe against
>>> most write-hole (for metadata and CoW data only though, nocow data is
>>> still affected).
>>>
>>> Thus I believe this can be a valid path to explore, and even have a
>>> higher priority than full journal.
>>>
>>> Thanks,
>>> Qu
>>>
>>
>>
>>


-- 
gpg @keyserver.linux.it: Goffredo Baroncelli <kreijackATinwind.it>
Key fingerprint BBF5 1610 0B64 DAC6 5F7D  17B2 0EDA 9B37 8B82 E0B5

next prev parent reply	other threads:[~2022-06-07 17:47 UTC|newest]

Thread overview: 37+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2022-05-24  6:13 [PATCH DRAFT] btrfs: RAID56J journal on-disk format draft Qu Wenruo
2022-05-24 11:08 ` kernel test robot
2022-05-24 12:19 ` kernel test robot
2022-05-24 17:02 ` David Sterba
2022-05-24 22:31   ` Qu Wenruo
2022-05-25  9:00   ` Christoph Hellwig
2022-05-25  9:13     ` Qu Wenruo
2022-05-25  9:26       ` Christoph Hellwig
2022-05-25  9:35         ` Qu Wenruo
2022-05-26  9:06           ` waxhead
2022-05-26  9:26             ` Qu Wenruo
2022-05-26 15:30               ` Goffredo Baroncelli
2022-05-26 16:10                 ` David Sterba
2022-06-01  2:06 ` Wang Yugui
2022-06-01  2:13   ` Qu Wenruo
2022-06-01  2:25     ` Wang Yugui
2022-06-01  2:55       ` Qu Wenruo
2022-06-01  9:07         ` Wang Yugui
2022-06-01  9:27           ` Qu Wenruo
2022-06-01  9:56             ` Paul Jones
2022-06-01 10:12               ` Qu Wenruo
2022-06-01 18:49                 ` Martin Raiber
2022-06-01 21:37                   ` Qu Wenruo
2022-06-03  9:32                     ` Lukas Straub
2022-06-03  9:59                       ` Qu Wenruo
2022-06-06  8:16                         ` Qu Wenruo
2022-06-06 11:21                           ` Qu Wenruo
2022-06-06 18:10                             ` Goffredo Baroncelli
2022-06-07  1:27                               ` Qu Wenruo
2022-06-07 17:36                                 ` Goffredo Baroncelli [this message]
2022-06-07 22:14                                   ` Qu Wenruo
2022-06-08 17:26                                     ` Goffredo Baroncelli
2022-06-13  2:27                                       ` Qu Wenruo
2022-06-08 15:17                         ` Lukas Straub
2022-06-08 17:32                           ` Goffredo Baroncelli
2022-06-01 12:21               ` Qu Wenruo
2022-06-01 14:55                 ` Robert Krig

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=f5bf7ecb-8cb1-4da1-6052-a2968d4dc6b1@inwind.it \
    --to=kreijack@inwind.it \
    --cc=linux-btrfs@vger.kernel.org \
    --cc=lukasstraub2@web.de \
    --cc=martin@urbackup.org \
    --cc=paul@pauljones.id.au \
    --cc=quwenruo.btrfs@gmx.com \
    --cc=wangyugui@e16-tech.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox